What to Log, How Long to Keep It, and How to Reconstruct: Article 12 for Engineers

Systima27 February 202615 min read

Recall our earlier example of the customer in Munich disputing an automated insurance assessment. The regulator asks your team to reconstruct the decision: what data was used, which model version produced it, what post-processing was applied, and whether a human reviewed the output. Your incident response team opens the logs. The model version field says "production". The retrieved context was not captured. The intermediate reasoning was never persisted. You cannot reconstruct the decision. You are not compliant with Article 12.

Article 12 of the EU AI Act is, by a comfortable margin, the most technically demanding obligation the regulation imposes on providers of high-risk AI systems. It does not ask you to write a policy document or appoint a governance committee. It asks you to design your system so that every decision it makes can be traced, reconstructed, and audited after the fact. Automatically. Throughout the system's entire lifecycle.

This is not a documentation exercise. It is an engineering requirement that touches your inference pipeline, your storage architecture, your data retention policies, and your deployment infrastructure. It sits at the intersection of observability, compliance, and cost management, and getting it wrong is not something you discover gracefully.

Here is the diagnostic test. Pick a decision ID from six months ago; any decision your system made in production. Can you reconstruct the exact model version that produced it? The parameter configuration? The full input, including any retrieved context? Every intermediate tool call, every post-processing transformation, every human intervention that modified the output? If the answer to any of those is no, you are not compliant with Article 12.

This post is part of a broader series on engineering compliance with the EU AI Act. For the full regulatory context and how logging fits into the wider compliance architecture, see Compliance-as-Architecture: An Engineering Leader's Guide to the EU AI Act.

What Article 12 actually requires

The text of Article 12 is deceptively short. It mandates that high-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (logging) while the system is operating. These logging capabilities must conform to recognised standards or common specifications, and must enable the monitoring of the system's operation, facilitate post-market monitoring, and support the traceability of the system's functioning throughout its lifecycle.

The critical phrase is that logs must be adequate to enable the tracing back of the AI system's activity. This is not a suggestion to record error logs or request counts. It is a requirement that you capture enough state at decision time to reconstruct the full chain of causation from input to output, including every transformation in between.

Retention periods are prescribed by the Act, with minimums that vary depending on the system's risk classification and sector-specific requirements. In practice, the legal minimum is often insufficient for your own operational needs. Post-market monitoring, incident investigation, and model drift analysis all benefit from retention windows that exceed the statutory floor. The question is not whether you can afford to keep logs for longer; it is whether you can afford not to.

The Act also requires that logging is automatic. You cannot rely on manual annotation, developer discipline, or opt-in instrumentation. The logging infrastructure must be a structural feature of the system, not an afterthought bolted on during a compliance review.

What constitutes a "decision" in a modern AI system

The scope of what you must log depends entirely on the complexity of your system. For a straightforward classifier; say, a credit scoring model that takes a feature vector and returns a probability; the decision boundary is clean. You log the input features, the model version, the raw output probability, any threshold logic that converted the probability into an accept/reject decision, and the final outcome. This is well-understood territory.

For an LLM-based system, the picture changes materially. A single decision now involves the user's raw input, the system prompt (which may be versioned independently of the model), any few-shot examples injected into the context window, the model's raw output, and whatever post-processing logic you apply before the response reaches the user. If you are using retrieval-augmented generation, you must also capture the retrieval query, the documents returned, and their relevance scores. The "input" to the model is not what the user typed; it is the fully assembled prompt, which may be thousands of tokens long and include content from multiple sources.

For agentic and composite systems, the scope expands further still. A single user request may trigger an orchestrator that selects from multiple tools, executes them in sequence or parallel, feeds intermediate results back into the model, makes additional model calls based on those results, and finally aggregates everything into a response. Each of these steps is a link in the decision chain. If you log only the initial input and the final output, you have lost the ability to explain why the system did what it did. You have a black box with recorded endpoints and no visibility into the reasoning path.

The practical consequence is that your logging schema must expand to match your system's architecture. A simple model needs a simple log. An agentic system needs a decision graph. And the logging infrastructure must be designed to accommodate whichever level of complexity your system actually operates at, not the level you wish it operated at.

The logging schema

A compliant decision record needs to capture several categories of information. Not all of these apply to every system, but for any system of meaningful complexity, most of them will.

Decision identity and timing. Every decision must have a unique, immutable identifier; a decision ID that can be referenced in audits, incident reports, and user complaints. Timestamps must be in UTC with millisecond precision at minimum. If your system spans multiple services or regions, clock synchronisation matters. NTP drift of even a few seconds can make causal reconstruction ambiguous.

Model identification. "GPT-4" is not a model version. You need the exact checkpoint, fine-tune identifier, or deployment hash. If you are using a third-party API, you need whatever version identifier the provider exposes, and you need to record it at call time, not infer it later. If the provider does not expose versioning with sufficient granularity, that is a compliance risk you must document and mitigate.

Parameter configuration. Temperature, top_p, max_tokens, stop sequences, and any other inference parameters that affect output. If you are using a system prompt, log a content-addressable hash of it, and store the full text in a versioned prompt registry that the hash can resolve against. The same applies to few-shot examples and any other injected context.

Input capture. This means the raw user input as received, the enriched or preprocessed input after any transformations, and the full assembled prompt or feature vector sent to the model. For RAG systems, this includes the retrieval query, the retrieved documents (or their content-addressable hashes with the full documents stored separately), and relevance scores.

Output capture. The raw model output before any post-processing, the output after post-processing (filtering, formatting, safety checks), and the final user-facing response. If these three are identical, log that fact explicitly rather than omitting the intermediate stages.

Tool calls. For agentic systems, every tool invocation must be logged: the tool name, the parameters passed, the response received, the latency, and any error states. This applies to database queries, API calls, function executions, and any other external interaction the system initiates as part of its reasoning. The relationship between tool calls must be preserved; you need to know the order and the causal dependencies.

Human intervention. If a human reviewer overrode, modified, or approved the system's output, that must be recorded with an override flag, the reviewer's identifier, the reasoning provided, and both the original and overridden outputs. This intersects directly with the human oversight requirements discussed in In the Loop, On the Loop, Over the Loop: Designing Human Oversight for Production AI.

System context. Which feature flags were active at the time of the decision? Which A/B test variant was the user in? Which deployment environment served the request? These contextual factors can materially affect system behaviour, and without them, reconstruction may produce different results from the original execution.

The transient state problem

The most insidious challenge in decision logging is not what your system produces; it is what your system consumes from external sources that may not be available later.

Consider a RAG pipeline that retrieves documents from a search index. The index is updated daily. By the time an auditor asks you to reconstruct a decision from three months ago, the documents that were retrieved at execution time may have been modified, re-ranked, or removed entirely. You cannot re-run the retrieval query and expect the same results.

The same applies to external API calls. An agentic system that queries a third-party data source, checks a regulatory database, or calls a pricing engine is consuming ephemeral state. That state existed at a specific moment in time and may never be reproducible.

Chain-of-thought reasoning in LLM-based systems presents a similar problem. If your system uses intermediate reasoning steps that are generated and consumed within a single request lifecycle but not persisted, you have lost a critical part of the decision chain. Working memory in agentic architectures; the scratchpad, the plan, the intermediate evaluations; is exactly the kind of state that Article 12 requires you to preserve.

The solution is straightforward in principle and expensive in practice: capture all external inputs and intermediate state at execution time, as part of the decision record. Do not rely on the ability to re-derive them later. Every document retrieved, every API response received, every intermediate reasoning step generated must be logged or referenced by a content-addressable hash that resolves to an immutable store.

The storage cost implications are real. For a high-volume system making hundreds of thousands of decisions per day, each with retrieved context and multiple tool calls, you are looking at terabytes of decision logs per month. But the alternative is not cheaper logging; the alternative is non-compliance.

Retention architecture

Once you accept that decision logs are voluminous and must be retained, the question becomes how to store them without bankrupting the organisation or creating an operational burden that defeats the purpose.

The Act prescribes minimum retention periods, but these are floors, not ceilings. Your post-market monitoring obligations, discussed in detail in Your Model Is in Production. Now What? Post-Market Monitoring Under Article 72, will often require access to historical decision data well beyond the statutory minimum. Trend analysis, drift detection, and incident investigation all depend on being able to query logs from months or years in the past.

A tiered storage architecture is the practical answer. Hot storage holds recent decision logs; the last 30 to 90 days, depending on your query patterns. These are fully indexed, queryable at low latency, and used for operational monitoring and debugging. Warm storage holds older logs that are searchable but not optimised for fast retrieval; think months to a year. Cold storage holds archived logs that are retrievable but may take minutes or hours to access; this is where the bulk of your long-term retention sits.

For tamper evidence, consider WORM (Write Once Read Many) storage for your compliance-critical logs. An auditor needs assurance that the logs they are reviewing have not been modified since the decision was made. Append-only storage, cryptographic hashing of log entries, or blockchain-anchored timestamps are all approaches that provide varying degrees of integrity assurance. The right choice depends on your risk profile and the expectations of your notified body or market surveillance authority.

The GDPR tension is unavoidable. Article 12 of the AI Act requires you to retain logs sufficient for decision reconstruction. The GDPR requires you to delete personal data when it is no longer necessary for the purpose for which it was collected. These obligations coexist, and the resolution is not to pick one over the other. You must design your logging schema so that personal data can be pseudonymised or redacted within the decision record without destroying the record's value for compliance reconstruction. This means separating PII into a distinct, independently deletable layer, with the decision record referencing it by pseudonymous identifier. When a deletion request arrives, you remove the PII mapping while preserving the structural integrity of the decision log.

Latency and cost trade-offs

Every millisecond of logging latency is a millisecond added to your inference pipeline. For interactive systems where response time matters; chatbots, real-time decision engines, customer-facing applications; this is not a trivial concern.

Synchronous logging, where the decision record is durably written before the response is returned, is the safest approach from a compliance perspective. You never lose a record. But it adds latency to every single decision, and if your logging infrastructure experiences a hiccup, your production system stalls or errors.

Asynchronous logging, where the decision record is buffered and flushed in the background, is faster but introduces a window of vulnerability. If the process crashes between the decision and the flush, you lose the record. For most high-risk systems, this is not an acceptable gap.

The practical middle ground is a synchronous write to a local durable buffer; an append-only file or a local message queue; followed by asynchronous replication to your central logging infrastructure. The local buffer survives process restarts and provides a recovery point, while the asynchronous replication avoids coupling your inference latency to your central storage system's performance.

For extremely high-volume systems where logging every decision is genuinely prohibitive, statistical sampling may be defensible. But this is a risk decision, not a cost optimisation. You must document the sampling methodology, the statistical guarantees it provides, the acceptance criteria you are applying, and the rationale for why full logging is not feasible. An auditor will scrutinise this justification, and "it was expensive" is unlikely to be sufficient on its own.

Audit reconstruction: the acid test

Everything described above serves a single purpose: when someone hands you a decision ID and asks you to explain exactly what happened, you can.

Reconstruction means producing the full decision chain for any historical decision. The model version, loaded from your model registry. The parameter configuration, resolved from the snapshot stored at decision time. The complete input, including retrieved context and assembled prompt. Every tool call and its response. Every post-processing step. Every human intervention. The final output as delivered to the user.

This requires several things to be true simultaneously. Your model artefacts must be version-controlled and retrievable; you must be able to load the exact model version that served a decision six months ago. Your parameter configurations must be immutable snapshots, not references to mutable configuration that may have changed since. Your input and output records must be complete, including all transient state captured at execution time. And your reconstruction tooling must be able to assemble all of these components into a coherent narrative without relying on any state that no longer exists.

Build this capability now. Test it regularly. Run reconstruction drills against historical decisions and verify that the results match the original records. Do not wait for an audit or an incident to discover that your logging infrastructure has gaps. The time to learn that your decision records are incomplete is during a routine internal review, not when a market surveillance authority is asking questions about a specific decision that caused harm.

Open-source logging infrastructure

We have open-sourced the logging layer we use as the foundation for Article 12 compliance work with our clients: @systima/aiact-audit-log. It provides the schema described in this post (with every field mapped to the specific Article 12 paragraph it relates to), SHA-256 hash chains for tamper evidence, S3-compatible storage with retention enforcement, automatic capture middleware for the Vercel AI SDK, context propagation via AsyncLocalStorage, and CLI tooling for querying, verifying, and exporting logs.

It is MIT-licensed and provides the technical logging capability required by Article 12. It is not, on its own, sufficient for compliance. The gap between logging infrastructure and full compliance; risk management, human oversight design, monitoring procedures, technical documentation; is the gap that requires organisational decisions, not library features. We wrote about this boundary in detail in Open-Source Article 12 Logging Infrastructure For The EU AI Act.

Where to start

Logging infrastructure is typically the single largest engineering effort in an EU AI Act compliance programme. It touches your inference pipeline, your storage layer, your data governance policies, and your operational tooling. It is also the obligation that most directly determines whether you can satisfy the Act's transparency and traceability requirements in practice, not just on paper.

If you are assessing your current logging capabilities against Article 12, or designing a compliant logging architecture for a new high-risk system, Systima's AI Governance and Compliance practice works with engineering teams to design and implement compliant logging infrastructure. That means a decision trace schema tailored to your system topology, a retention architecture that balances compliance with cost, reconstruction tooling that you can demonstrate to an auditor, and a migration path that does not require halting production.

EU AI ActArticle 12loggingaudit trails

What Article 12 actually requires

What constitutes a "decision" in a modern AI system

The logging schema

The transient state problem

Retention architecture

Latency and cost trade-offs

Audit reconstruction: the acid test

Open-source logging infrastructure

Where to start

Related Articles

Claude Code Is Way More Token-Hungry Than OpenCode. We Measured Exactly How Much

The EU AI Act Digital Omnibus Is Settled: A Pause Is Not a Reprieve

Continuous Conformity: Engineering Evidence for Orchestrated AI Systems