Continuous Conformity: Engineering Evidence for Orchestrated AI Systems

Systima

Applied Research Agentic AI AI Governance

Continuous Conformity: Engineering Evidence for Orchestrated AI Systems

Systima18 May 202612 min read

A recent piece in EU Law Live, The Hidden Layer: The Structural Gap in the AI Act's Governance of Agentic Systems, identifies a structural mismatch between the EU AI Act's provider/deployer framework and the orchestration layer that increasingly determines how agentic AI systems behave. Its most operationally provocative line concerns conformity assessment: "A conformity assessment conducted on Tuesday may tell us little about what the system does on Wednesday." That observation closes one argument and opens another. If the static-conformity model fits orchestrated systems poorly, what does a workable alternative look like in engineering terms?

This applied research note develops one answer. It is not a claim that the legal question is settled. It is a proposal for how regulated organisations running orchestrated AI systems should redesign their evidence pipeline so that conformity becomes a continuous property of the system rather than a milestone in its history.

The static-conformity assumption

The AI Act presupposes a particular shape of regulated artefact. Article 16 obliges providers of high-risk AI systems to ensure compliance with Articles 8 to 15 before placing the system on the market. Article 17 obliges providers to operate a quality management system that produces documentation, version control, and post-market surveillance. Article 43 specifies that conformity assessment is conducted prior to placing the system on the market, with re-assessment triggered by substantial modifications under Article 25.

The model behind these provisions is recognisable from product safety law. The artefact is designed, documented, assessed, and then released. Post-market monitoring runs in the background; conformity remains presumptively intact until a substantial modification triggers a fresh assessment.

This works for products that have a stable identity over time. A medical device approved on Tuesday is materially the same device on Wednesday. The assertion that yesterday's assessment applies to today's behaviour is, in most cases, defensible.

Orchestrated agentic systems break this assertion. Not necessarily through any single dramatic change, but through the accumulation of routine runtime decisions that change what the system does without changing what the documentation says it does.

What reconstitutes between Tuesday and Wednesday

It is worth being concrete about what changes. In an agentic system with even a modest orchestration layer, the following can shift between any two queries:

Component	What can change at runtime	Documented at design time?
Tool catalogue	New tools onboarded; old tools deprecated; tool versions bumped	Sometimes partially
Tool discovery	Tools encountered via capability negotiation (e.g. MCP) that did not exist at design time	Rarely
Retrieval indexes	New documents indexed; embeddings re-computed; sources added or revoked	Rarely versioned
Prompt and policy text	Iterated by product and safety teams without formal release process	Inconsistently
Model versions	Provider-side updates; fine-tuning rounds; deployment of new checkpoints	Tracked, but rarely treated as conformity-relevant
Memory state	User-, task-, and organisation-scoped state that influences future behaviour	Almost never
Routing graphs	Decision logic about which sub-agent handles which task	Documented as code, rarely as compliance artefact
Jurisdictional path	Which jurisdictions data traverses on the way to a result	Generally invisible

The article distinguishes two forms of runtime discretion that affect this list. Routing discretion selects components from a known catalogue; compositional discretion discovers and incorporates components that were unknown at design time. Both reshape system behaviour without crossing the line that triggers Article 25 in any obvious way.

The cumulative effect is that the system's behavioural identity drifts away from the artefact described in the conformity assessment, even when no individual change is large enough to constitute a "substantial modification" in the article's sense. The assessment is not falsified. It just stops describing the system that is actually running.

From milestone to stream

The proposal is straightforward to state and harder to implement: conformity should be modelled as a property the system continuously demonstrates rather than a status conferred on it at a fixed point in time.

This is not novel in regulated industries. The shift from milestone-based to continuous assurance has happened, or is happening, in several adjacent regimes. The Digital Operational Resilience Act requires financial entities to conduct threat-led penetration testing on critical systems, not as one-off audits but as a continuous testing programme. The Medical Device Regulation requires post-market clinical follow-up that updates the clinical evaluation throughout the device's life. PSD2 strong customer authentication monitoring is continuous by design, not periodic.

What changes for orchestrated AI systems is that the artefact under assurance is no longer stable enough for assurance-by-snapshot to be defensible. The natural unit of conformity stops being "the system as it was on Tuesday" and starts being "the system as it has been operating across this interval, evidenced by these records". The conformity assessment does not disappear; it changes shape, becoming the validation of an evidence-emitting capability rather than a one-time judgement about a fixed artefact.

A regulated organisation operating an orchestrated AI system should aim for the position where, at any moment, it could answer a regulator's question of the form "what was this system actually doing last quarter?" with a query against a structured evidence stream rather than a reconstruction effort. That is the standard the static model implicitly promised but cannot, for orchestrated systems, deliver.

The evidence stream

The shift to continuous conformity makes most sense when expressed as a specification of what the system emits, on what triggers, with what retention, and what each emission evidences.

Artefact	Trigger	Retention floor	Evidences
Tool registry snapshot	On every change to the registry	5 years	Article 11 documentation currency; Article 25 substantial-modification basis
Tool invocation log	Every tool call	Article 19 floor of 6 months; 5 years for high-risk	Article 12 record-keeping; Article 14 oversight effectiveness
Capability-verification record	First invocation of any discovered tool	5 years	Compliance with the article's recommended Article 40 standards
Routing decision log	Every consequential routing decision	6 months minimum	Article 12; rationale reconstruction
Retrieval provenance trace	Every retrieval that informs an output	Aligned with retention of the output	Article 13 transparency; Article 15 accuracy basis
Policy version manifest	Every policy change	Indefinite	Quality management evidence; Article 17
Geographic-routing record	When data crosses a jurisdictional boundary	5 years	Article 9 risk management; GDPR Articles 44 to 49; Agentic Tool Sovereignty
Human approval and override log	Every escalation event	5 years	Article 14 human oversight
Drift signal	Continuous monitoring threshold breach	1 year	Article 72 post-market monitoring; Article 25 trigger evidence
Re-evaluation record	Scheduled or signal-driven re-evaluation	5 years	Continued conformity demonstration

Two observations about this specification are worth dwelling on. First, the artefacts are not new in concept; most regulated AI deployments already produce some of them. What changes is the requirement that they form a coherent stream, queryable as a whole, rather than a set of disconnected logs scattered across systems. Second, the retention floors above are conservative; the actual figure depends on the risk class of the system and on whether the deployment falls under additional regimes (GDPR, sectoral retention rules, contractual obligations).

The most consequential of these artefacts in regulatory terms are likely to be the capability-verification record and the geographic-routing record. The first responds directly to the article's call for capability-verification standards under Article 40. The second is the engineering counterpart of what the article terms Agentic Tool Sovereignty: the jurisdictional dimension of orchestration decisions that the existing framework does not capture but that regulated organisations cannot ignore.

Drift detection and the material-change question

The continuous-evidence stream is not, by itself, a complete account. It records what the system did. It does not, on its own, answer the question that Article 25 actually asks: has the system changed enough to require a new conformity assessment?

This is the hardest part of the proposal, and the part where legal interpretation remains genuinely open. The article notes the dilemma: a literal reading of "substantial modification" risks rendering the Act unworkable for systems that discover new capabilities daily; the alternative reading leaves compositional discretion outside the article's reach entirely. The Commission has not yet issued Article 96 guidance on this point. Until it does, regulated organisations must take a position.

A continuous-conformity approach makes the position defensible rather than settled. It treats Article 25 less as a discrete trigger and more as a threshold question that the evidence stream is designed to surface. The engineering implementation has three parts.

First, drift signals. The system continuously computes deltas against the documented baseline: how many tools have been added since the last assessment; how much of the retrieval index has been replaced; how often runtime policies have been overridden; how much routing behaviour has shifted, measured against representative scenarios. Each signal has a threshold below which the system is presumed conformant and above which a re-evaluation is triggered. The thresholds are not derived from the legal text. They are organisational policy, justified to the regulator if asked.

Second, evaluation harnesses. When a threshold trips, the system runs a structured evaluation: regression scenarios drawn from the original conformity assessment, fairness and accuracy tests against demographic baselines, behavioural fingerprints for newly onboarded tools, end-to-end scenarios that exercise routing and compositional decisions. The output is a re-evaluation record entered into the stream.

Third, classification. The re-evaluation record either confirms that the system remains within the envelope of its original conformity assessment, or it does not. If it does not, the organisation has either a documented material change (with a fresh Article 25 assessment to perform) or a documented decision that the change is not material (with the reasoning preserved). Either way, the regulator is not left to reconstruct the question from logs.

This does not solve the interpretive problem the article identifies. It does provide an audit trail that allows the interpretive question to be addressed seriously when it arises, rather than dismissed for lack of evidence.

A worked illustration: agentic claims triage

Consider a regulated insurer operating an agentic system that triages incoming claims. The system retrieves the customer's policy documents, summarises the claim, classifies it against a set of internal categories, and either resolves low-complexity claims automatically or routes them to a human handler with a recommended outcome. The system uses an orchestration framework with a registered tool catalogue and a retrieval index over policy documents and claims history.

In the static-conformity model, the insurer conducts a conformity assessment before deployment. The assessment establishes that the system performs adequately across a representative scenario set, that human oversight is meaningful, that the routing logic is documented, that the retrieval sources are appropriate. After deployment, the system continues to operate against that assessment.

Within six months, the situation has drifted. The retrieval index has been re-built three times to incorporate updated policy wordings. A new claims-history tool has been onboarded by the platform team. The routing policy was adjusted twice to handle a class of edge cases. The model provider has issued a minor update to the underlying generative model. None of these changes is a substantial modification in the textbook sense. None individually warrants a fresh conformity assessment. Collectively, the system that is triaging claims in month seven is recognisably different from the system that was assessed.

In the continuous-conformity model, the same drift is observable in the evidence stream. The tool registry shows the new tool and its capability-verification record. The retrieval index version history shows three rebuilds, with corresponding evaluation outputs at each. The policy manifest records both adjustments and the rationale. The drift signal for routing behaviour shows a measurable shift, below the threshold that would trigger re-evaluation but visible. The model version manifest records the provider update.

At the six-month review point, the insurer's compliance function does not reconstruct any of this from raw logs. It queries the stream. The question "what has changed since the last assessment, and is any of it material" has a documented answer, with the underlying records available to evidence it. The insurer's regulator, if it asks, does not receive a narrative; it receives a query result.

The position the insurer takes on materiality is not necessarily right in any final sense. The legal interpretation of Article 25 in this scenario is not settled. What the continuous-conformity model provides is a defensible position rather than a hopeful one.

Where this leaves the open legal questions

Continuous conformity is an engineering response to an interpretive problem. It does not resolve the interpretive problem. Several questions remain open and will, in time, need either Commission guidance under Article 96 or judicial clarification.

Whether the cumulative drift described above can constitute a substantial modification under Article 25, even when no individual change crosses the threshold, is unresolved. The natural reading of the provision focuses on discrete events. An accumulation-based reading would change the practical compliance burden materially.

Whether the capability-verification protocol contemplated by the article (and operationalised in the evidence stream above) maps onto any specific harmonised standard under Article 40 remains to be seen. The first AI Act harmonised standards have only recently entered public enquiry. Engineering decisions made now are made against an incomplete standards picture.

Whether the geographic-routing record constitutes evidence of compliance with GDPR Articles 44 to 49 in addition to its role as Agentic Tool Sovereignty evidence under the AI Act is a question that crosses regulatory regimes and that few organisations are currently positioned to answer cleanly.

These are not reasons to defer building the evidence infrastructure. They are reasons to build it in a way that allows the answers, when they arrive, to be applied without retrofitting.

Where Systima fits

The technical substrate for this work already exists in part. The @systima/aiact-audit-log package implements structured, tamper-evident logging for Vercel AI SDK applications with the retention, hashing, and reconstruction primitives that the evidence stream above presupposes. Its forthcoming companion, @systima/aiact-docs, generates Annex IV technical documentation that updates as the underlying codebase changes, which is the documentation side of the same problem.

Systima's consulting practice operates around the continuous-conformity model rather than the static one. Organisations engage us when their orchestration layer has drifted ahead of their compliance documentation; the work is to rebuild the evidence pipeline so that the orchestration layer governs itself, with the compliance function querying the stream rather than reconstructing it.

How to cite this work

The citation block at the end of this article provides APA, BibTeX, and OSCOLA formats for both this applied research note and the source article. Citations of the underlying argument should reference the source article; citations of the continuous-conformity model proposed here should reference both.

EU AI Actagentic AIorchestration layerconformity assessmentcontinuous conformityArticle 25Article 12evidence pipelineregulated industriesapplied research

The static-conformity assumption

What reconstitutes between Tuesday and Wednesday

From milestone to stream

The evidence stream

Drift detection and the material-change question

A worked illustration: agentic claims triage

Where this leaves the open legal questions

Where Systima fits

How to cite this work

Related Articles

Claude Code Is Way More Token-Hungry Than OpenCode. We Measured Exactly How Much

The EU AI Act Digital Omnibus Is Settled: A Pause Is Not a Reprieve

Project Delivery Framework: An Operating System for the 'CEO of the Engagement'