Your Model Is in Production. Now What? Post-Market Monitoring Under Article 72

Systima27 February 202610 min read

Most engineering teams treat deployment as the finish line. The model is trained, evaluated, integrated, tested, shipped. The champagne is metaphorical but the relief is real. Article 72 of the EU AI Act reframes deployment as a starting point.

The Act requires providers of high-risk AI systems to establish and document a post-market monitoring system. This is not advisory language. It is a legal obligation with specific requirements for what you track, how you respond to what you find, and when you report to regulatory authorities. The monitoring system must be proportionate to the nature of the AI system and the severity of its risks, and it must be in place before the system enters the market.

If you have already worked through the Compliance-as-Architecture: An Engineering Leader's Guide to the EU AI Act, you will recognise a pattern: the Act does not prescribe specific technologies or tools, but it does prescribe specific obligations. Monitoring is one of them, and it is non-negotiable.

What Article 72 actually requires

The requirements are more structured than most teams assume. Article 72 mandates a documented post-market monitoring system that actively and systematically collects and analyses data on the performance of the AI system throughout its entire operational lifetime. "Throughout its entire operational lifetime" is doing significant work in that sentence. This is not a quarterly review cadence. It is continuous.

The monitoring system must be established before the system is placed on the market or put into service. Retrospectively bolting monitoring onto a production system does not satisfy the requirement; you need to demonstrate that the monitoring architecture was designed as part of the system, not appended after the fact.

Critically, monitoring data must feed back into the risk management system required by Article 9. If your monitoring detects degraded performance in a specific use case, that finding must update your risk register. If your risk register changes, your conformity assessment may need to be revisited. This creates a continuous feedback loop between production behaviour and regulatory documentation.

For high-risk systems, the post-market monitoring plan must form part of the technical documentation submitted as part of the conformity assessment. This means auditors and notified bodies will review your monitoring design. It is not an internal process you can define informally.

What you probably already have

If you are running ML systems in production with any degree of seriousness, you already have infrastructure that overlaps with Article 72's requirements. Model performance dashboards tracking accuracy, precision, recall, or domain-specific quality metrics. Alerting pipelines that fire when key metrics degrade beyond defined thresholds. Latency monitoring and error rate tracking integrated into your observability stack. Incident response processes, even if they are oriented toward system availability rather than model behaviour. Possibly A/B testing infrastructure that lets you compare model versions against baseline performance.

This is not nothing. It is, in many cases, the majority of the technical work.

The gap is rarely in the infrastructure itself. It is in the documentation chain. Most monitoring systems are oriented toward product quality and operational reliability, not regulatory compliance. The difference is not what you measure; it is whether the measurements are linked to a risk register, whether anomalies trigger a defined classification process, whether corrective actions are documented end-to-end, and whether the entire system is described in your technical documentation. The telemetry exists. The governance wrapper usually does not.

What you probably do not have

The gaps tend to cluster in five areas, and they are the areas that matter most for compliance.

Formal incident classification thresholds. At what point does a performance degradation become a serious incident under the Act? Most teams handle this ad hoc; a senior engineer makes a judgement call during an incident. The Act requires this to be defined in advance. You need documented thresholds that distinguish routine degradation from events that trigger regulatory reporting obligations. These thresholds must be justifiable and linked to the risk assessment, not arbitrary round numbers.

Corrective action documentation chains. When you detect an issue and fix it, is the full chain from detection to diagnosis to correction to verification recorded? Is it linked to the risk register entry it relates to? Most engineering teams fix problems quickly and thoroughly, but the documentation trail is often fragmented across Slack threads, pull request descriptions, and incident retrospectives. The Act requires a coherent, traceable chain.

Regulatory reporting triggers. Serious incidents must be reported to market surveillance authorities. Do you have a defined process for determining when an incident crosses the reporting threshold? Do you know which authority to report to, in what timeframe, and in what format? This is operational process design, not engineering, and it is frequently absent.

Feedback loop to risk management. Monitoring findings must update the risk register. This is explicitly required. If your monitoring detects a new failure mode, that failure mode must be assessed, documented, and addressed within your risk management system. Many teams maintain risk registers as static documents produced during initial compliance work. The Act treats them as living artefacts.

Bias drift monitoring. Aggregate performance metrics can remain stable while performance degrades for specific demographic groups. Without stratified monitoring that tracks outcomes across relevant subgroups, you will not detect distributional bias that emerges or worsens over time. This connects directly to the tension between bias testing and data protection explored in You Need Demographic Data to Prove You're Not Biased: The GDPR-AI Act Tension. The monitoring obligation makes ongoing bias detection a continuous requirement, not a one-time evaluation.

Designing the monitoring architecture

The monitoring architecture for Article 72 compliance extends beyond standard ML observability. It needs to cover six dimensions, each with defined thresholds and response procedures.

Accuracy drift. Statistical monitoring of output quality against the baseline evaluation benchmarks established during conformity assessment. This requires maintaining those benchmarks as versioned, immutable reference points. Define two tiers of drift thresholds: a lower threshold that triggers human investigation and review, and an upper threshold that triggers automated rollback or system suspension. The distance between these thresholds reflects your risk tolerance and must be justified in the monitoring plan.

Input distribution shift. Monitor for changes in the statistical distribution of inputs compared to the training and evaluation data. Significant distributional shift means your evaluation results may no longer be representative of actual system behaviour. Techniques like population stability indices or Kolmogorov-Smirnov tests can quantify drift, but the key decision is what magnitude of shift invalidates your conformity assessment.

Output distribution shift. Changes in the distribution of system outputs can indicate model degradation even when input distributions remain stable. This is particularly relevant for systems where outputs have direct consequences; a credit scoring model that gradually shifts toward more conservative scores may not trigger accuracy alerts but represents a meaningful change in system behaviour.

Latency degradation. Performance regression is not purely an operational concern. In time-sensitive applications, latency degradation may mean the system is no longer fit for its intended purpose. Monitor latency at the model inference level, not just the API level, to distinguish infrastructure issues from model-level problems.

Error rate monitoring by category. Track error rates by error type and by use case segment, not just in aggregate. A stable overall error rate of two percent can mask a situation where errors in one category have halved while errors in another have tripled. The disaggregated view is what matters for risk assessment.

Human override rate. If your system includes human oversight mechanisms, the rate at which human operators override or reject system outputs is a powerful signal. A rising override rate indicates that the model is becoming less reliable or that operating conditions have shifted beyond its effective range. This connects monitoring directly to the human oversight architecture described in In the Loop, On the Loop, Over the Loop: Designing Human Oversight for Production AI. Monitoring and oversight are not separate systems; they are interdependent.

Incident detection versus incident reporting

Not every anomaly is a serious incident under the Act. Conflating the two will either overwhelm your reporting process or, worse, cause you to under-report by treating everything as routine.

The Act defines serious incidents as those that may have led to or may lead to the death of a person or serious damage to health, serious and irreversible disruption of the management and operation of critical infrastructure, breach of obligations under Union law intended to protect fundamental rights, or serious damage to property or the environment. This is a high bar, but it is also context-dependent. A five percent accuracy drop in a content recommendation system is not a serious incident. The same drop in a medical triage system may well be.

Your monitoring system needs a structured escalation path: anomaly detection triggers investigation; investigation produces a root cause analysis; the root cause is classified against defined severity criteria; the classification determines the response, which may include regulatory notification. Define your classification thresholds clearly and document them in the monitoring plan. The thresholds should be derived from your risk assessment, not invented independently.

The classification decision itself must be documented, including cases where you determined an incident was not serious. If a market surveillance authority later reviews your monitoring records, they will want to see not only the incidents you reported but the borderline cases you evaluated and the reasoning behind your classification decisions.

The corrective action chain

When monitoring detects an issue and you respond, the entire sequence must be traceable. This is where the distinction between good engineering practice and regulatory compliance becomes most visible. Good teams fix problems quickly. Compliant teams fix problems quickly and document the chain.

The chain has five links. Detection: what triggered the investigation? This could be an automated alert, a human operator observation, a user complaint, or a finding from an internal audit. Diagnosis: what was the root cause? Was it data drift, a software defect, an infrastructure change, or an adversarial input? Correction: what change was made? This might be a model rollback, a parameter adjustment, a training data correction, a feature flag change, or a system suspension. Verification: how did you confirm the correction resolved the issue without introducing new problems? Documentation: the entire chain, from trigger to verification, must be recorded and linked to the relevant entry in the risk register.

This is where logging infrastructure becomes critical. You cannot diagnose what you did not capture. The logging architecture described in What to Log, How Long to Keep It, and How to Reconstruct: Article 12 for Engineers is not a separate compliance concern; it is the foundation that makes post-market monitoring actionable. If your logs do not capture sufficient detail to reconstruct the system's state and behaviour at the time of an incident, your corrective action chain will have gaps that are visible to any competent auditor.

Making monitoring operational

Post-market monitoring is where compliance stops being a documentation exercise and becomes an operational discipline. It runs continuously, it touches production infrastructure, and it requires coordination between engineering, product, legal, and risk functions. The monitoring plan is not a document you produce and file; it is a system you operate.

Systima helps engineering teams design and implement post-market monitoring systems that satisfy Article 72 while integrating with existing MLOps infrastructure. The goal is not to replace what works but to close the gaps between operational monitoring and regulatory compliance, building the documentation chains, classification frameworks, and feedback loops that the Act requires. If you are preparing a high-risk AI system for the EU market and need to get monitoring right, talk to us.

EU AI ActArticle 72monitoringMLOps

What Article 72 actually requires

What you probably already have

What you probably do not have

Designing the monitoring architecture

Incident detection versus incident reporting

The corrective action chain

Making monitoring operational

Related Articles

Claude Code Is Way More Token-Hungry Than OpenCode. We Measured Exactly How Much

The EU AI Act Digital Omnibus Is Settled: A Pause Is Not a Reprieve

Continuous Conformity: Engineering Evidence for Orchestrated AI Systems