Compliance-as-Architecture: An Engineering Leader's Guide to the EU AI Act
Your AI system rejects a loan application in Munich in October 2026. The applicant challenges the decision. The regulator asks you to reconstruct the decision trace: the exact input, the model version, the parameter configuration, the intermediate reasoning, the post-processing logic, and any human intervention. You have 15 days to respond. Can you?
If the answer is uncertain, this guide is for you.
The EU AI Act is not a policy document you hand to Legal and forget. It is a set of engineering constraints that will determine whether your AI systems can operate in Europe after August 2026. This guide translates the Act's core obligations into architecture decisions, pipeline requirements, and production controls. It is written from the perspective of building and leading production AI systems, and studying the Act in detail, so that engineering leaders can understand precisely what must change, where, and why.
Be the first to know about the next Systima blog post
Receive insights on agentic AI, compliance, and automation in regulated industries.
No spam. Unsubscribe at any time.
The enforcement reality
The EU AI Act entered into force on 1 August 2024. Enforcement is phased:
- February 2025: Prohibited AI practices became enforceable. Systems that manipulate behaviour, exploit vulnerabilities, or perform untargeted facial recognition scraping are now unlawful.
- August 2025: Obligations for general-purpose AI models take effect. This covers foundation model providers and their downstream transparency requirements. For a detailed breakdown of these obligations and what they mean for engineering teams consuming foundation models, see GPAI Obligations for Engineering Teams That Use Foundation Models.
- August 2026: The bulk of the Act lands. High-risk AI system obligations under Articles 9 through 15 become enforceable. This is the date that matters for most engineering teams.
- August 2027: Full enforcement across all remaining provisions, including certain high-risk systems embedded in regulated products.
The penalty regime is calibrated to make non-compliance a material business risk. Violations of prohibited practices carry fines of up to 35 million EUR or 7% of global annual turnover, whichever is higher. Violations of high-risk obligations carry fines of up to 15 million EUR or 3% of turnover. Supplying incorrect information to regulators carries fines of up to 7.5 million EUR or 1% of turnover.
The Act has extraterritorial reach. If your system's output affects people in the EU, or if your system is placed on the EU market, you are in scope regardless of where you are incorporated. UK companies serving EU customers, processing EU user data, or deploying AI-powered features accessible from EU member states are caught. For a detailed analysis of territorial scope, see Why Many UK Firms Will Need to Comply with the EU AI Act.
This matters for engineering leaders because regulatory exposure now lives inside system architecture. Risk is not an abstract concept tracked in a spreadsheet; it is architectural state. Architectural state is observable. This is not a metaphor; it is the operating reality under this regulation.
Consider the structural implications. Most of the Act's obligations map directly to engineering artefacts, not legal documents. A risk management system is a pipeline with gates, not a PDF. An audit trail is an immutable log store, not a quarterly report. When legal documentation is not derived from actual system state, it drifts. Drifted documentation is false documentation. False documentation under a regulatory regime with active market surveillance and mandatory incident reporting is not a minor administrative problem; it is the exposure itself.
The most dangerous response to the AI Act is therefore the most common one: treating it as a documentation exercise. Organisations that produce compliance documents disconnected from their running systems create precisely the non-compliance those documents were meant to prevent. The engineering function must own the compliance architecture, because only the engineering function can ensure that what is documented reflects what is deployed.
Where do you fit?
The AI Act assigns obligations based on your role in the AI value chain, not your job title or company description. Three scenarios cover the majority of SaaS and technology companies:
Scenario one: You consume a hosted LLM via API and present outputs to EU users. You call OpenAI, Anthropic, Google, or a similar provider's API. You pass user input, receive a response, and display it. You are likely classified as a deployer. Your obligations centre on transparency (telling users they are interacting with AI), human oversight (ensuring appropriate human control over consequential outputs), logging (maintaining records of system use), and monitoring (detecting and reporting issues in production).
Scenario two: You fine-tune an open model and serve it to EU customers. You take a base model, such as Llama, Mistral, or a similar openly available model, apply your own training data, evaluate it, and serve it through your own infrastructure. You are likely classified as a provider. You bear the full weight of Articles 9 through 15: risk management, data governance, documentation, logging, transparency, human oversight, robustness, and cybersecurity. You are responsible for the system's compliance before it reaches the market.
Scenario three: You orchestrate multiple models into a composite or agentic system. You chain models together, route between them, use tool-calling or retrieval-augmented generation, apply scoring or ranking logic, and present a unified output. This is increasingly common in production AI. You are often treated as the provider of the overall system, regardless of whether individual model components were built by others. If the composite system operates in a high-risk domain (credit scoring, employment, insurance, law enforcement, critical infrastructure), the full provider obligations apply.
Many SaaS companies believe they are "just deployers" because they call an external model API. This belief is frequently incorrect. Once you route queries between models, filter or rank outputs, apply post-processing logic, chain tool invocations, or attach decision consequences to model outputs, you may trigger what the Act calls substantial modification or exercise functional control. Functional control is what regulators will examine. The question is not who trained the base model. The question is who determines how the system behaves in production and what decisions it influences.
If your system scores loan applications, prioritises CVs, triages insurance claims, or makes recommendations that a downstream human is likely to follow without independent evaluation, your classification risk is elevated.
When Annex III applies, and when it does not
Most SaaS AI systems are not high-risk. But if you cross Annex III, everything changes. Understanding the boundary matters commercially; it determines whether you face the full weight of Articles 9 through 15 or a lighter set of transparency and monitoring obligations.
Annex III lists specific use-case categories that trigger high-risk classification. The categories most relevant to technology companies include:
- Credit scoring and creditworthiness assessment in financial services. If your system evaluates whether a person should receive credit, a loan, or insurance, it is almost certainly high-risk.
- Employment and worker management. Systems that filter CVs, rank candidates, evaluate job applications, make promotion or termination decisions, or allocate tasks based on individual attributes fall within Annex III.
- Access to essential private and public services. Systems that determine eligibility for benefits, evaluate credit scores, triage emergency calls, or assess insurance risk and pricing.
- Law enforcement and justice. Risk assessment tools, evidence evaluation, and recidivism prediction.
- Migration and border control. Systems that assess applications for asylum, visas, or residence permits.
The boundary question that catches most companies is scoring and ranking. If your system produces a score or ranking that influences a decision about a natural person, and that decision falls within an Annex III domain, the system is high-risk. This includes recommendation systems that are positioned as "decision support" but that downstream humans follow without meaningful independent evaluation. The legal question is not whether a human technically makes the final decision. It is whether the human has sufficient information and incentive to deviate from the system's recommendation. If the answer is no, the system is effectively making the decision.
Substantial modification is the other trap. You may start as a deployer of someone else's system, which carries lighter obligations. But if you fine-tune, retrain, or significantly alter the system's behaviour, you may become a provider under the Act. The trigger is not cosmetic changes; it is modification that affects the system's compliance with the requirements or changes its intended purpose. In practice, fine-tuning on domain-specific data, adding retrieval-augmented generation to a base model, or wrapping a model in an agentic orchestration layer can all constitute substantial modification. If it changes what the system does or how it performs, you should assume the provider classification until you have documented evidence otherwise.
For a detailed analysis of what these classifications mean for SaaS engineering and QA, see Why UK SaaS Companies Must Redesign Engineering and QA for the EU AI Act.
The seven engineering obligations
Articles 9 through 15 of the AI Act define the core technical obligations for providers and deployers of high-risk AI systems. Each article translates into specific engineering requirements. What follows is a precise mapping from regulatory text to architecture, artefacts, and production controls.
Risk management as a release gate
Article 9 requires a risk management system that operates throughout the AI system's lifecycle. This is not a one-time risk assessment. It is a continuous process: identification, analysis, evaluation, and treatment of risks, with iterative review as the system changes.
In engineering terms, this means a risk register stored in version control, linked to specific model versions and system configurations. Each identified risk must have a corresponding mitigation, and that mitigation must be testable. Failure mode analysis should cover known failure patterns for your model type: hallucination rates for generative models, distributional shift for classifiers, tool misuse for agentic systems, cascading errors in multi-model pipelines.
The risk management system must include evaluation gates in your CI/CD pipeline. When a model is retrained, fine-tuned, or when system configuration changes, risk evaluation must execute before deployment proceeds. This means defining quantitative or categorical thresholds for each identified risk. If hallucination rate exceeds a defined threshold on your evaluation suite, deployment is blocked. If bias metrics on protected characteristics exceed tolerance, deployment is blocked. If latency or reliability regressions breach SLA-derived limits, deployment is blocked.
The artefacts required are: a versioned risk register (YAML, JSON, or structured format in your repository), risk evaluation pipelines that execute as part of CI/CD, threshold definitions for each risk category, and an audit trail of gate pass/fail decisions linked to deployment records.
The diagnostic test is straightforward: if you cannot state the numeric or categorical threshold at which deployment is blocked, you do not have a risk management system. You have a discussion. Discussions do not satisfy Article 9. Gates do.
Risk management does not end at deployment. Article 9 requires continuous, iterative risk management throughout the system's lifetime. Findings from post-market monitoring must feed back into the risk register, triggering re-evaluation and, where necessary, updated mitigation measures. The risk register is a living document; if it does not change as your system changes, it is already wrong.
Training data provenance and the GDPR tension
Article 10 imposes data governance requirements for training, validation, and testing datasets. It requires that datasets are relevant, sufficiently representative, and as free of errors as possible. It requires examination for possible biases, particularly where outputs may affect natural persons. It requires documentation of data preparation processes, assumptions, and known gaps.
In engineering terms, this means dataset versioning with full lineage tracking. Every training run must be reproducible from a specific, immutable dataset version. Data preparation pipelines must be deterministic and logged. Bias evaluation must run against defined demographic dimensions and produce quantitative metrics that can be compared across dataset and model versions.
The artefacts required are: versioned datasets (or dataset manifests with checksums and storage references), data lineage records linking raw sources to processed training sets, data quality evaluation reports, bias evaluation reports with per-demographic-group metrics, and documentation of any data filtering, augmentation, or weighting decisions.
Here, however, you encounter a genuine regulatory tension. Article 10 requires you to test for bias across protected characteristics; age, gender, ethnicity, disability. But the GDPR's Article 9 restricts processing of special category data, which includes precisely those characteristics. You cannot test for racial bias without processing racial data. You cannot test for gender bias without processing gender data.
This is not a hypothetical conflict. It is a daily operational reality for any team building evaluation pipelines. And the risk is bidirectional: if you collect demographic data improperly to satisfy your AI Act bias testing obligations, you may solve your Article 10 problem and create a GDPR Article 9 enforcement problem. Two regulatory regimes, two enforcement bodies, one engineering decision. Get it wrong in either direction and you have exposure.
The resolution involves careful legal basis analysis, data minimisation techniques (statistical sampling, differential privacy, synthetic proxy datasets), and documentation of why the processing is necessary and proportionate. There is no clean answer. There is only a defensible engineering and legal position, constructed deliberately and documented thoroughly. This is not a question you can delegate to counsel. The choice of bias testing methodology is an engineering decision with legal consequences, and the engineering team must own the approach.
For a full treatment of this conflict and practical approaches, see You Need Demographic Data to Prove You're Not Biased: The GDPR-AI Act Tension.
Documentation-as-code
Articles 11 and 12, supplemented by Annex IV, require technical documentation that is comprehensive, up to date, and sufficient to demonstrate conformity. Annex IV specifies what must be documented: system description, design specifications, development methodology, data governance measures, risk management measures, performance metrics, monitoring plans, and more.
The critical architectural decision here is where that documentation lives and how it is produced. If your technical documentation is a set of Word documents maintained by a compliance officer who periodically interviews engineers, it will be wrong within weeks of being written. Systems change. Models are retrained. Configurations shift. Pipeline logic evolves. Documentation that is not generated from system state is documentation that drifts, and drifted documentation under the AI Act is a liability, not an asset.
The engineering approach is documentation-as-code. Model cards are generated from evaluation pipeline outputs. System architecture descriptions are generated from infrastructure-as-code definitions and service manifests. Data governance documentation is generated from dataset versioning metadata and pipeline logs. Risk management documentation is generated from risk register files and CI/CD gate results. Performance metrics are pulled from evaluation stores.
The artefacts required are: documentation generation pipelines that run as part of CI/CD or on a scheduled basis, templates that pull from authoritative system state (model registries, experiment trackers, configuration stores, pipeline metadata), a versioned documentation output that can be diffed against previous versions to show what changed and when.
This connects directly to the core thesis of this guide. If documentation is generated from code, configuration, and pipeline output, it cannot drift. If it cannot drift, it cannot be false. If it cannot be false, it satisfies the Act's requirement for accuracy and currency. The alternative; manually maintained compliance documents; creates the very exposure it was meant to prevent.
Decision audit trails
Article 12 requires that high-risk AI systems are designed and developed with logging capabilities that enable the recording of events relevant to identifying risks, facilitating post-market monitoring, and enabling investigation of incidents. Logs must be proportionate to the system's intended purpose and kept for an appropriate period.
For production AI systems, this translates into a demanding logging architecture. Every decision that the system produces must be reconstructable. This means capturing: the exact input received, any pre-processing applied, the model version invoked, the parameter configuration at inference time, the raw model output, any post-processing, filtering, or ranking logic applied, the final output delivered, and any human intervention or override that occurred.
For agentic or multi-step systems, the requirements multiply. Each tool call, each intermediate reasoning step, each routing decision must be logged with sufficient fidelity to reconstruct the chain of events. If your system queries a database, calls an external API, and synthesises a response, the audit trail must capture what was queried, what was returned, how the response was constructed, and what was presented to the user.
Retention architecture is equally important. The Act does not specify a universal retention period, but it requires that logs are kept for a period appropriate to the system's intended purpose and applicable legal obligations. For systems in financial services, healthcare, or employment, sector-specific regulations may mandate retention of five to ten years. Your logging infrastructure must support immutable storage, tamper detection, and efficient retrieval for regulatory queries.
The artefacts required are: a structured logging schema that captures the full decision trace, immutable log storage with integrity verification, retention policies aligned with sectoral requirements, retrieval tooling that can reconstruct a complete decision from a single identifier, and human override logs that record who intervened, when, and what they changed.
The diagnostic test: given a decision ID from six months ago, can you reconstruct the exact model version, parameter configuration, input data, intermediate tool calls, post-processing logic, and any human intervention? If the answer is no, your logging architecture does not satisfy Article 12.
This logging architecture is also the backbone of post-market monitoring. Without structured decision traces, you cannot detect drift, investigate incidents, or demonstrate corrective action. Logging and monitoring are not separate workstreams; they are two views of the same infrastructure.
For detailed implementation guidance on logging schema design, retention architecture, and reconstruction tooling, see What to Log, How Long to Keep It, and How to Reconstruct: Article 12 for Engineers.
Transparency artefacts
Article 13 requires that high-risk AI systems are designed and developed in such a manner that their operation is sufficiently transparent to enable deployers to interpret a system's output and use it appropriately. Instructions for use must accompany the system and include information about the provider, system characteristics, capabilities, limitations, performance metrics, foreseeable misuse scenarios, and human oversight measures.
In engineering terms, transparency is a set of artefacts that must be produced, maintained, and delivered alongside the system.
Model cards are the primary transparency artefact. A model card documents what the model was trained on, what it was evaluated against, what its known limitations are, how it performs across different demographic groups, and what it should and should not be used for. Model cards should be generated from evaluation pipeline outputs, not written retrospectively.
Confidence score exposure is required where the system produces probabilistic outputs. If your system scores, ranks, or classifies, the end user or deployer must have access to confidence information sufficient to calibrate their reliance on the output. This does not necessarily mean exposing raw probability values; it means providing meaningful indicators of certainty and uncertainty.
System limitation disclosures must be specific and actionable. Stating that "the system may produce errors" is not a limitation disclosure. Stating that "the system's accuracy degrades on inputs longer than 4,000 tokens, on queries in languages other than English, and on topics not represented in the training data" is a limitation disclosure. Engineering teams must identify, quantify, and communicate specific failure modes.
User-facing instructions must be sufficient for a deployer to implement the system in compliance with its intended purpose. This includes deployment requirements, integration constraints, monitoring obligations, and escalation procedures.
The artefacts required are: auto-generated model cards, confidence calibration documentation, specific limitation disclosures derived from evaluation data, and deployer-facing integration guides.
Override architecture
Article 14 requires that high-risk AI systems are designed and developed so that they can be effectively overseen by natural persons during the period in which they are in use. Human oversight measures must be appropriate to the risks, level of autonomy, and context of use.
The Act does not mandate that every output is manually reviewed. It mandates meaningful human control where risk justifies it. The distinction matters enormously for system design and operational cost.
Three models of human oversight apply, and your architecture must support the appropriate model for each decision context:
Human-in-the-loop: A human reviews and approves every output before it reaches the end user or triggers a downstream action. This is appropriate for high-consequence, low-volume decisions: credit application rejections, medical diagnostic suggestions, employment screening recommendations. The engineering requirement is a review queue with full context presentation, approval/rejection/modification controls, and audit logging of every human decision.
Human-on-the-loop: The system operates autonomously, but a human monitors outputs in near-real-time and can intervene. This is appropriate for moderate-consequence, moderate-volume decisions. The engineering requirement is a monitoring dashboard with statistical oversight (distribution of outputs, anomaly detection, trend analysis), alerting thresholds, and intervention tooling that allows a human to halt, modify, or override system behaviour.
Human-over-the-loop: The system operates autonomously within defined parameters. A human sets the parameters, reviews aggregate performance, and adjusts constraints periodically. This is appropriate for lower-consequence, high-volume decisions. The engineering requirement is a governance interface where authorised personnel can adjust system parameters, review performance reports, and implement policy changes.
The critical design decision is matching oversight model to risk level. Over-investing in human-in-the-loop oversight for low-risk, high-volume decisions creates operational bottlenecks and reviewer fatigue, which paradoxically reduces oversight quality. Under-investing in oversight for high-risk decisions creates regulatory exposure and genuine harm risk.
Note that human oversight is not only a provider obligation. Deployers of high-risk AI systems are also required to implement the oversight measures described in the system's instructions for use. If you are consuming a third-party AI system, you must operate it with the oversight architecture the provider specifies. If the provider's instructions are vague or absent, you have a supply chain problem, not a compliance exemption.
The artefacts required are: an oversight architecture document mapping decision types to oversight models, review queue tooling for in-the-loop decisions, monitoring dashboards for on-the-loop decisions, governance interfaces for over-the-loop decisions, and audit trails for all human interventions across all models.
For a complete architecture guide covering implementation patterns and trade-offs, see In the Loop, On the Loop, Over the Loop: Designing Human Oversight for Production AI.
Robustness and security hardening
Article 15 requires that high-risk AI systems achieve an appropriate level of accuracy, robustness, and cybersecurity. They must be resilient to errors, faults, and inconsistencies, and designed to address risks from adversarial attacks.
Robustness in production AI means the system performs predictably under real-world conditions, not just on curated evaluation datasets. Engineering requirements include: comprehensive evaluation suites that cover edge cases and adversarial inputs; drift detection pipelines that monitor input distribution and output distribution over time; degradation alerting that triggers when performance metrics breach defined thresholds; and rollback capability that can revert to a known-good model version within minutes, not hours.
For generative AI systems, robustness also covers output consistency, instruction adherence, and refusal behaviour. Evaluation suites must test whether the system produces harmful outputs under adversarial prompting, whether it hallucinates under specific input patterns, and whether it maintains its intended behaviour across diverse input distributions.
Cybersecurity for AI systems extends beyond traditional application security. Specific threats include: prompt injection (direct and indirect), where adversarial inputs manipulate system behaviour through the model's input channel; tool invocation exploitation, where agentic systems are manipulated into calling tools with unintended parameters; data poisoning, where training data is compromised to embed backdoors or bias; model weight exfiltration, where proprietary model parameters are extracted through inference API abuse; and supply chain attacks, where dependencies (model weights, embedding models, evaluation datasets) are compromised upstream.
Engineering requirements include: input validation and sanitisation layers that operate independently of the model; tool invocation validation that enforces parameter schemas and permission boundaries; model weight integrity verification (checksums, signed artefacts); dependency pinning and verification for model supply chains; rate limiting and anomaly detection on inference endpoints; and regular adversarial testing (red-teaming) against known attack taxonomies.
The artefacts required are: evaluation suites with coverage metrics, drift detection pipeline configurations, rollback runbooks, adversarial testing reports, security threat models specific to AI components, and incident response procedures for AI-specific attack vectors.
Post-deployment monitoring
Article 72 establishes a post-market monitoring obligation that operates for the entire lifetime of the AI system. This is not a suggestion. It is a legal requirement with specific reporting obligations.
Post-market monitoring means continuous, systematic collection and analysis of data about the system's performance in production. It means detecting when the system behaves unexpectedly, when performance degrades, when new risks emerge, or when incidents occur. It means having defined thresholds that trigger investigation and, where necessary, corrective action.
The engineering requirements are: production monitoring pipelines that track model performance metrics (accuracy, latency, error rates, output distribution), input monitoring that detects distributional shift, user feedback collection and analysis, incident detection with defined severity classification, corrective action logging that records what was detected, what was investigated, and what was changed.
The Act also imposes specific reporting obligations. Serious incidents, defined as incidents that result in death, serious damage to health, serious and irreversible disruption of critical infrastructure, or breach of fundamental rights obligations, must be reported to the relevant market surveillance authority. The reporting timeline is tight: immediately after the provider or deployer establishes a causal link between the AI system and the incident, and no later than 15 days after becoming aware of the incident.
This means your monitoring system must not only detect anomalies but classify them against regulatory severity thresholds and trigger the appropriate internal escalation. Your incident response procedures must include a regulatory reporting pathway, not just an engineering remediation pathway.
It is worth noting the broader regulatory machinery. Market surveillance authorities in each EU member state will oversee compliance. Conformity assessment procedures, either self-assessment or third-party audit depending on the domain, will be required before high-risk systems can be placed on the market. Harmonised standards that translate the Act's requirements into specific technical specifications are currently being developed by CEN and CENELEC, with publication expected to align with the August 2026 enforcement date.
For a complete post-market monitoring architecture guide, see Your Model Is in Production. Now What? Post-Market Monitoring Under Article 72.
The effort reality
Engineering leaders need honest effort estimates. The work required to achieve compliance is substantial, but it is bounded and predictable if approached systematically.
The three largest effort areas are:
- Logging retrofit. Most production AI systems were not designed with Article 12 logging in mind. Retrofitting comprehensive decision trace logging into an existing system; capturing inputs, model versions, intermediate steps, post-processing, and outputs in an immutable, retrievable store; is consistently the most time-consuming compliance workstream. It touches inference pipelines, data infrastructure, storage architecture, and retention policy. For agentic or multi-model systems, the complexity multiplies with each integration point.
- Documentation generation pipelines. Moving from manually authored compliance documents to generated-from-system-state documentation requires building new CI/CD stages, integrating with model registries and experiment trackers, and establishing templates that pull from authoritative sources. The first iteration is the hardest; subsequent updates become incremental.
- Human oversight dashboards. If your system currently operates without structured human oversight, designing and building review queues, monitoring dashboards, and governance interfaces is a significant product and engineering effort. It requires UX design for reviewer workflows, backend infrastructure for queue management and audit logging, and operational processes for staffing and training reviewers.
Effort by organisational maturity:
- Mature MLOps team with existing model registries, experiment tracking, evaluation pipelines, and structured logging: 8 to 12 weeks of focused engineering effort to close gaps and formalise compliance artefacts.
- Ad hoc logging, manual documentation, no structured evaluation: 4 to 6 months of engineering and infrastructure work, assuming dedicated resourcing.
The commercial reality is accelerating this timeline. EU enterprise customers, particularly in financial services, healthcare, and public sector, are beginning to include AI Act compliance evidence as a procurement condition. RFPs are asking for technical documentation, risk management evidence, and logging architecture descriptions. Companies that cannot provide these artefacts will lose deals before regulators ever become involved.
Organisational implications
The AI Act's obligations cannot be satisfied by the engineering team alone, but they cannot be satisfied without the engineering team at the centre. Several organisational shifts are necessary.
A named AI risk owner. The Act requires that someone is accountable for the risk management system and its continuous operation. In most technology companies, this should be the Head of AI, VP of Engineering, or CTO; not General Counsel. If AI oversight lives in Legal without engineering ownership, the system will fail under audit. Legal expertise is essential for interpreting obligations and managing regulatory relationships, but the risk management system itself is an engineering artefact. It is a pipeline with gates, a monitoring dashboard, a logging architecture. The person accountable for it must understand the system architecture, the evaluation methodology, and the deployment pipeline. A risk owner who cannot read the CI/CD configuration or interpret evaluation metrics cannot meaningfully own the risk.
QA role expansion. Traditional QA focused on functional correctness: does the feature work as specified? AI evaluation adds dimensions that most QA teams have not historically covered: statistical performance assessment, bias evaluation, adversarial testing, drift detection. QA teams need new skills, new tooling, and new authority to block releases based on evaluation outcomes.
Platform and DevOps responsibility for logging integrity. The logging architecture required by Article 12 is an infrastructure concern. Immutable storage, tamper detection, retention management, and efficient retrieval at regulatory query time are platform team responsibilities. This cannot be an afterthought bolted onto application logging.
Compliance engineering for larger organisations. Organisations operating multiple AI systems across different risk categories will benefit from a dedicated compliance engineering function: engineers who specialise in translating regulatory requirements into technical specifications, building compliance tooling, and maintaining the documentation generation pipeline.
For a detailed examination of how these organisational changes apply to SaaS companies specifically, see Why UK SaaS Companies Must Redesign Engineering and QA for the EU AI Act.
Six questions to ask your engineering team tomorrow
These questions are diagnostic. If your team can answer all six with specifics, you are ahead of the majority of the industry. If any question produces silence or vagueness, that is your starting point.
- Can we reconstruct any production decision end-to-end? Given a decision ID, can we retrieve the input, model version, configuration, intermediate steps, post-processing, output, and any human intervention? If the answer involves manual effort or uncertainty, your logging architecture is insufficient.
- Where is our risk register stored and versioned? If the answer is "a spreadsheet" or "a Confluence page", it is not linked to system state and it will drift. A risk register must be in version control, linked to model versions, and referenced by CI/CD gates.
- What blocks deployment if evaluation fails? If nothing blocks deployment, you do not have a risk management system. If the answer is "we review the metrics and make a judgement call", you have a discussion, not a gate.
- Who owns human oversight design? If no one owns it, oversight is ad hoc. If Legal owns it, the implementation will be disconnected from system architecture. An engineering leader must own the oversight architecture, with Legal input on risk calibration.
- What triggers a serious incident report? If the answer is uncertain, your monitoring system cannot satisfy Article 72's reporting obligations. Incident severity thresholds must be defined, implemented in monitoring, and linked to an escalation pathway that includes regulatory reporting.
- Can we demonstrate that documentation is generated from system state? If documentation is authored manually in a separate tool, it is already drifting. If it is generated from model registries, evaluation pipelines, and configuration stores, it is accurate by construction.
What to do next
The EU AI Act's obligations are precise, technically demanding, and enforceable. They are also achievable. The gap between where most engineering organisations are today and where they need to be is an architecture problem, not a legal problem.
Systima works with engineering teams to close that gap. We do not produce compliance documents that sit in a shared drive. We produce engineering artefacts that satisfy regulatory requirements because they are derived from system state.
If you are assessing your organisation's readiness, we offer an engineering-led compliance architecture review. The output is concrete:
- Gap report mapped to Articles 9 through 15: where your current system satisfies obligations, where it does not, and what the exposure is in each area.
- Logging architecture blueprint: decision trace schema, retention architecture, and reconstruction capability design tailored to your system topology.
- Risk gate CI/CD design: evaluation thresholds, deployment gates, and release criteria integrated into your existing pipeline.
- Documentation-as-code pipeline specification: what to generate, from which sources, and how to keep it current as your system evolves.
- Prioritised remediation roadmap: sequenced by regulatory risk and engineering effort, so you address the highest-exposure gaps first.
The work is bounded. The timeline is known. The question is whether you begin now, with the advantage of time, or later, under the pressure of enforcement and procurement demands.