Pre-Mortem: KPMG’s AI-Powered Audit

The audit opinion is the most consequential document most public companies produce. Not the annual report. Not the investor deck. The audit opinion, because it carries a named partner’s signature, and because that signature means something in law. On 9 June 2026, KPMG and Microsoft announced the deployment of Microsoft Agent 365 and Copilot across 276,000 KPMG professionals in 138 countries, including inside KPMG Clara, the firm’s global smart audit platform. Scott Flynn, KPMG’s Global Head of Audit, called it “a pivotal milestone in our AI-powered, human assured audit transformation.” The word “assured” is doing a great deal of work in that sentence.

A pre-mortem asks the same five questions, every time, applied before failure is possible rather than after. This is the fifth in the series. The first looked at vendor accountability in regulated finance. The second at clinical safety in healthcare. The third at execution accountability in defence procurement. The fourth at clinical AI infrastructure. This one looks at professional services, the sector that has built its entire business model on the premise that human expertise is the product.

 

The Bet

KPMG is betting that efficiency and accountability can coexist at this scale. That 276,000 professionals deploying AI agents, with a governance layer running underneath, will not dilute the professional accountability the audit opinion rests on. It is a reasonable bet. It is also an untested one. The commercial logic is clear: 276,000 professionals, 138 countries, and an AI-powered workflow running through KPMG Clara creates the kind of structural productivity gain that redefines the firm’s cost base, and potentially its fee model. Analysis of recent audit fee movements suggests clients are already pressing the case that AI efficiency should flow through to lower fees. The deeper bet, the one sitting beneath the headline deployment, is that “AI-powered, human-assured” constitutes a defensible operating model before any regulatory body has defined what “human-assured” actually requires in practice.

 

The Assumption

The single assumption carrying all the weight: that governing agents is the same thing as being accountable for them. Microsoft Agent 365 provides what its own documentation describes as a control plane, a centralised registry of agents with lifecycle rules, identity controls, and audit logging. That is a meaningful capability. It answers the question: how many agents do you have, and what can they touch? It does not, on its own, answer the question a claims lawyer or a regulator will eventually ask: who is accountable when the agent was visible, governed, and still wrong? KPMG’s Trusted AI framework lists ten ethical pillars, including one labelled Accountability, which calls for human oversight and responsibility to be embedded across the AI lifecycle. That is a principle-level commitment. None of the publicly available documentation specifies what happens to the partner’s signature when an AI-assisted conclusion is signed off and later found to be materially incorrect.

 

The Sequence

KPMG has deployed agents at scale before any authoritative regulatory framework specifies what AI-assisted audit evidence must look like, or how human review of AI-generated conclusions must be documented to meet existing standards. The IAASB approved a project proposal in March 2026 to revise ISA 500, Audit Evidence, to address technology use in audit, but the project is still in early research and information gathering, with no exposure draft issued and no effective date. The PCAOB has stated publicly that it is considering developing risk management guidance for audit firms using AI. Considering, not publishing. The capability is deployed. The standard that surrounds it is still being drafted.

 

The Pager

Lisa Heneghan, KPMG’s Global Chief Digital Officer, was specific about what this deployment requires: “strong foundations in governance, visibility and accountability.” That framing is responsible, and Agent 365 provides the visibility that most enterprises currently lack. The harder question is structural and specific. The audit opinion is signed by a named partner. Professional indemnity is priced around that signature. When an agent embedded in KPMG Clara surfaces a conclusion, the partner reviews it, signs the opinion, and the work later contains a material error, the liability has historically sat with the partner and the firm. What KPMG, Microsoft, and the client have not yet published is a clear allocation of responsibility for the agent’s contribution to that error. Is it a tool failure, an oversight failure, or something existing frameworks do not yet classify? The governance layer provides the audit trail. It does not specify who reads it, or what reading it is worth, when a claim is filed.

 

The Proof

The announcement commits 276,000 professionals and earns KPMG the designation of Microsoft “Frontier Firm.” Neither is a performance measure. No published metric connects this deployment to audit accuracy improvement, reduction in deficiencies, or quality outcomes. What the deployment actually demonstrates is that KPMG can deploy Agent 365 at scale and maintain visibility over its agent estate. That is a meaningful operational achievement. It is not the same as demonstrating that AI-assisted audit conclusions are more reliable than human-only ones, which is what regulators, courts, and insurers will eventually need to see. KPMG Clara’s existing framing covers adoption and workflow integration. No published figure connects it to audit opinion accuracy or deficiency rates. The proof that matters most is still outstanding.

 

Verdict

If KPMG publishes a clear framework specifying how AI-assisted audit evidence is reviewed, validated, and documented, paired with a liability position that survives regulatory scrutiny, this becomes the reference model for professional services AI at scale. The governance commitment is genuine. The scale of deployment is unmatched in the sector. Scott Flynn’s “AI-powered, human-assured” is the right aspiration. The question is whether “human-assured” describes a documented, auditable review process that a regulator will accept and an insurer will cover, or whether it is a positioning statement waiting for a definition. At 276,000 professionals across 138 countries, the audit opinion at the centre of this deployment is too consequential to leave that question open. The answer should come before the first material claim, not after.

Already Building: Epic Agent Factory and the Governance Gap

The pre-mortem on Epic Agent Factory asked who would answer when a health-system-built agent made a clinically significant error. It published on 9 June. I have since learned of a Becker’s Hospital Review report from 30 March confirming that one of America’s largest health systems had already been building those agents for weeks before the question was published.

It confirms the pre-mortem’s central argument. Neither the research nor the article surfaced how quickly the sequence had already begun.

 

The Deployment That Was Already In Motion

Advocate Health had already tapped Epic’s Agent Factory, becoming one of the first health systems to build and deploy agents through the platform. Andy Crowder, Advocate Health’s SVP and Chief Digital and AI Officer, described the direction in a LinkedIn post on 26 March: “By combining Epic’s Agent Factory Platform capabilities with Advocate Health’s scale, clinical insight, and commitment to innovation, we’re translating AI from promise into practice.” He pointed to a three-day Epic immersion at The Pearl innovation district in Charlotte, focused on speeding up pharmacy verification for complex medications and cutting infusion chart preparation time for pharmacists and nurses. Four working prototypes emerged, scheduled to go live in July 2026.

Crowder added: “Together, we’re advancing responsible, practical AI that fits naturally into clinical workflows, reduces friction, and gives clinicians back time to focus on what matters most.” It is a considered statement, and the commitment is genuine. But it is not a governance document. And Advocate Health is not unusual here. They are representative. They moved first because the platform enabled it, the commercial pressure to reduce administrative burden was real, and nothing in the regulatory landscape said stop.

This is the sequence the pre-mortem described. Capability arrived. Deployment followed. The governance architecture to surround it had not been ratified.

 

The Workflows That Come Next

Pharmacy verification and infusion chart preparation are not, in themselves, clinical decision-making. They reduce documentation burden and carry genuine operational value. But they are the entry point, not the ceiling.

Epic’s own Penny agent already handles prior authorisation for thousands of health systems. Agent Factory is the platform through which health systems build their own versions of exactly those capabilities. Prior authorisation sits at the intersection of clinical judgment and payer approval. An AI-generated argument that misrepresents a contraindication, omits a relevant diagnosis, or positions a clinical case in a way that leads a payer to deny appropriate care causes harm that is downstream and deniable. The agent did not make the clinical decision. But the agent shaped the argument that influenced it.

The pre-mortem’s central question, who owns the error, was always pointed at this trajectory. The agent is built by the health system, on Epic’s platform, using Curiosity’s foundation models, in a regulatory environment where no one has yet specified how liability is allocated between vendor and deployer. Advocate Health’s prototypes are the first step of a sequence that leads directly to that question.

 

Colorado Tried to Build the Rails

While health systems were building, legislators in Colorado were attempting to create the governance scaffolding that the platform lacks at a federal level. Three separate AI-related healthcare laws had been passed by June 2026, each addressing a different dimension of the problem, and each confirming the same underlying gap.

Colorado’s original AI Act, SB 24-205, was scrapped before it ever took effect. A legal challenge from X.AI in April 2026, supported by federal intervention from the DOJ, led to enforcement being suspended and the legislature repealing the law entirely. Its replacement, SB 26-189, was signed on 14 May. It is a narrower law, retaining consumer notice requirements and the right to meaningful human review following adverse outcomes, but dropping the duty-of-care standard and mandatory impact assessments that had made the original controversial. It takes effect January 1, 2027.

HB 26-1139, signed on 2 June, constrains how payers use AI in coverage determinations. It requires that AI-driven decisions be based on the patient’s individual medical and clinical history rather than group data, and that any denial or delay of coverage based on medical necessity receive review by a licensed clinician. It too takes effect January 1, 2027.

Together, SB 26-189 and HB 26-1139 create obligations on both sides of the prior authorisation workflow. Neither specifies who bears the cost when an agent-generated output leads to the wrong clinical outcome. Three laws confirming the gap exists is not the same as closing it.

 

The Sequence Is Not a Prediction. It Is a Pattern.

On 1 June 2026, eight days before the pre-mortem was published, the Joint Commission launched its first voluntary AI certification programme for healthcare organisations. Built on the initial guidance published with the Coalition for Health AI in September 2025, the certification covers governance, data management, risk and bias reduction, and monitoring. It is a meaningful step forward. But the certification recognises organisations, not individual tools. It does not validate or certify individual AI products. It contains no discussion of liability allocation. It is a framework for responsible intent, not a mechanism for accountability when something goes wrong.

Epic has not published a liability framework specifying what a health system owns when a self-built Agent Factory agent produces a clinical error. No Epic contract language or public terms of service document does so. No federal regulatory body has published guidance specifically addressing liability allocation for agentic AI operating within EHR environments. The FDA has authorised more than 1,400 AI-enabled devices and issued no specific enforcement guidance for agentic AI in EHR environments.

The pre-mortem’s conclusion was that if Epic published a clear liability framework and paired it with a safety review mechanism, Agent Factory could become the defining infrastructure layer of hospital AI over the next decade. That conclusion stands. What the evidence now confirms is that the clock is not running from some future launch date.

It was already running.

Pre-Mortem: Epic Agent Factory

Update, 14 June 2026: One of America’s largest health systems was already building Agent Factory agents in late March, weeks before this piece published. This new piece confirms the central argument.


 

Epic unveiled Agent Factory at HIMSS 2026 (March 2026), positioning it as a no-code, drag-and-drop visual builder that lets health systems design, deploy, and monitor their own autonomous AI agents inside the Epic environment. Alongside it came Curiosity, a family of generative medical foundation models trained on deidentified records from 300 million patients across 310 health systems, backed by a research preprint on arXiv first published in August 2025. Together, the announcements represent Epic’s move from AI vendor to AI infrastructure provider, handing health systems the tools to build clinical automation at their own pace and on their own terms.

A pre-mortem is a discipline borrowed from project risk management. Before a programme succeeds or fails, you ask: if this does not go as planned, what was the mechanism? This series applies that lens to major AI-in-industry announcements, not to predict failure but to surface the questions that deserve answers before deployment, not after.

 

The Bet

Epic is betting that health systems want to own their AI destiny. Phil Lindemann, VP of Data and Research, framed Agent Factory as enabling customers to implement AI solutions without needing to call a vendor or write a line of code. That is a significant commercial and philosophical shift. Epic’s existing suite, Art, Penny, and Emmie, has posted credible numbers: 42 per cent reduction in prior authorisation submission time at Summit Health, 58 per cent sustained reduction in billing-related service messages at Rush University, 69 per cent early lung cancer detection at The Christ Hospital against a 46 per cent national average. The bet is that health systems, given those results as proof of concept, will want to build the next generation themselves.

 

The Assumption

The assumption underneath Agent Factory is that health system capability is ready to meet platform capability. Canvas Medical CEO Adam Farren noted in HIMSS 2026 commentary that most hospitals are not yet positioned to take advantage of the platform. Agent Factory is in early phase, with first availability in 2026 and continued rollout in 2027. Epic’s own roadmap, and the organisational readiness required for clinical agent deployment, put realistic momentum at leading health systems two to three years out. The platform may well be sound. The question is whether the organisations it serves have the clinical informatics depth, the governance infrastructure, and the project bandwidth to build and validate autonomous agents safely, particularly in clinical rather than administrative workflows.

 

The Sequence

Epic shipped the capability before any ratified standard governs what happens when a health-system-built agent makes a clinically significant error. The Joint Commission and Coalition for Health AI published voluntary joint guidance in September 2025, covering governance structures and vendor management. The FDA has authorised over 1,400 AI-enabled devices but has published no specific enforcement guidance for agentic AI in EHR environments. No federal regulatory framework yet specifies how liability for agent-generated clinical errors should be allocated between vendor and deploying health system. The capability is real and available. The governance architecture to surround it is not yet ratified.

 

The Pager

When an Agent Factory-built agent makes a clinically significant error, who owns it? Epic’s public framing places health systems “in the driver’s seat.” That is a positioning statement, not a governance document. No published contract language, terms of service excerpt, or named executive statement specifies who bears liability for agent-generated errors. No Epic accountability framework for self-built agents has been published. KPMG’s Q4 AI Pulse Survey (2025) found that 75 per cent of large-enterprise leaders name security, compliance, and auditability as their top requirements for agent deployment. At present, the answer to the pager question is that nobody has publicly claimed the call.

 

The Proof

Curiosity carries published research behind it: a preprint on arXiv first submitted in August 2025, covering 118 million patients and 151 billion tokens via the CoMET architecture. That is a meaningful evidential bar. Agent Factory has no equivalent published validation. Epic’s self-reported statistic that more than 85 per cent of customers are actively using Epic AI is plausible given market penetration of 43.7 per cent of US hospitals by count and 56.9 per cent by beds, but it refers to the existing suite, not to Agent Factory specifically. No performance benchmarks, error rate thresholds, or clinical outcome commitments for health-system-built agents on Agent Factory appear in any public source.

 

Verdict

If Epic publishes a clear liability framework that specifies what health systems own when they deploy self-built agents, and pairs that with a safety review mechanism before clinical agents go live, Agent Factory could become the defining infrastructure layer of hospital AI over the next decade. The foundation is genuinely strong: real outcome data from deployed agents, a clinically substantiated foundation model, and a market position that no competitor can easily replicate. The Curiosity publication demonstrates that Epic is capable of meeting an external evidential standard. The question is whether it applies that same rigour to the governance scaffolding around Agent Factory before health systems start building in earnest, rather than after the first serious incident forces the issue.

Pre-Mortem: The Pentagon’s Autonomous Drones Reset

 

The Pentagon’s Replicator programme promised thousands of cheap autonomous drones in two years and delivered hundreds. The response has not been to wind it down. It has been to dissolve it, rebuild it as a new command inside Special Operations Command, and ask Congress for roughly 240 times the money. A programme that under-delivered on a lean, fast model is being re-attempted on a vast one, and the case for why the second structure succeeds where the first did not has not yet been made in public.

A pre-mortem asks the same five questions, every time, applied to a current programme before failure is possible rather than after. This is the third in the series. The first looked at vendor accountability in regulated finance. The second looked at clinical safety accountability in regulated healthcare. This one looks at execution accountability in defence procurement, the hardest delivery environment of them all. Different sector, similar structural shape: commitment moving faster than the architecture meant to hold it to account.

 

The Bet

The bet is that scale fixes what speed could not. Replicator was announced in August 2023 with a target of multiple thousands of all-domain attritable autonomous systems inside roughly two years, run by the Defense Innovation Unit on about a billion dollars across two fiscal years. It was deliberately lean, built to route around the traditional acquisition machine. By the deadline it had fielded hundreds. The reset, the Defense Autonomous Warfare Group, carries a 2027 budget request of about $54 billion, against roughly $226 million the year before. The technical bet is sound on its face: mass autonomy is where warfare is going, and the United States cannot afford to be slow to it. The harder bet, the one sitting under the headline number, is that money and a command structure fix what was an execution problem. Those are different things, and the launch treats them as one.

 

The Assumption

One belief is doing all the work: that Replicator’s shortfall was a problem of resourcing and structure, solvable with more of both. The documented failures point elsewhere. Systems were selected that proved unreliable, too expensive, or too slow to manufacture at the quantities needed. Some existed only as a concept when they were chosen. And the programme could not procure software able to orchestrate and command large, mixed swarms of different drones, which is the actual technical heart of autonomy at scale. None of those is a budget problem. A bigger budget buys more of the same systems and more of the same integration gap. If the diagnosis is wrong, the cure scales the disease.

 

The Sequence

Commitment came before the architecture, again. Replicator launched in August 2023. A second line of effort, focused on countering small drones, was added by a Secretary of Defense memo in September 2024. The original thousands-by-2025 deadline arrived with hundreds delivered. The programme was then consolidated into a joint interagency task force, dissolved, and rebuilt as the new autonomous-warfare group inside Special Operations Command, with the first acquisition under the new structure landing in January 2026, two counter-drone systems. Only in April 2026 did the Secretary tell the House Armed Services Committee that a sub-unified command for autonomous warfare was coming. The command meant to own this is still being stood up around a commitment already made. The funding tells the same story. Of that $54 billion, only about $1 billion is appropriated base money. The other $53 billion is a request, parked in a flexible five-year reconciliation pot that Congress has not yet passed. The headline number signals overwhelming commitment. In hard terms it is roughly a billion dollars in hand and fifty-three billion in hope. The intention is real. The money, for now, is one dollar in every fifty-four.

 

The Pager

Start with the credit, because it is real. The new group has a named director, Lt. Gen. Francis L. Donovan (USMC), with a clear command line and an appointment made by the Secretary himself. That is more named, senior accountability than most large defence programmes ever put on the public record, and it counts for something. The harder question is operational and specific. Standing policy requires appropriate levels of human judgement over the use of force. At swarm scale, with attritable systems acting at machine speed, who is the named individual accountable when one of them engages wrongly? The command line is clear. The accountability for the autonomous decision itself, at the scale this programme is built to reach, has not been framed in public. A command answers for a programme. It is a harder thing to say who answers for a single autonomous engagement when there are thousands of them in the air.

 

The Proof

The committed measures are input measures. Dollars requested, units contracted, the first systems bought. There is no public outcome measure for capability actually delivered, no cost per effective intercept, no fielded-and-working-at-scale figure with a date attached. This matters because the proof problem already bit once. Leadership called Replicator on track in 2024 and said it had made enormous strides in 2025, while the independent accounting found hundreds, not thousands. When the people who own the programme also own the definition of progress, optimism outruns delivery. Second-attempt scepticism is earned, not unfair. In eighteen months, the question of whether this worked will be answered by whoever holds the platform to define what delivered at scale means, and right now that platform is a budget request.

 

Verdict

This is a serious programme with serious people behind it. The strategic logic is correct, mass autonomy matters and slowness is its own risk. The accountability has a name and a rank, which is rare. The first systems have been bought and are heading to the field. None of that is in doubt.

What is unproven is whether a command and a budget can fix a problem that was about manufacturing maturity, software orchestration, and realistic system selection. A reorganisation addresses none of those by itself.

The action is concrete. Publish the outcome measure, not the input: a fielded-and-working-at-scale metric with a date, committed before the reconciliation money is spent, not after. Name the human accountable for autonomous engagement decisions at scale, not only the command that owns the programme. And diagnose the first shortfall in public before scaling, so the much larger second bet rests on a corrected understanding rather than a hope.

If the department publishes a delivered-at-scale outcome measure tied to a named owner, and solves the swarm-orchestration software problem it could not solve the first time, this becomes the programme that proves autonomous capability can be fielded at speed. Without both, it becomes the most expensive way yet found to relearn that money and reorganisation do not fix an execution problem.

Pre-Mortem: NHS Frontline Productivity Programme

 

On 1 April 2026, NHS England formally launched the Frontline Productivity Programme. It succeeds the £2 billion Frontline Digitisation Programme and is anchored to the NHS 10-Year Health Plan. The headline target is a 2% year-on-year productivity gain over three years. The lead use case is Ambient Voice Technology (AVT), AI-powered ambient scribing for clinicians, with £200 million committed in year one. The Department of Health and Social Care (DHSC) and NHS England have appointed Rob Thompson as joint Chief Digital, Data and Technology Officer.

A pre-mortem asks the same five questions, every time, applied to a current programme. This is the second in the series. The first looked at vendor accountability in regulated finance. This one looks at clinical safety accountability in regulated healthcare. Different sector, similar structural shape.

 

The Bet

The NHS is betting that AVT can deliver enough of the 2% year-on-year productivity gain to justify scaling deployment to tens of thousands of clinicians faster than the clinical safety framework for AI-enabled ambient scribing can be ratified. The technical bet rides on multi-site evidence led by Great Ormond Street Hospital (GOSH) across nine London NHS sites and 17,000 patient encounters: a 23.5% increase in patient interaction time, an 8.2% reduction in appointment length, and a 13.4% increase in A&E patients per shift. The strategic bet is that 19 self-certified suppliers competing for trust contracts will produce price discipline without producing safety variance. Reasoned bets, made under genuine pressure, backed by measurable evidence. But they are bets, and the framing reads as inevitability.

 

The Assumption

One belief is doing all the work: that clinicians using AVT will verify AI-generated notes against the patient context every time, at scale, rather than develop the same review-as-rubber-stamp pattern automation has produced in every regulated environment it has reached. The mechanism that produces the productivity gain is the same mechanism that erodes clinical attention to the note. If review thins because AVT proves “good enough” most of the time, the productivity number stays positive while clinical safety quietly degrades. Patient Safety Learning argued earlier this year that Copilot has arrived in the NHS without the operational guidance clinicians need to use it safely.

 

The Sequence

Capability shipped before the operational governance for AI-enabled ambient scribing was ratified. South West London is rolling out AVT to 20,000 clinicians across four trusts. University Hospitals of Leicester and Northamptonshire have deployed to over 10,000. Hertfordshire Community NHS Trust has moved past pilot to full rollout. NHS England published a 19-supplier self-certified AVT registry in January. Underneath, the clinical safety standards DCB0129 and DCB0160 are under active review, and the Explainability-Enabled Clinical Safety Framework for AI is still being developed. Commitment came first. The assurance framework is catching up.

 

The Pager

The accountability layer on this programme is more developed than most national digital programmes ever achieve. Rob Thompson holds a joint DHSC/NHSE Chief Digital, Data and Technology Officer post: senior, named, public, accountable. Chief Clinical Information Officers (CCIOs) at every deploying trust carry statutory DCB0160 deployment accountability. That deserves credit. The harder question is operational. When an AVT-generated note contains a clinically significant error that affects patient care, who is the named individual who carries the pager that night? The trust CCIO? The supplier on the registry? The clinician who signed off the note? The accountability is statutory; the operational reporting line for AI-specific clinical safety failure has not yet been publicly framed for AVT.

 

The Proof

Three outcome measures sit in the public record: the 2% year-on-year productivity gain, the GOSH-led multi-site evaluation, and the Oxford University Hospitals pilot in which 90% of clinicians reported reduced documentation time. All three measure clinician time and patient throughput. None measure clinical safety. A 2025 national cross-sectional study in the Journal of Medical Internet Research (JMIR), covering 178 NHS organisations and 14,747 digital health technology deployments, found that only 17.3% were fully assured against both DCB0129 and DCB0160. At a typical NHS trust, only 24.5% of deployed technologies held both assurances. The standards exist. Compliance with them is patchy. There is no committed measure for AVT-attributable adverse event rate by supplier, the rate at which clinicians materially amend AI-generated notes versus accept them, or DCB0160 compliance inside the AVT registry specifically. In 18 months, “did this work?” will be answered by whoever owns the platform to define what safe enough means.

 

Verdict

The Frontline Productivity Programme is more carefully constructed than most NHS technology programmes of the past two decades. Named senior accountability, real pilot evidence, multiple trusts in genuine production deployment, a clear use case the workforce wants. None of that is in dispute.

What is in dispute is whether the underlying clinical safety assurance layer holds at scale. DCB0129 and DCB0160 exist. Compliance with them currently runs at a quarter of what it should be. The deployments are racing toward 20,000-clinician scale while the AI-specific framework is still being written.

The action is concrete. Name the human at each deploying trust who carries the pager when an AVT-generated note causes patient harm. Demand per-supplier clinical safety performance reports from each of the 19 registry vendors, not self-certifications. Publish a clinical safety outcome measure alongside the productivity target before the year is out: adverse event rate change attributable to AVT, broken out by trust and by supplier.

If NHS England publishes a clinical safety outcome measure tied to a named owner in six months, and the AVT registry shifts from self-certification to audited compliance, the Frontline Productivity Programme becomes a model for AI deployment in regulated public services. Without both, the productivity number stays positive while the question of whether it was worth the clinical safety risk remains structurally unanswerable.

Pre-Mortem: Anthropic’s Wall Street Agentic AI Suite

 

Thirteen of the world’s largest financial institutions just deployed ten autonomous AI agents into the most regulated workflows in finance. None of them has publicly named who is accountable when the agents are wrong. Not the banks. Not the vendor. Not the regulators. The launch on 5 May reads like a milestone. Read closer and it reads like a stress test of every governance assumption the financial services industry operates on.

A post-mortem tells you why something failed once it already has. A pre-mortem asks the same questions before failure is possible. Same five questions, every time, applied to a current programme, announcement, or initiative. This is the first in the series, and the subject is not chosen by accident. The Anthropic Wall Street launch is the clearest example I have seen this year of capability racing ahead of the architecture meant to hold it to account. If you are a CIO, a CRO, or a transformation lead in a regulated industry, the lessons here apply to you whether you are deploying Claude or not.

 

The Bet

Anthropic and the deploying banks are betting that ten autonomous agents can land in the most regulated workflows in finance, underwriting, KYC, credit memos, statement audits, faster than the regulatory architecture can constrain them. The technical bet rides on Claude Opus 4.7’s 64.37% on the Vals AI Finance Agent benchmark and AIG’s quoted 88% accuracy on insurance claims out of the box. The strategic bet is that being first at this footprint, including JPMorgan Chase, Goldman Sachs, Citi, AIG, BNY, Carlyle, Mizuho, and Visa, outweighs whatever comes back from regulators in the next twelve months. Reasoned bets, made by an extraordinarily capable vendor and the most sophisticated buyers in the world. But they are bets, not certainties, and the launch reads as certainty. The CIO of any one of those banks is taking on operational, regulatory, and reputational risk for which the vendor has accepted no published share. That is the bet they should be examining most carefully.

 

The Assumption

One belief is doing all the work: that bank operating models can absorb ten simultaneously deployed agents without the human-in-the-loop quietly thinning where the agents prove reliable. Anthropic’s own commitment depends on it, from the primary announcement: “Users stay firmly in the loop, reviewing, iterating on, and approving Claude’s work before it goes to a client, gets filed, or is acted on.” The history of automation in regulated environments tells a different story. Algorithmic trading kill switches were not triggered because the system was performing. Automated underwriting reviews became rubber stamps once approval rates looked normal. Every automation failure in regulated finance follows the same arc: human oversight erodes invisibly as the system proves itself, and the erosion is only visible after the failure. JPMorgan CIO Lori Beer said it directly at the launch: “The technology can do so much. It’s the actual organization’s ability to digest and absorb it.” That ability is the load-bearing assumption. If it holds, the launch is a milestone. If it does not, the launch is a slow-moving incident.

 

The Sequence

Capability shipped. Ten named agents, Microsoft 365 generally available, Moody’s embedded, more than a dozen banks in production. What was committed before the operational governance for vendor-supplied agentic decisioning was published: all of it. Three weeks earlier, the Fed and the OCC revised Model Risk Management guidance and explicitly excluded agentic AI as “novel and rapidly evolving.” A Request for Information is planned, with no committed timeline. The EU AI Act’s high-risk financial-sector requirements take effect 2 August, twelve weeks after launch. The FCA and PRA decided against creating a dedicated AI Senior Management Function and instead mapped accountability onto existing SMFs that were never designed with autonomous agents in mind. Three jurisdictions. Three different gaps. One vendor launch landing in all of them at once. This is not a regulator being slow. This is a regulator explicitly stating that the rules do not yet apply, while the systems the rules are meant to govern are already in production.

 

The Pager

The banks have named regulatory accountability at the firm level. SMF24 (Chief Operations), SMF4 (Chief Risk Officer), SMF16 (Compliance Oversight) at FCA and PRA-regulated firms hold statutory responsibility for technology, risk, and compliance. Model risk owners at US firm level cover the same ground. Real, senior, public. That deserves credit. However, none of them have been publicly named for the deployment of these specific agents. Inheriting accountability through a job description is not the same as being named as the accountable owner of a programme. The first is the regulatory default. The second is what serious AI governance actually requires. Anthropic has no published vendor accountability commitment for autonomous regulated decisioning. The asymmetry is the entire story. When a Claude-built agent denies a loan that should have been approved, or approves a KYC file that should have been escalated, the pager rings at the bank, with consequences for the bank, while the vendor’s exposure is contractual and capped. The clearest demonstration came six days before the launch itself. On 29 April, Goldman Sachs removed Claude access for its Hong Kong bankers over contractual, regulatory, and geopolitical factors. The bank pulled the product. The vendor did not pull itself out. Whoever absorbs the cost when regulatory fit fails, absorbs it alone. Until vendor accountability is publicly framed, every bank deploying these agents is underwriting risk the vendor will not.

 

The Proof

Two outcome measures have been published. 64.37% on Vals AI. 88% on AIG insurance claims out of the box. Both are useful. Neither measures regulated-decision accuracy at scale. There is no committed measure for customer-detriment rate, near-miss frequency, incident reporting cadence to regulators, or the rate at which human reviewers actually amend agent outputs versus rubber-stamp them. The banks deploying these agents do not yet have public outcome commitments either, and that absence is its own answer. Former CFO Alyona Mysko captured what is at stake: “In finance, 99% correct is still wrong.” In eighteen months, the question “did this work?” will be answered by whoever owns the platform to define what work means. Right now, that platform is the vendor’s marketing. The banks need to claim that platform back, in their own outcome language, before the metric is set by a third party with no skin in their game.

 

Verdict

The launch is genuinely significant. More than a dozen named banks in production, industry-leading benchmark performance, audit logs in the Claude Console, the deepest Microsoft and Moody’s integrations any AI vendor has shipped. None of that is in dispute.

What is in dispute is whether the deploying banks have done the work to fill the accountability gap that the vendor has not closed and the regulators have not yet defined. The lesson generalises beyond Anthropic and beyond banking. Any CIO buying agentic AI in a regulated industry, healthcare, insurance, energy, the public sector, is operating in the same gap, and most have not yet noticed.

The action is concrete. Name the human in your organisation who carries the pager when the agent is wrong. Demand a vendor accountability schedule before you sign, not after. Define your own regulated-decision outcome measure and publish it, so the standard your performance is judged against is one you helped set.

If Anthropic publishes a vendor accountability commitment in the next six months, and a major bank commits to a public regulated-decision outcome measure tied to a named owner, this becomes a case study other industries will study for years. Without both, it becomes the most expensive procurement lesson the industry buys this decade.