The Bottleneck Was Never the Code: Enterprise Design Thinking for the AI-Augmented IT Organization
A Working Example of Enterprise Design Thinking with AI Coding Assistants – The MeridianGuard Claims Modernization Story
A standalone companion case study
Companion to: “Enterprise Design Thinking in the Age of AI Coding Assistants”
*Disclaimer: This is a fictitious case study, crafted to illustrate the impact Enterprise Design Thinking can have on projects that make extensive use of new AI coding assistants. The names, numbers, and moments are composites — the lessons are real.
Why This Story
AI coding assistants compress the time it takes to build software. They do not, on their own, compress the time it takes to understand what to build. The Working Example in the original article – an enterprise IT team modernizing a legacy claims processing system at an insurance company – is short on purpose, because the framework is the point. This companion piece fills in the story behind the outcome. It walks through six months in the life of one project: how the team observed, what they framed, what they made, who they listened to, what they killed, and what changed for the people the system was supposed to serve.
The names of the company and the people are composites, drawn from real patterns observed across multiple enterprise modernization programs. The behaviours are not invented. The story is intended to be useful for any team that has been handed AI coding assistants and a six-month mandate and is wondering, honestly, how to make the time savings translate into outcomes that matter.
The Project at a Glance
MeridianGuard Insurance is a mid-sized US property and casualty insurer. Its claims processing platform, CADRE (Claims Administration, Decisioning, Routing and Evaluation), was originally built in 2003 on a mainframe foundation, with a Java middleware layer stitched on top in 2012. Claims adjusters use a desktop client to triage and resolve incoming claims. At the start of this project, the average time to resolve a routine claim was eleven minutes, the cost per claim was around thirty-two dollars, and adjuster Net Promoter Score for the system was negative fourteen.
The CIO had authorized a six-month modernization effort with a clear mandate: make claims adjusters faster on routine claims without compromising fraud detection, regulatory compliance, or audit trails. The team had access to a curated stack of AI coding assistants – IBM Bob for codebase grounding and refactoring, Claude Code for prototyping and exploratory work, GitHub Copilot inside the IDE. The budget was substantial but not unlimited, and success would be measured by adjuster outcomes, not by lines of code shipped or features delivered.
The team had ten people. Four engineers, one of them the engineering lead. Two product managers, splitting business-facing and engineering-facing responsibilities. Two designers and one senior designer embedded full time. An operations lead representing site reliability and platform engineering. Together, they had real authority to decide what was in and what was out.
Five Sponsor Users committed to the project: three claims’ adjusters working different lines (personal, commercial auto, property), one claims supervisor, and one fraud analyst. Each agreed to a two-hour weekly engagement and attendance at biweekly playbacks. They were given decision-making authority on user-facing trade-offs, not just advisory input.
Phase 1 – Observe (Weeks 1-3)
The team resisted the easiest starting point. Nobody fed the legacy CADRE codebase to an AI and asked for a modern equivalent. Instead, they invested the first three weeks in observation, on the theory that a six-month project will be saved or sunk by the first three weeks of listening.
Elena Vasquez, the designer, led a contextual inquiry program. Each adjuster, supervisor, and fraud analyst was shadowed for a full workday by an engineer and a designer paired together. The pairing was deliberate: engineers needed to see the work with their own eyes, and designers needed to hear the system constraints as the engineers heard them. Twelve sessions were recorded with consent. The transcripts ran to nearly nine hundred pages.
This is where AI earned its first dollar on the project. Claude was used to cluster the transcripts into themes – context-switching cost, redundant data entry, fraud-indicator overload, slowness of legacy API responses – and to draft an empathy map for each persona. The team did not accept these summaries as truth. Elena cross-checked twenty randomly selected themes against the raw transcripts and corrected three meaningful mischaracterizations before any synthesis was shared. The discipline was simple: AI proposes, humans dispose.
Ravi Subramanian, the operations lead, ran a parallel observation track in the telemetry. He pulled twelve months of CADRE logs through Splunk and asked Claude to help correlate adjuster session duration with backend response times. Two findings emerged that no interview had surfaced. First, the claims search API was returning a P95 latency of 4.2 seconds during the ten-to-noon peak. Adjusters had stopped noticing because they had built habits around the wait – coffee runs, email triage, conversations with colleagues. Second, a single misconfigured policy lookup was responsible for thirty-eight percent of “manual override” actions, which adjusters had been quietly absorbing as “just how the system works.” Neither of these would have appeared in a feature backlog. Both became central to the design.
Aisha Okafor, the business-facing product manager, walked the regulatory and reinsurance reporting chain. Her question was less about user pain and more about constraints: what had to remain true about the new system regardless of what the team built? Audit trails, state-by-state regulatory reporting, NAIC compliance, OFAC screening hooks. Her observation was that roughly sixty percent of what stakeholders attributed to user interface complexity was in fact compliance complexity inherited from immutable obligations. This framing later protected the team from a tempting but unworkable simplification.
Marcus Chen, the engineering lead, performed a third kind of observation: of the codebase. IBM Bob was indexed against the entire CADRE repository. Marcus did not ask Bob to “explain the code,” which is a question that produces eloquent but shallow answers. He asked Bob to identify the seams between the legacy core and the Java middleware – places where contracts were thin enough to allow a strangler-fig modernization. Bob surfaced fourteen candidate seams. Marcus and two senior engineers manually validated each and rejected six as either too coupled or too operationally critical to touch in a six-month window. AI accelerated the inventory; humans made the architecture decisions.
By the end of week three, the team had a synthesis: a journey map of the adjuster, an inventory of system seams, a compliance map, and a reliability baseline. They also had specific moments that hurt – a forty-seven-second wait after clicking “Open Claim,” a fraud-flag panel that displayed twenty-three indicators with no prioritization, a “Recent Claims” widget that had not been refreshed in three years. Real moments, attached to real people, with names.
Phase 2 – Reflect (Weeks 4-5)
The team gathered for a two-day Hill-writing session. The five Sponsor Users were invited for the morning of day two.
The first draft, generated with AI assistance, read: “A claims adjuster processes claim faster with AI-powered automation.” It was, predictably, vague. It cantered automation rather than the adjuster. It said nothing about what the adjuster could newly do.
Aisha pushed back. So did Priya Natarajan, the personal-lines adjuster who would later become the project’s most influential Sponsor User. “Faster isn’t the win,” Priya said. “Faster and confident is the win. Right now, I’m fast on the easy ones and burned out on the hard ones. The system gives me the same energy whether I’m closing a windshield claim or untangling a hit-and-run with three witnesses.” That comment became the seed of the Hill.
After several iterations – with the AI assistant used to stress-test phrasing and generate alternatives, and the Sponsor Users used as the ground truth for whether the wording felt right – the team settled on a single sentence.
A claims adjuster resolves a routine claim in under three minutes – with the right context surfaced automatically – so that the time saved flows into the complex cases where human judgment matters most.
This Hill did three important things. It named a specific user. It defined a concrete, measurable capability. And it identified the differentiated wow factor: the redistribution of attention from routine to complex, not the elimination of either. Two supporting Hills were also written – one for supervisors (workload visibility across their team in under thirty seconds) and one for fraud analysts (suspicious patterns escalated with full context, not raw flags). The team intentionally did not write a Hill for “the system” or “the business.” Hills name humans.
Phase 3 – Make (Weeks 6-18)
The Make phase is where AI coding assistants showed their most visible impact, and where the team’s discipline was most tested.
In the first two weeks of Make, three prototypes were built in parallel. The first, called Context Surface, was a redesigned claim-opening screen that pre-fetched and arranged policy details, prior claims, vehicle history, and fraud indicators into a single readable panel. Two engineers and the designer built it using Claude Code for the React frontend and IBM Bob for the policy-lookup service refactor. Working demo: six days. The second, called Decision Suggest, was an LLM-backed assistant that read the claim, the policy, and the fraud signals and proposed a recommended next action with confidence levels and citations. Two engineers built it with Claude Code, with prompt engineering by Elena and a senior engineer. Working demo: eight days. The third, called Workflow Reweave, was a complete reordering of the adjuster’s screens – fewer clicks, fewer tabs, a single “next action” affordance. One engineer and the designer built it in low-code, using AI to scaffold the UI and accelerate test refactoring. Working demo: five days.
Three prototypes in nineteen person-days. Pre-AI, the team’s honest internal estimate for any one of these had been six to eight weeks.
The Sponsor Users reviewed all three in week eight. The feedback shaped the project decisively. Context Surface was loved by every adjuster. “This is the first time the system feels like it knows me,” Priya said. Approved for inclusion. Decision Suggest, by contrast, scared the fraud analyst and made the supervisor uncomfortable. Sofia Romano, the fraud analyst, said the line that decided it: “I don’t want adjusters to stop thinking. I want them to think better.” The team made the consequential decision to kill the recommendation feature entirely and retain only the citation feature – surfacing relevant prior decisions and policy clauses but never making the call. This was the most important pivot of the project, and it was driven by Sponsor User feedback rather than by metrics. Workflow Reweave had the right instinct but the wrong execution; the single “next action” affordance erased the adjuster’s optionality, and the reordered screens, while universally preferred, were better integrated into the Context Surface design. Merged.
Over weeks eight through eighteen, the team built the production version of the combined design, plus the citation-only descendant of Decision Suggest. The patterns of work were consistent. AI was used aggressively for boilerplate, test scaffolding, refactoring, and code translation between legacy Java services and a new Node and TypeScript edge layer. AI was not used to generate business logic without a human in the loop – every state transition, every regulatory check, every audit-trail entry was hand-reviewed. AI was used to draft documentation, and humans corrected it. Marcus instituted a rule that became something of a team motto: “Bob writes the first draft of the runbook; the person who owns the service writes the second draft.” Operations were embedded from day one, and Ravi’s observability scaffold – structured logging, trace IDs, SLO definitions, runbook stubs – was inherited automatically by every new service. Of the fourteen services touched, all fourteen were observable from first deployment.
The Cadence: Playbacks Every Two Weeks
Six Playbacks across the eighteen weeks of active Make. Each was structured the same way: ten minutes of “what we learned about the user,” fifteen minutes of “what we built,” fifteen minutes of Sponsor User reaction live in the room, fifteen minutes of stakeholder discussion, and five minutes of “what we’ll do next.”
Two Playbacks stand out. The third, in week ten, was the prototype review in which Decision Suggest was killed. The team had built something impressive. The Sponsor Users rejected it. The temptation to negotiate – could we add an opt-out toggle, could we make it advisory-only – was strong. The team did not negotiate. They listened, agreed, and pivoted in front of the executive sponsors. The CIO later said this was the moment she trusted the project.
The fifth, in week sixteen, was the one where the team admitted that the citation feature was not working as intended in eighteen percent of complex commercial-auto claims because the underlying retrieval system was hallucinating clause references. Rather than hide this until it was fixed, the team led with it. Sofia helped re-frame the retrieval rules; the fix shipped two weeks later. Transparency about failure, in a structured forum, turned out to be a faster path to organizational trust than confidence about success would have been.
Through Their Eyes
This project was not the work of a hero. It was the work of a small cross-functional team that knew when to lean on AI and when to lean on each other. The contributions of each role are best understood through the small decisions they made along the way.
Marcus Chen – Engineering Lead
Marcus had been at MeridianGuard for nine years and had watched two previous modernization attempts collapse under their own ambition. His contribution to this project was, in his own words, knowing what not to ask the AI.
He set the technical guardrails on day one. AI assistants would be used for prototyping, scaffolding, refactoring, test generation, documentation drafting, and code translation. They would not be the primary author of any code path touching money movement, regulatory reporting, or audit trails. Every AI-generated function would be reviewed by a human owner before merge. Every pull request description would distinguish AI-authored from human-authored lines, so reviewers calibrated attention appropriately.
He invested early in observability and tooling so the team could move quickly without fear. He paired junior engineers with senior engineers on AI sessions, on the theory that watching how an experienced engineer prompts and critiques AI is the fastest way to develop judgment about it. He held a weekly fifteen-minute “AI postmortem” where the team shared one thing the AI got right and one thing it got wrong. By month three, the team’s prompts had measurably improved, and the AI’s first-draft quality had improved alongside them.
His biggest contribution may have been a moment in Playback Three, when he stood in front of the CIO and said: “We’re going to kill the feature that demos best, because our users told us it would make their work worse.” That sentence was only possible because Marcus had spent the prior week building the team’s authority to make exactly that decision.
Aisha Okafor and David Park – Product Managers
Aisha owned the business-facing side: regulators, reinsurance, finance, and customer experience. David owned the engineering-facing side: roadmap, prioritization, and scope.
Aisha’s contribution was protective. Her early compliance and regulatory mapping prevented the team from over-promising. When the AI generated a polished workflow that quietly elided OFAC screening, she caught it. When a stakeholder pushed for a customer-facing claim status feature out of scope, she negotiated it into a fast-follow rather than letting it bloat the Hill. Her role was the role of someone willing to say no on behalf of constraints the user could not see.
David’s contribution was directional. He owned the Hill as a living artifact, defending it from gentle drift and revising it when evidence demanded. Mid-project, when adjusters reported that twelve percent of the claims they processed were not strictly “routine” by the team’s original definition, David led a Hill refinement: the under-three-minute target now covered a more carefully scoped seventy percent of claims, and a parallel Hill was added for the next-tier “semi-routine” cases. The team did not lose alignment; they tightened it.
Both PMs used AI extensively. Aisha used it to generate compliance scenario tests and to translate regulatory text into requirements. David used it to draft Playback narratives, JIRA tickets, and stakeholder summaries. Both were rigorous about not letting AI-generated artifacts become the team’s official position without human authorship. A draft was always a draft until a human had signed it.
Elena Vasquez – Designer
Elena’s contribution was the work of holding the user in the room when the user was not there. Every Hill draft, every prototype, every code review meeting had Priya’s voice in it because Elena made sure of it.
She built the empathy infrastructure: persona documents, journey maps, and an interactive Figma prototype that mirrored adjusters’ real screen layouts down to the screen resolution and font rendering of their actual workstations. She used AI to accelerate the production of these artifacts – drafting personas, generating UI variations, transcribing interviews – but treated AI output as a starting point and never a destination.
Her hardest contribution was a series of small refusals. She had warned against the Decision Suggest recommendation engine before the Sponsor Users even saw it; she had observed Sofia’s discomfort in an earlier session and predicted the rejection. She was overruled, and the prototype was built anyway. After Sofia’s feedback validated her instinct, the team began to take Elena’s early warnings more seriously. By month four, an unspoken rule had emerged: when Elena said “the users won’t want this,” the team would, at minimum, prototype the alternative.
Elena also designed the Playback itself – its rhythm, its artifacts, its room arrangement. She insisted on Sponsor Users sitting at the front of the room, not in the back. She insisted on showing real prototypes, not slides. She insisted on the team standing while presenting and sitting while listening. Small choices, with cumulative impact on whose voice felt central in the room.
Ravi Subramanian – Operations Lead
Ravi’s contribution was making sure that the system the team was proud of in the demo was the same system that did not wake anyone up at three in the morning.
He embedded observability from day zero. Every new service shipped with structured logging, trace context propagation, SLO definitions, and a runbook stub. He used AI to accelerate this work – Bob generated the first draft of every runbook from the code, SRE engineers refined them, and the result was usable documentation on day one of every service’s life rather than month six. He owned the migration strategy from CADRE-legacy to the new edge layer, working with two engineers on a strangler-fig pattern where the new system intercepted traffic, served what it could, and fell back to legacy for the rest. AI helped here too – generating the request-mirroring infrastructure, the comparison harness that flagged divergence between old and new responses, and the rollback automation. None of this code was business-critical, which is precisely why it was the right place to lean hard on AI.
Ravi’s quiet superpower was that he made operational excellence cheap. Because every service was observable on day one, the team had honest data on every prototype’s real-world behaviour. Because rollbacks were a one-command operation, the team felt safe shipping smaller, more frequent changes. Because runbooks existed before incidents, on-call engineers had fewer three a.m. surprises. His most consequential contribution was a single sentence in Playback Four: “We’ve shipped four times this sprint. None of them paged anybody.” That sentence, more than any feature, convinced the executive sponsors that AI-accelerated delivery was not a stability risk.
A Day in the Life – Priya, claims adjuster, Before
Priya Natarajan is thirty-eight years old and has been at MeridianGuard for eleven years. She works in personal lines and ranks in the top quartile of her unit on quality and the top half on speed. She processes about twenty-two claims a day. What follows is a workday in the world of CADRE as it existed before the project began.
At six fifty-five in the morning, Priya logs into CADRE from her home office. The desktop client takes ninety seconds to spin up. She uses the time to get coffee. At seven, her queue shows thirty-one claims. Three are flagged urgent. She opens the first.
It is a windshield claim. It should take three minutes. It takes nine. The opening screen loads in forty-seven seconds. She tabs through six panels – policy summary, vehicle, claim narrative, prior claims, payments, notes – because the system arranges them in alphabetical order rather than in the order a human would read them. She copies the VIN into a separate vehicle-history tool that opens in a browser. She copies the policy number into a separate policy-lookup tool because the embedded policy summary in CADRE is two months out of date. She returns to CADRE. The fraud-indicator panel shows fourteen yellow flags. Twelve are false positives she has learned to ignore over the years; she filters them mentally. The remaining two are routine for a windshield claim. She approves the claim, enters her notes (the notes field accepts only two hundred and fifty-six characters and does not autosave), and clicks submit. The system thinks for eleven seconds, then asks her to confirm. She confirms. One claim done. Thirty to go.
By nine-thirty she is on her seventh claim. It is a sideswipe with disputed liability. She has been on it for twenty-two minutes. There are two prior claims on this policyholder. The system shows her the claim numbers but not the outcomes. She opens two new tabs. The legacy report on each take about thirty-five seconds to render. She reads them, closes them, then realizes she needs to compare them and reopens both. She decides and writes it up. She submits. By eleven forty-five the system is slow. P95 latencies are bad during the peak. Priya has stopped noticing.
At two-ten in the afternoon Priya has been at it for seven hours and is on her seventeenth claim. It is a routine fender-bender, but she is tired. She mis-keys a date. The system accepts it. She catches the error herself thirty seconds later, finds the field, edits it. The audit log now shows an edit, which she will need to explain in a free-text field that will be reviewed later. She submits.
By four-thirty she has finished twenty-two claims. Two are flagged for QA review, both because of late-afternoon mistakes on routine cases – exactly the cases she should have had energy for, but did not. The complex case she handled at nine-thirty will be re-reviewed by her supervisor because Priya does not feel confident about her own decision. She is right not to feel confident: the system gave her almost no support on the hardest case of her day. At four forty-five she logs off. Her shoulders hurt. She has had nine point four hours of screen time, roughly fourteen hundred keystrokes, and eight hundred seventy mouse clicks. She did not, today, do her best work.
A Day in the Life – Priya, claims adjuster, After
Six months later. The same Priya. The same queue. The same coffee.
At seven in the morning Priya logs into the new claims experience. It loads in four seconds. Her queue still shows thirty-one claims, but the queue is now ordered by what has changed since she last looked at each item: claims with new information at the top, claims awaiting customer response below, urgent flags pinned. She opens the first.
The windshield claim opens in two seconds. The screen is a single panel – policy summary, vehicle, narrative, recent activity – arranged in the order Priya reads. The policy data is live, not two months stale. The vehicle history is embedded. The fraud-indicator panel shows two flags, not fourteen; the system has learned which indicators are false-positive-noisy on which claim types and de-prioritizes them by default while leaving the full list one click away. The two flags that remain are the ones Priya would have flagged manually. She reads. She approves. She types notes that are autosaved with no character limit. She submits. The system does not ask her to confirm. It has learned that on simple claims the second click is friction without value, and the audit trail captures the same information either way. Two minutes flat. One claim done.
At nine she is on her tenth claim. The disputed-liability sideswipe that took her twenty-two minutes in the old world has just opened. The new system has surfaced the two prior claims with their outcomes and the policy clauses that were applied. It has not made a recommendation – that is Priya’s job – but it has done the reading. She reviews the citations. She forms a view. She wants to verify one clause. She clicks. The clause is shown in full, including the historical version that was in force at the time of the prior claim, because the system knows that policies change and that the right clause is the one that was active when the loss occurred. Priya makes her decision in nine minutes. She is confident. She submits.
By eleven forty-five the system is still fast. By two-ten in the afternoon she is on her twentieth claim of the day. She is not tired in the same way she used to be tired. The routine claims have taken about two minutes each. She has saved the energy for the harder ones. She is on a complex commercial-auto claim now, and she is present in a way she did not have the bandwidth to be present six months ago.
At four in the afternoon Priya has finished twenty-eight claims, six more than her old average. None are flagged for QA review. The complex case from this morning will not be re-reviewed by her supervisor because her notes and reasoning are complete enough to stand on their own. She closes her laptop. Her shoulders are fine. She did, today, her best work – and she has the strange experience, increasingly common in her unit, of feeling that the day went quickly.
The number the executives will cite – average time to resolve a routine claim dropped from eleven minutes to two minutes and forty seconds – does not, on its own, capture this. What changed is not just the time. It is the redistribution of Priya’s attention from the routine to the complex. That is what the Hill said. That is what the team built.
Outcomes That Mattered
Six months after the new system went into full production, the team published its outcomes report. The headline number was the one the Hill had named, but the surrounding numbers told a richer story – about the adjusters, about the business, about the system, and about the way the team itself learned to work with AI in the loop.
User outcomes (the Hill, in numbers)
The average time to resolve a routine claim fell from eleven minutes to two minutes and forty seconds, a seventy-six percent reduction. Adjuster Net Promoter Score for the system moved from negative fourteen to positive forty-one, the largest single-system NPS swing the company had ever measured. Quality flags raised on routine claims fell by thirty-eight percent – the late-afternoon mistakes that Priya used to make on easy work largely disappeared. Time spent on complex claims rose by twenty-four percent, which the team had hypothesized and which the data confirmed. This was treated as a success, not a regression. Supervisor escalation rates on complex claims did not move, but supervisors reported that the quality of escalation packages improved markedly. Fraud analysts received twenty-nine percent fewer noisy alerts and forty-one percent more well-contextualized ones, a redistribution they described as “finally being asked the right questions.”
Business outcomes
Cost per claim fell from approximately thirty-two dollars to approximately eighteen dollars, a forty-four percent reduction driven primarily by adjuster time and partly by reduced rework. Throughput per adjuster rose from twenty-two claims a day to twenty-eight on average, with the top quartile reaching thirty-four. Customer-reported reopened-claim rate fell by twenty-two percent, suggesting that first-pass decisions were better, not just faster. Regulatory and audit findings during the post-launch reviews were zero, which Aisha’s compliance mapping had been designed to protect.
System and reliability outcomes
Across the fourteen services modernized, every single one shipped with observability from day one – structured logging, trace context, SLO definitions, and an initial runbook. SLO compliance held above 99.9 percent across all new services through the first three months in production. Mean time to recovery for incidents fell from forty-seven minutes to eleven minutes, driven by the runbook-on-day-one discipline and one-command rollback automation. Change failure rate (the share of deploys that triggered a rollback or hotfix) was four percent, well inside the team’s eight percent guardrail. The team deployed an average of nine times per week into production during steady state Make, up from roughly one deploy per fortnight on the legacy CADRE platform.
Engineering and coding outcomes
This is where the AI coding assistants told their honest story. Roughly sixty-two percent of lines committed to the repository during the project were AI-authored in the first draft. Of those, the team estimates that around seventy-three percent were merged with no human edits beyond formatting (mostly tests, scaffolding, refactoring, infrastructure-as-code, and runbook drafts), and the remaining twenty-seven percent were materially rewritten or rejected by the human reviewer. Crucially, only about fourteen percent of business-logic changes were AI-authored; the rest were human-written, often with AI assistance for refactoring or tests but with humans firmly in the driver’s seat. This split – AI on the boilerplate, humans on the decisions – is, in retrospect, the right shape.
Prototype velocity was where AI showed its starkest impact. Three competing prototypes were built in nineteen person-days, against an honest pre-AI internal estimate of six to eight weeks for any one of them. Across the project, average pull-request cycle time (from open to merge) fell to nine hours, down from a legacy-team baseline of just over three days. The team’s first-time test pass rate on AI-generated code climbed from forty-one percent in week one to seventy-eight percent by month three, almost entirely on the back of Marcus’s weekly fifteen-minute “AI postmortem” – a small ritual that compounded into measurably better prompts and better first drafts.
Quality did not regress with speed. Escaped defects (bugs discovered in production within thirty days of release) ran at 0.6 per ten thousand lines of merged code, compared to a company-wide legacy baseline of 1.4. Critical-severity defects in AI-authored code were not measurably higher than in human-authored code on a per-line basis – a finding the team attributes less to the AI and more to the rule that every AI-generated line went through a human owner before merge.
Documentation, historically the first thing to slip on modernization projects, was complete on day one of every service’s life. Every runbook had a first draft generated by IBM Bob and a second draft authored by the service owner; the result was usable on-call material before any service ever paged anyone. Internal developer NPS on the AI-assisted workflow was positive fifty-two by the end of the project, with the strongest endorsements coming from junior engineers who said the AI helped them produce work closer to senior quality and from senior engineers who said the AI freed them to spend more time on architecture and review.
Outcomes at a glance
| Outcome | Before | After |
|---|---|---|
| Time to resolve a routine claim | 11 min | 2 min 40 sec (−76%) |
| Adjuster NPS | −14 | +41 |
| Quality flags on routine claims | Baseline | −38% |
| Time spent on complex claims | Baseline | +24% (intended) |
| Throughput per adjuster (claims/day) | 22 | 28 (top quartile: 34) |
| Cost per claim | ~$32 | ~$18 (−44%) |
| Reopened-claim rate (customer-reported) | Baseline | −22% |
| Fraud alert noise vs. signal | High noise | −29% noisy / +41% contextualized |
| SLO compliance (new services) | n/a | >99.9% |
| Mean time to recovery (incidents) | 47 min | 11 min |
| Change failure rate | Legacy ~12% | 4% |
| Deploy frequency | ~1 per fortnight | ~9 per week |
| Pull-request cycle time | ~3 days | 9 hours |
| Prototypes built in two weeks | 1 (best case) | 3 (in 19 person-days) |
| AI-authored share of committed lines | n/a | ~62% |
| AI-authored share of business-logic changes | n/a | ~14% |
| AI first drafts merged without rewrite | n/a | ~73% |
| First-time test pass rate on AI-generated code | 41% (week 1) | 78% (month 3) |
| Escaped defects (per 10k lines, first 30 days) | 1.4 (company baseline) | 0.6 |
| Day-one runbook coverage on new services | Rare | 100% |
| Internal developer NPS on AI-assisted workflow | n/a | +52 |
What the numbers do not say
These outcomes are not the whole story, and the team is careful about that. The hardest-to-quantify outcome was the redistribution of Priya’s attention from the routine to the complex – visible in the trend lines but lived in her shoulders at four-thirty in the afternoon. The second hardest was the executive team’s growing willingness to absorb honest progress reports, including failures, because the Playbacks made honesty the norm. Neither of these shows up cleanly on a dashboard. Both were the point.
The headline number – eleven minutes to two minutes forty – is what will be cited in the next leadership meeting. The story behind it is that AI took the boilerplate, the team took the decisions, and the framework kept the two pointed at the same Hill.
What Made It Work
The team’s retrospective surfaced five lessons, each mapping directly to the Enterprise Design Thinking framework.
The Observe phase paid for itself many times over. The thirty-eight percent manual-override pattern would not have been found in interviews alone, nor in telemetry alone. It took both, and it took an operations lead and a designer in the same room. Three weeks of pure observation felt like a luxury at the start; by month three, it felt like the bargain of the project.
The Hill prevented at least one major mistake. When a stakeholder pushed in month four for a customer-facing claim portal – a perfectly defensible feature in another universe – the team was able to point to the Hill and say: this is not what we promised. It went onto the fast-follow list. The Hill was a shield as well as a compass.
Sponsor Users were the most important variable. They killed the wrong feature. They redirected the right ones. They built executive trust by showing up in Playbacks and saying, on the record, what they thought. The team’s investment in their time – two hours weekly, real prototypes, real authority – was the single highest-return investment in the project.
Playbacks turned transparency into trust. The team’s willingness to admit in Playback Five that the citation feature was failing in eighteen percent of complex commercial-auto claims was the moment the executive sponsors stopped asking whether AI-assisted delivery was risky and started asking how to scale it. Transparency about failure, in a structured forum, is a faster path to organizational trust than confidence about success.
AI accelerated everything except judgment. The nineteen-day prototype phase was unthinkable without AI. So was the runbook-on-day-one observability discipline. So was the daily refactoring pace. What AI did not accelerate was the team’s ability to know which features should ship, which prototypes should die, which compliance constraints were inviolable, and when to listen harder to a Sponsor User who was quietly uncomfortable. Those were human, and they stayed human, and the team is honest about that.
Closing Reflection
This project did not succeed because the team had AI coding assistants. Many teams have AI coding assistants and ship the wrong product faster than ever. It succeeded because a small, diverse, empowered team used a disciplined framework to point those AI assistants at the problem that mattered, validated their work against real users at a steady cadence, and was honest about both the wins and the failures in front of stakeholders who could absorb honesty.
The headline outcome – eleven minutes to two minutes forty – is the number that will be cited. The redistribution of Priya’s attention is the actual achievement. The Hill named both, and the team built toward both on purpose.
When building gets cheap, clarity wins. That is the lesson. And it is the discipline that AI coding assistants make essential rather than optional.
#ibm #ibmbob #AI #vibecoding #DesignThinking #EnterpriseDesignThinking #AICodingAssistants #DigitalTransformation #ITLeadership #ProductManagement #GenerativeAI #LegacyModernization #Innovation #AICoding #ClaudeCode #AIAssistedDevelopment
