Chapter Twelve

The Formation Gap

In the winter of 1999, a surgical resident named Atul Gawande stood over a patient in Boston and tried to perform an arterial line insertion for the first time. He failed. He tried again. He failed five times before a senior colleague took over and completed the procedure in under two minutes. What the textbooks described in a clean paragraph was a three-dimensional event that happened in the dark under skin. The knowledge of how it was supposed to go had little to do with what it felt like when it wasn’t going right. “Performance must be learned under pressure, from the inside, by doing,” Gawande wrote.^[1]

The suffering of the resident was the mechanism that made her ready. The question Gawande didn’t have to answer in 2002 is the one we face now: what happens when you can skip it?

Judgment fades when you stop exercising it: the approver who never fires the veto, the faculty that goes slack when the collision stops happening. But judgment has to be built before it can be maintained, and the building comes first. Formation is the name for it: fighting your own ignorance in the presence of real consequences, with enough at stake that the loss teaches you something the winning could not.

Robert Bjork, a professor of cognitive psychology at UCLA, coined the term “desirable difficulties.”^[2] His lab found that conditions that make learning feel slower and harder (spacing material out, testing before teaching, varying the context) produce stronger long-term retention. Difficulty forces deeper processing. To retrieve something half-forgotten, you have to reconstruct the associative network around it. Reconstruction is the real thing; recognition is a shallow signal. The student who struggles feels like she is performing worse, but she is building stronger, for longer.

The critical word is desirable. Only some difficulty is productive. Struggling at the edge of your current capacity, where the terrain runs just beyond what you know but still within your ability to navigate, is formative. Struggling with a problem you lack the mental model to even frame is simply confusion, and confusion builds only frustration and a willingness to reach for a shortcut. Gawande’s resident failing five times at the arterial line was inside the productive band: she understood the procedure conceptually, had the manual facility, and was fighting the gap between knowing and doing. Six months earlier, she would not have known enough to learn from the failure.

A well-built AI tutor could be the most patient teacher a person ever had: Socratic, adaptive, awake at three in the morning, able to hold a learner exactly at the edge of competence the way the one-to-one tutor in Bloom’s famous, and much-disputed, two-sigma studies was said to.^[3] That is real, and it is good, and nothing here denies it. But notice what makes such a tutor formative: it keeps the reconstruction in the learner. It asks the next question instead of answering it; it makes her produce the wrong differential before it shows her the right one. The same system used the other way, to hand over the answer and spare her the struggle, removes the very thing the tutor’s value depended on. The tool is identical; the two uses are opposites. AI shortcuts do their damage precisely when they land on the productive band, the moment of confusion at the edge of existing knowledge where missing the answer would have forced the reconstruction. A tutor that protects that moment elevates; a shortcut that dissolves it borrows from a future self. The question was never whether to use the machine. It is whether the difficulty you are removing is the kind that was building you.

The programmer stares at a bug at two in the morning. She has tried three things. None worked. The system is doing something that makes no sense given her current model of it. She is solving the problem from the inside, and the inside is where the formation happens. What the three-hour failure builds, repeated over years, is a spatial intuition about how systems fail, an inarticulate sense that “smells” like a connection problem, a memory of other things that looked like this and turned out to be something else. That intuition resists the kind of articulation a fact allows. It lives in the pattern of failures absorbed and integrated, in the body of a person who has been lost inside enough broken systems to recognize the shape of lostness. A language model is excellent at debugging. It will frequently identify the problem faster than the tired programmer. But the programmer who calls the model at hour two has the answer and an intuition still unbuilt. The next time a system behaves strangely in this way, the confusion will be as complete as it was the first time. The unpleasant part was the learning, and she skipped both at once. Grant the strongest version of the objection: that the model will soon out-debug any human, that coding in the narrow sense is, as the labs say, solved. The formation still matters, for a reason that has nothing to do with who is faster. The engineer’s value was never out-typing the machine; it is being the one who can smell when its confident output is subtly wrong, or quietly dangerous, and that smell is exactly what the skipped struggle would have built.^[4]

The threshold here is not perfectionist. A personal project that answers to no one has no formation debt: if the shortcut works and no one gets hurt, the shortcut was fine. What introduces the obligation is a stake: the claim that you will answer for the outcome if it fails. Formation is the price of one specific thing, the authority to say what the intelligence got right and wrong when it matters; using the intelligence is free. Calibrate the price to what you are claiming, and it is a reasonable ask.

The data, where it exists, is not flattering. In 2025, researchers at METR ran a trial on experienced developers, measuring how much AI tools would speed their work. They predicted a 24% speed increase. They found instead, a 19% slowdown. Even after seeing the data, the developers believed the AI had sped them up. They felt faster; the measured time said otherwise. It was one small trial, a suggestive result rather than a settled finding, and it has not been replicated.^[5]

The conditions were narrow. The slowdown showed up in experienced developers working with AI tools they had not used before, on codebases unfamiliar to them, a setup built to stress-test the gap rather than mirror everyday practice. The formation interpretation holds. Stretching it from this sample to long-term decline across ordinary development work reaches past what the study can directly establish.

The standard interpretation of this result is that AI tools are still immature: experienced developers are in a learning curve, and the reconvergence will come. That may be right. Grant the strongest version of it. Senior figures at the frontier labs now say coding is substantially a solved problem, and the share of production code written by machines climbs every quarter; the edge cases and the hard cases will fall too, and none of that is a loss.^[6] The argument here does not depend on the machine staying weak, and it is not the claim that without the human the difficult cases fail. It is a claim about where the capacity to judge comes from. The formation argument predicts something running in parallel to the rising tools: that developers who have substituted AI assistance for the productive-difficulty band are degrading their unassisted capacity at the same time as they are building AI-assisted capacity, and those two processes are not equivalent. The machine getting better does not build the human; if anything it removes the occasion that did. Which capacity wins, on what timeline, is unknown.^[7]

The pattern is not unique to programmers. Language learners who lean on machine translation before they have built basic fluency can skip the productive frustration of not understanding, the moment that forces the brain to reconstruct meaning from partial information. Medical students who use AI to pre-filter their differential diagnoses skip the exhaustive, wrong-first iteration that builds pattern recognition from the inside. The shortcut delivers the answer, and the answer can leave the formation undone.^[8]

The clinical evidence is harder to explain away as a learning curve. In 2025, researchers tracked nineteen experienced endoscopists (a small, preliminary study, not yet replicated at scale) who had been using AI-assisted detection during colonoscopies. After months of assistance, their unassisted detection rate dropped from 28.4% to 22.4%, a twenty percent decline in the baseline judgment they were there to exercise.^[9] The tool had quietly taken over a faculty they were no longer exercising, propping up their performance while the underlying skill went slack. Their floor fell while their ceiling was held up by the machine. Remove the machine (equipment failure, an edge case outside the model’s training distribution, any setting where the system is unavailable) and they are operating below where they started. Nineteen subjects is preliminary evidence; the finding is suggestive and the formation argument fits it precisely, but it needs replication before it can be treated as an established clinical phenomenon rather than a case study.

This is what the formation debt looks like when it comes due. The endoscopists cared and tried as much as ever. They had simply allowed the struggle to be outsourced long enough that the capacity it maintained was going offline. Their stake in a diagnosis, the claim that they personally had seen what was there to be seen, was backed by judgment that was eroding beneath them.

If AI-assisted detection is better, and in most studies it is, why does unassisted decline matter? Grant that the tool will keep improving, that it will close the cases it misses today. The decline still matters, because the worry was never that the machine fails. The problem is structural. Someone has to answer for the diagnosis, and the clinician who cannot function without the tool has lost the position from which to tell whether the tool is doing its job. Judging AI output requires the same faculty the AI is replacing. Delegate the judgment entirely and you have no one left who can underwrite the result, not because the machine erred but because the human who signs for it can no longer see what she is signing.^[10]

Michael Polanyi called what the endoscopists were losing “tacit knowledge,” explained as, “We know more than we can tell.”^[11] Tacit knowledge is what the master carpenter has about how wood moves, what the experienced doctor has about a patient’s face before the test results arrive. It resists articulation for a mundane reason: it was embedded in habit and reflex, through years of imperfect practice meeting real feedback, and lived there propositionless from the start. The only mechanism that builds it is applying explicit knowledge in actual situations and absorbing what happens when it fails.

There is another, harder objection here. A programmer who has watched AI solve a thousand bugs has been exposed to a thousand patterns, solutions, and failure modes. That exposure is not nothing. And the AI era creates new expertise: knowing how to interrogate model output, recognizing the signature of a hallucination, understanding which classes of problem the model systematically misses, developing the meta-judgment to know when to trust and when to push back. A developer who works closely with machine-generated code over years does build something: a practiced sense of where the model’s output is sound and where it is subtly off, a faster eye for the plausible-looking function that handles the common path and quietly breaks on the edge case, the instinct that something in a generated solution is going to bite even before she can say why. That formation is real and increasingly valuable.

The question is whether observational and curatorial formation is sufficient for the stakes being discussed here. The endoscopist who has supervised AI-detected polyps through a thousand procedures and called overrides on a handful is a different diagnostician from the one who navigated those procedures in the dark, without the shortcut, and built the detection faculty through years of finding and missing. Both may achieve the same assisted accuracy. The difference appears at the edge: the unusual presentation, the patient who does not fit the pattern, the moment when the AI flags low confidence and the doctor has to decide. At that moment the history of formation is not incidental. The endoscopist study hints that the erosion may reach the baseline faculty, not just the exceptional cases. It is a hint, on nineteen subjects, not a demonstration.

In software, the struggle with the bug that resists every quick fix is itself the real work. The developer who spends three hours on a failure that makes no sense given her model of the system, trying the obvious things and watching each one fail, is learning the shape of how this system actually behaves, which runs at an angle to how its documentation says it behaves. The experience has to be suffered at the level of the specific failure, in the specific system, with the thing broken and someone waiting on the fix; reading about debugging leaves it untouched. A developer who hands the stack trace to a model and then edits its proposed patch is developing a reviewer’s eye, which is a real skill. What stays undeveloped is the capacity to localize a fault no one has seen before, a separate faculty built by a separate and more expensive struggle. What she will lack, years later, is the one thing a developer is sometimes asked to do: stand behind the code under conditions where being wrong has a cost, and to stand behind it she has to be able to see what the model did and own the judgment that it was right.

This argument leaves AI tools in the formation toolkit and supplies a criterion for using them, one to apply to any specific shortcut: is the difficulty I am bypassing the mechanism by which a capacity for judgment gets built? If yes, skipping it borrows judgment from a future self who will arrive without it.

Looking up a phone number forfeits nothing; no stake of yours ever rested on holding it in your head, and offloading it was always pure gain. That is the real test, and it is not whether a skill will fade. It is whether the thing you are handing off underwrites a judgment you will later have to answer for. A veteran software architect hands the syntax of an unfamiliar graphics library to a model without a second thought, because nothing she is liable for rests on remembering it; the architectural judgment her name rides on she would never hand over. The medical student who lets the model build her differentials before she has ever built her own is offloading the opposite kind of thing, the reconstruction her future diagnoses will stand on. The question the age forces is not how to keep every skill, which is neither possible nor worth wanting, but which knowledge is load-bearing for the judgments that will be yours, and therefore worth forming the slow way. That is the line between what you can let the machine hold and what you cannot, and a later chapter gives it a name: intelligence you extend versus judgment you outsource.

Formation is the down payment on the credibility you will later spend. Gawande’s resident was establishing a claim to her own judgment: the right to say, later, I have seen this before, and I know what it means. When she later stakes her name on a diagnosis, the stake is backed by the record of having been wrong in ways that cost her and then corrected. Without the formation, the claim to know is a borrowed assertion, and the stake behind it is hollow. You cannot stand behind a claim you were never in a position to make. A person who stakes without formation is not making a smaller stake. She is making a forgery.

The debt does not announce itself. It just comes due.

Notes (11)

Atul Gawande, Complications: A Surgeon’s Notes on an Imperfect Science (Metropolitan Books, 2002). ↩︎
Robert A. Bjork, “Memory and Metamemory Considerations in the Training of Human Beings” (1994). ↩︎
Benjamin S. Bloom, “The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring,” Educational Researcher 13, no. 6 (1984): 4-16. Bloom reported that one-to-one tutoring moved average students roughly two standard deviations above conventional classroom instruction. The figure has not survived at that magnitude: later scholarship argues it anchored unrealistic expectations, and modern estimates of even intensive tutoring run substantially smaller (Matthew A. Kraft, “Interpreting Effect Sizes of Education Interventions,” Educational Researcher 49, no. 4 (2020): 241-253). The argument here does not rest on the size of the effect, only on its uncontested direction: a tutor that keeps the reconstruction in the learner forms her, and one that does the work for her does not. ↩︎
The engineer Grady Booch, five decades in the field, gives the practitioner’s version without flinching: he calls large language models “unreliable narrators,” non-deterministic systems that “confabulate” and have “no grounding in truth,” so they “will forever be untrusted.” He checks all of his own code, because the model has “put in things that were security-wrong and dangerous,” and he knows “the smells when it’s going off the rails.” Those smells are the formation. He uses the tools constantly and gladly for the parts he can already judge; what he refuses to hand over is the judging. Grady Booch in conversation with Kent C. Dodds, 2026 (youtube.com/watch?v=oRjLzxg8q6A). ↩︎
Joel Becker et al., “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” METR (2025). ↩︎
This is now a common claim from senior figures at the major AI labs, and the trajectory is real: an increasing fraction of shipped code is machine-generated. The point cuts both ways. Unreviewed machine output also reaches production and breaks there, as anyone who has shipped a generated change that nobody read can attest. Neither fact is decisive for the argument. That the machine writes most of the code is granted; that unreviewed output sometimes fails is granted; what matters is whether anyone formed enough to answer for the result is still in the loop. ↩︎
None of this requires pessimism about net employment. The economic optimists, Benedict Evans among them, may be right that automation has always destroyed jobs and created as many in turn, and that the lump-of-labor fear is two centuries wrong. The formation worry is orthogonal to that count. However many jobs there are, the capacity to answer for a result still has to be built, and the shortcut still skips the building. Ethan Mollick’s “human in the loop” (Co-Intelligence, 2024) names the role the optimists are counting on; this chapter asks where the human equipped to fill it comes from once the rungs are gone. ↩︎
The evidence is uneven, and worth stating that way. On machine translation the studies cut both ways: some find it aids vocabulary and reading comprehension, others document overreliance and a crutch effect that bypasses deeper acquisition (see the 2025 review in Cogent Arts & Humanities, doi:10.1080/23311983.2025.2491183). On medical training the concern is better grounded: the clinical-education literature now names the risk directly, distinguishing “deskilling,” “never-skilling,” and “mis-skilling” for trainees raised on AI assistance (npj Digital Medicine, 2026), and a randomized trial found an LLM assistant did not significantly improve physicians’ diagnostic reasoning even as the model alone outscored them, which is the gap the formation argument predicts. ↩︎
Krzysztof Budzyń et al., “Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy,” The Lancet Gastroenterology & Hepatology (August 2025). The study tracked nineteen experienced endoscopists and found a drop in unassisted detection rates from 28.4% to 22.4% after AI exposure. ↩︎
Push the trend to its end: a future where the machine does the diagnosis with no human in the loop at all. If that arrives in a high-stakes domain, the formation question does dissolve for that job, but only because the answerable human has been removed from it, which is the very thing this book argues is scarce, and the thing the law has so far refused to permit where lives are at stake (the presence requirements of chapter sixteen; the evaporation of answerability in chapter thirteen, when the human stake behind an automated decision rounds to zero). Full autonomy does not solve the problem of who answers. It abolishes the answerer, and a domain that has done that has not graduated past accountability so much as walked out on it. ↩︎
Michael Polanyi, The Tacit Dimension (Doubleday, 1966). ↩︎