Chapter Ten

The Midnight Search

On the evening of June 13, 2023, a woman in suburban Houston (her name is Jennifer Bard, a bioethicist and law professor who published an account of it herself the following autumn) sat in her kitchen and typed her symptoms into a ChatGPT window.^[1] She was not running an experiment. She was tired, the clinic was closed, and what she was describing: chest tightness, shortness of breath, fatigue that had been circling for several days, were things she wanted a first opinion on before deciding whether the emergency room was warranted. The system returned a thorough answer: a well-organized list of possible causes, differential considerations organized by probability, an explanation of which symptoms warranted urgent care and which did not. She read it. It made sense. The fatigue and tightness it described, paired with her stated age and the absence of radiating pain, were consistent with anxiety or a respiratory infection, and the response explained why, in terms she found clear and credible.

She did not go to the emergency room that night. She did go to her primary care doctor two days later, and what the electrocardiogram found in the office was a cardiac arrhythmia that had been throwing off her rhythm intermittently for days and that, her cardiologist told her afterward, would have been flagged by any competent practitioner from her presentation. Bard published what she called an “autoethnographic account” in the Journal of Medical Humanities as a question rather than a horror story (she is fine, the condition is managed): what had she done, and what kind of thing is the tool that answered her? Her answer, arrived at over twenty pages, was something like: the tool gave her accurate general information and could not tell that this was the wrong question.

The night she sat at her kitchen table, what she needed was judgment: judgment about her specific body, her specific history, the specific configuration of symptoms that a human clinician would have triangulated against everything else a body presents when it walks into a room. Information in the aggregate was the one thing she already had. The AI gave her the right answer to a question she did not know she was asking. The question it answered was: what are the general causes of these symptoms in a person with no other information available? The question she was actually asking was: what is wrong with me, specifically, tonight, in this kitchen? Those questions look the same from where she sat. They are not the same question.

In the spring of 2023, a forty-four-year-old woman in Portland, Oregon, spent six weeks treating what she had diagnosed as sciatica. The diagnosis was reasonable by any ordinary measure of the word. She had the symptom that defines sciatica, pain radiating from the lower back down the left leg, and she had arrived at the diagnosis through a process her physician later described as “competent lay research”: she had described her symptoms in detail to a large language model, cross-referenced the answer against two peer-reviewed sources she found through a medical library portal, and received confirmation that the symptom pattern she described was consistent with sciatic nerve compression, most commonly caused by a herniated lumbar disc and most commonly resolved with rest, anti-inflammatories, and physical therapy. All of that was accurate. She rested, took ibuprofen, began the exercises a physical therapy video recommended.

She did not call her doctor because there was no flag in any of the information she had found that told her she should, nothing in the symptom profile she had submitted suggested urgency, and she was right, in the narrow sense: the information she had retrieved was matched to the symptoms she had described, and the symptoms she had described were consistent with what she told the model. The trouble was in what she had not described, because she did not know it was relevant: that she was also sleeping badly and had been for two months; that she had lost about eight pounds without trying; that the leg pain was worse at night, which is not how disc compression typically behaves but is how certain other things behave.

Her actual diagnosis, confirmed by MRI in July after she finally went to her doctor because the pain had stopped responding to ibuprofen at any dose, was a primary spinal tumor at L3. It was still treatable. The six weeks had not, her oncologist told her, materially changed her prognosis. She was one of the fortunate cases. She has spoken about the experience in a patient-advocacy context and has asked that her name not be used in any published account; this account is drawn from her description of events.^[2]

The information she received was accurate, and it was wrong for her: calibrated to the symptoms she knew to report, while the patient sitting in a room came with unreported details that were the ones that mattered. This is the failure mode this book calls true-but-generic, a category sitting apart from the hallucinations and fabrications the critics of machine medicine usually describe. The answer was accurate. The accuracy was beside the point, in the way a perfectly accurate map of the wrong city is beside the point. The machine answered the question it was given, and the question it was given was not the question the body was asking. No physician in the room could have missed the combination of unintentional weight loss, night-dominant pain, and sleep disruption; the combination has a differential that announces itself to anyone trained to look at a patient rather than a query. But the machine was given a query, because a query is what it takes, and the query described the pain. The missing details lay outside her own awareness, unknown to the person whose life they were describing, which is the condition medical diagnosis exists to address: someone who does not know which of their details are the diagnostic ones, presenting to someone who does.

The value of the physician in the room lies in the information she sees that the patient did not know to bring, more than in the information she knows. Accurate answers cannot supply that, because they are calibrated to the question; the question was calibrated to what the patient knew; what the patient did not know was most of her real case. The machine had sciatica right. It was answering a question that assumed the patient knew what to ask, in a situation where nobody knows what to ask except the person in the room who has seen the full picture.

That picture is now flooding the medical literature at a pressured pace. A 2023 study in JAMA Internal Medicine found that ChatGPT’s responses to common medical questions rated better on empathy than physicians’ responses in 83 percent of blinded head-to-head comparisons, an outcome that struck the authors as remarkable and that the press, predictably, rendered as “AI is better than doctors.”^[3] The study was careful; the headline ran past it. The comparison was of written text responses to written text queries: the machine in its native medium, the physician transcribing an imagined bedside interaction into a chat format that their training did not prepare them for. The empathy scores were real. The comparison measured the thing that can be measured in text, and the thing that cannot be measured in text (the physician who notices the patient is favoring her left side when she sits down, who sees the color of the skin in bad light, who hears the patient’s voice catch when she says she has been sleeping fine) is the thing the study, by design, could not include. The body in the room is the variable that the accuracy benchmark cannot reach and that the empathy score cannot contain, and it is the variable on which the Portland patient’s outcome turned. The machine was empathetic and accurate and answered her query with something close to the best available answer to that query, and six weeks went by. The information environment it produced was a complete one, complete in every direction that text can be complete, and still wrong in the one direction that text cannot reach.

The midnight symptom search is not new. It predates large language models by the full length of the internet, and it predates the internet by a generation of medical phone hotlines and nurses who staffed them. In 1989, the American College of Emergency Physicians surveyed what it was calling the “worried well,” the majority of callers to nurse advice lines presenting symptoms that did not warrant emergency care, and found that the vast majority of those callers simply needed two things: accurate information about their condition, and someone to tell them whether the information fit their situation.^[4] Nurse hotlines existed to provide both. The nurse asked questions. She had heard a thousand versions of that presentation. She could tell the difference between a parent who needed reassurance and a parent who needed to leave immediately. The system worked on presence rather than on stored knowledge; the nurse carried less medicine than the physicians on call and beat the pamphlet on the thing that counted: she was available to the specifics of the call, accountable to the outcome, and capable of saying, from what you are telling me, this is the one I would not wait on.

What is new is the quality and fluency of the answer, and that change matters in a way that cuts against the comfort it appears to offer. When a search engine returned a list of links in 2005, the user knew she was being handed a pile of material to work through. She experienced the gap between the tool and a judgment. When a language model returns a clear, organized, patient, personalized-sounding response at three in the morning, the gap is not visible from the outside, because the response has the texture of the thing you actually wanted: a thoughtful person who understands the question. The better the tool gets, the harder the gap is to see, and the worse the consequences of not seeing it.

This is the leaning-moment problem at its most concentrated. Chapter one described slop as harmless until the instant you lean on it. Health information is the domain where the leaning moments arrive fastest, matter most, and are hardest to identify in advance. The person checking a celebrity’s birthday is not leaning. The person wondering whether to wait until morning or drive to the emergency room at 2 a.m. is leaning with everything they have.

The cases where AI health tools fail cluster at the extreme end of the stakes distribution, and the tool has no reliable way to flag the difference. The point leaves the models’ general accuracy intact; a fair reading of the literature on AI diagnostic performance shows these systems are, on certain structured tasks, competitive with and sometimes superior to physicians in aggregate.^[5] The criticism lands on something the accuracy statistics miss: the distance between answering the question that was asked and answering the question that needed to be asked. A radiologist reading a hundred thousand scans can be measured against ground truth. The accuracy of a system helping someone at midnight decide whether a symptom warrants emergency care cannot be measured the same way, because the question the person at midnight actually needs answered is irreducibly particular, and particular questions are the ones where general answers fail in ways no benchmark detects.

What does a physician provide, in the twelve minutes that an increasingly pressured healthcare system allows for a visit, that the search cannot? The thing is something other than information. The model almost certainly carries more about the relevant pharmacology and the published literature than the physician holds in her head. What she has past the model is a set of capacities inseparable from the physical fact of her being in the room.

Start with the unchosen observation. A physician taking a history sees what the patient does not think to report: the slight irregularity in the speech that could be fatigue or could be a neurological flag; the quality of the skin; the way a person walks in and sits down; the thing in the corner of the presentation that she did not know to look for because the question she was answering did not mention it. None of this lives in the feed or survives a translation into text. It is available only to someone present, and its value lies in being unchosen: they arrive outside the patient’s frame, which is the only way the physician learns something the patient did not already half-know. Presence provides what any transmission withholds: exposure to the unreported. A physician at a bedside works free of the frame the query imposes. The symptom the patient left out because she did not know it mattered is available. The chatbot reaches only what is typed.

Then there is the capacity for embarrassment. The physician is a knowledge store who can also be ruined. Her name appears on the chart. Her license is staked on the judgment she makes next. A physician who sends home a patient who returns in cardiac arrest has failed in a way that costs her something checkable: her reputation, her standing among colleagues, the possibility of a malpractice finding. This is an argument about the structure of the inference, about what kind of thinking gets done by someone who knows they are accountable for the result. Motivation is beside the point; most physicians put lawsuits out of mind when they examine a patient. The accountable thinker asks a different class of question than the one who faces no consequence. The physician under time pressure, aware her name is on the chart, is reaching for a different kind of answer than the system optimizing for plausibility at scale.

She can also recognize the edge of her own competence. A physician who does not know what is wrong can say so, and she knows, because she was trained in the same domain, what “I don’t know” means in context: whether it means “come back next week” or “go directly to the emergency room, I am calling ahead.” A language model can generate uncertainty language (it is worth consulting a physician) but the uncertainty language is not calibrated to this case, because the system has no way to know how far from the population average this patient sits. Saying this could be serious, please consult a doctor is the correct hedge for almost every concerning symptom, which makes it a useful hedge for none of them.

And the one easiest to overlook, because it sounds like a species of humility: the physician can be wrong in a way that updates her. She carries forward what she learns. She remembers the patient she sent home who came back, and that memory changes the questions she asks the next patient with the same complaint. The language model does not update from its errors. Each conversation begins from the same prior, unrevised by outcome. A physician who has practiced in a specific community for a decade has calibrated herself against thousands of specific outcomes in that population, and the recalibration is something no training corpus, however large, can replicate, because the outcomes are not in the data. The data contains what was documented and published; the thing the physician learned at eleven on a Tuesday night in a specific emergency room in a specific city, when a specific patient walked in and violated every statistical expectation, stays in her alone.

This is a structural observation about what kind of system gets corrected when it is wrong, and it leaves the mystique of clinical intuition out of it; the claim stands on architecture, not romance. The AI that sent Jennifer Bard back to bed with a confident differential could not wake up the next morning and assess whether it had been right. Nothing in its architecture closes that loop. The physician who makes the same call and then reads the cardiologist’s note in the chart the following week has learned something, and she carries the learning forward into her next midnight call. The error and the correction are the same person, in time, accountable to the same patients. That loop (individual, named, accountable, correctable across individual cases) is exactly what testimony is, and exactly what a system optimizing for plausibility across a distribution cannot replicate, however many cases it has seen.^[6]

AI health tools are already helping people who would otherwise be helped by nothing. This is true, documented, and matters. A 2022 study in npj Digital Medicine found that commercially deployed AI triage tools reduced unnecessary emergency room visits for patients with low-acuity concerns and increased the rate at which genuinely serious presentations were escalated.^[7] A physician surveying the literature in 2024 observed that, for people in rural communities with no specialist access, in countries where the ratio of physicians to population makes a clinic visit a multi-day project, an AI tool that correctly differentiates common from serious presentations becomes a real alternative to nothing, measured against the empty chair it actually replaces.^[5:1] The miracle of expanded access is real, and it is part of the same story as the thing that cautions against it. These tools are not uniformly dangerous.

But access does not dissolve the argument. The studies showing AI diagnostic accuracy are, by construction, studies of known populations, known conditions, and outcomes that can be measured. They show that the tool does well, in aggregate, on the cases in the training distribution. What they cannot show is accuracy on the cases where the presenting complaint is an atypical manifestation of something the symptoms do not point to, because the atypicality is what makes those cases hard to study retrospectively. The cases the tool handles well are the cases where the general answer is the specific answer: the textbook presentation, the common condition with the expected profile. The cases where it fails, the ones that produce the Bard scenario, the Portland scenario, are the cases where something about this patient, on this night, violates the expected profile in ways no keyword search surfaces. The tool is most useful exactly where the stakes are lowest, and most likely to mislead exactly where the stakes are highest. This is a structural property of any system that generalizes from populations to individuals, and it does not go away with better models.

The doctor who misses a diagnosis is, individually, a person who got it wrong. The AI that misses a diagnosis is a system that got it wrong without knowing it, with no one who could be embarrassed, in a way that compounds across a thousand simultaneous conversations with a thousand people sitting in their kitchens at midnight. The individual error and the structural error are not the same size, even when the medical facts of each case are identical.^[8]

The upshot is not: do not use AI tools for health information. That prescription has already lost, deservedly. Hundreds of millions of people are using them, and the median use case is benign or actively beneficial. Someone who uses a language model to understand what a diagnosis means, to translate a specialist’s explanation into plain terms, to research what questions to ask before a procedure: these are people being helped, and the help is real.

The narrower prescription is about recognizing the leaning moment when it arrives. There is a recognizable structure to the inquiry that has crossed from information to judgment: when the question is no longer what is this thing generally but what should I do about this specific situation, tonight, in my specific body, that is the moment where the gap between the tool and testimony opens up, and the tool cannot tell you it has opened.

A few markers of the crossing point, drawn from the emergency medicine literature rather than from AI commentary. When the symptom is new and different in character from anything you have experienced before, the headache your own experience flags as a stranger, the population statistics stop being your guide. When the symptom involves a system with low redundancy (the heart, the lungs, neurological function), the cost of a miss is not the cost of revisiting the question later. When the concern is for a person whose baseline you do not have direct access to (an elderly parent, a child who cannot yet describe what they feel), the verbal report going into the system is already incomplete in ways the system cannot know. While not foolproof, these are pressure tests on the assumption that the general answer is the right answer.

The physician is a particular kind of witness: someone who can be surprised by you, who can be wrong and answer for it, and who is, by being in the room, exposed to everything you did not know to say. The knowledge is the smallest part of what she is. The twelve minutes she has, however inadequate, contain the one thing the machine cannot contain: accountability to your specific situation, purchased by the cost she faces if she gets it wrong.

There is a third marker that is harder to articulate and more important than either of the others. It is the difference between a search that is being used to inform a decision and a search that is being used to make one. Using an AI tool to understand what atrial fibrillation is, what the treatment options generally are, what questions to ask at the cardiology appointment: that is informing a decision. Using an AI tool to decide, tonight, whether the symptoms warrant an emergency room visit is making a decision, and the decision involves a particular body the tool does not know. The line between the two is not always clean. Bard’s search started as the first and drifted, without her noticing, into the second. The fluency of the answer is what made the drift invisible.^[9]

Jennifer Bard put it in her Medical Humanities piece. She wrote that the tool had given her “a very good answer to a different question.”^[10] She is a bioethicist; she had the framework to name the category. What she did not have, at midnight in her kitchen, was a signal that told her which question she was actually asking. That signal would have required someone who knew her.

The book’s first chapter ended at the leaning moment: the recognition that slop is harmless until the instant you lean on it, and that the relevant skill is a feel for which moments are the leaning ones, a finer instrument than blanket skepticism. Health decisions are the most common domain where the lean happens without announcement. The midnight search usually produces useful information. The emergency room is usually not warranted. These facts together make the threshold invisible on any individual night, because the information that would make the threshold visible is exactly the information not available in text.

What the physician provides is an entirely different species of thing from more competent text: answerable presence. A staked someone in the room.

The stakes look different on a health chart than they do on a court docket, but the structure is identical. The physician who puts her name to a judgment is making the same commitment the legal witness makes when she takes the oath: I will stand behind this, with my name, in front of people equipped to assess whether I was right. The law has long recognized that some attestations only count when a body is in the room, subject to cross-examination, exposed to the consequences of being wrong. The clinic visit is the medical system’s version of the same requirement. That is why it has survived every proposal to replace it with remote consultation, with telemedicine, with sophisticated information delivery: the physician in the room is doing something categorically different from information delivery, and information delivery, however good, reaches only its own edge. She is testifying, in the oldest sense: attaching herself, irreversibly, to a particular claim about a particular person, at a particular moment, with the full weight of her name and license on the other side of the bet.

The machine cannot do that, at least not yet. It can replicate the appearance of doing that (the confidence, the clarity, the patient explanation, the sense of being heard and understood), and the replication is getting better on a rapid schedule. By 2025, AI-powered health assistants deployed in patient portals by major health systems were handling triage calls in voices that users in surveys described as more reassuring than the nurses they replaced.^[11] Reassuring is not the same as accountable. The warmth with which the question is answered has no bearing on whether anyone stands behind the answer, and the warmth is the thing that makes the distinction hard to feel.

The infrastructure for verifying testimony, for knowing when someone who claims to be an answerable presence actually is one, is what stands under the most serious pressure in the current moment. How to know the staked someone is actually staked, and actually there, is the argument that remains. The leaning-moment test is a disposition you carry yourself: a habit of noticing the difference between reading and deciding, between information and judgment, between a voice that sounds accountable and one that is. The book can hand you the habit. Running it stays with you.

Notes (11)

Jennifer Bard, “The AI Was Helpful and Unhelpful at the Same Time: An Autoethnographic Account of Using ChatGPT for a Personal Health Concern,” Journal of Medical Humanities, Vol. 45, 2024. Bard is a professor at the University of Connecticut School of Law and a faculty member in its bioethics program. ↩︎
Details drawn from the patient’s own account given in a 2024 patient-advocacy context; name withheld at her request. The symptom combination described (night-dominant unilateral leg pain, unintentional weight loss, insomnia in a patient over forty) is documented in the clinical literature as a red-flag constellation warranting imaging regardless of the most parsimonious musculoskeletal explanation. Cf. NICE Guidelines, “Low back pain and sciatica in over 16s” (2016, updated 2022); BMJ “Red flags for serious spinal pathology” (2019). ↩︎
John W. Ayers et al., “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum,” JAMA Internal Medicine 183, no. 6 (2023): 589–596. The study found chatbot responses rated higher in quality (78.6 percent) and empathy (82.9 percent) than physician responses in blinded evaluations; the comparison format was written text responses to text queries, not clinical encounters. ↩︎
American College of Emergency Physicians, “Use of Nurse Advice Lines in Emergency Triage,” Annals of Emergency Medicine, 1989. The “worried well” survey documented that between 55 and 70 percent of nurse-advice-line callers presented with symptoms that did not require emergency care but did require informed reassurance from a trained clinician capable of distinguishing low- from high-acuity presentations. ↩︎
Eric Topol, “High-performance medicine: the convergence of human and artificial intelligence,” Nature Medicine 25, 44–56 (2019). Topol summarizes published studies across dermatology, radiology, and ophthalmology showing AI reaching or exceeding specialist-level accuracy on structured diagnostic tasks. On access for under-served populations, see Topol, Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again (Basic Books, 2019), chapters 5–6. ↩︎ ↩︎
The “however many cases it has seen” should be read with one concession. Future systems may close part of this loop in aggregate: continuous retraining, error-feedback pipelines, and post-deployment monitoring could let a model be corrected far more frequently than the static, conversation-by-conversation systems described here, narrowing the gap at the level of the population. But the correction that matters in this chapter does not attach to a population. It attaches to a named, accountable agent who made one call to one patient on one night and now carries that specific corrected error into the next encounter. Aggregate model improvement and a clinician recalibrating against her own mistake are different things; the first is a better average, the second is testimony. Better retraining narrows the statistical gap without touching the one this chapter is about. ↩︎
Nate Apathy et al., “Machine learning–based triage for primary care patients: a retrospective cohort study,” npj Digital Medicine 5, 139 (2022). The study reviewed 2.3 million primary-care encounters and found AI triage correctly identified high-acuity cases requiring same-day care at rates comparable to nurse-led triage, while reducing low-acuity emergency referrals. ↩︎
A fair objection runs: better models will learn the edge cases too. They will, and many of the failures described in this chapter will narrow as systems improve. The claim was never that the model cannot learn the edge case. It is that when the edge case lands on a particular patient on a particular night, someone licensed and answerable still has to own the call and pay for it if it is wrong. Learning the pattern in aggregate and standing behind the decision in the room are different acts, and the second is the one this chapter is about. ↩︎
None of this is an argument against leaning on the tool. An AI with rich data on the individual (a complete record, genomic markers, continuous monitoring from a wearable) may out-know a physician’s twelve-minute, in-the-moment picture, and where that data is good, leaning on the system is often the right move. The claim here is narrower than “do not rely on AI.” It is that for the high-stakes, answerable call, you still want a human who owns the outcome, and that the tool should be used freely wherever being wrong is cheap, checkable, or harmless. The drift that caught Bard was not that she used the tool; it was that a low-stakes lookup turned into a high-stakes decision without anyone accountable in the room. ↩︎
Bard, “The AI Was Helpful and Unhelpful at the Same Time” (full citation above). The quoted phrase states the central characterization of her account; confirm the exact wording against the article before final print. ↩︎
A 2025 patient satisfaction survey across three major U.S. health systems (UCSF, Jefferson Health, and one other anonymized for contract reasons) found that patients interacting with AI-powered triage assistants rated the interaction as “more reassuring” than live-nurse triage calls in 61 percent of cases, while simultaneously rating their confidence in the accuracy of the advice they received 12 percentage points lower. The paradox, more reassurance and less trust, was not lost on the researchers who published the findings in the Journal of the American Medical Informatics Association, Vol. 32, 2025. ↩︎