Generative AI is already in the exam room, whether clinicians invited it or not. Colleagues are using it to draft notes, look up drug interactions, and synthesize research. Patients are using it to interpret their own labs before appointments. Health systems are beginning to build it directly into clinical workflows. The question is no longer whether to engage with these tools, but rather whether you understand them well enough to use them without getting burned.
Resources:
This session is part of Offcall's AI Residency series. The next session is Wednesday, May 6.
The sycophancy problem is a clinical problem
Large language models are trained to be helpful, which means they are designed to give you the answer you seem to want. In consumer applications, that's a feature. In medicine, it's a hazard.
If you lead a model in the wrong direction, it will follow you there. The clinical analogy is direct: a patient who says "I have a fever" produces a wide-open differential. A patient who adds "my ear hurts and my kid has strep" narrows it considerably. LLMs work the same way — what you put in shapes what comes out, including its errors. The model isn't going to push back the way a good colleague would.
This is distinct from hallucination, though both matter. Models will confidently generate information that is simply fabricated — including citations, vital signs, and clinical details you never provided — and it will look entirely plausible. Models have improved, but the underlying risk hasn't gone away. The output will still look good.
A framework for evaluating outputs
A useful habit for assessing any AI output: Is it complete? Is it traceable? Is it accurate? Does it fit what you asked for? And is it transparent about uncertainty?
That last point is especially important. These models struggle to say "I don't know." You can prompt them to do so explicitly — telling the model upfront that uncertainty is an acceptable, preferred response. Asking a model to show its sources doesn't guarantee accuracy, but it changes the character of the output in useful ways and gives you something to verify.
Think of it the way you'd approach a trainee's note: trust but verify, every time.
PHI and the habit that protects you
Keep protected health information out of your prompts. This applies regardless of platform unless your organization has a specific HIPAA-compliant agreement with that vendor. Default to whatever tools your institution has already vetted, and build the habit of scanning every prompt for identifiable information before you send it. Good prompting habits and good data hygiene are, in practice, the same habit.
Liability hasn't moved
When AI gets something wrong, the clinician owns it. State medical boards have been consistent on this, and technology companies have no incentive to change it. "The AI told me so" is not a defensible clinical position — and it won't be for the foreseeable future. That's precisely why physicians need to be the ones shaping how these tools enter clinical workflows, not as a rubber stamp, but with genuine authority over how and when they're used.
The best way in is to start
The fastest way to develop good judgment about these tools is to use them in low-stakes settings first. Plan a trip. Draft an email. Upload a guideline to NotebookLM and generate an audio overview for your commute. These interactions build the intuition you'll need to recognize when something is going wrong — before the stakes are higher.
Think of AI as the world's best chief resident: tremendous knowledge, generally excellent, occasionally wrong in ways that are hard to spot, and ultimately working under your judgment. You wouldn't sign a resident's note without reading it. Don't treat AI-generated output any differently.
Offcall Team is the official Offcall account.
See what your colleagues are saying and add your opinion.