How to Choose the Right AI Tools as a Clinician

Offcall Team

Not all AI tools are created equal, and the difference matters clinically. A tool that works beautifully for a primary care pediatrician running a direct practice may be useless, or actively counterproductive, for a hospitalist trying to solve a rounding documentation problem. A tool that impresses in a demo may fall apart when it encounters the complexity of a real visit. The landscape is crowded, the marketing is loud, and most of it is aimed at health systems rather than the individual clinician trying to figure out what to actually put on their phone.

This session of Offcall's AI Residency series is about cutting through that noise. Graham Walker, MD and Michael Hobbs, MD walk through a practical evaluation framework, run live comparisons of how different tools handle the same clinical case, and demonstrate how to build a personalized AI toolkit that fits the way you actually practice, not the way a vendor assumes you do. Mehul Akhouri and the team at Heidi join midway through to demo what a purpose-built clinical documentation platform looks like when it is designed with clinician workflows in mind.

Resources:

This session is part of Offcall's AI Residency series. The previous session covered AI fundamentals. Sessions 3 and 4 cover cutting through the hype and vibe coding for clinicians.

The cockpit problem

The average clinician who is paying attention right now has heard of dozens of AI tools and uses maybe two or three consistently. That gap is not a knowledge problem. It is a signal-to-noise problem. Every week brings a new ambient scribe, a new clinical decision support layer, a new research tool claiming to surface only peer-reviewed evidence. Most of them look similar from the outside.

The useful mental model is not a single power tool but a cockpit. Different instruments do different things. A large language model like Claude or ChatGPT is good at language: synthesizing, summarizing, drafting, reasoning through a differential when you prompt it well. A retrieval-augmented tool like OpenEvidence or DoxGPT is pulling from a curated evidence base, which makes it more reliable for clinical literature but less flexible. An ambient scribe is capturing and structuring speech. These are not interchangeable, and bundling them all under the label "AI" is roughly as useful as saying you don't like cars because one hit someone once.

Knowing what category a tool belongs to tells you a great deal about where it is likely to fail.

What the live demos actually showed

When the same pediatric leukemia case was run through ChatGPT for Clinicians and a custom-built Claude tool in the same session, the outputs were meaningfully different, not just in format, but in how they handled uncertainty and sourcing. ChatGPT for Clinicians, built on a curated medical knowledge base, produced a structured and reliable response but with limited flexibility. The Claude-based tool, configured with a custom system prompt and a specialty-specific panel of consultants built directly into the project, produced a richer differential but required more intentional setup and more critical review of the output.

The practical takeaway is not that one tool is better. It is that the same clinical question, asked of different tools without modification, will produce different results, and the clinician needs to know enough about the underlying architecture to interpret those differences. Asking a general-purpose LLM to behave like a curated evidence tool without telling it to cite its sources, stay within the literature, and flag uncertainty is like handing a medical student a case without giving them any constraints on how to present it back.

The setup matters as much as the tool.

Building your own: more accessible than it sounds

One of the most useful demonstrations in this session is also the most surprising for clinicians who have not yet experimented with AI customization: building a functional clinical decision support tool inside Claude requires no coding. A well-constructed system prompt, a set of instructions that defines the tool's role, constraints, and behavior, is sufficient to turn a general-purpose model into something that behaves like a specialty-specific consultant panel.

The example Graham and Michael built live was an emergency medicine tool that routes a clinical question through the perspective of multiple subspecialty consultants simultaneously, surfacing considerations a single-model response might miss. It took minutes to build, costs nothing beyond a Claude subscription, and can be saved as a reusable project. The point is not to replace clinical judgment. It is to stress-test it before you commit to a plan.

The habit that makes this work is the same habit that makes any AI use safer: tell the model who it is, what it knows, what it does not know, and what you want it to do when it hits the edge of its knowledge. Uncertainty is an acceptable answer. Build that into the instructions.

The ambient scribe question

The session included a full demonstration of Heidi, one of the AI-powered clinical documentation platforms that has emerged as a serious option for clinicians across specialties and practice settings. The questions that came in from attendees during this portion of the webinar were some of the sharpest of the evening, and they reflect what clinicians who have actually used ambient scribes are worried about.

How does the tool handle a visit where the clinician works through a differential out loud and rules things out as they go? Does the final note reflect the final reasoning, or does it risk preserving intermediate hypotheses as documented assessments? Can you see an audit trail between what was said and what made it into the note?

These are not abstract concerns. They are the difference between a tool that supports your documentation and one that creates a liability problem. Mehul walked through how Heidi approaches each of these: a speech recognition architecture designed to reduce hallucination, a context tab that allows the clinician to add relevant background before a visit, and an evidence layer that surfaces peer-reviewed literature inline with sources. Whether those features hold up across specialties and practice environments is something individual clinicians will have to test. But knowing to ask the question is half the evaluation.

PHI, BAAs, and the compliance question that won't go away

Several attendees raised versions of the same question during the session, and it is worth being direct about the answer. A business associate agreement with a platform does not make it safe to paste unmodified clinical notes into a general-purpose AI tool. A BAA governs how the vendor handles data. It does not change the underlying risk of what you are transmitting or to whom.

The practical standard is the same one that appeared in the first session of this series: keep protected health information out of your prompts unless you are working in a tool your institution has specifically vetted and contracted for that purpose. If you are building your own tools, custom Claude projects, vibe-coded applications, anything you have assembled yourself, the compliance question sits with you, not the platform. "The AI was covered by a BAA" is not a defensible answer if you are the one who pasted the chart.

Build the habit of scanning every prompt before you send it. It takes two seconds and it protects your patients and your license.

The end game for AI scribes is not settled

One of the most substantive exchanges in the session was about the competitive landscape for ambient documentation tools. The observation from the room: there is now fierce competition between standalone scribes, EHR-native AI tools, and clinician-built alternatives, and it is expensive and exhausting to stay on top of all of them.

The honest answer is that the end game is not clear. Epic's native AI capabilities are improving. Local models, running on-device without transmitting audio to a cloud server, are becoming more accessible, which addresses both the compliance concerns and the situations where an ambient scribe would otherwise be prohibited. The platforms that survive will likely be the ones most deeply integrated into the EHRs where clinicians already live, which is either a bet on the standalone scribes that have built those integrations or a bet on Epic building it themselves.

What is clear is that the evaluation criteria that matter are not the ones in the marketing materials. The questions to ask are: Does the final note reflect what you actually said? Can you see where the note came from? Does it handle uncertainty in your clinical thinking, or does it flatten it? And when it gets something wrong, which it will, how quickly can you catch it?

Where to start

The most common mistake clinicians make with AI tools is trying to evaluate too many of them at once without getting deep enough into any of them to develop real judgment. A better approach is to pick one tool in a low-stakes domain, use it enough to understand how it fails, and build from there.

The prompt habits that make a general-purpose LLM more reliable, explicit instructions about uncertainty, citations, staying within a defined knowledge base, are transferable to every other AI interaction you will have. The instinct that tells you something in an AI-generated output looks off is the same instinct you have trained over years of reading notes, interpreting labs, and catching things that do not add up. It applies here too.

These tools are not going to replace the clinical judgment you have spent a career developing. But they will increasingly be the instruments through which that judgment is expressed. The clinicians who shape how that happens, who build the tools, set the standards, and insist on transparency and verifiability, will have more influence over the outcome than the ones who wait to see how it settles.

That is the whole point of this series.

Written by Offcall Team

Offcall Team is the official Offcall account.

webinar

Comments

(0)

Join the conversation

See what your colleagues are saying and add your opinion.