The Clinician's Guide to Local AI: Why On-Device Models Are Quietly Winning the Privacy Argument

Offcall Team

Most of the AI conversation in healthcare is happening at 30,000 feet: about platforms, contracts, and enterprise rollouts. Meanwhile, a quieter shift is happening on individual clinicians' laptops. Models small enough to run entirely on-device, without ever sending a byte of audio or text to a cloud server, are becoming genuinely useful. For physicians worried about HIPAA, two-party consent laws, and the general unease of typing patient details into a browser tab owned by someone else, this matters more than the marketing cycle suggests.

In Offcall's second AI Residency webinar, Dr. Graham Walker and Dr. Michael Hobbs spent a meaningful chunk of time on local models, not as a theoretical concept but as something both of them use every day. The technical detail in that segment got compressed in the overview article. It deserves its own walkthrough.

Resources:

This session is part of Offcall's AI Residency series. The previous session covered AI fundamentals. Sessions 3 and 4 cover cutting through the hype and vibe coding for clinicians.

What "running locally" actually means

The mental model most clinicians have for AI is a query that travels from their keyboard to a data center and back. That is how ChatGPT, Claude, and most ambient scribes work. As Walker explained during the session, the speed feels instant precisely because it is being processed across enormous infrastructure: "It's being all outsourced to, you know, millions of computers."

A local model inverts that. The model itself, a compressed version of the same family of large language models, sits on your hard drive. Your microphone audio, your typed prompts, your dictation, all of it stays on your machine.

The trade-offs are real:

Smaller models, narrower capabilities. A locally runnable model is not going to do everything Claude Opus does.
Hardware-dependent. Older laptops will struggle with bigger models.
Setup friction. You need to install something like Ollama or LM Studio to run them.

The trade-offs in your favor are also real, and for clinicians, they are substantial: privacy, no internet dependency, no per-token costs, and no third-party logging.

The dictation use case: where local models already win

Walker's daily-use example was Spokenly, a free Mac dictation app he uses "easily 50 times a day." It runs Nvidia's Parakeet model locally and, in his side-by-side experience, beats Apple's built-in dictation on both speed and accuracy. Hobbs uses Whisper Flow for the same purpose.

Here is the part that matters clinically. Because the audio never leaves the machine, you can dictate things you would never feel comfortable speaking into a cloud-connected tool. Walker put it bluntly: "You can talk about your finances or say your bank password or whatever, because it's not going to the internet."

Translate that to clinical work and the implication is straightforward. Drafting a note, working through a differential out loud, dictating a letter to a referring physician, none of that requires a cloud round-trip if your dictation engine is local. The transcript becomes text on your screen, and what you do with it next is a decision you control.

Three tools to know: Ollama, LM Studio, and Hugging Face

The intimidating part of local models is the ecosystem around them. Hobbs and Walker named three resources clinicians should know:

Ollama (ollama.com). The simplest entry point. Install it, pick a model, and it runs.
LM Studio. A more visual interface. As Walker noted, modern versions "will read your computer memory and your computer CPU and speed and everything like that. And it'll actually tell you, hey, these are the best models we think for you."
Hugging Face. The library where the models actually live. Looks intimidating. Is searchable. Walker's tip: if it overwhelms you, "just ask Claude or ChatGPT to explain it to you."

For models specifically, Hobbs recommended starting with Google's Gemma family or Alibaba's Qwen family. Both are free, both come in multiple size variants, and both are recent enough to be genuinely capable.

A real workflow: voice cloning and the eight-second demo

To make the point that local models are not toys, Walker described an experiment from a few weeks earlier when he lost his voice before recording a podcast episode. The team used a free local text-to-speech model (Qwen 3.5 TTS) and eight seconds of his recorded voice to generate something "freakishly lifelike."

That is the level of capability now sitting on consumer hardware. Whether you find it exciting or unsettling depends on the application, but the underlying point is that "local" no longer means "underpowered."

Where this matters for ambient scribing

One audience question during the webinar pointed at exactly the right issue. In California, a two-party consent state, Sutter Health is reportedly facing a lawsuit over alleged use of an ambient AI scribe without patient consent. The question raised in the chat: would a local model solve that?

Walker was careful not to give legal advice, and rightly so. But the architectural answer is interesting. A fully local ambient scribe, one where audio is captured, transcribed, and structured entirely on the device, with no cloud transmission at any point, sidesteps a meaningful chunk of the data-handling concerns that plague cloud-based scribes. It does not eliminate the consent question (recording is recording). But it shrinks the surface area of who has the data and where it lives.

Hobbs is already tinkering on this himself: "I've been working on building an AI scribe just to tinker with on my own and kind of refining that."

How to start without breaking anything

For clinicians who want to experiment, the lowest-risk path looks like this:

Install Ollama on your work or personal laptop.
Pull a small Gemma or Qwen model. The tool will warn you if you're trying to download something too large for your hardware.
Pair it with a local dictation tool like Spokenly or Whisper Flow.
Use it for non-clinical tasks first. Drafting emails, summarizing articles, dictating notes to yourself, until you understand how it fails.
Only then consider clinical applications, and only within whatever your institution's policies allow.

The clinicians who will get the most out of the next two years of AI development are not the ones chasing every new SaaS launch. They are the ones who understand the difference between a model that runs in someone else's data center and a model that runs on their own machine, and who know when each is the right call.

For the full discussion, including the live tool demos, watch the complete webinar here:

Written by Offcall Team

Offcall Team is the official Offcall account.

Comments

(0)

Join the conversation

See what your colleagues are saying and add your opinion.