When General AI Tools Fall Short in Women’s Health and Hormone Health — And What Clinicians Can Do About It

Paulina Cecula, MD

Most clinicians I talk to are already using AI in some form. It could be a quick question to OpenEvidence between patients. Or a prompt to help draft a patient letter. Or an AI scribe to help with documentation during patient encounters.

If you're managing women's health (and realistically, almost every clinician is, whether or not it's your primary specialty), you've probably turned to one of these AI tools for a hormone-related question at some point.

We're regularly told that AI now exceeds physicians on medical exams and that clinical accuracy is impressive across the board. However, does this accuracy and performance apply across all medical domains, especially female hormone health? The answer, based on a growing body of research, is: not as well as we might assume.

Sign up for our newsletter

On/Offcall is the weekly dose of information and inspiration that every physician needs.

The evidence gap that AI inherits

General large language models learn from what exists. And in women's health, particularly in hormonal and menopausal care, what exists is not always comprehensive, current, or balanced.

Women were systematically underrepresented in clinical trials for decades. A significant portion of early biomedical research was conducted on male subjects. Conditions like endometriosis, PCOS, and perimenopause remain under-researched relative to their prevalence and clinical complexity. And perhaps most consequentially for AI, the hormone therapy literature is dominated by the legacy of the WHI study, a landmark trial with significant methodological limitations that, decades later, still casts a long shadow over how MHT is discussed, prescribed, and, it turns out, represented in training data.

AI models can’t correct for these gaps on their own. The risk is that they amplify whatever signal exists in the data they were trained on. In domains where the evidence base is contested, evolving, or thin, models can appear confident while being incomplete or outdated. Hormone therapy is one of those domains.

What the research actually shows

Thousands of studies have now evaluated LLMs in medicine - a systematic review published this year identified over 4,600. The headline results are often impressive: models passing USMLE exams or outperforming clinicians on certain diagnostic benchmarks. But women's health is rarely evaluated explicitly within these frameworks, and menopause and hormone therapy are almost never tested as a discrete subdomain.

When researchers have looked specifically at menopause and MHT questions, the results are sobering. One study evaluated several leading LLMs using 35 questions – 20 patient-level and 15 clinician-level. Accuracy for clinician-level questions peaked at 67% with GPT-4 and dropped to 47% with Gemini. For context, those are not the kinds of numbers we’d expect from these models across other fields of medicine or that we would accept in a trusted clinical decision support tool. The researchers concluded that menopause care "remains an under-tested domain where models frequently produce incomplete or inaccurate responses."

A separate benchmark focused specifically on women's health, the Women's Health Benchmark, published in late 2025 found approximately 60% failure rates on its most challenging prompts which reflected the kinds of nuanced questions that come up in real consultations.

Anecdotally, clinicians who are up to date with hormone care and hormone therapy prescribing report that they can’t use general AI tools for this field despite using it for other clinical questions or uses. The nuance that experienced clinicians carry around formulation differences, optimal levels, route of administration, timing of initiation, and individual risk stratification doesn't always come through.

Why domain-specific tools perform better

Can we do something about it? The most promising direction from the research is retrieval-augmented generation, or RAG: giving a model access to a curated, authoritative knowledge base rather than relying solely on its general training. A study evaluating this approach for women's health specifically found that a reasoning-capable model augmented with guideline-based knowledge significantly outperformed general models on the same questions.

This makes clinical sense. The quality of an AI tool's output in any specialist domain depends heavily on the quality and currency of the knowledge it draws from. In hormone health, that means access to current clinical guidelines, society consensus, as well as the ability to synthesize emerging evidence and real-world data in a way that reflects genuine modern clinical practice and nuance - not just outdated statements.

What We Built at Dama Health, and Why

This was the starting point for Dama Assist, a new AI-powered clinical tool for hormone health. Dama Assit was built because we kept hearing from clinicians that general AI tools were useful in other areas of medicine, but not reliable for hormone health consultations. They wanted something they could actually trust at the point of care. Dama Assist was built to become the most trusted AI tool for hormone health and MHT consultations at the point of care.

It’s trained on all available evidence, guidelines, and literature in the space, but on top of that, a curated clinical knowledge database with expert hormone health guides, clinical consensus, and encoded expertise from modern practice. It is designed for the kinds of questions that come up in real consultations: titration decisions, complex patient scenarios, communicating risk and benefit, and creating patient-specific resources.

Feedback thus far from clinicians has been incredibly promising; having a reliable and trustworthy reference changes the texture of a consultation and helps to improve workflows, confidence, and patient outcomes.

The "confidence" theme comes up again and again in feedback Dama Assists receives from users - not because clinicians lack knowledge, but because hormone health is genuinely complex and the evidence is moving fast.

A practical point for anyone using AI in hormone health now

For clinicians currently using an AI tool for women’s health, menopause, or hormone therapy questions, a few things are worth bearing in mind:

Treat responses as a starting point, not a final answer.
Ask explicitly for the most up-to-date evidence and guideline recommendations.
Pay attention to the sources and what's missing as much as what's stated; incomplete answers can be even more risky than inaccurate ones.
Consider whether a domain-specific tool might be a better fit for this part of your practice.

We're at an early stage of seeing how AI fits into clinical workflows, and the potential for impact is huge. The models are getting better quickly. But better general performance doesn't automatically close the gap in specialist domains - that requires deliberate investment in the knowledge, evaluation, and clinical expertise that makes a tool actually trustworthy. If you want to try Dama Assist for yourself, you can access a free trial here: https://www.damaassist.com/.

Sources:

2025 PHYSICIANS AI REPORT

Access the Complete Report

✓Complete quantitative breakdown of what physicians really think about AI

✓Strategic implications for healthcare organizations and AI companies

✓Sentiment analysis of physician attitudes about AI and the future

Download full report

Written by Paulina Cecula, MD

Paulina Cecula is an MD and co-founder of Dama Health, a precision hormone health company building clinical tech tools for women's health. She writes about AI in medicine, personalized hormone care, and the future of women’s health.

Comments

(0)

Join the conversation

See what your colleagues are saying and add your opinion.