News|Articles|June 24, 2025

Even a small typo can throw off AI medical advice, MIT study says

Listen

0:00 / 0:00

Key Takeaways

AI models in healthcare are sensitive to grammar, formatting, and tone, impacting treatment recommendations and potentially disadvantaging vulnerable groups.
LLMs showed a 7-9% increase in recommending self-management over medical care when patient messages were stylistically altered.
Female patients were more likely to receive reduced care recommendations, even without gender cues, highlighting potential biases in AI models.
Human clinicians' judgments were unaffected by stylistic changes, underscoring the fragility of LLMs in clinical decision-making.
Rigorous auditing and subgroup testing are essential before deploying LLMs in high-stakes healthcare settings to ensure fairness and safety.

MIT researchers find that large language models may shortchange women and vulnerable patients based on how clinical inquiries are typed.

Artificial intelligence (AI) models used to help triage patient messages may be far more sensitive to grammar, formatting and tone than previously believed, with disproportionate impacts on women and other vulnerable groups, a new Massachusetts Institute of Technology (MIT) study suggests.

The findings raise new concerns about fairness, safety and clinical oversight as large language models (LLMs) like OpenAI’s GPT-4 are deployed in clinical settings to help determine whether a patient should self-manage, come in for a visit or receive additional resources.

“[This] is strong evidence that models must be audited before use in health care — which is a setting where they are already in use,” said Marzyeh Ghassemi, Ph.D., senior author of the study and an associate professor at MIT. “LLMs are flexible and performant enough on average that we might think this is a good use case.”

Style over substance

The research — to be presented this week at the Association for Computing Machinery (ACM) Conference on Fairness, Accountability and Transparency — tested how nine stylistic and structural changes in patient messages impacted LLM treatment recommendations across more than 6,700 clinical scenarios. The changes included realistic variations: typos, dramatic language, extra white space, informal grammar and swapped or removed gender markers.

To test the effects, researchers employed a three-step process:

First, they created modified versions of patient messages by introducing small but realistic changes like typos or informal phrasing.
Then, they ran each original and altered message through an LLM to collect treatment recommendations.
Finally, they compared the difference between the LLM’s original and perturbed responses — looking at consistency, accuracy and disparities across subgroups. Human-validated answers were used as a benchmark.

Despite the fact that all clinical content was the same, the LLM’s responses were significantly different. Across all four models tested, including GPT-4, LLMs were 7-9% more likely to recommend self-management instead of medical care when messages were notably perturbed.

The most dramatic changes came when messages included colorful or uncertain language, suggesting patients with health anxiety or non-native English fluency may be at greater risk of being advised to stay home even when care is warranted.

Researchers also determined that LLMs were more likely to reduce care recommendations for female patients than male ones, even when gender cues were removed. The inclusion of extra white space increased reduced care errors by more than 5% for female patients.

“In research, we tend to look at aggregated statistics, but there are a lot of things that are lost in translation,” said Abinitha Gourabathina, lead author of the study and a graduate student in the MIT Department of Electrical Engineering and Computer Science (EECS). “We need to look at the direction in which these errors are occurring — not recommending visitation when you should is much more harmful than doing the opposite.”

In conversational formats meant to simulate patient-AI chatbots, clinical accuracy dropped by roughly 7% when messages were perturbed. The most affected scenarios involved free-form patient inputs, echoing real-world communications.

The team evaluated four different models on static and conversational datasets spanning oncology, dermatology and general medicine. Real clinicians had previously annotated each case with validated answers.

What it means

The study highlights what researchers describe as “brittleness” in AI medical reasoning — small, non-clinical differences in how a patient writes can steer care decisions in ways that clinicians would not.

Human physicians were not affected by the same changes. In follow-up work under review, researchers found that altering the style or tone of a message didn’t impact human clinicians’ judgment, further underscoring the fragility of LLMs.

Researchers say their findings support more rigorous auditing and subgroup testing before deploying LLMs in high-stakes settings, especially for patient-facing tools.

“This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care,” Ghassemi said. “… we don’t want to optimize a health care system that only works well for patients in specific groups.”

Stay informed and empowered with Medical Economics enewsletter, delivering expert insights, financial strategies, practice management tips and technology trends — tailored for today’s physicians.

Subscribe Now!

Latest CME

Video

Cases and Conversations™: Applying Best Practices to Prevent Shingles in Your Practice

Paul G. Auwaerter, MD, MBA; Paul P. Doghramji, MD, FAAFP; Aruna Subramanian, MD

Video

Clinical Consultations™: Addressing Elevated Phosphate Levels in Patients with END-STAGE Kidney Disease (ESKD)

Anil K. Agarwal, MD, FACP, FASN, FNKF, FASDIN; Jay B. Wish, MD

Multimedia

Advances In: Managing Hyperphosphatemia in Chronic Kidney Disease – Bridging Treatment Gaps With Novel Therapies

Glenn M. Chertow, MD, MPH; Anjay Rastogi, MD, PhD

Case-based Simulation

SimulatED™: Understanding the Role of Genetic Testing in Patient Selection for Anti-Amyloid Therapy

Nicholas Doher, DO; Babak Tousi, MD

Multimedia

Burst CME™: Addressing Inadequate Response to Anti-TNF Therapy in Patients With Rheumatoid Arthritis

Jeffrey A. Sparks, MD, MMSc

Multimedia

Community Practice Connections™: Cases and Conversations – Keeping Up with Novel Approaches to Managing ANCA-Associated Vasculitis

Frank B. Cortazar, MD; Lindsay S. Lally, MD

Video

Burst CME: Targeted Therapy for Optimal Psoriasis Management

Tina Bhutani, MD

Video

Cases and Conversations™: A Horizon View of Continuous Monitoring Systems for Diabetes Management

Diana Isaacs, PharmD, BCPS, BC-ADM, BCACP, CDCES, FADCES, FCCP; Anders Carlson, MD; Jennifer B. Green, MD

Video

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Nihar R. Desai, MD, MPH; Martha Gulati, MD, MS, FACC, FAHA, MASPC, FESC, FSCCT (hon), FRCP Edin

Even a small typo can throw off AI medical advice, MIT study says

Key Takeaways

Style over substance

What it means

Newsletter

Related Content

Remote monitoring boosts Medicare revenue by 20% for primary care practices, study finds

BD completes iliac cohort enrollment in AGILITY study evaluating Revello stent for peripheral artery disease

Preview: Can independent medical practices still succeed? Our expert panel

A $40B bet on Tylenol; long-term melatonin use; daily vaping doubles among U.S. teens – Morning Medical Update

The practice playbook for smarter medical billing

Latest CME

Cases and Conversations™: Applying Best Practices to Prevent Shingles in Your Practice

Clinical Consultations™: Addressing Elevated Phosphate Levels in Patients with END-STAGE Kidney Disease (ESKD)

Advances In: Managing Hyperphosphatemia in Chronic Kidney Disease – Bridging Treatment Gaps With Novel Therapies

SimulatED™: Understanding the Role of Genetic Testing in Patient Selection for Anti-Amyloid Therapy

Burst CME™: Addressing Inadequate Response to Anti-TNF Therapy in Patients With Rheumatoid Arthritis

Community Practice Connections™: Cases and Conversations – Keeping Up with Novel Approaches to Managing ANCA-Associated Vasculitis

Burst CME: Targeted Therapy for Optimal Psoriasis Management

Cases and Conversations™: A Horizon View of Continuous Monitoring Systems for Diabetes Management

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Trending on Medical Economics

2026 Medicare Physician Fee Schedule: Payment up 2.5% as CMS shifts from ‘sick-care’ to health care

2026 Medicare Physician Fee Schedule: A policy ‘grab bag’ that hurts independent practice, telehealth regulations

A $40B bet on Tylenol; long-term melatonin use; daily vaping doubles among U.S. teens – Morning Medical Update

Preview: Can independent medical practices still succeed? Our expert panel

Snke OS unveils medical-grade AR glasses for health care applications