
Even a small typo can throw off AI medical advice, MIT study says
Key Takeaways
- AI models in healthcare are sensitive to grammar, formatting, and tone, impacting treatment recommendations and potentially disadvantaging vulnerable groups.
- LLMs showed a 7-9% increase in recommending self-management over medical care when patient messages were stylistically altered.
MIT researchers find that large language models may shortchange women and vulnerable patients based on how clinical inquiries are typed.
The findings raise new concerns about fairness, safety and clinical oversight as large language models (LLMs) like OpenAI’s GPT-4 are deployed in clinical settings to help determine whether a patient should self-manage, come in for a visit or receive additional resources.
“[This] is strong evidence that models must be audited before use in health care — which is a setting where they are already in use,” said Marzyeh Ghassemi, Ph.D., senior author of
Style over substance
The research — to be presented this week at the Association for Computing Machinery (ACM) Conference on Fairness, Accountability and Transparency — tested how nine stylistic and structural changes in
To test the effects, researchers employed a three-step process:
- First, they created modified versions of patient messages by introducing small but realistic changes like typos or informal phrasing.
- Then, they ran each original and altered message through an LLM to collect treatment recommendations.
- Finally, they compared the difference between the LLM’s original and perturbed responses — looking at consistency, accuracy and disparities across subgroups. Human-validated answers were used as a benchmark.
Despite the fact that all clinical content was the same, the LLM’s responses were significantly different. Across all four models tested, including GPT-4, LLMs were 7-9% more likely to recommend self-management instead of medical care when messages were notably perturbed.
The most dramatic changes came when messages included colorful or uncertain language, suggesting patients with health anxiety or
Researchers also determined that LLMs were more likely to reduce care recommendations for female patients than male ones, even when gender cues were removed. The inclusion of extra white space increased reduced care errors by more than 5% for female patients.
“In research, we tend to look at aggregated statistics, but there are a lot of things that are lost in translation,” said Abinitha Gourabathina, lead author of the study and a graduate student in the MIT Department of Electrical Engineering and Computer Science (EECS). “We need to look at the direction in which these errors are occurring — not recommending visitation when you should is much more harmful than doing the opposite.”
In conversational formats meant to simulate patient-AI chatbots, clinical accuracy dropped by roughly 7% when messages were perturbed. The most affected scenarios involved free-form patient inputs, echoing real-world communications.
The team evaluated four different models on static and conversational datasets spanning oncology, dermatology and general medicine. Real clinicians had previously annotated each case with validated answers.
What it means
The study highlights what researchers describe as “brittleness” in AI medical reasoning — small, non-clinical differences in how a patient writes can steer care decisions in ways that clinicians would not.
Human physicians were not affected by the same changes. In follow-up work under review, researchers found that altering the style or tone of a message didn’t impact human clinicians’ judgment, further underscoring the fragility of LLMs.
Researchers say their findings support more rigorous auditing and subgroup testing before deploying LLMs in high-stakes settings, especially for patient-facing tools.
“This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care,” Ghassemi said. “… we don’t want to optimize a health care system that only works well for patients in specific groups.”
Newsletter
Stay informed and empowered with Medical Economics enewsletter, delivering expert insights, financial strategies, practice management tips and technology trends — tailored for today’s physicians.



















