Study: AI struggles as medical coder

Author(s):

Todd Shryock

AI may be good at some tasks in the health care industry, but coding isn’t currently one of them

AI doesn't code well: ©Lalaka -stock.adobe.com

In a study published in the April 19 online issue of NEJM AI, researchers at the Icahn School of Medicine at Mount Sinai found significant limitations in the capability of state-of-the-art artificial intelligence systems, specifically large language models (LLMs), to accurately perform medical coding tasks.

The study, led by Dr. Ali Soroush and his team, extracted a comprehensive list of over 27,000 unique diagnosis and procedure codes from a year's worth of routine care data within the Mount Sinai Health System. Utilizing descriptions associated with each code, researchers tasked prominent LLMs from OpenAI, Google, and Meta to generate the most precise medical codes. However, the results revealed a disconcerting trend across all models.

According to the findings, which evaluated LLMs including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, none of the models exhibited satisfactory accuracy, with all falling below a 50% threshold in reproducing original medical codes. Despite GPT-4 emerging as the top performer, boasting the highest exact match rates for ICD-9-CM, ICD-10-CM, and CPT codes, errors persisted at an unacceptable level.

"GPT-4 demonstrated the best performance among the models we examined, yet it still fell short of achieving reliable accuracy," stated Dr. Soroush, the study's corresponding author and assistant professor at Icahn Mount Sinai. "Our research underscores the imperative for thorough evaluation and refinement before integrating AI technologies into critical health care operations such as medical coding."

The study highlighted nuanced differences in the models' error patterns. Notably, GPT-4 showcased a relatively sophisticated comprehension of medical terminology, often producing technically correct codes that conveyed the intended meaning. Conversely, GPT-3.5 exhibited a propensity for vagueness, generating codes that, while accurate, were more general in nature compared to the original descriptions.

Dr. Eyal Klang, director of the D3M's Generative AI Research Program and co-senior author of the study, emphasized the importance of assessing LLMs' capabilities in numerical tasks, particularly in medical coding, where precision is paramount. He suggested that integrating expert knowledge could enhance the accuracy of AI-driven medical code extraction, potentially streamlining billing processes and reducing administrative burdens in health care.

While the study provides valuable insights into the current limitations and challenges of LLMs in health care, the researchers caution that artificial tasks may not fully represent real-world scenarios, where LLM performance could be even more compromised.

As the health care industry continues to explore the potential of AI-driven solutions, researchers say that the study serves as a critical reminder of the importance of rigorous evaluation and ongoing development to ensure the reliability and efficacy of these technologies in clinical practice.