News
Article
Author(s):
A Mass General Brigham study suggests that LLMs could be used to aid clinicians during physical exams.
During any patient visit, a physical examination is critical in evaluating a patient’s health, determining what conditions they may have and providing guidance to further clinical management. However, if a clinician is young, generally inexperienced or simply lacks specialized training in a specific area, researchers from Mass General Brigham suggest that large language models (LLMs) could be used as aids during physical exams.
The study, published in the Journal of Medical Artificial Intelligence, put the LLM “GPT-4” to the test, prompting the program to recommend physical exam instructions based on patients’ primary symptoms.
“Medical professionals early in their career may face challenges in performing the appropriate patient-tailored physical exam because of their limited experience or other context-dependent factors, such as lower resourced settings,” Marc D. Succi, MD, senior author of the study, said in a Mass General Brigham release. “LLMs have the potential to serve as a bridge and parallel support physicians and other medical professionals with physical exam techniques and enhance their diagnostic abilities at the point of care.”
Researchers prompted GPT-4 to recommend physical exam instructions based on the 19 chief complaints detailed in the Hypothesis Driven Physical Exam Student Handbook, which was created by the American Association of Medical Colleges (AAMC). GPT-4’s responses were evaluated by three attending physicians, rated on a Likert scale based on accuracy, comprehensiveness, readability and overall quality.
The attending physicians determined that GPT-4 provided solid instructions, scoring at least 80% of the possible points. GPT-4 provided the highest-quality instructions for examining a patient with “leg pain upon exertion,” and the lowest-quality instructions for “lower abdominal pain.”
Despite this, researchers found that the program occasionally omitted key instructions or was overly vague in its explanations, indicating the need for human oversight and evaluation. Researchers conclude that GPT-4 would serve as a solid tool/aid to fill knowledge gaps and assist physicians in their physical examinations of patients.
In their qualitative analysis of the program’s responses, reviewers were impressed with the details of special tests, but critical of the program’s lack of specificity, inclusion of redundant information, omission of informative exams, inclusion of vague language and irrelevant information, general inconsistencies and lack of calling for all vital signs.
“GPT-4 performed well in many respects, yet its occasional vagueness or omissions in critical areas, like diagnostic specificity, remind us of the necessity of physician judgement to ensure comprehensive patient care,” said Arya Rao, a student researcher and lead author of the study.
Going forward, the authors of the study call for future investigations that directly compare the diagnostic capabilities of unassisted physicians to those with access to LLMs. They also recommend that real-world cases be used to train LLMs in order to address the gaps in the diagnostic capacity of GPT-4, as demonstrated in the study.
“We anticipate an increasing role for LLMs in clinical decision support, helping to fill knowledge gaps and serving as an academic tool for emerging medical professionals, thereby enhancing physicians’ diagnostic capacity,” the authors concluded.