The growing importance of standardized evaluation for generative AI in health care

Author(s):

Neeraj Mainkar, PhD

Key Takeaways

Generative AI in healthcare requires robust evaluation frameworks due to complex multimodal data and patient safety concerns.
Current metrics like SPICE and BERTScore are inadequate for evaluating the complexity of healthcare data.
The CLIP score offers potential for standardized evaluation by measuring text-image alignment in AI-generated medical descriptions.
Standardized evaluation frameworks are essential for reliable, transparent AI outputs in healthcare, meeting regulatory requirements.

As generative AI technologies advance, establishing standardized evaluation metrics is necessary to ensure safety and efficacy in health care.

medical surgical artificial intelligence: © ihorvisn - stock.adobe.com

The field of generative artificial intelligence (AI) in health care is advancing at an unprecedented pace, driven by the need for models to not only generate clinical summaries but also interpret complex multimodal data such as videos, text and images. As these technologies evolve, evaluating their accuracy and effectiveness has become increasingly challenging. The stakes are high, as patient safety and the complexity of medical data demand precise validation. Expert validation by skilled surgeons is currently the gold standard, but it is not a sustainable long-term solution.

With the global generative AI market projected to soar from $1 billion in 2022 to $22 billion by 2032, the health care industry must address the urgent need for reliable metrics to keep pace with this rapid growth. Methods like adapting the CLIP score to measure the alignment between text and images highlight the potential for new evaluation techniques. By exploring these advancements and examining the challenges of assessing AI-generated outputs, particularly in the context of multimodal data, we can work toward establishing standardized evaluation frameworks that ensure the safe and effective deployment of AI in clinical settings.

The challenge of evaluating generative AI in health care

Neeraj Mainkar, PhD
_{© Proprio}

Evaluating generative AI in health care is particularly challenging due to the complexity of integrating diverse data types — such as textual descriptions, medical imaging and extensive surgical videos. The critical nature of these technologies and the intricate relationships between different data types make it difficult to assess their accuracy and reliability.

For instance, summarizing lengthy surgical videos presents a unique challenge. One of the most critical needs for generative AI in health care is the ability to accurately describe surgical procedures from videos. Oversimplified summaries can omit critical details, affecting the accuracy of postoperative reports or educational materials. Surgical procedures involve real-time adjustments and various tools over extended periods, requiring evaluation mechanisms that capture both visual and procedural aspects and accurately interpret transitions between different phases. In robot-assisted surgeries, for example, AI must describe how the surgeon maneuvers instruments and how these movements correlate with changes in patient anatomy.

Inaccurate descriptions can lead to significant information gaps. Current evaluation tools are not equipped to measure how well AI models integrate complex, dynamic data over long durations. To achieve this, robust evaluation frameworks are needed to ensure that systems can reliably meet this goal.

Current metrics for evaluating AI models

Several metrics have been developed to evaluate generative AI models, but many fall short when addressing the complexity of health care multimodal data. Existing metrics like SPICE (Semantic Propositional Image Caption Evaluation) and BERTScore offer value in assessing text- or image-based outputs but may not suffice for multimodal data integration or compositional reasoning.

Harmonic mean of metrics: Combines various evaluation metrics into a single balanced score, offering a comprehensive view of AI performance. However, it primarily focuses on balancing precision and recall, which may not fully capture the nuances of surgical procedures.

SPICE: Evaluates compositional semantics of generated descriptions, useful for visual data but struggles with video data and complex processes like surgeries.

BERTScore: Measures semantic similarity between generated and reference texts using BERT embeddings, providing nuanced language evaluation. Its application to medical procedures remains limited.

One promising method for evaluating AI-generated medical descriptions involves the CLIP score, which measures the alignment between text and images. In AI models generating medical descriptions from images or videos, this approach calculates the cosine similarity between vectors representing the input (image/video) and the output (text). A score closer to 1 indicates better alignment.

By applying this technique to medical AI models, we can quantify how accurately the generated descriptions match the visual data. This method offers a standardized, objective way to assess the reliability of AI-generated content, which is needed to ensure trust in health care applications.

The need for standardization in health care AI evaluation

As generative AI becomes more embedded in health care workflows, standardized evaluation frameworks are crucial. The risks associated with inaccurate AI-generated outputs — whether in medical summaries or surgical video interpretations — are too high to rely on inconsistent evaluation methods. Standardized frameworks would provide consistent criteria to ensure AI outputs are precise, reliable and transparent. Additionally, integrating AI explainability mechanisms into these frameworks would allow users to understand and trust the decision-making processes behind AI outputs, further ensuring safe and effective technology use.

Organizations like HealthAI and the Coalition for Health AI are working to develop validation mechanisms and assurance labs for health care AI. These initiatives aim to create standardized frameworks that can be widely adopted, meeting regulatory requirements such as the EU AI Act and the U.S. Executive Order on AI.

As generative AI models continue to become increasingly sophisticated and integrate deeper into clinical workflows, the need for reliable, objective and standardized metrics becomes increasingly urgent. Emerging techniques, such as adapting the CLIP score for multimodal data evaluation, illustrate the importance of developing advanced metrics. These metrics are needed to accurately capture the complexity of medical procedures, especially in complex medical areas. Establishing robust, standardized evaluation frameworks will ensure that models are safe, effective and trustworthy in health care settings.

Neeraj Mainkar, PhD, is vice president of software engineering and advanced technology at Proprio, and has more than 25 years of experience in the regulated software industry. A computational physicist by training, he is passionate about medical technology software and strongly believes in the application of advanced digital technologies in the operating room to support the enhancement of surgeon performance and improved patient outcomes.