Clinical Competency of Generative LLMs in Stroke Care: An Evaluation of GPT-4o, Claude, and Gemini

Article Impact Level: HIGH
Data Quality: STRONG
Summary of Npj Digital Medicine, 8(1), 481. https://doi.org/10.1038/s41746-025-01830-9
Dr. John Tayu Lee et al.

Points

Researchers evaluated three top AI models on their ability to provide safe and accurate guidance across the four distinct stages of stroke patient care.
Using three different prompt engineering strategies, AI-generated responses were scored by four blinded stroke specialists on accuracy, empathy, and other key clinical performance domains.
The AI models’ average scores ranged from 48 to 56, falling below the 60 out of 100 benchmark required for passing Taiwan’s medical qualification exam.
While certain prompting techniques improved specific aspects, such as empathy or reducing hallucinations, no method enabled the AI to perform reliably, especially during acute stroke treatment scenarios.
The findings emphasize that generative AI currently requires robust clinical validation and human oversight to ensure patient safety, especially when dealing with high-risk medical conditions.

Summary

A recent study published in npj Digital Medicine evaluated the clinical reliability of three generative large language models (LLMs)—GPT-4o, Claude 3 Sonnet, and Gemini Ultra 1.0—across the stroke care continuum. Researchers from National Taiwan University and Harvard T.H. Chan School of Public Health tested the models on realistic patient inquiries spanning four stages: prevention, diagnosis, treatment, and rehabilitation. Three prompt engineering techniques were utilized: Zero-Shot Learning (ZSL), Chain of Thought (COT), and Talking Out Your Thoughts (TOT). The study did not include confidence intervals or hazard ratios, as it was based on qualitative scoring rather than clinical trial outcomes.

Outputs were assessed by four blinded senior stroke specialists across five domains: accuracy, hallucinations, specificity, empathy, and actionability. A clinical competency benchmark was established at a score of 60/100, aligning with the standards of Taiwan’s medical qualification exam. Overall performance was suboptimal, with average scores ranging between 48 and 56. No single model or prompt combination consistently passed the competency threshold. The models struggled most significantly with acute treatment inquiries, where scores were lowest.

While the LLMs showed inconsistent performance, specific prompt engineering techniques demonstrated distinct advantages. TOT prompts improved scores for empathy and actionability, occasionally allowing models to meet or exceed the 60/100 benchmark in prevention and rehabilitation scenarios. ZSL was most effective at reducing hallucinations and providing concise, accurate responses, particularly in the treatment stage. The authors conclude that despite their potential, current general-purpose LLMs are unreliable for independent use in high-risk medical situations like stroke and necessitate robust human oversight and AI-clinician collaboration for safe deployment.

Link to the article: https://www.nature.com/articles/s41746-025-01830-9

References

Lee, J. T., Li, V. C.-S., Wu, J.-J., Chen, H.-H., Su, S. S.-Y., Chang, B. P.-H., Lai, R. L., Liu, C.-H., Chen, C.-T., Tanapima, V., Shen, T. K.-B., & Atun, R. (2025). Evaluation of performance of generative large language models for stroke care. Npj Digital Medicine, 8(1), 481. https://doi.org/10.1038/s41746-025-01830-9