Article NL V.22 (2025) Internal Medicine Research

Optimizing Systematic Review Screening: The Role of LLMs in Abstract and Full-Text Evaluation

Article Impact Level: HIGH
Data Quality: STRONG
Summary of Annals of Internal Medicine, 178(3), 389–401. https://doi.org/10.7326/ANNALS-24-02189
Dr. Christian Cao et al.

Points

  • Large language models (LLMs) were tested for abstract and full-text screening in systematic reviews (SRs), using 48,425 citations for abstract screening and 12,690 full-text articles, with eligibility determined by SR authors.
  • Optimized LLM prompts achieved 97.7% sensitivity and 85.2% specificity for abstract screening and 96.5% sensitivity and 91.2% specificity for full-text screening, significantly outperforming zero-shot prompting methods.
  • The LLM-based approach completed abstract screening in under one day for $157.02 USD, compared to 83 hours and $1,666.67 USD for manual screening, demonstrating substantial savings in time and cost.
  • The study highlights LLMs as a scalable tool for streamlining systematic reviews, reducing the workload for researchers while maintaining high accuracy in article selection.
  • The study acknowledges its retrospective design and the use of freely available articles, suggesting that further optimizations are needed to improve model performance and applicability in broader SR settings.

Summary

This study investigates the use of large language models (LLMs) to streamline the process of article screening in systematic reviews (SRs), focusing on the impact of LLM-driven abstract and full-text screening for improving efficiency. The authors developed generic prompt templates using the GPT4-0125-preview model, which were tested across 48,425 citations for abstract screening and 12,690 full-text articles for SRs. The model was tasked with including or excluding articles based on eligibility criteria defined by the SR authors. The results were then compared to the decisions made by the original SR authors after full-text screening.

The optimized LLM prompts achieved impressive performance metrics, with a weighted sensitivity of 97.7% (range: 86.7% to 100%) and specificity of 85.2% (range: 68.3% to 95.9%) for abstract screening. Full-text screening yielded a weighted sensitivity of 96.5% (range: 89.7% to 100%) and specificity of 91.2% (range: 80.7% to 100%). In contrast, the zero-shot prompts performed poorly, with sensitivity scores of only 49.0% for abstract screening and 49.1% for full-text screening. Furthermore, the study revealed that LLM-based screening drastically reduced the cost and time required for SRs, with a single human abstract screening estimated to take 83 hours and cost $1,666.67 USD, while the LLM-based approach completed the task in under one day for just $157.02 USD.

The study demonstrates that LLMs, when optimized with appropriate prompting, can significantly enhance the speed and accuracy of article screening for SRs, making it a valuable tool for researchers. The results suggest that these models may reduce the time and financial burden associated with systematic reviews, offering a scalable solution for future medical research. However, further optimizations are necessary, and the study’s limitations include its retrospective nature and the exclusive use of freely available articles for full-text screening.

Link to the article: https://www.acpjournals.org/doi/10.7326/ANNALS-24-02189


References

Cao, C., Sang, J., Arora, R., Chen, D., Kloosterman, R., Cecere, M., Gorla, J., Saleh, R., Drennan, I., Teja, B., Fehlings, M., Ronksley, P., Leung, A. A., Weisz, D. E., Ware, H., Whelan, M., Emerson, D. B., Arora, R. K., & Bobrovitz, N. (2025). Development of prompt templates for large language model–driven screening in systematic reviews. Annals of Internal Medicine, 178(3), 389–401. https://doi.org/10.7326/ANNALS-24-02189

About the author

Hippocrates Briefs Team

Leave a Comment