Article Impact Level: HIGH Data Quality: STRONG Summary of Annals of Internal Medicine, 178(3), 389–401. https://doi.org/10.7326/ANNALS-24-02189 Dr. Christian Cao et al.
Points
- Large language models (LLMs) were tested for abstract and full-text screening in systematic reviews (SRs), using 48,425 citations for abstract screening and 12,690 full-text articles, with eligibility determined by SR authors.
- Optimized LLM prompts achieved 97.7% sensitivity and 85.2% specificity for abstract screening and 96.5% sensitivity and 91.2% specificity for full-text screening, significantly outperforming zero-shot prompting methods.
- The LLM-based approach completed abstract screening in under one day for $157.02 USD, compared to 83 hours and $1,666.67 USD for manual screening, demonstrating substantial savings in time and cost.
- The study highlights LLMs as a scalable tool for streamlining systematic reviews, reducing the workload for researchers while maintaining high accuracy in article selection.
- The study acknowledges its retrospective design and the use of freely available articles, suggesting that further optimizations are needed to improve model performance and applicability in broader SR settings.
Summary
This study investigates the use of large language models (LLMs) to streamline the process of article screening in systematic reviews (SRs), focusing on the impact of LLM-driven abstract and full-text screening for improving efficiency. The authors developed generic prompt templates using the GPT4-0125-preview model, which were tested across 48,425 citations for abstract screening and 12,690 full-text articles for SRs. The model was tasked with including or excluding articles based on eligibility criteria defined by the SR authors. The results were then compared to the decisions made by the original SR authors after full-text screening.
The optimized LLM prompts achieved impressive performance metrics, with a weighted sensitivity of 97.7% (range: 86.7% to 100%) and specificity of 85.2% (range: 68.3% to 95.9%) for abstract screening. Full-text screening yielded a weighted sensitivity of 96.5% (range: 89.7% to 100%) and specificity of 91.2% (range: 80.7% to 100%). In contrast, the zero-shot prompts performed poorly, with sensitivity scores of only 49.0% for abstract screening and 49.1% for full-text screening. Furthermore, the study revealed that LLM-based screening drastically reduced the cost and time required for SRs, with a single human abstract screening estimated to take 83 hours and cost $1,666.67 USD, while the LLM-based approach completed the task in under one day for just $157.02 USD.
The study demonstrates that LLMs, when optimized with appropriate prompting, can significantly enhance the speed and accuracy of article screening for SRs, making it a valuable tool for researchers. The results suggest that these models may reduce the time and financial burden associated with systematic reviews, offering a scalable solution for future medical research. However, further optimizations are necessary, and the study’s limitations include its retrospective nature and the exclusive use of freely available articles for full-text screening.
Link to the article: https://www.acpjournals.org/doi/10.7326/ANNALS-24-02189
References Cao, C., Sang, J., Arora, R., Chen, D., Kloosterman, R., Cecere, M., Gorla, J., Saleh, R., Drennan, I., Teja, B., Fehlings, M., Ronksley, P., Leung, A. A., Weisz, D. E., Ware, H., Whelan, M., Emerson, D. B., Arora, R. K., & Bobrovitz, N. (2025). Development of prompt templates for large language model–driven screening in systematic reviews. Annals of Internal Medicine, 178(3), 389–401. https://doi.org/10.7326/ANNALS-24-02189