Deficits in Thoroughness and Organization: Evaluating Ambient AI Scribes

Article Impact Level: HIGH
Data Quality: STRONG
Summary of  Annals of Internal Medicine https://doi.org/10.7326/ANNALS-25-02772
Dr. Ashok Reddy et al.

Points

Researchers compared eleven different artificial intelligence scribe tools against eighteen human note-takers to evaluate the quality of documentation produced during five standardized primary care clinical cases.
Human-generated notes received significantly higher scores across all scenarios including a score of forty-three point eight compared to only twenty point three for AI in a back pain case.
The pooled analysis revealed that artificial intelligence scribes demonstrated the largest quality deficits in the critical domains of thoroughness and organization when compared directly to human clinicians.
Blinded raters utilized a standardized documentation quality instrument to ensure a vendor-neutral assessment of ten clinical domains including the overall usefulness of the produced medical notes.
The findings indicate that while these tools may reduce administrative burdens they currently require more rigorous evaluation to prevent the compromise of clinical care quality during large-scale deployment.

Summary

This study evaluated the comparative quality of clinical documentation produced by ambient artificial intelligence (AI) scribes and human note-takers. Given the increasing adoption of AI to mitigate clinician administrative burden, the research sought to determine if automated tools maintain the high standards of thoroughness and utility required in primary care. Investigators utilized five standardized primary care cases from the Veterans Health Administration, involving 11 AI scribe tools, 18 human note-takers, and 30 blinded human raters to ensure a vendor-neutral assessment.

The analysis revealed that human-generated notes received significantly higher overall quality scores than AI-generated notes across all clinical scenarios. Utilizing the modified Physician Documentation Quality Instrument (PDQI-9), which scores 10 domains on a 50-point maximum scale, the study found the largest disparity in an acute low back pain case, where human notes scored 43.8 compared to 20.3 for AI. Pooled domain analysis identified the greatest AI deficits in thoroughness (-1.23), organization (-1.06), and clinical usefulness (-1.03) relative to human performance.

The findings suggest that while ambient AI scribes hold promise for reducing documentation time, their current output may compromise the quality of clinical records. The researchers emphasize that rigorous, ongoing evaluation is essential before large-scale deployment to ensure these tools enhance rather than detract from patient care. These results highlight a critical need for software refinement, as current AI models struggle to match human clinicians in distilling complex audio recordings into organized, useful medical documentation.

Link to the article: https://www.acpjournals.org/doi/10.7326/ANNALS-25-02772

References

Reddy, A., Gunnink, E., Wheat, C. L., Pawlikowski, S., Payne, C. M., Wiltz, S., Hubert, T. L., Kirsh, S., Carey, E., Hill, D., & Nelson, K. M. (2026). Rapid evaluation of artificial intelligence technology used for ambient dictation in primary care: Comparing the quality of documentation of artificial intelligence–generated and human-produced clinical notes. Annals of Internal Medicine, ANNALS-25-02772. https://doi.org/10.7326/ANNALS-25-02772