Artificial intelligence (AI), particularly large language models (LLMs) like GPT-4o, is emerging as a potential tool for enhancing diagnostic accuracy in healthcare. This study evaluated the diagnostic performance of GPT-4o compared to human ophthalmologists in glaucoma cases, exploring its strengths and limitations.
Study objectives and design
The primary aim was to assess GPT-4o’s performance in primary and differential diagnoses of glaucoma relative to human ophthalmologists. While GPT-4o demonstrated potential in generating comprehensive differential diagnoses, it fell short in primary diagnosis accuracy and completeness, highlighting the current limitations of AI as a standalone diagnostic tool.1
According to a group of Chinese researchers, the prospective, observational study was conducted at a tertiary care ophthalmology center. Twenty-six glaucoma cases, encompassing both primary and secondary types, were selected from publicly available databases and institutional records. These cases were analyzed by GPT-4o and three ophthalmologists with varying levels of experience.
Performance assessment and results
Diagnostic accuracy and completeness were evaluated using 10-point and 6-point Likert scales, respectively. Statistical analyses, including Kruskal–Wallis and Mann–Whitney U tests, revealed significant differences in performance:
- Primary diagnosis:
GPT-4o achieved a mean accuracy score of 5.500 (p < 0.001), significantly lower than the highest-performing ophthalmologist, Doctor C, who scored 8.038 (p < 0.001). Completeness scores for GPT-4o were also inferior (3.077, p < 0.001) compared to the lowest-scoring ophthalmologist, Doctor B (3.615, p < 0.001). - Differential diagnosis:
In contrast, GPT-4o demonstrated comparable accuracy to the ophthalmologists for differential diagnoses, scoring 7.577 versus Doctor A (7.615) and Doctor C (7.673) (p < 0.0001). Notably, GPT-4o achieved the highest completeness score (4.096), outperforming Doctor C (3.846), Doctor A (2.923), and Doctor B (2.808) (p < 0.0001).
Insights and context
Primary diagnosis involves cognitive reasoning and prioritization of clinical information, areas where GPT-4o underperformed compared to human ophthalmologists. These findings align with prior research showing that while AI models like ChatGPT have improved in general medicine and certain specialties, they continue to struggle in highly specialized fields such as neuro-ophthalmology and ocular pathology.1
Conversely, GPT-4o’s superior performance in differential diagnosis reflects its capacity to reference extensive databases and generate exhaustive lists of potential diagnoses. However, this comprehensive approach may overwhelm clinicians, increasing the risk of cognitive bias and misdiagnosis.
Limitations and implications
The study’s limitations include its small sample size (n = 26) and case selection, which emphasized a broad range of glaucoma subtypes over real-world prevalence patterns. Future studies should address these constraints by utilizing larger, more representative datasets.
“Recognizing this limitation, future research with a larger, more representative sample that aligns with real-world prevalence rates would improve the applicability of these findings across diverse clinical settings,” the researchers noted.
According to the researchers, the findings underscore the importance of improving AI models for primary diagnosis by incorporating clinician feedback, enhancing training datasets, and exploring more sophisticated reasoning algorithms. Evaluating AI’s utility in primary care or non-specialist settings, where it might serve as an adjunct to optometrists or general practitioners, is another avenue for future research.
Conclusions
GPT-4o demonstrated promise as a complementary tool for differential diagnosis in glaucoma cases but remains inadequate for primary diagnosis. The study also highlighted concerning gaps in diagnostic accuracy among human ophthalmologists, emphasizing the need for continuous self-evaluation and transparent communication with patients.
“Future advancements in AI may eventually enhance diagnostic accuracy, but until then, it should be viewed as a complementary tool, not a replacement for human expertise,” the researchers concluded.
Leave a Reply