Click here to view the original post by International Journal of Retina & Vitreous.
In addition to our own original articles, the Retina Consultant team expertly curates relevant news and research from leading publications.to provide our readers with a comprehensive view of the retina & ophthalmology landscape, Full credit belongs to the original authors and publication.
By clicking the link above, you will be leaving our site to view an article from its original source.
Sharpen your perspective. Get our weekly analysis of the news shaping the retina industry.
Selection and characteristics of eligible studies
A flowchart of the literature search and study selection process is presented in Fig. 1. The study selection process involved a systematic search of databases and registers, during which a total of 2065 records were identified from databases and 26 from registers. After removing 910 duplicate records, 722 records were marked as ineligible by automation tools, and an additional 333 records were eliminated for various other reasons, resulting in 126 records being screened. Out of these, 56 records were excluded. Subsequently, 67 reports were assessed for eligibility; 40 of these were retrospective studies, leading to 18 studies ultimately included in the review. The eligible studies included in this systematic review exhibited diverse characteristics, contributing to a comprehensive understanding of the research question. A total of 18 studies were reviewed, comprising randomized controlled trials, cross-sectional, and prospective studies. The studies were conducted in various settings, including general population, hospitals and community clinics across multiple countries, such as the United States of America, and China. In addition, two of the studies used OCT for imaging while 16 studies captured the retina only by a fundus image camera. The included studies assessed DR and ME for the screening process and 9 of them only assessed DR. We examined the diagnostic AI utilized by each study for detecting DR and ME, as well as the quality of the images, the geographical areas where the studies were conducted, and the sample sizes involved (Table 1).
Quality assessment
Figures 2 and 3 show the summary chart and bar chart for quality assessment of the included studies. Out of the studies included, 4 studies, Baget-Bernaldz et al.,2021; LI et al., 2021; Nunz do rio et al., 2021; and scheetz et al., 2021 had high risk of bias for patient selection. Out of the 18 studies, Wonngchaisuwat et al., 2020 had unclear risk of bias for the index testing. 1 study, scheetz et al., 2021 reported unclear information for the establishment and blinding of reference grading. 13 out of the 18 studies performed poorly in the time and flow evaluation. All studies showed low concerns for applicability of patient selection, index testing, and reference grading.
Threshold analysis and heterogeneity test
To investigate the presence of a threshold effect, we calculated Spearman’s correlation coefficient between sensitivity and specificity. The correlation coefficient was − 0.45, suggesting a moderate threshold effect across the studies included. An asymmetry was also observed in the summary ROC curve, which indicates that diagnostic performance may be influenced by the threshold levels used in different studies. This suggests that different studies may have applied varying diagnostic thresholds, potentially affecting the balance between sensitivity and specificity. We assessed heterogeneity across the included studies using the I² statistic and Cochran’s Q test. Cochran’s Q test was significant for both sensitivity and specificity (p < 0.05), the SE, SP, LR+, LR-, and DOR are all shown in Table 2. The differences in study design, patient populations, and diagnostic thresholds likely contribute to the heterogeneity observed in sensitivity and specificity estimates (see Fig. 4).
Synthesis of results
The included data on AI systems and doctors in screening diabetic retinopathy were analyzed using MetaDiSc software (version 1.4). The AI-based screening systems demonstrated high diagnostic accuracy, with pooled sensitivity and specificity of 0.877 (95% CI: 0.870–0.884) and 0.906 (95% CI: 0.904–0.908), respectively. Additional diagnostic performance metrics, including the diagnostic odds ratio (DOR), and likelihood ratios (LR + and LR-) are summarized in Table 2.
For doctors, the pooled sensitivity and specificity were 0.751 (95% CI: 0.736–0.766) and 0.941 (95% CI: 0.936–0.946), respectively. Additional performance metrics, including the diagnostic odds ratio (DOR), and likelihood ratios (LR + and LR-) are summarized in Table 2.
The Fagan nomogram analysis demonstrated the clinical utility of AI-based DR screening. If a patient tests positive for DR using AI, the post-test probability of truly having the disease increases to 84.92%, indicating that AI effectively enhances disease detection. Conversely, a negative AI result reduces the probability of disease presence to just 3.56%, highlighting AI’s potential to rule out DR with confidence. These findings reinforce the role of AI in triaging patients, allowing ophthalmologists to focus on high-risk cases requiring further evaluation (Fig. 5). Subgroup analyses based on factors such as imaging modality, and doctor expertise revealed further insights into diagnostic performance. These results are shown in Table 3.
Meta regression and sensitivity analysis
To explore the sources of heterogeneity, we performed a meta-regression analysis using the MetaDiSc software (version 1.4), evaluating the potential influence of covariates such as imaging modality (fundus vs. OCT) and doctor expertise (retina specialist vs. ophthalmologist). The analysis revealed that both the doctor and AI datasets exhibited high levels of heterogeneity, with I² values exceeding 98% for doctors and 99% for AI systems. This heterogeneity suggests substantial variability in study methodologies, including differences in AI model architecture, image quality, patient populations, and diagnostic thresholds used to classify diabetic retinopathy (see Fig. 6).
High heterogeneity affects the interpretation of pooled results by potentially exaggerating or underestimating AI’s diagnostic performance in different settings. One key factor contributing to heterogeneity is the variation in AI training datasets—models trained on diverse populations may generalize better than those trained on homogeneous datasets. Additionally, differences in study inclusion criteria (e.g., DR severity grading) and reference standards for diagnosis could influence pooled estimates.
Despite this heterogeneity, sensitivity analyses confirmed the robustness of the results. The overall pooled estimates remained significant across different subgroup exclusions, indicating that AI consistently demonstrated strong diagnostic performance across multiple study conditions.
Publication bias
Publication bias was assessed through visual inspection of a funnel plot and statistical tests, including Egger’s test. The funnel plot (Fig. 7) demonstrated asymmetry, suggesting the presence of potential publication bias. Specifically, there was a noticeable lack of studies on the left side of the funnel, indicating that smaller studies with non-significant or smaller effect sizes may be underrepresented in the analysis. This was further supported by Egger’s test, which produced a statistically significant result (t = 2.1400, p = 0.0472), confirming asymmetry and the likelihood of publication bias. A trim-and-fill plot (Fig. 8) was then made to adjust for the publication bias.










Leave a Reply