Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings

244Citations
Citations of this article
386Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Purpose: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. Design: Evaluation of diagnostic test or technology. Participants: ChatGPT is a publicly available LLM. Methods: We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey's test to decide if there were meaningful differences between the tested subspecialties. Main Outcome Measures: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT's outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. Results: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT's answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. Conclusion: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. Financial Disclosure(s): Proprietary or commercial disclosure may be found after the references.

Figures

References Powered by Scopus

The measurement of observer agreement for categorical data

60873Citations
N/AReaders
Get full text

Measuring nominal scale agreement among many raters

6613Citations
N/AReaders
Get full text

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

1886Citations
N/AReaders
Get full text

Cited by Powered by Scopus

ChatGPT for Education and Research: Opportunities, Threats, and Strategies

475Citations
N/AReaders
Get full text

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

153Citations
N/AReaders
Get full text

A Comprehensive Study of ChatGPT: Advancements, Limitations, and Ethical Considerations in Natural Language Processing and Cybersecurity

103Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Antaki, F., Touma, S., Milad, D., El-Khoury, J., & Duval, R. (2023). Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmology Science, 3(4). https://doi.org/10.1016/j.xops.2023.100324

Readers over time

‘23‘24‘25060120180240

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 53

45%

Professor / Associate Prof. 30

25%

Lecturer / Post doc 21

18%

Researcher 15

13%

Readers' Discipline

Tooltip

Medicine and Dentistry 33

37%

Business, Management and Accounting 23

26%

Engineering 18

20%

Computer Science 15

17%

Article Metrics

Tooltip
Mentions
News Mentions: 3

Save time finding and organizing research with Mendeley

Sign up for free
0