PNAS, volume 122, issue 2, pages
Transforming literature screening: The emerging role of large language models in systematic reviews.
Authors:
Fernando M. Delgado-Chaves 1, Matthew J. Jennings 2, Antonio Atalaia 3, Justus Wolff 4, Rita Horvath5, Zeinab M. Mamdouh 6, Jan Baumbach 1, and Linda Baumbach 1
1 Institute for Computational Systems Biology, Faculty of Mathematics,
Informatics and Natural Sciences, University of Hamburg, Hamburg 22761, Germany;
2- Centerfor Motor Neuron Biology and Diseases, Department of Neurology Columbia University,
New York, NY 10032; c
3- Inserm Center of Research in Myology, Neuro-Myology Service G.H. Pitié-Salpêtrière, Sorbonne Université, Paris 75013, France;
4- Syte – Strategy Institute for DigitalHealth, Hamburg 20354, Germany;
5- Department of Clinical Neurosciences, University of Cambridge, Cambridge CB2 0QQ, United Kingdom;
6- Department of Pharmacologyand Personalised Medicine, Maastricht University, Maastricht 6229 ER, The Netherlands;
gDepartment of Pharmacology and Toxicology, Faculty of Pharmacy, Zagazig University,
Zagazig 44519, Egypt; hDepartment of Mathematics and Computer Science, Institute for
Mathematics and Computer Science, University of Southern Denmark, Odense 5230,
Denmark; iDepartment of Health Economics and Health Services Research, University Medical
Center Hamburg-Eppendorf,
Hamburg 20246, Germany; and jCenter for Bioinformatics
Hamburg, Faculty of Mathematics, Informatics and Natural Sciences, University of Hamburg,
Hamburg 22761, Germany
Systematic reviews (SR) synthesize evidence-based medical literature, but they involve
labor-intensive manual article screening. Large language models (LLMs) can select relevant
literature, but their quality and efficacy are still being determined compared to
humans. We evaluated the overlap between title-and abstract-based selected articles of
18 different LLMs and human-selected articles for three SR. In the three SRs, 185/4,662,
122/1,741, and 45/66 articles have been selected and considered for full-text
screening by two independent reviewers. Due to technical variations and the inability of the LLMs
to classify all records, the LLM’s considered sample sizes were smaller. However, on
average, the 18 LLMs classified 4,294 (min 4,130; max 4,329), 1,539 (min 1,449; max
1,574), and 27 (min 22; max 37) of the titles and abstracts correctly as either included
or excluded for the three SRs, respectively. Additional analysis revealed that the definitions
of the inclusion criteria and conceptual designs significantly influenced the LLM
performances. In conclusion, LLMs can reduce one reviewer´s workload between 33%
and 93% during title and abstract screening. However, the exact formulation of the inclusion
and exclusion criteria should be refined beforehand for ideal support of the LLMs.