Assessing the accuracy of machine-assisted abstract screening with DistillerAI a user study

We compared the decisions of the machine-assisted approach, single-reviewer screening (i.e., no machine assistance), and screening with DistillerAI alone (i.e., no human involvement after training) against the reference standard and calculated sensitivities, specificities, and the area under the rec...

Full description

Bibliographic Details
Main Author: Gartlehner, Gerald
Corporate Authors: United States Agency for Healthcare Research and Quality, RTI International-University of North Carolina Evidence-based Practice Center
Format: eBook
Language:English
Published: Rockville, MD Agency for Healthcare Research and Quality November 2019, 2019
Series:Methods research report
Online Access:
Collection: National Center for Biotechnology Information - Collection details see MPG.ReNa
LEADER 04800nam a2200277 u 4500
001 EB002000871
003 EBX01000000000000001163772
005 00000000000000.0
007 tu|||||||||||||||||||||
008 210907 r ||| eng
100 1 |a Gartlehner, Gerald 
245 0 0 |a Assessing the accuracy of machine-assisted abstract screening with DistillerAI  |h Elektronische Ressource  |b a user study  |c prepared for Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services ; prepared by TI International-University of North Carolina Evidence-based Practice Center ; investigators, Gerald Gartlehner [and 6 others] 
260 |a Rockville, MD  |b Agency for Healthcare Research and Quality  |c November 2019, 2019 
300 |a 1 PDF file (viii, 20 pages)  |b illustrations 
505 0 |a Includes bibliographical references 
710 2 |a United States  |b Agency for Healthcare Research and Quality 
710 2 |a RTI International-University of North Carolina Evidence-based Practice Center 
041 0 7 |a eng  |2 ISO 639-2 
989 |b NCBI  |a National Center for Biotechnology Information 
490 0 |a Methods research report 
856 4 0 |u https://www.ncbi.nlm.nih.gov/books/NBK550282  |3 Volltext 
082 0 |a 700 
520 |a We compared the decisions of the machine-assisted approach, single-reviewer screening (i.e., no machine assistance), and screening with DistillerAI alone (i.e., no human involvement after training) against the reference standard and calculated sensitivities, specificities, and the area under the receiver operating characteristics curve. In addition, we determined the interrater agreement, the proportion of included abstracts, and the number of conflicts between human screeners and DistillerAI. RESULTS: The mean sensitivity of the machine-assisted screening approach across the five screening teams was 78 percent (95% confidence interval [CI], 66% to 90%), and the mean specificity was 95 percent (95% CI, 92% to 97%). By comparison, the sensitivity of single-reviewer screening was also 78 percent (95% CI, 66% to 89%); the sensitivity of DistillerAI alone was 14 percent (95% CI, 0% to 31%).  
520 |a BACKGROUND: Web applications that employ natural language processing technologies such as text mining and text classification to support systematic reviewers during abstract screening have become more user friendly and more common. Such semi-automated screening tools can increase efficiency by reducing the number of abstracts needed to screen or by replacing one screener after adequately training the algorithm of the machine. Savings in workload between 30 percent and 70 percent might be possible with the use of such tools. The goal of our project was to conduct a case study to explore a screening approach that temporarily replaces a human screener with a semi-automated screening tool. METHODS: To address our objective, we evaluated the accuracy of a machine-assisted screening approach using an Agency for Healthcare Research and Quality comparative effectiveness review as the reference standard.  
520 |a Specificities for single-reviewer screening and DistillerAI alone were 94 percent (95% CI, 91% to 97%) and 98 percent (95% CI, 97% to 100%), respectively. Machine-assisted screening and single-reviewer screening had similar areas under the curve (0.87 and 0.86, respectively); by contrast, the area under the curve for DistillerAI alone was just slightly better than chance (0.56). The interrater agreement between human screeners and DistillerAI with a prevalence-adjusted kappa was 0.85 (95% CI, 0.84 to 0.86). DISCUSSION: Findings of our study indicate that the accuracy of DistillerAI is not yet adequate to replace a human screener temporarily during abstract screening. The approach that we tested missed too many relevant studies and created too many conflicts between human screeners and DistillerAI. Rapid reviews, which do not require detecting the totality of the relevant evidence, may find semi-automation tools to have greater utility than traditional systematic reviews 
520 |a We chose DistillerAI as a semi-automated screening tool for our project, applying its naïve Bayesian machine-learning option. Five teams screened the same 2,472 abstracts in parallel, using the machine-assisted approach. Each team trained DistillerAI with 300 randomly selected abstracts that the team screened dually. For the remaining 2,172 abstracts, DistillerAI replaced one human screener in each team and provided predictions about the relevance of records. We used a prediction score of 0.5 (i.e., inconclusive) or greater to classify a record as an inclusion. A single reviewer also screened all remaining abstracts. A second human screener resolved conflicts between the single reviewer and DistillerAI.