Creating efficiencies in the extraction of data from randomized trials a prospective evaluation of a machine learning and text mining tool

BACKGROUND: Machine learning tools that semi-automate data extraction may create efficiencies in systematic review production. We prospectively evaluated an online machine learning and text mining tool's ability to (a) automatically extract data elements from randomized trials, and (b) save tim...

Full description

Bibliographic Details
Main Author: Gates, Allison
Corporate Authors: United States Agency for Healthcare Research and Quality, University of Alberta Evidence-based Practice Center
Format: eBook
Language:English
Published: Rockville, MD Agency for Healthcare Research and Quality August 2021, 2021
Series:Methods research report
Online Access:
Collection: National Center for Biotechnology Information - Collection details see MPG.ReNa
LEADER 03965nam a2200265 u 4500
001 EB002010956
003 EBX01000000000000001173855
005 00000000000000.0
007 tu|||||||||||||||||||||
008 220201 r ||| eng
100 1 |a Gates, Allison 
245 0 0 |a Creating efficiencies in the extraction of data from randomized trials  |h Elektronische Ressource  |b a prospective evaluation of a machine learning and text mining tool  |c prepared for Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services ; prepared by University of Alberta Evidence-based Practice Center ; investigators, Allison Gates [and 5 others] 
260 |a Rockville, MD  |b Agency for Healthcare Research and Quality  |c August 2021, 2021 
300 |a 1 PDF file (various pagings)  |b illustrations 
505 0 |a Includes bibliographical references 
710 2 |a United States  |b Agency for Healthcare Research and Quality 
710 2 |a University of Alberta Evidence-based Practice Center 
041 0 7 |a eng  |2 ISO 639-2 
989 |b NCBI  |a National Center for Biotechnology Information 
490 0 |a Methods research report 
856 4 0 |u https://www.ncbi.nlm.nih.gov/books/NBK572890  |3 Volltext 
082 0 |a 000 
520 |a BACKGROUND: Machine learning tools that semi-automate data extraction may create efficiencies in systematic review production. We prospectively evaluated an online machine learning and text mining tool's ability to (a) automatically extract data elements from randomized trials, and (b) save time compared with manual extraction and verification. METHODS: For 75 randomized trials published in 2017, we manually extracted and verified data for 21 unique data elements. We uploaded the randomized trials to ExaCT, an online machine learning and text mining tool, and quantified performance by evaluating the tool's ability to identify the reporting of data elements (reported or not reported), and the relevance of the extracted sentences, fragments, and overall solutions. For each randomized trial, we measured the time to complete manual extraction and verification, and to review and amend the data extracted by ExaCT (simulating semi-automated data extraction).  
520 |a Among a median (IQR) 90 percent (86% to 96%) of relevant sentences, pertinent fragments had been highlighted by the system; exact matches were unreliable (median (IQR) 52 percent [32% to 73%]). A median 48 percent of solutions were fully correct, but performance varied greatly across data elements (IQR 21% to 71%). Using ExaCT to assist the first reviewer resulted in a modest time savings compared with manual extraction by a single reviewer (17.9 vs. 21.6 hours total extraction time across 75 randomized trials). CONCLUSIONS: Using ExaCT to assist with data extraction resulted in modest gains in efficiency compared with manual extraction. The tool was reliable for identifying the reporting of most data elements. The tool's ability to identify at least one relevant sentence and highlight pertinent fragments was generally good, but changes to sentence selection and/or highlighting were often required 
520 |a We summarized the relevance of the extractions for each data element using counts and proportions, and calculated the median and interquartile range (IQR) across data elements. We calculated the median (IQR) time for manual and semiautomated data extraction, and overall time savings. RESULTS: The tool identified the reporting (reported or not reported) of data elements with median (IQR) 91 percent (75% to 99%) accuracy. Performance was perfect for four data elements: eligibility criteria, enrolment end date, control arm, and primary outcome(s). Among the top five sentences for each data element at least one sentence was relevant in a median (IQR) 88 percent (83% to 99%) of cases. Performance was perfect for four data elements: funding number, registration number, enrolment start date, and route of administration.