Article Text

Download PDFPDF

Original research
Evaluating local open-source large language models for data extraction from unstructured reports on mechanical thrombectomy in patients with ischemic stroke
  1. Aymen Meddeb1,2,
  2. Philipe Ebert1,
  3. Keno Kyrill Bressem3,
  4. Dmitriy Desser1,
  5. Andrea Dell'Orco1,
  6. Georg Bohner1,
  7. Justus F Kleine1,
  8. Eberhard Siebert1,
  9. Nils Grauhan4,
  10. Marc A Brockmann4,
  11. Ahmed Othman4,
  12. Michael Scheel1,
  13. Jawed Nawabi1
    1. 1Department of Neuroradiology, Charité Universitätsmedizin Berlin, Berlin, Germany
    2. 2Department of Neuroradiology, CHU Reims Imagerie Médicale, Reims, Champagne-Ardenne, France
    3. 3German Heart Center Munich, Technical University of Munich, Munchen, Germany
    4. 4Department of Neuroradiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
    1. Correspondence to Dr Aymen Meddeb; aymen.meddeb{at}charite.de

    Abstract

    Background A study was undertaken to assess the effectiveness of open-source large language models (LLMs) in extracting clinical data from unstructured mechanical thrombectomy reports in patients with ischemic stroke caused by a vessel occlusion.

    Methods We deployed local open-source LLMs to extract data points from free-text procedural reports in patients who underwent mechanical thrombectomy between September 2020 and June 2023 in our institution. The external dataset was obtained from a second university hospital and comprised consecutive cases treated between September 2023 and March 2024. Ground truth labeling was facilitated by a human-in-the-loop (HITL) approach, with time metrics recorded for both automated and manual data extractions. We tested three models—Mixtral, Qwen, and BioMistral—assessing their performance on precision, recall, and F1 score across 15 clinical categories such as National Institute of Health Stroke Scale (NIHSS) scores, occluded vessels, and medication details.

    Results The study included 1000 consecutive reports from our primary institution and 50 reports from a secondary institution. Mixtral showed the highest precision, achieving 0.99 for first series time extraction and 0.69 for occluded vessel identification within the internal dataset. In the external dataset, precision ranged from 1.00 for NIHSS scores to 0.70 for occluded vessels. Qwen showed moderate precision with a high of 0.85 for NIHSS scores and a low of 0.28 for occluded vessels. BioMistral had the broadest range of precision, from 0.81 for first series times to 0.14 for medication details. The HITL approach yielded an average time savings of 65.6% per case, with variations from 45.95% to 79.56%.

    Conclusion This study highlights the potential of using LLMs for automated clinical data extraction from medical reports. Incorporating HITL annotations enhances precision and also ensures the reliability of the extracted data. This methodology presents a scalable privacy-preserving option that can significantly support clinical documentation and research endeavors.

    • Thrombectomy
    • Angiography
    • Technology

    Data availability statement

    Data are available upon reasonable request. All notebooks and prompts used in this study are publicly available at GitHub (https://github.com/Meddebma/AI_4_Medicine/blob/main/Thrombectomy_LLM_Extraction.ipynb).

    http://creativecommons.org/licenses/by-nc/4.0/

    This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

    Statistics from Altmetric.com

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    WHAT IS ALREADY KNOWN ON THIS TOPIC

    • Large language models (LLMs) have shown promise in various natural language processing tasks including extracting data from unstructured texts.

    • Clinical data extraction from medical reports is a critical task that can benefit from automation due to the labor-intensive and time-consuming nature of manual extraction.

    WHAT THIS STUDY ADDS

    • This study shows the specific application of open-source LLMs in extracting clinical data from unstructured mechanical thrombectomy reports.

    • The integration of a human-in-the-loop (HITL) approach significantly enhances the precision and reliability of the extracted data.

    HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

    • Open-source LLMs, when combined with HITL annotations, offer a scalable and privacy-preserving solution for clinical data extraction, enhancing the efficiency and accuracy of clinical documentation and research.

    Introduction

    Large language models (LLMs) are artificial intelligence (AI) systems that understand and generate human-like natural language responses to text prompts.1–3 These models, trained on vast datasets, have shown remarkable clinical reasoning capabilities4–6 in passing medical licensing examinations7 8 and generating prevention and treatment recommendations for various conditions including cardiovascular disease9 10 and breast cancer.11 They can produce clinical notes,3 generate radiology reports,12 13 and even assist in writing research articles.14–16

    Although GPT-4 has been effectively used for text mining from unstructured medical data in radiology17 and neurology,18 19 the application of commercial LLMs in medicine raises significant privacy issues, with the important need to uphold the strict standards of data privacy and security inherent to the clinical context. In addition, reliability and usability are important issues that must be addressed when using LLM. Users, especially in the medical field, must be able to rely on the accuracy and consistency of the model, which requires ongoing refining, testing, and evaluation to ensure that the system is delivering accurate outputs. However, commercial models behind API can be updated by the provider and drastically change their behavior, which can pose a substantial risk for clinical workflows.

    In neuroradiology, the production of accurate and detailed notes after interventional procedures are crucial.20 This documentation must meticulously describe the intervention, reflecting the highly individualized nature of each procedure. It should detail the indication for the procedure, enumerate the technical steps undertaken, list the materials and medications used, address potential complications, and report the outcome of the intervention.21 This level of documentation is essential for upholding high care standards and also for supporting efficient patient discharge, enabling quality assessments, and enhancing clinical research.

    However, this highly individualized nature of procedural notes hampers the use of generated data. The diversity in documentation practices and writing styles poses a significant challenge for structuring the data for research purposes or integrating it into national registries. Moreover, the variability in detail and terminology used complicates the task of standardizing data for comparative studies or broader analysis.

    This is where open-source LLMs present a compelling solution.22–24 By operating fully locally, these models ensure that all data processing is confined to the hospital’s internal devices and designated servers without the need for external internet connectivity. This mode of operation mitigates the risk of data breaches and aligns with the principles of patient data privacy.

    This work aims to explore the potential of open-source local LLMs in extracting accurate information from procedural reports of mechanical thrombectomy in patients with ischemic stroke and accelerating annotation for medical information extraction to fully leverage the rich data contained in procedural notes for quality assurance, research, and regulatory purposes.

    Methods

    Patient population

    Our internal dataset encompasses consecutive reports from patients who underwent mechanical thrombectomy for acute ischemic stroke between September 2020 and June 2023 in a university hospital with a comprehensive stroke center. The external dataset encompasses consecutive reports from patients treated in a second university hospital between September 2023 and March 2024. All collected data adhered to the principles outlined in the Declaration of Helsinki. Patient data were anonymized to ensure privacy and confidentiality, in line with the stringent data protection requirements of clinical research.

    Extraction pipeline and prompt structure

    To adapt a generalized LLM for our specific task of information extraction, we employed an in-context learning strategy.25 This method involves crafting precise prompts to provide the model with clear instructions and context, enhancing its ability to perform complex tasks. We developed an automated pipeline to process the reports, beginning with the creation of a JavaScript Object Notation (JSON) template. This template defined the 15 data points we aimed to extract from the thrombectomy reports: National Institutes of Health Stroke Scale (NIHSS) score, symptom onset, occluded vessel, occlusion side, used materials, medication, complications, outcome, Thrombolysis in Cerebral Infarction (TICI) score, area dose product, fluoroscopy time, arrival time, puncture time, first series time, and artery opening time. For each data point a detailed list of instructions was provided to clarify definitions and specifics. Finally, a prompt was crafted to analyze the reports, extract the necessary data, and populate the JSON template. The system was also configured to process texts primarily in German. All notebooks and prompts are publicly available in GitHub (https://github.com/Meddebma/AI_4_Medicine/blob/main/Thrombectomy_LLM_Extraction.ipynb).

    Models

    For information extraction we implemented four state-of-the-art open-source LLMs, each varying in the number of parameters: Qwen with 72 billion, Mixtral configured as 8 clusters of 7 billion each, and BioMistral with 7 billion parameters. Each model was deployed locally to align with stringent data security measures and our commitment to patient privacy. Additionally, we tested the Phi-2 model with 2 billion parameters; however, it did not produce reasonable outputs and was subsequently excluded from further analysis. Table 1 provides an overview of the used models.

    Table 1

    Overview of the used open-source models

    Human-in-the-loop (HITL) annotation

    Due to the extensive volume of cases, we adopted a human-in-the-loop (HITL) annotation strategy to establish ground truth for our dataset. Rather than creating annotations from scratch, clinical experts refined the outputs from our strongest model, Mixtral. This refinement process required a detailed review and adjustment by four clinical experts comprising one student, one radiology resident, and two board-certified radiologists specializing in neurointerventions to ensure the outputs were accurate and clinically relevant. The combined effort amounted to approximately 80 hours of work. The resulting annotated dataset then served as a benchmark to evaluate the performance of subsequent models. This workflow is shown in figure 1.

    Figure 1

    Workflow design for information extraction.

    To determine the most effective model for our HITL annotation process we initially conducted a series of preliminary evaluations. These evaluations assessed the precision of each model on consistently reported data elements within a small subset of our data (n=20). Based on these initial results, Mixtral emerged as the superior model, particularly excelling in the extraction of critical data points such as used materials data. This decision to use Mixtral as the basis for our HITL approach allowed us to begin the augmentation process from the most reliable automated baseline, ensuring the high quality of our ground truth data.

    To assess the time reduction afforded by using an LLM in data extraction tasks we meticulously recorded the time required for the manual extraction of 30 reports. Additionally, we measured the time taken for the LLM to extract data and the subsequent time needed for manual corrections of the LLM-extracted information. To statistically evaluate the significance of the observed differences in time between manual and assisted methods, we conducted a paired t-test. The α level for determining statistical significance was set at p=0.05. Furthermore, we quantified the efficiency gains from LLM assistance by calculating the percentage of time saved.

    Handling missing data and extraction failures

    The used LLMs were prompted to explicitly identify and label missing data points in the thrombectomy reports as ‘not available’ or ‘not applicable’. This approach was employed to ensure clarity and prevent the generation of incorrect or fabricated data. Furthermore, we incorporated a feedback loop to reinitiate the extraction process after initial failures. This system was crucial for identifying reports where data extraction was not feasible, such as those involving venous sinus thrombectomy or spontaneous recanalization, ensuring accurate data handling.

    Evaluation metrics

    To comprehensively evaluate the performance of the LLMs we used a range of metrics designed to assess various aspects of the output quality of the models—namely, precision, recall, and F1 score. We performed all statistical analyses using the Pandas and SciPy libraries in Python (Version 3.12.1) and plots were created using RStudio (R Version 4.3.2).

    As the column of used materials includes a list of materials separated with a comma in the output JSON file, we opted for token-based metrics to count the number of extracted items as it is more adapted than determining the accuracy of the whole list.

    Results

    Study population

    Initially, 1026 reports were retrieved using our radiology information system. Eighteen reports were excluded due to spontaneous recanalization or absence of intracranial occlusion and eight further reports were excluded due to venous sinus thrombectomy. Included reports were written by seven different neurointerventionalists. The external dataset comprised 50 reports on mechanical thrombectomy performed at a second university hospital.

    Extracted information

    Our evaluation of LLMs for extracting information from unstructured thrombectomy reports showed variable effectiveness across diverse metrics. Among the models tested, Mixtral achieved the highest performance with precision values ranging from 0.99 for first series time to 0.69 for occluded vessel data. The Qwen model showed moderate performance with precision scores from 0.85 for the NIHSS score to 0.28 for occluded vessels. Despite its specialization in medical tasks, the BioMistral model had the lowest precision, with scores peaking at 0.81 for first series time and dipping to 0.14 for medication data.

    Notably, all models performed well in extracting explicit data facilitated by the use of an integrated template within the reports. For instance, precision for NIHSS score was high across models (Mixtral: 0.98, Qwen: 0.85, BioMistral: 0.79). Similarly, scores for puncture time (Mixtral: 0.98, Qwen: 0.82, BioMistral: 0.82), first series time (Mixtral: 0.99, Qwen: 0.81, BioMistral: 0.81), and artery opening time (Mixtral: 0.98, Qwen: 0.79, BioMistral: 0.78) indicated strong performance. However, precision was notably lower for occluded vessel extraction in the Mixtral model (0.69) and for medication details in both the BioMistral (0.14) and Qwen models (0.28).

    For the external dataset we used only the Mixtral model. This model showed high precision across various data points ranging from a perfect 1.00 for the NIHSS score to 0.70 for occluded vessels.

    The detailed performance metrics for the internal and external datasets are shown in table 2 and table 3, respectively. Figure 2 visually illustrates the precision values for each model, including error bars that highlight the variability in performance across the tested data points.

    Table 2

    Performance metrics for the internal dataset

    Table 3

    Performance metrics using Mixtral model for the external dataset

    Figure 2

    Precision for prediction of data points with different parameter size models.

    Human-in-the-loop annotation

    The analysis showed that the mean time required for manual data extraction was 186.95 s (range 37–401 s), the mean time for initial data extraction by the LLM was 4.33 s (range 2–6 s), and the mean time needed for manual corrections of the LLM-extracted information was 59.63 s (range 20–103 s). These efficiencies resulted in an average time savings of 65.6% per case (range 45.95–79.56%). The time difference was statistically significant (p<0.05).

    Discussion

    Extracting meaningful data from unstructured medical text is both challenging and essential for data analysis and research. Our study highlights the feasibility of automated data extraction from thrombectomy reports in patients with stroke using open-source LLMs within a secure local environment that respects patient data privacy. The Mixtral 8×7b instruct model demonstrated high performance, with precision values ranging from 0.99 for first series time to 0.68 for occluded vessel. Our results indicate that LLMs can effectively contribute to medical research by streamlining data processing while safeguarding sensitive information.

    Recent studies underscore the high efficacy of LLMs in extracting both implicit and explicit data from unstructured text. Dagdelen et al explored the use of LLMs for material science data extraction using models with a JSON output schema. Their findings show a clear superiority of LLMs over traditional natural language processing methods, highlighting significant time savings achieved through the HITL annotation technique.26 Similarly, a study by Goel et al showed that LLMs could significantly accelerate medical data extraction. Their approach, which also used HITL annotations, reduced time costs by an average of 42% compared with traditional annotation methods from scratch.27 In our study, HITL reduced time by an average of 65%.

    For mechanical thrombectomy procedures in patients with stroke, various studies have demonstrated the effectiveness of extracting procedural details from free-text reports. Yu et al used a traditional natural language processing approach to detect large vessel occlusion in radiologic reports, achieving an accuracy of 97.3%.28 Gunter et al also reported accuracy greater than 90% in identifying different stroke characteristics from radiology reports.29 More recently, Lehmen et al used GPT-3.5 and GPT-4 to extract data from 100 mechanical thrombectomy reports, with a correctness rate of 94% across data points and performance varying between 61% and 100% per category.30 However, a significant limitation of this approach is the potential compromise of data privacy. In contrast, our automated local LLM pipeline achieved comparable results in 1000 reports while ensuring a completely secure environment for data processing, thus respecting patient data privacy. This approach maintains high performance in data extraction and also upholds strict data protection standards.

    The accuracy and precision of the different models revealed varying degrees across extracted data points. For instance, the TICI score and NIHSS score showed high precision across all models, indicating robust model performance to extract explicit data. In contrast, ‘used materials’, ‘medication’, and ‘complications’ showed lower precision, suggesting higher complexity of these implicit data. For example, the Mixtral model correctly identified a peri-interventional distal embolus with peripheric small vessel occlusion or a failure of closure device with a subsequent groin hematoma as complications whereas these were not detected by the BioMistral and Qwen models.

    We observed moderate performance in extracting data for the ‘occluded vessel’ category, attributable to two main factors. First, the lack of standardized nomenclature for vessels poses a significant challenge; terms like ‘distal ICA’, ‘carotid terminus’, and ‘supraophthalmic segment’ refer to the same location but are labeled differently by neurointerventionalists. Second, there are often discrepancies between the occluded vessel as noted in the clinical history and the requested procedure section, which typically describe the occlusion identified on CT or MRI scans, and the results section which reflects the occlusion detected during angiography. These inconsistencies contribute to the difficulties in accurately extracting and interpreting data regarding occluded vessels from procedural reports.

    Our study has some limitations. First, its retrospective nature may limit the generalizability of the results. Second, the results are based on the performance of the models in extracting data from German reports and therefore may not be directly applicable to reports in other languages. However, especially for English text, one can assume that the model performance will even increase as the training data of the models consists mainly of English texts. Last, variability in the quality and consistency of the input data, such as differences in terminology, formatting, or detail level in the reports, can affect the performance of the models; however, our model showed a stable performance in extracting data from the external dataset.

    Conclusion

    Our findings show that an automated pipeline for data extraction from procedural reports using local open-source LLMs is both feasible and effective, achieving high performance levels. Furthermore, integrating an HITL annotation process can significantly reduce the time cost while ensuring reliable results.

    Data availability statement

    Data are available upon reasonable request. All notebooks and prompts used in this study are publicly available at GitHub (https://github.com/Meddebma/AI_4_Medicine/blob/main/Thrombectomy_LLM_Extraction.ipynb).

    Ethics statements

    Patient consent for publication

    Ethics approval

    This retrospective study was approved by the ethics committee of the Charité University Hospital in Berlin (No. EA4/062/20).The requirement for informed consent was waived due to the retrospective design of the study.

    Acknowledgments

    AM is fellow of the BIH Charité Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin and the Berlin Institute of Health at Charité (BIH).

    References

    Footnotes

    • Contributors AM and JN designed the study. PE, KKB, DD, AD, GB, JFK, ES and MS gathered and critically reviewed the data. NG, MAB and AO gathered and critically reviewed the external dataset, AM performed statistical analyses and drafted the manuscript. PE, KKB, DD, AD, GB, JFK, ES, MS, NG, MAB, AO and JN critically reviewed the manuscript. All authors approved the final version of the manuscript. AM is responsible for the overall content as guarantor.

    • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

    • Competing interests None declared.

    • Provenance and peer review Not commissioned; externally peer reviewed.