Article Text

Download PDFPDF

Original research
Assessing the clinical reasoning of ChatGPT for mechanical thrombectomy in patients with stroke
  1. Tse Chiang Chen1,
  2. Mitchell W Couldwell1,
  3. Jorie Singer1,
  4. Alyssa Singer1,
  5. Laila Koduri1,
  6. Emily Kaminski1,
  7. Khoa Nguyen1,
  8. Evan Multala1,
  9. Aaron S Dumont2,
  10. Arthur Wang2
  1. 1 Tulane University School of Medicine, New Orleans, Louisiana, USA
  2. 2 Department of Neurological Surgery, Tulane University School of Medicine, New Orleans, Louisiana, USA
  1. Correspondence to Dr Arthur Wang, Neurological Surgery, Tulane University School of Medicine, New Orleans, LA 70112, USA; awang15{at}tulane.edu

Abstract

Background Artificial intelligence (AI) has become a promising tool in medicine. ChatGPT, a large language model AI Chatbot, shows promise in supporting clinical practice. We assess the potential of ChatGPT as a clinical reasoning tool for mechanical thrombectomy in patients with stroke.

Methods An internal validation of the abilities of ChatGPT was first performed using artificially created patient scenarios before assessment of real patient scenarios from the medical center’s stroke database. All patients with large vessel occlusions who underwent mechanical thrombectomy at Tulane Medical Center between January 1, 2022 and December 31, 2022 were included in the study. The performance of ChatGPT in evaluating which patients should undergo mechanical thrombectomy was compared with the decisions made by board-certified stroke neurologists and neurointerventionalists. The interpretation skills, clinical reasoning, and accuracy of ChatGPT were analyzed.

Results 102 patients with large vessel occlusions underwent mechanical thrombectomy. ChatGPT agreed with the physician’s decision whether or not to pursue thrombectomy in 54.3% of the cases. ChatGPT had mistakes in 8.8% of the cases, consisting of mathematics, logic, and misinterpretation errors. In the internal validation phase, ChatGPT was able to provide nuanced clinical reasoning and was able to perform multi-step thinking, although with an increased rate of making mistakes.

Conclusion ChatGPT shows promise in clinical reasoning, including the ability to factor a patient’s underlying comorbidities when considering mechanical thrombectomy. However, ChatGPT is prone to errors as well and should not be relied on as a sole decision-making tool in its present form, but it has potential to assist clinicians with more efficient work flow.

  • Thrombectomy
  • Stroke

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Artificial intelligence has become a promising tool in medicine and has been shown to improve work flow and efficiency.

WHAT THIS STUDY ADDS

  • This study demonstrates the potential capabilities of artificial intelligence in making clinical decisions regarding mechanical thrombectomy for stroke.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • The implications of this study show that ChatGPT can have a potential role in assisting clinicians by improving work flow and serving as an adjunct to daily clinical practice.

Background

Artificial intelligence (AI) has emerged as a promising tool in various medical specialties. With its ability to mimic human intelligence through large-scale analytics, AI has the potential to make daily work flow more efficient. In stroke neurology, the use of AI software such as RapidAI and VizAI in the treatment of stroke has improved diagnostic accuracy, treatment, and decision making.1 The development of large language models (LLMs) has also garnered attention. ChatGPT, an LLM chatbot introduced in November 2022, uses Reinforcement Learning from Human Feedback (RLHF) and has undergone iterative improvements, with GPT-4 being the current model.2 ChatGPT shows promise in supporting clinical practice by aiding in diagnoses, recommending diagnostic tests, and helping in the management of patients with chronic conditions.3 It has also shown proficiency in passing medical licensing examinations, including the neurosurgery written board examination.4 Notably, a study by Microsoft suggests that ChatGPT exhibits signs of artificial general intelligence and reasoning capabilities.5

Studies evaluating LLMs in the medical field have shown that, when given a clinical scenario, ChatGPT can provide meaningful treatment suggestions. At the same time, the program struggles to establish causal relationships and lacks medical expertise and experience.6 ChatGPT is sensitive to the phrasing of the prompt, which can lead to inconsistent answers depending on how the question is asked and occasionally prioritizes an articulate answer over an accurate answer.7

We assessed the utility of ChatGPT as a clinical reasoning tool for mechanical thrombectomy (MT) in patients presenting with large vessel occlusions (LVO). Although the criteria for MT are well established, the decision to proceed with MT still requires clinical reasoning to fully understand the indications and contraindications. To our knowledge, there is currently no existing literature on using ChatGPT for this purpose.

Methods

GPT-4 (OpenAI, San Francisco, California, USA) was used for this study. The application is the current iteration of ChatGPT. ChatGPT was investigated in two parts of the study.

Study set-up

Part 1

An internal validation of ChatGPT was performed using artificially created patient scenarios. These cases are designed to cover a range of indications and contraindications for MT with variations in perfusion mismatch ratio, time since onset of symptoms (TSO), baseline modified Rankin Score (mRS), and National Institutes of Health Stroke Scale (NIHSS) score. The purpose is to assess ChatGPT’s ability to (a) perform arithmetic to calculate time elapsed (time is important for deciding whether or not to pursue MT); and (b) perform more nuanced clinical reasoning when given borderline cases (see below).

Borderline cases are those that do not meet current American Heart Association thrombectomy guidelines but have other favorable clinical or radiologic findings that would potentially impact the decision for thrombectomy.

Additionally, we evaluated the performance of ChatGPT in scenarios involving the presence or absence of intracranial hemorrhage (ICH), incomplete prompts with missing clinical information (ie, absent penumbra volume), and cases with medical comorbidities.

Part 2

We compared the performance of ChatGPT in clinical decision-making with the decisions made by board-certified stroke neurologists and neurointerventionalists using real-life cases from the Tulane Medical Center stroke database from January 1, 2022 to December 31, 2022.

The inclusion criteria were individuals aged ≥18 years with a LVO (internal carotid artery (ICA), ICA bifurcation, M1 or M2 segments of the middle cerebral artery, basilar artery) and with available CT perfusion (CTP) data. The exclusion criteria were individuals aged <18 years, patients presenting >24 hours since last known normal (LKN), patients without CTP data including those who underwent thrombectomy. All cases were anonymized. We chose these inclusion/exclusion criteria to closely match those used in the American Heart Association/American Stroke Association guidelines and the criteria for extended window studies (Diffusion Weighted Imaging or CT Perfusion Assessment with Clinical Mismatch in the Triage of Wake Up and Late Presenting Strokes Undergoing Neurointervention (DAWN) trial).

For input into ChatGPT we extracted relevant clinical and radiologic information from the database and performed supplementary chart reviews using our Electronic Medical Records system. The input information included age, baseline modified Rankin Score (mRS), LKN status, time of arrival at the hospital (TOA), presence of LVO, and CTP results including penumbra and core infarct volume, mismatch ratio, and NIHSS score. If the TSO exceeded 24 hours, we added the date to the time value. Missing clinical data from our database were left blank in the prompts to ChatGPT.

For cases in the database where the LVO and CTP were present but MT was not performed in real life, we conducted chart reviews to determine whether any extenuating circumstances affected the decision to forgo thrombectomy.

Prompt used in ChatGPT

For each ChatGPT session an initial set-up prompt is entered. The prompt serves as both the clinical scenario and task for ChatGPT to follow and interpret. To ‘blind’ ChatGPT, a new session is created for each test/patient case. This helps minimize cross-contamination because ChatGPT recalls previous sessions. The same prompt is used for Parts 1 and 2 of the study.

The prompt asks ChatGPT to act as a clinical decision-making tool for thrombectomy and is based on the criteria used in the DAWN trial. The prompt used for this study is as follows:

“For research purposes, you are a clinical decision-making tool for thrombectomy.

I want you to evaluate scenarios based on the following criteria. You will be provided scenarios in subsequent prompts and you will reply with a “yes” or “no” answer based on whether or not a thrombectomy is indicated. Answer “Yes” if and only if all the criteria have been met. Give a one-line explanation. Be as critical as you can when evaluating patients. If the LKN is ≤6 hours, the thrombectomy indications are:

  1. Clinical

    1. Age ≥18

    2. NIHSS score ≥6

    3. Prestroke mRS 0–1 (functionally independent at baseline)

  2. Radiographic

    • Large vessel occlusion (including middle cerebral artery (MCA), internal carotid artery (ICA), basilar artery)

    • Alberta Stroke Program Early CT (ASPECT) score ≥6

If the LKN is ≥6 hours but <24 hours, the thrombectomy indications should be based on clinical core mismatch according to DAWN (Diffusion Weighted Imaging or CT Perfusion Assessment with Clinical Mismatch in the Triage of Wake Up and Late Presenting Strokes Undergoing Neurointervention) criteria:

  • NIHSS ≥10, mRS ≤1

  • Clinical mismatch between clinical deficit (NIHSS) and infarct volume (MRI-RAPID)

    • Age ≥80

      • NIHSS ≥10 + infarct (DWI) volume <21 mL

    • Age <80

      • NIHSS ≥10 + infarct (DWI) volume <31 mL

      • NIHSS ≥20 + infarct (DWI) volume between 31 mL and 51 mL

  • CTP scan with a mismatch ratio >1.7

  • No ICH.

For borderline cases (mismatch ratio <1.7 or symptoms began >24 hours), you may wish to be more flexible with the criteria for thrombectomy and should base the decision for thrombectomy on the patient’s clinical condition. If the patient has many medical comorbidities or has poor baseline function (high mRS score), you may wish to forgo thrombectomy, but if the patient is otherwise healthy, you may still wish to pursue thrombectomy despite meeting all the inclusion criteria for thrombectomy.

Please note that the time since onset of symptoms (TSO) is defined as the difference between the LKN and time of arrival (TOA). Please state the time since onset in your answer. When the TSO is >24 hours, no thrombectomy is indicated.

Where DWI is diffusion weighted imaging for core infarct volume is the infarct (DWI) volume and baseline mRS is the pre-stroke mRS.”

Data provided to ChatGPT are standardized in the following format, which is automated from information collected from the Tulane Stroke Database:

  • Age

  • Baseline mRS score

  • LKN and TOA at the hospital

  • NIHSS score

  • LVO presence and/or location

  • Perfusion imaging result: CTP results including penumbra volume (mL), core stroke volume (mL), mismatch ratio.

An example of data inputted into ChatGPT is as follows: 76 years old; baseline mRS: 1; LKN (hh:mm): 08:05; LVO: yes; penumbra: 50; core: 0; mismatch ratio: infinite; NIHSS: 18.

Assessing the ability of ChatGPT to perform multi-step evaluations involving arithmetic

A key component in decision making for MT is the TSO. The LKN time and the TOA at the hospital is universally ascertained by the physician. The clinician will then have to calculate the actual TSO, which is defined as the difference between the LKN and the TOA, to determine if the patient is beyond the window for thrombectomy.

To assess higher order ‘thinking’, two versions of the same patient scenario were provided to ChatGPT with the key difference being the explicit calculation of the TSO (in hours). In version 1 we assessed the ability of ChatGPT to calculate the TSO when only provided with the variables of LKN and TOA at the hospital (figure 1). In version 2, ChatGPT was explicitly provided the exact TSO. By comparing the performance of ChatGPT across these two versions, we can examine any performance differences with increasing complexity of the task.

Figure 1

Screenshot showing the response of ChatGPT to inquiring on time elapsed.

Variations presented to ChatGPT

Version 1 (V1): data provided to ChatGPT:

  • Age

  • Baseline mRS score

  • Last known normal (LKN)

  • Time of arrival at hospital (TOA)

  • NIHSS score

  • LVO location

  • Perfusion imaging result: CTP results including penumbra volume, core stroke volume, mismatch ratio

Version 2 (V2): data provided to ChatGPT

  • Age

  • Baseline mRS score

  • Time since onset of symptoms (TSO)

  • NIHSS score

  • LVO location

  • Perfusion imaging result: CTP results including penumbra volume, core stroke volume, mismatch ratio.

We add the following to the prompt to ChatGPT:

“Please note that the time since onset is defined as the difference between the LKN and time of arrival. Please state the time since onset in your answer. When the time since onset is greater than 24 hours, no thrombectomy is indicated.”

ChatGPT analytical metrics

The clinical reasoning of ChatGPT was analyzed for each patient in the Tulane Stroke Database:

  • number of cases where ChatGPT did not recommend thrombectomy (‘declined’ cases)

  • number of cases where ChatGPT recommended thrombectomy (‘accepted’ cases)

  • number of cases agreed on by both physicians and ChatGPT (‘agreed cases by ChatGPT vs MDs’)

  • number of errors

We also analyzed the performance of ChatGPT after patients within the stroke database were stratified by time into patient subgroups:

  • patients presenting within 6 hours of symptom onset

  • patients presenting within 6 and 24 hours of symptom onset (extended window strokes)

  • patients presenting beyond 24 hours of symptom onset.

Error assessment

Throughout the assessment we manually checked and compiled the presence of errors, error rate, and characterization/type of errors. It is important to note that, when evaluating errors, we assess ChatGPT’s clinical reasoning based on the provided prompt.

For each patient case we examined the response of ChatGPT for errors and categorized the types of errors based on the following:

  • Logic (ie, ChatGPT acknowledges that the criteria for thrombectomy is met, but then states no thrombectomy is indicated).

  • Mathematics (ie, ChatGPT states that a mismatch ratio of 3 is <1.7 or infinite is <1.7).

  • Misinterpreted data (ie, we input NIHSS 10 but ChatGPT states that NIHSS 5 or no NIHSS is inputted).

  • Hallucination (ie, ChatGPT creates a criterion not stated in the original prompt).

  • Clinical reasoning (ie, an LVO case meets all thrombectomy criteria but the patient has an unknown baseline mRS. Some interventionalists may still pursue the case, but ChatGPT may decline given strict adherence to guidelines. If borderline/equivocal, the case will not be marked as having an error but will be commented on).

Results

Part 1: Internal validation of ChatGPT using test cases

A total of 22 distinct artificial patient cases were evaluated, with the first 15 distinct cases assessed over two versions (V1 and V2).

A summary of the artificially created cases in Part 1 of the study (table 1):

  • Artificial patients 1–15: Two versions (V1 and V2) of the data for each clinical patient are inputted to ChatGPT with the key difference being an explicit definition of the TSO. Patients are further stratified by TSO: patients 1–5 have TSO <6 hours, patients 6–10 have TSO 6–24 hours, and patients 11–15 have TSO >24 hours. ChatGPT was assessed for clinical decision making along with arithmetic abilities to calculate TSO from LKN and TOA at the hospital.

  • Artificial patients 16–17: ChatGPT is assessed for the ability to recognize contraindications to MT.

  • Artificial patients 18–19: ChatGPT is assessed for more nuanced decision making by considering the patients’ medical comorbidities.

  • Artificial patients 20–21: ChatGPT is assessed for the ability to recognize cases with ICH and to decide against MT.

  • Artificial patient 22: ChatGPT is assessed for the ability to recognize when additional clinical information is needed to decide on MT.

Table 1

List of artificially created cases used for internal validation of ChatGPT’s clinical reasoning abilities

ChatGPT also evaluated cases involving the presence or absence of ICH, incomplete prompts with missing values, and cases with different levels of comorbidities (see artificial patients 18 and 19). ChatGPT was able to make decisions despite missing information and exhibited the capability to discern between cases with high comorbidity and low comorbidity (figure 2).

Figure 2

Screenshots showing ChatGPT responses to scenarios. (A) Example of input with missing information. (B, C) Contrasting ChatGPT responses to cases with different levels of comorbidities.

A total of eight errors (27%) were made within the arithmetic cases (patients 1–15). Four errors were mathematics errors and four were logic errors (table 1). There were no errors for patients 16–22.

Part 2: Comparison of ChatGPT performance using the Tulane Stroke Database

In 2022, a total of 572 ischemic strokes were recorded in the Tulane Stroke Database, 102 of which met the inclusion criteria for the study with both a documented LVO and CTP imaging. Of these 102 cases, 57 (55.9%) underwent MT. A summary of the patient characteristics is shown in table 2.

Table 2

Characteristics of stroke database and patients used for study in 2022

A review of the Tulane database showed that 32 of the 102 cases of LVO had extenuating circumstances surrounding their clinical presentation which complicated the decision-making process. In these 32 cases, no MT was performed by the physicians although some cases were favorable candidates for MT. The reasons why MT was not performed included: family declined the procedure, discrepancies found in repeat imaging, instances of distal filling from collaterals or non-occlusive filling defects with clear distal vessels, chronic LVOs or recurrence of stroke, low NIHSS score or improved clinical examinations such as after thrombolytics, new information obtained after the initial history of the present illness but before intervention, cardiac arrest during transport, stent placement instead of thrombectomy, the presence of ICH, dissection, high risk due to frailty/comorbidities/age, difficult approach, large core volume or completed infarct, and other physician judgments.

Of the 102 eligible cases, ChatGPT recommended no thrombectomy for 69 cases and recommended thrombectomy for 33 cases (table 3). In 61 of the 102 eligible cases (59.8%), ChatGPT reached a decision that was in agreement with the physicians’ decisions. Analysis of ChatGPT decisions showed nine errors (8.8%) committed by ChatGPT (table 4). Of these nine errors, five affected the decision making for MT. For example, in one case with an unknown LKN, ChatGPT advised against MT but in real life a decision would be made based on CTP data.

Table 3

ChatGPT results using the Stroke Database 2022

Table 4

ChatGPT errors from the Stroke Database

Excluding the 32 cases with extenuating circumstances, a total of 70 LVO cases were used for further analysis (table 3). Of the 70 cases, 33 (47.1%) presented with TSO <6 hours, 30 cases (42.9%) presented with TSO between 6 and 24 hours, and seven (10.0%) presented with TSO >24 hours. Further analysis after exclusion of the 32 cases showed a rate of agreement of 54.3% between ChatGPT and the physician’s decision. After stratifying by TSO, the percentage agreement between ChatGPT and physician decisions increased as TSO increased. Within the subgroups of TSO <6 hours, 6–24 hours, and >24 hours, ChatGPT showed agreement with physicians in 39.5%, 56.7%, and 85.7%, respectively (table 3).

Analysis of the Tulane database showed that there were 20 cases in which MT was performed despite the patients not meeting MT criteria. Of these, in 11 cases the patient had a baseline mRS >2 or NIHSS score below the threshold for MT (table 3).

Discussion

This study evaluated the clinical decision making of ChatGPT when presented with patients with ischemic stroke with LVO. From the results, a few points are worth discussing.

First, from our internal validation cases, Chat GPT is able to follow inclusion and exclusion criteria established by the DAWN study to correctly classify patients as eligible or ineligible for MT.8 This is highlighted in the internal validation phase of the study. For example, in the unique test cases (patients 16–22) ChatGPT did not have any errors, which suggests that the application is able to recognize and reach correct clinical decisions when variables such as contraindications to MT are introduced. This suggests that ChatGPT is able to follow the stroke guidelines closely.

Second, ChatGPT was able to perform basic arithmetic by calculating time elapsed and was able to critically analyze nuanced patient cases with medical comorbidities and their implications in outcomes. Third, on further analysis of our patients in the Tulane Stroke Database, ChatGPT showed 54.3% agreement with the physicians’ decisions after exclusion of cases where there were extenuating circumstances that would act as confounding variables for ChatGPT decision making. An explanation for this agreement percentage is that, at our institution, the stroke neurologists and interventionalists are more aggressive in pursuing MT even when a patient does not strictly meet the DAWN criteria. Real-life decision making for thrombectomy is influenced by various indications and contraindications and also by the preferences and aggressiveness of stroke neurologists/interventionalists. At our institution we have a culture of being more aggressive with thrombectomy than the criteria set out by the DAWN criteria. This is due to a number of reasons including the clinical acumen and judgment of the physician, the time sensitivity of each case and the uncertainty with regard to certain aspects of a patient’s characteristics such as the baseline mRS, and the emotional humanistic aspect influenced by the family members of each case. This means that, if evaluating strictly on DAWN criteria, the decision for thrombectomy for a portion of our patients would not match the indications in the DAWN criteria and would lead to a greater than expected mismatch in agreement between ChatGPT and physicians. This agreement percentage can also be interpreted as ChatGPT being more conservative by strictly adhering to DAWN criteria despite our physicians being more flexible with their adherence to DAWN criteria in their decision making.

Analysis of errors committed by ChatGPT

ChatGPT is not without errors. The AI application exhibited different types of errors, with math errors being the predominant kind, followed by logic errors and misinterpretation of data. The math errors include misunderstanding of an infinite mismatch ratio or misinterpreting NIHSS scores. Other common errors include declining thrombectomy when certain information integral to the DAWN criteria is unknown in real life and thus not provided to ChatGPT. In effect, this can be interpreted as ChatGPT evaluating cases with a strict adherence to the DAWN criteria.

Misinterpretation of entered data was another error type observed, where ChatGPT mistook the penumbra for the ASPECT score. Logic errors were also present, such as when ChatGPT used the wrong criteria (ie, using DAWN criteria for LKN between 6 and 24 hours on a case where LKN was <6 hours) or when ChatGPT did not use the correct ‘if’ statements concerning age and NIHSS as specified in the initial set-up prompt during case evaluation.

In the internal validation test cases in which ChatGPT was asked to perform arithmetic, ChatGPT had a higher error rate compared with the performance using the stroke patient database. This may be due to the increased complexity of the task that may use more ‘tokens’ for computation. Tokens are roughly the computation units on which ChatGPT relies and of which there are a limited amount per prompt evaluation.

One of the criticisms of ChatGPT is the application’s ability to ‘hallucinate’. In this situation of hallucination, the answer that ChatGPT gives is factually incorrect but the answer is provided in a way that is very reasonable and convincing that it almost always ‘looks’ correct. Surprisingly, there were no hallucination errors in the formal experiment. The lack of hallucinations may be due to the standardized nature of the input cases and the specific question being asked of ChatGPT.

Study limitations

In terms of database yield for eligible patients, we had a suboptimal yield as not all LVO cases had available CTP studies and not all MT cases had CTP studies, thus reducing the number of patients for comparison. The accuracy of the database is also not foolproof as the input relies on the assumed accuracy in documentation. CTP data can be prone to inaccuracies, particularly in the presence of motion artifacts. Live visual evaluation by radiologists or neurologists is required for accurate assessment, which the base model of ChatGPT lacks as it is text-based and does not have visual input.

At its current stage, ChatGPT should not be relied on as a sole decision-making tool. To reach a level of confidence for (semi-)autonomous use, improvements are needed to eliminate misinterpretation of inputted data, enhance math evaluations (such as concurrent use of plugins or other AI models), and implement guardrails against illogical statements, potentially including enforced double-checking of ChatGPT’s own statements. Specifically, ChatGPT plugins adept at mathematics such as Wolframalpha could potentially assist with mathematical calculations in future versions of ChatGPT experiments.9 Further exploration is warranted in areas where nuanced clinical reasoning is required, and close collaboration with the industry is essential to further develop this field.

Conclusion

AI has made significant advances in the field of medicine in recent years. This study shows the ability of ChatGPT to interpret clinical scenarios and provide reasoning for MT. However, in its current form ChatGPT is still prone to errors, making it not yet reliable for independent clinical use. These errors range from misinterpretation of inputs to making faulty calculations and questionable clinical decisions. Despite these limitations, ChatGPT performed well in the majority of cases. Further development, including the implementation of safeguards to ensure accuracy, holds the potential to make this technology potentially useful in neurointervention.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

References

Footnotes

  • Contributors TCC: Writing – review and editing, investigation. MWC, JS, AS, LK, EK, KN, EM: Writing – original draft, writing – review and editing, investigation. ASD: Writing – review and editing. AW: Conceptualization, supervision, writing – review and editing, methodology, investigation, project administration. AW accepts full responsibility for the work and/or the conduct of the study, had access to the data, had access to the data, and controlled the decision to publish

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.