Purpose The DAWN trial (Diffusion weighted imaging or CT perfusion Assessment with clinical mismatch in the triage of Wake-up and late presenting strokes undergoing Neurointervention with Trevo) has demonstrated the benefits of thrombectomy in patients with unknown or late onset strokes, using automated software (RAPID) for measurement of infarct volume. Because RAPID is not available in all centers, we aimed to assess the accuracy and repeatability of visual infarct volume estimation by clinicians and the consequences for thrombectomy decisions based on the DAWN criteria.
Materials and methods 18 physicians, who routinely depend on MRI for acute stroke imaging, assessed 32 MR scans selected from a prospective databaseover two independent sessions. Raters were asked to visually estimate the diffusion weighted imaging (DWI) infarct volume for each case. Sensitivity, specificity, and accuracy of the estimated volumes were compared with the available RAPID measurements for various volume cut-off points. Thrombectomy decisions based on DAWN criteria with RAPID measurements and raters’ visual estimates were compared. Inter-rater and intra-rater agreement was measured using kappa statistics.
Results The mean accuracy of raters was <90% for all volume cut-points. Inter-rater agreement was below substantial for each DWI infarct volume cut-off points. Intra-rater agreement was substantial for 55–83% of raters, depending on the selected cut-off points. Applying DAWN criteria with visual estimates instead of RAPID measurements led to 19% erroneous thrombectomy decisions, and showed a lack of reproducibility.
Conclusion The visual assessment of DWI infarct volume lacks accuracy and repeatability, and could lead to a significant number of erroneous decisions when applying the DAWN criteria.
Statistics from Altmetric.com
Results of recent trials1–4 allowed the extension of the time window for mechanical thrombectomy in patients with acute ischemic stroke (AIS) with emergent large vessel occlusion (ELVO)5 beyond 6 hours from the time of last known well (LKW), based on clinical imaging mismatch criteria. While the Endovascular Therapy Following Imaging Evaluation for Ischemic Stroke 3 (DEFUSE 3) trial used both core infarct and hypoperfusion volumes as imaging criteria,3 4 the Diffusion weighted imaging or CT perfusion Assessment with clinical mismatch in the triage of Wake-up and late presenting strokes undergoing Neurointervention with Trevo (DAWN) trial relied solely on infarct volume measurement for patient selection.1 2 This core infarct volume was automatically generated by the RAPID software (iSchemiaView, Redwood city, California, USA) on either CT perfusion6 or MRI diffusion weighted imaging (DWI).7 The recent American Heart Association guidelines recommend using the criteria from these trials for patient selection for thrombectomy beyond 6 hours since LKW.8
In centers using brain MRI for acute screening, physicians are required to visually estimate DWI infarct volume for the thrombectomy decision in the extended time window if RAPID or other approved third party software is unavailable to them. We aimed to assess whether clinician conducted visual assessment of DWI infarct volume is accurate compared with RAPID, and repeatable among physicians. Using the DAWN selection criteria, we also assessed whether RAPID measurements and raters’ estimates of DWI infarct volume would lead to similar treatment decisions.
The study protocol was approved by the local ethics committee. In line with French regulations, where the study was conducted, the institutional review board waived the need for patient signed consent. The study methods followed the Standards for Reporting Diagnostic Accuracy Study (STARD)9 and the Guidelines for Reporting Reliability and Agreement Studies (GRRAS).10 The study consisted of a set of MRI scans processed with RAPID and then examined by several raters who visually estimated the infarct volume and who were tested for accuracy of measurements and inter-rater/intra-rater agreement.
All raters were clinicians involved in the management of patients with AIS, routinely using brain MRI for acute stroke screening. Eighteen raters, including six vascular neurologists, six interventional neuroradiologists, and six diagnostic neuroradiologists (three attending physicians (seniors) and three resident physicians (juniors) in each category) from three comprehensive stroke centers (14 raters) and two referring stroke centers (four raters) participated in this study. All resident physician raters had completed at least 2 years of residency. To promote participation, raters were asked if they would like to collaborate in the design and reporting of the present work.
Patients and reference standard
Cases were selected from a prospectively collected database of all patients with AIS with a time of LKW >6 hours and an anterior circulation ELVO who were emergently evaluated using brain MRI between November 2011 and September 2017 (n=102 patients). Detailed characteristics of this cohort have been previously published.11 All patients had good quality interpretable scans with no major motion artifact. DWI acquisitions were performed with a b value of 1000, and infarct volume was automatically measured by RAPID using an ADC threshold of 620, which served as a reference standard for the present study. A total of 27 MRI scans failed to be processed with RAPID (fail rate of 26.5%), leaving 75 patients with a successful automatic measurement of infarct volume with RAPID.
Two authors (CD and RF) selected 32 examinations, with the aim of including approximately one-third of cases with a small infarct volume (defined by a DWI volume <21 mL1 2 measured with RAPID), one-third of cases with a large infarct volume (defined by a DWI volume ≥71 mL),3 4 12 and one-third of cases with an intermediate infarct volume (≥21 mL and <71 mL). The selected proportions were designed to study the reliability of dichotomized DWI volumes with various cut-offs while minimizing the paradoxes of kappa statistics. If an imbalance were to exist in the portfolio of our selected patients, such as too many (or too few) large (or small) infarct volumes, a kappa paradox might erroneously lead to low kappa values even if high agreement is truly present.13 14 The number of cases in the present study (n=32) was minimized to accommodate for rater fatigue, however it remained superior to the recommendations of Donner et al,15 according to which 24 cases are sufficient to keep the lower boundary of the CI within a predetermined limit (0.4), assuming substantial agreement (κ >0.6) between five raters (the number of cases being inversely correlated with the number of raters).
Reading sessions and index test
The MRI studies, originating from one tertiary center and several referral centers, were anonymized and uploaded to the Picture Archiving and Communication System (PACS) (with the following sequences: DWI, apparent diffusion coefficient, and intracranial time of flight angiography). Raters were asked to perform two reading sessions. The following clinical information were provided for each case prior to displaying brain imaging, in order to mimic clinical conditions: gender, age, symptoms (left or right motor deficit, aphasia, etc), initial National Institutes of Health Stroke Scale (NIHSS) score, stroke type (wake-up stroke with time of symptom discovery or unknown onset stroke with time of LKW), and time of brain MRI. Each reading session was independently performed on a dedicated workstation. Raters had no time constraints and modification of images window and width levels was permitted. Two authors (AG, RF) monitored the two independent reading sessions of each rater who evaluated all 32 cases under the same conditions, in a different order, and at least 1 month apart. For each case, raters had to give a visual estimation of the total infarct volume (integer answer in mL). Answers served as the index test for the present accuracy study.
Raters were also provided with four ‘reference’ MRI studies displaying DWI lesions with total infarct volumes of 21 mL, 31 mL, 51 mL, and 71 mL, according to RAPID measurements. Raters were allowed to consult the reference MRI studies at any time. The infarct volumes of the reference MRI studies represent the various cut-offs used for patient selection in recent trials1–4 and were thus used to define the four target conditions for the present accuracy study: infarct volume <21 mL, infarct volume <31 mL, infarct volume <51 mL, and infarct volume <71 mL.
The following DAWN criteria2 were used for thrombectomy decisions:
DAWN group A criteria include patients ≥80 years old with an NIHSS score of ≥10, who should undergo thrombectomy if their DWI infarct volume is <21 mL.
DAWN group B criteria include patients <80 years old with an NIHSS score of ≥10, who should undergo thrombectomy if their DWI infarct volume is <31 mL.
DAWN group C criteria include patients <80 years old with an NIHSS score of ≥20, who should undergo thrombectomy if their DWI infarct volume is <51 mL.
These criteria were applied to the study population with RAPID measurements and raters’ visual estimates in order to compare the resulting treatment decisions.
The study cohort’s characteristics and proportions of answers (for all raters/cases and specific subgroups) were first described. Mean±SD values were calculated for continuous variables and frequency was reported for categorical variables. For the accuracy study, 2×2 contingency tables were used to calculate sensitivity, specificity, and accuracy rates for each rater. Group and subgroup validation parameters were calculated using mean sensitivity, specificity, and accuracy. Interpretation of the accuracy values was made according to Fischer et al, where an accuracy value >90% indicates high accuracy, 70–90% moderate accuracy, 50–70% low accuracy, and 50% a toss-up (chance result).16 17 Inter-rater agreement was measured using Fleiss’ kappa statistics and intra-rater agreement was measured using Cohen’s kappa statistics with 95% bias corrected CIs obtained by 1000 bootstrap resampling. Slight (k=0–0.20), fair (k=0.21–0.40), moderate (k=0.41–0.60), substantial (k=0.61–0.80), and excellent (k>0.80) agreement categories were reported according to Landis and Koch,18 who defined agreement as insufficient if it fails to reach the substantial level (k >0.6). All analyses were done with R software V.3.3.2 and a significance level of 5%.
Eight of 32 patients were men (25%, mean age 69.9±12.8 years), and 17 patients (53%) had wake-up stroke (see online supplemental table I). Median (IQR) NIHSS score at admission was 15.5 (11–25). Median (IQR) DWI infarct volume was 49.5 mL (17.5–85.5) (measured with RAPID). Eight patients (25%) were ≥80 years old with an NIHSS score of ≥10 (DAWN group A), 18 cases (56%) were <80 years old with an NIHSS score of ≥10 (DAWN group B), and six cases (19%) were <80 years old with an NIHSS score of ≥20 (DAWN group C).
Accuracy study results
The mean proportions (minimum–maximum) of agreement with RAPID for DWI infarct volume ±10%/±20%/±30% among the 18 raters were 12% (6–22%)/29% (16–44%)/39% (25–56%), respectively. Mean proportions (minimum–maximum) of cases judged to a have a DWI infarct volume <21 mL/<31 mL/<51 mL/<71 mL were 24% (3–56%)/37% (13–66%)/59% (38–88%)/78% (53–100%), respectively. Raters' answers for each patient are graphically displayed in figure 1A.
Raters achieved the lowest mean sensitivity in their volume estimation for patients with a RAPID reference infarct volume <21 mL (61%, minimum–maximum value=10–100%, see online supplemental table III). In this configuration, only 3/18 raters (16.7%) had 100% sensitivity, correctly estimating a DWI infarct volume <21 mL in all patients with an infarct volume <21 mL according to the reference standard RAPID. The highest mean sensitivity was reached for the <71 mL cut-off point (96%, minimum–maximum value=82–100%) where eight raters (44.4%) had a sensitivity of 100% (figure 1B). Increasing the dichotomization cut-off point from 21 mL to 71 mL also significantly decreased specificity (from 93% (75–98%) to 47% (24–70%)), without any significant difference between specialties, levels of experience, or stroke center level (table 1).
Agreement for DWI infarct volume estimation
Values for inter-rater agreement for dichotomized DWI infarct volume with various cut-off points are displayed in table 1 and figure 2A. Inter-rater agreement was below substantial for all raters for each DWI infarct volume cut-off point, without significant differences between specialties, levels of experience, or stroke center level (table 1). The proportion of cases with perfect agreement among all raters for dichotomized DWI infarct volume with cut-off points of 21 mL/31 mL/51 mL/71 mL were 44%/37%/44%/50%, respectively. Disagreement involving at least 6/18 raters for dichotomized DWI infarct volume with cut-off points of 21 mL/31 mL/51 mL/71 mL occurred in 13%/16%/13%/16% of cases, respectively.
The results of the intra-rater agreement study are displayed in figure 2B. Detailed results are available in the online supplemental table II. Intra-rater agreement for the assessment of dichotomized DWI infarct volume was at least substantial for 10/18 (55%) to 15/18 (83%) raters, depending on the selected cut-off point. Thirteen raters (72%) had below substantial intra-rater agreement for at least one of the dichotomized DWI infarct volume configurations. Raters changed their dichotomized judgment between two readings in 10.5–12.8% of cases, depending on the selected the cut-off point.
Thrombectomy decisions based on DAWN criteria
Based on the DAWN criteria and RAPID measures, thrombectomy was indicated in 12/32 patients (37%), including 4/8 patients (50%) fulfilling DAWN group A criteria, 8/18 patients (44%) fulfilling DAWN group B criteria, and 0/6 patients (0%) fulfilling DAWN group C criteria.
When using raters’ visual estimates and the DAWN criteria, thrombectomy indications varied between 2/32 patients (6%) and 21/32 patients (66%). Compared with RAPID measures, the use of visual estimates led to erroneous decisions (ie, denying thrombectomy to an eligible patient or offering thrombectomy off protocol) in 3/32 (9%) to 10/32 (31%) cases. The overall mean of erroneous decisions was 19% (figure 3).
Inter-rater agreement for thrombectomy decisions based on DAWN criteria and visual estimates was below substantial for all raters, as displayed in the online supplemental figure IA (detailed results available in the online supplemental table III).
Intra-rater agreement was below substantial for 9/18 raters (50%), as displayed in the online supplemental figure IB (individual results available in the online supplemental table IV). The variation in visual estimates between both readings resulted in a change in thrombectomy decision in a mean of 21% of cases (ranging from 9% to 44%).
This is, to our knowledge, the first study addressing visual assessment of DWI infarct volumes in the extended time window. Our study demonstrated low accuracy of the visual assessment of DWI infarct volume compared with the reference standard RAPID. Sensitivity (the proportion of true positives among patients diagnosed with an infarct volume inferior to the cut-off point) showed a tendency to increase with increasing infarct volume cut-off point, at the expense of decreased specificity (figure 1). The visual estimation of DWI infarct volume also had insufficient reproducibility for meaningful use in clinical practice. More importantly, application of the DAWN criteria with visual estimates of DWI infarct volume instead of RAPID measurements led to a high proportion of erroneous decisions, and also showed insufficient inter-rater and intra-rater reproducibility.
The accuracy of visual assessment of DWI infarct volume proved to be insufficient, failing to reach 90% in all categories,16 17 and it varied with the rater; in other words, it proved to be operator dependent and therefore lacking strong objectivity for its use as a pivotal clinical decision making tool for an invasive therapy. The sensitivity of the visual estimates (or their ability to identify the estimated volumes within the correct category) was limited, especially for low volumes (<21 mL and <31 mL) where higher estimates were generated by raters (sensitivity 61% and 82%, respectively). When applying the DAWN criteria for decision making, the high proportion of small infarct volume overestimations translated into a high proportion of erroneous decisions to deny thrombectomy to patients fulfilling DAWN group A and B criteria, in whom the benefit of endovascular treatment has been demonstrated.2
Although we saw an increase in sensitivity with the moderate (<51 mL) and especially with the large volume cut-off points (<71 mL) at 89% and 96%, respectively, the loss of specificity makes the clinical test being evaluated (visual assessment here) of no clinical benefit in decision making (specificity 74% and 47%, respectively). Low specificity in our study was the inability to correctly identify patients with a DWI infarct volume superior or equal to the predefined cut-off point. When applying the DAWN criteria for decision making, the low specificity translated into a high proportion of overtreatment of patients in whom the benefit of thrombectomy is suggested11 19 but not yet demonstrated.
Inter-rater agreement was below substantial for all raters in any of the dichotomizations analyzed, without any significant difference between specialties. Reproducibility of the assessment by the same rater (intra-rater agreement) also proved to be insufficient for use in clinical practice. Previous studies have demonstrated the lack of reproducibility of ischemic infarction visual assessment and grading with an ordinal scale on both CT20 and MRI.21 22 This lack of reproducibility also impacted on the agreement for thrombectomy decisions using the DAWN criteria, which proved to be insufficient between and within raters.
The accuracy and reliability of pure visual estimation of DWI infarct volume was expected to be insufficient for reliable clinical decision making. Our study confirms the limited performance of this method, despite the help of visual reference examinations. In the absence of RAPID, other methods are available to the clinician. The semi-visual ABC/2 technique, previously shown to have good reliability23 and which requires manual measurements of the three main perpendicular axes of the infarct, could serve as an alternative. Nevertheless, this technique showed limited performance for small infarct volumes23 (which represent the majority of patients included in DAWN2 and DEFUSE 34) and is also more difficult to apply in the case of multifocal infarcts. Further studies are necessary to compare this technique with RAPID and other software packages in the settings of AIS with ELVO.
We recognize certain limitations in our study, including the artificially constructed population and the fact that a different set of cases could generate different results. The (necessary) artificial construction of our portfolio, which included patients with a wide range of DWI infarct volumes, was required to design and perform an agreement study while avoiding the paradoxes of kappa statistics. If we had included mostly patients with small infarct volumes (in order to mimic the recent trials cohorts2 4) the calculation of kappa indexes would have been flawed and we would have found low kappa values even in cases of high agreement.13 14 Our study cohort might thus not be a representative sample of commonly treated patients.2 4 However, this does not affect the accuracy of the validation parameters (sensitivity and specificity), which are independent of disease prevalence. A preliminary training session or the use of semi-manual techniques (such as the ABC/2 method23) could have improved their performance for estimation of DWI infarct volume. On the other hand, reference MRI examinations were provided to the raters, which might have artificially improved their performance. Asking our raters to answer dichotomous questions (as opposed to requiring them to provide an integer estimate of the volume) could also have altered our results in either direction. We also did not impose time constraints and thus did not measure the time spent for each rating session, which could have allowed us to study if rating speed was related to accuracy. We could not study the consequences of visual estimation of DWI infarct volume on thrombectomy decisions based on DEFUSE 3 criteria, because it would require perfusion maps that were not available for the present study. Thrombectomy decisions were extrapolated from the visual estimates and DAWN criteria, and thus do not necessarily reflect the real decisions that would have been made by the clinicians. RAPID was used here as a reference standard because of its exclusive use in the DAWN and DEFUSE 3 studies,2 4 but there are other available tools for the automated or semi-automated measurement of DWI infarct volume, which have not yet been validated in a thrombectomy trial.
Our study suggests that even with the use of visual reference aids, the visual assessment of DWI infarct volume lacks accuracy and reproducibility, and could result in a high proportion of erroneous thrombectomy decisions.
Contributors Study design: NK, CD, AG, and RF. Acquisition, analysis, or interpretation of the data: all authors. Drafting of the manuscript: NK, CD, AG, and RF. Statistical analysis: KZ. Supervision: RF.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial, or not-for-profit sectors.
Competing interests None declared.
Patient consent Not required.
Ethics approval The study protocol was approved by the local ethics committee.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Data, analytic methods, and study materials will be made available to any researcher for purposes of reproducing the results or replicating the procedure. Requests to receive these materials should be sent to the corresponding author, who will maintain their availability.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.