Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Validation of deep-learning-based triage and acuity score using a large national dataset

  • Joon-myoung Kwon ,

    Contributed equally to this work with: Joon-myoung Kwon, Youngnam Lee

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Emergency Medicine, Mediplex Sejong Hospital, Incheon, Korea

  • Youngnam Lee ,

    Contributed equally to this work with: Joon-myoung Kwon, Youngnam Lee

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    youngnam.lee@vuno.co

    Affiliation VUNO, Seoul, Korea

  • Yeha Lee,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization

    Affiliation VUNO, Seoul, Korea

  • Seungwoo Lee,

    Roles Data curation, Formal analysis, Software, Validation, Visualization

    Affiliation VUNO, Seoul, Korea

  • Hyunho Park,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation VUNO, Seoul, Korea

  • Jinsik Park

    Roles Conceptualization, Methodology, Project administration, Resources, Supervision

    Affiliation Department of Cardiology, Mediplex Sejong Hospital, Incheon, Korea

Abstract

Aim

Triage is important in identifying high-risk patients amongst many less urgent patients as emergency department (ED) overcrowding has become a national crisis recently. This study aims to validate that a Deep-learning-based Triage and Acuity Score (DTAS) identifies high-risk patients more accurately than existing triage and acuity scores using a large national dataset.

Methods

We conducted a retrospective observational cohort study using data from the Korean National Emergency Department Information System (NEDIS), which collected data on visits in real time from 151 EDs. The NEDIS data was split into derivation data (January 2014-June 2016) and validation data (July-December 2016). We also used data from the Sejong General Hospital (SGH) for external validation (January-December 2017). We predicted in-hospital mortality, critical care, and hospitalization using initial information of ED patients (age, sex, chief complaint, time from symptom onset to ED visit, arrival mode, trauma, initial vital signs and mental status as predictor variables).

Results

A total of 11,656,559 patients were included in this study. The primary outcome was in-hospital mortality. The Area Under the Receiver Operating Characteristic curve (AUROC) and Area Under the Precision and Recall Curve (AUPRC) of DTAS were 0.935 and 0.264. It significantly outperformed Korean triage and acuity score (AUROC:0.785, AUPRC:0.192), modified early warning score (AUROC:0.810, AUPRC:0.116), logistic regression (AUROC:0.903, AUPRC:0.209), and random forest (AUROC:0.910, AUPRC:0.179).

Conclusion

Deep-learning-based Triage and Acuity Score predicted in-hospital mortality, critical care, and hospitalization more accurately than existing triages and acuity, and it was validated using a large, multicenter dataset.

Introduction

Overcrowding in an emergency department (ED) has been identified as a healthcare crisis in many nations.[1,2] Triage is important in identifying vulnerable and high-risk patients among a large number of less urgent patients as ED overcrowding and delay in care are associated with increased mortality in many conditions.[3] The rapid assessment of the patient’s risk and urgency is necessary to identify high-risk patients and determine treatment priority on arrival at the ED.

The Canadian Triage and Acuity Scale (CTAS) was developed in 1999 after studying the successful National Triage Scale (NTS) from Australia.[4] The Korean Triage and Acuity System (KTAS) was developed in 2012 based on CTAS and has been used nationwide as triage since 2016 in Korea.[5] Although these Triage and Acuity Scores (TASs) help identify patients with high-risk of death, they have two limitations. First, they rely on the provider’s subjective judgement of critical care needs and pain intensity.[6,7] As a decision can be different for each provider, outcomes predicted by these TASs have high variation and low reliability.[8] Second, they can be a bottleneck in the ED patient’s flow because subjective information cannot be instantly judged and is often ambiguous. In addition, the time to judge can take more depending on the experience of the provider as subjective information is based on clinical expertise.[9] This delay is a risk to patient safety.

The Modified Early Warning Score (MEWS) is a widely used tool and overcomes two limitations using physiological parameters (systolic blood pressure, pulse rate, respiratory rate, temperature, and level of consciousness (Alert, Voice, Pain, Unresponsive).[1014] However, it has a limitation in capturing the relationship between parameters. MEWS is the sum of the scores for each parameter, and the score for each parameter is calculated independently. For example, systolic blood pressure is not considered when calculating the score for the temperature even though the temperature is interpreted differently according to systolic blood pressure.

Machine learning (ML) based overcomes the limitation of MEWS and shows higher performance than MEWS.[15] ML is an algorithm that allows a computer to learn by itself from given data without explicitly programming (i.e., improved performance on a specific task). Until the last few years, several domains, including TAS, used ML such as logistic regression (LR) and random forest (RF).[1618] LR finds the relationship parameters and outcome and expresses it as a linear combination of parameters. RF creates several decision trees with ensemble technique and combines the results from them. The decision tree is to build a tree-like graph (i.e., model) that predicts the outcome by learning discrete cut-points (i.e., rule). Recently, deep learning (DL) has achieved state of the art performance in several domains through deep hierarchical feature construction.[1921] One of the most important advantages of DL compared to ML is feature learning. From a large number of data, the deep learning automatically learns the features or representations needed for given tasks such as classification and detection using several non-linear modules. In this study, we developed a Deep-learning-based TAS (DTAS) and validated that DTAS significantly outperforms existing TAS using a large, multicenter dataset.

Methods

We conducted a retrospective observational cohort study using data from the Korean National Emergency Department Information System (NEDIS) which collected data on all visits in real time from 151 EDs in Korea. The NEDIS data was split into derivation data (January 2014-June 2016) and validation data (July-December 2016). Furthermore, we used data from the Sejong General Hospital (SGH) for external validation (January-December 2017). The hospital is a specialist cardiovascular teaching hospital, with approximately 14,000 patients visiting the ED each year. As shown in Table 1, internal and external validation data had different characteristics. We verified that DTAS was not biased towards specific characteristics through the validation of both data. The Sejong General Hospital Institutional Review Board approved this study and granted waivers of informed consent based on general impracticability and minimal harm. Patient information was anonymized and de-identified before the analysis.

The NEDIS data included age, sex, arrival time, chief complaint, arrival mode, initial vital signs, trauma, ED treatment result, place of hospitalization, admission result, KTAS, discharge diagnosis, etc. The study subjects were adult patients (≥18 years), and patients who were dead on arrival or had missing value were excluded.

The primary outcome was in-hospital mortality. The secondary outcome was critical care, and the tertiary outcome was hospitalization in this study. The critical care outcomes comprised of direct admission to the intensive care unit (ICU), transfer to other hospitals for ICU admission, and in-hospital mortality. The hospitalization outcomes consisted of direct admission to hospital, transfer to other hospitals for admission, and in-hospital mortality. Admitted patients who eventually die were included in the critical care outcome and the hospitalization outcome. However, each outcome was not double counted because we predicted independently for each outcome whether it would occur or not: "hospitalization or non-hospitalization," "critical care or non-critical care," and "mortality or non-mortality." We use age, sex, chief complaint, time from symptom onset to ED visit, arrival mode, trauma, initial vital signs and mental status as predictor variables (Table 1).

We developed DTAS using multilayer perceptron, a method of deep learning, with 5 hidden layers. Because there was no gain in accuracy when adding more than 5 layers, we made up 5 layers to minimize the parameters to be learned. The first to fourth layers consisted of 32, 32, 16, and 8 nodes, and applied a rectified linear activation. The last layer consisted of 1 node which represented the risk of each outcome and applied a sigmoid function. We learned DTAS as the Adam optimizer and used a binary-cross entropy as a loss function.[22] To validate our model, we used the hyperparameters of the model with the best performance on 10% of the data from the derivation data during the training process.

We compared the performance of DTAS with KTAS, MEWS, LR, and RF. KTAS has been used nationwide as triage since 2016 in Korea. MEWS is widely used as a tool to identify patients at risk of deterioration, and several studies have shown good results with MEWS in predicting poor outcomes of ED patient.[13,23,24] In the previous studies, LR and RF were the most commonly used machine learning algorithms and showed better performance than MEWS.[2527]

We conducted a performance test exclusively for each outcome. We used the area under the receiver operating characteristic curve (AUROC) and area under the precision and recall curve (AUPRC) as the comparative measures. AUROC is one of the most used metrics in evaluating binary classifiers and shows sensitivity against 1-specificity. Compared with AUROC, AUPRC is useful with an imbalanced data like our study and show precision (i.e., 1-false positive) against recall (i.e., sensitivity).[28] With imbalanced data, in which the number of negatives outweighs the number of positives, AUROC has a limitation for evaluating the performance because the false positive rate (false positive/total real negatives) does not decrease dramatically when the total negatives are large.

Results

A total of 11,656,559 ED visits to 151 hospitals were included in the NEDIS. We excluded 689,041 visits due to 114,368 dead on arrivals and 574,673 missing values. Study subjects comprised of 10,967,518 ED visits and the outcomes were 153,217 in-hospital death (1.4%), 625,117 critical care admissions (5.7%), and 2,964,367 hospitalization (27.0%) (Table 1). DTAS was developed using derivation data of 8,981,184 patients, and the validation study was performed using data of 1,986,334 patients on the NEDIS. External validation was performed using 13,989 visits to SGH ED, where the outcomes were 150 in-hospital death (1.1%), 987 critical care admissions (7.1%), and 4,337 hospitalizations (31.0%).

As shown in Fig 1 and Table 2, DTAS (AUROC: 0.935, AUPRC: 0.264) significantly outperformed KTAS (AUROC: 0.785, AUPRC: 0.192), MEWS (AUROC: 0.810, AUPRC: 0.116), LR (AUROC: 0.903, AUPRC: 0.209), and RF (AUROC: 0.910, AUPRC: 0.179) with respect to in-hospital mortality. DTAS also outperformed KTAS, MEWS, LR, and RF with respect to critical care and hospitalization (Table 2). With respect to external validation, DTAS consistently showed better performance than other TASs.

thumbnail
Fig 1. Accuracy for predicting in-hospital mortality.

Fig 1 shows Receiver operating characteristic (ROC) curve and precision-recall (PR) curve for predicting in-hospital mortality. ROC curve of internal validation (A) and PR curve of internal validation (B) show that the Deep-learning-based Triage and Acuity Score (DTAS) predicted in-hospital mortality more accurately than Korean Triage and Acuity System (KTAS), Modified Early Warning Score (MEWS), Random Forest (RF), and Logistic Regression (LR) using the National Emergency Department Information System (NEDIS) data (Table 1). The ROC curve of external validation (C) and PR curve of external validation (D) demonstrated that DTAS predicted in-hospital mortality more accurately than other methods using the Sejong General Hospital (SGH) dataset. With respect to external validation, DTAS (AUROC: 0.92, AUPRC: 0.23) significantly outperformed KTAS (AUROC:0.80, AUPRC: 0.13), MEWS (AUROC: 0.74, AUPRC: 0.06), RF (AUROC: 0.89, AUPRC: 0.14), and LR (AUROC: 0.89, AUPRC:0.16).

https://doi.org/10.1371/journal.pone.0205836.g001

thumbnail
Table 2. Accuracy for predicting in-hospital mortality, critical care, and hospitalization.

https://doi.org/10.1371/journal.pone.0205836.t002

As shown in Fig 1, the sensitivity of KTAS level 3 was 0.49 for predicting in-hospital mortality. At this point, the precisions of DTAS, KTAS, MEWS, RF, and LR were 0.24, 0.08, 0.09, 0.16, and 0.18, respectively.

Discussion

We found that DTAS showed the best performance for predicting in-hospital mortality, critical care, and hospitalization based on a large, multicenter dataset. DTAS can reduce a false positive by 67% compared to KTAS. This reduction in false positives increases the practical applicability of DTAS.

Several previous studies attempted to predict outcomes of ED patients. Taylor et al. reported a new random forest method for predicting in-hospital mortality of emergency department patients with sepsis.[29] Ong et al. reported a conventional machine learning model for predicting cardiac arrest in critically ill patients presenting to the ED.[30] But two studies used small population and did not perform multicenter validation. The performance of algorithms based on given data rather than medical knowledge, such as machine and deep learning, is not guaranteed in other environments. The algorithms can memorize only the characteristics of derivation data. Because they learn the relationship between the predictor variables and outcome from only given data. Wolpert explains the No Free Lunch theorem; if optimized in one situation, a model cannot produce good results in other situations.[31]

We used the national big data NEDIS to develop and validate DTAS, and the subjects of this study were those who visited ED across the whole country. Therefore, DTAS learned the characteristics of all patients nationwide rather than any particular area. However, DTAS can be biased to the average of NEDIS (i.e., overfitting). So, we verified DTAS using SGH (external validation) which had different characteristics from NEDIS. Through multicenter validation, we showed that the performance of DTAS was not biased towards specific characteristics and guaranteed in other environments.

Most patients do not experience rare events such as in-hospital mortality and critical care (i.e., imbalanced data). In this environment, AUPRC is a more important metric than AUROC. With imbalanced data, in which the number of negatives outweighs the number of positives, AUROC has a limitation for evaluating the performance because the false positive rate (false positive/total real negatives) does not decrease dramatically when the total negatives are large. AUPRC, on the other hand, is suitable for imbalanced data, as they consider the fraction of true positives among positive predictions.[32] Although DTAS can reduce false positives by 67% compared to KTAS, the AUROC of DTAS is only 19% higher than the AUROC of KTAS for predicted in-hospital mortality. On the other hand, AUPRC of DTAS is 38% higher than AUPRC of KTAS.

Unfortunately, traditional triage tools are complex scoring methods that require detailed history taking and physical exams (e.g., pain score, evidence of dehydration, pitting edema, and blood sugar test result), and judgment based on clinical experience (e.g., expected emergency department resource).[4,7] These tools require considerable time for triage and are of limited use in resource-constrained settings of circumstances in which junior triage provider, who have limited training and experience, practice.[9,33] Numerous studies concluded that dedicating a senior doctor in triage reduced the waiting time for patients to see a doctor, decreased the LOS, and lowered the proportion of leftover patients without being seen.[33,34] However, this solution requires enormous cost.[35]

On the other hand, DTAS requires only age, sex, chief complaint, symptom to visit time, arrival mode, trauma or not, initial vital sign, and mental status as input parameters. This allows DTAS to have three strengths. First, outcomes predicted by DTAS have low variation and high reliability because input parameters are basic information with low inter-physician variation. Second, input parameters do not require expert judgment and can be collected very quickly, it would be of great value in a resource-constrained ED setting. Third, parameters of DTAS can be checked in a pre-hospital setting and DTAS score can be calculated in pre-hospital transport and out-of-hospital situations. Therefore, DTAS has the potential to make the process of pre-hospital emergency medical service (EMS) and ED efficient. Our next area of focus for research is the prospective study of EMS and ED triage to verify the performance and efficiency of DEWS.

Conclusion

Deep-learning-based Triage and Acuity Score predicted in-hospital mortality, critical care, and hospitalization more accurately than existing triages and acuity, and it was validated using a large, multicenter dataset.

References

  1. 1. Pines JM, Hilton JA, Weber EJ, Alkemade AJ, Al Shabanah H, Anderson PD, et al. International perspectives on emergency department crowding. Acad Emerg Med. 2011;18: 1358–1370. pmid:22168200
  2. 2. Hoot NR, Aronsky D. Systematic Review of Emergency Department Crowding: Causes, Effects, and Solutions. Ann Emerg Med. 2008;52.
  3. 3. Bernstein SL, Aronsky D, Duseja R, Epstein S, Handel D, Hwang U, et al. The effect of emergency department crowding on clinically oriented outcomes. Acad Emerg Med. 2009;16: 1–10. pmid:19007346
  4. 4. Bullard MJ, Musgrave E, Warren D, Unger B, Skeldon T, Grierson R, et al. Revisions to the Canadian Emergency Department Triage and Acuity Scale (CTAS) Guidelines 2016. Can J Emerg Med. 2017;19: S18–S27.
  5. 5. Lee B, Kim DK, Park JD, Kwak YH. Clinical considerations when applying vital signs in pediatric korean triage and acuity scale. J Korean Med Sci. 2017;32: 1702–1707. pmid:28875617
  6. 6. Tanabe P, Gimbel R, Yarnold PR, Kyriacou DN, Adams JG. Reliability and Validity of Scores on the Emergency Severity Index Version 3. Acad Emerg Med. 2004;11: 59–65. pmid:14709429
  7. 7. Christ M, Grossmann F, Winter D, Bingisser R, Platz E. Modern triage in the emergency department. Dtsch Arztebl Int. 2010;107: 892–8. pmid:21246025
  8. 8. Farrohknia N, Castrén M, Ehrenberg A, Lind L, Oredsson S, Jonsson H, et al. Emergency Department Triage Scales and Their Components: A Systematic Review of the Scientific Evidence. Scand J Trauma Resusc Emerg Med. BioMed Central Ltd; 2011;19: 42. pmid:21718476
  9. 9. Welch SJ, Davidson SJ. The performance limits of traditional triage. Ann Emerg Med. Elsevier Inc.; 2011;58: 143–144. pmid:21601312
  10. 10. Burch VC, Tarr G, Morroni C. Modified early warning score predicts the need for hospital admission and inhospital mortality. Emerg Med J. 2008;25: 674–678. pmid:18843068
  11. 11. Subbe CP, Davies RG, Williams E, Rutherford P, Gemmell L. Effect of introducing the Modified Early Warning score on clinical outcomes, cardio-pulmonary arrests and intensive care utilisation in acute medical admissions*. Anaesthesia. 2003;58: 797–802. pmid:12859475
  12. 12. Armagan E, Yilmaz Y, Olmez OF, Simsek G, Gul CB. Predictive value of the modified early warning score in a turkish emergency department. Eur J Emerg Med. 2008;15: 338–340. pmid:19078837
  13. 13. Gottschalk SB, Wood D, Devries S, Wallis LA, Bruijns S. The cape triage score: A new triage system South Africa. Proposal from the cape triage group. Emerg Med J. 2006;23: 149–153. pmid:16439753
  14. 14. Mullan PC, Torrey SB, Chandra A, Caruso N, Kestler A. Reduced overtriage and undertriage with a new triage system in an urban accident and emergency department in Botswana: A cohort study. Emerg Med J. 2014;31: 356–360. pmid:23407375
  15. 15. Levin S, Toerper M, Hamrock E, Hinson JS, Barnes S, Gardner H, et al. Machine-Learning-Based Electronic Triage More Accurately Differentiates Patients With Respect to Clinical Outcomes Compared With the Emergency Severity Index. Ann Emerg Med. American College of Emergency Physicians; 2017; 18–20.
  16. 16. Zhai H, Brady P, Li Q, Lingren T, Ni Y, Wheeler DS, et al. Developing and evaluating a machine learning based algorithm to predict the need of pediatric intensive care unit transfer for newly hospitalized children. Resuscitation. 2014;85: 1065–1071. pmid:24813568
  17. 17. Blomberg SN, Folke F, Lippert F. Machine Learning–A novel approach to increase recognition of out-of-hospital cardiac arrest. Resuscitation. Elsevier Ireland Ltd; 2017;118: e19.
  18. 18. Green M, Lander H, Snyder A, Hudson P, Churpek M, Edelson D. Comparison of the Between the Flags calling criteria to the MEWS, NEWS and the electronic Cardiac Arrest Risk Triage (eCART) score for the identification of deteriorating ward patients. Resuscitation. European Resuscitation Council, American Heart Association, Inc., and International Liaison Committee on Resuscitation.~Published by Elsevier Ireland Ltd; 2018;123: 86–91. pmid:29169912
  19. 19. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521: 436–444. pmid:26017442
  20. 20. Son J, Park SJ, Jung K-H. Retinal Vessel Segmentation in Fundoscopic Images with Generative Adversarial Networks. 2017; Available: http://arxiv.org/abs/1706.09318
  21. 21. Kwon J-M, Lee Y, Lee Y, Lee S, Park J. An Algorithm Based on Deep Learning for Predicting In-Hospital Cardiac Arrest. J Am Heart Assoc. 2018;7: e008678. pmid:29945914
  22. 22. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. 2017 IEEE Int Conf Consum Electron ICCE 2017. 2014; 434–435.
  23. 23. Liu N, Koh ZX, Goh J, Lin Z, Haaland B, Ting BP, et al. Prediction of adverse cardiac events in emergency department patients with chest pain using machine learning for variable selection. BMC Med Inform Decis Mak. 2014;14: 75. pmid:25150702
  24. 24. Subbe CP, Slater A, Menon D, Gemmell L. Validation of physiological scoring systems in the accident and emergency department. Emerg Med J. 2006;23: 841–845. pmid:17057134
  25. 25. Churpek MM, Yuen TC, Winslow C, Meltzer DO, Kattan MW, Edelson DP. Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards. Crit Care Med. 2016;44: 368–74. pmid:26771782
  26. 26. Mortazavi BJ, Downing NS, Bucholz EM, Dharmarajan K, Manhapra A, Li SX, et al. Analysis of Machine Learning Techniques for Heart Failure Readmissions. Circ Cardiovasc Qual Outcomes. 2016;9: 629–640. pmid:28263938
  27. 27. Shouval R, Hadanny A, Shlomo N, Iakobishvili Z, Unger R, Zahger D, et al. Machine learning for prediction of 30-day mortality after ST elevation myocardial infraction: An Acute Coronary Syndrome Israeli Survey data mining study. Int J Cardiol. Elsevier Ireland Ltd; 2017;246: 7–13. pmid:28867023
  28. 28. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proc 23rd Int Conf Mach Learn—ICML ‘06. 2006; 233–240.
  29. 29. Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients with Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23: 269–278. pmid:26679719
  30. 30. Ong MEH, Lee Ng CH, Goh K, Liu N, Koh Z, Shahidah N, et al. Prediction of cardiac arrest in critically ill patients presenting to the emergency department using a machine learning score incorporating heart rate variability compared with the modified early warning score. Crit Care. BioMed Central Ltd; 2012;16: R108. pmid:22715923
  31. 31. Wolpert DH. The Supervised Learning No-Free-Lunch Theorems. Proc 6th Online World Conf Soft Comput Ind Appl. 2001; 10–24.
  32. 32. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: 1–21.
  33. 33. Wiler JL, Gentle C, Halfpenny JM, Heins A, Mehrotra A, Mikhail MG, et al. Optimizing Emergency Department Front-End Operations. Ann Emerg Med. Elsevier Inc.; 2010;55: 142–160.e1. pmid:19556030
  34. 34. Abdulwahid MA, Booth A, Kuczawski M, Mason SM. The impact of senior doctor assessment at triage on emergency department performance measures: Systematic review and meta- analysis of comparative studies. Emerg Med J. 2016;33:504–13. pmid:26183598
  35. 35. Partovi SN, Nelson BK, Bryan ED, Walsh MJ. Faculty triage shortens emergency department length of stay. Acad Emerg Med. 2001;8: 990–5. pmid:11581086