Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Challenges in evaluating the accuracy of AI-containing digital triage systems: A systematic review

Abstract

Introduction

Patient-operated digital triage systems with AI components are becoming increasingly common. However, previous reviews have found a limited amount of research on such systems’ accuracy. This systematic review of the literature aimed to identify the main challenges in determining the accuracy of patient-operated digital AI-based triage systems.

Methods

A systematic review was designed and conducted in accordance with PRISMA guidelines in October 2021 using PubMed, Scopus and Web of Science. Articles were included if they assessed the accuracy of a patient-operated digital triage system that had an AI-component and could triage a general primary care population. Limitations and other pertinent data were extracted, synthesized and analysed. Risk of bias was not analysed as this review studied the included articles’ limitations (rather than results). Results were synthesized qualitatively using a thematic analysis.

Results

The search generated 76 articles and following exclusion 8 articles (6 primary articles and 2 reviews) were included in the analysis. Articles’ limitations were synthesized into three groups: epistemological, ontological and methodological limitations. Limitations varied with regards to intractability and the level to which they can be addressed through methodological choices. Certain methodological limitations related to testing triage systems using vignettes can be addressed through methodological adjustments, whereas epistemological and ontological limitations require that readers of such studies appraise the studies with limitations in mind.

Discussion

The reviewed literature highlights recurring limitations and challenges in studying the accuracy of patient-operated digital triage systems with AI components. Some of these challenges can be addressed through methodology whereas others are intrinsic to the area of inquiry and involve unavoidable trade-offs. Future studies should take these limitations in consideration in order to better address the current knowledge gaps in the literature.

Introduction

During the past years, digital online symptom checkers and digital patient-facing triage tools have become increasingly common. These tools allow patients to enter their symptoms and answer questions, and either receive possible diagnosis or advice on what level of care may be appropriate [1]. Digital triage solutions often focus on primary care conditions [2], as such conditions are often less urgent and can be triaged to various level of urgencies to optimize queues and resource allocation, and in contrast to emergency medicine triage systems, often don’t require physical examination. Artificial intelligence (AI) or machine learning is often described as a potential way to significantly improve various triage systems [35].

However, evaluating triage solutions is complex. It is difficult to capture the many important aspects of a triage system (e.g. such as condition coverage, diagnostic accuracy, patient safety and consequent resource utilization) with one primary outcome [6]. This complexity could explain why there are relatively few comprehensive validations of the predecessors to digital triage solutions, the traditional primary care telephone triage systems [7, 8]. Moreover, triage systems are commonly validated using patient vignettes, which are short descriptions of clinical cases with a predetermined correct diagnosis and/or level of care. Vignettes are a practical method, but might have limitations when assessing something as complex as triage.

Recent studies have attempted to compare different digital triage systems’ accuracy [9, 10]. In general, reviews conclude that studies and data on triage system accuracy remain limited despite increased usage [11]. Moreover, there is limited published research on the specific methodological challenges in studying these types of rapidly developing systems. As digital triage systems are already being implemented in healthcare [12], it is valuable to gain a better understanding of how they work.

Accuracy is a necessary, but not sufficient, criteria for a triage system to be useful. Understanding potential limitations in understanding triage accuracy with vignettes could be useful, considering the potential mismatch between using standardised vignettes to assess a complex intervention. A better understanding of the specific challenges in studying digital AI-based triage systems’ accuracy could be useful when designing future studies. This systematic review therefore aims to summarize the current knowledge regarding obstacles for studying digital patient-operated AI-based triage systems’ accuracy in a primary care setting.

Materials and methods

This systematic review was carried out in accordance with PRISMA guidelines (PRISMA checklist available as S1 Checklist) [13]. No predefined or preregistered protocol was used. The research question was: What limitations exist for studying the accuracy of digital patient-operated AI-containing triage systems for primary care? This question was deconstructed using the PICO structure [14] in order to design the search strategy. As studies typically mention methodological limitations in their discussions, a PICO was constructed to identify primary studies on the topic, with the goal of later synthesizing all limitations identified in the various studies. Consequently, the PICO focused on identifying studies containing such limitations. The deconstructed form of the question was: what are the limitations found when studying [Outcome:] accuracy with regards to appropriate urgency/level of care for [Population:] primary care patients with all types of conditions when [Intervention:] triage assessment is performed by digital patient-operated triage system in comparison to [Comparison:] regular triage systems utilized by healthcare staff?

Search strategy

A literature search was performed on the 10th of September 2021 in the following databases: Pubmed (NCBI), Scopus (Elsevier) and Web of Science (Clarivate). The search string was iteratively designed due to an initial paucity of results, finally only incorporating the population and intervention components of the PICO in order to minimize the risk of missing relevant studies. The following search phrase was ultimately used, and MeSH terms used for searching PubMed were adapted for searches in Scopus and Web of Science.

((triage OR “symptom checker”) AND ("artificial intelligence"[MeSH] OR “machine learning”[MeSH] OR “AI” OR “neural network” OR “supervised learning” OR “NLP”)) AND ("Primary Health Care"[MeSH] OR "General Practice" OR “GP clinic” OR “primary care”)

No language restrictions were applied during the search and retrieval of articles. Databases were searched from inception.

Selection criteria.

Following removal of duplicates, the remaining article titles and abstracts were screened by the author. Full articles, abstracts, pre-prints, and posters were included with no restriction on date of publication. All study types were included with no restriction. Articles were included for full text review if their abstract described a digital triage or symptom checking system. During full text review, articles were excluded from data extraction if they did not include a system able to be used in a primary care setting (i.e. handle a general population), were limited to triaging only a specific condition or specific group of conditions, were not patient-operated, did not report accuracy or if the studied system did not have an AI component. All types of AI were included, regardless of if the component was patient-facing or not.

To widen the search and find other articles that might be relevant, articles’ references were searched for potentially relevant articles (i.e. citation chaining), and these articles were retrieved from PubMed and added to the results. No automation tools were used. The search and selection PRISMA flow chart is depicted in Fig 1 [13].

Data extraction and synthesis

Data from the included studies were extracted as per Table 1 below:

If data was missing for a variable this was denoted as “not available”. Retrieved articles were too heterogenous to be synthesized quantitatively and were instead qualitatively synthesized. Quality assessment tools for assessing potential bias were deemed less useful as this review assessed limitations in articles, and not the studies actual results. Each article was assessed by the author and no automation tool was used. Main limitations were defined as explicit mentions of limitations or challenges, which relate to the study’s ability to address the question of accuracy. Other limitations (e.g. relating to statistical features or patient recruitment) were not included.

The limitations described in all included studies were synthesized using a qualitative thematic analysis. This analysis entails a qualitative analysis of the findings, and an iterative grouping in order to identify subthemes and overarching main themes. All included studies were included in the thematic synthesis.

Results

The literature search yielded 76 articles, from which 25 duplicates were removed. The abstracts of the remaining 51 articles were screened. 20 of these 51 were not included as they did not meet inclusion criteria. The remaining 31 articles went through abstract and full text review. Of these, 27 were excluded per the abovementioned criteria, leaving 4 articles. These 27 excluded articles were excluded as they

  • did not address a primary care setting (n = 9) [1523],
  • did not report accuracy (n = 8) [2431],
  • studied a system not triaging broad range of conditions (n = 5) [3236],
  • did not study a patient-operated system, (n = 3) [3739],
  • were not triage related (n = 2) [40, 41].

The references of all full-text reviewed articles were searched for relevant articles and through this citation chaining an additional 4 articles were identified. Thus, a total 8 articles were found to be relevant to the research question and were grouped into 2 categories:

  1. Primary studies on one or several digital triage systems’ accuracy (n = 6) [1, 10, 4245]
  2. Reviews which include assessments of digital triage systems’ accuracy (n = 2) [11, 12]

The included primary studies are summarized in Table 2 below.

thumbnail
Table 2. Primary studies reporting studies and limitations on determining digital triage systems’ accuracy.

https://doi.org/10.1371/journal.pone.0279636.t002

The two reviews retrieved in the search process were both systematic reviews assessing several aspects of digital and online symptom checkers and similar services [12], and intelligent online triage tools [11]. Both reviews were searched for additional primary that fulfilled this study’s inclusion criteria, but none were found. The reviews are summarized briefly in Table 3.

The limitations described in all included studies were then synthesized using a qualitative thematic analysis, which is described in Table 4.

thumbnail
Table 4. Themes of the limitations described by authors in retrieved articles.

https://doi.org/10.1371/journal.pone.0279636.t004

Discussion

The synthesis of the articles identified in this systematic review revealed several themes which are relevant to studying the accuracy of digital triage systems.

Ontological limitations in studying rapidly developing and highly contextual novel technology

Several studies highlighted that there is an intrinsic challenge in studying a rapidly developing field [1, 10, 12]. It can be difficult to assess a heterogeneous group of AI-powered digital triage systems as new software is developed and existing software is updated continuously, as systems can differ from each other, as new systems might arise and as the performance of systems can change over time as software is updated. Assuming that an inductively studied phenomenon will not change over time is a well-known problem of induction, as extensively discussed by e.g. Karl Popper [56]. In this specific case, this entails a limitation in the external validity one can expect when studying rapidly developing technological fields.

The identified studies also highlight that studies assessing the accuracy of triage systems treat primary care contexts homogenously [11]. This can be an issue as systems and/or vignettes can have a geographical bias [10], in e.g. what conditions are common or regarding how urgent certain conditions are deemed to be. Different countries have different healthcare systems and often use different triage solutions, as exemplified by many countries using a triage system developed in that country. The results of a study on a digital triage system in a specific context might not be representative of the system in a different context. Both of these limitations relate to the ontology of a rapidly developing and highly contextual intervention such as digital triage software. The limitations are difficult to mitigate through study design, and should be kept in mind when assessing studies on such interventions.

Epistemological limitations in studying triage accuracy

Some studies discuss the limitations in defining a gold standard for what is appropriate or safe triage [12, 44, 45]. All retrieved studies used a selected group of clinicians’ assessments to define a gold standard. The validity of this can be questioned due to high interrater variability [44, 45], lack of consensus [45], varying methods across studies [45], and that this definition inherently biases the assessment in favor of clinicians [12]. However, it is not clear what alternative method one could use instead. Furthermore, there is an challenge as some outcomes might not be able to predict in advance [45], and such cases will be often excluded from vignette testing.

First, this highlights that there is no universal consensus on what is appropriate triage, and how triage systems’ accuracy should be tested. Triage entails assessing a patient’s medical needs with less information than would be obtained during a consultation. Triage will therefore always involve some level of tradeoff between decreasing resource utilization and increasing the risk for missing pertinent clinical information that might affect the assessment of the patient. Reaching a consensus on what is appropriate triage outcomes is not possible without an underlying consensus on what level of risk one is willing to accept and what level of resource utilization is optimal. Moreover, as long as there are various opinions on what tradeoff is optimal, it will be difficult to compare studies on different systems in different contexts.

Second, high interrater variability has been observed in other triage studies. Studies on emergency triage have demonstrated that interrater variability can be high when applied by clinicians [57, 58], and that triage scales seldom consistently show high reliability [59]. This illustrates a methodological tradeoff when studying triage accuracy: between using simple patient vignettes with one clear diagnosis, in which interrater variability most likely will be lower, and more real-life cases, where interrater variability most likely will be higher. Both alternatives have limitations, either limiting the external validity or the reliability of the comparator (internal validity).

Methodological limitations in using patient vignettes

All primary studies used vignettes to assess triage accuracy, and several discuss associated methodological limitations, primarily related to external validity. Some of these limitations are addressable through adjusting vignette design, whereas others are more difficult to mitigate. These limitations and potential mitigations are described in Table 5 below.

The possible mitigations described in Table 5 could potentially improve the external validity of future vignette studies on the accuracy of digital triage systems. However, some of the limitations are more difficult to address.

First, vignettes with clear symptoms and diagnosis can differ from real life cases which can have more complex and ambiguous presentations. However, vignettes with more complex and ambiguous cases (i.e. with higher external validity) will more likely suffer from a higher interrater variability when defining a gold standard. This tradeoff between external and internal validity becomes more complex considering that vignette case mix should be adjusted for specific healthcare contexts. By adjusting the case mix to better reflect a certain geography or practice, one will unavoidably decrease it for other geographies or practices. Moreover, if cases are weighted so that e.g. common or dangerous cases are overrepresented, then the researchers’ choice of allocating weights can greatly affect the results [44].

Second, clinical vignettes presented to a digital system that recommends a certain triage outcome is a different phenomenon than a human advising a real patient with natural language [11]. Even if a triage system had a perfect accuracy, certain things will be lost (e.g. the social interaction between a patient and a clinician) and some things will be gained (e.g. removing the risk that practitioner gender affects the triage assessment [61]).

Finally, as highlighted by some studies, using real patient data instead of constructed vignettes can be challenging [10, 44]. Either one must select cases which don’t need a physical examination (limiting what cases one test the system with) or one includes cases in which a diagnosis was obtained using a physical examination (information which the triage system will not be given). Previous research has analyzed the various aspects of vignettes, comparing construct, internal and external validity, as well including the strengths and weaknesses of using clinical vignettes [62]. Unfortunately, recommendations given regarding vignette content do not address the challenges discussed above.

Limitations due to conflicts of interests

Five of the 6 identified primary studies had authors which declared a conflict of interest [10, 4245]. Industry sponsored device studies have a risk for bias [63], and report more positive efficacy results and favorable conclusions than non-industry sponsored studies. Two of the primary studies are non-peer reviewed preprints that have not since been peer-reviewed and published [42, 44]. Certain methodological weaknesses in one of the preprints has since been raised [64, 65].

Implications for practice and future research

This review identified certain themes of limitations which impact the ability to assess the accuracy of digital AI-containing patient-facing triage systems. The identified limits as well as possible mitigations are summarized in Table 6 below.

thumbnail
Table 6. Possible mitigations based on synthesis of identified limitations.

https://doi.org/10.1371/journal.pone.0279636.t006

The ontological limitations aren’t addressed by the recommendations above, but the mitigations align with clearer reporting of important health informatics principles, and can be achieved by e.g. adhering to STARE-HI guidelines [66].

Fraser et al have published a suggested guideline with five sequential steps for evaluating symptom checkers [65]. Some of their recommendations overlap with those above, e.g. that vignettes can have higher external validity by including common and dangerous conditions. Several of the methodologies they recommend, including observational trials and RCTs, overcome the limitations of using vignettes. However, other researchers that also emphasize the need of RCTs argue that they should only be undertaken when the software is stable (so that future changes will only be minor) [67]. This is challenging for software that is continuously being developed. Other of Fraser et al’s recommendations, such as routine and random auditing of cases once a system is used, may partially address this, but may also face challenges in the heterogeneity of the contexts and systems.

In summary, this systematic review of studies on the accuracy of digital triage systems uncovers several methodological improvements which future researchers could consider, as well as epistemological and ontological limitations which challenge what knowledge can be obtained regarding such systems using such methodologies. This does not mean that studies on triage systems should not be performed, but rather that more studies are needed, and that decision-makers and clinicians should be aware of non-methodological limitations when assessing this literature.

Limitations and strengths

This review importantly and self-referentially has numerous limitations which are important to keep in mind when interpreting the results. First, despite a broad search strategy with few restrictions and citation chaining a limited number of articles were found. Pertinent studies may have been missed in other databases not searched, grey literature or other languages. Similarly, this is a rapidly developing field and it is probable that new studies will have emerged when this is being read and that these studies could change this reviews results. Second, the review was not preregistered and only one author with a conflict of interest assessed the articles. Third, the study’s research question was limited to digital systems containing AI-components. The review therefore excluded studies not containing AI, which may indirectly have been relevant to the research question. Fourth, this systematic review can describe limitations mentioned in retrieved studies, but does not address limitations identified by end-users and clinicians that implement such systems. Including that perspective would be useful for creating recommendations for how to better design such studies. Fifth, the limitations identified in this review, and potential methods of mitigating them, were compared to existing literature on e.g. reporting on health informatics studies. As no systematic review was performed on methods to address the epistemological, ontological or methodological challenges, there are most likely other studies and frameworks not identified in this study, which could be useful in mitigating the limitations.

This review also has several strengths. First, other recent reviews did not include any primary study not identified in the search, which implies that this search strategy is less likely to have missed significant amounts of relevant literature. Second, to the extent of the author’s knowledge, this is the first comprehensive study on the challenges in studying the nascent field of digital AI-containing triage systems, and identifying these challenges may assist future researchers in decisions regarding study design. Finally, certain limitations which are difficult for researchers to address are highlighted, so that clinicians critically appraising the literature can better understand and assess such studies.

Supporting information

References

  1. 1. Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ. 2015;351:h3480. pmid:26157077
  2. 2. Verzantvoort NCM, Teunis T, Verheij TJM, van der Velden AW. Self-triage for acute primary care via a smartphone application: Practical, safe and efficient? PLoS One. 2018;13(6):e0199284. pmid:29944708
  3. 3. Berlyand Y, Raja AS, Dorner SC, Prabhakar AM, Sonis JD, Gottumukkala RV, et al. How artificial intelligence could transform emergency department operations. Am J Emerg Med. 2018;36(8):1515–7. pmid:29321109
  4. 4. Weisberg EM, Chu LC, Fishman EK. The first use of artificial intelligence (AI) in the ER: triage not diagnosis. Emerg Radiol. 2020;27(4):361–6. pmid:32643069
  5. 5. Levin S, Toerper M, Hamrock E, Hinson JS, Barnes S, Gardner H, et al. Machine-Learning-Based Electronic Triage More Accurately Differentiates Patients With Respect to Clinical Outcomes Compared With the Emergency Severity Index. Ann Emerg Med. 2018;71(5):565–74 e2. https://doi.org/10.1016/j.annemergmed.2017.08.005.
  6. 6. van Ierland Y, van Veen M, Huibers L, Giesen P, Moll HA. Validity of telephone and physical triage in emergency care: the Netherlands Triage System. Fam Pract. 2011;28(3):334–41. pmid:21106645
  7. 7. Campbell JL, Fletcher E, Britten N, Green C, Holt TA, Lattimer V, et al. Telephone triage for management of same-day consultation requests in general practice (the ESTEEM trial): a cluster-randomised controlled trial and cost-consequence analysis. Lancet. 2014;384(9957):1859–68. pmid:25098487
  8. 8. Lake R, Georgiou A, Li J, Li L, Byrne M, Robinson M, et al. The quality, safety and governance of telephone triage and advice services—an overview of evidence from systematic reviews. BMC Health Serv Res. 2017;17(1):614. pmid:28854916
  9. 9. Ceney A, Tolond S, Glowinski A, Marks B, Swift S, Palser T. Accuracy of online symptom checkers and the potential impact on service utilisation. PLoS One. 2021;16(7):e0254088. pmid:34265845
  10. 10. Gilbert S, Mehl A, Baluch A, Cawley C, Challiner J, Fraser H, et al. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? A clinical vignettes comparison to GPs. BMJ Open. 2020;10(12):e040269. pmid:33328258
  11. 11. Gottliebsen K, Petersson G. Limited evidence of benefits of patient operated intelligent primary care triage tools: findings of a literature review. BMJ Health Care Inform. 2020;27(1). pmid:32385041
  12. 12. Chambers D, Cantrell AJ, Johnson M, Preston L, Baxter SK, Booth A, et al. Digital and online symptom checkers and health assessment/triage services for urgent health problems: systematic review. BMJ Open. 2019;9(8):e027743. pmid:31375610
  13. 13. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. pmid:33782057
  14. 14. Schardt C, Adams MB, Owens T, Keitz S, Fontelo P. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Med Inform Decis Mak. 2007;7:16. pmid:17573961
  15. 15. Soenksen LR, Kassis T, Conover ST, Marti-Fuster B, Birkenfeld JS, Tucker-Schwartz J, et al. Using deep learning for dermatologist-level detection of suspicious pigmented skin lesions from wide-field images. Sci Transl Med. 2021;13(581). pmid:33597262
  16. 16. Hendrix N, Hauber B, Lee CI, Bansal A, Veenstra DL. Artificial intelligence in breast cancer screening: primary care provider preferences. J Am Med Inform Assoc. 2021;28(6):1117–24. pmid:33367670
  17. 17. Chen CH, Hsieh JG, Cheng SL, Lin YL, Lin PH, Jeng JH. Emergency department disposition prediction using a deep neural network with integrated clinical narratives and structured data. Int J Med Inform. 2020;139:104146. pmid:32387818
  18. 18. Grant K, McParland A, Mehta S, Ackery AD. Artificial Intelligence in Emergency Medicine: Surmountable Barriers With Revolutionary Potential. Ann Emerg Med. 2020;75(6):721–6. pmid:32093974
  19. 19. Lee SY, Chinnam RB, Dalkiran E, Krupp S, Nauss M. Prediction of emergency department patient disposition decision for proactive resource allocation for admission. Health Care Manag Sci. 2020;23(3):339–59. pmid:31444660
  20. 20. Hong WS, Haimovich AD, Taylor RA. Predicting hospital admission at emergency department triage using machine learning. PLoS One. 2018;13(7):e0201016. pmid:30028888
  21. 21. Arnold DH, Gebretsadik T, Moons KG, Harrell FE, Hartert TV. Development and internal validation of a pediatric acute asthma prediction rule for hospitalization. J Allergy Clin Immunol Pract. 2015;3(2):228–35. pmid:25609324
  22. 22. Ferri P, Saez C, Felix-De Castro A, Juan-Albarracin J, Blanes-Selva V, Sanchez-Cuesta P, et al. Deep ensemble multitask classification of emergency medical call incidents combining multimodal data improves emergency medical dispatch. Artif Intell Med. 2021;117:102088. pmid:34127234
  23. 23. Hastings SN, Schmader KE, Sloane RJ, Weinberger M, Goldberg KC, Oddone EZ. Adverse health outcomes after discharge from the emergency department—incidence and risk factors in a veteran population. J Gen Intern Med. 2007;22(11):1527–31. pmid:17828432
  24. 24. Lin SY, Mahoney MR, Sinsky CA. Ten Ways Artificial Intelligence Will Transform Primary Care. J Gen Intern Med. 2019;34(8):1626–30. pmid:31090027
  25. 25. Uohara MY, Weinstein JN, Rhew DC. The Essential Role of Technology in the Public Health Battle Against COVID-19. Popul Health Manag. 2020;23(5):361–7. pmid:32857014
  26. 26. Kong L. A study on the AI-based online triage model for hospitals in sustainable smart city. Future Generation Computer Systems. 2021;125:59–70. https://doi.org/https://doi.org/10.1016/j.future.2021.06.023.
  27. 27. Anmella G, Prime-Tous M, Segu X, Solanes A, Ruiz V, Martin-Villalba I, et al. PRimary carE digital Support ToOl in mental health (PRESTO): Design, development and study protocols. Rev Psiquiatr Salud Ment. 2021. pmid:33933665
  28. 28. Miller S, Gilbert S, Virani V, Wicks P. Patients’ Utilization and Perception of an Artificial Intelligence-Based Symptom Assessment and Advice Technology in a British Primary Care Waiting Room: Exploratory Pilot Study. JMIR Hum Factors. 2020;7(3):e19713. pmid:32540836
  29. 29. D’Hollosy WON, Van Velsen L, Soer R, Hermens H. Design of a web-based clinical decision support system for guiding patients with low back pain to the best next step in primary healthcare. 9th International Conference on Health Informatics, HEALTHINF 2016—Part of 9th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2016; 21 February 2016–23 February 2016; Rome2016. p. 229–39.
  30. 30. Tsai CH, You Y, Gui X, Kou Y, Carroll JM. Exploring and promoting diagnostic transparency and explainability in online symptom checkers. Conference on Human Factors in Computing Systems: Making Waves, Combining Strengths, CHI 2021,; 8 May 2021–13 May 2021; Virtual, Online2021.
  31. 31. Milne-Ives M, Swancutt D, Burns L, Pinkney J, Tarrant M, Calitri R, et al. The Effectiveness and Usability of Online, Group-Based Interventions for People With Severe Obesity: Protocol for a Systematic Review. JMIR Res Protoc. 2021;10(6):e26619. pmid:34255710
  32. 32. Gupta R, Krishnam SP, Schaefer PW, Lev MH, Gilberto Gonzalez R. An East Coast Perspective on Artificial Intelligence and Machine Learning: Part 1: Hemorrhagic Stroke Imaging and Triage. Neuroimaging Clin N Am. 2020;30(4):459–66. pmid:33038996
  33. 33. Cronin RM, Fabbri D, Denny JC, Rosenbloom ST, Jackson GP. A comparison of rule-based and machine learning approaches for classifying patient portal messages. Int J Med Inform. 2017;105:110–20. pmid:28750904
  34. 34. Papachristou I, Bosanquet N. Improving the prevention and diagnosis of melanoma on a national scale: A comparative study of performance in the United Kingdom and Australia. J Public Health Policy. 2020;41(1):28–38. pmid:31477796
  35. 35. Ferrante di Ruffano L, Takwoingi Y, Dinnes J, Chuchu N, Bayliss SE, Davenport C, et al. Computer-assisted diagnosis techniques (dermoscopy and spectroscopy-based) for diagnosing skin cancer in adults. Cochrane Database Syst Rev. 2018;12:CD013186. pmid:30521691
  36. 36. Jones OT, Ranmuthu CKI, Hall PN, Funston G, Walter FM. Recognising Skin Cancer in Primary Care. Adv Ther. 2020;37(1):603–16. pmid:31734824
  37. 37. Spasic I, Button K. Patient Triage by Topic Modeling of Referral Letters: Feasibility Study. JMIR Med Inform. 2020;8(11):e21252. pmid:33155985
  38. 38. Ayling RM, Wong A, Cotter F. Use of ColonFlag score for prioritisation of endoscopy in colorectal cancer. BMJ Open Gastroenterol. 2021;8(1). pmid:34083226
  39. 39. North F, Varkey P, Bartel GA, Cox DL, Jensen PL, Stroebel RJ. Can an office practice telephonic response meet the needs of a pandemic? Telemed J E Health. 2010;16(10):1012–6. pmid:21058892
  40. 40. Livingstone D, Chau J. Otoscopic diagnosis using computer vision: An automated machine learning approach. Laryngoscope. 2020;130(6):1408–13. pmid:31532858
  41. 41. Coiera E. The Price of Artificial Intelligence. Yearb Med Inform. 2019;28(1):14–5. pmid:31022746
  42. 42. Middleton K, Butt M, Hammerla N, Hamblin S, Mehta K, Parsa A. Sorting out symptoms: design and evaluation of the ’babylon check’ automated triage system. 2016.
  43. 43. Ghosh S, Bhatia S, Abhi B. Quro: facilitating user symptom check using a personalised Chatbot-Oriented dialogue system. Stud Health Technol Inform. 2018;252:51–6. pmid:30040682
  44. 44. Razzaki S, Baker A, Perov Y, Middleton K, Baxter J, Mullarkey D, et al. A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis. London: Babylon; 2018.
  45. 45. Entezarjou A, Bonamy AE, Benjaminsson S, Herman P, Midlov P. Human- Versus Machine Learning-Based Triage Using Digitalized Patient Histories in Primary Care: Comparative Study. JMIR Med Inform. 2020;8(9):e18930. pmid:32880578
  46. 46. Services EI. Isabel Symptom Checker | EBSCO: EBSCO; 2021 [cited 2021 2021-11-10]. Available from: https://www.ebsco.com/health-care/products/isabel-symptom-checker.
  47. 47. Symptify. How it works 2021 [cited 2021 2021-11-10]. Available from: https://symptify.com/how.
  48. 48. Infermedica. Infermedica API 2021 [cited 2021 2021-11-10]. Available from: https://infermedica.com/product/infermedica-api.
  49. 49. Zagorecki A, Orzechowski P, Holownia K. A system for automated general medical diagnosis using Bayesian networks. Stud Health Technol Inform. 2013;192:461–5. pmid:23920597
  50. 50. Rehm G, Bourgonje P, Hegele S, Kintzel F, Moreno Schneider J, Ostendorff M, et al. QURATOR: Innovative Technologies for Content and Data Curation. 2020.
  51. 51. Zimmer V. ada/inside: Digital Health Connect; 2018 [cited 2021 2021-11-10]. Available from: https://www.digitalhealthconnect.ch/wp-content/uploads/2018/06/AdaHealth-Vincent-Zimmer_DHC18.pdf.
  52. 52. Cirkovic A. Evaluation of Four Artificial Intelligence-Assisted Self-Diagnosis Apps on Three Diagnoses: Two-Year Follow-Up Study. J Med Internet Res. 2020;22(12):e18097. pmid:33275113
  53. 53. Koren G, Souroujon D, Shaul R, Bloch A, Leventhal A, Lockett J, et al. "A patient like me"—An algorithm-based program to inform patients on the likely conditions people with symptoms like theirs have. Medicine (Baltimore). 2019;98(42):e17596. pmid:31626135
  54. 54. Moreno Barriga E, Pueyo Ferrer I, Sanchez Sanchez M, Martin Baranera M, Masip Utset J. [A new artificial intelligence tool for assessing symptoms in patients seeking emergency department care: the Mediktor application]. Emergencias. 2017;29(6):391–6.
  55. 55. Healthily. Healthily Explainability Statement 2021 [cited 2021 2021-11-10]. Available from: https://assets.ctfassets.net/iqo3fk8od6t9/4Sy7OZIAdH65Kl2OmkAG9a/7e7e18ef63e464936b08f5c6cfc3fda7/FINAL_Short_Form_Explainability_Statement__-_17_Sep_2021.pdf.
  56. 56. Duignan B. Problem of Induction Encyclopedia Britannica: Encyclopedia Britannica; 2013 [cited 2022 6th of March].
  57. 57. Mistry B, Stewart De Ramirez S, Kelen G, Schmitz PSK, Balhara KS, Levin S, et al. Accuracy and Reliability of Emergency Department Triage Using the Emergency Severity Index: An International Multicenter Assessment. Ann Emerg Med. 2018;71(5):581–7 e3. https://doi.org/10.1016/j.annemergmed.2017.09.036.
  58. 58. Creaton A, Liew D, Knott J, Wright M. Interrater reliability of the Australasian Triage Scale for mental health patients. Emerg Med Australas. 2008;20(6):468–74. pmid:19125824
  59. 59. Hinson JS, Martinez DA, Cabral S, George K, Whalen M, Hansoti B, et al. Triage Performance in Emergency Medicine: A Systematic Review. Ann Emerg Med. 2019;74(1):140–52. pmid:30470513
  60. 60. Jungmann SM, Klan T, Kuhn S, Jungmann F. Accuracy of a Chatbot (Ada) in the Diagnosis of Mental Disorders: Comparative Case Study With Lay and Expert Users. JMIR Form Res. 2019;3(4):e13863. pmid:31663858
  61. 61. Vigil JM, Coulombe P, Alcock J, Stith SS, Kruger E, Cichowski S. How nurse gender influences patient priority assignments in US emergency departments. Pain. 2017;158(3):377–82. pmid:28187101
  62. 62. Evans SC, Roberts MC, Keeley JW, Blossom JB, Amaro CM, Garcia AM, et al. Vignette methodologies for studying clinicians’ decision-making: Validity, utility, and application in ICD-11 field studies. Int J Clin Health Psychol. 2015;15(2):160–70. pmid:30487833
  63. 63. Lundh A, Lexchin J, Mintzes B, Schroll JB, Bero L. Industry sponsorship and research outcome. Cochrane Database Syst Rev. 2017;2:MR000033. pmid:28207928
  64. 64. Coeira E. Paper Review: the Babylon Chatbot [Web page]. Internet: The Guide to Health Informatics 3rd Edition; 2018 [cited 2021 2021-10-12]. Available from: https://coiera.com/2018/06/29/paper-review-the-babylon-chatbot/.
  65. 65. Fraser H, Coiera E, Wong D. Safety of patient-facing digital symptom checkers. Lancet. 2018;392(10161):2263–4. pmid:30413281
  66. 66. Talmon J, Ammenwerth E, Brender J, de Keizer N, Nykanen P, Rigby M. STARE-HI—Statement on reporting of evaluation studies in Health Informatics. Int J Med Inform. 2009;78(1):1–9. pmid:18930696
  67. 67. Murray E, Hekler EB, Andersson G, Collins LM, Doherty A, Hollis C, et al. Evaluating Digital Health Interventions: Key Questions and Approaches. Am J Prev Med. 2016;51(5):843–51. pmid:27745684