Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Analyzing inter-reader variability affecting deep ensemble learning for COVID-19 detection in chest radiographs

  • Sivaramakrishnan Rajaraman ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    sivaramakrishnan.rajaraman@nih.gov

    Affiliation Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland, United States of America

  • Sudhir Sornapudi,

    Roles Investigation, Methodology, Software, Visualization, Writing – original draft

    Affiliation Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, Missouri, United States of America

  • Philip O. Alderson,

    Roles Data curation, Supervision, Writing – review & editing

    Affiliation School of Medicine, Saint Louis University, St. Louis, Missouri, United States of America

  • Les R. Folio,

    Roles Data curation, Supervision, Writing – review & editing

    Affiliation Radiology and Imaging Sciences, Clinical Center, National Institutes of Health, Bethesda, Maryland, United States of America

  • Sameer K. Antani

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    Affiliation Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland, United States of America

Abstract

Data-driven deep learning (DL) methods using convolutional neural networks (CNNs) demonstrate promising performance in natural image computer vision tasks. However, their use in medical computer vision tasks faces several limitations, viz., (i) adapting to visual characteristics that are unlike natural images; (ii) modeling random noise during training due to stochastic optimization and backpropagation-based learning strategy; (iii) challenges in explaining DL black-box behavior to support clinical decision-making; and (iv) inter-reader variability in the ground truth (GT) annotations affecting learning and evaluation. This study proposes a systematic approach to address these limitations through application to the pandemic-caused need for Coronavirus disease 2019 (COVID-19) detection using chest X-rays (CXRs). Specifically, our contribution highlights significant benefits obtained through (i) pretraining specific to CXRs in transferring and fine-tuning the learned knowledge toward improving COVID-19 detection performance; (ii) using ensembles of the fine-tuned models to further improve performance over individual constituent models; (iii) performing statistical analyses at various learning stages for validating results; (iv) interpreting learned individual and ensemble model behavior through class-selective relevance mapping (CRM)-based region of interest (ROI) localization; and, (v) analyzing inter-reader variability and ensemble localization performance using Simultaneous Truth and Performance Level Estimation (STAPLE) methods. We find that ensemble approaches markedly improved classification and localization performance, and that inter-reader variability and performance level assessment helps guide algorithm design and parameter optimization. To the best of our knowledge, this is the first study to construct ensembles, perform ensemble-based disease ROI localization, and analyze inter-reader variability and algorithm performance for COVID-19 detection in CXRs.

Introduction

Coronavirus disease 2019 (COVID-19) is caused by the new Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) that originated in Wuhan, China. The World Health Organization (WHO) declared this disease spread as an ongoing pandemic [1]. As of July 2020, the pandemic has resulted in over 14 million cases, and more than 600,000 deaths worldwide. The disease commonly infects the lungs and results in pneumonia-like symptoms [2]. Reverse transcription-polymerase chain reaction (RT-PCR) analysis is the gold standard to confirm infections. However, these tests are reported to exhibit varying sensitivity and are not widely available [2]. Radiological imaging using chest X-rays (CXRs) and computed tomography (CT) scans, though not currently recommended in the United States, are commonly used radiological diagnostic support aids to manage COVID-19 disease progression [2]. While CT scans are more sensitive in detecting pulmonary disease manifestations than CXRs, their use is limited due to issues such as non-portability, repeated sanitation requirements for CT examination rooms and equipment, and exposing patients, hospital staff and technical personnel to the infection. According to the American College of Radiology (ACR) recommendations [3], CXRs are considered a viable alternative to CT scans in addressing some of these limitations. However, the pandemic nature of the disease has compounded the existing shortage of expert radiologists, particularly in third-world countries [4]. Under these circumstances, artificial intelligence (AI) driven computer-aided diagnostic (CADx) tools have been considered as potentially viable alternatives for facilitating swift patient referrals or aiding appropriate medical care [5]. Several studies using data-driven deep learning (DL) algorithms with convolutional neural network (CNN) models in various strategies have been published for detecting, localizing, or measuring progression of COVID-19 using CXRs and CTs [4, 6, 7]. While there are scores of medical imaging CADx solutions that use DL approaches for disease detection including COVID-19, there are significant limitations in existing approaches related to data set type, size, scope, model architecture, and evaluation. We address these concerns and propose novel analyses to meet the urgent demand for COVID-19 detection using CXRs.

Modality-specific transfer learning and ensemble learning

Existing solutions tend to be disease-specific and require retraining on a large-collection of expert-annotated data to ensure use in real-world applications. Generalization of these approaches is challenged by available expert-annotations, their strength (i.e. weak image-level labels versus strong region of interest (ROI) localizing the pathology), and necessary computation resources. Under these circumstances, transfer learning strategies are commonly adopted [8] where the models are trained on a large-scale selection of stock photographic images like ImageNet [9] and then fine-tuned for the specific task. A problem with this approach is that the architecture and hyperparameters of these pre-trained models are optimized for natural image computer vision applications. In contrast, medical image collections bearing the desired pathology are significantly smaller in number. Therefore, using these models for medical visual analyses often results in a covariate shift and generalization issues due to the difference in source and target image modalities. Medical images are distinct in their characteristics such as highly localized disease ROIs, and varying appearances for the same disease label and severity [10]. Under these circumstances, the transferred knowledge from the natural image processing domain may not be optimal for disease localization. We propose training deep learning (DL) models with suitable depth on a large-scale selection of medical images of the same modality to learn relevant feature representations that can be transferred and fine-tuned for related medical visual recognition tasks. Such medical modality-specific transfer learning could improve DL performance and generalization by learning the common characteristics of the source and target modalities. This could lead to a better initialization of model parameters and faster convergence, thereby reducing computational demand, improving efficiency, and increasing opportunity for successful deployment.

Data-driven DL models use non-linear methods and learn through stochastic error backpropagation to perform automated feature extraction and classification. These models scale up in performance by increasing the amount of training data and computational resources. Further, their sensitivity to the training data specifics limits their generalization due to learning different sets of weights at each instance of training. This stochastic learning nature results in different predictions referred to as the variance error. Also, there are issues concerning bias errors due to an oversimplified model that results in predictions that are different from the GT thereby placing a higher demand on appropriate threshold selection for obtaining desired performance. Ensemble learning methods including majority voting, averaging, weighted averaging, stacking, and blending seek to address these issues by combining predictions of multiple models and resulting in a better performance compared to that of any individual constituent model [11].

ROI localization, observer variability, and statistical analysis

Data-driven medical DL models have often been maligned for their “black box” behavior, i.e., inability to make clear their decision-making process. This is often due to their massive architectural depth resulting in a large number of model parameters and lack of decomposability into individual explainable components. Further, multiple non-linear processing units perform complex data transformations that can result in unpredictable behavior. This results in an apparent opaque relationship between input and predictions which is a serious bottleneck in their use in deriving understandable clinical interpretations.

Supervised learning requires a consistent label associated with the appearance of the pathology in the image. However, in medical images, these labels can vary not only for disease stage and shared appearance with other diseases but also with observer expertise and sensitivity to assessment demands. A new pandemic, for example, may bias experts toward higher sensitivity, i.e. they will tend to associate non-specific features with the new disorder because they lack experience with relevant disease manifestation in the image [13]. Therefore, an assessment of observer variability, including analyzing (i) inter-reader, and (ii) intra-reader variability, constitutes an essential part of AI-based classification and localization studies. It is reported that inter-reader variability tends to be higher than intra-reader variability because multiple observers may have a different opinion on the outlining disease-specific ROI depending on their expertise or personal leanings toward recommending necessary clinical care [12]. Thus, inter-reader variability is a major obstacle that may lead to misinterpretation through the “inexact” region of interest (ROI) annotations and also affects supervised learning. Not only can this lead to a false diagnosis or inability to evaluate the true benefit of accurately supplementing clinical-decision making, but it places a greater burden on the number of training images needed to overcome these implicit biases. Thus, it is imperative to conduct inter-reader variability analysis as part of evaluating AI performance. An obvious approach to overcome this challenge might be to compare a collection of annotations by several radiologists using relevant clinical data. However, quantifying expert performance in annotating disease-specific ROI is difficult. This persistent challenge exists because of the difficulty in obtaining or estimating a known true ROI for the task under study. While there exist automated tools to manage inter- and intra-reader variability, these algorithms need to be assessed to warrant their suitability for the task under study. Additionally, it is imperative to determine an appropriate measure for comparing individual expert annotations with each other and with the AI [13].

Results and methods in a study need to be transparently reported to accurately communicate scientific discovery. Statistical analyses are critical for measuring inherent data variability and their impact on AI performance. They help in evaluating claims and differentiating reasonable and uncertain conclusions. Statistical reporting helps to alleviate issues resulting from incorrect data mining, biased samples, overgeneralization, causality, and violating the assumptions concerning analysis. However, a study of the literature reveals that scientific publications are often limited in presenting statistical analyses of their results [14].

In this study, we address the aforementioned limitations through a stage-wise systematic approach, as follows: (i) we explore the benefits of CXR modality-specific pretraining that results in learning CXR modality-specific knowledge, which can be transferred and fine-tuned to improve performance toward COVID-19 detection in CXRs; (ii) we compare the utility of several ImageNet-pretrained CNN models truncated at their empirically determined intermediate layers to that of out-of-the-box ImageNet-pretrained CNNs toward the current task; (iii) we use ensembles of fine-tuned models for COVID-19 detection that are created through various strategies to improve performance compared to any individual constituent model; (iv) we explain learned behavior of individual CNNs and their ensembles using class-selective relevance mapping (CRM)-based localization [15] tools that identify discriminative ROIs involved in detecting COVID-19 viral disease manifestations; (v) we perform ensemble localization to improve localization behavior and compensate for the error due to neglected ROIs by individual CNNs; (vi) we perform exploratory studies to analyze variability in model localization using annotations of two expert radiologists; (vii) we measure statistical significance in performance metrics including Intersection over Union (IoU) and mean average precision (mAP); and, (viii) we perform inter-reader variability analysis using Simultaneous Truth and Performance Level Estimation (STAPLE) [13] that generates a reference consensus annotation from the set of radiologists’ annotations. This is compared with individual radiologist annotations and the predicted disease ROI by model ensembles to provide a measure of inter-reader variability and algorithm performance. To our best knowledge, this is the first study to construct ensembles, perform ensemble-based disease ROI localization, and evaluate inter-reader reader variability and algorithm performance toward COVID-19 detection in CXRs.

Related works

CXR modality-specific transfer learning and ensemble learning

Yadav et al. [16] demonstrated the benefits of transferring knowledge learned from training on a large-scale selection of CXR images and repurposing them toward tuberculosis (TB) detection. They constructed model ensembles and compared their performance with individual models toward classifying CXRs as showing normal lungs or TB -like manifestations. Rajaraman & Antani [17] proposed CXR modality-specific knowledge transfer by retraining the ImageNet-pretrained CNN models on a large-scale selection of CXRs collected from various institutions. This helped in improving generalization of the learned knowledge that was transferred and fine-tuned to detect TB disease-like manifestations in CXRs. The authors performed ensemble learning using the best-performing CNNs to demonstrate better performance in classifying CXRs as belonging to normal or TB-infected classes. At present, the literature on CXR analysis benefiting from modality-specific knowledge transfer particularly applied to detect COVID-19 viral disease manifestations is limited. This leaves room for progress toward evaluating the efficacy of these methods in improving the performance toward COVID-19 detection. Lakhani & Sundaram [18] used model ensembles to classify CXRs as showing normal lungs or TB-like radiological manifestations. It was observed that an ensemble of custom CNN and ImageNet-pretrained models delivered superior classification performance with an AUC of 0.99. Rajaraman et al. [19] evaluated the efficacy of a stacked model ensemble constructed from hand-crafted features/classifiers and DL models toward TB detection in CXRs. CXRs collected from various institutions were used to improve the generalization of the proposed approach. It was observed that the model ensembles delivered better performance than individual constituent models in all performance metrics. Ensemble learning has been applied to detect cardiomegaly in CXRs [20]. The authors observed that DL model ensembles were 92% accurate as compared to 76.5% accuracy obtained with hand-crafted features/classifiers. These results demonstrate the superiority of ensemble learning over the traditional approach of evaluating the performance with stand-alone models. Applied to COVID-19 detection in CXRs, Rajaraman et al. [5] iteratively pruned the DL models and constructed ensembles to improve performance compared to individual constituent models. To this end, the authors observed that the weighted average of iteratively pruned models demonstrated superior classification performance with a 99.01% accuracy and AUC of 0.9972. Otherwise, the literature available on applying ensemble learning toward COVID-19 detection in chest radiographs is limited.

ROI localization, observer variability, and statistical analysis

Exploratory studies in developing explainable and transparent AI solutions toward clinical decision-making are crucial to developing robust solutions for clinical use. Literature studies reveal several works interpreting the learned behavior of DL models by highlighting pixels that impact prediction scores, with varying intensities. Zeiler & Fergus [21] used deconvolution methods to modify the gradients that resulted in qualitatively improving ROI localization. Dosovitskiy & Brox [22] inverted image representations using up-CNN models to provide insights into learned feature representations. Zhou et al. [23] generated class-activation maps (CAM) by mapping the prediction class scores back to the deepest convolutional layer. Selvaraju et al. [24] generalized the use of CAM tools and proposed gradient-weighted CAM (Grad-CAM) methods that can be applied to CNNs with varying architecture. Kim et al. [15] proposed a class-selective relevance mapping (CRM) algorithm to visualize discriminative ROIs in classifying medical image modalities. The authors measured both positive and negative contributions of the feature map spatial elements in the deepest convolutional layer of the trained models toward making class-specific predictions. It was observed that CRM methods delivered superior localization toward classifying medical imaging modalities compared to CAM-based methods. Applied to the task of localizing COVID-19 viral disease manifestations in CXRs and CT scans, Li et al. [7] proposed a DL model called COVNet that learned the underlying feature representations from volumetric CT scans. It was observed that the model showed better performance with an AUC of 0.96 in detecting COVID-19 viral disease patterns and differentiating them from other non-COVID-19 pneumonia-related opacities. They used CAM-based visualization tools to localize the suspicious ROIs toward detecting COVID-19 viral disease manifestations. Karim et al. [25] proposed a custom DL model and used Grad-CAM tools to explain their predictions toward COVID-19 detection. The model achieved a sensitivity of 83% in detecting COVID-19 disease patterns in CXRs. Rajaraman & Antani [6] proposed a weakly-labeled data augmentation approach to increase training data size for recognizing COVID-19 viral related pneumonia opacities in CXRs. They used a strategic approach to train various DL models with non-augmented and weakly-labeled augmented training and evaluated their performance. It was observed that the simple addition of CXRs showing COVID-19 viral disease manifestations to weakly labeled augmented training data improved performance. This study revealed that COVID-19 viral disease patterns have a uniquely different presentation compared to non-COVID-19 viral pneumonia-related opacities. The authors used Grad-CAM tools to study the behavior of models trained with non-augmented and augmented data toward localizing COVID-19 viral disease manifestations in CXRs. Otherwise, the literature is limited concerning the use of visualization tools toward COVID-19 detection in CXRs. Applied to CXR analysis, Balabanova et al. [26] performed an observational study among Russian clinicians in analyzing the variability toward interpreting abnormalities in CXRs. The agreement was analyzed in different scales using the Kappa statistic for a set of 50 CXRs. It was observed that there existed only a fair agreement in detecting and localizing abnormalities with a Kappa value of 0.380 and 0.448, respectively. This demonstrated that limited agreement on interpreting abnormalities resulted in sub-optimal population screening. Applied to CT scans, Al-Khawari et al. [27] analyzed inter- and intra-radiologist variability in detecting abnormal parenchymal lung manifestations on high-resolution CT scans. They used the Kappa statistic to measure the degree of agreement toward these analyses. A clinically acceptable agreement was observed between the radiologists, but the agreement rate declined when the radiologists were not involved in the regular analysis of thoracic CT scans. Another study [28] analyzed COVID-19 disease manifestations in high-resolution CT scans obtained from patients at the North Sichuan Medical College, Nanchong, China. They assessed inter-observer variability by having CT readers repeat the data analysis at intervals of three days. A comparison of a set of measurements by the same scan reader was used to assess intra-observer variability. They observed the existence of significant variability in inter- and intra-observer analysis, concerning the extent and density of disease spread. At present, there is no available literature on the analysis of inter- and/or intra-reader variability applied to COVID-19 detection in CXRs.

Diong et al. [14] conducted a cross-sectional study toward analyzing the quality of statistical reporting in a random selection of publications in the Journal of Physiology and the British Journal of Pharmacology. The study used samples before and after the publication of an editorial, suggesting measures to adopt in reporting data and statistical analyses. The authors observed no evidence of change in reporting these measures after the editorial publication. They observed that 90–96% of papers were not reporting statistical significance measures including p-values to identify the specific groups exhibiting these statistically significant differences in performance. Appropriate statistical analyses are included in the current study.

Materials and methods

Data collection

This retrospective study uses the following publicly available datasets:

  1. Pediatric CXR dataset: Kermany et al. [29] made available a collection of 5,856 pediatric CXRs showing normal lungs (n = 1,583) or bacterial (n = 2,780) or viral pneumonia (n = 1,493) disease manifestations. The data were collected from children age 1 to 5 years at the Guangzhou Children’s Medical Center, China. The radiological examinations were performed as a part of routine clinical care. The CXR images are made available in JPEG format, and approximately 2000 × 2000 pixels resolution with 8-bit depth.
  2. RSNA CXR dataset: Shih et al. [30] made available a collection of 26,684 frontal CXRs for a Kaggle challenge. The CXRs are grouped into to normal (n = 8,851) and abnormal (n = 17,833) classes; the abnormalities include pneumonia or non-pneumonia related opacities. The CXR images are made available in 1024 × 1024 8-bit pixels resolution and DICOM format.
  3. CheXpert CXR dataset: Irvin et al. [31] made available a collection of 191,219 frontal CXRs showing normal lungs (n = 17,000) or other pulmonary abnormalities (n = 174,219). The CXR images are collected from patients at Stanford University Hospital, California, and are labeled for various thoracic disease manifestations by an automated natural language processing (NLP)-based labeler. The labels are extracted from radiological texts and conform to the Fleischner Society glossary of terms for thoracic imaging.
  4. NIH CXR-14 dataset: Wang et al. [8] released a collection of 112,120 frontal CXRs, collected from 30,805 patients at the NIH Clinical Center, Maryland. The collection includes CXRs, labeled as showing pulmonary abnormalities (n = 51,708) or normal lungs (n = 60,412). The CXRs were screened to remove personally identifiable information and ensure patient privacy. The CXRs belonging to the abnormal category are labeled for multiple thoracic disease manifestations using the information extracted from radiological reports using an automated NLP-based labeling algorithm.
  5. Twitter-COVID-19 CXR dataset: A radiologist from a hospital in Spain made available a collection of 134 CXRs exhibiting COVID-19 viral pneumonia manifestations, on Twitter (https://twitter.com/ChestImaging). The data were collected from SARS-CoV-2 PCR+ subjects and are made available at approximately 2000 ×2000 pixels resolution.
  6. Montreal-COVID-19 CXR dataset: Cohen et al. [32] manage a GitHub repository that hosts a collection of CXRs and CT scans of SARS-CoV-2 + and/or suspected patients. The images are pooled from publications and hospitals through collaboration with physicians and other public resources. As of May 2020, the collection includes 226 CXRs showing COVID-19 viral pneumonia manifestations. The authors didn’t provide complete metadata, however, the collection includes CXRs of 131 male patients and 64 female patients. The demographic information provided by the data providers for the various datasets used in this study are given in Table 1.

Lung ROI cropping and preprocessing

Input data characteristics directly impact DL model learning, which is significant in applications that involve disease detection. For example, clinical decision-making could be adversely impacted by learning irrelevant features. In the case of COVID-19 and other pulmonary diseases, it is vital to limit analysis to the lung ROI and train the models toward learning relevant feature representations from within these pulmonary zones. Literature studies reveal that U-Net-based semantic segmentation delivers commendable performance in segmentation tasks using natural and medical imagery [33]. For this study, we use a custom U-Net with dropout [34] layers to segment the lung ROI from the background. Gaussian dropouts are used in the encoder to reduce overfitting and provide restrictive regularization. A dropout ratio of 0.5 is used after empirical pilot evaluations. Fig 1 shows the architecture of the custom U-Net segmentation and its corresponding performance curves. This is the first step in training. The model is trained and validated on patient-specific splits (80/20 train/validation split) of CXRs and their associated lung masks made available by Candemir & Antani [35]. Sigmoidal activation is used at the deepest convolutional layer to restrict the mask pixels to the range (0–1). The model is optimized to minimize a combination of binary cross-entropy and dice losses given by, (1) where is the binary cross-entropy loss, is the Dice loss, and n is the batch number. The losses are computed for each mini-batch. The final loss for the entire batch is determined by the mean of loss across all the mini-batches. The expression for and is given by: (2) (3) where t is the target and y is the output from the final layer. Here, we choose w1 = w2 = 0.5. Callbacks are used to store model weights after each epoch only when there is a reduction in the validation loss. This helps us select the “best model” at the end of the training phase. The default value of 0.5 is used as the discrimination threshold to convert the predicted probability into the class labels. The best model weights are used for lung mask generation. The model is trained to generate lung masks at 256 × 256 pixel resolution for various datasets used in this study. The lung boundaries are delineated using the generated masks and are cropped to a bounding box containing the lung pixels. The lung bounding boxes are resized to 256 × 256 pixel dimensions and used for further analysis. The cropped lung bounding boxes are further preprocessed as follows: (i) Images are normalized so that the pixel values are restricted to the range (0–1). (ii) Images are passed through a median filter to perform noise removal and edge preservation. (iii) Image pixels are centered through mean subtraction and are standardized to reduce computational complexity. The segmentation workflow is shown in Fig 2.

thumbnail
Fig 1. The architecture of the custom U-Net with dropout and its performance curves.

https://doi.org/10.1371/journal.pone.0242301.g001

thumbnail
Fig 2. Segmentation workflow showing UNet-based mask generation and lung ROI cropping.

https://doi.org/10.1371/journal.pone.0242301.g002

Repeated CXR pretraining and fine-tuning

The steps in training that follow segmentation are shown in Fig 3. First (1), the images are preprocessed to remove irrelevant features by cropping the lung ROI. The cropped images are used for model training and evaluation. We perform repeated CXR-specific pretraining in transferring modality-specific knowledge that is fine-tuned toward detecting COVID-19 viral manifestations in CXRs. To do this, in the next training step (2) the CNNs are trained on a large collection of CXRs to separate normals from those showing abnormalities of any type. Next, (3) we retrain the models from the previous step, focusing on separating CXRs showing bacterial pneumonia or non-COVID-19 viral pneumonia from normals. Next, (4) we fine-tune the models from the previous step toward the specific separation of CXRs showing COVID-19 pneumonia from normals. Finally (5) the learned features from this phase of training become parts of the ensembles developed to optimize the detection of COVID-19 pneumonitis from CXRs.

thumbnail
Fig 3. The workflow of the proposed repeated CXR-specific pretraining and fine-tuning.

https://doi.org/10.1371/journal.pone.0242301.g003

Details of this step-wise training approach include that in the first stage of pretraining, a custom CNN and selected ImageNet-pretrained CNN models are retrained on a large selection of CXRs with sufficient diversity due to sourcing from different collections, to coarsely learn the characteristics of normal and abnormal lungs. This CXR-specific pretraining helps in converting the weight layers, specific to the CXRs, in subsequent steps. The motivation behind this approach is to perform a knowledge transfer from the natural image domain to CXR-domain and learn the characteristics of normal lungs and a wide selection of CXR-specific pulmonary disease manifestations. During this training step, the datasets are split at the patient-level into 90% for training and 10% for testing. We randomly allocated 10% of the training data for validation.

During the second stage of repeated CXR-specific pretraining, the learned knowledge from the first stage pretrained models is transferred and repurposed to classify CXRs as exhibiting normal lungs, bacterial pneumonia, or non-COVID-19 viral pneumonia manifestations. This pretraining is motivated by the biological similarity in non-COVID-19 viral and COVID-19 viral pneumonia. However, there exist distinct radiological manifestations between each other as well as with non-viral pneumonia-related opacities [6, 29]. The motivation is to transfer the learned knowledge and fine-tune for COVID-19 detection. For the normal class, we pooled CXRs from various collections to introduce generalization and improve model performance. During this pretraining stage, again, the datasets are split at the patient-level into 90% for training and 10% for testing. For validation, we randomly allocated 10% of the training data.

The learned knowledge from the second stage of pretraining is transferred and fine-tuned to improve performance in classifying CXRs as showing normal lungs or COVID-19 viral pneumonia disease manifestations. Table 2 shows the datasets and their distribution used in various stages of learning proposed in this study. We compare this performance to that without repeated CXR-specific pretraining, referred to as Baseline. In the Baseline data set the ImageNet-pretrained CNNs are retrained out-of-the-box to categorize the CXRs as showing normal lungs or COVID-19 viral disease manifestations. For the normal class, we pooled CXRs in a patient-specific manner from various collections to introduce generalization and improve model performance. During this training step, we performed a patient-level split of the train and test data as follows: The CXRs from the Montreal-COVID-19 and Twitter-COVID-19 collections are combined (n = 360) where n is the total number of images in the collection. The data are split at the patient-level into 80% for training and 20% for testing. We randomly allocated 10% of the training data for validation. The test set includes 72 CXRs, containing 36 CXRs each from the Montreal-COVID-19 and Twitter-COVID-19 collections. The GT disease annotations for this test data are set by the verification of publicly identified cases from two expert radiologists, referred to as Rad-1 and Rad-2 hereafter, with a combined experience of 60 years. The radiologists used the web-based VGG Image Annotator tool [36] to independently annotate the test collection by manually setting boundary boxes for what they believed to be COVID-19-related abnormalities. This was done in independent sessions in which each radiologist was shown the chest radiographs in Portable Network Graphics format with a spatial resolution of 1024 × 1024 pixels and was asked to annotate COVID-19 viral disease-specific ROI in the given test set.

thumbnail
Table 2. Datasets and their distribution used in various stages of learning.

https://doi.org/10.1371/journal.pone.0242301.t002

It is well known that large amounts of high-quality data are imperative for DL model training and achieving superior performance. A challenge in the medical image-based DL is the lack of sufficient data. Many studies limit their work to data sourced from a single site. Using limited, single-site data toward model training may result in loss of generalizability and degrade model performance when evaluated on unseen data from other institutions or diverse imaging practices. Under these circumstances, generalizability and performance could be improved by increasing the variability of training data. In this study, we use a diversified data distribution from multiple CXR collections to enhance model generalization and performance in repeated CXR-specific pretraining and fine-tuning stages. Class weights are used to reward the minority classes to prevent biasing error and reduce overfitting. During model training, data are augmented with random horizontal and vertical pixel shifts in the range (-5 to 5) and rotations in the degree range (-9 to 9).

The following CNN-based DL models were trained and evaluated at various stages of learning performed in this study: (i) a custom wide residual network (WRN) [37] with dropout, (ii) ResNet-18 [38], (iii) VGG-16 [39], (iv) VGG-19 [39], (v) Xception [40], (vi) Inception-V3 [41], (vii) DenseNet-121 [42], (viii) MobileNet-V2 [43], (ix) NasNet-Mobile [44]. The models are selected with an idea of increasing the architectural diversity, thereby increasing the representation power, when used in ensemble learning. All computation is done on a Windows® system with Intel Xeon CPU E3-1275 v6 3.80 GHz processor and NVIDIA GeForce® GTX 1050 Ti. We used Keras DL framework with Tensorflow backend, CUDA, and CUDNN libraries to accelerate GPU performance.

Residual CNNs having depths of hundreds of layers suffer from diminishing feature reuse [37]. This occurs due to issues with gradient flow, which results in only a few residual blocks learning useful feature representations. A WRN combats diminishing feature reuse issues by reducing the number of layers and increasing model width [37]. The resultant networks are found to exhibit shorter training times with similar or improved accuracy. In this study, we use a custom WRN with dropout regularization. Dropouts provide restrictive regularization, address overfitting issues, and enhance generalization. After empirical observations, we used 5 × 5 kernels for the convolutional layers, assigned a dropout ratio of 0.3, a depth of 16, and a width of 4, for the custom WRN used in this study. Fig 4 shows a WRN block with the dropout used in this study. The output from the deepest residual block is average pooled, flattened, and appended to a final dense layer with Softmax activation to predict class probabilities.

thumbnail
Fig 4. A custom wide residual network (WRN) with dropout regularization.

https://doi.org/10.1371/journal.pone.0242301.g004

As mentioned before, ImageNet-pretrained CNNs have been developed for computer vision tasks using natural images. These models have varying depth and learn diversified feature representations. For medical images that are often available in limited quantities, deeper models may not be optimal and can lead to overfitting and loss of generalization. During the first stage of pretraining, the CNNs are instantiated with their ImageNet-pretrained weights and are truncated at empirically determined intermediate layers to effectively learn the underlying feature representations for CXR images and improve classification performance. The truncated models are appended with (i) zero-padding, (i) a 3 × 3 convolutional layer with 1024 feature maps, (ii) a global average pooling (GAP) layer, (iii) a dropout layer with an empirically determined dropout ratio of 0.5, and (iv) a final dense layer with Softmax activation to output prediction probabilities. These customized models learn CXR-specific feature representations to classify CXR images as showing normal or abnormal lungs. The custom WRN is initialized with random weights. Fig 5 shows the architecture of the pretrained CNNs used during the first stage of repeated CXR-specific pretraining.

thumbnail
Fig 5. The architecture of the CNNs used in the first stage of repeated CXR-specific pretraining.

I/P = Input, I-PCNN = truncated ImageNet-pretrained CNNs, ZP = Zero-padding, CONV = Extra convolution layer, GAP = Global Average Pooling, DO = Dropout, D = Final dense layer with Softmax activation.

https://doi.org/10.1371/journal.pone.0242301.g005

In the second stage, pretrained models from the first stage are truncated at their deepest convolutional layer and appended with (i) GAP layer, (ii) dropout layer (ratio = 0.5), and (iii) dense layer with Softmax activation to output class probabilities for CXRs showing normal lungs, bacterial pneumonia, or non-COVID-19 viral pneumonia. Fig 6 shows the architecture of the customized models used during the second stage of pretraining.

thumbnail
Fig 6. The architecture of the CNNs used in the second stage of pretraining.

I/P = Input, CXR-Pre-CNN = CXR-specific CNNs from the first stage of pretraining, truncated at their deepest convolutional layer, GAP = Global Average Pooling, DO = Dropout, D = Final dense layer with Softmax activation.

https://doi.org/10.1371/journal.pone.0242301.g006

Next, the second-stage pretrained models are truncated at their deepest convolutional layer and appended with (i) GAP layer, (ii) dropout layer (ratio = 0.5), and (iii) dense layer with Softmax activation. The resultant models are fine-tuned to classify the CXRs as belonging to COVID-19+ or normal classes where ‘+’ symbolizes COVID-19-positive cases. Fig 7 shows the architecture of the models used toward COVID-19 detection.

thumbnail
Fig 7. The architecture of the CNNs fine-tuned toward COVID-19 detection.

I/P = Input, CXR-Pre-CNN = CXR-pretrained CNNs from the second stage of pretraining, truncated at their deepest convolutional layer, GAP = Global Average Pooling, DO = Dropout, D = Final dense layer with Softmax activation.

https://doi.org/10.1371/journal.pone.0242301.g007

The models in various learning stages are trained and evaluated using stochastic gradient descent (SGD) optimization to estimate learning error and classification performance. We used callbacks to check the internal states of the models and store model checkpoints. The model weights delivering superior performance with the test data are used for further analysis. The performance of the models at various learning stages is evaluated using the following metrics: (i) Accuracy; (ii) Area under curve (AUC); (iii) Sensitivity; (iv) Specificity; (v) Precision; (vi) F1 score; (vii) Matthews correlation coefficient (MCC); (viii) Kappa statistic; and (ix) Diagnostic Odds Ratio (DOR). The following ensemble strategies are applied to the fine-tuned models for COVID-19 detection to improve performance: (i) Majority voting; (ii) Simple averaging; and (iii) Weighted averaging. In majority voting, the predictions with maximum votes are considered as final predictions. The average of the individual model predictions is considered the final prediction in a simple averaging ensemble. For a weighted ensemble, we optimized the weights for the model predictions that minimized the total logarithmic loss. This loss decreases as the prediction probabilities converge to GT labels. We used the Sequential Least Squares Programming (SLSQP) algorithmic method [45] to perform several iterations of constrained logarithmic loss minimization to converge to the optimal weights for the model predictions.

Inter-reader variability analysis

Fig 8 shows examples of COVID-19 viral disease-specific ROI annotations on CXRs made by Rad-1 and Rad-2. In this study, we used the well-known Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm [13] to arrive at a consensus reference ROI annotation and use it to evaluate the performance of the top-N ensembles and to simultaneously assess the performance against each radiologist.

thumbnail
Fig 8. Examples showing inter-reader variability in annotating COVID-19 disease ROI.

(A) and (B) show the annotations (bounding boxes in blue) of Rad-1 and Rad-2, respectively, for a given COVID-19 disease labeled image; (C) and (D) shows the GT annotations of Rad-1 and Rad-2, respectively for another COVID-19 disease labeled image.

https://doi.org/10.1371/journal.pone.0242301.g008

STAPLE methods are widely used in validating image segmentation algorithms and comparing the performance of experts. Segmentation solutions are treated as a response to a pixel-wise classification problem. The algorithm uses an expectation-maximization (EM) approach that computes a probabilistic estimate of a reference segmented image computed from a collection of expert annotations and weighing them by an estimated level of performance for each expert. It incorporates this knowledge to spatially distribute the segmented structures while satisfying homogeneity constraints. The details pertaining to the algorithm and the performance measures including Kappa statistic, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) used to analyze inter-reader variability and assess program performance are summarized in Section A of the S1 File.

Disease ROI localization

In this study, we use CRM [15] visualization to evaluate the effectiveness of CRM-based ensemble localization. Details of the CRM algorithm are provided in Section B of the S1 File. First, we use CRM-based ROI localization to interpret predictions of individual CNNs and compare against the GT annotations provided by each expert. Next, we select the top-3, top-5, and top-7 performing models, construct ensemble CRMs through an averaging process and compare against each radiologists’ independent annotations, and the STAPLE-generated consensus annotation. Finally, we quantitatively compare the ensemble localization performance with each other and against individual CRMs in terms of IoU and mean average precision (mAP) metrics. The mAP score is calculated by taking the mean of average precision (AP) over various IoU thresholds [46].

Statistical analysis

Statistical tests were conducted to determine significance in performance differences between the models. We used confidence intervals (CI) to measure model discrimination capability and estimate its precision through the error margin. We measured 95% CI as the exact Clopper–Pearson interval for the AUC values obtained by the models in various learning stages. Statistical packages including StatsModels and SciPy are used in these analyses. We performed a one-way analysis of variance (ANOVA) [47] on mAP values obtained with the top-N (N = (3, 5, 7)) model ensembles to study their localization performance and determine statistical significance among them and against the annotations of each of the radiologist and also the STAPLE-generated consensus ROI annotation. One-way ANOVA tests are performed only if the assumptions of data normality and homogeneity of variances are satisfied for which we performed Shapiro-Wilk and Levene’s analyses [47]. Statistical analyses are performed using R statistical software (Version 3.6.1).

Results

Recall that in the first stage of CXR-specific pretraining, we truncated the ImageNet-pretrained CNNs at their intermediate layers to empirically determine the layers that demonstrated superior performance. These empirically determined layers for the various models are listed in Section C of the S1 File. The performance achieved through truncating the models at the selected intermediate layers and appending task-specific heads toward classifying the CXRs is shown in Table 3.

thumbnail
Table 3. Performance metrics achieved during the first-stage of CXR-specific pretraining.

https://doi.org/10.1371/journal.pone.0242301.t003

From Table 3, we observe that the AUC values are not statistically significantly different across the models (p > 0.05). The DOR provides a measure of diagnostic accuracy and estimation of discriminative power. A high DOR is obtained by a model that exhibits high sensitivity and specificity with low FPs and FNs. A model with higher AUC indicates that it is more capable of distinguishing TNs and TPs. Considering DOR and AUC values, VGG-19 demonstrates somewhat better performance followed by NasNet-Mobile in classifying CXRs into normal or abnormal categories. Also considering MCC and Kappa metrics, VGG-19 outperformed other models. The confusion matrix, ROC curves, and normalized Sankey flow diagram obtained using the VGG-19 model toward this classification task are shown in Fig 9. We used a normalized Sankey diagram [48] to visualize model performance. Here, weights are assigned to the classes on the truth (left) and prediction (right) side of the diagram to provide an equal visual representation for the classes on either side. The strips width changes across the plot so that the width of each at the right side represents the fraction of all objects which the model predicts as belonging to a category that truly belongs to each of the categories.

thumbnail
Fig 9. Performance achieved using the VGG-19 model during the first-stage of CXR-specific pretraining.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

https://doi.org/10.1371/journal.pone.0242301.g009

Recall that during the second stage of CXR-specific pretraining, the learned representations from the first-stage pretrained models are transferred and fine-tuned to classify CXRs as showing normal lungs, bacterial proven pneumonia, or non-COVID-19 viral pneumonia. The performance achieved by the second-stage pretrained models is shown in Table 4.

thumbnail
Table 4. Performance metrics achieved by the models during the second stage of CXR-specific pretraining.

https://doi.org/10.1371/journal.pone.0242301.t004

We observed no statistically significant difference in AUC values achieved with the models during this pretraining stage (p > 0.05). Considering DOR, DenseNet-121 demonstrated better performance (220.68) followed by MobileNet-V2 (205.81) in categorizing the CXRs as showing normal lungs, bacterial pneumonia, or non-COVID-19 viral pneumonia. Considering MCC and F1 score metrics that consider both sensitivity and precision to determine model generalization, DenseNet-121 outperformed other models. The confusion matrix, ROC curves, and normalized Sankey flow diagram obtained using the DenseNet-121 model toward this classification task are shown in Fig 10.

thumbnail
Fig 10. Performance achieved using the DenseNet-121 model during the second stage of CXR-specific pretraining.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

https://doi.org/10.1371/journal.pone.0242301.g010

The second stage pretrained models are truncated at their deepest convolutional layer, appended with task-specific heads, and fine-tuned to classify the CXRs as belonging to COVID-19+ or normal categories. Table 5 shows the performance metrics achieved by the models toward this task.

thumbnail
Table 5. Performance metrics achieved with fine-tuning the second-stage pretrained models for COVID-19 detection.

https://doi.org/10.1371/journal.pone.0242301.t005

We observed no statistically significant difference in AUC values (p > 0.05) achieved by the fine-tuned models. Considering DOR, ResNet-18 demonstrated better performance (83.2) followed by DenseNet-121 (51.54) in categorizing the CXRs as showing normal lungs or manifesting COVID-19 viral disease. The custom WRN, Inception-V3, and DenseNet-121 are found to be equally sensitive (0.9028) toward this classification task. However, the ResNet-18 fine-tuned model demonstrated better performance with other performance metrics including accuracy, AUC, specificity, precision, F1 score, MCC, and Kappa. The confusion matrix, ROC curves, and normalized Sankey flow diagram obtained using the ResNet-18 model toward this classification task are shown in Fig 11.

thumbnail
Fig 11. Performance achieved using the ResNet-18 model during fine-tuning for COVID-19 detection.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

https://doi.org/10.1371/journal.pone.0242301.g011

We visualized the deepest convolutional layer feature embedding for the ResNet-18 fine-tuned model, using the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm [49], which is shown in Section D of the S1 File. The performance obtained with the fine-tuned models is compared to the Baseline, as shown in Table 6. The Baseline refers to out-of-the-box ImageNet-pretrained CNNs that are retrained toward this classification task. The custom WRN is initialized with randomized weights for the Baseline task.

thumbnail
Table 6. Performance metrics achieved during fine-tuning the second-stage pretrained models for COVID-19 detection is compared with the baseline.

https://doi.org/10.1371/journal.pone.0242301.t006

As observed in Table 6, the fine-tuned models achieved better performance compared to their baseline counterparts. The AUC metrics achieved with the fine-tuned custom WRN, VGG-16, VGG-19, and NasNet-Mobile models are shown in bold type and are observed to be statistically better than (p < 0.05) their baseline, untuned counterparts. We also observed a marked reduction in the number of trainable parameters for the fine-tuned models. The fine-tuned DenseNet-121 model showed a 54.51% reduction in the number of trainable parameters while delivering better performance as compared to its baseline counterpart. The same holds true for ResNet-18 (46.05%), Inception-V3 (42.36%), Xception (37.57%), MobileNet-V2 (37.38%), and NasNet-Mobile (11.85%) with the added benefit of improved performance compared to their baseline models.

We performed visualization studies to compare how the fine-tuned models and their baseline counterparts localize the ROIs in a CXR manifesting COVID-19 viral patterns. Fig 12 shows the following: (i) a CXR with COVID-19 disease consensus ROI obtained with STAPLE using Rad-1 and Rad-2 annotations, and (ii) the ROI localization achieved with various fine-tuned models and their baseline counterparts.

thumbnail
Fig 12. COVID-19 viral disease ROI CRM-based localization achieved using the fine-tuned models and their baseline counterparts.

(A) Original CXR with STAPLE-generated consensus ROI (shown as blue box ROI); (B) Baseline VGG-16; (C) Baseline VGG-19; (D) Baseline MobileNet-V2; (E) Baseline ResNet-18; (F) Baseline Inception-V3; (G) Fine-tuned VGG-16; (H) Fine-tuned VGG-19; (I) Fine-tuned MobileNet-V2; (J) Fine-tuned ResNet-18; (K) Fine-tuned Inception-V3.

https://doi.org/10.1371/journal.pone.0242301.g012

We extracted the features from the deepest convolution layer of the fine-tuned models and their baseline counterparts. We used CRM tools to localize the pixels involved in predicting the CXR images as showing COVID-19 viral disease patterns. As observed in Fig 12, the baseline models demonstrated poor disease ROI localization, compared to the fine-tuned models. We observed that the fine-tuned models learned salient ROI feature representations, matching the experts’ knowledge about the disease ROI. The localization excellence of the fine-tuned models can be attributed to (i) CXR-specific knowledge transfer that helped to learn modality-specific characteristics; the learned feature representations are transferred and repurposed for the COVID-19 detection task, and (ii) optimal architecture depth to learn the salient ROI feature representations to classify CXRs to their respective categories. These deductions are supported by poor localization performance of deeper, out-of-the-box ImageNet-pretrained baseline CNNs like ResNet-18, Inception-V3, and MobileNet-V2, which possibly suffered from baseline overfitting that resulted in poor learning and generalization.

We constructed ensembles of the top-3, top-5, and top-7 performing fine-tuned CNNs to evaluate for an improvement in predicting the CXRs as showing normal lungs or COVID-19 viral disease patterns. We used majority voting, simple averaging, and weighted averaging strategies toward this task. In weighted averaging, we optimized the weights for the model predictions to minimize the total logarithmic loss. We used the SLSQP algorithm to iterate through this minimization process and converge to the optimal weights for the model predictions. The results achieved with the various ensemble methods are shown in Table 7. We observed no statistically significant difference in the AUC values achieved by the various ensemble methods (p > 0.05). We observed that the performance with top-3 ensembles is better than that of top-5 and top-7 ensembles. It is observed that the weighted averaging of top-3 fine-tuned CNNs viz. ResNet-18, MobileNet-V2, and DenseNet-121 demonstrated better performance when their predictions are optimally weighted at 0.6357, 0.1428, and 0.2216, respectively. This weighted averaging ensemble delivered better performance in terms of accuracy, AUC, DOR, Kappa, F1 score, MCC, and other metrics, as compared to other ensembles. The confusion matrix, ROC curves, and normalized Sankey flow diagram obtained with the weighted averaging of the top-3 fine-tuned CNNs are shown in Fig 13.

thumbnail
Fig 13. Performance achieved through weighted averaging of the top-3 fine-tuned CNNs toward COVID-19 detection.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

https://doi.org/10.1371/journal.pone.0242301.g013

thumbnail
Table 7. Performance achieved with an ensemble of top-3, top-5, and top-7 fine-tuned models toward COVID-19 detection.

https://doi.org/10.1371/journal.pone.0242301.t007

Table 8 shows the performance achieved in terms of CRM-based IoU and mAP scores by the individual fine-tuned CNNs using the annotations of Rad-1, Rad-2, and STAPLE-generated consensus ROI. For Rad-1, the fine-tuned Inception-V3 model demonstrated higher values for the average IoU and mAP metrics. For Rad-2, we observed that the fine-tuned NasNet-Mobile outperformed other models. With STAPLE-generated consensus ROI, the Inception-V3 model outperformed other models in localizing COVID-19 viral disease-specific ROI.

thumbnail
Table 8. Performance achieved in terms of CRM-based IoU and mAP values by the individual fine-tuned CNNs using the radiologists’ annotations and STAPLE-generated ROI consensus annotation.

https://doi.org/10.1371/journal.pone.0242301.t008

The precision-recall (PR) curves of the best performing models using Rad-1, Rad-2, and the STAPLE-generated consensus ROI are shown in Section E of the S1 File. These curves are generated for varying IoU thresholds in the range (0.1–0.7). This range is empirically determined from the PR curves to alleviate issues due to poor and high sensitivity and precision rates and ensure measuring mAP scores to appropriately reflect the models’ localization ability. The confidence score threshold is varied to generate each curve. For a given fine-tuned model, we define the confidence score as the highest heat map value in the predicted ROI weighted by the classification score at the output nodes. We considered the ROI predictions as TP when the IoU and confidence scores are higher than their corresponding thresholds. For a given PR curve, we computed the AP score as the average of the precision across all recall values.

The following are the important observations from this localization study: The accuracy of a model is not related to disease ROI localization. From Table 6, we observed that the fine-tuned ResNet-18 model is highly accurate, followed by DenseNet-121 and MobileNet-V2, in classifying the CXRs as belonging to the COVID-19 viral category. However, while localizing disease-specific ROI, the Inception-V3, VGG-16, and NasNet-Mobile fine-tuned models delivered superior ROI localization performance compared to other models. This underscores the fact that the classification accuracy of a model is not an optimal measure to interpret its learned behavior. Localization studies are indispensable to understand the learned features and compare them to the expert knowledge for the problem under study. These studies provide comprehensive qualitative and quantitative measures of the learning capacity of the model and its generalization ability.

Next, we constructed an ensemble of CRMs through averaging the ROI localization by the top-3, top-5, and top-7 fine-tuned models. We ranked the models based on the IoU and mAP scores. The localization performance achieved with the various ensemble CRMs is shown in Table 9. We observed that the ensemble CRMs delivered superior ROI localization performance compared to that achieved with the individual models. However, the number of models in the top-performing ensembles varied. While using the annotations of Rad-1, we observed that the ensemble of the top-3 models demonstrated higher values for IoU and mAP than other ensembles. However, for Rad-2, the ensemble of the top-5 models demonstrated superior localization with IoU and mAP values of 0.2955 and 0.2352, respectively. The ensemble of top-3 fine-tuned models demonstrated higher values for IoU and mAP scores compared to other models while using STAPLE-generated ROI consensus annotation. Considering this study, we observed that averaging the CRMs of more than top-5 fine-tuned models didn’t improve performance but rather it saturates ROI localization. PR curves resulting from this observation are shown in Section F of the S1 File.

thumbnail
Table 9. IOU and mAP values obtained with top-3, top-5, and top-7 ensembles using annotations of Rad-1, Rad-2, and STAPLE-generated consensus ROI annotations.

https://doi.org/10.1371/journal.pone.0242301.t009

Instances of CXRs showing ROI annotations of Rad-1, Rad-2, top-3 ensemble using STAPLE-generated ROI consensus (referred to as program hereafter), and the STAPLE-generated ROI consensus annotation are shown in Fig 14.

thumbnail
Fig 14. Sample CXRs from two different patients (rows A-D and E-H, respectively) show ROI annotations generated.

(A) and (E) Rad-1 (in blue); (B) and (F) Rad-2 (in green); (C) and (G) Top-3 ensemble using STAPLE-generated consensus ROI (program) (in yellow); (D) and (H) STAPLE-generated consensus ROI annotation (in red).

https://doi.org/10.1371/journal.pone.0242301.g014

Fig 15 shows the following: (A) an ensemble CRM generated with the top-3 fine-tuned models that delivered superior localization performance using STAPLE-generated ROI consensus annotation, and (B) an ensemble CRM generated with the top-5 fine-tuned models that delivered superior localization performance using the annotations of Rad-2.

thumbnail
Fig 15. Instances of ensemble CRMs combining top-N ensemble ROI predictions.

(A) top-3 CNNs using STAPLE-generated consensus ROI annotation; (B) top-5 CNNs using Rad-2 annotations. The green box denotes reference ROI annotation and the blue box denotes ensemble CRM localization.

https://doi.org/10.1371/journal.pone.0242301.g015

We observe that the CRMs obtained using individual models in the top-N ensemble highlight ROI to varying extents. The ensemble CRM averages the ROIs localized with individual CRMs to highlight the disease-specific ROI involved in class prediction. The ensemble CRMs have a superior IoU value, compared to that of individual CRMs; the ensemble CRM improved localization performance as compared to individual ROI localization. This underscores the fact that ensemble localization improves performance and ability to generalize, conforming to the experts’ knowledge about COVID-19 viral disease manifestations.

To perform a one-way ANOVA analysis, we investigated whether the assumptions of data normality and homogeneous variances are satisfied. We used the Shapiro–Wilk test to investigate for normal distribution of the data and Levene’s test, for homogeneity of variances, using mAP scores obtained with the top-N ensembles. We plotted the residuals to investigate if the assumption of normal residual distribution is satisfied. Fig 16 shows the following: (A) The mean plot for the mAP scores obtained by the top-N ensembles using Rad-1, Rad-2, and STAPLE-generated consensus ROI annotations, and (B) a plot of the quantiles of the residuals against that of the normal distribution.

thumbnail
Fig 16. Statistical analyses.

(A) Mean plot for the mAP scores obtained by the top-N ensembles using Rad-1, Rad-2, and STAPLE-generated consensus ROI annotations; Error bars represent standard errors. The differences are not statistically significant; (B) Residual plot showing the data follow the normal distribution.

https://doi.org/10.1371/journal.pone.0242301.g016

It is observed from the residual plot shown in Fig 16 that all the points fall approximately along with a 45-degree reference. This underscores the fact that the assumption of normal distribution of data is satisfied. Table 10 shows the consolidated results of Shapiro–Wilk, Levene, and one-way ANOVA analyses.

thumbnail
Table 10. Consolidated results of Shapiro–Wilk, Levene, and one-way ANOVA analyses.

https://doi.org/10.1371/journal.pone.0242301.t010

To compute one-way ANOVA, we measure the variance between group means, the variance within the group, and the group sizes. This information is combined to measure statistical significance from the test statistic F. In our study, we have three groups (Rad-1, Rad-2, and STAPLE) of 10 observations each, hence the distribution is mentioned as F (2, 27). As observed from Table 10, the p-values obtained with the Shapiro-Wilk test are not significant (p > 0.05) and reveal that the normality assumption is satisfied. The result of Levene’s test is not statistically significant (p > 0.05). This demonstrates that the variance across the mAP values obtained with the annotations of Rad-1, Rad-2, and STAPLE-generated consensus ROI are not statistically significantly different. Since the conditions of data normality and homogeneity of variances are satisfied, we performed one-way ANOVA to explore the existence of a statistically significant difference in the mAP scores. To this end, we observed no statistically significant difference in the mAP scores obtained with Rad-1, Rad-2, and STAPLE-generated consensus ROI (F (2, 27) = 1.678, p = 0.2060). This smaller F-value underscores the fact that the null hypothesis (H0), i.e., that all groups demonstrate equal mAP scores, holds good.

We used the STAPLE-generated consensus ROI as to the standard reference and measured its agreement with that generated by the program and the radiologists. The consensus ROI is estimated from the set of ROI annotations provided by Rad-1 and Rad-2. STAPLE assumes that Rad-1 and Rad-2 individually annotated ROIs for the given CXRs so that the quality of annotations are captured. We determined the set of TPs, FPs, TNs, and FNs for 10 different IoU thresholds in the range (0.1–0.7) and provided a measure of inter-reader variability and program performance using the following metrics: (i) Kappa statistic; (ii) Sensitivity; (iii) Specificity; (iv) PPV; and (v) NPV. These parameters depend on the relative proportion of the disease-specific ROI. An ROI provided by a radiologist or predicted by the program is considered as a TP if the IoU with the consensus ROI is greater than or equal to a given IoU threshold. Each radiologist or program ROI that produces an IoU less than the threshold or falls outside the consensus ROIs is counted as FP. The FN is defined as a radiologist or program ROI that is completely missing when there is a consensus ROI. If there is an image with no ROIs on both the ROI annotations under test, it is considered as TN. Fig 17 shows the variability in Kappa, sensitivity, specificity, and PPV values observed for the Rad-1, Rad-2, and the program.

thumbnail
Fig 17. Assessing inter-reader variability and program performance.

The following performance metrics are measured and plotted for 10 different IoU thresholds in the range (0.1–0.7): (A) Kappa statistic; (B) Sensitivity; (C) Specificity; (D) PPV.

https://doi.org/10.1371/journal.pone.0242301.g017

The estimated Kappa, sensitivity, specificity, PPV, and NPV values that are averaged over 10 different IoU thresholds in the range (0.1–0.7) are shown in Table 11.

thumbnail
Table 11. Performance level assessment and inter-reader variability analysis using STAPLE-generated consensus ROI.

https://doi.org/10.1371/journal.pone.0242301.t011

The performance assessment as observed from Table 11 indicated that Rad-1 is more specific than Rad-2. The same holds good for the Kappa and PPV metrics. We observed that NPV is 1 for Rad-1 and Rad-2. This is because the number of FNs = 0, which signifies that none of the radiologists ROI completely missed when there is an ROI in the STAPLE-generated consensus annotation. However, the NPV achieved with the program is 0.6 which underscores the fact the predicted ROIs missed a marked proportion of ROIs in the STAPLE-generated consensus. This assessment indicated that Rad-1 generated annotations similar to that of STAPLE-generated consensus by demonstrating higher values for Kappa, sensitivity, and PPV as compared to Rad-2. We also observed that the program is performing with higher specificity but with lower sensitivity as compared to Rad-1 and Rad-2. These assessments provided feedback indicating the need for program modifications, parameter tuning, and other measures, to improve its localization performance.

Discussion

There are several salient observations to be made from the analyses reported above. These include (i) the kind of data used in training, (ii) the size and variety of data collections, (iii) learning ability of various DL architectures informing their selection, (iv) need for customizing the models for improved performance, (v) benefits of ensemble learning, and (vi) the imperative need for localization to measure conformity to the problem.

We observed that repeated CXR-specific pretraining and fine-tuning resulted in improved performance toward COVID-19 detection as compared to the baseline, out-of-the-box, ImageNet pretrained CNNs. This highlights the need to use task-specific modality training resulting in improved model adaption, convergence, reduced bias, and reduced overfitting. This approach may have helped the DL models differentiate distinct radiological manifestations between COVID viral pneumonia and other non-viral pneumonia-related opacities. An added benefit is that this approach resulted in reductions in both computations and the number of trainable parameters.

It is well-known that neural networks develop or learn implicit rules to convert input data into features for making decisions. These learned rules are opaque to the user and the decisions are difficult to interpret. However, an interpretable model explaining its predictions related to model accuracy doesn’t necessarily guarantee those accurate predictions are for the right reasons. Localization studies help observe if the model has learned salient ROI feature representations that agree with expert annotations. In our study, we demonstrate that CRM visualization tools show superior localization performance in localizing COVID-19 viral disease-specific ROIs, particularly for the fine-tuned models compared to the ImageNet-pretrained CNNs.

Model ensembles further improved qualitative and quantitative performance in COVID-19 detection. Ensemble learning compensated mislabeling in individual models by combining their predictions and reduced prediction variance to the training data. We observed that the weighted averaging ensemble of the top-3 performing fine-tuned models delivered better performance compared to any individual constituent model. The results demonstrate that the detection task benefits from an ensemble of repeated CXR-specific pretrained and fine-tuned models. Ensemble learning also compensates for localization errors in CRMs and missed ROIs by combining and averaging the individual CRMs. Empirical evaluations show that ensemble localization demonstrated superior IoU and mAP scores and they significantly outperform ROI localization by individual CNN models.

It is difficult to quantify individual radiologists' performance in annotating ROIs in medical images. Not only are they the truth standard, but this “truth” is impacted by inherent biases related to a pandemic event like COVID-19 and their clinical exposure and experience. This complexity is compounded further because CXRs offer lower diagnostic sensitivity than CTs for example. So, a conservative assessment of the CXR is likely to result in smaller and more specific truth annotation ROIs. We used STAPLE to compute a probabilistic estimate of expert ROI annotations for the two expert radiologists who contributed to this study. STAPLE assumes these annotations are conditionally independent. The algorithm discovers and quantifies the bias among the experts when they differ in their opinion of the disease-specific ROI annotation. We use STAPLE-generated annotations as GT to assess the variation for every annotation for each expert, where the DL model is also considered as an expert. We observed that the Kappa values obtained using the STAPLE-generated consensus ROI are in a low range (0–0.2). This is probably because of the small number of experts and their inherent biases in assessing COVID-19 cases. Particularly, we note that Rad-1 was very specific in marking the ROIs, whereas Rad-2 annotated larger regions that sometimes accommodated multiple smaller regions into a single ROI. This led to lower IoU value that in turn affected the Kappa value. The pandemic is an evolving situation and CXR manifestations often exhibit biological similarity to non-COVID-19 viral pneumonia. The CXR is not a definitive diagnostic tool and expert views may differ in referring a candidate patient for further review. It would be helpful to conduct a similar analysis with a larger number of experts on a larger patient population. We remain hopeful that health agencies and medical societies will make such image collections available for future research. As more reliable and widely available COVID testing becomes available, the results of that testing could be used with CXRs as an additional important indicator of GT.

Regarding the limitations of our study: (i) The publicly available COVID-19 data collections used are fairly small and may not encompass a wide range of disease pattern variability. An appropriately annotated large-scale collection of CXRs with COVID-19 viral disease manifestations is necessary to build confidence in the models, improve their robustness, and generalization. (ii) The study is evaluated with the ROI annotations obtained from two expert radiologists. However, it would help to have more radiologists contribute independently in the annotation process and then arrive at a consensus that could reduce annotation errors. (iii) We used conventional convolutional kernels toward this study, however, future research could propose novel convolutional kernels that reduce feature dimensionality and redundancy and result in improved performance with reduced memory and computational requirements. (iv) Ensemble models require markedly high training time, memory, and computational resources for successful deployment and use. However, recent advancements in storage and computing solutions and cloud technology could lead to improvements in this regard.

Conclusions

In this study, we have demonstrated that a combination of repeated CXR-specific pretraining, fine-tuning, and ensemble learning helped in (a) transferring CXR-specific learned knowledge that can be subsequently fine-tuned to improve COVID-19 detection in CXRs; and (b) improving classification generalization and localization performance by reducing prediction variance. Ensemble-based ROI localization helped in improving localization performance by compensating for the errors in individual constituent models. We also performed inter-reader variability analysis and program performance assessment by comparing them with a STAPLE-based estimated reference. This assessment highlighted the opportunity for improving performance through ensemble modifications, requisite parameter optimization, increased task-specific dataset size, and involving “truth” estimates from a larger number of expert collaborators. We believe that the results proposed are useful for developing robust models for tasks involving medical image classification and disease-specific ROI localization.

Supporting information

References

  1. 1. Coronavirus disease (COVID-2019) situation reports. In: World Health Organization (WHO) Situation Reports. [Internet]. Jan 2020 [cited May 2020]. Available: https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports
  2. 2. Rubin GD, Ryerson CJ, Haramati LB, Sverzellati N, Kanne JP, Raoof S, et al. The Role of Chest Imaging in Patient Management During the COVID-19 Pandemic: A Multinational Consensus Statement From the Fleischner Society [published online ahead of print, 2020 Apr 7]. Chest. 2020;158(1):106–116. pmid:32275978
  3. 3. ACR Recommendations for the use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection. In: Recommendations for Chest Radiography and CT for Suspected COVID19 Infection [Internet]. 11 Mar 2020 [cited 12 Mar 2020]. Available: https://www.acr.org/Advocacy-and-Economics/ACR-Position-Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected-COVID19-Infection
  4. 4. Bai HX, Hsieh B, Xiong Z, Halsey K, Choi JW, Tran TML, et al. Performance of radiologists in differentiating COVID-19 from viral pneumonia on chest CT [published online ahead of print, 2020 Mar 10]. Radiology. 2020;200823. pmid:32155105
  5. 5. Rajaraman S, Siegelman J, Alderson PO, Folio LS, Folio LR, Antani SK. Iteratively Pruned Deep Learning Ensembles for COVID-19 Detection in Chest X-Rays. IEEE Access. 2020;8:115041–115050. pmid:32742893
  6. 6. Rajaraman S, Antani S. Weakly Labeled Data Augmentation for Deep Learning: A Study on COVID-19 Detection in Chest X-Rays. Diagnostics (Basel). 2020;10(6):E358. Published 2020 May 30. pmid:32486140
  7. 7. Li L, Qin L, Xu Z, Yin Y, Wang X, Kong B, et al. Artificial Intelligence Distinguishes COVID-19 from Community Acquired Pneumonia on Chest CT [published online ahead of print, 2020 Mar 19]. Radiology. 2020;200905. pmid:32191588
  8. 8. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In: Proceedings of the International Conference on Computer Vision (ICCV); 2017. p. 3462–3471.
  9. 9. Deng J, Dong W, Socher R, Li L, Li, K, Li F-F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2009. p. 248–255.
  10. 10. Shen D, Wu G, Suk HI. Deep Learning in Medical Image Analysis. Annu Rev Biomed Eng. 2017;19:221–248. pmid:28301734
  11. 11. Chowdhury AK, Tjondronegoro D, Chandran V, Trost SG. Ensemble Methods for Classification of Physical Activities from Wrist Accelerometry. Med Sci Sports Exerc. 2017;49(9):1965–1973. pmid:28419025
  12. 12. Zhao B, Tan Y, Bell DJ, Marley SE, Guo P, Mann H, et al. Exploring intra- and inter-reader variability in uni-dimensional, bi-dimensional, and volumetric measurements of solid tumors on CT scans reconstructed at different slice intervals. Eur J Radiol. 2013;82(6):959–968. pmid:23489982
  13. 13. Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging. 2004; 23(7):903‐921. pmid:15250643
  14. 14. Diong J, Butler AA, Gandevia SC, Héroux ME. Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice. PLoS One. 2018;13(8):e0202121. Published 2018 Aug 15. pmid:30110371
  15. 15. Kim I, Rajaraman S, Antani S. Visual Interpretation of Convolutional Neural Network Predictions in Classifying Medical Image Modalities. Diagnostics (Basel). 2019;9(2):38. Published 2019 Apr 3. pmid:30987172
  16. 16. Yadav O, Passi K, Jain CK. Using Deep Learning to Classify X-ray Images of Potential Tuberculosis Patients. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2018. p. 2368–2375.
  17. 17. Rajaraman S, Antani SK. Modality-specific deep learning model ensembles toward improving TB detection in chest radiographs. IEEE Access. 2020;8:27318–27326. pmid:32257736
  18. 18. Lakhani P, Sundaram B. Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiology. 2017;284(2):574–582. pmid:28436741
  19. 19. Rajaraman S, Sornapudi S, Kohli M, Antani S. Assessment of an ensemble of machine learning models toward abnormality detection in chest radiographs. Conf Proc IEEE Eng Med Biol Soc. 2019;2019:3689–3692. pmid:31946676
  20. 20. Islam MT, Aowal MA, Minhaz AT, Islam KA. Abnormality Detection and Localization in Chest X-Rays using Deep Convolutional Neural Networks. arXiv preprint arXiv: 170509850. 2017.
  21. 21. Zeiler MD, Fergus R. Visualizing and Understanding Convolutional Networks. arXiv preprint arXiv:13112901. 2013.
  22. 22. Dosovitskiy A, Brox T. Inverting Visual Representations with Convolutional Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 4829–4837.
  23. 23. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 2921–2929.
  24. 24. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the International Conference on Computer Vision (ICCV); 2017. p. 618–626.
  25. 25. Karim MR, Döhmen T, Rebholz-Schuhmann D, Decker S, Cochez M, Beyan O. DeepCOVIDExplainer: Explainable COVID-19 Predictions Based on Chest X-ray Images. arXiv preprint arXiv:200404582. 2020.
  26. 26. Balabanova Y, Coker R, Fedorin I, Zakharova S, Plavinskij S, Krukov N, et al. Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study. BMJ. 2005;331(7513):379‐382. pmid:16096305
  27. 27. Al-Khawari H, Athyal RP, Al-Saeed O, Sada PN, Al-Muthairi S, Al-Awadhi A. Inter- and intraobserver variation between radiologists in the detection of abnormal parenchymal lung changes on high-resolution computed tomography. Ann Saudi Med. 2010;30(2):129‐133. pmid:20220262
  28. 28. Jiang Y, Guo D, Li C, Chen T, Li R. High-resolution CT features of the COVID-19 infection in Nanchong City: Initial and follow-up changes among different clinical types [published online ahead of print, 2020 May 13]. Radiol Infect Dis. 2020;10.1016/j.jrid.2020.05.001. pmid:32406420
  29. 29. Kermany DS, Goldbaum M, Cai W, Valentim CCS, Liang H, Baxter SL, et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell. 2018;172(5):1122–1131.e9. pmid:29474911
  30. 30. Shih G, Wu CC, Halabi SS, Kohli MD, Prevedello LM, Cook TS, et al. Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Radiol Artif Intell. 2019;1(1): e180041.
  31. 31. Irvin J, Rajpurkar P, Ko M, Yu Y, Silviana C-I, Chute C, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the 33rd AAAI conference on artificial intelligence (AAAI); 2019. p. 590–597.
  32. 32. Cohen JP, Morrison P, Dao L. COVID-19 image data collection. arXiv preprint arXiv:200311597. 2020.
  33. 33. Hesamian MH, Jia W, He X, Kennedy P. Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges. J Digit Imaging. 2019;32(4):582–596. pmid:31144149
  34. 34. Baldi P, Sadowski P. The Dropout Learning Algorithm. Artif Intell. 2014;210:78–122. pmid:24771879
  35. 35. Candemir S, Antani S. A review on lung boundary detection in chest X-rays. Int J Comput Assist Radiol Surg. 2019;14(4):563–576. pmid:30730032
  36. 36. Dutta A, Zisserman A. The VIA Annotation Software for Images, Audio and Video. In: Proceedings of the 27th ACM International Conference on Multimedia (MM); 2019. p. 2276–2279.
  37. 37. Zerhouni E, Lanyi D, Viana MP, Gabrani M. Wide residual networks for mitosis detection. In: Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI); 2017. p. 924–928.
  38. 38. Zhang HX, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the International Conference on Computer Vision (ICCV); 2016. p. 770–778.
  39. 39. Simonyan K, Zisserman, A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (ICLR); 2015. p. 1–14.
  40. 40. Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 1251–1258.
  41. 41. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the Inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 2818–2826.
  42. 42. Liu HZ, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. In Proceedings of the International Conference on Computer Vision (ICCV); 2017. p. 4700–4708.
  43. 43. Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen LC. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. p. 4510–4520.
  44. 44. Pham MY, Zoph GB, Le QV, Dean J. Efficient neural architecture search via parameter sharing. In: Proceedings of the International Conference on Machine Learning (ICML); 2018. p. 4092–4101.
  45. 45. Zahery M, Maes HH, Neale MC. CSOLNP: Numerical Optimization Engine for Solving Non-linearly Constrained Problems. Twin Res Hum Genet. 2017;20(4):290–297. pmid:28535831
  46. 46. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), 2014. p. 740–755.
  47. 47. Kao LS, Green CE. Analysis of variance: is there a difference in means and what does it mean?. J Surg Res. 2008;144(1):158‐170. pmid:17936790
  48. 48. Platzer A, Polzin J, Rembart K, Han PP, Rauer D, Nussbaumer T. BioSankey: Visualization of Microbial Communities Over Time. J Integr Bioinform. 2018;15(4):20170063. Published 2018 Jun 13. pmid:29897884
  49. 49. Acuff NV, Linden J. Using Visualization of t-Distributed Stochastic Neighbor Embedding To Identify Immune Cell Subsets in Mouse Tumors. J Immunol. 2017;198(11):4539–4546. pmid:28468972