Increasing prediction accuracy of pathogenic staging by sample augmentation with a GAN

ChangHyuk Kwon; Sangjin Park; Soohyun Ko; Jaegyoon Ahn

doi:10.1371/journal.pone.0250458

Abstract

Accurate prediction of cancer stage is important in that it enables more appropriate treatment for patients with cancer. Many measures or methods have been proposed for more accurate prediction of cancer stage, but recently, machine learning, especially deep learning-based methods have been receiving increasing attention, mostly owing to their good prediction accuracy in many applications. Machine learning methods can be applied to high throughput DNA mutation or RNA expression data to predict cancer stage. However, because the number of genes or markers generally exceeds 10,000, a considerable number of data samples is required to guarantee high prediction accuracy. To solve this problem of a small number of clinical samples, we used a Generative Adversarial Networks (GANs) to augment the samples. Because GANs are not effective with whole genes, we first selected significant genes using DNA mutation data and random forest feature ranking. Next, RNA expression data for selected genes were expanded using GANs. We compared the classification accuracies using original dataset and expanded datasets generated by proposed and existing methods, using random forest, Deep Neural Networks (DNNs), and 1-Dimensional Convolutional Neural Networks (1DCNN). When using the 1DCNN, the F1 score of GAN5 (a 5-fold increase in data) was improved by 39% in relation to the original data. Moreover, the results using only 30% of the data were better than those using all of the data. Our attempt is the first to use GAN for augmentation using numeric data for both DNA and RNA. The augmented datasets obtained using the proposed method demonstrated significantly increased classification accuracy for most cases. By using GAN and 1DCNN in the prediction of cancer stage, we confirmed that good results can be obtained even with small amounts of samples, and it is expected that a great deal of the cost and time required to obtain clinical samples will be reduced. The proposed sample augmentation method could also be applied for other purposes, such as prognostic prediction or cancer classification.

Citation: Kwon C, Park S, Ko S, Ahn J (2021) Increasing prediction accuracy of pathogenic staging by sample augmentation with a GAN. PLoS ONE 16(4): e0250458. https://doi.org/10.1371/journal.pone.0250458

Editor: Paweł Pławiak, Politechnika Krakowska im Tadeusza Kosciuszki, POLAND

Received: October 18, 2020; Accepted: April 7, 2021; Published: April 27, 2021

Copyright: © 2021 Kwon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The source programs used to generate the dataset and replicate the results in the paper are available on GitHub (https://github.com/narrowpath/SampleAugmentationWithGAN). The datasets analyzed during the current study are available from the TCGA database (https://doi.org/10.5114/wo.2014.47136) following the protocol outlined in the Methods section of the manuscript.

Funding: This work was supported by National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT in the form of grants awarded to JA (NRF-2019R1A2C3005212, NRF-2017M3A9B6062027).

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: TCGA, The Cancer Genome Atlas; TNM, Tumor, Node, and Metastasis; SMOTE, Synthetic Minority Oversampling Technique; GANs, Generative Adversarial Networks; 1DCNN, 1-Dimensional Convolutional Neural Networks

Introduction

Correct prediction of cancer stage is beneficial because it can help medical doctors determine more appropriate treatment for patients with cancer. For example, doctors can use staging information to determine type of surgery to perform, or whether chemotherapy or radiation therapy is required.

Numerous measures or methods have been proposed for accurate prediction of cancer stage, and one of the most widely used for cancer stage prediction is the Tumor, Node, and Metastasis (TNM) staging system developed by the American Joint Committee on Cancer (AJCC). TNM is a clinically useful staging system for cancers of almost every anatomic site and histology. From the 7^th edition of the AJCC Cancer Staging Manual to the most recent 8^th edition, few changes may be observed with respect to some cancers [1, 2], but in other cancer types, such as lung, gastric, and breast cancer [3–6] numerous changes are present in the criteria for prediction of cancer stage. These changes in the criteria may cause confusion in patient treatment.

Recently, alternative methods to predict cancer stage with additional clinical information or genomic information have been proposed. These methods, for the most part, adopt machine learning techniques to increase prediction accuracy. The machine learning methods used include Random Forest (RF) [7, 8], Support Vector Machine (SVM) [9], Naïve Bayes (NB) [9, 10], J48 Decision Tree [11], Logistic Regression [10, 11], Neural Network (NN) [12], and Neuro-Fuzzy Model [13]. In many cases, these methods showed better performance than the TNM staging system. For example, the Neuro-Fuzzy computational intelligence model [13] classified the pathological stage of patients with prostate cancer using data from The Cancer Genome Atlas (TCGA) [14], and compared these results with results using the AJCC pTNM (Pathological Tumor-Node-Metastasis) Staging Nomogram, as well as other machine learning methods such as Artificial Neural Network (ANN) or SVM, and found fewer false positives than the number achieved with AJCC or other machine learning models.

However, most of this studies used machine learning methods on a relatively small number of samples. machine learning methods generally require a substantial number of samples to ensure high predictive power. To overcome this limitation of a small sample size, many sample augmentation methods have been developed. The Synthetic Minority Oversampling Technique (SMOTE) [15, 16] was primarily developed to oversample a small number of samples, and has additionally shown its ability to convert highly imbalanced data into balanced data. Since 2012, the technique of deep learning has been applied in many fields, and the application of Denoising Autoencoder (DA) [17] solved the problem of insufficient training samples by expanding small gene expression data. Generative Adversarial Networks (GANs) [18] can be used to generate synthetic samples. GANs and their variations are widely used to synthesize images, but they can be also used to generate table type numerical data, as well as tabular data such as medical or educational records. TableGAN [19] shows that fake tables that are statistically similar to the original table are synthesized using GANs using four real world datasets in four different domains to solve the security problems required when sharing or delivering the public or partners’ data. Tabular GAN (TGAN) [20] shows the GANs model by applying Long Short-term Memory (LSTM) with attention to generate column-by-column data using tabular datasets of three mixed variable types.

In this study, we also used GANs to oversample small number of mRNA expression samples. GANs are difficult to use for data with a small sample size, especially when the number of features (genes) exceeds 10,000. To solve this problem, we first selected 300–800 genes depending on cancer types using DNA mutation data and RF. We synthesized the expression profiles of selected genes by applying GANs to gene expression of twelve cancer types including STAD (Stomach adenocarcinoma), BRCA (Breast invasive carcinoma), HNSC (Head and Neck squamous cell carcinoma), KIRC (Kidney renal clear cell carcinoma), KIRP (Kidney renal papillary cell carcinoma), LUAD (Lung Adenocarcinoma), THCA (Thyroid carcinoma), READ (Rectal adenocarcinoma), ESCA (Esophageal carcinoma), KICH (Kidney Chromophobe), LIHC (Liver hepatocellular carcinoma), and LUSC (Lung squamous cell carcinoma) from the TCGA database [14]. We then classified the cancer stage of augmented data using three classification methods. Comparison of the original data and augmented data obtained using existing sample augmentation methods allowed us to confirm that the prediction accuracy of cancer stage was significantly improved.

This paper is organized as follows. In the Materials and Methods Section, we first describe data used for the experiment, selected features, and normalization algorithm. Then, the sample augmentation method using GAN and three classification algorithms are described. In the Results Section, we describe the characteristics of the augmented sample, and compare the effects of the five known algorithms and four GAN series that we implemented. We also verify whether our method is effective for small samples, and evaluate the importance of the selected genes. In the Discussion Section, we compare the selection criteria of our experiment with the results of other groups, and mention various fields in which our method could be applied.

●. We use feature selection based on DNA mutation data and GAN for augmentation of mRNA expression data to increase the accuracy of our cancer-stage classification.
●. The augmented datasets obtained using the proposed method demonstrate significant increase in the classification accuracy.
●. By using GAN and 1DCNN in the prediction of cancer stage, good results are obtained even with a small amount of sample.

Materials and methods

Data preparation and feature selection

We downloaded mRNA and DNA mutation data from the TCGA database [14] of twelve cancer types, STAD, BRCA, HNSC, KIRC, KIRP, LUAD, THCA, READ, ESCA, KICH, LIHC, and LUSC, which have at least twelve samples for all four stages. From downloaded data, only samples of which DNA and RNA IDs are matched and stage information exists were selected. Specific information regarding the data is provided in Table 1.

Download:

Table 1. Number of samples and features.

https://doi.org/10.1371/journal.pone.0250458.t001

As the feature space is too big compared to the number of samples for training the proposed model, we selected the most important features (= genes) for each dataset. RF classifier [7, 21], which showed the best performance, was used to select ranking genes using DNA mutation data. Through iterative experiments, we selected the p-value threshold as 0.004. The selected number of the most important features selected are shown in Table 1, and the list of genes is provided in S1 Table.

Finally, matched mRNA data with selected genes were normalized using ComBat [22] to correct batch effects.

Sample augmentation and classification algorithm

The Generative Adversarial Networks (GANs) are composed of the generator and discriminator, which are trained in parallel. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution.

In this study, we used a GANs to augment mRNA samples. When images are generated using GANs, random values are input to the generator. In our case, random values from a normal distribution with mean and standard deviation of training mRNA data are fed into the generator. The training data are 70% of the entire data, selected at random. We used one hidden layer with 256 neurons for both a generator and a discriminator with reference to the previous study [23] and the randomly synthesized data and real data are judged to be real or fake in the discriminator, and learned repeatedly. The number of epochs used varies from 900 to 1,100 depending on the cancer type.

After the generator is trained, we generate n (= number of training samples) samples (GAN1), n * 20 samples (GAN20), and n * 100 samples (GAN100), using the trained generator, with the latent space generated by mean and standard deviation values that were used to train the generator. The mean and standard deviation created to make latent space in the Training Step are stored at a global variable and selected randomly as argument of the Generating Step. The ratio of stages is kept for augmented samples. Augmented samples are used as training data for classification of cancer stage.

We used three types of classifiers, 1DCNN [24], DNNs, and RF [7]. 1DCNN has been proposed to process 1-dimentional spectral channels. The 1DCNN we used for this study consists of two convolution layers. In this study, 20 and 40 filters with kernel size of 5 were used for first and second convolution layers, respectively. For both layers, size of pool is two and Relu is used for activation function. After the convolution step, the flattening process is performed, and flattened values are fed into the hidden layer of size 64. Activation function is Relu, optimizer is Adam, batch size is 32, and number of epochs is 1,000. For DNNs, we used three hidden layers of size 64, 32, and 4. Activation functions used are Relu for hidden layers and Softmax final layer. Adam is used for optimizer. We used the RandomForestClassifier module of scikit-learn (version 0.23.2) in python (version 3.5.2). The number of trees in the forest (n_estimators) is 100, the oob_score (whether to use out-of-bag samples to estimate the generalization accuracy) is true, and the random_state (random value) is 123456. We tried varying the number of n_estimators (70, 100, and 130), and adopted 100 according to S3 Table.

Finally, these classifiers were evaluated using the remaining 30% of the entire sample. The steps described above form one cycle, and are illustrated in (Fig 1).

Download:

Fig 1. Overview of sample augmentation using GANs and classification of stages using augmented samples.

https://doi.org/10.1371/journal.pone.0250458.g001

Results

Characteristics of augmented samples

As mentioned in detail in the methods, we augmented samples by constructing GANs composed of components of a Generating Step and Training Step (as shown in Fig 1). These augmented samples were used for training three classifiers and the remaining 30% of the original data were classified using the classifiers. To characterize the augmented samples and to confirm the possibility that augmented samples can be effectively used for cancer stage classification, we performed principal component analysis (PCA) for the original dataset and the augmented dataset.

The first column of (Fig 2) shows PCA plots for the original dataset for eight cancer types, and we can see that the stages are not distinguished. However, we can see that the stages are clearly distinguished for GAN1 data. These results imply that augmented samples have different characteristics for each stage. The differences in the augmented samples are not the result of changes in gene expression patterns, however, as we can see that the distribution of gene expression is not very different between the original and augmented data, as shown in the third column in (Fig 2).

Download:

Fig 2. PCA plots for original and generated data for each cancer type.

First and second columns are PCA plots of original and generated data samples (GAN1), respectively (stage1: red, stage2: blue, stage3: green, stage4: yellow). Third column is PCA plot for genes of original (cyan) and generated data (GAN1, yellow).

https://doi.org/10.1371/journal.pone.0250458.g002

The effect of sample augmentation

To evaluate the effect of sample augmentation, we created three classification models (using RF, 1DCNN, and DNNs) for each of the nine datasets. The nine datasets are 1) original dataset (Ori), 2) original dataset with selected features only (FS), 3) synthesized data with mean and standard deviation (MS), 4) synthesized data using SMOTE [16] (SMOTE), 5) synthesized data using DA [17] (DA), 6) GAN1, 7) GAN5, 8) GAN20, and 9) GAN100. All experiments using twelve datasets and three classifiers were repeated 10 times.

Features of FS data are selected from DNA mutation data using RF classifier, and are the same as those used to create GAN1, GAN5, GAN20, and GAN100. MS is randomly generated samples using mean and standard deviation/2 of training samples of each stage. SMOTE data is generated using a basic algorithm in SMOTE [16].

SMOTE is proposed to handle imbalanced data. For example, if SMOTE is run using 657 (110/383/152/12) training samples of BRCA, it generates 1,532 (383/383/383/383) samples. DA data is generated using a Denoising Autoencoder [17]. DA uses the denoising method to extract features that obtain useful structure in the input distribution and eventually generate gene expression data. Given n samples and m features, DA generates n * floor (m / 5) + n samples (floor (x) returns a largest integer not greater than x). For example, breast cancer has 659 training samples and 19,738 features, so 2,601,732 samples are generated. In (Fig 3), we can see that GAN1, GAN5, GAN20, and GAN100 show an increase over compared datasets. S2 Table shows that most of the p-values from t-tests between GAN and comparison results are < 0.05. In particular, all GAN5 showed significantly increased accuracy and most GAN20 datasets showed good accuracy.

Download:

Fig 3. Comparing classification accuracy using 1DCNN with different datasets.

Ori: Original data, FS: data with selected features, MS: randomly generated data using the mean and standard deviation of the FS data, DA: Denoising Autoencoder. P-values of t-test between three GAN results (GAN1, GAN5, GAN20 and GAN100) and five comparison results (Ori, FS, MS, SMOTE and DA) are given in S2 Table.

https://doi.org/10.1371/journal.pone.0250458.g003

We can also see that the accuracies of FS increased up to 9% compared to Ori, and the error bars are narrowed except in the case of KIRP. In particular, the accuracy was 0.48 for the 19,738 gene features in BRCA, but increased to 0.57 using a selected 359 features. These results show the effect of gene selection using DNA mutation data.

Next, we compared three classifiers, 1DCNN, DNN, and RF. Tables 2–13 show the accuracy and F1 score for each dataset and for each cancer type. Tables 2–13 also show that GAN1, GAN5, GAN20, and GAN100 demonstrate better predictive performance, regardless of classifier. Overall, 1DCNN and DNN showed good results and RF showed a poor F1 score.

Download:

Table 2. Comparison result of three classifiers for LUAD (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t002

Download:

Table 3. Comparison result of three classifiers for KIRC (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t003

Download:

Table 4. Comparison result of three classifiers for STAD (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t004

Download:

Table 5. Comparison result of three classifiers for READ (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t005

Download:

Table 6. Comparison result of three classifiers for KIRP (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t006

Download:

Table 7. Comparison result of three classifiers for HNSC (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t007

Download:

Table 8. Comparison result of three classifiers for BRCA (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t008

Download:

Table 9. Comparison result of three classifiers for THCA (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t009

Download:

Table 10. Comparison result of three classifiers for ESCA (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t010

Download:

Table 11. Comparison result of three classifiers for KICH (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t011

Download:

Table 12. Comparison result of three classifiers for LIHC (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t012

Download:

Table 13. Comparison result of three classifiers for LUSC (mean±sdv).

https://doi.org/10.1371/journal.pone.0250458.t013

Next, we examined whether the proposed sample augmentation method is effective for datasets with small samples. We used whole samples and randomly selected 50%, 30%, and 10% of samples from BRCA, LUAD, and KIRC datasets, and applied 1DCNN. The results are shown as 100O, 50O, 30O, and 10O in (Fig 4). We next expanded the sampled datasets 5 times (GAN5) and applied 1DCNN. The results are shown as 100G, 50G, 30G, and 10G in (Fig 4). We can see that reducing the number of samples lowers the classification accuracy; however, accuracies are much higher when samples are augmented. More importantly, we can see that the decrease in accuracy is generally smaller when samples are augmented. These results imply that the proposed method is effective for small datasets. Lastly we performed experiments to determine the optimal fold for sample augmentation. We compared classification accuracies from samples augmented by 1, 5, 10, 20, 30, 50, 70, and 100 fold. The results are shown in (Fig 5). In general, we can conclude that the optimal folds differ for different cancer types; however, we can observe that 5 fold (GAN5) demonstrates generally good results.

Download:

Fig 4. Classification accuracy for randomly sampled data.

100G, 50G, 30G and 10G indicate classification accuracies using 5 times augmented data from 100%, 50%, 30%, and 10% randomly selected samples, respectively. 100O, 50O, 30O, and 10O indicate classification accuracies using 100%, 50%, 30%, and 10% randomly selected samples (same as those for 100G/50G/30G/10G), respectively. Classification algorithm used is 1DCNN, and each random sampling was performed 10 times.

https://doi.org/10.1371/journal.pone.0250458.g004

Download:

Fig 5. Optimal augmentation fold.

Each fold was repeated 10 times.

https://doi.org/10.1371/journal.pone.0250458.g005

Analysis of selected genes

We additionally verified that genes were selected properly, in this section. Eight published studies [25–32] of TCGA included hyper-mutated genes and non-hyper-mutated genes. Among those genes, we selected significantly mutated genes for each cancer using MutSig [33, 34] and MuSiC [35], and summarized them in Table 14.

Download:

Table 14. Significantly mutated gene selected by TCGA.

https://doi.org/10.1371/journal.pone.0250458.t014

Selected genes in the current study (S1 Table) frequently harbor those mutations. For example, 17% (KIRP), 17% (LUAD), 24% (STAD), 26% (BRCA), 32% (HNSC), 33% (READ), 38% (THCA), and 67% (KIRC) of selected genes were overlapped with genes in Table 14, as shown in (Fig 6).

Download:

Fig 6. Association of pathways with diseases of selected genes.

https://doi.org/10.1371/journal.pone.0250458.g006

KIRC matches six out of nine genes. VHL and PBRM1 are major genes that cause mutations in more than 40% of clear cell renal cell carcinoma, and SETD2 and PTEN, which are quite frequent, are genes that cause both copy number loss and mutation. The BRAF gene of THCA is the most important gene with 60% missense mutation and more than 2% fusion, and includes a list of most oncogenes such as NRAS, TP53, PTEN, and RB1.

Selected genes are also overlapped with genes in the Online Mendelian Inheritance in Man (OMIM) database (23–27%, [36]), the Clusters of Orthologous Groups of proteins (COG) database (about 15, [37]), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (33~43% [38]). We can see that the overlapping percentages of KEGG are the largest in general, which means that a significant number of genes are important genes involved in the pathway. The PI(3)K/AKT/MTOR pathway (altered in 28% of tumors) has been shown to be important in KIRC by papers published by TCGA, and genes in S1 Table match the PI3K-AKT pathway with p-value 0.026. It contains most of the upstream genes of the AKT pathway, for example PIK3CA, PTEN, Receptor Tyrosine Kinase (RTK)-related genes (EPHB, PDGFR) and Integrin Subunit (ITG)-related genes (ITGA7, ITGA9, ITGA11, ITGB1BP, LABA, LAMB, THBS).

A Warburg effect-like state achieved through downregulation of AMP-activated kinase (AMPK) and upregulation of acetyl-CoA carboxylase (ACC) has also been shown to be important in cancers. Among the genes in S1 Table, ATP binding transporter pathway genes (ABCA, ABCB, ABCC, CFTR), extracellular matrix receptor (ECM) genes (COL1, COL4, COL5, COL6, COL11, ITGA, LAMA, LAMB, LAMC, THBS, TNC, TNN, TNXB, AGRN), and Krebs cycle genes (ACAT, ACOX, ACSBG, ADH1, CAMK1, CAMK2G, ECI, FBP, PFKFB, PDHA, SIRT3, SLC2A) were found.

Discussion

We noted that both GAN5 and GAN20 show good results in that the error bars are generally narrower in most of the carcinomas than those of GAN1, in (Fig 3). This observation indirectly demonstrates that increasing the number of samples leads to increased classification accuracy. In addition, it can be confirmed in Tables 2–13 that the 1DCNN classification method was excellent in both accuracy and F1 score. In Jian Liu’s paper [17], Sample Expansion-Based 1DCNN (SE1DCNN), a method of obtaining a large number of samples through multiple, partially corrupted inputs, improved accuracy by 1–9% compared to the method using only 1DCNN. In addition, Sample Expansion using the Sample Expansion-Based SAE (SESAE) method improved accuracy by 2–17% compared to using only the Stacked Autoencoder (SAE). It was confirmed that when a good sample augmentation method and a good classification model are combined, there is better improvement of performance, and development of good combined models is always required.

The optimal number of samples differs for different cancer types, as observed in (Fig 5). Our model used one hidden layer with 256 neurons, which is the most suitable size for an imbalanced data set, according to the previous study [23]. However, further study is needed of the remaining five options (256/512/102, 256/512, 128/256/512, 128/256, and 128). In addition to these results, optimization of the hyperparameters (such as learning rate, epochs, cost function, and hidden layer unit) used in our GAN model, need additional work.

In addition to the DNA mutation data used for feature selection in this study, various combinations of more omics data such as mRNA, DNA methylation, and miRNA data can be used to further increase the classification accuracy. Application of those data combinations will be the focus of our follow-up work. Moreover, various recently developed deep generative models such as DCGAN, cycleGAN, and Variational Autoencoder, could be explored for more accurate classification, which could be our future study.

Conclusions

In this paper, we proposed the sample augmented method using GANs, and showed that augmented samples significantly increased the classification accuracy of cancer stages. In particular, we were able to confirm that the proposed method is efficient for a dataset with small number of samples. Therefore, the proposed sample augmentation method can be applied for other purposes, such as prognostic prediction or cancer classification.

●. Advantages
- The proposed method can generate additional data samples more accurately, which can increase the accuracy of cancer-stage prediction.
- The proposed method is generally applied to other types of mRNA expression data of which the aim is different from cancer-stage prediction.
●. Disadvantages
- If the number of features is large, the learning time is significantly slower than with other machine learning approaches such as random forest or gradient boosting.

Supporting information

S1 Table. List of all genes selected by features selection.

https://doi.org/10.1371/journal.pone.0250458.s001

(XLSX)

S2 Table. Results of t-test and friedman test.

https://doi.org/10.1371/journal.pone.0250458.s002

(XLSX)

S3 Table. The result of newly added cancers and random forest.

https://doi.org/10.1371/journal.pone.0250458.s003

(XLSX)

S1 File.

https://doi.org/10.1371/journal.pone.0250458.s004

(DOCX)

References

1. Kamarajah SK, Burns WR, Frankel TL, Cho CS, Nathan H. Validation of the American Joint Commission on Cancer (AJCC) staging system for patients with pancreatic adenocarcinoma: a Surveillance, Epidemiology and End Results (SEER) analysis. Annals of surgical oncology. 2017;24(7):2023–30. pmid:28213792
- View Article
- PubMed/NCBI
- Google Scholar
2. Cates JM. The AJCC 8th edition staging system for soft tissue sarcoma of the extremities or trunk: a cohort study of the SEER database. Journal of the National Comprehensive Cancer Network. 2018;16(2):144–52. pmid:29439175
- View Article
- PubMed/NCBI
- Google Scholar
3. Wang M, Chen H, Wu K, Ding A, Zhang M, Zhang P. Evaluation of the prognostic stage in the 8th edition of the American Joint Committee on Cancer in locally advanced breast cancer: an analysis based on SEER 18 database. The Breast. 2018;37:56–63. pmid:29100045
- View Article
- PubMed/NCBI
- Google Scholar
4. Shao N, Xie C, Shi Y, Ye R, Long J, Shi H, et al. Comparison of the 7th and 8th edition of American Joint Committee on Cancer (AJCC) staging systems for breast cancer patients: a Surveillance, Epidemiology and End Results (SEER) analysis. Cancer management and research. 2019;11:1433. pmid:30863154
- View Article
- PubMed/NCBI
- Google Scholar
5. Shi S, Xie H, Yin W, Zhang Y, Peng X, Yu F, et al. The prognostic significance of the 8th edition AJCC TNM staging system for non–small‐cell lung cancer is not applicable to lung cancer as a second primary malignancy. Journal of Surgical Oncology. 2020.
- View Article
- Google Scholar
6. Qiu M-Z, Wang Z-X, Zhou Y-X, Yang D-J, Wang F-H, Xu R-H. Proposal for a new TNM stage based on the 7th and 8th American Joint Committee on Cancer pTNM staging classification for gastric cancer. Journal of Cancer. 2018;9(19):3570. pmid:30310514
- View Article
- PubMed/NCBI
- Google Scholar
7. Cutler A, Cutler DR, Stevens JR. Random forests. Ensemble machine learning: Springer; 2012. p. 157–75.
8. Gupta P, Chiang S-F, Sahoo PK, Mohapatra SK, You J-F, Onthoni DD, et al. Prediction of Colon Cancer Stages and Survival Period with Machine Learning Approach. Cancers. 2019;11(12):2007. pmid:31842486
- View Article
- PubMed/NCBI
- Google Scholar
9. Kaur H, Bhalla S, Raghava GP. Classification of early and late stage liver hepatocellular carcinoma patients from their genomics and epigenomics profiles. PloS one. 2019;14(9).
- View Article
- Google Scholar
10. Roy S, Kumar R, Mittal V, Gupta D. Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning. Scientific Reports. 2020;10(1):1–15. pmid:31913322
- View Article
- PubMed/NCBI
- Google Scholar
11. De Bari B, Vallati M, Gatta R, Lestrade L, Manfrida S, Carrie C, et al. Development and validation of a machine learning-based predictive model to improve the prediction of inguinal status of anal cancer patients: A preliminary report. Oncotarget. 2017;8(65):108509. pmid:29312547
- View Article
- PubMed/NCBI
- Google Scholar
12. Garapati SS, Hadjiiski L, Cha KH, Chan HP, Caoili EM, Cohan RH, et al. Urinary bladder cancer staging in CT urography using machine learning. Medical physics. 2017;44(11):5814–23. pmid:28786480
- View Article
- PubMed/NCBI
- Google Scholar
13. Cosma G, Acampora G, Brown D, Rees RC, Khan M, Pockley AG. Prediction of pathological stage in patients with prostate cancer: a neuro-fuzzy model. PLoS One. 2016;11(6). pmid:27258119
- View Article
- PubMed/NCBI
- Google Scholar
14. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology. 2015;19(1A):A68. pmid:25691825
- View Article
- PubMed/NCBI
- Google Scholar
15. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–57.
- View Article
- Google Scholar
16. Kovács G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing. 2019;366:352–4.
- View Article
- Google Scholar
17. Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget. 2017;8(65):109646. pmid:29312636
- View Article
- PubMed/NCBI
- Google Scholar
18. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al., editors. Generative adversarial nets. Advances in neural information processing systems; 2014.
- View Article
- Google Scholar
19. Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment. 2018;11(10):1071–83.
- View Article
- Google Scholar
20. Xu L, Veeramachaneni K. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:181111264. 2018.
- View Article
- Google Scholar
21. Breiman L. Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics Department, University of California, Berkeley …, 1996.
22. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. pmid:16632515
- View Article
- PubMed/NCBI
- Google Scholar
23. Tanaka FHKdS, Aranha C. Data augmentation using GANs. arXiv preprint arXiv:190409135. 2019.
- View Article
- Google Scholar
24. Hu W, Huang Y, Wei L, Zhang F, Li H. Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors. 2015;2015.
- View Article
- Google Scholar
25. Network CGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330. pmid:22810696
- View Article
- PubMed/NCBI
- Google Scholar
26. Network CGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61. pmid:23000897
- View Article
- PubMed/NCBI
- Google Scholar
27. Network CGAR. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202–9. pmid:25079317
- View Article
- PubMed/NCBI
- Google Scholar
28. Agrawal N, Akbani R, Aksoy BA, Ally A, Arachchi H, Asa SL, et al. Integrated genomic characterization of papillary thyroid carcinoma. Cell. 2014;159(3):676–90. pmid:25417114
- View Article
- PubMed/NCBI
- Google Scholar
29. Network CGA. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517(7536):576–82. pmid:25631445
- View Article
- PubMed/NCBI
- Google Scholar
30. Network CGAR. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499(7456):43. pmid:23792563
- View Article
- PubMed/NCBI
- Google Scholar
31. Network CGAR. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014;511(7511):543–50. pmid:25079552
- View Article
- PubMed/NCBI
- Google Scholar
32. Network CGAR. Comprehensive molecular characterization of papillary renal-cell carcinoma. New England Journal of Medicine. 2016;374(2):135–45.
- View Article
- Google Scholar
33. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. pmid:23770567
- View Article
- PubMed/NCBI
- Google Scholar
34. Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505(7484):495–501. pmid:24390350
- View Article
- PubMed/NCBI
- Google Scholar
35. Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, et al. MuSiC: identifying mutational significance in cancer genomes. Genome research. 2012;22(8):1589–98. pmid:22759861
- View Article
- PubMed/NCBI
- Google Scholar
36. Amberger JS, Bocchini CA, Scott AF, Hamosh A. Omim. org: leveraging knowledge across phenotype–gene relationships. Nucleic acids research. 2019;47(D1):D1038–D43. pmid:30445645
- View Article
- PubMed/NCBI
- Google Scholar
37. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic acids research. 2001;29(1):22–8. pmid:11125040
- View Article
- PubMed/NCBI
- Google Scholar
38. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research. 2017;45(D1):D353–D61. pmid:27899662
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Kamarajah SK, Burns WR, Frankel TL, Cho CS, Nathan H. Validation of the American Joint Commission on Cancer (AJCC) staging system for patients with pancreatic adenocarcinoma: a Surveillance, Epidemiology and End Results (SEER) analysis. Annals of surgical oncology. 2017;24(7):2023–30. pmid:28213792
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Cates JM. The AJCC 8th edition staging system for soft tissue sarcoma of the extremities or trunk: a cohort study of the SEER database. Journal of the National Comprehensive Cancer Network. 2018;16(2):144–52. pmid:29439175
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Wang M, Chen H, Wu K, Ding A, Zhang M, Zhang P. Evaluation of the prognostic stage in the 8th edition of the American Joint Committee on Cancer in locally advanced breast cancer: an analysis based on SEER 18 database. The Breast. 2018;37:56–63. pmid:29100045
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Shao N, Xie C, Shi Y, Ye R, Long J, Shi H, et al. Comparison of the 7th and 8th edition of American Joint Committee on Cancer (AJCC) staging systems for breast cancer patients: a Surveillance, Epidemiology and End Results (SEER) analysis. Cancer management and research. 2019;11:1433. pmid:30863154
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Shi S, Xie H, Yin W, Zhang Y, Peng X, Yu F, et al. The prognostic significance of the 8th edition AJCC TNM staging system for non–small‐cell lung cancer is not applicable to lung cancer as a second primary malignancy. Journal of Surgical Oncology. 2020.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Qiu M-Z, Wang Z-X, Zhou Y-X, Yang D-J, Wang F-H, Xu R-H. Proposal for a new TNM stage based on the 7th and 8th American Joint Committee on Cancer pTNM staging classification for gastric cancer. Journal of Cancer. 2018;9(19):3570. pmid:30310514
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Cutler A, Cutler DR, Stevens JR. Random forests. Ensemble machine learning: Springer; 2012. p. 157–75.

[ref8] 8. Gupta P, Chiang S-F, Sahoo PK, Mohapatra SK, You J-F, Onthoni DD, et al. Prediction of Colon Cancer Stages and Survival Period with Machine Learning Approach. Cancers. 2019;11(12):2007. pmid:31842486
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref9] 9. Kaur H, Bhalla S, Raghava GP. Classification of early and late stage liver hepatocellular carcinoma patients from their genomics and epigenomics profiles. PloS one. 2019;14(9).
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref10] 10. Roy S, Kumar R, Mittal V, Gupta D. Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning. Scientific Reports. 2020;10(1):1–15. pmid:31913322
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. De Bari B, Vallati M, Gatta R, Lestrade L, Manfrida S, Carrie C, et al. Development and validation of a machine learning-based predictive model to improve the prediction of inguinal status of anal cancer patients: A preliminary report. Oncotarget. 2017;8(65):108509. pmid:29312547
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Garapati SS, Hadjiiski L, Cha KH, Chan HP, Caoili EM, Cohan RH, et al. Urinary bladder cancer staging in CT urography using machine learning. Medical physics. 2017;44(11):5814–23. pmid:28786480
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Cosma G, Acampora G, Brown D, Rees RC, Khan M, Pockley AG. Prediction of pathological stage in patients with prostate cancer: a neuro-fuzzy model. PLoS One. 2016;11(6). pmid:27258119
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology. 2015;19(1A):A68. pmid:25691825
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–57.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref16] 16. Kovács G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing. 2019;366:352–4.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref17] 17. Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget. 2017;8(65):109646. pmid:29312636
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref18] 18. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al., editors. Generative adversarial nets. Advances in neural information processing systems; 2014.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref19] 19. Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment. 2018;11(10):1071–83.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref20] 20. Xu L, Veeramachaneni K. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:181111264. 2018.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref21] 21. Breiman L. Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics Department, University of California, Berkeley …, 1996.

[ref22] 22. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. pmid:16632515
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref23] 23. Tanaka FHKdS, Aranha C. Data augmentation using GANs. arXiv preprint arXiv:190409135. 2019.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref24] 24. Hu W, Huang Y, Wei L, Zhang F, Li H. Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors. 2015;2015.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref25] 25. Network CGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330. pmid:22810696
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref26] 26. Network CGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61. pmid:23000897
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref27] 27. Network CGAR. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202–9. pmid:25079317
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref28] 28. Agrawal N, Akbani R, Aksoy BA, Ally A, Arachchi H, Asa SL, et al. Integrated genomic characterization of papillary thyroid carcinoma. Cell. 2014;159(3):676–90. pmid:25417114
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref29] 29. Network CGA. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517(7536):576–82. pmid:25631445
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref30] 30. Network CGAR. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499(7456):43. pmid:23792563
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref31] 31. Network CGAR. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014;511(7511):543–50. pmid:25079552
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref32] 32. Network CGAR. Comprehensive molecular characterization of papillary renal-cell carcinoma. New England Journal of Medicine. 2016;374(2):135–45.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref33] 33. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. pmid:23770567
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref34] 34. Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505(7484):495–501. pmid:24390350
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref35] 35. Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, et al. MuSiC: identifying mutational significance in cancer genomes. Genome research. 2012;22(8):1589–98. pmid:22759861
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref36] 36. Amberger JS, Bocchini CA, Scott AF, Hamosh A. Omim. org: leveraging knowledge across phenotype–gene relationships. Nucleic acids research. 2019;47(D1):D1038–D43. pmid:30445645
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref37] 37. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic acids research. 2001;29(1):22–8. pmid:11125040
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref38] 38. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research. 2017;45(D1):D353–D61. pmid:27899662
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Data preparation and feature selection

Sample augmentation and classification algorithm

Results

Characteristics of augmented samples

The effect of sample augmentation

Analysis of selected genes

Discussion

Conclusions

Supporting information

S1 Table. List of all genes selected by features selection.

S2 Table. Results of t-test and friedman test.

S3 Table. The result of newly added cancers and random forest.

S1 File.

References