Prediction of hepatitis E using machine learning models

Yanhui Guo; Yi Feng; Fuli Qu; Li Zhang; Bingyu Yan; Jingjing Lv

doi:10.1371/journal.pone.0237750

Abstract

Background

Accurate and reliable predictions of infectious disease can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task. However, for different data series, the performance of these models varies. Hepatitis E, as an acute liver disease, has been a major public health problem. Which model is more appropriate for predicting the incidence of hepatitis E? In this paper, three different methods are used and the performance of the three methods is compared.

Methods

Autoregressive integrated moving average(ARIMA), support vector machine(SVM) and long short-term memory(LSTM) recurrent neural network were adopted and compared. ARIMA was implemented by python with the help of statsmodels. SVM was accomplished by matlab with libSVM library. LSTM was designed by ourselves with Keras, a deep learning library. To tackle the problem of overfitting caused by limited training samples, we adopted dropout and regularization strategies in our LSTM model. Experimental data were obtained from the monthly incidence and cases number of hepatitis E from January 2005 to December 2017 in Shandong province, China. We selected data from July 2015 to December 2017 to validate the models, and the rest was taken as training set. Three metrics were applied to compare the performance of models, including root mean square error(RMSE), mean absolute percentage error(MAPE) and mean absolute error(MAE).

Results

By analyzing data, we took ARIMA(1, 1, 1), ARIMA(3, 1, 2) as monthly incidence prediction model and cases number prediction model, respectively. Cross-validation and grid search were used to optimize parameters of SVM. Penalty coefficient C and kernel function parameter g were set 8, 0.125 for incidence prediction, and 22, 0.01 for cases number prediction. LSTM has 4 nodes. Dropout and L2 regularization parameters were set 0.15, 0.001, respectively. By the metrics of RMSE, we obtained 0.022, 0.0204, 0.01 for incidence prediction, using ARIMA, SVM and LSTM. And we obtained 22.25, 20.0368, 11.75 for cases number prediction, using three models. For MAPE metrics, the results were 23.5%, 21.7%, 15.08%, and 23.6%, 21.44%, 13.6%, for incidence prediction and cases number prediction, respectively. For MAE metrics, the results were 0.018, 0.0167, 0.011 and 18.003, 16.5815, 9.984, for incidence prediction and cases number prediction, respectively.

Conclusions

Comparing ARIMA, SVM and LSTM, we found that nonlinear models(SVM, LSTM) outperform linear models(ARIMA). LSTM obtained the best performance in all three metrics of RSME, MAPE, MAE. Hence, LSTM is the most suitable for predicting hepatitis E monthly incidence and cases number.

Citation: Guo Y, Feng Y, Qu F, Zhang L, Yan B, Lv J (2020) Prediction of hepatitis E using machine learning models. PLoS ONE 15(9): e0237750. https://doi.org/10.1371/journal.pone.0237750

Editor: Tao Song, Polytechnical Universidad de Madrid, SPAIN

Received: May 25, 2020; Accepted: August 1, 2020; Published: September 17, 2020

Copyright: © 2020 Guo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: According to Chinese laws and regulations, only government departments at or above the provincial level have the right to publish epidemic data. At present, due to the low conditions, Shandong Health Committee has only published parts of the data. The url of published data is http://wsjkw.shandong.gov.cn/. If readers want to reproduce our experiment, they can access our github (https://github.com/guoyanhui03/dataset.git). All other relevant data are within the manuscript and Supporting Information files.

Funding: This work was supported by ZhiFei Disease Prevention and Control Technology Research Fund Project (No. LYH2017-08) to YF, Open Research Fund of Shandong Provincial Key Laboratory Of Infectious Disease Control and Prevention, Shandong Center for Disease Control and Prevention (No. 2017KEYLAB01) to YG, Shandong Medical Health Science and Technology Development Programs (No. 2018WS309) to YF, Science and Technology Project for the Universities of Shandong Province (No. J18KB171) to YG, Discipline Talent Team Cultivation Program of Shandong Women’s University (No. 1904) to YG, and Shandong Women’s University High level scientific research project Cultivation Fund (No. 2019GSPGJ07) to YG.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Viral hepatitis is recognized as one of the most frequently reported diseases, and hepatitis E as an acute liver disease has been a major public health problem [1]. Every year, there are an estimated 20 million hepatitis E infections worldwide, leading to over 3 million symptomatic cases of hepatitis E, and 55,000 hepatitis E-related deaths. The prevalence is highest in East and South Asia [2]. Sporadic hepatitis E has caused over 50% of acute viral hepatitis cases in recent years [3], which caused the huge social, economic, and health burden. However, incidence data relying on hospital-based reporting, is often lagged. To better mitigate future outbreaks, it is necessary to accurately predict the incidence of hepatitis E. US Centers for Disease Control (CDC) and Prevention have openly endorsed adopting models to inform decision making [4].

Hepatitis E is transmitted by the fecal-oral route through contaminated water. It mainly broke out in developing countries in Asia, Africa and Central America [5]. In recent years, the incidence and death of hepatitis E are higher than that of hepatitis A, and the incidence is on the rise. With the development of information technology, CDC has accumulated a large number of historical data of hepatitis E. Effective use of these data to predict the incidence can reduce the risk of hepatitis E. However, researches on prediction and early warning of infectious diseases mainly focus on dengue [6, 7], influenza [8], AIDS [9], and hepatitis B [10, 11]. There are few studies on hepatitis E incidence. Hence, this paper focuses on the key issue of hepatitis E incidence.

For the prediction of infectious diseases, researchers mainly adopt time series method. Originally, some linear estimation methods were applied to the prediction of infectious diseases, including Autoregressive(AR) [12], Moving Average(MA) [13] and Autoregressive Moving Average(ARMA) [14]. However, the above methods are only applicable to stationary data. Subsequently, ARIMA model was proposed, which is better to address data with some trend. The paper [11] adopted ARIMA to predict incidence of hepatitis B. They showed that ARIMA model outperformed grey model GM(1,1). An improved ARIMA model, called SARIMA, which takes into account recent and seasonal patterns, has been shown to produce useful disease estimates. The paper [15] adopted SARIMA model to capture a substantial amount of dengue variability, and obtained better results.

Besides, another mainstream to analyze time series is adopted by artificial intelligence methods, such as Markov model [16], artificial neural network [17], support vector machine(SVM) [18], etc. Because of the generalization ability and the ability to process high-dimensional nonlinear regression estimation, SVM has been successfully used in many fields of time series prediction, including financial prediction [19] and disease prediction [20]. The paper [20] applied SVM to predict dengue incidence, obtained better results.

At present, deep learning represented by Convolutional Neural Network(CNN) [21] and Recurrent Neural Network(RNN) [22], has revolutionized various fields, due to the powerful feature extraction and representation capabilities. Among them, CNN is widely used in image recognition, classification and other visual problems. RNN is a powerful approach to analyze temporal data, and widely used in natural language processing [22], speech recognition [23] and so on. However, when there are many recursions, RNN fails in practice due to problems with vanishing gradients [24]. Moreover, RNN can not satisfy the multimodal case because of sharing parameters. Fortunately, Long Short-Term Memory(LSTM), a variant of the RNN, makes up for the lack of RNN. Nowadays, LSTM has been used in various fields, including traffic flow prediction [25], finacial prediction [26], infectious diseases prediction [27]. The paper [27] applied LSTM to model seasonality and trends in hand-foot-mouth disease incidence, and got a good result.

In this paper, ARIMA, SVM and LSTM were used to predict the monthly incidence of hepatitis E in Shandong Province. We use RMSE, MAPE and MAE to evaluate the three methods. Specially, LSTM model obtained state-of-the-art performance. The model building and comparison would give some suggestions on the model chosen. And the predicted results may offer references for hepatitis E prevention. Meanwhile, these methods are general and could also be suitable for predicting other diseases.

Materials and methods

Materials source

We obtained publicly available data about hepatitis E in Shandong Province, China between 2005 and 2017 from the Shandong Center for Disease Control and Prevention(SCDC). Data mainly includes monthly incidence and monthly cases number of hepatitis E in Shandong. Monthly incidence means the number of per 100,000 people in Shandong Province, as shown in Fig 1. Monthly cases number is the number of aggregated confirmed cases in a month, as shown in Fig 2.

Download:

Fig 1. Monthly incidence of hepatitis E from January 2005 to December 2017.

https://doi.org/10.1371/journal.pone.0237750.g001

Download:

Fig 2. Monthly cases number of hepatitis E from January 2005 to December 2017.

https://doi.org/10.1371/journal.pone.0237750.g002

ARIMA model

ARIMA model consists of auto regressive (AR) model and moving average (MA) model. The model is expressed as ARIMA(p, d, q), in which p is the order of auto regression, d is the degree of trend difference, q means the order of moving average. First, we can determine the parameter d by evaluating the stationarity of the data. Then, we determine p and q by analyzing autocorrelation and partial correlation. Finally, training and prediction are done.

Data stationarity analysis.

In the analysis of time series, the basic assumption is the stationarity and ergodicity of the series. An important tool of testing time series stationarity is the Augmented Dickey-Fuller(ADF) unit root test. If the data series is not stationary, we need to adopt some transformation methods to make the data stationary, including logarithmic transform, smoothing methods, difference methods and decomposition methods. In this paper, difference method is used to meet the requirement of ARIMA model.

Parameter estimation.

After stationarity analysis, we can determine parameter d. p and q of ARIMA model are estimated by autocorrelation function(ACF) and partial autocorrelation function(PACF). In order to get a more efficient ARIMA model, we optimize the parameters p and q by the grid-search method, according BIC criterion. Finally, we choose ARIMA(1, 1, 1) and ARIMA(3, 1, 2) for incidence and cases number prediction of hepatitis E, respectively.

SVM model

SVM was proposed by Vapnik [28], widely applied to solve classification and regression problems. SVM is more suitable for nonlinear problem by kernel function, and improves the generalization ability of model by structural risk. SVM regression also is called SVR, for regression problem. In this work, we use libSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm) designed by Lin [29] to implement hepatitis E incidence prediction.

Data preprocessing and modeling.

Firstly, we normalize the raw data to [0, 1] by min-max normalization. The normalization formula is x_norm = (x − x_min)/(x_max − x_min), where x denotes the raw data, x_min and x_max are minimum and maximum values, respectively. By observing the autocorrelation of hepatitis data, we use previous three data to predict the next one, x_t = f(x_t−3, x_t−2, x_t−1). We choose 80% of the data as the training set and the rest as the test set.

Parameters setting of SVM.

The kernel function, as the similarity metrics of the samples, is a key factor that affects SVM model. We adopted radial basis function(RBF) as kernel function. In addition, penalty coefficient C and g are also important parameters that affect the performance of SVM. We use grid searching to find the optimal combination of parameters. C and g change from 2⁻¹⁰ to 2⁵, by index changes. In our experiment of hepatitis E incidence, C and g were set by 8, 0.125, respectively. For the prediction of hepatitis E cases number, C and g were set by 22, 0.01, respectively.

LSTM model

LSTM is variant of RNN, which can deal with long-term sequential data since the gradients tend to vanish. LSTM’s ability is mainly due to the existence of memory unit, usually referred to as cell state. Cell state can determine whether the information is useful. Then, it save the useful information. There are three gates in a cell, which are called input gate, forget gate and output gate. In this paper, we implemented LSTM with the help of Keras (https://pypi.org/project/Keras/).

Data preprocessing and modeling.

LSTM can be used for sequence prediction, sequence classification, sequence generation, sequence to sequence prediction. In this work, we model hepatitis E prediction by sequence to sequence prediction. Two key factors of LSTM modeling are feature and time step. We take monthly incidence or monthly cases number as feature. Time step is set 3, which means that we predict next monthly incidence of hepatitis E, by using the previous three monthly data. The form of input and output is shown in Fig 3. Then, we choose as the value of prediction. In this model, the number of nodes in the input, hidden and output layer are set 1, 4, 1, respectively.

Download:

Fig 3. Structure of LSTM model.

https://doi.org/10.1371/journal.pone.0237750.g003

For data preprocessing, we also normalized the raw data to [0, 1], as SVM method.

Parameters setting of LSTM.

The data set of hepatitis E incidence only has 156 elements, since the model is prone to over-fitting. In order to overcome the above problem, we adopt dropout and regularization strategy. Dropout parameter between hidden layer and output layer is set to 0.15. Regularization parameter is set to 0.001. The iteration of training is set to 220. In addition, Optimization algorithm is the heart of machine learning, which affects the convergence and optimization of the algorithm. We also find that the Adam is faster and better than stochastic gradient descent (SGD) optimization method.

Comparison metrics

In order to fairly compare the performance of the three models, we apply three commonly used quality indexes, including Root Mean Square Error (RMSE), Mean Absolute Percent Error (MAPE), Mean Absolute Error (MAE). RMSE is used to measure the discreteness of a group of numbers themselves, as shown in the formula 1. RMSE tends to be dominated by larger values. MAPE is widely used to measure the quality of a prediction model, as shown in the formula 2. The smaller the MAPE value is, the better the accuracy of the prediction model can be. MAE shows the actual prediction error, as shown in the formula 3. (1) (2) (3)