An Adaptive Moment estimation method for online AUC maximization

Xin Liu; Zhisong Pan; Haimin Yang; Xingyu Zhou; Wei Bai; Xianghua Niu

doi:10.1371/journal.pone.0215426

Abstract

Area Under the ROC Curve (AUC) is a widely used metric for measuring classification performance. It has important theoretical and academic values to develop AUC maximization algorithms. Traditional methods often apply batch learning algorithm to maximize AUC which is inefficient and unscalable for large-scale applications. Recently some online learning algorithms have been introduced to maximize AUC by going through the data only once. However, these methods sometimes fail to converge to an optimal solution due to the fixed or rapid decay of learning rates. To tackle this problem, we propose an algorithm AdmOAM, Adaptive Moment estimation method for Online AUC Maximization. It applies the estimation of moments of gradients to accelerate the convergence and mitigates the rapid decay of the learning rates. We establish the regret bound of the proposed algorithm and implement extensive experiments to demonstrate its effectiveness and efficiency.

Citation: Liu X, Pan Z, Yang H, Zhou X, Bai W, Niu X (2019) An Adaptive Moment estimation method for online AUC maximization. PLoS ONE 14(4): e0215426. https://doi.org/10.1371/journal.pone.0215426

Editor: Yossiri Adulyasak, HEC Montréal, CANADA

Received: February 24, 2018; Accepted: April 3, 2019; Published: April 23, 2019

Copyright: © 2019 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are available from the LIBSVM (www.csie.ntu.edu.tw/~cjlin/libsvmtools/) and the UCI websites (www.ics.uci.edu/~mlearn/MLRepository.html).

Funding: Our work is supported by the National Key Research Development Program of China (No. 2017YFB0802800). The funders participated in the design of algorithms and commented on the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

AUC [1] plays an important role in measuring classification performance, and quantifies the ability of a classifier that assigns a higher score for a randomly chosen positive instance than a randomly drawn negative instance [2]. Compared with accuracy and cross-entropy loss, AUC is independent of the priori class probability distribution and misclassification costs, which makes it more favourable for imbalanced classification tasks [3–6]. Moreover, AUC is largely applied in many real-world scenarios like cancer diagnosis and anomaly detection [7, 8].

In recent decades, many batch learning algorithms [9–12] have been introduced to optimize AUC directly. Despite the success of these batch AUC optimization algorithms, they all require the whole training instances available before training. Besides, they update the model every epoch with all training instances. Therefore, it is not efficient and scalable for large-scale applications in batch learning setting. To address this challenge, online learning technique has been introduced to maximize AUC, which has been shown to be capable for large-scale scenarios [13–15]. The online learning methods update the model with only one instance each epoch. As a result, it is desirable to apply online leaning algorithms for handling large-scale streaming data which arrives sequentially.

However, the task of AUC optimization requires minimizing the sum of the losses between instances from different classes. Therefore, it is difficult to maximize AUC by directly applying online learning, which requires to obtain all previous training instances at current iteration for calculating the sum of pairwise losses. Several recent works [16–19] adopt different approximations of the sum of pairwise losses to avoid storing all received training data, which makes them more feasible for large-scale tasks. In general, there are two kinds of online AUC maximization frameworks. The first framework uses reservoir sampling method that keeps fixed buffers to store some historical instances for calculating pairwise losses [18, 19]. The other framework employs one-pass technique to maximize AUC by processing each instance only once [17]. The work [16] proposed an adaptive one-pass online AUC maximization algorithm called AdaOAM. This method adjusts the learning rates of different dimensions to the geomerty of data by applying an adaptive gradient method (Adagrad) [20].

Despite AdaOAM has achieved good performance, its learning rate may shrink too fast due to the rapid increase of its denominators. As a result, the model may fail to fully converge [21]. To tackle this problem, we propose an algorithm AdmOAM, Adaptive Moment estimation method for Online AUC Maximization. It applies the estimation of moments of gradients to adaptively calculate the learning rates dimensionally based on the framework of [22]. Our method mitigates the rapid decay of learning rates by using exponential moving averages of past gradients as the denominators. Furthermore, AdmOAM is efficient and only requires to store the first and second moments of gradients. Based on the theoretical analysis of AdmOAM, we have found that the regret bound of AdmOAM stays much lower than existing non-adaptive methods and is comparable with AdaOAM. We have also shown the effectiveness of the proposed AdmOAM in experiments on several benchmark datasets in comparison with 4 state-of-the-art online AUC maximization algorithms.

The rest of the paper is organized as follows. We first give an overview of some related works. Then we describe the problem setting and the framework of AdmOAM. We then give the theoretical analysis and provide the experimental results. Finally, we present a summary and some directions for future work.

Related work

In this section, we briefly review three prior works in related topics: online learning, adaptive gradient methods and AUC maximization.

Online learning

Online learning is a type of efficient and scalable machine learning algorithm that updates the model from data sequentially. Compared to the traditional batch or offline learning algorithms, online learning algorithms avoid the time-consuming training process and are capable of retraining efficiently at the arrival of new data [13]. The first online learning algorithm is the Perceptron [23], which updates the model with the first-order gradient information. Passive-Aggressive is another type of first-order online algorithm, which applys the margin-based technique [24]. In recent years, several second-order online learning algorithms have been proposed to accelerate the convergence of optimization [25, 26]. Besides, some regularization terms have been introduced to stabilize the online learning models [11, 27]. However, these traditional online approaches are aimed at optimizing the classification accuracy or error rate, which are inappropriate for imbalanced classification tasks. In contrast, we develop a novel first-order online algorithm by maximizing a imbalanced metric with adaptive gradient method.

Adaptive gradient methods

Online Gradient Descent (OGD) [28] is the dominant method for solving the online convex optimization problems. It updates a model by moving the parameters along the direction opposite the gradient of the loss function with a global learning rate. However, infrequently occurring features are highly informative and require relatively larger learning rates than frequently occurring features. Therefore, OGD can not fully incorporate the knowledge of the geometry of the data with the global learning rate. To tackle this challenge, some researchers have proposed several variants of OGD that perform adaptive gradient optimization by adjusting the learning rates on a per-feature basis iteratively [20, 22, 29]. The most famous adaptive gradient algorithm is Adagrad [20], which can achieve better performance than non-adaptive algorithms both theoretically and experimentally. However, Adagrad has been observed to diverge due to the rapid decay of the learning rates since the denominators of the learning rates are based on the accumulation of the square of the past gradients. To address this problem, some variants of Adagrad have been proposed, such as RMSprop [29], Adam [22] and AMSgrad [30]. These methods use exponential moving average to estimate the moments of the gradients, which can mitigate the rapid decay of the learning rates. Among these variants of Adagrad, Adam is the most widely used method due to its fast convergence and easiness in implement. Furthermore, Adam has been successfully used in many real-world applications like computational biology [31], automated driving [32], text categorization [33], machine translation [34], etc.

AUC maximization

AUC has been widely used to evaluate the classification performance. Therefore, several algorithms have been proposed to maximize AUC with different convex surrogate losses [10, 16, 17, 19, 35]. Initially, many efforts have been devoted to optimize AUC in batch learning setting [10, 35]. However, those batch algorithms fail to meet the demands of efficient and scalable learning for large-scale tasks. Therefore, some online AUC maximization algorithms have been proposed [16, 17, 19]. Generally, there are two main online learning frameworks for AUC maximization. The first framework stores several fixed-size buffers and adopts the reservoir sampling technique to update the buffers for representing the received instances. The sizes of these buffers are related to the number of training instances that makes this type of algorithms impractical for large-scale applications. Besides, this framework use hinge losses as the surrogate loss, which has been proven to be inconsistent with AUC [36]. To overcome these limitations, the work [17] proposed a new framework called OPAUC, which applies the square loss as the surrogate loss function in one-pass learning setting. OPAUC utilizes the consistency between the square loss and the AUC score, and only maintains the mean vector and variance matrix of the received instances. Compared to the first framework, the storage requirement of OPAUC is independent of the number of training instances and each instance only requires to go through only once. But the above two frameworks are both based on the OGD for optimzation, which prevents them from exploiting the geometrical information of data [20]. For performing more informative gradient-based learning, the work [16] proposed an algorithm AdaOAM by applying one-pass framework and Adagrad [20].

Although AdaOAM has achieved fairly good performance, it may not fully converge according to the fact that the denominators of its learning rates are the accumulation of all previous gradients. The learning rates of AdaOAM would shrink fast with the rapid increase of the denominators, and this degrades its learning performance. To solve this problem, we develop a novel adaptive online AUC maximization algorithm called AdmOAM, which uses the square loss function in one-pass framework and applies Adam [22] for mitigating the rapid decay of the learning rates.

Method

In this section, we present the framework of AdmOAM. We first introduce the problem setting of the online AUC maximization tasks. Then we present the details of AdmOAM.

Problem setting

We concentrate on learning a linear model in binary classification setting. We denote and as the feature space and label space, respectively. Let denotes an unknown distribution over X × Y. Let denotes a sample that is drawn i.i.d from D. Let denotes the hypothesis class. At the t-th iteration, we denote the received training instance as , and is the linear model we learned currently. Let be the set of positive instances and be the set of negative instances in sample , where n₊ and n₋ refer to the numbers of positive and negative instances, respectively. Then the AUC score of the linear function f on sample can be calculated as: (1) where w is the weight vector of function f and is the indicator function which outputs 1 if condition is satisfied and 0 otherwise. Practically, we use the least square loss as a surrogate of the indicator function. The square loss function is convex and keeps consistent with AUC [17]. Then we can minimize the following objective function for finding the optimal linear classifier. (2) where is introduced as the regularizer to reduce the complexity of the linear classifier. Next we present the details of AdmOAM.

Adaptive moment online AUC maximization

In online learning framework, we focus on minimizing the regret of a sequence of algorithms with regard to a competing hypothesis, where the model of the competing hypothesis is the optimal decision in hindsight. The optimal decision w* is defined as: (3) where χ is the decision set.

The regret of hypothesis at iteration is defined as: (4) where . According to the approach in [17], the overall loss can be transformed to a sum of losses on each training instance in online setting (5) where denotes the i.i.d training sample on the t-th iteration, and is an unbiased estimation of . Then the gradient of can be calculated as: (6) where and denote the numbers of positive and negative instances in , respectively. For calculating without storing all received instances, we use c^± and Γ^± as the mean vectors and covariance matrices of positive and negative instances, respectively. (7)

The mean vectors and covariance matrices can be updated as follows: (8)

Therefore, the gradient can be reformulated as: (9)

Then we can update the linear classifier by using online gradient descent w_t+1 = w_t − η_t g_t, where η_t is the learning rate at iteration t and . According to the properities of strong convexity in [16], the optimal w_* is satisfied with . As a result, it is reasonable to restrict w_t with by applying the projected gradient method [28]. (10)

However, it has been shown that the model can not fully exploit the geometrical information of data with a global learning rate [20]. For solving this problem, [16] proposed AdaOAM by updating the learning rates of different features as follows: (11) where i ∈ [d] is the i-th dimensional feature and t is the number of iterations.

Algorithm 1 The AdmOAM Algorithm

Input: The regularization parameter λ > 0, the step size , 1st moment (the mean) and 2nd raw moment (the uncentered variance), the exponential decay rates β₁, β₂ ∈ [0, 1) and a smooth parameter ϵ > 0.

Output: Updated classifier w_T.

Variables: .

Initialize .

for t = 1, 2, …, T do

Observe instance (x_t, y_t);

if y_t = +1 then

, ;

and ;

Update and ;

else

, ;

and ;

Update and ;

end if

Calculate gradient ;

Update m_t, v_t;

;

end for

When the value of the accumulation of previous gradients increases too fast, the learning rate of AdaOAM would shrink to a much small value, which can result in a slow convergence. Therefore, we propose AdmOAM for alleviating the rapid decay of learning rates, and our work is inspired by [22]. For the learning rates of different features, AdmOAM adaptively updates them with the estimations of moments of gradients. By using the exponential moving averages of previous gradients as the denominators, AdmOAM mitigates the rapid decay of learning rates.

Specifically, we denote m and v as the exponential moving averages of the gradients and the squared gradients, respectively. These two vectors are introduced as the estimations of the first and second moments of gradients. They can be updated as follows: (12) where β₁, β₂ ∈ [0, 1) are the exponential decay rates of m and v. Due to the property of exponential moving average, the denominator would not shrink too fast. Besides, AdmOAM only requires extra O(d) space for storing m and v as compared to the efficient OPAUC. Note that if m and v are zero vectors in initialization, then the correction of the bias is needed according to [22]. Therefore, the classifier w with the initialization bias correction can be updated as: (13) where ϵ > 0 is a smooth parameter for preventing the denominator becoming zero. The framework of AdmOAM is shown in Algorithm 1.

Theoretical analysis

Next we present our main theoretical results of AdmOAM.

Lemma 1. Let w_t and g_t (t ∈ [T]) be the weight vector and gradient defined in the Algorithm 1, and we have (14) where we denote the i-th dimension of the gradient at iteration t as g_t,i and r_t,i = max_j<t|x_j,i − x_t,i|.

Proof. Firstly, we define as the optimal weight vector of the linear model in hindsight. Objective function uses as the regularizer. According to the strongly convex property, we have [16]. As a result, we restrict w_t with by applying the projected gradient update rule. Besides, according to the definition of the gradient of , we have . If y_t = 1, we have (15) where denotes the number of the received negative instances at iteration t. By applying inequality 〈w, v〉 ≤ ‖w‖₂‖v‖₂ and (a + b)² ≤ 2a² + 2b², we have (16)

This upper bound also holds for y_t = −1.

Lemma 2. Assume the gradient of the objective function f_t is bounded, sup_w∈χ‖g_t(w)‖₂ ≤ G, sup_w∈χ‖g_t(w)‖_∞ ≤ G_∞ and the distance between any elements of the hypothesis class is bounded, sup_w,u∈χ‖w − u‖₂ ≤ D, sup_w,u∈χ‖w − u‖_∞ ≤ D_∞ and β₁, β₂ ∈ [0, 1) satisfy . Let and β_1,t = β₁ ρ^t−1, ρ ∈ (0, 1). For any T > 1, the following regret bound holds (17) where g_1:T,i = [g_1,i, g_2,i, ⋯, g_t,i].

Lemma2 is the Theorem 4.1 from the work [22]. Next we derive the regret bound of the proposed AdmOAM algorithm.

Theorem 1. Assume ‖x‖_∞ ≤ 1, β₁, β₂ ∈ [0, 1) satisfy . Let and β_1,t = β₁ ρ^t−1, ρ ∈ (0, 1). For any T > 1, AdmOAM can achieve following regret bound (18) where ,, and .

Proof. Since the inequality holds for any t ∈ [T], we have and according to the triangle inequality and ‖w‖_∞ ≤ ‖w‖₂.

Using Lemma1 and assumption ‖x‖_∞ ≤ 1, we have r_t,i ≤ max_j<t|x_j,i − x_t,i| ≤ 2 and . Then it is easy to obtain .

According to the definitions of and g_1:T,i, we have (19) (20)

Plugging (19), (20), the bound of the gradient and the bound of the distance between any weight vectors into (17), we have which completes the proof.

If the features are rather dense, then we have (21) according to the inequality and r_j,i ≤ 2 for any i ∈ [d]

If we denote the constant , it is easy to obtain: (22) Also we have: (23) Therefore, AdmOAM’s regret bound is for the dense feature space and its convergence rate is as in the general case of the non-adaptive algorithms.

When the features are sparse, the term B_j,i should be much smaller than C. Therefore, the regret bound of AdmOAM should be much smaller than , which results in a faster convergence. Besides, AdmOAM achieves comparable convergence rate with respect to AdaOAM according to [22].

From the above analysis, we can conclude that AdmOAM converges faster than the non-adaptive algorithms and stays in comparable convergence rate as AdmOAM.

Experimental results

In this section, we evaluate the performance of AdmOAM on several standard benchmark datasets.

Compared algorithms

Since we only concentrate on online scenarios, we do not take existing batch learning methods into consideration. We compare AdmOAM with 4 competing online AUC maximization algorithms:

OAM_seq: The OAM algorithm using sequential updating [19];
OAM_gra: The OAM algorithm using online gradient updating [19];
OPAUC: One-pass AUC optimization algorithm [17];
AdaOAM: The adaptive subgradient online AUC optimization algorithm [16].

Experimental testbed and setup

We conduct the experiments on 13 benchmark datasets, which can be downloaded from the LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/) and the UCI websites (http://www.ics.uci.edu/~mlearn/MLRepository.html). Note that the glass, vehicle, dna and acoustic are multi-class datasets, we transform them into binary-class by randomly setting one class as the positive class, and the others as negative. The sparsity of the dataset is defined as the number of zero elements divided by the total number of elements in its feature matrix. Besides, the features have been rescaled to [−1, 1]. The details of the datasets are summarized in Table 1.

Download:

Table 1. Information about datasets.

https://doi.org/10.1371/journal.pone.0215426.t001

We conduct nested cross-validation for hyperparameter searching and model evaluation. In the outer cross-validation, we conduct 5-fold cross-validation on each benchmark dataset, where 4 folds are for training and the remaining fold is treated as the test set. In the inner cross-validation, we apply 5-fold cross-validation on the training set for hyperparameter searching. After the process of the inner cross-validation, we train the model on the whole training set with the tuned hyperparameters. Finally, we calculate the AUC score of the model on the test set for model evaluation. For further reducing the variance in the results, we apply 4 independent 5-fold nested cross-validation on each dataset. Therefore the AUC performance of each algorithm on different datasets is the average over 20 independent runs. For the hyperparameter searching, we tune the learning rate η ∈ 2^[−10:1:3] and the regularization parameter λ ∈ 2^[−10:1:3] for AdmOAM, AdaOAM and OPAUC. For the exponential decay rates of AdmOAM, we decide β₁ ∈ [0.1: 0.1: 0.9] and β₂ ∈ [0.099: 0.1: 0.999]. For buffer sampling algorithms like OAM_seq and OAM_gra, we tune the penalty parameter C ∈ 2^[−10:1:10], and the size of the buffer is set at 100 as recommended in [19].

Evaluation on benchmark datasets

In this subsection, we analyse the average AUC values, convergence rate and running time of AdmOAM with compared methods. Table 2 shows the average AUC values over 20 independent runs on 13 benchmark datasets.

Download:

Table 2. Evaluation on benchmark datasets for comparing AUC performance (mean+std).

https://doi.org/10.1371/journal.pone.0215426.t002

Based on the results in Table 2, we have several observations. Firstly, the two adaptive methods AdmOAM and AdaOAM achieve higher AUC score than the other three non-adaptive methods in most cases. Therefore, the adaptive learning strategy can effectively improve the performance of the existing online AUC maximization algorithms. Secondly, AdmOAM obtains better or comparable performance than AdaOAM for most datasets. Especially in the svmguide4 and vehicle datasets, AdmOAM achieves much higher AUC scores than AdaOAM. This indicates the effectiveness of AdmOAM over AdaOAM.

Next we provide the analysis on the speed of the convergence of AdmOAM. For each online AUC maximization algorithm, it updates the model from a sequence of training data one at a time. For comparing the convergence speed, we evaluate the AUC score of different online learning algorithms on the testing set. Compared with reservoir sampling methods like OAM_seq and OAM_gra, the algorithms based on the one-pass learning mode obtain better performance according to the results in Table 2. Therefore, we compare the convergence rate of AdmOAM with AdaOAM and OPAUC. Fig 1(a)–1(d) depict the convergence curves on 4 benchmark datasets with the error bars. Specifically, we report the average AUC score across 20 independent runs on the testing datasets at different iterations. From Fig 1, we can observe that AdmOAM converges faster than the other two algorithms. With the increasing of the number of iterations, AdmOAM achieves a higher AUC score than AdaOAM and OPAUC. This validates our theoretical analysis and demonstrates the effectiveness of AdmOAM.

Download:

Fig 1. Evaluation of convergence rate on benchmark datasets.

https://doi.org/10.1371/journal.pone.0215426.g001

We also present the running time on 13 datasets in Fig 2. On most datasets, AdmOAM is more efficient than OAM_seq and OAM_gra, and stays competitive with AdaOAM in the computational complexity. Compared to OPAUC, AdmOAM needs to spend a little more time for updating two extra vectors of the first and second moments.

Download:

Fig 2. Comparsion of the runing time (in milliseconds).

The y-axis is set as log-scale.

https://doi.org/10.1371/journal.pone.0215426.g002

Evaluation of parameter sensitivity

Since AdmOAM adaptively updates the learning rates, we mainly focus on the parameter sensitivity of learning rate and the other parameters are fixed at the tuned values. We report the average test AUC score across 5 independent runs (1 trail of 5-fold cross-validation) with the range of learning rates η ∈ 2^[−10:1:3] in Fig 3. Based on the results in Fig 3, we can observe that AdmOAM is less sensitive to the learning rate than OPAUC especially when the value of the learning rate is over 2⁻². In [16], the author claimed that AdaOAM is insensitive to the parameter settings. From Fig 3, we can observe that AdmOAM obtains comparable or better average AUC score than AdaOAM. The above results indicate that AdmOAM can effectively adjust its per-coordinate learning rate and is less sensitive to the parameter settings.

Download:

Fig 3. Evaluation of parameter sensitivity.

https://doi.org/10.1371/journal.pone.0215426.g003

Conclusion and future work

In this paper we proposed AdmOAM, an Adaptive Moment estimation method for Online AUC Maximization. It applies the estimation of moments of gradients to accelerate the convergence and mitigate the rapid decay of the learning rates. Theoretically, we have analysed the regret bound of the proposed algorithm. It can achieve a lower bound than non-adaptive online AUC maximization algorithms and stay competitive to AdaOAM. Moreover, we evaluated its performance with several competing algorithms on benchmark datasets. The experimental results validate the theoretical analysis and indicate the effectiveness of the proposed algorithm.

For future work, there are several research directions. Firstly, AdmOAM uses all features for AUC maximization. This is not efficient and scalable for high-dimensional sparse datasets. It would be interesting to combine AdmOAM with feature selection techniques for learning a sparse model. Secondly, AdmOAM is not suitable for the non-linearly separable data with linear model. It would be interesting to combine AdmOAM with online kernel learning methods for handling the nonlinearity of the data.

Acknowledgments

Our work is supported by the National Key Research Development Program of China (No. 2017YFB0802800). The authors would like to thank the Editor-in-Chief and anonymous reviewers for their insightful and constructive commendations that have led to an improved version of this paper.

References

1. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36.
- View Article
- Google Scholar
2. Brefeld U, Scheffer T. AUC maximizing support vector learning. In: Proceedings of the ICML 2005 Workshop on ROC Analysis in Machine Learning; 2005.
3. Maloof MA. Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets II. vol. 2; 2003. p. 2–1.
4. Ling CX, Huang J, Zhang H. AUC: a better measure than accuracy in comparing learning algorithms. In: Conference of the Canadian Society for Computational Studies of Intelligence. Springer; 2003. p. 329–341.
5. Yan L, Dodier RH, Mozer M, Wolniewicz RH. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. In: ICML; 2003. p. 848–855.
6. Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering. 2005;17(3):299–310.
- View Article
- Google Scholar
7. Abe N, Zadrozny B, Langford J. Outlier detection by active learning. In: KDD. ACM; 2006. p. 504–509.
8. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9(1):319. pmid:18647401
- View Article
- PubMed/NCBI
- Google Scholar
9. Cortes C, Mohri M. AUC optimization vs. error rate minimization. In: NIPS; 2004. p. 313–320.
10. Joachims T. A support vector method for multivariate performance measures. In: ICML. ACM; 2005. p. 377–384.
11. Kotlowski W, Dembczynski KJ, Huellermeier E. Bipartite ranking through minimization of univariate loss. In: ICML; 2011. p. 1113–1120.
12. Rakotomamonjy A. Optimizing area under ROC curve with SVMs. In: ROCAI; 2004. p. 71–80.
13. Bottou L, Cun YL. Large scale online learning. In: NIPS; 2004. p. 217–224.
14. Cesa-Bianchi N, Conconi A, Gentile C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory. 2004;50(9):2050–2057.
- View Article
- Google Scholar
15. Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: ICML; 2012. p. 449–456.
16. Ding Y, Zhao P, Hoi SC, Ong YS. An adaptive gradient method for online AUC maximization. In: AAAI; 2015. p. 2568–2574.
17. Gao W, Jin R, Zhu S, Zhou ZH. One-pass AUC optimization. In: ICML; 2013. p. 906–914.
18. Kar P, Sriperumbudur B, Jain P, Karnick H. On the generalization ability of online learning algorithms for pairwise loss functions. In: ICML; 2013. p. 441–449.
19. Zhao P, Hoi SC, Jin R, Yang T. Online AUC maximization. In: ICML; 2011. p. 233–240.
20. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research. 2011;12(Jul):2121–2159.
- View Article
- Google Scholar
21. Lu Y, Lund J, Boyd-Graber J. Why ADAGRAD fails for online topic modeling. In: EMNLP; 2017. p. 446–451.
22. Kingma D, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014.
23. Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review. 1958;65(6):386. pmid:13602029
- View Article
- PubMed/NCBI
- Google Scholar
24. Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. Journal of Machine Learning Research. 2006;7(Mar):551–585.
- View Article
- Google Scholar
25. Dredze M, Crammer K, Pereira F. Confidence-weighted linear classification. In: ICML. ACM; 2008. p. 264–271.
26. Crammer K, Kulesza A, Dredze M. Adaptive regularization of weight vectors. In: NIPS; 2009. p. 414–422.
27. Duchi JC, Shalev-Shwartz S, Singer Y, Tewari A. Composite objective mirror descent. In: COLT; 2010. p. 14–26.
28. Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: ICML; 2003. p. 928–936.
29. Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. 2012;4(2):26–31.
- View Article
- Google Scholar
30. Reddi SJ, Kale S, Kumar S. On the convergence of Adam and beyond. ICLR. 2018.
31. Angermueller C, Parnamaa T, Parts L, Stegle O. Deep learning for computational biology. Molecular Systems Biology. 2016;12(7):878–878. pmid:27474269
- View Article
- PubMed/NCBI
- Google Scholar
32. Sallab A, Gamal M. End-To-End multi-modal sensors fusion system for urban automated driving. In: Proceedings of the NIPS Workshop on Machine Learning for Intelligent Transportation Systems; 2018.
33. Chen G, Ye D, Xing Z, Chen J, Cambria E. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. International Symposium on Neural Networks. 2017; p. 2377–2383.
34. Lample G, Ott M, Conneau A, Denoyer L, Ranzato M. Phrase-Based neural unsupervised machine translation. EMNLP. 2018; p. 5039–5049.
35. Herschtal A, Raskutti B. Optimising area under the ROC curve using gradient descent. In: ICML. ACM; 2004. p.49–57.
36. Gao W, Zhou ZH. On the consistency of AUC pairwise optimization. In: IJCAI; 2015. p. 939–945.

[ref1] 1. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Brefeld U, Scheffer T. AUC maximizing support vector learning. In: Proceedings of the ICML 2005 Workshop on ROC Analysis in Machine Learning; 2005.

[ref3] 3. Maloof MA. Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets II. vol. 2; 2003. p. 2–1.

[ref4] 4. Ling CX, Huang J, Zhang H. AUC: a better measure than accuracy in comparing learning algorithms. In: Conference of the Canadian Society for Computational Studies of Intelligence. Springer; 2003. p. 329–341.

[ref5] 5. Yan L, Dodier RH, Mozer M, Wolniewicz RH. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. In: ICML; 2003. p. 848–855.

[ref6] 6. Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering. 2005;17(3):299–310.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref7] 7. Abe N, Zadrozny B, Langford J. Outlier detection by active learning. In: KDD. ACM; 2006. p. 504–509.

[ref8] 8. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9(1):319. pmid:18647401
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref9] 9. Cortes C, Mohri M. AUC optimization vs. error rate minimization. In: NIPS; 2004. p. 313–320.

[ref10] 10. Joachims T. A support vector method for multivariate performance measures. In: ICML. ACM; 2005. p. 377–384.

[ref11] 11. Kotlowski W, Dembczynski KJ, Huellermeier E. Bipartite ranking through minimization of univariate loss. In: ICML; 2011. p. 1113–1120.

[ref12] 12. Rakotomamonjy A. Optimizing area under ROC curve with SVMs. In: ROCAI; 2004. p. 71–80.

[ref13] 13. Bottou L, Cun YL. Large scale online learning. In: NIPS; 2004. p. 217–224.

[ref14] 14. Cesa-Bianchi N, Conconi A, Gentile C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory. 2004;50(9):2050–2057.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref15] 15. Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: ICML; 2012. p. 449–456.

[ref16] 16. Ding Y, Zhao P, Hoi SC, Ong YS. An adaptive gradient method for online AUC maximization. In: AAAI; 2015. p. 2568–2574.

[ref17] 17. Gao W, Jin R, Zhu S, Zhou ZH. One-pass AUC optimization. In: ICML; 2013. p. 906–914.

[ref18] 18. Kar P, Sriperumbudur B, Jain P, Karnick H. On the generalization ability of online learning algorithms for pairwise loss functions. In: ICML; 2013. p. 441–449.

[ref19] 19. Zhao P, Hoi SC, Jin R, Yang T. Online AUC maximization. In: ICML; 2011. p. 233–240.

[ref20] 20. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research. 2011;12(Jul):2121–2159.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref21] 21. Lu Y, Lund J, Boyd-Graber J. Why ADAGRAD fails for online topic modeling. In: EMNLP; 2017. p. 446–451.

[ref22] 22. Kingma D, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014.

[ref23] 23. Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review. 1958;65(6):386. pmid:13602029
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref24] 24. Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. Journal of Machine Learning Research. 2006;7(Mar):551–585.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref25] 25. Dredze M, Crammer K, Pereira F. Confidence-weighted linear classification. In: ICML. ACM; 2008. p. 264–271.

[ref26] 26. Crammer K, Kulesza A, Dredze M. Adaptive regularization of weight vectors. In: NIPS; 2009. p. 414–422.

[ref27] 27. Duchi JC, Shalev-Shwartz S, Singer Y, Tewari A. Composite objective mirror descent. In: COLT; 2010. p. 14–26.

[ref28] 28. Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: ICML; 2003. p. 928–936.

[ref29] 29. Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. 2012;4(2):26–31.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref30] 30. Reddi SJ, Kale S, Kumar S. On the convergence of Adam and beyond. ICLR. 2018.

[ref31] 31. Angermueller C, Parnamaa T, Parts L, Stegle O. Deep learning for computational biology. Molecular Systems Biology. 2016;12(7):878–878. pmid:27474269
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref32] 32. Sallab A, Gamal M. End-To-End multi-modal sensors fusion system for urban automated driving. In: Proceedings of the NIPS Workshop on Machine Learning for Intelligent Transportation Systems; 2018.

[ref33] 33. Chen G, Ye D, Xing Z, Chen J, Cambria E. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. International Symposium on Neural Networks. 2017; p. 2377–2383.

[ref34] 34. Lample G, Ott M, Conneau A, Denoyer L, Ranzato M. Phrase-Based neural unsupervised machine translation. EMNLP. 2018; p. 5039–5049.

[ref35] 35. Herschtal A, Raskutti B. Optimising area under the ROC curve using gradient descent. In: ICML. ACM; 2004. p.49–57.

[ref36] 36. Gao W, Zhou ZH. On the consistency of AUC pairwise optimization. In: IJCAI; 2015. p. 939–945.

Figures

Abstract

Introduction

Related work

Online learning

Adaptive gradient methods

AUC maximization

Method

Problem setting

Adaptive moment online AUC maximization

Theoretical analysis

Experimental results

Compared algorithms

Experimental testbed and setup

Evaluation on benchmark datasets

Evaluation of parameter sensitivity

Conclusion and future work

Acknowledgments

References