Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A new accuracy measure based on bounded relative error for time series forecasting

Abstract

Many accuracy measures have been proposed in the past for time series forecasting comparisons. However, many of these measures suffer from one or more issues such as poor resistance to outliers and scale dependence. In this paper, while summarising commonly used accuracy measures, a special review is made on the symmetric mean absolute percentage error. Moreover, a new accuracy measure called the Unscaled Mean Bounded Relative Absolute Error (UMBRAE), which combines the best features of various alternative measures, is proposed to address the common issues of existing measures. A comparative evaluation on the proposed and related measures has been made with both synthetic and real-world data. The results indicate that the proposed measure, with user selectable benchmark, performs as well as or better than other measures on selected criteria. Though it has been commonly accepted that there is no single best accuracy measure, we suggest that UMBRAE could be a good choice to evaluate forecasting methods, especially for cases where measures based on geometric mean of relative errors, such as the geometric mean relative absolute error, are preferred.

Introduction

Forecasting has always been an attractive research area since it plays an important role in daily life. As one of the most popular research domains, time series forecasting has received particular concern from researchers [15]. Many comparative studies have been conducted with the aim of identifying the most accurate methods for time series forecasting [6]. However, research findings indicate that the performance of forecasting methods varies according to the accuracy measure being used [7]. Various accuracy measures have been proposed as the best to use in the past decades. However, many of these measures are not generally applicable due to issues such as being infinite or undefined under certain circumstances, which may produce misleading results. The criteria required for accuracy measures have been explicitly addressed by Armstrong and Collopy [6] and further discussed by Fildes [8] and Clements and Hendry [9]. As discussed, a good accuracy measure should provide an informative and clear summary of the error distribution. The criteria should also include reliability, construct validity, computational complexity, outlier protection, scale-independency, sensitivity to changes and interpretability. It has been suggested by many researchers that no single measure can be superior to all others in these criteria [6, 10, 11].

The evolution of accuracy measures can be seen through the measures used in the major comparative studies of forecasting methods. Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) can be considered as the very early and most popular accuracy measures. They were the primary measures used in the original M-Competition [12]. Despite well-known issues such as their high sensitivity to outliers, they are still being widely used [1315]. When using these accuracy measures, errors which are small and appear to be good, such as 0.1 by RMSE and 1% by MAPE, can often be obtained. Wei et al. [16] employed RMSE as the performance indicator in their research on stock price forecasting. The average error obtained was 84 and it was claimed to be superior to some other previous models. However, without comparison, the error 84 as a number is not easy to interpret. In fact, the average fluctuation of stock indices used was 83 which is smaller than the error of their proposed model. A similar case can be found regarding MAPE. Esfahanipour and Aghamiri [17] proposed a model with an error of 1.3%, which appears to be good. Yet, this error was larger than the average daily fluctuation of the stock price, which was approximately 1.2%. The poor interpretation here is mainly due to the lack of comparable benchmark used by the accuracy measure.

Armstrong and Collopy [6] recommended the use of relative absolute errors as a potential solution to the above issue. Accuracy measures based on relative errors, such as Mean Relative Absolute Error (MRAE), can provide a better interpretation of how good the evaluated forecasting method perform compared to the benchmark method. However, when the benchmark error is small or equal to zero, the relative error could become extremely large or infinite. This may lead to an undefined mean or at least a distortion of the result. Thus, Armstrong and Collopy suggested a method named ‘winsorizing’ to overcome this problem by trimming extreme values. However, this process will also add some complexity to the calculation and an appropriate trimming level has to be specified [18].

Similarly, MAPE also has the issue of being infinite or undefined due to zeros in the denominator [19]. The symmetric mean absolute percentage error (sMAPE) was first proposed by Armstrong [20] as a modified MAPE which could be a simple way to fix the issue. It was then used in the M3-Competition as an alternative primary measure to MAPE [7]. However, Goodwin and Lawton [21] pointed out that sMAPE is not as symmetric as its name suggested. In fact, it gave more penalties to under-estimates more than to over-estimates. Thus, the use of sMAPE in the M3-Competition was widely criticized by researchers later [22]. In an unpublished working paper, Chen and Yang [23] defined a modified sMAPE, called msMAPE, by adding an additional component to the denominator of sMAPE. The added component can efficiently avoid the inflation of sMAPE caused by zero-valued observations. However, this does not address the issue of asymmetry for sMAPE.

Hyndman and Koehler [18] proposed Mean Absolute Scaled Error (MASE) as a generally applicable measurement of forecasting accuracy without the problems seen in the other accuracy measures. However, this measure can still be dominated by a single large error, though infinite and undefined values have been well avoided for most cases [24]. Davydenko and Fildes [24] proposed an altered version of MASE, the average relative MAE (AvgRelMAE), which uses the geometric mean to average the relative efficiencies of adjustments across time series. Although the geometric mean is appropriate for averaging benchmark ratios [25], the appropriateness of AvgRelMAE still depends on its component measure RelMAE for each time series.

In this paper, a new accuracy measure is proposed to address the issues mentioned above. Specifically, by introducing a newly defined bounded relative absolute error, the new measure can address the asymmetric issue of sMAPE while maintaining its other properties, such as scale-independence and outlier resistance. Further, we believe that the new measure improves the interpretability based on relative errors with a selectable benchmark than sMAPE which uses the percentage errors based on the observation values. Given that [6] claimed that measures based on relative errors are the most reliable, we believe our measure is reliable in this sense.

Review of accuracy measures

Many accuracy measures have been proposed to evaluate the performance of forecasting methods during the past couple of decades. A table of most commonly used measures were listed in the review of 25 years of time series forecasting [1]. There was also a thorough review on accuracy measures by Hyndman and Koehler [18]. In this section, we mainly focus on new insights or new measures that have been introduced since 2006.

For a time series with n observations, let Yt denote the observation at time t and Ft denote the forecasts of Yt. Then the forecasting error et can be defined as (Yt–Ft). Let denote the forecasting error at time t obtained by some benchmark method. That means , where is the forecast at time t by the benchmark method.

Scale-dependent measures

The measures based on absolute or squared errors are also known as scale-dependent measures since their scale depends on the scale of the data. They are useful in comparing forecasting methods on the same set of data. However, they should not be used across data sets that are on different scales. The most commonly used scale-dependent measures are Mean Absolute Error (MAE), Mean Squared Error (MSE) and RMSE: (1) (2) (3)

MAE had been cited in the very early forecasting literature as a primary measure of performance for forecasting models [26]. As shown in Eq 1, MAE directly calculates the arithmetic mean of absolute errors. Hence, it is very easy to compute and to understand. However, it may produce biased results when extremely large outliers exist in data sets. Specifically, even a single large error can sometimes dominate the result of MAE.

MSE, which calculates the arithmetic mean of squared errors, was used in the first M-Competition [12]. However, its use was widely criticized later as inappropriate [6, 27]. MSE is more vulnerable to outliers since it gives extra weight to large errors. Also, the squared errors are on different scale from the original data. Thus, RMSE, which is the squre root of MSE, is often preferred to MSE as it is on the same scale as the data. However, RMSE is also sensitive to forecasting outliers [28].

Percentage-based measures

To be scale-independent, a common approach is to use percentage errors based on observation values. Two example measures based on percentage errors are MAPE and sMAPE defined as: (4) (5)

It should be noted that absolute values are used in the denominator of sMAPE defined in this paper. This definition is different but equivalent to the definition in Makridakis [10] and Makridakis and Hibon [7] when forecasts and actual values are all non-negative. The absolute values in the denominator can avoid negative sMAPE as pointed out by Hyndman and Koehler [18].

MAPE was used as one of the major accuracy measures in the original M-Competition [12]. However, the percentage errors could be excessively large or undefined when the target time series has values close to or equal to zero [19]. Moreover, Armstrong [20] pointed out that MAPE has a bias favouring estimates that are below the actual values. This was illustrated by extremes: “a forecast of 0 can never be off by more than 100%, but there is no limit to errors on the high side”. Makridakis [10] discussed the asymmetric issue of MAPE with another example which involves two forecasts on different actual values. However, we believe that the example by Makridakis is beyond the idea of Armstrong in 1985. To our understanding, we believe that the assumption concerning the asymmetric issue of MAPE described by Armstrong [20] is: i), the estimates are non-negative while the actual value is positive; ii) the forecasting range is asymmetric that 0 is the lower bound for lower estimates while there is no upper bound for upper estimates; iii), errors for lower estimates and upper estimates should be symmetric (an extreme case: 0 as the worst lower estimate should have the same absolute error as the worst upper estimate which is infinite).

sMAPE can produce symmetric errors in the asymmetric forecasting range as stated in the above assumption. However, it is more natural to consider the symmetric property in a symmetric forecasting range for lower and upper estimates. Thus, sMAPE was widely criticized as an asymmetric measure [21, 22]. Regardless of the asymmetric issue, an advantage of sMAPE is that it does not have the issue of MAPE from being excessively large or infinite. Also, due to the error bounds defined, sMAPE is more resistant to outliers since it gives less significance to outliers compared to other measures which do not have bounds for errors.

Relative-based measures

Another approach for accuracy measures to be scale-independent is to use relative errors based on the errors produced by a benchmark method (e.g. the naïve method). The most commonly used such measures are MRAE and the geometric mean relative absolute error (GMRAE): (6) (7)

MRAE can provide a clearer intuition of the performance improvement compared to the benchmark method. However, MRAE has a similar limitation as MAPE, in that it can also be excessively large or undefined, when is close to or equal to zero.

GMRAE is favoured since it is generally acknowledged that the geometric mean is more appropriate for averaging relative quantities than the arithmetic mean [6, 8]. According to an alternative representation of GMRAE shown above in Eq 7, a key step for calculating GMRAE is to make an arithmetic mean of log-scaled error ratios. This makes GMRAE more resistant to outliers compared to MRAE which uses the arithmetic mean of original error ratios. However, GMRAE is still sensitive to outliers. More specifically, GMRAE can be dominated by not only a single large outlier, but also an extremely small error close to zero. This is because there is neither upper bound nor lower bound for the log-scaled error ratios used by GMRAE. Also, it should also be noticed that zero errors, both in et and , have to be excluded from the analysis. Thus, GMRAE may not be sufficiently informative.

Rather than use the average of relative errors, one can also use the relative of average errors obtained by a base measure. For example, when the base measure is RMSE, then relative RMSE (RelRMSE) is defined as: (8)

RelRMSE is a commonly used measure proposed by Armstrong and Collopy [6] where RMSE* denotes the RMSE produced by a benchmark method. Similar measures, such as RelMAE and RelMAPE, can be easily defined. They are also called relative measures. An advantage of relative measures is their interpretability [18]. However, the performance of relative measures is restricted by the component measure. For example, RelMAPE is also undefined when MAPE is undefined. Further, RelMAPE can also be easily dominated by extreme large outliers since MAPE is not resistant to outliers. Thus, it makes no sense to compute RelMAPE if MAPE, as the component, is skewed.

Another disadvantage of relative measures is that they are only available when there are several forecasts on the same series [18]. As a related idea of relative measures, MASE does not have the above issue. It is defined as: (9)

In MASE, the absolute error |et| for each observation is scaled by the average in-sample error MAE** produced a benchmark method (e.g. one-step naïve method, or seasonal naïve method for seasonal data). Thus, MASE will not produce infinite or undefined values except in the irrelevant case where all historical data are equal. However, MASE is still vulnerable to outliers [24]. Moreover, it has to be assumed that the period-to-period difference of the time series is stationary, so that the scaling factor is a consistent estimator of the scale of the series.

For comparisons of forecasting methods on multiple time series, MASE is equivalent to the weighted arithmetic mean of relative MAEs [24]: (10) where m denotes the number of time series, ni denotes the number of observations for the ith time series and . As pointed out by Davydenko and Fildes [24], using the arithmetic mean of MAE ratios introduces a bias towards overrating the accuracy of a benchmark method. They proposed the measure AvgRelMAE as an alternative to MASE, based on the geometric mean to average the scaled quantities.

(11)

It should be noticed that AvgRelMAE uses out-of-sample as the scaling factor while MASE uses in-sample . Though AvgRelMAE was shown to have many advantages such as interpretability and robustness [24], it still has the same issue with MASE since they are based on RelMAE. As mentioned above, the accuracy of RelMAE is constrained by the accuracy of MAE. Since MAE can be dominated by extreme outliers, the MAE ratio ri does not necessarily represent an advisable comparison of forecasting methods based on the errors of the majority of forecasts for the ith time series.

A new accuracy measure

The criteria for a useful accuracy measure have been explicitly addressed in the literature [6, 8, 9, 11]. As reviewed in the previous Section, many measures have been proposed with various advantages and disadvantages. However, most of these measures suffer from one or more issues. In this section, we propose a new accuracy measure which adopts the advantages of other measures such as sMAPE and MRAE without having their common issues. Specifically, the proposed measure is expected to have the following properties: (i) Informative: it can provide an informative result without the need to trim errors; (ii) Resistant to outliers: it can hardly be dominated by a single forecasting outlier; (iii) Symmetric: over estimates and under estimates are treated fairly; (iv) Scale-independent: it can be applied to data sets on different scales; (v) Interpretability: it is easy to understand and can provide intuitive results.

It has been mentioned above in the review that sMAPE is resistant to outliers due to bounded error defined. We would like to propose a new measure in a similar fashion to sMAPE without its issues. Since relative errors are more general than percentage errors in providing intuitive results, we use the Relative Absolute Error (RAE) as the base to derive our new measure.

(12)

Since RAE has no upper bound, it can be excessively large or undefined when is small or equal to zero. This issue can be easily addressed by adding a |et| to the denominator of RAE, which introduces a bounded RAE (BRAE): (13)

In BRAE, the added |et| can ensure that the denominator will be no less than the numerator. It means BRAE will have a maximum error of 1 while the minimum error is 0 when |et| is equal to zero. Due to the upper bound of BRAE, an accuracy measure based on BRAE will be more resistant to forecasting outliers. It can be noticed that the asymmetric issue of sMAPE has also been addressed in BRAE by adding a |et| rather than a |Ft| to the denominator. Also, a measure based on BRAE is more appropriate than sMAPE for intermittent demand data which have many zero-valued observations. To avoid the issue of being undefined, BRAE is defined to be 0.5 for the special case when |et| and are both equal to zero.

In practice, the one-step naïve method is a commonly used benchmark where . However, it should be noticed that the naïve method is not necessarily an effective benchmark. For example, when most forecasting methods can generally produce much smaller errors than the naïve method, BRAE will have the same issue as percentage error based measure stated above. Thus, it is preferable to use a properly competitive method as a benchmark, such that a value of around 0.5 is obtained by BRAE.

Based on BRAE, a measure called Mean Bounded Relative Absolute Error (MBRAE) can be defined as: (14)

Though MBRAE is adequate to compare forecasting methods, it is a scaled error that cannot be directly interpreted as a normal error ratio reflecting the error size. In fact, the process of calculating GMRAE also contains a mean of log-scaled error ratio which is not easily interpretable. But this issue is addressed by converting the log-scaled error to a normal ratio with the exponential function. Similarly, a transformation can be made to MBRAE to obtain a more interpretable measure which is termed the unscaled MBRAE (UMBRAE): (15)

With UMBRAE, the performance of a proposed forecasting method can be easily interpreted, in terms of the average relative absolute error based on BRAE, as follows: when UMBRAE is equal to 1, the proposed method performs roughly the same as the benchmark method; when UMBRAE < 1, the proposed method performs roughly (1−UMBRAE)*100% better than the benchmark method; when UMBRAE > 1, the proposed method is roughly (UMBRAE−1)*100% worse than the benchmark method.

In general, UMBRAE is informative without the need to trim extreme errors. At the same time, based on the bounded errors, UMBRAE is resistant to outliers. It is also symmetric and obviously scale-independent. The benchmark used by UMBRAE is selectable where the naïve method can be easily applied. A competitive benchmark is preferable to obtain more intuitive results. To the best of our knowledge, UMBRAE has not been proposed before. We suggest it as a generally applicable accuracy measure for time series forecasting. UMBRAE would be particularly useful for the cases where the performance of forecasting methods are not expected to be dominated by forecasting outliers.

Evaluation and results

In this section, the performance of UMBRAE is evaluated. The naïve method is used as the benchmark for UMBRAE. Properties such as reliability and sensitivity have been well investigated in the study by Armstrong and Collopy [6]. In their study, MAPE and MRAE have been assessed to be acceptable in terms of reliability and good in terms of sensitivity. In fact, these properties, especially reliability, cannot be easily examined. For example, in the reliability tests, if forecasting methods are expected to have the same rankings when they are evaluated by a reliable accuracy measure, these forecasting methods themselves have to perform stably on different time series. It is difficult to find such forecasting methods in the real world. Thus, these properties are not examined in our study. Instead, it is assumed that UMBRAE, based on relative errors, will also be reliable and sensitive to error changes. Consequently, our evaluation will be mainly focused on the expected properties mentioned in the previous Section. To make comparisons, other common measures mentioned in the review Section are also examined in our evaluation. Comparisons are firstly made with synthetic time series to specifically examine the required properties. Then the M3-Competition data with 3003 time series [7] are used to demonstrate how these measures perform with real-world data.

Evaluation with synthetic data

Three groups of synthetic time series data are used in the comparative study. These synthetic data are not designed to be representative of real-world data. Rather, they are selected to clearly show the drawbacks of accuracy measures in terms of the required properties. In the synthetic evaluations, the average one-step naïve error is used to scale errors for MASE.

One of the most desired properties of an accuracy measure is the ability to resist outliers. Thus, the first group of synthetic data is made to examine whether the accuracy measure is resistant to a single forecasting outlier. As shown in Fig 1, Yt is the objective time series with 10 observations, which are randomly generated under the normal distribution (mean = 300, sd = 100). is the forecasting series of Yt. Specifically, does not have obvious forecasting outlier and its forecasting errors measured by MAPE are approximately 10%. The other three forecasts are the same as except that they all have a forecasting outlier for the eighth observation. Though occasionally occurring large errors should also be considered in evaluating the performance of a forecasting method, it is assumed that a single large outlier should not affect the whole performance significantly. However, the results in Fig 1 shows that the errors reported by some accuracy measures have been significantly dominated by the single forecasting outlier. The worst is RMSE where its error for has become approximately 36 times larger than its error for . Though MASE has been scaled from MAE, it in fact performs the same as MAE in dealing with the forecasting outlier. The errors given by MAE and MASE for have both been distorted to be about 15 times larger than for . In contrast, sMAPE, GMRAE and UMBRAE are less sensitive to this single forecasting outlier. UMBRAE reports the smallest differences for the four time series.

thumbnail
Fig 1. Evaluation on the resistance of accuracy measures to a single forecasting outlier.

A: Synthetic time series data where Yt is the target series and are forecasts. The only difference between is their forecasts on the observation Y8. B: Results of single forecasting outlier evaluation, which shows UMBRAE is less sensitive than other measures to a single forecasting outlier.

https://doi.org/10.1371/journal.pone.0174202.g001

The second group of time series data is created to evaluate whether over-estimates and under-estimates are treated ‘fairly’ by the accuracy measures. As presented in Fig 2, Yt is the same time series as which was used in the single forecasting outlier resistance evaluation. In this scenario, makes a 10% over-estimate error to all observations in Yt while makes a 10% under-estimate. The results in Fig 2 show that all the accuracy measures except sMAPE have given the same error for and . sMAPE produces a larger error for which indicates it puts a heavier penalty on under-estimates than on over-estimates.

thumbnail
Fig 2. Evaluation on the symmetry of accuracy measures to over-estimates and under-estimates.

A: Synthetic time series data where Yt is the target series and are forecasts. makes a 10% over-estimate to all observations of Yt, while makes a 10% under-estimate. B: Results of symmetric evaluation, which shows UMBRAE and all other accuracy measures except sMAPE are symmetric.

https://doi.org/10.1371/journal.pone.0174202.g002

Davydenko and Fildes [24] suggested another scenario to examine the property of symmetry for measures. In this scenario, the reward given for improving the benchmark is expected to balance the penalty given for reducing the benchmark by the same quantity. We also use this to examine our measure UMBRAE. Suppose that a time series has only two observations (y) and there is one forecasting method to be compared with another benchmark method. For the benchmark method, it makes the forecasts f with errors (yf) of 1 and 2 respectively. In contrast, the forecasting method produces errors of 2 and 1 respectively. As an expected result, the forecasting method has an error of 1 measured by UMBRAE based on the benchmark method. Thus, UMBRAE is also symmetric for this case.

Normally, the scale-dependent issue of accuracy measures is related to their capability of evaluating forecasting performance across data series on different scales. Accuracy measures based on percentages or relative ratios are clearly suited to perform such evaluations and no synthetic data are made for this. However, the scale-dependent issue also exists within a data series. Thus, the third group of synthetic data shown in Fig 3 is made to evaluate the property of accuracy measures dealing with data on different scales within a single time series. In this data set, Yt is a time series generated by the Fibonacci sequence from 2 to 144. As the forecasts to Yt, all forecasting values of are set to have a 20% over-estimate error of the relevant observation of Yt. In contrast, has the same mean absolute error as but its errors are on different percentage scales from 1440% to 0.2%. Specifically, has the same absolute error as . For instance, has the same absolute error as which is 28.8. As presented in Fig 3, MAE, RMSE, MASE and even GMRAE do not show any difference between the two forecasts. MRAE and MAPE, however, have produced substantially different results for the two cases. The errors measured by them for are approximately ten times larger than for . In contrast, UMBRAE and sMAPE give a moderate difference for the two forecasts.

thumbnail
Fig 3. Evaluation on the scale dependency of accuracy measures.

A: Synthetic time series data where Yt is the target series and are forecasts. and have the same mean absolute error, but errors are on different percentage scales to the corresponding values of Yt. B: Results of scale dependency evaluation, where MAE, RMSE, MASE and even GMRAE show no difference between and . MRAE and MAPE produce substantially different errors for the two cases. sMAPE and UMBRAE can reasonably distinguish the two forecasts.

https://doi.org/10.1371/journal.pone.0174202.g003

Evaluation with the M3-Competition data

The M-Competitions are well-known empirical studies which employ various real-world time series data in comparing the performance of forecasting methods. In this study, we use the M3-Competition [7] Data which contains 3003 time series to evaluate our proposed measure. The forecasting data are available with R package ‘Mcomp’ maintained by Hyndman. The ‘Mcomp’ package for R is available from Hyndman’s website: http://robjhyndman.com/software/mcomp/. Among the 24 forecasting methods in the M3-Competition, 22 are used in our evaluation since their forecasts are available for all the 3003 time series. Since the one-step naïve method is used by many accuracy measures as the benchmark, it is also listed in the results as a forecasting method. As an alternative version of MASE, AvgRelMAE which use geometric mean to average errors across time series, is also included in this evaluation. To simplify the results, errors are only measured at the first six forecasting horizons across the 3003 time series, which are available from all of the 22 forecasting methods.

The results are listed in Table 1. It can be noticed that errors by MAE and RMSE are relatively large numbers which is meaningless without comparisons. UMBRAE is able to give interpretable results where a forecasting method with an error < 1 can be considered to be better than the benchmark method in terms of the average relative absolute error based on BRAE. As shown in the results, the naïve method, which is the benchmark used by UMBRAE, has an error of 1. Errors of other forecasting methods measured by UMBRAE are all less than 1. This indicates that these forecasting methods are better than the naïve method. However, MRAE gives the opposite result in which the naïve method is ranked as the best. It has to be noticed that all the errors excluding that for the naïve method measured by AvgRelMAE are smaller than 1, whereas all the errors measured by MASE are much larger than 1. The rank correlation coefficient of different measures is shown in Table 2. The correlation between RMSE, or MRAE, and other measures is extremely low. In contrast, UMBRAE shows substantially high agreement with most of other measures, where the average Spearman rank correlation is 0.516. Particularly, UMBRAE has remarkably high correlations with GMRAE and AvgRelMAE which are 0.995 and 0.990 respectively.

thumbnail
Table 1. Results on M3-Competition data at first six forecasting horizons.

https://doi.org/10.1371/journal.pone.0174202.t001

thumbnail
Table 2. Spearman’s rank correlation coefficient of the rankings in Table 1.

https://doi.org/10.1371/journal.pone.0174202.t002

To eliminate the influence of outliers and extreme errors, we also use trimmed means to evaluate the accuracy measures. A 3% trimming level is used in our study. As shown in Table 3, most errors measured by MAE, RMSE, MASE, MRAE and MAPE have significant differences compared to that without trimming shown in Table 1. The rankings of forecasting methods made by these measures also have significant changes. In contrast, errors and rankings measured by other measures have less changes. Particularly, the value of UMBRAE is quite invariant to trimming, where differences appear only after the third decimal point for most of the forecasting methods. It can also be noticed that the rankings made by UMBRAE in Table 3 keep the same as that in Table 1. In general, all the measures except MRAE have similar rankings. As shown in Table 3, the rank correlations between UMBRAE and other measures are much higher on average as shown in Table 4.

thumbnail
Table 3. Results with a 3% trimming level on M3-Competition data at first six forecasting horizons.

https://doi.org/10.1371/journal.pone.0174202.t003

thumbnail
Table 4. Spearman’s rank correlation coefficient of the rankings in Table 3.

https://doi.org/10.1371/journal.pone.0174202.t004

To show the error distributions in a similar manner to that in [24], we use the errors produced by the forecasting method ForecastPro as an example. Figs 4 to 11 show the distributions of the eight underlying error measurements used in the nine accuracy measures mentioned in this paper. In each Fig, the top plot shows the kernel density estimate of the errors illustrating its distribution, while the bottom shows a box-and-whisker plot which more clearly highlights the outliers. From these Figs, it can be seen that the distribution of error measurements used in UMBRAE is more evenly distributed, with fewer outliers than in the other measures.

thumbnail
Fig 4. Box-and-whisker plot and kernel density estimates for the absolute errors used by MAE.

https://doi.org/10.1371/journal.pone.0174202.g004

thumbnail
Fig 5. Box-and-whisker plot and kernel density estimates for the squared errors used by RMSE.

https://doi.org/10.1371/journal.pone.0174202.g005

thumbnail
Fig 6. Box-and-whisker plot and kernel density estimates for the absolute scaled errors used by MASE.

https://doi.org/10.1371/journal.pone.0174202.g006

thumbnail
Fig 7. Box-and-whisker plot and kernel density estimates for the absolute scaled errors used by AvgRelMAE (log-scale).

https://doi.org/10.1371/journal.pone.0174202.g007

thumbnail
Fig 8. Box-and-whisker plot and kernel density estimates for the relative absolute errors used by MRAE and GMRAE (log-scale, forecasts with zero or undefined error excluded).

https://doi.org/10.1371/journal.pone.0174202.g008

thumbnail
Fig 9. Box-and-whisker plot and kernel density estimates for the absolute percentage errors used by MAPE.

https://doi.org/10.1371/journal.pone.0174202.g009

thumbnail
Fig 10. Box-and-whisker plot and kernel density estimates for the scaled percentage errors used by sMAPE.

https://doi.org/10.1371/journal.pone.0174202.g010

thumbnail
Fig 11. Box-and-whisker plot and kernel density estimates for the bounded relative absolute errors used by UMBRAE (using the naïve errors as the benchmark).

https://doi.org/10.1371/journal.pone.0174202.g011

Discussion

Fig 1 shows that MRAE and MAPE can be easily dominated by a single forecasting outlier. This is because they are based on the arithmetic mean and there are no upper bound defined for the single error. In practice, the poor resistance to forecasting outliers may produce misleading results. This can be illustrated by our evaluation on the M3-Competition data. As shown in Table 1, MRAE gives significantly different rankings from other measures. It suggests the naïve method performs the best while almost all the other accuracy measures indicate that the naïve method is the worst. By examining the forecasting data, we can find that the results measured by MRAE are seriously distorted by the extreme large relative absolute errors where the naïve errors are small. With the geometric mean, GMRAE has shown remarkable resistance to the forecasting outliers. However, one disadvantage of measures based on the geometric mean is that zero-error forecasts have to be excluded. Thus, these measures may not be sufficiently informative. In contrast, due to the bounded errors defined, we have shown that UMBRAE can perform as well as GMRAE in resisting forecasting outliers. In fact, the errors and rankings given by UMBRAE are remarkably correlated to which measured by GMRAE, especially in Tables 3 and 4 where extreme errors are trimmed. Thus, for the cases where measures such as GMRAE are preferred, UMBRAE could be an alternative measure since it is much easier to use without the need to trim errors.

It can also be noticed in Figs 4 to 11 that all the accuracy measures except AvgRelMAE (see Fig 7), GMRAE (see Fig 8) and UMBRAE (see Fig 11) have highly skewed distributions with long tails including extremely large forecasting outliers. Although undefined and zero errors (0.5%) have been trimmed, GMRAE still contains about 10.2% forecasting outliers including some large log-transformed errors such as -10.76 and 8.08. Although the bounded errors used by sMAPE (see Fig 10) and UMBRAE also contain some outliers, there are no extremely large errors. Specifically, UMBRAE follows a symmetric distribution and it only produces about 3% outliers which will not affect the result significantly.

It has to be noted that UMBRAE does not necessarily always provide the same information as GMRAE. For example, given a time series with a million observations, if the forecasting method and the benchmark method produces errors (yf) which are e and e* following the standard normal distribution, UMBRAE and GMRAE will both be approximately 1. However, if the forecasting method produces errors of 2e, the value of GMRAE will be approximately 2 as one may expected. But, UMBRAE will give an error of approximately 1.67 which is less than 2. This is because the bounded error used by UMBRAE will not be increased too much when error e is doubled for the cases where |e| is much larger than |e*|. In other words, a twice worse forecast will not be given an error of twice in significance by UMBRAE when the forecast is much worse than most of other forecasts. In fact, this is the key strategy of UMBRAE for resisting outliers. Also, the above expectation of error 2 is based on the estimation by ‘relative average error’. However, it is arguable the ‘average relative error’ is not necessarily the same as the ‘relative average error’. This can be more or less reflected by the synthetic test shown in Fig 3. More discussions about this will be given later in this section in terms of the scale-independency. We believe that the above issue does not invalidate the use of UMBRAE in practice.

One of the common concerns about an accuracy measure is whether it is symmetric. Two different cases were used to evaluate the property of symmetry for accuracy measures. In our point of view, the first case is about the symmetry in the absolute quantity which concerns whether the same over-estimates and under-estimates can be treated fairly by a measure. As shown in Fig 2, only sMAPE is not symmetric in the absolute quantity (due to the asymmetric bounded errors used). This issue has been addressed by UMBRAE with symmetric bounded errors defined. The second case is in fact about the symmetry in the relative quantity where measures are expected to give a result of 1 for averaging two relative errors N and . Normally, a measure which uses the arithmetic mean should not be symmetric in such relative quantity. However, UMBRAE, which uses the arithmetic mean for part of its calculations, has shown a symmetric result. This is because UMBRAE does not work directly on the original error ratios. The original relative errors have been converted to bounded relative errors for UMBRAE before calculating the arithmetic mean. In fact, this is quite similar to the process of calculating GMRAE which is based on the geometric mean. As a result, it is not an issue for UMBRAE to use the arithmetic mean. Figs 8 and 11 show that both errors used by GMRAE and UMBRAE follow a symmetric distribution.

It is necessary (or, at least, highly desirable) for an accuracy measure to be scale-independent when assessing forecasting methods across data on different scales. Normally, measures based on percentages or ratios in the same range are considered to be scale-independent. However, we argue that it is not enough for these percentages or ratios to be in the same range. To be truly scale-independent, these error percentages or ratios should also be closely related to the scale of data for specific observations. Otherwise, they may lead to misleading results. For example, in Table 1, the error of MASE for the naïve method is 2.134. This is a somewhat confusing result which may be intuitively interpreted as indicating that the naïve method performs worse than the naïve method itself! In fact, it means the naïve method gives smaller errors on average for the forecasting data than its errors for the in-sample data. In contrast, AvgRelMAE does not have this issue since it uses the average error on out-of-sample as the scaling factor. Fig 3 shows that MASE fails to distinguish the difference between the two forecasts which are clearly different considering the error percentages at different observations. This is because every single error used by MASE at different observations is scaled by the same scaling factor. GMRAE also fails in this evaluation. We notice that this is because GMRAE, in fact, has the same issue as MASE. Every single error of GMRAE can also be considered to be a scaled error based on a consistent scaling factor GMAE*, which is the geometric mean of the benchmark errors e*. According to the above, we conclude that MASE, AvgRelMAE and GMRAE are relatively scale-independent because they assume that the scaling factor is a consistent estimator. In contrast, UMBRAE is scale-independent and it is closely related to the error ratios at observations. Thus, it can reasonably show the difference between the two forecasts with respect to error percentages.

Another important property of an accuracy measure is its interpretability. As Table 1 shows, the numerical errors measured by MAE and RMSE have little intuitive meaning without comparisons, and have therefore been scored as ‘fair’. Comparatively, measures which produce errors in percentages or ratios based on a benchmark are more interpretable. The benchmark used by an accuracy measure is also important for its interpretability. In Table 1, errors measured by MAPE are all small errors around 10%. However, these small errors are less meaningful without comparisons. This is because these small percentages are based on the original values of observations. Thus, they do not necessarily indicate a good performance. In contrast, errors measured by UMBRAE are more interpretable. An error of 0.77 indicates that the forecasting method performs approximately 23% better than the benchmark method.

As shown in Table 5, the accuracy measures are rated by the key criteria concerned in this paper. Measures are considered to be less informative if undefined or zero errors have to be excluded. The property of symmetry is rated in both absolute quantity and relative quantity as discussed above. Measures are rated as relatively scale-independent because they assume that the scaling factor is a consistent estimator. Relative-based accuracy measures are considered to be more interpretable than other measures since they can provide more intuitive results in terms of performance without extra comparisons. sMAPE is rated as poor in interpretability since its error, which has a range of (0,200), is not as easy as MAPE to understand.

In summary, we show that UMBRAE (i) is informative and uses all available errors; (ii) can perform as well as GMRAE in resisting forecasting outliers without the need to trim zero-error forecasts; (iii) is symmetric in both absolute quantity and relative quantity; (iv) is scale-independent; (v) is interpretable and can provide intuitive result. As such, UMBRAE combines the best features of various alternative measures into a single new measure. Thus, we believe UMBRAE is an interesting new measure because it constitutes a simple, flexible, easy to use and understand measure that is resistant to outliers. Also, the forecasting benchmark for calculating UMBRAE is selectable, and the ideal choice should be a forecasting method to be outperformed. As a well-known benchmark, the naïve method can be easily applied as a default to show whether a forecasting method is generally good or not.

Conclusion

We have proposed a new accuracy measure UMBRAE based on bounded relative errors. As discussed in the review of sMAPE, one advantage of the bounded error is that it gives less significance to outliers since it does not have the issue of being excessively large or infinite. Evaluation on the proposed measure along with related measures has been made on both synthetic and real-world data. We have shown that UMBRAE combines the best features of various alternative measures without having their common drawbacks. UMBRAE, with selectable benchmark, can provide an informative and interpretable result based on bounded relative error. It is less sensitive to forecasting outliers than other measures. It is also symmetric and scale-independent. Though it has been commonly accepted that there cannot be any single best accuracy measure, we suggest that UMBRAE is a good choice for general use when evaluating the performance of forecasting methods. Since UMBRAE, in our study, performs similar to GMRAE without the need to trim zero-error forecasts, we particularly recommend UMBRAE as an alternative measure for the cases where GMRAE is preferred.

Although we have shown that UMBRAE has many advantages as described above, its statistical properties have not been well studied. For example, the way how UMBRAE reflects the properties of the distributions of errors is unclear. Moreover, one possible underlying drawback for UMBRAE is that the bounded error used by UMBRAE will reach the maximum value 1.0 when the benchmark error () is equal to zero even if the forecast is good. This may produce a biased estimate especially when the benchmark method produces a large number of zero errors. Although this drawback may not be relevant for the majority of real-world data, in the future, we would like to address this issue.

Author Contributions

  1. Conceptualization: CC.
  2. Data curation: CC.
  3. Formal analysis: CC JT JMG.
  4. Funding acquisition: JT JMG.
  5. Investigation: CC JT JMG.
  6. Methodology: CC JT JMG.
  7. Project administration: JT JMG.
  8. Software: CC.
  9. Supervision: JT JMG.
  10. Validation: CC JT JMG.
  11. Visualization: CC JT JMG.
  12. Writing – original draft: CC JT JMG.
  13. Writing – review & editing: CC JT JMG.

References

  1. 1. De Gooijer JG, Hyndman RJ. 25 Years of Time Series Forecasting. International Journal of Forecasting. 2006;22(3):443–473.
  2. 2. Gao ZK, Jin ND. A directed weighted complex network for characterizing chaotic dynamics from time series. Nonlinear Analysis: Real World Applications. 2012;13(2):947–952.
  3. 3. Gao ZK, Yang YX, Fang PC, Zou Y, Xia CY, Du M. Multiscale complex network for analyzing experimental multivariate time series. Europhysics Letters. 2015;109(3):30005.
  4. 4. Gao ZK, Small M, Kurths J. Complex network analysis of time series. Europhysics Letters. 2016;116(5):50001.
  5. 5. Gao ZK, Cai Q, Yang YX, Dang WD, Zhang SS. Multiscale limited penetrable horizontal visibility graph for analyzing nonlinear time series. Scientific Reports. 2016;6:35622. pmid:27759088
  6. 6. Armstrong JS, Collopy F. Error measures for generalizing about forecasting methods: Empirical comparisons. International Journal of Forecasting. 1992;8(1):69–80.
  7. 7. Makridakis S, Hibon M. The M3-Competition: results, conclusions and implications. International Journal of Forecasting. 2000;16(4):451–476.
  8. 8. Fildes R. The evaluation of extrapolative forecasting methods. International Journal of Forecasting. 1992;8(1):81–98.
  9. 9. Clements MP, Hendry DF. On the limitations of comparing mean square forecast errors. Journal of Forecasting. 1993;12(8):617–637.
  10. 10. Makridakis S. Accuracy measures: theoretical and practical concerns. International Journal of Forecasting. 1993;9(4):527–529.
  11. 11. Armstrong JS, Fildes R. Correspondence on the selection of error measures for comparisons among forecasting methods. Journal of Forecasting. 1995;14(1):67–71.
  12. 12. Makridakis S, Andersen A, Carbone R, Fildes R, Hibon M, Lewandowski R, et al. The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting. 1982;1(2):111–153.
  13. 13. Olaofe ZO. A 5-day wind speed & power forecasts using a layer recurrent neural network (LRNN). Sustainable Energy Technologies and Assessments. 2014;6:1–24.
  14. 14. Svalina I, Galzina V, Lujić R, Šimunović G. An adaptive network-based fuzzy inference system (ANFIS) for the forecasting: The case of close price indices. Expert Systems with Applications. 2013;40(15):6055–6063.
  15. 15. Boyacioglu MA, Avci D. An Adaptive Network-Based Fuzzy Inference System (ANFIS) for the prediction of stock market return: The case of the Istanbul Stock Exchange. Expert Systems with Applications. 2010;37(12):7908–7912.
  16. 16. Wei LY, Chen TL, Ho TH. A hybrid model based on adaptive-network-based fuzzy inference system to forecast Taiwan stock market. Expert Systems with Applications. 2011;38(11):13625–13631.
  17. 17. Esfahanipour A, Aghamiri W. Adapted neuro-fuzzy inference system on indirect approach TSK fuzzy rule base for stock market analysis. Expert Systems with Applications. 2010;37(7):4742–4748.
  18. 18. Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International Journal of Forecasting. 2006;22(4):679–688.
  19. 19. Makridakis SG, Wheelwright SC, Hyndman RJ. Forecasting: Methods and Applications. Wiley series in management. Wiley; 1998.
  20. 20. Armstrong JS. Measures of Accuracy. In: Long-Range Forecasting: From Crystal Ball to Computer. A Wiley-Interscience Publication. Wiley; 1985. p. 346–354.
  21. 21. Goodwin P, Lawton R. On the asymmetry of the symmetric MAPE. International Journal of Forecasting. 1999;15(4):405–408.
  22. 22. Ord K. Commentaries on the M3-Competition An introduction, some comments and a scorecard. International Journal of Forecasting. 2001;17(4):537–541.
  23. 23. Chen Z, Yang Y. Assessing forecast accuracy measures; 2004. Available from: https://www.researchgate.net/publication/228774888_Assessing_forecast_accuracy_measures.
  24. 24. Davydenko A, Fildes R. Measuring forecasting accuracy: The case of judgmental adjustments to SKU-level demand forecasts. International Journal of Forecasting. 2013;29(3):510–522.
  25. 25. Fleming PJ, Wallace JJ. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM. 1986;29(3):218–221.
  26. 26. Wright DJ, Capon G, Pagé R, Quiroga J, Taseen AA, Tomasini F. Evaluation of forecasting methods for decision support. International Journal of Forecasting. 1986;2(2):139–152.
  27. 27. Chatfield C. Apples, oranges and mean square error. International Journal of Forecasting. 1988;4(4):515–518.
  28. 28. Armstrong JS. Evaluating forecasting methods. In: Principles of Forecasting: A Handbook for Researchers and Practitioners. vol. 30. Springer US; 2001. p. 443–472.