Asymptotic Properties of Spearman’s Rank Correlation for Variables with Finite Support

Petra Ornstein; Johan Lyhagen

doi:10.1371/journal.pone.0145595

Abstract

The asymptotic variance and distribution of Spearman’s rank correlation have previously been known only under independence. For variables with finite support, the population version of Spearman’s rank correlation has been derived. Using this result, we show convergence to a normal distribution irrespectively of dependence, and derive the asymptotic variance. A small simulation study indicates that the asymptotic properties are of practical importance.

Citation: Ornstein P, Lyhagen J (2016) Asymptotic Properties of Spearman’s Rank Correlation for Variables with Finite Support. PLoS ONE 11(1): e0145595. https://doi.org/10.1371/journal.pone.0145595

Editor: Shyamal D. Peddada, National Institute of Environmental and Health Sciences, UNITED STATES

Received: June 13, 2013; Accepted: December 7, 2015; Published: January 5, 2016

Copyright: © 2016 Ornstein, Lyhagen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Funding: The authors have no support or funding to report.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A common question when looking at new data is “Does Y tend to increase when X increases?” When X and Y are ordinal, the nonparametric Spearman’s sample rank correlation, , is frequently used to measure the association.

Spearman originally thought of the situation where a small group of individuals are rated on two separate tasks [1]. His question was whether there existed an association between an individual’s two ratings. As ρ_s is defined as the sample correlation of the ranks of two variables this question translates to whether is significantly different from zero. In cases when there are no ties, follows a normal distribution under independence [2]. In practice, is often used not for ratings, but for Likert type survey variables that take only a few values. When both variables are discrete with only a few categories, bias from not taking ties into account can become considerable with increasing sample size. In addition, the question of interest often concerns not only whether there exists an association but the size of that association. For example, the association between smoking and lung function has been heavily researched during the last half century. Both smoking and lung function are typically measured in categories, and the question of interest has over time shifted from whether smoking decreases lung function to the extent of the impact. In such cases, when ties cannot be disregarded or the research question is not posed against independence, an asymptotic distribution is lacking ([3], p. 7904).

The focus of this paper is on the properties of when used as a measure of association between variables with finite support. [4] has constructed a population version of Spearman’s rho for discrete variables, ρ_s. In this article, we apply Nešlehová’s results to the sample version of Spearman’s rank correlation, deriving its asymptotic properties and showing the importance of Nešlehová’s work to statistics.

In the next Section we introduce ρ_s and for discrete variables with finite support. In Section three we derive the asymptotic properties of . Section four presents simulation results and some empirical examples. A conclusion ends the paper.

Definitions

We are interested in the case when X and Y are discrete random variables with probability mass functions p_i = P(X = i) and q_j = P(Y = j) with finite support i ∈ {1, …, I}, and j ∈ {1, …, J}, I, J ∈ [2, ∞). Spearman’s sample rank correlation is typically seen in the following form (1) where n denotes the sample size and R_i = rankX_i, S_i = rankY_i, and .

Previous to Nešlehová’s work, Spearman’s sample correlation did not have a population version. In this Section we present Neslehova’s population version of Spearman’s rank correlation for variables that take a finite number of values [4]. In such cases, the relation between X and Y can be represented in a contingency table, and ρ_s can be written as a function of the cell probabilities. We denote the joint probability mass function h_ij = P(X = i∩Y = j). Then, , and . The cumulative marginal distribution functions are then and respectively. ([5], p. 94–95) defines Spearman rank correlation (2) ρ_s is defined for cases with at least some variation in both X and Y, so that and . We denote the empirical marginal distribution functions by and , the estimated cell proportion in cell i, j by and let , . It turns out that the sample version of ρ_s equals the standard Spearman’s sample correlation. We thus have a second available expression of ([4], p. 564) (3)

Asymptotic properties of

In this section we use the definitions presented above and apply the delta theorem to derive consistency, asymptotic unbiasedness, and asymptotic normality of between variables with finite support.

As there are only IJ − 1 unique probabilities and we can write . Denote h_IJ = [h₁₁, …, h_IJ]^T, and to avoid linear dependence, define the vector h = [h₁₁, …, h_{I − 1,J}]^T as the first IJ − 1 entries of h_IJ.

Theorem If X and Y are discrete random variables with finite support, ρ_s is as defined in Eq (2), the gradient of ρ_s with respect to h is denoted by , and the covariance matrix of h is denoted by Σ, then (4)

Proof. As shown by ([6], p. 419) converges in distribution to a singular multivariate normal distribution with mean zero, covariance matrix and rank IJ − 1. It follows that converges in probability to h. This implies that converges in distribution to a nondegenerate multivariate normal distribution with mean zero, and covariance matrix Σ = diag(h) − h h^T. As all terms in Eq (2) are functions of h, ρ_s can be consistently estimated from the cell proportions.

Next, we show that is continuous with continuous first partial derivatives. Denote the separate terms of ρ_s as follows: (5) (6) Then (7)

Since and we have that 0 < B^k < ∞, ∀k. A and B are simple functions of h, involving no division. Therefore, is smooth with respect to , implying that application of the delta theorem to is straightforward. We thus conclude that converges to the distribution given in Eq (4).

For construction of the asymptotic covariance matrix, is given below. (8) where , , and for all (r, s) ≠ (I, J), (9) (10)

A Monte Carlo experiment and empirical examples

In this section we first exemplify our results by a small Monte Carlo simulation and then by empirical examples. The Monte Carlo simulation is based on 20000 replicates for the sample sizes n = [50, 100, 200, 400, 800] and carried out in MATLAB version R2012b. For each replicate the data is generated as follows. First n observations are generated from a bivariate normal distribution with correlation 0.5. The variables are then discretized into five categories each such that the first variable has equal proportions, i.e. p_i = [0.2, 0.2, 0.2, 0.2, 0.2] and the second is skewed, q_j = [0.5, 0.25, 0.125, 0.0625, 0.0625]. This yields a population rank correlation ρ_s of 0.4249. In Table 1 the results from the Monte Carlo simulation are shown. In addition we run the simulation generating data from a bivariate normal distribution with correlation 0.95. The results from this simulation are consistent with those presented. In column one and two bias and mean square error of are presented. From a practical perspective the bias is very close to zero. As the bias is close to zero the MSE is basically the variance, and as could be expected the MSE is halved when the sample size is doubled.

Download:

Table 1. Bias, MSE and rejection rates for the Spearman rank correlation.

Rejection rates should be compared to the nominal 5%.

https://doi.org/10.1371/journal.pone.0145595.t001

One way to analyze the normality of a statistic is to make a simple z—test at e.g. the 5% level. If the normality assumption is true then we would expect the rejection rate to be 5%. A 95-% confidence interval for a proportion of 0.05 is 0.047–0.053 for 20000 replicates. This means that observed proportions outside this interval would indicate that normality is not the case. In this part of the simulation we compare the asymptotic estimator with two other estimation strategies: the large sample approximation suggested by [7], available through e.g. MATLAB:s function corr and the empirical bootstrap. As the corr function does not give the variance but the p—value, the variance is solved from the formula of the z—statistic. The comparison with MATLAB:s built in function is chosen because it is easily available and therefore commonly used. However, this approximation disregards ties and is valid only under independence. We also analyze other approximations from the literature. They all rely on both the independence assumption as well as the assumption of continuous distribution, and they perform similarly to each other. Therefore, only the results from MATLAB:s built in function are shown. The bootstrap comparison is chosen because it tends to perform well and, although somewhat more complicated as well as computationally demanding to use, is typically a good choice in situations when a closed form for the variance is lacking.

From row three in Table 1 we see that the asymptotic variance is within the interval for sample sizes larger than 400 with good margin, indicating that normality, while an asymptotic property, is a good approximation for from moderate sample sizes. The variance estimators used for comparison relate to the identical point estimate. From row four we see that violating the assumptions of independent and continuous observations has a severe impact on the results: MATLAB:s built in function performs poorly and does not improve with increasing sample size. The results from the bootstrap estimator (row five) are within the desired range by sample size 100, indicating that for small sample sizes, the bootstrap seems to be the best choice of variance estimator.

A kernel density estimate of the small sample distribution for the sample size 50 is shown in Fig 1. A standard normal distribution is also shown as reference. The asymptotic variance seems to be fairly well approximated by the normal distribution although the empirical distribution has a slight negative skew. This deviation from normality is much lower for n = 100 and larger samples are very well approximated by the normal distribution. Due to space limitations, only n = 50 is displayed.

Download:

Fig 1. Kernel density of discrete version of Spearman rank correlation when sample size is 50 compared to a standard normal distribution.

https://doi.org/10.1371/journal.pone.0145595.g001

In the next step of the simulation study, we compare the power of the estimators. Variables are generated with the same characteristics as previously, but the correlation of the underlying continuous variables is now set to 0.55 and 0.65, yielding population rank correlations ρ_s of 0.4695 and 0.5608 respectively. The results are shown in Table 2. When the true rank correlation is 0.4695, no estimator exceeds a power of 0.36 even with a sample size of 800. When the true rank correlation is 0.5608, a larger difference from the null, the asymptotic estimator has a power of about 0.5 with a sample size of 100 and 0.95 with a sample size of 400. The asymptotic estimator consistently outperforms the bootstrap, but the difference is small and at least partly due to the bootstrap estimator’s somewhat lower rejection rates. Turning to MATLAB:s built in function, the results from Table 2 underscores those from Table 1 in showing that this type of estimator should not be used for other purposes than testing against ρ_s = 0.

Download:

Table 2. Rejection rates when testing against the null H₀: ρ = 0.4249.

https://doi.org/10.1371/journal.pone.0145595.t002

We illustrate the performance of the three different types of estimators with empirical examples taken from [8] The results are shown in Table 3. The purpose is to give examples of the practical implications of the above derived asymptotic variance (V_A), the bootstrap (V_B), and MATLAB:s built in approximation (V_M). I and J represent the number of values that X and Y can take respectively, and n gives the sample size. The sizes of the contingency tables and sample sizes are what is commonly encountered in empirical applications and the examples are from various fields: 2.4) income and job satisfaction, 2.11) inheritage of political views, 3.2) primary and secondary pneumonia infection in calves, 8.10) smoking and lung function. The most striking result is that the asymptotic variance and the bootstrap estimates perform similarly, while V_M differs considerably. Returning to the correlation between smoking and decreased lung function (8.10), in the chosen example we have a point estimate of 0.24. Using our derived variance, we are 95 percent confident to say that this translates to a value in the interval (0.18; 0.30). The bootstrap estimate would similarly return a confidence interval of (0.18; 0.30), while the approximation assuming independence and no ties returns the wider interval (0.16; 0.32). One could think of a policy ascribing regulations to substances depending their established correlation with lung disease. For this, a hypothesis test with null hypothesis corresponding to the relevant threshold would be needed. In this case the use of a biased variance estimator would lead to an overestimation of uncertainty with a delay in health regulation as a potential consequence.

Download:

Table 3. Variance estimates and some other information for a few examples from [8].

https://doi.org/10.1371/journal.pone.0145595.t003

Conclusion

Using Nešlehová’s population version of Spearman’s rho we have been able to show that Spearman’s sample correlation has desirable asymptotic properties when applied to discrete variables. In particular, we have shown that is consistent and asymptotically normal, and derived the asymptotic variance. Simulation results on both rejection rates and power indicate that the asymptotic variance performs as well as bootstrap for sample sizes from 400, allowing for easy construction of confidence intervals when Spearman’s correlation is used. For moderate to large sample sizes, the derived asymptotic variance combines the easy use of a closed form statistic with a performance on pair with the bootstrap. In addition, the existence of an asymptotic variance in closed form, suitable for practical applications, means that the potential uses of Spearman’s rank correlation in the construction of other estimators has increased.

Acknowledgments

We would like to thank the referees for valuable comments.

Author Contributions

Wrote the paper: PO. Developed the study concept: JL. Proved the theorem: PO.

References

1. Spearman C. The proof and measurement of association between two things. Journal of Psychology. 1904;15:72–101.
- View Article
- Google Scholar
2. Kendall M. Rank Correlation Methods. London: Griffin; 1948.
3. Kotz S, Balakrishnan N, Read CB, Vidakovic B. Encyclopedia of Statistical Sciences. 2nd ed. John Wiley; 2006.
4. Nešlehová J. On rank correlation measures for non-continuous random variables. Journal of Multivariate Analysis. 2007;98:544–567.
- View Article
- Google Scholar
5. Nešlehová J. Dependence of Non-Continuous Random Variables. Carl von Ossietzky Universität. Oldenburg; 2004.
6. Cramér H. Mathematical methods of statistics. Eleventh ed. Princeton, NJ: Princeton Press; 1946.
7. Best DJ, Roberts DE. Algorithm AS 89: The Upper Tail Probabilities of Spearman’s rho. Applied Statistics. 1975;24:377–379.
- View Article
- Google Scholar
8. Agresti A. Categorical data analysis. John Wiley; 1990.

[ref1] 1. Spearman C. The proof and measurement of association between two things. Journal of Psychology. 1904;15:72–101.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Kendall M. Rank Correlation Methods. London: Griffin; 1948.

[ref3] 3. Kotz S, Balakrishnan N, Read CB, Vidakovic B. Encyclopedia of Statistical Sciences. 2nd ed. John Wiley; 2006.

[ref4] 4. Nešlehová J. On rank correlation measures for non-continuous random variables. Journal of Multivariate Analysis. 2007;98:544–567.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Nešlehová J. Dependence of Non-Continuous Random Variables. Carl von Ossietzky Universität. Oldenburg; 2004.

[ref6] 6. Cramér H. Mathematical methods of statistics. Eleventh ed. Princeton, NJ: Princeton Press; 1946.

[ref7] 7. Best DJ, Roberts DE. Algorithm AS 89: The Upper Tail Probabilities of Spearman’s rho. Applied Statistics. 1975;24:377–379.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref8] 8. Agresti A. Categorical data analysis. John Wiley; 1990.

Figures

Abstract

Introduction

Definitions

Asymptotic properties of

A Monte Carlo experiment and empirical examples

Conclusion

Acknowledgments

Author Contributions

References