Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Bias caused by sampling error in meta-analysis with small sample sizes

Abstract

Background

Meta-analyses frequently include studies with small sample sizes. Researchers usually fail to account for sampling error in the reported within-study variances; they model the observed study-specific effect sizes with the within-study variances and treat these sample variances as if they were the true variances. However, this sampling error may be influential when sample sizes are small. This article illustrates that the sampling error may lead to substantial bias in meta-analysis results.

Methods

We conducted extensive simulation studies to assess the bias caused by sampling error. Meta-analyses with continuous and binary outcomes were simulated with various ranges of sample size and extents of heterogeneity. We evaluated the bias and the confidence interval coverage for five commonly-used effect sizes (i.e., the mean difference, standardized mean difference, odds ratio, risk ratio, and risk difference).

Results

Sampling error did not cause noticeable bias when the effect size was the mean difference, but the standardized mean difference, odds ratio, risk ratio, and risk difference suffered from this bias to different extents. The bias in the estimated overall odds ratio and risk ratio was noticeable even when each individual study had more than 50 samples under some settings. Also, Hedges’ g, which is a bias-corrected estimate of the standardized mean difference within studies, might lead to larger bias than Cohen’s d in meta-analysis results.

Conclusions

Cautions are needed to perform meta-analyses with small sample sizes. The reported within-study variances may not be simply treated as the true variances, and their sampling error should be fully considered in such meta-analyses.

Introduction

Systematic reviews and meta-analyses have become important tools to synthesize results from various studies in a wide range of areas, especially in clinical and epidemiological research [13]. Sampling error is a critical issue in meta-analyses. On the one hand, it impacts the evaluation of heterogeneity between studies. For example, the popular heterogeneity measure I2 statistic is supposed to quantify the proportion of variation due to heterogeneity rather than sampling error [46]; if sampling error increases, the I2 tends to decrease, leading to a conclusion of more homogeneous studies. More troublesome, within-study sampling error may affect the derivation underlying I2 to such an extent that the interpretation of I2 is challenged [7]. On the other hand, sampling error may threaten the validity of a meta-analysis. The most popular meta-analysis method usually models the observed effect size in each study as a normally distributed random variable and treats the observed sample variance as if it was the true variance [8, 9]. It accounts for sampling error in the point estimate of the treatment effect within each study, but it ignores sampling error in the observed variance. This method is generally valid when the number of samples within each collected study is large: the large-sample statistical properties, such as the central limit theory and the delta method, guarantee that the distribution approximation performs well. However, ignoring sampling error in within-study variances has caused some misunderstandings about basic quantities in meta-analyses, especially when some studies have few samples. For example, the famous Q test for homogeneity does not exactly follow the chi-squared distribution due to such sampling error [10], and this problem may subvert I2 [7].

One important purpose of performing meta-analyses is to increase precision as well as to reduce bias for the conclusions of systematic reviews [11]. For this reason, the PRISMA statement [12] recommends researchers to report both the risks of bias within individual studies and also between studies. The bias within individual studies often relates to the studies’ quality [13]. Also, certain measures have been designed to reduce bias in study-level estimates. For example, Hedges’ g is considered less biased than Cohen’s d within studies when the effect size is the standardized mean difference (page 81 in Hedges and Olkin [14]). The bias between studies is usually introduced by publication bias or selective reporting [1520]. Besides the bias in point estimates of treatment effects, sampling error also produces bias in the variance of the overall weighted mean estimate under the fixed-effect setting [21, 22]. Under the random-effects setting, the well-known DerSimonian–Laird estimator of the between-study variance may also have considerable bias, especially when sample sizes are small [23, 24]. Moreover, the between-study bias in the treatment effect estimates, such as publication bias, may implicate other parameters in a meta-analysis, including the between-study variance [25]. The bias in variance estimates can seriously impact the precision of the meta-analysis results.

This article focuses on the performance of meta-analyses with small sample sizes, where the sampling error in the observed within-study variances may not be ignored. Throughout this article, we refer to sample size as the number of participants in an individual study, instead of the number of studies in a meta-analysis. Studies with small or moderate sample sizes are fairly common in meta-analyses [26], especially when the treatments are expensive and the enrollments of participants are limited by studies’ budgets. We demonstrate a type of bias in meta-analysis results that is completely due to sampling error; it has received relatively less attention in the existing literature compared with other types of bias [2730]. Such bias is mainly caused by the association between the observed study-specific effect sizes yi and their within-study variances . This association may exist even in the absence of publication bias or selective reporting [31, 32]. When one uses the true variances instead of the estimated variances, the association may still be present for certain effect sizes, e.g., the (log) odds ratio.

If each study’s result is unbiased and its marginal expectation equals to some overall treatment effect θ, then a naïve argument for the unbiasedness of the overall effect estimate in a meta-analysis, , is that , where wi is the weight of each study. The weight is usually the inverse of the within-study variance or the marginal variance incorporating heterogeneity between studies. However, this equation treats the weights wi as fixed values, while in practice they are estimates subject to sampling error. The association between the observed effect sizes and their estimated within-study variances may be strong when the sample sizes are small, so the expectation of the overall estimate in the meta-analysis may not be directly derived without the information about such association, and its unbiasedness is largely unclear [33]. In addition, when the sample sizes are small, the sampling error in the observed within-study variances and the estimated between-study variance may be large, so the confidence interval (CI) of the overall estimate may be poor with coverage probability much lower than the nominal level.

In the following sections, we will review five common effect sizes for continuous and binary outcomes, explain how small sample sizes may introduce bias in meta-analyses, and evaluate such bias using extensive simulation studies.

Methods

Meta-analyses with continuous outcomes

Suppose that a meta-analysis contains N studies, and each study compares a treatment group with a control group. Denote ni0 and ni1 as the sample sizes in the control and treatment groups in study i. The continuous outcome measures of the participants in each group are assumed to follow normal distributions. The population means of the two groups in study i are μi0 and μi1, and the sample means are denoted as and accordingly. The variances of the samples in the two groups are frequently assumed to be equal, denoted as ; see, e.g., page 76 in Hedges and Olkin [14] and page 224 in Cooper et al. [34]. The is estimated as the pooled sample variance , where and are the sample variances in the control and treatment groups, respectively.

If the outcome measures have a meaningful scale and all studies in the meta-analysis are reported on the same scale, the mean difference (MD) between the two groups, i.e., Δi = μi1μi0, is often used as the effect size to measure treatment effect (page 224 in Cooper et al. [34]). We can obtain an estimate of the MD from each study, denoted as , and its estimated within-study variance is . Traditional meta-analysis methods usually account for sampling error in the sample means yi but ignore such error in the sample variances ; the within-study variances have been customarily treated as the true variances, which should be [10]. However, accurate estimates of variances may require very large sample sizes; the sample variances may be far away from their true values when sample sizes are small. In the following context, we will treat the sample variances as random variables like the sample means, instead of the true variances.

Because the outcome measures are assumed to be normal, the sample means and are independent of the sample variances and (see page 218 in Casella and Berger [35]). Thus, the yi and are independent in each study. Given that the observed MDs yi are unbiased, such independence guarantees that the overall effect size estimate is unbiased in a fixed-effect meta-analysis (which assumes that the underlying true effect sizes Δi in all studies equal to a common value Δ), because . However, in a random-effects meta-analysis, each study’s weight is updated as by incorporating an estimate of the between-study variance . The between-study variance τ2 can be estimated using many different methods [36], and its estimate depends on both yi and ; therefore, yi and the updated weight may be correlated to some extents. The expectation of the weighted average cannot be split in the foregoing way, so the unbiasedness of the overall MD estimate is not guaranteed in a random-effects meta-analysis.

A more commonly-used effect size for continuous outcomes is the standardized mean difference (SMD), because this unit-free measure permits different scales in the collected studies and is deemed more comparable across studies (see Normand [8] and Chapter 3 in Grissom and Kim [37]). The true SMD in study i is . It is usually estimated as by plugging in the sample means and the pooled variance, and is often referred to as Cohen’s d (see page 66 in Cohen [38]). If we define a constant qi = ni0ni1/(ni0 + ni1), multiply Cohen’s d by , and express it as , then the numerator follows a normal distribution with variance 1, and the denominator is the square root of a chi-squared random variable divided by its degrees of freedom ni0 + ni1 − 2 [35]. Also, the numerator and denominator are independent. Therefore, strictly speaking, Cohen’s d (multiplied by the constant ) follows a t-distribution, although it is approximated as a normal distribution in nearly all applications. If the true effect size is non-zero, the t-distribution is noncentral. The exact within-study variance of Cohen’s d can be derived as a complicated form of gamma functions [39], but researchers usually use some simpler forms to approximate it. Different approximation forms for the within-study variance of Cohen’s d are given in several books on meta-analyses; see, e.g., page 80 in Hedges and Olkin [14], page 226 in Cooper et al. [34], and page 290 in Egger et al. [40]. This article approximates it as . As depends on yi, they are correlated. The correlation may increase as the sample size decreases, because the coefficient of in the formula of , increases.

Furthermore, it is well-known that Cohen’s d is a biased estimate of the SMD. The bias is around (page 80 in Hedges and Olkin [14]); and it reduces toward zero as the sample sizes increase. When the sample sizes are small, a bias-corrected estimate, called Hedges’ g, is usually adopted [41]. It is calculated as with an estimated variance (page 86 in Hedges and Olkin [14]). Like Cohen’s d, the observed data yi and are also correlated when using Hedges’ g as the effect size. Therefore, even if Hedges’ g is (nearly) unbiased within each individual study, the overall SMD estimate in the meta-analysis may be still biased due to the correlation between yi and .

Meta-analyses with binary outcomes

Suppose a 2 × 2 table is available from each collected study in a meta-analysis with a binary outcome. Denote ni00 and ni01 as the numbers of participants without and with an event in the control group, respectively; ni10 and ni11 are the data cells in the treatment group. The sample sizes in the control and treatment groups are ni0 = ni00 + ni01 and ni1 = ni10 + ni11. Also, denote pi0 and pi1 as the population event rates in the two groups.

The odds ratio (OR) is frequently used to measure treatment effect for a binary outcome [42]; its true value in study i is . Using the four data cells in the 2×2 table, the OR is estimated as . The ORs are usually combined on a logarithmic scale in meta-analyses, because the distribution of the estimated log OR, is better approximated by a normal distribution. The within-study variance of yi is estimated as . Besides the OR, the risk ratio (RR) and risk difference (RD) are also popular effect sizes. The underlying true RR in study i is RRi = pi1/pi0, and it is also combined on the log scale in meta-analyses like the OR. The log RR is estimated as , and its within-study variance is estimated as . Moreover, the underlying true RD in study i is RDi = pi1pi0, estimated as with an estimated within-study variance . When the sample sizes are small, some data cells may be zero even if the event is not rare. If a 2 × 2 table contains zero cells, a fixed value of 0.5 is often added to each data cell to reduce bias and avoid computational error (see page 521 in the Cochrane Handbook for Systematic Reviews of Interventions [43] and many other papers [4446]), although this continuity correction may not be optimal in some cases [4750].

Like the SMD for continuous outcomes, the distributions of the sample log OR, log RR, and RD are approximated as normal distributions in conventional meta-analysis methods. Also, because both yi and depend on the four cells of 2 × 2 tables for all three effect sizes, they are intrinsically correlated.

Simulation studies

We conducted simulation studies to investigate the impact of sampling error on meta-analyses with small sample sizes. The number of studies in a simulated meta-analysis was set to N = 5, 10, 20, and 50. We first generated the sample size within each study ni from a uniform distribution U(5, 10), then we gradually increased it by sampling it from U(10, 20), U(20, 30), U(30, 50), U(50, 100), U(100, 500), and U(500, 1000). These sample sizes ni were generated anew for each simulated meta-analysis. The control/treatment allocation ratio was set to 1:1 in all studies, which is common in real-world applications. Specifically, participants were assigned to the control group and participants were assigned to the treatment group, where ⌈x⌉ represents an integer that is greater than or equal to x.

When the outcome was continuous, we simulated meta-analyses based on the MD and the SMD. For the MD, each participant’s outcome measure was sampled from in the control group or in the treatment group. Without loss of generality, the baseline effect μi0 of study i was generated from N(0,1). The study-specific standard deviation σi was sampled from U(1,5), and it was generated anew for each simulated meta-analysis. The mean difference Δi was sampled from N(Δ,τ2). The overall MD Δ was set to 0, 0.5, 1, 2, and 5, and the between-study standard deviation τ was set to 0, 0.5, and 1. For the SMD, each participant’s outcome measure was also generated using the foregoing setting within each study, but the SMD θi = Δi/σi, not the mean difference Δi, was sampled from the normal distribution: θi ~ N(θ,τ2). The overall SMD θ was set to 0, 0.2, 0.5, 0.8, and 1 to represent different magnitudes of effect size. The between-study standard deviation τ was 0, 0.2, and 0.5. Both Cohen’s d and Hedges’ g were used to estimate the SMD.

When the outcome was binary, we first simulated meta-analyses based on the OR. The event numbers ni01 and ni11 in the control and treatment groups were sampled from Binomial(ni0,pi0) and Binomial(ni1,pi1), respectively. The event rate in the control group pi0 was sampled from U(0.3, 0.7) representing a fairly common event [32], and it was generated anew for each meta-analysis. The event rate in the treatment group pi1 was calculated using pi0 and the study-specific log OR θi; specifically, . The study-specific log OR θi was sampled from N(θ,τ2), where the overall log OR θ was set to 0, 0.2, 0.4, 1, and 1.5, and the between-study standard deviation τ was 0, 0.2, and 0.5. In addition to the OR, we also generated meta-analyses based on the RR and RD. The event numbers were similarly sampled from binomial distributions and the pi0 was from U(0.3,0.7). However, for the log RR and the RD, we considered only the fixed-effect setting with all study-specific effect sizes θi equal to a common value θ. Specifically, if the effect size was the log RR, the event rate in the treatment group was pi1 = eθpi0, where the true log RR θ was set to 0 and 0.3 to guarantee that pi1 was between 0 and 1. If the effect size was the RD, pi1 = pi0 +θ, where the true RD θ was set to 0 and 0.2 to guarantee that pi1 was between 0 and 1. The random-effects setting was not considered for the log RR and RD because it may lead to improper pi1’s beyond the [0, 1] range. We could successfully generate meta-analyses by truncating such improper pi1’s and constraining them to be between 0 and 1; however, this constraint would produce bias, which cannot be distinguished from the bias caused by sampling error that is of primary interest in this article.

For each simulation setting above, 10,000 meta-analyses were generated. The random-effects model was applied to each simulated meta-analysis [51], even if some meta-analyses were generated under the fixed-effect setting with τ = 0. Thus, the produced CIs might be conservative. Also, the between-study variance was estimated using the popular method of moments by DerSimonian and Laird [24]. The restricted maximum likelihood method may be a better choice [8, 52, 53], but it is more computationally difficult and its solution did not converge in a noticeable number of our simulated meta-analyses. Also, there are many other alternatives for estimating the between-study variance, such as the Paule–Mandel estimator, which may be recommended in certain situations [54, 55], while they have been used less frequently compared with the DerSimonian–Laird estimator so far. Therefore, we considered only the DerSimonian–Laird estimator for the between-study variance, which was sufficient to achieve this article’s purpose.

S2S7 Files present the R code and results for the simulation studies.

Results

Fig 15 present the boxplots of the estimated overall effect sizes in the 10,000 simulated meta-analyses for the MD, SMD (estimated by both Cohen’s d and Hedges’ g), log OR, log RR, and RD, respectively. In addition, Table 1 shows the bias of the estimates and Table 2 shows their 95% CIs’ coverage probabilities. When the number of studies in a meta-analysis increased from 5 to 50, the range of the estimated overall effect size shrank because their variances decreased. When the between-study heterogeneity increased in Fig 13, the middle and lower panels indicate that the box of the estimated overall effect sizes expanded vertically due to more heterogeneity in the meta-analyses.

thumbnail
Fig 1. Boxplots of the estimated mean differences in 10,000 simulated meta-analyses.

The true between-study standard deviation τ increased from 0 (panels a and b) to 1 (panel c). The number of studies in each meta-analysis N increased from 5 (panel a) to 50 (panels b and c). The true mean difference Δ (horizontal dotted line) was 0.

https://doi.org/10.1371/journal.pone.0204056.g001

thumbnail
Fig 2. Boxplots of the estimated standardized mean differences in 10,000 simulated meta-analyses.

For each sample size range on the horizontal axis, the left gray box was obtained using Cohen’s d, and the right black box was obtained using Hedges’ g. The true between-study standard deviation τ increased from 0 (upper and middle panels) to 0.5 (lower panels). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (middle and lower panels). The true standardized mean difference θ (horizontal dotted line) increased from 0 (left panels) to 1 (right panels).

https://doi.org/10.1371/journal.pone.0204056.g002

thumbnail
Fig 3. Boxplots of the estimated log odds ratios in 10,000 simulated meta-analyses.

The true between-study standard deviation τ increased from 0 (upper and middle panels) to 0.5 (lower panels). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (middle and lower panels). The true log odds ratio θ (horizontal dotted line) increased from 0 (left panels) to 1.5 (right panels).

https://doi.org/10.1371/journal.pone.0204056.g003

thumbnail
Fig 4. Boxplots of the estimated log risk ratios in 10,000 simulated meta-analyses.

The true between-study standard deviation τ was 0 (i.e., the simulated studies were homogeneous). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (lower panels). The true log risk ratio θ (horizontal dotted line) increased from 0 (left panels) to 0.3 (right panels).

https://doi.org/10.1371/journal.pone.0204056.g004

thumbnail
Fig 5. Boxplots of the estimated risk differences in 10,000 simulated meta-analyses.

The true between-study standard deviation τ was 0 (i.e., the simulated studies were homogeneous). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (lower panels). The true risk difference θ (horizontal dotted line) increased from 0 (left panels) to 0.2 (right panels).

https://doi.org/10.1371/journal.pone.0204056.g005

thumbnail
Table 1. Bias of the estimated overall effect size in the simulation studies.

https://doi.org/10.1371/journal.pone.0204056.t001

thumbnail
Table 2. Coverage probability (in percentage, %) of the estimated overall effect size’s 95% confidence interval in the simulation studies.

https://doi.org/10.1371/journal.pone.0204056.t002

Fig 1 and Table 1 indicate that the estimated MD was almost unbiased in all situations with different numbers of studies and different extents of heterogeneity, even if the studies had very small sample sizes. As the trends in the plots for Δ = 0.5, 1, 2, and 5 were fairly similar to those for Δ = 0, they were not displayed in Fig 1 due to space limit. Table 2 shows that the CI coverage probability of the MD was fairly close to the nominal confidence level 95% in most cases. The coverage was slightly below 95% when the number of studies was small (N = 5) and the sample sizes were also very small (between 5 and 10) within studies.

When the true SMD was zero in the left panels of Fig 2, both Cohen’s d and Hedges’ g were almost unbiased. The box of Cohen’s d was slightly larger vertically than that of Hedges’ g when the sample sizes within studies were small, so the point estimates of Hedges’ g were more concentrated around the true SMD. The CI coverage was also close to the nominal 95% level. However, as the true SMD increased from 0 to 0.5 and to 1, both Cohen’s d and Hedges’g began to have bias, and the bias increased as the sample sizes decreased within studies. Cohen’s d generally produced less bias in the estimated overall SMD than Hedges’ g, as shown in Table 1. The CI coverage of Cohen’s d was still close to 95% when the true SMD increased, but that of Hedges’ g dropped below 80% when the sample size was fairly small (between 5 and 10), the true SMD was fairly large (θ = 1), and the number of studies was large (N = 50).

The patterns in Fig 3 of the ORs for binary outcomes were similar to those in Fig 2. The estimated overall log ORs were almost unbiased when the true log OR was zero. As the true log OR increased to 1 and to 1.5 and the sample sizes within studies decreased, the bias in the estimated overall log OR tended to be larger in the negative direction. Also, the CI coverage dropped dramatically when the number of studies and the between-study variance were large in Table 2. For example, when τ = 0, N = 5, θ = 1.5, and the sample size of each study was between 5 and 10, the bias of the estimated overall log OR was −0.30 and the CI coverage was 97.8%. The log OR underestimated the true value θ. Among the simulated meta-analyses whose CIs did not cover θ, 2.2% had CIs entirely below θ, while only one meta-analysis (0.01%) had a CI entirely above θ. As the number of studies increased to N = 50 and other parameters unchanged, the bias was still −0.30, but the CI coverage decreased to 82.5%. The CIs of the meta-analyses not covering θ were all below θ. Therefore, the low CI coverage was likely because the CI became shorter as the number of studies N increased while the bias remained.

Compared with the log OR, the log RR in Fig 4 was more sensitive to the sample sizes within studies. The estimated overall log RR had tiny bias and its CI coverage was close to 95% when the sample sizes within studies were large (more than 500). However, the bias was substantial and the CI coverage was fairly low even when the sample sizes were moderate (between 50 and 100). Like the situation for the log OR, the poor CI coverage for the log RR related to the bias. For example, when τ = 0, N = 5, θ = 0.3, and the sample size of each study was between 5 and 10, the bias of the estimated overall log RR was 0.26 and the CI coverage was 86.2%. The log RR overestimated the true value θ. The CIs of the simulated meta-analyses not covering θ were all above θ. When N increased to 50 and other parameters unchanged, the bias was 0.35 and the CI coverage dropped dramatically to 9.7%. The CIs of the simulated meta-analyses not covering θ were also all above θ.

Fig 5 shows that the estimated overall RD was almost unbiased when the true RD was zero and had small bias when the true RD was 0.2. The bias was relatively large when the sample sizes within studies were fairly small. The CI coverages were between 92% and 96% in all situations.

In addition, Figures A–F in S1 File present scatter plots of the sample effect sizes against their precisions (i.e., the inverse of their sample variances) in ten selected simulated meta-analyses with small sample sizes for the MD, SMD (including both Cohen’s d and Hedges’ g), log OR, log RR, and RD. They are plotted using the same idea of the funnel plot for assessing publication bias [56], and they roughly illustrate the association between the sample effect sizes yi and their within-study variances . Figure A in S1 File indicates that this association seemed tiny for the MD, which was consistent with our conclusion that the MD yi and its variance are independent in theory. The other figures show different extents of association for the SMD, log OR, log RR, and RD. For example, the estimated SMDs that were closer to zero tended to have larger precisions (i.e., smaller variances) in Figures B and C in S1 File.

Discussion

This article has shown that the bias in the overall estimates of the SMD, log OR, log RR, and RD may be substantial in meta-analyses with small sample sizes. The estimated overall MD was almost unbiased in nearly all simulation settings, mainly because its point estimate and within-study variance were independent. However, for the other four effect sizes except the MD, the intrinsic association between their point estimates and estimated variances within studies may be strong, so the meta-analysis results were biased in many simulation settings. Therefore, when the collected studies have small sample sizes, researchers need to choose a proper effect size and perform the meta-analysis with great cautions.

Surprisingly, to estimate the overall SMD, using Cohen’s d led to noticeably less bias than using Hedges’ g in our simulation studies, although Hedges’ g was designed as a bias-corrected estimate of the SMD within individual studies. For example, in one of our simulated fixed-effect meta-analyses with 50 studies and 5 to 10 samples in each study (the true SMD was 1), the average of Cohen’s d in the 50 studies was around 1.29, while the average of Hedges’ g 1.07 was closer to the true value 1. This was consistent with the fact that Hedges’ g was generally less biased within individual studies. However, the meta-analytic overall Cohen’s d was 0.98, which was much closer to 1 compared with the meta-analytic overall Hedges’ g 0.89, because of the sampling error in these effect sizes’ variances that caused the association between the effect sizes and the variances. Note that, instead of advocating that Cohen’s d is always preferred than Hedges’ g in meta-analyses, this article only reminds researchers that Cohen’s d may be less biased in at least some meta-analytic results, and the argument for the use of Hedges’ g in the presence of small sample sizes needs to be carefully examined.

In addition, there are alternative methods to estimate the within-study variance of Hedges’ g besides the one used in our article. Specifically, our simulation studies used , where gi is the point estimate of Hedges’ g in study i; this calculation was introduced on page 86 in Hedges and Olkin [14]. Recall that Hedges’ g is calculated by multiplying Cohen’s d by a bias-correction coefficient; that is, gi = Jidi, where and di is the point estimate of Cohen’s in study i. Therefore, the variance of Hedges’ g can be alternatively estimated as , where is the within-study variance of Cohen’s d; see, e.g., page 226 in Cooper et al. [34]. Using this alternative calculation for the within-study variances of Hedges’ g, the combined SMD may remain biased. For example, consider a special case that all N studies in a meta-analysis have the same sample size n, so the bias-correction coefficients in all studies are equal: Ji = J. Using the fixed-effect model, the expectation of the combined Cohen’s d is and the expectation of the combined Hedges’ g is Because J is a coefficient always less than 1, we have μg < μd if assuming μd is positive. If the true overall SMD θ is also positive and the combined Cohen’s d underestimates it (as in our simulation studies), then μg < μd < θ, indicating that the combined Hedges’ g is more biased. However, if the combined Cohen’s d overestimates the overall SMD (i.e., μd > θ), then the combined Hedges’ g might be less biased.

This article helps explain the phenomenon of the inflated type I error rates for testing for publication bias. To detect potential publication bias in meta-analyses, it has been popular to check for the association between the study-specific effect sizes and their standard errors using the funnel plot or Egger’s regression test [15]. However, it is well known that such association may be intrinsic for binary outcomes even if no publication bias appears, so Egger’s test may have an inflated type I error rate [31, 32]. In addition to the intrinsic association for binary outcomes, this article indicates that such a problem also exists when using the SMD for continuous outcomes. Although the meta-analyses with false positive results do not truly have publication bias, they may still suffer from bias due to sampling error.

Moreover, our findings imply that the magnitude of sample size may not be viewed as an absolute concept in meta-analyses; we may not determine whether a sample size is small or large without taking other parameters into account. For example, using the log OR as the effect size, Fig 3(A), 3(D) and 3(G) show that a sample size of 10 to 20 may be large enough to produce desirable meta-analysis results when the true log OR is zero. However, when the heterogeneity, the number of studies, and the true log OR are large, Fig 3(I) shows that a sample size of 50 to 100 may not be adequate.

The bias of the estimated overall log RR was particularly substantial in Fig 4; this may be related to the effect of the weighting bias for binary outcomes [57]. However, unlike the purpose of Tang [57], this article focused on the bias completely due to sample error which exists for both continuous and binary outcomes.

This article performed the simulated meta-analyses using the popular inverse-of-variance method in a frequentist way. Alternatively, several exact models have been proposed for binary outcomes; they do not require the normal approximation to estimate the study-specific effect sizes and their within-study variances [5862]. The event numbers in the compared groups can be directly modeled as binomial distributions, thus accounting for sampling error in both point estimates of effect sizes and their variances. Similar exact models are also needed for continuous outcomes to avoid treating the within-study variances as if they were the true variances; we leave them as future work.

Supporting information

S1 File. Scatter plots of the sample effect sizes against their precisions (i.e., the inverse of their sample variances) in some simulated meta-analyses with small sample sizes.

https://doi.org/10.1371/journal.pone.0204056.s001

(PDF)

S2 File. R code for the simulation studies.

https://doi.org/10.1371/journal.pone.0204056.s002

(ZIP)

S3 File. Simulation results for the mean difference.

https://doi.org/10.1371/journal.pone.0204056.s003

(ZIP)

S4 File. Simulation results for the standardized mean difference.

https://doi.org/10.1371/journal.pone.0204056.s004

(ZIP)

S5 File. Simulation results for the log odds ratio.

https://doi.org/10.1371/journal.pone.0204056.s005

(ZIP)

S6 File. Simulation results for the log risk ratio.

https://doi.org/10.1371/journal.pone.0204056.s006

(ZIP)

S7 File. Simulation results for the risk difference.

https://doi.org/10.1371/journal.pone.0204056.s007

(ZIP)

References

  1. 1. Sutton AJ, Higgins JPT. Recent developments in meta-analysis. Statistics in Medicine. 2008;27(5):625–50. pmid:17590884
  2. 2. Berlin JA, Golub RM. Meta-analysis as evidence: building a better pyramid. JAMA. 2014;312(6):603–6. pmid:25117128
  3. 3. Gurevitch J, Koricheva J, Nakagawa S, Stewart G. Meta-analysis and the science of research synthesis. Nature. 2018;555:175–82. pmid:29517004
  4. 4. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–60. pmid:12958120
  5. 5. Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine. 2002;21(11):1539–58. pmid:12111919
  6. 6. Lin L, Chu H, Hodges JS. Alternative measures of between-study heterogeneity in meta-analysis: reducing the impact of outlying studies. Biometrics. 2017;73(1):156–66. pmid:27167143
  7. 7. Hoaglin DC. Practical challenges of I2 as a measure of heterogeneity. Research Synthesis Methods. 2017;8(3):254. pmid:28631294
  8. 8. Normand S-LT. Meta-analysis: formulating, evaluating, combining, and reporting. Statistics in Medicine. 1999;18(3):321–59. pmid:10070677
  9. 9. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research Synthesis Methods. 2010;1(2):97–111. pmid:26061376
  10. 10. Hoaglin DC. Misunderstandings about Q and 'Cochran's Q test' in meta-analysis. Statistics in Medicine. 2016;35(4):485–95. pmid:26303773
  11. 11. Egger M, Davey Smith G. Meta-analysis: potentials and promise. BMJ. 1997;315(7119):1371–4. pmid:9432250
  12. 12. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLOS Medicine. 2009;6(7):e1000100. pmid:19621070
  13. 13. Ioannidis JPA, Patsopoulos NA, Rothstein HR. Reasons or excuses for avoiding meta-analysis in forest plots. BMJ. 2008;336(7658):1413–5. pmid:18566080
  14. 14. Hedges LV, Olkin I. Statistical Methods for Meta-Analysis. Orlando, FL: Academic Press; 1985.
  15. 15. Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34. pmid:9310563
  16. 16. Sutton AJ, Song F, Gilbody SM, Abrams KR. Modelling publication bias in meta-analysis: a review. Statistical Methods in Medical Research. 2000;9(5):421–45. pmid:11191259
  17. 17. Kirkham JJ, Dwan KM, Altman DG, Gamble C, Dodd S, Smyth R, et al. The impact of outcome reporting bias in randomised controlled trials on a cohort of systematic reviews. BMJ. 2010;340:c365. pmid:20156912
  18. 18. Lin L, Chu H. Quantifying publication bias in meta-analysis. Biometrics. 2017:In press. pmid:29141096
  19. 19. Lin L, Chu H, Murad MH, Hong C, Qu Z, Cole SR, et al. Empirical comparison of publication bias tests in meta-analysis. Journal of General Internal Medicine. 2018;33(8):1260–7. pmid:29663281
  20. 20. Murad MH, Chu H, Lin L, Wang Z. The effect of publication bias magnitude and direction on the certainty in evidence. BMJ Evidence-Based Medicine. 2018;23(3):84–6. pmid:29650725
  21. 21. Cochran WG, Carroll SP. A sampling investigation of the efficiency of weighting inversely as the estimated variance. Biometrics. 1953;9(4):447–59.
  22. 22. Hedges LV. An unbiased correction for sampling error in validity generalization studies. Journal of Applied Psychology. 1989;74(3):469–77.
  23. 23. Böhning D, Malzahn U, Dietz E, Schlattmann P, Viwatwongkasem C, Biggeri A. Some general points in estimating heterogeneity variance with the DerSimonian–Laird estimator. Biostatistics. 2002;3(4):445–57. pmid:12933591
  24. 24. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7(3):177–88. pmid:3802833
  25. 25. Jackson D. The implications of publication bias for meta-analysis' other parameter. Statistics in Medicine. 2006;25(17):2911–21. pmid:16345059
  26. 26. Davey J, Turner RM, Clarke MJ, Higgins JPT. Characteristics of meta-analyses and their component studies in the Cochrane Database of Systematic Reviews: a cross-sectional, descriptive analysis. BMC Medical Research Methodology. 2011;11:160. pmid:22114982
  27. 27. Easterbrook PJ, Gopalan R, Berlin JA, Matthews DR. Publication bias in clinical research. The Lancet. 1991;337(8746):867–72.
  28. 28. Gøtzsche PC. Reference bias in reports of drug trials. BMJ. 1987;295(6599):654–6. pmid:3117277
  29. 29. Gilbert JR, Williams ES, Lundberg GD. Is there gender bias in JAMA's peer review process? JAMA. 1994;272(2):139–42. pmid:8015126
  30. 30. Egger M, Zellweger-Zähner T, Schneider M, Junker C, Lengeler C, Antes G. Language bias in randomised controlled trials published in English and German. The Lancet. 1997;350(9074):326–9.
  31. 31. Macaskill P, Walter SD, Irwig L. A comparison of methods to detect publication bias in meta-analysis. Statistics in Medicine. 2001;20(4):641–54. pmid:11223905
  32. 32. Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Comparison of two methods to detect publication bias in meta-analysis. JAMA. 2006;295(6):676–80. pmid:16467236
  33. 33. Shuster JJ. Empirical vs natural weighting in random effects meta-analysis. Statistics in Medicine. 2010;29(12):1259–65. pmid:19475538
  34. 34. Cooper H, Hedges LV, Valentine JC. The Handbook of Research Synthesis and Meta-Analysis. 2nd ed. New York, NY: Russell Sage Foundation; 2009.
  35. 35. Casella G, Berger RL. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Press; 2001.
  36. 36. Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, et al. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research Synthesis Methods. 2016;7(1):55–79. pmid:26332144
  37. 37. Grissom RJ, Kim JJ. Effect Sizes for Research: A Broad Practical Approach. Mahwah, NJ: Lawrence Erlbaum Associates; 2005.
  38. 38. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.
  39. 39. Malzahn U, Böhning D, Holling H. Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika. 2000;87(3):619–32.
  40. 40. Egger M, Davey Smith G, Altman DG. Systematic Reviews in Health Care: Meta-Analysis in Context. 2nd ed. London, UK: BMJ Publishing Group; 2001.
  41. 41. Hedges LV. Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics. 1981;6(2):107–28.
  42. 42. Bland JM, Altman DG. The odds ratio. BMJ. 2000;320(7247):1468. pmid:10827061
  43. 43. Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions. Chichester, UK: John Wiley & Sons; 2008.
  44. 44. Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Annals of Human Genetics. 1956;20(4):309–11. pmid:13314400
  45. 45. Gart JJ, Pettigrew HM, Thomas DG. The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika. 1985;72(1):179–90.
  46. 46. Pettigrew HM, Gart JJ, Thomas DG. The bias and higher cumulants of the logarithm of a binomial variate. Biometrika. 1986;73(2):425–35.
  47. 47. Sweeting MJ, Sutton AJ, Paul LC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Statistics in Medicine. 2004;23(9):1351–75. pmid:15116347
  48. 48. Bradburn MJ, Deeks JJ, Berlin JA, Localio AR. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Statistics in Medicine. 2007;26(1):53–77. pmid:16596572
  49. 49. Cai T, Parast L, Ryan L. Meta-analysis for rare events. Statistics in Medicine. 2010;29(20):2078–89. pmid:20623822
  50. 50. Rücker G, Schwarzer G, Carpenter J, Olkin I. Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Statistics in Medicine. 2009;28(5):721–38. pmid:19072749
  51. 51. Viechtbauer W. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software. 2010;36:3.
  52. 52. Jackson D, Bowden J, Baker R. How does the DerSimonian and Laird procedure for random effects meta-analysis compare with its more efficient but harder to compute counterparts? Journal of Statistical Planning and Inference. 2010;140(4):961–70.
  53. 53. Sidik K, Jonkman JN. A comparison of heterogeneity variance estimators in combining results of studies. Statistics in Medicine. 2007;26(9):1964–81. pmid:16955539
  54. 54. Paule RC, Mandel J. Consensus values and weighting factors. Journal of Research of the National Bureau of Standards. 1982;87(5):377–85.
  55. 55. van Aert RCM, Jackson D. Multistep estimators of the between-study variance: the relationship with the Paule-Mandel estimator. Statistics in Medicine. 2018;37(17):2616–29. pmid:29700839
  56. 56. Sterne JAC, Egger M. Funnel plots for detecting bias in meta-analysis: guidelines on choice of axis. Journal of Clinical Epidemiology. 2001;54(10):1046–55. pmid:11576817
  57. 57. Tang J-L. Weighting bias in meta-analysis of binary outcomes. Journal of Clinical Epidemiology. 2000;53(11):1130–6. pmid:11106886
  58. 58. Smith TC, Spiegelhalter DJ, Thomas A. Bayesian approaches to random-effects meta-analysis: a comparative study. Statistics in Medicine. 1995;14(24):2685–99. pmid:8619108
  59. 59. Warn DE, Thompson SG, Spiegelhalter DJ. Bayesian random effects meta-analysis of trials with binary outcomes: methods for the absolute risk difference and relative risk scales. Statistics in Medicine. 2002;21(11):1601–23. pmid:12111922
  60. 60. Chu H, Nie L, Chen Y, Huang Y, Sun W. Bivariate random effects models for meta-analysis of comparative studies with binary outcomes: methods for the absolute risk difference and relative risk. Statistical Methods in Medical Research. 2012;21(6):621–33. pmid:21177306
  61. 61. Stijnen T, Hamza TH, Özdemir P. Random effects meta-analysis of event outcome in the framework of the generalized linear mixed model with applications in sparse data. Statistics in Medicine. 2010;29(29):3046–67. pmid:20827667
  62. 62. Jackson D, Law M, Stijnen T, Viechtbauer W, White IR. A comparison of seven random-effects models for meta-analyses that estimate the summary odds ratio. Statistics in Medicine. 2018;37:1059–85. pmid:29315733