Heterogeneity in meta-analyses: an unavoidable challenge worth exploring
Article information
Abstract
Heterogeneity is a critical but unavoidable aspect of meta-analyses that reflects differences in study outcomes beyond what is expected by chance. These variations arise from differences in the study populations, interventions, methodologies, and measurement tools and can influence key meta-analytical outputs, including pooled effect sizes, confidence intervals, and overall conclusions. Systematic reviews and meta-analyses combine evidence from diverse studies; thus, a clear understanding of heterogeneity is necessary for reliable and meaningful interpretations of the results. This review examines the concepts, sources, measurement techniques, and implications of this heterogeneity. Statistical tools (e.g., Cochran’s Q, I2, and τ2) quantify heterogeneity, whereas τ and prediction intervals, as they use the same units, aid in the intuitive understanding of heterogeneity. The choice between fixed- and random-effects models can also significantly affect the handling and interpretation of heterogeneity in meta-analyses. Effective management strategies include subgroup analyses, sensitivity analyses, and meta-regressions, which identify sources of variability and strengthen the robustness of the findings. Although heterogeneity complicates the synthesis of a single effect size, it offers valuable insights into patterns and differences among studies. Recognizing and understanding heterogeneity is vital for accurately synthesizing the evidence, which can indicate whether an intervention has consistent effects, benefits, or harms. Rather than viewing heterogeneity as inherently good or bad, researchers and clinicians should consider it a key component of systematic reviews and meta-analyses, allowing for a deeper understanding and more nuanced application of pooled findings. Addressing heterogeneity ultimately enhances the reliability, applicability, and overall impact of the conclusions of meta-analyses.
Introduction
A systematic review is a structured research approach that is used to compile and assess multiple studies to address a specific research question [1,2]. Meta-analysis is a statistical technique that integrates the findings of these studies to provide a broader understanding of the evidence [2,3]. One key objective of meta-analyses is to combine study results to produce an overall estimate of the treatment effects. This enhances the precision of effect estimates and improves statistical power, enabling a more reliable detection of treatment effects than individual studies alone [4,5].
In several meta-analyses, the included studies yield varying results. Therefore, recognizing the patterns in these findings is essential. Even if the results are consistent, the implications must be interpreted carefully. If inconsistencies are observed, both investigating the meaning and identifying the underlying causes are essential to accurate interpretation [5,6].
Outcome variations often emerge when studies are combined in a meta-analysis. These discrepancies may stem from differences in the study populations, intervention types, measurement tools, study designs, or analytical approaches. Such variations are collectively referred to as heterogeneity, a concept that lies at the heart of interpreting the findings of meta-analyses [7]. Heterogeneity refers to variability in study outcomes that exceeds what would be expected by chance alone [6].
Heterogeneity encompasses all forms of variation among studies included in meta-analyses. As meta-analyses integrate studies conducted in diverse settings, heterogeneity is inevitable. This can complicate the interpretation of meta-analysis results, necessitating a careful examination of the fundamental differences in study methodologies or contextual factors. However, the presence of heterogeneity does not render the meta-analysis results unimportant or invalid. Researchers conducting meta-analyses must recognize potential factors that may contribute to heterogeneity and develop strategies to address and account for it.
Addressing heterogeneity is not only a statistical challenge but also an opportunity to gain deeper insights into systematic reviews and meta-analyses. This article offers a comprehensive exploration of heterogeneity in meta-analyses and discusses the concepts, sources, detection methods, and strategies to achieve an effective resolution in the interpretation of results.
Understanding heterogeneity
Heterogeneity in a meta-analysis refers to differences in the results of the individual studies included. This is similar to the heterogeneity that manifests in primary studies. Therefore, heterogeneity at the primary study level will be addressed in this section first. Moreover, this section will provide information on differentiating between confidence and prediction intervals, as misunderstanding these intervals can lead to incorrect conclusions regarding the consistency and relevance of the results of meta-analyses.
Heterogeneity in a primary study
To illustrate heterogeneity at the primary study level, imagine three classes of students taking anatomy tests. Figs. 1A, B, and C shows the grade distributions for classes A, B, and C, respectively. The grade distributions are assumed to be normally distributed with a mean of 60 for all classes. The standard deviations for classes A, B, and C are 5, 10, and 20, respectively. If the grade distribution is normal and the variance is known, the prediction interval can be calculated using the following formula:
Normal distribution curves illustrating how prediction intervals (PIs) widen with increasing heterogeneity at the primary study level. All curves have the same mean, but increasing standard deviations (SDs) from A to C reflect greater variability in individual outcomes within a single study. (A) Mean score = 60, SD = 5 (narrow prediction interval). (B) Mean score = 60, SD = 10 (moderate prediction interval). (C) Mean score = 60, SD = 20 (wide prediction interval).
where α is the probability of a type 1 error and SD is the standard deviation. Using this formula, the 95% prediction interval for each class can be calculated as follows:
Class A: (60 – 1.96 × 5.60 + 1.96 × 5) ≒ (60 – 2 × 5.60 + 2 × 5) = (50, 70)
Class B: (60 – 1.96 × 10.60 + 1.96 × 10) ≒ (60 – 2 × 10.60 + 2 × 10) = (40, 80)
Class C: (60 – 1.96 × 20.60 + 1.96 × 20) ≒ (60 – 2 × 20.60 + 2 × 20) = (20, 100)
The prediction interval estimates the range within which future observations are likely to fall, given the current data. For example, for the students in Class A, 95% of their scores are expected to fall between 50 and 70. Similarly, for Class B, 95% of the scores are expected to fall between 40 and 80, and for Class C, between 20 and 100.
Suppose an anatomy class score below 40 is considered low, a score above 80 is considered high, and a score between 40 and 80 is considered acceptable. In this scenario, almost all students in Class A performed acceptably, while fewer students in Class B met this acceptable range and even fewer in Class C. Indeed, significant variation in performance is found in Class C, with some students performing exceptionally well and others performing poorly. As a result, professors or policymakers in medical education administration may consider changing the teaching methods used in Class C. These findings highlight the importance of understanding the statistical dispersion of scores within a population when attempting to make informed decisions.
Prediction interval vs. confidence interval
The prediction interval is calculated using the mean and standard deviation. Unlike statistical techniques such as variance, the prediction interval shows dispersion in the same units as the original data, making interpreting the information intuitive and easy to understand.
Dispersion refers to the distribution of the individual scores in a primary study. In the context of a secondary study, such as a systematic review or meta-analysis, this corresponds to heterogeneity or inconsistency, which is the degree to which the results of different studies are widely distributed. Therefore, the prediction interval in a secondary study provides information on the heterogeneity of the results of the primary studies [6].
A prediction interval can be confused with a confidence interval because both represent uncertainty in statistical analysis; however, they serve different purposes. A prediction interval forecasts an individual value within a population, focuses on future events, and indicates dispersion. By contrast, a confidence interval estimates the mean value of a sample, focuses on past or current events, and indicates precision. The formula used to calculate the prediction interval is as follows:
where α is the probability of a type 1 error and SD is the standard deviation. The formula used to calculate the confidence interval is as follows:
where α is the probability of a type 1 error and SE is the standard error. The formula used to calculate the standard error (SE) is as follows:
where SD is standard deviation and n is sample size. Because the standard error is always smaller than the standard deviation, the confidence interval range is narrower and more precise than the prediction interval.
How heterogeneity manifests in a meta-analysis
A systematic review is a study that involves collecting and evaluating multiple studies; the unit of analysis is not individuals but the studies themselves. Inevitably, the studies included in a systematic review will differ from those included in a meta-analysis.
Consider as an example a systematic review and meta-analysis investigating the effect of a local intraperitoneal anesthetic on pain in patients undergoing laparoscopic cholecystectomy [8]. The results of the meta-analysis are reported using the standardized mean difference (SMD) because pain was measured using different instruments (e.g., a 101-point visual analog scale and an 11-point numerical rating scale). In this study, the pooled SMD for resting abdominal pain was −0.741 (95% CI [−1.001 to −0.481]), indicating a significant reduction in pain.
Consider a similar example. Figs. 2A, B, and C shows the distribution of the analgesic effects of three different local anesthetics, A, B, and C, respectively. In patients undergoing laparoscopic cholecystectomy, the analgesic effects on pain were assumed to be normally distributed with a mean SMD of −2 for all three local anesthetics. The SMD represents the difference between two normalized means, calculated by dividing the mean difference by an estimate of the within-group standard deviation. In this case, an SMD of −2 indicates that the anesthetic reduced pain by twice the standard deviation of the measured data. The standard deviations were 0.1 for anesthetic A, 0.4 for anesthetic B, and 0.8 for anesthetic C. The prediction interval for the pooled effect size for the meta-analysis is calculated as follows:
Normal distribution curves illustrating how prediction intervals (PIs) widen with increasing heterogeneity at the meta-analysis level. All curves share the same mean standardized mean difference (SMD = −2), but increasing standard deviations (SDs) from A to C represent greater variability in effect sizes across studies. (A) Mean SMD = −2 , SD = 0.1 (narrow prediction interval). (B) Mean SMD = −2 , SD = 0.4 (moderate prediction interval). (C) Mean SMD = −2 , SD = 0.8 (wide prediction interval).
where α is the probability of a type 1 error and τ is the standard deviation of the variability between the studies. Using this formula, the 95% prediction intervals for the pooled SMD of the analgesic effects were as follows: local anesthetic A: (−2.2 to −1.8), local anesthetic B: (−2.8 to −1.2), and local anesthetic C: (−3.6 to −0.4). This means that when predicting the analgesic effect of these local anesthetics under the same conditions, one would predict correctly 95% of the time if forecasting a pooled SMD of −2.2 to −1.8 for anesthetic A, −2.8 to −1.2 for anesthetic B, and −3.6 to −0.4 for anesthetic C.
Suppose that in a given study design, a pooled SMD of less than −2.5 is considered a large analgesic effect, −2.5 to −1.5 a moderate effect, −1.5 to 0 a trivial effect, and 0 or more an increase in pain. In this context, we can expect that almost all studies will show a moderate analgesic effect of local anesthetic A, many studies will show a moderate effect and some will show large or trivial effects for local anesthetic B, and the analgesic effect will vary widely between studies for local anesthetic C, with many showing large or trivial effects and some even showing an increase in pain.
This variability has significant implications for clinical decision making. Clinicians may consider using local anesthetic C in a clinical setting, as many studies have shown some analgesic effects despite this variation. However, if the risk of adverse effects of local anesthetic C is high, its use may be discouraged, particularly if many studies indicate trivial effects or increased pain.
This example underscores the importance of understanding and quantifying statistical heterogeneity and variability in results across studies, even though systematic reviews and meta-analyses are secondary studies. Statistical heterogeneity provides crucial insights into the consistency and applicability of pooled results, ensuring informed decision making in both research and clinical settings [9,10].
Effects of heterogeneity on meta-analysis results
Heterogeneity refers to the variation in an intervention’s effects on different individuals, and whether consistent effects, benefits, or harms are found. If the heterogeneity is low (Fig. 2A), similar effects are expected for all patients. However, if the heterogeneity is large (Fig. 2C), the effects are expected to differ significantly from one patient to another, with some patients experiencing significant benefits, some moderate benefits, and some minimal benefits, while others may even be harmed. Therefore, quantifiable variations in outcomes, such as dispersion, heterogeneity, and inconsistency, must be understood across primary studies and meta-analyses, as this enables researchers and clinicians to better evaluate how heterogeneity influences the results of meta-analyses.
Effects of heterogeneity on study weight
Generating summary estimates and confidence intervals in meta-analyses requires the use of either fixed- or random-effects models [6]. Fixed-effects models are used when the study populations and interventions are considered homogeneous and when the number of studies is very small. However, they only account for within-study variations. For a fixed-effects model, the weight of each study (Wi) is calculated as follows:
where Vi is the within-study variance. Random-effects models are used when study populations and interventions are considered heterogeneous. They account for both within- and between-study variations and are applied when the meta-analysis attempts to provide generalizations across a broad population, such as when significant diversity exists in the study populations or interventions, making it difficult to assume a single treatment effect size. For a random-effects model, the weight of each study (Wi) is calculated as follows:
where Vi is the within-study variance of each study, and τ2 is the between-study variance. The between-study variance (τ2) is a measure of the heterogeneity and significantly impacts the weight assigned to each study in a random-effects model.
Effects of heterogeneity on standard errors and confidence intervals
Heterogeneity affects standard errors and confidence intervals. Fig. 3 illustrates five studies with identical 95% CIs but different means. All the studies have the same standard deviation (1) and sample size (10), resulting in an identical 95% CI (1.24). As shown in Figs. 3A and D , all five studies reported the same mean values. In Figs. 3B and E, the means increased or decreased by 0.5, and in Figs. 3C and F, the means increased or decreased by 1 in the order of studies A, B, and C or D, E, and F, respectively. The means and confidence intervals of the individual studies in Figs. 3B and E, and Figs. 3C and F were the same.
Effects of heterogeneity on standard errors and confidence intervals. Forest plots of five studies with the same standard deviation (SD = 1) and sample size (n = 10), resulting in equal 95% CI widths. Fig. 3A–C use fixed-effects models, and Fig. 3D–F use random-effects models. The pooled mean is constant (1.000), but the 95% CI becomes wider as heterogeneity increases under the random-effects model. (A) Fixed-effects; no heterogeneity (all means = 1.0). (B) Fixed-effects; moderate heterogeneity (means vary by 0.5). (C) Fixed-effects; high heterogeneity (means vary by 1.0). (D) Random-effects; no heterogeneity (same as A). (E) Random-effects; moderate heterogeneity (same data as B). (F) Random-effects; high heterogeneity (same data as C).
A fixed-effects model was used in Figs. 3A, B, and C to synthesize the pooled effect size, whereas a random-effects model was used in Figs. 3D, E, and F. From Fig. 3A to B and then to C, and from Figs. 3D to E and then to F, less overlap in the confidence intervals between the studies is evident, indicating increasing heterogeneity.
In Figs. 3A and D, in which all studies had the same mean and confidence interval, the pooled mean and confidence interval was 1.000 (95% CI [0.723−1.277]), regardless of the analysis method. In Figs. 3B and C, heterogeneity increased but a fixed-effects model was applied. Accordingly, the pooled mean and 95% CI remained 1.000 (95% CI [0.723−1.277]), identical to that in Fig. 3A.
However, as shown in Figs. 3E and F , when a random-effects model was applied, the pooled means and confidence intervals became 1.000 (95% CI [0.307−1.693]) and 1.000 (95% CI [−0.386 to 2.386]), respectively. This shows that, although the pooled mean remained constant in Fig. 3D, the confidence interval widened with increasing heterogeneity.
This indicates that heterogeneity among studies affects the confidence interval of the pooled effect size when synthesized using a random-effects model in a meta-analysis.
Effects of heterogeneity on the mean effect size
Heterogeneity also affects the mean effect size. Five studies with different sample sizes and increasing means are shown in Fig. 4. The means each increased by 1 and the sample sizes each increased by 10 in the order of studies A, B, C, D, and E, respectively. Fig. 4A shows a fixed-effects model and Fig. 4B shows a random-effects model. The pooled mean and 95% CI for Fig. 4A was 1.667 (95% CI [1.507−1.827]) using the fixed-effects model. For Fig. 4B, the pooled mean and 95% CI was 1.018 (95% CI [−0.243 to 2.278]) using the random-effects model. This illustrates that the pooled mean of the fixed-effects model was strongly influenced by studies with larger sample sizes, whereas the random-effects model was less influenced by these larger studies. This indicates that heterogeneity can cause the pooled mean effect size between fixed- and random-effects models to differ.
Forest plots showing how heterogeneity affects the pooled mean effect size in meta-analyses. Five studies with increasing means (−1 to 3) and increasing sample sizes (10 to 50) are analyzed. (A) Fixed-effects model. The pooled mean (1.667) is strongly influenced by larger studies. (B) Random-effects model. The pooled mean (1.018) reflects more balanced weighting due to between-study heterogeneity.
As we have demonstrated, heterogeneity influences the results of meta-analyses. Therefore, accurately measuring heterogeneity is crucial for understanding the variability in study outcomes, assessing the consistency of intervention effects, and properly calculating weights and confidence intervals in meta-analyses [9].
Measuring heterogeneity
Heterogeneity can be measured using statistical tests and visual inspection [6,11,12]. In this section, we will discuss these approaches in further detail.
Statistical tests
As studying an entire population is often not feasible, researchers must rely on studying a sample. This leads to differences between the true effect in the population and the observed effect in the sample. In primary studies, in which subjects are the unit of analysis, the focus is on the distribution of the true effect size. However, in a meta-analysis in which individual studies are the units of analysis, both true and observed effect size distributions must be considered.
Owing to sampling errors, the true effect variance in the population is smaller than the observed effect variance in the sample. True effect variance is calculated by subtracting the sampling error from the observed effect variance. Common statistical tests for heterogeneity are based on this principle. Below are the formulas for I2 and τ2, which are the most common statistical tests of heterogeneity. I2 is calculated as follows:
where Q is Cochran’s Q statistic and df is the degrees of freedom. τ2 is calculated as follows:
where Q is Cochran’s Q-test, df is the degrees of freedom, and W is the weight of the study. For both formulas, Cochran’s Q-test represents the observed effect variance and df represents the sampling error; therefore, the numerator of both statistics represents the true effect variance.
Cochran’s Q-test
Cochran’s Q-test is a traditional and widely used method for assessing heterogeneity in meta-analyses. This test is used to determine whether the effect sizes of individual studies deviate from the overall effect size. The formula for Cochran’s Q-test is as follows:
where k is number of studies, M is effect size from the meta-analysis, Yi is effect size from the i-th study, and Vi is the variance from the i-th study. Cochran’s Q-test follows a chi-squared (χ2) distribution with k–1 degrees of freedom. To determine the significance of the Q statistic, we calculate the probability of exceeding its value, denoted as Pchi2. The χ2 test for Cochran’s Q-test provides a P value for objective judgment.
If the Q statistic is large, indicating substantial variability in the Yi, Pchi2 will be small. A small Pchi2 (typically < 0.05) leads to a rejection of the null hypothesis Y1 = Y2 =⋯ = Yn and a conclusion that the study results are not homogeneous. If Pchi2 > 0.05, we accept the null hypothesis and conclude that no heterogeneity exists. However, the Q statistic is affected by the sample size and number of studies. Its power to detect heterogeneity increases with larger sample sizes and more studies, and decreases with smaller sample sizes and fewer studies. Here, the “power” refers to the probability of correctly identifying the presence of heterogeneity when it exists.
Due to the tendency of the P values from Cochran’s Q-test to be greater than 0.05, even when heterogeneity is present, a threshold of 0.1 is often used to improve the sensitivity in detecting heterogeneity [13].
Higgin’s I2
Higgin’s I2 statistic quantifies the degree of heterogeneity. The formula is as follows:
where Q is Cochran’s Q statistic and df is the degrees of freedom [14]. In this formula, Cochran’s Q-test represents the total observed effect variance, and the degrees of freedom (df) represents the expected variance owing to sampling errors. The difference between Q and df indicates the true effect variance. Therefore, I2 is the proportion of the true effect variance to the observed effect variance.
However, I2 only indicates the proportion of the true effect variance to the observed variance; it does not provide the actual magnitude of heterogeneity. For example, in Figs. 5A and B, the observed effect variances are the same. I2 can help researchers infer the true effect variance in this case. By contrast, in Figs. 5C and D, the observed effect variances differ. In this case, I2 does not indicate the magnitude of the true effect variance.
Schematic diagrams illustrating how I2 represents the proportion of true effect variance to observed variance. In Figures A and B, the observed variances are the same, so differences in I2 reflect differences in true effect variance. In Figures C and D, the observed variances differ, so I2 does not indicate the actual extent of heterogeneity. (A) I2 = 75%; large true effect variance with the same observed variance as in B. (B) I2 = 25%; small true effect variance with the same observed variance as in A. (C) I2 = 40%; moderate true effect variance with more observed variance than in D. (D) I2 = 80%; small true effect variance with less observed variance than in C.
For instance, if I2 is 84% and the observed effect variance is 100, the true effect variance will be 84; however, if the observed effect variance is 50, the true effect variance will be 42. Thus, I2 alone cannot reveal the true effect variance if the observed variance is not known. Commonly, arbitrary thresholds are used to interpret I2, as in the following example: 0%–25%, no heterogeneity; 25%–50%, low heterogeneity; 50%–75%, moderate heterogeneity; and 75%–100%, high heterogeneity. However, the use of these thresholds alone is not recommended to diagnose heterogeneity. I2 is a proportion and not an absolute measure of the true variance. In addition, uncertainty regarding I2 is considerable, particularly when the number of studies is small. Considering I2 along with the observed variance is more informative for accurately understanding the extent of heterogeneity. However, this method requires the reader or other researchers to calculate the true variance, which is not intuitive.
τ2 statistics, τ, and prediction intervals
Various methods are known to estimate τ2. Using the Der Simonian and Laird method, based on the method of moments, τ2 can be calculated as follows:
where Q is Cochran’s Q-test, df is the degrees of freedom, W is the weight of the study, and Wi is the weight of i-th study.
The τ2 statistic represents the between-study variance in a random-effects meta-analysis. The square root of this value (𝜏) is the estimated standard deviation of the between-study variation. Since the standard deviation has the same unit as the original data, 𝜏 shares the same unit as the original data. Therefore, prediction intervals, calculated using means and standard deviations, also have the same unit as the original data, making them intuitive and easy to interpret in terms of heterogeneity. Prediction intervals can indicate heterogeneity only if τ2 > 0. When Q < df and τ2 = 0, confidence intervals are typically used instead. The known challenges to using prediction intervals are as follows: (1) they assume a normal distribution; (2) they can be wide because they use standard deviations rather than errors; (3) a sufficient number of studies is required for reliability, with the required number varying based on the degree of heterogeneity; (4) alone they lack specific cut-off values for heterogeneity; and (5) most statistical programs do not provide prediction intervals.
In summary, with statistical tests for heterogeneity, the following apply: (1) Pchi2 determines the presence of heterogeneity; (2) I2 indicates the proportion of the true effect variance to the observed effect variance; (3) τ2, τ, and prediction intervals indicate the magnitude of heterogeneity (true effect variance); and (4) although τ2 uses a different unit from the original data, τ and prediction intervals use the same unit, making them more intuitive; therefore, using τ and prediction intervals to report the degree of heterogeneity is preferred.
Visual inspection methods using graphs
Visual inspection methods provide an intuitive approach to assessing heterogeneity in meta-analyses. By plotting effect sizes and confidence intervals, these graphical methods help researchers identify patterns, trends, and potential inconsistencies across studies. While they serve as useful complementary tools for detecting variations in effect sizes. The following sections describe three commonly used graphical methods: forest plots, L’Abbé plots, and Galbraith plots.
Forest plots
Forest plots display the effect sizes and confidence intervals of individual studies. Overlapping CIs indicate low heterogeneity, whereas non-overlapping CIs indicate high heterogeneity.
L’Abbé plots
The L’Abbé plot displays the results of the two groups as weighted circles on an X-Y plane [7]. For example, in a study comparing the incidence of postoperative nausea and vomiting between palonosetron and ramosetron [15], a diagonal line through the origin indicates no difference in the effectiveness between the two groups, a bottom-right diagonal line indicates that palonosetron was more effective, and a top-left diagonal line indicates that ramosetron was more effective. The larger circles represent higher precision. Homogeneity is indicated when the studies form a straight line resembling a cloud near the origin (Fig. 6).
L’Abbé plot comparing the odds of postoperative nausea and vomiting between the palonosetron and ramosetron groups. Each circle represents an individual study, with size reflecting its precision. The solid diagonal line (X = Y) indicates equal odds between groups. The dashed trend line lies above the diagonal, indicating a tendency toward lower odds in the palonosetron group.
Galbraith plots
The Galbraith plot visually represents the reciprocal value of the standard error of the effect size and the standardized effect size on the X-Y plane. An arc corresponding to the observed effect size or outcome is plotted on the right-hand side of the graph. The point at which a straight line through the origin intersects this arc represents the observed effect size. Homogeneity is satisfied if the points fall within ±2 of this straight line (Fig. 7) [16].
Galbraith plot for assessing heterogeneity. Each point represents a study, with the standardized effect size on the Y-axis and the reciprocal of its standard error on the X-axis. The shaded area indicates the ±2 range around the regression line. The right-hand axis shows the corresponding unstandardized effect sizes.
The disadvantage of using these graphs to assess heterogeneity is their reliance on the subjective judgment of the researcher, which can vary among individuals.
Interpreting heterogeneity: beyond a single effect size
Heterogeneity can be challenging if the goal of the meta-analysis is to report a single common effect size. Low heterogeneity between studies allows researchers to confidently present a common effect size. However, with significant heterogeneity, assuming a common effect size is problematic.
For example, in Fig. 2A, the analgesic effect of local anesthetic A shows an SMD of −2.2 to −1.8 in most studies. This suggests that applying anesthetic A will likely produce a moderate effect (around SMD −2) in most patients; thus, reporting the common effect size of SMD −2 is meaningful. Conversely, Fig. 2C demonstrates a substantial variation in the effectiveness of local anesthetic C across studies. Some studies reported large effects, while others reported trivial effects, and some even reported increased pain. This wide variation indicates that patient outcomes vary considerably; thus, reporting the common effect size is less meaningful.
However, the goal of this meta-analysis was to extend beyond reporting the common effect sizes to identify patterns in the distribution of the effect size. Recognizing significant variations in effect sizes is important because it can lead to further investigations of the factors that cause these differences. Additionally, clinicians should be aware that the treatment effects can vary.
Therefore, heterogeneity is neither inherently good nor bad and its presence does not invalidate the findings of a meta-analysis. Researchers must acknowledge this heterogeneity and understand that reporting a common effect size may not be appropriate in some cases.
Approaches to clinical heterogeneity
In meta-analyses, the types of heterogeneity include clinical, methodological, chance, and statistical heterogeneity. Clinical heterogeneity arises from differences in the inclusion and exclusion criteria, comorbidities, demographics, diagnostic criteria, interventions, co-interventions, and outcome variables among studies [17,18]. Methodological heterogeneity results from differences in the study designs, quality, outcome measures, and analytical methods. Heterogeneity may also result from chance. Statistical heterogeneity is identified using statistical analyses and may result from clinical, methodological, or chance heterogeneity or a combination of these factors [18].
The diversity of studies included in systematic reviews depend on the scope of the research question. Systematic reviews examining multiple interventions for the same condition or responses to an intervention among different populations naturally include more heterogeneous studies.
In a meta-analysis, reporting a common effect size is appropriate only if the included studies have sufficient homogeneity in terms of participants, interventions, comparisons, outcomes, and methods. High heterogeneity can result in misleading common effect sizes. For example, if a meta-analysis examines the effects of Drugs A and B separately, combining them into a single effect size would be inappropriate. However, if both drugs belong to the same class, and the average effect is of interest, a combined effect size may be reasonable. Thus, clinical heterogeneity does not necessarily preclude conducting a meta-analysis. Clinical heterogeneity can also result from differences in eligibility criteria such as symptom severity or outcome cutoffs, which affect treatment effectiveness, measurement sensitivity, and specificity.
Judgments of clinical heterogeneity are qualitative and are based on rational arguments regarding trial similarities or differences. Clinical heterogeneity is closely related to applicability, reflecting the differences in the PICO-TS (participants, interventions, comparators, outcomes, timing, and setting) of the included studies. Applicability is based on how well the findings of a systematic review apply to clinical practice and considers clinical heterogeneity, external validity, and generalizability [18]. Addressing clinical heterogeneity in a systematic review enhances its relevance and applicability to clinicians and policymakers.
Clinical heterogeneity vs. statistical heterogeneity
Clinical and statistical heterogeneity are distinct concepts that do not always coincide. Although clinical heterogeneity refers to variations in study populations, interventions, comparison, outcomes, and study methods, statistical heterogeneity is observed through statistical testing and indicates variability in effect sizes across studies [19,20]. A study can exhibit clinical heterogeneity without showing statistical heterogeneity and vice versa. Therefore, a clear understanding of both types of heterogeneity is crucial for adequately assessing their impact. Identifying the causes of statistical heterogeneity and knowing when clinical heterogeneity is not problematic is essential.
If the treatments yield statistically similar outcomes despite clinical heterogeneity, the findings may be generalizable to a broader population. Conversely, the presence of statistical heterogeneity without clinical heterogeneity suggests that underlying factors may be affecting treatment effect sizes. Therefore, systematic reviews and meta-analyses should account for these variations. Addressing both types of heterogeneity ensures that the meta-analysis findings are clinically meaningful.
Considerations for investigating clinical heterogeneity
When investigating clinical heterogeneity, the following should be considered: (1) pre-planning: develop a plan for investigating clinical heterogeneity, including the types of clinical covariates and statistical methods, and describe them in the systematic review protocol prepared a priori; (2) scientific rationale: choose clinical covariates based on a clear scientific rationale, such as pathophysiological arguments or evidence from previous studies; (3) relevance to research: select clinical covariates based on their potential to influence the treatment effect of the PICO-TS or other relevant factors; (4) sufficient data: select clinical covariates only if sufficient studies exist (generally at least 10) to support the analysis; (5) limiting covariates: limit the number of clinical covariates to ensure manageability and relevance, thus reducing the risk of inflation of type 1 errors; and 6) observational nature: results involving covariates should be descriptively interpreted as hypothesis-generating rather than causal because systematic reviews and meta-analyses are observational.
Although preplanning for clinical covariates is ideal, identifying all potential covariates in advance may be challenging. In such cases, post hoc analyses may be necessary. However, these require careful interpretation and the results should be reported as observational findings rather than as definitive conclusions.
Actions to take when heterogeneity occurs in your data
This section provides information on what should be done if heterogeneity is evident in your data.
Verify the data
Ensure that the data have been extracted and entered correctly. Mistakes such as entering standard errors instead of standard deviations can narrow the confidence intervals. Double-checking the original sources and data entries is thus essential.
Ignore heterogeneity and analyze the data using a fixed-effects model
A fixed-effects model assumes a single true effect size across studies, ignoring heterogeneity. This approach may be appropriate if heterogeneity is low and studies are sufficiently similar. However, it can produce overly narrow confidence intervals, leading to overconfidence in results.
Use a random-effects model
There are two ways to synthesize a pooled effect size in a meta-analysis: using a fixed- or a random-effects model. The fixed-effects model is used when the number of studies is very small and the study population and interventions are considered homogeneous. The random-effects model, on the other hand, is used when the study population and interventions are not homogeneous and the purpose of the meta-analysis is to provide generalizations across a broad population (e.g., when assuming a single treatment effect size is difficult due to diversity in study subjects or interventions). Therefore, when heterogeneity exists, using a fixed-effects model is considered inappropriate.
Fixed-effects models generally provide the most precise estimates of the effects of an intervention. Therefore, fixed-effects models can still be used even in the presence of some heterogeneity as long as the findings are appropriately and carefully interpreted. For example, when a fixed-effects model is applied to a meta-analysis, it assigns more weight to studies with larger sample sizes. A study with a larger sample size that shows a statistically insignificant result will influence the pooled estimate towards statistical insignificance more than a random-effects model. If the pooled estimate using the fixed-effects model remains statistically significant, this strengthens the conclusion that the treatment effect is considered genuinely significant.
Conduct a systematic review without a meta-analysis
If the heterogeneity is too large, it may be more appropriate to present the results separately rather than synthesize them into a single pooled effect size. With this approach, the risk of providing misleading conclusions based on highly heterogeneous data is avoided [21].
Investigate the sources of heterogeneity in the meta-analysis
Understanding the sources of heterogeneity is essential for interpreting the results of a meta-analysis. Identifying the sources of heterogeneity clarifies the variations in treatment effects and can guide future research planning [19]. Key strategies for exploring heterogeneity include subgroup, sensitivity, and meta-regression analyses.
Subgroup analyses
Subgroup analyses involve dividing studies into homogeneous groups based on specific clinical or methodological characteristics and then separately analyzing these groups [22–25]. This approach helps to determine whether certain factors contributed to the observed heterogeneity. For example, differences in patient characteristics, treatment protocols, or study designs may explain variations in effect sizes.
Although subgroup analyses are typically planned in advance, they may be conducted post hoc as predicting all relevant variables beforehand is not always possible. Although post hoc subgroup analyses can provide useful insights, results should be interpreted with caution, as this type of analysis can introduce bias or overestimate the effects. The findings of a subgroup analysis should be treated as exploratory and hypothesis-generating rather than as definitive conclusions [23,24].
Sensitivity analyses
Sensitivity analyses evaluate the robustness of meta-analysis results by testing how changes in key assumptions or decisions affect the findings [26]. This may involve excluding studies with low-quality or missing data or modifying the criteria used for study inclusion. By assessing the consistency of the results under these conditions, researchers can evaluate the reliability of the conclusions of a meta-analysis.
Consistency across different sensitivity analyses provides additional support for the findings, as this suggests that the findings are not excessively influenced by a particular study or methodological choice. Sensitivity analysis is a valuable tool for assessing the reliability of pooled effect estimates and the potential impact of data quality on results.
Meta-regressions
Meta-regression is a statistical technique used to examine how study characteristics (e.g., sample size, treatment duration, and baseline risk) influence effect sizes. Through analyzing these characteristics as covariates, meta-regressions can help researchers identify factors that contribute to heterogeneity [27,28]. This is particularly useful for exploring the relationship between continuous or categorical variables and effect sizes, such as the influence of patient age or treatment dosage.
Importantly, meta-regressions only estimate correlations and not causal relationships. Therefore, the results should be interpreted with caution, particularly considering the low statistical power often associated with meta-analyses, which may limit the ability to detect true associations or significant differences.
For an accurate interpretation of pooled estimates in meta-analyses, researchers must identify and investigate the causes of heterogeneity [29,30]. Subgroup analyses, sensitivity analyses, and meta-regressions are effective tools for exploring heterogeneity. However, each of these methods requires careful planning and interpretation. Although these strategies offer valuable insights into the factors influencing treatment effects, the results should be considered observational rather than definitive. A thorough understanding of these methods ensures more accurate and meaningful conclusions from meta-analyses.
Conclusion
Heterogeneity is an inherent challenge of meta-analyses that arises when the results of multiple studies are combined. Researchers must therefore develop a thorough understanding of how heterogeneity manifests and be able to identify the factors that may affect the results. Although confidence intervals and I2 values are commonly used to report heterogeneity, a deeper exploration of their sources and characteristics is necessary for accurate interpretation. Key values such as tau statistics and prediction intervals provide valuable insights into the extent of heterogeneity. When heterogeneity is substantial, using subgroup, sensitivity, or meta-regression analyses to investigate the source is essential. However, the results should not be interpreted as evidence of causal relationships. Heterogeneity is an unavoidable issue in meta-analyses, and lacking a clear understanding of its impact can result in misleading interpretations of findings and inappropriate conclusions. Therefore, a comprehensive understanding of heterogeneity is essential to correctly conduct and interpret meta-analyses and ensure the reliability of the evidence reported.
Acknowledgements
The authors would like to express their gratitude to Dr. Michael Borenstein for his insightful online lectures, which provided valuable inspiration and guidance for the conceptual framework and structure of this study. Although the content of this manuscript was developed independently, the lectures significantly influenced the authors’ approaches to the subject matter.
Notes
Funding
None.
Conflicts of Interest
Geun Joo Choi has been an editor for the Korean Journal of Anesthesiology since 2020 and Hyun Kang has been an statistical rounds board of KJA since 2013. However, they were not involved in any process of review for this article, including peer reviewer selection, evaluation, or decision-making. There were no other potential conflicts of interest relevant to this article.
Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Author Contributions
Geun Joo Choi (Methodology; Validation; Writing – original draft; Writing – review & editing)
Hyun Kang (Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing – original draft; Writing – review & editing)
