Gain Score Analysis
|March 26, 2019|
Manuscript and proposal reviewers as well as research methodologists have, from time to time, questioned the use of gain scores, also called difference scores or change scores, in the analysis of pretest-posttest designs. Typically they have stated that an analysis of covariance (ANCOVA), that is, an analysis of residual scores, represents a more appropriate analytical strategy and that the analysis of gain scores are not appropriate.
The debate between the use of gain scores and ANCOVA has a long history and, at times, generated heated discussion. Cohen, Cohen, Aiken, and West (2003) note that change scores can often overcorrect the posttest by the pretest, but that the interpretation "depends on our theoretical model of change, and that difference scores may be exactly what we need to match our model" (p. 571, footnote). Fitzmaurice, Laird, and Ware (2004) argue that the choice between analysis of gain scores versus ANCOVA depends on the research question. ANCOVA tests the following question: given that participants start with the same score, how do they differ at posttest? Tests of gain scores answer a different question: how do groups, on average, differ in gains? Fitzmaurice and colleagues argue that the latter question is most often the question investigators intend to ask. They and Oakes and Feldman (2001) recommend the use of ANCOVA only in the analysis of randomized controlled trials, although comparisons of groups within conditions, such as with moderator, may require gain scores (Cribbie & Jamieson, 2000). Maxwell and Delaney (1990) also support gain-score analysis, even as the preferred method of analysis for some randomized control trials.
My colleagues and I have been questioned several times by journal reviewers who misunderstand the literature on change scores, so below I first provide some typical comments one might include in a letter to a journal editor. I then provide a little more detail about the analysis of gain scores compared to an ANCOVA. Finally, I summarize the relevant discussions in a few books and manuscripts, such as Fitzmaurice, Laird, and Ware (2004), Lin and Hughes (1997), Maris (1997), Maxwell and Delaney (1990), Oakes and Feldman (2001), and Rogosa (1988). In general, the conclusion seems to be that analysts should use ANCOVA sparingly, only for randomized trials, only for tests of main effects, and even then only if groups are equivalent at baseline. The analysis of gain scores, in contrast, provides unbiased results in a much wider array of research designs.
Reviewers raised a question about the use of gain scores, also called change scores or difference scores, as the primary outcome in our randomized controlled trial. We have reviewed the sources suggested by reviewers and those that originally guided our analysis (e.g., Fitzmaurice, 2001; Fitzmaurice, Laird, & Ware, 2004; Fleiss, 1986; Judd & Kenny, 1981; Maris, 1998; Maxwell & Delaney, 1990; Oakes & Feldman, 2001; Rogosa, 1988) and concluded that the use of gain scores is defensible. In particular, Maxwell and Delaney conclude, "with randomized studies, the two methods are in general, if not exact, agreement" (Maxwell & Delaney, 1990, p. 393). Gain-score analysis may offer the preferred method of analysis for some randomized controlled trials (Maris, 1998).
One way to understand these results is to consider a repeated measures analysis of variance. The results from an analysis of gain scores is indistinguishable from a repeated measures ANOVA for studies with only two assessments per individual. The two methods are mathematically equivalent (Anderson et al., 1980, p. 238). Repeated measures, however, has a long history in the behavioral sciences and has received little criticism similar to the concerns raised about gain scores. Indeed, some texts criticize or caution against gain scores and then recommend that researchers use repeated measures analysis of variance without a similar qualification (e.g., Cohen, Cohen, West, & Aiken, 2003; Gall, Borg, & Gall, 1996). The analysis of gain scores provides for appropriate, unbiased tests for most research designs.
An analysis of gain scores and repeated measures ANOVA are equivalent tests for both a single group or when comparing two groups. When testing gains in only one group, an analysis of gain scores and repeated measures ANOVA are also equivalent to a paired t-test.
Single-Group Tests of Gains
For single-group tests that compare pretest to posttest, analysts have been left with three options: a t-test of gain scores, paired t-tests, and repeated measures ANOVA. At times, however, reviewers will question one or more of these methods. Rest assured, these three tests are equivalent. That may not satisfy reviewers, as ignorance sometimes prevails. For example, after explaining that the three tests were "mathematically equivalent," my colleagues and I received this response from a manuscript reviewer: "I agree - these statistical tests provide numerically equivalent results...so, please use a t-test of gain or repeated measures ANOVA....not individual paired t-tests." One wonders if this reviewer understands the word equivalent.
The result from an analysis of gain scores is not just indistinguishable from a paired t-test, they are the same. Kanji (1999), in 100 Statistical Tests, defines the paired t-test in terms of the difference between "each pair of observations" (p. 30), which are gain scores. Similarly, the SAS STAT User’s Guide (2nd ed., version 9.2), under the computation section for t-tests, states that "the analysis [for a paired design] is the same as the analysis for the one-sample design . . . based on the differences di = y1i − y2i" (p. 7416), for all i participants. Rosenthal and Rosnow (2008) agreed: "When computing t tests for correlated data (or matched pairs, or repeated measurements), we perform our calculations not on the original n1 + n2 scores, but on differences between the n1 and n2 scores" (p. 396). Winer (1971) also shows that these two tests are identical (see pp. 47-48).
If one defines di = Xia − Xib for each individual i, then M(d) = M(Xa) − M(Xb), where M(·) denotes the mean of a variable, over all i. The tests of the following hypotheses are then mathematically equivalent: (a) M(Xa) = M(Xb), (b) M(Xa) − M(Xb) = 0, and (c) M(d) = 0. These correspond to the paired t-test, repeated measures ANOVA, and a t-test or ANOVA of the difference score. Each of these tests also accounts for the correlation between Xia and Xib to determine the variance. Some tests may calculate the test variance from V(Xia), V(Xib), and their correlation directly. The t-test of difference scores, di uses the variance of the di, denoted V(d). Winer (1971), however, showed that V(d) = V(Xa) + V(Xb) − 2·rab· √V(Xa)· √V(Xb). This equality and the M(d) = M(Xa) − M(Xb) equality show that a t-test of difference scores is identical to a paired t-test.
An analyst can also conduct a repeated measures ANOVA for a study with only two assessments per individual, such as a pretest and posttest. This is also mathematically equivalent to the test of gain scores and paired t-test (Anderson et al., 1980, p. 238). A repeated measures ANOVA, however, will produce an F statistic with one numerator degree of freedom (df) when the study has just two assessments across time. For this single-df F-test, the F-value will equal the squared t-value from either t-test approach, producing an identical p-value.
Comparison of Gains between Two Independent Groups
Questions frequently arise about whether the gains differ between two independent groups, such as whether individuals in a treatment sample make greater gains than those in a comparison sample. Repeated measures ANOVA with a fixed-effects group factor and a t-test of the difference between gain scores from two independent groups are also equivalent. "We note that with only two data points, the repeated measures ANOVA (see Winer, 1971) is mathematically equivalent to the simple gain score" (Anderson et al., 1980, p. 238). As with single-group tests, the F-value from an ANOVA will equal the squared t-value from the t-test, and both will produce an identical p-value. A paired t-test, however, is designed only for single-group comparisons of correlated data, so it is not applicable to the comparison of gains between two independent samples.
Quite a number of factors can influence the choice between an analysis of gain scores, also called change scores or difference scores, and an analysis of covariance (ANCOVA). First, the analysis of gain scores answers a different research question than ANCOVA. The analysis of gains, for example, focuses on the improvements from pretest to posttest for whole groups. Second, covariate adjustment for pretest in observational or quasi-experimental studies can bias results. Third, in contrast to some claims, "the difference score is an unbiased estimate of true change" (Rogosa, 1988, p 180). Fourth, gain score analyses can offer interpretation advantages. Finally, in randomized controlled trials, the ANCOVA has been considered to provide a more efficient, powerful analysis, although Oakes and Feldman (2001) disagree, showing that this "rests on the untenable assumption that pretests are measured without error" (p. 18). Fitzmaurice, Laird, and Ware (2004) conclude that "the analysis of longitudinal data from a randomized trial is the only setting where we recommend adjustment for baseline through analysis of covariance" (p. 124). Willett (1994) notes that "the researcher is strongly advised to avoid these [residual-change] scores as measures of within-person change" (p. 674; see also Willett & Singer, 1989).
A gain-score analysis and an analysis with a covariate adjustment for pretest scores answer different research questions. The analysis of gain scores compares those improvements between groups, such as treatment and control groups. Specifically, the analysis tests whether we can reject the hypothesis (H0) that the groups improved at the same rates. Fitzmaurice, Laird, and Ware (2004) state that a gain score answers the "question of whether the two groups differ in terms of their mean change over time" (p. 124). The gain-score analysis concerns changes in group means. In contrast, an analysis of covariance (ANCOVA) tests differences in covariate-adjusted scores and tests whether we can reject the hypothesis (H0) that individuals, when sharing the same pretest score, improved at the same rates. That is, ANCOVA "addresses the question of whether an individual belonging to one group is expected to change more (or less) than an individual belonging to the other group, given that they have the same baseline response" (Fitzmaurice, Laird, & Ware, 2004, p. 124, emphasis in original).
Gain scores have been condemned for problems of bias and regression effects, and it has been claimed that these problems typically occur with nonequivalent groups designs (Anderson et al., 1980; Cook & Campbell, 1979; Fleiss, 1986; Maxwell and Delaney, 1990). When pretests are not equivalent, so the argument goes, the interpretation of a gain score analysis may be problematic. In a randomized trial, we may assume equivalent groups; the randomization, then, protects against regression toward the mean and biased estimation by using a controlled design (Fleiss, 1986; Lin & Hughes, 1997; Maris, 1998; Rogosa, 1988). With the assumption of an equivalent comparison group, an analysis of gain scores should give similar results to an ANCOVA. Indeed, "with randomized studies, the two methods are in general, if not exact, agreement" (Maxwell & Delaney, 1990, p. 393). As just discussed, however, the "problems" that occur with non-equivalent groups designs result from a misunderstanding of the research question answered by the analysis. Thus, Fitzmaurice, Laird, and Ware (2004) argue that the analysis of gain scores is always appropriate, and that ANCOVA will bias results in non-equivalent groups or observational designs.
Bias Introduced by the Covariate
In nonequivalent groups or observational designs, the covariate adjustment in an ANCOVA can lead to biased or misinterpreted results. Covariates, for example, can introduce spurious relationships between group assignment and an outcome (Fitzmaurice, Laird, & Ware, 2004). This could lead to a conclusion of "no difference" between groups when they truly differ, as the covariate may have explained away the meaningful differences between groups, or vice versa. Similarly, Allison (1990) concluded that "the most compelling argument against [the regressor variable approach] is that it leads to the conclusion that there is a treatment effect when a straightforward examination of means indicates that nothing has happened" (p. 110). Oakes and Feldman (2001) show that "in the absence of randomization, when baseline differences between groups exist, we . . . show that change-score models yield less biased estimates (if biased at all)" (p. 18; see also Rogosa, 1988). One critical source of bias, discussed above, is that an analysis of covariance with nonequivalent groups frequently answers the wrong research question (Fitzmaurice, 2001; Rogosa, 1988).
Maris (1998) shows that only if the assignment is made on the basis of the pretest is the gain score estimator biased, and in those cases, the covariance adjustment estimator should be used. Maris refers to assignment on the basis of pretest as a situation where participants with lower (or higher) pretest scores would be more likely to get assigned to treatment. This situation might occur, for example, in a regression discontinuity design, which is quite different from the randomized, non-equivalent groups, or observational designs discussed here.
Regression Toward the Mean
Another criticism of the gain score is that regression toward the mean introduces bias into the estimate of change. Lin and Hughes (1997) show that "the effect of regression toward the mean can confound the evaluation of treatment effects if the study has no randomized control group" (p. 129). While true, the issue is a little more complicated, and even in comparisons between nonequivalent control groups, the gain score is often more likely to answer the research questions of interest than covariate-adjusted change (Fitzmaurice, 2001).
Judd and Kenny (1981) show that the critical assumption for a valid test of gain scores is stationarity, that “the effect of the assignment variable on the pretest is equal to the effect of the assignment variable on the posttest" (p. 117). The assignment variable is not the treatment variable, and if an investigator assigns study participants randomly, the assignment variable will be unrelated to both pretest and posttest and satisfies stationarity. Judd and Kenny develop this more formally in their book and note that many situations would satisfy the stationarity assumption. In particular, "regression toward the mean does not pose a problem when evaluators are comparing two stable groups" (Oakes & Feldman, 2001).
Regression toward the mean occurs when the correlation between initial status and the change score is negative (Rogosa, 1988). This correlation, however, can be negative, positive, or zero (Zimmerman & Williams, 1982), "with the sign of the coefficient mainly depending on the ratio of the standard deviations of the two tests" (Yin & Brennan, 2002). The premise that this correlation is always negative (e.g., Cronbach & Furby, 1970; Linn & Slinde, 1977) was based on the erroneous assumption that both the pretest and posttest had equal variances (Rogosa, 1988; Yin & Brennan, 2002). Regression toward the mean only occurs in certain situations that depend on the measurement time, and even then, only if time 1 and time 2 variances remain stable. When the variance of a measure increases over time, for example, "regression toward the mean does not hold" (Rogosa, p 187). Thus, Rogosa (1988) and Allison (1990) argue that regression toward the mean is rare, at best, and that a residual-change approach (e.g., ANCOVA) does not necessarily solve the potential problems associated with the regression to the mean.
Finally, Maris (1998) also shows that "regression toward the mean is not a reason for not using the gain score estimator" (p. 325). "Regression toward the mean and a biased [gain score estimator] are simply two aspects of the same data pattern, and there is no logical relation between the two phenomena. In particular (a) regression toward the mean does not imply that [the gain score estimator] is biased, and (b) the absence of regression toward the mean does not imply unbiasedness of [the gain score estimator]" (p. 322-323). Therefore, Maris concludes that "if the assignment is not on the basis of the pretest, there is no basis for preferring the covariance adjustment estimator over the gain score estimator" (p. 309).
Gain Score Reliability
Another common criticism of gain scores is that they are unreliable (Gupta, Srivastava, & Sharma, 1988; Linn & Slinde, 1977; Lord, 1956). Allison (1990) has shown that "the low reliability of change scores is irrelevant for the purpose of causal inference" (p. 104); "the low reliability results from the fact that in calculating the change score we difference out all the stable between-subject variation, except for that due to the treatment effect" (p. 105). Rogosa (1988) also demonstrated that the analysis of gain scores can provide both a reliable and unbiased estimate of true change. Claims of unreliability rely on the unrealistic situation of limited individual differences. If all individuals grow at nearly the same rate, gain scores show that you cannot detect individual differences that do not exist. Like Alison, Rogosa makes it clear that "the difference score is an unbiased estimate of true change" (p 180).
At this point, the reliability issue is nothing but myth, and unfortunately, it is a rather persistent myth. The past claims that difference scores are unreliable (e.g., Cronbach & Furby, 1970) have long been debunked (e.g., Rogosa & Willett, 1983; Willett, 1988):
Although once highly favoured, [the gain score] was lambasted through the 1950s, 60s and 70s because of its purported unreliability and (usually negative) correlation with initial status (Bereiter, 1963; Linn & Slinde, 1977). But these criticisms were based on flawed assumptions, and the difference score, and some modifications of it, are now seen as the best you can do with only two waves of data (Rogosa, Brandt, & Zimowski, 1982; Rogosa & Willett, 1983; 1985; Willett, 1988). (Willett & Singer, 1989, p. 429)
This myth nonetheless persists despite considerable evidence to the contrary (see also Allison, 1990; Collins, 1996; Gollwitzer, Christ, & Lemmer, 2014; Mellenbergh, 1999; Rogosa, 1988; Thomas & Zumbo, 2012; Williams & Zimmerman, 1996; Yin & Brennan, 2002; Zimmerman, Williams, & Zumbo, 1993).
A critical and related issue is that the alternative ANCOVA model assumes perfect reliability—no measurement error—for the covariate, including the pretest measure. Few social science investigations rely on measures that meet this assumption, which introduces interpretation problems for ANCOVA. Kahneman (1965) demonstrated that unreliable covariates lead to systematic bias (see also Zinbarg, Suzuki, Uliaszek, & Lewis, 2010). Gain score models have no such preferect-reliability assumption. In each scenario where someone argues against gain scores due to measurement error, it is the ANCOVA which is actually problematic. As Oakes and Feldman (2001) discuss, many criticisms of gain scores essentially compare apples to oranges; they compare gain scores of measures that contain error with theoretical ANCOVA models with measures that have no error.
Gain scores, for purposes such as the analysis of outcomes in randomized control trials can offer a better interpretation. An ANCOVA, the alternative to gain scores, focuses on differences between the treatment groups at posttest while holding constant pretest differences. The ANCOVA, then, does not inform about the how the groups changed over time. It estimates a difference in residualized scores that no longer have a sensible scale or unit. The gain score, however, tells us precisely how scores changed from pretest to posttest. It tells us whether each group improved, deteriorated, stayed constant, and by precisely how much. In a reading intervention, for example, we may find a difference of 18.5 words per minute between groups in oral reading. This is immediately meaningful to teachers, school administrators, and other reading researchers.
The research question answered by the analysis of gain scores also provides an easier interpretation. As discussed above, the analysis of gain scores addresses group differences. In comparisons of treatment groups from randomized trials, we assume equivalence at pretest, hence both the analysis of gain scores and ANCOVA have a similar interpretation. If groups differ at pretest, however, either in theory or in practice, the interpretation of a analysis of covariance becomes more challenging. Consider the following statement: "given baseline equivalence in body weight, males and females made similar gains in muscle mass." The challenge is that males and females rarely begin with the same body weight, thus obscuring the interpretation. The same holds for cardiovascular risk, risk of cancer, obesity rates, levels of depression and anxiety, and aggressive behavior. Similar distinctions can be made for many groups: smokers vs. nonsmokers; unemployed vs. employed, and so on. The analysis of gain scores, however, simply provides differences between groups on the average change for those groups.
Fitzmaurice, Laird, and Ware (2004) argue that ANCOVA will almost always have advantages, in terms of power, in randomized controlled trials. Oakes and Feldman (2001), however, show that "the common assumption that ANCOVA models are more powerful rests on the untenable assumption that pretests are measured without error. In the presence of measurement error, change-score models may be equally or even more powerful" (p. 18). And with very small samples, a t-test of gain scores should offer greater power because it estimates one fewer parameter than an ANCOVA (Maxwell & Delaney, 1990; Oakes & Feldman, 2001). This latter case is somewhat rare in typical randomized trials, but more common in group-randomized trials where the unit of analysis is a whole community or school (Feng, Diehr, Peterson, & McLerran, 2001; Murray, 1998).
Within the literature on research methods, quite a number of papers and books raise questions and offer recommendations about how to analyze pretest-posttest data. The following summaries briefly touch on a few of these contributions.
Willett (1988) consists of "a lengthy (and tediously complete) overview of issues, problems, and misconceptions in the measurement of change" (John Willett's website). Willett pointed out that "among methodologists, the residual change score has now been largely discredited as a measure of individual change" (p. 380; e.g., Rogosa, Brandt, & Zimowski, 1982). He then compared longitudinal growth models with the analysis of gain scores and covariate-adjustment on two-wave data in perhaps the most complete comparison of the two approaches.
"The considerable logical, substantive, and technical flaws of the residual-change score were documented and the reader was advised to avoid this measure of individual growth. On the other hand, despite serious drawbacks, the observed difference score was shown to be a moderately successful measure of individual growth" (Willett, 1988, p. 413). Willett demonstrated that (a) gain scores could be highly reliable in the presence of interindividual change over time, (b) the gain score validly estimates individual growth, and (c) "under the straight-line growth model, the OLS estimate of the individual growth rate was shown to be a natural extension of the observed difference score to multiwave data" (p. 414). Willett, however, recommended longitudinal growth models over the analysis of gains, as do many statisticians and methodologists.
Fitzmaurice, Laird, & Ware (2004)
Fitzmaurice, Laird, and Ware (2004) state that "the analysis of longitudinal data from a randomized trial is the only setting where we recommend adjustment for baseline through analysis of covariance" (p. 124). They show that Lord's paradox, the seemingly conflicting results from the analysis of gain scores and ANCOVA, occurs because the two methods answer "qualitatively different scientific questions." A gain score analysis answers questions about differences in gains for groups, such as treatment or control, males or females, and so on. ANCOVA, in contrast, answers questions about two individuals who begin with the same pretest score. That is, ANCOVA addresses a conditional hypothesis (Jamieson, 1999). Thus, the question that ANCOVA answers may begin with an untenable assumption, that two participants have equivalent scores at baseline. In a comparisons of males and females, smokers and nonsmokers, students at high and low risk of failure, and so on, the invalid assumption of equivalence at pretest may lead to spurious results.
ANCOVA raises other challenges in observational studies. Consider a comparison between smokers and nonsmokers. An adjustment for a pretest score of, say, peer antisocial behavior, can confound the analysis of outcomes of interest if the pretest is associated with those outcomes.
Fitzmaurice and colleagues provide additional compelling examples. They also discuss when an ANCOVA will be more powerful than an analysis of change scores. In general, ANCOVA will almost always provide a more powerful analysis for randomized trials, although this should not be the only choice of analysis method. The specific research question, interpretation of the results, and so on, should also be considered.
Oakes & Feldman (2001)
This paper compares change-score and ANCOVA models. Oakes and Feldman show that if assuming perfect measurement in a randomized trial, the ANCOVA model "is about 30% more precise than change-score analysis" (p. 16). As the relationship between pretest and group assignment approaches .50, the two models perform more similarly. Large differences between groups at baseline can bias results with ANCOVA and imperfect measurement can reduce power.
Oakes and Feldman conclude with "the principal finding . . . that for a randomized experiment, ANCOVA yields unbiased treatment estimates and typically has superior power to change-score methods, all else equal. However, in the absence of randomization, when baseline differences between groups exist, we follow Allison (1990) and show that change-score models yield less biased estimates (if biased at all). Then, bias aside, we went on to show that the common assumption that ANCOVA models are more powerful rests on the untenable assumption that pretests are measured without error. In the presence of measurement error, change-score models may be equally or even more powerful" (p. 18).
Cribbie & Jamieson (2000)
This paper examines a directional bias when measuring correlates of change with analysis of covariance, regression, gain scores, and structural equation models. Regression and ANCOVA favor finding differences between groups when they begin at different levels and either remain parallel or diverge (Jamieson, 1999). The analyses are less likely to find differences when groups start at a different point and then converge. These scenarios often occur in quasi-experiments or when comparing nonequivalent groups, like males and females. The latter situation also occurs during the analysis of moderators, such as when testing an intervention effect that attempts to improve scores for at-risk students faster than scores for students not at risk.
Cribbie and Jamieson (2000) discusses the literature and problems with some of the previous papers, such as certain omissions by Maris (1998). Their computer simulation study then showed "that properly identified structural equation models are not susceptible to [the directional] bias. Neither gain scores (posttest minus pretest) nor structural equation models exhibit the 'regression bias'" (p. 893, abstract). ANCOVA and regression models, however, still do.
To test for treatment effects, we could easily test differences between groups only at posttest, Y, ignoring pretest, X (equation 1). Maris (1998) defines μYtt as the mean value of participants in the treatment condition that end up in the treatment group. He differentiates this from the hypothetical value, μYtc, which is the mean value of participants in the treatment condition that end up in the control group (unobserved). Maris (1998) then defines the covariate adjustment estimator in equation (2) below. The gain score estimator is the difference between the control group pretest and posttest subtracted difference between the treatment group pretest and posttest (3), and is equivalent to (2) when β is equal to 1.
(1) tpost = μYtt - μYcc
(2) tcov.adj = μYtt - μYcc - β(μXt - μXc)
(3) tgain = (μYtt - μXt) - (μYcc - μXc)
These equations differ, then, in that the covariance adjustment estimator and the gain score estimator both adjust for a function of the difference in pretest measures. Provided we can assume those pretest differences adequately describe the expected differences at posttest, equations (2) and (3) should provide unbiased results. That is, if the differences between conditions remain the same across time, then they should describe the differences at posttest.
Problems arise when the groups differ at pretest such that we cannot expect change from pretest to adequately describe change at posttest. Such a case may occur when the treatment group, for example, contains more extreme values. Because scores are likely to regress toward the mean, these extreme treatment values may be likely to make greater changes toward the overall mean. This, in turn, may imply a treatment effect when none truly exists. Notice, however, that this effect applies to both tgain and tcov.adj. Therefore, "regression toward the mean is not a reason for not using the gain score estimator" (Maris, 1998, p. 325).
If researchers assign participants to treatment and control groups on the basis of the pretest measure, then they should use the covariance adjustment estimator, tcov.adj. Here, only non-linear regressions within the treatment and control groups would cause problems. Note that for the covariance adjustment estimator, we assume equal regressions in the treatment and control groups. Interestingly, Maris (1998) shows that in some cases, such as self-selection into treatment, the covariance adjustment estimator is only unbiased when the estimator of β is 1. That is, under circumstances such as self-selection, t gain, the gain score estimator is unbiased.
Finally, measurement error will not bias the gain score estimator, t gain, if assignment is not affected by the pretest. Measurement error does not bias covariance adjustment estimator, tcov.adj, either. In particular, the gain score estimator, tgain, "does not depend on measurement error, unless the assignment is affected by the pretest." The important message is that "if the assignment is not on the basis of the pretest, there is no basis for preferring the covariance adjustment estimator over the gain score estimator" (Maris, 1998, p. 309, abstract). Maris refers to assignment on the basis of pretest as a situation where participants with lower pretest scores would be more likely to get assigned to treatment, such as in a regression discontinuity design. "Assignment on the basis of pretest scores means that any two units with the same pretest score are (a) always assigned to the same group, or (b) randomly assigned to one of the groups (with any probability)" (Maris, 1998, p. 316). This appears to include matched and randomized designs, where participants were equally likely to fall into each treatment condition.
Lin & Hughes (1997)
Abstract: Many studies use the same variable, for example blood pressure in studies of antihypertensive treatments, to identify subjects to be included in the study and to evaluate the effects of a treatment. As a consequence, if not properly accounted for, the effect of regression toward the mean can confound the evaluation of treatment effects if the study has no randomized control group. In this paper we review the methods that have been proposed for adjusting for the effect of regression toward the mean when the variable of interest is assumed to be normally distributed. Maximum likelihood estimation and moment-based estimation are considered, including more recent methods which can be applied when repeated measurements on each subject are available.
Maxwell & Delaney (1990)
Maxwell and Delaney give a nice account of an analysis of gain scores with ANOVA versus an ANCOVA with the pretest as the covariate. The authors point out that while gain scores are problematic for unrandomized studies, "when subjects are randomly assigned to conditions, the expected magnitude of the treatment effects will be the same in the two analyses" (p 393). In fact, Maxwell and Delaney conclude, "with randomized studies, the two methods are in general, if not exact, agreement" (p. 393).
Their first point is that gain scores essentially restrict the covariate adjustment (regression line) to 1. This, they argue, gives the test less power. Smaller errors may allow an ANOVA of gain scores to miss an effect than an ANCOVA might catch since the ANCOVA uses a regression line different than 1.0. Of course, they then go on to say that an ANOVA of gain scores can be more powerful on small studies, if the pretest and posttest measures correlate highly, because a gain-score ANOVA estimates one less parameter than an ANCOVA.
Maxwell and Delaney show that when intact groups are concerned, we may run into Lord's paradox, which is the situation where groups as wholes do not change but individuals within those group change. In this scenario, the ANCOVA answers the question, "do members who begin at a certain level change over the year?" But this is not the question of interest; it's a conditional question on students' initial level. The gain-score ANOVA tells us whether groups change as a whole, which represents our hypotheses, and provides the correct analysis. Maris (1998) implies that Lord's paradox may not actually cause problems except in situations where we have no control group.
David Rogosa has written an excellent article called "Myths about Longitudinal Research." In it, he tackles a number of claims about the inadequacy of gain scores. In particular, he shows that "the difference score is an unbiased estimate of true change" (Rogosa, p. 180), so it cannot be intrinsically unreliable or unfair. He also challenges the ideas (a) that when change correlates with initial status, either negatively or positively, that one cannot avoid regression toward the mean, and (b) that residual change scores (e.g., ANCOVA) solves problems with gain scores. Anyone who questions the use of gain scores should carefully read Rogosa's article (also see Rogosa & Willett, 1983, 1985).
The details of Rogosa's article can be somewhat technical. The foundation of many of his arguments, however, revolve around confusion between the observed correlation, with measurement error, and the "true" correlation, which remains error free, and the assumption that the variance of a measure remain stable over time. The correct interpretation requires assumptions about the true correlations and that the variance of a measure may change. Often the variance increases. The assumptions of previous authors have lead them to incorrect conclusions about gain scores. Let me summarize Rogosa's conclusions.
A common criticism of gain scores is that they are unreliable (Gupta, Srivastava, & Sharma, 1988; Linn & Slinde, 1977; Lord, 1956). Rogosa (1988) has shown that the analysis of gain scores can provide both a reliable and unbiased estimate of true change. Claims of unreliability rely on the unrealistic situation of limited individual differences. If all individuals grow at nearly the same rate, gain scores show that you cannot detect individual differences that do not exist (Rogosa). "The difference score is an unbiased estimate of true change" (Rogosa, p 180).
Many of the critics claim that the correlation between initial status and change is negative, and hence implies regression toward the mean. Rogosa argues that the correlation depends heavily on the time of measurement, and so it can be negative, positive, or even zero. Thus, regression toward the mean only occurs in certain situations, depending on the measurement time, and even then, only if time 1 and time 2 variances remain stable. When the variance of a measure increases over time, "regression toward the mean does not hold" (Rogosa, p 187).
"After demonstrating the inadequacy of simple change scores, we will show how the use of the postscore as the dependent variable and the prescore as a covariate makes the analysis of change a special case of general [analysis of partial variance]" (Cohen & Cohen, 1983, p. 413-414). Cohen and Cohen, like many standard texts, argue for the use of analysis of partial variance or the analysis of covariance. Building on previously debunked myths, Rogosa points out that the properties of the usual sample estimates of residual change are "not pretty" (Rogosa, p 188). This estimate may be badly biased, the precision is low due to measurement error and finite sample size in the regression adjustment, and that reliability is at best only marginally better than that of gain scores.
Because the adjustment for initial status depends on the time of measurement, the attempt to "purge initial status" from the adjusted measure falls short. "The fatal flaw of the residual change procedure is the attempt to assess correlates of change by ignoring individual growth. Questions about systematic individual differences in growth cannot be answered without reference to individual growth. Yet these time 1-time 2 correlation procedures valiantly attempt to do so" (Rogosa, p. 190-191).
Rogosa concludes his paper with three additional myths. All of these apply mostly to covariance structures analysis, such as traditional structural equations models that include stability estimates across time. He argues that the analysis of covariance matrices do not inform us about change, stability coefficients provide ambiguous information, and finally that a "casual analysis," such as cross-lagged correlations, cannot support causal inferences (Rogosa, p. 201).
Allison, P. D. (1990). Change scores as dependent variables in regression analysis. Sociological Methodology, 20, 93-114.
Allison, P. D. (1990). Change scores as dependent variables in regression analysis. In C. C. Clogg (Ed.), Sociological Methodology (Vol 20, pp. 93-114). Oxford, UK: Blackwell.
Anderson, S., Auquier, A., Hauck, W., Oakes, D., Vandaele, W., & Weisberg, H., (1980). Statistical methods for comparative studies. New York: John Wiley & Sons.
Bereiter, C. (1963). Some persisting dilemmas in the measurement of change. In C. W. Harris (Ed.), Problems in the measurement of change (pp. 3-20). Madison, WI: University of Wisconsin Press.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Hillsdale, NJ: Erlbaum Associates.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Collins, L. M. (1996). Is reliability obsolete? A commentary on 'Are simple gain scores obsolete?' Applied Psychological Measurement, 20, 289-292.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago: Rand McNally College Publishing.
Cribbie, R. A., & Jamieson, J. (2000). Structural equation models and the regression bias for measuring correlates of change. Educational and Psychological Measurement, 60(6), 893-907.
Cronbach, J. L., & Furby, L. (1970) How should we measure 'change'-or should we? Psychological Bulletin, 74(1), pp. 68-80.
Feng, Z., Diehr, P., Peterson, A., & McLerran, D. (2001). Selected statistical issues in group randomized trials. Annual Review of Public Health, 22, 167-187.
Fitzmaurice, G. (2001). A conundrum in the analysis of change. Nutrition, 17(4), 360-361.
Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2004). Applied longitudinal analysis. Hoboken, NJ: Wiley.
Fleiss, J. L. (1986). The design and analysis of clinical experiments. New York: John Wiley & Sons.
Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction (6th Ed). White Plains, NY: Longman.
Gupta, J. K., Srivastava, A. B. L., & Sharma, K. K. (1988). On the optimum predictive potential of change meaures. Journal of Experimental Education, 56, 124-128.
Jamieson, J. (1999). Dealing with baseline differences: Two principles and two dilemmas. International Journal of Psychophysiology, 31(2), 155-161.
Judd, C. M., & Kenny, D. A. (1981). Estimating the effect of social interventions. Cambridge, England: Cambridge University Press.
Kahneman, D. (1965). Control of spurious association and the reliability of the controlled variable. Psychological Bulletin, 64(5), 326-329.
Kanji, G. K. (1999). 100 statistical tests (new ed.). Thousand Oaks, CA: Sage.
Lin, H. M. & Hughes, M. D. (1997). Adjusting for regression toward the mean when variables are normally distributed. Statistical Methods in Medical Research, 6(2), 129-46.
Linn, R. L., & Slinde, J. A. (1977). The determination of the significance of change between pre- and post-testing periods. Review of Educational Research, 47, 121-150.
Lord, F. M. (1956). The measurement of growth. Educational and Psychology and Measurement, 16, 421-437.
Maris, E. (1998). Covariance adjustment versus gain scores--revisited. Psychological Methods, 3, 309-327.
Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data: A model comparison perspective. Pacific Grove, CA: Brooks/Cole.
Mellenbergh, G. J. (1999). A note on simple gain score precision. Applied Psychological Measurement, 23(1), 87-89.
Murray, D. M. (1998). Design and analysis of group-randomized trials. New York: Oxford University Press.
Oakes, J. M., & Feldman, H. A. (2001). Statistical power for nonequivalent pretest-posttest designs: The impact of change-score versus ANCOVA models. Evaluation Review, 25(1), 3-28.
Rogosa, D. (1988). Myths about longitudinal research. In K. W. Schaie, R. T. Campbell, W. M. Meredith, & S. C. Rawlings (Eds.), Methodological issues in aging research (pp. 171-209). New York, NY: Springer.
Rogosa, D. R., Brandt, D., & Zimowski, M. (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 90, 726-748.
Rogosa, D. R., & Willett, J. B. (1983). Demonstrating the reliability of the difference score in the measurement of change. Journal of Educational Measurement, 20, 335-343.
Rogosa, D. R., & Willett, J. B. (1985). Understanding correlates of change by modeling individual differences in growth. Psychometrika, 50, 203-228.
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behavioral research: Methods and data analysis (3rd ed.). San Francisco: McGraw-Hill
Thomas, D. R., & Zumbo, B. D. (2012). Difference scores from the point of view of reliability and repeated-measures ANOVA: In defense of difference scores for data analysis. Educational and Psychological Measurement, 72(1), 37-43.
Willett, J. B. (1988). Questions and answers in the measurement of change. Review of Research in Education, 15, 345-422.
Willett, J. B. (1994). Measurement of change. In T. Husen & T. N. Postlethwait (Eds.), The international encyclopedia of education (2nd ed., pp. 671-678). Oxford, UK: Pergamon.
Willett, J. B., & Singer, J. D. (1989). Two types of question about time: Methodological issues in the analysis of teacher career path data. International Journal of Educational Research, 13(4), 421-437.
Williams, R. H., & Zimmerman, D. W. (1996). Are simple gain scores obsolete? Applied Psychological Measurement, 20, 59-69.
Winer, B. J. (1971). Statistical principles in experimental design (2nd ed.). New York: McGraw-Hill.
Yin, P., & Brennan, R. L. (2002). An investigation of difference scores for a grade-level testing program. International Journal of Testing, 2(2), 83-105.
Zimmerman, D. W., & Williams, R. H. (1982). Gain scores in research can be highly reliable. Journal of Educational Measurement, 19(2), 149-154.
Zinbarg, R. E., Suzuki, S., Uliaszek, A. A., & Lewis, A. R. (2010). Biased parameter estimates and inflated Type I error rates in analysis of covariance (and analysis of partial variance) arising from unreliability: Alternatives and remedial strategies. Journal of Abnormal Psychology, 119(2), 307-319. doi: 10.1037/a0017552