Keith Smolkowski (e-mail me)

Covariates, Regression, & Pre-Post Gains

Oregon Research Institute

Δ Papers that support the analysis of gain scores, which may argue against the analysis of covariates or residual change.

© Sources that argue for only prespecified covariates identified though a theoretically based and empirically supported argument before data collection begins.

® Papers that argue for analysis of covariance or residual-change models with covariates identified post hoc, such as when groups are found to differ at baseline.

The summary at the bottom of this page reviews common assumptions and challenges with the use of covariates.

Jump to A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, R, S, T, U, V, W, XYZ, & Summary
See also Mediation Analysis, Causal Inference, & RCTs & QEDs


Alemayehu, D. (2011). Current issues with covariate adjustment in the analysis of data from randomized controlled trials. American Journal of Therapeutics, 18(2), 153-157. https://doi.org/​10.1097/​MJT.0b013e3181b7d228

© See abstract at PubMed.

Allison, P. D. (1990). Change scores as dependent variables in regression analysis. Sociological Methodology, 20, 93-114.

Δ (©) Allison refutes stanard criticisms of the the change score: unreliability and biased due to regression toward the mean. He then shows that the residual-change model can suggest treatment effects when inspections of means indicates no such difference. The covariate-adjusted approach assumes that "regression to the mean within groups implies regression to the mean between groups, a conclusion that seems quite implausible for many applications" (p. 110). The analysis of gains does not require the same assumption. Allison suggests that the investigator carefully select the appropriate model for each empircal application, but "in ambiguous cases, there may be no recourse but to do the analysis both ways and to trust only those conclusions that are consistent across methods" (p. 110).

Also referenced as Allison, P. D. (1990). Change scores as dependent variables in regression analysis. In C. C. Clogg (Ed.), Sociological Methodology (Vol 20, pp. 93-114). Oxford, UK: Blackwell.

Altman, D. G. (1985). Comparability of randomised groups. The Statistician, 34, 125-126.

Altman (1985) discourages significance tests of baseline differences and breaks down the illogic as follows. The significance test—the p value—assesses the probability that an observed difference occurred by chance. It summarizes the likelihood of incompatibility of the data from the assumption of no difference. In a randomized controlled trial, however, baseline differences necessarily occur by chance. Hence, “performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance. Such a procedure is clearly absurd” (p. 126).

Altman, D. G. (1998). Adjustment for covariate imbalance. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (pp. 1000-1005). New York: Wiley.

© Moher and colleagues (2010), in their elaboration of the CONSORT statement, referred to Altman (1998) when they clarified that "adjustment for variables because they differ significantly at baseline is likely to bias the estimated treatment effect" (p. 19).

Altman, D. G. (2005). Adjustment for covariate imbalance. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (2nd Ed., pp. 1273-1278). New York: Wiley. https://doi.org/​10.1002/​0470011815.b2a01015

Altman (2005) writes, "While randomization eliminates bias, it does not guarantee comparable baseline characteristics of the patients in the different treatment groups in a particular trial. Simple randomization is quite likely to yield some differences, especially in small trials. . . . Baseline balance is not a requirement. Because of the use of randomization, standard methods of analysis (estimation and hypothesis testing) will yield valid results regardless of the distribution of baseline variables."

Altman, D. G., Schulz, K. F., Moher, D., Egger, M., Davidoff, F., Elbourne, D., Gotzsche, P., & Lang, T., for the CONSORT Group (2001). The revised CONSORT statement for reporting randomized trials: Explanation and elaboration. Annals of Internal Medicine, 134(8), 663-694. https://doi.org/​10.7326/​0003-4819-134-8-200104170-00012

Anderson, S., Auquier, A., Hauck, W., Oakes, D., Vandaele, W., & Weisberg, H., (1980). Statistical methods for comparative studies. New York: John Wiley & Sons. ◊

"We note that with only two data points, the repeated measures ANOVA (see Winer, 1971) is mathematically equivalent to the simple gain score" (Anderson et al., 1980, p. 238).

Arndt, S., Cohen, G., Alliger, R. J., Swayze, V. W., II, & Andreasen, N. C. (1991). Problems with ratio and proportion measures of imaged cerebral structures. Psychiatry Research, 40(1), 79-89.

Assmann, S. F., Pocock, S. J., Enos, L. E., & Kasten, L. E. (2000). Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet, 355, 1064-1069.

© Assmann, Pocock, Enos, and Kasten (2000) discuss the use of baseline covariates in RCTs and the need to prespecify the analysis plan.

Atchley, W., Gaskins, C., & Anderson, D. (1976). Statistical properties of ratios. I. Empirical results. Systematic Zoology, 25(2), 137-148.

Atinc, G., Simmering, M. J., & Kroll, M. J. (2012). Control variable use and reporting in macro and micro management research. Organizational Research Methods, 15(1) 57-74.

Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3), 399-424. https://doi.org/​10.1080/​00273171.2011.568786

Becker, T. E. (2005). Potential problems in the statistical control of variables in organizational research: A qualitative analysis with recommendations. Organizational Research Methods, 8, 274-289.

Blance, A., Tu, Y., Baelum, V., & Gilthorpe, M. S. (2007). Statistical issues on the analysis of change in follow-up studies in dental research. Community Dentistry and Oral Epidemiology, 35(6), 412-420.

See abstract at PubMed.

Bloom, H. S., Richburg-Hayes, L., & Black, A. R. (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis, 29(1), 30-59. https://doi.org/10.3102/0162373707299550

Bollmer, J., Bethel, J., Garrison-Mogren, R., & Brauen, M. (2007). Using the risk ratio to assess racial/ethnic disproportionality in special education at the school-district level. Journal of Special Education, 41, 186-198. https://doi.org/​10.1177/​00224669070410030401

Bollmer, J., Bethel, J., Munk, T., & Bitterman, A. (2014). Methods for assessing racial/ethnic disproportionality in special education: A technical assistance guide (Revised). Westat, IDEA Data Center: https://ideadata.org/resource-library/knowledge-lab/

Breaugh, J. A. (2006). Rethinking the control of nuisance variables in theory testing. Journal of Business and Psychology, 20, 429-443.

Breaugh, J. A. (2008). Important considerations in using statistical procedures to control for nuisance variables in non-experimental studies. Human Resource Management Review, 18, 282-293.

Breaugh, J. A., & Arnold, J. (2007). Controlling nuisance variables by using a matched-groups design. Organizational Research Methods, 10, 523-541.

Burks, B. S. (1926a). On the inadequacy of the partial and multiple correlation technique (Part I). Journal of Educational Psychology, 17(8), 532-540.

Burks, B. S. (1926b). On the inadequacy of the partial and multiple correlation technique (Part II). Journal of Educational Psychology, 17(9), 625-630.

Burks, B. S., & Kelley, T. L. (1928). Statistical hazards in nature-nurture investigations. In G. M. Whipple (Ed.), The twenty-seventh yearbook of the National Society for the Study of Education: Nature and nurture, Part I, Their influence upon intelligence (pp. 9-38). Bloomington, IL: Public School Publishing Company.

Burr, J. A., & Nesselroade, J. R. (1990). Change measurement. In Von Eye, A. (Ed.), Statistical methods in longitudinal research, Vol 1: Principles and structuring change (pp. 3-35). Boston: Academic Press.

Offers a selection of methods to study change. They authors note that while the 1950s, 60s, and 70s brought about an "enormous amount of discussion and negative critical comment on change scores" (p. 13), at the time of this book's printing, "the psychometric and sociometric literature is gaining momentum toward a reaffirmation of the value of change measurement (see, for example, Liker, Augustyniak, and Duncan, 1985; Maxwell and Howard, 1981; Nesselroade and Bartsch, 1977; Nesselroade and Cable, 1974; Nesselroade, Stigler, and Baltes, 1980; Rogosa, Brandt, and Zimowski, 1982; Rogosa and Willett, 1985)" (p. 13). The authors then review the literature and discuss many approaches to the measurement of change.

Cain, K. C., Kronmal, R. A., & Kosinski, A. S. (1992). Analysing the relationship between change in a risk factor and risk of disease. Statistics in Medicine, 11, 783-797.

Δ

Ceyhan, E., & Goad, C. (2009). A comparison of analysis of covariate-adjusted residuals and analysis of covariance. Communications in Statistics-Simulation and Computation, 38(10), 2019-2038.

See the Ceyhan and Goad (2009) technical report for full text.

Ceyhan, E., & Goad, C. (2009, July 27). A comparison of analysis of covariate-adjusted residuals and analysis of covariance (Technical Report #KU-EC-09-4). Retrieved from the Conrnell University Library: https://arxiv.org/abs/0903.4331

"It is demonstrated that the methods on covariate-adjusted residuals are only appropriate in removing the covariate influence when the treatment-specific lines are parallel and treatment-specific covariate means are equal" (abstract).

Chambless, L. E., & Roeback, J. R., (1993). Methods for assessing difference between groups in change when initial measurement is subject to intra-individual variation. Statistics in Medicine, 12, 1213-1237.

The authors discuss the bias that can result from covariate-adjusted between-group differences that rely on observed data.

Chapman, L. J., & Chapman, J. P. (1973). Disordered thought in schizophrenia. East Norwalk, CT: Appleton-Century-Crofts.

Δ Demonstrated that there is no statistical means of accomplishing "control" for covariates with ANOCVA when groups differ on a covaraite (see Miller & Chapman, 2001). "The only legitimate use of analysis of covariance is for reducing variability of scores in groups that vary randomly. Its use is invalid for preexisting disparate groups that differ on the variable to be covaried out" (p. 82).

Christenfeld, N. S., Sloan, R. P., Carroll, D., & Greenland, S. (2004). Risk factors, confounding, and the illusion of statistical control. Psychosomatic Medicine, 66(6), 868-875. https://doi.org/​10.1097/​01.psy.0000140008.70959.41

Clason, D. L., & Mundfrom, D. J. (2012). Adjusted means in analysis of covariance: Are they meaningful? Multiple Linear Regression Viewpoints, 38(1) 8-15.

Clason and Mundfrom (2012) demonstrated that when using a pretest score as a covariate in analyzing posttest results, "then the adjustment is comparing entities that not only do not exist, but (probably) cannot exist" (p. 15).

Cleophas, T. J., & Zwinderman, A. H. (2012). Statistics applied to clinical studies (5th ed.). New York: Springer.

Cochran, W. G. (1957). Analysis of covariance: Its nature and uses. Biometrics, 13, 261-281. https://doi.org/​10.2307/​2527916

Δ Cochran (1957) introduces a special issue on analysis of covariance but offers a treatment of the issue himself. He argues, for example, that "when the groups differ widely in x, [the inability to remove bias and imprecision of estimates] imply that the interpretation of an adjusted analysis is speculative rather than soundly based" (p. 266). Cochran also points out that "it is important to verify that the treatments have had no effect on [the covariate] x" (p. 264). If treatments can affect the covariates, then "they distort the nature of the treatment effect that is being measured" (p. 264).

Cochran, W., & Rubin, D. (1973). Controlling bias in observational studies: A review. Sankhya: The Indian Journal of Statistics, Series A, 35(4), 417-446.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. ◊

Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Hillsdale, NJ: Erlbaum Associates. ◊

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. ◊

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago: Rand McNally College Publishing. ◊

Classic text on the design of nonrandomized experiments. See Shadish, Cook, and Campbell (2002) for a new and expanded edition.

Collins, L. M. (1996). Is reliability obsolete? A commentary on 'Are simple gain scores obsolete?' Applied Psychological Measurement, 20, 289-292.

Δ In this paper, Linda Collins comments on Williams and Zimmerman (1996) and questions the relevance of the traditional idea of reliability to the measurement of change.

Cox, D. R., & McCullagh, P. (1982). Some aspects of analysis of covariance. Biometrics, 38(3), 541-561.

Cox and McCullagh (1982) discuss, among many other things, Lord's (1967, 1969) paradox. "The paradox is resolved by noting that it is inappropriate to compare males and females at fixed initial weights because this amounts to comparing overweight females with underweight males" (p. 551).

Cribbie, R. A., & Jamieson, J. (2000). Structural equation models and the regression bias for measuring correlates of change. Educational and Psychological Measurement, 60(6), 893-907.

Δ This paper examines a directional bias when measuring correlates of change with ANCOVA and regression. Cribbie and Jamieson's (2000) computer simulation study "shows that properly identified structural equation models are not susceptible to this bias. Neither gain scores (posttest minus pretest) nor structural equation models exhibit the 'regression bias'" (abstract, p. 893). Westfall and Yarkoni (2016) and Zinbarg, Suzuki, Uliaszek, and Lews (2010) offer reviews of problems caused by correlations between predictors and measurement error. Miller and Chapman (2001) and Fitzmaurice (2001) also raise conceptual challenges with the so-called "control" variables.

Cribbie, R. A., & Jamieson, J. (2004). Decreases in posttest variance and the measurement of change. Methods of Psychological Research, 9(1), 37-55.

Cronbach, J. L., & Furby, L. (1970) How should we measure 'change'-or should we? Psychological Bulletin, 74(1), pp. 68-80.

® The authors discourage the use of gain scores.

Culpepper, S. A. (2014). The reliability of linear gain scores as measures of student growth at the classroom level in the presence of measurement bias and student tracking. Applied Psychological Measurement, 38(7), 503-517. https://doi.org/​10.1177/​0146621614534763

Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16(2), 166-178. https://doi.org/​10.1037/​a0023355

Dimitrov, D. M., & Rumrill Jr., P. D. (2003). Pretest-posttest designs and measurement of change. Work: A Journal of Prevention, Assessment and Rehabilitation, 20(2), 159-165.

® Dimitrov and Rumrill (2003) present an interesting but potentially flawed review of analyses for pre-post designs. The authors do not seem to have surveyed the range of issues on ANCOVA and gain scores already presented in the literature. They recomend ANCOVA for pretest-posttest data but then suggest that "ANOVA on gain scores is also useful, whereas … repeated measures ANOVA with pretest-posttest data should be avoided" (p. 164). Given ANOVA on gain scores and repeated measures ANOVA are numerically identical approaches (Anderson et al., 1980, p. 238), yet Dimitrov and Rumrill's results appear to differ.

Egbewale, B. E., Lewis, M., & Sim, J. (2014). Bias, precision and statistical power of analysis of covariance in the analysis of randomized trials with baseline imbalance: A simulation study. BMC Medical Research Methodology, 14(49), 1-12.

® Edgewale et al. (2014) present results "based on analyses whose assumptions were optimally satisfied through the simulation process" (p. 10). The authors did not appear to consider (or even mention) measurement error in covariates. Given ANCOVA assumes perfect measurement of covariates, this may be a key oversight and may ultimately imply the results have little relevance applied research. For example, Kisbu-Sakarya et al. (2013) report that, "in case of baseline imbalance, ANCOVA and residual change score methods produced large Type I error rates when reliability was less than perfect" (p. 58; see also Allison, 1990; Oakes & Feldman, 2001; Trochim, 2000; Westfall & Yarkoni, 2016; Zinbarg et al., 2010).

Edwards, J. R. (1994). Regression analysis as an alternative to difference scores. Journal of Management, 20(3), 683-689.

® Argues against gain scores based on reliability and validity issues, which are no longer considered problems, and discusses the challenges with Tisak and Smith (no year given).

Edwards, J. R. (2001). Ten difference score myths. Organizational Research Methods, 4(3), 265-287.

® Although Edwards (2001) attempts to address myths of difference scores, he purpetuates many of them. For example, although many authors have declared that reliability is not a reason to avoid the difference score (e.g., Rogosa, 1988; Willett, 1988; Zimmerman & Williams, 1982), Edwards argues that they are nonetheless unreliabile. He also argues that gain scores increase Type I and Type II errors, which is not entirely accurate (Oakes & Feldman, 2001).

Elashoff, J. D. (1969). Analysis of covariance: A delicate instrument. American Educational Research Journal, 6(3), 383-401.

Elashoff (1969) presents the assumptions of analysis of covariance. The assumptions include random assignment, covariates measured without error, no covariate-slope/treatment interactions, and the same variance of the dependent variable for every covariate score and treatment group. She also notes that "A basic postulate underlying the use of analysis of covariance to adjust treatment means for the effects of the covariate x is that the x variable is statistically independent of the treatment effect" (p. 388).

European Medicines Agency. (2015). Guideline on adjustment for baseline covariates in clinical trials. Retreived from the European Medicines Agency website: http:// www.ema.europa.eu/ ema/ index.jsp?curl=pages/ regulation/ general/ general_content_001217.jsp&mid=

© "This document provides advice on how to address important baseline covariates in designing, analysing and reporting clinical trials. It mainly focuses on confirmatory randomised trials" (EMA website). "Baseline imbalance observed post hoc should not be considered an appropriate reason for including a variable as a covariate in the primary analysis" (p. 3).

Feng, Z., Diehr, P., Peterson, A., & McLerran, D. (2001). Selected statistical issues in group randomized trials. Annual Review of Public Health, 22, 167-187.

Feng and colleagues provide a readable review of the statistical issues in group-randomized trials. In particular, they show that the design and analysis of group-assigned experiments must account for both the estimate of the intraclass correlation, ρ, and its lack of precision. Feng et al. discuss quite a number of important details, such as differences between the generalized linear mixed model (GLMM), generalized estimating equations (GEE), and permutation tests. They also discuss a few issues related to the analysis of change from baseline.

Finch, W. H., & Shim, S. S. (2016). A comparison of methods for estimating relationships in the change between two time points for latent variables. Educational and Psychological Measurement. Advance online publication.

Fitzmaurice, G. (2001). A conundrum in the analysis of change. Nutrition, 17(4), 360-361.

Δ Although statistics instructors frequently recommend ANCOVA for comparisons between nonequivalent groups, either in quasi-experimental or observational studies, Fitzmaurice (2001) argues that the choice between ANCOVA and the analysis of gain scores depends on the research question. Gain scores answers the most straightforward question about a difference in gains, while ANCOVA answers a conditional question about differences at posttest assuming same initial values. An analysis of gain scores, Fitzmaurice argues, usually answers the questions that interest investigators, whereas ANCOVA would provide an incorrect answer. See Rogosa (1988, Myth 6, p. 189) and Knapp and Schafer (2009) for more discussion about the questions answered by the two approaches. Miller and Chapman (2001) raise additional conceptual challenges with the so-called "control" variables.

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2004). Applied longitudinal analysis. Hoboken, NJ: Wiley. ◊

Δ Fitzmaurice, Laird, and Ware (2004) provide an excellent introduction to longitudinal analyses with extensive detail about each model. The section on adjustment for baseline response (p. 122) describes a number of different approaches, although they state that "the analysis of longitudinal data from a randomized trial is the only setting where we recommend adjustment for baseline through analysis of covariance" (p. 124).

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011). Applied longitudinal analysis (2nd ed.). Wiley.

Δ Fitzmaurice, Laird, and Ware (2011) update their excellent book. This e-book has no page numbers (see hard copy for quotes) but the comments on their 2004 first-edition book apply. They discuss the adjustment for baseline response on page 124 of the PDF.

See the erratafor typos, errors and omissions and the website for the 2nd edition for sample SAS programs and other content.

Fleiss, J. L. (1986). The design and analysis of clinical experiments. New York: John Wiley & Sons. ◊

Fleiss, J. L., & Tanur, J. M. (1973). The analysis of covariance in psychopathology. In M. Hammer, K. Salzinger, & S. Sutton (Eds.), Psychopathology: Contributions from the social, behavioral, and biological sciences (pp. 509-527). New York: Wiley.

Demonstrated that there is no statistical means of accomplishing "control" for covariates with ANCOVA when groups differ on a covaraite (see Miller & Chapman, 2001).

Freedman, D. A. (1997). From association to causation via regression. Advances in Applied Mathematics, 18, 59-110.

Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40(2), 180-193. https://doi.org/​10.1016/​j.aam.2006.12.003

Freedman argues against regression adjustments to experimental data, especially in small samples. "Regression estimates are generally biased, but the bias is small with large samples. Adjustment may improve precision, or make precision worse; standard errors computed according to usual procedures may overstate the precision, or understate, by quite large factors" (abstract, p. 180). See Liu (2012) for a brief summary of Freedman (2008); see Lin (2013) for comments on and extension of Freedman's argument.

Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246-263.

Ganju, J. (2004). Some Unexamined Aspects of Analysis of Covariance in Pretest–Posttest Studies. Biometrics, 60(3), 829-833. https://doi.org/​10.1111/​j.0006-341X.2004.00235.x

Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covariance. Review of Educational Research, 42(3), 237.

Glynn, R. J., Schneeweiss, S., & Stürmer, T. (2006). Indications for propensity scores and review of their use in pharmacoepidemiology. Basic & Clinical Pharmacology & Toxicology, 98(3), 253-259. https://doi.org/​10.1111/​j.1742-7843.2006.pto_293.x

Gochyyev, P. (2015). Essays in psychometrics and behavioral statistics (Doctoral dissertation, University of California, Berkeley). Retrieved from https://escholarship.org/

Gochyyev, P. (2017, October 17). Evaluating the treatment effect in the ADM study and Lord's paradox. Paper presented at the University of California, Berkeley, at the Berkeley Evaluation and Assessment Research (BEAR) Center, Berkeley, CA.

Gochyyev, P. (2019, October 31). Lord's paradox and consequences for effects of interventions on outcomes. Paper presented at the University of Oregon College of Education, Eugene, OR.

See https://​media.uoregon.edu/channel/​archives/​13586 for a recording of Dr. Perman Gochyyev's presentation.

Gollwitzer, M., Christ, O., & Lemmer, G. (2014). Individual differences make a difference: On the use and the psychometric properties of difference scores in social psychology. European Journal of Social Psychology, 44(7), 673-682. https://doi.org/​10.1002/​ejsp.2042

Δ Gollwitzer, Christ, and Lemmer (2014) argue that difference score models can be useful and should be used more often. The authors walk through the assumptions associated with the reliability of the gain score and describe a latent difference score model.

Gordon, R. (1968). Issues in multiple regression. American Journal of Sociology, 73(5), 592-616.

Gottman, J. M., & Rushe, R. H. (1993). The analysis of change: Issues, fallacies, and new ideas. Journal of Consulting and Clinical Psychology, 61(6), 907-910.

Δ The authors discuss Rogosa's (1988) chapter on myths about longitudinal research and reviews each myth. Most of the paper, however, focuses on growth models and not the change-score/ANCOVA dilemma.

Greenland, S., & Robins, J. M. (2009). Identifiability, exchangeability and confounding revisited. Epidemiologic Perspectives and Innovations, 64(4). https://doi.org/​10.1186/​1742-5573-6-4 [Retrieved from https://www.biomedcentral.com/]

Greenland and Robins (2009) review a paper they wrote more than 20 years earlier, "Identifiability, Exchangeability and Epidemiological Confounding" and discuss challenges in the literature and subsequent advances. For example, "many researchers" treated "causal intermediates (causes of disease affected by exposure) treated as confounders," which "adjusts away part of the very effect under study and can induce selection bias" (p. 2; see Dishion, Kavanagh, Schneiger, Nelson, & Kaufman, 2002, for an example). Greenland and Robins cover additional topics associated with confounding and causal inference, such as assumptions of ignorability and how randomization does not guarantee ignorability or the absence of confounding.

Griliches, Z. (1972). Cost allocation in railroad regulation. The Bell Journal of Economics and Management Science, 3(1), 26-41.

Griliches (1972) presents "a careful analysis of the consequences of using ratios by researchers" (Lev & Sunder, 1979).

Gupta, J. K., Srivastava, A. B. L., & Sharma, K. K. (1988). On the optimum predictive potential of change measures. Journal of Experimental Education, 56, 124-128.

The authors discuss the conditions under which the validity of gain scores is optimum.

Hancock, G. R., Stapleton, L. M., & Mueller, R. O. (Eds.). (2019). The reviewer’s guide to quantitative methods in the social sciences (2nd ed.). Routledge.

He, H., Hu, J., & He, J. (2016). Overview of propensity score methods. In H. He, P. Wu, & D.-G. Chen (Eds.), Statistical causal inferences and their applications in public health research (pp. 29-48). Cham: Springer. bk

Hendrix, L. J., Carter, M. W., & Hintze, J. L. (1979). A comparison of five statistical methods for analyzing pretest-posttest designs. Journal of Experimental Education, 47(2), 96-102.

® The authors examine several methods of analysis of change and recommend the analysis of gain scores with pretests as a covariate, even though they show that it is equivalent to the analysis of posttest data with pretests as a covariate.

Holland, P. W., & Rubin, D. B. (1982). On Lord's Paradox (Technical Report No 82-34). Princeton, NJ: Educational Testing Service. Retrieved from the Wiley Online Library: http://onlinelibrary.wiley.com/ journal/ 10.1002/ (ISSN)2330-8516

Δ Holland and Rubin (1982) conclude that "the blind use of complicated statistical procedures, like analysis of covariance, is doomed ot lead to absurd concclusions" (p. 30). That said, Holland and Rubin argue that analysis of covariance can provide valuable answers in certain situations but that causal statements must be made explicit, ideally through the use of mathematics, rather than in natural language, which can be "vague and potentailly misleading" (p. 30).

Holland, P. W., & Rubin, D. B. (1983). On Lord's Paradox. In H. Wainer & S. Messick (Eds.), Principles of modern psychological measurement (pp. 3-35). Hillsdale, NJ: Lawrence Erlbaum.

Hosp, J. L., Therrien, W. J., Fien, H., Smolkowski, K., & Chaparro, E. (2011, February). Universal screening: Intervention effects on classification accuracy. Symposium conducted at the National Association of School Psychologists 2011 Annual Convention, San Francisco, CA.

Hosp and colleagues (2011) demonstrated that the pre-post correlations was .60 (r2 = .36) in the control condition but just .31 (r2 = .10) in the intervention condition. The pretest variable explained three times as much variation in posttest in the control condition as in the treatment condition. The violation of the assumption of parallel regression slopes across treatment conditions (Ceyhan & Goad, 2009; Huitema, 2011) may produce biased results from ANCOVA.

Hu, J., Leite, W. L., & Gao, M. (2017). An evaluation of the use of covariates to assist in class enumeration in linear growth mixture modeling. Behavior Research Methods, 49(3), 1179-1190.

Huck, S. W., & McLean, R. A. (1975). Using a repeated measures ANOVA to analyze the data from a pretest-posttest design: A potentially confusing task. Psychological Bulletin, 82(4), 511-518.

® Huck and McLean show that repeated meausres ANOVA is equivalent to the analysis of gain scores but then go on to conclude that covariance analysis is recommended.

Huitema, B. E. (1980). The analysis of covariance and alternatives. New York: Wiley.

Huitema, B. E. (2011). The analysis of covariance and alternatives: Statistical methods for experiments, quasi-experiments, and single-case studies (2nd ed.). New York: Wiley.

See chapters 6 and 7 about the incorrect use of ANCOVA with data from quasi-experimental designs, such as those with nonrandomized, nonequivalent, or naturally occurring groups. This may extend to include the comparison of groups within conditions.

Jamieson, J. (1993). The law of initial values: Five factors or two? International Journal of Psychophysiology, 14, 233-239.

Jamieson, J. (1994). Correlates of reactivity: Problems with regression based methods. International Journal of Psychophysiology, 17, 73-78.

Δ Jamieson (1994) examined the use of gain scores and regression methods in continuous outcomes for identifying relationships between a third variable. Among other results, Jamieson found that "when there was a correlation between the 3rd variable and baseline, regression measures of change yielded a high rate of Type one errors" (abstract). Notablly, in situations that increased the power of the regression approach, the rate of Type I errors also increased. Zinbarg, Suzuki, Uliaszek, and Lews (2010) and Van Breukelen (2006) present similar cautions about associations between a covariate and baseline. Cribbie and Jamieson (2000) and Westfall and Yarkoni (2016) discuss problems caused by measurement error. Miller and Chapman (2001) and Fitzmaurice (2001) also raise conceptual challenges with the so-called "control" variables.

Jamieson, J. (1995). Measurement of change and the law of initial values: A computer simulation study. Educational and Psychological Measurement, 55, 38-46.

Jamieson, J. (1999). Dealing with baseline differences: Two principles and two dilemmas. International Journal of Psychophysiology, 31(2), 155-161.

Δ This is an excellent little paper. Jamieson reviews the biases associated with skewed data for gain-score analyses and nonequivalent groups for ANCOVA. He then considers the ethical dillema of the researcher. "In each case there is a guideline for the correct action: transform skewed data, avoid ANCOVA with real baseline differences. However, in each case there is also an apparently acceptable alternative action (not transforming; using ANCOVA) that might be chosen in order to maximize power by capitalizing on the directional bias and Type I error. The purpose of this note is to draw attention to this directional bias and the importance of clear guidelines for dealing with baseline differences" (p. 160).

Jamieson, J. (2004). Analysis of covariance (ANCOVA) with difference scores. International Journal of Psychophysiology, 52(3), 277-283. https://doi.org/​10.1016/​j.ijpsycho.2003.12.009

Jamieson (2004) shows that ANCOVA models that regress posttest on pretest are identical to those that regress gain scores on pretest. See also Fitzmaurice, Laired, and Ware (2004, p. 122) for a discussion of this and related issues.

Jamieson, J., & Howk, S. (1992). The law of initial values: A four factor theory. International Journal of Psychophysiology, 12, 53-61.

Judd, C. M., & Kenny, D. A. (1981). Estimating the effect of social interventions. New York: Cambridge University Press. ◊

Judd and Kenny (1981) offer a readable introduction to research design and analysis with chapters on randomized trials, regression discontinuity designs, nonequivalent group designs, and other quasi-experimental studies. See Chapters 4, 6, and 8 for discussions of ANCOVA and pre-post change scores for randomized experiments, nonequivalent group designs, and design issues. See, in particular, the four potential problems with using ANCOVA to improve power on page 59.

Kahan, B., Jairath, V., Doré, C., & Morris, T. (2014). The risks and rewards of covariate adjustment in randomized trials: An assessment of 12 outcomes from 8 studies. Trials, 15(1), 139.

©

Kahneman, D. (1965). Control of spurious association and the reliability of the controlled variable. Psychological Bulletin, 64(5), 326-329.

Kahneman (1965) writes (abstract), "The techniques of matched groups, analysis of covariance, and partial correlation represent various approaches to the prevention of a spurious association between X1 and X2 due to a confounding variable, X3. In all these techniques the use of an unreliable measure for X3 leads to a systematic bias. . . . Groups should be matched on true scores rather than observed scores, but no correction is possible for the factorial design in which groups are formed on the basis of unreliable correlated measures." See also Meehl's (1970) excellent paper, in which he comments on Kaneman (1965).

Karpen, S. (2017). Misuses of regression and ANCOVA in educational research. American Journal of Pharmaceutical Education, 81(8), 84-85 (Article 6501).

Kempthorne, O. (1957). An introduction to genetic statistics. New York: Wiley.

From Meehl (1970): "In discussing 'adjustments' (for nuisance variables) Kempthorne warns (p. 284) that 'the adjustment of data should be based on knowledge of how the factor which is being adjusted for actually produces its effect. An arbitrarily chosen adjustment formula may produce bias rather than remove the systematic difference. It is this fact which tends to vitiate the uses of the analysis of covariance recommended in most books on the analysis of experiments'" (p. 6).

Kendall, M. G. (1946). The advanced theory of statistics, Volume II. London: Charles Griffin.

Kendall (1946) stated that "we would emphasize that the analysis of variance [and covariance], like other statistical techniques, is not a mill which will grind out results automatically without care or forethought on the part of the operator. It is a rather delicate instrument which can be called into play when precision is needed, but requires skill as well as enthusiasm to apply to the best advantage. The reader who roves among the literature of the subject will sometimes find elaborate analyses applied to data in order to prove something which was almost obvious from careful inspection right from the start; or he will find results stated without qualification as 'significant' without any attempt at critical appreciation" (quote taken from Elashoff, 1969).

Kendall, P. L., & Lazarsfeld, P. F. (1950). Problems of survey analysis. In R. K. Merton & P. F. Lazarsfeld (Eds.), Continuities in Social Research: Studies in the scope and method of "The American Soldier" (pp. 133-196). The Free Press.

® Kendall and Lazarsfeld (1950) formalized the logic of survey analysis. They have, however, been credited with beginning the widespread use of control variables as "a hallmark of sophisticated research" (Gordon, 1968, p. 592).

Kenny. D. A. (1975). A quasi-experimental approach to assessing treatment effects in the nonequivalent control group design. Psychological Bulletin, 82, 345-362.

Δ Selection based on pretest (e.g., regression discontinuity) requires an ANCOVA model, while selection based on group differences or selection midway between pretest and posttest would require an analysis of gain scores.

Kenny. D. A., & Zautra, A. (2001). Trait-state models for longitudinal data. In A. Sayer & L. M. Collins (Eds.), New methods for the anlaysis of change (pp. 243-263). Washington, DC: American Psychological Association.

Kenward, M., White, I., & Carpenter, J. (2010). Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? by G. F. Liu, K. Lu, R. Mogg, M. Mallick and D. V. Mehrotra, Statistics in Medicine 2009; 28 :2509-2530. Statistics in Medicine, 29(13), 1455-1456.

Kenward, White, and Carpenter (2010) comment on Liu, Lu, Mogg, Mallick, and Mehrotra (2009); the authors reply in Liu et al. (2010). Kenward et al. believe the recommendation by Liu et al. (2009) to use covariates as a dependent variable was flawed for at least two key reasons.

Keppel, G., & Zedeck, S. (1989). Data analysis for research designs: Analysis of variance and multiple regression/correlation approaches. New Yort: Freedman & Company.

Δ ANCOVA "depends on the assumption that individuals have been randomly assigned to conditions. If this assumption is not met, any adjustment in treatment means cannot be defended or justified statistically" (Keppel & Zedeck, 1989, p. 481).

Kim, D., Pieper, C., Ahmed, A., & Colón-Emeric, C. (2016). Use and interpretation of propensity scores in aging research: A guide for clinical researchers. Journal of the American Geriatrics Society, 64(10), 2065-2073. https://doi.org/​10.1111/​jgs.14253

Kim, Pieper, Ahmed, and Colón-Emeric (2016) reivew four common methods that use proposensity scores: matching, weighting, stratification, and covariate adjustment. For each, they explain the procedure and review best practices and caveats.

King, B. M. (2010). Analysis of variance. In P. Peterson, E. Baker, & B. McGaw (Eds.), International Encyclopedia of Education (3rd Ed., pp. 32-36). Oxford: Elsevier.

King, G. (2010). A hard unsolved problem? Post-treatment bias in big social science questions. Paper presented at the Hard Problems in Social Science Symposium, Harvard University, Cambridge, MA. Retrieved from http://www.slideshare.net/ HardProblemsSS/

Δ As an exmaple, King (2010) suggests that in an analysis of the causal effects of race on salary, covariates associated with qualifications would be valuable but individuals’ position in the firm would confound the analysis due to its potential association with both race and salary. If position were included in the analysis, it would make the race-salary effect statistically independent from the race-position and position-salary effects, which would remove (some of) the bias that the original analysis intended to explain.

King, L. A., King, D. W., McArdle, J. J., Saxe, G. N., Doron-LaMarca, S., & Orazem, R. J. (2006). Latent difference score approach to longitudinal trauma research. Journal of Traumatic Stress, 6, 771-785. https://doi.org/​10.1002/​jts.20188

Kisbu-Sakarya, Y., MacKinnon, D., & Aiken, L. (2013). A Monte Carlo comparison study of the power of the analysis of covariance, simple difference, and residual change scores in testing two-wave data. Educational and Psychological Measurement, 73(1), 47-62.

Kisbu-Sakarya, MacKinnon, and Aiken (2013) provide a nice illustration of the differences in Type I error rates and power between an analysis of gain scores and ANCOVA in the presence of measurement error. One key finding was that, "in case of baseline imbalance, ANCOVA and residual change score methods produced large Type I error rates when reliability was less than perfect. . . . On the other hand, Type I error rates for the difference score method were not influenced by baseline imbalance or reliability." (p. 58), which occurs quite frequently. Indeed, this is usually the scenario where the WWC (inappropriately) requires ANCOVA! Kisbu-Sakarya et al. also point to a few challenges in Petscher and Schatschneider (2011).

Knapp, T. R., & Schafer, W. D. (2009). From gain score t to ANCOVA F (and vice versa). Practical Assessment, Research & Evaluation, 14, 6. Retrieved from http://pareonline.net/

Knapp and Schafer (2009) develop a method to convert the gain scores estimates to ANCOVA estimates and vise versa. In their review of the two, they note that they answer different questions: "There is an important difference between the research question that is implied by the use of the t test and the research question that underlies the use of ANCOVA. For the former, the question is: 'What is the effect of the treatment on the change from pretest to posttest?' For the latter the question is: 'What is the effect of the treatment on the posttest that is not predictable from the pretest (i.e., conditional on the pretest)?'" (p. 2). See Fitzmaurice (2001) and Rogosa (1988, Myth 6, p. 189) for other discussions of the questions answered by the two approaches.

Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics, 95, 391-413.

Langbein, L. (2012). Public program evaluation: A statistical guide. New York: Taylor & Francis.

Langbein (2012) writes "Nonexperiments (NEs) include any design that uses statistical controls. Any randomized field experiment (RFE) that uses statistical controls for pretest scores (to improve internal or statistical validity) is partly a nonexperiment [emphasis added], though at its core it remains a randomized experiment. Most of the quasi experiments (QEs) discussed in Chapter 5 also use statistical controls, although, at their core, they use elements of matching comparable groups at one point in time or matching the same group or groups before and after a program change. In these two cases, the line between the NE and the RFE and the line between the NE and the QE is not clear. One would describe these as mixed designs: part RFE and part NE or part QE and part NE" (p. 143).

Larzelere, R. E., Kuhn, B. R., & Johnson, B. (2004). The intervention selection bias: An underrecognized confound in intervention research. Psychological Bulletin, 130(2), 289-303.

Lesaffre, E., & Senn, S. (2003). ‘A note on non-parametric ANCOVA for covariate adjustment in randomized clinical trials. Statistics in Medicine, 22, 3583-3596.

"There is considerable debate over covariate adjusted analysis in the biostatistics literature (Lesaffre and Senn 2003)" (Liu, 2012, p. 629).

Lev, & Sunder. (1979). Methodological issues in the use of financial ratios. Journal of Accounting and Economics, 1(3), 187-210.

Liker, J. K., Augustyniak, S., & Duncan, G. J. (1985). Panel data and models of change: A comparison of first differences and conventional two-wave models. Social Science Research, 14, 80-101.

Linden, A., Trochim, W. M. K., & Adams, J. L. (2006). Evaluating program effectiveness using the regression point displacement design. Evaluation and The Health Professions, 29(4), 407-423.

Lin, H. M. & Hughes, M. D. (1997). Adjusting for regression toward the mean when variables are normally distributed. Statistical Methods in Medical Research, 6(2), 129-46.

Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique. The Annals of Applied Statistics, 7(1), 295-318.

Lin (2013) comments on Freedman's (2008) argument.

Linn, R. L., & Slinde, J. A. (1977). The determination of the significance of change between pre- and post-testing periods. Review of Educational Research, 47, 121-150.

® Abstract: "Reviews and discusses the major measures of change (i.e., difference scores, residual scores, and estimated true change), correlates of change, and the issues and problems associated with each measure. The concerns involved in inferring treatment effects from group differences are discussed, as well as accountability systems based on student achievement. It is suggested that regression analyses be used where appropriate. These treat the pretest no differently from other independent variables (or predictors), and the posttest as the dependent variable, thus avoiding many of the difficulties that are introduced by other methods." See Rogosa (1988, p. 179).

Liu, G. F., Lu, K., Mogg, R., Mallick, M., & Mehrotra, D. V. (2009). Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? Statistics in Medicine, 28(20), 2509-2530.

See comment by Kenward, White, and Carpenter (2010) and reply by Liu, Lu, Mogg, Mallick, and Mehrotra (2010).

Liu, G. F., Lu, K., Mogg, R., Mallick, M., & Mehrotra, D. V. (2010). Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? Reply. Statistics In Medicine, 29(13), 1457.

See comment by Kenward, White, and Carpenter (2010) on the original paper by Liu, Lu, Mogg, Mallick, and Mehrotra (2009).

Liu, X. S. (2012). Covariate imbalance and precision in measuring treatment effects. Evaluation Review, 35(6), 627-641.

Liu (2012) reviews the assumptions of covariate adjustment, which typically reduces unexplained variance from the error term and improves the precision of estimates. Liu demonstrates that "chance covariate imbalance may actually hurt precision in small pilot studies with limited sample sizes" (p. 639). "The chance covariate imbalance between the contrasted treatment conditions must not exceed a certain upper bound before covariate adjustment can lead to any gain in precision" (p. 640).

Llabre, M. M., Spitzer, S. S., Saab, P. G., Ironson, G. H., & Schneiderman, N. (1991). The reliability and specificity of delta versus residualized change as measures of cardiovascular reactivity to behavioral challenges. Psychophysiology, 28, 701-711.

Llewelyn, H. (2019). Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PloS one, 14(2), e0212302. https://doi.org/​10.1371/​journal.pone.0212302

Loevinger, J. (1943). On the proportional contributions of differences in nature and in nurture to differences in intelligence. Psychological Bulletin, 40(10), 725-756.

Longford, N. T. (2010). Analysis of covariance. In P. Peterson, E. Baker, & B. McGaw (Eds.), International Encyclopedia of Education (3rd Ed., pp. 18-24). Oxford: Elsevier.

Lord, F. M. (1956). The measurement of growth. Educational and Psychology and Measurement, 16, 421-437.

Lord, F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin, 68(5), 304-305.

Lord (1967) presents a hypothetical, illustrative example where two statisticians analyzed pretest and posttest data, one with an analysis of gain scores and the other with an analysis of covariance, and came to strikingly different conclusions. In Lord's example, the statisticians compared males and females on their weight gained from a college cafeteria diet. The analysis of gain scores showed no average differences between male and females in weight gained. The analysis of covariance showed that males gained considerably more weight than females. "The explanation is that with the data usually available for such studies, there simply is no logical or statistical procedure that can be counted on to make proper allowances for uncontrolled preexisting differences between groups" (p. 305).

MacCallum, R. C., Kim, C., Malarkey, W. B., & Kiecolt-Glaser, J. K. (1997). Studying multivariate change using multilevel models and latent curve models. Multivariate Behavioral Research, 32, 215-253.

MacKinnon, D. P., Krull, J. L., & Lockwood, C. M. (2000). Equivalence of the mediation, confounding, and suppression effect. Prevention Science, 1(4), 173-181.

MacKinnon, Krull, and Lockwood (2000) demonstrated "that mediation, confounding, and suppression models are statistically equivalent" (p. 180), which emphasized the point that "for any given set of data, there may be a number of different possible models that fit the data equally well" (p. 180).

Maris, E. (1998). Covariance adjustment versus gain scores—revisited. Psychological Methods, 3, 309-327.

Δ Maris provides a nice description of differences between the gain score estimator and the covariance adjusted estimator. Maris supports the use of the covariate adjusted estimator in situations where participants are assigned (a) randomly or (b) based on a pretest score, such as a regression discontinuity design. Gains score are unbiased under random assignment, and may or may not be biased in other situations, as long as assignment does not rest on the pretest.

Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data: A model comparison perspective. Pacific Grove, CA: Brooks/Cole.

Maxwell, S. E., Delaney, H. D., & Dill, C. A. (1984). Another look at ANCOVA versus blocking. Psychological Bulletin, 95(1), 136-147.

Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing experiments and analyzing data: A model comparison perspective (3rd ed.). Routledge. https://doi.org/​10.4324/​9781315642956

Maxwell, S. E., Delaney, H. D., & Manheimer, J. M. (1985). ANOVA of residuals and ANCOVA: Correcting an illusion by using model comparisons and graphs. Journal of Educational Statistics, 10(3), 197-209. https://doi.org/​10.2307/​1164792

Maxwell, Delaney, and Manheimer (1985) discuss the assumption that ANCOVA is equivalent to an analysis of residuals and show that the two are quite different. "In sum, although the concept of a residual score can be a useful pedagogical tool for explaining the logic of ANCOVA, it has typically not been utilized accurately" (p. 208).

Maxwell, S. E., & Howard, G. S. (1981). Change scores—necessarily anathema? Education and Psychological Measurement, 41, 747-756.

Δ

May, K., & Hinttner, J. B. (2010). Reliability and validity of gain scores considered graphically. Perceptual and Motor Skills, 111(2), 399-406. https://doi.org/​10.2466/​03.PMS.111.5.399-406

See abstract at PubMed.

McArdle, J. J. (2009). Latent variable modeling of differences and changes with longitudinal data. Annual Review of Psychology, 60, 577-605. https://doi.org/​10.1146/​annurev.psych.60.110707.163612

Meehl, P. E. (1970). Nuisance variables and the ex post facto design. In M. Radner & S. Winokur (Eds.), Minnesota studies in the philosophy of science. Vol. IV. Analyses of theories and methods of physics and psychology (pp. 372-402). Minneapolis: University of Minnesota Press. original

Meehl (1970) briefly discusses the challenge with unreliability in ANCOVA presented by Kahneman (1965) and then shows how, even assuming perfectly reliable measures, matched groups, ANCOVA, and partial correlations have three core difficulties: systematic unmatching (on other variables), the unrepresentative subpopulation problem, and causal-arrow ambiguity. Each contributes to the ambiguity of results and their interpretation from nonrandomized designs.

Meehl, P. E. (1971a). High school yearbooks: A reply to Schwarz. Journal of Abnormal Psychology, 77(2), 143-148.

Meehl (1971) writes, "the nearly universal tendency among social scientists to view correlations uncorrected for social class as 'spurious' is condemned" (p. 143).

Meehl, P. (1971b). Law and the fireside inductions: Some reflections of a clinical psychologist. Journal of Social Issues, 27(4), 65-100.

See Meehl (1989).

Meehl, P. (1972). Specific genetic etiology, psychodynamics, and therapeutic nihilism. International Journal of Mental Health, 1(1/2), 10-27.

Meehl, P. (1989). Law and the fireside inductions (with postscript): Some reflections of a clinical psychologist. Behavioral Sciences & the Law, 7(4), 521-550.

Mellenbergh, G. J. (1999). A note on simple gain score precision. Applied Psychological Measurement, 23(1), 87-89.

From the abstract: "A distinction is necessary between two concepts of measurement precision. Reliability is population dependent and information is examinee dependent. Both concepts also apply to the simple gain score. Low reliability implies that it is not appropriate to correlate the gain score with variables in a population, and low information implies that no precise statements can be made about an examinee's true gain. However, low reliability does not imply that statements about within-examinee change are necessarily imprecise."

Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. Journal of Abnormal Psychology, 110(1), 40-48. https://doi.org/​10.1037/​0021-843X.110.1.40

Miller and Chapman (2001) discuss a number of challenges with ANCOVA, especially the significant problems that emerge when comparing groups that differ on a covariate. The problems include the misplaced assumption that the covariate somehow "controls" for differences, a view that the "relevant literature roundly condems" (p. 42). Another challenge is that when the covariate and grouping variable (e.g., condition) are related, "the grouping variable, its essence, has been altered in some substantive way that is frequently not specifiable in a conceptually meaningful way" (p. 43). This means that the grouping variable can no longer be interpreted as intended. See Westfall and Yarkoni (2016) for a detailed review of problems caused by measurement error.

Moher, D., Hopewell, S., Schulz, K. F., Montori, V., Götzsche, P. C., Devereaux, P. J., Elbourne, E., Egger, M., & Altman, D. G. (2010). CONSORT 2010 explanation and elaboration: Updated guidelines for reporting parallel group randomised trials. British Medical Journal, 340, c869. https://doi.org/​10.1136/​bmj.c869

Moher et al. (2010) wrote, "In RCTs, the decision to adjust should not be determined by whether baseline differences are statistically significant" (p. 14). They go on to say, "adjustment for variables because they differ significantly at baseline is likely to bias the estimated treatment effect" (p. 19).

Mongiardino Koch, N., Soto, I. M., & Ramírez, M. J. (2015). Overcoming problems with the use of ratios as continuous characters for phylogenetic analyses. Zoologica Scripta, 44(5), 463-474. https://doi.org/​10.1111/​zsc.12120

Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. New York: Cambridge University Press. ◊

When using covariates, Morgan and Winship (2007) advise that "to offer a precise and defendable causal effect estimate, a well-specified theory is needed to justify assumptions about underlying causal relationships" (p. 30).

Morris, S. B. (2008). Estimating effect sizes from pretest-posttest-control group designs. Organizational Research Methods, 11, 364-386.

Murray, D. M. (1998). Design and analysis of group-randomized trials. New York: Oxford University Press. ◊

This book covers nested cross-sectional and nested cohort designs and discusses a range of models, including pre-post designs with analysis of covariance and time x condition (gain-score) analyses.

Murray, D. M. (2001). Statistical models appropriate for designs often used in group-randomized trials. Statistics in Medicine, 20, 1373-1385.

Murray, D. M., Van Horn, L., Hawkins, J. D., & Arthur, M. W. (2006). Analysis strategies for a community trial to reduce adolescent ATOD use: A comparison of random coefficient and ANOVA/ANCOVA models. Contemporary Clinical Trials, 27(2), 188-206.

From the abstract: "We use data from an earlier study that included the Community Youth Development Study (CYDS) communities to estimate power for CYDS intervention effects given several analytic models that might be applied to the multiple baseline and follow-up surveys that define the CYDS cross-sectional design. We compare pre-post mixed-model ANCOVA models against random coefficients models, both in one- and two-stage versions. The two-stage pre-post mixed-model ANCOVA offers the best power for the primary outcomes and will provide adequate power for detection of modest but important intervention effects." Note that the pre-post mixed-model ANCOVAs in this paper have no covariates but instead measure net gains.

Muthén, B., & Jöreskog, K. G. (1983). Selectivity problems in quasi-experimental studies. Evaluation Review, 7(2), 139-174. https://doi.org/​10.1177/​0193841X8300700201

Nesselroade, J. R., & Bartsch, T. W. (1977). Multivariate perspectives on the construct validity of the trait-state distinction. In R. B. Cattell & R. M. Dreger (Eds.), Handbook of madern personality theory. Washington, DC: Hemisphere.

Δ

Nesselroade, J. R., & Cable, D. G., (1974). 'Sometimes, it's okay to factor difference scores'—the separation of trait and state anxiety. Multivariate Behavioral Research, 9, 273-282.

Δ

Nesselroade, J. R., Stigler, S. M., & Baltes, P. B. (1980). Regression toward the mean and the study of change. Psychological Bulletin, 88, 622-637.

Neuroskeptic. (2016, April 2). Statistics: When confounding varaiables are out of control [Blog post]. Retreived from Neuroskeptic blog at Discover Magazine online: http://blogs.discovermagazine.com/ neuroskeptic/

Neuroskeptic (2016) outlines the problem with control variables, but see Westfall and Yarkoni (2016) for details. See also Miller and Chapman (2001) and Meehl (1970) for additional issues.

Oakes, J. M., & Feldman, H. A. (2001). Statistical power for nonequivalent pretest-posttest designs: The impact of change-score versus ANCOVA models. Evaluation Review, 25(1), 3-28.

Δ Oakes and Feldman (2001) examined the assumptions and statistical power for measures of pre-post change and compares change-score and ANCOVA models in the process. They conclude with "the principal finding . . . that for a randomized experiment, ANCOVA yields unbiased treatment estimates and typically has superior power to change-score methods, all else equal. However, in the absence of randomization, when baseline differences between groups exist, we follow Allison (1990) and show that change-score models yield less biased estimates (if biased at all). Then, bias aside, we went on to show that the common assumption that ANCOVA models are more powerful rests on the untenable assumption that pretests are measured without error. In the presence of measurement error, change-score models may be equally or even more powerful" (p. 18).

Owen, S. V., & Froman, R. D. (1998). Uses and abuses of the analysis of covariance. Research In Nursing and Health, 21(6), 557-562.

Δ

Pearl, J. (2016, July). Lord's paradox revisited–(Oh Lord! Kumbaya!) (Technical Report R-436). Retrieved from the UCLA Computer Science Department website: http://ftp.cs.ucla.edu/ pub/ stat_ser/ r436.pdf

Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer. New York: John Wiley & Sons. ♦¹

Pedhazur, E. J. (1982). Multiple regression in behavioral research (2nd ed.). San Francisco: Holt, Rinehart, & Winston.

Pedhazur "describes a number of actual instances in which ANCOVA was incorrectly applied to nonexperimental designs" (Keppel & Zedeck, 1989, p. 482).

Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). Fort Worth, TX: Harcourt Brace.

Pedhazur (1997) presents on pages 170-172 a review of causal assumptions and cautions for control variables.

Petscher, Y. (2010). A simulation study on the performance of the simple difference and covariance adjusted scores in randomized experimental designs. Dissertation Abstracts International: Section B. Sciences and Engineering, 70(8-B), 5209.

Petscher presents an analysis of power for covariance-adjusted scores and simple difference scores in RCTs and concludes that the former is generally more powerful. The disseration does not discuss the point made by Oakes and Feldman (2001) that for ANCOVA models "the pretest score, X, is assumed to be measured without error (Greene, 1997)" (p. 9), whereas "the change-score model incorporates the reliability of the pretest" (p. 16).

Petscher, Y., & Schatschneider, C. (2011). A simulation study on the performance of the simple difference and covariance-adjusted scores in randomized experimental designs. Journal of Educational Measurement, 48(1), 31-43.

Petscher and Schatschneider (2011) determine, through simulation, that the gain score can be as powerful as the covariate-adjusted score in certain situations. The authors tested the difference between an analysis of gain scores and ANCOVA, setting the reliability of the measures at .80. Imperfect reliability violates one of the key assumptions of ANCOVA (perfect measurement) and may reduce the power of ANCOVA relative to gain scores (Oakes & Feldman, 2001). Kisbu-Sakarya et al. (2013) point to a few challenges in Petscher and Schatschneider (2011). Also, "covariance-adjusted scores," noted in the title, are not equivalent to ANCOVA, the analysis approach used in the paper (Maxwell et al., 1985).

Pocock, S. J., Assmann, S. E., Enos, L. E., & Kasten, L. E. (2002). Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: Current practice and problems. Statistics in Medicine, 21(19), 2917-2930.

© Pocock, Assmann, Enos, and Kasten (2002) discuss reporting challenges. "Key issues include: the overuse and overinterpretation of subgroup analyses; the underuse of appropriate statistical tests for interaction; inconsistencies in the use of covariate-adjustment; the lack of clear guidelines on covariate selection; the overuse of baseline comparisons in some studies; the misuses of significance tests for baseline comparability, and the need for trials to have a predefined statistical analysis plan for all these uses of baseline data" (p. 2917, abstract).

Porter, A. C., & Raudenbush, S. W. (1987). Analysis of covariance: Its model and use in psychological research. Journal of Counseling Psychology, 34, 383-392. https://doi.org/​10.1037/​0022-0167.34.4.383

Rachor, R. E., Cizek, G. J. (1996). Reliability of raw gain, residual gain, and estimated true gain scores: A simulation study. Paper presented at the 1996 Annual Meeting of the American Educational Research Association, New York, NY.

Reichardt, C. S. (1979). The statistical analysis of data from nonequivalent group designs. In T. D. Cook & D. T. Campbell (Eds.), Quasi-experimentation. Chicago: Rand McNally.

Robins, J. M., Greenland, S. (1986). The role of model selection in causal inference from nonexperimental data. American Journal of Epidemiology, 123, 392-402.

"Robins and Greenland [1986] noted that confounding can be induced by control of baseline covariates" (Greenland & Robins, 2009, p. 4).

Rogosa, D. (1988). Myths about longitudinal research. In K. W. Schaie, R. T. Campbell, W. Meredith, & S. C. Rawlings (Eds.), Methodological issues in aging research (pp. 171-209). New York: Springer.

Δ Chapter is concerned with methods for the analysis of longitudinal data. It seeks to convey "right thinking" about longitudinal research. Heroes of this chapter are statistical models for collections of individual growth (learning) curves. Myths indicate some of the beliefs that have impeded doing good longitudinal research. In particular, Rogosa dispels the myth that gain score methods are unreliable.

Rogosa, D. R., Brandt, D., & Zimowski, M. (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 90, 726-748.

Δ

Rogosa, D. R., Floden, R., Willett, J. B. (1984). Assessing the stability of teacher behavior. Journal of Educational Psychology, 76(6), 1000-1027.

Rogosa, D. R., & Willett, J. B. (1983). Demonstrating the reliability of the difference score in the measurement of change. Journal of Educational Measurement, 20, 335-343.

Δ

Rogosa, D. R., & Willett, J. B. (1985). Understanding correlates of change by modeling individual differences in growth. Psychometrika, 50, 203-228.

Δ

Rohrer, J. M. (2018). Thinking clearly about correlations and causation: Graphical causal models for observational data. Advances in Methods and Practices in Psychological Science, 1(1), 27-42. https://doi.org/​10.1177/​251524591774562

Rosenbaum, P. R. (2002). Covariance adjustment in randomized experiments and observational studies. Statistical Science, 17(3), 286-327.

Includes comments by Angrist and Imbens, Hill, and Robins, with a rejoinder by Rosenbaum.

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688-701.

Rubin (1974) defines and defends the randomized experiment. He provides a clear explanation of the importance of experimental control, which can be created by randomization for most social science experiments. Rubin also compares the relative value of observational studies to experiments.

With respect to the addition of covariates, Rubin makes clear the need to include only carefully considered variables that lie on the causal pathway to the outcome of interest. "When trying to estimate the typical causal effect in the 2N trial experiment, handling additional variables may not be trivial without a well-developed causal model that will properly adjust for those prior variables that causally affect Y and ignore other variables that do not causally affect Y even if they are highly correlated with the observed values of Y. Without such a model, the investigator must be prepared to ignore some variables he feels cannot causally affect Y and use a somewhat arbitrary model to adjust for those variables he feels are important" (Rubin, 1974, p. 697).

To summarize, "We can never know that all causally relevant prior variables that systematically differ in the E [experimental] and C [control] trials have been controlled" (Rubin, 1974, p. 699). Rubin then cautions, "We must be prepared to ignore irrelevant prior variables even if they systematically differ in E and C trials, or else we can obtain any estimate desired by eventually finding the 'right' irrelevant prior variables." (p. 699).

Rubin, D. B. (1977). Assignment to treatment group on the basis of a covariate. Journal of Educational Statistics, 2(1), 1-26.

Rubin, D. B. (1990). Formal modes of statistical inference for causal effects. Journal of Statistical Planning and Inference, 25, 279-292.

The first six sections provide an interesting overview of causal effects and their defining characteristics. The following eight sections describe several modes of inference.

Rubin, D. B. (1991). Practical implications of models of statistical inference for causal effects and the critical role of random assignment. Biometrics, 47, 1213-1234.

Sadooghi-Alvandi, S. M., & Jafari, A. A. (2013). A parametric bootstrap approach for one-way ANCOVA with unequal variances. Communications in Statistics—Theory and Methods, 42(14), 2473-2498, https://doi.org/​10.1080/​03610926.2011.625486

Schjoedt, L., & Bird, B. (2014). Control variables: use, misuse and recommended use. In A. Carsrud & M. Brännback (Eds.), Handbook of research methods and applications in entrepreneurship and small business (pp. 136-155). Edward Elgar Publishing. https://doi.org/​10.4337/​9780857935052.00013

Seddon, G. M. (1988). The validity of reliability measures. British Educational Research Journal, 14(1), 89-97.

Senn, S. J. (1989). Covariate imbalance and random allocation in clinical trials. Statistics in Medicine, 8, 467-475.

© Senn (1989) makes the following recommentsions for clinical trials: "1. Before the data are collected, relevant prognostic variables should be identified using a priori information on correlation with treatment response and taking into account requirements regarding conditional size and precision (consideration of Figure 1 may be of help here); 2. Other covariates collected 'for the record' should be ignored in the analysis; 3. Do not carry out tests of homogeneity on the covariates; 4. Perform an analysis of covariance using the identified prognostic factors (step 1 above)" (p. 474). See abstract at PubMed.

Senn, S. J. (2005). Baseline adjustment in longitudinal studies. In P. Armitage & T. Colton (Eds). Encyclopedia of Biostatistics, Vol. 1 (2nd ed; pp. 294-297). New York: John Wiley & Sons. https://doi.org/​10.1002/​0470011815.b2a12007

Senn, S. J. (2005). Baseline balance and valid statistical analyses: Common misunderstandings. Applied Clinical Trials, 14, 24-27.

Senn, S., & Schemper, M. (2006). Change from baseline and analysis of covariance revisited. Statistics in Medicine, 25(24), 4334-4344.

Senn and Schemper (2006) appear to come to conclusions opposite those of others on this page, that ANCOVA is perfectly valid when groups differ at baseline. This is at odds with the conclusions of Allison (1990), Judd and Kenny (1981), Oakes and Feldman's (2001), and others.

Senn, S. (2013). Seven myths of randomisation in clinical trials. Statistics in Medicine, 32(9), 1439-1450. https://doi.org/​10.1002/​sim.5713

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin. ◊

A new edition of Cook and Campbell (1979), revised to include a broader range of designs, such as randomized experiments and regression discontinuity designs. An excellent text overall. The authors, however, equate "experimental" with "randomized," and use these terms to discriminate between experimental and quasi-experimental. A more useful distinction, and one that conserves terminology, would specify that true experiments are those that can demonstrate a functional relationship between a dependent variables and the independent variable (e.g., an intervention). This definition could include regression discontinuity designs and some single subject designs as true experiments.

Shadish, W. R., & Ragsdale, K. (1996). Random versus nonrandom assignment in controlled experiments: Do you get the same answer? Journal of Consulting and Clinical Psychology, 64(6), 1290-1305.

"It is concluded that studies using nonrandom assignment may produce acceptable approximations to results from randomized experiments under some circumstances but that reliance on results from randomized experiments as the gold standard is still well founded" (abstract). Nonetheless, "a slightly degraded randomized experiment may still produce better effect estimates than many quasi-experiments (Shadish & Ragsdale, 1996)" (Shadish, Cook, & Campbell, 2002, p. 229).

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., . . . Nosek, B. A. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3) 337-356. https://doi.org/​10.1177/​2515245917747646

Silberzahn et al. (2018): "Twenty-nine teams involving 61 analysts used the same data set to address the same research question" (abstract, p. 338).

Simon, H. A., & Rescher, N. (1966). Cause and counterfactual. Philosophy of Science, 33(4), 323-340.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11) 1359-1366. https://doi.org/​10.1177/​0956797611417632

Simmons, Nelson, and Simonsohn (2011) discuss the inflated Type I error rates—or equivalently confidence intervals that erroneously exclude zero—that occur when investigators (a) choose among a large number of DVs, (b) add more cases, (c) explore tests with various covariates or moderators, (d) ignore a condition in a multi-condition study, (e) or combine scenarios. Simply controlling for gender and examining its interaction with condition can increase the false-positive rate from 5% to 12%. Adding a DV can inflate the Type I errors from 5% to 9.5%. These types of (frequently atheoretical) manipulations are very common.

Smith, R. (2005). Relative size versus controlling for size. Interpretation of ratios in research on sexual dimorphism in the human corpus callosum. Current Anthropology, 46(2), 249-273.

Spector, P. E., & Brannick, M. T. (2011). Methodological urban legends: The misuse of statistical control variables. Organizational Research Methods, 14(2), 287-305. https://doi.org/​10.1177/​1094428110369842

Spector and Brannick (2011) give very clear explanations of the problems with covariate adjustment and the use of so-called "control" variables. They recommend investigators treat control variables like any other in a study, with a fully supported argument describing the rationale for their inclusion. Such an argument, however, requires examining multiple competing hypotheses about the role that covariates play in a model. One key concern is that in certain models, "the 'test' for spuriousness is identical to the 'test' for mediation (MacKinnon et al., 2000)" (p. 294).

Spector, P. E., Zapf, D., Chen, P. Y., & Frese, M. (2000). Why negative affectivity should not be controlled in job stress research: Don’t throw out the baby with the bath water. Journal of Organizational Behavior, 21, 79-95.

Steiner, P. M., Cook, T. D., Li, W., & Clark, M. H. (2015). Bias reduction in quasi-experiments with little selection theory but many covariates. Journal of Research on Educational Effectiveness, 8(4), 552. https://doi.org/​10.1080/​19345747.2014.978058

Streiner, D. L. (2016). Control or overcontrol for covariates? Evidence Based Mental Health, 19(1), 4-5. https://doi.org/​10.1136/​eb-2015-102294

See abstract at PubMed.

Stuart, E. A. (2007). Estimating causal effects using school-level data sets. Educational Researcher, 36(4), 187-198.

Δ

Suckling, J. (2011). Correlated covariates in ANCOVA cannot adjust for pre-existing differences between groups. Schizophrenia Research, 126(1-3), 310-311. https://doi.org/​10.1016/​j.schres.2010.08.034

Δ Suckling (2011) presents a letter explaining why covariate adjustment did not "control" for group differences in recently published paper on cannabis use disorders in schizophrenia.

Thomas, D. R., & Zumbo, B. D. (2012). Difference scores from the point of view of reliability and repeated-measures ANOVA: In defense of difference scores for data analysis. Educational and Psychological Measurement, 72(1), 37-43.

Δ

Thompson, B., Diamond, K. E., McWilliam, R., Snyder, P., & Snyder, S. W. (2005). Evaluating the quality of evidence from correlational research for evidence-based practice. Exceptional Children, 71(2), 181-194.

Tian, L., Cai, T., Zhao, L., & Wei, L. (2012). On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial. Biostatistics, 13(2), 256-273. https://doi.org/​10.1093/​biostatistics/kxr050

Tian, Cai, Zhao, and Wei (2012) build upon a novel augmentation procedure examined by Zhang, Tsiatis, and Davidian (2008) and Tsiatis, Davidian, Zhang, and Lu (2008).

Tipton, E., Hallberg, K., & Hedges, L. (2017). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 41(5), 472-505.

Tisak, J., & Smith, C. S. (1994). Defending and extending difference score methods. Journal of Management, 20(3), 675-682.

Tisak and Smith (1994) discuss difference scores, "the difference between distinct but conceptually linked constructs" (p. 675), not to be "confused with change scores, or the difference between a single construct measured at two or more points in time" (p. 675).

Tisak, J., & Smith, C. S. (1994). Rejoinder to Edwards's comments. Journal of Management, 20(3), 691-694.

Trochim, W. M. K. (2020). Nonequivalent groups analysis. Research Methods Knowledge Base website: http://www.socialresearchmethods.net/ kb/ statnegd.php

Δ Trochim's (2020) offers an excellent and accessible description of the bias introduced by ANCOVA when groups differ and covariates are measured with error. The combinations of these conditions lead to biased results from residual-change models. Cribbie and Jamieson (2000), Westfall and Yarkoni (2016), and Zinbarg, Suzuki, Uliaszek, and Lews (2010) offer reviews of problems caused by correlations between predictors and measurement error. Miller and Chapman (2001) and Fitzmaurice (2001) also raise conceptual challenges with the so-called "control" variables.

Tsiatis, A. A., Davidian, M., Zhang, M., & Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in Medicine, 27(23), 4658-4677. https://doi.org/​10.1002/​sim

See abstract at PubMed.

Tu, Y.-K., Gunnell, D., & Gilthorpe, M. (2008). Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon—the reversal paradox. Emerging Themes in Epidemiology, 5(1), e2. https://doi.org/​10.1186/​1742-7622-5-2

van Belle, G. (2002). Statistical rules of thumb. New York: John Wiley & Sons. ◊

van Belle, G. (2008). Statistical rules of thumb (2nd ed.). New York: John Wiley & Sons. ◊

Van Breukelen, G. J. P. (2006). ANCOVA versus change from baseline: More power in randomized studies, more bias in nonrandomized studies. Journal of Clinical Epidemiology, 59(9), 920-925. https://doi.org/​10.1016/​j.jclinepi.2006.02.007

Δ Van Breukelen (2006) compares ANCOVA with an analysis of change scores. From the abstract: "In randomized studies both methods are unbiased, but ANCOVA has more power [cf. Oakes & Feldman, 2001]. If treatment assignment is based on the baseline, only ANCOVA is unbiased. In nonrandomized studies with preexisting groups differing at baseline, the two methods cannot both be unbiased, and may contradict each other. In the study of depression, ANCOVA suggests absence, but ANOVA of change suggests presence, of a treatment effect. The methods differ because ANCOVA assumes absence of a baseline difference." See Allison (1990), who recommends that "in ambiguous cases, there may be no recourse but to do the analysis both ways and to trust only those conclusions that are consistent across methods" (p. 110).

Correct title cited above. The title was published incorrectly, as described in the erratum attached to the end of the van Breukelen (2006) above. Or see Van Breukelen, G. J. P. (2006). Erratum to "ANCOVA versus change from baseline had more power in randomized studies and more bias in nonrandomized studies" [J Clin Epidemiol 59 (2006) 920-925]. Journal of Clinical Epidemiology, 59(12), 1334. https://doi.org/​10.1016/​j.jclinepi.2006.02.007

van Breukelen, G. P. (2013). ANCOVA versus CHANGE from baseline in nonrandomized studies: The difference. Multivariate Behavioral Research, 48(6), 895-922. https://doi.org/​10.1080/​00273171.2013.831743

van Eersel, G. G., Bouwmeester, S., Polak, M. G., & Verkoeijen, P. P. J. L. (2017). Commuting Mice and Men. The misuse of ANCOVA in neuroimaging studies. PsyArXiv. https://doi.org/​10.31234/​osf.io/​qcsbz

Venter, A., Maxwell, S. E., Bolig, E., (2002). Power in randomized group comparisons: The value of adding a single intermediate time point to a traditional pretest-posttest design. Psychological Methods, 7(2), 194-209.

From the abstract: "Adding a pretest as a covariate to a randomized posttest-only design increases statistical power, as does the addition of intermediate time points to a randomized pretest-posttest design. . . . If straight-line growth is assumed, the pretest-posttest slope must assume fairly extreme values for the intermediate time point to increase power beyond the standard analysis of covariance on the posttest with the pretest as covariate, ignoring the intermediate time point."

Wainer, H. (1991). Adjusting for differential base rates: Lord's paradox again. Psychological Bulletin, 109, 147-151.

Δ Wainer (1991) discusses three methods used to adjust for baseline values of heart rates: "(a) subtract the base rate, (b) divide by the base rate, and (c) covary out the base rate" (p. 147). Using Rubin's (e.g., 1974, 1977) potential outcomes framework, Wainer describes the assumptions for each strategy and concludes that "the answer for heart rate data is almost surely Methodology (a)" (p. 147).

Wainer, H., & Brown, L. M. (2004). Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission hand licensing data. American Statistician, 58(2), 117-123.

Wainer, H., & Brown, L. M. (2006). Three statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. Handbook of Statistics, 26, 893-918.

Weisberg, H. I. (1979). Statistical adjustments and uncontrolled studies. Psychological Bulletin, 86(5), 1149-1164. https://doi.org/​10.1037/​0033-2909.86.5.1149

Weisberg (1979) covers some of the same ground as Reichardt (1979).

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PLOS ONE, 11(3), 1-22. https://doi.org/​10.1371/​journal.pone.0152719

Westfall and Yarkoni (2016) have shown that predictors, including covariates, measured with error can increase the Type I error rate to 100%, indicating that "a potentially large proportion of incremental validity claims made in the literature are spurious" (p. 1). See Westfall and Yarkoni's Ivy: "Incremental Validity" Error Rate Calculator. See Neuroskeptic (2016) for a brief overview of the issue. Note also that this paper does not address all the issues raised by Kahneman (1965), Meehl (1970), Miller and Chapman (2001), Spector and Brannick (2011), or others.

What Works Clearinghouse. (2015). WWC standards brief: Baseline equivalence (Version 2.0). Washington DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation, WWC.

White, A. T., & Spector, P. E. (1987). An investigation of age-related factors in the age-job-satisfaction relationship. Psychology and Aging, 2(3), 261-265.

White and Spector (1987) describe some of the points by Specktor and Brannick (2011) about theoretical relationships among potential control variables.

Willett, J. B. (1988). Questions and answers in the measurement of change. Review of Research in Education, 15(1), 345-422. https://doi.org/​10.3102/​0091732X015001345

Δ "A lengthy (and tediously complete) overview of issues, problems, and misconceptions in the measurement of change" (John Willett's website).

Willett (1988) pointed out that "among methodologists, the residual change score has now been largely discredited as a measure of individual change" (p. 380; e.g., Rogosa, Brandt, & Zimowski, 1982). He then compared longitudinal growth models with the analysis of gain scores and covariate-adjustment on two-wave data. "The considerable logical, substantive, and technical flaws of the residual-change score were documented and the reader was advised to avoid this measure of individual growth. On the other hand, despite serious drawbacks, the observed difference score was shown to be a moderately successful measure of individual growth" (Willett, 1988, p. 413).

Willett demonstrated that (a) gain scores could be highly reliable in the presence of interindividual change over time, (b) the gain score validly estimates individual growth, and (c) "under the straight-line growth model, the OLS estimate of the individual growth rate was shown to be a natural extension of the observed difference score to multiwave data" (p. 414). Willett, however, recommended longitudinal growth models over the analysis of gain scores. Growth models can, for exmaple, correct for measurement error and allow for curvilinear or other forms of growth.

Willett also discussed the "misguided" propensity for some investigators to standardize measures across time for the analysis of growth (p. 378), which he calls "misguided" in most cases. "Such standardization constitutes a completely artificial and unrealistic restructuring of interindividual heterogeneity in growth" (p. 378) due to the almost certain increases in the variance over time. "it is impossible for the population variance of true (and observed) status to be constant over time (Rogosa & Willett, 1985; Willett, 1985)" (p. 378, emphasis in original).

This paper has also been cited with the year 1989 (a reprint?) as, "In E. Z. Rothkopf (Ed.), Review of research in education (Vol 15, pp. 345-422). Washington, DC: American Educational Research Association."

Willett, J. B. (1994). Measurement of change. In T. Husen & T. N. Postlethwait (Eds.), The international encyclopedia of education (2nd ed., pp. 671-678). Oxford, UK: Pergamon.

Δ Willett shows that the difference score has been "falsely condemned" for at least four reasons, which all originate in "critics' misunderstanding of the association between it and pretest status" (p. 672). The reasons include the supposed unreliability of the gain score estimator and three issues with the correlations between change and initial status. Willett then offers interesting insights into the creation of the residual-change score, which was "motivated by an unnecessary desire to create measures of change that were uncorrelated with pretest score" (p. 673). He notes that "the researcher is strongly advised to avoid these [residual-change] scores as measures of within-person change" (p. 674).

Willett, J. B. (1997). Measuring change: What individual growth modeling buys you. In E. Amsel and K. A. Renninger (Eds.), Change and development: Issues of theory, method, and application (pp. 213-243). Mahwah, NJ: Lawrence Erlbaum Associates.

Willett, J. B., & Singer, J. D. (1989). Two types of question about time: Methodological issues in the analysis of teacher career path data. International Journal of Educational Research, 13(4), 421-437.

Δ Willett and Singer (1989) note that the difference score is "unbiased, easy to compute and intuitively appealing. Although once highly favored, it was lambasted through the 1950s, 60s and 70s because of its purported unreliability and (usually negative) correlation with initial status (Bereiter, 1963; Linn & Slinde, 1977). But these criticisms were based on flawed assumptions, and the difference score, and some modifications of it, are now seen as the best you can do with only two waves of data (Rogosa, Brandt, & Zimowski, 1982; Rogosa & Willett, 1983; 1985; Willett, 1988)" (p. 429). The authors then show, as have many others, that the problems depend on the fallible observed scores while the questions usually concern the underlying true scores.

Winship, C., & Marc, R. D. (1992). Models for sample selection bias. Annual Review of Sociology, 18, 327-350.

Williams, R. H., & Zimmerman, D. W. (1996). Are simple gain scores obsolete? Applied Psychological Measurement, 20, 59-69.

Δ Williams and Zimmerman (1996) focus mostly on the reliability of simple change and the misleading assumptions used to demonstrate its unreliability.

Wright, D. B. (2006). Comparing groups in a before-after design: When t test and ANCOVA produce different results. British Journal of Educational Psychology, 76(3), 663-675.

Wright (2006) discusses the two principal approaches to the anlaysis of group designs when conditions differ on some measure. "There are two main approaches: t test on the gain scores and ANCOVA partialling out the initial scores. . . . Recommendations . . . stress careful examination of the research questions, sampling and allocation of participants and graphing the data. ANCOVA is appropriate when allocation is based on the initial scores, t test can be appropriate if allocation is associated non-causally with the initial scores, but often neither approach provides adequate results" (abstract).

Wright, D. (2017). Using graphical models to examine value-added models. Statistics and Public Policy, 4(1), 1-7. https://doi.org/​10.1080/​2330443X.2017.1294037

Wu, C. F. J., & Hamada, M. (2000). Experiments: Planning, analysis, and parameter design optimization. New York: John Wiley & Sons. ◊

Yanez, N. D., Kronmal, R. D., & Shemanski, L. R. (1998). The effects of measurement error in response variables and tests of association of explanatory variables in change models. Statistics in Medicine, 17, 2597-2606.

Yin, P., & Brennan, R. L. (2002). An investigation of difference scores for a grade-level testing program. International Journal of Testing, 2(2), 83-105.

Δ From the abstract: "Descriptive statistics and empirical norms are reported, and reliability estimates for difference scores for both students and districts are examined. This article begins with a review of traditional perspectives on change scores and a summary of more recent perspectives."

Yu, L.-M., Chan, A., Hopewell, S., Deeks, J. J., & Altman, D. G. (2010). Reporting on covariate adjustment in randomised controlled trials before and after revision of the 2001 CONSORT statement: A literature review. Trials, 11, 59. https://doi.org/​10.1186/​1745-6215-11-59

Zhang, M., Tsiatis, A., & Davidian, M. (2008). Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics, 64, 707-715. https://doi.org/​10.1111/j.1541-0420.2007.00976.x

Zimmerman, D. W. (1997). A geometric interpretation of the validity and reliability of difference scores. British Journal of Mathematical and Statistical Psychology, 50, 73-80.

Zimmerman, D. W., & Williams, R. H. (1982). A note on the correlation of gains and initial status. Journal of General Psychology, 107, 203-207.

Zimmerman, D. W., & Williams, R. H. (1982). Gain scores in research can be highly reliable. Journal of Educational Measurement, 19(2), 149-154.

Δ Zimmerman and Williams (1982) review the literature that declare gain scores unreliabile. "These conclusions are based on certain assumptions which at first glance appear reasonable about the values of parameters in a well known formula for the reliability of differences" (p. 149). The authors make alternative assumptions, that they argue may be more realistic. The conclude "that gain scores can be reliable and that it would be premature to discard such measures in research" and that "it is very likely that there are many situations in which simple pretest-posttest differences are quite useful" (p. 153).

Zimmerman, D. W., & Williams, R. H. (1986). Note on the reliability of experimental measures and the power of significance tests. Psychological Bulletin, 100, 123- 124.

Zimmerman, D. W., Williams, R. H., & Zumbo, B. D. (1993). Reliability of measurement and power of significance tests based on differences. Applied Psychological Measurement, 17, 1-9.

Zimmerman, Williams, and Zumbo (1993) wrote that "explicit power calculations reaffirm the paradox put forward by Overall & Woodward (1975, 1976)—that significance tests of differences can be powerful even if the reliability of the difference scores is 0. This anomaly arises because power is a function of observed score variance but is not a function of reliability unless either true score variance or error score variance is constant. Provided that sample size, significance level, directionality, and the alternative hypothesis associated with a significance test remain the same, power always increases when population variance decreases, independently of reliability" (abstract). See also Oakes and Feldman (2001).

Zinbarg, R. E., Suzuki, S., Uliaszek, A. A., & Lewis, A. R. (2010). Biased parameter estimates and inflated Type I error rates in analysis of covariance (and analysis of partial variance) arising from unreliability: Alternatives and remedial strategies. Journal of Abnormal Psychology, 119(2), 307-319. https://doi.org/​10.1037/​a0017552

Zinbarg, Suzuki, Uliaszek, and Lewis (2010) showed that when an unreliable covariate is correlated with another independent variable (IV), the Type I errors increase for the unique association of the IV and the dependent variable. They then suggest steps to minimize the impact of this bias. See also Jamieson (1994) and Van Breukelen (2006) for another discussion of this issue.

Zumbo, B. D. (1999). The simple difference score as an inherently poor measure of change: Some reality, much mythology. In Bruce Thompson (Ed.) Advances in social science methodology (pp. 269-304). Greenwich, CT: JAI Press.



Challenges with the Use of Covariates & Regression

        Notwithstanding substantial flaws, covariate adjustment has been common for many years. Kahneman (1965), for example, referenced its common use and Elashoff (1969) cited it as popular. Gordon (1968) credited Kendall and Lazarsfeld (1950) with the logic of statistical control and the growing availability of computers in the 1960s for its widespread use. Applied researchers have long considered covariate adjustment a panacea for problems related to power, bias, imbalance, and other challenges that may occur in a study. Some authorities, such as the What Works Clearinghouse (WWC), have even required covariate adjustment, at least since 2008 (WWC, Version 2.0), for studies with moderate baseline differences of 0.05 to 0.25 standard deviations. The popularity of covariate adjustment is striking given its significant limitations and interpretation problems.

        Covariate adjustment and ANCOVA have been condemned for nearly a century. Regression and covariate adjustment has been described as "fraught with dangers" (Burks, 1926a, p. 532); speculative and not soundly based (Cochran, 1957); methodologically unsound (Meehl, 1970); doomed to failure in quasi-experimental designs (QEDs; Rogosa, 1988); logically, substantively, and technically flawed (Willett, 1988); implausible in many cases (Allison, 1990); roundly condemned (Miller & Chapman, 2001); and myth or urban legend (Spector & Brannick, 2011). The use of covariates without prespecification can lead to technical and conceptual flaws (Alemayehu, 2011; Assmann et al., 2000; Pocock et al., 2002). These are only a sample of authors who have raised concerns about covariate adjustment (see also Atinc, Simmering, & Kroll, 2012; Burks, 1926b; Burks & Kelly, 1928; Cribbie & Jamieson, 1994; Freedman, 2008; Liu, 2011; Maris, 1998; Meehl, 1971a; Pedhazur, 1997; Rogosa & Willett, 1985; Wainer, 1991).

        An example. Covariate adjustment may lead to the misinterpretation of results, a common problem demonstrated by P. Morgan and colleagues' (2017, Exceptional Children) investigation into the potential disproportionate representation of Black children in special education. The authors argue that "rigorously evaluating this hypothesized causal relation necessitates covariate adjustment for confounds" (p. 182). They further justify the use of covariates in the following manner:

The standard method for identifying a causal relation between two variables is random assignment. . . . Yet it is not possible to assign children randomly . . . [by] their race or ethnicity. An alternative . . . is a statistical method such as regression analysis to account for potential confounders (Holmes, 2013; S. L. Morgan & Winship, 2007). Statistically controlling for individual academic achievement should provide more rigorous estimates of the extent to which the disproportionality is attributable to race or ethnicity and not to other factors (National Research Council, 2004)" (p. 183).

        P. Morgan and colleagues (2017) provide little more than their citation of Holmes (2013) and S. Morgan and Winship (2007) to justify their use of covariates, yet neither source assumes, as did P. Morgan et al. (2017), that all covariates represent confounders. S. Morgan and Winship, in contrast, argue that "to offer a precise and defendable causal effect estimate, a well-specified theory is needed to justify assumptions about underlying causal relationships" (p. 30). P. Morgan et al. (2017) offer no theoretical and empirical support for their choice of covariates, like most papers that employ covariates to control for confounders. This makes their model and interpretation questionable at best.

        In their "best-evidence" synthesis, P. Morgan et al. (2017) cite Morgan et al. (2015) as an example, which "extensively adjusts for potentially confounding factors, including family-level SES and individual child-level academic achievement and behavioral functioning" (Morgan et al., 2015, p. 283). The confounding factors include low birth weight, age of mother, families' socioeconomic status, access to health insurance, and teachers ratings of child behavior among others. Children from Black and White families differ on many of these characteristics, due in part to systemic racial bias. Because systemic bias could also produce disparities in special education assignment, these covariates are not likely just "confounds." Rather, systemic bias may cause both differential reports on covariates as well as special education assignment.

        Many children from Black families may have experienced lower socioeconomic status, limited access to adequate health care or health insurance, poor ratings of behavior from teachers, and so on due to systemic, even if implicit, racial bias. The same biases may influence the assignment of children to special education. Perhaps these families live in formerly redlined neighborhoods with a history of bias in employment, housing, health care, and also education. In such a scenario, the covariates included by P. Morgan et al. (2015) do not remove variance associated with confounds but rather variance associated with well-documented causes of disparate racial outcomes in the United States. In that case, the models in P. Morgan et al. incorrectly "controlled" for causal variables and removed from their models the variance associated with the very bias they had attempted to investigate.

        The myth of confounding. Holmes (2013), Spector and Brannick (2011), Morgan and Winship (2007), Meehl (1971a), and others argue for an a priori theory, well-specified and comprehensive, that guides the specification of all predictors within the model along with external, empirical support for assumptions. Referring to statistical control via partial correlations, Barbara Burks (1926b) "points out another kind of misuse to which the partial correlation coefficient has often been subjected" (p. 625):

The discussion concerns the type of pitfall that besets the user of partial technique who would interpret his findings by such a conclusion as: "This partial correlation coefficient between traits 1 and 2 represents the degree of association between 1 and 2 after all they hold in common with variable 3 has been eliminated." Such an interpretation is usually only one out of an indefinite number of interpretations all consistent with the data at hand; and there is often no reason for selecting it in preference to any of the others.

        Paul Meehl (1971a) similarly discussed how researchers take for granted, "as almost everyone does" (p. 144), that associations should be corrected for background factors, such as socioeconomic status, as if control variables somehow purify an analysis of all spuriousness. This is the approach Morgan et al. (2015) and many other authors have taken. Meehl, however, outlined eight alternative hypotheses for the role of socioeconomic status as a covariate in a previously published manuscript and showed how the interpretations differed. An assumption of spuriousness led to incorrect inferences in each case.

        MacKinnon, Krull, and Lockwood (2000) demonstrated "that mediation, confounding, and suppression models are statistically equivalent" (p. 180), so the assumption that a covariate corrects for confounding cannot be confirmed statistically. The interpretation of covariates must come from a theoretical argument with empirical evidence that explains "how the factor which is being adjusted for actually produces its effect" (Kempthorne, 1957, p. 6) before those covariates are entered into a statistical model.

        Controls may increase bias. Including statistical controls may increase bias rather than reduce it. "One cannot be sure that confounding is being reduced as one adjusts for additional confounders and sees the estimate change: Even if the confounders satisfy the usual conditions for being a confounder and there is no other bias, it is theoretically possible that the change represents an increase in confounding" (Greenland & Robins, 2009, p. 4; see also Tu et al., 2008). "The use of covariate adjustment in cohort studies is even more fraught and may result in paradoxical situations, in which there can be opposite interpretations of the results" (Streiner, 2016).

        Statistical controls can present an illusion intensified by missing confounders, use of proxies, measurement error, and involves considerable subjective judgment (Christenfeld, Sloan, Carroll, & Greenland, 2004). Any analysis with covariates requires not only strong theoretical and empirical support but also consideration of each assumption. Violations of assumptions, common in psychological and educational research, invalidate the interpretation and can render results suspect. In many circumstances, the addition of covariates can also increase Type I error rate (Westfall & Yarkoni, 2016), a problem made worse by "the 'everything but the kitchen sink' approach" (Becker, 2005).

        In studies that compare treatments, many statisticians discourage covariate adjustment without preselection, especially in studies that compare treatments. Moher and colleagues (2010), in their elaboration of the CONSORT statement, echo Altman (1998) in their caution that "adjustment for variables because they differ significantly at baseline is likely to bias the estimated treatment effect" (p. 19). Despite the flaws, however, some proponents of covariate adjustment offer no warnings about the potential for spurious effects and interpretations. The WWC, for example, requires the addition of covariates when conditions differ moderately at baseline yet no warnings about the value of theory, empirical support, alternative interpretations, or even basic assumptions (e.g., WWC Standards, 2008, 2011, 2014, 2017, 2020).

Assumption of Independence between a Group Assignment Variable and Covariates

        In group-comparison studies, investigators often include covariates to reduce the standard error of the estimate of group differences or reduce confounding. Individual characteristics may differ between groups, for example, especially if an investigator examines many baseline variables, and analysts often include these variables in the statistical models to control for confounding. This approach to analysis, however, has logical flaws. In all models that include covariates, investigators must carefully select the appropriate model for each application based on a robust causal theory and empirical support (e.g., Allison, 1990; Spector & Brannick; Morgan & Winship, 2007; Meehl, 1971a). Alternatively, researchers can use an alternative model, such as the analysis of gains. The use of ANCOVA or an analysis of gains depends on assumptions and assignment procedures.

        Baseline differences. The argument to include covariates to account for baseline differences goes like this: If groups assigned to different treatment conditions begin with different mean values, any analysis that fails to account for those differences will lead to a biased estimate of condition differences. This argument rests on additional assumptions. First, if groups of subjects differ at baseline, then they will differ at their posttest assessment. This implies that the groups of subjects will regress to different means. Second, without some approach to correct for those baseline differences, the analysis will result in bias. Third, ANCOVA is the best approach to correct for baseline differences.

        Many researchers assume that ANCOVA equates groups when the model includes covariates that differed at baseline. This is an example of a fallacy called affirming the consequent. Fitzmaurice (2001) describes ANCOVA as a conditional analysis. It answers a conditional question about differences at posttest assuming baseline equivalence. Allison (1990) describes the ANCOVA approach in terms of regression to the mean. The covariate-adjusted approach assumes that "regression to the mean within groups implies regression to the mean between groups, a conclusion that seems quite implausible for many applications" (Allison, 1990, p. 110).

        Specifically, if (A) groups are balanced at baseline, then (B) ANCOVA produces unbiased estimates. This statement relies on the auxiliary assumption that when are equivalent at baseline they regress to the same mean (Allison, 1990). The use of ANCOVA to correct for baseline differences reverses the statement about ANCOVA: If (B) an ANCOVA includes covariates that differ at baseline, then (A) it produces estimates as if groups were initially equivalent. Affirming the consequent, or the fallacy of the converse, takes the true statement A ⇒ B, which means that if A is true then B is true, and reverses it. The opposite deduction, however, is invalid. If B is true, A is not necessarily true.

        As described by Elashoff (1969), Allison (1990), Fitzmaurice (2001), and others, when groups differ at baseline, ANCOVA may produce biased estimates. ANCOVA "depends on the assumption that individuals have been randomly assigned to conditions. If this assumption is not met, any adjustment in treatment means cannot be defended or justified statistically" (Keppel & Zedeck, 1989, p. 481). Moreover, "covariate imbalance can hurt the precision of estimates even if random assignment is used in the studies" (Liu, 2012, p. 630). The analysis of net change or gains offers a solution. It does not require the same, restrictive assumptions of baseline equivalence or regression to the same mean across groups. The analysis of gains has its own assumptions (Jamieson, 1999), but the approach answers the most straightforward question that interests investigators (Fitzmaurice, 2001): does the change in scores across time differ between groups? This is an unconditional question and assumes that the two groups regress to different means (Allison, 1990). Baseline differences can lead to bias with either model, but an analysis of gains may be more defensible in many cases.

        Van Breukelen (2006) demonstrates that in many studies with nonrandomly assigned conditions, an analysis of change may produce less biased results than ANCOVA. "Based on literature, we saw that (1) the difference between ANCOVA and ANOVA of change is that between assuming absence or presence of a baseline group difference, and (2) the choice between both methods depends on the treatment assignment procedure" (p. 924). Van Breukelen (2013) suggests that the analysis of gains "allows for a pretest group difference, but it makes the strong assumption that the groups will show equal change if neither group is treated. ANCOVA assumes absence of a true pretest group difference and implies regression of posttest means toward a common mean if neither group is treated" (p. 914). If members are either assigned to groups at random or, in an observational or quasi-experimental study, random samples of their respective populations, the equal change assumption is more plausible than the regression to a common mean.

        Measurement error. The presence of measurement error can exacerbate problems with ANCOVA. Kisbu-Sakarya et al. (2013) provide a nice illustration of the differences in Type I error rates and power between an analysis of gain scores and ANCOVA in the presence of measurement error. One key finding was that, "in case of baseline imbalance, ANCOVA and residual change score methods produced large Type I error rates when reliability was less than perfect. . . . On the other hand, Type I error rates for the difference score method were not influenced by baseline imbalance or reliability." (p. 58). Hence, baseline differences appear to produce both bias and inflated Type I error rates for ANCOVA.

        Similarly, Oakes and Feldman (2001) concluded "that for a randomized experiment, ANCOVA yields unbiased treatment estimates and typically has superior power to change-score methods, all else equal. However, in the absence of randomization, when baseline differences between groups exist, we follow Allison (1990) and show that change-score models yield less biased estimates (if biased at all). Then, bias aside, we went on to show that the common assumption that ANCOVA models are more powerful rests on the untenable assumption that pretests are measured without error. In the presence of measurement error, change-score models may be equally or even more powerful" (p. 18). The choice between ANCOVA and the analysis of gain scores therefore rests on the research question, the assumption of baseline equivalence, and the presence of measurement error.

        Covariate adjustment distorts the IV. Baseline differences in covariates imply an association between the independent variable (IV) and the covariate. Miller and Chapman (2001) cautioned that covariate-adjustment with a variable that differs at baseline makes the treatment indicator independent of the covariate, a common misinterpretation noted by Burks (1926b). This sounds like the purpose of ANCOVA, but Miller and Chapman draw attention to a flaw in the reasoning: the treatment indicator can no longer be interpreted in the same way as it was before entering the covariate in the model. "The grouping variable, its essence, has been altered in some substantive way that is frequently not specifiable in a conceptually meaningful way" (p. 43).

        If the covariate is associated with the grouping variable (e.g., condition), adding the covariate to the model then removes common variability between the grouping variable and the covariate, leaving only the residual portion of the grouping variable. This occurs regardless of the reason for the association between the group indicator and covariate, whether the covariate was affected by the group variable or happened to differ due to random variation. The "[residualized group] is not a good measure of the construct that [group was] intended to measure" (Miller & Chapman, p. 43). Hence, a randomly assigned condition indicator would no longer represent random assignment (Meehl, 1970). Incidentally, the common variance between the covariate and DV is also partialled out, so the covariate also changes the DV.

        Elashoff (1969) declared that "a basic postulate underlying the use of analysis of covariance to adjust treatment means for the effects of the covariate x is that the x variable is statistically independent of the treatment effect" (emphasis in original, p. 388). If a covariate is related to the assignment, it will remove common variance between an assignment variable and the DV. The covariate-adjusted assignment variable in a randomized controlled trial would, therefore, no longer represent random assignment.

        Meehl (1970) referred to this phenomenon as systematic unmatching: "The result of holding constant an identified nuisance variable Z will, in general, be to systematically unmatch pair members with respect to some fourth (unidentified) nuisance variable W" (p. 3). Cochran (1957) warned that when covariates are related to the independent variable, "they no longer merely remove a component of experimental error. . . . They distort the nature of the treatment effect" (emphasis added, p. 264). For this reason, Jamieson (2004) argued that ANCOVA is appropriate only for variables that correlate with the dependent variable but have no association with the independent variable. For slightly different reasons, Clason and Mudfrom (2012) also demonstrated, with both a theoretical argument and simulation results, that when using a covariate to adjust posttest data, "the adjustment is comparing entities that not only do not exist, but (probably) cannot exist" (p. 15).

        Further challenges. Many other authors have written about baseline equivalence, such as Cribbie and Jamieson (2000), Jamieson (1994, 1999), Fitzmaurice (2001), Fitzmaurice, Laird, and Ware (2004), Judd and Kenny (1981, see p. 59), Kisbu-Sakarya, MacKinnon, and Aiken (2013), Westfall and Yarkoni (2016), and Zinbarg, Suzuki, Uliaszek, and Lewis (2010). For example, ANCOVA does not allow for easy estimation in the face of missing data. The gain score more easily accommodates this situation, as all cases can be analyzed with procedures that rely on maximum likelihood methods (e.g., PROC MIXED in SAS). ANCOVA typically requires multiple imputation, which brings its own challenges (e.g., Allison, 2012; Drechsler, 2015; Graham, Olchowski, & Gilreath, 2007), although specialized software (e.g., Mplus) offers alternatives.

Assumption of Perfect Measurement

        Violation of the assumption that "all [predictors] are measured with perfect reliability" (Cohen, Cohen, West, & Aiken, 2003, p. 351) also leads to troublesome consequences. "It is well known that when a covariate has measurement error, all the estimates for the regression coefficients are biased" (Alemayehu, 2011, p. 155). Kahneman (1965) pointed this out many years ago, and Zinbarg, Suzuki, Uliaszek, and Lewis (2010) showed that covariates measured with error can bias parameter estimates and inflate the Type I error rates. In the more general case, in correlational studies, Westfall and Yarkoni (2016) found that for measures with moderate reliability, Type I error rates can approach 100%. William Trochim (2020) provided a direct demonstration of how measurement error, coupled with pretest differences, might lead to erroneous posttest differences, described next.

Covariate Adjustment, Pretest Differences, & Measurement Error         Trochim (2020) begins by reminding readers that regression accounts for measurement error in the dependent variable (A) but not the predictors. An unreliable covariate in a regression procedure flattens (deflates) the regression line about the mean (B). Baseline differences in a true confounding variable without measurement error will not influence the difference between conditions (C). But in a comparison between two groups when those groups differ on an unreliable covariate, the unreliability of the covariate flattens the regression line at a different place for each group (D). This interaction between mean differences on the covariate and its unreliability spuriously changes the apparent difference between groups.

        Some researchers assume that gain scores, an alternative to ANCOVA, are unreliable. The occasional claim that gain scores are unreliable (e.g., Cronbach & Furby, 1970) has long been debunked:

Although once highly favoured, [the gain score] was lambasted through the 1950s, 60s and 70s because of its purported unreliability and (usually negative) correlation with initial status (Bereiter, 1963; Linn & Slinde, 1977). But these criticisms were based on flawed assumptions, and the difference score, and some modifications of it, are now seen as the best you can do with only two waves of data (Rogosa, Brandt, & Zimowski, 1982; Rogosa & Willett, 1983; 1985; Willett, 1988). (Willett & Singer, 1989, p. 429)

This myth nonetheless persists despite considerable evidence to the contrary (see also Allison, 1990; Collins, 1996; Gollwitzer, Christ, & Lemmer, 2014; Mellenbergh, 1999; Rogosa, 1988; Thomas & Zumbo, 2012; Williams & Zimmerman, 1996; Yin & Brennan, 2002; Zimmerman, Williams, & Zumbo, 1993).

Assumption of Regression Slope Equivalence

        "If different population slopes are associated with different treatments, a common slope parameter does not exist; the pooled estimate [of the within-group regression slope of the covariate] (bw) cannot be an appropriate estimate of the different population slopes in this case" (Huitema, 2011, p. 184). The potential violation of the assumption of parallel regression slopes across treatment conditions (Ceyhan & Goad, 2009; Huitema, 2011) can produce biased results from ANCOVA. Allison (1990) offers a thorough discussion of this issue along with a comparison between ANCOVA and an analysis of gain scores.

        Because interventions frequently change the association between a pretest measure and a posttest outcome, any successful intervention can create a discrepancy in the slopes across conditions. In a literacy study described by Hosp and colleagues (2011), the pre-post correlation for the Woodcock Reading Mastery Test (Woodcock, 1998) Word Attack measure was .60 (r² = .36) in the control condition but just .31 (r² = .10) in the intervention condition. The differences were statistically significant, and the pretest variable explained three times as much variation in posttest in the control condition as in the treatment condition.

        A similar bias can appear with continuous predictors, rather than a dichotomous independent variable, when the slope between the predictor and dependent variable depends on the covariate. Regression models, therefore, also appears to assume that any given regression slope under consideration does not depend on other predictors.

Ratios

        Ratios represent an altogether different form of "control." Sometimes authors normalize values by dividing a dependent variable by a second variable to create a rate or ratio. Consider, for example, a dependent variable that captures the number of office discipline referrals (ODRs) during the last month of school. Both the number of school days in the month and size of the school can affect the number of ODRs, yet both are likely irrelevant to research questions about schools use of ODRs. One might divide the number of ODRs in a month by the number of school days in the month and enrollment (ODR rate = ODRs / days / students).

        Such a procedure is often said to control for extraneous variables. Dividing a variable by a size (e.g., ODRs / school days) does not control for size in the statistical sense (Atchley, Gaskins, & Anderson, 1976; Lev & Sunder, 1979; Smith, 2005). Nonetheless, in many scenarios a dividing the dependent variable by another relevant variable to create a rate, ratio, or proportion is a worthwhile exercise. That is, many proportional measures are proportional in meaning (Smith, 2005), such as in the principle of justice in law. Like all rates or ratios, however, when they are used to describe small numbers, such as with small samples or when examining events in short time segments, ratios, rates, and proportions can vary widely with only incremental changes in the underlying data (Bollmer et al., 2007; see also Bollmer et al., 2014, chapter 5). This can become exacerbated and reduce reliability estimates when comparing metrics that rely on quotients.

        Thus, although rate metrics, which include ratios, risks, and other proportions, do not control for size in the statistical sense (Atchley, Gaskins, & Anderson, 1976; Lev & Sunder, 1979; Smith, 2005), they provide a theoretically meaningful proportionate metrics useful for impact evaluations and other purposes (Smith, 2005). Note that a ratio technically requires the same metric or units in the numerator and denominator, whereas a more general term for the quotient is a rate (Smith, 2005).



Links to External Websites




Bibliography Help