• Users Online: 181
  • Print this page
  • Email this page

Table of Contents
Year : 2019  |  Volume : 2  |  Issue : 2  |  Page : 51-54

Analyzing educational interventions without random assignment

Office of Academic Affairs, University of Georgia College of Veterinary Medicine, Athens, Georgia, USA

Date of Web Publication5-Nov-2019

Correspondence Address:
Dr. Samuel C Karpen
College of Veterinary Medicine, 501 D.W. Brooks Drive, Athens, GA 30602
Login to access the Email id

Source of Support: None, Conflict of Interest: None

DOI: 10.4103/EHP.EHP_21_19

Rights and Permissions

Since educational researchers rarely have the luxury of random assignment, confounding variables are a common concern. This manuscript introduces readers to methods for statistically controlling confounding variables, namely propensity score matching, propensity score weighting, and doubly robust estimation. These techniques allow researchers to accurately estimate the effect of an intervention (e.g. a new teaching method's effect on course grades) even when the groups being compared differ on other relevant variables (e.g. one group has a higher pre-DVM GPA than the other). Example analysis are included to aid researchers hoping to conduct their own analyses.

Keywords: Confounding, doubly robust estimation, propensity scores, statistics

How to cite this article:
Karpen SC. Analyzing educational interventions without random assignment. Educ Health Prof 2019;2:51-4

How to cite this URL:
Karpen SC. Analyzing educational interventions without random assignment. Educ Health Prof [serial online] 2019 [cited 2022 Oct 3];2:51-4. Available from: https://www.ehpjournal.com/text.asp?2019/2/2/51/270289

  Introduction Top

Fluctuations in admission criteria, curricular content, and curricular delivery can make it difficult to attribute changes in students' performance to specific interventions. If an instructor in a graduate course finds that students who experienced a new teaching technique earned higher final grades than previous students who did not experience the technique, is she/he to conclude that the intervention worked? Perhaps, current students earned higher undergraduate grade point average (GPA) or higher graduate record examination (GRE) scores than previous students, and this is driving the difference in performance. Similarly, in 2016, the National Association of Boards of Pharmacy witnessed an overall drop in the North American Pharmacist Licensure Examination (NAPLEX) pass rates after changing the examination format from adaptive delivery to fixed form, increasing the number of questions from 185 to 250, and allowing 6 h instead of 4 h and 15 min.[1] Since 2012, however, applications to pharmacy schools have been steadily decreasing, encouraging colleges/schools of pharmacy to admit previously unacceptable students.[2] Was the drop in NAPLEX scores due to relaxed admissions standards starting in 2012 or changes to the examination in 2016? In both examples, variables outside of the researchers' control, known as confounding variables, made it difficult to determine a change's source.

Propensity score techniques can mitigate both the NAPLEX and new teaching technique dilemmas by minimizing the effect of confounding variables. Karpen and Welch 2018 found that the changes to the 2016 NAPLEX did not significantly affect pass rates after accounting for the fact that 2016 examinees had significantly lower GPA and Pharmacy College Admissions Test scores than previous years' examinees. While the teaching technique scenario was hypothetical, it is reasonable to assume that no two classes are academically or demographically identical before entering a graduate program. These differences must be accounted for when comparing classes within the graduate program.

Propensity score matching (PSM) attempts to account for confounding variables by comparing treatment cases to control cases who are as similar as possible in terms of the confounding variables. [Table 1] presents a hypothetical scenario in which some students received an intervention intended to boost the final examination scores and others did not. In addition, students who received the intervention had lower undergraduate GPAs than students who did not receive the intervention (Mnointervention = 3.52 vs. Mintervention = 3.34). If researchers compared the groups' mean final examination scores without considering GPA, they would conclude that the intervention harmed examination performance by 2.6 points; however, the nonintervention group should outperform the intervention since it is composed of students with higher preadmission GPAs. By comparing nonintervention cases to intervention cases with similar preadmission GPAs, the researchers can estimate how the nonintervention group would have performed without a GPA disadvantage. In this example, the new control group's mean final examination score was 77.6, indicating that when preadmission GPA was accounted for, the intervention increased the final examination performance by 3.8 points.
Table 1: Hypothetical propensity score matching scenario in which intervention and nonintervention groups differ in terms of multiple variables

Click here to view

Manual matching is feasible with [Table 1] data; however, most studies include hundreds of cases and multiple confounders. Rather than painstakingly matching each case on multiple confounding variables, researchers summarize participants' scores on the confounders using logistic regression or generalized boosted models.[3],[4] The summary is called a propensity score, and it represents the odds of treatment group membership, given a case's confounder scores. Nonintervention cases that are most similar to the intervention cases in terms of the confounders will have the highest propensity scores. A new control group is then created by matching individuals in the nonintervention group to individuals in the intervention group who are most similar in terms of propensity score and therefore most similar in terms of the confounders.

The most common PSM methods are nearest-neighbor and inverse probability weighting. In nearest-neighbor matching, each treatment case is matched to a control case with an acceptably close propensity score. The range of acceptability is called the caliper. It is generally advised to start with a very narrow caliper and increase its size until every treatment case has a match. In inverse probability weighting, each treatment case is weighted by (1/its propensity score) and each control case is weighted by (1/1 minus its propensity score). Control observations that are similar to treatment observations in terms of the confounders will be weighted more heavily than those that are dissimilar, and treatment observations that are similar to control observations will be weighted more heavily than those that are dissimilar. One advantage of inverse probability weighting is that cases without adequate matches are retained, thereby limiting data loss.[5]

After matching or weighting, researchers should ensure that the newly matched groups are similar in terms of the confounders. If matching/weighting is effective, the treatment and new control group will have similar scores on the confounders and similar propensity scores. Researchers can examine the effect sizes for the confounders by the group before and after matching/weighting. Effect sizes should be smaller (ideally <0.020) after weighting than before. Researchers can also plot the propensity scores by group and examine the extent to which the groups' propensity score distributions overlap. While there is no standard for the amount of overlap in propensity scores after matching/weighting, more is better.

Once the groups are matched/weighted by propensity score, the researcher can run the final analyses to test for a treatment effect. To account for any residual confounding not addressed by the matching/weighting, the researcher can include the confounders in the final model in a technique called doubly robust estimation.[6]

  Example Top

To illustrate propensity score methods, the authors created a dataset resembling one that educators may encounter. It included GPA, GRE, end-of-year comprehensive examination score, and an indicator of whether a student completed a study skills course during orientation. One group completed the course and the other did not. The dataset was designed so that GRE, GPA, and comprehensive examination score were correlated with one another such that students with lower GPAs had lower GRE and lower comprehensive examination scores. In addition, a higher proportion of students with low comprehensive examination scores were placed in the study skills group. Descriptive statistics and correlations are shown in [Table 2] and [Table 3], respectively.
Table 2: Descriptive statistics for simulation data

Click here to view
Table 3: Correlations for simulation data

Click here to view

Unweighted analysis

An unadjusted comparison indicated that the study skills course harmed comprehensive examination performance by 1.1 points (95% confidence interval (CI) = 0.79–1.56). This conclusion, however, ignored the fact that the study skills group also had lower GRE scores and GPA than the nonstudy skills group and that both of these variables correlated with comprehensive examination performance.


Propensity scores for each participant were generated with a generalized boosted model. These values represent the inverse probability of being in the study skills group, and when applied to the data, they adjust for the effects of GRE and GPA. Each treatment case was weighted by (1/its propensity score) and each control case was weighted by (1/1 minus its propensity score), thereby making the groups more similar in terms of GPA and GRE score [Appendix 1 for R code].

Balance check

Prior to weighting, both GPA and GRE differed markedly between the study skills and no study skills groups. Specifically, the effect size for GRE was 0.328, and the effect size for GPA was 0.863. After weighting, the effect sizes for GPA and GRE were reduced to 0.062 and 0.302, respectively. While 0.302 was still larger than ideal, it was less than half of the effect size observed in the unweighted data. In addition, the mean GPA and GRE for the weighted control group were 3.67 and 310, respectively, which were much closer to the means of the treatment group (3.62 and 310) than were observed in the unweighted data (3.79 and 313) [Appendix 2 for R code].

Doubly robust estimation

The final step was to build a model using the weighted data in which the treatment group was used to predict comprehensive examination scores while including GRE and GPA as covariates to account for any confounding missed by the inverse probability weighting. As indicated by the positive coefficient, the study skills class improved comprehensive examination scores by 0.39 points after accounting for GPA and GRE scores (95% CI = 0.18–1.03) [[Table 4] and Appendix 3 for R code].
Table 4: Results of doubly robust estimation

Click here to view

  Summary Top

Since random assignment is sometimes impossible, and since curricula and admissions metrics often fluctuate, it is important for researchers to have tools to establish statistical control. This article introduces educators to one such tool. We hope that this review will enable and encourage educators to analyze previously problematic data and to produce more accurate estimates of the effects of their interventions, as both have implications for program quality improvement.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.

  Appendixes Top

Appendix 1

#Computing Inverse Probability Weights






library (twang)

library (ggplot2)

library (dplyr)

library (survey)

library (knitr)

ps.Data <- ps (Treatment ~ GRE + GPA, data = Data, n.trees = 15000, interaction.depth = 3,

shrinkage = 0.001, perm.test.iters = 0, stop.method = c(”es.mean”, “ks.max”), estimand = “ATT”, verbose = FALSE)

Appendix 2

#Assessing Balance After Weighting

Data$pscores <- ps.Data$ps$es.mean.ATT

ggplot (data = Data, aes (pscores, fill = factor (Treatment)))+geom_density (alpha=0.5)+ theme_classic()

Data.balance <- bal.table (ps.Data)

kable (Data.balance$unw)

kable (Data.balance$es.mean.ATT)

Appendix 3

#Doubly robust estimation

Data$weights <- get.weights (ps.Data, stop.method = “es.mean”)

design.ps <- svydesign (ids = ~1, weights = ~weights, data = Data)

summary (svyglm (Exam ~ Treatment + GRE + GPA, design = design.ps))

  References Top

National Association of Boards of Pharmacy. New NAPLEX to Launch in November 2016. NABP Newsletter. Mount Prospect, IL: National Association of Boards of Pharmacy; 2016.  Back to cited text no. 1
American Association of Colleges of Pharmacy. AACP Student Trend Data. First Professional Application Trends 1998-2015. Mount Prospect, IL: American Association of Colleges of Pharmacy; 2016.  Back to cited text no. 2
Westreich D, Lessler J, Funk MJ. Propensity score estimation: Machine learning and classification methods as alternatives to logistic regression. Stat Med 2010;63:826-33.  Back to cited text no. 3
McCaffrey DF, Griffin BA, Almirall D, Slaughter ME, Ramchand R, Burgette LF. A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Stat Med 2013;32:3388-414.  Back to cited text no. 4
Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011;46:399-424.  Back to cited text no. 5
Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. Am J Epidemiol 2011;173:761-7.  Back to cited text no. 6


  [Table 1], [Table 2], [Table 3], [Table 4]


    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

  In this article
Article Tables

 Article Access Statistics
    PDF Downloaded382    
    Comments [Add]    

Recommend this journal