• Users Online: 58
  • Print this page
  • Email this page


 
 
Table of Contents
RESEARCH METHODOLOGY
Year : 2022  |  Volume : 5  |  Issue : 3  |  Page : 120-124

Reduce, reuse, and recycle: Saving resources by repurposing data to address evaluation questions


1 Department of Educational Policy Studies and Evaluation, University of Kentucky College of Education, Lexington, USA
2 Department of Family and Community Medicine, University of Kentucky, Lexington, KY, USA

Date of Submission21-Jun-2022
Date of Acceptance02-Sep-2022
Date of Web Publication11-Nov-2022

Correspondence Address:
Dr. Shannon Oldham Sampson
Department of Educational Policy Studies and Evaluation, 131 Taylor Education Building, University of Kentucky, Lexington, KY 40503
USA
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/EHP.EHP_14_22

Rights and Permissions
  Abstract 

Introduction: In program evaluation, time, and energy of program leaders and participants are often at a premium, and survey fatigue is a methodological challenge. Finding ways to reduce survey fatigue and burden on participants while also working to include multiple sources of evidence to answer evaluation questions is a key task of evaluators. Materials and Methods: Rasch modeling was used to build scales from Accreditation Council for Graduate Medical Education (ACGME) milestone data. Results: Existing data for related constructs were repurposed to address evaluation questions with minimal threat to construct validity. Discussion: This work illustrates a novel way to reuse and recycle existing data to build scales by using a psychometrically-sound survey development and measurement approach.. Conclusion: This work illustrates an innovative methodology to combat participant survey fatigue and include multiple sources of evidence to answer evaluation questions.

Keywords: Evaluation, Rasch, scaling


How to cite this article:
Sampson SO, Xia Y, Parsons JM, Cardarelli R. Reduce, reuse, and recycle: Saving resources by repurposing data to address evaluation questions. Educ Health Prof 2022;5:120-4

How to cite this URL:
Sampson SO, Xia Y, Parsons JM, Cardarelli R. Reduce, reuse, and recycle: Saving resources by repurposing data to address evaluation questions. Educ Health Prof [serial online] 2022 [cited 2023 Feb 4];5:120-4. Available from: https://www.ehpjournal.com/text.asp?2022/5/3/120/360984



The College of Medicine at a Southeast Conference University received a Primary Care Training and Enhancement (PCTE) initiative grant (Health Resources and Services Administration Grant Number: T0BHP30024) to enhance the training of residents (postgraduate trainees). The program sought to improve healthcare access, quality, and cost-effectiveness while also bettering provider work life. To pursue these aims, the College of Medicine made several additions to its resident training program, including incorporating new content on health disparities, team-based care, and population health analytics methodologies into its existing Quality Health Care Curriculum, and involving residents in existing quality improvement processes for marginalized populations. With the implementation of these objectives, the medical group hoped to (1) increase knowledge and confidence in working with health disparities, (2) increase confidence and understanding of how to work in interprofessional teams, and (3) increase residents’ ability to engage in quality improvement activities.

Following the departure of an internal evaluator at the end of the third year of the project, the principal investigator (PI) reached out to the authors of this piece to conduct an audit of the initial evaluation and to provide evaluation services for the remaining years of the grant. In conducting the audit, the evaluators learned that participants were required to take a large number of surveys as part of the original evaluation, many of which had been added over time, and the project leaders perceived that the collected data were not clearly linked to grant objectives and outcomes. Project leaders expressed concern about survey fatigue.[1],[2],[3] To address this concern, the evaluators conducted a crosswalk of all instruments and all evaluation questions and developed an inventory of additional existing data sources that were not designed for the purposes of the grant but that may address the evaluation questions. One such data source was the Accreditation Council for Graduate Medical Education Family Medicine Milestones (ACGME Milestones), which is used by residencies to gauge the growth and development of trainees in preparation for becoming an independent physician.

Although we recognize modern validity theory in educational and psychological research stipulates results should be interpreted in light of their intended purpose,[4] when the goal of repurposing data is to provide additional evaluation on a conceptually similar construct to existing data the benefit of combating survey fatigue, arguably, outweighs the risks to validity. Using this process, particularly as part of an array of other indicators, evaluators can build a rich evidence story while decreasing the burden on the participants.


  Background Top


The evaluation team was charged with selecting measures that aligned well with these outcomes; thus, it was critical to have a solid understanding of how the program leaders conceptualized each area. For the purposes of this piece, we feature the process for one of the three areas: knowledge of working with health disparities.

Knowledge of working with health disparities

Through analysis of curricular gaps and population needs, the grant team determined that enhancements in training around health disparities should address mental disorders, social determinants of health, and healthcare in rural populations. All medical specialties within the program adapted their curriculum to create specialty-specific milestones for training health professionals. The curriculum was designed so the residents would learn to (1) apply principles of public health to improve the health of patients and populations; (2) apply principles of community engagement to improve the health of populations; (3) use critical thinking to improve the health of populations; and (4) use team and leadership skills to improve the health of populations. Several learning outcomes were also integrated across the curriculum. For example, residents would learn to review public health data and trends and integrate that knowledge into patient care, including attention to the special needs and assets of vulnerable populations; define a population for health assessment and improvement, and list the benefits of a practice registry that includes data on social determinants of health; access practice-specific data regarding a patient population; identify benchmarks for assessment of care or health status; interpret results of data analysis, and identify appropriate plan(s) for improvements; and show basic team skills in carrying out practice-based population health management.

It was with this understanding of population health, health disparities, and the approach to teaching these, that the evaluation team framed the measurement of increased knowledge of health disparities.

Rasch measurement

The Rasch model[5] is a framework with which to examine the properties of an instrument designed to measure one latent trait. The Rasch model assumes that the probability of a given person/item interaction is determined only by the “ability” of the person and the difficulty of the item based on the latent variable. In other words, a person’s performance on an item is independent of their response on any other item. Rasch builds measures, which are estimates of item difficulty and person ability. The measures are constructed with equal intervals along with a linear structure. In Rasch models, items and persons conjointly define a common interval (logit) scale.[6] With a shared logit scale, it is possible to calculate a probabilistic outcome between any person’s probability of selecting a rating category on any item based on its difficulty to endorse. The Rasch Rating scale model[7] is a version of the Rasch model used when instrument items are polytomous:



where

is the probability that a person at

passing m number of thresholds on an item (j) located at will respond at threshold

. Different than other approaches to scaling, item difficulty measures and resulting scales are independent of the sample on which they were calibrated.[8]

Unidimensionality

One of the challenges in building a scale from an existing dataset is that the set of items must work together to build a unidimensional scale. Toland et al.[9] explain that for a scale to be considered unidimensional: (1) the variance explained by the measures should reach 50%, (2) the eigenvalue of the first contrast of the standardized residuals should be less than 2.0, and (3) the ratio of the variance explained by the Rasch dimension to the variance explained by the first contrast of the residuals should be high. Rasch analysis using Winsteps software[10] imposes a unidimensional structure, but also provides indicators of the extent to which the items indeed work together to form a scale.

When any of the dimensionality expectations do not hold, researchers can look to the Principal Components Analysis of residuals (PCAR) to detect items clustering at high or low loadings. This analysis allows the researcher to examine whether groups of items share the same patterns of unexpectedness, particularly in contrast to the items most opposite to them in the analysis. When items show patterns that contrast meaningfully with their identified counterparts, they may share a common attribute. When there is no clustering of the first contrast, unidimensionality can be considered tenable. When a substantive secondary dimension is detected, the researcher must determine if this dimension is large enough to distort the measurement. Different than classical test theory approaches to dimensionality, Rasch dimensionality is constructed by intention.[11] If items have a theoretical framing, it is justifiable to leave the items on the scale.

Reliability and separation

Reliability is the degree to which measures are free from measurement error. Rasch analyses provide a direct estimate of the modeled error variance for each estimate of a person’s ability and an item’s difficulty.[11],[12],[13] The size of the standard error is most strongly influenced by the number of observations used to make the estimate.[14] The Rasch model subtracts the average person measurement error variance, which results in a person variance that is adjusted for measurement error and represents the “true” variance in the person measures. Person separation reliability is calculated as the ratio of true variance to observed variance and represents the proportion of variance that is not due to error. The person separation index compares the “true” spread of the person measures with their measurement error. The higher the person's reliability, the more spread the persons are on the variable. Typically, the higher the better. Item separation reliability is also calculated by Winsteps software,[10] but has no equivalent in classical measures of instrument reliability. It indicates how well items are separated by persons. Specifically, the target for person reliability is above 0.8, and item reliability measures above 0.9.

Fit

Rasch modeling not only produces equal-interval level measures for people and items; it also produces diagnostic indicators that can be used to determine scale quality. The degree to which a measure can be calculated depends on how closely the data fit the model. Once parameters of a Rasch model are estimated, response patterns are predicted. From there, statistics are computed that indicates the extent to which item and person response patterns fit that model. When fit calculations fall outside a given range, they are not productive for measuring the construct as defined by the model. Acceptable infit and outfit item indices range from 0.5 to 1.5.[15]


  Materials and Methods Top


Participants

The data used in this study were the ACGME Milestones[16] data for family medicine residents at the southeast university medical center. Residents were rated by faculty on the 2015 ACGME Milestones from 2015 to 2020. Residents included in the analyses were from 8 cohorts: (1) starting in spring 2015, (2) starting spring 2015, (3) starting spring 2015, (4) starting fall 2015, (5) starting fall 2016, (6) starting fall 2017, (7) starting fall 2018, and (8) starting fall 2019. Each cohort had six to seven residents, for a total of 51 residents. As stated by Linacre[17] and confirmed by Azizan et al.,[18] one needs as many items for a stable person measure as persons for a stable item measure. Consequently, 30 items administered to 30 persons (with reasonable targeting and fit) should produce statistically stable measures (±1.0 logits, 95% confidence); the numbers for these analyses were sufficient.

Data for this study were provided by the Graduate Medical Education Program Coordinator. Because this paper serves as a proof of concept, the evaluation team requested de-identified ACGME data from before the point when the current evaluation team began work with the grant and received a nonhuman subjects research waiver from Institutional Review Board.

Instrumentation

Accreditation Council for Graduate Medical Education Family Medicine Milestones

ACGME Milestones[16] are a set of competencies developed to evaluate medical residents. These milestones were developed in October 2015 and were revised significantly in 2019 by ACGME. Residents in the program are evaluated biannually on the milestones by faculty, with each resident receiving a collective rating on each Milestone.

Every item is assigned a rating based on five levels. If a resident does not achieve level 1 (the lowest level), the raters select the option of “has not achieved level 1” which is scored as 0 on the rating scale. Otherwise, the rater either selects a whole number rating as described in a rubric, or a half-number rating. A whole number rating indicates a resident is competent in all indicators in that level, as well as on the indicators in all lower levels. If the rater feels the resident is not competent in all indicators within the whole-number level, they can choose the response box between levels, which indicates partial competence at the higher level and full competence on the indicators at the lower levels. The original ratings ranged from 0 to 5 in 0.5 units. For analysis, the ratings were transformed to a 10-level rating scale.

Although the milestones are not designed around the outcomes of the grant, they do address general competencies that are closely related to the definition of addressing health disparties: medical knowledge (MK), patient care (PC), communication (C), practice-based learning and improvement (PBLI), professionalism (PROF), and systems-based practice (SBP).

Design and procedure

The evaluators first reviewed the ACGME Milestones and cross-walked specific competency areas that aligned to grant objectives related to addressing health disparities. This was done independently, then in conversation within the evaluation team. Next, data were run through the Rasch measurement model to ensure the sets of items held together as unidimensional scales with acceptable fit and functioning. Winsteps software, version 4.45 (Linacre, 2020) was used for this analysis. The following indicators were run for diagnostics:

Dimensionality

In Winsteps,[10] Table 23 displays diagnostics for dimensionality. The dimensionality analysis stratifies items into three clusters for each contrast (principal component). For each contrast, each person is measured on each cluster of items. These measures are correlated for each pair of clusters. The dimensionality check includes a look at variance explained by the measures (approaching 50%) and unexplained variance in eigenvalues (less than 2).

Fit

Winsteps[10] Table 13.1 displays the items and their order of entry. Listed in this table is the total number of times the item was used and the total score, which is the sum of responses to that item across all the respondents rated. Other entries include item measure, the estimate of item difficulty, and reported in logits. Items are listed in this table from the most difficult item to the least difficult; Infit, an information-weighted statistic, more sensitive to unexpected behavior of persons near the item’s difficulty level; and Outfit, a weighted statistic, more sensitive to unexpected behavior by persons far from the item’s measure level.

Infit and Outfit are listed with mean-square statistics. Mean square is a chi-square statistic divided by its degrees of freedom, with an expected value of 1. Values substantially less than one indicate dependency in the data and values substantially greater than one indicate unmodeled noise.

Finally, PtMeasure-al correlation is used to check whether response scoring makes sense. When PtMeasure-al correlations are negative, it often indicates a miskeyed item or reverse scored response.


  Results Top


From the crosswalk, the evaluation team identified six items from the ACGME Milestones that related to Health Disparities [Table 1]. These items were tested to check whether they act as a unidimensional scale using Rasch principal components analysis of residuals. The unexplained variance in the first contrast was 1.92. The variance explained by the measures was 86.2%. The HD item reliability was 0.98 with a separation of 6.54. Person reliability was 0.97 with a separation of 5.73. Infit MNSQ ranged from 0.64 to 1.49 and Outfit MNSQ ranged from 0.66 to 1.63. The PtMeasure-al Correlation was all at least 0.84 [Table 2].
Table 1: Addressing health disparities scale

Click here to view
Table 2: Addressing health disparities item quality indicators

Click here to view



  Discussion Top


The eigenvalue on the principal component of the residuals on all the scales was less than 2, and the variance explained by the measures was greater than 50%, so the statistics confirmed the theoretical assignment of items to each of the scales. Most item mean-square outfit and infit values were between 0.5 and 1.5, indicating sufficient evidence of model fit. Only in one case did an item show greater misfit than predicted by the model. This was item SBP-3 (Advocates for individual and community health). At times, items that are double-barreled will show misfit: double-barreled items ask users to respond to more than one concept simultaneously, usually indicated by the word “and.” In this example, “individual health” and “community health” may be two competing concepts that could yield two different responses if they were split apart. From a conceptual standpoint, however, this item aligns closely to the definition of addressing health disparities. Furthermore, the fit is not grossly out of the acceptable range. Thus, we argued this item should stay with the HD scale.

Limitations

First, this dataset included 45 residents who had been in the program for more than two semesters, which means those residents were entered more than once in the dataset. This may introduce dependency in the dataset.[19],[20] Residents were treated as independent cases because they were measured at different points in time and after an educational intervention had occurred. This allowed for a greater spread of persons across items for building the scale.

From a theoretical perspective, using existing data to measure new outcomes also should be viewed with caution for a few reasons. First, as is the case with the scales developed in this scenario, an underlying assumption must exist that the new unidimensional construct or constructs being measured are fully subsumed within the construct or constructs being measured as part of the original instrument. Using preexisting instrumentation could result in certain aspects of validity being threatened if parts of the construct are missing because of the focus of the prior data set. We recognize this limitation, and for this reason, contextualized these findings with other evidence for the evaluation study.


  Conclusion Top


Although we only feature our work on the scale for addressing health disparities here, this process was used to develop other scales from these data for other grant objectives, including interprofessional teamwork and skills related to implementing quality improvement projects. We offer this exercise not to feature a new scale, but instead, as an example of how to create a new measure from existing data with a Rasch-based validity backing. The scale derived showed strong evidence of construct validity, giving the evaluation team confidence that these measures were well suited for gauging grant related outcomes. This is particularly true in that these measures were reported alongside multiple sources of evidence, not in isolation.

Although collecting survey data to assess self-perceptions of change related to skills or abilities can provide useful data to programmers implementing initiatives or interventions, survey fatigue and reliance solely on self-reported change leads to serious methodological challenges when attempting to build robust evidence of the impact of an intervention on participants. Furthermore, in accordance with Campbell’s law,[21] using measures designed for the purpose of a study can introduce bias. This study provides an example of how to use existing data by aligning items or data conceptually to program objectives and assessing the quality of a newly constructed scale using a Rasch measurement approach.

Financial support and sponsorship

Not applicable.

Conflicts of interest

There are no conflicts of interest.



 
  References Top

1.
Ben-Nun P. Respondent fatigue. In: Lavrakas P, editor. Encyclopedia of Survey Research Methods. Thousand Oaks: Sage; 2008. p.743-4.  Back to cited text no. 1
    
2.
Choi BC. Twelve essentials of science-based policy. Prev Chronic Dis 2005;2:A16.  Back to cited text no. 2
    
3.
Porter SR, Whitcomb, ME, Weitzer, WH. Multiple surveys of students and survey fatigue. New Dir Inst Res 2004;121:63-73.  Back to cited text no. 3
    
4.
Kane MT. Validating the interpretations and uses of test scores. J Educ Meas 2013;50:1-73.  Back to cited text no. 4
    
5.
Rasch G. Studies in Mathematical Psychology: I. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danmarks pædagogiske Institut; 1960/1980.  Back to cited text no. 5
    
6.
Smith EV, Smith RM. Introduction to Rasch measurement. Maple Grove: JAM Press; 2004.  Back to cited text no. 6
    
7.
Andrich D. A rating formulation for ordered response categories. Psychometrika 1978;43:561-73.  Back to cited text no. 7
    
8.
Bezruczko N. Rasch Measurement in the Health Sciences. Maple Grove: JAM Press; 2005.  Back to cited text no. 8
    
9.
Toland MD, Grisham J, Waddell M, Crawford R, Dueber, DM. Scale evaluation and eligibility determination of a field-test version of the assessment, evaluation, and programming system – third edition. Top Early Child Sp Educ 2021; 42:152.  Back to cited text no. 9
    
10.
Linacre JM. Winsteps Rasch Measurement Computer Program (Version 4.45). Portland, OR: Winsteps.com; 2020.  Back to cited text no. 10
    
11.
Wright, BD. Rasch measurement models. In: Masters GN Keeves JP, editors. Advances in measurement in educational research and assessment. New York: Pergamon; 1999.  Back to cited text no. 11
    
12.
Wright BD, Masters GN. Rating Scale Analysis. Chicago, IL: MESA Press; 1982.  Back to cited text no. 12
    
13.
Wright BD, Stone MH. Best Test Design. Chicago, IL: MESA Press; 1979.  Back to cited text no. 13
    
14.
Linacre M. Standard Errors: Model and Real: Winsteps [Internet]. Available from: https://www.winsteps.com/winman/standarderrors.htm. [Last accessed on 24 Sep 2022].  Back to cited text no. 14
    
15.
Bond TG, Fox CM, Lacey H. Applying the Rasch Model: Fundamental Measurement in the Social Sciences. 2nd ed. Mahwah NJ: Erlbaum; 2007.  Back to cited text no. 15
    
16.
Andolsek K, Padmore J, Hauer KE, Holmboe E. A Guidebook for Programs. Chicago, IL: The Accreditation Council for Graduate Medical Education; 2015.  Back to cited text no. 16
    
17.
Linacre, JM. Sample size and item calibration stability. Rasch Meas Trans 1994;7:328. [Internet]. Available from https://www.rasch.org/rmt/rmt74m.htm. [Last accessed on 20 Apr 2022].  Back to cited text no. 17
    
18.
Azizan NH, Mahmud Z, Rambli A. Accuracy and bias of the Rasch rating scale person estimates using maximum likelihood approach: A comparative study of various sample sizes. J Phys Conf Ser 2021;2084:12006.  Back to cited text no. 18
    
19.
Marais I. Local dependence. In: Christensen Kreiner S Mesbah M. Rasch Models in Health. Wiley-ISTE; 2013:111-30.  Back to cited text no. 19
    
20.
Ip E. Testing for local dependency in dichotomous and polytomous item response models. Psychometrika 2001;66:109-132.  Back to cited text no. 20
    
21.
Campbell D. Assessing the impact of planned social change. Eval Program Plann 1979;2:67-90.  Back to cited text no. 21
    



 
 
    Tables

  [Table 1], [Table 2]



 

Top
 
  Search
 
    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

 
  In this article
Abstract
Background
Materials and Me...
Results
Discussion
Conclusion
References
Article Tables

 Article Access Statistics
    Viewed437    
    Printed9    
    Emailed0    
    PDF Downloaded18    
    Comments [Add]    

Recommend this journal