Beyond the noise: Accurate measurement for equity-focused policy

Join

Error message

Access denied. You must log in to view this page.

The following article is authored by Brian Gill and Jennifer Lerner.

 


 

  • Reliable measures for historically disadvantaged groups are essential for gaining insights into equity outcomes and assessing the performance of providers or interventions.
  • Random variation can lead to the misidentification of the lowest performers and standard policy solutions sacrifice equity insights for measurement accuracy.
  • Bayesian stabilisation is a statistical technique that enhances measurement reliability for small groups, promoting both accuracy and equity.

 


Promoting equity through policy requires good measurement of outcomes for historically disadvantaged groups. Indeed, measurement of subgroup outcomes is sometimes a central element of an equitable improvement strategy. In primary and secondary education policy in the United States, for example, student outcomes for disadvantaged subgroups are used to measure the performance of every public school; and schools where a subgroup is falling short are identified for improvement. Similarly, as achieving equity in health outcomes has risen on policy agendas, health-care systems have increasingly sought to measure subgroup outcomes for hospitals and other health-care providers. These examples from education and health care share an approach that aims to promote equitable outcomes by measuring inequities and using those measurements to inform interventions and funding decisions. When measurement is used for management or policy, getting the measurement right is critically important. But policy makers may not know that random variation in measured outcomes can lead them astray – and random error is particularly problematic when examining individual providers or subgroups, which by definition involve smaller numbers of participants.

 

Random error – also known as statistical “noise” – is a serious threat to outcomes-focused management and policy, especially when attention and consequences are focused on the lowest-performing providers. For any particular measurement, the providers at the bottom of a ranked list may have been simply unlucky rather than truly low- performing. A couple of decades ago, Kane & Staiger (2002) showed that a substantial fraction of the variation in school performance measures is random, particularly when small numbers of students are involved and particularly for schools at the top or the bottom of the performance distribution. And yet humans have a natural tendency to see patterns where none exist, reading imaginary meaning into data that are actually random flukes (for a review of human judgment and decision making, see Fischhoff & Broomell, 2020).  Moreover, because measured scores on high-stakes outcomes often trigger strong emotions, happiness for seemingly positive outcomes or anger over negative outcomes may further diminish the capacity for distinguishing real patterns from random variation (for review of emotion effects on judgment and decision making, see Lerner, Li, Valdesolo & Kassam, 2015).

 

As mentioned, focusing on subgroups exacerbates the problem of noise, because the numbers (of students, patients or clients) are smaller. If there are multiple subgroups of interest (such as racial/ethnic minorities, clients with disabilities or low-income clients), the risk of erroneously identifying a service provider, programme, or intervention as low-performing (or high- performing) based on random chance increases further.

 

A solution that creates tension between accuracy and equity

 

Some readers may wonder if an easy solution is already in place. Recognising that small groups are especially susceptible to random error, some policy makers set minimum group sizes, below which outcomes are not measured. Under United States education law, for example, every state sets a minimum number of students in a group for inclusion in school performance measures. Setting a threshold for minimum group size, however, requires policy makers to trade accuracy and equity: a larger minimum reduces the effect of random error and the corresponding risk of mistakenly identifying a group as low performing, promoting accuracy. But increased accuracy comes at the cost of ignoring the performance of groups that happen to be small in any particular school, undermining the commitment to equity. A smaller minimum, conversely, ensures that outcomes are measured for more students, at the cost of increasing the likelihood of misidentifying the schools and subgroups that need help, drawing unneeded attention to providers or groups that are fleetingly unlucky rather than truly low performing. This accuracy-equity trade off may be the reason that states have not reached a consensus on the minimum group size used for measuring school performance, which varies from 10 students at the low end to 30 at the high end.

 

A solution that simultaneously promotes accuracy and equity

 

Fortunately, Bayesian stabilisation is a statistical technique that offers policy makers a better solution, simultaneously serving accuracy and equity rather than trading them off against each other (see, for example, Forrow, Starling & Gill, 2023). Bayesian statistical methods can substantially reduce random error and increase accuracy by borrowing information from the full distribution of performance, from the provider’s own historical performance or from other measures of the provider’s performance. In essence, Bayesian methods recognise that outliers – measurements that are far from typical for the provider or for providers in general – are likely to be driven in part by random error, and that small groups are especially susceptible to random error. Bayesian stabilisation pulls in the outliers, with the amount of adjustment related to the precision of the estimate, which is in turn affected by the size of the group.

 

Many studies have demonstrated that stabilised results are more accurate (on average) than unstabilised results (Efron and Morris, 1977). Researchers have shown how Bayesian stabilisation can improve measures of local performance in schools (Forrow, Starling & Gill 2023).  The same principles apply in healthcare settings (Vollmer, Finucane & Brown 2018). Indeed, the same methods could be applied to any policy domain where decisions about providers or programmes depend on measures of their performance – improving the accuracy of those performance measures and reducing the risk of making decisions based on unreliable data.

 

When policy makers seek to reduce historical inequities in outcomes, Bayesian stabilisation is especially valuable in ensuring that inequities are accurately measured, informing the efficient allocation of resources and attention. Education administrators need to know which schools truly need the most help in improving outcomes for disadvantaged students; health administrators need to know which medical practices and hospitals have the most room to improve in serving historically disadvantaged patient populations; policy makers, more broadly, need to know which interventions and programmes work for whom. Bayesian stabilisation can provide that information. And, if widely used and effectively communicated, Bayesian stabilisation might serve an educational function as well, by helping decision makers understand how, in the absence of stabilisation, it is all too easy to be fooled by randomness.

 

Have you seen?

Breaking data silos: empowering individuals through secure data exchanges
Digital Public Infrastructure – lessons from India
Data value: to share, or not to share

 

References

 

Efron, B., & Morris, C. (1977). Stein’s paradox in statistics. Scientific American, 236(5), 119–127. https://www.scientificamerican.com/article/steins-paradox-in-statistics/

 

Fischhoff, B. & Broomell, S.B. (2020).  Judgment and decision making.  Annual Review of Psychology, 71, 331-355.  https://doi.org/10.1146/annurev-psych-010419-050747

 

Forrow, L., Starling, J. & Gill, B. (2023). Stabilizing subgroup proficiency results to improve the identification of low-performing schools. (REL 2023-001). U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance. https://ies.ed.gov/ncee/rel/Products/Publication/106926.

 

Kane, T. J., & Staiger, D. O. (2002). The promise and pitfalls of using imprecise school accountability measures. Journal of Economic Perspectives, 16(4), 91-114. https://www.aeaweb.org/articles?id=10.1257/089533002320950993

 

Lerner, J. S., Li, Y., Valdesolo, P., & Kassam, K. S. (2015). Emotion and decision making. Annual Review of Psychology, 66, 799-823. https://doi.org/10.1146/annurev-psych-010213-115043

 

Vollmer, L., Finucane, M., & Brown, R. (2018). Revolutionizing estimation and inference for program evaluation. Evaluation Review, 44(4). https://doi.org/10.1177/0193841X18815817

 

….

 

Brian Gill is a Senior Fellow at Mathematica and Director of Mid-Atlantic Regional Educational Laboratory.

 

Jennifer Lerner is a Professor of Public Policy and Decision Science at the Harvard Kennedy School. Drawing insights from psychology, economics, and neuroscience, her research examines human judgment and decision making

 

The facts, ideas and opinions expressed in this piece are those of the authors; they are not necessarily those of UNESCO or any of its partners and stakeholders and do not commit nor imply any responsibility thereof. The designations employed and the presentation of material throughout this piece do not imply the expression of any opinion whatsoever on the part of UNESCO concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries.

Join