How can we use bad measures in decisionmaking?

,

Reposted from my personal blog:

I had about 20 minutes of between-events time Thursday morning and used it to catch up on two interesting papers on value-added assessment and teacher evaluation--the Jesse Rothstein piece using North Carolina data and the Koedel-Betts replication-and-more with San Diego data.

Speaking very roughly, Rothstein used a clever falsification test: if the assignment of students to fifth grade is random, then you shouldn't be able to use fifth-grade teachers to predict test-score gains in fourth grade. At least with the set of data he used in North Carolina, you could predict a good chunk of the variation in fourth-grade test gains knowing who the fifth grade teachers were, which means that a central assumption of many value-added models are problematic.

Cory Koedel and Julian Betts's paper replicated and extended the analysis using data from San Diego. They were able to confirm with different data that using a single year's worth of data led to severe problems with the assumption of close-to-random assignment. They also claimed that using more than one year's worth of data smoothed out the problems.

Apart from the specifics of this new aspect of the value-added measure debate, it pushed my nose once again into the fact that any accountability system has to address the fact of messy data.

Let's face it: we will never have data that are so accurate that we can worry about whether the basis for a measure is cesium or ytterbium. Generally, the rhetoric around accountability systems has been either "well, they're good enough and better than not acting" or "toss out anything with flaws," though we're getting some new approaches, or rather older approaches introduced into national debate, as with the June Broader, Bolder Approach paper and this week's paper on accountability from the Education Equality Project.

Now that we have the response by the Education Equality Project to the Broader, Approach on accountability more specifically, we can see the nature of the debate taking shape. Broader, Bolder is pushing testing-and-inspections, while Education Equality is pushing value-added measures. Incidentally, or perhaps not, the EEP report mentioned Diane Ravitch in four paragraphs (the same number of paragraphs I spotted with references to President Obama) while including this backhanded, unfootnoted reference to the Broader, Bolder Approach:

While many of these same advocates criticize both the quality and utility of current math and reading assessments in state accountability systems, they are curiously blithe about the ability of states and districts to create a multi-billion dollar system of trained inspectors--who would be responsible for equitably assessing the nation's 95,000 schools on a regular basis on nearly every dimension of school performance imaginable, no matter how ill-defined.

I find it telling that the Education Equality Project folks couldn't bring themselves to acknowledge the Broader, Bolder Approach openly or the work of others on inspection systems (such as Thomas Wilson). Listen up, EEP folks: Acknowledging the work of others is essentially a requirement for debate these days. Ignoring the work of your intellectual opponents is not the best way to maintain your own credibility. I understand the politics: the references to Ravitch indicate that EEP (and Klein) see her as a much bigger threat than Broader, Bolder. This is a perfect setup for Ravitch's new book, whose title is modeled after Jane Jacobs's fight with Robert Moses. So I don't think in the end that the EEP gang is doing themselves much of a favor by ignoring BBA.

Let's return to the substance: is there a way to think coherently about using mediocre data that exist while acknowledging we need better systems and working towards them? I think the answer is yes, especially if you divide the messiness of test data into separate problems (which are not exhaustive categories but are my first stab at this): problems when data cover a too-small part of what's important in schooling, and problems when the data are of questionable trustworthiness.

Data that cover too little

As Daniel Koretz explains, no test currently in existence can measure everything in the curriculum. The circumscribed nature of any assessment may be tied to the format of a test (a paper and pencil test cannot assess the ability to look through a microscope and identify what's on a slide), to test specifications (which limits what a test measures within a subject), or to subjects covered by a testing system. Some of the options:

  • Don't worry. Don't worry about or dismiss the possibility of a narrowed curriculum. Advantage: simple. Easy to spin in a political context. Disadvantage: does not comport with the concerns of millions of parents concerned about a narrowed curriculum.
  • Toss. Decide that the negative consequences of accountability outweigh any use of limited-purpose testing. Advantage: simple. Easy to spin in a political context. Disadvantage: does not comport with the thoughts of millions of parents concerned about the quality of their children's schooling.
  • Supplement. Add more information, either by expanding the testing or by expanding the sources of information. Advantage: easy to justify in the abstract. Disadvantages: requires more spending for assessment purposes, either for testing or for the type of inspection system Wilson and BBA advocate (though inspections are not nearly as expensive as the EEP report claims without a shred of evidence). If the supplementation proposal is for more testing, this will concern some proportion of parents who do not like the extent of testing as it currently exists.

Data that are of questionable trustworthiness

I'm using the term trustworthiness instead of reliability because the latter is a term of art in measurement, and I mean the category to address how accurately a particular measure tells us something about student outcomes or any plausible causal connection to programs or personnel. There are a number of reasons why we would not trust a particular measure to be an accurate picture of what happens in a school, ranging from test conditions or technical problems to test-specification predictability (i.e., teaching to the test over several years) and the global questions of causality.

The debate about value-added measures is part of a longer discussion about the trustworthiness of test scores as an indication of teacher quality and a response to arguments that status indicators are neither a fair nor accurate way to judge teachers who may have very different types of students. What we're learning is a confirmation of what I wrote almost 4 years ago: as Harvey Goldstein would say, growth models are not the Holy Grail of assessment. Since there is no Holy Grail of measurement, how do we use data that we know are of limited trustworthiness (even if we don't know in advance exactly what those limits are)?

  • Don't worry. Don't worry about or dismiss the possibility of making the wrong decision from untrustworthy data. Advantage: simple. Easy to spin in a political
    context. Disadvantage: does not comport with the credibility problems of historical error in testing and the considerable research on the limits of test scores.
  • Toss. Decide that the flaws of testing outweigh any use of messy data. Advantage: simple in concept. Easy to spin in a political context. Easy to argue if it's a partial toss justified for technical reasons (e.g., small numbers of students tested). Disadvantage: does not comport with the thoughts of millions of parents concerned about the quality of their children's schooling. More difficult in practice if it's a partial toss (i.e., if you toss some data because a student is an English language learner, because of small numbers tested, or for other reasons).
  • Make a new model. Growth (value-added) models are the prime example of changing a formula in response to concerns about trustworthiness (in this case, global issues about achievement status measures). Advantage: makes sense in the abstract. Disadvantage: more complicated models can undermine both transparency and understanding, and claims about superiority of different models become more difficult to evaluate as the models become more complex. There ain't no such thing* as a perfect model specification.
  • Retest, recalculate, or continue to accumulate data until you have trustworthy data. Treat testing as the equivalent of a blood-pressure measurement: if you suspect that a measurement is not to be trusted, take the blood pressure test the student again in a few minutes months/another year. Advantage: can wave hands broadly and talk about "multiple years of data" and refer to some research on multiple years of data. Disadvantage: Retesting/reassessment works best with a certain density of data points, and the critical density will depend on context. This works with some versions of formative assessment, where one questionable datum can be balanced out by longer trends. It's more problematic with annual testing, for a variety of reasons, though that can reduce uncertainties.
  • Model the trustworthiness as a formal uncertainty. Decide that information is usable if there is a way to accommodate the mess. Advantage: makes sense in the abstract. Disadvantage: The choices are not easy, and there are consequences of the way of modeling uncertainty you choose: adjusting cut scores/data presentation by measurement/standard errors, using fuzzy-set algorithms, Bayesian reasoning, or political mechanisms to reduce the influence of a specific measure when trustworthiness decreases.

Even if you haven't read Accountability Frankenstein, you have probably already sussed out my view that both "don't worry" and "toss" are poor choices in addressing messy data. All other options should be on the table, usable for different circumstances and in different ways. Least explored? The last idea, modeling trustworthiness problems as formal uncertainty. I'm going to part from measurement researchers and say that the modeling should go beyond standard errors and measurement errors, or rather head in a different direction. There is no way to use standard errors or measurement errors to address issues of trustworthiness that go beyond sampling and reliability issues, or to structure a process to balance the inherently value-laden and political issues involved here.

The difficulty in looking coldly at messy and mediocre data generally revolves around the human tendency to prefer impressions of confidence and certainty over uncertainty, even when a rational examination and background knowledge should lead one to recognize the problems in trusting a set of data. One side of that coin is an emphasis on point estimates and firmly-drawn classification lines. The other side is to decide that one should entirely ignore messy and mediocre data because of the flaws. Neither is an appropriate response to the problem.

* A literary reference (to Heinlein), not an illiteracism.