Learn About: Evaluating Performance | Common Core
Home > Staffing and Students > Building a better evaluation system: At a glance > Building a better evaluation system: Statistical caveats
| print Print


Building a better evaluation system: Statistical caveats

To minimize the chance that value-added models might misclassify teachers, policymakers need to understand the models’ statistical caveats. Value-added models are imprecise because they are:

Based on standardized tests. Even the best standardized tests do not perfectly measure a student’s actual achievement. First of all, one test cannot measure all skills. Students’ test performance can also be affected by how much sleep they got the previous night or if they are under stress. These errors are minimized when averaged over a large number of students.

Based on a small number of students. In statistics, a sample size of 30 is typically considered the threshold for an adequate sample that will produce highly reliable results. Teachers, especially elementary school teachers, typically have classes of fewer than 30 kids, and most teachers do not teach enough students in any year for average test scores to be highly reliable (Economic Policy Institute 2010). Therefore, a teacher’s value-added scores are often based on a small sample, which increases the chances for error because just one or two students can greatly affect the estimate of a teacher’s performance (Economic Policy Institute 2010, Goldhaber 2010). To minimize this impact, averaging scores over multiple years should be considered as well as multiple years of each student’s past test scores (McCaffrey, Sass and Lockwood 2008, Schochet and Hanley 2010).

Students and teachers are not randomly distributed. Teachers are not randomly assigned to the schools they teach or even the students they teach within their school, and students are not randomly assigned either. This lack of random assignment, the gold standard in research, makes it difficult to separate a teacher’s impact on students from other non-observable measures, such as a student’s motivation or help at home (Economic Policy Institute 2010). If teachers and students were randomly assigned, such differences would likely even out with large enough sample sizes. Comparing teachers within a single school, including more years of data, and including more student data are all ways to minimize the noise in value-added scores.

Isolating a single teacher’s impact. In today’s classrooms, many students are taught by several teachers in one day. Many students also receive instruction from reading, speech, and other specialists throughout the school year (Economic Policy Institute 2010). Furthermore, more instruction is taking place across subject areas – for instance, reading instruction may also be incorporated in the social studies curriculum (Economic Policy Institute 2010). In addition, some students receive extra help from tutors (Economic Policy Institute 2010). Statisticians are still working on how to accurately isolate a single teacher’s impact while taking all these factors into account, but crediting individual teachers with the success of the school as a whole is one way to minimize this impact.

Missing data. In most data systems, some students’ past test scores are missing either for technical reasons or because of highly mobile students who do not have previous test scores. Such missing data can impact a teacher’s value-added score, since students with missing data tend to be lower-performing students. To minimize the impact of missing data, data systems need to improve to better follow students from year to year and from school to school. And for those students who come to the school without previous test data some value-added models also substitute the average test score of the class or the grade for a student’s missing test scores.

Past teachers’ impact on current student performance. Teachers impact students’ performance well after they leave the teacher’s classroom (Economic Policy Institute 2010). However, researchers are not certain how long into the future a past teacher’s influence reaches. And value-added models differ on how they account for past teachers’ effect. Some assume that the impact is consistent over a student’s academic career; others assume the impact diminishes over time; others do not explicitly account for it. The topic is hotly debated, and policymakers should make the decision they feel best represents what happens in their schools.

Summer loss. Since most value-added models use data from state assessments, they have data that is collected just once a year, instead of in the fall and spring. Therefore, student scores also depend upon how much knowledge they lost or gained during the summer. Not accounting for summer break can impact a teacher’s value-added score, particularly in low-income communities (Economic Policy Institute 2010). To minimize summer loss, testing could move to both the fall and spring, and value-added scores could use the gain over that period.


This report was written by Jim Hull, Center for Public Education Senior Policy Analyst.

Posted: March 31, 2011

Add Your Comments





Display name as (required):  

Comments (max 2000 characters):




Comments:



Home > Staffing and Students > Building a better evaluation system: At a glance > Building a better evaluation system: Statistical caveats