JOHN LAWRENCE: Student evaluations of teaching are not valid

by John W. Lawrence
June 28, 2020

In a review of the literature on student evaluations of teaching (SET), Philip B. Stark and Richard Freishtat—of the University of California, Berkeley, statistics department and the Center for Teaching and Learning, respectively—concluded, “The common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness for promotion and tenure decisions should be abandoned for substantive and statistical reasons: There is strong evidence that student responses to questions of ‘effectiveness’ do not measure teaching effectiveness.” This is a startling conclusion, given that SET scores are the primary measure that many colleges and universities use to evaluate professors’ teaching. But a preponderance of evidence suggests that average SET scores are not valid measures of teaching effectiveness.

There are many statistical problems with SET scores. The response rate of student evaluations is often low. There is no reason to assume that the response pattern of those who do not complete the surveys would be similar to the pattern of those who do complete them. Some colleges assume that a low response rate is the professor’s fault; however, no basis exists for this assumption. Also, average SET scores in small classes will be more greatly influenced by outliers, luck, and error. SET scores are ordinal categorical variables in which participants make ratings that can be from poor (one) to great (seven). Stark and Freishtat point out that SET score numbers are labels, not values. We cannot assume the difference between one and two is the same as the difference between five and six. It does not make statistical sense to average categorical variables.

Even if SET score averages were statistically meaningful, it is impossible to compare them with other scores, such as the departmental average, without knowing the distribution of scores. For example, in baseball, if you don’t know the distribution of batting averages, you can’t know whether the difference between a .270 and .300 batting average is meaningful. Also, it makes no sense to compare SET scores of very different classes, such as a small physics course and a large lecture class on Shakespeare and hip-hop.

Read more: