Don’t blame AI for the A-Levels scandal

The AI Delusion

Buy Now

By Gary Smith
August 21^st 2020

Many years ago, when I was a young assistant professor of economics, I had to endure a minor hazing ritual—serving for one year on the admissions committee for the PhD program. As a newbie, I was particularly impressed by a glowing letter of recommendation that began, “This is the best student I have had in 30 years.” The applicant’s test scores were not off-the-charts, but the letter was number 1.

There was a dean who chaired the admissions committee year after year and he advised me to calm down, because this professor wrote a recommendation every year celebrating “the best student I have had in 30 years.” The committee had a chuckle at my expense.

I’ve now been teaching for nearly 50 years and I know firsthand the temptation teachers have to praise students generously. We want our students to succeed and we are happy to help.

Admissions committees inevitably take this puffery into account. For the professor who, year-after-year, identified one of his students as the best student in 30 years, we discounted the claim because of the recommender’s reputation and the strong-but-not exceptional standardized test scores, but we did take into account the professor’s judgment that this student was the best applicant from his school that year. We deflated the level of the praise, but we paid attention to the rank ordering.

These memories came back recently with all of the hullabaloo related to British A-Level test grades, which are used in the United Kingdom for university admissions. Because of the COVID-19 pandemic, the summer 2020 scheduled tests were canceled and the government’s Office of Qualifications and Examinations Regulation (Ofqual) was given the thankless task of estimating what the grades (A, B, C, …) would have been on more than 700,000 subject tests that 275,000 students signed up for but did not take.

Ofqual collected two kinds of data from the students’ teachers:

A prediction of “the grade that student would have been most likely to achieve if teaching and learning had continued and students had taken their exams as planned.”
A rank-ordering of the students who were predicted to receive the same grade on a particular subject test.

The Ofqual team relied on plenty of research that supported my personal experience; for example, teachers are typically twice as likely to be too generous as to be too stingy. Specifically, teacher grade expectations are accurate about half the time, too optimistic one-third of the time, and too pessimistic one-sixth of the time.

If Ofqual had simply assigned each student the grades that the teachers had reported as their expectations, the percentage of tests receiving the highest possible grade (A*) would have increased from 7.7 percent in 2019 to 13.9 percent in 2020, the percentage receiving A or A* grades would have increased from 25.2 percent to 37.7 percent, and the percentage receiving grades of B or higher would have increased from 51.1 percent to 65 percent. Interviews with teachers also revealed that almost all had submitted predictions of how their students would have done on a “good day.”

The Ofqual team could have let it go at that, perhaps attaching disclaimers warning that the grades had been inflated by teacher generosity.

Instead, they made the politically dangerous decision to reduce many grades below teacher expectations in order to achieve a grade distribution comparable to previous years. The most obvious way to do this was by relying on the teachers’ rank orderings. An extreme case would be where generous teachers ranked every student one grade too high, but made a perfect assessment of the rank order. Reducing every grade by one level would be a perfect solution.

In practice, various studies have concluded that the correlation between predicted and actual grades is not 100 percent, but more like 80 percent, which still suggests that the rank order assessments provide useful information for adjusting grades.

The Ofqual team did a detailed statistical analysis of a dozen different adjustment methods and eventually settled on a system of adjusting the scores on each subject test at each school up or down (usually down) so that the average score on the subject test would be comparable to previous years and also reflect the rank-ordering teachers had sent them. This meant, for example, that if a student was ranked in the 50th percentile among students taking a particular subject test at a school, and 50th percentile students in the past had received B grades, this student would be given a B grade, even if the teacher had reported an expectation of an A grade or a C grade. With teachers typically more generous than stingy, grades were more likely to be adjusted downward than upward.

Ofqual also made a few tweaks to account for situations where average scores from previous years might be misleading. If the sample was small, then tying current scores to previous scores would be perilous. So, with 5 or fewer students, no adjustment was made—the teacher prediction was used as the final grade. With more than 15 students, the teacher predictions were ignored. In between, with 6 to 15 students, the final scores were a combination of historical scores and teacher predictions.

Despite Ofqual’s best efforts, there were problems.

First, the anchoring of the current grade distribution to the historical grade distribution made it very hard for high-achievers at low-scoring schools to get good grades. An extreme example would be a school where no one had previously received an A* grade on a particular subject test. It would not be possible for a current student to get an A* grade, no matter how talented the student was.

Second, the heavier weighting of teacher predictions for smaller sample sizes meant that students in small samples got the full benefit of teacher generosity, and this disproportionately benefitted students at elite schools who took tests in elite subjects such as classical Greek and the history of art. For ordinary students at ordinary schools who took tests in ordinary subjects, the teacher predictions were ignored.

Overall, scores went up slightly (propelled in part by the reliance on teacher predictions for the small samples), but what got the headlines was that 39 percent of the final grades were lower than teacher predictions. The outcry was not a surprise, nor was the argument that teachers know their students better than artificial intelligence (AI) algorithms.

I have written three books warning of the dangers of AI algorithms, but these grade reductions were not an example of AI run amok. It was Ofqual’s intention to adjust grades downward to account for teacher generosity and make 2020 grades comparable to 2019 grades and they tried very hard to find a fair and reasonable way.

AI algorithms are quite different. They typically search for patterns that will enable them to achieve specific goals, like matching photographs or playing board games. When the rules and objectives are clear and the task can be repeated a very large number of times, the successes may be astonishing, including the conquering of human experts at backgammon, checkers, chess, and Go.

When the rules and goals are ambiguous or in flux, however, AI algorithms can flop disastrously.

Change a few pixels in a photograph or change the dimensions of a Go board, and AI algorithms calibrated on different data do poorly. AI algorithms for screening job applicants, pricing car insurance, approving loan applications, and determining prison sentences based on Facebook posts, Twitter likes, website visits, smartphone usage, and the like should not be trusted.

I don’t know if Ofqual could have designed a better way for adjusting for teacher generosity, but I do know that an AI data-mining algorithm would almost surely have done worse, possibly much worse.

Image by Stanley Morales via Pexels.

Gary Smith is the Fletcher Jones Professor of Economics at Pomona College. He received his Ph.D. He has won two teaching awards and written (or co-authored) more than eighty academic papers and twelve books including The Phantom Pattern Problem: The Mirage of Big Data, 9 Pitfalls of Data Science, Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie With Statistics, and What the Luck? The Surprising Role of Chance in Our Everyday Lives. His research has been featured by Bloomberg Radio Network, CNBC, The Brian Lehrer Show, Forbes, The New York Times, Wall Street Journal, Motley Fool, Newsweek, and BusinessWeek.

The AI Delusion

Related posts:

Recent Comments