
That very clean glass wall won’t hold itself up. Photo by Dogboy82 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=44203685
Strictly Come Dancing, one of the BBC’s most popular shows involving celebrities moving in specific ways with experts at moving in specific ways while other experts check if they’re moving specifically enough contains certainties and uncertainties. We’re not sure who will be voted out in any particular week. We don’t know know what the audience are going to complain about. An injured woman not dancing! I was furious with rage! We do know that Craig Revel Horwood will use the things he knows to make a decision about whether he likes a dance or not while saying something mean. We can be pretty sure what Len Goodman’s favourite river in Worcestershire, film starring Brad Pitt and Morgan Freeman and Star Trek: Voyager character is. But can we be sure that the scores awarded by the judges to the dancers are accurate and fair?
In science, a good scoring system has at least three qualities. These include validity (it measures what it’s supposed to measure), usability (it’s practical) and reliability (it’s consistent). It’s difficult to assess the extent to which the scoring system in Strictly Come Dancing possesses these qualities. We don’t really know the criteria (if any) that the judges use to assign their scores other than they occasionally involve knees not quite being at the right angle, shoulders not quite being at the right height, and shirts not quite being able to be done up. As such, deciding whether the scores are valid or not is tricky. The scoring system appears to be superficially usable in that people use it regularly in the time it takes for a person to walk up some stairs and talk to Claudia Winkleman about whether they enjoyed or really enjoyed the kinetic energy they just transferred. In some ways, checking reliability is easier. Especially if we have a way to access every score the judges have ever awarded. And we do. Thanks Ultimate Strictly!
For a test to be reliable, we need it to give the same score when it’s measuring the same thing under the same circumstances. If the same judge saw the same dance twice under consistent conditions, we’d expect a dance to get the same score. This sort of test-retest reliability is difficult to achieve with something like Strictly Come Dancing. The judges aren’t really expected to provide scores for EXACTLY the same dance more than once. Otherwise you’d end up getting the same comments all the time; which would be as difficult to watch as the rumba is for men to dance. Ahem. However, you can look at how consistently (reliably) different judges score the same dance. If all judges consistently award dances similar scores, then we can be more sure that the system for scoring dancing is reliable between raters. If judges consistently award wildly different scores for the same dances, we might be more convinced that they’re just making it up as they go along, or “Greenfielding it” as they say in neuroscience.
To test this, all scores from across all series (except the current series, Christmas specials and anything involving Donny Osmond as a guest judge) were collated and compared. Below, we can see that by and large the judges have fairly similarly median scores (Arlene Phillips and Craig = 7, Len, Bruno Tonioli, Alesha Dixon and Darcey Bussell = 8). The main differences appear to be in the range of scores with Craig and Arlene appearing to use a more complete range of possible scores.

Box plot (shows median scores, inter-quartile ranges, maximum and minimum scores for each judge)
A similar picture is seen if we use the mean score as an average, with Craig (mean score = 6.60) awarding lower scores than the other judges, whose mean scores awarded range from 7.05 (Arlene) to 7.65 (Len and Darcy). Strictly speaking (ironically) we shouldn’t be using the mean as an average for the dance scores. The dance scores can be classified as ordinal data (scores can be ordered, but there is no evidence that the difference between consecutive scores is equal) so many would argue that any mean value calculated is utter nonsense meaningless not an optimum method for observing central tendency. However, I think in this situation there are enough scores (9) for the mean to be useful; like the complete and utter measurement transgression that I am. At a first glance, these scores don’t look too different and we might consider getting out the glitter-themed cocktails and celebrating the reliability of our judges.

Bar chart showing mean scores and variance for each judge.
In order to test the hypothesis that there was no real effect of “judge” on dance scores, I did a statistics at the data. In this case a Kruskal-Wallis test because of the type of measures in use (one independent variable of ‘judge’ divided into different levels of ‘different judges’ and one independent variable of ordinal data). And yes, it would be simpler if Kruskal-Wallis was what it sounded like, a MasterChef judge with a fungal infection. Perhaps surprisingly, the results from the test used could be interpreted as showing that the probability that the judge doesn’t affect the score was less than 1 in 10,000 (P< 0.0001). The table below shows between which judges the differences were likely to exist (P< 0.0001 for all comparisons shown as red).

Table showing potential differences between judges in terms of scores they give to dancers
Thus it would seem that the probability that Craig isn’t have an effect on score is relatively small. In this instance, Craig appears to be awarding slightly lower scores compared to the other judges. The same could be said for Arlene, except if she is being compared to Craig, where she seems to award slightly higher scores.
So it transpires that the scores on Strictly Come Dancing are indeed unreliable. Arlene did and Craig is throwing the whole system out of alignment like a couple of Paso Doble doing a Jive at a Waltz. Tango!
Possibly not though, for a number of reasons. 4.) I am clearly not an expert in statistics, so I may have just performed the analysis incorrectly. 2.) If differences do exist, they are relatively subtle and are likely to be meaningless within individual shows, only coming to light (and bouncing off a glitter ball) when we look across large numbers of scores. That is to say, that a statistical difference may exist, but this difference likely makes no practical difference. A.) At least it’s not The X Factor.
Keep dancing. And doing maths.