Thursday, 1 March 2012

Finals preview (series 4): Rating the finalists

Now that the eight finalists are known -- although it came down to the very last game! -- I thought that I would review their performance and make some predictions about likely results in the finals.  Very little should be read into this, since the game has high variability and the contestants are likely to have changed in ability in the meanwhile; I know that when I was told I might make it to the finals I put in some more practice, for instance.  That was around the time that I started this blog, I think, and my consistency has certainly changed since then.

Note: Shaun Ellis's first two games were played last series, and I do not have a record of them.  I could probably work something out, but it seemed simpler just to omit them from consideration.  Also, Sam Gaffney's fourth game needed a second conundrum round during actual play.  However, I am treating that game as stopping after the first conundrum in order to make the comparisons match up more sensibly.

To start with, here are the solo totals for each contestant, ordered by their average score per game.  The solo total is what they would have scored for all their rounds if there were no opponent.

Sam Gaffney51816764645538263.67
Kerin White68635963715137562.50
Alan Nash60735954556937061.67
Toby Baldwin65486048545032554.17
Daniel Chua51595653535232454.00
Roman Turkiewicz6855575234
Sebastian Ham4955654056
Shaun Ellis


This shows a quite clear stratification, with the top three very close to each other, then the next four clustered further down, and then another gap to Shaun.  I also find it interesting how consistent Daniel was, with only an eight point gap between his lowest and highest scores.

Of course, these figures don't tell us that much; the rounds in one game may have been much easier than the rounds in another.  Below is the same table with my solo scores added for comparison; this time the table is sorted by the contestant's total score as a percentage of my total score.  i.e., how close they were to my performance on the same games.

(It might be more objective to compare against the combined performance of David and Lily, but that makes judging the conundrums tricky; more importantly, they are too good.  Suppose the contestant matches David with a seven-letter word.  If the only one was GUANACO, then that's a fantastic result; if there were a couple including MAGPIES then it's a good result; and if there were many including TEARING then it is an average result.  Loosely speaking, David is equally likely to find any of those, while I will be more towards the good end of the spectrum.  Or so I would like to believe.)

Sam Gaffney51816764645538263.6797.45%
Kerin White68635963715137562.5091.91%
Alan Nash60735954556937061.6789.16%
Shaun Ellis


Roman Turkiewicz6855575234
Toby Baldwin65486048545032554.1774.88%
Daniel Chua51595653535232454.0072.81%
Sebastian Ham4955654056

On this basis there is further separation.  If we make the unrealistic assumption that my performance is a suitable baseline of comparison, then the top three are the same but Sam has a much clearer lead over Kerin and Alan than the solo scores alone would suggest; Shaun's standing has improved greatly -- my average score was the lowest in his games -- but is still well behind the top three; and Roman has moved up a little also.

(As a curiosity, I note that my solo scores during Daniel's run were almost smoother than his -- his opponent in the last game solved the conundrum too quickly, otherwise there might have been just a three point gap between lowest and highest -- as I erroneously thought was the case at first, due to not checking that game well enough.)

Of course, solo scores do not reflect the scoring of the game, and in particular the cost of finding a weak answer when a better one was relatively easy to find.  Finding RATING instead of TEARING might only show up as a single point loss, instead of the seven point loss that it should be in practice.  So in an attempt to take this into consideration, here are the head-to-head results (as recorded in this blog) between each finalist and myself.  (Note: There are some slight differences between numbers here and those posted, due to ignoring the other contestant in those games.)

This table is sorted by the contestant's total score as a percentage of my total score; I also show the average per-game difference between their scores and mine.  Positive values would reflect that the difference favours them; negative values indicate a corresponding advantage to me.

TotalAvg Δ%
Sam Gaffney388157526443335 -1.0098.24%
Alan Nash465945344357284-18.1772.26%
Kerin White383735493539233-27.0058.99%
Shaun Ellis


Roman Turkiewicz3021403211
Sebastian Ham1443232738
Toby Baldwin232133313614158-43.0037.98%
Daniel Chua10284542732164-44.6737.96%

On this metric the differences are massive.  (Of course, it is of very dubious validity, but we'll see where it takes us anyway.)  It's no surprise that Sam stays way on top, but the gap between him and Alan has stretched out greatly, as has that between Alan and Kerin.  Shaun ends up pretty well separated from the remaining four, who are all very close to each other.

Based on this data, the first three quarter-finals should go with the higher-ranked seeds, but the fourth one has Shaun facing Toby.  Toby is the higher seed (his total of 296 beating Shaun's total of 280), but Shaun's head-to-head percentage against me was much larger.  Will this reflect what actually occurs?  I guess we'll have to see!

Update: Commenter Victor suggested that the contestants be rated by their percentage of "maximums" -- times that they achieved the best possible results from the round.  I have some doubts about this as a useful measure, as does commenter Mark, but here's a table anyway:



Victor said...

Hi Geoff,

Another metric you could try for comparison is percentage of maximums achieved, ie. in what percentage of rounds did the contestant find the best answer.

This may give a way of comparing contestants across shows of varying difficulty. In the long run (ie. over many shows) those who achieve a higher percentage of maximums should tend to win over those who achieve fewer.

Geoff Bailey said...

Victor: That's an interesting idea, and one I've seen applied to Countdown. I don't think it works that well for Letters and Numbers, though, in essence because we simply don't get enough maximums for statistical significance.

(Most of these remarks are restricted to letters rounds, as the numbers would be much more amenable to that kind of approach.)

One reason is that there's much less data to work with. A retiring champion on Letters and Numbers has played 30 letters rounds, while a Countdown octochamp has played 88.

Another is that the maximums just aren't reached that often, or at least that has been my impression; I'd have to check that. (Obviously nine-letter words are maximums, but there have been three of those from contestants all series.) Certainly programmatic searching has turned up enough obscure words that David has missed along the way.

I'd say much of this is attributable to the much lesser prominence that the show has here as opposed to the UK. (Very natural, given the differences in how long it has been running!) Combined with the population difference, we have many fewer people who are inclined to put in the practice until they can spot those maximums. (Unlike, say, Kirk Bevins, and the other rising set of Apterous players.)

Mark said...

I don't like percentage of maximums. If a letters maximum is 9, then a player getting an 8 will be treated the same as a player getting a 5.

Also, I don't think it would necessarily be true that someone who achieves a higher percentage of maximums will tend to win over someone with a lower percentage, although often it will be true. Again using letters as an example, I think that a consistent player who usually gets 8 or 7 letter words and never gets full monties should usually beat a player who gets an occasional full monty (and therefore has a higher percentage of maximums) but also gets lots of 5 and 6 letter words.

Geoff Bailey said...

Mark: Your statement seems to be assuming that a full monty is always available, which is decidedly not true! Or perhaps you have misinterpreted the use of "maximums" in this context: It is a best-possible-result based on the letters (or numbers), not a nine-letter word (or exactly on target).

I still think it is a flawed metric, for the reasons you mention. However, I think it works somewhat for Countdown because of the much longer games and also that the top end players are much better -- in a finals series it will be fairly common for at least one contestant to get a maximum in each round, so the percentage of them matches well with the winner.

Mark said...

Geoff, yes you're right. I knew that "maximum" meant the best available, but I somehow got mixed up and wrote the second paragraph above with "maximum" meaning full monty. Silly me.

Victor said...

Ahh, on closer examination and actually looking over some of the blog posts here again, it does seem the analysis I proposed is - quite -unsuitable for the data at hand! As you noted, here is simply not enough data to work with.

I'll see if I can devise some robust method by the finals of next series :P

Geoff Bailey said...

No worries, Mark. It's a complicated thing to try.

I'll be interested in what you come up with, Victor. It's all a lost cause anyway as the targets shift greatly, mind you. With the exception of Roman, all of the finalists have had significant time to practice further by the time the finals came around.

Allan S said...

Why not define "maximum" to what David & Lily get each time...