Just How Predictable are Competitive Smash (and FGC) Games?
statsmash Elo rankings also double as predictive models (with a minor modification). Not can you use that for cool things like tournament projections, but it’s a way that you can evaluate just how good your ranking system is.
After all, if you and your friends go and rank the top 50 players–how do evaluate how good it is?
In this case, we can see exactly just how accurate our Elo models are at predicting who wins and who doesn’t in the games we track.
Table of Contents #
- All Time
- All Time - High Skill
538 BSS Scores #
Just so you have some perspective for the rest of the article, here are the BSS scores that 538 had for their sports forecasts
Curious as to what BSS scores are? Check out the methodology section first. Otherwise, just know that the higher–the better.
Smash-con - Skilled Participants #
We’re going to look at Smashcon first, because it acts like a test set, as I did not modify the parameters of the model prior to or after the tournament.
We can see two things off the bat: first, Smash games are in general much easier to predict than traditional sports. We see much higher BSS scores than 538’s forecasts, and that makes sense. MkLeo vs random pool opponent #2, who’s going to win? Well, you don’t need much effort to predict that one.
Ultimate had a BSS score similar to Men’s March Madness, and Melee had an even higher BSS score of 0.447.
Secondly, Melee seems to be easier to predict than Ultimate, with a significantly higher accuracy and BSS. To some extent this also makes sense. Melee is somewhat top-heavy, and the top players typically crush lower ranked opponents, and with less matchups, there’s less opportunities for cheesy wins.
All Time - Skilled Participants #
The trend before reverses. This could possibly just be a difference in data size. I have over 16,000 recorded sets for Ultimate, but only around 8,000 recorded sets for Melee. We can see that Ultimate’s scores are very similar to its Smashcon scores, which indicates that there is less variance overall, possibly indicative of stabilization.
On the other hand, Samurai Shodown has the lowest scores of all games I track. This isn’t surprising, though–the game had few large tournaments before Evo, and in general in the FGC Evo holds a much more important place. Infiltration, for instance, was not even on the Elo list, because he didn’t play any recorded sets before Evo.
Contrary to that, UNIST was surprisingly accurate. The game does have a much longer history, dating back to a release in 2013, but I would have thought the upsets at Evo would make a bigger difference.
So, comparing to “normal” sports, both smash games and UNIST are similarly predictable to March Madness (remember, the UVA vs UMBC was the ONLY time a 1st seed has lost to a 16th seed), while Samurai Shodown (as is, I expect to normalize if the scene is healthy and more data arrives) is more in territory of the NBA… which is still one of the more predictable sports.
But, I mean to some extent that’s expected with giant double elimination matches. Most of the matches a top player plays are, well, forgone conclusions.
All Time - Highly Skilled #
Now we only consider matches where both players have top 10 levels of Elo.
Once again, the model for Ultimate performs well, even when only considering matches at the top. Honestly, the accuracy is pretty shocking. Melee falls to 67%, and indeed within the top 10 it seems anyone can take a game off of anyone in modern times.
The Samsho and UNIST models fall off dramatically, though. Samurai Shodown is understandable; half of the top 10 Elo players didn’t exist prior to Evo. UNIST is somewhat shocking, however. It seems that much of the accuracy before was top players beating up on their pool opponents. Still, within the realm of prediction this is still a BSS similar to that of MLB. I am surprised that the score is even lower than Samurai Shodown’s, though.
To lay the framework for how we’re going to evaluate accuracy, we’re going to use a few metrics. First, just pure binary accuracy. That is, in some respects, what people want to see. However, our Elo models are probabilistic and there is a clear flaw in just accuracy. Consider the case where we give a player a 51% chance to win a match. They then proceed to lose. Is that evidence that the model is wrong? Well, not really. After all, we only gave them a 1% higher chance to win. A loss is completely expected.
So the next metric we’re going to use is the Brier Skill Score, or BSS. This is essentially a comparison between the Brier score of an unskilled predictor (in this case, a model that simply gives both parties a 50/50 chance to win) and the Brier score of our model. The higher the better we’re doing. The Brier score, unlike accuracy, considers how far off you were in your prediction. If you predicted a 50% chance for one party to win, and they do not, you are not penalized as much as if you had predicted a 90% chance to win.
What does skilled participants mean? Matches where an unknown player faces another, and thus both have a 1300 Elo score, are basically coinflips no matter how you slice it, with zero data on either to begin with. I filtered the matches for the (very lenient) qualification that at least one of the players in the match must have a 1450 Elo score.
What does Highly Skilled mean? In this case, both players must have Elo scores above roughly the cutoff for top 10 in their game.Contact me at email@example.com or @stu2b50 on Twitter