Today, I'll examine online tournament data to see if Balance Patch 8.0.0 produced any statistical significant differences. Specifically, I'll be looking at data from two date ranges: from January 28th to June 29th ("pre-patch"), and June 29th to October 13th ("post-patch"). I picked 8.0.0 because it was somewhat recent, and had a patch before it and a patch after it (that's where January 28th and October 13th came from - those were the release dates of patch 7.0.0 and 9.0.0 respectively).
Also, I'll limit the scope to a handful of predetermined characters, mostly because of time. But, if you're curious about other characters, you'll have a way to look for yourself below.
The characters will be: Marth, Falco, Ike, King K. Rool, and Mario. Mario is here as a sanity check - he only received a final smash update - ideally, we should see no significant change.
Marth and Falco were both buffed and fairly "mainstream" competitive characters. K Rool was also buffed, but he is, well, not considered a very good character. Ike is kinda mixed.
Methodology
In order to do this I'm modeling character wins and losses as a binomial distribution. This has one major assumption that is not true: independence. These characters are being piloted by players, players of varying skills and even beyond just skill, styles. So they're not truly independent - but hopefully the relatively large sample sizes mitigates biasing that this assumption produces.
Additionally, of course, some characters have better average players - but we're comparing between what should be roughly the same population of players.
With that under way, given that basically every character has n > 2000, I can approximate the binomials as a Normal distribution and then just do a normal Z-test. Additionally, I realized halfway through that computers are fast now and you can do the Fisher Exact Test for these scales without a problem. These should produce roughly similar numbers with these sample sizes but I'll provide both.
Marth, Falco, K Rool will be a one-sided test (should be higher), Mario and Ike will be two-sided.
I'll stick with the normal arbitrary threshold of p < 5%.
Marth
RxC Table:
Pre Patch | Post Patch | |
---|---|---|
Wins | 1325 | 942 |
Losses | 1938 | 1386 |
Ouch, if you do the division, you may notice that Marth, in fact, loses winrate. So that axes my original idea. Well, sometimes it happens. Now, instead of doing a one-sided test, I'll do a two-sided test - is this just noise, or somehow did Marth get worse?
Normal: 0.91457
Fisher: 0.93394
So, no, thankfully I am not going crazy, and it seems that it was just noise. I will note that Marth has a very large, and very, well, unskilled playerbase, so if you want to explain this in another way than "the buffs were useless", it could be simply that Marth players are bad, on average, at making use of the buffs.
Well, it could be noise. Just because it's above the threshold doesn't mean it can't be truly different - just that through this method, I can't be confident that anything significant happened.
Falco
RxC Table:
Pre Patch | Post Patch | |
---|---|---|
Wins | 1995 | 1472 |
Losses | 2926 | 1789 |
Normal: 0.000018
Fisher: 0.000021
Wow! That's not only below our target p value, it's SO far below it seems clear that Falco did greatly benefit from patch.
King K. Rool
Pre Patch | Post Patch | |
---|---|---|
Wins | 4897 | 3081 |
Losses | 5088 | 2999 |
Normal: 0.02248
Fisher: 0.02336
Still comfortably below the 0.05 threshold. I would also note that K. Rool by far has the highest sample size so far.
Ike (two-sided)
Pre Patch | Post Patch | |
---|---|---|
Wins | 6183 | 3051 |
Losses | 5067 | 2617 |
Normal: 0.16296
Fisher: 0.16443
First, I would note that Ike also regressed slightly in winrate. But seemingly not a significant one, although it's cutting it close - if I did a one-sided test for being nerfed, it would be quite close to 0.05.
Mario
Pre Patch | Post Patch | |
---|---|---|
Wins | 8360 | 2674 |
Losses | 9364 | 3000 |
Normal: 0.95764897567626905417
Fisher: 0.96344417480320632130
This is my sanity check - and it seems pretty sane. Indeed, very likely nothing changed.
Heck, I'll do a couple more
Mewtwo
Pre Patch | Post Patch | |
---|---|---|
Wins | 1429 | 803 |
Losses | 1867 | 938 |
Normal: 0.03003357877003944368
Fisher: 0.03217526916549692112
Pit
Pre Patch | Post Patch | |
---|---|---|
Wins | 825 | 504 |
Losses | 1076 | 641 |
Normal: 0.36926673508202484397
Fisher: 0.38344963085808192460
So it seems that Falco, King K. Rool, and Mewtwo pass the frequentist test.
Curious about other characters? What to know what a Bayesian approach would show? Want to use a more complicated model?
Now you can. Happy hunting.
As usual, you can contact me at stu2b@statsmash.io or @stu2b50 on Twitter.