Nagora's Rating System

nagora · May 28, 2022

I've been working on a rating system for the last couple of years and I thought I'd wheel it out here.

Starting from a Bayesian point of view that when we are asked to pre-judge and an event that we know nothing about, the rational prior position is to shrug and toss a coin - we say it's 50-50 who will win (or, more precisely, it's 1/n where n is the number of possible outcomes/winners). After the event is over we update our view for the contestants. If competitor A wins, we move our expectation that s/he will win to 2/3, and the loser's down to 1/3. For my purposes, I multiply these numbers by 10000 so that ratings theoretically run from 0 to 10000 but in reality 10000 is never reached and ratings are expressed as four decimal digits (sometimes three but that is very rare).

If these were the only two competitors we ever saw then we would eventually get quite a good estimation about who will win each encounter. We would still have to ask ourselves some questions about how long historical information is allowed to influence our estimate, of course, but I'll come back to that.

So, what happens when we add more competitors? We more or less take the same approach - new guys are given a rating of 1/2, or 5000, and then updated based on their record. If they win a match they are bumped to 2/3, then 3/4 etc (6666, 7500). Note that their opponent's rating is not involved in this updating, unlike the Elo system.

To deal with the history question, I normally drop results off after six bashos have been participated in (which may be a year or more if events are skipped). An exception to this is that if basho -7 (i.e., the one about to be dropped off the end of the rating history) was a zen-sho, then it is kept for the calculation. A string of such results is treated similarly, so for example Hakuho's four 15-0 results in a row meant that at one point his calculation was based on 10 bashos. This allows ratings to continue to exceed 9891, which would be the limit for six bashos of 15-0 otherwise. However, not even Hakuho has managed to go above that score. An analogous system is used for the lower divisions where the perfect score is 7-0.

Another way of looking at the rating is that a value of, say 8478, means that the rikishi has an 84.78% chance of beating the average opponent that he has been facing in the previous year.

This is the truth about all rating systems for sumo, boxing, tennis, etc. where there is no objective scoring system. We can only really rate players based on how they dominate (or not) their peers. Comparing Mike Tyson to Muhammad Ali just isn't really possible based on statistics alone. Similarly Hakuho and Nishinoumi - as we will see they are far apart in rating but to what extent this is because competition in 1910 was stiffer than it was in 2010 is impossible to divine just by looking at results. You can only compete against the people in front of you.

So, let's look at some numbers. Here's the standings I calculated at the start of Natsu 2022:

| Rank | Rating | Rikishi      |
|------+--------+--------------|
|    1 |   8313 | Terunofuji   |
|    2 |   7021 | Abi          |
|    3 |   6848 | Mitakeumi    |
|    4 |   6471 | Oho          |
|    5 |   6232 | Takakeisho   |
|    6 |   5938 | Wakamotoharu |
|   7= |   5882 | Nishikigi    |
|   7= |   5882 | Kotoshoho    |
|    9 |   5862 | Kotonowaka   |
|   10 |   5761 | Wakatakakage |
|   11 |   5747 | Takayasu     |
|   12 |   5732 | Endo         |
|   13 |   5584 | Midorifuji   |
|   14 |   5444 | Hoshoryu     |
|  15= |   5435 | Shodai       |
|  15= |   5435 | Ichinojo     |
|   17 |   5326 | Tamawashi    |
|   18 |   5244 | Hokutofuji   |
|   19 |   5211 | Ishiura      |
|   20 |   5195 | Ura          |
|   21 |   5109 | Kiribayama   |
|  22= |   5000 | Onosho       |
|  22= |   5000 | Azumaryu     |
|   24 |   4894 | Sadanoumi    |
|   25 |   4891 | Daieisho     |
|   26 |   4783 | Takarafuji   |
|   27 |   4773 | Shimanoumi   |
|   28 |   4719 | Tochinoshin  |
|  29= |   4706 | Kotokuzan    |
|  29= |   4706 | Kagayaki     |
|   31 |   4677 | Yutakayama   |
|  32= |   4674 | Terutsuyoshi |
|  32= |   4674 | Takanosho    |
|   34 |   4565 | Chiyotairyu  |
|   35 |   4524 | Aoiyama      |
|   36 |   4457 | Okinoumi     |
|   37 |   4375 | Ichiyamamoto |
|   38 |   4348 | Kotoeko      |
|  39= |   4239 | Tobizaru     |
|  39= |   4239 | Chiyoshoma   |
|   41 |   4130 | Meisei       |
|   42 |   4111 | Myogiryu     |

And here's the standings for the same group post-Natsu 2022 (with the pre- and post- Natsu ratings for easier comparison):

| Rank |  Pre | Post | Change |              | Prelim |
|------+------+------+--------+--------------+--------|
|    1 | 8313 | 8313 |      0 | Terunofuji   |        |
|    2 | 7021 | 6452 |   -569 | Abi          | *      |
|    3 | 6848 | 6413 |   -435 | Mitakeumi    |        |
|    4 | 5862 | 6092 |    230 | Kotonowaka   |        |
|    5 | 5938 | 5957 |     19 | Wakamotoharu | *      |
|    6 | 5584 | 5882 |    298 | Midorifuji   | *      |
|    7 | 5761 | 5761 |      0 | Wakatakakage |        |
|    8 | 6232 | 5652 |   -580 | Takakeisho   |        |
|    9 | 5882 | 5625 |   -257 | Nishikigi    | *      |
|   10 | 5444 | 5556 |    112 | Hoshoryu     |        |
|  11= | 5326 | 5543 |    217 | Tamawashi    |        |
|  11= | 5109 | 5543 |    434 | Kiribayama   |        |
|   13 | 4894 | 5484 |    590 | Sadanoumi    | *      |
|  14= | 5435 | 5435 |      0 | Ichinojo     |        |
|  14= | 4891 | 5435 |    544 | Daieisho     |        |
|   16 | 5195 | 5385 |    190 | Ura          | *      |
|   17 | 4674 | 5326 |    652 | Takanosho    |        |
|   18 | 5747 | 5287 |   -460 | Takayasu     |        |
|   19 | 5732 | 5244 |   -488 | Endo         |        |
|   20 | 5211 | 5211 |      0 | Ishiura      | *      |
|   21 | 5244 | 5122 |   -122 | Hokutofuji   |        |
|   22 | 4719 | 5056 |    337 | Tochinoshin  |        |
|  23= | 5882 | 5000 |   -882 | Kotoshoho    | *      |
|  23= | 5435 | 5000 |   -435 | Shodai       |        |
|   25 | 5000 | 4940 |    -60 | Onosho       |        |
|   26 | 4524 | 4783 |    259 | Aoiyama      |        |
|   27 | 4773 | 4773 |      0 | Shimanoumi   |        |
|   28 | 4375 | 4681 |    306 | Ichiyamamoto | *      |
|   29 | 4677 | 4545 |   -132 | Yutakayama   | *      |
|  30= | 4783 | 4457 |   -326 | Takarafuji   |        |
|  30= | 4674 | 4457 |   -217 | Terutsuyoshi |        |
|  30= | 4457 | 4457 |      0 | Okinoumi     |        |
|  30= | 4239 | 4457 |    218 | Tobizaru     |        |
|   34 | 4706 | 4375 |   -331 | Kagayaki     | *      |
|   35 | 4565 | 4348 |   -217 | Chiyotairyu  |        |
|   36 | 4130 | 4130 |      0 | Meisei       |        |
|   37 | 6471 | 4118 |  -2353 | Oho          | *      |
|   38 | 4111 | 4111 |      0 | Myogiryu     |        |
|  39= | 4348 | 4022 |   -326 | Kotoeko      |        |
|  39= | 4239 | 4022 |   -217 | Chiyoshoma   |        |
|   41 | 5000 | 3529 |  -1471 | Azumaryu     | *      |
|   42 | 4706 | 3125 |  -1581 | Kotokuzan    | *      |

Note that Terunofuji remained at the same rating as his score of 12-3 matched the score of 12-3 from Haru 2021, which has just dropped out of the rating history for him (since it wasn't a 15-0 result).

When two rikishi meet I calculate a percentage chance for East to win based on this conceptual algorithm:

Each bout is the result of a sequence of attempts to end it. Each wrestler has a chance of being successful equal to their rating/10000, so Shodai has a chance of 50%; Terunofuji 83.13%.

If one wrestler is successful and the other is not, then the bout is over and the successful rikishi is the winner.

Otherwise (i.e., neither or both succeed), the bout continues with another pair of attempts until there is a conclusive outcome.

That's the idea, but although you could Monte-Carlo it and get an estimate we can convert this algorithm into an exact formula:

Ec=((Ew x Wl) ÷ (1 - ((El x Wl) + (Ew*Ww))))

Where Ew is East's rating/10000, El is 1-Ew, and similarly Ww and Wl for West. Ec is the chance that East will win as a probability; multiply by 100 to get a percentage. Yeah, in El and Wl that's an 'el' but the font isn't good at distinguishing it from a capital i.

So for Terunofuji Vs Shodai we get East's chance as 83.13%. Notice that because Shodai's current rating is 5000, he neither increases nor decreases his opponent's chance to win and any rikishi's chance (according to this system) can be calulated by simply dividing his opponent's score by 100.

For Terunofuji Vs Daieisho (8313 Vs 5435), Terunofuji's chance drops to 80.5%, and against Abi's 6452 it is 73%.

So that's the system and now, in defiance of what I said about not being able to compare across the decades, here's the Yokozuna rated by the highest scores they reached in their careers:

| Rank | Rating |              |
|------+--------+--------------|
|    1 |   9674 | Hakuho       |
|    2 |   9595 | Futabayama   |
|    3 |   9508 | Tachiyama    |
|    4 |   9375 | Taiho        |
|   5= |   9348 | Kitanoumi    |
|   5= |   9348 | Chiyonofuji  |
|    7 |   9273 | Tanikaze     |
|   8= |   9239 | Tamanoumi    |
|   8= |   9239 | Asashoryu    |
|   10 |   9180 | Tochigiyama  |
|   11 |   9130 | Takanohana   |
|   12 |   9091 | Hitachiyama  |
|   13 |   9070 | Umegatani    |
|  14= |   8971 | Tsunenohana  |
|  14= |   8971 | Tamanishiki  |
|   16 |   8913 | Tochinishiki |
|   17 |   8889 | Inazuma      |
|   18 |   8871 | Haguroyama   |
|   19 |   8804 | Takanosato   |
|   20 |   8696 | Wakanohana   |
|   21 |   8681 | Wajima       |
|   22 |   8600 | Kimenzan     |
|   23 |   8596 | Onogawa      |
|  24= |   8587 | Wakanohana   |
|  24= |   8587 | Kitanofuji   |
|   26 |   8548 | Onishiki     |
|  27= |   8500 | Konishiki    |
|  27= |   8500 | Kashiwado    |
|  29= |   8478 | Terunofuji   |
|  29= |   8478 | Sadanoyama   |
|  29= |   8478 | Mienoumi     |
|  29= |   8478 | Akebono      |
|  33= |   8370 | Terukuni     |
|  33= |   8370 | Musashimaru  |
|  33= |   8370 | Hokutoumi    |
|   36 |   8333 | Umegatani    |
|   37 |   8261 | Asahifuji    |
|  38= |   8152 | Kisenosato   |
|  38= |   8152 | Harumafuji   |
|   40 |   8143 | Yoshibayama  |
|   41 |   8043 | Akinoumi     |
|  42= |   8000 | Otori        |
|  42= |   8000 | Jinmaku      |
|  44= |   7935 | Tochinoumi   |
|  44= |   7935 | Kotozakura   |
|   46 |   7925 | Onomatsu     |
|   47 |   7917 | Unryu        |
|   48 |   7872 | Hidenoyama   |
|   49 |   7826 | Kakuryu      |
|   50 |   7813 | Musashiyama  |
|   51 |   7750 | Nishinoumi   |
|   52 |   7727 | Azumafuji    |
|   53 |   7722 | Futahaguro   |
|  54= |   7717 | Wakanohana   |
|  54= |   7717 | Kagamisato   |
|  54= |   7717 | Asashio      |
|   57 |   7701 | Onokuni      |
|   58 |   7667 | Nishinoumi   |
|   59 |   7609 | Chiyonoyama  |
|   60 |   7586 | Maedayama    |
|   61 |   7561 | Shiranui     |
|   62 |   7556 | Shiranui     |
|   63 |   7455 | Ozutsu       |
|   64 |   7358 | Sakaigawa    |
|   65 |   7206 | Minanogawa   |
|   66 |   7045 | Nishinoumi   |
|   67 |   6078 | Miyagiyama   |
|  68= |      0 | Wakashima    |
|  68= |      0 | Onishiki     |
|  68= |      0 | Okido        |
|  68= |      0 | Maruyama     |
|  68= |      0 | Ayagawa      |
|  68= |      0 | Akashi       |

It's interesting to me how little overlap there is between this list and the current ratings, Miyagiyama (intai 1931) is the only one who would not be above all the current non-Yokozuna in the top flight.

Rayden's peak rating was 9444, which would slot him into 4th place on the above list.

There are a few problems. Firstly, I can't easily get the results of playoffs out of the database so the ratings do not consider scores of 16-0, 15-1 etc.

The second major issue is what to do when people jump from one division to the other. It's certainly possible to come out of Juryo with a high rating and immediately get slapped down a couple of thousand points and at the moment the programs I use for my own purposes mark any rikishi who has not spent an entire six-basho stint in their current division as having a "preliminary" rating.

Finally, it is clear that the top division at least has a boundary between the top half and bottom half with very little mixing during a tournament. This tends to make the top wrestler's scores lower than they should be as they only face the top opponents rather than a cross-sample of the division. I've not paid enough attention to the other divisions to know if this is an issue with them too.

On the plus side, it avoids the inflation problem that plagues some systems and newbies and retirees do not overly trouble it.

So, there you have it. That's what I do in my spare time :)

Edited May 29, 2022 by nagora
Clarification

Seiyashi · May 28, 2022

While I think it's a nice fresh perspective, I have difficulty taking it seriously when Midorifuji and Nishikigi are ranked higher than a lot of joi stalwarts. Given the explanation of the methodology it's because they've been on form, but it's not going to help in the next basho when we know they are in all probability going to be creamed based on the general assessment of their sumo ability.

nagora · May 28, 2022

2 hours ago, Seiyashi said:

While I think it's a nice fresh perspective, I have difficulty taking it seriously when Midorifuji and Nishikigi are ranked higher than a lot of joi stalwarts. Given the explanation of the methodology it's because they've been on form, but it's not going to help in the next basho when we know they are in all probability going to be creamed based on the general assessment of their sumo ability.

In both cases, this is at least partly the problem of carrying ranks up with a promotion from a lower division. Neither of them have had much time in the top rank (in a row, at least). Having said that, Midorifuji did slightly better than his rating predicted in this basho, and performance is all one can really go on and a rating system should recognise when someone has hit form, shouldn't it? We shall see in July.

Sumo is unusual compared to most Western sports in that the matches are neither fully round-robin nor knock-out. The committee decide who will face whom - moreso as the fortnight proceeds - and that poses challenges to any system that attempts to rate contestants based on their results as the sample of opponents is not random nor "fair".

I've used this as the basis of (fantasy) betting for the last year and what I've found is that it generally gives very good results against the bookies at the start of the Basho but the accuracy tails off as fatigue and injuries set in from about day-10. Dynamically updating the ratings during the basho gave much worse results, however.

Reonito · May 28, 2022

6 hours ago, nagora said:

Starting from a Bayesian point of view that when we are asked to pre-judge and an event that we know nothing about, the rational prior position is to shrug and toss a coin - we say it's 50-50 who will win (or, more precisely, it's 1/n where n is the number of possible outcomes/winners). After the event is over we update our view for the contestants. If competitor A wins, we move our expectation that s/he will win to 2/3, and the loser's down to 1/3.

This seems like extremely aggressive updating—when things like this are done e.g. for batting averages in baseball, the number of average at-bats included as part of regressing an early-season estimate to the mean is in the hundreds. That might not be appropriate here, but using (say) 20 bouts as the 50:50 baseline, so that a single win/loss moves rikishi to 0.55 and 0.45, respectively, would probably make this more stable.

SDM · May 28, 2022

It doesnt appear to take into account the strength of opposition.

Hence you see people on the list far above/below their merit.

Seiyashi · May 29, 2022

What this is excellent for though is measuring the performance of all holders of a particular rank against each other, especially if you introduce the added refinement of confining the computations to their tenure at the rank. Leaving aside the win probability computations, the yokozuna table looks particularly good as it gives a basis for computing dominance.

nagora · May 29, 2022

15 hours ago, SDM said:

It doesnt appear to take into account the strength of opposition.

Hence you see people on the list far above/below their merit.

I deliberately avoided any attempt to allow for strength of individual opposition as that way leads to circular reasoning IMO. The approach was always to assume that a rikishi can only be rated based on how they perform relative to their peer group and trust that the banzuki generates a reasonable peer group by its nature.

I realised it was a confusing mistake not to mark the ratings which are preliminary because the rikishi has not been in the division long enough for their score to "settle". I've updated the second table above to show the 14 wrestlers who's current rating should be taken with a variable amount of salt; everyone else has a rating based on at least 6 basho at the current level. 5 of the marked rikishi actually outperformed their preliminary rating; make of that what you will.

Here's the top division with the preliminary ratings filtered out, perhaps you'll find these more reasonable:

| Rank | Rating |              |
|    1 |   8313 | Terunofuji   |
|    2 |   6413 | Mitakeumi    |
|    3 |   6092 | Kotonowaka   |
|    4 |   5761 | Wakatakakage |
|    5 |   5652 | Takakeisho   |
|    6 |   5556 | Hoshoryu     |
|   7= |   5543 | Tamawashi    |
|   7= |   5543 | Kiribayama   |
|   9= |   5435 | Ichinojo     |
|   9= |   5435 | Daieisho     |
|   11 |   5326 | Takanosho    |
|   12 |   5287 | Takayasu     |
|   13 |   5244 | Endo         |
|   14 |   5122 | Hokutofuji   |
|   15 |   5056 | Tochinoshin  |
|   16 |   5000 | Shodai       |
|   17 |   4940 | Onosho       |
|   18 |   4783 | Aoiyama      |
|   19 |   4773 | Shimanoumi   |
|  20= |   4457 | Tobizaru     |
|  20= |   4457 | Terutsuyoshi |
|  20= |   4457 | Takarafuji   |
|  20= |   4457 | Okinoumi     |
|   24 |   4348 | Chiyotairyu  |
|   25 |   4130 | Meisei       |
|   26 |   4111 | Myogiryu     |
|  27= |   4022 | Kotoeko      |
|  27= |   4022 | Chiyoshoma   |

11 hours ago, Seiyashi said:

What this is excellent for though is measuring the performance of all holders of a particular rank against each other, especially if you introduce the added refinement of confining the computations to their tenure at the rank. Leaving aside the win probability computations, the yokozuna table looks particularly good as it gives a basis for computing dominance.

The thing about win probability calculations is that it is the only real test of any ranking system. In overly simplistic terms, any higher ranked competitor should beat a lower one. Adding a bit more meat onto that, the ranking should give some idea as to how often that rule can be expected to be broken.

Seiyashi · May 29, 2022

29 minutes ago, nagora said:

12 hours ago, Seiyashi said:

What this is excellent for though is measuring the performance of all holders of a particular rank against each other, especially if you introduce the added refinement of confining the computations to their tenure at the rank. Leaving aside the win probability computations, the yokozuna table looks particularly good as it gives a basis for computing dominance.

The thing about win probability calculations is that it is the only real test of any ranking system. In overly simplistic terms, any higher ranked competitor should beat a lower one. Adding a bit more meat onto that, the ranking should give some idea as to how often that rule can be expected to be broken.

While I agree with the general idea that win probability calculations are the proof of a ranking system, it doesn't help in the case of comparing performances of e.g. yokozuna across different eras because by definition they could never have fought one another, and therefore the computation as a predictor is meaningless in not being testable or comparable against real-world results. It might have given us a number as to who was more likely to win at their peak, but since you've pointed out that the rating system cannot account for absolute strength of performance, that computation really ultimately holds no meaning.

I would argue that at least the ranking system provides a consistent "digest" or single benchmark by which to compare rikishi relative to the rest of the field, rather than trying to compare "this one had 10 9-6s, that one had 4" etcetc. As Reonito pointed out, the Bayesian assumption is perhaps a bit too aggressive and I'm not sure I agree with the specific sentiment that this particular win probability necessarily means anything in the context of sumo.

Edited May 29, 2022 by Seiyashi

Suwihuto · May 30, 2022

Looks like an interesting way to approach the problem.

I do echo what's been said above, I'd be tempted to have a longer series feeding into each rating, and I'd also look to do something more about the divisional differences (which would be more relevant with a longer series too). Would it make sense to downrate (or uprate) results in different divisions? For example, a 1--0.8--0.6--0.4--0.2--0.1 relative scaling factor may help, meaning that a Juryo 15-0 would be treated as a 12-3 in Makuuchi. It should be easy enough to work out what factors make sense here, although the problem is never going to be entirely eliminated of course.

Of course, a ~0.8 factor between each division, rather than the arbitrary sliding scale above, is likely more sensible to use too.

Edited May 30, 2022 by Suwihuto
Further thought.

Churaumi · May 30, 2022

Seems to me this system is better used to gauge the quality of overall competition than the quality of individual rikishi. For example, if a rikishi can gain a certain rank but with a middling rating from this system, it would indicate different things than attaining that rank at that time with a high or low score. I would need to sit and congronkulate some numbers to figure out trends and things, to figure out what those would mean. I assume scores skewing to the middle would indicate more even competition in general, since that would indicate a prevalence of close kachikoshi among evenly matched rikishi. It wouldn't reflect the skill of competition, it could be exceptionally inept sumo but it would be inept all around.

This system would also invert for outliers like Hakuho or Shonanzakura, since they weren't really competitive at their ends of the spectrum of skill. It shows how far above (or below) their peers they were. That still wouldn't help rank the outliers beyond how dominant or not they were in their own time. Maybe periods with more or less outliers would indicate something. I assume.

Nagora's Rating System

Recommended Posts

nagora 88

Share this post

Link to post

Share on other sites

Seiyashi 4,097

Share this post

Link to post

Share on other sites

nagora 88

Share this post

Link to post

Share on other sites

Reonito 1,469

Share this post

Link to post

Share on other sites

SDM 9

Share this post

Link to post

Share on other sites

Seiyashi 4,097

Share this post

Link to post

Share on other sites

nagora 88

Share this post

Link to post

Share on other sites

Seiyashi 4,097

Share this post

Link to post

Share on other sites

Suwihuto 125

Share this post

Link to post

Share on other sites

Churaumi 743

Share this post

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity