Check out Omega Chess, our featured variant for September, 2024.


[ Help | Earliest Comments | Latest Comments ]
[ List All Subjects of Discussion | Create New Subject of Discussion ]
[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]

Comments/Ratings for a Single Item

EarliestEarlier Reverse Order LaterLatest
Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]
🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 09:20 PM UTC:
I'm thinking of tweaking the way the GCR is calculated. As it is right now, the value that is going to grow the quickest is a player's past games. This affects the stability value, which is already designed to near the limit of one more quickly than reliability ever will. Even if games with the current opponent and one's past games remained equal in number, stability would grow more quickly than reliability. But after the first opponent, one's past games will usually outnumber one's games with the current opponent. Besides this, gravity is based on stability scores, and as stability scores for both opponents quickly near the limit of one, gravity becomes fairly insignificant. Given that past games will usually outnumber games played against the current opponent, it makes sense for reliability to grow more quickly than stability.

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 11:50 PM UTC:
I'm rethinking this even more. I was reading about Elo, and I realized its main feature is a self-correcting mechanism, sort of like evolution. Having written about evolution fairly extensively in recent years, I'm aware of how it's a simple self-correcting process that gets results. So I want a ratings system that is more modeled after evolution, using self-correction to get closer to accurate results.

So let's start with a comparison between expectations and results. The ratings for two players serve as a basis for predicting the percentage of games each should win against the other. Calculate this and compare it to the actual results. The GCR currently does it backward from this. Given the results, it estimates new ratings, then figures out how much to adjust present ratings to the new ratings. The problem with this is that different pairs of ratings can predict the same results, whereas any pair of ratings predicts only one outcome. It is better to go with known factors predicting a single outcome. Going the other way requires some arbitrary decision making.

If there is no difference between predicted outcome and actual outcome, adjustments should be minimal, perhaps even zero. If there is a difference,  ratings should be adjusted more. The maximum difference is if one player is predicted to win every time, and the other player wins every time. 
Let's call this 100% difference. This would be the case if one rating was 400 points or more higher than another.  The maximum change to their scores should be 400 points, raising the lower by 400 points and decreasing the higher by 400. So the actual change may be expressed as a limit that approaches 400. Furthermore, the change should never be greater than the discrepancy between predictions and outcomes. The discrepancy can always be measured as a percentage between 0% and 100%. The maximum change should be that percentage of 400.

But it wouldn't be fair to give the maximum change for only a single game. The actual change should be a function of the games played together. This function may be described as a limit that reaches the maximum change as they play more games together. This is a measure of the reliability of the results. At this point, the decision concerning where to set different levels of reliability seems arbitrary. Let's say that at 10 games, it is 50% reliable, and at 100 games near 100% reliable. So, Games/(Games + 10) works for this. At 10, 10/20 is .5 and at 100, 100/110 is .90909090909. This would give 1 game a reliability of .090909090909, which is almost 10%. So, for one game with 100% difference between predictions and results, the change would be 36.363636363636. This is a bit over half of what the change currently is for two players with ratings of 1500 when one wins and the other loses. Currently, the winner's rating rises to 1564, while the loser's goes down to 1435. With both players at 1500, the predicted outcome would be that both win equally as many games or draw a game. Any outcome where someone won all games would differ from the predicted outcome by 50%, making the maximum change only 200, and for a single game, that change would be 18.1818181818. This seems like a more reasonable change for a single game between 1500 rated players.

Now the question comes in whether anything like stability or gravity should factor into how the scores change. Apparently the USCF uses something called a K-factor, which is a measure of how many games one's current rating is based on. This corresponds to what I have called stability. Let's start with maximums. What should be the maximum amount that stability should minimize the change to a score? Again, this seems like an arbitrary call. Perhaps 50% would be a good maximum. And at what point should a player's rating receive that much protection? Or, since this may be a limit, at what point should change to a player's rating be minimized by half as much, which is 25%? Let's say 200 games. So, Games/(Games + 600) works for this. At 200, it gives 200/800. At 400, it gives 400/1000.

And what about gravity? Since gravity is a function of stability, maybe it adds nothing significant. If one player has high stability and the other doesn't, the one whose rating is less stable will already change more. So, gravity can probably be left out of the calculation.

🕸📝Fergus Duniho wrote on Sun, Apr 12, 2015 02:43 AM UTC:
So far, the current method is still getting higher accuracy scores than the new method I described. Maybe gravity does matter. This is the idea that if one player's rating is based on several games, and the other player's rating isn't, the rating of the player with fewer games should change even more than it would if their past number of games were equal. This allows the system to get a better fix on a player's ability by adjusting his rating more when he plays against opponents with better established ratings.

🕸📝Fergus Duniho wrote on Mon, Apr 13, 2015 01:48 AM UTC:
I've been more closely comparing different approaches to the ratings. One is the new approach I described at length earlier, and one is tweaking the stability value. In tweaking the stability value, I could increase the accuracy measurement by raising the number of past games required for a high stability score. But this came at a cost. I noticed that some players who had played only a few games quickly got high ratings. Perhaps they had played a few games against high rated players and won them all. Still, this seemed to be unfair. Maybe the rating really was reflective of their playing abilities, but it's hard to be sure about this, and their high ratings for only a few games seemed unearned. In contrast to this, the new rating method put a stop to this. It made high ratings something to be earned through playing many games. Its highest rated players were all people who had played several games. Its highest rating for someone who played games in the single digits was 1621 for someone who had won 8.5 out of 9 games. In contrast, the tweaked system gave 1824 to someone who won 4 out of 4 games, placing him 5th in the overall rankings. The current system, which has been in place for years, gave 1696 and 1679 to people who won 8.5/9 and 4/4 respectively.

In the ratings for all games, the new system gets a lower accuracy score by less than 2%. That's not much of a difference. In Chess, it gets the higher accuracy score. In some other games, it gets a lower score by a few percentage points. Generally, it's close enough but has the advantage of reducing unearned high ratings, which gives it a greater appearance of fairness. So I may switch over to it soon.

🕸📝Fergus Duniho wrote on Mon, Apr 13, 2015 11:43 PM UTC:
I have switched the ratings system to the new method, because it is fairer. Details on the new system can be found on the page. I have included a link to the old ratings system, which will let you compare them.

Kevin Pacey wrote on Fri, Apr 15, 2016 02:08 AM UTC:
Hi Fergus

I lost a game of Sac Chess to Carlos quite some time ago. I thought that it was to be rated, but as far as I can tell my rating is based on only 1 game (a win at Symmetric Glinski's Hexagonal Chess vs. Carlos). I don't know if the ratings have been updated to take into account my Sac Chess loss, but I thought I'd let you know, even though I don't plan to play on Game Courier, likely at least anytime soon.

🕸📝Fergus Duniho wrote on Fri, Apr 15, 2016 02:21 AM UTC:
Your game is marked as rated, but for some reason it didn't make it into the FinishedGames database table. I will have to look into whether this problem is isolated or more systemic. Just as a quick check, the last two games I finished are in the database. I will give this more attention <s>tomorrow</s> soon.

🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 06:03 PM UTC:

Kevin,

I just recreated the FinishedGames table, and your Sac Chess game against Carlos is now listed there. I'm not sure why it didn't get in before, but I have been fixing up the code for entering finished games into this table, and hopefully something like this won't happen again. But if it does, let me know.


🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 08:58 PM UTC:

Things are getting weird. When I looked at Kevin Pacey's rating, I noticed it was still based on one game, not two. For some reason, the game he won was not getting added to the database. At this time, I was using the REPLACE command to populate the database. Also, it was failing silently. So, I Truncated the table, changed REPLACE to INSERT and recreated the table. This time, the game he won got in, but the game he lost did not. Maybe this game didn't make it in originally because of some mysterious problem with how INSERT works. It is frustrating that the MySQL commands are not performing reliably, and they are failing silently. So if it wasn't for noticing these specific logs, I would be unaware of the problem.


🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 09:12 PM UTC:

I changed INSERT back to REPLACE and ran the script for creating the FinishedGames table again. This time, the log for the game Kevin lost got in, and the log for the game he won vanished even though I did not Truncate the table prior to doing this. Also, the total number of rows in the table did not change.


🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 10:06 PM UTC:

I finally figured out the problem and got the logs for both of Kevin's games into the FinishedGames table. The problem was that both logs had the same name, and the table was set up to require each log to have a unique name. So I ALTERed the table to remove all keys, then I made the primary key the combination of Log + Game. Different logs had been recorded with INSERT and REPLACE, because INSERT would go with the first log it found with the same name, and REPLACE would replace any previous entries for the same log name with the last one. This change increased the size of the table from 4773 rows to 4883 rows.


Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]
Aurelian Florea wrote on Mon, Dec 11, 2017 10:58 AM UTC:

The rating system could be off. I'm not sure if ratings should change instantly, meaning once any game is finished, then the rating is recalculated for the 2 palyers in question :)! Anyway yesterday a few games of mine (I think 3) have finished and ratings have not changed. It should have probably ended up a bit bellow 1530.


🕸📝Fergus Duniho wrote on Mon, Dec 11, 2017 02:59 PM UTC:

Ratings are calculated holistically, and they are designed to become more stable the more games you play. You can read the details on the ratings page for more on how they work differently than Elo ratings.


Aurelian Florea wrote on Mon, Dec 11, 2017 03:48 PM UTC:

I did read the rules, but I have not understood them. It seemed to me that tehy do not look like ELO ratings though. Anyway Fergus, are you saying that they work fine?


🕸📝Fergus Duniho wrote on Mon, Dec 11, 2017 04:56 PM UTC:

As far as I'm aware, they do.


Aurelian Florea wrote on Tue, Dec 12, 2017 03:29 AM UTC:

@Fergus,

I think I know what is going on with my ratings. By now I already have quite a few games played, and losing to a very high rated opponent or winning against a very low rated opponent does not mean much for the algorithm, in terms of correcting my rating. I think this is how is supposed to work.

So, are you using a system of equations where the unknowns are the ratings, and the coefients are based on the results :)?!...


Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]
🕸📝Fergus Duniho wrote on Tue, Dec 12, 2017 05:04 PM UTC:

Aurelian, I have moved this discussion to the relevant page.

By now I already have quite a few games played, and losing to a very high rated opponent or winning against a very low rated opponent does not mean much for the algorithm, in terms of correcting my rating. I think this is how is supposed to work.

Yes, it is supposed to work that way.

So, are you using a system of equations where the unknowns are the ratings, and the coefients are based on the results :)?!...

I'm using an algorithm, which is a series of instructions, not a system of equations, and the ratings are never treated as unknowns that have to solved for. Everyone starts out with a rating of 1500, and the algorithm finetunes each player's rating as it processes the outcomes of the games. Instead of processing every game chronologically, as Elo does, it processes all games between the same two players at once.


Aurelian Florea wrote on Wed, Dec 13, 2017 06:59 AM UTC:

Ok, I'm starting to understand it, thanks for the clarifications :)!


Kevin Pacey wrote on Mon, Apr 23, 2018 04:17 PM UTC:

Perhaps the Game Courier rating system could someday be altered to somehow take into account the number of times a particular chess variant has been played by a particular player, and/or between him and a particular opponent, when calculating the overall public (or rated) games played rating for a particular player.


Aurelian Florea wrote on Mon, Apr 23, 2018 04:58 PM UTC:

I used to  think that players that play a more diverse assortment of variants are disadvantaged by the current system, but probably it is not a big deal.

Also maybe larger games with more pieces should matter more as they are definitely more demanding.

But both these things are quite difficult to do without a wide variety of statistics which we cannot have at the time :)!


Joe Joyce wrote on Mon, Apr 23, 2018 05:42 PM UTC:

Actually, I found that when I played competitively a few years ago, the more different games I played, the better I played in all of then, in general. This did not extend to games like Ultima or Latrunculi, but did apply to all the chesslike variants, as far as I can tell.


Aurelian Florea wrote on Tue, Apr 24, 2018 12:13 AM UTC:

There probably is something akin to general understanding :)!


🕸📝Fergus Duniho wrote on Tue, Apr 24, 2018 04:21 PM UTC:

This script has just been converted from mysql to PDO. One little tweak in the conversion is that if you mix wildcards with names, the SQL will use LIKE or = for each item where appropriate. So, if you enter "%Chess,Shogi", it will use "AND (Game LIKE '%Chess' OR Game = 'Shogi' )" instead of "AND (Game LIKE '%Chess' OR Game LIKE 'Shogi' )".


🕸📝Fergus Duniho wrote on Wed, Apr 25, 2018 04:44 PM UTC:

Perhaps the Game Courier rating system could someday be altered to somehow take into account the number of times a particular chess variant has been played by a particular player, and/or between him and a particular opponent, when calculating the overall public (or rated) games played rating for a particular player.

First of all, the ratings script can be used for a specific game. When used this way, all its calculations will pertain to that particular game. But when it is used with a wildcard or with a list of games, it will base calculations on all included games without distinguishing between them.

Assuming it is being used for a specific game, the number of times two players have played that game together will be factored into the calculation. The general idea is that the more times two players play together, the larger the effect that their results will have on the calculation. After a maximum amount of change to their ratings is calculated, it is recalculated further by multiplying it by n/(n+10) where n is the number of games they have played together. As n increases, n/(n+10) will increase too, ever getting nearer to the limit of 1. For example n=1 gives us 1/11, n=2 gives us 2/12 or 1/6, n=3 gives us 3/13, ... n=90 gives us 90/100 or 9/10, and so on.

During the calculation, pairs of players are gone through sequentially. At any point in this calculation, it remembers how many games it has gone through for each player. The more games a player has played so far, the more stable his rating becomes. After the maximum change to each player's rating is calculated as described above, it is further modified by the number of games each player has played. Using p for the number of games a player has played, the maximum change gets multipled by 1-(p/(p+800)). As p increases, so does p/(p+800), getting gradually closer to 1. But since this gets substracted from 1, that means that 1-(p/(p+800) keeps getting smaller as p increases, ever approaching but never reaching the limit of zero. So, the more games someone has played, the less his rating gets changed by results between himself and another player.

Since p is a value that keeps increasing as the calculations of ratings is made, and its maximum value is not known until the calculations are finished, the entire set of calculations is done again in reverse order, and the two sets of results are averaged. This irons out the advantages any player gains from the order of calculations, and it ensures that every player's rating is based on every game he played.

As I examine my description of the algorthm, the one thing that seems to be missing is that a player who has played more games should have a destabilizing effect on the other player's rating, not just a stabilizing effect on his own rating. So, if P1 has played 100 games, and P2 has played 10, this should cause P2's rating to change even more than it would if P1 has also played only 10 games. At present, it looks like the number of games one's opponent has played has no effect on one's own ratings. I'll have to examine the code and check whether it really matches the text description I was referring to while writing this response.


Kevin Pacey wrote on Wed, Apr 25, 2018 08:14 PM UTC:

I was wondering along the lines of do we want a Game Courier rating system that rewards players for trying out a greater number of chess variants with presets. There are many presets that have barely been tried, if at all. However, conversely this could well 'punish' players who choose to specialize in playing only a small number of chess variants, perhaps for their whole Game Courier 'playing career'. [edit: in any case it seems, if I'm understanding right, the current GC rating system may 'punish' the winner (a player, or a userid at least) of a given game between 2 particular players who have already played each other many times, by not awarding what might otherwise be a lot of rating points for winning the given game in question.]


25 comments displayed

EarliestEarlier Reverse Order LaterLatest

Permalink to the exact comments currently displayed.