[ Help | Earliest Comments | Latest Comments ]

[ List All Subjects of Discussion | Create New Subject of Discussion ]

[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]

Comments/Ratings for a Single Item

⇧Earliest ⇧Earlier ⇧Reverse Order⇩ Later⇧ Latest⇩

Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 11:20 AM EST:

Your rating could also rise without playing any more games. Your rating in GCR is a holistic measure of your relative performance against everyone else in your playing network. It depends on how everyone is doing relative to each other. There is no such thing as a true rating, because all ratings are relative. The method is not trying to measure your playing strength in some absolute terms that can be given a specific number with a specific meaning. The one meaningful constant here is the difference between two players' scores. Each time it recalculates ratings, this page will be using all the data at hand to give the best estimates of relative playing strength that it can. It would not do a better job of this by keeping ratings static when people aren't playing.

Gary Gifford wrote on Wed, Jan 11, 2006 12:28 PM EST:

Originally I wrote, in part: 'For a player's rating to rise or fall while sitting on his (or her) laurels seems terrible to me.' However, in the 4 hours that have since then passed I have reversed my hasty opinion (obviously biased by years under the USCF Rating system). Anyway, since we are talking about a player's playing strength in relation to other player's playing strenghts,[an ever changing field of relative values] then in that light a static [or frozen rating] just isn't realistic... and is not a valid number for comparison. So, by further thought I have crossed the fence to Fergus's side of the ratings camp. I still don't like the idea of fun-games and experimental games getting thrown into the equation, but I guess we have to start somewhere. So, in closing, thank you Fergus for all the effort you are putting into this. I am sure it will turn out well and be valued over time.

Gary Gifford wrote on Wed, Jan 11, 2006 06:37 PM EST:

Good questions.  I'll deffer them to Fergus as he understands what is
going on here far better than I do and I could end up giving a wrong
answer.  But I do know a player who was about 2000.  Unfortunately he has
a mental condition, he is now about 1400 and getting weeker in all
cognitive areas.  It is now a strain for him just to walk. 
Understandably, he could have quit playing chess while at 2000... but he
still plays.  Anyway, if he quit at 2000 his frozen 2000 rating would
certainly be false.  Of course, if he quit and his rating climbed, that
too would be false.  It would need to drop over time to reflect reality. 
Would this happen with the equations Fergus is using?  I don't know... 
We can shoot all kinds of rating situations around and argue one way or
the other, but what is the point?  Does it really matter?

Why should we get so wrapped up in these values?  They are just a means of
comparison.  Before we had nothing.  Now we will have something.  If we do
not like that 'something' then we can choose the 'unrated game' option
once implemented.  We can also play in USCF tournaments where our ratings
will freeze once we quit playing.

Christine Bagley-Jones wrote on Wed, Jan 11, 2006 08:55 PM EST:

yeah no need to get wrapped up in it, but it would be good to get the best
rating system in place, i am sure it would save Fergus a lot of hassle in
the future also if people complain, say 'other sites have a better
system' etc etc.
will be kinda fun too, to see people have ratings, then you can see who is
like the 'favorite' and the 'underdog' in games etc etc
high drama :)

Roberto Lavieri wrote on Thu, Jan 12, 2006 08:02 AM EST:

I don´t dislike Fergus method, but it needs some tuning, and, perhaps, one or a couple of new modifiers added, although it can make the method unnecesarily complicated. I have not had time enough to go deep on it, but I´ll try in some moment.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 11:57 AM EST:

I have changed the method again. The changes are along the lines of what I was describing yesterday. Stability is now a factor of how many of each player's games have already factored into the calculations for his ratings. This will cause the ratings of players who have played more games to stabilize more than the ratings of other players. I have also added a gravity factor. This is a function of stability. It goes to the player with the higher stability, and it diminishes with distance. It increases the reliability of the scores in the direction of the player with greater stability. Thus, the players who have played the most games become gravitational centers around which ratings of less experienced players gravitate. The specific details are given in the description of the method. I expect it will still need some tweaking, and I plan to add a form field for entering test data.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 12:16 PM EST:

Michael Howe asks:

what, therefore, is the refutation to my concern that a player's rating be retroactively affected by the performance of a past opponent whose real playing strength has increased or decreased since that player last played him?

A system that offers estimates instead of measurements is always going to be prone to making one kind of error or the other. This is as true of Elo as it is of GCR. Keeping ratings static may avoid some errors, but it will introduce others. The question to ask then is not how to avoid this kind of error or that kind of error. The more important question to ask is which sort of system will be the most accurate overall. Part of the answer to that question is a holistic system. Given that the system estimates relative differences in playing strength, the best way to estimate these differences is a method that bases every rating on all available data. Because of its monodirectional chronological nature, Elo does not do this. But the GCR method does do this. This allows it to provide a much more globally consistent set of ratings than Elo can with its piecemeal approach of calculating ratings. Since ratings have no meanings in themselves and mean something only in relation to each other, a high level of global consistency is the most important way in which a set of ratings can be described as accurate. Since a holistic method is the most important means, if not actually necessary, for achieving this, a holistic method is the way to go, regardless of whatever conceivable errors might still be allowed.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 02:03 PM EST:

The testdata field takes data in a line by line grid form like this:

1500 0 1 0
1500 0 0 1
1500 1 0 0

It automatically names players with letters of the alphabet. Each line begins with a rating and is followed by a series of numbers of wins against each player. The above form means that A beat B once, B beat C once, and C beat A once.

Roberto Lavieri wrote on Thu, Jan 12, 2006 02:11 PM EST:

I agree with the Gravity modifier, but you need tune all the modifiers, although I am not sure what is going to be the best, we need good arguments for the decisions, and these are not enterely clear. I think the method, as now, has some weaknesses, basically due to the weight of modifiers. If you have some troubles with my English, please tell me. I´m trying to write orthodox English, as possible.

Roberto Lavieri wrote on Thu, Jan 12, 2006 04:48 PM EST:

There is a very little error in my all time performance. The result in the LOG rlavieri2003-cvgameroom2004-318-638 was not counted (it says 'has won', but no one is mentioned). I believe there is another error, but I can´t find it, my own record register gives 38.0/75

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 05:30 PM EST:

Because Game Courier hasn't kept track of winners and losers by userid, I have had to resort to checking the status string to find the winner. It reads the name from there, compares it with the name on record for each player, and determines from that who has won. If there is a problem with a particular status string, it has to be fixed in the log. You or your opponent may be able to do this by updating who has won in the log in question.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 06:34 PM EST:

After testing some different values, I increased the value of stability, and reduced the values of reliability and gravity. One thing I was trying to do was bring all the ratings in a perfect cycle to 1500. This is where A beats B, B beats C, C beats D, D beats E, and E beats A, and they all start at 1500. I didn't succeed at this, but I did manage to bring them closer. The overall effect of the changes I made was to bring all ratings closer together. Among all the games, the greatest difference is now little more than 400. This may be a fair estimate, given that not a lot of games have been played yet, and many people have still played very few games. Also, I removed the testdata field for the time being.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 10:37 PM EST:

I've just done some more tweaking of values. After calculating the ratings, I used them to predict the scores, and I measured their success rate. I used this measure to help me tweak values. I got it up to about 75% accuracy, but it was at the expense of too rapidly raising the ratings of players who had played few games. I let it drop to about 72% to prevent the ratings of less experienced players from changing too quickly. In general, the final ratings have been slightly more accurate than the two separately calculated sets of ratings that were averaged to get them. The main change I made was to increase reliability. I left stability and gravity at what I tweaked them to earlier today.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 11:19 PM EST:

I brought the accuracy on the current data up to almost 75% by changing the order in which the players are sorted. They are now ordered by how many games they have played. The most accurate sequence of calculations is the one that begins with the players who have played the most games. Although including the reverse sequence of calculations has reduced the overall accuracy score, it has also corrected one error I spotted in the first series of calculations. Although the accuracy rate is an important measure, it is not the only way to assess accuracy, and the data I have used it on is not enough to draw any firm conclusions. I remain convinced that using two sequences of calculations in opposite orders is overall better than using only one.

Roberto Lavieri wrote on Fri, Jan 13, 2006 08:46 AM EST:

I think that a 'weighted history' makes sense in every rating system. Recent played games must have more importance in the rating calculations than old ones. This may help to reflect drastic changes in real player´s force. Illness, temporary desinterest, and other factors can make players skills fall down, and experience, progressive knowledge of a game, high interest and other factors can help to increase rating quickly in some cases.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 12:39 PM EST:

A weighted history would work with the assumption that anyone who isn't
actively playing games is deteriorating in ability. I'm not sure this is
an accurate assumption to make. Also, the factors you list as causing
performance to drop are going to have less effect on games played on Game
Courier, because these are correspondence games played over a long period
of time, and a person may take time off for illness or disinterest without
it affecting his game. When it does affect someone's games, it will
generally affect fewer games than it would for someone active in Chess
tournaments. Also, the large timeframe on which games are played is going
to make it even harder to determine what the weights should be. For these
reasons, I am not convinced that a weighted history should be used here.

Anyway, if you do want ratings that better reflect the current level of
play among active players, you already have the option of using the Age
Filter to filter out older games. I think that should suffice for this
purpose.

Roberto Lavieri wrote on Fri, Jan 13, 2006 02:24 PM EST:

You are right about the use of the age filter to reflect 'current' ratings (this is not enterely true, but it can be a better approximation), although I still disagree with you about the weighted history, I think it can be good for our purposes, but I recognize it is not easy give the weights in every case. This site contains many games for which people is learning and constructing some basis for better play by experience, and this is a step-by-step proccess, perhaps long in time; all of us must be considered real novices in many games, this is a reason to consider weighted history, precisely by the nature of this site. The case is other if we are talking about old, popular games widely played since a lot of time, but TCVP contains many new games, and the list is expected to grow in the future. I insist with other claim: not all games must be rated, or the rating system can be a tool which mainly reflects how good is someone to play in an inedit scenario. The list of 'rated' games can grow, but with games that become 'relatively popular' with time.

Roberto Lavieri wrote on Fri, Jan 13, 2006 02:37 PM EST:

The Age filter and some other filters don´t work yet.

Roberto Lavieri wrote on Fri, Jan 13, 2006 03:11 PM EST:

I used 'inedit' in a past comment, this is not an english word. Use 'new' instead.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 03:11 PM EST:

I have fixed the Age Filter. So it now works. I have tested the other filters, and they all work. If you see a bunch of warnings when you try to view only rated games, that's because there are none, and the program is reporting problems with an empty array.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 03:28 PM EST:

I want to draw attention to the main change I made today. You may notice
that the list of ratings now uses various background colors. Each
background color identifies a different network of players. The
predominant network is yellow, and the rest are other colors. Everyone in
a network is connected by a chain of opponents, all of whom are also in
the network.

Regarding weighted histories, they probably work better for the world of
competitive Chess, in which active players normally play several rated
games at regular intervals. This frequency and regularity provides a basis
for weighting games. But Game Courier lacks anything like this. Games here
are played at the leisure of players.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 06:57 PM EST:

I have updated the 'GCR vs Elo' text and rearranged how sections of this page are displayed.

Mark Thompson wrote on Fri, Jan 13, 2006 09:33 PM EST:

I've always thought the best implementation of ratings would be an 'open-source' approach: make public the raw data that go into calculating the ratings, and allow many people to set up their own algorithms for processing the data into ratings. So users would have a 'Duniho rating' and a 'FIDE rating' and 'McGillicuddy rating' and so on. Then users could choose to pay attention to whichever rating they think is most significant. Over time, studies would become available as to which ratings most accurately predict the outcomes of games, and certain ratings would outcompete others: a free market of ideas.

(zzo38) A. Black wrote on Fri, Jan 13, 2006 11:33 PM EST:

Quote:

I've always thought the best implementation of ratings would be an 'open-source' approach: make public the raw data that go into calculating the ratings, and allow many people to set up their own algorithms for processing the data into ratings. So users would have a 'Duniho rating' and a 'FIDE rating' and 'McGillicuddy rating' and so on. Then users could choose to pay attention to whichever rating they think is most significant. Over time, studies would become available as to which ratings most accurately predict the outcomes of games, and certain ratings would outcompete others: a free market of ideas.

I also like the open-source approach (maybe make the raw data XML, plain-text, or both), but there should also be one built-in to this site as well, so if you don't have your own implementation you can view your own.

Mark Thompson wrote on Sat, Jan 14, 2006 12:13 AM EST:

'I also like the open-source approach (maybe make the raw data XML,
plain-text, or both), but there should also be one built-in to this site
as well, so if you don't have your own implementation you can view your
own.'

Sure, the site should have its own 'brand' of ratings. But I mean, it
would be good to make ratings from many user-defined systems available
here also. Just as the system allows users to design their own webpages
(subject to editorial review) and their own game implementations, there
could be a system whereby users could design their own ratings systems,
and any or all these systems could be available here at CVP to anyone who
wants to view them, study their predictive value, use them for tournament
matchings, etc.

Of course, it's much easier to suggest a system of multiple user-defined
rating schemes (hey, we could call it MUDRATS) than to do the work of
implementing it. But if enough people consider the idea and feel it has
merit, eventually someone will set it up someplace and it will catch on.

25 comments displayed

⇧Earliest ⇧Earlier ⇧Reverse Order⇩ Later⇧ Latest⇩

Permalink to the exact comments currently displayed.