Comments/Ratings for a Single Item
How would you measure the diversity of games played between two players? Suppose X1 and Y1 play five games of Chess, 2 of Shogi, and 1 each of Xiang Qi, Smess, and Grand Chess. Then we have X2 and Y2, who play 3 games of Chess, 3 of Shogi, 2 of Xiang Qi, and 1 each of Smess and Grand Chess. Finally, X3 and Y3 have played two games each of the five games the other pairs have played. Each pair of players has played ten games of the same five games. For each pair, I want to calculate a trust value between a limit of 0 and a limit of 1, which I would then multiply times the maximum adjustment value to get a lower adjustment value.
Presently, the formula n/(n+10) is used, where n is the number of games played between them. In this case, n is 10, and the value of n/(n+10) is 10/20 or 1/2. One thought is to add up fractions that use the number of games played of each game.
X1 and Y1
5/(5+10)+2/(2+10)+1/(1+10)+1/(1+10)+1/(1+10) = 5/15+2/12+1/11+1/11+1/11 = 17/22 = 0.772727272
X2 and Y2
3/13 + 3/13 + 2/12 + 1/12 + 1/11 = 6/13 + 2/12 + 2/11 = 695/858 = 0.81002331
X3 and Y3
2/12 * 5 = 10/12 = 0.833333333
The result of this is to put greater trust in a diverse set of games than in a less diverse set, yet this is the opposite of what I was going for.
How would this change if I changed the constant 10 to a different value? Let's try 5.
X1 and Y1
5/(5+5)+2/(2+5)+1/(1+5)+1/(1+5)+1/(1+5) = 5/10+2/7+3/6 = 1 2/7 = 1.285714286
Since this raises the value above 1, it's not acceptable. Let's try changing 10 to 20.
X1 and Y1
5/(5+20)+2/(2+20)+1/(1+20)+1/(1+20)+1/(1+20) = 5/25+2/22+1/21+1/21+1/21 = 167/385 = 0.433766233
X2 and Y2
3/23 + 3/23 + 2/22 + 1/22 + 1/21 = 6/23 + 2/22 + 2/21 = 2375/5313 = 0.447016751
X3 and Y3
2/22 * 5 = 10/22 = 0.454545454
This follows the same pattern, though the trust values are lower. To clearly see the difference, look at X2 and Y2, and compare 2/22, which is for two games of Xiang Qi, with 2/21, which is for one game each of Smess and Grand Chess. 2/22 is the smaller number, which indicates that it is giving lower trust scores for the same game played twice.
Since it is late, I'll think about this more later. In the meantime, maybe somebody else has another suggestion.
Presently, the more games two players play together, the greater the amount of trust that is given to the outcome of their games, but each additional game they play together adds a smaller amount of trust. This is why players playing the same game together would produce a smaller increase in trust than players playing different games together in the calculations I was trying out in my previous comment. Since this is how the calculations naturally fall, is there a rationale for doing it this way instead of what I earlier proposed? If one player does well against another in multiple games, this could be more indicative of general Chess variant playing ability, whereas if one does well against another mainly in one particular game but plays that game a lot, this may merely indicate mastery in that one game instead of general skill in playing Chess variants, and that may be due to specialized knowledge rather than general intelligence. The result of doing it this way is that players who played lots of different games could see a faster rise in their ratings than a player who specialized in only a few games. However, people who specialized in only a few games would also see slower drops in their ratings if they do poorly. For each side, there would be some give and take. But if we want to give higher ratings to people who do well in many variants, then this might be the way to do it.
Hi Fergus
Note I did put a late, second, edit to my previous post, mentioning the small distinction that we're talking about specific userid's rather than specific players. I made this distinction since it's possible (and evident in some cases already on GC's ratings list) that people can have more than one userid, hence more than one Game Courier rating. While presumably it would be tough to prevent this if desired, starting a new rating from scratch at least does not guarentee a player he will get a higher one after many games (a long time ago, it may be worth noting, the Chess Federation of Canada allowed a given player to at least once effectively destroy his existing rating and begin again from scratch, perhaps for a $ price).
It goes without saying that I am talking about userids. The script is unable to disinguish players by anything other than userid, and it has no confirmed data on which players are using multiple userids. All I can do about this is discourage the use of multiple userids so that this doesn't become much of a factor. But if someone wants to play games with multiple userids, he presumably has a reason for wanting to keep seperate ratings for different games.
One concern I had was that adding up fractions for the number of times two players played each separate game could eventually add up to a value greater than 1. For example, if two players played 12 different games together, the total would be 12 * (1/11) or 12/11, which is greater than 1. One way to get around this is to divide the total by the number of different games played. Let's see how this affects my original scenarios:
X1 and Y1
5/(5+10)+2/(2+10)+1/(1+10)+1/(1+10)+1/(1+10) = 5/15+2/12+1/11+1/11+1/11 = 17/22 = 0.772727272
17/22 * 1/5 = 17/110 = 0.154545454
X2 and Y2
3/13 + 3/13 + 2/12 + 1/12 + 1/11 = 6/13 + 2/12 + 2/11 = 695/858 = 0.81002331
695/858 * 1/5 = 695/4290 = 0.162004662
X3 and Y3
2/12 * 5 = 10/12 = 0.833333333
10/12 * 1/5 = 10/60 = 0.1666666666
As before, these values are greater where the diversity is more evenly spread out, which is to say more homogenous.
However, the number of different games played was fixed at 5 in these examples, and the number of total games played was fixed at 10. Other examples need to be tested.
Consider two players who play 20 individual games once each and two others who play 10 individual games twice each. Each pair has played 20 games total.
Scenario 1: 20 different games
(20 * 1/11) / 20 = 20/11 * 1/20 = 1/11
Scenario 2: 10 different games twice
(10 * 2/12)/10 = 20/12 * 1/10 = 2/12 = 1/6
Applying the same formula to these two scenarios, the 20 different games have no more influence than a single game, which is very bad. This would severely limit the ratings of people who are playing a variety of games. So, if diversity of games played is to be factored in, something else will have to be done.
The problem is that the importance of diversity is not as clear as the importance of quantity. It is clear that the more games two players have played together, the more likely it is that the outcome of their games is representative of their relative playing abilities. But whether those games are mixed or the same does not bear so clearly on how likely it is that the outcome of the games played reflects their relative playing abilities. With quantity as a single factor, it is easy enough to use a formula that returns a value that gets closer to 1 as the quantity increases. But with two factors, quantity and diversity, it becomes much less clear how they should interact. Furthermore, diversity is not simply about how many different games are played but also about how evenly the diversity is distributed, what I call the homogeneity of diversity. When I think about it, homogeneity of diversity sounds like a paradoxical concept. The X3 and Y3 example has a greater homogeneity of diversity than the other two, but an example where X4 and Y4 play Chess 10 times has an even greater homogeneity of diversity but much less diversity. Because of these complications in measuring diversity, I'm feeling inclined to not factor it in.
The most important part of the GCR method is the use of trial-and-error. Thanks to the self-correcting nature of trial-and-error, the difference that factoring in diversity could make is not going to have a large effect on the final outcome. So, unless someone can think of a better way to include a measure of diversity, it may be best to leave it out.
If left as is, is the current rating system at least somewhat kind to a player who suddenly improves a lot (e.g. through study or practice), but who has already played a lot of games on Game Courier? I'm not so sure, even if said player from then on plays much more often vs. players he hasn't much played against before on GC.
I was thinking about older results (both in time and number of games) should, maybe fade away. Is that very difficult to implement Fergus? It seems fairer, but trouble that you need many games at the "same time" to make the ratings meaningfull, and as the current population is that cannot be easilly done :)!
The script includes an age filter. If you don't want to include old ratings, you can filter them out.
I was also thinking the results from old games should be less trusted than the results from new games. A recent game is a better indication of a player's ability than a 10-year-old game.
Off-topic, there is much a chess variants player might do to improve in a relatively short period of time (aside from suddenly improved real-life conditions partly or wholly beyond his control, such as recovering from poor health or personal problems, or acquiring more free time than before). Besides any sort of intuition/experience aquired through sustained practice there's general or specific study he might do on his own, as I alluded to previously. As Joe alluded to, there are many variants that are rather like standard chess, and study and practice of chess probably can only help playing chess variants generally.
Unlike for many chess variants, there is an abundance of chess literature that can help improvement, even at many variants, and hiring a chess teacher, coach or trainer will probably help those who play chess variants too. A chess trainer can help with any physical fitness regime, which also can help those who play chess variants. There might also be such ways available for improvement found by those into other major chess variants with literature etc. such as Shogi and Chinese Chess, though these two are perhaps less generally applicable for overall improvement at chess variants than using chess would be (not sure).
@Fergus,
I think it is not only about "old" in the calendar sense, but also "old" in the many games ago sense.
Also, I think fading away is a nicer way of putting things than cut offs :)!
Since different people play games at different rates, the last n games of each player would not return a single set of games that would work for everyone. A chronological cut-off point works better, because it can be used to select a uniform set of games.
I was envisioning a system where older games are considered with lesser weight by some formula down to some minimum. A game should have, say, at least half the weight of a recent game no matter how old (for example.)
I can play atound with formulas if interested. Beyond the question of age, however, I think the system is good as-is.
I have two problems with discounting the results of older games. One is that the decision concerning what age to start reducing game results is arbitrary. The other is that the results for a game is zero points for a loss and one point for a win. While one point for a win could be reduced to a smaller value, zero points for a loss could not without introducing negative values. The alternative would then be to reduce the 1 point for a win and increase the zero points for a loss, making the results of the older game more drawish. This would distort the results of a game and possibly result in less accurate ratings. I prefer using the age filter to set a clean cut-off point at which older results just aren't included in the calculation.
I don't see the first one being a problem as they would decrease in significance with age on a smooth curve. The other concern is a problem though. I agree increasing the value of losses up from zero is a bad idea. This would only work if a win was +1 and a loss was -1.
Stupid question: Could you rate wins as +2 and losses as -1, and would that help?
No, I will be sticking to the traditional values of 1 for a win and 0 for a loss.
I don't think there is any problem, weighting games with a {0, 1} result. You just add {0, w} to the score, and w to the number of games played. That 0 stays 0 is not a problem; the effect of the weighting comes in through the number of games. E.g. if you won a recent game (with weight 1) and lost an old one, your average will be 1/(1+w), which is greater than 0.5 if w<1, because the old loss is weighted less.
I think HG's idea could work. It is along the lines of what I was thinking but somewhat better put. There should be a clearer rule though but fading away (never to 100%) of the results is a reasonable concept :)!
The scores are used to determine the number of games played between two players, because that is what they add up to.
Well, the weighted scores would add up to the weighted number of games. A recent 1-0 plus an old 0-w adds up to 1+w games, so the player averages will be 1/(1+w) and w/(1+w). When your rating calculation requires the average opponent rating, you will have to weight that average in the same way, of course.
If you were going to just veto the notion by dictatorial decree, you could have said so at the beginning and spared us the pretence of discussion.
25 comments displayed
Permalink to the exact comments currently displayed.
Such a ratings system would be more complicated and work differently than the current one. The present system can work for a single game or for a set of games, but when it does work with a set of games, it treats them all as though they were the same game.
Yes, that would be the result. Presently, someone who specializes in a small set of games, such as Francis Fahys, can gain a high general GCR by doing well in those games.
Game Courier ratings are not calculated on a game-by-game basis. For each pair of players, all the games played between them factor into the calculation simultaenously. Also, it is not designed to "award" points to players. It works very differently than Elo, and if you begin with Elo as your model for how a ratings system works, you could get some wrong ideas about how GCR works. GCR works through a trial-and-error method of adjusting ratings between two players to better match the ratings that would accurately predict the outcome of the games played between them. The number of games played between two players affects the size of this adjustment. Given the same outcome, a smaller adjustment is made when they have played few games together, and a larger adjustment is made when they have played several games together.
Getting back to your suggestion, one thought I'm having is to put greater trust in results that come from playing the same game and to put less trust in results that come from playing different games together. More trust would result in a greater adjustment, while less trust would result in a smaller adjustment. The rationale behind this is that results for the same game are more predictive of relative playing ability, whereas results from different games are more independent of each other. But it is not clear that this would reward playing many variants. If someone played only a few games, the greater adjustments would lead to more extreme scores. This would reward people who do well in the few variants they play, though it would punish people who do poorly in those games. However, if someone played a wide variety of variants, smaller adjustments would keep his rating from rising as fast if he is doing well, and they would keep it from sinking as fast if he is not doing well. So, while this change would not unilaterally reward players of many variants over players of fewer variants, it would decrease the cost of losing in multiple variants.