Contents |
|
Day One has come and gone, and your online tournament is up and running. Hopefully, more and more people will join the ongoing online battle. When assigning an initial PQ to each latecomer, you will have to keep in mind that our algorithm is designed to be a "zero sum game": one player's gain is always compensated by another player's loss. If the initial PQ value of every player is set to the average of the distribution, player migration into the system will not change the zero mean value.
The chess player association did not account for this. Consequently, when chess player numbers increased, the new participants decreased the ratings for even the top level players, a steady drain that makes comparison to over years unsatisfying guesswork. You could cheat a bit by assigning initial PQ values at random, as long as the average over all initial values is preserved. This might come handy for setting up Day One - to get things moving, it might be a smart move to scatter ratings a lot intially, randomly assigned in an pre-registration based lottery event.
There is the inverse problem: players that do not play anymore, thus effectively removing themselves (and their share of points) from the system. In a a zero sum, zero average rating system, any attempt to introduce a "rating decay" over time has to redistribute the ratings points to all other players. Paradoxically, the decay does not decrease the PQ for all players - only for half of them:
decay = 0.1; // per time period
pq -= decay*(pq-mean);
For an above average player that is not competing any longer, PQ will steadily approach the average rating, but never drop below. For a player with below average rating, however, not competing will increase the PQ till it approaches the average.
If this seems unintuitive and unacceptable to you, remember that PQ is our estimate of player ability. If we do not have any data, be it because a player has just joined, or because a player is just not playing, then our best estimate is that the player has average strength - it is the most likely result, after all. Some rating algorithms attempt to avoid this, but there is no "patch" to combine two unrelated properties (the player strength estimate, and our ccnfidence about it based on the amount of data we have accumulated) into a single number. Mixing data from different domains is difficult at best, and usually a bad idea.
Many gaming associations have adopted the Elo rating system, and often, parameters and formula are tweaked to account for various, often mutually exclusive, purposes. At its core, all rating starts with the mechanism outlined in this article. The more complications and patches you add on top of the pure solution, the less comprehensible the results.
To express a confidence in player ratings, it is better to have a second number that describes how frequently a player has participated in matches, and how many encounters are listed in our database. If necessary, rank displays can be filtered by such a "confidence level". Such a mechanism offers an incentive for participating often, and for participating over a long time, without imposing a penalty (e.g. a decay) for temporary absence.
It is inevitable that a rating changes its meaning over time. Our PQ is only defined with respect to the player population at a given time. If the community changes, so does the meaning of our PQ. Only if you take for granted that the players today are on average as capable as the players two years from now, comparing their ratings makes sense - unless you decide to measure against a bot, an artifical opponent. Needless to say that you can't change the game or the rules either.
Of course, the next problem we are facing is one of user interface: What value do we want to display, when, and how?
Signing Off
Given an ideal Gaussian distribution, you can expect to find 68.3% of the players within theinterval - only 15.8% are supposed to be better, another 15.8% not as good as the players in this interval. This gives you a practical limit on how much dispersal you could allow for while keeping the overwhelming majority of players within a certain interval.
Why would you want to do so? Here we leave the realm of mathematics and venture into the area of psychology: the impact of native numbers. Let's say the average PQ is 1500 points, and you have a width of about 500 points. In this case roughly 2.3% of the players will be end up with ratings below 500 points, but some 0.15% might rate below zero, and they will not like it. Mind you, in a pure zero sum game with that bell-shaped distribution, about half of your customers will face negative ratings.
What to do then? Remember that condoms are sold as large, extra, and gigantic, but never tiny? You should not give up the zero sum, zero average PQ used internally, but you can offset the PQ for displayed rating values by some sufficiently large number. You will probaly wind up multiplying the internal values with a large scaling factor as well.
Due to the shape of the normal distribution, any offset, no matter how large, will still cut into the low end "tail". There might always be some poor souls ending up below zero. Your game could simply map any rating below a certain positive threshold to a minimal "display" value (effectively hiding the real values), but there is an elegant alternative: fixed range rating.
Fixed Range Rating
The majority of your players will neither be very bad nor
extremely good. Those who do not do well do not really want to know how
far behind they are, and those who are outstanding play by different rules
anyway. Both groups will have difficulties finding exact equals no matter
what we do, and loss of precision will not change this. Thus, we can
easily waive accuracy on the fringe (both low and high ratings), as long
as our results are accurate and nearly linear where it counts - near the
average, for the majority.
Remember that our rating is an estimate of the ability to succeed against another player. Our reference is a fictional "average" player. If we view the outcome for a match of a player with a given rating PQ against the player with average rating, we get the Fermi function again - a result in a limitedinterval!
Voila, this is your fixed range rating. Multiply Fermi(PQ,T) with 10,000, and all your players will have ratings between zero and ten thousand. The average player will have a rating of 5,000. Inexperienced players will find themselves rated close to zero, outstanding players will approach 10,000. To apply the above formula, those players would have to know many decimals of their ratings to obtain meaningful results.
One benefit of this representation is that the outcome of a game can be calculated without using exponential functions. Given two players with ratings R1, R2 out of.
r1 = R1/10000.0f;
r2 = R2/10000.0f;
// likelihood that player 1
wins
p = 1.0f/(1.0f +
(r2*(1-r1))/(r1*(1-r2)));
There is one problem with this approach, however. If the width of the distribution exceeds the interval in which the Fermi function changes significantly, then a large number of players will have visible ratings close to 10,000, and an equally large number will have near zero ratings. To fix this, you have to use a different value for T - the greater the width of the distribution, the higher. If you want about 95% of your players mapped to approximately 95% of the entire range (i.e. between 250 and 9,750 points), the appropriate display scale as a function of the width would be about 0.55, or approximately 1/2.
//
Width has to be calculated
// from the values
in the database.
width = PlayerDB.getWidth();
// We shoot for
max_rating =
10000.0f;
// Symmetry, guesstimate using
Fermi(-2*width)==0.025
display_scale =
0.5*width;
// This is 1/(1+exp(-2*x/width))
display_rating = max_rating*Fermi(
internal_rating, display_scale );
Using this approach, the ratings you show to the players will always be between zero and 10,000, with the average rating being 5,000, and the vast majority of your players spread out over an interval of 9,500 different values. About 68% of your players will find themselves somewhere between 1,200 and 8,800, with about 38% between 2,600 and 7,300 - you get the idea. For the majority, the rating will depend nearly linearly on skill - no matter how much dispersal your internally used PQ values exhibit. Just keep in mind that the visible rating, and the scaling used for it, are completely separate from the internally used PQ, and the max_gain and T parameters.
![]() |
Fig 7: Internet Bolo
League Reating Adjustment |
Credit
Assignment
To my knowledge, the first multiplayer game that adapted a similar mechanism to calculate post-game ratings has been the Internet Bolo League, using a mechanism based on the Elo-System. Figure 7 shows the effective gain/loss calculated by the Bolo rating algorithm. The Elo-system, in turn, was devised by Arpad E. Elo for the United States Chess Foundation in 1960, and is still in use. Details have been published in The Rating of Chessplayers, Past & Present by Arpad E. Elo (ISBN 0-668-04721-6, 1978). Recently, Netgames USA adopted a similar rating scheme.
The approach described above will give you running estimates of relative player skill, with an accuracy that depends on your game, and on the number of samples you obtain. Most of your work still lies ahead, and it touches upon design issues for both user interface and game mechanics. Remember that scores define objectives - your game's rules have just been changed.
If you are heading for an persistent, authorative player database, you will obviously have a larger community to sample from, and you can obtain data from many servers over a long time period. The problem with this is verification of player identity, and cheat prevention. Persistent player ID is required for accurate ratings, just as accurate ratings are required to guide server access based on skill. The same problems will affect databases that are maintained locally on a per-server basis. Per-session rating avoids all issues of persistency, but the estimates might not be good, as we start with a blank slate in every session. Inevitably, cheating is always an issue.
Then there is the user interface. Ideally, a game would indicate both the risk and the possible gain of a potential target timely enough to aid the player in a decision. Much like in old arcade games, you could provide the likely score (gain*risk) along with the target player's name, as soon as somebody sets their sights on him. Post-frag feedback will likely not suffice, yet pre-frag indicators (crosshair color and size, warning sounds) are of little use in dense, close quarter free for all games.
Incidentally, such high pressure, high speed games exhibit a distinct randomness. Any kind of performance statistics, be it rating or simple frag counting, faces a Credit Assignment Problem here: determining whose skill was instrumental to accomplish a certain score. You might have to review your game design in an attempt to find more "atomic" events - e.g. counting applied damage instead of completed frags. Obviously, this will affect player tactics in yet another way. So would direction indicators to help players to find a match.
Tournaments consisting of a sequence of one-on-one encounters (of which there could be several executed in parallel on a single server, as long as the map areas are separate from each other) solve the Credit Assignment Problem. Using the current estimates in selecting the next pairings should offer quick convergence, which means that near equal players (if any) will meet quickly. Finally, you can attempt to add handicaps based on skill estimates. Instead of trying to balance handicaps during playtesting, the game can try to determine proper handicaps by itself - guided by skill estimates, you might find ways to cut down trial-and-error.
Rating based skill estimates are not only useful for multiplayer. Single player games traditionally put the burden of estimating their own ability on the player, but we are moving towards adaptive game logic that responds to observed player performance during the game, and adjusts challenges and supplies accordingly. At its core, any such system will have to maintain a rating of player ability.
It is usually not a good idea to combine ratings from different game modes. Teamplay has different objectives, and mixing scores can only lead to confusion. Performance statistics based on simple criteria (hits/shots, damage done vs. damage received, average lifetime) work best, and largely avoid credit assignment issues. Rating groups of players as a team is closer to the information needed to select player team pairings.
At first glance, Elo-like rating might seem much to elaborate for online shooters. Yet, this is one aspect of a game easily done right, and the benefit might be huge. Multiplayer deathmatch games have advanved far into an era of progessive refinement, with the gameplay pretty much boiled down to its essence. The shelf lifetime of future competitive multiplayer games might well depend on their ability to aid the players - both the die hard, and the casual gamer.
Bernd Kreimeier is a physicist, writer, and coder, working as senior programmer and project lead at Loki Entertainment. Previous work for Gamasutra and Game Developer Magazine include Killing Games: A Look At German Videogame Legislation as well as the "Dirty Java" features on Using the Java Native Interface and Optimizing Pure Java. He can be reached at mailto:%20bk@lokigames.com.
_________________________________________________