The overall goal of this document is to provide a short manual on how to calculate Elo-ratings for attractiveness ratings on pairwise presented stimuli. Below, we provide a worked example based on an artificially generated data set. We suggest you go through this example before using the method on your own data set. The next section explains in a bit more detail the suggested reliability index as means to evaluate stability in ratings. The last section presents an example with empirical data.
In general, the underlying idea of the Elo-rating procedure is to update individual scores after pair-wise contests based on the expected outcome of the contest before a contest actually takes place. The more expected an outcome was, the smaller is the change in scores of the two contestants. Conversely, the more unexpected an outcome was, the larger are the score changes. The expectation of an contest outcome is expressed as the difference in Elo-ratings before the contest. Inherent in this general philosophy is that scores of contestants change over time - think of a chess or tennis player or an animal that at the start of her career will have a relatively low score (and/or rank), which may increase over time and eventually will drop again.
Now, when we think of a pair of visual stimuli as ‘contestants’, for example in the context of attractiveness ratings, the aspect of dynamics over time is much less important. Typically, an experiment of attractiveness is conducted over a very brief period of time (at least relatively, compared to a chess player’s career or a monkey’s life), and as such we expect such changes in attractiveness over time to play only a negligible role. Elo-rating, as implemented in this package and tutorial, will still result in a reliable ranking of stimulus attractiveness. The crucial aspect of this package is that Elo-ratings are based on multiple randomized sequences of ratings, which we refer to as mElo in the accompanying paper. Though not strictly necessary, we think this is a prudent approach, because the order in which rating trials occur in the sequence may affect the final Elo-ratings of the stimuli, at least in small data sets (small in terms of stimuli or small in number of rating trials).
The first thing to be done is to install and load the package.1 Eventually, a simple install.packages("EloChoice")
will suffice (once the package is on the official R server). Also note that we need the packages Rcpp
and RcppArmadillo
installed, which should be automatically downloaded and installed if you use the install.packages("EloChoice")
command.2
# install package (to be done once)
install.packages(EloChoice)
# load package (every time you want to use the package)
library(EloChoice)
If you already know how to read your raw data into R, you can skip this section. We assume that you have your ratings organized in a table, in which each line corresponds to a rating event/trial. There is at least a column for the stimulus that was preferred by the rater and the one that was not. Additional columns are likely to be present and the table below provides an example. Most likely you will have organized this table in a spreadsheet software like Excel or OpenOffice. The perhaps simplest way of reading a data set into R is to first save your data table from the spreadsheet software as tab-delimited text file (File>Save as…> and then choose ‘Tab Delimited Text (.txt)’).
Having such a text-file, it is then easy to read that data set into R and save it as an object named xdata
:3
# Windows
xdata <- read.table(file = "c:\\datafiles\\myfile.txt", sep = "\t", header = TRUE)
# Mac
xdata <- read.table(file = "/Volumes/mydrive/myfile.txt", sep = "\t", header = TRUE)
str(xdata)
Note that you have to specify the full path to the place where the data file is saved.4 The sep = "\t"
argument tells R that you used a tab to separate entries within a line. The header = TRUE
argument tells R that the first line in your table are column headers. Finally, the str(xdata)
command gives a brief overview over the data set, which you can use to check whether the import went smoothly (for example, as indicated by the number of lines (obs.
in the output of str()
), which should correspond to the number of trials in your data set).
preferred stimulus | losing stimulus | rater | date | time |
---|---|---|---|---|
ab | cf | A | 2010-01-01 | 14:34:01 |
cf | xs | A | 2010-01-01 | 14:34:08 |
ab | xs | A | 2010-01-01 | 14:34:11 |
dd | cf | A | 2010-01-01 | 14:34:15 |
ab | cf | B | 2010-01-04 | 09:17:20 |
If the import was successful, the only thing you need to know is that for the functions of the package to work we need to access the columns of the data table individually. This can be achieved by using the dollar character. For example xdata$preferred.stimulus
returns the column with the preferred stimuli. If you are unfamiliar with this, hopefully it will become clear while going through the examples in the next section.
Throughout this first part of the tutorial, we will use randomly generated data sets. We begin by creating such a random data set with the function randompairs()
, which we name xdata
.
set.seed(123)
xdata <- randompairs(nstim = 7, nint = 700, reverse = 0.1)
head(xdata)
#> index winner loser
#> 1 1 o s
#> 2 2 n r
#> 3 3 j r
#> 4 4 s r
#> 5 5 j r
#> 6 6 k o
The command set.seed(123)
simply ensures that each time you run this it will create the same data set as it shown in this tutorial. If you want to create a truly random data set, just leave out this line.
With this we created a data set with 7 different stimuli (nstim = 7
), which were presented in presentations/trials (nint = 700
). The head()
command displays the first six lines of the newly created data set.5 You can see the stimulus IDs: the winner
-column refers to the preferred stimulus (the winner of the trial) and the loser
-column to the second/unpreferred stimulus (the loser). The index
-column simply displays the original order in which the stimuli were presented. Note also that the data set is generated in a way such that there is always a preference for the stimulus that comes first in alphanumeric order, which is reversed in 10% of trials (reverse = 0.1
). This ensures that a hierarchy of stimuli actually arises.
We then can go on to calculate the actual ratings from this sequence with the core function of this package, elochoice()
. Again, to enable you to obtain the exact results as presented in this tutorial, I use the set.seed()
function.
set.seed(123)
res <- elochoice(winner = xdata$winner, loser = xdata$loser, runs = 1000)
summary(res)
#> Elo ratings from 7 stimuli
#> total (mean/median) number of rating events: 700 (200/201)
#> range of rating events per stimulus: 187 - 214
#> startvalue: 0
#> k: 100
#> randomizations: 1000
The two code pieces winner = xdata$winner
and loser = xdata$loser
specify the two columns in our data table that represent the winning (preferred) and losing (not preferred) stimuli, respectively. The runs = 1000
bit indicates how many random sequences of presentation order we want to generate.
We saved the results in an object named res
, from which can we can get a brief summary with the summary(res)
command. From this, we can see that there were 7 stimuli in the data set, each appearing between 187 and 214 times. The summary also notes that we randomized the original sequence 1000 times. In case you have larger data sets, you may want to reduce the number of randomizations to reduce the computation time and to explore whether everything runs as it should.6
Next, we obviously want to see what the actual Elo-ratings are for the stimuli used in the data set. For this, we use the function ratings()
:
ratings(res, show = "original", drawplot = FALSE)
#> c j k n o r s
#> 432 224 82 5 -169 -202 -372
The show = "original"
argument specifies that we wish to see the ratings as obtained from the initial (original) data sequence that is present in our data set. If we want to have the ratings averaged across all randomizations (mElo), we change the argument to show = "mean"
:
ratings(res, show = "mean", drawplot = FALSE)
#> c j k n o r s
#> 459.205 247.165 141.170 16.752 -159.900 -235.153 -469.239
Similarly, you can also request all ratings from all randomizations, using show = "all"
or return the ranges across all randomizations with show = "range"
.
If you wish to export ratings, you can use the write.table()
function, which saves your results in a text file that can easily be opened in a spreadsheet program. First, we save the rating results into a new object, myratings
, which we then export. Note that the resulting text file will be in a ‘long’ format, i.e. each stimulus along with its original or mean rating will appear as one row.
myratings <- ratings(res, show = "mean", drawplot = FALSE)
# Windows
xdata <- write.table(myratings, "c:\\datafiles\\myratings.txt", sep = "\t", header = TRUE)
# Mac
xdata <- write.table(myratings, "/Volumes/mydrive/myratings.txt", sep = "\t", header = TRUE)
If you want to export the ratings from each single randomization (i.e. show = "all"
) or the range of ratings across all randomizations (show="range"
), the layout of the text file will be ‘wide’, i.e. each stimulus appears as its own column with each row representing ratings after one randomization or two rows representing the minimum and maximum rating values. By default, R appends row names to the text output, which is not convenient in this case so we turn this option off with row.names = FALSE
}.
myratings <- ratings(res, show = "all", drawplot = FALSE)
# Windows
xdata <- write.table(myratings, "c:\\datafiles\\myratings.txt", sep = "\t", header = TRUE, row.names = FALSE)
# Mac
xdata <- write.table(myratings, "/Volumes/mydrive/myratings.txt", sep = "\t", header = TRUE, row.names = FALSE)
Finally, the ratings()
function also allows you to take a first graphical glance at how the randomizations affected the ratings.
ratings(res, show = NULL, drawplot = TRUE)
This figure shows the mean ratings across the 1000 randomizations as black circles, while the ratings from the original/initial sequence are indicated by the smaller grey circles The vertical bars represent the ranges of Elo-ratings across the 1000 randomizations for each stimulus. The stimulus with the highest rating is c and the stimulus with the lowest rating is s. Recall that the data generation with randompairs
uses an underlying ranking of stimuli that is based on an alphabetical order: hence c should have a higher rating than j, which should be higher rated than k and so on. The rating process recovers this nicely here because there were not too many exceptions to this rule during the data generation (reversals = 0.1
).
We define an reliability-index as \(R = 1- \frac{\sum{u}}{N}\), where \(N\) is the total number of rating events/trials for which a binary expectation for the outcome of the trial existed7 and \(u\) is a vector containing 0’s and 1’s, in which a \(0\) indicates that the preference in this trial was according to the expectation (i.e. the stimulus with the higher Elo-rating before the trial was preferred), and a \(1\) indicates a trial in which the expectation was violated, i.e. the stimulus with the lower Elo-rating before the trial was preferred (an ‘upset’). In other words, \(R\) is the proportion of trials that went in accordance with the expectation. Note that trials without any expectation, i.e. those for which ratings for both stimuli are identical, are excluded from the calculation.8 Subtracting the proportion from \(1\) is done to ensure that if there are no upsets (\(\sum {u}=0\)), the index is \(1\), thereby indicating complete agreement between expectation and observed rating events. For example, in the table below there are ten rating events/trials, four of which go against the expectation (i.e. they are upsets), yielding an reliability-index of \(1-4/10=0.6\).
This approach can be extended to calculate a weighted reliability index, where the weight is given by the absolute Elo-rating difference, an index we denote \(R'\) and which is defined as \(R' = 1- \sum_{i=1}^{N} {\frac{u_i * w_i}{\sum{w}}}\), where \(u_i\) is the same vector of 0’s and 1’s as described above and \(w_i\) is the absolute Elo-rating difference, i.e. the weight. Following this logic, stronger violations of the expectation contribute stronger to the reliability index than smaller violations. For example, column 3 in the table below contains fictional rating differences (absolute values), which for illustrative purposes are assigned in a way such that the four largest rating differences (200, 200, 280, 300) correspond to the four upsets, which should lead to a smaller reliability index as compared to the simpler version described earlier. Applying this leads to \(R'=1-0.57=0.43\). In contrast, if we apply the smallest rating differences (90, 100, 120, 140, as per column 4 in the table) to the upsets, this should lead to a larger reliability-index, which it does: \(R'=1-0.26=0.74\).
In sum, the reliability index represents how many rating events went against the expectations. The weighted version refines this measure by integrating the strength of deviations from expectations, such that many large deviations will lead to smaller index values, while many small deviations will lead to larger index values.
higher rated is not preferred | upset | rating difference (example 1) | rating difference (example 2) |
---|---|---|---|
1 | yes | 200 | 90 |
1 | yes | 300 | 100 |
0 | no | 100 | 300 |
0 | no | 150 | 280 |
1 | yes | 200 | 120 |
0 | no | 140 | 200 |
1 | yes | 280 | 140 |
0 | no | 90 | 150 |
0 | no | 150 | 200 |
0 | no | 120 | 150 |
To calculate the reliability-index, we use the function reliability()
. Note that we calculated our initial Elo-ratings based on 1000 randomizations, so to save space, I’ll display only the first six lines of the results (the head(...)
function).
upsets <- reliability(res)
head(upsets)
#> upset upset.wgt totIA
#> 1 0.8581662 0.8847880 698
#> 2 0.8610315 0.8885191 698
#> 3 0.8553009 0.8876968 698
#> 4 0.8569385 0.8913181 699
#> 5 0.8640916 0.8858861 699
#> 6 0.8647482 0.8842455 695
Each line in this table represents one randomization.9 The first column represents the unweighted and the second the weighted reliability index (\(R\) and \(R'\)), which is followed by the total number of trials that contributed to the calculation of the index. Note that this number cannot reach 1000 (our total number of trials in the data set) because at least for the very first trial we did not have an expectation for the outcome of that trial.
We then calculate the average values for both the unweighted and weighted upset indices.
mean(upsets$upset)
#> [1] 0.8548718
mean(upsets$upset.wgt)
#> [1] 0.886231
Remember that our data set contained a fairly low number of ‘reversals’, i.e. 10% of trials went against the predefined preference (e.g. ‘A’ is preferred over ‘K’). As such, the \(R\) and \(R'\) values are fairly high (\(\sim 0.8 - 0.9\)). If we create another data set, in which reversals are more common, we can see that the values go down.
set.seed(123)
xdata <- randompairs(nstim = 7, nint = 700, reverse = 0.3)
res <- elochoice(winner = xdata$winner, loser = xdata$loser, runs = 1000)
upsets <- reliability(res)
mean(upsets$upset)
#> [1] 0.6084496
mean(upsets$upset.wgt)
#> [1] 0.6462743
A note of caution. The functions described in the following section have not been very thoroughly tested (yet). If they fail or produce confusing results, please let us know!
We may also wonder how many raters are needed to achieve stability in ratings. The function raterprog()
allows assessing this. The idea here is to start with one rater, calculate reliability, add the second rater, re-calculate reliability, add the third rater, re-calculate reliability, etc. The idea is that as more raters are included, reliability should converge on some ‘final’ value. Please note that for now the function only returns the weighted reliability index \(R'\).
In order to calculate this progressive reliability, obviously you need a column in your data set that reflects rater IDs. For this demonstration we use a subset11 of a data set that contains this information and is described further below.
data(physical)
# limit to 10 raters
physical <- subset(physical, raterID %in% c(1, 2, 8, 10, 11, 12, 23, 27, 31, 47))
set.seed(123)
res <- raterprog(physical$Winner, physical$Loser, physical$raterID, progbar = FALSE)
raterprogplot(res)
When we look at these results graphically, we may conclude that with this data set using the data from the first 5 raters may be sufficient to achieve high reliability (e.g. larger than 0.8).
It may be advisable, however, to not necessarily rely on the original sequence of raters involved. In the general spirit of the package, we can randomize the order with which raters are included. We can do this by increasing the ratershuffle=
argument from its default value.12 Note that doing so will increase computation time even further, so start with small values if applying this to your own data to test whether the functions work as intended.
set.seed(123)
res <- raterprog(physical$Winner, physical$Loser, physical$raterID, progbar = FALSE, ratershuffle = 10)
raterprogplot(res)
The results of this figure probably would be interpreted in a quite similar same way as those of the figure above: from four or five raters onwards, we achieve reliability larger than 0.8. One thing to note here is the original rater sequence (the grey circles fall outside the quartiles of the reshuffled orders for the low rater numbers). This is not a problem per se, but just illustrates that the original rater order might not be very representative. On other other thing to note here is that the variation in reliability becomes smaller with larger rater numbers.
In the following, we go through an empirical example data set. Here, 56 participants were asked to choose the one out of two presented bodies which depicted the stronger looking male. Each of the 82 stimuli appeared 112 times, resulting in a total of 4,592 rating trials.
We start by loading the data set. Then we calculate the Elo-ratings, show the average ratings and plot them.
data(physical)
set.seed(123)
res <- elochoice(winner = physical$Winner, loser = physical$Loser, runs = 500)
summary(res)
#> Elo ratings from 82 stimuli
#> total (mean/median) number of rating events: 4592 (112/112)
#> range of rating events per stimulus: 112 - 112
#> startvalue: 0
#> k: 100
#> randomizations: 500
ratings(res, show = "mean", drawplot = FALSE)
#> P012 P070 P076 P142 P168 P013 P169 P150
#> 634.890 594.636 564.144 564.084 489.200 454.752 444.490 438.770
#> P060 P143 P161 P017 P016 P061 P003 P027
#> 420.564 411.304 380.524 349.902 346.874 330.388 320.228 286.242
#> P002 P046 P148 P006 P103 P007 P038 P075
#> 235.422 235.164 226.182 218.364 210.590 175.764 160.536 139.222
#> P089 P119 P053 P151 P147 P050 P126 P129
#> 137.570 121.326 109.632 108.052 105.076 96.488 66.296 59.824
#> P044 P020 P155 P008 P047 P117 P014 P004
#> 59.592 59.228 55.650 55.564 26.126 17.458 14.410 4.172
#> P078 P166 P092 P112 P083 P040 P034 P121
#> -31.650 -34.438 -57.890 -68.312 -71.590 -74.642 -80.666 -82.202
#> P021 P110 P163 P132 P015 P090 P165 P036
#> -87.140 -91.170 -96.086 -99.912 -116.514 -144.022 -157.626 -157.668
#> P171 P124 P018 P057 P077 P079 P125 P058
#> -159.726 -167.988 -177.456 -178.644 -195.106 -208.544 -220.562 -221.670
#> P031 P127 P128 P052 P098 P123 P172 P029
#> -255.682 -258.202 -261.508 -287.894 -290.642 -294.466 -295.756 -300.986
#> P134 P086 P019 P170 P140 P080 P042 P159
#> -307.024 -345.188 -374.590 -402.974 -422.570 -435.016 -462.394 -534.958
#> P085 P144
#> -544.016 -673.610
ratings(res, show = NULL, drawplot = TRUE)
Depending on how the stimulus presentation is prepared, there may be cases (trials) in the final data set in which a stimulus is paired with itself (‘self-contest’). As long as the presentation is indeed of pairs of stimuli this is not a problem because such self-contests are irrelevant to refine the true rating of a stimulus (as opposed to pairs of two different stimuli). If there are three or more stimuli presented in one trial, the situation becomes different if for example two ‘A’s are presented with one ’B’. However, this is an issue to be dealt with later, when presentation of triplets or more stimuli is properly implemented in the package (currently under development).
For now, self-contests are excluded from analysis of stimulus pairs for the reason mentioned above. However, a message that such self-contests occur in the data will be displayed in such cases.13
# total of seven trials with two 'self-trials' (trials 6 and 7)
w <- c(letters[1:5], "a", "b"); l <- c(letters[2:6], "a", "b")
res <- elochoice(w, l)
#> data contained 2 'self-contests' (identical winner and loser)
#> these cases were excluded from the sequence!
ratings(res, drawplot=FALSE)
#> a b c d e f
#> 50 7 1 0 0 -58
summary(res)
#> Elo ratings from 6 stimuli
#> total (mean/median) number of rating events: 5 (1.67/2)
#> range of rating events per stimulus: 1 - 2
#> startvalue: 0
#> k: 100
#> randomizations: 1
#> original data contained 2 'self-contests' (identical winner and loser)
#> these cases were excluded from the sequence!
# total of five trials without 'self-trials'
w <- c(letters[1:5]); l <- c(letters[2:6])
res <- elochoice(w, l)
ratings(res, drawplot=FALSE)
#> a b c d e f
#> 50 7 1 0 0 -58
summary(res)
#> Elo ratings from 6 stimuli
#> total (mean/median) number of rating events: 5 (1.67/2)
#> range of rating events per stimulus: 1 - 2
#> startvalue: 0
#> k: 100
#> randomizations: 1
Installing has to be done once, while loading the package has to be done each time you restart R
↩
install.packages("Rcpp")
and install.packages("RcppArmadillo")
will install the two packages by hand↩
you can name this object any way you like, xdata
is just a personal convention↩
you can also use setwd()
to define a working directory and then you can use relative paths, or you can use (the freely available) RStudio, which offers nice options to work in projects that facilitate reading of files (and much more)↩
If you want to see the entire data set, simply type xdata
↩
on my laptop PC, the calculations of this example take less than a second, but this can drastically increase if you have larger data sets and/or want to use larger number of randomizations↩
Consequently, the maximum value \(N\) can take is the number of trials minus one, because for at least the very first trial in a sequence, no expectation can be expressed↩
Technically, there is still an expectation here: if both stimuli have the same rating before the forced choice, the expection that one is preferred is 0.5 (for both stimuli), but we are interested in binary expections (win or lose), i.e. the expectations that are different from 0.5↩
the first line corresponds to the actual original sequence↩
thanks to TF for the suggestion↩
a subset because the calculations can take very long, depending on the size of your data set↩
which is 1, and in fact reflects the original rater sequence as it appears in your data set↩
In this document the actual message does not show, but if you run the code yourself you should be able to see it↩