OkCupid has a feature called MyBestFace which allows one to upload photos and get a rating of that photo’s hotness as determined by your peers. For every picture you upload, you are asked to perform 20 rating trials of people of your preferred sex and sexual orientation (for instance, straight ladies if you’re a straight dude). In each trial, you are asked to determine which of 2 people represented on two different photos you are most likely to date.
From a psychophysical perspective, the goal of MyBestFace is to determine the value of “hotness” represented in a picture through the use of a 2-alternative forced choice (2AFC) paradigm. I was discussing this with a friend, who is taking this way too seriously; he put several dozens of his photos up on the site and retested the ones that worked best a second time to get more accurate ratings. I told him not to read too much into the ratings, because, hey, how much can you really infer from 20 trials per pic?
Indeed, if you count the number of “hotter” ratings versus the total number of ratings, you can get an estimate of a photo’s average choice probability; that is, the probability that your photo will be preferred in a random draw of two photos. But this is not very efficient method of estimating hotness, both at the level of analysis and design.
At the analysis level, you’re wasting a lot of information. Your face is being compared to other photos, and you have some idea of how hot these photos are based on people’s ratings. This information is certainly helpful and should be taken into account to estimate your own hotness. At the design level, intuitively you’d want to be compared with people of similar hotness. You might not be as good looking as George Clooney, but you already knew that; you’re better off being compared to people in your own league, so to speak. This raises interesting questions about optimal Bayesian design. Let’s look at these two levels in turn.
Let’s start with the assumption that a given photo has an associated hotness parameter h, and that the goal here is to estimate h. In a 2AFC trial, a psychophysical observer compares two photos, and responds 0 if the one on the left is the best or 1 if the one on the right is best. Then a minimal assumption is that:
Where f(x) is an increasing function of x. In keeping with the psychophysics literature, we could assume that f is the logistic or cumulative Gaussian function. Another possibility is that f includes the phenomenon of lapses; assuming people respond without looking at the pictures on some fraction of trials a, then we could assume that . From here on out I’ll assume that f is the logistic function for convenience.
The fact that f is smooth and not, for instance, a step function, represents the fact that the decision is noisy. For a given observer, two equivalent presentations could yield different results if the two pictures are similarly hot. In addition, two different observers might have a different definition of hotness, which constitutes an additional source of noise.
Offline inference is straightforward; what we have is a logistic regression problem. It would be impractical to include everyone that has ever used MyBestFace in the regression, since the design matrix could end up being arbitrarily large. For offline inference to be practical, each photo could be part of a “block” of photos which are only compared against each other. If a block is 1000 photos large, that means a regression based on a design matrix which is about 10,000 x 1000.
A slight difficulty here is that absolute hotness is impossible to determine based on this; only relative hotness. Imagine, for instance, that a block includes only 2 photos. In that case, you can determine the difference between the hotness of the two photos, but not their absolute level. With this parametrization, the solution to the regression is non-unique, which poses a problem. The solution is simply to add a weak prior that assumes that hotness ratings hover around 0.
We can compare the direct estimate of the choice probability p(r=m|photo at position m) with the indirect estimate given by logistic regression. Here I chose blocks of 500 photos, 20 presentations for each photo, and hotness ratings picked at random from a normal distribution with standard deviation of 3. I did 100 replications of the experiment to get the standard deviation of the estimates. The results are shown below:
It’s clear here that both the direct and indirect estimates of choice probability (CP) are approximately unbiased, which is a good sign. Clearly, however, the indirect method estimates, estimated via logistic regression, are less noisy overall. The median decrease in standard deviation is 22%. This sounds modest, but recall that the standard deviation of an estimate goes as 1 over the square root of the number of trials. Thus, 20 trials analyzed with the indirect method are as good as 32 trials analyzed with the direct method, a pretty sizeable improvement in efficiency.
Single trial strategies
So far I’ve been assuming that people are selected at random in a trial. If we have some prior information about the beauty of the participants then better strategies are available. okcupid does have that information; it uses star ratings to get an estimate of a person’s physical beauty. That means the system can make a prior estimate about the hotness of a photo p(h).
There’s a nice paper on Arxiv by coauthor and all-around nice guy Simon Barthelmé about optimal design strategies in the context of psychophysics. In theory one should compute the expected outcomes for all possible multi-trial strategies to derive the optimal strategy. This is usually intractable, so we resort to single trial strategies.
One strategy is to minimize the expected entropy after a single trial. When dealing with Gaussians, this is equivalent to minimizing the expected variance after a single trial. Assuming that p(h) is a Gaussian (without loss of generality, centered at 0 and with unit variance), and the probe stimulus (the second photo) has known hotness h’, it’s possible to approximate p(h|y), the distribution of hotness after a single 2AFC trial, as a Gaussian using the Laplace approximation. You can work it out on pen and paper yourself. This Gaussian has variance equal to:
This function has a minimum at p(y=1|h’) = 0.5. In other words, as expected, the variance of the estimate of the hotness diminishes fastest when photos of similar hotness are compared. Note that, interestingly, the variance of the estimate doesn’t change at all when p(y=1|h’)=1 or p(y=1|h’)=0. That means there is literally no information to be gained by comparing yourself to George Clooney.
Multi trial strategies
Let’s assume, as above, that inference is done offline. That means that user responses are not taken into account to alter the design as we go along. So we only have an initial estimate of the hotness of every photo. We’d like to come up with a strategy that will maximize the efficiency of our design, while also working under the constraint that each photo should be used in N total trials.
One way of doing this is making it so that the photos being compared are as close to each other as possible in terms of hotness. We can arrange it so that the photo with prior hotness rank 50 is only compared against pictures going from rank 40 to rank 60. If we do this, however, we find that the mean squared error in the estimated p is actually higher than with a random design. An example is shown below:
As you can see, the random design leads to estimates with approximately the same errors across the range of probabilities. The ordered non-random design gives more veridical estimates at the highest and lowest probabilities but often messes up somewhere in the middle range. Basically one photo where the initial estimate was way off has the possibility of being compared only against much hotter or much less hot photos. This messes up the estimate for that one photo and has repercussions on all the other photos. You can see a couple of these outliers in the lower left quadrant.
An alternative is the following: pick a picture at random. Evaluate its prior hotness plus a random Gaussian with standard deviation sigma, giving the target hotness. Find a picture with hotness most similar to the target hotness and use that as the competitor to the initial picture. One can repeat the process and keep a tally to prevent pictures from being over- or under-used.
As for sigma, it’s logical to set it to the same sigma as the prior hotness estimates. An example run is shown below. As you can see, it helps at all levels of p, but it’s especially useful at high and low p. In that case, the mean squared error is about 40% less than with the random design. Thus 20 optimally designed trials are as good as about 32 non-optimally designed trials. This increase in efficiency is cumulative with respect to using an optimal analysis method discussed in the previous section, so in total we’ve saved a good 2/3 of trials.
The final thing we can do is incorporate the prior in the analysis procedure in addition to the design procedure. I haven’t tried it yet, but I think that although it might give more veridical estimates of absolute hotness, it probably wouldn’t give better hotness estimates within a set of pictures for a given person, since these pictures share the same prior; the net effect should be to shift all the estimates towards the prior mean. I’d love to be proven wrong though.