Machine learning in online dating

Love and happiness
It can make you do right,
It can make you do wrong
It can make you come home early
It can make you stay out all night long

I went to a LA machine learning meetup last week featuring Jon Morra from eHarmony, where he highlighted some of the uses of machine learning in their online dating platform. I came away impressed by the extent and breadth of machine learning techniques deployed to solve this most human of problems, finding love.

Here is a recording of the presentation:

The central problem

À chaque guenille, son torchon – Québécois proverb*

The central problem of online dating is that there is simply too much choice. To prevent overwhelming users, we want to propose matches intelligently. Abstractly, you want to estimate some “dating compatibility” matrix between different people and serve up some matches up that maximize the summed probability of getting it on.

If the love-distance matrix was small, you had a way to easily compute it, and you’d want to serve up just the one best match to each person, then you could solve this assignment problem with the Hungarian algorithm, for instance. But of course when we’re dealing with millions of users, computing love distances isn’t trivial, and because our matches are imperfect, we’ll want to serve up more than one match. The three-pronged approach outlined by John solves these issues:

  1. Reducing the pool of potential matches using compatibility ratings using self-reported psychological profile surveys, and factors such as sexual preference, age, location, etc.
  2. Computing affinity between potential matches based on demographics, text features, visual features, etc.
  3. Optimally serving matches to users based on affinity, ie. via a daily email

compatibility

The first part is the most straightforward: based on secret sauce internal surveys and insights from psychology, people are rated as more or less compatible with each other. The compatibility rating encompasses both single-person personality traits and dyad – i.e. similarity – traits.

Results are also filtered by sexual preference, age brackets, location, etc. This first pass eliminates a lot of non-compatible matches based a hard threshold, and thus sparsifies the love-distance matrix to a much more manageable of non-zero elements. I would also venture that it probably results in the creation of cliques, e.g. by location, which allows parallelization for subsequent steps.

Affinity

The affinity score is a computed probability that two users will communicate, based on a trained logistic regression model. The training data consists of logs of whether two users communicated given their profiles. Training is done using Vowpal Wabbit, a horribly named but potent machine learning package that can do online training of linear and logistic regressions models in the terabyte regime.

You live and die by your features; eHarmony uses classic features like site usage statistics, text features (bag-of-words, I presume), number of photos, etc. extracted from pairs of users. I imagine that the training matrix also includes dyad features like the compatibility rating. Interestingly, eHarmony has also ventured into photo analysis lately.

John first showed examples of using Viola-Jones detectors to extract basic features of images like face area/photo area. The ubiquitous Viola-Jones detector, implemented in OpenCV, uses a cascaded stub classifier to decide whether an image location contains a face. The classifier uses Haar-like features, which can be computed very efficiently using integral images, and is trained using AdaBoost.

face parts

John then showed more recent results using the Face Parts detector, which I didn’t know about, but is pretty amazing. The idea behind Face Parts is that a face can be deconstructed into parts arranged in a tree structure. Part match – e.g. a score which says how eyebrow-like an image patch is – is determined by the dot product of a template with a histogram of Gaussians (HOG) feature set.

FaceParts
FaceParts

The parts are joined by “springs”, and the total spring deformation determines how energetic a configuration of parts is – low energy configurations are better. A weighted sum of appearance and structural scores determines the “goodness” of a particular configuration.

The goodness of all configurations can be estimated and maximized effectively using a message-passing algorithm because of the special tree structure of the springs model. Several potential tree structures are allowed – for instance, one for front facing faces, another for profile – so that pose estimation, detection, and landmark detection are all done with the same step. Pretty slick.

Training is done in a maximum-margin setting using structured SVM learning methods. Once the model is trained, it’s evaluated on faces in the eHarmony dataset, and various features are extracted from the image: things like ratios of face width to face heights and whether you’re showing cleavage or not. Jon implemented an efficient version which is open source and available on GitHub.

My understanding is that these features are not encoded dyadically in the affinity model: e.g. it doesn’t try to match guys with mustaches with women showing cleavage. Rather, these are monadic features that determine how likely you are to be communicated with, i.e. how attractive you are. And how likely you are to receive communication is important in the next step, matching, which tries to make everybody happy: à chaque guenille son torchon.

matching

Finally, we have to match users optimally. The system sets a goal of 6 to 10 matches per person, and runs a directed flow solver to maximize total flow in a directed acyclic graph – the sum total of affinity scores of matched people – using the CS2 algorithm.

A very interesting cutting edge development – not in production right now – is the idea of serving more or less matches to certain people based on their profile. Some people like more choice, some less – introverts, for example (maybe).

Without knowing a priori if a certain person is a maximizer or satisficer, how do you find their optimal number of matches? One approach would be, for a month, to select a number of matches at random from day to day, and then from then on pick whichever number yielded the most communications for that person. But aren’t we just wasting a lot of days with this strategy?

In fact this problem is the classic multi-armed bandit problem in disguise. You have a series of one-armed bandits – a mathematical idealization of slot machines – which give rewards with certain, unknown, probability. Each trial, you pick a bandit, and gets its reward. The problem is then to maximize the total reward over a given period of time; that is, to minimize the total regret. This requires balancing exploration and exploitation.

One strategy which is not quite optimal but nevertheless very fast and effective is the UCB policy, which says that you should pick whichever arm has the highest upper confidence bound. So the UCB policy could be deployed in this scenario to rapidly find a users’ optimal number of matches.

Here, we have more data that we can exploit – we know users’ profiles. This problem can be treated within the framework of the contextual bandit – basically, classic bandit + regression on features. There’s a very slick paper from Yahoo! labs that shows how to generalize the UCB strategy to the contextual bandit problem, which I highly recommend you check out.

conclusion

At the end of the day, is it all worth it? John highlighted a paper published in PNAS that showed that people who got married from online dating have higher martial satisfaction than those who met offline, and among dating sites, eHarmony has the best marital satisfaction rates.

Despite the fact that the survey underlying the paper was commissioned by eHarmony itself, the stats look legit, and PNAS is a pretty damn good journal, of course. Of course, one can’t eliminate self-selection biases, i.e. people who want to be in committed relationships select this particular site, as Aziz Ansari points out:

Read up more on applying data science to online dating at the OkCupid blog and in the book Dataclysm.

* for each (female) washcloth, a (male) washcloth, i.e. to each his own. Quebecois French has a lot of terms for washcloth for some reason – see also débarbouillette, which literally translates to “a small item that removes scribbles (from one’s face)”.

Machine learning in online dating

The best words in Scrabble

qi

Well, I’m back in the old country for the holidays, and boy do I feel like the smartest pickle in the jar for moving to California when it’s below zero here in Montreal (either scale).

There is one thing I miss about the old country: French Scrabble. I’ve spent an inordinate amount of time memorizing word lists and playing the game, and now it’s a pain to find a partner in good ol’ US of A.

So I decided to start studying English Scrabble, and I figured I could use some stats to prioritize my learning*. Quackle is a Scrabble solver that has a (well-hidden) command line interface that can play games against itself. The AI agent, called ‘speedy player’, uses heuristics rather than Monte Carlo simulations to determine which move to play; it’s very fast, however, has access to the full dictionary (TWL06, the North American tournament dictionary**) and plays a kick-ass game.

Screenshot from 2015-01-02 12:31:03
ZOEAE and UMIAQ are legit words, apparently

I let it run for 100,000 games and, with some Python glue (pandas mostly), compiled a list of the best words by various criteria: points per play, plays per game, points per game (= points per play * plays per game). I grouped words by form (ie. axe, axes, axed, etc. count as one root word) using a dictionary I found in Zyzzyva. The most useful word in Scrabble is…

Qi. Indeed, Qi is the only two letter word that contains a Q and one of a few dozen that contains a Q but not a U. Therefore, it is played very frequently, in 7 out of every 10 games. The next 5 are:

  1. BE
  2. RE – as in do, ré, mi…
  3. ZA – short for pizza
  4. ER – an expression of hesitation

The top 50 is in fact dominated by such two-letter words, which, while often not valuable in themselves, can be used to attach other, more useful words.

It is surprising how skewed the distribution of word values is. While knowing Qi leads to a whopping +22 points per game advantage, the 101th most common word, joe, gives an expected improvement of around a single point. Thus, the most useful words are by far the short 2 word letters, followed by three letter words containing high-paying letters, followed by a very long tail of infrequently used words.

Verbs are an interesting subset of words, because they have a lot of alternate forms, and thus by learning a verb’s root you add several to your vocabulary. Here’s the top 20, which includes quite a few surprises:

  1. BE – including rare forms ART, WAST, WERT
  2. EAT – including rare form ET
  3. IN – to harvest: I IN, he INS, we INNED, you’re INNING
  4. EX – to cross out; I EX, he EXES, we EXED, you’re EXING
  5. AX – to work with an ax – like ex
  6. DO – including rare forms – DIDST, DOEST, DOST, DOTH, DOETH
  7. PI – to spill into disorder – PIED, PIING, PIEING, PIES
  8. AH and AAH – to express surprise
  9. OH – similar to AH
  10. UP – to raise
  11. OWE
  12. AWE
  13. AXE
  14. ZAG – to turn sharply – ZAGGED, ZAGGING, etc.
  15. VEX
  16. OPE – to open – OPED, OPING, OPES
  17. EYE – including EYING, EYEABLE
  18. JOW – to ring a bell
  19. ZAP
  20. JOT

OUZO >> RAKI > ARAK >> PASTIS in Scrabble. Exactly the opposite of real life.

What about high-paying words (per play) that are used rather frequently (more than 1 out of every 1000 games)? We have:

  1. ANTIQUE
  2. DISRATE – to lower in rank
  3. STEARIN – the solid portion of a fat
  4. RETSINA – a resin-flavored Greek wine
  5. TERTIAN – a recurrent fever

This list shows a consistent pattern – bingos created with low-paying letters (modulo the Q in antique) – letters which are highly likely to co-occur if you manage your leaves well.

This brings us to the important subject of the optimal leave, that is, which combination of letters on the rack leads to the best plays. We can repeat the same exercise and compute the average number of points on a turn given a certain rack, etc. The top 10 in terms of total points per game are:

  1. AEINRST
  2. EINORST
  3. AEEINRT
  4. AEGINRT
  5. EEINRST
  6. AEILNRT
  7. AEEIRST
  8. AEILRST
  9. AEINORT
  10. AENORST

The best leaves in terms of total points are dominated by the letters EITRNASL, forming LATRINES. If we look at the most paying leave per play, however, we find leaves dominated by the blank tile ? and high-paying letters – these leaves are infrequent, but when they can be placed for a bingo, with high-paying letters and assorted double and triples, they’re worth a ton.

Here is the full list of plays, with annotations for letter count, whether they contain high paying letters, etc. It’s a Google Fusion Table, which is a pretty nice tool for sharing large databases.

Here is another list for leaves.

* IOW I double-nerd-sniped myself
** A new version, the TWL2014 has been announced, but it is not in electronic form yet.

The best words in Scrabble

Fixing broken .m4v movies made in Matlab

I made a few ten-minute movies in Matlab composed of 1-second clips from BBC Earth, encoded as H.264, to use as visual stimuli. They played fine on Windows but would stop at the 170 second mark in Quicktime on Mac.

We’re using Processing on Mac for visual display, which uses Quicktime in the background, so they would lock up in Processing as well.

A solution is to use a passthrough encoder to remux the file; somehow there’s missing metadata and that fixed it for quicktime. ffmpeg does the trick:

ffmpeg -i bad_movie.m4v -c:v copy -c:a copy good_movie.m4v

Note that it doesn’t actually re-encode the data, it just adds some metadata – about 3k worth of metadata in fact.

Fixing broken .m4v movies made in Matlab