# Turing machines, the number game, and inference

The number game, which stems from Josh Tenenbaum’s PhD thesis, illustrates some important ideas about Bayesian concept learning. Here is a discussion of this in lecture notes from Kevin Murphy for reference. The basic setup is as follows: I give you a set of numbers from 1 to 100, and you try to guess another number in the set. For instance, I give you the numbers {4,8,64,2}. Intuitively, you might say another number in the set is 16, because that’s consistent with the concept that the set contains powers of 2.

Now, it’s also the case that this example set contains only even numbers. Hence predicting 24 or 92 would be also pretty reasonable. Or we could say that the next number if 43, because the full underlying set is {4,8,64,2,43,57} – 6 uniform random draws from the range 1 to 100. How can we say with any authority what the next number in the set is, then?

We can pick the next element in the sequence as follows: evaluate the likelihood that a draw was from one of a finite set of concepts. Then, pick a number from this concept proportional to the probability that the concept was the one generating the original set. In other words, evaluate the probability that {4,8,64,2} was generated by one of a number of concepts – even numbers, odd numbers, powers of two, etc. – then pick a concept proportional to its posterior probability, and pick an element allowed by the concept uniformly at random. This is our guess of the next element in the sequence.

Here, we need to evaluate the posterior probability that a draw came from a given concept. Bayes’ theorem shows us that this probability is proportional to the likelihood of the draw under the concept times the prior probability of the concept.

Now, the likelihood of the draw under the concept is 0 if the concept is inconsistent with the draw – {4,8,64,2} and odd numbers, for example. If the concept is consistent with the draw, then probability will be higher, all else being equal, if the hypothesis set is smaller. This is what’s called the size principle: all else being equal, a draw {4,8,64,2} is more likely to have come from the set of powers of 2 rather than the set of all even numbers, simply because the set of powers of 2 is smaller than the set of even numbers.

This is one manifestation of Occam’s razor: a concept that can explain everything doesn’t explain anything.

What about the prior probability of a given set? Intuitively, we might say that simple concepts – the set of powers of 2, even numbers, etc. – are more likely than complex concepts – the set {4,8,64,2,43,57}. We can certainly estimate this prior probability empirically from the pattern of responses of humans in the empirical Bayes framework – as done in Josh Tenenbaum’s thesis. Is there a way to define an absolute concept of complexity of a concept, however?

The answer to this question is both illuminating and completely impractical. Algorithmic information theory defines Kolmogorov complexity, which is the number of tokens needed to describe a concept using a universal description language – a Turing machine, Python, assembly, lambda calculus, etc.

Of course, a concept might be much easier to describe in a given language than in another. If a language has a primitive for powers of two, then reasonably, it might only take one token to describe this concept in that language, but not in another. However, the theory of universal Turing machines tells us that the Kolmogorov complexity of a concept in one language is at most some constant plus the Kolmogorov complexity of this same concept in another language.

Why is this? Well, as long as both languages are Turing complete, it’s always possible to translate a program in the first language to a program in the second language by concatenating a universal emulator for the first language in the second language and a string corresponding to the program in the first language.

With the concept of Kolmogorov complexity in hand, it’s possible to define an absolute concept of prior probability for a given concept: the unnormalized prior probability of a concept is simply 2 to the power of the length of a program which produces the concept. So, for example, the concepts “even numbers”, “powers of 2”, and the set {4,8,64,2,43,57} can be expressed in Python via:

```rg = xrange(100)
evens         = [x if x % 2 == 0 for x in rg]
powers_of_two = [x if abs(math.log2(x) - round(math.log2(x))) < 1e-6 for x in rg]
weird_set<div class="mceTemp"><dl id="attachment_3609" class="wp-caption aligncenter" style="width: 300px"><dt class="wp-caption-dt"><img src="https://xcorr.files.wordpress.com/2015/11/screen-shot-2015-11-19-at-11-54-20-pm.png?w=300" alt="Humans happily predict next elements in a sequence in the number game" width="300" height="179" class="size-medium wp-image-3609" /></dt><dd class="wp-caption-dd">Humans happily predict next elements in a sequence in the number game</dd></dl></div>= [x if x == 4 or x == 8 or x == 64 or x == 2 or x == 43 or x == 57 for x in rg]
```

You can see here that intuitively more complicated concepts take more tokens to express, which translates into lower prior probability. Neat!

Of course, this is highly impractical as a means of specifying prior probability. The biggest issue is that Kolmogorov complexity refers to the length of the smallest program which can produce the given concept. This length, however, is uncomputable. The hand-wavy version of this argument goes as follows: suppose I show you a program which can spit out a given concept. Does there exist an alternative program which is one token smaller which also produces the same concept?

To answer this question, I would have to evaluate all the valid programs which are one token smaller than the one I currently have, and check if any produces the required concept. One of the programs will, almost assuredly, get stuck in what looks like an infinite loop. Perhaps it will produce the required concept after long enough.

If only we had a function which can take an input program and decide whether this program will halt… you can see where I’m going with this – this function, famously, does not exist, and in fact producing a function which can compute the Kolmogorov complexity of a concept universally is equivalent to solving the halting problem.

So in terms of practical applications, setting the prior to a power of the Kolmogorov complexity is a dud. From a theoretical perspective, and from the standpoint of understanding Bayesian machinery, it’s quite a lovely theory, however. More on this subject in this lovely paper by Marcus Hutter on arXiv.

I haven’t updated the blog in a while, as I have been very busy interviewing, going back home for a bit, and transitioning into a new role. I’m proud to say that I’m now working in Mountain View, at the Googleplex! It’s an amazing opportunity to work with some of the top machine learning and statistics researchers and practitioners in the world.

About my transition to industry: as scientists, we’re attached to the idea of the inevitable march of progress, the notion that scientific discovery will conquer all the scourges of humanity – disease, famine, global warming – and usher us into an era of pure bliss. As a vision scientist, I might say that studying neural coding will help us cure blindness in the not-so-distant future.

But the reality is that a lot of these scourges have low-tech solutions. To take the example of blindness, one of its leading causes in industrialized countries is type II diabetes, for which the causes – lack of physical activity, poor diet – are well-understood. So while I’m working on understanding hierarchical visual processing, and thinking about how that might eventually translate into a neural implant that will feed visual input directly into the brain – perhaps using nanobots – I’m missing the more obvious, immediate solution, which is to influence people’s diet and physical activity so that they don’t get diabetes in the first place.

If we truly think that one’s highest purpose is to increase the well-being of the most people – and it’s certainly a viewpoint that’s been admirably championed from the Stoics, to the Gates foundation to the recent framework of Effective Altruism – then we have to start thinking about how to measure people’s behaviour, the causal effects of different interventions, and the optimal selection of these interventions given the current evidence. This is the domain of behavioural and experimental economics, causal inference, and reinforcement learning.

So I’m very happy to be working in a group that’s been strongly influenced by the work of Hal Varian and Donald Rubin. It’s a data science group that’s particularly focused on modeling user behaviour and causal inference.

To take one example, Kay Broderson has a post over at the unofficial Google data science blog on inferring the causal impact of an intervention in the context of time-series analysis; he has some R code to get this running as well. One can view as a generalization of a difference-in-differences estimator to the scenario of more than two time points and one than one potential control.

You could use this same framework, to, say, determine the effect of the change in economic policies in post-Maoist China on sex-selective child survival, or measure how building schools in Indonesia in the late 70’s helped lift a massive number of people out of poverty. Very cool stuff. On that subject, I’m collaborating with a friend of mine on the analysis of a development project in schools in Nepal, tying all this stuff together to try to help people.

TL;DR: the statement that “academia is noble because science is intrinsically good” is dubious, and I’m glad I’m getting a chance to work in the real world.

Not to fear, I will still be leveraging my deep net/vision/machine learning wizardry for some hush-hush projects in my new work.

I’ll be writing here about some interesting papers I’m reading, along with various musings, starting later this week. The future of the blog will probably be less code-heavy – I am free from Matlab! – and will probably be more focused on machine learning than before. Until next time!

# Calling R from Matlab – flat file communication

In the last post I showed how to fit a Bayesian hierarchical model in R to estimate experimental parameters. Our main data analysis pipeline uses Matlab, and we’d like to integrate these two environments for the purpose of analysis.

The example involved JAGS, which does have a direct Matlab interface in addition to the more common R one, but more generally you might need to use some unique package  for an analysis that only runs in R. So how can you pipe your data from Matlab to R and back?

# Flat file communication

There are a variety of unofficial interfaces between Matlab and R – you can call R from Matlab on Windows via a COM extension, or you can spawn Matlab engines and manipulate .mat files in R – but the very simplest communication involves writing your data in flat files, and loading them on the other end.

The `csvwrite` and `csvread` functions in Matlab, and the corresponding `read.csv` and `write.csv` functions in R are really all you need. Be aware, however, that Matlab’s built-in CSV reading function doesn’t support headers or string variables. You can pick up this more fancy version from the Matlab File Exchange that does.

# Matlab side

Matlab allows you to run terminal commands using the `!` operator, or the OS-specific `dos` and `unix` commands. Thus, to read some data, run an R script, then read back the data written by R is as simple as:

```csvwrite('in.csv',[originaldat]);
```

Matlab will block until R returns, so you don’t need to worry about asynchronous callbacks.

# R side

On the R side, you need a batch script that reads the data, transforms it, and then writes it back. For our previous example, this would look like:

```rawdat <- read.csv('in.csv',header=FALSE)

#Extract the data we need
ms <- rawdat[,1]
stds <- rawdat[,3]

#Run the MCMC model
library("rjags")
library("coda")