### Decoding fMRI activity evoked by natural movies

The Gallant lab have just published a new paper in Current Biology about decoding visual activity in fMRI evoked through natural movies. TryNerdy has a very high level overview of the paper. Here I’m more interested in the nitty gritty computational/statistical details.

The idea is to train an encoding model using fMRI responses during natural movies. Then the goal is to decode what natural movies were presented to an observer based on a second set of fMRI responses (see movie above). Importantly, the second set of fMRI responses was not used during training. Furthermore, there is no overlap between the presented stimuli in the training and decoding sets. This is a pretty tough test of the ability of the encoding/decoding models to extrapolate.

The goal of decoding is to infer the stimulus s that drove a response r. From Bayes’ theorem, $p(s|r) \propto p(r|s)p(s)$. Let’s look at these two elements in turn.

The likelihood $p(r|s) = N(\hat r = Xw; \Sigma)$ captures the encoding model, how the stimulus is transformed into a distributed pattern of activity visible through the BOLD signal. Pretty standard stuff here, the (Gaussian) density is centered around a mean response $\hat r$ given by a linear encoding model $Xw$ with covariance $\Sigma$ estimated from the data. Now X is a matrix representation of features of the stimulus, and it’s derived nonlinearly from the pixel representation of the movies. There’s a lot of headroom in choosing X, but the authors chose I think a pretty interesting representation. Rather than focusing on the static content of the movies, the representation actually encodes the localized motion energy in the movies extracted with a standard energy model (above, B). There’s extra channels for 0 temporal frequency so that static patterns are represented in addition to motion. If only static patterns are represented however, the quality of the reconstruction is quite a bit poorer. So it’s really the motion energy that’s driving a lot of the reconstruction ability, rather then the static content, and I think there’s some interesting implications for the role of motion in object recognition.

w represents the weights of the linear encoding model, and are estimated using the training data (above, A). Each voxel is assigned a separate encoding model, representing its sensitivity to position, speed, orientation, as well as its hemodynamic response. There’s a lot of weights, so the regression is regularized through an L1 penalty

The prior $p(s)$ represents the probability distribution of stimuli. Since natural movies have a lot of structure, representing this prior is a challenge. If you’ve been at some of the Gallant lab’s recent SFN posters, you will be familiar with the trick they used here. Rather than representing the prior through parameters, they represented it through samples. In practice, that means they collected hours of YouTube videos as a representative sample of “the space of natural movies”.

In this case, the posterior probability of stimuli is discrete, $p(s) = \sum_i \delta (s-s_i)$. For every sample video clip, they computed its contribution to the posterior. Then decoding is simply a matter of computing a summary statistic for this posterior. For such a sparsely sampled prior, MAP decoding is ineffective. It would like doing Gibbs sampling and then using the one sample with the highest posterior as the estimate of the model parameters; clearly this would require a huge number of samples to yield a good estimate.

Taking the expected value of the posterior didn’t work too well either; the authors tell us that this typically averaged over only two or three video clips. So instead they averaged over the top 100 stimuli. Not too sure what this corresponds to in terms of implicit loss, but usually rank-based measures are quite robust. So you can probably think of it as a robust measure of the bulk of the posterior.

The results are pretty convincing; have a look at Jack’s page on the subject for more example videos. Shinji Nishimoto, An T. Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, & Jack L. Gallant (2011). Reconstructing visual experiences from brain activity evoked by natural movies Current Biology : 10.1016/j.cub.2011.08.031