Heavy day today. These notes might be slightly more rambling than usual, apologies.
Eero Simoncelli
Eero (pictured above) delivered a lecture focusing on encoding, and specifically on efficient coding . From Barlow (1961):
Sensory relays recode sensory messages so that their redundancy is reduced but comparatively little information is lost.
He pointed out that this view that conservation of information is of primary importance is at odds with his previous lecture, which focused on null spaces and information loss. This lead to some heated debate on the usefulness of information theory to understand neural codes. He mentioned that Michael Shadlen believes that what should be studied is not the potential information but rather the decisions, that is, the subpart of the information that is used to drive behaviour, and implied that while he recognized the limitations of information theory, he wasn’t anywhere this extreme.
Within the framework of information theory it is not possible to make a distinction between information and information that is meaningful and important to the organism. One example that was given is that of color vision: while information theory can tell you how to create photoreceptors matched to the spectrum of natural scenes, it cannot tell you why primates aren’t sensitive to infrared, for instance. This sort of argument can be solved by stacking a series of optimality principles: maximal information capacity, minimal wiring length, minimal energy use, etc.
He showed an interesting way of illustrating the mutual information I(r,s) between the stimulus and response: I(r,s) = H(r) – H(r|s). The mutual information is given by the entropy of the response minus the conditional entropy of the response given the noise, that is, “the entropy of the noise”. This quantity is related by a log to the ratio of the size of the response volume compared to the size of the neighborhood in response space corresponding to a given stimulus, as shown below:
To optimally encode a scalar variable, a neuron’s response pattern is determined by the following factors:
- The distribution of the stimulus p(s)
- The distribution of the input noise p(n|s)
- The distribution of the output noise p(n|r)
- The cost of outputting each response C(r)
With zero input noise, fixed output noise and equal cost within a range [0..1] and infinite cost everywhere else, the best mapping from r to s is f(s) = cdf(p(s)). The more general case can be handled semi-analytically (McDonnell & Stocks 2008).
Laughlin (1981) is an early example of an application of this principle: he showed a good match between the distribution of contrast in the environment of a fly and the contrast selectivity of a neuron within the visual system of the fly.
The maximum entropy distribution subject to the constraint E(f(r)) = c is given by . Thus, assuming a linear cost for spikes, the corresponding maximum entropy distribution is the exponential distribution, which provides an alternate route towards justifying sparse coding: not as an end in itself, but as a way of maximizing information transmission in the face of unit metabolic cost for each spike. Baddeley et al. (1997) shows that the distribution of responses in IT and V1 seems to be comparable to the optimal for a distribution of naturalistic stimuli.
In the case of multiple neurons and/or multiple inputs, there’s 3 tractable cases:
- Statistical independence in low noise
- Gaussian linear systems
- Response to one attribute (tuning curves)
Case 1 corresponds to sparse coding/ICA type approaches (Olshausen, Sejnowski, Lewicki, etc). Case 2 includes Atick & Redlich ’92 (efficient coding and retina) and some of Simoncelli’s more recent work.
Eero spent quite a bit of time on a project focusing on case 3, an early version of which is presented in a NIPS paper (Ganguli & Simoncelli 2010).The setup is the following:
- We have a population of N neurons
- The average firing rate of the population is fixed at R
- The population encodes a bounded scalar variable with a prior probability p(s)
- Each neuron in the population has a tuning curve and follows Poisson statistics
Given this, what is the optimal way of allocating tuning curves for each neuron? This problem, in the version above, is not tractable. However, it can be made tractable by assuming that the tuning curves are defined by a density d(s) and a total gain g(s). This constrains the bandwidth of the tuning curves as well as their shape. Given this additional set of assumptions, the information conveyed by the population can be maximized by using a bound based on the Fisher information matrix. With this simplification, it turns out that the optimal density d(s) is Np(s) and the optimal gain is g(s) = R. This is a very straightforward result.
Eero then went on to illustrate many examples where this theoretical framework accounts well for the properties of images, neurons and psychophysical observers. He showed examples based on orientation tuning, speed tuning, spatial frequency tuning, and auditory frequency tuning, and it seemed quite convincing. Interestingly, it’s possible to start from the psychophysical discrimination performance of observers and from that infer the probability distribution of the corresponding property. Powerful stuff. This is currently being submitted in (what I presume to be) a high-impact journal.
David Heeger
David talked about V1 computation and normalization in particular. One issue I always have had with the normalization model is that it’s very difficult to constrain the properties of the normalization pool (is it tuned, unturned, how large is it, etc.). He mentioned that in Schwartz and Simoncelli (2001) optimal normalization properties for a simulated population of V1 neurons were derived from first principles via redundancy reduction (un-butterflying the butterfly graph). I must have missed that the first time around, I’ll have to go back and reread.
He went on to talk about some of his recent fMRI results. Having been a subject in an fMRI experiment, I was aware of the various pie-slice stimuli that they use to map the retinotopy in visual areas. I was not aware, however, of the computational aspect of this: basically the stimuli used are periodic, and one can derive the retinotopy by looking at the phase of the response in each voxel at the frequency of the stimulation. Clever.
It’s somewhat surprising that it is possible to decode orientation from fMRI, since orientation columns are so much smaller than the size of a voxel. One early explanation is that the columns are inhomogeneous and so the orientation selectivity doesn’t quite cancel out over the scale of a voxel. This turns out to be wrong; work by Jeremy Freeman (that guy again) and David showed that there’s a bias in the orientation selectivity as a function of retinotopy. Voxels corresponding to regions near the horizon tend to be slightly more selective for horizontal orientation, on average, for example.
Using a forward modeling approach, he showed that it is possible to read out information about normalization properties in V1 using fMRI (Brouwer & Heeger 2011). I’m a bit skeptical, I must say; the results depend on the linearity of the mapping from spikes to BOLD signal, and AFAIK that remains to be proven.
He showed some other results related to normalization, in particular fruit-fly olfaction (Olsen, Bhandawat & Wilson 2010; Luo, Axel & Abbott 2010) and the normalization model of attention (Reynolds & Heeger 2009). He also pointed out the potential link between normalization and hierarchical Bayesian inference (Lee & Mumford 2003).
He topped off with a slide on the various reasons why one might normalize:
- To maximize the use of limited dynamic range
- To simplify the readout of population codes
- To get invariance with respect to uninteresting stimulus dimensions
- To create various read-out rules, for instance winner-take-all vs. averaging
- For decorrelation (Schwartz & Simoncelli 2001)
On an unrelated note, while I was researching this post, I stumbled onto this Scholarpedia article on models of visual cortex. Interesting.