### Optimizing GLM hyperparameters through the evidence

I wrote earlier about a recent paper by the Pillow lab which uses priors optimized through the evidence (aka marginal likelihood) to estimate spatially and frequency-localized receptive fields. It seems that evidence optimization might be seeing something of a revival as a technique for estimating model hyperparameters. I just posted an update on my GLM with quadratic penalties package to use this technique.

Specifically, we can assume that the prior in a GLM is given by $N(0,Q^{-1})$, where the precision Q is given by:

$Q = Q_0 + \sum_i \lambda_i Q_i$

Then the hyperparameters $\lambda_i$ are selected by optimizing the evidence of the model via trust-region Newton, following a Laplace approximation of the posterior (Bishop, Chapters 3-5). One application of this is in models where the parameters are organized into two or more dimensions (say, x and y or space and time). Then, it’s natural to add a penalty for the smoothness of the parameters along one dimension and a second for the other dimension. What you get is a form of Automatic Smoothness Determination (ASD). Here’s an example of applying this in a logistic regression model where the filter is a 2D Gabor:

The precision is sufficiently flexible to accommodate interesting model structure. For example, in the poster that Theo and I presented at SFN, I used it to track functional connectivity as a function of time. Functional connectivity can be estimated by using the firing patterns of other neurons as inputs to a GLM whose output is the target neuron. Each functional connection is defined by 3 parameters, corresponding to Laguerre basis functions. Only a handful of possible connections will turn out to be significant; therefore, it’s natural to impose a penalty on the magnitude of each group of 3 parameters. Now if in addition we let these parameters change over time, it’s also possible to add one smoothness penalty per group of 3 parameters. In total, we had 28 cells, hence 27 inputs for a given cell; thus, $Q_1$ through $Q_{27}$ were used to impose group sparseness, and $Q_{28}$ through $Q_{54}$ were used for temporal smoothness.