Gibbs sampling is great but convergence is slow when parameters are correlated. If the covariance structure is known, you can reparametrize to get better mixing. Alternatively you can keep the same parametrization but switch to Metropolis-Hastings with a Gaussian proposal distribution whose covariance is similar to the model parameters. But what if you don’t know the covariance structure or it’s heavily dependent on the data?
I ran into this very issue with a sampler for estimating the parameters of a fitted 2d surface. I thought about the problem for a while, and came up with this method for getting an efficient sampler:
1. Run MH with a Gaussian proposal distribution with identity covariance structure
2. Get the empirical covariance of the MH run
3. Run MH with Gaussian proposal with covariance distribution from (2); adapt the precision of the proposal to reach an acceptance ratio 0.2-0.3
4. Repeat 2-3 five times
5. Run MH for real with the final proposal distribution resulting from 1-4
The idea is pretty simple. Sampling will be efficient if the proposal distribution is similar to the sampling distribution. So we can start with a guessed covariance structure, sample for a bit, then adjust the covariance structure of the proposal distribution to match the previous samples, and so forth until the proposal distribution is similar to the sampling distribution. Once the proposal is adapted, you can run standard MCMC. You might call call this strategy static adaptive Metropolis-Hastings.
As it turns out, you can do quite a bit better than this. Rather than doing adaptation in a first phase and “real sampling” in a second phase, it’s possible to mix blocks of adaptation and sampling. In that case, the resulting sampler is non-Markovian (since the samples depend on several previous samples rather than just one) and non-reversible. Yet, if you do it right, it’s nevertheless possible to get samples from the correct distribution. The advantage is that if the sampling distribution has a weird shape, the proposal distribution will adapt to it locally. This dynamic adaptive Metropolis-Hastings algorithm is described in Haario et al. (2001). You can find a high-level description of it here.
It’s implemented in this Matlab package (‘am’ option).
It’s possible to combine adaptive Metropolis and delayed rejection (DR). The basic idea of delayed rejection is to combine several MH transition kernels, starting from kernels with large variance down to kernels will small variance. Large variance moves have low chance of being accepted, but if they are a large move in parameter space results. Low-variance moves have high chance of being accepted, but since they are small the resulting chain will tend to show random walk behaviour. Thus, DR starts with high-variance proposals, falling back to lower variance proposals on rejection.
DR and AM go especially well together, because in the initial stages of AM the proposal is maladapted and very few samples are accepted. Yet, to start the adaptation process, you need samples. DR gives you those samples.
Combining DR and AM gives the dram algorithm, which is a potent plug-and-play alternative to Gibbs sampling. Don’t be fooled by the talk of sum-of-squares on the web page of the Matlab package: it works with any likelihood function.