I had dinner with Geoffrey Hinton and Yoshua Bengio a few weeks back, and I left full of ideas – and wine, also. Now I’m fitting a massive model for early and intermediate visual areas which involves major spiffiness and about 100 hours of data (!).
Stochastic gradient descent is one optimization algorithm which can scale to giant datasets. Here’s a slideshow on the subject by Leon Bottou. The key slide is #37: how to choose the learning rate:
Turns out the learning rate does not change with the sample size, so it’s possible at any epoch to choose a learning rate based on a small subsample of the data. And just like that, you’ve just eliminated a hyperparameter in the optimization. Very nice!
Also of note: this old school paper by Yann LeCun on neural net training tips and tricks.
you can have a adaptive learning rate as in “Cai, Xun, Kanishka Tyagi, and Michael T. Manry. “Training multilayer perceptron by using optimal input normalization.” Fuzzy Systems (FUZZ), 2011 IEEE International Conference on. IEEE, 2011.”
it works way better than heuristic.
Btw, Yoshua Bengio has a more recent ‘tricks-of-the-trade’ type of report on arxiv, “Practical recommendations for gradient-based training of deep architectures” (http://arxiv.org/abs/1206.5533)
I love this trick!