I had dinner with Geoffrey Hinton and Yoshua Bengio a few weeks back, and I left full of ideas – and wine, also. Now I’m fitting a massive model for early and intermediate visual areas which involves major spiffiness and about 100 hours of data (!).
Stochastic gradient descent is one optimization algorithm which can scale to giant datasets. Here’s a slideshow on the subject by Leon Bottou. The key slide is #37: how to choose the learning rate:
Turns out the learning rate does not change with the sample size, so it’s possible at any epoch to choose a learning rate based on a small subsample of the data. And just like that, you’ve just eliminated a hyperparameter in the optimization. Very nice!
Also of note: this old school paper by Yann LeCun on neural net training tips and tricks.