xcorr: AI & neuro

by Patrick Mineault

The secret ingredient in stochastic gradient descent

I had dinner with Geoffrey Hinton and Yoshua Bengio a few weeks back, and I left full of ideas – and wine, also. Now I’m fitting a massive model for early and intermediate visual areas which involves major spiffiness and about 100 hours of data (!).

Stochastic gradient descent is one optimization algorithm which can scale to giant datasets. Here’s a slideshow on the subject by Leon Bottou. The key slide is #37: how to choose the learning rate:

Turns out the learning rate does not change with the sample size, so it’s possible at any epoch to choose a learning rate based on a small subsample of the data. And just like that, you’ve just eliminated a hyperparameter in the optimization. Very nice!

Also of note: this old school paper by Yann LeCun on neural net training tips and tricks.

March 21, 2013

3 responses to “The secret ingredient in stochastic gradient descent”

kanishkaugee says:

March 7, 2015 at 10:14 pm

you can have a adaptive learning rate as in “Cai, Xun, Kanishka Tyagi, and Michael T. Manry. “Training multilayer perceptron by using optimal input normalization.” Fuzzy Systems (FUZZ), 2011 IEEE International Conference on. IEEE, 2011.”
it works way better than heuristic.

Reply
timvieira says:

January 23, 2014 at 11:36 am

Btw, Yoshua Bengio has a more recent ‘tricks-of-the-trade’ type of report on arxiv, “Practical recommendations for gradient-based training of deep architectures” (http://arxiv.org/abs/1206.5533)

Reply
timvieira says:

January 23, 2014 at 11:30 am

I love this trick!

Reply

The secret ingredient in stochastic gradient descent

3 responses to “The secret ingredient in stochastic gradient descent”

Leave a comment Cancel reply