The secret ingredient in stochastic gradient descent

I had dinner with Geoffrey Hinton and Yoshua Bengio a few weeks back, and I left full of ideas – and wine, also. Now I’m fitting a massive model for early and intermediate visual areas which involves major spiffiness and about 100 hours of data (!).

Stochastic gradient descent is one optimization algorithm which can scale to giant datasets. Here’s a slideshow on the subject by Leon Bottou. The key slide is #37: how to choose the learning rate:

Magic slide

Turns out the learning rate does not change with the sample size, so it’s possible at any epoch to choose a learning rate based on a small subsample of the data. And just like that, you’ve just eliminated a hyperparameter in the optimization. Very nice!

Also of note: this old school paper by Yann LeCun on neural net training tips and tricks.


3 thoughts on “The secret ingredient in stochastic gradient descent

  1. you can have a adaptive learning rate as in “Cai, Xun, Kanishka Tyagi, and Michael T. Manry. “Training multilayer perceptron by using optimal input normalization.” Fuzzy Systems (FUZZ), 2011 IEEE International Conference on. IEEE, 2011.”
    it works way better than heuristic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s