Adagrad – eliminating learning rates in stochastic gradient descent

Earlier, I discussed how I had no luck using second-order optimization methods on a convolutional neural net fitting problem, and some of the reasons why stochastic gradient descent works well on this class of problems. Stochastic gradient descent is not a plug-and-play optimization algorithm; it requires messing around with the step size hyperparameter, forcing you … More Adagrad – eliminating learning rates in stochastic gradient descent