# Adam Optimizer in Deep Learning

Hey folks, let’s focus on another interesting topic that is, Adam Optimizer in Deep Learning. Before we proceed let’s take an intro.

### Optimizer Algorithms in Deep Learning

Why do we need optimizer algorithms? Just train any model and dataset using the gradient descent algorithm and obtain weights at each stage and get the output, right? Turns out that it’s not good always. An optimizer’s role is to reduce the exponential work and time required to train and get the weights of data points at each stage, show a better bias-variance tradeoff and reduce computational time. Because this involves lots of parameters basically.

Now obviously, a business person wouldn’t wait 5 or 6 days for his data to be analyzed and predicted. An example algorithm is Adam optimizer algorithm.

Adam optimizer is an extension to the stochastic gradient descent. It is used to update weights in an iterative way in a network while training. Proposed by Diederik Kingma and Jimmy Ba and specifically designed for deep neural networks i.e., CNNs, RNNs etc. The Adam optimizer doesn’t always outperform the stochastic gradient descent well it does for some cases like MNIST dataset.

Adam optimizer as we know combines the stochastic gradient descent and RMSprop together to learn a neural network behavior. Let’s imagine a neural network with input, hidden and output layers. When the input parameters are given to a hidden layer with some weights the Adam optimizer comes into work here.

Now, let’s say neuron1 had given weight w1 and neuron2 had given a weight w2. Currently, the present layer has a weight w3. Based on the current and the previous weights the optimizer will learn the weight for the parameter.

This all involves very complex mathematics. I will try to explain it as it will be understood by a very learner.

1. Every neuron in an NN has an input function, weights and threshold. We obtain a bias and also a gradient value at this layer.  So, at each layer learn the previous gradient and correlation of the parameter in the present neuron.
2. Then, estimate the gradient of the present layer.  Search for the optimal value and good gradient.
3. Repeat.

Each layer has an opportunity to learn faster by updating the weights in a supervised environment and score optimally each time and converge fast. As I said this doesn’t happen all the time. Next, let us look at the parameters.

• alpha –  the learning rate or step size.  Proportionate of the weights that are updated. For faster initial learning even before the updated rates we require larger values of alpha.  Smaller values slow learning right down during training
• beta1-The exponential rate of decay for the first moment estimates
• beta2- The exponential decay rate for the second-moment estimates. The value should be as close to 1.0 on problems with a sparse gradient (example: Natural Language Processing and computer vision problems).
• epsilon- prevents a division by zero.

The default parameters in deep learning libraries

• TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.
Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0.
• Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1.
• Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
• Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
• MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
• Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

Thanks a lot!