An Optimization Strategy Based on Hybrid Algorithm of Adam and SGD

： Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to stochastic gradient descent (SGD). So scholars (Nitish Shirish Keskar et al.,2017) proposed a hybrid strategy to start training with Adam and switch to SGD at the right time. In the learning task with a large output space, it was observed that Adam could not converge to an optimal solution (or could not converge to an extreme point in a non-convex scene) [1]. Therefore, this paper proposes a new variant of the ADAM algorithm (AMSGRAD), which not only solves the convergence problem, but also improves the empirical performance.


Introduction
Stochastic gradient descent (SGD) [2] has emerged as one of the most used training algorithms for deep neural networks. Despite its simplicity, SGD performs well empirically across a variety of applications but also has strong theoretical foundations. One disadvantage of SGD is that it scales the gradient uniformly in all directions; this can be particularly detrimental for ill-scaled problems. This also makes the process of tuning the learning rate α circumstantially laborious. To correct for these shortcomings, several adaptive methods have been proposed which diagonally scale the gradient via estimates of the function's curvature. Examples of such methods include Adam [3], Adagrad [4] and RMSprop [5]. These methods can be interpreted as methods that use a vector of learning rates, one for each parameter, which are adapted as the training algorithm progresses. Interestingly however, in these and other instances, Adam outperforms SGD in both training and generalization metrics in the initial portion of the training, but then the performance stagnates. To investigate this further, [6] propose SWATS, a simple strategy that combines the best of both worlds by Switching from Adam to SGD. So this paper proposes an optimization strategy based on Adam and SGD hybrid algorithm to guarantee the convergence of Adam.

Introduction to basic SGD and Adam algorithms
Training neural networks is equivalent to solving the following non-convex optimization problem, min where f is a loss function. The iterations of SGD can be described as: denotes the iterate, is a (tuned) step size sequence, also called the learning rate, and ∇ ( ) denotes the stochastic gradient computed at . And Math ematically, the Adam update equation can be represented as: are two hyperparameters. The former controls first-order momentum and the latter controls second-order momentum.

SWATS strategy
Given the insights of [7] which suggest that the lack of generalization performance of adaptive methods stems from the non-uniform scaling of the gradient, a natural hybrid strategy would begin the training process with Adam and switch to SGD when appropriate. To investigate this further, Nitish Shirish Keskar propose SWATS, a simple strategy that combines the best of both worlds by Switching from Adam to SGD.

Learning rate for SGD after the switch
Adam's descending direction is: The downward direction of SGD is: = • must be decomposed into the sum of the direction of and the two directions in the orthogonal direction, Then its projection in the direction of means the distance that SGD advances in the direction of decline determined by the Adam algorithm, The projection in the orthogonal direction of is the distance that SGD advances in its own selected correction direction. Figure 1: You can get an orthogonal projection of SGD in the direction of Adam's descent, which should be exactly equal to Adam's descent direction (with step size).
= It also gets the converted SGD learning rate: = (( ) )/(( ) ) Since γ is a noisy estimate of the scaling needed, we main tain an exponential average initialized at 0, denoted by λ such that λ = + (1 − ) We use of Adam, see (3), as the averaging coefficient since this reuse avoids another hyperparameter and also because the performance is relatively invariant to fine-grained specification of this parameter.

Switchover point
Having answered the question of what learning rate λ k to choose for SGD after the switch, we now discuss when to switch to SGD. We propose checking a simple, yet powerful, criterion: − < , (4) at every iteration with k > 1. The condition compares the bias-corrected exponential averaged value and the current value ( ). The bias correction is necessary to prevent the influence of the zero initialization during the initial portion of training. Once this condition is true, we switch over to SGD with learning rate Λ ∶= . Nitish Shirish Keskar also experimented with more complex criteria including those involving monitoring of gradient norms. However, we found that this simple un-normalized criterion works well across a variety of different applications. [8] discuss fundamental flaw in the current exponential moving average methods like ADAM. They show that ADAM can fail to converge to an optimal solution even in simple one-dimensional convex settings. These examples of non-convergence contradict the claim of convergence in (Kingma & Ba, 2015), and the main issue lies in the following quantity of interest:

The non-convergence of Adam
This quantity essentially measures the change in the inverse of learning rate of the adaptive method with respect to time. One key observation is that for SGD and ADAGRAD, Γ ≽ 0 for all t∈[T]. In particular, update rules for these algorithms lead to "non-increasing" learning rates. However, this is not necessarily the case for exponential moving average variants like ADAM, Γ can potentially be indefinite for t∈[T] . Sashank show that this violation of positive definiteness can lead to undesirable convergence behavior for ADAM. Consider the following simple sequence of linear functions for ℱ= [−1,1]: where C > 2. For this function sequence, it is easy to see that the point x = −1 provides the minimum regret. Suppose β = 0 and β = 1/ (1 + C ). We show that ADAM converges to a highly suboptimal solution of x = +1 for this setting. Intuitively, the reasoning is as follows.
The algorithm obtains the large gradient C once every 3 steps, and while the other 2 steps it observes the gradient −1, which moves the algorithm in the wrong direction.
The large gradient C is unable to counteract this effect since it is scaled down by a factor of almost C for the given value of β , and hence the algorithm converges to 1 rather than −1.

A new type of exponential moving average method: AMSGRAD
In this section, we develop a new principled exponential moving average variant and provide its convergence analysis. Our aim is to devise a new strategy with guaranteed convergence while preserving the practical benefits of ADAM. To understand the design of our algorithms, let us revisit the quantity Γ in (5). For ADAM, this quantity can potentially be negative. The proof in the original paper of ADAM erroneously assumes that Γ is positive semi-definite and is hence, incorrect. For the first part, we modify these algorithms to satisfy this additional constraint. Later on, we also explore an alternative approach where Γ can be made positive semi-definite by using values of β and β that change with t.
AMSGRAD uses a smaller learning rate in comparison to ADAM and yet incorporates the intuition of slowly decaying the effect of past gradients on the learning rate as long as Γ t is positive semi-definite. Algorithm 1 presents the pseudocode for the algorithm.
The key difference of AMSGRAD with ADAM is that it maintains the maximum of all until the present time step and uses this maximum value for normalizing the running average of the gradient instead of in ADAM. By doing this, AMSGRAD results in a non-increasing step size and avoids the pitfalls of ADAM, Γ ≽ 0 for all t∈[T] even with constant β . Also, in Algorithm 1, one typically uses a constant β in practice (although, the proof requires a decreasing schedule for proving convergence of the algorithm).
To gain more intuition for the updates of AMSGRAD, it is instructive to compare its update with ADAM and ADAGRAD. Suppose at particular time step t and coordinate i ∈[d], we have v , > g , > 0 , then ADAM aggressively increases the learning rate, however, as we have seen in the previous section, this can be detrimental to the overall performance of the algorithm. On the other hand, ADAGRAD slightly decreases the learning rate, which often leads to poor performance in practice since such an accumulation of gradients over a large time period can significantly decrease the learning rate. In contrast, AMSGRAD neither increases nor decreases the learning rate and furthermore, decreases which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations. . We first observe that, similar to the examples of non-convergence we have considered, the optimal solution is x = −1; thus, for convergence, we expect the algorithms to converge to x = −1. For this sequence of functions, we investigate the regret and the value of the iterate x for ADAM and AMSGRAD. To enable fair comparison, we set β = 0.9 and β = 0.99 for ADAM and AMSGRAD algorithm, which are typically the parameters settings used for ADAM in practice. Figure 2 shows the average regret (R /t) and value of the iterate (x ) for this problem. We first note that the average regret of ADAM does not converge to 0 with increasing t. Furthermore, its iterates x converge to x = 1, which unfortunately has the largest regret amongst all points in the domain. On the other hand, the average regret of AMSGRAD converges to 0 and its iterate converges to the optimal solution. Figure 2 also shows the stochastic optimization setting: 1010 , ℎ 0.01 −10 , ℎ , Similar to the aforementioned online setting, the optimal solution for this problem is x = −1. Again, we see that the iterate x t of ADAM converges to the highly suboptimal solution x = 1.
Finally, we consider the multiclass classification problem on the standard CIFAR-10 dataset, which consists of 60,000 labeled examples of 32 × 32 images. We use CIFARNET, a convolutional neural network (CNN) with several layers of convolution, pooling and non-linear units, for training a multiclass classifier for this problem. In particular, this architecture has 2convolutional layers with 64 channels and kernel size of 6 × 6 followed by 2 fully connected layers of size 384 and 192. The network uses 2 × 2 max pooling and layer response normalization between the convolutional layers. A dropout layer with keep probability of 0.5 is applied in between the fully connected layers [9]. The minibatch size is also set to 128 similar to previous experiments. The results for this problem are reported in Figure 3. The parameters for ADAM and AMSGRAD are selected in a way similar to the previous experiments. We can see that AMSGRAD performs considerably better than ADAM on train loss and accuracy. Furthermore, this performance gain also translates into good performance on test loss.

Conclusion
This article first explained the SWTHS strategy proposed by Nitish Shirish Keskar, and proposed a new optimization model ADAGRAD in combination with Adam's non-convergence. This new method essentially endows the algorithm with a long-term memory of past gradients. These fixes retain the good practical performance of the original algorithms, and in some cases actually show improvements.