深度学习笔记之优化算法(六)RMSprop算法的简单认识-编程知识

深度学习笔记之优化算法——RMSProp算法的简单认识

引言
- 回顾：AdaGrad算法
- - AdaGrad算法与动量法的优化方式区别
  - AdaGrad算法的缺陷
- RMProp算法
- - 关于AdaGrad问题的优化方式
  - RMSProp的算法过程描述
- RMSProp示例代码

引言

上一节对 $\text{AdaGrad}$ 算法进行了简单认识，本节将介绍 $\text{RMSProp}$ 方法。

回顾：AdaGrad算法

AdaGrad算法与动量法的优化方式区别

与动量法、 $\text{Nesterov}$ 动量法在迭代过程中对梯度方向进行优化不同， $\text{AdaGrad}$ 算法在迭代过程中对梯度大小(学习率)进行优化，两者优化的思路本质上存在区别。其迭代过程对比表示如下：

关于动量法在计算当前迭代步骤的梯度 $m_t$ 时，使用了 $m_{t-1},\nabla_{\theta;t-1} \mathcal J(\theta_{t-1})$ 加权和(向量加法)的方式来优化 $m_t$ 的方向;当方向固定后，在判断沿着 $m_t$ 方向前进的步长时,仅使用了固定的学习率 $\eta$ 作为前进步长。
而 $\text{AdaGrad}$ 算法对当前时刻的梯度信息 $\mathcal G_t$ 并没有执行任何方向上的优化;在判断步长时使用 $\begin{aligned}\frac{\eta}{\sqrt{\mathcal R_t} + \epsilon} \Rightarrow \eta\end{aligned}$ 执行更新操作，其本质上是向量与标量之间的乘法操作。
$\begin{aligned} & \text{Momentum : } \begin{cases} m_t = \beta \cdot m_{t-1} + (1 - \beta) \cdot \nabla_{\theta;t-1} \mathcal J(\theta_{t-1}) \\ \theta_t = \theta_{t-1} - \eta \cdot m_t \end{cases} \\ & \text{AdaGrad : } \quad \begin{cases} \mathcal G_t = \nabla_{\theta;t-1} \mathcal J(\theta_{t-1}) \\ \mathcal R_t = \mathcal R_{t-1} + \mathcal G_t \odot \mathcal G_t \\ \begin{aligned} \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\mathcal R_t} + \epsilon} \odot \mathcal G_t \end{aligned} \end{cases} \end{aligned}$

AdaGrad算法的缺陷

引入上一节使用 $\text{AdaGrad}$ 算法对目标函数 $x^T \mathcal Q x;x = (x_1,x_2)^T;\mathcal Q = \begin{pmatrix}0.5 \quad 0 \\ 0 \quad 20\end{pmatrix}$ 的迭代过程： Adagrad算法图像示例
我们能够观察到：虽然该算法在梯度较小的、平缓的倾斜方向能够稳定的前进，但是同样也会观察到：在迭代算法的中后段，算法消耗了相当多的迭代步骤，原因也很明显：此时的学习率 $\eta$ 太小了，并且还会无限的小下去。

上述示例中的目标函数是一个强凸函数，它存在全局最优解；因此迭代的最终结果也只会趋近最优解；但如果目标函数是一个复杂函数呢 $?$ 就像这样：
画的不太好，凑合着看~
非凸复杂函数示例
观察上图，黄色点描述的是使用 $\text{AdaGrad}$ 算法，权重在不同迭代步骤下的更新位置；如果该目标函数是一个简单的凸函数，它可能最终会收敛至某一点，例如红色点；但如果该函数比较复杂，在本段迭代过程之后，梯度又重新增加(图中最左侧黄点位置)那么此时的收敛速度又是什么样的呢 $?$

上一节提到过： $\text{AdaGrade}$ 的学习率只会减小，不会增加，即便后续的梯度又重新增大，但它的学习率不会增加，只会更加缓慢地继续更新。
对应《深度学习(花书)》P188 8.5.1中的原文:从训练开始时累积梯度平方会导致有效学习率过早地、过量地减小。

之所以 $\text{AdaGrad}$ 算法的学习率只减不增，究其原因还是：在累积平方梯度的过程中，平方梯度 $\mathcal G_t \odot \mathcal G_t$ 被完整地保存在累积梯度变量 $\mathcal R$ 中。这种现象在 $\text{Nesterov}$ 动量法中也提到过：在迭代步骤较深时，初始迭代步骤的历史平方梯度对当前步骤没有参考价值。

RMProp算法

关于AdaGrad问题的优化方式

针对上述问题，同样可以按照动量法的思路：通过指数加权移动平均法适当地丢弃遥远过去的历史平方梯度。优化后的公式表示如下：
视频中的描述(文章下方链接) $\text{33:14}$ 与《深度学习(花书)》中的公式关于 $\epsilon$ 的位置存在稍许不同，对比如下:
$\begin{aligned} \text{AdaGrad : } & \begin{cases} \mathcal G_t = \nabla_{\theta;t-1} \mathcal J(\theta_{t-1}) \\ \mathcal R_t = \mathcal R_{t-1} + \mathcal G_t \odot \mathcal G_t \\ \begin{aligned} \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\mathcal R_t} + \epsilon} \odot \mathcal G_t \end{aligned} \end{cases} \\ \text{Video(RMProp) : } & \begin{cases} \mathcal G_t = \nabla_{\theta;t-1} \mathcal J(\theta_{t-1}) \\ \mathcal R_t = \beta \cdot \mathcal R_{t-1} + (1 - \beta) \cdot \mathcal G_t \odot \mathcal G_t \\ \begin{aligned} \theta_t = \theta_{t - 1} - \frac{\eta}{\sqrt{\mathcal R_t} + \epsilon} \odot \mathcal G_t \end{aligned} \end{cases} \\ \text{DeepLearning(RMProp) : } & \begin{cases} \mathcal G_t = \nabla_{\theta;t-1} \mathcal J(\theta_{t-1}) \\ \mathcal R_t = \beta \cdot \mathcal R_{t-1} + (1 - \beta) \cdot \mathcal G_t \odot \mathcal G_t \\ \begin{aligned} \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\mathcal R_t + \epsilon}} \odot \mathcal G_t \end{aligned} \end{cases} \end{aligned}$
这种操作旨在：当执行迭代步骤时，只有之前的若干次迭代步骤对当前步骤产生影响。

RMSProp的算法过程描述

基于 $\text{RMSProp}$ 的算法步骤表示如下：
初始化操作：

学习率 $\eta$ ；衰减因子 $\beta$ ；
初始化参数 $\theta$ ；梯度累积信息 $\mathcal R = 0$ ；超参数 $\epsilon = 10^{-7}$

算法过程：

$\text{While}$ 没有达到停止准则 $\text{do}$
从训练集 $\mathcal D$ 中采集出包含 $k$ 个样本的小批量： ${(x^{(i)},y^{(i)})\}_{i=1}^k$ ；
计算当前步骤参数 $\theta$ 的梯度信息 $\mathcal G$ ：
$\mathcal G \Leftarrow \frac{1}{k} \sum_{i=1}^k \nabla_{\theta} \mathcal L[f(x^{(i)};\theta),y^{(i)}]$
使用 $\mathcal R$ 通过指数加权移动平均法对梯度内积 $\mathcal G \odot \mathcal G$ 进行累积：
$\mathcal R \Leftarrow \beta \cdot \mathcal R + (1 - \beta) \cdot \mathcal G \odot \mathcal G$
计算参数 $\theta$ 更新信息 $\Delta \theta$ ：
这里暂时使用《深度学习(花书)》中的描述。
$\Delta \theta = - \frac{\eta}{\sqrt{\mathcal R_t + \epsilon}} \cdot \mathcal G$
应用更新：
$\theta \Leftarrow \theta + \Delta \theta$
$\text{End While}$

RMSProp示例代码

将 $\text{RMSProp}$ 算法与 $\text{AdaGrad}$ 算法进行对比，对应代码表示如下：

import numpy as np
import math
import matplotlib.pyplot as plt
from tqdm import tqdmdef f(x, y):return 0.5 * (x ** 2) + 20 * (y ** 2)def ConTourFunction(x, Contour):return math.sqrt(0.05 * (Contour - (0.5 * (x ** 2))))def Derfx(x):return xdef Derfy(y):return 40 * ydef DrawBackGround(ax,Idx):ContourList = [0.2, 1.0, 4.0, 8.0, 16.0, 32.0]LimitParameter = 0.0001for Contour in ContourList:# 设置范围时，需要满足x的定义域描述。x = np.linspace(-1 * math.sqrt(2 * Contour) + LimitParameter, math.sqrt(2 * Contour) - LimitParameter, 200)y1 = [ConTourFunction(i, Contour) for i in x]y2 = [-1 * j for j in y1]ax[Idx].plot(x, y1, '--', c="tab:blue")ax[Idx].plot(x, y2, '--', c="tab:blue")def Process(mode):assert mode in ["AdaGrad","RMSProp"]Start = (8.0, 1.0)LocList = list()LocList.append(Start)Eta = 0.2Beta = 0.8Epsilon = 0.0000001R = 0.0Delta = 0.1while True:DerStart = (Derfx(Start[0]), Derfy(Start[1]))InnerProduct = (DerStart[0] ** 2) + (DerStart[1] ** 2)if mode == "AdaGrad":R += InnerProductelse:DecayR = R * BetaR = DecayR + ((1.0 - Beta) * InnerProduct)UpdateEta = -1 * (Eta / (Epsilon + math.sqrt(R)))UpdateMessage = (UpdateEta * DerStart[0], UpdateEta * DerStart[1])Next = (Start[0] + UpdateMessage[0], Start[1] + UpdateMessage[1])DerNext = (Derfx(Next[0]), Derfy(Next[1]))# 这里终止条件使用梯度向量的模接近于Delta，一个很小的正值;if math.sqrt((DerNext[0] ** 2) + (DerNext[1] ** 2)) < Delta:breakelse:LocList.append(Next)Start = Nextreturn LocListdef DrawPicture():AdaGradLocList = Process(mode="AdaGrad")RMSPropLocList = Process(mode="RMSProp")fig, ax = plt.subplots(2, 1, figsize=(8, 6))AdaGradplotList = list()ax[0].set_title("AdaGrad")DrawBackGround(ax,Idx=0)for (x, y) in tqdm(AdaGradLocList):AdaGradplotList.append((x, y))ax[0].scatter(x, y, s=30, facecolor="none", edgecolors="tab:orange", marker='o')if len(AdaGradplotList) < 2:continueelse:ax[0].plot([AdaGradplotList[0][0], AdaGradplotList[1][0]], [AdaGradplotList[0][1], AdaGradplotList[1][1]], c="tab:orange")AdaGradplotList.pop(0)RMSPropplotList = list()ax[1].set_title("RMSProp")DrawBackGround(ax, Idx=1)for (x, y) in tqdm(RMSPropLocList):RMSPropplotList.append((x, y))ax[1].scatter(x, y, s=30, facecolor="none", edgecolors="tab:red", marker='o')if len(RMSPropplotList) < 2:continueelse:ax[1].plot([RMSPropplotList[0][0], RMSPropplotList[1][0]], [RMSPropplotList[0][1], RMSPropplotList[1][1]], c="tab:red")RMSPropplotList.pop(0)plt.show()if __name__ == '__main__':DrawPicture()

对应图像结果表示如下：
AdaGradVSRMSProp
对比图像可以看出：关于 $\text{RMSProp}$ 的迭代步骤明显少于 $\text{AdaGrad}$ 。
回头再次观察 $\text{RMSProp}$ 迭代公式，可以发现：虽然 $\text{RMSprop}$ 算法对 $\text{AdaGrad}$ 进行了改进，但其本质上依然是对梯度的大小(学习率)进行优化。下一节我们将对 $\text{RMSProp}$ 进行延伸——从梯度方向、梯度大小(学习率)两个角度同时对梯度进行优化。
即使用 $\text{Nesterov}$ 动量的 $\text{RMSProp}$ 算法。