Reinforcement Learning with Code 【Chapter 10. Actor Critic】

Reinforcement Learning with Code 【Chapter 10. Actor Critic】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

  • Reinforcement Learning with Code 【Chapter 10. Actor Critic】
      • 10.1 The simplest actor-critic algorithm (QAC)
      • 10.2 Advantage Actor-Critic (A2C)
      • 10.3 Off-policy Actor-Critic
    • Reference

10.1 The simplest actor-critic algorithm (QAC)

​ Recall the idea of policy gradient method is to search for an optimal policy by maximizing a scalar metric J ( θ ) J(\theta) J(θ). The metric has three options, average state value E [ v π ( S ) ] \mathbb{E}[v_\pi(S)] E[vπ(S)], average one step reward E [ r π ( S ) ] \mathbb{E}[r_\pi(S)] E[rπ(S)] or average state value from a specific state s 0 s_0 s0

​ According to policy gradient theorem in chapter 9, we are informed that

θ t + 1 = θ t + α ∇ θ J ( θ t ) = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta_t) \\ & = \theta_t + \alpha \mathbb{E}_{S\sim\eta, A\sim\pi} [\nabla_\theta \ln \pi(A|S,\theta_t)q_\pi(S,A)] \end{aligned} θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

where η \eta η is a distribution of the states. Since the true gradient is unknwon, we can use a stochastic gradient ot approximated it, hence we have

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

  • In policy gradient method such as REINFORCE, we use the idea of Monte-Carlo to approximate the true value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Where q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is approximated by an episode return that is q t ( s t , a t ) = ∑ k = t + 1 T γ k − t − 1 r q_t(s_t,a_t)=\sum_{k=t+1}^T \gamma^{k-t-1}r qt(st,at)=k=t+1Tγkt1r.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is estimated by value function approximationg, and the value function is updated using the idea of TD learning. The corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be seen as one of the policy gradient method.

​ When we use the parameterized value function q ( s , a ; w ) q(s,a;w) q(s,a;w) to approximate the q t ( s t , a t ) q_t(s_t,a_t) qt(st,at), and the value function is updated by the idea of Sarsa of TD learning, the algorithm is called Q actor-critic (QAC). The core idea of QAC is that

QAC: { Actor : θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) q ( s t , a t ; w t ) Critic : w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 ; w t ) − q ( s t , a t ; w t ) ] ∇ w q ( s t , a t ; w t ) \text{QAC:} \left\{ \begin{aligned} \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \nabla_\theta \ln\pi(a_t|s_t;\theta) q(s_t,a_t;w_t) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w [r_{t+1}+\gamma q(s_{t+1},a_{t+1};w_t) - q(s_t,a_t;w_t)]\nabla_w q(s_t,a_t;w_t) \end{aligned} \right. QAC:{Actor:θt+1Critic:wt+1=θt+αθθlnπ(atst;θ)q(st,at;wt)=wt+αw[rt+1+γq(st+1,at+1;wt)q(st,at;wt)]wq(st,at;wt)

We use value function approximation to approximate true q value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Meanwhile we use the idea of Sarsa to update our value function.

​ We can write the objective function of the update rule

QAC: { Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ t ) q ( S , A ; w t ) ] Critic : min ⁡ w J ( w ) = E S ∼ η , A ∼ π [ ( R + γ q ( S ′ , A ; w t ) − q ( S t , A ; w t ) ) 2 ] \text{QAC:} \left \{ \begin{aligned} \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta_t)q(S,A;w_t)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[(R + \gamma q(S^\prime,A;w_t) - q(S_t,A;w_t))^2]} \end{aligned} \right. QAC: Actor:θmaxJ(θ)Critic:wminJ(w)=ESη,Aπ[lnπ(AS;θt)q(S,A;wt)]=ESη,Aπ[(R+γq(S,A;wt)q(St,A;wt))2]

Pesudocode

Image

10.2 Advantage Actor-Critic (A2C)

​ The core idea of A2C is to introduce a baseline to reduce estimation variance. That is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) ( q π ( S , A ) − b ( S ) ) ] \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)q_\pi(S,A)] = \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)(q_\pi(S,A)-b(S))] ESη,Aπ[θlnπ(AS;θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS;θt)(qπ(S,A)b(S))]
where the additional baseline b ( S ) b(S) b(S) is a scalar function of S S S. Add a baseline doesn’t affect the expectation of the above equation that is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) b ( S ) ] = 0 = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s ; θ t ) ∇ θ ln ⁡ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ π ∑ a ∈ A ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)b(S)] & = 0 \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\pi(a|s;\theta_t) \nabla_\theta \ln \pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta\pi\sum_{a\in\mathcal{A}}(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta1 = 0 \end{aligned} ESη,Aπ[θlnπ(AS;θt)b(S)]=0=sSη(s)aAπ(as;θt)θlnπ(as;θt)b(s)=sSη(s)aAθπ(as;θt)b(s)=sSη(s)b(s)aAθπ(as;θt)=sSη(s)b(s)θπaA(as;θt)=sSη(s)b(s)θ1=0

How to find the optimal baseline? The derivation is omitted. The optimal baseline is

b ∗ ( s ) = E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 q π ( s , A ) ] E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ] b^*(s) = \frac{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 q_\pi(s,A)]}{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2]} b(s)=EAπ[∣∣θlnπ(As;θt)2]EAπ[∣∣θlnπ(As;θt)2qπ(s,A)]

But its too complex to use in practice. If the weight ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 ∣∣θlnπ(As;θt)2 is removed, we can obtain a suboptimal baseline that has a concise expression:

b † ( s ) = E [ q π ( s , A ) ] = v π ( s ) \textcolor{red}{b^\dagger (s) = \mathbb{E}[q_\pi(s,A)] = v_\pi(s)} b(s)=E[qπ(s,A)]=vπ(s)

The suboptimal baseline is the state value of state s s s.

When b ( s ) = v π ( s ) b(s)=v_\pi(s) b(s)=vπ(s), the gradient-ascent becomes
θ t + 1 = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) [ q π ( S , A ) − v π ( S ) ] ] = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) δ π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\mathbb{E}[\nabla_\theta \ln\pi(A|S;\theta_t)[q_\pi(S,A)-v_\pi(S)]] \\ & = \theta_t + \alpha\mathbb{E}[\nabla_\theta\ln\pi(A|S;\theta_t) \delta_\pi(S,A)] \end{aligned} θt+1=θt+αE[θlnπ(AS;θt)[qπ(S,A)vπ(S)]]=θt+αE[θlnπ(AS;θt)δπ(S,A)]
Here,
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \textcolor{red}{\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)} δπ(S,A)=qπ(S,A)vπ(S)
is called advantage function, which reflects the advantage of one action over the others. More specifically, note that v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) vπ(s)=aAπ(as)qπ(s,a) is the mean of the action value. If δ π ( s , a ) > 0 \delta_\pi(s,a)>0 δπ(s,a)>0, it means that the corresponding action has a greater value than the mean value.

The stochastic version is
θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) [ q t ( s t , a t ) − v t ( s t ) ] = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) δ t ( s t , a t ) \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\nabla_\theta \ln\pi(a_t|s_t;\theta_t)[q_t(s_t,a_t)-v_t(s_t)] \\ & = \theta_t + \alpha\nabla_\theta\ln\pi(a_t|s_t;\theta_t) \delta_t(s_t,a_t) \end{aligned} θt+1=θt+αθlnπ(atst;θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst;θt)δt(st,at)
we need to estimate the true q-value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). There are many ways:

  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by Monte-Carlo learning, the algorithm is called REINFORCE with a baseline.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

q t ( s t , a t ) − v t ( s t ) = r t + 1 + γ q t ( s t + 1 , a t + 1 ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \begin{aligned} q_t(s_t,a_t) - v_t(s_t) & = r_{t+1} +\gamma q_t(s_{t+1},a_{t+1}) - v_t(s_t) \\ & \textcolor{red}{\approx r_{t+1} +\gamma v_t(s_{t+1}) - v_t(s_t)} \end{aligned} qt(st,at)vt(st)=rt+1+γqt(st+1,at+1)vt(st)rt+1+γvt(st+1)vt(st)

Hence, we don’t need to maintain two networks to represent v π ( s ) v_\pi(s) vπ(s) and q π ( s , a ) q_\pi(s,a) qπ(s,a). We just need one network to represent v π ( s ) v_\pi(s) vπ(s).

In A2C we use one policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ) and one state value network v ( s ; w ) v(s;w) v(s;w). The core idea of A2C is that

A2C : { Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w δ t ∇ w v ( s t , ; w t ) \text{A2C}: \left \{ \begin{aligned} \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \textcolor{blue}{\delta_t}\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \textcolor{blue}{\delta_t} \nabla_w v(s_t,;w_t) \end{aligned} \right. A2C: Advantage:δtActor:θt+1Critic:wt+1=rt+1+γv(st+1;wt)v(st;wt)=θt+αθδtθlnπ(atst;θ)=wt+αwδtwv(st,;wt)

We can write the objective function of the update rule

A2C : { Advantage: Δ ( S ) = R + γ v ( S ′ ; w t ) − v ( S ; w t ) Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ ) Δ ( S ) ] Critic : min ⁡ w J ( w ) = E S ∼ η [ ( R + γ v ( S ′ ; w ) − v ( S ; w ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{A2C}: \left\{ \begin{aligned} \text{Advantage:} \Delta(S) & = R+\gamma v(S^\prime;w_t) - v(S;w_t) \\ \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta)\Delta(S)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta}[(R + \gamma v(S^\prime;w) - v(S;w))^2]} = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. A2C: Advantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)=R+γv(S;wt)v(S;wt)=ESη,Aπ[lnπ(AS;θ)Δ(S)]=ESη[(R+γv(S;w)v(S;w))2]=ESη[Δ(S)]

Pesudocode

Image

10.3 Off-policy Actor-Critic

Importance Sampling

​ The key technique to convert the AC algorithm to off-policy is importance sampling. Consider a random variable X ∈ X X\in\mathcal{X} XX. Suppose that p 0 ( X ) p_0(X) p0(X) is a probability distribution. Our goal is to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. We also known the p 1 ( X ) p_1(X) p1(X) is a probability distribution of X X X. How can we use the probability p 1 ( X ) p_1(X) p1(X) to sample data to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. The technique is importance sampling. Suppose we have some i.i.d. samples { x i } i = 1 n \{x_i\}^n_{i=1} {xi}i=1n generated by distribution p 1 ( X ) p_1(X) p1(X).
E X ∼ p 0 [ X ] = ∑ x ∈ X p 0 ( x ) x = ∑ x ∈ X p 1 ( x ) p 0 ( x ) p 1 ( x ) x ⏟ f ( x ) = E X ∼ p 1 [ f ( X ) ] E X ∼ p 0 [ X ] = E X ∼ p 1 [ f ( X ) ] ≈ f ˉ = 1 n ∑ i = 1 n f ( x i ) = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) ⏟ importance weight x i \mathbb{E}_{X\sim p_0}[X] = \sum_{x\in\mathcal{X}}p_0(x)x = \sum_{x\in\mathcal{X}}p_1(x)\underbrace{\frac{p_0(x)}{p_1(x)}x}_{f(x)} = \mathbb{E}_{X\sim p_1}[f(X)] \\ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum^n_{i=1}f(x_i) = \frac{1}{n} \sum^n_{i=1} \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}}x_i EXp0[X]=xXp0(x)x=xXp1(x)f(x) p1(x)p0(x)x=EXp1[f(X)]EXp0[X]=EXp1[f(X)]fˉ=n1i=1nf(xi)=n1i=1nimportance weight p1(xi)p0(xi)xi
An Example

​ Consider X ∈ X = + 1 , − 1 X\in\mathcal{X}={+1,-1} XX=+1,1 Suppose the p 0 p_0 p0 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.5 , p 0 ( X = − 1 ) = 0.5 p_0(X=+1)=0.5, p_0(X=-1)=0.5 p0(X=+1)=0.5,p0(X=1)=0.5
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 0 [ X ] = ( + 1 ) × 0.5 + ( − 1 ) × 0.5 = 0 \mathbb{E}_{X\sim p_0}[X] = (+1)\times 0.5 + (-1) \times 0.5 = 0 EXp0[X]=(+1)×0.5+(1)×0.5=0
Suppose p 1 p_1 p1 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.8 , p 0 ( X = − 1 ) = 0.2 p_0(X=+1)=0.8, p_0(X=-1)=0.2 p0(X=+1)=0.8,p0(X=1)=0.2
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 1 [ X ] = ( + 1 ) × 0.8 + ( − 1 ) × 0.2 = 0.6 \mathbb{E}_{X\sim p_1}[X] = (+1)\times 0.8 + (-1) \times 0.2 = 0.6 EXp1[X]=(+1)×0.8+(1)×0.2=0.6
We can use the importance samping techique to sample data under distribution p 1 p_1 p1 to compute E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]
E X ∼ p 0 [ X ] = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) x i \mathbb{E}_{X\sim p_0}[X] = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i EXp0[X]=n1i=1np1(xi)p0(xi)xi

import numpy as np
import matplotlib.pyplot as plt
# reproducible
np.random.seed(0)# 定义元素和对应的概率
elements = [1, -1]
probs1 = [0.5, 0.5]
probs2 = [0.8, 0.2]# 重要性采样 importance sample
sample_times = 300
sample_list = []
i_sample_list = []
average_list = []
importance_list = []
for i in range(sample_times):sample = np.random.choice(elements, p=probs2)sample_list.append(sample)average_list.append(np.mean(sample_list))if sample == elements[0]:i_sample_list.append(probs1[0] / probs2[0] * sample)elif sample == elements[1]:i_sample_list.append(probs1[1] / probs2[1] * sample)importance_list.append(np.mean(i_sample_list))plt.plot(range(len(sample_list)), sample_list, 'o', markerfacecolor='none', label='sample data')
plt.plot(range(len(average_list)), average_list, 'b--', label='average')
plt.plot(range(len(importance_list)), importance_list, 'g--', label='importance sampling')
plt.axhline(y=0.6, color='r', linestyle='--')
plt.axhline(y=0, color='r', linestyle='--')
plt.ylim(-1.5, 2.5) # 限制y轴显示范围
plt.xlim(0,sample_times) # 限制x轴显示范围
plt.legend(loc='upper right')
plt.show()
Image

Off-policy policy gradient theorem

​ With importance sampling, we are ready to present the off-policy gradient theorem. Suppose that the β \beta β is a behavior policy. Our goal is to use the samples generated by behavoir policy β \beta β to learn a target policy π \pi π that can maximize the following metric
max ⁡ θ J ( θ ) = E S ∼ d β [ v π ( S ) ] \max_\theta J(\theta) = \mathbb{E}_{S\sim d_\beta}[v_\pi(S)] θmaxJ(θ)=ESdβ[vπ(S)]
Theorem 10.1 (Stochastic off-policy policy gradient theorem). In the discounted case where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1), the gradient of J ( θ ) J(\theta) J(θ) is
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ⏟ importance weight ∇ θ ln ⁡ π ( A ∣ S ; θ ) q π ( S , A ) ] \textcolor{red}{\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\underbrace{\frac{\pi(A|S;\theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S;\theta) q_\pi(S,A) \Big]} θJ(θ)=ESρ,Aβ[importance weight β(AS)π(AS;θ)θlnπ(AS;θ)qπ(S,A)]
where the state distribution ρ \rho ρ is
ρ ( s ) ≜ ∑ s ′ ∈ S d β ( s ′ ) Pr ⁡ π ( s ∣ s ′ ) \rho(s) \triangleq \sum_{s^\prime\in\mathcal{S}} d_\beta(s^\prime) \Pr_\pi(s|s^\prime) ρ(s)sSdβ(s)πPr(ss)
where Pr ⁡ π ( s ∣ s ′ ) = ∑ k = 0 ∞ γ k [ P π k ] s ′ , s = [ ( I − γ P π ) − 1 ] s ′ , s \Pr_\pi(s|s^\prime)=\sum_{k=0}^\infty \gamma^k[P^k_\pi]_{s^\prime,s}=[(I-\gamma P_\pi)^{-1}]_{s^\prime,s} Prπ(ss)=k=0γk[Pπk]s,s=[(IγPπ)1]s,s is the discounted total probability of transitioning from s ′ s^\prime s to s s s under policy π \pi π.

​ The off-policy policy gradient is invariant to any additional baseline b ( s ) b(s) b(s). In particular, we have
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ∇ θ ln ⁡ π ( A ∣ S ; θ ) ( q π ( S , A ) − b ( S ) ) ] \nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\frac{\pi(A|S;\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S;\theta) \big( q_\pi(S,A) - b(S) \big) \Big] θJ(θ)=ESρ,Aβ[β(AS)π(AS;θ)θlnπ(AS;θ)(qπ(S,A)b(S))]
when we take the state value as the baseline v π ( S ) = b ( S ) v_\pi(S)=b(S) vπ(S)=b(S), there comes the advantage function.
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S) δπ(S,A)=qπ(S,A)vπ(S)
The corresponding stochastic gradient-ascent algorithm is
θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \nabla_\theta \ln\pi(a_t|s_t;\theta_t)(q_t(s_t,a_t)-v_t(s_t)) θt+1=θt+αθβ(atst)π(atst;θ)θlnπ(atst;θt)(qt(st,at)vt(st))
The advantage function can be replaced by the TD error. That is
q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≜ δ t ( s t , a t ) q_t(s_t,a_t)-v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \triangleq \delta_t(s_t,a_t) qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)
In off-policy A2C we use the behavior policy β \beta β to obtain samples, to learn a policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ), and one value network v ( s ; w ) v(s;w) v(s;w). The core idea of off-policy A2C is

off-policy A2C : { Behavior policy: S ∼ β Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ w v ( s t , ; w t ) \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \delta_t\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)}\delta_t \nabla_w v(s_t,;w_t) \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:δtActor:θt+1Critic:wt+1β=rt+1+γv(st+1;wt)v(st;wt)=θt+αθβ(atst)π(atst;θ)δtθlnπ(atst;θ)=wt+αwβ(atst)π(atst;θ)δtwv(st,;wt)

We rewrite the objective function of the update rule

off-policy A2C : { Behavior policy: S ∼ β Advantage : Δ ( S ) = R + γ v ( S ′ ; w ) − v ( S ; w ) Actor : max ⁡ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) Δ ( S ) ln ⁡ π ( A ∣ S ; θ ) ] Critic : min ⁡ w J ( w ) = E S ∼ ρ [ ( R + γ v ( S ′ ; w t ) − v ( S t ; w t ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \Delta(S) & = R + \gamma v(S^\prime;w) - v(S;w) \\ \text{Actor}: \max_\theta J(\theta) & = \mathbb{E}_{S\sim\rho,A\sim\beta}[\frac{\pi(A|S;\theta)}{\beta(A|S)}\Delta(S) \ln\pi(A|S;\theta)] \\ \text{Critic}: \min_w J(w) & = \mathbb{E}_{S\sim\rho}[(R + \gamma v(S^\prime;w_t) - v(S_t;w_t))^2] = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)β=R+γv(S;w)v(S;w)=ESρ,Aβ[β(AS)π(AS;θ)Δ(S)lnπ(AS;θ)]=ESρ[(R+γv(S;wt)v(St;wt))2]=ESη[Δ(S)]

Pesudocode

Image

Reference

赵世钰老师的课程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/63697.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

pc端网页用vue并且实现响应式 vue+bootstrap-vue

1、hbuiler内新建vue项目 在项目文件夹下用npm加载依赖(或者用hbuilder内打开命令) 2、配置路由 src内新建router文件夹,router内新建index.js index.js内配置重定向到首页 main.js内配置路由 import router from /router/index.js new…

java获取到heapdump文件后,如何快速分析?

简介 在之前的OOM问题复盘之后,本周,又一Java服务出现了内存问题,这次问题不严重,只会触发堆内存占用高报警,没有触发OOM,但好在之前的复盘中总结了dump脚本,会在堆占用高时自动执行jstack与jm…

9.3.2.1网络原理(UDP)

1.UDP的基本特点:无连接,不可靠传输,面向数据报,全双工. 2.1~1024的端口号有特定的含义,不建议使用.比如21:ftp,22:ssh,80:http,443:https. 3.CRC校验算法:循环冗余校验和,把UDP报中的每个字节都依次进行累加,把累加的结果,放到两个字节的变量中,溢出也无所谓,因为都加了一遍.…

基于python+MobileNetV2算法模型实现一个图像识别分类系统

一、目录 算法模型介绍模型使用训练模型评估项目扩展 二、算法模型介绍 图像识别是计算机视觉领域的重要研究方向,它在人脸识别、物体检测、图像分类等领域有着广泛的应用。随着移动设备的普及和计算资源的限制,设计高效的图像识别算法变得尤为重要。…

oracle的管道函数

Oracle管道函数(Pipelined Table Function)oracle管道函数 1、管道函数即是可以返回行集合(可以使嵌套表nested table 或数组 varray)的函数,我们可以像查询物理表一样查询它或者将其赋值给集合变量。 2、管道函数为并行执行,在…

Python 中的机器学习简介:多项式回归

一、说明 多项式回归可以识别自变量和因变量之间的非线性关系。本文是关于回归、梯度下降和 MSE 系列文章的第三篇。前面的文章介绍了简单线性回归、回归的正态方程和多元线性回归。 二、多项式回归 多项式回归用于最适合曲线拟合的复杂数据。它可以被视为多元线性回归的子集。…

【C语言】小游戏-三字棋

大家好,我是深鱼~ 目录 一、游戏介绍 二、文件分装 三、代码实现步骤 1.制作简易游戏菜单 2.初始化棋盘 3.打印棋盘 4.玩家下棋 5.电脑随机下棋 6.判断输赢 7.判断棋盘是否满了 四、完整代码 game.h(相关函数的声明,整个代码要引用的头文件以及宏…

CSS 中的优先级规则是怎样的?

聚沙成塔每天进步一点点 ⭐ 专栏简介⭐内联样式(Inline Styles)⭐ID 选择器(ID Selectors)⭐类选择器、属性选择器和伪类选择器(Class, Attribute, and Pseudo-class Selectors)⭐元素选择器和伪元素选择器…

如何让ES低成本、高性能?滴滴落地ZSTD压缩算法的实践分享

前文分别介绍了滴滴自研的ES强一致性多活是如何实现的、以及如何提升ES的性能潜力。由于滴滴ES日志场景每天写入量在5PB-10PB量级,写入压力和业务成本压力大,为了提升ES的写入性能,我们让ES支持ZSTD压缩算法,本篇文章详细展开滴滴…

hive 字段注释乱码

hive 字段注释乱码: 在mysql中运行: alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;OK

Signal Desktop for Mac(专业加密通讯软件)中文版安装教程

想让您的聊天信息更安全和隐藏吗? Mac版本的Signal Desktop是MACOS上的专业加密通信工具,非常安全。使用信号协议,该协议结合了固定前密钥,双重RATCHES算法和3-DH握手信号,该信号可以确保第三方实体将不会传达您的消息…

LeetCode150道面试经典题--同构字符串(简单)

1.题目 给定两个字符串 s 和 t ,判断它们是否是同构的。如果 s 中的字符可以按某种映射关系替换得到 t ,那么这两个字符串是同构的。每个出现的字符都应当映射到另一个字符,同时不改变字符的顺序。不同字符不能映射到同一个字符上&#xff0c…