🥦Python人形机踊跃跨栏举重投篮高维数动作算法模型

Python | 机器人 | 人形 | 运动 | 静态操作 | 动态操作 | 动力学 | 运动学 | 算法 | 机器学习

Python人形机踊跃跨栏举重投篮高维数动作算法模型 | 亚图跨际viadean on Notion

🎯要点

📜协作机器人：Python协作运动机器人刚体力学解耦模型

📜机器人运动学和动力学用例：Python | C++ | MATLAB机器人正逆向运动学动力学求解器及算法

📜机器人运动学和动力学用例：Python | C# | MATLAB 库卡机器人微分运动学 | 欧拉-拉格朗日动力学 | 混合动力控制

🍇Python连续动作空间算法

此算法使用四个神经网络：Q 网络、确定性策略网络、目标 Q 网络和目标策略网络。Q 网络和策略网络非常类似于简单的 A2C，但在此算法中，参与者直接将状态映射到动作（网络的输出直接为输出），而不是输出离散动作空间中的概率分布。

目标网络是原始网络的延时副本，可以缓慢跟踪学习到的网络。使用这些目标值网络可以大大提高学习的稳定性。原因如下：在不使用目标网络的方法中，网络的更新方程与网络本身计算的值相互依赖，这使其容易发散。例如：

Q(s, a) \leftarrow Q(s, a)+\alpha\left[R(s, a)+\gamma \max Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right]

因此，我们有确定性策略网络和 Q 网络的标准 Actor & Critic 代码结构：

 class Critic(nn.Module):
     def __init__(self, input_size, hidden_size, output_size):
         super(Critic, self).__init__()
         self.linear1 = nn.Linear(input_size, hidden_size)
         self.linear2 = nn.Linear(hidden_size, hidden_size)
         self.linear3 = nn.Linear(hidden_size, output_size)
 
     def forward(self, state, action):
         """
         Params state and actions are torch tensors
         """
         x = torch.cat([state, action], 1)
         x = F.relu(self.linear1(x))
         x = F.relu(self.linear2(x))
         x = self.linear3(x)
 
         return x
 
 class Actor(nn.Module):
     def __init__(self, input_size, hidden_size, output_size, learning_rate = 3e-4):
         super(Actor, self).__init__()
         self.linear1 = nn.Linear(input_size, hidden_size)
         self.linear2 = nn.Linear(hidden_size, hidden_size)
         self.linear3 = nn.Linear(hidden_size, output_size)
         
     def forward(self, state):
         """
         Param state is a torch tensor
         """
         x = F.relu(self.linear1(state))
         x = F.relu(self.linear2(x))
         x = torch.tanh(self.linear3(x))
 
         return x

我们将网络和目标网络初始化为：

 actor = Actor(num_states, hidden_size, num_actions)
 actor_target = Actor(num_states, hidden_size, num_actions)
 critic = Critic(num_states + num_actions, hidden_size, num_actions)
 critic_target = Critic(num_states + num_actions, hidden_size, num_actions)
 
 for target_param, param in zip(actor_target.parameters(), actor.parameters()):
     target_param.data.copy_(param.data)
 for target_param, param in zip(critic_target.parameters(), critic.parameters()):
     target_param.data.copy_(param.data)

与深度 Q 学习（以及许多其他 RL 算法）一样，此算法也使用重放缓冲区来采样经验以更新神经网络参数。在每次轨迹展开期间，我们保存所有经验元组（状态、动作、奖励、下一个状态）并将它们存储在有限大小的缓存中——即“重放缓冲区”。然后，当我们更新价值和策略网络时，我们会从重放缓冲区中随机采样小批量经验。

重播缓冲区如下所示：

 import random
 from collections import deque
 
 class Memory:
     def __init__(self, max_size):
         self.buffer = deque(maxlen=max_size)
     
     def push(self, state, action, reward, next_state, done):
         experience = (state, action, np.array([reward]), next_state, done)
         self.buffer.append(experience)
 
     def sample(self, batch_size):
         state_batch = []
         action_batch = []
         reward_batch = []
         next_state_batch = []
         done_batch = []
 
         batch = random.sample(self.buffer, batch_size)
 
         for experience in batch:
             state, action, reward, next_state, done = experience
             state_batch.append(state)
             action_batch.append(action)
             reward_batch.append(reward)
             next_state_batch.append(next_state)
             done_batch.append(done)
         
         return state_batch, action_batch, reward_batch, next_state_batch, done_batch
 
     def __len__(self):
         return len(self.buffer)

值网络的更新与 Q 学习中的更新类似。更新后的Q值由贝尔曼方程得到：

y_i=r_i+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} \mid \theta^{\mu^{\prime}}\right) \mid \theta^{Q^{\prime}}\right)

然而，在此算法中，下一状态Q值是通过目标值网络和目标策略网络来计算的。然后，我们最小化更新后的 Q 值和原始 Q 值之间的均方损失：

\text { Loss }=\frac{1}{N} \sum_i\left(y_i-Q\left(s_i, a_i \mid \theta^Q\right)\right)^2

代码如下：

 Qvals = critic.forward(states, actions)
 next_actions = actor_target.forward(next_states)
 next_Q = critic_target.forward(next_states, next_actions.detach())
 Qprime = rewards + gamma * next_Q
 
 critic_loss = nn.MSELoss(Qvals, Qprime)
 critic_optimizer.zero_grad()
 critic_loss.backward() 
 critic_optimizer.step()

对于策略函数，我们的目标是最大化预期回报：

J(\theta)= E \left[\left.Q(s, a)\right|_{s=s_t, a_t=\mu\left(s_t\right)}\right]

为了计算策略损失，我们取目标函数相对于策略参数的导数。请记住，参与者（策略）函数是可微的，因此我们必须应用链式法则。

PreviousMATLAB和Python发那科ABB库卡史陶比尔工业机器人模拟示教框架 NextPython协作运动机器人刚体力学解耦模型

Last updated 1 year ago

Was this helpful?