How to Train your Bee

news/2024/9/19 14:43:36/文章来源:https://www.cnblogs.com/WX-codinghelp/p/18420381

How to Train your Bee

Assignment 2 Help Guide© Dreamworks, ”Bee Movie”CALCULATE AN OPTIMAL POLICY FORRecap: Bellman Equation

  • Bellman Equation is used to calculate the optimal value of a state
  • Equation looks complicated, but it’s just the highest expected reward from the best action:
  • ? ?! ?, ?) is the probability of entering the next state ?! given we perform action ? in state ?
  • ? ?, ?, ?! is the reward received for performing action ? in state ? and entering the next state ?!
  • ? is the discount factor (makes immediate rewards worth more than future rewards)
  • ?∗ ?! is the value of the next state ?!
  • The optimal policy, ?, is the best available action that can be performed in each state
  • The value of a state is given by the highest expected reward when following the optimalpolicyAssignment 2 HelpGuideRecap: Value Iteration
  1. Arbitrarily assign a value to each state (e.g. seteach state to 0)
  1. Until convergence
  • Calculate the Q(s,a) values for every state for thatiteration, and determine the action that maximises Q(s,a)

for each state

  • Calculate the value for every state using the optimalaction for that iteration
  • Convergence occurs if the difference in statevalues between two iterations is less than somevalue ?
  • Tutorial 6 involves implementing Value Iteration ithe simple Gridworld environmentAssignment 2 HelpGuid? ?, ? = ? ?, ? + ?)! ?" ?, ? ?#$%(?" )Recap: Policy Iteration
  1. Arbitrarily assign a policy to each state (e.g.action to be performed in every state is LEFT)
  1. Until convergence
  • Policy Evaluation: Determine the value of every state basedon the current policy
  • Policy Improvement: Determine the best action to beperformed in every state based on the values of the currentpolicy, then update the policy based on the new bestaction
  • Convergence occurs if the policy between two

iterations does not changeTutorial 7 involves implementing Policy Iteration inthe simple Gridworld environment

Assignment 2 HelpGuideComputing State Space

  • Both Value Iteration and Policy Iteration require us to loop through every state
  • Value Iteration: to determine the current best policy and the value of a state
  • Policy Iteration: to determine the new best policy based on the current value of a state
  • We need a way to compute every state so we can loop through them all
  • One way is to get every combination of states possible
  • In BeeBot, for a given level, each state is given by the position and orientation of the bee, the position andorientation of the widgets
  • We can determine the list of all states by computing every combination of bee position and orientation, widgetposition, and widget orientation
  • However, this might include some invalid combinations (e.g., widget or bee are inside a wall)
  • Is there a better way we can search for possible states?Assignment 2 HelpGuideTransition Outcomes
  • The probabilistic aspect of this assignment means that by performing certain actions incertain states, there might be multiple next states that can be reached each with a differentprobability and reward
  • The assignment involves creating a get_transition_outcomes(state, action)function
  • Takes a (state, action) pair and for every possible next state returns the probability of ending up in that stateand the reward
  • Should return a list or other data structure with all the next states, probabilities, and rewards
  • This will be useful when utilising the Bellman Equation to calculate the value of a state
  • When creating the transition function, there are a few things to consider:
  • What are the random aspects of the BeeBot environment?
  • What are the possible next states to consider from a given state?
  • What are edge cases that need to be considered (e.g. moving near a wall or thorn)?

Assignment 2 HelpGuideTransition Outcomes

  • The transition function will usually assume a given action is valid, so we need to only feed it

actions that are valid for given states to avoid any odd behaviour

  • We can cache if actions are valid to improve runtime
  • perform_action(state, action)
  • Might help you understand how to determine the possible next states for certain states and actions
  • However, note that it only returns one possible state for a given action
  • We can cache the results of the transition function to improve runtime
  • Tutorial 6 involves creating a transition function for the simple Gridworld environment

Assignment 2 HelpGuideTerminal States

  • We need to create a way to handle terminal states when calculating the values and optimalpolicies of states, otherwise the agent might think it can leave the terminal states
  • There are two ways we can model the terminal states to do this
  • Terminal states
  • Set the value of a terminal state to 0
  • Skip over it without updating its value if it is encountered in a loop
  • Absorbing states
  • Create a new state outside of the regular state space to send the agent to once it reaches a terminal state
  • If the player performs any action代 写How to Train your Bee in the absorbing state, it remains in the absorbing state
  • The reward is always 0 for the absorbing state, no matter the action performed

Assignment 2 HelpGuideReward Function

  • Rewards are not dependent on the resulting state, but on the state the agent is in and theaction it performs
  • Reward functions considered up until this point have been R(s), solely based on the statethe agent is in
  • For BeeBot, the expected reward function is R(s, a) – actions also give rewards that need tobe considered, as well as any possible penalties
  • We can use get_transition_outcomes(state, action) to get the rewards:
  • We can start by initialising a matrix of all zeroes size |S| x |A|
  • Then, loop over each (state, action) pair and initialise the total expected reward to 0
  • Loop over the outcomes from get_transition_outcomes(state, action) and add the (probability xeward) to compute the expected reward over all outcomes
  • i.e. ? ?, ? = ???????? ?????? = ∑#! ? ?! ?, ?) ⋅ ?(?, ?, ?! )Assignment 2 HelpGuideValue Iteration: Updating State Values
  • How we choose to update states can affect the performance of our value iteration
  • Batch updates uses the value of the next state from the previous iteration to update thevalue of the current state in the current iteration
  • In-place updates uses the value of the next state from the current iteration, if it has already

been calculated, to update the value of the current state in the current iteration

  • If the next state has not yet been calculated in the current iteration, it uses the value from the previous iteration
  • In-place updates typically converges in fewer iterations
  • The order in which states are calculated also has an effect for in-place updates (i.e. startingnear the goal and working backwards may enable faster convergence)

Assignment 2 HelpGuidePolicy Iteration: Linear Algebra

  • The numpy library is allowed for this assignment, as well as built-in python libraries
  • We can use linear algebra to compute the value of states for the policy evaluation step ofPolicy Iteration
  • ? is the identity matrix of size |S| x |S|
  • ?$ is a matrix containing the transition probabilities based on the current policy
  • ? is a vector containing rewards for every state based on the current policy
  • numpy can be used to perform linear algebra
  • vπ = np.linalg.solve(I − γPπ, r)Assignment 2 HelpGuideImproving Runtime
  • We need to compute the state space to calculate the value of every state
  • Calculating the value of every state can be time consuming, especially for large levels
  • We can remove unreachable states from the state space and only consider states the agent can reach
  • This can improve runtime as we are reducing the number of states to calculate the value of
  • Remember to use caching where possible
  • If you are repeatedly computing something, caching can drastically improve runtime
  • Remember to limit use of inefficient data structures
  • If you're checking whether an element is in a list (e.g. if next state is in the states list), either modify your code

to remove the need for this, or use a data structure with more efficient lookup (e.g. a set or dictionary)Assignment 2 HelpGuide“Flying is exhausting. Why don't you humans just run everywhere, isn't that faster?” - Barry B. Benson Assignment 2 HelpGuide

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/799878.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MindSearch 快速部署

基础任务(完成此任务即完成闯关)按照教程,将 MindSearch 部署到 HuggingFace 并美化 Gradio 的界面,并提供截图和 Hugging Face 的Space的链接。MindSearch 部署到Github Codespace 和 Hugging Face Space 和原有的CPU版本相比区别是把internstudio换成了github codespace。…

小程序隐私合规自查指南

一 背景:小程序作为一种轻量级应用,广泛应用于各大互联网平台。工信部通报2022年第5批侵害用户权益名单中首次出现8款违规小程序。各监管单位对“小程序”违规收集个人信息监控手段和监控力度不断加强。 工信部APP违法违规通报 上海市委网信办查处违规小程序二、小程序隐私合…

Jmeter的简单使用一:http请求

1、创建线程组setUp和tearDown线程组类似测试用例的测试开始之前执行某些初始化操作,如环境准备、数据库连接和释放数据库连接2、设置线程组Ramp-Up时间(以秒为单位)是指从开始到所有线程都达到活动状态的时间。例如,如果你设置了10个线程,并且Ramp-Up时间为20秒,那么JMe…

Flags

Flags是位字段的序列,当其中任何一个位不为零且广播可连接时广播包中应包含flags. 否则,flags可以被忽略。flags只能包含在广播包中,扫描响应包中不能包含flags。flags的作用是在广播包中加入如下标志:有限可发现模式;一般可发现模式;不支持BR/EDR;设备同时支持LE和BR/E…

Oracle 19c OCP 认证考试 082 题库(第23题)- 2024年修正版

【优技教育】Oracle 19c OCP 082题库(Q 23题)- 2024年修正版 考试科目:1Z0-082 考试题量:90 通过分数:60% 考试时间:150min 本文为(CUUG 原创)整理并解析,转发请注明出处,禁止抄袭及未经注明出处的转载。 原文地址:http://www.cuug.com/index.php?s=/home/article/deta…

windows7遇到不兼容如何解决

概述: 低版本的Windows缺乏一些高版本中所新增的系统接口,而VxKex可以为程序提供这些缺失的接口从而使其正常运行 当然VxKex不仅可以用于lucky也可以使其他一些最低要求为win10的程序在win7上运行起来 详情见其github项目地址 不过目前对游戏的效果不佳 国内加速下载下载:http…

一文搞定WeakHashMapE0

写在前面 在缓存场景下,由于内存是有限的,不能缓存所有对象,因此就需要一定的删除机制,淘汰掉一些对象。这个时候可能很快就想到了各种Cache数据过期策略,目前也有一些优秀的包提供了功能丰富的Cache,比如Google的Guava Cache,它支持数据定期过期、LRU、LFU等策略,但它…

windows7不支持一些程序的运行,如何解决

低版本的Windows缺乏一些高版本中所新增的系统接口,而VxKex可以为程序提供这些缺失的接口从而使其正常运行 当然VxKex不仅可以用于lucky也可以使其他一些最低要求为win10的程序在win7上运行起来 详情见其github项目地址 不过目前对游戏的效果不佳 国内加速下载下载:https://da…

MBR4045PT-ASEMI低Low VF肖特基MBR4045PT

MBR4045PT-ASEMI低Low VF肖特基MBR4045PT编辑:ll MBR4045PT-ASEMI低Low VF肖特基MBR4045PT 型号:MBR4045PT 品牌:ASEMI 封装:TO-247 安装方式:插件 批号:最新 恢复时间:35ns 最大平均正向电流(IF):40A 最大循环峰值反向电压(VRRM):45V 最大正向电压(VF):0.75V~…

快速比较两个数据库所有表的字段是否一致

背景 在开发时,常常会有开发环境,测试环境,生产环境。当开发环境中的数据库结构发生变化时,往往需要同步到测试环境和生产环境,但是有时候会忘记同步了。那么,如何快速判断两个数据库的所有表字段是否一致呢? 需要工具:navicat(或类似数据库工具),Beyond Comapre(或…

Hadoop(二十)Yarn工作原理

Yarn资源调度器Yarn是一个资源调度平台,负责为运算程序提供服务器运算资源,相当于一个分布式的操作系统平台,而MapReduce等运算程序则相当于运行于操作系统之上的应用程序一、基础架构YARN主要由ResourceManager、NodeManager、ApplicationMaster和Container等组件构成二、Y…

springcloud负载均衡组件ribbon使用

一、微服务负载均衡ribbon策略如下: 1、线性轮询策略: RoundRibbonRule 2、重试策略:RetryRule 3、加权响应时间策略:WeightedResponseTimeRule 4、随机策略:RandomRule 5、最空闲策略:BestAvailableRule 6、区域感知轮询策略:ZoneAvoidanceRule(默认) 每个策略对应什…