主要内容来源于:论文以及教程(Thomas Simonini Deep Reinforcement Learning Course with Tensorflow, Arthur Juliani Simple Reinforcement Learning with Tensorflow series),OpenAI Spinning Up in Deep RL
Concepts
Policy Gradient methods
which attempt to learn functions which directly map an observation to an action.
observation -> action
Q-Learning
attempts to learn the value of being in a given state, and taking a specific action there.
state, action -> value
Bellman Equation
which states that the expected long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state.
$$ Q(s, a) = r + \gamma (\max (Q(s’, a’))) $$
利用Bellman Equation可以实现Q-Table算法:
1 | import gym |
但是这种方法不具有扩展性,毕竟表格的容量有限。
Q-Learning with Neural Networks
1 | import gym |