RL问题有以下几个方面:
- 不同的actions导致不同的rewards
- rewards具有时延性
- action的reward取决于环境状态
Concepts
Learning a Policy
Learning which rewards we get for each of the possible actions, and ensuring we chose the optimal ones.
Policy Gradients
Simple neural network learns a policy for picking actions by adjusting it’s weights through gradient descent using feedback from the environment.
Value functions
Instead of learning the optimal action in a given state, the agent learns to predict how good a given state or action will be for the agent to be in.
e-greedy policy
This means that most of the time our agent will choose the action that corresponds to the largest expected value, but occasionally, with e probability, it will choose randomly.
policy loss equation
Loss = -log(π)A
A is advantage, and is an essential aspect of all reinforcement learning algorithms. Intuitively it corresponds to how much better an action was than some baseline.
π is the policy. In this case, it corresponds to the chosen action’s weight.
The Multi-armed bandit
1 | import tensorflow as tf |