32. Methods of solution of MPD: DP and LP#
32.1. Brute force: policy search#
32.2. Dynamic programming#
Introduction to DP
sequential or temporal approach to optimization
solving complex problems breaking them down into sub-problems:
solve sub-problems
combine solutions of sub-problems
for problems with 2 properties:
optimal substructure: optimal sol can be decomposed in sub-pbs
overlapping sub-problems: sub-pbs recur, sols can be cached and reused
DP assumes full knowledge of the MDP
used for planning (???)
prediction: given MDP and policy, evaluate value function
control: given MDP, evaluate optimal value function and policy
32.2.1. Policy iteration#
Policy iteration alternates policy evaluation and policy improvement steps. When value function converges, iteration stops at Bellman optimality condition, as shown in the algorithm below.
32.2.1.1. Policy Evaluation#
Bellman equation for a given policy \(\pi\). Solution of Bellman equation (31.1),
direct solution. Complexity: …
full policy-evaluation back-up
\[V_{k+1}(s) \leftarrow \sum_{a \in A} R(s,a) \left\{ \pi(a|s) + \gamma \sum_{s' \in S} P(s'|s,a) V^{\pi}_k(s') \right\}\]synchronous, asynchronous update; todo convergence?
32.2.1.2. Policy Improvement#
Greedy update. Let \(\pi\) be a deterministic policy. Greedy update acts maximizing the action-value function for all state over all the actions,
and it “improves the value function” (which value function?) of any state,
In stochastic optimization, usually \(\varepsilon\)-greedy improvement is introduced to improve exploration and try to avoid converging to local optimum.
Theorem 32.1 (Policy improvement theorem)
Let \(\pi\) and \(\pi'\) a pair of deterministic policites s.t.
Then \(\pi' \ge \pi\), i.e. \(V^{\pi'}(s) \ge V^{\pi}(s)\), \(\forall s \in S\).
Proof.
with in \((1)\) the first action \(a_t\) is drawn from \(\pi'(s)\), \(s'\) results from the transition
and the following are drawn from \(\pi\). The same notation is used in \((2)\), and in all the following manipulations.
32.2.1.3. Policy Iteration algorithm#
Policy iteration algorithm
Convergence is reached when \(\pi_{n+1}(s) = \pi_{n}(s)\). Greedy update (32.1) gives
namely, when the optimality condition (31.3) is reached.
32.2.2. Value Iteration#
Unlike policy iteration, value iteration algorithm doesn’t provide any explicit policy, but only optimal value of the value-functions.
An optimal policy for a MDP is built with an optimal action \(a^*\) first, followed by actions drawn from the optimal policy. Thus, it must satisfy
with \(s_{n} = s\) and \(s_{n+1} = s'\), resulting from transition \(s'|s,a\). Subdividing the problem in sub-problems, deterministic value iteration proceeds from \(s'\) backwards to \(s\),
or
32.2.3. Dynamic programming: extensions#
synchronous, or in-place
prioritized sweeping/update
…
32.3. Linear programming#
…