34. Large or Continuous MDPs
For small-dimensional problems, value functions can be efficiently represented by arrays or look-up tables,
\[\begin{split}\begin{aligned}
V(s) \quad & \rightarrow \quad \mathbf{V} \in \mathbb{R}^{|S|} \\
Q(s,a) \quad & \rightarrow \quad \mathbf{Q} \in \mathbb{R}^{|S| \times |A|} \\
\end{aligned}\end{split}\]
For large problems with large number of discrete states \(\sim 10^{N}\) (\(N = 20\) for backgammon, \(= 10^{170}\) for Go), or continuous state space, RL usually relies on function approximation of the value function,
\[\begin{split}\begin{aligned}
V^{\pi}(s) & \sim V(s ; \boldsymbol{\theta}) \\
Q^{\pi}(s,a) & \sim Q(s,a; \boldsymbol{\theta}) \\
\end{aligned}\end{split}\]
Parameters \(\boldsymbol{\theta}\) are updated during learning. Examples of function approximators, the most suitable depend on the problem itself (ANN, Fourier/wavelets basis, polynomial, piecewise-polynomial, coarse coding,…)
Optimizing/minimizing mean-squared error between \(V(s,\theta)\) and \(V^{\pi}(s)\)
\[L(\boldsymbol{\theta}) := \mathbb{E}_{\pi} \left[ \left( V^{\pi}(s) - V(s; \boldsymbol{\theta}) \right)^2 | s_t = s \right]\]
Gradient descent,
\[\begin{split}\begin{aligned}
\Delta \boldsymbol{\theta}
& = - \frac{1}{2} \alpha \nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}) = \\
& = - \frac{1}{2} \alpha \nabla_{\boldsymbol{\theta}} \, \mathbb{E}_{\pi} \left[ \left( V^{\pi}(s) - V(s, \boldsymbol{\theta}) \right)^2 | s_t = s \right] = \\
& = \alpha \, \mathbb{E}_{\pi} \big[ \left( V^{\pi}(s) - V(s, \boldsymbol{\theta}) \right) \nabla_{\boldsymbol{\theta}} V(s, \boldsymbol{\theta}) \ | \ s_t = s \big] \ ,
\end{aligned}\end{split}\]
and the stochastic gradient decent update reads
\[
\Delta \boldsymbol{\theta}
= \alpha \ \left( V^{\pi}(s) - V(s, \boldsymbol{\theta}) \right) \, \nabla_{\boldsymbol{\theta}} V(s, \boldsymbol{\theta}) \ .
\]
State representation with features \(\boldsymbol{\phi}(s)\),
\[\begin{split}\boldsymbol{\phi}(s) = \begin{bmatrix} \phi_1(s) \\ \dots \\ \phi_n(s) \end{bmatrix}\end{split}\]
State-value function as a linear combination of features
\[V(s; \boldsymbol{\theta}) = \boldsymbol{\phi}^T(s) \, \boldsymbol{\theta} \ ,\]
and the gradient of the objective function directly follows from
\[\nabla_{\boldsymbol{\theta}} V(s; \boldsymbol{\theta}) = \boldsymbol{\phi}(s) \ .\]
…
Action-value function as a linear combination of features
\[Q(s, a; \boldsymbol{\theta}) = \boldsymbol{\phi}^T(s,a) \, \boldsymbol{\theta} \ ,\]
…
34.1. Value and Policy-Based RL
value based
policy based
actor-critic
…