31. Markov Processes#
Definition 31.1 (Stochastic process)
A stochastic process is as collection of random variables \(X(t)\) defined on a common probability space \((\Omega, \mathcal{F}, P)\), indexed by some set \(T\), with outcomes in a measurable space \((S, \Sigma)\),
Markov process is a memomry-less stochastic process (see def Definition 31.1). For discrete-time Markov processes, state at time \(n\) only depends on state at time \(n-1\)
with
\(S\) (finite) set of states
\(\mathbf{P}\) state transition probability matrix, \(P_{ss'} = P(s'|s)\)
\(\mu\) a set of initial probability, \(\mu_i = P(X_0 = i), \forall i\)
Properties of probability matrix.
A stationary MP is a process described by a constant state transition probability.
1-step transition.
Notation: probability vectors as row vectors
In treating stochastic processes, probability vectors are usually treated as row vectors. Probability distribution over states \(s'\) at time \(n\) is the sum of probabilities or reaching \(s'\) starting from \(s\) at time \(n-1\)
or using matrix formalism
Properties of probability matrix.
Since \(P_{s s'} = P(s'|s)\) represents the probability of arriving in \(s'\) starting from \(s\) (i.e. the conditional probability of arriving in \(s'\), starting from \(s\)), the entries of a matrix representing a time-discrete process are
The sum over \(s'\) is \(1\) (sum over all the possibilities),
Thus, a probability matrix \(\mathbf{P}\) has a eigenvalue \(\lambda = 1\) with the uniform vector \(\mathbf{1}\) as eigenvector.
Is this unique? No, in general.
Existence of a stationary distribution
Spectral radius of a probability matrix is \(\rho(\mathbf{P}) = 1\).
Proof
for \(i\) s.t. the components of the (right) eigenvalue \(\mathbf{v}\) satisfy \(|x_i| \ge |x_j|\), \(\forall j\). As \(P_{ij} \ge 0\) and \(\sum_j P_{ij} = P_{ii} + \sum_{j \ne i} P_{ij} = 1\), it follows
meaning that any eigenvalue \(\lambda\) is contained in a circle centered in \(P_{ii}\) with radius \(1 - P_{ii}\). It’s not hard to prove that such a circle is contained in a circle centered in \(0\) with radius \(1\)1. Thus, for all the eigenvalues \(|\lambda| \le 1\) holds, and at least one eigenvalue with \(|\lambda| = 1\), namely the \(\lambda = 1\) eigenvalue shown above, exists. Thus, the spetral radius of a probability matrix is \(\rho(\mathbf{P}) = 1\).
n-step transition. \(n\)-step transition in a stationary MP reads
First passage. …; recurrent states
Classification of states: accessible, communicating (and dividing states into partitions, disjoint classes); positive recurrent, periodic, ergodic; absorbing or transient
Classification of MP: absorbing, ergodic, regular
Stationary distribution, a probability vector \(\mathbf{p}\) so that probability doesn’t change in the next timestep,
Fundamental matrix
Other definitions: mixing rate, spectral gap
31.1. Markov Reward Processes#
with
\(R\) reward function; it could be a stochastic variable, depending on a state with probability \(p(r|s)\) (or on the transition \(p(r|s,s')\)?)
\(\gamma\) discount factor, \(\gamma \in [0,1]\); it’s used to represent the present value of future rewards (\(\gamma \sim 0\) short-sighted, \(\gamma \sim 1\) far-sighted)
Return
Time horizion: finite, indefinite (until stopping criteria, or absorbing state), infinite
Possible choices of cumulative reward:
total reward, \(V = \sum_i r_i\)
average reward, \(V = \frac{1}{n} \sum_{i=1}^{n} r_i\)
discounted reward \(V = \sum_{i} \gamma^{i-1} r_i\)
Return \(v_t\) is defined as the total discount reward from timestep \(t\)
Value function Expected value of the return \(v_t\) from state \(s_t = s\)
31.1.1. Bellman Equation for \(V(s)\)#
or in matrix form
Solution of Bellman equation. Bellman equation for value function is linear. It can be solved directly, solving the linear problem
Solutions:
direct solution of the linear system (feasible for small-dimensional MRP only?)
iterative methods for large MRPs: DP (dynamic programming), MC (Monte-Carlo evaluation), TD (Temporal-Difference learning)
31.2. Markov Decision Processes#
with
A (finite) set of actions
\(\mathbf{P}\) state transition probability matrix, \(P(s'|s,a)\)
\(R\) reward function, \(R(s,a) = \mathbb{E}\left[ r | s, a \right]\)
31.2.1. Policies#
A policy is a distribution over actions, given the initial state
Given a policy \(\pi(a|s)\), a MDP becomes the MRP with
transition probability
\[P^{\pi}(s'|s) = \sum_{a \in A} \pi(a|s) P(s'|s,a)\]reward (todo Prove it!)
\[R^{\pi}(s) = \sum_{a \in A} \pi(a|s) R(s,a)\]
Value functions.
State value function, \(V^{\pi}(s)\), expected return starting from \(s\) and following policy \(\pi\)
\[V^{\pi}(s) := \mathbb{E}_{\pi} \left[ v_t | s_t = s \right]\]Action-value functon \(Q^{\pi}(s,a)\), expected return starting from \(s\), taking action \(a\) first and then following policy \(\pi\)
\[Q^{\pi}(s,a) := \mathbb{E}_{\pi} \left[ v_t | s_t = s, a_t = a \right]\]
31.2.2. Bellman Equations#
Bellman expectation equations. Bellman expectation equation for the state-value function \(V^{\pi}(s)\) reads
Bellman expectation equation for the action-value function \(Q^{\pi}(s)\) reads
Bellman expectation operators todo
Optimal value functions, and optimal policies
Definition 31.2 (Optimal value functions)
Optimal state-value function \(V^*(s)\) is defined as
Optimal action-value function \(Q^*(s,a)\) is defined as
Usually, the goal of a problem involving MDP is the maximization of value functions, as the performance of the MDP.
Valus functions define a partial ordering of policies,
Theorem 31.1 (Optimal policies)
For a MDP,
an optimal policy \(\pi^*\) exists, s.t. \(\pi^* \ge \pi\), \(\forall \pi\)
all optimal policies achieve the same value of optimal functions, namely
\[V^{\pi^*}(s) = V^*(s) \quad , \quad Q^{\pi^*}(s,a) = Q^*(s,a) \ . \]a deterministic optimal policy exists, and it’s found optimizing action-value function over actions \(a\),
\[\begin{split}\pi^*(a|s) = \left\{ \begin{aligned} & 1 && , \quad \text{if } a = \text{argmax}_{a \in A} Q^*(s,a) \\ & 0 && , \quad \text{otherwise} \end{aligned} \right. \end{split}\]
Proof
todo
Property 31.1 (Property: optimal state-value and action-value functions)
todo Prove it!
Bellman optimality equations. Using Property 31.1 and the expression of action-value function in (31.2)
Bellman optimality operators
Properties of Bellman operators
Solving Bellman optimality equations
non-linear
no closed form solution
iterative solutions:
Dynamic Programming (DP): Value Iteration (VI), Policy Iteration (PI)
Linear Programming (LP)
Reinforcement Learning (RL): Q-learning (off-policy), SARSA (on-policy),…
- 1
\(1 - P_{ii} \ge |\lambda - P_{ii}| = |\lambda - 0 + 0 - P_ii| \ge | |\lambda| - P_ii |\); if \(|\lambda| \ge P_{ii}\) it follows \(1 \ge |\lambda|\); if \(|\lambda| \le P_{ii}\)…todo