马尔可夫决策过程MDP

A stochastic process is an indexed collection of random variables ${X_t}$. e.g., time series of weekly demands for a product

A stochastic process $X_t$ is said to be Markovian if and only if

$P(X_{t+1}=j| X_t=i,X_{t-1}=k_{t-1},X_{t-2}=k_{t-2},...X_1=k_1,X_0=k_0) = P(X_{t+1}=j|X_0=k_0)$

“The future is independent of the past given the present”

A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states $s_1; s_2; : : :$ with the Markov property

A Markov process is a is a tuple $(S;P;\mu)$

S is a (finite) set of states
P is a state transition probability matrix, $P_{ss'} = P(s'\vert s)$
a set of initial probabilities $\mu_j^0 = P(X_0=i)$ for all i

A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple $S;P;R;\gamma; \mu$

S is a (finite) set of states
P is a state transition probability matrix, $P_{ss'} = P(s'\vert s)$
R is a reward function, $R_s = E[r \vert s]$
$\gamma$ is a discount factor, $\gamma \in[0,0]$
a set of initial probabilities $\mu_j^0 = P(X_0=i)$ for all i

The return $v_t$ is the total discounted reward from time–step t. $v_t = r_{t+1}+\gamma r_{t+2}+\gamma r^2 r_{t+3}+--- = \sum_{k=0}^{\infty}\gamma ^k r_{t+k+1}$

The state value function V(s) of an MRP is the expected return starting from state s $V(s) = E[v_{t}\vert s_{t}=s]$

Bellman Equation

$\begin{equation}\begin{split} V(s) = E[v_{t}\vert s_{t}=s] \\ = E[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+ ...\vert s_{t}=s]\\ = E[r_{t+1}+\gamma (r_{t+2}+\gamma r_{t+3}+ ...) \vert s_{t}=s]\\ = E[r_{t+1}+\gamma v_{t+1}\vert s_{t}=s]\\ = E[r_{t+1}+\gamma V(s_{t+1})\vert s_{t}=s] \end{split}\end{equation}$

So, We get the Bellman Equation:

$\begin{equation}\begin{split} V(s) = E[r+ \gamma V(s')\vert s]= R_{s}+\gamma \sum_{s'\in S} P_{ss'}V(s') \end{split}\end{equation}$

The Bellman equation can be expressed concisely using matrices.

$V =R+\gamma PV$

$\begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix} = \begin{bmatrix} R(1) \\ \vdots\\ R(n) \end{bmatrix} +\gamma \begin{bmatrix} P_{11} &\dots & P_{1n}\\ \vdots &\ddots &\vdots \\ P_{n1} &\dots & P_{nn} \end{bmatrix} \begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix}$

Solving the Bellman Equation

The Bellman equation is a linear equation. It can be solved directly $V =R+\gamma PV$

$(1-\gamma P)V =R$

$V = (1-\gamma P)^{-1}R$

Computational complexity is $O(n^3)$ for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,

Dynamic programming
Monte–Carlo evaluation
Temporal–Difference learning

Discrete–time Finite Markov Decision Processes (MDP)

A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.

A Markov Process is a tuple $(S;A;P;R; \gamma; \mu)$

S is a (finite) set of states
A is a (finite) set of actions
P is a state transition probability matrix, $P(s' \vert s; a)$
R is a reward function, $R(s; a) = E[r \vert s; a]$
$\gamma$ is a discount factor, $\gamma \in [0,1]$
a set of initial probabilities $\mu_i^0= P(X_0 = i)$ for all i

Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).

Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:

Markovian $\subset$ History–dependent
Deterministic $\subset$ Stochastic
Stationary $\subset$ Non–stationary

Stationary Stochastic Markovian Policies

A policy $\pi$ is a distribution over actions given the state:

$\pi (s,a) = P(a\vert s)$

MDP policies depend on the current state (not the history)
i.e., Policies are stationary (time–independent)
Given an MDP $M= (S;A;P;R;\gamma;\mu)$ and a policy $\pi$
- The state sequence s1; s2; … is a Markov process $(S;P^{\pi};\gamma \mu)$
- The state and reward sequence s1; r2; s2; … is a Markov reward process $(S;P^{\pi};R^{\pi};\gamma; \mu)$, where
Pπ=∑a∈Aπ(s,a)P(s′|s,a)
Rπ=∑a∈Aπ(s,a)R(s,a)

Value Functions

Given a policy $\pi$, it is possible to define the utility of each state: Policy Evaluation

The state–value function $V^{\pi}(s)$ of an MDP is the expected return starting from state s, and then following policy $\pi$ $V^{\pi}(s) = E_{\pi}[v_t\vert s_t=s]$

For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.

The action–value function $Q^{\pi}(s; a)$ is the expected return starting from state s, taking action a, and then following policy $\pi$ $Q^{\pi}(s; a) = E_{\pi}[v_t\vert s_t=s,a_t=a]$

Bellman Expectation Equation

$\begin{equation}\begin{split} V^{\pi}(s) = E_{\pi}[r_{t+1}+ \gamma V^{\pi}(s_{t+1}) \vert s_t=s]\\ =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s')) \end{split}\end{equation}$

$\begin{equation}\begin{split} Q^{\pi}(s; a) = E_{\pi}[r_{t+1}+ \gamma Q^{\pi}(s_{t+1}; a_{t+1}) \vert s_t=s, a_t=a]\\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s') \\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a') \end{split}\end{equation}$

The Bellman expectation equation can be expressed concisely using the induced MRP

$V^{\pi} = R^{\pi} + \gamma P^{\pi} V^{\pi}$

with direct solution

$V^{\pi} = (1- \gamma P^{\pi})^{-1} R^{\pi}$

Bellman operators for $V^{\pi}$

The Bellman operator for $V^{\pi}$ is defined as $T^{\pi}$ : $R^{S}\rightarrow R^{S}$ (maps value functions to value functions):

$(T^{\pi}V^{\pi})(s) =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s'))$

Using Bellman operator, Bellman expectation equation can be compactly written as:

$T^{\pi}V^{\pi} = V^{\pi}$

$V^{\pi}$ is a fixed point of the Bellman operator $T^{\pi}$
This is a linear equation in$V^{\pi}$ and $T^{\pi}$
If 0 <$\gamma<1$ then $T^{\pi}$ is a contraction w.r.t. the maximum norm

The Bellman operator for $Q^{\pi}$ is defined as

$T^{\pi}$ : $R^{S\times A} \rightarrow R^{S\times A}$ (maps action–value functions to action–value functions):

$(T^{\pi}Q^{\pi})(a,a) =\sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a')$

Using Bellman operator, Bellman expectation equation can be compactly written as:

$T^{\pi}Q^{\pi} = Q^{\pi}$

Optimal Value Function and optical policy

Optimal Value Function

The optimal state–value function $V^*(s)$ is the maximum value function over all policies

$V^{*}(s) = \max_{\pi}V^{\pi}(s)$

The optimal action–value function $^*(s; a)$ is the maximum action–value function over all policies

$Q^{*}(s,a) = \max_{\pi}Q^{\pi}(s,a)$

The optimal value function specifies the best possible performance in the MDP
An MDP is “solved” when we know the optimal value function

Optimal Policy

Value functions define a partial ordering over policies

$\pi\geq\pi^\prime$ if $V^{\pi}(s) \geq V^{\pi^\prime}(s)$, $\forall s\in S$

Theorem

For any Markov Decision Process

There exists an optimal policy $\pi^*$ that is better than or equal to all other policies $\pi^*>\pi, \forall \pi$
All optimal policies achieve the optimal value function, $V^{\pi^*}(s) = V^*(s)>$
All optimal policies achieve the optimal action–value function, $Q^{\pi^*}(s,a) = Q^*(s,a)$
There is always a deterministic optimal policy for any MDP

A deterministic optimal policy can be found by maximizing over $Q^*(s,a)$

$\begin{equation} \pi^*( a\vert s) = \begin{cases} 1& \text{if } a= arg \max_{a\in A}Q^*(s,a)\\ 0& \text{otherwise} \end{cases} \end{equation}$

Bellman Optimality Equation

Bellman Optimality Equation for $V^*$

$V^*(s) = \max_{a}Q^*(s,a) = \max_{a}\{ R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime) \}$

Bellman Optimality Equation for $Q^*$

$\begin{equation}\begin{split} Q^*(s,a) = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime)\\ = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) \max_{a^\prime}Q^*(s\prime,a\prime) \end{split}\end{equation}$

Principle of optimality:

the tail of an optimal policy is optimal for the “tail” problem.（最优策略的尾策略对于尾问题也是最优的。或者说，不管初始状态和前面已经发生的，对于后续子问题,最优策略对剩余子问题的策略（tail policy）仍然是最优的）

数学符号 refer:

数学符号的竖线有时需要用\vert表示

特殊符号 ‘要用\prime

List of Greek letters and math symbols

赏

支付宝打赏

微信打赏

您的打赏是对我最大的鼓励！

Related Issues not found

Please contact @hwdong-net to initialize the comment

MDP

Bellman Equation

Solving the Bellman Equation

Discrete–time Finite Markov Decision Processes (MDP)

Stationary Stochastic Markovian Policies

Value Functions

Bellman Expectation Equation

Bellman operators for \(V^{\pi}\)

The Bellman operator for \(Q^{\pi}\) is defined as

Optimal Value Function and optical policy

Optimal Value Function

Optimal Policy

Theorem

Bellman Optimality Equation

Bellman Optimality Equation for \(V^*\)

Bellman Optimality Equation for \(Q^*\)

Principle of optimality:

CATALOG

FEATURED TAGS

FRIENDS