AI

马尔可夫决策过程MDP

MDP

Posted by xuepro on May 31, 2018

A stochastic process is an indexed collection of random variables \({X_t}\). e.g., time series of weekly demands for a product

A stochastic process \(X_t\) is said to be Markovian if and only if

P(Xt+1=j|Xt=i,Xt1=kt1,Xt2=kt2,...X1=k1,X0=k0)=P(Xt+1=j|X0=k0)

“The future is independent of the past given the present”

A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states \(s_1; s_2; : : :\) with the Markov property

A Markov process is a is a tuple \((S;P;\mu)\)

  • S is a (finite) set of states
  • P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)

  • a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i

A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple \(S;P;R;\gamma; \mu\)

  • S is a (finite) set of states
  • P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)

  • R is a reward function, \(R_s = E[r \vert s]\)
  • \(\gamma\) is a discount factor, \(\gamma \in[0,0]\)
  • a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i

The return \(v_t\) is the total discounted reward from time–step t. \(v_t = r_{t+1}+\gamma r_{t+2}+\gamma r^2 r_{t+3}+--- = \sum_{k=0}^{\infty}\gamma ^k r_{t+k+1}\)

The state value function V(s) of an MRP is the expected return starting from state s \(V(s) = E[v_{t}\vert s_{t}=s]\)

Bellman Equation

V(s)=E[vt|st=s]=E[rt+1+γrt+2+γ2rt+3+...|st=s]=E[rt+1+γ(rt+2+γrt+3+...)|st=s]=E[rt+1+γvt+1|st=s]=E[rt+1+γV(st+1)|st=s]

So, We get the Bellman Equation:

V(s)=E[r+γV(s)|s]=Rs+γsSPssV(s)

The Bellman equation can be expressed concisely using matrices.

V=R+γPV
[V(1)V(n)]=[R(1)R(n)]+γ[P11P1nPn1Pnn][V(1)V(n)]

Solving the Bellman Equation

The Bellman equation is a linear equation. It can be solved directly \(V =R+\gamma PV\)

(1γP)V=R
V=(1γP)1R

Computational complexity is \(O(n^3)\) for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,

  • Dynamic programming
  • Monte–Carlo evaluation
  • Temporal–Difference learning

Discrete–time Finite Markov Decision Processes (MDP)

A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.

A Markov Process is a tuple \((S;A;P;R; \gamma; \mu)\)

  • S is a (finite) set of states
  • A is a (finite) set of actions
  • P is a state transition probability matrix, \(P(s' \vert s; a)\)
  • R is a reward function, \(R(s; a) = E[r \vert s; a]\)
  • \(\gamma\) is a discount factor, \(\gamma \in [0,1]\)
  • a set of initial probabilities \(\mu_i^0= P(X_0 = i)\) for all i

Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).

Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:

  • Markovian \(\subset\) History–dependent
  • Deterministic \(\subset\) Stochastic
  • Stationary \(\subset\) Non–stationary

Stationary Stochastic Markovian Policies

A policy \(\pi\) is a distribution over actions given the state:

\(\pi (s,a) = P(a\vert s)\)

  • MDP policies depend on the current state (not the history)
  • i.e., Policies are stationary (time–independent)
  • Given an MDP \(M= (S;A;P;R;\gamma;\mu)\) and a policy \(\pi\)
    • The state sequence s1; s2; … is a Markov process \((S;P^{\pi};\gamma \mu)\)
    • The state and reward sequence s1; r2; s2; … is a Markov reward process \((S;P^{\pi};R^{\pi};\gamma; \mu)\), where
    Pπ=aAπ(s,a)P(s|s,a)
    Rπ=aAπ(s,a)R(s,a)

Value Functions

Given a policy \(\pi\), it is possible to define the utility of each state: Policy Evaluation

  • The state–value function \(V^{\pi}(s)\) of an MDP is the expected return starting from state s, and then following policy \(\pi\) \(V^{\pi}(s) = E_{\pi}[v_t\vert s_t=s]\)

For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.

The action–value function \(Q^{\pi}(s; a)\) is the expected return starting from state s, taking action a, and then following policy \(\pi\) \(Q^{\pi}(s; a) = E_{\pi}[v_t\vert s_t=s,a_t=a]\)

Bellman Expectation Equation

Vπ(s)=Eπ[rt+1+γVπ(st+1)|st=s]=aAπ(a|s)(R(s,a)+γsSP(s|s,a)Vπ(s))
Qπ(s;a)=Eπ[rt+1+γQπ(st+1;at+1)|st=s,at=a]=R(s,a)+γsSP(s|s,a)Vπ(s)=R(s,a)+γsSP(s|s,a)aAπ(a|s)Qπ(s,a)

The Bellman expectation equation can be expressed concisely using the induced MRP

Vπ=Rπ+γPπVπ

with direct solution

Vπ=(1γPπ)1Rπ

Bellman operators for \(V^{\pi}\)

The Bellman operator for \(V^{\pi}\) is defined as \(T^{\pi}\) : \(R^{S}\rightarrow R^{S}\) (maps value functions to value functions):

(TπVπ)(s)=aAπ(a|s)(R(s,a)+γsSP(s|s,a)Vπ(s))

Using Bellman operator, Bellman expectation equation can be compactly written as:

TπVπ=Vπ
  • \(V^{\pi}\) is a fixed point of the Bellman operator \(T^{\pi}\)
  • This is a linear equation in\(V^{\pi}\) and \(T^{\pi}\)
  • If 0 <\(\gamma<1\) then \(T^{\pi}\) is a contraction w.r.t. the maximum norm

The Bellman operator for \(Q^{\pi}\) is defined as

\(T^{\pi}\) : \(R^{S\times A} \rightarrow R^{S\times A}\) (maps action–value functions to action–value functions):

(TπQπ)(a,a)=sSP(s|s,a)aAπ(a|s)Qπ(s,a)

Using Bellman operator, Bellman expectation equation can be compactly written as:

TπQπ=Qπ

Optimal Value Function and optical policy

Optimal Value Function

The optimal state–value function \(V^*(s)\) is the maximum value function over all policies

V(s)=maxπVπ(s)

The optimal action–value function \(^*(s; a)\) is the maximum action–value function over all policies

Q(s,a)=maxπQπ(s,a)
  • The optimal value function specifies the best possible performance in the MDP
  • An MDP is “solved” when we know the optimal value function

Optimal Policy

Value functions define a partial ordering over policies

\(\pi\geq\pi^\prime\) if \(V^{\pi}(s) \geq V^{\pi^\prime}(s)\), \(\forall s\in S\)

Theorem

For any Markov Decision Process

  • There exists an optimal policy \(\pi^*\) that is better than or equal to all other policies \(\pi^*>\pi, \forall \pi\)
  • All optimal policies achieve the optimal value function, \(V^{\pi^*}(s) = V^*(s)>\)
  • All optimal policies achieve the optimal action–value function, \(Q^{\pi^*}(s,a) = Q^*(s,a)\)
  • There is always a deterministic optimal policy for any MDP

A deterministic optimal policy can be found by maximizing over \(Q^*(s,a)\)

π(a|s)={1if a=argmaxaAQ(s,a)0otherwise

Bellman Optimality Equation

Bellman Optimality Equation for \(V^*\)

\(V^*(s) = \max_{a}Q^*(s,a) = \max_{a}\{ R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime) \}\)

Bellman Optimality Equation for \(Q^*\)

Q(s,a)=R(s,a)+γsSP(s|s,a)V(s)=R(s,a)+γsSP(s|s,a)maxaQ(s,a)

Principle of optimality:

the tail of an optimal policy is optimal for the “tail” problem.(最优策略的尾策略对于尾问题也是最优的。或者说,不管初始状态和前面已经发生的,对于后续子问题,最优策略对剩余子问题的策略(tail policy)仍然是最优的)

数学符号 refer:

数学符号的竖线 有时需要用\vert表示

特殊符号 ‘要用\prime

List of Greek letters and math symbols

2


支付宝打赏 微信打赏

您的打赏是对我最大的鼓励!

Related Issues not found

Please contact @hwdong-net to initialize the comment