A stochastic process is an indexed collection of random variables \({X_t}\). e.g., time series of weekly demands for a product
A stochastic process \(X_t\) is said to be Markovian if and only if
P(Xt+1=j|Xt=i,Xt−1=kt−1,Xt−2=kt−2,...X1=k1,X0=k0)=P(Xt+1=j|X0=k0)“The future is independent of the past given the present”
A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states \(s_1; s_2; : : :\) with the Markov property
A Markov process is a is a tuple \((S;P;\mu)\)
- S is a (finite) set of states
-
P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)
- a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i
A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple \(S;P;R;\gamma; \mu\)
- S is a (finite) set of states
-
P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)
- R is a reward function, \(R_s = E[r \vert s]\)
- \(\gamma\) is a discount factor, \(\gamma \in[0,0]\)
- a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i
The return \(v_t\) is the total discounted reward from time–step t. \(v_t = r_{t+1}+\gamma r_{t+2}+\gamma r^2 r_{t+3}+--- = \sum_{k=0}^{\infty}\gamma ^k r_{t+k+1}\)
The state value function V(s) of an MRP is the expected return starting from state s \(V(s) = E[v_{t}\vert s_{t}=s]\)
Bellman Equation
V(s)=E[vt|st=s]=E[rt+1+γrt+2+γ2rt+3+...|st=s]=E[rt+1+γ(rt+2+γrt+3+...)|st=s]=E[rt+1+γvt+1|st=s]=E[rt+1+γV(st+1)|st=s]So, We get the Bellman Equation:
V(s)=E[r+γV(s′)|s]=Rs+γ∑s′∈SPss′V(s′)The Bellman equation can be expressed concisely using matrices.
V=R+γPVSolving the Bellman Equation
The Bellman equation is a linear equation. It can be solved directly \(V =R+\gamma PV\)
(1−γP)V=RComputational complexity is \(O(n^3)\) for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,
- Dynamic programming
- Monte–Carlo evaluation
- Temporal–Difference learning
Discrete–time Finite Markov Decision Processes (MDP)
A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.
A Markov Process is a tuple \((S;A;P;R; \gamma; \mu)\)
- S is a (finite) set of states
- A is a (finite) set of actions
- P is a state transition probability matrix, \(P(s' \vert s; a)\)
- R is a reward function, \(R(s; a) = E[r \vert s; a]\)
- \(\gamma\) is a discount factor, \(\gamma \in [0,1]\)
- a set of initial probabilities \(\mu_i^0= P(X_0 = i)\) for all i
Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).
Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:
- Markovian \(\subset\) History–dependent
- Deterministic \(\subset\) Stochastic
- Stationary \(\subset\) Non–stationary
Stationary Stochastic Markovian Policies
A policy \(\pi\) is a distribution over actions given the state:
\(\pi (s,a) = P(a\vert s)\)
- MDP policies depend on the current state (not the history)
- i.e., Policies are stationary (time–independent)
- Given an MDP \(M= (S;A;P;R;\gamma;\mu)\) and a policy \(\pi\)
- The state sequence s1; s2; … is a Markov process \((S;P^{\pi};\gamma \mu)\)
- The state and reward sequence s1; r2; s2; … is a Markov reward process \((S;P^{\pi};R^{\pi};\gamma; \mu)\), where
Rπ=∑a∈Aπ(s,a)R(s,a)
Value Functions
Given a policy \(\pi\), it is possible to define the utility of each state: Policy Evaluation
- The state–value function \(V^{\pi}(s)\) of an MDP is the expected return starting from state s, and then following policy \(\pi\) \(V^{\pi}(s) = E_{\pi}[v_t\vert s_t=s]\)
For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.
The action–value function \(Q^{\pi}(s; a)\) is the expected return starting from state s, taking action a, and then following policy \(\pi\) \(Q^{\pi}(s; a) = E_{\pi}[v_t\vert s_t=s,a_t=a]\)
Bellman Expectation Equation
Vπ(s)=Eπ[rt+1+γVπ(st+1)|st=s]=∑a∈Aπ(a|s)(R(s,a)+γ∑s′∈SP(s′|s,a)Vπ(s′))The Bellman expectation equation can be expressed concisely using the induced MRP
Vπ=Rπ+γPπVπwith direct solution
Vπ=(1−γPπ)−1RπBellman operators for \(V^{\pi}\)
The Bellman operator for \(V^{\pi}\) is defined as \(T^{\pi}\) : \(R^{S}\rightarrow R^{S}\) (maps value functions to value functions):
(TπVπ)(s)=∑a∈Aπ(a|s)(R(s,a)+γ∑s′∈SP(s′|s,a)Vπ(s′))Using Bellman operator, Bellman expectation equation can be compactly written as:
TπVπ=Vπ- \(V^{\pi}\) is a fixed point of the Bellman operator \(T^{\pi}\)
- This is a linear equation in\(V^{\pi}\) and \(T^{\pi}\)
- If 0 <\(\gamma<1\) then \(T^{\pi}\) is a contraction w.r.t. the maximum norm
The Bellman operator for \(Q^{\pi}\) is defined as
\(T^{\pi}\) : \(R^{S\times A} \rightarrow R^{S\times A}\) (maps action–value functions to action–value functions):
(TπQπ)(a,a)=∑s′∈SP(s′|s,a)∑a′∈Aπ(a′|s′)Qπ(s′,a′)Using Bellman operator, Bellman expectation equation can be compactly written as:
TπQπ=QπOptimal Value Function and optical policy
Optimal Value Function
The optimal state–value function \(V^*(s)\) is the maximum value function over all policies
V∗(s)=maxπVπ(s)The optimal action–value function \(^*(s; a)\) is the maximum action–value function over all policies
Q∗(s,a)=maxπQπ(s,a)- The optimal value function specifies the best possible performance in the MDP
- An MDP is “solved” when we know the optimal value function
Optimal Policy
Value functions define a partial ordering over policies
\(\pi\geq\pi^\prime\) if \(V^{\pi}(s) \geq V^{\pi^\prime}(s)\), \(\forall s\in S\)
Theorem
For any Markov Decision Process
- There exists an optimal policy \(\pi^*\) that is better than or equal to all other policies \(\pi^*>\pi, \forall \pi\)
- All optimal policies achieve the optimal value function, \(V^{\pi^*}(s) = V^*(s)>\)
- All optimal policies achieve the optimal action–value function, \(Q^{\pi^*}(s,a) = Q^*(s,a)\)
- There is always a deterministic optimal policy for any MDP
A deterministic optimal policy can be found by maximizing over \(Q^*(s,a)\)
π∗(a|s)={1if a=argmaxa∈AQ∗(s,a)0otherwiseBellman Optimality Equation
Bellman Optimality Equation for \(V^*\)
\(V^*(s) = \max_{a}Q^*(s,a) = \max_{a}\{ R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime) \}\)
Bellman Optimality Equation for \(Q^*\)
Q∗(s,a)=R(s,a)+γ∑s′∈SP(s′|s,a)V∗(s′)=R(s,a)+γ∑s′∈SP(s′|s,a)maxa′Q∗(s′,a′)Principle of optimality:
the tail of an optimal policy is optimal for the “tail” problem.(最优策略的尾策略对于尾问题也是最优的。或者说,不管初始状态和前面已经发生的,对于后续子问题,最优策略对剩余子问题的策略(tail policy)仍然是最优的)
数学符号 refer:
数学符号的竖线 | 有时需要用\vert 表示 |
特殊符号 ‘要用\prime
List of Greek letters and math symbols
您的打赏是对我最大的鼓励!
Related Issues not found
Please contact @hwdong-net to initialize the comment