Lecture 2: Markov Decision Processes | Reinforcement Learning | David Silver | Course

1. Markov Process / Markov chain

1.1. Markov process

Markov process or Markov chain is a tuple \langle S,P \rangle such that

  • S is a finite set of states, and
  • P is a transition probability matrix.

In a  Markov process, the initial state should be given. How do we choose the initial state is not a role of the Markov process.

1.2. State transition probability

state transition probability from s to s' is

    \[ P_{ss'}=\mathbb{P}[S_{t+1}=s' | S_{t}=s] \]

Episode

An episode of a Markov process is a trajectory of the transition states.


2. Markov reward process

2.1. Markov reward process (MRP)

Markov reward process is a tuple \langle S,P, R, \gamma \rangle such that

  • S is a finite set of states,
  • P is a transition probability matrix,
  • R is a reward function, , and
    • A reward function on state s is R_{s}=\mathbb{E}[r_{t+1}|S_{t}=s].
    • r_{t} is a reward at time t.
  • \gamma is a discount factor, \gamma \in [0,1].
The steps of a MDP [sungjae]
  1. Be on a state.
  2. Move to a new state.
  3. Gain a reward on arriving at the state.
  4. Go to step 1
2.2. Return

The return at time-step t is denoted by G_{t}.
G_{t} is the total discounted reward from time-step t.

    \[ G_{t}=r_{t+1}+\gamma r_{t+2}+\cdots = \sum_{k=0}^{\infty}{\gamma^{k}r_{t+1+k}} \]

2.3. Value function in MRP
State value function

The state value function v(s) of a Markov reward process is the expected return starting from state s.

    \[ v(s)=\mathbb{E}[G_{t}|S_{t}=s] \]

2.4. Bellman equation for MRP
Bellman equation for Markov reward processes

The Bellman equation for MRPs implies that v(s) can be represented by the multiple v(s').

    \[ v(s)=R_{s}+\gamma\sum_{s'\in S}{P_{ss'}v(s')} \]

Bellman equation in matrix form

    \[ \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix} R_{1} \\ \vdots \\ R_{n} \end{bmatrix} + \gamma \begin{bmatrix} P_{11} & \cdots & P_{1n} \\ \vdots & \ddots & \vdots \\ P_{n1} & \cdots & P_{nn} \end{bmatrix} \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} \]

    \[ v=R+\gamma P v \]

Solving the Bellman equation

    \[ v=R+\gamma P v \]

    \[ (I-\gamma P)v=R \]

    \[ v=(I-\gamma P)^{-1}R \]

  • The computational complexity to solve v is O(n^3) for n states.
  • This analytic method is only possible if n is small.
  • If n is large, the following iterative methods are used to solve MRPs.
    1. Dynamic programming
    2. Monte-Carlo evalution
    3. Temporal-Difference learning

3. Markov decision process

3.1. Markov decision process (MDP)

Markov decision process (MDP) is a Markov reward process with decisions of actions.

Markov decision process is a tuple \langle S,A, P, R, \gamma \rangle such that

  • S is a finite set of states,
  • A is a finite set of actions,
  • P is a transition probability matrix,
    • P_{ss'}^{a}=\mathbb{P}[S_{t+1}=s' | S_{t}=s, A_{t}=a]
  • R is a reward function, , and
    • A reward function on state s is R_{s}^{a}=\mathbb{E}[r_{t+1} | S_{t}=s, A_{t}=a].
    • r_{t} is a reward at time t.
  • \gamma is a discount factor, \gamma \in [0,1].
The steps of a MDP [sungjae]
  1. Be on a state.
  2. Take an action.
  3. Gain a reward by the action
  4. Arrive at a new state.
  5. Go to step 1
3.2. Policy

policy \pi is a distribution over actions a given states s.

    \[ \pi (a|s) = \mathbb{P}[A_{t}=a | S_{t}=s] \]

  • A policy defines the action[behavior] of an agent.
  • MDP policies only depend on the current state.
  • MDP policies are parameterized by states.
  • Policies are stationary[time-independent].
    • A_{t}\sim \pi(\cdot | S_{t}), \forall{t} > 0
    • stationary \neq dynamic
3.3. Value functions in Markov decision processes

There two value functions in MDPs: [1] the state-value function and [2] action-value function.

State-value function

The state-value function v_{\pi}(s) of an MDP is the expected return [1] starting from state s and [2] following policy \pi.

    \[ v_{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s] \]

Action-value function

The action-value function q_{\pi}(s,a) is the expected return [1] starting from state s, [2] taking action a, and then [3] following policy \pi

    \[ q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t} | S_{t} = s , A_{t} = a] \]

3.4. Bellman expectation equation in MDP
Bellman expectation equation for v^{\pi}

    \[ v_{\pi}(s)=\sum_{a\in A}{\pi(a|s)q_{\pi}(s,a)} \]

  • s is the current state.
  • A is the set of the actions able to be taken on s.
  • a\in A is an action on s.
  • \pi is a particular policy.
  • Conclusion: v_{\pi}(s) can be solved by q_{\pi}(s,a).
Bellman expectation equation for q^{\pi}

    \[ q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \in S}{P_{ss'}^{a} v_{\pi}(s')} \]

  • R_{s}^{a} is the reward function that returns reward if action a is taken on state s.
  • Conclusion: q_{\pi}(s,a) can be solved by v_{\pi}(s').
Bellman expectation equation for v^{\pi} (2)

    \[ v_{\pi}(s)=\sum_{a\in A}{\pi(a|s) R_{s}^{a} + \gamma \sum_{s' \in S}{P_{ss'}^{a} v_{\pi}(s')} } \]

  • Conclusion: v_{\pi}(s) can be solved by v_{\pi}(s').
Bellman expectation equation for q^{\pi} (2)

    \[ q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \in S}{P_{ss'}^{a} \sum_{a\in A}{\pi(a|s')q_{\pi}(s',a)} \]

  • Conclusion: q_{\pi}(s,a) can be solved by q_{\pi}(s',a).
Bellman expectation equation (matrix form)

    \[ v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi} \]

    \[ v_{\pi} = (I - \gamma P^{\pi})^{-1} R^{\pi} \]

3.5. Optimal value function
  • The optimal value function specifies the best possible performance in the MDP.
  • An MDP is solved‘ when we know the optimal value function.
Optimal state-value function

The optimal state-value function is v_{*}(s) is the maximum value function over all policies.

    \[ v_{*}(s) = \max_{\pi}{v_{\pi}(s)} \]

Optimal action-value function

The optimal action-value function is q_{*}(s,a) is the maximum value function over all policies.

    \[ q_{*}(s,a) = \max_{\pi} q_{\pi}(s,a) \]

Partial ordering over policies

    \[ \pi \geq \pi^{'} \textup{ if } v_{\pi}(s) \geq v_{\pi^{'}}(s), \forall{s} \]

Theorems

For any Markov decision process

  • [★] There exists an optimal policy \pi_{*} that is better than or equal to all other policies, i.e., \pi \geq \pi^{*}, \forall{\pi}.
  • All optimal policies achieve the optimal value function.
  • All optimal policies achieve the optimal action function.
Finding an optimal policy

An optimal policy can be constructed by following formulation.

    \[ \pi_{*}(a|s)= \begin{cases} 1 & \text{ if } a = \underset{a \in A}{\arg\max} \ {q_{*}(s,a)} \\ 0 & \text{ otherwise } \end{cases} \]

3.6. Bellman optimality equation
Bellman optimality equation for v_{*}

    \[ v_{*}(s)=\underset{a}\max q_{*}(s,a) \]

Bellman optimality equation for q_{*}

    \[ q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s' \in S}{P_{ss'}^{a} v_{*}(s')} \]

Bellman optimality equation for v_{*} (2)

    \[ v_{*}(s)=\underset{a}\max \left[ R_{s}^{a} + \gamma \sum_{s' \in S}{P_{ss'}^{a} v_{*}(s')} \right] \]

Bellman optimality equation for q_{*} (2)

    \[ q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s' \in S}{P_{ss'}^{a} \underset{a'}\max q_{*}(s',a')} \]


4. Extensions to MDPs

[Our formulation] → [Extension]

  • finite states → infinite states
  • discrete states → continuous states
  • fully observable states → partially observable states
  • discounted future reward → undiscounted future reward
  • maximum future reward → averaged future reward

Leave a Reply

Your email address will not be published. Required fields are marked *