#### 1. Markov Process / Markov chain

##### 1.1. Markov process

A **Markov process** or **Markov chain** is a tuple such that

- is a finite set of states, and
- is a transition probability matrix.

In a Markov process, the initial state should be given. How do we choose the initial state is not a role of the Markov process.

##### 1.2. State transition probability

A **state transition probability** from to is

###### Episode

An **episode** of a Markov process is a trajectory of the transition states.

#### 2. Markov reward process

##### 2.1. Markov reward process (MRP)

A **Markov reward process** is a tuple such that

- is a finite set of states,
- is a transition probability matrix,
- is a reward function, , and
- A reward function on state is .
- is a reward at time .

- is a discount factor, .

###### The steps of a MDP [sungjae]

- Be on a state.
- Move to a new state.
- Gain a reward on arriving at the state.
- Go to step 1

##### 2.2. Return

The **return** at time-step is denoted by .

is the total discounted reward from time-step .

##### 2.3. Value function in MRP

###### State value function

The state value function of a Markov reward process is the expected return starting from state .

##### 2.4. Bellman equation for MRP

###### Bellman equation for Markov reward processes

The Bellman equation for MRPs implies that can be represented by the multiple .

###### Bellman equation in matrix form

###### Solving the Bellman equation

- The computational complexity to solve is for states.
- This analytic method is only possible if is small.
- If is large, the following iterative methods are used to solve MRPs.
- Dynamic programming
- Monte-Carlo evalution
- Temporal-Difference learning

#### 3. Markov decision process

##### 3.1. Markov decision process (MDP)

A **Markov decision process** (**MDP**) is a Markov reward process with decisions of actions.

A **Markov decision process** is a tuple such that

- is a finite set of states,
- is a finite set of actions,
- is a transition probability matrix,
- is a reward function, , and
- A reward function on state is .
- is a reward at time .

- is a discount factor, .

###### The steps of a MDP [sungjae]

- Be on a state.
- Take an action.
- Gain a reward by the action
- Arrive at a new state.
- Go to step 1

##### 3.2. Policy

A **policy** is a distribution over actions given states .

- A policy defines the action[behavior] of an agent.
- MDP policies only depend on the current state.
- MDP policies are parameterized by states.
- Policies are
**stationary**[time-independent].- stationary dynamic

##### 3.3. Value functions in Markov decision processes

There two value functions in MDPs: [1] the state-value function and [2] action-value function.

###### State-value function

The **state-value function** of an MDP is the expected return [1] starting from state and [2] following policy .

###### Action-value function

The **action-value function** is the expected return [1] starting from state , [2] taking action , and then [3] following policy

##### 3.4. Bellman expectation equation in MDP

###### Bellman expectation equation for

- is the current state.
- is the set of the actions able to be taken on .
- is an action on .
- is a particular policy.
- Conclusion: can be solved by .

###### Bellman expectation equation for

- is the reward function that returns reward if action is taken on state .
- Conclusion: can be solved by .

###### Bellman expectation equation for (2)

- Conclusion: can be solved by .

###### Bellman expectation equation for (2)

- Conclusion: can be solved by .

###### Bellman expectation equation (matrix form)

##### 3.5. Optimal value function

- The
**optimal value function**specifies the best possible performance in the MDP. - ‘
**An MDP is solved**‘ when we know the optimal value function.

###### Optimal state-value function

The **optimal state-value function** is is the maximum value function over all policies.

###### Optimal action-value function

The **optimal action-value function** is is the maximum value function over all policies.

###### Partial ordering over policies

###### Theorems

For any Markov decision process

- [★] There exists an optimal policy \pi_{*} that is better than or equal to all other policies, i.e., \pi \geq \pi^{*}, \forall{\pi}.
- All optimal policies achieve the optimal value function.
- All optimal policies achieve the optimal action function.

###### Finding an optimal policy

An optimal policy can be constructed by following formulation.

##### 3.6. Bellman optimality equation

###### Bellman optimality equation for

###### Bellman optimality equation for

###### Bellman optimality equation for (2)

###### Bellman optimality equation for (2)

#### 4. Extensions to MDPs

[Our formulation] → [Extension]

- finite states → infinite states
- discrete states → continuous states
- fully observable states → partially observable states
- discounted future reward → undiscounted future reward
- maximum future reward → averaged future reward