Fermer X

A Numerical Toy Stochastic Control Problem
Solved by Dynamic Programming

P. Carpentier, J.-P. Chancelier, M. De Lara and V. Leclère
(last modification date: October 25, 2017)
Version pdf de ce document
Version sans bandeaux



In this practical work, you will solve a toy stochastic control problem using the dynamic programming algorithm.

1 Formulating a stochastic control problem

1.1 Problem statement

We consider a control system given by the dynamics

 x(t + 1) =  x(t)+  u(t) +     w(t + 1)    ,
 ◟--◝◜--◞    ◟◝◜◞   ◟◝◜◞       ◟--◝◜--◞
future state  state  control  random perturbation


  • time t is discrete and belongs to 𝕋 ∪ {tf} , where
    𝕋 =  {ti,ti + 1,...,tf − 1},

    where we call ti the initial time and tf the horizon;

  • x(t) is the value of the state at the beginning of period [t,t + 1[, belonging to a finite set
𝕏 =  {1,2,...,x },

    where x 5 is an integer;

  • u(t) is the control during [t,t + 1[, decided at the beginning of the time period [t,t + 1[, and belonging to the finite set
    𝕌 = {1, 0,− 1 };

  • w(t + 1) is a random perturbation occuring during the time period [t,t + 1[, belonging to the finite set
    𝕎  =  {1,− 1},

    and we assume that

    • the random variables w(ti + 1),,w(tf) are independent;
    • each random variable w(t) follows a uniform probability distribution on the set 𝕎
ℙw (t){− 1} = ℙw (t){1 } = 2.

The manager is supposed to make a decision u(t) at each t 𝕋, at the beginning of the period [t,t + 1[, before knowing the random perturbation w(t + 1): such a setup is called decision-hazard setting. As the random perturbation w(t + 1) comes after the decision u(t), it might be that the future state x(t + 1) in (1) goes outside the state set 𝕏. Therefore, at the boundaries 1 and x 5 of the state set 𝕏, we have to limit the set of possible controls to ensure that the future state will remain in the prescribed bounds.

Question 1

[3] Exhibit, for each possible state x 𝕏, the largest subset 𝔹(x) of 𝕌 which ensures that the next state — as computed by the system dynamics (1) with x(t) = x — will remain in the set 𝕏, whatever the random perturbation w(t + 1) and the decision u(t) 𝔹(x).
[2] Write a Scicoslab function which returns the subset 𝔹(x) of 𝕌 (as a row vector), given the state x as entry.

Let xref 𝕏 be a given reference state to be achieved at horizon time tf. The aim of the manager is to minimize the final cost

K (x(tf))   where   K (x ) = (x − xref) .

For this purpose, the manager can select among admissible (state) policies. An admissible policy is a mapping γ : 𝕋 × 𝕏 𝕌 that assigns an admissible control to any time and state, that is, γ(t,x) 𝔹(x), for all (t,x) 𝕋 × 𝕏.

Let xi 𝕏 be a given initial state. The stochastic control problem to solve is now

  min   𝔼((x(t ) − x  )2),
γ:𝕋×𝕏→ 𝕌       f     ref

subject to the dynamics (1) and constraints that

x(t) = x , u(t) = γ(t,x(t)) ∈ 𝔹 (x(t)), ∀t ∈ 𝕋.
   i    i

1.2 Numerical data

We take

  • ti = 1 and tf = 10;
  • x = 9 and x ref = 5.

2 Writing the stochastic dynamic programming equation

2.1 Stochastic dynamic programming equation

Since the perturbations w(ti + 1),,w(tf) are supposed to be independent random variables, Dynamic Programming applies and the equation associated with the problem of minimizing the expected cost (14) writes


for time t varying from tf 1 to ti. The expectation 𝔼 is taken with respect to the marginal probability distribution w(t+1) given by (6).

initialization V (tf,x) = K(x); for  t = tf,tf 1,,ti + 1,ti do for  x 𝕏 do V (t,x) = + ;
for  u 𝔹(xdo

vu = 0 ;
for  w 𝕎 do

vu = vu + 1
2(V (t + 1,x + u + w)) ;

if vu < V (t,xthen

V (t,x) = vu ;
γ(t,x) = u ; Algorithm 1:  Dynamic programming algorithm corresponding to (10)

2.2 Pseudo code for the stochastic dynamic programming equation

Writing the code to compute the Bellman function V in (10) (and the optimal control) follows a standard scheme, sketched in Algorithm 1. In the simple example of §1.1, we directly store the Bellman values in a 2-indices matrix V = (V (t,x))t[ti,tf],x𝕏.

  // Initialization 
  // Initialization of the matrix used to store Bellman values
  // Compute Bellman function at final time 
  // make a backward loop on time
  for t=Tf-1:-1:Ti do
    // make a loop on possible states 
    for x=ZZZZ do
      // make a loop on admissible controls (for state x) to obtain the best possible
      // expected cost starting from state x in cost_x
      for u=ZZZZ do
        // make a loop on the random perturbation to compute the expected
        // cost cost_x_u 
        for w=ZZZZ do
          // for a given perturbation compute the cost associated to
          // control u and state x using the instantaneous cost and the
          // Bellman function at time t+1  
          //cost_x_u_w = (cost_t(x,u) + V(t+1,ZZZZ))*p(w)
        // update cost_x_u with cost_x_u_w 
        // compare the current cost_x_u to cost_x and 
        // update cost_x if cost_x_u is better than the current cost_x
        if ZZZZ then ZZZZ,else ZZZZ,end;
      // store cost_x in V(t,x)

Question 2

[3] Use the previous pseudo code pattern to program the computation of the Bellman equation (10) associated with the minimization problem described in §1.1.
[3] Draw and comment on the obtained Bellman values. Compute V (t, 5) and comment on the value obtained. Explain why V (t, 1) decreases as time t goes on.

Question 3 [3] Modify the previous code in order to also keep track of the optimal control function. As for the Bellman value function, you will use a 2-indices matrix U = (U(t,x))t[ti,tf1],x𝕏 to store the optimal control at time t when the current state is x. Do you find the optimal policy intuitive?

3 Computing and evaluating an optimal policy

3.1 Computing an optimal policy

Note that, if we know the Bellman function V , we can use Equation (10) to compute the optimal control  ♯
u (t,x )  to be used for a given state x at time t by

u ♯(t,x) ∈ arg min  𝔼 (V(t + 1,x + u + w (t + 1))))

This way of doing is used when Bellman functions are computed on grids, but that the system dynamics produce states outside the grid. In that case, we use interpolation to compute Bellman values and we obtain better controls by recomputing them from Equation (10), than by using stored controls associated with states limited to the grid.

Question 4 [2] Write a function which returns the optimal control knowing current time t and state x. This function will use the previouly computed matrix V filled with Bellman optimal values.

3.2 Evaluation of a given policy

As defined in §1.1, an admissible policy γ : 𝕋 × 𝕏 𝕌 is such that γ(t,x) 𝔹(x). The optimal control obtained by dynamic programming is an admissible policy.

Given an admissible policy γ and a (perturbation) scenario w() = (w(ti + 1),,w(tf)), we are able to build a state trajectory x() = (x(ti),,x(tf 1),x(tf)) and a control trajectory u() = (u(ti),,u(tf 1)) produced by the “closed-loop” dynamics initialized at the initial time ti by


We also obtain the payoff associated with the scenario w()

J  (ti,xi,w (⋅)) = K (x(tf)),

where x() and u() are given by (12). The expected payoff associated with the policy γ is

𝔼(J  (ti,xi,w(⋅))),

where the expectation 𝔼 is taken with respect to the product probability , whose marginals are given by (6). The true expected value (14) is difficult to compute,1 and we evaluate it by the Monte Carlo method using Ns scenarios (w1(),,wNs()):

1--    Jγ(ti,xi,ws (⋅)).
Ns s=1

By the law of large numbers, the mean payoff (15) is a “good” approximation of the expected payoff(14) if the number of scenarios is “large enough”.

We propose three different functions in order to evaluate the mean payoff associated with a policy given by a macro named policy.

The first version uses a Monte Carlo approach. The third argument of the function N is the number of requested simulation for Monte Carlo.

  function cost=simulation_mc(x0,policy,N)
    for i=1:N do
      for t=1:TF-1 do

The second version uses the law of the random perturbations to compute the expectation.

  function cost=simulation_ex(x0,policy)
    // Exact computation with the law of W
    for i=1:size(Wa,'r') do
      for t=1:TF-1 do
  function W=all_w(n)
    // generated all the possible (W_1,...,W_(TF-1))
    if n==1 then

The third version uses dynamic programming.

  function costs=simulation_dp(policy)
    // evaluation by dynamic programming with fixed policy
    // Bellman function at time TF
    Vs(TF,:)=(X-xref) .^2;
    // Compute final value functions
    // Loop backward over time:
    for t=(TF-1):-1:1 do
      for x=1:10 do
        // loop on noises 
        for iw=1:size(W,"*") do

Question 5 [4] Use the proposed simulation macros to evaluate the mean payoff associated with the optimal policy in (11). Compare the different proposed solutions.

L'École des Ponts ParisTech est membre fondateur de

L'École des Ponts ParisTech est certifiée 

ParisTech ParisEst ParisTech


Copyright 2014
École des Ponts ParisTech
All rights reserved