A Numerical Toy Stochastic Control Problem
Solved by Dynamic Programming

P. Carpentier, J.-P. Chancelier, M. De Lara and V. Leclère
(last modiﬁcation date: March 7, 2018)

Version pdf de ce document
Version sans bandeaux

1 Formulating a stochastic control problem
2 Writing the stochastic dynamic programming equation
3 Computing and evaluating an optimal policy

1 Formulating a stochastic control problem
1.1 Problem statement
1.2 Numerical data
2 Writing the stochastic dynamic programming equation
2.1 Stochastic dynamic programming equation
2.2 Pseudo code for the stochastic dynamic programming equation
3 Computing and evaluating an optimal policy
3.1 Computing an optimal policy
3.2 Evaluation of a given policy

Abstract

In this practical work, you will solve a toy stochastic control problem using the dynamic programming algorithm.

1 Formulating a stochastic control problem

1.1 Problem statement

We consider a control system given by the dynamics

x(t + 1) = x(t)+ u(t) + w(t + 1) , ◟--◝◜--◞ ◟◝◜◞ ◟◝◜◞ ◟--◝◜--◞ future state state control random perturbation

(1)

where

time t is discrete and belongs to $𝕋 ∪ {tf}$ , where
$𝕋 = {ti,ti + 1,...,tf − 1},$ (2)

where we call t_i the initial time and t_f the horizon;
x(t) is the value of the state at the beginning of period [t,t + 1[, belonging to a ﬁnite set
$♯ 𝕏 = {1,2,...,x },$ (3)

where x^♯ ≥ 5 is an integer;
u(t) is the control during [t,t + 1[, decided at the beginning of the time period [t,t + 1[, and belonging to the ﬁnite set
$𝕌 = {1, 0,− 1 };$ (4)
w(t + 1) is a random perturbation occuring during the time period [t,t + 1[, belonging to the ﬁnite set
$𝕎 = {1,− 1},$ (5)

and we assume that
- the random variables w(t_i + 1),…,w(t_f) are independent;
- each random variable w(t) follows a uniform probability distribution on the set 𝕎
  $1- ℙw (t){− 1} = ℙw (t){1 } = 2.$ (6)

The manager is supposed to make a decision u(t) at each t ∈ 𝕋, at the beginning of the period [t,t + 1[, before knowing the random perturbation w(t + 1): such a setup is called decision-hazard setting. As the random perturbation w(t + 1) comes after the decision u(t), it might be that the future state x(t + 1) in (1) goes outside the state set 𝕏. Therefore, at the boundaries 1 and x^♯ ≥ 5 of the state set 𝕏, we have to limit the set of possible controls to ensure that the future state will remain in the prescribed bounds.

Question 1

(a): [3] Exhibit, for each possible state x ∈ 𝕏, the largest subset 𝔹(x) of 𝕌 which ensures that the next state — as computed by the system dynamics (1) with x(t) = x — will remain in the set 𝕏, whatever the random perturbation w(t + 1) and the decision u(t) ∈ 𝔹(x). Explain how you determine the set 𝔹(x).
(b): [2] Write a Scicoslab function which returns the subset 𝔹(x) of 𝕌 (as a row vector), given the state x as entry.

Let x_ref ∈ 𝕏 be a given reference state to be achieved at horizon time t_f. The aim of the manager is to minimize the ﬁnal cost

2 K (x(tf)) where K (x ) = (x − xref) .

(7)

For this purpose, the manager can select among admissible (state) policies. An admissible policy is a mapping γ : 𝕋 × 𝕏 → 𝕌 that assigns an admissible control to any time and state, that is, γ(t,x) ∈ 𝔹(x), for all (t,x) ∈ 𝕋 × 𝕏.

Let x_i ∈ 𝕏 be a given initial state. The stochastic control problem to solve is now

min 𝔼((x(t ) − x )2), γ:𝕋×𝕏→ 𝕌 f ref

(8)

subject to the dynamics (1) and constraints that

x(t) = x , u(t) = γ(t,x(t)) ∈ 𝔹 (x(t)), ∀t ∈ 𝕋. i i

(9)

1.2 Numerical data

We take

t_i = 1 and t_f = 10;
x^♯ = 9 and x_ref = 5.

2 Writing the stochastic dynamic programming equation

2.1 Stochastic dynamic programming equation

Since the perturbations w(t_i + 1),…,w(t_f) are supposed to be independent random variables, Dynamic Programming applies and the equation associated with the problem of minimizing the expected cost (14) writes

for time t varying from t_f − 1 to t_i. The expectation 𝔼 is taken with respect to the marginal probability distribution ℙ_w(t+1) given by (6).

initialization V (t_f,x) = K(x); for t = t_f,t_f − 1,…,t_i + 1,t_i do for x ∈ 𝕏 do V (t,x) = +∞ ;
for u ∈ 𝔹(x) do

v_u = 0 ;
for w ∈ 𝕎 do

v_u = v_u + 1
2 V (t + 1,x + u + w) ;

if v_u < V (t,x) then

V (t,x) = v_u ;
γ(t,x) = u ; Algorithm 1: Dynamic programming algorithm corresponding to (10)

2.2 Pseudo code for the stochastic dynamic programming equation

Writing the code to compute the Bellman function V in (10) (and the optimal control) follows a standard scheme, sketched in Algorithm 1. In the simple example of §1.1, we directly store the Bellman values in a 2-indices matrix V = (V (t,x))_{t∈[t_i,t_f],x∈𝕏}.

  // Initialization
  Tf=ZZZZ;
  Nx=ZZZZ;
  state=[1:Nx];
  // Initialization of the matrix used to store Bellman values
  V=ones(Tf,Nx)*%inf;
  // Compute Bellman function at final time
  V(Tf,:)=ZZZZ
  // make a backward loop on time
  for t=Tf-1:-1:Ti do
    // make a loop on possible states
    for x=ZZZZ do
      // make a loop on admissible controls (for state x) to obtain the best possible
      // expected cost starting from state x in cost_x
      cost_x=%inf
      for u=ZZZZ do
        // make a loop on the random perturbation to compute the expected
        // cost cost_x_u
        for w=ZZZZ do
          // for a given perturbation compute the cost associated to
          // control u and state x using the instantaneous cost and the
          // Bellman function at time t+1
          cost_x_u_w=0.5*V(t+1,ZZZZ)
          //cost_x_u_w = (cost_t(x,u) + V(t+1,ZZZZ))*p(w)
        end
        // update cost_x_u with cost_x_u_w
        cost_x_u=ZZZZ
        // compare the current cost_x_u to cost_x and
        // update cost_x if cost_x_u is better than the current cost_x
        if ZZZZ then ZZZZ,else ZZZZ,end;
      end
      // store cost_x in V(t,x)
      ZZZZ
    end
  end

Question 2

(a): [3] Use the previous pseudo code pattern to program the computation of the Bellman equation (10) associated with the minimization problem described in §1.1.
(b): [4] Draw the obtained Bellman values as functions of the state (one curve for each time t). Comment on their shapes and the values they take. Compute V (t, 5) and comment on the value obtained. Explain why V (t, 1) increases as time t goes from t_i to the horizon t_f.

Question 3 [3] Modify the previous code in order to also keep track of the optimal control function. As for the Bellman value function, you will use a 2-indices matrix U = (U(t,x))_{t∈[t_i,t_f−1],x∈𝕏} to store the optimal control at time t when the current state is x. Display the matrix U. Do you ﬁnd the optimal policy intuitive?

3 Computing and evaluating an optimal policy

3.1 Computing an optimal policy

Note that, if we know the Bellman function V , we can use Equation (10) to compute the optimal control $♯ u (t,x )$ to be used for a given state x at time t by

u ♯(t,x) ∈ arg min 𝔼 (V(t + 1,x + u + w (t + 1)))) u∈𝔹(x)

(11)

This way of doing is used when Bellman functions are computed on grids, but that the system dynamics produce states outside the grid. In that case, we use interpolation to compute Bellman values and we obtain better controls by recomputing them from Equation (10), than by using stored controls associated with states limited to the grid.

Question 4 [2] Write a function which returns the optimal control knowing current time t and state x. This function will use the previouly computed matrix V ﬁlled with Bellman optimal values.

3.2 Evaluation of a given policy

As deﬁned in §1.1, an admissible policy γ : 𝕋 × 𝕏 → 𝕌 is such that γ(t,x) ∈ 𝔹(x). The optimal control obtained by dynamic programming is an admissible policy.

Given an admissible policy γ and a (perturbation) scenario w(⋅) = w(t_i + 1),…,w(t_f), we are able to build a state trajectory x(⋅) = x(t_i),…,x(t_f − 1),x(t_f) and a control trajectory u(⋅) = u(t_i),…,u(t_f − 1) produced by the “closed-loop” dynamics initialized at the initial time t_i by

We also obtain the payoﬀ associated with the scenario w(⋅)

γ J (ti,xi,w (⋅)) = K (x(tf)),

(13)

where x(⋅) and u(⋅) are given by (12). The expected payoﬀ associated with the policy γ is

γ 𝔼(J (ti,xi,w(⋅))),

(14)

where the expectation 𝔼 is taken with respect to the product probability ℙ, whose marginals are given by (6). The true expected value (14) is diﬃcult to compute,¹ and we evaluate it by the Monte Carlo method using N_s scenarios w¹(⋅),…,w^N_s(⋅):

∑Ns 1-- Jγ(ti,xi,ws (⋅)). Ns s=1

(15)

By the law of large numbers, the mean payoﬀ (15) is a “good” approximation of the expected payoﬀ(14) if the number of scenarios is “large enough”.

We propose three diﬀerent functions in order to evaluate the mean payoﬀ associated with a policy given by a macro named policy.

The ﬁrst version uses a Monte Carlo approach. The third argument of the function N is the number of requested simulation for Monte Carlo.

  function cost=simulation_mc(x0,policy,N)
    cost=0;
    for i=1:N do
      x=x0;
      for t=1:TF-1 do
        w=W(grand(1,1,'uin',1,2));
        x=x+policy(t,x)+w;
      end
      cost=cost+(x-xref)^2
    end
    cost=cost/N;
  endfunction

The second version uses the law of the random perturbations to compute the expectation.

  function cost=simulation_ex(x0,policy)
    // Exact computation with the law of W
    Wa=all_w(TF-1);
    cost=0;
    for i=1:size(Wa,'r') do
      x=x0;
      for t=1:TF-1 do
        x=x+policy(t,x)+Wa(i,t);
      end
      cost=cost+(x-xref)^2
    end
    cost=cost/size(Wa,'r');
  endfunction

  function W=all_w(n)
    // generated all the possible (W_1,...,W_(TF-1))
    if n==1 then
      W=[-1;1]
    else
      Wn=all_w(n-1);
      W=[-1*ones(size(Wn,'r'),1),Wn;1*ones(size(Wn,'r'),1),Wn];
    end
  endfunction;

The third version uses dynamic programming.

  function costs=simulation_dp(policy)
    // evaluation by dynamic programming with fixed policy
    Vs=ones(TF,length(X))*%inf;
    // Bellman function at time TF
    Vs(TF,:)=(X-xref) .^2;
    // Compute final value functions
    // Loop backward over time:
    for t=(TF-1):-1:1 do
      for x=1:10 do
        // loop on noises
        EV=0;
        for iw=1:size(W,"*") do
          next_state=x+policy(t,x)+W(iw);
          EV=EV+P(iw)*Vs(t+1,next_state);
        end
        Vs(t,x)=EV;
      end
    end
    costs=Vs(1,:);
  endfunction

Question 5 [4] Use the proposed simulation macros to evaluate the mean payoﬀ associated with the optimal policy in (11). Compare the diﬀerent proposed solutions.

Formations

Départements

Pratique

Recherche

Management of Energy

Management of Renewable Resources and Biodiversity

Management of Exhaustible Resources

Models in Life Science

Miscellaneous

A Numerical Toy Stochastic Control Problem
Solved by Dynamic Programming

Contents