# A Numerical Toy Stochastic Control Problem Solved by Dynamic Programming

P. Carpentier, J.-P. Chancelier, M. De Lara and V. Leclère
(last modiﬁcation date: March 7, 2018)
Version pdf de ce document
Version sans bandeaux

### Contents

Abstract

In this practical work, you will solve a toy stochastic control problem using the dynamic programming algorithm.

### 1 Formulating a stochastic control problem

#### 1.1 Problem statement

We consider a control system given by the dynamics

 (1)

where

• time t is discrete and belongs to , where  (2)

where we call ti the initial time and tf the horizon;

• x(t) is the value of the state at the beginning of period [t,t + 1[, belonging to a ﬁnite set  (3)

where x 5 is an integer;

• u(t) is the control during [t,t + 1[, decided at the beginning of the time period [t,t + 1[, and belonging to the ﬁnite set  (4)

• w(t + 1) is a random perturbation occuring during the time period [t,t + 1[, belonging to the ﬁnite set  (5)

and we assume that

• the random variables w(ti + 1),,w(tf) are independent;
• each random variable w(t) follows a uniform probability distribution on the set 𝕎  (6)

The manager is supposed to make a decision u(t) at each t 𝕋, at the beginning of the period [t,t + 1[, before knowing the random perturbation w(t + 1): such a setup is called decision-hazard setting. As the random perturbation w(t + 1) comes after the decision u(t), it might be that the future state x(t + 1) in (1) goes outside the state set 𝕏. Therefore, at the boundaries 1 and x 5 of the state set 𝕏, we have to limit the set of possible controls to ensure that the future state will remain in the prescribed bounds.

Question 1

(a)
[3] Exhibit, for each possible state x 𝕏, the largest subset 𝔹(x) of 𝕌 which ensures that the next state — as computed by the system dynamics (1) with x(t) = x — will remain in the set 𝕏, whatever the random perturbation w(t + 1) and the decision u(t) 𝔹(x). Explain how you determine the set 𝔹(x).
(b)
[2] Write a Scicoslab function which returns the subset 𝔹(x) of 𝕌 (as a row vector), given the state x as entry.

Let xref 𝕏 be a given reference state to be achieved at horizon time tf. The aim of the manager is to minimize the ﬁnal cost

 (7)

For this purpose, the manager can select among admissible (state) policies. An admissible policy is a mapping γ : 𝕋 × 𝕏 𝕌 that assigns an admissible control to any time and state, that is, γ(t,x) 𝔹(x), for all (t,x) 𝕋 × 𝕏.

Let xi 𝕏 be a given initial state. The stochastic control problem to solve is now

 (8)

subject to the dynamics (1) and constraints that

 (9)

#### 1.2 Numerical data

We take

• ti = 1 and tf = 10;
• x = 9 and x ref = 5.

### 2 Writing the stochastic dynamic programming equation

#### 2.1 Stochastic dynamic programming equation

Since the perturbations w(ti + 1),,w(tf) are supposed to be independent random variables, Dynamic Programming applies and the equation associated with the problem of minimizing the expected cost (14) writes

for time t varying from tf 1 to ti. The expectation 𝔼 is taken with respect to the marginal probability distribution w(t+1) given by (6).

initialization V (tf,x) = K(x); for  t = tf,tf 1,,ti + 1,ti do for  x 𝕏 do V (t,x) = + ;
for  u 𝔹(xdo

vu = 0 ;
for  w 𝕎 do

vu = vu + V (t + 1,x + u + w) ;

if vu < V (t,xthen

V (t,x) = vu ;
γ(t,x) = u ; Algorithm 1:  Dynamic programming algorithm corresponding to (10)

#### 2.2 Pseudo code for the stochastic dynamic programming equation

Writing the code to compute the Bellman function V in (10) (and the optimal control) follows a standard scheme, sketched in Algorithm 1. In the simple example of §1.1, we directly store the Bellman values in a 2-indices matrix V = (V (t,x))t[ti,tf],x𝕏.

// Initialization
Tf=ZZZZ;
Nx=ZZZZ;
state=[1:Nx];
// Initialization of the matrix used to store Bellman values
V=ones(Tf,Nx)*%inf;
// Compute Bellman function at final time
V(Tf,:)=ZZZZ
// make a backward loop on time
for t=Tf-1:-1:Ti do
// make a loop on possible states
for x=ZZZZ do
// make a loop on admissible controls (for state x) to obtain the best possible
// expected cost starting from state x in cost_x
cost_x=%inf
for u=ZZZZ do
// make a loop on the random perturbation to compute the expected
// cost cost_x_u
for w=ZZZZ do
// for a given perturbation compute the cost associated to
// control u and state x using the instantaneous cost and the
// Bellman function at time t+1
cost_x_u_w=0.5*V(t+1,ZZZZ)
//cost_x_u_w = (cost_t(x,u) + V(t+1,ZZZZ))*p(w)
end
// update cost_x_u with cost_x_u_w
cost_x_u=ZZZZ
// compare the current cost_x_u to cost_x and
// update cost_x if cost_x_u is better than the current cost_x
if ZZZZ then ZZZZ,else ZZZZ,end;
end
// store cost_x in V(t,x)
ZZZZ
end
end

Question 2

(a)
[3] Use the previous pseudo code pattern to program the computation of the Bellman equation (10) associated with the minimization problem described in §1.1.
(b)
[4] Draw the obtained Bellman values as functions of the state (one curve for each time t). Comment on their shapes and the values they take. Compute V (t, 5) and comment on the value obtained. Explain why V (t, 1) increases as time t goes from ti to the horizon tf.

Question 3 [3] Modify the previous code in order to also keep track of the optimal control function. As for the Bellman value function, you will use a 2-indices matrix U = (U(t,x))t[ti,tf1],x𝕏 to store the optimal control at time t when the current state is x. Display the matrix U. Do you ﬁnd the optimal policy intuitive?

### 3 Computing and evaluating an optimal policy

#### 3.1 Computing an optimal policy

Note that, if we know the Bellman function V , we can use Equation (10) to compute the optimal control to be used for a given state x at time t by

 (11)

This way of doing is used when Bellman functions are computed on grids, but that the system dynamics produce states outside the grid. In that case, we use interpolation to compute Bellman values and we obtain better controls by recomputing them from Equation (10), than by using stored controls associated with states limited to the grid.

Question 4 [2] Write a function which returns the optimal control knowing current time t and state x. This function will use the previouly computed matrix V ﬁlled with Bellman optimal values.

#### 3.2 Evaluation of a given policy

As deﬁned in §1.1, an admissible policy γ : 𝕋 × 𝕏 𝕌 is such that γ(t,x) 𝔹(x). The optimal control obtained by dynamic programming is an admissible policy.

Given an admissible policy γ and a (perturbation) scenario w() = w(ti + 1),,w(tf), we are able to build a state trajectory x() = x(ti),,x(tf 1),x(tf) and a control trajectory u() = u(ti),,u(tf 1) produced by the “closed-loop” dynamics initialized at the initial time ti by

We also obtain the payoﬀ associated with the scenario w()

 (13)

where x() and u() are given by (12). The expected payoﬀ associated with the policy γ is

 (14)

where the expectation 𝔼 is taken with respect to the product probability , whose marginals are given by (6). The true expected value (14) is diﬃcult to compute,1 and we evaluate it by the Monte Carlo method using Ns scenarios w1(),,wNs():

 (15)

By the law of large numbers, the mean payoﬀ (15) is a “good” approximation of the expected payoﬀ(14) if the number of scenarios is “large enough”.

We propose three diﬀerent functions in order to evaluate the mean payoﬀ associated with a policy given by a macro named policy.

The ﬁrst version uses a Monte Carlo approach. The third argument of the function N is the number of requested simulation for Monte Carlo.

function cost=simulation_mc(x0,policy,N)
cost=0;
for i=1:N do
x=x0;
for t=1:TF-1 do
w=W(grand(1,1,'uin',1,2));
x=x+policy(t,x)+w;
end
cost=cost+(x-xref)^2
end
cost=cost/N;
endfunction

The second version uses the law of the random perturbations to compute the expectation.

function cost=simulation_ex(x0,policy)
// Exact computation with the law of W
Wa=all_w(TF-1);
cost=0;
for i=1:size(Wa,'r') do
x=x0;
for t=1:TF-1 do
x=x+policy(t,x)+Wa(i,t);
end
cost=cost+(x-xref)^2
end
cost=cost/size(Wa,'r');
endfunction

function W=all_w(n)
// generated all the possible (W_1,...,W_(TF-1))
if n==1 then
W=[-1;1]
else
Wn=all_w(n-1);
W=[-1*ones(size(Wn,'r'),1),Wn;1*ones(size(Wn,'r'),1),Wn];
end
endfunction;

The third version uses dynamic programming.

function costs=simulation_dp(policy)
// evaluation by dynamic programming with fixed policy
Vs=ones(TF,length(X))*%inf;
// Bellman function at time TF
Vs(TF,:)=(X-xref) .^2;
// Compute final value functions
// Loop backward over time:
for t=(TF-1):-1:1 do
for x=1:10 do
// loop on noises
EV=0;
for iw=1:size(W,"*") do
next_state=x+policy(t,x)+W(iw);
EV=EV+P(iw)*Vs(t+1,next_state);
end
Vs(t,x)=EV;
end
end
costs=Vs(1,:);
endfunction

Question 5 [4] Use the proposed simulation macros to evaluate the mean payoﬀ associated with the optimal policy in (11). Compare the diﬀerent proposed solutions.