In this practical work, you will solve a toy stochastic control problem using thedynamic programming algorithm.
1 Formulating a stochastic control problem
1.1 Problem statement
We consider a control system given by the dynamics
(1)
where
time t is discrete and belongs to , where
(2)
where we call ti the initial time and tf the horizon;
x(t) is the value of the state at the beginning of period [t,t + 1[, belonging to a finite
set
(3)
where x♯≥ 5 is an integer;
u(t) is the control during [t,t + 1[, decided at the beginning of the time period [t,t + 1[, and
belonging to the finite set
(4)
w(t + 1) is a random perturbation occuring during the time period [t,t + 1[, belonging to the
finite set
(5)
and we assume that
the random variables w(ti + 1),…,w(tf) are independent;
each random variable w(t) follows a uniform probability distribution on the
set 𝕎
(6)
The manager is supposed to make a decision u(t) at each t ∈ 𝕋, at the beginning of
the period [t,t + 1[, before knowing the random perturbation w(t + 1): such a setup is
called decision-hazard setting. As the random perturbation w(t + 1) comes after the
decision u(t), it might be that the future state x(t + 1) in (1) goes outside the state set 𝕏.
Therefore, at the boundaries 1 and x♯≥ 5 of the state set 𝕏, we have to limit the
set of possible controls to ensure that the future state will remain in the prescribed
bounds.
Question 1
(a)
[3] Exhibit, for each possible state x ∈ 𝕏, the largest subset𝔹(x) of 𝕌 whichensures that the next state — as computed by the system dynamics(1) with x(t) = x— will remain in the set𝕏, whatever the random perturbationw(t + 1) and thedecisionu(t) ∈ 𝔹(x). Explain how you determine the set𝔹(x).
(b)
[2] Write a Scicoslab function which returns the subset𝔹(x) of 𝕌 (as a rowvector), given the statex as entry.
Let xref∈ 𝕏 be a given reference state to be achieved at horizon time tf. The aim of the
manager is to minimize the final cost
(7)
For this purpose, the manager can select among admissible (state) policies. An admissible policy is
a mapping γ : 𝕋 × 𝕏 → 𝕌 that assigns an admissible control to any time and state, that is,
γ(t,x) ∈ 𝔹(x), for all (t,x) ∈ 𝕋 × 𝕏.
Let xi∈ 𝕏 be a given initial state. The stochastic control problem to solve is now
2 Writing the stochastic dynamic programming equation
2.1 Stochastic dynamic programming equation
Since the perturbations w(ti + 1),…,w(tf) are supposed to be independent random variables,
Dynamic Programming applies and the equation associated with the problem of minimizing theexpected cost (14) writes
for time t varying from tf− 1 to ti. The expectation 𝔼 is taken with respect to the marginal
probability distribution ℙw(t+1) given by (6).
initialization
V (tf,x) = K(x);
fort = tf,tf− 1,…,ti + 1,tidoforx ∈ 𝕏doV (t,x) = +∞
; foru ∈ 𝔹(x) do
vu = 0
; forw ∈ 𝕎do
vu = vu + V (t + 1,x + u + w)
;
ifvu< V (t,x) then
V (t,x) = vu
; γ(t,x) = u
;
Algorithm 1: Dynamic programming algorithm corresponding to (10)
2.2 Pseudo code for the stochastic dynamic programming equation
Writing the code to compute the Bellman function V in (10) (and the optimal control) follows a
standard scheme, sketched in Algorithm 1. In the simple example of §1.1, we directly store the
Bellman values in a 2-indices matrix V = (V (t,x))t∈[ti,tf],x∈𝕏.
// Initialization Tf=ZZZZ; Nx=ZZZZ; state=[1:Nx]; // Initialization of the matrix used to store Bellman values V=ones(Tf,Nx)*%inf; // Compute Bellman function at final time V(Tf,:)=ZZZZ // make a backward loop on time for t=Tf-1:-1:Tido // make a loop on possible states for x=ZZZZdo // make a loop on admissible controls (for state x) to obtain the best possible // expected cost starting from state x in cost_x cost_x=%inf for u=ZZZZdo // make a loop on the random perturbation to compute the expected // cost cost_x_u for w=ZZZZdo // for a given perturbation compute the cost associated to // control u and state x using the instantaneous cost and the // Bellman function at time t+1 cost_x_u_w=0.5*V(t+1,ZZZZ) //cost_x_u_w = (cost_t(x,u) + V(t+1,ZZZZ))*p(w) end // update cost_x_u with cost_x_u_w cost_x_u=ZZZZ // compare the current cost_x_u to cost_x and // update cost_x if cost_x_u is better than the current cost_x if ZZZZthen ZZZZ,else ZZZZ,end; end // store cost_x in V(t,x) ZZZZ end end
Question 2
(a)
[3] Use the previous pseudo code pattern to program the computation of the Bellmanequation(10) associated with the minimization problem described in§1.1.
(b)
[4] Draw the obtained Bellman values as functions of the state (one curve foreach timet). Comment on their shapes and the values they take. ComputeV (t, 5) andcomment on the value obtained. Explain why V (t, 1) increases as timet goes fromtitothe horizontf.
Question 3[3] Modify the previous code in order to also keep track of the optimalcontrol function. As for the Bellman value function, you will use a 2-indices matrixU = (U(t,x))t∈[ti,tf−1],x∈𝕏to store the optimal control at timet when the current state isx.Display the matrixU. Do you find the optimal policy intuitive?
3 Computing and evaluating an optimal policy
3.1 Computing an optimal policy
Note that, if we know the Bellman function V , we can use Equation (10) to compute the optimal
control to be used for a given state x at time t by
(11)
This way of doing is used when Bellman functions are computed on grids, but that the system
dynamics produce states outside the grid. In that case, we use interpolation to compute Bellman
values and we obtain better controls by recomputing them from Equation (10), than by using
stored controls associated with states limited to the grid.
Question 4[2]Write a function which returns the optimal control knowing currenttimet and statex. This function will use the previouly computed matrixV filled withBellman optimal values.
3.2 Evaluation of a given policy
As defined in §1.1, an admissible policy γ : 𝕋 × 𝕏 → 𝕌 is such that γ(t,x) ∈ 𝔹(x). The optimal
control obtained by dynamic programming is an admissible policy.
Given an admissible policy γ and a (perturbation) scenario w(⋅) = w(ti + 1),…,w(tf), we are
able to build a state trajectory x(⋅) = x(ti),…,x(tf− 1),x(tf) and a control trajectory
u(⋅) = u(ti),…,u(tf− 1) produced by the “closed-loop” dynamics initialized at the initial time ti
by
We also obtain the payoff associated with the scenario w(⋅)
(13)
where x(⋅) and u(⋅) are given by (12). The expected payoff associated with the policy γ
is
(14)
where the expectation 𝔼 is taken with respect to the product probability ℙ,
whose marginals are given by (6). The true expected value (14) is difficult to
compute,1
and we evaluate it by the Monte Carlo method using Ns scenarios w1(⋅),…,wNs(⋅):
(15)
By the law of large numbers, the mean payoff (15) is a “good” approximation of the expected
payoff(14) if the number of scenarios is “large enough”.
We propose three different functions in order to evaluate the mean payoff associated with a
policy given by a macro named policy.
The first version uses a Monte Carlo approach. The third argument of the function N is the
number of requested simulation for Monte Carlo.
function cost=simulation_mc(x0,policy,N) cost=0; for i=1:Ndo x=x0; for t=1:TF-1do w=W(grand(1,1,'uin',1,2)); x=x+policy(t,x)+w; end cost=cost+(x-xref)^2 end cost=cost/N; endfunction
The second version uses the law of the random perturbations to compute the expectation.
function cost=simulation_ex(x0,policy) // Exact computation with the law of W Wa=all_w(TF-1); cost=0; for i=1:size(Wa,'r')do x=x0; for t=1:TF-1do x=x+policy(t,x)+Wa(i,t); end cost=cost+(x-xref)^2 end cost=cost/size(Wa,'r'); endfunction function W=all_w(n) // generated all the possible (W_1,...,W_(TF-1)) if n==1then W=[-1;1] else Wn=all_w(n-1); W=[-1*ones(size(Wn,'r'),1),Wn;1*ones(size(Wn,'r'),1),Wn]; end endfunction;
The third version uses dynamic programming.
function costs=simulation_dp(policy) // evaluation by dynamic programming with fixed policy Vs=ones(TF,length(X))*%inf; // Bellman function at time TF Vs(TF,:)=(X-xref) .^2; // Compute final value functions // Loop backward over time: for t=(TF-1):-1:1do for x=1:10do // loop on noises EV=0; for iw=1:size(W,"*")do next_state=x+policy(t,x)+W(iw); EV=EV+P(iw)*Vs(t+1,next_state); end Vs(t,x)=EV; end end costs=Vs(1,:); endfunction
Question 5[4] Use the proposed simulation macros to evaluate the mean payoffassociated with the optimal policy in(11). Compare the different proposed solutions.