Dynamic programming and optimal control. Volume 2 - Bertsekas D.P.

Автор: Bertsekas D.P.
Теги: programming
ISBN: 1-886529-13-2
Год: 1995
Похожие
The red book of mathematical problems
Dynamic programming and optimal control. Volume 1
Fundamentals of Statistical Signal Processing. Volume II. Detection Theory
Foundations Of Electromagnetic Theory
Текст
                    
Dynamic Programming
and Optimal Control
Volume II
Dimitri P. Bertsekas
Massachusetts Institute of Technology
Athena Scientific, Belmont, Massachusetts
Athena Scientific
Post Office Box 391
Belmont, Mass. 02178-9098
U.S.A.
Email: athenascWworld.std.com
Cover Design: Anu Ga.lla.ger
© 1995 Dimitri I’. Bertsekas
All rights reserved. No part of this book may be reproduced in any form
by any electronic or mechanical means (including photocopying, recording,
or information storage and retrieval) without permission in writing from
the publisher.
Portions of this volume are adapted and reprinted from t he author’s Dy-
namic Programming: Delernmi.isl.ic and Stochastic Mod.cl.s. Prentice-IIall,
1987, by permission of Prcnticc-llall, Inc.
Publisher’s Cataloging-in-Publication Data
Bertsekas, Dimitri P.
Dynamic Programming and Optimal Control
Includes Bibliography and Index
1. Mathematical Optimization. 2. Dynamic Programming. I. Title.
QA402.5 .B465 1995	519.703	95-075941
ISBN 1-886529-12-1 (Vol. I)
ISBN 1-886529-13-2 (Vol. 11)
ISBN 1-886529-11-6 (Vol. I and II)
Contents
1.	Infinite Horizon — Discounted Problems
v" 1.1. Minimization of Total Cost Introduction ..............p. 2
1.2.	Discounted Problems with Bounded Cost per Stage ... p. 9
1.3.	Finite-State Systems Computational Methods...........p. Hi
1.3.1.	Value Iteration and Error Bounds...............p.	19
1.3.2.	Policy Iteration ..............................p.	35
1.3.3.	Adaptive Aggregation...........................p.	11
1.3.4.	Linear Programming.............................p.	19
/ 1.1. The Role of Contraction Mappings.....................p.	52
1.5.	Scheduling and Multiarmed Bandit Problems............p.	54
1.6.	Notes. Sources, and Exercises........................p.	61
2.	Stochastic Shortest Path Problems
v 2.1. Main Results...........................................p.	78
2.2.	Computational Methods................................p.	87
2.2.1.	Value Iteration................................p.	88
2.2.2.	Policy Iteration ..............................p.	91
2.3.	Simulation-Based Methods.............................p.	94
2.3.1.	Policy Evaluation by Monte-Carlo Simulation . . . p.	95
2.3.2.	Q-Learning.....................................p.	99
2.3.3.	Approximations.................................p. 101
2.3.4.	Extensions to Discounted Problems..............p. 1-18
2.3.5.	The Role of Parallel Comput ation..............p. 120
2.4.	Notes, Sources, and Exercises........................p. 121
3.	Undiscounted Problems
3.1.	Unbounded Costs per Stage?............................p.	134
3.2.	Linear Systems and Quadratic Cost.....................p.	150
3.3.	Inventory Control.....................................p.	153
3.4.	Optimal Stopping......................................p.	155
3.5.	Optimal Gambling Strategies...........................p.	160
». -I
iv	(\>ult4ifs
3.6.	Noustationary and Periodic Problems ...................p.	167
3.7.	Noles, Sources, and Exercises..........................p.	172
4.	Average Cost per Stage Problems
4.1.	Preliminary Analysis...................................p.	181
•1.2. Optimality Conditions.................................p.	lf)l
4.3.	Computational Methods..................................p.	202
4.3.1.	Value Iteration.................................p.	202
4.3.2.	Policy Iteration................................p.	213
4.3.3.	Linear Prograinming..............................p.	221
4.3.4.	Simulation-Based Methods ........................p.	222
4.4.	Infinite State Space ..................................p.	226
4.5.	Notes, Sources, and Exercises..........................p.	229
5.	Continuous-Time Problems
5.1.	Uniforniization .......................................p.	212
5.2.	Queueing Applications..................................p.	250
5.3.	Semi-Markov Problems...................................p.	261
5.4.	Notes, Sources, and Exercises..........................p.	273

CONTENTS OF VOLUME 1
1.	The Dynamic Programming Algorithm
1.1.	Introduction
1.2.	The Basic Problem
1.3.	The Dynamic Programming Algorithm
1.4.	State Augmentation
1.5.	Some Mathematical Issues
l.(i. Notes, Sources, and Exercises
2.	Deterministic Systems and the Shortest Path Problem
2.1.	Finite-State Systems and Shortest Paths
2.2.	Some Shortest Path Applications
2.2.1.	Critical Path Analysis
2.2.2.	Hidden Markov Models and the Viterbi Algorithm
2.3.	Shortest Path Algorithms
2.3.1.	Label Correcting Methods
2.3.2.	Auction Algorithms
2.1.	Notes. Sources, and Exercises
3.	Deterministic Continuous-Time Optimal Control
3.1.	Continuous-Time Optimal Control
3.2.	The Hamilton Jacobi Bellman Equation
3.3.	The Pontryagin Minimuni Principle
3.3.1.	An Informal Derivation Using the 11JB Equation
3.3.2.	A Derivation Based on Variational Ideas
3.3.3.	The Minimum Principle for Discrete-Time Problems
3.4.	Extensions of the Minimum Principle
3.1.1.	Fixed Terminal State
3.4.2.	Free Initial State
3.4.3.	Free Terminal Time
.3.4.4	. Time-Varying System and Cost
3.4.5.	Singular Problems
3.5.	Notes, Sources, and Exercises
4.	Problems with Perfect State Information
4.1.	Linear Systems and Quadratic Cost,
4.2.	Inventory Control
1.3.	Dynamic Portfolio Analysis
4.4.	Optimal Stopping Problems
4.5.	Scheduling and the Interchange Argument
1.6.	Notes, Sources, anil Exercises
.1
vi	C'oiitenls
5.	Problems with Imperfect State Information
5.1.	Reduction to the Perfect Information Case
5.2.	Linear Systems mid Quadratic (lost
5.3.	Minimum Variance Cont rol of Linear Systems
5.1.	Sufficient Statistics and Pinite-State Markov Chains
5.5.	Sequential Hypothesis Test ing
5.6.	Notes, Sources, and Exercises
G. Suboptimal anti Adaptive Control
6.1.	Certainty Equivalent and Adaptive Control
6.1.1.	Caution. Probing, and Dual Control
6.1.2.	Two-Phase Control and Identilial>ilitу
(i.	1.3. Certainty Equivalent Control and Identifiability
6.1.1.	Self-liming Regulators
6.2.	Open-Loop Feedback Control
6.3.	Limited Lookahead Policies and Applications
6.3.1.	Flexible Manufacl iiring
6.3.2.	Computer Chess
6.1. Approximations in Dynamic Programming
6.1.1. Discretization of Optimal Control Problems
6.4.2. Cost-to-Co Approximation
6.1.3. Other Approximations
6.5.	Notes, Source's, and Exercises
7. Introduction to Infinite Horizon Problems
7.1.	An Overview
7.2.	Stochastic Shortest Path Problems
7.3.	Discount<'d Problems
7.4.	Average Cost Problems
7.5.	Notes, Sources, and Exercises
Appendix A:	Mathematical Review
Appendix B:	On Optimization Theory
Appendix C:	On Probability Theory
Appendix D:	On Finite-State Markov Chains
Appendix E:	Least-Squares Estimation and Kalman Filtering
Appendix F:	Modeling of Stochastic Linear Systems
ABOUT THE AUTHOR
Dimitri Bertsekas studied Mechanical and Electrical Engineering at
the National Technical University of Athens, Greece, and obtained his
Ph.D. in system science from t he Massachusetts Institute of Technology. He
has held faculty positions with the Engineering-Economic Systems Dept.,
Stanford University and the Electrical Engineering Dept., of the Univer-
sity of Illinois, Urbana, lie is currently Prolessor of Electrical Engineering
and Computer Science at the Massachusetts Institute of Technology, lie
consults regularly with private industry and has held editorial positions in
several journals. He has been elected Fellow of the IEEE.
Professor Bertsekas has done research in a broad variety of subjects
from control theory, optimization theory, parallel and distributed computa-
tion, data communication networks, and systems analysis. He has written
numerous papers in each of these areas. This book is his fourth on dynamic
programming and optimal control.
Other books by the author:
1)	Dynamic Programming and Stochastic Control, Academic Press, !!)“(>.
2)	Stochastic Optimal Control: The Discrete-Time Case, Academic Press.
1978 (with S. E. Shreve: translated in Russian).
3)	Constrained Optimization and Lagrange Multiplier Methods, Academic
Press, 1982 (translated in Russian).
J) Dynamic Programming: Deterministic and. Stochastic Models. Prenti-
ce-Hall, 1987.
.5) Data Networks. Prentice-Hall, 1987 (with R.. G. Gallager; translated
in Russian and Japanese): 2nd Edition 1992.
G) Parallel and Disl ributed Computation: Numerical Methods. Prentice-
Hall, 1989 (with J. N. Tsitsiklis).
7) Linear Network Optimization: .Algorithms and. Codes, M.I.T. Press
1991.

This two-volume book is based on a. first.-year gradual,e course' on
dynamic programming and optimal control that 1 have taught, lor over
twenty years al. Stanford University, t he University of Illinois, and the Mas-
sachusetts Institute of Technology. The course has been typically attended
by students from engineering, operations research, economics, and applied
mathematics. Accordingly, a principal objective of the book has been to
provide a. unified treatment of the subject., suitable for a broad audience.
In particular, problems with a continuous character, such as stochastic con-
trol problems, popular in modern control theory, arc simultaneously treated
with problems with a discrete character, such as Markovian decision prob-
lems, popular in operations research. Furthermore, many applications and
examples, drawn from a broad variety of fields, are discussed.
The book niay be viewed as a greatly expanded and pedagogically
improved version of my 1987 book “Dynamic Programming: Deterministic
and Stochastic Models," published by Prcntice-Ibill. I have included much
new material on deterministic and stochastic shortest, path problems, as
well as a new chapter on continuous-time optimal cont rol problems and the
Pontryagin Maximum Principle, developed from a dynamic programming
viewpoint. 1 have also added a fairly extensive exposition of simulation-
based approximation techniques for dynamic programming. These tech-
niques, which are often referred to as “neuro-dynamic programming" or
"reinforcement learning,” represent a breakthrough in the practical ap-
plication of dynamic programming to complex problems that, involve I hi'
dual curse of large dimension and lack of an accurate mathematical model.
Other material was also augmented, subst.antia.lly modified, and updated.
With the new material, however, the book grew so much in size I hat it.
became necessary to divide it into two volumes: one on finite horizon, and
the other on infinite horizon problems. This division was not, only nat ural in
terms of size, but. also in terms of style and orient at ion. The first, volume is
more oriented towards modeling, and the second is more oriented towards
mathematical analysis and computation. To make the first volume self-
contained for instructors who wish to cover a. modest amount, of infinite
horizon material in a course that is primarily oriented towards modeling,
-. .1
x	Preface
conceptualization, and finite horizon problems, I have added a final chapter
that provides an introductory treatment of infinite horizon problems.
Many topics in the book are relatively independent of the others. 1‘or
example Chapter 2 of Vol. 1 on shortest path problems can be skipped
without loss of continuity, and the same is true for Chapter 3 of Vol. 1,
which deals with continuous-time optimal control. As a result, the book
can be used to teach several different types of courses.
(a)	A two-semester course that covers both volumes.
(b)	A one-semester course primarily focused on finite horizon problems
that covers most of the first volume.
(c)	A one-semester course focused on stochastic optimal control that, cov-
ers Chapters 1, 4, 5, and 6 of Vol. 1, and Chapters 1, 2, and 4 of Vol.
11.
(c)	A one-semester course that covers Chapter 1, about 50% of Chapters
2 through (i of Vol. I, and about 70% of Chapters 1, 2, and 4 of Vol.
If. This is tiie course I usually teach at MIT.
(d)	A one-quarter engineering course that covers the first, three chapters
and parts of Chapters 4 through 6 of Vol. I.
(e)	A one-quarter mathematically oriented course focused on infinite hori-
zon problems that covers Vol. II.
7dic mathematical prerequisite for the text, is knowledge of advanced
calculus, introductory probability theory, and matrix-vector algebra. A
summary of this material is provided in the appendixes. Naturally, prior
exposure to dynamic system theory, control, optimization, or operations
research will be helpful to the reader, but, based on my experience, the
material given here is reasonably self-contained.
The book contains a large number of exercises, and the serious reader
will benefit greatly by going through them. Solutions to all exercises are
compiled in a manual that is available to hist rectors from Athena Scientific
or from the author. Many thanks are due to the several people who spent
long hours contributing to this manual, particularly Steven Shreve, Eric
Loiedennan, Lakis Polymenakos, and Cynara \Vu.
Dynamic programming is a conceptually simple technique, that can
be adequately explained using elementary analysis. Yet a mathematically
rigorous treatment, of general dynamic programming requires the compli-
cated machinery of measure-theoretic probability. My choice has been to
bypass the complicated mathematics by developing the subject in general-
ity, while claiming rigor only when the underlying probability spaces are
countable. A mathematically rigorous treatment, of the subject is carried
out ill my monograph "Stochastic Optimal Control: The Discrete Time
Case,'’ Academic Press, 1978, coauthored by Steven Shreve. This mono-
graph complements the present text and provides a solid foundation for the
Preface	xi
subjects developed somewhat informally here.
Finally, I am thankful to a number of individuals and institutions
for their contributions to the book. My understanding of the subject was
sharpened while ! worked with Steven Shreve on onr I97S monograph.
My interaction and collaboration with John Tsitsiklis on stochastic short-
est. paths and approximate dynamic programming have been most valu-
able. Michael Caramanis, Emmanuel I'eruandcz-Gaucherand, Pierre Hum-
bid, Lennart Ljimg, and John Tsitsiklis taught from versions of the book,
and contributed several substantive comments and homework problems. A
number of colleagues offered valuable1 insights and information, particularly
David Castanon, Eugene Feinberg, and Krishna Pattipati. NSF provided
research support. Prentice-Hall graciously allowed the use of material from
my 1987 book. Teaching and interacting with the students at. MIT haw1
ke.pt up my interest and excitement for the subject.
Dimitri I’. Bertsekas
bertsekas (.о1) ids. mit.edu

1
Infinite Horizon -
Discounted Problems
Contents
1.1.	Minimization of Total Cost - Introduction.............p.	2
1.2.	Discounted Problems with Bounded Cost per Stage ... p. 9
1.3.	Finite-State Systems - Computational Methods..........p.	16
1.3.1.	Value Iteration and Error Bounds................p.	19
1.3.2.	Policy Iteration................................p.	35
1.3.3.	Adaptive Aggregation............................p.	44
1.3.4.	Linear Programming..............................p.	49
1.4.	The Role of Contraction Mappings......................p.	52
1.5.	Scheduling and Multiarmed Bandit Problems.............p.	54
1.6.	Notes, Sources, and Exercises.........................p.	64

2	liilinil.e Horizon Discounted Problems C7iap. I
This volume focuses on stochastic optimal control problems with an
infinite number of decision stages (an infinite horizon). An introduction
to these problems was presented in Chapter 7 of Vol. 1. Here, we provide:
a more comprehensive analysis. In particular, we do not assume a finite
number of states and we also discuss the associated analytical and compu-
tational issues in much greater depth.
We recall from Chapter 7 of Vol. I that then' are four classes of infinite
horizon problems of major interest.
(a) Discounted problems with bounded cost per sl.age.
(1>) Stochastic shortest pat h problems.
(c) Discounted and undiseounted problems wit h unbounded cost, per st age.
(d) Average cost per stage problems.
Each one of the first four chapters of the present, volume considers one
of tile above problem classes, while tin' final chapter extends t he analysis to
continuous-time problems with a countable number of states. Throughout
this volume we concentrate on the perfect, information case, where each
decision is made with exact knowledge' of the current system state. Im-
perfect state informat ion problems can be treated, as in Chapter 5 of Vol.
1. I>y reformulation into perfect, informal ion problems involving a sufficient,
statistic.
1.1 MINIMIZATION OF TOTAL COST - INTRODUCTION
We now formulate the total cost minimization problem, which is the
subject, of this chapter and t.ho next two. This is ini infinite horizon, sta-
tionary version of the basic problem of Chapter 1 of Vol. I.
Total Cost Infinite Horizon Problem
Consider the stationary discrete-time dynamic system
•'g-n =/(ag, u/,-, tag.), k = 0,1,...,	(1.1)
where for all the state .ig is an clement of a space S', the control оis
an element of a space C, and the random disturbance ng is an ('lenient,
of a space D. We assume that D is a countable set. The control ng. is
constrained to take values in a given nonempty subset U(:r.k) of C, which
depends on the current slate .г/,, [a/,, о ('(.ig), lor all ,/g. C 5]. The random
disturbances ug..к = 0, I,..., have identical statistics and are characterized
by probabilities P(- | .ig, ig) defined on D, where | Xk- iik) is the
Sac. 1.1
Minimization ol Total Cost Introduction
3
probability ol occurrence of Wk, when the current, stair: and cont rol are .17.
and u.f,.. respectively. The probability of «,7. may depend explicitly on .17.
and Hi,, but. not, on values of prior disturbances 117. ।......u\}.
Given an initial state xo, we want to find a policy тг = {/ro,//1,..
where /17 : S t-» C, fik^Xk) G U(xk)> for all € S, к -= 0,1................ that,
minimizes t he cost, function f
(N-l	4
Л(-Со) = lim E < V <>ау(.га-.//а(-'ч), 117.) > .	(1.2)
N—>c*j "’a-	I	I
fc=0,L,,..	I fc=0	)
subject to the system equation constraint (1.1). The cost per stage у :
S x C x D 1—> 'li is given, and cr is a positive scalar referred to as t.he
discount factor.
We denote by И the set of all admissible policies тг, that is, the set of
all sequences of functions тг = {po,/<r,   } with //7 : S C, IIk-(x) e U(x)
for all x E S, к ~ (), 1.... The opt imal cost, funct ion ./* is delined by
J*(.r) = min,/тг(х), .r E S.
wen
A stationary policy is an admissible policy of the form тг —
and its corresponding cost function is denoted by ,/p. For brevity, we
refer to {ц.р. ..} as the stationary policy /1. Wc say that /1 is optimal if
Jii(x) = .7*(.i.') for all states x.
Note that, while we allow arbitrary state and control spaces, we re-
quire that the disturbance space be countable. This is necessary to avoid
the mathematical complications discussed in Section 1.5 of Vol. 1. The
countability assumption, however, is satisfied in many problems of inter-
est, notably for deterministic optimal control problems anil problems with
a finite or a countable number of states. For other problems, our main
results can typically be proved (under additional technical conditions) by
following the same line of argument as the one given here, but also by
dealing with the mathematical complications of various measme-theoret ic
frameworks; see [BeS78].
The cost Л(.г0) given by Eq. (1.2) represents the limit, of expected
finite horizon costs. These costs are well defined as discussed in Section
f In what follows we will generally impose appropriate assumptions on the
cost per stage <1 and the discount factor n that, guarantee that the limit delining
the total cost ./„(.t’o) exists. If this limit is not known to exist, we use instead
the dclinit ion
Г N -1
E(xi>) = lim sup E <2 O1 </(.17,,/ц;(.17.), 117.)
у ..^	"4	i—1
i.	11.1,	</ u
.. .ж
4	Horizon Discounted Problems Chap. I
1.5 of Vol. 1. AuotlicT possibility would be to minimize over 7r the expected
infinite horizon cost
F S	П’а) > .
* -<>.*!,. . I* 0	J
Such a cost, would require а Гаг more complex mathematical formulation (a
probability measure' on the space of all disturbance sequences; see [BeS"8]).
However, we mention that, under the assumptions that, we will be using,
the preceding expression is equal to the cost given by Eq. (1.2). This
may be proved by using the monotone convergence! theorcin (see Section
3.1) and other stochastic convergence theorems, which allow interchange of
limit and expectation under appropriate! conditions.
The DP Algorithm for the Finite-Horizon Version of the Problem
Consider any admissible polit y rr = {//.o,/<i,...}, any positive integer
JV. and any function ./ : S l)i. Suppose that we accumulate the costs of
the lirst. N stages, and to them we add the terminal cost oa'.7(.cn), for a.
total expected cost
(	N' 1	1
Г s o-v./(.r,v) + У7 <|/'.'/('''А’,/'а (.'т). if*) г •
* o/l.... I	*-ll	J
The minimum of this cost over тг can be calculated by starting with <yN J(x)
and by carrying out N iterations of the corresponding DP algorithm ol
Section 1.3 of Vol. I. This algorithm expresses t.he optimal (Ar — /c)-stage cost
starting from state .r, denoted by ./* (.r), as the minimum of the expected
sum of the cost, of stage V — k and t.he optimal (V — A: — ])-st.age cost
starting from the next stale. It. is given by
.//,.(./)— min	U, 1 w, «’)) }•	= 0, 1, • • • , N-1.
ned(.i)
(1.3)
with the. initial condition
Av (') = ол'./(.г).
For all initial stales .r, the optimal /V-st.age cost, is the function ./i>(.c)
obtained from t.he last, step of the DP algorithm.
Let. us consider for all A' and .r, the functions given by
.	-Lv-*(.f)
Sec. 1.1
Miuimizat ion of Total Cost Inf roeliicf.ioi)
□'lien is the optinitil A'-stage cost, Jo(.r), while the DI’ recursion (I ..3)
can lie equivalent ly be written in terms of the functions If. as
VA.+ i(.r) = min /r{e/(.r, a, ui) + oVf. (J'(;r. a, <(')) }, /.’= 0. 1, . . . , .V - 1.
aCt'(.r)
with the initial condition
lh(.r) = ./(.r).
The above algorithm can be used to calculate? all. the optimal finite
horizon cost functions with a single. DI’ recursion. In particular, suppose
that we have computed the optimal (Ar — l)-st,age cost function V\_|.
Then, to calculate the optimal Л'-stage cost function Vn, we do not need
to execute1 the A’-stage DP algorithm. Instead, we can calculate1 V\ using
the one-stage iteration
!>(.;)= min u. «’) +ol'.v-i «.«'))}.
More generally, starting with some1 terminal cost function, we1 can
consider applying repeatedly the1 DP iteration as above. With each appli-
cation, we will be? obtaining the? optimal cost function of some? finite: horizon
problem. The1 horizon of this problem! will be1 longer by one1 stage1 eiver the1
heaizon of the prcceeling problem. Note- that, this e-enivenience1 is possible1
only because we are dealing with a stationary system anel a common cost
function g for all stage?s.
Some? Shorthand Notation
The: precexling mctlmel of calculating finite horizon optimal exists mo-
tivates the introeluction of twei mappings that play an impeirtant theoretical
role anil proviele1 a convenient shorthand notation in expressions t hat woulel
be toe> complicated to write? eitherwise.
For any function : S’ e—> 3?. we1 consider the1 function obtaine'il by
applying t.he Di’ mapping to ./, anel we- de'imte1 it by f
= min E u, ic) + aj(f(.r, a, u-)') } , j: € S'.	(1-1)
u'
Since1 (T,/)(•) is itself a, funct ion elefiiietel on the state space’ S', we1 view T as
a mapping that transforms the function ./ on .S' into t he lunction 7’./ on .S'.
Neite1 that T.J in the optimal cost function for the on.c-sla.gc probit in that
has stage cost, g and terminal cost. n.7.
j Whenever we1 use the mapping T, we1 will impose’ sufficient assumpliems Ie>
guarantee? that the cxpectexl value involveel in Eq. (1.1) is well elefineel.
6
bilinite Horizon Discounted Problems
(Imp. I
Similarly, for any funct ion .] : ,S' >—> iff and any control function p. :
5 C, we denote
(7;,./)(.z:) = /;{z/(.r,/z(.r), zzz) п-.7(/(.г,	<<’)) } , G S.
IV
Again, T;1.7 may be viewed as the cost function associated witli /z for the
one-stage problem that has stage cost g and terminal cost о J.
We will denote by Tk the composition of the mapping T with itself к
times; that is, for all к we write
(7'*,/)(.r)= (7’(7'A->./))(.;;), x G 5.
Thus Tk,) is the function obtained I>y applying tdie mapping 7’ to the
function Tk~1./. For convenience, we also write
(T"J)(x)	,7(x), .rcS.
Similarly, I’j'i.i is defined by
(7’Д./)(.г)= (7;,(7;;-l.7))(.r), xGS,
and
(Tz?.7)(,:) = .7(.r), X G S.
It can be verified by induction that (7,/,',7)(x) zs the optimal cost for the.
к-stage, о-discounted problem until initial state, x, cost per stage g, and
terminal cost Junction a1'./. Similarly, (TfJ)).!:) is the cost of a. policy
for the same problem. To illustrate the case where к = 2, note
that
(7’2.7)(x) = min E{u(x, a, in) 4- о (T .J) (ffr, a, zz;))}
llt£(/(x) 11!
= min ZJ < z/(.z:, Un, zz'o) + a min 75 •{ g(f(x, zzn, w(l), щ, Wi)
1/oGZdr) 1'4, I	“0."’<())'Z’l I
+ O-7(/(/(.r, an, ZZ’o), ZZ.,, zz-,)) 11
= mill Id < <i(.r. Ho, Ii’u) + min E\<ig(f (>'., Un,up), щ ,Wi)
H'u I	i'I tG(7(r,<i|).l,,o)) "’I t
+ <'2./	"0, »',)). zz,, zz-,)) | j.
The last expression can be recognized as the DP algorithm for the 2-stage,
o-discounted problem with initial state x, cost per stage g, and terminal
cost function zi2.7.
Finally, consider a X'-stage policy те = {/'<». Mi,    , pi,-i}. Then, the
expression	   7’/IA, ^.))[х) is delined recursively lor z = 0,... I: — 2
by
СЛо he н ''' ho. -1 )(•'') ~ СЛо (he 11 ' ‘iic i-^))(a;)
and represents the cost of the policy тг for the к-stage, a-discounted problem
with initial state x, cost, per stage g, auel terminal cost, function akJ.
Sec. 1.1
Miniuiiziition ofTolnl Cost hit i oiIik I ion
7
Some Basic Properties
The following monotonicity property plays a fundamental role in the
developments of tins volume.
Lemma 1.1: (Monotonicity Lemma) For any functions J : .5' >—>
Ji and J' : S JR, such that
,7(.r) < ./'С7-'), for all ;i: г S’,
and for any function ц-.S^C with p(x) e P(-r), for all a: G S, we
have
(Tfc J)(.r) < (7X/')(.r), for all .r £ S, k = 1, 2,...,
(7p)(.r) < (T/f J')(.r), for all G 5, k =1,2,...
Proof: The result follows by viewing (TkJ)(.n) and (7’(('.7)(.r) as A ’-stage
problem costs, since as the terminal cost, function increases uniformly so will
the A'-stage costs. (One can also prove t he lemma by using a straight forward
induction argument.) Q.E.D.
For any two functions J : S >—» ft and .J' : 5	3i, we write
J < .1’	if J(.c) < ./'(.r) for all •<’ G S’.
With this notation, Lemma 1.1 is stated as
.!<.]'	=> TkJ < Tk.I', /,’=1,2.................
.]<]'	=>	T(c./<Tp./',	/.’=1,2,...
Let us also denote by c : S'R the unit function that takes the value
1 identically on S’:
e(.r) = 1, for all ,r € S.	(1.6)
We have from the definitions (1.4) and (1.5) of T and 7’;,, for any funct ion
J : S >—> 4? and scalar r
(T(./ + re))(.r) = (T./)(.r)4-or, .rG.S,
+ /’<’))(./.•) = (/),./)(.r) + о;;	.1- G s.
More generally, the following lemma can be verified by induction using the
preceding two relations.
-. -Ж
Iiilinite Horizon Discounted Problems C'liap. 1
Lemma 1.2: For every A:, function J : S' н and scalar r, we have	))i, stationary policy //.,	
(Tfc(J + rc))(.c) = (TkJ)(x) + akr,	for all x e S',	(1.7)
(7;A(J + r())(.r) = (7;'.J)(,:)+nAr,	for all x 6 S’.	(1.8)
A Preview of Infinite Horizon Results
It is worth at this point to speculate on the type of results that we
will he aiming for.
(а)	Convcrqtmce of the DP Alt/oritlmi. Let Jo denote the zero function
[Jo (•'•') - 0 for all Since the infinite horizon cost, of a policy is, by
definition, the limit, of its /.'-stage costs as /.' —> oe, it is reasonable to
speculate that the optimal infinite horizon cost is equal to I,he limit
of the optimal L-stage costs; that, is,
J*(.r) =	x e 5.	(IT)
This means that if we start, with the zero function Jo and iterate with
the DP algorithm indefinitely, we will get. in the limit tin; optimal cost,
function J*. Also, for о < 1 and a bounded function J, a terminal
cost, ak J diminishes with /.:, so it. is reasonable to speculate that, if
<1 < 1,
J‘(.r) — lim (7'A'J)(.r), for all ,r S' and bounded functions J.
(1.10)
(b)	Hcll.ni.au s Equation. Since by definition we have for all x 6 S
(7'A и Jn)(.r) = min A {//(.c. a, w) + o(TA Jo)(/(x, u, w)) } , (l.H)
i/CO’(.г) u-
it. is reasonable to speculate that if 1mm _ ю Tk Ju = J* as in (a) above,
then we must have by taking limit, as k —> oo,
./*(.;:)— min	a, iv) + a J* (f(x, n, w))}, x. e S’, (1.12)
lit (-'(.r) in
or, equivalently,
J*=7’J*.	(1.13)
This is known as Bellman's equation and asserts that the optimal
cost function J* is a fixed point of the mapping T. We will see that
tier. 1.2
Disciitmled I’го/di'iiis with Hounded (-05/ per .Stage
!)
Bellman’s equation holds for all the total cost minimization problems
that we will consider, although depending on our assumptions, its
proof 1-пи be quite complex.
(c)	('liararh Ti.zahon tij OiiUmal Slalioiuin) I 'olicics. If we view Bellman’s
equation as the DP algorithm taken to its limit as k tx. it is
reasonable' to speculate that it attains t he minimum in the right -
hand side of Bellman’s equation for all .r, t hen the stationary policy
// is optimal.
Most of the analysis of total cost infinite horizon problems revolves
around the above three issues and also around the issue' of ellicient com-
putation of .7* and an optimal stationary policy. Bor the discounted cost
problems with bounded cost per stage considered in this chapter, and for
stochastic shortest path problems under our assumptions of Chapter 2. the
preceding conjectures are correct. For problems with unbounded costs per
stage and for stochastic shortest path problems where our assumptions of
Chapter 2 are violated, there may be counterintuitive mathematical phe-
nomena that invalidate some of the preceding conjectures. This illustrates
that infinite horizon problems should be approached carefully and with
mathematical precision.
1.2 DISCOUNTED PROBLEMS WITH BOUNDED COST PER
STAGE
We now discuss the simplest, type of infinite horizon problem. We
assume the following:
Assumption D (Discounted Cost - Bounded Cost per Stage):
The cost per stage <j satisfies
|</(x, w, w)| < Л/, for all (r, u, w) e S x C x D, (2.1)
where M is some scalar. Furthermore, 0 < a < 1.
Boundedness of I.he cost, per st age is not as restrict! ve as might appear.
It holds for problems where the spaces S, C, anil D are finite sets. Even
if these spaces are not finite, during the computational solution of the
problem they will ordinarily be approximated by finite sets. Also, it is
often possible to reformulate the problem so that it. is defined over bounded
regions of the state and control spaces over which the cost is bounded.

10	/н/нп/с lhni/,i>u I )isci unit < ‘t I I ’i < tblemx	I
The following proposition shows that the DP algorithm converges to
the optimal cost, function ./* for an arbitrary bounded starting function
This will follow as a, consequence of Assumption D, which implies that the
"tail'' of the cost after st,age N, that is,
Г л
E < ^2	«г)
4-400	U;=/V
diminishes to zero as N —> oo. Furthermore, when a terminal cost, <\N
is added to the Л'-stage cost, its effect diminishes to zero as N oo if J
is bounded.
Proposition 2.1: (Convergence of the DP Algorithm) For any
bounded function ./ : S *—> !)i, the optimal cost function satisfies п -t ,
J‘(./:)= lim (TNJ)(:r), for all j: e S. (2.2)
Proof: For every positive integer Z\, initial state .r(> C .S', and policy тг —
{/to, /zi,. . .}. we break down the cost E(.i'a') into the portions incurred over
the first. l\ stages and over t he remaining st,ages
С V - 1	'j
Л(.со) = lim E< 52па,.9(л-,/Ц-(а),1^) ?
l а —о	)
f к' -1	>
- E < пМ-'Чи/М-а) jo.) >
I A'--(l	J
p'-l	A
4- Jini E <	u’a) > •
~4°° Ia=A’	J
Since by Assumption D we have |</(.rt,//а (.га ), ttg ) | < Л/, we also obtain
p -1	V	rv
lim E < Y oA'.<7(j\.,//,t.(.rA:), utfc) > < M Y ak = -----------.
l А — /V	)	k=K
Using these relations, it follows that
M-nO - ------------ол ni;tx||
I - u	.ct 5 ‘
Г	Д 1	1
<	E < (\к.](.гк) + У j	u’k) >
V	A:^()	J
a/cA/
<	.Лг(.'о) Ь ------1- inaxl J(.v;)|.
1 — о	.rt s ‘	1
Sec. 1.2
I )i,s< < unit < ’< I /’/<»/>/(/jj.-. with lit и 11 и /<•< I ( *<».«./ /»<•/ S/.U4-
I I
By taking the minimum over 7Г, we obtain for all .z.’o and A',

<>M7
1 — a
— aK max|.7(z)|
nKif
<	4- ------к (\K' max|./(.r)
and by taking the limit as A' —> oc, the result- follows. Q.E.D.
Note that based on the preceding proposition, t he DP algorithm may
be used to compute at least an approximation to ,7*. This computational
method together with some additional methods will be examined in the
next section.
Given any stationary policy /i, we can consider a uiodilied discounted
problem, which is the same as t.he original (except, that, the control constraint
set contains only one element, for each state the control //.(j:); that is,
the control constraint set is U(x) = {//.(r)} instead of U(x). Proposition
2.1 applies t.o this modified problem and yields the following corollary:
Corollary 2.1.1: For every stationary policy p, the associated cost
function satisfies
= Jim (T;f J)(.r), for all x e S.	(2.4)
The next proposition shows that .7* is the unique solution of Bellman's
equation.
Proposition 2.2: (Bellman’s Equation) The optimal cost func-
tion J* satisfies
J*(.r) = min E{g(x, u, w) + aJ* (j(x, u, «')) }, for all x e S.
(2.5)
or, equivalently,
J*=TJ*.	(2.6)
Furthermore, J* is the unique solution of this equation within the class
of bounded functions.
-..Л
12	infinite Horizon — Discounted Problems Chap. 1
Proof: from Eq. (2.3), we have lor all .r E S and N,
where ./ц is the zero function [./o(:r) = 0 for all .r £ S]. Applying the
mapping 7’ to this relation and using t.he Monotonicity Lemma 1.1 as well
as Lemina 1.2, we obtain for all .r E 5 and N
oN + 'l\l	+
(7’./*)(.r) - Ц---- < (T"+LJ())(.r) < (7’.7’)(.,;) + ----.
1—0	1—0
Since (TN • 1 .7o)(.r) converges to ./*(.r) (cf. Prop. 2.1), by taking the limit
as N —> oo in the preceding relation, we obtain J* = TJ*.
To show uniqueness, observe that if J is bounded and satisfies J =
7’.7, then .7 = lini.v .TN,/, so by Prop. 2.1, we have ,7 = J*. Q.E.D.
Based on t.he same reasoning we used t.o obtain Cor. 2.1.1 from Prop.
2.1, we have:
Corollary 2.2.1: For every stationary policy p, t.he associated cost
function satisfies
J,,,(x) = E{<7(:i;,/z(;r), w) + oJ;i(/(.r,/((.r), w))}, for all z e S,
w k
(2-7)
or, equivalently,
I — т I
Furthermore, ,7M is the unique solution of this equation within the class
of bounded functions.
The next, proposition characterizes stationary optimal policies.
Proposition 2.3: (Necessary and Sufficient Condition for Op-
timality) A stationary policy p is optimal if and only if //. (.r) attains
the minimum in Bellman’s equation (2.5) for each e S\ that is,
7’.7* --7),./*.	(2.S)
Proof: If 7’.7* = 7),./*, then using Bellman’s equation (.7* = TJ*), we
have .7* = T/iJ*, so by the uniqueness part of Cor. 2.2.1, we obtain .7* — J,r,
Sec. 1.2 Discounted Problems with Bounded Cost per Stage	13
that, is, p is optimal. Conversely, if the stationary policy p is optimal, we
have, J* = ,/ц. which by Cor. 2.2.1, yields ./* = 7),./*. Combining this with
Bellman's equation (,/* = TJ*), we obtain TJ' — I),J*. Q.E.D.
Note that Prop. 2.3 implies the existence of an optimal stationary
policy when the minimum in the right-hand side of Bellman's equation is
attained for all .r 6 .9. In particular, when U(.r) is finite for each ,r C .S', an
optimal stationary policy is guaranteed to exist.
We finally show the following convergence rate4 est imate’ for any bounded
function J:
inax|(7't'./)(.r) — J*(,r)| < (J- max| J(.r) - ./’'(;/•)]. k = 0, I, . . .
This relation is obtained by combining Bellman’s equation and the following
result:
Proposition 2.4: For any two bounded functions J : S J' :
S e-» 3?, and for all к = 0,1,..., there holds
max|(7’A: J)(.r) — (T’M')(a,)| < оЛ тах|,7(ж) — ,7'(a;)|.	(2.9)
Proof: Denote
r = “дФС'') - •/,(-'-)|-
Then we have
J(.r) - c < J'(.r) < J(j:) + c, ;r £ S.
Applying T1- in this relation and using the Monotonicity Lemina 1.1 as well
as Lemma 1.2, we obtain
рт.ф) - o'T < (Tfc./')(.r) < (TC/)(.t-) + <J-c, ,r & S.
It follows that
|(7*./)(.r)-(7’C7')(.r)| <cJc, .ceS.
which proves t.he result. Q.E.D.
As earlier, we have:
Infinite Horizon Discounted Problems Cluip. I
...л
Corollary 2.4.1: For any two bounded functions J : S >—» K, J' :
S >—» li, and any stationary policy /1, we have
шах|(Тд J)(x) -	< ak max| J(x) - J'(x)|, к = 0,1,...
т. £ .S’	tr G S
Example 2.1 (Machine Replacement)
Consider an infinite horizon discounted version of a problem we formulated
in Section 1.1 of Vol. 1. Here, we want to operate elliciontly a machine that
can be in any one of n stat.es, denoted 1,2,	State 1 corresponds to a
machine in perfect condit ion. The transition probabilities ptJ are given. There
is a cost y(/’) for operating for one time period the machine when it. is in state
L The options at the start of each period are to (a) let the machine operate
one more period in the state it currently is, or (b) replace the machine with a
new machine (state 1) at a cost R. Once replaced, the machine is guaranteed
to stay in state 1 for one period; in subsequent periods, it may deteriorate
to state's j > 1 according to tin* transit ion probabilities pij. We assume an
infinite horizon and a discount factor о £ ((), 1), so the theory ol this section
applies.
Bellman’s equation (cf. Prop. 2.2) takes the form
(i) = min
/? + .</(!)+ <».7*(1), <;(:)+ <i
By Prop. 2.3, a stationary policy is optimal if it replaces at. states i where
/?	o-C(l) <	-I- O ) J'u-C(j),
and it does not replace at states i where'
H l-.y(l) t o./*(l) <j(i) + O Vpo-F(j).
We can use' the convergence of t he DP algorithm (cf. Prop. 2.1) to
characterize the optimal cost function using properties of the finite horizon
cost functions. In particular, the DP algorithm starting from tin» zero function
takes the* form
/(.(/) - 0,
(7’./„)(t) = min |Л + y(l). </(/)],
Ser. /..3 Finite-Stele Systems C\>ni/>iil;>Li<ni;il Methedx
(Ta./0)(z) = min I\ 4- .7(1) + (\(Tk -l./o)(l), 1/(0 -1-o ^	'
./^1
Assume that g(i) is nondecreasing in /, and that the transition probabilities
satisfy
n	n
- 1, • ,- 1.
for all functions ,/(i), which arc monotonically iiondccrcasing in i. It can be
shown that this assumpt ion is satisfied if and onl.y if, for every A:. У'( ki>,,
is monotonically nondecreasing in i (see [Hos83b|, p. 252). The assumption
(2.10) is satisfied in particular if
P,j = 0, if j < i.
i.e., the machine cannot go to a bel ter slate wit h usage, and
1>,j <	if i < J,
i.e., there is greater chance of ending at a bad state j if we start at a worse
state i. Since //(/) is noildecreasing in i. we have t hat ('/’,/<,)(/) is nondecreasing
in i, and in view of the assumption (2.10) on the transition probabilities, the
same is true for (7’2./,,)(i). Similarly, it. is seen that, for all Ar, (Tk.)„)(/) is
iiondccrcasing in i and so is its limit
,/‘(i) = (Нп^(7’‘-./„)(,).
This is intuitively clear: the.' optimal cost should not decrease as the machine
starts at a worse initial state. It follows that the function
.</(/) -I- О ,./•(./)
is nondecreasing in i. Consider the set. of slates
I « + .<;(!) 4-o,7’(l) < </(<) +	/*(./') j ,
and let.
.» __ f smallest state in S'h if Sit is nonempty,
I 11 + 1	otherwise.
Then, an optimal policy takes the form
replace if and only if 1 > «*,
as shown in 1’ig. 1.2.1.
16
InlhiiU' lloij/лч! Discounted Problems	I
..I
Replace
Figure 1.2.1 Determining the optimal policy in the machine replacement exam-
ple.
1.3 FINITE-STATE SYSTEMS - COMPUTATIONAL METHODS
In this section we discuss several alternative approaches for numeri-
cally solving t he discounted problem with bounded cost per stage. The first
approach, value iteration, is essentially the DP algorithm and yields in the
limit the optimal cost function and an optimal policy, as discussed in the
preceding section. We will describe some variations aimed at accelerating
convergence. Two other approaches, policy iteration and linear program-
ming, terminate in a. finite number of iterations (assuming the number of
states -md controls are finite). However, when the number of states is large,
these approaches are impractical because of large overhead per iteration.
Another approach, adaptive aggregation, bridges the gap between value
iteration and policy iteration, and in a sense combines the best features of
both methods.
In Section 2.3 we will consider some additional methods, which are
well-suited for dynamic systems that, are lim'd to model but, relatively easy
to simulate. In particular, we will assume in Section 2.3 that the transition
probabilities of t.he problem are unknown, but, the system’s dynamics and
cost structure can be observed through simulation. We will then discuss
the methods of temporal diflcrences and Q-lcaniing, which also provide
conceptual vehicles for approximate forms of value iteration and policy
Sec. I.:’,
1чilil.i'-St:it <• Sv.sf enis	( \>ni/>nl:il ioinil Met In и/s
17
iteration using, for example, neural networks.
Throughout this section we assume a discounted problem (Assump-
tion D holds). We further assume that t lit' stale, control, and disturbance
spaces underlying the problem are finite sets, so that we are dealing in
effect with the control of a finite-stale Markov chain.
We first translate some of our earlier analysis in a notation that is
more convenient for Markov chains. Let. the state space .S' consist, of //
states denoted by 1.2..... n:
•S'	{ I.-.!..a}.
Wre denote by /),,(</) the transition probabilities
P,j(«) = P(m+i = J |	= i, ttk = «),	', J e .S’, Il G //(/).
These transition probabilities may be given a priori or may be calculated
from flat system equation
•'Mi =	"/. "'a)
and t.he known probability distribution P(- | .r, u) of t he input disturbance
ii’k- Indeed, we have
/"./(") = /’("'•'.A") I '• ")•
where ll’4(u) is the (finite) set
li',j(u) = {m G D | /(/, u., ic) = j}.
To simplify notation, we assume that the cost per stage does not.
depend on tv. This amounts to using expected cost per stage in all calcula-
tions, which makes no essential difference in the definit ions of the mappings
T and 7), of Eqs. (1.1) and (1.5). and in the subsequent, analysis. Thus, if
is the cost of using и at. state i and moving t.o .state j, we use as
cost per stage the expected cost. </(/, u) given by
n
.'/('") = ^2/м(").')(ь "•./)
M
The mappings T and 7), of Eqs. (1.4) and (1.5) can be written as
= min
n
.'/('") + '> 52/M ("PC/)
i = 1,2......n.
18
Jnlinitr IliH’i/.ou II Vob/rni.s Chap. I
('VX') = ./'(0) + "^/^(/XOpUX	':== i’2’ •  •
j-i
Any function 7 on 5, as well as t.he functions TJ and 7),./ may be
represented by the n-dimensional vectors
V7(")
/(7’7X1)
\(TJ)(n)

For a stationary policy //, we denote by P/(, the transition probability
matrix
//>n(/'(l))	7Л„(/л(1))>
7 a	;	:	;
' ' ' Phh (/'('')) /
and by fhi the cost vector
/.7(1./Х0) \
.7/, =
\,7("./X")) /
We can then write in vector notation
7 [i J — fiji + o/p./.
The cost, function 7;, corresponding to a stationary policy p is, by
Cor. 2.2.1, the unique solution of the equat ion
J it — 7//7p — <][! + ci I /IJ i!.
This equation should be viewed as a system of v. linear equations with
n. unknowns, the components Jlt(i) of the n-dimcnsional vector Jt,. The
equation can also be written as
(I-nPpJn =!!„,
or, equivalently,
Jl, = (l -llPuP'fJli.	(3.1)
where I demotes t he n x и ident ity matrix. The invertibility of t.he matrix
I — <>/), is assured since we have proved that the system of equations
representing 7,, = Ti,Jfl has a unique solution for any vector //,, (cf. Cor.
2.2.1). For another way to see that 1 — (>P(, is an invertible matrix, note
that the eigenvalues of any transition probability matrix lie within the unit
circle of the complex plane. Tims no eigenvalue of <i/’M can be equal to 1,
which is the necessary and sullicient condition for I — аРц to be invertible.
Sec. I..'I	/'’iiiite-Stiite Systems ( mm/mfi/timsil Met boils	19
1.3,1 Value Iteration and Error Bounds
Ik’re we start with any n-dimensioual vector ./ and successively com-
pute 7V, T-.J..... By Prop. 2.1, we have for all i.
Ilin (74',/)(/') л.
f;—
Furthermore, by Prop. 2.1 [using = ./* in E<|. (2.9)], the error sequence
|(TkJ){i) — J*(i)| is bounded by a constant multiple of aK', for all i £ S.
This method is called value iteration or successive approximation. The
method can be substantially improved thanks to certain monotonic error
bounds, which are easily obtained as a byproduct, of the computat ion.
The following argument is helpful in understanding the nature of t hese
bounds. Let us first break down the cost of a stationary policy p into the
first stage cost and the remainder:
•-M') //('-/'A')) +	| ,r() = /}.
It follows that
where c is the unit vector, i =(1,1,..., 1)', a,nd fl and fl arc t.he minimum
and maximum cost per stage:
fl = min 7(1. //,(/)), fl = maxy(i, p(i)).
Using the definition of fl and fl. we can st,lengthen the bounds (3.2) as
follows:
These bounds will now be applied in the context of the value iteration
method.
Suppose that, we have a vector .J and we compute
I s fl — SJii + о I),./.
By subtracting this equation from the relation
fl/i. Ui< + o7y(./y,
t
20
bilinite I/orizon Discounted Problems Cluip. I
we obtain
= T„. 7 - J -1- oP^J,,
This equation can be viewed as a variational form of the equation =
7),.7ц, and implies that — J is I lie cost ucclor associated with, the station-
ary policy /i and a, cost per .stage vector equal to Tlt./ — .7. Therefore, the
bounds (3.3) apply wit h replaced by — .7 and </,, replaced by 7),./ — J.
It, follows that,
1 - <r
07
where
7 = min [(7),,/)(/,) - .7(7)],	7 = max[(7),.7)(i) - .7(7)].
Equivalently, for every vector .7, we have
.7 4--c < T.,J + co. < Jt, < 7), ,7 + ce < J + —e,
о	о
where	_
07	_ tn
c = ——.	c = -——.
1—0	l—o
The following proposition is obtained by a more sophisticated application
of the preceding argument,.
Proposition 3.1: For every vector J, state i, and k, we have
(7’<7)(/) + e, <(Т>- + '.1)^ + с,,+ }
~ V	(3.4)
< (7^+iJ)(t) + cfc+1
<(^.7)(t) + (T,
= .min [(TAj)(i)-(^-i.7)(t)],	(3.5)
a- = -^~ max [('^ J)(t) - (T^1.7)(/)].	(3.6)
1 — (\-I - L,..., 11 L
Sec. l.:i Finite-State Systems f'ompnlat tonal Methods
21
Proof: Denote
7 -= Jilin „[('/’./)(/) - ./(/)].
We have
./ + 7<’ < TJ.	(3.7)
Applying 7’ to both sides and using the monotonicity of 7’, we have
7’./ I tty > T-J.
and, combining this relation with E<|. (3.7). we obtain
.7 I (1	0)70 w | oqc w T-J.	(3.8)
This process can be repeated, first, applying T to obtain
TJ + (о -I- o2he < T-J T <i V < 7'3(./).
and then using Eq. (3.7) to write
J + (1 + о + o2)7c < TJ + (o + o-'2)qc < T2J + o'27<’ < 7’3./.
After steps, this results in the inequalities
< T'- + 'J.
Taking t.h<' limit, as - -x.' and using I,he equality c, = 07/(1 ~ o). we
obtain
J T c < 7 J + Cj c < L ~J + oCj (‘ 7 J*,	(3.9)
where c, is defined by Eq. (3.3). Heplacing J by T!<J in this inequality, we
have
7*"./c,.,,^ ./7
which is the second inequality in Eq. (3. 1).
From E<[. (3.8). we have
07 < join (J(7’-./)(/) - (7’./)(/)].
22
Infinite' Horizon Discounted Problems
Chap. I
and consequent ly
<1C| < I'J.
Using this relation in ICtj. (3.9) yields
7 J + C|(' S I “•/ I e.jC.
and replacing ./ 1 >y 7’*' we have the lirst inequality in Eq. (3.1). Au
analogous argument shows the last t wo inequalities in Eq. (3.1). Q.E.D.
We note t hat the preceding proof does not rely on the finiteness of the
state space, and indeed Prop. 3.1 can be proved for an infinite state space
(see also Exercise l.f)). The following example demonstrates tin' nature of
the error bounds.
Example 3.1 (Illustration of the Error Bounds)
Consider a problem where there are two states and two controls
.S — {1.2),	(' -{ и 1. и “}.
The transition probabilities corresponding to the controls n1 and u2 are as
shown in log. 1.3.1: that is, the transition probability matrices are
'The transition costs are
.7(1. n1) = 2.	.7(1, i/2) =t)..r>,	.1/(2, a1) =1,	</(2, ii2) = 3,
and the discount factor is n = 0.9. The mapping 7’ is given for i = 1,2 by
{2	2	>
7(i, и1) -I- о y^/>o(»')70’). .</('', ч2) + <‘52ро(и2).1(.7) 7 •
j ।	j-i	J
The scalars c,. and сд of Eqs. (3.5) and (3.6) are given by
iiiiu{(7’*./)()) - (7’* './)(!), (T\/)(2) - (Ta'“'./)(2)},
e, n ma.x{(7’l./)(l) - (Tk 1./)(!), (7’‘ ,/)(2) - (7’A ~ 1 ./)(2)}.
The results of the value iteration met hod starting with the zero function ,/u
[./,,(1) = ./,,(2) = 0] are shown in Pig. 1.3.2 and illustrate the power of the
error bounds.
Sec. 1.3 Finite-State Systenis ( 'oiuput at/опа/ Met hod>
23
(a)	(b)
Figure 1.3.1 State transition diagram for Example 3.1: (а) и = a1; (b) и -- и'2
		(/’''•M (2)	+<.4	(7'М.Ю) +n-	+<lk	(T'U^CZ) +</.-
0	I)	0				
1	0.500	1.000	5.000	9.500	5.500	10.000
2	1.287	1.562	6.350	8.375	6 .(>25	8.650
3	1.811	2.220	6.856	7.767	7.232	8.1 11
4	2.414	2.745	7.129	7.540	7.160	7.870
5	2.896	3.247	7.232	7.417	7.583	7.768
6	3.343	3.686	7.287	7.371	7.629	7.712
7	3.741)	4.086	7.308	7.345	7.654	7.692
8	4.099	4.441	7.319	7.336	7.663	7.680
9	4.422	4.767	7.324	7.331	7.669	7.676
10	4.713	5.057	7.326	7.329	7.671	7.674
11	4.974	5.319	7.327	7.328	7.672	7.673
12	5.209	5.554	7.327	7.328	7.672	7.673
13	5.421	5.766	7.327	7.328	7.672	7.673
14	5.612	5.957	7.328	7.328	7.672	7.672
15	5.783	6.128	7.328	7.328	7.672	7.672
Figure 1.3.2 Peitoitii.iiicc ol (Ik* value item I i< >i i iix'lliod with and without the
error bounds of Prop, 3.1 for the problem of Example 3.1.
Termination Issues - Optimality of the Obtained Policy
Let us now discuss how to use the error bounds to obtain an optimal
-. .1
24	hdinit.c Horizon Discounted Problems Chiip. I
or near-optimal policy in a finite number of value iterations. We lirst note
that given any J, if we compute TJ ami a policy ц attaining the minimum
in I he calculation ol 7’./, i.e.. 7),./ — 7’.7, then we can obtain t.he following
bound on the siiboptimalily of /t:
max(7)—./*(/)] < - '‘(1- (ma.x[(7’.7)(z) - .7(/)] - inin[(7V)(/') - .7(7)]) .
(3.10)
To see this, apply Eq. (3.-1) with к = I to obtain for all i
Ct <./*(0- (7’./)(,) <e15
and also apply Eq. (3.1) with /r - I and with Tlt replacing T t.o obtain
c, </,,(/)- (7;,.7)(Z) = ./,,(/)- (LT.7)(7) < cq.
Subtracting the above two equations, we obtain the estimate (3.10).
la practice, one terminates the value iteration method when the dif-
ference (q. — q) of the error bounds becomes sufficiently small. One can
then Lake as linal estimate, of .!* the. “median”
or the “average”
II
Л = TkJ +	- £((7’A'•/)(/) - (T^'./)(7))e.	(3.12)
//,( 1 — n)
Both of these vectors lie in t.he region delineated by the error bounds. Then,
the estimate (3.10) provides a bound on the suboptimality of the policy p
at taining the minimum in t.he calculation of 7’A'.7.
The bound (3.10) can also be used to show that after a sufficiently
large number of value it ('rat ions, the st at ionary policy pA' that at tains the
minimum in t.he /rth value iteration [i.e. (7’ n7'A-l).7 = TA .7] is optimal.
Indeed, since the number of stationary policies is finite, there exists un
f > 0 such that if a st ationary policy )i satisfies
inax [./,,(/) - ./’(')] <
then // is optimal. Now let. /r be such that for all к > к we have
(max[(7’A./)W - (7* './)(/)] - nun [(7’A./)(,) - (7’A-‘./)(/)]) < c
Then from Eq. (3.10) we see that for all к > к, the stationary policy that
attains the minimum in the A'th value iteration is optimal.
See. 1.,'J l''imte-Stnto Systems Com[>iit.;itio:i;il Methods
25
Rate of Convergence
To analyze t lie rate о I con vergence of value i I era I ion with error bounds,
assume that there is a stationary policy //,* that attains the minimum over
fl in the relation
min Т1,'Г>‘~ 1 J — TkJ
lor all sufficiently large, so that eventually the met hod reduces to I lie
linear iteration
J : = fhi* + J-
In view of our preceding discussion, this is true for example if//* is a unique
optimal stationary policy. Generally t.he rate of convergence of linear itera-
tions is governed by the maximum eigenvalue modulus of the matrix of the
iteration [which is a in our case, sine/' any transition probability matrix has
a unit, eigenvalue with corresponding eigenvector c = (1,1,..., 1)'. while
all other eigenvalues lie within the unit, circle of t he complex plane).
It turns out, however, that when error bounds are used, the rat/'
at which the iterates .7/,. and .7,. of Eqs. (3.11) and (3.12) approach the
optimal cost vector J* is governed by the modulus of the subtlominanI
eigenvalue ol th/' transition probability matrix /)<•. that is. the eigenvalue
with second largest modulus. The proof of this is outlined in Exercise 1 .IS.
For a. sketch /if the ideas involved, let. Ai..... A„ 1 >/' the eigenvalues of
ordered according to decreasing modulus; that is
IM > |A2| >••> |A„|,
with A] equal to 1 ami A2 being t.he siibdominant eigenvalue. Assume that
there is a set of linearly independent. eigenvectors ci,/’2.e„ correspond-
ing to A|,A2...............................................A„ with с, = c — (1, 1... ., 1)'. Then th/' initial error
J — -7;1» can be expressed as a linear combination /if the eigenvectors
/ - •<,’ - £ I C T 52<jCj
j--2
for sonic scalars 41. £2,. . . .  Sim'/' 7),»./	//,,» -t-ol),*./ and --
Sis* +/'f',,» .7Z,». successive errors arc related by
T„»J =/!/>(./ -.7,,.), for all .1.
Thus t.he error after /,' iterations can be written as
II
J = 2
Using th/' error bounds/if Prop. 3.1 amounts to a. translation 01'7*../ along
the vector c. Thus, at best, th/' error bounds are tight enough to eliminate
the component ri*£t/’ of the error, but. cannot, affect t.he remaining term
ak Zq=2 A*'$//’j, 'vhi/'h diminishes like <ia'|A2|a' with A> being the stibdom-
inant eigenvalue.
luliiiite Horizon DHionnted I’roblonis C'hiip. 1
Л
26
Problems where Convergence is Slow
In Example 3.1, the convergence ol' value iteration with tin» error
bounds is very fast. For this example, it can be verified that //*(1) = »2,
/Г (2) = id , and that
/1/4 3/4 \
W 1/4 /
The eigenvalues of //,* can be calculated to be A| = 1 and A2 = which
explains the fast convergence1, since the modulus 1/2 of the subdominant
eigenvalue A2 is considerably smaller than ones On the other hand, there
arc situations where convergence of the method even with the use of error
bounds is very slow. For example, suppose that Pfl* is block diagonal with
two or more blocks, or more generally, that //,• corresponds to a system
with more than one (('current class of state's (see Appendix D of Vol. 1).
Then it can be shown that, the subdominant eigenvalue A2 is equal to I,
and convergence it typically slow when <1 is close' l.o 1.
As an example, consider tin' following three simple deterministic prob-
lems, each having a single policy and more' than one recurrent class of
states:
Problem 1: 11 --3, P,, = t hrce-dimensional ident ity, <;(/,//.(«)) = i.
I’ 10I1I.1111. 2: 11 — Г>. // = live-dimensional ident ity. </(/,/;(/’)) = i.
Problrm 11 — (>, </(/’,/1(1)) -- i and
/1	0	0	0	0	()\
0	0	1	0	0	0
_ 0	1	0	0	0	0
'' ~	0 0 0 Old 
0	0	0	0	0	1
\()	0	0	10	0/
Figure 1.3.3 shows the number of iterations needed by the value it-
eration method with and without the error bounds of Prop. 3.1 to find J,,
wit hin an error per coordinate of less than or equal to 10~G max;
The starting function in all cases was taken to be zero. The performance
is rather unsatisfactory but, nonetheless, is typical of situations where
the subdominant eigenvalue modulus of the optimal transition probabil-
ity matrix is close to 1. One possible approach to improve the performance
of value iteration for such problems is based on the adaptive aggregation
method to be discussed in Section 1.3.3.
Sec. 1.3 ldni(e-Sl<i(e Systems ( *<>liipu(;H ioiifd Mel hods
	Pr. 1 о .9	Pr. 1 о = .99	Pr. 2 о .9	Pr. 2 о .99	Pr. 3 о - .9	Pr. 3 о .99
W/out bounds	131	1371	131	1371	132	1392
With bounds	127	1333	129	1352	131	1371
Figure 1.3.3 Nunibrr of it era t ions for t he value it erat ion met he к I with mid with-
out error bounds. The problems arc deterministic. Because' the siibdoiiiinaiit
eigenvalue of the I ransit ion probability matrix is equal to 1, the error bounds are
inclfect ive.
Elimination of Nouopt,iinal Actions in Value Iteration
We know from Prop. 2.3 that,, il ft £ I7(i) is such that
<7(t. ft) + о ^2p,;(f/)J*0) > 7*(1).
J-1
then ft cannot be optimal at state /; that is, for every optimal stationary
policy /(, we have //.(/) yf ft. Therefore, if we are sure that the above
inequality holds, we can safely eliminate ft from the admissible set U(i).
While we cannot check this inequality, since we do not know the optimal
cost function ,7*. we can guarantee that it holds if
</(i. ii) + о ^21>ij(.ii)d.(.j) > J(i).	(3.13)
j = i
where 7 and 7 are upper and lower bounds satisfying
7(i)<7*(i)<7(0.	/ = 1....»
The preceding observation is the basis for a useful application of the
error bounds given earlier in Prop. 3.1. As these bounds arc computed
in the course of the value iteration method, the inequality (3.13) can be
simultaneously checked and nonopt,iinal actions can be eliminated from the
admissible set with attendant savings in subsequent computations. Since’
the upper and lower bound functions .7 ami 7 converge to J*, it can be seen
[taking into account the finiteness of the const raint set U(/)] t hat event ually
all nonopt,iinal fi £ U(i) will be eliminated, I,hereby reducing the set, U(i)
after a finite number of iterations to the set of controls that are optimal at
i. In this manner the computational requirements of value iteration can lie
substantially reduced. However, t he amount, of computer memory required
to maintain the set of controls not as yet eliminated at each i E S may be
increased.
28
lillinite Idiii/Ain Discounted I’mldeins C'hap. I
..ж
Gauss-Seidel Version of Value Iteration
In the value iteration method described earlier, the estimate ol' t.he
cost function is iterated for all states simultaneously. An alternative is to
iterate one state at a time, while incorporating into the computation the
interim results. This corresponds to using what is known as the Guus.s-
Si idcl. niclli.od. lor solving the nonlinear system ol equations J - I'd (see
[BeT8!)a] or [OrR7(l]).
For mdimensional vectors define the mapping F by
(/'•/)(!)
(3.1 1)
and, for i = 2,. . ., n,
(/<’./)(/)= min </(«, u) + пУ/!,i(ii.)(FJ)(j) + <> ^2/»,.;(«)•/(.;) 
In words, (/'’./)(/) is computed by the same equation as (7V)(i) except that
the previously calculated values (F./)(l),..., (FJ)(i — 1) are used in place
of ./(1),...../(/ - 1). Note that the computation of FJ is as easy as the
computation oiT.J (unless a. parallel computer is used, in which case the
computation of 7’./ may potentially be obtained much faster than FJ; see
[TsiSO], [BeTIJla] for a. comparative analysis).
Consider now the value iteration method whereby we compute J, FJ,
F2J,.... The following propositions show that the method is valid and
provide an indication of better performance over the earlier value iteration
method.
Proposition 3.2: Let J, J' be two n-diinensional vectors. Then for
any k — 0, 1,...,
max|(FM)(i) — (Fk J')(i)| < <ifc max| J(i) — J'(i)|.	(3.16)
Furthermore, we have
(FJ‘)(i) = J*(i), i G 5,	(3.17)
lim (FF/)(i) = J‘(i), i€ S.	(3.18)
Sec. I..'I
I'inili'-State S\^Iciiis	iniiel M<-llnxl>
2!)
Proof: It is sullicient to prove Eq. (3.16) lor Л- — 1. We have by the
definition of /’ and Prop. 2.4,
|(F./)(1)-(E.7')(1)| y. <i iimx|./(/) - ,/'(/)|.
Also, using this inequality.
|(F./)(2) -(FJ’)(2)\ -o.nax{|(F,/)(l) - (/’./')( I )|. |./(2) - ,/'(2)|.
И")
< о inax|.7(/) - J'(r)|.
Proceeding similarly, we have, for every i and j < i,
|(A./)(j) - U</')(j)| < a imix|./(t) - ./'(')!
so Eq. (3.16) is proved for к = 1. The equation FJ* = /* follows from the
definition (3.14) and (3.15) of F, and Bellman’s equation J* = TJ*. The
convergence property (3.18) follows from Eqs. (3.16) and (3.17). Q.E.D.
Proposition 3.3: If an n-diiuensional vector ./ satisfies
J(<) < (rJ)(i) < г =
then
(T‘ J)(j) < (FV)(i) < J*(i), z = l,...,n, /,:=!,2,... (3.19)
Proof: The proof follows by using the definition (3.14) and (3.15) of F.
and the monotonicity property of T (Lemma 1.1). Q.E.D.
The preceding proposition provides the main motivation for employ-
ing the mapping F in place of T in the value iteration method. The result
indicates that the Clauss-Seidel version converges faster than the ordinary
value iteration method. The faster convergence property can be substan-
tiated by further analysis (see e.g., [BeTSfla]) and has been confirmed in
practice through extensive experimentation. This comparison is somewhat
misleading, however, because the ordinary method will normally be used
in conjunction with the error bounds of Prop. 3.1. One may also employ
error bounds in the Clauss-Seidel version (see Exercise 1.9). However, there
is no clear superiority of one met hod over the ot her when bounds are in-
troduced. Furthermore, the ordinary method is better suited for parallel
computation than the Clauss-Seidel version.
30	Infinite Horizon Di^eounl<></ I’robleins ('hap. I
We note that, there is a more flexible form of the Clauss-Seidel method,
which selects states in arbitrary order to update I heir costs. 1'liis met hod
maintains an approximation ./ to the optimal vector .)*. and at each itera-
tion. it. selects a. state i and replaces ./(/) by (T.7)(i). The remaining values
•/(./), 1 И1 b are left unchanged. The choice of the state i at each iteration
is arbitrary, except for the restriction that, all slates are selected infinitely
often. This method is an example of an ii.si/ii<'liroiion..s Ji.re.il point iteration.
and can be shown to converge to .7* starting from any initial .7. Analyses
of this type of mid,hod are given in [Ber82a], and in Chapter 6 of [BeT8f)a]:
see also Exercise 1.15.
Generic Rank-One Corrections
We may view value iteration coupled with t he error bounds of Prop.
.3.1 as a method that makes a correction to the result.» of value iteration
along the unit vector c. It is possible to generalize the idea of correction
along a fixed vector so that it works for any type of convergent linear
iteral ion.
Let us consider the case of a single st ationary policy /i and an iteration
of the form J	I'.l, where
./ = II/, + Qii.l-
Here, Qt, is a matrix with eigenvalues strictly within the unit circle, and
/у, is a vector such that
•/„ =
An example is the Gauss-Seidel iteration of Section 1.3.1, and some other
(examples are given in Exercises 1.4, 1.5, and 1.7, and in Section 5.3. Also,
the value iteration method lor stochastic shortest path problems and a
single stationary policy, to be discussed in Section 2.2, is of the above
form.
Consider in place of .7 := an iteration of the form
.7 := I'.l,
where .7 is related to .7 by
.7 = .7 + yd.
with tl a fixed vector and у a scalar to be selected in some optimal manner.
In particular, consider choosing 7 by minimizing over 7
j|./ + 7d - I'V + 7<0If"-
which, by denoting
Sec. 1.3 b'initi'-St nt c Systems ('inninitiil iolinl Mel hods
31
can be written as
||./- К/ I Э(</ -.)|p.
By setting to zero tlie derivative of this expression with respect to * . it is
straightforward to verify that the optimal solution is
- --)'(/'•/ - •')
Thus the iteration J := IM can lie written as
./ := MJ.
where
MJ = IM + §c.
We note that this iteration requires only slightly more computation than
the iteration J := IM. since the' vector z is computed once and the com-
putation of у is simple.
A key question of course is under what circumstances the iterat ion
J := MJ converges faster than the iteration J := IM, and whether indeed
it converges at all t.o Jf,. It, is straightforward I,о verify t hat, in the case
where Qt, = o/’;, and </ — c, the it,oration J := MJ can be written as
!1
[compare with Eq. (3.12)]. Thus in this case the iteration J := d/(./) shifts
the result TltJ of value iteration to a vector that, lies somewhere in tin'
middle of the error bound range given by Prop. 3.1. By the result of this
proposit ion it. follows that t he iteration converges I,о
Generally, however, the iteration J MJ need not converge in the
case where the direction vector d is chosen arbit rarily. If on t he other hand
d is chosen to be an eigenvector of Qt,, convergence can be proved. This
is shown in Exercise 1.8, where it is also proved t hat, if d is tin. eigen w-
tor eoirespondiny Io the dominant ci.geimii.lue of Qt, (the. one. with. largest
modulus), the convcrgen.ee, rate of the iteration. J := A/J is governed by the
subdominant. eigenvalue ofQ,, (the one with, second, largest niotltilus). One
possibility for finding approximately such an eigenvector is to apply F a
sufficiently large number of times to a vector J. In particular, suppose that
the initial error J — can be decomposed as
tt
7 ~ 'hi ~	. Sj' .i
J=1
for some scalars .......f,,. where c,.....c„ are eigenvectors of Q,,. and
Ai,...,A„ are corresponding eigenvalues. Suppose also that Ai is the
32
Inlinilc Horizon Discounted Problems ('hnp. I
-Ж
unique dominant eigenvalue, that is, |A;| < |Ai| for j = 2,..., n. Then
the difference Fk+[,l — ]<’>.,/ is nearly equal to £1 (A,4 ' - A,)ci for huge к
and can be used to estimate the dominant eigenvector ci. In order to decide
whether к lias been chosen large enough, one can test to see if I,he angle
between the successive dillerenees 1 1,! — Fk J and Fk J - /'* 1 J is very
small; if this is so, the components of Fk+l.J — FkJ along the eigenvectors
<2.....c„ must also be very small. (For a. more sophisticated version of
t his argument., see [Ber93], where the generic rank-one correct,ion method
is developed in more general form.)
We can thus consider a two-pha.se approach: in the’ first phase, we
apply several times the regular iteration .7 := F.J both to improve our esti-
mate of ,7 and also to obtain an est imate d of an eigenvector corresponding
to a dominant eigenvalue; in the second phase we use the modified iterat ion
.7 := hl.I that involves extrapolation along </. it can be shown that the
two-pha.se method converges to ,7/, provided the error in t.he estimation of
d is small enough, that is, t.he cosine of t.he angle between d and Q^d as
measured by the ratio
(/‘47 _	-i,/)'( Fk~ i J - Fk~-O)
j|/47 - 7-A-1 ,/||  ||7^~1 (.7) -	-2,/||
is sufficiently close to one.
Not.i't hat the computation of the first phase is not wasted since it uses
the iteration J : = F.J that we are trying to accelerate. Furthermore, since
the second phase involves the calculation of F.J at the current iterate .7, any
error bounds or termination criteria based on F.J can be used to terminate
the algorithm. As a result, the. same Unite termination mechanism can be
used for both iterations J := F.J and .7 : = hid.
One difficulty of the correction method outlined above is that the
appropriate vector d depends on and therefore also on /;. In the case
of optimization over several policies, the mapping F is defined by
г
II
(F./)(i) = min //,(//) +	</o(")'7(./)
J- I
1 = 1,..., n. (3.20)
One can then use the rank-one correction approach in two different ways:
(1) Iteratively compute the cost vectors of the policies generated by a
policy iteration scheme of the type discussed in the next subsection.
(2) Guess at an opt imal policy within the first phase, switch to the second
phase, and then return to the first phase if the policy changes “sub-
stantially’' during the second phase. In particular, in the first phase,
t.he iteration .7 := F.J is used, where F is the nonlinear mapping of
Eq. (3.20). Upon switching to the second phase, the vector z is taken
Sec.. 1.3
I'init e-St ;it < Systems ( \ >ni{ nil-it i< ni;il McIIhhS
33
to be equal to where //,* is the policy that attains the minimum
in Eq. (3.2(1) at the time ol the switch. The second phase consists ol
the it era! ion
.1 = MJ = /<’,/ + уг,
where /•’ is the nonlinear mapping of Eq. (3.20), and / is again given
by
To guard against subsequent changes in policy, which induce cor-
responding changes in the matrix Q,,», one should ensure that the
method is working properly, for example, by recomputing </ if the
policy changes and/or the error Ц7Л/ — ,/|| is not reduced at a sat-
isfactory rate. This method is generally effective because the value
iterat ion met hod typically finds an opt imal policy much before it. finds
t he optimal cost vector.
It should be mentioned, however, that the rank-one correction method
is ineffective, if there is little or no separation between the dominant, and
the subdoininant eigenvalue moduli, both because the convergence rate of
the method for obt aining d is slow, and also because the convergence rate
of the modified iteration .7 M.J is not much faster than the one of th<'
regular iteration .7 := b'J. For such problems, one should try corrections
over subspaces of dimension larger than one (see [Ber!)3], and the adaptive
aggregation anil multiple-rank correction methods given in Section 1.3.3).
Infinite State Space — Approximate Value Iteration
The value iteration method is valid under the assumptions of Prop.
2.1, so it is guaranteed to converge t.o .7* for problems with infinite state
and control spaces. However, for such problems, the method may be im-
plementable only through approximations. In particular, given a function
.7, one may only be able to calculate a function .7 such that
(3.21)
where f is a given positive scalar. A similar situation may occur even
when the state space is Unite but the number of states is very large, d’hen
instead of calculat ing (7’./)(.r) for all stales .r, one may do so only for some
states and est imate (7’./)(.i.') for t he remaining states .i: by some form of
interpolation, or by a least-squares error lit. of (T.7)(.r) with a. fimetion
from a suitable parametric class (compare with t.he discussion of Sect ion
2.3). Then the function -7 thus obtained will satisfy a relation such as
(3.21).
inax|./(.r) - (T./)(.r)| < r,
.. .ж
3d	hiliiiilf Ihni.'ott I yisct unit id PrttltlvHix С’/ьч/,. /
We are thus led to consider the approximate value iteration method
that generates a sequence {.//. } satisfying
uiax|.4+1(.r)-(T.4)(.r)| <<.	/, = 0,1,...	(3.22)
starting from an arbitrary bounded function ./(). Generally, such a sequence
“converges” to ./* to within an error of</(l - o). To see this, note that,
Eq. (3.22) yields
— re < Ji < /'./a + re.
By applying 7 to this relation, we obtain
T-Jo — orc < TJi < T2.Jy + nrc,
so l>y using Eq. (3.22) to write
7’./| - <<’ < .1-2 < 7 -/| I <<’
we have
T2.!» - c(l + o)c < -h < T2J„ + f(1 + a)e.
Proceeding similarly, we obtain for all A' > 1,
7’A ’ 1 Jo — <-( 1 + о +  • • + oA'~1 )c < Jk < 7’A - 1 •/(> + r(l + a +    T <P'_ ')e.
By taking the limit, superior and I,he limit, inferior as к -  oo, and by using
the fact, linu—cv 7’A'./() = ./*, we see t hat,
./*--------< < lim inf .4- < limsup.4 < .!* 3------—c.
1 - Г1	A' = oc	1 — О
It, is also possible to obtain versions of the error bounds of Prop. 3.1
for the approximate value iteration met hod. We have from that proposit ion
7’./*- -	min[(7'.4)(.r)-.4(.r)]e<./*
< Г,,к +	1^[(7’Л)(.г) - ./д (.г)]с.
By using Eq. (3.22) in the above relation, we obtain
• 4l 1 - К' -	— min [.//,. + 1 (.r) + < - /*(./•)]<' < ./*
1 — О .rf=.S L	J
< л ( I + cc + -t-n iiia,x[j/,_H(.'r) + f - Jk(./:)]c,
1 — H zG.S' L	J
or
<+a min.,.G.s-[./A..H(.r) - .//.-(.г)|
/. . . - _____________L_____________-‘ , I *
( d-rimax.,.(-.s [.//.-g । (.f) - .4(;r)]	i
< -4 i i 3---------------------7-------------------c.
I - о	>
These bounds hold even when the state space is infinite because the bounds	|
of Prop. 3.1 can be shown for an infinite' state space as well. However, for	f
these bounds to be useful, one should know r.
1.3.2 Policy Iteration
The policy iteration algorithm generates a sequence of stationary poli-
cies, each with improved cost, over the preceding one. Given the station-
ary policy //, and the corresponding cost function an improved policy
{/7,//,...} is computed I>y minimization in the DI’equation corresponding
to Jfl. that is. T-fi.l,, = T./p. and the process is repeated.
The algorithm is based on the’ following proposition.
Proposition 3.4: Let p, and //, be stationary policies such that. T^Jtl =
or equivalently, for i = 1,..., n,
II	II
з(г,7*(г))+о ^7^(Д(г))./л0) = min v(i, u) + о. V (а)./л(у) .
7=1	L	j=i
Then we have
4(0 <	(3.23)
Furthermore, if /1 is not optimal, strict inequality holds in the above
equation for at least one state i.
Proof: Since J/z = (Cor. 2.2.1) and, by hypothesis. = T.)ti. we
have for every /.
M') = .'/('•//(/)) + 0	,(//.(/))./,,(.;)
j-1
= (Зд(')-
Applying repeatedly on both sides of this inequality and using t he mono-
tonicity of 1- (Lemma 1.1) and Cor. 2.1.1, we obtain
> /pdp >  • • > ГЗ •/,, >    > Jii 11^ l~ /,, = •//,.
proving Eq. (3.23).
If ./,, -- ./77, then from the preceding relation it. follows that -
Tj7.7;, and since by hypothesis we have TjjJii = 77/p., we obtain = T.Jf,.
implying that ,/z, = ./* by Prop. 2.2. Thus // must, be optimal. It. follows
that if /I is not optimal, then	for some st.ate i. Q.E.D.
36	luliiiitt' llori/.on I )i.s<4 unit c< I	('h.q>. I
Policy Iteration Algorithm
Step 1: (Initialization) Guess an initial stationary policy /z°.
Step 2: (Policy Evaluation) Given the stationary policy /zA:, com-
pute the corresponding cost function J k from the linear system of
equations
(/ - o/',fc)J/lfc = (i^-
Step 3: (Policy Improvement) Obtain a new stationary policy
Hk+l satisfying
1J .,k — T J ,,k.
If ,lflk = stop; else return to Step 2 and repeat the process.
Since the collection of all stationary policies is finite (by the finite-
ness of S and O') and an improved policy is generated at every iteration,
it follows that the algorithm will find an optimal stationary policy in a
finite number of iterations. This property is the main advantage of pol-
icy iteration over value iteration, which in general converges in an infinite
number of iterations. On the other hand, finding the exact value of Jr
in Step 2 of the algorithm requires solving the system of linear equations
(I —	The dimension of this system is equal to the num-
ber of st.ates, and thus when this number is very large, the method is not
attractive.
Figure 1.3.4 provides a geometric interpretation of policy iteration
and compares it with value iteration.
We note that in some cases, one can exploit the special structure of
the problem at, hand to accelerate policy iteration. For example, sometimes
we can show that if /(. belongs to some restricted subset Л/ of admissible
control functions, then ,//t has a. form guaranteeing that /7 will also belong
to the subset ЛЛ In this case, policy iteration will be confined within
the subset Л/. if the initial policy belongs to Л/. Furthermore, the policy
evaluation step may be facilitated. For an example, see Exercise 1.14.
We now demonstrate policy it,erat ion by means of the example con-
sidered earlier in this section.
Example 3.1 (continued)
Let us go through the calculations of the policy iteration method:
Initialization: We select, the initial stationary policy
111 i >	I	< J /1\	‘2
/'• (1) = и. ,	/I. (2) = и .
Sec. /.3
I'initc Xlttle Xyslriiis (‘t ни/>ut .it it >11.1/ Melhtitl.'-
37
Figure 1.3.4 Geometric interpretation of policy iteration and value iteration.
Each stationary policy /1 defines the linear function <ifl I- счРц.1 of the vector ./.
and TJ is the piecewise linear function min/([<//f 4 о/p./]. The optimal cost ./*
satisfies J* = TJ*. so it is obtained from the intersection of the graph of TJ and
the 45 degree line shown. The value iteration sequence is indicated in the top
figure by the staircase construction, which asymptotically leads to J*. The policy
iteration sequence terminates when the correct linear segment of the graph nfTJ
(i.e., the optimal stationary policy) is identified, as shown in the bottom figure.
38
bilinite I lori/.on	I )ixci >i 11 it < •• 1 Problems (7iap. I
.. «ж
Policy Evaluation: We obtain Jfio through the equation J/(o = 7’о J о or.
equivalent ly, the linear system of equal ions
Лз'О) =	+ <ipn("' i) + opi-iG?(2).
-/,,<>(2) = '/('Л"2) + op-2i (a2)-/,,<>( 1) + np22(«2)-/,,i>(2).
Substituting the data of the problem, we have
./ ,,(1) = 2 t 0.9- ~  ./„(I) + 0.9  1  ./ »(2),
1	I
./„„(2) = 3 + 0.9 - 1 I) + 0.9 	(2).
Solving this system of linear ('({nations for ♦/,<(!) and .//to(2), we obtain
,//(o(l) — 21.12,	,/?t(2) -25.9G.
Policy Improvement: We now lind /л1 (I) and p’(2) satisfying 7‘ ।./ (l —
T,) n. We have
(7'.//,0)(l) = inin 12 + 0.9	• 21.12 + j  25.96
0.5+ 0.9	-21.12 + -25.9(1)
= iniii{24.12,23.45} = 23.45,
(7’./p0)( 1) = mi" < 1 + 0.9	• 24.12 +  25.96
3 + 0.9
21.12 + ~ -25.96
4
= inin{2.3.12, 25.95} = 23.12.
4’110 minimizing controls are
/,1(0 = „Л //(2) = n'.
Policy Evaluation: We obtain ./(1i through the equation ./^i —
(>) = .'/(I - »2) + «Pi i("2)-/„। (0 + '1/Л2(«2)-У i (2),
ЛЯ1) = -'/(‘Л "') + пр-т (»')•/,,! (1) + op22(u')-/;,i(2).
See. i.:{
riuite-St:ile Systems ( 'tmijmt :it i<mill Mctlunls
3!»
Substitution of tho data of the problem and solution of the system of equal ions
yields
, i (11 ~ 7.3.3,	,/(i i (2) ~ 7.67.
Policy Improvement: We perform the minimization required to lied 7'7 j :
)(1) = min
 7.33 +
= miu{3.67, 7.33} =
7.33,
(T7pl )(2) = min
1
4
lienee we have 7 t = T.l i . which implies that p.1 is optimal and 7 । — J':
//*(!) = a2, p*(2) = u',	7*(1)~7.33,
Modified Policy Iteration
When the number of states is large, .solving the linear system (/ —
aP^i. ).f к = U^k in the policy evaluation step I >y direct methods such as
Gaussian elimination can be prohibitively time-consuming. One way to
get around this difficulty is to solve the linear system iteratively by using
value iteration. In fact, we may consider solving the system only approx-
imately by executing a limited number of value iterations. This is called
the modijied policy deration olyorithni.
To formalize this method, let Jo be an arbitrary //.-dimensional vector.
Let ttii.    be positive integers, and let the vectors ,I\. .h,. . . and the
stationary policies po. p.j. ... be defined by
T^.lk -T./k,	-h. (l -T’^./k. h (I. I,--.
Thus, a stationary policy /А is defined from -h. according to T k.Jk = TJk-
and t.he cost. J к is approximately evaluated by ntk — 1 a<Iditioilai value
iterations, yielding the vector ,Jk । |. which is used in turn to define /Л1 1.
We have the following:
40
InliiiHe Horizon Discounted Problems
Chap. 1
Proposition 3.5: Let {./»,} and {/nJ; be the sequences generated by
the modified policy iteration algorithm. Then {J^} converges to J’.
Furthermore, there exists an integer к such that for all к > к, /Л is
optimal.
Proof: Let r be a scalar such that the vector Jo, defined by Jo = Jo + re,
satisfies TJo < Jo- [Any scalar г such that max, [(TJo)(?) — Jo(/)] <
(I <i)r lias this property.] Define lor all k. Jk\ i 7’/,”1 J*. Then, it can
be seen by induction that for all к and m = 0,1,. . ., m/,, t.he vectors T”'k.Jk
and dilfer by t he multiple of t.he unit vector co"'0 1 " 1 "'l 1 и"с. It,
follows that if Jo is replaced by Jo as the starting vector in I he algorithm,
the same sequence of policies {///. } will be obtained; that is, wo have for all
k
T^Jr^TJ,..
Now we will show that for all к we have- < 7VJU. Indeed, we have
T и Jo = 7’Jo < Jo, from which we obtain
;u = l,2,...
so that.
7>7, = TJt < Tfl„Ji = 7’;;'” H7o < 7;7Jo = 'Jx < TtloJo = TJo.
This argument can be continued to show that for all k, we have J/, <
T-h. i, so that
7a- <7’<7o, A: = 0,1,...
On the other hand, since TJo < Jo, we have J’ < Jo, and it, follows
that application of any number of mappings ol the iorm 7], to Jo produces
functions that are bounded from below by .T. Vlllis.
7 • J... a-= 0.1....
By taking the limit as A' -e -x. we obtain lim/,. . x •/;, ('/ = J'(') h>r '•
.md since Inn,, . x <T, Л '	(1. we obtain
^liin Ji,(i) = J‘('l, i ~ 1....."
Since the number of stationary policies is finite, there exists an T > 0
such that if a stationary policy // satislies
max[j,,(/) - J*(t)] < Г.
Sec. I..'l	e-St.m<• .Systems Computational Methods
11
then ц is optimal. Now let A’ be such that lor all k > k we have
j£~ (niax[(7'.A-)(/) .A-(')] nun [(7’.A-)(') -A-(0])-'-
Then from Eq. (3.10) we see that for all A' > A’, the stationary policy fik'
that satisfies T k.Jk = 7’-A- is optimal. Q.E.D.
Note that if ntr = 1 lor all A' in the modified policy it erat ion algoril Inn,
wc obtain the value iteration method, while if пц. — oo we obtain the policy
iteration met hod. where the policy evaluation step is performed iteratively
by means of value iteration. Analysis and computational experience suggest
that it is usually best, to take /»/.. larger t han 1 according t.o some heuristic
scheme.. A key idea here; is that, a value iteration involving a single policy
(evaluating 7),,/ for some //. and ./) is much less expensive than an iteration
involving all policies (evaluating 7'./ for some ./), when the number of
controls available at each state is large. Note that error bounds such as
the ones of Prop. 3.1 can be used to improve the approximation process.
Furthermore., Gauss-Seidel iterations can be used in place of the usual value
iterations.
Infinite State Space - Approximate Policy Iteration
The policy iteration method can be defined for problems with infinite
state and control spaces by means of the relation
T к i i •/. *- = T.1 к, A: = 0 1,....
The proof of Prop. 3.4 can then be used to show that the generated se-
quence of policies {/?•'} is improving in the sense that, J i < ./,* for all
k. However, for infinite state- space problems, I,he policy evaluation step
and/or the policy improvement step of the method may be implementable
only through approximations. A similar situation may occur even when
the state space is finite but, the number of states is very large.
We arc thus led to consider an approximate policy iteration method
that generates a sequence of stationary policies {//*} and a corresponding
sequence of approximate cost functions {.A } satisfying
max |./A,(.r)-./A.(.r)|< A k = 0,1,...	(3.2-1)
.r 6 5	t
and
max I(7)p_> J;,- )(.r) - (7.A. )(z )| A <.	к - l>. 1... .	(3.2'i)
where b and t are some positive scalars, and //’ is an arbitrary station-
ary policy. We call this the (ippro.fimate policy iteration algorithm. The
following proposition provides error bounds for this algorithm.
42
Infinite Hori/.on Di.sconnl.ed Problems ('hup. /
..Il
Proposition 3.6: The sequence {/ik} generated by the approximate
policy iteration algorithm satisfies
lim sup max
c + 2nd
(1 — a)2
(3.26)
Proof: Krom Eqs. (3.24) and (3.25), we have for all
7z,n i ./;p - <n*»c < T/(hi -Л- < 'l .Jk +
where c = (1, 1,. . . , 1)' is the unit vector, while from Eq. (3.21), we have
for all /.•
7’./;, < 7’./;p 4- tide.
By combining these two relations, we obtain for all /,
I '•/,/ - 77/zd + (' + 2оЛ)с < T^-1,,1. h (< 4- 2o<*>)c.	(3.27)
1'i'oni Eq. (3.27) and the equation T^/,,1 t = ./pZ. , we have
7’p t i a <	+ (< + 2od)c.
By subtracting from this relation the equation 7’p-n Jp + i = -/^a+i, we
obtain
7'(/- I I .//(A- — 7,,A t 1 ./;|A t 1 < .//(A- — 7,рч-1 + (< + 2(ld)c,
which can be writ ten as
-//(a t i - 7zp- < <>/'), + (< + 2<id)e,	(3.23)
where l‘\ is the function given by
= (>-l(7;p, ,./„,,+ 1)(.r) - o-- '(7>U./,p.)W
= E{,/;p +, (/ (,r,	1 (,r),»’)) — (J U /'A +1 ('-’) , "')) } 
Let.
Then we have 7). (,r) <	f”1' all .r C .S’, and Eq. (.3.28) yields
0,- < o£a- i- < I- 2<>л,
Sec. 1.3
Finite-Stale Systems C'cin/iut;il ioicil Methods
Let
<* = ('') ~ 7 "(•'))
From Eq. (3.27) and I lie relation
~ 7*W) <
which follows from Prop. 2.1, we have
T)p. ( । .l^k < T.l^. + (< + 2oh)c <	4- aCy 4 (< 4 2<u'i)c.
We also have
7),n । ./(p. —	1 i + 7’zjo i ./;1r - 7’p-.( । Jfy4 ।.
and by subtracting the last two relations, we obtain
- •/’ < (i(j. + <il'\ 4 (< 4 2oh)c,
From this relation we see that
Ц-П < <"<A- + «ЧА- 4 f + 2nd.
By taking tlit- liinit superior as к -e oc and by using Eq. (3.29), we obtain
г + 2аЛ
(1 - о) lim sup 0,. < о---------1< + 2nd.
A—ex.	1 - <i
This relation simplifies to
..	t + 2оЛ
Inn sup < T.--------ex.
A-.oo (1— rl)‘
which was to be proved. Q.E.D.
Proposition 3.6 suggests that, the approximate policy it,orat ion met hod
makes steady progress up to a point and then the iterates ./ t oscillate
wit,hili a neighborhood of the optimum J*. This behavior appears to be
typical in practice. Note that for <*> = () and < = 0, Prop. 3.6 shows that
the cost sequence {./pr} generated by t he (exact) policy iteration algorit hm
converges to J\ even when the state space is infinite.
-Ж
44	tntinite Horizon Discounted Problems CIui/i. 1
1,3.3 Adaptive Aggregation
Let. us now consider an alternative to value iteration for perform-
ing approximate evaluation of a stationary policy //., that is, for solving
approximately tin? system
Л' - /
Tins alternative is recommended for problems where convergence of value
iteration, even with error bounds, is very slow. The idea here is to solve
instead of t.he system	another system of smaller dimension,
which is obtained by lumping together the states of the original system
into subsets Si, S-2,.. ., Sm that can be viewed as aggregate states. These
subsets arc disjoint and cover the entire state space, that is,
5 = Si U S-> u   • и s„,.
Consider the n x m matrix 11' whose Zth column has unit entries at coor-
dinates corresponding to states in .S', and all other entries equal to zero.
Consider also an tn x n matrix Q such that tin: ith row of Q is a probability
distribution (</,, ,..., </,„) with qis = 0 if s (f S;. The st.ruct.uix: of Q implies
two useful properties:
(a)	QIT = I.
(b)	The matrix
S = QP„\V
is an in x in transition probability matrix. In particular, the i/th
component of П is equal to
D j = 52<iis 52
.s£.S',	lets J
and gives the probability that, the next, state will belong to aggregate
state S', given that t.he current state is drawn from the aggregate
stale S, according to the probability distribution {r/,.s | s E S’,}. The
transition probability matrix H delines a Markov chain, called the
aggregate Markon chain. whose stat.es arc the in aggregate states.
Kigurc 1.3.5 illustrates the aggregate Markov' chain.
Aggregate Markov chains are most useful when their transition be-
havior captures the broad attributes of the behavior of the original chain.
This is generally true if the slates of each aggregate state are “similar" in
some sense. Let us describe one such situation. In particular, suppose that
we have an estimate ./ of Jz, and (hat we postulate that over the slates s
of (’very aggregate stale Si the variation, ./,,(«) — J(.s) is constant. This
amounts to hypothesizing that for some m-dinicnsional vector g we have
=- IT//.
Sec. /..'J
I'i uile-St al <• S'v.steius C'i>m/iul;it i<in;il Met hulls
45
Figure 1.3.5 Illustration of the aggregate Markov chain. In this example, the
aggregate states are 5| — {1,2,3}, .S’2 — {-1,5}, and 5’,; = {(>}. The matrix 1Г
Ims-columns (I. 1, 1,(1,(),(>)', (0,0,0, 1, 1,0)', ami (0, 0, 0, 0, 0, 1)'. In this example,
the matrix Q is chosen so that each of its rows defines a nnilorni probability
distribution over the states of the corresponding aggregate state. Tillis the rows of
Q are (1/3, 1/3, 1/3,11,0,0), (0.0,0, 1/2, 1/2,0), and (0,0,0,11,0, 1). The aggregate
Markov chain lias transition probabilities rn = ^(/>21 + />23), ri2 - (pi t + Pal).
>'13 = 0. 1'21 = 5 (p.12 + Раз), r22 = 5/45. '"2.3 = .jPKi. ''31 = 0. '32 = Ц» and
143 = 0.
13y cotiillining t.he equations 7},.,/ = </,, + <i/’,.7 anil </(l = (7 — of},)./,,. we
have
(7= J.
This is the variational form of the equation = T^Jf, discussed earlier in
connection with error bounds in Section 1.3.1, and can be used equally well
for evaluating Let us multiply both sides with Q and use the equation
Jp — J = Wy. We obtain
which, by using the equations = I and R = Ql’^W. is written as
(J - o/?),i/ = Q(T„.J - ./).
This equation can be solved for y, since 7?. is a transition probability matrix
and therefore the matrix I — nR is invertible. Also, by applying 7}, t.o both
sides of the equation .7,, = ./ + Wy, we obtain
7/,. = 7},./;, = 7},./ 4 о//,11’//.
We thus conclude that, if the variation of — .7(.s) is roughly con-
stant over the states .s of each aggregate state, then the vector 7/,./-bo//, II'?/
is a good approximation for ./,,. Starting with ,/. this approximation is ob-
tained as follows.
_. .1
46	liilinite Hoii/Aiu D'lMOiuited I’roblvius ('biiy. I
Aggregation Iteration
Step 1: Compute 7}, J.
Step 2: Delineate the aggregate states (i.e., doline IV) and specify
tin: matrix Q.
Step 3: Solve fur у the system
(I -nI7)/y = (?(7’„./ - ./).	(3.30)
where R = QP^W, and approximate .Jt, using
.J := TltJ + nP.Wy.	(3.31)
Note that, the aggregation iteration (.3.31) can be equivalently written
as
./ :=7),(,7 + Il’/y),
so it. differs from a value iteration in that it. operates with Tt, on J -t- \Vy
rather than
Solving the system (3.30) in the aggregation iteration has an inter-
esting interpretation. It. can be seen that у is the o-discountcd cost vector
corresponding to tlie transition probability matrix R and the cost-per-stage
vector	— J). Thus, calculating у can be viewed as a policy evalua-
tion step for t.he aggregate Markov chain when the cost per stage for each
aggregate state .S’, is equal to
E<a.s((7;,j)W-./(.s)).
which is the average 7),./ — ./ over the aggregate state Si according to the
distribution | s fc S,}. A key attractive aspect of the aggregation
iteration is that t.he dimension of the system (3.30) is m (the number of
aggregate states), which can be much smaller than и (the dimension of the
system Jfl = arising in the* policy evaluation step of policy iteration).
Delineating the Aggregate States
A key issue is how to identify the aggregate states 5j,. .., 5',,, in a.
way t hat the error — ./ is of similar magnitude in each one. One way
to do this is to view 7),./ as an approximation to and to group together
states i with comparable magnitudes of (7'),.7)(i) — J(i). Thus the interval
[c.c], where
c = mm[(7),./)(/) - ./(/)], c = max[(’7|,./)(z) - J(z)],
Sec. I.:!
h'inite-Stiite Systems C'om/mt.nt iomd Methods
47
is divided into in segments and membership ol a slate / in an aggregate
state is determined by the segment within which (7),./)(/)	./(/) lies. By
this we mean that for each state i. we set i t Si if (7),./)(,) - ./(?) = c. and
we sot,
/ £ .S't. if (7),./)(/) - J(i) - <_• - (A- - Ф с (О, Л].
where
h =
III,
This choice is based on the conjecture that, al least near convergence.
(T;I./)(?) — ./(/) will be of comparable magnitude for states i for which
Jii(i) — ./(/) is of comparable magnitude. Analysis and experimentation
given in [BeC89] has shown that the preceding scheme often works well
with a small number of aggregates st.at.es in (say 3 to 6), although the
properties of the method are yet fullv understood.
Note that the aggregate states can change from one iteration to the
next, so the aggregation scheme “adapts’’ I,о the progress of the computa-
tion. The criterion used to delineate the aggregate states does not exploit
any special problem structure. In some cases, however, it is possible to take
advantage of existing special structure and modify accordingly the method
used to form the aggregate states.
Adaptive Aggregation Methods
It is possible to construct, a number of methods that calculate by
using aggregation iterations. One possibility is simply to perform a se-
quence of aggregation iterations using the preceding method to partition
the state space into a few, say 3 to 10, aggregate states. This method
can be greatly improved by interleaving each aggregation iteration with
multiple value iterations (applications of the mapping Tt, on the current it-
erate). This is recommended based on experimentation and analysis given
in [BeC’89], to which we refer for further discussion. Au interesting em-
pirically observed phenomenon is that the error between the iterate and
J/z is often increased by an aggregation iteration, but. then unusually large
improvements are made during the next few value iterations. This sug-
gests that the number of value iterations following an aggregation iterat ion
should be based on algorithmic progress: that is, an aggregation iteration
should be performed when the progress of the value iterations becomes rela-
tively small. Some experimentation may be needed with a given problem to
determine an appropriate criterion for switching from the value iterations
to an aggregation iteration.
Then1 is no proof of convergence of t he scheme just, described. On the
basis of computational experimentation, it appears reliable in practice. Its
convergence nonetheless can be guaranteed by introducing a feature that

48	lu/iiiife Horizon Discounted Problems Chap. 1
enforces some irreversible progress via t.he value iteration method following
an aggregation iteration. In particular, one may calculat.e the error bounds
of Prop. 3.1 at the value iteration Step 1, and impose a requirement that,
the subsequent, aggregation iteration is skipped if these error bounds do
not improve by a certain factor over the bounds computed prior to the
preceding aggregation iteration.
'Го illustrate the effectiveness of the adaptive aggregation method,
consider the three deterministic problems described earlier (cf. Pig. 1.3.3),
and the performance of the method with two, three, and four aggregate
states, starting from the zero function. The results, given in Pig. 1.3.G,
should be compared with those of Fig. 1.3.3.
It. is intuitively clear that the performance of the aggregation method
should improve as t he number of aggregate states increases, and indeed the
computational results bear this out. The two extreme cases where in = n
and in = 1 are of interest. When in — n, each aggregate state has a single
state and we obtain I he policy iteration algorithm. When ш = 1, there is
only one aggregate state, IT is equal to the unit vector c = (1,. . ., I)', and
a straight forward calculation shows (.hat. for the choice Q = (1/n, . . . , I//(),
the solution of the aggregate system (3.30) is
From this equation (using also the fact /),c — c), we obtain the iteration
’ := T"7 +	- •/(’))<	(3.32)
which is the same as the rank-one correction formula (3.12) obtained in
Section 1.3.1 and amounts to shifting the result Tz,7 of value iteration
within the error bound range given by Prop. 3.1. Thus we may view the
aggregation scheme as a. cont inuum of algorithms with policy iteration and
value iteration (coupled with t he error bounds of Prop. 3.1) included as the
two extreme special cases.
Adaptive Multiple-Rank Corrections
One may observe that, the aggregation iteration
I- II'//).
amounts to applying 7), to a correction of .) along the subspace spanned
by the columns of II’. Once the matrix II' is computed based on the adap-
tive procedure discussed above, we may consider choosing the vector ij in
alterant ive ways. An interest ing possibility, which letuls to a generalization
Sec. 1.3
I'inile-St-iit e Systems ( omputat ion;d Met hods
49
No. of aggregate states	I’r. 1 0 = .9	I’r. 1 0 = .99	I’r. 2 0 = .9	I’r. 2 0 = .99	I’r. 3 0 = .9	I’r. 3 0	.99
2	14	13	9	9	S3	505
3	1	1	3	3	61	367
1			3	3	26	351
Figure 1.3.6 Number of iterations of adaptive aggregation methods with two,
three, and Гонг aggregate slate's Io solve t he problems of b’ig. 1.3.3. Kach low ol
Q was chosen to define a uniform probability distribution over the states of the
corresponding aggregate stale.
of t.he rank-one correction method of the preceding subsection, is to select
у so that
\\./ + \Vy~Tl,(J + \\'y)\\2	(3.33)
is minimized. By setting to zero the gradient, with respect, to у of the above
expression, we can verify that the optimal vector is given by
v = (Z'Z)-'Z'(T/(,/-./),
where Z = (I — ng,)IV’. The corresponding iteration then becomes
J := g,(J + IF//) = T„.J + ng,IT//.
Much of our discussion regarding the rank-one correction method also
applies to this generalized version. In particular, we can use a. two-phase
implementation, which allows a return from phase two to phase one when-
ever the progress of phase two is unsatisfactory. Furthermore, a. version of
the method that works in the case of multiple policies is possible.
1.3.4 Linear Programming
Since lim,у —,TN.J = ./* for all .J (cf. Prop. 2.1), we have
.J <TJ =>	./<./* =T.J*.
Thus J* is the “largest” ./ that satisfies the constraint J < 7'./. 'This
constraint can be written as a finite system of linear inequalities
7(i) <</(/,«) +о ^2 p,, (a)./(7).	/=1....11.
50
liiliuilv llori/.oii Dixcounled Problems (-Imp. 1
and delineates a polyhedron in ')?". The optimal cost, vector J* is t.he
'northeast ' corner of this polyhedron, as illust rated in Fig. 1.3.7. In par-
ticular. •/*(!)....../*(/;) solve the following problem (in A,.........A„):
maximize A,
II
subject to A, < (j[i, u) + n	Il G Гф).
2=1
where S' is any nonempty subset of the state space S' = (!,..., n). This is
a linear program wit h /I variables and as many as /1 x q constraints, where
q is the maximum number of elements of t.he sets U[i). As n increases,
its solution becomes more complex. For very large n and q. the linear
programming approach can be practical only with the use of special large-
scale linear programming methods.
Figure 1.3.7 Linear programming problem associated with the discounted infi-
nite horizon problem. 'Che constraint set is shaded and the objective to maximize
is ./(!) + ./(2).
Example 3.t (continued)
For the example considered earlier in this section, the linear programming
problem takes t he form
maximize Ai + A,
S<-e. /.3
I' iiiil c-Sl:it e Systems	( '< mquil,/I u>n;il Melh<nls
51
subject to A, < 2 + 0.9 ( j z''1 + jA,
A2 < I + 0.9	+ p\2
A, < 0.5 + ().!) (--A, + ^A,
Aj < 3 I 0.9 (jAi 5 -’.vj
Cost Approximation Based on Linear Programming
When the number of slates is very large or infinite, we may con-
sider finding an approximation to the optimal cost, function, which can be
used in turn to obtain a (siiboptimal) policy by minimization in Bellman's
equation. One possibility is to approximate ./*(./) with the Ипсат form
III
r) =	(3.34)
A--1
where г = (/q......rin) is a vector of parameters, and for each state ,r,
шд.(.г) are some fixed and known scalars. This amounts to approximating
the cost function ,/*(./) by a linear combination of/// given functions //g.(.r),
where k = 1,. .., ///. These functions play t.he role of a basis for the space of
cost function approximations J(.r.r) that can be generated with different
choices of r (see also the discussion of approximations in Section 2.3.3).
It is then possible t.o determine r by using ./(./,/) in place of ./• in
the preceding linear programming approach. In particular, we compute /'
as the solution of the program
maximize ,/(./•, /)
.re.s
subject to ,7(.r, /)< //(./.'//) + а p.,.y(//.)./(//, /), ./ e 5, //e tr(./’),
;>e.s'
where S is either the state space S or a suitably chosen finite subset, of 5.
and £/(./ ) is either U(.r.) or a suitably chosen finite subset of ft(./.'). Because
is linear in the parameter vector r, t he above program is linear in tlie
parameters /q,... , r,,,. Thus if in is small, the number of variables of the
linear program is small. The number of constraints is as large as s-q, where .s-
is t.he number of elements of 5 and q is t he maximum number of elements of
the sets U(.r.). However, linear programs with a small number of variables
and a large number of constraints can often be solved relatively quickly
with the use of special large-scale linear programming met hods known as
cutting plane or column generation methods (see e.g. [Dan63], [Ber95a]).
Thus, the preceding linear programming approach may be practical even
for problems with a very large number of states.
.. .*
52	lu/inil<• ll<>ri/,on Discounted Problems Chnp. 1
Approximate Policy Evaluation Using Linear Programming
In I,he case of a. very large or infinite slate space, it is also possible
to use linear programming to evaluate approximately the cost function
of a stationary policy /t in the context of the approximate policy iteration
scheme discussed in Section 1.3.2. Suppose that we wish to approximate J,,
by a function ./(•,?•) of a given form, which is parameterized by the vector
г = (г,	The bound of Prop. 3.G suggests that we should try to
determine the parameter vector r so as to minimize
max|.7(.r,7’) - ./„(.r)|.
From the error bounds given just prior to Prop. 3.1, il can also be seen
that we have
niax|.7(.r, r) - .7,,(/)| < —— max|.7(.r,/•) - (7),.7)(.r. r)|.
,!'(->	1 — <1 .r€5
T’his motivates choosing r by solving the problem
min max|,7(.r, r) — (7), .7)(.r, r)|,
r .res
where S' is either the state space S' or a suitably chosen finite subset of S’.
The preceding problem is equivalent to
minimize z
subject to
•7(.r. /)
” ° ^2/’-' W (/'('’))-7(1/, r)
«С s
x e S'.

When ./(.r,r) has t.he linear form (3.31). this is a linear program in the
variables z and	r,„.
1.4
THE ROLE OF CONTRACTION MAPPINGS
Two key structural properties in IIP models are responsible lor most,
of the mat hemal ical results one can prove about them. The lirsl is the
monolonicilI/ propcih/ of t he mappings 7’ and 7), (cf. Lemma 1.1 in Section
1.1). This property is fundamental lor total cost, infinite horizon problems.
For example, il. forms the basis for the results on positive and negative DP
models to be shown in Chapter 3.
Sec. /. / The Hole ol Coni met ion khipi>i,4P'
When the cost per stage is bounded and there is discounting, however,
we have another property that si l engthens the clients of monotonicity: the
mappings T and 7), are contraction mappings. In this section, we explain
the meaning and implications of this property. The material in this section
is conceptually very important., since contraction mappings are present in
several additional DP models. However, the main result of this section
(Prop. 4.1) will not be used explicitly in any of t.he proofs given later in
this book.
Let B(S) denote the set. of all bounded real-valued functions on 5.
With every function .J : S ct that belongs to /1(5), we associate the
scalar
||.7|| =max|,7(.r)|.	(4.1)
[As an aid for the advanced reader, we mention that, the function ||  |[ may
be shown to be a norm on the linear space B(S), and with this norm B{S)
becomes a complete normed linear space [Lue69].] The following definit ion
and proposition arc' specializations to 73(,S’) of a. more general notion and
result, tiiat, apply t.o such a. space (see, e.g., references [LiStil] and [Lueti!)]).
Definition 4.1: A mapping H : B(S) 73(5) is said to be a con-
traction mapping if there exists a scalar p < 1 such that,
\\HJ - HJ'W < p\\J - J'\\, for all J, J' ё 73(S),
where ||  || is the norm of Eq. (4.1). It is said to be an m-stagc.
contraction mapping if there exists a positive integer m and some p < 1
such that
< p\\J - ,/'||, for all .7, J' e 73(5),	(4.2)
where 77’" denotes the composition of H with itself ni times.
The main result concerning contraction mappings is the following.
For a proof, see references [LiSGl] and [LueGf)].
Proposition 4.1: (Contraction Mapping Fixed-Point Theo-
rem) If H : B(S) ь-+ B(S) is a contraction mapping or an m-st.age
contract ion mapping, then I here exists a. unique fixed point of //; that
is, there exists a unique function J* ё B(S) such that
II J* = J*.
л
54
Intinito Horizon Discount rd Problems
Cliaj). 1
Furthermore, if ./ is any function in 1/(5) and IIh is the composition
of II with itself к times, then
^lim ||7lV- ./* *|| = 0.
Now consider the mappings T and 7), defined by Eqs. (1.1) and (1.5).
Proposition 2.4 and Cor. 2.4.1 show that T and 7), are contraction map-
pings (p = o). As a result, I he convergence of the value iteration method to
the unique fixed point, of 7' follows direct ly from t he contraction mapping
theorem. Note also that, by Prop. 3.2, the mapping !•' corresponding to
the Gauss-Seidel variant of t he value iteration met hod is also a. contraction
mapping with p = <i, and the convergence result, of Prop. 3.2 is again a
special case of the contraction mapping theorem.
1.5 STOCHASTIC SCHEDULING AND THE MULTIARMED
BANDIT
In the problem of t.liis section there are n projects (or activities) of
which only one can be worked on at any time period. Each project, i is
characterized at time к by its state If project i. is worked on at time k,
out; receives an expected reward <5R1 (.r(.), where <i G (0. 1) is a discount
factor; the state ,rj. then evolves according to the equation
.r*. ( i =	if i is worked on at, time k, (5.1)
where is a random disturbance with probability distribution depending
on .r). but, not. on prior disturbances. The states of all idle projects are
unaffected; t.liat. is,
•г/ч t ~ ’Г ' *K ’(4(’ 111 fbne k-	(5.2)
We assume perfect, state information and that the reward functions /?'(•)
are uniformly bounded above and below, so the problem comes under the
discounted cost framework of Section 1.2.
We assume also that, at any time к there is the option of permanently
retiring from all projects, in which case a reward ofcAI is received and no
additional rewards are obtained in the future. The retirement reward M
is given and provides a parameterization of the problem, which will prove
very useful. Not.e that for А/ sufficiently small it, is never optimal to retire,
thereby allowing the possibility of modeling problems where retirement is
not a real option.
:	Sec. /.5	.S'( oc/iasl ic Seht •<! Illi UK nail I hi' Mult hirnieil H;ui</il	55
i	The key characteristic of the problem is tint independence ol' the
1	projects manifested in our three basic assumptions:
I	1. States of idle projects remain fixed.
j	2. Rewards received depend only on the state of the project currently
t	engaged.
!
(	3. Only one project can be worked on tit a time.
I
(	The rich structure implied by these assumptions makes possible a
j	powerful methodology. It turns out that, optimal policies have the form of
;	an index rulc\ that is, for each project i, there is a function in.'(.r') such
i	that an optimal policy at time k is to
Г	retire if M > max{ in' (.r(.)},	(5.3a)
[•	J
work on project, i if in'(x'k) — max{///•'(./[)} > M. (5.3b)
i
j	Thus n?'(.rj.) may be viewed as an index of profitability of operating the
!	ith project, while M represents profitability of retirement at time k. The
?	optimal policy is to exercise the option of maximum profit,ability.
।	The problem of this section is known as a niult.iarmed, bandit prob-
1	lem. An analogy here is drawn between project scheduling and selecting
I	a sequence of plays on a slot machine that lias several arms corresponding
to different but unknown probability distributions of payolf. With each
play the distribution of the selected arm is better identilied, so t.he tradeoff
I here is between playing arms with high expected payolf and exploring the
|b	winning potent ial of other arms.
Index of a Project
Let J(x.M) denote the optimal reward attainable when t.he initial
state is x. = (.id,... ;x") and the retirement reward is 1\1. From Section
1.2 we know that, for each А/,	is the unique bounded solution of
Bellman's equation
~ max j^A/, max L' (,т, Л/, ./)j , for all ,r,	(5.-1)
where L' is defined by
Р(ж,Л/, .7) = H'(x') + <> E{J (.id
The next proposition gives some useful properties of J.
x'~1, w'), .n'+1,.. .,	A/) }.
(5.5)
56
Inlinih' 1 lol i/.on	Discinullcd Problems
Ch;i/>. I
.ж
Proposition 5.1: Let В = max; max^i | Л’(ж1)|. For fixed x, the
optimal reward function ./(.г, A/) has the following properties as a
function of M:
(a) J(x, M) is convex and monotonically uondecreasing.
(1>) J(.v,M) is constant for M < —13/(1 — a).
(c) ./(x, M) = HI for all M > 13/(1 - <>).
Proof: Consider the value iteration method starting with the function
./of-г, A/) — inax[(), А/J.
Successive iterates are generated by
./*.+ i ('', A/) = max pl/, max L'(./.-, Л/, Л )j .	/. = (),!......... (5.6)
and we know from 1’rop. 2.1 of Seclion 1.2 that
litn ./*(./•, A/) = ./(./:, A/), for all x, M.	(5.7)
We show inductively that .Jf,.(j:,HI) has the properties (a) to (c) stated in
the proposition and, by taking the limit as k —> oa, we establish the same
properties for ./. Clearly, ./(>(.r, A/) satisfies properties (a) to (c). Assume
that ./j,.(.r,Af) satisfies (a) to (c). Then from Eqs. (5.5) and (5.6) it follows
that .//,. |_ j (ж, A/) is convex and monotonically uondecreasing in Л/, since
the expectation and maximization operations preserve these properties.
Verification of (b) and (c) is straightforward, and is left for the reader.
Q.E.D.
Consider now a. problem where there is only one project that can be
worked on, say project i. The optimal reward function for this problem
is denoted ,/'(.?:', A/) and has the properl ies indicated in Prop. 5.1. Л
typicai form for J‘(x’, A/), viewed as a function of Л/ for fixed x‘, is shown
in Fig. 1.5.1. Clearly, there is a minimal value /»'(.;;') of Л/ for which
./'(.r', A/) =- A/; that is,
iii'(x') = min{А/ | ./'(.<', A/) = A/}, for all x'.	(5.8)
The function m'(.r') is called the inile.r fini< I ion (or simply index) of project
i. It provides an indifference threshold at each state; that is, m' (x1) is
the retirement reward for which we are indifferent between retiring and
operating the project when at state
Our objective is to show the optimality of the index rule (5.3) for the
index function defined by Eq. (5.8).
S('C. I.Г)
57
S’lochastic Scliediding and the Mult i.-inurd Huiidil
Figure 1.5.1 Form ol the zt.li project reward function ./'(.r'.A/) for fixed .r' and
definition of the index
Project-by-Project Retirement Policies
Consider first a problem with a single project., say project /, and a
fixed retirement reward Al. Then by the definition (5.8) of the index, an
optimal policy is to
retire project i. if ni'(.i:') < Al.	(5.9a)
work on project i if	///'(.r') > Al.	(5.9b)
In other words, the project is operated continuously up to the time that
its state falls into the rdiivnienl set,
S1 = {.r' \ in'(.r')< Al}.	(5.10)
At that time the project is permanently ret ired.
Consider now the inultiproject problem for fixed retirement reward
Al. Suppose at. some time we are at. state ,r = (.r1,. .. ,.;")• 1-a‘t us ask two
questions:
1. Does it make sense to retire (from all projects) when there is still
a project i. with state .u' such that in'(.f') > АГ’ The answer is
negative. Retiring when m'(.r') > Al cannot be optimal, since ii
we operate project, i exclusively up to the time that its state ./' tails
wit hin the retirement set S' of Eq. (5.10) and I.hen retire, we will gain
58
hilinitc lloi i/.чн Disciiiintcil I ’r<>1 tlt'iiis
CIi.ii>. I
a. higher expect ed reward. [This follows from the definition (5.8) of the
index and t he nature of t he opt imal policy (5.!)) for t.he single-project
problem.]
2. Does it ever make sense to work on a project i with state in the
retirement set S' of Eq. (5.10)? Intuitively, the answer is negative:
it. seems unlikely that a project, unattractive enough to be retired if
il were the only choice would become attractive merely because of
the availability of other projects that are independent in the sense
assumed here.
We are led therefore to the conjecture that there is an optimal project-
by-project retirement (PPR) policy that, permanently retires projects in the
same way a.s if they were the only project, available. Thus at each time a
1’1’R. policy, when at state j: = (.r1,. . . ,.r"),
permanently retires project i.	if j:1 a S',	(5.11a)
works on sonic project	if .r? S-i for some j, (5.11b)
where S' is the /'th project retirement set of Eq. (5.11). Note that, a PPR
policy decides about retirement of projects but does not specify the project
to be worked on out. of I,hose not. yet. retired.
The following proposition substantiates our conjecture. The proof is
lengthy but. quite simple.
Proposition 5.2: There exists an optimal PPR policy.
Proof: In view of Eqs. (5.1), (5.5), and (5.11), existence of a 1’PR. policy
is equivalent to having, for all i.
M > £'(./, Л/,./),	for all ./• with .C € S',	(5.12a)
l\l < L'(.r.,.M..I),	for all .r with ./•' i S',	(5.12b)
where I.' is given by
L'(.r.M.J) -- IP(.,') I O /.’{./(.r1_______/'	______/-".AZ)},
(5.13)
and ./(.c, A/) is the optimal reward function corresponding to x and M.
'I’he it.h single-project opt imal reward funct ion J' clearly satisfies, for
all ,r',
•J1 (.r1, M) < J{.r'....r1"1, ./', ,r' + l,.r", AZ), (5.14)
since having t he option of working at projects other than i cannot decrease
tin’ optimal reward. Kurt hern tore, from the definition of the retirement, set
S' [cf. Eq. (5.10)],
.r'^N', if AZ </?-(./')+ n	A/)}.	(5.15)
S<f. I.Г,
SI < uliu^l ir Siliutlulii ii\ ,ui<l Ihu A lull i.11 ui<-< I /luruh'l
59
Using Eqs. (5.13) to (5.15), we obtain Eq. (5.12b).
It. will sullice to show Eq. (5.12a) for i I. Denote:
r - (.r2....r"): The state ol all projects other than project 1.
J_(e A/): Tllc optimal reward fund ion for I he problem resnlt ing after
project 1 is permanently retired.
J(.r', z. -V): The optimal reward function for the problem involving
all projects and corresponding to state .г = (.r'.n).
We will show the following inequality lor all .r (.r', r):
J(t. A/) < ./(.r1, ,r. M) < J(j. A/) T (./ '(.r1. A J) - A/).	(5.16)
In words this expresses the intuitively clear fact that, at, stale (.r1. t) one
would be happy to retire project 1 permanently if one gets in return the
maximum reward that can be obtained from project I in excess of the
retirement reward Л/. We claim that to show Eq. (5.12a) for z = 1. it will
sullice to show Eq. (5.16). Indeed, when .r1 G 5'1, then J1 (.r1. A/) = Л/.
so from Eq. (5.16) we obtain J(.i:j,T,A7) = Jf.r.A/), which is in turn
equivalent to Eq. (5.12a) for i = 1.
We now turn to the proof of Eq. (5.16). Its left, side is evident.
To show the right sick1, we proceed by induction on t.he value iterat ion
recursions
Jy+I(.r
x) = max pl/, Z?1 (.r1) + <\E{ Ji, (/ (./-1. in1). .r) }.
max{Zf'(.r') +	, E'
(E "”))}]•
(5.17a)
Л:+1(т) = lllax ЛЛ1Ш1Х[/?'(<') + oZ?{JA.(/'’'U, <"'))}]
(5.17b)
•Лу-ь I ) = Illilx [А/. Я1 (!') +	'"'))}] 1	(5.17c)
where, for all i ф 1 and z = (.r2..r"),
F'(t, te') = (,r2
.r'~ 1, ./ '(.r1. te’), .r' t 1, . . . ,.;:").	(5.18)
The initial conditions for the recursions (5.17) are
JoU'.x) = Л/.
J(l(i) - -v.
J’U1) = Л/,
for all (,c*. E)
for all z,
for all .r1.
(5.19a)
(5.19b)
(5.19c)
We know that	J(.r'. a A/), Jj.(l) J(t.A/). and J'(.r') -
./'(.i:1, AJ), so to show Eq. (5.16) it will sullice to show that, lor all k and
60	hdiiiite Попкой Discounted Problems ('/in/). I
In view of the definitions (5.19), we see that Eq. (5.20) holds for A: = 0.
Assume that it holds for some k. We will show that it. holds for /.: I I. From
E<|. (5.17) and t.he induction hypothesis (5.20), we have
A+i(-i-',£) < max [л/, /£'(./:') + aE{Jk.{^ + .7/, (/' (.г1, ш1)) - Л/ },
niax [/?-(.«:') +оЬ’{Д.(^и-+ 4U1) - Mo-
using the facts	M and -7/(x1) — А/ Iе'- ^*1- (5.17)], and the
preceding equation, we see t hat
•Al I (<’, 21) < ina.x[A|, /А],
where
/I, ,пах[Л/. /»’>(.,>) I <>/{./;(./'(.,!,«'))}] -|-o(7,(x) - A/).
= max A 7, max ]/?'('') + <> /''{ J k. (7'1'(x, in')) }] I- о (.// (.i:1) — A/).
Using Eqs. (5.17b), (5.17c), and the preceding equations, we see that
.A. + i(.r',l) < тах[./‘ + 1(.г‘)+2А.и)~Л/, JA.+ I (x) + .7* (x1) - Al]. (5.21)
It can be seen from Eqs. (5.17) anil (5.19) that •//.(t1) < •7/+1(-<'1) and
Jk.(K) < 4+iW for all k, x1, and x, so from Er,. (5.21) we obtain that
Eq. (5.20) holds for к + 1. The induction is complete. Q.E.D.
As a first step towards showing optimality of the index rule, we use
the preceding proposition to derive an expression for the partial derivative
of .7(.r, Al) with respect of Al.
Lemma 5.1: For fixed x, let Km denote the retirement time under
an optimal policy when the retirement, reward is Al. Then for all M
for which 0J(.c, M)/0M exists we have
^ = 4^ = 4
Proof: Fix x and А/. Let тг* be an optimal policy and let Km be the
retirement time under тг*. If тг* is used for a problem with retirement
reward Al -|- <. we receive
E{reward prior to retirement} + (Л/ I < )5 {н,'с] = J(x, Al) + еЕ{плл/}.
Sec. I..‘> Slochust ic SclnxluliHg ;ui<l the Nlull ini med linii'lil	6 I
The optimal reward ./(.r, Al + c) when the retirement, reward is Al + <. is
no less than the preceding expression, so
+ <) > ./(./•, л/) + </;{<>'<«/ }.
Similarly, we obtain
,/(.r. Л/ - <-) > ,/(.r. Al) - < £’{<i}.
For e > 0, these two relations yield
./(.r.A/)-,/(.,:. Л/ -r) < /;{о/<д,} < .7(.rjW+<)- ,/(.r,A/)
< <
The result follows by taking e —► 0. Q.E.D.
Note that the convexity of .7(.r, ) with respect, to Л/ (Prop. 1.1) im-
plies that the derivative Al)/<)A1 exists almost everywhere with re-
spect to Lebesgue measure [Roc70]. Furthermore, it. can be shown I,hat
<Э7(х, A/)/cW exists for all Al for which the optimal policy is unique.
For a given Al. initial state ,r, and optimal PPR. policy, let T, be
the retirement time of project i. if it were t.he only project available, let 7’
be the retirement time for the multiproject problem. Both T, and T take
values that are either nonnegative or oo. The existence of an optimal PPR.
policy implies that we must have
T = 7\ + • • • + Tn
and in addition 1), I = 1,..., are independent random variables. There-
fore,
} = E {<Ci+ +7'>} =	}.
7=1
Using Lemma. 5.1, we obtain
i)J(EM) = A o.Ht',Ai)	r
0Л1 -I J- dAI '
i=l
Optimality of the Index Rule
We are now ready to show our main result..
Proposition 5.3: The index rule. (5.3) is an optimal stationary policy.
.. ..ж.
hiliiulf I h >ri/,i m I )is(<>iiiil t'd I'robk'inx ( I
Proof: Fix .r — (.r1,. . . denote
zzz(.r) = inax{7zzJ (./')} .
and let / be such that.
zzz'(.r') = inax{zzzJ ('')}•
If zzz,(:/-) < Л/ t he optimality of the index rule (5.3;i) at state x follows from
the existence of an optimal PPR policy. If ui(x) > Al, we note that
Л/) = IV (.:') + al-:{.l'(p(x', zz>'), A/)}
and I h(4i nsr this relation together with E<|. (5.22) t.o write
<)./(x, A/) _ i).J'(x', A/) . . (l./px-i, Al)
oTi ~ ~i)Ki 11 м/
— E
l) ЛI
X' -1, /' (./', Z(Z'). .z:' + l,..., ,-r’1, A/) },
and finally
а/(.г,л/)
UTi

where
Л'(.г.Л/,./) = IV(x') 1 <J-:{.px',...,x'
1,	n"), X'"1,..., X", A/) }.
(The interchange of differentiation and expectation can be justified for
almost all A/; see [Ber73a].) By the existence of an optimal PPR policy,
we also have
,/(.r, zzz(.z:)) = L1 (x, zzi(.z'), .J).
Therefore, the convex functions </(.r, Л/) and L'(x, Л/,./) viewed as func-
tions of Л/ for fixed x are equal for А/ = zzz(.r) and have equal derivative
for almost, all Al < in(.r). 11 follows that for all A! < m.(x) we have
J(x, Al) -- L'(x, Al,.I).
This implies that, the index rule (.5.3b) is optimal for all x with zzz(.r) > Al.
Q.E.D.
Sec. I..r> Sitic S<li<<lnlinp, :m<l the Mult hit ine<l Ihnnlit	63
Deteriorating and Improving Cases
It is evident, that great simplification results from the optimality of
the index rule (5.3). since optimization of a multiproject problem has been
reduced to it. separate single-project optimization problems. Nonetheless,
solution of each of these single-project problems can be complicated. Under
certain circumstances, however, the situation simplilies.
Suppose that for all /', and <r' that can occur with positive prob-
ability, we have either
< z//'(/'(.i-', ze'))	(5.23)
or
m’(.r') >	(/'(.r', »,')).	(5.2-1)
Under Eq. (5.23) [or Eq. (5.24)] projects become more (less) profitable
as they arc worked on. We call these cases improving mid del e.riora,ting.
respectively.
In the improving case the nature of the optimal policy is evident:
either retire at the first period or else select a project with maximal index
at the first period and continue engaging that project for all subsequent
periods.
In the deteriorating case, note that Eq. (5.24) implies that, if retire-
ment is optimal when at state .r' then it is also optimal at each state
/‘(r1,™1). Therefore, for all ,r' such that Al = ///,'(./.') we have, for all tv'.
./'(.г', AL) = Al, J'(fa>'), Al) = Al.
From Bellman's equation
./'(.r',xW) = шах [ЛГ, /?'(.(') + aE{j' «"), A/)}]
we obtain
;//'(./') = /?.'(./') + am'(.r')
or
/"'(/') --=	(5.25)
I - n
Thus the optimal policy in the deteriorating case is
retire il Al > max,;  [ _ and otherwise engage the project z with
maximal one-step reward I?'(.r').

64	Infinite Horizon Discounted Problems ('Imp- I
Example 5.1 (Treasure Hunting)
Consider a search problem involving N sites. Isach site i may contain a.
treasure with expected value г,. A search at site i costs r, and reveals the
treasure with probability /7, (assuming a treasure is there). Let P, be the
probability that there' is a treasure at sit<! i. We take P, as the state of the
project corresponding to searching site i. Then the corresponding one-step
reward is
= /U>, -c,.	(5.26)
If a search at, site I does not reveal the treasure, the probability P, drops to
~p = P,(l - /A)
'  /’,(1 -/X) + 1
as can be verified using Bayes’ rule. If the search finds the treasure, the
probability drops to zero, since the treasure is removed from the site.
Based on this and I he fact, that /i’'(/’,) is increasing with P, [cl'. Eq. (.5.26)], it.
is seen that the deteriorating condition (5.24) holds. Therefore, it is optimal
to search the site i lor which the expression /?'(/<) of Eq. (5.26) is maximal,
provided max, H'(P,) > (I, and to retire if /"(/',) < I) lor all i.
1.6 NOTES, SOURCES, AND EXERCISES
Many authors have contributed to the analysis of the. discounted
problem wit h bounded cost per stage, most notably Shapley [Slia53], Bell-
man [Bel57], and Blackwell [Bla.G5], For variations and extensions ol the
problem involving multiple criteria, weighted criteria, and constraints, sec
[FeS94], [Gho90], [Kos89], and [WhKSO]. The mathematical issues relating
to measurability concerns are analyzed extensively in [BcS78], [DyY79],
and [Her89j.
The error bounds given in Section 1.3 and Exercise 1.9 are improve-
ments on results of [MeQGG] (sec [Por71], [I’or75]. [Ber7G], and [PoT78]).
The corresponding convergence rate was discussed in [Mor71] and [MoW77j.
The Gauss-Seidel method for discounted problems was proposed in [Kus71]
(see also [1 lasGS]). An extensive discussion of the convergence aspects of
the method and related background is given in Section 2.G of [BeT89a].
The material on the generic rank-one correction, including the convergence
analysis of Exercise 1.8, is new; see [Ber93], which also describes a multiple-
rank correction method where the elfcct of several eigenvalues is nullified.
Value iteration is particularly well-suited for parallel computation; see e.g.,
[AMT93], [BeT89a|.
Policy iteration for discounted problems was proposed in [Bel57j. The
modified policy iteration algorithm was suggested and analyzed in [PuS78]
and [PuS82]. The approximate policy iteration analysis and the conver-
gence proof of policy iteration for an infinite state space (Prop. 3.(1) are
new and were developed in collaboration with .1. Tsitsiklis. 'The relation
between policy iteration and Newton’s method (Exercise 1.10) was pointed
out in [PoA69] and was further discussed in [PuB78].
The material on adaptive aggregation is due to [BeC89]. In an al-
ternative aggregation approach [SPK89], the aggregate states arc fixed.
Changing adaptively the aggregate states from one iteration to the next
depending on tlie progress of the computation has a. potentially significant
effect on the efficiency of the computation for difficult problems where the
ordinary value iteration method is very slow.
The linear programming approach of Section 1.3.4 was proposed in
[D’EpGO]. There is a relation between policy iteration and the simplex
method applied to solving the linear program associated with the dis-
counted problem. In particular, it. can be shown that t.h<; simplex method
for linear programming with a block pivot ing rule is mathematically equiv-
alent to the policy iteration algorithm. Then' are also duality connections
that relate the linear programming approach with randomized policies,
constraints, and multiple criteria; see e.g.. [Kal83], [Put.94], Approxima-
tion methods using basis functions and linear programming were proposed
in [ScS85].
A complexity analysis of finite-state infinite horizon problems is given
in [PaT87], Discretization methods that approximate inlinite state space
systems with finite-state Markov chains, are discussed in [Ber75], [Fox71],
[HaL8G], [Whi78], [\Vhi79], and [\Vhi80a]. For related multigrid approxima-
tion methods and associated complexity analysis, see [C11T89] and [ChT91 ].
A different approach to deal with infinite state spaces, which is based on
randomization, has been introduced in [Rus91]; see also [Hns95], Further
material on computational methods may be found in [Put 78].
The role, of contraction mappings in discounted problems was first,
recognized and exploited in [Slia53], which considers two-player dynamic
games. Abstract DP models and the implications of monotonicity and con-
traction have been explored in detail in [Den67], [Ber77], [BeS7S], [VePS 1],
and [VeP87].
The index rule solution of the multiarnied bandit problem is due to
[Git79] and [GiJ74]. Subsequent contributions include [Whi80b]. [Kel81],
[Whi81], and [Whi<82]. The proof given here is due to [Tsi8G]. Alternative
proofs and analysis are given in [VWB85], [NTW89], [Tso91], [Web92],
[BeN93], [Tsi93b], [BPT94a], and [BPT94b], Much additional work on the
subject, is described in [Kum85] and [KuVSG].
Finally, we note that even though our analysis in this chapter requires
a countable disturbance space, it may still serve as (he starling point of
analysis of problems with uncountable disturbance space. This can be done
by reducing such problems to deterministic problems with state space a set
of probability measures. The basic idea of this reduction is demonstrated
.11
GG	Infinite Horizon Discounted Problems Clnij). 1
in Exercise 1.13. The advanced reader may consult. [13eS78] (Section 9.2),
and see how such a, reduct ion can be ellected for a very broad class of finite
and infinite horizon problems.
EXERCISES
1.1
Write a computer problem and compute iteratively the vector satisfying
/l\	/3/4	1/4
./„ =	2	+ ci 1/1 3/4 - c
W	\ 0
0 \
c
1-е/
Do your computations lor all combinations of о = 0.9 and cr = 0.999, and
( — 0.5 and c = 0.001. Try value! iteration with and without, error bounds,
and also adaptive aggregation with two aggregate classes of states. Discuss
your result.s.
1.2
The purpose of this problem is to show that shortest path problems with
a discount factor make! little sense. Suppose that we have a graph with a
nomicgntive length c/.;, for each arc: (/,/). The cost of a path (io, ii ,...,/,„)
is У/(l' o1' <c,A ,A., ।, where <i is a discount factor from (0,1). Consider the
problem of finding a path of minimum cost that connects two given nodes.
Show 11,nt this problem need not have a solution.
1.3
Consider a problem similar to that of Scct.ion 1.1 except that when we are at
state .r*, there is a probability /1 where 0</f< 1, that the next state
will be? determined according to ,rA r i = ilk, ng) and a probability (1—/1)
that the system will move to a termination state, where it stays permanently
thereafter at no cost. Show that even if ci = 1, the problem! can be put into
the' discounted cost framework.
Sec. 1.6 Notes. Sources. mid Exercises	67
1.4
C’onsider a problem similar to that of Section 1.2 except that the discount
(actor о (kipends on the current state .га . the cont rol u-к, and the disturbance
that is. the cost function has t hr form
where
О U’ о (./-(I. //(,{./()). »’())o (./’; . Il] (.Г] ), U’l )    О (.Q..	). U’A).
with o(.r. и. tv) a given function satisfying
I)	< min{n(.r, u. tv) | .г € S. и G C, iv G 11}
< inax{o(.r, a, iv) | j; G -S’, и G C. w G I)}
< I.
Argue that the results and algorithms of Sections 1.2 and 1.3 have direct
counterparts for such problems.
1.5 (Column Reduction [Por75])
The purpose of this problem is to provide a t ransformation of a certain ty pe
of discounted problem into another discounted problem with smaller discount
factor. Consider the //-state discounted problem under the assumptions of
Section 1.3. The cost per stage* is </(/,?/,), the discount factor is o, and the
transit ion prol >abilil ies are p, ,(<i). For each j — I,.. . ,	1(4.
m.	= min min pt, (a).
...n	•
For all /, j. and //, let
1 - L/..,. <
assuming that t < 1-
(a)	Show that //,,(//) an* transition probabilities.
(b)	Consider the discounted problem with cost per st age <y(Z, ?/.), discount
factor o(l - £2"_] n>j)' transition probabilities Show that
this problem has the same optimal policies as the original, and that its
optimal cost vector J' satisfies
о У'" . in ,J'{j)
r f +	' e,
1 — n
where ./* is the optimal cost vector of the original problem and c is the
unit vector.
G8
Infinite Horizon Discounted Problems Chap. I
l.G
Let, J : S t—► 3i be any bounded funct ion on S and consider the value iteration
method ol Section 1.3 with a starting function ./ : S >—> li of the form
./(.r) = ./(') + r, .res,
where r is some scalar. Show that the bounds (7*J)(.r) + Ci,. and (7*J)(x)+cfe
of Prop. 3.1 arc independent of the scalar r for all £ 5. Show also that if S
consists of a single state .r (i.e., S = {./'}), then
(7’./)(.r) + c, = (7’./)(.r)+m =
1.7	(Jacobi Version of Value Iteration)
Consider the problem ol’ Section 1.3 and the version of the value iteration
method that starts with an arbitrary function ,/ : S SJ£ and generates
recursively F./, F~./,. .., where F is the mapping given by
(FJ)(i) = min

1 - «/><.(»)
Show that. (/-’*’./)(/) —♦ J’(i) as k oo and provide a rate of convergence
estimate that is at least as favorable as the one for the ordinary method (cf.
Prop. 2.3).
1.8	(Convergence Properties of Rank-One Correction [Ber93])
Consider the solution of the system .7 = where F : St" и St" is the
mapping
FJ = h + QJ.
h is a given vector in St", and Q is an n x 11. matrix. Consider the generic
rank-one correction iteration J : = MJ. where M : St" St" is the mapping
MJ = FJ S-уг.
z = Qd,
(d-z)'(F.I -J)
||d-<
(a) Show that any solution ./* ol the system J — /'./ satislies J* = .MJ*.
(li) Verify that the value iteration method that uses the error bounds in the
manner of lt<|. (3.12) is a. special case of the iteration •/ := MJ with d
equal to the unit vector.
See. 1.6 Notes, Sources, and ICxereises
69
(<:) Assumet hat d is an eigen vector ol (^, let z\ be (he corresponding eigen-
value', and let Л|.........A,( i be t he remaining eigenvalues. Show (hat
MJ can be written as
MJ - h i /Г./.
where h. is some vector in 9t" and
Show also that lid = 0 and that for all к and 7,
Hk = HQ1' 1,	A/1./ A/(Fi'
furthermore, the eigenvalues of li, are 0, A ।,.. ., A„ ।. ('This last state-
ment requires a somewhat complicated proof; see [Ber93].)
(d) Let d be as in part (c), and suppose that, cj..c'n — i are eigen vectors
corresponding to Ai,...,A„_|. Suppose that a vector J can be writ fen
as
n - I
./ = .r + ^ + £c<,„
7-1
where ./* is a solution of the system. Show that, for all к > 1,
Mk.i = ./*	»<-.
so that if’ A is a dominant eigenvalue and Ai,...,A„-i lie within the
unit circle, A/*J converges to ./* at a. rate governed by the subdominant
eigenvalue. Л’о/.е; This result can b< generalized for the case where Q
does not have a full set of linearly independent, eigenvectors, and for the
case where F is modified through multiple-rank corrections [Ber9.3],
1.9 (Generalized Error Bounds [Ber76])
Lot 6’ be a set and B(5) be the set of all bounded real-valued funct ions on
S. Let T : U(S) 1 > B(.S') be a mapping with the following two properties:
(1) TJ < TJ' for all 7, J' E B(5) with ./ < ./'.
(2) For every scalar r yf 0 and all .1: € 5.
(T(./+ rc)) (.r)— r)
Ul <--------------------------< O2
where* <11, ri2 an* two scalars with 0 < <i| < n--j < 1.
70
Infinite Horizon Discounted Problems Chap. 1
(a)	Show that 7’ is a contraction mapping on B(.S'), and hence for every
J C B(S) we have
lim (7‘*./)(.r) = ./*(./;), .r G S,
where ,/* is the unique fixed point of T in 73(5).
(b)	Show that for all J G £J(S'), J' E and k = 1,2,. ..,
(TkJ)(x) + ck < (TA + 1 ,/)(.r) + e, + 1 < J*(.r) < (7’fc+1./)(;r) +cr-+1
< (TkJ)(x)+ck,
where for all k
Lk - "‘in |	— min [('У1'’ ./)(.r) - (7,A './)(.()].
:n"£'s	.	«-i)
T-^- mm[(7</)(.r) - (7’''-}•
= max / ~r'— max[(7'‘j)(.r) - (7’t ^ ./)(.<)],
U o, .cs	(C2)
max[(Tk J)(.,.) - (Tk~1 J)(.r)] .
1 — Q-j ,rCS L	J J
Л geometric interpretation of these relations for L1kj case where .S' con-
sists of a single element is provided in Fig. 1.6.1.
(c)	Consider tin* following algorithm:
Mr) (TJk- I )(.r) + 7a-. .r e S’,
where ./» is any function in /7(5’), 7. is any scalar in the range [<7.,Cfc],
and <7, anil ck are given by Eqs. (G.l) and (li.2) with (T1'—
(T1, ~1 ./)(.;:) replaced by (7'./у_ i)(.r) — .A-i(j'). Show that for all k,
тах|.Л(.г) - ,/*(.r)| < 02 max|./o(.r) - ./*(.r)|.
(d)	Let .7 G It" and consider the equation J = TJ, where
TJ = h + MJ
and the vector h G 3?" and the matrix M are given. Let s, be the ith
row sum of M. that, is,
.s. = ±»b,
and let о । = min, .s,, o2 = max, s,. Show that if the elements of M
are all nonnegat ivc and n2 < 1. then the conclusions of parts (a) and
(b) hold.
(e)	[Por7.r>] Consider the Clauss-Seidel method for solving the system J =
<1 + aPJ. where 0 < <1 < 1 and /’ is a transition probability matrix.
Use part (d) to obtain suitable error bounds.
See. 1.6 Notes, Sources, and T.xereises
71
Figure 1.6.1 Graphical inierprrlat ion of the ciroi bounds of Exercise 1.!).
1.10	(Policy Iteration and Newton’s Method)
The purpose of this problem is to demonstrate a relation between policy
iteration and Newton’s method for solving nonlinear equations. Consider an
equation ol the form F(J) = 0. where /*’ : 3?"	Ji'". Civen a vector ./(. C -h'”,
Newton’s method determines ,/A । , by solving the' linear system of equations
Р(Л) + ---у- (-Л-+1 - -h ) ~ 0,
where }/(),! is the Jacobian matrix of /*' evaluated at .
(a)	Consider the discounted finite-state problem of Scct.ion 1.3 and define
F(./) - TJ - J.
Show that if there is a unique // such that
T„ J - TJ.
then the Jacobian matrix of F at, ./ is
<^V) P /
~7л7~ = аР'-~л
72
lu/iiiih' Horizon Discounted Problems (’iiup. I
.,1
where I is the п x n identity.
(b)	Show that, the policy iteration algorithm can be identifier! with Newton’s
met hod lor solving /'’(•/) — 0 (assuming it gives a unique policy at each
step).
1.11	(Minimax Problems)
1 ’rovide* analogs of I he results and a Igor it hms of Sect ions 1.2 and 1.3 lor (he
minimax problem where the cost is
i
A' 0,1...
</ is bounded, .q. is generated by .q. ( i =	///. (.rranel lV‘(.r, it.) is a
given nonempty subset of D for each (.r. it.) G .S' x (\ (Compare* wit h Exerehsc
1 ,.rj in Chapter 1 of Vol. 1.)
1.12	(Data Transformations [Scli72])
A linite-stat.e problem where* the discount factor at each stage depends on
the state can be* transformed into a problem with stale indepenelent discount
factors. To sex* this, consider the following set. of e'epiations in the* variables

./(/) — in i 11
+52'"-./("Ob) ,	' = i, •  ",
(6.3)
where* we assume* that lor all i. и G H(i). and j. тч(и) > 0 anil
and drliia*. for all i and
Sec. 1.6
i\oh‘s. Sources. and
73
Show that, for all i and j.
and that a solut ion {./(/) | > — I} <->1 h'< 1- ((>.3) is also a solut ion ol t he
equations
./(/) = min
utl'ld
</{> ") + (")/(./)
1.13	(Stochastic to Deterministic Problem Transformation)
Under the assumptions and notation of Section 1.3. consider the controlled
system
in 1-1 = p/.T/4, A- — 0, 1.... ,
where />a is a probability distribution over 8 viewed as a row vector, and PPA
is the transition probability matrix eoi responding to t he control funct ion ///, .
The state is щ and the control is /ц. Consider also the cost function
Show that the optimal cost and an optimal policy for the deterministic prob-
lem involving the above' system and cost function yield the optimal cost and
an optimal policy for the discounted cost problem of Section 1.3.
1.14	(Threshold Policies and Policy Iteration)
(a)	Consider the machine replacement example* of Section 1.2, and assume
that, the condition (2.10) holds. Let us define a threshold policy to be a
stat ionary policy that replaces if and only if the state is greater than or
equal to some fixed state /. Suppose that we start the policy iterat ion
algorithm using a threshold policy. Show that all the subsequently
generated policies will be threshold policies, so that the algorithm will
terminate after at, most n iterations.
(b)	Prove t he result ol part (a) for the asset, selling example of Section 1.2.
assuming that there is a finite number of values that the oiler can
take. Here, a threshold policy is a stationary policy that sells the asset
if the oiler is higher than a certain lixed nuiubcr.
.л
74	Infiniti' Horizon Discounted Problems Chap. 1
1.15	(Distributed Asynchronous DP [I3er82<i], [BcT89a])
The value it.eration method is well suited for distributed (or parallel) compu-
te! ion since* t he iteration
./(,) := (7’./)(i)
can be executed in parallel lor all state's i. Consider the* finite-state discounted
problem of Section 1.3, and assume that the* above iteration is executed a.syn-
ehron.imsly at. a. diIferent. proee*ssor i lor each slate* /. By this we* mean that,
the* ?th processor holds a ve*ctor 7' and upelat.es the* Zth component of that,
vector at arbitrary time's with an it.('rat ion of the* form
•/>) : (7’./')(/),
and at arbitrary time's transmits the results of the latest computation toother
processors in who then update* •/"'(/) according to
Assume that all processors never stop comput ing and transmit ting the* results
of their computation to the other processors. Show that the estimates 7/ of
the optimal cost function available* at. each processor / at time / converge* to the
optimal solution function 7* as / — oc. Hint-. Let J and 7 be two functions
such that	ami 7’7 < 7, and suppose that, for all initial estimates 7g
of the* proex'ssors, we* have* 7 <	< 7. Show that the* estimates 7/ of the
processors at. time / sat isfy 7 J! < 7 for all / > 0, and 7’7 < 7/ < 7’7 for /,
sufficiently large*.
1.16
Assume* that we* have two gold mine's, Anaconda and Bonanza, and a. gold-
mining machine*. Le*t .r,\ and xy be the* ciirreuit amounts of golel in Anae-onda
and Bonanza, re*spe*ctive*ly. When t he machine! is used in Anaconda (or Bo-
nanza), there is a probability рд (or pp, respectively) that, гд-гд (or ryxip
respectively) of the* gold will be* mined without damaging the machine, and a
probability 1 — рд (or 1 — рв, respectively) that the machine* will be damaged
be'voiiel re*pair and no gold will be mim*el. We* assume that 0 < r,\ < 1 and
()	I.
(a)	Assume that p.\ — pn — p, where ()</>< 1. Find the* mine selec-
tion policy that maximizes the* e'Xpected amount of gold mined before
the machine breaks down. Ili.nl,: This problem can be! viewed as a
discounted multiarme*d bandit problem with a discount factor p.
(b)	Assume* that p,i < 1 and рц — 1. Argue that the! optimal cxpe'ctcd
amount of gold mined has the form 7*(.гд,.ср) = 4д(хд) + xy, where
7д(.гд) is the! optimal expected amount of gold mined if mining is
re*stricte*d just to Anacouela. Show that there! is no policy that attains
the* optimal aiiioiinl, 7*(.гд,.гв).
Sec. 1.6 Notes. Sources, and I£xercises
75
1.17 (The Tax Problem [VWI385])
T'liis problem is similar to the mult iarmed bandit problem. The only (inference
is that, if we engage project. / at, period A:, wo pay a tax t^C’1 (.r1) for (miry
other project j [fol a tot al of nA X-/;/, '' (/J )l’ instead of earning a reward
(\к	The objective is to find a project, selection policy that minimizes
the total tax paid. Show that the problem can be converted into a bandit
problem with reward function for project i equal to
E'(E) = С"(E) — (iE{C"	« «>'))}
1.18 (The Restart Problem [KaV87])
The purpose' of this problem is to show that the index ol a project in the
mult iarmed bandit context can be calculated by solving an associated infinite*
horizon discounted cost problem. In what follows we consider a single project
with reward function 7?(.r), a fixed initial state ./(». and the* calculation of the
value of index for that state. Consider the problem where at state
.ед. and time A- there arc two options: (1) Continue, which brings reward
с/’7?(.га-) and moves the project to state ~	or (2) restart the
project, which moves the state to j'q. brings reward cvk/?(./•(>), and moves the
project to state jj,H ] = f(.ru.u'). Show that the optimal reward functions of
this problem and of the bandit problem with Л/ = are identical, and
therefore the optimal reward for both problems when starting at ./\i equals
7//(.ro). Hint: Show that Bellman’s (‘({nation for both problems takes tin* form
./(-/) = Шахраи) + a E {./ (/(.ro, »’))},	+ <iE{.I »')) }].

Stochastic Shortest Path
Problems
Contents
2.1.	Main Results........................................p.	78
2.2.	Computational Methods...............................p.	87
2.2.1.	Value Iteration...............................p.	88
2.2.2.	Policy Iteration..............................p.	91
2.3.	Simulation-Based Methods............................p.	94
2.3.1.	Policy Evaluation by Monte-Carlo Simulation . . . p.	95
2.3.2.	Q-Learning....................................p.	99
2.3.3.	Approximations................................p.	101
2.3.4.	Extensions to Discounted Problems.............p.	118
2.3.5.	The Role of Parallel Computation..............p.	120
2.4.	Notes, Sources, and Exercises........................p.	121

78	.S'toc/i.istic Shortest, I'.illi Problems Clmii. 2
In this chapter we consider a. stochastic version of the shortest path
problem discussed in Chapter 2 of Vol. 1. Ли introductory analysis of this
problem was given in Section 7.2 of Vol. I. The analysis of this chapter is
more sophisticated and uses weaker assumptions. In particular, we make
assumptions that generalize those made for deterministic shortest path
problems in Chapter 2 of Vol. 1.
In this chapter we also discuss another major topic of this book. In
particular, in Section 2.3 we develop simulation-based methods, possibly
involving approximations, which are suitable for complex problems that
involve a large number of states and/or a lack of an explicit mathematical
model. These methods are most economically developed in the context
of stochastic shortest path problems. They can then be extended to dis-
counted problems, and this is done in Section 2.3.1. further extensions to
average cost per stage problems are discussed in Section 4.3.4.
2.1 MAIN RESULTS
Suppose that we have a graph with nodes 1, 2,..., n, t, where t is a
special state called the destination or the termination state. We can view
the deterministic shortest path problem of Chapter 2 of Vol. I as follows:
we want to choose for each node i yf t, a successor node /i(i) so that (i. //(/))
is an arc, and the path formed by a sequence of successor nodes starting at
any node j terminates at t. and has the minimum sum of arc lengths over
all paths that start at j and terminate at t.
The stochastic shortest path problem is a generalization whereby at
each node i, we must, select a probability distribution over all possible
successor nodes j out of a given set of probability distributions pijln) pa-
rameterized by a control w. £ U(i). For a given selection of distributions
and for a. given origin node, the path traversed as well as its length are
now random, but we wish that the path leads to the destination t with
probability one and has minimum expected length. Note that if every fea-
sible probability distribution assigns a probability of 1 to a single successor
node, we obtain the deterministic shortest path problem.
We formulate this problem as the special case of the total cost infinite
horizon problem where:
(a)	There is no discounting (<» = 1).	:
(b)	The state space is S — {1.2....nJ} with transition probabilities
denoted by
p,.M =	= j I -i'k = i.uk = u). i.jes, ueU(i).
Furthermore, the destination I is absorbing, that is, for all n £ U(t),
p;t(u) = 1.
Sec. 2.1
Л/.lill Hcsults
7!)
(c)	The control constraint set. lt(i) is a finite set for all i.
(cl) A cost is incurred when control и e U(i) is selected. luirther-
itiore, the destination is cost-free; that, is, </(/, u) = 0 for all
Note that as in Section 1.3. we assume that (he cost per stage does
not depend on io. This amounts to using expected cost per st.age in all
calculations. In particular, if the cost of using «. at. state i and moving to
state j is i)(i. u. j), we use as cost per stage the expected cost
(<
») =	“-.у)-
j=i
We are interested in problems where either reaching the destination
is inevitable or else there is an incentive to reach the destination in a finite
expected number of stages, so that the essence of the problem is to reach
the destination with minimum expected cost. We will be more specific
about this shortly.
Note that since the destination is a. cost-free and absorbing state,
the cost, starting from / is zero for every policy. Accordingly, for all cost
functions, we ignore the component that corresponds to t and define the
mappings T and 7), on functions J with components ./(1) ,/(n). We
will also view the functions J as «-dimensional vectors. Thus
'7(b'') +
j = i
(T.7)(i) = min
i = 1,. . ., n,
(T„.7)(/) =	Z = 1, •.. ,n.
As in Section 1.3, for any stationary policy //, we use the compact
notation
/ Pi i (p(l))	Pi,.(/'(I))
n ;	:	i
\ Pal (/'('"))	Pllll (/'("•))
and
/.7(1,P(l))
.'//< )
We can then write, in vector notation
80	ic SJiortesI l^ith I’loblcms ('h.ip, 2
In Ierins of this notat ion, tin' cost function of a, policy тг = {//», // i.   } can
he written as
/.vi	\
•/yr — In 11 sup / pl(  •  7 д. । ./о lini sup I </p)( F / 1 p(J   'I ii f. . fjp i 1 .
' ' * — \ ./
where Ju denotes the zero vector. The cost function of a stationary policy
// can be written as
A’ — l
= limsnp7’/У"'./о = limsup 5^ I’lir/i!-
'	‘	A—(1
The stochastic shortest path problem was discussed in Section 7.2 of
Vol. I, under (he assumption that, all policies lead to the destination with
probability 1. regardless of t he initial st ate. In order to analyze the problem
under weaker conditions, we introduce the notion of a proper policy.
Definition 1.1: A stat ionary policy //, is said to be propel' if, when
using this policy, there is positive probability that the destination will
be reached after at most n stages, regardless of the initial state; that
is, if
(>t, = max P{:i:„ yf t | .гц = r,//} < 1.	(1.1)
i -- 1...., n.
Л stationary policy I hat is not, proper is said to be improper.
With a little thought., it can be seen that. //. is proper if and only if
in the Markov chain corresponding to //, each state i is connected to the
destination with a path of positive probability transitions. Note from the
def ini t ion (1.1) that
P{.r>„ I | .co = '/'} = /J{.r2„ t | .I',,	= i.p}
X lJ{.r„ I | ,Г() = I . fl.}
< ft-
More generally, for a proper policy p. the probability of not reaching the
dost inat ion after !; st ages diminishes as ' regardless of t he initial st ale;
that is,
I’i'l, I | .io = /,//} < PpA/"J. i = l....,n.	(1.2)
Thus the destination will eventually be reached with probability one under
a proper policy, Eurthermore. tli(> limit, defining the associated total cost.
Sec. 2.1
Main Kc.sults
«1
vector ,/p will exist and lie finite, since the expected cost inclined in the
/rth period is bounded in absolute value by
p,1/'7"-1 max |//(/.//(/)) |.
/	1..1/
(1.3)
Note that, under a proper policy, the cost structure is similar to the one lor
discounted problems, the main difference being that, the effective discount,
factor depends on I he current state and stage, but builds up to al least /if,
per n stages.
Throughout this section, we assume the following:
Assumption 1.1: There exists at least, one proper policy.
Assumption 1.2: For every improper policy //,, the corresponding
cost. is oo for at. least one state i; that is, some component of the
sum ]Cto’ PpUii diverges to oo as A —> oo.
In the case of a deterministic shortest path problem, Assumption 1.1
is satisfied if and only if every node is connected t.o the destination with a
path, while Assumption 1.2 is satisfied if and only if each cycle t hat does
not. contain the destination has positive length. A simple condition that,
implies Assumption 1.2 is that the cost. <j(i. u) is strictly positive for all
i I and ii E U(j). .Another important case where Assumptions 1.1 and
1.2 are satisfied is when all policies arc proper, t hat. is. when termination is
inevitable under all stationary policies (this was assumed in Section 7.2 of
Vol. I). Actually, for this case, it is possible to show that mappings 7’ and
7), are contraction mappings with respect to sonic norm [not necessarily
the maximum norm of Eq. (1.1) in Chapter Ij: see Section 1.3 of [BeTfSfla],
or [Tse'JO]. As a result of this contraction property, the results shown
for discounted problems can also be shown for stochastic shortest path
problems where termination is inevitable under all stationary policies. It
turns out, however, that similar results can bo shown even when some
improper policies exist: the results that we prove under Assumptions 1.1
and 1.2 are almost as strong as those for discounted problems with bounded
cost per stage. In particular, we show I,hat:
(a)	Tin’ optimal cost, vector is the unique solution of Bellman's equation
.7* = T.J*.
(b)	'The value iteration method converges t.o the optimal cost vector •/*
for an arbitrary starting vector.
- ..1
82	.Sloeh.ist.ie .Sli(irte,st. Pith Problems Chap. 2
(с)	Л stationary policy /I is optimal if and only if /),./* = TJ*.
(d)	The policy iteration algorithm yields an optimal proper policy start-
ing from an arbit rary proper policy.
The following proposition provides some basic preliminary results:
Proposition 1.1:
(a)	For a proper policy /t, the associated cost vector Jtl satisfies
lim (7)';'./)(i) = Jz,(i), i =	(1.4)
к—*oo
for every vector J. Furthermore,
j — т I
and Jz, is the unique solution of this equation.
(b)	A stationary policy /t satisfying for sonic vector J,
J(r) > (T(1.J)(i), i = l,...,n,
is proper.
Proof: (a) Using an induct ion argument., we have for all ./ € !)i" and A: > 1
k-1
7'‘-./ = P‘7+^/7//„.	(1.5)
Equation (1.2) implies t hat for all J € W", we have
lim l^J = 0,
so t hat.
r- -1
lim 7)7./ = lim У Р’У')» = Л'1
III— (>
where the limit above can be shown to exist using Eq. (1.2).
Also we have by definition
t[+,j = !Jlt + Wu,
and by taking the limit as A- —* oo, we obtain
A* = fM + PfiJiu

Sec. 2.1 Mnin (ic.sult.s	83
which is equivalent t,o Jt, =
Filially, to show uniqueness, note that if J = !},,!, then we have
./ = TkJ for all k. so that. J ~ Ниц ,lX, TkJ J*,.
(b) The hypot hesis ./ > 7},./. t he monotonicity of 7),, and ICq. (1.5) imply
that
A—J
./ > 7p =	+ 52	/ i--'....
т=Л)
If//, were not proper, by Assumption 1.2, some component ol the sum in
the right-hand side of the above relation would diverge to oc as k —» oe,
which is a cont radict ion. Q.E.D.
The following proposition is the main result of this section, and pro-
vides analogs to the main results for discounted cost problems (Props.
2.1-2.3 in Section 1.2).
Proposition 1.2:
(a)	The optimal cost vector J* satisfies Bellman’s equation
J* = TJ*.
Furthermore, J* is the unique solution of this equation.
(b)	We have
lim (Tk J)(г) = J*(j), i = l,...,n,
for every vector ./.
(c)	A stationary policy /1 is optimal if and only if
TtlJ* = TJ*.
Proof: (a), (b) We first show that. T has al. most, one fixed point. Indeed,
if J and J' are two fixed points, then we select // and fi' such that J = TJ —
TltJ and J' = TJ1 = T^J’-, this is possible because the control constraint
set is linite. By Prop. 1.1(b). we have that // and //' are proper, and Prop.
1.1(a) implies that J = Jz, and J' = Jfli. We have J = TkJ < Tk,J for
all к > 1, and by Prop. 1.1(a), we obtain .7 <	71*',./ = ./,,» = J'.
Similarly, J1 < J, showing that J = J' and that T has at most one fixed
point.
..11
84	St< ic Shot It's! l\ilh I’iftl ilt'iiis	2
We next show that. /' has at least, one fixed point.. Let, // be a. proper
policy (there exists one by Assumption 1.1). Choose /А such that.
'/>•//< ^4-
Then we, have- ./fl =	By Prop. 1.1(b), //' is proper, and using
the monotonicity of 7),/ and Prop. 1.1(a), we obtain
•A' —	= J/>.	(l.G)
Continuing in the same manner, we construct, a. sequence {//.'} such that
each //*' is proper and
•/,/ >	c..	/.- = 0,1....	(1.7)
Since the set of proper policies is finite, some policy /I must, be repeated
within the sequence {//* }, and by Eq. (1.7), we have
I/, = I J),.
Thus is a lixed point of 7’, and in view of tin; uniqueness property shown
earlier, •/,, is the unique fixed point of T.
Next, we show that t he unique fixed point, of 7’ is equal to t he optimal
cost vector ./*, and that Tk.J —> ,Jr for all ./. The construction of the
preceding paragraph provides a proper //. such that 71.//1 = .7,,. We will
show I hat. 7'A./ —+ J;, for all ./ and that. = ,/*. Let <• = (1,1,..., I), let.
A > 0 be some scalar, and let. ./ be the vector satisfying
7), .7 .7 - Ac.
There is a. unique such vector because the equation .7 = 7], J + Ac can be
written as ./ = <;(l 4- Ac 4- so ./ is the cost, vector corresponding to )i.
for </,, replaced by r/z, 4-Ac. Since //. is proper, by Prop. 1.1(a), J is unique.
Furthermore, we have < ./, which implies that.
./,< = 7'./,, < 7’j < 7),.7 = .7 - Ac < .7.
Using I he monotonicity of 7' and t he preceding relat ion, we obtain
= 74/,, < Tkj < T1' - l.j <j. k > 1.
Hence, 74/ converges to sonic vector ./, and we have
T.l = T( lim 747).
See. 2.1 M.iiu Hesults	S5
The mapping 7’ cun be seen to be continuous, so we can interchange 7'
with the limit in the preceding relation, thereby obtaining 7	7'7. By the
uniqueness ol' (lie fixed point of 7’ shown earlier, we must have 7 It
is also seen that
- he = 7’7,, - he < 7'(7,, - he) < 7’7,, - 7,,.
Thus, 7’A'(7,, — h<) is monotonically increasing and bounded above. As
earlier, it follows that lini/,^,^ 7’*'(7,, — he) = 7,,. For any 7. we can find
h > 0 such t hat
7,, - he < J < J.
By the monotonicity of 7’. we then have
21Л(-Л, -- he) < ТЧ < TkJ, /.>!,
and since liin/,. >x, 7’A(./,, ~ he) = liui/.. 7’fcJ = 7,,, it. follows t hat,
lint 7’A'7
A —	'
To show that 7,, = 7*. take any policy ?r = {/i(),/ri, . . We have
Л'о ''' 7,,,,.. i 'h> — 7 A 7o,
where 7ц is the zero vector. Taking the liinsup of both sides as /, oc in
the preceding inequality, we obtain
•4 > 7,„
so /t is an optimal stationary policy and 7,, = 7*.
(c) If /I is optimal, then J/t = 7* and, by Assumptions 1.1 and 1.2, /< is
proper, so by Prop. 1.1(a), T^J* = 7),7,, = 7,, = J* = TJ*. Conversely,
if J* = 7’7* — 7’,,7*, it follows from Prop. 1.1(b) that // is proper, and by
using Prop. 1.1(a), we obtain 7* = 7,,. Therefore. //. is optimal. Q.E.D.
The results of Prop. 1.2 can also be proved (with minor changes)
assuming, in place of Assumption 1.2, that и.) > 0 for all i and u. t U(i).
and that there exists an optimal proper policy; six- Exercise 2.12.
Compact Control Constraint Sets
It turns out. that. th(> licitcness assumption on tin’ control constraint
U(i) can be weakened. It. is sufficient that, for each /, U(i) be a compact
subset of a Euclidean space, and that, and i)(i, u.) be continuous
in и over U(i). for all / and j. Under these compactness and continuity
assumptions, and also Assumptions l.l and 1.2, Prop. 1.2 holds as stated.
'Fhe proof is similar to the one given above, but is technically much more
complex. It. can be found in [BeTf) 11>].
л
86
Stochastic Shortest Path Problems Chap. 2
Underlying Contrcictions
We mentioned in Sect,ion 1.4 that, the strong results we derived for
discounted problems in Chapter 1 owe their validity to the contraction
property of the mapping 7'. Despite the similarity of Prop. 1.2 with the
corresponding discounted cost results of Section 1.2, under Assumptions 1.1
and 1.2, the mapping T of this section need not. lie a contraction mapping
with respect to any norm; see Exercise 2.13 for a counterexample. On the
other hand there is an important special case where T is a contraction
mapping with respect to a weiyhtcil .sup noiin. In particular, it can be
shown that if all .stationary policies are proper, then there exist, positive
constants a,......r„ and some у with 0 < ~ < 1. such that we have for all
vectors ./] and ./2,
,	_	< 7 J',1'1*,, уЩ')
Л proof of this fact is outlined in Exercise 2.14.
Pathologies of Stochastic Shortest Path Problems
We now give two examples (.hat illust rate the sensitivity of our results
to seemingly minor changes in our assumptions.
Example 1.1 (The Blackmailer’s Dilemma [Whi82])
This example shows that the assumption of a finite or compact control con-
straint set, cannot be easily relaxed. Here, t here arc two states, state 1 and
the destination state* I. At state 1, we can choose* a control u with 0 < a < 1;
we then move* to state t at no cost with probability a2, and stay in state 1 at
a cost. — (t with probability 1 — u2. Note that every stationary policy is proper
in the sense that it leads to the destination with probability one.
We may regard a as a demand made by a blackmailer, and state 1 as
the* situation where* the victim complies. State* / is the* situatiem whore* the
victim refuses te> yield to the; blackmailer's demand. The? problem then can
be? seen as one whereby the blackmailer tries to maximize his total gain by
balancing his desire for increased demands with keeping his victim compliant.
If contreds were chosen from a Jinitc siibse*t of the interval (0, i], the
problem woulel come under the* framework of this section. The optimal cost
would then be* finite*, and there* would exist. an optimal stationary policy. It
turns out., however, that without Ike Jiniteness restriction the optimal cost
starting at state I is —oo and there crisis no optimal stationary policy. In-
deed, for any stationary poliey // with //.(I) u, we* have?
./„(I) = -u + (l - u2)./„( I)
from which
Ml) = -1-
a
See. 2.2	ional Methods
«7
Therefore, min,, Jfl (1) — — oc and ./*(1) = — oc, but. there is no stationary
policy that achieves the* optima] cost. Note also that this situation would
not change if the constraint set were и G [0, I] (i.e.. и ~ 0 were an allowable
control), although in this case* the stationary policy that applies //(1) = 0 is
improper and its corresponding cost, vect or is zero, thus viola ting Assam pt ion
1.2.
An interesting tact about this problem is that there is an optimal non-
stationary policy tv f<jr which .7^(1) = — og. ’Fins is tin* policy 7r = {//(>, ft i. . . .}
that applies ft(1) = '/(A- + I) at time A' and state 1. where у is a scalar in
the interval (0,1/2). We leave the verification of this fact, to the reader.
What happens with the policy tv is that the blackmailer requests diminishing
amounts over time, which nonetheless add to og. However, the probability
of the victim's refusal diminishes at a. much faster rate over (inn*, and as a
result, the probability of the victim remaining compliant, forever is strictly
positive, leading t.o an infinite total expected payoff to the blackmailer.
Example 1.2 (Pure Stopping Problems)
This example illustrates why we need to assume that all improper policies
have infinite cost for at. least some initial state (Assumption 1.2). Consider
an optimal stopping problem where a state-dependent cost is incurred only
when invoking a stopping act ion that drives t he system to the destination; all
costs are zero prior to stopping. Eventual stopping is a requirement here, so
to properly formulaic* such a stopping problem as a total cost infinite* horizon
problem, it is essential to make the stopping costs negative (by adding a nega-
tive constant to all stopping costs if necessary), providing an incentive to stop.
We then conic under the* framework of this section but with Assumption 1.2
violated because the improper policy that never stops does not yield infinite*
cost for any starting state. Unfortunately, this seemingly small relaxation of
our assumptions invalidates our result s as shown by tin* example of Fig. 2.1.1.
This example is in effect a deterministic shortest path problem involving a
cycle with zero length. In particular, in the example* there is a (nonoptimal)
improper policy t hat у ielels finite' cost for all initial states (rather than in fin it e*
cost for some* initial state*), and 7’ has multiple* fixed points.
2.2
COMPUTATIONAL METHODS
All the* methods developtxl in connect.ion with the* elise*emiite*el cost
problem in Sect ion 1.3, have stochastic shortest, path analogs. For example,
value iteration works as shown by Prop. 1.2(b). Furthermore, the (exact
and approximate*) linear programming approach also has a straightforward
extension (of. Section 1.3.4), since J* is the largest, solution of the system
of inequalities J < TJ. In this section, we will discuss in more* detail some

SUh'IkisIir Shortest Pi'tibliviis Пир. 2
Cost = 0 Cost = -1 Cost = 0
Transition diagram and
costs under policy
Cost = 0 Cost = 0 Cost = 0
Transition diagram and
costs under policy
l/E/E-}
Figure 2.1.1 Example where Prop. 1.2 fails to hold when Assumption 1.2 E
violated. ’There are two stationary policies, // and //', with transition probabilities
and costs as shown. The equation J -- TJ is given by
./(1)	-1,./(->)}.
./(2) = ./(I),
and is satisfied by any J of tin' form
•/(I)	./(2) A
wit li Л < — I. I lere the proper policy // is opt imal and the corresponding optimal
cost vector is
./(I) -- - 1.	./(2) - -1.
The dillicnlty is that the improper policy // has limit' (zero) cost for all initial
states.
of the major methods, and we will also focus on some stochastic shortest,
path problems with special structure. It turns out that by exploiting this
special structure, we can improve the convergence properties of some of
(lie methods, hor example, in deterministic shortest, path problems, value
iteration terminates finitely (Section 2.1 of Vol. 1), whereas this does not
happen for any significant class of discounted cost, problems.
2.2.1 Value Iteration
As shown by Prop. 1.2(b), value iterat ion works lor st ochast ic shortest
path problems. I'’url hern юге, several of the enhancements and variations
of value iteration for discounted problems have* stochastic shortest path
analogs, hi particular, (.here are error bounds similar to the, ones of Prop.
3.1 in Section 1.3 (alt hough not. <|inte as powerful: see Section 7.2 of Vol.
1). It. can also be shown that the Gauss-Seidel version of the method works
and that it.s rate of convergence is typically faster than that, of the ordi-
nary method (Exercise 2.G). lutrl hcrniore, the rank-one correct ion method
described in Section 1.3.1 is straightforward and effective, as long as there
is sonic separation between the dominant and the subdomiliant eigenvalue
moduli.
Finite Termination of Value Iteration
Generally, the value iteration method requires an infinite number of
iterations in stochastic shortest path problems. However, under special
circumstances, the method can terminate finitely. Л prominent example
is lite case of a deterministie shortest path problem, but there are other
more general circumstances where termination occurs. In particular, let us
ii ss ii n i e thul lhe / га и si t i on probabilit у yrn ph co rrcs po n <1 i n у to sonic oplininl
stationin'!/ policy p* is acyclic. By this we mean that, there are no cycles
in the graph that has as nodes the states 1......n.t, and has an arc (/../)
for each pair of states i and / such that /л, (/'*(!)) > 0. We assume in
particular that there are no positive self-transition probabilities /g, (/'*('))
for i / t. but it turns out that under Assumptions 1.1 and 1.2, a stochas-
tic. shortest path problem with such self-transitions can be converted into
another stochastic shortest, path problem where />,,(») = 0 for all i ф I and
и £ ('(/). bi particular, it. can be shown (Exercise 2.8) that the modified
stochastic shortest path problem that has costs
+n--------TV- ' = 1.........."•
in place of <i(i.u). and transition probabilities
instead of p,,(u) is equivalent to the original in the sense that it has the
same optimal costs and policies.
We claim that, under the preceding acyclicity assumption. Ihc ralne
iteration method will yield ./* after al. most, n iterations when startl'd Jrom
the. vector J yiven by
){i) = ^.	/1-1.....n.	(2.1)
To show this, consider the sets of states ,S’o, .S'i.delined by
•S,	!').	(2.2)
S\ + t = {' I Po (/'*(')) = 0 нН J i U(„=()S,„}, A' = 0, I.... (2.3)
and let be the last of these sets that, is nonempty. Then in view of our
acyclicity assumption, we have
uL<A--{i..........»./}•	(-J)
Let us show by induction that., starting from the vector J ol Eq. (2.1). the
value iteration method will yield lor A: -- I). I.A’,
(TA./)(|) = ./*(/), for all i £	i. yA t.
—1
90
.Stoc/ia-sllc Shortest /’at/i Problems Chap. 2
Indeed, this is so for A- = 0. Assume that (TkJ)(i) — ./•(<) if i fc Ukn„nSm.
Then, by the monotonicity of T, we have for all i,
•/’(') < ./)(,),
while we have by the induction hypothesis, t he definition of the sets S^,
and t he optimality of p*,
(V* 1 1 ./)(/)<</(/,/Р (/)) +	£
= ./*(/), for all i G l/X'oA,,,, l Ф t.
гГ11с last two relations complete the induction.
Thus, we have’ shown that, under the acyclicity assumption, at the
Art.h iteration, the value iteration method, will set to the optimal values the
costs of stale’s in the set. In particular, all optimal costs will be found
after I.' it era! ions.
Consistently Improving Policies
The properties of value iteration can be furl.her improved if there is
an optimal policy //* under which from a. given state, we can only go to a
state1 of lower cost; that is, for all i, we have
/Л.у (/'*(0) -	=>	•/‘(') > •/*(./)
We call such a. policy conszsZwiZ/y improninij.
Л case where a consistently improving policy exists arises in deter-
ministic. shortest pat,li problems when all the arc lengths are positive. An-
other important case arises in continuous-space shortest path problems; see
[Tsi!)3a] and Exercise 2. It).
The transition probability graph corresponding to a. consistently im-
proving policy is seen to be acyclic, so when such a policy exists, by the
preceding discussion, the value iteration method terminates finitely. How-
ever. a stronger property can be proved. As discussed in Chapter 2 of Vol.
I, for shortest path problems with positive art' lengths, one can use Dijk-
stra’s algorithm. This is the label correcting method, which removes from
the OPEN list a node with minimum label al. each iteration and requires
just, oik’ iteration per node. A similar property holds for stochastic shortest
path problems if there is a consistently improving policy: if one removes
from the OPEN list a state j with minimum cost, estimate, ./(j), the Gauss-
Seidel version of I he value iteration method requires just one iteration per
stall’; see Exercise 2.1 I.
Eor problems where a consistently improving policy exists, it is also
appropriate to use st raightforward adaptations of the label correcting short-
est path methods discussed in Section 2.3.1 of Vol. 1. In particular, one may
approximate the policy of removing from the OPEN list a minimum cost
state by using the bld’’ and LLL stralegies (see [PBT95]).
See. 2.2	Conipiit.nliontil Methods
91
2.2.2 Policy Iteration
The policy iteration algorithm is bast’d on I he const riicl ion used in I he
proof of Prop. 1.2 to show that T has a. fixed point. In the typical iteration,
given a proper policy p anil the corresponding cost, vector one obtains a
new proper policy p satisfying T^J,, = TJlt. It was shown in Eq. (1.6) that
J-p < 'h‘- 4 <’al1 l,e scen also that strict inequality' Jfiii) < •//,(') holds for
at least one state i, if p is nonoptinial; otherwise we would have Jt, — TJlt
and by Prop. 1.2(c), p would be optimal. Therefore, the new policy is
strictly better if the current policy is nonoptinial. Since the number of
proper policies is finite, the policy iteration algorithm terminates after a
finite number of iterations with an optimal proper policy.
It is possible t.o execute approximately the policy evaluation step of
policy iteration, using a finite number of value iterations, as in the dis-
counted case. Here we start, with some vector ./(). I'or all A’, a stat ionary
policy /«*' is defined from Jk according to 7’ * ./*. = 7’.//,., the cost. J* is
approximately evaluated by — 1 additional value iterations, yielding the
vector Л + i, which is used in turn to define //Л+1. The proof of Prop. 3.5
in Section 1.3 can be essentially repeated to show that Jk —» J*, assuming
that the initial vector Jo satisfies 7’Jo < Jo. Unfortunately, the require-
ment TJo < Jo is essential for the convergence proof, unless all stationary
policies are proper, in which case T is a contraction mapping (cf. Exercise
2.14).
As in Section 1.3.3, it is possible to use adaptive aggregation in con-
junction with approximate policy evaluation. However, it is important, that,
the destination t forms by itself an aggregate state, which will play the role
of the destination in the aggregate Markov chain.
Approximate Policy Iteration
Let us consider an approximate policy iteration algorithm that gen-
erates a sequence of stationary policies {/J } and a. corresponding sequence
of approximate cost vectors {Jk} satisfying
ninx |./A.(/) -	л.(-/)| < A, A=(). 1,...	(2.5)
I — 1 ,,,II	'
and
max |(T?.+ lJ/;)(/)-(rjfc)(/)| <c, A; = 0,1,...	(2.6)
where b and ( an* sonic positive scalars, and //° is souk* proper policy. Ono
difficulty with such an algorithm is that, even if the current, policy /Т is
proper, the next policy //Л+t may not. be proper. In this case, we have
Т^А+1(1) = oo for some i, and the method breaks down. Note, however,
that for a sufficiently small c, Eq. (2.6) implies that T t 11 Jk = TJk. so by
Prop. 1.1(b),	+ 1 will be proper. In any case, we will analyze the method

92	Stochastic Shortest Path Problems (’hap. 2
under the assumption that all generated policies are proper. The following
proposition parallels Prop. 3.6 in Section 1.3. It. provides an estimate of
the difference .7 p - J* in terms of t he scalar
P = max P{.rn t. | .го =
I — I... ,Il
p; proper
Note that for every proper policy p. and state i, we have P{.i'a /= I | .ip =
i, //} < 1 by the definition of a proper policy, and since the number of
proper policies is finite, we have p < 1.
Proposition 2.1: Assume that the stationary policies pk generated
by the approximate policy iteration algorithm arc all proper. Then
lim sup max (.7 a (i) - J*(i)) <	(2.7)
A-^oc	'	(1 - pY
Proof: 'rhe proof is similar t.o the one of Prop. 3.6 in Sect ion 1.3. We
modify the arguments in order to use the relations + re) < T),./ T re
ami /'//c < pc, which hold for all proper policies p and positive scalars r.
We use l£<|s. (2.5) and (2.6) to obtain for all k
। 1 -'p < 7’./,,* T (c -I- 2h)e < 7'p ./;it 4- (<- + 2<5)c.	(2.8)
Prom Eq. (2.8) and t he eqilal ion 7’p.7;p - ./a. we have
7'p. ( i /,,* <	+ (r + 2h)c.
By subtracting from this relation the equation 7’;p +।-J^kt i = ./^a -i ।, we
obtain
7,p f i -7;p - T^k । । .7/za \ i < J^k ~ 7f,k n + (с + 2Л)е.
This relation ran lie written as
< 7’p+1 (J^k i i ~ J;p.) + (c + 2<*>)c,	(2.9)
where /'p i i is t.he transition probability matrix corresponding t.o /Л 1 1.
Let.
sA = niax (.7 /. + I(0 - .7,p(')).
I.11	'
Then E<j. (2.9) yields
sA-c < 4A'/’,a i-ic + (< + 2Л)с.
Er
g- See. 2.2	(’ouiputatioual Мг11ю<1ь	93
у By multiplying this relation with F а+i and by adding (< + 2b)e. we obtain
Й
8‘
f	< £a ^/(a i к' I (< I ча-/’7* 11 <• I -(< I
b
(f. By repeating this process for a total of n. — 1 times, we have
f	SA ' i. sU’"a I l<’ I "(' I L /<A' I /'(' I 2b>)< .
Thus.
«,.<4^.	(-.in)
1 - (>
Let. p' be an optimal stationary policy. From Eq. (2.8). we have
Tt,k+iJflk <	+ (c + 2b)c
= 4-V ~ W + /* + (f + “iby
= 4* (4A “	+ (' + ‘^)c-
We also have
TJjAi+l Jl. = J 14 I + T A A 1 Jf, A — 4,A-4 1 ./ АЧ 1 = J A-t 1 + /’*АЧ I (Jflk ~ ./,,A I I )
By subtract ing the last, two relations, and by using t.he definit ion of 4 and
Eq. (2.10). we obtain
4m ~ /* < /)<• (// _ G*) T + l — 44 + 4 +
— e,i‘ (7,A' — -/‘J 1- 4T'pt i' I- (< + ft'’)'’
<4’(4*- -/*) +^А-еТ(г4-2<<))с	С2-Ш
Let
Ck -- max (-/ a (/) -
I- I.....It ‘
Then Eq. (2.11) yields, for till
e 11 c о,- 4‘r ~ь
(1 -p + /?)(f+ 2Л)
--------------------(
1
By innltii>lyiiig this relation with and by adding (1—p+n)(<+2<5>)a /( 1—
p), we obtain
..	(1 - P E 7i)(c + 2Л)	,	2(1 - p 4- n)(( + 2b)
Ck+P- < 8A-i -------------------1-------------'' L M ' ?<’-l ------------:------------c
— /)	1	— n
94
Stochnstie Short-vst Path Problems
Chap. 2
By repeating this process for a total of n — 1 times, we have
, pn «0 - /> + n)(f + 2/>)	?t(l -р + п)(^ + 2й)
U !-,/• < C-l’’<< -I-------i------------< < pCkf 4----------:------------e.
I /'	I - /'
By taking the limit superior as k -c oo, we obtain
,,	. "U	+ 2Л)
(1 -p)hm.siip<*; < -----------------------,
к	1 - P
which was to be proved. Q.E.D.
The error bound (2.7) uses I he worst-case estimate of the number of
stages required to reach I. with positive probability, which is n. We can
strengthen tin; error bound if we have a better estimate. In particular, for
all in > 1, let.
P„, = max P{:i:„, yt t | j:0 = /,//.},
i-— 1.......II
fl: pl opei
and let Til be the minimal in for which p,„ < 1. Them the proof of Prop.
2.1 can be adapted t.o show that
Inn sup max (J a (z) - ./*(/.)) < -------------------------.
(1-РшГ
2.3 SIMULATION-BASED METHODS
The computational methods described so far apply when there is a
mat hematical model of the cost structure and the transition probabilities
of tin; system. In many problems, however, such a model is not available,
but instead, th<; system and cost structure can be simulated. By this we
mean that the state space and the control space are known, and there is a
computer program that, simulates, for a given control n, the probabilistic
transitions from any given state z t.o a. successor state j according to the
transition probabilities p,,(u), and also generat.es a corresponding transi-
tion cost z/(z, n.j). It is then of course possible to use repeated simulation to
calculate (at least approximately) the transition probabilities of the system
and I he expected st age cost s by averaging, and then t.o apply the met hods
discussed earlier.
The methodology discussed in this section, however, is geared towards
an alternative’ possibility, which is much more attractive when one is faced
with a large and complex system, and one contemplates approximations:
rather than estimate explicitly the transition probabilities and costs, we
See. 2.3 Siiimliitioii-Diuied Methods
95
estimate the cost, function of a given policy by generating a number of
simulated system I rajectoric.s and associated costs, and by using some form
of “least-squares fit."
Wit l>ii> this context, then' arc a number ol possible approximation
techniques, which for the most part are patterned alter I.hr' value and the
policy iteration methods. We focus first on exact, methods where estimates
of various cost, functions are mamtamed in a. “look-up table" t hat, contains
one entry per state. We later develop approximate methods where cost,
functions are maintained in a “compact." form; t hat is, they are represented
by a function chosen from a parametric class, perhaps involving a feature
extraction mapping or a neural network. We first, consider these methods
for the stochastic shortest path problem, and we later adapt, them for the
discounted cost problem in Section 2.3.4.
To make the notation better suited for the simulation context, we
make a slight change in the problem definition. In particular, instead of
considering the expected cost.	at state i under control a. we allow
the cost, a to depend on the next state j. Thus our notation for the cost per
stage is now </(/, u.j). All the results and the entire analysis of the preceding
sections can be rewritten in terms of the new notation by replacing </(i. u)
with do(").'/('• »•./)•
2.3,1 Policy Evaluation by Monte-Carlo Simulation
Consider the stochastic shortest path problem of Section 2.1. Suppose
that we are given a fired stationary policy /i and we want to calculate’ by
simulation the corresponding cost vector . One possibility is of course to
generate, starting from each i. many sample’ state t ra jectories ami average
the corresponding costs to obtain an approximation to We can do
this separately for each possible init ial state, but. a. more ellicient met hod is
to use each trajectory to obtain a cost sample for many states by considering
the costs of the trajectory portions that start at these states. If a stall’ is
encountered multiple times within the same trajectory, the corresponding
cost samples can be treated as multiple independent, samples for t hat state, j
To simplify notation, in what follows we do not show the dependence
of various quant ities on the given policy. In particular, t he transit ion prob-
ability from i to j, and the corresponding stage cost are denoted by p,z and
in place of />,,(//.(/)) and	j), respect ively.
To formalize the process, suppose that, we perform an inlinite number
of simulation runs, each ending at the termination state /. Assume also
that within the total mmiber of runs, each state is encountered an inlinite
f The validity of doing so is not quite obvious because in the case of multiple
visits to the same state within the’ same trajectory, the corresponding multiple
cost samples are correlated, since portions of the corresponding cost sequence are
shared by these cost samples. Гог a justifying analysis, see Exercise 2.15.
96	Slorha.st ic Siu>rl<*>•/ I'rolih'iiix ('hiip. 2
number ol limes, Consider the /nth time a given state i is encountered,
and let (z, /1, i->, •.., /,v ,1) be t he remainder of t he corresponding t rajcctory.
Let <(/,'/") be the corresponding cost, of reaching slate; /,
<(/, in) = .</(/, /1) + </(/i. /2) + •   T I).
We assume (.lull the simulations correctly average the desired quantities;
that is, for all states i. we have
Mv-lIZ Z<:u)
III — 1
We can iteratively calculate the sums appearing in the above equa-
tions by using the update formulas
fi’V) := -M') +	(<'(.' in) ~ Ji'(i'))-	= W-.......
where
1
7<„ = —,	///=1.2,...,
in
and the initial conditions are, lor all /,
1,(1) =0.
The normal wav to implement the preceding algorithm is to update
the costs ./;,(/) at. the end of each simulat ion rim that generates t.lie state
tra jectory (/j. /2,. . ., in. I), by using for each k = 1.N. the formula
MM : = •/,,(/>.) +7„,a (.//(//,.. //,. 1 1) + //(/;. 1 ,. ii..' 2) + • • • +.7('.v. I) - -MM)>
(3.2)
where ////, is the number of visits thus far to state T and 7,„A. = 1/m/,..
There are also forms of the law of large numbers, which allow the use of
a different stepsize 7„,A. in the above equation. It can be shown that, for
convergence of iterat ion (3.2) to the correct cost value /,,.(/*), it is sufficient
that 7,„a, be diminishing at t he rate of one over the number of visits to state
T-
Monte-Carlo Simulation Using Temporal Differences
An alternative (and essentially equivalent) method to implement the
Monte-Carlo simulation update (3.2), is to update ll,(ii ) immediately after
t/Gi-M and /2 are generated, then update ,/,,(/1) and ,/,,(/2) immediately
after //(/2,/з) and /3 are generated, and so on. The method uses the quan-
tities
M 1) + MM I-1) - -MM- A- = l,...,Ar,	(3.3)
Sec. 2..Ч
Sninilnl inn Ihiseil Mel hails
<17
with ii\'+i = /. which lire calk'd l.eiiij>oin,l diffiiinns. They represent the
difference between the current estimate •/,,(//,) of i.q/ei-tcil cosl l<> go to the
termination state and the prcilictcd cosl to go to the terminal ion state.
!lUk-- b,' l l) +	1 I )
based on the simulated outcome of the current stage. Given a sample state
trajectory ('i.ej......h\-J). the cost, update Ibrnmla (3.2) can be rewri 11 ci i
in terms of the temporal differences <1/, as follows [to see this, just add the
formulas below and use the fact (i,v +1) =	= ()]:
Following the state transition (ii./a). set
dji (l 1)  — J i< (l 1 ) T " hi i d I — d)i{l I ) T ") hi । (,fj)! I , 1’2 ) T dfi (^2) ' d fl (! I ) ) 
Following the stale transition (1'2, in). set.
dfi{i i)	dfi (/1) + 7 in [ dj — .lft (? i) + 7,,, 1 (f/(i'2. Ci) T df, (/ n) — dfl (/•_»)).
dfili-n)	df,(l2) + 'fm-jdn = dfifin) + "nil-2 in) + df,[i:’,) — d/,(is)').
Following the state transition (i.vJ), set
dfl ( ' 1 ) • = dfl (/ 1 ) I "hi I dЛ' dj! (/ I ) T " HI ] (//(bv • / ) df, ( / i\l )) ,
df.Un) := df,(h>) + ',„2d,v = ./,,(/'2) +g,„2 (.'/(. gv./) - •//i('.v))-
dz,(dv) := dfi(i,x) + yni.vd.v = Jfi^N) + 7»,4. (</('/,v. t) - •//i(dv))-
The stepsizes 7пц.. A’ = I......Ad are given by 7„,A. = l/nii... where
mi,, is the number of visits already made to stair* ik- bi the ease where tin*
sample trajectory involves at. most one visit to each state, the preceding
updates are equivalent to the update (3.2). If there are multiple visits to
some st,ate during a sample trajectory, there is a dillerence between the pre-
ceding updates and the update (3.2). because the updates corresponding Io
each visit, to the given state affect, the updates corresponding I,о subsequent
visits to the same state. However, this is an effect which is of second order
in the stepsize 7. so once 7 becomes small, the differeuce is negligible.
98
Shic Sh<nt<'.s( l^tlh Problems ('h;ip. 2
TD(A)
The preceding i11iploiilentat ion of I he Mont e-Carlo simulation method
for evaluating the cost of a policy // is known as TD(1) (here TD stands
for Temporal Differences). Л generalization of TD(I) is TD(A), where A is
a. parameter with
0 < A < I.
Given a. sample trajectory (zi,..., zl/y, / ) with a corresponding cost, sequence
i/(/j,/2)..!/('«,/),	TD(A) updates the cost estimates J/((zli),..., ./,,(zljv)
using t he temporal differences
-- ii. i i) I •//d'Z ii)-	к - 1........N,
and I he equal ions
following the transition (zj,z2).
following the t ransition (z2. hi),
/,,(zj) := .//,(гi) +
f -А/, (z 1) . Jj, (z ।) _h 3»» । Az/’2 ,
t ;=	+ 7'»2(^'
and more generally for к — 1, . . . , N,
' -M71) : =
•/z,(z2) := ./,,(/2) + 7,„2AA'"2(/fc,
. ./,,(z't) :=	+ 7ш(//а-.
following the transition (ц.,'4 + i).
'the use of a value of A less than 1 tends to discount I,he effect of
the temporal differences of state transitions far into the future on the cost
estimate of the current state. In the case where A = 0, we obtain the TD(0)
algorithm, which following the transition (za.,z'a;+i) updates ./,,(«*:) by
•M'i) '= -///(za) + 7-'<z, (.'/('A'.'A 4-l) + •//(zifeH l) -	(3-4)
'Dus algorithm is a special case of an important stochastic iterative algo-
rithm known as the .s7oz7z.(z.s7zc approximation (or liobbins-Monro) method
(see e.g., [BMPDOj, [BeT89a], [LjS<83]) for solving Bellman’s ccpiatious
^{.'/('A-, «А4 1) + •//(('A'+1)} — ’fiiVk) = o.
In this algorithm, the expected value above is approximated by using a
single sample at each iteration [cl. Eq. (3.-1)).
The stepsizes 7„4, need not be equal to 1 /rri*:, where mt is the number
of visits thus far to state zj., but they should diminish to zero with the
number of visits to each state. For example one may use the same stepsize
7,„ =. 1/zzz for all states within the zzzth simulation trajectory. With such
Sec. 2.3
Siiuul;di<iii-I $<4.S( •< / A h * I f и x Is
99
a stepsize and under some technical conditions, chief of which is that each
state i — 1,, . ., n is visited infinitely often in the course of the simulation,
it can be shown that, tor all A G [0, 1], the cost estimates ./,,(/) generated
by TD(A) converge, to the correct values with probability 1.
While TD(A) yields the correct values of •/,,(/) in the limit regardless
of the value of A, the choice of z\ may have a substantial effect on the rate
of convergence. Some experience suggests that using A < 1 (rather than
A = 1), often reduces the number of sample trajectories needed to attain
the same variance of error between ./;t(z) and its estimate. However, at,
present there is no analysis relating to this phenomenon.
Simulation-Based Policy Iteration
The policy evaluation procedures discussed above can be embedded
within a simulation-based policy iteration approach. Let us introduce the
notion of the Q-factor of a state-control pair (i, u) and a st ationary policy
p, defined as
QpG,») = ^2 Л, (")('/('. u.j) + ./,<(./))•	(3.5)
j=t
It is the expected cost corresponding to starting at state z, using control
и at the first stage, and using the stationary policy /z at the second and
subsequent st.ages.
The Q-factors can be evaluated by first, evaluating as above, and
then using further simulation and averaging (if necessary) to compute the
right-hand side of Eq. (3.5) for all pairs (z,u). Once this is done, one can
execute a policy improvement step using the equation
p(z') = arg min <X(z,rt). z = l,...,/z.
uEU(i)
(3.6)
We thus obtain a version of the policy iteration algorithm that combines
policy evaluation using simulation, and policy improvement using Eq. (3.6)
and further simulation, if necessary. In particular, given a policy //. and its
associated cost vector Jlt, t he cost of the improved policy ./g is computed
by simulation, with /7(z) determined using Eq. (3.6) on-line.
> 2.3.2 Q-Learning
We now introduce an alternat ive method for cases where I here is
; no explicit model of the system ami the cost struct tire. This method is
; analogous to value iteration and has the advantage that it can be used
1, directly in the case of multiple policies.
Instead of approximating t he cost.
function of a particular policy, it updates direct ly the (^-factors associated
with an optimal policy, thereby avoiding the multiple policy evaluation

100	St (></hist ic Si и tri гз/ /\itii Prol)l<4HS C’/iap. 2
steps of the policy iteration method. These Q-factors are defined, for all
pairs (/, a) I>y
J--I
ITom this definition and Bellman’s equation, we see that, the Q-factors
satisfy for all pairs (/,«,),
II	/
W, ") = УтД") (+ '“hi QU,«-')),	(3.7)
,i -1
and it can be shown that, the; Q-factors are the unique solution of t.he
above system of equations. The proof is essentially the same as the proof
of existence and uniqueness of solution of Bellman’s equation; see Prop.
1.2 of Section 2.1. In fact, by introducing a Systran whose slates are the
original states 1I together with all the pairs (/, n.), the above system
of equations can be seen to be a special case of Bellman’s equation (see
Exercise 2.17). Furthermore, the Q-fact,ors can be obtained by the iteration
II	/	X
Q(i, n.)	\ Pi(«) '/(',",/) + min Q(/, "'))• for all (/,?/.),
' \	/
which is analogous to value iterat ion. Л more general version of this is
II
Q(i. u) :=	+
j-1
when1 q is a stepsize parameter with 7 G (t), 1]. that may change from one
iteration to the next.. The Q-learning method is an approximate version of
this iteration, whereby the expected value is replaced by a single sample,
i.e.,
Q(z, "):== Q(/. ")+"i (</(/. "..;) + min Q(j, u’) -
Here j and </(/. 11, j) are generated from t he pair (/. a) by simulation, that
is, according to the transition probabilit ies ptJ(, u)- Finis Q-learning can be
viewed as a combinat ion of value iteration and simulation.
Because Q-learning works using a. single sample per iteration, it is
well suited for a. simulation context. By contrast, there is no single sample
version of the value iteration method, except in special cases [see Exercise
2.9(d)]. The’ reason is that, while it. is possible’ to use a single-sample
approximat ion of a term of the form /y’{min[-]}. such as the one appearing
.'/(' "-J) + min QU,11’) • (3-8)
Sec. 2.3
Siimihil ion-lhiscd M<'t hods
10 1
in the Q-factor equation (3.S), it is not, possible to do so for a term of the
form lliili[/•>’{}], such as t he one appearing in Bellman's <*<j11;11 i<>11.
To guarantee the convergence of the Q-learning algorithm to the opti-
mal Q-factors, all state-cont rol pairs (z, u) must be visited infinitely often,
and the stepsize у should be chosen in some special way. In particular, if
the it,eration corresponds to the /nth visit, of the pair (/, u), one may use in
the Q-learning iteration the stepsizc 7/,. = c/m, where c is a positive con-
stant. We refer to [TsifM] for a proof of convergence of Q-learning under
very genend conditions,
2.3.3 Approximations
We now consider approximatioii/suboptimal coni rol schemes t hat, ar<
suitable for problems with a large number of states. The discounted ver-
sions of these scheme’s, which are discussed in Section 2.3.1. can be adapted
for the case of an infinite state space. Generally there are t,wo types of ap-
proximations to consider:
(a)	Approximation of the optimal cost, function J*. This is done by us-
ing a function that, given a state /, produces an approximation ./(/'. /)
of J* (z) where г is a parameter/weight vector that is typically deter-
mined by some form of opt imization; for example, by using some type
of least squares framework. Once ,/(/, r) is known, it can be used in
real-time to generate a subopthnal control at any state / according to
II
p(/J = argjmn V/i,///,)(//(/. u,j) + ./(./,?)).
An alternative possibility, which docs not, require the real-time cal-
culation of the expected value in the above formula, is I,о obtain ap-
proximations Q(z. ". r) of the Q-factors Q(/. (/), and then to generate
asuboptimal control at any state i, according to
p(z) = arg min Q(/,u, r).
It is also possible, to use approximat ions .//,(/. r) of t he cost, funct ions
.7/t of policies // in an approximate policy iteration scheme. Note
that the cost approximation approach can be enhanced if we have
additional information on the true functions Q(J.it). or •/,,(/).
For example, if we know that J*(i) > 0 lor all /. we may first compute
the approximation ./(/./') by using some method, and then replace
this approximation by max{0. ,/(/'./)}. This idea applies to all the
approximation procedures of this section.
(b)	Approximation of a policy //. or the optimal policy //*. Again this
approximation will be done by some function parameterized by a
-.-I.
102	Stochaslic Short<\st I’ntli Problems C!hii[>. 2
parameter/weight vector r, which given a .state i, produces an ap-
proximation of /i(i) or an approximation /?(i,r) of/t*(i). The
paranictcr/weight vector r can be determined by some type of least
squares optimization framework.
In this section we discuss several possibilities, emphasizing primarily
the cast; of cost approximation. The choice of the struct ure of the approx-
imating functions is very significant for the success of the approximation
approach. One possibility is to use the linear form
•/(/, r)	(3.9)
A-- 1
where r = (ri,...,/',„) is the parameter vector and tz.g.(i) are some fixed
and known scalars. This amounts to approximating the cost, function J* by
a linear combination of in given basis functions (ug.(l),..., ug.(n)), where
/,: = l,...,zn.
Example 3.1	: (Polynomial Approximations)
An important example of linear cost approximation is based on polynomial
basis functions. Suppose that the state consists ol 7 integer components
ji, . . ., j;,,, each taking values within some limited range of the nonnegative
integers. 1‘or example, in a queueing system, .17. may represent the number
of customers in the /.:th queue, where k = 1,... ,7. Suppose that we want to
use an approximating function that is quadratic in the components a*.. Then
we can deline a total of 1 + 7 + 72 basis functions that depend on the state
./• = (j i,... ,.f,y) via
Wu(.r) = 1, nik(-i-) = a'l.', ti4s(.r) = J'r.r.s, /с, .s = 1,..., 7.
An approximating function that is a linear combination of these functions is
given by
<1	<1	<1
J(.r, г) = ги +	14.14. + У7 У?
fe=i	r = i .s=l
where tin' parameter vector r has components r(). r*.. and r/...,, with A’,.s =
1,...,7. hi fact, any kind of approximating function that is polynomial in
the components .iq,. .., .r,, can be constructed in this way.
Example 3.2	: (Feature Extraction)
Suppose that through intuition or analysis we can identify a number of char-
acteristics of the state that affect the optimal cost function in a substantial
way. We assume t hat these characteristics can be numerically quantified, and
that they form a 7-dimensional vector /(z) = (/1 (Oi • •  , A(')) > called the
feature vector of state i. For example, in computer chess (Section 6.3.2 of
Sec. 2.:\
Siniul.it ion-Based A h •! In к Is
103
Vol. 1) where the state is the- current board position, appropriate (eattires
arc material balance, piece mobility, king safety, and other positional factors.
Features, when well-chosen, can capture (he dominant nonlinearities of the
optima] cost, function ./*, and ran be used to approximate ./* through tin-
linear comi >ination
</
./(/./') П) I-
where г().Г|...r4 are appropriately chosen weights.
It is also possible to combine feature extraction with more general poly-
nomial approximat ions of t.he type discussed in Example 3.1. For example, a
feature extraction mapping f followed by a quadratic polynomial mapping,
yields an approximat ing function of t h(‘ form
ч	ч 4
EE W/TUG),
k-.-. 1	A — 1 s - i
where the parameter vector r has components Го, гд.. and гд,ч.. with A\.s —
’This function can be viewed as a linear cost, approximation that
uses the basis funct ions
w(/)= I. ifk(i) =	....</.
Note that more that. one slate may map into t he same feature vector,
so that each (listiuel value of leal.are vector corresponds to a subset ol stales.
This subset may be viewed as an “aggregate slate." The opt imal cost function
,/* is approximated by a function that is constant over each aggregate state.
We will discuss this viewpoint shortly.
It can be seen from the preceding examples that linear approximating
functions of the form (.3.9) are well suited for a broad variety of situat ions.
There are also interesting nonlinear approximating functions ./(/, /•). includ-
ing those delim'd by neural networks, perhaps in combination with feature
extraction mappings. In our discussion, we will not address the choice of
the structure of ./(/, r). but rather focus on various methods for obtaining
a suitable parameter vector r. We will primarily discuss three approaches:
(a)	Feature-based aggregation, where r is determined as I he cost vector
of an "aggregate stochastic shortest, path problem.”
(b)	Afinmizing the Bellman eg nation error, where r is determined so
that the approximate cost function .J(i.r) nearly satisfies Bellman's
equation.
(c)	Appro.riinate. policy iteration, where the cost functions of the gener-
ated policies у are approximated by •/,<(/, /'), with r chosen according
to a, least-squares error criterion.
We note, however, that the methods described in this subsection are not
fully understood. We have chosen to present, them because of their potential
to deal with problems that are too complex to be handled in any other way.
- ..I
104
.S'/< x'/l.'isl ic 4'1 <'st I'ntll /*ri»/>l< ’ins
C'/iii/i. 2
Feature-Based Aggregation
We mentioned earlier in Example 3.2 I hat a feature extraction map-
ping divides the slate space into subsets. The states of each subset are
mapped into the same feature vector, and are ''similar'' in that they ‘’share!
t he same features.’' With this context in mind. let. the set. of states { 1...., //}
of the given stochastic shortest path problem be partitioned in m disjoint
subsets ,S\., k = l,...,m. We approximate the optimal cost ./*(/) by a
function that is constant over each sot 5y, that is.
III
I= y. ''A"’/.(').
A -1
where г = (r,./„,)'	is a vector of parameters and
1 if i € Si,.,
I) if i G Sk.
Equivalently, the approximate cost function (./(l.r).....7(n./')) is repre-
sented as IF/'. where IE is the n x in matrix whose entry in the /th row
and /,'th column is «‘/.(i). Tlie /th row of II' may be viewed as the feature
vector corresponding to state /' (cf. Example 3.2).
In the aggregation approach, the parameters //.• are obtained as the
optimal costs of an "aggregate stochastic shortest path problem'’ whose
states are the subsets 5/, . Thus i'),. is chosen to be the optimal cost, of the
aggregate stall' .S\. in an aggregate problem, which is tornuilaled similar to
the aggregation method of Section 1.3.3. hi particular, let. Q be an in x n
matrix such t hat t he /. t h row of Q is a probability distribution (r/A:r.7a„)
with </i, = I) if i $ S/-. As in Section 1.3.3, the structure of Q implies that
for each stationary policy /i, the matrix
A,,. = QPt,\V
is an in x in transition probability matrix. The states of the aggregate
stochastic shortest, path problem are the sets 5,.... , S,n together with the
termination state /; the stationary policies select at. aggregate state Sa- a
control и G U(i) for each / C St. and thus can be identified with stationary
policies of the original stochastic shortest, path problem; finally the tran-
sition probability matrix corresponding to // in t.lie aggregate stochastic
shortest, path problem is /q,. Given a stationary policy />, the st.ale tran-
sition mechanism in the aggregate stochastic shortest path problem can
be described as follows: at aggregate state S),., we move to state / with
probability r/A,, then we move to state j with probability Aj (//(/))> alI(l
finally, if j is not. the termination state /, we move to the aggregate state
Si corresponding to j (J S/).
See. 2.3
Si mu]; it	Met lк ></>
1 05
Suppose now t licit r = (/’ i,  • - , I'm)' is the opt iuial cost function of the
aggregate .stochastic shortest path problem. Then r solves I lie correspond-
ing Bellman equation, which has the form
П: = 5 min YjM")
uCl (г)	'
' 1	.1 J
<A‘-"•./) + >7 '>"’4./)
,s I
A- — I............hi .
One way to obtain r is policy iteration based on Monte-Carlo simulation,
as described in Section 2,3.1. An alternat ive. due to [Ts\'!l 1]. is to use a
simulation-based form of value iteration for the aggregate problem. Here,
at each iteration we choose a subset .S\. we randomly select, a state i G
according to the probabilities <ц-,, ;i,nd we update //, according to
H	/	III	\
П- := (1 - "ib'i.- + G mill	I .'/(',"•./) +У	I . (3.10)
, \ ,	/
./ - I	\	S “ 1	/
where 7 is a positive stepsize that, diminishes to zero as tin* algorithm pro-
gresses. The following example illustrates the method. We refer to [TsV9 1]
for experimental results relating to this example as well for convergence
analysis of the method.
Example 3.3	: (Tetris [TsV94])
Tetris is a popular video game played on a two-dimensional grid. Each square
in the grid can h<* full or empty, making up a “wall of bricks" wit h “holes*'
and a “jagged top". The squares till up as blocks of different shapes fall at a
constant rate from tin* top of the grid and art* added to the top of the wall.
As a given block falls, tlit* player can move horizontally and rotate tlx* block
in all possible ways, subject to the constraints imposed by the sides of the
grid and the top of the wall. There is a finite set of standard shapes for the
falling blocks. The game starts with an empty grid and ends when a square
in the top row becomes full and t lit* top of the wall reaches the top of the
grid. However, when a row of full squares is created, this row is removed, the
bricks lying above this row niuvc one row downward, and tin* player scores a
point. The player’s objective is to maximize the score attained (total number
of rows removed) up to termination of the game.
Assuming that, for every policy, the game terminates with probability
one (something that is not really known at present), we can model the problem
of finding an optimal tetris playing strategy as a stochastic shortest path
problem. Tl x* coni rol, dr not ed by //, is I he horizon I al positioning and rot at ion
applied to the falling block. The st ate consists of two components:
(1)	The board position, t hat. is. a binary description of 1,1 к* fu 11/<*111 pt у st at us
ol each square, denoted by ./•.
(2)	The shape of tlx* current tailing block, denoted by //.
The component у is generated according to a probability distribution
independently of the control. Exercise 2.9 shows that under these cir-
cumstances. it. is possible to derive a. reduced form of Bellman’s equation
___1
10G	Stochast ic Shortest Path Problems Chap. 2
involving a cost function ./ that depends only on the component .r of the
state (sec also Exercise 1.22 of Vol. 1). This expiation has the intuitive form
, /'(.'/) niax y, ?/) + j(7(.r,/y. it))], for all .r,
V
where g(x, y, </) and J’(.г, y, a) are the number of points scored (rows removed),
and the board position when the state is (x,y) and control u. is applied,
respectively.
Unfortunately, the number of states is extremely large. Il is equal to
ni2h,,\ w'hen* in is t.he number of different shapes of falling blocks, and h and
<r are the height and width of the grid, respectively. In particular, for the
reasonable' numbers in, = 7, h — 20, and iv = 7 we have over 10 “ states.
Thus it is essential t.o use approximations.
An approximating function that involves feature extraction is particu-
larly attractive here, since the quality of a given position can be described
quite well by a few features t hat are easily recognizable by experienced play-
ers. These features include the current height of the wall, and the presence
of “holes’* and “glitches” (severe irregularities) in the first, few rows. Suppose
that, based on human experience and tiial and error, we obtain a method to
map each board position .r into a vector of features. Suppose that, there is a
finite number of possible' feature' ve'etors, say ni. and define
1 if board posit.ion x maps into the At.h feature* vector,
0 otherwise.
The approximating function ./(.r, r) is given by
./(.r, /) =
A I
where r = (/’i,... , rm) is the parameter vector. 'The simulation-based value
iteration (3.10) takes the form
whore the* positive.* stepsizc 7 diminishes with t he number of visits to position
,c.
One' way to implement t in* method is as follows: The game is simulated
many tini<*s from start to finish, starting from a variety of “representative”
board positions. At each it,oration, we have the current board position x and
we tletci mine the feat uro vector /,* Io which x maps. Then wo randomly gener-
ate a falling block у according to a known and lixod probabilistic mechanism,
and we update 74. using the above iteration. Let u* be the choice of и that
attains the maximum in the* iteration.
See. 2.3 Simulation-Based Methods
107
Then the board position subsequent to j: in the simulation is /(л, у, к’), and
this position is used as the current state for the next iteration.
In the aggregate stochastic shortest path problem formulated above,
policies consist of a different control choice for each state. Л .somewhat dif-
ferent aggregate stochastic shortest path problem is obtained by requiring
that, for each k, the same control is used at all states of Sk. This control
must be chosen from a suitable set U(k) of admissible controls for the states
in Sk- The optimal cost function r — (ncorresponding to this
aggregate stochastic shortest, path probkun solves the following Bellnian
equation
rk= iiiiu	u.j) + ^2 f.Ji’.s(j) I, /. = 1,. ... in.
, = , , = 1 \ ,s=l /
This equation can be solved by Q-learning, particularly when in. is relatively
small and t he number of controls in the sets U (k) is also small.
Note also that the aggregate problem need not. be solved exactly, but
can itself be solved approximately by any of t he simulation-based met hods
to be discussed subsequently in this section. In this context, aggregation
is used as a feature extraction mapping that maps each state i to the cor-
responding feature vector ii’(/) —- (u'i(/).((',„(/)).	This feature vector
becomes the input to some other approximating function (see Fig. 2.3.1).
Figure 2.3.1 View of a cost funcl ion approximation sebenx* that con.si.sts of a
feature extraction mapping followed by an approximator. The .scheme* conceptu-
ally separates into an aggregate, system and a cost approximator lor tkr aggregate
system.
We finally mention an extension of the aggregation approach whereby
we represent t he approximate cost. I'nncl.ioii (./(1./	, ./(n, i )) as Hr,
where each row of the n x in matrix П' is a probability distribution. Thus
III
— ...IL
iOS
Stochastic Shortest Path Problems ('haj). 2
where »'* (/) is the (/, A-)th entry of the matrix 1Г, and we have
II
У ii'K-(i') — 1,	4'k(i) > 0, i = 1, • • , и, f.- = 1,.. . , in.
K- - I
The t ransition probability mat rix of the aggregate stochastic shortest path
problem corresponding to //. is still I?;, = QPZ,IT, and we may use as pa-
rameter vector r t he optimal cost vector of this aggregate problem.
Approximation Based on Bellman’s Equation
Another possibility for approximation of the optimal cost by a. func-
tion ./(/,;), whine r is a vector of unknown parameters, is baser! on mini-
mizing the error in Bellman’s equation; for example by solving the problem
min V I.- min y'pG(u)(.v(/,u..7) + .7(j.r))| ,	(3.11)
its	.i
where .S' is a suitably chosen subset, of “representative" states. This min-
imization may be at tempted by using some type of gradient, or Clauss-
Newlon method.
Л gradient-like method that can be used to solve this problem is
obtained by making a correction to r that is proportional to the gradient
of t.h<' squared error term in Eq. (3.1 1). T his method is given by
r : = r- -,«(/./•) VD(i.r)
^r--, D(i, r) (	(T) V./0', r) - V.7(f, r)),	(3-12)
j
where V denotes the gradient with respect to r, D(i, r) is the error in
Bellman’s equation, given by
P(/,r) -= min V/Л,(")(.'/('"•./) I ^(./.r)) - •/(/','),
J
Ti is gi ven by
77 — arg min V/У7 (») (</(b </../)+ ./(./. r)),
nfcO(') i'
J
and 7 is a stepsize, which may change from one iteration to the next. The
method should perform many such iterations at each of the representative
states. Typically one should cycle through the set of representative, states
S in some order, which may change (perhaps randomly) from one cycle to
1 he next.
See. 2.3
Sinmhit hm-lhised A let hods
109
Note that in iteration (3.12) we approximate the gradient of the term
+ 'A./-'-))	(3.13)
lift'd) —
by
£Р,/и)^./о\г),
j
which can be shown to be correct only when t he above minimum is attained
at a unique 17 e C/(i) [otherwise the function (3.13) is nondilferentiable with
respect to r]. Thus the convergence of iteration (3.12) .should be analyzed
using the theory of nondifferentiable optimization. One possibility to avoid
this complication is to replace the noudilferentiabk’ term (3.13) by a smooth
approximation, which can be arbitrarily accurate (see [Ber82b], Ch. 3).
An interesting special case arises when we want to approximate the
cost function of a given policy // by a function	where r is a param-
eter vector. The iteration (3.12) then takes the form
/• : = r -| i,//.}/*?,{ Vd;,(/../.i') | i.//}	।
= r - yA, {</,,(/../,;) |	| /.//.} -	г)),
where
'M'-T'') = .7('-/'(')J) + ^(./•') - </(',')•
and Ej{- [ i./i} denotes expected value over j using the transition proba-
bilities pij (/'(/))- There is a simpler version of iteration (3.11) that does
not require averaging over the successor states j. In this version, the two
expected values in iteration (3.14) are replaced by two independent single
sample values, hi particular, r is updated by
,. ;= ;. _ 7,(3.15)
where j and j correspond to two independent, transitions starting from i.
It is necessary to use two independently generated states j and j in order
that the expected value (over j and j) of the product.
</„(/..;. r)(V./,,(}. r)-^./„(/.r)).
given i, is equal to the term
[ /, //} (TT; { V.7;,(j, r) I ,,//.} - ?./„(,»)
appearing in the right-hand side of ICq. (3.1 I).
There are also versions of the above iterations that, update Q-I’actor
approximations rather than cost approximations. In particular, let us in-
troduce an approximation Q(i,u,r) to the Q-factor Q(i,u), where r is an
j) + min Q(j,ii',r)]\ , (3.16)
n'er .)	/ I
110	Stocha.sh’c Shortt'st Path Problems Chap. 2
unknown parameter vector. Bellman’.s equation tor the	i* given
by [cf. Eq. (3.7)]
Q(i, и) = У\,(a)(</(/, u,j) I- min
'	ii'euW) '
J
so iii analogy with problem (3.11), we determine the parameter vector r by
solving t ho least, squares problem
min 22 |<?(мл И - У2Л;(М)(.(/(Ь "
(/>)eiz	J
where V is a suitably chosen subset of “representative” state-control pairs.
The analog of the gradient-like methods (3.12) and (3.14) is given by
- 7/i{</„(z,.;, r) | i, u}E{\7dn(i,j, r) | i,w}
= r - 'jE{du(i,j,r) | i, a}	'9 - VQ(i.u,r)),
j
where du(i,J,r) is given by
d„(i,j,r) = <i(i, u,j)+ min Q(j. u’, r) - Q(i, u, r),
u'e(r(i)
71 is obtained by
77 = arg min Q(j, «',?•),
and 7 is a stepsize parameter. In analogy with Eq. (3.15), the two-sainple
version of this iteration is given by
r := г - -yd,t(i,j, r)(yQ(j,7i., r) - VQ(i,u, r)),
where j and j are two states independently generated from i according to
the transition probabilit ies corresponding to a, and
77 = arg niiii Q(j.u',r).
u'eU(i)
Note that there is no two-sample version of iteration (3.12), which is
based on optimal cost approximation. This is the advantage of using Q-
factor approximations rather t han optimal cost approximations. The point
is that it is possible Io use single-sample or two-sample approximations in
gradient-like methods for terms of the form E{min[-]}, such as the one
appearing in Eq. (3.16), but not for terms of the form min[E{-}], such as
the one appearing in Eq. (3.11). The following example illustrates the use
of the two-sample approximation idea.
See. 2.3	юп-lMethods
I I 1
Example 3.4: (Tetris Continued)
Consider the game of tet.ris described in Example 3.3. and suppose that an
approximation of a given form	is desired, where the parameter vector
7’ is obtained by solving the problem
min Джн-уУо/) max	//,//). r)]|\
where S is a suitably chosen set of “representative” states. Because this
problem involves a term of the form F{max[-]}, a two-sainple gradient-like'
method is possible1. It has the form
' := r - ?Ф, Ц. /') (V./ (/(.г, y, /') ~ VJ(.r, r)) .
where tj and у are two falling blocks that arc randomly and independently
generated.
r) = max [<;(.;, ,y. a) + ./(/'(.r,?;. n). r)] -	/),
and
Ti = arg max [y(.r. y, u') + ,1 y. it’, /))].
Ii'
Similar to Example 3.3, consider a feature-based approximating func-
tion given by
m
-Hj’, r) =	/ч.шд.(.г),
A-—1
where r = (r i,... . /,„) is the parameter vector and
Wfc(r) _ J 1 if board position ,r maps into the Ath feature vector,
10 otherwise.
For this approximating function, the preceding two-sample gradient iteration
takes the relatively simple form
n := '7.. ~r)(u<b(f(jr. y. Й)) - wk(.r)). к = 1..........m.
Note that this iteration updates at most two parameters [the ones correspond-
ing to the feature vectors to which the board positions .r and [(.г.Т/.й) map.
assuming that these feature vectors are dill'erent]. To implement the method,
a set containing a large1 number of states ./• is selected and al. each ,r C .S',
two falling blocks у and у are independently generated. The controls и and й
that are optimal for (j:, y) and (,r,y), based on the current parameter vector
r, are calculated, and the parameters of the feature vectors associated with .1:
and f(.i:,y,u) are adjusted according to the preceding formula.
-1
I 12
.S’hie Shortesf I’.ith Problems Cli;i[>. 2
Approximate
Policy Evaluation
Policy
Improvement
Figure 2.3.2 Block tlinginiii of approximate policy iteration.
Approximate Policy Iteration Using Monte-Carlo Simulation
We now discuss an approximat)' form of the policy iteration method,
where we use approximations ./,(/.') to the cost .11, of stationary policies
//, and/or approximations <//,(/. u, r) to the corresponding Q-factors. The
theoretical basis of the method was discussed in Section 2.2.2 (cf. Prop.
2.1).
Similar to our earlier discussion on simulation, suppose that for a
fixed stationary policy //, we have a subset of ‘Teprescnta.tive" slates S
(perhaps chosen in the course of the simulation), and that for each i G S,
we have A/(i) samples of the cost, ./,,(/). The mth such sample is denoted
by <-(/,iu). Then, we can introduce approximate costs ./,(/,/), where r is
a parameter/weight vector obtained by solving the following least-squares
optimization problem
ЛП<)
Ifc.s 1
Once tin' optimal value of r has bom determined, we can approximate the
costs of the policy // by ./z,(i, /'). Then, we can evaluate approximate
(^-factors using the formula
(),,(!. a. r) = ^2/>,,(«)(.</(/, a,,/) +	,)).	(3.17)
Sec. 2.2
,S’i mu Lit i<	Met /ки/.s
I 13
Figure 2.3.3 Structure of approximate policy iteration algorithm.
and we can obtain an improved policy /7 using the formula
/7(7) = arg min (X,(i. и, r)
= arg min УМ ("№('•"•./) + Л'О- >')), for all /.	<'3'IS^
utL (/) —
We thus obtain an algorithm that alternates between approximate policy
evaluation steps and policy improvement steps, as illustrated in Fig. 2.3.2.
The algorithm requires a single approximation per policy it,«nation, namely
the approximation J^i.r) associated with the current, policy //.. The pa-
rameter vector r determines the Q-factors via E<p (3.17) and the next policy
Ji via. Eq. (3.18).
For another view of the approximate policy iteration algorithm, note
that it consists of four modules (see Fig. 2.3.3):
(a)	Tli«' simulator, which given a state-decision pair (/, u), generates the
next, state j according to t he correct transit ion probabilit ies.
(b)	The decision generator, which generates the decision /7(7) of the im-
proved policy at the current state i [cf. Eq. (3.18)] for use in the
simulator.
(c)	The eost-to-go approriniator, which is the function	that, is
consulted by the decision generator for approximate cost.-to-go values
to use in the minimization of Eq. (3.18).
(d)	The optimizer, which accepts as input, the sample t ra jectorics pro-
duced by the simulator and solves the problem
Л/(,)
in-iuzL ZZ 79 - r('-'")|2	(3.19)
_ ...1
114	,S7ochast/с Shor!cst Path Problems (Imp. 2
Io obtain the approximation	of the cost of Ji.
Note that in very large problems, the policy Ji cannot be evaluated
and stored in (explicit form, and thus the optimization in Eq. (3.18) must
In; evaluated “on tin' lly” during the simulation. When this is the case, the
parameter vector r associated with //. remains unchanged as we evaluate
the cost of the improved policy Ji by generating the simulation data and
by solving the least squares problem (3.19).
One way to solve this latter problem is to use gradient-like methods.
Given a sample state trajectory (?i, /2.  • , in, 0 generated using the policy
Ji, which is defined by Eq. (3.18), the parameter vector r associated with
Ji is updated by
<v	/	N	\
c T - 7^2 V./;,(iA . r) I .7^ (/* . /’) -	u(i,„,Ji(i„,). i,„ 1 1) I , (3.20)
/.• - 1	\	IU-k	/
where 7 is a stepsize. The summation in the right-hand side above is a
sample gradient corresponding to a I,erm in the least squares summation of
problem (3.19).
We finally mention two variants of tin' approximate policy iteration
algorithm, both of which require additional approximations per policy iter-
ation. In the first variant, instead of calculating the approximate Q-faetors
via Eq. (3.17), we form an approximation Q/, (/.//.,//), where the parameter
vector у is determined by solving the least, squares problem
min У2 ", .!/) “ Г)Г'	(3.21)
(i.u)G V'
where V' is a “representat ive" set of statc-cout rol pairs (i, 11), and (z, ti, r)
is evaluated using Eq. (3.17) and either exact calculation or simulation.
This variant is useful if it. speeds up the calculations of the policy improve-
ment step [cf. Eq. (3.18)].
In the second variant of the algorithm, we first perform the approxi-
mate policy evaluation step to obtain	Then we compute the im-
proved policy Ji(i.) by the. formula (3.18) only for states i in a “represen-
tative” subset S. We then obtain an “improved" policy fi(i, u), which is
defined over all state’s, by introducing a parameter vector i> and by solving
t he least, squares problem
(3-22)
n .s
Here’, we assume that, the controls are elements of some Euclidean space’ and
|| • || denotes the norm on that, space. This approach accelerates the’ policy
improvement step [cf. Eej. (3.18)] at the expense of solving an additional
least, squares problem pci’ policy iteration.
Sec. 2.3
Sin mini .ioii-Hnsed Mell mils
115
Approximate Policy Iteration Using TD(1)
.Just as there is a temporal differences implement al ion of Monte-Carlo
simulation, there is also a temporal differences implementation of the gra-
dient iteration (3.20). The temporal differences d/, are given by
4 = 'iC'fcJ'X'T), b-+i) + ./д(и-+1, r) - Jg(iA-,r), A: = 1,... ,7V, (3.23)
and tire iteration (3.20) can be alternatively written as follows [just add the.
equations below using the temporal difference expression (3.23) to obtain
the iteration (3.20)]:
Following the state transition (ii.b), set.
г + q-di V./p(b , /).
Following the st.ate transition (/2,7.3), set
r :=f +	+ Vj-irtii.r)).
(3.24)
(3.25)
Following the state transition (iy,A), set
r :=r + 7<v(V.^(/l,r) + V.^(/2.r)-l--------f	(3.20)
The vector r may be updated at each transition, although the gra-
dients VJy7(/A, r) are evaluated for the value of r that prevails at the time
«А is generated. Also, for convergence, the stepsize 7 should diminish over-
time. A popular choice is to use during the mth trajectory 7 = c/m, where
<: is a constant.
A variant of this method that has been proposed under the name
TD(A) uses a parameter A 6 [0,1] in tin1 formulas (3.23)-(3.2G). It has the
following form:
For A' = 1....N, following the state t ransition (//,-,/* r 1), set.
k
r v 4- Г A^
Ш = 1
While this method lias received wide attention, its validity has been
questioned. Examples have been constructed [Ber’Jbb] where the approxi-
mating function	obtained in the limit by T1)(A) is an increasingly
poor approximation to ./,,(/) a.s A decrease's towards 0, and the approxima-
tion obtained by TD(0) is very poor. It is possible, however, t.o use the
two-sample gradient iteration (3.15) Cor a. simulation-based, approximate
evaluation of the cost functions of various policies in an approximate policy
iteration scheme. This iteration resembles the TD(0) formula but aims at
minimizing the error in Bellman's equation.
116
Stochastic Shortest Path Problems Clhi/f. 2
- -I
Optimistic Policy Iteration
In the approximate policy iteration approach discussed so far, the
least squares problem I hat evaluates the cost of the improved policy /7
must be solved completely for the vector r. An alternative is to solve this
problem approximately and replace the policy // with the policy Ji. after a
single or a few simulation runs. An extreme possibility is to replace //, with
/7 at the end of each state transition, as in the next algorithm:
Following the state transition (A, A-t1), set
k
r :—r + "А У	r),
and generate the next transition (Aq i, A+2) by simulation using the control
Д(4ц) = arg min	+
sell(l)	'
J
The theoretical convergence properties of this method have not been
investigated so far, although its TD(A) version has been used with success
in solving some challenging problems [Tes!)2].
Variations Involving Multistage Lookahead
'Го reduce the effect of the approximation error
Л-(') “ J/M
between the t rue and approximate costs of a policy /1, one can consider
a lookahead of several stages in computing the improved policy /7. The
method adopted earlier lor generating the decisions /7(i) of the improved
policy.
//(/) = arg
min
ulU(i)
+	г)).
for all i.
corresponds to a single stage lookahead. At a given st ate i, it finds the opti-
mal decision for a one-stage problem with stage cost </(i, u.,j) and terminal
cost (after the first stage)
An ш-stage lookahead version finds the optimal policy for an ш-stage
problem, whereby we start, at the current state /. make the m subsequent
decisions with perfect state information, incur the corresponding costs of
the in stages, and pay a terminal cost.	where j is the state after
tn stages. This is a. finite horizon stochastic optimal control problem that
may be tractable, depending on the horizon in and the number of possible
Sec. 2.11
Siu ml; it iim-I3;i>e< I let I к ills
I 17
successor states from each state. If tit (i) is the first, decision of the ni-stage
lookahead opt imal policy start ing at stateI he improved policy is defined
by
/7(/) и i (/).
Note that if ./,,( ). r) is equal to the exact cost ./,,(/) for all states /, that
is, there is no approximation, the multistage version of policy iteration can
be shown to terminate with an optimal policy under the same conditions
as ordinary policy iteration (see Exercise 2.1G).
Multistage lookahead can also be used in the real-time calculation of
a suboptimal control policy, once an approximation ./(i.r) of the optimal
cost has been obtained by any one of the methods of this subsection. An
example is the computer chess programs discussed in Section 6.3 of Vol.
I. In that case, the approximation of the cost-to-go function (I he position
evaluator discussed in Section 6.3 of Vol. I) is relatively primitive. It is
derived from the features of the position (material balance, piece mobil-
ity, king safety, etc.), appropriately weighted with factors that are either
heuristically determined by trial and error, or (in the case of a, champion
program. IBM’s Deep Thought) by training on examples from grandmaster
play. It is well-known that the quality of play of computer chess programs
crucially depends on the size of the lookahead. This indicates that in
many types of problems, the multistage lookahead versions of the methods
of this subsection should be much more effective than their single stage
lookahead counterparts. This improvement, in performance must of course
be weighted against, the considerable increase in computation required to
optimally solve the associated multistage problems.
Approximation in Policy Space
We finally mention a conceptually different approximation possibility
that aims at direct optimization over policies of a given type. Here we hy-
pothesize a stationary policy of a certain struct ural form, say /7(z, r), where
r is a vector of unknown paranieters/weights that is subject to optimiza-
tion. We also assume that for a fixed r, t.hcr cost of starting at i and using
the stationary policy /7(-,r), call •/(','), can be evaluated by simulation.
We may then minimize over r
E{J(/.r)},	(3-27)
I
where the expectation is taken with respect to some probability distribution
over the set of initial states. This minimization will typically be carried out.
by some met hod that, does not require the use of gradients if the gradient,
of J(i, r) with respect to r cannot be easily calculated. If the simulation
can produce1 the value of the gradient V./(i. r) together willi ./(/. r). then a
gradient-based method can be used. Generally, the minimization of the cost
....»
118	Stochastic Shortest Path Problems Chap. 2
function (3.27) tends to be quite difficult if the dimension of the parameter
vector r is large (say over 10). As a result, the method is most likely
effective only when adequate optimal policy approximation is possible with
very few parameters.
2.3.4 Extension to Discounted Problems
We now discuss adaptations of the simulation-based methods for the
case of a discounted problem. Consider first the evaluation of policies
by simulation. One difficulty here is that trajectories do not terminate,
so we cannot obtain sample costs corresponding to different stat.es. One
way Io get around this dilliculty is to approximate a discounted cost, by
a finite horizon cost of sufficiently large horizon. Another possibility is to
convert the (i-discounled problem to an equivalent stochastic shortest path
problem by introducing an artificial termination state t and a transition
probability 1 — ci from each state i I. to the termination state t. The
remaining transition probabilities are scaled by multiplication with <i (see
Vol. 1, Section 7.3). Bellman’s equation for this stochastic shortest path
problem is identical with Bellman’s equation for the original n-discomitcd
problem, so the optimal cost functions and the optimal policies of the two
problems arc identical.
The preceding approaches may lead to long simulation runs, involving
many transitions. An alternative possibility that is useful in some cases is
based on identifying a. special state, called the reference stale, that is as-
sumed to be reachable from all other states under the given policy. Suppose
that such a state can be identified and for concreteness assume that it is
state J. Thus, we assume that the Markov chain corresponding to the given
policy has a single recurrent class and state 1 belongs to that class (see Ap-
pendix D of Vol. I). If there are multiple recurrent classes, the procedure
described in what follows can be modified so that there is a reference state
for each class.
To simplify notation, we do not show t.he dependence of various quan-
tities on the given policy. In particular, the transition probability from i to
j and the corresponding stage cost are denoted by ptJ and g(i,j), in place
of ]>,, (p(i)) and y(i,//(«),./), respectively. For each initial state i, let С(г)
denote the average discounted cost incurred up to reaching the reference
state 1. Let also m,t denote the first passage time, from state г to state 1,
that, is, the number of transitions required to reach state 1 starting from
state i. Note that tn, is a random variable'. We denote
D(i) -- A {<>'"<} .
By dividing the cost J/fi.') into the portion up to reaching state 1 and the
See. 2.3
Siuiulat ion-Based Met I a ids
1 19
remaining portion starting from slate 1. we have
= c(/) + oW-W).
Applying this equation for i = 1, we have ./,,(1) = C(l) + 79(1)7,,(1), so
that,
./„(1) = - 6 O, 4.	(3.2!))
1(1	l-P(l)	V
Combining Eqs. (3.28) and (3.29), we obtain
I	 I
•7"(z) = f(/) + T^n(i)’	, =
Therefore, to calculate the cost, vector .7,,. it is sufficient t.o calculate tin'
costs and in addition to calculate the expected discount terms U(i).
Both of these can be computed, similar to the stochastic shortest path
problem, by general ing many sample system trajectories, and averaging the
corresponding sample costs and discount t erms uj> t.o reaching t he reference
state 1.
Note here that because (7(1) and P(l) crucially affect the calculated
values it may be worth doing some extra simulations starting from
the reference slate 1 to ensure that C(l) and D(l) are accurately calculated.
Once a simulation method is available t.o evaluate (perhaps approxi-
mately) the cost of various policies, it. can be embedded within a (perhaps
approximate) policy iteration algorithm along the lines discussed for the
stochastic shortest path problem.
We note also that there is a straightforward extension of the Q-
learning algorithm to discounted problems. The optimal Q-factors are the
unique solution of the equation
II
Q(/.u) =	,(»)(.</(/."../) To./’(.)))
j = i
M	/
= У'л/О') (</('''./) +9 i»"1 QU-«')
This is again proved by introducing a system whose states are pairs (/. a),
so that the above system of equations becomes a. special case of Bellman's
120
SI < »< /ь'1-s/ i<  .S'll. II I Z'.st I'llth I ' IX il >/</1IS
( 7м/,. 2
equation. With similar observal ions, il. follows that, the vector of (/Tai lors
can lie obtained by the value iteration
II	/	«
I j) I " min <?(.>-"') I -
	\	"'^.n J
The C/lezirnitig method is an approximate version of this iteration, whereby
the expected value is replaced by a single sample, i.e.,
Q(i, u) := Q(i, u) + 7 ( </(/, и, j) + о min Q(j. u') ~ Q(l, u) ) .
Here j and </(?', u.j) are generated from the pair (z, u) by simulation, that,
is, according to the transition probabilities p,j(zz).
We finally note that approximation based on minimization of the
error in Bellman’s equation can also be used in the case of a discounted
cost. One simply needs to introduce the discount factor al. the appropriate
places in the various iterations given above. For example, the variant of
iteration (3.15) for evaluating the discounted cost of a. policy /z, is
г r - -)<//,(/. ), r)(nV.r) - V.7,,(z, /)),
where n is the discount, factor and
'M'OW) =	+<'•//,(./, r) - ./,,(/,/).
2.3.5 The Role of Parallel Computation
It is well-known that. Monte-Carlo simulation is very well-suited for
parallelization; one can simply carry out multiple simulation runs in par-
allel and occasionally merge the results. Also several DP-related methods
are well-suited for parallelization; for example, each value iteration can be
parallelized by ext,cut.ing the cost, updates of different states in different
parallel processors (see e.g., [ЛМТ93]). In fact, the parallel updates can be
asynchronous. By this we mean (hat dilferent. processors may execut,e cost
updates as fast as they can, without waiting to acquire the most recent
updates from other processors; these latter updates may be late in com-
ing because some of the other processors may be slow or because some of
the communication channels connecting the processors may be slow. Asyn-
chronous parallel value iteration can be shown Io have t In1 same convergence
properties as its synchronous counterpart, and is often substantially faster.
We refer to [BertS'da] and [BeT<S!)a] lor an extensive discussion.
There are similar parallelization possibilities in approximate DP. In-
deed, approximate policy iteration may be viewed as a combination of two
opera! ions:
St'c. 2.1	Sources. ;in<l i<,\ui rises	121
(a)	SimnJalion. which produces many pairs (/,c(/)) of states i and sample
costs c(/) associated with the improved policy p.
(b)	Tiainiiiy. which obtains the st at e-sample cost pairs produced by the
simulator and uses them in the least-squares optimization ol the pa-
rameter vector? of the approximate cost function ./,,(•• r).
The simulation operation can be parallelized in the usual way by
executing multiple independent simulations in multiple processors. The
training operation can also be parallelized to a great extent. 1'or example,
one may parallelize the gradient iteration
.'V
A-zz-l
that is used for training [cf. Eq. (3.20)]. There are two possibilities here:
(1)	To assign different components of T to different processors and to
execute the component updates in parallel.
(2)	To parallelize the computation of the sample gradient
;V
к - i
in the gradient iteration, by assigning different blocks of state-sample
cost pairs to different processors.
There are several straightforward versions of these parallelization methods,
and it is also valid to use asynchronous versions of them ([BcTSIJa], Ch. 7).
There is still another parallelizal ion approach for the training pro-
cess. It. is possible to divide the state space S into several subsets 5,„.
m = 1.......ЛЛ and to calculate a different approximation	lor
each subset. S,„. In other words, the parameter vector c,„ that is used
to calculate the approximate cost. ./g-(/.?,„) depends on (he subset. 5,„ Io
which state i belongs. The parameters r,„ can be obtained by a parallel
training process using the applicable simulation data, that is, the state-
sample cost pairs (/.<(/’)) wit h / C .S’„,. Note t hat, (.lie ext reme case where
each set Sm corresponds to a single state, corresponds to the case where
then' is no approximation.
2.4 NOTES, SOURCES, AND EXERCISES
The analysis of the stochastic shortest pat h problems of Sect ion 2.1 is
taken from [BeTSDa] and [BeTDlb], The latter reference proves the results
122
Stiiehiistic Shortest Path Problems
Chnp. 2
shown here under a. more genera,! compart ness assumption on U(i) and
continuity assumption on y(i, u) and Pijtji-)- Stochastic shortest path prob-
lems were first formulated a,nd analyzed in [EaZ(>2] under the assumpt ion
> 0 for all /. — 1,..., n and и & U(i). Finitely terminating value
iteration algorithms have been developed for several types of stochastic
shortest, pal,h problems (see [NgPSS], [I’oT92], .[PsT93], [Tsi93a]). The use
of a Dijkstra-Iike algorithm for continuous space shortest path problems
involving a, consistently improving policy was proposed in [Tsi93a] (see Ex-
ercise 2.10). A Dijkstra-like algorithm was also proposed for another class
of problems involving a consistently improving policy in [Ngl’8.8]. The
algorithm of Exercise 2.11 is new in the general form given here. The er-
ror bound on the performance of approximate policy iteration (Prop. 2.1),
which was developed in collaboration witli ,1. Tsitsiklis, is also new. Two-
player dynamic game versions of the stochastic shortest path problem have
been discussed in [PoA69] (see also the survey [Ra.F91]).
Several approximation methods that are not based on simulation were
given in [ScS85]. The interest in simulation-based methods is relatively re-
cent . In t he art ilicial intelligence community, t hese met hods are collect ivel v
referred Io as /< и ijo i< < i и c nl Icariiiiiy. hi t he engineering community, these
methods are also referred to as neuio-dipiumic proyranvminy. The method
of temporal differences was proposed in an influential paper by Sutton
[Sut,88]. Q-learning was proposed by Watkins [Wat,89]. A convergence
proof of Q-learning under fairly weak assumptions was given in [Tsi94]; see
also [J.JS93], which discusses the convergence, of TD(A). For a nice survey
of related methods, which also includes historical references, see [BBS93].
A variant of C^-learningc is the method of advantage updating developed
in [Bai93], [Bai94], [Bai95], and [IIBK94] (see Exercise 2.18). The ma-
terial on feature-based aggregation has been adapted from [TsV94], The
two-sample simulation-based gradient method for minimizing the error in
Bellman's equation was proposed in [Ber95b[; see also [11ВК94]. The opti-
mistic policy iteration method was used in an application to backgammon
described in [Tes92],
EXERCISES
2.1
Suppose (hat you want to travel from a start point S' to a, destination point
L) in niininnini average time. There are two options:
(I) Use a direct route that requires a. lime units.
See. 2.4 Not.es. Sources, and Exercises	123
(2) Take a potential .shortcut that requires b time units to go to an inter-
mediate point I. From I you can either go to the destination D in c
time units or return to the start (this will take an additional b time
units). You will find out the value of conce you reach the intermediate
point /. What you know a priori is that c has one of the tn values
Ci....,r-„, with corresponding probabilities p,.........p,„. Consider two
cases: (i) The value of c is constant over tiine, and (ii) The value ol c
changes each tiine you return to the start, independent,ly of the value at
the previous time periods.
(a)	Formulate I.he problem as a stochastic short.est, path problem. Write
Bellman's equation and characterize the optimal stat ionary policies as
best as you can in terms of the given problem data,. Solve I.he problem
for the case a = 2, b — 1, щ = 0, cz = 5, pi = 0.5, pz = 0.5.
(b)	Formulate as a stochastic shortest path problem the variation where
once you reach the intermediate point I, you can wait there. Each <1
time units the value of c changes to one of the values ci,. .., c„t with
probabilities pt,. .. , p,„. independently of its earlier values. Each time
the value of c changes, you have the option of waiting lor an extra <1.
units, ret inning In t he si art, or going Io the desl illation. Characterize
the optimal stationary policies as best as you can.
2.2
A gambler engages in a game of successive coin flipping over an infinite hori-
zon. lie wins one dollar each time heads comes up, and loses in > 0 dollars
each time two successive tails come up (so the sequence TTTT loses 3m dol-
lars). The gambler at each time period either flips a fair coin or else cheats
by flipping a two-headed coin. In the latter case, however, he gets caught
with probability p > 0 before he Hips the coin, the game terminates, and the
gambler keeps his earnings thus far. The gambler wishes to maximize his
expected earnings.
(a)	View this as a stochastic shortest path problem and identify all proper
and all improper policies.
(b)	Identify a critical value Tn such that if m > Til, then all improper policies
give an infinite cost, for some initial state.
(<:) Assume that in > m, and show that it is then optimal to try to cheat
if the last Hip was tails and to play fair otherwise.
(d)	Show that if in < 771 it is optimal to always play fair.
2.3
Consider a stochastic shortest path problem where all stationary policies are
proper. Show that for every policy rr there exists an in > I) such that
I
124	Stochastic Shortest Path Problems (’hup. 2
lor all i =	Abbreviated Proof: Assume the contrary; that is, there
exists a nonstatiouary я = {/'<<,//1 ,-••} and an initial state t such that
P(.r.,u = I | .Го = Z, 7г) — 0 for all m. For each state J, let m(J) be the
minimum integer iti such that state j is reachable from i with positive prob-
ability under policy 7г; that is,
lll(j') = lllill{m. I	= ./ I ./(> = г, 7Г) > ()},
where we adopt the convention that in(j) = ex, if j is not reachable from
i tinder /г. i.e., 1’0,,, = j | .t’o = < , тг) = 0 for all in. In part icular, we
have m(/) = 0 and iu.(t) = oo. Consider any stationary policy //. such that
I'-G) = /0,.(o(j) for all J with in(j) < oo. Argue that for any two states
j and j' with ni( j) < oo and m(j') = oo, we have p,/(/<(/)) = 0. Thus,
states j1 with in(j') = oo (including t) are not reachable under the stationary
policy p from states j with ni(j) < oo (including i), thereby contradicting
the hypothesis.
2.4
Consider the stochastic shortest path problem, and assume that </(i, u) < 0
for all i and a € C(i). Show that either the optimal cost is —oo for some
initial state, or else, under every policy, tin* system eventually enters with
probability one a set. of cost-free states and never leaves that set thereafter.
2.5
Consider the stochastic shortest, path problem, and assume that there exists
at least, one proper policy. Proposition 1.2 implies that, if, for each improper
policy //, we have = oo for at least one state «, then there is no improper
policy /i such that ./(I/ (y) = —oo for at least one state j. Give an alternative
proof of this fact, that does not use Prop. 1.2. Hint: Suppose that there
exists an improper policy ji' such that. •/,,/(.;) = — oo for at least one state j.
Combine this policy with a proper policy to produce another improper policy
Ii" for which J ,,(i) < 00 for all i.
2.6	(Gauss-Seidel Method lor Stochastic Shortest Paths)
Show t hat t he Gauss-Seidel version of the value Herat ion method for stochas-
tic shortest paths converges under the same assumptions as the ordinary
method (Assumptions 1.1 and 1.2). Hint: Consider t.wo functions J and ./
that, differ by a constant from ./* at all states except the destination, and are
such that ./ < 7'./ and TJ < J.
Sec. 2. I Noles. Sources, ;tu<l Hxercisex	125
2.7	(Sequential Space Decomposition)
Consider the stochastic shortest path problem, and suppose that there is a
finite sequence of subsets of stales Si, S2,. • •, S/ii such that each of tin? states
i —	belongs t.o one and only one of these' subsets, and the following
property holds:
For all iu = 1.....jW and states i € Su,,, t.he successor state j is
either the termination state t or else belongs to one of the subsets
5,,,, Sltl 1.. .., .S’i for all choices of the control ?/ C I ’(/).
(a)	Show that the solution of this problem decomposes into the solution
of iM stochastic shortest path problems, each involving the states in a
subset Sm plus a termination state.
(b)	Show also that a Unite horizon problem with N stages can be viewed
as a stochast ic shortest path problem with the property given above.
2.8
Consider a stochastic shortest path problem under Assumptions 1.1 and 1.2.
Assuming < 1 for all /	/ and и C ('(/)„ consider another stochastic
shortest path problem that has transition probabilities р,Д»)/(1	for
all i 7^ / and j 7^ i, and costs
= + ________
(a)	Show that the two problems are equivalent in that they have the same
optimal costs and policies. How would you deal with the case where
pM(u) — 1 for some / t and 11. C U(i\!
(b)	Interpret y(f,u) <us an average cost, incurred between arrival to st,ate i
and transition to a state j 7^ /.
2.9	(Simplifications for Uncontrollable State Components)
Consider a stochastic short (’.st path problem under Assumptions 1.1 and 1.2.
where t he state is a composite? (z, y) of two components i and y, and the (‘vo-
lution of I ho main component, i can be direct ly alFccted by the cont rol u, but,
the evolution of the other component у cannot (cf. Section 1.4 and Exercise
1.22 of Vol. I). In particular, we assume that, given the state (/,]/) and the
control u, the next state (j, z) is determined a.s follows: first j is generated ac-
cording to transition probabilities y), and t hen z is generated according
to conditional probabilities p(z | j) that depend on the main component j of
the new state. We also assume that the cost, per stage* is y(i,y, u,j) and does
not depend on the second component z of the next state' (./, z). For functions
J(i), z — 1. • •., m consider the mapping
ti
min /J/'oO'-'/K'/b’-//- “'j} +./(./))
(f./)(/) =	/>(.</1 n
_ д
126
St <	S/iorte.st l\ilh I */< >1 >len is	(.'hup. 2
aiul (he, corresponding mapping of a stationary policy p,,
(7;(J)(0 = 52p(.V I	+ V(j)) 
у	j = i
(a)	Show that J -- 7’7 is a form of Bellman’s equation and can be used to
characterize tin* optimal stationary policies, Hint. Given	define
Л') = E,,/'(//10-Л', .'/)•
(b)	Show the validity of a modified value iteration algorithm that starts
with an arbitrary function ./ and sequentially produces T.1, T2J, ...
(c)	Show the validity of a modified policy iteration algorithm whose typical
iteration, given the current policy //(«, y), consists of two steps: (1) The
policy evaluation step, which computes the unique function .7 у that
solves the linear system of equations .7= 7’р:.7(р-. (2) The policy
improvement step, which compute's the improved policy /E 1 (/, y) from
t he equation 7^t , i / * -	-
(d)	Suppose that у is the only source of randomness in the problem; that
is, for each (t,y, a), there is a state j such that p,j(u,y) = 1. Justify
the use of the following single sample version of value iteration (cf. the
Q-lcarmng algorithm of Section 7.G.2)
.7(i) :=./(/)+7 ( min [y(i, y, u,./) + ,7(y)l - J(y)
Here, given /, we generate у according to the probability distribution
p(y | /), and j is the unique state corresponding to (i,y. a).
2.10	(Discretized Shortest Path Problems [Tsi93a])
Suppose that the states are the grid points of a grid on the plane. The set of
neighbors of each grid point ,r is denoted f/(.r) and includes between two and
lour grid points. At. each grid point ,r, we have two options:
(1)	Choose two neighbors j:4 , .r £ U(./) and a probability p 6 [0, I], pay a
cost. y(.r) у/p~ + (I — p)2, and move to .r+ or to .r with probability p
or 1 — p, respectively. Here у is a function such that g(x) > 0 for all x.
(2)	Slop and pay a cost /(.r).
Show that there exists a consistently improving optimal policy for this prob-
han. Null'-. This problem can be used to model discretized versions of deter-
ministic continuous space 2-dimensional shortest path problems. (Compare
also with Exercise (i.l I in Chapter (i of Vol. 1.)
Sec. 2.1
Sources. b'xercise.'-
127
2.11	(Dijkstra’s Algorithm and Consistently Improving Policies)
Consider the stochastic shortest path problem under Assumptions LI and
1.2, and assume that there exists a consistently improving optimal stationary
policy.
(a)	Show that the transition probability graph of this policy is acyclic.
(b)	Consider the following algorithm, which maint ains t wo subsets of st at os
P and L. and a function J defined on the* slate space. (’Го relate the
algorithm with Dijkstra’s method of Section 2.3.1 of Vol. 1. associate1
J with the node labels, Л with the Ol’lCN list, and /* with the subset
of nodes that have already exited the OPEN list.) Initially, P — 0,
L — {!}, and
./(/)	d '= t,
 ’	to if i = t.
Al the typical iteration, select a state j* from L such that
j* = arg min ./( /).
(II L is empty tlie algorithm terminates.) Remove j' from L ami place
it in P. In addition, for all i P such that there exists a и G I(i) wit h
(</) > 0. and
/>,,(//) = 0 for all j <]_ P,
define
(/(/) = {“ G t'4') I /',)•(«) > II and Pn(“) — I' la- all j P],
set
./(/) min
.7(7), min
uCO’y)
!/(', ") + W,(''VU)
and place i in L if it is not. already there. Show that the algorithm
is well defined in the sense that P(i) is nonempty and the set L does
not become empty until all states are in P. Eiuthermore, each state
/ is removed from Л once, and at. the tiine it is removed, we have
'/(./) = ./*(.7)-
2.12	(Alternative Assumptions for Prop. 1.2)
Consider a variation of Assumption 1.2, whereby we assume that </(;, u) s 0
for all i and u 6 (7(7). and that there exists an optimal proper policy. Prove
the assertions of Prop. 1.2. except that, in part, (a), uniqueness ot the solution
of Bellman’s equation should be shown within the set !)C+ = {,/ | J > 0}
(rather than within Sr"), and the vector J in part (l>) must belong to Di1 .
> _1
128	St»к h;ist ic ,S7i< >) I < -st I'nlli /‘rob/r/ns f'/ja/). 2
Hint: Proposition 1.1 is not valid» so a somewhat different proof is needed.
(’oinplet e I lie det nils of tin* lol lowing argument. The assn nipt ions guarantee
that .7* is linite and .7* € 7C. [We have .7* > 0 because </(/, u) > 0, and
./*(/) < oo because a proper policy exists.] The idea now is to show that
.7* > 7\7*. and then to choose // such that TflJ* — Т.Г and show that //. is
optimal and proper. Let 7г = {//о,//],...} be; a policy. We have for all 7,
where 7Ti is I he policy {//1, //2.• • •}• Since .7^ > J*, we obtain
л(/) ></(/, /m>w) + ^>,.,(/М')Ис) =	> ('/'./ко.
'Taking the inlimum over 7r in the preceding equation, we obtain
(4.1)
Let // be such that. TltJ* — J'J*. Его in Tap (1.1), we have .7* > 7',,./*,
and using tin' monotonicity ol 7),, we obtain
(v l	Д- |
./• > ?•"./’ = /*;?'./* + ^2	> 52 N - L
A --<)
By taking limit superior as ,V > .x-. we obtain .Г -7;/. Therefore, /i is an
optimal proper policy, and .7* = Jfl. Since // was selected so that TtlJ* ~ T.J*,
we obtain, using .7* = .7/y and Jfl — that .7* = 7’.7*. Lor the rest of the
proof, use the vector Ar similar to I he proof of Prop. 1.2.
2.13	(A Contraction Counterexample)
Consider a stochastic shortest path problem with a single state 1, in addition
to the termination state /. At state 1 there are two controls и and it . Under
и (he cost is 1 and the1 system remains in state 1 for one more stage; under и
the cost is 2 and the system moves to I. Show that Assumptions 1.1 and 1.2
are satisfied, but. 7' is not. a contraction mapping with respect to any norm.
2.14	(Contraction Property - All Stationary Policies are Proper)
Assume that all stationary policies are proper. Show that, tin* mappings T
and 7), are contraction mappings with respect to some weighted sup norm
IMI.. = iiuix --1Л01.
/ -= I.4 Г,
See. 2.1 /Votes. Sourees, ;iii<l l‘,\'<Tcis<‘,s
I 29
where r is a vector whose components ri,...,r„ art' positive.
Abbra 14 a I ctl proof (Jioni [lleTSlhif p. 225: scr also iTseDO])-. Partition the
state space as follows. Let 5| = {1 } and for к -- 2.3 deline sequent tally
5k — \ i i / 5i U • • • U 5/,-. । and niin max	> 0
I 1	u C ( ' (/1 / ё S', L । -.  U >’д. j
Let 5„, be t he last of these sets that is nonempty. We claim t hat I lie sets 5\
cover the ent ire state space, that is, Uj," 15\- — 5. To see this, suppose that
the set 5V =	| i (д U^L|5a- } is nonempty. Then for each t C 5X, there
exists some u, € L/(i) such that р,7(п,) - 0 for ah j S\. Take any p such
that //(/) — u, for all / € 5>_. The stationary policy p satisfies [/’; ],7 = () for
all i У .S\. j 5\. and A', and t hrrefore cannot be proper. 'This cont radict s
the hypothesis.
We will choose a vector r > 0 so that T is a contraction mapping with
respect to || • ||P. We will lake the /th component r, to be the sauir for states
i in the sana' set 5д. In particular, we will choose the components r, of t he
vector r by
r,	иr	if i G 5 a ,
where ip.....//„, are appropriately chosen scalars satisfying
1 = J/i < !)-2 << y,„.	(1-3)
Let
and notc t hat 0 < < < 1. W’e will show t hat it is snlliciciit to choose .......//,„
so that for some у < 1, we have
- r) + ——< < 7 < 1,	-......Ill,	(1.5)
!/k	!lk
and then show that such a choice of .............ytll exists.
Indeed, for vectors ./ and J' in 3?", let // be* such that T^J = T.l. Then
we have* for all /.
(7V')(,) - (Т./Ш) -= (Г.7')(,) - (7’„./)(0
< (7’„./')(,)- (7),./)(,)
(1Я
Let A'(j) be such that / belongs to t he s<‘l 5д.(;. Then we have lor any constant

130
Stochastic Shortest Pnth Problems
Ch;tp. 2
and Eq. (d.(>) iniplie.s that for all
/./ )(/. -- //)(/) I	x
-------------------------- < -------- ) Pu !/kt, i )



Рч(р(1))
.'Л-(0 I
Pi;i (/'•(')) + ----
IMO
№0) I
,'ЛО)
?A-(.) /	.'/Ao)
where I he second inequality follows from Eq. (4.3), the third inequality uses
Eq. (1.1) and the fact /;*о) -1 ~ Vm < 0, and the last inequality follows from
(TJ'W)- (T.IW)
, II.
and we obtain

lor all .J,.!' £ 'R" with ||./ - ./'||
We now show how to choose' the scalars ,yi, i/-j, . . . , i/,„ so that Eqs. (4.3)
and (1.5) hold. Let y/i> = (I, i/i — I, and suppose that y,, i/>,..., /д have been
chosen. If ( — 1, we choose /д । । = ///, + 1. If c < 1, we choose ijk-t-1 fo be
.'A I i = ^(.'A + Л/А.),
Л/а- = min
Using (.he fact
Mi.- । i = min
!P-
ЛЛ.Н-
A/,„ = min
which implies Eq. (1.5).
See. 2.1
Notes. Sources, and hs, rciscs
131
2.15 (Multiple State Visits in Monte Carlo Simulation)
Argue that the Monte-Carlo simulation formula
M') =
M
lim —г V c(i, in)
m -k M
[cf. Eq. (3.1)] is valid even if a state may be revisited within the same sample
trajectory. Hint.: Suppose the Л/ cost samples are generated from A’ trajec-
tories. and that the A:th trajectory involves n*. visits to state i and generates
7Ц. corresponding cost samples. Denote nik = ii\ + •   + Hi.-. Write
л;
lim
1
37
}
and argue that
(or see [Ros83b], Cor. 7.2.3 for a closely related result).
2.16 (Multistage Lookahead Policy Iteration)
(a) Consider the stochastic shortest path problem under Assumptions 1.1
and 1.2. Let /i be a stationary policy, let J be a function such that
TJ < .1 < (J = Jh is one possibility), and let {/7(l, рч,. .., /7Л_ ।} be
an optimal policy for the rV-stage problem with terminal cost function
i.e.
TT,kT,'4^k' 'J = TK~I'.J, k -- 0,1,.. . ,N - 1.
(a)	Show that,
.7^ < Л- for all /; -0,1.........N - 1.
Hint: hirst show that 7,A 1 1./ < 7,A./ < J lor all /<:, and then show that,
the hypothesis T~kTN~k-1 J = T'V”A J implies that .1^. < TN~k~'.I.
(b)	Use part (a) to show the validity of the multistage policy iteration
algorithm discussed in Section 2.3.3.
132
Stocliast.ic Shortest Path Problems
Gimp, 2
2.17	(Viewing Q-Factors as Optimal Costs)
Consider the stochastic shortest path problem under Assumptions 1.1 and
1.2. Show that the Q-bictois Q(i,n) can be viewed as state costs associated
with a modified stochastic shortest path problem. Use this fact to show that
the Q-factors Q(z, u) arc the unique solution of the system of (‘([nations
Q(i, 0 = X (</(/, u,j) + '"in Q(j,ii') 
J - I
Hi,id: Introduce a new state for each pair with transition probabilities
Pu(u) to the states j — 1,..., a, t.
2.18	(Advantage Updating)
Consider the optimal Q-factors Q*(i, u.) of the stochastic shortest pat.li prob-
lem under Assumptions 1.1 and 1.2. Define the «(/ra.n/.m/c Junction by
A’(i, u) = min Q*(', a') ~	")
u'c(/(,)
(a)	Show that. /!*(/, n) together with tlie optimal costs ./*(/) solve uniquely
the system of equations
•/’(') = /Г(/. и) +	i>,,(iinJ) + ./*(.))],
max /1 * (z, //) — 0, i. — 1,..., n.
(b)	Introduce approximating functions A(/, u,r) and J(t,r), and derive a
gradient method aimed at minimizing the sum of the squared errors of
(he Bellman-dike (’([nations of part (a) (cf. Section 2.3.3).
Undiscounted Problems
Contents
3.1.	Unbounded Costs per Stage............................p.	131
3.2.	Linear Systems and Quadratic	Cost....................p.	150
3.3.	Inventory Control....................................p.	153
3.4.	Optimal Stopping.....................................p.	155
3.5.	Optimal Gambling Strategies..........................p.	160
3.6.	Nonstationary and Periodic Problems..................p.	167
3.7.	Notes. Sources, and Exercises........................p.	172
134
Undiscounted Problems
Chap. 3
In this chapter we consider total cost infinite horizon problems where
we allow costs per stage that are unbounded above or below. Also, the
discount factor a does not have to be less than one. The complications
resulting are substantial, and the analysis required is considerably more
sophisticated than the one given thus far. We also consider applications of
the theory to important classes of problems. The problem section touches
on several related topics.
3.1 UNBOUNDED COSTS PER STATE
In this section we consider the total cost infinite horizon problem of
Section l.l under one of the following two assumptions.
Assumption P: (Positivity) The cost per stage <j satisfies
0 < .9(u u, for all (x, u, w) G S x C x D. (1.1)
Assumption N: (Negativity) The cost per stage g satisfies
g(x. u, w) < 0, for all (x, u, w) G S x С x D. (1.2)
Problems corresponding to Assumption P are sometimes referred to
in the research literature as negative. DI’ problems. This name was used in
the original reference [StrGG], where the probkun of maximizing the infinite
sum of negative rewards per stage was considered. Similarly, problems
corresponding to Assumption N are sometimes referred to as positive DP
problem,s [Bla(>5], [StrGG]. Assumption N arises in problems where there
is a nonnegative reward per stage and the total expected reward is to be
n ui.r.i) n.ize.d..
Note that when о < 1 and g is either bounded above or below, we
may add a. suitable scalar to g in order to satisfy Eq. (1.1) or Eq. (1.2),
respectively. An optimal policy will not be affected by this change since,
in view of the presence of the discount factor, the addition of a constant r
to g merely adds (1 — o)-1?' to the cost associated with every policy.
One complication arising from unbounded costs per stage is that, for
some initial states .r() and some genuinely interesting admissible policies
See. 3.1
Unbounded Costs per Stale
135
7Г = {/zm/ti,..the cost ./„(.z.’o) may be oo (in the case of Assumption P)
or — oo (in the case of Assumption N). Here is an example:
Example 1.1
Consider the scalar system
Гач i — P-i'A- + iza, A: — 0, 1,...,
where .га € 'R and щ. £ JR, for all A:, and /3 is a positive scalar. The. control
constraint is |izc| < 1, and the cost, is
iv-1
A- = 0
Consider the policy zr = {p,p,,.. .}, where /i(.r) = 0 for all x £ JR. Then
Л-1
•/fr(j'o) = lim V ok/ik|.co|,
/V—• ou	'
к- 0
and lienee
, / X j 0 if .Го = 0 -r	.
A(x*o) = < .f , n if <yp £ 1,
(oo 11 Го / o
while
Л(ло) = ]-~p	a/3 < 1.
Note a peculiarity here: if /3 > 1 the state ,i.'a diverges to oo or to —oo, but if
the discount factor is sufficiently small (о < 1//3), the cost. .7ч(.го) is Unite.
It is also possible to verify that when /3 > 1 and a(3 > t the optimal
cost ,7*(xo) is equal to oo for |zo| > l/(/3 — 1) and is finite for |x0| < l/(/3— 1).
The problem here is that when /3 > 1 the system is unstable, and in view of
the restriction |щ.| < 1 on the control, it may not be possible to force the
state near zero once it has reached sufficiently large magnitude.
The preceding example shows that, there is not, much that can be done
about the possibility of the cost function being inlinite for some policies.
To cope with this situation, we conduct our analysis with the notational
understanding that the costs .Лг(.го) and ./*(.-r(l) may be oo (or -oo) under
Assumption P (or N, respectively) for some initial states .z’o ami policies
7Г. Ill other words, we consider Jn(') and ,/*(•) to be extended real-valued
functions. In fact,, the entire subsequent, analysis is valid even if the cost
g(z, u, w) is oo or —oo for some (,z:, iz, m), as long as Assumption P or
Assumption N holds.
The line of analysis of this section is fundamentally different from
the one of the discounted problem of Section 1.2. For t.lie lat ter problem,
the analysis was based on ignoring the “tails’’ of the cost sequences. In
136
I ’iidisctnint<<I Pioblfins
Ch:ip. 3
this section, the tails of the cost, .sequences may not be small, and for
this reason, the control is much more focused on affecting the long-term
behavior of the slate. For example, let. n = I, and assume that, the stage
lost at all states is nonzero except, for a cost-free and absorbing termination
state. Then, a primal у task ol coni rol under Assam pl ion I ’ (or Assumption
N) is roughly to bring the state of the system to the termination stale or
to a region when' the cost, per stage is nearly -zero as quickly as possible
(as lal.e as possible, respectively). Note the dilference in control objective
between Assumptions 1’ and N. It accounts for sonic strikingly different
results under the two assumptions.
Main Results -- Bellman’s Equation
We now present, results that, characterize the optimal cost [unction
./*, as well as optimal stationary policies. We also give conditions under
which value iteration converges to the optimal cost function J*. In the
proofs we will often need to interchange expectation and limit in various
relations. This interchange is valid under t he assumptions of the following
theorem.
Monotone Convergence Theorem: Let P = (pi,p?,...) be a prob-
ability distribution over S = {1,2,...}. Let {Iin} be a sequence
of extended real-valued functions on S such that for all i 6 S and
N = 1,2,...,
0 < /iyv(z) < /hv+i(i).
Let. h : S e-> [0, oo] be. the limit function
h(i) = Jim
Then
TO	ОС	oo
Jim '^PiliN(i') = 57 lim /tw(f) = y^pjh(i).
4—1	4=1	7 = 1
Proof: We have
57< 57
1^1	7 = 1
By taking the limit, we obtain
lim Y' p,hN(i) < Y^pih(i),
7=1	7=1
Sec. 3.1
Unbounded Costs per ,Sf;lto
137
so there remains to prove the reverse inequality, For every integer M I,
we have
,w	,w
Jh" 12 p,h\(i) > lim _ />,/>Л-(/)
,i	‘ i i	<i
and by taking the limit as M oc the reverse inequality follows. Q.E.D.
Similar to all the infinite horizon problems considered so far. the
optimal cost, function satisfies Bellman's equation.
Proposition 1.1: (Bellman’s Equation) Under either Assumpt ion
P or N the optimal cost function ,J* satisfies
J*(.r) = min е{д(.т, iqtc) + aJ*(/(;r, (1, w))}, :r 6 S
or, equivalently,
J*=TJ*.
Proof: For any admissible policy rr =	consider the cost ./T(.r)
corresponding to тг when the initial stale is .r. We have
•U(.r) = E{.<7(.r,/U)(.r),te) + l4(/(.r./m(-r).tr))},	(1.3)
U’
where, for all .ri G 5,
pv-i
К(-г-1)= lim E < У'<'1А:.</(-''а-,/'а (-га ). "’a )
N— %	"!, L------<
/.-12.. I A —I
Thus, Ur (.ci) is the cost from stage 1 to infinity using тг when the initial
state is .rj. We have clearly
Р?г(.'Г1) > a J* (.rj),	for all .iq G S.
Hence, from Eq. (1.3),
•M-r) > 7i{.A/(.r,/A(l(.r), tn) + nJ* (/(.r, //o(:r), «;))}
ll’
> min /,’{.</(.r. u. it’) + aJ* (/,(/,’))}.
Taking the minimum over all admissible policies, we have
min ./л-(.г) = ./*(')
7Г
> mill E{f/(.r,'iiji’) + nJ*(f{:r.,ii,n’))}	(l.d)
= (TJ*Kr).
.1
138	lhidiseoimt,od Problem» C7ia.p. 3
Thus there remains to prove that the reverse inequality also holds. We
prove this separately for Assumption N and for Assumption P.
Assume P. The following proof of J* < TJ* under this assumption
would be considerably simplilied if we knew that there exists a /i such
that T,,J* = TJ*. Since in general such a /i need not exist, we introduce a
positive sequence {q. }, and we choose an admissible policy rr = {/io, //1,..
such that
Uk./*)(.'•)	+ Q., .res', /. = (),i,...
Such a choice is possible because we know that, under P, we have -oo <
J*(T) for all .r. By using the inequality TJ* < J* shown earlier, we obtain
(TllkJ*)(.r.) <•/*(.'.) +re.S, /. = 0,1,...
Applying 7},a 1 to both sides of this relation, we have
+ f'<T
< (7’7*)(.i:) + a,-! +
< J*(:r) +	+ orfc.
Continuing this process, we obtain
(7;.(,7;,| ••-v;,j../*)(./) < (tj*)(.i;) + уЛю,.
/=()
By taking the limit as к —> and noting that
J’(z) < J^.r) = Jhn^(7';,(|7’/,| ••• T„A..7(l)(.r) < lyT^, Tl4, ./*)(.,;),
where ./(l is t.he zero function, it follows that
./*(') < AW < (TJ*)(.r) + y\pr,, ,e G s.
i=0
Since the sequence {q. } is arbitrary, we can take o ‘c, as small as
desired, and we obtain ./*(./;) < (TJ*)(.r) for all ,r G S. Combining this
with the inequality J*(.r) > (7’J*)(.r) shown earlier, the result follows
(under Assumption P).
Assume N and let. Jbe I he opt imal cost, fund ion for the correspond-
ing N-stage problem
(N-l	'|
./л (./у) = niinE	CW. ч'к) г •	(C5)
I A -()	J
Sec. i
lJnboinnh'tl Costs per Stsit.o
139
We first show that
./*(.?;) = ^lim	,r e 5.	(1.6)
Indeed, in view of Assumption N, we have J* < for all N, so
.l*(.r) < lim ./л-(.г), .r £ S.	(1.7)
N
Also, for all 7Г = {/io,/ii,...}, we have
pV-1	'l
E < akg[.vk^iJk{xk), wk) > > J/v(.ro),
IA-0	J
and by taking the limit, as N oc,
./n-(.r) > lim ./w(x), .r 6 S.
,V .-x.
Taking the minimum over тг, we obtain J*(j ) > lmi,v_>o<J J;v(.r), and com-
bining this relation with Eq. (1.7), we obtain Eq. (1.6).
For every admissible p., we have
T^.In > -Лу-н,
and by taking the limit, as N —► oo, and using the monotone convergence
theorem and Eq. (1.6), we obtain
W > J*.
Taking the minimum over //, we obtain TJ* > J*, which combined with
the inequality J* > TJ* shown earlier, proves the result under Assumption
N. Q.E.D.
Similar to Cor. 2.2.1 in Section 1.2, we have:
Corollary 1.1.1: Let /i be a stationary policy. Then under Assump-
tion P or N, we have
J,i(x) = Е{ц(х,ц{х), w) + a.7M(/(.z,//.(.r), w))}, x e S
w
or, equivalently,
J>, = T,^.	(1.8)
(	Contrary to discounted problems with bounded cost per stage, the
i optimal cost function .7* under Assumption P or N need not be the unique
I solution of Bellman’s equation. Consider the following example.
i
I:
-I
140	Ihuliscoiinled I’rulilem.s ('li;i/>. 3
Example 1.2
Let, .S' = [0,oo) (or S = (—oo.t)]) and
</(x, II, ll’) 0,	f(.r. II. II’) -- - .
Then for ('very fl, the function ./ given by
./(.r) = /f.r,	.< G .S'.
is a solution of Bell man's equation, so 7' has an inlinite number of lixed points.
Note-, however, that there is a unique lixed point within the class of bounded
functions, the zero function ./(,(./) = 0, which is the optimal cost function for
this problem. More generally, it can be shown by using the following Prop. 1.2
that if a < 1 and there exists a bounded function that is a fixed point of T,
thi'ii that function must, be equal to the optimal cost, funct ion .Г (sett Exercise
3.5). When о = I, Bellman’s equation may have an infinity of solutions even
within the class of bounded functions. This is because if о = 1 and ./(•) is
any solution, then for any scalar ./() + r is also a solution.
The optimal cost function ./*, however, lias the property that it is
the smallest (under Assumption I’) or largest (under Assumption N) lixed
point of T in the sense described in the following proposition.
Proposition 1.2:
(a)	Under Assumption P, if J : S e-> (—oo,oc] satisfies J > TJ and
either J is bounded below and a < 1, or ,J > 0, then J > J*.
(b)	Under Assumption N, if J : S's-» [~oo, oo) satisfies .7 < TJ and
either J is bounded above and o <" 1, or J < 0, then J < J*.
Proof: (a) Under Assumption P, let r be a. scalar such that J(.r) + г > 0
for all .v G S and if о > 1 let r = 0. For any sequence {qJ with ry > 0, let
7Г = {/л,, /71, . . .} be an admissible policy such that, for every x G S and k,
/'/. С'')- »’) + <>/(/(•''./'A'(-''). »’)) } <	+ <T-	(1-9)
IP
Such a policy exists since (7’./)(.r) > — oo for all x G We have for any
initial state •i\) c s,
pv-l	'I
J + (.ro) = min liin /;<	"’к) ?
IA--()	J
f	л 1	1
<	niiiilinniif E <	+ r) + У2	wk) >
‘ I	A-—()	J
r	jV--J	a
<	lini	inf	< <i'v (J(.r;v)	+ r)	+ У	(Jj)(xK., рл(.гд),	U'k) >	.
x	I	A=o	J
Sec. ,‘f. I
I fnl Нинк l<‘< I C'obts per Si ale
14 1
Using Eq. (1.9) and the assunipt ion J > T.J, we obtain
Combining these inequalities, we obtain
/	,v I
.Л(.П>) < ./(.Гц) + lim Ол'/-+ У (Ьц.
\	/.--a
Since the' sequence {q.} is arbitrary (except for q. > 0), we may select
{c*} so that liin.v-,^ fiA(A >s arbitrarily close to zero, and the result
follows.
(b) Under Assumption N, let г be a scalar such that ./(.r) + /<() lor all
x G S', and if <i > 1, let r = 0. We have for every initial stale .i t) G S’.
{.v. I
5	''A‘7/'А-(-'Ч)."'a )
Z—/	'
A--()
f	N -1
> minlimsupE < <vv(./(.r.v) + r) + 57 aia'a/(.';a ./'а (.га). h’i.-) >
Л’“"°с	।	A-—0	J
N- 1	V
o'V(.7(.r,v) + r) + 57 '1A'.'/(-|'a ./'а (-Га ). </>a) > .
A ll	J
(1.1(1)
where the last inequality follows from the fact that for any sequence {h.v(£)}
of functions of a parameter we have
> lim sup min E
N-.-b	*
min lim sup h ,v (£) > lim sup min/i.y (£).
s	N —'5C	N —»oo ч
.1
142
Undiscoimted Problems
This inequality follows by writing
An(O > min/ijv(£)
and by subsequently taking the limsup of both sides and the minimum
over £ of the left-hand side;.
Now we have, by using the assumption J < TJ,
r	N -I
niin E < <*N J(xn) +	akg(xk,p,k(xk), Wk)
I	k' = 0
oA' 1 min /<; {</(.r/v-i, u/v-i, W- i)
///V — 1	— 1 ) "’/V-l
+ O-j(/(x-/V-l, »N-1, W/V-l))}
N--2
A;=0
N-2
> niin/; < aN~l.j(x,N^i) + У2 nkfjbk-,/kk(xk),wk')
I	A=(>
Using this relation in Eq. (1.10), we obtain
J*(xo) > J(.tu) + lim aNr = J(.r0).
Q.E.D.
As before, we have the following corollary:
Corollary 1.2.1: Let ц be an admissible stationary policy.
(a)	Under Assumption P, if J : S >—> (—oo,oc] satisfies J > T^J and
either J is bounded below and a < 1, or J > 0, then J > J
(b)	Under Assumption N, if J : S [—oo,oc) satisfies J < THJ and
either J is bounded above and a < 1, or ,7 < 0, then J < JlL.
Conditions for Optimality of a Stationary Policy
Under Assumption P, we have the same optimality condition as for
discounted problems with bounded cost per stage.
Sec. 3.1
Unbounded Costs per Stille
113
Proposition 1.3: (Necessary and Sufficient Condition for Op-
timality under P) Let Assumption P hold. A stationary policy p
is optimal if and only if
TJ* = TflJ*-
Proof: If TJ* = TltJ*, Bellman’s equation (./’ — 7'./*) implies that .7* --
TpJ*. From Cor. 1.2.1(a) we then obtain .7* > ,7/t, showing that p is
optimal. Converse!}', if J* = Jt,, we have using Cor. 1.1.1, 7’./* = .7* =
= T„./Z, =7), .7*.' Q.E.D.
Unfortunately, the sufficiency part of the above proposition need not
be true under Assumption N; that, is, we may have TJ* = Тц.1* while p is
not optimal. This is illustrated in the following example.
Example 1.	3
Let S = C = ( — oo, 0], U(x) = C for all x 6 S, and
g(r, ii,w) = и. in') = u,
for all (.r, u,w) g S x C x D. Then .7*(.r) = —oo for all .r g S, and every
stationary policy p satisfies the condition of the preceding proposition. On
the other hand, when //.(.r) = 0 for all .r g S, we have' J,,(.r) = 0 for all .r g S,
and hence p is not optimal.
It is worth noting that. Prop. 1.3 implies the existence of an optimal
stationary policy under Assumption P when U(.r) is a finite set for every
x g S. This need not be true under Assumption N (see Example 4.1 in
Section 3.4).
Under Assumption N, we have a different characterization of an op-
timal stationary policy.
Proposition 1.4: (Necessary and Sufficient Condition for Op-
timality under N) Let Assumption N hold. A stationary policy p
is optimal if and only if
T.Jtl = T^.	(1.11)
Proof: If TJ/, = TftJfi, thi'ii from Cor. 1.1.1 we have ,/y, = TjiJ,,, so that
<7;l is a fixed point, of T. Then by Prop. 1.2, we have- .7,, < .7*, which implies
that p is optimal. Conversely, if .7(, = .7*, then T^J^ = ,7,, = J* = TJ* ==
Q.E.D.
.1
144	Uiidisioimled I’robleins (,'Iuip. 3
'The interpretation of the preceding optimality condition is that per-
sistently using /I is optimal if and only if tins perforins al, least as well as
using any Ji. a.t the first stage and using //. thereafter. Under Assumption
I’ this condition is not sullicieut to guarantee optimality ol the stationary
policy //, as the following example shows.
Example 1.	4
Let .S' = (—00,00), U(.r) = (0,1] for all .1: g .S’,
.</(./, », «>) = |.r|,	f(.r. 11. in) = о -1 их,
lor all (,c, 11, 4:) £ S' X C x D. Let, p(.r) = 1 for all x g S. Then Jt, (./) = 00
if x 0 and J(,(0) = 0. Eiirtherniore. we have ./,, = 7), JM = T.J,, as the
reader can easily verify. It can also he verified that .7*(,r) — |.r|, and hence
t he stationary policy // is not, optimal.
The Value Iteration Method
We now turn to the question whether the DP algorithm converges to
the optimal cost function J*. Let Jo be the zero function on S,
Jo (.r) = 0, x g S.
Then under Assumption P, we have
Jo < 7’Jo < 7V/(I <  •  < T*Jo <   ,
while under Assumpt ion N, we have
Jo > 7’./,, > 7’2 J() > • • • > TU/„ > • • •
hi either case t.he limit function
J<x (>r) = JiinJ7’U7o)(.r), .reS,	(1.12)
is well defined, provided we allow I,he possibility that JIX, can take the value
Oil (under Assumption P) or —00 (under Assumption N). The question is
whether the value iteration method is valid in the sense
J^=J*.	(1.13)
This question is, of course, of computational interest, but it is also of
analytical interest since, if one knows that J* = lim/._..,x, TkJo, one can
infer properties of the unknown function J* from properties of the Ai-stage
Sec. 3.1 I hilx>iiii<l<‘<l C'ost.x per St;ilc	145
optimal cost functions which are defined in a concrete algorithmic
maimer.
We will show that /> — J’ under Assumption N. It turns out.
however, that under Assumption P, we may have Jx / J* (see Exer-
ci.se 3.1). We will later pres ide easily verifiable conditions ( hat. guarantee
that ./oc = ./* under Assumption P. We have the following proposition.
Proposition 1.5:
(a)	Let Assumption P hold and assume that
Joo(x) = (TJoo)(.T), X&S.
Then if J : S' Ji is any bounded function and a < 1, or
otherwise if Ju < J < J*, we have
lim (TkJ)(x) =	j: G S. (1.14)
(b)	Let Assumption N hold. Then if J : S' н-» Э{ is any bounded
function and a < 1, or otherwise if J* < .J < Jo, we have
lim (TkJ)(x) = J*(x), xeS. (1.15)
Proof: (a) Since under Assumption P, we have
Jo < TJU << TkJe <</*.
it follows that, lii. TkJp = .1-^ < J*. Since J-x is also a fixed point, of
T by assumption, we obtain from Prop. 1.2(a) that. J* < .1^. It follows
that
Joo = J‘,	(1.16)
and hence Eq. (1.14) is proved for the case J = Jo.
For the case' where о < 1 and J is bounded, let, r be a scalar such
that
Jo - rc < J < Jo + re.	(1.17)
Applying Tk to this relation, we obtain
T'-.Ju - <Jr<’ < Tk.J < Tk.!p I- <iArc.	(US)
Since Tk Jo converges to J*, as shown earlier, this relation implies that Tk J
converges also to J*.
- .1
146	P/idj.scoiin/.ed Problems C7iap. .7
III the case where Jo < J < J*, we have by applying Tk
Tk.Ju <TkJ < J*, A: = 0,1,...	(1.19)
Since TkJo converges to ./*, so does TkJ.
(b) It was shown earlier [cf. Eq. (1.6)] that under Assumption N, we have
J^(.r) = JinJTM0)(.r) = J*(.r).	(1.20)
The proof from this point is identical to that for part- (a). Q.E.D.
We now derive conditions guaranteeing that .7^ = T.J^ holds under
Assumption P, which by Prop. 1.5 implies that JIX, = J*. We prove two
propositions. The first admits an easy proof but requires a finiteness as-
sumption on the control constraint sot. The second is harder to prove but
requires a weaker compactness assumption.
Proposition 1.6: Let Assumption P hold and assume that the con-
trol constraint set is finite for every x G S. Then
7oo = T.JX = J*.	(1.21)
Proof: As shown in the proof of Prop. 1.5(a), we have for all A:, Tk.Jn <
< J’. Applying T to this relation, we obtain
(7’A 1 1 J())(.r) = Ulin h{</(x, ", ii') + o(7’A'J0) (/(•';. u, n>)) }
»£(.'(.,)	(1.22)
<(7’./J(.r),
and by taking the limit, as A: —> oo, it follows that.
Joo < 7’Joo.
Suppose that there existed a state x G S such that
MJ) < (TJoo)C4	(1.23)
Let /ц. minimize in Eq. (1.22) when x = x. Since U(.r) is finite, there must
exist some ii i. f <(.»•) such that ii/,. =- i'i for all A' in some infinite subset К
of the positive integers. By Eq. (1.22) we have for all A: G К
t1 J(l)(.r) = L’{y(.r, ii, + <i(TA Jo)(/(.r, «, in))}
«.•
< (TJ^.)(.l:).
S'<:c. :l. I
Unbounded Costs per SLide
147
Taking the limit as A: —» oo, к G K, we obtain
Jx(x) = E{g(x, «, w) + aJoo (/(.<:, a, id)) }
w
> (TJ,x)(i)
= min E{g(x, u, w) + aJoo u, w))}.
uE.U(i) w
This contradicts Eq. (1.23), so we have	Q.E.D.
The following proposition strengthens Prop. l.G in that it requires a
compactness rather than a finiteness assumption. We recall (see Appendix
A of Vol. I) that a subset X of the n-dimeiisional Euclidean space JI” is said
to be compact if every sequence {.r;.} with G X contains a subsequence
that converges to a point a: G X Equivalently, X is compact if
and only if it is closed and bounded. The empty set is (trivially) consid-
ered compact. Given any collection of compact sets, their intersection is
a compact set (possibly empty). Given a sequence of nonempty compact
sets Xi, Xi ..., Xk, .. such that
Ah D X2 Э ••• Э XA. Э ХА.И D	(1.2-1)
their intersection CljJT, AT is both nonempty and compact. In view of this
fact, it follows that if f : J?" >—> [—00,00] is a function such that the set
Fa = {a; G I/(.r) < A}	(1.25)
is compact for every A G 77, then there exists a vector ,r* minimizing /;
that is, there exists an j:* G 51" such that
/(;<;*)= min/(.<;).	(1.26)
гЕ?)Сп
To see this, take a sequence {Ay} such that Ay —> min.,.tK’- ,/'(.r) and A*. >
A/;+1 for all k. If пппл.езр> /(.r) < 00, such a sequence exists and the sets
Fq. = {.r G 51" I /(.r) < A;,}	(1.27)
are nonempty and compact. Furthermore, F\k D Faa.+ i for all k. and
hence the intersection	is also nonempty and compact. Let. ./•♦ be
any vector in	 Then
/(-'-*) < Ay, A-=1,2,...,	(1.28)
and taking the limit as A' -» -oo. we obtain f(x*) < 11611.,,=),'-, J(.r), proving
that x* minimizes /(.r). The most common case where we can guarantee
.1
148	(Iiidiscountcd Problems Chap. 3
that the set l'\ <il Eq. (1.25) is compact, lor all /\ is when f is continuous
and J(.r) —> тс- as ||.r|| —» oc.
Proposition 1.7: Let Assumption P hold, and assume that the sets
C4(j:, A) = | u e (J(.r)| E{ff(-r, u, w) + o(TaJo)(/(j;, и, w))} < a|
(i.29)
are compact subsets of a Euclidean space for every ,c £ S', A £ 3J, and
for all k greater than some integer k. Then
= TJ^ = J\	(1.30)
Furthermore, there exists a stationary optimal policy.
Proof: As in Prop. 1.0, we have Jx < T.J^. Suppose that there existed
a state J: £ S such that
./.(?•)< (7'7.ю)(.!)	(1.31)
Clearly, we must have ./^(.r) < oo. For every к > к, consider the sets
= {(/.£ «(.;)! /;{.</(.г. w. in) + <i(TA./ll)(/(.r, и, Hi))} < Joo(.c)}.
V	I II’	J
Let, also и/,, be a point at taining the minimum in
(7'A 4 ’.Ju)(.r) = min Н{у(.г, u, ir) + о(7’A .7»)(/(.?:, u, «:))};
iiCl'(j') to
that is, hi,, is such that
(7’A H,/(l)(.f)	ilk- u>) + <i(7’A./o)(/(.c, iik, a’)) }-
ir
Such minimizing points щ. exist, by our compactness assumption. For every
к > к, consider the sequence {a, Since TA .Ju < 7’A + 1./u <	< Joo,
il follows I hat
"') + '>(7'A'./())(/(.r, n;, «>)) }
<r
<	/;{.</(.г, и.,, in) + о(Г'Jo)(/(:r, u,, u.'))}
U.’
<	.J.^(.r), i > к.
Sec. 3.1	I ’iilCosts per State	149
Therefore (", c tT (.r. .7^(1,')). alKl since /7а-(.г. ./^(.г)) is compact, all
the limit points of {",}(TA. belong to (4 (,r,	and at least, one such
limit point exists. Hence the same is true of the limit points of the whole
sequence {",} ... It. follows that, if /7 is a. limit, point, of { ", }4, A then
"e nAx=^MT7'U-'-))-
This implies by Eq. (1.29) that, for all A: > A-
••M-r) > E{</(7 "."’) + o(7’A' ./o)(/(.r, г7, in)) } > (71Ач 1 ./o )(.!).
11’
Taking the limit as A' —+ oc, we obtain
K(j-) = /i{y(.E "• "’) + aJ.^ (f(.i-, it. "’))}•
u>
Since the right-hand side is greater than or equal to (7'.7ix.)(.r), Eq. (1.31) is
contradicted. Hence .7^ = 7'./^. and Eq. (1.30) is proved in view of Prop.
1.5(a).
To show that there exists an optimal stationary policy, observe that
Eq. (1.30) and the last relation imply that, it attains the minimum in
•7*(.r) = min E{ </(.?,", (c) + <i.7* (/(.r, ", »))}
ll£l '(.<•) «
for a state ,i e S with .7*(.r) < oc. For states .? e S such that. ./*(.?) = oo.
every " 6 (7(.r) attains Hie preceding minimum. Hence by Prop. 1..3(a) an
optimal stationary policy exists. Q.E.D.
The reader may verify by inspection of the preceding proof that if
/ref"')- к — th1- • • attains the minimum in the relation
(Tk+ './o)(Ji) = min E{y(J-, ".,"’) +o(7,'''.7l))(/(.r, ". w))},
u^U(.r) ip
then if //*(./) is a limit point of {//a-(-J’)}, for every J- G S, the stationary
policy //* is optimal. Furthermore, {//a-(J’)} has at least one limit point, for
every J: £ S for which .7*(7) < oc. Thus the. value iteration, method under
the assumptions of either Prop. 1.6 or Prop. 1.7 yields in. the limit not only
the optimal cost function. J* hut also an optimal stationary policy.
Other Computational Methods
Unfortunately, policy iteration is not, a valid procedure under either P
or N in the absence of further conditions. If// and fl. are stationary policies
such that Tpdi' = then Ciln ',c nhown that under Assumption P we
have

(1-32)
150	Undiscomiled Problems Chup. 3
To sec this, note that T-^J^ = TJft <	— J/t from which we obtain
linijv_oo T~ Jp, < JIL. Since .7д = lim/v _>oo 7^ J() and Jo < J/o we obtain
JjT S Л'- However, J(1 < J(l by itself is not sufficient to guarantee the
validity of policy iteration. For example, it is not clear that strict inequality
holds in Eq. (1.32) for at least one state x £ S when /i is not optimal. The
difficulty here is that, the equality = T.Jt, does not. imply (hat. //. is
optima], and additional conditions are needed to guarantee the validity of
policy iteration. However, for special cases such conditions can be verified
(see. for example Section 3.2 and Exercise 3.16).
It is possible to devise a computational method based on mathemat-
ical programming when S, C, and D are finite sets by making use of Prop.
1.2. Under N and n = 1, the corresponding (linear) program is (compare
with Section 1.3.4)
II
iniixiiiiizc z\;:
t—1
n
subject, to A, < </(i, u) 4- ^2 Po(u)A_„ л =1,2, ...,n, u, eU(i.).
j=i
When re = 1 and Assumption P holds, the corresponding program takes
the form
minimize
II
subject to A; > min
but unfortunately this program is not linear or even convex.
3.2 LINEAR SYSTEMS AND QUADRATIC COST
Consider the case of t he linear system
'4 + 1 = A.l\- + Л1Ц +	k = 0,1,... ,
where ,rA. £ ?)?", щ. £ Ji"' for all and tin’ matrices A, J3 are known. As
in Sections 1.1 and 5.2 of Vol. 1, we assume that t.he random disturbances
See. 3.2
l/meiir Systems mid Qiimlfiil ie Cost.
151
wk are independent with zero mean and finite second moments. The cost
function is quadratic and has the form
f л 1	-I
= ^nn^ E <	<*'‘(-,:'kQ-'’I- +	,
A—0.1 . ...,V - I I A— 0	)
where Q is a positive semidelinite symmetric n x n. matrix and 1! is a positive
definite symmetric nt. x m matrix. Clearly, Assumpt ion P of Section .3.1
holds.
Our approach will be to use the DP algorithm to obtain the functions
TA,, T2.Jy,.... as well as the pointwise limit function ./^ = linm^x, TkJ^.
Subsequently, we show t hat ./;x. satisfies Joo = TJ^, and hence, by Prop.
1.5(a) of Section 3.1. J, = J*. The optimal policy is then obtained from
the optimal cost function J* by minimizing in Bellman’s equation (cf. Prop.
1.3 of Section 3.1).
As in Section 4.1 of Vol. I. we have
Jo(.r) = 0,	./• 6 Si",
(7’./o)(.r) = minj.r'Q.'' V u'Rtt] = .r'Q.c, .r t ,')C",
ti
(T2Jo)(;r) = min E{.r’Q.r + u'Hu + n(A.r + 13it + u-yQ(A.r + 13и 4- u>)}
= .r'Il\j- + aE{ir'Qu'},	:r € Jf".
i-i
(Tfc+!Ju)(.r) -.= .r'A'/,..r + 2*2	E{ ie’ 11,	.i: G 3i", h - 1,2......
where the matrices Ko, 111. lly • . . are given recursively by
Ao = Q.
Kk+1 = A'(allk - ci-KAB(<\B'IlkB + R)~'B'Kk)A + Q. l: = 0,l,...
By defining II = Il/a and .4 = x/ri/l, the preceding equation may be
written as
A/, + 1 = A'(Kk-Kk.B{B' Kk.B + R) ' B'l{k)A + Q,
and is of the form considered in Section 1.1 of Vol. 1. By using the result
shown there, we have that the generated matrix sequence {A*.} converges
to a positive definite symmet ric matrix A',
A'/, - » II,
provided the pairs (A, B) and (Л. C). where Q = C'C. are controllable and
observable, respectively. Since .4 --- у/nA, controllability and observability
.1
152	I	Problems C'hiij). 3
<>1 (.4, 13) or (Л, C) are clearly equivalent to controllability and observability
ol (/1, 13) or (.4, C), respectively. The matrix 1\ is the unique solution of
t he equation
A' =	- R2KI3(al3'Kl3 + R) 43'K)A + Q. (2.1)
Because A\. — К, it. can also be seen that the limit.
Zz- I
<• - lim У , <\h'	K„,ir}
in -<i
is well defined. and in fact.
c = ------E{ w'K «•}.	(2.2)
'1’hus, in conclusion, if the pairs (A, 13) and (Л.С) are controllable
and observable, respectively, the limit of the functions TkJu is given by
./.x,(.r) lim (7’A'./,))(.'•) - .e'K.r + c.	(2.3)
Using Eqs. (2.1) t.o (2.3), it can be verified by straightforward calculation
that, for all ,r €
•/^(•') = (7’./^)(.r) = min [.r'Q.r -1- zz.'/Az + <vE{./oe(/l:r + 13и + ш)}]
(2.4)
and hence, by Prop. 1.5(a) of Section 3.1,	= J*. Another method for
proving that = 7’./x, is to show that the assumption of Prop. 1.7 of
Section 3.1, is satisfic'd; that is, the sets
(4(.r. A) = {n | !:{.,'Q.t + u'Rv + о(7’ТА1)(Л.г + 13a + in)} < A}
are compact lor all к and scalars A. This can be verified using the fact
that 7Ч.Д| is a positive semidefinite quadratic function and R is positive
delinite. The optimal stationary policy /z*, obtained by minimization in
Eq. (2.4), has the form
/z?(.r) = -(\(t\I3' К13 + /’) 43'KA.r,	£ 3i".
This policy is attractive lor practical implementation since it is linear and
stationary. A number of generalized versions of the problem of this sec-
tion, including the case of imperfect state information, are treated in the
exercises. Interestingly, the problem can be solved by policy iteration (see
Exercise 3.16), even though, as discussed in Section 3.1, policy iteration is
not valid in general under Assumption P.
See. :i.<i
hieenti>r\' ( 'out i< >1
I 53
3.3 INVENTORY CONTROL
Let us consider a discounted. inlinite horizon version of the inventory
control problem of Section 1.2 in Vol. 1. Inventory stock evolves according
to the equation
'7ч-1 = .'4- + nt.- -	I- = ().l....	(3.1)
We assume that the successive demands </'/, are independent, and bounded,
and have identical probability distributions. We also assume for simplicity
that then1 is no fixed cost. Tin' cast' of a nonzero fixed cost, can be t reated
similarly. rl’he cost function is
f Л - -1	t
Л(.Го) = bin E <	OA'(c//A-(.eA.) + JI (.r*. + /a(.I'a) - H’A-)) > •
A—0.1 V _ I I A — 0	)
where
//(//) = pmax(0. —//) 4“ h niiixjO. //).
The DI’ algorithm is given by
./o(.r) ' 0,
(7V + I ,/())(,;) = min Elen + II(.r + и - «•) I- <\(Tk'./o)(.r + и — <<•)}. (3.2)
0<U
We first show that the optimal cost is finite for all initial states, that
is,
,/*(.ro) = iiiiiiJj(7u) < oe, for all .ro 6 5.	(3.3)
7Г
Indeed, consider the policy тг = {/7,/7,...}, where /7 is defined by
Since ((’a is nonnegative ami bounded, it follows that the inventory stock
jg. when the policy тг is used satisfies
-«7-1 < -i’a < niax(t),.го), k: = 1,2,...,
ami is bounded. Hence /7(.га ) is also bounded. It. follows that, the cost, po-
stage incurred when тг is used is bounded, and in view of the presence of
the discount factor we have
.Ar(.ru) < oo, :i:o & S.
Since .J* < Jn, the finiteness of the optimal cost follows.
154	Uudisvomited Problems Cluip. 3
Next we observe that, under the assumption e < p< the functions
Tk ./o arc real-valued and convex. Indeed, we have
< 7’./0 <    <	< • •  < .7*,
which implies that, 7’A'./o is rcal-valued. Convexity follows by induction as
shown in Section 4.2 of Vol. I.
Consider now the sets
Uk(.i:,X) = {it > () | £?{cii4-H(:i: + ii.-iii)+a(TA Ju)(.r„-w)} < A}. (3.4)
These sets arc bounded since the expected value within the braces above
tends to oc as и —> oc. Also, t he sets (//. (.c, A) are closed since the expected
value in Eq. (3.4) is a continuous function of « [recall that is a real-
valued convex ami hence continuous function]. Thus we may invoke Prop.
1.7 of Section 3.1 and assert that
lim (Tk./<>)(.r) = J* (.r), .r <4 S.
it, follows from t he convexity of the functions TA./(, that the limit function
J* is a real-valued convex function. Furthermore, an optimal stationary
policy /I* can be obtained by minimizing in t he right-hand side of Bellman’s
equation
•7*(.c) = min E[cti + JI(.r + и - in) 4- <v.7*(.-r + u. — w)}.
" '?11
We have
S* — .i: if .r < S'*,
0 otherwise,
where S'* is a minimizing point of
G*(</) = cy -I- L(/y) -f o/4{./*(i/ - in)},
with
L(/y) = 7J{/7(i/-U-)}.
It can be seen that if p > c, we have limi^^ G*(ly) = oc, so that such a
minimizing point exists. Furthermore, by using the observation made near
the end of Section 3.1, it follows that a minimizing point S* of G*(y) may
be obtained as a. limit, point of a sequence {S’/,.}, where for each A' the scalar
S\, minimizes
/'*('•) = I
G\(.iy) = ciy 4- L(y) 4- nE{(TkJu)(y - «’)}
and is obtained by means of the value iteration method.
t- '
See.	Optiniiil Slopping	155
It turns out that the critical level S* lia-s a, simple characterization.
1	It can be shown that S' minimizes over у the expression (1 — o)c// + L(y),
i	and it can be essentially obtained in closed form (see Exercise 3.18, and
j [HcS84], Ch. 2).
In the case where there is a positive fixed cost (7V > 0), the same
. line of argument may be used. Similarly, we prove that ./* is a real-valued
j 71-convex function. A separate argument is necessary to prove that J* is
e also continuous (this is intuitively clear and is left for the reader). Once
, 71-convexity and continuity of J* are established, the optimality of a sta-
tionary (s*,S*) policy follows from the equation
j	J'(x) = nun E{C(u) + H (.r + и — w) + a./*(x + и — w)},
t
t where C(u) = К + cu if и > 0 and C(0) = 0.
? 3.4 OPTIMAL STOPPING
i
Consider an infinite horizon version of Lire stopping problems of Sec-
tion 4.-1 of Vol. I. At each state .r, we must choose bet ween t.wo actions: pay
a stopping cost s(.r) and stop, or pay a cost c(x) and continue the process
according to the system equation
•П + 1 =/<(<>, «’*),	/.: = (),1,...	(1.1)
The objective is to find the optimal stopping policy that minimize’s the
total expected cost over an infinite number of stages. It is assumed that
the input disturbances ii\ have the same probability distribution for all A-,
which depends only on the current, state Xk-
This problem may be viewed as a special case' of the stochastic short-
est path problem of Section 2.1, but here we will not assume that the state
space, is finite and that only proper policies can be optimal, as we did in
Section 2.1. Instead we will rely on the general theory of unbounded cost
problems developed in Section 3.1.
To put. the problem within the framework of the total cost infinite
horizon problem, we introduce an additional state t (termination state)
and we complete the system equation (4.1) as in Section 4.4 of Vol. I by
letting
,rA.+ 1 = I, if иk = stop or ./> = t.
Once the system reaches the termination state, it remains there perma-
nently at no cost.
We first assume that
•s(.i') > 0,	c(.r) > 0, for all j- £ S,	(4.2)
(7'./)(.r) = /
1 56	I'in/i.si oiinb'J I ’n>1 >li41is	( 7iap. .7
thus coining under the framework of Assumption I' of Section 3.1. The case
corresponding to Assumption N, where s(.c) < 0 and c(.r) < 0 for all ,r G S
will lie considered later. Act ually, whenever t here exists an < > 0 such that
c(.r) > ( for all .i: 6 5, the results to be obtained under the assumption
(4.2) apply also to the case where s(z) is bounded below by some scalar
rather I.han bounded by zero. The reason is that, if c(.r) is assumed t.o be
greater than < > 0 lor all ,г c .S’, any policy that will not. stop within a
finite' expected number of stages results in infinite cost, and can be excluded
from consideration. As a result, if we reformulate the problem and add a
constant, г to s(.r) so that. s(.r) 1- r > 0 for all .r <~ .S', the optimal cost, ./*(.i:)
will merely be increased by r, while optimal policies will remain unaffected.
The mapping 7’ that defines the DP algorithm takes t.he form
min[.s(.c), c(.c) + E{./(/r(.r, w))}] if .c	/.,	, t
(I	if .t: = /,	V	'
where .s(.r) is the cost of the st,opping action, and c(.r) + /?{./(/, (.r, »’))}
is the cost of the continuation action. Since the control space has only two
elements, by Prop. 1.6 of Section 3.1, we have
Jiin(7’C/,>)(.r) = ./*(.,). .rG.S,	(4.4)
where ./() is the zero function [./n(.r) = 0, lor all .r G 5]. By Prop. 1.3 of
Section 3.1. then’ exists a stationary optimal policy given by
stop if .s(.r) < c(.r) P E{J‘(/,.(.r, w))}.
continue if.s(.r) > c(.r) + A{ ./* (/,.(.r. «>))}.
Let us denote by .S’* the optimal stopping set (which may be empty)
,s* = {./ c .s’| s(.r) <<•(.<•) p/•:{./*(/..(.<,»))}}.
Consider also t he sets
.S', = {.r e .S' | ,s(.r) < c(.r) + E{[T4)(/,.(.r, w))}}
that determine the optimal policy for finite horizon versions of the stopping
problem. Since we have
•A, <T.7(I < ••• <74./() <•••<•/*,
it follows t hat
51 C .S'2 C • • • C 5, C • • • C 5*
and therefore ,.$’*• C 5*. Also, if .r $ U^Sfr. then we have
,s(.?:) > c(.r) + i7’ or. «•))}, к = (I. 1.... ,
See. 3.1
(){>l iiuii I Si t >i)(>ine,
157
and by taking the limit and using the monotone convergence theorem and
the fact Tk de — > ./*, we obtain
.s(.r) > c(.i:) + 7J{.7*(/(.(.r, «;))},
from which .r S*. Hence
5’	(4.5)
In other words, the op litnal st.oppinij .set 5* for l.lic infinite horizon prohlrin
is equal t.o the tin ion of all the finite horizon stopping sets Sp.
Consider now, as in Section 4.4 of Vol. I, tin' one-step-to-go stopping
set
Si = {.r t S I s(.r) < ф;) + 7i’{/a-))}}	(4.6)
and assume that Si is absorthni/ in the sense
/,.(.c,w) 6 Si, for all ,t: Q Si, n> € D.	(4.7)
Then, as in Section 4.4 of Vol. I, it follows that the one-step lookahead
policy
st op if and only if j- e Si
is optimal. We now provide sonic examples.
Example 4.	1 (Asset Selling)
Consider the version of the asset selling example of Sections 4.4 and 7.3 of
Vol. I, where the rate of interest r is zero and there is instead a maintenance
cost c > 0 per period for which the house remains unsold. Furthermore, past
offers can be accepted at any future time. We have the following optimality
equation:
J*f.r) — max [.r. —c+£{./* (max(.r, ir))}].
In this case we consider maximization of total expected reward, the continua-
tion cost is strictly negative, and the stopping reward ,r is positive. Hence the
assumption (4.2) is not satisfied. If, however, we assume that, .r takes values
in a bounded interval [0,47], where 47 is an upper bound on the possible
values of offers, our analysis is still applicable [cf. the discussion following Eq.
(4.2)]. Consider the one-step-to-go stopping set given by
.5’1 = {.r | .r. > -c+ E{max(.f, i«)}}.
After a calculation similar to the one given in Section 4.4 of Vol. 1. we see
that
S',	{.r | r > a],
where о is the scalar satisfying
a -- p(7\)n 4- I it> (I/’(//•) - c.
Clearly, Si is absorbing in the sense* of Kq. (1.7) and therefore the one-step
lookahead policy that, accepts the first oiler great.er than or equal to о is
optimal.
158
Undi.scountcd Problems
Chap. 3
Example 4.	2 (Sequential Hypothesis Testing)
Consider the hypothesis testing problem of Section 5.5 of Vol. I for the case
whore the number of possible observat ions is unlimited. Hen; the states are
.r° and ./? (true distribution of the observations is /« and /i, respectively).
The set S is the interval [0, 1] and corresponds to the sufficient statistic
Pi. = P(j-k = I Co, ,... , 2Д.).
To each p G [0, 1] we may assign the stopping cost
,s-(p) = min[(l - p)Lt), pLj],
that is, the cost associated with optimal choice between the distributions Jo
and f\. The mapping T of Eq. (4.3) takes the form
(7’./)(p) = min (1 - p)L(), ;>Л|, < + E (./ (	t r r t ) I
L	z [ \pfu(z) + (i -p)fl(z)J J J
for all p € [0, 1], where the expectation over z is taken with respect to the
probability distribution
/’(:)=?'Jo(c)T(l -р)/1(г), ztZ.
The opt imal cost, function J* satislics Bellman’s expiation
,/*(p) = min (I - p)Lu,pL,,c I- /;(./’ (	Г(1>|^~ \r I \) I
L	= L	\pJo(z) + (1 - P)Ji(z) J J
and is obtained in the limit through the equation
C(p) = liu^(TkJu)(p), p G [0,1],
where Ju is the zero function on [0, I].
Now consider t.hi; functions l Ju, к = 0,1,. .. It is clear that
Jo < 7’Jo <    < Tk Ju <    < min [(1 — p)Lo,pL^.
furthermore, in view of the analysis of Section 5.5 of Vol. I, we have that the
function 7’*Jo is concave on [0, 1] for all I.:. Hence the pointwise limit function
J* is also concave on [0, 1]. In addition, Bellman’s equation implies that
./*(()) = J*(l) = 0,
J’G') < niin[(l - p)Lu,pLi]-
Using the reasoning illustrated in I’ig. 3.4.1 it follows that [provided c <
LltL 11 (Lu + Li)] there exist two scalars о, (I with 0 < [i < cf < 1, that
determine an optimal stationary policy of the form
accept Ju
accept fi
continue the observations
if P < o,
if /I < p < <x.
if p < /А
In view of the optimality of the preceding stationary policy, the sequen-
tial probability ratio test, described in Section 5.5 of Vol. I is justified when
the number of possible observations is infinite.
Sec. 3.1
Optimal Stopping
159
Figure 3.4.1 Derivation of the sequential probability intio lest.
The Case of Negative Transition Costs
We now consider the stopping problem under Assumption N, that, is,
s(.z-) < 0,	c(.r) < 0, for all .z- G 5.
Under these circumstances there is no penalty for cont inuing operation of
the system (although by not stopping at a given state, a favorable oppor-
tunity may be missed). The mapping T is given by
(T.7)(.z-) = min [s(.z-), c(.i.') + E{./(/r(.T, zz;))}].
The optimal cost function ./* satisfies J*(.z.') < .s(.r) for all .c £ S, and by
using Props. 1.1 and 1.5(1)) of Section 3.1, we have
J* = T.J*, J* --- lim = lim TVs,
A — x,	A- -.x,
where Jo is the zero function. It. can also be seen that if t he one-step-to-go
stopping set St is absorbing [cf. Eq. (4.7)], a. one-step lookahead policy is
optimal.
160
(liidiseoiuited Problems
Chap. 3
Example 4.3 (The Rational Burglar)
This example was considered at the end of Section -I. 1 of Vol. I where it
was shown that a one-step lookahead policy is optimal for any finite? horizon
length. The optimality equation is
./•(<) - imix[.r, (I -	T "-)}].
Tilt’ problem is equivalent to a minimization problem where
,s(.r) .= -,r,	<•(.'’) = 0,
so Assumption N holds. I'iiiin the preceding analysis, we have I hat. 7’A.s —• ./*
and that a one-step lookahead policy is optimal il the one-step stopping set
is absorbing [cl’. Eqs. (4.6) and (4.7)]. It can be shown (see the analysis of
Section 1.4 ol Vol. 1) that, this condition holds, so the finite horizon optimal
polity whereby the burglar retires when his accumulated earnings reach or
exceed (1 — p)wjp is optimal for an inlinite horizon as well.
Example 4.4 (A Problem with no Optimal Policy)
This is a deterministic stopping problem whine Assumption N holds, anil an
optimal policy does not exist, even though only two controls are available
at each state (stop and coiit.iline). 'I’he states are the positive integers, and
continual ion from state i. leads to state i.+ I with certainty and no cost, that
is, .S' = {1,2,...}, c(i) = 0, and ./’, (/, in) = i + 1 for all i C- S and «• € D. The
stopping cost is .s(i) — —!+(!//) for all i G 5, so that there is an incentive to
delay stopping at every state. We have ./*(/) = — I for all i. and the optimal
cost —1 can be approached arbitrarily closely by postponing the stopping
action for a snlliciently long time. However, there does not exist an optimal
policy that, attains t he optimal cost.
3.5 OPTIMAL GAMBLING STRATEGIES
Л gambler enters a certain game played as follows. 'The gambler may
si tike at any time к any amount up > 0 that does not exceed his current
fortune .rp (defined to be his initial capital plus his gain or minus his loss
thus far), lie wins his stake back and as much more with probability p
and he loses his stake with probability (1 -p). Thus the gambler’s fortune
evolves according to the equation
I'Ppl = J'p + lllpllp- к - 0, 1,
(5.1)
Sec. ()pthn:il Ctiunblinp; Stnitexica
161
where »•/, — I with probability p and = — 1 with probability (I - p).
Several games, such as playing red anil black in roulette, fit this description.
The gambler enters the game with an initial capital .r(l. and his goal is
to increase his fortune up to a level .Y. He cont inues gambling unt il he either
reaches his goal or loses his ent ire initial capital, at which point he leaves
the game. The problem is Io determine I he opt iuial gambling st ralegy lor
maximizing Hie probability of reaching his goal. By a gambling strategy,
we mean a rule that specifies what the stake should be at time k when the
gambler's fortune is .//,, for every .?g with 0 < .;g. < X.
The problem may be cast within the total cost, infinite horizon frame-
work, where we consider maximization in place of minimization. Let us
assume for convenience that fortunes are normalized so that, ,V --- I. The
state space is t he set. [(), 1 ] IJ , where /. is a I ernii nation state to which the
system inoves with certainty from both states 0 and 1 with corresponding
rewards 0 and 1. When ,r/.. yf 0. .17. yt 1, the system evolves according to
Eq. (5.1). 'f’he control constraint set is specified by
0 £ a/.- < .17.
0 < nk < 1 - ,r/...
The reward per stage when .17. /- 0 and .17 / 1 is zero. Under these
circumstances the probability of reaching the goal is equal to the total
expected reward. Assumption N holds since our problem is equivalent to a
problem of minimizing expected total cost with nonposit ive costs per stage.
Th<> mapping T defining the DP algorithm takes the form
' max (!<„<., [pJ(.t: + u) + (1 - p)J(.v - a)]
(TJ)(.r)=J0
. 1
if .r G (0. 1).
if ,r = 0,
if.r = 1,
for any function J : [0. 1] >—, [(), -oc].
Consider now the case where
that is, the game is unfair to the gambler. Л discretized version of the case
where 1 /2 < p < 1 is considered in Exercise 3.21. When 0 < p < 1/2, it
is intuitively clear that if the gambler follows a very conservative strategy
and stakes a very small amount at. each time, he is all but certain to lose
his capital. For example, if the gambler adopts a strategy of betting l/n
at each time, then it may be shown (see Exercise 3.21 or [Ash70], p. 182)
that his probability of attaining the target fortune ol' 1 starting with an
initial capital i/n, ()<?<». is given by
162
I/ndi.scounl.<41 I’l'oblcins
G'liap. 3
...I
И ()</)< 1/2, n tends to infinity, a,nd i/п tends t.o a constant, the above
probability tends to zero, thus indicating that placing consistently small
bets is a bad strategy.
We are thus led t.o a policy that places large bets and, in particular,
the bold stmlcgy whereby t.he gambler stakes at each time his entire
fortune ;c*. or Just enough to reach his goal, whichever is least. In other
words, t.he bold strategy is I,he stationary policy //* given by
j:
1 — i:

if 0 < x < 1/2,
if 1 /2 < x < 1.
We will prove that the bold strategy is indeed an optimal policy. To this
end it. is sufficient to show that, for every initial fortune x G [0, 1] the value of
the reward function (ur) corresponding to the bold strategy //,* satisfies
the sufficiency condition (cf. Prop. 1.4, Section 3.1)
rjf,- — J,,*,
or equivalently
./„.(())=(), ./„.(!) = ],
(x) > />./„• (x + «) + (1 ~ (x - u),	(5.3)
for all x G (0, 1) and u 6 [0,x] П [0,1 — ;c].
By using the definition of the bold strategy, Bellman’s equation
•f,,.» =
is written as
•M0) = 0,	Л*(1) = 1,	(5.4)
7 7.A — /	*1 0 < x < 1/2,	,	,
" (	tp+(l-p)Jz,.(2;c-l) ifl/2<x-<l.
The following lemma shows that .//t* is uniquely defined from these rela-
tions.
Lemma 5.1: For every p, with 0 < p < 1/2, there is only one bounded
function on [0,1] satisfying Eqs. (5.4) and (5.5), the function Jfl*.
Furthermore, is continuous and strictly increasing on [0,1].
Proof: Suppose that there existed two bounded functions Ji : [0,1] >—> JR
and : [0, 1] JR such that. ./,(()) = 0, ./,(1) = 1, 7 = 1, 2, and
(p./,(2x)	if0<x<l/2,
\ p+ (1 — p).J,{‘2x — 1) if 1/2 < x < 1,
7=1, 2.
Sec. 3.
Optimal Gambling Strut chit's
163
Then we have
Ji(2a:) - J2(2j-) =	if() < (. <	G)
Ji (2.r - 1)  J2(2a: - 1) = -7| (r)t ~	, if 1/2 < ,r < 1.	(5.7)
Let z be any real number with 0 < z < 1. Define
_ ( 2z if 0 < 2 < 1 /2,
''' “ } 2z - f if 1/2 < z < 1,
. = f 22A._, if 0 < 2A:_, < 1/2,
if 1/2 < 2A._| < 1,
for k = 1,2.... Then from Eqs. (5.6) and (5.7) it follows (using p < 1/2)
that
|.ШЬШ)| >	~	1.2,...
(1 - РГ
Since Ji(sA.) — J2(cA) is bounded, it follows that Ji(z) — J^t?) = 0, for
otherwise the right side of the inequality would tend to oc. Since z £ [0,1]
is arbitrary, we obtain J\ = J2. Hence J,,* is the unique bounded function
on [0,1] satisfying Eqs. (5.4) and (5.5).
To show that, Jis strictly increasing and continuous, we consider
the mapping T/«, which operates on functions J : [0,1] i-» [0.1] and is
defined by
(T	+	ifo<.r<i/2,
1 '** Л' ’ }pJ(l) + (l-p)J(2a--l) if 1/2 < Л-< 1,
(T„.J)(0) = 0,	(7>J)(1) = 1.	(5.8)
Consider the functions Jo. T/Jo, • • •	.., where Jo is the zero func-
tion [Jo(j') = 0 for all a; £ [0, 1] ]. We have
J,,.)-/-) = Jitn (T'y J0)(;r), ,r £ [0,1].	(5.9)
Furthermore, the functions T^.Jo can be shown to be monotonically nonde-
creasing in the interval [0, 1]. Hence, by Eq. (5.9), J,,» is also monotonically
uondecreasing.
Consider now for n = 0,1,... the sets
S„ = {a: £ [0,1] | .r = k2~",k = nonnegative integer}.
.1
164	Iiii(iisc(nint('(l Problems Chap. 3
It is straightforward to verify that
(7™./„)(.r)-(7’;;.-/o)(.r), tgs,, i, ш>7/>1.
As a result of this cxjunlity and Eq. (5.9),
- Up-.rc5),.i, n>L.	(5.10)
A further fact that may be verified by using induction and Eqs. (5.8) and
(5.10) is that for any noniiegative integers k, n for which 0 < k‘2~n <
(k + 1)2 " < 1, we have
j," < ,/„.((A-+ 1)2-»)	(k2~")< (1 -p)».	(5.11)
Since any number in [0, 1] can be approximated arbitrarily closely from
above and below by numbers of the form k'2 ", and since /,,• has been
shown to be monotonically nondecreasiug, it follows from Eq. (5.11) that
is continuous and strictly increasing. Q.E.D.
We are now in a position to prove the following proposition.
Proposition 5.1: The bold strategy is an optimal stationary gam-
bling policy.
Proof: We will prove the sufficiency condition
•//<*(') >/Ц/Ф:+»)+(1--/»)./,,•('-'/), .re [0,1]. ue [o.i]n[o, i-.r],
(5-12)
In view of the continuity of established in the previous lemma., it it
sufficient to est ablish Eq. (5.12) lor all .i: e [0, 1] and и e [(),./] Cl [0. 1 — ж]
that belong to the union	of the sets S'„ defined by
S„ — i s C [0, !]].'.= k'2 ", к =- noniiegative integer}.
We will use induction. By using the fact, that ./,,-(()) = 0. ./;,-(l/2) = p,
and •/;,•(!) = 1, we can show that Eq. (5.12) holds for all .r and u in So
and S|. Assume that Eq. (5.12) holds for all J-. м £ SH. We will show that
it, holds for all .r, u. £ S„+i.
Eor any .г, и E S„ । i with и E [0,.r] Cl [(), 1 - .r], there are four possi-
bilities:
1.	./ 4- и < 1 /2,
2.	./ - и > 1 /2.
3.	.i: - и < -i: < 1 /2 < ,r 4- io
Sec. З.Г) Opt iiii.il Chiiill>1 iiifi St i;itct;ics	1 65
4.	j: ~ a < 1 /2 < .it <	4 и,
We will prove Eq. (5.12) for each of t hese cases.
Case I. 11 ,r. и c .S',, । |. then 2,r c S,,. and 2n i 5„. and l>y the
induction hypothesis
hi' I'-’-'’)	(2.r I 2u) (I рЦ,'(2.г 2u)	(I. ph.13)
If x + и < 1/2. then by Eq. (5.5)
Лг* ('') “	+ ») - (! - /'MH-'' - »)
= /'(•<• (2-r) - p-hi’ C2-1' 2 2") - 0 - р)Л'* (2'f -2»))
and using ICq. (5.1il), the desired relation Eq. (5.12) is proved for the case
under consideration.
Case 2. И .г. и Г S',, । [. then (2.r - I) СЕ S'„ and 2u C S,,. and by the
induction hypot basis
./р» (2.r - I ) — />./,. (2.r I 2u - l)-(l	/')•/,. >(2.r 2u - I) '-(I.
If x — и > 1/2. t hen by Eq. (5.5)
(.r) - p-/p* (.r +?/) — (!— /))./,. (.r - u)
= p+ (1 -рЦ,.(2.г - 1) - p(p+ (1 - p)./,.(2.c + 2u. - 1))
-(l-p)(p + (l — p)./,-. (2.r — 2u — 1))
= (1 -?)(/„• (2.1-- I )—/>./,• (2.r + 2,/— 1)-(1 -p)./,,. (2.Г-2,i - I))
> 0,
and Eq. (5.12) follows from the preceding relations.
Case 2. Using Eq. (5.5). we have
Jp. (.r) - p./z,* (.r +„)-(!- p) ./,* (.c - </)
= p./,.(2.r) - p(p-I (1 -/>)./,. (2.r -I 2n - I)) - /,(1	/;),/p.(2,r	2u)
=	- P - 0 ~ P'l-lli‘('2j' + 2" ~ 0 “ (1 -р)Л>’(2-г ' 2"))-
Now we must have > 2. for otherwise и < 2 and .r + a < 1/2. Hence
2.r >1/2 and the sequence of equalities can be continued as follows:
•/p* ('-) - M-4'- + u) - (1 - p). / p-(.r - a)
= p(p + (1 - p)'hc(’i-1' - 1) - P
- (1 ~ /')•//<* (2-'' + 2u - 1) - (I - p)-/p* (2.r - 2ii))
= P(l ~	- 1) - ./p> (2,г I- 2u - 1) - ./,• (2.r - 2i,))
= (1 ~ р)(Л<’ C2'1' - l/2) “ P-hi* (2-r + 2» - J) - p./р’ (2.Г - 2,/)).
1G6	Undiscountrd Problems Chap. 3
Since p < (1 — p), the last expression is greatin* than or equal to both
(1 ~P) (•//«* (2ж - 1/2) - l^i'* (2:t + 2<i - 1) - (1 - p) J/t. (2.Г - 2u))
and
(1 l>) (Л'* (2-r 'Z2) ' И "	(2r + 2" ~ И “ рЛ«*(2-': - 2"))-
Now for x, и £ S„+i, and n. > 1, we have (2x — 1/2) £ SH and (2u — 1/2) £
б/ if (2u — 1/2) £ [0, 1], and (1/2 — 2a) £ 5„ if (1/2 — 2u) £ [0, 1]. By the
induction hypothesis, the first or the second of the preceding expressions
is nonnegative, depending on whether 2.z: + 2u — 1 > 2x — 1/2 or 2.r — 2n >
2.i: - 1/2 (i.e., « > j or и < /). Hence Eq. (5.12) is proved for case 3.
Силе. Ji. The proof resembles the one for case 3. Using Erg (5.5), we
have
У,,- (.r) - p.7/(. (.i: + u) - (1 - p)./„* (-r - ".)
= P + (1 - 1'Ц.’ (2-f - 1) - P(P + (1 - p)V (2x + 2ч - 1))
- (1 -p)p./,,.(2x - 2u)
= p( 1 - p)
+ (1 - p) (./,,♦ (2x - 1) - pj/l»(2x + 2u - 1) - pjyl*(2.r - 2it)).
We must have	for otherwise и < i and x — и > Hence 0 <
2.i:	— 1 < 1 /2 < 2.i: — 1 /2 < 1, and using Eq. (5.5) we have
(I -p)./„.(2x- 1) = (1 -р)р./„*(Ь--2) =р(./,л(2.г- 1/2) -p).
Using (lie preceding relations, we obtain
(x) - p./,y (x + »)-(!- p)./;,* (x - и.)
p( 1 - p) + p(./„- (2.,; - 1/2) - p) - p( 1 ~ p)./„. (2./; + 2u - 1)
-P(l - p)-//<’(2.'-- - 2»)
= p((l ~2p) 1-(2.r — 1/2) — (1 —/;)./,• (2.r + 2u — 1)
-(1 -p).//(*(2.r -2»)).
These relations are equal to both
p((l - 2p)(l -7/(.(2.r + 2«-1))
+ ./,,.(.r - 1/2) - р./,,*(2.г + 2ii - 1) - (1 - p)./,,*(2.c - 2u.))
and
p((l -2p)(l - J„.(2.r-2u))
+ ./„.(2.Z.- - 1/2) - (1 - p)J„*(2x + 2u - 1) - pJ„.(2a- - 2u)).
St'c.:(.()
Nonslrit ionary and Periodic Problems
167
Since 0 < ./,,»(2.r + ‘2u — 1) < ! and 0 <•/,,» (2./: - 2n') I, these expressions
arc greater than or equal to both
p(.//z.(2.r - 1/2) —/>./,• (2.r -I- 2u - 1) - (1 -p)J„.(2.r -2«))
and
/>(./„. (2.r - 1/2) - (1 - />)./„, (2.r + 2a. - 1) - />./,,• (2.r - 2»))
and the result follows as in case 3. Q.E.D.
We note that the bold strategy is not the unique optimal stationary
gambling strategy. For a characterization of all optimal strategies, sei'
[DuSG5], p. 90. Several other gambling problems where strategies of the
bold type are optimal are described in [DuStih], Chapters 5 and (i.
3.6 NONSTATIONARY AND PERIODIC PROBLEMS
The standing assumption so far in this chapter has been t hat, the prob-
lem involves a. .stationary system and a. stationary cost per stage (except for
the presence of the discount, factor). Problems with uonstationary system
or cost per stage arise occasionally in practice or in theoretical studies and
are thus of some interest . It t urns out that such problems can be converted
to stationary ones by a simple reformulation. We can then obtain results
analogous to those' obtained earlier for stationary problems.
Consider a uonstat ionary system of the form
I’k+l = fk(.-l’k, "k- H’l).	/.' = 0,1..........
and a. cost function of the form
(6.1)
In these equations, lor each Ir, .ij- belongs t.o a space >S\.. belongs l.o a
space Ck and satisfies Uk G Uk(-i'k) for all ./g- G Sk. and «>*. belongs to a
countable space /?д.. The sets Sk, Ck, I /•(''/. ). D/, may dilfer from one stage
to the next.. The random disturbances n>k are characterized by probabilities
Pk(- I -i’k, "k)- which depend on ,гд. and m- as well as the time index A’. T’lic
set of admissible policies 11 is t he set. ol all sequences тг — {/*о- P i,  • } with
pk : Sk >-> Ck and iik(:rk) G	for all G Sk and A: = 0, 1.... The
functions (jk : Sk x Ck x Dk Si are given and are assumed to satisfy one
of the following three assumptions:
168
( ill<lisi4)linlc<l I’loblt'lllS
Chnp. 3
Assumption D': We have a < 1, and the functions yA. satisfy, for all
k = 0,1,...,
I'/*	"A,	< Л/,
lor all (.rA., ii/., iiy,-) & S\- x C'A. x /?A.,
where j\f is sonic scalar.
Assumption P': The functions r/A. satisfy, for all k = 0, 1....,
0 <	4k, U'k), for all (.rA, lik, ч’к) C 5/.- x Ск X Dk-
Assumption N': The functions <ji,- satisfy, for all k = 0. 1,...,
, iik, "’a ) < 0, for all (.i>, <ik, n'k) € -S’A- x CA. x T?A:.
We will refer to the problem formulated as the iionst.idioiiary problem
(NSP for short). We can get an idea on how the NSP can be converted to a
stationary problem by considering the special cnse where the state space is
the same for each st.age (i.e., S'A — S for all /,•). We consider an augmented
state
= (.r, /,).
where ./• 6 5, and is t he I'line index. The new st,ate space is 5 = 5 x 7v,
where К denotes the set, of nonnegative integers. The augmented system
evolves according to
(.r./.:)^ (,/A(.r. UA. H’k)J, + 1),	(.r,/.')G 5.
Similarly, we can deline a cost per stage as
</((.r, /,). MA-. H'k) = 'Д 4k, Ii'k), (.r. fc) e 5.
It is evident that, the problem corresponding t-o the augmented system is
stationary. If we restrict attention to initial states ,?o G S x {()}, it can be
semi that this stationary problem is equivalent to the NSP.
Let us now consider the more general case. To simplify notation, we
will assume that the state spacers S',, i = 0. 1...... t.he control spaces Ci,
Sec. :i.(i
A'o/istatioimi'v iiinl I’<4i<xlic 1 'iоЫгтх
169
i = 0.1.....and tlie disturbance spaces D,. i. = 0,1...., arc all mutually
disjoint. This assumption docs not involve a loss of generality since, if
necessary, we may relabel Ihc elements of .S',. (",. and I), without a I lent i ng
the struct me of t he problem. Doline now a new stale space ,S'. a new conf rol
space C, and a new (countable) disturbance space I) by
.S'ЦДД',. с-ид„с,, 1) --- U,x
Introduce a new (stationary) system
П- u	I.-Ii’r), a-0. I.............	(6.2)
where .! /. 6 5. Гр. € C, <7y, € D. and I he system finict ion / : S x C x 1)	.S'
is defined by
u. ir) - /,(». 7/. «). if.г C.S',. и	ii’ L !>,, i- I). I, . . .
For triplets (.?-. i~i. ir). where for some 7—0.1,..., we have' .7: G .S',, but it C,
or ii’ I),, the definition of J is immaterial; any definition is adequate for
our purposes in view of the control constraints to be introduced. Flic
control constraint is taken to be ii £ U(.r) for all ,r € 5, where (7(-) is
defined by
Щ.г) - (Ц.г).	if.rG.S',.	/ —
Tlie disturbance (c is characterized by probabilities /’(<? | .]. ii) such that
P(ii’ G D, ) ,f 6 .S',, й G C,) = L, i = 0. 1....
I’(ir <1 I), | .r C .S',. ii G C,) --- (I.	/	(I. I... .
Furthermore, for any ir, P 1),. .r, G S,. n, G Ci, i = 0, 1....we have
l’(ir, I .r,. (/,) - /’,(</>, I .г,. II,).
We also introduce a new cost function
where the (stationary) cost per stage1 у : S x С x I) w Ii is delined for all
i = 0, 1,.. . by
y(.i.'. u, ic) ~ n-ii’). it .г G 5,. и G C',. in G I),.
For triplets (.r, ii, ii’), where for some 7 = 0.1,..., we have .7: £ but h
or ir /9,. any definition of у is adequate provided |</(.r. h. «)| < .1/ for
all (.K.fi.ir) when Assumption D' holds, 0 < ii, ir) when 1’' holds, and
170
I h к I inion i it <’< / / ’inblcnis
( 'll.!/». 3
ii. tr) < (I when N' holds, '[’lie set of admissible policies 11 for the new
problem consists of all sequences тг = {po./'i. . . .}, where ///. : S >--> C and
p/ (.r) 6 (/(.r) for all .r G S and A: = (), 1..
rhe construction given defines a problem that, (dearly (its the frame-
work of I lie inlinite horizon total cost problem. We will refer t.o t.his problem
as the stationary problem (SP for short).
It is important t.o understand the nature of the intimate connection
between the NSP and the SP formulated here. Let тг = {po.pi be an
admissible policy for the NSP. Also, let тг — {po,pi.. ..} be an admissible
policy lor the SP such t hat
if.rGS',. /=0,1,...	(6.4)
Let ./'о G ,S'i) be the initial state for the NSP and consider the same initial
state for the SP (i.e.. .r() = jp G So). Then the sequence of states {./;}
generated in the SP will satisfy G S,, / = 0,1...., with probability 1
(i.e., the system will move from the set So t.o the set Sj, then t.o Sj, etc.,
just as in tin' NSP). Furthermore, the probabilistic law of generation of
states and costs is identical in the NSP and the SP. As a. result, it is easy
t.o see that, for any admissible policies тг and тг satisfying Eq. (6.4) and
initial states .ip, .ip satisfying ,/p = j'ii G So, the sequence of generated
states in the NSP and the SP is the same (,r, = for all i) provided the
generated disturbances tty and ib, are also the same for all i (io, = ib,, lor
all /). Furthermore, if тг and тг satisfy Eq. (6.4), we have ./„(.ro) = ./„(.zp)
if .Го -- -i'll G S\). Let. us also consider the optimal cost functions for the
NSP and the SP:
./'(•'о) = min ,/я(.Гц).	.r() G So.
rrCll
./*(.f’o) •- min ./„-(.ip). .f(i G Si,.
7ГС 1 1
Then it follows from the const ruction of the SP that
where, for all / = О, 1......
г л -1	)
(.ro./) --min lim /•,’	a1-' '///, (.г/, //g • (./>). ng) > , (6.6)
л <_ 11 N- -nt “'r	у—'
!. -II. i.,V - I < fc-z	)
if.ro = .r, G S,. Note that in this equation, the right-hand side is defined in
terms of the data of the NSP. /Vs a special case of this equat ion, we obtain
./‘(.ip) = ./’'(.ro.O) = ./*(./'o), il -i'o = -t'o G S'o. (6.7)
Sec. 3.6 Noiistiitiouinv noil Periodic Problems
J71
Thus the optimal cost. function. J* of the NSP can. be obtained from the
optimal cost function J* of the SP. Furthermore, il тг* — {/'и-/'i •   } is an
optimal policy for the SP, then the policy тг* — {/q,, //.],...} defined by
/т*(х,) = p*(.r,), for all .r, £ S’,, i = (), I,...,	(6.8)
is an optimal policy for the NSP. Thus optimal policies for the SP yield
optimal policies for the. NSP via Eq. (6.8). Another point to be noted is
that if Assumption D' )P',N') is satisfied for the. NSP, then Assumption
D (P, N) introduced earlier in this chapter is satisfied, for the SP.
These observations show that one may analyze the NSP by means of
the SP. Every result given in the preceding sections when applied to the
SP yields a corresponding result for the NSP. We will just provide the form
of the optimality equation for the NSP in the following proposition.
Proposition 6.1: Under Assumption D' (P', N'), there holds
J*(x0) = J*(xo,0), xo € So,
where for all i — 0,1,..., the functions J*(-, i) map St into JR ([0, oo],
[—00,(1]), are given by Eq. (6.G), and satisfy for all xi £ S; and i =
J*(.r,,i)= min /i'{</,(.Ti,Ui,Wi) + a.7*(/,.(j:,,tq,wi),T + l)}. (6.9)
Under Assumption D' the functions J*(-, г), i = 0,1,..., are the unique
bounded solutions of the set of equations Eq. (G.9). Furthermore,
under Assumption D' or P', if//.*(.t;) £ Ui(r,i) attains the minimum
in Eq. (6.9) for all £ Si and i, then the policy тг* = {//,(),/г.р...} is
optimal for the NSP.
Periodic Problems
Assume within the framework of the NSP that there exists an integer
p > 2 (called the period) such that for all integers i and j with |i— y| = nip,
m = 1.2.....we have
S,=Sj, C,=Ch 1); = Dj, U,.(-) = Uj(-),
fi = fj. <h=!J.i- I •'../) = Pj(- I ('•'") e S, x C,.
We assume that the spaces S>, Ci, D>, i = 0,1,...,p - 1. are mutually
disjoint. We define new state, control, and disturbance spaces by
S = U'’=;[S,. С = и‘(А'}С,. D = C'^Di.
-1
172	I' tidix'oimt <<! РтЬкчпх (’h;ii>. .7
The optimality equation for Uh* <‘<juivalent stationary problem reduces to
11ie system ol p equal ions
.P (.Co. 0) niin А’{.'/1|(-П). «о. »’(|) I-t\.l'	Hu, H’li). I)},
"№,’l|(.rll) ir,,
•7*(.ri. 1) = mill E {.'/(' i. “I, "’i) + 0.7* (J) (.1'1 • «1. (Г|). 2)}.
" I C/q (.r । ) ii'i
I-/'- 0 = min I'- {fh> ।	!•"/> i i)
"p 11 <'(. ।	।)	।
Ь о./ * (/), _ । (.rp _ ।. //p . ।, ti'p- i). (1)} •
These equations may lie used to obtain (under Assumption D' or P') a
periodic policy of the forni (//„,..., //.*..,, //(j. . . . , //.*„,,.. .} whenever the
minimum of the right-hand side is attained for all .1:,. i =	—
I.	When all spaces involved are finite, an optimal policy may be found
by means of the algorithms of Section 1.3. appropriately adapted to the
corresponding SP.
3.7 NOTES, SOURCES, AND EXERCISES
Uiidisconnted problems and discounted problems with unbounded
cost per stage were first analyzed systematical]}' in [DuSG5], [BlaG5], and
[Strtiti], An extensive treatment., which also resolves the associated mea-
surability questions, is [BeS7S]. Sullicient conditions (or convergence of the
value iteration method under Assumption P (ci. Props. l.G and 1.7) were
derived independently in [Ber77] and [Scli75]. The former reference also de-
rives necessary conditions for convergence. Problems involving convexity
assumptions are analyzed in [Ber73b].
We have bypassed a number of complex theoretical issues relating
to stationary policies that, historically have played an important role in
tlie development, ol the subject of this chapter. The main question is to
what extent is it possible I,о restrict attention to stationary policies. Much
theoretical work has been done on this question [BeS79], [BlaGG], [Bla70],
[DuS(i.r>]. [h'ei78]. [l?eS83], [Fei92a], [l?ei92b], [OrnGO], and some aspects are
still open. Suppose, for example, that we are given an c > 0. One issue
is whether there exists an (-optimal stationary policy, that is, a stationary
policy /1, such that
,/y,(.r) < ./*(.r) + <. for all ,r e 5 with ./*(<') > —cc,
Sec. 3. / Notes, Sources, ;iu<] I'lxercisos
173
for nil ./ 6 .S' with J*(.r) — —
The answer is positive under any one of tin1 following conditions:
I.	Assumption J* holds and о < 1 (see Exercise 3.<S).
2.	Assumption N holds, .S' is a Unite set. о I. and ,/'(.r) > x lor all
.r 6 .S' (see Exercise 3.11 or [Bla65], [Bla.70], and [Orn(>9]).
3.	Assumption N holds. .S’ is a count.able set. о I. and the problem is
detcrininistic (sec [BeS79]).
The answer can be negative under any one of the following conditions:
1. Assumption 1’ holds anil о — 1 (see Exercise 3.<S).
2. Assnmption N holds and о < 1 (see Exercise 3.11 or [BeS79]).
The existence of an f-opt.imal stationary policy for stochastic shortest path
problems with a finite state space, but. under somewhat dilferent assump-
tions than the ones of Section 2.1 is established in [IA’i92b].
Another issue is whether there exists an optimal stationary policy
whenever there exists an optimal policy for each initial state. This is true
under Assumption P (see Exercise 3.9). It is also true (but very hard to
prove) under Assumption N if Jf(./') > — oo for all .r G S', л = I. and the
disturbance space D is countable [Bla70]. [DuS'tif)]. [Orii(>9]. Simple two-
state examples can be constructed showing that the result, fails to hold if
о = I and ./*(.r) = for some state .r (see Exercise 3.10). However,
these examples rely on the presence of a stochastic element in the problem.
If the problem is deterministic, stronger results are available; one can find
an optimal stationary policy if there exists an optimal policy at each initial
state and either о = 1 or о < 1 and ,/*(.r) > — эс for all .r G S. These
results also re<iiiire a difficult proof [BeS79j.
The gambling problem and its solution are taken from [I)uS(>5]. In
[Bil83]. a surprising property of the optimal reward function ./* for this
problem is shown: ./* is almost everywhere differentiable with derivative
zero, yet it is strictly increasing, taking values that, range from 0 to 1.
EXERCISES
3.1
Let i> = [0. oc) and C = L’(.r) = (0, oo) be the state and control spaces,
174
Undiscuuntcd Problems
Cl nip. 3
respectively, let the system equation b<5
X'fcH = 'J'k + Uk, k = 0,l,...,
where <a is the discount factor, and let
gt-i-kiUk) = n, + uk
l>c the co.st per .stage. Show that lor this deterministic problem, Assumption
P holds and that = oo for all j; G S', but (7^’Jo)(0) = 0 for all к [Jo is
the zero function, Jo(ji) = 0, for all .1: G 5].
3.2
Let Assumption I’ hold and consider the finite-state case S' = D = {1,2,..., n},
o=l, .г*,.। = »4:. The mapping 7’ is represented as
(7’J)(i) = min
nee’(')
?t
//(л") + 57 Pr/(»)•/(.»
,Z=I
where pz/(u) denotes the transition probability that the next state will be j
when the current state is i and control u is applied. Assume that the sets U(i)
are compact, subsets of Rn for all z, and that p7j(u) and <7(7, u) are continuous
on (7(i) for all i and j. Show that liniA:-.^ (Tk ./())(z) = ./*(/’), where J«(z) — 0
for all / — 1,..., 11. Show also that there exists an optimal stationary policy.
3.3
Consider a deterministic problem involving a linear system
= A.14. + iluk, k = 0, 1,...,
when' the pair (.1, J3) is controllable and .i‘k 6 R", щ. 6 .R,n. Assume no
constraint.* on the control and a cost per stage (j satisfying
0 < ll)- (-Л n) R” x R"'-
Assiiinc furthermore' that (j is continuous in .1: and 11, and that u„) —► 00
if {.r„} is bounded and ||u.,,|| —» 00.
(a)	Show that for a discount factor о < 1, the optimal cost satisfies () <
./*(./) < эс, for all .r G R". Furthermore, there exists an optimal
stationary policy and
^lim (7'A'J,,)(.!) = ./*(.!), x G IR”.
(b)	Show thnl the same is true, except perhaps for J*(x) < oc, when the
system is of the form .rryi = /(хл.щ), with f : R" x Rm i—» R" being
a rout inuous 1111 li t ion.
(c)	Prove the same results assuming that the control is constrained to lie in
a compact set U G i')?’” [С(х) = U for all x] in place of the assumption
!l(.r„, 11,,)	oc if {.r,,} is bounded and [|u„|| —> oc. Hint: Show that
7'l ./o is real valued and continuous for every k, and use Prop. 1.7.
Sec. 3.7 Л'о/es, Sources. mid l-'xerc'iscs
175
3.4
Under Assumption P. let // be such that for al! ,r e S, /i(.r) C l'(.r) and
(Т„./')И < (т.пи + (.
where < is some positive scalar. Show that,, if о < I.
Hint: Show that (7’/'./*)(.г) <	J n'r. Alternatively, let =
.)* + (< / (1 — о)}show that 7),!' <	, and use Cor. 7.1.1.
3.5
Under Assumption P or N, show that if о < 1 and : S >-♦ )7 is a bounded
function satisfying	then = -Г. Hint'. Under P, let. г be a scalar
such that. .7* + re > J'. Argue that ./* > and use Prop. 1.2(a).
3.6
We want to find a scalar sequence {zz<>, u.i,...} that, satisfies I/./,. < e,
ill,. > 0, for all and maximizes УУ u .(/(m.-). where <• > 0 and .</(») > 0
for all и > 0, <;(()) = 0. Assume that 7 is monotonically uondecreasing 011
[0, oc). Show that the optimal value of the problem is J'(c), where J* is a
monotonically uondecreasing function on [(),<x>) satisfying ./*(0) = 0 and
J*(.r) = max {.</(») + .)*{.r ~ 11) },	.rt[0,<x).
()<- M< t 1	J
3.7
Let Assumption P hold and assume that, 7Г* —	} 6 11 satisfies
./* =	for all k. Show that, zr* is optimal, i.e.. ,/„* = J*.
3.8
Under Assumption I \ show t hat,, given < > (), I.here exists a policy я, ( II such
that ./л< (.г) < .7* (,c) + (. for all .г E 5, and that, for a < 1 the policy zr, can
be taken stationary. Give an example where о = 1 and for each stationary
policy “ we have	oo, while J*(.r) = 0 lor all .r. Hrn.l: Sec t he proof
of Prop. 1.1.
-. .1
176
Uudiscountvd 7('hup. 3
3.9
Under Absuiiiptiou 1\ show that if then' exists an optimal policy (a policy
6 II such that ./я+ — ,/*), then there exists an optimal stationary policy.
ЗЛО
Use t he following counterexample to show that t he result of Exercise 3.9
may fail Io hold under Assumption N if •/*(./•) -- — "x. for some .r G S. Let
.s'	/ >	{(h I }. J (.г. и. ir) /г. ij(.r. и. ir) ii. I '(0)	( x Л)]. Г ( 1 )	{0},
l>( ir 0 | j- — 0. 11)	. and //( // I | ./ I. и) -= I. Show that J ’ (0) --- — x.
./‘(1) -- 0 and that the admissible noiist a I iouarv polios1 {///». Pi...•} with
//£(()) — - (2/м)д is optimal. Show that every stationary policy // satisfies
- (2/(2 - <>))//(()). -h> ( I) “ 9 (see (Bla“()]. [PuS(j5], and [Oni(i9] for
related analysis).
ЗЛ1
Show t hat I he result of Exercise 3.S holds under Assumpt ion N if 5 is a finite
set, n — 1, and ./k(-r) > for all .r C- .S’. Construct a counterexample
to show that. the* result can lail to hold il .S’ is count.able and n < I [even
if > -эо for all ./• t .S']. Hint: Consider au integer Ar such that
tho Л'-stage optimal cost satisfies ./,\(.r) < ./*(./) -1- ( for all .r. Eor a
counterexample, see [1 >eS79j.
ЗЛ2 (Deterministic Linear-Quadratic Problems)
Consider t he determiiiistic linear-quadratic problem involving the system
П- ( j = A.i-f. + Hui.
and t he cost
Л(.Ги) ~	-E /0, (./‘A )'////A-(./‘A )) -
/. (I
We assume that /»’ is positive definite symmetric. Q is of the form C'C, and
the pairs (. I. />). (. I. (’) are cout rollable and observable, respect ively. Use the
I henry of Sect ions 1.1 о I \ ol. I ami S. I to show that the stationary policy /С
with
/<'(.<) =• -(//'A'/7 + /I) '///I’.l.r
is optimal. wlKTe /V is tin’ unique positive scniidelinite symmetric solution of
tlie algebraic liiccati equation (if. Section -1.1 of Vol. 1):
A' = ,!'(/< - К Hi 11'К H i НГ'и'К)Л l-Q.
Provide a similar result under au appropriate controllability assumption for
the case of a periodic deterministic linear system and a periodic quadratic
cost. (cf. Section 3.6).
Sec. 3.7 Xotes. Sources. ;iu</ l-'xrreisex
177
3.13
Consider 11 к» linear-quadrat ic problem of Seel ion 3.2 with I he only ditlerencc
that the disturbances have* zero mean, but their covariance ma (rices are
nonsl at iouarv mid uniformly bounded over A'. Show that t he optimal control
law remains unchanged.
3.14 (Periodic Linear-Quadratic Problems)
Consider the linrat system
.ri, ) ) --- .L.•»/, 4 />/. nt. + и').-. A- -
and t he quadratic cost
where the mat rices have appropriate' dimensions. Qa and A’a are positive
svmidelinilv and positive definite symmetric, respectively. for all 1, and 0 s.
о < I. Assume that. the system and cost arc periodic with period p (cf.
Section 3.6). that the controls are unconstrained, ami that the disturbance's
are independent, and have zero mean and finite covariance. Assume further
that the following (controllability) condition is in ellecl.
I'or any st ate 7(i, t here exist s a finite sequence of coni rols {77o. 771.711 }
such that 7,-ч i = 0. where 7,- •. । is general cd by
7;. । । = /1а.га- 4- Ba-7a.	/, = ().!.г.
Show that there is an optimal periodic policy "* of the form
  -/'p l -
where p*,....//* i aregiveuby
О (о /У, | | /7 ( /?, ) 1 /j' А’, ; I . I,.r, i 0.p -- 2.
Mp -)(') = - о ((\в;,..]1 -I- /q,- i) ’ Bp. i A’nA/1 1 .Г,
anti the matrices A’d.A’i....A‘p । satisfy the coupled set of p algebraic Itic-
cati equations given for i 0, I...../> - I by
A', =	1 -<12/\',.и/Л(о/Л'Лн 1/Л + Г.) 'li',K,nA,) +Q,.
wit h
/\’p — 1\(
3.15 (Linear-Quadratic Problems Imperfect State Information)
Consider the linear-quadratic problem of Section 3.2 with the difference that
t he controller. instead of having pol led, st at e informal ion, has access to mea-
surements of t he form
•-a	(' i i. I I'k, к --- d, i,...
As in Section 5.2 of Vol. I, (he disturbances од- an1 independent and have,
identical statistics, zero mean, and finite covariance matrix. Assume that for
every admissible policy 7Г the matrices
| M)(.o,- /;{.a | 4})' I d
are miilbrinlv bounded over A-. where /д is the information vector delined in
Section 5.2 of Vol. 1. Show that the stationary policy //* given by
+ I!) 11)'К.-\Е(.Г1- I Л}. lor all /a',
is optimal. Show also that the same is true if U'k and ед art' nonstationary
with zero mean and covariance matrices that are uniformly bounded over k.
Ihnl: Combine t he theory of Sections 5.2 of Vol. I and 3.2.
3.1G (Policy Iteration for Linear-Quadratic Problems [Kle68])
Consider the problem of Sect ion 3.2 and let. Ao be an т x n mat rix such that,
the matrix (/I 4- BA()) has eigenvalues strictly within the unit circle.
(a)	Show that tin* cost corresponding to the stationary policy po, where
= L{t.r is of the form
(./•) = ./•'	4- constant.
where A’o is a positive seimdefinite symmetric matrix satisfying the
(linear) equal ion
A’() - <i(A I- ill^'K^A 4- JJLo) 4- Q 4-
(b)	Let //[ (./•) al,tain tin* minimum lor each .r in the expression
mm{tZ/?u4 a (Ar 4- ВяУКо(.Л.г 4- bu) j.
и
Show that for all .r we have
Jfl! (j1) - .r A’| .r 4- constant < Jllit (./),
whore Ki is some positive semidefinite symmetric matrix.
(c)	Show that the- policy iteration process described in parts (a) and (b)
yields a sequence {Л /, } such t hat
К к	К.
where A' is the optimal cost matrix of the problem.
Sec. 3.7 Notes. Sources, and I Exercises	179
3.17 (Periodic Inventory Control Problems)
In the inventory control problem of Section 3.3. consider the ca.se where (he
stat istics of the demands шд., the' prices q , and the holding and t he shortage
costs are periodic with period p. Show that there exists an optimal periodic
policy of the form тг* = {//<*.	, /G*>,  i•••}.
/t,’(.c) = {	1prt^'S*.’	/ = 0,	- 1,
I 0	11 ot herwise,
where S’,,.. .., ,S’_ । are appropriate scalars.
3.18 [HcS84]
Show that the critical level S' for the inventory problem with zero fixed cost,
of Section 3.3 minimizes (1 -- <\)cy + L(,y) over y. Hint'. Show that the cost
can be expressed as
Л(.го)	(U “ f‘)<7A + №a)) + ।	^{“’} - <‘-'o ; .
I A -0	J
where /д = .17, + /ц.(.'О-
3.19
Consiiler a inacliine Hint may break clown and can be repaired. When it
operates over a time unit, it. costs —1 (that is, it produces a benefit of 1 unit ),
and it may break down with probability 0.1. When it. is in the breakdown
mode, it may be. repaired with an effort «.. The probability of making it
operative over one time unit is then u, and the cost is Си2. Determine the
optimal repair effort over an infinite time horizon with discount factor о < 1.
3.20
Let г», г,,... be a sequence of independent and identically distributed ran-
dom variables taking values on a finite set Z. We know that, the probability
distribution of the д ’s is one out of n distributions	and we are
trying to decide which distribution is the correct one. At each time k after
observing 2|,...,za, we may either slop the observations and accept one of
the n distributions as correct, or take another observation at a. cost c ~> 0.
The cost for accepting f, given that f, is correct is L,,, i,j = l,...,n. We
assume L,j > 0 for i j, L,, — 0, i - 1.......n. The a priori distribution of
fl..... f„ is denot ed
ll
/’<1 = {li.l-i.../'<: > о. У7/>;, -- I.
Show that the optimal cost J' (/b) is a concave hmetion of /h- Characterize
the optimal acceptance regions and show how they can be obtained in the
limit by means of a value iteration method.
180
I '• IHii.SO >11 111 <</ 1 ’r< >/ik'l U.S
( .'/nip. .1
3.21 (Gambling Strategies for Favorable Gaines)
A ga nil tier plays a game such as the one ol Section 3.5, but where the prob-
ability ol winning p sat islies 1/2 p •' I. I lis object ive is to reach a final
fortune //, where n is an integer with n -£ 2. His initial foil,line is an integer
i with 0 < i. < ?/, and his stake at time A' can take only integer values //д.
satisfying (I \ it д- < ла , 0 < ад- £ ti - .г/,, where .гд- is his fortune at, time A:.
Show that, t he strategy that always stakes one unit is optimal [i.e., //*(./) = 1
for all integers with 0 < .r n is opt imal]. lliiil: Show that if p £ (1 /2, I),
and if p -- I /2,
./ * (/)—-.	0 < i < ti,
ii
(or see [Ash70|. p. 1x2, lor a proof). Then use t he sufficiency condition of
Prop. 1.1 in Sect ion 3.1.
3.22 [Sch81]
Consider a network of n queues whereby a customer at queue i upon comple-
tion of service is rout cd to queue j with prol >a 1 >i I i I у p, t, and ex i t s the net work
with probability I -- У/ j Pu • 1’or each queue ? denote:
rt: the external customer arrival rate,
: the average customer service time,
I11
A,: I he customer departure rate,
(i7-. the total customer arrival rate (sum of external rate and departure
rates from upstream queues weighted by the corresponding probabili-
ties).
We have
and we assume that, any portion ol the arrival rate a, in excess of the service
rate p, is lost,; so the departure rate at queue / sat islies
A, -- minfp,. nJ - min ii^r, +	X,/),, .
Assume that r, > 0 for at. least one /, and that for every queue i\ with r,, > 0,
t here is a queue i with I	, I ull|l a sequence i ।. i-i,.. . , i д • , i. such
that />/1	.....^'f-'	^,(>w that the departure rail's A, satisfying the
preceding (’({nations are unique and can bo found by value iteration or policy
iteration. I! iul: This problem does not. quite lit our framework because we
may have	। for So,ll(' '• However, it is possible to carry out an
analysis based on ///-stage contraction mappings.
Sec. 3.7
Aolcs. SolllCes. JJI</ E\(TC/>c.-
ЕЧ1
3.23 (Infinite Time Reachability [Ber71], [Ber72])
when1 t he dist urbance space 1) is an arbit гагу (not necessarily count able) set.
The disturbances (t'i. can take values in a subset И(.гд. и/,) оГ 1) that may
depend on ./a and ///,. 'This problem deals with the fol lowing quest ion: (liven
a nonempty subset A’ of the state space 5', under what condit ions does t here
exist an admissible policy that keeps the state of the (closed-loop) svslem
in the set A lor all k and all possible values ng G IT (.I'AIII.' (.17)) . t lint ib.
1'1. С .V.	lol	Illi U'l, С II  (.1-/. . /II, (.Г/.)).	/, - (I. I. . . .	(7.2)
The set A	is	said	Io be iiijiii/lclt/ reachable if t here exists an	admissible
policy {//<>. j 11. . .	.}	and 5O///C initial st ate .rlt e A for which the above relat ions
are satisfied. It	is	said	to be sfrant/li/ reachable if I here exists an	admissible
policy {//(i. ц ।. .. .} such that for all initial st al es .ru 6 A I he above relat ions
are sat isfied.
Consider the function /? mapping any subset Z of t he slate space .S’
into a subset /?(Z) of .S’ defined by
/?(Z)  |.r | for some a €_ I;(.r). ./ (J>- 11 - li') ( l°r all ir И' (.г, a) j П Z.
(a)	Show that the s<4 A is strongly reachable if and only if /f(A’) — A.
(b)	Given Ad consider the set A’* defined as follows: .r() C A* if and only
if ./•(, 6 A and t here exists an admissible policy {//o. / /1... ,} such t bat
thal Eqs. (7.1) and (7.2) art* satisfied when .r(l is taken as t.he initial
state of the system. Show that a set A is infinitely reachable if and
only if it contains a nonempty strongly reachable set. Furthermore,
the largest such set is A"* in the sense that A* is strongly reachable
whenever nonempty, and if .V G A* is another strongly reachable set,
then A C A A
(<•) Show that if A is infinitely reachable, there exists an admissible sta-
tionary policy fi such that, if the initial state belongs to A*. then
all subsequent- st ates of the closed-loop syst em .14 ( 1 — f //(.14 ), ng )
are guaranteed to belong to A’*.
(d)	Given A. consider the sets (A"),	— 1,2...., where /Л(А') denotes
the set obtained after k applications of Ihc mapping /? on A”. Show
I hal
(e)	Given A*, consider for each .r G_ A and /;	1.2, . . . the set
(d(.r) -- { a I f(.r. u, ir) G II1' (A') lor all tr (2 И '(.г. a) }.
Show that, if there exisls an index I; such that for all ,r c A ami
k > (he sei 7;/.(r) is a compact subset, of a Euclidean space, then
A-iTjn.V).
.1
182
I Indiscount cd Problems Chap. 3
3.24	(Infinite Time Reachability for Linear Systems)
Consider tin; linear stationary system
.ГАЧ I ~ A.Ca + 13 Uk + GlUk,
where Xk E -R", Щ- G )R"', and w* G ЗГ, and the matrices A, 13. and G are
known and have appropriate dimensions. The. matrix A is assumed invertible.
The controls щ- and the disturbances wa: arc restricted to take values in the
ellipsoids U — {a | u! Ku < 1} and IV = {w | w'Qw < 1}, respectively, where
К and Q are positive definite symmetric matrices of appropriate dimensions.
Show that in order for the ellipsoid A” = {x | x'K.r, < 1}, where К is a positive
definite symmetric matrix, to be strongly reachable (in the terminology of
Exercise 3.23), it is sufficient that for some positive definite* symmetric matrix
M and for some scalar /3 E (0, 1) we have
/< 1 —~,GQ lG' : positive deficite.
il
Show also that if the above* relations an* satisfied, the linear stationary policy
//,*, where /z*(.r) — h.r and
Л = -(!{. + U'PHj-'B'FA.
г= (l-ЛА-1	?
achieves reachability of the ellipsoid A' — {./• | x'K.r < 1}. Furthermore,
the matrix (Л H- HL) lues all its eigenvalues strictly within the unit circle.
(For a proof together with a computational procedure for finding matrices К
satisfying the above, see [Bcr71] and [Ber72b].)
3.25	(The Blackmailer’s Dilemma)
Consider Example 1.1 of Section 2.J. Here, there are two states, state 1 and
a termination slate /. At state 1, we can choose a control и with 0 < и < 1;
we then move to state* f at no cost with probability p(u), and stay in state 1
at a cost -u with probability 1 — p(u).
(a)	Let == u~. For th is case it was shown in Example 1.1 of Sect ion 2.1,
that the optimal costs arc ./*(1) = — x. and ./*(/) — 0. Furthermore, it
was shown that there is no optimal stationary policy', although there is
an opt imal nonstat ionary policy, f ind the set of solut ions to Bellman's
equat ion and verify I he result of Prop. 1.2(b).
(b)	Let — и. Find the set of solutions to Bellman's equation and
use Prop. 1.2(b) to show (hat the optimal costs arc ,/*(!) — —1 and
J*(l) — 0. Show that there is no optimal policy (stationary or not).
Average Cost per Stage
Problems
Contents
4.1.	Preliminary Analysis .................................p.	181
4.2.	Optimality Conditions.................................p.	191
4.3.	Computational Methods.................................p.	2(12
4.3.1.	Value Iteration................................p.	202
4.3.2.	Policy Iteration...............................p.	213
4.3.3.	Linear Programming.............................p.	221
4.3.4.	Simulation-Based Methods.......................p.	222
4.4.	Infinite State Space..................................p.	226
4.5.	Notes, Sources, and Exercises.........................p.	229
.1
184	.\vcrage Cost per Stille Problems C'lm/j. 1
The results ol I lie preceding chapters apply mainly to problems where
the optimal total expected cost is finite either because of discounting or
because ol a cost-free absorbing state that the system eventually enters.
In many situations, however, discounting is inappropriate and there is no
nat urai cost-lrec absorbing state. In such sit mil ions it is often ineaniiigful to
optimize the average cost, per stage, to be defined shortly. In this chapter,
we discuss this type of optimization, with an emphasis on the ease of a
limte-statc Markov chain.
Ли int roductory analysis of the problem of t his chapter was given in
Section 7.1 ol Vol. I. 'I’hat analysis was based on a connection between the
average cost per stage and I he stochastic shortest, path problem. While
this connection can be birther extended to obtain more powerful results
(see Exercises 1.1.3-I. I(i). we develop here an alternative line ol analysis
that is based on a relation with t he discount cd cost problem. 'Phis relation
allows us to use discounted cost results, derived in Sections 1.2 and 1-3, in
order to conjecture and prove results for the average cost problem.
4.1 PRELIMINARY ANALYSIS
Let. us humiliate the problem ol this chapter for th»' case of finite
state and control spaces. We adopt the Markov chain notation used in
Section 1.3. In particular, we denote the slates by 1.............n. To each state i ’
and colit rol u t here corresponds a. set. of t rausit ion probabilities /Л, (u), J =	"
Each I hue the system is in state i and control и is applied, we incur	,
an expected cost <;(/. <i). and the system moves to state j with probability
l>u (u). The ob ject.ive is to minimize over all policies тг — {/z(). p,... .} with i
/il.-(i) & C(z). l’>f nil z and Z-, the average cost per stage f
if'-1	I	t
Jin^ —E Z	///.(с/. )) > ,	j*
I a---(i	J
Ibr any ^ivcii initial state .r().	I
| Wlidi llic IiiihI dcliiiing tin- average cost is not known to exist, we use	a
instead I lie definition	1
lim sup ^/-7 <	</(-''Am /М-П )) > .	|
л ; I A --()	J	.q
We will show, however. as part, of our subsequent analysis that t he limit exists |
al least for those policies тг that are of interest.	|
/•«
4
Sec. I. I I ’relilliiliarv Analysis
1«5
As in Section 1.3. we use the following shorthand notation for a sta-
tionary policy //:
/	\	/ Pi i (/'( ।))   /'i,, (/'( I)) \
/Л'-’ I ;		/' I	'	I 
//(»)) /	.../'„ДМ"))/
7
Since the (/. j )l h element of the matrix Pfr (P^ to the Atli power) is the
Ai-step transition probability P(.i'i. - j | ./(> -- /) corresponding Io //. it can
be six'll t hat
/ 1 'V--' \
J,‘ ~	b'7"'
\	' y-o /
An important result, regarding transition probability matrices is that the
limit in the preceding equation exists. We show this fact shortly in the
context of a more general result., which establishes the connection between
the average vest per stage problem and the discounted cost problem.
Au Overview of Results
While the material of t his chapter does not rely on the analysis of I he
average cost problem of Sect ion 7.1 in Vol. 1, it is wort h suinniariziug some
of the salient features of that analysis (see also Exercises 1.13-1.1 (i). We
assumed there that then' is a special state, by convention state it. which is
recurrent in the Markov chain corresponding to each stationary policy. 11
we consider a sequence of generated stales, and divide it into cycles marked
by successive visits to the special state n. we see that, each of the cycles
can be viewed as a state trajectory of a corresponding stochastic shortest
path problem with the termination state being essentially n. More pre-
cisely, this stochastic shortest path problem has states 1,2,. . . , n. plus an
artificial termination state I to which we mow from state i with transition
probability /),„(</). The transition probabilities from a stale i Io a state
j yf n. are the same as those of tdie original problem, while p,„(i/) is zero.
For any scalar A. we considered I lie stochastic shortest path problem with
expected stage cost </(/, (t) - A for each state i = 1.. . . , m We then argued
that if we fix the expected stage cost, incurred at state i to be
- A(,
where A* is the optimal average cost per stage starting from the special
state ii. then the associated stochastic shortest path problem becomes es-
sentially equivalent to the original average cost per stage problem. Fur-
thermore. Bellman's equation for the associated stochastic shortest jral.li

186
Average Cost per St‘W /’*•«,/>/<чл.ч (.Ump. 4
problem can be viewed as Bellman's equation for the original average cost
per stage1 problem. Based on this line of analysis, we showed a number of
results, which will be strengthened in I he present chapter by using different
methods. In summary. these results are the following:
(a)	The optimal average cost, per stage is independent, of the initial slate.
This propert y is a generic feature for almost all average cost, problems
of practical interest..
(b)	Bellman’s equation takes the form
where It' (n) = 0, A* is the optimal average cost per stage, and h*(i)
has the interpretation of a relative or differential cost, for each state
i (it is the ininimuin of the dilference between the expected cost to
reach n from i for the first, time and the cost that would be incurred
if the cost, per stage was the average A*).
(<) There are versions of the value iteration, policy iteration, adaptive
aggregation, and linear programming methods that can be used for
computational solution under reasonable conditions.
We will now provide the foundation for the analysis of tins chapter by
developing the connection between average cost and discounted problems.
Relation with the Discounted Cost Problem
Let. us consider t he cost of a stationary policy // for the corresponding
(«-discounted problem. It. is given by
•Л>.„ =	I	I <ilt (/ -	n-e(o.l).
/,-.()	\/,—0	/
(1-1)
To get a sense of the relation with tin1 average cost of /t, we note that this
latter cost, is written as
Sec. 4.1 PieHininiuy Antilysis	187
Assuming that the order of the two limits in the right-hand side above can
be interchanged, we obtain
,,~'1	lingvЛ/' I,' oA'
= lim (1 - <>)./<>,M(r).
a—1
The formal proof of the above relation will follow as a corollary to the next,
proposition.
Proposition 1.1: For any stochastic matrix P and ri e (0,1), there
holds
(1 - aPf1 = (1 - a)-JP* + II + O(| 1 - a|),	(1.2)
where O(jl — <v|) is an «-dependent matrix such that
lim O(|l - a|) = 0,	(1.3)
a—»1
and the matrices P* and II are given by
/v-1
P‘ = lim — V 7’\	(1.4)
Лг—ос A
*•=0
Я = (7-P + P*)-1 -P*.	(1.5)
[It will be shown as part of the proof that the limit in Eq. (1.4) and
the inverse in Eq. (1.5) exist.] Furthermore, P* and II satisfy the
following equations:
pt — ppt __ pt p — pt pt	(1.6)
P'H =0,	(1.7)
P* + н =--1 + PH.	(1.8)
Proof: From the matrix inversion formula that expresses each entry ol the
inverse as a ratio of two determinants, it is seen that the niatrix
.V(<>) = (1 - a)(I oPy-'
.1
188	а e/agc (\)sl /xtSI age / ’nil >leni.s Clin/). 4
can 1 )< expressed as a mat rix wit li elements t hat are either zero or fractions
whose numerator and denominator are polynomials in о with no common
divisor. The denominator polynomials of the nonzero elements of A/(o)
cannot have 1 as a root., since otherwise some elements of A/(o) would
lend Io infinity as о > I; this is not possible, because from Eq. (1.1) for
any //, we have (1 - o)~ 1 ;W(o)</z, = (I-aP)~	and (y)| <
(I — о ) 1 max, jy/p (z) |, implying that, the absolute values of the coordinates
ol Л/(о)<//, are bounded by max, l()1' all о < I. Therefore, the (/. j)l.li
clouiei 11 of the matrix Л/(о) is of the form
where у. g,. / - I.... , j>. and . i  ~	arc scalars such t hat f., / 1
for i — I...</.
Define
I’* lini ;U(o).	(1.9)
and lei II be Hie matrix having as (/. j)l h element the 1st derivative of
evaluated at о = 1. By the 1st. order Taylor expansion of the
elements ol mZJ(o) of A/(o). we have for all о in a neighborhood of о — 1
,W(o) --- P* T(1 -o)// TO((I -o)'-’),	(1.10)
where O((l - o)“) is an o-dependent matrix such that
Multiplying Eq. (1.10) wit h (1 - <i) 1, we obtain t.he desired relation (1.2)
[although, we have yet to show that and II are also given by Eqs. (1.4)
anil (1.0), respectively).
We will now show that, P* as defined by Eq. (1.0). satisfies Eqs. (l.G),
(1.5), (1.7), (1.8). and (l.l). in that, order.
Wo have
(/ - o/')(/ ~ о/’) -' = /	(1.11)
and
o(/ - <!/’)(/ -	। = о/.	(1.12)
Subtracting these two equations and rearranging terms, we, obtain
o/’(l - o)(/ - of’) 1	(1 - o)(/ - <\P) ' T(o-l)/.
By taking the limit as о —» 1 and using the delinil ion (1.9), it follows that
See. 1.1	I’re/iuiinerv .Ami/vsis	IS!)
Also, by reversing the order of (/ — о P) and (/ — oP)‘ 1 in Eqs. (1.11)
and (1.12). it follows .similarly that /’*/'- P\ From /’/*' — I’1, we also
obtain (/ о/’)/’’ (I o)P* or /’* - (1 o)f/ о/’)*1/’’, and by
taking thi' limit as о — 1 and by using Eq. (1.9). we have /)F — P* P‘.
Thus Eq. (1.0) litis been proved.
\\ e letve. using Eq. (1 .(>). ( P -- P*)- - P~ P’. and si11iilarly
(/' - /7’)A= Pl- - P\ A'XI.
Therefore.
(/ - o/’) 1 (I - о)" I/”
/х (I
= I - /’• I £o'(P -
A-x I
- (/ -<>(P --/”)) 1 - P‘.
On the other hand, front Eq. (1.10), we have
If -- lim ((1 — <t) - 1А/(<i) — (1 — o)~ } P*)
= lint ((/ - oP)-1 - (1 - o)-1/'*).
By combining the hist two equations, we obtain
IJ ---- lim (/ - o(/7 - P7))'1 - P* = (A - P + P*)"1 - P\
<) I
which is Eq. (1.5).
From Eq. (l.a), we obtain
(/"/'+ P')H -:/-(/ -X7 EE’)/”
or. using Eq. (1.0).
// - PH PHI = I - P‘.	(1.13)
Multiplying this relation 1 >y P’ and using Eq. (1.0), we obtain P* II - 0,
which is Eq. (1.7). Equation (1 ,S) then follows from Eq. (1.13).
Multiplying Eq. (1.0) with Pk and using Eq. (1.0), we obtain
P* + P> H - P>- E Z’7 t’ ll. к - 0, I . . .
Adding this relation over /.'	()...Д' - 1, we have
X 1
Д’/7* + и = У p^ + РЧ1.
h-i)
1
190
Average Cost per Stage Problems Chap. I
Dividing by N and taking the limit as N —> oo, we obtain Eq. (1.4).
Q.E.D.
Note that the matrix P* of Eq. (1.4) can be used to express concisely
the average cost vector J of any Markov chain with transition probability
matrix P and cost vector q as
1 N ' ( 1 N ' \
.7 = lim — Pk<j = I lim —	P1' | q — Pf<j.
N-.oo N	\ N-oo N	'	J
To interpret this equation, note that, we may view the ?th row of P* as a
vector of steady-state occupancy probabilities corresponding to starting at
state that is, the //th element p*j of P* represents the long-term fraction
of time that the Markov chain spends at state j given that, it. starts at state
i. Thus the above* equation gives the average cost, per stage ./(/), starting
from state i, as the sum t of all the single-stage costs i/j weighted
by the corresponding occupancy probabilit ies.
From Eq. (1.1) and Prop. 1.1, we obtain the following relation between
o-discounted anil average cost corresponding to a stationary policy.
Proposition 1.2: For any stationary policy /i and о & (0,1), we have
J,,.,,. = (1 - a)"1 + /q, + O(| 1 - a|),	(1.14)
where
is the average cost vector corresponding to /q and /q, is a vector sat-
isfying
+ htl =	+ Pfjif,.	(1.15)
Proof: Equation (1.11) follows from Eqs. (1.1) and (1.2) with the iden-
tilications P = P/t, P‘ = P*,, and /q, = 77</;l. Equation (1.15) follows
by multiplying Eq. (1.8) with r/,, and by using the same identifications.
Q.E.D.
In the next section we use the preceding results to establish Bellman’s
equation lor the average cost per stage problem. As in the earlier chapters,
this equation involves the* mappings T and T),. which take the form
(7’./)(7) -- min

« =	(1.16)
Sec. 1.2
Optiinnlity C'omlitioiis
191
II
= .</(/.//(/)) +	< --= i......."
I -1
(1.17)
4.2 OPTIMALITY CONDITIONS
Our first result int roduces the analog of Bellman’s equal ion for t he
case of equal optimal cost for each initial state*. This is the case that
normally appears in practice, as discussed in Section 7.1 of Vol. I. The
proposition shows that all solutions of this equation can be identified witli
the optimal average cost and an associated differential cost.. However, it
provides no assurance that the equation has a solution. Bor this we need
further assumptions, which will be given in (lie sequel (see Prop. 2.G).
Proposition 2.1: If a scalar z\ and au zi-dimensional vector h satisfy
A + /;(/) = min
II:
or equivalently
Ac + It Th,
(2.2)
then A is the optimal average cost per stage J*(i) for all i.
A = min Л (i) — ./*(«),	7 = V •
n.
(2.3)
Furthermore, if/t*(z) attains the minimum in Eq. (2.1) for each i, the
stationary policy //* is optimal, that is, = A for all i.
Proof: Let тг --- {//().//j,.. .} be any admissible policy and let A; be a
positive integer. We have, from Eq. (2.2).
7/ v_, h > Ac -f h.
By applying 7), v 2 to both sides of this relation, and by using the mono-
tonicity of 7), v ., and Eq. (2.2). we see t hat.
7M,v-	3- 7)(v . >!l > 2A< T h.
Continuing in the same manner, we finally obtain
Wp	 ,VA. • Л.	(2.1)
_ -1
192	.hc/.чце (ох/ per .S7 аде I ’ri >1 lit ‘ins	('lltiji. 4
with equality il each ///,., k ~ (). 1...A' — 1. attains tlie minimum in Eq.
(2.1). As discussed in Section 1.1. (7),u7)(|   • 7), x ,/;)(/) >s equal t.o t.he
A'-stage cost, corresponding to init ial st ate i. policy {//<>, //1./<v-i	and
termiiia 1 cost function h: that, is,
{	i	'i
HUv) +	.'/(.'7,.//а (.Га )) I J'O = /. 7Г > .
/.--II	J
Using this relation in Eq. (2. 1) and dividing by A', we obtain for all i
I	I PP	1
A’{/l(.r,v) i JO - i. 7Г ; I- — /•? "p o(j7. .///, (J7. )) | JO -- /. rr >
I A 0	J
> A +p<(/).
By taking the limit as A' -> oc, we see that
./,T(/)>A.	/ -= 1......II.
with equality if ii/Ai}, b =- 0.1........ attains the ini11iiinnii in Eq. (2.1).
Q.E.D.
Note t hat. t he proof of Prop. 2.1 carries t hrough even if t he state space
and control space are infinite as long as the function h is bounded and t.he
minimum in the optimality equation (2.1) is attained for each /.
In order Io interpret t he vector // in Bellman's equation Ac h - Th,
note that, by iterating this equation A’ times (see also the proof of the
preceding proposition), we obtain A'Ac + h = Txh. 'Phus for any t.wo
st at.es i and j we have
A+ //(/) . (ГЛ7/)(/). A A /,(.;) - (7’л7/)(;;),
which by subtraction yields
//(/) //(./)	(7'л7/)(/)	(7’л7/)(./). lor all
Eor any /. (7’л //)(/) is the optimal A'-stage expected cost starting at i
when the terminal cost, function is h. Thus, according to the preceding
equation, //(/) - h(j) represents, for every A', the difference in optimal N-
stage expected cost, due to starting al state i rather that, starting at state
j. Based on this interpretation, we refer to h as the	or relative
cost vector. (An alternative but. similar interpretation is given in Section
7.1 of Vol. I.)
Now given a stationary policy //, we may consider, as in Section 1.2, a
problem where the constraint set (/(/) is replaced by I he set U(i) -- {//(/)};
See. 1.2	()j>l iiteililу (\)iidil i<»n,s
193
that is, ('(/) contains a single element. the control /i(i). Since then we
would have only one admissible policy, the policy //. application of I’rop.
2.1 yields the following corollary.
Corollary 2.1.1: Let // be a .stationary policy. II a scalar A;, and an
n-dimcnsional vector /q, satisfy, for all
A,, + /'/,(>) = !/(i. /'(')) + £/>,,(;,.(/))М.Д	(2.B)
j -i
or equivalent ]y
A;,C + h fl — 11, h.Ц.
them
A,, =//<(«),	/ = 1-----и.
Blackwell Optimal Policies
It turns out that the converse of Prop. 2.1 also holds; that is. if for
some scalar A we have — A for all / — 1.......n. then A together with a
vector It sat islies Bellman's equation (2.1). W’e show t his by ini reducing the
notion of a Blackwell optimal policy, which was lirst formulated in [BlaG2j.
together with the line of analysis of the present, section.
Definition 1.1: A stationary policy p, is said to be Blackwell optimal
if it. is simultaneously optimal for all t he «-discountcd problems with
a in an interval (о. 1), when: о is some scalar with 0 < <v < I.
The following proposition provides a useful characterization of Black-
well optimal policies, and essentially shows the converse of Prop. 2.1.
Proposition 2.2: The following hold true:
(a)	A Blackwell optimal policy is opt imal for the average cost prob-
lem within the class of all stationary policies.
(b)	There exists a Blackwell optimal policy.
Proof: (a) If p* is Blackwell optimal, then for all stationary policies //.
194	Дгггадг ('<») /><•/ St;ецс Problems ('hup. I
and о in an interval (<T, 1) we have	Equivalently, using Eq.
(I.II),
( 1 -o )~' ./,,* + /z,,* +O(| I - <i|) < (I -~'|) 1+ /q,+ 0(1 1 - <1|), n G (<+ 1)
or
-//,♦ < •//, + (1 - <>)(/',< - /'//•) + (1 - 'I)O(| 1 -o|),	<> G (о, 1).
By taking the limit as о -» I. we obtain
(I,) From Eq. (1.1), we know that, lor each /z and state i, (z) is a rational
function of (i, that is, a. ratio of two polynomials in o. Therefore, for any
two policies /z and /z' the graphs of	and ,/,,_,,/(/) either coincide or
cross only a finite number of times in the interval (0, 1). Since there are
only a finite number of policies, we conclude that lor each state i there
is a policy /z' and a scalar Ft, G (0. 1) such that p' is optimal for the ci-
discomiled problem for о G (о,. I) when the initial state is i. Consider the
stationary policy defined for each z by /z’(z) — d’hen //*(/) attains
the minimum in Bellman's equation for the o-discount.ed problem
n
P(i) inin	4 a) p,,(zz)./„(;)
i< c z; p)	,
j --1
for all z and for all zi in the interval (max, o,. 1). Tin irefore, //* is a station-
ary optimal policy for the o-disconntcd problem for ;dl о in (imix,<T,, 1),
implying that //,* is Blackwell optimal. Q.E.D.
We note that, (.lie converse of Prop. 2.2(a) is not, true; it is possible
that a stationary average cost optimal policy is not Blackwell optimal (see
Exercise 4.6). We mention also that one can show a stronger result than
Prop. 2.2(b). namely that a Bhi.ckwell optimal policy is average cost opti-
mal within the class of all policies (not just those that are stationary; see
Exercise 4.7).
ddie next proposition provides a useful characterization of Blackwell
opt imal policies.
Proposition 2.3: If /i* is Blackwell optimal, then for all stationary
policies // we have
I,- = P,>*J>.* <	(2-6)
Furthermore, for all /z such that	we have
.//,* + h/G = fhi* + Pii*hii* < <hi + PiJifi*,	(2-7)
where /z;,» is a vector corresponding to /I* as in Prop. 1.2.
See. 1.2
O}>t in ь-i lit у Cinulil iotis
1 95
Proof: Since [i* is optimal for the a-disc.ounted problem lor all a in an
interval (о, I), we must have, for every // and о f (o', 1),
f/ii * + rt I p+ .p * t/p “I* <i Pp 'J<> ,p * -	(2 .8)
From Prop. 1.2, we have, for all о e (о, 1),
= (1 - о)-!./,,. + V +«(11 -<>!)
Substituting this expression in Eq. (2.8), we obtain
0 < Ур ~ Ур* + a{Pt, - />)((! - ci)~L/p* + //„• + O(|l - <i|)),	(2.!))
or equivalently
0 < (1 - o)(</p - ур») + o(Pp - Pp.)(./p* + (1 - <r)/qt* + 0((l - o)2)).
Bv taking the. limit as a —+ 1, we obtain the desired relation Pp*./,,* <
11 ft is such that, fp•./;,* = Pp./р., then from Eq. (2.9) we obtain
0 + Un - .'//,• + o(Pp - Pp’)(//p* + O(| I - o|)).
By taking the limit as о —» 1 and by using also the relation Jp» + /qy =
9ii* + P/i* h ii* [cf. Eq. (l.l.r>)]. we obtain the desired relation (2.7). Q.E.D.
As a consequence of the preceding proposition, we obtain a converse
of Pro]>. 2.1.
Proposition 2.4: If the optimal average cost over the class of sta- tionary policies is equal to A for all initial states, then there exists a vector It such that		
Л H-/i(z) = min	Ii <j(i.u) + ^2po(w)/r(y) j=i	, i = l,...,u. (2.10)
or equivalently	Ac + /i = Th..	
Proof: Let. //* be a Blackwell optimal policy. We then have Jp-(/) ~= A
for all i. For every //. each element of the vector P/r7p» is equal to A. so
that P/eJ/i» = PiiJ/i*. From Eq. (2.7), we then obtain the desired relation
(2.10) with k — /q,«. Q.E.D.
196
.1vcragc Cost per Slagc Pit>/>1<чiis ('hup. 4
Bellman’s Equation for a Unichain Policy
We recall from Appendix I) of Vol. 1 that in a (illite-state Markov
chain, a recurrent, class is a set of states that communicate in the sense
that, from every state of the set., there i.s a. probability of 1 to eventually
go to all other states of the set. and a probability of 0 to ever go t.o any
state outside the set. There are two kinds ofstat.es: those that belong to
some recurrent, class (these are the states that, after they are visited once,
they will be visited an infinite number of times with probability 1). and
those that are transient (these are the stales that, with probability 1 will
be visited only a (mile number of times regardless of the initial state).
Stationary policies whose associated Markov chains have a single re-
current class and a possibly empty set, of transient, states will play an im-
portant role in our development. Such policies are called uiridiai.ii.. The
state trajectory of I he Markov chain corresponding to a. imichain policy, is
eventually (with probability I) cotilincd t.o the recurrent, class ofstat.es, so
the average cost per stage corresponding to all initial stall's as well as the
differential costs of the recurrent states are independent of the stage costs
of the transient, states. The next, proposition shows that for a imichain
policy //. the average cost, per stage is the same lor all initial states, and
that Bellman's equation A(,c I litl — holds, furthermore, we show
that Bellman’s equation has a unique solution, provided we fix the differ-
ential cost, of some state at some arbitrary value (0. for example). This is
necessary, since if A;, and /q, satisfy Bellman’s equation (2.5), the same is
true for Az, and h,,, + yc, where 7 i.s any scalar.
Proposition 2.5: Let /1 be a unichain policy. Then:
(a.) There exists a constant A,, and a. vector /7, such that
./,,(/) = A(1.	/ = 1.....n,	(2.11)
and
II
A,, +hl,(i) ^ </(;.//(/))	(//(/))/>,,(j), i =
j-n
(2.12)
(b) Let. t be a fixed state. The system of the 11 + 1 linear equations
II
A + /i(7) = </(/,p(«))+^2p,,(p.(i))/i.O'), i = l,...,n, (2.13)
j=t
See. 1.2
Opt’miiility ('iniilil ions
197
h(t.) = 0,
(2.14)
iu the v + 1 unknowns Л. /1(1).......h(n) has a unique solution.
Proof: (a) Let t be a recurrent state under //. Lor each state 1 -- I let
C, and .V, be the expected cost and the expected number of stages, re-
spectively. Io reach t for the first time starting from i under policy p. Let
also Ci and A) be the expected cost and expected number of stages, re-
spectively. to return to I for the first time starting from / under policy //.
From Prop. 1.1 in Section 2.1, we have that €', and N, solve uniquely the
systems of equations
C =.'/('•/'(')) + У" /с(/Ф'Ж.о
(2.15)
Po (/'(')) Aj.
Let
(2.17)
Multiplying Eq. (2.16) by A;, and subtracting it. from Eq. (2.15). we obtain
C.-A^A’, = </(/,//(/))-A(J + У2 /'0 (/'(')) (G-~A(< A’, )• z = l.................и.
By defining
/q,(/)=.-С, - A„Л',.	/--I......il.	(2.IS)
and by noting that, from Eq. (2.17). we have
ЛД/) =- 0.
we obtain
A,, + /t,,(i) = L У2р'Уа('))Ь/А.1^	' 1....."
which is Eq. (2.12). Equat ion (2.11J follows from Eq. (2.12) and Cor. 2.1.1.
(b) By part (a), for any solution (A./i) of the system of equations (2.13)
and (2.11). we have A — A,,. as well as //(/) ~ (). Suppose that I belongs to
198
.A vc/адс Cost per Problems Chop. 4
t lie recurrent, class of states of ( lie Markov chain corresponding to p. Then,
in view of Eq. (2.1-1). the system of equations (2.13) can be written as
n
/'(') =	/'-('))---V + У. Po (/'('))/'(./)	' - * 1 2.H.i/I,
J--1. j-C
and is the same as Bellman’s equat ion for a corresponding stochastic short-
est, path problem where / is the termination state, .(/(o/'(')) — >K tbe
expected stage cost. at. state /, and /*(/) is the average cost, starting from
i up to reaching I. By Prop. 1.2 in Section 2.1, this system has a. unique
solution, so //(/) is uniquely defined by Eq. (2.13) for all i. ф t..
Suppose now that t. is a t ransient state of the Markov chain corre-
sponding to p. Then we choose another state I that, belongs to the recurrent
class and make the tra.nsfonnat.ion of variables h(i) — h(i) — h,(i). The sys-
tem of equations (2.13) and (2.11) can be written in terms of the variables
Л and /?(/) as
n
7i(i) = </(i,/«(«)) -A +	Ри(/<(»))ЛО’). i =	/i/Z,
h(l) = 0,
so by the stochastic shortest path argument given earlier, it has a unique
solution, implying that, the solution of the system of equations (2.13) and
(2.1-1) is also unique. Q.E.D.
Conditions for Equal Optimal Cost for All Initial States
We now t urn to the case of multiple policies, and we provide condi-
tions under which Bellman’s equation Ac -+ It = Th has a solution, and by
Prop. 2.1, the optimal cost, is independent of the initial state.
Proposition 2.6: Assume any one of the following three conditions:
(1) Every policy that is optimal within the class of stationary policies
is unichain.
(2) For every t wo states i and j, there exists a stationary policy rr
(depending on i and j) such that, for sonic k,
= j | a:0 = i, tt) > 0.
See. 4.2 Optimality Com Hl ions
199
> (3) There exists a state t, and constants L > 0 and о G (0,1) such
that.
| Ja(i) — J(t(t)| < L, for all i = 1,..., n, and a 6 (a, 1),
(2.19)
where J(, is the a-discounted optimal cost vector.
Then the optimal average cost per stage has the same value Л for all
initial states i. Furthermore, A satisfies
A = lim (1 — a)	z = l,...,n,	(2.20)
a —* 1
and for any state t, the vector h given by
/г(г) = lim (j;>(i) - J„(t)), i = 1........и,,	(2-21)
satisfies Bellman's equation
Ac + h = Th..	(2.22)
together with A.
Proof: Assume condition (1). Proposition 2.2 asserts that, a Blackwell
optimal policy exists and is optimal within the class of stationary policies.
Therefore, by condition (1). this policy is uuichain, and by Prop. 2.5, the
corresponding average cost is independent of the initial st,ate. The result
follows from Prop. 2.4.
Assume condition (2). Consider a Blackwell optimal policy //*. If it
yields average cost t hat, is independent of the initial state, we are done, as
earlier. Assume the contrary; that is, both the set.
M ----	| •/,,•(/) = max,/,,.(;)
and its complement А/ are nonempty. The idea now is to use the hypoth-
esis that every pair of states communicates under some stationary policy,
in order to show that the average cost of states in ,\1 can be reduced by
opening communication to the states in 1\[, t hereby creating a contradic-
tion. Take any states i G Л/ and j G Al, and a. stationary policy /I such
that, for some />', P(.i’i, ~ j | .r(; = /.//) > (). Then then' must exist states
m G Л/ and Th G Al such that there is a positive transition probability
from m to Th under p; that, is,	= I’(./g.+ i = Th. | Xk = m.p) > 0.
It can thus be seen that the mth component of is strictly less than
200
Cos/. per Singe Problems ('hup. 4
max, (z), which is equal U> the mth component of. 'This contradicts
the necessary condition (2.6).
Einally, assume condition (3). bet /Т 1м/ a Blackwell optimal policy.
By Eq. (1.11), we have for all states i and <v in some interval (cv,1)
./„(/) = (1 -o)	+O(|1 -<>|).	(2.23)
Writing this equation lor state i and for state t, and subtracting, we obtain
for all i t,
|.М')-Л-(/)| < (1-<0|Л«(0“-Л»(0|b(f-n)|//7,.(/)-V(t)|ю((1--о)2).
Taking the limit as <>	1 and using the hypothesis that |,/„ (/) -./„(/.) | < L
for all о ь (0, 1), we obtain that (z) = ./;1*(Z) for all i. Thus the average
cost of the Blackwell optimal policy is independent, of the initial state, and
we are done.
To show Eqs. (2.20)-(2.22), we note that the relation lim,,_. i(l —
o)./,,(z) ;= A for all z follows from Eq. (2.23) and the fact ./,,*(?) = A for all
z. Also, from Eq. (2.23), we have
./„(/) -./„(/.)--//z,.(z) -//„.(/)+ O(|1 -o|),
so that
lim (./„(/) - ./„(/)) = //„.(') ~ /'/,'(')
(1 -• 1
Setting //(/') — //,,»(/') -~h.fi> (/) for all i, and using the fact = A for all i
and Eq. (2.7). we see that the condition Xc + h — Th. is satisfied. Q.E.D.
The condit ions of t he preceding proposition are among I he weakest
guaranteeing t hat, the optimal average cost per stage is independent of the
initial state. In particular, it is clear that some sort of accessibility condi-
tion must, be satisfied by t.he transition probability matrices corresponding
to stationary policies or at, least to optimal stationary policies. Tor if there
existed two stales neither of which could be reached from the other no
matter which policy we use, then it. can be only by accident that the same
optimal cost per stage will correspond to each one. An extreme example
is a problem where the state is forced Io stay the same regardless of the
control applied (i.e.. each state is absorbing). Then the optimal average
cost per stage for each state z is ininueu(i) !7(d и), and this cost may be
different, for different, stall's.
Example 2.1: (Machine Replacement)
Consider a machine that can be in any one of n states, 1,2, ...,». There
is a cost, z/(i) for operating for one time period the machine when it. is in
state /. The options at t he start of each period are to (a) lot. the machine
Sec. 4.2
Optimnlil у (oiulit.iou.s
20 J
operate one more period in tlie state it currently is. or (b) repair the machine
at, a positive cost /? and bring it to state. 1 (corresponding to a machine in
perfect condition). The transitions between dillerent states over each time
period arc governed by given probabilities pr/. Once repaired, the machine
is guaranteed to stay in state 1 lor one period, and in subsequent periods, it.
may deteriorate to states j > 1 according to the transition probabilities p\t.
The problem is to find a policy that minimizes the average cost per stage.
Nob! that we have analyzed t he discounted cost version of (.his problem in
Example 2.1 of Section 1.2. As in that example, we will assume that p(i) i>
nondecreasmg in л and that the transition probabilities satisfy
1 =j...." -1.
for all functions J(i). which arc monotonically nondocreasing in 7.
Noto, that not all policies are unichaiu here. Eor example, consider the
stationary policy that replaces at every state except the worst state n (a poor
but legitimate choice). The corresponding Markov chain lias two recurrent
classes, { 1,2,. . ., n, — 1 } and {n} (assuming that pin — 0). It can also be seen
that condition (2) of Prop. 2.6 is not guaranteed in the absence of further
assumptions. [This condition is satisfied if we assume in addition that, for all
i, we have > 0, because, by replacing, we can bring the system to state'
1, from where, by not replacing, we can reach every other slate.]
We can show, however, that condition (3) of Prop. 2.6 is satisfied. In-
deed. consider the corresponding discounted problem with a discount factor
a < 1. W’e have for all 7
Jn (/) = min
/( H-</(l) + o,7(1(l). </(/) +
and in particular.
< /.'+ -/(!) -f- <>./.,(1),
(1) = min
П
Л’ + y(l) + oT,(l). <7(M) +
j -- 1
From the last two equations, by subtraction we obtain
where the last inequality follows from the hu t
0<.7fl(,)-Jo(l),
which holds since ./„(/) -	(1) is nondecrcasing in i. as shown in Example
2.1 nf Section 1.2. The last two relations imply that, condition (3) of Prop.
202
. \ \'cr:ii^t‘ per .S'/agi’ I *г< ihli'iiis
Cli.ip. 4
2.6 is satisfied, and it follows that there exists a scalar A and a vector Л-, such
that for all i,
while I he policy llial chooses 1.11c minimizing action above is average cost
optimal.
By Pro],. 2.6, we can take /i(z) = lim„—। (Л«(/) — ./,,(!)), and since
— ./,,(1) is nondecreasing in i, it follows that /i(i) is also nondecreasing
in i. Similar to Example 2.1 of Section 1.2, this implies that an optimal policy
takes the form
replace if and only if i > /*,
where
_ J smallest state in Sk if Sn is nonempty,
I ii + 1	otherwise,
and
4.3 COMPUTATIONAL METHODS
All the computational methods developed for discounted and stochas-
tic shortest path problems (cf. Sections 1.3 and 2.2) have average cost per
stage counterparts, which we discuss in this section. However, the deriva-
tions of these methods are often intricate, and have no direct analogs in
the discounted and stochastic shortest path context. In fact, the validity
of these? incthods may depend on assumptions that relate to the structure
of the underlying Markov chains, something that, we have not encountered
so far.
4.3.1 Value Iteration
The natural version of the value iteration method lor the average
cost problem is simply to generate successively the finite horizon optimal
costs TkJu, k: = 1,2,..., starting with the zero function Jo- It is then
natural to speculate that the /.--stage average costs TkJo/k: converge to
the optimal average cost vector as k: —> oo (this is in fact proved under
natural conditions in Section 7.4 of Vol. 1). This method has two drawbacks.
First, some of the components of TkJo typically diverge to oo or --oo, so
Sec. IS
( '< hiijnit ;i i н >iuil h let Iк к!.'-
203
direct calculation of Ьпц._.^ 7’A .A,//.: is numerically impractical. Second,
this method will not provide ns with a corresponding differential cost vector
h.
Wr can bypass both difficulties by subtracting a multiple of the unit
vector e from TkJo. so that, the difference, call it hk, remains bounded. In
particular, we consider methods of the form
hk =	- <V'e.	(3.1)
where is some scalar satisfying
min (7"A',/o)(f) < bk < max (TA,7o)(f),
? — 1 ....,71	/=1 H
such as for example the average of (TA./o)(?)
= ./(>)(/).
It
/^1
or
dA' = (7’A'
where t is sonic fixed state. Then if the, differences niaxt(7’fc./(i)(z) --
min, (7’A ,/o)(z) remain bounded as k -» oc (this can be guaranteed under
the assumptions of the subsequent Prop. 3.1), the vectors /tA' also remain
bounded, and we will see that with a proper choice of the scalar cA', the
vectors /?' converge to a, differential cost vector.
Let us now restate the algorithm hk = Tk,Ju — AAc in a form that is
suitable for iterative calculation. We have
/;/•+! = 7’АЧ-1,/0 -йА'+'с,
and since
Tk+lJu = T(7’A'./o) = 7'(hk + M'e) = Th.k + ЛАс,
we obtain
/p-+i = Thk + (dA' - (‘>А-+1)с.	(3.2)
In the case where bk is given by the average of (71A'./o)(i), we have
1 "	| "	| "
6a-h =	(7’(//4 <r-c))(0 -J
11 i-U	1—1	1-1
so that the iteration (3.2) is written as
1 "
hk+> = Thk - - V(T/iA')(i)c.	(3.3)
n '
204	Average (\>st per Sl<w Prablcins ('hup. 4
Similarly, in the ease when* we fix a state / and we choose bk = (TkJu)(/.),
we have
= (7’A i-1./(,)(/) (T(hk -g Ус))(/) -- (77Л)(/) -t
and the iteration (3,2) is written as
/;A + 1 = Tllk _	(3.4)
We will henceforth restrict, attent ion t.o the case where = (7’’A ./o)(/,),
and we will (all the corresponding algorithm (3.1) re.la.live value Uemlion,
since t,he iterate /)/' is equal t.o 7’*'./(i — (T1-' Jt>)(l)e and may be viewed as
a /.’-stage optimal cost, vector relative In stale I. The following results also
apply t.o other versions of the algorithm (see Exercises <1.1 and 1.5). Note
that, relative value iteration, which generates //*’, is not really dilferent than
ordinary value iteration, which generates 7 A.7i>. The vectors generated by
I,he two methods merely dilfer by a multiple of the unit vector, and the
minimizat ion problems involved in the corresponding iI,erat ions of the two
methods are mathematically equivalent.
It can be seen that, if the relative value iteration (3.1) converges to
some vector h, then
(Th)(/)<’ + 1, =- Th,
which by Prop. 2.1, implies t hat. (Th)(t~) is t he optimal average cost po-
stage for all initial states, and h i.s an associated dilferent.ial cost vector.
'I’lms convergence can only be expected when the optimal average cost, per
stage is independent, of t he initial st at e, iudicat ing that, at, least, oik- of I,he
conditions of Prop. 2.(i is required. However, it turns out. that a stronger
hypothesis is needed for convergence. The following example illustrates the
reason.
Example 3.1:
Consider the iteration
// " = 7),//' - (7),//')(/.)<,
which is I hr relative value iteration (3.1) for the case of a fixed //,. Using the
expressions 7),// — -h and	~ </(</// +/р/Л), where cj is the
row vector having all coordinates equal to 0 except for coordinate t, which is
equal t.o I. this iteration can he written as
// 1 1 = (/,, + 1 ’llhl‘ - ce’i(tJi, +
Equivalent Iy. we have

(3.5)
See. 1.3
( 'ош/mtat biu.il Met luxls
205
where
(/- ei'i)l'i,.	(3.G)
Convergence of deration (3.5) depends on whether all the eigenvalues of /\,
lie strictly within the unit circle. Wo have lor any eigenvalue 7 of /’/( with
corresponding eigenvector u,
p„e — (A — e<;)/?„e = ?(e -
and in particular, for the eigenvalue 7 — 1 and the corresponding eigenvector
v — (' we obtain using the fact e'.c = 1.
Therefore, we have
7',,(r - <<;<;) y(e - rcjc),
and it follows that each eigenvalue у of wil.li corresponding eigenvect.or
which is not a scalar multiple of e. is also an eigenvalue ol with corre-
sponding eigenvector (r — ceje). Thus, if has an eigenvalue у 1 th.il. is
on the unit circle, the iteration (3.5) is not convergent. This occurs when /),
has a periodic structure and some of its uonunity eigenvalues arc 011 the unit
circle. I'or example, suppose that.
/ 0 I \
\ 1 ° /
which has eigenvalues 1 mid —1. Then taking I — I, the matrix of I'lq.
(3.G) is given by
and has eigenvalues (1 and - 1. As a result, iteration (3.5) does not. converge
even though // is a unichahi policy.
The following proposition shows convergence of the relative vidtie it-
eration (3.1) under a technical condition that excludes situations such as
the one of the preceding example. When there is only one control available
per state, t hat is. t here is only one stat iotiary policy /1. the condil ion of t lit'
following propos'd ion requires t hat. for some posit ive iiilcgci m. I he dial rix
/p" has at least one cohtnin all the componeiits of which are positive. As
can be seen from the preceding example, this condition need not hold if//
is unichnin. 11 owe ver. we will later provide a variant, of t he relative value it-
eration (3.1) that converges under the weaker condition I lint till stationary
policies are nnidiaiti (see Prop. 3.3).
206
Average CosZ /кт Prolih'ins
(:hni>. J
Proposition 3.1: Assume that there exists a positive integer rn. such
that for every admissible policy тг = {/zo, pi,...}, there exists an e > 0
and a state .s such that
[P/l Ki P/lm- 1 • • • ^/'1 ]i’ — 61 z = l,...,zz,	(3-7)
-1	• • • ^eo]’-4 — 6> z = l,...,n,	(3.8)
where [фа denotes the element of the zth row and .sth column of the
corresponding matrix. Fix a state t and consider the relative value
iteration algorithm
hk+l(i) = (77z7)(z) — (Thk)(t), z = l,...,n,	(3.9)
where /t°(z) are arbitrary scalars. Then the sequence {hk} converges to
a vector /z satisfying (Th)(t)e + h = Th, so that by Prop. 2.1, (77z)(f)
is equal to the optimal average cost per stage for all initial states and
h is an associated differential cost vector.
Proof: Denote
</A(z) = hkkl («) — hk(J,), i = 1,2...., n.
We will show that for all i, and k: > m we have
inax</*(i) - min</*(/) < (1 — r) ^max<7A“'"(/) - miu(/fc_'"(z)^ ,	(3.10)
where m and < arc as stated in the hypothesis. From this relation we then
obtain, for some 1J > 0 and all k-,
w<ix<ik(i) — min (/ (i) < B(1 — ()kk"’.
t	I
Since qk(I.) = 0, it follows that, for all z,
|/A' H (/') — /zA(z) [ = |'/ (')l maxQA (j) — miiiz/Q') < 71(1 — ()k/'’"
Therefore, for every г > 1 and i we have
г - 1
|/лАт4 r(z) - /лА (/)| < \hk D + 1 (z) - /zA'+z(z)|
z=o
Г — 1
</1(1 -	^2(1 - ()'/'"	(3.11)
(=:()
77(1 — z )fc/m (1 - (1 - e)’7’»)
~ I - (1 - e)‘/''l
S<-c. 1.3
( ‘oll>J>lllal ioilill Met I Illi Is
207
so that. {hk'(i)} is a Cauchy sequence and converges to a limit. h,(i). brom
Eq. (3.!)) we sec then that, the equation (77z)(/) t //(/) -•= (77/)(/) holds for
all i. It. will thus be sullicient. to prove Eq. (3.10).
To prove Eq. (3.10), we denote by /ik(i') the control that attains the
minimum in the relation
~ min
net'tO
(3.12)
for every A' and i,. Denote
A,.. = (77z/)(Z).
Then we have
/'A + l = !//ц. + l,llkhk - AlT <	, I T(,A - Xkc,
h1' = fW -i + kA- Х" 1 Afc-ie < <h,k -I- Pi,khk~l - Aa._.z .
where c = (1.......1)'	is the unit vector. Krom these relations, using the
definition </A — hkt 1 — h1-. we obtain
Piik<ll'~' + (Aa. -1 ~ 4k < kA. i‘/' ' + (AA । ~ -M^-
Since this relation holds for every > 1. by iterating we obtain
Pi't.    I'l'!.-+ (Aa. - ЛА.)е < z/A
< kA... kA-T (Aa.-,„-Aa.)c.
First, let us assume that the special st ate .s corresponding t.o ///,_.	... ,/ц.
as in Eqs. (3.7) and (3.S) is the fixed slate I. used in iteration (-3.9); that is.
Pllk_m J,, >G	1	l.....m	(3.L1)
Ik-. • • • kJ" > C	Z=1.......e.	(3.15)
The right-hand side of Eq. (3.13) yields
7A (’) <	- i    l’i'k-т I'j'l1'	(/) + Aa.~„, — Aa.,
Г---1
so using Eq. (3.15) and the fact	— 0, we obtain
</A'(0 < (1 - f) шах/ -"'(./) + Aa_„, - Aa., i 1..........ii,
- -1
208	Average Cost, per Stage Problems Chap. 4
implying that,
maxz/(y) < (1 - c)inax</:-'"(J) +- A*. _m - At. (3.16)
j	,i
Similarly, from the left-hand side of Eq. (3.13) we obtain
min </ (;) > (1 - r) rnin</A;~m(j) + Afc..,,,. - Aa., (3-17)
j	j
and by subtracting the last two relations, we obtain the desired Eq. (3.10).
When the special state s corresponding to	Pk as in Eqs.
(3.7) and (3.8) is not equal to t, we define a related iterative process
/>ti(i) = (T/>)(/)	(77;Aj(.s)5	(3.18)
//"(<)= //.'’(/),	Z=l,...,Z/„
Then, as earlier, we have
max</(i) - min</A'(z) <(!—<) (luiixi)1' '"(1) — min<;A-'"(z.)^ ,	(3.19)
where
zyA = />н _ />_
II, is straight-forward to verify, using Eqs. (3.9) and (3.18), that, for all i and
к we have
hA'(z) - hA:(z) + (T/> ')(.s) - (Т/Л- ')(')•
Therefore, the coordinates of both h1-' and z/A’ differ from the coordinates of
hk and </*’, respectively, by a constant. It, follows that
ma,xz/A'(z) — min z/A(z) = maxz/A:(z) — miuz/A'(z),
and from Eq. (3.19) we obtain the desired Eq. (3.10). Q.E.D.
As a by-product, of the preceding proof, we obtain a, rate of conver-
gence estimate. By taking the limit in Eq. (3.11) as r oo, we obtain
fit I - ffA'/ai
ma.x|/zA'(z)--/z(z)| < -1L (1~A: = 0,1,...,
so the bound on the error is reduced by (1 — c)1/'" at each iteration. A
sharper rate of convergence result can be obtained if we assume that there
exists a unique optimal stationary policy /z‘. Then, it is possible to show
that, the minimimi in Eq. (3.12) is attained by /<*(') l(,r all i, and all к
alt,er a certain index, so for such At, the relative value iteration takes the
form /zAt 1 = Tlph1’' — (7’(l«/zA')(/)z’, and is governed by the largest eigenvalue
modulus of t he matrix /(,* given by Eq. (3.6).
Note that, contrary to the case of a discounted or a. stochastic short-
est path problem, the Gauss-Seidel version ol the relative value iteration
method need not, converge. Indeed, the reader can construct, examples of
such behavior involving two-state systems and a single policy.
Sec. 4.3 Computational Methoth
209
Error Bounds
Similar to discounted problems, the relative value iteration method
can be strengthened by the calculation of monotonic error bounds.
Proposition 3.2: Under the assumption of Prop. 3.1, the iterates h,k
of the relative value iteration method (3.9) satisfy
Qk — £fc+i — Cfc+i — <‘k,	(3.20)
where Л is the optimal average cost per stage for all initial states, and
= imn[(77i.fc)(i) - ЛгЛ(г)],
ck = max[(77tfc)(i) - /tfc(i)].
Proof: Let /u(i) attain the minimum in
(T / i,k') (j.) = min
v t U (i)
+ ^P,.Au)hk(.j')
for each k and i. We have, using Eq. (3.9),
(Tlik)(i) = .'/(/./p. (z)) d- ^2po(//a ('))/i*'(j)
j'-t
It.
=	- (w-’Xf),
J=1
and
11
hA:(') <
.r--i
Subtracting the last two relations, we obtain
ll
(tipa,) - /A(,)	')(./) -i'1’ '(./)),
j=i
and it follows that
niin[(77d')(i) — /^'(г)] S nuii^'UhW1)^) — IP~1 (г)],
210
/\ verage Cost per S<an<‘ Problems Chap. 4
or equivalently
U--i - Oc-
Л similar argument shows that
C- < <’k- i •
By Prop. 3.1 we have hk(I) —> and (Th)(z) — /z(z) — A for all /. .so that'
rA. -> A. Since {cA.} i.s also nondecreasing, we must, have ck. < A for all k.
Similarly, ck. > X for all k. Q.E.D.
We now demonstrate the relative value iteration method and the error
bounds (3.20) by means of an example.
Example 3.2:
Consider an undiscounted version of the example of Section 1.3. We have
5W-. {1,2}, C = {u',»2},
pu(u2) Pi2(u2)\	( 1/4 3/4 \
/J21(«2) 1>22(и2))	\\/l 3/4 J’
and
<j( I, a?) ~ 2,	</(I, </2)	0.5,	</(2, /zs) = 1,	y(2, u“) -- 3.
The mapping 7’ lias the form
(77/)(/) min <//(*, а’) Ь	y(l. n2) -! У2
I	j--i	j-i
Letting I = 1 be the reference state, the relative value iteration (3.9) takes
the form
/Р4 '(!)=()
t/4 1 (2) = (T^)(2) - (1V)(1).
The results of the computation starting with //'(1) = h°(2) = 0 arc shown in
the table of P’ig. 4.3.1.
Sec. 1.3
Methodr,
21 1
/,	/P(l)	/,*•(2)	Fa-	<7.
(1	0	()		
1	0	0.500	0.625	0.875
2	0	0.250	0.687	0.812
3	(1	0.375	0.719	0.781
4	0	0.312	0.734	0.765
5	0	0.344	0.742	0.758
•6	0	0.328	0.746	0.754
7	0	0.336	0.748	0.752
s	0	0.332	0.749	0.751
9	0	0.334	0.749	0.75(1
10	()	0.333	0.750	0.750
Figure 4.3.1 Iterates, and error bounds generated by the relative value iteration
nicthod lor the problem of Bxuniple 3,2.
We note an interesting application of the error bounds of Prop. 3.2.
Suppose that for some vector h, we calculate a /i such that
T„h = Th.
Then by applying Prop. 3.2 to the original problem and also to the modified
problem where the only stationary policy is p, we obtain
c < ./*(/) < ./,,(/) < c.
where
c --- min[(T7;)(?) -//.(?)], c -max[(77i)(/') — //(«)].
We thus obtain a bound on the degree of suboptiinality of //,. This bound
can be proved in a more general setting, where ./*(?) i.s not necessarily
independent of the initial state i (see Exercise 4.10).
Other Versions of the Relative Value Iteration Method
As mentioned earlier, the condition for convergence of the relative
value iteration method given in Prop. 3.1 is stronger t han the conditions of
Prop. 2.6 for the optimal average cost per stage to be independent of the
initial state. We now show that we can bypass this difficulty by modifying
212
Average Cost per Stage Problems Chtip. 4
the problem without, allecting either I,hi: optimal cost, or the optimal poli-
cies and by applying the relative value iteration method to the modified
problem.
Let г be any scalar with
0 < т < 1,
and consider the problem that, results when each transition matrix P,, cor-
responding to a stationary policy //. is replaced by
!>, = rP„ + (1 - r)I,	(3.21)
where / is the identity matrix. Note that, is a. transition probability
matrix with the properly that., at. every state, a sell-transition occurs with
probability at, least (1 -- t). This destroys any periodic character that Pfl
may have, For another view of the same point,, note that each eigenvalue
of is of the form ry+ (1 — t), where 7 is an eigenvalue of Pf,. Therefore,
all eigenvalues 7 7^ 1 of Pf, that, lie on the unit, circle are mapped into
eigenvalues of Pl: strictly inside the unit, circle.
Bellman’s equation for the modified problem is
A,,e + h,, ~ fl,, 4- Pi,hi, = <h, + (rf), + (1 - r)[)lii,.
which can be writ ten as
/\,с + Th 1,	а,, T P„(Thi,).
We observe that this equation is the same as Bellman's equation for the
original problem.
A;|C 4" ll j, -- (Ju -p Pfjl j,,
with the identification
h;, = Th,,.
ll. follows Liotu Cor. 2.L.L that, if the average cost per stage for the original
problem is independent, of i. for every /1, then the same is true for the
modified problem. Furthermore, the costs of all stationary policies, as
well as the optimal cost,, are equal for both the original and the modified
problem.
Consider now the relative value iteration met,hod (3.!)) for the modi-
fied problem. A straight,forward calculat ion shows that it takes the form
/d'l'(/) = (1 ”	r)/d(i) t- min	!j(i. u) F Ty2l'i.All-')hh(.i')	(3.22)
	- min uCl’(l)	4-7^27>(j(u)/tA'(j) j-i	7
Sec. i.S Compelatioiial Methods
213
where /, is seine fixed state with /z11)/) = 0. Note that. tins iteration is a.s
easy to execute as the original version. 11. is convergent, however, under
weaker conditions t han t hose required in Prop. .3.1.
Proposition 3.3: Assume that each stationary policy is unichain.
Then, for 0 < г < 1, the sequences {/zA'(i)} generated by the modified
relative value iteration (3.22) satisfy
lim min
A’—uEU(i)
(3.23)
where A is the optimal average cost per stage and /». is a differential
cost vector.
Proof: 'L’he proof consists of showing that the conditions of Prop. 3.1
are satisfied for the modified problem involving the transition probability
matrices of Eq. (3.21).
Indeed, let in > ниц, where n is the number ofstat.es and zzq is the
number of distinct stationary policies. Consider a set. of control functions
pojii,... . p,n. Then at least one p. is repeated n times within the subset
pii. Lot .4 be a state belonging to the recurrent class of the
Markov chain corresponding to //. Then the conditions
['Gon ’ ' ’	— ' • zl = I.. . . , zz,
[/<,	' i......".
are satisfied for some < because, in view of Eq. (3.21), when there is a
positive probability of reaching ,ч from z at some stage, there is also a
positive probability of reaching it. at any subsequent, stage. Q.E.D.
Note that, since t he modified value iteration met hod is not hing but
the ordinary method applied t.o a modified problem, the error bounds of
Prop. 3.2 apply in appropriately modified form.
4.3.2 Policy Iteration
The policy iteration algorithm for the average cost, problem is similar
to those described in Sections 1.3 and 2.2. Given a stationary policy, one
obtains an improved policy by means of a minimization process until no
further improvement is possible. We will assume throuyhout this section
214
«Ли'Г.чцг Go.sl per Slagr Problems Cbnp. I
Hint every stationary policy encountered in the course of the alyorithm is
ini ich <iin.
At I he Art.li step of the policy iterat ion algorithm, we have, a stat ionary
policy //*'. We then perforin a policy ci’n.lini.1 i.on step; that, is, we obtain
corresponding average and differential costs AA: and hhfif satisfying
-W + /Р(г) = //(г,//Л(0) + '^21>ij(lik(i))h,' (j), I = 1,... .11, (3.24)
j=i
or equivalently
AAe + /A = 'Ifkhh' = ,V- +l,lpld-.
Note that AA' and h.k can be computed as the unique solution of the linear
system of equations (3.21) together with the normalizing equation hk(t) =
0. where I is any st.at.e (cf. Prop. 2.5). This system can be solved either
directly oi- iteratively using the relative value iteration method or by an
adaptive aggregation method, as discussed later.
We subsequently perforin a policy improvement st,ep; t hat is, we find
a stat ionary policy //A + 1, where for all i, p,-'+1(i.) is such that.
Il	ll
.</(/,//''+‘(0) + ^P,.,^ hl(/))/rA'O') = min </(i, (I.) 4 ^2p,j(u)/ifc(.7) ,
.71	“""(') L 7^
(3.25)
or equivalently
7;,r(l/^ =Thk.
If /i*’+l = /ik, t he algorit hm terininates; otherwise, the process is repeated
with pk+l replacing //A.
There is an easy proof, given in Exercise 4.1 1, that, the policy iter-
ation algorithm terminates finitely if we assume that, the Markov chain
corresponding to each /;A is irreducible (is unichain and has no transient
states). To prove the result without this assumption, we impose the fol-
lowing restrict ion in the way the algorithm i.s operated: if /dfi) attains the
minimum in Eq. (H.25), we choose iif+'(i) = lik(i) even if there are other
controls attaining the. nmiimum. in addition to fik(i). We then have:
Proposition 3.4: If all the generated policies are unichain, the pol-
icy iteration algorithm terininates finitely with an optimal stationary
policy.
It is convenient to st,ate the main argument needed for the proof of
Prop. 3.4 as a lemma:
Sec. 1.3	nl h>n;il Methods	2 15
Lemma 3.1: Let /i be a. unichain stationary policy, and let A and h
be corresponding average and differential costs satisfying
Xe + li = TtJi,	(3.26)
as well as the normalization condition
j 'v -1 * * *
lim — V P‘7t = 0.	(3.27)
fc=0
[The above limit and the limit in the following Eq. (3.29) are shown
to exist in Prop. 1.1.] Let {p, p,...} be the policy obtained from /i
via the policy iteration step described previously, and let A and h be
corresponding average and differential cost satisfying
Ac T h = P^(h)	(3.28)
and
1 'V1
lim V P±h = 0.	(3.29)
ЛГ — ОО Д' 2—, a	'
fc=o
Then if p p, we must have either (1) A < A, or (2) A = A and
h(i) < h(i.) for all i = 1,..., n, with strict inequality for at least one
state i.
We note that, once Lemma 3.1 is established, it can be shown that
the policy iteration algorithm will terminate finitely. The reason is that the
vector h corresponding to p via Eq. (3.26) and (3.27) is unique by Prop.
2.5(b), and therefore the conclusion of Lemina 3.1 guarantees that, no policy
will be encountered more than once in the course of the algorithm. Since
the number of stationary policies is finite, the algorithm must terminate
finitely. If the algorithm stops at the Alli step with /i.k+l = pA, we see from
Eqs. (3.24) and (3.25) that,
Al'c + h,k = TIP,
which by Prop. 2.1 implies that /Р is an optimal stationary policy. So to
prove Prop. 3.1 there reniains t.o prove Lemma 3.1.
Proof of Lemma 3.1: For notational convenience, denote
I 1	-
P = P,,. P = P7T. P* = lim - Vpk. P*
1	iV—»oo N
УО
lim
N — >oo
N- 1
V У.
k-A)
1
N
216
A vcragc Cost, per Stage Problems
Chap. 4
II =- !)/,,	!) = !ljl-
Define the vector <*) by
<5 = Ac + h - 7/ - Ph.	(3.30)
We have, by assumption, T^li = Th < TtJi. = Ac 3- //, or equivalently
7/ 4- Ph < (j + Ph = Ac 4- h,	(3.31)
from which we obtain
Л(/) > 0,	/	!.....//.	(3.32)
Define also
A - It h.
By combining Eq. (3.30) with the equation Ar; 4- h = 7/ h Ph, we obtain
b = (A - A)c 4- A — 7’A.
Multiplying t his relation with Pk and adding from 0 t.o Ar — 1, we obtain
N I
- Л'(А - А)< -+Д--Л,Л А.	(3.33)
i=i)
Dividing by N and taking the limit as N -о сю, we obtain
1 Л'"'
n = i)vS^ = (A“’A)''-	(X34)
In view of tin1 fact h > 0 |cf. Eq. (3.3'2)j, we see that
A > A.
if A > A, we are done, so assume that, A = A. A state ? is called /’-
rccurrcnl (P-irnnsicnt) if 7 belongs (does not belong, respectively) to the
single recurrent class of the Maikov chain corresponding t.o P . From Eq.
(3.3-1). P h = 0 and since b > 0 and the elements of P that, are positive
are t hose columns corresponding to /’-recurrent states, we obtain
b{i.) = 0, for all i, that are /’-recurrent,.	(3.35)
It. follows by const ruct ion ol the algorithm that, if i is /’-recurrent., then
the /th rows of /’ and /’ are identical [since p(i) •--- //(/) for all i with
<*>(/) = 0]. Since P and P have a single recurrent class, it follows that this
Sec.	ion;il Methods	217
class is identical for both P and P. From the normalization conditions
(3.27) and (3.29), we then obtain h(j.) = h(J) for all i that are /’-recurrent.
Equivalent ly.
A(i) = 0, for all i that are /'’-recurrent.	(3.3G)
From Eq. (3.33) we obtain
N - 1
lim P Л - A lim \ ' P A < A  h.
Л’ —	,V —
_-Д>
In view of Eq. (3.3G). t he coordinates of P A corresponding t.o I’-transient
states lend to zero. Therefore, we have
<*>(?) < A(/). for all i that are F-transient.	(3.37)
From Eqs. (3.32) anti (З-ЗЗ) to (3.37), we see how that either A - 0. in
which case // = JI, or else A > 0 with strict inequality A(/) I) for at least
one /’-transient state i. Q.E.D.
We now deinousirate the policy iteration algorithm by means of the
example of the previous sect ion.
Example 3.2: (continued)
bet.
=	/?’(2) = u2.
We take t — I as a reference slate and obtain A,n. /z o( 1). and Л l((2) from
the system of equal ions
T I) = <;(I, u1) -I pi i (u1 )/qy (1) -1- / ’12(a1 )/>,,<>(2),
A(y + F;zo (2) = <;(2. u~) + №i (гг.-2)Л7,о (1) + /ори2 )/';,o (2),
/'„»(!) =0-
Substituting the data of the problem,
A?,„	2 I [/>„”(2),	A(1„ I /q,„(2)	3 I \„(2),
from which
А,Я1) = ()-	/'moC2) = 2.
218
Average Cost |>er Stage Problems Chap. '
We now lind /£l(l) and //.'(2) I>y the miiiimization indicated in Eq
(3.25). We determine
min [.'/(I,"') t рн("')/'„и(1) +	)/'„n(2),
</(i, и2)+pi i	i)+pi 2 (и2	e)]
----- min [‘2 + -j- • 2, 0.5 4- - • 2^
ini и [2.5,2|
and
min [y('2, 11.') + Р21 (ll 1 )/!„<>( I) + P2 j( " '	„0 (2) ,
.</(2, »2) + P21("')/'„a(l) + P22 (" 2 )/'„(> (2)^
= min (1 4- - • 2, 3	-• • 2!
= 111 in[ 1.5. 4.5].
4'he minimization yields
//‘(1) -	p.1 (2)	111.
We obtain A ,. h 1 (1), ami /1 1 (2) from the system of equations
A(, 1 + h,,! (!) - !)(1 •	+	/’i 1 ("2)/'„i (I) + Pf2(ii2)hll 1 (2),
+ lll,1 (2) - .‘A2- "') + P2I ("')/'„! (1) -I- /'22 ("-’)/'-„ I (2),
(I) - ().
By substituting the data of tli<» problem, we obtain
/'„1(1) = 0,	/'„>(2) =5
We lind //“(I) an<l //“(2) by di't.i'iiiiiniiig the minimum in
min [</(l, a1) -I - pi i(u.' )hl,i (1)4- pi2("‘)/'„i (2),
.'/(1. "2) +1’< 1 ("2)/'-„i (1) b /'12("2)/'„i (2)]
= min [2 + 2.1, 0.54- 2.1]
= min[2.0<S. 0.75],
and
min [</(2,11') 4- P2i('i')/i„i(l) + P22("')/'-„i(2),
.'/(2, ii2) I- P21 ("2)/'„i (I) I l’'>(ii2)h)li (2)]
. Г, , 1 1	3 11
=	3 4----]
= inin[l.()8, 3.25].
Sec.
( '(>inj>nt;Hittunl Method*
219
The minimization yields
//(I) = //'(1) = (Г,	//"(2) - //‘(2) - I/',
and hence the preceding policy is optimal and the optimal average' cost is
A„, =3/1.
The algoriLinn of this section shares sonic of the features of other
type's of policy iteration algorithms. In particular, it is possible to carry
out. policy evaluation approximately by using a few relal ive value iterations;
see [Put 9-1] for an analysis. Note also that, in specially structured problems
one may be able to confine policy iteration within a convenient, subset of
policies, for which policy evaluation is facilitated.
Adaptive Aggregation
Consider now an extension to t he average cost, problem of t he aggre-
gation method described in Section 1.3.3. For a given unichain station-
ary policy //, we want to calculate an approximation to the pair (A;,. /i/z)
satisfying Bellman's equation A/(c + //.,, = T/Zq, and = 0. where
the state n is viewed as the reference state. By expressing Xt, as A,, =
(7],Z/z,)(//). we can eliminate it. from this system of equations, and obtain
Zg, = Tf.li(l, — (7/Zq, )(;/.)<:. This equation is written compactly as
Z/.p = 1 ll j/ ,
where the mapping T is defined by
th, = ijr + p,h,
and
ijr = (7 - ее],)</>,, Pr = (I - се],)//,.
with e„ = (0.0.... ,0,1)'.
We partition the set of states into disjoint, subsets 5;, 5'2,.... 5',,, that
are viewed as aggregate states. We assume that, one of the subsets, say
Sm, consists of just, the reference state n; that is, S„, = {n}. As in Section
1.3.3, consider the n x m matrix IF whose /th column has unit, entries at.
coordinates corresponding to states in S, and all other entries equal t.o zero.
Consider also an in x 11 matrix Q such that the /th row of Q is a probal>ility
distribution (/7,1...//;„) with = 0 if s 1/ S,. Note that QIC — J, and
that the in x m matrix H -- QT^W is the transition probability matrix of
the aggregate Markov chain, whose states are the m aggregate stat.es.
Suppose now that we have an estimate //. of Zq, and that we postulate
that, over the states s of every aggregate si,ate S, the variation Zi7l(.s) — h(s)
220
Avc/vigc Cost /></' Stag/’ /’rob/e/п.ч Chap. -7
/.s coii.slu/il.. 'Vliis amounts to hypothesizing that for sonic ///-dinicnsional
vector у we have
h„ - h = IP.//.
By combining the equations Th - //,- t P,h and //,, - ij, -I we have
(1 - I’.-jth,, -h) = Th -//.
By multiplying both sides of this equation with Q- and by using tlie rela-
tions h,, — h — И'// and QII’ = 1, we obtain
(7-QPrIP)// = Q(77,-//).
Assuming that, the matrix I — QP,M' is invertible, this equation can be
solved lor I/. Also, by applying 7’ to hot h sides of t he equation h,, -- li- + \Vij,
we obiain
h,, = Th,, = Th +- P,Wy.
'Г1 ins I he aggregat ion it.erat ion lor average cost, problems is as follows:
Aggregation Iteration
Step 1: Compute Th, = yr + P,-h, where
<// — (7 Wii У!/,, > I г — (7 iii’h)T,i-
Step 2: Delineate the aggregate states (i.e.. define 11') and specify
the matrix Q.
Step 3: Solve for // the system
(7 - QP.-W)» = QTh,
and approximate h,, using
h := Th + P^Wy.
Lor the iteration to be valid, the matrix 1 — QP, \V must be invertible.
We will show that this ь guaranteed under an aperiodicity assumption such
as t he one used to prove convergence of I he relal ive value iteration method
(< f. Prop. 3.1). Ill particular. we assume that all (he eigenvalues ol the
transition matrix ll - QP^W of the aggregate Markov chain, except for a
single unity eigenvalue, lie strictly within the unit circle. Let us denote by
< the ///-dimensional vector <>f all I's. and by c,„ the ///-dimensional vector
Sec. 1.3
(’l>llll>lltllt ill! in I A let I к >( Is
221
with Lust coordinate 1, and all other coordinates (). Then using the easily
verified relations Qh - ' and <'lnQ - c'n. we see that.
QPrW = (/-cc',„)/?.
Front the analysis of Example 3.1 in Section 1.3. we have that QP,W has
m — 1 eigenvalues that are equal to the in - 1 nonunity eigenvalues of И
and has 0 as its /nth eigenvalue. Thus QP,]V must have all its eigenvalues
strictly within the unit circle, and it follows that, the matrix I QP, II is
invert ible.
In an adaptive aggregation method, a key issue is how to identify the
aggregate states Si..... S,,, in a way that the error /q, — h is of similar
magnitude in each aggregate state. Similar to Section 1.3.3, one way to
do this is to view 77/, as an approximation to //,, and to group together
states i with coinparable magnitudes of	— h(i). As dismissed in
Section 1.3.3, this type of aggregation method can be greatly improved by
interleaving each aggregation iteration with multiple relative value itera-
tions (applications of the mapping 7’ on the current iterate). We refer t.o
[13eC’89] for further experimentation, analysis, and discussion.
4.3.3 Linear Programming
Let us now develop a linear programming-based solution method,
assuming that any one of the conditions of Prop. 3.3 holds, so that, the
optimal average cost A* is independent of the initial state, and together
with an associated differential cost vector h*. satisfies X*< + h* = Th*.
Consider the following optimization problem in the variables A and
i = 1, . . . . //.
maximize A
subject to A+ //(/)< (77/)(/),	/ = 1,...,/?..
which is equivalent to the linear program
maximize A
11
subject t.o A + //(/’) < //(/’. //) +	/ = !,....//.	//G tf(/).
./ --1
(3.3b)
Л nearly verbatim repetition of the proof of Prop. 2.1 shows I hat if
(A,//) is a feasible solution, that is, Ac +//. < 77/, then A у A*, which
implies that (A*.//*) is an optimal solution of the linear program (3.3JS).
Furthermore, in any optimal solution (A,//) of the lineal' program (3.38),
we have A — A*.
There is a linear program, which is dual to the above and which
admits an interesting interpretation. In particular, the duality theory of
222
.Average Ccisl |хт.57лд|' /’rob/em.s
( 7iap. I
linear programming (see e.g.. [Dan(>3]) asserts that, the following (dual)
linear program
subject, to
minimize
52	“) •= 52 52 'Л'.
j I,• • •,".
(3.39)
52 52	=
i~ 1 it £(/(?)
</(?, ",)>(), i =	u€U(i),
has the same optimal value as the (primal) program (3.38). The variables
q(i, n), i =	и C U(i), of the dual program can be, interpreted
as the steady-state probabilities that state i will be visited at the typical
transition and that control и will then be applied. The constraints of the
dual program are the constraints that, </(i, a) must satisfy in order to be
feasible steady-state probabilities under some randomizad stationary policy,
that is, a policy that chooses at state i the control и probabilistically,
by sampling the constraint set. U(i) according to the probabilities q(j, u),
и e (/(?.). The cost function
52 52 '/(b u)-'Ab »)
i=l nt(/(<)
of the dual problem is t he steady-state average cost per transition. Duality
theory assert,s that, the minimal value' of this cost is A*, thus implying that
the optimal average cost per stage that can be obtained using randomized
stationary policies is no better than what, can be achieved with ordinary
(deterministic) stationary policies. Indeed, it can be verified that, if /i* is
an optimal (deterministic) stationary policy that is unichain, and p* is the
steady-state probability of state i in the corresponding Markov chain, then
,.	,	[ p*
'/*('• ") = | ()‘
if u = /i*(i),
otherwise,
is an optimal solution of the dual problem (3.39).
4.3.4 Simulation-Based Methods
We now describe briefly how the simulation-based methods of Section
2.3 can be adapted to work for average cost problems. We make a slight
change in the problem definition to make the notation better suited for
Sec. -/.3	(Methods	2'23
the simulation context. In particular, instead of considering the expected
cost </(/. it) al state i under cont rol u. we allow the cost. </ to depend on
the next state j. Thus our notation lor the cost, per stage is now </(i, и, j),
as in the simulation-related material for stochastic shortest pat h and dis-
counted problems (cf. Section 2.3). ЛИ the results and the entire analysis
of the preceding sections can be rewritten in terms of the new notation by
replacing </(/, u) with , /'o (").</(',«. ./)
Policy Iteration
In order to implement, a simulation-based policy iteration algorithm
like the one of Section 2.3.1, we need to be able to carry out the policy
evaluation step for a given unichain policy //. This can be done by using
the connection with the stochastic shortest, path formulation described in
Section 1.1. We fix a state I., and we evaluate the cost of the given pol-
icy // for two stochastic shortest path problems whose termination state
is (essentially) 1. In particular, we evaluate by Monte-Carlo simulation or
TD(A) the expected cost C, from each state i up to reaching t. [cf. Eq.
(2.15)]. This requires the generation of many trajectories terminating at
state t and the corresponding sample costs. Simultaneously with the eval-
uation of the costs Co we evaluate the expected number of transitions Л’,
from each state i up to reaching I [cf. E<j. (2.16)]. Then the average cost
AM of the policy is obtained as
V V’	(3.40)
[cf. Eq. 2.17)], and the associated dilferential costs arc obtained as
hf.ti) = C\ - XltN„	----",	(3.11)
[cf. Eq. (2.18)].
To implement a simulation-based approximate policy iteration algo-
rithm. a similar procedure can be used. In particular, one can obtain by
Monte-Carlo simulation or TD(1) functions C'i(r) and N^t) that depend
on a parameter vector r and approximate the costs Ct and IV, of the corre-
sponding stochastic shortest, path problems, as described in Section 2.3.3.
Then, one can use
as an approximation to the average cost per stage of the policy and also
use
M'i) C(r) - /\,(г)Л’,('г), i. 1,... ,n,
as approximations to the corresponding differential costs [cf. Eqs. (3.10)
and (3.41)].
224
Лгтру Cost pci- Sltw I’lobk'ins Cluip. 4
Note here that, because’ the approximations G’t(r) and Ni(r') play an
important, role in the calculations, it may be worth doing sonic extra simula-
tions starting from the reference state t to ensure’ that these approximations
are' ne’arly exact.
Value Iteration and Q-Lcarning
To derive the appropriate form of t he CJ-learning algorithm of Section
2.3.2. we form an auxiliary average, cost problem by augmenting the, original
system with one adelit.ional state for each possible pair (Л u) with и G I7(i).
The probabilistic transition mechanism from the original slates is t he same
its for the' original problem, while the probabilistic transition mechanism
from an auxiliary state’ (/, u.) is that we move only to state’s j of the' original
problem with corresponding probabilities p,j(u) and exists <;(i, u, j). It can
be1 seen that, t he auxiliary problem has the same' opt imal average cost per
stage’ as the original, and that the corresponding Bellman’s equation is
A I /i(i) ----- min	ч, i) t /'(/)),	/"l,...,n,	(3.42)
tl
A + Q(b u) = ^p,j(u)(e/(i, u, j)-I /i(j)), i	n, uGh(i),
J  ।
(3.43)
where: бДЛ u) is the: elilferential cost corresponding to (?,/1). Taking the
minimuni over ч in Eq. (3.43) and substituting in Eq. (3.42), we obtain
h(i) = min (^O, ")i e = l,...,n.	(3.44)
Substituting the above form of h(z) in Eq. (3.13), we' obtain Bellman’s
equation in a, form that exclusively involve’* the Q-faetors:
11	/	\
A+Q(i.u) = 'S ( <l(i, w, j) + min Q(j, u1) ) , i = 1,..., n, и G U(i).
'	\	l'cgG) /
(3.45)
Lei, us now apply t.o the auxiliary problem tin’ following variant of
the relative' value' iteration
(sex’ Exercise 1.5 for the case when' c = 0, pi = 1, and pj = 0 for j t).
We t he’ll obtain the iterat ion
/Л,||(/)= min У /Л)(»)(1/(Л u,./) + h*'(j)) ~ ^;,'(1), z = l,...,n,
(3.46)
See. I.:i Coui/>ut;it ion.d Methods	225
+	= £jfo(»)(.'7('. H-j)+hl''(j))-lik^). i-	и t L/(z).
J = 1
(3.47)
brom these equations. we have that
I'Hi) = min Qk(i, u),	i.,n,	(3.48)
and by substituting the above form of	in Erg (3.17), we obtain the
following relative value iteration for the (/-factors
Qk+l(i.ii) 1>,,(ii) I j) + min Qk(j, u')] - min "')•
7~( '	\	"'cfojt ’ J o'eoto
(3.19)
This it erat ion is analogous tot he value it erat ion for (J-factors in t he stochas-
tic shortest path context. The sequence of values niiii„(:(;(/) (<?*(/. u) is ex-
pected to converge to the optimal average cost per stage and the sequences
of values miipu ci,) Q(i. a) art' expected t.o converge to differential costs
h(i).
An incremental version of the preceding iteration that involves a pos-
itive stepsize 7 is given by
<Ж <') := (1 - 'flQO, n) + 5 I 527A/("j |	7) + min Q(,7, »') |
\J7T|	\	"'«T) J
— min Q(J. it') .
u'etgi)	J
(3.50)
[compare wit h Eq. (3.8) in Section 2.3]. The nat ural form of t he (/learning
method tor tin' average cost, problem is an approximate version of this
iteration, whereby the expected value is replaced by a single sample, i.e..
Q(i.it) := Q(t, a) + 7 (a(i,u, j) + mill Q(j. 11') - min 62(1, ti')
\	u'clUj)	(351)
-Q(i-n))-
where j and </(/, n,, j) are generated I rein tin* pai r (/. ч) by мин i lat ion.
Minimization of the Bdlman Equation Error
There is a straightforward extension of 1 he method of Section
for obtaining an approximate representation of the average cost A and as-
sociated differential costs /,(/). based on minimizing the squared error in
Bellman's equation. Here we approximate A by /\(/-) and li(i.) by h(i.r).
_ .1
226
\ vcragc Cost, per Stage I’robk-ins Clin)). 4
where i' is a vector ol unknown parameters/weights. We impose a nor-
malization constraint such as h.(J, r) = 0, where t is a fixed state, and we
minimize the error in Bellman’s equation by solving the problem
u
j=i
min |A(z’) 4- li(i, r) — min
where S’ is a suitably chosen subset of “representative" states. This min-
imization may be attempted by using some gradient method of the type
discussed in Section 2.3.3.
4.4 INFINITE STATE SPACE
The standing assumption in t he preceding sections has been I.hat the
state space is finite. Without, finiteness of the state space, many of the
results presented in the past. three sections no longer hold. For example,
whereas one could restrict attention t.o stationary policit's for finite state
systems, this is no longer true when the state space is infinite*. The following
example from [Kos83a] shows that. if t he stato space' is count able, I here mav
not exist an optimal policy.
Example 4.1	:
I ,et t lie si at c space I ie { 1. 12, 2Z, 3, 3', . . .}. and let I here I ><• two controls, n1
and ил. The transition probabilities and costs per stage are
In words, nt st,ate / we may, at. a cost. (). either move to state (z 4 1) or move
to stale where we stay thereafter at a cost. — 1 + l/i per stage.
It can be seen that for every policy тг and state i = 1.2,..., we have
J-(i) > —1. However, for every state i, we can obtain an average cost per
st,age — I 4- I//, where j > /, by moving to state j' once we gel. t.o slate j.
Hence, for every initial state i = 1,2, . . ., an average cost per stage of —1 can
be approached arbitrarily closely, but cannot be attained by any policy.
Here is another example, from [Ros70], which shows that for a count-
able stale space there may exist an optimal uonstationary policy, but not
an optimal stationary policy.
Sec. I. l
Infinite Stnte Space
227
Example 4.2	:
Let the state space be {1,2.3.. ..}. and let, there be two controls, n1 and u*.
The transition probabilities and costs per st.age are	o\r=	c /
/',(,4 i>("‘) =	= L	О"
1	, '  v  --
<)(iy 111) = 1,	</('. u2) =	/=1,2,....	2	7
i
In words, al slate / we may either move to state (/ H- 1) at a cost 1 or stay at
i at a cost 1/Л
For any stationary policy // other than the policy for which //(/) = u1
for all /. let n(//) be the smallest integer for which
Then the corresponding average cost per stage satisfies
ЛЛ0 —	—Г > U, for all i with / < ?/(%).
?Цтг)
For the policy where //(/) — a1 for all Л we have ,/„(/) -- I for all i. Since (he
optimal cost per stage cannot be less than zero, it is clear that
min J-(i) “0.	/ = 1,2,...
7Г
However, the optimal cost, is not, aUained by any stationary policy, so no
stationary policy is optimal. On the other hand, consider the uonstationary
policy 7Г* that, on entering state i chooses u2 for i consecutive times and then
chooses a . If the starting state is /, the sequence of costs incurred is
2 £ 2i __________________1 i ]	1_____1
1 /' i' 1 i + 1 ’ i + 1 ' i + 1'	1 i + 2 ’ i + 2'
< times	(<{)) times
The average cost corresponding to this policy is
./„»(') = Hill 2'" — = 0, i = 1,2, 3,...
2^=1 (г + О
Hence the uonstationary policy is optimal while, as shown previously, no
stationary policy is optimal.
Generally, the analysis of average cost, problems with an infinite state
spa.ee is difficult. although there has been considerable progress (see the
references). An important tool is Prop. 2.1, which admits a straightforward
extension to the case where the state and control spaces are infinite. In
particular, if we can find a scalar A and a bounded function h. such that
Bellman's equation (2.1) holds, (hen by repeating the proof of Prop. 2.1,
we can show that A must be the optimal average cost per stage for all
initial states. Among other situations, this result is useful when we can
guess the right A and h, and verify that they satisfy Bellman’s equation.
Some important special cases can be satisfactorily analyzed in this way (see,
the references). We describe one such case, the average cost version of the
linear-quadratie problem examined in Chapters 1, and 5 of Vol. 1.
- .1
228	Average Cost per Stage Problems Chap. 4
Linear Systems with Quadratic Cost
Consider the li nea r-( puu Ira Lie problem involving the system
•O,4 i = -Ь /. I- l!(ik -( «7,., k = 0, 1,...,	(-1.1)
and t lie cost function
I PP	1
MM - Alini^ — e S CPPw HM M'MMP) ? •	(-1.2)
i)J .... I A=-<>	J
We make the same assumptions as in Seel ion S. I. that is. Q is positive
semidelinite .symmetric, 11 is positive definite symmetric, and 1/7,. are in-
dependent, and have zero mean and finite second moments. We also as-
sume that, thi' pair (.1, 11) is controllable and that, the pair (Л.С), where
(j — CC \ is observable. Under t hese assumptions, it was shown in Section
•1.1 of Vol. 1 that, the Kiccati equation
A), = 0,	(1.3)
Ek t, = A'(M - КО^И'К/. 11 + 11)- 1 U’K,.)A 4- Q (1.-1)
yields in the limit a matrix A.
1\ —- lim A’/.,	(1-5)
which is the unique solution of the equation
A = /l'(A - К 11(11'К11 I- A) 41'K) A + Q	(-1.6)
within the class of positive seiniflclitlitc symmetric matrices.
The optimal value of the ;V-sta.ge costs
Sec. 4.Г>
Notes, Sources, and INercises
229
we see that, the optimal Л:-stage costs tend t.o
A = /?{«'A'»'}	( I.S)
as N oc. In addition, the A'-stage optimal policy in its initial stages
tends to the stationary policy
//*(./;) = ~(I3’KJ3 + /?)-'13' К Ал:.	(.1.9)
Furthermore, a simple calculation shows that,, by the definition of A, A’,
and ji* (!), we have
A -I .r'K.r min Hl.r.'Q.r 4- li.' Ли 4- (Ar 4- Hu. I- tit)' К (Ar 4 Hu 4 /с) [.
while the minimum in the right-hand side of this equation is attained at,
u* = //*(') as given by Eq. (1.9).
By repeating the proof of Prop. 2.1, we obtain
A	I r°,7r}
1	1 fA l	1
- дУ-'цЛ-'А) +	< ^2!.-г'..(Лг!.. 4 <M | .го,7T I ,
I A=l>	)
with equality if тг = {//♦.//*,...}. Hence, if rr is such that E’{.r(vK.r.y |
,г'о,тг} is uniformly bounded over N, we have, by t aking the limit, as N -> v.
in the preceding relation.
A <-4(.r),	.rti'li''-,
with equality if тг = {//*./4... .}. Thus the linear stationary policy given
by Eq. (4.9) is optimal over all policies тг with l‘J{.r'N 1\.г,у | .го.тг} bounded
uniformly over N.
4.5 NOTES, SOURCES, AND EXERCISES
t	Several ant hors have cont ributed to the average cost, problem (|l lowlitl],
)	[Bro65]. |Bos70], [SchtiSi]. [VcitiG]. [Vei(>9]), most, not,ably Blackwell ([I JlaG2|).
g	An alternative detailed treatment to ours is given in [Put91]. An extensive
! survey containing many references is given in [AВF93].
j	The result of Prop. 2.G under conditions (2) and (.3) was shown in
<	[Bat73] and [Hos70], respectively. The relative value iteration method of
t	Section 4.3 is due to [\VhiG3]. and its modified version of Eq. (3.22) is due
230
t.o [Sch71]. The error bounds of Prop. .3.2 are due to Odoni ([OdoG9]). The
value iteration method has been analyzed exhaust ively in [Sell?I], [S(T'77],
and [S('1;7S]. Convergence under slightly weaker conditions than those
given here is shown in [Pla77], Tin; error bounds of Exercise 4.10 are due
to Varaiya ([Var78j), who used them Io construct, a differential form of
the value iteration method. Discrete-time versions of Varaiya s method arc
given in [I’13W79]. 'Пн1 value iteration method based on stochastic shortest,
paths of Exercise 1.15 is new (see [Ber95c]).
The policy iteration algorithm can be generalized for problems where
the optimal average cost, per st.age is not. the same 1'or ('very initial slate (sec
[BlaG2], [Put,91], [VeiGG], and [Dcr70]). The adaptive aggregation method
is dm' t.o [BeC89].
The approximation procedures of Section 4.3.1, and the Q-letuuiug
algorithms of Sect,ion 4.3.1 and Exercise 1.16 are new. Alternative algo-
rithms of the (J-learning type are given in [Sch93] and [Sin94].
I'br analysis of iulinite horizon versions of inventory control prob-
lems, such as the ones of Section 1.2 of Vol. I, sec [lglG3a], [1 glG3b], and
[11огГ71]. Infinite state space models are discussed in [Kus78], [Sen8G],
[LasiSSj, [Bor89], [Uav89a], [Cav<89b], [Иег89], [SenS9a], [Scn89b], [РЛМ90],
[PAM91], [CavOI], [HHL91], [Seti91], [CaS92j, [UiS92], [ABE93J. [Sen93a],
[Sen93b], and [Put91].
EXERCISES
4.1
Solve the average cost version (a = 1) of the computer manufacturer’s prob-
lem (Exercise 7.3, Vol. 1).
4.2
Consider a stationary inventory control problem of the type considered in
Section 1.2 of Vol. 1 but with the difference that, the stock .гд. can only take
integer values from 0 to some integer Л/. The order <//,• can take, integer values
with (I < hi, < Л/ — ,гд., and the random demand (ед. can only take nonnegative
integer values with P(<<4- — 0) > If and /’(дед -- 1) > 0. Unsatisfied demand is
lost, so stock evolves according to the equation jg (.| — max(0,.o, -1 uy — w*).
The problem is to find an inventory policy that, minimizes the average, cost
per st.age. Show that there exists an optimal stationary policy and that the
optimal cost is independent of the initial stock .r».
Sec. 4.5
Notes, Sonices, mid I'Aerdsrs
4.3 [LiR71]
Consider a person providing a certain type* of servin' to customers. The
person receives at, the beginning of each time period with probability /ъ a
propose I by a oust ol nor о I typo i. wlit'ie / - 1,2, . . . , n, who oilers an an ion nt
of money .W,. We ass tu no t hat	< I. The person max’ reject tin* oiler,
in which case the customer leaves and the person remains idle' during that
period, or the person may accept the oiler in which cast' I he person spends
some time with (hat customer determined according Io a Markov process with
transition probabilities 3a < when*, for /; — 1.2,. .
/La- = probability that the type i customer will leave4 after A’ periods, given
that the customer has already stayed with the person for k — 1 periods.
The problem is to determine an acceptance-rejection policy that maximizes
lim -т{1ахрес1(’(1 payment over /V periods}.
Consider I wo cast's:
I.	3,A. - il e (0, 1) for all
2.	Гог each i there exists A-, such that 3,;: == 1.
(a)	formulate the person’s problem as an average cost per stage* problem,
and show that the optimal cost is independent, of the initial state.
(b)	Show that there exists a scalar A* and an optimal policy that accepts
the offer of a type / customer if and only if
X*Tt < ЛЛ.
where* Tt is the expected time spent wit.h the type / customer given by
7', ~ i 4	A'c/j, . [ (1 —	j) • •  (I - 3<n).
4.4
Let A ° be an arbitrary vector in li'", and define lor all i and k > 1
//J - 7|/'/i“ - (7'A//')(/)(-.
/Л = т'ч" - 1 V(t1
/I
>-_l
/7 = Tl'li'1 - mil!
Let also h'i = /?’ = h" = h".
-1
-5
232	?\ veragc (\)st, per I 'rohlvms (‘Imp. I
(a) Show that the sequences {h^}, {AA}t ami {// } are generated by the
algorithms
hp1 = Th1; - (Tit)(i)r.
A"'1 =Thk - ^(77У)(,)с,
7—1
hP' = Thk - mill ('r/7')(i)r.
I I,.... II
(l>) Show that the convergence result of Prop. 3.1 holds for the algorithms of
part (a), lliiil,-. Proposition 3.1 applies to the algorithms that generate
{h1;}. Express h1, anil // as continuous funct ions of {/)* }, i = I,.. ., n.
4.5	(Variants of Relative Value Iteration)
Consider the following two variants of the relative value iteration algorithm:
/?* '(г) = (77/')(i) - A*'.	1 =
where
A* = c+ Ц).
or
Afc = c + y'jijlf-'ti).
Il(4-c c is an arbitrary scalar and (pi,. .. ,p„) is an arbitrary probability dis-
tribution over the states of the system. Under the assumptions of Prop. 3.1,
show that the sequence {//} converges to a vector h and the sequence {Aa}
converges to a scalar z\ satisfying Ac | h --- Th, so that by Prop. 2.1, A is
equal to the optimal average cost, per stage for all initial states and h is an
associated differential cost vector, HiiT. Modify the problem by introducing
an artificial state C from which the system moves at. a cost c to state* j with
probability for all a. Apply Prop. 3.1.
4.6
( ’onsider a de term mist ic system with t\м» stales 0 aud I. Upon entering st ate
0, t he system stays (hero permanently al. no cost. In state 1 there is a choice
of staying there at no cost or moving to state 0 at cost 1. Show that every
police is average cost opt imal, but t he only st at ionarv policy t hat is Blackwell
optimal is the one that keeps the system in the state it currently is.
Sec. -1.Г)
Suiirvt'x, and IC.xwciscx
233
4.7
Show t hat a Blackwell opt i ma I policy is opt imal over all policies (not. just t hose
that, are stationary). Hint: Use the following result: If {<•„} is a nonnegative
bounded sequence, then
A proof of t his result, can be found in [PuttM], p. 417.
4.8	(Reduction to the Discounted Case)
bor the finite-state average cost problem suppose, (here is a stale i such that
for sonic /4 > 0 we have />,/(//) U for all states i and controls u. Consider
the (I - ,J)-discouiited problem with the same state space, control space, and
transit ion probabilities

a - r>y ‘p,A11)
(1 -nr' (v,M ~
if,7
if ./• = /.
Show that p'.7(4) and J(i) are optimal average and dilfe.rential costs, respec-
tive!}', where j is the optimal cost funct ion of the (I - /^-discounted problem.
4.9	(Deterministic Finite-State Systems)
Consider a deterministic finite-state system. Suppose that the system is con-
trollable in the sense that given any two states i and /, there exists a sequence
of admissible controls that drives the state of the system from /, to j. Con-
sider the problem of finding an admissible control sequence {?г0, щ,. ..} that
minimizes
N - I
Л (/) - ^lhu~	, и,i.-).
k^o
Show that the optimal cost is independent of the initial state, and t hat there
exist optimal control sequences, which after a certain t ime index are periodic.
.1
231
А\('г;цу‘ [мт .S'/age /’ro/den/s ( 7/а.р. 4
4.10	(Generalized Error Bounds)
Let h, be any h,-dimensional vector and /t be such that
7>	7 7/,.
Show that, for all /,
iuiii[(77i)(J) - /'(.>)] < -/*(') < 'AA') < nuix[(77<)(j) -
regardless of whether •/*(/') is independent of the initial state i. Hint: Com-
plete the* details of the following argument. Let
<4') - (77/)(/) --//(/),	/ - 1.....//,
and let be the vector with coordinates We have
TILh =- d + h, T*h = 7'fji -I- PtLb — b 4- l^b 4“ h
and, continuing in the same manner,
N - 1
7/vh --- 57 /РЛ + h’ A' =--1,2,...
Hence
J,, = lil.l ^T*h = /p,
л- - N	1
where
= Ли!?-
A---0
proving the right-hand side of the desired relation. Also, let тг = {/zq, /и,...}
1>е any admissible policy. Wo have
7\vh > b b h
trom which we obtain
•'"N -lTl'\'h 7’c,v-i's +	P‘'\	+ h,>‘2 min<Xj)<i + h.
Thus, for ;il) i.
^-тСжГ-'^.уОС'') > inin <*>(.;) + --L--
1\ 4' I	,i	- V 4* 1
and, taking the limit as N —» oo, we obtain
C(') > min
J
Since тг is arbitrary, we obtain the ief't-hand side of the desired relation.
Sac. 4.5 .Voles. Sources. and Exercises
4.11
Use Prop. 1.1 to show that in I Ik1 policy Herat ion algorithm wc have lor all
A-.
a‘+i(. = a*;-+	,(7# - // aS),
where
X - I
- lim -У
'	л • x A	'
in
Use this (act to show that il the Markov chain corresponding to // 1 1 has no
transient states and /А1 is not, optimal, then ААч 1 < Afr.
4.12	(Policy Iteration for Linear-Quadratic Problems)
The purpose of this problem is to show that policy iteration works for linear-
quadratic problems (even though neither t he slate space nor t he control space
are (inite). Consider the problem of Section 4.1 under the usual controllability,
observability, and positive (semi)definitencss assumpt ions. Let Lu be an in x n
matrix such that the matrix (.4 + RLu) is stable.
(a)	Show that the average cost per stage corresponding to the stationary
policy /i", where /t°(.r) = Ltur, is of the form
./ о = /-;(w7<mr}.
where K.. i" a positive -emidemme -vuilnotir matiix '.it i.-ivmg 'b
(linear) equation
Ku = (A + RLu)'Ku(A -t- IJLu) -I Q I- L'uKLu.
(b)	Let p,'(.r) = Li.r — (/< + R'h'uLi) 1ll'KuA.r in' the control function
attaining the minimum for each .r in I he exp, ession
luinpi llu I (. I.r I Un) /\,i( Lr I I ’ и) } 
Show that
J I = K{in'KiH>} <
where' A'i is sonic posit,ive semidelinite symmetric matrix.
(c)	Consider repeating the (policy iteration) process described in parts (a)
and (b), thereby obtaining a sequence, of positive seiuidelmite symmetric
matrices { Ki-}. Show that
Ki- K,
where /< is the optimal cost matrix of the problem.
236
Average Cost per Stage Problems Chu}). 1
4.13	(Alternative Analysis for the Unichain Case)
The purpose of this exercise is to show how to extend the average cost problem
analysis based on the conned ion with I he stochastic shortest path problem,
which is given in Section 7.4 of Vol. 1. In particular, here this connection is
used to show that there exists a solution (A, h) to Bellman’s equation Ac + /i =
Th if every policy that is optimal within the class of stationary policies is
unichain, without resorting to the use of Blackwell optimal policies (cf. Prop.
2.6). for this we will use the stochastic shortest path theory of Section 2.1,
and from the present chapter, Prop. 2.1 and Prop. 2.5 (which is proved using
a stochastic shortest path argument). Complete the details of the following
proof:
lor any stationary policy /z, let A/( be tin; average cost per stage as
defined by Eq. (2.17), let z\ — ininMA/t, and let 1\1 = {/z | AM = A}. Suppose
that there is a state .s that is simultaneously recurrent in the Markov chains
corresponding to all //. G Л/. Similar to Section 7.4 in Vol. 1, consider an
associated stochastic short,est path problem with states 1,2,...,/? and an
artificial termination state / to which we move from state i, with transition
probability pls(u). The stage4 costs in t his problem arc <7(7,?/,) — A for i =
1,...,//, and the transition probabilities from a state i to a state j я are
the satiir as those of the original problem, while p,.«(u) is zero. Show that in
this stochastic shortest path problem, every improper policy has infinite cost
lor some initial state, and use this fact to conclude that if h(i) is the optimal
cost starting at state i. — 1,... , n, then A and h satisfy Ac4-/i — Th. If there is
no state s that is simultaneously recurrent for all /1, G Л/, select a Ji. G M such
that there is no /z G Л/ whose recurrent class is a strict subset of the recurrent
class of /7 (it is sufficient that. /7 has minimal number of recurrent states over
all //, G A/), change the stage cost of all stales /' that, are not recurrent under
/7 to //(/, u) 4- <, where t > 0, use as st ate s in the preceding argument any
state that is recurrent under /7, and take ( —> ().
4.14	(Stochastic Shortest Path Solution Method)
The purpose* of this exercise is to show how the average cost problem can be
solved by solving a finite sequence of stochastic shortest path problems. As
in Section 7.4 of Vol. 1, we. assume that a special state, by convention state
//, is recurrent in the Markov chain corresponding to each stationary policy.
Eor a stationary policy /z, let
('/,(/) : exported cost starting from i up to the first visit to zz,
: expected number of stages starting from i up to the first visit to n.
The proof of Prop. 2.5 shows that A/( — Ctl(n.)/Nfl(:i,). Let A* = minM Ад be
the corresponding optimal cost.
Consider the stochastic shortest path problem obtained by leaving un-
changed all transition probabilities рч(ч) for j 7^ /?, by setting all transition
probabilities p;n(«) U> 0, and by introducing an artificial termination state t to
which we move from each state i with probability />jn(u). The expected stage
See. I.Г,
Notes, Sources, tint! I'Nereises
237
cost at state i is <;(<, u) — A, where A is a scalar parameter. Let fc,, x(i) be. the
cost of stationary policy p, for this stochastic shortest path problem, starting
(rom state ?. and let h>,(/) = mm,,	be the corresponding optimal cost..
(a)	Show that lor all scalars z\ and A', we have
Vx(i) = /Ь,.л'(') + (A' - А)Л',,(i), i = 1,.... ii.
(b)	Doline
/|.д(г) = min	i = 1,.... n.
i‘
Show that Лд(/) is concave, monotonically decreasing, and piecewise
linear as a function of Л, and that
//a(??) = 0 if and only if X — A*.
Figure 4.5.1 illustrates these relations.
(c)	Consider a one-dimensional search procedure that finds a zero of the
function Лд(п) of A. This procedure brackets A* from above and below,
mid is illustrated in Fig. 1.5.2. Show that (his procedure solves the
average cost problem by solving a finite number of stochastic shortest
path problems.
Figure 4.5.1 delation of t lie costs of siatiomny policies in the aveiage cost
problem and the associated st.ochasl iv shortest path problem.
-1
238
A ver.nge ( '<>sl per Stage Problems	I
Figure 4.5.2 One dinien.Hional itrral ive .search procedure to hud A such that
/ix(n) — 0 [cf. Exercise -l.bJ(c)]. 1 Caeli value h.\(n) is obtaintsl by solving the
associated stochastic shortest path problcii) with stage cost /;(/,//) — A. At the
start of the typical iteration, we have scalars о and ,J such that n < A* < J,
together with the corresponding nonzero values and ЛДа). We hud o'
such t hat
o' — n	h,» (a)
o' -/i	hj(n)’
and wo calculate fo/(n). Lot // be such that
pz —	fo/W)
- o” “ fo, W) '
Wo then replace о by o', and if /3' < f3, we also calculate /ty'(u) and wc replace /3
by d'. We then perform another iteration. The algorithm stops if either h.\ (n) = 0
or hj(n) o.
4.15 (Stochastic Shortest Path-Based Value Iteration [Ber95c])
Tile purpose of this exercise is to provide a value iteration method for average
cost problems, which is based on the connection with tire stochastic shortest
pat.li problem. Let the assumptions of Exercise 1.14 hold. Consider the
algorithm
(?) = iniii
н £>.>(»X(./) -a\
i — 1,. .., n,
A*" = Xk + bkhk+l(lt),
where n is the special state that is rcciineiit for all imlcham policies and &k
is a positive1, stepsize.
Sec. I..i	S<hhces, and Exercises
239
(a ) 1 ttl crj )i'cl I he ;ilg<>riI hm a.s a value il eta I i< ai a Ip/ >ri I li 111 l< >r a sl< >wl v va t v
ing stochastic shortest path problem of the type considered in Exercise
1.1 1. Given that, lor small <\ the iteration of // is last,er that the it-
eration ol X, speculate on (he convergence properties of the algorithm.
[It can he proved that there exists a positive constant A such that we
have h (//) —* 0 and Aa —‘ A* if £ < y. <\ where' 3 is hoinc positive'
constant. Another interesting possibility for which convergence’ can be
proved is to select /)A as a constant divided by I phis the number of
times that //A(n) has changed sign.)
(b) I'Sc the error bounds ol Prop. 3.2 to justify the iteration
4.1G (^“beaming Based on Stochastic Shortest Paths)
The purpose of this exercise is to provide a Q"^“‘lri|ing method for average
cost problems, which is based on the value iteration method of Exercise 4.13.
Let the assumptions of Exercise -1.14 hold. Speculate on t he convergence
properties of the following (^-learning algorithm
Q(i. u) Q(i, и) + ~i ii,j)+ iiiin Q(j, «') - Q(/,	- A.
/ = I, . . . . H t: b'(/).
A := A -1 h min Q(n, u').
where
С(.лР = {^'"'’
t0	otherwise,
ami j and </(/,>’•.?) are generin.ed from file pair (/,</) by simulation. Here
the stepsizes 7 and A should be diminishing, but b should diminish ’‘faster"
than 7 [for example 7 = ci/к and b = c-i/k log k, where <-| and C2 are positive
constants and к is the number of iterations performed on the corresponding
pair (i 11) or A],

5
Continuous-Time Problems
I
Contents
5.1.	Uniforinization......................................p.	212
5.2.	Queueing Applications................................p.	250
5.3.	Semi-Markov Problems ................................p.	2G1
5.4.	Notes, Sources, and Exercises........................p.	273
- .1
242
Conf iiiuoiis-Tiiiie I ’mhlems

\\\' have considered so far problems where the cost, pci’ stage docs
not depend on t he time required for transit ion from one state to the next.
Such problems haw a natural discrete-time representation. On the other
hand, there are situations where controls are applied at discrete times but
cost is continuously accumulated, furthermore, the time between succes-
sive control choices is variable; it may be random or it may depend on the
current, state and the choice of control. For example, in queueing systems
state transitions correspond to arrivals or departures of customers, and
the corresponding times of transition are random. 'Phis chapter primarily
discusses problems ol Illis type. We rest rict attention to cont innous-t inie
systems with a finite or countable number of states. Many of the practical
systems of this type involve the Poisson process, so for many of the exam-
ples discussed, we assume that the reader is familiar with this process at
the level of textbooks such as [Kos83b] and [Gal95j.
In Section 5.1, we concent rate on an important class of continuous-
time optimization models of the discounted type, when1 the times between
successive transitions have an exponential probability distribution. We show
that by using a conversion process called uniform,i.zation. discounted ver-
sions of these models can be analyzed within the discrete-time framework
discussed up to now.
In Section 5.2, we discuss applications of uniformization. We concen-
trate on queueing models arising in various communications and scheduling
contexts.
In Section 5.3, we discuss more general continuous-time models, called
semi-Markov problems, where the times between successive transitions need
not, have an exponential distribution.
5.1 UNIFORMIZATION
in this chapter, we restrict, ourselves to continuous-time systems with
a finite or a countable number of states. Here state transitions and control
selections take place at, discrete times, but the time from one transition to
the next, is random. In this section, we assume that:
1.	If the system is in state i and control и is applied, the next state will
be j with probability р,7(а).
2.	'File time interval т bet ween t he transition to state i and the transition
to the next state is exponentially distributed with parameter
that, is,
transition tiine interval < т | (, «} < 1 -- c""k"'r,
or equivalently, the probability density function of т is
p(r) = /<(»)< ’"A'W, T > ().
Sec. Г,. I
( ‘llil'orilli/sitioli
2 13
Furthermore, т is independent, of earlier transit ion times, states, and
controls. The parameters are uniformly bounded in the sense
t hat for some о we have
//,(</) < //, for all z. и C U(i).
The parameter zy(zz) is referred to as the rule of transition associated
with state i and control n. It. can he verified that the corresponding average
t raiisit ion I iiric is
so z/;(zz) can be interpreted as the average number of transitions per unit
time.
The state and control at. any time I are denoted by .if I) and ii{t).
respectively, and stay constant between transitions. We use the following
notation:
tr- The time of occurrence of t.he /.'th transition. By convention, we
denote to = 0.
тг = /j. — /а-ь The A:th transition tiine interval.
•I'A- = c(0.T We have .ift) = ,rA. for lk < t. < /A..H.
uk = uftkf We have lift) = u,k for tk < t < tk+l.
We consider a cost function of the form
Jim Л e-Tfy(.r(t), ».(/.)),//j. ,	(1.1)
where <j is a given function and id is a given positive' discount parameter.
Similar to discrete-time problems, an admissible policy is a sequence тг =
{//о./1],. . .}. where each ///, is a function mapping states t.o controls with
Pfc(z) e l/(i) for all states z. Under тг, the control applied in the interval
[tA:, tk+1) is	Because states stay constant, between transitions, the
cost function of тг is given by
•/n-(.'•<)) = 52 E 1 /	/вС'Т)) -Го > .
J
We Hist consider t.he case where the rah: of transit ton. is the same for
all stales and controls', that, is,
v, fa) = 1/, for all i,u.
244
( 'olll.illllons-Tmie / 'ru/i/illis
C'/iap. 5
A little thought shows that the problem is then essentially the same as
the one where transition times are fixed, because the control cannot influ-
ence the cost, of a stage by affecting the length of the next transition time
interval.
Indeed, the cost (1.1) corresponding to a sequence {(.г,.., гц.)} can be
expressed as
^2 A'< /	c-^'<;(.)•(/), u(/))dt	22/ E{<j(xk, H)-)}
k=o I"E	)	V'b-	J
We have (using the independence of the transition time intervals)
(1-2)
(1.3)
The above expression for о yields (1 — o)//i = 1/(Й + v), so that from Eq.
(1.3), we have
From this equation together with Eq. (1.2) it follows that the cost of the
problem can be expressed as
у-кЕпЧЯ.'/С'т,«у)}-
Thus we are faced in effect with an ordinary discrete-time problem where
expected total cost is to be minimized. The effect of randomness of the
transition times has been simply to appropriately scale the cost per stage.
To summarize, a continuous-time Markov chain problem with cost
e“d'</(.r(/), u(t))dt
and rate of transition ;/ that is independent of state and control is equivalent
to a discrete-time Markov chain problem with discount factor
лмт/,
S.-c, Г,. I
I <uilt>1'1111/,;il ion
215
and cost per stage given by
(1.5)
In part icular, Bellman's equation takes the form
In some problems, in addit ion to the cost (1.1), there is an extra (expected
stage cost <)(i,ii) that is incurred al. the time the control и is chosen at.
state i, and is independent of the length of the transition interval. In that
ease the expected stage cost (1.5) should be changed to
(/(/, u) +
.'/('• ")
/3 + v '
and Bellman’s equation (1.6) becomes
./(/) = min
и G U (i)
tX'.") + 4-^+n ^2/>,;(ii)./(j)
p + v
.1
Example 1.1
A manufacturer of a specialty item processes orders in batches. Orders arrive
according to a Poisson process with rate n per unit time; that is, the succes-
sive interarrival intervals are independent and exponentially distributed with
parameter i/. For each order there is a positive cost c per unit time that the
order is unfilled. Costs are discounted with a discount, parameter fl > 0. The
setup cost for processing the orders is A'. Upon arrival of a new order, the
manufacturer must decide whether to process the current, batch or to wait for
the next order.
Here the state is the number i of unfilled orders. If the decision to (ill
the orders at state i is made, the cost, i.s A’ and the next transition will be
to state 1. Otherwise, there will be an average cost (ciflKfl + v) up to the
transition to the next state i + 1 [cf. Eq. (1.5)], as shown in Fig. 5.1.1. [Note
that the setup cost. A' is incurred immediately alter a decision t.o process the
orders is made, so A' is not. discounted over the time interval up to the next
- _1
246
( \ и it in и» >us- I 'in ic I *!< ihli'ins

t rausition; cf. Eq. (1.9).] We an’ in ('ffect. faced with a discounted discretc-
tiine problem with positive but. unbounded cost per stage. (We could also
consider an alternative model where an upper bound is placed on the number
of unfilled orders. We would then have a discounted discrete-time problem
with bounded cost per stage.)
Since Assumption P is satisfied (cf. Section 3.1). Bellman’s (‘({nation
holds and takes th»’ form
./(/) — niin Л -I о,/(I), —--------1- o./(/ -I-- 1) ,
/ - 1,2,...,	(1.10)
where о - e/(d + e) is the effective discount, factor [cf. Eq. (1.1)]. Beasoning
from first principles, we sec' that ./(/) is a monotonically nondecreasing func-
tion of (, so from Bellman’s equation il follows that there* exists a threshold
i* such that it. is optimal to process the orders if and only if their number
exceeds / *.
Figure 5.1.1 Transition diagram for the continuous-time Markov chain of Ex-
ample 1.1. The I ransit.ioiis associated with t he first rout ro! (do not, fill I lie orders)
arc shown with solid lines, and the transitions associated with the second control
(fill the orders) arc* shown with broken lines.
Noiiuniforni Transition Rates
We now argue that the more general case where tin* transition rate
/<(//) depends on the state and tin; control can be converted to the previ-
ous case of uniform transition rate by using the trick of allowing fictitious
transitions [тот a state to itself. Roughly, transitions that are slow on
the average are speeded up with the understanding that sometimes after a
transition the state* may stay unchanged. To see how this works, let v be a
new uniform transition rate with vt(u) < v for all i and a, and deline new
.See. Л./
( iiilonili/.iil ion
217
transition probabilities by
We icier to this process as the uniform version of the original (see I'ig.
5.1.2). We argue now that, leaving slate i al. a rate zg(ii) in the origi-
nal process is statistically identical I,о leaving state i at the faster rate i>.
but returning back to i with probability 1 — mtuf/v in the new process.
Equivalently, transitions are real (lead to a different state) with probability
c,(i/)/;z < 1. By statistical equivalence, we mean that, for any given policy
7Г. initial state ,r(l. tinie I, and st.ate /, the probability P{.r(l) =-. i | .r(l. ?r} is
identical for the original process anil its uniform version. We give a proof
of this fact in Exercise 5.1 for the case of a Unite number of states (see also
[Lip“5], [Scr79], and [RosS3l>] for further discussion).
To summarize, we can convert a continuous-time Markov chain prob-
lem with transition rates /y(i/), transition probabilities p,,(</), and cost
Jim E { У em31	j> ,
into a discrete-time Markov chain problem with discount factor
о
(1.11)
where is a uniform transition rate chosen so that
for all i, u.
The transition probabilities are
1Pij^
+ 1 -
if' fh
if i = j.
and the cost per stage is
for all
id + l'
In particular, Bellman's equation takes the form
./(;') = min
</('. ti) + о

- -I
Transition rates and probabilities
for continuous-time chain
Transition probabilities for uniform version
Figure 5.1.2 Transforming a continuous-time Maikov chain into its uniform
version through the use of fictitious self-transitions. The uniform version has a
uniform transition rate which is an upper bound for all transition rales i/Дм)
of the original, and transition probabilities p,;(u) —	for i £ j,
and I4,(u,) — (i'i(u)Ip) 14,(u) Ь 1 -	for j = i. hi the example of the figure
we have pn(u) — 0 for all i and it.
which, after some calculation using t he preceding definitions, can be written
as
,/(t) =
1
-- mm
P — p;
In the case where there is an extra expected stage cost g(i, u) that is in-
curred al. the time the control и is chosen at state /, Bellman’s equation
See. 1.3 Unil(irnii/.;il.it>ti
becomes [cf. Eq. (1.9)]
21!)
(/f + I'l'jfi., n) -I- </(i, u)
(1.15)
+ (v - e,(ll))./(l) + i', (,'n) p, j (u)J(f)
.1
Undiscounted and Average Cost Problems
When the discount parameter /1 is zero in the preceding problem
formulation, we obtain a continuous-time version of t.he undiscomited cost
problem considered in Chapter 3. If in addition, t.he number of slat.es is
finite and there is a cost-free anil absorbing state, we obtain a continuous-
time analog of t.he stochastic shortest path problem considered of Chapter
2. However, when /1 — 0, it. is unnecessary to resort. to uniformization.
It. can be seen that the problem is essentially the same as the discrete-
time problem with the same transition probabilities but where the average
transition cost, at, state i under « is the average cost per unit, time </(/,«)
mult iplied with the expected length l/;i,(a) of t he t ransition interval. Thus
Bellman's equation has the form
+ 22cu
./(/) = min
u(i-'ll)
n,(ll.)
(l.l(i)
After some calculation, it can be seen that the above equation can also be
obtained from Eq. (1.14) by setting ii — 0.
In fact for undiscomited problems, the preceding argument does not
depend on the character of the probability distributions of the transition
times. Regardless of whether these distributions ate exponential or not.
one simply needs to multiply g(i,u) with the average transition t ime cor-
responding to (», u) and then treat t he problem as if it. were a discrete-time
problem.
There is also a. continuous-time version of I,he average cost, per st age
problem of Chapter 1. The cost funct ion has the form
We will consider this problem in Section 5.3 in a more general context
where the probability distributions of the transition times need not be
exponential.
250
C 'out inuous-Tiuic Problems
Chap,
.2 QUEUEING APPLICATIONS
We now illustrate tin’ theory of the preece ling section through some'
applications involving the control of queues.
Example 2.1 (M/M/.1 Queue with Controlled Service Rate)
Consider a single-server queueing system where customers arrive according
loa I \»iss<>n process wit li i at e A. The service lime of a custoniei is exponen-
tially distributed with parameter p (called the service rate). Service times of
customers are independent and are also independent of customer interarrival
times. 'The service rate p can be selected from a closed subset Л/ of an inter-
val [(), /7] and can he changed at the times when a customer arrives or when
a customer departs from the system. There is a cost. v(//) per unit time for
using rate // and a waiting cost. r(/) per unit time when there are / customers
in the system (waiting in queue or undergoing service). The idea is that one
should be able to cut down on the customer waiting costs by choosing a faster
service' rate, which presumably costs more. The problem, roughly, is Io se-
lect t he service rate so t hat, t he service* cost is opt imally traded off wit h the
customer waiting cost.
We assume the following:
1.	for some p € А/ wo have p > A. (In words, there is available a ser-
vice rate that is fast, enough Io keep up with tlx? arrival rate, thereby
maintaining the queue length bounded.)
2.	The waiting cost function c is nomiegative, monotonically nondecrcas-
ing, and “convex’’ in the sense
с(/ -I- 2) - c(/ + 1) ’> <(/ | 1) - <(/),	/ - 0. I....
3.	The service rate cost fund ion 7 is nomiegative, and emit inuous on [0, //],
with </(())	0.
The problem fits the* framework of this section. The state is the' num-
ber of customers in the system, and the control is the choice of service rate
following a customer arrival or departure. 'The transition rate at state I is
, \ f A if / -0,
f Д 4-// iO>l.
The transition probabilities of the Markov chain and its uniform version for
I he choice
// - A 4 p
are shown in big. 5.2.1.
4’he effective discount factor is
17
rv =------
/0^
Sec. ~).2
QueiK'ing ЛррИсп! ions
251
Transition probabilities for continuous-time chain
(/I -//)/(Л 4 Ц)
(P-//J/M 1-/0	(//- /0/(Л I /I)
Transition probabilities forthe uniform version
Figure 5.2.1 Cent inuous-tiuic Markov chain and uniform verxion lor Exam-
ple 2.1 when the service rate is equal to p. The transition rates <>f tire original
Markov chain are = A 4- // for states t > 1, and Oj(/0 — A for state 0.
T he t ransit ion rate for t he uniform version is // — A 4- /7.
and the cost per stage is
The form of Bellman's equation is [cf. Eq. (1.1-1)]
7(0) = 7-L-(<•(()) + (;/- A)7(0) TA7(D)
and for i -= 1,2,...,
7(f) = -2— mil, [Ф)	-1) -|- (//- A - /t)7(i) + A7(i + 1)1. (2.1)
О -f- I-' )tQ ,XI u	J
An optimal policy is to use at state i the service rate that miiiiinizes the
expression on the right. Thus it is optimal t.o use at slate i the service rate
//’(/) = arg min {</(//) - /гД(>)}.	(2.2)
where A(z) i.s the (inferential of the optimal cost
Д(j) —- 7(f) — 7(; — 1),	/ = 1,2....
[When the minimum in Eq. (2.2) is attained l>y more than one service rate
ii we choose by convention the smallest.] We will demonstrate shortly that
Д(») is monotonically noudecrcasiny. It will then follow from Ecj. (2.2) (see
_ -1
252	C\)iit inuoiis-'l'inio Problems C’hep. 5
Fig. 5.2.2) that the. optimal service, rate p* (?) is mcmolcniieally no ndec leasing.
'riius, as the queue length increases, it. is optimal to use a fast.er service rate.
To show that. A(?) is monotonically nondecreasing, we use the value
iteration method to generate a sequence of functions Jk from t h<* starting
function
Jo(0-0,	? = (),!,...
For k = 0, 1,..[cf. Eq. (2.1)], we have
Лн(()) = -^(e(0) + (^-A)./A.(0)H AA(1)),
anti for i — 1,2,...,
Л I I («) = 7?~— mill [c(.) + </(//) + /*Л(' - 1) + (/' - A - li)-h(i) + A./fc(?'+ 1)].
(2.3)
For A: = 0. I,... and i ~ 1,2,..., let
Aa(,) = A(z)-.A(z-1).
For completeness of notation, clclinc? also Да-(0) -- 0. From the theory of
Section 3.1 (set? Prop. 1.7 of (hat section), we have .Д(?) —* ./(?) as A: —* oc.
It, follows that we have
lini Д/..(/) = Д('/), i = 1,2,...
Therefore, it will siillice to show t hat A;, (?) is monotonically nondecreasing
for every k. For this we use induction. The assertion is trivially true for
A: = 0. Assuming that Дд.(г) is monotonically nondeercasing. we show that
the same is true for Да । ](?'). l>et
//(<)) - 0,
4(0 = arg mm ['/(/') - /'Дг(')], i = 1 ,‘2,...
From Eq. (2.3) we have, for all ? — (),!,...,
А/. i i (? 1- I )-•//. ( i (/ 4 1) - ./a । । (?)
-	,4— (л' । и > 4/'4' i a) -0'4' ।
fi 4- /'
+ (" - А -//(/--i- 1))./,.(/ + 1)
-	I A./i. (/ I- 2) - <•(/) - '/(//'''(' i 1)) - /(' I- - О (2.4)
-	(m- A -/(' + 1))Л(')-А.Л.(/+ 1))
— -~jj-— (<•(/’ 1- 1) - c(i) + А Ад ('+ 2) 4- (zz — А) Ад. (i 4- 1)
-	/,'('+ I )(Aa 0 + I) — Ал(/))).
Sec. 5.2
C^iuuieing /\ />plic;i( ions
253
Similarly, we obtain, for i — 1,2,....
Aa-m(') <	('(') - '•(' - П -bU.(' -Hl) I (" - Л)Лл (/)
-/(' -	- I)))-
Subtracting the last two inequalities, we obtain, for » — 1,2,...,
+	> (с(Н1Ьф)) - (Ф)-<•(,- 1))
+ A ( А (i 4- 2) — A a: (/' + 1))
T (// — Л - // (/ -|- 1)) (Ад (/’ + 1) -- Aa (/))
+ /U- 1)(Aa(0 - Aa.u- o).
Using our convexity assumption on r(i), the fact n — A — //A (/ -4- 1) = Ji -
pk(i + 1) > 0, and the induction hypothesis, we see that every term on
the right-hand side of the preceding inequality is noimegative. T'herefore,
Aa i- । (i 4- 1) > Aa- j. । (/) lor i = 1,2,... From Eq. (2.4) we can also show that
А/гм(1) > 0 = Aa + i(0), and the induction proof is complete.
Figure 5.2.2 Determining the optimal service rate at states i and (/'4- I) in
Example 2.1. The optimal service rate p*(/) lends to increase as the system
becomes more crowded (/ increases).
To summarize, the optimal service rate /z* (z) is given by Eq. (2.2) and
tends to become faster as the system becomes more crowded (/ increases).

254
G'oiitintioiis-THiie I ’го/dems
Example 2.2 (M/M/1 Queue with Controlled Arrival Rate)
('ou.sidcr t lie same queueing sysl cm as in the previous example with the dif-
ference that, the service rate // is fixed, but the arrival rate A can bo contiollcd.
We assume that, A is chosen from a closed subset A of an interval (0, A], and
there is a cost </(A) per unit time. All other assumptions of Example 2.1 are
also in cllocl. W hat \ve have* here is a problem of flow cont rol, whereby we
want to trade oil opt anally t he cost for I hrol t ling the arrival process wit h I ho
customer waiting cost.
This problem is very similar to the one of Example 2.1. We choose* as
uniform transition rate
/' = A 4- //.
and construct the uniform version of tin* Markov chain. Bellman's equation
takes the form
./(()) = min[c(0) + </(A) -I ~ A)./(0) + >./( J)],
/7 + /7 A( Л L	J
./(/) = —miu[<:(/) + </(A) + iiJ(i - 1) + (n - A - //.).7(г) -I- A./(z + I)].	i
/? E У X л L	J
An optimal policy is to use at stall' i the arrival rate
A*(i) = arg min [./(A) + AA(z + 1)],	(2.5)
where, as before, A(i) is the diflerential of the optimal cost
A(z) = ./(/)-./(z-1), г =1,2,...	1
As in Example 2.1, we can show that A(/‘) is monotonically noin Increasing;
so from Eq. (2.5) we see that the optimal arrival rate ten,ds to decrease, as the
system becomes iiidic crowded (z increases).
I
Example 2.3 (Priority Assignment and the //<• Rule)	|
Consider и queues that share a single server. There is a positive cost c, per
unit time and per customer in each queue i. The service time of a customer	3
of queue z is exponentially distributed with parameter in, and all customer	Я
service I imes are independent. Assuming that we start with a given number	Я
of customers in each queue and no further arrivals occur, what is the optimal	ja
order for serving the customers? The cost here is	Я
when’ ,r,(/) is the number of customers in the ith queue at. time t, and [3 is a
positive discount parameter.
.Sec. ~>.2
Qiieu< ‘i ng Applicnt ions
255
We lirst, construct t he uniform version of the problem. The const ruction
is shown in big. 5.2.3. The discount factor is
I //’
(2T)
where
uiax{/(, I.
and the corresponding cost is
where .r’k is tJi<* number of customers in the zt.li queue after the A’th transition
(real or lid it ions).
Transition probabilities for the / th
queue when service is provided
Transition probabilities for uniform version
Figure 5.2.3 Continuous-time Markov chain and uniform version for t he /th
queue of Example 2.3 when service is provided. The transition rate for the
uniform version is p — max, {//,,}.
We now rewrite the cost, in a way that is more convenient for analysis.
The idea is to transform the problem from one of mmhuizmg wait ing costs
to one of maximizing savings in waiting costs through customer service’. For
к ~ 0.1,..., define
У i if the A; th transition corresponds to a departure from queue g
to if the A*th transition is fictitious.
256
( \ ml iitiu >us-'I'itiK' Piulilt'iiis ('li.ij). 5
1 h’liot(' also
= 0,
J o : the init ial number of customers in queue i.
Then tin' cost (2.7) can also he writ.ten as
Therefore, instead of minimizing the cost (2.7), we can equivalently
maximize oA/t’{c,A.},
k~u
(2.S)
where c,A. can be viewed as the sanim/s in wailim/ cast, rate obtained from the
/.I Ii transition.
IV’c now rrcoipiize, problem as a mu.ltiarm.ed bandit problem. The
n queues can be viewed as separate projects. At each time, a nonempty
queue, say z, is selected and served. Since a customer departure occurs with
probability /i,/p, and a fictitious transition that leaves the state unchanged
occurs with probability 1 — //.,///,. the corresponding expected reward is
It is evident that the problem falls in the deteriorating case examined at the
end of Section 1.5. Therefore, after each customer departure, it is optimal
to serve the queue with maximum expected reward per stage (i.e., engage
the project, with maximal index; cl. the end ol Section 1.5). Equivalent!y
[cf. Eq. (2.!))], it i.s optimal to serve l.li.e nonempty queue i. for winch, pr, is
marimtim. This policy is known as the pc rule. It. plays an important role in
several other formulations of t.hrr priority assignment problem (see [BDM83],
[Ilai75a], and [Ha.r75b]). We can view p,c, as the ratio of the waiting cost
rate c, by the average time l//z, needed to serve a. customer. Therefore, the
p.e rule amounts to serving the queue for which the savings in waiting cost
rate per unit average service time are maximized.
.See, 5.2
Queueing .4 />р/и а t к ms
257
Example 2.4 (Routing Policies for a Two-Station System)
Consider the system consisting of two queues shown in I’ig. 5.2.1. Customers
arrive according t.o a Poisson process with rate A and arc routed upon arrival
t.o one of the two queues. Service times are independent and exponentially
distributed with parameter //1 in the first queue and //j in the second queue.
The cost is
whore ,J, ci. and c-j are given positive scalars, and .ri (/,) and denote the
number of customers at time I in queues I and 2, respectively.
.As earlier, we construct the uniform version of this problem with uni-
form late
= A4/n 4//2	(2.10)
and the transition probabilities shown in big. 5.2.5. We take as state space*
the set of pairs (/. j) of customers in queues 1 and 2. Bellman’s equation takes
t he form
/('../) =	+//>./((г- l)+.j)	1)4)
'	"	(2.11)
4- minf./(z + 1,	/ 4- 1)],
p 41/	1	J
where for any .j- we denote
(4 = max(0,
From this equation we see that, an optimal policy is to route an arriving
customer to queue I if and only if the state (/, J) at the time of arrival belongs
to the set
‘b’l = I -/(г 4 l.J) < .7(1. J + 1)}.	44
Figure 5.2.4 Queueing system of Example 2.4. The problem is to route
each arriving customer to queue 1 or 2 so as to minimize the total average'
discounted waiting cost.

258
Cont iuuon^-'i'iine 1 hobh'm.s
Components of the transition rates when
customers are routed to queue 1
Transition probabilities for uniform version
Figure 5.2.5 (’outiiuioiis-t imr Markov chain and uniform version for Exam-
ple 2. I when customers are rout cd to the first quenr. The slates arc t ho pairs
of eictomer numbers in the two queues.
"This optimal policy can be characterized better by some further analy-
sis. Intuitively, one expects that optimal routing can be achieved by sending
a customer to the queue that, is "less crowded" in some sense. It is therefore
natural to conjecture t hat. if it. is optimal to route to t he first queue when the
slate is (?.y). it must be optimal to do t he same when the first queue is even
less crowded; that, is, the state is (i — /n, j) with in > 1. This is equivalent
Sec. 5.2
Queueing Applicnt ions
259
to saying that the set of stales Sj Uh which it. is optimal to route to the lirst
queue is characterized by a monotonically nondecrcasing threshold, function
F by means of
•s = {(/,») |7= ЯП}	(2.13)
(see I?ig. 5.2.6). Accordingly, we call (he corresponding optimal policy a
thicshold policy.
Figure 5.2.6 Typical threshold policy characterized )>y a threshold function
We will demonstrate (.he existence of a threshold optimal policy by
showing that the functions
A Cm) = J (г + 1, j) -	0,
A2(m) = ./(?', j + 1) - J(i T 1, j)
arc monotonically nondecrcasing in i for each (ixed j, and in j lor each fixed i,
respectively. We will show t his property lor Д ।; tin1 proof for is analogous.
Lt will be sullicient to show that, for all A- I), 1.(he functions
A|(m) = .A(« + Cj) - Л(м + J)	(2.Ы)
are monotonically iiondecreasing in i for each fixed j, where Jk is gener-
ated by Hie value iteration method starting from the zero (million; that is,
Jк j. i (»,;/) = (7’./r)(/, j), where 7’ is the 1)1* mapping defining lie41 man’s equa-
tion (2.11) and Jo = 0. This is true because	—» J(i.j) for all i.j as
к —» DC (Prop. 1.6 in Section 3.1). To prove that, Ai(i,j) has the desired
_ .л
260
Com muons-Time Problems	Chap. -5
property, it is useful l<> first verify that Jk(i.j) is monotonically nondecreas-
ing in i (or j) for lixed j (or z). This is simple t.o show by induction or by
arguing from first principles using the fact that	has a Axstagc optimal
cost interpretation. Next we use Eqs. (2.11) and (2.14) to write
(/O- ,.)Ap	= n -e2
+	-Л((?- I)',J 4 1))
4 //2(aG I- 1, С/ - 1)') - Л GO)) C-O'r>)
4-A (min 4 2, j),.//,.(/4 l,j'4 I)]
— min [A-(f + 1, j + 1), Л (E j + 2)]}.
We now argue by induction. We have Д\'(г, j) — 0 for all (z, j). We assume
that A*'(i, j) is monotonically nondecreasing in i for fixed j, and show that
the same is true for A{ + '(z, j). This can be verified by showing that each of
the tonus in the right-hand side! of Eq. (2.15) is monotonically nondecreasing
in i for lixed j. Indeed, tin* lirsl. term is constant, and the. second and third
terms art' seen to be monotonically nondecreasing in i using the induction
hypothesis for the ease where. i,j > 0 and the earlier .shown fact that Jk(i.j)
is monotonically nondecreasing in i for the case where i — 0 or j -- I). The
last term on the right-hand side of Eq. (2.15) can be written as
АT 1, j + 1) T min ^.A;(/’ 4- 2, j) — -A (/ + 1, j + 1), O]
- Л(| 4 1,J 4- 1) - nuii[0, 4 2) - Jkij. 4 I,J +1)]}
= A(inin [(),	4 1. j) ~ -IlCi 4 1, j 4 1)]
4 niiixjo, Л(7 4 1,./ 4 1) —	4 2)])
= A (min [(), A; (/ 4 1 ,.j)] 4- mux [(), A^ (i, j 4 1)]).
.Since A{ (/4 I, j) mid Af(ij4 I) nre moilotonicu]l.y iioiidecn;n.siiig in i by the
induction hypothesis, the same is true for the preceding expression. There-
fore, each of the terms on the right-hand side of Eq. (2.15) is monotonically
nondecrcasing in ?, and the induction proof is complete. Thus the existence
ol an optimal threshold policy is established.
There are a number of generalizations of the routing problem of this
example that admit a similar analysis and for which there exist optimal poli-
cies of the threshold type. For example, suppose that there are additional
Poisson arrival processes with rates Xi and X-2 at queues I and 2, respectively.
The existence of an optimal threshold policy can be shown by a nearly ver-
batim repetition of our analysis. A luore substantive extension is obtained
when I here is addit ional service capacity p t hat can be switched at the times
of transition due to an arrival or service completion to serve a customer in
queue 1 or 2. Then we can similarly prove that it is optimal to route to queue
I if and only if (л./j G and Io switch the additional service capacity to
queue 2 if and only if (/ + 1. j T 1) G -A . where 5i is given by Eq. (2.12) and is
characterized by a threshold function as in F.<|. (2.13). For a. proof of this and
further extensions, we refer to [1 laj8-lj. which generalizes and unifies several
earlier results on the subject.
See. 5.3 Sciui-Mai kov /’r<>Z>/<4u.s
5.3 SEMI-MARKOV PROBLEMS
261
We now consider a more general version of the coni miioiis-time prob-
lem ol Section 5.1. We still have a finite or a countable inn uber of st ales, but
we replace transit ion probabilities with truii.si.l.ioi> dwlribnlions ii)
that, for a given pair (/.a), specify t.he joint. distribution of the transition
interval and the next state:
Qlj(' , u) = n - tk < T, ,r/,.+ l = J I -r*. = i, Up — «}.
We assume that, for all states i and J, and controls и t U(i), ii) is
known and that the average transition time is finite:
[ u) < oc-
./o
Note that the transition distributions specify the ordinary transition prob-
abilities via
The difference from the model of Section 5.1 is that a) need not be
an exponential distribution.
Continuous-time problems with general transition distributions as de-
scribed above are called semi-Markov pmblcitis because, for any given pol-
icy, while at a transition time L the future of the system statistically de-
pends only on the current state, at. other times it may depend in addition
on the time elapsed since t he preceding transition. By contrast,, when the
transition distributions are exponential, the future of the system depends
only on its current state at all times. This is a consequence of the. so called
тетотцк ss propcrt.ii of the exponential distribution. In our context, this
property implies that, for any lime t. between (lie transition times /* and
tjv-j-i, the additional time /,,4] — /. needed to effect the next transition is
independent of the time f — t/, that, the system has been in the' current state
[to see this, use the following generic calculation

where n = / - m = /дч । - /, and и is the transition rate]. Thus, when
the transition distributions are exponential, the state evolves in continuous
time as a Markov process, but this need not be true for a more general
distribution.

ft
. .1
262
Con! inuous-Tinie Problems Chap. 5
Discounted Problems
Let, us first consider a cost, function of the form
Jim £u(l))dt|,	(3.1)
where I к is the completion time of the .Vl.li transition, and the function (/
and the positive discount, parameter d are given. 'L'he cost function of an
admissible TV-stage policy rr = {po,/»i,. • •	is given by
= V E | f с''^</(.гА.,/ц,(х;,))Ь/.| ,ro - il .
We .see that for all states i we have
./^(Z) - C(o/tl)(/)) 4-^2	(3.2)
where .1^ '(./) is the (iV-1 )-stage cost of the policy /и = {/q,//.2 I<n-i }
that is used after the first stage, and G(L<i) is the expected single stage
cost corresponding to (i, n). This latter cost, is given by
U.),
or equivalent ly, since /J c	= ()—<•
G'(z,'/) - !/(/, «) £	a).	(3.3)
If we denote
= f (~l'lrQlj(dT.,u),	(3.4)
J\y
we see that Eq. (3.2) can be written in the form
^'(O = g'(/,//()(/)) + У2 "‘у (/'"(')) ^r'u),	c3-5)
which is similar t.o the corresponding equation for discounted discrete-time
problems [we have in,Да) in place ol opl((n)].
The expression (З.Г>) motivates t he use of mappings 7’ and 7), that are
similar t.o those used in Chapter t lor discounted problems. Let us define
for a, function ./ and a stationary policy /t,
('/;,./)(!) = c;(i.	+ ^>//l/(p(>))JU),
(3.6)
Sec. .5..'!
S'euii-A linkin' Problems
263
(7’.7)(1) - niin
»d'(d
б'(/, и) 4- Г in,l(u)./(j)
(3.7)
Then by using Eq. (3.5), it, can be seen that the cost, function ,/„ ol an
infinite horizon policy тг = {//о, /ц ,...} can be expressed as
4(i) = Jiin^ J? (/) = Jim^(7),ll7;,l • •  7),л. , ./u)(/),
where Jo is tlie zero function [Jo(ii) = 0 for all /]. The cost of a stationary
policy /I can be expressed as
J„(z) hin (У1//Jo))/).
N —»-x>
The discounted cost analysis of Section 1.2 carries t hrough in its en-
tirety, provided we assume that:
(a)	i/(i. 11) [and hence also <7(1, a)] is a. bounded funct ion of i and и.
(b)	The maxiinum over (i. u) of the sum £2 lll/j(n) is less than one; that
is,
p = max w,/(w) < 1.	(3.8)
j
Under these circumstances, the mappings T and Tt, can be shown t о be
contraction mappings with modulus of contraction p [compare also with
Prop. 2.1 in Section 1.2]. Using this fact, analogs of Props. 2.1-2.3 of
Section 1.2 can be readily shown. In particular, the optimal cost, function
J* is the unique bounded solution of Bellman’s equation ./ = T.J or
J(i) - min
In addition, there are analogs of several of the computational methods of
Section 1.3, including policy iterat ion and linear programming.
What is happening here is that essentially we have the equivalent of
a discrcte-tinie discounted problem where the discount fact,or depends on i
and ii. In fact, in Exercise 1.12 of Chapter 1, a data transformation is given,
which converts such a. problem to au ordinary discrete-time discounted
problem where the discount, factor is the same for all i and u. With a little
thought it can be seen that this data transformation i.s very similar to the
uniformization process we discussed in Section 5.1.
We note that for the contraction property p < 1 [cf. Eq. (3.S)] to hold,
it i.s sullicient that, there exist, т > () and < > 0 such that, the transition
time is greater than r with probability greater than < > 0; that is, we have
for all i. and 11, G U(i).
1 “ zL = 527>{r - f I ' 3} ~	(3T)
J	J
264
('out iiiiioii.s-Tiinc Problems
Chap. 5
In the сане where the state space is countably infinite and the func-
tion	is not bounded, tlie mappings 'Г and '!), are not contraction
mappings, and a, discounted cost analysis that parallels the one of Section
1.2 is not possible. Even in this case, however, analogs of the results of
Section 3.1 can often lie shown under appropriate conditions that parallel
Assumptions P and N of that section.
We finally note that in some problems, in addition to the. cost (3.1),
there is an extra expected stage cost g(i, u) that is incurred at the time
the control и is chosen at. state /, and is independent of the length of the
t ransition interval. In that case the mappings T and Tz, should be changed
to
(T,,./)(/) = .</(/',//.(/)) + Г/(/, //.(/)) + j (/'(?))•/(;),
.1
(7’./)(/) — min
.')('• u) + G(i. u) + ^2	("-)^(j) •
(3.10)
Another problem variation arises when the cost per unit rime g depends
on the next state j. In this problem formulation, once the system goes into
state i. a control u £ U(i) is selected, the next state is determined to be j
with probability p,;(u), and the cost of the next transition is g(i, u.j)Tij(u)
where t,z(u) is random with distribution Qo(T, "O/pijC"-)- In this case,
G((, «) should be. defined by
G(/,	/ .'/('• "-j)-—
• ./()
[cf. Eq. (3.3)] and the preceding development goes through without modi-
fication.
Example 3.	1
Consider the manufacturer’s problem of Example 1.1, with the only differ-
cnee that the times between the arrivals of successive orders arc uniformly “
distributed in a given interval [0,т1пах] instead ol being exponentially dis- &
tributed. Let /•’ and Л7? denote the choices of filling and not filling the*.*
orders, respectively. The transition distributions are	fjt
q.at.g)^ (n,iuE dd 'C	:•
(0	otherwise,
and
Q,,(r,NI<') = ( '""'I1- hdl >O = ‘+b
I ()	otherwise.	• 7
The effective cost per stage (7 of Eq. (3.3) is given by	H
G(/, /’) - 0, Gd’jVE) = ya-/.
Sec. 5.3 Semi-Mmkov Problems
265
where
The scalars intl ol i<<|. (3. 1) that arc uouzrio aic
= illite i/A'?') = a,
Bellman’s equation has the form
./(/) — min [/< Ь (\J{ 1), 'yi + <\J{i 4- 1)], i. -- 1.2,....
As in Example l.l, we can conclude that there exists a threshold /* such that
it is optimal to fill the orders if and only if their number i exceeds /*.
Example 3.	2 (Control of an M/D/1 Queue)
Consider a single server queue where customers arrive according to a Poisson
process with rate A. The service time of a customer is deterministic and is
equal to l//r where fi is the service rate provided. The arrival and service rat es
A and fi can be selected from given subsets A and Л/, and can be changed
only when a customer departs from tin’ system. There arc costs c/(A) and / (//)
per unit time for using rates A and //., respectively, and there is a waiting cost
с(г) per unit time when there are i- customers in the system (waiting in queue
or undergoing service). We wish to find a rate-setting policy that minimizes
the total cost, when there is a positive discount parameter A-
This problem bears similarity with Examples 2.1 and 2.2 of Sect ion 5.2.
Note, however, that while in those examples the rates can be changed both
when a customer arrives and when a customer departs, here the rates can be
changed only when a customer departs. Because the service; time distribut ion
is not exponential, it is necessary to make this restriction in order to be able
to use as state the number of customers in the system; if we allowed the
arrival rate to also change when a customer arrives, the time already spent
in service by the customer found in service by the arriving customer would
have to be part of the state.
The transit ion distributions are given by
n / X x f 1 -- At if J = 1,
A,/i) = <	J
I 0	otherwise,
JV	10	otherwise,
where p;,(A,/r) are the state transition probabilities. It, can be seen that for
i > 1 and ji\ — 1, />,j (A, /i) can be calculated as the probability that j - i -I- 1
arrivals will occur in an interval of length (0, l//z,]. In particular, we have
f г~ЛЬ1(Л//О(-'~‘+1) T 	 I
S	‘ ‘O - ' •- i. > 1.
f 0	otherwise,
- .1
266
(outimious-Time Problems	Chap. 5
Using the above formulas and Eqs. (3.3)-(3.4) and (3.(>)-(3.7), one can write
Bellman’s equation and solve tile problem as if it were essentially a discrete-
time discounted problem.
Average Cost Problems
Л natural cost function for the continuous-time average cost problem
would be
lim ~E 1 / z;(,r(/), zz(/))d/I .	(.3.11)
/	1 I J{}	I
However, we will use instead the cost, function
lim	|	,	(3.1'2)
• x, K{IN\ (./(l	J
where I,y is the completion time of the zVth transition. 'Phis cost function
is also reasonable and turns out to be analytically convenient. We note,
however, that the cost functions (3.11) and (3.12) are equivalent under the
conditions of thi! subsequent analysis, although a rigorous justification of
this is beyond our scope (see [Kos7(l], p. 52 and p. 160 for related analysis).
We assume that there are n states, denoted 1,. . . , ?i, and that the
control constraint sei U(i) is finite lor each state z. lair each pair (/. </), we
denote by G(i. a) t.he single st age expected cost corresponding to state i
and control и. We have
G(7, ii) = (j(i, ii)t,(u).
(3.13)
where r,(ii) is the expected value of the transition time corresponding to
(/, it):
=	TQ‘j(,,Td').	(3.14)
[If the cost per unit time </ depends on the next, state j, I he expected
transition cost. G{i, u) should be defined by
6'(b «)

and the following analysis and results go through without modification.]
We assume throughout the remainder of this section that
? --1.....и, и E U(i).
(3.15)
See. 5.3 ,Ь'сш/-Л1агА'<л' Problems
267
Tlie cost, function of an admissible policy тг = {/to,/и,...} is given by
1	f у ,l I '1''' *1	I 1
•MO = lim ——j-----------------:-----E { V /	<7 щ.(,rk.))dt то = / ? .
л-~ E{I.N I To = i, тг} |^ЛД.	J
Our earlier analysis of the discrete-time average cost problem in Chap-
ter 4 suggests that under assumptions similar to those of Section 4.2, the
cost ./,,(/) of a stationary policy /<., as well as the optimal average cost per
stage ./*(/), are independent of the initial state i. Indeed, we will see t hat,
the character of the solution of the problem is determined by the struc-
ture of the embedded Markov chain, which is the controlled discrete-time
Markov chain whose transit,ion probabilities arc
/M") = l>m <2о(т- ")•
In particular, we will show that ,/z,(z) and. J*(i) are iii.d.ependeii/ of i if
and. onli/ if the same is true for the embedded Markoil chain problem,, bbr
example, we will show that. and •/*(/), are independent of i if all
stationary policies p are. uni-chain; that is, the Markov chain with transition
probabilities Pij(ii(i)) has a single recurrent class.
We, will also show that Bellman’ s equation for average cost, semi-
Markov problems resembles the corresponding discrete-time equation, and
t ake's t.he form
h(i) = min
«Ct'p)
G(i, ii.) - Ar,(w.) +
j=-i
(3.16)
As a special case, when т,.{и) = 1 for all (t,w), we obtain the correspond-
ing discrete-time equation of Chapter 4. We illustrate Bellman's equation
(3.16) lor the case of a. single unichain policy with the stochastic shortest
path argument that we used to prove Prop. 2.5 in Section 1.2.
Consider a unichain policy //, and without loss of generality' assume
that state n is a recurrent state in I,he Markov chain corresponding to //. for
each state i yf n let. C, and 7} be the expected cost and the expected time,
respectively, up to reaching state n, for t.he first time starting from i. Let
also Cf and 7), be the expected cost, and the expected time, respectively,
up to returning to n for the first, time starting from n. We can view C, as
t.he costs corresponding to //. in a stochastic shortest, path problem where
n is a termination state and the costs are //(/)). Since // is a proper
policy lor this problem, from Prop. 1.1 in Section 2.1, we have that the
scalars Ci solve uniquely the system of equations
II
C, = G(i.,ii.(i)) + V ZfoHMM z	(3.17)
- л
268
( 'out muous-Time Problems (’hap. 5
Similarly, we can view 7’, as the costs corresponding to /1. in a stochastic
shortest path problem where 11 is a termination state and the costs are
T; (ji(i)), so that the 7) solve uniquely the system of equations
ll
'Г, — т, (ji.(i)) + У"	(3.1S)
J - I,.//"
Denote
(3.19)
Multiplying Eq. (3.18) by A,, and subtracting it. from Eq. (3.17), we obtain
for all z = 1,..., n,
II
G - A,/7;	G’(/.//(/))-A,,t, (//.(/))-I-	7>,Др.(*))(С; - Az,7)).
By delining
//,,(/) = G, - A„7), z=l................................n,	(3.20)
and by noting that from Eq. (3.19) we have
h 1,(11.) ~ I),
we obtain for all z = I......n,
II
///((z) G(/,/z(i)) - Xllr,^i(i)) + УУМ/'ОУМЛ (3.21)
J. I
which is Bellman’s equation (3.10) for the case of a single stationary policy
/'•
We have not yet proved that, the scalar A,, of Eq. (3.19) is the average
cost per st.age corresponding t.o /i. This fact will follow from the following
proposition, which parallels Prop. 2.1 in Section 1.2 and shows that if
Bellman's equation (3.16) has a solution (A,/z), then the optimal average
cost is equal to A and is independent of the initial state.
Proposition 3.1: If a scalar A and an n-dimensioual vector Ii satisfy
h(i.) = min
i = 1,..., n,
then A is the optimal average cost per stage J*(i) for all z,
(3.22)
A = min J-.(i.) = J*(i), i = l,...,n.	(3.23)
7Г
Furthermore, if/z*(z) attains the minimum in Eq. (3.22) lor each z, the
stationary policy /(’ is optimal; that is, (г) = A for all i.
See. Г,„Ч
Semi-Markov I 'rohlenis
269
Proof: Eor any // consider I lie mapping Tt, : ?)i'" i-> Ilf" given by
(Гр/|)(0 = G’(/.//(/)) + ^Г/л-Д/Ф))ЧД I =
j -1
and the vector r(p) and matrix /), given by
T(/t) =	•;	•	Р/, = I
\r„ (/:(»)) /	Vni (/'(”)) ••/W Hn))
Let ~ — {/io, //,i,...} be any admissible policy and N be a positive integer.
We have, from Eq. (3.22).
T/ix -J1 ^N-l) + >i-
By applying 'f/gv-2 to both sides of this relation, and by using the mono-
tonicity of TIIN_2 and Eq. (3.22), we see that
^/'jV-2	1	2? A'/V-J	- i ) + h)
~ ii-1) +
> AP/(N_2r(//w-i) + AL(//a-_2) + h-
Continuing in the same manner, we finally obtain
,v_| h — A7.v(tt) + h,	(3.21)
where Lv(tt) is given by
'л'(тг) —	•   ^)<lv-2^'(/Z'V^ I )
+ ^0 ’ ' ' -P/'W-:iTOlA'-2) +  ' • + т(ро).
Note that the /th component of the. vector 1д,(тг) is E{In | .r'o = /. ^}.
the expected value of the completion time of the Nth transition when t he
initial state is /. and rr is used. Note also that, equality holds in Eq. (3.2-1)
if attains the minimum in Eq. (3.22) for all k and /. Il can be seen
that
(^*и'Л'1 ’
th)(i) = E h(.rN)+- /
I	./0
Using this relation in Eq. (3.21) mid dividing by /.’{/,у | .10 ~ 1. a}, we
obtain for all /
E{h(.r,v) 1 ,rn = /.я} £{./ол .4''Ю."ЮН I 'ri> = '>}
L’{/,v I -i'll 1, ?r}	Е{/л' | -i'll i, ?r}
/.(/)
E{/.,V I -Г,) = /, 7Г} '

270
Continuous-Time Problems
Taking the limit as N —» oo and using the fact limjv_0C E{In | ':» =
i. 7Г} = ou [cf. Eq. (3.15)], we see that
E {./? .''/('•(О-"(OH |.T(l = /,7r|
E{tN I J’o = i, тг}
i -. 1,..., /i,
- ./,.(/) o A,
with equality if jik(i) attains the mininniin in E<|. (3.22) (or all k and i.
Q.E.D.
By combining Prop. 3.1 with Eq. (3.21), we obtain the following:
Proposition 3.2: Lei. // be a. unichain policy. Then:
(a)	There exists a scalar A,, and a vector hlt such that
J/'(0 —	i = 1,..., /t,	(3.2u)
and
М») = С(г,р.(/))-А;1Тг(/(.(г))+^2рг>(/г(г))/?7,О), i = l,...,n.
j = i
(3.26)
(b)	Let t be a fixed state. The system of the n + 1 linear equations
h(i) = G(i,//.(?)) - A,,r,(/i(i)) +^Po(/x(i))/t(j), г =
j=i
(3.27)
h(i) = 0,	(3.28)
in the n + 1 unknowns A, /:(!),. .., h(n) has a unique solution.
Proof: Part, (a) follows from Prop. 3.1 and Eq. (3.21). The proof of part
(b) is ulentical to the proof of Prop. 2.5(b) in Section 4.2. Q.E.D.
To establish conditions under which there exists a solution (A,/i) to
Bellman's equal ion (3.22). we formulate a corresponding discrete-time av-
erage cost problem. Let. -) be any scalar such that
for all i and и e 17(i) with < 1. Define also for all i and и € G(i).
if j 7= L
if j =
1 -
See. ~>.:i	Piyihlcins
271
It can be неси that we have for all i and j
-= 1.
/Ем")	() if and only if	0.	(3.30)
We view pv(u) as the transition probabilities of the discrete-lime average
cost problem whose expected stage cost, corresponding to (/.a) is
G(z,«)
i)
We call this th <* (i.iudiarii distrtlt' hill.f tii't ttit/t cos/. />/ol>/< hi.. 'I’lic lollow
iug proposition shows that this problem is essentially equivalent with oiir
original semi-Markov average cost problem.
Proposition 3.3 If the scalar Л and the vector h satisfy
/i(?) — min
uetz (i)
n.
G{i,u) - A +^2j50(u)/i(J)
j=i
then A and the vector h with components
/i(i) = 7/1(1), i —	. ,n,
satisfy Bellman’s equation
h(i) = min
vQU(i)
n.
G(i,u) - Атг(а) +
7=1
for the semi-Markov average cost, problem.
(3.33)
i = 1,..., 11,
(3.34)
Proof: By substituting Eqs. (3.29), (3.31), and (3.33) in Eq. (3.3’2), we
obtain after a. straight forward calculation
1
mm —
««=(/(,) T,(ll.)
0
II
G(i, и) - Xt,(ii) I 5?/>,.,(»)/'-(./) - h(i.)
7-1
I......и.
This implies that the minimum of the expression within brackets in the
right-hand side above is zero, which is equivalent to Bellman's equation
(3.3-1). Q.E.D.
_ .1
272	(oiitiniioiis-Thiic Probk'ins C'luip. 5
Note that in view of Eq. (3.30), (he auxiliary discrete-time average
cost problem and the semi-Markov average cost, problem have the same
probabilistic structure, hi particular, if all stationary policies are unichaiu
lor one problem, the same is true for the other. Tims, the results and
algorithms of Sections 1.2 and -1.3, when applied to the auxiliary discrete-
time problem, yield results and algorithms for the semi-Markov problem.
For example, value iteration, policy iteration. and linear programming can
be applied to the auxiliary problem in order to solve the semi-Markov
problem. We state a partial analog of Prop. 2.6 from Section 4.2.
Proposition 3.4: Consider the semi-Markov average cost problem,
and assume either one of the following two conditions:
(1)	Every policy that is optimal within the class of stationary policies
is unichaiu.
(2)	For every two states i and j, there exist,s a stationary policy тг
(depending on i and j) such that, for some k,
= j I ((> = I, тг) > 0.
Then the optimal average cost per stage has the same value A
for all initial states i. Furthermore, A together with a vector Ii
satislies Bellman’s equation (3.34) for t.he semi-Markov average
cost problem.
Proof: By Prop. 2.6 in Section 4.2, under cither one of the conditions
stated, Bellman's equation (3.32) for the auxiliary discrete-time average
cost problem has a solution (A,/i), from which a solution to Bellmans
equation (3.3-1) can be extracted according to Prop. 3.3. Q.E.D.
Example 3.3	:
Consider the average cost version of the inanul'act urer's problem of Example
3.1.	Here we have	_
T(F)-Tt(iVF) =
G'(t,/’)-/<, G'(i.AT) = ^y
where /•’ and JV/' denote the decisions to (ill and not fill the orders, respec-
tively. Bellman's equation takes the form
li(i) --- min - A+ /1(1). i‘i - A--'~t q. h(i + I )j .
Sec. ->.l
Nolo, Sourco, ;iiid N\eici>o
273
We leave it as an exercise for the reader to show that, there exists a threshold
<* sinh that it is optimal to fill the orders if and only if i exceeds ;*.
Example 3.4	: [LiR.71]
(‘onsider a person providing a. certain type of service Io customers. Polcn-
tial customers arrive according to a Poisson process with rate r: that is the
customer's interarrival times are independent, and exponentially distributed
wit h parameter r. Each customer offers one of n pairs (m.,. 7’,), i -- I.... , n,
where in, is the amount of money offered for the set vice and 7', is the aver-
age amount of time that will be. required to perforin the service. Successive
offers are independent anil offer (тг,7’,) occurs with probability p,, where
~ 1. An offer may be rejected, in which case the customer leaves,
or may be accepted in which case all offers that arrive while the customer is
being served are lost. The problem is to determine the acceptance-rejection
policy that maximizes the service provider's average income per unit t ime.
Let us denote by i the state corresponding to the offer (z/t.,,7',), and let,
.1 and It denote the accept and reject decision, respectively. We. have
r,(.4) = T, + 1,
/•
= -m,,	G(/./?)=.(),
p;j(.l) =	= Pj-
Bellman's equation is given by
h(i) = min
ll follows that an optimal policy is t.o accept oiler ji.7’,) if
where - A is the optimal average income per unit. lime.
5.4 NOTES, SOURCES, AND EXERCISES
The idea, of using unifortuizalion to convert coiit.iniioiis-thiie stochas-
tic control problems involving Markov chains into discrete-time problems
gained wide attention following [Lip75]: see also [BeB87].
- .1
274	(\>ntiimous-Tiine Problems Chap. 5
Control of queueing systems lias been researched extensively. For
additional material on the problem of control of arrival rate or service rale
(cf. Examples 2.1 and 2.2 in Section 5.2), see [BWN92], [CoK.87], [CoV84],
[KVW82], [Sob82], [St.1’74], and [Sti85]. For more on priority assignment
and routing (cf. Examples 2.3, 2.4 in Section 5.2), see [BDM83], [BaD81],
[BeT89b], [BhF91], [CoV84], [llar75a], [Har75b], [PaKSl], [SuC91], and
[AyROl], [CrC91], [EVW80], [EpV89], [Haj84], [LiK84], [TSC92], [ViE88],
respectively.
Semi-Markov decision models were introduced in [.Jew(i3] and are also
discussed in [llos70].
EXERCISES
5.1 (Proof of Validity of Uniforinization)
Complete the details of the lollowing argument, showing the validity of the
uniforinization procedure for the ease of a finite number of states i = 1.....n.
We lix a policy, and for notational simplicity we do not show the dependence
ol I rausition rates on the control. Lei />(/•) be the row vector with coordinates
/,,(/) = /’{.<(/) = i | .<•„}.	i'^= 1.....it.
Wo have

where p(O) is the row vector with ith coordinate equal to one if .r:() = i and
zero otherwise, ami the matrix /1 has elements
From this we obtain
p(0 =M'))E’'.
Consider the transition probability matrix 13 ol the uniform version
Нес. Г). 1
Notes, Sources, ;iu<l ISxercixes
275
where e > //,, i. — 1, . . . , u. Consider also (he following equation:
__ 5- vt I list   — ist \ л (^•b>'/)
e ~(’ 2^~/T-
A--0
Use these relations to write
p(/) ^p(O)
Zc=Q
where
Г(А-, t) =	= Prob{Ar transitions occur between 0 and t
in the uniform Markov chain}.
Verily that for / =	we have
/),(/,) z= Prob|.r(t) — i in the uniform Markov chainj.
5.2
Consider the A//A//1 queueing problem with variable service rate (Example
2.1 in Section 5.2). Arsenic that no arrivals are allowed (A — 0). and one can
either serve a customer at rate /z or refuse service (Л/ = {(),//}). Lot the cost
rates for customer waiting and service’ be <•(/) — ci and </(//). respectively,
wit h (/(()) = 0.
(a)	Show that an optimal policy is to always serve an available customer if
</(/')	£
/'	У
and to always refuse service otherwise.
(b)	Analyze the problem when the cost rate for wailing is instead c(z) = ci'2.
5.3
A person has an asset I,о sell lor which she i ecr ives oilers I hat can lake one
of n values. The times between successive oilers are random, independent.
and identically distributed with given distribution. Find the otter acceptance
policy that maximizes C{oz .s}, where T is the time of sale, s is the sale price,
and о € (0,1) is a discount factor.
- .1
27G
Continuous-Time Problems
Chap. 5
5.4
Analyze the priority assignment problem of Example 2.3 in Section 5.2 within
the semi-Markov context of Section 5.3, assuming that the customer service
times arc independent but not exponentially distributed. Consider both the
discounted and the average cost cases.
An unemployed worker receives job oilers according to a Poisson process with
rate r. which she may accept or reject. 'The offered salary (per unit time) takes
one of n possible values u’i.ir,, with given probabilit ics, independently of
preceding offers. If she accepts an oiler at salary tv,, she keeps the job lor a
random amount of time that, has expected value lt. If she rejects the oiler,
she receives unemployment compensation c (per unit time) and is eligible t.o
accept future offers. Solvo the problem of maximizing the worker’s average
income per unit t ime.
5.6
Consider a single' server queueing system when1 the server may be either on
or off. Customers arrive according to a Poisson process with rate A, and their
service times are independent, identically distributed with given distribution.
Each time a customer departs, the server may switch from on to off at a fixed
cost C‘u or from olf to on at a fixed cost С’ь There is a cost c per unit time
and customer residing in the system. Analyze this problem as a semi-Markov
problem for the discounted and the average cost cases. hi t he latter case,
assume that, the queue has limit cd storage, and that customers arriving when
th(' queue is full arc lost.
5.7
Consider a semi-Markov version of the machine replacement problem of Ex-
ample 2.1 in Sect ion 1.2. Here, the transition times are random, independent.
and have given distributions. Also <;(/) is the cost per unit time of operat-
ing the machine at state i. Assume that /Ър-ю > 0 lor all z < n. Derive;
Bellman’s equation and analyze the problem.
References
[ABF93] Arapost'ithis. A., Borkar, V., Feruandez-Gaucheraud, E., Ghosh,
M., and Marcus, S., 1993. ‘‘Discrete-Time Controlled Markov Processes
with Average Cost. Criterion: A Survey." SIAM .). on Control and Opti-
mization, Vol. 31, pp. 282-341.
[AMT93] Archibald, T. W., McKinnon, К. I. M.. and Thomas. L. C., 1993.
"Serial and Parallel Value Iteration Algorithms for Discounted Markov De-
cision Processes," Eur. .1. Operations Research, Vol. 67, pp. 188-203.
[Ash70] Ash, R. 13., 1970. Basic Probability Theory, Wiley, N. Y.
[AyR91] Ayonn, S.. and Rosberg. Z., 1991. "Optimal Routing to Two
Parallel Heterogeneous Servers with Rcsequencing,” IEEE Trans, on Au-
tomatic Control, Vol. 36. pp. 1436-1 149.
[BBS93] Barto, A. Ck, Bradtke, S. 3., and Singh. S. P., 1993. "Real-
Time Learning and Control Using Asynchronous Dynamic Programming."
Comp. Science Dept.. Tech. Report, 91-57, Univ, of Massachusetts, Artificial
Intelligence, Vol. 72, 1995, pp. 81-138.
[BDM83] Baras, .1. S., Dorsey, A. .1., and Makowski. A. M., 1983. "Two
Competing Queues wit h Linear Cost.s: The /гс-R.ule is Often Optimal," Re-
port SRR 83-1, Department of Electrical Engineering, University of Mary-
land.
[BMP90J Be.nvemst.e, A., Metivier, M., and 1’rourier, P., 1999. Adaptive
Algorithms and Stochastic Approximations, Springer-Verlag, N. Y.
[BI’T'Jla] Bertsinias, D., Paschalidis, I. C., and Tsitsiklis, .1. N.. 1991.
“Optimization of Multiclass Queueing Networks: Polyhedral and Nonlinear
Characterizations of Achievable Performance," Annals of Applied Proba-
bility. Vol. 4. pp. 43-75.
[BPT9 lb] Bertsinias, D., Paschalidis, I. C., and Tsitsiklis, .1. N., 1991.
“Branching Bandits and Klimov's Problem: Achievable Region and Side
Constraints,” Proc, of the 1994 IEEE Conference on Decision and Control,
pp. 174 180, IEEE Trans, on Automatic Control, to appear.
_ -1
278	licfeiwices
[BWN92] Blanc, .1. 1’. C., de Waal, 1’. R., Nain, P., and Towsley, D., 1992.
“Optimal Control of Admission to a Multiserver Queue with Two Arrival
Streams,” IEEE Trans, on Automatic Control, Vol. .37, pp. 785-797.
[BaD81] Baras, J. S., and Dorsey, A. J., 1981. “Stochastic Control of Two
Partially Observed Competing Queues,” IEEE Trans. Automatic Control,
Vol. AC-26, pp. 1106-1117.
[Bai93] Baird, L. C., 1993. “Advantage Updating.” Report WL-TR-93-
1146, Wright Patterson ЛЕВ, OH.
[Bai94] Baird, L. C., 1994. “Reinforcement Learning in Continuous Time:
Advantage Updating,” International Conf, on Neural Networks, Orlando,
Fla.
[Bai95] Baird, L. C., 1995. “Residual Algorithms: Reinforcement Learning
with Function Approximation,’’ Dept, of Computer Science Report, U.S.
Air Force Academy, CO.
[Bat.73] Bather, J., 1973. “Optimal Decision Procedures for Finite Markov
Chains,” Advances in Appl. Probability, Vol. 5, pp. 328-339, pp. 521-540,
541-553.
[BeC89] Bertsekas, D. P., and Castanon, D. A., 1989. “Adaptive Aggrega-
tion Methods for Inlinite. Horizon Dynamic. Programming," IEEE Trans,
on Automatic. Control, Vol. AC-34, pp. 589-598.
[BeN93] Bcrtshnas, D., and Nino-Mora, .J., 1993. “Conservation Laws, Ex-
tended Polymatroids, and the Multiariued Bandit Problem: A Unified
Polyhedral Approach,” Mathematics of Operations Research, to appear.
[BeR87] Bcutlcr, F. .1., and Ross, K. W., 1987. “Uniformization for Semi-
Markov Decision Processes Under Stat ionary Policies,” .1. Appl. Prob., Vol.
24, pp. 399-420.
|BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Con-
trol: The Discrete Time Case, Academic Press, N. 57
[BeS79] Bertsekas, D. P., and Shreve, S. E., 1979. “Existence of Optimal
Stationary Policies in Deterministic Optimal Control,” .1. Math. Anal, and
Appl., Vol. 69, pp. 607-620.
[BeT89a] Bertsekas. D. P., and Tsitsiklis, .1. N., 1989. Parallel and Dis-
tributed Computation: Numerical Methods, Prentice-Hall, Englewood Cliff
N. .1.
[Be4’89l>] Rentier. F. .1., and Tencketzis, 1).. 198!). "lioul.ing in Queue-
ing Networks Under Imperfect. Information: Stochastic Dominance and
Thresholds,” Stochastics and Stochastics Reports, Vol. 26, pp. 81-100.
[BcT91a] Bertsekas, D. P., and Tsitsiklis, .1. N., 1991. “A Survey ol Some
Aspects of Parallel and Distributed Iterative Algorithms,” Automation,
References
279
Vol. 27, pp. 3-21.
[BeT9lb] Bertsekas. D. I’., and Tsitsiklis, .1. N., 1991. "An Analysis of
Stochastic Shortest Path Problems,” Math. Operations Research. Vol. Hi.
pp. 580-595.
[13ei57] Bellman, R.., 1957. Applied Dynamic Programming, Princeton Uni-
versity Press, Princeton, N. J.
|Ber71] Bertsekas, D. P., 1971. “Control of Uncertain Systems With a
Set-Membership Description of the Uncertainty," Ph.D. Dissertation. Mas-
sachusetts Institute of Technology.
[Ber72] Bertsekas, D. P., 1972. ‘‘Infinite 'Time Reachability of State Space
Regions by Using Feedback Control,” 1 PED Trans. Automatic Control. Vol.
AC-17, pp. 601-613.
[I3er73a] Bertsekas, D. P., 1973. “Stochastic Optimization Problems with
Nondifferentiable Cost Functionals," .1. Optimization Theory Appl., Vol.
12. pp. 218-231.
[Ber73b] Bertsekas. D. P., 1973. “Linear Convex Stochastic Control Prob-
lems Over an Inlinite Time Horizon." IEEE 'Trans. Automatic Control, Vol.
AC-18, pp. 311-315.
[Ber75] Bertsekas, D. P.. 1975. "Convergence of Discretization Procedures
in Dynamic. Programming.” IEEE Trans. Automatic Control, Vol. AC-20,
pp. 115-119.
[Bcr76] Bertsekas, D. P.. 1976. “On Error Bounds for Successive Approx-
imation Methods," IEEE Trans. Automatic Control. Vol. AC-21, pp. 394-
396.
[Ber77] Bertsekas, D. P.. 1977. “Monotone; Mappings with Application in
Dynamic Programming," SIAM J. on Control and Optimization. Vol. 15,
pp. 438-46-1.
[Ber82n] Bertsekas. D. P., 1982. "Distributed Dynamic Programming.”
IEEE Trans. Automatic Control, Vol. AC-27, pp. 610-616.
[Ber82bj Bertsekas, D. P., 1982. Constrained Optimization and Lagrange
Multiplier Methods. Academic Press. N. Y.
[Bei'93] Bertsekas, D. P., 1993. “A Generic Rank One Correction Algorithm
for Markovian Decision Problems,” Lab. for Info, and Decision Systems
Report L1DS-P-2212, Massachusetts Institute of Technology, Operations
Research Leiters, Io appear.
[Ber95a] Bertsekas, D. P., 1995. Nonlinear Programming, Athena Scientific,
Belmont, MA. to appear.
[I3er95b] Bertsekas, D. P., 1995. “A Counterexample to Temporal Differ-
ences Learning," Neural Computation, Vol. 7, pp. 270-279.
280
Uelervlltes
- -J
[I5erf)5c] Bertsekas, I). P., 1995. “A New Value Iteration Method for the
Average Cost Dynamic Programming Problem," Lab. lor Info, and Decision
Systems Report, Massachusetts Institute of Technology.
[BhE91] Bhattacharya. P. P., and Ephremidcs, A., 1991. “Optimal Alloca-
tions of a. Server Between Two Queues with Due Tinies,” IEEE Trans, on
Automatic. Control, Vol. 30, pp. 1417-11'23.
[В1183] Billingsley, P., 1983. “The Singular Enaction of Bold Play," Ameri-
can Scientist., Vol. 71, pp. 39'2-397.
|BlaO2| Blackwell, D., 190'2. "Discrete Dynamic Programming." Ann. Mat h.
Statist., Vol. 33, pp. 719-720.
[Bla65] Blackwell, D., 1905. “Discounted Dynamic Programming," Ann.
Math. Statist., Vol. 30, pp. 226-235.
[Bla70] Blackwell, D., 1970. “On Stationary Policies," .1. Hoy. Statist. Socr
Sei\ A, Vol. 133, pp. 33-38.
[Bor89] Borkar, V. S., 1989. “Control of Markov Chains with Long-Run
Average* Cost Criterion: The Dynamic Programming Equations," SIAM ,1.
on Control and Optimization, Vol. 27, pp. 642-657.
[Bro65] Brown, B. W., 1965. “On the Iterative Method of Dynamic Pro-
gram ining on a Pinite Space Discrete Markov Process.” Ann. Math. Stat ist.,
Vol. 36, pp. 1'279-1286.
[CaS92] Cavazos-Cadena, R, and Sennott, L. 1., 1992.“Comparing Recent
Assumpt ions for t he Existence of Optimal Stationary Policies,” Operations
Research Letters, Vol. 11, pp. 33-37.
[Cav89a] Cavazos-Cadena, R., 1989. “Necessary Conditions for the Op-
timality Equations in Average-Reward Markov Decision Processes,” Sys.
Control Letters, Vol. 11, pp. 65-71.
[Cav89b] Cavazos-Cadena, IL, 1989. “Weak Conditions for the Existence
of Optimal Stationary Policies in Average Markov Decisions Chains with
Unbounded Costs,” Kybcrnctika, Vol. 25, pp. 145-156.
[Cav91] Cavazos-Cadena, R., 1991. “Recent Results on Conditions lor the
Existence of Average Optimal Stationary Policies,” Annals of Operations
Research, Vol. 28. pp. 3-28.
[C11T89] Chow, С.-S., and Tsitsiklis, .1. N., 1989. “The Complexity of
Dynamic Programming,” Journal of Complexity, Vol. 5, pp. 466-488.
[C11T9J] Chow, С.-S., and Tsitsiklis, J. N., 1991. “An Optimal One Way
Multigrid Algorithm for Discrete Time Stochastic Control,” IEEE Trans,
on Automatic Control, Vol. AC-36, pp, 898-911.
[G'oR.87] Courcoubetis, C. A., and Reiman, M. L, 1987. “Optimal Control
of a Queueing System with Simultaneous Service Requirements,” IEEE
1 ifTt'ri'i ич'И
281
Trans, on Automatic Control. Vol. AC-32, pp. 717-727.
[Co\'8 1] Courcoubetis. C.. ami Varaiya. I’. I’.. 1981. "The Service Process
with Least Thinking Time Maximizes Resource Utilization,'' IEEE Trans.
Automatic. Control. Vol. AC-29, pp. 1005-1008.
[CTC91] Cruz. IL I,., and Chiiah. M. C.. 1991. "A Minimax Approach to
a Simple Routing Problem," IEEE Trans, on Automatic Control. Vol. 3(>.
pp. I 124-1-135.
[D'EpGO] D’Epenuiix. F., 1960. "Sur un Probit me do Production et de
Stockage Bans 1'Aleatihre." Bev. Francaise Ant,. Infor. Recherche Opera-
tiomielle, Vol. 11, (English Transl.: Management Sei., Vol. 10. 1963. pp.
98-108).
[Dan63j Dantzig. G. B., 1963. Linear Programming and Extensions. Prince-
ton Univ. Press, Princeton. N. J.
[Den67] Denardo. E. V., 1967. “Contraction Mappings in the Theory Un-
derlying Dynamic Programming.” SIAM Review, Vol. 9, pp. 165-177.
[Der70] Derman, C., 1970. Finite State Markovian Decision Processes. Aca-
demic Press. N. Y.
[DuS65] Dubins. L., and Savage. L. M., 1965. How to Gamble If A'on Must.,
McGraw-Hill, N. Y.
[DyY79] Dyukin. E- B., and Yuskevich, A. A., 1979. Controlled Markov
Processes, Springer-Verlag. N. Y.
[EVW80] Ephremidcs, A., Varaiya, P. P., and Walrand, .1. (1., 1980. “A
Simple; Dynamic Routing Problem,” IEEE Trans. Automatic Control, Vol.
AC-25, pp. 69(1-693.
[EaZ62] Eaton, .1. IL, and Zadeh, L. A., 1962. ‘•Optimal Pursuit Strategies
in Discrete State Probabilistic Systems.” Trans. ASME Ser. D. J. Basic
Eng.. Vol. 84, pp.23-29.
[EpV89] Ephremidcs, A., and Verdu, S.. 1989. "Control and Optimization
Methods in Commitment ion Network Problems," IEEE Trans. Automatic
Control, Vol. AC-34, pp. 936-942.
[FAM90] Fernandez-Gaucherand. E., Arapostathis, A., and Marcus, S. L,
1990. “Remarks on the Existence of Solutions to I he Average Cost. Op-
timality Equation in Markov Decision Processes.” Systems and Control
Letters, Vol. 15, pp. 1'25-43'2.
[FAM91] Fernandez-Gaucherand, E., Arapostathis, A., and Marcus, S. I.,
1991. “On the Average Cost Optimality Equation and the Structure of Op-
timal Policies for Partially Observable Markov Decision Processes." Annals
of Operations Research, Vol. 29, pp. 439-470.
[FeS94] Feinberg. E. A., and Shwartz, A., 199-1. "Markov Decision Models
. л
28'2	References
with Weight eel Discounted Criteria,” Mathematics of Operations Research,
Vol. 1!). pp. 1-17.
[Fei78] Feinberg, E. A., 1978. “The Existence of a Stationary c-Optiinal
Policy lor a Finite-State Markov Chain,” Theor. Prob. Appl., Vol. 23, pp.
297-313.
[Fei92a] Feinberg, E. A., 1992. “Stationary Strategies in Borel Dynamic
Programming,” Mathematics of Operations Research, Vol. 125, pp. 87-96.
[Fci92l>] Feinberg, E. A., 1992. “A Markov Decision Model of a Search
Process,” Comtemporary Mathematics, Vol. 125, pp. 87-96.
[Fox71] Fox, B. L., 1971. “Finite State Approximations to Denumerable
State Dynamic Progams,” .1. Math. Anal. Appl., Vol. 34, pp. 665-670.
[Gal95] Gallager, R. G., 1995. Discrete Stochastic Processes, Kluwer, N. Y.
[Gho90] Ghosh, M. K., 1990. “Markov Decision Processes with Multiple
Costs,” Operations Research Letters, Vol. 9, pp. 257-260.
[Gi.J74] Gittins, .1. C., and .Jones, D. M., 1974. “A Dynamic Allocation
Index for the Sequential Design of Experiments,” Progress in Statistics (J.
Gani, cd.), North-Holland, Amsterdam, pp. 241-266.
[Git79] Gittins, .1. C., 1979. “Bandit, Processes and Dynamic Allocation
Indices," .1. Roy. Statist,. Soc., Vol. B, No. 41, pp. 148-164.
[I1BK91] Harmon, M. E., Baird, L. C., and Klopf, Л. II., 1994. “Advan-
tage Updating Applied to a Differential Game,” Presented at. NIPS Conf.,
Denver, Colo.
[HI1L9I] Hernandez-Lerma, ()., Hennet, .1. C., and Lasserre, .1. 13., 1991.
“Average Cost Markov Decision Processes: Optimality Conditions,” J.
Math. Anal. Appl., Vol. 158, pp. 396-106.
[1 laL8(>] Haiiric, A., and L’Ec.uyer, I’., 1986. “Approximation and Bounds in
Discrete Event, Dynamic Programming.'' IEEE Trans. Automatic Control,
Vol. AC-31, pp. 227-235.
[Haj84] Hajek. B., 1981. “Optimal Control of Two Interacting Service Sta-
tions,” IEEE Trans. Automatic Control, Vol. AC-29, pp. 491-499.
[IIar75a] Harrison, J. M., 1975. “A Priority Queue with Discounted Linear
Costs,” Operations Research, Vol. 23, pp. 260-269.
[IIar75b] Harrison, .1. M., 1975. “Dynamic Scheduling of a Multiclass Queue:
Discount, Optimality,” Operations Research, Vol. 23, pp. 270-282.
[Has68] Hastings, N. A. J., 1968. “Some Noles on Dynamic. Programming
and Replacement.,” Operational Research Quart., Vol. 19, pp. 453-464.
[HeS84] Heyman, D. P., and Sobel, M. .1., 1984. Stochastic Models in Op-
erations Research, Vol. 11, McGraw-Hill, N. Y.
References
283
[Her89] Hernandez-Lerma, О., 1989. Adaptive Maikov Control Processes,
Springer-Verlag, N. Y.
[HoT74] Hordijk, A., and Tijins, II., 1974. “Convergence Results and Ap-
proximations for Optimal (.s,S) Policies," Management Sei., Vol. 20, pp.
1431-1438.
[How60] Howard, R., 1960. Dynamic Programming and Markov Processes.
MIT Press, Cambridge, MA.
[Igl63a] Iglehart, D. L., 1963. “Optimality of (5, .s) Policies in the Infinite
Horizon Dynamic Inventory Problem,” Management Sci., Vol. 9, pp. 259-
267.
[Igl63b] Iglehart., D. L., 1963. “Dynamic Programming and Stationary
Analysis of Inventory Problems,” in Scarf, 11., Gilford, D., and Shelly, M.,
(eds.), Multistage Inventory Models and Techniques, Stanford University
Press, Stanford. CA, 1963.
[JJS94] Jaakkola, T., .Jordan, M. 1., and Singh, S. P., 1994. “On the
Convergence of Stochastic Iterative Dynamic Programming Algorithms,”
Neural Computation, Vol. 6, pp. 1185-1201.
[Jcw63] Jewell, W., 1963. “Markov Renewal Programming I and II,” Op-
erations Research, Vol. 2, pp. 938-971.
[KaV87] Katehakis. M., and Veinott, A. F., 1987. “The Multi-Armed Ban-
dit Problem: Decomposition and Computation,” Math, of Operations Re-
search, Vol. 12, pp. 262-268.
[Kal83] Kallenberg, L. С. M., 1983. Linear Programming and Finite Markov
Control Problems, Mathematical Centre Report., Amsterdam.
[Kel81] Kelly, F. P., "Mult i-Armed Bandits with Discount Factor Near One:
The Bernoulli Case,” The Annals of Statistics, Vol. 9, pp. 987-1001.
[Kle68] Kleinman, D. L., 1968. "On an Iterative Technique for Riccati
Equation Computations,” IEEE Trans. Automatic Control, Vol. AC-13,
pp. 114-115.
[KuV86] Kumar, P. R.., and Varaiya, P. P., 1986. Stochastic Systems: Es-
timation, Identification, and Adaptive Control, Prentice-Ilall, Englewood
Cliffs, N.J.
[Kum85] Kumar, P. R., 1985. “A Survey of Some Results in Stochastic
Adaptive Control," SIAM .1. on Control and Optimization, Vol. 23. pp.
329-380.
[Kus71] Kushner. II. J., 1971. Introduction t.o Stochastic Control, Holt,
Rinehart, and Winston, N. Y.
[Kus78] Kushner, H. J., 1978. "Optimality Conditions for the Average Cost
per Unit Time Problem with a Diffusion Model.” SIAM J. Control Opti-
- .1
284
Kclciences
inization, Vol. 16, pp. 330-316.
[Las88] Lasserre, .1. B., 1988. Conditions lor Existence of Average and
Blackwell Optimal Stationary Policies in Denumerable Markov Decision
Processes,” ,1. Math. Anal. Appl., Vol. 136, pp. 179-190.
[LiK84] Lin, W., and Kumar, P. R.., 1981. “Optimal Control of a Queueing
System with Two Heterogeneous Servers,” IEEE Ti ans. Automatic Control,
Vol. AC-29. pp. 696-703.
[LiH.71] Lippman, S. A., and Boss, S. M.. 1971. "The Streetwalker's Dilemma:
A Job-Shop Model,” SIAM J. of Appl. Math., Vol. 20, pp. 336-342.
[LiStil] Liust ernik, L., and Sobolev, V., 1961. Elements of Functional Anal-
ysis, Ungar, N. Y.
[Lip75] Lippman, S. A., 1975. “Applying a New Device in the Optimization
of Exponential Queuing Systems,” Operations Research, Vol. 23, pp. 687-
710.
[LjS83] Ljung, L., and Soderstrom, T., 1983. Theory and Practice of Re-
cursive Identification, MIT Press, Cambridge, MA.
[Lue69] Luenbcrger, D. G., 1969. Optimization by Vector Space Methods,
Wiley. N. Y.
|McQ66] MacQueen, J., 1966. "A Modified Dynamic Programming Method
for Markovian Decision Problems,” J. Math. Anal. Appl., Vol. 14, pp. 38-
43.
[MoW77] Morton, T. E., and Weeker, W., 1977. “Discounting. Ergodicity
and Convergence for Markov Decision Processes,” Management, Sci.. Vol.
23, pp. 890-900.
[Mor71] Morton, T. E., 1971. “Ou the Asymptotic Convergence Rate of
Cost Differences for Markovian Decision Processes,” Operations Research,
Vol. 19, pp. 244-248.
[NTW89] Nain, P., Tsoueas, P., and Walrand, J., 1989. “Interchange Ar-
gument,s in Stochastic Scheduling,” .1. of Appl. Prob., Vol. 27, pp. 815-826.
[NgP86] Nguyen, S., and Pallol.tino, S., 1986. “1 Ivperpal.hs and Shortest
Ilyperpaths,’’ in Combinatorial Optimization by B. Simeone' (cd.), Springer-
Verlag. N. Y. pp. 258-271.
[Odo69] Odoni, A. IL, 1969. "On Finding the Maximal Gain for Markov
Decision Processes,” Operations Research, Vol. 17, pp. 857-860.
[OrR70j Ortega, J. M., and Rheinboldt, W. G'., 1970. Iterative Solution of
Nonlinear Equations in Several Variables, Academic Press, N. Y.
[OriiGO] Ornstehi, D., 19G9. “On the Existence of Stationary Optimal Strate-
gics,” Proc. Amer. Malli. Soc., Vol. 20, pp. 563-5G9.
/57(‘Г<Ч1( ('.S
285
[РВТ95] Polymenakos, L. C., Bertsekas, D. P., and Tsitsiklis, J. N.. 1995.
"Ellieicnt Algorithms for Continuous-Space Shortest Pat h Problems.” Lab.
for info, and Decision Systems Report L1DS-P-2292, Massachuset ts Insti-
tute of Technology.
[PBW79J Popyack, J. L._ Brown, R. L.. and White. С. C.. Ill, 1969. "Dis-
crete Versions of an Algorithm due to Varaiya,” IEEE Trans. Ant. Control,
Vol. 2-1, pp. 503-501.
[PaKSlJ Pattipati, K. R., and Kleinman, D. L., 1981. "Priority Assignment
Using Dynamic Programming for a Class of Queueing Systems,” IEEE
Trans, on Automatic Control, Vol. AC-26, pp. 1095-1106.
[PaT87] Papadimitriou. C. 11., and Tsitsiklis, .1. N.. 1987. "The Complexity
of Maikov Decision Processes,” Math. Operations Research, Vol. 12, pp.
441-450.
[Pla77] Platzman, L., 1977. “Improved Conditions for Convergence in Undis-
counted Markov Renewal Programming,” Operations Research, Vol. 25. pp.
529-53.3.
[PoA69] Pollatschck, M., and Avi-ltzhak, B., 1969. “Algorithms for Stochas-
tic Gaines with Geometrical Interpretation,” Man. Sci., Vol. 1.5, pp. 399-
413.
[PoT78] Pollens, E., and Totten, J., 1978. "Accelerated Computation of
the Expected Discounted Return in a Markov Chain,” Operations Research,
Vol. 26. pp. 350-358.
[PoT92] Polychronopoulos. G. 11., and Tsitsiklis. .1. N., 1992. "Stochas-
tic Shortest Path Problems with Recourse,” Lab. for Info, and Decision
Systems Report L1DS-P-P-2183, Massachusetts Institute of Technology.
[Por71] Porteus. E., 1971. "Some Bounds for Discounted Sequential Deci-
sion Processes,” Mau. Sci., Vol. 18, pp. 7-11.
[Por75j Porteus. E., 1975. "Bounds and Transformations for Finite Markov
Decision Chains," Operations Research, Vol. 23, pp. 761-784.
[Por81] Porteus, E., 1981. “Improved Conditions for Convergence in Undis-
comited Markov Renewal Programming,” Operal ions Research. Vol. 25. pp.
529-533.
[Psl'93] Psaraflis. II. N.. and Tsitsiklis, .1. N., 1993. “Dynamic Shortest
Paths in Acyclic Networks with Markovian Arc Costs,” Operations Re-
search. Vol. 41, pp. 91-1 ()1.
[PuB78] Puterman, M. L., and Brumelie, S. L., 1978. “The Analytic Theory
of Policy Iteration,” in Dynamic Programming and its Applications, M. L.
Puterman (ed.), Academic Press, N. Y.
[PuS78] Puterman, M. L.. and Shin, M. C.. 1978. “Modilicd Policy Iteration
286
Ruforcuccs
Algorithms for Discounted Markov Decision Problems,” Management Sei.,
Vol. 24, pp. 1127-1137.
[PuS82] Putennan, M. L., and Shin, M. C., 1982. "Action Elimination Pro-
cedures for Modified Policy Iteration Algorithms,” Operations Research,
Vol. 30, pp. 301-318.
[Put.78] Putennan, M. L. (ed.), 1978. Dynamic Programming and its Ap-
plications, Academic. Press, N. Y.
[Put.94] Putennan, M. L., 1994. “Markovian Decision Problems,” J. Wiley,
N. Y.
[RVW82] Rosberg, Z., Varaiya, P. P., and Walrand, .1. C., 1982. "Optimal
Control of Service in Tandem Queues,” IEEE Trans. Automatic Control,
Vol. AC-27, pp. 600-609.
[Raid) 1] R.aghavan, T. E. S., and Filar. .1. A., 1991. “Algorithms for Stochas-
tic Gaines A Survey,” ZOR Methods and Models of Operations Re-
search, Vol. 35, pp. 437-472.
|HiS92] Hitt, R. 1\., and Seiinot, L. I.. 1992. "Optimal Stationary Policies
in General State Markov Decision Chains with Finite Action Set,” Math.
Operations Research, Vol. 17, pp. 901-909.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton University
Press, Princeton, N. .J.
[Ros70] Ross, S. M., 1970. Applied Probability Models with Optimization
Applications, Holden-Day, San Francisco, СЛ.
[Ros83a] Ross, S. M., 1983. Introduction to Stochastic Dynamic Program-
ming, Academic Press, N. Y.
[R.os83b] Ross, S. M., 1983. Stochastic Processes, Wiley. N. Y.
[Ros89] Ross, K. W.. 1989. "Randomized and Past-Dependent Policies lor
Markov Decision Processes with Multiple Constraints,” Operations Re-
search, Vol. 37, pp. 474-477.
[Rus94] Rust, .1., 1994. "Using Randomization to Break the Curse, of Di-
mensionality," Unpublished Report, Dept , of Economics, University of Wis-
consin.
[Rns95] Rust., .]., 1995. “Numerical Dynamic Programming in Economics,”
in Handbook of Computational Economics, 11. Amman, D. Kendrick, and
.1. Rust (eds.).
[SPK<85] Schweitzer. P. .1., Putennan, M. L., and Kindle, K. W., 1985.
“Iterative Aggregation-Disaggregation Procedures for Solving Discounted
Scnii-Markovian Reward Processes," Operations Research, Vol. 33, pp. 589-
605.
Kvfvreneex
287
[ScF77] Schweitzer, P. .)., and Federgruen, A,, 1977. “The Asymptotic Be-
havior of Value Iteration in Markov Decision Problems,” Math. Operations
Research, Vol. 2, pp. 360-381.
[ScF78] Schweitzer, P. J., and Federgruen, A., 1978. “The Functional Equa-
tions of Undiscounted Markov Renewal Programming." Math. Operations
Research, Vol. 3, pp. 308-321.
[SeS85] Schweitzer, P. J., and Seidman, A.. 1985. “Generalized Polyno-
mial Approximations in Markovian Decision Problems,” .1. Math. Anal,
and Appl., Vol. 110, pp. 568-582.
[Sch68] Schweitzer. P. J., 1968. “Perturbation Theory and Finite Markov
Chains,” .1. Appl. Prob., Vol. 5, pp. 401-113.
[Sch71] Schweitzer, P. J., 1971. “Iterative Solution of the Functional Equa-
tions of Undiscounted Markov Renewal Progranuuiug,” ,1, Math. Anal.
Appl., Vol. 34, pp. 495-501.
[Sch72] Schweitzer, P. J., 1972. “Data Transformations for Markov Renewal
Prograiinning,'' talk al National ORSA Meeting, Atlantic ('it.v, N. .1.
[Sei 175] Schal, M., 1975. "Conditions for Optimality in Dynamic Program-
ming and for the Limit of n-Stage Optimal Policies to be Optimal,” Z.
Wahrscheinlichkeitstheorie mid Verw. Gcbiete, Vol. 32, pp. 179-196.
[Seh81] Schweitzer, P. J., 1981. “Bottleneck Determination in a Network
of Queues,” Graduate School of Management Working Paper No. 8107,
University of Rochester, Rochester, N. Y.
[Sch93] Schwartz. A., 1993. “A Reinforcement. Learning Method for Maxi-
mizing Undiscounted Rewards,” Proc, of the Tenth Machine Learning Con-
ference.
[Sen86j Sennott, L. 1.. 1986. “A New Condition for the Existence of Op-
timum Stationary Policies in Average Cost Markov Decision Processes,”
Operations Research Let t., Vol. 5, pp. 17-23.
[Scn89a] Sennott, L. 1., 1989. “Average Cost Optimal Stationary Policies
in Infinite State Markov Decision Processes with Unbounded Costs,” Op-
erations Research. Vol. 37, pp. 626-633.
[Sen89b] Sennott, L. I., 1989. “Average Cost Semi-Markov Decision Pro-
cesses' and the Control of Queueing Systems,” Prob. Eng. Info. Sci.. Vol. 3.
pp. 247-272.
[Sen91] Sennott. L. 1., 1991. “Value Iteration in Count,able State Average
Cost Markov Decision Processes with Unbounded Cost,” Annals of Oper-
ations Research, Vol. 28, pp. 261-272.
[Sen93a] Sennott, L. 1., 1993. “The Average Cost Optimality Equation and
Critical Number Policies.” Prob. Eng. Info. Sci., Vol. 7. pp. 47-67.
- .1
288
Reference»
[Sen93b] Sennott, L. 1., 1993. “Constrained Average Cost Maikov Decision
Chains,'’ Prob. Eng. Info. Sci., Vol. 7, pp. 69-83.
[Scr79] Scrfozo, 1!.. 1979. "An Equivalence Between Discrete and Contin-
uous Time Maikov Decision Processes/’ Operations Research, Vol. 27. pp.
616-620.
[Sha53] Shapley, L. S., 1953. “Stochastic Games,” Proc. Nat. Acad. Sci.
U.S.A., Vol. 39.
[Sin94] Singh, S. P., 199*1. “Reinforcement Learning Algorithms for Averagc-
Payolf Markovian Decision Processes." Proc, of 12th National Conference
on Art iliciaI Intelligence, pp. 202-207.
[Sob82] Sobel, M. .1., 1982. The Optimality of Full-Service Policies.’' Op-
erations Research, Vol. 30, pp. 636-649.
[StP7 Ij Stidham, S., and Prabhti, N. U., 197*1. "Optimal Control of Queue-
ing Systems,’’ in Mathematical Mel hods in Queueing Theory (Lecture
Notes in Economics and Math. Syst., Vol. 98), A. B. Clarke (Ed.), Springcr-
Verlag, N. Y., jjp. 263-294.
[Sti85] St idham, S. S., 1985. “Optimal Control of Admission to a Queueing
System," IEEE Trans. Automatic Control, Vol. AC-30, pp. 705-713.
[Str66] Strauch, R., 1966. “Negative Dynamic Programming,’' Aim. Math.
Statist., Vol. 37, pp. 871-890.
[SttC91] Suk, .L-В., ami Cassandras, С. G., 1991. “Optimal Scheduling
of Two Competing Queues with Blocking,” IEEE Trans, on Automatic
Control, Vol. 36, pp. 1086-1091.
[Sut.88] Sutton, R. S., 1988. “Learning to Predict by the Methods of Tem-
poral DiHereiices,” Machine Learning, Vol. 3, pp. 9-44.
[TSC92] Towsley, D., Sparaggis, P. D., and Cassandras, C. G., 1992. “Opti-
mal Routing ami Buller Allocation for a Class of Pinite Capacity Queueing
Systems,” IEEE Trans, on Automatic Control, Vol. 37, pp. 1446-1451.
[Tes92] Tesauro, G., 1992. “Practical Issues in Temporal Difference Learn-
ing," Machine Learning, Vol. 8, pp. 257-277.
[TsV94] Tsitsiklis, J. N., ami Van Roy, B., 1994. “Feature-Based Methods
for Large-Scale Dynamic Programming," Lab. for Info, and Decision Sys-
tems Report L1DS-P-2277, Massachuset ts Inst itute of Technology, Machine
Learning, to appear.
[Tsebtlj Tseng, P., 1991). "Solving Л-l lorizoii, Stationary Maikov Decision
Problems in Time Proportional to log(//).” Operations Research Letters,
Vol. 9. pp. 287-297.
[Tsi8(i] Tsitsiklis, .1. N., 1986. “A Lemma. on the Multiarmed Bandit Prob-
lem," IEEE Trans. Automatic Control, Vol. AC-31, pp. 576-577.
Helvri'iices
289
[4’si89] Tsitsiklis, J. N.. J 989. "A Com parison of Jacobi and Gauss-Seidel
Parallel Iterations." Applied Math. Lett.. Vol. 2. pp. 167-170.
[TsiD.'Ja] T'sitsikJis, .1. N., 1993. "Elliciciit Algorithms for Globally Optimal
Trajectories.” Lab. for Info, and Decision Systems Report L1DS-P-22IO,
Massachusetts Institute of Technology. IEEE dhans, on Automatic Control,
to appear.
[Tsi93b] Tsitsiklis, J. N.. 1993. "A Short Proof of the Gittins Index The-
orem.” Lab. for Info, and Decision Systems Report, L1DS-P-2171, Mas-
sachusetts Inst it.ut e of Technology; also Annals of Applied Probability, Vol.
4. 1994. pp. 191-199.
[Tsi9 I] Tsitsiklis. .1. N.. 199-1. "Asynchronous Stochastic Approximation
and Q-Learmng." Machine Learning. Vol. 16, pp. 185-202.
[Tso91] Tsonkas. P., 1991. “The Region of Achievable Performance in a
Model of Kliniov.” Research Report, l.B.M.
[VWB85] Varaiya, P. P., Walrand. .1. C., and Buyukkoc. C.. 1985. "Ex-
tensions of the Multia.rnied Bandit Problem: The Discounted Case," IEEE
Trans. Automatic Control, Vol. AC-30, pp. 12(1-439.
[Var78] Varaiya, P. P., 1978. "Optimal and Suboptimal Stationary Controls
of Markov Chains,” IEEE Trans. Automatic Control. Vol. AC-23, pp. 388-
391.
[VeP84] Verd'u, S., and Poor, 11. V., 1984. “Backward, Forward, and Back-
ward-Forward Dynamic Programming Models under Coinniutativit.y Con-
ditions," Proc. 1984 IEEE Decision and Control Conference. Las Vegas.
NE, pp. 1081-1086.
[VeP87] Verd’u. S.. and Poor. 11. V., 1987. “Abstract, Dynamic Program-
ming Models under Commutativity Conditions,” SIAM .1. on Control and
Optimization. Vol. 25, pp. 990-1006.
[Vei66] Veinott, A. F.. Jr., 1966. “On Finding Optimal Policies in Discrete
Dynamic Programming with no Discounting.” Ann. Math. Statist., Vol. 37.
pp. 1284-1294.
[Vci69] Veinott, A. F.. Jr., 1969. “Discrete Dynamic Programming with
Sensitive Discount Optimality Criteria,” Ann. Math. Statist.. Vol. -10. pp.
1635-1660.
[ViE8b] Viniotis, 1.. and Ephreiuides. A., 1988. "Extension of the Op-
timality of the Threshold Polic y in Heterogeneous Miilti-wi wr Q>i<-u< ing
Systems, 11st.E '1 lans. on Automatic Control, Vol. 33, pp. 101-109.
[WatS’J] Watkins. C. ,1. С. H.. "Learning from Delayed Rewards." Ph.D.
Thesis. Cambridge Univ.. England.
-1
290	I ? Нек ’iiccs
|Wel>92] Weber, R., 1991. “On the Git tins Index for Multianned Bandits,’’
preprint; Annals of Applied Probability, Vol. 3, 1993.
[W11K80] White, (.'. (.'., and Kim, K., 1980. "Solution Procedures lor Par-
tially Observed Markov Decision Processes,” .1. Large Scale Systems, Vol.
1, pp. 129-110.
[Whi03| White, I). .1.. 19G3. “Dynamic Programming. Markov Chains, and
the Method of Successive Approximat ions,” .1. Mat h. Anal, and Appl., Vol.
(>. pp. 373-370.
[Whi78] Whitt, W., 1978. “Approximations of Dynamic Programs I," Math.
Operations Research, Vol. 3, pp. 231-243.
[Whi79] Whitt, W., 1979. “Approximations of Dynamic Programs 11,”
Math. Operations Research, Vol. 4, pp. 179-185.
[Whi80a] White. D. .1., 1980. “Finite State Approximations for Denumer-
able St,ate Inlinite Horizon Discounted Markov Decision Processes: The
Method of Successive Approximations,” in Recent Development s in Markov
Decision Processes, Hartley, R., Thomas, L. C., and White, D. .1. (cds.),
Academic Press, N. Y., pp. 57-72.
[WhiSOb] Whittle, 1’., 1980. “Multi-Armed Bandits and the Gittins Index,”
.1. Roy. Statist.. Soc. Ser. B, Vol. 42. pp. 143-149.
WlnW Whittle. Г.. 194. ••Arm-Aaitr.ni;g Bar.di>." Tik AtziaSy Prob-
ability. Vol. 9, pp. 284-292.
[Whi82] Whittle, P., 1982. Optimization Over Time, Wiley, N. Y., Vol. 1,
1982, Vol. 2, 1983.
INDEX
A
Admissible policy, 3
Advantage updating, 122, 132
Aggregation, 44, 104, 219
Approximation in policy space, 117
Asset, selling, 1.57, 275
Asynchronous algorithms, 30, 74,
120
Average cost problem, 184, 249, 26G
В
Basis functions, 51, 65, 103
Bellman's equation, 8. 11, 83, 108,
137, 186. 191. 196, 225. 247, 268
Blackwell optimal policy. 193. 233
Bold strategy. 162
c
Chess. 102. 117
Column reduct ion. 67
Contraction mappings. 52, 65, 86,
128
Consistent ly improving policies, 90.
122. 127
Controllability, 151, 228
Cost approximation, 51, 101. 225
D
Data transformations, 72. 263, 271
Differential cost, 186, 192
Dijkstra’s algorithm. 90, 122
Discounted cost, 9. 186, 243, 262
Discretization. 65
Distributed computation, 74, 120
Duality, 65, 222
E
«--optimal policy, 172
Error bounds, 19, 69, 209, 213, 234,
239
F
Feature-based aggregation, 104
Feature ext raction, 103
Feature vectors, 103
G
Gambling, 160, 173, 180
Gauss-Seidel method, 28. 88, 208
I
Improper policy, 80
Index function, 56
Index of a project.. 55
Index rille, 55, 65
Inveiilorv coni rol. I 53. I 79
IIIe< 1 ueiblc Mai kov < limn. 2 I 1
J
Jacobi method, 68
L
LLL strategy, 90
babel correcting method, 90
Linear programming, 49, 150, 221
Linear quadratic problems. 150. 176-
178. 228. 235
M
Measurability issues. 64, 172
Minimax problems. 72
Monotone convergence theorem, 136
Monte-Carlo simulation, 96. 112,
120, 131. 223
Multiarmed bandit problem. 54. 256
Multiple-rank corrections, 48. 61
N
Negative DP model, 134
Neuro-dynamic programming. 122
Newton's method, 71
.. .1
292
Index
Nonstationary problems, 167
О
Observability, 151, 228
One-stcp-lookahead rule, 157, 159,
160
Optimistic policy iteration, 1 16, 122
P
Parallel computation, 64, 71, 120
Periodic problems, 167, 171, 177,
17!)
Policy, 3
Policy evaluation, 36, 214
Policy existence, 160, 172, 182,226
Policy improvement, 36, 211
Policy iteration, 35, 71, 73, 91, 149,
186, 213, 223
Policy iteration, approximate, 41,
91,1 12, 115
Policy iteration, modified, 39, 91
Polynomial approximations, 102
Posit ive DP model, 134
Priority assignment , 254
Proper policy, 80
Q-factor, 99, 132
Q-leariiiiig, 16, 99, 122, 224, 230,
239
Quadratic cost, 150, 176-178,228,
235
Queueing cont rol, 250, 265
R.
Randomized policy. 222
Bank-one correction, 30. 68
Reachability, 181, 182
Reinforcement learning, 122
Relative cost., 186, 192
Replacement problems, 14, 200, 276
Riccat.i equation, 151, 228
Robbins-Monro method, 98
Routing, 257
S
SLR strategy, 90
Scheduling problems, 54
Semi-Markov problems, 261
Sequential hypothesis testing, 158
Sequential probability ratio, 158
Sequential space decomposition, 125
Shortest, path problem, 78, 90, 126
Simulation-based methods, 16, 78,
94, 222
Stochastic, approximation method,
98
Stationary policy, 3
Stationary policy, existence, 13, 83,
143, 160, 172, 182, 227
Stochastic shortest paths, 78, 485,
236-239
Stopping problems, 87, 155
Successive approximation, 19
T
Temporal differences. 16, 97, 115,
122, 223
Tetris. 105, 1 11
Threshold policies, 73
U
Unbounded costs per stage', 134
Uncontrollable state components,
105, 125
Undiscounted problems, 134, 249
Uniforinization, 212, 274
Unichain policy, 196
V
Value iteration. 19, 88, 144, 186,
202, 211. 221, 238
Value iteration, approximate, 33
Value iteration, relative, 201, 211,
229. 232
Value iteration, termination, 23, 89
W
Weighted sup norm, 86, 128