Convergence of Stochastic Processes - Pollard D.

Автор: Pollard D.
Теги: mathematical analysis probability theory functional analysis stochastic processes
ISBN: 0-387-90990-7
Год: 1984
Похожие
Foundations of Modern Probability
Markov Processes. Characterization and Convergence
Probability and Measure
Counterexamples in probability (TOC)
Текст
                    David Pollard
Convergence of
Stochastic Processes
With 36 Illustrations
i
Springer-Verlag
New York Berlin Heidelberg Tokyo

David Pollard
Department of Statistics
Yale University
New Haven, CT 06520
U.S.A.
AMS Subject Classifications: 60F99, 60G07, 60H99, 62M99
Library of Congress Cataloging in Publication Data
Pollard, David
Convergence of stochastic processes.
(Springer series in statistics)
Bibliography: p.
Includes index.
1. Stochastic processes. 2. Convergence. I. Title.
II. Series.
QA274.P64 1984 519.2 84-1401
© 1984 by Springer-Verlag New York Inc.
All rights reserved. No part of this book may be translated or reproduced in any
form without written permission from Springer-Verlag, 175 Fifth Avenue, New York,
New York 10010, U.S.A.
Typeset by Composition House Ltd., Salisbury, England.
Printed and bound by R. R. Donnelley & Sons, Harrisonburg, Virginia.
Printed in the United States of America.
987654321
ISBN 0-387-90990-7 Springer-Verlag New York Berlin Heidelberg Tokyo
ISBN 3-540-90990-7 Springer-Verlag Berlin Heidelberg New York Tokyo

To Barbara Amato

Preface
A more accurate title for this book might be: An Exposition of Selected
Parts of Empirical Process Theory, With Related Interesting Facts About
Weak Convergence, and Applications to Mathematical Statistics. The high
points are Chapters II and VII, which describe some of the developments
inspired by Richard Dudley's 1978 paper. There I explain the combinatorial
ideas and approximation methods that are needed to prove maximal
inequalities for empirical processes indexed by classes of sets or classes of
functions. The material is somewhat arbitrarily divided into results used to
prove consistency theorems and results used to prove central limit theorems.
This has allowed me to put the easier material in Chapter II, with the hope of
enticing the casual reader to delve deeper.
Chapters III through VI deal with more classical material, as seen from a
different perspective. The novelties are: convergence for measures that don't
live on borel cr-fields; the joys of working with the uniform metric on
D[0, 1]; and finite-dimensional approximation as the unifying idea behind
weak convergence. Uniform tightness reappears in disguise as a condition
that justifies the finite-dimensional approximation. Only later is it exploited
as a method for proving the existence of limit distributions.
The last chapter has a heuristic flavor. I didn't want to confuse the
martingale issues with the martingale facts.
My introduction to empirical processes came during my 1977-8 stay with
Peter Gaenssler and Winfried Stute at the Ruhr University in Bochum,
while I was supported by an Alexander von Humboldt Fellowship. Peter
and I both spent part of 1982 at the University of Washington in Seattle,
where we both gave lectures and absorbed the empirical process wisdom of
Ron Руке and Galen Shorack. The published lecture notes (Gaenssler 1984)
show how closely our ideas have evolved in parallel since Bochum. I also

viii Preface
had the privilege of seeing a draft manuscript of a book on empirical processes
by Galen Shorack and Jon Wellner.
At Yale I have been helped by a number of friends. Dan Barry read and
criticized early drafts of the manuscript. Deb Nolan did the same for the
later drafts, and then helped with the proofreading. First Jeanne Boyce, and
then Barbara Amato, fed innumerable versions of the manuscript into the
DEC-20. John Hartigan inspired me to think.
The National Science Foundation has supported my research and writing
over several summers.
I am most grateful to everyone who has encouraged and aided me to get
this thing finished.

Contents
Notation xiii
CHAPTER I
Functionals on Stochastic Processes 1
1. Stochastic Processes as Random Functions 1
Notes 4
Problems 4
CHAPTER II
Uniform Convergence of Empirical Measures 6
1. Uniformity and Consistency 6
2. Direct Approximation 8
3. The Combinatorial Method 13
4. Classes of Sets with Polynomial Discrimination 16
5. Classes of Functions 24
6. Rates of Convergence 30
Notes 36
Problems 38
CHAPTER III
Convergence in Distribution in Euclidean Spaces 43
1. The Definition 43
2. The Continuous Mapping Theorem 44
3. Expectations of Smooth Functions 48
4. The Central Limit Theorem 50
5. Characteristic Functions 54
6. Quantile Transformations and Almost Sure Representations 57
Notes 61
Problems 62

Contents
CHAPTER IV
Convergence in Distribution in Metric Spaces 64
1. Measurability 64
2. The Continuous Mapping Theorem 66
3. Representation by Almost Surely Convergent Sequences 71
4. Coupling 76
5. Weakly Convergent Subsequences 81
Notes 85
Problems 86
CHAPTER V
The Uniform Metric on Spaces of Cadlag Functions 89
1. Approximation of Stochastic Processes 89
2. Empirical Processes 95
3. Existence of Brownian Bridge and Brownian Motion 100
4. Processes with Independent Increments 103
5. Infinite Time Scales 107
6. Functionals of Brownian Motion and Brownian Bridge 110
Notes 117
Problems 118
CHAPTER VI
The Skorohod Metric on ?>[0, oo) 122
1. Properties of the Metric 122
2. Convergence in Distribution 130
Notes 136
Problems 137
CHAPTER VII
Central Limit Theorems 138
1. Stochastic Equicontinuity ' 138
2. Chaining 142
3. Gaussian Processes 146
4. Random Covering Numbers 149
5. Empirical Central Limit Theorems 155
6. Restricted Chaining 160
Notes 165
Problems 167
CHAPTER VIII
Martingales 170
1. A Central Limit Theorem for Martingale-Difference Arrays 170
2. Continuous Time Martingales 176
3. Estimation from Censored Data 182
Notes 185
Problems 186

Contents xi
APPENDIX A
Stochastic-Order Symbols 189
APPENDIX В
Exponential Inequalities 191
Notes 193
Problems 193
APPENDIX С
Measurability 195
Notes 200
Problems 200
References 201
Author Index 209
Subject Index 211

Notation
Integrals and expectations are written in linear functional notation; sets are
identified with their indicator functions. Thus, instead of \A f{x)JP{dx) write
JP(fA). When the variable of integration needs to be identified, as in iterated
integrals, I return to the traditional notation. And orthodoxy constrains
me to write J f(x) dx for the lebesgue integral, in whatever dimension is
appropriate. If unspecified, the domain of integration is the whole space.
Abbreviations can stand for a probability measure or a random variable
distributed according to that probability measure:
Bin(n, p) = binomial distribution for n trials with success
probability p.
N(/j,, a2) = normal distribution with mean \i and variance a2.
N(fi, V) = multivariate normal distribution with mean vector
p. and variance matrix V.
Uniform(a, b) — uniform distribution on the open interval (a, b);
square brackets, as in Uniform[0, 1], indicate
closed intervals.
Poisson(A) = poisson distribution with mean A.
The symbol ? denotes end of proof, end of definition, and so on—some-
on—something to indicate resumption of the main text. Product measures, product
spaces, and product c-fields share the product symbol (x). Maxima and
minima are v and л. Set-theoretic difference is \; symmetric difference
is A. If ajbn -> oo, for sequences {а„} and {bn}, then write an^bn.
Invariably IR denotes the real line, and IR' denotes /c-dimensional euclidean
space. The borel cr-field on a metric space 3C is always 38(SE). The symbol IP

xiv Notation
denotes a probability measure on a (sometimes unspecified) measurable
space (Q, &); miscellaneous random variables live on this space.
An ~>, a cross between ~ (the sign for "is distributed according to") and
an ordinary arrow -> (for convergence), is used for convergence in distribu-
distribution and weak convergence.
A result stated and proved in the text is always referred to with initial
letters capitalized. Thus the Multivariate Central Limit Theorem is numbered
111.30, but Taylor's theorem and dominated convergence are not reproved.
The letters B, U, BP, EP usually denote the gaussian processes: brownian
motion, brownian bridge, P-motion, and P-bridge. The letters Un and
En denote empirical processes, with Un generated by observations on
Uniform@, 1). Usually Р„ is the empirical measure.
The set of all square-integrable functions with respect to a measure p. is
written y2(fi); the corresponding space of equivalence classes is Ь2(ц). А
similar distinction holds for Jz? 1(fi) and Ll{fi). Often pP denotes the Jz?2(.P)
seminorm; || ¦ || is the supremum norm on a space of functions. The symbols
nt,ns, and so on, are usually projection maps on function spaces.
Expressions like N(e), Nt(d, d, T), and N2(d, P, #") represent various
covering numbers; J(S), Ji{5, d, T), and J2(S, P, !F) are the corresponding
covering integrals.

CHAPTER I
Functionals on Stochastic Processes
... which introduces the idea of studying random variables determined by the
whole sample path of a stochastic process.
I.I. Stochastic Processes as Random Functions
Functions analyzed as points of abstract spaces of functions appear in
many branches of mathematics. Geometric intuitions about distance (or
approximation, or convergence, or orthogonality, or any other ideas learned
from the study of euclidean space) carry over to those abstract spaces, lending
familiarity to operations carried out on the functions. We enjoy similar
benefits in the study of stochastic processes if we analyze them as random
elements of spaces of functions.
Remember that a stochastic process is a collection {X, :teT} of real
random variables, all defined on a common probability space (Q, S, IP).
Often T will be an interval of the real line (which makes the temptation to
think of the process as evolving in time almost irresistible), but we shall also
meet up with fancier index sets—subsets of higher-dimensional euclidean
spaces, and collections of functions. The random variable X, depends on
both t and the point со in Q at which it is evaluated. To emphasize its role as
a function of two variables, write it as X(co, t). For fixed t, the function
X(-, t) is, by assumption, a measurable map from Q into IR. For fixed со, the
function X(co, •) is called a sample path of the stochastic process. If all
the sample paths lie in some fixed collection 3C of real-valued functions
on T, the process X can be thought of as a map from Q into Ж, a random
element of 3C. For example, if a process indexed by [0, 1] has continuous
sample paths it defines a random element of the space C[0, 1] of all
continuous, real-valued functions on [0, 1]. (In Chapter IV we shall
formalize the definition by adding a measurability requirement.)
Each sample path of X is a single point in 3C. Each random variable Z for
which Z(co) depends only on the sample path X(co, •), such as the maximum

2 I. Functionals on Stochastic Processes
of X(co, t) over all f, can be expressed by means of a functional on 3C. That is,
the value Z(co) is found by applying to X(co, ¦) a map H that sends functions
in 9? onto points of the real line. The name functional serves to distinguish
functions on spaces of functions from functions on other abstract sets, an
outmoded distinction, but one that can help us to remember where H lives.
By breaking Z into the composition of a functional with a map from Q
into 9C we also break any analysis of Z into two parts: calculations involving
only the random element X; and calculations involving only the functional
H. This allows us to study many different Z's simultaneously, just by varying
the choice of H. Of course we only gain by this if most of the hard work can
be disposed of once and for all in the analysis of X.
The idea can be taken further. Suppose that a second stochastic process
{Yt: t e T} puts all its sample paths in the same function space 9C. Suppose
we want to study the same functional H of both processes; we want to show
that ИХ and HY have distributions that are close, perhaps. Break the
problem into its two parts: show that the distributions of X and Y (the
probability measures they induce on ЗГ) are close; then show that H has a
continuity property ensuring that closeness of the distributions of X and Y
implies closeness of the distributions of HX and HY. Such an approach
would make the analysis easier for other functionals with the same sort of
continuity property; for a different H only the second part of the analysis
would need repeating.
1 Example. Goodness-of-fit test statistics can often be expressed as func-
functionals on a suitably standardized empirical distribution function. Consider
the basic case of an independent sample ?v, ...,?„ from the Uniform@,1)
distribution. Define the uniform empirical process Un by
Un(co, t) = и'1'2 t Ш<») <t}~t) for 0 < t < 1.
1=1
This has the standardized binomial distribution, n~1/2(Bin(n, t) — nt), for
fixed t. Each sample path is piecewise linear with jump discontinuities at
the n points ^(ш),..., ?„(со).
The process U„ defines a random element of any function space 9C that
contains all sample paths of this form. We could take 3E as the smallest such
set of functions, but there is some advantage to choosing a space that can
accommodate the sample paths of other important stochastic processes.

I.I. Stochastic Processes as Random Functions 3
That way the analysis for this particular problem will carry over to many
other limit and approximation problems. The usual choice is the space
D[0, 1] of all real-valued functions on [0, 1] that are right continuous at
each point of [0, 1) with left limits existing at each point of @, 1]. The catchy
French acronym cadlag (continue a droite, limites a gauche) offers a most
convenient way to avoid repeating this mouthful of a description. From
now on it's cadlag. The processes U„ define random elements of the space
D[0, 1] of all (real-valued) cadlag functions on [0, 1].
Numerous statistics have been suggested as candidates for testing whether
the observations really do come from the uniform distribution, amongst
them:
-i
Un(oj, tf dt.
How do these statistics behave as n tends to infinity? We can try to answer
the question by first determining how {Un}, as a sequence of stochastic
processes indexed by [0, 1], behaves as n tends to infinity.
Suppress the argument со. At each fixed t the sequence of real random
variables {Un(t)} converges in distribution to a iV@, t — t2) distribution, by
virtue of the Central Limit Theorem; at each fixed pair (s, t) the sequence of
bivariate random vectors {(Un(s), Un(t))} converges to a bivariate normal
distribution with covariance the same as the covariance between Un(s) and
Un(t), by virtue of the bivariate form of the Central Limit Theorem; and
similarly for the higher finite-dimensional distributions. That is, the joint
distributions of the random variables obtained by sampling Un at any fixed,
finite set of index points converge to the corresponding distributions of the
so-called brownian bridge (also known as the tied-down brownian motion),
the gaussian process U(t) characterized by:
(i) U has continuous sample paths (it is a random element of C[0, 1])
between the two fixed points U@) = U(l) = 0;
(ii) For fixed tb ... ,tk the random vector (t/faj),..., U(tk)) has a multi-
variate normal distribution with zero means and cov(C/(t;), U(tj)) =
The process U need not live on the same probability space as the empirical
processes. It must, however, have sample paths in C[0, 1]. The reason we
need this apparently irrelevant property will emerge in Chapter V, when we
make the argument rigorous.
When n is large, the process Un behaves something like U. On this basis,
Doob A949) made the bold heuristic assumption that the distributions of
statistics calculated from I)'„ should be close to the distributions of the

I. Functionals on Stochastic Processes
corresponding statistics for U; he conjectured that functionals of Un should
converge in distribution to the analogous functionals of U. In his words (he
wrote х„ and x instead of Un and U):
We shall assume, until a contradiction frustrates our devotion to heuristic
reasoning, that in calculating asymptotic xn(t) process distributions when n —> oo
we may simply replace the xn(t) processes by the x(t) process. It is clear that this
cannot be done in all possible situations, but let the reader who has never used
this sort of reasoning exhibit the first counter example.
Happily, the heuristic does give the right answer for each of the three func-
functionals mentioned above. ?
This example will guide us in our definition of convergence in distribution
for stochastic processes treated as random elements of function spaces. We
shall explore one possible meaning for closeness of two processes in a
distributional sense. The empirical process application will point us towards
a theory slightly different from the one normally espoused in the weak
convergence literature.
Our progress towards the theory will begin in Chapter III with another
look at the classical results for convergence of real random variables and
random vectors, partly as a refresher course in the basic techniques, and
partly as a way to get-a running start on the theory for random elements of
general metric spaces in Chapter IV.
The momentum of the development will carry us through to Chapter V,
where Doob's heuristic argument will receive its rigorous justification, as a
weak convergence result for D[0,1]. Chapter VII will extend the theory to
empirical processes with index sets more complicated than [0, 1], building
upon methods that will be introduced and developed in Chapter II.
Notes
Stochastic processes have frequently been treated as random functions in
the probability literature. Doob A953) pioneered. Gihman and Skorohod
A974, Chapter III) have collected together criteria for existence of stochastic
processes with sample paths in C[0, 1] and D[0, 1]. Breiman A968, Chapters
12 to 14) is a good place to start.
Problems
[1] Verify that U has the same covariance function as Un, by writing (Un{s), Un(t)) as
a normalized sum of the independent random vectors ({?,- < s}, {?,• < t}).
[2] Show that the random variables H3 Un and H3 U both have expectation 5. [Remem-
[Remember that Un and U have the same first and second moment structures.]

Problems
[3] Show that H3 Un can be written in terms of the order statistics ?(i) as
[Complicated functions of the observations can sometimes be represented by
simple functionals of a suitably constructed random process.]

CHAPTER II
Uniform Convergence of
Empirical Measures
... in which uniform analogues of the strong law of large numbers are proved
by two methods. These generalize the classical Glivenko-Cantelli theorem,
which concerns only empirical measures indexed by intervals on the real line,
to uniform convergence over classes of sets and uniform convergence over classes
of functions. The results are applied to prove consistency for statistics expressible
as continuous functionals of the empirical measure. A refinement of the second
method gives rates of convergence.
ILL Uniformity and Consistency
For independent sampling from a distribution function F, the strong law of
large numbers tells us that the proportion of points in an interval (-co, f]
converges almost surely to F(t). The classical Glivenko-Cantelli theorem
strengthens the result by adding that the convergence holds uniformly over
all t. The strong law also tells us that the proportion of points in any fixed
set converges almost surely to the probability of that set. The strengthening
of this result, to give uniform convergence over classes of sets more interesting
than intervals on the real line, and its further generalization to classes of
functions, will be the main concern of this chapter.
For the most part we shall consider only independent sampling from a
fixed distribution P on a set S. The probability measure Pn that puts equal
mass at each of the n observations ?,u ..., ?„ will be called the empirical
measure. It captures everything we might need to know about the observa-
observations, except for the order in which they were taken. Averages over the
observations can be written as expectations with respect to Р„:
If P\f\ < oo, the average converges almost surely to its expected value, Pf.
We shall be finding conditions under which the convergence is uniform over
a class $F of functions.
Of course we should not expect uniform convergence over all classes of
functions, except in trivial cases. Unless P is a discrete distribution, the
difference PnD — PD cannot even converge to zero uniformly over all sets;

II. 1. Uniformity and Consistency 7
there always exists a countable set with Р„ measure one. But there are non-
trivial classes over which the convergence is uniform. When we have such a
class !F we can deduce consistency results for statistics that depend on the
observations only through the values Pnf, for / in ?F.
1 Example. The median of a distribution P on the real line can be defined
as the smallest value of m for which P(— со, m] > \. If P( — со, t] > \ for
each t > m then the median is a continuous functional, in the sense that
| median(g) — median(P) [ < s
whenever the distribution Q is close enough to P. Close means
sup|g(-co, t] - P(-oo, t]\ <5,
t
where the tiny б is chosen so that
P( — со, m — s] < j — S,
P(-co, m + s] > ^+ 8.
The argument goes: if Q has median m then
P(-oo, m'] > S(-co,m'] - S > i - d,
so certainly m' > m — s. Similarly, for every m" < m',
P(-со, m"] < Q(-со, m"] + ё < \ + 6,
which implies m" < m + s, and hence m' < m + s.
Next comes the probability theory. If the empirical measure Р„ is con-
constructed from a sample of independent observations on P, the Glivenko-
Cantelli theorem tells us that
sup \Р„(— со, i] — P(— со, t]| ->¦ 0 almost surely.
From this we deduce that, almost surely,
|median(Pn) — median(P)| < s eventually.
The sample median is strongly consistent as an estimator of the population
median. ?
For this example we didn't have to prove the uniformity result; the
Glivenko-Cantelli theorem is the oldest and best-known uniform strong law
of large numbers in the literature. But as we encounter new functions
(usually called functionals) of the empirical measure, new uniform con-
convergence theorems will be demanded. We shall be exploring two methods
for proving these theorems.
The first method is simpler in concept, but harder in execution. It
involves direct approximation of functions in an infinite class $F by a

8 II. Uniform Convergence of Empirical Measures
finite collection of functions. Classical convergence results, such as the strong
law of large numbers or the ergodic theorem, ensure uniform convergence
for the finite collection; the form of approximation carries the uniformity
over to 3F. Section 2 deals with direct approximation.
The second method depends heavily upon symmetry properties implied by
independence. It uses simple combinatorial arguments to identify classes
satisfying uniform strong laws of large numbers under independent sampling.
Sections 3 to 5 assemble the ideas behind this method.
II.2. Direct Approximation
Throughout the section !F will be a class of (measurable) functions on a set
S with a ff-field that carries a probability measure P. The empirical measure
Р„ is constructed by sampling from P. Assume P\f\ < oo for each / in 3F.
If 3F were finite, the convergence of Р„ f to Pf assured by the strong law of
large numbers would, for trivial reasons, be uniform in /. If J^ can be ap-
approximated by a finite class (not necessarily a subclass of 3F) in such a way
that the errors of approximation are uniformly small, the uniformity carries
over to !F. The direct method achieves this by requiring that each member
of 3F be sandwiched between a pair of approximating functions taken from
the finite class.
2 Theorem. Suppose that for each s > 0 there exists a finite class ?FS containing
lower and upper approximations to each f in S', for which
ft,L<f<fe,V and P(fs,V ~ fs,L)< S.
Then sup^r \Pnf — Pf\ -* 0 almost surely.
Proof. Break the asserted convergence into a pair of one-sided results:
liminfinf(Pn/-P/)> 0
and
limsup sup(Pn/ - Pf) < 0
or, equivalently,
liminf inf(Р„(-/) - P(-f)) > 0.
Then two applications of the next theorem will complete the proof. ?
3 Theorem. Suppose that for each s > 0 there exists a finite class J^ of
functions for which: to each fin #" there exists anft in !Ft such thatfE <f and
Pfe ^ Pf- e. Then
liminf inf(Pn/ - Pf) > 0 almost surely.

II.2. Direct Approximation 9
Proof. For each e > 0,
liminf inf (Р„/ - Pf) > liminf inf (Р„/е - Pf) because fe < f
> liminf inf (Р„/? - Pf) + inf(P/? - Pf)
> 0 H s almost surely, as #"? is finite.
Throw away an aberrant null set for each positive rational s to arrive at the
asserted result. ?
You might have noticed that independence enters only as a way of
guaranteeing the almost sure convergence of Pnfe to Pf for each approximat-
approximating /e. Weaker assumptions, such as stationarity and ergodicity, could
substitute for independence.
4 Example. The method of fc-means belongs to the host of ad hoc procedures
that have been suggested as ways of partitioning multivariate data into
groups somehow indicative of clusters in the underlying population. We can
prove a consistency theorem for the procedure by application of the one-
onesided uniformity result of Theorem 3.
For purposes of illustration, consider only the simple case where observa-
observations ?b ...,?„ from a distribution P on the real line are to be partitioned
into two groups. The method prescribes that the two groups be chosen to
minimize the within-groups sum of squares. Equivalently, we may choose
optimal centers an and bn to minimize
t\Zt ~ a\2 л \Zt - b\\
i=l
then allocate each ?; to its nearest center. The optimal centers must lie at the
mean of those observations drawn into their clusters, hence the name
fe-means (or 2-means, in the present case). In terms of the empirical measure
Pn, the method seeks to minimize
where
fa,b(x) = \x - a\2 л \x - b\2.
As the sample size increases, W(a, b, Pn) converges almost surely to
for each fixed (a, b). This suggests that (an, bn), which minimizes W{-, ¦, Р„),
might converge to the (a*,b*) that minimizes W{-, -,P). Given a few
obvious conditions, that is indeed what happens.
To ensure finiteness of W{-, -, P), assume that P|x|2 < oo. Assume also
that there exists a unique (a*, b*) minimizing W. Adopt the convention that

10 II. Uniform Convergence of Empirical Measures
a < b in order that (b*, a*) be ruled out as a distinct minimizing pair.
Without uniqueness the consistency statement needs a slight reinterpretation
(Problem 1).
The continuity argument lurking behind the consistency theorem does
depend on one-sided uniform convergence of W(-, -, Pn) to W(-, -, P), but
not uniformly over all possible choices for the centers/We must first force
(an,bn) into a region
C = [-M, M]<g)IR и IR ®[-M,M]
for some suitably large M, then prove
liminf inf(PK/ab - Pfa,b) ^ 0 almost surely,
с
We need at least one of the centers within a bounded region [ — M, M] to get
the uniformity. Determine how large M needs to be by invoking optimality
oi(an,bn).
W(an,bn,Pn)<W@,0,Pn)
-»• W@, 0, P) almost surely
= P|x|2.
If both а„ and bn lay outside [ — M, M] then
Ща„, К, Р„) > (iMJPn[-iM, iM]
-»• (iMJP[ -^M, iM] almost surely.
If we choose M so that P\x\2 <{\МJР[-%МЛЩ then there must
eventually be at least one of the optimal centers within [ —M, M], almost
surely. We shall later also need M so large that (a*, b*) belongs to C.
Explicit construction of the finite approximating class demanded by
Theorem 3 is straightforward, but a trifle messy. That is one of the dis-
disadvantages of brute-force methods. First note that
fa,b(x) <(x - Mf + (x + MJ for (a, b) in С
Write F(x) for the upper bound. Because PF < со, there exists a constant D,
larger than M, for which PF[-D, DJ < s. We have only to worry about
the approximation to/ab on [ — D, D~\.
We may assume that both a and b lie in the interval [-3D, 3D]. For if,
say, | b | > 3D then
/,,»(*){I*I <D} = \x- a\2 = fUx){\x\ < D}
because \a\ < M; the lower approximation for/a a on [ — D, D~\ will also
serve forfa>b.

II.2. Direct Approximation 11
Let C? be a finite subset of [ — 3D, 3D]2 such that each (a, b) in that
square has an (a', b') with \a — a'\ < s/D and \b — b'\ < s/D. Then for
each x in [—D, D~\,
\fa,t(x) -/e-,b'(x)| < l(x - flJ - (x - a'J1 + |(x-bJ -(x - fe'Jl
< 2|a - a'l |x - Ka + a')l + 2|fc - fe'l I* - Kb + fe')l
< 2(s/D)(D + 3D) + 2(s/D)(D + 3D)
= 16e.
The class J^33? consists of all functions (/a<;;,<(X) — 16е){|х| < D} for (a', &')
ranging over C!:.
From Theorem 3,
limmfinf(Pn/aib-P/a;b)>0.
с
Eventually the optimal centers (а„, bn) lie in С Thus
liminf(FF(an, Ь„, Р„) - W(an, bn, PJ) > 0 almost surely.
Since
W(an, bn, Pn) < W(a*, b*, Pn) because (an, bn) is optimal for Р„
-» W(a*, b*, P) almost surely
< W(an, bn, P) because (a*, b*) is optimal for P,
we then deduce that
W{an, bn,P)^ W(a*, b*, P) almost surely.
Notice what happened. The uniformity allowed us to transfer optimality of
(а„,Ь„) for Р„ to a sort of asymptotic optimality for P; the processes
W(-, •, Pn) have disappeared, leaving everything in terms of the fixed, non-
random function W( ¦, ¦, P).
We have assumed that W(-, ¦, P) achieves its unique minimum at (a*, b*).
Complete the argument by strengthening this to: for each neighborhood U
of(a*,b*\
inf W(a, b, P) > W(a*, b*, P).
c\u
Continuity of W(-, ¦, P) takes care of the infimum over bounded regions of
C\U. If there were an unbounded sequence (a;, /?;) in С with
we could extract a subsequence along which, say, a; -> — oo and /?; -»• Д with
P | < M. Dominated convergence would give
W(a*,b*,P) = P\x-p\\
which would contradict uniqueness of (a*, b*): for every a, the pair (a, /?)
would minimize W(-, -,P). The pair (an,bn), by seeking out the unique
minimum of W(-, •, P) over the region C, must converge to (a*, b*). ?

12 II. Uniform Convergence of Empirical Measures
The /c-means example typifies consistency proofs for estimators denned
by optimization of a random criterion function. By ad hoc arguments one
forces the optimal solution into a restricted, often compact, region. That is
usually the hardest part of the proof. (Problem 2 describes one particularly
nice ad hoc argument.) Then one appeals to a uniform strong law over the
restricted region, to replace the random criterion function by a deterministic
limit function. Global properties of the limit function force the optimal
solution into desired neighborhoods. If one wants consistency results that
apply not just to independent sequences but also, for example, to stationary
ergodic sequences, one is stuck with cumbersome direct approximation
arguments; but for independent sampling, slicker methods are available for
proving the uniform strong laws. We shall return to the /c-means problem
in Section 5 (Example 29 to be precise) after we have developed these methods.
5 Example. Let в be the parameter of a stationary autoregressive process
for independent, identically distributed innovations {и„}. Stationarity
requires |0| < 1. A generalized M-estimator for в is any value в„ for which
the random function
Hn(9) = (n- I)'1 ? д(у,ЖУ1+1 ~ 0Уд
takes the value zero. We would hope that в„ converges to the в* at which the
deterministic function
H(ff) = ШУ1ЖУ2 ~ Оуг)
takes the value zero. If | g \ < 1 and | ф | < 1 and ф is continuous, we can go
part of the way towards proving this by means of a uniform strong law for a
bivariate empirical measure.
Write <2„ for the probability measure that puts equal mass (n — I) on
each of the pairs (yu y2),..., (у„_и у„). For fixed (integrable) /(•, •),
Qnf -* Of almost surely,
where Q denotes the joint distribution of (yb y2). This follows from the
ergodic theorem for the stationary bivariate process {(у„, у„+1)}.
Check the approximation conditions of Theorem 2, with Q in place of P,
for the class of functions
/(*!, х2)в) = д(х1)ф(х2 - BXl) for -1 < в < 1.
First, choose an integer К so large that
Щ\у1\<к,\у2\<к}>1-?.

II.3. The Combinatorial Method 13
Then appeal to uniform continuity of ф on the compact interval [—2K, 2K~\
to find a S > 0 such that | ф(а) — ф(Ь) \ < г whenever | a - b \ < 3 and
\a\ < 2K and \b\ < 2K. For в in the interval [kS/K, (k + l)S/K],
\f(xl,x2,e)-f(xux2,k3/K)\ < г + 2{\Xl\ >K} + 2{\x2\ > K}.
With the integer к running over the finite range needed for these intervals to
cover [ — 1, 1], the functions
f(x1,x2,k3/K)±e±2{\x1\ > К} ±2{\х2\ Ж}
provide the upper and lower approximations required by Theorem 2.
As noted following Theorem 3, the uniform strong laws also apply to
empirical measures constructed from stationary ergodic sequences. Ac-
Accordingly,
F) sup \Qnf(-, ¦, в) - Qf(-, -, 0)| -> 0 almost surely,
that is,
sup \Hn(9) - H@)\ -»• 0 almost surely.
l«l<i
Provided 6n lies in the range [— 1, 1], we can deduce from F) that Н(в„) -> 0
almost surely. It would be a sore embarrassment if the estimate of the auto-
regressive parameter were not in this range. Usually one avoids the em-
embarrassment by insisting only that Н„(в„) -> 0, with в„ in [— 1, 1]. Such а в„
always exists because Н„@*) -> О almost surely.
Convergence results for в„ depend upon the form of H(). We know в„
gets forced eventually into the set {| H | < e} for each e > 0. If this set
shrinks to в* as e | 0 then в„ must converge to в*, which necessarily would
have to be the unique zero of H(-). If we assume that H does have these
properties we get the consistency result for the generalized M-estimator. ?
II.3. The Combinatorial Method
Since understanding of general methods grows from insights into simple
special cases, let us begin with the best-known example of a uniform strong
law of large numbers, the classical Glivenko-Cantelli theorem. This asserts
that, for every distribution P on the real line,
G) sup | Pn(- oo, t] - P( — oo, t] | -» 0 almost surely,
when the empirical measure Р„ comes from independent sampling on P.
The ideas that will emerge from the treatment of this special case will later
be expanded into methods applicable to other classes of functions. To
facilitate back reference, break the proof into five steps.

14 II. Uniform Convergence of Empirical Measures
Keep the notation tidy by writing || • || to denote the supremum over the
class J of intervals ( —oo, t], for — oo < t < oo. We could restrict the
supremum to rational t to ensure measurability.
First Symmetrization.
Instead of matching Pn against its parent distribution P, look at the difference
between Pn and an independent copy, P'n say, of itself. The difference Pn — Р'„
is determined by a set of In points (albeit random) on the real line; it can be
attacked by combinatorial methods, which lead to a bound on deviation
probabilities for \\Р„ — P'n\\. A symmetrization inequality converts this into
a bound on || Pn — P\\ deviations.
8 Symmetrization Lemma. Let {Z(t): t e T} and {Z'{t)\ t eT} be indepen-
independent stochastic processes sharing an index set T. Suppose there exist constants
P>0anda>0 such that F{ | Z'(t) | < a} > fi for every t in T. Then
(9) P^sup |Z(t)| >s\< jT^sup \Z(t) - Z\t)\ > e - a
Proof. Select a random x for which |Z(t)| > e on the set {sup \Z(t)\ > s}.
Since т is determined by Z, it is independent of Z'. It behaves like a fixed
index value when we condition on Z:
| < a|Z} > p.
Integrate out.
L Z(t)\ >s\< 1P{|Z'(t)| < a, |Z(t)| > s}
< ЕР jsup \Z(t) - Z'(t)\ > s - <x\. П
Close inspection of the proof would reveal a disregard for a number of
measure-theoretic niceties. A more careful treatment may be found in
Appendix С For our present purpose it would suffice if we assumed T
countable; the proof is impeccable for stochastic processes sharing a count-
countable index set. We could replace suprema over all intervals (—00, t] by
suprema over intervals with a rational endpoint.
For fixed t, Р„(— oo, t] is an average of the n independent random variables
{?; < t}> eacn having expected value P( — 00, t] and variance P( — 00, t] —
(P( — 00, t]J, which is less than one. By Tchebychev's inequality,

II.3. The Combinatorial Method
15
Apply the Symmetrization Lemma with Z = Р„ — P and Z' = P'n — P, the
class J as index set, a = js, and /? = j.
A0) P{||Pn - P\\
Second Symmetrization.
2ТР{\\Р„ - Р'„\\
if n > 8е
The difference Р„ — Р'„ depends on 2n observations. The double sample
size creates a minor nuisance, at least notationally. It can be avoided by a
second symmetrization trick, at the cost of a further diminution of the e.
Independently of the observations ?,u ...,?,„, ?,\, ...,?,'„ from which the
empirical measures are constructed, generate independent sign random
variables au...,an for which P{o-; = +1} = W{a{ = -1} = \. The sym-
symmetric random variables {^ < t} — {?¦ < t}, for i = 1,..., n and
— oo < t < oo, have the same joint distribution as the random variables
<7г[{?г < t} — {?'i < t}]. (Consider the conditional distribution given
Thus
Р{||Р„ -Р'я\\> Ы =
< IP<sup n Y, a№i ^ f) > h
+
Write P° for the signed measure that places mass n 1ai at ?,-. The two sym-
metrizations give, for n > 8e~2,
A1) IP{||-Pn - P\\ > e} < 4F{||P°|| > ie}.
To bound the right-hand side, work conditionally on the vector of observa-
observations \, leaving only the randomness contributed by the sign variables.
Maximal Inequality.
Once the locations of the % observations are fixed, the supremum ||P°||
reduces to a maximum taken over a strategically chosen set of intervals
Ij = ( — oo, tj2, for j = 0, 1,..., n. Of course the choice of these intervals
depends on %; we need one tj between each pair of adjacent observations.
(The t0 and tn are not really necessary.) With the number of intervals reduced
so drastically, we can afford a crude bound for the supremum.
A2)
j=o

16 II. Uniform Convergence of Empirical Measures
This bound will be adequate for the present because the conditional proba-
probabilities decrease exponentially fast with n, thanks to an inequality of
Hoeffding for sums of independent, bounded random variables.
Exponential Bounds.
Let Yu..., Ynbs independent random variables, each with zero mean and
bounded range: at < Yt < bt. For each ц > 0, Hoeffding's Inequality
(Appendix B) asserts
F{| Y, + ¦ ¦ ¦ + Yn\ > n) < 2 ехрГ-2,?2 / J>; - fl;Jl.
Apply the inequality with Yt = а{{^ < t}. Given \, the random variable Yt
takes only two values, ± {?,- < t}, each with probability \.
F{|PB°(-oo,t]
/
<2exp(-ne2/32),
because the indicator functions sum to at most n. Use this for each Ij in
inequality A2).
F{||PB°|| > Щ® < 2(и + 1) exp(-ne2/32).
Notice that the right-hand side now does not depend on %.
Integration.
Take expectations over %
JP{\\Pn - P\\ > e} < 8(n + 1) exp(-ne2/32).
This gives very fast convergence in probability, so fast that
? JP{\\Pn - P\\ > ?} < 00
n=l
for each e > 0. The Borel-Cantelli lemma turns this into the full almost sure
convergence asserted by the Glivenko-Cantelli theorem.
II.4. Classes of Sets with Polynomial Discrimination
We made use of very few distinguishing properties of intervals for the proof
of the Glivenko-Cantelli theorem in Section 3. The main requirement was
that they should pick out at most n + 1 subsets from any set of n points.
Other classes have a similar property. For example, quadrants of the form
( —oo, t] in IR2 can pick out fewer than (n + IJ different subsets from a

II.4. Classes of Sets with Polynomial Discrimination 17
set of n points in the plane—there are at most n + 1 places to set the horizontal
boundary and at most n + 1 places to set the vertical boundary. (Problem 8
gives the precise upper bound.) With (n + IJ replacing the n + 1 factor,
we could repeat the arguments from Section 3 to get the bivariate analogue
of the Glivenko-Cantelli theorem. The exponential bound would swallow
up (n + IJ, just as it did the n + 1. Indeed, it would swallow up any poly-
polynomial. The argument works for intervals, quadrants, and any other class
of sets that picks out a polynomial number of subsets.
13 Definition. Let 3i be a class of subsets of some space S. It is said to have
polynomial discrimination (of degree v) if there exists a polynomial p(-)
(of degree v) such that, from every set of N points in S, the class picks out at
most p(N) distinct subsets. Formally, if So consists of N points, then there
are at most p(N) distinct sets of the form So n D with D in 3i. Call p(-) the
discriminating polynomial for Q). ?
When the risk of confusion with the algebraic sort of polynomial is slight,
let us shorten the name "class having polynomial discrimination" to
"polynomial class," and adopt the usual terminology for polynomials of low
degree. For example, the intervals on the real line have linear discrimination
(they form a linear class) and the quadrants in the plane have quadratic
discrimination (they form a quadratic class). Of course there are classes
that don't have polynomial discrimination. For example, from every col-
collection of N points lying on the circumference of a circle in IR2 the class of
closed, convex sets can pick out all 2N subsets, and 2N increases much faster
than any polynomial.
The method of proof set out in Section 3 applies to any polynomial class
of sets, provided measurability complications can be taken care of. Appendix
С describes a general method for guarding against these complications.
Classes satisfying the conditions described there are called permissible.
Every specific class we shall encounter will be permissible. As the precise
details of the method are rather delicate—they depend upon properties of
analytic sets—let us adopt a naive approach. Ignore measurability problems
from now on, but keep the term permissible as a reminder that some regularity
conditions are needed if pathological examples (Problem 10) are to be ex-
excluded. Problems 3 through 7 describe a simpler approach, based on the
more familiar idea of existence of countable, dense subclasses.

18 II. Uniform Convergence of Empirical Measures
14 Theorem. Let Pbea probability measure on a space S. For every permissible
class 3} of subsets of S with polynomial discrimination,
sup | PnD — PD | -» 0 almost surely.
Proof. Go back to Section 3, change J to 2, replace the n + 1 multiplier
by the polynomial appropriate to 2), and strike out the odd reference to
interval and real line. ?
Which classes have only polynomial discrimination? We already know
about intervals and quadrants; their higher-dimensional analogues have
the property too. Other classes can be built up from these.
15 Lemma. If *€ and 2 have polynomial discrimination, then so do each of:
(i) {Dc:De2};
(ii) {С и D: С e Vand D e 2};
(iii) {C n D: С e #and D e 2}.
Proof. Write c(-) and d(-) for the discriminating polynomials. We may
assume them both to be increasing functions of JV. From a set So consisting
of N points, suppose <€ picks out subsets Su...,Sk with к < c(N). Suppose
St consists of JV; points. The class 2 picks out at most d(Nt) distinct subsets
from St. This gives the bound d(JVi) + • • • + d(Nk) for the size of the class
in (iii). The sum is less than c(N) d(N). That proves the assertion for (iii).
The other two are just as easy. ?
The lemma can be applied repeatedly to generate larger and larger
polynomial classes. We must place a fixed limit on the number of operations
allowed, though. For instance, the class of all singletons has only linear
discrimination, but with arbitrary finite unions of singletons we can pick
out any finite set.
Very quickly we run out of interesting new classes to manufacture by
means of Lemma 15 from quadrants and the like. Fortunately, there are
other systematic methods for finding polynomial classes.
Polynomials increase much more slowly than exponentials. For JV large
enough, a polynomial class must fail to pick out at least one of the 2N subsets
from each collection of JV points. Surprisingly, this characterizes poly-
polynomial discrimination. Some picturesque terminology to describe the
situation has become accepted in the literature. A class 3) is said to shatter
a set of points F if it can pick out every possible subset (the empty subset
and the whole of F included); that is, 3i shatters F if each of the subsets
off has the form D n F for some D in 2). This conveys a slightly inappropriate
image, in which F gets broken into tiny fragments, rather than an image of
a diligent 2) trying to pick out all the different subsets of F; but at least it is
vivid.

II.4. Classes of Sets with Polynomial Discrimination 19
For example, the class of all closed discs in IR2 can shatter each three-point
set, provided the points are not collinear. But from no set of four points, no
matter what its configuration, can the discs pick out more than 15 of the 16
possible subsets. The discs shatter some sets of three points; they shatter
no set of four points.
16 Theorem. Let So be a set of N points in S. Suppose there is an integer
V < N such that 3) shatters no set of V points in So. Then 3) picks out no
more than ($) + A) + ••• + (v-i) subsets from So.
Proof. Write Fu...,Fk for the collection of all subsets of V elements
from So. Of course к = (у). By assumption, each F( has a "hidden" subset
Hi that 3) overlooks: D nFt^ Ht for every D in 2. That is, all the sets of
the form D n So, with D in 3i, belong to
<ё0 = {С с So: С n Ft Ф H, for each i}.
It will suffice to find an upper bound for the size of #0.
In one special case it is possible to count the number of sets in ^0 directly.
If Ht = Ft for every i then no С in ^0 can contain an Ft; no С can contain a
set of V points. In other words, members of ^0 consist of either 0, 1,..., or
V — 1 points. The sum of the binomial coefficients gives the number of sets
of this form.
By playing around with the hidden sets we can reduce the general case to
the special case just treated. Label the points of So as 1,..., N. For each i
define Щ = (H; и {1}) n F;; that is, augment Ht by the point 1, provided
it can be done without violating the constraint that the hidden set be con-
contained in F{. Define the corresponding class
?! = {Сс50:Сп]?;54Я; for each i}.
The class (el has nothing much to do with ^0. The only connection is that all
its hidden sets, the sets it overlooks, are bigger. Let us show that this implies
^i has a greater cardinality than ^0. (Notice: the assertion is not that
(fi r— (fi \
'©O — » W
Check that the map Ci-» C\{1} is one-to-one from <ё0\(ё1 into ^"Д^о.
Start with any С in ^oX^i- By definition, С n F^ Ht for every i, but
С n Fj = H'j for at least one j. Deduce that Я,- ф H'j, so 1 belongs to С
and Fj and H'j, but not to H}. The stripping of the point 1 does define a
one-to-one map. Why should C\{1} belong to "^Д^о? Observe that
(C\{1}) П Fj = H'j\{l) = Hj,
which bars C\{1} from belonging to ^0. Also, if Ft contains 1 then so must
Щ, but C\{1} certainly cannot; and iff; doesn't contain 1 then
(C\{1}) nFi = CnFi*Hi = Щ.
In either case (C\{1}) n F{ ф Н\, so C\{1} belongs to ^

20 II. Uniform Convergence of Empirical Measures
Repeat the procedure, starting from the new hidden sets and with 2 taking
over the role played by 1. Define H- = (Щ и {2}) n Ft and
<?2 = {C ? So: С n F, Ф Щ for each i}.
The cardinality of #2 is greater than the cardinality of ^v Another N — 2
repetitions would generate classes (ёъ, #4,..., ^N with increasing cardinali-
cardinalities. The hidden sets for ^N would fill out the whole of each Ft: the special
case already treated. ?
17 Corollary. If a class shatters no set of V points, then it must have polynomial
discrimination of degree no greater than V — 1. ?
All we lack now is a good method for identifying classes that have trouble
picking out subsets from large enough sets of points.
18 Lemma. Let <§ be a finite-dimensional vector space of real functions on S.
The class of sets of the form {g > 0},for g in <&, has polynomial discrimination
of degree no greater than the dimension of IS.
Proof. Write V — 1 for the dimension of <S. Choose any collection {sb ..., sv}
of distinct points from S. (Everything reduces to triviality if S contains fewer
than V points.) Define a linear map L from <& into IR7 by
Ug) = (fif(si), • • •, g(sv)).
Since L'S is a linear subspace of IR*' of dimension at most V — 1, there exists
in ]RF a non-zero vector у orthogonal to VS. That is,
E Mfo) = ° f°r each g in <&,
i
or
A9) Z 7ie(Si) = E (-УдФд for each g.
{+} {->
Here { +} stands for the set of those i for which yt > 0, and { — } for those
with уi < 0. Replacing у by — у if necessary, we may assume that { —} is
non-empty.
Suppose there were a g for which {g > 0} picked out precisely those
points S; with i in { + }. For this g, the left-hand side of A9) would be > 0,
but the right-hand side would be < 0. We have found a set that cannot be
picked out. ?
Many familiar classes of geometric objects fall within the scope of the
lemma. For example, the class of subsets of the plane generated by the linear
space of quadratic forms ax2 + bxy + cy2 + dx + ey + f includes all
closed discs, ellipsoids, and (as a degenerate case) half-spaces. More com-
complicated regions, such as intersections of 257 closed or open half-spaces, can
be built up from these by means of Lemma 15. You can feed them into

II.4. Classes of Sets with Polynomial Discrimination 21
Theorem 14 to churn out a whole host of generalizations of the classical
Glivenko-Cantelli theorem.
Uniform limit theorems for polynomial classes of sets have one thing in
common: they hold regardless of the sampling distribution. This happens
because the number An(i;) of subsets picked out by the class from the sample
{^l5 ...,?„} can be bounded above by a polynomial in n, independently of
the configuration of that sample. Without the uniform bound the inequality
A2) would be replaced by
B0) F{ ||РИ° || > ie 1%} < 2Д„© exp( - ns2/32).
Write Wn for the minimum of 1 and the right-hand side of B0). Then the
argument from the Integration step gives the sharper bound
Р{||Р„ - P|| > e} < 4JPWn
for all n large enough. Thus a sufficient condition for ||PB — P|| to converge
in probability to zero is: JPWn -» 0 for each e > 0. Equivalently, because
0 < Wn < 1, we could check that log А„(^) = op(ri). Theorem 16 helps here.
where V = K(?i> ¦•¦,й is the smallest integer such that Sj shatters no
collection of V points from {?u ..., ?„}. Set к = V - 1. If к < \n, all the
terms in the sum for Bn(k) are less than (?):
'1 log B(k) < n'1
log Bn(k) < n'1 log[(/c + l)n!/(n - /с)! /с!].
Three applications of Stirling's approximation and some tidying up reduce
the right-hand side to
-A - k/n) log(l - k/n) - (k/n) log(tyn) + o(l),
which tends to zero as k/n -» 0. It follows that both n log А„ -» О and
Il-P« — P|| -» 0 in probability, if V/n -» 0 in probability.
If we don't know how fast V/n converges to zero, we can't use the Borel-
Cantelli lemma to deduce from these inequalities that ||Р„ — Р|| converges
almost surely to zero. But there is another reason why the convergence in
probability implies the stronger result.
Symmetry properties would force ||Р„ — Р|| to converge almost surely to
some constant, no matter how V/n behaved. Given Р„, the unordered set
{^!,..., ?„} is uniquely determined, but there's no way of deciding the order
in which the observations were generated. Given Pn+ u we know slightly less
about {^lr..., ?,„}; it could be any of the (n + 1) possible subsets of size n
obtained by deleting one of the support points of Pn+1. (Count coincident
observations as distinct support points.) The conditional distribution of Р„
given Pn+1 must be uniform on one of these (n + 1) subsets, each subset
being chosen with probability (n + I). The conditional expectation
of Р„ given Pn+1 (in the intuitive sense of the average over the n + 1
possible choices for Pn) must be Pn+1. The extra information carried

22
II. Uniform Convergence of Empirical Measures
by Pn+2> Рп+з>--- adds nothing more to our knowledge about Pn; the
conditional expectation of Pn given the u-field generated by Pn+1, Pn+2, ¦ ¦ ¦
still equals Pn+1. That is, the sequence {Pn} is a reversed martingale, in some
wonderful measure-valued sense. Apply Jensen's inequality to the convex
function that takes Pn onto \\Pn - P\\ to deduce that {\\Р„ - Р\\} is a bounded,
reversed submartingale. (Problem 11 arrives at the same conclusion in a
slightly more rigorous manner.) Such a sequence must converge almost
surely (Neveu 1975, Proposition V-3-13) to a limit random variable, W. Since
W is unchanged by finite permutations of {?;}, the zero-one law of Hewitt
and Savage (Breiman 1968, Section 3.9) forces it to take on a constant value
almost surely. The only question remaining for the proof of a uniform strong
law of large numbers is whether the constant equals zero or not: convergence
in probability to zero plus convergence almost surely to a constant gives
convergence almost surely to zero.
21 Theorem. Let Si be a permissible class of subsets of S. A necessary and
sufficient condition for
sup \PnD — PD | -* 0 almost surely
is the convergence ofn~lVn to zero in probability, where Vn = Vn{?,u ..., ?„)
is the smallest integer such that 3) shatters no collection of Vn points from
{Si. • • •. U-
Proof. You can formalize the sufficiency argument outlined above; necessity
is taken care of in Problem 12. Ill
Because 0 < n 1Vn < 1, convergence in probability of n 1K, to zero is
equivalent to n~ ^WVn -* 0. This has an appealing interpretation. The uniform
strong law of large numbers holds if and only if, on the average, the class of
sets behaves as if it has polynomial discrimination with degree but a tiny
fraction of the sample size.
22 Example. Let's see how easy it is to check the necessary and sufficient
condition stated in Theorem 21. Consider the class <? of all closed, convex
subsets of the unit square [0,1]2. We know that there exist arbitrarily large
collections of points shattered by (€. Were we sampling from a non-atomic

II.4. Classes of Sets with Polynomial Discrimination
23
distribution concentrated around the rim of a disc inside [0, I]2, the class
<€ could always pick out too many subsets from the sample. Indeed, there
would always exist a convex С with PnC = 1 and PC = 0. But such con-
configurations of sample points should be thoroughly atypical for sampling
from the uniform distribution on [0, I]2. Theorem 21 should say something
useful in that case.
How large a subcollection of sample points can <€ shatter? Suppose it is
larger than the size requested by Theorem 21. That is, for some e > 0,
JP{n~1Vn > s}>s infinitely often.
This will lead us to a contradiction.
A set of к points is shattered by ^ if and only if none of the points can be
written as a convex combination of the others; each must be an extreme
point of their convex hull. So there exists a convex set whose boundary has
empirical measure at least k/n, which seems highly unlikely because P puts
zero measure around the boundary of every convex set. Be careful of this
plausibility argument; it contains a hidden appeal to the very uniformity
result we are trying to establish. An approximation argument will help us to
avoid the trap.
Divide [0, I]2 into a patchwork of m2 equal subsquares, for some fixed
m that will be specified shortly. Because the class si of all possible unions of
these subsquares is finite,
Ш sup | Pn A - PA | > \г \ < \z for all n large enough.
The js here is chosen to ensure that, for some n,
IPIn^ > e and sup \Р„А - РА\<\г}> \г.
si
Since a set with positive probability can't be empty, there must exist a sample
configuration for which <€ shatters some collection of at least ns sample
points and for which \Р„А - PA\ < \г for every A in stf. Write H for the
convex hull of the shattered set, and AH for the union of those subsquares
that intersect the boundary of H. The set AH contains all the extreme points
of H, so Р„АН > г; it belongs to si, so \PnAH - PAH\ < \г. Consequently
PAH > \г, which will give the desired contradiction if we make m large
enough.
\шкш
«в :
И
и я
ш
•
•

24
II. Uniform Convergence of Empirical Measures
Experiment with values of m equal to a power of 3. No convex set can
have boundary points in all nine of the subsquares; the middle subsquare
would lie inside the convex hull of four points occupying each of the four
corner squares. For every convex С the P measure of the union of those
—-— \—
subsquares intersecting its boundary must be less than f. Subdivide each
of the nine subsquares into nine parts, then repeat the same argument
eight times. This brings the measure of squares on the boundary down to
(fJ. Keep repeating the argument until the power of | falls below js. That
destroys the claim made for AH. ?
II.5. Classes of Functions
The direct approximation methods of Section 2 gave us sufficient conditions
for the empirical measure Pn to converge to the underlying P uniformly over
a class of functions,
sup | Pnf-Pf
3?
0 almost surely.
The conditions, though straightforward, can prove burdensome to check. In
this section a transfusion of ideas from Sections 3 and 4 will lead to a more
tractable condition for the uniform convergence. The method will depend
heavily on the independence of the observations {?;}, but the assumption of
identical distribution could be relaxed (Problem 23).
Throughout the section write |[ • || to denote sup»» | ¦ |.
Let us again adopt a naive approach towards possible measurability
difficulties, with only the word permissible (explained in Appendix C) to
remind us that some regularity conditions are needed to exclude patho-
pathological examples.
A domination condition will guard against any complications that could
be caused by #" containing unbounded functions. Call each measurable F
such that |/| < F, for every /in J% an envelope for J*\ Often F will be taken
as the pointwise supremum of | /1 over J5", the natural envelope, but it will

II.5. Classes of Functions 25
be convenient not to force this. We shall assume PF < oo. With the proper
centering, the natural envelope must satisfy this condition (Problem 14) if
the uniform strong law holds.
The key to the uniform convergence will again be an approximation
condition, but this time with distances calculated using the j§? * seminorm
for the empirical measures themselves. This allows us to drop the require-
requirement that the approximating functions sandwich each member of J^
23 Definition. Let Q be a probability measure on S and 2F be a class of
functions in jS? 1(Q). For each e > 0 define the covering number iV\(e, Q, 3F)
as the smallest value of m for which there exist functions gu ..., gm (not
necessarily in J*) such that miny Q | / - g} \ < e for each / in #". For definite-
ness set iV\(e, <2, #¦) = oo if no such m exists. ?
If #" has envelope F we can require that the approximating functions
satisfy the inequality \gj\ < F without increasing iV^e, Q, ?F): replace gs by
max{-F, min[F, gj]}.
We could also require gj to belong to #, at the cost of a doubling of e: replace
gj by an fj in J^ for which Q\fj — g}I < e.
24 Theorem. Let 3F be a permissible class of functions with envelope F.
Suppose PF < oo. If Р„ is obtained by independent sampling from the prob-
probability measure P and if log N±(e, Р„, ?F) = op(n) for each fixed e > 0, then
\Pnf — Pf\ -» 0 almost surely.
Proof. Problem 11 (or the slightly less formal symmetry argument leading
up to Theorem 21 in Section 4) shows that {\\Р„ — Р\\} is a reversed sub-
martingale ; it converges almost surely to a constant. It will suffice if we
deduce from the approximation condition that {\\Р„ — Р\\} converges in
probability to zero.
Exploit integrability of the envelope to truncate the functions back to a
finite range. Given e > 0, choose a constant К so large that PF{F > K} < e.
Then
sup \PJ -Pf\< sup \PJ{F <K}- Pf{F < K}\
+ supPn\f\{F > K} + supP\f\{F >K}.
Because | /1 < F for each / in !F, the last two terms sum to less than
PnF{F > K} + PF{F > K}.
This converges almost surely to 2PF{F > K}, which is less than 2e. It
remains for us to show that the supremum over the functions f{F < K}
converges in probability to zero. As truncation can only decrease the ?>1(Р„)
distance between two functions, the condition on log covering numbers also

26 II. Uniform Convergence of Empirical Measures
holds if each/is replaced by its truncation; without loss of generality we may
assume that | /1 < К for each / in J^.
In the two Symmetrization steps of the proof of the Glivenko-Cantelli
theorem (Section 3) we showed that
JP{\\Pn - P\\ > e} < 4P{||P°|| > ie} for n > 8e
-2
where ||-|| denoted a supremum over intervals ( — oo, t] of the real line. The
signed measure P° put mass ±n~1 on each observation ?b ...,?„, the
random ± signs being decided independently of the {^}. The argument
works just as well if || • || denotes a supremum over #, the interpretation
adopted in the current section. The only property of the indicator function
(— oo, /] needed in the Symmetrization steps was the boundedness, which
implied var(Pn( — oo, tj) < n~1. This time an extra factor of K2 would
appear in the lower bound for n.
With intervals we were able to reduce ||P°|| to a maximum over a finite
collection; for functions the reduction will not be quite so startling. Given
%, choose functions gu ..., gM, where M = N^e, Р„, JF), such that
min Pn | / - 0j | < h for each / in <F.
j
Write /* for the gj at which the minimum is achieved.
Now we reap the benefits of approximation in the j§? 1(PH) sense. For any
function g,
= Ря\д\-
Choose g = / — /* for each / in turn.
P^sup|P°/| > Щ& < P^sup[|P°/*| + PJ/-/*|] >ie
< Jp\max\P°gj\ > УЩ because Р„|/ - /*| < |e
(. J J
< iVi^e, Р„, J*) max
j
Once again Hoeffding's Inequality (Appendix B) gives an excellent bound
on the conditional probabilities for each g-}.
n
f Bgj(QJ]
2 ехр(-и?2Д28К2) because \g}\ < K.

II.5. Classes of Functions
27
When the logarithm of the covering number is less than ns2/256K2, the in-
inequality
P{||P°|| > ie|^} < 2 exp[log N&e, Pn, P) - пе2/тК21
will serve us well; otherwise use the trivial upper bound of 1. Integrate out.
IP{ira > Ы < 2exp(-ne2/256K2) + P{log iV^e, Pn, &) > ne2/256K2}.
Both terms on the right-hand side of the inequality converge to zero. ?
For some classes of functions the conditions of the theorem are easily
met because iV^e, Р„, 3P) remains bounded for each fixed e > 0. This
happens if the graphs of the functions in #" form a polynomial class of sets.
The graph of a real-valued function / on a set S is defined as the subset
Gf = {(s,t):0<t<f(s) or f(s)<t<0}
of S ® IR. We learn something about the covering numbers of a class J^ by
observing how its graphs pick out sets of points in S ® IR.
25 Approximation Lemma. Let 2F be a class of functions on a set S with
envelope F, and let Qbe a probability measure on S with 0 < QF < oo. If the
graphs of functions in S' form a polynomial class of sets then
N&QF, Q, &) <As~w for 0 < e < 1,
where the constants A and W depend upon only the discriminating polynomial
of the class of graphs.
Proof. Let/j,..., fm be a maximal collection of functions in i* for which
Q\f-fj\>eQF if i*j.
Maximality means that no larger collection has the same property; each /
must lie within eQF of at least one f-r Thus m > N^eQF, Q, !F).
Choose independent points (s1? ГД ..., (sk, tk) in S ® IR by a two-step
procedure. First sample s( from the distribution Q(-F)/Q(F) on S. Given
st, sample tt from the conditional distribution Uniform[ — F(s{), F(s;)].
The value of k, which depends on m and e, will be specified soon.
-F

28 II. Uniform Convergence of Empirical Measures
The graphs Gt and G2, corresponding to fx and/2, pick out the same
subset from this sample if and only if every one of the к points lands outside
the region G( AG2. This occurs with probability equal to
П El - IWPtfSi, tt) e G, Л G2|s;}] = [1 - JP(\ frist) - f2(s1)\/2F(s1))f
/i -/2
< (i - W
< exp(-jks).
Apply the same reasoning to each of the B) possible pairs of functions ft
and fj. The probability that at least one pair of graphs picks out the same
set of points from the к sample is less than
1 exp(- jks) < \ expB log m - \кг).
Choose к to be the smallest value that makes the upper bound strictly less
than 1. Certainly к < A + 4 log m)/s. With positive probability the graphs
all pick different subsets from the к sample; there exists a set of к points in
S ® IR from which the polynomial class of graphs can pick out m distinct
subsets. From the defining property of polynomial classes, there exist
constants В and V such that m < Bkv for all к > 1. Find n0 so that
A+4 log n)v < n1'2 for all n > n0. Then either m < n0 or m < Bm1/2s~v.
Set W = 2Fand A = max(B2, n0). ?
To show that a class of graphs has only polynomial discrimination we can
call upon the results of Section 4. We build up the graphs as finite unions and
intersections (Lemma 15) of simpler classes of sets. We establish their dis-
discrimination properties by direct geometric argument (as for intervals and
quadrants) or by exploitation of finite dimensionality (as in Lemma 18) of a
generating class of functions.
26 Example. Define a center of location for a distribution P on IRm as any
value в minimizing the criterion function.
Щ6, P) = P<K\x - 6\),
where ф(-) is a continuous, non-decreasing function on [0, 00) and | • | denotes
the usual euclidean distance. If Рф(\х\) < оо and ф(-) does not increase too
rapidly, in the sense that there exists a constant С for which <K2t) <
for all t, then the function Щ-, P) is well defined:
HF,P) < Р1фB\9\){\х\ < \9\} + Сф(\х\){\х\ > |0|}] < 00.
If trivial cases are ruled out by the requirement
B7) Р{х:ф(\х\)<ф(ю-)}>0,

II.5. Classes of Functions 29
the minimizing value will be achieved (Problem 21); extra regularity condi-
conditions on P, which are satisfied by distributions such as the multivariate
normal, ensure uniqueness (Problem 22). For this example, let us not get
bogged down by the exact conditions needed; just assume that H{-,P) has a
unique minimum at some 90.
Estimate 90 by any value 9„ that minimizes the sample criterion function
H( •, Р„). То show that 9n converges to 90 almost surely, it will suffice to prove
that Н(в„, P) -» HF0, P) almost surely, because H(9, P) is bounded away
from H(90, P) outside each neighborhood of 90.
The argument follows the same pattern as for k-means (Example 4). First
show that 9„ eventually stays within a large compact ball {| x | < K}. Choose
the К greater than 1601 and large enough to ensure that
ф&К)Р{\х\<^К}>Рф(\х\),
which is possible by B7): as К tends to infinity the left-hand side converges
to ф(оо-). Such а К will suffice because Я@, Р„) = Р„ф(\х\) and
Щ9, Pn) > ф№)Рп{М < Ж
for every 9 with \в\> К.
If we prove uniform almost sure convergence of Р„ to P over the class
then we can deduce almost surely that Щ9„, P) -» H(90, P) from
H(9n, Pn) - H(9n, P) -> 0,
H(9n, Pn) < H(90, Pn) -> H(90, P) < H(9n, P).
Here's our chance to apply Theorem 24.
The class !F has envelope фBК) + Сф(\х\), which satisfies the first
requirement of the theorem. Bound the covering numbers by showing that
the graphs of functions in J* have only polynomial discrimination. We may
assume that 0@) = 0. The graph of ф(\- — 9\) contains a point (y, i), with
t > 0, if and only if \y - 9\ > a(t), where a(t) denotes the smallest value of
a for which ф(а) > t. From a collection of points {(y(, t;)} the graph picks
out those points satisfying |y;|2 - 2yr 9 + \9\2 - a(ttJ > 0. Construct
from (yt, Q a point zt = (yt, \yt\2 - a(t;J) in IRm+1. On IRm+1 define а
vector space ^ of functions
9p,y,s(x,s) = P-x + ys + S
with parameters fi in IRm and y, S in IR. By Lemma 18, the sets {g > 0}, for
g in <§, pick out only a polynomial number of subsets from {z;}; those sets
corresponding to functions in ^ with fi = -29, у = 1, and д = \9\2 pick
out even fewer subsets from {z;}. The graphs of functions ф(\- — 9\) have
only polynomial discrimination. ?

30 II. Uniform Convergence of Empirical Measures
Buried within the argument of the last example lies a mini lemma relating
finite dimensionality of a class of functions to discrimination properties of
the graphs. It is perhaps worth noting.
28 Lemma. Let #" be a finite-dimensional vector space of real functions on S.
The class of graphs of functions in 2F has polynomial discrimination.
Proof. Define on S ® IR the vector space & of real functions
Define
Gi = {<7/,i >0} = {(s,t):f(s)>t}
G2 = {<?-/,-! >0} = {(M):/(s)<t}
Observe that
Gf = Gx{t > 0} и G2{t < 0}.
Invoke Lemmas 18 and 15. ?
29 Example. Now let's have another try at the /c-means problem introduced
in Example 4. There we met the class of functions of the form
fa,b(x) = \x - a\2 л \x - b\2
with (a, b) ranging over the subset С of IR2. We know that supc/a ъ < F for
an F with PF < oo, provided P|x|2 < oo.
The graphs of functions | x — a\2 form a class with polynomial discrimina-
discrimination, by Lemma 28. Intersect pairs of such graphs in all possible ways to get
the graphs of all functions/e> b. Apply Lemma 15 (to handle the intersections),
then the Approximation Lemma (to bound covering numbers), then Theorem
24:
sup |Р„/а>b- Pfa,b\-*0 almost surely,
с
Compare this with the direct approximation argument of Example 4. ?
II.6. Rates of Convergence
Theorem 24 imposed the condition log N^s, Р„, 2F) = op(ri) on the rate of
growth of the covering numbers. Many classes meet the condition easily.
For example, if the graphs of functions from J* have only polynomial
discrimination, the covering numbers stay smaller than a fixed polynomial
in e'1. The method of proof will deliver a finer result for such a class; we can
get good bounds not just for a fixed e deviation but also for an г„ that de-
decreases to zero as n increases. That is, we get a rate of convergence for the

II.6. Rates of Convergence 31
uniform strong law of large numbers. The method will also allow the class
of functions to change with n, provided the covering numbers do not grow
too rapidly. If the classes are uniformly bounded, and if the supremum of
Pf2 over the nth class tends to zero as n increases, this will speed the rate of
convergence.
Consider the effect upon the two key steps of the argument for Theorem
24 if we let both e and $F depend on n. As before, replace Р„ - P by the
signed measure P° that places mass + n~x at each of ?u ..., ?„. The sym-
metrization inequality
C0) IPJsup \PJ-Pf\> 8Л < 4ip|sup \P°J\ > 2en
still holds provided var(Pn /)/Dе„J < \ for each/in Jv The approximation
argument and Hoeffding's Inequality still lead to
C1) ipjsup |P°/| > 2б„| J < 2ади, Р„, #„) ехрГ-^?п2/(тах Png2
where the maximum runs over all W1(?n,Pn,#n) functions {g^ in the
approximating class.
If the supremum over J*n of P/2 tends to zero, one might expect that the
maximum over the {Р„д2} should converge to zero at about the same rate.
The next lemma will help us make the idea precise if the approximating
{g^ are chosen from #"„. As squares of functions are involved, covering
numbers need to be calculated using if2 seminorms rather than the if1
seminorms of Definition 23.
32 Definition. Let Q be a probability measure on S and #" be a class of
functions in if 2(B)- For each г > 0 define the covering number N2(e, Q, 3?)
as the smallest value of m for which there exist functions gu ..., gm (not
necessarily in J*) such that ттДB(/— fifjJI'2 < ? for each /in J*. For
defmiteness set N2(e, Q, #") = oo if no such m exists. ?
As before, if #" has envelope F we can require that \g}\ < F; and we could
require g} to belong to J^ at the cost of a doubling of г, by substituting for
gj an fj in & such that (QfJ} - «?/I/2 < е.
33 Lemma. Let 3F be a permissible class of functions with |/| < 1 and
(Pf2I12 < 6 for each fin &. Then
H>|sup(Pn/2I/2 > 8<4 < 4H>rjV2(<5, Р„, &) ехр(-и<52) л 1].
Proof. Let P'n be an independent copy of Р„. Write Z(/) for (Pn/2I/2 and
Z'(f) for (P'nf2I12. From the Symmetrization Lemma of Section 3,
C4) IP jsup | Z(/) | > 84 < f IP {sup | Z(f) - Z'(f) | >

32 II. Uniform Convergence of Empirical Measures
because, for each / in 3F,
F{|Z'(/)| < 23} > 1 - IP(Z'(/J)/4<52 = 1 - (P/2)/4<52.
The intrusion of the square-root into the definition of Z and Z' would
complicate reduction to the P° process. Instead, construct Р„ and Р'„ by a
method that guarantees equal numbers of observations for both empirical
measures. Sample 2n observations Xu ..., X2n from P. Independently of
the vector X of these observations, generate independent selection variables
т1;..., т„ with 1Р{т(г) = 1} = 1Р{т@ = 0} = \. Use these to choose one
observation from each of the pairs (Х2;_1; X2i), for ? = 1,..., n. Construct
Pn from these observations, and Р'„ from the remaining observations. For-
Formally, set ?,- = Z2i_i+t(i) and ^ = I2j_l(i), then put mass n on each
point Ь for Pn, and put mass n~x on each ?• for P;. Set S2n = i(PB + P^). It
has the same distribution as P2n.
Temporarily write p(-) for the ?>2(S2n) seminorm: p(f) = (S2nf2I/2-
Given X, find functions gu..., gM, where M = N2(^/2<5, S2n, #"), for which
min p(/ — gj) < y/28 for every / in
j
We may assume that \gj\ < 1 for every;'. The awkward y/2 will disappear at
the end when we convert to if 2(Р„) covering numbers.
From the triangle inequality for the i?2(Pn) seminorm, and the bound
2S2n for Р„, deduce for each / and g that
- Z(g)\ < Z(f ~g)< BS2n(f - gJI'2 = Jlp(f - g)
and similarly for Z'. For/in #" set g equal to the g} that minimizes p(/ - gfj).
Then
lj)-Z'(gj)\ + Z'(gj-f)
?46 + \Z{gj)-Z'(gj)\,
whence
IP\sup \Z(f) - Z'(/)| > 6<5|xi < P-Wx|Z@;) - Z'(gj)\ > 2d\X
< M max JP{\Z(gj) - Z'(9j)[ > 2<5|X}.
Fix a g with |gf| < 1. Bound \Z(g) - Z'(g)\ by
\Z{gf ~ Z'{gf\l[Z{g)
which is less than
\Png2 - P'ng2\/BS2ng2y2

II.6. Rates of Convergence 33
thanks to the inequality a112 + b1'2 > (a + bI/2, for a, b > 0. Apply
Hoeffding's Inequality (Appendix B).
<P
i=l
>2n5BS2ng2I'2\X
\-
< 2Qxp\-l6n2S2S2ng
< 2ехр(-2и<52)
because the inequality | g \ < 1 implies
*2<-i) - 92(Х2дУ < Ь^м) + g\X2d ='2nS2ng2.
i = 1 i = 1
Notice how the <S2ngf2 factor cancelled. That happened because we sym-
symmetrized Z instead of Pn.
Setting g equal to each gj in turn, we end up with
pjsup |Z(/) - Z'(/)| > 651 xl < 2W2(y2<5, 52„, #-) exp(-2n<52)
I ^ J
Decrease the right-hand side to the trivial upper bound of 1, if necessary,
then average out over X:
C5)
ipjsup \Z(f) - Z'(f)\ > 6s\ < JP[_2N2(^25, S2n, F) ехр(-2п<52) л 1]
The presence of the S2n is aesthetically unpleasing, especially since both
S and J5" will always depend on n in applications. Problem 24 allows us to
replace it by Pn, at a small cost:
^, S2n, ^) exp(-2nS2) л 1]
< IP[2N2(<5, Pn, ^)N2(S, P'n, &) ехр(-2и<52) л
< IP[2N2(<5, Р„, #") ехр(-и<52) л 1]
by virtue of the inequality xy л 1 < (x л 1) + (у л 1) for x, у > 0. The
empirical measures Р„ and Р'„ have the same distribution; the sum of expec-
expectations is less than
т52)л 1].
Combine the last bound with C4) and C5) to complete the proof. ?
The bounds we have for j§?i covering numbers can be converted into
bounds for if2 covering numbers. For the sake of future reference, consider

34 II. Uniform Convergence of Empirical Measures
a class 3F with envelope F. Set F equal to a constant to recover the inequalities
for a uniformly bounded 3F.
36 Lemma. Let 3Fbe a class of functions with strictly positive envelope F, and
Qbe a probability measure with QFZ < со. Define P(-) = Q(F2)/Q(F2) and
g= {f/F:feF}. Then:
(i) NMQF2I'2, Q, &) < N2(e, P, <S) < N.&2, P, 9);
(ii) if the class of graphs of functions in ?F has only polynomial discrimination
then there exist constants A and W, not depending on Q and F, such that
21'2, Q, &) <AS-wfor0<S< 1.
Proof. For every pair of functions /b /2 with \fi\ < F and |/21 < F,
(QF'r'QIA - /2|2 = P\fJF - f2/F\2 < 2P\fJF - fJF|.
Assertion (i) follows from these connections between the seminorms used
in the definitions of the covering numbers.
The graph of/ covers a point (x, t) if and only if the graph off/F covers
(x, t/F(x)); the graphs of functions in 'S also have polynomial discrimination.
Invoke the Approximation Lemma A1.25) for classes with envelope 1.
Rechristen A2W as Amd2W as W. ?
It is now an easy matter to prove rate of convergence theorems for uni-
uniformly bounded classes of functions. As an example, here is a result tailored
to classes whose graphs have polynomial discrimination. (Remember that
the notation х„ g> yn means xjyn -> со.)
37 Theorem. For each n, let 2Fn be a permissible class of functions whose
covering numbers satisfy
sup N^e, Q, jg < Ae~w for 0 < s < 1
Q
with constants A and W not depending on n. Let {an} be a non-increasing
sequence of positive numbers for which nS2a2 g> log n. If \f\ < 1 and (P/2I/2
< <5„ for each fin 3Fn, then
sup \Pnf — Pf\ -4 52an almost surely,
Proof. Fix e > 0. Set г„ = se2an. Because
var(Pn/)/DenJ < (l6ne2S2nz2yl < (logn)
the symmetrization inequality C0)'holds for all n large enough:
pjsup \PJ - Pf\ > sj < 4ipjsup \P°f\ > 2e\

II.6. Rates of Convergence 35
Condition on \. Find approximating functions {gj} as in C1). We may
assume that each gf belongs to $Fn. (More formally, we could replace gj by
an fj in <Fn for which Pn\f} - g}\ < e, then replace e by 2e throughout the
argument.) From C1),
ipjsup \P°J\ > 2e\ < 2As;w ехр(-^е„2/645и2) + pjsup PJ2 > 64<5„2
The first term on the right-hand side equals
2Ae~w exp[PF log(l/<52cO - ne2<5*
which decreases faster than every power of n because log(l/<52an) increases
more slowly than log n, while n<52a2 increases faster than log n. Lemma 33
bounds the second term by
nyw exp(-n<52)
which converges to zero even faster than the first term. An application of the
Borel-Cantelli lemma completes the proof. ?
Specialized to the case of constant {an}, the constraint placed on {5„} by
Theorem 37 becomes <52 §> n~l log n. This particular rate pops up in many
limit theorems involving smoothing of the empirical measure because
(Problem 25) it corresponds to the size of the largest ball containing no
sample points.
38 Example. Let P be a probability measure on Md having a bounded
density p( ¦) with respect to d-dimensional lebesgue measure. One theoretically
attractive method for estimating p(-) is kernel smoothing: convolution of
the empirical measure Р„ with a convenient density function to smear out
the point masses into a continuous distribution. The estimate is
Pn(x) = PnG-dKx,a,
where
for some density function К on lRd and a scaling factor a that depends on n.
Note that the a~d is not part of Kx_ff.
The traditional method for analyzing р„ compares it with the correspond-
corresponding smoothed form of p,
p(x) = FpB(x) = Pa'dK^a.
The difference pn — p breaks into a sum of a random component, pn — p,
and a bias component p — p. The smaller the value of a, the smaller the
bias (Problem 26); the slower a tends to zero, the faster р„ — p converges to
zero. These two conflicting demands determine the rate at which р„ — р can
tend to zero.

36 II. Uniform Convergence of Empirical Measures
If pn — p is to converge uniformly to zero, we must not allow a to decrease
too fast. Otherwise we might somewhere be smoothing Pn over a region
where it puts too little mass, making \pn — p\ too large. Theorem 37 will let
<7d decrease almost as fast as и log n, the best rate possible.
For concreteness take К to be the standard normal density, which enjoys
the uniform bound 0 < Kx a < 1 for all x and a (the constant {2%)~il2 is
too awkward for repeated use.) The class of graphs of all Kx „ functions has
polynomial discrimination (Problem 28 proves this for a whole class of
kernel functions). Assume also that the density p(-) is bounded, say 0 <
p(-) < 1. In that case
because
x)/a)p(y) dy
= ad ЫA)р(х + at) dt.
Everything is set up for Theorem 37. Put <х„ = 1 and b2n = od. Provided
a* 5> n'1 logn,
sup \PnKxa - PKxa\ < ad almost surely,
X
that is,
sup | pn(x) — p(x) | -> 0 almost surely.
X
Smoothness properties of p(-) determine the rate at which the bias term
converges to zero (Problem 26). For example, one bounded derivative would
give maximum bias of order 0{a) in one dimension. We would then want
something like a3 g> n~l logn to get a comparable rate of convergence
from Theorem 37 for р„ — p. ?
Notes
Uniform strong laws of large numbers have a long history, which is described
in the first section of the survey paper by Gaenssler and Stute A979). Theorem
2 comes from DeHardt A971), but the idea behind it is much older. Billingsley
and Topsoe A967) and Topsoe A970, Sections 12 to 15) developed much
deeper results for the closely related area of uniformity in weak convergence.
Hartigan A975) is a good source for information about clustering.
Hartigan A978) and Pollard A981b, 1982b, 1982c) have proved limit

Notes 37
theorems for fe-means. Engineers know the method of /c-means by the
name quantization. The March 1982 IEEE Transactions on Information
Theory was devoted to the topic. Denby and Martin A979) proposed the
generalized M-estimator of Example 5.
It has long been appreciated that comparison of two independent empirical
distribution functions transforms readily into a combinatorial problem.
Gnedenko A968, Section 68), for example, reduced the analysis of two-
sample Smirnov statistics to a random walk problem. The method in
the text has evolved from the ideas introduced by Vapnik and Cervonenkis
A971). Their method of conditioning turned calculations for a single fixed
set into an application of an exponential bound for hypergeometric tail
probabilities. Classes with polynomial discrimination are often called VC
classes in the literature. The symmetrization described in Section 3 is a well-
known method for proving central limit theorems in Banach spaces (Araujo
and Gine 1980, Section 2; Gine and Zinn 1984, Section 1). Steele A975,1978)
discovered subadditivity properties of empirical measures that strengthen
the Vapnik-Cervonenkis convergence in probability results to necessary and
sufficient conditions for almost sure convergence. Pollard A981b) introduced
the martingale tools and the randomization method described in Problem
12 to rederive Steele's conditions. Theorem 21 and Example 22 are based on
Steele A978); Flemming Tops0e explained to me the f-trick for convex sets.
See Gaenssler and Stute A979) for more about the history of this example.
The proof of Theorem 16, which is often called the Vapnik-Cervonenkis
lemma, is adapted from Steele A975). Sauer A970) was the first to publish
the inequality in precisely this form, although he suggested that Shelah had
also established the result. (I am unable to follow the two papers of Shelah
that Sauer cited.) Vapnik and Cervonenkis A971) proved an insignificantly
weaker version of the inequality. Dudley A978, Section 7) has dug further
into the history. Lemma 18 is due to Steele A975) and Dudley A978).
The sum of binomial coefficients in Theorem 16 and the randomization
method of Problem 12 suggest that a direct probabilistic path might lead
to the necessary and sufficient conditions of Theorem 21. Does there exist
a set of n independent experiments that can be performed to decide whether
a particular subset of a collection of n points can be picked out by a particular
class of sets? Or maybe the experiments could be linked in some way. For a
class that picks out only subsets with fewer than V points the solution is
easy—it lies buried within the proof of Theorem 16.
The notes to Chapter VII will give more background to the concept of
covering number.
Vapnik and Cervonenkis A981) have found necessary and sufficient
conditions for uniform almost sure convergence over bounded classes of
functions. They worked with if00 and if1 distances between functions. Gine
and Zinn A984) applied chaining inequalities and gaussian-process methods
(see Chapter VII) to deduce if2 necessary and sufficient conditions. The
square-root trick in Lemma 33 comes from Le Cam A983) via Gine and

38 II. Uniform Convergence of Empirical Measures
Zinn. Kolchinsky A982) and Pollard A982c) independently introduced the
symmetrization method used in Lemma 33. The Approximation Lemma is
due to Dudley A978), who proved it for classes of sets. The extension to
classes of functions was proved ad hoc by Pollard A982d), using an idea of
Le Cam A983).
The density estimation literature is enormous. Silverman A978) and Stute
A982b) have found sharp results involving the и log и critical rate.
Bertrand-Retali A974, 1978) proved that ad g> n~l log и is both necessary
and sufficient for uniform consistency over the class of all uniformly con-
continuous densities on IRA
Universal separability was mentioned in passing by Dudley A978) as a
way of avoiding measurability difficulties.
Most of the results in the chapter can be extended to independent, non-
identically distributed observations (Alexander 1984a).
Problems
[1] In Example 4 relax the assumption that W{-, ¦, P) has a unique minimum; assume
the function achieves its minimum for each (a, b) in a region D. Prove that the
distance from the optimal (а„, Ь„) to D converges to zero almost surely, provided
P does not concentrate at a single point. [The condition rules out the possibility
of a minimizing pair for W(-, -, P) with one center off at infinity.]
[2] Here is an example of an ad hoc method to force an optimal solution into a
restricted region. Suppose an estimator corresponds to the/, that minimizes Pnf
over a class J5", and that we want to force/„ into a region Ж. Write y0 for the infi-
mum, assumed finite, of Pf over ,?. Suppose there exists a positive function b(-)
on J^ such that, for some г > 0,
b(f) > 7o + e for / in
>[inf//&(/)]
Show that liminf inf^jr Р„/> y0 almost surely. [Trivial if y0 < 0.] Deduce
that/,, belongs to Ж eventually (almost surely). Now read the case A consistency
proof of Huber A967). Compare the last part of his argument with our Theorem 3.
[3] Call a class J5" of functions universally separable if there exists a countable subclass
J^ such that each/in jF can be written as a point wise limit of a sequence in J^.
If J* has an envelope F for which PF < со, prove that universal separability
implies measurability of \\Pn — P\\.
[4] For any finite-dimensional vector space'S of real functions on S, the class S> of sets
of the form {g > 0}, for g in '$, is universally separable. [Express each g in 'S as
a linear combination of some fixed finite collection of non-negative functions. Let
^o be the countable subclass generated by taking rational coefficients. For each
g in 40 there exists a sequence {д„} in <SU for which д„ [ д. Show that {д„ > 0} J.
{g > 0} pointwise.]
[5] The operations in Lemma 15 preserve universal separability.

Problems 39
[6] For a universally separable class 3), the quantity Vn defined in Theorem 21 is
unchanged if® is replaced by its countable subclass @>0. Prove that Vn is measur-
measurable.
[7] Prove that the class of indicator functions of closed, convex subsets of TRd is
universally separable. [Consider convex hulls of finite sets of points with rational
coordinates.]
[8] Theorem 16 informs us that the class of quadrants in Ж2 picks out at most
1 + jN + jN2 subsets from any collection of N points. Find a configuration for
which this bound is achieved.
[9] For N > 2 the sum of binomial coefficients singled out by Theorem 16 is bounded
by Nv. [Count subsets of {1, ..., N} containing fewer than V elements by arrang-
arranging each subset into increasing order then padding it out with enough copies of
the largest element to bring it up to a F-tuple. Don't forget the empty set.]
[10] Let M be a subset of [0, 1] that has inner lebesgue measure zero and outer lebesgue
measure one (Halmos 1969, Section 16). Define the probability measure ц as the
trace of lebesgue measure on M (the measure defined in Theorem A of Halmos
A969), Section 17). Assuming the validity of the continuum hypothesis, put M
into a one-to-one correspondence with the space [0, U) of all ordinals less than
the first uncountable ordinal U (Kelley 1955, Chapter 0). Define <2> as the class
of subsets of [0, 1] corresponding to the initial segments [0, x] in [0, U).
(a) Show that 2 has linear discrimination. [It shatters no two-point set.]
(b) Equip Mm with its product cr-field and product measure цт. Generate ob-
observations ?,, ?2,... on P = Uniform@, 1) by taking them as the coordinate
projection maps on M°°. Construct empirical measures {Р„} from these
observations. Show that sups \PnD — PD\ is identically one.
(c) Repeat the construction with the same <2>, but replace (M°°, /t°°) by a countable
product of copies of Mc equipped with the product measure Я™, where Я
equals the trace of lebesgue measure on Mc. This time sups \PnD — PD\ is
identically zero.
[Funny things can happen when 2 has measurability problems. Argument
adapted from Pollard A981a) and Durst and Dudley A981).]
[11] For independent and identically distributed random elements {?(}, write <f" for
the cr-field generated by all symmetric functions of ?u ..., ?N as N ranges over
n, n + 1, For a fixed function /, apply the usual reversed martingale argument
(Ash 1972, page 310) to show that 1Р(Р„/|<Г+1) = PB+1/. If P(sup^ |/|) < 00,
deduce
p \PJ - Pf\ \Г^\ > sup \Pn+J - Pf\
I &
for every class of functions 3" that makes both suprema measurable.
[12] Here is one way to prove necessity in Theorem 21. Suppose \\Р„ — P\\ -> 0 almost
surely. Construct ^„+ by placing mass n~l at each ?; for which the sign variable
tr,- equals +1; construct ц~ similarly from the remaining ?;'s. Notice that P° =
Hn — Hn- Let N be the number of sign variables cu...,on equal to +1.
(a) Prove that (n/N)n+ has the same distributions as PN. [What if N = 0?]
(b) Deduce that both ||/in+ - \P\\ -> 0 and \\ц~ - jP\\ -> 0, in probability.

40 II. Uniform Convergence of Empirical Measures
(c) Deduce that ||/*„+ - /л„ \\ -* 0 in probability.
(d) Suppose 3> shatters a set F consisting of at least щ of the points ?b ...,?„.
Without loss of generality, at least \m\ of the points in F are allocated to /*„+.
Choose a D to pick out just those points from F. Use independence properties
of the {tr;} to show that, with high probability, /?(D\F) and n~(D\F) are
nearly equal. [Argue conditionally on Pn and the at for those ^ in F.]
(e) Show that /Jn+(D) — ftT(D) ~ hi with high conditional probability. This
contradicts (c).
[13] Rederive the uniform strong law for convex sets (Example 22) by the direct
approximation method of Theorem 2.
[14] Let #" be a permissible class with natural envelope F = sup^ | /1. If \\Р„ - P\\ -* 0
almost surely and if sup^ \Pf\ < oo then PF < oo. [The condition on supj; \Pf\
excludes trivial cases such as SF consisting of all constant functions. From
||Р„ - P\\<& and ||Fn_! -P\\<e deduce n'x\f(Q -Pf\< 2s; almost sure
convergence implies
KVsup \f{Q -Pf\>n infinitely often \ = 0.
if )
Invoke the non-trivial half of the Borel-Cantelli lemma, then replace each t,a by
?i to get
oo > IPjsup |/(?i) - Pf\ I > IPFC^) - constant.
Noted by Gine and Zinn A984).]
[15] Here is an example of how Theorem 24 can go wrong if the envelope F has PF = oo.
Let P be the Uniform@, 1) distribution and let J^ be the countable class consisting
of the sequence {/J, where ft(x) = x~2{(i + I) < x < Г1}. Show that the
graphs have polynomial discrimination and that Pft = 1 for every ;'. But
sup,- Pnfi^> oo almost surely. [Find an а„ with naj -> 0, such that [0, а„] contains
at least one observation, for n large enough.]
[16] Let #^ be the class of all monotone increasing functions on IR taking values in the
range [0,1]. The class of graphs does not have polynomial discrimination, but it
does satisfy the conditions of Theorem 24 for every P. [If {x;} and {t;} are strictly
increasing sequences, the graphs can shatter the set of points (xlt tx\ ..., (xN, tN).]
[17] For the J^ of the previous problem, rewrite Pnf as JJ Pn{f > t} dt. Deduce
uniform almost sure convergence from the classical result for intervals. [Suggested
by Peter Gaenssler.]
[18] Let J^ and ^ be classes of functions on S with envelopes F and G. Write У for
the class of all sums f + g with / in J5" and g in <&. Prove that
N15Q(F + G), Q, SP) < NiEQF, Q, &)N15QG, Q, <S) for i = 1, 2.
[19] A condition involving only covering numbers for P would not be enough
to give a uniform strong law of large numbers. Let P be Uniform@, 1). Let 3)
consist of all sets that are unions of at most n intervals each with length less than
n'2, for n = 1, 2,.... Show that sups \PnD - PD \ = 1, even though N^s, P, 9s)
< oo for each s > 0.

Problems 41
[20] Deduce Theorem 2 from Theorem 24.
[21] Under the conditions set down in Example 26, the function H(-,P) achieves its
minimum. [If H{9h P) converges to the infimum as i -> oo, use Fatou's lemma to
show that the infimum is achieved at a cluster point of {0J; condition B7) rules
out cluster points at infinity.] Notice that only left continuity of ф is needed for
the proof. Find and overcome the extra complications in the argument that would
be caused if ф were only left-continuous.
[22] This problem assumes familiarity with convexity methods, as described in Section
4.2 of Tong A980). Suppose that the distribution P of Example 26 has a density
p(-) whose high level sets Dt = {p > t} are convex and symmetric about the origin.
Prove that H(9, P) has a minimum at 9 = 0. [By Fubini's theorem,
Щ9, P) = (if {0 < s < ф(\х - в\)}{0 < t < p(x)} ds dt dx
= || volume[B@, ф))с п ?>,] ds dt,
where В(в, г) denotes the closed ball of radius r centered at 9. The volume of
B(9, r) n Dt is maximized at в = 0.] When is the minimum unique? Show that a
multivariate normal with zero means and non-singular variance matrix satisfies
the condition for uniqueness.
[23] Suppose {?,} are independent, but that the distribution of ?,-, call it Q,, changes
with i. Write P(B) for the average distribution of the first n observations, Pin) =
n~1(Q1 + ¦¦¦ + ()„). Show for a permissible polynomial class 9) that
sup | Pn D - PMD | -> 0 almost surely.
s
What difficulties occur in the extension to more general classes of sets, or functions?
[Adapt the double-sample symmetrization method of Lemma 33: sample a pair
(X2l-i, X2t) from Qt; use the selection variable т; to choose which member of the
pair is allocated to Р„, and which to P'n.~\
[24] Show that
^ KQ! + Q2), &) < N2E, Qu
[Let /ix be the density of Qt with respect to Q^ + Q2. Consider the approximating
functions 0,{/i! >j}+ g2{ht < i}.]
[25] Let P be the uniform distribution on [0, I]2. For a sample of n independent
observations on P show that
!P{some square of area а„ contains no observations} -> 1
if а„ is just slightly smaller than rTx log n. [Break [0, I]2 into N subsquares each
with area slightly less than n~1 log n. Set At = {fth subsquare contains at least one
observation}. Show that W{Ai+1 |^4X n • • • n At) < РЛ;+1- The probability that
each of these subsquares contains at least one point is less than (P^4X)N. Bertrand-
Retali A978).]

42 II. Uniform Convergence of Empirical Measures
[26] In one dimension, write the bias term for the kernel density estimate as
p(x) - p(x) = JK(z)[p(X + az) - p(x)-] dz.
Suppose p has a bounded derivative, and that J \z\K(z) dz < со. Show that the
bias is of order O(a). Generalize to higher dimensions. [If p has higher-order
smooth derivatives, and К is replaced by a function orthogonal to low degree
polynomials, the bias can be made to depend only on higher powers of ст.]
[27] The graphs of translated kernels Kx „ have polynomial discrimination for any К
on the real line with bounded variation. [Break К into a difference of monotone
functions.]
[28] Let К be a density on JRd of the form h( | x \), where h( ¦) is a monotone decreasing
function on [0, oo). Adapt the method of Example 26 to prove that the graphs of
the functions Kx a have polynomial discrimination.
[29] Modify the density estimate of Example 38 for distributions on the real line by
choosing К as a function of bounded variation for which J K(z) dz = 0 and
j zK(z) dz = 1 and J \zK(z)\dz < oo. Replace р„ by qn(x) = а~2РпКха. Show
that Wqn{x) converges to the derivative of p. How fast can a tend to zero without
destroying the almost sure uniform convergence sup^ \qn(x) — Wqn(x)\ -+01

CHAPTER III
Convergence in Distribution in
Euclidean Spaces
... which runs through some of the standard methods for proving convergence
in distribution of sequences of random vectors, and for proving weak con-
convergence of sequences of probability measures on euclidean spaces. These include:
checking convergence for expectations of smooth functions of the random vectors;
checking moment conditions for sums of independent random variables (the
Central Limit Theorems); checking convergence of characteristic functions
(the Continuity Theorems for characteristic functions); and reduction to analo-
analogous problems of almost sure convergence via quantile transformations.
III.1. The Definition
Convergence in distribution of a sequence {Xn} of real random variables is
traditionally defined to mean convergence of distribution functions at each
continuity point of the limit distribution function :
IP{Xn < x} -> IP{X < x} whenever IP{X = x} = 0.
Although convenient for work with order statistics and quantiles, this
definition does have some disadvantages. Distribution functions are not
well suited to calculations involving sums of independent random variables.
The simplest proofs of the Central Limit Theorem, for example, do not
directly check pointwise convergence of distribution functions; they show
that sequences of characteristic functions, or expectations of other smooth
functions of the sums, converge. With the extensions to sequences of random
vectors (measurable maps into multidimensional euclidean space JRd), the
difficulties multiply. And for random elements of more general spaces not
equipped with a partial ordering, even the concept of distribution function
disappears. With all this in mind, let us start afresh from an equivalent
definition, which lends itself more readily to generalization.
1 Definition. A sequence of random vectors {Х„} is said to converge in
distribution to a random vector X, written Xn ~» X, if 1Р/(Х„) -»IP/(X) for
every / belonging to the class ^(Ж*) of all bounded, continuous, real func-
functions on IR*. ?
This notion of convergence does not specify the limit random vector
uniquely. If X and Y have the same distribution, that is, if
IP{X e A} = IP{ Ye A} for each borel set A,

44 III. Convergence in Distribution in Euclidean Spaces
then Х„ -» X means the same as Xn ~> Y. This invites slight abuses of nota-
notation. For example, it is convenient to write Xn ~> N@, 1), meaning that the
sequence of real random variables {Xn} converges in distribution to any
random variable having a standard normal distribution. The symbol N@, 1)
stands not only for a particular probability measure on the borel u-field
ЩЖ) but also for any random variable having that distribution. Similarly,
we can avoid much circumlocution by writing, for example,
instead of:
if Х„ has a binomial distribution with parameters n and \, and X
has a normal distribution with mean zero and variance \, then
In general, for probability measures on the borel <7-field &(TRk), define
Р„ ~> Р to mean convergence in distribution for random vectors having these
distributions. This definition is equivalent to the requirement: Pnf^Pf
for every / in ^(SR.k). Most authors call this weak convergence.
III.2. The Continuous Mapping Theorem
Suppose Xn ~> X, as random vectors taking values in Шк, and let Я be a
measurable map from IR* into Ж5. Does it follow that HXn -* HX1 That is,
does 1РДЯХ„) -+ Щ(НХ) for every / in <^(Ж5)? If H were continuous,
/off would belong to %(Жк) for every / in %(Ж3). The result would be
trivially true. The convergence HXn ~» HX also holds under a slightly
weaker assumption: it suffices that H be continuous at almost all points of
the range of X. This will follow as a simple corollary to the next lemma.
2 Convergence Lemma. Let hbe a bounded, measurable, real-valued function
on Жк, continuous at each point of a measurable set C.
(i) Let {Xn} be a sequence of random vectors converging in distribution to X.
If P{X ? Q = 1, then 1Р/г(Х„) ^ Wh(X).
(ii) Let {Pn} be a sequence of probability measures converging weakly to P.
IfP(C) = 1, then PJx -> Ph.
Proof. As the two assertions are similar, we need only prove (ii). Consider
any increasing sequence {/;} of bounded, continuous functions for which
f<h everywhere and f | h at each point of C. Accept for the moment that
such a sequence exists. Then weak convergence of {Pn} to P implies that
C) Pn f -> Pf for each fixed i.

III.2. The Continuous Mapping Theorem 45
On the left-hand side bound /j by h.
liminf Pnh > Pft for each fixed ;'.
Invoke monotone convergence as i tends to infinity on the right-hand side.
D) liminf Pn h > Ph.
The companion inequality obtained by substituting — h for h combines with
D) to complete the proof.
Now to construct the functions {/}. They must be chosen from the family
If we can find a countable subfamily of 3F, say {gu g2, ¦..}, whose pointwise
supremum equals h at each point of C, then setting/; = maxf^, ...,#;}
will do the trick.
Without loss of generality suppose h > 0. (A constant could be added to
h to achieve this.) For each subset A of Ж* define the distance function
d(-,A)by
d(x, A) = inf{|x - y\; у е A].
It is a continuous function of x, for each fixed A. For positive integral m and
positive rational r define
fm,Ax) = г л md(x, {h < r}).
Each/m r is bounded and continuous; it is at most r if h(x) > r; it takes the
value zero if h(x) < r: it belongs to 3F. Given a point x in С and an s > 0,
choose a positive rational number r with h{x) — s < r < h(x). Continuity
of h at x keeps its value greater than r in some neighborhood of x. Conse-
Consequently, d(x, {h < r}) > 0 and/m r(x) = r > h(x) - s for all m large enough.
П
Weak convergence of {Pn} was needed only to establish the convergence
C) for the functions {/,}. These functions were, however, not just continuous,
but uniformly continuous. The functions from which they were constructed
even satisfied a Lipschitz condition:

46 III. Convergence in Distribution in Euclidean Spaces
Thus the lemma could have been proved using only convergence of expecta-
expectations of bounded, uniformly continuous functions of the random vectors. In
particular, such a requirement would imply the convergence Pnh -> Ph for
each h in
5 Corollary. // Pnf —> Pf for every bounded, uniformly continuous f then
Pn ~> P. (And similarly for convergence in distribution of random vectors.) ?
The lemma also provides an answer to the question asked at the start of
the section.
6 Continuous Mapping Theorem. Let H be a measurable mapping from Жк
into ЖЛ Write С for the set of points in Шк at which H is continuous. If a se-
sequence {Xn} of random vectors taking values in Жк converges in distribution
to a random vector X for which IP{X e C) = 1, then HXn ~> HX.
Proof. For each fixed / in ^(JR.S), the bounded function f°His continuous
at all points of C. ?
Some authors seem to regard this result as trivial and obvious; they
scarcely notice that, at least implicitly, they make use of it for many applica-
applications. It is better to recognize these covert appeals to the Continuous Mapping
Theorem. That way the more general form of the result in Chapter IV will
come as no surprise.
7 Example. If the real random variables {Xn} converge in distribution to X
then Р{Х„ < x} -* IP{X < x} at each x for which IP{X = x} = 0. That is,
the sequence converges at each continuity point x of the distribution function
of X. This holds because x is the only point of discontinuity of (the indicator
function of) the set ( — oo, x]. Problem 1 shows you how to go the other way,
from pointwise convergence of distribution functions to convergence in
distribution of the random variables.
The same result is true in higher dimensions if the inequalities Xn < x
and X < x are taken componentwise and ( —oo,x] is interpreted as a
multidimensional orthant with vertex at x. Continuity of the multi-
multidimensional distribution function of X at x requires that X lands on the
boundary of (—oo, x] with zero probability. ?
8 Example. Consider the multinomial distribution obtained by independent
placement of n objects into к disjoint cells. Write Fn = (Fnl,..., Fnk) for
the column vector of observed frequencies—cell i receives Fni objects—and
P — (Ръ ¦ • ¦ > Pk) f°r tne column vector of cell probabilities. Pearson's chi-
square statistic is
к
Zn= E(fni - nPiJ/nPi.

III.2. The Continuous Mapping Theorem 47
Write Zn as a function of a standardized column vector, by setting
Xn = n-l'2(Fn-np)
and
Zn = Х'„А~2Хп,
where Л denotes the diagonal matrix with sfpi as its ith diagonal element. By
the Multivariate Central Limit Theorem, which will be proved later in this
chapter (Theorem 30), the random vectors {Х„} converge to a N@, V)
distribution whose variance matrix V has (i, j)th element pt — pf if i = j
and —PiPj otherwise. Manufacture a random vector with this limit distribu-
distribution by applying a linear transformation to a column vector W of indepen-
independent N@, 1) random variables: Xn ~> Л^ — uu')W, where и denotes the unit
column vector (y/p~i,..., sfpd-
The mapping H from Шк into Ж defined by
Hx = x'A'2x = \А~гх\2
is continuous. Apply the Continuous Mapping Theorem.
Zn = HXn
-* HA(Ik - uu')W
= \{Ik- uu')W\2
~ Xk-u
because Ik — uu' represents the projection orthogonal to the unit vector u.
(The squared length of the projection of W onto any (k — l)-dimensional
subspace has a chi-square distribution with к — I degrees of freedom.) ?
9 Example. Suppose ?и ?2,... are independent random variables each with
a Uniform@, 1) distribution. Neyman A937) developed a goodness-of-fit
test whose asymptotic properties depended on the behavior of the statistics
к Г п -12
j=i b=i J
where n0, пъ ... are given polynomials defined on [0, 1], with the ortho-
normality property:
Explicitly, no(y) = 1,Я1(з0 = y/U(y - Uitj(y) = У5[6(у - iJ - fl.and
so on. Define random column vectors
Xt = (fi1(.?d, ••;*№) for I" =1,2,....
The statistic Gn can then be written as

48 III. Convergence in Distribution in Euclidean Spaces
The Multivariate Central Limit Theorem (Theorem 30) and the ortho-
normality properties of the polynomials ensure that
The Continuous Mapping Theorem (applied to which map?) allows us to
deduce that Gn -> xl¦ Neyman used this to determine the approximate
critical region for his test. ?
III.3. Expectations of Smooth Functions
There are two sorts of perturbation of a random vector X that don't affect
the expectation Wf(X) of a smooth, bounded function of X too greatly:
changes, however gross, that occur with only small probability; and changes
that might occur with high probability but which alter X by only small
amounts. These effects are easy to quantify when the smooth function / is
uniformly continuous. Suppose | f(x) — f(z)\ < г whenever \x — z\ < 5.
Write H/ll for the supremum of/(-)- Then for any random vectors X and Y,
whether dependent or not,
A0) |P/(X)-F/(X + 7)|
<W{\Y\<S}\f(X)-f(X+Y)\
The inequality lets us deduce convergence in distribution of a sequence of
random vectors from convergence of slightly perturbed sequences.
11 Lemma. Let {Х„}, X and Y be random vectors for which Xn + aY-+
X + oY for each fixed positive a. Then Xn -¦ X.
Proof. Remember (Corollary 5) we have only to check that Р/(Х„) -> P/(X)
for each bounded, uniformly continuous /. Apply inequality A0) with X
replaced by Xn and Y replaced by oY.
sup \SPf(Xn) - Wf(Xn + oY)\ < г + 2||/||1P{| Y| > Sa'1}.
n
Similarly
|F/CY) - 1РДХ + oY)\ < e + 2||/||Р{|У| > Sa'1}.
Choose a small enough to make both right-hand sides less than 2e, then
invoke the known convergence of {P/(XH + aY)} to P/(X + aY) to
deduce that limsup | P/(ZB) - P/(X) \ < 4e. ?
Now, instead of thinking of the aY as a perturbation of the random
vectors, treat it as a means for smoothing the function/. This can be arranged

III.3. Expectations of Smooth Functions 49
by choosing independently of X and the {Xn} a random vector Y having a
smooth density function with respect to lebesgue measure—for convenience,
take Y to have a JV(O, Ik) distribution. Integrate out first with respect to
the distribution of Y then with respect to the distribution of X.
JPf(X + oY) = WfJ(X),
where
fa(x) = JB7r)-t/2/(x + ay) exp(-4M2) dy
= (Bna2yk'2f(z) exp(-4|z - x\2/a2) dz.
The function / has been smoothed by convolution. Dominated convergence
justifies repeated differentiation under the last integral sign to prove that
fa belongs to the class ^"(Ж*) of all bounded real functions on Шк having
bounded, continuous partial derivatives of all orders.
12 Theorem. // F/(XB) -> F/(X) for every fin # "(Ж*) then Х„ ~> X.
Proof. Convergence holds for every fa produced by convolution smoothing.
Apply Lemma 11. ?
For the remainder of the section assume that к = 1. That is, consider
only real random variables. As the results of Section 5 will show, no great
generality will be lost thereby—a trick with multidimensional characteristic
functions will reduce problems of convergence of random vectors to their
one-dimensional analogues.
For expectations of smooth functions of X, the effect of small perturba-
perturbations can be expressed in terms of moments by applying Taylor's theorem.
Suppose / belongs to ^"(Ж*). Then, ignoring the niceties of convergence,
we can write
/(* + y) = f(x) + уf\x) + b2f"(x) + ¦¦¦.
Suppose the random variable X is incremented by an independent amount
Y. Then, again ignoring problems of convergence and finiteness, deduce
A3) Wf(X + Y) = F/(Z) + W(Y)Wf'(X) + ^
Try to mimic the effect of the increment У by a different increment W, also
independent of X. As long as F(Y) = JP(W) and F(Y2) = W(W2), the
expectations F/(Z + Y) and F/(X + W) should differ only by terms in-
involving third or higher moments of Y and W. These higher-order terms
should amount to very little provided both Y and W are small; the effect of
substituting W for Y should be small in that case.
This method of substitution can be applied repeatedly for a random
variable Z made up of a lot of little independent increments. We can replace

50 III. Convergence in Distribution in Euclidean Spaces
the increments one after another by new independent random variables. If
at each substitution we match up the first and second moments, as above,
the overall effect on F/(Z) should involve only a sum of quantities of third
or higher order. In the next section this approach, with normally distributed
replacement increments, will establish the Liapounoff and Lindeberg forms
of the Central Limit Theorem.
To make these approximation ideas more precise we need to bound the
remainder terms in the informal expansion A3). Because only the first two
moments of the increments are to be matched, a Taylor expansion to quad-
quadratic terms will suffice. Existence of third derivatives for/will help to control
the error terms.
Assume / belongs to the class ^(IR) of all bounded real functions on Ж
having bounded continuous derivatives up to third order. Then the remainder
term in the Taylor expansion
A4) f(x + y) = f(x) + yf'ix) + \y2f"ix) + Rix, y)
can be expressed as
R(x, y) = b3f'"(x + 0ijO
i (depending on x and y) between 0 and 1. Write ||/'"|| for the supremum
of | /'"(•) | .Then
A5) |Д(*,H1<|11/'1М3,
Set С equal to i||/'"||. Then from A4) and A5),
|F/(X + Y) - F/(X) - F(Y)F/'(X) - %JPiY2)TPf"iX)\
<W\RiX,Y)\
<CF(|Y|3).
Apply the same argument with Y replaced by the increment W, which is
also independent of X. Because F(Y) = F(PF) and F(Y2) = F(W2), when
the resulting expansion for F/(X + W) is subtracted from the expansion
for JPfiX + Y) most of the terms cancel out, leaving,
A6) |F/(X + У) - F/(X + WO| ^ TP\R(X, Y)\ + W\RiX, W)\
< CF(|Y|3) f
This inequality is sharp enough for the proof of a limit theorem for sums of
independent random variables with third moments.
III.4. The Central Limit Theorem
A sum of a large number of small, independent random variables is approxi-
approximately normally distributed—that is roughly what the Central Limit
Theorem asserts. The rigorous formulations of the theorem set forth condi-
conditions for convergence in distribution of sums of independent random

III.4. The Central Limit Theorem 51
variables to a standard normal distribution. We shall prove two versions
of the theorem.
To begin with, consider a sum Z = ^ + ¦ • • + ?,k of independent random
variables with finite third moments. Write c? for P??. Standardize, if
necessary, to ensure that P?; = 0 for each ;' and o\ + ¦ ¦ ¦ + ak = 1. Inde-
Independently of the {?j}, choose independent iV@, <T?)-distributed random
variables {n}}, for; = 1,..., k. Start replacing the {^} by the {tjj}, beginning
at the right-hand end. Define
Sj = ?i + • • • + ?,-_! + nJ+1 + ¦ ¦ ¦ + r\k.
Notice that Sk + ?k = Z and that Sj + пг has a JV(O, 1) distribution.
Choose a ^3(IR) function /, as in Section 3. Theorem 12 has shown that
convergence for expectations of infinitely differentiable functions of random
variables is enough to establish convergence in distribution; convergence
for functions in ^3(IR) is more than enough. We need to show that P/(Z)
is close to P/(iV@, 1)).
Apply inequality A6) with X = SJ} Y = ?j, and W = r\y Because
Sj + ?j = SJ+1 + nJ+1 for; = 1,..., к - 1,
A7) | P/(Z) - TPf(N@, 1)) | < E
<c X
J=l J=l
With this bound in hand, the proof of the first version of the Central Limit
Theorem presents no difficulty.
18 Liapounoff Central Limit Theorem. For each n let Zn be a sum of indepen-
independent random variables ?nl, ?n2,. ¦., ^„fc(n) with zero means and variances that
sum to one. If the Liapounoff condition,
A9) f Pl^f^O flsn^oo,
is satisfied, then Zn ^» N@, 1).
Proof. Choose and fix an / in ^3(IR). Check that P/(Zn) -> Р/(ЛГ@, 1)).
The replacement normal random variables are denoted by nnl,..., п„к(п).
The sum nnl + ¦ ¦ ¦ + nnk(n) has a JV(O, 1) distribution. Write a\-} for the
variance of ?nJ, and Xn for the sum on the left-hand side of A9). With sub-
subscripting n's attached, the bound A7) becomes
|p/(zn) - p/(jv(o, i»| <cxn + c x
J"=l

52 III. Convergence in Distribution in Euclidean Spaces
By Jensen's inequality, o*j = (P<^K/2 < IP| ^„j|3, which shows that the sum
contributed by the normal increments is less than P|iV@, 1)|3А„. Two calls
upon A9) as n -* со complete the proof. ?
The Liapounoff condition A9) imposes the unnecessary constraint of
finite third moments upon the summands. Liapounoff himself was able to
weaken this to a condition on the B + <5)th moments, for some 3 > 0.
The remainder term R(x, y) in the Taylor expansion A4) does not increase
as fast as \y\2+d though:
B0) | R(x, y) | = | [/(* + y) - f{x) - yf'(x)-] - b2
< II /"IIIУ I2 for all x and y.
The new bound improves upon A5) for large \y\, but not for small \y\. To
have it both ways, apply A5) if \y\ < e and B0) otherwise. Increase С to the
maximum of ?||/"'|| and ||/"||. Then the bound on the expected remainder
is sharpened to
B1) W\R(X, Y)\ <РС|У|3{|У| <е} +РС|У|2{|У| > е}
<еСР(У2) + СРУ2{|Г| > e}.
22 Lindeberg Central Limit Theorem. For each n let Zn be a sum of indepen-
independent random variables ?„и ?„2,..., ?nk(n) with zero means and variances that
sum to one. If, for each fixed e > 0, the Lindeberg condition
B3) t &?j{IZnj\ > ?} - 0 as n -> со
is satisfied, then Zn -^ JV(O, 1).
Proof. Use the same notation as in the proof of Theorem 18. Denote the
left-hand side of B3) by Ln(e). Stop one line earlier than before in the ap-
application of A7):
k(n) k(n)
|P/(Zn) - IP/(JV(O, 1))| < X F\R(Snj, Li)\ + E JP\R(Snj, nnJ)\.
j=l J=l
From B1), the first sum is less than
С ? ЙЙ
j=
which equals Cs + CLn(s) because the variance sum to one. For the second
sum retain the bound
J=l

Ш.4. The Central Limit Theorem 53
but, in place of Jensen's inequality, use:
12 / \2Г1с(и) -|2
tk(n) -12 /
= max a*]
i
<?2 + LB(e). ?
The strange-looking Lindeberg condition is not as artificial as one might
think. For example, consider the standardized summands for a sequence
{Yn} of independent, identically distributed random variables with zero
means and unit variances: ?nj = n~ ll2Yj for; = 1, 2,..., n. In this case
LB(e) = nWn-1Yj{\Y1\> nll2s},
which tends to zero as n tends to infinity, by dominated convergence, because
Y\ is integrable. It is even more comforting to know that the Lindeberg
condition comes very close to being a necessary condition (Feller 1971,
Section XV.6) for the Central Limit Theorem to hold.
24 Example (A Central Limit Theorem for the Sample Median). Let М„ be
the median of a sample {Yx, Y2,..., Yn} from a distribution function G
with median M. For simplicity, suppose the sample size n is odd, n = 2JV + 1,
so that М„ equals the (N + l)st order statistic of the sample. Suppose also
that the underlying distribution function G has a positive derivative у at its
median. To prove that
it suffices to check pointwise convergence of the distribution functions.
B5) IP{n1/2(Mn - M) < x}
= JP{Mn<M + n~1/2x}
= F {at least N + 1 observations < M + n~i/2x}
= F\n + 1 < ? {Yj<M + n/2x}l.
Define
M + n~ll2x} = G(M + n/2x),
tn = (N + 1 - пр„)/Ь„.

54 III. Convergence in Distribution in Euclidean Spaces
Check that P?nj- = 0 and P^ H + Р<^„ = 1. Continuity of G at M
gives pn-*\; differentiability of G at M gives tn -> — 2xy. The last probability
in B5) becomes
B6) JpL < f Ц.
As each | ?nj\ is less than b~1, which converges to zero, both the Liapounoff
and Lindeberg conditions are easy to check.
By either of two routes,
Problem 13 shows that substitution of — 2xy for tn in B6) leads to the
correct result:
Р{и1/2(М„ - M) < x) -> IP{iV@, 1) > -2xy}. П
III.5. Characteristic Functions
The smoothing argument of Section 3 showed that only expectations of
<^C0(IR'') functions need be considered when checking convergence in distri-
distribution of random vectors. In this section, further analysis of the method of
smoothing will show that the class of functions can be narrowed a little
more: it suffices to check the convergence
P exp(iy • Xn) -+ P exp(;y • X)
for each fixed vector у inlR*. That is, pointwise convergence of characteristic
functions implies convergence in distribution.
Start again from the convolution expression
aY) = P fB7iff2)-i(/2/(z)exp(-||z - X\2/a2)dz,
derived under the assumption that Y has a N@, Ik) distribution independent
of X. This holds for every bounded measurable /. Reverse the order of

III.5. Characteristic Functions 55
integration (Fubini) on the right-hand side to show that
TPf(X + aY)= jf(z)J(z) dz,
where
B7) J(z) = WBno2ym exp(-i|z - X\2/a2).
The distribution of X + oY has density function J(-) with respect to lebesgue
measure. The dependence on a, which will remain fixed for most of the
argument, need not be made explicit.
The integrand appearing on the right-hand side of B7) comes from the
density function of oY evaluated at (z — X). It is also proportional to the
characteristic function of Y evaluated at (z — X)/a:
exp(-i|z -X\2/a2) = jBnyk'2 exP(iy(z - X)/a - i\y\2)dy.
Invoke Fubini's theorem.
B8) J(z) = ГB7кт)-"Р ехр(-iy • X/o) exp(iy • z/a - \\y\2) dy
= ^BпаУкф{-у/а) exp(iy ¦ z/a - i|y|2) dy,
where ф(-) denotes the characteristic function of X.
Formula B8) shows that the characteristic function of X uniquely deter-
determines the distribution of X + aY for each fixed a. Because X + oY con-
converges almost surely to X as a tends to zero, this proves that the characteristic
function also uniquely determines the distribution of X. A stronger result
can be proved by exploiting continuity properties of the dependence of J
on ф.
29 Continuity Theorem. A sequence of random vectors {Xn} converges in
distribution to a random vector X if and only if the corresponding sequence of
characteristic functions converges pointwise:
IP exp(iy • Xn) ->¦ P exp(jy • X)
for each fixed у in Шк.
Proof. Prove necessity by splitting exp(ix • y) into its real and imaginary
parts, both of which define functions in

56
HI. Convergence in Distribution in Euclidean Spaces
For the proof of sufficiency, denote the characteristic function of Xn by
ф„, and the density function of Х„ + aY by Jn. The domain of integration
is Жк. For fixed / and a,
\JPf(Xn + aY) - JPf(X + aY)
f(z)Jn(z) dz - jf(z)J(z) dz
< 11/11 11 Uz)-J(z) | dz
= ll/ll U(J(z)-Jn(z))+dz -
= 2Ц/11 J(J(z)-Jn(z))+rfz,
- Jn(z))~ dz
because
0=1-1
(z) - Jn{z)) dz
(z) - Ш)+ dz - j(J(z)-Jn(z)y dz.
Dominated convergence applied to B8) with ф replaced by ф„ shows that
the functions {Jn} converge pointwise to J. Thus {(J - Jn)+) converges
pointwise to zero. Notice that (J — Jn)+ < J for each n. Invoke dominated
convergence again.
2Ц/11 j(J(z)-Jn(z))+dz^0.
Complete the proof by an appeal to Lemma 11. ?
Perhaps this result should have been proved before we launched into the
Central Limit Theorem proofs of Section 4. At least for identically distributed
random variables, Theorems 18 and 22 could have been disposed of more
rapidly; but the general cases would have required much the same level of
effort, applied to exp(y • ix) in place of a general ^3(IR) function. The ad-
advantages of working with characteristic functions come from the special
multiplicative properties of the complex exponential. A central limit theorem
for martingales in Chapter VIII will make use of this property.
Theorem 29 contains a bonus: convergence in distribution of random
vectors, Xn ~> X, is equivalent to the convergence in distribution of all
linear functions of the random vectors: у ¦ Х„ ~> у ¦ X, for every y. (For the

III.6. Quantile Transformations and Almost Sure Representations 57
non-trivial half of the assertion, notice that 0n(t) is the expectation of a
bounded continuous function of t • Xn.) Convergence problems for random
vectors reduce to collections of convergence problems for random variables;
limit theorems for random vectors can be deduced from their univariate
analogues.
30 Multivariate Central Limit Theorem. For every sequence {?„} of indepen-
independent, identically distributed random vectors with P?,- = 0 and Р(?у?}) = V,
n-1/2(^+--- + ^)-JV@, V).
Proof. For fixed y, the random variables {y ¦ ?„} are independent and
identically distributed with zero means and variances equal to y'Vy. If
y'Vy = 0 the random variables у ¦ ?j and у ¦ N@, V) degenerate to zero; the
assertion then holds for trivial reasons. Otherwise standardize to unit
variance. The standardized sequence satisfies the Lindeberg condition. By
Theorem 22,
(y'Vyy l'2y ¦ n- v\ix +... + ?,)-> N@, 1)
= (y'Vyyll2y-N@, V). ?
III.6. Quantile Transformations and Almost Sure
Representations
The definition of convergence in distribution by means of pointwise con-
convergence of distribution functions does have its advantages, at least for real
random variables. Behind some of these advantages lies a construction that
reduces problems involving arbitrary distributions on the real line to the
uniform case.
For each distribution function F define a quantile function
Q(u) = M{t: F(t) >u} for 0 < и < 1.
Right continuity of F shows that Q(u) sits at the left endpoint of the closed
interval oft values for which F(t) > u; in other words,
F(t) > и if and only if Q(u) < t.
If ? has the Uniform@, 1) distribution then
> ?} = F(t).
That is, <2(?) has distribution function F.
The same construction gives a method for representing weakly convergent
sequences of probability measures by sequences of random variables that
converge almost surely.

58 III. Convergence in Distribution in Euclidean Spaces
31 Representation Theorem. Let {Р„} be a sequence of probability measures
on the real line that converges weakly to P. There exist random variables {Х„}
with distributions {Pn}, and X, with distribution P, such that Х„ -» X almost
surely.
Proof. Write Fn for the distribution function of Р„, and F for the distribution
function of P. Denote the quantile function corresponding to Fn by Qn.
Choose ? with distribution Uniform@, 1) then define Xn = QM)- It suffices
to show that {Qn(u)} converges for each и outside a countable subset of
@, 1), for then we can define X (almost surely) as the limit of the {Х„}.
(Problem 21 gives a more concrete representation for X.)
Recall that Fa(t) -*¦ F(t) for each t in the dense set To of points for which
]P{X = t} = 0. Each non-empty open interval contains points of TQ. In
particular, if {Qn{u)} does not converge for some fixed и then there exist
points t and s in To for which
liminf Qn(u) < t < s < limsup Qn(u).
Since Qn(u) < t implies Fn(t) > u, the limit along a subsequence gives
F(i) > u. Similarly, Qn(u) > s infinitely often implies that F(s) < u. Thus F
takes the constant value и throughout the interval [t, s]. There can be at
most countably many values of и (a set of zero lebesgue measure) for which
F has such a flat spot: different и values produce disjoint flat spots, and the
real line can accommodate only countably many disjoint intervals. This
proves that liminf Qn(u) = limsup Qn(u) for almost all values of и in @, 1).
If X is to be defined as the pointwise limit of {Xn} we should also exclude
the possibility of infinite limits. Given и in @, 1), choose t in To such that
F(t) > u. Convergence at points of To gives Fn(t) > и eventually, which
implies Qn{u) < t eventually, and so limsup Qn(u) < со. The liminf can be
handled similarly. ?
Some weak convergence results, such as the one-dimensional case of the
Continuous Mapping Theorem, reduce to trivialities when the Representa-
Representation Theorem is invoked. Similar representations do obtain in higher
dimensions and for probability measures on abstract metric spaces, but
these require a completely different construction. More about that in
Chapter IV.
32 Example. Given a probability measure P on ^(IR) and a bounded se-
sequence of measurable functions {hn} converging pointwise to a limit function,
h, under what conditions will Pnhn -> Ph for every sequence {Р„} converging
weakly to P?
Consider first the special case where P concentrates at a single point x.
Choose Pn concentrated at х„, for any sequence of real numbers {х„} con-
converging to x. In this case the requirement reduces to hn(xn) -* h{x). This
shows that something stronger than mere pointwise convergence of hn to

Ш.6. Quantile Transformations and Almost Sure Representations 59
h is needed. (Notice that if hn = h for every n, the requirement is equivalent
to continuity of h at x.)
In general it will suffice if hn(xn) -* h(x) for every x in a set of P measure
one and every sequence {х„} converging to such an x. Represent {Р„} by an
almost surely convergent sequence {Xn} of random variables. Then hn(Xn) ->
h{X) almost surely. By dominated convergence, Whn(Xn) -> JPh(X), which
implies that Pnhn -> Ph. П
33 Example. Let G be an open subset of IR. Suppose that Pn ~> P. The
Representation Theorem can be used to show that liminf PJfi) ^ P(G).
(The proof of the Convergence Lemma also contains the result implicitly.)
Switch to almost surely convergent representations Х„ and X. For an со
at which convergence holds, if X(a>) belongs to G then so does Х„(со) for all
n large enough. That is,
liminf {Xn e G} > {X e G} almost surely.
Apply Fatou's lemma.
liminf PnG = liminf W{Xn e G}
> P liminf{Zn e G}
> W{X e G}
= P(G). П
34 Example. Let {ф„} be the characteristic functions corresponding to a
weakly convergent sequence of probability measures {?„}, and ф be the
characteristic function of the limit distribution P. Take {Х„} as the almost
surely convergent representing sequence of random variables. For each
fixed t, the sequence {exp(itXn)} converges almost surely to exp(itX).
Moreover, the convergence is uniform on each bounded interval / of f-values:
sup|exp(rtZJ - exp(itX)| < sup | exp(it(Xn — X)) — 1)|
-> 0 almost surely,
by virtue of the continuity of the exponential function at zero. Because
sup |IP exp(;tZn) - IP exp(itX)\ < P sup \exp(it(Xn - X)) - 1)[,
the sequence {(jf)n(t)} converges uniformly to 0@ over bounded intervals. П
The argument for the proof of the Representation Theorem made little
explicit use of the limit distribution function F. The essentials were:
(i) existence of lim Fn(t) for each t in a dense subset To;
(ii) for each fixed и in @, 1) there exists a t in To for which liminf Fn(t) > u,
and an s in To for which limsup Fn(s) < u.

60 III. Convergence in Distribution in Euclidean Spaces
The second property is equivalent to the existence, for each fixed e > 0, of a
constant К such that
C5) limsup Pa\_-K, KJ < s.
Of course К depends on e. A sequence {Р„} with this property is said to be
uniformly tight. The corresponding property for a sequence of random
variables {Х„}—that for each e > 0 there exists а К such that
limsupIP{|XJ > К} < е
—is also called uniform tightness.
36 Theorem. Each uniformly tight sequence of probability measures on the
real line has a subsequence that converges weakly.
Proof. Use Cantor's diagonalization procedure to select a subsequence
{Р„.} for which {Fn>(t)} converges for each rational t. The subsequence
satisfies both of the conditions (i) and (ii) noted above. Construct an almost
surely convergent sequence of random variables {Х„,} such that Xn. has
distribution Pn, and Х„. -* X almost surely. The sequence {?„-} converges
weakly to the distribution of X. ?
By providing a method for constructing probability measures on
this theorem allows specification of the limit distribution to be omitted from
some important results. For example, the Continuity Theorem for charac-
characteristic functions, as proved (Theorem 29) in Section 5, can be improved
upon slightly.
37 Continuity Theorem (General Form). Let {Xn} be a sequence of real
random variables whose characteristic functions {</>„} converge pointwise to
some function ф. If ф is continuous at the origin then it must be the charac-
characteristic function of an X for which Xn ~> X.
Proof. Once we prove that ф is a characteristic function, Theorem 29 will
do the rest. Let us show that {Х„} is uniformly tight. Then, by Theorem 36
it will have at least one subsequence that converges in distribution. The limit
of any such subsequence must have ф as its characteristic function.
The inequality C5), which defines uniform tightness, calls for a constraint
to be placed on the tails of the distributions of the random variables {Xn}.
Here we can use an inequality, valid for any random variable Z with charac-
characteristic function p, that relates the tail behavior of a distribution to the values
of its characteristic function near the origin: there exists a positive constant
a (approximately 6.308) such that
C8) P{|Z| > /Г1} < (a/2/0 f [1 - (>№ dt
J-h

Notes 61
for every positive h. This follows (Fubini's theorem) from the equality
Г*
B/г) [1 - IP exp(iZt)] dt = P[l - (sin hZ)/hZ].
J-h
Interpret (sin 0)/0 as 1. Because the integrand is non-negative everywhere
and greater than a71 = 1 - sin 1 on the set {| Z [ > h~1}, the expectation is
greater than or4P{|Z| > /Г1}.
Apply C8) with Z = Xn and p = ф„.
limsup 1Р{\Х„\ > /Г1} < limsup (a/2h) [1 - фп
n n J —h
dt
= (a/2h) [1 - фAI dt
J-h
by dominated convergence. Thanks to the continuity of ф at the origin, the
last expression can be brought close to
«[1 - 0@)] = lim «[1 - фМ1 = О
П
by choosing h small enough. ?
Notes
Everything in this chapter is classical.
The name Continuous Mapping Theorem seems to have taken hold in
the literature, despite its evident inappropriateness. Billingsley A968,
Section 5) has more on the history of the theorem. Some authors refer to
F-almost-sure-continuity as P-Riemann-integrability.
Convolution with a normal kernel as a means of smoothing is an ancient
idea. Weierstrass A885) used it to prove his famous approximation theorem.
Liapounoff A900) introduced an auxiliary normal variable as a device for
smoothing a sum of independent variables. That enabled him to complete
the argument of Glaisher A872), and thereby prove with full rigor a central
limit theorem. The Liapounoff condition did not appear in the 1900 paper;
instead Liapounoff imposed a stronger requirement on the maximum third
absolute moment. In his 1901 paper he improved this to a requirement on
the sum of B + <5)th moments about the mean. (Problem 19 gives a minor
variation on his 1901 result.)
Section 4 follows in part the lucid exposition of Lindeberg A922), with
a little help from Billingsley A968, Section 7).
Uspensky A937, Chapter XIV) attributed the Continuity Theorem to
Liapounoff A900); the key idea was implicit in Liapounoff's proof. Levy
A922) proved it explicitly. Both Liapounoff and Levy required uniform
convergence on compacta. The dominated convergence trick in the text is
often called Scheffe's lemma.

62 Ш. Convergence in Distribution in Euclidean Spaces
Problems
[1] Suppose X, Xu X2,--- are real random variables such that 1Р{Х„ < x] ->
IP{X < x] whenever IP{X = x} = 0. Prove that Xn ~» X. [For a fixed / in «"(IR)
find points {a,-} with — oo < o^ < a2 < • • • < ak < oo such that IP{X = aj = 0
and both of W{X < aj and IP{X > ak) are small. Choose the {aj close enough
together to ensure that / changes by very little over each interval [a;,aj+1].
Approximate / by a step function constant on each of those intervals.] Extend
the argument to higher dimensions.
[2] Let {xn} be a sequence of points in IR*, and {Х„} be a sequence of degenerate
random vectors with 1Р{Х„ = х„} = 1. Show that {Х„} converges in distribution
if and only if {х„} converges.
[3] Let P be a probability measure on the <r-field ^(IR*). Show that for each s > 0 and
each borel set В there exists a closed set F? contained in В and an open set G?
containing В such that P(Gf — F?) < к. [Prove that the class of all such В forms
a (т-field. It contains every closed set, because closed sets can be written as count-
countable intersections of open sets.] Deduce that if IP/(X) = IP/( Y) for every bounded
uniformly continuous / then X and У have the same distribution. [Represent open
sets as the limits of increasing sequences of uniformly continuous functions.]
[4] A sequence of probability measures on ^(IR*) can converge weakly to at most one
limit distribution: if Р„ ~> P and Pn ~> Q then P and Q agree for all borel sets.
[Use Problem 3.]
[5] If Х„ ~> X then Х„ + op(l) -> X. [Appendix A explains the op(-) notation. This
result is sometimes called Slutsky's theorem.]
[6] If {Х„} converges in probability to X then Xn ~> X. The converse is false.
[7] If Х„ ~* X and IP{X = c} = 1, for some constant c, then {Xn} converges in
probability to с
[8] Give an example of a continuous function g and a sequence of random vectors
with Х„ ~-> X for which {ТРд(Х„)} does not converge to IP#(X). [Don't forget the
word bounded in the definition of convergence in distribution.]
[9] Give an example of a map H from IR* into itself and a sequence of random vectors
in Жк with Xn ~» X for which HXn does not converge in distribution to HX.
[10] The set D of discontinuity points of a map H from IR* into IRS is borel measurable.
[Start from the set Dm „ of all those points x for which \Hy — Hz\ > m'1 for at
least one pair of points with \x — y\ < n and \x — z\ < n~l. Prove that Dm „
is open. Tops0e.]
[11] A random IR^-vector X and a random IRJ-vector У defined on the same probability
space can be combined into a single random IR-/+t-vector (X, У). If Х„ -~> X, as
random vectors in IR*, and У„ -+ У, as random vectors in IRJ, it need not follow that
(Х„, У„) ~» (X, Y). But if IP{X = c} = 1 for some constant с the result is true.
[Consider expectations of bounded uniformly continuous functions on IRJ+*.]
Use characteristic functions to prove that the result also holds if X is independent
of У and Xn is independent of У„, for each n.

Problems 63
[12] If Х„ has a t-distribution on n degrees of freedom then Xn -* N@, 1) as n -* oo.
[Use Problem 11.]
[13] If real random variables {Х„} converge in distribution to X and {xn} is a sequence
of real numbers converging to an x for which IP{X = x) = 0, prove that
> xn} -> IP{X > x}.
[14] If a sequence of random vectors {Х„} converges in distribution then Xn = Op(l).
[Appendix A explains the Op(-) notation.]
[15] Let Я be a measurable map from IR* into IRS that is differentiable at a point x0.
That is, there exists a linear map L from Жк into IRS such that
Hx = Hx0 + L(x — x0) + o(x — x0) near x0.
If nll2(Xn - xo)^Z, prove that пт{НХ„ - Hx0) -> LZ. [Some authors call
this the delta method.]
[16] If Xn ~ Bin(n, 9) for some fixed 9 in @, 1), find the limiting distribution of
n1/2[arsin(X»1/2 - arsin 01/2] as n -> oo.
[17] For у real, define Я„(у) = ехр(г» - 1 - iy (iy)n/nl Prove that \Н„(у)\ <
\y\n+1/(n + 1)!. [Proceed inductively, using i$'o Hj[s)ds = Hn+1(t) for t > 0.
Take complex conjugates for t < 0. Borrowed from Feller A971).]
[18] Use the inequality in the previous problem to show that the characteristic function
of n/2(Poisson(n) — n) converges pointwise to exp( —12/2). [This is one way to
find the characteristic function of the N@, 1) distribution.]
[19] For each n let Zn be a sum of independent random variables ?„ь ..., ^„t(n) with
zero means and variances that sum to one. If for some 5 > 0,
4")
P|^|2+<5->0 as n^oo
then Zn -~> N@, 1). [Apply the Lindeberg Central Limit Theorem. Liapounoff.]
[20] Suppose a random vector X has a characteristic function ф that is integrable with
respect to lebesgue measure on IR*. Prove that the distribution of X has a bounded,
uniformly continuous density
g(z) = Bn) k exp(-i
[Choose a continuous / vanishing outside a bounded region of IR*. Start from the
expression for IP/(X + oT) derived in Section 5:
Bm7)-*/(z) exp(iz .y/a - ||у\2)ф(-у/<т) dy dz.
Я
Make a change of variable w = —yja, then take the limit as a -> 0 under the
integral sign. Adapt Problem 3 to complete the proof.]
[21] In the proof of the Representation Theorem write Q for the quantile function
corresponding to F. Show that {Qn(u)} converges to Q(u) for each и that is a point
of continuity for Q.

CHAPTER IV
Convergence in Distribution in
Metric Spaces
... in which that theory from Chapter III depending only on the metric space
properties of IR* is extended to general metric spaces. It is argued that the theory
should consider not just borel-measurable random elements. A Continuous
Mapping Theorem and an analogue of the almost sure Representation Theorem
survive the generalization. A compactness condition—uniform tightness—is
shown to guarantee existence of cluster points of sequences of probability
measures.
IV. 1. Measurability
We write a statistic as a functional on the sample paths of a stochastic
process in order to break an analysis of the statistic into two parts: the study
of continuity properties of the functional; the study of the stochastic process
as a random element of a space of functions. The method has its greatest
appeal when many different statistics can be written as functionals on the
same process, or when the process has a form that suggests a simple ap-
approximation, as in the goodness-of-fit example from Chapter I. There we
expressed various statistics as functionals on the empirical process Un, which
defines a random element of D[0, 1]. Doob's heuristic argument suggested
that Un should behave like a brownian bridge, in some distributional sense.
Formalization of the heuristic, the task we embark upon in this chapter,
requires a notion of convergence in distribution for random elements of
D[0, 1]. As for euclidean spaces, the definition will involve convergence of
expectations of bounded, continuous functions of the processes. For this
we need a notion of distance. Equip D[0, 1] with its uniform metric, which
assigns the maximum separation
||x-y|| =sup|x(t)-XOI
r
as the distance between x and y. We shall find it easiest to prove convergence
in distribution of {[/„} using this metric, even though it does create some
minor measurability difficulties. Chapter VI will examine another metric,
for which these difficulties disappear, at the cost of greater topological
complexity.
An expectation Р/([/„) is well defined only when /([/„) is measurable. If
Un lives on a probability space (O, $, P), we can arrange for measurability

I V.I. Measurability 65
by equipping D[0, 1] with a c-field, SP say, then checking & /^-measurability
of Un and ^-measurability of/. The borel c-field will not be the best choice for
3P. The definition of convergence in distribution for random elements of a
general metric space anticipates this complication for D[0, 1].
1 Definition. An d%2/-measurable map X from a probability space (O, S, IP)
into a set Ж with c-field s/ is called a random element of 3C.
If <F is a metric space, the set of all bounded, continuous, stf/ЩЖ)-
measurable, real-valued functions on 3C is denoted by ^{ЗС; stf).
A sequence {Х„} of random elements of 3C converges in distribution to a
random element X, written Xn ~> X, if Р/(Х„) -> IP/(X) for each / in
A sequence {Pn} of probability measures on s/ converges weakly to P,
written Pn ~> P, if PB / -> Pf for every / in <€(<X; sf). D
The borel (т-field ®{9C), the (т-field generated by the closed sets, will
always contain s/. For those spaces where we need sd strictly smaller than
the borel сг-field, we will usually have it generated by the collection of all
closed balls in 3C. Also the trace of s/ on each separable subset of 3C will
coincide with the trace of the borel c-field on the same subset. Limit distri-
distributions will always be borel measures concentrating on separable, .im-
.immeasurable subsets of 3C. We could build these properties into the definition
of weak convergence, but it would neither save us any extra work, nor
simplify the theory much.
2 Example. If D[0, 1] is equipped with the borel c-field 38 generated by the
closed sets under the uniform metric, the empirical processes {[/„} will not
be random elements of D[0, 1] in the sense of Definition 1. That is, UH is not
(f/^-measurable.
Consider, for example, the situation for a sample of size one. (Problem 1
extends the argument to larger sample sizes.) For each subset A of [0, 1]
define
GA = {x e D[0, 1]: x has a jump at some point of A}.
Each GA is open because |x(t) — x(t — )\ depends continuously upon x, for
fixed t. If Ux were §/^-measurable, the set {U1 e GA} = {^ e 4}jvould
belong to S. A probability measure ц could be defined on the class of all
subsets of [0, 1] by setting ц{А) = TP{^ e A}. This ц would be an extension
of the uniform distribution to all subsets of [0, 1]. Unfortunately, such an
extension cannot coexist with the usual axioms of set theory (Oxtoby 1971,
Section 5): if we wish to retain the axiom of choice, or accept the continuum
hypothesis, we must give up borel measurability of Uv The borel c-field
generated by the uniform metric on D[0, 1] contains too many sets.
There is a simple alternative to the borel (т-field. For each fixed t, the
map [/„(¦, t) from О into IR is a random variable. That is, if щ denotes the

66 IV. Convergence in Distribution in Metric Spaces
coordinate projection map that takes a function x in D[0, 1] onto its value
at t, the composition щ ° Un is &/^(IR)-measurable. Each Un is measurable
with respect to the a-field S? generated by the coordinate projection maps
(Problem 2). Call SP the projection a-field. Problem 4 shows that SP coincides
with the (т-field generated by the closed balls. All interesting functionals on
D[0, 1] are ^-measurable. ?
Too large a c-field si makes it too difficult for a map into 3C to be a random
element. We must also guard against too small an si. Even though the metric
on 3C has lost the right to have si equal to the borel a-field, it can still demand
some degree of compatibility before a fruitful weak convergence theory will
result. If #(<?¦; sf) contains too few functions, the approximation arguments
underlying the Continuous Mapping Theorem will fail. Without that key
theorem, weak convergence becomes a barren theory. An extreme example
should give you some idea of the worst that might happen.
3 Example. Allow the real line to retain its usual euclidean metric, but change
its ff-field to the one generated by the intervals of the form [n, n + 1), with
n ranging over the integers. Call this a-field 01. Functions measurable with
respect to M must stay constant over each of the generating intervals. For a
continuous function, this imposes a harsh restriction; continuity at each
integer forces an ^-measurable function to be constant over the whole real
line. This completely degrades the weak convergence concept: every sequence
of ^-measurable random elements converges in distribution. It bodes ill for
a sensible Continuous Mapping Theorem.
Consider the map Я from the disfigured real line into the real real line
(equipped with its usual metric and (т-field) defined by Hx = 1 if 0 < x < 3
and Hx = 0 otherwise. It is a perfectly good ^-measurable map, continuous
at the point 1. Apply it to random elements {Xn} identically equal to 3, and
X identically equal to 1. Even though Xn ~> X in the sense of Definition 1,
{HXn} does not converge in distribution to HX. ?
IV.2. The Continuous Mapping Theorem
Suppose Х„ ~> X, as .^/-measurable random elements of a metric space 3C,
and let Я be an j<//j</'-measurable map from 3C into another metric space
T. If Я is continuous at each point of an .^/-measurable set С with
W{XeC} = 1, does it follow that HXn ~> HX? That is, does {Р/(ЯХ„)}
converge to Р/(ЯХ) for every / in <ёBГ; st'I
We found an answer to the analogous question for random vectors in
Section III.2 by reducing it to an application of the Convergence Lemma.
The same approach works here. We need to prove JPh(Xn) -* JPh(X) for
every bounded, .^/-measurable, real-valued h that is continuous at each
point of C. Were si equal to the borel (т-field $8BE), the proof would go

IV.2. The Continuous Mapping Theorem 67
through almost exactly as before, with only a few words difference. For borel-
measurable random elements of metric spaces, the theory parallels the theory
in Chapter III very closely, at least as far as the Continuous Mapping
Theorem is concerned. Example 3 warns us that non-borel (т-fields require
more careful handling.
With this in mind, let's rework the Convergence Lemma of Chapter III,
paying more attention to measurability difficulties. To begin with we assume
only that si is a sub-c-field of Щ&). Define
D) & = {fe ^{Ж;^): f < h}.
Last time we constructed a countable subfamily of J^ whose pointwise
supremum achieved the upper bound h at each point of C. Functions in the
subfamily took the form
/m,rW = г л md(x, {h < r})
Continuity offmr suffices for borel measurability, but it needn't imply
^-measurability. We must find a substitute for these functions. This is
possible if we impose a regularity condition, which ensures that the pointwise
supremum of #" equals h at each point of C. If С is separable (meaning that
it has a countable, dense subset), we can then extract from J^ a countable
subfamily having the same supremum as J* at each point of С The regularity
condition will capture the key property enjoyed by fm r.
Without loss of generality suppose h > 0. Suppose also that h is continuous
at a point x. Choose r with 0 < r < h(x). Look for an / in 3F with/(x) > r.
Continuity provides a S > 0 such that h(y) > r on the closed ball B(x, 5)
centered at x. If we could find a g in #(#"; s/) with 0 < g < B(x, 8) and
g{x) = 1, the function rg would meet our requirements. Notice the
similarity to the topological notion of complete regularity (Simmons 1963,
Section 27). If si happened to contain all the closed balls centered at x, a
property enjoyed by the projection a-field on D[0, 1] (Problem 4), the
function
E)
would do, because {g > 1 — s} = B(x, sd). For general si we must postulate
existence of the appropriate g.
To maintain the parallel with euclidean spaces as closely as possible,
strengthen the requirements on g to include uniform continuity. We lose
only a scintilla of generality thereby; the special g of E) still passes the test.
6 Definition. Call a point x in 3C completely regular (with respect to the
metric d and the (т-field si) if to each neighborhood V of x there exists a
uniformly continuous, .^/-measurable function g with g(x) = 1 and g < V.
О
You might well object to yet another mathematical notion attaining the
status of regularity; the world is already overloaded with instances of

68
IV. Convergence in Distribution in Metric Spaces
"regular" as a synonym for "amenable to our current theory." At least it
has the virtue of reminding us of its topological counterpart. (A more
sadistic author might have called it T3±.) The terminology would not be
wasted if we were to expand our weak convergence theory to cover borel
measures on general topological spaces, for there topological complete
regularity seems just the thing needed for a well-behaved theory.
7 Convergence Lemma. Let h be a bounded, s4'-measurable, real-valued
function on 9C. Ifh is continuous at each point of some separable, sd-measurable
set С of completely regular points, then:
(i) Xn -> X and TP{X e C} = 1 imply Wh(Xn) -> Wh(X);
(ii) Р„~>Р and PC = 1 imply Pnh -* Ph.
Proof. As the arguments for both assertions are quite similar, let us prove (ii)
only. Assume that h > 0 (add a constant to h if necessary). Define !F as in
D), but with the continuity requirement strengthened to uniform con-
continuity. At those completely regular points of 3C where h is continuous, the
supremum of OF equals h. This applies to points in C.
Separability of С will enable us to extract a suitable countable subfamily
from $F. Argue as for the classical Lindelof theorem (Simmons 1963, Section
18). Let Co be a countable, dense subset of С Let {gu д2,...}Ъе the set of
all those functions of the form rB, with r rational, В a closed ball of rational
radius centered at a point of Co, and rB < f for at least one / in JF. For each
gt choose one/satisfying the inequality gt < f. Denote it byf. This picks out
the required countable subfamily:
(8) sup f = sup J* on C.
i
To see this, consider any point z in С and any / in J*\ For each rational
number r such that /(z) > r > 0 choose a rational e for which / > r at all
points within a distance 2e of z. Let В be the closed ball of radius e centered
at a point x of Co for which d(x, z) < e. The function rB lies completely
below/; it must be one of the {g(}. The corresponding/ takes a value greater
than r at z. Assertion (8) follows.

IV.2. The Continuous Mapping Theorem 69
Complete the argument as for the Convergence Lemma of Section III.2.
Assume without loss of generality that/ ] h at points of C. Then
liminf Pnh > liminf Pn/ for each г
= Pfi because Pn ~> P
-* Ph as i -* со, by monotone convergence.
Replace h by — h + (a big constant) to get the companion inequality for the
limsup. ?
9 Corollary. // TPf(Xn) -*Wf(X) for each bounded, uniformly continuous,
srf-measurable f, and if X concentrates on a separable set of completely
regular points, then Xn ^> X. ?
The corollary flows directly from the decision to insist upon uniformly
continuous separating functions in the definition of a completely regular
point. As with its counterpart for euclidean spaces, it makes some weak
convergence arguments just a little bit more straightforward than the corre-
corresponding arguments with continuous functions.
10 Example. Let 3C be a space equipped with a c-field si and metric d, and
<& be a space equipped with a u-field 8) and metric e. Equip 3C ® ®J with its
product (т-field and the metric a defined by
<rL(x, y), (x1, /)] = max[rf(x, x'\ e(y, /)].
Suppose Х„ ~* X, as random elements of Ж If Wx concentrates on a separable
set of completely regular points, and Yn -> y0 in probability for some fixed
completely regular point y0 in <&, then (Х„, Yn) ^ (X, y0), as random elements
of the product space Ж ®<W.
Of course the assertion only makes sense if Xn and Yn are defined on the
same probability space. Given that prerequisite, measurability with respect to
the product c-field presents no problem, because
(Xn, Yny\A ®B) = (X;U) n (Y^B),
and similarly for (X, y0).
Write С for the separable set on which ]PX concentrates. Then JPx,yo
concentrates on the product set С ® {y0}, which is separable. Each point
of this set is completely regular: if/(c) = 1 and/ = 0 outside the ball of
d-radius e, and g(y0) = 1 and g = 0 outside a ball of e-radius e, then the
product f(x)g(y) equals 1 at (c, y0) and vanishes outside a ball of c-radius e.
The product is uniformly continuous if both / and g are bounded and uni-
uniformly continuous; it is si ® ^-measurable if/ is ^-measurable and g is
^-measurable.
By virtue of Corollary 9, to prove (Xn, Yn) ~> (X, y0) we have only to
check that JPh{Xn, У„) -> Wh(X, y0) for each bounded, uniformly con-
continuous, si ® ^-measurable, real function h on Ж ® <%l¦ Given г > 0 choose

70 IV. Convergence in Distribution in Metric Spaces
д > 0 so that | h(x, y) - h(x', y')\ < s whenever o[(x, y), {%', /)] < 5. Write
k(-) for the bounded, uniformly continuous, ^/-measurable function h{-, yQ).
Then
\Wh(Xn, Yn) - Wh(X, yo)\ < e + 2\\h\\TP*{e(Yn, y0) > 5}
+ \Wk(Xn) - JPk(X)\.
Convergence in probability of Yn to y0 makes the middle term converge to
zero. (Notice the outer measure P*. By definition, P*Z equals the infimum
of PWover all $ -measurable real functions with W > Z. For most applica-
applications e(-, y0) will be ^-measurable, in which case P* can be replaced by P.)
The last term converges to zero because Х„ ~+ X. ?
11 Example (Convergence in Distribution via Uniform Approximation).
Let X, Хъ X2,... be random elements of Ж with Wx concentrated on a
separable set of completely regular points. Suppose, for each e > 0 and
5 > 0, there exist approximating random elements AX, AXU AX2, ¦ ¦ ¦ such
that:
(i) W*{d(X, AX)>5}<e;
(ii) limsup №*{d(Xn, AXn) > 8} < e;
(iii) AXn -> AX.
Then Х„ ~> X. Notice again the use of outer measure to guard against non-
measurability.
We have already met a special case of this result in Lemma III. 11, where
АХ„ = Х„ + aY. In applications to stochastic processes, the approximations
are typically constructed from the values of the processes at a fixed, finite
set of index points. For such approximations, classical weak convergence
methods can handle (iii). The assumptions (i) and (ii) place restrictions on
the irregularity of the sample paths. Chapter V will take up this idea.
The convergence Xn ~» X follows from convergence of expectations for
every bounded, uniformly continuous, .^/-measurable/. If | f(x) — f{y) \ < г
whenever d(x, y) < S then \Wf(Xn) - TPf(X)\ is less than
IPI/PO - f(AXn)\ + \Wf{AXn) - Wf(AX)\ + IP|/(AST) - f(X)\.
The convergence (iii) takes care of the middle term. Handle the first term by
splitting it into the contributions from {d(Xn, AXn) > 5} and its comple-
complement; and similarly for the last term. ?
The Convergence Lemma has one other important corollary, the result
that tells us how to transfer convergence in distribution of random elements of
Ж to convergence in distribution of selected functionals of those random
elements. For substantial applications turn to Chapter V.
12 Continuous Mapping Theorem. Let H be an s//sf -measurable map from
3C into another metric space X'. If H is continuous at each point of some
separable, .^-measurable set С of completely regular points, then Х„ ^> X and
W{X e C} = 1 together imply HXn ~> HX. ?

IV.3. Representation by Almost Surely Convergent Sequences 71
IV.3. Representation by Almost Surely
Convergent Sequences
In Section III.6 we used the quantile transformation to construct almost
surely convergent sequences of random variables representing weakly con-
convergent sequences of probability measures. That method will not work for
probabilities on more general spaces; it even breaks down for IR2. But the
representation result itself still holds.
13 Representation Theorem. Let {Pn} be a sequence of probability measures
on a metric space. IfPn ~> P andP concentrates on a separable set of completely
regular points, then there exist random elements {Х„} and X with distributions
{Pn} and P such that Xn -> X almost surely. ?
The new construction makes repeated use of a lemma that can be applied
to any two probability measures P and Q that are close in a weak convergence
sense. Roughly speaking, the idea is to cut up the metric space 3C into pieces
Bo, Bu ..., Bk for which PBt a> QBt for each i, so that the set Bo has small
P measure and each of the other B/s has small diameter. We use these sets
to construct a random element Y of Ж, starting from an X with distribution
P. If X lands in Bt choose Y in Bt according to the conditional distribution
Q(-\Bt). For i > 1 this forces Y to lie close to X, because Bt doesn't contain
any pairs of points too far apart. The random element Y has approximately
the distribution Q:
к
A4) W>{YeA} = ? TP{YeA\XeBt}TP{XeBt}
i = O
;=o
i = 0
= Q(A).
A slight refinement of the construction will turn the approximation into an
equality. When applied with Q = Р„ and partitions growing finer with n, it
will generate the sequence {Х„} promised by the Representation Theorem.
15 Lemma. For each e > 0 and each P concentrating on a separable set of
completely regular points, the space 9C can be partitioned into finitely many
disjoint, ssf-measurable sets Bo, Bu ..., Bk such that:
(i) the boundary of each Bt has zero P measure (a P-continuity set);
(ii) P(BQ) < e;
(iii) diameter(Bj) < 2s for i = l,2,...,k.

72 IV. Convergence in Distribution in Metric Spaces
Proof. Call the separable set C. To each x in С there exists a uniformly
continuous, ^-measurable / with f(x) = 1 and/ = 0 for points a distance
greater than e from x. The open sets of the form {/ > a], for 0 < a < 1, are
all .«/-measurable and of diameter less than 2s. At each point on the boundary
of {/ > a}, the continuous function / takes the value a. Because P{/ = a]
can be non-zero for at most countably many different values of a, there must
exist at least one a. for which the probability equals zero. Choose and fix such
an a, then write G(x) for the corresponding set {/ > a}. It has diameter less
than 2e and is a P-continuity set.
The union of the family of open sets {G(x): x e C} contains the separable
set C. Extract a countable subfamily {G(x;): i = 1,2,...} containing C.
(Every open cover of a separable subset of a metric space has a countable
subcover: Problem 5.) Because
Р <7СхЛ t P G(x} > P(O = 1
there exists a /c such that
Г * "I
> 1 - e.
Define Bi = G(Xi)\[G(xi)u---uG(ii-1)] for i = l,...,k and Bo =
[G(xO и • • • и G(xJ]c, a process known to the uncouth as disjointification.
The boundary of Bt is covered by the union of the boundaries of the P-
continuity sets G(xi),..., G(xk). Each Bt lies completely inside the cor-
corresponding G(xi), a set of diameter less than 2e if i > 1. ?
Proof of Theorem 13. Holding e fixed for the moment, carry out the con-
construction detailed in the proof of the lemma, generating P-continuity sets
Bo, Bu ..., Bk as described.
The indicator function of Bt is almost surely continuous [P] because it has
discontinuities only at the boundary of B{. So by the Convergence Lemma
Р„(Б,) -> P(Bt). When n is large enough, say n > n(s),
A6) Pn(Bt) > A - e)P(Bd for i = 0,l,...,L
Write nm for nB~m). Without loss of generality suppose 1 = nt < n2 < ¦ ¦ ¦ .
For nm < n < nm+1, construct Xn using the {BJ partition corresponding to
sm = 2~m. Notice that Bt now depends on n through the value of m.
Let ?, be a random variable that has a Uniform@, 1) distribution in-
independent of X. If ^ < 1 — sm and X lands in B;, choose Xn according to the
conditional distribution Р„(-|В,). So far no Bt has received more than its
quota of Р„ measure, because of A6). The extra probability will be distributed
over the space 3C to bring Х„ up to its desired distribution Р„. If ?, > 1 — em
choose X„ according to the distribution ц„ determined by
Pn(A) = fin(A)nt > 1 - ej + ? PK(A\B,X1 -
< = o

IV.3. Representation by Almost Surely Convergent Sequences 73
That is,
Hn(A) = e I Рп(А\Вд1Рп(Вд - A - em)P(Bt)l
i=0
By A6), the right-hand side is non-negative. And clearly ц„Ж = 1.
Except on the set Qm = {X e Bo or ?, > 1 — sm}, which has measure at
most 2em, the random elements X and Xn lie within 2гт of each other.
On the complement of the set {Qm infinitely often}, the sequence {Xn}
converges to X. By the Borel-Cantelli lemma P{Qm infinitely often} = 0.
?
The applications of Theorem 13 follow the same pattern as in Section
Ш.6. Problems of weak convergence transform into problems of almost sure
convergence, to which the standard tools (monotone convergence, dominated
convergence, and so on) can be applied.
17 Example. Most of the proof of the Convergence Lemma did not use the
full force of almost sure continuity for the function h. To get the inequality
for the liminf we only needed lower-semicontinuity of h at points of C.
(Remember that semicontinuity imposes only half the constraint of con-
continuity: only a lower bound is set on the oscillations of h in a neighborhood
of a point. Problem 9 will refresh your memory on semicontinuity.) The
Representation Theorem gives a quick proof of the same result.
If g is bounded below, lower-semicontinuous, and ^-measurable (auto-
(automatic if л/ equals the borel cr-field), then liminf Р„д > Pg whenever Р„ ~> Р
with P concentrated on a separable set of completely regular points. To
prove it, switch to almost surely convergent representations. Lower-semi-
Lower-semicontinuity at X(a>) plus almost sure convergence of the representing sequence
imply
liminf g(X„(со)) > д(Х(со)) almost surely.
Take expectations.
liminf Png = liminf TPg(Xn)
by Fatou's lemma
A similar inequality holds for upper-semicontinuous, .^/-measurable func-
functions that are bounded above. As a special case,
A8) limsup PnF <PF
for each closed, ^/-measurable set F. If inequality A8) holds for all such F
then necessarily Р„ --¦» P (Problem 12). ?
19 Example. Let <§ be a uniformly bounded class of ^/-measurable, real
functions on Ж. Suppose that Pn --¦» P, with P concentrated on a separable

74 IV. Convergence in Distribution in Metric Spaces
set of completely regular points. Suppose also that <g is equicontinuous at
almost all points [P] of Ж. That is, for almost all x and each e > 0 there exists
a 5 > 0, depending on x but not on g, such that \g{y) — g(x)\ < e whenever
d(x, y) < 5, for every g in <S. Then
B0) sup | Png-Pg |-> 0.
This result underlies the success of most of the functions that have been
constructed in the literature to metrize the topology of weak convergence.
To prove B0), represent the probability measures by almost surely
convergent random elements {Х„}, then deduce from equicontinuity that
B1) sup \д{Х„) - g(X) | -* 0 almost surely.
It would be tempting to appeal to dominated convergence to get
sup \TPg(Xn) - JPg(X)\ < IP sup \g(Xn) - g(X)\ -> 0,
but that would assume measurability of the supremum in B1). Instead, note
that B0) could fail only if, for some s > 0, there were functions {д„} in <S
for which \Pngn - Рд„\ > s infinitely often. Apply the dominated conver-
convergence argument to the countable family ^0 = {gu g2,...} to reach a con-
contradiction. ?
22 Example (The Bounded-Lipschitz Metric for Weak Convergence).
Suppose that ,s/ contains all the closed balls, as in the case of D[0, 1] under
its uniform metric. The function /(•) = r[l — md(-, z)]+, which serves to
separate z from points outside a small neighborhood of z, has the strong
uniformity property
\f(x)-f(y)\<mrd(x,y).
A function satisfying such a condition, with mr replaced possibly by a
different constant, is called a Lipschitz function. For the proof of the Con-
Convergence Lemma, Pnf-*Pf for each bounded, ^-measurable Lipschitz
function would have sufficed; convergence for bounded Lipschitz functions
implies weak convergence. From Example 19 we draw a sharper conclusion.
Define ?f to be the set of all ^-measurable Lipschitz functions for which
| f(x) - f(y) | < d(x, y) and supx | f(x) | < 1. The class <? is equicontinuous
at each point of 9E. Every bounded Lipschitz function can be expressed as a
multiple of a function in $?.
Define the distance between two probability measures on л/ by
You can check that Я has all the properties required of a metric. If P con-
concentrates on a separable set and Pn ~» P, the distance X{Pn, P) converges to
zero, in obedience to the uniformity result of Example 19. Conversely, the

IV.3. Representation by Almost Surely Convergent Sequences 75
convergence of X(Pn,P) to zero would ensure that Pnf-*Pf for each
bounded Lipschitz function /, which, as noted above, implies weak con-
convergence. ?
23 Example (The Prohorov Metric for Weak Convergence). Suppose 3C is
a separable metric space equipped with its borel cr-field. For each 8 > 0 and
each borel subset A of Ж define
A6 = {x e %: d{x, A) < 5}.
(Visualize the open set As as A wearing a halo of thickness 8.) Define the
Prohorov distance between two borel probability measures as
p(P, Q) = inf{<5 > 0: PA6 + 8 > QA for every A}.
This distance has great appeal for robustniks, who interpret the delta halo
as a way of constraining small migrations of Q mass and the added delta as
insurance against a small proportion of gross changes. To us it will be just
another metric for weak convergence.
It is not obvious that p is symmetric, one of the properties required of a
metric. We need to show that QA& + 8 > PA for every A, whenever
p(P, Q) < 8. Set В equal to the complement of A6. We know that QB <
PBd + 8. Subtract both sides from 1, after replacing Bs by the complement
of A, a larger set. (No point of A can be less than 8 from a point in B.) We
have symmetry.
If p(P, Q) = 0 then certainly PFa + 8 > QF for every closed F and
every 8 > 0. Hold F fixed but let 8 tend to zero through a sequence of values.
The sequence {Fd} shrinks to F, giving PF > QF in the limit. Interchange
the roles of P and Q then repeat the argument to deduce that P and Q agree
on all closed sets, and hence (Problem 11) on all borel sets.
For the triangle inequality, suppose that p(P, Q) < 8 and p(Q, R) < r\.
Temporarily set В = Ац. Then
RA < QA" + ц = QB + ц < PB* + tj + 8.
Check that A5+<] contains B6. Deduce that p(R, P) < ц + д.
Next, show that weak convergence implies convergence in the p metric.
It suffices to deduce that p(Pn, P) < 8 eventually if Pn ~> P. For each borel
set A define
Notice that A6 > fA> A. Also, because
IMx) - fA(y)\<8~l\d(x, A) - d(y, A)\<8~1 d(x, y),
the class of all such fA functions is equicontinuous. By Example 19,
svv\PJA-PfA\-+0.
A

76 IV. Convergence in Distribution in Metric Spaces
Call this supremum А„. Then
PA3 > PfA > PJA - А„ > Р„А - А„
for every A. Wait until А„ < 5 to be able to assert that p(P, Pn) < 5.
Finally, if p(Pn, P) -> 0 then, for fixed closed F,
limsup PnF < PFd + S
for every <5 > 0. Let <5 decrease to zero then deduce from Problem 12 that
Р„ ^> P. Convergence in the p metric is equivalent to weak convergence. ?
IV.4. Coupling
The Representation Theorems of Sections III. 6 and IV. 3 both depended
upon methods for coupling distributions Pn and P. That is, we needed to
construct random elements Х„ and X, on the same probability space, such
that Х„ had distribution Pn and X had distribution P. Closeness of Р„ and
P, in a weak convergence sense, allowed us to choose Xn and X close in a
stronger, almost sure sense. This section will examine coupling in more
detail.
A coupling of probability measures P and Q, on a space 3C, can be realized
as a measure M on the product space SC ® 3C, with X and Y defined by the
coordinate projections. The product measure P ® Q is a coupling, albeit
a not very informative one. More useful are those couplings for which M
concentrates near the diagonal. For example, in the Representation Theorem
we put as much mass as possible on the set {(x, y): d(x, y) < e}.
Roughly speaking, one can construct such couplings in two steps. First
treat the desired property—that as much mass as possible be allocated to a
particular region D in the product space—as a strict requirement. Imagine
building up M slowly by drawing off mass from the P marginal measure and
relocating it within D, subject to a matching constraint: to put an amount <5
near (x, y) one must deplete the P supply near x by <5 and the Q supply near
у by 8. When as much mass as possible has been shifted into D by this method,
forget about the constraint imposed by D. In the second step, complete the
transfer of mass from P into the product space subject only to the matching
constraint. The final M will have the correct marginals, P and Q.
A precise formulation of the coupling algorithm just sketched is easiest
when both P and Q concentrate on a finite set of points. The first step can
be represented by a picture that looks like a crossword puzzle. Label the
points on which Q concentrates as 1,..., r; let these correspond to rows of
a two-way array of cells. Similarly, let 1,..., с label both the points on
which P concentrates and the columns of the two-way array. The stack beside
row i represents the mass Q puts on point i, and the stack under column j
represents the mass P puts on j. The unshaded cells correspond to D. The

IV.4. Coupling
77
aim is to place as much mass as possible in the unshaded cells without
violating the constraint that the total mass in a row or column should not
exceed the amount originally in the marginal stacks.
У
\' i
This formulation makes sense even if the marginal supplies don't both
correspond to measures with total mass one. In general we could allow any
non-negative masses R(i) and C(/) in the supply stacks for row i and column
j. We would seek a non-negative allocation M(i, j) of as much mass as possible
into the unshaded cells, subject to
and
for each i andy. A continuous analogue of the classical marriage lemma (a sort
of fractional polygamy) will give the necessary and sufficient conditions for
existence of an M that turns the inequalities for the columns into equalities.
Treat С and R as measures. Write C(J) for the sum of supply masses in a
set of columns J. Denote by D} the set of rows i for which cell (i, y) belongs
to D for at least one column j in J. It is easy to see that M can have column
marginal С only if R(Dj) > C(J) for every J, because the rows in D, contain
all the D-cells in the columns of J. Sufficiency is a little trickier.
24 Allocation Lemma. // R(Dj) > C(J) for every set of columns J, then
there exists an allocation M(i,j) into the cells ofD such that
and
for every i and).
Proof. Use induction on the number of columns. The result is trivial for
с = 1. Suppose it is true for every number of columns strictly less than c.
Construct M by transferring mass from the column margins into D.
Shift mass at a constant rate into each of the D-cells in row r. For any mass

78
IV. Convergence in Distribution in Metric Spaces
shifted from C(j) into (r,j) discard an equal amount from R(r). If R(r)
becomes exhausted, move on to row r — 1, and so on. Stop when either:
(i) some C(j) is exhausted; or
(ii) one of the constraints R(Dj) > C{J) would be violated by continuation
of the current method of allocation.
Here R and С are used as variable measures that decrease as mass is drawn
off; the supply stacks diminish as the allocation proceeds. Notice that the
mass transferred at each step can be specified as the largest solution to a
system of linear inequalities.
If the allocation halts because of (i), the problem is transformed into an
allocation for с — 1 columns. The inductive hypothesis can be invoked to
complete the allocation.
If allocation halts because of (ii), then there must now exist some К for
which R(DK) = C(K). Continued allocation would have caused R(DK) <
C{K). The matching-constraint prevents К from containing every column:
the total column supply always decreases at the same rate as the total row
supply. Write Kc for the non-empty set of columns not in K.
к
¦шщ
HI
qW
1
I
У
с
Dk
If the marginal demands of the columns in К are to be met, the entire
remaining supply R(DK) must be devoted to those columns. With this
requirement the problem splits into two subproblems: rows in DK may match
only mass drawn off from the columns in K; from the rows DQK not in DK,
match mass from the columns in Kc. Both subproblems satisfy the initial
assumptions of the lemma. For subsets of К this follows because allocation
halted before R(Dj) < C(J) for any J. For subsets of Kc, it follows from
R(Dj n Z)|)
K) - R(DK)
> C(J и К) - C(K)
= C(J).
Invoke the inductive hypothesis for both subproblems to complete the proof
of the lemma. ?

IV.4. Coupling
79
25 Corollary. If R and С have the same total mass and R(Dj) > C(J) for
every J, then the allocation measure M has marginal measures R and C. ?
The Allocation Lemma applies directly only to discrete distributions
supported by finite sets. For distributions not of that type a preliminary
discretization, as in the proof of the Representation Theorem, is needed.
26 Example. Let P and Q be borel probability measures on a separable
metric space. The Prohorov distance p{P, Q) determines how closely P
and Q can be coupled, in the sense that p(P, Q) equals the infimum of those
values of e such that
B7) JP{d(X, Y) > e} < e,
with X having distribution P and Y having distribution Q. We can use the
Allocation Lemma to help prove this.
Half of the argument is easy. From B7) deduce, for every A,
QA = W{YeA}
< W>{X e Ac) + JP{d(X, Y) > e}
< PAE + e,
whence p(P, Q) < s.
For the other half of the argument suppose p(P, Q) < s. Construct X
and Y by means of a two-stage coupling. Apply the method of Lemma 15
twice to partition the underlying space into sets Bo, Bu...,Bk with both
QBQ < S and PB0 < 5, and diameter(B,) < 5 for i = \,...,k. Choose 5 as
a quantity much smaller than e; it will eventually be forced down to zero
while г stays fixed. The requirement that each Bt be a Q- or P-continuity set
is irrelevant to our present purpose.
Set R(i) equal to QBt and CO') equal to PBj. Into the region D allow only
those cells (i, j), for 1 < i < к and 1 < j < k, whose corresponding Bt and
Bj contain a pair of points, one in Bt and one in Bj3 a distance < e apart.
Augment the double array by one more row, call it oo, whose row stack
contains mass e + 25. Include (oo, 0),..., (oo, k) in the region D.
Bo
V
в,
¦ --
Bk

80 IV. Convergence in Distribution in Metric Spaces
The hypotheses of the Allocation Lemma are satisfied. For any collection
of columns J,
< pb0 + p({j в
\J\{0]
< 5 + q( U bX + г
V\{0> /
< <5 + Q\ (J ВЛ + QB0 + e by definition of D
< 5 + R(Dj\{oo}) + 5 + s
Distribute all the mass from the column stacks into D, as in the Allocation
Lemma. The oo row acts as a temporary repository for the small amount of
mass that cannot legally be shifted into the desired small-diameter cells.
Return the mass in this row to the column stacks, leaving at least I — e — 25
of the original С mass in the desired cells.
Strip away the oo row. Allocate the remaining mass in the column stacks
after expanding D to include all cells (i,j), for 0 < i < к and 0 < j < k.
So far we have only decided the allocation of masses M(i, j) between the
cells. Within the cells distribute according to the product measures
The resulting Monf ®f has marginal measures P and Q. For example,
within Bo the column marginal is
YtM(i,0)Q(Bi\Bi)P(-\Bo) = P(B0)P(-\B0) = P(-Bo).
i
The M measure concentrates at least 1 — e — 2<5 of its mass within the
original D, a cluster of cells each of diameter less than 5 in both row and
column directions. For a point (x, y) lying in a cell (i,j) of this cluster, there
exists points z,- and Zj with
d(x, zt) < 5, d(zt, zj) < e, d(z}, y) < 5,
which gives d(x, y) < e + 25. Put another way, if X and Y denote the co-
coordinate projections then
П»да, Y) > s + 25} < s + 25.
As 5 can be chosen arbitrarily small, and г can be chosen as close to p(P, Q)
as we please, we have the desired result.
Problem 17 gives a condition under which the bound p(P, Q) can be
achieved by a coupling of P and Q. ?

IV.5. Weakly Convergent Subsequences 81
IV. 5. Weakly Convergent Subsequences
A reader not interested in existence theorems could skip this section, which
presents a method for constructing measures on metric spaces. The results
will be used in Section V.3 to prove existence of the brownian bridge. The
method will be generalized in Chapter VII.
We saw in Section III.6 how to modify the quantile-transformation
construction of the one-dimensional Representation Theorem to turn it
into an existence theorem, a method for constructing a probability measure
as the distribution of the almost sure limit of a sequence of random variables.
We had to impose a uniform tightness constraint to stop the sequence from
drifting off to infinity. The analogous result for probabilities on metric
spaces plays a much more important role than in euclidean spaces, because
existence theorems of any sort are so much harder to come by in abstract
spaces. Again the key to the construction is a uniform tightness property,
which ensures that sequences that ought to converge really do converge.
The setting is still that of a metric space 3C equipped with a sub-a-field s# of
its borel a-field.
28 Definitions. Call a probability measure P on ,s/ tight if for every в > О
there exists a compact set K(s) of completely regular points such that
PK(s) > 1 - e.
Call a sequence {Pn} of probability measures on si uniformly tight if for
every e > 0 there exists a compact set K(e) of completely regular points such
that liminf Pn G > 1 — г for every open, ^-measurable G containing K(e).
?
Problem 7 justifies the implicit assumption of ^-measurability for the
K(e) in the definition of tightness; every compact set of completely regular
points can be written as a countable intersection of open, ^-measurable
sets.
If G is replaced by K(s), the uniform tightness condition becomes a
slightly tidier, but stronger, condition. It is, however, more natural to retain
the open G. If Р„ ~> P and P is tight then, by virtue of the results proved in
Example 17, the liminf condition for open G is satisfied; it might not be
satisfied if G were replaced by K(s). More importantly, one does not need
the stronger condition to get weakly convergent subsequences, as will be
shown in the next theorem.
For the proof of the theorem we shall make use of a property of compact
sets:
If {х„} is a Cauchy sequence in a metric space, and if d(xn, К) -» О
for some fixed compact set K, then {х„} converges to a point of K.
This follows easily from one of a set of alternative characterizations of
compactness in metric spaces. As we shall be making free use of these charac-
characterizations in later chapters, a short digression on the topic will not go amiss.

82 IV. Convergence in Distribution in Metric Spaces
To prove the assertion we have only to choose, according to the definition
of d(xn, K), points {yn} in К for which d(xn, у„) -» О. From {у„} we can
extract a subsequence converging to a point у in K. For if no subsequence
of {у„} converged to a point of K, then around each x in К we could put an
open neighborhood Gx that excluded у„ for all large enough values of n.
This would imply that {yn} is eventually outside the union of the finite
collection of Gx sets covering the compact K, a contradiction. The cor-
corresponding subsequence of {xn} also converges to y. The Cauchy property
forces {х„} to follow the subsequence in converging to y.
A set with the property that every sequence has a convergent subsequence
(with limit point in the set) is said to be sequentially compact. Every compact
set is sequentially compact. This leads to another characterization of
compactness:
A sequentially compact set is complete (every Cauchy sequence
converges to a point of the set) and totally bounded (for every
positive e, the set can be covered by a finite union of closed balls
of radius less than e).
For clearly a Cauchy sequence in a sequentially compact К must converge
to the same limit as the convergent subsequence. And if К were not totally
bounded, there would be some positive г for which no finite collection of
balls of radius e could cover K. We could extract a sequence {xn} in К with
xn+1 at least e away from each of xlt..., xn for every n. No subsequence of
{х„} could converge, in defiance of sequential compactness.
For us the last link in the chain of characterizations will be the most
important:
A complete, totally bounded subset of a metric space is compact.
Suppose, to the contrary, that {GJ is an open cover of a totally bounded
set К for which no finite union of {G;} sets covers K. We can cover К by
a finite union of closed balls of radius \, though. There must be at least
one such ball, Bx say, for which К глВх has no finite {GJ subcover. Cover
К n Bx by finitely many closed balls of radius ?. For at least one of these
balls, B2 say, К глВ1глВ2 has no finite {GJ subcover. Continuing in
this way we discover a sequence of closed balls {В„} of radii {2~"} for
which К n Bi n • • • n Bn has no finite {G;} cover. Choose a point xn
from this (necessarily non-empty) intersection. The sequence {xn} is
Cauchy. If К were also complete, {xn} would converge to some x in K.
Certainly x would belong to some Gh which would necessarily contain
Bn for n large enough. A single G; is about as finite a subcover as one could
wish for. Completeness would indeed force {G,} to have a finite subcover
for K. End of digression.
29 Compactness Theorem. Every uniformly tight sequence of probability
measures contains a subsequence that converges weakly to a tight borel
measure.

IV.5. Weakly Convergent Subsequences 83
Proof. Write {Pn} for the uniformly tight sequence, and Kk for the compact
set K(ek), for a fixed sequence {sk} that converges to zero. We may assume
that {Kk} is an increasing sequence of sets.
The proof will use a coupling to represent a subsequence of {Р„} by an
almost surely convergent sequence of random elements. The limit of these
random elements will concentrate on the union of the compact Kk sets; it
will induce the tight borel measure on 9C to which the subsequence {Р„} will
converge weakly.
Complete regularity of each point in Kk allows us to cover Kk by a collec-
collection of open ^-measurable sets, each of diameter less than sk. Invoke
compactness to extract a finite subcover, {Uki: 1 < i < ik}. Define л/т
to be the finite subfield of s/ generated by the open sets Uki for 1 < к < т
and 1 < i < ik.
The union of the fields {stfm} is a countable subfield s/x of s/. Apply
Cantor's diagonalization argument to extract a subsequence of {Pn} along
which lim Р„А exists for each A in stf^. Write XA for this limit. It is a finitely
additive measure on the field ^ao. Avoid the mess of double-subscripting
by assuming, with no loss of generality, that the subsequence is {Р„} itself.
If {Pn} were weakly convergent to a measure P we would be able to deduce
that P(interior of A) < XA < P(closure of A) for each A in stf х. If we could
assume further that P put zero mass on the boundary of each such A, we
would know the P measure of enough sets to allow almost surely convergent
representing sequences to be constructed as in the Representation Theorem.
Unfortunately there is no reason to expect P to cooperate in this way.
Instead, we must turn to Я as a surrogate for the unknown, but sought after,
probability measure P.
Since X need not be countably additive, it would be wicked of us to presume
the existence of a random element of 3C having distribution X. We must take
a more devious approach.
We can build a passable imitation of s/x on the unit interval. Partition
@, 1) into as many intervals as there are atoms ofjtfu making the lebesgue
measure of each interval A equal to the X measure of the corresponding A in
s/t. These intervals generate a finite field s$± on @, 1). Partition each atom
A in s& \ into as many subintervals as there are atoms of st2 ш A matching
up lebesgue and X measures as before. The subintervals together generate a
second field sf2 on @, 1); finer tnan ^i- Continuing in similar fashion, we
set up an increasing sequence of fields {j?k} on @,1) that fit together in the
same way as the fields {stfk} on Ж. The union of the ~Mk% is a countable
subfield Лх of @, 1). There is a bijection Л ¦-> A between stf ^ and st x that
preserves inclusion, maps slk onto s#k, and preserves measure, in the sense
that the lebesgue measure of A equals XA. The construction ensures that,
if ц has a Uniform@, 1) distribution, IP{>7 e 1} = XA for every A in si'«,.
The random variable ц chooses between the sets in s$k in much the same
way as a random element X with distribution P would choose between the
sets in s/k.

84 IV. Convergence in Distribution in Metric Spaces
By definition of X, there exists an n(k) such that
C0) Р„А > A — sk)XA for every A in s/k whenever n > n(k).
Lighten the notation by assuming that n(k) = k. (If you suspect these
notational tricks for avoiding an orgy of subsequencing, feel free to rewrite
the argument using, by now, triple subscripting.) As in the proof of the
Representation Theorem, this allows us to construct a random element Xn,
with distribution Р„, by means of an auxiliary random variable ?, that has a
Uniform@, 1) distribution independent of r\:
For each atom A of sdn, \lr\ falls in the corresponding A of stfn and
? < 1 — е„, distribute Xn on A according to the conditional
distribution Pn(-\A). If i > 1 - en distribute Х„ with whatever
conditional distribution is necessary to bring its overall distribution
uptoPn.
We have coupled each Р„ with lebesgue measure on the unit square.
To emphasize that Xn depends on r\, ?;, and the randomization necessary
to generate observations on Pn(-\A), write it as Xn(co, Ц, 0- Notice that the
same ц and ?, figure in the construction of every X„.
It will suffice for us to prove that {Xn(co, Ц, ?)} converges to a point
X(со, ц, ?) of Kk for every со and every pair (щ, <!;) lying in a region of prob-
probability at least A — skJ, a result stronger than mere almost sure convergence
to a point in the union of the compact sets {Kk}. Problem 16 provides the
extra details needed to deduce borel measurability of X.
For each m greater than k, let Gmk be the smallest open, ^„-measurable
set containing Kk. Uniform tightness tells us that
XGmk = liminf PnGmk > 1 - sk,
which implies IP{>7eGmfc} > 1 — ek. Define Gk as the intersection of the
decreasing sequence of sets {Gmk} for m = к, к + 1,.... The overbar here
is slightly misleading, because Gk need not belong to si'x. But it is a borel
subset of @, 1). Countable additivity of lebesgue measure allows us to deduce
that IP[r] e Gk} > 1 — sk. Notice how we have gotten around lack of count-
countable additivity for X, by pulling the construction back into a more familiar
measure space.
Whenever ц falls in Gk and ? < 1 — sk, which occurs with probability at
least A — skJ, the random elements Xk, Xk+1,... crowd together into a
shrinking neighborhood of a point of Kk. There exists a decreasing sequence
{AJ with:
(i) Am is an atom of $4m\
(ii) Am is contained in Gmk;
(iii) Xm(co, ц, О lies in Am.
Properties (i) and (iii) are consequences of the method of construction for
Xm; property (ii) holds because Gk is a subset of Gmk. The set Gmk, being the

Notes 85
smallest open, j/m-measurable set containing Kk, must be contained within
the union of those Umi that intersect Kk. The atom Am must lie wholly within
one such Umi, a set of diameter less than гт. So whenever tj falls in Gk and
?, < 1 — ?t, the sequence {Xm} satisfies:
(i) d(-X"m(co, j/, О, Хп(со, ц, 0) < sm for к < m < n;
(ii) d(Xm(co, ц, 0, Kk) < em for к < т.
As explained at the start of the digression, this forces convergence to a point
X(co,r,,i) of Kk. ?
Notes
Any reader uncomfortable with the metric space ideas used in this chapter
might consult Simmons A963, especially Chapters 2 and 5).
The advantages of equipping a metric space with a cr-field different from
the borel a-field were first exploited by Dudley A966a, 1967a), who developed
a weak convergence theory for measures living on the cr-field generated by
the closed balls. The measurability problem for empirical processes (Example
2) was noted by Chibisov A965); he opted for the Skorohod metric. Руке
and Shorack A968) suggested another way out: X„ ~> X should mean
JPf(Xn) -* JPf(X) for all those bounded, continuous / that make/(Xn) and
f(X) measurable. They noted the equivalence of this definition to the defini-
definition based on the Skorohod metric, for random elements of D[0, 1] con-
converging to a process with continuous sample paths.
Separability has a curious role in the theory. With it, the closed balls
generate the borel сг-field (Problem 6); but this can also hold without
separability (Talagrand 1978). Borel measures usually have separable
support (Dudley 1967a, 1976, Lecture 5).
Alexandroff A940, 1941, 1943) laid the foundation for a theory of weak
convergence on abstract spaces, not necessarily topological. Prohorov A956)
reset the theory in complete, separable metric space, where most probabilistic
and statistical applications can flourish. He and LeCam A957) proved
different versions of the Compactness Theorem, whose form (but not the
proof) I have borrowed from Dudley A966a). Weak convergence of baire
measures on general topological spaces was thoroughly investigated by
Varadarajan A965). Topsee A970) put together a weak convergence theory
for borel measures; he used the liminf property for semicontinuous functions
(Example 17) to define weak convergence. These two authors made clear
the need for added regularity conditions on the limit measure and separation
properties on the topology. One particularly nice combination—a completely
regular topology and a г-additive limit measure—corresponds closely to my
assumption that limit measures concentrate on separable sets of completely
regular points.
The best references to the weak convergence theory for borel measures on
metric spaces remain Billingsley A968, 1971) and Parthasarathy A967).

86 IV. Convergence in Distribution in Metric Spaces
Dudley's A976) lecture notes offer an excellent condensed exposition of both
the mathematical theory and the statistical applications.
Example 11 is usually attributed to Wichura A971), although Hajek
A965) used a similar approximation idea to prove convergence for random
elements of C[0, 1].
Skorohod A956) hit upon the idea of representing sequences that converge
in distribution by sequences that converge almost surely, for the case of
random elements of complete, separable metric spaces. The proof in Section 3
is adapted from Dudley A968). He paid more attention to some of the points
glossed over in my proof—for example, he showed how to construct a
probability space supporting all the {Х„}. Here, and in Section 5, one needs
the existence theorem for product measures on infinite-product spaces. Руке
A969,1970) has been a most persuasive advocate of this method for proving
theorems about weak convergence. Many of the applications now belong
to the folklore.
The uniformity result of Example 19 comes from Ranga Rao A962);
Billingsley and Tops0e A967) and Tops0e A970) perfected the idea. Not
surprisingly, the original proofs of this type of result made direct use of the
dissection technique of Lemma 15. Prohorov A956) defined the Prohorov
metric; Dudley A966b) defined the bounded Lipschitz metric.
Strassen A965) invoked convexity arguments to establish the coupling
characterization of the Prohorov metric (Example 26). My proof comes
essentially from Dudley A968), via Dudley A976, Lecture 18), who introduced
the idea of building a coupling between discrete measures by application of
the marriage lemma. The Allocation Lemma can also be proved by the
max-flow-min-cut theorem (an elementary result from graph theory; for a
proof see Bollobas A979)). The conditions of my Lemma ensure that the
minimum capacity of a cut will correspond to the total column mass.
Appendix В of Jacobs A978) contains an exposition of this approach,
following Hansel and Troallic A978). Major A978) has described more
refined forms of coupling.
Problems
[1] Suppose the empirical process U2 were measurable with respect to the borel cr-field
on D[0, 1] generated by the uniform metric. For each subset A of A, 2) define JA
as the open set of functions in D[0, 1] with jumps at some pair of distinct points
tx and t2 in [0, 1] with tl + t2 in A. Define a non-atomic measure on the class of
all subsets of A,2) by setting y(A) = 1P{U2 e JA}. This contradicts the continuum
hypothesis (Oxtoby 1971, Section 5). Manufacture from у an extension of the
uniform distribution to all subsets of A, 2) if you would like to offend the axiom
of choice as well. Extend the argument to larger sample sizes.
[2] Write si for the cr-field on a set Ж generated by a family {/•} of real-valued func-
functions on Ж. That is, si is the smallest cr-field containing /f 1B for each i and each
borel set B. Prove that a map X from (Q, S) into Ж is <f/^/-measurable if and only
if the composition ft ° X is (f/J'CIR^measurable for each i.

Problems 87
[3] Every function in ?>[0, 1] is bounded: \x(tn)\ -> oo as n -> oo would violate either
the right continuity or the existence of the left limit at some cluster point of the
sequence {(„}.
[4] Write 0> for the projection cr-field on D[0, 1] and @0 for the cr-field generated by
the closed balls of the uniform metric. Write щ for the projection map that takes
an x in ?>[0, 1] onto its value x(t).
(a) Prove that each nt is ^„-measurable. [Express {x: ntx > a} as a countable
union of closed balls B(xn, n), where xn equals a plus (n + n'1) times the
indicator function of [t, t + n~').] Deduce that 3D0 contains &>.
(b) Prove that the cr-field &> contains each closed ball B(x, r). [Express the ball
as an intersection of sets {z: \n,x — ntz\ < r}, with t rational.] Deduce that
&> contains 38 0 .
[5] Let {G;} be a family of open sets whose union covers a separable subset С of a
metric space. Adapt the argument of Lemma 7 to prove that С is contained in the
union of some countable subfamily of the {G;}. [This is Lindelof's theorem.]
[6] Every separable, open subset of a metric space can be written as a countable union
of closed balls. [Rational radii, centered at points of the countable dense set.]
The closed balls generate the borel cr-field on a separable metric space.
[7] Every closed, separable set of completely regular points belongs to j/. [Cover it
with open, ^/-measurable sets of small diameter. Use Lindelof's theorem to
extract a countable subcover. The union of these sets belongs to $2. Represent the
closed set as a countable intersection of such unions.]
[8] Let Co be the countable subset of C[0, 1] consisting of all piecewise linear func-
functions with corners at only a finite set of rational pairs (t;, r,). Argue from uniform
continuity to prove that C[0, 1] equals the closure of Co. Deduce that C[0, 1]
is a projection-measurable subset of D[0, 1].
[9] A function h is said to be lower-semicontinuous at a point x if, for each M < h(x),
h is greater than M in some neighborhood of x. To say h is lower-semicontinuous
means that it is lower-semicontinuous at every point. Show that the upper envelope
of any set of continuous functions is lower-semicontinuous. Adapt the construction
of Lemma 7 to prove that every lower-semicontinuous function that is bounded
below can be represented on a separable set of completely regular points as the
point wise limit of an increasing sequence of continuous functions. How would one
define upper-semicontinuity? Which sets should have upper-semicontinuous
indicator functions? What does a combination of both semicontinuities imply?
[10] If Х„ --» X as random elements of a metric space ЭЕ and d(Xn, Yn) -> 0 in prob-
probability, then Yn -* X, provided that IP* concentrates on a separable set of completely
regular points. [Convergence in probability means JP*{d(Xn, Yn) > e} -> 0 for
each e > 0.]
[11] Let P be a borel measure on a metric space. For every borel set В there exists an
open GE containing В and a closed Fe contained in В with P(GC\FS) < s. [The
class of all sets with this property forms a cr-field. Each closed set has the property
because it can be written as a countable intersection of open sets.] Deduce that
P is uniquely determined by the values it gives to closed sets. Extend the result
to measures defined on the cr-field generated by the closed balls.

88 IV. Convergence in Distribution in Metric Spaces
[12] Suppose limsup PnF < PF for each closed, ^/-measurable set F. Prove that
Р„ --» P by applying the inequalities
for each non-negative / in %?(Ж; j/). [The summands are identically zero for all
i large enough. Apply the same argument to —/ + (a big constant).]
[13] If Р„В -> PB for each ^/-measurable set В whose boundary has zero P measure
then Р„ ~* P. [Replace the levels i/k of the previous problem by levels f; for which
[14] The functions in <?(Ж; jf) generate a sub-a-field 38 c of s/. A map X from (Q, S)
into Ж is S/^.-measurable if and only if f(X) is <f/J'(IR)-measurable for each /in
[15] (Continued). The trace of @c on any separable set S of completely regular points
of Ж coincides with the borel ст-field on S. [Sets of the form {/ > 0} n S, with /
in ЩЖ; sd\ form a basis for the relative topology on S. Every relatively open
subset of S is a countable union of such sets, by Lindelof's theorem.]
[16] (Continued). Let {Xm} be a sequence of (f/j^-measurable random elements of Ж
that converges pointwise to a map X. Prove that X is S/^-measurable. If X
takes values only in a fixed separable set of completely regular points, then it is
[17] Let P and Q be tight probability measures on the borel cr-field of a separable metric
space 3C. There exists a coupling for which IP{rf(Z, Y) > A} < A, where A =
p(P, Q), the Prohorov distance between P and Q. [From Example 26, there exist
measures Mn on X ® 3C, wi(th marginals P and Q, for which
Мя{(х, у): d(x, у) > А + n} < A + n.
The sequence {М„} is uniformly tight. The limit of a weakly convergent subse-
subsequence defines the required coupling. Is separability of Ж really needed ?]
[18] Let Ж be a compact metric space, and <?(Ж) be the vector space of all bounded,
continuous, real functions on Ж. Let Г be a non-negative linear functional on
<€(Ж) with Г1 = 1. These steps show that Tf = Pf for some borel probability
measure P;
(a) Given у > 0 find functions gu . ..,gk in Сё+{Ж) with diameter{#; > 0} < 2y
and 0, + • • • + & = 1. [Find /x in %+(Ж) with /,(x) > 0 and fx(y) = 0 for
d(y, x) > y. Cover К by finitely many open sets {fx > 0}. Standardize the
corresponding functions to sum to one everywhere.]
(b) Choose x, with g^x) > 0. Define Py as the discrete probability measure that
puts mass Tgt aXxb for each i. If h belongs to ЩЖ), show that \Pyh - Th\->0
as у -> 0.
(c) Extract a subsequence of {Py} that converges weakly. Show that the limit
measure P has the desired property.

CHAPTER V
The Uniform Metric on Spaces of
Cadlag Functions
... in which random elements of metric spaces of cadlag functions—stochastic
processes whose sample paths have at worst simple jump discontinuities—are
treated. Necessary and sufficient conditions for convergence in distribution are
found then specialized to prove limit theorems for empirical processes and
processes with independent increments. The two most commonly occurring
limit processes, brownian motion and brownian bridge, are studied.
V. 1. Approximation of Stochastic Processes
The theory developed in Chapter IV justifies its existence by what it has to
say about the limiting distributions of functionals defined on sequences of
stochastic processes. Processes {Xn(t): t e T} are identified with random
elements of some space Ж of functions on T, a space large enough to contain
the sample paths of every Х„; the functionals are maps defined on Ж to which
the Continuous Mapping Theorem can be applied. In this chapter we shall
specialize the general theory to the particular function space D[0, 1], under
its uniform metric. It will turn out that most applications, especially those
that come up with brownian bridges and brownian motions as limit pro-
processes, require no fancier setting than this.
Recall that for a function defined on a subset T of the extended real line
[ - oo, oo] to deserve the title cadlag it must be right continuous ("continue a
droite") and have left limits ("limites a gauche") at each point of T.
Many of the well-studied stochastic processes—processes with independent
increments, or the markov property, or some form of martingale structure—
have versions with cadlag sample paths. For T = [0, 1], these processes can
be analyzed as random elements of D[0, 1], the set of all real-valued cadlag
functions on [0, 1]. Notice that right continuity at 1 and existence of the
left limit at 0 become empty requirements for this space. The analysis for
other spaces of cadlag functions on compact intervals (such as D[ — oo, oo],
the natural space in which to study empirical distribution functions over the
real line), and the extensions to non-compact index sets (Section 5), will call
for only slight variations on the methods developed for D[0, 1].

90 V. The Uniform Metric on Spaces of Cadlag Functions
For each finite subset S of [0, 1] write tts for the projection map from
D[0, 1] into IRS that takes an x onto its vector of values {x(t): t e S}. Abbre-
Abbreviate 7r{t} to nt. These projections generate the projection a-field, 2P. Questions
of measurability in this chapter will always refer to this a-field. A stochastic
process X on (Q, S, IP) with sample paths in D[0, 1], such as an empirical
process, is ^/^-measurable provided n, ° X is ё/^(IR)-measurable for each
fixed t (Problem IV.2).
Probability measures on SP are uniquely determined by the values they
give to the generating sets {п$ 1B}, with S a finite subset of [0, 1] and В а
borel subset of IR5. Equivalently, the distribution of a random element of
D[0, 1] is uniquely specified by giving the distributions of all its finite-
dimensional projections.
As every cadlag function on [0, 1] is bounded (Problem IV.3), the uniform
distance
d(x, y) = ||x - y\\ = sup{|x(t) - y(t)\: 0 < t < 1}
defines a metric on D[0, 1]. No other metric will be used for D[0, 1] in this
chapter. The closed balls for d generate & but not the larger borel a-field
(Problem IV.4). Every point in D[0, 1] is completely regular—see Definition
IV.6 and the discussion preceding it.
The difficulty with the a-field, which can be blamed on the lack of a
countable, dense subset of functions in D[0, 1], has dissuaded many authors
from working with the uniform metric. Chapter IV showed that the difficulty
can be surmounted, at least when limit distributions concentrate on separable
subsets of D[0, 1]. As compensation for persisting with the uniform metric,
we shall find its topological properties much easier to understand and
manipulate than those of its main competitor, the Skorohod metric, which
will be discussed in Chapter VI. That will make life more pleasant for us in
Section 3 when we come to apply the Compactness Theorem.
The limit processes for the applications in this chapter will always con-
concentrate in a separable subset of D[0, 1], usually C[0, 1], the set of all con-
continuous, real functions on [0, 1]. As a closed (uniform convergence preserves
continuity), separable, ^-measurable subset of D[0, 1] (Problems IV.8),
C[0, 1] has several attractive properties. For example, it inherits complete-
completeness from D[0, 1], and its projection u-field coincides with its borel a-field
for the uniform metric (Problem IV.6).
How do we establish convergence in distribution of a sequence {Х„} of
random elements of D[0,1] to a limit process XI Certainly we need the
finite-dimensional projections {nsXn} to converge in distribution, as
random vectors in IRS, to the finite-dimensional projection nsX, for each
finite subset S of [0, 1]. Continuity of 7rs and the Continuous Mapping
Theorem make that a necessary condition. The methods of Chapter III usually
help here. But that alone could hardly suffice. Continuous functionals such
as Mx = || x ||, a typical non-trivial example, depend on x through more than
its values at a fixed, finite S. Indeed, direct examination of that very functional
gives the clue to the extra condition needed.

V.I. Approximation of Stochastic Processes 91
Intuitively, it should be possible to approximate МХ„ by taking the
maximum of |Xn(s)| over a large, finely spaced, finite subset S of [0, 1]. If
S expands up to a countable, dense subset of [0, 1] containing 1, then
for every sample path of Xn. The cadlag property assures us of this. (Notice
the special treatment demanded by 1.) Given any д > 0 and e > 0, a large
enough S could be found to ensure that
A) JP{\MsXn-MXn\>5}<e.
At first sight this seems to have solved the problem. Because MsXn depends
continuously on nsXn, it converges in distribution to MSX as n -> oo. For
/ bounded and uniformly continuous, JPf(MsXn) -» JPf(MsX). From A)
with 5 chosen appropriately,
\Wf(MsXn) - Wf(MXn)\ < a + 2||/||e.
A similar inequality holds for X. Taken together, these seem to add up to
convergence in distribution of МХ„ to MX. But the argument is flawed.
The problem lies with the choice of S in A). If the sample paths of Xn
jumped about more and more wildly as n increased, S would have to get bigger
with n. That would undermine the convergence of {MsXn} to MSX, since
finite-dimensional convergence says nothing about {nsXn} for S varying
with n. We need the same S to work for each n.
A similar argument can be invoked for any other continuous functional
H on D[0, 1]. We approximate НХ„ by a continuous function of nsXn for
some large, finite set S not depending on n. We construct the approximation
by applying H to an element AsXn in D[0, 1]. For simplicity, suppose S has
been rearranged into increasing order and augmented by the points 0 and 1,
if necessary, to form a grid 0 = s0 < sx < ¦ ¦ ¦ < sk = 1. For x in D[0, 1]
define the approximating path Asx by
B) (Asx)(t) = x(st) for Si<t<s^u
and (Asx)(l) = x(l). Notice that Asx depends on x only through nsx.
In order that H(Asx) be close to Hx, it suffices that \\Asx — x\\ be small.
Consider, for example, a uniformly continuous H. There exists a 5 > 0 such
that \H(Asx) — Hx\ < e whenever \\Asx — x\\ < S; the random variable HXn
lieswithineofff(/4sXn)withprobabilitynolessthanP{|HsXn - Xn\\ < S}.
Again, the same grid would have to work for every Х„—or at least for every
Х„ with n large enough—to allow finite-dimensional convergence to imply
convergence in distribution of {AsXn} to ASX. (The example of the empirical
process Un, whose sample paths have jumps of size n~1/2, shows that uniform
approximation of Xn by AsXn can only be required for large values of n,
anyway.) If such a map As does exist, the argument sketched above for the
supremum functional M will carry over to the functional H, proving that
НХ„ ~> НХ. (The argument might seem familiar—it is a special case of
ExampleIV.ll).

92 V. The Uniform Metric on Spaces of Cadlag Functions
For the sake of brevity, from now on shorten "finite-dimensional distri-
distributions" to fidis and "finite-dimensional projections" to fidi projections.
3 Theorem. Let X, Xu X2,... be random elements of D[0, 1] {under its
uniform metric and projection a-field). Suppose JP{X e C} = 1 for some separ-
separable subset С of D[0, 1]. The necessary and sufficient conditions for {Xn} to
converge in distribution to X are:
(i) the fidis of Х„ converge to the fidis of X; that is, nsXn ^ nsX for each
finite subset S of [0, 1];
(ii) to each e > 0 and <5 > 0 there corresponds a grid 0 = t0 < tj <••• <tm= 1
such that
D) limsup w\max sup|Xn{t) - Xn(t,)\ > s\ < s,
I i Ji J
where J, denotes the interval [t;, ti+ x), for i = 0, 1,..., m — 1.
Proof. Suppose that Xn ~> X, The projection map 7is is both projection-
measurable (by definition) and continuous. Condition (i) follows by the
Continuous Mapping Theorem.
To simplify the proof of (ii) suppose that the separable subset С equals
C[0, 1]. Continuity of the sample paths makes the choice of the grid in (ii)
easier. (Problem 4 outlines the extra arguments needed for the general case.)
Let {s0, sb ...} be a countable, dense subset of [0, 1]. To avoid trivialities,
assume that s0 = 0 and Si = 1. Write Ak for the interpolation map con-
constructed as in B) from the grid obtained by rearranging s0,..., sk into
increasing order. For fixed x in C[0, 1], the distance \\Akx — x\\ converges
to zero as к increases, by virtue of the uniform continuity of x. Applied to
the sample paths of X, this shows that \\AkX — X\\ -»0 almost surely.
Convergence in probability would be enough to assure the existence of
some к for which
E) JP{\\AkX - X\\ > 5} < e.
Choose and fix such a k. Because \\Akx — x|| varies continuously with x, the
set
F = {x ? D[0, 1]: ||Дх - x|| > 6}
is closed. By Example IV. 17 (inequality IV. 18 to be precise),
limsup JP{XneF} <JP{X<= F}.
The left-hand side bounds the limsup in D). We could require "> <5"
rather than " > <5" in D), but that would create a minor complication in the
next lemma.
Now let us show that (i) and (ii) imply Х„ ~> X. Retain the assumption
that X has continuous sample paths (see Problem 4 for the general case),

V.I. Approximation of Stochastic Processes
93
so that E) still holds. Choose any bounded, uniformly continuous, projection-
measurable, real function / on D[0, 1]. Given e > 0 find b > 0 such that
\f(x) — f{y)\ < e whenever \\x — y\\ < 5. Write AT for the approximation
map constructed from the grid in (ii) corresponding to this 5 and e. Condition
D) becomes
limsup JP{\\ATXn - Х„\\ > 5} < e.
With no loss of generality we may assume that the Ak of E) equals AT,
because if we combine the two underlying grids we have at worst to replace
b by 2<5 to preserve the two approximations. For example, iff; < Sj < t <ti+i
then, whenever \\АТХ„ - XJ < 5,
\Xn(t) - Xn(Sj)\ < \Xn(t) - Xn(tt)\ + \Xn(Sj) - Xn(td\ < 26.
The argument now follows the lines sketched out before. Write the com-
composition f ° AT as g о пт where g is a bounded, continuous function on
D[0, 1]—a fancy way of saying that f(ATx) depends on x continuously
through the values that x takes at the grid points.
JPf(Xn) - JPf(X)\
< TP\f(Xn) - f(ATXn)\ + \JPf(ATXn) - JPf(ATX)\
+ JP\ f(ATX)-f(X)\
< s + 2\\f\\JP{\\Xn - ATXn\\ >b} + \Wg(nTXn) -
The middle term in the last line converges to zero as n
fidi convergence nTXn -> nTX.
oo, because of the
П
We now know the price we pay for wanting to make probability state-
statements about functional that depend on the whole sample path of a stochastic
process: with high probability we need to rule out nasty behavior between the
grid points. For uniform approximation of sample paths, a large value
of\Xn(t) - Xn(tt)\ would be nasty; inequality D) rules it out. So how do we

94 V. The Uniform Metric on Spaces of Cadlag Functions
control the left-hand side of D)? It involves the probability of a union of m
events, which we may bound by the sum of the probabilities of those events,
m-l
F) ?
i = 0
Then we can concentrate on what happens in an interval Jt between adjacent
grid points. For many stochastic processes, good behavior of the increment
-^п(г;+1) ~ Xn(td forces good behavior for the whole segment of sample
path over Jt.
7 Lemma. Let {Z(t): 0 < t < b} be a process with cadlag sample paths
taking the value zero at t = 0. Suppose Z(t) is S\-measurable, for some in-
increasing family ofa-fields {€t: 0 < t < b}. If at each point of {\Z(i)\ > S},
W{\Z{b) - Z(t)\ <%\Z(t)\\St} > p,
where /3 is a positive number depending only on S, then
sup \Z(t)\ > d\ < /r4P{|Z(b)| > Щ.
O<t<b J
Proof. Let S be a finite subset of [0, b~\ containing the point b. By virtue of
right continuity, as S expands up to a countable, dense subset of [0, b],
max | Z(t) | -»• sup | Z{t) \ for every sample path of Z.
S [0,6]
Notice the minor complication here if " > 5" were replaced by " > 5". It is
good enough to establish the inequality with the supremum taken over S.
Define т as the first point of S for which | Z(t) \ > S if there is such a t,
otherwise set т = oo. The event {т = t} belongs to St. For distinct points t
and t' in S the events {т = t} and {т = t'} are disjoint. This justifies a first-
passage decomposition.
Is
= t}JP{\Z(b) - Z(t)\
= t, \zq>) - z(t)\ <
щ. ?
For the lemma to help us we need to know something about the condi-
conditional distribution of Z(b) — Z(t) given what has happened up to time t.
When do we have such information? Here are some possibilities. The
distribution of the increment might not depend on the past at all—the

V.2. Empirical Processes 95
process might have independent increments. In that case the requirement
imposed by the lemma reduces to something involving the marginal
distributions of the increments. Or the distribution of Z{b) — Z(t) might
depend on the past only through the value of Z(t)—the markov property.
In that case the requirement turns into a condition on the transition proba-
probabilities. Or Z(b) — Z(t) might merely have zero conditional expectation—the
martingale property. And then we have all those clever stopping time and
maximal inequality tricks to play with.
In each of these cases it turns out to be quite easy to find sufficient condi-
conditions for strengthening fidi convergence to the full convergence in distribution
of the processes as random elements of D[0, 1]. Sections 4 and 2 illustrate
the first two possibilities; Chapter VIII will play tricks with martingales. In
each case the limit process will be either a brownian bridge, a brownian
motion, or a close relative.
8 Definition. A brownian motion is a process with continuous sample
paths and
@ B@) = 0;
(ii) for 0 < tY < ¦ ¦ ¦ < tk the increments B(t2) - B(ti), • ¦ ¦, B(tk) - B(tt_i)
are mutually independent; each B(tt) — B(t;_i) has a N@, t; — ?,_!)
distribution.
A brownian bridge, or tied-down brownian motion, is a process on [0, 1]
with continuous sample paths and
(i) 1/@) = t/(l) = 0;
(ii) for each finite subset S of [0, 1] the random vector 7is U has a multi-
variate normal distribution with zero means and covariances given by
WU(s)U(t) = s(l - 0 for s < t. ?
The name tied-down brownian motion comes from one method for
constructing a random element distributed like U. Apply to В the continuous
transformation that takes the function x(t) onto the function x(t) — ?x(l).
That is, tie down the loose end of В at t = 1 to force upon it the constraint
(i) placed on U. The tied-down process has gaussian fidi projections whose
means and covariances agree with those for the brownian bridge; the pro-
processes generate the same distribution on the projection cr-field.
V.2. Empirical Processes
From a sequence {?;} of independent random variables, each having a
Uniform@, 1) distribution, construct the empirical process
UJit) = n~112 t № < t} - t] for 0 < t < 1.

96 V. The Uniform Metric on Spaces of Cadlag Functions
For fixed finite S, the projection ns Un is a normed sum of independent,
identically distributed random vectors. By the Multivariate Central Limit
Theorem (III. 30) it converges to a zero-mean multivariate normal distribu-
distribution with the same covariance structure as the brownian bridge; that is, the
fidis of ?/„¦ converge to the fidis of U. (No accident of course—that's why U
got the covariances it did.) By design, the first condition of Theorem 3 is
satisfied. A markov property of Un will strengthen this to convergence in
distribution.
9 Empirical Central Limit Theorem (Uniform Case). The empirical processes
{?/„} constructed by independent sampling from Uniform@, 1) converge in
distribution, as random elements of D[0, 1], to the brownian bridge.
Proof. We need to find a grid 0 = t0 < t1 < ¦ ¦ ¦ < tm = 1 satisfying D),
for fixed e and S. Do this by making the sum in F), with Xn replaced by Un,
small. Take the {?;} equally spaced. By reasons of symmetry, or stationarity
if you prefer, the sum reduces to mP{supo<r<6 \Un(t)\ > S}, where b = m~1.
We seek an m that makes it less than e.
Take St as the ст-field generated by Un{s)for 0 < s < t. It tells us how many
of the observations ?u. ..,?„ have landed in [0, t], and where they lie.
Suppose we know that [0, t] contains exactly к of the observations.
Given this information, the other n — к observations distribute themselves
uniformly in the interval (t, 1]. Formally, on {Un(t) = n~1/2(k - nt)} the
conditional distribution of Un(b) — Un(i) given St is
n~1/2[Bin(n -k,B)- n(b - t)],
where в = (b — t)/(\ — t). Notice the markov property: the conditional
distribution depends only on Un{t). Apply Tchebychev's inequality on the
set where | Un(t)\ > 5, that is, where \k — nt\> nll28.
JP{\Un(b)-Un(t)\>i\Un(t)\\^}
= P{|Bin(n - к, в) - n(b - f)l > i\k - nt\}
< 4[(и - к)в{\ -в) + [(и - к)в - n(b - tj]2]/(k - ntJ
< 4пв/(к - ntJ + 4в2
< [4Ь/A - Ь)У32 + 4Ь2/A - ЪJ
< \ for small enough values of b.
Notice that the argument would break down if we replaced \\ Un(t)\ by \b
in the first line. Apply Lemma 7 with /? = \ and m large enough.
sup |?/„@1 > 8\ <
o<t<b )
With m fixed, let n -»• oo. Invoke fidi convergence to make the right-hand
side of the inequality converge to
2mJP{\N@, b - b2)\ > i<5} < 2m[(b -
which is less than e for m large enough. ?

V.2. Empirical Processes 97
The restriction to samples from the uniform distribution was unnecessary.
A similar result holds for the empirical process
En(r) = n^[_Fn{r) - F(r)] = n-1'2 t lint <r}~ F(r)l
constructed by independent sampling from any distribution function F on
the real line. Think of ?„ as a random element of D[— oo, oo], the space of
cadlag functions on [ — oo, oo]. Right continuity at — oo can be achieved by
setting ?„( — 00) equal to zero; the left limit at +00 also equals zero, the
natural value for ?„( + 00).
It would cause us no great hardship to carry all the theory for D[0, 1]
over to D[ — 00, 00] then track through the proof of the general Empirical
Central Limit Theorem. The uniform metric is well-defined, because of the
boundedness (Problem 6) of functions in D[— 00, 00]; the projection cr-field
plays the same role as in D[0, 1]. This time the limit gaussian process ?
would have sample paths in O[— 00, 00] and multivariate normal fidis with
zero means and covariance kernel
A0) WE(r)E(s) = F(r) - F(r)F(s) for r < s.
The E process need not have continuous sample paths; it jumps where F
jumps.
Some small complications would arise with the choice of grid points for
the general empirical process, but these could be overcome easily (Problem
7). Everything else would go through in much the same way as for the
uniform distribution. There is, however, a much simpler way to prove the
general Empirical Central Limit Theorem: use the quantile transformation
(Section III.6) to represent ?„ as a continuous image of [/„.
11 Empirical Central Limit Theorem. The empirical processes {?„} con-
constructed by independent sampling from a distribution function F converge in
distribution, as random elements of D\_ —со, oo], to the gaussian process E with
covariance kernel given by A0).
Proof. Define a map H from D[0, 1] into D[ — 00, 00] by setting (Hx)(r) =
x(F(r)). It is measurable (both spaces are equipped with their projection
сг-fields) and uniformly continuous:
\\Hx - Hy\\ = sup \x(F(r)) - y(F(r))\ < \\x - y\\.
r
By the uniform case of the Empirical Central Limit Theorem and the Con-
Continuous Mapping Theorem, HUn ~> HU. Remember from Section III.6 that
the quantile function has the property: ?; < F(r) if and only if <2(?г) ^ r-
(HUn)(r) = n'1'2 ta^i < F(r)} - F(r)]
i= 1

98 V. The Uniform Metric on Spaces of Cadlag Functions
The random variables {Q(?,d} form an independent sample from the distri-
distribution F; the last sum has the same distribution as Е„. You can complete the
proof by checking that HU satisfies the defining requirements for E. (This
provides one method for proving the existence of E.) ?
Even though the quantile transformation worked like a charm here, you
should beware of applying it unnecessarily to force processes that want to
live in ?>[ —oo, oo] into migrating to D[0, 1]. Gratuitous rescaling of the
time axis, perhaps with the aim of reducing a problem to a case involving
only uniformly distributed random variables, can complicate an otherwise
straightforward argument.
12 Example. The Central Limit Theorem for the sample median, proved by
direct methods in Section III.4, can be deduced from the Empirical Central
Limit Theorem. As before, assume that the sampling distribution F has a
continuous, positive density / in a neighborhood of its median m.
To finesse the problems caused by the jumps in the empirical distribution
function Fn, regard any random variable т„ for which
as a sample median. From the condition on /,
F(t) = \ + (t - m)lf(m)
Because mn must converge to m (Example II. 1), we deduce that
F(mn) = i + (mn - m)U(m) + op(l)].
Combine the expressions for Fn(mn) and F(mn) to get
En(mn) = -nll2(mn - m)U(m) + op(l)] + op(l),
which rearranges to
A3) n^\mn - m) = I-EM,) ~ op(l)]/[/(m) + op(l)].
Think of the right-hand side as a function of the four random elements ?„,
т„, op(l), and op(l). (Perhaps we should distinguish between the two op(l)
symbols—they stand for different variables.) Better still, think of it as a
function of the random element (?„, mn, op(l), op(l)) of D[ — oo, oo] ® Ж3.
Since (ш„,орA), орA)) -»¦ (m, 0,0) in probability and ?„->?, and since
Problem 8 shows that E concentrates on a separable subset of D\_ — oo, oo],
Example IV. 10 lets us deduce that
A4) (?„, mn, op(l), op(l)) -> (E, m, 0, 0).
The right-hand side of A3) can be constructed from the left-hand side of A4)
through application of the map H from D[— oo, oo] ® Ж3 into Ж denned by
H(x, a, p, y) = [-x(a) - ?l/[/(m) + y].

V.2. Empirical Processes 99
H is continuous at (x, m, 0, 0) for every x that has m as a point of continuity.
Almost every sample path of E has this property because, as shown by the
representation of ? given in the proof of Theorem 11, each such sample path
can be written in the form и ° F, where и belongs to C[0, 1]. (Don't forget
that F is continuous at m.) Call out the Continuous Mapping Theorem.
n^2(mn - m) = H(En, mn, op
~> ЩЕ, m, 0, 0)
= -E(m)/f(m),
which has the desired N@, |/(m)~2) distribution. ?
15 Example. What happens to the limit distribution of Kolmogorov's
goodness-of-fit statistic when parameters are estimated? That is, if Fn is
obtained by independent sampling on the distribution function F(-,90),
with 60 the true value of an unknown parameter в, what is the asymptotic
behavior of the statistic
Dn = n1'2 sup \Fn(t)-F(t,en)\
t
constructed using an estimator вп for 01
Suppose F were uniformly differentiable, in the strong sense (Problem 9)
that
\\F{-, в) - F(-, в0) -(в- 0О)Д(.)|| = о(в - во), near 0O,
for some fixed function A(-) in D[ —oo, oo]. Then, provided в„ were one of
those nice estimators that converge in towards в0 at the Op(n~112) rate, Dn
could be written as
Dn = n1/2\\Fn(-) - F(-, в0) - (в„ - 0О)Д(.)|| + ор(п1/2@„ - во))
= \\Е„ - п112(в„ - 0О)Д|| + орA).
We know how ?„ behaves for large n—like the gaussian process E—but what
will be the effect of adding on the extra term п1/2(вп - во)А? That depends
upon the joint distribution of ?„ and в„.
According to the statistical folklore, good estimators can often be coerced
into the form
for some function L satisfying 1РЦ^) = 0 and JPL(rjtJ = a2 < oo. Let's
assume that our estimator has such a representation. That makes it much
easier to analyze the random elements Zn = (Е„,п112(вп — в0)) of
D[-oo, oo] <g) Ж. Look at the fidis. Write (En(tt),..., EB(tk), п1/2(вп - в0))
as
n~112 ? ({m < h} - F(tu в0), ..-,{%< U - F(tk, 0O),

100 V. The Uniform Metric on Spaces of Cadlag Functions
The Multivariate Central Limit Theorem gives convergence in distribution
to a zero mean multivariate normal random vector (?(tx),..., E(tk), y),
where the covariances amongst the first к components follow A0) and
WE(t)y = JP{rii < t}L{rii). That suggests a limit process (E, y) for Zn, with E
and у having a sort of gaussian distribution inD[—oo, oo]®lR.
We actually already have all the tools needed to formalize the limit result
for Zn. All that maximal inequality and finite-dimensional approximation
stuff goes through as for the Empirical Central Limit Theorems; most of the
brain work has been carried out, and only very messy details remain. But
that would be terribly inelegant. Why not wait for the neater proof in Section
VII.5, and just accept the fidi proof as sufficient evidence in the meantime?
Elegance will return when the functions {77,- < t} and L(f?r) are accorded
equal status.
Check continuity and measurability for the map from D[— 00, 00] ® 1R
into IR that sends (x, r) onto ||x(-) — rA(-)ll. then bring out the Continuous
Mapping Theorem, yet again, to deduce that Dn^D = \\E{-) — yA(-)\\.
The parameter estimation propagates through to the limit to add a random
drift term yA(-) onto the gaussian process ?(•). If only we could calculate the
distribution of D then we would have solved the problem completely.
Problem 10 looks at a special case to show what difficulties this can present.
Try reworking the example with all the processes in D[ — 00, 00] rescaled
using F(-, 60) if you want to discover some of the perils of too automatic a
recourse to the quantile transformation. ?
V.3. Existence of Brownian Bridge and
Brownian Motion
This section proves the existence of gaussian processes satisfying the re-
requirements of Definition 8. The proof uses the Compactness Theorem from
Section IV.5. This is neither the simplest nor the fastest method known to
mankind; but it will generalize readily into an existence proof for more
exotic gaussian processes. Any reader who found Section IV.5 too tedious
could safely skip onto the next section, with perhaps just a quick look at
Problem 11.
A good way to construct a process is to set up a sequence of approxima-
approximations that should converge to the process, if it exists. One then needs to show
that the approximating sequence, or one of its subsequences, actually does
converge to something. One hopes this something has all the required
properties of the desired process.
We have not yet considered any sequence of processes that should con-
converge to a brownian motion, but in the uniform empirical processes {?/„}
we do have a sequence converging to a close relative, the brownian bridge.
Let us extract a brownian bridge as the limit of a convergent subsequence.

V.3. Existence of Brownian Bridge and Brownian Motion
101
16 Theorem. There exists a brownian bridge.
Proof. Check the uniform tightness condition in the Compactness Theorem
of Section IV.5. We need to find a compact subset К of D[0, 1] for which
A7)
liminf!P{C/neG} > 1 - e,
whenever G is an open, projection-measurable set containing K. To force
the brownian bridge to live in C[0, 1], we shall also want К to contain only
continuous functions.
Our К will be represented as the intersection of a sequence of closed sets
{Dk}, with Dk a finite union of closed balls of radius 2/fc (any sequence of radii
decreasing to zero would suffice). As the digression after the start of Section
IV.5 pointed out, such а К is compact: it inherits completeness from D[0, 1]
and, by the choice of {Dk}, it is certainly totally bounded.
The closed balls making up each Dk center themselves on piecewise
linear, continuous functions obtained by linear interpolation between a
finite set of vertices @, r0),..., (t,-, r;),..., A, rj:
With T standing for the grid 0 = tQ < ¦ ¦ ¦ < tm = 1, write LT for the map
from Жг to C[0, 1] that takes the vector г = (r0,..., rm) onto this inter-
interpolated function. The map satisfies a Lipschitz condition,
A8)
||Lr(r) - LT(s)|| <maxh-sr
The composition LT о nT defines a map from D[0, 1] into C[0, 1], a piecewise
linear, continuous approximation map for which
\\LT ° nT(x) — x\\ < 2 max sup |x(t) — x(tt)\,
where Jt = [ti,ti+1). The right-hand side should look familiar. It is the
functional of x, call it hT(x), that lurks behind the convergence criterion of
Theorem 3.

102 V. The Uniform Metric on Spaces of Cadlag Functions
For given 5 > 0 and e > 0, the proof of Theorem 9 came up with a grid T
for which
limsup JP{hT(Un) > 3} < e.
You should convince yourself that it was unnecessary to assume the existence
of the brownian bridge process to get this inequality. Reinterpret it as a
statement of how well Un can be approximated by linear interpolation:
given e and S there exists a grid T for which
limsup F{||Lr о nT(Un) - UJ > 5} < e.
Why are there no measurability difficulties here?
Now come back to the task of specifying the closed balls for the Dk sets.
For each fc choose a grid T(k) such that
liminf W{\\Lnk)°nTik)(Un) - l/J < 1/fc} > 1 - e/2k+1.
Because the random vectors {nT(k)(Un)} converge in distribution, there exists
a compact subset Hk of JRT(k) such that
liminf JP{nnk)UneHk] > 1 - e/2*+1
hence
liminf TP{LTW о nm){Un)eLnk){Hk)} > 1 - s/2*+1.
As a continuous image of a compact set, LT(k)(Hk) is compact, and so can
be covered by a finite collection of closed balls with radius 1/fc. Set Dk equal
to the union of the finite collection of closed balls with the same centers,
but radius 2/fc.
The larger radius allows for a 1/fc distance between Un and its linearly
interpolated approximation LT(k) ° nnk)(Un):
liminf IP{ Un e Dk} > 1 - e/2k for every fc.
The e/211 was contrived so that
liminf JP{Un ? Dx r> ¦ ¦ ¦ n Dk} > 1 — 6 for every k.
Of course we cannot just let к tend to infinity now and hope to replace the
finite intersections by their limit K. We really do need that open set G in the
definition of uniform tightness.
Inequality A7) will follow from what we have proved about the {Dk} if
we show that G э Dx n • • • n Dk for some fc. If there were no such fc, we
could choose an xk from each closed set Fk = Gc n D1 n • • • n Dk. As shown
in the next paragraph, we could then apply Cantor's diagonalization argu-
argument to extract a Cauchy subsequence from {xk}. The limit of the Cauchy
subsequence (remember that D[0, 1] is complete) would belong to Fk for
every fc, an impossibility because
P| Fk = Gc n К = 0.
к
The existence of the desired fc would follow.

V.4. Processes with Independent Increments 103
Here is how we would construct the Cauchy subsequence. Dt is a finite
union of closed balls of radius 2; some subsequence of {xk} would lie com-
completely within one of these, Bl say. Bl n D2 would be contained in a finite
union of closed balls of radius 1; some sub-subsequence would lie completely
within Bx n B2, with B2 a closed ball of radius 1. Bx n B2 n D3 would be
contained in a finite union of closed balls of radius f; a sub-sub-subsequence
would lie within B1 n B2 n B3, with B3 a closed ball of radius f. And so on.
The desired Cauchy sequence would take the first element of the first sub-
subsequence, the second element of the sub-subsequence, the third element of the
sub-sub-subsequence, and so on.
Every function in К is continuous, a uniform limit of piecewise linear,
continuous functions from the sets LT(k)(Hk).
By the Compactness Theorem, some subsequence of the uniformly tight
{?/„} converges in distribution to a probability measure concentrating on a
countable union of compact subsets of C[0, 1]. Its fidis, the limits of the
corresponding empirical process fidis, identify it as the distribution of the
sought-after brownian bridge. ?
There are two ways of constructing brownian motion from the brownian
bridge U. One can either set
Bj@ = U(t) + tZ for 0 < t < 1,
where Z has a ]V@, 1) distribution independent of U, or one can rescale by
setting
B2(t) = A + t)^(-p—) for 0 < t < oo.
In both cases one gets a gaussian process with continuous sample paths.
Direct calculation of means and covariances, which uniquely determine
multivariate normal distributions, shows that B± has the right fidis to identify
it as a brownian motion on [0, 1] and B2 has the right fidis for brownian
motion on [0, oo).
V.4. Processes with Independent Increments
A stochastic process Z indexed by an interval of the real line is said to have
independent increments if Z(t0), Z{tx) — Z(t0),..., Z{tk) — Z(tk-t) are
mutually independent whenever t0 < tt < •¦• < tk. For random elements
{Х„} of D[0, 1] with independent increments satisfying a mild regularity
condition, the criterion for convergence in distribution given in Theorem
3 reduces to a particularly simple form when the limit process has continuous
sample paths. Essentially we have only to check convergence for each
increment Xn(t) - Xn(s). Just one thing can spoil the nice clean charac-
characterization. Because the increments would be unchanged by the addition of an

104 V. The Uniform Metric on Spaces of Cadlag Functions
arbitrary constant to Х„, they surely cannot determine uniquely its finite-
dimensional distributions; Xn@) must be specified.
19 Theorem. Let X, Xu X2,... be random elements of D[0, 1], each with
independent increments. Suppose X has continuous sample paths. Then Xn ^ X
if and only if:
(i) XJP)
(ii) for each pair s < t, the increment Xn{i) — Xn(s) converges in distribution
to X(t) - X(s);
(iii) given b > 0 there exist a > 0 and ft > 0 and an integer n0 such that
W{\Xn(t) - Xn(s)\ <Щ> P whenever \t - s\ < a and n > n0.
Proof. Forgive the mysterious factor of \ in (iii); it's there to make the
notation fit better with Lemma 7.
Suppose Xn --»X. Conditions (i) and (ii) follow from convergence of the
fidis. Condition (iii) is a simple consequence of the Continuous Mapping
Theorem applied to the continuous map from D[0, 1] into Ж defined by
Яах = sup{|x(t) - x(s)\: \t - s\< a}.
(Take the supremum over rational pairs to check measurability.) At each
continuous x the supremum converges to zero as a -> 0, because continuity
on a compact interval implies uniform continuity. As this applies to every
sample path of X, there must exist an a > 0 and /? > 0 such that
JP{HaX < j5} — 2ft. Because HaXn ~> HaX and because ( — oo, jS) is open,
liminfF{#aAr,I < \d} > 2/?
П
by authority of Example IV.17. For all n large enough,
W{\Xn(t) - Xn(s)\ < \b} > W{HxXn < Щ > p,
whenever \t — s\ < a.
Now let us show that the two requirements listed in Theorem 3 follow
from conditions (i), (ii), and (iii). Choose 0 = s0 < Sj < • • • < sk. The joint
characteristic function of the random vector
(Xn(s0), Xn(Sl) - Xn(s0),..., Xn(sk) - Xn(sk_,))
factorizes into a product of one-dimensional characteristic functions,
because Xn has independent increments. By (i) and (ii), the product converges
to the corresponding product for X, the joint characteristic function of the
random vector
(X(s0), X(Sl) - X(s0),..., X(sk) - X(sk-,)).
With a continuous (linear, even) transformation, recover the desired fidi
convergence.
For the second requirement of Theorem 3 try a grid 0 = t0 < ¦ ¦ ¦ < tm = 1
for which max;(t>+1 — tt) < у < a, with a as given by (iii). The value of у

V.4. Processes with Independent Increments 105
will be specified at the end of the proof. Fix for the moment a value of n
greater than n0. Write <gt for the сг-field generated by {Xn(s)\ s < t). Suppose
t, < t < ti+1. Then, on the set {\XJt) - Xn(tt)\ > 6},
JP{\Xn(ti+1) - Xn(t)\ < \\Xn{t) - XM\\«t)
>JP{\Xn(ti+1)-Xn(t)\<j6}
> P.
Apply Lemma 7 to the process Х„{-) — Xn{tt) on each interval J; = [tb ti+1).
limsup IP< max sup | Xn(t) — Xn(tt) | > S
limsup У JP< sup I Xn(t) - Xn(ti) \ > S
1 = 0 I J,
limsup W{\Xn{ti+x) — Xn(td\
=~o
m
¦=o
m-l
!=0
which is less than
m-l
Г1 Е
! = 0
because (Xn(ti+1), Xn(tt)) - (X(ti+1), X(tt)) and {(u, v)eJR2: \u - v\ > Щ
is closed. Write Eo,..., ?m_i for the independent events appearing in the
last summation. From the inequality exp( — WEt) > 1 — 1Р?; deduce that
r=o >
= -lodl-JP\jE?j
< -log(l - JP{HyX > ^.5}) because max(fi+1 - tt) < y.
i
Argue as before that HyX [ 0 for each sample path, as у [ 0, to see that у
could have been chosen to make the logarithmic expression less than /?e. ?
Notice how we used the sample path continuity of the limit process X.
The argument would break down if X were allowed to grow by jumps. For
example, if X were a poisson process, with X(tj) — Xit^^ distributed
Poisson(A(t; — ?г_ J), the sum would not decrease as у tended to zero:
m— 1 m-1
X JP{\x(ti+1) - x(ti)\ >is}= x LMtt+i - h) + o(ti+1 - t,.)]
i = 0 i = 0
= X + o(l).
The uniform metric is inappropriate for such an X. A better metric, which
can tolerate a jumpy limit process, will be introduced in Chapter VI.

106 V. The Uniform Metric on Spaces of Cadlag Functions
Theorem 19 really only says something about convergence in distribution
to gaussian limit processes. If X@) has a normal distribution, the combina-
combination of independent increments and continuous sample paths forces X to be
a rescaled and recentered brownian motion (the gaussian process described
in Definition 8). An approximation argument shows why. Because Hy(X) -> 0
almost surely as у -» 0, there exists a sequence {е„} with е„ [ 0 and
ЩН2/п(Х) > ej - 0. Set ?„, = Xn(i/n) - Xn((i - 1)/и). Define step-func-
step-function approximations to X by
Xn(t) = X@) + t {i/n < t}U\U < ?„}•
Whenever H2/n(X) < en, the Xn process lies uniformly within е„ of X. Thus
||Х„ - X|| -» 0 in probability.
Fix t. Write al for the variance of the sum appearing in the definition of
Xn(t), and an for its expectation. If an -> 0, then Xn(t) = X@) + an + op(l)
from which we get X(t) — X@) = constant, a degenerate sort of normality.
If, on the other hand, {&„} does not tend to zero, then, along some subse-
subsequence, (Xn(t) - X@) - an)/an ~> N@, 1) by the Lindeberg Central Limit
Theorem. (Necessarily e.Jan -* 0 along any subsequence for which an is
bounded away from zero.) Convergence of types (Breiman 1968, Section 8.8)
allows nothing but normality for the distribution of X(t) - X@). A similar
argument works for any other increment of X.
Even with the gloss of apparent generality rubbed off the limit distribution,
Theorem 19 still has enough content to handle some non-classical problems
about sums of independent random variables.
20 Example. Let {?ni: i = 1,..., k(n); n = 1, 2,...} be a triangular array of
random variables satisfying the conditions of the Lindeberg Central Limit
Theorem. That is, they are independent within each row, they have zero
means, they have variances {o^r} that sum to one within each row, and
B1) Е1РЙ{|&,,|>е}->0
i
for each fixed e > 0. Set Sni = ?nl + • • • + t,ni. What is the asymptotic
behavior of a random variable such as max; Sni? We may write it as a func-
functional of the partial-sum process Sn(-), the random element of D[0, 1]
defined by
Sn(t) = Sni for var Sni<t< var Sn> i+1.
The curious choice for the location of the jumps of Sn(-) has the virtue that
var Sn(t) -> t as n -> oo, for each fixed t, because max,- alt -> 0. The increments
of Sn(-) over disjoint subintervals of [0, 1] are sums over disjoint groups of
the {?„;}; the process has independent increments. If we joined up the
vertices, to produce an interpolated partial-sum process with sample paths
in C[0, 1], we would destroy this property.

V.5. Infinite Time Scales 107
For fixed s and t, with s less than t, the increment Sn(t) — Sn(s) is also a
sum of the elements in a triangular array of independent random variables.
Because
var[SB@ - Sn(s)~] -> t - s
and because B1) also holds for summation over any subset of the {?„;}, the
Lindeberg Central Limit Theorem makes Sn(t) — Sn(s) converge to a
N@, t — s) distribution—one of the conditions we need to make Sn(-)
converge in distribution to a brownian motion.
What about condition (iii) of Theorem 19? Tchebychev's inequality
is good enough.
W{\Sn(t) - Sn(s)\ <Щ>\- var[Sn(t) -
-> 1 - (f - S)HW,
uniformly in t and s. You can figure out what a should be from this.
Only our ingenuity in thinking up functional that are continuous at the
sample paths of brownian motion can curtail our supply of limit theorems
for the partial sums. One example:
max Sni = sup Sn(t) ~> sup B(t).
i t 1
Calculation of the limit distribution in closed form awaits us in Section 6,
where this and several other functionals of brownian motion and brownian
bridge will be examined. ?
V.5. Infinite Time Scales
Both the spaces I>[0, 1] and D[— oo, oo] have compact intervals of the
extended real line as their index sets; the theories for convergence in distri-
distribution of random elements of these spaces differ only superficially. For spaces
with non-compact index sets some extra complications arise. As a typical
example, consider D[0, oo), the set of all real-valued cadlag functions on
[0, oo).
A function x in D[0, oo) must be right-continuous at each point of [0, oo),
with a finite left limit existing at each point of @, oo). Because the limit point
+ oo does not belong to the index set, no constraint is placed on the behavior

108 V. The Uniform Metric on Spaces of Cadlag Functions
of x(t) as t -> oo; it could diverge or oscillate about in any fashion. Such a
function need not be bounded. The uniform distance between two functions
in D[0, oo) might be infinite.
Even for bounded random elements of D[0, oo), convergence in distribu-
distribution in the sense of the uniform metric may impose far stronger requirements
on their tail behavior (that is, on what happens as t -» oo) than we can hope to
verify. Sometimes the best we can try for is control over large compact sub-
intervals of [0, oo). That corresponds to convergence in the sense of the
metric for uniform convergence on compacta.
22 Definition. A sequence of functions {х„} in D[0, oo) converges uniformly
on compacta to a function x if sup(<fc \xn(t) — x(t)| -> 0 as n -> oo for each
fixed k. Equivalently, d(xn, x) -» 0, where
GO
d(xn,x)= ? 2"'min[l,4(x,,x)]
dk{xn,x) = sup|xn(t)-x(t)|. ?
t<k
In all that follows, ?>[0, oo) will be equipped with the metric d and the
projection сг-field. With that combination, each x in D[0, oo) is completely
regular (Problem 12).
For each к define a truncation map Lk from D[0, oo) into D[0, /c]; set
Lkx equal to the restriction of x to the interval [0, k]. By construction, {х„}
converges to x if and only if {Lkxn} converges uniformly to Lkx for each k.
Convergence in distribution has a similar characterization.
23 Theorem. Let X,XUX2,... be random elements ofD[0, oo), withJP{X e C}
for some separable set C. Then Xn ^*X if and only if Lk Xn --» Lk X, as random
elements of D[0, k],for each fixed k.
Proof. Necessity of the condition follows from the Continuous Mapping
Theorem, because each Lk is both continuous and measurable.
For sufficiency, define a continuous, measurable map Hk from D[0, /c]
into D[0, oo) by (Hkz)(t) = z(t л к). It comes as close as possible to defining
an inverse to the map Lk: the function Hk ° Lkx equals x on [0, /c]. Deduce
that d(x, Hk ° Lkx) < 2~k for every x in D[0, oo).
Suppose LkXn ~> LkX. Because Hk is continuous, Hk ° LkXn <-» Hk ° LkX.
Pick к large enough to make 2~k < e. Then both JP{d(Xn, Hk о LkXn) > e}
and JP{d(X, Hk о LkX) > e} equal zero. Example IV.l 1, with Hk ° Lk playing
the role of the approximation map, does the rest. ?
Typically С equals C[0, oo), the set of all continuous functions in D[0, oo).
As in the case of compact time intervals, C[0, oo) sits inside ?>[0, oo) as a
closed, separable, measurable subset (Problem 13). The borel сг-field on
C[0, oo), generated by the open sets for the d metric, coincides with the

V.5. Infinite Time Scales
109
projection cr-field. The most important of the limit processes living in
C[0, oo) is brownian motion on [0, oo).
24 Example. Let Sn denote the nth partial sum of a sequence ?l5 ?2,... of
independent, identically distributed random variables for which 1Р<!;г = О
and IPiJf = 1. What is the asymptotic behavior of the hitting time
as n tends to infinity? Define Hx = inf{t > 0: x(t) > 1}, with the usual
convention that the empty set has infimum + oo. Right continuity of functions
in D[0, oo) allows us to take the infimum over rational t values to verify
measurability of H. Express n~ 1zn as the functional HXn of the process
Xn(t) = n-ll2Sj, j/n < t < (j + l)/n.
Use Theorem 23 to check that Х„ ~> В, a brownian motion on [0, oo).
For fixed k, the truncated process LkXn has the same form as the partial-sum
process of Example 20, but stretched out and rescaled to fit the interval
[0, /c] instead of [0, 1]. The modifications have little effect on the arguments
adduced there to prove convergence to brownian motion; exactly the same
idea works in D[0, k~].
If we are to use the Continuous Mapping Theorem to prove HXn ~> HB
we will need H continuous at almost all sample paths of B. By itself, continuity
of a sample path x will not suffice for continuity of H at x. You can construct
sequences of functions {х„} that converge uniformly to a bad continuous x
without having Нх„ converging to Hx.
BAD:
Hx
The functional H has better success with a good path like this:
GOOD
- E г т + E

по
V. The Uniform Metric on Spaces of Cadlag Functions
Here Hx = т. If x(t) < 1 — e for t < z — 5 (a continuous function achieves
its maximum on a compact interval) and x(z + S) > 1 + e, then any
у in D[0, oo) with dx+d(x, y) < s must satisfy z — д < Ну < т + д. That
brownian motion has only good sample paths, almost surely, can best
be shown by arguments based on a strong markov property, a topic to be
taken up in the next section. The distribution of the functional HB of brown-
brownian motion will also be derived in that section. For the moment, we must
content ourselves with knowing that п~1т„ has a limiting distribution that
we could calculate if we were better acquainted with the limit process of the
sequence {Xn}. ?
V.6. Functionals of Brownian Motion and
Brownian Bridge
From the definition of brownian motion on [0, oo) it is easy to deduce
(Problem 14), for each fixed z, that the shifted process
Bz(i) = B(z + t) - B(z) for 0 < t < oo
is a new brownian motion, independent of the сг-field ST generated by the
random variables {B(s): 0 < s < z}. Equivalently,
B5) РДВ.М =
for every A in Sz and every bounded measurable / on C[0, oo).
The assertion is also valid for a wide class of random z values. The precisely
formulated generalization, known as the strong markov property, underlies
most of the clever tricks one can perform with brownian motion. It will get
us the distributions of a few interesting functionals.
The random variable z will need to be a stopping time, that is, a random
variable, taking values in [0, oo], for which {г < t) belongs to St for each t.

V.6. Functional of Brownian Motion and Brownian Bridge
111
Interpret Sx as the smallest u-field containing every St—we learn everything
about the brownian motion if we watch it forever. Define
Bz(co, t) = [B(co, t
- B(co,
oo}.
Problem 17 shows that Bz is projection-measurable, as a map from Q into
C[0, oo).
The stopping time т determines a u-field Sz, which captures the intuitive
idea of an event being observable before time т. By definition, an event A
belongs to Sz if and only if A{z < t} belongs to St for every t. The strong
markov property asserts that Bt is a brownian motion independent of Sz on
{t < oo}. Equivalently, B5) holds for every #t-measurable A contained in
{t < oo} and every bounded, measurable / on C[0, oo). To prove the
assertion it suffices that we check the equality for each bounded, uniformly
continuous /. (Apply Problem 15 to the distributions induced on C[0, oo)
by B, under IP, and Bz, under IP(-1 A).) With such an /, continuity of sample
paths implies
f{Bz)A = lim ? f(Bk/n)A{(k - l)/n < т < k/n}.
k=l
Of course only one term in the sum will be non-zero for each fixed n.
The event A{(k — l)/n < т < k/n} belongs to Skln; the independence
asserted in B5) can be invoked for each k/n value.
W№)A = lim ? Wf(Bkln)A{(k - 1)/и < т < k/n}
= lim ? TPf(B)TPA{(k - 1)/и < т < k/n} by B5)
jt=i
= 1Р/(В)ПМ because т < oo on A.
The strong markov property is established.

112
V. The Uniform Metric on Spaces of Cadlag Functions
26 Example. The classical reflection principle for brownian motion involves
a hidden appeal to the strong markov property. It enables us to find the
distribution of the stopping time т = inf{t: B{t) = a} for fixed a > 0.
The reflection principle uses the symmetry of brownian motion—the
processes Bt and — Bz have the same distribution—to argue that В should
be just as likely to hit a and end up with B(t) > a as it is to hit a and end up
with B(t) < a. The probability that В hits a before time t should be twice
the probability that B(t) > a. A formal proof works backwards.
IP{B@ > a} = IP{B@ > a, т < t]
= IP{B(t) + Bz(t - t) > a, x < t]
= TP{Bz(t - t) > 0, z < t} because B{x) = a if т < oo
< t}JP{Bz(t - т)
Because т is #t-measurable it can be treated as a constant inside the condi-
conditional probability. (A slightly more formal justification would invoke
Fubini's theorem.) On {т < t}, symmetry of the normal distribution allows
us to replace the conditional probability by j, and then deduce
IP{B(f) > a} = iIP{T < t}.
An explicit expression for IP{t < t} can be found by using the N@, t)
distribution for B(t).
For our later purposes it will be more important that we know the Laplace
transform LaB) = IP exp( — At) for X > 0. Two applications of Fubini's
theorem, then a differentiation, will lead to a simple differential equation
for U.
IP Г{т
•Jo
t}dt
z) dt

V.6. Functionals of Brownian Motion and Brownian Bridge
113
= IP [Xe-Xt{t > a2/N@, IJ} dt
= IPexp(-Aa2/iV@, IJ)
= B/яI/2 Г^ехрС-Яа^ - \z2) dz.
Jo
Differentiate with respect to A, then change variables by setting у =
BAI/2az-1.
L'JLX) = B/яI/2 Г-а2г~2 exp(-A<x2z - \z2) dz
Jo
= -<хBАГ1/2B/яI/2 Гехр(-^2 - Xa2y~2)dy
Jo
The only solution with LJO) = 1 is La(A) = ехр(-аBЯI/2).
П
Doob A949), in the non-heuristic part of his paper, was able to parlay
the result from Example 26 into an expression for the distribution function
of \\U\\, which his heuristic argument suggested as the limit for Kolmogorov's
goodness-of-fit statistic. His approach provides a case study in the application
of the strong markov property.
First one transforms the brownian bridge U by a scale change:
B(t) = A + t)U
+t
for 0 < t < oo.
The means and covariances of this gaussian random element of C[0, oo)
identify it as a brownian motion (the same trick as in Section 3). Because
= 0,
t
U
> <x:0 < t < oo
к1 + Ч
= W{\B(i)\ = <x(l + t) for at least one t}
= JP{B hits a + at or В hits -a - at}.

114
V. The Uniform Metric on Spaces of Cadlag Functions
To find the probability that В ever leaves the region bounded by the two
sloping lines ± (a + at), one first solves the simpler problem where there is
only one sloping line. For a fixed fi > 0 (and not just fi = a) set
ф(а) = IP {В hits a + fit},
the dependence on fi being suppressed for the while. If В is ever to hit the line
ai + a2 + ft, f°r ai > 0 and a2 > 0, it must first hit the line ax + fit. Call
the stopping time at which this happens a. Invoke the strong markov property.
ф(ос1 + a2) = JP{B hits ax + oc2 + fit}
= IP{{7 < oo, Ba hits a2 + fit}
= IPfff < oo}IP{B hits a2 + fit}
The equation ф(ос1 + a2) = ф(ос1)ф(ос2) has non-negative, decreasing solu-
solutions with ф@) = 1 only of the form ф(&) = exp( — ca) for a > 0. The positive
constant с remains to be determined.
Another strong-markov argument, working with the stopping time т at
which В first hits level a, gives the value of the constant.

V.6. Functionals of Brownian Motion and Brownian Bridge
115
exp(-ca) = F{B hits a + fit}
= F{t < oo, Bt hits fix + fit}
= F[{t < oo}F{B, hits fix + fit\Sz}-\
= IP{t < 00} (j)(fix)
= IPexp(- с fix)
= La(cfi) in the notation of Example 26
= exp(-aBcj3I/2).
Solve for c.
B7) F{B hits a + ?i} = exp(-2a?) for a > 0, fi > 0.
Notice that the hitting probability depends on a and fi only through their
product. The probability of В hitting the line кос + k~ lfit is the same for
each к > 0.
The two barrier problem is slightly harder. For brevity write, temporarily,
I for the line a + fit and II for the line -a- fit. Then
F{ |B(OI = oc + fit for at least one t] = W{B hits I or В hits II}.
The two events {B hits 1} and {B hits II} are not disjoint. An inclusion-
exclusion argument is needed. The infinite sum
{hit 1} - {hit I then II} + {hit I then II then 1}
+ {hit II} - {hit II then 1} + {hit II then I then II}
need not
hit I before
it hits II
might hit
again
r
... then II...
takes the value 0 if В never hits a barrier, it takes the value 1 if В makes a
finite (positive) number of alternating hits, and it is undefined if В makes
infinitely many alternating hits. If we prove that
B8) IP{B makes infinitely many alternating hits} = 0
then dominated convergence will justify
B9) IP{B hits I or В hits II}
= IPfhit 1} - IPfhit I then II} + F{hit I then II then 1}
+ F{hit II} - F{hit II then 1} + F{hit II then I then II} .

116
V. The Uniform Metric on Spaces of Cadlag Functions
Our success with the calculation for one barrier will extend, in strongly
markov fashion, to each of the individual terms in this sum of probabilities.
To avoid notational indigestion let us chew on just one of the terms,
IP{hit I then II}, say. Define the stopping time x (different from the last т)
as the first t at which В hits I. From the perspective of Bt, barrier II is
defined by the line of slope — /? and intercept — 2(<x + fix). (We can treat
x as if it were a constant if we condition on Sz.) By symmetry, Bz has the
same probability of hitting the line 3(<x + fix) + fj(t - x), which it sees
as having slope fj and intercept 2(<x + fix). Formula B7) allows Bz to double
the slope and halve the intercept of the line it is trying to hit. That is,
JP{BZ hits
= JP{BZ hits 3(a + /hr) + fi{t - x)\Sz)
= W{BZ hits 2(a + fix) + 2fi(t - x)\Sz}
= IP{Bthits2a + 2fit\iz).
Notice that the new line sits above a + fJt for t > 0. Integrate out over the
event {x < oo}.
JP{B hits I then II} = П>{т < oo, Bx hits II}
= IP{t < oo, Bz hits 2a + 2f3t}
= JP{B hits 2a + 2f3t}
= exp[ - 2Ba)Bj?)] from B7).
Similar reasoning succeeds with higher numbers of alternating hits. For
example, if a denotes the time of the first hit on II after a hit on I, then
W{B hits I then II then 1} = IP{t < oo, a < oo, Ba hits 1}
= IP{t < oo, a < oo, Ba hits -2a - 2/ft}
= IP{t < oo, Bz hits -2a - 2f3t}
= IP{t < oo, Bz hits 3a + 3jSt}
= JP{B hits 3a + 3f3t}
= exp[-2Ca)CiS)].

Notes 117
You will observe the significance of the point — a//? on the t axis if you draw
a picture for the argument.
In general, for a string of m alternating hits on barriers I and II, the
probability is exp( — 2m2uf}). Let m tend to infinity to prove B8). Formula
B9) with a = /? provides the solution to the problem with which we began:
P{||C/J > <x}->P{||t/|| >a} = 2 f (-l)m+1 exp(-2m2a2).
m=l
Read Doob's paper.
Notes
Most authors avoid the uniform metric once they recognize the measurability
problems it creates with the empirical process. Billingsley A968, Section 18)
made the case against it clear. Dudley A966a, 1967a) had already pointed
the way around the problem, but his solution has been mostly overlooked in
the literature. Dudley A966a) had even proved a multivariate version of the
Empirical Central Limit Theorem. He used a markov property of the
empirical process.
The idea of interpreting conditions like D) as a means for constructing
simple approximations to stochastic processes appears explicitly in
Wichura A971). Hajek A965) had already applied the idea to characterize
weak convergence in C[0, 1]. Skorohod A956) based his study of weak
convergence in D[0, 1] under its various Skorohod metrics on the same
idea. Unfortunately this simple approach seems to have lost out to the
uniform tightness approach (exposited by Billingsley A968), for example),
possibly because the approximation method appears to demand a separate
proof for existence of the limit process as a random element of a space such
as C[0, 1]. Actually, D) plus fidi convergence imply uniform tightness; the
argument in Section 3 is easily generalized. Lemma 7, specialized to pro-
processes with independent increments and with \d replacing i|Z(r)| in the
hypothesis, is sometimes attributed to Skorohod A957), although it clearly
has a strong similarity to Levy's symmetrization inequality. Notice that the
^|Z(t)| does gain us something—one benefit was noted during the proof of
the Empirical Central Limit Theorem, in Section 2. Gihman and Skorohod
A974, VI.5) made more systematic use of a form of the lemma in a study of
weak convergence for markov processes. Stute A982a) obtained delicate
oscillation results for empirical processes by exploiting their markovian
properties.
The Empirical Central Limit Theorem usually goes by the name of
Donsker's theorem; Donsker A952) proved it (using the uniform metric) in
justifying Doob's A949) heuristic approach to the Kolmogorov and Smirnov
theorems. Donsker A952) used a poissonization trick in his proof; he got it
from Chung A949), who got it from Kolmogorov A933). Kac A949) knew
that a poissonized process was easier to analyze, because of its independent

118 V. The Uniform Metric on Spaces of Cadlag Functions
increments. Breiman A968, Section 13.6) used another version of the trick—
representation of uniform order statistics as rescaled points of a poisson
process—in relating the empirical process to a partial-sum process. The
quantile-transformation trick belongs to the folklore.
The interest aroused when Durbin A973a) applied weak convergence
methods to get limit distributions for statistics analogous to those of
Kolmogorov and Smirnov, but with estimated parameters, died down when
the intractable limit processes asserted themselves. The papers in the con-
conference on empirical processes (Gaenssler and Revesz 1976), the lecture
notes by Durbin A973b), and Pollard A980), offer several perspectives on
minimum distance methods. Parr A982) has compiled a bibliography.
The existence proof in Section 3 adapts an argument of Dudley A966a,
Proposition 2; 1978, Lemma 1.3).
Erdos and Kac A946,1947) proved that the limit distributions for certain
functions of partial sums of independent random variables depend on the
summands only through their first two moments. Their paper led Donsker
A951) to formulate the results as limit theorems for functionals on a partial-
sum process. Donsker proved convergence of the process to brownian
motion. Skorohod A957) studied processes with independent increments in
great detail; a nice account of some of his results has appeared in Gihman
and Skorohod A969, Chapter IX).
Infinite time scales with various topologies (uniform or Skorohod) of
convergence on compacta often cause minor confusion. The simple theory
for C[0, oo) (Whitt 1970) proves remarkably unwilling to carry over to
Z)[0, oo), at least for Skorohod metrics (Lindvall 1973, Stone 1963). The
obvious nature of the main theorem (Theorem 23) for uniform metrics makes
it hard to appreciate why Skorohod metrics should be any more difficult.
More about this in Chapter VI.
Billingsley A968, Section 11) has derived distributions for functionals of
brownian motion by taking limits for functionals of simple random walks
(partial-sum processes). Stopping-time arguments for brownian motion give
other first-exit distributions (Breiman 1968, Section 13.7). The results can
also be derived by elegant martingale methods (Loeve 1978, Complements
to Section 42).
Problems
[1] D[0, 1] is complete under its uniform metric. [If {xn} is a uniform Cauchy sequence
prove pointwise, and then uniform, convergence. Show that uniform convergence
preserves the cadlag property.]
[2] Let С be a separable subset of Z)[0, 1] under its uniform metric. There exists a
countable subset To of [0, 1] such that functions in С can have discontinuities
only at points of To. [A function in D[0, 1] has only finitely many jumps > s,
otherwise the cadlag property would fail at a cluster point of those jumps. Take
the union over rational s. If x is a uniform limit of {xn} then it can jump only where
one of the {х„} jumps.]

Problems 119
[3] Calculate the covariances of the tied-down process B(i) — ?.8A) to show that its
fidis agree with those of the brownian bridge. [Means and covariances determine
a multivariate normal distribution.]
[4] Extend the proof of Theorem 3 to limit processes with jumps. [If X has paths
in a separable set C, and if the grid points for the approximation maps Ak pick up
every jump point of С as к -> со, then \\AkX — X\ -> 0 almost surely.]
[5] Would the argument in the proof of Theorem 9 for bounding Un(t) over [0, b]
work for a different interval, say [1 — b, 1]? This would be necessary if Un were
not stationary; direct analysis of the non-uniform empirical process would
require such an extension. [Try a different definition for $,.]
[6] Every function in Z)[— со, со] is bounded. [If |x(fn)| > n then x would not have
the cadlag property at a cluster point of {?„}.]
[7] Give a direct proof of the Empirical Central Limit Theorem for sampling from a
general F. [Make sure the jump points of F appear in the sequence of grids from
which fidi approximations are calculated.]
[8] The image of a separable metric space under a continuous map is separable. The
empirical process E concentrates on a separable subset of ?>[— со, со]. [The
image of a dense subset is dense.]
[9] Suppose the distribution function F(x, 9) has a bounded partial derivative A(x, в)
with respect to в. If A(-, •) is uniformly continuous (or even just A(x, •) equi-
continuous), then
sup \F(x, в) - F(x, 6»0) - (в - е0Щх, 0)| = o(9 - в0).
X
[10] Find the limiting distribution of Kolmogorov's goodness-of-fit statistic for
sampling from the N(9,1) distribution, when в is estimated by the sample mean.
Show that this limit does not depend on the true value в0; express it as a functional
of the gaussian process with covariance function <J>(s)[l — <D(f)] — $(s)$(t),
where Ф and ф denote the distribution function and density function of the iV@, 1)
distribution.
[11] Here is another construction for the brownian bridge.
(a) Temporarily suppose there exists a brownian bridge U. Define, recursively,
new processes Yo, Yu ... and Zo, Zu ... by setting Yo = U, Yn+i = Yn — Zn,
and
where hnJ is the function whose graph is an isosceles triangle of height one
sitting on the base [0' - l)/2",J/2"]. Show that Yn+1 and Zn are independent.
Deduce that the {ZJ are mutually independent. [Calculate covariances. The
process Yn would be obtained from the brownian bridge by tying it down at the
points {B/ — l)/2"}; it is made up of 2" independent, rescaled brownian
bridges sitting side by side. The covariance calculations all reduce to the same
thing.]
(b) Show that the process Zo + ••• + Zn interpolates linearly between the vertices
C//2"+ \ U(J/2"+!)), for j = 0,..., 2"+'.

120 V. The Uniform Metric on Spaces of Cadlag Functions
(c) Now run the argument the other way. Construct processes {Х„} with the
same distributions as the {ZJ, then recover the brownian bridge as a limit of
sums of the {XJ. Let {?„,¦: j = 1, ¦•., 2"; и = 0,1,...} be independent
random variables, with (nj distributed JV@,1/2"+2). Define
Xn(t) = I hnj(t)?nj.
At the points {j/2n+ x: j = 0,..., 2"+ *} the sum Xo + • • ¦ + Х„ has the right
fidis for a brownian bridge. [Only the fidis of U, which we know are well
defined, were needed to calculate the distributional properties of
(d) Show that IP{||XJ > sj < 2"+1 exp(-2"+1s2). [Apply the exponential
inequality from Appendix В for normal tails: \\Xn\\ is a maximum of 2" in-
independent |ЛГ@,1/2"+2) | random variables.]
(e) By choosing sn = Bn/2"+1I/2 and then applying the Borel-Cantelli lemma,
show that ?S=o XJf) converges uniformly in t, almost surely; it defines a
process X with continuous sample paths, almost surely.
(f) At dyadic rational values for t, the series for X contains only finitely many
non-zero terms. The fidi projections of X at a dense subset of [0,1] have the
distributions of a brownian bridge.
The process X, with a negligible set of sample paths discarded, is a brownian
bridge.
[12] Every point of Z)[0, oo) is completely regular. [The distance dk(x, y) can be ex-
expressed as a supremum involving only rational time points. For each x and k, the
function dk(x, ¦) is projection measurable. Use [1 — mdk(x, -)]+ as a separating
function.]
[13] C[0, oo) is a closed, separable subset of Z)[0, oo). [It is the closure of a countable
collection of piecewise linear, continuous functions, each constant over an interval
of the form [t, oo).]
[14] For each fixed т and each finite subset S of [0, oo), the random vector nsBz is
independent of Sz, because brownian motion has independent increments. Deduce
independence of Bt and Sz. [The projection maps {ns} generate the projection
o--field on C[0, oo).]
[15] Let P and Q be probability measures on the c-field @0 generated by the closed
balls of a metric space. If Pf = Qf for every bounded, uniformly continuous,
^-measurable / then P = Q. [Every closed ball is a pointwise decreasing limit
of a sequence of such functions. The same is true for the intersection of any finite
collection of closed balls. These sets form a generating class for 38Q that is closed
under the formation of finite intersections.]
[16] The borel tr-field on the product Ж ® <% of two separable metric spaces
coincides with the product tr-field Я{9С) ® Щ<Ш). [Every open set in the product
space is a countable union of sets of the form Gx ® Gy, with both Gx and Gy open.
Compare with Problem IV.5.]
[17] Show that the shifted process Bz is projection measurable for each random time t.
[Prove measurability of B(a>, t + z(co)) by writing it as a composition of two

Problems 121
measurable maps: со t-* (t(co), B(co, •)) from О into [0, oo] ® C[0, oo) and (s, x) h->
limn x((t + s) л n)[l л (n - s)+] from [0, oo] ® C[0, oo) into IR.
[18] In the sense of Example 24, almost all brownian motion sample paths are good.
If т denotes the first time that В hits level 1, prove that
IP{Bt(s) < 0 for 0 < s < 8} = 0.
Let 8 tend to zero through a sequence to show that bad paths belong to a set of
probability zero. [You could try letting the level a in Example 26 sink to zero.]

CHAPTER VI
The Skorohod Metric on Z)[0, oo)
... in which an alternative to the metric of uniform convergence on compacta
is studied. With the new metric the limit processes need not confine their jumps
to a countable set of time points. Amongst the convergence criteria developed is
an elegant condition based on random increments, due to Aldous. The chapter
might be regarded as an extended appendix to Chapter V.
VI. 1. Properties of the Metric
The uniform metric on D[0, 1] is the best choice for applications where the
limit distribution concentrates on C[0, 1], or on some other separable subset
of Z)[0, 1]. It is well suited for convergence to brownian motion, brownian
bridge, and the gaussian processes that appear as limits in the Empirical
Central Limit Theorem. But it excludes, for example, poisson processes and
other non-gaussian processes with independent increments, whose jumps
are not constrained to lie in a fixed, countable subset of [0, 1]. To analyze
such processes, Skorohod A956) introduced four new metrics, all weaker
than the uniform metric. Of these, the Jx metric has since become the most
popular. (Too popular in my opinion—too often it is dragged into problems
for which the uniform metric would suffice.) But Skorohod's Jx metric on
D[0, 1] will not be the main concern of this chapter. Instead we shall in-
investigate a sort of Jx convergence on compacta for D[0, oo), the space where
the interesting applications live.
With the results from Section V.5 in mind, and even without seeing the J1
metric defined, you might suspect that convergence Xn ~> X of random
elements of D[0, oo) should reduce to convergence of their restrictions to
each finite interval [0, Г], in the sense of the Jx metric on D[0, Г]. This is
almost true. We need to avoid those values of T at which X has positive
probability of jumping. The difficulty arises because projection maps are
not automatically continuous for Jt metrics. Both the points 0 and T require
special treatment from the /t metric on D[0, Г], whereas only 0 has an
a priori right to special treatment in D[0, oo). That tiny distinction makes it
slightly more convenient to study D[0, oo) directly than to deduce all its
properties from those of ?)[0, Г]. As we shall not be concerned with

VI. 1. Properties of the Metric
123
Skorohod's J2, Mu and M2 metrics, let us drop the Jx designation from
now on.
1 Definition. For each finite T and each pair of functions x and у in D[0, oo)
define the distance d-r(x, y) as the infimum of all those values of д for which
there exist grids 0 = t0 < t, < ¦ • ¦ < tk, with tk > T, and 0 = s0 < st < • • •
< sk, with sk > T, such that 11,; — st \ < д for i = 0,..., k, and
and Si<s<si+1,
\x(t) - y(s)\ < 5 if tt<t<ti+1
for i = 0,..., к — 1. The weighted sum
d(x, y) = | 2"
defines the Skorohod metric on D[0, oo).
The proof that d is a metric contains no surprises (Problem 1). The
requirement tk> T prevents discontinuities near T from overinfiating the
distance dT(x, y): if, say, у jumps just before T then it can be matched by a
similar jump in x just a little beyond T. By allowing dT to depend on more
than the segment of path over [0, Г], we avoid a difficulty that other authors
have encountered in defining a metric for /х convergence on compacta.
The main difference between convergence in the uniform sense and
convergence in the Jx sense appears at the discontinuity points of a limit
function. If {xn} converges uniformly on compacta to x and if x has a jump
O—-'
at f 0, then each х„, for n large, must have a jump of almost the same magnitude
precisely at t0. Skorohod's metric still forces each xn to have a jump of almost
the same magnitude, but not precisely at t0.
O---'

124
VI. The Skorohod Metric on D[0, oo)
Straight from the definition we deduce that convergence of xn to x under
the d metric is equivalent to dT(xn, x) -> 0 for each finite T. Another equiv-
equivalent form is (Problem 2):
d{xn, x) -> 0 if and only if there exist continuous, strictly increasing
maps {!„} from [0, со) onto itself such that, uniformly over compact
sets of t values, Xn(t) - t -> 0 and x(Xn(t)) - xn(t) -> 0.
Sometimes continuity of a functional on D[0, oo) can be verified more easily
when convergence is invoked in this form.
2 Example. Consider once again the first passage time functional from
Example V.24: for x in D[0, oo),
Hx = inf{t > 0: x(t) > 1}
with the infimum over an empty set defined to be +co. The functional
is continuous, in the sense of the Skorohod metric, at every good path x.
Hx
If d(xn, x) -» 0 then the functions {у„} constructed from {xn} and {Я„} by
setting yn(t) = х„(Яп@) converge uniformly to x over compact sets of t
values. The argument from Example V.24 gives Hyn -» Hx. Complete the
proof of continuity of H by checking that Hyn = Я„(Ях„). ?
As in Chapter V, we shall find a criterion for weak convergence in D[0, oo)
by considering the approximations to processes built from the values they
take at a fixed, finite grid of points 0 = t0 < tx < ¦ ¦ ¦ < tk. Define the inter-
interpolation map A from D[0, oo) into D[0, oo) by
= x(tk){tk < t < со} + X x(ti){ti < t < ti+1}.
;=o

VI. 1. Properties of the Metric
125
All the jumps of Ax occur at grid points. For a fine grid, x and Ax will be
close if x does not vary much between the grid points, except possibly for a
single jump in each interval. In that case values of x before the jump are
close to the value at the left grid point; values of x after the jump are close
to the value at the right grid point:
Multiple jumps, though, could be missed completely:
Bad behavior of x in \tk, oo) affects d(x, Ax) only through those dT with
T > tk. Good behavior of x in [О, Г] may be quantified by a modulus
function.
3 Definition. For each x in D[0, oo) and each finite interval [a, b\ the
modulus function A(x, [a, b~\) equals the infimum of those positive e for
which there exists a point s in (a, fe] such that \x{t) — x(a)\ < s if a<t<s
and \x(t) — x(b)\ < s if s < t < b. ?

126 VI. The Skorohod Metric on O[0, oo)
4 Lemma. Let Ax be the approximation to x constructed from the values
taken at grid points 0 = t0 < • • • < tk = T, with tt — t?_ j < д for each i.
Then
dT(x, Ax) < ё v max A(x, [t;_b tj).
i<k
Proof. Write A for the maximum A(x, [*,-_!, tj) value. Choose e > 0. The
definition of the modulus function gives points {s,} with
0 = s0 = t0 < Si < ti < s2 < ¦ ¦ ¦ < sk < tk
and
\x(s) - x(ti)\ < A + s if st < s < si+1 for i = 0,..., к — I,
\x(s) - x(tk)\ < A + e if sk< s < tk.
Possibly sk will be strictly less than T. Add on one more point sk+1 with
tk < h+i < tk + д and
\x(s) — x(tk)\ < A + s for tk < s < sk+1.
Set tk+1 = sk+1. The extra points make {sj and {tt} suitable grids for the
calculation of dT(x, Ax). Since (Ax)(t) = x(tt) throughout \tu ti+1),
\(Ax)(t) — x(s)\ < A + s for tt<t<ti+1 and st<s<si+1.
Deduce that dT(x, Ax) < 3 v (A + s), then let e tend to zero. ?
The lemma shows that the error committed in the approximation of x by
Ax depends upon the variability of x, as measured by the modulus function.
If we sufficiently refine the grid used to construct Ax, the approximation
improves; for a fixed x, we can make the error as small as we please by
choosing the grid fine enough.
5 Lemma. For each fixed x in D[0, oo), each fixed T, and each e > 0, there
exists a 5 > 0 such that
max A(x, [r,_1; tj) < s
i
for every grid 0 = t0 < ¦ ¦ ¦ < tk = T with tt — tf_ j < b for each i.
Proof. If A(x, [a, a']) > s, we get points a < т < /? < a' with
\x(t) - x(oc)\ >\s and \x{x) - x(f})\ > \г
by setting
т = inf{t > a: \x(t) — x(a)\ > js},
P = inf{t > t: \x(t) - x(t)\ > is}.

VI. 1. Properties of the Metric 127
So if the assertion were false there would exist sequences of points <х„ < т„ <
Р„<Т for which Р„ - а„ -> 0 and
\х(тп) - х(осп)\ >ie and \х(тп) - х(рп)\ > ^s.
We could extract subsequences along which <xn -» а, т„ -» и, /?„ -» и, for some
и in [О, Г]. This would violate the cadlag property at a: there must exist a
<5 > 0 for which
x(s) — x(a — )\<je for и — д < s < a,
\x(s) — x(a)\ < jz for a < s < a + 5,
but eventually both members of at least one of the pairs (<х„, т„) or (т„, fjn)
would have to lie on the same side of a. ?
Our main use for Lemmas 4 and 5 will be the construction of finite-
dimensional approximating processes to characterize convergence in
distribution for random elements of D[0, oo). They also help us dispose of
measurability complications.
6 Theorem. Under its Skorohod metric D[0, oo) is separable. The borel
a-field coincides with the projection a-field.
Proof. Write Do for the countable class of functions that take constant,
rational values on each of the intervals [0, 1/JV),..., [(iV2 — 1)/JV, Л0,
[./V, oo) for ЛГ = 1, 2,... . Let us show that the class Do is dense in D[0, oo).
Write S(N) for the set of grid points {j/N: j = 0,..., N2}. As usual,
nSiN) denotes the map that projects an x in D[0, oo) onto its vector of values at
the points oiS(N). Define hN as the continuous map from 1RS(JV) into Z)[0, oo)
that takes a vector г = (r0,..., rN2) onto the step function with constant
value rj on the jth grid interval. If the components of г are rational, /ijv(r)
belongs to Do.
Given x and an e > 0, choose ./V with 2~N less than e and N~1 less than
the ё of Lemma 5. We may assume that 5 < s. Construct the approximation
AN x to x using the grid points S(N). That is, set AN = hN о nsm. From Lemma
4, dN(x, ANx) < s. Hence d(x, ANx) < 2s. Find rational numbers r} with
| г3 — x(J/N) | < e. Then /^(r) belongs to Do and d(x, hN(i)) < 3s.
For the moment write Si for the borel u-field and SP for the projection
ff-field on Z)[0, oo). To prove that SP ? 3&, observe that x(t0) = lim Hn(x)
for fixed t0, where
Н„(х) = sup x(t)sn(t),
t
sn(t) = (I - n\t - t0 - 2/n\)+.
For fixed n, the functional Hn is continuous (Problem 3). As a pointwise
limit of continuous functions on D[0, oo), the projection nt0 must be 8%-
measurable. The projections generate SP.

128 VI. The Skorohod Metric on D[0, oo)
To prove that 3§ ^ 3P,it is enough to establish that each continuous, real
function / on D[0, oo) is ^s/^(IR)-measurable. (Every closed set in a metric
space can be represented as /~ 1{0} for some continuous /.)
We know that d(x, hN ° nS(N)(x)) -» 0 as ./V -> oo. Thus/ ° hN ° nS(N)(x) -»
f(x) for each x, by continuity of/. The map /o hN is continuous from 1RS<N)
into IR, and hence ^(IRS(iV))/^(]R)-rneasurable. The map nS(N) is, by definition,
^y^(IRS(N))-measurable. Thus their composition f°hN°nS(N) must be
^s/^(IR)-measurable. As a pointwise limit of such functions, / must also
be ^/^(IR)-measurable. ?
Needless to say, from now on we shall always equip ?>[0, oo) with its
projection u-field, alias the borel u-field for the Skorohod metric. Every
point of ?>[0, oo) is completely regular under this u-field.
7 Example. The asymptotic theory for maxima of independent random
variables bears some similarity to the theory for sums of independent random
variables. The role played by the normal is taken over by the extreme-value
distributions, whose distribution functions are of the form exp( — G(x)) for
G(x) equal to one oie~x, or x~x{x > 0}, or ( —x)~a{x < 0}, with a a positive
parameter. If the maximum Mn of n independent observations from a distri-
distribution function F can be standardized to converge in distribution, then the
limit must be one of these: for constants а„ (positive) and bn,
(8) W{Mn < anx + bn} = F\anx + bn) -> exp(- G(x)).
This convergence implies a much stronger result for the joint asymptotic
behavior of the maxima at different sample sizes, a result analogous to the
convergence of the partial-sum process to brownian motion (Example V.20).
Define the maxima process У„(-) as the random element of D[0, oo) with
Yn{t) = (Mj - bn)lan for j/n <t<(j+ l)/n.
The assumption (8) gives convergence for { Yn(l)}. Using only the facts about
the Skorohod metric that we have so far accumulated, we can strengthen
this to convergence in distribution of the {Yn} process.
The method of proof depends upon a representation of Yn as a continuous
transformation of a poisson process. To minimize extraneous detail, assume
G(x) = x~~a{x > 0}. Trivial modifications of the argument would cover the
other two cases.
Define a sequence of measures {Н„} on @, oo) by means of their distribu-
distribution functions:
Я„@, x] = и exp(-G(x)/n) - (и - 1) exp(-G(x)/(n - 1)).
Calculate their density functions if you doubt that these are well-defined
measures. On S = [0, со) ® @, oo) generate independent poisson processes

VI. 1. Properties of the Metric 129
{л„} with intensity measures {X ® Hn}, where X denotes lebesgue measure
on [0, oo). The sum an = n1 + ¦ ¦ ¦ + п„ is also a poisson process. As n tends
to infinity, an increases to a poisson process a on S with intensity measure
CD П
X X ® Hi = X ® lim ? Hi = x ® У,
"=i n ;=i
the measure у being determined on @, oo) by
y(x, oo) = lim[n - n exp( -
Label the points of <?„ as (rjni, hni), where t]nl < r\n2 < ¦ ¦ •. The {t]ni} form
a poisson process on [0, oo), with intensity nX, independent of the {hni};
the gaps between adjacent t]ni have independent exponential distributions
with mean n. The {hni} are independent observations on the distribution
function exp( — G/n). When subjected to a slight vertical perturbation they
will become standardized observations on F.
Let Q be the quantile transformation corresponding to the distribution
function F (Section III.6). Define
ВД = [Q(exp(-GG0/n)) - №..
For large n, this transformation hardly disturbs y:
Tn(y) = inf{(z - bn)/an: F(z) > exp( - G(y)/n)}
= M{x: F\anx + b) > exp(-GOO)}
-* inf{x: exp( - G(x)) > exp( - G(y))} by (8)
= У-
The transformed variables {а„ Tn(hni) + bn) form a sequence of independent
observations on F, because hni has distribution function exp( — G/n):
&{anTn(hni) + bn<x} = JP{Q(exp(-G(hni)/n)) < x}
= P{Uniform@, 1) < F(x)}
= F(x).
The Т„ has the desired effect, in the vertical direction, on the points of an.
Define a random element Zn of D[0, oo) by setting
Zn(t) = sup{Tn(hni): цп1 < t}.
If Zn had its jumps at n~l, In"i,... instead of at цпЪ r\n2,... it would be a
probabilistic copy of Yn. Remedy the defect by applying to the time axis the
random, piecewise linear transformation yn that sends 0 onto 0 and;/n onto
t]nj, for j = 1, 2,... . The processes У„(-) and Zn ° }>„(•) have the same distri-
distribution as random elements of D[0, oo).
By the weak law of large numbers, yn{t) -> t in probability uniformly on
compact intervals (Problem 4). Thus d(Zn, Zn о yn) -> 0 in probability. The

130 VI. The Skorohod Metric on O[0, oo)
random elements {Zn} themselves converge almost surely to the random
element
Z(t) = sup{hi:rii<t},
where (r\u hj), (rj2, h2),... denote the points of the poisson process a arranged
in order of increasing time coordinate. Deduce that {Zn}, and hence {У„},
converges in distribution to Z. ?
VI.2. Convergence in Distribution
In Section V.I we found a necessary and sufficient condition for convergence
in distribution of random elements {XJ of D[0, 1], under its uniform metric,
to a limit process X concentrating on a separable subset. The separability
allowed X to have discontinuities only at fixed locations. For the proof we
constructed an approximation АХ„ to each Х„ based on the values it took
at a fixed finite grid on [0, 1]. The conditions we imposed ensured that, with
high probability, the АХ„ process was uniformly close to Xn.
A similar method of proof will apply for convergence in distribution of
random elements of D[0, oo) under d, its Skorohod metric. The constraint
on the limit process will disappear, because D[0, oo) itself is separable under
d. Each approximation AXn will, with high probability, be close to its Xn
in the sense of d distance. But one extra complication will arise because the
fidi projections are not automatically continuous.
If x belongs to D[0, oo) and x{x) ф х(т —), the projection map nz is not
continuous at x. For example, if xn(t) = x(nt/(n + 1)) then d(xn, x) -» 0 but
nzxn -» x(t—) Ф nTx. For x a continuity point of x, however, щ is continuous
at x (Problem 5). Necessarily, n0 is continuous at every x, because every
increasing A that maps [0, oo) onto itself must set A@) equal to 0. For a
random element X of D[0, oo), the projection nz will be continuous at all
sample paths except those that have a jump at т.
9 Lemma. For each random element X of D[0, oo) there exists a subset Tx
of [0, oo) such that [0, оо)\Гх is countable and JP{X(t) = X(t-)} = 1 for
t in Гх. The projection щ is IPX almost surely continuous at each t in Гх.
Proof. It is enough to show that if e > 0 then
JP{\X(t)-X(t-)\>s}>s
for at most finitely many t values in each bounded interval [0, Г]. Write
Jit) for {| X(t) - X(t -) | > e}, If TPJ(tn) > e for an infinite sequence {tn} of
distinct points in [0, Г] then 1Р{/({„) infinitely often} > e. There would exist
an со belonging to infinitely many of the J(tn) sets. At some cluster point t in

VI.2. Convergence in Distribution 131
[0, T] the inequality \X(a>, Q - X(co, tn-)\ > г would hold for infinitely
many distinct tn values in every neighborhood of t. This would violate the
cadlag property of X(co, •) at t. ?
Because the projection n0 is continuous at every x in D[0, oo), let us also
admit 0 as a point of Tx, even though X@—) is not defined.
10 Theorem. Let X, X1, X2, ¦¦ ¦ be random elements of D[0, oo). Necessary
and sufficient conditions for Xn ^* X, in the sense of the Skorohod metric, are:
(i) the fidis of Xn corresponding to finite subsets of Tx converge to the fidis
ofX;
(ii) for each s > 0, each ц > 0, and each finite T in Tx, there exists a grid
0 = t0 < ¦ ¦ ¦ < tK = T of points from Tx with
limsup iNmax A(Xn, [t;_b tj) > r\> < s.
Proof of Necessity. Appeal to the Representation Theorem (IV. 13) for a
new sequence {Xn}, with the same distributions as {Xn}, and an X with the
same distribution as X, for which
A1) d(Xa(co, -),X(co, •)) -> 0 for almost all со.
For t in Гх,
Xn(co, t) -> X(co, t) at almost every со.
Fidi convergence for {Х„} at points of Tx follows.
Find a grid 0 = t0 < ¦ ¦ ¦ < tk = Г of points from Гх with
A2) ipjmax A(X, [t;_ u tj) > Л < s.
Such a grid exists by virtue of Lemma 5: as a sequence of grids is refined
down to a countable, dense subset of [0, Г], the maximum value of A over
the grid intervals converges to zero at each sample path of X. Write Ar(x)
as an abbreviation for max; A(x, [t;_ l5 tj).
Consider an со at which the convergence A1) holds. Write xn(-) for
Х„(со, ¦), and x(-) for X(a>, •). If we show limsup Ат(х„) < 2Аг(х), then it will
follow that
limsup Р{АГ(Х„) > 2ц\ = limsup TP{AT(Xn) > 2r\\
< IP{limsup АТ(Х„) > 2rj}
< W{AT(X) > r,}
< ? by A2)
as required by hypothesis (ii).

132 VI. The Skorohod Metric on D[0, со)
Choose <5 > 0. By definition of AT(x), there exists points {sj with ?,_i <
Si < ti and
|x@-x(*;-i)l < AT(x) + <5 for ti_1<t<s,-,
|xCt)-x(t,)l <AT(*) + <5 for Si<t<ti.
Continuity of x at both ti^1 and t; allows us to assume that the strict in-
inequality st < tt holds, and also that the range of validity for the first inequality
is {,_! — S' < t < Si, and for the second s? < t < tt + S', for some 3' > 0.
Because d(xa, x) -> 0, there exists a sequence of continuous, increasing
maps {Xn} from [0, oo) onto [0, oo) such that
Ut)~t^0 and х(Я„@) - xH(t) - 0
uniformly on compacta. When n is large enough, ?г_х < Ап'Ч5;) < h- Use
st,n = ^"'(Sj) as the split point in [tj_l5 tj for bounding Дт(х„). If t;_i <
t < sijn, and n is large enough,
|х„@ - x^-JI < |х(Я„(г)) - xCA^ti-i))! + <5
< |х(Яи@) - xfe.OI + Ix^^) - xafe^))! + 8
< 2AT(x) + 3d,
because, eventually, t,-t — 8' < ln{tt_x) < Xn(t) < An(si>n) = st. A similar
argument applies to t in [s; „, t;].
Proof of Sufficiency. Let / be a bounded, uniformly continuous, real
function on D[0, oo). We need to show that JPf(Xn) -> P/(X). Given г > 0,
find »? > 0 so that | /(x) - /(y) | < e whenever d(x, y) < 217. Choose from
Tx а Г large enough to ensure d(x, y) < dT(x, y) + r\ for every pair x, y.
Let AT(-) have the same meaning as in the proof of necessity. According
to hypothesis (ii) there exists a grid on [0, T] for which
1Р{АТ(ХП) >ц) < г if n>n0.
Also we may assume that the grid points are less than r\ apart and, by the
same reasoning as for A2), that W{AT(X) > r\} < s. Lemma 4 shows that
the approximations constructed from this grid are, with high probability,
close to the sample paths of the processes:
W{dT(Xn,AXJ>t1}<s if n>n0,
W{dT(X, AX) >rj} <e.
Complete the proof in the usual way. Write A for the interpolation map
constructed from the values at the grid points.
< JP\f(Xn) - f(AXn)\ + \Wf{AXn) - Wf{AX)\
+ W\f(AX)-f(X)\
< s + 2||/||1Рде„, AXn) > 2ц] + \JPf(AXn) - JPf(AX)\
+ s + 2\\f\\JP{d(X,AX)>2r,}
< e + 2||/||e + e + e + 2||/||e eventually,
because AXn -> AX and d(x, Ax) < dT(x, Ax) + ц for every x. ?

VI.2. Convergence in Distribution 133
Roughly speaking, the modulus А(Х„, [f;_ ъ tj) will be small if Xn has at
worst one large jump in the interval [?;_15 tj. To prove convergence in
distribution, we need some way of stopping large jumps from piling up in a
small interval. If Х„ has a jump at х„, which in general will be a random
time, then we need
W{\Xn(xn + t) - Xn(xn)\ small, for all small t) x 1,
and the approximation should hold uniformly in n. A maximal inequality
seems required. Contrast this vague, formidable task with an elegant suffi-
sufficient condition due to Aldous A978): Х„ ~> X if the fidis converge and, for
each fixed T,
A3) Х„(Рп + 3n) - Xn(Pn) -+ 0 in probability,
whenever {<5„} is a sequence of positive numbers converging to zero and
{pn} is a sequence of stopping times taking values in [0, T]. (The stopping
time property means that the event {pn < t} should belong to the a-field
generated by the random variables Xn(s), for 0 < s < t.)
An equivalent form of A3) is: for each T, each r\ > 0, and each s > 0,
there exists a 3 > 0 and an n0 such that
A4) Щ\Хп{рп + д')-Хп(рп)\>г1}<г for n>n0,
whenever pn is a stopping time for Х„ that takes values in [0, Г] and 3' is a
real number with 0 < d' < д. The proof of Aldous's result is built up by
repeated application of inequality A4).
15 Lemma. Let Z be a random element of D[0, oo) for which
JP{\Z(p + 5') - Z(p)\ > n} < s
for each real d' in [0, <5] and each stopping time p taking values in [0, Г]. If
a and x are stopping times for which a < г and \ Z(t) — Z{a) \ > 2r\ on {т < oo},
then JP{z < T л (a + Щ} < 4s.
Proof. Integrate both sides of the inequality for p with respect to lebesgue
measure on [0, 8], interchange the order of integration, then make a change
of variable in the inner integral.
гд
IP J{|Z(s) - Z(p)\ >rj}{p<s<p
Apply this inequality twice with p equal to the stopping tiroes aT = T л а
and xT = T a x, then add.
led > IP J{|Z(s) - Z(aT)\ >r,}{<rT<s<<TT + 3}
+ {|Z(s) - Z(xT)\ > rj} {xT < s < xT + 3} ds
> IP ^{Z(s) - Z(aT)\ >r,} + {|Z(s) - Z(xT)\ > ny\
x {tt < s < aT + 3} ds.

134 VI. The Skorohod Metric on D[0, со)
On the set {т < T}, the sum of the two indicators is at least 1, because at
least one of the inequalities
|Z(s) - Z(aT)\>n, |Z(s) -Z(zT)\>n
must hold if aT = a, zT = x, and \Z(a) - Z(x)\ > 2ц. Deduce that
2sS>JP f {t < T} {z < s < a + 3} ds
> IP j {
< Гл (о- + |<5)}{a + ^d < s < <7 + d} ds
< T a (a + i<5)}. ?
16 Theorem. Let X, Xu X2,... be random elements of D[0, со)for which:
(i) the fidis of Xn corresponding to finite subsets of Tx converge to the fidis
ofX;
(ii) Aldous's condition A3) holds.
Then Xn ^ X in the sense of the Skorohod metric.
Proof. Verify condition (ii) of Theorem 10. Set down a grid 0 = t0 < ¦ ¦ ¦
< tk = Г of points from Tx with the maximum grid interval shorter than
^<x, for a value of a that will be specified soon.
Fix an n, then define stopping times for Xn by
то = О,
zJ+1 = inf{t > zf \Xn(t) - Xn(zj)\ > In}
with the usual convention that the infimum of the empty set equals + oo. We
should perhaps add an extra subscript n to each Zj.
If (ti-i,tj contains at most one of the {zj} then we must have
А(Х„, [f^, t,J) < 4n. For if т,--! < t;_! < zj < t{ < zj+1 then
\XJt)-XJxj-i)\<2ri if U^KtKZj,
\Xn(t) - Xn(zj)\ <2n if zj<t<ti
and hence
\Xn(t)-Xn(ti_1)\<4n if t^^tKZj,
\Xn(t) - Xn(tt)\ <4n if zj<t<tt.
Apply the same reasoning to each grid interval.
maxA(Xn,[t;_bt;]) > 4n
< {some (?;_ь t;] contains at least two of the {т^}
< {some pair т,-_ ь tj has Zj < T л (zj_ x + |a)}.

VI.2. Convergence in Distribution 135
Fix an integer m, whose value will be specified soon. Bound the last indicator
function by
f {t; < T л (xj.! + !«)} + m f {xj < Т л (т,_х + Т/т)}.
j=i j=i
The reasoning here is: either z2m > T, in which case the pair х3_и xi would
be detected by the first sum; or z2m < T, in which case at least m terms of the
second sum must equal one, for otherwise [О, Г] would have to contain
(m + 1) disjoint intervals (zj_u tj) of length greater than T/m. Take expec-
expectations.
A7) ЩтахМХя, [tt.utJ)>
< 2m max JP{tj < Г л (zj_l + ^
j<2m
+ 2 max W{tj < T л (т7-_! + T/m)}.
j<2m
Now we choose m and a.
Invoke Lemma 15 for Z = Х„ with о = т^_1; т = т,- and the 3 provided
by A4). For n> n0,
JP{Tj< Г Л (Т,._! +^)} <4?.
Choose ш so that T/m < \5; hold it fixed. The second term in the bound A7)
is then less than the 8s if n > n0. From A4) find the a for which
П\Хп(рп + д')-Хп(рп)\>г,}<г/т for n>ni
if 0 < d' < a. and if р„ is a stopping time for Х„ with 0 < р„ < T Invoke
Lemma 15 again, but replace S by a and ? by г/т. For n > n1,
1Р{т; < Г л (tj_ ! + jtx)} < 4s/m.
The first term in the bound A7) is less than 8s if n > nx. In summary:
IP<max А(Х„, [t;_ ь tj) > 4rj > < 16e for n > max(n0, nj),
which completes the proof. ?
18 Example. For processes with independent increments, the criterion for
convergence in distribution is particularly simple. Under the uniform
metric for D[0, 1], Theorem V.I9 showed that fidi convergence plus an
equicontinuity condition on the increments suffices if the limit process X
has continuous sample paths. With a minor variation we get Aldous's
condition; essentially, we can drop the constraint on the sample paths of X
if we reinterpret the result as convergence in the Skorohod sense.

136 VI. The Skorohod Metric on D[0, со)
Suppose that for each e > 0, ц > 0, and Г < oo, there exists a 3 > 0 and
an n0 such that JP{\Xn(t + 6') - Xn(t)\ > rj} < г whenever 0 < t < T,
0 < 3' < S, and n > n0. If р„ is a stopping time that takes only finitely many
different values, and these values all lie in [0, T], then for n > n0,
« = t}JP{\Xn(t + d') - Xn(t)\ > ц\рп = t}]
t
< ?.
Every stopping time can be approximated arbitrarily closely from above by
a stopping time that takes only finitely many different values (round up to
the next value on a finely spaced grid); the cadlag property of the sample
paths carries the inequality over to the stopping times covered by Aldous's
condition. ?
Problem VIII.8 will give a more interesting application of Theorem 16
to convergence of martingales.
Notes
Much of this chapter draws ideas from Billingsley A968, Chapter 4) and
Gihman and Skorohod A974, Sections Ш.4 and VI.5).
Skorohod A956) defined on D[0, 1] several metrics, which allowed
different sorts of behavior of a convergent sequence near a discontinuity of
the limit function. Billingsley A968) introduced a variation on Skorohod's
Jl metric, thereby making D[0,1] complete. This is not so important if we
consider only convergence to a known limit process, but it does greatly
simplify the theory if existence of the limit process must be proved by com-
compactness arguments. The metric of Definition 1 is modeled on the metric
of Kolmogorov A956) for D[0, 1].
Whitt A980), and Lindvall A973) have shown how difficult it is to write
down a metric for Jx convergence on compacta, as defined by Stone A963).
The modulus function is a fixed-grid analogue of Skorohod's A956)
AJt, or the Ac of Gihman and Skorohod A974, page 423). Billingsley A968,
Section 14) has defined other moduli for D[0, 1]. If existence of the limit
process is not assumed in Theorem 10, the conditions will not guarantee its
existence; the conditions of the theorem do not translate directly into a
characterization of uniform tightness.
Theorem 6 borrows from Parthasarathy A967, Section VII.6). The
elegant argument for inclusion of the borel a-field in the projection a-field
was attributed by Straf A969, page 67) to Wichura.
Example 7 is based on the method of Resnick A975). The idea of embedd-
embedding the maxima process into a two-dimensional poisson process comes from
Pickands A971).

Problems 137
The convergence criterion of Theorem 10 corresponds to Theorem VI.5.2
of Gihman and Skorohod A974).
Aldous A978) proved a result slightly different from Theorem 16; he gave
a sufficient condition for uniform tightness in D[0, 1]. I found Kurtz's
A981, Chapter 2) rearrangement of the proof helpful.
Skorohod A957) studied convergence of processes with independent
increments. His paper contains many fascinating sample-path arguments.
Problems
[1] Prove that the d of Definition 1 is a metric on D[0, oo). [To show that d(x, у) = 0
implies x = y, fix a t, then choose Г larger than t. For each z with t < z < T, deduce
from the definition of dT the existence of a sequence {sn} with sn -» z and v(sn) -* x(%)-
Deduce that either x(z) ~ y{z) or x(z) = y(z—). Choose a sequence {т,} strictly
decreasing to t. Right continuity of у at t gives both v(t,) -> y(t) and y{zt —) -> y(t);
right continuity of x gives x(zt) ->¦ x(t)-]
[2] Show that d(xn, x) ->¦ 0 if and only if there exist continuous, strictly increasing
maps {Д„} from [0, oo) onto itself such that, uniformly on compact sets of г values,
Xn(t)~t^O and х(А„(г)) ~ xn(t) -» 0.
[Construct Х„ as a piecewise linear map that takes gridpoints for х„ onto gridpoints
for x, for pairs of grids chosen according to the definition of dj{xn, x) with T
depending on n.~\
[3] Prove continuity of the functional Hn that appeared in the proof of Theorem
6. [If d(x, x,) -> 0, choose continuous, increasing {Aj} with x(Af(t)) — x;(?) -> 0
and Aj(l) — t -> 0 uniformly on compact t sets. For г large enough, bound
|Я„(х)-Я„(х()|Ьу
sup|x(t)|sup|sn(Ai(O) - sM\ + sup|x(Ai(t)) - x,{t)\
t<c t t<c
for some constant c]
[4] Let {у„} be a sequence of random, increasing maps from [0, oo) onto itself such that
Уп@ ->¦ ? in probability, for each fixed t. Show that
SUP |уи@ - t| -» 0 in probability
0<1<Г
for each fixed Г. [If|yn(s) ~ s\ < e and \yn(t) - t[ < ethen |у„(и) -u\<e + \s-t\
for и between s and t]
[5] Suppose d(xn, x) -» 0 and that x is continuous at т. Show that х„(т) ->¦ x(z). [Use
х(А„(т)) - х„(т) -> 0 and Л„(т) -> т.]
[6] If d(xn, x) ->¦ 0 and x belongs to C[0, oo) then х„ converges to x uniformly on
compacta.
[7] If Xn "+ X in the Skorohod sense, and if X has sample paths in C[0, oo), then
Х„ "+ A" in the sense of the metric for uniform convergence on compacta. [Switch
to versions that converge almost surely in the Skorohod sense.]

CHAPTER VII
Central Limit Theorems
... in which the chaining method for proving maximal inequalities for the
increments of stochastic processes is established. Applications include con-
construction of gaussian processes with continuous sample paths, central limit
theorems for empirical measures, and justification of a stochastic equicontinuity
assumption that is needed to prove central limit theorems for statistics defined
by minimization of a stochastic process.
VII. 1. Stochastic Equicontinuity
Much asymptotic theory boils down to careful application of Taylor's
theorem. To bound remainder terms we impose regularity conditions, which
add rigor to informal approximation arguments, but usually at the cost of
increased technical detail. For some asymptotics problems, especially those
concerned with central limit theorems for statistics defined by maximization
or minimization of a random process, many of the technicalities can be
drawn off into a single stochastic equicontinuity condition. This section
shows how. Empirical process methods for establishing stochastic equi-
equicontinuity will be developed later in the chapter.
Maximum likelihood estimation is the prime example of a method that
defines a statistic by maximization of a random criterion function. Indepen-
Independent observations ? u ..., ?„ are drawn from a distribution P, which is
assumed to be a member of a parametric family defined by density functions
{?(¦> #)}• F°r simplicity take 6 real-valued. The true, but unknown, 60 can
be estimated by the value 6n that maximizes
Let us recall how one proves asymptotic normality for в„, assuming it is
consistent for 60.
Write 0o(-, <?)forlogK-, e),andgi(; в),д2(-, в),д3(-, в), for the first three
partial derivatives with respect to в, whose existence we impose as a regularity
condition. Using Taylor's theorem, expand go(-, в) into
0o(-. ^o) + (в - во)д1(-, в0) + КО - 0oJ9z(-, 0o) + Ш - воKд3(-, в*)

VII. 1. Stochastic Equicontinuity 139
with в* between в0 and в. Integrate with respect to the empirical measure Р„.
GM = GB@O) + (в - во)Р„д1 + № - воJРпд2 + Rn@).
If we impose, as an extra regularity condition, the domination
\д3(-,в)\<Н(-) for all в,
then the remainder term will satisfy
\Rn(9)\ <Ш- во\3Р„\дз(;в*)\ <Цв - во\3Рпн.
Assume PH < оо and P\g2\ < °о. Then, by the strong law of large numbers,
for each sequence of shrinking neighborhoods of 60 we can absorb the re-
remainder term into the quadratic, leaving
A) Gn(ff) = Gn(e0) + (в- во)Р„д1 +K0- во)\Р9г + op(l)) near в0.
The Op(l) stands for a sequence of random functions of в that are bounded
uniformly on the shrinking neighborhoods of в0 by random variables of
order op(l). Provided Pg2 < 0, such a bound on the error of approximation
will lead to the usual central limit theorem for {п1/2F„ — в0)}. As a more
general result will be proved soon, let us not pursue that part of the argument
further. Instead, reconsider the regularity conditions.
The third partial derivative of go( ¦, в) was needed only to bound the
remainder term in the Taylor expansion. The second partial derivative enters
A) only through its integrated value Pg2 ¦ But the first partial derivative plays
a critical role; its value at each ?; comes into the linear term. That suggests
we might relax the assumptions about existence of the higher derivatives and
still get A). We can. In place of Pg2 we shall require a second derivative
for Pgo(-, в); and for the remainder term we shall invoke stochastic equi-
equicontinuity.
In its abstract form stochastic equicontinuity refers to a sequence of
stochastic processes {Zn(t): t e T} whose shared index set Г comes equipped
with a semimetric d(-, •)• (In case you have forgotten, a semimetric has all
the properties of a metric except that d(s, t) = 0 need not imply that s equals
t.) We shall later need it in that generality.
2 Definition. Call {ZJ stochastically equicontinuous at t0 if for each ц > 0
and g > 0 there exists a neighborhood U of t0 for which
limsup pi sup | Zn(t) - ZB(t0) \ > Л < s. ?
There might be measure theoretic difficulties related to taking a supremum
over an uncountable set of t values. We shall ignore them as far as possible
during the course of this chapter. A more careful treatment of measurability
details appears in Appendix С
Because stochastic equicontinuity bounds Zn uniformly over the neigh-
neighborhood U, it also applies to any randomly chosen point in the neighborhood.

140 VII. Central Limit Theorems
If {т„} is a sequence of random elements of Г that converges in probability
to t0,then
C) ZH(zH) - Zn(t0) -» 0 in probability,
because, with probability tending to one, т„ will belong to each U. When we
come to check for stochastic equicontinuity the form in Definition 2 will be
the one we use; the form in C) will be easier to apply, especially when be-
behavior of a particular {т„} sequence is under investigation.
The maximum likelihood method generalizes to other maximization
problems, where {log p(-, в)} is replaced by other families of functions. For
future reference it will be more convenient if we pose them as minimization
problems.
Suppose <F = {/(•, i):teT}, with T a subset of Ж*, is a collection of
real, P-integrable functions on the set S where P lives. Denote by Р„ the
empirical measure formed from n independent observations on P, and
define the empirical process Е„ as the signed measure nll2(Pn — P). Define
We shall prove a central limit theorem for sequences {%„} that come close
enough to minimizing the (Fn(-)}-
Suppose /(-, t) has a linear approximation near the t0 at which F(-) takes
on its minimum value:
D) /(•, 0 = /(•, t0) + (t- to)'A(O + \t~ to\r(; t).
For completeness set r(-, t0) = 0. The A(-) is a vector of к real functions on
S. Of course, if the approximation is to be of any use to us, the remainder
function r(-,t) must in some sense be small near t0. If we want a central
limit theorem for {т„}, stochastic equicontinuity of {Enr(-, t)} at t0 is the ap-
appropriate sense.
Usually r(-,t) will also tend to zero in the ??2{P) sense:P|r(-, t)\2 -» 0 as
t -> t0. That is, /(•, 0 will be differentiable in quadratic mean. In that case,
we may work directly with the i?2(P) seminorm pP on the set 0t of all re-
remainder functions {r(-,t)}. Stochastic equicontinuity of {Enr(-,t}} would
then follow from: for each г > 0 and r\ > 0 there exists in M a neighborhood
V of 0 such that
limsup IP<sup \Enr\ > r\> < s.
I v )
The neighborhood V would take the form {ref :pP(r) < 3} for some
3 > 0. This would be convenient for empirical process calculations. Differ-
Differentiability in quadratic mean would also imply that PA = 0. For if PA were
non-zero the integrated form of D),
P/(-, 0 = Pf{; t0) + (t- to)'PA + o(t - t0) near t0,
would contradict existence of even a local minimum at t0.

VII.1. Stochastic Equicontinuity 141
5 Theorem. Suppose {т„} is a sequence of random vectors converging in
probability to the value t0 at which F(-) has its minimum. Define r(-, i) and the
vector of functions A(-) by D). //
(i) t0 is an interior point of the parameter set T;
(ii) F(-) has a non-singular second derivative matrix V at t0;
(iii) Fn(-in) = op(n-1) + mf,Fn(t);
(iv) the components of A(-) all belong to =S?2(P);
(v) the sequence {Enr(-, t)} is stochastically equicontinuous at t0 ;
then nll2(zn - t0) ~» JV(O, V~l[P{AA') -
Proof. Reparametrize to make t0 equal to zero and V equal to the identity
matrix. Then (ii) implies
F(t) = F@) + \[ 112 + o( 1112) near 0.
Separate the stochastic and deterministic contributions to the function Fn(t)
by writing Р„ as the sum P + n~mEn. Write Zn(t) for ?>(•, t). Stochastic
equicontinuity implies Zn(xn) = op(l). For values of t near zero,
F) Fn(t) - Fn@) = P[/(•, 0 - /(•, 0)] + n~ l'2En [/(¦, 0 - Я •, 0)]
Invoke (iii). Because Fn{xn) comes within op(n~l) of the infimum, which is
smaller than Fn@),
op{n-')>Fn{xn)-FM
= ikl2 + op{\xn\2) + n-Vh&A + op(n-ll2\zn\).
The random vector ?„Д has an asymptotic N@, P(AA') - (PA)(PA)')
distribution; it is of order Op(l). Consequently, by the Cauchy-Schwarz in-
inequality, т'„Е„А> — |т„|ОрA). Tidy up the last inequality.
op{»~1) ^ Й - орA)]|т„|2 - n^^lTjO/l) - op(n-^2\zn\)
= Й- орA)][|ти| - Op(n~1/2)]2 - Op{n~').
It follows that the squared term is at most Op(n~ *), and hence т„ = Op(n~1/2).
(Look at Appendix A if you want to see the argument written without the
Op(-) and ОД-) symbols.) Representation F) for t = т„ now simplifies:
Fn(zn) = Fn@) + i\zn\2 + n~ll2z'nEnA + op(n-')
= Fn@) + $|ти + п/2?„А|2 - in^l^AI2 + o/n).
The same simplification would apply to any other sequence of t values of
order Op(n~1/2). In particular,
Fn{-n-ll2EnA) = Fn@) - in-^AI2 + оДи).
Notice the surreptitious appeal to (i). We need n~ 1/2?„ A to be a point of T
before stochastic equicontinuity applies; with probability tending to one as
n —> со, it is.

142 VII. Central Limit Theorems
Now invoke (iii) again, comparing the values of Fn at т„ and — п~1/2Е„А
to get
whence п1/2т„ = — ?„Д + op(l). When transformed back to the old para-
metrization, this gives
ni/2Fi/2(Tn _ to) = _ к-1/2Е„Л + op{\)
~>V~ 1/2iV@, P(AA') - (PA)(PA)'). ?
Examples 18 and 19 in Section 4 will apply the theorem just proved. But
before we can get to the applications we must acquire the means for verifying
the stochastic equicontinuity condition.
VII.2. Chaining
Chaining is a technique for proving maximal inequalities for stochastic
processes, the sorts of things required if we want to check the stochastic
equicontinuity condition defined in Section 1. It applies to any process
{Z(t):teT} whose index set is equipped with a semimetric d(-, •) that
controls the increments:
W{\Z(s)-Z(t)\>ri}<A{n,d{s,t)) for r,>0.
It works best when A(-, •) takes the form
with D a positive constant. Under some assumptions about covering numbers
for Г, the chaining technique will lead to an economical bound on the tail
probabilities for a supremum of | Z(s) — Z(t) | over pairs (s, t).
The idea behind chaining, and the reason for its name, is easiest to under-
understand when Г is finite. Suppose Tu T2,..., Tk+1 = Г are subsets with the
property that each t lies within 5t of at least one point in Tt. Imagine each
point of Ti+! linked to its nearest neighbor in Tit for i = 1,..., к. From every
t stretches a chain with links t = tk+ ъ tk,..., t1 joining it to a point in Tt.
Q = point of Г,
О = point of т2
• = point of T3

VII.2. Chaining 143
The value of the process at t equals its value at tx plus a sum of increments
across the links joining t to tv The error involved in approximating Z{t) by
Z(ti) is bounded, uniformly in t, by
? max\Z(ti+1)-Z(td\.
If 7] contains Nt points, the maximum in the ith summand runs over Ni+1
different increments, each across a link of length at most <5;. The probability
of the summand exceeding щ is bounded by a sum of Ni+x terms, each less
than А(?7г, Si).
)
G) JP\max\Z(t) - Z{tx)\ >r]l+--- + г,Л < ? Л,-+1Д(ъ, 5,)-
It ) i=i
This inequality is useful if we can choose цг, Зь and 7] to make both the right-
hand side and the sum of the O7J small. In that case the maximum of
| Z(s) — Z{t) | over all pairs in Г is, with high probability, close to the maxi-
maximum for pairs taken from the smaller class Tv
When ДО/, <5) = 2 exp(—^r\2jD2d2), a good combination seems to be:
{<5,} decreasing geometrically and {?/,} chosen so that Ы{+1А(г/г, <5;) = 2<5;,
that is,
With these choices the right-hand side of G) is bounded by the tail of the
geometric series ?f 3t, and the sum of the {j/J on the left-hand side can be
approximated by an integral that reflects the rate at which Nt increases as
<5; decreases.
8 Definition. The covering number N(S), or NC, d, T) if there is any risk of
ambiguity, is the size of the smallest <5-net for Г. That is, NC) equals the
smallest m for which there exist points tu...,tm with min; d(t, tt) < 3 for
every t in T. The associated covering integral is
Cd
J{3) = J(S, d,T)= [2 log(iV(MJ/M)]1/2 du for 0<^<l. ?
Jo
The N(uJ, in place of N(u), will allow us to bound maxima over more than
just the nearest-neighbor links from Ti+1 to 7].
If we interpret P as standing for the ^C1(P) or J?2(P) semimetrics on #",
the notation iV\(<5, P, #") and N2C, P, #") used in Chapter II almost agrees
with Definition 8. Here we implicitly restrict tu ..., tm to be points of T.
In Chapter II the approximating functions were allowed to lie outside J5!
They could have been restricted to lie in 2F without seriously affecting any
of the results.
The proof of our main result, the Chaining Lemma, will be slightly more
complicated than indicated above. To achieve the most precise inequality,

144 VII. Central Limit Theorems
we replace nt by a function of the link lengths. And we eliminate a few pesky
details by being fastidious in the construction of the approximating sets
7]. But apart from that, the idea behind the proof is the same.
As you read through the argument please notice that it would also work
if N(-) were replaced throughout by any upper bound and, of course, J(-)
were increased accordingly. This trivial observation will turn out to be most
important for applications; we seldom know the covering numbers exactly,
but we often have upper bounds for them.
9 Chaining Lemma. Let {Z(t): t e T} be a stochastic process whose index
set has a finite covering integral J(-). Suppose there exists a constant D such
that, for all s and t,
A0) IP{|Z(s) - Z(t)\ > rjd(s, t)} < 2 exp(-^rj2/D2) for n > 0.
Then there exists a countable dense subset T* of T such that, for 0 < s < 1,
IP{ |Z(s) - Z(t)| > 26DJ(d(s, 0) far some s, t in T* with d(s, t) < s} < 2s
We can replace T* by T if Z has continuous sample paths.
Proof. Write Щи) for [2 log(N(wJ/t<)]1/2. It increases as и decreases. Set
<5; = s/2' for j = 1, 2,.... Construct 2<5rnets 7] in a special way, to ensure
that 7\ ? T2 ? • • • . (The extra 2 has little effect on the chaining argument.)
Start with any point tx. If possible choose a t2 with d(t2, tt) > 2d1; then
a t3 with d(t3, tj) > 231 and d(t3, t2) > 2дг; and so on. After some tm, with
m no greater than NC^, the process must stop: if m > NC^ then some pair
tt, tj would have to fall into one of the NC^ closed balls of radius <5X that
cover T. Take Tt as the set {tu ..., tm}. Every t in T lies within 23X of at least
one point in Tv
Next choose tm+1, if possible, with d(tm+u t,) > 232 for i < m; then tm+2
with d(tm+2, ti) > 282 for i < m + 1; and so on. When that process stops
we have built Ti up to T2, a 2<52-net of at most NC2) points.
The sets T3, T4,... are constructed in similar fashion. Define T* to be
the union of all the {7]}.
For the chaining argument sketched earlier (for finite T) we bounded the
increment of Z across each link joining a point of Ti+ г to its nearest neighbor
in Tt. This time Ti+1 contains Tt; all the links run between points of 7J+1.
With only an insignificant increase in the probability bound we can increase
the collection of links to cover all pairs in Ti+1, provided we replace the
suggested п{ by a quantity depending on the length of the link. Set
At = {|Z(s) - Z(f)\ > Dd(s, t)HCt) for some s, t in 7]}.
It is a union of at most NCiJ events, each of whose probabilities can be
bounded using A0).
WA, < 2N{3f expE-W,.J] = 2St.
The union of all the {A{}, call it A, has probability at most 2e.

VII.2. Chaining 145
Consider any pair (s, t) in Г* for which d(s, t) < s. Find the n for which
3„ < d(s, t) < 23 „. Because the {7]} expand as i increases, both s and t
belong to some Tm+1 with m > n. With a chain s = sm+1, sm,..., sn link s
to an sn in Т„, choosing each st to be the closest point of Tt to si+u thereby
ensuring that d(si+1,si) < 2dt. Define a chain {tj for t similarly. Break
Z(s) — Z(t) into Z(sn) — Z(tn) plus sums of increments across the links of
the two chains; |Z(s) - Z(t)\ is no greater than
m
|Z(sJ - Z(tn)| + X [|Z(si+1) - Z(s,.)| + |Z(tl+1) - Z(tf)|].
i = n
Both si+ x and s,- belong to Ti+1. On Л • +15
|Z(si+1) - Z(sd\ < Dd(si+1, sdH(8,+ 1) < ЩЯЙ+1).
On Ac, these bounds, together with their companions for (sn, tn) and (ti+1, tt),
allow |Z(s) - Z(t)| to be at most
m
Dd(sn, QHEn) + 2 X 2D8tHEt+1).
i — n
The distance d(sn, tn) is at most
mm m
d(s, 0 + Z <*(*+1, sO + Z rf(f.-+1. *i) < 2^„ + 2 X 25, < 105„.
i = n i = и i = n
Also E,- = 4(<5,+! - ^i+2). Thus, on Ac,
m
|Z(s) - Z(t)| < ЮГ»^ЯЯ) + 4D Z4(<5l+i -
m л
X {<5i
i = n J
16Z> X {<5i+2 < u < 3i+1}H(u) du
<26DJ(d(s,t)).
If Z has continuous sample paths, the inequality with Г* replaced by Г is
the limiting case of the inequalities for Г* with s replaced by s + n~l. ?
Often we will apply the inequality from the Chaining Lemma in the weaker
form:
IP{ |Z(s) - Z(t)| > 26DJ(s) for some s, t in Г* with d(s, t) < &} < 2s.
A direct derivation of the weaker inequality would be slightly simpler than
the proof of the lemma. But there are applications where the stronger result
is needed.

146 VII. Central Limit Theorems
11 Example. Brownian motion on [0, 1], you will recall, is a stochastic
process {B(-,t):O<t < 1} with continuous sample paths, independent
increments, B(-, 0) = 0, and B(t) - B(s) distributed JV(O, t - s) for t > s.
If we measure distances between points of [0, 1] in a strange way, the Chain-
Chaining Lemma will give a so-called modulus of continuity for the sample paths
of Я
The normal distribution has tails that decrease exponentially fast: from
Appendix B,
JP{\B(t) - B(s)\ > rj} <
Define a new metric on [0, 1] by setting d(s, t) = \s — r|1/2. Then В satisfies
inequality A0) with D = 1. The covering number NE, d, [0, 1]) is smaller
than 23~2, which gives the bound
C3
JC) < [2 log 4 + 10 log(l/M)]1/2 du
•>o
< B log 4I/2<5 + УШ[к^A/<5)Г1/2
Jo
g(/) du
< 4<5[log(l/<5)]1/2 for 5 small enough.
From the Chaining Lemma,
JP{|B(s) - B(t)\ > 26J(d(s, 0) for some pair with d(s, t) < 6} < Ж
The event appearing on the left-hand side gets smaller as 3 decreases. Let
5 I 0. Conclude that for almost all со,
\B(co,s) - B(co, 01 < 741(s - 01og|s -
I 1/2
for \s — t\1/2 < d(co). Except for the unimportant factor of 74, this is the
best modulus possible (McKean 1969, Section 1.6). ?
VII.3. Gaussian Processes
In Section 5 we shall generalize the Empirical Central Limit Theorem of
Chapter V to empirical processes indexed by classes of functions. The limit
processes will be analogues of the brownian bridge, gaussian processes with
sample paths continuous in an appropriate sense. Even though existence of
the limits will be guaranteed by the method of proof, it is no waste of effort
if we devote a few pages here to a direct construction, which makes non-trivial
application of the Chaining Lemma. The direct argument tells us more about
the sample path properties of the gaussian processes.
We start with analogues of brownian motion. The argument will extend
an idea already touched on in Example 11.

VII.3. Gaussian Processes 147
Look at brownian motion in a different way. Regard it as a stochastic
process indexed by the class of indicator functions
#"= {[O,t]:O<r<: 1}.
The со variance P[B(-, f)B(-,g)~] can then be written as P(fg), where P =
Uniform[0, 1]. The process maps the subset ^ of „S?2(P) into the space
=S?2 (IP) in such a way that inner products are preserved. From this perspec-
perspective it becomes more natural to characterize the sample path property as
continuity with respect to the =S?2 (P) seminorm pP on J5". Notice that
Pp(I[0, s] - [0, t]|) = (ЛЕО, s] - [0, t]|2I/2 = Is - f|1/2.
It is no accident that we used the same distance function in Example 11.
The new notion of sample path continuity also makes sense for stochastic
processes indexed by subclasses of other =S?2(P) spaces, for probability
measures different from Uniform[0, 1].
12 Definition. Let #" be a class of measurable functions on a set S with a
cr-field supporting a probability measure P. Suppose ZF is contained in
5?2{P). A P-motion is a stochastic process {BP(-,f):f e #"} indexed by !F
for which:
(i) BP has joint normal finite-dimensional distributions with zero means
and covariance JP[BP(-,f)BP(-, g)~\ = P(fg);
(ii) each sample path BP(co, ¦) is bounded and uniformly continuous with
respect to the ?C2(P) seminorm pP(-) on !F.
The name does not quite fit unless one reads "Uniform[0, 1]" as
" brownian," but it is easy to remember. The uniform continuity and bounded-
ness that crept into the definition come automatically for brownian motion
on the compact interval [0,1]- In general #" need not be a compact subset
of if2 (P), although it must be totally bounded if it is to index a P-motion
(Problem 3); uniformly continuous functions on a totally bounded OF must
be bounded.
We seek conditions on P and 3F for existence of the P-motion. The
Chaining Lemma will give us much more: a bound on the increments of the
process in terms of the covering integral
JE) = J(d, pP, P)= Г [2 log(JV(M, pP, iF)»]1/2 du.
Jo
Finiteness of J(-) will guarantee existence of BP.
13 Theorem. Let ^ be a subset of =S?2(P) with a finite covering integral,
J(-), under the i?2(P) seminorm pP(-)- There exists a P-motion, BP, indexed
by !F,for which
| BP(co, f) - BP(co, g) | < 26J(pP(f - g)) if pP(f - g) < d(co),
with 5(co) finite for every оз.

148 VII. Central Limit Theorems
Proof. Construct the process first on a countable dense subset J^ = {./}}
of J*. Such a subset exists because !F has a finite E-net for each <5 > 0 (other-
(otherwise J could not be finite). Apply the Gram-Schmidt procedure to J^,
generating an orthonormal sequence of functions {uj}. Each / in J^ is a
finite linear combination Yj (.uj>f)uj because {ult... ,un} spans the same
subspace as {/j,... ,/„}. Here, temporarily, (u, /> denotes the inner product
in i?2 (P): <м, /> = Р(м/). Choose a probability space (Д <f, IP) supporting
a sequence {C/j} of independent iV@, 1) random variables. For each /
in J^ and со in ?2 define
The sum converges for every со, because only finitely many of the coefficients
<Mj-,/> are non-zero. The finite-dimensional distributions of Z are joint
normal with zero means and the desired covariances:
as required for a P-motion.
The =S?2(P) seminorm is tailor-made for the chaining argument. Because
IP{ | JV(O, 1)| > x} < 2 exp(-^x2) for x > 0 (Appendix B),
- Z(g)\ > n) <
Apply the Chaining Lemma to the process Z on J^. Because J^ itself is
countable we may as well assume the countable dense subset promised by
the Lemma coincides with J^. Let G{5) denote the set of со for which
\Z(f) - Z(g) | > 26J(pP(f - g)) for some pair with pP(f - g) < д.
Then IPG(<5) < 2<5 for every 3 > 0. As 5 decreases, G(<5) contracts to a
negligible set G@). For each со not in G@),
| Z(co, /) - Z(co, g) | < 26J(pP(/ - g)) if pP(f - g) < <5(co).
Reduce D. to Q\G@). Then each sample path Z(co, •) is uniformly continuous.
Extend it from the dense J^, up to a uniformly continuous function on the
whole of 2F. The extension preserves the bound on the increments, because
both J and pP are continuous. Complete the proof by checking that the
resulting process has the finite dimensional distributions of a P-motion. ?
For brownian motion, continuity of sample paths in the usual sense
coincides with continuity in the pP sense, with P = Uniform[0, 1]. The
P-motion processes for different P measures on [0, 1] (or on Ж, or on Ж*)

VII.4. Random Covering Numbers 149
do not necessarily have the same property. If P has an atom of mass a at a
point t0 > the sample paths of the BP indexed by intervals {[0, ?]} will all have
a jump at t0. The size of the jump will be N@, a) distributed independently
of all increments that don't involve a pair of intervals bracketing t0. All
sample paths are cadlag in the usual sense.
We encountered similar behavior in the gaussian limit processes for the
Empirical Central Limit Theorem (V.ll) on the real line. We represented
the limit as U(F(-)), with U a brownian bridge and F the distribution function
for the sampling measure P. We can also manufacture the limit process
directly from the P-motion, in much the same way that we get a brownian
bridge from brownian motion. Denote by 1 the function taking the constant
value one. Then the process obtained from BP by setting
EP(;f) = BP(-,f)-(Pf)BP(;l),
is a gaussian process analogous to the brownian bridge.
14 Definition. Call a stochastic process EP indexed by a subclass !F of
i?2(P) a P-bridge over <F if
(i) EP has joint normal finite-dimensional distributions with zero means
and covariance IP[?P(-,/)?P(-, g)] = P(fg) - (Pf)(Pg);
(ii) each sample path EP(a>, •) is bounded and uniformly continuous with
respect to the i?2(P) seminorm on J^ ?
The P-bridge will return in Section 5 as the limit in a central limit theorem
for empirical processes indexed by a class of functions.
VII.4. Random Covering Numbers
The two methods developed in Chapter II, for proving uniform strong laws
of large numbers, can be adapted to the task of proving the maximal in-
inequalities that lurk behind the stochastic equicontinuity conditions in-
introduced in Section 1. The second method, the one based on symmetrization
of the empirical measure, lends itself more readily to the new purpose because
it is the easier to upgrade by means of a chaining argument. We have the
tools for controlling the rate at which covering numbers grow; we have a
clean exponential bound for the conditional distribution of the increments
of the symmetrized process. The introduction of chaining into the first
method is complicated by a messier exponential bound. Section 6 will tackle
that problem.
Recall that the symmetrization method relates Pn — P to the random
signed measure P° that puts mass + n~* at each of ?u ..., ?„, the signs being
allocated independently plus or minus, each with probability \. For central

150 VII. Central Limit Theorems
limit theorem calculations it is neater to work with the symmetrized empirical
process E° = n1/2P°. Hoeffding's Inequality (Appendix B) gives the clean
exponential bound for E° conditional on everything but the random signs.
For each fixed function /,
i= 1
That is, if distances between functions are measured using the =S?2(Pn)
seminorm then tail probabilities of E° under IP(-|i;) satisfy the exponential
bound required by the Chaining Lemma, with D = 1. For the purposes of
the chaining argument, E° will behave very much like the gaussian process
BP of Section 3, except that the bound involves the random covering number
calculated using the =S?2(Pn) seminorm. Write
J2C, Pn,P)= f [2 log(iV2(M, Р„, #-)»]1/2 du
Jo
for the corresponding covering integral.
Stochastic equicontinuity of the empirical processes {?„} at a function
/0 in J5" means roughly that, with high probability and for all n large enough,
Enf — Enf0\ should be uniformly small for all/close enough to/0. Here
closeness should be measured by the i?2(P) seminorm pP. With the Chaining
Lemma in hand we can just as easily check for what seems a stronger property
—but if you look carefully you'll see that it's equivalent to stochastic equi-
equicontinuity for a larger class of functions. Of course we need 3F to be per-
permissible (Appendix C).
15 Equicontinuity Lemma. Let ^ be a permissible class of functions with
envelope F in „S?2(P). Suppose the random covering numbers satisfy the uni-
uniformity condition: for each n > 0 and г > 0 there exists а у > 0 such that
A6) limsup IP{ J2(y, Pn,^)> n) < e.
n
Then there exists a S > Ofor which
limsup HNsup \En(f - g)\ > n\ < s,
I is] J
where И = {(/, g):f,ge& and Pp(f - g) < 6).
Proof. The idea will be: replace En by the symmetrized process E°; replace
[<5] by a random analogue,
<2<5> = {(/, g):f,geP and (Р„(/ - ^JI'2 < 23};
then apply the Chaining Lemma for the conditional distributions IP(-|^).

VII.4. Random Covering Numbers 151
For fixed / and g in [<5] we have var(En(f- g)) = P(f- gJ < S2.
Argue as in the First and Second Symmetrization steps of Section II.3:
when 3 is small enough,
pjsup \En(f - g)\ > Л < 4pjsup |?°(/ - g)\ >
That gets rid of Е„.
If with probability tending to one the class <2<5> contains [<5], we will
waste only a tiny bit of probability in replacing [<5] by <2<5>:
pj
jsup |?°(/ - 0)| > |Л < pjsup \E°(f -
I M J 1
It would suffice if we showed sup^2 \Pnh — Ph\ -»0 almost surely, where
J^ = {(/ — gJ: /, g e #"}. This follows from Theorem 11.24 because the
condition A6) implies
A7) log N,C, Р„, JD = op(n) for each 8 > 0.
Problems 5 and 6 provides the details behind A7). That gets rid of [<5].
The reason we needed to replace [<5] by <2<5> becomes evident when we
condition on \. Write />„(•) for the j??2(Pn) seminorm. We have no direct
control over />„(/ — g) for functions in [<5]; but for <2<5>, whose members
are determined as soon as \ is specified, pn(f — 9) < 2<5. Apply the Chaining
Lemma.
IP{I W - 9)\ > 26J2BS, Pn, &) for some (/, g) in <2«5>*|^} < 43.
The countable dense subclass <2<5>* can be replaced by <2<5> itself, because
E° is a continuous function on IF for each fixed \:
\E°n(f -g)\< n'l2Pn\f -g\< n^2pn(f - g).
Integrate out over \, then choose b so that both P{26J2B<5, Pn, #") > %r\}
and 4E are small. ?
Now that we have the maximal inequalities for empirical processes, we
can take up again the central limit theorems for statistics defined by minimiza-
minimization of a random process, the topic we left hanging at the end of Section 1.
Recall that we need the processes {Enr(-, t)}, which is indexed by the
class 01 = {r(-,t): t e T} of remainder functions, stochastically equi-
continuous at t0. If /(•, t) is differentiable in quadratic mean at t0, it will
suffice if we find a neighborhood V = {r e @:pP(r) < d} for which
limsup P<sup |?nr| >
I v
Notice that V ? {rx - r2:pP(r1 — r2) < d}, because r(-, tQ) = 0 by defini-
definition. Thus we may check for stochastic equicontinuity by showing:
(i) The class & has an envelope belonging to ??2(P).

152 VII. Central Limit Theorems
(ii) /(•, t) is differentiable in quadratic mean at t0. From (i), this follows by
dominated convergence if r( •, t) -» 0 almost surely [P] as t -»t0.
(iii) Condition A6) is satisfied for & = M.
These three conditions place constraints on the class {/(•, 0}-
18 Example. The spatial median of a bivariate distribution P is the value
of в that minimizes М(в) = P\x — в\. Estimate it by the в„ that minimizes
Mn(&) = Pn | x — в |. Example 11.26 gave conditions for consistency of such
an estimator. Those conditions apply when P equals the symmetric normal
N@,12), a pleasant distribution to work with because explicit values can be
calculated for all the quantities connected with the asymptotics for {$„}.
For this P, convexity and symmetry force M(-) to have its unique minimum
at zero, so в„ converges almost surely to zero. Theorem 5 will produce the
central limit theorem,
п^в„ ~> ЛГ(О, D/n)I2),
after we check its non-obvious conditions (ii), (iv), and (v).
Change variables to reexpress M{&) in a form that makes it easier to find
derivatives.
M@) =
1 (\
Differentiate under the integral sign.
M'@) = 0, of course,
M"@) = Bл)-1 [\х\(хх' - /
A random vector X with a N@,12) distribution has the factorization X = RU
where R2 = \X\2 has a ^distribution independent of the random unit
vector U = X/\X\, which is uniformly distributed around the rim of the
unit circle.
V = M"@) = JP(R3UU' - RI2)
= JPR3JPUU' -(JPR)I2
= (n/8)ll2I2.
Condition (ii) wasn't so hard to check.
To figure out the A(x) that should appear in the linear approximation
carry out the usual pointwise differentiation. That gives A(x) = x/\x\ for
x/0. Set A@) = 0, for completeness. The components of A( •) all belong
to SC2(P). Indeed, PAA' = WUU' = \l2. That's condition (iv) taken care of.

VII.4. Random Covering Numbers 153
Now comes the hard part—or at least it would be hard if we hadn't
already proved the Equicontinuity Lemma. Start by checking that the class
0t of remainder functions r( •, в) has an envelope in J?2(P). For в ф О,
< 4.
It follows that | • — в | is differentiable in quadratic mean at в = 0. We have
only to verify condition A6) of the Equicontinuity Lemma to complete the
proof of stochastic equicontinuity.
Each r(-, в), for в ф 0, can be broken into a difference of two bounded
functions:
Write 0tx and 0t2 for the corresponding classes of functions.
The linear space spanned by 8&^ has finite dimension; the graphs have
polynomial discrimination, by Lemma 11.28; the covering numbers
N2(u, Pn, ^j) are bounded by a polynomial Au~w in u~l, with A and W
not depending on Pn (Lemma 11.36).
The graphs of functions in 8%г also have polynomial discrimination,
because {(x, t):\x — в\ — \x\ > \0\t} can be written as
{-2e'x + \e\2>2\e\\x\t + \e\2t2}r>{\x\ + \e\t>oi}vj{\x\ + \e\t < o}.
This is built up from sets of the form {g > 0} with g in the finite-dimensional
vector space of functions
The covering numbers for 0l2 are also uniformly bounded by a polynomial
in и.
These two polynomial bounds combine (Problem 11.18) to give a similar
uniform bound for the covering numbers of 01, which amply suffices for the
Equicontinuity Lemma: for each ц > 0 there exists а у such that
•^гG> Pn> 8%) ^ Л f°r every Pn. The conditions of Theorem 5 are all satisfied;
the central limit theorem for {в„} is established. ?
19 Example. Independent observations are sampled from a distribution P
on the real line. The optimal 2-means cluster centers an, bn minimize
W(a, b, Pn) = Pnfa,b, where /а>ь(х) = |х-а|2л|х-Ь|2. In Examples
II.4 and 11.29 we found conditions under which а„, Ь„ converge almost surely
to the centers a*, b* that minimize W(a, b, P) = Pfuyb. Theorem 5 refines
the result to a central limit theorem.

154 VII. Central Limit Theorems
Keep the calculations simple by taking P as the Uniform[0, 1] distribu-
distribution. The argument could be extended to other P distributions, higher
dimensions, and more clusters, at the cost of more cumbersome notation
and the imposition of a few extra regularity conditions.
The parameter set consists of all pairs (a, b) with 0 < a < b < 1. For the
Uniform[0, 1] distribution direct calculation gives explicitly the values
a*, b* that minimize W(a, b, P).
W(a, b, P) = J{0 < x < i(a + b)}\x - a\2
+ ii(a + b)<x< l}|x - b\2dx
= ia3 + i(l - bf +Ub~ aK-
Minimizing values: a* = ?, b* = f, as you might expect. Near these optimal
centers,
W(a, b,P) = & + |(fl - iJ - |(fl - № - |) + Kb - IJ
+ cubic terms
The function fab(x) has partial derivatives with respect to a and b except
when x = j(a + b). That suggests for A(x) the two components
Aa(x)= -2(x-i){0<x<|},
Ab(x)= -2(x-M<x<l}.
Both functions belong to J?2(P). The remainder function is denned by
subtraction of the linear approximation from/ai). Simplify the notation by
writing s — a — \, t = b — f; change fab to gst and r(-, a, b) to R(-, s, t).
(\s\ + \t\)R(x, s, t) = gSil(x) - jo,oW + 2s(x - i){0 < x < |}
+ 2t(x - M < x < 1}
= a piecewise. linear function of x.
envelope
s~R(;s,

VII.5. Empirical Central Limit Theorems 155
The remainder functions are bounded by a fixed envelope in i?2(P).
|K(x, s, 01 < [|(x - i - sJ - (x - iJ| + |(x - ! - tf - (x - |J
<4|x -i| + 4\x -|| + 2.
Deduce differentiability in quadratic mean of/„ b at the optimal centers.
The graphs of the piecewise linear functions in 01 have only polynomial
discrimination, and they have an envelope in ?C2(P). Lemma 11.36 gives a
uniform bound on the covering numbers that ensures J2(y, Рп,0?)<ц for
every Р„ if у is chosen small enough. The Equicontinuity Lemma applies;
the processes {Enr(-, a, b)} are stochastically equicontinuous at (\, f); the
optimal centers obey a central limit theorem
{n'l\an - |), n"\bn - I)) ~» N@, F^PCAAOF),
where
[! 1} Ч? Д
VII.5. Empirical Central Limit Theorems
As random elements of D[0, 1], the uniform empirical processes {Un}
converge in distribution to a brownian bridge. More generally, the empirical
processes {?„} for observations from an arbitrary distribution on the real
line converge in distribution, as random elements of D\_— oo, oo], to a
gaussian process obtained by stretching out the brownian bridge. Both
results treat the empirical measure as a process indexed by intervals of the
real line. In this section we shall generalize these results to empirical measures
indexed by classes of functions.
Convergence in distribution, as we have denned it, deals with random
elements of metric spaces. Once we leave the safety of intervals on the real
line it becomes quite a problem to decide what metric space of functions
empirical process sample paths should belong to. Without the natural
ordering of the intervals, it is difficult to find a completely satisfactory sub-
substitute for the cadlag property; without the simplification of cadlag sample
paths, empirical processes run straight into the measure-theoretic complica-
complications we have so carefully been avoiding. Appendix С describes one way of
overcoming these complications. A class of functions satisfying the regularity
conditions described there is said to be permissible. Most classes that arise
from specific applications are permissible.
Let !F be a pointwise bounded, permissible class of functions for which
sup^r \Pf\ < oo. The empirical processes {?„} define bounded functions on
SF\ their sample paths belong to the space ЭС of all bounded, real functions

156 VII. Central Limit Theorems
on 3F. To avoid some of the confusion that might be caused by the hierarchy
of functions on spaces of functions on spaces of functions, call members of
Ж functionals. Equip Ж with the metric generated by the uniform norm,
||x|| = sup^ |x(/)|. Be careful not to confuse the norm || • || onf with the
У2(Р) seminorm pP{-) on &'.
The choice of <7-field for Ж is tied up with the measurability problems
handled in Appendix C. We need it small enough to make En a measurable
random element of Ж, but large enough to support a rich supply of measur-
measurable, continuous functions. The limit distributions must concentrate on sets
of completely regular points (Section IV.2). That suggests that the «r-field
should at least contain the balls centered at the functionals that are uniformly
continuous for the pP seminorm.
20 Definition. Write C(#", P) for the set of all functionals x(-) in Ж that are
uniformly continuous with respect to the i?2(P) seminorm on !F. That is, to
each г > 0 there should exist a b > 0 for which | x(f) — x(g) | < s whenever
Pp(f - 9) < 8- Define 38P as the smallest <7-field on Ж that: (i) contains all
the closed balls with centers in C(#", P); (ii) makes all the finite-dimensional
projections measurable. ?
Notice that CijF, P) is complete, because it is a closed subset of the
complete metric space (Ж, || • ||). Notice also that <%p depends on the sampling
distribution P. Each Е„ is a ^-measurable random element of Ж under mild
regularity conditions (Appendix C).
The finite-dimensional projections of {?„} (the fidis) converge in distri-
distribution to the fidis of EP, the P-bridge process over ?F (Definition 14). Of
course some doubts arise over the existence of EP; getting a version with
sample paths in C(#", P) is no simple matter, as we saw in Section 3. Happily,
the questions of existence and convergence are both taken care of by a single
property of the empirical processes, uniform tightness.
Recall from Section IV.5 that uniform tightness for {?„} requires existence
of a compact set Ke of completely regular points in Ж such that
liminfP{EneG} > 1 -e
for every open, ^-measurable set G containing KE. From uniform tightness
we would get a subsequence of {?„} that converged in distribution to a tight
borel measure on 3C. If C(#; P) contained each Ke, the limit would con-
concentrate in C{ZF, P). Its fidis would identify it (Problem 8) as the P-bridge
over #".
Uniform tightness of {?„} would also imply convergence of the whole
sequence to EP. For if {Wh(En)} did not converge to JPh(EP) for some
bounded, continuous, ^p-measurable lionf then
\JPh(En) - TPh(EP)\ > s infinitely often
for some s > 0. The subsequence along which the inequality held would
also be uniformly tight; it would have a sub-subsequence converging to a

VII.5. Empirical Central Limit Theorems 157
process whose fidis still identified it as a P-bridge. That would give a con-
contradiction : along the sub-subsequence, {JPh(En)} would converge to JPh(EP)
without ever getting closer than s.
21 Theorem. Let SF be a pointwise bounded, totally bounded, permissible
subset ofJ?2(P). If for each n > 0 and s > 0 there exists a3 > Ofor which
B2) limsup IP {sup | En{f - g)\ > Л < e,
I № )
where [<5] = {(/, g): f, g e !F and pP(f — g) < 3}, then En ~> EP as random
elements of 3E. The limit P-bridge process EP is a tight, gaussian random
element of 3C whose sample paths all belong to C(!F, P).
Proof. Check the uniform tightness. Given s > 0 find a compact subset К
of C(#", P) with liminf Р{?„ ? G} > 1 — s for every open, ^p-measurable
G containing K. Construct К as an intersection of sets Du D2,..., where Dk
is a finite union of closed balls of radius k~1 centered at points of C{?F, P).
Every functional in Dk will lie uniformly within k~1 of a member of C(JF, P);
every functional in К will therefore belong to C(#", P), being a uniform limit
of functionals in C(#", P). The proof follows closely the ideas used in Theorem
V.16 to prove existence of the brownian bridge. Only the continuous inter-
interpolation between values taken at a finite grid requires modification.
Fix n > 0 and г > 0 for the moment, and choose 3 according to B2).
Invoke the total boundedness assumption on !F to find a finite subclass #"г =
{/i> •••>/»} °f &" such that each / in J* has an /* in J^ for which
Pplf ~ /*) < ?¦
We need to find a finite collection of closed balls in 9C, each with a specified
small radius and centered on a functional in C(?F, P), such that En lies with
specified large probability in the union of the balls. Construct the centers
for the balls by a continuous interpolation between the values taken on at
each f in J^ by realizations of an En.
For each /(, the sequence of random variables {En(f)} converges in
distribution. There exists a constant С for which
limsup IP< max | Elf) | > C> < s.
I ' J
Define
D.n = < со:sup |En(a>,f - g)\ < n and max |Еп(ю,/f)| < С
By the choice of 3 and С we ensure that liminf PQn > 1 - 2s. Write Sn for
the bounded set of all points in IRra with coordinates ?„(со,/,) for an oo in Qn,
and S for the union of the {?„}. If r belongs to S then \r(\ < С for each
coordinate and \rt — rj\ < 2ц whenever there exists an / in !F with
Pp(f ~ f) <3 and pP(f - /•) < д.

158 VII. Central Limit Theorems
Construct weight functions A,(-) on #" by
Mf) = »«(/)/D>i(/) + ¦¦¦ + vm(f)l
Each vt(-) is a uniformly continuous function (under the pP seminorm) that
vanishes outside a ball of radius д about/;. For every / there is at least one
/, its/*, for which vt(f) > \; the denominator in the denning quotient for
A,- is never less than \. The A;(-) are non-negative, uniformly continuous
functions that sum to one everywhere in 3F.
For г in S define an interpolation function by x(f, r) = Yj=1 A,-(/)r;.
Each x(-, r) belongs to C(J^ P). If pP(f — /) < «5, all the rt values corre-
corresponding to non-zero {A;(/)} satisfy \rt — rj\ < 2ц. As a convex combination
of these values, x(f, r) must also lie within 2r\ of r,-. An En(co, •) corresponding
to an со in D,n has a similar property:
\En(co,f) -En{co,f})\ <n when pP(f - f}) < д.
Thus, if (o belongs to D,n and |Е„(а>, fj) — rj\ <r\ for every j then
Choose from the bounded set S a finite subset (r(l),..., r(p)} for which
min max | rj — rj(k) | < ц for every г in S.
к j
Abbreviate x{-, r(/c)) to xk(-), for к = 1,..., p. Then from what we have just
proved
min sup |En(co,/) - xk(f)\ < 4rj,
к &
whenever со belongs to D.n. If we set D equal to the union of the balls
B(Xl,4ri),...,B(xp,4ri),tben
liminfff {?„??>} > 1 - 2e.
Repeat the argument with щ replaced by {4k)'г and s replaced by s/2k+1, for
к = 1, 2,..., to get each of the Dk sets promised at the start of the proof:
Dk is a finite union of closed balls of radius k~г and
liminf JP{EneDk} > 1 - г/2".
The remainder of the proof follows Theorem V.I6 almost exactly.
The intersection of the sets Du D2,... is a closed and totally bounded
subset of the complete metric space C(#", P); it defines the sought-after
compact K. The open G contains some finite intersection D1 n • • • n Dk. If
not, there would exist a sequence Y = {yk} with yk in Gc n Dx n • • • n Dk
for each k. Some subsequence Y' of Y would lie within one of the balls making

VII.5. Empirical Central Limit Theorems 159
up Dx; some subsequence Y" of Y' would lie within one of the balls making
up D2; and so on. The sequence constructed by taking the first member of
Y', the second member of Y", and so on, would be Cauchy; it would con-
converge to a point у (9С is complete) belonging to all the closed sets
{Gc n Dt n • • • n Dk}. This would contradict
f) Gc n Dt n ¦ ¦ ¦ n Dk = Gc n К = 0.
*=i
Complete the uniform tightness proof by noting that
liminf JP{En e G} > liminf Р{?„ e Dt n • • • n ?>J > 1 - s
if G contains D-^гл ¦ ¦ ¦ n Dk. ?
Condition B2) points the way towards mass production of empirical
central limit theorems. The Chaining Lemma makes it easy. For example,
from the Equicontinuity Lemma of Section 4, we get conditions on the
random covering numbers under which {?„} converges in distribution. The
next section will describe other sufficient conditions.
23 Example. We left unfinished back in Example V.I5 a limit problem for
goodness-of-fit statistics with estimated parameters. The empirical processes
were indexed by intervals of the real line; the estimators took the form
р
1 = 1
for an L with PL = 0, PL2 < oo. We wanted to find the limiting distribution
of
Dn = sup | ?B(-oo51] - nll2@n - 0o)A(t)| + o,(l)
t
the A(-) being a fixed cadlag function on [ — oo, oo].
Set #¦ equal to {L} и {(— oo, t]: — oo < t < oo}. Express Dn in terms of
a function on the corresponding 9C. Define
H(x) = sup \x((- oo, t]) - x(L)A(t)\.
t
Guard against measurability evils by restricting the supremum to rational
t values: it makes no difference to ?„, A, or the limiting P-bridge. Clearly
H(-) is a continuous function on 9C.
You can check condition B2) by means of the Equicontinuity Lemma.
The intervals have only polynomial discrimination; the inclusion of the
single extra function L has a barely perceptible effect on the covering num-
numbers. Deduce that Dn = Н(Е„) + oJl) -» H(EP). ?

160 VII. Central Limit Theorems
VII.6. Restricted Chaining
In this section the method of the Chaining Lemma is modified to develop
another approach to empirical central limit theorems. The arguments for
three representative examples are sketched. You might want to skip over the
section at the first reading.
The chaining arguments in Section 2 assumed that the increments of the
stochastic process had exponentially decreasing tail probabilities,
B4) P{|Z(s) - Z(t)| > n} < 2 exp(--^2/?>2<52) if Ф, 0 ^ <5-
The inequality held for every n > 0 and 3 > 0. We shall carry the argument
further to cover processes, such as the empirical process, for which the
inequality holds only in a restricted region 0> of (n, 3) pairs.
Suppose / is a bounded function, \f\ < C. Let S2 be an upper bound
for the variance a2(f) = Pf2 — (PfJ- Bennett's Inequality (Appendix B)
gives
B5) JP{\EJ\>t,} =
>
Л/2
2 expl-i(tj2/d2)BBCr,/(nl'2d2))-}
iV/<52) if S2h >2C/(nll2B-\A))
for any fixed X between 0 and 1, because B(-) is a continuous, decreasing
function, with B@) = 1.
The restricted range complicates the task of proving maximal inequalities
for the stochastic process {Z(t): t e T}. We can chain as in Section 2 as long
as the (nh 3t) pairs remain within SP, but eventually the chain will hit the
boundary of 3?, when the links are getting down to lengths less than some
tiny a, say. That leaves the problem of how to bound increments of Z across
little links from points in T to their nearest neighbors in an a-net for T.
Remember the abbreviations NC), for the covering number NC, d, T),
and JC), for the covering integral
Cd
JC, d,T)= [2 log(WMJ/")]1/ du.
=
Jo
The chaining argument will work for maximal deviations down to about
J(a). That explains the constraint J(a) < y/\2D in the next theorem. The
other constraints on a and у are cosmetic.
26 Theorem. Let {Z(t):teT} be a stochastic process that satisfies the
exponential inequality B4)for every ц > 0 and д > 0 with ё > ar]il2,for some
constant a. Suppose T has a finite covering integral J(-). Let T(a) be an a-net
(containing N(a) points) for T; let tx be the closest point in T(a) to t; and let

VII.6. Restricted Chaining 161
[<5] denote the set of pairs (s, t) with d{s, t) < 3. Given s > 0 and у > 0 there
exists a 3 > 0, depending on s, y, and J(-), for which
pjsup |Z(s) - Z(f)| > 5yl < 2г + pjsup \Z(t) - Z{Q\ >
(. и J l т
provided oc <\sandy < 144 and J(oc) < тт{у/12Д 3/D}.
Proof. The argument is similar to the one used for the Chaining Lemma.
Write H(u) for [2 log(W(uJ/u)]1/2, as before. Choose the largest 3 for which
3 < ^s and JC) < y/l2D. The assumptions about a ensure 3 > a. Find the
integer к for which 3 < 3*a < 3<5 then define
3t = 3*-;a and nt = ОЗ{Н(ё1+1) for г = 0,..., /с.
Notice that ^! < «5 < c>0 and ^ = a. Also
i = 0
¦^ E \{3i+2<u<3i+l}H{u)du
i = 0 J
< у because J{31) < J{3) < yj\2D.
Choose <5rnets Tt containing NCi) points, making sure that Tk = T(<x). Link
each t to a t0 in To through a chain of points,
with tt being the closest point of 7] to ti+ v By this construction, d(ti+ u tj) < 3t.
The smallest value of the ratios {3?/п(}, for i = 0,..., к — 1, occurs at
i' = к — 1; all the ratios are greater than
3a/DH(a) > 3a2/DJ(a) > a2.
The 07;, 3t) pairs all belong to the region in which the exponential inequality
B4) holds. Apply the inequality for increments across links of the chains.
pjmax |Z(ta) - Z(to)\ > y\ < ? pjmax \Z(ti+1) - Z(t,)\ > m)
I T(a) J i = 0 lTi+1
k-1
i = 0
00
i+1) exp[-\og(N(8t+ tf/3^ J]
i = 0
Notice how one of the N(<51+ j) factors was wasted; both factors will be needed
later. The last series sums to less than s because «5, < ^г and the {3t} decrease
geometrically.

162
VII. Central Limit Theorems
¦T(a)
Join each (s, i) pair in [<5] by two chains leading up to To plus a link
between s0 and t0.
sup |Z(s) - Z(t)\ < 2 sup |Z@ - Z(OI + 2 max|Z(tJ - Z(to)|
[й] Г T(a)
+ SUp |Z(SO) - Z(to)\.
Partition the 5y correspondingly.
fsup|Z(s)-Z(t)|>:
и
The distance between the s0 and t0 of each pair appearing in the last term
is less than
d(s0, sj + d(sa, s) + d(s, t) + d(t, tj + d(tx, t0)
i=0 i = 0
< 3<5O + 2a + д
< 12^.
There are at most N(d0J such pairs. The exponential inequality holds for
each pair, because 12<5/(ay1/2) > I2y~112 > 1.
FJsup|Z(so)-Z(to)I >?}
I [S]
< N(S0J2 exp(-$y2/144Dzd2) because d(s0, t0) < 123
< 25 exp[log(iV(<5J/<5) - i/A44^2] because N(d0) < NC)
< 2д ехр[^-2(Я(<5J<52 - 72/144?>2)]
< 25 because HC)d < JC) < y/l2D
<s. ?

VII.6. Restricted Chaining 163
Theorem 21 states a sufficient condition for empirical processes indexed
by a pointwise bounded, totally bounded, permissible subclass #" of SC2(P)
to converge in distribution to a P-bridge: given ц > 0 there exists а б > 0
for which
limsup P<sup \En(f - g) \ > rj> < s.
For a permissible class of bounded functions, say 0 < / < 1, any condition
implying finiteness of Nt(-, P, #") or N2(-, P, #") will take care of the total
boundedness. Finiteness of a covering integral will allow us to apply Theorem
26, leaving only a supremum over the class Ж = {/ — fa :/?#"} of little
links. It will then suffice to prove sup^ \Enh\ = op(l) to get the empirical
central limit theorem. Notice that a, and hence Ж, will depend on n. The
next three examples sketch typical methods for handling Ж.
27 Example. Equip #" with the semimetric d(J, g) = (P\f - g\I12- (This
is the SC2(P) seminorm applied to the function |/ — g\112.) The square root
ensures that the variance a2(f — g) is less than d(f, gJ. If we take X = \,
the exponential bound B5) becomes, for d(f, g) < 3,
Щ\ЕЛ -g)\>r,}<2 ехр(-^2/<52) if <52A? > гДп1'2!*-'&)).
That is, D = 2 and a = B/B~ 1(i))ll2n~1'4 for Theorem 26.
The covering numbers for d(-, •) are closely related to the S?l(P) covering
numbers: in terms of the covering integral,
rs
B8) J(8, d,&)= [2 log(N1(M2, P, JF)»]1/2 du for 0 < 3 < 1.
If J is finite, Theorem 26 can chain down to leave a class Ж of little links with
|/г | < 1 and P | h | < a2. If we add to this the condition
B9) log N^cn-1/2, Рп,Ж) = ор(п112) for each о О,
the empirical central limit theorem will hold.
The methods of Section II.6 work for the class Ж112 = {\Щ112:кеЖ}.
Notice that
N2C, Pn, Ж1/2) < N.id2, Р„, Ж)
because Рп(\^\1/2 - \K\mf < PB|*i - h2\. From Lemma 11.33,
C0) p|sup(Pn|/i|I/2 > 8al < 4P[N1(a2, Pn, Ж) exp(-na2) л 1]
= 4P[exp(log Nt(a2, Р„, Ж) - ш2) л 1]
-»0 by B9).

164 VII. Central Limit Theorems
Symmetrize. For n large enough,
pisup|?n/i| > 4y\ < 4p|sup|?°/i| > yi.
I x J I ж J
Condition on \. Cover Ж by M = N^yn'112, Pn, Ж) balls, for the &\Pn)
seminorm, with centers gu ..., gM in Ж Then as in Section И.6,
pjsup|?°/i| > ?|4 <? МтахР{|ВД > \y\\).
On the set of \ where sup# Pn\h\ < 64a2, Hoeffding's Inequality bounds the
right-hand side by
2 exp[log M - ЩуJ/F4a2)]
which is of order op(l) because B9) says log M = op(n1/2). The central limit
theorem follows. ?
31 Example. The direct approximation method of Section II.2 gave uniform
strong laws of large numbers. With a suitable bound on the number of
functions needed for the approximations, we get central limit theorems.
Define a direct covering number А(д, Р, Ж) as the smallest M for which
there exist functions дъ.--,дм sucn that, for every h in Ж,
\h\ < g{ and Pgi < b + P\h\ for some i.
We may assume 0 < gt < 1. If
C2) log A(cn~1/2, P, Ж) = o(n1/2) for each c> 0,
and if the covering integral B8) from the previous example is finite, then the
empirical central limit theorem holds.
Given у > 0, choose X in the exponential inequality B5) so that
2/B~1(X) = y. The dependence of X on у does not vitiate the chaining argu-
argument in Theorem 26; it does ensure that functions in Ж satisfy
P\h\ <a2 = yn~112.
Find gi,..-,gM according to the definition of A(yn~112, P, Ж). Because
Pgi < 2yn~112 for each i, the contributions of the means to Е„ are small.
JP<sup\Enh\ > 4yi < pisupn1/2Pn|/z| > 3yi because n1/2P\h\ < у
{ ж J l ж J
< Р<тахи1/2Р„д; > Ъу\ because \h\ < gr,for some i
I ' J
< M maxP{?ngr; > y) because n1/2Pgt < 2y
i
< M max 2 exp[ -^/Рд^ЩуЦп^РдШ from B5)
< 2 exp[log M - W
= o(l) by C2). ?

Notes 165
33 Example. In the previous two examples, the method of chaining left
links of small ^C1(P) seminorm at the end of the chain; if1 approximation
methods took care of Ж If we chain instead with i?2(P) covering numbers,
we need if2 approximation methods for Ж.
Set d(-, •) equal to the i?2(P) semimetric. Because a2(f - g) < d(f, gf,
the chaining down to Ж requires J2(l, P, &) finite. At the end
Ph2 <a2 = B/B(i))n/2.
Invoke Lemma 11.33.
Ipisup(Pn/i2I/2 < 8a I -> 1 as n -> oo
i ж J
if the random covering numbers satisfy
log N2(cn~1/4, Р„, Ж) = op(n112) for each с > 0.
This would follow from
C4) J2(cn-1/4, Pn, Ж) = op(l) for each c> 0,
because
op{n^) = (сп-1^ГЧ2(сп-1'\ Рп, Ж)
> [2 log(N2(c?T1/4, Pn, ЖJп1/4/с)У12.
Symmetrize. For all n large enough,
p|sup|?n/i| >
Now we are back to the sort of problem we were solving in Section 4. Condi-
Condition on \. On the set of those \ for which sup^(Pn/i2I/2 < 8a, chain using
the Hoeffding Inequality to bound the tail probabilities. Apply the Chaining
Lemma for P(-1?), the i?2(PJ seminorm, and s = 8a.
\E°h\ > 26J2(8a, Р„, Ж)\Л < 16a if sup(Pn/i2I/2 < 8a.
)
Condition C4) and finiteness of J2(l, P, #") are sufficient for the empirical
central limit theorem to hold. ?
Notes
Theorem 5 draws on ideas from Chernoff A954), but substitutes stochastic
equicontinuity where he placed domination conditions on third-order
partial derivatives. The theorem also holds if t0 is just a local minimum for
F(-), or if т„ is a minimum for Fn(-) over a large enough neighborhood of t0.
Huber A967, Lemma 3) made explicit the role of stochastic equicontinuity
in a proof of the central limit theorem for an M-estimator.

166 VII. Central Limit Theorems
The chaining argument abstracts the idea behind construction of processes
on a dyadic rational skeleton. It appears to have entered weak convergence
theory through the work of Kolmogorov and the Soviet School; it is closely
related to the arguments for construction of measures in function spaces
(Gihman and Skorohod 1974, Sections Ш.4, III.5). The Chaining Lemma
is based on an arrangement by Le Cam A983) of an argument of Dudley
A967a, 1973,1978). Le Cam's approach avoids the complications introduced
into Dudley's proof by the nuisance possibility that covering numbers N(S)
might not increase rapidly enough as 3 decreases to zero. Alexander A984a,
1984b) has refined Dudley's form of the chaining argument to prove the most
precise maximal inequalities for general empirical processes to be found in
the literature.
Theorem 13 is based on Theorem 2.1 of Dudley A973), but with his
modulus function increased slightly to take advantage of Le Cam's A983)
cleaner bound for the error term. The extra (<5 log(i/<5)I/2 does not change
the order of magnitude of the modulus for most processes.
The argument in Section 4 is based on Pollard A982c), except for the
substitution of convergence in probability (condition A6)) for uniform
convergence. Kolchinsky A982) developed a similar technique to prove a
similar central limit theorem for bounded classes of functions. He imposed
finiteness of J2(-,P, #") plus a growth condition on N\(-, Pn, #") to get
results closer to those of my Example 27. Gine and Zinn A984) have found
a necessary and sufficient random entropy condition for the empirical
central limit theorem.
Brown A983) sketched the large-sample theory for the spatial median. He
referred to Brown and Kildea A979) and the appendix he wrote for Maritz
A981) for rigorous proofs, which depend on a form of stochastic equi-
continuity.
The central limit theorem for /c-means was proved by Pollard A982b,
1982d) for a fixed number of clusters in euclidean space. The one-dimen-
one-dimensional result was proved by Hartigan A978), using a different method.
Dudley A978, 1981a, 1981b, 1984) has developed the application of
metric entropy (covering numbers) to empirical process theory. These papers
extended his earlier work on entropy and sample path properties of gaussian
processes A967b, 1973), and on the multidimensional empirical distribution
function A966a).
Dudley A966a, 1978) introduced most of the ideas needed to prove
central limit theorems for empirical processes indexed by sets. He extended
these ideas to classes of functions in A981a, 1981b). His lecture notes A984)
provide the best available overview of empirical process theory, as of this
writing. The proof of my Theorem 21 was inspired by Chapter 4 of those
lecture notes, which reworked ideas from Dudley and Philipp A983). If
Pf = 0 for each / in J5", a standardization that can be imposed without
affecting Е„ or EP, the conditions of Theorem 21 are also necessary for the
empirical central limit theorem.

Problems 167
The first central limit theorems for empirical processes indexed by classes
of sets were proved by the direct approximation method. Bolthausen A978)
worked with the class of compact, convex subsets of the unit square in Ж2.
He applied an entropy bound due to Dudley A974). Revesz A976) indexed
the processes by classes of sets with smooth boundaries. Earlier work of
Sun was, unfortunately, not published until quite recently (Руке and Sun
1982). Dudley's A978) Theorem 5.1 imposed a condition on the "metric
entropy with inclusion" that corresponds to finiteness of a covering integral.
Strassen and Dudley A969) proved a central limit theorem for empirical
processes indexed by classes of smooth functions. They deduced the result
from their central limit theorem for sums of independent random elements
of spaces of continuous functions. All these theorems depend on existence
of good bounds for the rate of growth of entropy functions (covering num-
numbers). For more about this see Dudley A984, Sections 6 and 7) and Gaenssler
A984).
Theorem 26 resets an argument of Le Cam A983). Such an approximation
theorem has been implicit in the work of Dudley. Gine and Zinn A984) have
pointed out the benefits of stripping off the i?2(P) chaining argument, to
expose more clearly the problem of how to handle the little links left at the
end of the chain. They have also stressed the strong parallels between
empirical processes and gaussian processes. The examples in Section 6
follow the lead of Gine and Zinn: Example 27 is based on their adaptation
of Le Cam's A983) square-root trick; Example 31 is based on their im-
improvement of Dudley's A978) "metric entropy with inclusion" method;
Example 33 is based on their Theorem 5.5.
Problems
[1] Prove that the stochastic equicontinuity concept of Definition 2 follows from:
Zn(TJ — Zn{t0) -> 0 in probability for every sequence {т„} that converges in
probability to t0. [Suppose the defining property fails for some r\ > 0 and s > 0.
For a sequence of neighborhoods {Uk} that shrink to t0 find positive integers n(l) <
nB) < • • • with
P<sup \Zn(k)(t) - Zm(t0)\ >
( uk
for every k. Choose random elements {т„} of T such that, for n(k) < n < n(k + 1),
| Zn(ffl, т„(«)) - Zn(ffl, @) | > A sup | Zn(w, t) - Zn{w, ?0) I
uk
and т„(ш) belongs to Uk. Appendix С covers measurability of т„.]
[2] Let {/(¦, t): t e T} be a collection of IR'-valued functions indexed by a subset of IR'.
Suppose P\f(-,t)\2 < oc for each t. Set F(t) = Pf(-, t) and Fn(t) = PJ{-, t). Let
{т„} be a sequence converging in probability to a value t0 at which F(t0) = 0. If
(a) F(-) has a non-singular derivative matrix D at t0;

168 VII. Central Limit Theorems
(c) {?„/(¦, t)} is stochastically equicontinuous at f0 ;
then п1/2(т„ - ro)~>iV(O, D'lP\_f(-,to)f(,-,to)''\D~l). [Compare with Huber
A967).]
[3] For a class J* to index a P-motion it must be totally bounded under the ??2{P)
seminorm pP. [First show $F is bounded: otherwise \BP(fn)\ -> oo in probability
for some {/„}, violating boundedness of P-motion sample paths. Total boundedness
will then follow from: for each e > 0, every / lies within e of some linear combination
of a fixed, finite subclass of 3?. If for some s no such finite subclass exists, find {/„}
such that
n
fn+l = вп+l + Y.a»jfj'
where pP{gn+i) > e and gn+1 is orthogonal tofu ...,/„• Fix an M. Show that there
exists a 5 > 0, depending on M and e, for which
ЩВР(/„+1) > M\BP(fy),..., 5Р(/„)} > S.
Deduce that JP{supnBP(fn) > M} = 1 for every M, which contradicts boundedness
of the sample paths. Notice that continuity of the sample paths does not enter the
argument. Dudley A967b).]
[4] If sup^ |P/| is finite then J* must be totally bounded under the i?2{F) seminorm
pP if it supports a P-bridge. [Choose Z with a iV,@, 1) distribution independent of
EP. The process B(f) = EP(f) + ZPf is a P-motion with bounded sample paths.
Invoke Problem 3. The condition on the means is needed—consider the У con-
consisting of all constant functions. The P-bridge is unaffected by addition of arbitrary
constants to functions in !F; it depends only on the projection of J^ onto the sub-
space of .Sf 2(P) orthogonal to the constants.]
[5] Let Жу be a class of functions with an envelope H in ?C2(P). Set Ж2 = {h2: h e Жх}.
Show that
'2, Q, Ж2) < N2Be, Q,
[By the Cauchy-Schwarz inequality,
Q\hj - h\\ < Q{2H\hx -h2\)< 2(QH2y2(Q\hy - hz\2I'2
if both | fell < Hand \h2\ < Я.]
[6] Let J* be a permissible class of functions with envelope F. Suppose
J2(S, Р„, JF) = op(n) for each 5 > 0.
[Condition A6) of the Equicontinuity Lemma implies that J2{5, Р„, !F) = Op(l)
for each 5 > 0.] Show that Ж2 = {(/ - gJ: f - g e J^} satisfies the sufficient
condition (Theorem 11.24) for the uniform strong law of large numbers:
log Ny(e, Р„, Ж2) = op(n), for each s > 0.
[Set Я = IF and Жх = {/ - g:f, ge3F}. Show that, for 1 > e > 0,
iV2Be, Р„, Ж,) < N2(s, Р„, J^J < e ехрЙУ2(е, Р„, J^J/e2).

Problems 169
Deduce from this inequality, Problem 5, and the strong law of large numbers for
{Р„Н2} that, if 1 > e > 0,
P{log ЛГ1DеBРЯ2I/2, Р„, Ж2) > щ}
< IP{log ЛГ!Dе(Р„Н2I/2, Р„, Ж2) > щ] + JP{PnH2 > 2PH2}
< IP{log iV2Be, Р„, Жх) > щ) + W{PnH2 > 2PH2}
< Р{^2(е, Р„, 3Ffj&2 > Щ) + 1Р{Р„Н2 > 2PH2}
A weaker result was proved by Pollard A982c).]
[7] If J^ is totally bounded under the <?2{P) seminorm, then the space С(Ж, Р) of
bounded, uniformly continuous, real functions on & is separable. [Suppose
|x(/) — x(g)\ < e whenever pP(f — g) < 25. Choose {/l5... ,/„} as a maximal
set with pP(fi — fj) > jd. Use the weighting functions A,(-) from the proof of
Theorem 21 to interpolate between rational approximations to the {x(/;)}.]
[8] Suppose Э' is totally bounded under the ?C2(P) seminorm. If two probability
measures X and ц on the u-field &F have the same fidis, and if both concentrate on
C(#", P), then they must agree everywhere on 3I'. [Show that X and ц agree for all
finite intersections of fidi sets and closed balls with centers in C(#", P). For example,
consider a closed ball B(x, r) with x in C{^, P). Let {/b/2, ...} be a countable,
dense subset of C(&, P). Define
В„ = {z e C(J% P): |z(/;) - x(fd\ < r for 1 < i < и}.
Show that цВ(х, г) < цВ„ = ХВ„ -> XB{x, r) as n -> оо. Extend the result to finite
collections of closed balls and fidi sets, then apply a generating-class argument.]
[9] The property that the graphs have only polynomial discrimination is not preserved
by the operation of summing two classes of functions. That is, both #" and <§ can
have the property without the class У = {/ + g:fe3',деЩ having it. Let
3> = {Du D2, ¦ ¦ •} be the set of indicator functions of all finite sets of rational
numbers in [0,1]. Let & = {2n + Dn: n = 1, 2,...} and ^ = {-2n: n = 1, 2,...}.
The graphs from neither class can shatter two-point sets, but Sf can shatter arbi-
arbitrarily large finite sets of rationals in [0, 1]. [The roundabout reasoning used to
bound the covering numbers in Example 18 may not be completely unnecessary.]

CHAPTER VIII
Martingales
... in which martingale central limit theorems in discrete and continuous time
are proved. An extended non-trivial application to Kaplan-Meier estimators —
estimation of distribution functions from censored data—is sketched.
VIII. 1. A Central Limit Theorem for
Martingale-Difference Arrays
Martingale theory must surely be the most successful of all the attempts to
extend the classical theory for sums of independent random variables to
cover dependent variables. Many of the classical limit theorems have
martingale analogues that rival them for elegance and far exceed them in
diversity of application. We shall explore two of these martingale theorems
in this chapter.
One main change in technique will become apparent. Where proofs for
independent summands use truncation to protect against occasional ab-
abnormally large increments—the sort of thing implicit in something like the
Lindeberg condition—martingale proofs can resort to stopping time
arguments. Optional stopping preserves the conditional expectation connec-
connection within a martingale sequence as long as the decision to stop is based
only on past behavior of the sequence. The prohibition against peering into
the future to anticipate abnormally large increments imposes a characteristic
feature on martingale theorems. One needs conditions that protect against
the behavior of the worst single increment, because the decision to stop can
be taken only after that increment has had its effect.
In central limit theorems, independence allows one to factorize expecta-
expectations of products and separate out the contribution of a particular increment
from contributions of past and future increments. The martingale property
allows a weaker factorization, for expectations conditional on the past alone.
Conditional variances take over the role played by variances of independent
summands. But apart from that, the arguments for martingales share the

VIII. 1. A Central Limit Theorem for Martingale-Difference Arrays 171
same inspiration as the proof of the Lindeberg Central Limit Theorem in
Section III.4.
We shall prove asymptotic normality for row sums of martingale-differ-
martingale-difference arrays. That is, for each n we have random variables ?nl, ...,?,„„ (we
avoid some messy notation, and lose no generality, by assuming exactly n
variables in the nth row) and сг-fields <fn0 ? • • • ? $„„ for which
(a) ^
(b) ?nJ is <fnj-measurable.
Define conditional variances
vnJ = JP(Zlj\*n.j-i) for j=l,...,n.
Notice that vnj is an §n^-i-measurable random variable. Convergence of
sums of conditional variances will be the only connection tying together the
variables in different rows of the array.
1 Theorem. Let {?nj} be a martingale-difference array. If as n -* со,
(i) Yjj vnj ~*(j2 i" probability, with a2 a positive constant;
(ii) for every г > 0, the sum ?,- 1Р(<!;?Д|?nj| > e}\Saj-1) converges in prob-
probability to zero (a Lindeberg condition);
then ?nl + ..• + ?„„ ^ N@, a2).
Proof. Without loss of generality set a2 = 1. Let us check pointwise con-
convergence of characteristic functions:
IP cxp[it(Z»i + • • • + Ul - exp(-it2).
We really will need some of the special multiplicative properties of the
complex exponential function, and not just its smoothness by way of three
bounded, continuous derivatives as in Section III.4. The randomness of the
conditional variances fouls up the argument based on successive substitution
of matching normal increments, which worked for independent summands.
At the risk of notational abuse (more will follow) abbreviate the condi-
conditional expectation TP(-\$nj) to IP/-)- Write R(x) for the remainder term
e'x — 1 — ix and Snj for the partial sum ?nl н h ?nj. Define rnj as the condi-
conditional expectation JPj_ ^(t^j). When the {?n]} are small, in a sense to be
made precise by condition (ii), we shall get rnJ x —%vnj.
The proof will work by successive conditioning. We pin down the effect
of individual increments by evaluating
IP expOtSJ = POPoflP^O • • (IPn_! cxp(itSJ ¦ ¦ •))))),
a layer at a time. Start with the innermost conditional expectation. Factorize
out the part depending on ^nj-1 then expand the remaining exp(it?m).
^^l + rj.

172 VIII. Martingales
The 1 + rm factor foils the attempt to work the same idea with 1Р„_2; it
won't cooperate by slipping outside the conditional expectation, leaving
exp(itSnn_ J to enjoy the same treatment from 1Р„-2 that exp(j't5nn) received
from №„_!• We could clear away the obstacle by dividing out the offending
factor.
IP^EV^l + rj expats™)] = IP^expatS,,,,,-!)
= exp(irSn>n_2)(l + »•„,,,_!).
We could get rid of the A + г„ п^г) in a similar fashion.
If you pursue this idea through each layer of conditioning, you will see
the sense in starting from
The remote possibility that one rnj might get close to — 1 could cause minor
difficulty when we come to bound the product term. To avoid this, replace
A + rnj)~l by A — rnj). When rnJ к 0 the change has little effect. Define
k
Tnk = П A - rHj) and Znk = Tnk exv(itSnk).
If we could show IP | Т„„ - exp(^t2) | -»¦ 0 and JPZnn -»¦ 1, then it would follow
that
< exP(-it2)[IP|expatSnn + it2) - Zm\ + \TPZm - 1|]
->0.
We would get the desired results for {Tnn} and {Zm} from:
(a) Zj rnj -*¦ -if2 m probability;
(b) I>nj-I < t2;
(c) max; | rnj | -* 0 in probability.
The second of these requirements need not be satisfied but, without loss of
generality, we may act as if it were.
We replace ?nJ by ?nJ{j < an), where а„ = max{fc: Yj=ivnj ^ 2I- Inter-
Interpret <т„ as zero if vnl > 2. Because vnJ is <fnj_:-measurable, the event {j < an}
is Sn> j-!-measurable; an is a stopping time. The new variables are martingale
differences. The new row sums have the same asymptotic behavior as the
original row sums:
A i tnj ф i tnjij < <uj < п»{и > <u = n» j i vnj > 2} -, 0.
U'=i j=i J (.j = i J
The quantities wnj = Wj_ iR(t?nj{j < an}) satisfy analogues of the require-
requirements (a), (b), and (c). The argument depends on the inequalities (Problem 1)
for real x,
\R(x)\<±\x\2;

VIII.1. A Central Limit Theorem for Martingale-Difference Arrays
173
For the analogue of (c): if 5 > 0, no | wnj\ exceeds
maxHy^t2^; < an) < |A2 + E^-i^l^-l >
Set д small, then invoke the Lindeberg condition. For the analogue of (b):
J=l
0' ^ cr»}j;nj because {/ < а„} is «f^^-measurable
< t2 by definition of сг„.
For the analogue of (a), fix S > 0:
-JRit^U < U) + tf&jU < <U + tf&iij >
The first sum can be made small with high probability, by an appropriate
choice of S; the last two terms converge in probability to zero.
We could carry an along throughout the rest of the argument, but that
would clutter up the notation. Instead, let us assume that (a), (b), and (c)
hold. While we are at it, let us drop the n subscript for variables in the nth
row of the array; all calculations take place within the nth row. Our task is
to prove that
and
WZn -. 1,
B) IP|rn-exp(it2)HO
where Tk = П)=1A - rj) and Zk = Tk exV(itSk).
The path leading from (a), (b), and (c) to the first assertion in B) is well-
worn (Chung 1968, Section 7.1). For complex 9,
|log(l-0) + 0|<|0|2 if \9\<i
Apply the inequality to each r}. When max^ |r,-| < j,
<rmax|r,-| by(b).
j

174 VIII. Martingales
It follows from (a), (c), and continuity of the exponential function that {Tn}
converges in probability to exp(^t2). Each | Т„ | is bounded, because of (b):
C) I Tn\ < П A + \rj\) < expfe \rA < exp(t2).
Boundedness plus convergence in probability imply convergence in L1.
For the proof of the second assertion in B), bound the errors that accrue
during the calculation of conditional expectations layer-by-layer.
JP j^Zj = TPj^lT
Thus \WZj - IPZ^i | < P|r?Z7-_i | < exp(/2)IP|r2|, because inequality C)
implies \Z^X \ = 17}_t | < ехр(г2). Sum over;'.
n
\WZn - 1| < exp(J2)IP X lol2 -* ° as n -" °°'
j=i
because ^-1r} |2 < min{t4, t2 maxj \ rj\}. ?
4 Example. A sequence of real random variables Xo, Xu... is called an
autoregression of order one if Х„ = 0ОХ„_1 + wn for some fixed 0O. The
innovations {un} are assumed independent and identically distributed with
zero mean and finite variance a2. The initial value Xo is independent of the
{un}. The least-squares estimator 9„ minimizes
Solving for вп and standardizing, we get
E) nv\en - в0) = [n-^2 iuj
With the help of Theorem 1 we can prove
nll2(9n - в0) - N@, 1 - 61)
provided |0O| < 1 and IPX2, < oo. This will follow from convergence results
for the denominator and numerator in E):
n
n'1 ZxJ-i-+ ff2/(l - eo) in probability,
Start with the denominator.

VIII. 1. A Central Limit Theorem for Martingale-Difference Arrays 175
Square both sides of the defining equation of the autoregression, then
sum over/.
3=1 3=1 3=1 3=1
Rearrange then divide through by n.
F) A - el)n~' t Щ-1 = и"l t u) + 290n-1 t "jXj-i
3=1 3=1 3=1
+ п-1(^о-^„2).
On the right-hand side, the first term converges almost surely to <r2, by the
strong law of large numbers. The third term converges in L1 to zero, because
repeated application of the equality
IPX2 = Wu2n + 280lPunVXn_1 + 920JPX2n.L = a2 + d2WX2n_u
yields
ГОу2 „2/1 , Д2 , i д2п-2\ , n2nmv2 . я-2/zi O2\
1глп = a {I + tH + ¦ ¦ • + b0 ) + tH \гло -* а Д1 — u0).
The n brings the limit down to zero. The middle term on the right-hand
side of F) converges in L2 to zero:
= n 2 Yj №u]X2_ l independence kills the cross-product terms
3=1
= О(и-1) because {IPX?} is convergent.
So much for the denominator.
Write Sn for the a-field generated by {Xo, uu ..., uj. Abbreviate IP(-|<^)
to IP/-)- The variables {ujXj^^ are martingale differences for {Sj}. Apply
Theorem 1 to the sum in the numerator of E). For condition (i):
vnJ = Wj^in-'ujX2.,) = и" VX?_1;
t vnj = о2п~г t хз-1 ^ ^4/(l - в1) in probability.
3=1 3=1
The Lindeberg condition demands a more delicate argument if higher-
moment constraints on the innovations are to be avoided.
3=1
[u2X2.1l{u2>en^2} + {X2.i>
3=1
_ „-1 V Y2 TP«2f,.2 -. ™l/2l i „-1 V /r2v2 SY2
— n 2-i Ji-j-i'*ui\ui > si ' ) + n 2j ff •^j-ii-^j-i
3=1 3=1
en
1/21

176 VIII. Martingales
The first sum converges to zero in probability, because u\ is integrable. The
second sum converges in L1 to zero because the sequence {X2} is uniformly
integrable (Problem 2). ?
VIII.2. Continuous Time Martingales
A stochastic process {Z(r): 0 < t < со} is said to be a martingale with
respect to an increasing family of сг-fields {$t; 0 < t < oo} if Z(t) is adapted
to the (т-fields (that is, Z(t) is ^-measurable) and IP(Z(s) | St) = Z(t) whenever
s > t. After some fiddling around with sets of measure zero it can usually be
arranged that such a process has cadlag sample paths (Dellacherie and Meyer
1982, Section VI. 1), in which case it may be studied as a random element of
D[0, oo).
Call Z an L2-martingale if it has cadlag sample paths and IPZ(fJ < oo
for each t. The behavior of an L2-martingale is largely determined by the
conditional variances of its increments. The conditional expectation of
[Z(t + S) — Z{t)~]2 given St plays a role similar to that of the conditional
variance vnj in Section 1. The most economical way to explain this uses some
deeper results from the Strasbourg theory of stochastic processes. We could
avoid the appeal to the deeper theory by building its special consequences
into the martingale calculations for each particular application. That would
always work for martingales that evolve by discrete jumps; the calculations
would be similar to those in Section 1. The theory would be more self-
contained, but it would disguise the unifying concept of the conditional
variance process.
According to the Doob-Meyer decomposition (Theorem VII. 12 of
Dellacherie and Meyer A982), applied to the supermartingale — Z2), for
each L2-martingale Z, the process Z2 has a unique representation as a sum
V + M of a martingale M and an increasing, predictable, conditional
variance process V with V@) = 0. Both M and V have cadlag sample paths.
(Strictly speaking, for this decomposition we need the сг-fields {<%t} to satisfy
the " usual conditions": &0 should contain all IP-negligible sets and each St
should equal the intersection of the сг-fields Ss for s > t.) The adjective
"predictable" has the technical meaning that V(co, t) is measurable with
respect to the сг-field onfl® [0, oo) generated by the class of all adapted,
left-continuous processes. So V behaves something like a process with
left-continuous sample paths; its paths can be predicted a tiny instant into
the future. We will need the predictability property only in Lemma 11.
If the martingale Z changes only by jumps ?u ?2,. • • occurring at fixed
times ti < t2 < ¦•-, and if St = Stk for tk < t < tk+1, then V is just a sum
of conditional variances:
G) V(t) =

VIII.2. Continuous Time Martingales 177
whenever tk < t < tk+1. You can check directly that Z2 — V is a martingale
and that there exists a sequence of left-continuous, adapted processes con-
converging pointwise on fi ® [0, oo) to V. The value of V at tk corresponds to
what we would have written as vt + ¦ • • + vk in Section 1. You might take
this as your guiding example for the rest of the section if you wish to avoid
all appeals to the Strasbourg theory.
The process V carries information about the conditional variances of the
increments of Z. If s > t,
(8)
= IP(Z(SJK) - Z(tJ - 2Z(t)F(Z(s) - Z(t)\$t)
= IP(J/(s)K) + P(M(s)|<fr) - V(t) - M(t)
= JP(V(s) - V{i)\St).
For s very close to t, the predictability of V makes V(s) — V(t) almost
<f,-measurable; the last conditional expectation almost equals V(s) — V(t),
in a sense that will receive a more rigorous meaning later.
By means of a simple Tchebychev inequality we get from V a bound on
the size of the increments of Z in precisely the form required by the stopping
time argument of Lemma V.7. The maximal inequality provided by that
lemma lies at the heart of any proof for convergence in distribution in spaces
of cadlag functions.
9 Lemma. Let {Z(t): 0 < t < b) be an L2-martingale with conditional
variance process V. If , for every t,
TP(V(b) - V{t)\St) < b2j\2 almost surely,
then Ipisup|Z@ - Z@)| > 4 < 3IP{|Z(b) - Z@)| > id}.
Proof. With no loss of generality we may assume Z@) = 0. Write IP(( •) for
expectation conditional on $t. Lemma V.7 invites us to check that
i on
We shall do this by bounding the conditional probability from below by
A0) |-4|Z@r2IPt|Z(b)-Z@|2,
which is greater than f - 4S~2(S2/l2) on the set {\Z(t)\ > S}. Start from
the inequality
\Z(b)\ < \Z(b) - Z(t)\ + \Z(t)\
<||Z@|{|Z(b)-z(t)|<i|Z@|}
+ 3\z(b) - z{t)\{\z(b) - z(oi

178 VIII. Martingales
Keeping in mind that the absolute value of a martingale is a submartingale,
take conditional expectations given St.
\Z(t)\ < TP,\Z(b)\
{|Z(b)-z(OI<i|Z@|}
- Z(t)\{\Z(b) - Z(t)\
Z(b)-z(OI<i|Z(OI}
+ 6|Z(or1iP(|Z(b)-z(t)|2.
On {\Z(t)\ > S} we may divide through by f |Z(t)| to get the bound A0).
?
Martingales with cadlag sample paths define random elements of the
space D[0, oo) under its projection сг-field. As we shall be concerned only
with convergence in distribution to limit processes having continuous sample
paths, we equip D[0, oo) with its metric for uniform convergence on compacta
(Section V.5).
The main theorem will give conditions for a sequence of L2-martingales to
converge in distribution to a stretched-out brownian motion BH, the process
constructed from brownian motion by applying a continuous, strictly
increasing transformation H(-) to the time scale: BH{t) = B(H(t)). This new
gaussian process is an L2-martingale with conditional variance process
H(t), provided Я@) = О.
One hypothesis of the theorem will be pointwise convergence of the
conditional variance processes to H. It is analogous to the hypothesis in
Theorem 1 that the sum of conditional variances converges to one. As in the
proof of that theorem, we shall use the hypothesis to introduce stopping
times for the martingales. For fullest generality we must draw upon the
Strasbourg theory for the properties of predictable stopping times; for the
special case of martingales with conditional variance processes as in G),
the argument from Theorem 1 would suffice.
11 Lemma. Let {Xn} be a sequence of L2'-martingales whose conditional
variance processes converge pointwise to a fixed, continuous, increasing
function: Vn(t) -* H(t) in probability, for each fixed t. Then there exists a
sequence of stopping times {an} and constants {е„}, with cn -» oo in probability
and ?„ I 0, such that sup, | Vn(t л <тп) - H(t л <т„) | < е„ almost surely.
Proof. First fix an e > 0, and define a stopping time
The hypothesis implies that т„ -> oo in probability: if 0 = s0 < sx < ¦ ¦ ¦ < sk
are chosen so that H(s;) — H(s(_ x) < \г for each i, then (by monotonicity of
Vn and H)
1Р{т„ < sk} < TP{\Vn(Si) - H(st)\ > \г for some i} -> 0.

VIII.2. Continuous Time Martingales 179
From the definition of т„,
A2) \Vn(t) - H(t)\ < s for t<xn.
Only the time t = xn might cause trouble; Vn might have a jump there. That
is where predictability comes to the rescue.
There exists an increasing sequence of stopping times {xnj} with xnJ < xn
and xnj I xn almost surely (Dellacherie and Meyer 1978, IV.69, IV.77),
because predictability allows us to peer just a little distance into the future
for Vn. Replace х„ by ти>Ля) for some j(n) such that xnJ(n) -* oo in probability.
This lops off the troublesome point t = х„ in A2); the inequality holds for
every t in the range [0, тв,Ди)].
If the argument works for fixed e it must also work for a sequence {е„}
decreasing slowly enough. Write <т„(е) for the stopping time т„ J(n) just
identified. There exist integers n(l) < n{2) < ¦¦¦ such that
Ща^к'1)^ к} <к~г for n > n(k).
Set е„ = fc and а„ = ajk.'1) when n(k) < n < n(k + 1). ?
Stopping time tricks like the one used in this proof pop up all over the
place in martingale limit theory. Sample path properties that hold with
probability tending to one can often be made to hold with probability one
by enforcing an appropriate stopping rule.
Something must be added to the convergence of conditional variance
processes. Otherwise we could specialize the result to processes with in-
independent increments, obtaining as a by-product the Central Limit Theorem
for sums of independent random variables without having to impose any-
anything like a Lindeberg condition. The extra something takes the form of a
constraint on the maximum jump in the sample path. Define the jump
functional JT on D[0, oo) by:
JT(x) = max{|x(s) - x(s-)|: 0 < s < Г}.
It is both continuous and measurable (Problem 4). If Xn >-* BH then certainly
JT(Xn) ^ Jt(Bh) = 0; convergence in probability of {JT(Xn)} to zero is a
necessary condition. The theorem assumes just a little bit more to get a
sufficient condition.
13 Theorem. Let {Xn} be a sequence of L2-martingales with conditional
variance processes {Vn}. Let H be a continuous, increasing function on [0, oo)
with Я@) = 0. Sufficient conditions for convergence in distribution of {Х„},
as random elements of D[0, oo), to the stretched-out brownian motion BH are:
(i) XJO) ^ 0 in probability;
(ii) Vn(t) -» H(t) in probability, for each fixed t;
(iii) JPJk(XnJ -»¦ 0 for each fixed k, as n -»¦ oo.

180 VIII. Martingales
Proof. By virtue of Theorem V.23, we have only to prove convergence in
distribution for the truncations of the processes to each compact interval
[0, Г]. The argument for the typical case Г = 1 will suffice. According to
Theorem V.3 we shall need to establish
(a) Fidi Convergence: the fidis of the {Xn} converge to the fidis of BH
(b) The Grid Condition: to each e > 0 and S > 0 there corresponds a grid
0 = t0 < t1 < ¦ ¦ • < tm = 1 such that
limsupIP^max sup \Xn(t) — Xn(tj)\ > d> < s
Theorem 1 will take care of (a); Lemma 9, applied to the Xn processes stopped
appropriately, will take care of (b).
By Lemma 11 there exist stopping times {<т„} and constants {е„}, with
an -* oo in probability and en [ 0, such that | Vn(t л <т„) — H(t л <т„) | < е„
almost surely. We may assume that Х„ has at most one jump of size greater
than е„ up to time an. Formally, we would replace {е„} by a more slowly
decreasing sequence such that
}(Х„) > ?„} -»¦ 0 as n -> оо
for some slowly diverging sequence {k(ri)}. Such sequences exist because of
(iii). Then we would replace an by
mf{t<an:\Xn(t)-Xn(t-)\>sn}.
These modifications would not disturb the other properties of {<т„} and {е„}.
The stopped martingale X*(t) = Xn(t л <т„) has conditional variance
process V*(-) — Vn{- л ап). It enjoys (i), (ii), and (iii) in strengthened form:
(i)' X*@) = Х„@) ^ 0 in probability;
(ii)' \V*(t)~H(t Aan)\<sjora\lt;
(iii)' X* has at most one jump of size > &„ and WJX{X*J -* 0.
These will make it easy to prove that X* ~> BH. The required convergence
of the truncation of Xn to [0, 1] will then follow because
W{Xn{t) = X*{t) for 0 < t < 1} > W{an > 1} -> 1.
Simplify the notation by dropping the star.
Fidi Convergence
Let us prove only that Х„A) -»• N@, ЯA)). Problem 6 extends the result to
higher-dimensional fidis. Because of (i)', we can do this by breaking
Xn{\) — Xn@) into a sum of martingale differences that satisfy the conditions
of Theorem 1.
Focus for the moment on a fixed Х„ by setting Z(t) = Xn{t) — Х„@) and
writing V instead of Vn for its conditional variance process. Break Z(l) into

VIII.2. Continuous Time Martingales 181
a sum of increments Z{z]) — Z(t,-_ Д with 0 = т0 < xr < ¦ ¦ • a sequence of
stopping times defined inductively by
Tj+1 = inf{r > ту. \Z(t) - Z(tj)\ > &„} л 1 л (tj + Sn).
Choose {Sn} so that \H(t + Sn) - H(t)\ < е„ for every t in [0, 1]. Denote
expectations given Sz by IP/-); write AjJ for the increment/^) — /(Tj.j)
of any function / between successive stopping times. Then
Z(l)= ? A,Z.
j=i
Along any particular sample path of Z all except finitely many of these
increments equal zero; there is no problem with convergence of the sum.
Check the conditions of Theorem 1 for the martingale differences {A^Z}.
Strictly speaking the theorem applies only to triangular arrays with a finite
number of variables in each row, but there may be infinitely many A^Z
increments. We would need to apply the theorem to a finite sum of A^Z,
for 1 < j < j(n), with j(n) chosen so that both
?a,(Z) and fjPj.^AjZ)
converge to zero in probability.
By property (8) of the conditional variance process,
Informally speaking, predictability of V almost makes A;- V measurable with
respect to SZj_l; the last sum almost equals ?V AjV, which we know will
converge in probability to ЯA) as n -* oo. Formally, Yj (A,-F — Щ-iAjV)
is a sum of martingale differences with zero mean and variance less than
Y 1Р(А7- VJ because the cross-product terms vanish
j
< X IP[Ben + AjH( ¦ л an))Aj П by (ii)'
j
< 3enIP \Y Aj v) by the choice of {д„}
< Зе„(е„
-> 0 as n -* oo.
That takes care of condition (i) of Theorem 1.
For the Lindeberg condition it suffices to check the stronger L1 conver-
convergence.
e} = П>Х(А^J{|А^| > е}
j
if 6 > 2en

182 VIII. Martingales
The inequality follows, by the definition of xh from:
\Z(tj) - Z(tj^)\ < \Z(tj) - Z(T}-)\ + |Z(tj-) - Z(Vl)|
< | jump at zj | + en
At most one increment AjZ can exceed 2е„ in absolute value, and that
happens only if Z has its one jump greater than en at Zj. An appeal to the
second part of (Hi)' completes the proof of the fidi convergence.
The Grid Condition
Choose 0 = t0 < ¦ • ¦ < tm = I so that H(tJ+i) - H(tj) < S2/24 for each j.
For n large enough to make 2en < <52/24, the strengthened condition (ii)'
implies
A4) ВД(Г;+1) - VM\*,) ^ ?>2I\2 almost surely
if tj< t < tJ+ !• Invoke Lemma 9.
A5) limsupipi max sup \Xn(t) - Xn(t})\ > 5
« I J [tj,tJ+i)
J
J=0 n
Y, 3IP{|BH(tJ+1) - BH(tj)\ > id} by fidi convergence.
j=o
m-l
X ЗР{|ЛГ@, H(tj+1) - H(tj))\ > id}
m-l
j=o
j
which is less than e if the grid points are close enough together. ?
VIII.3. Estimation from Censored Data
The empirical distribution function based on an independent sample
?u ..., ?,„ from P is a natural estimator for the distribution function of P.
How should one modify it when the observations are subject to censorship?
One possibility, the Kaplan-Meier estimator, can be analyzed by the martin-
martingale methods from the previous section. What follows is a heuristic account.
The notes to the chapter will point you towards more rigorous treatments,
which draw on results from the theory of stochastic integration.

VIII.3. Estimation from Censored Data 183
Consider the simplest model for censorship. Independent variables
с!,..., с„, drawn from a censoring distribution C, cut short the natural
lifetime ?;; we observe the value yt = ?,- л ct. Each yt has distribution Q,
where Q(s, oo) = P(s, oo)C(s, oo). In addition, we can tell whether the
уi represents natural death, ?t < ct, or a case of death by censorship, ^ > c;.
To construct the Kaplan-Meier empirical measure К„, start from the usual
empirical measure Qn for the observations Ух,..-,у„. Working from the
left-hand end, distribute successively all the mass from each censored point
equally amongst all the {y;} lying to its right. In the situation pictured, y1
keeps its mass ^; then y2 surrenders mass } x ^to each of its seven successors
y3,..., y9; then y3 surrenders ^ x (^ + j x |) to each of y4,..., y9; and
so on. At the last point, yg, there are no more {yj to inherit its mass, so
dump it all down on a fictitious super-survivor out at +00. If y9 had not
been censored, it would have kept all its inherited mass. In any case, Kn will
distribute its total mass of one amongst the naturally deceased {yt}, with
maybe a little bit on + 00. Notice that К„[0, t] < gn[0, f] for each t.
Make the analysis as simple as possible by assuming that both P and С
are continuous distributions living on [0, 00), with neither concentrated on
a finite interval. Write St for the cr-field corresponding to everything we
learn up to time t about which ?,. have died or been censored. Write IP/-)
forP(-l^).
Calculate (to first-order terms) the conditional distribution of the incre-
increment AKn = Kn(t, t + h] given St, for tiny positive h. From St we learn the
value Qn[0, i]. Define m to be nQn[0, t]. The remaining n — m observations
in (t, 00) are generated by choosing each ?; from the conditional distribution
P(-\? > t), then censoring it by a ct chosen from C(-\c > t)- Write AP for
P(t, t + h] and AKn for Kn(t, t + h]. To first order, each of the n - m
observations has conditional probability AP/P(t, 00) of registering a natural
death during the interval (t, t + h]. A single such observation would receive
a fraction (n — m)~ * of the К„ measure for (t, 00]. Thus, to first order,
= (n- mT^KJt, оо](и - m)AP/P(t, 00).
This suggests that
JPtAKn/Kn(t, 00] = AP/P(t, 00) + lower order terms,
which would lead us to believe that log Kn(t, 00] — log P(t, 00) is a contin-
continuous-time martingale for each n.
An attempt to add rigor to the first-order analysis would reveal a few
illegal divisions by zero. There is a positive probability that either n — m
or Kn(t, 00] could equal zero. A suitable stopping time can save us from

184 VIII. Martingales
embarrassment. Let {<*„} be a sequence of positive numbers converging to
zero. Make sure that nocn is an integer. Define р„ as the first t for which Qn(t, oo)
equals an. Then certainly
Kn(t a pn, oo] > Qn(t a pn, oo) > а„ > О for all t.
The first-order analysis could be made rigorous enough to show that the
process
Xn(i) = log Kn(t л /)„, oo] - log P(t л /)„, со)
is a continuous-time martingale for each n.
On the set {р„ > t}, the increment in Vn, the conditional variance process
of Xn, would be
Wt(AXnJ = (n - m)~2(n - m)AP/P(t, oo) + smaller order terms
On {pn < t} the increment would be zero. Recover Vn as a limit of sums of
conditional variances (see Problem 5).
Vn(t) = n'1 J{0 < s < t a pn}Qn(s, oo)?^, юГ'
By definition of р„,
[
o
Thus {Vn} converges in probability to zero uniformly over compact sets.
Apply Lemma 9. For b fixed and и large enough,
Ip{sup|Xn@| > Л < №{\ХЯ(Ь)\ > 1<5}
< 12ё~21РХ„фJ
-*0.
That is, for each fixed b,
sup |log Kn(t а р„, oo] — log P(t a pn, oo)| -* 0 in probability.
t<b
Because both К„ and P are probability measures, and because р„ ->¦ oo in
probability, it follows that
A6) sup | К„[0, t] - P[0, i] | -»¦ 0 in probability.
t
The Kaplan-Meier measure estimates P consistently, in the sense of uniform
convergence of distribution functions.
Now consider the normalized martingale nll2Xn(t). It has conditional
variance process nVn, which converges in probability to the continuous
function
H(t)
= f{0 < s < t}Q(s, 00)"^(s, аэ)-1 P(ds).

Notes 185
That suggests {п1/2Х„} converges in distribution to a brownian motion
stretched out by H. Theorem 13 proves it.
We already have the first two requirements of the theorem holding. What
about the maximum jump of nll2Xn on an interval [0, t]? It can only occur
at one of the {yt} that died naturally. In the worst case, where all except the
largest y{ in [0, t] are censored, the measure К„ puts mass (nQn(t л р„, oo))~*
at that largest y(. The corresponding jump J in n1/2Xn cannot exceed
nll2(na.^)~la.~l, which converges to zero provided <х„ |> n~llA. That is more
than enough to force IPJ2 -* 0 as n -> oo. Thus nll2Xn ~> BH.
The effect of the stopping wears off as n -> oo, because р„ -» oo in proba-
probability. So we get from the convergence result for {Х„} that
n1'2 loglKn(t, oo]/P(t, oo)] - BH
as random elements of D[0, oo). Rewrite the left-hand side as
и1'2 log[l + (P[0, t] - К„[0, t])/P(t, oo)]
= - и1/2(К„[0, f] - P[0, t])(l + op(l))/P(t, oo)
with the op(l) converging in probability to zero uniformly over compact
intervals, by virtue of A6). Deduce that
nll2(Kn[_0, Q - P[0, t])/P{t, oo) - BH,
or, equivalently,
п1/2(Х„[0, t] - P[0, ф ^ P(t, oo)BH.
The covariances of the gaussian process P(-, oo)BH(-) depend on both P
and C.
Notes
Martingales have been one of the hottest topics on the weak convergence
market for some years now, partly because of a boom in stochastic integral
stocks. For discrete-time martingales key papers are Brown A971), McLeish
A974), and Aldous A978). Both McLeish and Aldous obtained results
sharper than those in Sections 1 and 2. Hall and Heyde A980, Section 6.4)
gave a much fancier version of Example 4. The literature on continuous time
martingales has grown rapidly; the Shiryayev A981) survey, Metivier A982),
the two-volume monograph by Liptser and Shiryayev A977, 1978), and
Gihman and Skorohod A979), are accessible. Meyer A966), and Dellacherie
and Meyer A978, 1982), make harder reading. Dellacherie A972) contains
much about subtle cr-fields.
What I called the conditional variance process usually gets written as
<Z, Z>. The more widely definable quadratic variation process [Z, Z] has
taken over the role I gave to <Z, Z> in Section 2. In the discrete-time theory,
[Z, Z] would be the sum of squared increments; <Z, Z) would be the sum
of conditional variances.

186 VIII. Martingales
Martingales and stochastic integrals simplify the theory for estimation of
distribution functions from censored data. Gill A980) covered this in some
detail; Jacobsen A982) explained the essential prerequisites from the
Strasbourg theory of processes. Bremaud A981) gave applications to point
processes. The martingale proof of the central limit theorem for Kaplan-
Meier estimators makes an interesting comparison with a more traditional
weak convergence approach of Breslow and Crowley A974).
Problems
[1] For у real, define Н„(у) = eiy - 1 - iy (iy)"/n\. Prove that \Н„(у)\ <
\y\n+1/(n + 1)!. [Proceed inductively, using i$'0Hn(s) ds = Hn+i(t) for t > 0.]
Deduce that \eiy - 1 - iy + \y2\ < min{^2 + \y2, i\y\3}.
[2] Prove uniform integrability of the squared autoregressive process {X2} from
Example 4. Break и,- into a sum Vj + Wj, where Vj = Uj{\uj\ < M} — JPuj{\uj\ < M}
for some large truncation level M. Define
Show that Х„ = Yn + Zn + B^Xq. The sequence {Y2} is uniformly bounded.
Because JPZ2 = TPw2 -\ \- el"'2JPw\, the M can be chosen to make WZ2 < e
for every n. [A similar argument would work for {u,} a sequence of martingale
differences. Solution provided by Kai Fun Yu.]
[3] For the uniform empirical process Un prove that {[/„(t)/(l — t):O<r<l}isa
martingale with respect to an appropriate family of ст-fields. Bound
P{suptsi,[i7n@l > 8} for small enough b, then deduce the Empirical Central
Limit Theorem. Generalize to ?„. [Over [a, b~\ take St as the cr-field determined
by the locations of sample points in [a, t], for a < t < b.]
[4] The jump functional is projection measurable; JT(x) is the limit of
max,- \x(ti) — jc(f;_ j)| as the grid 0 = t0 < • ¦ • < tm = T is refined down to a
countable dense subset of [0, T].
[5] Let Z be an L2 martingale on [0, 1] with conditional variance process V. If
V has continuous sample paths, adapt the arguments in the Fidi Convergence part
of the proof of Theorem 13 to show that
? P([Z(t;) - Z(t;_ J] Vtj_,) -> V{\) in probability,
i
as the maximum spacing in the grid 0 = t0 < ¦ ¦ • < tm = 1 tends to zero. [The
intuitive interpretation of V as a limit of a sum of conditional variances of incre-
increments is close to the mark.]
[6] Suppose that martingales {Xn} satisfy the conditions of Theorem 13. Prove that
their fidis converge to the fidis of BH. [If 0 < t1 < ¦ ¦ ¦ < tk < 1, then YJf) =
Yj=1 ajXn(t л tj) is a martingale with У„A) = ?,. aJXB(tJ).']

Problems 187
[7] Using the notation from Section 3, show that
Zn{i) = Kn[0, t л р„] - J{0 < s < t л Pn}Kn{s а р„, a>]P(s, ооГ'ВД
is a martingale. Deduce from the consistency result for К„ that {rc1/2Zj converges
in distribution to a stretched-out gaussian process. Reprove the central limit
theorem for the Kaplan-Meier estimator.
[8] Let {Х„} be a sequence of L2-martingales with conditional variance processes
{Vn}. Suppose the fidis converge to the fidis of a stretched out brownian motion
BH, with H continuous. If Vn(t) -> H(t) in probability for each fixed t, then Xn-* BH.
[If р„ is a stopping time taking only finitely many different values, then
„ + <5„) - Х„(р„)]2 = Р[К,(р„ + <5„) - К,(р„)].
Check Aldous's condition (VI.13) for suitably stopped processes. Use Lemma 11.]

APPENDIX A
Stochastic-Order Symbols
Mann and Wald A943) defined the symbols 0p(-) and op(-\ the stochastic
analogues of O(-) and o(-), as a way of avoiding many of the messy details
that bedevil asymptotic calculations in probability theory.
Let {Х„} and {Yn} be sequences of random vectors. Write Х„ = Op(Yn) to
mean: for each e > 0 there exists a real number M such that
ТР{\Х„\ > M\Yn\} < s if n is large enough. Write Xn = op(Yn) to mean:
Р{ \Х„\ > e| Yn\} -> 0 for each e > 0. Of course Xn and Yn must be defined
on the same probability space.
Typically {Yn} will be non-random. For example: Х„ = op(l) is another
way of writing Х„ -> 0 in probability; a sequence of order Op(l) is said to be
stochastically bounded; statistical estimators commonly come within
Op(n~112) of the parameters they estimate.
With the Op(-), op{-) notation, Problem III. 15 (the delta method) has a
quick solution. Differentiability of the map H means
H(x) = H(x0) + L(x - x0) + o(x - x0) near x0,
where L is a fixed s x к matrix. If п1/2(Х„ — x0) ^* Z then
H(Xn) = Я(х0) + L(Xn - x0) + ор(Х„ - x0)
= H(x0) + L(Xn - x0) + op(n-^2).
Thus
nll2(H(Xn) - H(x0)) = Ln1'2(Xn - x0) + op(l) -> LZ.
A few hidden assertions here need justification. For example:
n1'\Xn - x0) -> Z implies Zn - x0 = O/n/2);
if Zn -»• Z then LZn + op(l) ~> LZ.
These conceal the asymptotic details of the proof.

190 A. Stochastic-Order Symbols
For comparison, here is part of the proof of Theorem VII.5 written
without the benefit of the notation. Remember that
/(-, 0 = /(•, 0) + t'A(-) + U|r(-, 0,
F(t) = Pf(-,t) = F@) + i\t\2 + Kt),
where for each e > 0 there exists a S > 0 such that
\h(t)\<e\t\2 if \t\<S.
By assumption,
t
where пХ„ -* 0 in probability as n -* oo. Also we know that
i1/2(F.(t») - F(xn)) = ?„/(-, 0) + т;?„Л + |т„| У„
or
B) ?„(т„) = F(tJ + n-l>2Enf(-,0) + n-x'nEnA + n~^2\xn\Yn,
where Yn = Enr{-, xn) ->• 0 in probability.
From A) and B),
Х„ > Fn(xn) - Fn@)
= F(xn) + n~^2x'nEnA + n-^2\xn\Yn - F@)
-= т IT ~r~ "t"/1( T ) т~ W T ii Zi i" W I T I /
Z I M I ' Z V VO/ ' П П — ' I И I И
Take absolute values. Because Xn > 0,
nXn > \n\xn\2 - \n\h{xn)\ - п1/2|т„| |?„А[ - п1/2|т„| | У„|
>^A-е)|т„|2-п1/2|т„|(|?„А| + |Уп|) if \xn\ < д
> Kl - е)(и1/2|т„| - ^„J - i(l - e) ^2,
where Wn = A - e)-1(|?HA| + | У„|). For fixed M deduce that
1Р{(п1/21тп| - Wnf > 2M} < Р{[т„| >ё} + П>{2A - е)иХ„ > М}
+ F{W2 > М}.
The first and second terms converge to zero. If M is large enough, the third
term is eventually less than any fixed a because ... well, because Wn is of
order Op{\). The mess gets too great if we try to avoid the stochastic-order
symbols any longer.
If you need further convincing read the beautiful expository article by
ChernoffA956).

APPENDIX В
Exponential Inequalities
The central limit theorem leads one to expect sums of independent random
variables to behave as if they were normally distributed; tail probabilities
for standardized sums can be approximated by normal tail probabilities.
For the limit theorems proved in this book, we need upper bounds rather
than approximations. A few such bounds are collected together here.
The tails of the normal distribution decay rapidly. For ц > 0,
VI п"
The important factor is the exp(—\n2). We need a similar upper bound for
the tail probabilities of a sum of independent random variables Yb ..., Yn.
Set S = Yj + • • • + Yn. For each t > 0,
П
A) IP{S >n}< exp(-nt)TPexp(tS) = exp(- nt) Ц IPexp(t Y;).
The trick will be to find a t that makes the last product small. For the normal
distribution it is easy to find the best t directly:
Р{ЛГ@, 1) > n} < inf exp(^t2 - nt) = exp(-^2).
t>0
For other distributions we have to work harder. We must maneuver the
moment generating function of Yt into a tractable form that gives us some
clue about which value of t to choose.
2 Hoeffding's Inequality. Let Yu Y2,..., Ynbe independent random variables
with zero means and bounded ranges: a{ < Yt < bt. For each n > 0,
f ¦ • • + Yn > n} <
?(Ь( - аА

192 В. Exponential Inequalities
Proof. Use convexity to bound the moment generating function of Y{. Drop
the subscript i temporarily.
e'Y < eta(b - Y)/(b - a) + e'\Y - a)/(b - a).
Take expectations, remembering that IP У = 0.
ffVy < ewb/(b - a) - etba/(b - a).
Seta = 1 — P = — a/(b — a)andM = t(b — a).Note:a > Obecausea < 0 < b.
log Pety < logOSe""" + aee") = -аи + log(jS + ae").
Write L(u) for this function of u. Differentiate twice.
L'(u) = -a + aeu/(P + aeu) = -a + a/(a + jSe"""),
L"(m) = ai?e-"/(a + ^e""J = [«/(a + 0
The inequality is a special case of: x(l — x) < % for 0 < x < 1. Expand by
Taylor's theorem.
L(u) = L@) + mL'(O) + i^2L"(M*) < |м2| = it2(b - йJ.
Apply the inequality to each Yt, then use A)
Set t = 4^/^; (bt - atJ to minimize the quadratic. ?
3 Corollary. Apply the same argument to {— У;} then combine with the in-
inequality for {Yj} to get a two-sided bound under the same conditions:
Y, + ¦ ¦ ¦ + Yn| > n} < 2 expf-2n2\ ?(b, - ad2]. ?
L /¦¦=! J
In one special case the proof can be shortened slightly. If Yt takes only
two values, + at, each with probability \, then
IPe171 = \\etai + e~ta4 = | (aitJk/Bky. < exp&ft2).
The rest of the proof is the same as before. We only need the Hoeffding
Inequality for this special case.
4 Bennett's Inequality. Let Yu ..., У„ be independent random variables with
zero means and bounded ranges: \ Yt\ < M. Write a2 for the variance of Yt.
Suppose V > a2 + ¦ ¦ • + a2. Then for each r\ > 0,
1 + ..-+ Yn\ > n} <
where B(X) = 2Я[A + Я) Iog(l + X) - Ц for X > 0.

В. Exponential Inequalities 193
Proof. It suffices to establish the corresponding one-sided inequality. The
two-sided inequality will then follow by combining it with the companion
inequality for {— YJ.
Bound the moment generating function of Yt. Drop the subscript i
temporarily.
IPe'y = 1 + г1РУ + ? (tk/k\)JP(Y2Yk-2)
k=2
= 1 + a2g(t) where g(t) = (etM - 1 - tM)/M2
< exp[a2gr@3.
From A) deduce 1P{S >ц}< exp[Fgr(t) - r\t\. Differentiate to find the
minimizing value, t = M~1 log(l + Mr\V~ l\ which is positive. ?
The function B(-) is well-behaved: continuous, decreasing, and B@ +) = 1.
When X is large, B(X) « 2X~l log A in the sense that the ratio tends to one as
X -» oo; the Bennett Inequality does not give a true exponential bound for ц
large compared to V/M. For smaller r\ it comes very close to the bound for
normal tail probabilities.
Problem 2 shows that B(X) > A + \X)~l for all X > 0. If we replace B(-)
by this lower bound we get
r,} < 2
which is known as Bernstein's inequality.
Notes
Feller A968, Chapter VII) analyzed the tail probabilities of binomial and
normal distributions—sharp results obtained by elementary methods.
Bennett A962) and Hoeffding A963) derived and compared a number of
inequalities on tail probabilities for sums of independent random variables.
Dudley A984) noted the simpler derivation of Hoeffding's Inequality when
Yt takes only values ±at. Bernstein's inequality apparently dates from the
1920's; it appeared as Problem X.14 in Uspensky's A937) book.
Problems
[1] For independent iV(O, l)-distributed random variables Yu Y2, ¦ ¦ ¦, show that
(maxjsn Yj)/B log иI'2 converges in probability to one. [Write М„ for the maximum.
Show
JP{Mn < Bri log nI'2} = [1 - IP{N@, 1) > Br\ log nI/2}]",
then use the exponential inequalities for normal tails.]

194 В. Exponential Inequalities
[2] For the function B() appearing in Bennett's Inequality prove that
A + iA)B(A) > 1 for all X > 0.
[Apply l'Hdpital's rule twice to reduce the left-hand side to
(l+iA*Xl + A*r1+$log(l +A»)
for some X* less than A. Then use log(l + A*) > A*/(l + A*).]
[3] If a random variable Y has zero mean, finite variance a2, and is bounded above by
a constant M, then for t > 0,
IIV < (<tVm + М2е-/м)/(<72 + M2).
[Subject to the constraints IPY = 0 and JPY2 < a2, the value of ПУ is maximized
when IPy concentrates on the two values M and — <j2/M.] To prove this let ф{у)
be the quadratic
е-/мс-\М - j>)[1 + (С + ОС)' + ст2/М)] + ёмС~\у + а2/МJ,
where С = M + <j2/M. Check that the coefficient of y2 is strictly positive and that
ф satisfies
ф(М) = е'м, ф( - <J2/M) = e ' ''м, ф'(- <72/M) = te~ '°2<м.
Show that ф(у) > e'y for у < M, with equality at у = M and у = — c2/M. [The
function h{y) = е~'уф(у) has a local minimum of 1 at —<J2/M. Also й(М) = 1. If
h(y*) were equal to 1 for some y* in the interval ( — a2/M, y*), the quadratic e'yh'(y)
would have three real roots: one at — <J2/M, one in the interval (— <J2/M, y*), and
one in the interval (y*, M).~\ The distribution IPy concentrated at M and — a2/M
achieves equality in nVy < ТРф(У). Bennett A962, page 42).]
[4] For the one-sided form of Bennett's Inequality one needs only zero means and
Yi < M for each i. Reexpress the inequality from the previous problem as
ПУ < exp[tM + log/(l + <t2/M2)],
where/(y) = 1 - y~l + y~xe~'My. Prove that d2/dy2 log f{y) equals
-2y-3e-MyletM> -1-tMy-
which is less than zero for у > 1. Deduce from a Taylor expansion to quadratic
terms that
log /(>•)< log /A) + 0» - l)/'(l)//0) for y>\,
whence HV-*1 < exp[<j2(e'M — 1 — tM)/M2']. Complete the argument as before.
[Hoeffding A963, page 24).]

APPENDIX С
Measurability
The defining properties of cr-fields ensure that the usual.countable operations
in probability theory—countable unions and intersections, pointwise limits
of sequences, and the like—cause no measurability difficulties. In Chapters
II and VII, however, we needed to take suprema over uncountable families
of measurable functions. The possibility of a non-measurable supremum
was brushed aside by an assurance that a regularity condition, dubbed
permissibility, would take care of everything. This appendix will supply the
missing details.
The discussion will take as axiomatic certain properties of analytic sets.
A complete treatment may be found in Sections III.l to 111.20, 111.27 to
111.33, and 111.44 to 45, of Dellacherie and Meyer A978). Square brackets
containing the initials DM followed by a number will point to the section
of that book where you can find the justification for any unproved assertions.
Suppose M is a set equipped with a cr-field Л. The analytic (^-analytic
in DM terminology) subsets of M form a class slightly larger than Л.
Denote it by =я/(М). If Л is complete for some probability measure \i, (that
is, Л contains all the sets of zero \i measure) then sf(M) = Л [DM 33].
For example, the analytic subsets generated by ^[0, 1] contain that cr-field
properly, but the cr-field of lebesgue measurable subsets of [0, 1] coincides
with its analytic sets. You see from this example that we should be writing
л/(Ж) rather than s/(M). The ambiguity is not serious when M is equipped
with only one cr-field. For metric spaces, we will always choose Л to be the
borel cr-field; for product spaces, it will always be the product cr-field.
We considered empirical processes indexed by a class of functions.
Formally, ?u <jj2, • • ¦ were measurable maps from a probability space (Q, S, IP)
into a set S equipped with a cr-field У. A class 3F of 5^/^(IR)-measurable,

196 С. Measurability
real-valued functions on S was given. The empirical measure Pn attached to
each / in J* the real number
We were assuming measurability of functions of a> such as sup^r \Pnf — Pf\.
Let us now consider more carefully the dependence on &>, which we emphasize
by writing Р„(со, •) instead of Pn(-)-
Permissible Classes
Suppose that the class J* is indexed by a parameter t that ranges over some
set T. That is, & = {f(-,t):te T}. We lose no generality by assuming & so
indexed; T could be J* itself. When more convenient, write ft instead of /(•, t).
Assume T is a separable metric space. The metric on T will be important
only insofar as it determines the borel ст-field ЩТ) on T.
1 Definition. Call the class J* permissible if it can be indexed by a T in such
a way that
(i) the function /(•,•) is ^ <g) ^(T)-measurable as a function from S <g) T
into the real line;
(ii) T is an analytic subset of a compact metric space T (from which it
inherits its metric and borel er-field). ?
Some authors call a T satisfying (ii) a Souslin measurable space [DM 16].
The usual sorts of class parametrized by borel subsets [DM 12] of an
euclidean space are permissible. (Take T as the one-point compactification.)
So are fancier classes such as all indicator functions of compact, convex
subsets of euclidean space (Problem 2).
Assume from now on that J* is permissible and that (Q, S, IP) is complete.
Here are the properties of analytic sets that make the definition of per-
permissibility a good one for empirical process applications. For every measur-
measurable space (M, Ji\
(a) si{M ® Г) contains the product ст-field Ji <g) ЩТ);
(b) for each H in s/(M <g) T), and in particular for each Ji <g) ^(T)-measur-
able set, the projection nMH of H onto M is in sf(M) [DM 13, DM 9:
the set H is also in srf(M <g) T), because T is analytic];
(c) for each A in s/(M) and each S/^-measurable map ц from (Q, S, IP)
into M, the set {rj e Л} is an analytic subset of Q [DM 11]; hence {rj e s/}
belongs to S, because (Q, S, IP) is complete.
From these properties we shall be able to deduce measurability for functions
defined by certain uncountable operations.

С. Measurability
197
Measurable Suprema
Suppose g{-, •) is an Ji <g) ^(T)-measurable real function on M <g) T Set
G(w) = supt g(m, 0- Then by (a), d(M <g) T) contains the set
Яа = {(m, t): g(m, t) ><x}.
The projection of Hx onto M is an analytic set, by (b). It consists of all those
m for which G(m) > a. Thus {G > a} belongs to =я/(М) for each real a. If
ц is a measurable map from a complete probability space (Q, S, IP) into M
then, by (c), the set {со: G{t]{a>)) > a} is $ -measurable. That is, supr g(r\((o\ t)
is an S-measurable function of со.
If !F is permissible and if P | ft \ < oo for each t, requirement (i) of Defini-
Definition 1 plus Fubini's theorem make Pft a measurable function of t. Apply the
argument given above, with M = S", Ji = the product er-field У™, ц = the
vector (fi,..., ?„), and
[/(s,-, 0 - ^/,
g(s, t) =
to deduce that sup^ \Pn(a>,f) — Pf\ is a measurable function of со.
Measurable Cross-Sections
The Symmetrization Lemma II.8 made use of another property of analytic
sets. We had a stochastic process {Zt: teT}; we assumed existence of a
random x for which, almost surely, \ZZ\ > г whenever sup,|Z,[ > e. For
this we need a cross-section theorem [DM44-45]. Under requirement (ii) of
Definition 1, and for any complete probability space (M, Ji, \i),
(d) if H belongs to srf{M ® T) there exists a measurable map h from M into
T и {oo} (where oo is an ideal point added to T) such that: (w, h(m))
belongs to H whenever h(m) ф oo; and Ыт) Ф oo for \i almost all m in
the projection nMH. Call h a measurable cross-section for H.
Write Z(co, t) to emphasize the role of Z as an $ <g) ^(T)-measurable
function on Q (g) T. Let {ej} be a strictly decreasing sequence of real numbers
converging to e, with ex = oo. Set
j = <m: ej+1 < sup \Z(co, t)\ <
j = {(со, t): sJ+1 < \Z(co, t)\ < sj}.

198 С. Measurability
The sets {Aj} all belong to S. The sets {Bj} all belong to S ® ЩТ), and hence
are analytic. Let t0 be any fixed element of T. Choose a measurable cross-
section tj for each 5,. Set т equal to т,- on Aj, and equal to t0 outside the
union of the {Aj}. Redefine т(ю) to be t0 whenever (d) would set it equal to
oo. For almost all со, if sup, | Z(co, t)\> e then (со, x(co)) belongs to Bj for some
j; that is,
ej+1 < Z(co, x(co)) < ?j for some j,
as required.
A formal proof of the Symmetrization Lemma II.8 is now possible.
Require Z and Z' to be defined on a product space Q ® Q' equipped with
product measure IP ® IP',
Z(co, со', t) = Z(co, t),
Z'(co, со', t) = Z'(co', t).
The t constructed above need depend only on со. For almost all со,
IP'{со1: \Z'(o)',z(co))\ <a} ^ /?.
The rest of the proof goes through as before, with Fubini's theorem
formalizing the conditioning argument.
Shattered Sets
Theorem 11.21 placed a condition on the behavior of Vn{?,x,..., ?„), the
smallest integer к such that 3) shatters no collection of к points from
{?i, • • • > ?«}• Assume that the indicator functions of sets in 3) form a per-
permissible class J*\ Then Vn is measurable.
For example, here is how to prove that {Vn < 2} belongs to S. Define
g(s1,s2, tu t2, t3, t4)as
f(su t])f(s2, tj) + f(su t2)[l - f(s2, t2)]
+ [1 - f(slt h)ms2, t3) + [1 - f(su t4)][l - f(s2, U)l
Clearly g is У2 ® ^(T4)-measurable. The function
G(co) = max sup дЩсо), ^(co), t)
is (f-measurable. The set {Vn < 2} equals {G < 4}.
Covering Numbers
Functions of the empirical measure that enter into arguments depending on
Fubini's theorem demand that we take measurability seriously. But for
other functions, such as the random covering numbers appearing in Sections

С. Measurability 199
П.5, П.6, VI.4, and VI.6, we do not really need all the machinery of analytic
sets. For example, we could just as well interpret a condition like
log N,E, Pn,^) = op(n)
in terms of outer measure: for each e > 0,
IP*{log JViOS, Р„, &) > ne} -»• 0.
The proofs go through almost exactly as before. Equivalently, we could
interpret the conditions on the covering numbers to mean JP{Zn > ns} -> 0
for some measurable random variable Zn greater than log JV,(E, Pn, #").
For permissible classes, another solution would be to replace covering
numbers by packing numbers: define М^д, Р„, .?) as the smallest m for
which there exist functionsfu...,fm in J* with Pn\fj — fk\ > $ for j Ф k.
This is a measurable function of со (and even jointly measurable in со and
S); the set
{со: М^д, Рп(со, •), У) > т]
equals the projection on Q of the S ® ^(Tm)-measurable set
(со, t): min n~l ? 1/F0»), tj) - /KM h)\ >
j*k i=l
The packing numbers are closely related to covering numbers:
M,{25, Р„, Г) < N,C, Pn, !F) < M,{5, Р„, IF).
Theorems stated in terms of random covering numbers have equivalent
versions for random packing numbers.
I cannot prove measurability for covering numbers of permissible
classes. The set of со where ЫгE, Р„, J5') strictly exceeds m is the complement
of a projection of a complement of an analytic set, which apparently need
not be measurable.
The Function Space Ж
For the results in Chapter VII we required the class J^ to be pointwise
bounded and have sup^r|P/| finite. The empirical process,
had bounded sample paths; the functions En(co, •) all belonged to the set Ж
of bounded, real functions on %F. To avoid confusion, we called members of
ЭС functionals. We equipped Ж with the uniform norm, |Jx|| = sup^ |x(/)|.
The limit processes had sample paths in the set C(#, P) of functionals that
were uniformly continuous with respect to the if 2(P) seminorm pP on !F.
The cr-field 38r was smallest for which
(i) all the closed balls (for ||-||) belonged to <MP;
(ii) all the finite-dimensional projections were ^-measurable.

200 С. Measurability
We assumed in Chapter VII that Е„ is <f/^-measurable. That is true for a
permissible OF that is separable under the pP seminorm.
Because each ?„( •, /) is a real random variable, the finite-dimensional
projections create no difficulty for the <f/^p-measurability. The properties
of analytic sets are needed to prove that {a>\ ||?„(со, •) — x(-)|| < r} belongs
to $ whenever x(-) belongs to C(J*, P). Introduce the index set T as described
in Definition 1. Problem 1 shows that x(/r) is ^(T)-measurable. Equip S"
with its product cr-field У. The function
g(s, t) =
~112
is Уп (х) ^(r)-measurable. The argument in Measurable Suprema estab-
establishes $ -measurability of sup, g(%{a>), t), which equals \\En(co, •) — x(-)\\.
Notes
Dudley A978) introduced a condition, which he termed Pe-Suslin, as a way
of handling the measurability problems for empirical processes indexed by
classes of sets. Since then he has refined the definition several times. He has
called the latest version (Dudley 1984) of the condition "image admissible
Suslin " (via a parameter space); it is almost the same as permissibility.
Le Cam A983) has also imposed a Suslin type of condition.
Dudley and Philipp A983) have systematically replaced measurability
assumptions by conditions framed using measurable cover functions (the
idea touched on in Covering Numbers—an unfortunate clash of ter-
terminology—with the Zn random variables).
Problems
[1] Suppose !F has a countable, dense subset {#,} under the У2(Р) seminorm pP.
Suppose also that J^ satisfies condition (i) of Definition 1. If x(-) is a bounded real
functional on 3F that is pP continuous, show that x(ft) is a measurable function
of t. [Assume x > 0. Represent x(ft) as
limsup sup x(gj){rft) < n'1},
where r/f) = pP(ft — gj). Use Fubini's theorem to prove measurability of r,-.]
[2] The class # of all non-empty, compact, convex subsets of the unit square [0, I]2
is permissible. [Equip % with the metric
d(Ci, C2) = inf{s > 0:Сх ? C\ and C2 ? C\),
where CE denotes the set of x less than s away from at least one point of C. Under
this metric, <% is compact (Eggleston 1969, Section 4.2). The set of (x, C) with x in
С is a closed subset of [0, l]2 (Я) (ё. Take 9S as its own indexing set.]
[3] The class of sets {/ > 0}, for / running through a permissible class, is itself per-
permissible.

References
Aldous, D. A978). Stopping times and tightness. Annals of Probability 6:335-340.
Alexander, K. S. A984a). Probability inequalities for empirical processes and a law of
the iterated logarithm. Annals of Probability. (Based on his 1982 Ph.D. dissertation,
MIT.) To appear.
Alexander, K. S. A984b). Rates of growth for weighted 'empirical processes. (The
Proceedings of the Neyman-Kiefer Memorial Conference, Berkeley, 1983. Belmont,
CA: Wadsworth.) To appear.
Alexandroff, A. D. A940). Additive set functions in abstract spaces. Mat. Sbornik
50(NS 8): 307-342. (Chapter 1.)
Alexandroff, A. D. A941). Additive set functions in abstract spaces. Mat. Sbornik
51(NS 9): 563-621. (Chapters 2 and 3.)
Alexandroff, A. D. A943). Additive set functions in abstract spaces. Mat. Sbornik
55(NS 13): 169-234. (Chapters 4 and 5.)
Araujo, A. and Gine, E. A980). The Central Limit Theorem for Real and Banach Valued
Random Variables. New York: Wiley.
Ash, R. B. A972). Real Analysis and Probability. New York: Academic Press.
Bennett, G. A962). Probability inequalities for the sum of independent random variables.
Journal of the American Statistical Association 57:33^-5
Bertrand-Retali, M. A974). Convergence uniforme d'un estimateur d'une densite de
probabilitedans Rs. ComptesRendusdeVAcademiedesSciences, ParisllS:451-453.
Bertrand-Retali, M. A978). Convergence uniforme d'un estimateur de la densite par
la methode du noyau. Revue Roumaine de Mathimatiques Pures et Appliques
23:361-385.
Billingsley, P. A968). Convergence of Probability Measures. New York: Wiley.
Billingsley, P. A971). Weak Convergence of Measures: Applications in Probability.
Philadelphia: Society for Industrial and Applied Mathematics. (Regional Confer-
Conference Series in Applied Mathematics # 5.)
Billingsley, P. and Tops0e, F. A967). Uniformity in weak convergence. Zeitschrift fur
Wahrscheinlichkeitstheorie und Verw. Geb. 7:1-16.
Bollobas, B. A979). Graph Theory: An Introductory Course. New York: Springer-Verlag.
Bolthausen, E. A978). Weak convergence of an empirical process indexed by the closed
convex subsets of I2. Zeitschrift fur Wahrscheinlichkeitstheorie und Verw. Geb.
43:173-181.

202 References
Breiman, L. A968). Probability. Reading, MA: Addison-Wesley.
Bremaud, P. A981). Point Processes and Queues: Martingale Dynamics. New York:
Springer-Verlag.
Breslow, N. and Crowley, J. A974). A large sample study of the life table and product
limit estimates under random censorship. Annals of Statistics 2:437^-53.
Brown, В. М. A971). Martingale central limit theorems. Annals of Mathematical
Statistics 42:59-66.
Brown, В. М. A983). Statistical uses of the spatial median. Journalof the Royal Statistical
Society, Series В 45:25-30.
Brown, В. М. and Kildea, D. G. A979). Outlier-detection tests and robust estimators
based on signs of residuals. Communications in Statistics—Theory and Methods
A8:257-269.
Chernoff, H. A954). On the distribution of the likelihood ratio. Annals of Mathe-
Mathematical Statistics 25:573-578.
Chernoff, H. A956). Large sample theory: parametric case. Annals of Mathematical
Statistics 27:1-22.
Chibisov, D. M. A965). An investigation of the asymptotic power of tests of fit. Theory
of Probability and Its Applications 10:421^-37.
Chung, K. L. A949). An estimate concerning the Kolmogorov limit distribution.
Transactions of the American Mathematical Society 67:36-50.
Chung, K. L. A968). A Course in Probability Theory. New York: Harcourt, Brace and
World.
DeHardt, J. A971). Generalizations of the Glivenko-Cantelli theorem. Annals of
Mathematical Statistics 42:2050-2055.
Dellacherie, C. A972). Capacitis et Processus Stochastiques. Heidelberg-Berlin:
Springer-Verlag.
Dellacherie, С and Meyer, P-A. A978). Probabilities and Potential, Part A. Amsterdam:
North-Holland. Translation of the 1975 French second edition of the first four
chapters from Meyer A966).)
Dellacherie, С and Meyer, P-A. A982). Probabilities and Potential, Part B. Amsterdam:
North-Holland.
Denby, L. and Martin, R. D. A979). Robust estimation of the first-order autoregressive
parameter. Journal of the American Statistical Association 74:140-146.
Donsker, M. D. A951). An invariance principle for certain probability limit theorems.
Memoirs of the American Mathematical Society 6:1-12.
Donsker, M. D. A952). Justification and extension of Doob's heuristic approach to the
Kolmogorov-Smirnov theorems. Annals of Mathematical Statistics 23:277-
281.
Doob, J. L. A949). Heuristic approach to the Kolmogorov-Smirnov theorems. Annals
of Mathematical Statistics 20:393^103.
Doob, J. L. A953). Stochastic Processes. New York: Wiley.
Dudley, R. M. A966a). Weak convergence of probabilities on nonseparable metric
spaces and empirical measures on euclidean spaces. Illinois Journal of Mathe-
Mathematics 10:109-126.
Dudley, R. M. A966b). Convergence of Baire measures. Studia Mathematica
27:251-268.
Dudley, R. M. A967a). Measures on non-separable metric spaces. Illinois Journal of
Mathematics 11:449-453.
Dudley, R. M. A967b). The sizes of compact subsets of Hilbert space and continuity
of gaussian processes. Journal of Functional Analysis 1:290-330.
Dudley, R. M. A968). Distances of probability measures and random variables. Annals
of Mathematical Statistics 39:1563-1572.
Dudley, R. M. A973). Sample functions of the gaussian process. Annals of Probability
1:66-103.

References 203
Dudley, R. M. A974). Metric entropy of some classes of sets with differentiable boun-
boundary. Journal of Approximation Theory 10:227-236.
Dudley, R. M. A976). Probabilities and Metrics: Convergence of laws on metric spaces,
with a view to statistical testing. Aarhus: Aarhus University. (Mathematics Insti-
Institute Lecture Note Series #45.)
Dudley, R. M. A978). Central limit theorems for empirical measures. Annals of Prob-
Probability 6:899-929. (Correction, ibid. 7 A979):909-911.)
Dudley, R. M. A981a). Donsker classes of functions. In Csorgo, M., Dawson, D. A.,
Rao, J. N. K., and Salek, A. K. Md. E. (Editors), Statistics and Related Topics,
pages 341-352. Amsterdam: North-Holland.
Dudley, R. M. A981b). Vapnik-Cervonenkis Donsker classes of functions. In Aspects
Statistiques et Aspects Physiques des Processus Gaussiens. CNRS, Paris, St. Flour
1980.
Dudley, R. M. A984). A Course on Empirical Processes. Springer Lecture Notes in
Mathematics. (Lectures given at Ecole d'Ete de Probabilites de St. Flour, 1982.)
To appear.
Dudley, R. M. and Philipp, W. A983). Invariance principles for sums of Banach space
valued random elements and empirical processes. Zeitschrift fur Wahrschein-
lichkeitstheorie und Verw. Geb. 62:509-552.
Durbin, J. A973a). Weak convergence of the sample distribution function when para-
parameters are estimated. Annals of Statistics 1:279-290.
Durbin, J. A973b). Distribution Theory for Tests Based on the Sample Distribution
Function. Philadelphia: Society for Industrial and Applied Mathematics. (Regional
Conference Series in Applied Mathematics # 9.)
Durst, M. and Dudley, R. M. A981). Empirical processes, Vapnik-Cervonenkis classes
and Poisson processes. Probability and Mathematical Statistics {Wroclaw)
1:109-115.
Eggleston, H. G. A969). Convexity. Cambridge: Cambridge University Press.
Erdos, P. and Kac, M. A946). On certain limit theorems in the theory of probability.
Bulletin of the American Mathematical Society 52:292-302.
Erdos, P. and Kac, M. A947). On the number of positive sums of independent random
variables. Bulletin of the American Mathematical Society 53:1011-1020.
Feller, W. A968). An Introduction to Probability Theory and Its Applications, Vol. I
(Third edition). New York: Wiley.
Feller, W. A971). An Introduction to Probability Theory and Its Applications, Vol. II
(Second edition). New York: Wiley.
Gaenssler, P. A984). Empirical Processes. Institute of Mathematical Statistics Lecture
Notes—Monograph Series 3.
Gaenssler, P. and Revesz, P. (Editors) A976). Empirical Distributions and Processes.
Springer Lecture Notes in Mathematics 566.
Gaenssler, P. and Stute, W. A979). Empirical processes: a survey of results for inde-
independent and identically distributed random variables. Annals of Probability
7:193-243.
Gihman, I. I. and Skorohod, A. V. A969). Introduction to the Theory of Random
Processes. Philadelphia: Saunders. (Earlier version of the three volumes 1974,
1975, 1979.)
Gihman, 1.1. and Skorohod, A. V. A974). The Theory of Stochastic Processes, I. New
York: Springer-Verlag.
Gihman, 1.1. and Skorohod, A. V. A975). The Theory of Stochastic Processes, II. New
York: Springer-Verlag.
Gihman, I. I. and Skorohod, A. V. A979). The Theory of Stochastic Processes, III.
New York: Springer-Verlag.
Gill, R. D. A980). Censoring and Stochastic Integrals. Mathematical Centre Tract 124.
Amsterdam: Mathematisch Centrum.

204 References
Gine, E. and Zinn, J. A984). On the central limit theorem for empirical processes. Annals.
of Probability. To appear.
Glaisher, J. W. L. A872). On the law of facility of errors of observations, and on the
method of least squares. Memoirs of the Royal Astronomical Society 39:75-124.
Gnedenko, B. V. A968). The Theory of Probability (Fourth edition). New York:
Chelsea.
Hajek, J. A965). Extension of the Kolmogorov-Smirnov test to regression alternatives.
In J. Neyman and L. M. Le Cam (Editors), Bernoulli, Bayes, Laplace: Anniversary
Volume, pages 45-60. New York: Springer-Verlag.
Hall, P. and Heyde, С. С A980). Martingale Limit Theory and Its Applications. New
York: Academic Press.
Halmos, P. R. A969). Measure Theory. New York: Van Nostrand.
Hansel, G. and Troallic, J. P. A978). Mesures marginales et theoreme de Ford-
Fulkerson. Zeitschriftfur Wahrscheinlichkeitstheorie und Verw. Geb. 43:245-251.
Hartigan, J. A. A975). Clustering Algorithms. New York: Wiley.
Hartigan, J. A. A978). Asymptotic distributions for clustering criteria. Annals of
Statistics 6:117-131.
Hoeffding, W. A963). Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association 58:13-30.
Huber, P. J. A967). The behavior of maximum likelihood estimates under nonstandard
conditions. In Fifth Berkeley Symposium on Mathematical Statistics and Prob-
Probability, pages 221-233. Berkeley, CA: University of California.
Jacobs, K. A978). Measure and Integral. New York: Academic Press.
Jacobsen, M. A982). Statistical Analysis of Counting Processes. Springer Lecture Notes
in Statistics 12.
Kac, M. A949). On deviations between theoretical and empirical distributions. Pro-
Proceedings of the National Academy of Science USA 35:252-257.
Kelley, J. L. A955). General Topology. New York: Van Nostrand.
Kolchinsky, V. I. A982). On the central limit theorem for empirical measures. Theory
of Probability and Mathematical Statistics (Kiev) 24:71-82.
Kolmogorov, A. N. A933). Sulla determinazione empirica di una legge di distribuzione.
Giornale dell'Institute Ital. degli Attuari 4:83-91.
Kolmogorov, A. N. A956). On Skorohod convergence. Theory of Probability and Its
Applications 1:213-222.
Kurtz. T. G. A981). Approximation of Population Processes. Philadelphia: Society for
Industrial and Applied Mathematics. (Regional Conference Series in Applied
Mathematics #36.)
Le Cam, L. A957). Convergence in distribution of stochastic processes. University of
California Publications in Statistics 2:207-236.
Le Cam, L. A983). A remark on empirical measures. In Bickel, P., Doksum, K., and
Hodges, J. (Editors), Festschrift for E. L. Lehmann, pages 305-327. Belmont, CA:
Wadsworth.
Levy, P. A922). Sur la determination des lois de probability par leurs fonctions carac-
teristiques. Comptes Rendus de VAcademie des Sciences, Paris 175:854—856.
Liapounoff, A. M. A900). Sur une proposition de la theorie des probabilites. Bulletin
de VAcademie imperiale des Sciences de St. Petersbourg 13:359-386.
Liapounoff, A. M. A901). Nouvelle forme du theoreme sur la limite de probabilites.
Memoires de VAcademie imperiale des Sciences de St. Petersbourg 12:Number 5.
Lindeberg, J. W. A922). Eine neue Herleitung des Exponentialgesetzes in der Wahr-
scheinlichkeitsrechnung. Mathematische Zeitschrift 15:211-225.
Lindvall, T. A973). Weak convergence of probability measures and random functions
in the function space Z>[0, oo). Journal of Applied Probability 10:109-121.
Liptser, R. S. and Shiryayev, A. N. A977). Statistics of Random Processes, I: General
Theory. New York: Springer-Verlag.

References 205
Liptser, R. S. and Shiryayev, A. N. A978). Statistics of Random Processes, II: Applica-
Applications. New York: Springer-Verlag.
Loeve, M. A978). Probability Theory, II. (Fourth Edition) New York: Springer-Verlag.
Major, P. A978). On the invariance principle for sums of independent identically
distributed random variables. Journal of Multivariate Analysis 8:487-517.
Mann, H. B. and Wald, A. A943). On stochastic limit and order relationships. Annals
of Mathematical Statistics 14:217-226.
Maritz, J. S. A981). Distribution-Free Statistical Methods. London: Chapman and Hall.
McKean, H. P., Jr. A969). Stochastic Integrals. New York: Academic Press.
McLeish, D. L. A974). Dependent central limit theorems and invariance principles.
Annals of Probability 2:620-628.
Metivier, M. A982). Semimartingales: A Course on Stochastic Processes. Berlin: De-
Gruyter.
Meyer, P-A. A966). Probability and Potential. (First edition). Boston: Blaisdell.
Neveu, J. A975). Discrete-Parameter Martingales. Amsterdam:North-Holland.
Neyman, J. A937). 'Smooth' test for goodness of fit. Skandinavisk Aktuarietidskrift
20:149-199. (Reprinted in "A Selection of Early Statistical Papers of J. Neyman,"
Cambridge: Cambridge University Press, 1967.)
Oxtoby, J. С A971). Measure and Category. New York: Springer-Verlag.
Parr, W. C. A982). Minimum distance estimation: a bibliography. Communications in
Statistics, /4 11:1511-1518.
Parthasarathy, K. R. A967). Probability Measures on Metric Spaces. New York:
Academic Press.
Pickands, J. A971). The two-dimensional Poisson process and extremal processes.
Journal of Applied Probability 8:745-756.
Pollard, D. A980). The minimum distance method of testing. Metrika 27:43-70.
Pollard, D. A981a). Limit theorems for empirical processes. Zeitschriftfur Wahrschein-
lichkeitstheorie und Verw. Geb. 57:181-195.
Pollard, D. A981b). Strong consistency of A>means clustering. Annals of Statistics
9:135-140.
Pollard, D. A982a). Beyond the heuristic approach to Kolmogorov-Smirnov theorems.
In Gani J. and Hannan E. J. (Editors), Essays in Statistical Science. Journal
of Applied Probability, Special Volume 19A, pages 359-365. (Festschrift for P. A. P.
Moran.)
Pollard, D. A982b). Quantization and the method of fc-means. IEEE Transactions on
Information Theory IT-28:199-205.
Pollard, D. A982c). A central limit theorem for empirical processes. Journal of the
Australian Mathematical Society (Series A) 33:235-248.
Pollard, D. A982d). A central limit theorem for A:-means clustering. Annals of Prob-
Probability 10:919-926.
Prohorov, Yu. V. A956). Convergence of random processes and limit theorems in
probability. Theory of Probability and Its Applications 1:157-214.
Руке, R. A969). Applications of almost surely convergent constructions of weakly
convergent processes. Springer Lecture Notes in Mathematics 89:187-200.
Руке, R. A970). Asymptotic results for rank statistics. In M. L. Puri (Editor), Non-
parametric Techniques in Statistical Inference, pages 21-37. Cambridge: Cambridge
University Press.
Руке, R. and Shorack, G. A968). Weak convergence of a two-sample empirical process
and a new approach to Chernoff-Savage theorems. Annals of Mathematical
Statistics 39:755-771.
Руке, R. and Sun, T. G. A982). Weak convergence of empirical processes. (University
of Washington Statistics Department Technical Report # 19.)
Ranga Rao, R. A962). Relations between weak and uniform convergence of measures
with applications. Annals of Mathematical Statistics 33:659-680.

206 References
Resnick, S. I. A975). Weak convergence to extremal processes. Annals of Probability
3:951-960.
Revesz, P. A976). Three theorems of multivariate empirical process. Springer Lecture
Notes in Mathematics 566:106-126.
Sauer, N. A972). On the density of families of sets. Journal of Combinatorial Theory (A)
13:145-147.
Shelah, S. A971). Stability, the f.c.p., and superstability; model theoretic properties of
formulas in first-order theory. Annals of Mathematical Logic 3:271-362.
Shelah, S. A972). A combinatorial problem; stability and order for models and theories
in infinitary languages. Pacific Journal of Mathematics 41:247-261.
Shiryayev, A. N. A981). Martingales: recent developments, results and applications.
International Statistical Review 49 :199-233.
Silverman, B. W. A978). Weak and strong uniform consistency of the kernel estimate
of a density and its derivatives. Annals of Statistics 6:177-184.
Simmons, G. F. A963). Introduction to Topology and Modern Analysis. New York:
McGraw-Hill.
Skorohod, A. V. A956). Limit theorems for stochastic processes. Theory of Probability
and Its Applications 1:261-290.
Skorohod, A. V. A957). Limit theorems for stochastic processes with independent
increments. Theory of Probability and Its Applications 2:138-171.
Steele, J. M. A975). Combinatorial Entropy and Uniform Limit Laws. Ph.D. thesis,
Stanford. (Reproduced by University Microfilms, Ann Arbor.)
Steele, J. M. A978). Empirical discrepancies and subadditive processes. Annals of
Probability 6:118-127.
Stone, С A963). Weak convergence of stochastic processes defined on semi-infinite
time intervals. Proceedings of the American Mathematical Society 14:694-696.
Straf, M. A969). A general Skorohod space and its applications to the weak convergence
of stochastic processes with several parameters. Ph.D. thesis, University of Chicago.
Strassen, V. A965). The existence of probability measures with given marginals. Annals
of Mathematical Statistics 36:423-439.
Strassen, V. and Dudley, R. M. A969). The central limit theorem and e-entropy. Springer
Lecture Notes in Mathematics 89:224-231.
Stute, W. A982a). The oscillation behavior of empirical processes. Annals of Probability
10:86-107.
Stute, W. A982b). A law of the logarithm for kernel density estimators. Annals of
Probability 10:414-422.
Talagrand, M. A978). Les boules peuvent-elles engendrer la tribu borelienne d'un
espace metrisable non separable? Siminaire Choquet, Paris 17e annee: C5.
Tong, Y. L. A980). Probability Inequalities in Multivariate Distributions. New York:
Academic Press.
Tops0e, F. A970). Topology and Measure. Springer Lecture Notes in Mathematics 133.
Uspensky, J. V. A937). Introduction to Mathematical Probability. New York: McGraw-
Hill.
Vapnik, V. N. and Cervonenkis, A. Ya. A971). On the uniform convergence of
relative frequencies of events to their probabilities. Theory of Probability and Its
Applications 16:264-280.
Vapnik, V. N. and Cervonenkis, A. Ya. A981). Necessary and sufficient conditions for
the uniform convergence of means to their expectations. Theory of Probability and
Its Applications 26:532-553.
Varadarajan, V. S. A965). Measures on topological spaces. American Mathematical
Society Translations 48:161-228. (Translation from Mat. Sbornik 55 (97)
A961): 35-100.)

References 207
Weierstrass, К. A885). tJber die analytische Darstellbarkeit sogenannter willkurlicher
Funktionen reeller Argumente. Sitzungsberichte der Preussischen Akademie der
Wissenschaften:633-640, 789-906. (Also in Collected Works, Vol. Ill, pp. 1-37,
Berlin: Mayer u. Muller 1903.)
Whitt, W. A970). Weak convergence of probability measures on the function space
C[0, oo). Annals of Mathematical Statistics 41:939-944.
Whitt, W. A980). Some useful functions for functional limit theorems. Mathematics
of Operation Research 5:67-85. (First circulated as preprint in 1970.)
Wichura, M. A971). A note on the weak convergence of stochastic processes. Annals
of Mathematical Statistics 42:1769-1772.

Author Index
Aldous, D. 133, 137, 185, 187
Alexander, K. S. 38, 166
Alexandroff, A. D. 85
Amato, B. R. viii
Araujo, A. 37
Ash, R. B. 39
Donsker, M. D. 117-118
Doob, J. L. 3,4,64,113,117
Dudley, R. M. vii, 37-39, 85-86,
117-118, 166-168, 193,200
Durbin, J. 118
Durst, M. 39
Barry, D. G. viii
Bennett, G. 193-194
Bertrand-Retali, M. 38, 41
Billingsley, P. 36, 61, 85-86, 117-118,
136
Bollobas, B. 86
Bolthausen, E. 167
Boyce, J. S. viii
Breiman, L. 4,22,106,118
Bremaud, P. 186
Breslow, N. 186
Brown, В. М. 166, 185
Cervonenkis, A. Ya. 37
Chernoff, H. 165, 190
Chibisov, D. M. 85
Chung, K. L. 117,173
Crowley, J. 186
DeHardt, J. 36
Dellacherie, С 176, 179, 185, 195-197
Denby, L. 37
Eggleston, H. G. 200
Erdos, P. 118
Feller, W. 53, 63, 193
Gaenssler, P. vii, 36-37, 40, 118, 167
Gihman, 1.1. 4, 117-118, 136-137, 166,
185
Gill, R. D. 186
Gine, E. 37, 40, 166-167
Glaisher, J. W. L. 61
Gnedenko, B. V. 37
Hajek.J. 86,117
Hall, P. 185
Halmos, P. R. 39
Hansel, G. 86
Hartigan, J. A. viii, 36, 166
Heyde, С. С. 185
Hoeffding, W. 193-194
Huber, P. J. 38, 165, 168

210
Author Index
Jacobs, K. 86
Jacobsen, M. 186
Kac, M. 117-118
Kelley, J. L. 39
Kildea, D. G. 166
Kolchinsky, V. I. 38, 166
Kolmogorov, A. N. 99, 117, 136, 166
Kurtz, T. G. 137
Le Cam, L. 37-38, 85, 166-167, 200
Levy, P. 61
Liapounoff, A. M. 61, 63
Lindeberg, J. W. 61
Lindvall, T. 118,136
Liptser, R. S. 185
Loeve.M. 118
Ranga Rao, R. 86
Resnick, S. I. 136
Revesz, P. 118,167
Sauer, N. 37
Shelah, S. 37
Shiryayev, A. N. 185
Shorack, G. vii-viii, 85
Silverman, B. W. 38
Simmons, G. F. 67-68, 85
Skorohod, A. V. 4, 86, 117-118, 122,
136-137, 166, 185
Steele, J. M. 37
Stone, С 118,136
Straf, M. 136
Strassen, V. 86, 167
Stute, W. vii, 36-38, 117
Sun, T. G. 167
Major, P. 86
Mann, H. B. 189
Maritz, J. S. 166
Martin, R. D. 37
McKean, H. P. 146
McLeish, D. L. 185
Metivier, M. 185
Meyer, P-A. 176, 179, 185, 195-197
Talagrand, M. 85
Tong, Y. L. 41
Tops0e, F. 36-37, 62, 85-86
Troallic, J. P. 86
Uspensky, J. V. 61, 193
Neveu, J. 22
Neyman, J. 47
Nolan, D. A. viii
Oxtoby, J. С 65, 86
Parr, W. 118
Parthasarathy, K. R. 85, 136
Philipp, W. 166, 200
Pickands, J. 136
Pollard, D. 36-39, 118, 166, 169
Prohorov, Yu. V. 85-86
Руке, R. vii, 85-86, 167
Vapnik, V. N. 37
Varadarajan, V. S. 85
Wald, A. 189
Weierstrass, K. 61
Wellner, J. viii
Whitt, W. 118,136
Wichura, M. 86,117,136
Yu, K. F. 186
Zinn, J. 37, 38, 40, 166-167

Subject Index
abuse, of notation 44,171
adapted 176
Aldous's condition 133
Allocation Lemma 77
almost-sure representation. See
Representation Theorem
Approximation Lemma 27
autoregression 12, 174
axiom of choice 65, 86
BH. See brownian motion, stretched out
BP. See P-motion
3$p 156
Bennett's Inequality 160, 192, 194
Bernstein's inequality 193
bias, of density estimator 35,42
binomial coefficient 19
bounded variation 42
brownian bridge 64
denned 3,95
existence 101, 119
brownian motion
construction from brownian bridge
103
denned 95
modulus of continuity for 146
stretched-out 178
tied-down 3
156
108
P)
C[0, oo)
C[0, 1] 1, 90
cadlag 3, 89, 176
Cauchy sequence 81
censored data, estimation from 182
Central Limit Theorem
Empirical, for distribution functions
97
Empirical, uniform case 96
Liapounoff 51
Lindeberg 52
Multivariate 57
central limit theorem
autoregression 174
empirical process 157,163-165
martingale-difference array 171
minimization functional 141
Chaining Lemma 144
chaining 37
restricted 160-163
for stochastic processes 142-145
characteristic function 54
Continuity Theorem 55, 60
inversion formula 63
uniform convergence on compacta 59
uniquely determines distribution 55
chi-square, Pearson's 46
clustering. See A>means
combinatorial method 13
compactness
in metric spaces 81-82
sequential 82
Compactness Theorem 82

212
Subject Index
completely regular
point of metric space 67
topological space 68
conditional variance
of increment 171
process 176
continuity set 71
Continuity Theorem, for characteristic
functions 55, 60
Continuous Mapping Theorem
for euclidean space 46
failure of 66
for metric spaces 70
consistency
for autoregression 12
of Л-means 9
of optimal solutions 12
convergence in distribution
Aldous's condition for, in D[0, oo)
134
in D[0, oo), for Skorohod metric 131
in D[Q, oo), for uniform convergence
on compacta 108
of empirical process 97,157
in euclidean space 43
of L2 martingale 179
in metric spaces 65
process with independent increments
104, 135
via semicontinuity 73
via uniform approximation 70
of uniform empirical process 96
Convergence Lemma
for euclidean space 44-46
for metric spaces 68
convex sets, shattering 17, 23
convolution 35, 49, 54
coupling 76-80
in Compactness Theorem 84
covering integral
for metric space 143
as modulus of continuity 147
covering numbers
direct 164
inequalities between 34
?C1, for classes of functions 25
?C2, for classes of functions 31
measurability of 198
for metric spaces 143
for polynomial classes 27
random 150
criterion function
minimization of 28
random 12
cross-section, measurable
crossword puzzle 76
197
D[0, oo)
definition of 107
metric of uniform convergence on
compacta 108
Skorohod metric 123
Z>[0, 1] 3, 89-90
A.® 21
delta method 63, 189
density
convolution 55
estimation 35
differentiability, in quadratic mean 140,
151
direct approximation 8, 164
discrimination, linear; polynomial;
quadratic 17
distribution function, pointwise
convergence of 43, 46, 53
Donsker's theorem. See Central Limit
Theorem, Empirical
Doob-Meyer decomposition 176
l-argument 24
Е„. See empirical process
EP. See P-bridge
empirical measure
bivariate 12
defined 6
empirical process
central limit theorem for 96-97
defined 2, 95, 97, 140
measurability of 65, 199
envelope, for classes of functions 24,
151
equicontinuity
of class of functions 74
stochastic 139
Equicontinuity Lemma 150
ergodicity 9, 12
Exponential Bounds 16
exponential bound. See tail probability
extreme-value distribution 128
fidi projection 92. See also projection
maps
fidis (finite-dimensional distributions)
3, 90, 92
function space Ж 155, 199

Subject Index
213
functional
first passage time 109, 124
in function space 156
jump 179
terminology 2, 7
functions, infinitely differentiable
49
gaussian process
characterization 106
indexed by functions 146
Glivenko-Cantelli theorem
classical 6-7, 13
generalized, for classes of functions
25
generalized, for classes of sets 18, 22
goodness-of-fit statistic 2
with estimated parameters 99,159
Kolmogorov's, limiting distribution of
113
Neyman's 47
graph, of real-valued function 27
grid, approximations constructed from
91, 124
heuristic argument, Doob's 4, 64
Hewitt-Savage, zero-one law of 22
Hoeffding's Inequality 16, 26, 31, 150,
164, 191
independent increments
convergence in distribution
definition 103
174
16,21
104, 135
innovations
Integration
interpolation
continuous
in Z>[0, oo)
in Z>[0, 1]
101
124
92
in a function space 158
JF) xiv. See also covering integral
Kaplan-Meier estimator 182
kernel
smoothing 35
translates of 42
fc-means
central limit theorem 153
consistency 9, 30
method of 9
Laplace transform, for hitting time 112
Liapounoff condition 51
Lindeberg condition 52, 106
for martingale-difference arrays 171
Lindelof's theorem 68, 72, 87
Lipschitz condition 45, 74, 101
little links 163
location, center of 28
marginal stacks, with coupling 77
markov property for empirical process
96
marriage lemma 77
martingale
continuous time 176
difference array 171
L2 176
L2, convergence to gaussian process
179
reversed 39
reversed, empirical measure as 22
stopped 180
Maximal Inequality
maximal inequality
for cadlag processes
by chaining 144
for L2-martingales
maximum likelihood
measurability, for random elements of
metric spaces 64-66
median
central limit theorem for 53, 98
consistency 7
spatial, central limit theorem for 152
spatial, consistency of 28
M-estimator
generalized 12
stochastic equicontinuity 165
metric
bounded-Lipschitz 74
entropy 166
Prohorov 75, 79, 88
Skorohod, on Z>[0, oo) 123-124
minimization of random criterion
function 140
modulus function for D[0, oo) 125
multinomial distribution 46
15
94
177
138
NF) xiv. See also covering number
|| • ||. See norm
norm 14, 24, 156

214
Subject Index
normal distribution
characteristic function of
multivariate 29
tail probabilities 191
optimal centers 10
outer measure 70
63
remainder term, in Taylor expansion
50, 139
Representation Theorem
with random elements 71
with random variables 58
pP xiv
robustnik 75
&. See projection a-field
Pn. See empirical measure
P°n 15, 26, 149
partial sum process 106, 109
P-bridge 149
permissible
classes of functions 24
classes of sets 17
definition 196
7ts. See projection maps
P-motion 147
polynomial
bound for number of sets picked out
19
class 17,27
discriminating 17
discrimination 17,20
orthogonal 47
poisson process 105, 129
predictable 176
product a-field 69
projection, measurability of 196
projection maps 66, 90
forD[0, oo) 130
projection <7-field 66, 90
on D[0, oo) 127
on D[0, 1] 87
Prohorov's theorem. See Compactness
Theorem
quadratic variation process 185
quantile transformation 57, 97, 129
random element of metric space 1, 65
random vector
characteristic functions 56
convergence in distribution of 43,
56
perturbation of 48
rates of convergence 30
reflection principle, for brownian motion
112
sample path 1
Scheffe's lemma 61
semicontinuity 73, 87
seminorm xiv
separability
of C[0, 1] 87
of D[0, oo), under Skorohod metric
127
and a-fields 85
for subsets of metric spaces 67
universal 38
shatter 18,21,198
sigma field. See also $gp, 0>
generated by balls 87
sign variables 15
signed measure. See P°n
Slutsky's theorem 62
Souslin 196
square-root trick 32, 37
Stirling's approximation 21
stochastic
equicontinuity 139-140
order symbol 141, 189
process 1
stopping time 110, 172
Strasbourg theory 176
strong markov property, for brownian
motion 111
submartingale 178
reversed 22,25
subsets
hidden 19
picked out 18
substitution, of increments 49
symmetrization 32, 149
inequality 14-15
Symmetrization, First 14
Symmetrization Lemma 14, 198
Symmetrization, Second 15
symmetry, and empirical measures 21
tail probability
characteristic function bound
exponential bound 160,191
60

Subject Index
215
tied-down brownian motion. See
brownian bridge
tight measure
on metric space 81
on real line 60
total boundedness
of metric space 82
and P-motion 168
triangular array 51, 106
truncation 25
2-means. See A>means
uniform tightness
for general empirical processes 156
for uniform empirical processes 102
usual conditions 176
VC class 37
vector space, finite-dimensional 20,
30,38
Un. See empirical process, uniform
uniform convergence of density estimator
36
uniform integrability 176
uniform strong law of large numbers 7
for classes of functions 25, 168
for classes of sets 18, 22
for convex sets 22
weak convergence
in euclidean space 44
in metric spaces 65
weight functions 158
3C. See function space