Текст
                    David Pollard
Convergence of
Stochastic Processes
With 36 Illustrations
i
Springer-Verlag
New York Berlin Heidelberg Tokyo


David Pollard Department of Statistics Yale University New Haven, CT 06520 U.S.A. AMS Subject Classifications: 60F99, 60G07, 60H99, 62M99 Library of Congress Cataloging in Publication Data Pollard, David Convergence of stochastic processes. (Springer series in statistics) Bibliography: p. Includes index. 1. Stochastic processes. 2. Convergence. I. Title. II. Series. QA274.P64 1984 519.2 84-1401 © 1984 by Springer-Verlag New York Inc. All rights reserved. No part of this book may be translated or reproduced in any form without written permission from Springer-Verlag, 175 Fifth Avenue, New York, New York 10010, U.S.A. Typeset by Composition House Ltd., Salisbury, England. Printed and bound by R. R. Donnelley & Sons, Harrisonburg, Virginia. Printed in the United States of America. 987654321 ISBN 0-387-90990-7 Springer-Verlag New York Berlin Heidelberg Tokyo ISBN 3-540-90990-7 Springer-Verlag Berlin Heidelberg New York Tokyo
To Barbara Amato
Preface A more accurate title for this book might be: An Exposition of Selected Parts of Empirical Process Theory, With Related Interesting Facts About Weak Convergence, and Applications to Mathematical Statistics. The high points are Chapters II and VII, which describe some of the developments inspired by Richard Dudley's 1978 paper. There I explain the combinatorial ideas and approximation methods that are needed to prove maximal inequalities for empirical processes indexed by classes of sets or classes of functions. The material is somewhat arbitrarily divided into results used to prove consistency theorems and results used to prove central limit theorems. This has allowed me to put the easier material in Chapter II, with the hope of enticing the casual reader to delve deeper. Chapters III through VI deal with more classical material, as seen from a different perspective. The novelties are: convergence for measures that don't live on borel cr-fields; the joys of working with the uniform metric on D[0, 1]; and finite-dimensional approximation as the unifying idea behind weak convergence. Uniform tightness reappears in disguise as a condition that justifies the finite-dimensional approximation. Only later is it exploited as a method for proving the existence of limit distributions. The last chapter has a heuristic flavor. I didn't want to confuse the martingale issues with the martingale facts. My introduction to empirical processes came during my 1977-8 stay with Peter Gaenssler and Winfried Stute at the Ruhr University in Bochum, while I was supported by an Alexander von Humboldt Fellowship. Peter and I both spent part of 1982 at the University of Washington in Seattle, where we both gave lectures and absorbed the empirical process wisdom of Ron Руке and Galen Shorack. The published lecture notes (Gaenssler 1984) show how closely our ideas have evolved in parallel since Bochum. I also
viii Preface had the privilege of seeing a draft manuscript of a book on empirical processes by Galen Shorack and Jon Wellner. At Yale I have been helped by a number of friends. Dan Barry read and criticized early drafts of the manuscript. Deb Nolan did the same for the later drafts, and then helped with the proofreading. First Jeanne Boyce, and then Barbara Amato, fed innumerable versions of the manuscript into the DEC-20. John Hartigan inspired me to think. The National Science Foundation has supported my research and writing over several summers. I am most grateful to everyone who has encouraged and aided me to get this thing finished.
Contents Notation xiii CHAPTER I Functionals on Stochastic Processes 1 1. Stochastic Processes as Random Functions 1 Notes 4 Problems 4 CHAPTER II Uniform Convergence of Empirical Measures 6 1. Uniformity and Consistency 6 2. Direct Approximation 8 3. The Combinatorial Method 13 4. Classes of Sets with Polynomial Discrimination 16 5. Classes of Functions 24 6. Rates of Convergence 30 Notes 36 Problems 38 CHAPTER III Convergence in Distribution in Euclidean Spaces 43 1. The Definition 43 2. The Continuous Mapping Theorem 44 3. Expectations of Smooth Functions 48 4. The Central Limit Theorem 50 5. Characteristic Functions 54 6. Quantile Transformations and Almost Sure Representations 57 Notes 61 Problems 62
Contents CHAPTER IV Convergence in Distribution in Metric Spaces 64 1. Measurability 64 2. The Continuous Mapping Theorem 66 3. Representation by Almost Surely Convergent Sequences 71 4. Coupling 76 5. Weakly Convergent Subsequences 81 Notes 85 Problems 86 CHAPTER V The Uniform Metric on Spaces of Cadlag Functions 89 1. Approximation of Stochastic Processes 89 2. Empirical Processes 95 3. Existence of Brownian Bridge and Brownian Motion 100 4. Processes with Independent Increments 103 5. Infinite Time Scales 107 6. Functionals of Brownian Motion and Brownian Bridge 110 Notes 117 Problems 118 CHAPTER VI The Skorohod Metric on ?>[0, oo) 122 1. Properties of the Metric 122 2. Convergence in Distribution 130 Notes 136 Problems 137 CHAPTER VII Central Limit Theorems 138 1. Stochastic Equicontinuity ' 138 2. Chaining 142 3. Gaussian Processes 146 4. Random Covering Numbers 149 5. Empirical Central Limit Theorems 155 6. Restricted Chaining 160 Notes 165 Problems 167 CHAPTER VIII Martingales 170 1. A Central Limit Theorem for Martingale-Difference Arrays 170 2. Continuous Time Martingales 176 3. Estimation from Censored Data 182 Notes 185 Problems 186
Contents xi APPENDIX A Stochastic-Order Symbols 189 APPENDIX В Exponential Inequalities 191 Notes 193 Problems 193 APPENDIX С Measurability 195 Notes 200 Problems 200 References 201 Author Index 209 Subject Index 211
Notation Integrals and expectations are written in linear functional notation; sets are identified with their indicator functions. Thus, instead of \A f{x)JP{dx) write JP(fA). When the variable of integration needs to be identified, as in iterated integrals, I return to the traditional notation. And orthodoxy constrains me to write J f(x) dx for the lebesgue integral, in whatever dimension is appropriate. If unspecified, the domain of integration is the whole space. Abbreviations can stand for a probability measure or a random variable distributed according to that probability measure: Bin(n, p) = binomial distribution for n trials with success probability p. N(/j,, a2) = normal distribution with mean \i and variance a2. N(fi, V) = multivariate normal distribution with mean vector p. and variance matrix V. Uniform(a, b) — uniform distribution on the open interval (a, b); square brackets, as in Uniform[0, 1], indicate closed intervals. Poisson(A) = poisson distribution with mean A. The symbol ? denotes end of proof, end of definition, and so on—some- on—something to indicate resumption of the main text. Product measures, product spaces, and product c-fields share the product symbol (x). Maxima and minima are v and л. Set-theoretic difference is \; symmetric difference is A. If ajbn -> oo, for sequences {а„} and {bn}, then write an^bn. Invariably IR denotes the real line, and IR' denotes /c-dimensional euclidean space. The borel cr-field on a metric space 3C is always 38(SE). The symbol IP
xiv Notation denotes a probability measure on a (sometimes unspecified) measurable space (Q, &); miscellaneous random variables live on this space. An ~>, a cross between ~ (the sign for "is distributed according to") and an ordinary arrow -> (for convergence), is used for convergence in distribu- distribution and weak convergence. A result stated and proved in the text is always referred to with initial letters capitalized. Thus the Multivariate Central Limit Theorem is numbered 111.30, but Taylor's theorem and dominated convergence are not reproved. The letters B, U, BP, EP usually denote the gaussian processes: brownian motion, brownian bridge, P-motion, and P-bridge. The letters Un and En denote empirical processes, with Un generated by observations on Uniform@, 1). Usually Р„ is the empirical measure. The set of all square-integrable functions with respect to a measure p. is written y2(fi); the corresponding space of equivalence classes is Ь2(ц). А similar distinction holds for Jz? 1(fi) and Ll{fi). Often pP denotes the Jz?2(.P) seminorm; || ¦ || is the supremum norm on a space of functions. The symbols nt,ns, and so on, are usually projection maps on function spaces. Expressions like N(e), Nt(d, d, T), and N2(d, P, #") represent various covering numbers; J(S), Ji{5, d, T), and J2(S, P, !F) are the corresponding covering integrals.
CHAPTER I Functionals on Stochastic Processes ... which introduces the idea of studying random variables determined by the whole sample path of a stochastic process. I.I. Stochastic Processes as Random Functions Functions analyzed as points of abstract spaces of functions appear in many branches of mathematics. Geometric intuitions about distance (or approximation, or convergence, or orthogonality, or any other ideas learned from the study of euclidean space) carry over to those abstract spaces, lending familiarity to operations carried out on the functions. We enjoy similar benefits in the study of stochastic processes if we analyze them as random elements of spaces of functions. Remember that a stochastic process is a collection {X, :teT} of real random variables, all defined on a common probability space (Q, S, IP). Often T will be an interval of the real line (which makes the temptation to think of the process as evolving in time almost irresistible), but we shall also meet up with fancier index sets—subsets of higher-dimensional euclidean spaces, and collections of functions. The random variable X, depends on both t and the point со in Q at which it is evaluated. To emphasize its role as a function of two variables, write it as X(co, t). For fixed t, the function X(-, t) is, by assumption, a measurable map from Q into IR. For fixed со, the function X(co, •) is called a sample path of the stochastic process. If all the sample paths lie in some fixed collection 3C of real-valued functions on T, the process X can be thought of as a map from Q into Ж, a random element of 3C. For example, if a process indexed by [0, 1] has continuous sample paths it defines a random element of the space C[0, 1] of all continuous, real-valued functions on [0, 1]. (In Chapter IV we shall formalize the definition by adding a measurability requirement.) Each sample path of X is a single point in 3C. Each random variable Z for which Z(co) depends only on the sample path X(co, •), such as the maximum
2 I. Functionals on Stochastic Processes of X(co, t) over all f, can be expressed by means of a functional on 3C. That is, the value Z(co) is found by applying to X(co, ¦) a map H that sends functions in 9? onto points of the real line. The name functional serves to distinguish functions on spaces of functions from functions on other abstract sets, an outmoded distinction, but one that can help us to remember where H lives. By breaking Z into the composition of a functional with a map from Q into 9C we also break any analysis of Z into two parts: calculations involving only the random element X; and calculations involving only the functional H. This allows us to study many different Z's simultaneously, just by varying the choice of H. Of course we only gain by this if most of the hard work can be disposed of once and for all in the analysis of X. The idea can be taken further. Suppose that a second stochastic process {Yt: t e T} puts all its sample paths in the same function space 9C. Suppose we want to study the same functional H of both processes; we want to show that ИХ and HY have distributions that are close, perhaps. Break the problem into its two parts: show that the distributions of X and Y (the probability measures they induce on ЗГ) are close; then show that H has a continuity property ensuring that closeness of the distributions of X and Y implies closeness of the distributions of HX and HY. Such an approach would make the analysis easier for other functionals with the same sort of continuity property; for a different H only the second part of the analysis would need repeating. 1 Example. Goodness-of-fit test statistics can often be expressed as func- functionals on a suitably standardized empirical distribution function. Consider the basic case of an independent sample ?v, ...,?„ from the Uniform@,1) distribution. Define the uniform empirical process Un by Un(co, t) = и'1'2 t Ш<») <t}~t) for 0 < t < 1. 1=1 This has the standardized binomial distribution, n~1/2(Bin(n, t) — nt), for fixed t. Each sample path is piecewise linear with jump discontinuities at the n points ^(ш),..., ?„(со). The process U„ defines a random element of any function space 9C that contains all sample paths of this form. We could take 3E as the smallest such set of functions, but there is some advantage to choosing a space that can accommodate the sample paths of other important stochastic processes.
I.I. Stochastic Processes as Random Functions 3 That way the analysis for this particular problem will carry over to many other limit and approximation problems. The usual choice is the space D[0, 1] of all real-valued functions on [0, 1] that are right continuous at each point of [0, 1) with left limits existing at each point of @, 1]. The catchy French acronym cadlag (continue a droite, limites a gauche) offers a most convenient way to avoid repeating this mouthful of a description. From now on it's cadlag. The processes U„ define random elements of the space D[0, 1] of all (real-valued) cadlag functions on [0, 1]. Numerous statistics have been suggested as candidates for testing whether the observations really do come from the uniform distribution, amongst them: -i Un(oj, tf dt. How do these statistics behave as n tends to infinity? We can try to answer the question by first determining how {Un}, as a sequence of stochastic processes indexed by [0, 1], behaves as n tends to infinity. Suppress the argument со. At each fixed t the sequence of real random variables {Un(t)} converges in distribution to a iV@, t — t2) distribution, by virtue of the Central Limit Theorem; at each fixed pair (s, t) the sequence of bivariate random vectors {(Un(s), Un(t))} converges to a bivariate normal distribution with covariance the same as the covariance between Un(s) and Un(t), by virtue of the bivariate form of the Central Limit Theorem; and similarly for the higher finite-dimensional distributions. That is, the joint distributions of the random variables obtained by sampling Un at any fixed, finite set of index points converge to the corresponding distributions of the so-called brownian bridge (also known as the tied-down brownian motion), the gaussian process U(t) characterized by: (i) U has continuous sample paths (it is a random element of C[0, 1]) between the two fixed points U@) = U(l) = 0; (ii) For fixed tb ... ,tk the random vector (t/faj),..., U(tk)) has a multi- variate normal distribution with zero means and cov(C/(t;), U(tj)) = The process U need not live on the same probability space as the empirical processes. It must, however, have sample paths in C[0, 1]. The reason we need this apparently irrelevant property will emerge in Chapter V, when we make the argument rigorous. When n is large, the process Un behaves something like U. On this basis, Doob A949) made the bold heuristic assumption that the distributions of statistics calculated from I)'„ should be close to the distributions of the
I. Functionals on Stochastic Processes corresponding statistics for U; he conjectured that functionals of Un should converge in distribution to the analogous functionals of U. In his words (he wrote х„ and x instead of Un and U): We shall assume, until a contradiction frustrates our devotion to heuristic reasoning, that in calculating asymptotic xn(t) process distributions when n —> oo we may simply replace the xn(t) processes by the x(t) process. It is clear that this cannot be done in all possible situations, but let the reader who has never used this sort of reasoning exhibit the first counter example. Happily, the heuristic does give the right answer for each of the three func- functionals mentioned above. ? This example will guide us in our definition of convergence in distribution for stochastic processes treated as random elements of function spaces. We shall explore one possible meaning for closeness of two processes in a distributional sense. The empirical process application will point us towards a theory slightly different from the one normally espoused in the weak convergence literature. Our progress towards the theory will begin in Chapter III with another look at the classical results for convergence of real random variables and random vectors, partly as a refresher course in the basic techniques, and partly as a way to get-a running start on the theory for random elements of general metric spaces in Chapter IV. The momentum of the development will carry us through to Chapter V, where Doob's heuristic argument will receive its rigorous justification, as a weak convergence result for D[0,1]. Chapter VII will extend the theory to empirical processes with index sets more complicated than [0, 1], building upon methods that will be introduced and developed in Chapter II. Notes Stochastic processes have frequently been treated as random functions in the probability literature. Doob A953) pioneered. Gihman and Skorohod A974, Chapter III) have collected together criteria for existence of stochastic processes with sample paths in C[0, 1] and D[0, 1]. Breiman A968, Chapters 12 to 14) is a good place to start. Problems [1] Verify that U has the same covariance function as Un, by writing (Un{s), Un(t)) as a normalized sum of the independent random vectors ({?,- < s}, {?,• < t}). [2] Show that the random variables H3 Un and H3 U both have expectation 5. [Remem- [Remember that Un and U have the same first and second moment structures.]
Problems [3] Show that H3 Un can be written in terms of the order statistics ?(i) as [Complicated functions of the observations can sometimes be represented by simple functionals of a suitably constructed random process.]
CHAPTER II Uniform Convergence of Empirical Measures ... in which uniform analogues of the strong law of large numbers are proved by two methods. These generalize the classical Glivenko-Cantelli theorem, which concerns only empirical measures indexed by intervals on the real line, to uniform convergence over classes of sets and uniform convergence over classes of functions. The results are applied to prove consistency for statistics expressible as continuous functionals of the empirical measure. A refinement of the second method gives rates of convergence. ILL Uniformity and Consistency For independent sampling from a distribution function F, the strong law of large numbers tells us that the proportion of points in an interval (-co, f] converges almost surely to F(t). The classical Glivenko-Cantelli theorem strengthens the result by adding that the convergence holds uniformly over all t. The strong law also tells us that the proportion of points in any fixed set converges almost surely to the probability of that set. The strengthening of this result, to give uniform convergence over classes of sets more interesting than intervals on the real line, and its further generalization to classes of functions, will be the main concern of this chapter. For the most part we shall consider only independent sampling from a fixed distribution P on a set S. The probability measure Pn that puts equal mass at each of the n observations ?,u ..., ?„ will be called the empirical measure. It captures everything we might need to know about the observa- observations, except for the order in which they were taken. Averages over the observations can be written as expectations with respect to Р„: If P\f\ < oo, the average converges almost surely to its expected value, Pf. We shall be finding conditions under which the convergence is uniform over a class $F of functions. Of course we should not expect uniform convergence over all classes of functions, except in trivial cases. Unless P is a discrete distribution, the difference PnD — PD cannot even converge to zero uniformly over all sets;
II. 1. Uniformity and Consistency 7 there always exists a countable set with Р„ measure one. But there are non- trivial classes over which the convergence is uniform. When we have such a class !F we can deduce consistency results for statistics that depend on the observations only through the values Pnf, for / in ?F. 1 Example. The median of a distribution P on the real line can be defined as the smallest value of m for which P(— со, m] > \. If P( — со, t] > \ for each t > m then the median is a continuous functional, in the sense that | median(g) — median(P) [ < s whenever the distribution Q is close enough to P. Close means sup|g(-co, t] - P(-oo, t]\ <5, t where the tiny б is chosen so that P( — со, m — s] < j — S, P(-co, m + s] > ^+ 8. The argument goes: if Q has median m then P(-oo, m'] > S(-co,m'] - S > i - d, so certainly m' > m — s. Similarly, for every m" < m', P(-со, m"] < Q(-со, m"] + ё < \ + 6, which implies m" < m + s, and hence m' < m + s. Next comes the probability theory. If the empirical measure Р„ is con- constructed from a sample of independent observations on P, the Glivenko- Cantelli theorem tells us that sup \Р„(— со, i] — P(— со, t]| ->¦ 0 almost surely. From this we deduce that, almost surely, |median(Pn) — median(P)| < s eventually. The sample median is strongly consistent as an estimator of the population median. ? For this example we didn't have to prove the uniformity result; the Glivenko-Cantelli theorem is the oldest and best-known uniform strong law of large numbers in the literature. But as we encounter new functions (usually called functionals) of the empirical measure, new uniform con- convergence theorems will be demanded. We shall be exploring two methods for proving these theorems. The first method is simpler in concept, but harder in execution. It involves direct approximation of functions in an infinite class $F by a
8 II. Uniform Convergence of Empirical Measures finite collection of functions. Classical convergence results, such as the strong law of large numbers or the ergodic theorem, ensure uniform convergence for the finite collection; the form of approximation carries the uniformity over to 3F. Section 2 deals with direct approximation. The second method depends heavily upon symmetry properties implied by independence. It uses simple combinatorial arguments to identify classes satisfying uniform strong laws of large numbers under independent sampling. Sections 3 to 5 assemble the ideas behind this method. II.2. Direct Approximation Throughout the section !F will be a class of (measurable) functions on a set S with a ff-field that carries a probability measure P. The empirical measure Р„ is constructed by sampling from P. Assume P\f\ < oo for each / in 3F. If 3F were finite, the convergence of Р„ f to Pf assured by the strong law of large numbers would, for trivial reasons, be uniform in /. If J^ can be ap- approximated by a finite class (not necessarily a subclass of 3F) in such a way that the errors of approximation are uniformly small, the uniformity carries over to !F. The direct method achieves this by requiring that each member of 3F be sandwiched between a pair of approximating functions taken from the finite class. 2 Theorem. Suppose that for each s > 0 there exists a finite class ?FS containing lower and upper approximations to each f in S', for which ft,L<f<fe,V and P(fs,V ~ fs,L)< S. Then sup^r \Pnf — Pf\ -* 0 almost surely. Proof. Break the asserted convergence into a pair of one-sided results: liminfinf(Pn/-P/)> 0 and limsup sup(Pn/ - Pf) < 0 or, equivalently, liminf inf(Р„(-/) - P(-f)) > 0. Then two applications of the next theorem will complete the proof. ? 3 Theorem. Suppose that for each s > 0 there exists a finite class J^ of functions for which: to each fin #" there exists anft in !Ft such thatfE <f and Pfe ^ Pf- e. Then liminf inf(Pn/ - Pf) > 0 almost surely.
II.2. Direct Approximation 9 Proof. For each e > 0, liminf inf (Р„/ - Pf) > liminf inf (Р„/е - Pf) because fe < f > liminf inf (Р„/? - Pf) + inf(P/? - Pf) > 0 H s almost surely, as #"? is finite. Throw away an aberrant null set for each positive rational s to arrive at the asserted result. ? You might have noticed that independence enters only as a way of guaranteeing the almost sure convergence of Pnfe to Pf for each approximat- approximating /e. Weaker assumptions, such as stationarity and ergodicity, could substitute for independence. 4 Example. The method of fc-means belongs to the host of ad hoc procedures that have been suggested as ways of partitioning multivariate data into groups somehow indicative of clusters in the underlying population. We can prove a consistency theorem for the procedure by application of the one- onesided uniformity result of Theorem 3. For purposes of illustration, consider only the simple case where observa- observations ?b ...,?„ from a distribution P on the real line are to be partitioned into two groups. The method prescribes that the two groups be chosen to minimize the within-groups sum of squares. Equivalently, we may choose optimal centers an and bn to minimize t\Zt ~ a\2 л \Zt - b\\ i=l then allocate each ?; to its nearest center. The optimal centers must lie at the mean of those observations drawn into their clusters, hence the name fe-means (or 2-means, in the present case). In terms of the empirical measure Pn, the method seeks to minimize where fa,b(x) = \x - a\2 л \x - b\2. As the sample size increases, W(a, b, Pn) converges almost surely to for each fixed (a, b). This suggests that (an, bn), which minimizes W{-, ¦, Р„), might converge to the (a*,b*) that minimizes W{-, -,P). Given a few obvious conditions, that is indeed what happens. To ensure finiteness of W{-, -, P), assume that P|x|2 < oo. Assume also that there exists a unique (a*, b*) minimizing W. Adopt the convention that
10 II. Uniform Convergence of Empirical Measures a < b in order that (b*, a*) be ruled out as a distinct minimizing pair. Without uniqueness the consistency statement needs a slight reinterpretation (Problem 1). The continuity argument lurking behind the consistency theorem does depend on one-sided uniform convergence of W(-, -, Pn) to W(-, -, P), but not uniformly over all possible choices for the centers/We must first force (an,bn) into a region C = [-M, M]<g)IR и IR ®[-M,M] for some suitably large M, then prove liminf inf(PK/ab - Pfa,b) ^ 0 almost surely, с We need at least one of the centers within a bounded region [ — M, M] to get the uniformity. Determine how large M needs to be by invoking optimality oi(an,bn). W(an,bn,Pn)<W@,0,Pn) -»• W@, 0, P) almost surely = P|x|2. If both а„ and bn lay outside [ — M, M] then Ща„, К, Р„) > (iMJPn[-iM, iM] -»• (iMJP[ -^M, iM] almost surely. If we choose M so that P\x\2 <{\МJР[-%МЛЩ then there must eventually be at least one of the optimal centers within [ —M, M], almost surely. We shall later also need M so large that (a*, b*) belongs to C. Explicit construction of the finite approximating class demanded by Theorem 3 is straightforward, but a trifle messy. That is one of the dis- disadvantages of brute-force methods. First note that fa,b(x) <(x - Mf + (x + MJ for (a, b) in С Write F(x) for the upper bound. Because PF < со, there exists a constant D, larger than M, for which PF[-D, DJ < s. We have only to worry about the approximation to/ab on [ — D, D~\. We may assume that both a and b lie in the interval [-3D, 3D]. For if, say, | b | > 3D then /,,»(*){I*I <D} = \x- a\2 = fUx){\x\ < D} because \a\ < M; the lower approximation for/a a on [ — D, D~\ will also serve forfa>b.
II.2. Direct Approximation 11 Let C? be a finite subset of [ — 3D, 3D]2 such that each (a, b) in that square has an (a', b') with \a — a'\ < s/D and \b — b'\ < s/D. Then for each x in [—D, D~\, \fa,t(x) -/e-,b'(x)| < l(x - flJ - (x - a'J1 + |(x-bJ -(x - fe'Jl < 2|a - a'l |x - Ka + a')l + 2|fc - fe'l I* - Kb + fe')l < 2(s/D)(D + 3D) + 2(s/D)(D + 3D) = 16e. The class J^33? consists of all functions (/a<;;,<(X) — 16е){|х| < D} for (a', &') ranging over C!:. From Theorem 3, limmfinf(Pn/aib-P/a;b)>0. с Eventually the optimal centers (а„, bn) lie in С Thus liminf(FF(an, Ь„, Р„) - W(an, bn, PJ) > 0 almost surely. Since W(an, bn, Pn) < W(a*, b*, Pn) because (an, bn) is optimal for Р„ -» W(a*, b*, P) almost surely < W(an, bn, P) because (a*, b*) is optimal for P, we then deduce that W{an, bn,P)^ W(a*, b*, P) almost surely. Notice what happened. The uniformity allowed us to transfer optimality of (а„,Ь„) for Р„ to a sort of asymptotic optimality for P; the processes W(-, •, Pn) have disappeared, leaving everything in terms of the fixed, non- random function W( ¦, ¦, P). We have assumed that W(-, ¦, P) achieves its unique minimum at (a*, b*). Complete the argument by strengthening this to: for each neighborhood U of(a*,b*\ inf W(a, b, P) > W(a*, b*, P). c\u Continuity of W(-, ¦, P) takes care of the infimum over bounded regions of C\U. If there were an unbounded sequence (a;, /?;) in С with we could extract a subsequence along which, say, a; -> — oo and /?; -»• Д with P | < M. Dominated convergence would give W(a*,b*,P) = P\x-p\\ which would contradict uniqueness of (a*, b*): for every a, the pair (a, /?) would minimize W(-, -,P). The pair (an,bn), by seeking out the unique minimum of W(-, •, P) over the region C, must converge to (a*, b*). ?
12 II. Uniform Convergence of Empirical Measures The /c-means example typifies consistency proofs for estimators denned by optimization of a random criterion function. By ad hoc arguments one forces the optimal solution into a restricted, often compact, region. That is usually the hardest part of the proof. (Problem 2 describes one particularly nice ad hoc argument.) Then one appeals to a uniform strong law over the restricted region, to replace the random criterion function by a deterministic limit function. Global properties of the limit function force the optimal solution into desired neighborhoods. If one wants consistency results that apply not just to independent sequences but also, for example, to stationary ergodic sequences, one is stuck with cumbersome direct approximation arguments; but for independent sampling, slicker methods are available for proving the uniform strong laws. We shall return to the /c-means problem in Section 5 (Example 29 to be precise) after we have developed these methods. 5 Example. Let в be the parameter of a stationary autoregressive process for independent, identically distributed innovations {и„}. Stationarity requires |0| < 1. A generalized M-estimator for в is any value в„ for which the random function Hn(9) = (n- I)'1 ? д(у,ЖУ1+1 ~ 0Уд takes the value zero. We would hope that в„ converges to the в* at which the deterministic function H(ff) = ШУ1ЖУ2 ~ Оуг) takes the value zero. If | g \ < 1 and | ф | < 1 and ф is continuous, we can go part of the way towards proving this by means of a uniform strong law for a bivariate empirical measure. Write <2„ for the probability measure that puts equal mass (n — I) on each of the pairs (yu y2),..., (у„_и у„). For fixed (integrable) /(•, •), Qnf -* Of almost surely, where Q denotes the joint distribution of (yb y2). This follows from the ergodic theorem for the stationary bivariate process {(у„, у„+1)}. Check the approximation conditions of Theorem 2, with Q in place of P, for the class of functions /(*!, х2)в) = д(х1)ф(х2 - BXl) for -1 < в < 1. First, choose an integer К so large that Щ\у1\<к,\у2\<к}>1-?.
II.3. The Combinatorial Method 13 Then appeal to uniform continuity of ф on the compact interval [—2K, 2K~\ to find a S > 0 such that | ф(а) — ф(Ь) \ < г whenever | a - b \ < 3 and \a\ < 2K and \b\ < 2K. For в in the interval [kS/K, (k + l)S/K], \f(xl,x2,e)-f(xux2,k3/K)\ < г + 2{\Xl\ >K} + 2{\x2\ > K}. With the integer к running over the finite range needed for these intervals to cover [ — 1, 1], the functions f(x1,x2,k3/K)±e±2{\x1\ > К} ±2{\х2\ Ж} provide the upper and lower approximations required by Theorem 2. As noted following Theorem 3, the uniform strong laws also apply to empirical measures constructed from stationary ergodic sequences. Ac- Accordingly, F) sup \Qnf(-, ¦, в) - Qf(-, -, 0)| -> 0 almost surely, that is, sup \Hn(9) - H@)\ -»• 0 almost surely. l«l<i Provided 6n lies in the range [— 1, 1], we can deduce from F) that Н(в„) -> 0 almost surely. It would be a sore embarrassment if the estimate of the auto- regressive parameter were not in this range. Usually one avoids the em- embarrassment by insisting only that Н„(в„) -> 0, with в„ in [— 1, 1]. Such а в„ always exists because Н„@*) -> О almost surely. Convergence results for в„ depend upon the form of H(). We know в„ gets forced eventually into the set {| H | < e} for each e > 0. If this set shrinks to в* as e | 0 then в„ must converge to в*, which necessarily would have to be the unique zero of H(-). If we assume that H does have these properties we get the consistency result for the generalized M-estimator. ? II.3. The Combinatorial Method Since understanding of general methods grows from insights into simple special cases, let us begin with the best-known example of a uniform strong law of large numbers, the classical Glivenko-Cantelli theorem. This asserts that, for every distribution P on the real line, G) sup | Pn(- oo, t] - P( — oo, t] | -» 0 almost surely, when the empirical measure Р„ comes from independent sampling on P. The ideas that will emerge from the treatment of this special case will later be expanded into methods applicable to other classes of functions. To facilitate back reference, break the proof into five steps.
14 II. Uniform Convergence of Empirical Measures Keep the notation tidy by writing || • || to denote the supremum over the class J of intervals ( —oo, t], for — oo < t < oo. We could restrict the supremum to rational t to ensure measurability. First Symmetrization. Instead of matching Pn against its parent distribution P, look at the difference between Pn and an independent copy, P'n say, of itself. The difference Pn — Р'„ is determined by a set of In points (albeit random) on the real line; it can be attacked by combinatorial methods, which lead to a bound on deviation probabilities for \\Р„ — P'n\\. A symmetrization inequality converts this into a bound on || Pn — P\\ deviations. 8 Symmetrization Lemma. Let {Z(t): t e T} and {Z'{t)\ t eT} be indepen- independent stochastic processes sharing an index set T. Suppose there exist constants P>0anda>0 such that F{ | Z'(t) | < a} > fi for every t in T. Then (9) P^sup |Z(t)| >s\< jT^sup \Z(t) - Z\t)\ > e - a Proof. Select a random x for which |Z(t)| > e on the set {sup \Z(t)\ > s}. Since т is determined by Z, it is independent of Z'. It behaves like a fixed index value when we condition on Z: | < a|Z} > p. Integrate out. L Z(t)\ >s\< 1P{|Z'(t)| < a, |Z(t)| > s} < ЕР jsup \Z(t) - Z'(t)\ > s - <x\. П Close inspection of the proof would reveal a disregard for a number of measure-theoretic niceties. A more careful treatment may be found in Appendix С For our present purpose it would suffice if we assumed T countable; the proof is impeccable for stochastic processes sharing a count- countable index set. We could replace suprema over all intervals (—00, t] by suprema over intervals with a rational endpoint. For fixed t, Р„(— oo, t] is an average of the n independent random variables {?; < t}> eacn having expected value P( — 00, t] and variance P( — 00, t] — (P( — 00, t]J, which is less than one. By Tchebychev's inequality,
II.3. The Combinatorial Method 15 Apply the Symmetrization Lemma with Z = Р„ — P and Z' = P'n — P, the class J as index set, a = js, and /? = j. A0) P{||Pn - P\\ Second Symmetrization. 2ТР{\\Р„ - Р'„\\ if n > 8е The difference Р„ — Р'„ depends on 2n observations. The double sample size creates a minor nuisance, at least notationally. It can be avoided by a second symmetrization trick, at the cost of a further diminution of the e. Independently of the observations ?,u ...,?,„, ?,\, ...,?,'„ from which the empirical measures are constructed, generate independent sign random variables au...,an for which P{o-; = +1} = W{a{ = -1} = \. The sym- symmetric random variables {^ < t} — {?¦ < t}, for i = 1,..., n and — oo < t < oo, have the same joint distribution as the random variables <7г[{?г < t} — {?'i < t}]. (Consider the conditional distribution given Thus Р{||Р„ -Р'я\\> Ы = < IP<sup n Y, a№i ^ f) > h + Write P° for the signed measure that places mass n 1ai at ?,-. The two sym- metrizations give, for n > 8e~2, A1) IP{||-Pn - P\\ > e} < 4F{||P°|| > ie}. To bound the right-hand side, work conditionally on the vector of observa- observations \, leaving only the randomness contributed by the sign variables. Maximal Inequality. Once the locations of the % observations are fixed, the supremum ||P°|| reduces to a maximum taken over a strategically chosen set of intervals Ij = ( — oo, tj2, for j = 0, 1,..., n. Of course the choice of these intervals depends on %; we need one tj between each pair of adjacent observations. (The t0 and tn are not really necessary.) With the number of intervals reduced so drastically, we can afford a crude bound for the supremum. A2) j=o
16 II. Uniform Convergence of Empirical Measures This bound will be adequate for the present because the conditional proba- probabilities decrease exponentially fast with n, thanks to an inequality of Hoeffding for sums of independent, bounded random variables. Exponential Bounds. Let Yu..., Ynbs independent random variables, each with zero mean and bounded range: at < Yt < bt. For each ц > 0, Hoeffding's Inequality (Appendix B) asserts F{| Y, + ¦ ¦ ¦ + Yn\ > n) < 2 ехрГ-2,?2 / J>; - fl;Jl. Apply the inequality with Yt = а{{^ < t}. Given \, the random variable Yt takes only two values, ± {?,- < t}, each with probability \. F{|PB°(-oo,t] / <2exp(-ne2/32), because the indicator functions sum to at most n. Use this for each Ij in inequality A2). F{||PB°|| > Щ® < 2(и + 1) exp(-ne2/32). Notice that the right-hand side now does not depend on %. Integration. Take expectations over % JP{\\Pn - P\\ > e} < 8(n + 1) exp(-ne2/32). This gives very fast convergence in probability, so fast that ? JP{\\Pn - P\\ > ?} < 00 n=l for each e > 0. The Borel-Cantelli lemma turns this into the full almost sure convergence asserted by the Glivenko-Cantelli theorem. II.4. Classes of Sets with Polynomial Discrimination We made use of very few distinguishing properties of intervals for the proof of the Glivenko-Cantelli theorem in Section 3. The main requirement was that they should pick out at most n + 1 subsets from any set of n points. Other classes have a similar property. For example, quadrants of the form ( —oo, t] in IR2 can pick out fewer than (n + IJ different subsets from a
II.4. Classes of Sets with Polynomial Discrimination 17 set of n points in the plane—there are at most n + 1 places to set the horizontal boundary and at most n + 1 places to set the vertical boundary. (Problem 8 gives the precise upper bound.) With (n + IJ replacing the n + 1 factor, we could repeat the arguments from Section 3 to get the bivariate analogue of the Glivenko-Cantelli theorem. The exponential bound would swallow up (n + IJ, just as it did the n + 1. Indeed, it would swallow up any poly- polynomial. The argument works for intervals, quadrants, and any other class of sets that picks out a polynomial number of subsets. 13 Definition. Let 3i be a class of subsets of some space S. It is said to have polynomial discrimination (of degree v) if there exists a polynomial p(-) (of degree v) such that, from every set of N points in S, the class picks out at most p(N) distinct subsets. Formally, if So consists of N points, then there are at most p(N) distinct sets of the form So n D with D in 3i. Call p(-) the discriminating polynomial for Q). ? When the risk of confusion with the algebraic sort of polynomial is slight, let us shorten the name "class having polynomial discrimination" to "polynomial class," and adopt the usual terminology for polynomials of low degree. For example, the intervals on the real line have linear discrimination (they form a linear class) and the quadrants in the plane have quadratic discrimination (they form a quadratic class). Of course there are classes that don't have polynomial discrimination. For example, from every col- collection of N points lying on the circumference of a circle in IR2 the class of closed, convex sets can pick out all 2N subsets, and 2N increases much faster than any polynomial. The method of proof set out in Section 3 applies to any polynomial class of sets, provided measurability complications can be taken care of. Appendix С describes a general method for guarding against these complications. Classes satisfying the conditions described there are called permissible. Every specific class we shall encounter will be permissible. As the precise details of the method are rather delicate—they depend upon properties of analytic sets—let us adopt a naive approach. Ignore measurability problems from now on, but keep the term permissible as a reminder that some regularity conditions are needed if pathological examples (Problem 10) are to be ex- excluded. Problems 3 through 7 describe a simpler approach, based on the more familiar idea of existence of countable, dense subclasses.
18 II. Uniform Convergence of Empirical Measures 14 Theorem. Let Pbea probability measure on a space S. For every permissible class 3} of subsets of S with polynomial discrimination, sup | PnD — PD | -» 0 almost surely. Proof. Go back to Section 3, change J to 2, replace the n + 1 multiplier by the polynomial appropriate to 2), and strike out the odd reference to interval and real line. ? Which classes have only polynomial discrimination? We already know about intervals and quadrants; their higher-dimensional analogues have the property too. Other classes can be built up from these. 15 Lemma. If *€ and 2 have polynomial discrimination, then so do each of: (i) {Dc:De2}; (ii) {С и D: С e Vand D e 2}; (iii) {C n D: С e #and D e 2}. Proof. Write c(-) and d(-) for the discriminating polynomials. We may assume them both to be increasing functions of JV. From a set So consisting of N points, suppose <€ picks out subsets Su...,Sk with к < c(N). Suppose St consists of JV; points. The class 2 picks out at most d(Nt) distinct subsets from St. This gives the bound d(JVi) + • • • + d(Nk) for the size of the class in (iii). The sum is less than c(N) d(N). That proves the assertion for (iii). The other two are just as easy. ? The lemma can be applied repeatedly to generate larger and larger polynomial classes. We must place a fixed limit on the number of operations allowed, though. For instance, the class of all singletons has only linear discrimination, but with arbitrary finite unions of singletons we can pick out any finite set. Very quickly we run out of interesting new classes to manufacture by means of Lemma 15 from quadrants and the like. Fortunately, there are other systematic methods for finding polynomial classes. Polynomials increase much more slowly than exponentials. For JV large enough, a polynomial class must fail to pick out at least one of the 2N subsets from each collection of JV points. Surprisingly, this characterizes poly- polynomial discrimination. Some picturesque terminology to describe the situation has become accepted in the literature. A class 3) is said to shatter a set of points F if it can pick out every possible subset (the empty subset and the whole of F included); that is, 3i shatters F if each of the subsets off has the form D n F for some D in 2). This conveys a slightly inappropriate image, in which F gets broken into tiny fragments, rather than an image of a diligent 2) trying to pick out all the different subsets of F; but at least it is vivid.
II.4. Classes of Sets with Polynomial Discrimination 19 For example, the class of all closed discs in IR2 can shatter each three-point set, provided the points are not collinear. But from no set of four points, no matter what its configuration, can the discs pick out more than 15 of the 16 possible subsets. The discs shatter some sets of three points; they shatter no set of four points. 16 Theorem. Let So be a set of N points in S. Suppose there is an integer V < N such that 3) shatters no set of V points in So. Then 3) picks out no more than ($) + A) + ••• + (v-i) subsets from So. Proof. Write Fu...,Fk for the collection of all subsets of V elements from So. Of course к = (у). By assumption, each F( has a "hidden" subset Hi that 3) overlooks: D nFt^ Ht for every D in 2. That is, all the sets of the form D n So, with D in 3i, belong to <ё0 = {С с So: С n Ft Ф H, for each i}. It will suffice to find an upper bound for the size of #0. In one special case it is possible to count the number of sets in ^0 directly. If Ht = Ft for every i then no С in ^0 can contain an Ft; no С can contain a set of V points. In other words, members of ^0 consist of either 0, 1,..., or V — 1 points. The sum of the binomial coefficients gives the number of sets of this form. By playing around with the hidden sets we can reduce the general case to the special case just treated. Label the points of So as 1,..., N. For each i define Щ = (H; и {1}) n F;; that is, augment Ht by the point 1, provided it can be done without violating the constraint that the hidden set be con- contained in F{. Define the corresponding class ?! = {Сс50:Сп]?;54Я; for each i}. The class (el has nothing much to do with ^0. The only connection is that all its hidden sets, the sets it overlooks, are bigger. Let us show that this implies ^i has a greater cardinality than ^0. (Notice: the assertion is not that (fi r— (fi \ '©O — » W Check that the map Ci-» C\{1} is one-to-one from <ё0\(ё1 into ^"Д^о. Start with any С in ^oX^i- By definition, С n F^ Ht for every i, but С n Fj = H'j for at least one j. Deduce that Я,- ф H'j, so 1 belongs to С and Fj and H'j, but not to H}. The stripping of the point 1 does define a one-to-one map. Why should C\{1} belong to "^Д^о? Observe that (C\{1}) П Fj = H'j\{l) = Hj, which bars C\{1} from belonging to ^0. Also, if Ft contains 1 then so must Щ, but C\{1} certainly cannot; and iff; doesn't contain 1 then (C\{1}) nFi = CnFi*Hi = Щ. In either case (C\{1}) n F{ ф Н\, so C\{1} belongs to ^
20 II. Uniform Convergence of Empirical Measures Repeat the procedure, starting from the new hidden sets and with 2 taking over the role played by 1. Define H- = (Щ и {2}) n Ft and <?2 = {C ? So: С n F, Ф Щ for each i}. The cardinality of #2 is greater than the cardinality of ^v Another N — 2 repetitions would generate classes (ёъ, #4,..., ^N with increasing cardinali- cardinalities. The hidden sets for ^N would fill out the whole of each Ft: the special case already treated. ? 17 Corollary. If a class shatters no set of V points, then it must have polynomial discrimination of degree no greater than V — 1. ? All we lack now is a good method for identifying classes that have trouble picking out subsets from large enough sets of points. 18 Lemma. Let <§ be a finite-dimensional vector space of real functions on S. The class of sets of the form {g > 0},for g in <&, has polynomial discrimination of degree no greater than the dimension of IS. Proof. Write V — 1 for the dimension of <S. Choose any collection {sb ..., sv} of distinct points from S. (Everything reduces to triviality if S contains fewer than V points.) Define a linear map L from <& into IR7 by Ug) = (fif(si), • • •, g(sv)). Since L'S is a linear subspace of IR*' of dimension at most V — 1, there exists in ]RF a non-zero vector у orthogonal to VS. That is, E Mfo) = ° f°r each g in <&, i or A9) Z 7ie(Si) = E (-УдФд for each g. {+} {-> Here { +} stands for the set of those i for which yt > 0, and { — } for those with уi < 0. Replacing у by — у if necessary, we may assume that { —} is non-empty. Suppose there were a g for which {g > 0} picked out precisely those points S; with i in { + }. For this g, the left-hand side of A9) would be > 0, but the right-hand side would be < 0. We have found a set that cannot be picked out. ? Many familiar classes of geometric objects fall within the scope of the lemma. For example, the class of subsets of the plane generated by the linear space of quadratic forms ax2 + bxy + cy2 + dx + ey + f includes all closed discs, ellipsoids, and (as a degenerate case) half-spaces. More com- complicated regions, such as intersections of 257 closed or open half-spaces, can be built up from these by means of Lemma 15. You can feed them into
II.4. Classes of Sets with Polynomial Discrimination 21 Theorem 14 to churn out a whole host of generalizations of the classical Glivenko-Cantelli theorem. Uniform limit theorems for polynomial classes of sets have one thing in common: they hold regardless of the sampling distribution. This happens because the number An(i;) of subsets picked out by the class from the sample {^l5 ...,?„} can be bounded above by a polynomial in n, independently of the configuration of that sample. Without the uniform bound the inequality A2) would be replaced by B0) F{ ||РИ° || > ie 1%} < 2Д„© exp( - ns2/32). Write Wn for the minimum of 1 and the right-hand side of B0). Then the argument from the Integration step gives the sharper bound Р{||Р„ - P|| > e} < 4JPWn for all n large enough. Thus a sufficient condition for ||PB — P|| to converge in probability to zero is: JPWn -» 0 for each e > 0. Equivalently, because 0 < Wn < 1, we could check that log А„(^) = op(ri). Theorem 16 helps here. where V = K(?i> ¦•¦,й is the smallest integer such that Sj shatters no collection of V points from {?u ..., ?„}. Set к = V - 1. If к < \n, all the terms in the sum for Bn(k) are less than (?): '1 log B(k) < n'1 log Bn(k) < n'1 log[(/c + l)n!/(n - /с)! /с!]. Three applications of Stirling's approximation and some tidying up reduce the right-hand side to -A - k/n) log(l - k/n) - (k/n) log(tyn) + o(l), which tends to zero as k/n -» 0. It follows that both n log А„ -» О and Il-P« — P|| -» 0 in probability, if V/n -» 0 in probability. If we don't know how fast V/n converges to zero, we can't use the Borel- Cantelli lemma to deduce from these inequalities that ||Р„ — Р|| converges almost surely to zero. But there is another reason why the convergence in probability implies the stronger result. Symmetry properties would force ||Р„ — Р|| to converge almost surely to some constant, no matter how V/n behaved. Given Р„, the unordered set {^!,..., ?„} is uniquely determined, but there's no way of deciding the order in which the observations were generated. Given Pn+ u we know slightly less about {^lr..., ?,„}; it could be any of the (n + 1) possible subsets of size n obtained by deleting one of the support points of Pn+1. (Count coincident observations as distinct support points.) The conditional distribution of Р„ given Pn+1 must be uniform on one of these (n + 1) subsets, each subset being chosen with probability (n + I). The conditional expectation of Р„ given Pn+1 (in the intuitive sense of the average over the n + 1 possible choices for Pn) must be Pn+1. The extra information carried
22 II. Uniform Convergence of Empirical Measures by Pn+2> Рп+з>--- adds nothing more to our knowledge about Pn; the conditional expectation of Pn given the u-field generated by Pn+1, Pn+2, ¦ ¦ ¦ still equals Pn+1. That is, the sequence {Pn} is a reversed martingale, in some wonderful measure-valued sense. Apply Jensen's inequality to the convex function that takes Pn onto \\Pn - P\\ to deduce that {\\Р„ - Р\\} is a bounded, reversed submartingale. (Problem 11 arrives at the same conclusion in a slightly more rigorous manner.) Such a sequence must converge almost surely (Neveu 1975, Proposition V-3-13) to a limit random variable, W. Since W is unchanged by finite permutations of {?;}, the zero-one law of Hewitt and Savage (Breiman 1968, Section 3.9) forces it to take on a constant value almost surely. The only question remaining for the proof of a uniform strong law of large numbers is whether the constant equals zero or not: convergence in probability to zero plus convergence almost surely to a constant gives convergence almost surely to zero. 21 Theorem. Let Si be a permissible class of subsets of S. A necessary and sufficient condition for sup \PnD — PD | -* 0 almost surely is the convergence ofn~lVn to zero in probability, where Vn = Vn{?,u ..., ?„) is the smallest integer such that 3) shatters no collection of Vn points from {Si. • • •. U- Proof. You can formalize the sufficiency argument outlined above; necessity is taken care of in Problem 12. Ill Because 0 < n 1Vn < 1, convergence in probability of n 1K, to zero is equivalent to n~ ^WVn -* 0. This has an appealing interpretation. The uniform strong law of large numbers holds if and only if, on the average, the class of sets behaves as if it has polynomial discrimination with degree but a tiny fraction of the sample size. 22 Example. Let's see how easy it is to check the necessary and sufficient condition stated in Theorem 21. Consider the class <? of all closed, convex subsets of the unit square [0,1]2. We know that there exist arbitrarily large collections of points shattered by (€. Were we sampling from a non-atomic
II.4. Classes of Sets with Polynomial Discrimination 23 distribution concentrated around the rim of a disc inside [0, I]2, the class <€ could always pick out too many subsets from the sample. Indeed, there would always exist a convex С with PnC = 1 and PC = 0. But such con- configurations of sample points should be thoroughly atypical for sampling from the uniform distribution on [0, I]2. Theorem 21 should say something useful in that case. How large a subcollection of sample points can <€ shatter? Suppose it is larger than the size requested by Theorem 21. That is, for some e > 0, JP{n~1Vn > s}>s infinitely often. This will lead us to a contradiction. A set of к points is shattered by ^ if and only if none of the points can be written as a convex combination of the others; each must be an extreme point of their convex hull. So there exists a convex set whose boundary has empirical measure at least k/n, which seems highly unlikely because P puts zero measure around the boundary of every convex set. Be careful of this plausibility argument; it contains a hidden appeal to the very uniformity result we are trying to establish. An approximation argument will help us to avoid the trap. Divide [0, I]2 into a patchwork of m2 equal subsquares, for some fixed m that will be specified shortly. Because the class si of all possible unions of these subsquares is finite, Ш sup | Pn A - PA | > \г \ < \z for all n large enough. The js here is chosen to ensure that, for some n, IPIn^ > e and sup \Р„А - РА\<\г}> \г. si Since a set with positive probability can't be empty, there must exist a sample configuration for which <€ shatters some collection of at least ns sample points and for which \Р„А - PA\ < \г for every A in stf. Write H for the convex hull of the shattered set, and AH for the union of those subsquares that intersect the boundary of H. The set AH contains all the extreme points of H, so Р„АН > г; it belongs to si, so \PnAH - PAH\ < \г. Consequently PAH > \г, which will give the desired contradiction if we make m large enough. \шкш «в : И и я ш • •
24 II. Uniform Convergence of Empirical Measures Experiment with values of m equal to a power of 3. No convex set can have boundary points in all nine of the subsquares; the middle subsquare would lie inside the convex hull of four points occupying each of the four corner squares. For every convex С the P measure of the union of those —-— \— subsquares intersecting its boundary must be less than f. Subdivide each of the nine subsquares into nine parts, then repeat the same argument eight times. This brings the measure of squares on the boundary down to (fJ. Keep repeating the argument until the power of | falls below js. That destroys the claim made for AH. ? II.5. Classes of Functions The direct approximation methods of Section 2 gave us sufficient conditions for the empirical measure Pn to converge to the underlying P uniformly over a class of functions, sup | Pnf-Pf 3? 0 almost surely. The conditions, though straightforward, can prove burdensome to check. In this section a transfusion of ideas from Sections 3 and 4 will lead to a more tractable condition for the uniform convergence. The method will depend heavily on the independence of the observations {?;}, but the assumption of identical distribution could be relaxed (Problem 23). Throughout the section write |[ • || to denote sup»» | ¦ |. Let us again adopt a naive approach towards possible measurability difficulties, with only the word permissible (explained in Appendix C) to remind us that some regularity conditions are needed to exclude patho- pathological examples. A domination condition will guard against any complications that could be caused by #" containing unbounded functions. Call each measurable F such that |/| < F, for every /in J% an envelope for J*\ Often F will be taken as the pointwise supremum of | /1 over J5", the natural envelope, but it will
II.5. Classes of Functions 25 be convenient not to force this. We shall assume PF < oo. With the proper centering, the natural envelope must satisfy this condition (Problem 14) if the uniform strong law holds. The key to the uniform convergence will again be an approximation condition, but this time with distances calculated using the j§? * seminorm for the empirical measures themselves. This allows us to drop the require- requirement that the approximating functions sandwich each member of J^ 23 Definition. Let Q be a probability measure on S and 2F be a class of functions in jS? 1(Q). For each e > 0 define the covering number iV\(e, Q, 3F) as the smallest value of m for which there exist functions gu ..., gm (not necessarily in J*) such that miny Q | / - g} \ < e for each / in #". For definite- ness set iV\(e, <2, #¦) = oo if no such m exists. ? If #" has envelope F we can require that the approximating functions satisfy the inequality \gj\ < F without increasing iV^e, Q, ?F): replace gs by max{-F, min[F, gj]}. We could also require gj to belong to #, at the cost of a doubling of e: replace gj by an fj in J^ for which Q\fj — g}I < e. 24 Theorem. Let 3F be a permissible class of functions with envelope F. Suppose PF < oo. If Р„ is obtained by independent sampling from the prob- probability measure P and if log N±(e, Р„, ?F) = op(n) for each fixed e > 0, then \Pnf — Pf\ -» 0 almost surely. Proof. Problem 11 (or the slightly less formal symmetry argument leading up to Theorem 21 in Section 4) shows that {\\Р„ — Р\\} is a reversed sub- martingale ; it converges almost surely to a constant. It will suffice if we deduce from the approximation condition that {\\Р„ — Р\\} converges in probability to zero. Exploit integrability of the envelope to truncate the functions back to a finite range. Given e > 0, choose a constant К so large that PF{F > K} < e. Then sup \PJ -Pf\< sup \PJ{F <K}- Pf{F < K}\ + supPn\f\{F > K} + supP\f\{F >K}. Because | /1 < F for each / in !F, the last two terms sum to less than PnF{F > K} + PF{F > K}. This converges almost surely to 2PF{F > K}, which is less than 2e. It remains for us to show that the supremum over the functions f{F < K} converges in probability to zero. As truncation can only decrease the ?>1(Р„) distance between two functions, the condition on log covering numbers also
26 II. Uniform Convergence of Empirical Measures holds if each/is replaced by its truncation; without loss of generality we may assume that | /1 < К for each / in J^. In the two Symmetrization steps of the proof of the Glivenko-Cantelli theorem (Section 3) we showed that JP{\\Pn - P\\ > e} < 4P{||P°|| > ie} for n > 8e -2 where ||-|| denoted a supremum over intervals ( — oo, t] of the real line. The signed measure P° put mass ±n~1 on each observation ?b ...,?„, the random ± signs being decided independently of the {^}. The argument works just as well if || • || denotes a supremum over #, the interpretation adopted in the current section. The only property of the indicator function (— oo, /] needed in the Symmetrization steps was the boundedness, which implied var(Pn( — oo, tj) < n~1. This time an extra factor of K2 would appear in the lower bound for n. With intervals we were able to reduce ||P°|| to a maximum over a finite collection; for functions the reduction will not be quite so startling. Given %, choose functions gu ..., gM, where M = N^e, Р„, JF), such that min Pn | / - 0j | < h for each / in <F. j Write /* for the gj at which the minimum is achieved. Now we reap the benefits of approximation in the j§? 1(PH) sense. For any function g, = Ря\д\- Choose g = / — /* for each / in turn. P^sup|P°/| > Щ& < P^sup[|P°/*| + PJ/-/*|] >ie < Jp\max\P°gj\ > УЩ because Р„|/ - /*| < |e (. J J < iVi^e, Р„, J*) max j Once again Hoeffding's Inequality (Appendix B) gives an excellent bound on the conditional probabilities for each g-}. n f Bgj(QJ] 2 ехр(-и?2Д28К2) because \g}\ < K.
II.5. Classes of Functions 27 When the logarithm of the covering number is less than ns2/256K2, the in- inequality P{||P°|| > ie|^} < 2 exp[log N&e, Pn, P) - пе2/тК21 will serve us well; otherwise use the trivial upper bound of 1. Integrate out. IP{ira > Ы < 2exp(-ne2/256K2) + P{log iV^e, Pn, &) > ne2/256K2}. Both terms on the right-hand side of the inequality converge to zero. ? For some classes of functions the conditions of the theorem are easily met because iV^e, Р„, 3P) remains bounded for each fixed e > 0. This happens if the graphs of the functions in #" form a polynomial class of sets. The graph of a real-valued function / on a set S is defined as the subset Gf = {(s,t):0<t<f(s) or f(s)<t<0} of S ® IR. We learn something about the covering numbers of a class J^ by observing how its graphs pick out sets of points in S ® IR. 25 Approximation Lemma. Let 2F be a class of functions on a set S with envelope F, and let Qbe a probability measure on S with 0 < QF < oo. If the graphs of functions in S' form a polynomial class of sets then N&QF, Q, &) <As~w for 0 < e < 1, where the constants A and W depend upon only the discriminating polynomial of the class of graphs. Proof. Let/j,..., fm be a maximal collection of functions in i* for which Q\f-fj\>eQF if i*j. Maximality means that no larger collection has the same property; each / must lie within eQF of at least one f-r Thus m > N^eQF, Q, !F). Choose independent points (s1? ГД ..., (sk, tk) in S ® IR by a two-step procedure. First sample s( from the distribution Q(-F)/Q(F) on S. Given st, sample tt from the conditional distribution Uniform[ — F(s{), F(s;)]. The value of k, which depends on m and e, will be specified soon. -F
28 II. Uniform Convergence of Empirical Measures The graphs Gt and G2, corresponding to fx and/2, pick out the same subset from this sample if and only if every one of the к points lands outside the region G( AG2. This occurs with probability equal to П El - IWPtfSi, tt) e G, Л G2|s;}] = [1 - JP(\ frist) - f2(s1)\/2F(s1))f /i -/2 < (i - W < exp(-jks). Apply the same reasoning to each of the B) possible pairs of functions ft and fj. The probability that at least one pair of graphs picks out the same set of points from the к sample is less than 1 exp(- jks) < \ expB log m - \кг). Choose к to be the smallest value that makes the upper bound strictly less than 1. Certainly к < A + 4 log m)/s. With positive probability the graphs all pick different subsets from the к sample; there exists a set of к points in S ® IR from which the polynomial class of graphs can pick out m distinct subsets. From the defining property of polynomial classes, there exist constants В and V such that m < Bkv for all к > 1. Find n0 so that A+4 log n)v < n1'2 for all n > n0. Then either m < n0 or m < Bm1/2s~v. Set W = 2Fand A = max(B2, n0). ? To show that a class of graphs has only polynomial discrimination we can call upon the results of Section 4. We build up the graphs as finite unions and intersections (Lemma 15) of simpler classes of sets. We establish their dis- discrimination properties by direct geometric argument (as for intervals and quadrants) or by exploitation of finite dimensionality (as in Lemma 18) of a generating class of functions. 26 Example. Define a center of location for a distribution P on IRm as any value в minimizing the criterion function. Щ6, P) = P<K\x - 6\), where ф(-) is a continuous, non-decreasing function on [0, 00) and | • | denotes the usual euclidean distance. If Рф(\х\) < оо and ф(-) does not increase too rapidly, in the sense that there exists a constant С for which <K2t) < for all t, then the function Щ-, P) is well defined: HF,P) < Р1фB\9\){\х\ < \9\} + Сф(\х\){\х\ > |0|}] < 00. If trivial cases are ruled out by the requirement B7) Р{х:ф(\х\)<ф(ю-)}>0,
II.5. Classes of Functions 29 the minimizing value will be achieved (Problem 21); extra regularity condi- conditions on P, which are satisfied by distributions such as the multivariate normal, ensure uniqueness (Problem 22). For this example, let us not get bogged down by the exact conditions needed; just assume that H{-,P) has a unique minimum at some 90. Estimate 90 by any value 9„ that minimizes the sample criterion function H( •, Р„). То show that 9n converges to 90 almost surely, it will suffice to prove that Н(в„, P) -» HF0, P) almost surely, because H(9, P) is bounded away from H(90, P) outside each neighborhood of 90. The argument follows the same pattern as for k-means (Example 4). First show that 9„ eventually stays within a large compact ball {| x | < K}. Choose the К greater than 1601 and large enough to ensure that ф&К)Р{\х\<^К}>Рф(\х\), which is possible by B7): as К tends to infinity the left-hand side converges to ф(оо-). Such а К will suffice because Я@, Р„) = Р„ф(\х\) and Щ9, Pn) > ф№)Рп{М < Ж for every 9 with \в\> К. If we prove uniform almost sure convergence of Р„ to P over the class then we can deduce almost surely that Щ9„, P) -» H(90, P) from H(9n, Pn) - H(9n, P) -> 0, H(9n, Pn) < H(90, Pn) -> H(90, P) < H(9n, P). Here's our chance to apply Theorem 24. The class !F has envelope фBК) + Сф(\х\), which satisfies the first requirement of the theorem. Bound the covering numbers by showing that the graphs of functions in J* have only polynomial discrimination. We may assume that 0@) = 0. The graph of ф(\- — 9\) contains a point (y, i), with t > 0, if and only if \y - 9\ > a(t), where a(t) denotes the smallest value of a for which ф(а) > t. From a collection of points {(y(, t;)} the graph picks out those points satisfying |y;|2 - 2yr 9 + \9\2 - a(ttJ > 0. Construct from (yt, Q a point zt = (yt, \yt\2 - a(t;J) in IRm+1. On IRm+1 define а vector space ^ of functions 9p,y,s(x,s) = P-x + ys + S with parameters fi in IRm and y, S in IR. By Lemma 18, the sets {g > 0}, for g in <§, pick out only a polynomial number of subsets from {z;}; those sets corresponding to functions in ^ with fi = -29, у = 1, and д = \9\2 pick out even fewer subsets from {z;}. The graphs of functions ф(\- — 9\) have only polynomial discrimination. ?
30 II. Uniform Convergence of Empirical Measures Buried within the argument of the last example lies a mini lemma relating finite dimensionality of a class of functions to discrimination properties of the graphs. It is perhaps worth noting. 28 Lemma. Let #" be a finite-dimensional vector space of real functions on S. The class of graphs of functions in 2F has polynomial discrimination. Proof. Define on S ® IR the vector space & of real functions Define Gi = {<7/,i >0} = {(s,t):f(s)>t} G2 = {<?-/,-! >0} = {(M):/(s)<t} Observe that Gf = Gx{t > 0} и G2{t < 0}. Invoke Lemmas 18 and 15. ? 29 Example. Now let's have another try at the /c-means problem introduced in Example 4. There we met the class of functions of the form fa,b(x) = \x - a\2 л \x - b\2 with (a, b) ranging over the subset С of IR2. We know that supc/a ъ < F for an F with PF < oo, provided P|x|2 < oo. The graphs of functions | x — a\2 form a class with polynomial discrimina- discrimination, by Lemma 28. Intersect pairs of such graphs in all possible ways to get the graphs of all functions/e> b. Apply Lemma 15 (to handle the intersections), then the Approximation Lemma (to bound covering numbers), then Theorem 24: sup |Р„/а>b- Pfa,b\-*0 almost surely, с Compare this with the direct approximation argument of Example 4. ? II.6. Rates of Convergence Theorem 24 imposed the condition log N^s, Р„, 2F) = op(ri) on the rate of growth of the covering numbers. Many classes meet the condition easily. For example, if the graphs of functions from J* have only polynomial discrimination, the covering numbers stay smaller than a fixed polynomial in e'1. The method of proof will deliver a finer result for such a class; we can get good bounds not just for a fixed e deviation but also for an г„ that de- decreases to zero as n increases. That is, we get a rate of convergence for the
II.6. Rates of Convergence 31 uniform strong law of large numbers. The method will also allow the class of functions to change with n, provided the covering numbers do not grow too rapidly. If the classes are uniformly bounded, and if the supremum of Pf2 over the nth class tends to zero as n increases, this will speed the rate of convergence. Consider the effect upon the two key steps of the argument for Theorem 24 if we let both e and $F depend on n. As before, replace Р„ - P by the signed measure P° that places mass + n~x at each of ?u ..., ?„. The sym- metrization inequality C0) IPJsup \PJ-Pf\> 8Л < 4ip|sup \P°J\ > 2en still holds provided var(Pn /)/Dе„J < \ for each/in Jv The approximation argument and Hoeffding's Inequality still lead to C1) ipjsup |P°/| > 2б„| J < 2ади, Р„, #„) ехрГ-^?п2/(тах Png2 where the maximum runs over all W1(?n,Pn,#n) functions {g^ in the approximating class. If the supremum over J*n of P/2 tends to zero, one might expect that the maximum over the {Р„д2} should converge to zero at about the same rate. The next lemma will help us make the idea precise if the approximating {g^ are chosen from #"„. As squares of functions are involved, covering numbers need to be calculated using if2 seminorms rather than the if1 seminorms of Definition 23. 32 Definition. Let Q be a probability measure on S and #" be a class of functions in if 2(B)- For each г > 0 define the covering number N2(e, Q, 3?) as the smallest value of m for which there exist functions gu ..., gm (not necessarily in J*) such that ттДB(/— fifjJI'2 < ? for each /in J*. For defmiteness set N2(e, Q, #") = oo if no such m exists. ? As before, if #" has envelope F we can require that \g}\ < F; and we could require g} to belong to J^ at the cost of a doubling of г, by substituting for gj an fj in & such that (QfJ} - «?/I/2 < е. 33 Lemma. Let 3F be a permissible class of functions with |/| < 1 and (Pf2I12 < 6 for each fin &. Then H>|sup(Pn/2I/2 > 8<4 < 4H>rjV2(<5, Р„, &) ехр(-и<52) л 1]. Proof. Let P'n be an independent copy of Р„. Write Z(/) for (Pn/2I/2 and Z'(f) for (P'nf2I12. From the Symmetrization Lemma of Section 3, C4) IP jsup | Z(/) | > 84 < f IP {sup | Z(f) - Z'(f) | >
32 II. Uniform Convergence of Empirical Measures because, for each / in 3F, F{|Z'(/)| < 23} > 1 - IP(Z'(/J)/4<52 = 1 - (P/2)/4<52. The intrusion of the square-root into the definition of Z and Z' would complicate reduction to the P° process. Instead, construct Р„ and Р'„ by a method that guarantees equal numbers of observations for both empirical measures. Sample 2n observations Xu ..., X2n from P. Independently of the vector X of these observations, generate independent selection variables т1;..., т„ with 1Р{т(г) = 1} = 1Р{т@ = 0} = \. Use these to choose one observation from each of the pairs (Х2;_1; X2i), for ? = 1,..., n. Construct Pn from these observations, and Р'„ from the remaining observations. For- Formally, set ?,- = Z2i_i+t(i) and ^ = I2j_l(i), then put mass n on each point Ь for Pn, and put mass n~x on each ?• for P;. Set S2n = i(PB + P^). It has the same distribution as P2n. Temporarily write p(-) for the ?>2(S2n) seminorm: p(f) = (S2nf2I/2- Given X, find functions gu..., gM, where M = N2(^/2<5, S2n, #"), for which min p(/ — gj) < y/28 for every / in j We may assume that \gj\ < 1 for every;'. The awkward y/2 will disappear at the end when we convert to if 2(Р„) covering numbers. From the triangle inequality for the i?2(Pn) seminorm, and the bound 2S2n for Р„, deduce for each / and g that - Z(g)\ < Z(f ~g)< BS2n(f - gJI'2 = Jlp(f - g) and similarly for Z'. For/in #" set g equal to the g} that minimizes p(/ - gfj). Then lj)-Z'(gj)\ + Z'(gj-f) ?46 + \Z{gj)-Z'(gj)\, whence IP\sup \Z(f) - Z'(/)| > 6<5|xi < P-Wx|Z@;) - Z'(gj)\ > 2d\X < M max JP{\Z(gj) - Z'(9j)[ > 2<5|X}. Fix a g with |gf| < 1. Bound \Z(g) - Z'(g)\ by \Z{gf ~ Z'{gf\l[Z{g) which is less than \Png2 - P'ng2\/BS2ng2y2
II.6. Rates of Convergence 33 thanks to the inequality a112 + b1'2 > (a + bI/2, for a, b > 0. Apply Hoeffding's Inequality (Appendix B). <P i=l >2n5BS2ng2I'2\X \- < 2Qxp\-l6n2S2S2ng < 2ехр(-2и<52) because the inequality | g \ < 1 implies *2<-i) - 92(Х2дУ < Ь^м) + g\X2d ='2nS2ng2. i = 1 i = 1 Notice how the <S2ngf2 factor cancelled. That happened because we sym- symmetrized Z instead of Pn. Setting g equal to each gj in turn, we end up with pjsup |Z(/) - Z'(/)| > 651 xl < 2W2(y2<5, 52„, #-) exp(-2n<52) I ^ J Decrease the right-hand side to the trivial upper bound of 1, if necessary, then average out over X: C5) ipjsup \Z(f) - Z'(f)\ > 6s\ < JP[_2N2(^25, S2n, F) ехр(-2п<52) л 1] The presence of the S2n is aesthetically unpleasing, especially since both S and J5" will always depend on n in applications. Problem 24 allows us to replace it by Pn, at a small cost: ^, S2n, ^) exp(-2nS2) л 1] < IP[2N2(<5, Pn, ^)N2(S, P'n, &) ехр(-2и<52) л < IP[2N2(<5, Р„, #") ехр(-и<52) л 1] by virtue of the inequality xy л 1 < (x л 1) + (у л 1) for x, у > 0. The empirical measures Р„ and Р'„ have the same distribution; the sum of expec- expectations is less than т52)л 1]. Combine the last bound with C4) and C5) to complete the proof. ? The bounds we have for j§?i covering numbers can be converted into bounds for if2 covering numbers. For the sake of future reference, consider
34 II. Uniform Convergence of Empirical Measures a class 3F with envelope F. Set F equal to a constant to recover the inequalities for a uniformly bounded 3F. 36 Lemma. Let 3Fbe a class of functions with strictly positive envelope F, and Qbe a probability measure with QFZ < со. Define P(-) = Q(F2)/Q(F2) and g= {f/F:feF}. Then: (i) NMQF2I'2, Q, &) < N2(e, P, <S) < N.&2, P, 9); (ii) if the class of graphs of functions in ?F has only polynomial discrimination then there exist constants A and W, not depending on Q and F, such that 21'2, Q, &) <AS-wfor0<S< 1. Proof. For every pair of functions /b /2 with \fi\ < F and |/21 < F, (QF'r'QIA - /2|2 = P\fJF - f2/F\2 < 2P\fJF - fJF|. Assertion (i) follows from these connections between the seminorms used in the definitions of the covering numbers. The graph of/ covers a point (x, t) if and only if the graph off/F covers (x, t/F(x)); the graphs of functions in 'S also have polynomial discrimination. Invoke the Approximation Lemma A1.25) for classes with envelope 1. Rechristen A2W as Amd2W as W. ? It is now an easy matter to prove rate of convergence theorems for uni- uniformly bounded classes of functions. As an example, here is a result tailored to classes whose graphs have polynomial discrimination. (Remember that the notation х„ g> yn means xjyn -> со.) 37 Theorem. For each n, let 2Fn be a permissible class of functions whose covering numbers satisfy sup N^e, Q, jg < Ae~w for 0 < s < 1 Q with constants A and W not depending on n. Let {an} be a non-increasing sequence of positive numbers for which nS2a2 g> log n. If \f\ < 1 and (P/2I/2 < <5„ for each fin 3Fn, then sup \Pnf — Pf\ -4 52an almost surely, Proof. Fix e > 0. Set г„ = se2an. Because var(Pn/)/DenJ < (l6ne2S2nz2yl < (logn) the symmetrization inequality C0)'holds for all n large enough: pjsup \PJ - Pf\ > sj < 4ipjsup \P°f\ > 2e\
II.6. Rates of Convergence 35 Condition on \. Find approximating functions {gj} as in C1). We may assume that each gf belongs to $Fn. (More formally, we could replace gj by an fj in <Fn for which Pn\f} - g}\ < e, then replace e by 2e throughout the argument.) From C1), ipjsup \P°J\ > 2e\ < 2As;w ехр(-^е„2/645и2) + pjsup PJ2 > 64<5„2 The first term on the right-hand side equals 2Ae~w exp[PF log(l/<52cO - ne2<5* which decreases faster than every power of n because log(l/<52an) increases more slowly than log n, while n<52a2 increases faster than log n. Lemma 33 bounds the second term by nyw exp(-n<52) which converges to zero even faster than the first term. An application of the Borel-Cantelli lemma completes the proof. ? Specialized to the case of constant {an}, the constraint placed on {5„} by Theorem 37 becomes <52 §> n~l log n. This particular rate pops up in many limit theorems involving smoothing of the empirical measure because (Problem 25) it corresponds to the size of the largest ball containing no sample points. 38 Example. Let P be a probability measure on Md having a bounded density p( ¦) with respect to d-dimensional lebesgue measure. One theoretically attractive method for estimating p(-) is kernel smoothing: convolution of the empirical measure Р„ with a convenient density function to smear out the point masses into a continuous distribution. The estimate is Pn(x) = PnG-dKx,a, where for some density function К on lRd and a scaling factor a that depends on n. Note that the a~d is not part of Kx_ff. The traditional method for analyzing р„ compares it with the correspond- corresponding smoothed form of p, p(x) = FpB(x) = Pa'dK^a. The difference pn — p breaks into a sum of a random component, pn — p, and a bias component p — p. The smaller the value of a, the smaller the bias (Problem 26); the slower a tends to zero, the faster р„ — p converges to zero. These two conflicting demands determine the rate at which р„ — р can tend to zero.
36 II. Uniform Convergence of Empirical Measures If pn — p is to converge uniformly to zero, we must not allow a to decrease too fast. Otherwise we might somewhere be smoothing Pn over a region where it puts too little mass, making \pn — p\ too large. Theorem 37 will let <7d decrease almost as fast as и log n, the best rate possible. For concreteness take К to be the standard normal density, which enjoys the uniform bound 0 < Kx a < 1 for all x and a (the constant {2%)~il2 is too awkward for repeated use.) The class of graphs of all Kx „ functions has polynomial discrimination (Problem 28 proves this for a whole class of kernel functions). Assume also that the density p(-) is bounded, say 0 < p(-) < 1. In that case because x)/a)p(y) dy = ad ЫA)р(х + at) dt. Everything is set up for Theorem 37. Put <х„ = 1 and b2n = od. Provided a* 5> n'1 logn, sup \PnKxa - PKxa\ < ad almost surely, X that is, sup | pn(x) — p(x) | -> 0 almost surely. X Smoothness properties of p(-) determine the rate at which the bias term converges to zero (Problem 26). For example, one bounded derivative would give maximum bias of order 0{a) in one dimension. We would then want something like a3 g> n~l logn to get a comparable rate of convergence from Theorem 37 for р„ — p. ? Notes Uniform strong laws of large numbers have a long history, which is described in the first section of the survey paper by Gaenssler and Stute A979). Theorem 2 comes from DeHardt A971), but the idea behind it is much older. Billingsley and Topsoe A967) and Topsoe A970, Sections 12 to 15) developed much deeper results for the closely related area of uniformity in weak convergence. Hartigan A975) is a good source for information about clustering. Hartigan A978) and Pollard A981b, 1982b, 1982c) have proved limit
Notes 37 theorems for fe-means. Engineers know the method of /c-means by the name quantization. The March 1982 IEEE Transactions on Information Theory was devoted to the topic. Denby and Martin A979) proposed the generalized M-estimator of Example 5. It has long been appreciated that comparison of two independent empirical distribution functions transforms readily into a combinatorial problem. Gnedenko A968, Section 68), for example, reduced the analysis of two- sample Smirnov statistics to a random walk problem. The method in the text has evolved from the ideas introduced by Vapnik and Cervonenkis A971). Their method of conditioning turned calculations for a single fixed set into an application of an exponential bound for hypergeometric tail probabilities. Classes with polynomial discrimination are often called VC classes in the literature. The symmetrization described in Section 3 is a well- known method for proving central limit theorems in Banach spaces (Araujo and Gine 1980, Section 2; Gine and Zinn 1984, Section 1). Steele A975,1978) discovered subadditivity properties of empirical measures that strengthen the Vapnik-Cervonenkis convergence in probability results to necessary and sufficient conditions for almost sure convergence. Pollard A981b) introduced the martingale tools and the randomization method described in Problem 12 to rederive Steele's conditions. Theorem 21 and Example 22 are based on Steele A978); Flemming Tops0e explained to me the f-trick for convex sets. See Gaenssler and Stute A979) for more about the history of this example. The proof of Theorem 16, which is often called the Vapnik-Cervonenkis lemma, is adapted from Steele A975). Sauer A970) was the first to publish the inequality in precisely this form, although he suggested that Shelah had also established the result. (I am unable to follow the two papers of Shelah that Sauer cited.) Vapnik and Cervonenkis A971) proved an insignificantly weaker version of the inequality. Dudley A978, Section 7) has dug further into the history. Lemma 18 is due to Steele A975) and Dudley A978). The sum of binomial coefficients in Theorem 16 and the randomization method of Problem 12 suggest that a direct probabilistic path might lead to the necessary and sufficient conditions of Theorem 21. Does there exist a set of n independent experiments that can be performed to decide whether a particular subset of a collection of n points can be picked out by a particular class of sets? Or maybe the experiments could be linked in some way. For a class that picks out only subsets with fewer than V points the solution is easy—it lies buried within the proof of Theorem 16. The notes to Chapter VII will give more background to the concept of covering number. Vapnik and Cervonenkis A981) have found necessary and sufficient conditions for uniform almost sure convergence over bounded classes of functions. They worked with if00 and if1 distances between functions. Gine and Zinn A984) applied chaining inequalities and gaussian-process methods (see Chapter VII) to deduce if2 necessary and sufficient conditions. The square-root trick in Lemma 33 comes from Le Cam A983) via Gine and
38 II. Uniform Convergence of Empirical Measures Zinn. Kolchinsky A982) and Pollard A982c) independently introduced the symmetrization method used in Lemma 33. The Approximation Lemma is due to Dudley A978), who proved it for classes of sets. The extension to classes of functions was proved ad hoc by Pollard A982d), using an idea of Le Cam A983). The density estimation literature is enormous. Silverman A978) and Stute A982b) have found sharp results involving the и log и critical rate. Bertrand-Retali A974, 1978) proved that ad g> n~l log и is both necessary and sufficient for uniform consistency over the class of all uniformly con- continuous densities on IRA Universal separability was mentioned in passing by Dudley A978) as a way of avoiding measurability difficulties. Most of the results in the chapter can be extended to independent, non- identically distributed observations (Alexander 1984a). Problems [1] In Example 4 relax the assumption that W{-, ¦, P) has a unique minimum; assume the function achieves its minimum for each (a, b) in a region D. Prove that the distance from the optimal (а„, Ь„) to D converges to zero almost surely, provided P does not concentrate at a single point. [The condition rules out the possibility of a minimizing pair for W(-, -, P) with one center off at infinity.] [2] Here is an example of an ad hoc method to force an optimal solution into a restricted region. Suppose an estimator corresponds to the/, that minimizes Pnf over a class J5", and that we want to force/„ into a region Ж. Write y0 for the infi- mum, assumed finite, of Pf over ,?. Suppose there exists a positive function b(-) on J^ such that, for some г > 0, b(f) > 7o + e for / in >[inf//&(/)] Show that liminf inf^jr Р„/> y0 almost surely. [Trivial if y0 < 0.] Deduce that/,, belongs to Ж eventually (almost surely). Now read the case A consistency proof of Huber A967). Compare the last part of his argument with our Theorem 3. [3] Call a class J5" of functions universally separable if there exists a countable subclass J^ such that each/in jF can be written as a point wise limit of a sequence in J^. If J* has an envelope F for which PF < со, prove that universal separability implies measurability of \\Pn — P\\. [4] For any finite-dimensional vector space'S of real functions on S, the class S> of sets of the form {g > 0}, for g in '$, is universally separable. [Express each g in 'S as a linear combination of some fixed finite collection of non-negative functions. Let ^o be the countable subclass generated by taking rational coefficients. For each g in 40 there exists a sequence {д„} in <SU for which д„ [ д. Show that {д„ > 0} J. {g > 0} pointwise.] [5] The operations in Lemma 15 preserve universal separability.
Problems 39 [6] For a universally separable class 3), the quantity Vn defined in Theorem 21 is unchanged if® is replaced by its countable subclass @>0. Prove that Vn is measur- measurable. [7] Prove that the class of indicator functions of closed, convex subsets of TRd is universally separable. [Consider convex hulls of finite sets of points with rational coordinates.] [8] Theorem 16 informs us that the class of quadrants in Ж2 picks out at most 1 + jN + jN2 subsets from any collection of N points. Find a configuration for which this bound is achieved. [9] For N > 2 the sum of binomial coefficients singled out by Theorem 16 is bounded by Nv. [Count subsets of {1, ..., N} containing fewer than V elements by arrang- arranging each subset into increasing order then padding it out with enough copies of the largest element to bring it up to a F-tuple. Don't forget the empty set.] [10] Let M be a subset of [0, 1] that has inner lebesgue measure zero and outer lebesgue measure one (Halmos 1969, Section 16). Define the probability measure ц as the trace of lebesgue measure on M (the measure defined in Theorem A of Halmos A969), Section 17). Assuming the validity of the continuum hypothesis, put M into a one-to-one correspondence with the space [0, U) of all ordinals less than the first uncountable ordinal U (Kelley 1955, Chapter 0). Define <2> as the class of subsets of [0, 1] corresponding to the initial segments [0, x] in [0, U). (a) Show that 2 has linear discrimination. [It shatters no two-point set.] (b) Equip Mm with its product cr-field and product measure цт. Generate ob- observations ?,, ?2,... on P = Uniform@, 1) by taking them as the coordinate projection maps on M°°. Construct empirical measures {Р„} from these observations. Show that sups \PnD — PD\ is identically one. (c) Repeat the construction with the same <2>, but replace (M°°, /t°°) by a countable product of copies of Mc equipped with the product measure Я™, where Я equals the trace of lebesgue measure on Mc. This time sups \PnD — PD\ is identically zero. [Funny things can happen when 2 has measurability problems. Argument adapted from Pollard A981a) and Durst and Dudley A981).] [11] For independent and identically distributed random elements {?(}, write <f" for the cr-field generated by all symmetric functions of ?u ..., ?N as N ranges over n, n + 1, For a fixed function /, apply the usual reversed martingale argument (Ash 1972, page 310) to show that 1Р(Р„/|<Г+1) = PB+1/. If P(sup^ |/|) < 00, deduce p \PJ - Pf\ \Г^\ > sup \Pn+J - Pf\ I & for every class of functions 3" that makes both suprema measurable. [12] Here is one way to prove necessity in Theorem 21. Suppose \\Р„ — P\\ -> 0 almost surely. Construct ^„+ by placing mass n~l at each ?; for which the sign variable tr,- equals +1; construct ц~ similarly from the remaining ?;'s. Notice that P° = Hn — Hn- Let N be the number of sign variables cu...,on equal to +1. (a) Prove that (n/N)n+ has the same distributions as PN. [What if N = 0?] (b) Deduce that both ||/in+ - \P\\ -> 0 and \\ц~ - jP\\ -> 0, in probability.
40 II. Uniform Convergence of Empirical Measures (c) Deduce that ||/*„+ - /л„ \\ -* 0 in probability. (d) Suppose 3> shatters a set F consisting of at least щ of the points ?b ...,?„. Without loss of generality, at least \m\ of the points in F are allocated to /*„+. Choose a D to pick out just those points from F. Use independence properties of the {tr;} to show that, with high probability, /?(D\F) and n~(D\F) are nearly equal. [Argue conditionally on Pn and the at for those ^ in F.] (e) Show that /Jn+(D) — ftT(D) ~ hi with high conditional probability. This contradicts (c). [13] Rederive the uniform strong law for convex sets (Example 22) by the direct approximation method of Theorem 2. [14] Let #" be a permissible class with natural envelope F = sup^ | /1. If \\Р„ - P\\ -* 0 almost surely and if sup^ \Pf\ < oo then PF < oo. [The condition on supj; \Pf\ excludes trivial cases such as SF consisting of all constant functions. From ||Р„ - P\\<& and ||Fn_! -P\\<e deduce n'x\f(Q -Pf\< 2s; almost sure convergence implies KVsup \f{Q -Pf\>n infinitely often \ = 0. if ) Invoke the non-trivial half of the Borel-Cantelli lemma, then replace each t,a by ?i to get oo > IPjsup |/(?i) - Pf\ I > IPFC^) - constant. Noted by Gine and Zinn A984).] [15] Here is an example of how Theorem 24 can go wrong if the envelope F has PF = oo. Let P be the Uniform@, 1) distribution and let J^ be the countable class consisting of the sequence {/J, where ft(x) = x~2{(i + I) < x < Г1}. Show that the graphs have polynomial discrimination and that Pft = 1 for every ;'. But sup,- Pnfi^> oo almost surely. [Find an а„ with naj -> 0, such that [0, а„] contains at least one observation, for n large enough.] [16] Let #^ be the class of all monotone increasing functions on IR taking values in the range [0,1]. The class of graphs does not have polynomial discrimination, but it does satisfy the conditions of Theorem 24 for every P. [If {x;} and {t;} are strictly increasing sequences, the graphs can shatter the set of points (xlt tx\ ..., (xN, tN).] [17] For the J^ of the previous problem, rewrite Pnf as JJ Pn{f > t} dt. Deduce uniform almost sure convergence from the classical result for intervals. [Suggested by Peter Gaenssler.] [18] Let J^ and ^ be classes of functions on S with envelopes F and G. Write У for the class of all sums f + g with / in J5" and g in <&. Prove that N15Q(F + G), Q, SP) < NiEQF, Q, &)N15QG, Q, <S) for i = 1, 2. [19] A condition involving only covering numbers for P would not be enough to give a uniform strong law of large numbers. Let P be Uniform@, 1). Let 3) consist of all sets that are unions of at most n intervals each with length less than n'2, for n = 1, 2,.... Show that sups \PnD - PD \ = 1, even though N^s, P, 9s) < oo for each s > 0.
Problems 41 [20] Deduce Theorem 2 from Theorem 24. [21] Under the conditions set down in Example 26, the function H(-,P) achieves its minimum. [If H{9h P) converges to the infimum as i -> oo, use Fatou's lemma to show that the infimum is achieved at a cluster point of {0J; condition B7) rules out cluster points at infinity.] Notice that only left continuity of ф is needed for the proof. Find and overcome the extra complications in the argument that would be caused if ф were only left-continuous. [22] This problem assumes familiarity with convexity methods, as described in Section 4.2 of Tong A980). Suppose that the distribution P of Example 26 has a density p(-) whose high level sets Dt = {p > t} are convex and symmetric about the origin. Prove that H(9, P) has a minimum at 9 = 0. [By Fubini's theorem, Щ9, P) = (if {0 < s < ф(\х - в\)}{0 < t < p(x)} ds dt dx = || volume[B@, ф))с п ?>,] ds dt, where В(в, г) denotes the closed ball of radius r centered at 9. The volume of B(9, r) n Dt is maximized at в = 0.] When is the minimum unique? Show that a multivariate normal with zero means and non-singular variance matrix satisfies the condition for uniqueness. [23] Suppose {?,} are independent, but that the distribution of ?,-, call it Q,, changes with i. Write P(B) for the average distribution of the first n observations, Pin) = n~1(Q1 + ¦¦¦ + ()„). Show for a permissible polynomial class 9) that sup | Pn D - PMD | -> 0 almost surely. s What difficulties occur in the extension to more general classes of sets, or functions? [Adapt the double-sample symmetrization method of Lemma 33: sample a pair (X2l-i, X2t) from Qt; use the selection variable т; to choose which member of the pair is allocated to Р„, and which to P'n.~\ [24] Show that ^ KQ! + Q2), &) < N2E, Qu [Let /ix be the density of Qt with respect to Q^ + Q2. Consider the approximating functions 0,{/i! >j}+ g2{ht < i}.] [25] Let P be the uniform distribution on [0, I]2. For a sample of n independent observations on P show that !P{some square of area а„ contains no observations} -> 1 if а„ is just slightly smaller than rTx log n. [Break [0, I]2 into N subsquares each with area slightly less than n~1 log n. Set At = {fth subsquare contains at least one observation}. Show that W{Ai+1 |^4X n • • • n At) < РЛ;+1- The probability that each of these subsquares contains at least one point is less than (P^4X)N. Bertrand- Retali A978).]
42 II. Uniform Convergence of Empirical Measures [26] In one dimension, write the bias term for the kernel density estimate as p(x) - p(x) = JK(z)[p(X + az) - p(x)-] dz. Suppose p has a bounded derivative, and that J \z\K(z) dz < со. Show that the bias is of order O(a). Generalize to higher dimensions. [If p has higher-order smooth derivatives, and К is replaced by a function orthogonal to low degree polynomials, the bias can be made to depend only on higher powers of ст.] [27] The graphs of translated kernels Kx „ have polynomial discrimination for any К on the real line with bounded variation. [Break К into a difference of monotone functions.] [28] Let К be a density on JRd of the form h( | x \), where h( ¦) is a monotone decreasing function on [0, oo). Adapt the method of Example 26 to prove that the graphs of the functions Kx a have polynomial discrimination. [29] Modify the density estimate of Example 38 for distributions on the real line by choosing К as a function of bounded variation for which J K(z) dz = 0 and j zK(z) dz = 1 and J \zK(z)\dz < oo. Replace р„ by qn(x) = а~2РпКха. Show that Wqn{x) converges to the derivative of p. How fast can a tend to zero without destroying the almost sure uniform convergence sup^ \qn(x) — Wqn(x)\ -+01
CHAPTER III Convergence in Distribution in Euclidean Spaces ... which runs through some of the standard methods for proving convergence in distribution of sequences of random vectors, and for proving weak con- convergence of sequences of probability measures on euclidean spaces. These include: checking convergence for expectations of smooth functions of the random vectors; checking moment conditions for sums of independent random variables (the Central Limit Theorems); checking convergence of characteristic functions (the Continuity Theorems for characteristic functions); and reduction to analo- analogous problems of almost sure convergence via quantile transformations. III.1. The Definition Convergence in distribution of a sequence {Xn} of real random variables is traditionally defined to mean convergence of distribution functions at each continuity point of the limit distribution function : IP{Xn < x} -> IP{X < x} whenever IP{X = x} = 0. Although convenient for work with order statistics and quantiles, this definition does have some disadvantages. Distribution functions are not well suited to calculations involving sums of independent random variables. The simplest proofs of the Central Limit Theorem, for example, do not directly check pointwise convergence of distribution functions; they show that sequences of characteristic functions, or expectations of other smooth functions of the sums, converge. With the extensions to sequences of random vectors (measurable maps into multidimensional euclidean space JRd), the difficulties multiply. And for random elements of more general spaces not equipped with a partial ordering, even the concept of distribution function disappears. With all this in mind, let us start afresh from an equivalent definition, which lends itself more readily to generalization. 1 Definition. A sequence of random vectors {Х„} is said to converge in distribution to a random vector X, written Xn ~» X, if 1Р/(Х„) -»IP/(X) for every / belonging to the class ^(Ж*) of all bounded, continuous, real func- functions on IR*. ? This notion of convergence does not specify the limit random vector uniquely. If X and Y have the same distribution, that is, if IP{X e A} = IP{ Ye A} for each borel set A,
44 III. Convergence in Distribution in Euclidean Spaces then Х„ -» X means the same as Xn ~> Y. This invites slight abuses of nota- notation. For example, it is convenient to write Xn ~> N@, 1), meaning that the sequence of real random variables {Xn} converges in distribution to any random variable having a standard normal distribution. The symbol N@, 1) stands not only for a particular probability measure on the borel u-field ЩЖ) but also for any random variable having that distribution. Similarly, we can avoid much circumlocution by writing, for example, instead of: if Х„ has a binomial distribution with parameters n and \, and X has a normal distribution with mean zero and variance \, then In general, for probability measures on the borel <7-field &(TRk), define Р„ ~> Р to mean convergence in distribution for random vectors having these distributions. This definition is equivalent to the requirement: Pnf^Pf for every / in ^(SR.k). Most authors call this weak convergence. III.2. The Continuous Mapping Theorem Suppose Xn ~> X, as random vectors taking values in Шк, and let Я be a measurable map from IR* into Ж5. Does it follow that HXn -* HX1 That is, does 1РДЯХ„) -+ Щ(НХ) for every / in <^(Ж5)? If H were continuous, /off would belong to %(Жк) for every / in %(Ж3). The result would be trivially true. The convergence HXn ~» HX also holds under a slightly weaker assumption: it suffices that H be continuous at almost all points of the range of X. This will follow as a simple corollary to the next lemma. 2 Convergence Lemma. Let hbe a bounded, measurable, real-valued function on Жк, continuous at each point of a measurable set C. (i) Let {Xn} be a sequence of random vectors converging in distribution to X. If P{X ? Q = 1, then 1Р/г(Х„) ^ Wh(X). (ii) Let {Pn} be a sequence of probability measures converging weakly to P. IfP(C) = 1, then PJx -> Ph. Proof. As the two assertions are similar, we need only prove (ii). Consider any increasing sequence {/;} of bounded, continuous functions for which f<h everywhere and f | h at each point of C. Accept for the moment that such a sequence exists. Then weak convergence of {Pn} to P implies that C) Pn f -> Pf for each fixed i.
III.2. The Continuous Mapping Theorem 45 On the left-hand side bound /j by h. liminf Pnh > Pft for each fixed ;'. Invoke monotone convergence as i tends to infinity on the right-hand side. D) liminf Pn h > Ph. The companion inequality obtained by substituting — h for h combines with D) to complete the proof. Now to construct the functions {/}. They must be chosen from the family If we can find a countable subfamily of 3F, say {gu g2, ¦..}, whose pointwise supremum equals h at each point of C, then setting/; = maxf^, ...,#;} will do the trick. Without loss of generality suppose h > 0. (A constant could be added to h to achieve this.) For each subset A of Ж* define the distance function d(-,A)by d(x, A) = inf{|x - y\; у е A]. It is a continuous function of x, for each fixed A. For positive integral m and positive rational r define fm,Ax) = г л md(x, {h < r}). Each/m r is bounded and continuous; it is at most r if h(x) > r; it takes the value zero if h(x) < r: it belongs to 3F. Given a point x in С and an s > 0, choose a positive rational number r with h{x) — s < r < h(x). Continuity of h at x keeps its value greater than r in some neighborhood of x. Conse- Consequently, d(x, {h < r}) > 0 and/m r(x) = r > h(x) - s for all m large enough. П Weak convergence of {Pn} was needed only to establish the convergence C) for the functions {/,}. These functions were, however, not just continuous, but uniformly continuous. The functions from which they were constructed even satisfied a Lipschitz condition:
46 III. Convergence in Distribution in Euclidean Spaces Thus the lemma could have been proved using only convergence of expecta- expectations of bounded, uniformly continuous functions of the random vectors. In particular, such a requirement would imply the convergence Pnh -> Ph for each h in 5 Corollary. // Pnf —> Pf for every bounded, uniformly continuous f then Pn ~> P. (And similarly for convergence in distribution of random vectors.) ? The lemma also provides an answer to the question asked at the start of the section. 6 Continuous Mapping Theorem. Let H be a measurable mapping from Жк into ЖЛ Write С for the set of points in Шк at which H is continuous. If a se- sequence {Xn} of random vectors taking values in Жк converges in distribution to a random vector X for which IP{X e C) = 1, then HXn ~> HX. Proof. For each fixed / in ^(JR.S), the bounded function f°His continuous at all points of C. ? Some authors seem to regard this result as trivial and obvious; they scarcely notice that, at least implicitly, they make use of it for many applica- applications. It is better to recognize these covert appeals to the Continuous Mapping Theorem. That way the more general form of the result in Chapter IV will come as no surprise. 7 Example. If the real random variables {Xn} converge in distribution to X then Р{Х„ < x} -* IP{X < x} at each x for which IP{X = x} = 0. That is, the sequence converges at each continuity point x of the distribution function of X. This holds because x is the only point of discontinuity of (the indicator function of) the set ( — oo, x]. Problem 1 shows you how to go the other way, from pointwise convergence of distribution functions to convergence in distribution of the random variables. The same result is true in higher dimensions if the inequalities Xn < x and X < x are taken componentwise and ( —oo,x] is interpreted as a multidimensional orthant with vertex at x. Continuity of the multi- multidimensional distribution function of X at x requires that X lands on the boundary of (—oo, x] with zero probability. ? 8 Example. Consider the multinomial distribution obtained by independent placement of n objects into к disjoint cells. Write Fn = (Fnl,..., Fnk) for the column vector of observed frequencies—cell i receives Fni objects—and P — (Ръ ¦ • ¦ > Pk) f°r tne column vector of cell probabilities. Pearson's chi- square statistic is к Zn= E(fni - nPiJ/nPi.
III.2. The Continuous Mapping Theorem 47 Write Zn as a function of a standardized column vector, by setting Xn = n-l'2(Fn-np) and Zn = Х'„А~2Хп, where Л denotes the diagonal matrix with sfpi as its ith diagonal element. By the Multivariate Central Limit Theorem, which will be proved later in this chapter (Theorem 30), the random vectors {Х„} converge to a N@, V) distribution whose variance matrix V has (i, j)th element pt — pf if i = j and —PiPj otherwise. Manufacture a random vector with this limit distribu- distribution by applying a linear transformation to a column vector W of indepen- independent N@, 1) random variables: Xn ~> Л^ — uu')W, where и denotes the unit column vector (y/p~i,..., sfpd- The mapping H from Шк into Ж defined by Hx = x'A'2x = \А~гх\2 is continuous. Apply the Continuous Mapping Theorem. Zn = HXn -* HA(Ik - uu')W = \{Ik- uu')W\2 ~ Xk-u because Ik — uu' represents the projection orthogonal to the unit vector u. (The squared length of the projection of W onto any (k — l)-dimensional subspace has a chi-square distribution with к — I degrees of freedom.) ? 9 Example. Suppose ?и ?2,... are independent random variables each with a Uniform@, 1) distribution. Neyman A937) developed a goodness-of-fit test whose asymptotic properties depended on the behavior of the statistics к Г п -12 j=i b=i J where n0, пъ ... are given polynomials defined on [0, 1], with the ortho- normality property: Explicitly, no(y) = 1,Я1(з0 = y/U(y - Uitj(y) = У5[6(у - iJ - fl.and so on. Define random column vectors Xt = (fi1(.?d, ••;*№) for I" =1,2,.... The statistic Gn can then be written as
48 III. Convergence in Distribution in Euclidean Spaces The Multivariate Central Limit Theorem (Theorem 30) and the ortho- normality properties of the polynomials ensure that The Continuous Mapping Theorem (applied to which map?) allows us to deduce that Gn -> xl¦ Neyman used this to determine the approximate critical region for his test. ? III.3. Expectations of Smooth Functions There are two sorts of perturbation of a random vector X that don't affect the expectation Wf(X) of a smooth, bounded function of X too greatly: changes, however gross, that occur with only small probability; and changes that might occur with high probability but which alter X by only small amounts. These effects are easy to quantify when the smooth function / is uniformly continuous. Suppose | f(x) — f(z)\ < г whenever \x — z\ < 5. Write H/ll for the supremum of/(-)- Then for any random vectors X and Y, whether dependent or not, A0) |P/(X)-F/(X + 7)| <W{\Y\<S}\f(X)-f(X+Y)\ The inequality lets us deduce convergence in distribution of a sequence of random vectors from convergence of slightly perturbed sequences. 11 Lemma. Let {Х„}, X and Y be random vectors for which Xn + aY-+ X + oY for each fixed positive a. Then Xn -¦ X. Proof. Remember (Corollary 5) we have only to check that Р/(Х„) -> P/(X) for each bounded, uniformly continuous /. Apply inequality A0) with X replaced by Xn and Y replaced by oY. sup \SPf(Xn) - Wf(Xn + oY)\ < г + 2||/||1P{| Y| > Sa'1}. n Similarly |F/CY) - 1РДХ + oY)\ < e + 2||/||Р{|У| > Sa'1}. Choose a small enough to make both right-hand sides less than 2e, then invoke the known convergence of {P/(XH + aY)} to P/(X + aY) to deduce that limsup | P/(ZB) - P/(X) \ < 4e. ? Now, instead of thinking of the aY as a perturbation of the random vectors, treat it as a means for smoothing the function/. This can be arranged
III.3. Expectations of Smooth Functions 49 by choosing independently of X and the {Xn} a random vector Y having a smooth density function with respect to lebesgue measure—for convenience, take Y to have a JV(O, Ik) distribution. Integrate out first with respect to the distribution of Y then with respect to the distribution of X. JPf(X + oY) = WfJ(X), where fa(x) = JB7r)-t/2/(x + ay) exp(-4M2) dy = (Bna2yk'2f(z) exp(-4|z - x\2/a2) dz. The function / has been smoothed by convolution. Dominated convergence justifies repeated differentiation under the last integral sign to prove that fa belongs to the class ^"(Ж*) of all bounded real functions on Шк having bounded, continuous partial derivatives of all orders. 12 Theorem. // F/(XB) -> F/(X) for every fin # "(Ж*) then Х„ ~> X. Proof. Convergence holds for every fa produced by convolution smoothing. Apply Lemma 11. ? For the remainder of the section assume that к = 1. That is, consider only real random variables. As the results of Section 5 will show, no great generality will be lost thereby—a trick with multidimensional characteristic functions will reduce problems of convergence of random vectors to their one-dimensional analogues. For expectations of smooth functions of X, the effect of small perturba- perturbations can be expressed in terms of moments by applying Taylor's theorem. Suppose / belongs to ^"(Ж*). Then, ignoring the niceties of convergence, we can write /(* + y) = f(x) + уf\x) + b2f"(x) + ¦¦¦. Suppose the random variable X is incremented by an independent amount Y. Then, again ignoring problems of convergence and finiteness, deduce A3) Wf(X + Y) = F/(Z) + W(Y)Wf'(X) + ^ Try to mimic the effect of the increment У by a different increment W, also independent of X. As long as F(Y) = JP(W) and F(Y2) = W(W2), the expectations F/(Z + Y) and F/(X + W) should differ only by terms in- involving third or higher moments of Y and W. These higher-order terms should amount to very little provided both Y and W are small; the effect of substituting W for Y should be small in that case. This method of substitution can be applied repeatedly for a random variable Z made up of a lot of little independent increments. We can replace
50 III. Convergence in Distribution in Euclidean Spaces the increments one after another by new independent random variables. If at each substitution we match up the first and second moments, as above, the overall effect on F/(Z) should involve only a sum of quantities of third or higher order. In the next section this approach, with normally distributed replacement increments, will establish the Liapounoff and Lindeberg forms of the Central Limit Theorem. To make these approximation ideas more precise we need to bound the remainder terms in the informal expansion A3). Because only the first two moments of the increments are to be matched, a Taylor expansion to quad- quadratic terms will suffice. Existence of third derivatives for/will help to control the error terms. Assume / belongs to the class ^(IR) of all bounded real functions on Ж having bounded continuous derivatives up to third order. Then the remainder term in the Taylor expansion A4) f(x + y) = f(x) + yf'ix) + \y2f"ix) + Rix, y) can be expressed as R(x, y) = b3f'"(x + 0ijO i (depending on x and y) between 0 and 1. Write ||/'"|| for the supremum of | /'"(•) | .Then A5) |Д(*,H1<|11/'1М3, Set С equal to i||/'"||. Then from A4) and A5), |F/(X + Y) - F/(X) - F(Y)F/'(X) - %JPiY2)TPf"iX)\ <W\RiX,Y)\ <CF(|Y|3). Apply the same argument with Y replaced by the increment W, which is also independent of X. Because F(Y) = F(PF) and F(Y2) = F(W2), when the resulting expansion for F/(X + W) is subtracted from the expansion for JPfiX + Y) most of the terms cancel out, leaving, A6) |F/(X + У) - F/(X + WO| ^ TP\R(X, Y)\ + W\RiX, W)\ < CF(|Y|3) f This inequality is sharp enough for the proof of a limit theorem for sums of independent random variables with third moments. III.4. The Central Limit Theorem A sum of a large number of small, independent random variables is approxi- approximately normally distributed—that is roughly what the Central Limit Theorem asserts. The rigorous formulations of the theorem set forth condi- conditions for convergence in distribution of sums of independent random
III.4. The Central Limit Theorem 51 variables to a standard normal distribution. We shall prove two versions of the theorem. To begin with, consider a sum Z = ^ + ¦ • • + ?,k of independent random variables with finite third moments. Write c? for P??. Standardize, if necessary, to ensure that P?; = 0 for each ;' and o\ + ¦ ¦ ¦ + ak = 1. Inde- Independently of the {?j}, choose independent iV@, <T?)-distributed random variables {n}}, for; = 1,..., k. Start replacing the {^} by the {tjj}, beginning at the right-hand end. Define Sj = ?i + • • • + ?,-_! + nJ+1 + ¦ ¦ ¦ + r\k. Notice that Sk + ?k = Z and that Sj + пг has a JV(O, 1) distribution. Choose a ^3(IR) function /, as in Section 3. Theorem 12 has shown that convergence for expectations of infinitely differentiable functions of random variables is enough to establish convergence in distribution; convergence for functions in ^3(IR) is more than enough. We need to show that P/(Z) is close to P/(iV@, 1)). Apply inequality A6) with X = SJ} Y = ?j, and W = r\y Because Sj + ?j = SJ+1 + nJ+1 for; = 1,..., к - 1, A7) | P/(Z) - TPf(N@, 1)) | < E <c X J=l J=l With this bound in hand, the proof of the first version of the Central Limit Theorem presents no difficulty. 18 Liapounoff Central Limit Theorem. For each n let Zn be a sum of indepen- independent random variables ?nl, ?n2,. ¦., ^„fc(n) with zero means and variances that sum to one. If the Liapounoff condition, A9) f Pl^f^O flsn^oo, is satisfied, then Zn ^» N@, 1). Proof. Choose and fix an / in ^3(IR). Check that P/(Zn) -> Р/(ЛГ@, 1)). The replacement normal random variables are denoted by nnl,..., п„к(п). The sum nnl + ¦ ¦ ¦ + nnk(n) has a JV(O, 1) distribution. Write a\-} for the variance of ?nJ, and Xn for the sum on the left-hand side of A9). With sub- subscripting n's attached, the bound A7) becomes |p/(zn) - p/(jv(o, i»| <cxn + c x J"=l
52 III. Convergence in Distribution in Euclidean Spaces By Jensen's inequality, o*j = (P<^K/2 < IP| ^„j|3, which shows that the sum contributed by the normal increments is less than P|iV@, 1)|3А„. Two calls upon A9) as n -* со complete the proof. ? The Liapounoff condition A9) imposes the unnecessary constraint of finite third moments upon the summands. Liapounoff himself was able to weaken this to a condition on the B + <5)th moments, for some 3 > 0. The remainder term R(x, y) in the Taylor expansion A4) does not increase as fast as \y\2+d though: B0) | R(x, y) | = | [/(* + y) - f{x) - yf'(x)-] - b2 < II /"IIIУ I2 for all x and y. The new bound improves upon A5) for large \y\, but not for small \y\. To have it both ways, apply A5) if \y\ < e and B0) otherwise. Increase С to the maximum of ?||/"'|| and ||/"||. Then the bound on the expected remainder is sharpened to B1) W\R(X, Y)\ <РС|У|3{|У| <е} +РС|У|2{|У| > е} <еСР(У2) + СРУ2{|Г| > e}. 22 Lindeberg Central Limit Theorem. For each n let Zn be a sum of indepen- independent random variables ?„и ?„2,..., ?nk(n) with zero means and variances that sum to one. If, for each fixed e > 0, the Lindeberg condition B3) t &?j{IZnj\ > ?} - 0 as n -> со is satisfied, then Zn -^ JV(O, 1). Proof. Use the same notation as in the proof of Theorem 18. Denote the left-hand side of B3) by Ln(e). Stop one line earlier than before in the ap- application of A7): k(n) k(n) |P/(Zn) - IP/(JV(O, 1))| < X F\R(Snj, Li)\ + E JP\R(Snj, nnJ)\. j=l J=l From B1), the first sum is less than С ? ЙЙ j= which equals Cs + CLn(s) because the variance sum to one. For the second sum retain the bound J=l
Ш.4. The Central Limit Theorem 53 but, in place of Jensen's inequality, use: 12 / \2Г1с(и) -|2 tk(n) -12 / = max a*] i <?2 + LB(e). ? The strange-looking Lindeberg condition is not as artificial as one might think. For example, consider the standardized summands for a sequence {Yn} of independent, identically distributed random variables with zero means and unit variances: ?nj = n~ ll2Yj for; = 1, 2,..., n. In this case LB(e) = nWn-1Yj{\Y1\> nll2s}, which tends to zero as n tends to infinity, by dominated convergence, because Y\ is integrable. It is even more comforting to know that the Lindeberg condition comes very close to being a necessary condition (Feller 1971, Section XV.6) for the Central Limit Theorem to hold. 24 Example (A Central Limit Theorem for the Sample Median). Let М„ be the median of a sample {Yx, Y2,..., Yn} from a distribution function G with median M. For simplicity, suppose the sample size n is odd, n = 2JV + 1, so that М„ equals the (N + l)st order statistic of the sample. Suppose also that the underlying distribution function G has a positive derivative у at its median. To prove that it suffices to check pointwise convergence of the distribution functions. B5) IP{n1/2(Mn - M) < x} = JP{Mn<M + n~1/2x} = F {at least N + 1 observations < M + n~i/2x} = F\n + 1 < ? {Yj<M + n/2x}l. Define M + n~ll2x} = G(M + n/2x), tn = (N + 1 - пр„)/Ь„.
54 III. Convergence in Distribution in Euclidean Spaces Check that P?nj- = 0 and P^ H + Р<^„ = 1. Continuity of G at M gives pn-*\; differentiability of G at M gives tn -> — 2xy. The last probability in B5) becomes B6) JpL < f Ц. As each | ?nj\ is less than b~1, which converges to zero, both the Liapounoff and Lindeberg conditions are easy to check. By either of two routes, Problem 13 shows that substitution of — 2xy for tn in B6) leads to the correct result: Р{и1/2(М„ - M) < x) -> IP{iV@, 1) > -2xy}. П III.5. Characteristic Functions The smoothing argument of Section 3 showed that only expectations of <^C0(IR'') functions need be considered when checking convergence in distri- distribution of random vectors. In this section, further analysis of the method of smoothing will show that the class of functions can be narrowed a little more: it suffices to check the convergence P exp(iy • Xn) -+ P exp(;y • X) for each fixed vector у inlR*. That is, pointwise convergence of characteristic functions implies convergence in distribution. Start again from the convolution expression aY) = P fB7iff2)-i(/2/(z)exp(-||z - X\2/a2)dz, derived under the assumption that Y has a N@, Ik) distribution independent of X. This holds for every bounded measurable /. Reverse the order of
III.5. Characteristic Functions 55 integration (Fubini) on the right-hand side to show that TPf(X + aY)= jf(z)J(z) dz, where B7) J(z) = WBno2ym exp(-i|z - X\2/a2). The distribution of X + oY has density function J(-) with respect to lebesgue measure. The dependence on a, which will remain fixed for most of the argument, need not be made explicit. The integrand appearing on the right-hand side of B7) comes from the density function of oY evaluated at (z — X). It is also proportional to the characteristic function of Y evaluated at (z — X)/a: exp(-i|z -X\2/a2) = jBnyk'2 exP(iy(z - X)/a - i\y\2)dy. Invoke Fubini's theorem. B8) J(z) = ГB7кт)-"Р ехр(-iy • X/o) exp(iy • z/a - \\y\2) dy = ^BпаУкф{-у/а) exp(iy ¦ z/a - i|y|2) dy, where ф(-) denotes the characteristic function of X. Formula B8) shows that the characteristic function of X uniquely deter- determines the distribution of X + aY for each fixed a. Because X + oY con- converges almost surely to X as a tends to zero, this proves that the characteristic function also uniquely determines the distribution of X. A stronger result can be proved by exploiting continuity properties of the dependence of J on ф. 29 Continuity Theorem. A sequence of random vectors {Xn} converges in distribution to a random vector X if and only if the corresponding sequence of characteristic functions converges pointwise: IP exp(iy • Xn) ->¦ P exp(jy • X) for each fixed у in Шк. Proof. Prove necessity by splitting exp(ix • y) into its real and imaginary parts, both of which define functions in
56 HI. Convergence in Distribution in Euclidean Spaces For the proof of sufficiency, denote the characteristic function of Xn by ф„, and the density function of Х„ + aY by Jn. The domain of integration is Жк. For fixed / and a, \JPf(Xn + aY) - JPf(X + aY) f(z)Jn(z) dz - jf(z)J(z) dz < 11/11 11 Uz)-J(z) | dz = ll/ll U(J(z)-Jn(z))+dz - = 2Ц/11 J(J(z)-Jn(z))+rfz, - Jn(z))~ dz because 0=1-1 (z) - Jn{z)) dz (z) - Ш)+ dz - j(J(z)-Jn(z)y dz. Dominated convergence applied to B8) with ф replaced by ф„ shows that the functions {Jn} converge pointwise to J. Thus {(J - Jn)+) converges pointwise to zero. Notice that (J — Jn)+ < J for each n. Invoke dominated convergence again. 2Ц/11 j(J(z)-Jn(z))+dz^0. Complete the proof by an appeal to Lemma 11. ? Perhaps this result should have been proved before we launched into the Central Limit Theorem proofs of Section 4. At least for identically distributed random variables, Theorems 18 and 22 could have been disposed of more rapidly; but the general cases would have required much the same level of effort, applied to exp(y • ix) in place of a general ^3(IR) function. The ad- advantages of working with characteristic functions come from the special multiplicative properties of the complex exponential. A central limit theorem for martingales in Chapter VIII will make use of this property. Theorem 29 contains a bonus: convergence in distribution of random vectors, Xn ~> X, is equivalent to the convergence in distribution of all linear functions of the random vectors: у ¦ Х„ ~> у ¦ X, for every y. (For the
III.6. Quantile Transformations and Almost Sure Representations 57 non-trivial half of the assertion, notice that 0n(t) is the expectation of a bounded continuous function of t • Xn.) Convergence problems for random vectors reduce to collections of convergence problems for random variables; limit theorems for random vectors can be deduced from their univariate analogues. 30 Multivariate Central Limit Theorem. For every sequence {?„} of indepen- independent, identically distributed random vectors with P?,- = 0 and Р(?у?}) = V, n-1/2(^+--- + ^)-JV@, V). Proof. For fixed y, the random variables {y ¦ ?„} are independent and identically distributed with zero means and variances equal to y'Vy. If y'Vy = 0 the random variables у ¦ ?j and у ¦ N@, V) degenerate to zero; the assertion then holds for trivial reasons. Otherwise standardize to unit variance. The standardized sequence satisfies the Lindeberg condition. By Theorem 22, (y'Vyy l'2y ¦ n- v\ix +... + ?,)-> N@, 1) = (y'Vyyll2y-N@, V). ? III.6. Quantile Transformations and Almost Sure Representations The definition of convergence in distribution by means of pointwise con- convergence of distribution functions does have its advantages, at least for real random variables. Behind some of these advantages lies a construction that reduces problems involving arbitrary distributions on the real line to the uniform case. For each distribution function F define a quantile function Q(u) = M{t: F(t) >u} for 0 < и < 1. Right continuity of F shows that Q(u) sits at the left endpoint of the closed interval oft values for which F(t) > u; in other words, F(t) > и if and only if Q(u) < t. If ? has the Uniform@, 1) distribution then > ?} = F(t). That is, <2(?) has distribution function F. The same construction gives a method for representing weakly convergent sequences of probability measures by sequences of random variables that converge almost surely.
58 III. Convergence in Distribution in Euclidean Spaces 31 Representation Theorem. Let {Р„} be a sequence of probability measures on the real line that converges weakly to P. There exist random variables {Х„} with distributions {Pn}, and X, with distribution P, such that Х„ -» X almost surely. Proof. Write Fn for the distribution function of Р„, and F for the distribution function of P. Denote the quantile function corresponding to Fn by Qn. Choose ? with distribution Uniform@, 1) then define Xn = QM)- It suffices to show that {Qn(u)} converges for each и outside a countable subset of @, 1), for then we can define X (almost surely) as the limit of the {Х„}. (Problem 21 gives a more concrete representation for X.) Recall that Fa(t) -*¦ F(t) for each t in the dense set To of points for which ]P{X = t} = 0. Each non-empty open interval contains points of TQ. In particular, if {Qn{u)} does not converge for some fixed и then there exist points t and s in To for which liminf Qn(u) < t < s < limsup Qn(u). Since Qn(u) < t implies Fn(t) > u, the limit along a subsequence gives F(i) > u. Similarly, Qn(u) > s infinitely often implies that F(s) < u. Thus F takes the constant value и throughout the interval [t, s]. There can be at most countably many values of и (a set of zero lebesgue measure) for which F has such a flat spot: different и values produce disjoint flat spots, and the real line can accommodate only countably many disjoint intervals. This proves that liminf Qn(u) = limsup Qn(u) for almost all values of и in @, 1). If X is to be defined as the pointwise limit of {Xn} we should also exclude the possibility of infinite limits. Given и in @, 1), choose t in To such that F(t) > u. Convergence at points of To gives Fn(t) > и eventually, which implies Qn{u) < t eventually, and so limsup Qn(u) < со. The liminf can be handled similarly. ? Some weak convergence results, such as the one-dimensional case of the Continuous Mapping Theorem, reduce to trivialities when the Representa- Representation Theorem is invoked. Similar representations do obtain in higher dimensions and for probability measures on abstract metric spaces, but these require a completely different construction. More about that in Chapter IV. 32 Example. Given a probability measure P on ^(IR) and a bounded se- sequence of measurable functions {hn} converging pointwise to a limit function, h, under what conditions will Pnhn -> Ph for every sequence {Р„} converging weakly to P? Consider first the special case where P concentrates at a single point x. Choose Pn concentrated at х„, for any sequence of real numbers {х„} con- converging to x. In this case the requirement reduces to hn(xn) -* h{x). This shows that something stronger than mere pointwise convergence of hn to
Ш.6. Quantile Transformations and Almost Sure Representations 59 h is needed. (Notice that if hn = h for every n, the requirement is equivalent to continuity of h at x.) In general it will suffice if hn(xn) -* h(x) for every x in a set of P measure one and every sequence {х„} converging to such an x. Represent {Р„} by an almost surely convergent sequence {Xn} of random variables. Then hn(Xn) -> h{X) almost surely. By dominated convergence, Whn(Xn) -> JPh(X), which implies that Pnhn -> Ph. П 33 Example. Let G be an open subset of IR. Suppose that Pn ~> P. The Representation Theorem can be used to show that liminf PJfi) ^ P(G). (The proof of the Convergence Lemma also contains the result implicitly.) Switch to almost surely convergent representations Х„ and X. For an со at which convergence holds, if X(a>) belongs to G then so does Х„(со) for all n large enough. That is, liminf {Xn e G} > {X e G} almost surely. Apply Fatou's lemma. liminf PnG = liminf W{Xn e G} > P liminf{Zn e G} > W{X e G} = P(G). П 34 Example. Let {ф„} be the characteristic functions corresponding to a weakly convergent sequence of probability measures {?„}, and ф be the characteristic function of the limit distribution P. Take {Х„} as the almost surely convergent representing sequence of random variables. For each fixed t, the sequence {exp(itXn)} converges almost surely to exp(itX). Moreover, the convergence is uniform on each bounded interval / of f-values: sup|exp(rtZJ - exp(itX)| < sup | exp(it(Xn — X)) — 1)| -> 0 almost surely, by virtue of the continuity of the exponential function at zero. Because sup |IP exp(;tZn) - IP exp(itX)\ < P sup \exp(it(Xn - X)) - 1)[, the sequence {(jf)n(t)} converges uniformly to 0@ over bounded intervals. П The argument for the proof of the Representation Theorem made little explicit use of the limit distribution function F. The essentials were: (i) existence of lim Fn(t) for each t in a dense subset To; (ii) for each fixed и in @, 1) there exists a t in To for which liminf Fn(t) > u, and an s in To for which limsup Fn(s) < u.
60 III. Convergence in Distribution in Euclidean Spaces The second property is equivalent to the existence, for each fixed e > 0, of a constant К such that C5) limsup Pa\_-K, KJ < s. Of course К depends on e. A sequence {Р„} with this property is said to be uniformly tight. The corresponding property for a sequence of random variables {Х„}—that for each e > 0 there exists а К such that limsupIP{|XJ > К} < е —is also called uniform tightness. 36 Theorem. Each uniformly tight sequence of probability measures on the real line has a subsequence that converges weakly. Proof. Use Cantor's diagonalization procedure to select a subsequence {Р„.} for which {Fn>(t)} converges for each rational t. The subsequence satisfies both of the conditions (i) and (ii) noted above. Construct an almost surely convergent sequence of random variables {Х„,} such that Xn. has distribution Pn, and Х„. -* X almost surely. The sequence {?„-} converges weakly to the distribution of X. ? By providing a method for constructing probability measures on this theorem allows specification of the limit distribution to be omitted from some important results. For example, the Continuity Theorem for charac- characteristic functions, as proved (Theorem 29) in Section 5, can be improved upon slightly. 37 Continuity Theorem (General Form). Let {Xn} be a sequence of real random variables whose characteristic functions {</>„} converge pointwise to some function ф. If ф is continuous at the origin then it must be the charac- characteristic function of an X for which Xn ~> X. Proof. Once we prove that ф is a characteristic function, Theorem 29 will do the rest. Let us show that {Х„} is uniformly tight. Then, by Theorem 36 it will have at least one subsequence that converges in distribution. The limit of any such subsequence must have ф as its characteristic function. The inequality C5), which defines uniform tightness, calls for a constraint to be placed on the tails of the distributions of the random variables {Xn}. Here we can use an inequality, valid for any random variable Z with charac- characteristic function p, that relates the tail behavior of a distribution to the values of its characteristic function near the origin: there exists a positive constant a (approximately 6.308) such that C8) P{|Z| > /Г1} < (a/2/0 f [1 - (>№ dt J-h
Notes 61 for every positive h. This follows (Fubini's theorem) from the equality Г* B/г) [1 - IP exp(iZt)] dt = P[l - (sin hZ)/hZ]. J-h Interpret (sin 0)/0 as 1. Because the integrand is non-negative everywhere and greater than a71 = 1 - sin 1 on the set {| Z [ > h~1}, the expectation is greater than or4P{|Z| > /Г1}. Apply C8) with Z = Xn and p = ф„. limsup 1Р{\Х„\ > /Г1} < limsup (a/2h) [1 - фп n n J —h dt = (a/2h) [1 - фAI dt J-h by dominated convergence. Thanks to the continuity of ф at the origin, the last expression can be brought close to «[1 - 0@)] = lim «[1 - фМ1 = О П by choosing h small enough. ? Notes Everything in this chapter is classical. The name Continuous Mapping Theorem seems to have taken hold in the literature, despite its evident inappropriateness. Billingsley A968, Section 5) has more on the history of the theorem. Some authors refer to F-almost-sure-continuity as P-Riemann-integrability. Convolution with a normal kernel as a means of smoothing is an ancient idea. Weierstrass A885) used it to prove his famous approximation theorem. Liapounoff A900) introduced an auxiliary normal variable as a device for smoothing a sum of independent variables. That enabled him to complete the argument of Glaisher A872), and thereby prove with full rigor a central limit theorem. The Liapounoff condition did not appear in the 1900 paper; instead Liapounoff imposed a stronger requirement on the maximum third absolute moment. In his 1901 paper he improved this to a requirement on the sum of B + <5)th moments about the mean. (Problem 19 gives a minor variation on his 1901 result.) Section 4 follows in part the lucid exposition of Lindeberg A922), with a little help from Billingsley A968, Section 7). Uspensky A937, Chapter XIV) attributed the Continuity Theorem to Liapounoff A900); the key idea was implicit in Liapounoff's proof. Levy A922) proved it explicitly. Both Liapounoff and Levy required uniform convergence on compacta. The dominated convergence trick in the text is often called Scheffe's lemma.
62 Ш. Convergence in Distribution in Euclidean Spaces Problems [1] Suppose X, Xu X2,--- are real random variables such that 1Р{Х„ < x] -> IP{X < x] whenever IP{X = x} = 0. Prove that Xn ~» X. [For a fixed / in «"(IR) find points {a,-} with — oo < o^ < a2 < • • • < ak < oo such that IP{X = aj = 0 and both of W{X < aj and IP{X > ak) are small. Choose the {aj close enough together to ensure that / changes by very little over each interval [a;,aj+1]. Approximate / by a step function constant on each of those intervals.] Extend the argument to higher dimensions. [2] Let {xn} be a sequence of points in IR*, and {Х„} be a sequence of degenerate random vectors with 1Р{Х„ = х„} = 1. Show that {Х„} converges in distribution if and only if {х„} converges. [3] Let P be a probability measure on the <r-field ^(IR*). Show that for each s > 0 and each borel set В there exists a closed set F? contained in В and an open set G? containing В such that P(Gf — F?) < к. [Prove that the class of all such В forms a (т-field. It contains every closed set, because closed sets can be written as count- countable intersections of open sets.] Deduce that if IP/(X) = IP/( Y) for every bounded uniformly continuous / then X and У have the same distribution. [Represent open sets as the limits of increasing sequences of uniformly continuous functions.] [4] A sequence of probability measures on ^(IR*) can converge weakly to at most one limit distribution: if Р„ ~> P and Pn ~> Q then P and Q agree for all borel sets. [Use Problem 3.] [5] If Х„ ~> X then Х„ + op(l) -> X. [Appendix A explains the op(-) notation. This result is sometimes called Slutsky's theorem.] [6] If {Х„} converges in probability to X then Xn ~> X. The converse is false. [7] If Х„ ~* X and IP{X = c} = 1, for some constant c, then {Xn} converges in probability to с [8] Give an example of a continuous function g and a sequence of random vectors with Х„ ~-> X for which {ТРд(Х„)} does not converge to IP#(X). [Don't forget the word bounded in the definition of convergence in distribution.] [9] Give an example of a map H from IR* into itself and a sequence of random vectors in Жк with Xn ~» X for which HXn does not converge in distribution to HX. [10] The set D of discontinuity points of a map H from IR* into IRS is borel measurable. [Start from the set Dm „ of all those points x for which \Hy — Hz\ > m'1 for at least one pair of points with \x — y\ < n and \x — z\ < n~l. Prove that Dm „ is open. Tops0e.] [11] A random IR^-vector X and a random IRJ-vector У defined on the same probability space can be combined into a single random IR-/+t-vector (X, У). If Х„ -~> X, as random vectors in IR*, and У„ -+ У, as random vectors in IRJ, it need not follow that (Х„, У„) ~» (X, Y). But if IP{X = c} = 1 for some constant с the result is true. [Consider expectations of bounded uniformly continuous functions on IRJ+*.] Use characteristic functions to prove that the result also holds if X is independent of У and Xn is independent of У„, for each n.
Problems 63 [12] If Х„ has a t-distribution on n degrees of freedom then Xn -* N@, 1) as n -* oo. [Use Problem 11.] [13] If real random variables {Х„} converge in distribution to X and {xn} is a sequence of real numbers converging to an x for which IP{X = x) = 0, prove that > xn} -> IP{X > x}. [14] If a sequence of random vectors {Х„} converges in distribution then Xn = Op(l). [Appendix A explains the Op(-) notation.] [15] Let Я be a measurable map from IR* into IRS that is differentiable at a point x0. That is, there exists a linear map L from Жк into IRS such that Hx = Hx0 + L(x — x0) + o(x — x0) near x0. If nll2(Xn - xo)^Z, prove that пт{НХ„ - Hx0) -> LZ. [Some authors call this the delta method.] [16] If Xn ~ Bin(n, 9) for some fixed 9 in @, 1), find the limiting distribution of n1/2[arsin(X»1/2 - arsin 01/2] as n -> oo. [17] For у real, define Я„(у) = ехр(г» - 1 - iy (iy)n/nl Prove that \Н„(у)\ < \y\n+1/(n + 1)!. [Proceed inductively, using i$'o Hj[s)ds = Hn+1(t) for t > 0. Take complex conjugates for t < 0. Borrowed from Feller A971).] [18] Use the inequality in the previous problem to show that the characteristic function of n/2(Poisson(n) — n) converges pointwise to exp( —12/2). [This is one way to find the characteristic function of the N@, 1) distribution.] [19] For each n let Zn be a sum of independent random variables ?„ь ..., ^„t(n) with zero means and variances that sum to one. If for some 5 > 0, 4") P|^|2+<5->0 as n^oo then Zn -~> N@, 1). [Apply the Lindeberg Central Limit Theorem. Liapounoff.] [20] Suppose a random vector X has a characteristic function ф that is integrable with respect to lebesgue measure on IR*. Prove that the distribution of X has a bounded, uniformly continuous density g(z) = Bn) k exp(-i [Choose a continuous / vanishing outside a bounded region of IR*. Start from the expression for IP/(X + oT) derived in Section 5: Bm7)-*/(z) exp(iz .y/a - ||у\2)ф(-у/<т) dy dz. Я Make a change of variable w = —yja, then take the limit as a -> 0 under the integral sign. Adapt Problem 3 to complete the proof.] [21] In the proof of the Representation Theorem write Q for the quantile function corresponding to F. Show that {Qn(u)} converges to Q(u) for each и that is a point of continuity for Q.
CHAPTER IV Convergence in Distribution in Metric Spaces ... in which that theory from Chapter III depending only on the metric space properties of IR* is extended to general metric spaces. It is argued that the theory should consider not just borel-measurable random elements. A Continuous Mapping Theorem and an analogue of the almost sure Representation Theorem survive the generalization. A compactness condition—uniform tightness—is shown to guarantee existence of cluster points of sequences of probability measures. IV. 1. Measurability We write a statistic as a functional on the sample paths of a stochastic process in order to break an analysis of the statistic into two parts: the study of continuity properties of the functional; the study of the stochastic process as a random element of a space of functions. The method has its greatest appeal when many different statistics can be written as functionals on the same process, or when the process has a form that suggests a simple ap- approximation, as in the goodness-of-fit example from Chapter I. There we expressed various statistics as functionals on the empirical process Un, which defines a random element of D[0, 1]. Doob's heuristic argument suggested that Un should behave like a brownian bridge, in some distributional sense. Formalization of the heuristic, the task we embark upon in this chapter, requires a notion of convergence in distribution for random elements of D[0, 1]. As for euclidean spaces, the definition will involve convergence of expectations of bounded, continuous functions of the processes. For this we need a notion of distance. Equip D[0, 1] with its uniform metric, which assigns the maximum separation ||x-y|| =sup|x(t)-XOI r as the distance between x and y. We shall find it easiest to prove convergence in distribution of {[/„} using this metric, even though it does create some minor measurability difficulties. Chapter VI will examine another metric, for which these difficulties disappear, at the cost of greater topological complexity. An expectation Р/([/„) is well defined only when /([/„) is measurable. If Un lives on a probability space (O, $, P), we can arrange for measurability
I V.I. Measurability 65 by equipping D[0, 1] with a c-field, SP say, then checking & /^-measurability of Un and ^-measurability of/. The borel c-field will not be the best choice for 3P. The definition of convergence in distribution for random elements of a general metric space anticipates this complication for D[0, 1]. 1 Definition. An d%2/-measurable map X from a probability space (O, S, IP) into a set Ж with c-field s/ is called a random element of 3C. If <F is a metric space, the set of all bounded, continuous, stf/ЩЖ)- measurable, real-valued functions on 3C is denoted by ^{ЗС; stf). A sequence {Х„} of random elements of 3C converges in distribution to a random element X, written Xn ~> X, if Р/(Х„) -> IP/(X) for each / in A sequence {Pn} of probability measures on s/ converges weakly to P, written Pn ~> P, if PB / -> Pf for every / in <€(<X; sf). D The borel (т-field ®{9C), the (т-field generated by the closed sets, will always contain s/. For those spaces where we need sd strictly smaller than the borel сг-field, we will usually have it generated by the collection of all closed balls in 3C. Also the trace of s/ on each separable subset of 3C will coincide with the trace of the borel c-field on the same subset. Limit distri- distributions will always be borel measures concentrating on separable, .im- .immeasurable subsets of 3C. We could build these properties into the definition of weak convergence, but it would neither save us any extra work, nor simplify the theory much. 2 Example. If D[0, 1] is equipped with the borel c-field 38 generated by the closed sets under the uniform metric, the empirical processes {[/„} will not be random elements of D[0, 1] in the sense of Definition 1. That is, UH is not (f/^-measurable. Consider, for example, the situation for a sample of size one. (Problem 1 extends the argument to larger sample sizes.) For each subset A of [0, 1] define GA = {x e D[0, 1]: x has a jump at some point of A}. Each GA is open because |x(t) — x(t — )\ depends continuously upon x, for fixed t. If Ux were §/^-measurable, the set {U1 e GA} = {^ e 4}jvould belong to S. A probability measure ц could be defined on the class of all subsets of [0, 1] by setting ц{А) = TP{^ e A}. This ц would be an extension of the uniform distribution to all subsets of [0, 1]. Unfortunately, such an extension cannot coexist with the usual axioms of set theory (Oxtoby 1971, Section 5): if we wish to retain the axiom of choice, or accept the continuum hypothesis, we must give up borel measurability of Uv The borel c-field generated by the uniform metric on D[0, 1] contains too many sets. There is a simple alternative to the borel (т-field. For each fixed t, the map [/„(¦, t) from О into IR is a random variable. That is, if щ denotes the
66 IV. Convergence in Distribution in Metric Spaces coordinate projection map that takes a function x in D[0, 1] onto its value at t, the composition щ ° Un is &/^(IR)-measurable. Each Un is measurable with respect to the a-field S? generated by the coordinate projection maps (Problem 2). Call SP the projection a-field. Problem 4 shows that SP coincides with the (т-field generated by the closed balls. All interesting functionals on D[0, 1] are ^-measurable. ? Too large a c-field si makes it too difficult for a map into 3C to be a random element. We must also guard against too small an si. Even though the metric on 3C has lost the right to have si equal to the borel a-field, it can still demand some degree of compatibility before a fruitful weak convergence theory will result. If #(<?¦; sf) contains too few functions, the approximation arguments underlying the Continuous Mapping Theorem will fail. Without that key theorem, weak convergence becomes a barren theory. An extreme example should give you some idea of the worst that might happen. 3 Example. Allow the real line to retain its usual euclidean metric, but change its ff-field to the one generated by the intervals of the form [n, n + 1), with n ranging over the integers. Call this a-field 01. Functions measurable with respect to M must stay constant over each of the generating intervals. For a continuous function, this imposes a harsh restriction; continuity at each integer forces an ^-measurable function to be constant over the whole real line. This completely degrades the weak convergence concept: every sequence of ^-measurable random elements converges in distribution. It bodes ill for a sensible Continuous Mapping Theorem. Consider the map Я from the disfigured real line into the real real line (equipped with its usual metric and (т-field) defined by Hx = 1 if 0 < x < 3 and Hx = 0 otherwise. It is a perfectly good ^-measurable map, continuous at the point 1. Apply it to random elements {Xn} identically equal to 3, and X identically equal to 1. Even though Xn ~> X in the sense of Definition 1, {HXn} does not converge in distribution to HX. ? IV.2. The Continuous Mapping Theorem Suppose Х„ ~> X, as .^/-measurable random elements of a metric space 3C, and let Я be an j<//j</'-measurable map from 3C into another metric space T. If Я is continuous at each point of an .^/-measurable set С with W{XeC} = 1, does it follow that HXn ~> HX? That is, does {Р/(ЯХ„)} converge to Р/(ЯХ) for every / in <ёBГ; st'I We found an answer to the analogous question for random vectors in Section III.2 by reducing it to an application of the Convergence Lemma. The same approach works here. We need to prove JPh(Xn) -* JPh(X) for every bounded, .^/-measurable, real-valued h that is continuous at each point of C. Were si equal to the borel (т-field $8BE), the proof would go
IV.2. The Continuous Mapping Theorem 67 through almost exactly as before, with only a few words difference. For borel- measurable random elements of metric spaces, the theory parallels the theory in Chapter III very closely, at least as far as the Continuous Mapping Theorem is concerned. Example 3 warns us that non-borel (т-fields require more careful handling. With this in mind, let's rework the Convergence Lemma of Chapter III, paying more attention to measurability difficulties. To begin with we assume only that si is a sub-c-field of Щ&). Define D) & = {fe ^{Ж;^): f < h}. Last time we constructed a countable subfamily of J^ whose pointwise supremum achieved the upper bound h at each point of C. Functions in the subfamily took the form /m,rW = г л md(x, {h < r}) Continuity offmr suffices for borel measurability, but it needn't imply ^-measurability. We must find a substitute for these functions. This is possible if we impose a regularity condition, which ensures that the pointwise supremum of #" equals h at each point of C. If С is separable (meaning that it has a countable, dense subset), we can then extract from J^ a countable subfamily having the same supremum as J* at each point of С The regularity condition will capture the key property enjoyed by fm r. Without loss of generality suppose h > 0. Suppose also that h is continuous at a point x. Choose r with 0 < r < h(x). Look for an / in 3F with/(x) > r. Continuity provides a S > 0 such that h(y) > r on the closed ball B(x, 5) centered at x. If we could find a g in #(#"; s/) with 0 < g < B(x, 8) and g{x) = 1, the function rg would meet our requirements. Notice the similarity to the topological notion of complete regularity (Simmons 1963, Section 27). If si happened to contain all the closed balls centered at x, a property enjoyed by the projection a-field on D[0, 1] (Problem 4), the function E) would do, because {g > 1 — s} = B(x, sd). For general si we must postulate existence of the appropriate g. To maintain the parallel with euclidean spaces as closely as possible, strengthen the requirements on g to include uniform continuity. We lose only a scintilla of generality thereby; the special g of E) still passes the test. 6 Definition. Call a point x in 3C completely regular (with respect to the metric d and the (т-field si) if to each neighborhood V of x there exists a uniformly continuous, .^/-measurable function g with g(x) = 1 and g < V. О You might well object to yet another mathematical notion attaining the status of regularity; the world is already overloaded with instances of
68 IV. Convergence in Distribution in Metric Spaces "regular" as a synonym for "amenable to our current theory." At least it has the virtue of reminding us of its topological counterpart. (A more sadistic author might have called it T3±.) The terminology would not be wasted if we were to expand our weak convergence theory to cover borel measures on general topological spaces, for there topological complete regularity seems just the thing needed for a well-behaved theory. 7 Convergence Lemma. Let h be a bounded, s4'-measurable, real-valued function on 9C. Ifh is continuous at each point of some separable, sd-measurable set С of completely regular points, then: (i) Xn -> X and TP{X e C} = 1 imply Wh(Xn) -> Wh(X); (ii) Р„~>Р and PC = 1 imply Pnh -* Ph. Proof. As the arguments for both assertions are quite similar, let us prove (ii) only. Assume that h > 0 (add a constant to h if necessary). Define !F as in D), but with the continuity requirement strengthened to uniform con- continuity. At those completely regular points of 3C where h is continuous, the supremum of OF equals h. This applies to points in C. Separability of С will enable us to extract a suitable countable subfamily from $F. Argue as for the classical Lindelof theorem (Simmons 1963, Section 18). Let Co be a countable, dense subset of С Let {gu д2,...}Ъе the set of all those functions of the form rB, with r rational, В a closed ball of rational radius centered at a point of Co, and rB < f for at least one / in JF. For each gt choose one/satisfying the inequality gt < f. Denote it byf. This picks out the required countable subfamily: (8) sup f = sup J* on C. i To see this, consider any point z in С and any / in J*\ For each rational number r such that /(z) > r > 0 choose a rational e for which / > r at all points within a distance 2e of z. Let В be the closed ball of radius e centered at a point x of Co for which d(x, z) < e. The function rB lies completely below/; it must be one of the {g(}. The corresponding/ takes a value greater than r at z. Assertion (8) follows.
IV.2. The Continuous Mapping Theorem 69 Complete the argument as for the Convergence Lemma of Section III.2. Assume without loss of generality that/ ] h at points of C. Then liminf Pnh > liminf Pn/ for each г = Pfi because Pn ~> P -* Ph as i -* со, by monotone convergence. Replace h by — h + (a big constant) to get the companion inequality for the limsup. ? 9 Corollary. // TPf(Xn) -*Wf(X) for each bounded, uniformly continuous, srf-measurable f, and if X concentrates on a separable set of completely regular points, then Xn ^> X. ? The corollary flows directly from the decision to insist upon uniformly continuous separating functions in the definition of a completely regular point. As with its counterpart for euclidean spaces, it makes some weak convergence arguments just a little bit more straightforward than the corre- corresponding arguments with continuous functions. 10 Example. Let 3C be a space equipped with a c-field si and metric d, and <& be a space equipped with a u-field 8) and metric e. Equip 3C ® ®J with its product (т-field and the metric a defined by <rL(x, y), (x1, /)] = max[rf(x, x'\ e(y, /)]. Suppose Х„ ~* X, as random elements of Ж If Wx concentrates on a separable set of completely regular points, and Yn -> y0 in probability for some fixed completely regular point y0 in <&, then (Х„, Yn) ^ (X, y0), as random elements of the product space Ж ®<W. Of course the assertion only makes sense if Xn and Yn are defined on the same probability space. Given that prerequisite, measurability with respect to the product c-field presents no problem, because (Xn, Yny\A ®B) = (X;U) n (Y^B), and similarly for (X, y0). Write С for the separable set on which ]PX concentrates. Then JPx,yo concentrates on the product set С ® {y0}, which is separable. Each point of this set is completely regular: if/(c) = 1 and/ = 0 outside the ball of d-radius e, and g(y0) = 1 and g = 0 outside a ball of e-radius e, then the product f(x)g(y) equals 1 at (c, y0) and vanishes outside a ball of c-radius e. The product is uniformly continuous if both / and g are bounded and uni- uniformly continuous; it is si ® ^-measurable if/ is ^-measurable and g is ^-measurable. By virtue of Corollary 9, to prove (Xn, Yn) ~> (X, y0) we have only to check that JPh{Xn, У„) -> Wh(X, y0) for each bounded, uniformly con- continuous, si ® ^-measurable, real function h on Ж ® <%l¦ Given г > 0 choose
70 IV. Convergence in Distribution in Metric Spaces д > 0 so that | h(x, y) - h(x', y')\ < s whenever o[(x, y), {%', /)] < 5. Write k(-) for the bounded, uniformly continuous, ^/-measurable function h{-, yQ). Then \Wh(Xn, Yn) - Wh(X, yo)\ < e + 2\\h\\TP*{e(Yn, y0) > 5} + \Wk(Xn) - JPk(X)\. Convergence in probability of Yn to y0 makes the middle term converge to zero. (Notice the outer measure P*. By definition, P*Z equals the infimum of PWover all $ -measurable real functions with W > Z. For most applica- applications e(-, y0) will be ^-measurable, in which case P* can be replaced by P.) The last term converges to zero because Х„ ~+ X. ? 11 Example (Convergence in Distribution via Uniform Approximation). Let X, Хъ X2,... be random elements of Ж with Wx concentrated on a separable set of completely regular points. Suppose, for each e > 0 and 5 > 0, there exist approximating random elements AX, AXU AX2, ¦ ¦ ¦ such that: (i) W*{d(X, AX)>5}<e; (ii) limsup №*{d(Xn, AXn) > 8} < e; (iii) AXn -> AX. Then Х„ ~> X. Notice again the use of outer measure to guard against non- measurability. We have already met a special case of this result in Lemma III. 11, where АХ„ = Х„ + aY. In applications to stochastic processes, the approximations are typically constructed from the values of the processes at a fixed, finite set of index points. For such approximations, classical weak convergence methods can handle (iii). The assumptions (i) and (ii) place restrictions on the irregularity of the sample paths. Chapter V will take up this idea. The convergence Xn ~» X follows from convergence of expectations for every bounded, uniformly continuous, .^/-measurable/. If | f(x) — f{y) \ < г whenever d(x, y) < S then \Wf(Xn) - TPf(X)\ is less than IPI/PO - f(AXn)\ + \Wf{AXn) - Wf(AX)\ + IP|/(AST) - f(X)\. The convergence (iii) takes care of the middle term. Handle the first term by splitting it into the contributions from {d(Xn, AXn) > 5} and its comple- complement; and similarly for the last term. ? The Convergence Lemma has one other important corollary, the result that tells us how to transfer convergence in distribution of random elements of Ж to convergence in distribution of selected functionals of those random elements. For substantial applications turn to Chapter V. 12 Continuous Mapping Theorem. Let H be an s//sf -measurable map from 3C into another metric space X'. If H is continuous at each point of some separable, .^-measurable set С of completely regular points, then Х„ ^> X and W{X e C} = 1 together imply HXn ~> HX. ?
IV.3. Representation by Almost Surely Convergent Sequences 71 IV.3. Representation by Almost Surely Convergent Sequences In Section III.6 we used the quantile transformation to construct almost surely convergent sequences of random variables representing weakly con- convergent sequences of probability measures. That method will not work for probabilities on more general spaces; it even breaks down for IR2. But the representation result itself still holds. 13 Representation Theorem. Let {Pn} be a sequence of probability measures on a metric space. IfPn ~> P andP concentrates on a separable set of completely regular points, then there exist random elements {Х„} and X with distributions {Pn} and P such that Xn -> X almost surely. ? The new construction makes repeated use of a lemma that can be applied to any two probability measures P and Q that are close in a weak convergence sense. Roughly speaking, the idea is to cut up the metric space 3C into pieces Bo, Bu ..., Bk for which PBt a> QBt for each i, so that the set Bo has small P measure and each of the other B/s has small diameter. We use these sets to construct a random element Y of Ж, starting from an X with distribution P. If X lands in Bt choose Y in Bt according to the conditional distribution Q(-\Bt). For i > 1 this forces Y to lie close to X, because Bt doesn't contain any pairs of points too far apart. The random element Y has approximately the distribution Q: к A4) W>{YeA} = ? TP{YeA\XeBt}TP{XeBt} i = O ;=o i = 0 = Q(A). A slight refinement of the construction will turn the approximation into an equality. When applied with Q = Р„ and partitions growing finer with n, it will generate the sequence {Х„} promised by the Representation Theorem. 15 Lemma. For each e > 0 and each P concentrating on a separable set of completely regular points, the space 9C can be partitioned into finitely many disjoint, ssf-measurable sets Bo, Bu ..., Bk such that: (i) the boundary of each Bt has zero P measure (a P-continuity set); (ii) P(BQ) < e; (iii) diameter(Bj) < 2s for i = l,2,...,k.
72 IV. Convergence in Distribution in Metric Spaces Proof. Call the separable set C. To each x in С there exists a uniformly continuous, ^-measurable / with f(x) = 1 and/ = 0 for points a distance greater than e from x. The open sets of the form {/ > a], for 0 < a < 1, are all .«/-measurable and of diameter less than 2s. At each point on the boundary of {/ > a}, the continuous function / takes the value a. Because P{/ = a] can be non-zero for at most countably many different values of a, there must exist at least one a. for which the probability equals zero. Choose and fix such an a, then write G(x) for the corresponding set {/ > a}. It has diameter less than 2e and is a P-continuity set. The union of the family of open sets {G(x): x e C} contains the separable set C. Extract a countable subfamily {G(x;): i = 1,2,...} containing C. (Every open cover of a separable subset of a metric space has a countable subcover: Problem 5.) Because Р <7СхЛ t P G(x} > P(O = 1 there exists a /c such that Г * "I > 1 - e. Define Bi = G(Xi)\[G(xi)u---uG(ii-1)] for i = l,...,k and Bo = [G(xO и • • • и G(xJ]c, a process known to the uncouth as disjointification. The boundary of Bt is covered by the union of the boundaries of the P- continuity sets G(xi),..., G(xk). Each Bt lies completely inside the cor- corresponding G(xi), a set of diameter less than 2e if i > 1. ? Proof of Theorem 13. Holding e fixed for the moment, carry out the con- construction detailed in the proof of the lemma, generating P-continuity sets Bo, Bu ..., Bk as described. The indicator function of Bt is almost surely continuous [P] because it has discontinuities only at the boundary of B{. So by the Convergence Lemma Р„(Б,) -> P(Bt). When n is large enough, say n > n(s), A6) Pn(Bt) > A - e)P(Bd for i = 0,l,...,L Write nm for nB~m). Without loss of generality suppose 1 = nt < n2 < ¦ ¦ ¦ . For nm < n < nm+1, construct Xn using the {BJ partition corresponding to sm = 2~m. Notice that Bt now depends on n through the value of m. Let ?, be a random variable that has a Uniform@, 1) distribution in- independent of X. If ^ < 1 — sm and X lands in B;, choose Xn according to the conditional distribution Р„(-|В,). So far no Bt has received more than its quota of Р„ measure, because of A6). The extra probability will be distributed over the space 3C to bring Х„ up to its desired distribution Р„. If ?, > 1 — em choose X„ according to the distribution ц„ determined by Pn(A) = fin(A)nt > 1 - ej + ? PK(A\B,X1 - < = o
IV.3. Representation by Almost Surely Convergent Sequences 73 That is, Hn(A) = e I Рп(А\Вд1Рп(Вд - A - em)P(Bt)l i=0 By A6), the right-hand side is non-negative. And clearly ц„Ж = 1. Except on the set Qm = {X e Bo or ?, > 1 — sm}, which has measure at most 2em, the random elements X and Xn lie within 2гт of each other. On the complement of the set {Qm infinitely often}, the sequence {Xn} converges to X. By the Borel-Cantelli lemma P{Qm infinitely often} = 0. ? The applications of Theorem 13 follow the same pattern as in Section Ш.6. Problems of weak convergence transform into problems of almost sure convergence, to which the standard tools (monotone convergence, dominated convergence, and so on) can be applied. 17 Example. Most of the proof of the Convergence Lemma did not use the full force of almost sure continuity for the function h. To get the inequality for the liminf we only needed lower-semicontinuity of h at points of C. (Remember that semicontinuity imposes only half the constraint of con- continuity: only a lower bound is set on the oscillations of h in a neighborhood of a point. Problem 9 will refresh your memory on semicontinuity.) The Representation Theorem gives a quick proof of the same result. If g is bounded below, lower-semicontinuous, and ^-measurable (auto- (automatic if л/ equals the borel cr-field), then liminf Р„д > Pg whenever Р„ ~> Р with P concentrated on a separable set of completely regular points. To prove it, switch to almost surely convergent representations. Lower-semi- Lower-semicontinuity at X(a>) plus almost sure convergence of the representing sequence imply liminf g(X„(со)) > д(Х(со)) almost surely. Take expectations. liminf Png = liminf TPg(Xn) by Fatou's lemma A similar inequality holds for upper-semicontinuous, .^/-measurable func- functions that are bounded above. As a special case, A8) limsup PnF <PF for each closed, ^/-measurable set F. If inequality A8) holds for all such F then necessarily Р„ --¦» P (Problem 12). ? 19 Example. Let <§ be a uniformly bounded class of ^/-measurable, real functions on Ж. Suppose that Pn --¦» P, with P concentrated on a separable
74 IV. Convergence in Distribution in Metric Spaces set of completely regular points. Suppose also that <g is equicontinuous at almost all points [P] of Ж. That is, for almost all x and each e > 0 there exists a 5 > 0, depending on x but not on g, such that \g{y) — g(x)\ < e whenever d(x, y) < 5, for every g in <S. Then B0) sup | Png-Pg |-> 0. This result underlies the success of most of the functions that have been constructed in the literature to metrize the topology of weak convergence. To prove B0), represent the probability measures by almost surely convergent random elements {Х„}, then deduce from equicontinuity that B1) sup \д{Х„) - g(X) | -* 0 almost surely. It would be tempting to appeal to dominated convergence to get sup \TPg(Xn) - JPg(X)\ < IP sup \g(Xn) - g(X)\ -> 0, but that would assume measurability of the supremum in B1). Instead, note that B0) could fail only if, for some s > 0, there were functions {д„} in <S for which \Pngn - Рд„\ > s infinitely often. Apply the dominated conver- convergence argument to the countable family ^0 = {gu g2,...} to reach a con- contradiction. ? 22 Example (The Bounded-Lipschitz Metric for Weak Convergence). Suppose that ,s/ contains all the closed balls, as in the case of D[0, 1] under its uniform metric. The function /(•) = r[l — md(-, z)]+, which serves to separate z from points outside a small neighborhood of z, has the strong uniformity property \f(x)-f(y)\<mrd(x,y). A function satisfying such a condition, with mr replaced possibly by a different constant, is called a Lipschitz function. For the proof of the Con- Convergence Lemma, Pnf-*Pf for each bounded, ^-measurable Lipschitz function would have sufficed; convergence for bounded Lipschitz functions implies weak convergence. From Example 19 we draw a sharper conclusion. Define ?f to be the set of all ^-measurable Lipschitz functions for which | f(x) - f(y) | < d(x, y) and supx | f(x) | < 1. The class <? is equicontinuous at each point of 9E. Every bounded Lipschitz function can be expressed as a multiple of a function in $?. Define the distance between two probability measures on л/ by You can check that Я has all the properties required of a metric. If P con- concentrates on a separable set and Pn ~» P, the distance X{Pn, P) converges to zero, in obedience to the uniformity result of Example 19. Conversely, the
IV.3. Representation by Almost Surely Convergent Sequences 75 convergence of X(Pn,P) to zero would ensure that Pnf-*Pf for each bounded Lipschitz function /, which, as noted above, implies weak con- convergence. ? 23 Example (The Prohorov Metric for Weak Convergence). Suppose 3C is a separable metric space equipped with its borel cr-field. For each 8 > 0 and each borel subset A of Ж define A6 = {x e %: d{x, A) < 5}. (Visualize the open set As as A wearing a halo of thickness 8.) Define the Prohorov distance between two borel probability measures as p(P, Q) = inf{<5 > 0: PA6 + 8 > QA for every A}. This distance has great appeal for robustniks, who interpret the delta halo as a way of constraining small migrations of Q mass and the added delta as insurance against a small proportion of gross changes. To us it will be just another metric for weak convergence. It is not obvious that p is symmetric, one of the properties required of a metric. We need to show that QA& + 8 > PA for every A, whenever p(P, Q) < 8. Set В equal to the complement of A6. We know that QB < PBd + 8. Subtract both sides from 1, after replacing Bs by the complement of A, a larger set. (No point of A can be less than 8 from a point in B.) We have symmetry. If p(P, Q) = 0 then certainly PFa + 8 > QF for every closed F and every 8 > 0. Hold F fixed but let 8 tend to zero through a sequence of values. The sequence {Fd} shrinks to F, giving PF > QF in the limit. Interchange the roles of P and Q then repeat the argument to deduce that P and Q agree on all closed sets, and hence (Problem 11) on all borel sets. For the triangle inequality, suppose that p(P, Q) < 8 and p(Q, R) < r\. Temporarily set В = Ац. Then RA < QA" + ц = QB + ц < PB* + tj + 8. Check that A5+<] contains B6. Deduce that p(R, P) < ц + д. Next, show that weak convergence implies convergence in the p metric. It suffices to deduce that p(Pn, P) < 8 eventually if Pn ~> P. For each borel set A define Notice that A6 > fA> A. Also, because IMx) - fA(y)\<8~l\d(x, A) - d(y, A)\<8~1 d(x, y), the class of all such fA functions is equicontinuous. By Example 19, svv\PJA-PfA\-+0. A
76 IV. Convergence in Distribution in Metric Spaces Call this supremum А„. Then PA3 > PfA > PJA - А„ > Р„А - А„ for every A. Wait until А„ < 5 to be able to assert that p(P, Pn) < 5. Finally, if p(Pn, P) -> 0 then, for fixed closed F, limsup PnF < PFd + S for every <5 > 0. Let <5 decrease to zero then deduce from Problem 12 that Р„ ^> P. Convergence in the p metric is equivalent to weak convergence. ? IV.4. Coupling The Representation Theorems of Sections III. 6 and IV. 3 both depended upon methods for coupling distributions Pn and P. That is, we needed to construct random elements Х„ and X, on the same probability space, such that Х„ had distribution Pn and X had distribution P. Closeness of Р„ and P, in a weak convergence sense, allowed us to choose Xn and X close in a stronger, almost sure sense. This section will examine coupling in more detail. A coupling of probability measures P and Q, on a space 3C, can be realized as a measure M on the product space SC ® 3C, with X and Y defined by the coordinate projections. The product measure P ® Q is a coupling, albeit a not very informative one. More useful are those couplings for which M concentrates near the diagonal. For example, in the Representation Theorem we put as much mass as possible on the set {(x, y): d(x, y) < e}. Roughly speaking, one can construct such couplings in two steps. First treat the desired property—that as much mass as possible be allocated to a particular region D in the product space—as a strict requirement. Imagine building up M slowly by drawing off mass from the P marginal measure and relocating it within D, subject to a matching constraint: to put an amount <5 near (x, y) one must deplete the P supply near x by <5 and the Q supply near у by 8. When as much mass as possible has been shifted into D by this method, forget about the constraint imposed by D. In the second step, complete the transfer of mass from P into the product space subject only to the matching constraint. The final M will have the correct marginals, P and Q. A precise formulation of the coupling algorithm just sketched is easiest when both P and Q concentrate on a finite set of points. The first step can be represented by a picture that looks like a crossword puzzle. Label the points on which Q concentrates as 1,..., r; let these correspond to rows of a two-way array of cells. Similarly, let 1,..., с label both the points on which P concentrates and the columns of the two-way array. The stack beside row i represents the mass Q puts on point i, and the stack under column j represents the mass P puts on j. The unshaded cells correspond to D. The
IV.4. Coupling 77 aim is to place as much mass as possible in the unshaded cells without violating the constraint that the total mass in a row or column should not exceed the amount originally in the marginal stacks. У \' i This formulation makes sense even if the marginal supplies don't both correspond to measures with total mass one. In general we could allow any non-negative masses R(i) and C(/) in the supply stacks for row i and column j. We would seek a non-negative allocation M(i, j) of as much mass as possible into the unshaded cells, subject to and for each i andy. A continuous analogue of the classical marriage lemma (a sort of fractional polygamy) will give the necessary and sufficient conditions for existence of an M that turns the inequalities for the columns into equalities. Treat С and R as measures. Write C(J) for the sum of supply masses in a set of columns J. Denote by D} the set of rows i for which cell (i, y) belongs to D for at least one column j in J. It is easy to see that M can have column marginal С only if R(Dj) > C(J) for every J, because the rows in D, contain all the D-cells in the columns of J. Sufficiency is a little trickier. 24 Allocation Lemma. // R(Dj) > C(J) for every set of columns J, then there exists an allocation M(i,j) into the cells ofD such that and for every i and). Proof. Use induction on the number of columns. The result is trivial for с = 1. Suppose it is true for every number of columns strictly less than c. Construct M by transferring mass from the column margins into D. Shift mass at a constant rate into each of the D-cells in row r. For any mass
78 IV. Convergence in Distribution in Metric Spaces shifted from C(j) into (r,j) discard an equal amount from R(r). If R(r) becomes exhausted, move on to row r — 1, and so on. Stop when either: (i) some C(j) is exhausted; or (ii) one of the constraints R(Dj) > C{J) would be violated by continuation of the current method of allocation. Here R and С are used as variable measures that decrease as mass is drawn off; the supply stacks diminish as the allocation proceeds. Notice that the mass transferred at each step can be specified as the largest solution to a system of linear inequalities. If the allocation halts because of (i), the problem is transformed into an allocation for с — 1 columns. The inductive hypothesis can be invoked to complete the allocation. If allocation halts because of (ii), then there must now exist some К for which R(DK) = C(K). Continued allocation would have caused R(DK) < C{K). The matching-constraint prevents К from containing every column: the total column supply always decreases at the same rate as the total row supply. Write Kc for the non-empty set of columns not in K. к ¦шщ HI qW 1 I У с Dk If the marginal demands of the columns in К are to be met, the entire remaining supply R(DK) must be devoted to those columns. With this requirement the problem splits into two subproblems: rows in DK may match only mass drawn off from the columns in K; from the rows DQK not in DK, match mass from the columns in Kc. Both subproblems satisfy the initial assumptions of the lemma. For subsets of К this follows because allocation halted before R(Dj) < C(J) for any J. For subsets of Kc, it follows from R(Dj n Z)|) K) - R(DK) > C(J и К) - C(K) = C(J). Invoke the inductive hypothesis for both subproblems to complete the proof of the lemma. ?
IV.4. Coupling 79 25 Corollary. If R and С have the same total mass and R(Dj) > C(J) for every J, then the allocation measure M has marginal measures R and C. ? The Allocation Lemma applies directly only to discrete distributions supported by finite sets. For distributions not of that type a preliminary discretization, as in the proof of the Representation Theorem, is needed. 26 Example. Let P and Q be borel probability measures on a separable metric space. The Prohorov distance p{P, Q) determines how closely P and Q can be coupled, in the sense that p(P, Q) equals the infimum of those values of e such that B7) JP{d(X, Y) > e} < e, with X having distribution P and Y having distribution Q. We can use the Allocation Lemma to help prove this. Half of the argument is easy. From B7) deduce, for every A, QA = W{YeA} < W>{X e Ac) + JP{d(X, Y) > e} < PAE + e, whence p(P, Q) < s. For the other half of the argument suppose p(P, Q) < s. Construct X and Y by means of a two-stage coupling. Apply the method of Lemma 15 twice to partition the underlying space into sets Bo, Bu...,Bk with both QBQ < S and PB0 < 5, and diameter(B,) < 5 for i = \,...,k. Choose 5 as a quantity much smaller than e; it will eventually be forced down to zero while г stays fixed. The requirement that each Bt be a Q- or P-continuity set is irrelevant to our present purpose. Set R(i) equal to QBt and CO') equal to PBj. Into the region D allow only those cells (i, j), for 1 < i < к and 1 < j < k, whose corresponding Bt and Bj contain a pair of points, one in Bt and one in Bj3 a distance < e apart. Augment the double array by one more row, call it oo, whose row stack contains mass e + 25. Include (oo, 0),..., (oo, k) in the region D. Bo V в, ¦ -- Bk
80 IV. Convergence in Distribution in Metric Spaces The hypotheses of the Allocation Lemma are satisfied. For any collection of columns J, < pb0 + p({j в \J\{0] < 5 + q( U bX + г V\{0> / < <5 + Q\ (J ВЛ + QB0 + e by definition of D < 5 + R(Dj\{oo}) + 5 + s Distribute all the mass from the column stacks into D, as in the Allocation Lemma. The oo row acts as a temporary repository for the small amount of mass that cannot legally be shifted into the desired small-diameter cells. Return the mass in this row to the column stacks, leaving at least I — e — 25 of the original С mass in the desired cells. Strip away the oo row. Allocate the remaining mass in the column stacks after expanding D to include all cells (i,j), for 0 < i < к and 0 < j < k. So far we have only decided the allocation of masses M(i, j) between the cells. Within the cells distribute according to the product measures The resulting Monf ®f has marginal measures P and Q. For example, within Bo the column marginal is YtM(i,0)Q(Bi\Bi)P(-\Bo) = P(B0)P(-\B0) = P(-Bo). i The M measure concentrates at least 1 — e — 2<5 of its mass within the original D, a cluster of cells each of diameter less than 5 in both row and column directions. For a point (x, y) lying in a cell (i,j) of this cluster, there exists points z,- and Zj with d(x, zt) < 5, d(zt, zj) < e, d(z}, y) < 5, which gives d(x, y) < e + 25. Put another way, if X and Y denote the co- coordinate projections then П»да, Y) > s + 25} < s + 25. As 5 can be chosen arbitrarily small, and г can be chosen as close to p(P, Q) as we please, we have the desired result. Problem 17 gives a condition under which the bound p(P, Q) can be achieved by a coupling of P and Q. ?
IV.5. Weakly Convergent Subsequences 81 IV. 5. Weakly Convergent Subsequences A reader not interested in existence theorems could skip this section, which presents a method for constructing measures on metric spaces. The results will be used in Section V.3 to prove existence of the brownian bridge. The method will be generalized in Chapter VII. We saw in Section III.6 how to modify the quantile-transformation construction of the one-dimensional Representation Theorem to turn it into an existence theorem, a method for constructing a probability measure as the distribution of the almost sure limit of a sequence of random variables. We had to impose a uniform tightness constraint to stop the sequence from drifting off to infinity. The analogous result for probabilities on metric spaces plays a much more important role than in euclidean spaces, because existence theorems of any sort are so much harder to come by in abstract spaces. Again the key to the construction is a uniform tightness property, which ensures that sequences that ought to converge really do converge. The setting is still that of a metric space 3C equipped with a sub-a-field s# of its borel a-field. 28 Definitions. Call a probability measure P on ,s/ tight if for every в > О there exists a compact set K(s) of completely regular points such that PK(s) > 1 - e. Call a sequence {Pn} of probability measures on si uniformly tight if for every e > 0 there exists a compact set K(e) of completely regular points such that liminf Pn G > 1 — г for every open, ^-measurable G containing K(e). ? Problem 7 justifies the implicit assumption of ^-measurability for the K(e) in the definition of tightness; every compact set of completely regular points can be written as a countable intersection of open, ^-measurable sets. If G is replaced by K(s), the uniform tightness condition becomes a slightly tidier, but stronger, condition. It is, however, more natural to retain the open G. If Р„ ~> P and P is tight then, by virtue of the results proved in Example 17, the liminf condition for open G is satisfied; it might not be satisfied if G were replaced by K(s). More importantly, one does not need the stronger condition to get weakly convergent subsequences, as will be shown in the next theorem. For the proof of the theorem we shall make use of a property of compact sets: If {х„} is a Cauchy sequence in a metric space, and if d(xn, К) -» О for some fixed compact set K, then {х„} converges to a point of K. This follows easily from one of a set of alternative characterizations of compactness in metric spaces. As we shall be making free use of these charac- characterizations in later chapters, a short digression on the topic will not go amiss.
82 IV. Convergence in Distribution in Metric Spaces To prove the assertion we have only to choose, according to the definition of d(xn, K), points {yn} in К for which d(xn, у„) -» О. From {у„} we can extract a subsequence converging to a point у in K. For if no subsequence of {у„} converged to a point of K, then around each x in К we could put an open neighborhood Gx that excluded у„ for all large enough values of n. This would imply that {yn} is eventually outside the union of the finite collection of Gx sets covering the compact K, a contradiction. The cor- corresponding subsequence of {xn} also converges to y. The Cauchy property forces {х„} to follow the subsequence in converging to y. A set with the property that every sequence has a convergent subsequence (with limit point in the set) is said to be sequentially compact. Every compact set is sequentially compact. This leads to another characterization of compactness: A sequentially compact set is complete (every Cauchy sequence converges to a point of the set) and totally bounded (for every positive e, the set can be covered by a finite union of closed balls of radius less than e). For clearly a Cauchy sequence in a sequentially compact К must converge to the same limit as the convergent subsequence. And if К were not totally bounded, there would be some positive г for which no finite collection of balls of radius e could cover K. We could extract a sequence {xn} in К with xn+1 at least e away from each of xlt..., xn for every n. No subsequence of {х„} could converge, in defiance of sequential compactness. For us the last link in the chain of characterizations will be the most important: A complete, totally bounded subset of a metric space is compact. Suppose, to the contrary, that {GJ is an open cover of a totally bounded set К for which no finite union of {G;} sets covers K. We can cover К by a finite union of closed balls of radius \, though. There must be at least one such ball, Bx say, for which К глВх has no finite {GJ subcover. Cover К n Bx by finitely many closed balls of radius ?. For at least one of these balls, B2 say, К глВ1глВ2 has no finite {GJ subcover. Continuing in this way we discover a sequence of closed balls {В„} of radii {2~"} for which К n Bi n • • • n Bn has no finite {G;} cover. Choose a point xn from this (necessarily non-empty) intersection. The sequence {xn} is Cauchy. If К were also complete, {xn} would converge to some x in K. Certainly x would belong to some Gh which would necessarily contain Bn for n large enough. A single G; is about as finite a subcover as one could wish for. Completeness would indeed force {G,} to have a finite subcover for K. End of digression. 29 Compactness Theorem. Every uniformly tight sequence of probability measures contains a subsequence that converges weakly to a tight borel measure.
IV.5. Weakly Convergent Subsequences 83 Proof. Write {Pn} for the uniformly tight sequence, and Kk for the compact set K(ek), for a fixed sequence {sk} that converges to zero. We may assume that {Kk} is an increasing sequence of sets. The proof will use a coupling to represent a subsequence of {Р„} by an almost surely convergent sequence of random elements. The limit of these random elements will concentrate on the union of the compact Kk sets; it will induce the tight borel measure on 9C to which the subsequence {Р„} will converge weakly. Complete regularity of each point in Kk allows us to cover Kk by a collec- collection of open ^-measurable sets, each of diameter less than sk. Invoke compactness to extract a finite subcover, {Uki: 1 < i < ik}. Define л/т to be the finite subfield of s/ generated by the open sets Uki for 1 < к < т and 1 < i < ik. The union of the fields {stfm} is a countable subfield s/x of s/. Apply Cantor's diagonalization argument to extract a subsequence of {Pn} along which lim Р„А exists for each A in stf^. Write XA for this limit. It is a finitely additive measure on the field ^ao. Avoid the mess of double-subscripting by assuming, with no loss of generality, that the subsequence is {Р„} itself. If {Pn} were weakly convergent to a measure P we would be able to deduce that P(interior of A) < XA < P(closure of A) for each A in stf х. If we could assume further that P put zero mass on the boundary of each such A, we would know the P measure of enough sets to allow almost surely convergent representing sequences to be constructed as in the Representation Theorem. Unfortunately there is no reason to expect P to cooperate in this way. Instead, we must turn to Я as a surrogate for the unknown, but sought after, probability measure P. Since X need not be countably additive, it would be wicked of us to presume the existence of a random element of 3C having distribution X. We must take a more devious approach. We can build a passable imitation of s/x on the unit interval. Partition @, 1) into as many intervals as there are atoms ofjtfu making the lebesgue measure of each interval A equal to the X measure of the corresponding A in s/t. These intervals generate a finite field s$± on @, 1). Partition each atom A in s& \ into as many subintervals as there are atoms of st2 ш A matching up lebesgue and X measures as before. The subintervals together generate a second field sf2 on @, 1); finer tnan ^i- Continuing in similar fashion, we set up an increasing sequence of fields {j?k} on @,1) that fit together in the same way as the fields {stfk} on Ж. The union of the ~Mk% is a countable subfield Лх of @, 1). There is a bijection Л ¦-> A between stf ^ and st x that preserves inclusion, maps slk onto s#k, and preserves measure, in the sense that the lebesgue measure of A equals XA. The construction ensures that, if ц has a Uniform@, 1) distribution, IP{>7 e 1} = XA for every A in si'«,. The random variable ц chooses between the sets in s$k in much the same way as a random element X with distribution P would choose between the sets in s/k.
84 IV. Convergence in Distribution in Metric Spaces By definition of X, there exists an n(k) such that C0) Р„А > A — sk)XA for every A in s/k whenever n > n(k). Lighten the notation by assuming that n(k) = k. (If you suspect these notational tricks for avoiding an orgy of subsequencing, feel free to rewrite the argument using, by now, triple subscripting.) As in the proof of the Representation Theorem, this allows us to construct a random element Xn, with distribution Р„, by means of an auxiliary random variable ?, that has a Uniform@, 1) distribution independent of r\: For each atom A of sdn, \lr\ falls in the corresponding A of stfn and ? < 1 — е„, distribute Xn on A according to the conditional distribution Pn(-\A). If i > 1 - en distribute Х„ with whatever conditional distribution is necessary to bring its overall distribution uptoPn. We have coupled each Р„ with lebesgue measure on the unit square. To emphasize that Xn depends on r\, ?;, and the randomization necessary to generate observations on Pn(-\A), write it as Xn(co, Ц, 0- Notice that the same ц and ?, figure in the construction of every X„. It will suffice for us to prove that {Xn(co, Ц, ?)} converges to a point X(со, ц, ?) of Kk for every со and every pair (щ, <!;) lying in a region of prob- probability at least A — skJ, a result stronger than mere almost sure convergence to a point in the union of the compact sets {Kk}. Problem 16 provides the extra details needed to deduce borel measurability of X. For each m greater than k, let Gmk be the smallest open, ^„-measurable set containing Kk. Uniform tightness tells us that XGmk = liminf PnGmk > 1 - sk, which implies IP{>7eGmfc} > 1 — ek. Define Gk as the intersection of the decreasing sequence of sets {Gmk} for m = к, к + 1,.... The overbar here is slightly misleading, because Gk need not belong to si'x. But it is a borel subset of @, 1). Countable additivity of lebesgue measure allows us to deduce that IP[r] e Gk} > 1 — sk. Notice how we have gotten around lack of count- countable additivity for X, by pulling the construction back into a more familiar measure space. Whenever ц falls in Gk and ? < 1 — sk, which occurs with probability at least A — skJ, the random elements Xk, Xk+1,... crowd together into a shrinking neighborhood of a point of Kk. There exists a decreasing sequence {AJ with: (i) Am is an atom of $4m\ (ii) Am is contained in Gmk; (iii) Xm(co, ц, О lies in Am. Properties (i) and (iii) are consequences of the method of construction for Xm; property (ii) holds because Gk is a subset of Gmk. The set Gmk, being the
Notes 85 smallest open, j/m-measurable set containing Kk, must be contained within the union of those Umi that intersect Kk. The atom Am must lie wholly within one such Umi, a set of diameter less than гт. So whenever tj falls in Gk and ?, < 1 — ?t, the sequence {Xm} satisfies: (i) d(-X"m(co, j/, О, Хп(со, ц, 0) < sm for к < m < n; (ii) d(Xm(co, ц, 0, Kk) < em for к < т. As explained at the start of the digression, this forces convergence to a point X(co,r,,i) of Kk. ? Notes Any reader uncomfortable with the metric space ideas used in this chapter might consult Simmons A963, especially Chapters 2 and 5). The advantages of equipping a metric space with a cr-field different from the borel a-field were first exploited by Dudley A966a, 1967a), who developed a weak convergence theory for measures living on the cr-field generated by the closed balls. The measurability problem for empirical processes (Example 2) was noted by Chibisov A965); he opted for the Skorohod metric. Руке and Shorack A968) suggested another way out: X„ ~> X should mean JPf(Xn) -* JPf(X) for all those bounded, continuous / that make/(Xn) and f(X) measurable. They noted the equivalence of this definition to the defini- definition based on the Skorohod metric, for random elements of D[0, 1] con- converging to a process with continuous sample paths. Separability has a curious role in the theory. With it, the closed balls generate the borel сг-field (Problem 6); but this can also hold without separability (Talagrand 1978). Borel measures usually have separable support (Dudley 1967a, 1976, Lecture 5). Alexandroff A940, 1941, 1943) laid the foundation for a theory of weak convergence on abstract spaces, not necessarily topological. Prohorov A956) reset the theory in complete, separable metric space, where most probabilistic and statistical applications can flourish. He and LeCam A957) proved different versions of the Compactness Theorem, whose form (but not the proof) I have borrowed from Dudley A966a). Weak convergence of baire measures on general topological spaces was thoroughly investigated by Varadarajan A965). Topsee A970) put together a weak convergence theory for borel measures; he used the liminf property for semicontinuous functions (Example 17) to define weak convergence. These two authors made clear the need for added regularity conditions on the limit measure and separation properties on the topology. One particularly nice combination—a completely regular topology and a г-additive limit measure—corresponds closely to my assumption that limit measures concentrate on separable sets of completely regular points. The best references to the weak convergence theory for borel measures on metric spaces remain Billingsley A968, 1971) and Parthasarathy A967).
86 IV. Convergence in Distribution in Metric Spaces Dudley's A976) lecture notes offer an excellent condensed exposition of both the mathematical theory and the statistical applications. Example 11 is usually attributed to Wichura A971), although Hajek A965) used a similar approximation idea to prove convergence for random elements of C[0, 1]. Skorohod A956) hit upon the idea of representing sequences that converge in distribution by sequences that converge almost surely, for the case of random elements of complete, separable metric spaces. The proof in Section 3 is adapted from Dudley A968). He paid more attention to some of the points glossed over in my proof—for example, he showed how to construct a probability space supporting all the {Х„}. Here, and in Section 5, one needs the existence theorem for product measures on infinite-product spaces. Руке A969,1970) has been a most persuasive advocate of this method for proving theorems about weak convergence. Many of the applications now belong to the folklore. The uniformity result of Example 19 comes from Ranga Rao A962); Billingsley and Tops0e A967) and Tops0e A970) perfected the idea. Not surprisingly, the original proofs of this type of result made direct use of the dissection technique of Lemma 15. Prohorov A956) defined the Prohorov metric; Dudley A966b) defined the bounded Lipschitz metric. Strassen A965) invoked convexity arguments to establish the coupling characterization of the Prohorov metric (Example 26). My proof comes essentially from Dudley A968), via Dudley A976, Lecture 18), who introduced the idea of building a coupling between discrete measures by application of the marriage lemma. The Allocation Lemma can also be proved by the max-flow-min-cut theorem (an elementary result from graph theory; for a proof see Bollobas A979)). The conditions of my Lemma ensure that the minimum capacity of a cut will correspond to the total column mass. Appendix В of Jacobs A978) contains an exposition of this approach, following Hansel and Troallic A978). Major A978) has described more refined forms of coupling. Problems [1] Suppose the empirical process U2 were measurable with respect to the borel cr-field on D[0, 1] generated by the uniform metric. For each subset A of A, 2) define JA as the open set of functions in D[0, 1] with jumps at some pair of distinct points tx and t2 in [0, 1] with tl + t2 in A. Define a non-atomic measure on the class of all subsets of A,2) by setting y(A) = 1P{U2 e JA}. This contradicts the continuum hypothesis (Oxtoby 1971, Section 5). Manufacture from у an extension of the uniform distribution to all subsets of A, 2) if you would like to offend the axiom of choice as well. Extend the argument to larger sample sizes. [2] Write si for the cr-field on a set Ж generated by a family {/•} of real-valued func- functions on Ж. That is, si is the smallest cr-field containing /f 1B for each i and each borel set B. Prove that a map X from (Q, S) into Ж is <f/^/-measurable if and only if the composition ft ° X is (f/J'CIR^measurable for each i.
Problems 87 [3] Every function in ?>[0, 1] is bounded: \x(tn)\ -> oo as n -> oo would violate either the right continuity or the existence of the left limit at some cluster point of the sequence {(„}. [4] Write 0> for the projection cr-field on D[0, 1] and @0 for the cr-field generated by the closed balls of the uniform metric. Write щ for the projection map that takes an x in ?>[0, 1] onto its value x(t). (a) Prove that each nt is ^„-measurable. [Express {x: ntx > a} as a countable union of closed balls B(xn, n), where xn equals a plus (n + n'1) times the indicator function of [t, t + n~').] Deduce that 3D0 contains &>. (b) Prove that the cr-field &> contains each closed ball B(x, r). [Express the ball as an intersection of sets {z: \n,x — ntz\ < r}, with t rational.] Deduce that &> contains 38 0 . [5] Let {G;} be a family of open sets whose union covers a separable subset С of a metric space. Adapt the argument of Lemma 7 to prove that С is contained in the union of some countable subfamily of the {G;}. [This is Lindelof's theorem.] [6] Every separable, open subset of a metric space can be written as a countable union of closed balls. [Rational radii, centered at points of the countable dense set.] The closed balls generate the borel cr-field on a separable metric space. [7] Every closed, separable set of completely regular points belongs to j/. [Cover it with open, ^/-measurable sets of small diameter. Use Lindelof's theorem to extract a countable subcover. The union of these sets belongs to $2. Represent the closed set as a countable intersection of such unions.] [8] Let Co be the countable subset of C[0, 1] consisting of all piecewise linear func- functions with corners at only a finite set of rational pairs (t;, r,). Argue from uniform continuity to prove that C[0, 1] equals the closure of Co. Deduce that C[0, 1] is a projection-measurable subset of D[0, 1]. [9] A function h is said to be lower-semicontinuous at a point x if, for each M < h(x), h is greater than M in some neighborhood of x. To say h is lower-semicontinuous means that it is lower-semicontinuous at every point. Show that the upper envelope of any set of continuous functions is lower-semicontinuous. Adapt the construction of Lemma 7 to prove that every lower-semicontinuous function that is bounded below can be represented on a separable set of completely regular points as the point wise limit of an increasing sequence of continuous functions. How would one define upper-semicontinuity? Which sets should have upper-semicontinuous indicator functions? What does a combination of both semicontinuities imply? [10] If Х„ --» X as random elements of a metric space ЭЕ and d(Xn, Yn) -> 0 in prob- probability, then Yn -* X, provided that IP* concentrates on a separable set of completely regular points. [Convergence in probability means JP*{d(Xn, Yn) > e} -> 0 for each e > 0.] [11] Let P be a borel measure on a metric space. For every borel set В there exists an open GE containing В and a closed Fe contained in В with P(GC\FS) < s. [The class of all sets with this property forms a cr-field. Each closed set has the property because it can be written as a countable intersection of open sets.] Deduce that P is uniquely determined by the values it gives to closed sets. Extend the result to measures defined on the cr-field generated by the closed balls.
88 IV. Convergence in Distribution in Metric Spaces [12] Suppose limsup PnF < PF for each closed, ^/-measurable set F. Prove that Р„ --» P by applying the inequalities for each non-negative / in %?(Ж; j/). [The summands are identically zero for all i large enough. Apply the same argument to —/ + (a big constant).] [13] If Є -> PB for each ^/-measurable set В whose boundary has zero P measure then Р„ ~* P. [Replace the levels i/k of the previous problem by levels f; for which [14] The functions in <?(Ж; jf) generate a sub-a-field 38 c of s/. A map X from (Q, S) into Ж is S/^.-measurable if and only if f(X) is <f/J'(IR)-measurable for each /in [15] (Continued). The trace of @c on any separable set S of completely regular points of Ж coincides with the borel ст-field on S. [Sets of the form {/ > 0} n S, with / in ЩЖ; sd\ form a basis for the relative topology on S. Every relatively open subset of S is a countable union of such sets, by Lindelof's theorem.] [16] (Continued). Let {Xm} be a sequence of (f/j^-measurable random elements of Ж that converges pointwise to a map X. Prove that X is S/^-measurable. If X takes values only in a fixed separable set of completely regular points, then it is [17] Let P and Q be tight probability measures on the borel cr-field of a separable metric space 3C. There exists a coupling for which IP{rf(Z, Y) > A} < A, where A = p(P, Q), the Prohorov distance between P and Q. [From Example 26, there exist measures Mn on X ® 3C, wi(th marginals P and Q, for which Мя{(х, у): d(x, у) > А + n} < A + n. The sequence {М„} is uniformly tight. The limit of a weakly convergent subse- subsequence defines the required coupling. Is separability of Ж really needed ?] [18] Let Ж be a compact metric space, and <?(Ж) be the vector space of all bounded, continuous, real functions on Ж. Let Г be a non-negative linear functional on <€(Ж) with Г1 = 1. These steps show that Tf = Pf for some borel probability measure P; (a) Given у > 0 find functions gu . ..,gk in Сё+{Ж) with diameter{#; > 0} < 2y and 0, + • • • + & = 1. [Find /x in %+(Ж) with /,(x) > 0 and fx(y) = 0 for d(y, x) > y. Cover К by finitely many open sets {fx > 0}. Standardize the corresponding functions to sum to one everywhere.] (b) Choose x, with g^x) > 0. Define Py as the discrete probability measure that puts mass Tgt aXxb for each i. If h belongs to ЩЖ), show that \Pyh - Th\->0 as у -> 0. (c) Extract a subsequence of {Py} that converges weakly. Show that the limit measure P has the desired property.
CHAPTER V The Uniform Metric on Spaces of Cadlag Functions ... in which random elements of metric spaces of cadlag functions—stochastic processes whose sample paths have at worst simple jump discontinuities—are treated. Necessary and sufficient conditions for convergence in distribution are found then specialized to prove limit theorems for empirical processes and processes with independent increments. The two most commonly occurring limit processes, brownian motion and brownian bridge, are studied. V. 1. Approximation of Stochastic Processes The theory developed in Chapter IV justifies its existence by what it has to say about the limiting distributions of functionals defined on sequences of stochastic processes. Processes {Xn(t): t e T} are identified with random elements of some space Ж of functions on T, a space large enough to contain the sample paths of every Х„; the functionals are maps defined on Ж to which the Continuous Mapping Theorem can be applied. In this chapter we shall specialize the general theory to the particular function space D[0, 1], under its uniform metric. It will turn out that most applications, especially those that come up with brownian bridges and brownian motions as limit pro- processes, require no fancier setting than this. Recall that for a function defined on a subset T of the extended real line [ - oo, oo] to deserve the title cadlag it must be right continuous ("continue a droite") and have left limits ("limites a gauche") at each point of T. Many of the well-studied stochastic processes—processes with independent increments, or the markov property, or some form of martingale structure— have versions with cadlag sample paths. For T = [0, 1], these processes can be analyzed as random elements of D[0, 1], the set of all real-valued cadlag functions on [0, 1]. Notice that right continuity at 1 and existence of the left limit at 0 become empty requirements for this space. The analysis for other spaces of cadlag functions on compact intervals (such as D[ — oo, oo], the natural space in which to study empirical distribution functions over the real line), and the extensions to non-compact index sets (Section 5), will call for only slight variations on the methods developed for D[0, 1].
90 V. The Uniform Metric on Spaces of Cadlag Functions For each finite subset S of [0, 1] write tts for the projection map from D[0, 1] into IRS that takes an x onto its vector of values {x(t): t e S}. Abbre- Abbreviate 7r{t} to nt. These projections generate the projection a-field, 2P. Questions of measurability in this chapter will always refer to this a-field. A stochastic process X on (Q, S, IP) with sample paths in D[0, 1], such as an empirical process, is ^/^-measurable provided n, ° X is ё/^(IR)-measurable for each fixed t (Problem IV.2). Probability measures on SP are uniquely determined by the values they give to the generating sets {п$ 1B}, with S a finite subset of [0, 1] and В а borel subset of IR5. Equivalently, the distribution of a random element of D[0, 1] is uniquely specified by giving the distributions of all its finite- dimensional projections. As every cadlag function on [0, 1] is bounded (Problem IV.3), the uniform distance d(x, y) = ||x - y\\ = sup{|x(t) - y(t)\: 0 < t < 1} defines a metric on D[0, 1]. No other metric will be used for D[0, 1] in this chapter. The closed balls for d generate & but not the larger borel a-field (Problem IV.4). Every point in D[0, 1] is completely regular—see Definition IV.6 and the discussion preceding it. The difficulty with the a-field, which can be blamed on the lack of a countable, dense subset of functions in D[0, 1], has dissuaded many authors from working with the uniform metric. Chapter IV showed that the difficulty can be surmounted, at least when limit distributions concentrate on separable subsets of D[0, 1]. As compensation for persisting with the uniform metric, we shall find its topological properties much easier to understand and manipulate than those of its main competitor, the Skorohod metric, which will be discussed in Chapter VI. That will make life more pleasant for us in Section 3 when we come to apply the Compactness Theorem. The limit processes for the applications in this chapter will always con- concentrate in a separable subset of D[0, 1], usually C[0, 1], the set of all con- continuous, real functions on [0, 1]. As a closed (uniform convergence preserves continuity), separable, ^-measurable subset of D[0, 1] (Problems IV.8), C[0, 1] has several attractive properties. For example, it inherits complete- completeness from D[0, 1], and its projection u-field coincides with its borel a-field for the uniform metric (Problem IV.6). How do we establish convergence in distribution of a sequence {Х„} of random elements of D[0,1] to a limit process XI Certainly we need the finite-dimensional projections {nsXn} to converge in distribution, as random vectors in IRS, to the finite-dimensional projection nsX, for each finite subset S of [0, 1]. Continuity of 7rs and the Continuous Mapping Theorem make that a necessary condition. The methods of Chapter III usually help here. But that alone could hardly suffice. Continuous functionals such as Mx = || x ||, a typical non-trivial example, depend on x through more than its values at a fixed, finite S. Indeed, direct examination of that very functional gives the clue to the extra condition needed.
V.I. Approximation of Stochastic Processes 91 Intuitively, it should be possible to approximate МХ„ by taking the maximum of |Xn(s)| over a large, finely spaced, finite subset S of [0, 1]. If S expands up to a countable, dense subset of [0, 1] containing 1, then for every sample path of Xn. The cadlag property assures us of this. (Notice the special treatment demanded by 1.) Given any д > 0 and e > 0, a large enough S could be found to ensure that A) JP{\MsXn-MXn\>5}<e. At first sight this seems to have solved the problem. Because MsXn depends continuously on nsXn, it converges in distribution to MSX as n -> oo. For / bounded and uniformly continuous, JPf(MsXn) -» JPf(MsX). From A) with 5 chosen appropriately, \Wf(MsXn) - Wf(MXn)\ < a + 2||/||e. A similar inequality holds for X. Taken together, these seem to add up to convergence in distribution of МХ„ to MX. But the argument is flawed. The problem lies with the choice of S in A). If the sample paths of Xn jumped about more and more wildly as n increased, S would have to get bigger with n. That would undermine the convergence of {MsXn} to MSX, since finite-dimensional convergence says nothing about {nsXn} for S varying with n. We need the same S to work for each n. A similar argument can be invoked for any other continuous functional H on D[0, 1]. We approximate НХ„ by a continuous function of nsXn for some large, finite set S not depending on n. We construct the approximation by applying H to an element AsXn in D[0, 1]. For simplicity, suppose S has been rearranged into increasing order and augmented by the points 0 and 1, if necessary, to form a grid 0 = s0 < sx < ¦ ¦ ¦ < sk = 1. For x in D[0, 1] define the approximating path Asx by B) (Asx)(t) = x(st) for Si<t<s^u and (Asx)(l) = x(l). Notice that Asx depends on x only through nsx. In order that H(Asx) be close to Hx, it suffices that \\Asx — x\\ be small. Consider, for example, a uniformly continuous H. There exists a 5 > 0 such that \H(Asx) — Hx\ < e whenever \\Asx — x\\ < S; the random variable HXn lieswithineofff(/4sXn)withprobabilitynolessthanP{|HsXn - Xn\\ < S}. Again, the same grid would have to work for every Х„—or at least for every Х„ with n large enough—to allow finite-dimensional convergence to imply convergence in distribution of {AsXn} to ASX. (The example of the empirical process Un, whose sample paths have jumps of size n~1/2, shows that uniform approximation of Xn by AsXn can only be required for large values of n, anyway.) If such a map As does exist, the argument sketched above for the supremum functional M will carry over to the functional H, proving that НХ„ ~> НХ. (The argument might seem familiar—it is a special case of ExampleIV.ll).
92 V. The Uniform Metric on Spaces of Cadlag Functions For the sake of brevity, from now on shorten "finite-dimensional distri- distributions" to fidis and "finite-dimensional projections" to fidi projections. 3 Theorem. Let X, Xu X2,... be random elements of D[0, 1] {under its uniform metric and projection a-field). Suppose JP{X e C} = 1 for some separ- separable subset С of D[0, 1]. The necessary and sufficient conditions for {Xn} to converge in distribution to X are: (i) the fidis of Х„ converge to the fidis of X; that is, nsXn ^ nsX for each finite subset S of [0, 1]; (ii) to each e > 0 and <5 > 0 there corresponds a grid 0 = t0 < tj <••• <tm= 1 such that D) limsup w\max sup|Xn{t) - Xn(t,)\ > s\ < s, I i Ji J where J, denotes the interval [t;, ti+ x), for i = 0, 1,..., m — 1. Proof. Suppose that Xn ~> X, The projection map 7is is both projection- measurable (by definition) and continuous. Condition (i) follows by the Continuous Mapping Theorem. To simplify the proof of (ii) suppose that the separable subset С equals C[0, 1]. Continuity of the sample paths makes the choice of the grid in (ii) easier. (Problem 4 outlines the extra arguments needed for the general case.) Let {s0, sb ...} be a countable, dense subset of [0, 1]. To avoid trivialities, assume that s0 = 0 and Si = 1. Write Ak for the interpolation map con- constructed as in B) from the grid obtained by rearranging s0,..., sk into increasing order. For fixed x in C[0, 1], the distance \\Akx — x\\ converges to zero as к increases, by virtue of the uniform continuity of x. Applied to the sample paths of X, this shows that \\AkX — X\\ -»0 almost surely. Convergence in probability would be enough to assure the existence of some к for which E) JP{\\AkX - X\\ > 5} < e. Choose and fix such a k. Because \\Akx — x|| varies continuously with x, the set F = {x ? D[0, 1]: ||Дх - x|| > 6} is closed. By Example IV. 17 (inequality IV. 18 to be precise), limsup JP{XneF} <JP{X<= F}. The left-hand side bounds the limsup in D). We could require "> <5" rather than " > <5" in D), but that would create a minor complication in the next lemma. Now let us show that (i) and (ii) imply Х„ ~> X. Retain the assumption that X has continuous sample paths (see Problem 4 for the general case),
V.I. Approximation of Stochastic Processes 93 so that E) still holds. Choose any bounded, uniformly continuous, projection- measurable, real function / on D[0, 1]. Given e > 0 find b > 0 such that \f(x) — f{y)\ < e whenever \\x — y\\ < 5. Write AT for the approximation map constructed from the grid in (ii) corresponding to this 5 and e. Condition D) becomes limsup JP{\\ATXn - Х„\\ > 5} < e. With no loss of generality we may assume that the Ak of E) equals AT, because if we combine the two underlying grids we have at worst to replace b by 2<5 to preserve the two approximations. For example, iff; < Sj < t <ti+i then, whenever \\АТХ„ - XJ < 5, \Xn(t) - Xn(Sj)\ < \Xn(t) - Xn(tt)\ + \Xn(Sj) - Xn(td\ < 26. The argument now follows the lines sketched out before. Write the com- composition f ° AT as g о пт where g is a bounded, continuous function on D[0, 1]—a fancy way of saying that f(ATx) depends on x continuously through the values that x takes at the grid points. JPf(Xn) - JPf(X)\ < TP\f(Xn) - f(ATXn)\ + \JPf(ATXn) - JPf(ATX)\ + JP\ f(ATX)-f(X)\ < s + 2\\f\\JP{\\Xn - ATXn\\ >b} + \Wg(nTXn) - The middle term in the last line converges to zero as n fidi convergence nTXn -> nTX. oo, because of the П We now know the price we pay for wanting to make probability state- statements about functional that depend on the whole sample path of a stochastic process: with high probability we need to rule out nasty behavior between the grid points. For uniform approximation of sample paths, a large value of\Xn(t) - Xn(tt)\ would be nasty; inequality D) rules it out. So how do we
94 V. The Uniform Metric on Spaces of Cadlag Functions control the left-hand side of D)? It involves the probability of a union of m events, which we may bound by the sum of the probabilities of those events, m-l F) ? i = 0 Then we can concentrate on what happens in an interval Jt between adjacent grid points. For many stochastic processes, good behavior of the increment -^п(г;+1) ~ Xn(td forces good behavior for the whole segment of sample path over Jt. 7 Lemma. Let {Z(t): 0 < t < b} be a process with cadlag sample paths taking the value zero at t = 0. Suppose Z(t) is S\-measurable, for some in- increasing family ofa-fields {€t: 0 < t < b}. If at each point of {\Z(i)\ > S}, W{\Z{b) - Z(t)\ <%\Z(t)\\St} > p, where /3 is a positive number depending only on S, then sup \Z(t)\ > d\ < /r4P{|Z(b)| > Щ. O<t<b J Proof. Let S be a finite subset of [0, b~\ containing the point b. By virtue of right continuity, as S expands up to a countable, dense subset of [0, b], max | Z(t) | -»• sup | Z{t) \ for every sample path of Z. S [0,6] Notice the minor complication here if " > 5" were replaced by " > 5". It is good enough to establish the inequality with the supremum taken over S. Define т as the first point of S for which | Z(t) \ > S if there is such a t, otherwise set т = oo. The event {т = t} belongs to St. For distinct points t and t' in S the events {т = t} and {т = t'} are disjoint. This justifies a first- passage decomposition. Is = t}JP{\Z(b) - Z(t)\ = t, \zq>) - z(t)\ < щ. ? For the lemma to help us we need to know something about the condi- conditional distribution of Z(b) — Z(t) given what has happened up to time t. When do we have such information? Here are some possibilities. The distribution of the increment might not depend on the past at all—the
V.2. Empirical Processes 95 process might have independent increments. In that case the requirement imposed by the lemma reduces to something involving the marginal distributions of the increments. Or the distribution of Z{b) — Z(t) might depend on the past only through the value of Z(t)—the markov property. In that case the requirement turns into a condition on the transition proba- probabilities. Or Z(b) — Z(t) might merely have zero conditional expectation—the martingale property. And then we have all those clever stopping time and maximal inequality tricks to play with. In each of these cases it turns out to be quite easy to find sufficient condi- conditions for strengthening fidi convergence to the full convergence in distribution of the processes as random elements of D[0, 1]. Sections 4 and 2 illustrate the first two possibilities; Chapter VIII will play tricks with martingales. In each case the limit process will be either a brownian bridge, a brownian motion, or a close relative. 8 Definition. A brownian motion is a process with continuous sample paths and @ B@) = 0; (ii) for 0 < tY < ¦ ¦ ¦ < tk the increments B(t2) - B(ti), • ¦ ¦, B(tk) - B(tt_i) are mutually independent; each B(tt) — B(t;_i) has a N@, t; — ?,_!) distribution. A brownian bridge, or tied-down brownian motion, is a process on [0, 1] with continuous sample paths and (i) 1/@) = t/(l) = 0; (ii) for each finite subset S of [0, 1] the random vector 7is U has a multi- variate normal distribution with zero means and covariances given by WU(s)U(t) = s(l - 0 for s < t. ? The name tied-down brownian motion comes from one method for constructing a random element distributed like U. Apply to В the continuous transformation that takes the function x(t) onto the function x(t) — ?x(l). That is, tie down the loose end of В at t = 1 to force upon it the constraint (i) placed on U. The tied-down process has gaussian fidi projections whose means and covariances agree with those for the brownian bridge; the pro- processes generate the same distribution on the projection cr-field. V.2. Empirical Processes From a sequence {?;} of independent random variables, each having a Uniform@, 1) distribution, construct the empirical process UJit) = n~112 t № < t} - t] for 0 < t < 1.
96 V. The Uniform Metric on Spaces of Cadlag Functions For fixed finite S, the projection ns Un is a normed sum of independent, identically distributed random vectors. By the Multivariate Central Limit Theorem (III. 30) it converges to a zero-mean multivariate normal distribu- distribution with the same covariance structure as the brownian bridge; that is, the fidis of ?/„¦ converge to the fidis of U. (No accident of course—that's why U got the covariances it did.) By design, the first condition of Theorem 3 is satisfied. A markov property of Un will strengthen this to convergence in distribution. 9 Empirical Central Limit Theorem (Uniform Case). The empirical processes {?/„} constructed by independent sampling from Uniform@, 1) converge in distribution, as random elements of D[0, 1], to the brownian bridge. Proof. We need to find a grid 0 = t0 < t1 < ¦ ¦ ¦ < tm = 1 satisfying D), for fixed e and S. Do this by making the sum in F), with Xn replaced by Un, small. Take the {?;} equally spaced. By reasons of symmetry, or stationarity if you prefer, the sum reduces to mP{supo<r<6 \Un(t)\ > S}, where b = m~1. We seek an m that makes it less than e. Take St as the ст-field generated by Un{s)for 0 < s < t. It tells us how many of the observations ?u. ..,?„ have landed in [0, t], and where they lie. Suppose we know that [0, t] contains exactly к of the observations. Given this information, the other n — к observations distribute themselves uniformly in the interval (t, 1]. Formally, on {Un(t) = n~1/2(k - nt)} the conditional distribution of Un(b) — Un(i) given St is n~1/2[Bin(n -k,B)- n(b - t)], where в = (b — t)/(\ — t). Notice the markov property: the conditional distribution depends only on Un{t). Apply Tchebychev's inequality on the set where | Un(t)\ > 5, that is, where \k — nt\> nll28. JP{\Un(b)-Un(t)\>i\Un(t)\\^} = P{|Bin(n - к, в) - n(b - f)l > i\k - nt\} < 4[(и - к)в{\ -в) + [(и - к)в - n(b - tj]2]/(k - ntJ < 4пв/(к - ntJ + 4в2 < [4Ь/A - Ь)У32 + 4Ь2/A - ЪJ < \ for small enough values of b. Notice that the argument would break down if we replaced \\ Un(t)\ by \b in the first line. Apply Lemma 7 with /? = \ and m large enough. sup |?/„@1 > 8\ < o<t<b ) With m fixed, let n -»• oo. Invoke fidi convergence to make the right-hand side of the inequality converge to 2mJP{\N@, b - b2)\ > i<5} < 2m[(b - which is less than e for m large enough. ?
V.2. Empirical Processes 97 The restriction to samples from the uniform distribution was unnecessary. A similar result holds for the empirical process En(r) = n^[_Fn{r) - F(r)] = n-1'2 t lint <r}~ F(r)l constructed by independent sampling from any distribution function F on the real line. Think of ?„ as a random element of D[— oo, oo], the space of cadlag functions on [ — oo, oo]. Right continuity at — oo can be achieved by setting ?„( — 00) equal to zero; the left limit at +00 also equals zero, the natural value for ?„( + 00). It would cause us no great hardship to carry all the theory for D[0, 1] over to D[ — 00, 00] then track through the proof of the general Empirical Central Limit Theorem. The uniform metric is well-defined, because of the boundedness (Problem 6) of functions in D[— 00, 00]; the projection cr-field plays the same role as in D[0, 1]. This time the limit gaussian process ? would have sample paths in O[— 00, 00] and multivariate normal fidis with zero means and covariance kernel A0) WE(r)E(s) = F(r) - F(r)F(s) for r < s. The E process need not have continuous sample paths; it jumps where F jumps. Some small complications would arise with the choice of grid points for the general empirical process, but these could be overcome easily (Problem 7). Everything else would go through in much the same way as for the uniform distribution. There is, however, a much simpler way to prove the general Empirical Central Limit Theorem: use the quantile transformation (Section III.6) to represent ?„ as a continuous image of [/„. 11 Empirical Central Limit Theorem. The empirical processes {?„} con- constructed by independent sampling from a distribution function F converge in distribution, as random elements of D\_ —со, oo], to the gaussian process E with covariance kernel given by A0). Proof. Define a map H from D[0, 1] into D[ — 00, 00] by setting (Hx)(r) = x(F(r)). It is measurable (both spaces are equipped with their projection сг-fields) and uniformly continuous: \\Hx - Hy\\ = sup \x(F(r)) - y(F(r))\ < \\x - y\\. r By the uniform case of the Empirical Central Limit Theorem and the Con- Continuous Mapping Theorem, HUn ~> HU. Remember from Section III.6 that the quantile function has the property: ?; < F(r) if and only if <2(?г) ^ r- (HUn)(r) = n'1'2 ta^i < F(r)} - F(r)] i= 1
98 V. The Uniform Metric on Spaces of Cadlag Functions The random variables {Q(?,d} form an independent sample from the distri- distribution F; the last sum has the same distribution as Е„. You can complete the proof by checking that HU satisfies the defining requirements for E. (This provides one method for proving the existence of E.) ? Even though the quantile transformation worked like a charm here, you should beware of applying it unnecessarily to force processes that want to live in ?>[ —oo, oo] into migrating to D[0, 1]. Gratuitous rescaling of the time axis, perhaps with the aim of reducing a problem to a case involving only uniformly distributed random variables, can complicate an otherwise straightforward argument. 12 Example. The Central Limit Theorem for the sample median, proved by direct methods in Section III.4, can be deduced from the Empirical Central Limit Theorem. As before, assume that the sampling distribution F has a continuous, positive density / in a neighborhood of its median m. To finesse the problems caused by the jumps in the empirical distribution function Fn, regard any random variable т„ for which as a sample median. From the condition on /, F(t) = \ + (t - m)lf(m) Because mn must converge to m (Example II. 1), we deduce that F(mn) = i + (mn - m)U(m) + op(l)]. Combine the expressions for Fn(mn) and F(mn) to get En(mn) = -nll2(mn - m)U(m) + op(l)] + op(l), which rearranges to A3) n^\mn - m) = I-EM,) ~ op(l)]/[/(m) + op(l)]. Think of the right-hand side as a function of the four random elements ?„, т„, op(l), and op(l). (Perhaps we should distinguish between the two op(l) symbols—they stand for different variables.) Better still, think of it as a function of the random element (?„, mn, op(l), op(l)) of D[ — oo, oo] ® Ж3. Since (ш„,орA), орA)) -»¦ (m, 0,0) in probability and ?„->?, and since Problem 8 shows that E concentrates on a separable subset of D\_ — oo, oo], Example IV. 10 lets us deduce that A4) (?„, mn, op(l), op(l)) -> (E, m, 0, 0). The right-hand side of A3) can be constructed from the left-hand side of A4) through application of the map H from D[— oo, oo] ® Ж3 into Ж denned by H(x, a, p, y) = [-x(a) - ?l/[/(m) + y].
V.2. Empirical Processes 99 H is continuous at (x, m, 0, 0) for every x that has m as a point of continuity. Almost every sample path of E has this property because, as shown by the representation of ? given in the proof of Theorem 11, each such sample path can be written in the form и ° F, where и belongs to C[0, 1]. (Don't forget that F is continuous at m.) Call out the Continuous Mapping Theorem. n^2(mn - m) = H(En, mn, op ~> ЩЕ, m, 0, 0) = -E(m)/f(m), which has the desired N@, |/(m)~2) distribution. ? 15 Example. What happens to the limit distribution of Kolmogorov's goodness-of-fit statistic when parameters are estimated? That is, if Fn is obtained by independent sampling on the distribution function F(-,90), with 60 the true value of an unknown parameter в, what is the asymptotic behavior of the statistic Dn = n1'2 sup \Fn(t)-F(t,en)\ t constructed using an estimator вп for 01 Suppose F were uniformly differentiable, in the strong sense (Problem 9) that \\F{-, в) - F(-, в0) -(в- 0О)Д(.)|| = о(в - во), near 0O, for some fixed function A(-) in D[ —oo, oo]. Then, provided в„ were one of those nice estimators that converge in towards в0 at the Op(n~112) rate, Dn could be written as Dn = n1/2\\Fn(-) - F(-, в0) - (в„ - 0О)Д(.)|| + ор(п1/2@„ - во)) = \\Е„ - п112(в„ - 0О)Д|| + орA). We know how ?„ behaves for large n—like the gaussian process E—but what will be the effect of adding on the extra term п1/2(вп - во)А? That depends upon the joint distribution of ?„ and в„. According to the statistical folklore, good estimators can often be coerced into the form for some function L satisfying 1РЦ^) = 0 and JPL(rjtJ = a2 < oo. Let's assume that our estimator has such a representation. That makes it much easier to analyze the random elements Zn = (Е„,п112(вп — в0)) of D[-oo, oo] <g) Ж. Look at the fidis. Write (En(tt),..., EB(tk), п1/2(вп - в0)) as n~112 ? ({m < h} - F(tu в0), ..-,{%< U - F(tk, 0O),
100 V. The Uniform Metric on Spaces of Cadlag Functions The Multivariate Central Limit Theorem gives convergence in distribution to a zero mean multivariate normal random vector (?(tx),..., E(tk), y), where the covariances amongst the first к components follow A0) and WE(t)y = JP{rii < t}L{rii). That suggests a limit process (E, y) for Zn, with E and у having a sort of gaussian distribution inD[—oo, oo]®lR. We actually already have all the tools needed to formalize the limit result for Zn. All that maximal inequality and finite-dimensional approximation stuff goes through as for the Empirical Central Limit Theorems; most of the brain work has been carried out, and only very messy details remain. But that would be terribly inelegant. Why not wait for the neater proof in Section VII.5, and just accept the fidi proof as sufficient evidence in the meantime? Elegance will return when the functions {77,- < t} and L(f?r) are accorded equal status. Check continuity and measurability for the map from D[— 00, 00] ® 1R into IR that sends (x, r) onto ||x(-) — rA(-)ll. then bring out the Continuous Mapping Theorem, yet again, to deduce that Dn^D = \\E{-) — yA(-)\\. The parameter estimation propagates through to the limit to add a random drift term yA(-) onto the gaussian process ?(•). If only we could calculate the distribution of D then we would have solved the problem completely. Problem 10 looks at a special case to show what difficulties this can present. Try reworking the example with all the processes in D[ — 00, 00] rescaled using F(-, 60) if you want to discover some of the perils of too automatic a recourse to the quantile transformation. ? V.3. Existence of Brownian Bridge and Brownian Motion This section proves the existence of gaussian processes satisfying the re- requirements of Definition 8. The proof uses the Compactness Theorem from Section IV.5. This is neither the simplest nor the fastest method known to mankind; but it will generalize readily into an existence proof for more exotic gaussian processes. Any reader who found Section IV.5 too tedious could safely skip onto the next section, with perhaps just a quick look at Problem 11. A good way to construct a process is to set up a sequence of approxima- approximations that should converge to the process, if it exists. One then needs to show that the approximating sequence, or one of its subsequences, actually does converge to something. One hopes this something has all the required properties of the desired process. We have not yet considered any sequence of processes that should con- converge to a brownian motion, but in the uniform empirical processes {?/„} we do have a sequence converging to a close relative, the brownian bridge. Let us extract a brownian bridge as the limit of a convergent subsequence.
V.3. Existence of Brownian Bridge and Brownian Motion 101 16 Theorem. There exists a brownian bridge. Proof. Check the uniform tightness condition in the Compactness Theorem of Section IV.5. We need to find a compact subset К of D[0, 1] for which A7) liminf!P{C/neG} > 1 - e, whenever G is an open, projection-measurable set containing K. To force the brownian bridge to live in C[0, 1], we shall also want К to contain only continuous functions. Our К will be represented as the intersection of a sequence of closed sets {Dk}, with Dk a finite union of closed balls of radius 2/fc (any sequence of radii decreasing to zero would suffice). As the digression after the start of Section IV.5 pointed out, such а К is compact: it inherits completeness from D[0, 1] and, by the choice of {Dk}, it is certainly totally bounded. The closed balls making up each Dk center themselves on piecewise linear, continuous functions obtained by linear interpolation between a finite set of vertices @, r0),..., (t,-, r;),..., A, rj: With T standing for the grid 0 = tQ < ¦ ¦ ¦ < tm = 1, write LT for the map from Жг to C[0, 1] that takes the vector г = (r0,..., rm) onto this inter- interpolated function. The map satisfies a Lipschitz condition, A8) ||Lr(r) - LT(s)|| <maxh-sr The composition LT о nT defines a map from D[0, 1] into C[0, 1], a piecewise linear, continuous approximation map for which \\LT ° nT(x) — x\\ < 2 max sup |x(t) — x(tt)\, where Jt = [ti,ti+1). The right-hand side should look familiar. It is the functional of x, call it hT(x), that lurks behind the convergence criterion of Theorem 3.
102 V. The Uniform Metric on Spaces of Cadlag Functions For given 5 > 0 and e > 0, the proof of Theorem 9 came up with a grid T for which limsup JP{hT(Un) > 3} < e. You should convince yourself that it was unnecessary to assume the existence of the brownian bridge process to get this inequality. Reinterpret it as a statement of how well Un can be approximated by linear interpolation: given e and S there exists a grid T for which limsup F{||Lr о nT(Un) - UJ > 5} < e. Why are there no measurability difficulties here? Now come back to the task of specifying the closed balls for the Dk sets. For each fc choose a grid T(k) such that liminf W{\\Lnk)°nTik)(Un) - l/J < 1/fc} > 1 - e/2k+1. Because the random vectors {nT(k)(Un)} converge in distribution, there exists a compact subset Hk of JRT(k) such that liminf JP{nnk)UneHk] > 1 - e/2*+1 hence liminf TP{LTW о nm){Un)eLnk){Hk)} > 1 - s/2*+1. As a continuous image of a compact set, LT(k)(Hk) is compact, and so can be covered by a finite collection of closed balls with radius 1/fc. Set Dk equal to the union of the finite collection of closed balls with the same centers, but radius 2/fc. The larger radius allows for a 1/fc distance between Un and its linearly interpolated approximation LT(k) ° nnk)(Un): liminf IP{ Un e Dk} > 1 - e/2k for every fc. The e/211 was contrived so that liminf JP{Un ? Dx r> ¦ ¦ ¦ n Dk} > 1 — 6 for every k. Of course we cannot just let к tend to infinity now and hope to replace the finite intersections by their limit K. We really do need that open set G in the definition of uniform tightness. Inequality A7) will follow from what we have proved about the {Dk} if we show that G э Dx n • • • n Dk for some fc. If there were no such fc, we could choose an xk from each closed set Fk = Gc n D1 n • • • n Dk. As shown in the next paragraph, we could then apply Cantor's diagonalization argu- argument to extract a Cauchy subsequence from {xk}. The limit of the Cauchy subsequence (remember that D[0, 1] is complete) would belong to Fk for every fc, an impossibility because P| Fk = Gc n К = 0. к The existence of the desired fc would follow.
V.4. Processes with Independent Increments 103 Here is how we would construct the Cauchy subsequence. Dt is a finite union of closed balls of radius 2; some subsequence of {xk} would lie com- completely within one of these, Bl say. Bl n D2 would be contained in a finite union of closed balls of radius 1; some sub-subsequence would lie completely within Bx n B2, with B2 a closed ball of radius 1. Bx n B2 n D3 would be contained in a finite union of closed balls of radius f; a sub-sub-subsequence would lie within B1 n B2 n B3, with B3 a closed ball of radius f. And so on. The desired Cauchy sequence would take the first element of the first sub- subsequence, the second element of the sub-subsequence, the third element of the sub-sub-subsequence, and so on. Every function in К is continuous, a uniform limit of piecewise linear, continuous functions from the sets LT(k)(Hk). By the Compactness Theorem, some subsequence of the uniformly tight {?/„} converges in distribution to a probability measure concentrating on a countable union of compact subsets of C[0, 1]. Its fidis, the limits of the corresponding empirical process fidis, identify it as the distribution of the sought-after brownian bridge. ? There are two ways of constructing brownian motion from the brownian bridge U. One can either set Bj@ = U(t) + tZ for 0 < t < 1, where Z has a ]V@, 1) distribution independent of U, or one can rescale by setting B2(t) = A + t)^(-p—) for 0 < t < oo. In both cases one gets a gaussian process with continuous sample paths. Direct calculation of means and covariances, which uniquely determine multivariate normal distributions, shows that B± has the right fidis to identify it as a brownian motion on [0, 1] and B2 has the right fidis for brownian motion on [0, oo). V.4. Processes with Independent Increments A stochastic process Z indexed by an interval of the real line is said to have independent increments if Z(t0), Z{tx) — Z(t0),..., Z{tk) — Z(tk-t) are mutually independent whenever t0 < tt < •¦• < tk. For random elements {Х„} of D[0, 1] with independent increments satisfying a mild regularity condition, the criterion for convergence in distribution given in Theorem 3 reduces to a particularly simple form when the limit process has continuous sample paths. Essentially we have only to check convergence for each increment Xn(t) - Xn(s). Just one thing can spoil the nice clean charac- characterization. Because the increments would be unchanged by the addition of an
104 V. The Uniform Metric on Spaces of Cadlag Functions arbitrary constant to Х„, they surely cannot determine uniquely its finite- dimensional distributions; Xn@) must be specified. 19 Theorem. Let X, Xu X2,... be random elements of D[0, 1], each with independent increments. Suppose X has continuous sample paths. Then Xn ^ X if and only if: (i) XJP) (ii) for each pair s < t, the increment Xn{i) — Xn(s) converges in distribution to X(t) - X(s); (iii) given b > 0 there exist a > 0 and ft > 0 and an integer n0 such that W{\Xn(t) - Xn(s)\ <Щ> P whenever \t - s\ < a and n > n0. Proof. Forgive the mysterious factor of \ in (iii); it's there to make the notation fit better with Lemma 7. Suppose Xn --»X. Conditions (i) and (ii) follow from convergence of the fidis. Condition (iii) is a simple consequence of the Continuous Mapping Theorem applied to the continuous map from D[0, 1] into Ж defined by Яах = sup{|x(t) - x(s)\: \t - s\< a}. (Take the supremum over rational pairs to check measurability.) At each continuous x the supremum converges to zero as a -> 0, because continuity on a compact interval implies uniform continuity. As this applies to every sample path of X, there must exist an a > 0 and /? > 0 such that JP{HaX < j5} — 2ft. Because HaXn ~> HaX and because ( — oo, jS) is open, liminfF{#aAr,I < \d} > 2/? П by authority of Example IV.17. For all n large enough, W{\Xn(t) - Xn(s)\ < \b} > W{HxXn < Щ > p, whenever \t — s\ < a. Now let us show that the two requirements listed in Theorem 3 follow from conditions (i), (ii), and (iii). Choose 0 = s0 < Sj < • • • < sk. The joint characteristic function of the random vector (Xn(s0), Xn(Sl) - Xn(s0),..., Xn(sk) - Xn(sk_,)) factorizes into a product of one-dimensional characteristic functions, because Xn has independent increments. By (i) and (ii), the product converges to the corresponding product for X, the joint characteristic function of the random vector (X(s0), X(Sl) - X(s0),..., X(sk) - X(sk-,)). With a continuous (linear, even) transformation, recover the desired fidi convergence. For the second requirement of Theorem 3 try a grid 0 = t0 < ¦ ¦ ¦ < tm = 1 for which max;(t>+1 — tt) < у < a, with a as given by (iii). The value of у
V.4. Processes with Independent Increments 105 will be specified at the end of the proof. Fix for the moment a value of n greater than n0. Write <gt for the сг-field generated by {Xn(s)\ s < t). Suppose t, < t < ti+1. Then, on the set {\XJt) - Xn(tt)\ > 6}, JP{\Xn(ti+1) - Xn(t)\ < \\Xn{t) - XM\\«t) >JP{\Xn(ti+1)-Xn(t)\<j6} > P. Apply Lemma 7 to the process Х„{-) — Xn{tt) on each interval J; = [tb ti+1). limsup IP< max sup | Xn(t) — Xn(tt) | > S limsup У JP< sup I Xn(t) - Xn(ti) \ > S 1 = 0 I J, limsup W{\Xn{ti+x) — Xn(td\ =~o m ¦=o m-l !=0 which is less than m-l Г1 Е ! = 0 because (Xn(ti+1), Xn(tt)) - (X(ti+1), X(tt)) and {(u, v)eJR2: \u - v\ > Щ is closed. Write Eo,..., ?m_i for the independent events appearing in the last summation. From the inequality exp( — WEt) > 1 — 1Р?; deduce that r=o > = -lodl-JP\jE?j < -log(l - JP{HyX > ^.5}) because max(fi+1 - tt) < y. i Argue as before that HyX [ 0 for each sample path, as у [ 0, to see that у could have been chosen to make the logarithmic expression less than /?e. ? Notice how we used the sample path continuity of the limit process X. The argument would break down if X were allowed to grow by jumps. For example, if X were a poisson process, with X(tj) — Xit^^ distributed Poisson(A(t; — ?г_ J), the sum would not decrease as у tended to zero: m— 1 m-1 X JP{\x(ti+1) - x(ti)\ >is}= x LMtt+i - h) + o(ti+1 - t,.)] i = 0 i = 0 = X + o(l). The uniform metric is inappropriate for such an X. A better metric, which can tolerate a jumpy limit process, will be introduced in Chapter VI.
106 V. The Uniform Metric on Spaces of Cadlag Functions Theorem 19 really only says something about convergence in distribution to gaussian limit processes. If X@) has a normal distribution, the combina- combination of independent increments and continuous sample paths forces X to be a rescaled and recentered brownian motion (the gaussian process described in Definition 8). An approximation argument shows why. Because Hy(X) -> 0 almost surely as у -» 0, there exists a sequence {е„} with е„ [ 0 and ЩН2/п(Х) > ej - 0. Set ?„, = Xn(i/n) - Xn((i - 1)/и). Define step-func- step-function approximations to X by Xn(t) = X@) + t {i/n < t}U\U < ?„}• Whenever H2/n(X) < en, the Xn process lies uniformly within е„ of X. Thus ||Х„ - X|| -» 0 in probability. Fix t. Write al for the variance of the sum appearing in the definition of Xn(t), and an for its expectation. If an -> 0, then Xn(t) = X@) + an + op(l) from which we get X(t) — X@) = constant, a degenerate sort of normality. If, on the other hand, {&„} does not tend to zero, then, along some subse- subsequence, (Xn(t) - X@) - an)/an ~> N@, 1) by the Lindeberg Central Limit Theorem. (Necessarily e.Jan -* 0 along any subsequence for which an is bounded away from zero.) Convergence of types (Breiman 1968, Section 8.8) allows nothing but normality for the distribution of X(t) - X@). A similar argument works for any other increment of X. Even with the gloss of apparent generality rubbed off the limit distribution, Theorem 19 still has enough content to handle some non-classical problems about sums of independent random variables. 20 Example. Let {?ni: i = 1,..., k(n); n = 1, 2,...} be a triangular array of random variables satisfying the conditions of the Lindeberg Central Limit Theorem. That is, they are independent within each row, they have zero means, they have variances {o^r} that sum to one within each row, and B1) Е1РЙ{|&,,|>е}->0 i for each fixed e > 0. Set Sni = ?nl + • • • + t,ni. What is the asymptotic behavior of a random variable such as max; Sni? We may write it as a func- functional of the partial-sum process Sn(-), the random element of D[0, 1] defined by Sn(t) = Sni for var Sni<t< var Sn> i+1. The curious choice for the location of the jumps of Sn(-) has the virtue that var Sn(t) -> t as n -> oo, for each fixed t, because max,- alt -> 0. The increments of Sn(-) over disjoint subintervals of [0, 1] are sums over disjoint groups of the {?„;}; the process has independent increments. If we joined up the vertices, to produce an interpolated partial-sum process with sample paths in C[0, 1], we would destroy this property.
V.5. Infinite Time Scales 107 For fixed s and t, with s less than t, the increment Sn(t) — Sn(s) is also a sum of the elements in a triangular array of independent random variables. Because var[SB@ - Sn(s)~] -> t - s and because B1) also holds for summation over any subset of the {?„;}, the Lindeberg Central Limit Theorem makes Sn(t) — Sn(s) converge to a N@, t — s) distribution—one of the conditions we need to make Sn(-) converge in distribution to a brownian motion. What about condition (iii) of Theorem 19? Tchebychev's inequality is good enough. W{\Sn(t) - Sn(s)\ <Щ>\- var[Sn(t) - -> 1 - (f - S)HW, uniformly in t and s. You can figure out what a should be from this. Only our ingenuity in thinking up functional that are continuous at the sample paths of brownian motion can curtail our supply of limit theorems for the partial sums. One example: max Sni = sup Sn(t) ~> sup B(t). i t 1 Calculation of the limit distribution in closed form awaits us in Section 6, where this and several other functionals of brownian motion and brownian bridge will be examined. ? V.5. Infinite Time Scales Both the spaces I>[0, 1] and D[— oo, oo] have compact intervals of the extended real line as their index sets; the theories for convergence in distri- distribution of random elements of these spaces differ only superficially. For spaces with non-compact index sets some extra complications arise. As a typical example, consider D[0, oo), the set of all real-valued cadlag functions on [0, oo). A function x in D[0, oo) must be right-continuous at each point of [0, oo), with a finite left limit existing at each point of @, oo). Because the limit point + oo does not belong to the index set, no constraint is placed on the behavior
108 V. The Uniform Metric on Spaces of Cadlag Functions of x(t) as t -> oo; it could diverge or oscillate about in any fashion. Such a function need not be bounded. The uniform distance between two functions in D[0, oo) might be infinite. Even for bounded random elements of D[0, oo), convergence in distribu- distribution in the sense of the uniform metric may impose far stronger requirements on their tail behavior (that is, on what happens as t -» oo) than we can hope to verify. Sometimes the best we can try for is control over large compact sub- intervals of [0, oo). That corresponds to convergence in the sense of the metric for uniform convergence on compacta. 22 Definition. A sequence of functions {х„} in D[0, oo) converges uniformly on compacta to a function x if sup(<fc \xn(t) — x(t)| -> 0 as n -> oo for each fixed k. Equivalently, d(xn, x) -» 0, where GO d(xn,x)= ? 2"'min[l,4(x,,x)] dk{xn,x) = sup|xn(t)-x(t)|. ? t<k In all that follows, ?>[0, oo) will be equipped with the metric d and the projection сг-field. With that combination, each x in D[0, oo) is completely regular (Problem 12). For each к define a truncation map Lk from D[0, oo) into D[0, /c]; set Lkx equal to the restriction of x to the interval [0, k]. By construction, {х„} converges to x if and only if {Lkxn} converges uniformly to Lkx for each k. Convergence in distribution has a similar characterization. 23 Theorem. Let X,XUX2,... be random elements ofD[0, oo), withJP{X e C} for some separable set C. Then Xn ^*X if and only if Lk Xn --» Lk X, as random elements of D[0, k],for each fixed k. Proof. Necessity of the condition follows from the Continuous Mapping Theorem, because each Lk is both continuous and measurable. For sufficiency, define a continuous, measurable map Hk from D[0, /c] into D[0, oo) by (Hkz)(t) = z(t л к). It comes as close as possible to defining an inverse to the map Lk: the function Hk ° Lkx equals x on [0, /c]. Deduce that d(x, Hk ° Lkx) < 2~k for every x in D[0, oo). Suppose LkXn ~> LkX. Because Hk is continuous, Hk ° LkXn <-» Hk ° LkX. Pick к large enough to make 2~k < e. Then both JP{d(Xn, Hk о LkXn) > e} and JP{d(X, Hk о LkX) > e} equal zero. Example IV.l 1, with Hk ° Lk playing the role of the approximation map, does the rest. ? Typically С equals C[0, oo), the set of all continuous functions in D[0, oo). As in the case of compact time intervals, C[0, oo) sits inside ?>[0, oo) as a closed, separable, measurable subset (Problem 13). The borel сг-field on C[0, oo), generated by the open sets for the d metric, coincides with the
V.5. Infinite Time Scales 109 projection cr-field. The most important of the limit processes living in C[0, oo) is brownian motion on [0, oo). 24 Example. Let Sn denote the nth partial sum of a sequence ?l5 ?2,... of independent, identically distributed random variables for which 1Р<!;г = О and IPiJf = 1. What is the asymptotic behavior of the hitting time as n tends to infinity? Define Hx = inf{t > 0: x(t) > 1}, with the usual convention that the empty set has infimum + oo. Right continuity of functions in D[0, oo) allows us to take the infimum over rational t values to verify measurability of H. Express n~ 1zn as the functional HXn of the process Xn(t) = n-ll2Sj, j/n < t < (j + l)/n. Use Theorem 23 to check that Х„ ~> В, a brownian motion on [0, oo). For fixed k, the truncated process LkXn has the same form as the partial-sum process of Example 20, but stretched out and rescaled to fit the interval [0, /c] instead of [0, 1]. The modifications have little effect on the arguments adduced there to prove convergence to brownian motion; exactly the same idea works in D[0, k~]. If we are to use the Continuous Mapping Theorem to prove HXn ~> HB we will need H continuous at almost all sample paths of B. By itself, continuity of a sample path x will not suffice for continuity of H at x. You can construct sequences of functions {х„} that converge uniformly to a bad continuous x without having Нх„ converging to Hx. BAD: Hx The functional H has better success with a good path like this: GOOD - E г т + E
по V. The Uniform Metric on Spaces of Cadlag Functions Here Hx = т. If x(t) < 1 — e for t < z — 5 (a continuous function achieves its maximum on a compact interval) and x(z + S) > 1 + e, then any у in D[0, oo) with dx+d(x, y) < s must satisfy z — д < Ну < т + д. That brownian motion has only good sample paths, almost surely, can best be shown by arguments based on a strong markov property, a topic to be taken up in the next section. The distribution of the functional HB of brown- brownian motion will also be derived in that section. For the moment, we must content ourselves with knowing that п~1т„ has a limiting distribution that we could calculate if we were better acquainted with the limit process of the sequence {Xn}. ? V.6. Functionals of Brownian Motion and Brownian Bridge From the definition of brownian motion on [0, oo) it is easy to deduce (Problem 14), for each fixed z, that the shifted process Bz(i) = B(z + t) - B(z) for 0 < t < oo is a new brownian motion, independent of the сг-field ST generated by the random variables {B(s): 0 < s < z}. Equivalently, B5) РДВ.М = for every A in Sz and every bounded measurable / on C[0, oo). The assertion is also valid for a wide class of random z values. The precisely formulated generalization, known as the strong markov property, underlies most of the clever tricks one can perform with brownian motion. It will get us the distributions of a few interesting functionals. The random variable z will need to be a stopping time, that is, a random variable, taking values in [0, oo], for which {г < t) belongs to St for each t.
V.6. Functional of Brownian Motion and Brownian Bridge 111 Interpret Sx as the smallest u-field containing every St—we learn everything about the brownian motion if we watch it forever. Define Bz(co, t) = [B(co, t - B(co, oo}. Problem 17 shows that Bz is projection-measurable, as a map from Q into C[0, oo). The stopping time т determines a u-field Sz, which captures the intuitive idea of an event being observable before time т. By definition, an event A belongs to Sz if and only if A{z < t} belongs to St for every t. The strong markov property asserts that Bt is a brownian motion independent of Sz on {t < oo}. Equivalently, B5) holds for every #t-measurable A contained in {t < oo} and every bounded, measurable / on C[0, oo). To prove the assertion it suffices that we check the equality for each bounded, uniformly continuous /. (Apply Problem 15 to the distributions induced on C[0, oo) by B, under IP, and Bz, under IP(-1 A).) With such an /, continuity of sample paths implies f{Bz)A = lim ? f(Bk/n)A{(k - l)/n < т < k/n}. k=l Of course only one term in the sum will be non-zero for each fixed n. The event A{(k — l)/n < т < k/n} belongs to Skln; the independence asserted in B5) can be invoked for each k/n value. W№)A = lim ? Wf(Bkln)A{(k - 1)/и < т < k/n} = lim ? TPf(B)TPA{(k - 1)/и < т < k/n} by B5) jt=i = 1Р/(В)ПМ because т < oo on A. The strong markov property is established.
112 V. The Uniform Metric on Spaces of Cadlag Functions 26 Example. The classical reflection principle for brownian motion involves a hidden appeal to the strong markov property. It enables us to find the distribution of the stopping time т = inf{t: B{t) = a} for fixed a > 0. The reflection principle uses the symmetry of brownian motion—the processes Bt and — Bz have the same distribution—to argue that В should be just as likely to hit a and end up with B(t) > a as it is to hit a and end up with B(t) < a. The probability that В hits a before time t should be twice the probability that B(t) > a. A formal proof works backwards. IP{B@ > a} = IP{B@ > a, т < t] = IP{B(t) + Bz(t - t) > a, x < t] = TP{Bz(t - t) > 0, z < t} because B{x) = a if т < oo < t}JP{Bz(t - т) Because т is #t-measurable it can be treated as a constant inside the condi- conditional probability. (A slightly more formal justification would invoke Fubini's theorem.) On {т < t}, symmetry of the normal distribution allows us to replace the conditional probability by j, and then deduce IP{B(f) > a} = iIP{T < t}. An explicit expression for IP{t < t} can be found by using the N@, t) distribution for B(t). For our later purposes it will be more important that we know the Laplace transform LaB) = IP exp( — At) for X > 0. Two applications of Fubini's theorem, then a differentiation, will lead to a simple differential equation for U. IP Г{т •Jo t}dt z) dt
V.6. Functionals of Brownian Motion and Brownian Bridge 113 = IP [Xe-Xt{t > a2/N@, IJ} dt = IPexp(-Aa2/iV@, IJ) = B/яI/2 Г^ехрС-Яа^ - \z2) dz. Jo Differentiate with respect to A, then change variables by setting у = BAI/2az-1. L'JLX) = B/яI/2 Г-а2г~2 exp(-A<x2z - \z2) dz Jo = -<хBАГ1/2B/яI/2 Гехр(-^2 - Xa2y~2)dy Jo The only solution with LJO) = 1 is La(A) = ехр(-аBЯI/2). П Doob A949), in the non-heuristic part of his paper, was able to parlay the result from Example 26 into an expression for the distribution function of \\U\\, which his heuristic argument suggested as the limit for Kolmogorov's goodness-of-fit statistic. His approach provides a case study in the application of the strong markov property. First one transforms the brownian bridge U by a scale change: B(t) = A + t)U +t for 0 < t < oo. The means and covariances of this gaussian random element of C[0, oo) identify it as a brownian motion (the same trick as in Section 3). Because = 0, t U > <x:0 < t < oo к1 + Ч = W{\B(i)\ = <x(l + t) for at least one t} = JP{B hits a + at or В hits -a - at}.
114 V. The Uniform Metric on Spaces of Cadlag Functions To find the probability that В ever leaves the region bounded by the two sloping lines ± (a + at), one first solves the simpler problem where there is only one sloping line. For a fixed fi > 0 (and not just fi = a) set ф(а) = IP {В hits a + fit}, the dependence on fi being suppressed for the while. If В is ever to hit the line ai + a2 + ft, f°r ai > 0 and a2 > 0, it must first hit the line ax + fit. Call the stopping time at which this happens a. Invoke the strong markov property. ф(ос1 + a2) = JP{B hits ax + oc2 + fit} = IP{{7 < oo, Ba hits a2 + fit} = IPfff < oo}IP{B hits a2 + fit} The equation ф(ос1 + a2) = ф(ос1)ф(ос2) has non-negative, decreasing solu- solutions with ф@) = 1 only of the form ф(&) = exp( — ca) for a > 0. The positive constant с remains to be determined. Another strong-markov argument, working with the stopping time т at which В first hits level a, gives the value of the constant.
V.6. Functionals of Brownian Motion and Brownian Bridge 115 exp(-ca) = F{B hits a + fit} = F{t < oo, Bt hits fix + fit} = F[{t < oo}F{B, hits fix + fit\Sz}-\ = IP{t < 00} (j)(fix) = IPexp(- с fix) = La(cfi) in the notation of Example 26 = exp(-aBcj3I/2). Solve for c. B7) F{B hits a + ?i} = exp(-2a?) for a > 0, fi > 0. Notice that the hitting probability depends on a and fi only through their product. The probability of В hitting the line кос + k~ lfit is the same for each к > 0. The two barrier problem is slightly harder. For brevity write, temporarily, I for the line a + fit and II for the line -a- fit. Then F{ |B(OI = oc + fit for at least one t] = W{B hits I or В hits II}. The two events {B hits 1} and {B hits II} are not disjoint. An inclusion- exclusion argument is needed. The infinite sum {hit 1} - {hit I then II} + {hit I then II then 1} + {hit II} - {hit II then 1} + {hit II then I then II} need not hit I before it hits II might hit again r ... then II... takes the value 0 if В never hits a barrier, it takes the value 1 if В makes a finite (positive) number of alternating hits, and it is undefined if В makes infinitely many alternating hits. If we prove that B8) IP{B makes infinitely many alternating hits} = 0 then dominated convergence will justify B9) IP{B hits I or В hits II} = IPfhit 1} - IPfhit I then II} + F{hit I then II then 1} + F{hit II} - F{hit II then 1} + F{hit II then I then II} .
116 V. The Uniform Metric on Spaces of Cadlag Functions Our success with the calculation for one barrier will extend, in strongly markov fashion, to each of the individual terms in this sum of probabilities. To avoid notational indigestion let us chew on just one of the terms, IP{hit I then II}, say. Define the stopping time x (different from the last т) as the first t at which В hits I. From the perspective of Bt, barrier II is defined by the line of slope — /? and intercept — 2(<x + fix). (We can treat x as if it were a constant if we condition on Sz.) By symmetry, Bz has the same probability of hitting the line 3(<x + fix) + fj(t - x), which it sees as having slope fj and intercept 2(<x + fix). Formula B7) allows Bz to double the slope and halve the intercept of the line it is trying to hit. That is, JP{BZ hits = JP{BZ hits 3(a + /hr) + fi{t - x)\Sz) = W{BZ hits 2(a + fix) + 2fi(t - x)\Sz} = IP{Bthits2a + 2fit\iz). Notice that the new line sits above a + fJt for t > 0. Integrate out over the event {x < oo}. JP{B hits I then II} = П>{т < oo, Bx hits II} = IP{t < oo, Bz hits 2a + 2f3t} = JP{B hits 2a + 2f3t} = exp[ - 2Ba)Bj?)] from B7). Similar reasoning succeeds with higher numbers of alternating hits. For example, if a denotes the time of the first hit on II after a hit on I, then W{B hits I then II then 1} = IP{t < oo, a < oo, Ba hits 1} = IP{t < oo, a < oo, Ba hits -2a - 2/ft} = IP{t < oo, Bz hits -2a - 2f3t} = IP{t < oo, Bz hits 3a + 3jSt} = JP{B hits 3a + 3f3t} = exp[-2Ca)CiS)].
Notes 117 You will observe the significance of the point — a//? on the t axis if you draw a picture for the argument. In general, for a string of m alternating hits on barriers I and II, the probability is exp( — 2m2uf}). Let m tend to infinity to prove B8). Formula B9) with a = /? provides the solution to the problem with which we began: P{||C/J > <x}->P{||t/|| >a} = 2 f (-l)m+1 exp(-2m2a2). m=l Read Doob's paper. Notes Most authors avoid the uniform metric once they recognize the measurability problems it creates with the empirical process. Billingsley A968, Section 18) made the case against it clear. Dudley A966a, 1967a) had already pointed the way around the problem, but his solution has been mostly overlooked in the literature. Dudley A966a) had even proved a multivariate version of the Empirical Central Limit Theorem. He used a markov property of the empirical process. The idea of interpreting conditions like D) as a means for constructing simple approximations to stochastic processes appears explicitly in Wichura A971). Hajek A965) had already applied the idea to characterize weak convergence in C[0, 1]. Skorohod A956) based his study of weak convergence in D[0, 1] under its various Skorohod metrics on the same idea. Unfortunately this simple approach seems to have lost out to the uniform tightness approach (exposited by Billingsley A968), for example), possibly because the approximation method appears to demand a separate proof for existence of the limit process as a random element of a space such as C[0, 1]. Actually, D) plus fidi convergence imply uniform tightness; the argument in Section 3 is easily generalized. Lemma 7, specialized to pro- processes with independent increments and with \d replacing i|Z(r)| in the hypothesis, is sometimes attributed to Skorohod A957), although it clearly has a strong similarity to Levy's symmetrization inequality. Notice that the ^|Z(t)| does gain us something—one benefit was noted during the proof of the Empirical Central Limit Theorem, in Section 2. Gihman and Skorohod A974, VI.5) made more systematic use of a form of the lemma in a study of weak convergence for markov processes. Stute A982a) obtained delicate oscillation results for empirical processes by exploiting their markovian properties. The Empirical Central Limit Theorem usually goes by the name of Donsker's theorem; Donsker A952) proved it (using the uniform metric) in justifying Doob's A949) heuristic approach to the Kolmogorov and Smirnov theorems. Donsker A952) used a poissonization trick in his proof; he got it from Chung A949), who got it from Kolmogorov A933). Kac A949) knew that a poissonized process was easier to analyze, because of its independent
118 V. The Uniform Metric on Spaces of Cadlag Functions increments. Breiman A968, Section 13.6) used another version of the trick— representation of uniform order statistics as rescaled points of a poisson process—in relating the empirical process to a partial-sum process. The quantile-transformation trick belongs to the folklore. The interest aroused when Durbin A973a) applied weak convergence methods to get limit distributions for statistics analogous to those of Kolmogorov and Smirnov, but with estimated parameters, died down when the intractable limit processes asserted themselves. The papers in the con- conference on empirical processes (Gaenssler and Revesz 1976), the lecture notes by Durbin A973b), and Pollard A980), offer several perspectives on minimum distance methods. Parr A982) has compiled a bibliography. The existence proof in Section 3 adapts an argument of Dudley A966a, Proposition 2; 1978, Lemma 1.3). Erdos and Kac A946,1947) proved that the limit distributions for certain functions of partial sums of independent random variables depend on the summands only through their first two moments. Their paper led Donsker A951) to formulate the results as limit theorems for functionals on a partial- sum process. Donsker proved convergence of the process to brownian motion. Skorohod A957) studied processes with independent increments in great detail; a nice account of some of his results has appeared in Gihman and Skorohod A969, Chapter IX). Infinite time scales with various topologies (uniform or Skorohod) of convergence on compacta often cause minor confusion. The simple theory for C[0, oo) (Whitt 1970) proves remarkably unwilling to carry over to Z)[0, oo), at least for Skorohod metrics (Lindvall 1973, Stone 1963). The obvious nature of the main theorem (Theorem 23) for uniform metrics makes it hard to appreciate why Skorohod metrics should be any more difficult. More about this in Chapter VI. Billingsley A968, Section 11) has derived distributions for functionals of brownian motion by taking limits for functionals of simple random walks (partial-sum processes). Stopping-time arguments for brownian motion give other first-exit distributions (Breiman 1968, Section 13.7). The results can also be derived by elegant martingale methods (Loeve 1978, Complements to Section 42). Problems [1] D[0, 1] is complete under its uniform metric. [If {xn} is a uniform Cauchy sequence prove pointwise, and then uniform, convergence. Show that uniform convergence preserves the cadlag property.] [2] Let С be a separable subset of Z)[0, 1] under its uniform metric. There exists a countable subset To of [0, 1] such that functions in С can have discontinuities only at points of To. [A function in D[0, 1] has only finitely many jumps > s, otherwise the cadlag property would fail at a cluster point of those jumps. Take the union over rational s. If x is a uniform limit of {xn} then it can jump only where one of the {х„} jumps.]
Problems 119 [3] Calculate the covariances of the tied-down process B(i) — ?.8A) to show that its fidis agree with those of the brownian bridge. [Means and covariances determine a multivariate normal distribution.] [4] Extend the proof of Theorem 3 to limit processes with jumps. [If X has paths in a separable set C, and if the grid points for the approximation maps Ak pick up every jump point of С as к -> со, then \\AkX — X\ -> 0 almost surely.] [5] Would the argument in the proof of Theorem 9 for bounding Un(t) over [0, b] work for a different interval, say [1 — b, 1]? This would be necessary if Un were not stationary; direct analysis of the non-uniform empirical process would require such an extension. [Try a different definition for $,.] [6] Every function in Z)[— со, со] is bounded. [If |x(fn)| > n then x would not have the cadlag property at a cluster point of {?„}.] [7] Give a direct proof of the Empirical Central Limit Theorem for sampling from a general F. [Make sure the jump points of F appear in the sequence of grids from which fidi approximations are calculated.] [8] The image of a separable metric space under a continuous map is separable. The empirical process E concentrates on a separable subset of ?>[— со, со]. [The image of a dense subset is dense.] [9] Suppose the distribution function F(x, 9) has a bounded partial derivative A(x, в) with respect to в. If A(-, •) is uniformly continuous (or even just A(x, •) equi- continuous), then sup \F(x, в) - F(x, 6»0) - (в - е0Щх, 0)| = o(9 - в0). X [10] Find the limiting distribution of Kolmogorov's goodness-of-fit statistic for sampling from the N(9,1) distribution, when в is estimated by the sample mean. Show that this limit does not depend on the true value в0; express it as a functional of the gaussian process with covariance function <J>(s)[l — <D(f)] — $(s)$(t), where Ф and ф denote the distribution function and density function of the iV@, 1) distribution. [11] Here is another construction for the brownian bridge. (a) Temporarily suppose there exists a brownian bridge U. Define, recursively, new processes Yo, Yu ... and Zo, Zu ... by setting Yo = U, Yn+i = Yn — Zn, and where hnJ is the function whose graph is an isosceles triangle of height one sitting on the base [0' - l)/2",J/2"]. Show that Yn+1 and Zn are independent. Deduce that the {ZJ are mutually independent. [Calculate covariances. The process Yn would be obtained from the brownian bridge by tying it down at the points {B/ — l)/2"}; it is made up of 2" independent, rescaled brownian bridges sitting side by side. The covariance calculations all reduce to the same thing.] (b) Show that the process Zo + ••• + Zn interpolates linearly between the vertices C//2"+ \ U(J/2"+!)), for j = 0,..., 2"+'.
120 V. The Uniform Metric on Spaces of Cadlag Functions (c) Now run the argument the other way. Construct processes {Х„} with the same distributions as the {ZJ, then recover the brownian bridge as a limit of sums of the {XJ. Let {?„,¦: j = 1, ¦•., 2"; и = 0,1,...} be independent random variables, with (nj distributed JV@,1/2"+2). Define Xn(t) = I hnj(t)?nj. At the points {j/2n+ x: j = 0,..., 2"+ *} the sum Xo + • • ¦ + Х„ has the right fidis for a brownian bridge. [Only the fidis of U, which we know are well defined, were needed to calculate the distributional properties of (d) Show that IP{||XJ > sj < 2"+1 exp(-2"+1s2). [Apply the exponential inequality from Appendix В for normal tails: \\Xn\\ is a maximum of 2" in- independent |ЛГ@,1/2"+2) | random variables.] (e) By choosing sn = Bn/2"+1I/2 and then applying the Borel-Cantelli lemma, show that ?S=o XJf) converges uniformly in t, almost surely; it defines a process X with continuous sample paths, almost surely. (f) At dyadic rational values for t, the series for X contains only finitely many non-zero terms. The fidi projections of X at a dense subset of [0,1] have the distributions of a brownian bridge. The process X, with a negligible set of sample paths discarded, is a brownian bridge. [12] Every point of Z)[0, oo) is completely regular. [The distance dk(x, y) can be ex- expressed as a supremum involving only rational time points. For each x and k, the function dk(x, ¦) is projection measurable. Use [1 — mdk(x, -)]+ as a separating function.] [13] C[0, oo) is a closed, separable subset of Z)[0, oo). [It is the closure of a countable collection of piecewise linear, continuous functions, each constant over an interval of the form [t, oo).] [14] For each fixed т and each finite subset S of [0, oo), the random vector nsBz is independent of Sz, because brownian motion has independent increments. Deduce independence of Bt and Sz. [The projection maps {ns} generate the projection o--field on C[0, oo).] [15] Let P and Q be probability measures on the c-field @0 generated by the closed balls of a metric space. If Pf = Qf for every bounded, uniformly continuous, ^-measurable / then P = Q. [Every closed ball is a pointwise decreasing limit of a sequence of such functions. The same is true for the intersection of any finite collection of closed balls. These sets form a generating class for 38Q that is closed under the formation of finite intersections.] [16] The borel tr-field on the product Ж ® <% of two separable metric spaces coincides with the product tr-field Я{9С) ® Щ<Ш). [Every open set in the product space is a countable union of sets of the form Gx ® Gy, with both Gx and Gy open. Compare with Problem IV.5.] [17] Show that the shifted process Bz is projection measurable for each random time t. [Prove measurability of B(a>, t + z(co)) by writing it as a composition of two
Problems 121 measurable maps: со t-* (t(co), B(co, •)) from О into [0, oo] ® C[0, oo) and (s, x) h-> limn x((t + s) л n)[l л (n - s)+] from [0, oo] ® C[0, oo) into IR. [18] In the sense of Example 24, almost all brownian motion sample paths are good. If т denotes the first time that В hits level 1, prove that IP{Bt(s) < 0 for 0 < s < 8} = 0. Let 8 tend to zero through a sequence to show that bad paths belong to a set of probability zero. [You could try letting the level a in Example 26 sink to zero.]
CHAPTER VI The Skorohod Metric on Z)[0, oo) ... in which an alternative to the metric of uniform convergence on compacta is studied. With the new metric the limit processes need not confine their jumps to a countable set of time points. Amongst the convergence criteria developed is an elegant condition based on random increments, due to Aldous. The chapter might be regarded as an extended appendix to Chapter V. VI. 1. Properties of the Metric The uniform metric on D[0, 1] is the best choice for applications where the limit distribution concentrates on C[0, 1], or on some other separable subset of Z)[0, 1]. It is well suited for convergence to brownian motion, brownian bridge, and the gaussian processes that appear as limits in the Empirical Central Limit Theorem. But it excludes, for example, poisson processes and other non-gaussian processes with independent increments, whose jumps are not constrained to lie in a fixed, countable subset of [0, 1]. To analyze such processes, Skorohod A956) introduced four new metrics, all weaker than the uniform metric. Of these, the Jx metric has since become the most popular. (Too popular in my opinion—too often it is dragged into problems for which the uniform metric would suffice.) But Skorohod's Jx metric on D[0, 1] will not be the main concern of this chapter. Instead we shall in- investigate a sort of Jx convergence on compacta for D[0, oo), the space where the interesting applications live. With the results from Section V.5 in mind, and even without seeing the J1 metric defined, you might suspect that convergence Xn ~> X of random elements of D[0, oo) should reduce to convergence of their restrictions to each finite interval [0, Г], in the sense of the Jx metric on D[0, Г]. This is almost true. We need to avoid those values of T at which X has positive probability of jumping. The difficulty arises because projection maps are not automatically continuous for Jt metrics. Both the points 0 and T require special treatment from the /t metric on D[0, Г], whereas only 0 has an a priori right to special treatment in D[0, oo). That tiny distinction makes it slightly more convenient to study D[0, oo) directly than to deduce all its properties from those of ?)[0, Г]. As we shall not be concerned with
VI. 1. Properties of the Metric 123 Skorohod's J2, Mu and M2 metrics, let us drop the Jx designation from now on. 1 Definition. For each finite T and each pair of functions x and у in D[0, oo) define the distance d-r(x, y) as the infimum of all those values of д for which there exist grids 0 = t0 < t, < ¦ • ¦ < tk, with tk > T, and 0 = s0 < st < • • • < sk, with sk > T, such that 11,; — st \ < д for i = 0,..., k, and and Si<s<si+1, \x(t) - y(s)\ < 5 if tt<t<ti+1 for i = 0,..., к — 1. The weighted sum d(x, y) = | 2" defines the Skorohod metric on D[0, oo). The proof that d is a metric contains no surprises (Problem 1). The requirement tk> T prevents discontinuities near T from overinfiating the distance dT(x, y): if, say, у jumps just before T then it can be matched by a similar jump in x just a little beyond T. By allowing dT to depend on more than the segment of path over [0, Г], we avoid a difficulty that other authors have encountered in defining a metric for /х convergence on compacta. The main difference between convergence in the uniform sense and convergence in the Jx sense appears at the discontinuity points of a limit function. If {xn} converges uniformly on compacta to x and if x has a jump O—-' at f 0, then each х„, for n large, must have a jump of almost the same magnitude precisely at t0. Skorohod's metric still forces each xn to have a jump of almost the same magnitude, but not precisely at t0. O---'
124 VI. The Skorohod Metric on D[0, oo) Straight from the definition we deduce that convergence of xn to x under the d metric is equivalent to dT(xn, x) -> 0 for each finite T. Another equiv- equivalent form is (Problem 2): d{xn, x) -> 0 if and only if there exist continuous, strictly increasing maps {!„} from [0, со) onto itself such that, uniformly over compact sets of t values, Xn(t) - t -> 0 and x(Xn(t)) - xn(t) -> 0. Sometimes continuity of a functional on D[0, oo) can be verified more easily when convergence is invoked in this form. 2 Example. Consider once again the first passage time functional from Example V.24: for x in D[0, oo), Hx = inf{t > 0: x(t) > 1} with the infimum over an empty set defined to be +co. The functional is continuous, in the sense of the Skorohod metric, at every good path x. Hx If d(xn, x) -» 0 then the functions {у„} constructed from {xn} and {Я„} by setting yn(t) = х„(Яп@) converge uniformly to x over compact sets of t values. The argument from Example V.24 gives Hyn -» Hx. Complete the proof of continuity of H by checking that Hyn = Я„(Ях„). ? As in Chapter V, we shall find a criterion for weak convergence in D[0, oo) by considering the approximations to processes built from the values they take at a fixed, finite grid of points 0 = t0 < tx < ¦ ¦ ¦ < tk. Define the inter- interpolation map A from D[0, oo) into D[0, oo) by = x(tk){tk < t < со} + X x(ti){ti < t < ti+1}. ;=o
VI. 1. Properties of the Metric 125 All the jumps of Ax occur at grid points. For a fine grid, x and Ax will be close if x does not vary much between the grid points, except possibly for a single jump in each interval. In that case values of x before the jump are close to the value at the left grid point; values of x after the jump are close to the value at the right grid point: Multiple jumps, though, could be missed completely: Bad behavior of x in \tk, oo) affects d(x, Ax) only through those dT with T > tk. Good behavior of x in [О, Г] may be quantified by a modulus function. 3 Definition. For each x in D[0, oo) and each finite interval [a, b\ the modulus function A(x, [a, b~\) equals the infimum of those positive e for which there exists a point s in (a, fe] such that \x{t) — x(a)\ < s if a<t<s and \x(t) — x(b)\ < s if s < t < b. ?
126 VI. The Skorohod Metric on O[0, oo) 4 Lemma. Let Ax be the approximation to x constructed from the values taken at grid points 0 = t0 < • • • < tk = T, with tt — t?_ j < д for each i. Then dT(x, Ax) < ё v max A(x, [t;_b tj). i<k Proof. Write A for the maximum A(x, [*,-_!, tj) value. Choose e > 0. The definition of the modulus function gives points {s,} with 0 = s0 = t0 < Si < ti < s2 < ¦ ¦ ¦ < sk < tk and \x(s) - x(ti)\ < A + s if st < s < si+1 for i = 0,..., к — I, \x(s) - x(tk)\ < A + e if sk< s < tk. Possibly sk will be strictly less than T. Add on one more point sk+1 with tk < h+i < tk + д and \x(s) — x(tk)\ < A + s for tk < s < sk+1. Set tk+1 = sk+1. The extra points make {sj and {tt} suitable grids for the calculation of dT(x, Ax). Since (Ax)(t) = x(tt) throughout \tu ti+1), \(Ax)(t) — x(s)\ < A + s for tt<t<ti+1 and st<s<si+1. Deduce that dT(x, Ax) < 3 v (A + s), then let e tend to zero. ? The lemma shows that the error committed in the approximation of x by Ax depends upon the variability of x, as measured by the modulus function. If we sufficiently refine the grid used to construct Ax, the approximation improves; for a fixed x, we can make the error as small as we please by choosing the grid fine enough. 5 Lemma. For each fixed x in D[0, oo), each fixed T, and each e > 0, there exists a 5 > 0 such that max A(x, [r,_1; tj) < s i for every grid 0 = t0 < ¦ ¦ ¦ < tk = T with tt — tf_ j < b for each i. Proof. If A(x, [a, a']) > s, we get points a < т < /? < a' with \x(t) - x(oc)\ >\s and \x{x) - x(f})\ > \г by setting т = inf{t > a: \x(t) — x(a)\ > js}, P = inf{t > t: \x(t) - x(t)\ > is}.
VI. 1. Properties of the Metric 127 So if the assertion were false there would exist sequences of points <х„ < т„ < Р„<Т for which Р„ - а„ -> 0 and \х(тп) - х(осп)\ >ie and \х(тп) - х(рп)\ > ^s. We could extract subsequences along which <xn -» а, т„ -» и, /?„ -» и, for some и in [О, Г]. This would violate the cadlag property at a: there must exist a <5 > 0 for which x(s) — x(a — )\<je for и — д < s < a, \x(s) — x(a)\ < jz for a < s < a + 5, but eventually both members of at least one of the pairs (<х„, т„) or (т„, fjn) would have to lie on the same side of a. ? Our main use for Lemmas 4 and 5 will be the construction of finite- dimensional approximating processes to characterize convergence in distribution for random elements of D[0, oo). They also help us dispose of measurability complications. 6 Theorem. Under its Skorohod metric D[0, oo) is separable. The borel a-field coincides with the projection a-field. Proof. Write Do for the countable class of functions that take constant, rational values on each of the intervals [0, 1/JV),..., [(iV2 — 1)/JV, Л0, [./V, oo) for ЛГ = 1, 2,... . Let us show that the class Do is dense in D[0, oo). Write S(N) for the set of grid points {j/N: j = 0,..., N2}. As usual, nSiN) denotes the map that projects an x in D[0, oo) onto its vector of values at the points oiS(N). Define hN as the continuous map from 1RS(JV) into Z)[0, oo) that takes a vector г = (r0,..., rN2) onto the step function with constant value rj on the jth grid interval. If the components of г are rational, /ijv(r) belongs to Do. Given x and an e > 0, choose ./V with 2~N less than e and N~1 less than the ё of Lemma 5. We may assume that 5 < s. Construct the approximation AN x to x using the grid points S(N). That is, set AN = hN о nsm. From Lemma 4, dN(x, ANx) < s. Hence d(x, ANx) < 2s. Find rational numbers r} with | г3 — x(J/N) | < e. Then /^(r) belongs to Do and d(x, hN(i)) < 3s. For the moment write Si for the borel u-field and SP for the projection ff-field on Z)[0, oo). To prove that SP ? 3&, observe that x(t0) = lim Hn(x) for fixed t0, where Н„(х) = sup x(t)sn(t), t sn(t) = (I - n\t - t0 - 2/n\)+. For fixed n, the functional Hn is continuous (Problem 3). As a pointwise limit of continuous functions on D[0, oo), the projection nt0 must be 8%- measurable. The projections generate SP.
128 VI. The Skorohod Metric on D[0, oo) To prove that 3§ ^ 3P,it is enough to establish that each continuous, real function / on D[0, oo) is ^s/^(IR)-measurable. (Every closed set in a metric space can be represented as /~ 1{0} for some continuous /.) We know that d(x, hN ° nS(N)(x)) -» 0 as ./V -> oo. Thus/ ° hN ° nS(N)(x) -» f(x) for each x, by continuity of/. The map /o hN is continuous from 1RS<N) into IR, and hence ^(IRS(iV))/^(]R)-rneasurable. The map nS(N) is, by definition, ^y^(IRS(N))-measurable. Thus their composition f°hN°nS(N) must be ^s/^(IR)-measurable. As a pointwise limit of such functions, / must also be ^/^(IR)-measurable. ? Needless to say, from now on we shall always equip ?>[0, oo) with its projection u-field, alias the borel u-field for the Skorohod metric. Every point of ?>[0, oo) is completely regular under this u-field. 7 Example. The asymptotic theory for maxima of independent random variables bears some similarity to the theory for sums of independent random variables. The role played by the normal is taken over by the extreme-value distributions, whose distribution functions are of the form exp( — G(x)) for G(x) equal to one oie~x, or x~x{x > 0}, or ( —x)~a{x < 0}, with a a positive parameter. If the maximum Mn of n independent observations from a distri- distribution function F can be standardized to converge in distribution, then the limit must be one of these: for constants а„ (positive) and bn, (8) W{Mn < anx + bn} = F\anx + bn) -> exp(- G(x)). This convergence implies a much stronger result for the joint asymptotic behavior of the maxima at different sample sizes, a result analogous to the convergence of the partial-sum process to brownian motion (Example V.20). Define the maxima process У„(-) as the random element of D[0, oo) with Yn{t) = (Mj - bn)lan for j/n <t<(j+ l)/n. The assumption (8) gives convergence for { Yn(l)}. Using only the facts about the Skorohod metric that we have so far accumulated, we can strengthen this to convergence in distribution of the {Yn} process. The method of proof depends upon a representation of Yn as a continuous transformation of a poisson process. To minimize extraneous detail, assume G(x) = x~~a{x > 0}. Trivial modifications of the argument would cover the other two cases. Define a sequence of measures {Н„} on @, oo) by means of their distribu- distribution functions: Я„@, x] = и exp(-G(x)/n) - (и - 1) exp(-G(x)/(n - 1)). Calculate their density functions if you doubt that these are well-defined measures. On S = [0, со) ® @, oo) generate independent poisson processes
VI. 1. Properties of the Metric 129 {л„} with intensity measures {X ® Hn}, where X denotes lebesgue measure on [0, oo). The sum an = n1 + ¦ ¦ ¦ + п„ is also a poisson process. As n tends to infinity, an increases to a poisson process a on S with intensity measure CD П X X ® Hi = X ® lim ? Hi = x ® У, "=i n ;=i the measure у being determined on @, oo) by y(x, oo) = lim[n - n exp( - Label the points of <?„ as (rjni, hni), where t]nl < r\n2 < ¦ ¦ •. The {t]ni} form a poisson process on [0, oo), with intensity nX, independent of the {hni}; the gaps between adjacent t]ni have independent exponential distributions with mean n. The {hni} are independent observations on the distribution function exp( — G/n). When subjected to a slight vertical perturbation they will become standardized observations on F. Let Q be the quantile transformation corresponding to the distribution function F (Section III.6). Define ВД = [Q(exp(-GG0/n)) - №.. For large n, this transformation hardly disturbs y: Tn(y) = inf{(z - bn)/an: F(z) > exp( - G(y)/n)} = M{x: F\anx + b) > exp(-GOO)} -* inf{x: exp( - G(x)) > exp( - G(y))} by (8) = У- The transformed variables {а„ Tn(hni) + bn) form a sequence of independent observations on F, because hni has distribution function exp( — G/n): &{anTn(hni) + bn<x} = JP{Q(exp(-G(hni)/n)) < x} = P{Uniform@, 1) < F(x)} = F(x). The Т„ has the desired effect, in the vertical direction, on the points of an. Define a random element Zn of D[0, oo) by setting Zn(t) = sup{Tn(hni): цп1 < t}. If Zn had its jumps at n~l, In"i,... instead of at цпЪ r\n2,... it would be a probabilistic copy of Yn. Remedy the defect by applying to the time axis the random, piecewise linear transformation yn that sends 0 onto 0 and;/n onto t]nj, for j = 1, 2,... . The processes У„(-) and Zn ° }>„(•) have the same distri- distribution as random elements of D[0, oo). By the weak law of large numbers, yn{t) -> t in probability uniformly on compact intervals (Problem 4). Thus d(Zn, Zn о yn) -> 0 in probability. The
130 VI. The Skorohod Metric on O[0, oo) random elements {Zn} themselves converge almost surely to the random element Z(t) = sup{hi:rii<t}, where (r\u hj), (rj2, h2),... denote the points of the poisson process a arranged in order of increasing time coordinate. Deduce that {Zn}, and hence {У„}, converges in distribution to Z. ? VI.2. Convergence in Distribution In Section V.I we found a necessary and sufficient condition for convergence in distribution of random elements {XJ of D[0, 1], under its uniform metric, to a limit process X concentrating on a separable subset. The separability allowed X to have discontinuities only at fixed locations. For the proof we constructed an approximation АХ„ to each Х„ based on the values it took at a fixed finite grid on [0, 1]. The conditions we imposed ensured that, with high probability, the АХ„ process was uniformly close to Xn. A similar method of proof will apply for convergence in distribution of random elements of D[0, oo) under d, its Skorohod metric. The constraint on the limit process will disappear, because D[0, oo) itself is separable under d. Each approximation AXn will, with high probability, be close to its Xn in the sense of d distance. But one extra complication will arise because the fidi projections are not automatically continuous. If x belongs to D[0, oo) and x{x) ф х(т —), the projection map nz is not continuous at x. For example, if xn(t) = x(nt/(n + 1)) then d(xn, x) -» 0 but nzxn -» x(t—) Ф nTx. For x a continuity point of x, however, щ is continuous at x (Problem 5). Necessarily, n0 is continuous at every x, because every increasing A that maps [0, oo) onto itself must set A@) equal to 0. For a random element X of D[0, oo), the projection nz will be continuous at all sample paths except those that have a jump at т. 9 Lemma. For each random element X of D[0, oo) there exists a subset Tx of [0, oo) such that [0, оо)\Гх is countable and JP{X(t) = X(t-)} = 1 for t in Гх. The projection щ is IPX almost surely continuous at each t in Гх. Proof. It is enough to show that if e > 0 then JP{\X(t)-X(t-)\>s}>s for at most finitely many t values in each bounded interval [0, Г]. Write Jit) for {| X(t) - X(t -) | > e}, If TPJ(tn) > e for an infinite sequence {tn} of distinct points in [0, Г] then 1Р{/({„) infinitely often} > e. There would exist an со belonging to infinitely many of the J(tn) sets. At some cluster point t in
VI.2. Convergence in Distribution 131 [0, T] the inequality \X(a>, Q - X(co, tn-)\ > г would hold for infinitely many distinct tn values in every neighborhood of t. This would violate the cadlag property of X(co, •) at t. ? Because the projection n0 is continuous at every x in D[0, oo), let us also admit 0 as a point of Tx, even though X@—) is not defined. 10 Theorem. Let X, X1, X2, ¦¦ ¦ be random elements of D[0, oo). Necessary and sufficient conditions for Xn ^* X, in the sense of the Skorohod metric, are: (i) the fidis of Xn corresponding to finite subsets of Tx converge to the fidis ofX; (ii) for each s > 0, each ц > 0, and each finite T in Tx, there exists a grid 0 = t0 < ¦ ¦ ¦ < tK = T of points from Tx with limsup iNmax A(Xn, [t;_b tj) > r\> < s. Proof of Necessity. Appeal to the Representation Theorem (IV. 13) for a new sequence {Xn}, with the same distributions as {Xn}, and an X with the same distribution as X, for which A1) d(Xa(co, -),X(co, •)) -> 0 for almost all со. For t in Гх, Xn(co, t) -> X(co, t) at almost every со. Fidi convergence for {Х„} at points of Tx follows. Find a grid 0 = t0 < ¦ ¦ ¦ < tk = Г of points from Гх with A2) ipjmax A(X, [t;_ u tj) > Л < s. Such a grid exists by virtue of Lemma 5: as a sequence of grids is refined down to a countable, dense subset of [0, Г], the maximum value of A over the grid intervals converges to zero at each sample path of X. Write Ar(x) as an abbreviation for max; A(x, [t;_ l5 tj). Consider an со at which the convergence A1) holds. Write xn(-) for Х„(со, ¦), and x(-) for X(a>, •). If we show limsup Ат(х„) < 2Аг(х), then it will follow that limsup Р{АГ(Х„) > 2ц\ = limsup TP{AT(Xn) > 2r\\ < IP{limsup АТ(Х„) > 2rj} < W{AT(X) > r,} < ? by A2) as required by hypothesis (ii).
132 VI. The Skorohod Metric on D[0, со) Choose <5 > 0. By definition of AT(x), there exists points {sj with ?,_i < Si < ti and |x@-x(*;-i)l < AT(x) + <5 for ti_1<t<s,-, |xCt)-x(t,)l <AT(*) + <5 for Si<t<ti. Continuity of x at both ti^1 and t; allows us to assume that the strict in- inequality st < tt holds, and also that the range of validity for the first inequality is {,_! — S' < t < Si, and for the second s? < t < tt + S', for some 3' > 0. Because d(xa, x) -> 0, there exists a sequence of continuous, increasing maps {Xn} from [0, oo) onto [0, oo) such that Ut)~t^0 and х(Я„@) - xH(t) - 0 uniformly on compacta. When n is large enough, ?г_х < Ап'Ч5;) < h- Use st,n = ^"'(Sj) as the split point in [tj_l5 tj for bounding Дт(х„). If t;_i < t < sijn, and n is large enough, |х„@ - x^-JI < |х(Я„(г)) - xCA^ti-i))! + <5 < |х(Яи@) - xfe.OI + Ix^^) - xafe^))! + 8 < 2AT(x) + 3d, because, eventually, t,-t — 8' < ln{tt_x) < Xn(t) < An(si>n) = st. A similar argument applies to t in [s; „, t;]. Proof of Sufficiency. Let / be a bounded, uniformly continuous, real function on D[0, oo). We need to show that JPf(Xn) -> P/(X). Given г > 0, find »? > 0 so that | /(x) - /(y) | < e whenever d(x, y) < 217. Choose from Tx а Г large enough to ensure d(x, y) < dT(x, y) + r\ for every pair x, y. Let AT(-) have the same meaning as in the proof of necessity. According to hypothesis (ii) there exists a grid on [0, T] for which 1Р{АТ(ХП) >ц) < г if n>n0. Also we may assume that the grid points are less than r\ apart and, by the same reasoning as for A2), that W{AT(X) > r\} < s. Lemma 4 shows that the approximations constructed from this grid are, with high probability, close to the sample paths of the processes: W{dT(Xn,AXJ>t1}<s if n>n0, W{dT(X, AX) >rj} <e. Complete the proof in the usual way. Write A for the interpolation map constructed from the values at the grid points. < JP\f(Xn) - f(AXn)\ + \Wf{AXn) - Wf{AX)\ + W\f(AX)-f(X)\ < s + 2||/||1Рде„, AXn) > 2ц] + \JPf(AXn) - JPf(AX)\ + s + 2\\f\\JP{d(X,AX)>2r,} < e + 2||/||e + e + e + 2||/||e eventually, because AXn -> AX and d(x, Ax) < dT(x, Ax) + ц for every x. ?
VI.2. Convergence in Distribution 133 Roughly speaking, the modulus А(Х„, [f;_ ъ tj) will be small if Xn has at worst one large jump in the interval [?;_15 tj. To prove convergence in distribution, we need some way of stopping large jumps from piling up in a small interval. If Х„ has a jump at х„, which in general will be a random time, then we need W{\Xn(xn + t) - Xn(xn)\ small, for all small t) x 1, and the approximation should hold uniformly in n. A maximal inequality seems required. Contrast this vague, formidable task with an elegant suffi- sufficient condition due to Aldous A978): Х„ ~> X if the fidis converge and, for each fixed T, A3) Х„(Рп + 3n) - Xn(Pn) -+ 0 in probability, whenever {<5„} is a sequence of positive numbers converging to zero and {pn} is a sequence of stopping times taking values in [0, T]. (The stopping time property means that the event {pn < t} should belong to the a-field generated by the random variables Xn(s), for 0 < s < t.) An equivalent form of A3) is: for each T, each r\ > 0, and each s > 0, there exists a 3 > 0 and an n0 such that A4) Щ\Хп{рп + д')-Хп(рп)\>г1}<г for n>n0, whenever pn is a stopping time for Х„ that takes values in [0, Г] and 3' is a real number with 0 < d' < д. The proof of Aldous's result is built up by repeated application of inequality A4). 15 Lemma. Let Z be a random element of D[0, oo) for which JP{\Z(p + 5') - Z(p)\ > n} < s for each real d' in [0, <5] and each stopping time p taking values in [0, Г]. If a and x are stopping times for which a < г and \ Z(t) — Z{a) \ > 2r\ on {т < oo}, then JP{z < T л (a + Щ} < 4s. Proof. Integrate both sides of the inequality for p with respect to lebesgue measure on [0, 8], interchange the order of integration, then make a change of variable in the inner integral. гд IP J{|Z(s) - Z(p)\ >rj}{p<s<p Apply this inequality twice with p equal to the stopping tiroes aT = T л а and xT = T a x, then add. led > IP J{|Z(s) - Z(aT)\ >r,}{<rT<s<<TT + 3} + {|Z(s) - Z(xT)\ > rj} {xT < s < xT + 3} ds > IP ^{Z(s) - Z(aT)\ >r,} + {|Z(s) - Z(xT)\ > ny\ x {tt < s < aT + 3} ds.
134 VI. The Skorohod Metric on D[0, со) On the set {т < T}, the sum of the two indicators is at least 1, because at least one of the inequalities |Z(s) - Z(aT)\>n, |Z(s) -Z(zT)\>n must hold if aT = a, zT = x, and \Z(a) - Z(x)\ > 2ц. Deduce that 2sS>JP f {t < T} {z < s < a + 3} ds > IP j { < Гл (о- + |<5)}{a + ^d < s < <7 + d} ds < T a (a + i<5)}. ? 16 Theorem. Let X, Xu X2,... be random elements of D[0, со)for which: (i) the fidis of Xn corresponding to finite subsets of Tx converge to the fidis ofX; (ii) Aldous's condition A3) holds. Then Xn ^ X in the sense of the Skorohod metric. Proof. Verify condition (ii) of Theorem 10. Set down a grid 0 = t0 < ¦ ¦ ¦ < tk = Г of points from Tx with the maximum grid interval shorter than ^<x, for a value of a that will be specified soon. Fix an n, then define stopping times for Xn by то = О, zJ+1 = inf{t > zf \Xn(t) - Xn(zj)\ > In} with the usual convention that the infimum of the empty set equals + oo. We should perhaps add an extra subscript n to each Zj. If (ti-i,tj contains at most one of the {zj} then we must have А(Х„, [f^, t,J) < 4n. For if т,--! < t;_! < zj < t{ < zj+1 then \XJt)-XJxj-i)\<2ri if U^KtKZj, \Xn(t) - Xn(zj)\ <2n if zj<t<ti and hence \Xn(t)-Xn(ti_1)\<4n if t^^tKZj, \Xn(t) - Xn(tt)\ <4n if zj<t<tt. Apply the same reasoning to each grid interval. maxA(Xn,[t;_bt;]) > 4n < {some (?;_ь t;] contains at least two of the {т^} < {some pair т,-_ ь tj has Zj < T л (zj_ x + |a)}.
VI.2. Convergence in Distribution 135 Fix an integer m, whose value will be specified soon. Bound the last indicator function by f {t; < T л (xj.! + !«)} + m f {xj < Т л (т,_х + Т/т)}. j=i j=i The reasoning here is: either z2m > T, in which case the pair х3_и xi would be detected by the first sum; or z2m < T, in which case at least m terms of the second sum must equal one, for otherwise [О, Г] would have to contain (m + 1) disjoint intervals (zj_u tj) of length greater than T/m. Take expec- expectations. A7) ЩтахМХя, [tt.utJ)> < 2m max JP{tj < Г л (zj_l + ^ j<2m + 2 max W{tj < T л (т7-_! + T/m)}. j<2m Now we choose m and a. Invoke Lemma 15 for Z = Х„ with о = т^_1; т = т,- and the 3 provided by A4). For n> n0, JP{Tj< Г Л (Т,._! +^)} <4?. Choose ш so that T/m < \5; hold it fixed. The second term in the bound A7) is then less than the 8s if n > n0. From A4) find the a for which П\Хп(рп + д')-Хп(рп)\>г,}<г/т for n>ni if 0 < d' < a. and if р„ is a stopping time for Х„ with 0 < р„ < T Invoke Lemma 15 again, but replace S by a and ? by г/т. For n > n1, 1Р{т; < Г л (tj_ ! + jtx)} < 4s/m. The first term in the bound A7) is less than 8s if n > nx. In summary: IP<max А(Х„, [t;_ ь tj) > 4rj > < 16e for n > max(n0, nj), which completes the proof. ? 18 Example. For processes with independent increments, the criterion for convergence in distribution is particularly simple. Under the uniform metric for D[0, 1], Theorem V.I9 showed that fidi convergence plus an equicontinuity condition on the increments suffices if the limit process X has continuous sample paths. With a minor variation we get Aldous's condition; essentially, we can drop the constraint on the sample paths of X if we reinterpret the result as convergence in the Skorohod sense.
136 VI. The Skorohod Metric on D[0, со) Suppose that for each e > 0, ц > 0, and Г < oo, there exists a 3 > 0 and an n0 such that JP{\Xn(t + 6') - Xn(t)\ > rj} < г whenever 0 < t < T, 0 < 3' < S, and n > n0. If р„ is a stopping time that takes only finitely many different values, and these values all lie in [0, T], then for n > n0, « = t}JP{\Xn(t + d') - Xn(t)\ > ц\рп = t}] t < ?. Every stopping time can be approximated arbitrarily closely from above by a stopping time that takes only finitely many different values (round up to the next value on a finely spaced grid); the cadlag property of the sample paths carries the inequality over to the stopping times covered by Aldous's condition. ? Problem VIII.8 will give a more interesting application of Theorem 16 to convergence of martingales. Notes Much of this chapter draws ideas from Billingsley A968, Chapter 4) and Gihman and Skorohod A974, Sections Ш.4 and VI.5). Skorohod A956) defined on D[0, 1] several metrics, which allowed different sorts of behavior of a convergent sequence near a discontinuity of the limit function. Billingsley A968) introduced a variation on Skorohod's Jl metric, thereby making D[0,1] complete. This is not so important if we consider only convergence to a known limit process, but it does greatly simplify the theory if existence of the limit process must be proved by com- compactness arguments. The metric of Definition 1 is modeled on the metric of Kolmogorov A956) for D[0, 1]. Whitt A980), and Lindvall A973) have shown how difficult it is to write down a metric for Jx convergence on compacta, as defined by Stone A963). The modulus function is a fixed-grid analogue of Skorohod's A956) AJt, or the Ac of Gihman and Skorohod A974, page 423). Billingsley A968, Section 14) has defined other moduli for D[0, 1]. If existence of the limit process is not assumed in Theorem 10, the conditions will not guarantee its existence; the conditions of the theorem do not translate directly into a characterization of uniform tightness. Theorem 6 borrows from Parthasarathy A967, Section VII.6). The elegant argument for inclusion of the borel a-field in the projection a-field was attributed by Straf A969, page 67) to Wichura. Example 7 is based on the method of Resnick A975). The idea of embedd- embedding the maxima process into a two-dimensional poisson process comes from Pickands A971).
Problems 137 The convergence criterion of Theorem 10 corresponds to Theorem VI.5.2 of Gihman and Skorohod A974). Aldous A978) proved a result slightly different from Theorem 16; he gave a sufficient condition for uniform tightness in D[0, 1]. I found Kurtz's A981, Chapter 2) rearrangement of the proof helpful. Skorohod A957) studied convergence of processes with independent increments. His paper contains many fascinating sample-path arguments. Problems [1] Prove that the d of Definition 1 is a metric on D[0, oo). [To show that d(x, у) = 0 implies x = y, fix a t, then choose Г larger than t. For each z with t < z < T, deduce from the definition of dT the existence of a sequence {sn} with sn -» z and v(sn) -* x(%)- Deduce that either x(z) ~ y{z) or x(z) = y(z—). Choose a sequence {т,} strictly decreasing to t. Right continuity of у at t gives both v(t,) -> y(t) and y{zt —) -> y(t); right continuity of x gives x(zt) ->¦ x(t)-] [2] Show that d(xn, x) ->¦ 0 if and only if there exist continuous, strictly increasing maps {Д„} from [0, oo) onto itself such that, uniformly on compact sets of г values, Xn(t)~t^O and х(А„(г)) ~ xn(t) -» 0. [Construct Х„ as a piecewise linear map that takes gridpoints for х„ onto gridpoints for x, for pairs of grids chosen according to the definition of dj{xn, x) with T depending on n.~\ [3] Prove continuity of the functional Hn that appeared in the proof of Theorem 6. [If d(x, x,) -> 0, choose continuous, increasing {Aj} with x(Af(t)) — x;(?) -> 0 and Aj(l) — t -> 0 uniformly on compact t sets. For г large enough, bound |Я„(х)-Я„(х()|Ьу sup|x(t)|sup|sn(Ai(O) - sM\ + sup|x(Ai(t)) - x,{t)\ t<c t t<c for some constant c] [4] Let {у„} be a sequence of random, increasing maps from [0, oo) onto itself such that Уп@ ->¦ ? in probability, for each fixed t. Show that SUP |уи@ - t| -» 0 in probability 0<1<Г for each fixed Г. [If|yn(s) ~ s\ < e and \yn(t) - t[ < ethen |у„(и) -u\<e + \s-t\ for и between s and t] [5] Suppose d(xn, x) -» 0 and that x is continuous at т. Show that х„(т) ->¦ x(z). [Use х(А„(т)) - х„(т) -> 0 and Л„(т) -> т.] [6] If d(xn, x) ->¦ 0 and x belongs to C[0, oo) then х„ converges to x uniformly on compacta. [7] If Xn "+ X in the Skorohod sense, and if X has sample paths in C[0, oo), then Х„ "+ A" in the sense of the metric for uniform convergence on compacta. [Switch to versions that converge almost surely in the Skorohod sense.]
CHAPTER VII Central Limit Theorems ... in which the chaining method for proving maximal inequalities for the increments of stochastic processes is established. Applications include con- construction of gaussian processes with continuous sample paths, central limit theorems for empirical measures, and justification of a stochastic equicontinuity assumption that is needed to prove central limit theorems for statistics defined by minimization of a stochastic process. VII. 1. Stochastic Equicontinuity Much asymptotic theory boils down to careful application of Taylor's theorem. To bound remainder terms we impose regularity conditions, which add rigor to informal approximation arguments, but usually at the cost of increased technical detail. For some asymptotics problems, especially those concerned with central limit theorems for statistics defined by maximization or minimization of a random process, many of the technicalities can be drawn off into a single stochastic equicontinuity condition. This section shows how. Empirical process methods for establishing stochastic equi- equicontinuity will be developed later in the chapter. Maximum likelihood estimation is the prime example of a method that defines a statistic by maximization of a random criterion function. Indepen- Independent observations ? u ..., ?„ are drawn from a distribution P, which is assumed to be a member of a parametric family defined by density functions {?(¦> #)}• F°r simplicity take 6 real-valued. The true, but unknown, 60 can be estimated by the value 6n that maximizes Let us recall how one proves asymptotic normality for в„, assuming it is consistent for 60. Write 0o(-, <?)forlogK-, e),andgi(; в),д2(-, в),д3(-, в), for the first three partial derivatives with respect to в, whose existence we impose as a regularity condition. Using Taylor's theorem, expand go(-, в) into 0o(-. ^o) + (в - во)д1(-, в0) + КО - 0oJ9z(-, 0o) + Ш - воKд3(-, в*)
VII. 1. Stochastic Equicontinuity 139 with в* between в0 and в. Integrate with respect to the empirical measure Р„. GM = GB@O) + (в - во)Р„д1 + № - воJРпд2 + Rn@). If we impose, as an extra regularity condition, the domination \д3(-,в)\<Н(-) for all в, then the remainder term will satisfy \Rn(9)\ <Ш- во\3Р„\дз(;в*)\ <Цв - во\3Рпн. Assume PH < оо and P\g2\ < °о. Then, by the strong law of large numbers, for each sequence of shrinking neighborhoods of 60 we can absorb the re- remainder term into the quadratic, leaving A) Gn(ff) = Gn(e0) + (в- во)Р„д1 +K0- во)\Р9г + op(l)) near в0. The Op(l) stands for a sequence of random functions of в that are bounded uniformly on the shrinking neighborhoods of в0 by random variables of order op(l). Provided Pg2 < 0, such a bound on the error of approximation will lead to the usual central limit theorem for {п1/2F„ — в0)}. As a more general result will be proved soon, let us not pursue that part of the argument further. Instead, reconsider the regularity conditions. The third partial derivative of go( ¦, в) was needed only to bound the remainder term in the Taylor expansion. The second partial derivative enters A) only through its integrated value Pg2 ¦ But the first partial derivative plays a critical role; its value at each ?; comes into the linear term. That suggests we might relax the assumptions about existence of the higher derivatives and still get A). We can. In place of Pg2 we shall require a second derivative for Pgo(-, в); and for the remainder term we shall invoke stochastic equi- equicontinuity. In its abstract form stochastic equicontinuity refers to a sequence of stochastic processes {Zn(t): t e T} whose shared index set Г comes equipped with a semimetric d(-, •)• (In case you have forgotten, a semimetric has all the properties of a metric except that d(s, t) = 0 need not imply that s equals t.) We shall later need it in that generality. 2 Definition. Call {ZJ stochastically equicontinuous at t0 if for each ц > 0 and g > 0 there exists a neighborhood U of t0 for which limsup pi sup | Zn(t) - ZB(t0) \ > Л < s. ? There might be measure theoretic difficulties related to taking a supremum over an uncountable set of t values. We shall ignore them as far as possible during the course of this chapter. A more careful treatment of measurability details appears in Appendix С Because stochastic equicontinuity bounds Zn uniformly over the neigh- neighborhood U, it also applies to any randomly chosen point in the neighborhood.
140 VII. Central Limit Theorems If {т„} is a sequence of random elements of Г that converges in probability to t0,then C) ZH(zH) - Zn(t0) -» 0 in probability, because, with probability tending to one, т„ will belong to each U. When we come to check for stochastic equicontinuity the form in Definition 2 will be the one we use; the form in C) will be easier to apply, especially when be- behavior of a particular {т„} sequence is under investigation. The maximum likelihood method generalizes to other maximization problems, where {log p(-, в)} is replaced by other families of functions. For future reference it will be more convenient if we pose them as minimization problems. Suppose <F = {/(•, i):teT}, with T a subset of Ж*, is a collection of real, P-integrable functions on the set S where P lives. Denote by Р„ the empirical measure formed from n independent observations on P, and define the empirical process Е„ as the signed measure nll2(Pn — P). Define We shall prove a central limit theorem for sequences {%„} that come close enough to minimizing the (Fn(-)}- Suppose /(-, t) has a linear approximation near the t0 at which F(-) takes on its minimum value: D) /(•, 0 = /(•, t0) + (t- to)'A(O + \t~ to\r(; t). For completeness set r(-, t0) = 0. The A(-) is a vector of к real functions on S. Of course, if the approximation is to be of any use to us, the remainder function r(-,t) must in some sense be small near t0. If we want a central limit theorem for {т„}, stochastic equicontinuity of {Enr(-, t)} at t0 is the ap- appropriate sense. Usually r(-,t) will also tend to zero in the ??2{P) sense:P|r(-, t)\2 -» 0 as t -> t0. That is, /(•, 0 will be differentiable in quadratic mean. In that case, we may work directly with the i?2(P) seminorm pP on the set 0t of all re- remainder functions {r(-,t)}. Stochastic equicontinuity of {Enr(-,t}} would then follow from: for each г > 0 and r\ > 0 there exists in M a neighborhood V of 0 such that limsup IP<sup \Enr\ > r\> < s. I v ) The neighborhood V would take the form {ref :pP(r) < 3} for some 3 > 0. This would be convenient for empirical process calculations. Differ- Differentiability in quadratic mean would also imply that PA = 0. For if PA were non-zero the integrated form of D), P/(-, 0 = Pf{; t0) + (t- to)'PA + o(t - t0) near t0, would contradict existence of even a local minimum at t0.
VII.1. Stochastic Equicontinuity 141 5 Theorem. Suppose {т„} is a sequence of random vectors converging in probability to the value t0 at which F(-) has its minimum. Define r(-, i) and the vector of functions A(-) by D). // (i) t0 is an interior point of the parameter set T; (ii) F(-) has a non-singular second derivative matrix V at t0; (iii) Fn(-in) = op(n-1) + mf,Fn(t); (iv) the components of A(-) all belong to =S?2(P); (v) the sequence {Enr(-, t)} is stochastically equicontinuous at t0 ; then nll2(zn - t0) ~» JV(O, V~l[P{AA') - Proof. Reparametrize to make t0 equal to zero and V equal to the identity matrix. Then (ii) implies F(t) = F@) + \[ 112 + o( 1112) near 0. Separate the stochastic and deterministic contributions to the function Fn(t) by writing Р„ as the sum P + n~mEn. Write Zn(t) for ?>(•, t). Stochastic equicontinuity implies Zn(xn) = op(l). For values of t near zero, F) Fn(t) - Fn@) = P[/(•, 0 - /(•, 0)] + n~ l'2En [/(¦, 0 - Я •, 0)] Invoke (iii). Because Fn{xn) comes within op(n~l) of the infimum, which is smaller than Fn@), op{n-')>Fn{xn)-FM = ikl2 + op{\xn\2) + n-Vh&A + op(n-ll2\zn\). The random vector ?„Д has an asymptotic N@, P(AA') - (PA)(PA)') distribution; it is of order Op(l). Consequently, by the Cauchy-Schwarz in- inequality, т'„Е„А> — |т„|ОрA). Tidy up the last inequality. op{»~1) ^ Й - орA)]|т„|2 - n^^lTjO/l) - op(n-^2\zn\) = Й- орA)][|ти| - Op(n~1/2)]2 - Op{n~'). It follows that the squared term is at most Op(n~ *), and hence т„ = Op(n~1/2). (Look at Appendix A if you want to see the argument written without the Op(-) and ОД-) symbols.) Representation F) for t = т„ now simplifies: Fn(zn) = Fn@) + i\zn\2 + n~ll2z'nEnA + op(n-') = Fn@) + $|ти + п/2?„А|2 - in^l^AI2 + o/n). The same simplification would apply to any other sequence of t values of order Op(n~1/2). In particular, Fn{-n-ll2EnA) = Fn@) - in-^AI2 + оДи). Notice the surreptitious appeal to (i). We need n~ 1/2?„ A to be a point of T before stochastic equicontinuity applies; with probability tending to one as n —> со, it is.
142 VII. Central Limit Theorems Now invoke (iii) again, comparing the values of Fn at т„ and — п~1/2Е„А to get whence п1/2т„ = — ?„Д + op(l). When transformed back to the old para- metrization, this gives ni/2Fi/2(Tn _ to) = _ к-1/2Е„Л + op{\) ~>V~ 1/2iV@, P(AA') - (PA)(PA)'). ? Examples 18 and 19 in Section 4 will apply the theorem just proved. But before we can get to the applications we must acquire the means for verifying the stochastic equicontinuity condition. VII.2. Chaining Chaining is a technique for proving maximal inequalities for stochastic processes, the sorts of things required if we want to check the stochastic equicontinuity condition defined in Section 1. It applies to any process {Z(t):teT} whose index set is equipped with a semimetric d(-, •) that controls the increments: W{\Z(s)-Z(t)\>ri}<A{n,d{s,t)) for r,>0. It works best when A(-, •) takes the form with D a positive constant. Under some assumptions about covering numbers for Г, the chaining technique will lead to an economical bound on the tail probabilities for a supremum of | Z(s) — Z(t) | over pairs (s, t). The idea behind chaining, and the reason for its name, is easiest to under- understand when Г is finite. Suppose Tu T2,..., Tk+1 = Г are subsets with the property that each t lies within 5t of at least one point in Tt. Imagine each point of Ti+! linked to its nearest neighbor in Tit for i = 1,..., к. From every t stretches a chain with links t = tk+ ъ tk,..., t1 joining it to a point in Tt. Q = point of Г, О = point of т2 • = point of T3
VII.2. Chaining 143 The value of the process at t equals its value at tx plus a sum of increments across the links joining t to tv The error involved in approximating Z{t) by Z(ti) is bounded, uniformly in t, by ? max\Z(ti+1)-Z(td\. If 7] contains Nt points, the maximum in the ith summand runs over Ni+1 different increments, each across a link of length at most <5;. The probability of the summand exceeding щ is bounded by a sum of Ni+x terms, each less than А(?7г, Si). ) G) JP\max\Z(t) - Z{tx)\ >r]l+--- + г,Л < ? Л,-+1Д(ъ, 5,)- It ) i=i This inequality is useful if we can choose цг, Зь and 7] to make both the right- hand side and the sum of the O7J small. In that case the maximum of | Z(s) — Z{t) | over all pairs in Г is, with high probability, close to the maxi- maximum for pairs taken from the smaller class Tv When ДО/, <5) = 2 exp(—^r\2jD2d2), a good combination seems to be: {<5,} decreasing geometrically and {?/,} chosen so that Ы{+1А(г/г, <5;) = 2<5;, that is, With these choices the right-hand side of G) is bounded by the tail of the geometric series ?f 3t, and the sum of the {j/J on the left-hand side can be approximated by an integral that reflects the rate at which Nt increases as <5; decreases. 8 Definition. The covering number N(S), or NC, d, T) if there is any risk of ambiguity, is the size of the smallest <5-net for Г. That is, NC) equals the smallest m for which there exist points tu...,tm with min; d(t, tt) < 3 for every t in T. The associated covering integral is Cd J{3) = J(S, d,T)= [2 log(iV(MJ/M)]1/2 du for 0<^<l. ? Jo The N(uJ, in place of N(u), will allow us to bound maxima over more than just the nearest-neighbor links from Ti+1 to 7]. If we interpret P as standing for the ^C1(P) or J?2(P) semimetrics on #", the notation iV\(<5, P, #") and N2C, P, #") used in Chapter II almost agrees with Definition 8. Here we implicitly restrict tu ..., tm to be points of T. In Chapter II the approximating functions were allowed to lie outside J5! They could have been restricted to lie in 2F without seriously affecting any of the results. The proof of our main result, the Chaining Lemma, will be slightly more complicated than indicated above. To achieve the most precise inequality,
144 VII. Central Limit Theorems we replace nt by a function of the link lengths. And we eliminate a few pesky details by being fastidious in the construction of the approximating sets 7]. But apart from that, the idea behind the proof is the same. As you read through the argument please notice that it would also work if N(-) were replaced throughout by any upper bound and, of course, J(-) were increased accordingly. This trivial observation will turn out to be most important for applications; we seldom know the covering numbers exactly, but we often have upper bounds for them. 9 Chaining Lemma. Let {Z(t): t e T} be a stochastic process whose index set has a finite covering integral J(-). Suppose there exists a constant D such that, for all s and t, A0) IP{|Z(s) - Z(t)\ > rjd(s, t)} < 2 exp(-^rj2/D2) for n > 0. Then there exists a countable dense subset T* of T such that, for 0 < s < 1, IP{ |Z(s) - Z(t)| > 26DJ(d(s, 0) far some s, t in T* with d(s, t) < s} < 2s We can replace T* by T if Z has continuous sample paths. Proof. Write Щи) for [2 log(N(wJ/t<)]1/2. It increases as и decreases. Set <5; = s/2' for j = 1, 2,.... Construct 2<5rnets 7] in a special way, to ensure that 7\ ? T2 ? • • • . (The extra 2 has little effect on the chaining argument.) Start with any point tx. If possible choose a t2 with d(t2, tt) > 2d1; then a t3 with d(t3, tj) > 231 and d(t3, t2) > 2дг; and so on. After some tm, with m no greater than NC^, the process must stop: if m > NC^ then some pair tt, tj would have to fall into one of the NC^ closed balls of radius <5X that cover T. Take Tt as the set {tu ..., tm}. Every t in T lies within 23X of at least one point in Tv Next choose tm+1, if possible, with d(tm+u t,) > 232 for i < m; then tm+2 with d(tm+2, ti) > 282 for i < m + 1; and so on. When that process stops we have built Ti up to T2, a 2<52-net of at most NC2) points. The sets T3, T4,... are constructed in similar fashion. Define T* to be the union of all the {7]}. For the chaining argument sketched earlier (for finite T) we bounded the increment of Z across each link joining a point of Ti+ г to its nearest neighbor in Tt. This time Ti+1 contains Tt; all the links run between points of 7J+1. With only an insignificant increase in the probability bound we can increase the collection of links to cover all pairs in Ti+1, provided we replace the suggested п{ by a quantity depending on the length of the link. Set At = {|Z(s) - Z(f)\ > Dd(s, t)HCt) for some s, t in 7]}. It is a union of at most NCiJ events, each of whose probabilities can be bounded using A0). WA, < 2N{3f expE-W,.J] = 2St. The union of all the {A{}, call it A, has probability at most 2e.
VII.2. Chaining 145 Consider any pair (s, t) in Г* for which d(s, t) < s. Find the n for which 3„ < d(s, t) < 23 „. Because the {7]} expand as i increases, both s and t belong to some Tm+1 with m > n. With a chain s = sm+1, sm,..., sn link s to an sn in Т„, choosing each st to be the closest point of Tt to si+u thereby ensuring that d(si+1,si) < 2dt. Define a chain {tj for t similarly. Break Z(s) — Z(t) into Z(sn) — Z(tn) plus sums of increments across the links of the two chains; |Z(s) - Z(t)\ is no greater than m |Z(sJ - Z(tn)| + X [|Z(si+1) - Z(s,.)| + |Z(tl+1) - Z(tf)|]. i = n Both si+ x and s,- belong to Ti+1. On Л • +15 |Z(si+1) - Z(sd\ < Dd(si+1, sdH(8,+ 1) < ЩЯЙ+1). On Ac, these bounds, together with their companions for (sn, tn) and (ti+1, tt), allow |Z(s) - Z(t)| to be at most m Dd(sn, QHEn) + 2 X 2D8tHEt+1). i — n The distance d(sn, tn) is at most mm m d(s, 0 + Z <*(*+1, sO + Z rf(f.-+1. *i) < 2^„ + 2 X 25, < 105„. i = n i = и i = n Also E,- = 4(<5,+! - ^i+2). Thus, on Ac, m |Z(s) - Z(t)| < ЮГ»^ЯЯ) + 4D Z4(<5l+i - m л X {<5i i = n J 16Z> X {<5i+2 < u < 3i+1}H(u) du <26DJ(d(s,t)). If Z has continuous sample paths, the inequality with Г* replaced by Г is the limiting case of the inequalities for Г* with s replaced by s + n~l. ? Often we will apply the inequality from the Chaining Lemma in the weaker form: IP{ |Z(s) - Z(t)| > 26DJ(s) for some s, t in Г* with d(s, t) < &} < 2s. A direct derivation of the weaker inequality would be slightly simpler than the proof of the lemma. But there are applications where the stronger result is needed.
146 VII. Central Limit Theorems 11 Example. Brownian motion on [0, 1], you will recall, is a stochastic process {B(-,t):O<t < 1} with continuous sample paths, independent increments, B(-, 0) = 0, and B(t) - B(s) distributed JV(O, t - s) for t > s. If we measure distances between points of [0, 1] in a strange way, the Chain- Chaining Lemma will give a so-called modulus of continuity for the sample paths of Я The normal distribution has tails that decrease exponentially fast: from Appendix B, JP{\B(t) - B(s)\ > rj} < Define a new metric on [0, 1] by setting d(s, t) = \s — r|1/2. Then В satisfies inequality A0) with D = 1. The covering number NE, d, [0, 1]) is smaller than 23~2, which gives the bound C3 JC) < [2 log 4 + 10 log(l/M)]1/2 du •>o < B log 4I/2<5 + УШ[к^A/<5)Г1/2 Jo g(/) du < 4<5[log(l/<5)]1/2 for 5 small enough. From the Chaining Lemma, JP{|B(s) - B(t)\ > 26J(d(s, 0) for some pair with d(s, t) < 6} < Ж The event appearing on the left-hand side gets smaller as 3 decreases. Let 5 I 0. Conclude that for almost all со, \B(co,s) - B(co, 01 < 741(s - 01og|s - I 1/2 for \s — t\1/2 < d(co). Except for the unimportant factor of 74, this is the best modulus possible (McKean 1969, Section 1.6). ? VII.3. Gaussian Processes In Section 5 we shall generalize the Empirical Central Limit Theorem of Chapter V to empirical processes indexed by classes of functions. The limit processes will be analogues of the brownian bridge, gaussian processes with sample paths continuous in an appropriate sense. Even though existence of the limits will be guaranteed by the method of proof, it is no waste of effort if we devote a few pages here to a direct construction, which makes non-trivial application of the Chaining Lemma. The direct argument tells us more about the sample path properties of the gaussian processes. We start with analogues of brownian motion. The argument will extend an idea already touched on in Example 11.
VII.3. Gaussian Processes 147 Look at brownian motion in a different way. Regard it as a stochastic process indexed by the class of indicator functions #"= {[O,t]:O<r<: 1}. The со variance P[B(-, f)B(-,g)~] can then be written as P(fg), where P = Uniform[0, 1]. The process maps the subset ^ of „S?2(P) into the space =S?2 (IP) in such a way that inner products are preserved. From this perspec- perspective it becomes more natural to characterize the sample path property as continuity with respect to the =S?2 (P) seminorm pP on J5". Notice that Pp(I[0, s] - [0, t]|) = (ЛЕО, s] - [0, t]|2I/2 = Is - f|1/2. It is no accident that we used the same distance function in Example 11. The new notion of sample path continuity also makes sense for stochastic processes indexed by subclasses of other =S?2(P) spaces, for probability measures different from Uniform[0, 1]. 12 Definition. Let #" be a class of measurable functions on a set S with a cr-field supporting a probability measure P. Suppose ZF is contained in 5?2{P). A P-motion is a stochastic process {BP(-,f):f e #"} indexed by !F for which: (i) BP has joint normal finite-dimensional distributions with zero means and covariance JP[BP(-,f)BP(-, g)~\ = P(fg); (ii) each sample path BP(co, ¦) is bounded and uniformly continuous with respect to the ?C2(P) seminorm pP(-) on !F. The name does not quite fit unless one reads "Uniform[0, 1]" as " brownian," but it is easy to remember. The uniform continuity and bounded- ness that crept into the definition come automatically for brownian motion on the compact interval [0,1]- In general #" need not be a compact subset of if2 (P), although it must be totally bounded if it is to index a P-motion (Problem 3); uniformly continuous functions on a totally bounded OF must be bounded. We seek conditions on P and 3F for existence of the P-motion. The Chaining Lemma will give us much more: a bound on the increments of the process in terms of the covering integral JE) = J(d, pP, P)= Г [2 log(JV(M, pP, iF)»]1/2 du. Jo Finiteness of J(-) will guarantee existence of BP. 13 Theorem. Let ^ be a subset of =S?2(P) with a finite covering integral, J(-), under the i?2(P) seminorm pP(-)- There exists a P-motion, BP, indexed by !F,for which | BP(co, f) - BP(co, g) | < 26J(pP(f - g)) if pP(f - g) < d(co), with 5(co) finite for every оз.
148 VII. Central Limit Theorems Proof. Construct the process first on a countable dense subset J^ = {./}} of J*. Such a subset exists because !F has a finite E-net for each <5 > 0 (other- (otherwise J could not be finite). Apply the Gram-Schmidt procedure to J^, generating an orthonormal sequence of functions {uj}. Each / in J^ is a finite linear combination Yj (.uj>f)uj because {ult... ,un} spans the same subspace as {/j,... ,/„}. Here, temporarily, (u, /> denotes the inner product in i?2 (P): <м, /> = Р(м/). Choose a probability space (Д <f, IP) supporting a sequence {C/j} of independent iV@, 1) random variables. For each / in J^ and со in ?2 define The sum converges for every со, because only finitely many of the coefficients <Mj-,/> are non-zero. The finite-dimensional distributions of Z are joint normal with zero means and the desired covariances: as required for a P-motion. The =S?2(P) seminorm is tailor-made for the chaining argument. Because IP{ | JV(O, 1)| > x} < 2 exp(-^x2) for x > 0 (Appendix B), - Z(g)\ > n) < Apply the Chaining Lemma to the process Z on J^. Because J^ itself is countable we may as well assume the countable dense subset promised by the Lemma coincides with J^. Let G{5) denote the set of со for which \Z(f) - Z(g) | > 26J(pP(f - g)) for some pair with pP(f - g) < д. Then IPG(<5) < 2<5 for every 3 > 0. As 5 decreases, G(<5) contracts to a negligible set G@). For each со not in G@), | Z(co, /) - Z(co, g) | < 26J(pP(/ - g)) if pP(f - g) < <5(co). Reduce D. to Q\G@). Then each sample path Z(co, •) is uniformly continuous. Extend it from the dense J^, up to a uniformly continuous function on the whole of 2F. The extension preserves the bound on the increments, because both J and pP are continuous. Complete the proof by checking that the resulting process has the finite dimensional distributions of a P-motion. ? For brownian motion, continuity of sample paths in the usual sense coincides with continuity in the pP sense, with P = Uniform[0, 1]. The P-motion processes for different P measures on [0, 1] (or on Ж, or on Ж*)
VII.4. Random Covering Numbers 149 do not necessarily have the same property. If P has an atom of mass a at a point t0 > the sample paths of the BP indexed by intervals {[0, ?]} will all have a jump at t0. The size of the jump will be N@, a) distributed independently of all increments that don't involve a pair of intervals bracketing t0. All sample paths are cadlag in the usual sense. We encountered similar behavior in the gaussian limit processes for the Empirical Central Limit Theorem (V.ll) on the real line. We represented the limit as U(F(-)), with U a brownian bridge and F the distribution function for the sampling measure P. We can also manufacture the limit process directly from the P-motion, in much the same way that we get a brownian bridge from brownian motion. Denote by 1 the function taking the constant value one. Then the process obtained from BP by setting EP(;f) = BP(-,f)-(Pf)BP(;l), is a gaussian process analogous to the brownian bridge. 14 Definition. Call a stochastic process EP indexed by a subclass !F of i?2(P) a P-bridge over <F if (i) EP has joint normal finite-dimensional distributions with zero means and covariance IP[?P(-,/)?P(-, g)] = P(fg) - (Pf)(Pg); (ii) each sample path EP(a>, •) is bounded and uniformly continuous with respect to the i?2(P) seminorm on J^ ? The P-bridge will return in Section 5 as the limit in a central limit theorem for empirical processes indexed by a class of functions. VII.4. Random Covering Numbers The two methods developed in Chapter II, for proving uniform strong laws of large numbers, can be adapted to the task of proving the maximal in- inequalities that lurk behind the stochastic equicontinuity conditions in- introduced in Section 1. The second method, the one based on symmetrization of the empirical measure, lends itself more readily to the new purpose because it is the easier to upgrade by means of a chaining argument. We have the tools for controlling the rate at which covering numbers grow; we have a clean exponential bound for the conditional distribution of the increments of the symmetrized process. The introduction of chaining into the first method is complicated by a messier exponential bound. Section 6 will tackle that problem. Recall that the symmetrization method relates Pn — P to the random signed measure P° that puts mass + n~* at each of ?u ..., ?„, the signs being allocated independently plus or minus, each with probability \. For central
150 VII. Central Limit Theorems limit theorem calculations it is neater to work with the symmetrized empirical process E° = n1/2P°. Hoeffding's Inequality (Appendix B) gives the clean exponential bound for E° conditional on everything but the random signs. For each fixed function /, i= 1 That is, if distances between functions are measured using the =S?2(Pn) seminorm then tail probabilities of E° under IP(-|i;) satisfy the exponential bound required by the Chaining Lemma, with D = 1. For the purposes of the chaining argument, E° will behave very much like the gaussian process BP of Section 3, except that the bound involves the random covering number calculated using the =S?2(Pn) seminorm. Write J2C, Pn,P)= f [2 log(iV2(M, Р„, #-)»]1/2 du Jo for the corresponding covering integral. Stochastic equicontinuity of the empirical processes {?„} at a function /0 in J5" means roughly that, with high probability and for all n large enough, Enf — Enf0\ should be uniformly small for all/close enough to/0. Here closeness should be measured by the i?2(P) seminorm pP. With the Chaining Lemma in hand we can just as easily check for what seems a stronger property —but if you look carefully you'll see that it's equivalent to stochastic equi- equicontinuity for a larger class of functions. Of course we need 3F to be per- permissible (Appendix C). 15 Equicontinuity Lemma. Let ^ be a permissible class of functions with envelope F in „S?2(P). Suppose the random covering numbers satisfy the uni- uniformity condition: for each n > 0 and г > 0 there exists а у > 0 such that A6) limsup IP{ J2(y, Pn,^)> n) < e. n Then there exists a S > Ofor which limsup HNsup \En(f - g)\ > n\ < s, I is] J where И = {(/, g):f,ge& and Pp(f - g) < 6). Proof. The idea will be: replace En by the symmetrized process E°; replace [<5] by a random analogue, <2<5> = {(/, g):f,geP and (Р„(/ - ^JI'2 < 23}; then apply the Chaining Lemma for the conditional distributions IP(-|^).
VII.4. Random Covering Numbers 151 For fixed / and g in [<5] we have var(En(f- g)) = P(f- gJ < S2. Argue as in the First and Second Symmetrization steps of Section II.3: when 3 is small enough, pjsup \En(f - g)\ > Л < 4pjsup |?°(/ - g)\ > That gets rid of Е„. If with probability tending to one the class <2<5> contains [<5], we will waste only a tiny bit of probability in replacing [<5] by <2<5>: pj jsup |?°(/ - 0)| > |Л < pjsup \E°(f - I M J 1 It would suffice if we showed sup^2 \Pnh — Ph\ -»0 almost surely, where J^ = {(/ — gJ: /, g e #"}. This follows from Theorem 11.24 because the condition A6) implies A7) log N,C, Р„, JD = op(n) for each 8 > 0. Problems 5 and 6 provides the details behind A7). That gets rid of [<5]. The reason we needed to replace [<5] by <2<5> becomes evident when we condition on \. Write />„(•) for the j??2(Pn) seminorm. We have no direct control over />„(/ — g) for functions in [<5]; but for <2<5>, whose members are determined as soon as \ is specified, pn(f — 9) < 2<5. Apply the Chaining Lemma. IP{I W - 9)\ > 26J2BS, Pn, &) for some (/, g) in <2«5>*|^} < 43. The countable dense subclass <2<5>* can be replaced by <2<5> itself, because E° is a continuous function on IF for each fixed \: \E°n(f -g)\< n'l2Pn\f -g\< n^2pn(f - g). Integrate out over \, then choose b so that both P{26J2B<5, Pn, #") > %r\} and 4E are small. ? Now that we have the maximal inequalities for empirical processes, we can take up again the central limit theorems for statistics defined by minimiza- minimization of a random process, the topic we left hanging at the end of Section 1. Recall that we need the processes {Enr(-, t)}, which is indexed by the class 01 = {r(-,t): t e T} of remainder functions, stochastically equi- continuous at t0. If /(•, t) is differentiable in quadratic mean at t0, it will suffice if we find a neighborhood V = {r e @:pP(r) < d} for which limsup P<sup |?nr| > I v Notice that V ? {rx - r2:pP(r1 — r2) < d}, because r(-, tQ) = 0 by defini- definition. Thus we may check for stochastic equicontinuity by showing: (i) The class & has an envelope belonging to ??2(P).
152 VII. Central Limit Theorems (ii) /(•, t) is differentiable in quadratic mean at t0. From (i), this follows by dominated convergence if r( •, t) -» 0 almost surely [P] as t -»t0. (iii) Condition A6) is satisfied for & = M. These three conditions place constraints on the class {/(•, 0}- 18 Example. The spatial median of a bivariate distribution P is the value of в that minimizes М(в) = P\x — в\. Estimate it by the в„ that minimizes Mn(&) = Pn | x — в |. Example 11.26 gave conditions for consistency of such an estimator. Those conditions apply when P equals the symmetric normal N@,12), a pleasant distribution to work with because explicit values can be calculated for all the quantities connected with the asymptotics for {$„}. For this P, convexity and symmetry force M(-) to have its unique minimum at zero, so в„ converges almost surely to zero. Theorem 5 will produce the central limit theorem, п^в„ ~> ЛГ(О, D/n)I2), after we check its non-obvious conditions (ii), (iv), and (v). Change variables to reexpress M{&) in a form that makes it easier to find derivatives. M@) = 1 (\ Differentiate under the integral sign. M'@) = 0, of course, M"@) = Bл)-1 [\х\(хх' - / A random vector X with a N@,12) distribution has the factorization X = RU where R2 = \X\2 has a ^distribution independent of the random unit vector U = X/\X\, which is uniformly distributed around the rim of the unit circle. V = M"@) = JP(R3UU' - RI2) = JPR3JPUU' -(JPR)I2 = (n/8)ll2I2. Condition (ii) wasn't so hard to check. To figure out the A(x) that should appear in the linear approximation carry out the usual pointwise differentiation. That gives A(x) = x/\x\ for x/0. Set A@) = 0, for completeness. The components of A( •) all belong to SC2(P). Indeed, PAA' = WUU' = \l2. That's condition (iv) taken care of.
VII.4. Random Covering Numbers 153 Now comes the hard part—or at least it would be hard if we hadn't already proved the Equicontinuity Lemma. Start by checking that the class 0t of remainder functions r( •, в) has an envelope in J?2(P). For в ф О, < 4. It follows that | • — в | is differentiable in quadratic mean at в = 0. We have only to verify condition A6) of the Equicontinuity Lemma to complete the proof of stochastic equicontinuity. Each r(-, в), for в ф 0, can be broken into a difference of two bounded functions: Write 0tx and 0t2 for the corresponding classes of functions. The linear space spanned by 8&^ has finite dimension; the graphs have polynomial discrimination, by Lemma 11.28; the covering numbers N2(u, Pn, ^j) are bounded by a polynomial Au~w in u~l, with A and W not depending on Pn (Lemma 11.36). The graphs of functions in 8%г also have polynomial discrimination, because {(x, t):\x — в\ — \x\ > \0\t} can be written as {-2e'x + \e\2>2\e\\x\t + \e\2t2}r>{\x\ + \e\t>oi}vj{\x\ + \e\t < o}. This is built up from sets of the form {g > 0} with g in the finite-dimensional vector space of functions The covering numbers for 0l2 are also uniformly bounded by a polynomial in и. These two polynomial bounds combine (Problem 11.18) to give a similar uniform bound for the covering numbers of 01, which amply suffices for the Equicontinuity Lemma: for each ц > 0 there exists а у such that •^гG> Pn> 8%) ^ Л f°r every Pn. The conditions of Theorem 5 are all satisfied; the central limit theorem for {в„} is established. ? 19 Example. Independent observations are sampled from a distribution P on the real line. The optimal 2-means cluster centers an, bn minimize W(a, b, Pn) = Pnfa,b, where /а>ь(х) = |х-а|2л|х-Ь|2. In Examples II.4 and 11.29 we found conditions under which а„, Ь„ converge almost surely to the centers a*, b* that minimize W(a, b, P) = Pfuyb. Theorem 5 refines the result to a central limit theorem.
154 VII. Central Limit Theorems Keep the calculations simple by taking P as the Uniform[0, 1] distribu- distribution. The argument could be extended to other P distributions, higher dimensions, and more clusters, at the cost of more cumbersome notation and the imposition of a few extra regularity conditions. The parameter set consists of all pairs (a, b) with 0 < a < b < 1. For the Uniform[0, 1] distribution direct calculation gives explicitly the values a*, b* that minimize W(a, b, P). W(a, b, P) = J{0 < x < i(a + b)}\x - a\2 + ii(a + b)<x< l}|x - b\2dx = ia3 + i(l - bf +Ub~ aK- Minimizing values: a* = ?, b* = f, as you might expect. Near these optimal centers, W(a, b,P) = & + |(fl - iJ - |(fl - № - |) + Kb - IJ + cubic terms The function fab(x) has partial derivatives with respect to a and b except when x = j(a + b). That suggests for A(x) the two components Aa(x)= -2(x-i){0<x<|}, Ab(x)= -2(x-M<x<l}. Both functions belong to J?2(P). The remainder function is denned by subtraction of the linear approximation from/ai). Simplify the notation by writing s — a — \, t = b — f; change fab to gst and r(-, a, b) to R(-, s, t). (\s\ + \t\)R(x, s, t) = gSil(x) - jo,oW + 2s(x - i){0 < x < |} + 2t(x - M < x < 1} = a piecewise. linear function of x. envelope s~R(;s,
VII.5. Empirical Central Limit Theorems 155 The remainder functions are bounded by a fixed envelope in i?2(P). |K(x, s, 01 < [|(x - i - sJ - (x - iJ| + |(x - ! - tf - (x - |J <4|x -i| + 4\x -|| + 2. Deduce differentiability in quadratic mean of/„ b at the optimal centers. The graphs of the piecewise linear functions in 01 have only polynomial discrimination, and they have an envelope in ?C2(P). Lemma 11.36 gives a uniform bound on the covering numbers that ensures J2(y, Рп,0?)<ц for every Р„ if у is chosen small enough. The Equicontinuity Lemma applies; the processes {Enr(-, a, b)} are stochastically equicontinuous at (\, f); the optimal centers obey a central limit theorem {n'l\an - |), n"\bn - I)) ~» N@, F^PCAAOF), where [! 1} Ч? Д VII.5. Empirical Central Limit Theorems As random elements of D[0, 1], the uniform empirical processes {Un} converge in distribution to a brownian bridge. More generally, the empirical processes {?„} for observations from an arbitrary distribution on the real line converge in distribution, as random elements of D\_— oo, oo], to a gaussian process obtained by stretching out the brownian bridge. Both results treat the empirical measure as a process indexed by intervals of the real line. In this section we shall generalize these results to empirical measures indexed by classes of functions. Convergence in distribution, as we have denned it, deals with random elements of metric spaces. Once we leave the safety of intervals on the real line it becomes quite a problem to decide what metric space of functions empirical process sample paths should belong to. Without the natural ordering of the intervals, it is difficult to find a completely satisfactory sub- substitute for the cadlag property; without the simplification of cadlag sample paths, empirical processes run straight into the measure-theoretic complica- complications we have so carefully been avoiding. Appendix С describes one way of overcoming these complications. A class of functions satisfying the regularity conditions described there is said to be permissible. Most classes that arise from specific applications are permissible. Let !F be a pointwise bounded, permissible class of functions for which sup^r \Pf\ < oo. The empirical processes {?„} define bounded functions on SF\ their sample paths belong to the space ЭС of all bounded, real functions
156 VII. Central Limit Theorems on 3F. To avoid some of the confusion that might be caused by the hierarchy of functions on spaces of functions on spaces of functions, call members of Ж functionals. Equip Ж with the metric generated by the uniform norm, ||x|| = sup^ |x(/)|. Be careful not to confuse the norm || • || onf with the У2(Р) seminorm pP{-) on &'. The choice of <7-field for Ж is tied up with the measurability problems handled in Appendix C. We need it small enough to make En a measurable random element of Ж, but large enough to support a rich supply of measur- measurable, continuous functions. The limit distributions must concentrate on sets of completely regular points (Section IV.2). That suggests that the «r-field should at least contain the balls centered at the functionals that are uniformly continuous for the pP seminorm. 20 Definition. Write C(#", P) for the set of all functionals x(-) in Ж that are uniformly continuous with respect to the i?2(P) seminorm on !F. That is, to each г > 0 there should exist a b > 0 for which | x(f) — x(g) | < s whenever Pp(f - 9) < 8- Define 38P as the smallest <7-field on Ж that: (i) contains all the closed balls with centers in C(#", P); (ii) makes all the finite-dimensional projections measurable. ? Notice that CijF, P) is complete, because it is a closed subset of the complete metric space (Ж, || • ||). Notice also that <%p depends on the sampling distribution P. Each Е„ is a ^-measurable random element of Ж under mild regularity conditions (Appendix C). The finite-dimensional projections of {?„} (the fidis) converge in distri- distribution to the fidis of EP, the P-bridge process over ?F (Definition 14). Of course some doubts arise over the existence of EP; getting a version with sample paths in C(#", P) is no simple matter, as we saw in Section 3. Happily, the questions of existence and convergence are both taken care of by a single property of the empirical processes, uniform tightness. Recall from Section IV.5 that uniform tightness for {?„} requires existence of a compact set Ke of completely regular points in Ж such that liminfP{EneG} > 1 -e for every open, ^-measurable set G containing KE. From uniform tightness we would get a subsequence of {?„} that converged in distribution to a tight borel measure on 3C. If C(#; P) contained each Ke, the limit would con- concentrate in C{ZF, P). Its fidis would identify it (Problem 8) as the P-bridge over #". Uniform tightness of {?„} would also imply convergence of the whole sequence to EP. For if {Wh(En)} did not converge to JPh(EP) for some bounded, continuous, ^p-measurable lionf then \JPh(En) - TPh(EP)\ > s infinitely often for some s > 0. The subsequence along which the inequality held would also be uniformly tight; it would have a sub-subsequence converging to a
VII.5. Empirical Central Limit Theorems 157 process whose fidis still identified it as a P-bridge. That would give a con- contradiction : along the sub-subsequence, {JPh(En)} would converge to JPh(EP) without ever getting closer than s. 21 Theorem. Let SF be a pointwise bounded, totally bounded, permissible subset ofJ?2(P). If for each n > 0 and s > 0 there exists a3 > Ofor which B2) limsup IP {sup | En{f - g)\ > Л < e, I № ) where [<5] = {(/, g): f, g e !F and pP(f — g) < 3}, then En ~> EP as random elements of 3E. The limit P-bridge process EP is a tight, gaussian random element of 3C whose sample paths all belong to C(!F, P). Proof. Check the uniform tightness. Given s > 0 find a compact subset К of C(#", P) with liminf Р{?„ ? G} > 1 — s for every open, ^p-measurable G containing K. Construct К as an intersection of sets Du D2,..., where Dk is a finite union of closed balls of radius k~1 centered at points of C{?F, P). Every functional in Dk will lie uniformly within k~1 of a member of C(JF, P); every functional in К will therefore belong to C(#", P), being a uniform limit of functionals in C(#", P). The proof follows closely the ideas used in Theorem V.16 to prove existence of the brownian bridge. Only the continuous inter- interpolation between values taken at a finite grid requires modification. Fix n > 0 and г > 0 for the moment, and choose 3 according to B2). Invoke the total boundedness assumption on !F to find a finite subclass #"г = {/i> •••>/»} °f &" such that each / in J* has an /* in J^ for which Pplf ~ /*) < ?¦ We need to find a finite collection of closed balls in 9C, each with a specified small radius and centered on a functional in C(?F, P), such that En lies with specified large probability in the union of the balls. Construct the centers for the balls by a continuous interpolation between the values taken on at each f in J^ by realizations of an En. For each /(, the sequence of random variables {En(f)} converges in distribution. There exists a constant С for which limsup IP< max | Elf) | > C> < s. I ' J Define D.n = < со:sup |En(a>,f - g)\ < n and max |Еп(ю,/f)| < С By the choice of 3 and С we ensure that liminf PQn > 1 - 2s. Write Sn for the bounded set of all points in IRra with coordinates ?„(со,/,) for an oo in Qn, and S for the union of the {?„}. If r belongs to S then \r(\ < С for each coordinate and \rt — rj\ < 2ц whenever there exists an / in !F with Pp(f ~ f) <3 and pP(f - /•) < д.
158 VII. Central Limit Theorems Construct weight functions A,(-) on #" by Mf) = »«(/)/D>i(/) + ¦¦¦ + vm(f)l Each vt(-) is a uniformly continuous function (under the pP seminorm) that vanishes outside a ball of radius д about/;. For every / there is at least one /, its/*, for which vt(f) > \; the denominator in the denning quotient for A,- is never less than \. The A;(-) are non-negative, uniformly continuous functions that sum to one everywhere in 3F. For г in S define an interpolation function by x(f, r) = Yj=1 A,-(/)r;. Each x(-, r) belongs to C(J^ P). If pP(f — /) < «5, all the rt values corre- corresponding to non-zero {A;(/)} satisfy \rt — rj\ < 2ц. As a convex combination of these values, x(f, r) must also lie within 2r\ of r,-. An En(co, •) corresponding to an со in D,n has a similar property: \En(co,f) -En{co,f})\ <n when pP(f - f}) < д. Thus, if (o belongs to D,n and |Е„(а>, fj) — rj\ <r\ for every j then Choose from the bounded set S a finite subset (r(l),..., r(p)} for which min max | rj — rj(k) | < ц for every г in S. к j Abbreviate x{-, r(/c)) to xk(-), for к = 1,..., p. Then from what we have just proved min sup |En(co,/) - xk(f)\ < 4rj, к & whenever со belongs to D.n. If we set D equal to the union of the balls B(Xl,4ri),...,B(xp,4ri),tben liminfff {?„??>} > 1 - 2e. Repeat the argument with щ replaced by {4k)'г and s replaced by s/2k+1, for к = 1, 2,..., to get each of the Dk sets promised at the start of the proof: Dk is a finite union of closed balls of radius k~г and liminf JP{EneDk} > 1 - г/2". The remainder of the proof follows Theorem V.I6 almost exactly. The intersection of the sets Du D2,... is a closed and totally bounded subset of the complete metric space C(#", P); it defines the sought-after compact K. The open G contains some finite intersection D1 n • • • n Dk. If not, there would exist a sequence Y = {yk} with yk in Gc n Dx n • • • n Dk for each k. Some subsequence Y' of Y would lie within one of the balls making
VII.5. Empirical Central Limit Theorems 159 up Dx; some subsequence Y" of Y' would lie within one of the balls making up D2; and so on. The sequence constructed by taking the first member of Y', the second member of Y", and so on, would be Cauchy; it would con- converge to a point у (9С is complete) belonging to all the closed sets {Gc n Dt n • • • n Dk}. This would contradict f) Gc n Dt n ¦ ¦ ¦ n Dk = Gc n К = 0. *=i Complete the uniform tightness proof by noting that liminf JP{En e G} > liminf Р{?„ e Dt n • • • n ?>J > 1 - s if G contains D-^гл ¦ ¦ ¦ n Dk. ? Condition B2) points the way towards mass production of empirical central limit theorems. The Chaining Lemma makes it easy. For example, from the Equicontinuity Lemma of Section 4, we get conditions on the random covering numbers under which {?„} converges in distribution. The next section will describe other sufficient conditions. 23 Example. We left unfinished back in Example V.I5 a limit problem for goodness-of-fit statistics with estimated parameters. The empirical processes were indexed by intervals of the real line; the estimators took the form р 1 = 1 for an L with PL = 0, PL2 < oo. We wanted to find the limiting distribution of Dn = sup | ?B(-oo51] - nll2@n - 0o)A(t)| + o,(l) t the A(-) being a fixed cadlag function on [ — oo, oo]. Set #¦ equal to {L} и {(— oo, t]: — oo < t < oo}. Express Dn in terms of a function on the corresponding 9C. Define H(x) = sup \x((- oo, t]) - x(L)A(t)\. t Guard against measurability evils by restricting the supremum to rational t values: it makes no difference to ?„, A, or the limiting P-bridge. Clearly H(-) is a continuous function on 9C. You can check condition B2) by means of the Equicontinuity Lemma. The intervals have only polynomial discrimination; the inclusion of the single extra function L has a barely perceptible effect on the covering num- numbers. Deduce that Dn = Н(Е„) + oJl) -» H(EP). ?
160 VII. Central Limit Theorems VII.6. Restricted Chaining In this section the method of the Chaining Lemma is modified to develop another approach to empirical central limit theorems. The arguments for three representative examples are sketched. You might want to skip over the section at the first reading. The chaining arguments in Section 2 assumed that the increments of the stochastic process had exponentially decreasing tail probabilities, B4) P{|Z(s) - Z(t)| > n} < 2 exp(--^2/?>2<52) if Ф, 0 ^ <5- The inequality held for every n > 0 and 3 > 0. We shall carry the argument further to cover processes, such as the empirical process, for which the inequality holds only in a restricted region 0> of (n, 3) pairs. Suppose / is a bounded function, \f\ < C. Let S2 be an upper bound for the variance a2(f) = Pf2 — (PfJ- Bennett's Inequality (Appendix B) gives B5) JP{\EJ\>t,} = > Л/2 2 expl-i(tj2/d2)BBCr,/(nl'2d2))-} iV/<52) if S2h >2C/(nll2B-\A)) for any fixed X between 0 and 1, because B(-) is a continuous, decreasing function, with B@) = 1. The restricted range complicates the task of proving maximal inequalities for the stochastic process {Z(t): t e T}. We can chain as in Section 2 as long as the (nh 3t) pairs remain within SP, but eventually the chain will hit the boundary of 3?, when the links are getting down to lengths less than some tiny a, say. That leaves the problem of how to bound increments of Z across little links from points in T to their nearest neighbors in an a-net for T. Remember the abbreviations NC), for the covering number NC, d, T), and JC), for the covering integral Cd JC, d,T)= [2 log(WMJ/")]1/ du. = Jo The chaining argument will work for maximal deviations down to about J(a). That explains the constraint J(a) < y/\2D in the next theorem. The other constraints on a and у are cosmetic. 26 Theorem. Let {Z(t):teT} be a stochastic process that satisfies the exponential inequality B4)for every ц > 0 and д > 0 with ё > ar]il2,for some constant a. Suppose T has a finite covering integral J(-). Let T(a) be an a-net (containing N(a) points) for T; let tx be the closest point in T(a) to t; and let
VII.6. Restricted Chaining 161 [<5] denote the set of pairs (s, t) with d{s, t) < 3. Given s > 0 and у > 0 there exists a 3 > 0, depending on s, y, and J(-), for which pjsup |Z(s) - Z(f)| > 5yl < 2г + pjsup \Z(t) - Z{Q\ > (. и J l т provided oc <\sandy < 144 and J(oc) < тт{у/12Д 3/D}. Proof. The argument is similar to the one used for the Chaining Lemma. Write H(u) for [2 log(W(uJ/u)]1/2, as before. Choose the largest 3 for which 3 < ^s and JC) < y/l2D. The assumptions about a ensure 3 > a. Find the integer к for which 3 < 3*a < 3<5 then define 3t = 3*-;a and nt = ОЗ{Н(ё1+1) for г = 0,..., /с. Notice that ^! < «5 < c>0 and ^ = a. Also i = 0 ¦^ E \{3i+2<u<3i+l}H{u)du i = 0 J < у because J{31) < J{3) < yj\2D. Choose <5rnets Tt containing NCi) points, making sure that Tk = T(<x). Link each t to a t0 in To through a chain of points, with tt being the closest point of 7] to ti+ v By this construction, d(ti+ u tj) < 3t. The smallest value of the ratios {3?/п(}, for i = 0,..., к — 1, occurs at i' = к — 1; all the ratios are greater than 3a/DH(a) > 3a2/DJ(a) > a2. The 07;, 3t) pairs all belong to the region in which the exponential inequality B4) holds. Apply the inequality for increments across links of the chains. pjmax |Z(ta) - Z(to)\ > y\ < ? pjmax \Z(ti+1) - Z(t,)\ > m) I T(a) J i = 0 lTi+1 k-1 i = 0 00 i+1) exp[-\og(N(8t+ tf/3^ J] i = 0 Notice how one of the N(<51+ j) factors was wasted; both factors will be needed later. The last series sums to less than s because «5, < ^г and the {3t} decrease geometrically.
162 VII. Central Limit Theorems ¦T(a) Join each (s, i) pair in [<5] by two chains leading up to To plus a link between s0 and t0. sup |Z(s) - Z(t)\ < 2 sup |Z@ - Z(OI + 2 max|Z(tJ - Z(to)| [й] Г T(a) + SUp |Z(SO) - Z(to)\. Partition the 5y correspondingly. fsup|Z(s)-Z(t)|>: и The distance between the s0 and t0 of each pair appearing in the last term is less than d(s0, sj + d(sa, s) + d(s, t) + d(t, tj + d(tx, t0) i=0 i = 0 < 3<5O + 2a + д < 12^. There are at most N(d0J such pairs. The exponential inequality holds for each pair, because 12<5/(ay1/2) > I2y~112 > 1. FJsup|Z(so)-Z(to)I >?} I [S] < N(S0J2 exp(-$y2/144Dzd2) because d(s0, t0) < 123 < 25 exp[log(iV(<5J/<5) - i/A44^2] because N(d0) < NC) < 2д ехр[^-2(Я(<5J<52 - 72/144?>2)] < 25 because HC)d < JC) < y/l2D <s. ?
VII.6. Restricted Chaining 163 Theorem 21 states a sufficient condition for empirical processes indexed by a pointwise bounded, totally bounded, permissible subclass #" of SC2(P) to converge in distribution to a P-bridge: given ц > 0 there exists а б > 0 for which limsup P<sup \En(f - g) \ > rj> < s. For a permissible class of bounded functions, say 0 < / < 1, any condition implying finiteness of Nt(-, P, #") or N2(-, P, #") will take care of the total boundedness. Finiteness of a covering integral will allow us to apply Theorem 26, leaving only a supremum over the class Ж = {/ — fa :/?#"} of little links. It will then suffice to prove sup^ \Enh\ = op(l) to get the empirical central limit theorem. Notice that a, and hence Ж, will depend on n. The next three examples sketch typical methods for handling Ж. 27 Example. Equip #" with the semimetric d(J, g) = (P\f - g\I12- (This is the SC2(P) seminorm applied to the function |/ — g\112.) The square root ensures that the variance a2(f — g) is less than d(f, gJ. If we take X = \, the exponential bound B5) becomes, for d(f, g) < 3, Щ\ЕЛ -g)\>r,}<2 ехр(-^2/<52) if <52A? > гДп1'2!*-'&)). That is, D = 2 and a = B/B~ 1(i))ll2n~1'4 for Theorem 26. The covering numbers for d(-, •) are closely related to the S?l(P) covering numbers: in terms of the covering integral, rs B8) J(8, d,&)= [2 log(N1(M2, P, JF)»]1/2 du for 0 < 3 < 1. If J is finite, Theorem 26 can chain down to leave a class Ж of little links with |/г | < 1 and P | h | < a2. If we add to this the condition B9) log N^cn-1/2, Рп,Ж) = ор(п112) for each о О, the empirical central limit theorem will hold. The methods of Section II.6 work for the class Ж112 = {\Щ112:кеЖ}. Notice that N2C, Pn, Ж1/2) < N.id2, Р„, Ж) because Рп(\^\1/2 - \K\mf < PB|*i - h2\. From Lemma 11.33, C0) p|sup(Pn|/i|I/2 > 8al < 4P[N1(a2, Pn, Ж) exp(-na2) л 1] = 4P[exp(log Nt(a2, Р„, Ж) - ш2) л 1] -»0 by B9).
164 VII. Central Limit Theorems Symmetrize. For n large enough, pisup|?n/i| > 4y\ < 4p|sup|?°/i| > yi. I x J I ж J Condition on \. Cover Ж by M = N^yn'112, Pn, Ж) balls, for the &\Pn) seminorm, with centers gu ..., gM in Ж Then as in Section И.6, pjsup|?°/i| > ?|4 <? МтахР{|ВД > \y\\). On the set of \ where sup# Pn\h\ < 64a2, Hoeffding's Inequality bounds the right-hand side by 2 exp[log M - ЩуJ/F4a2)] which is of order op(l) because B9) says log M = op(n1/2). The central limit theorem follows. ? 31 Example. The direct approximation method of Section II.2 gave uniform strong laws of large numbers. With a suitable bound on the number of functions needed for the approximations, we get central limit theorems. Define a direct covering number А(д, Р, Ж) as the smallest M for which there exist functions дъ.--,дм sucn that, for every h in Ж, \h\ < g{ and Pgi < b + P\h\ for some i. We may assume 0 < gt < 1. If C2) log A(cn~1/2, P, Ж) = o(n1/2) for each c> 0, and if the covering integral B8) from the previous example is finite, then the empirical central limit theorem holds. Given у > 0, choose X in the exponential inequality B5) so that 2/B~1(X) = y. The dependence of X on у does not vitiate the chaining argu- argument in Theorem 26; it does ensure that functions in Ж satisfy P\h\ <a2 = yn~112. Find gi,..-,gM according to the definition of A(yn~112, P, Ж). Because Pgi < 2yn~112 for each i, the contributions of the means to Е„ are small. JP<sup\Enh\ > 4yi < pisupn1/2Pn|/z| > 3yi because n1/2P\h\ < у { ж J l ж J < Р<тахи1/2Р„д; > Ъу\ because \h\ < gr,for some i I ' J < M maxP{?ngr; > y) because n1/2Pgt < 2y i < M max 2 exp[ -^/Рд^ЩуЦп^РдШ from B5) < 2 exp[log M - W = o(l) by C2). ?
Notes 165 33 Example. In the previous two examples, the method of chaining left links of small ^C1(P) seminorm at the end of the chain; if1 approximation methods took care of Ж If we chain instead with i?2(P) covering numbers, we need if2 approximation methods for Ж. Set d(-, •) equal to the i?2(P) semimetric. Because a2(f - g) < d(f, gf, the chaining down to Ж requires J2(l, P, &) finite. At the end Ph2 <a2 = B/B(i))n/2. Invoke Lemma 11.33. Ipisup(Pn/i2I/2 < 8a I -> 1 as n -> oo i ж J if the random covering numbers satisfy log N2(cn~1/4, Р„, Ж) = op(n112) for each с > 0. This would follow from C4) J2(cn-1/4, Pn, Ж) = op(l) for each c> 0, because op{n^) = (сп-1^ГЧ2(сп-1'\ Рп, Ж) > [2 log(N2(c?T1/4, Pn, ЖJп1/4/с)У12. Symmetrize. For all n large enough, p|sup|?n/i| > Now we are back to the sort of problem we were solving in Section 4. Condi- Condition on \. On the set of those \ for which sup^(Pn/i2I/2 < 8a, chain using the Hoeffding Inequality to bound the tail probabilities. Apply the Chaining Lemma for P(-1?), the i?2(PJ seminorm, and s = 8a. \E°h\ > 26J2(8a, Р„, Ж)\Л < 16a if sup(Pn/i2I/2 < 8a. ) Condition C4) and finiteness of J2(l, P, #") are sufficient for the empirical central limit theorem to hold. ? Notes Theorem 5 draws on ideas from Chernoff A954), but substitutes stochastic equicontinuity where he placed domination conditions on third-order partial derivatives. The theorem also holds if t0 is just a local minimum for F(-), or if т„ is a minimum for Fn(-) over a large enough neighborhood of t0. Huber A967, Lemma 3) made explicit the role of stochastic equicontinuity in a proof of the central limit theorem for an M-estimator.
166 VII. Central Limit Theorems The chaining argument abstracts the idea behind construction of processes on a dyadic rational skeleton. It appears to have entered weak convergence theory through the work of Kolmogorov and the Soviet School; it is closely related to the arguments for construction of measures in function spaces (Gihman and Skorohod 1974, Sections Ш.4, III.5). The Chaining Lemma is based on an arrangement by Le Cam A983) of an argument of Dudley A967a, 1973,1978). Le Cam's approach avoids the complications introduced into Dudley's proof by the nuisance possibility that covering numbers N(S) might not increase rapidly enough as 3 decreases to zero. Alexander A984a, 1984b) has refined Dudley's form of the chaining argument to prove the most precise maximal inequalities for general empirical processes to be found in the literature. Theorem 13 is based on Theorem 2.1 of Dudley A973), but with his modulus function increased slightly to take advantage of Le Cam's A983) cleaner bound for the error term. The extra (<5 log(i/<5)I/2 does not change the order of magnitude of the modulus for most processes. The argument in Section 4 is based on Pollard A982c), except for the substitution of convergence in probability (condition A6)) for uniform convergence. Kolchinsky A982) developed a similar technique to prove a similar central limit theorem for bounded classes of functions. He imposed finiteness of J2(-,P, #") plus a growth condition on N\(-, Pn, #") to get results closer to those of my Example 27. Gine and Zinn A984) have found a necessary and sufficient random entropy condition for the empirical central limit theorem. Brown A983) sketched the large-sample theory for the spatial median. He referred to Brown and Kildea A979) and the appendix he wrote for Maritz A981) for rigorous proofs, which depend on a form of stochastic equi- continuity. The central limit theorem for /c-means was proved by Pollard A982b, 1982d) for a fixed number of clusters in euclidean space. The one-dimen- one-dimensional result was proved by Hartigan A978), using a different method. Dudley A978, 1981a, 1981b, 1984) has developed the application of metric entropy (covering numbers) to empirical process theory. These papers extended his earlier work on entropy and sample path properties of gaussian processes A967b, 1973), and on the multidimensional empirical distribution function A966a). Dudley A966a, 1978) introduced most of the ideas needed to prove central limit theorems for empirical processes indexed by sets. He extended these ideas to classes of functions in A981a, 1981b). His lecture notes A984) provide the best available overview of empirical process theory, as of this writing. The proof of my Theorem 21 was inspired by Chapter 4 of those lecture notes, which reworked ideas from Dudley and Philipp A983). If Pf = 0 for each / in J5", a standardization that can be imposed without affecting Е„ or EP, the conditions of Theorem 21 are also necessary for the empirical central limit theorem.
Problems 167 The first central limit theorems for empirical processes indexed by classes of sets were proved by the direct approximation method. Bolthausen A978) worked with the class of compact, convex subsets of the unit square in Ж2. He applied an entropy bound due to Dudley A974). Revesz A976) indexed the processes by classes of sets with smooth boundaries. Earlier work of Sun was, unfortunately, not published until quite recently (Руке and Sun 1982). Dudley's A978) Theorem 5.1 imposed a condition on the "metric entropy with inclusion" that corresponds to finiteness of a covering integral. Strassen and Dudley A969) proved a central limit theorem for empirical processes indexed by classes of smooth functions. They deduced the result from their central limit theorem for sums of independent random elements of spaces of continuous functions. All these theorems depend on existence of good bounds for the rate of growth of entropy functions (covering num- numbers). For more about this see Dudley A984, Sections 6 and 7) and Gaenssler A984). Theorem 26 resets an argument of Le Cam A983). Such an approximation theorem has been implicit in the work of Dudley. Gine and Zinn A984) have pointed out the benefits of stripping off the i?2(P) chaining argument, to expose more clearly the problem of how to handle the little links left at the end of the chain. They have also stressed the strong parallels between empirical processes and gaussian processes. The examples in Section 6 follow the lead of Gine and Zinn: Example 27 is based on their adaptation of Le Cam's A983) square-root trick; Example 31 is based on their im- improvement of Dudley's A978) "metric entropy with inclusion" method; Example 33 is based on their Theorem 5.5. Problems [1] Prove that the stochastic equicontinuity concept of Definition 2 follows from: Zn(TJ — Zn{t0) -> 0 in probability for every sequence {т„} that converges in probability to t0. [Suppose the defining property fails for some r\ > 0 and s > 0. For a sequence of neighborhoods {Uk} that shrink to t0 find positive integers n(l) < nB) < • • • with P<sup \Zn(k)(t) - Zm(t0)\ > ( uk for every k. Choose random elements {т„} of T such that, for n(k) < n < n(k + 1), | Zn(ffl, т„(«)) - Zn(ffl, @) | > A sup | Zn(w, t) - Zn{w, ?0) I uk and т„(ш) belongs to Uk. Appendix С covers measurability of т„.] [2] Let {/(¦, t): t e T} be a collection of IR'-valued functions indexed by a subset of IR'. Suppose P\f(-,t)\2 < oc for each t. Set F(t) = Pf(-, t) and Fn(t) = PJ{-, t). Let {т„} be a sequence converging in probability to a value t0 at which F(t0) = 0. If (a) F(-) has a non-singular derivative matrix D at t0;
168 VII. Central Limit Theorems (c) {?„/(¦, t)} is stochastically equicontinuous at f0 ; then п1/2(т„ - ro)~>iV(O, D'lP\_f(-,to)f(,-,to)''\D~l). [Compare with Huber A967).] [3] For a class J* to index a P-motion it must be totally bounded under the ??2{P) seminorm pP. [First show $F is bounded: otherwise \BP(fn)\ -> oo in probability for some {/„}, violating boundedness of P-motion sample paths. Total boundedness will then follow from: for each e > 0, every / lies within e of some linear combination of a fixed, finite subclass of 3?. If for some s no such finite subclass exists, find {/„} such that n fn+l = вп+l + Y.a»jfj' where pP{gn+i) > e and gn+1 is orthogonal tofu ...,/„• Fix an M. Show that there exists a 5 > 0, depending on M and e, for which ЩВР(/„+1) > M\BP(fy),..., 5Р(/„)} > S. Deduce that JP{supnBP(fn) > M} = 1 for every M, which contradicts boundedness of the sample paths. Notice that continuity of the sample paths does not enter the argument. Dudley A967b).] [4] If sup^ |P/| is finite then J* must be totally bounded under the i?2{F) seminorm pP if it supports a P-bridge. [Choose Z with a iV,@, 1) distribution independent of EP. The process B(f) = EP(f) + ZPf is a P-motion with bounded sample paths. Invoke Problem 3. The condition on the means is needed—consider the У con- consisting of all constant functions. The P-bridge is unaffected by addition of arbitrary constants to functions in !F; it depends only on the projection of J^ onto the sub- space of .Sf 2(P) orthogonal to the constants.] [5] Let Жу be a class of functions with an envelope H in ?C2(P). Set Ж2 = {h2: h e Жх}. Show that '2, Q, Ж2) < N2Be, Q, [By the Cauchy-Schwarz inequality, Q\hj - h\\ < Q{2H\hx -h2\)< 2(QH2y2(Q\hy - hz\2I'2 if both | fell < Hand \h2\ < Я.] [6] Let J* be a permissible class of functions with envelope F. Suppose J2(S, Р„, JF) = op(n) for each 5 > 0. [Condition A6) of the Equicontinuity Lemma implies that J2{5, Р„, !F) = Op(l) for each 5 > 0.] Show that Ж2 = {(/ - gJ: f - g e J^} satisfies the sufficient condition (Theorem 11.24) for the uniform strong law of large numbers: log Ny(e, Р„, Ж2) = op(n), for each s > 0. [Set Я = IF and Жх = {/ - g:f, ge3F}. Show that, for 1 > e > 0, iV2Be, Р„, Ж,) < N2(s, Р„, J^J < e ехрЙУ2(е, Р„, J^J/e2).
Problems 169 Deduce from this inequality, Problem 5, and the strong law of large numbers for {Р„Н2} that, if 1 > e > 0, P{log ЛГ1DеBРЯ2I/2, Р„, Ж2) > щ} < IP{log ЛГ!Dе(Р„Н2I/2, Р„, Ж2) > щ] + JP{PnH2 > 2PH2} < IP{log iV2Be, Р„, Жх) > щ) + W{PnH2 > 2PH2} < Р{^2(е, Р„, 3Ffj&2 > Щ) + 1Р{Р„Н2 > 2PH2} A weaker result was proved by Pollard A982c).] [7] If J^ is totally bounded under the <?2{P) seminorm, then the space С(Ж, Р) of bounded, uniformly continuous, real functions on & is separable. [Suppose |x(/) — x(g)\ < e whenever pP(f — g) < 25. Choose {/l5... ,/„} as a maximal set with pP(fi — fj) > jd. Use the weighting functions A,(-) from the proof of Theorem 21 to interpolate between rational approximations to the {x(/;)}.] [8] Suppose Э' is totally bounded under the ?C2(P) seminorm. If two probability measures X and ц on the u-field &F have the same fidis, and if both concentrate on C(#", P), then they must agree everywhere on 3I'. [Show that X and ц agree for all finite intersections of fidi sets and closed balls with centers in C(#", P). For example, consider a closed ball B(x, r) with x in C{^, P). Let {/b/2, ...} be a countable, dense subset of C(&, P). Define В„ = {z e C(J% P): |z(/;) - x(fd\ < r for 1 < i < и}. Show that цВ(х, г) < цВ„ = ХВ„ -> XB{x, r) as n -> оо. Extend the result to finite collections of closed balls and fidi sets, then apply a generating-class argument.] [9] The property that the graphs have only polynomial discrimination is not preserved by the operation of summing two classes of functions. That is, both #" and <§ can have the property without the class У = {/ + g:fe3',деЩ having it. Let 3> = {Du D2, ¦ ¦ •} be the set of indicator functions of all finite sets of rational numbers in [0,1]. Let & = {2n + Dn: n = 1, 2,...} and ^ = {-2n: n = 1, 2,...}. The graphs from neither class can shatter two-point sets, but Sf can shatter arbi- arbitrarily large finite sets of rationals in [0, 1]. [The roundabout reasoning used to bound the covering numbers in Example 18 may not be completely unnecessary.]
CHAPTER VIII Martingales ... in which martingale central limit theorems in discrete and continuous time are proved. An extended non-trivial application to Kaplan-Meier estimators — estimation of distribution functions from censored data—is sketched. VIII. 1. A Central Limit Theorem for Martingale-Difference Arrays Martingale theory must surely be the most successful of all the attempts to extend the classical theory for sums of independent random variables to cover dependent variables. Many of the classical limit theorems have martingale analogues that rival them for elegance and far exceed them in diversity of application. We shall explore two of these martingale theorems in this chapter. One main change in technique will become apparent. Where proofs for independent summands use truncation to protect against occasional ab- abnormally large increments—the sort of thing implicit in something like the Lindeberg condition—martingale proofs can resort to stopping time arguments. Optional stopping preserves the conditional expectation connec- connection within a martingale sequence as long as the decision to stop is based only on past behavior of the sequence. The prohibition against peering into the future to anticipate abnormally large increments imposes a characteristic feature on martingale theorems. One needs conditions that protect against the behavior of the worst single increment, because the decision to stop can be taken only after that increment has had its effect. In central limit theorems, independence allows one to factorize expecta- expectations of products and separate out the contribution of a particular increment from contributions of past and future increments. The martingale property allows a weaker factorization, for expectations conditional on the past alone. Conditional variances take over the role played by variances of independent summands. But apart from that, the arguments for martingales share the
VIII. 1. A Central Limit Theorem for Martingale-Difference Arrays 171 same inspiration as the proof of the Lindeberg Central Limit Theorem in Section III.4. We shall prove asymptotic normality for row sums of martingale-differ- martingale-difference arrays. That is, for each n we have random variables ?nl, ...,?,„„ (we avoid some messy notation, and lose no generality, by assuming exactly n variables in the nth row) and сг-fields <fn0 ? • • • ? $„„ for which (a) ^ (b) ?nJ is <fnj-measurable. Define conditional variances vnJ = JP(Zlj\*n.j-i) for j=l,...,n. Notice that vnj is an §n^-i-measurable random variable. Convergence of sums of conditional variances will be the only connection tying together the variables in different rows of the array. 1 Theorem. Let {?nj} be a martingale-difference array. If as n -* со, (i) Yjj vnj ~*(j2 i" probability, with a2 a positive constant; (ii) for every г > 0, the sum ?,- 1Р(<!;?Д|?nj| > e}\Saj-1) converges in prob- probability to zero (a Lindeberg condition); then ?nl + ..• + ?„„ ^ N@, a2). Proof. Without loss of generality set a2 = 1. Let us check pointwise con- convergence of characteristic functions: IP cxp[it(Z»i + • • • + Ul - exp(-it2). We really will need some of the special multiplicative properties of the complex exponential function, and not just its smoothness by way of three bounded, continuous derivatives as in Section III.4. The randomness of the conditional variances fouls up the argument based on successive substitution of matching normal increments, which worked for independent summands. At the risk of notational abuse (more will follow) abbreviate the condi- conditional expectation TP(-\$nj) to IP/-)- Write R(x) for the remainder term e'x — 1 — ix and Snj for the partial sum ?nl н h ?nj. Define rnj as the condi- conditional expectation JPj_ ^(t^j). When the {?n]} are small, in a sense to be made precise by condition (ii), we shall get rnJ x —%vnj. The proof will work by successive conditioning. We pin down the effect of individual increments by evaluating IP expOtSJ = POPoflP^O • • (IPn_! cxp(itSJ ¦ ¦ •))))), a layer at a time. Start with the innermost conditional expectation. Factorize out the part depending on ^nj-1 then expand the remaining exp(it?m). ^^l + rj.
172 VIII. Martingales The 1 + rm factor foils the attempt to work the same idea with 1Р„_2; it won't cooperate by slipping outside the conditional expectation, leaving exp(itSnn_ J to enjoy the same treatment from 1Р„-2 that exp(j't5nn) received from №„_!• We could clear away the obstacle by dividing out the offending factor. IP^EV^l + rj expats™)] = IP^expatS,,,,,-!) = exp(irSn>n_2)(l + »•„,,,_!). We could get rid of the A + г„ п^г) in a similar fashion. If you pursue this idea through each layer of conditioning, you will see the sense in starting from The remote possibility that one rnj might get close to — 1 could cause minor difficulty when we come to bound the product term. To avoid this, replace A + rnj)~l by A — rnj). When rnJ к 0 the change has little effect. Define k Tnk = П A - rHj) and Znk = Tnk exv(itSnk). If we could show IP | Т„„ - exp(^t2) | -»¦ 0 and JPZnn -»¦ 1, then it would follow that < exP(-it2)[IP|expatSnn + it2) - Zm\ + \TPZm - 1|] ->0. We would get the desired results for {Tnn} and {Zm} from: (a) Zj rnj -*¦ -if2 m probability; (b) I>nj-I < t2; (c) max; | rnj | -* 0 in probability. The second of these requirements need not be satisfied but, without loss of generality, we may act as if it were. We replace ?nJ by ?nJ{j < an), where а„ = max{fc: Yj=ivnj ^ 2I- Inter- Interpret <т„ as zero if vnl > 2. Because vnJ is <fnj_:-measurable, the event {j < an} is Sn> j-!-measurable; an is a stopping time. The new variables are martingale differences. The new row sums have the same asymptotic behavior as the original row sums: A i tnj ф i tnjij < <uj < п»{и > <u = n» j i vnj > 2} -, 0. U'=i j=i J (.j = i J The quantities wnj = Wj_ iR(t?nj{j < an}) satisfy analogues of the require- requirements (a), (b), and (c). The argument depends on the inequalities (Problem 1) for real x, \R(x)\<±\x\2;
VIII.1. A Central Limit Theorem for Martingale-Difference Arrays 173 For the analogue of (c): if 5 > 0, no | wnj\ exceeds maxHy^t2^; < an) < |A2 + E^-i^l^-l > Set д small, then invoke the Lindeberg condition. For the analogue of (b): J=l 0' ^ cr»}j;nj because {/ < а„} is «f^^-measurable < t2 by definition of сг„. For the analogue of (a), fix S > 0: -JRit^U < U) + tf&jU < <U + tf&iij > The first sum can be made small with high probability, by an appropriate choice of S; the last two terms converge in probability to zero. We could carry an along throughout the rest of the argument, but that would clutter up the notation. Instead, let us assume that (a), (b), and (c) hold. While we are at it, let us drop the n subscript for variables in the nth row of the array; all calculations take place within the nth row. Our task is to prove that and WZn -. 1, B) IP|rn-exp(it2)HO where Tk = П)=1A - rj) and Zk = Tk exV(itSk). The path leading from (a), (b), and (c) to the first assertion in B) is well- worn (Chung 1968, Section 7.1). For complex 9, |log(l-0) + 0|<|0|2 if \9\<i Apply the inequality to each r}. When max^ |r,-| < j, <rmax|r,-| by(b). j
174 VIII. Martingales It follows from (a), (c), and continuity of the exponential function that {Tn} converges in probability to exp(^t2). Each | Т„ | is bounded, because of (b): C) I Tn\ < П A + \rj\) < expfe \rA < exp(t2). Boundedness plus convergence in probability imply convergence in L1. For the proof of the second assertion in B), bound the errors that accrue during the calculation of conditional expectations layer-by-layer. JP j^Zj = TPj^lT Thus \WZj - IPZ^i | < P|r?Z7-_i | < exp(/2)IP|r2|, because inequality C) implies \Z^X \ = 17}_t | < ехр(г2). Sum over;'. n \WZn - 1| < exp(J2)IP X lol2 -* ° as n -" °°' j=i because ^-1r} |2 < min{t4, t2 maxj \ rj\}. ? 4 Example. A sequence of real random variables Xo, Xu... is called an autoregression of order one if Х„ = 0ОХ„_1 + wn for some fixed 0O. The innovations {un} are assumed independent and identically distributed with zero mean and finite variance a2. The initial value Xo is independent of the {un}. The least-squares estimator 9„ minimizes Solving for вп and standardizing, we get E) nv\en - в0) = [n-^2 iuj With the help of Theorem 1 we can prove nll2(9n - в0) - N@, 1 - 61) provided |0O| < 1 and IPX2, < oo. This will follow from convergence results for the denominator and numerator in E): n n'1 ZxJ-i-+ ff2/(l - eo) in probability, Start with the denominator.
VIII. 1. A Central Limit Theorem for Martingale-Difference Arrays 175 Square both sides of the defining equation of the autoregression, then sum over/. 3=1 3=1 3=1 3=1 Rearrange then divide through by n. F) A - el)n~' t Щ-1 = и"l t u) + 290n-1 t "jXj-i 3=1 3=1 3=1 + п-1(^о-^„2). On the right-hand side, the first term converges almost surely to <r2, by the strong law of large numbers. The third term converges in L1 to zero, because repeated application of the equality IPX2 = Wu2n + 280lPunVXn_1 + 920JPX2n.L = a2 + d2WX2n_u yields ГОу2 „2/1 , Д2 , i д2п-2\ , n2nmv2 . я-2/zi O2\ 1глп = a {I + tH + ¦ ¦ • + b0 ) + tH \гло -* а Д1 — u0). The n brings the limit down to zero. The middle term on the right-hand side of F) converges in L2 to zero: = n 2 Yj №u]X2_ l independence kills the cross-product terms 3=1 = О(и-1) because {IPX?} is convergent. So much for the denominator. Write Sn for the a-field generated by {Xo, uu ..., uj. Abbreviate IP(-|<^) to IP/-)- The variables {ujXj^^ are martingale differences for {Sj}. Apply Theorem 1 to the sum in the numerator of E). For condition (i): vnJ = Wj^in-'ujX2.,) = и" VX?_1; t vnj = о2п~г t хз-1 ^ ^4/(l - в1) in probability. 3=1 3=1 The Lindeberg condition demands a more delicate argument if higher- moment constraints on the innovations are to be avoided. 3=1 [u2X2.1l{u2>en^2} + {X2.i> 3=1 _ „-1 V Y2 TP«2f,.2 -. ™l/2l i „-1 V /r2v2 SY2 — n 2-i Ji-j-i'*ui\ui > si ' ) + n 2j ff •^j-ii-^j-i 3=1 3=1 en 1/21
176 VIII. Martingales The first sum converges to zero in probability, because u\ is integrable. The second sum converges in L1 to zero because the sequence {X2} is uniformly integrable (Problem 2). ? VIII.2. Continuous Time Martingales A stochastic process {Z(r): 0 < t < со} is said to be a martingale with respect to an increasing family of сг-fields {$t; 0 < t < oo} if Z(t) is adapted to the (т-fields (that is, Z(t) is ^-measurable) and IP(Z(s) | St) = Z(t) whenever s > t. After some fiddling around with sets of measure zero it can usually be arranged that such a process has cadlag sample paths (Dellacherie and Meyer 1982, Section VI. 1), in which case it may be studied as a random element of D[0, oo). Call Z an L2-martingale if it has cadlag sample paths and IPZ(fJ < oo for each t. The behavior of an L2-martingale is largely determined by the conditional variances of its increments. The conditional expectation of [Z(t + S) — Z{t)~]2 given St plays a role similar to that of the conditional variance vnj in Section 1. The most economical way to explain this uses some deeper results from the Strasbourg theory of stochastic processes. We could avoid the appeal to the deeper theory by building its special consequences into the martingale calculations for each particular application. That would always work for martingales that evolve by discrete jumps; the calculations would be similar to those in Section 1. The theory would be more self- contained, but it would disguise the unifying concept of the conditional variance process. According to the Doob-Meyer decomposition (Theorem VII. 12 of Dellacherie and Meyer A982), applied to the supermartingale — Z2), for each L2-martingale Z, the process Z2 has a unique representation as a sum V + M of a martingale M and an increasing, predictable, conditional variance process V with V@) = 0. Both M and V have cadlag sample paths. (Strictly speaking, for this decomposition we need the сг-fields {<%t} to satisfy the " usual conditions": &0 should contain all IP-negligible sets and each St should equal the intersection of the сг-fields Ss for s > t.) The adjective "predictable" has the technical meaning that V(co, t) is measurable with respect to the сг-field onfl® [0, oo) generated by the class of all adapted, left-continuous processes. So V behaves something like a process with left-continuous sample paths; its paths can be predicted a tiny instant into the future. We will need the predictability property only in Lemma 11. If the martingale Z changes only by jumps ?u ?2,. • • occurring at fixed times ti < t2 < ¦•-, and if St = Stk for tk < t < tk+1, then V is just a sum of conditional variances: G) V(t) =
VIII.2. Continuous Time Martingales 177 whenever tk < t < tk+1. You can check directly that Z2 — V is a martingale and that there exists a sequence of left-continuous, adapted processes con- converging pointwise on fi ® [0, oo) to V. The value of V at tk corresponds to what we would have written as vt + ¦ • • + vk in Section 1. You might take this as your guiding example for the rest of the section if you wish to avoid all appeals to the Strasbourg theory. The process V carries information about the conditional variances of the increments of Z. If s > t, (8) = IP(Z(SJK) - Z(tJ - 2Z(t)F(Z(s) - Z(t)\$t) = IP(J/(s)K) + P(M(s)|<fr) - V(t) - M(t) = JP(V(s) - V{i)\St). For s very close to t, the predictability of V makes V(s) — V(t) almost <f,-measurable; the last conditional expectation almost equals V(s) — V(t), in a sense that will receive a more rigorous meaning later. By means of a simple Tchebychev inequality we get from V a bound on the size of the increments of Z in precisely the form required by the stopping time argument of Lemma V.7. The maximal inequality provided by that lemma lies at the heart of any proof for convergence in distribution in spaces of cadlag functions. 9 Lemma. Let {Z(t): 0 < t < b) be an L2-martingale with conditional variance process V. If , for every t, TP(V(b) - V{t)\St) < b2j\2 almost surely, then Ipisup|Z@ - Z@)| > 4 < 3IP{|Z(b) - Z@)| > id}. Proof. With no loss of generality we may assume Z@) = 0. Write IP(( •) for expectation conditional on $t. Lemma V.7 invites us to check that i on We shall do this by bounding the conditional probability from below by A0) |-4|Z@r2IPt|Z(b)-Z@|2, which is greater than f - 4S~2(S2/l2) on the set {\Z(t)\ > S}. Start from the inequality \Z(b)\ < \Z(b) - Z(t)\ + \Z(t)\ <||Z@|{|Z(b)-z(t)|<i|Z@|} + 3\z(b) - z{t)\{\z(b) - z(oi
178 VIII. Martingales Keeping in mind that the absolute value of a martingale is a submartingale, take conditional expectations given St. \Z(t)\ < TP,\Z(b)\ {|Z(b)-z(OI<i|Z@|} - Z(t)\{\Z(b) - Z(t)\ Z(b)-z(OI<i|Z(OI} + 6|Z(or1iP(|Z(b)-z(t)|2. On {\Z(t)\ > S} we may divide through by f |Z(t)| to get the bound A0). ? Martingales with cadlag sample paths define random elements of the space D[0, oo) under its projection сг-field. As we shall be concerned only with convergence in distribution to limit processes having continuous sample paths, we equip D[0, oo) with its metric for uniform convergence on compacta (Section V.5). The main theorem will give conditions for a sequence of L2-martingales to converge in distribution to a stretched-out brownian motion BH, the process constructed from brownian motion by applying a continuous, strictly increasing transformation H(-) to the time scale: BH{t) = B(H(t)). This new gaussian process is an L2-martingale with conditional variance process H(t), provided Я@) = О. One hypothesis of the theorem will be pointwise convergence of the conditional variance processes to H. It is analogous to the hypothesis in Theorem 1 that the sum of conditional variances converges to one. As in the proof of that theorem, we shall use the hypothesis to introduce stopping times for the martingales. For fullest generality we must draw upon the Strasbourg theory for the properties of predictable stopping times; for the special case of martingales with conditional variance processes as in G), the argument from Theorem 1 would suffice. 11 Lemma. Let {Xn} be a sequence of L2'-martingales whose conditional variance processes converge pointwise to a fixed, continuous, increasing function: Vn(t) -* H(t) in probability, for each fixed t. Then there exists a sequence of stopping times {an} and constants {е„}, with cn -» oo in probability and ?„ I 0, such that sup, | Vn(t л <тп) - H(t л <т„) | < е„ almost surely. Proof. First fix an e > 0, and define a stopping time The hypothesis implies that т„ -> oo in probability: if 0 = s0 < sx < ¦ ¦ ¦ < sk are chosen so that H(s;) — H(s(_ x) < \г for each i, then (by monotonicity of Vn and H) 1Р{т„ < sk} < TP{\Vn(Si) - H(st)\ > \г for some i} -> 0.
VIII.2. Continuous Time Martingales 179 From the definition of т„, A2) \Vn(t) - H(t)\ < s for t<xn. Only the time t = xn might cause trouble; Vn might have a jump there. That is where predictability comes to the rescue. There exists an increasing sequence of stopping times {xnj} with xnJ < xn and xnj I xn almost surely (Dellacherie and Meyer 1978, IV.69, IV.77), because predictability allows us to peer just a little distance into the future for Vn. Replace х„ by ти>Ля) for some j(n) such that xnJ(n) -* oo in probability. This lops off the troublesome point t = х„ in A2); the inequality holds for every t in the range [0, тв,Ди)]. If the argument works for fixed e it must also work for a sequence {е„} decreasing slowly enough. Write <т„(е) for the stopping time т„ J(n) just identified. There exist integers n(l) < n{2) < ¦¦¦ such that Ща^к'1)^ к} <к~г for n > n(k). Set е„ = fc and а„ = ajk.'1) when n(k) < n < n(k + 1). ? Stopping time tricks like the one used in this proof pop up all over the place in martingale limit theory. Sample path properties that hold with probability tending to one can often be made to hold with probability one by enforcing an appropriate stopping rule. Something must be added to the convergence of conditional variance processes. Otherwise we could specialize the result to processes with in- independent increments, obtaining as a by-product the Central Limit Theorem for sums of independent random variables without having to impose any- anything like a Lindeberg condition. The extra something takes the form of a constraint on the maximum jump in the sample path. Define the jump functional JT on D[0, oo) by: JT(x) = max{|x(s) - x(s-)|: 0 < s < Г}. It is both continuous and measurable (Problem 4). If Xn >-* BH then certainly JT(Xn) ^ Jt(Bh) = 0; convergence in probability of {JT(Xn)} to zero is a necessary condition. The theorem assumes just a little bit more to get a sufficient condition. 13 Theorem. Let {Xn} be a sequence of L2-martingales with conditional variance processes {Vn}. Let H be a continuous, increasing function on [0, oo) with Я@) = 0. Sufficient conditions for convergence in distribution of {Х„}, as random elements of D[0, oo), to the stretched-out brownian motion BH are: (i) XJO) ^ 0 in probability; (ii) Vn(t) -» H(t) in probability, for each fixed t; (iii) JPJk(XnJ -»¦ 0 for each fixed k, as n -»¦ oo.
180 VIII. Martingales Proof. By virtue of Theorem V.23, we have only to prove convergence in distribution for the truncations of the processes to each compact interval [0, Г]. The argument for the typical case Г = 1 will suffice. According to Theorem V.3 we shall need to establish (a) Fidi Convergence: the fidis of the {Xn} converge to the fidis of BH (b) The Grid Condition: to each e > 0 and S > 0 there corresponds a grid 0 = t0 < t1 < ¦ ¦ • < tm = 1 such that limsupIP^max sup \Xn(t) — Xn(tj)\ > d> < s Theorem 1 will take care of (a); Lemma 9, applied to the Xn processes stopped appropriately, will take care of (b). By Lemma 11 there exist stopping times {<т„} and constants {е„}, with an -* oo in probability and en [ 0, such that | Vn(t л <т„) — H(t л <т„) | < е„ almost surely. We may assume that Х„ has at most one jump of size greater than е„ up to time an. Formally, we would replace {е„} by a more slowly decreasing sequence such that }(Х„) > ?„} -»¦ 0 as n -> оо for some slowly diverging sequence {k(ri)}. Such sequences exist because of (iii). Then we would replace an by mf{t<an:\Xn(t)-Xn(t-)\>sn}. These modifications would not disturb the other properties of {<т„} and {е„}. The stopped martingale X*(t) = Xn(t л <т„) has conditional variance process V*(-) — Vn{- л ап). It enjoys (i), (ii), and (iii) in strengthened form: (i)' X*@) = Х„@) ^ 0 in probability; (ii)' \V*(t)~H(t Aan)\<sjora\lt; (iii)' X* has at most one jump of size > &„ and WJX{X*J -* 0. These will make it easy to prove that X* ~> BH. The required convergence of the truncation of Xn to [0, 1] will then follow because W{Xn{t) = X*{t) for 0 < t < 1} > W{an > 1} -> 1. Simplify the notation by dropping the star. Fidi Convergence Let us prove only that Х„A) -»• N@, ЯA)). Problem 6 extends the result to higher-dimensional fidis. Because of (i)', we can do this by breaking Xn{\) — Xn@) into a sum of martingale differences that satisfy the conditions of Theorem 1. Focus for the moment on a fixed Х„ by setting Z(t) = Xn{t) — Х„@) and writing V instead of Vn for its conditional variance process. Break Z(l) into
VIII.2. Continuous Time Martingales 181 a sum of increments Z{z]) — Z(t,-_ Д with 0 = т0 < xr < ¦ ¦ • a sequence of stopping times defined inductively by Tj+1 = inf{r > ту. \Z(t) - Z(tj)\ > &„} л 1 л (tj + Sn). Choose {Sn} so that \H(t + Sn) - H(t)\ < е„ for every t in [0, 1]. Denote expectations given Sz by IP/-); write AjJ for the increment/^) — /(Tj.j) of any function / between successive stopping times. Then Z(l)= ? A,Z. j=i Along any particular sample path of Z all except finitely many of these increments equal zero; there is no problem with convergence of the sum. Check the conditions of Theorem 1 for the martingale differences {A^Z}. Strictly speaking the theorem applies only to triangular arrays with a finite number of variables in each row, but there may be infinitely many A^Z increments. We would need to apply the theorem to a finite sum of A^Z, for 1 < j < j(n), with j(n) chosen so that both ?a,(Z) and fjPj.^AjZ) converge to zero in probability. By property (8) of the conditional variance process, Informally speaking, predictability of V almost makes A;- V measurable with respect to SZj_l; the last sum almost equals ?V AjV, which we know will converge in probability to ЯA) as n -* oo. Formally, Yj (A,-F — Щ-iAjV) is a sum of martingale differences with zero mean and variance less than Y 1Р(А7- VJ because the cross-product terms vanish j < X IP[Ben + AjH( ¦ л an))Aj П by (ii)' j < 3enIP \Y Aj v) by the choice of {д„} < Зе„(е„ -> 0 as n -* oo. That takes care of condition (i) of Theorem 1. For the Lindeberg condition it suffices to check the stronger L1 conver- convergence. e} = П>Х(А^J{|А^| > е} j if 6 > 2en
182 VIII. Martingales The inequality follows, by the definition of xh from: \Z(tj) - Z(tj^)\ < \Z(tj) - Z(T}-)\ + |Z(tj-) - Z(Vl)| < | jump at zj | + en At most one increment AjZ can exceed 2е„ in absolute value, and that happens only if Z has its one jump greater than en at Zj. An appeal to the second part of (Hi)' completes the proof of the fidi convergence. The Grid Condition Choose 0 = t0 < ¦ • ¦ < tm = I so that H(tJ+i) - H(tj) < S2/24 for each j. For n large enough to make 2en < <52/24, the strengthened condition (ii)' implies A4) ВД(Г;+1) - VM\*,) ^ ?>2I\2 almost surely if tj< t < tJ+ !• Invoke Lemma 9. A5) limsupipi max sup \Xn(t) - Xn(t})\ > 5 « I J [tj,tJ+i) J J=0 n Y, 3IP{|BH(tJ+1) - BH(tj)\ > id} by fidi convergence. j=o m-l X ЗР{|ЛГ@, H(tj+1) - H(tj))\ > id} m-l j=o j which is less than e if the grid points are close enough together. ? VIII.3. Estimation from Censored Data The empirical distribution function based on an independent sample ?u ..., ?,„ from P is a natural estimator for the distribution function of P. How should one modify it when the observations are subject to censorship? One possibility, the Kaplan-Meier estimator, can be analyzed by the martin- martingale methods from the previous section. What follows is a heuristic account. The notes to the chapter will point you towards more rigorous treatments, which draw on results from the theory of stochastic integration.
VIII.3. Estimation from Censored Data 183 Consider the simplest model for censorship. Independent variables с!,..., с„, drawn from a censoring distribution C, cut short the natural lifetime ?;; we observe the value yt = ?,- л ct. Each yt has distribution Q, where Q(s, oo) = P(s, oo)C(s, oo). In addition, we can tell whether the уi represents natural death, ?t < ct, or a case of death by censorship, ^ > c;. To construct the Kaplan-Meier empirical measure К„, start from the usual empirical measure Qn for the observations Ух,..-,у„. Working from the left-hand end, distribute successively all the mass from each censored point equally amongst all the {y;} lying to its right. In the situation pictured, y1 keeps its mass ^; then y2 surrenders mass } x ^to each of its seven successors y3,..., y9; then y3 surrenders ^ x (^ + j x |) to each of y4,..., y9; and so on. At the last point, yg, there are no more {yj to inherit its mass, so dump it all down on a fictitious super-survivor out at +00. If y9 had not been censored, it would have kept all its inherited mass. In any case, Kn will distribute its total mass of one amongst the naturally deceased {yt}, with maybe a little bit on + 00. Notice that К„[0, t] < gn[0, f] for each t. Make the analysis as simple as possible by assuming that both P and С are continuous distributions living on [0, 00), with neither concentrated on a finite interval. Write St for the cr-field corresponding to everything we learn up to time t about which ?,. have died or been censored. Write IP/-) forP(-l^). Calculate (to first-order terms) the conditional distribution of the incre- increment AKn = Kn(t, t + h] given St, for tiny positive h. From St we learn the value Qn[0, i]. Define m to be nQn[0, t]. The remaining n — m observations in (t, 00) are generated by choosing each ?; from the conditional distribution P(-\? > t), then censoring it by a ct chosen from C(-\c > t)- Write AP for P(t, t + h] and AKn for Kn(t, t + h]. To first order, each of the n - m observations has conditional probability AP/P(t, 00) of registering a natural death during the interval (t, t + h]. A single such observation would receive a fraction (n — m)~ * of the К„ measure for (t, 00]. Thus, to first order, = (n- mT^KJt, оо](и - m)AP/P(t, 00). This suggests that JPtAKn/Kn(t, 00] = AP/P(t, 00) + lower order terms, which would lead us to believe that log Kn(t, 00] — log P(t, 00) is a contin- continuous-time martingale for each n. An attempt to add rigor to the first-order analysis would reveal a few illegal divisions by zero. There is a positive probability that either n — m or Kn(t, 00] could equal zero. A suitable stopping time can save us from
184 VIII. Martingales embarrassment. Let {<*„} be a sequence of positive numbers converging to zero. Make sure that nocn is an integer. Define р„ as the first t for which Qn(t, oo) equals an. Then certainly Kn(t a pn, oo] > Qn(t a pn, oo) > а„ > О for all t. The first-order analysis could be made rigorous enough to show that the process Xn(i) = log Kn(t л /)„, oo] - log P(t л /)„, со) is a continuous-time martingale for each n. On the set {р„ > t}, the increment in Vn, the conditional variance process of Xn, would be Wt(AXnJ = (n - m)~2(n - m)AP/P(t, oo) + smaller order terms On {pn < t} the increment would be zero. Recover Vn as a limit of sums of conditional variances (see Problem 5). Vn(t) = n'1 J{0 < s < t a pn}Qn(s, oo)?^, юГ' By definition of р„, [ o Thus {Vn} converges in probability to zero uniformly over compact sets. Apply Lemma 9. For b fixed and и large enough, Ip{sup|Xn@| > Л < №{\ХЯ(Ь)\ > 1<5} < 12ё~21РХ„фJ -*0. That is, for each fixed b, sup |log Kn(t а р„, oo] — log P(t a pn, oo)| -* 0 in probability. t<b Because both К„ and P are probability measures, and because р„ ->¦ oo in probability, it follows that A6) sup | К„[0, t] - P[0, i] | -»¦ 0 in probability. t The Kaplan-Meier measure estimates P consistently, in the sense of uniform convergence of distribution functions. Now consider the normalized martingale nll2Xn(t). It has conditional variance process nVn, which converges in probability to the continuous function H(t) = f{0 < s < t}Q(s, 00)"^(s, аэ)-1 P(ds).
Notes 185 That suggests {п1/2Х„} converges in distribution to a brownian motion stretched out by H. Theorem 13 proves it. We already have the first two requirements of the theorem holding. What about the maximum jump of nll2Xn on an interval [0, t]? It can only occur at one of the {yt} that died naturally. In the worst case, where all except the largest y{ in [0, t] are censored, the measure К„ puts mass (nQn(t л р„, oo))~* at that largest y(. The corresponding jump J in n1/2Xn cannot exceed nll2(na.^)~la.~l, which converges to zero provided <х„ |> n~llA. That is more than enough to force IPJ2 -* 0 as n -> oo. Thus nll2Xn ~> BH. The effect of the stopping wears off as n -> oo, because р„ -» oo in proba- probability. So we get from the convergence result for {Х„} that n1'2 loglKn(t, oo]/P(t, oo)] - BH as random elements of D[0, oo). Rewrite the left-hand side as и1'2 log[l + (P[0, t] - К„[0, t])/P(t, oo)] = - и1/2(К„[0, f] - P[0, t])(l + op(l))/P(t, oo) with the op(l) converging in probability to zero uniformly over compact intervals, by virtue of A6). Deduce that nll2(Kn[_0, Q - P[0, t])/P{t, oo) - BH, or, equivalently, п1/2(Х„[0, t] - P[0, ф ^ P(t, oo)BH. The covariances of the gaussian process P(-, oo)BH(-) depend on both P and C. Notes Martingales have been one of the hottest topics on the weak convergence market for some years now, partly because of a boom in stochastic integral stocks. For discrete-time martingales key papers are Brown A971), McLeish A974), and Aldous A978). Both McLeish and Aldous obtained results sharper than those in Sections 1 and 2. Hall and Heyde A980, Section 6.4) gave a much fancier version of Example 4. The literature on continuous time martingales has grown rapidly; the Shiryayev A981) survey, Metivier A982), the two-volume monograph by Liptser and Shiryayev A977, 1978), and Gihman and Skorohod A979), are accessible. Meyer A966), and Dellacherie and Meyer A978, 1982), make harder reading. Dellacherie A972) contains much about subtle cr-fields. What I called the conditional variance process usually gets written as <Z, Z>. The more widely definable quadratic variation process [Z, Z] has taken over the role I gave to <Z, Z> in Section 2. In the discrete-time theory, [Z, Z] would be the sum of squared increments; <Z, Z) would be the sum of conditional variances.
186 VIII. Martingales Martingales and stochastic integrals simplify the theory for estimation of distribution functions from censored data. Gill A980) covered this in some detail; Jacobsen A982) explained the essential prerequisites from the Strasbourg theory of processes. Bremaud A981) gave applications to point processes. The martingale proof of the central limit theorem for Kaplan- Meier estimators makes an interesting comparison with a more traditional weak convergence approach of Breslow and Crowley A974). Problems [1] For у real, define Н„(у) = eiy - 1 - iy (iy)"/n\. Prove that \Н„(у)\ < \y\n+1/(n + 1)!. [Proceed inductively, using i$'0Hn(s) ds = Hn+i(t) for t > 0.] Deduce that \eiy - 1 - iy + \y2\ < min{^2 + \y2, i\y\3}. [2] Prove uniform integrability of the squared autoregressive process {X2} from Example 4. Break и,- into a sum Vj + Wj, where Vj = Uj{\uj\ < M} — JPuj{\uj\ < M} for some large truncation level M. Define Show that Х„ = Yn + Zn + B^Xq. The sequence {Y2} is uniformly bounded. Because JPZ2 = TPw2 -\ \- el"'2JPw\, the M can be chosen to make WZ2 < e for every n. [A similar argument would work for {u,} a sequence of martingale differences. Solution provided by Kai Fun Yu.] [3] For the uniform empirical process Un prove that {[/„(t)/(l — t):O<r<l}isa martingale with respect to an appropriate family of ст-fields. Bound P{suptsi,[i7n@l > 8} for small enough b, then deduce the Empirical Central Limit Theorem. Generalize to ?„. [Over [a, b~\ take St as the cr-field determined by the locations of sample points in [a, t], for a < t < b.] [4] The jump functional is projection measurable; JT(x) is the limit of max,- \x(ti) — jc(f;_ j)| as the grid 0 = t0 < • ¦ • < tm = T is refined down to a countable dense subset of [0, T]. [5] Let Z be an L2 martingale on [0, 1] with conditional variance process V. If V has continuous sample paths, adapt the arguments in the Fidi Convergence part of the proof of Theorem 13 to show that ? P([Z(t;) - Z(t;_ J] Vtj_,) -> V{\) in probability, i as the maximum spacing in the grid 0 = t0 < ¦ ¦ • < tm = 1 tends to zero. [The intuitive interpretation of V as a limit of a sum of conditional variances of incre- increments is close to the mark.] [6] Suppose that martingales {Xn} satisfy the conditions of Theorem 13. Prove that their fidis converge to the fidis of BH. [If 0 < t1 < ¦ ¦ ¦ < tk < 1, then YJf) = Yj=1 ajXn(t л tj) is a martingale with У„A) = ?,. aJXB(tJ).']
Problems 187 [7] Using the notation from Section 3, show that Zn{i) = Kn[0, t л р„] - J{0 < s < t л Pn}Kn{s а р„, a>]P(s, ооГ'ВД is a martingale. Deduce from the consistency result for К„ that {rc1/2Zj converges in distribution to a stretched-out gaussian process. Reprove the central limit theorem for the Kaplan-Meier estimator. [8] Let {Х„} be a sequence of L2-martingales with conditional variance processes {Vn}. Suppose the fidis converge to the fidis of a stretched out brownian motion BH, with H continuous. If Vn(t) -> H(t) in probability for each fixed t, then Xn-* BH. [If р„ is a stopping time taking only finitely many different values, then „ + <5„) - Х„(р„)]2 = Р[К,(р„ + <5„) - К,(р„)]. Check Aldous's condition (VI.13) for suitably stopped processes. Use Lemma 11.]
APPENDIX A Stochastic-Order Symbols Mann and Wald A943) defined the symbols 0p(-) and op(-\ the stochastic analogues of O(-) and o(-), as a way of avoiding many of the messy details that bedevil asymptotic calculations in probability theory. Let {Х„} and {Yn} be sequences of random vectors. Write Х„ = Op(Yn) to mean: for each e > 0 there exists a real number M such that ТР{\Х„\ > M\Yn\} < s if n is large enough. Write Xn = op(Yn) to mean: Р{ \Х„\ > e| Yn\} -> 0 for each e > 0. Of course Xn and Yn must be defined on the same probability space. Typically {Yn} will be non-random. For example: Х„ = op(l) is another way of writing Х„ -> 0 in probability; a sequence of order Op(l) is said to be stochastically bounded; statistical estimators commonly come within Op(n~112) of the parameters they estimate. With the Op(-), op{-) notation, Problem III. 15 (the delta method) has a quick solution. Differentiability of the map H means H(x) = H(x0) + L(x - x0) + o(x - x0) near x0, where L is a fixed s x к matrix. If п1/2(Х„ — x0) ^* Z then H(Xn) = Я(х0) + L(Xn - x0) + ор(Х„ - x0) = H(x0) + L(Xn - x0) + op(n-^2). Thus nll2(H(Xn) - H(x0)) = Ln1'2(Xn - x0) + op(l) -> LZ. A few hidden assertions here need justification. For example: n1'\Xn - x0) -> Z implies Zn - x0 = O/n/2); if Zn -»• Z then LZn + op(l) ~> LZ. These conceal the asymptotic details of the proof.
190 A. Stochastic-Order Symbols For comparison, here is part of the proof of Theorem VII.5 written without the benefit of the notation. Remember that /(-, 0 = /(•, 0) + t'A(-) + U|r(-, 0, F(t) = Pf(-,t) = F@) + i\t\2 + Kt), where for each e > 0 there exists a S > 0 such that \h(t)\<e\t\2 if \t\<S. By assumption, t where пХ„ -* 0 in probability as n -* oo. Also we know that i1/2(F.(t») - F(xn)) = ?„/(-, 0) + т;?„Л + |т„| У„ or B) ?„(т„) = F(tJ + n-l>2Enf(-,0) + n-x'nEnA + n~^2\xn\Yn, where Yn = Enr{-, xn) ->• 0 in probability. From A) and B), Х„ > Fn(xn) - Fn@) = F(xn) + n~^2x'nEnA + n-^2\xn\Yn - F@) -= т IT ~r~ "t"/1( T ) т~ W T ii Zi i" W I T I / Z I M I ' Z V VO/ ' П П — ' I И I И Take absolute values. Because Xn > 0, nXn > \n\xn\2 - \n\h{xn)\ - п1/2|т„| |?„А[ - п1/2|т„| | У„| >^A-е)|т„|2-п1/2|т„|(|?„А| + |Уп|) if \xn\ < д > Kl - е)(и1/2|т„| - ^„J - i(l - e) ^2, where Wn = A - e)-1(|?HA| + | У„|). For fixed M deduce that 1Р{(п1/21тп| - Wnf > 2M} < Р{[т„| >ё} + П>{2A - е)иХ„ > М} + F{W2 > М}. The first and second terms converge to zero. If M is large enough, the third term is eventually less than any fixed a because ... well, because Wn is of order Op{\). The mess gets too great if we try to avoid the stochastic-order symbols any longer. If you need further convincing read the beautiful expository article by ChernoffA956).
APPENDIX В Exponential Inequalities The central limit theorem leads one to expect sums of independent random variables to behave as if they were normally distributed; tail probabilities for standardized sums can be approximated by normal tail probabilities. For the limit theorems proved in this book, we need upper bounds rather than approximations. A few such bounds are collected together here. The tails of the normal distribution decay rapidly. For ц > 0, VI п" The important factor is the exp(—\n2). We need a similar upper bound for the tail probabilities of a sum of independent random variables Yb ..., Yn. Set S = Yj + • • • + Yn. For each t > 0, П A) IP{S >n}< exp(-nt)TPexp(tS) = exp(- nt) Ц IPexp(t Y;). The trick will be to find a t that makes the last product small. For the normal distribution it is easy to find the best t directly: Р{ЛГ@, 1) > n} < inf exp(^t2 - nt) = exp(-^2). t>0 For other distributions we have to work harder. We must maneuver the moment generating function of Yt into a tractable form that gives us some clue about which value of t to choose. 2 Hoeffding's Inequality. Let Yu Y2,..., Ynbe independent random variables with zero means and bounded ranges: a{ < Yt < bt. For each n > 0, f ¦ • • + Yn > n} < ?(Ь( - аА
192 В. Exponential Inequalities Proof. Use convexity to bound the moment generating function of Y{. Drop the subscript i temporarily. e'Y < eta(b - Y)/(b - a) + e'\Y - a)/(b - a). Take expectations, remembering that IP У = 0. ffVy < ewb/(b - a) - etba/(b - a). Seta = 1 — P = — a/(b — a)andM = t(b — a).Note:a > Obecausea < 0 < b. log Pety < logOSe""" + aee") = -аи + log(jS + ae"). Write L(u) for this function of u. Differentiate twice. L'(u) = -a + aeu/(P + aeu) = -a + a/(a + jSe"""), L"(m) = ai?e-"/(a + ^e""J = [«/(a + 0 The inequality is a special case of: x(l — x) < % for 0 < x < 1. Expand by Taylor's theorem. L(u) = L@) + mL'(O) + i^2L"(M*) < |м2| = it2(b - йJ. Apply the inequality to each Yt, then use A) Set t = 4^/^; (bt - atJ to minimize the quadratic. ? 3 Corollary. Apply the same argument to {— У;} then combine with the in- inequality for {Yj} to get a two-sided bound under the same conditions: Y, + ¦ ¦ ¦ + Yn| > n} < 2 expf-2n2\ ?(b, - ad2]. ? L /¦¦=! J In one special case the proof can be shortened slightly. If Yt takes only two values, + at, each with probability \, then IPe171 = \\etai + e~ta4 = | (aitJk/Bky. < exp&ft2). The rest of the proof is the same as before. We only need the Hoeffding Inequality for this special case. 4 Bennett's Inequality. Let Yu ..., У„ be independent random variables with zero means and bounded ranges: \ Yt\ < M. Write a2 for the variance of Yt. Suppose V > a2 + ¦ ¦ • + a2. Then for each r\ > 0, 1 + ..-+ Yn\ > n} < where B(X) = 2Я[A + Я) Iog(l + X) - Ц for X > 0.
В. Exponential Inequalities 193 Proof. It suffices to establish the corresponding one-sided inequality. The two-sided inequality will then follow by combining it with the companion inequality for {— YJ. Bound the moment generating function of Yt. Drop the subscript i temporarily. IPe'y = 1 + г1РУ + ? (tk/k\)JP(Y2Yk-2) k=2 = 1 + a2g(t) where g(t) = (etM - 1 - tM)/M2 < exp[a2gr@3. From A) deduce 1P{S >ц}< exp[Fgr(t) - r\t\. Differentiate to find the minimizing value, t = M~1 log(l + Mr\V~ l\ which is positive. ? The function B(-) is well-behaved: continuous, decreasing, and B@ +) = 1. When X is large, B(X) « 2X~l log A in the sense that the ratio tends to one as X -» oo; the Bennett Inequality does not give a true exponential bound for ц large compared to V/M. For smaller r\ it comes very close to the bound for normal tail probabilities. Problem 2 shows that B(X) > A + \X)~l for all X > 0. If we replace B(-) by this lower bound we get r,} < 2 which is known as Bernstein's inequality. Notes Feller A968, Chapter VII) analyzed the tail probabilities of binomial and normal distributions—sharp results obtained by elementary methods. Bennett A962) and Hoeffding A963) derived and compared a number of inequalities on tail probabilities for sums of independent random variables. Dudley A984) noted the simpler derivation of Hoeffding's Inequality when Yt takes only values ±at. Bernstein's inequality apparently dates from the 1920's; it appeared as Problem X.14 in Uspensky's A937) book. Problems [1] For independent iV(O, l)-distributed random variables Yu Y2, ¦ ¦ ¦, show that (maxjsn Yj)/B log иI'2 converges in probability to one. [Write М„ for the maximum. Show JP{Mn < Bri log nI'2} = [1 - IP{N@, 1) > Br\ log nI/2}]", then use the exponential inequalities for normal tails.]
194 В. Exponential Inequalities [2] For the function B() appearing in Bennett's Inequality prove that A + iA)B(A) > 1 for all X > 0. [Apply l'Hdpital's rule twice to reduce the left-hand side to (l+iA*Xl + A*r1+$log(l +A») for some X* less than A. Then use log(l + A*) > A*/(l + A*).] [3] If a random variable Y has zero mean, finite variance a2, and is bounded above by a constant M, then for t > 0, IIV < (<tVm + М2е-/м)/(<72 + M2). [Subject to the constraints IPY = 0 and JPY2 < a2, the value of ПУ is maximized when IPy concentrates on the two values M and — <j2/M.] To prove this let ф{у) be the quadratic е-/мс-\М - j>)[1 + (С + ОС)' + ст2/М)] + ёмС~\у + а2/МJ, where С = M + <j2/M. Check that the coefficient of y2 is strictly positive and that ф satisfies ф(М) = е'м, ф( - <J2/M) = e ' ''м, ф'(- <72/M) = te~ '°2<м. Show that ф(у) > e'y for у < M, with equality at у = M and у = — c2/M. [The function h{y) = е~'уф(у) has a local minimum of 1 at —<J2/M. Also й(М) = 1. If h(y*) were equal to 1 for some y* in the interval ( — a2/M, y*), the quadratic e'yh'(y) would have three real roots: one at — <J2/M, one in the interval (— <J2/M, y*), and one in the interval (y*, M).~\ The distribution IPy concentrated at M and — a2/M achieves equality in nVy < ТРф(У). Bennett A962, page 42).] [4] For the one-sided form of Bennett's Inequality one needs only zero means and Yi < M for each i. Reexpress the inequality from the previous problem as ПУ < exp[tM + log/(l + <t2/M2)], where/(y) = 1 - y~l + y~xe~'My. Prove that d2/dy2 log f{y) equals -2y-3e-MyletM> -1-tMy- which is less than zero for у > 1. Deduce from a Taylor expansion to quadratic terms that log /(>•)< log /A) + 0» - l)/'(l)//0) for y>\, whence HV-*1 < exp[<j2(e'M — 1 — tM)/M2']. Complete the argument as before. [Hoeffding A963, page 24).]
APPENDIX С Measurability The defining properties of cr-fields ensure that the usual.countable operations in probability theory—countable unions and intersections, pointwise limits of sequences, and the like—cause no measurability difficulties. In Chapters II and VII, however, we needed to take suprema over uncountable families of measurable functions. The possibility of a non-measurable supremum was brushed aside by an assurance that a regularity condition, dubbed permissibility, would take care of everything. This appendix will supply the missing details. The discussion will take as axiomatic certain properties of analytic sets. A complete treatment may be found in Sections III.l to 111.20, 111.27 to 111.33, and 111.44 to 45, of Dellacherie and Meyer A978). Square brackets containing the initials DM followed by a number will point to the section of that book where you can find the justification for any unproved assertions. Suppose M is a set equipped with a cr-field Л. The analytic (^-analytic in DM terminology) subsets of M form a class slightly larger than Л. Denote it by =я/(М). If Л is complete for some probability measure \i, (that is, Л contains all the sets of zero \i measure) then sf(M) = Л [DM 33]. For example, the analytic subsets generated by ^[0, 1] contain that cr-field properly, but the cr-field of lebesgue measurable subsets of [0, 1] coincides with its analytic sets. You see from this example that we should be writing л/(Ж) rather than s/(M). The ambiguity is not serious when M is equipped with only one cr-field. For metric spaces, we will always choose Л to be the borel cr-field; for product spaces, it will always be the product cr-field. We considered empirical processes indexed by a class of functions. Formally, ?u <jj2, • • ¦ were measurable maps from a probability space (Q, S, IP) into a set S equipped with a cr-field У. A class 3F of 5^/^(IR)-measurable,
196 С. Measurability real-valued functions on S was given. The empirical measure Pn attached to each / in J* the real number We were assuming measurability of functions of a> such as sup^r \Pnf — Pf\. Let us now consider more carefully the dependence on &>, which we emphasize by writing Р„(со, •) instead of Pn(-)- Permissible Classes Suppose that the class J* is indexed by a parameter t that ranges over some set T. That is, & = {f(-,t):te T}. We lose no generality by assuming & so indexed; T could be J* itself. When more convenient, write ft instead of /(•, t). Assume T is a separable metric space. The metric on T will be important only insofar as it determines the borel ст-field ЩТ) on T. 1 Definition. Call the class J* permissible if it can be indexed by a T in such a way that (i) the function /(•,•) is ^ <g) ^(T)-measurable as a function from S <g) T into the real line; (ii) T is an analytic subset of a compact metric space T (from which it inherits its metric and borel er-field). ? Some authors call a T satisfying (ii) a Souslin measurable space [DM 16]. The usual sorts of class parametrized by borel subsets [DM 12] of an euclidean space are permissible. (Take T as the one-point compactification.) So are fancier classes such as all indicator functions of compact, convex subsets of euclidean space (Problem 2). Assume from now on that J* is permissible and that (Q, S, IP) is complete. Here are the properties of analytic sets that make the definition of per- permissibility a good one for empirical process applications. For every measur- measurable space (M, Ji\ (a) si{M ® Г) contains the product ст-field Ji <g) ЩТ); (b) for each H in s/(M <g) T), and in particular for each Ji <g) ^(T)-measur- able set, the projection nMH of H onto M is in sf(M) [DM 13, DM 9: the set H is also in srf(M <g) T), because T is analytic]; (c) for each A in s/(M) and each S/^-measurable map ц from (Q, S, IP) into M, the set {rj e Л} is an analytic subset of Q [DM 11]; hence {rj e s/} belongs to S, because (Q, S, IP) is complete. From these properties we shall be able to deduce measurability for functions defined by certain uncountable operations.
С. Measurability 197 Measurable Suprema Suppose g{-, •) is an Ji <g) ^(T)-measurable real function on M <g) T Set G(w) = supt g(m, 0- Then by (a), d(M <g) T) contains the set Яа = {(m, t): g(m, t) ><x}. The projection of Hx onto M is an analytic set, by (b). It consists of all those m for which G(m) > a. Thus {G > a} belongs to =я/(М) for each real a. If ц is a measurable map from a complete probability space (Q, S, IP) into M then, by (c), the set {со: G{t]{a>)) > a} is $ -measurable. That is, supr g(r\((o\ t) is an S-measurable function of со. If !F is permissible and if P | ft \ < oo for each t, requirement (i) of Defini- Definition 1 plus Fubini's theorem make Pft a measurable function of t. Apply the argument given above, with M = S", Ji = the product er-field У™, ц = the vector (fi,..., ?„), and [/(s,-, 0 - ^/, g(s, t) = to deduce that sup^ \Pn(a>,f) — Pf\ is a measurable function of со. Measurable Cross-Sections The Symmetrization Lemma II.8 made use of another property of analytic sets. We had a stochastic process {Zt: teT}; we assumed existence of a random x for which, almost surely, \ZZ\ > г whenever sup,|Z,[ > e. For this we need a cross-section theorem [DM44-45]. Under requirement (ii) of Definition 1, and for any complete probability space (M, Ji, \i), (d) if H belongs to srf{M ® T) there exists a measurable map h from M into T и {oo} (where oo is an ideal point added to T) such that: (w, h(m)) belongs to H whenever h(m) ф oo; and Ыт) Ф oo for \i almost all m in the projection nMH. Call h a measurable cross-section for H. Write Z(co, t) to emphasize the role of Z as an $ <g) ^(T)-measurable function on Q (g) T. Let {ej} be a strictly decreasing sequence of real numbers converging to e, with ex = oo. Set j = <m: ej+1 < sup \Z(co, t)\ < j = {(со, t): sJ+1 < \Z(co, t)\ < sj}.
198 С. Measurability The sets {Aj} all belong to S. The sets {Bj} all belong to S ® ЩТ), and hence are analytic. Let t0 be any fixed element of T. Choose a measurable cross- section tj for each 5,. Set т equal to т,- on Aj, and equal to t0 outside the union of the {Aj}. Redefine т(ю) to be t0 whenever (d) would set it equal to oo. For almost all со, if sup, | Z(co, t)\> e then (со, x(co)) belongs to Bj for some j; that is, ej+1 < Z(co, x(co)) < ?j for some j, as required. A formal proof of the Symmetrization Lemma II.8 is now possible. Require Z and Z' to be defined on a product space Q ® Q' equipped with product measure IP ® IP', Z(co, со', t) = Z(co, t), Z'(co, со', t) = Z'(co', t). The t constructed above need depend only on со. For almost all со, IP'{со1: \Z'(o)',z(co))\ <a} ^ /?. The rest of the proof goes through as before, with Fubini's theorem formalizing the conditioning argument. Shattered Sets Theorem 11.21 placed a condition on the behavior of Vn{?,x,..., ?„), the smallest integer к such that 3) shatters no collection of к points from {?i, • • • > ?«}• Assume that the indicator functions of sets in 3) form a per- permissible class J*\ Then Vn is measurable. For example, here is how to prove that {Vn < 2} belongs to S. Define g(s1,s2, tu t2, t3, t4)as f(su t])f(s2, tj) + f(su t2)[l - f(s2, t2)] + [1 - f(slt h)ms2, t3) + [1 - f(su t4)][l - f(s2, U)l Clearly g is У2 ® ^(T4)-measurable. The function G(co) = max sup дЩсо), ^(co), t) is (f-measurable. The set {Vn < 2} equals {G < 4}. Covering Numbers Functions of the empirical measure that enter into arguments depending on Fubini's theorem demand that we take measurability seriously. But for other functions, such as the random covering numbers appearing in Sections
С. Measurability 199 П.5, П.6, VI.4, and VI.6, we do not really need all the machinery of analytic sets. For example, we could just as well interpret a condition like log N,E, Pn,^) = op(n) in terms of outer measure: for each e > 0, IP*{log JViOS, Р„, &) > ne} -»• 0. The proofs go through almost exactly as before. Equivalently, we could interpret the conditions on the covering numbers to mean JP{Zn > ns} -> 0 for some measurable random variable Zn greater than log JV,(E, Pn, #"). For permissible classes, another solution would be to replace covering numbers by packing numbers: define М^д, Р„, .?) as the smallest m for which there exist functionsfu...,fm in J* with Pn\fj — fk\ > $ for j Ф k. This is a measurable function of со (and even jointly measurable in со and S); the set {со: М^д, Рп(со, •), У) > т] equals the projection on Q of the S ® ^(Tm)-measurable set (со, t): min n~l ? 1/F0»), tj) - /KM h)\ > j*k i=l The packing numbers are closely related to covering numbers: M,{25, Р„, Г) < N,C, Pn, !F) < M,{5, Р„, IF). Theorems stated in terms of random covering numbers have equivalent versions for random packing numbers. I cannot prove measurability for covering numbers of permissible classes. The set of со where ЫгE, Р„, J5') strictly exceeds m is the complement of a projection of a complement of an analytic set, which apparently need not be measurable. The Function Space Ж For the results in Chapter VII we required the class J^ to be pointwise bounded and have sup^r|P/| finite. The empirical process, had bounded sample paths; the functions En(co, •) all belonged to the set Ж of bounded, real functions on %F. To avoid confusion, we called members of ЭС functionals. We equipped Ж with the uniform norm, |Jx|| = sup^ |x(/)|. The limit processes had sample paths in the set C(#, P) of functionals that were uniformly continuous with respect to the if 2(P) seminorm pP on !F. The cr-field 38r was smallest for which (i) all the closed balls (for ||-||) belonged to <MP; (ii) all the finite-dimensional projections were ^-measurable.
200 С. Measurability We assumed in Chapter VII that Е„ is <f/^-measurable. That is true for a permissible OF that is separable under the pP seminorm. Because each ?„( •, /) is a real random variable, the finite-dimensional projections create no difficulty for the <f/^p-measurability. The properties of analytic sets are needed to prove that {a>\ ||?„(со, •) — x(-)|| < r} belongs to $ whenever x(-) belongs to C(J*, P). Introduce the index set T as described in Definition 1. Problem 1 shows that x(/r) is ^(T)-measurable. Equip S" with its product cr-field У. The function g(s, t) = ~112 is Уп (х) ^(r)-measurable. The argument in Measurable Suprema estab- establishes $ -measurability of sup, g(%{a>), t), which equals \\En(co, •) — x(-)\\. Notes Dudley A978) introduced a condition, which he termed Pe-Suslin, as a way of handling the measurability problems for empirical processes indexed by classes of sets. Since then he has refined the definition several times. He has called the latest version (Dudley 1984) of the condition "image admissible Suslin " (via a parameter space); it is almost the same as permissibility. Le Cam A983) has also imposed a Suslin type of condition. Dudley and Philipp A983) have systematically replaced measurability assumptions by conditions framed using measurable cover functions (the idea touched on in Covering Numbers—an unfortunate clash of ter- terminology—with the Zn random variables). Problems [1] Suppose !F has a countable, dense subset {#,} under the У2(Р) seminorm pP. Suppose also that J^ satisfies condition (i) of Definition 1. If x(-) is a bounded real functional on 3F that is pP continuous, show that x(ft) is a measurable function of t. [Assume x > 0. Represent x(ft) as limsup sup x(gj){rft) < n'1}, where r/f) = pP(ft — gj). Use Fubini's theorem to prove measurability of r,-.] [2] The class # of all non-empty, compact, convex subsets of the unit square [0, I]2 is permissible. [Equip % with the metric d(Ci, C2) = inf{s > 0:Сх ? C\ and C2 ? C\), where CE denotes the set of x less than s away from at least one point of C. Under this metric, <% is compact (Eggleston 1969, Section 4.2). The set of (x, C) with x in С is a closed subset of [0, l]2 (Я) (ё. Take 9S as its own indexing set.] [3] The class of sets {/ > 0}, for / running through a permissible class, is itself per- permissible.
References Aldous, D. A978). Stopping times and tightness. Annals of Probability 6:335-340. Alexander, K. S. A984a). Probability inequalities for empirical processes and a law of the iterated logarithm. Annals of Probability. (Based on his 1982 Ph.D. dissertation, MIT.) To appear. Alexander, K. S. A984b). Rates of growth for weighted 'empirical processes. (The Proceedings of the Neyman-Kiefer Memorial Conference, Berkeley, 1983. Belmont, CA: Wadsworth.) To appear. Alexandroff, A. D. A940). Additive set functions in abstract spaces. Mat. Sbornik 50(NS 8): 307-342. (Chapter 1.) Alexandroff, A. D. A941). Additive set functions in abstract spaces. Mat. Sbornik 51(NS 9): 563-621. (Chapters 2 and 3.) Alexandroff, A. D. A943). Additive set functions in abstract spaces. Mat. Sbornik 55(NS 13): 169-234. (Chapters 4 and 5.) Araujo, A. and Gine, E. A980). The Central Limit Theorem for Real and Banach Valued Random Variables. New York: Wiley. Ash, R. B. A972). Real Analysis and Probability. New York: Academic Press. Bennett, G. A962). Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association 57:33^-5 Bertrand-Retali, M. A974). Convergence uniforme d'un estimateur d'une densite de probabilitedans Rs. ComptesRendusdeVAcademiedesSciences, ParisllS:451-453. Bertrand-Retali, M. A978). Convergence uniforme d'un estimateur de la densite par la methode du noyau. Revue Roumaine de Mathimatiques Pures et Appliques 23:361-385. Billingsley, P. A968). Convergence of Probability Measures. New York: Wiley. Billingsley, P. A971). Weak Convergence of Measures: Applications in Probability. Philadelphia: Society for Industrial and Applied Mathematics. (Regional Confer- Conference Series in Applied Mathematics # 5.) Billingsley, P. and Tops0e, F. A967). Uniformity in weak convergence. Zeitschrift fur Wahrscheinlichkeitstheorie und Verw. Geb. 7:1-16. Bollobas, B. A979). Graph Theory: An Introductory Course. New York: Springer-Verlag. Bolthausen, E. A978). Weak convergence of an empirical process indexed by the closed convex subsets of I2. Zeitschrift fur Wahrscheinlichkeitstheorie und Verw. Geb. 43:173-181.
202 References Breiman, L. A968). Probability. Reading, MA: Addison-Wesley. Bremaud, P. A981). Point Processes and Queues: Martingale Dynamics. New York: Springer-Verlag. Breslow, N. and Crowley, J. A974). A large sample study of the life table and product limit estimates under random censorship. Annals of Statistics 2:437^-53. Brown, В. М. A971). Martingale central limit theorems. Annals of Mathematical Statistics 42:59-66. Brown, В. М. A983). Statistical uses of the spatial median. Journalof the Royal Statistical Society, Series В 45:25-30. Brown, В. М. and Kildea, D. G. A979). Outlier-detection tests and robust estimators based on signs of residuals. Communications in Statistics—Theory and Methods A8:257-269. Chernoff, H. A954). On the distribution of the likelihood ratio. Annals of Mathe- Mathematical Statistics 25:573-578. Chernoff, H. A956). Large sample theory: parametric case. Annals of Mathematical Statistics 27:1-22. Chibisov, D. M. A965). An investigation of the asymptotic power of tests of fit. Theory of Probability and Its Applications 10:421^-37. Chung, K. L. A949). An estimate concerning the Kolmogorov limit distribution. Transactions of the American Mathematical Society 67:36-50. Chung, K. L. A968). A Course in Probability Theory. New York: Harcourt, Brace and World. DeHardt, J. A971). Generalizations of the Glivenko-Cantelli theorem. Annals of Mathematical Statistics 42:2050-2055. Dellacherie, C. A972). Capacitis et Processus Stochastiques. Heidelberg-Berlin: Springer-Verlag. Dellacherie, С and Meyer, P-A. A978). Probabilities and Potential, Part A. Amsterdam: North-Holland. Translation of the 1975 French second edition of the first four chapters from Meyer A966).) Dellacherie, С and Meyer, P-A. A982). Probabilities and Potential, Part B. Amsterdam: North-Holland. Denby, L. and Martin, R. D. A979). Robust estimation of the first-order autoregressive parameter. Journal of the American Statistical Association 74:140-146. Donsker, M. D. A951). An invariance principle for certain probability limit theorems. Memoirs of the American Mathematical Society 6:1-12. Donsker, M. D. A952). Justification and extension of Doob's heuristic approach to the Kolmogorov-Smirnov theorems. Annals of Mathematical Statistics 23:277- 281. Doob, J. L. A949). Heuristic approach to the Kolmogorov-Smirnov theorems. Annals of Mathematical Statistics 20:393^103. Doob, J. L. A953). Stochastic Processes. New York: Wiley. Dudley, R. M. A966a). Weak convergence of probabilities on nonseparable metric spaces and empirical measures on euclidean spaces. Illinois Journal of Mathe- Mathematics 10:109-126. Dudley, R. M. A966b). Convergence of Baire measures. Studia Mathematica 27:251-268. Dudley, R. M. A967a). Measures on non-separable metric spaces. Illinois Journal of Mathematics 11:449-453. Dudley, R. M. A967b). The sizes of compact subsets of Hilbert space and continuity of gaussian processes. Journal of Functional Analysis 1:290-330. Dudley, R. M. A968). Distances of probability measures and random variables. Annals of Mathematical Statistics 39:1563-1572. Dudley, R. M. A973). Sample functions of the gaussian process. Annals of Probability 1:66-103.
References 203 Dudley, R. M. A974). Metric entropy of some classes of sets with differentiable boun- boundary. Journal of Approximation Theory 10:227-236. Dudley, R. M. A976). Probabilities and Metrics: Convergence of laws on metric spaces, with a view to statistical testing. Aarhus: Aarhus University. (Mathematics Insti- Institute Lecture Note Series #45.) Dudley, R. M. A978). Central limit theorems for empirical measures. Annals of Prob- Probability 6:899-929. (Correction, ibid. 7 A979):909-911.) Dudley, R. M. A981a). Donsker classes of functions. In Csorgo, M., Dawson, D. A., Rao, J. N. K., and Salek, A. K. Md. E. (Editors), Statistics and Related Topics, pages 341-352. Amsterdam: North-Holland. Dudley, R. M. A981b). Vapnik-Cervonenkis Donsker classes of functions. In Aspects Statistiques et Aspects Physiques des Processus Gaussiens. CNRS, Paris, St. Flour 1980. Dudley, R. M. A984). A Course on Empirical Processes. Springer Lecture Notes in Mathematics. (Lectures given at Ecole d'Ete de Probabilites de St. Flour, 1982.) To appear. Dudley, R. M. and Philipp, W. A983). Invariance principles for sums of Banach space valued random elements and empirical processes. Zeitschrift fur Wahrschein- lichkeitstheorie und Verw. Geb. 62:509-552. Durbin, J. A973a). Weak convergence of the sample distribution function when para- parameters are estimated. Annals of Statistics 1:279-290. Durbin, J. A973b). Distribution Theory for Tests Based on the Sample Distribution Function. Philadelphia: Society for Industrial and Applied Mathematics. (Regional Conference Series in Applied Mathematics # 9.) Durst, M. and Dudley, R. M. A981). Empirical processes, Vapnik-Cervonenkis classes and Poisson processes. Probability and Mathematical Statistics {Wroclaw) 1:109-115. Eggleston, H. G. A969). Convexity. Cambridge: Cambridge University Press. Erdos, P. and Kac, M. A946). On certain limit theorems in the theory of probability. Bulletin of the American Mathematical Society 52:292-302. Erdos, P. and Kac, M. A947). On the number of positive sums of independent random variables. Bulletin of the American Mathematical Society 53:1011-1020. Feller, W. A968). An Introduction to Probability Theory and Its Applications, Vol. I (Third edition). New York: Wiley. Feller, W. A971). An Introduction to Probability Theory and Its Applications, Vol. II (Second edition). New York: Wiley. Gaenssler, P. A984). Empirical Processes. Institute of Mathematical Statistics Lecture Notes—Monograph Series 3. Gaenssler, P. and Revesz, P. (Editors) A976). Empirical Distributions and Processes. Springer Lecture Notes in Mathematics 566. Gaenssler, P. and Stute, W. A979). Empirical processes: a survey of results for inde- independent and identically distributed random variables. Annals of Probability 7:193-243. Gihman, I. I. and Skorohod, A. V. A969). Introduction to the Theory of Random Processes. Philadelphia: Saunders. (Earlier version of the three volumes 1974, 1975, 1979.) Gihman, 1.1. and Skorohod, A. V. A974). The Theory of Stochastic Processes, I. New York: Springer-Verlag. Gihman, 1.1. and Skorohod, A. V. A975). The Theory of Stochastic Processes, II. New York: Springer-Verlag. Gihman, I. I. and Skorohod, A. V. A979). The Theory of Stochastic Processes, III. New York: Springer-Verlag. Gill, R. D. A980). Censoring and Stochastic Integrals. Mathematical Centre Tract 124. Amsterdam: Mathematisch Centrum.
204 References Gine, E. and Zinn, J. A984). On the central limit theorem for empirical processes. Annals. of Probability. To appear. Glaisher, J. W. L. A872). On the law of facility of errors of observations, and on the method of least squares. Memoirs of the Royal Astronomical Society 39:75-124. Gnedenko, B. V. A968). The Theory of Probability (Fourth edition). New York: Chelsea. Hajek, J. A965). Extension of the Kolmogorov-Smirnov test to regression alternatives. In J. Neyman and L. M. Le Cam (Editors), Bernoulli, Bayes, Laplace: Anniversary Volume, pages 45-60. New York: Springer-Verlag. Hall, P. and Heyde, С. С A980). Martingale Limit Theory and Its Applications. New York: Academic Press. Halmos, P. R. A969). Measure Theory. New York: Van Nostrand. Hansel, G. and Troallic, J. P. A978). Mesures marginales et theoreme de Ford- Fulkerson. Zeitschriftfur Wahrscheinlichkeitstheorie und Verw. Geb. 43:245-251. Hartigan, J. A. A975). Clustering Algorithms. New York: Wiley. Hartigan, J. A. A978). Asymptotic distributions for clustering criteria. Annals of Statistics 6:117-131. Hoeffding, W. A963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58:13-30. Huber, P. J. A967). The behavior of maximum likelihood estimates under nonstandard conditions. In Fifth Berkeley Symposium on Mathematical Statistics and Prob- Probability, pages 221-233. Berkeley, CA: University of California. Jacobs, K. A978). Measure and Integral. New York: Academic Press. Jacobsen, M. A982). Statistical Analysis of Counting Processes. Springer Lecture Notes in Statistics 12. Kac, M. A949). On deviations between theoretical and empirical distributions. Pro- Proceedings of the National Academy of Science USA 35:252-257. Kelley, J. L. A955). General Topology. New York: Van Nostrand. Kolchinsky, V. I. A982). On the central limit theorem for empirical measures. Theory of Probability and Mathematical Statistics (Kiev) 24:71-82. Kolmogorov, A. N. A933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Institute Ital. degli Attuari 4:83-91. Kolmogorov, A. N. A956). On Skorohod convergence. Theory of Probability and Its Applications 1:213-222. Kurtz. T. G. A981). Approximation of Population Processes. Philadelphia: Society for Industrial and Applied Mathematics. (Regional Conference Series in Applied Mathematics #36.) Le Cam, L. A957). Convergence in distribution of stochastic processes. University of California Publications in Statistics 2:207-236. Le Cam, L. A983). A remark on empirical measures. In Bickel, P., Doksum, K., and Hodges, J. (Editors), Festschrift for E. L. Lehmann, pages 305-327. Belmont, CA: Wadsworth. Levy, P. A922). Sur la determination des lois de probability par leurs fonctions carac- teristiques. Comptes Rendus de VAcademie des Sciences, Paris 175:854—856. Liapounoff, A. M. A900). Sur une proposition de la theorie des probabilites. Bulletin de VAcademie imperiale des Sciences de St. Petersbourg 13:359-386. Liapounoff, A. M. A901). Nouvelle forme du theoreme sur la limite de probabilites. Memoires de VAcademie imperiale des Sciences de St. Petersbourg 12:Number 5. Lindeberg, J. W. A922). Eine neue Herleitung des Exponentialgesetzes in der Wahr- scheinlichkeitsrechnung. Mathematische Zeitschrift 15:211-225. Lindvall, T. A973). Weak convergence of probability measures and random functions in the function space Z>[0, oo). Journal of Applied Probability 10:109-121. Liptser, R. S. and Shiryayev, A. N. A977). Statistics of Random Processes, I: General Theory. New York: Springer-Verlag.
References 205 Liptser, R. S. and Shiryayev, A. N. A978). Statistics of Random Processes, II: Applica- Applications. New York: Springer-Verlag. Loeve, M. A978). Probability Theory, II. (Fourth Edition) New York: Springer-Verlag. Major, P. A978). On the invariance principle for sums of independent identically distributed random variables. Journal of Multivariate Analysis 8:487-517. Mann, H. B. and Wald, A. A943). On stochastic limit and order relationships. Annals of Mathematical Statistics 14:217-226. Maritz, J. S. A981). Distribution-Free Statistical Methods. London: Chapman and Hall. McKean, H. P., Jr. A969). Stochastic Integrals. New York: Academic Press. McLeish, D. L. A974). Dependent central limit theorems and invariance principles. Annals of Probability 2:620-628. Metivier, M. A982). Semimartingales: A Course on Stochastic Processes. Berlin: De- Gruyter. Meyer, P-A. A966). Probability and Potential. (First edition). Boston: Blaisdell. Neveu, J. A975). Discrete-Parameter Martingales. Amsterdam:North-Holland. Neyman, J. A937). 'Smooth' test for goodness of fit. Skandinavisk Aktuarietidskrift 20:149-199. (Reprinted in "A Selection of Early Statistical Papers of J. Neyman," Cambridge: Cambridge University Press, 1967.) Oxtoby, J. С A971). Measure and Category. New York: Springer-Verlag. Parr, W. C. A982). Minimum distance estimation: a bibliography. Communications in Statistics, /4 11:1511-1518. Parthasarathy, K. R. A967). Probability Measures on Metric Spaces. New York: Academic Press. Pickands, J. A971). The two-dimensional Poisson process and extremal processes. Journal of Applied Probability 8:745-756. Pollard, D. A980). The minimum distance method of testing. Metrika 27:43-70. Pollard, D. A981a). Limit theorems for empirical processes. Zeitschriftfur Wahrschein- lichkeitstheorie und Verw. Geb. 57:181-195. Pollard, D. A981b). Strong consistency of A>means clustering. Annals of Statistics 9:135-140. Pollard, D. A982a). Beyond the heuristic approach to Kolmogorov-Smirnov theorems. In Gani J. and Hannan E. J. (Editors), Essays in Statistical Science. Journal of Applied Probability, Special Volume 19A, pages 359-365. (Festschrift for P. A. P. Moran.) Pollard, D. A982b). Quantization and the method of fc-means. IEEE Transactions on Information Theory IT-28:199-205. Pollard, D. A982c). A central limit theorem for empirical processes. Journal of the Australian Mathematical Society (Series A) 33:235-248. Pollard, D. A982d). A central limit theorem for A:-means clustering. Annals of Prob- Probability 10:919-926. Prohorov, Yu. V. A956). Convergence of random processes and limit theorems in probability. Theory of Probability and Its Applications 1:157-214. Руке, R. A969). Applications of almost surely convergent constructions of weakly convergent processes. Springer Lecture Notes in Mathematics 89:187-200. Руке, R. A970). Asymptotic results for rank statistics. In M. L. Puri (Editor), Non- parametric Techniques in Statistical Inference, pages 21-37. Cambridge: Cambridge University Press. Руке, R. and Shorack, G. A968). Weak convergence of a two-sample empirical process and a new approach to Chernoff-Savage theorems. Annals of Mathematical Statistics 39:755-771. Руке, R. and Sun, T. G. A982). Weak convergence of empirical processes. (University of Washington Statistics Department Technical Report # 19.) Ranga Rao, R. A962). Relations between weak and uniform convergence of measures with applications. Annals of Mathematical Statistics 33:659-680.
206 References Resnick, S. I. A975). Weak convergence to extremal processes. Annals of Probability 3:951-960. Revesz, P. A976). Three theorems of multivariate empirical process. Springer Lecture Notes in Mathematics 566:106-126. Sauer, N. A972). On the density of families of sets. Journal of Combinatorial Theory (A) 13:145-147. Shelah, S. A971). Stability, the f.c.p., and superstability; model theoretic properties of formulas in first-order theory. Annals of Mathematical Logic 3:271-362. Shelah, S. A972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics 41:247-261. Shiryayev, A. N. A981). Martingales: recent developments, results and applications. International Statistical Review 49 :199-233. Silverman, B. W. A978). Weak and strong uniform consistency of the kernel estimate of a density and its derivatives. Annals of Statistics 6:177-184. Simmons, G. F. A963). Introduction to Topology and Modern Analysis. New York: McGraw-Hill. Skorohod, A. V. A956). Limit theorems for stochastic processes. Theory of Probability and Its Applications 1:261-290. Skorohod, A. V. A957). Limit theorems for stochastic processes with independent increments. Theory of Probability and Its Applications 2:138-171. Steele, J. M. A975). Combinatorial Entropy and Uniform Limit Laws. Ph.D. thesis, Stanford. (Reproduced by University Microfilms, Ann Arbor.) Steele, J. M. A978). Empirical discrepancies and subadditive processes. Annals of Probability 6:118-127. Stone, С A963). Weak convergence of stochastic processes defined on semi-infinite time intervals. Proceedings of the American Mathematical Society 14:694-696. Straf, M. A969). A general Skorohod space and its applications to the weak convergence of stochastic processes with several parameters. Ph.D. thesis, University of Chicago. Strassen, V. A965). The existence of probability measures with given marginals. Annals of Mathematical Statistics 36:423-439. Strassen, V. and Dudley, R. M. A969). The central limit theorem and e-entropy. Springer Lecture Notes in Mathematics 89:224-231. Stute, W. A982a). The oscillation behavior of empirical processes. Annals of Probability 10:86-107. Stute, W. A982b). A law of the logarithm for kernel density estimators. Annals of Probability 10:414-422. Talagrand, M. A978). Les boules peuvent-elles engendrer la tribu borelienne d'un espace metrisable non separable? Siminaire Choquet, Paris 17e annee: C5. Tong, Y. L. A980). Probability Inequalities in Multivariate Distributions. New York: Academic Press. Tops0e, F. A970). Topology and Measure. Springer Lecture Notes in Mathematics 133. Uspensky, J. V. A937). Introduction to Mathematical Probability. New York: McGraw- Hill. Vapnik, V. N. and Cervonenkis, A. Ya. A971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16:264-280. Vapnik, V. N. and Cervonenkis, A. Ya. A981). Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and Its Applications 26:532-553. Varadarajan, V. S. A965). Measures on topological spaces. American Mathematical Society Translations 48:161-228. (Translation from Mat. Sbornik 55 (97) A961): 35-100.)
References 207 Weierstrass, К. A885). tJber die analytische Darstellbarkeit sogenannter willkurlicher Funktionen reeller Argumente. Sitzungsberichte der Preussischen Akademie der Wissenschaften:633-640, 789-906. (Also in Collected Works, Vol. Ill, pp. 1-37, Berlin: Mayer u. Muller 1903.) Whitt, W. A970). Weak convergence of probability measures on the function space C[0, oo). Annals of Mathematical Statistics 41:939-944. Whitt, W. A980). Some useful functions for functional limit theorems. Mathematics of Operation Research 5:67-85. (First circulated as preprint in 1970.) Wichura, M. A971). A note on the weak convergence of stochastic processes. Annals of Mathematical Statistics 42:1769-1772.
Author Index Aldous, D. 133, 137, 185, 187 Alexander, K. S. 38, 166 Alexandroff, A. D. 85 Amato, B. R. viii Araujo, A. 37 Ash, R. B. 39 Donsker, M. D. 117-118 Doob, J. L. 3,4,64,113,117 Dudley, R. M. vii, 37-39, 85-86, 117-118, 166-168, 193,200 Durbin, J. 118 Durst, M. 39 Barry, D. G. viii Bennett, G. 193-194 Bertrand-Retali, M. 38, 41 Billingsley, P. 36, 61, 85-86, 117-118, 136 Bollobas, B. 86 Bolthausen, E. 167 Boyce, J. S. viii Breiman, L. 4,22,106,118 Bremaud, P. 186 Breslow, N. 186 Brown, В. М. 166, 185 Cervonenkis, A. Ya. 37 Chernoff, H. 165, 190 Chibisov, D. M. 85 Chung, K. L. 117,173 Crowley, J. 186 DeHardt, J. 36 Dellacherie, С 176, 179, 185, 195-197 Denby, L. 37 Eggleston, H. G. 200 Erdos, P. 118 Feller, W. 53, 63, 193 Gaenssler, P. vii, 36-37, 40, 118, 167 Gihman, 1.1. 4, 117-118, 136-137, 166, 185 Gill, R. D. 186 Gine, E. 37, 40, 166-167 Glaisher, J. W. L. 61 Gnedenko, B. V. 37 Hajek.J. 86,117 Hall, P. 185 Halmos, P. R. 39 Hansel, G. 86 Hartigan, J. A. viii, 36, 166 Heyde, С. С. 185 Hoeffding, W. 193-194 Huber, P. J. 38, 165, 168
210 Author Index Jacobs, K. 86 Jacobsen, M. 186 Kac, M. 117-118 Kelley, J. L. 39 Kildea, D. G. 166 Kolchinsky, V. I. 38, 166 Kolmogorov, A. N. 99, 117, 136, 166 Kurtz, T. G. 137 Le Cam, L. 37-38, 85, 166-167, 200 Levy, P. 61 Liapounoff, A. M. 61, 63 Lindeberg, J. W. 61 Lindvall, T. 118,136 Liptser, R. S. 185 Loeve.M. 118 Ranga Rao, R. 86 Resnick, S. I. 136 Revesz, P. 118,167 Sauer, N. 37 Shelah, S. 37 Shiryayev, A. N. 185 Shorack, G. vii-viii, 85 Silverman, B. W. 38 Simmons, G. F. 67-68, 85 Skorohod, A. V. 4, 86, 117-118, 122, 136-137, 166, 185 Steele, J. M. 37 Stone, С 118,136 Straf, M. 136 Strassen, V. 86, 167 Stute, W. vii, 36-38, 117 Sun, T. G. 167 Major, P. 86 Mann, H. B. 189 Maritz, J. S. 166 Martin, R. D. 37 McKean, H. P. 146 McLeish, D. L. 185 Metivier, M. 185 Meyer, P-A. 176, 179, 185, 195-197 Talagrand, M. 85 Tong, Y. L. 41 Tops0e, F. 36-37, 62, 85-86 Troallic, J. P. 86 Uspensky, J. V. 61, 193 Neveu, J. 22 Neyman, J. 47 Nolan, D. A. viii Oxtoby, J. С 65, 86 Parr, W. 118 Parthasarathy, K. R. 85, 136 Philipp, W. 166, 200 Pickands, J. 136 Pollard, D. 36-39, 118, 166, 169 Prohorov, Yu. V. 85-86 Руке, R. vii, 85-86, 167 Vapnik, V. N. 37 Varadarajan, V. S. 85 Wald, A. 189 Weierstrass, K. 61 Wellner, J. viii Whitt, W. 118,136 Wichura, M. 86,117,136 Yu, K. F. 186 Zinn, J. 37, 38, 40, 166-167
Subject Index abuse, of notation 44,171 adapted 176 Aldous's condition 133 Allocation Lemma 77 almost-sure representation. See Representation Theorem Approximation Lemma 27 autoregression 12, 174 axiom of choice 65, 86 BH. See brownian motion, stretched out BP. See P-motion 3$p 156 Bennett's Inequality 160, 192, 194 Bernstein's inequality 193 bias, of density estimator 35,42 binomial coefficient 19 bounded variation 42 brownian bridge 64 denned 3,95 existence 101, 119 brownian motion construction from brownian bridge 103 denned 95 modulus of continuity for 146 stretched-out 178 tied-down 3 156 108 P) C[0, oo) C[0, 1] 1, 90 cadlag 3, 89, 176 Cauchy sequence 81 censored data, estimation from 182 Central Limit Theorem Empirical, for distribution functions 97 Empirical, uniform case 96 Liapounoff 51 Lindeberg 52 Multivariate 57 central limit theorem autoregression 174 empirical process 157,163-165 martingale-difference array 171 minimization functional 141 Chaining Lemma 144 chaining 37 restricted 160-163 for stochastic processes 142-145 characteristic function 54 Continuity Theorem 55, 60 inversion formula 63 uniform convergence on compacta 59 uniquely determines distribution 55 chi-square, Pearson's 46 clustering. See A>means combinatorial method 13 compactness in metric spaces 81-82 sequential 82 Compactness Theorem 82
212 Subject Index completely regular point of metric space 67 topological space 68 conditional variance of increment 171 process 176 continuity set 71 Continuity Theorem, for characteristic functions 55, 60 Continuous Mapping Theorem for euclidean space 46 failure of 66 for metric spaces 70 consistency for autoregression 12 of Л-means 9 of optimal solutions 12 convergence in distribution Aldous's condition for, in D[0, oo) 134 in D[0, oo), for Skorohod metric 131 in D[Q, oo), for uniform convergence on compacta 108 of empirical process 97,157 in euclidean space 43 of L2 martingale 179 in metric spaces 65 process with independent increments 104, 135 via semicontinuity 73 via uniform approximation 70 of uniform empirical process 96 Convergence Lemma for euclidean space 44-46 for metric spaces 68 convex sets, shattering 17, 23 convolution 35, 49, 54 coupling 76-80 in Compactness Theorem 84 covering integral for metric space 143 as modulus of continuity 147 covering numbers direct 164 inequalities between 34 ?C1, for classes of functions 25 ?C2, for classes of functions 31 measurability of 198 for metric spaces 143 for polynomial classes 27 random 150 criterion function minimization of 28 random 12 cross-section, measurable crossword puzzle 76 197 D[0, oo) definition of 107 metric of uniform convergence on compacta 108 Skorohod metric 123 Z>[0, 1] 3, 89-90 A.® 21 delta method 63, 189 density convolution 55 estimation 35 differentiability, in quadratic mean 140, 151 direct approximation 8, 164 discrimination, linear; polynomial; quadratic 17 distribution function, pointwise convergence of 43, 46, 53 Donsker's theorem. See Central Limit Theorem, Empirical Doob-Meyer decomposition 176 l-argument 24 Е„. See empirical process EP. See P-bridge empirical measure bivariate 12 defined 6 empirical process central limit theorem for 96-97 defined 2, 95, 97, 140 measurability of 65, 199 envelope, for classes of functions 24, 151 equicontinuity of class of functions 74 stochastic 139 Equicontinuity Lemma 150 ergodicity 9, 12 Exponential Bounds 16 exponential bound. See tail probability extreme-value distribution 128 fidi projection 92. See also projection maps fidis (finite-dimensional distributions) 3, 90, 92 function space Ж 155, 199
Subject Index 213 functional first passage time 109, 124 in function space 156 jump 179 terminology 2, 7 functions, infinitely differentiable 49 gaussian process characterization 106 indexed by functions 146 Glivenko-Cantelli theorem classical 6-7, 13 generalized, for classes of functions 25 generalized, for classes of sets 18, 22 goodness-of-fit statistic 2 with estimated parameters 99,159 Kolmogorov's, limiting distribution of 113 Neyman's 47 graph, of real-valued function 27 grid, approximations constructed from 91, 124 heuristic argument, Doob's 4, 64 Hewitt-Savage, zero-one law of 22 Hoeffding's Inequality 16, 26, 31, 150, 164, 191 independent increments convergence in distribution definition 103 174 16,21 104, 135 innovations Integration interpolation continuous in Z>[0, oo) in Z>[0, 1] 101 124 92 in a function space 158 JF) xiv. See also covering integral Kaplan-Meier estimator 182 kernel smoothing 35 translates of 42 fc-means central limit theorem 153 consistency 9, 30 method of 9 Laplace transform, for hitting time 112 Liapounoff condition 51 Lindeberg condition 52, 106 for martingale-difference arrays 171 Lindelof's theorem 68, 72, 87 Lipschitz condition 45, 74, 101 little links 163 location, center of 28 marginal stacks, with coupling 77 markov property for empirical process 96 marriage lemma 77 martingale continuous time 176 difference array 171 L2 176 L2, convergence to gaussian process 179 reversed 39 reversed, empirical measure as 22 stopped 180 Maximal Inequality maximal inequality for cadlag processes by chaining 144 for L2-martingales maximum likelihood measurability, for random elements of metric spaces 64-66 median central limit theorem for 53, 98 consistency 7 spatial, central limit theorem for 152 spatial, consistency of 28 M-estimator generalized 12 stochastic equicontinuity 165 metric bounded-Lipschitz 74 entropy 166 Prohorov 75, 79, 88 Skorohod, on Z>[0, oo) 123-124 minimization of random criterion function 140 modulus function for D[0, oo) 125 multinomial distribution 46 15 94 177 138 NF) xiv. See also covering number || • ||. See norm norm 14, 24, 156
214 Subject Index normal distribution characteristic function of multivariate 29 tail probabilities 191 optimal centers 10 outer measure 70 63 remainder term, in Taylor expansion 50, 139 Representation Theorem with random elements 71 with random variables 58 pP xiv robustnik 75 &. See projection a-field Pn. See empirical measure P°n 15, 26, 149 partial sum process 106, 109 P-bridge 149 permissible classes of functions 24 classes of sets 17 definition 196 7ts. See projection maps P-motion 147 polynomial bound for number of sets picked out 19 class 17,27 discriminating 17 discrimination 17,20 orthogonal 47 poisson process 105, 129 predictable 176 product a-field 69 projection, measurability of 196 projection maps 66, 90 forD[0, oo) 130 projection <7-field 66, 90 on D[0, oo) 127 on D[0, 1] 87 Prohorov's theorem. See Compactness Theorem quadratic variation process 185 quantile transformation 57, 97, 129 random element of metric space 1, 65 random vector characteristic functions 56 convergence in distribution of 43, 56 perturbation of 48 rates of convergence 30 reflection principle, for brownian motion 112 sample path 1 Scheffe's lemma 61 semicontinuity 73, 87 seminorm xiv separability of C[0, 1] 87 of D[0, oo), under Skorohod metric 127 and a-fields 85 for subsets of metric spaces 67 universal 38 shatter 18,21,198 sigma field. See also $gp, 0> generated by balls 87 sign variables 15 signed measure. See P°n Slutsky's theorem 62 Souslin 196 square-root trick 32, 37 Stirling's approximation 21 stochastic equicontinuity 139-140 order symbol 141, 189 process 1 stopping time 110, 172 Strasbourg theory 176 strong markov property, for brownian motion 111 submartingale 178 reversed 22,25 subsets hidden 19 picked out 18 substitution, of increments 49 symmetrization 32, 149 inequality 14-15 Symmetrization, First 14 Symmetrization Lemma 14, 198 Symmetrization, Second 15 symmetry, and empirical measures 21 tail probability characteristic function bound exponential bound 160,191 60
Subject Index 215 tied-down brownian motion. See brownian bridge tight measure on metric space 81 on real line 60 total boundedness of metric space 82 and P-motion 168 triangular array 51, 106 truncation 25 2-means. See A>means uniform tightness for general empirical processes 156 for uniform empirical processes 102 usual conditions 176 VC class 37 vector space, finite-dimensional 20, 30,38 Un. See empirical process, uniform uniform convergence of density estimator 36 uniform integrability 176 uniform strong law of large numbers 7 for classes of functions 25, 168 for classes of sets 18, 22 for convex sets 22 weak convergence in euclidean space 44 in metric spaces 65 weight functions 158 3C. See function space