Автор: Taylor J.C.  

Теги: introduction   probability  

ISBN: 0-387-94830-9

Год: 1996

Текст
                    J.C. Taylor
An Introduction to Measure
and Probability
With 12 Illustrations
Springer


J.C. Taylor Department of Mathematics and Statistics McGill University 805 Sherbrooke Street West Montreal, Quebec H3A 2K6 Canada Library of Congress Cataloging-in-Publication Data Taylor, J.C. (John Christopher), 1936- An introduction to measure and probability/J.C. Taylor. p. cm. Includes bibliographical references and index. ISBN 0-387-94830-9 (pbk.:alk. paper) 1. Measure theory. 2. Probabilities. I. Title. QA325.T39 1996 519.2-dc20 96-25447 Printed on acid-free paper. © 1997 Springer-Verlag New York, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone. Production managed by Hal Henglein; manufacturing supervised by Jeffrey Taub. Camera-ready copy prepared from the author's Aj^S-1&. files. Printed and bound by Maple-Vail Book Manufacturing Group, York, PA. Printed in the United States of America. 98765432 1 ISBN 0-387-94830-9 Springer-Verlag New York Berlin Heidelberg SPIN 10524399
PREFACE The original version of this book was an attempt to present the basics of measure theory, as it relates to probability theory, in a form that might be accessible to students without the usual background in analysis. This set of notes was intended to be a "primer" for measure theory and probability — a cross between a workbook and a text — rather than a standard "manual". They were intended to give the reader a "hands-on" understanding of the basic techniques of measure theory together with an illustration of their use in probability as illustrated by an introduction to the laws of large numbers and martingale theory. The sections of the first five chapters not marked by an asterisk are the current version of the original set of notes and constitute the essential core of the book. Since the main goal of the core of this book is to be a "fast-track", self-contained approach to martingale theory for the reader without a solid technical background, many details are put into exercises, which the reader is advised to do if a basic mastery of the subject is the goal. The present book retains the basic purpose and philosophy of the original set of notes but has been expanded to make it of more use to statisticians by adding a sixth chapter on weak convergence, which is an introduction to this very large circle of ideas as well as to the central limit theorem. In addition, having in mind its possible use by students of analysis, several sections in this book branch off from the central core to examine the Radon-Nikodym theorem, functions of bounded variation, Fourier series (very briefly), and differentiation theory and maximal functions. Measure theory for er-finite measures reduces in many respects to the study of finite measures and hence (after normalization) to the study of probabilities. From this point of view, this book can serve as an introduction to measure theory. Each chapter, except for Chapter V, concludes with a section of additional exercises in which various ancillary topics are explored. They may be omitted on first reading and are not used in the core of this text. Much of the material in the exercises may be found in the standard texts on measure theory or probability that are cited. The author is of the opinion that it is more useful to fill in details for many of these items than to read completed arguments but is well aware that it is more time-consuming! Note that the additional exercises and other sections marked by asterisks may be omitted without prejudice to understanding the core of these notes. There are six chapters: the first two chapters constitute the basic introduction; and the remaining four explain some uses of measure theory and Vll
Vlll PREFACE integration in probability. They have been written to be "self-contained", and as a result, some background information from analysis is scattered throughout. While this may no doubt appear to be excessive to the "well- prepared" reader, it is my hope that with this approach the technical details of the subject will be made accessible to the "well-motivated'' reader who may not have all the appropriate background. In other words, I hope to empower readers who wish to learn the subject but who have not come to this point by a standard route. As in many things, a user-friendly attitude seems to me to be very desirable. Whether these notes exhibit that is obviously up to you, the reader, to judge. As stated before, the main purpose of these notes is to explain and then make use of the main tools and techniques of the theory of Lebesgue integration as they apply to probability. To begin, Chapter I shows how to construct a probability measure on the er-algebra 93 (R) of Borel subsets of the real line from a distribution function F. Then Chapter II defines integration with respect to a probability measure and proves the main theorems of integration theory (e.g., the theorem of dominated convergence). Also, it defines Lebesgue measure on the real line and explains the relation between the Lebesgue integral and the Riemann integral. These two chapters (excepting for §2.7 and §2.8 on the Radon-Nikodym theorem and functions of bounded variation, respectively) constitute the essential part of the central core of the book, and the reader is advised to be familiar with them before proceeding further. Chapter III is concerned with independence and product measures. It deals mainly with finite products, although the existence of countable products of probability measures is proved following ideas of Kakutani (and independently Jessen) as an illustration of the use of measure-theoretic techniques. This is the first technically difficult result in these notes, and the reader can skip the details (as indicated) without any loss of continuity. The extension of these ideas to prove the existence of Markov chains associated with transition kernels on the real line is made in §3.5*. Chapter IV discusses the main convergence concepts for random variables (i.e., measurable functions) and proves Khintchine's weak law and Kolmogorov's strong law of large numbers for sums of independent, identically distributed (i.i.d.) integrable random variables. The chapter closes with a discussion of uniform iiitegrability and truncation. The last chapter of the original version, Chapter V, covers conditional expectation and martingales. While sufficient statistics are discussed in §5.3* as an application of conditional expectation, the bulk of the chapter is an introduction to martingales. Doob's optional stopping theorem (finite case) and the martingale convergence theorem (for discrete martingales) are proved. The chapter concludes with Doob's proof of the Radon-Nikodym theorem on countably generated er-algebras and the backward martingale proof of Kolmogorov's strong law. The final chapter, Chapter VI, examines weak convergence and the cen-
PREFACE IX tral limit theorem. It is independent of Chapter V and can be read after having worked through §4.1 and §4.3 of Chapter IV together with the beginning of §4.2. Tightness is emphasized, and an attempt is made to enhance the mathematical focus of the standard discussions on weak convergence. The discussion of uniqueness and the continuity theorem for the characteristic function follows Feller's and Lamperti's approach of using the Gaussian or normal densities. The central limit theorem is first discussed in the i.i.d. case and concludes with the case of a sequence or triangular array satisfying the Lindebergh condition. The reader may find the references to standard texts and sources on probability distracting. In that case, please ignore them. They are there to encourage one to look at and explore other treatments as well as to acknowledge what I have learned from them. In addition, if you are currently using one of them as a text, it may be helpful to have some connections pointed out. References are mainly to books listed in the Bibliography, and the occasional reference to an article is done with a footnote. The book by Loeve [L2] has many references in it and will be helpful if the reader wants to explore the classical literature. For information and references on martingales, Doob's book [Dl] has lots of historical notes in an appendix and an extensive bibliography. Since these notes emphasize the connection between analysis and probability, the reader wishing to go further into the subject should consult the recent book [S2] by Stroock, which has a decidedly analytic flavour. Finally, for a more abstract version of many of the topics covered in these notes and for detailed historical notes, the book by Dudley [D2] is recommended. For several years since 1986, versions of these notes have been used at McGill University (successfully, from my perspective) to teach a one- semester course in probability to a mixed audience of statisticians and electrical engineers. Depending on the preparation of the class, most of the original version of this book was covered in a one-semester course. These notes have also been used to teach an introductory course on measure theory to honours students (in this case, the course was restricted to the first four chapters with the sections on infinite product measures and Markov chains omitted as well as §4.5* and §4.6*). I would like to thank all the students who have (unwittingly) participated in the process that has led to the creation of this set of notes; my wife, Brenda MacGibbon of UQAM, for her enthusiastic encouragement, my colleagues Minggao Gu and Jim Ramsay of McGill University and Ah- mid Bose of Carleton University for their support and encouragement, as well as Regina Liu of Rutgers and D. Phong of Columbia University. I thank Jal Choksi of McGill University and Miklos Csorgo of Carleton University for useful conversations on various points. In addition, I wish to thank all my colleagues and friends who have helped me in many, many ways over the years. I also wish to thank Mr. Masoud Asgharian and Mr. Jun Cai for their diligent search for misprints and errors, and Mr. Erick
X PREFACE Lamontagne for help in final editing. Finally, I also wish to thank my copy editor, Kristen Cassereau, and my editor at Springer, John Kimmel, for their patience and professional help. J.C. Taylor
CONTENTS Preface vii List of Symbols xiii Chapter I. Probability Spaces 1 1. Introduction to R 1 2. What is a probability space? Motivation 5 3. Definition of a probability space 9 4. Construction of a probability from a distribution function 16 5. Additional exercises* 26 Chapter II. Integration 29 1. Integration on a probability space 29 2. Lebesgue measure on R and Lebesgue integration 44 3. The Riemann integral and the Lebesgue integral 49 4. Probability density functions 55 5. Infinite series again 58 6. Differentiation under the integral sign 59 7. Signed measures and the Radon-Nikodym theorem* 60 8. Signed measures on R and functions of bounded variation* 71 9. Additional exercises* 78 Chapter III. Independence and Product Measures 86 1. Random vectors and Borel sets in Kn 86 2. Independence 89 3. Product measures 95 4. Infinite products 110 5. Some remarks on Markov chains* 119 6. Additional exercises* 131 Chapter IV. Convergence of Random Variables and Measurable Functions 137 1. Norms for random variables and measurable functions 137 2. Continuous functions and Lp* 149 3. Pointwise convergence and convergence in measure or probability 167 4. Kolmogorov's inequality and the strong law of large numbers 176 5. Uniform integrability and truncation* 180 xi
CONTENTS 6. Differentiation: the Hardy-Littlewood maximal function* 186 7. Additional exercises* 199 Chapter V. Conditional Expectation and an Introduction to Martingales 210 1. Conditional expectation and Hilbert space 210 2. Conditional expectation 217 3. Sufficient statistics* 226 4. Martingales 229 5. An introduction to martingale convergence 238 6. The three-series theorem and the Doob decomposition 241 7. The martingale convergence theorem 245 Chapter VI. An Introduction to Weak Convergence 250 1. Motivation: empirical distributions 250 2. Weak convergence of probabilities: equivalent formulations 251 3. Weak convergence of random variables 255 4. Empirical distributions again: the Glivenko-Cantelli theorem 260 5. The characteristic function 262 6. Uniqueness and inversion of the characteristic function 266 7. The central limit theorem 273 8. Additional exercises* 281 9. Appendix* 284 Bibliography 291 Index 293
LIST OF SYMBOLS ZA, 6 a, 6 A, 15 Ac, 6 a A 6, 5 aVb, 5 »(R), 14 93(R2), 87 93(Rn), 87 93(E), 131 93(1)^,82 6(n,p), 141 Cb(E), 153 CC(R), 150 Cc(0), 150 C(E), 151 Ct(R), 153 Cc(Rn), 152 6o,26 d(F,,F2),257 Dh, 258 £>F+(x), 194 DF-(x), 194 DF+(x), 194 DF.(i), 194 dist(x,C), 150 EAB, 78 eo,26 e«, 26 {En i.o.}, 148 E+, 64
xiv LIST OF SYMBOLS £+,66 £-,64 £-,66 E[a], 30 E[X], 38 E[X\Y = y], 220 E[X\0], 212 {$t)tei, 229 /, 158 S,9 Si O &, 97 Si x fo, 97 ®~=1&,, 113 </>c 181 xn=l"n, 11" / *2w 5, 160 / • A", 232 /* 5,157 /i * /2, 109 /i * P2, 108 Fn -¾ F, 251 Fn => F, 251 Fn{x,u), 250 ffr, 234 H„(x,0,80 51, 193 /a(w)P(dw), 30 / sdP, 30 / XdP, 38 /0+oo/(x)dx,54 />(s)dz,50 #„, 160 ^(n.&PMl eu 58 /2, 59 /, 217
LIST OF SYMBOLS xv /P, 143 C> 143 lim inf n an, lim inf an, 33 n lim supn an, lim sup an, 33 n L, 137 L(n,tp), 49 LJ,41 L^N.^NJ.dn), 58 L2, 139 L°°, 144 Lp, 141 L?(n, 3, m), 146 LP(n,5,P), 146 M2(x0,dx), 123 M2(xo,dxi,dx2), 123 Mn(xo,dxi,dx2, ■ ■. ,dxn), 128 371,25 Ml x M2 X • •• X Hn, 104 Af, 239 (, ),(, ),59 N, 1 N, 120 N(x0,dxi), 123 i/+,65 v~, 65 v « ft, 66 II II, 58 n(x), 79 nt{x), 79 (n,5,P),9 (fti xn2,5i xfo.Pi xPj), 99 (fii x n2,$i x ^2,Mi x M2), 105 (fii x n2 x • • • x fin, Jj x fo x • • • x £„, Pj x P2 x • • • x Pn), 102 (x2i,nfc, x^=15fc, x~ jPfc), 118 x~ ,nn, 113 Pi x P2, 99
XVI LIST OF SYMBOLS P,9 P(du\B), 223 P*, 21 P[X < x\Y = y], 221 Pi ® P2, 99 Pj x P2 x • • • x Pn, 102 Pi*P2, 108 j)*, 187 X»=1Pn, , US Pn, 175 Q, 1 Q2(dxo,dxi,dx2), 124 Qn(dxo,dxi,...,dx„), 125 R, 1 9¾. 45 <r(€), 14 a{X), 140 <r2(X), 140 a, 140 <T({Xt | i € F}), 110 tr({Xt | l € /}), 110 <r({X,- | 1 < z < n}), 93 <r({X,- | i € F}), 93 <r2, 140 5„, 170 T, 159 U([a,b}), 246 U(v,<p),49 Ux([a,b}),246 V(E), 62 V(E), 65 V(*), 72 V(*,[a,b]),71 V(*,7r),71 V+($,7r), 74 V" (*,«■), 74
LIST OF SYMBOLS xvii Xn -^ X, 255 || X II,, 137 II X U, 144 II X ||2, 139 II X ||p, 141 XAY, 233 X\ 237 X+,35 X-,35 x+, 5 x~, 5 X e 5, 32 Xc, 180 Xn t±5- X, 167 Xn ^ X, 167 Xn -^ 0, 145 Xn -^ X, 168 Xn -^ X, 168 Xn -^ X, 255 Xn => X, 255 Xn P-^ X, 167 Xn ^^ X, 167 Xr, 235 |X|,41 1*1,5 Z, 1
CHAPTER I PROBABILITY SPACES l. Introduction to R An introduction to analysis usually begins with a study of properties of R, the set of real numbers. It will be assumed that you know something about them. More specifically, it is assumed that you realize that (i) they form a field (which means that you can add, multiply, subtract and divide in the usual way), and (ii) they are (totally) ordered (i.e., for any two numbers a, b € R, either a < b, a = b, or a > b) and a > b if and only ifb— a > 0. Note that a < b and 0 < c imply ac < be; a < b imply a + c < b + c for any ceK;0< 1 and-l< 0. Inside R lies the set Z of integers {..., —2, —1,0,1,2,3,... } and the set N of natural numbers {1,2,...}. Also, there is a smallest field inside R that contains Z, namely Q, the set of rational numbers {p/q : p,q € Z and q ^ 0}. A real number that is not rational is said to be irrational. It is often useful to view R as the points on a straight line (see Fig. 1.1). 1 1 »- 0 1 Fig. 1.1 Then a < b is indicated by placing a strictly to the left of b, if the line is oriented by putting 0 to the left of 1. The field Q is also totally ordered, but it differs from R in that there are "holes" in Q. To explain this, it is useful to make a definition. Definition 1.1.1. Let A C R. A number b with the property that b> a for all a e A is said to be an upper bound of A. Ifb<a for all a € A, b is said to be a lower bound of A. Example. Let A = {a €R\ a < 1}. Then any number b > 1 is an upper bound of A. Here there is a smallest or least upper bound (l.u.b.), namely 1. Exercise 1.1.2. Let A = {a e Q | a2 < 2}. Show that if there is a least upper bound of A, it cannot belong to Q. [Hints: show that, for small p > 0,a2 < 2 implies that (a+p)2 < 2 and a2 > 2 implies that {a-p)2 > 2; conclude that if b is a l.u.b. of A, then b2 = 2; show that there is no rational l
2 I. PROBABILITY SPACES number b with b2 = 2.] One can view \/2 € R as a number that "fills a hole in Q". "Holes" exist because it is often possible to split the set Q into two disjoint subsets A, B so that (i) a 6 A and b e B implies that a < b\ (ii) there is no l.u.b. of A in Q and no largest or greatest lower bound (g.l.b.) of B in Q. A decomposition of Q satisfying (i) is called a Dedekind cut of Q. 1.1.3. Axiom of the least upper bound. Every subset AofR that has an upper bound has a least upper bound. This axiom is an extremely important property of R. It will be taken for granted that the real numbers exist and have this property. (Starting with suitable axioms for set theory, one can show that (i) there exist fields with the properties of R and (ii) any two such fields are isomorphic (i.e., are the "same"); see Halmos [H2].) Exercise 1.1.4. (1) Show that N has no upper bound. [Hint: show that if b is an upper bound of N, then b — 1 is also an upper bound of N.] (2) Let e > 0 be a positive real number. Show that there is a natural number n € N with 0 < 1/n < e. [Hint: use (1) and the fact that multiplication by a positive number preserves an inequality.] This last exercise shows that the ordered field R has the Archimedean property, i.e., for any number e > 0 there is a natural number n with 0 < 1/n < e. If E and F are two sets, it will be taken for granted that the concept of a function / : E >-► F is understood. When E = N, a function / : N —► F is also called a sequence of elements from F. The function / in this case is often denoted by (/(n))n>i or (/n)n>i, where f(n) = fn is the value of / at n. When the domain of n is understood, (/n)n>i is often shortened to (/„). Given a sequence (/n)n>i) a subsequence of (/n)n>i is a sequence of the form (/nt)fc>i) where n : N —► N is a strictly increasing function k —* Tik, e.g., nfc = 2k for all k > 1. If the original sequence is written as (/(n))n>i, a subsequence may be indicated as {f{n.k))k>\ or (/W0))fc>i- The important thing is that in a subsequence one selects elements from the original sequence by using a strictly increasing function n. Definition 1.1.5. A sequence (6n)n>i of real numbers converges to B as n —> +oo if for any positive number e > 0, B — t < bn < B + e, for all sufficiently large n. That is, \bn — B\ < e ifn > n(e), where n is an integer depending on e and the sequence (see Exercise 1.1.16 for the definition of \a\). This is denoted by writing B = limn_+00 6n. A sequence (6n)n>i of reai numbers converges to +oo if for any N e N, bn>N,n> n{N).
1. INTRODUCTION TO R 3 Exercise 1.1.6. Let (6n)n>i be a non-decreasing sequence of real numbers, i.e., bn < 6n+i for all n > 1. Show that (1) the sequence converges, (2) the limit is finite if and only if the sequence has an upper bound, (3) when the limit is finite, it equals the l.u.b. of {bn \ n> 1}. Let (b'n)n>\ be another non-decreasing sequence with 6n < b'n for all n > 1. Show that (4) limnb„ < lim 6^,. Definition 1.1.7. Let (an)n>o be a. sequence of real numbers. Let sn — ao + ai + Q2 H 1- an- The sequence (sn)n>o is called the infinite series H^Lo °n' an(^ ^e senes 1S ^^ to converge if (sn)n>o converges. Exercise 1.1.8. Let (an)n>0 be a sequence of positive real numbers. Show that (1) the series YlnLoan converges if and only if {sn | n > 0} has an upper bound, (2) if 5I^L0 an converges, then it converges to l.u.b. {sn | n > 0}. Finally, show that (3) if H^Lo °n converges to a finite limit, then lim^_00 5Z^LN an = 0. Exercise 1.1.9. Suppose one has a random variable X whose values are non-negative integers. Let (an)n>o be a sequence of positive real numbers. When can the probability that X is n equal can where c is a fixed constant? What happens if an = £? (if an = £ or an = ;n^or („3"^)?) This exercise begs the question: what is a random variable? For the time being, think of it as a procedure that assigns probabilities to certain outcomes. The definition is given in Chapter II, see Definition 2.1.6. Exercise 1.1.10. A random variable X has a Poisson distribution with mean 1, if the probability that X is n is e_1/n!. (see Feller [Fl] for the Poisson distribution). Proposition 1.1.11. (Exchange of order of limits). Let (6m,n)m,n>i be a double sequence of reai numbers, i.e., a function b : N x N —► R. Assume that (1) m! < m2 => 6mi,„ < 6m2,n for all n > 1, (2) m < n2 => 6m,ni < 6m,„2 for allm> 1. Then lim ( lim 6mn)= lim ( lim bmn) = lim bnn, n—>+oo m—»+00 m—»+oo n—»+oo ' n—>+oo where an increasing sequence has limit +00 if it is unbounded. Proof. By symmetry it suffices to verify the second equality. Now 6m,m < def 6nri if m < n and so B = 1101,,-.006nn exists, as does Bm = limn_oobm^n.
4 I. PROBABILITY SPACES Also, by Exercise 1.1.6, Bm < B as 6mn < 6nn when n > m. It follows from (1) and Exercise 1.1.6 that Bmi < Bm2. Hence, limm_oo£m = B! exists and is less than or equal to B (see Exercise 1.1.6 again). If B is finite, let e > 0 and n = n(e) be such that B — e < &„,„ < B if n > n(e). Let m = n(e). Then Bm = limn_oobm,n > &n(e),n(e) > B - t. Hence, B' = B as B' > Bm for all m. If B = +00 and N < &„,„ for n > n(N), then Bm = limn_oo 6m,n > N if m = n(N) and so B' = +00. D Corollary 1.1.12. £~i(£°Li atj) = E^i(ESi °o) ^¾ > 0,1 < i, J- Proof. Let 6m,n = £^! H"=i 0*3 and verify that the conditions of Proposition 1.1.11 are satisfied. D Exercise 1.1.13. Decide whether Proposition 1.1.11 is valid when only one of (1) and (2) is assumed. In Corollary 1.1.12, what happens if an = IjOt.t+i = —1 for alii > 1 and all other ay = 0? Exercise 1.1.14. (See Feller [Fl], p. 267.) Let p„ > 0 for all n > 0 and assume E^oPn = 1. Let mT = E^=onrPn- Show that E~ 0 Tf<r = E~=oPnentfort>0. This brief discussion of properties of K concludes with a discussion of intervals. Definition 1.1.15. A set I c K is said to be an interval if x < y < z and x,z e I implies y € I. If an interval I has an upper bound, then I c (—00,6], where 6 = l.u.b. I and (-00,6] = {x € R \ x < b}. If it also has a lower bound, then I C [a, 6] = {x I a < x < b} if a = g.l.b. I. A bounded interval I — one having both upper and lower bounds — is said to be (1) a closed interval when I = [a,b], an open i by ]a,b[), (3) a half-open interval when I = (a,b] or [a,6), where (a,b] = {a a < x <b} and similarly [a,b) = {a \ a < x < b}. One also denotes (a,b\ by }a,b] and [a, 6) by [a,b[. An unbounded interval I is one of the following: (-00,6) = {x I x < b); (-00,6] = {x I x < 6}; (a, +00) = {x I a < x}; [a, +00) = {x I a < x}; or (-00,+00) = R. def (2) an open interval when I = (a, b) = {x | a < x < b) (often denoted
2. WHAT IS A PROBABILITY SPACE? MOTIVATION 5 Exercise 1.1.16. If x e R, define |x| = x if x > 0 and = (—l)x if x < 0. Let a, b be any two real numbers. Show that (1) \a + b\ < \a\ + |6| (the triangle inequality), (2) conclude that ||a| - |6|| < \a — b\ by two applications of the triangle inequality. If a, b are two numbers, let a V b denote their maximum, also denoted by max{a, b}, and a A b denote their minimum, also denoted by min{a, b}. Define x+ to be max{x, 0} = x V 0 and x~ to be max{—x, 0} = (—x) V 0. Show that (3) x~ = -(xAO), (4) x = x+ — x", (5) |x| = x++x~, (6) aV6= i{a + 6+ |a-6|}, and (7) aAb= ±{a + b-\a-b\}. Exercise 1.1.17. Verify the following statements for —oo < a < b < +oo: (1) (a,6) = U~=N(a,6-i], where 6 - a > £; (2) [a,6) = n~N(a-l,6); (3) [a,6] = n~ ,^,6+1). Show that if Xo € (a, b), then (xo — 6, Xo + S) C (a, b) for some 6 > 0 (this implies that (a, b) is an open set; see Exercise 1.3.10). Exercise 1.1.18. Let 0 < a < 1. If p > 0, show that qp = eploga is an increasing function of p and that limp-.oo a' = 1. Exercise 1.1.19. Let R be the union of two disjoint intervals I\ and h- Show that (1) one of the two intervals is to the left of the other (either Xi e I\ and X2 6 h always implies Xi < X2 or vice-versa), and (2) if neither of these two intervals is the void set, then sup/i = inf /2 if /1 is to the left of /2- Let (/„) be an increasing sequence of intervals. Show that (3) if I = Un/n, then I is an interval. Assume that each In above is unbounded and bounded below, i.e., there is an € R with (an,+oo) c In C [an,+00). Assume that I has the same property. Show that (4) if (a, +00) C I C [a, +00), then lim an = a. For further information on the real line and general background information in analysis, consult Marsden [Ml] or Rudin [R4]. 2. What is a probability space? Motivation A probability space can be viewed as something that models an "experiment" whose outcomes are "random" (whatever that means). There
6 I. PROBABILITY SPACES are often "simple" or "elementary outcomes" in the model (as points in an underlying set fi and weights assigned to these outcomes that indicate the likelihood or probability of the outcome occurring (see Feller [Fl]). The general outcome or "event" is often a collection of "elementary outcomes". For example, consider the following. Example 1.2.1. The "experiment" consists of rolling a fair six-sided die two times. The "elementary outcomes" could be taken to be ordered pairs w = (m, n), where m and n are integers from 1 to 6. The set fi of elementary outcomes may be taken to be the set of all such ordered pairs (it is usual to denote this set as the Cartesian product {1,2,..., 6} x {1,2,..., 6}). The set of all events may be taken as the collection of all subsets of fi, denoted by 7>(fi). If each elementary outcome w is assigned weight ^,, then one may define the probability P(A) of an event A by P{A) = J2ueA ^({^}) = = Jjj^, where \A\ denotes the number of elements in A. If the die is not fair — the probability of getting either a 1,2,3, or a 4 is § and for example that of getting either a 5 or a 6 is | — then the basic probabilities or weights of the elementary outcomes will need to be altered to correspond to the new situation. For elementary situations as in Example 1.2.1, it suffices to consider a so-called finitely additive probability space. This is a triple (fi,2l, P), where fi is a set (corresponding intuitively to the set of "elementary outcomes" , 21 is a collection of subsets of fi (the "events") with certain "algebraic" properties that make it into a Boolean algebra of subsets of fi, and for each event A € 21 there is a number P(A) assigned that lies between 0 and 1 (the probability of the occurrence of A). More explicitly, to say that 21 is a Boolean algebra means that the collection 21 of subsets satisfies the following conditions: (210 «€*; (2l2) Ai, A2 € 21 implies that Ax U A2 € 21; and (¾) A € 21 implies that Ac € 21, where Ac =f {w e fi | u ¢ A) =f CA. The statement that P is a probability means that it is a function defined on 21 with the following properties: (A) P(fi) = 1; (Fi) 0 < P(A) < 1; and {FAP3) Ax n A2 = 0 =► P(A, U A2) = P(Ai) + P(A2). It is not hard to see that Example 1.2.1 is a finitely additive probability space. Some simple consequences of the properties of a Boolean algebra 21 and a finitely additive probability defined on it are given in the next exercise. Exercise 1.2.2. Show that (1) A\, A2 € 21 implies that Ax n A2 € 21 and A\ n A2 e 21 (one often denotes A\ n A2 by A^\A2 or A^ n CA2),
2. WHAT IS A PROBABILITY SPACE? MOTIVATION 7 (2) P(0) = 0, (3) Ax c A2 implies that P(Ai) < P(A2), (4) P(A1uA2)<P(Al)+P(A2), (5) P(Aj U A2) + P(Aj n A2) = P(Aj) + P(A2), (6) P(U£=1Afc) = ELi P(M Uk,1 A,-) < ELi P(A*). Remark 1.2.3. In Exercise 1.2.2 (6), a union of sets u£=1Afc is converted to a disjoint union u£=1 A'fc, where A'k = Afc\ uJl/ A,. This is a standard device or trick that is often used, especially for countable unions A = Here is another example of an "experiment" with "random" outcomes. Example 1.2.4. What probability space is it natural to use to discuss the probability of choosing a number at random from [0,1]? What is the probability of choosing a number from (^, |]? Clearly, one should take fi to be [0,1] and 21 to be the collection of finite unions of intervals contained in [0,1] (so that 21 is a Boolean algebra containing intervals and their unions). Show that 21 is a Boolean algebra. How do you define P on 21? Example 1.2.5. Continuing with the same probability space as in Example 1.2.4, suppose that one wants to discuss the probability of selecting at random a number x with the following property: it does not lie in (^, §) — i.e., the middle third of [0,1] — nor does it lie in the middle third of either [0, g] or [|, 1] — and so on, infinitely often. This describes a subset C of [0,1], an event. Look at the complementary event Cc: it is the disjoint union of middle third intervals; C< = (I, §)u[(i, |)u(|, §)]u[(£, &)U(£, Jr)U(i|, §g)U (i,§i)]u.... Let q be the probability of Cc. Then one sees that (1) Q>h (2) <7>3 + ! + i = 3 + l. (3) 9>i + | + f7, (n) q>\ + l + 4t + --- + *£-. To verify this, one makes use of the principle of mathematical induction, stated below. The principle of mathematical induction. Let P{n) be a proposition or statement for each n € N. If (1) P(l) is true and (2) P(n+ 1) is true provided P(n) is true, then P(n) is true for all n. (This principle amounts to saying that if A c N is such that (1) 1 e A and (2) n € A implies n + 1 e A, then A = N).
8 I. PROBABILITY SPACES Now g + | + ^H H ^yr- is the nth partial sum of the series YlT=o 3^rr- This is a geometric series with ao = g and r = |, and so the sum is g(l — |)-1 = 1. Consequently, one expects the probability of choosing such an x from C to be zero! Remark. The subset C of [0,1] described in Example 1.2.5 is called the Cantor set or Cantor discontinuum. It contains no interval with distinct end points. Why? What would happen if instead of extracting middle thirds one removed the middle quarter at each stage? Note that neither the Cantor set C nor its complement is in the Boolean algebra 21 defined in Example 1.2.4, as the set Cc is the union of an infinite number of open intervals each of which is in the Boolean algebra 21: it can be shown that Cc cannot be expressed as a finite union of sets from 21 — this has to do with the fact that each of the intervals involved is a connected component of Cc, i.e., for any point x in one of these intervals I, the largest open interval that contains it coincides with I (see Exercise 1.3.11). Example 1.2.6. (Coin tossing) Suppose one tosses a fair coin until a head occurs. What is the probability of this event? To begin with, what could one take as fi, the set of "elementary outcomes"? Take fi to be N, where each n > 1 corresponds to a finite string of length n that commences with n — 1 heads and concludes with a tail. Here probabilities may be assigned to each integer n > 1, namely ^ for the string of length n. Since J2^L i 2^r = \ (t^"1") = * ^is gives a probability space with 21 taken to be all the subsets of fi and P{A) defined to be J2neA 2^- Therefore, the probability of first obtaining a head on an even number toss is YlV=i 2*^ = 3 (yzt) = g- In this example, you can verify that if A = Uj^-A,, and An n Am = 0 when n ^ m, then P{A) = £~=1 P(i4n). This is clearly a desirable property of a probability, but how can it be obtained in the context of the previous exercise, where fi = [0,1]? Example 1.2.7. Suppose X is a random variable with unit normal distribution, i.e., the probability that a < X < b is —i= fa e'^^^dx = P((a, b]). What is the probability p that X takes values in the Cantor set? Imitating Example 1.2.5, one may compute the probability q of the complementary set. Then (1) 9>P((3,1)), (2) 9>P((i,§)) + P((I,|))+P((I,|)),and (n) 9>P((i,§)) + P((I,|)) + P((I,|)) + ... + P((3^2,3^i)). Therefore, one expects the probability p to be l-limn^fPUl, §)) + P((i, |)) + P((|, §)) + P((£, &)) + P((£, Jr)) + P((»,»)) + P((» »)) + ... + P((31=2, 31=1))}.
3. DEFINITION OF A PROBABILITY SPACE 9 At this point, it is not so clear what the answer should be here. In fact it is zero! (See Proposition 2.4.1.) Exercise 1.2.8. Write a more explicit formula for the probability that the above random variable X takes values in the set Cn, which results after the procedure for constructing the Cantor set has been applied n times. This amounts to getting a handle on this set by writing an explicit description of the intervals involved in the removal process. [Hints: after completion of the nth stage of the procedure for constructing the Cantor set, one is left with 2n intervals each of length -L. They are all translates of the interval [0, g^]. To describe Cn, it suffices to determine the left hand endpoints of these intervals. One may do this by observing that, for each integer n, if k < 3n, then k has a unique expression as k = Yl^Zo °>3' witn the a* e {0,1,2}. One may use this to write a triadic "decimal" expansion for the left hand endpoints of the intervals remaining at the nth stage. Show that they will be of the form O.&162 • • • bn with bi € {0,2}, where O.&162 • • • &n = ^r E"=i W1-'- Show that one may obtain them from the 2n_1 left hand endpoints occurring at the (n — l)st stage by first putting a zero in the "first position", i.e., by shifting the "decimal" over to the right by one place and inserting a zero and then doing the same thing but this time inserting a two.] What Examples 1.2.5, 1.2.6, and 1.2.7 hint at is the following: while one often starts to construct a probability using some "obvious" definition for certain simple sets (those in 21), it is soon useful and necessary to try to extend the probability to more complicated sets that are made up from those of 21 by infinite operations. In addition the probability should not only be finitely additive (i.e., satisfy (FAP3)), but also countably additive, as its computation will often involve infinite series. 3. Definition of a probability space Definition 1.3.1. (fi, 5, P) is said to be a probability space if Q is a set, 5 is a er-algebra of subsets of fi, and P is a (countably additive) probability on 5- To say that 5 is a <7-aJgebra means that the collection 5 of subsets of fi satisfies (ffi) ftefr ($2) A e 5 implies Ac e 5; and (¾) (A„)n>i C 5 implies U~ ,A„ € 5- furthermore, P is a function defined on 5 that satisfies (P,) P(fl) = 1; (P2) 0 < P{A) < 1 if A e 5; and (P3) {An)n>\ C 5, and An D Am = 0 ifn ^ m implies P(U£°=1>ln) = E~=i P(An) (countable additivity).
10 I. PROBABILITY SPACES Remarks. (1) A probability differs from a finitely additive probability by the important property of being countably additive (/¾) (also called er-additive). (2) A probability space is also a finitely additive probability space since countable additivity (/¾) implies finite additivity (FAP3). (3) A er-algebra is also called a er-field. In Example 1.2.4, the basic way of computing probability was to assign length to intervals contained in [0,1]. Corresponding to this is a natural function F, which determines these probabilities: define F(x) to be the length of [0,x] if 0 < x < 1. This function can be extended to all of R by setting F(x) = 0 when x < 0 and F(x) = 1, if 1 < x. This extended function can be used to assign a probability to any interval (a, b]: namely, P((a,6]) = F(b) - F(a). This probability is the length of (a,b] C\ [0,1]. As a result, the "experiment" of Example 1.2.4 can also be modeled by using all the intervals of R as basic events, with the proviso that any interval disjoint from [0,1] has probability zero. The function F is an example of a distribution function (see Definition 1.3.5). Whenever one has a probability space (fi, J, P) with fi = R and the a- algebra 5 contains every interval (—oo,x], then the probability determines a natural function in the same way: let F(x) = P((-oo,x]). This function has the properties (DF\), (DF^), and {DF3) stated in the following proposition. Proposition 1.3.2. The function F(x) = P((—00,x]) associated with a probability P has the following properties: (DFi) x < y implies F(x) < F(y) (i.e., F is a non-decreasing function); {DF2) (i) limx_+00 F(x) = 1 (i.e., for any positive number e, F(x) > 1 - e if x is large enough and positive); and (ii) limx__oo F(x) = 0 (i.e., for any positive number e, F(x) < e if x is negative and \x\ is sufficiently large); (.0/¾) for any x, if(xn) is a sequence that decreases to x, then the vaJues F(xn) decrease to F(x) (i.e., F(x) = lim,,-,,*, F(xn)). Proof. As an exercise prove {DF\) and (DF^) using properties (^1),(^2), and (/¾). [For the first part of (DF2) use Exercise 1.1.8 (3) and the fact that R = u£r-oo(". n + 1] to show that 1 = F(n) + ££Ln P((k, k + 1]).] The property (.0/¾) says that for any x, the values of F(xn) as xn approaches x from above (i.e., from the right) approach F(x). The technical formulation of this statement is that limn_00 F(xn) = F(x) if x < xn+i < xn and limn xn = x. Since F is non-decreasing, it suffices to show that for any x, F(x) = lim,,^ F{x + ±). Let An = (x,x + £]. Then i4n+1 c An and n£°=1,4n = 0. Now Ai = (x,x+l] =Uf=1(x + v^ryx + \}a.ndAn = U£Ln(x+ ^,x+ ±]. Hence, P(Ai) = £r=iP((*+*TT.z+j]) < 1. andso£~„P((z+*TT>*+s]) =
3. DEFINITION OF A PROBABILITY SPACE 11 P(i4n) —► 0 as n —► +oo since it is the tail end of a convergent series. Now F(x + £) = F{x) + P{An). Hence, lim,,^ F(x + ^) = F(x). D Exercise 1.3.3. Let (i4n)n>i be a sequence in 5, where (fi,5,P) is a probability space. Show that P(An) —► 0 if (1) An D An+\ for all n > 1 and (2) n~ xAn = 0. The property {DF3) of a distribution function is a reflection of the countable additivity of P. It can be reformulated by saying that F(x) is right continuous at every point x. Definition 1.3.4. A function F : R —► R has a right limit A at a point a e R if, for any e > 0, there exists a 6 > 0, where 6 = 6(a, e), such that \F(x) — A| < e whenever a < x < a + 6. The right limit A is denoted by F(a+). The function F is right continuous at a if F(a) = F(a+). Similarly, one defines the left limit F(a-) to be A if, for any e > 0, there exists a 6 > 0, where 6 = 6(xo, e), such that |F(x) - A| < e whenever a - 6 < x < a and F is left continuous at a ifF(a) = F(a-). Remark. When F is the distribution function of a probability, then F(a—) = P((-oo,a)) (see Exercise 1.4.14 (3)). Definition 1.3.5. A function F : R —► [0,1] is said to be a distribution function if (1) it is non-decreasing and right continuous; and (2) limI__00 F(x) = 0 and limI_+00 F(x) = 1. A distribution function F determines the probability of certain sets, namely the probability of (a, b], where P((a, 6]) = F(b) — F(a). This notion of probability can then be extended to the Boolean algebra 21 generated by the intervals (a, b], where A G 21 if and only if A is a finite union of intervals of the form (a, b], with —oo < a < b < +oo, and by convention one takes (—oo, +oo] = R. Note that 21 is the smallest Boolean algebra containing all the intervals (-oo,x]),x G R. A priori, there is no reason why the probability P on 21 determined in this way by a distribution function F should come from a probability on a er-algebra J containing the Boolean algebra 21. This raises the following issue. Basic Problem 1.3.6. Given a distribution function F on R, are there a (T-algebra 5 D 21 and a probability P on J such that F is the distribution function ofP? Note that for such a probability P, it follows that P( (a, b]) = F(b) - F(a) whenever a<b?
12 I. PROBABILITY SPACES Remark. As will be seen later in Theorem 2.2.2, this basic problem has an important generalization: the distribution function F is replaced by any right continuous, non-decreasing function G on R. Then the question becomes: is there a er-algebra J D 21, and is there a function /joiiJ that behaves like a probability except that its value on fi = R is not forced to be 1? For example, if G(x) = x for all x, then such a set function fi would compute the length of sets. Returning to distribution functions, the rest of this chapter is devoted to the solution of Basic Problem 1.3.6: to show that every distribution function comes from a probability. As a first step, one has the following. Exercise 1.3.7. Let 21 be the collection of finite unions of intervals of the form (a, b], where —oo < a < b < +oo, and by convention one takes (-oo, +oo] = R. Verify that (1) 21 is a Boolean algebra, (i.e. it satisfies (2li), (212),(¾) of §1.2 ), and (2) there is one and only one finitely additive probability P on 21 such that P{{a,b}) = F{b) - F{a) for all a < b. Hints: (1) Show that any A G 21 can be written as a finite pairwise disjoint union of intervals of the form (a, b); note that the union of two intervals of this type is an interval of this type if they are not disjoint and observe that (a, b] D (c, d] is 0 or (a V c, b A d\. (2) Convince oneself of (2) by showing that if A = U"=i(aj,6j] with the intervals pairwise disjoint, then 52"=1{F(6j) - F(a,i)} is not dependent on the particular way A is written as a disjoint union. For example if A = (0,3] = (0,l]u(l,3] = (0,2]u(2,3], thenP((0,3]) = F(3)-F(0) = {F(1)-F(0)} + {F(3)-F(1)} = P((0,1]) + P((l,3]). [Suggestion: given two disjoint unions, make up a "finer" disjoint union by using the second one to "cut up" all the intervals of the first.] Convention 1.3.8. Until further notice, unless otherwise stated or the context makes it evident (see Definition 1.4.2), 21 wiJJ denote the above Boolean algebra of finite unions of half-open intervals (a, b] C R. Example 1.3.9. Here are three well-known distribution functions: (1) for the uniform distribution on [0,1], (see Fig. 1.2) {0 if x < 0, x if 0 < ar < 1, 1 if 1 < x.
3. DEFINITION OF A PROBABILITY SPACE 13 (0.1) (1.0) Fig. 1.2 (2) for the unit normal distribution (see Fig. 1.3) F(x) = — f \fhtJ-c e~*3'2dx. Fig. 1.3 (3) for the Poisson distribution with mean one (see Fig. 1.4) F(x) l e Z^0<n<i n!' X <0, 0<X. (l+...+l/n!)/e- (l+l/2+l/6Ve (l+l/2)/e l/e I 2 3 Fig. 1.4
14 I. PROBABILITY SPACES Exercise 1.3.10. (a) Let Q be a set and let € be a collection of subsets of fi. Show that there is a smallest er-algebra of subsets of fi that contains €. It is called the er-algebra generated by € and will be denoted by <r(€). (b) Let fi = R. Show that the smallest er-algebra containing each of the following collections € of subsets of R is the same: (1) €={(a,b}\a<b); (2) €={(a,b)\a<b}; (3) €={[a,b)\a<b}; (4) €={[a,b]\a<b}; (5) € is the collection of open subsets G of R; and (6) € is the collection of closed subsets F of R. The er-algebra that results is called the er-algebra of Borel subsets of R and is denoted by <8(R). [Hints for (1) to (4) in (b): make use of Exercise 1.1.17.] [Hints for (2),(5), and (6) in (b): for the two other collections of sets, two definitions are needed: a subset G of R is said to be open if Xo G G implies that, for some e > 0, (xo - e,Xo + e) C G; a subset F is said to be closed if Fc is open. As a result, the er-algebras generated by collections (5) and (6) coincide. To show that collections (2) and (5) generate the same a- algebras, it is necessary to know the following fact: every open set can be written as a countable union of open intervals (a fact that is part of the next exercise).] Exercise 1.3.11. (a) Let D be the collection of open subsets R. Show that (1) RgD, (2) Gi,G2 eD=>GinG2eD, (3) the union of any collection of open sets is open. (b) Show that if G is open and Xo G G, then there is a largest open interval (a, b) C G that contains Xo- [Hint: the union of a collection of intervals that contain a fixed point is itself an interval: recall Definition 1.1.15.] (c) Let G be an open set and let Xi,x2 G G. Show that the largest open interval I\ C G that contains Xi either equals h, the largest open interval C G that contains x2, or is disjoint from /2. (d) Let G be an open set. Show that G is a disjoint union of at most a countable number of open intervals. [Hint: suppose G C ( — 1,1). Show that if G is expressed as a disjoint union of open intervals using (c), at most a finite number can have length > 1/n, where n > 1 is any fixed positive integer.] Comment. It is standard mathematical terminology to call a collection D of sets a topology if it satisfies (1), (2), and (3) in the above exercise. The complements of open sets are defined to be the closed sets, and it follows
3. DEFINITION OF A PROBABILITY SPACE 15 from (3) that for any set A, the intersection of all the closed sets containing it is a closed set. It is called the closure of A and is usually denoted by A. Digression on countable sets. It is about time to say what the word "countable" means. A set E is countable if all its elements can be labeled by natural numbers in a 1:1 way, i.e., if there is a function c : N —► E such that (i) E = {c(n) | n E N}, (ii) c(ni) = c(ri2) implies ni = n-i- A set is at most countable if it is either finite (i.e., it can be "counted" using {1,...,n} for some n) or countable (i.e., it can be "counted" using N). Given two sets A and B, the Cartesian product A x B = {(a, b) \ aeA,beB}. Proposition 1.3.12. If A and B are countable, then Ax B, is countable. Proof. To begin with, A x B is clearly not finite. To "count" A x B is really the same as "counting" N x N. This can be viewed as a set in the plane. Figure 1.5 explains how to "count" or "enumerate" N x N: 1 (1.5) (1.4) (13) (1.2) X(2.4) nNN^' V v V<42> yvvv *- (1.1)(2.1) (3.1) (4.1)(5.1) Fig. 1.5 Proposition 1.3.13. Q is countable. Proof sketch. It suffices to show that {q G Q | q > 0} is countable. Why? Look at the diagram of N x N, and at each "site" (n, m) attach the rational n/m. It should then be clear how to count {q G Q | q > 0}! □ Proposition 1.3.14. (Cantor's diagonal argument) (0,1] and hence R is not countable. Proof. If 0 < a < 1, a has a decimal expansion as a=0.aiO2 • • • an • • • = Hi^i Ot/10'. If one eliminates decimals that terminate in an unbroken string of zeros, this decimal expression is unique (explain!). Assume that the numbers in (0,1] can be enumerated, and write them in a sequence using only decimals that fail to terminate in an unbroken
16 I. PROBABILITY SPACES string of zeros: a\ = 0.aiiai2Oi3 02 = 0.021022023 an = 0.anian2an3 • • • ann Let a = O.ai ^22033 • • -aj,n • • •, where 0 ^ .a'nn ^ ann for each n ^ 0. In other words, use the diagonal entries of the above infinite table to make a number in (0,1]. Then a is not in the list since its decimal expansion fails to agree with that of any of the expansions for a,\, a^,... . This is a contradiction. Since (0,1] is not countable, R is not countable. □ This argument applies to any non-void interval since, for example, the function f(x) = |5y maps [a, b] in a 1:1 way onto [0,1]. As a result, any non-void open set is not countable, as it contains some interval [a, b] with a ^ b. In particular, one has the following observation. Exercise 1.3.15. Let C be any countable subset of R. Then the closure R\C of R\C equals R. Show this by observing that C D CR\C, which is an open set. This property of the complement R\C of C is referred to by saying it is dense. Show that a set D is dense in R if and only if every non-void open set contains a point of D. Finally, one can verify the following fact. Exercise 1.3.16. Let £1,..., En be a finite collection of disjoint countable sets. Then E = U"=i£j is countable. Let E\,..., En,... be a countable collection of countable sets. Then E = U^i?* is countable. 4. Construction of a probability from a distribution function Now consider the basic problem of constructing a probability from a distribution function F. Exercise 1.3.10 (b) shows that if one is to get a probability space (R, 5, P) from F with 5 D 21, then J will have to contain the <7-algebra <8(R) of all Borel subsets of R. Furthermore, if A G 21 and A = U^L^n, where the An are in 21 and pairwise disjoint for example, (0,1] = U^i(^y, £] — then, it will be necessary (as shown later) that 00 P(A) = £P(An) n=l if there is to be a probability P on a er-algebra 5 D 21. Recall that P is to be er-additive. •ain--- • a2n • ••
4. CONSTRUCTION OF A PROBABILITY 17 Exercise 1.4.1. Show that F(l) -F(0) = £~i{F(^)-^(^y)}- [Hint use Exercise 1.1.8 and the right continuity of F.} Definition 1.4.2. A finitely additive probability P on a Boolean aJgebra 21 is said to be er-additive if A = U^=1An implies oo P(A) = £P(A«), n=l when A in 21, the sets An are all in 21, and are pairwise disjoint. A finitely additive probability P on 21 need not be er-additive, as the following example shows. Example. Define P on 21 by setting P( A) = 0 if A has an upper bound and P(j4) = 1 if A has no upper bound (remember that 21 is a special Boolean Algebra — see Convention (1.3.8)). Show that (R,2l, P) satisfies conditions (2li), (212),(83)(^1),(^2), and lFAp3) in §12- Show that it is not er-additive. To see this, try to calculate P(K) as 5Z^=_00 P((n,n + 1]). Notice that this "probability" does not come from a distribution function. What is missing? Returning to the problem of extending P from 21 to 2$(R), it will now be shown that if P on 21 is determined by a distribution function, then it is er-additive. Exercise 1.4.3. Show that P on 21 is er-additive if and only if (a, b] = UfcL](cfc,dfc], with the (cjt.dfc] pairwise disjoint, implies 00 P((a,b}) = F(b) - F(a) = £{F(dfc) - F(cfc)}. fc=i Theorem 1.4.4. Let F be a distribution function on R. Let P be the unique finitely additive probability on 21 such that P((a, b}) = F(b) — F(a) whenever a < b. Then P is a-additive on 21. Remark. This theorem may appear obvious to you in view of Exercise 1.4.3. In a way it should. However, its actual justification depends on the Axiom 1.1.3. Proof. By Exercise 1.4.3, it suffices to prove that if (a, b] = U^^Cfcjdfc] with the intervals (cfc,dfc] pairwise disjoint, then 00 (*) F(b) - F(a) = £{F(dfc) - F(ck)}. fc=i Now it is obvious that F(b) - F(a) > ££=i{F(dfc) - F(ck)} since (a, b] D Ufc=i(cfc,dfc]. Hence, by Exercise 1.1.8, F(b) - F(a) > E£Li{F(dfc) -
18 I. PROBABILITY SPACES F(cfc)}. Therefore, it is enough to verify the opposite inequality, and it is here that the Axiom 1.1.3 and the right continuity of the distribution function F come into play. First, one shows that it suffices to verify (*) when -oo < a < b < +00. Suppose, for example, that —00 = a < b < +00. Then, for large n, one has — n < b. If (—00,6] = U^-^Cfc.dfc] with the intervals (cfc,dfc] pairwise disjoint, then (—n,b] = UfcL^CfcV^—n),dk]. If (*) holds when the endpoints are both finite, then F(b) - F(-n) = J2 F(dk) ~ F(ck V (-n)) < £ F(dk) - F(ck). k=\ fc=i Since P((-oo, b}) = F(b) and F(—n) converges to zero as n tends to +00, it follows that F{b) < YXLi F(dk) ~ F(c*)> and hence F(6) = £*Li F(dk) - F(cit) (the opposite inequality is valid, as observed above). Similar arguments apply when —00 < a < b = +00 and when —00 = a < b = +00. This shows that the theorem holds provided it holds for intervals with finite endpoints. Assume -00 < a < b < +00, and choose any positive number e > 0. For each k, there is a positive number ek > 0 such that F(dk) < F(dk + efc) < F(dk) + ^ because the distribution function F is increasing and right continuous. In other words, P((cfc,dk}) < P((cfc, dk + ek}) < P((cfc, dk}) + ^. Therefore, £{F(dfc + efc) - F(ck)} = J2 P((cfc, dk + ek)) < ^{F(dfc) - F(ck)} + e. fc=i fc=i fc=i Note that (a, b] = U^L^Ck,^] c U^^Cfc^fc^eit), which is an open set. Using the right continuity once again (at Xo = a), there is a positive number e < b- a such that F(a + e) < F(a) + e. In other words, P((a,b])-€<P((a + e,b}). Since [a + e,6] C (a,6] C U^^Cfcjdfc + efc), the closed interval [a + e,b] is contained in a countable union of open intervals. A very famous theorem (the Heine-Borel theorem) asserts that, as a result, the union of some finite number of the open intervals (ck,dk+ek) contains [a + e,6]: the proof will be given below and it uses the Axiom 1.1.3. Assuming the validity of
4. CONSTRUCTION OF A PROBABILITY 19 the Heine-Borel theorem, suppose that one has [a+e, b] C U£=1 (cit, dk+ek). Then (a + e, b] C Uj=1 (cfc, dfc + e*] = A' G 21, and so P((a + e, 6]) = F(6) - F(a + e) < P(A') n n < J2 P((<*.dfc + efc]) = £{F(dfc + efc) - (cfc)} fc=i fc=i (recall that P(A') < £fc=iP((cfc>dfc + efc]) by Exercise 1.2.2 (6). The conclusion is then that F{b) - F{a) - t = P((a, b]) - e < P((a + e, b\) <J2{F(dk + ek)-F(ck)} fc=i oo oo < £{F(dfc + efc) - F(cfc)} < £{F(dfc) - F(ck)} + e. fc=i fc=i Since e is any positive number, the desired inequality is proved. □ For completeness, the key theorem that was used in the proof of this result will now be proved. Theorem 1.4.5 (Heine-Borel). Let [a,b] be a closed, bounded interval in K. Assume that there is a collection of open intervaJs (at,6t) whose union contains [a, 6]. Then the union of some finite number of the given collection of open intervals also contains [a, 6]. Proof. The idea of the proof is to see how large an interval [a,x] with a < x < 6 can actually be covered by a finite number of the open intervals (i.e., [a,x] c some finite union of these intervals). One knows that for some i, say ii, a G (^,,6^,). So, ifx = min{6, |(a + 6i,)}, then [a,x] C (^,,6^,). Let H equal the set of x in the interval [a, 6] such that [a, x] is contained in a finite number of the intervals. This is a set with an upper bound b. Note that if x G H and a < y < x, then y G H. Also, if d is an upper bound of H, then d < e < b implies e ¢ H. By Axiom 1.1.3, H has an l.u.b. Call it c. Exercise 1.4.6. (1) Show that [a,c] is contained in a finite union of the open intervals. [Hint: c is in some interval.] (2) Show that if [a,x] is contained in a finite union of the open intervals and x < b, then x is not an upper bound of H. This exercise implies c = b, and so the theorem is proved. □
20 I. PROBABILITY SPACES Remark. An equivalent form of the Heine-Borel theorem is the following result. Theorem 1.4.7. Let (Ot)te./ be a family of open sets that covers the closed, bounded interval [a,b] C R (i.e., UiejOi D [a,b]). Then a finite number of the sets Oj covers [a, b] (i.e., for some finite set F c J, UjefOj D [a,b)). Exercise 1.4.8. Show that Theorem 1.4.7 and Theorem 1.4.5 are equivalent. The Heine-Borel theorem is so basic that the class of sets for which it is true is given a name. Definition 1.4.9. A set K c R is said to be compact if, whenever K c UjejOj, each Oj open, there is a finite set F c J with K c UjefOj. The Heine-Borel theorem states that every closed and bounded interval is compact. Given this theorem, it is not hard to show that a set K c R is compact if and only if K is closed and bounded (see Exercise 1.5.6). For example, the Cantor set is compact! Returning again to the extension problem for P, it is clear that if A = U£Lij4„, An G 21 for all n and A not necessarily in 21, then A e $. Also, by Remark 1.2.3, A can be written as a disjoint union of sets in 21: one replaces each An by j4n\ U"^1 j4j. Consequently, if there is an extension, P(A) = £~=i P(An\ U?",1 At) < £~ , P(An). Without assuming the extension to be possible, one may define P*(-<4) to be the greatest lower bound of {Z™=1P(An)\A = U~ xAn,An G 21 for all n}. Then P*{A) is an estimate for the value P(A) of a possible extension when A G 2lCT, the collection of sets that are countable unions of sets from 21. Since R = Un=-oo(n'n "*" I)' eveT E C R is a subset of some set A G 2lCT. Hence if E c A, then one expects to have P{E) < P*(A). This motivates the definition of the following set function P*. If E c R, define P*(E) to be the greatest lower bound of {P*{A)\E c A G 2lff}, i.e., P*(E) Hf ^nf P*(A) = inf|£ P(An)\E c U~ ,An,An G 21 for all n|. Remark. The terms infimum (abbreviated to "inf") and supremum (abbreviated to "sup") are merely other words for "g.l.b." and "l.u.b.", respectively. In general, the set function P* does not behave like a probability because (/¾) need not hold! However, it has certain important properties which make it into what is called an outer measure. They are stated in the next definition.
4. CONSTRUCTION OF A PROBABILITY 21 Definition 1.4.10. An outer measure on the subsets of R is a set function P* such that (1) 0<P*(E) forallEcR, (2) Ei C E2 implies P*(#i) < P*(E2), and (3) E = u~ i^n implies P*{E) < £~l p*(^n) (i.e., it is countably subadditive). Proposition 1.4.11. The set function P* defined above is an outer measure with P*(E) <l for all E e R. Furthermore, since P is a-additive on a, P(A) = P*(A), for aJJ A e a. Pnoo/. Properties (1) and (2) of Definition 1.4.10 are obvious. Let e > 0 and, for each n, let En C UgL^,* be such that YlT=\ p(^n.fc) < P*(E„) + ^- where the sets An,fc G 3. Then E = U^E,, C U^ U^ A„ifc, which is a countable union of sets from 21. Hence, oo oo °° ( \ °° p*(£) < EEp(^^) ^ E p*(^») + £ = Ep*(£»)+e- n=l fc=l n=l "- ^ n=l If A G a is contained in U^A,,, A„ G a, n > 1, then A = U~=1A^, where A^ = A n [An\ UjlJ Afc]. Then, by the <r-additivity of P on a (Theorem 1.4.4), P(A) = £~i P(^n) < £~=i P(^n) as A'n c An, for all n > 1. This shows that P(A) < P*(A) and hence P(A) = P*(A) if ag a. □ Remark. The fact that P and P* agree on a is crucial in what follows. This is why it is so important that P be er-additive on a. While P* is defined for all subsets of R, it is not necessarily a probability. This raises the problem as to whether there is a natural class of sets on which it is a probability. The following way of solving this problem is due to a well-known Greek mathematician, C. Caratheodory. He observed that (i) the sets in any er-algebra <9 containing a have a special property (see (C) below) provided the outer measure P* restricted to 0 is a probability, and (ii) the collection of all sets with this property is in fact a er-algebra, and the restriction of P* to this er-algebra is a probability. First, notice that because of property (3) of Definition 1.4.10, for any two sets E and Q, one has P*(E) < P*{EnQ) + P*(E\Q). However, if Q G a, then in fact (C) P*(E) = P*(Er\Q) + P*(EnQc), for any set E because P and P* agree on a. To prove (C) for sets Q G a, let e > 0 and let An G a for all n > 1 be such that E c U~=1A„ and YlT=i p(An) < P*{E) + e. These sets An may be assumed to be pairwise disjoint by Remark 1.2.3. Further, An n Qc G a for all n > 1, since Q e a.
22 I. PROBABILITY SPACES Now p*(En Q)+P*(En Qc) < P* (u~ ,(A„ n Q)) + P* (u~ ,(A„ n Qc)) oo oo < £{P(An n Q)} + £{P(An n Qc)} n=l n=l oo oo = £{P(An n Q) + P(An n Qc)} = £ P(An) < P*(E) + e. n=l n=l Therefore, P*(E nQ) + P(£n Qc) < P*(E), proving (C). Now suppose that 0 D 21 is a er-algebra and that P* restricted to 0 is a probability, say R. Then, since a er-algebra is a Boolean algebra and R is er-additive on 0, one could construct a new outer measure R* from R. As stated later in Exercise 1.5.4, in fact R* = P*. Therefore, it follows from what has just been proved for 21 that condition (C) holds for all the sets in 0. This suggests that one should look at the collection 5 of all sets for which condition (C) holds, i.e., ff = {Q | P*(E) = P*(Er\Q) + P*{Er\Qc) for any E c K}. It will now be shown that (i) 5 is a er-algebra (containing 21), and (ii) P* restricted to 5 is a probability. Hence, (K, 5, P*) is a probability space and 5 D 21 (also, in view of Exercise 1.3.10 (b), J D <8(R) — the algebra of Borel sets). It remains to verify (i) and (ii). To verify (i), first note that Q = R G 5 and that A G J implies Ac G £. If Ax, A2 G 5, then Ax U A2 G £. To see this, let EcK. Then, by (C), one has P*(E) = P*{E(lAx) + P*{E n A\). Since, by (C), P*(E n Ax) = P*{E nA,n A2) + P*(£ n Ax n A%), and, again by (C), p*(E n A\) = P*{Er\A\c\ A2) + P*(£ n A\ n a%), this implies that P*{E) = P*(E n Ax n A2) + P*{E n Ax n Al) (1) +P*{Ec\A\C\A2) + P*{Er\A\C\Al).
4. CONSTRUCTION OF A PROBABILITY 23 Since E n (Ai u A2) = {Er\A1n A2) u(£ni4,n AD u (£: n A\ n A2) it follows from (1) and Definition 1.4.10 (3) that p*(E) >p*(#n(AiuA2)) + p*(£nAjnA|). Hence, A\, A2 e 5 implies that A\ U A2 e 5- By now it should be clear that J is a Boolean algebra. Therefore, J is a er-algebra providing \J^LxAn G 5 whenever the A„ £ J are pairwise disjoint. To verify this, one has to show that for any set Ecfl, P*(E) = P*(£n(u~ ,A„)) + F(£n(uS°=1An)c), when the j4„ ej are pairwise disjoint. To do this, it will suffice to show that for all n > 1, n (2) P*(E)>J2p*(EnAi) + p*(En(Vn=iAn)c). The reason is that this inequality and Definition 1.4.10 (3) imply that oo P*(E) > J2 P*(EnAn) + P*(En(U^=1An)c) >P*(En(U^=lAn)) + P*(En(U^=lAn)c). In view of the validity of the opposite inequality, again by Definition 1.4.10 (3), one then has p*(£;) = p*(£;n(u~1An)) + p*(£;n(u~1An)c), i.e., u~=1AneS. Now (2) holds if n (3) P*(E) >Y^P*(En Ai) + P*{En (U?=1 Ai)c). Hence, to verify (2), it suffices to prove this last inequality (i.e., inequality (3)) as (up=1Ai)c D {U^Ai)0. Note that (3) is equivalent by Definition 1.4.10 (3) to the identity n (4) P*(E) = Y, P*(£n A,) + P*(E n (\j?=lAi)e).
24 I. PROBABILITY SPACES To prove (4), let A\,...,An be pairwise disjoint and in $• Then, by applying the defining property of 5 first to A\ using E, and then to A^ and using E D A\ in place of E, it follows that p*(E) = P*(E n Ai) + P'fEn^) = p*(E n A,) + p*(E n Af n A2) + p*(£ n A^ n A|) = p*(EnA1) + p*(EnA2) + P*(EnASnA£), since A\ D A2 = 0. Hence, (4) holds for n = 2. Note that this follows immediately from formula (1) if A\ D A2 = 0. Assume that (4) is true for n — 1 pairwise disjoint sets Aif i.e., for any EcR, n-l p*(E) = J2 P*(E n Ai) + P*(E n (U^1 AO0). t=i Apply the defining property of 5 to An and use the set E D (U^T/Ai)c. It then follows that p*(En (ujL"!1 Ai)c) = p*(£;n (u^j-!1 Ai)c n An) + p*(En (u^A*)' n Acn) = p*(£;nAn) + p*(£;n(ur=1Ai)c). Therefore, n P*(E) = J2 P*(E n Ai) + p*(E n (uj"=1 Ai)c). This completes the proof of (i). The proof of (ii) is given below. Definition 1.4.12. The a-algebra 5 of sets Q that satisfy (C) P*{E) = P*{EnQ) + P*{EnQc) for all E c R is called the cr-algebra of P*-measurable sets. The following theorem is the goal to which all these arguments have been leading. Theorem 1.4.13. Let F be a distribution function on R. (1) Then there is a unique probability P on 2$(R), the <7-aJgebra of Bore! subsets of R, such that P((a, 6]) = F(b) - F(a) whenever a< b. (2) The a-algebra 5 of P*-measurable subsets contains 2$(R), and P* restricted to 5 is a probability such that P*((a, 6]) = F(b) - F(a) whenever a < b.
4. CONSTRUCTION OF A PROBABILITY 25 furthermore, the a-algebra $ of P* -measurable sets is the largest a-algebra 8 containing 21 with the property that the restriction of P* to (5 is a probability. Proof. First consider (2). It has been shown that 5 is a <r-algebra. To prove that P* restricted to 5 is a probability, it will suffice to verify (P3) of Definition 1.3.1 since P* obviously satisfies (Pi) and (P2). Let (i4n)n>i C 5 be pairwise disjoint. Since P* is an outer measure, it is countably subadditive, i.e., (3) of Definition 1.4.10 holds. Hence oo p*(u~,An)<£pvn). n=l To prove the reverse inequality, note that P*(u£Lii4n) > P*(U^=1i4n) = En=i p*(An)in view of (4) (take E = U^=1i4n). This completes the proof of (2) and of (ii) above. To prove the first statement, let Pi and P2 be two probabilities on S8(R) such that Pi((a,6]) = P2((a,6]) = F(b) - F(a) whenever a < b. Let SOI = {A G S8(R) I Pi (A) = P2(A)}. Then, by Exercise 1.3.7, Pi and P2 agree on 21 . Exercise 1.4.14. Let (fi,5, P) be a probability space and let (;4.n)n>i c J. Show that (1) if An C An+l for all n, then lim,,^ P(An) = P(U~=iAn), (2) if An D An+1 for all n, then lim,,^ P(An) = P(n~ ^). [Hint: recall Exercise 1.3.3.] If F is the distribution function of a probability P on 2$(R), show that (3) F(x-) = P((-oo,x)) for all x G R. Exercise 1.4.15. Show that the set SOI defined above has the following two properties (where (i4n)n<i c 9H): (1) An C An+X for all n implies U^ii4n G SDt; and (2) An D An+i for all n implies n£Li,4n G 9Jt, i.e., SOI is a so-called monotone class. Exercise 1.4.16. (Monotone class theorem for sets) Let SOI be a monotone class that contains a Boolean algebra 21. Show that 9Jt contains the smallest <r-algebra 5 containing 21. [Hints: let S0l0 be the smallest monotone class containing 21 (why does it exist?). Let A G 21. Show that {B G SOlo I B UA G S0l0} is a monotone class. Conclude that B G S0l0, A G 21 imply BuAe S0l0. Now fix Bx G S0l0 and look at {B G S0l0 | Bi U B G SOlo}. Conclude as before that BX,B G S0l0 implies Bx U B G S0l0. Make a similar argument to show that {Bc \ B G S0l0} is S0l0. Conclude that SOlo D 5.]
26 I. PROBABILITY SPACES The monotone class theorem has a function version, which is stated as Theorem 3.6.14. It gives conditions that ensure that a given collection H of bounded functions on fi contains all the bounded random variables in the er-algebra determined by a subset C of H that is closed under multiplication. From these three exercises and Exercise 1.3.10 (b), one concludes that SOt = S8(R), as it contains 21, and so (1) is established. The last statement of the theorem is a consequence of Exercise 1.5.4 and the observation made following Proposition 1.4.11 that condition (C) holds for all the sets in (5 if (5 is a er-algebra containing 21 such that P* restricted to 0 is a probability. □ Remark 1.4.17. It is easily verified that the collection SOt defined above has the following properties: (i) Q G SOI and (ii) if A, B G SOI with A C B, then B\A G SOI. Dynkin proved (see Proposition 3.2.6) that the smallest such system £ containing a collection €, of sets closed under finite intersections is the smallest er-field containing €. This is another version of the monotone class theorem. It is proved in Exercise 3.2.5, part C, that any collection of sets £ D € satisfying (i) and (ii) also contains the smallest Boolean algebra 21 containing € if the collection € is closed under finite intersections. Taking € as {(a, b] \ -oo < a < b < +00} Proposition 3.2.6 also proves the uniqueness of the probability P on S8(R) for which P((a,b))=F(b)-F(a). 5. Additional exercises* Exercise 1.5.1. Let (fi, 5, P) be a probability space. Show that (1) P(U~=1An) < Y.n=\ P(An) (see Exercise 1.2.2 (6)); and (2) P(U~=1An) = 0 if P(An) = 0 for all n > 1. Exercise 1.5.2. The so-called Heaviside function is the function H, where H(X) = \l, x>0. Calculate for this distribution function the corresponding outer measure P*, and determine the er-algebra 5 of P*-measurable subsets of K. [Hint: guess the er-algebra 5 and P*. Then see if they "work".] The resulting measure, denoted by eo or ^o is called the Dirac measure at the origin or unit point mass at the origin. Replace H(x) by H(x - def a) = Ha(x), and let ea denote the resulting measure. Give the formula for ea(A),A G 5, where 5 is the corresponding er-algebra of measurable sets. The measure ea is the Dirac measure or unit point mass at a. Exercise 1.5.3. For the distribution function of the Poisson distribution (see Example 1.3.9 (3)), calculate the corresponding outer measure P* and determine the er-algebra J of P*-measurable subsets of K.
5. ADDITIONAL EXERCISES* 27 Exercise 1.5.4. Let (5 D 21 be a er-algebra, and assume that P* restricted to (5 is a probability R. Define the outer measure R* by setting R*(£) = inf{£~=i R-(fin) | E C Un=iBn, Bn e 0 for all n}. Show that (1) R*(E) = inf{R(B) | E c B, B e 0}, (2) R*(E) < P*(E) for all E c R, (3) R*(E) = P*(E) for all EcR, and [Hint: R(B) = P*(B).] Exercise 1.5.5. Show that Q is P*-measurable if and only if P*(Q) + P*(QC) = 1- This exercise may be done by going through the following steps. (1) Let Q, E be subsets of R, and let (Am), (Bm), and (Cp) be pair- wise disjoint collections of sets from 21 such that (i) E c UPCP = C and (ii) Q c UmAm = A, Qc c U„Bn = B. Show that p*(£n Q) + p*(£;n Qc) < J2P(^m n c„) + ^P(B„ n cp) < ^ P(CP) + 2 P(Am n sn n cp) p m,n,p <£p(Cp) + £p(AmnBn). p tn,n [Hint: for the second inequality use Exercise 1.2.2 (5).] (2) If {Am) and (Bn) are two collections of pairwise disjoint sets from 21 such that (i) R = AuB, where A = U~=1i4m and B = U~=1B„, and (») Em P(Am)+Zn P(B„) < 1+e , show that Em,„ P(Amr\Bn) < e. [Hint: make use of Exercise 1.2.2 (5) to compute P(i4m) + P(Bn) and use the fact that R = Uminj4m U Bn.\ (3) If P*(Q) + P*(QC) = 1 and E c R, show that for any e > 0 one has P*(E n Q) + P*{E\QC) < P*{E) + e. Remarks. Result (2) says that P(A D B) < e, i.e., the overlap of A and B is small (relative to P) if (i) and (ii) hold. It will help to realize that all the sets in 2lCT are Borel and that, for example, ^2m P(i4m) = P{A), where P is the probability on 93(R). It is also of some interest to realize that all the above computations can be done without knowing that P on 21 has an extension as a probability to 93(R). As a result, the P*-measurable sets Q may be defined by the equation P*(Q) + P*(QC) = 1 and Theorem 1.4.13 proved using this definition (see Neveu [Nl]). Exercise 1.5.6. (The Bolzano-Weierstrass property) This exercise characterizes the compact subsets of R. Part A. Let A be a subset of R. Show that A is compact if and only if it is closed and bounded. [Bints: if A is closed and bounded, then for some
28 I. PROBABILITY SPACES N > 0 one has A C [-N,N\; if A C UA, then [-TV, TV] C UA U Ac; now use the Heine-Borel Theorem 1.4.5. For the converse, observe A C Un(-n,n); and if Xo G j4c and no open interval (xo - ^,Xo + ^) C Ac, look at the open sets [xo - ^,Xo + ^]0.] Part B. Let A be a compact subset of K, and let (xn)n>i be a sequence of points of A. Let S = {xn | n > 1}. If the sequence has no convergent subsequence, show that (1) 5 is infinite [Hint what happens if it is finite?], (2) S is closed [Hint: if a £ S, show that some open interval centered at a is disjoint from 5; otherwise what happens?], (3) for each point s of 5, there is an open interval centered at s containing no other point of S. [Hint: start off with the first point.] Show that (3) contradicts the assumption that A is compact [use (2)]. Conclude that (xn)n>i has a convergent subsequence. Part C. Let A be subset of K, and assume that every sequence (xn)n>i of points in A has a subsequence that converges to a point of A. Show that (1) A is closed [Hint: consider the hint for Part B (2)] , (2) A is bounded. [Hint: if a sequence converges it is bounded.] This exercise emphasizes the importance of the Bolzano—Weierstrass property (see Royden [R3]) for a set: every sequence (in the set) has a convergent subsequence (convergent to a point in the set). It is equivalent to the property of compactness.
CHAPTER II INTEGRATION 1. Integration on a probability space In elementary probability, if fi is finite, say fi = {1,2,..., n}, 5 is the collection of all subsets of fi, and if P gives weight Oj > 0 to {i} with 5Z"=1 a,i = 1, then the (mathematical) expectation E[X] of a random variable X on fi (i.e., a function X : fi —► K) is defined to be 5Z"=1 X(z)ai = 5Z"_i X(z)P({z}). When a* = 1/n for each z, this expectation is the usual average value of X. Heuristically, this number is what we expect as the average of a large number of "observations" of X — see the weak law of large numbers in Chapter IV. Also, if X is non-negative, then E[X] can be conceived as the "area" under the graph of X: over each z one may imagine a rectangle of width P({oj}) and height X(i); then £"=1 X(i)P({i}) is the sum of the "areas" of the rectangles that make up the set under the graph. Now let (0,5, B) be an arbitrary probability space and let X be a function on fi. For what functions X can an expectation E[X] or average be defined? In other words, what functions X can be integrated or summed against P? For example, let fi = K, 5 = 2$(R), and P be the uniform distribution on [0,1] (see Example 1.3.9 (1)). Recall that the distribution function F is given by {0 x<0, x 0< x< 1, 1 l<x. Let X(x) = sin x if x G C, the Cantor set (see the remark following Example 1.2.5), and X(x) = 0, if x £ C. Can one integrate X using P? The answer is "yes", and it turns out that the value of the integral is zero since P(C) = 0 (see the calculation in Example 1.2.5). The basic idea in integration is that if a function X takes a constant value a on a set A in 5 and is zero elsewhere, then J XdP should be aP(A), (i.e., when a > 0, it should be the "area" of the picture in fi x R shown in Fig. 2.1, where A is indicated by a heavy line). a Fig. 2.1 29
30 II. INTEGRATION Just as a distribution function F immediately determines a probability P on simple sets, namely the intervals (a, b], so does a probability (more generally, a measure) determine an obvious notion of integral for certain simple functions. In both instances, the problem arises as to how to extend to more general situations. Define the characteristic or indicator function of a set A to be the function that equals 1 on A and 0 on Ac. In these notes, it will be denoted by l,i. ( It is also often denoted by \A-) Definition 2.1.1. A function s : fi —► R is said to be a simple function or to be ^-simple if it can be written as s = 5Zn=1 anl/i„ with an > 0, and the sets An € 5 pairwise disjoint. Let Sj" denote the class of simple functions. Proposition 2.1.2. Let s be a simple function, and assume N M s= ^2anlA„ = ^2 tmls".' n=l m=l where the collections of sets An, 1 < n < N, and Bm, 1 < m < M, are both pairwise disjoint. Then N M J2*nP(An)= £&mP(Bm). n=l m=l Proof. Note that s has a "natural" representation as 5Zn=1 anlA„,An pair- wise disjoint and the an all distinct. To obtain it, consider the sets {w | s(w) = a} as a varies over R. If s is expressed as $3m=1 bmlBm, then M N £&mp(Bm) = 5>n{ J2 p(fi-)} m=l n=l m,bm=a„ N = ^2 anP(An) as Um {Bm \ bm = an} = An. D n=l Remark. This result corresponds to Exercise 1.3.7 (2), for the probability P defined by a distribution function. Definition 2.1.3. Let s € $f. Define the integral or expectation of s with respect to P to be 5Zn=1 anP(.An) if s = J2n=ian^A„, the sets An e 5 and pairwise disjoint. This number will be variously denoted by E[s\, JsdP, or /s(w)P(du>).
1. INTEGRATION ON A PROBABILITY SPACE 31 Remark. Proposition 2.1.2 shows that this definition makes sense, i.e., the value of E[s\ does not depend on the particular representation of s. Proposition 2.1.4. (Properties of simple functions and their integrals) Let s, si,S2, etc., denote simple functions, i.e., elements ofSt- The collection St has the following properties: (Si) for all A > 0, As e St *""* J{^s)dP = \(f sdP); (52) si + s2 € St and /(si + s2)dP = / sidP + / s2dP; (53) si < S2 implies J s\dP < J s2dP, where s\ < S2 means that for aJJ w,si(w) < s2(w),- (54) si, S2 € St implies s\ A S? and s\ V S? € St> where (s\ As2)(lj) = min{si(w),S2(w)} and (si Vs2)(w) = max{si(w),S2(w)} ; and (55) if(sn) is a sequence of S-simple functions with sn < sn+\ < s for all n, where s is S-simple and such that s = limn sn (i.e., limn sn(u>) = s(lj) for all w), then limn / sndP = J sdP. Proof. (Si) is obvious. It suffices to verify (52) for s\ = o,1a and S2 = £^=1a„l,t„. Then si + s2 = alB + Hn=i{anl/i„\/i + (a„ + a)lAnnA), where B = A\U%=1 An. Exercise. Use this expression for s\ + S2 when s\ = o,\a to verify that in this case [(si+s2)dP= jsidP+ J s2dP. Explain why it is enough to prove (52) in this case. To prove (53), let s = s2 - s\. Then s e 5+, as can be seen by writing a formula for s if s2 = Hn=1 "nU„ > s\ = £fc=1 bklBl!- Exercise. Determine the desired formula. First, show that U^=1.An D u£_jBfc and then use the Bk to "cut up" each An and use this observation to define s. Hence, / s2dP = / sidP + / sdP > / sidP. To show (54), first note that if s = 5Zn=i anlA„ with the sets An pair- wise disjoint, then s = V^=1anl,tt, (i.e., s(w) = max{anl/i„(w) | n = 1,..., #})• Exercise. Use this observation to verify that if si, S2 € 3^, then s\ V S2 € St (first let si = qIa and s2 = £n=1 a„lyi„).
32 II. INTEGRATION Exercise. Let a and b be any two real numbers. Show that a + b = a V b + a A b, where a V b = max{a, b) and a A 6 = min{a, b). (S4) follows from the above two exercises and from the first part of (52). The most difficult part of the proposition — and also the most important — is the last part. To prove (5s), first assume s = 1^. Choose m > 1, and let An,m = {u}\sn(u>) > 1 — l/rn). Since sn | l/i (i-e., the functions sn increase to 1a), the sets An<m increase with n : i4nm c -<4n+i,m and Un=l An,m = A. Exercise. Verify this statement. Returning to the proof, note that (1 — 1/tti)1a„ m < sn < 1a- Hence, (1 - l/m)P(i4„im) < J sndP < P(A). From the next exercise and Exercise 1.4.14, it follows that f sndP -► P{A) = J sdP. This proves (S5) in case s = oIa- Exercise 2.1.5. (Properties of limits: see Definition 1.1.5) Let (an) have limit A and (6n) have limit B. Show that (1) (an + 6n) has limit A + B (i.e., limn(an + bn) = A + B), and (2) (anbn) has limit AB (i.e., limnan6n = AB). Suppose that A = B and that (xn) is such that an < xn < bn. Show that (3) (xn) has limit A. Now consider s = Ylk=\ai'^Ak, where the Ak are pairwise disjoint. Since 0 < sn < s, it follows that sn = 5Zfc=1 snlyit, where (snl,tt)(w) = sn(u})lAk{u). Now f sndP = J2k=i JsnlAkdP and, by what has been proved, limn_oo JsnlAkdP = akP(Ak)- Therefore, limn_00/sndP exists and equals 5Zfc=1 afcP(.Afc), which by definition is J sdP. O The next definition determines the class of functions on (fi, 5, P) that can be nicely approximated by simple functions (see Proposition 2.1.11 (RV7)). One defines the integral of such a function as the limit of the integrals of the approximating functions. Definition 2.1.6. A function X : fi -» R U {±00} such that, for all A e K, {bJ I X((jj) < A} € 5 is caJJed a random variable on the probability space (fi, 5, P) or a measurable function on (fi, 5, P) (5-measurable if the a-algebra 5 needs to be indicated). Recall that one assumes —00 < x < +00 for all x eR. Remarks. (1) The above concept can be extended to functions on a measurable space, i.e., a pair (fi, 5), where fi is a set and 5 is a o- algebra. Define X : fi —► R U ±00 to be a measurable function on the measurable space (fi, 5) or an ^-measurable function, if, for all A e R, {bJ I X{bj) < A} e 5- Note that no probability is needed for the
1. INTEGRATION ON A PROBABILITY SPACE 33 definition. However, the use of the term "random variable" signifies that the underlying measurable space is equipped with a probability. (2) The fact that X is a random variable w.r.t. 5 or an ^-measurable function will be denoted by writing "X e 5" or "X in 5". Exercise 2.1.7. Show that X is a random variable if and only if X has any one of the following properties: (1) for all A e R, {u \ X(u>) <A}eJ; (2) for all A e R, {w | X(lj) > A} € J; (3) for all A e R, {u | X(u) >A}eJ. Show that every ^-simple function is a random variable. Definition 2.1.8. Let (an) CRU {±oo}. The limit supremum or lim- sup of (an) is defined to be infn{supm>nan}. The limit infimum or liminf of (an) is defined to be supn{infm>nan}. These numbers "measure" the "oscillation at +oo" of the sequence (an) and are denoted by limsupnan for limsupanj and liminfnan for liminf an), respectively. n n Exercise 2.1.9. Show that for any sequence (an) CRU {±oo}, (1) limsupnan and liminfnan exist, and (2) liminfnan < limsupnan. [Hint: find some monotone sequences.] Calculate limsupnan and liminfnan for the following sequences. (3) an = 1 — 1/n if n is even and = 1/n if n is odd, (4) an = sin(n7r/2), (5) an = sin nw if n is not divisible by and an — n otherwise. Exercise 2.1.10. Show that liminfnan = limsupnan and is finite if and only if the sequence (an) converges in R to this common value. Proposition 2.1.11. (Properties of random variables) Let X,Y,X\,X2, ■ ■ ■, etc, be random variables on (0,5, P). Then (RVi) q € R, X € 5 implies aX € £; (RV2) Xx + X2 € 3 if X, e 5, i = 1 or 2, where (X, + X2)(uj) =f Xx(w) + X2(lj), providing the sum makes sense (i.e., where one observes the conventions, A + (±oo) = ±oo if A € R and (±oo) + (±oo) = ±oo while (+oo) + (-oo) is not defined); (RV3) Xi V X2 and XXAX2€ & where (A"i V X2)(uj) =f max{Xi(uj),X2(uj)} for allujeil, and (A"j AX2)(uj) d=mm{Xi(w),X2(uj)} for all uj € SI; (RV4) (Xn) C 5 implies infn Xn and supn Xn € 5, where (infn Xn)(u) d= inf{Xn(w) \n> 1} for allu e fi, and def (supnXn)(w) = sup{Xn(w) \n> 1} for alluj e fi,- (RV5) liminfnXn and limsupnXn e 5 if (Xn)n>i C 5, where def (liminfn Xn)(u) = liminfn Xn(tj) for all uj € SI, and
34 II. INTEGRATION (limsupnXn)(u;) = limsupnXn(u;) for all lj € fi; {RV6) ifX = lim„X„ and {Xn)n>\ C 3, then X € 5, where def (limn Xn)(u>) = limn Xn(ui) for all ui e fi; (RVj) if X ej, and X non-negative, then there is a sequence (sn)n>i of simple functions with sn < sn+\ for all n and X = limn sn; (RV&) if X,Y e 5 and are both non-negative or both reai-vaiued, then XY e$. Proof. (RV\) is obvious in view of Exercise 2.1.7. (RV2): Let A e K. Let A = UreQ^ I -^i (w) < r} n {w | X2(w) < A - r}. Since Q, the set of rational numbers, is countable, the set A € J. Also, if w € A, then Xi(u;) + X2(ui) < A and so A C {Xi + X2 < A} (note that {X\ + X2 < A} denotes {w | Xi(u;) + .^2(0)) < A}). To simplify matters, assume that X\ and X2 are real-valued and that (X\ + ^2)(^) = X\(w) + X2(u) < A. Let Qi = Xi(isj) and Q2 = ^2(0»). The point lj € A if there is a rational number r with Qi < r and Q2 < A — r. Since Qi + Q2 < A implies Qi < A — a2, one wants to find a rational number r with Qi < r < A — a2. This is possible in view of the following fundamental property of K. Exercise 2.1.12. Let a < 6 be two real numbers. Then there is a rational number r with a < r < b. Show that, as a result, in every non-void open subset of K, there is a rational number. (Recall that a subset A of R is said to be dense if it has this property (see Exercise 1.3.15): in other words, Q is dense in K.) [Hints: choose an n > 1. The distance between consecutive rational numbers of the form k/n, k € Z, is 1/n. Choose n so that 1/n < b — a (this is possible in view of Exercise 1.1.4). Now look at the numbers k/n < a and those of this form > b. If there is no rational number between a and b, what happens?] For a related exercise, see Exercise 2.9.15. The upshot of this is that A = {X\ + X2 < A} and so X\ + X2 e 5 if both are real-valued. Exercise. Prove (RV2) when the Xi are not necessarily finite. [Hint: let A = {Xi > -00,X2 > -00}. Show that An {Xx + X2 < A} € ff.] Exercise. Prove {RV3). [Hint: express {w | max{X\(u)),X2(u))} < A} in terms of {Xi < A} and {X2 < A}.] Exercise. Prove (RV4). [Hint: express {infnXn < A} in terms of the sets {X„<A},n>L] Exercise. Prove (RV5). [Hint: use (RV4).\ Exercise. Prove (RV6). [Hint: use Exercise 2.1.10.] The verification of (RV7) is a bit more exciting. To understand the argument, Fig. 2.2 is useful:
1. INTEGRATION ON A PROBABILITY SPACE 35 : ; : ; : : a ' i i i__i Fig. 2.2 The set where '*2~^ < X < £ is indicated by heavy lines. For each n > 1, divide R into intervals using the dyadic rationals ^-, k € Z. Define iyi if iyi < X(u>) < i and 1 < k < 22n, 2n if 2n < X(uj), i.e., _^(fc-l) fc=i Then each sn is simple and sn < X for all n. Since the dyadic rational *|^ divides the interval [^¾^. £] in half, ^¾^ < X{u>) < £ implies sn+i(u;) > '*2~^ = sn(u;), and so sn < sn+\ < X for all n. Let s = limn_oo sn. Then s < X. If X(w) < +oo, then s(u;) = X(u») since |sn(w) - X(w)| < ^ for large enough n since X(u) < 2n for large n. If X(lj) = +oo, then sn(w) = 2n for all n and so s(u;) = +oo. Therefore, s = X. Exercise. Prove (RV&). [Hint: first verify it for products of ^-simple functions; then observe that each real-valued X can be written as X = X+ - X~, where X+ =f X V 0 and X~ = -{X A 0) = {-X) V 0 (see Exercise 1.1.16).] □ When the probability space (fi, 5, P) has a specific description, it is often useful and important to know that certain functions are random variables. For example, if fi = K, and 5 = 2$(R), then every continuous function is measurable. sn(w)
36 II. INTEGRATION Definition 2.1.13. A function f : R -» R is continuous at x0 e R if for any e > 0 there is a 6 = 6(xo, e) > 0 such that |/(x) - /(xo)| < e when \x - Xo| < 6. A function f : R —* R is continuous if it is continuous at every point ofR. A function f : E C R —* R, (i.e., defined on a subset E of R), is continuous at xo € E if for any e > 0 there is a 6 = 6(x0,t) such that |/(x) - /(xo)| < e when \x - x0| < 6 and x € E. It is continuous on E if it is continuous at every point of E. Exercise 2.1.14. Let / : R —► R. (1) Show that / is continuous at Xo if and only if for any open set O containing y0 = /(x0), the set /_10 = {x | /(x) € 0} contains an open interval about Xo (such a set is called a neighbourhood of x0). (2) Show that / : R —► R is continuous if and only if for any open set 0 C R, /_10 is also open. Remember that a set 0 is open if and only if a € 0 implies 0 contains an open interval about a (i.e., if and only if 0 is a neighbourhood of each of its points). Given the above results, one has the following result. Proposition 2.1.15. Let (R, 5, P) be a probability space, and assume 5 D 93(R). Then every continuous function X : R —► R is a random variable, i.e., is measurabJe. Proof, {u | X(ui) < A} is open since (—oo, A) is open. Every open set is a Borel set (see Exercise 1.3.10 (b)). □ Definition 2.1.16. A function <p : R —► R is called a Borel function if it is Borel-measurable, i.e., if it is a measurable function on (R, 93(R)). Proposition 2.1.17. (Composition of measurable functions) Let X be a finite random variable or measurabJe function on (fi, 5, P), and Jet <p be a BoreJ function. Then the function <p°X defined by (<poX)(u)) = ip{X{u)) for all u) e fi is again a random variabJe. In particular, if B € (S(R),X-1(B) =f {w | X(uj) € B) € $, as it is {lj I (1b o X)(lj) > 0}. Hence, X is a random variabJe if and only if X-l(B)€$forallB€*B{R). Proof, {uj | (tpoX)(uj) < A} = {w | X(uj) € {x | tp(x) < A}}. Now {x | ip{x) < A} = y?_1((-oo,A)) e 93(R)}, so the result will be proved if {lj I X(lj) efl} 6 J whenever B is a Borel set (i.e., if the proposition is proved for (p=lB,B € <8(R)).
1. INTEGRATION ON A PROBABILITY SPACE 37 Exercise 2.1.18. If A C R, define X'^A) = {w | X(u) e A). Show that (1) X-'(Ac) = (X~l(A))c, (2) x-1(A,nA2) = x-1(A1)nx-1(Aa), (3) X-1(U~1An) = U~1X-1(A„). Let 0 = {A c R | X~l(A) € 5}. Show that (4) 0 is a <r-algebra. Given this exercise, the fact that X is a random variable implies that 0 D {(-co, A] | A e R}. Hence, in view of Exercise 1.3.10 (b), 0 D <8(R). Clearly, if 0 D 93(R), it follows that X is a random variable since the intervals (-co, A] are Borel sets. □ As an almost immediate consequence, one has the following important result. Proposition 2.1.19. Let X be a finite random variable on a probability space (0,5, P). Then there is a unique probability Q on 93(R) such that its distribution function F satisfies F(x) = P{X < x} for all x € R. Proof. Let B € <8(R). Define Q(B) to be P(A'-1(B)). Since X~x preserves the er-algebra operations by Exercise 2.1.18, it follows automatically that Q is a probability. Clearly, Q((-oo,x]) = F(x) is P{X < x}, and the uniqueness of Q follows from Theorem 1.4.13. □ Remark. One way to prove this is to start from the distribution function of X, which is F(x) = P[X < x], and then appeal to Theorem 1.4.13 for the existence of Q. The point of the above argument is that since one has a probability available, namely P, one obtains Q by a straightforward set-theoretic formal procedure. In analysis, one refers to Q as the image of P under the measurable map X. This procedure, which produces an image measure, is very general and is not at all restricted to the situation of the above proposition. It will appear again in Proposition 3.1.10 when random vectors are considered. Definition 2.1.20. The probability Q that occurs in Proposition 2.1.19 is called the distribution (or probability law) of X. Remarks 2.1.21. (1) Every probability Q on 93(R) is the distribution of some random variable; namely X(x) = x on (R, <8(R),Q). (2) Statisticians are usually interested only in the distribution of a random variable X; its domain of definition is often of little interest to them. The construction of random variables with specified properties is a mathematical problem that for statistical purposes may often be taken for granted. Returning to the main theme, what can be said about the integral of a random variable X, its so-called expectation E[X\? To define this, one
38 II. INTEGRATION uses property (RV&) of Proposition 2.1.11 of a positive random variable. The next result allows E[X] to be defined by approximation. Proposition 2.1.22. Let X be a non-neg&tive random variable, and Jet (sn)n>i and (s^m^i be two increasing sequences of simple functions with limn sn = X = limm s'm. Then lim / sndP = lim / s^dP. Proof. Let tn,m = sn A s'm. By (54) and (S5) of Proposition 2.1.4, tnjtn € 3^ and limm_00 / tn,mdP = / sndP. Hence lim I sndP = lim lim / tn mdP n—»00 J n—»00 tn—.00 y ' = lim lim / tn mdP tn—.00 In—»00 J ' m—ooy ,dP. □ (by Proposition 1.1.11) Definition 2.1.23. Let X be a non-neg&tive random variable on a probability space (fi, 5, P), and Jet (sn)n>i be an increasing sequence of simple random variables with X = limnsn. Define f XdP to be Hmn/sndP. It is called the integral of X with respect to P or the expectation of X and is also denoted by E[X\. Remarks. (1) This definition makes sense because of Proposition 2.1.22. (2) The integral J XdP can also be defined as the supremum of {fsdP I s simple ,0 < s < X} (see Exercise 2.9.8). The expectation of X may be +00 for X > 0. If E[X\ < +00, then X is said to be integrable, and one writes X € 2^(0,5, P) to indicate this. Example 2.1.24. (1) Let n = NU{0}, and let 5 be the collection <P(fi) of all subsets of Q. A probability P is defined on 5 by the numbers pn = P({n}) provided that Hn^=o Pn = 1- A real-valued random variable X on (fi, 5, P) may be viewed as a sequence (an)n>o, i.e., an = X(n). When X is non-negative, the simple functions sn defined by sn(i) = X(i) = a*, 0 < i < n, and sn(i) = 0 for i > n have the property that sn | X. Further, J sndP = 52n=0a»Pt' and so E[X\ = 52iloa>P>- Consequently, on this probability space integration amounts to "summing" infinite series. (2) Let P be the Poisson distribution with mean A > 0, i.e., pn = e_A(^) if n > 0. Show that (i) if X(n) = n, for all n >, then E[X] = A, and (ii) if X{n) = n!, for all n > 0, then E[X\ < +00 if and only if 0< A< 1.
1. INTEGRATION ON A PROBABILITY SPACE 39 (3) In Feller's classic book [Fl], the definition of mathematical expectation for a non-negative X is the extension of part (1) of this example obtained by replacing fi with the countable set {xn | n > 1} of values of X. This means that the mathematical expectation is defined by means of the distribution of X; in other words, E[X] = Hi^0-^)P> = £^n=i xn^({i I X(i) = xn}) (see Lemma 3.5.3). Proposition 2.1.25. (Properties of the integral for non-negative X) (Ei) A > 0 implies E[XX] = XE[X\. (E2) E[X\ +X2] = E[Xi] + E[X2] ifXi and X2 are non-negative random variables. (E3) 0 < X\ < X2 implies E[Xi] < E[X2\. (En) (Principle of monotone convergence) If (Xn) is a sequence of non-negative random variables Xn such that 0 < Xn < Xn+\ < X for all n and \\mXn = X, then E[X] = WmE[Xn]. n n (E$) (Fatou's lemma) E[liminf Xn] < liminf E[Xn\ for any sequence n n (Xn) of non-negative random variables Xn. Proof. (E\) is obvious since sn | X implies Asn | XX, where as before sn | X means that sn < sn+i < X for all n and limn sn(w) = X(u) for all wefi. (¾) is also obvious since sn | X\, tn | X2 implies sn + tn | X\ +X2 and hence E[X\ + X2] = limn_00 E[sn + tn] = limn_00(E[s„] + ^[tn]) = limn_oo E[sn] + limn_oo E[tn] = E[X\] + £^[^2] (even if one of the limits is +00). To prove (E3), let X2(u)-Xx(u) if A-i(w)<+oo, 0 if A"i(w) =+00. Then X is a random variable and X2 = X + X\ (remember that X\, X2 are non-negative and can take infinite values). Then, by (i%), £[.¾] = E[Xi] + E[X2\ > E[X2\. The proof of (E4) is less trivial. First, observe that X is a random variable as X = limn Xn. Note that E[Xn] < E[A"n+i] < E[X] for all n and so lim,,^ E[Xn] < E[X}. Let (sn,fc)fc>i be an increasing sequence of simple functions with limit sn,fc = Xn for each n > 1. X(u>) =
40 II. INTEGRATION Consider the following infinite array: Sl,l Sii2 Si,3 ... Si,„... -» Xi S2,l S2,2 S2,3 ••• S2,„... -» X2 Sm,l Sm,2 Sm,3 . . . SmiTl . . . ► Am s(m+l),l S(m+1),2 S(m+i)]3 ... S(m+l),n —► Xm+\ If one replaces the entry at the (m, n) position with the maximum of all the simple functions Sk,e for 1 < k < m,l < £ < n and calls this simple function tmin, then it is clear that (1) smn < tmn < Xm, for all m, for all n; (2) > for all m, for all n. *m,n S *(m+l),n J Hence, limn_00 tm,n = Xm for all m, and the sequence (tm,m) is increasing. Let Y = limtmm. Then Y = X since by Proposition 1.1.11, limm tm,m(w) = limm limn tm,„(w)) = limm Xm{u) = X(u). Consequently, E[X] > \immE[Xm] = limmlimn E[tmtTl], which by another application of Proposition 1.1.11 equals limm E[tmim]. Since by Proposition 2.1.22, limm £[<m,m] = E[X], this proves (£4). Fatou's lemma (£5) is obtained by applying the principle of monotone def convergence (£4). Let Un = infm>n-Xm- Then liminfnXn = limnC/n = U and C/„ < Un+i for all n. Hence E[U] = limn E[Un] by (E4). Now C^n < Xm if m > n and so E[t/n |< E[Xm] if m > n. Therefore, E[t/„] < infm>„ E[Xm] < liminfn E[X„], and so E[U] < liminfn E[Xn\. D Remarks. (1) Fatou's lemma is an extremely useful estimating tool. (2) Exercise 2.1.24 (1) is an illustration of (£4), the principle of monotone convergence. Exercise 2.1.26. Show that if X is a non-negative integrable random variable, then P[X = +00] = 0. [Hint: let A = {lj \ X(u) = +00}. Then sn > 2nlyi, where sn is defined in Proposition 2.1.11 (ilVs).] Exercise 2.1.27. Let Y and Z be two random variables on (fi, 5, P), and let A e J. Show that Y1a+Z1ac is a random variable, where by convention 0 • (±00) = 0. Note that the sum is always defined since for any u at most one of y(w)lyi(w) and 2(^)1^= (w) is infinite.
1. INTEGRATION ON A PROBABILITY SPACE 41 Exercise 2.1.28. A random variable N such that P[ N ^ 0] = 0 will be called a null variable or null function. Show that (1) P(j4) = 0 if and only if I a is a null function, (2) if AT is a non-negative null variable, then E[N\ = 0, (3) every non-negative null variable N is integrable with E[N\ = 0, and | AT | is a null variable if AT is a null variable, (4) if £[|X|] = 0, then X is a null variable or function, where |X|(w) = |X(w)| for all u) € fi, (note that |x| = x+ +x~ by Exercise 1.1.16). Exercise 2.1.29. Let Y be a non-negative integrable random variable, and let A € 5 be such that P(A) = 1. Show that (1) E[Y] = E[Y1A] = E[U], where U = Y1A + Z\Ac and Z is any non-negative random variable, and (2) JB YdP = JB YlAdP = JB UdP for any Be?. Exercise 2.1.29 shows that modifying (in a measurable way) a non- negative random variable Y on a set of probability zero does not change its expectation or its integral over any set B in the <r-algebra $. In view of Exercise 2.1.28, any non-negative integrable random variable X may be modified to be (say) zero on {X = +00} without changing its expectation or integral over any set in J. So when dealing with questions involving integration of non-negative integrable random variables, one may as well assume that they are real-valued, i.e., are everywhere finite. It is therefore convenient to make the following convention. Convention 2.1.30. If X is a non-negative integrable random variable, it will be assumed that X is real-valued. To integrate arbitrary finite random variables X, it suffices to express them in terms of non-negative finite random variables. Definition 2.1.31. A finite random variable X is said to be integrable if X = X\ — X2, where Xi, i = 1,2, are finite, non-negative integrable random variables. If X is an integrable random variable with X = X\ — X2 and Xi non-negative and integrable, then E[X] = E[X\] - E[X2] (i.e., / XdPd= J XidP - J X2dP). Let Ll = Ll(Q,$,P) denote the set of finite integrable random variables. Remarks 2.1.32. (1) The definition of E[X] makes sense if it is independent of the expression of X as a difference of non-negative integrable random variables. Let X = X\ — X2 = Y\ — Yz, where the Xi and the Vj are non-negative integrable random variables. Then X\ + Y2 = Y\ + X2 and by (E2), E[X,\ + E[Y2] = E[YX\ + E[X2\. Hence, E[XX) - E[X2] = E[YX\ - E[Y2] since these are real (and hence) finite numbers. (2) For any random variable X, if X V 0 = X+ and X A 0 = -X~, then X will also be said to be integrable if X+ and X~ are integrable whether
42 II. INTEGRATION they are finite or not, and one sets E[X] = E[X+] - E[X~\. Let Y(u) = X{u) if |X(w)| < oo and = 0 otherwise. Then, Y is integrable in the sense of Definition 2.1.31, and E[Y] = E[Y+] - E[Y~] = E[X+] - E[X~\. This extended notion of integrability is equivalent to the following: there is a set A e J with P(A) = 1 and two non-negative random variables X\ and X2, both finite on A and with E[Xi\ < 00, i = 1,2, such that X(lj) = X\(lj) — X2(lj) for all lj e A. As a result, a random variable X is integrable in the above extended sense if and only if after being modified to equal zero on a set of probability zero, the resulting random variable is integrable in the sense of Definition 2.1.31. In order to comply with standard usage, one writes X € Lx if X is integrable in the above sense. The above remark explains the connection between this usage and the requirement of Definition 2.1.31 that the random variables in LX(Q,$, P) be finite. (3) In the context of Example 2.1.24 (1), where Q = N U {0}, 5 is the collection of all subsets of fi, and P({n}) = pn for all n, a random variable X is integrable if and only if the series 5Z^_0 X(n)pn is absolutely convergent. For such an X, E[X] = Yl^Lo X(n)pn. In Feller [Fl], X is said to have finite expectation if it is integrable. Proposition 2.1.33. Let (0,5, P) be a probability space. Then (1) LX(Q, 5, P) is a real vector space and the function X —► E[X\ is linear (i.e., it is a linear functional); (2) X eLx if and only if \X\ d=X+ + X~ e Lx and \E[X}\ < E[\X\}; (3) |X| < V, Y eLx implies X e L1. Proof. (1) Let X = X, - X2 and Y = Yx - Y2. Then X + Y = (X, + y,) - (X2 + Y2) e Lx if X,Y e Ll. Let X = Xx - X2 e Ll and A e R. Then AX = \Xi - \X2 e L1 if A > 0 and AX = (-A)X2 - (-A)Xi € Lx if A < 0. It is clear that E[XX + pY] = XE[X] + pE\Y] if X, Y € Ll and A, n € R. (2) If X = Xx - X2 e L\Xi > 0, then X+ < Xx and X~ < X2. Hence X± e Lx and so |X| =X+ + X~ eLx. Conversely, if |X| eLl,X± e Ll and so X = X+—X~ € Ll. It follows from the triangle inequality (Exercise 1.1.16) that \E[X]\ = \E[X+] - E[X~]\ < E[X+\ + E[X~] = E[\X\\. Therefore, \E[X]\ < E[\X\] if X e L1. (3) \X\ = X+ +X~ <Y e Lx implies E[X+], E[X~] < 00. □ Everything is now in place (barring the three exercises that follow) for a proof of Lebesgue's famous theorem of dominated convergence. Theorem 2.1.34. (Theorem of dominated convergence, first version) Let (Xn) be a sequence of random variables dominated in modulus by an integrable random variable Y (i.e., for all n, \Xn\ <Ye Lx). If X = limn Xn, then (1) X € Lx and (2) E[X] = limn^ E[Xn\.
1. INTEGRATION ON A PROBABILITY SPACE 43 Proof. (1) |AT„| -► |AT| by Exercise 2.1.35 and so |AT| < Y e L1. Hence by Proposition 2.1.33 (3), let1. (2) The proof uses Fatou's lemma. Note that Xn+Y > 0 for all n. Hence by Fatou's lemma (i.e., (E5) of Proposition 2.1.25), E[liminfn(y + Xn)\ < liminfn E[Y + Xn\ = liminfn(E[y] + E[Xn\) = E[Y\ + liminfn E[Xn\ in view of Exercise 2.1.36. Applying this exercise again, one has that liminfn(y + ATn) = Y + liminfn Xn = Y + X. Hence, E[Y\ + E[X\ < E[Y\ + liminf„ E[Xn\ and so E[X\ < liminfn E[Xn\. Since lim„(-ATn) = —X and | - ATn| = |ATn| < Y, it follows from what has been proved that -E[X] = E[-X] < liminf„ E[-Xn]. By Exercise 2.1.37, liminfn E[-Xn] equals -limsupn E[Xn] as E[-Xn] = -E[Xn\. Hence, -E[X] < - limsupn E[Xn] and so limsupn E[Xn\ < E[X\ < liminfn E[Xn\. By Exercise 2.1.10, this implies that E[X\ = limn £[*„]. D Exercise 2.1.35. If (an) is a sequence of real numbers that converges to A, then (|an|) converges to |A|. [Hint: use Exercise 1.1.16.] Exercise 2.1.36. Let (an) be any sequence of real numbers. Then if a is a real number, limn inf(an + a) = (limn inf an) + a. [Hint: remember that limn inf an is the lower bound for the oscillation at oo of the sequence (an).] Exercise 2.1.37. Let (an) be any sequence of real numbers. Then limninf(-an) = -(limnsupan). If AT is a null variable, E[\N|] = 0. So if X € L1 in the sense of Remark 2.1.32 (2), then X ± N e Ll and E[X ± N] = E[X}. Consequently, to decide whether a random variable X is integrable in the sense of Remark 2.1.32 (2), it suffices to know X modulo a null variable N, i.e., X ± N. This leads to the final version of Lebesgue's theorem. Theorem 2.1.38. (Theorem of dominated convergence) Let (Xn) be a sequence of random variables dominated in modulus almost surely (usually written as "a.s.") by an integrable random variable Y (i.e., for aJJ n, P({W | |X„(W)| < Y(u)}) = 1). IfX = \\mnXn bl.s. (i.e., P({w | X(u) = limn Xn(u)}) = 1), then (1) XeL\and (2) E[X] = limn^ E[Xn}. Proof. Let F = {u \ X(lj) = limn^oo Xn(w)}, and let Fn = {w | |AT„(w)| < Y{u)}. Then fij = Fnfn™^,] € 5 and P(fii) = 1 (see Exercise 1.5.1 (2)). If U is a random variable on (fi, 5, P), let U be defined by setting U(uj) = U(uj) if w e fii and 0 otherwise. Then U = U + N, AT a null variable, and U is in Ll if U is in Ll in the sense of Remark 2.1.32 (2). The initial version of the theorem, Theorem 2.1.34, applies to (Xn)n>\ and
44 II. INTEGRATION X. Since X = X + Nlt Nx a null variable, X € L1 and E[X] = E[X] = limnE[X„] = limn £[*„]. D Remark. In analysis, the phrase "almost surely" is replaced by "almost everywhere" and consequently "a.s." by "a.e." Remark 2.1.39. The hypotheses of the theorem of dominated convergence not only imply that £[.Xn] —► E[X] as n —► oo: they also imply that E[ \Xn - X\} —► 0 as n —► oo. It suffices to observe that \Xn - X\ -* 0 a.e. and that \Xn - X\ <2Y e L\ Conversely, since | E[Xn] - E[X] \ = \E[Xn - X\\ < E[ \Xn ~ X\}, it follows that E[Xn] — E[X] as n — oo if E[ \Xn - X\ ] —► 0 as n —► oo. In other words, there is an equivalent form of the theorem of dominated convergence, which is stated below. Theorem 2.1.40. (Theorem of dominated convergence: equivalent form) Let (Xn) be a sequence of random variables dominated in modulus almost everywhere (usually written as a.e.) by an integrable random variable Y (i.e., for all n, P({w | |X„(w)| < Y(u)}) = 1). IfX = limnXn a.e. (i.e., P({w | X(u>) = limnX„(w)}) = 1), then (1) X eL\and (2) Wmn^ E[\Xn - X\\ = 0. Remarks. (1) The convergence of £^[|Xn - X\\ —* 0 is what is later referred to as convergence in L1, i.e., Xn converges to X in Ll if E[ \Xn — X\ ] -► 0 (see Definition 4.1.23). (2) This theorem gives a very important sufficient condition in order that one may pass to the limit when integrating to get E[limn Xn] = limn £[A„]. It is by no means necessary, since for non-decreasing sequences of non- negative random variables, the result is always valid by monotone convergence. Later on, in Chapter IV, a necessary and sufficient condition is given for this result to be true (see Remark 4.5.10). 2. Lebesgue measure on R and Lebesgue integration (A) <7-finite measures. The preceding discussion of probability and integration makes very little use of the fact that P is a probability. What is extremely important is that P is a function defined on a <r-algebra 5 of subsets of a set fi that has the following two properties: (Mi) 0 < P(A) < +oo for all A e 5 and P(0) = 0; (M2) if A = U^L^n with the sets An € 5 and pairwise disjoint, then P(A) = E~=iP(^»)- Definition 2.2.1. A non-negative measure fx on a a-algebra 5 is a function satisfying (Mi) and (M2). A measure fx on a a-algebra 5 of subsets ofQ is said to be finite if fi(Q) < 00 and is said to be er-finite if fi = U£Lij4n with n(An) < +00 for aJJ n. A measure space (fi,3,/i)
2. LEBESGUE MEASURE ON R AND LEBESGUE INTEGRATION 45 is defined to be a triple consisting of a set Q, a <7-aJgebra 5 of subsets of fi, and a non-negative measure fx on 5- It is said to be a finite measure space (respectively, a er-finite measure space) if fx is finite (respectively, (j-finite). It was shown in Chapter I (Theorem 1.4.13) that there is a one-to-one correspondence between probabilities on 93 (R) and distribution functions F on R. In the case of er-finite measures fj, on 93 (R), there is an analogous result provided one assumes that fi((a, b]) is finite whenever —oo < a < b < +oo. One replaces a distribution function F on R by a finite-valued, non- constant, non-decreasing, right continuous function G. Each such function determines a unique er-finite measure on 93(R) as stated in the following result, and the measure determines G uniquely if one sets G(0) = 0. This corresponds to the solution for non-decreasing functions of the analogue of Basic Problem 1.3.6 for distribution functions (see the remark following its statement). Theorem 2.2.2. To each non-constant, non-decreasing, right continuous G : R —► R, there is a unique a-fnite measure fi on 93(R) with fj.((a, b\) = G(b) — G(a) ifa<b are any two reaJ numbers. Further, if G(0) = 0, then fi determines G : r MM), x>o, \ -m((x,0]), x<0. One way to prove this theorem consists of looking a little carefully at the arguments used to prove Theorem 1.4.13 and "handwaving" a bit. To begin, it is usual to replace 21 by 9t, the collection of finite unions of finite intervals (a,b\. It is a Boolean ring: A, B e 9t implies that A n Bc, A U B, and A n B all belong to 9t; so that Ac need not be in 9¾ if A e 9t, but its relative complement B D Ac is in 9¾ for all B € 9¾. Exercise 1.3.7 has to be modified. There is a unique finite set function (i on 9¾ that is finitely additive (i.e., (FAP3) holds) such that fj,((a,b\) = G(b) - G{a) for all a < b, and with n(A) > 0 for all A € 9¾. Remark. One may also carry the whole discussion through using the Boolean algebra 21. The price one pays is that the set function on 21 may take +00 as a value; e.g., in the case of G(x) = x, which is used to define Lebesgue measure. As every subset E of R is a subset of a countable union of sets from 91, one may define the outer measure (i* as 00 H'{E) = iaS(%2n(An) I E C U~ tAn, Ane9K for all n > 1}. n=l Since, as before, fj, is <r-additive on 91, fJ.{R) = p*{R) for R € 9¾. Furthermore, Caratheodory's definition of /T-measurable sets is the same as
46 II. INTEGRATION that for P*-measurable sets and once again the class 5 of /i*-measurable sets is a er-algebra, and the restriction of (i* to 5 is a measure on J. Exercise 1.4.14 (1) remains true, but (2) is false unless fJ,{An) < +oo for some n. To verify the uniqueness statement in Theorem 1.4.13, one uses the fact that fj,((—n,n\) = G(n) — G(-n) < +oo. Then, if n\ and fX2 are two measures on 93(R) that agree on intervals (a, 6], the set 9Jtn = {An(-n,n\, A € 93(R) | ^(An(-n,n\) = H2{A2n(-n,n})} isamonotone class of subsets of (— n, n] that contains the Boolean algebra 2ln of subsets R D (-n,n\ of (—n,n], where R € 9t. Hence by Exercise 1.4.16, 9Hn contains the smallest er-algebra $n of subsets of (—n,n] containing 21,,. Exercise 2.2.3. Show that $n equals {An(-n,n] | A € <8(R)}. [Hint: show that {A e 93 (R) | An(-n,n] e $n} is a monotone class containing a.] Consequently, fii(A) = (12(A) since it follows from Exercise 1.4.14 (1), that for i = 1,2, fii{A) = \imn-.0o fii(An(—n,n]), for all Borel sets A. □ (B) Lebesgue measure. In particular, if G(x) = x for all x, the resulting measure is called Lebesgue measure. It is commonly denoted by dx if x is the variable of integration and will also be sometimes denoted by A. This measure computes the length of a set. If E is a Lebesgue measurable set, its Lebesgue measure is often denoted by |E| (i.e., |E| = A*(£), where A* is the outer measure corresponding to the measure A on 9¾ such that A((a, b]) = b — a, if a < b). Another way to prove Theorem 2.2.2 is outlined in the following exercise for the case of Lebesgue measure, i.e., when G(x) = x for all x e R. Exercise 2.2.4. For each n > 1, define the distribution function Fn by the formula {0, x < -n, ^j(x + n), -n<x<n, 1, n < x. This distribution function corresponds to the probability of choosing a number at random from [—n,n], the uniform distribution on [—n,n]. It is also constructed by adding n and scaling by ^ the function Gn defined by {— n, x < — n, x, -n < x <n, n, n < x. Let Pn be the probability on 93(R) determined by the distribution function Fn.
2. LEBESGUE MEASURE ON R AND LEBESGUE INTEGRATION 47 def Show that An = 2nPn is the unique positive measure fi on 93(R) such that (1) fi((a, b]) = b — a, if — n < a < b <n, and (2) M(-oo,-n]) = M((n,oo)) = 0. Show that (3) if A is a Borel set and A C (-n,n], then An+i(j4) = A„(j4). Define [i on <8(R) by setting fi{A) = £~=1 -M^n), where Ai = A D (-1,1], An+1 =(An(-n-l,-n|)U(An(n,n + l]). Show that (4) /i is the unique positive measure v on 93(R) such that i/((a,6]) = 6 — a if— oo < a < 6 < oo; in other words, that fi agrees with A on »(R). With the approach of Exercise 2.2.4, there remains the problem of determining the er-field 5 of Lebesgue measurable sets. Following Exercise 1.5.4, one may use the measure fj, to determine an outer measure fi' and then show that the er-algebra of fx*-measurable sets is the same as the er-algebra 5 of Lebesgue measurable sets produced via the Caratheodory procedure applied to the function G(x) = x. Hence, Lebesgue measure A is the same as the restriction of the outer measure fx* to J. Alternatively, one may use completion to extend the measure fi (which is Lebesgue measure on 93(R)) to the Lebesgue measurable sets (see Exercises 2.9.11 and 3.3.13). Remark. The basic idea of Exercise 2.2.4 is that given a positive measure H on a er-algebra 5, to each set A e J with 0 < fi(A) < oo one can associate a probability Pa on 5 by defining Pa{E) = -K^fj,(EnA). In other words, by "normalizing" the measure fi on the subsets of A, one obtains a probability. When fi itself is a probability, the probability Pa is called the conditional probability of E given A. Pa(E) is usually denoted by P[E | A], and one has the familiar formula P[E \ A] = P[E n A]/P[A]. Exercise 2.2.5. Carry through the procedure outlined in the previous exercise for any non-constant, non-decreasing, right continuous, finite function G on R. Begin by choosing n large enough that G(n) — G(—n) > 0. Exercise 2.2.6. If E is a Borel set, let \E\ denote its Lebesgue measure. Define fi(E) = \E D [0,1]|. Show that fx is a probability and that its distribution function F is the first one of Example 1.3.9. Hence, fx is the probability on 93(R) corresponding to F. (C) Integration with respect to a a-finite measure fi. Let (fi, 5,n) be a er-finite measure space (Definition 2.2.1). Instead of being called random variables, the functions X that satisfy the condition in Definition 2.1.6 are called measurable functions and, as before, every X non-negative is the limit of an increasing sequence (sn)n>i of non-negative simple functions sn (defined as before). For a simple function s, J sdfj,
48 II. INTEGRATION is defined as before. As in the case of a probability measure, limn J snd(i depends only on X, and hence one defines J Xdfx as this limit. The discussion of integration on a probability space applies without change to a er-finite measure space. One writes J Xdfi, J X(x)(i(dx) or J X(x)dfj,(x) for the integral with respect to fi of X € L1 = L*(Q,$,(i). The principle of monotone convergence, Fatou's lemma, and Lebesgue's theorem of dominated convergence all hold in the context of a er-finite measure space. In particular, if fi = R and dp = dx, the integral J Xdx is called the Lebesgue integral of X. One often denotes 1^(11,93(11), dar) by Ll{dx) or Ll(R). If E c R is a Lebesgue measurable set, let 93(E) = {AtlE \ A € 93(R)}. The set E determines a er-finite measure space: fi = E,$ = 93(E), and the measure (i(A(~\E) = \AC\E\ — this measure is called Lebesgue measure on E. Note that since the usual metric on R induces a metric on E, the er-algebra 93(E) is also the same as the smallest er-algebra of subsets of E that contains all the open subsets of the metric space E (Definition 4.1.25). This is because a subset O' of E is open in E if and only if O' = O D E, O an open subset of R (see Exercise 3.6.3). Exercise 2.2.7. Let 5 denote the <r-algebra of Lebesgue measurable sets (i.e., the A*-measurable sets where A* is the outer measure corresponding to G(x) = x). Then one knows that J D 93 (R). Show the following: (1) if E € 5, then there is a Borel set A D E with |A| = \E\ (where these values may be +oo); (2) if E € $ and \E\ < oo, then there is a Borel set A D E with \A\E\ = 0; (3) if £ € 5 and \E\ < oo, show that there are Borel sets B and A with B c E c A with \A\E\ = \E\B\ = 0; [Hint: let Fn = F n (-n, n], Fed, and verify (3) for En and ££. Note that (3) says 1b < 1e < 1a and /lfl(x)dx = /l,t(x)dx, cf. the situation in Proposition 2.3.4.] (4) show that E e 5 if and only if there are Borel sets B and A with B c E C A and |A\B| = 0. [Hints: (only if) use (3); (if) use (2.3.6) or use the definition of a A*-measurable set.] (5) let X be ^-measurable and non-negative. Show that there are Borel measurable functions U and V such that (a) 0 < U < X < V and (b)\{U<V}\=0. Exercise 2.2.8. Let /ibea <r-finite measure on 93(R). If b € R, define Hb{A) = n{A-b) for A € 93(R), where A-b = {u-b \ u e A}. Show that (1) fi), is a er-finite measure on 93 (R), (2) if A denotes Lebesgue measure, then A(,((c,d\) = A((c, d\) for any c < d, and conclude that (3) A = A(, for any b e R (this property of Lebesgue measure is called
3. THE RIEMANN INTEGRAL AND THE LEBESGUE INTEGRAL 49 translation invariance and Lebesgue measure is said to be translation invariant). Show that (4) Lebesgue measure is the only translation invariant measure fi on <8(R) such that /i((0,1]) = 1. [Hint: show that n{{0,r]) = r for any rational number r > 0.] Remark 2.2.9. (A set that is not Lebesgue measurable) Not every subset of K is Lebesgue measurable. The standard example of a non- (Lebesgue)-measurable set E makes use of a countable dense subgroup G of K; for example Q, or the structurally simpler one that is discussed in Exercise 2.9.15. The sets x + G,x e K are called the cosets of G and either are disjoint or else coincide: (xi + G) fl (X2 + G) ^ 0 if and only if Xi — X2 6 G, in which case Xi + G = X2 + G. The set E is given by choosing exactly one point out of each of the cosets of G, an apparently obvious possibility but which technically speaking involves the use of the so-called axiom of choice (see [H2]). It follows that if yi,y2,€ E, then j/i - y2 ¢ G. One proves that E is not Lebesgue measurable by contradiction: (i) if E is Lebesgue measurable, then \E\ ^ 0 since R = U96g(<? + E) implies that |K| = J2geG lff+^1 (recall G is countable), which equals zero if 0 = \E\ since for any Lebesgue measurable set E, \E\ = \g + E\; (ii) for any Lebesgue measurable set E with \E\ ^ 0, the set D = {j/i - y2 \ yi € E} contains a non-void open set (see Exercise 4.7.18) and so, by the denseness of the countable group DOG ^ 0, this contradicts the basic property of E, namely that, if 7/1,2/2 6 E, then yx - y2 ¢ G. Example 2.2.10. X e L1 = #(11,3(11), dz) if X(x) = e'^.x € R. To see this, one can define a new function Y(x) = e (T~Mf k < \x\ < k + 1. Then, since 0 < X < Y, it suffices to show Y € Ll by Proposition 2.1.33 (3) (as applied to a er-finite measure space). Let sn = yi(_nn). Then sn < sn+i and limn sn = Y. Also, the sn are simple functions and Jsn(x)dx = 2^3fc=i e~^~*~^- Since the series YlV=i e-^-5"^ converges, limn_oo / sn(x)dx = J Y(x)dx < oo, and so Y € L1. Note that this discussion does not compute /e~(~%~^dx. It merely says that it is finite. To actually compute the integral, it is necessary to relate the Lebesgue integral to the Riemann integral. 3. The Riemann integral and the Lebesgue integral Let tp(x) be a bounded function on [a, b]. A partition n of [a, b] is a finite sequence of points a = to < t\ < ... < tn = b. A partition n' refines 7r if 7r' is given by adding more points to those that determine n. Let m,i = inf{y?(x) \U <x < ti+\} and Mj = sup{y?(x) \U< x < ti+\}. The upper sum U(w,ip) is defined to be Yl^Zo M(fc+i ~U) and the lower
50 II. INTEGRATION sum L(w,(p) is defined to be 5ZfcT0 mt(*t+i — U)- Definition 2.3.1. A bounded function <p on [a, b] is Riemann integrable if for any e > 0 there is a partition n with U(n, <p) — L(w, <p) < t. Proposition 2.3.2. Assume <p is Riemann integrable on [a, 6]. Then inf„ t/(7r, ip) = sup„ L(n, tp). Proof. If w' refines n, then L(n,(p) < L(Tr',(p) < U(n',<p) < U(n,(p) (one sees this by adding one point to n, then another, and so on). Given any two partitions ni,W2, there is a least partition w that refines w\ and 7^. Hence, L(n\,<p) ^ i(fi^) < U(TT2,tp). Consequently, inf„C/(7r,y?) > supwL(7r,y?). The fact that <p is Riemann integrable implies that the difference between these numbers is less than e, for any e > 0. Therefore, these numbers agree. □ Definition 2.3.3. Let ip be a bounded function on [a, b] that is Riemann integrable. The Riemann integral of ip over [a, b] is defined to be inf„ U(n, (p) = sup,,. L(n, ip). It is denoted by /a ip{x)dx. Assume that |y?(x)| < M for all x e [a,b\. Let 7r be a partition of [a, 6] defined by n + 1 points tf, and define 0 if x < a, -M if x = U,0 <i<n + l, mi if tj < x < ti+1,0 < i < n, 0 if 6 < x, M*) = and u„(x) = if x < a, if x = tj,0 < i < n + 1, if U < x < U+i,0 <i<n, if b <n. These functions are Borel functions, ln < uw and |^w(x)|, |u„(x)| < Ml[a<b] e L1. These functions are therefore in L*{dx), and one has L(w,<p) = j£n(x)dx, U(n,tp) = Jun(x)dx. They are like simple functions, except that they are not necessarily positive. Proposition 2.3.4. Assume ip is bounded and Riemann integr&ble on [a,b\. Then there exists a sequence (nn) of partitions with nn+i a refinement of 7rn, for all n > 1 such that tp(x)dx = lim U(TTn,tp) = lim L(nn,<p). I J a n—»oo Hence, there are Bore] functions f < </?l[a,6] ^ 9 that are in L1 with / f(x)dx = / g(x)dx.
3. THE RIEMANN INTEGRAL AND THE LEBESGUE INTEGRAL 51 Consequently, <pl[a,b] ,s a lebesgue measurable function that is integrate and (the Lebesgue integral) /</5(x)l[ai()](x)dx = J tp(x)dx (the Rie- mann integral). Proof. For each n, there is a partition n'n with U(i^n, tp) — L(n'n, tp) < l/n. Let 7rn be the smallest partition that refines tt[,tt'2, ... , and 7rJ,. Then, £«,¥>) < L(nn,<p) < U(*n,<p) < UtfnV) and so U(nn,<p) - L(wn,(p) < l/n. Hence, 6 (p(x)dx = lim U{nn,<p) = lim L(nn,(p). n—>oo n—>oo Since 7rn+i refines 7r„, l„n < 4„+1 < uWfi+1 < u„n. Let / = lim„^oo^r„ and g = Wmn-.oo unn. These are Borel functions, with / < ¥?l[a,6] < 5- Since |^ir„|,|«ir„| < Ml[a,6] € L1, by dominated convergence (Theorem 2.1.38 applied to dx) f,g € L1 and / f(x)dx = limn_oo/^n(x)dx = limn^ooI^TTn,^) = /o ip(x)dx = lim„^oo ^(7rn, </?) = limn^oo / "ir„ (x)dx = /ff(x)dx. At this point, the full strength of Theorem 1.4.13 comes into play. So far no explicit use has been made of the a- algebra 5 of A*-measurable sets (the Lebesgue measurable sets if A* is the outer measure determined by G(x) = x) as opposed to the cr-algebra 93 (R) of Borel subsets. Also, in the case of a probability, no explicit use has been made of the P*-measurable sets that are not Borel measurable, where P* is the outer measure defined by a distribution function. Look at E = {x \ g(x) > /(x)}. This is a Borel set since g and / are Borel functions. Its Lebesgue measure is zero: let En = {x \ g(x) > f(x)+l/n}\ asg > f and f g(x)dx = f f(x)dx,f{g(x)-f(x)}dx = 0; since J{g(x)-f(x)}dx > (l/n)/lE„(x)dx = (l/n)|E„|, it follows that |E„| = 0 and hence, \E\ = 0. Since g > <pl[a,b] ^ /> the function <pl[a,b] = ft is a function that agrees with a Borel function (say / or g) except possibly on the set E of Lebesgue measure zero. Once one shows that <pl[a,b] 1S measurable with respect to the cr-algebra 5 of Lebesgue measurable sets, then by integrating on (R, 5, dx), rather than on (R,93(R),dx), the integral f <p(x)l[ab](x)dx is defined (in the sense of Lebesgue). Since / < tpl[a,b] < 9, where f,g € L1 and J f(x)dx = fg(x)dx, the fact that Jf(x)dx < f ip(x)l[ab](x)dx < Jg(x)dx implies that /</5(x)l[a)l,](x)dx = /a ip(x)dx. To complete the proof of Proposition 2.3.4, it suffices to verify the Lebesgue measurability of yl[a,6]- L
52 II. INTEGRATION Lemma 2.3.5. Let tp be a function on R such that there exists (1) a BoreJ set E with \E\ = 0 and (2) a BoreJ function / with tp = f on R\E. Then </j is Lebesgue measurable, i.e., is measurable with respect to the cr-aJgebra 5 of Lebesgue measurable sets. Proof. {<p < \} = {<p< A}n(R\E)U{</5 < \}DE = {/ < A}n(R\E)n{</5 < A}nE. Since / is Borel, {/ < A} € ®(R) and so {/ < A}n(R\E) € ®(R). Also, if {</5 < A} D E = E\, it is a subset of a Borel set E of Lebesgue measure zero. Sublemma 2.3.6. Every subset Ei of a set E of Lebesgue measure zero is Lebesgue measurable and of measure aero. Proof. Let F C R be any set. Since A*(F n £^) < |E| = 0, A*(F) > \*{F n Ef) = A'(Fn£i)+A*(Fn E\). As the opposite inequality is always true by Definition 1.4.10 (3), the result is proved. □ The sublemma shows that {tp < A} D E e J, the cr-algebra of Lebesgue measurable sets. Hence, {tp < A} € 5 for all A € R, i.e., tp is Lebesgue measurable. □ Remark. Sublemma 2.3.6 is stated for Lebesgue measure, but it applies to any probability associated with a distribution function and equally well to any positive cr-finite measure. Exercise 2.3.7. (See Lusin's theorem (Exercise 4.7.9).) Let E be a Lebesgue measurable set and / : E —► R a function such that, for any e > 0, there is a closed set A with A c E such that (i) the restriction of / to A is continuous (on A) (see Definition 2.1.13) and (ii) |E\j4| < e. For each n > 1, let An be a closed set with An c E such that (i) the restriction of / to An is continuous and (ii) |E\j4n| < £. Show that (1) An n {x € E | /(x) > A} is closed for each A, (2) if A =f U~=1i4n, then An{x € E \ f(x) > A} is a Borel set for each A, and (3) \E\A\ = 0. Conclude that / is Lebesgue measurable, i.e., {x € E \ f(x) > A} is a Lebesgue measurable set for any A. Note that Lusin's theorem (Exercise 4.7.9) states that the converse is true. Remark. From now on, unless explicitly stated or the context makes it clear, if tp is a Lebesgue measurable function defined on an interval [a, b] (or any of the other three bounded intervals with these endpoints), then J tp(x)dx denotes the Lebesgue integral Jtp(x)l[ab](x)dx of the Lebesgue measurable function tpl[ab\ on R, which agrees with tp on [a, b] and is zero on its complement. In a calculus course, the Riemann integral is usually discussed only for a
3. THE RIEMANN INTEGRAL AND THE LEBESGUE INTEGRAL 53 function ip(x) that is continuous on [a, b]. For such functions, the Riemann integral exists. A proof is outlined below. 2.3.8. Some properties of real-valued functions continuous on [a,b). (1) Let ip\,<f2 both be continuous on [a,b]. Then ip\ ± ip2 and ipnp2 are continuous on [a, b]. If ip(x) > 0 for all x, a < x < b, and is continuous, then l/</5 is continuous on [a, b]. (These facts are to be found in [Ml] or [R4], for example. One can also try to prove them.) (2) ip is bounded on [a,b]. For each x, a < x < b, if one takes e = 1, by definition of continuity at x there is a 6 = 6(x) > 0 such that \ip(x) - ip(y)\ < 1 if |x - y\ < 6(x) and y € [a,b\; the open intervals (x - 6(x),x + 6(x)) cover [a,6]; by the Heine Borel theorem (Theorem 1.4.5), there is a finite number of points Xi,...,xm such that [a,b] c Uj^Xj - 6(xj),Xj + 6(Xi)). Hence, if y € [a,b], \ip(y) — </?(xt)| < 1 for some 1 < i < m. Consequently, 1^(2/)1 < 1 + maxi<j<m |</?(Xi)| = M, i.e., ip is bounded. Alternatively, the continuous image of a compact set is compact and so bounded. (3) ip is uniformly continuous on [a, b]: i.e., given e > 0, there exists 6 > 0 such that \<p(x) — </?(y)| < e whenever \x - y\ < 6 and {x, y} C [a,b]. Note that the 6 does not depend on x or y. Proof. Choose e > 0 and for each x, let 6(x) > 0 be such that \(f(x) — <p(y)\ < e/2 if |x - y| < 6(x). The open intervals (x - 6(x)/2,x + 6(x)/2) cover [a,b}. Let [a,b] c L)J^1(Xi-6(xi)/2,xi + 6(xi)/2) i = 1,... ,m, where such a set of points x* exists by virtue of the Heine-Borel theorem. Let 6 = min1<i<m6(xi)/2. Let \s-t\ < 6 and s,t € [a, b]. Then s is in some [xj-6(Xj)/2, Xj+6(Xj)/2] and by the choice of 6, t € (x* - 6(xj), x* + 6(Xi)). Hence, \<p(s) - f{t)\ < \<p(s) - <p{Xi)\ + \<p{Xi) - <p{t)\ < e/2 + e/2 = e. (4) If ip is continuous on [a, b], there are points xm,XM G [a, b] such that m = <p(xm) < ip(x) < <p(xm) = M, a < x <b. In other words, ip attains its maximum and its minimum on [a, b\. (1) Let <fi,<f2 both be continuous on [a,b]. Then <pi ± <p2 and <pi<f2 are continuous on [a, b]. If <p(x) > 0 for all x, a < x < b and is continuous, then l/</5 is continuous on [a, 6]. Proo/. By (2) above, </? is bounded above and below. Let M = sup{</?(x) | a < x < b}. The function /(x) = M — <p(x) is continuous on [a, 6] by (1), and if it never vanishes, l//(x) is again continuous on [a, b]. It is bounded — see (2)- by a constant N. So 1/{M - i/?(x)} < .V if a < x < b. Hence, 1/AT < M - <p(x), (i.e., <p{x) < M - l/N if a < x < b). This contradicts
54 II. INTEGRATION the fact that M = sup{y?(x) \ a < x < b}. Therefore, M — <p(x) vanishes somewhere. Applying this to — ip, —m + ip(x) also vanishes somewhere. Proposition 2.3.9. Let <p be a continuous reaJ-vaiued function on [a, 6]. Then ip is Riemann integrable. Proof. By (2) above, ip is bounded. Let e > 0, and choose n so that \s — t\ < (b — a)/n implies \tp(s) — (p{t)\ < e (possible by (3)). Let 7r be given by dividing [a, b] into intervals of length (b — a)/n. Then in view of (4), applied to each closed subinterval [tj,tj+i], and the choice of n, Mi — m-i < e for all i. Hence, U(n,(p) — L(n,ip) < (b — a)t. Since e is arbitrary, <p is Riemann integrable. Improper integrals and Lebesgue integrable functions. The function X{x) = e (~^~' is continuous, even representable by a power series that converges uniformly. As a result, the Riemann integral /0 e~(T~)dx exists. Definition 2.3.10. JQ+°° f(x)dx = limfl_+00 /0 f(x)dx is called the improper Riemann integral of f over [0, +oo), if this limit exists (to simplify questions of existence of the integral over [0, R], f is assumed continuous). If the improper integrai /0 °° |X(x)|dx < +oo, the improper integrai /0°° X(x)dx exists and is said to be absolutely convergent. If the improper integral /0°° X(x)dx exists but /0°° \X(x)\dx = +oo, the improper integrai is said to be conditionally convergent. From calculus (see Marsden [Ml], p. 322) one learns that /0 °° e~^^~^dx = ^- and so /^ C^dx = sfa. Proposition 2.3.11. Let X(x) be a non-negative continuous function on [0,+oo). Iffo°° X(x)dx exists, then Xl[0t+oo) e Ll and J Xl[0t+oo)(x)dx = f0°° X(x)dx. When X is continuous but not necessarily non-negative, the improper integral /0+°° X(x)dx can be conditionally convergent. Proof. Assume X is non-negative. Since Xl[o,+oo) = lim^_oo Xl[otN\ with Xl[0,N] < Xl[0i(N+i)), and /0+o° X(x)dx = limN^+00 /„ X(x)dx, it follows from the principle of monotone convergence (Proposition 2.1.25 (£4)) that IXl [0,+00) (x)dx ^J^ J Xl[0,N](x)dx r-N /-+00 = lim / X(x)dx= / X(x)dx. w—°° Jo Jo Let r ?mi if x > 0, v ' \ 1 if x = 0.
4. PROBABILITY DENSITY FUNCTIONS 55 Then one can show that |X(a:)|l[oi+00)(a:) ¢ L1, i.e., Xl[0<+oo) ¢ L1 but that J+°° ^^r dx exists by using an argument similar to the argument used to show that an alternating series converges. □ Remark 2.3.12. When X is non-negative, then /0+°° X(x)dx = /Xl[o)+CX))(x)dx even if this limit (the improper integral) is infinite. Proposition 2.3.11 shows how to calculate the integral f e~(~*~'dx. of the L'-function X(x) = e (~^~'. The value of this Lebesgue integral is y/2n. 4. Probability density functions 2 Consider the distribution function F(x) = -J^ /f ^ e~(~?~)du of Example 1.3.9 (2). According to Theorem 1.4.13, there is a unique probability P on 93 (R) with F as its distribution function. Proposition 2.4.1. For any Bore] set E, 1 r+°° 2 j , 1 r 2 In particular, ifC is the Cantor set, then P(C) = 0. (See Example 1.2.7.) Proof. Define (i(E) = -j= fEe~^^du for any Borel set E. Then F(x) = fj,((—oo, x]), and the first statement follows from Theorem 1.4.13 once it is shown that n is a probability. Clearly, 0 < (J.(E) < 1 and n.(R) = 1 since the Lebesgue integral over R equals the improper Riemann integral -j^ J_OQ e~^^~^du, which equals 1. Let A = U'^'=lAn with the sets An pairwise disjoint and Borel. Let Bn = U?=1^ and Yn(u) = -j=e-^lBn(u) = £?=1 U» • j=e~^. Then Wmn^ooYniu) = ^-e^^l^u). Call this function Y. Now Yn < Yn+\. Thus, by the principle of monotone convergence, (E4) 2 of Proposition 2.1.25 applied to the Lebesgue integral, ^^ 2 J Y(u)du = limn_00 Yn(u)du = limn_oo 5Z"=1 -^ JA. e_( t~ >du. In other words, n oo t=l t=l To compute (C) in Example 1.2.7, notice that \C\ = 0. Let s be a simple function. Then fcs(x)dx = 0 as s is a null function by Exercise 2.1.28. Hence, Jc X(x)dx = 0 for any non-negative, measurable function X. □ This proposition illustrates a general fact. Namely, the integral of a non- negative integrable random variable X over a set A € 5, determines a set function v(A) that is a measure (see Definition 2.2.1). The next exercise explains the details of the argument.
56 II. INTEGRATION Exercise 2.4.2. Use the preceding arguments to show the following, where (£),5,/1) is a cr-finite measure space. (1) Let X be non-negative and p-integrable, i.e., X € L'^fop). Let i/(A) = JA Xdfi =f / lA(w)X(w)fi(dw). (2) Show that v is a cr-finite measure on 5 with v(£l) = J Xdfj,. (3) Let X be a non-negative, measurable function, and let E € 5 be a set with n{E) = 0: show that fE Xdp = 0. (4) Show that J fdv = J fXdfx for any non-negative measurable function /. (5) Show that f e l}{Q.,$,v) if and only if fX € L1 (12,fop) and Jfdu = ffXdfi. Examples 2.4.3. (1) Let A > 0 and set P(E) = { /En[0+Oo) e~Xxdx. Then P is a probability. This follows once it is observed that the function X € Ll, with J X(x)dx = 1, where {.* , U~Xx ifx>0, X x) = ^ A w ^ " ifx<0. A random variable with this distribution or law is said to be exponentially distributed (with parameter A ). (2) The function X € L\ where x(x)= f a"*"-^— ifx>0, V ' \ 0 if x < 0, and q, v > 0. The function 7(t) = /0 °°ut_1e_udu is called the Gamma function. 7(n + 1) = n!, n = 0,1,... . The probability P = PQ)1/ given by P{E) = :4^ JE X(x)dx is called a Gamma distribution. (3) Other familiar distributions from statistics are given by integrating other L'-functions (e.g. Beta distributions, Student's t, etc., see Feller [F2] for additional examples). These examples (should) raise the following question. When can a probability P on <8(R) be written as P(E) = JE X{x)dx, X € L1? In view of Exercise 2.4.2 (3), for this to happen it is necessary that P{E) = 0 if \E\ = 0. It is an important result that this condition is not only necessary but sufficient. In other words, a probability P on 93(R) is determined by a non-negative L'-function of integral one — a so-called probability density function — if and only if P(£) = 0 when \E\ = 0. This theorem is a special case of the Radon-Nikodym theorem (Theorem
4. PROBABILITY DENSITY FUNCTIONS 57 2.7.19) which will be proved later in §7 using the Hahn decomposition for signed measures and again at the end of these notes using martingale theory (see Corollary 5.7.4 and Exercise 5.7.6). Such a probability is said to be absolutely continuous with respect to Lebesgue measure A. The (probability) density is called the Radon-Nikodym derivative of P with respect to A. The derivative of the distribution function F of a probability P exists almost everywhere (see Wheeden and Zygmund [Wl], p. Ill, Theorem 7.21). In particular, the derivative of the distribution function of an absolutely continuous probability P exists almost everywhere, as is shown in Chapter IV. In this case, the derivative X (or /) of the distribution function is the probability density function for P. In general, the derivative / of a distribution function F is a non-negative integrable function. The probability P can then be written as the sum of the absolutely continuous measure £ given by / and another so-called singular measure t/, where a measure fi is said to be singular if there is a Borel set E with \E\ = 0 and n{Ec) = 0. This decomposition of P (Theorem 2.7.22) is known as Lebesgue's decomposition theorem, and the measures £ and T) are unique. By normalizing these measures (i.e., multiplying them by scalars), it follows that every probability that is neither absolutely continuous nor singular is a unique convex combination of an absolutely continuous probability and a singular one (Exercise 2.7.25). Another decomposition of a probability results from the fact that a distribution function has at most a countable number of jumps (Exercise 2.9.13), where a jump occurs at x if F is not continuous at x (i.e., if F(x) ^ F(x-), where F(x-) is defined in Definition 1.3.4). It is explained in Exercise 2.9.14 that every distribution function that is neither continuous nor discrete is a unique convex combination of a continuous distribution function and a so-called discrete one: a distribution function F is said to be discrete if the corresponding probability P is concentrated on a countable set D (i.e., P(D) = 1); note that one may assume that P({x}) > 0 for all x € D. Since a countable set has Lebesgue measure zero, every discrete probability is singular. Further, Exercise 2.4.4 explains how the corresponding distribution function increases only by jumps. Continuous distribution functions do not necessarily determine absolutely continuous probabilities. In fact, a continuous distribution function F can have a derivative equal to 0 almost everywhere: a classic example, the Cantor-Lebesgue function (see Wheeden and Zygmund [Wl], p. 35), is defined by using the complement of the Cantor set (see Exercise 2.9.16). The resulting probability is therefore singular with respect to Lebesgue measure. In the case of the Cantor-Lebesgue function, it lives on the Cantor set C (i.e., the Cantor set has probability equal to one). If a continuous probability is not absolutely continuous, Lebesgue's decomposition theorem (Theorem 2.7.22) applied to it produces a singular continuous probability. As a result, every probability that is neither con-
58 II. INTEGRATION tinuous nor discrete, neither absolutely continuous nor singular, is a unique convex combination of an absolutely continuous probability, a singular continuous probability, and a discrete probability (Exercise 2.9.14). Alternatively, one may obtain this result by decomposing the singular part of a probability using (Exercise 2.9.14) into a continuous part and a discrete part. Hence, every probability is a convex combination of an absolutely continuous one, a continuous singular one, and a discrete one. This decomposition is in fact unique in the sense that the resulting measures are unique (see Exercise 2.9.14 (5)). Exercise 2.4.4. Let F be the distribution function of a finite random variable X defined on a probability space (0,5, P). Let £ be a Borel subset, and assume that X{Q) C E (i.e., all the values of X lie in E). Let Q be the distribution of X, i.e., the unique probability on (K,93(R)) determined by F. Show that (1) Q(E) = 1, and (2) F(x) = P[X e£n(oo,i]] = Q(E n (oo,x]) for all x e R. Assume that E is a countable set and that P[X = x] = Q({x}) > 0 for each x € E. Show that (3) F(x) = F(x-) if and only if x ¢ E; (4) conclude that F increases only by jumps and that at a point x of E the jump size is Q({x}). 5. Infinite series again The analogue for N (or any countable set) of Lebesgue measure is often called counting measure. If E c N, let \E\ denote the number of points in E. Then, | • | is a measure on ?P(N) — the collection of all subsets of N. The corresponding integral could be written as J Xdn. The real-valued functions a on N are just the real-valued sequences (On)n>l- Exercise 2.5.1. Let a = (an)n>i be > 0. Show that 5Z^=1On is the integral of a with respect to counting measure dn on N. Exercise 2.5.2. Let £i denote the space L1(N,<P(N),dn). Show that a € £\ if and only if ^Z^l t an is absolutely convergent. Definition 2.5.3. Let V be a reaJ vector space. A norm on Vis a function || ||: V -» R+ such that (1) for all u,v e V, \\ u + v \\<\\ u\\ + \\ v ||; (2) for allueV and A € K, || Au ||= |A| || u \\; and (3) || u ||= 0 implies u = 0.
6. DIFFERENTIATION UNDER THE INTEGRAL SIGN 59 Exercise 2.5.4. Show that £\ is a real vector space and that the function n=l l°"l= II a 111 1S a norm on °\- Exercise 2.5.5. Let £2 denote the set of sequences a = (an)n>i such that H^Li °n < +°° (*he set of square summable sequences). Show that (1) £2 is a real vector space, and (2) a,b € £2 implies ab e £\, where (ab)n = anbn, n > 1. Definition 2.5.6. Let V be a real vector space. An inner product on V is a function (, ) : V x V —► K such that (1) for aJJ u,v € V, (u,v) = (v,u), (2) for aJJ ui,v.2,v € V, (ui + u2,v) = {u\,v) + (u2,v), (3) foraJJu,ve^2,AeK, (Xu,v) = \{u,v), (4) for aJJ u ? 0 € £2, (u, u) > 0. One often denotes (u,v) by {u,v). In addition, if the vector space is a complex vector space, then an inner product is complex-valued, and the above conditions are satisfied, with the important difference that (1) now states that (v, u) = (u,v). The simplest example of a complex inner product def _ _ is defined on C itself by (u, v) = uv, where if v = a + ib, then v = a — ib. In this case, the corresponding norm of v = a + ib is the modulus \v\ = \/a2 + b2 of the complex number v. Exercise 2.5.7. For a,b e £2, define (a, b) = 5Z^=1an&n (why does this series converge?). Show that (, ) is an inner product on £2- Exercise 2.5.8. Verify the Cauchy-Schwarz inequality: \(a,b)\ <\\ a II2II & L) where || a ||2= [22^^0^] • Show that a —>|| a H2 is a norm. [Hint: consider the discriminant B2 — A AC of the quadratic polynomial of A, Q{\) =|| a + Xb \\l= AX2 + BX + C > 0. The quadratic formula A = B±VTa ° for the roots of Q(A) shows that, since Q > 0, (i) B2-AAC < 0 and (ii) B2 - AAC = 0 if and only if Q has a (unique) root and this root equals ^ as it is a root of Q'{X) = 2AX + 2B = 0. (See also Proposition 5.1.4.)] Example 2.5.9. The Poisson distribution on Nu {0} with mean 1 can be viewed as a probability distribution given by an ^-density a, wherean = e_1(ni)- Is this density in £2? 6. Differentiation under the integral sign In many circumstances functions f(x) are transformed by integrating with a kernel K(x, y), a function of x, y € R. For example, (i) K(x,y) = 7^e-t£^or (ii) K{x,y) = e-*y.
60 II. INTEGRATION The transformed function Kf(x) = f K(x, y)f{y)dy, assuming the integral (with respect to Lebesgue measure dx on K) exists. The kernel may be differentiable in x, and it is then natural to expect that Kf is differentiable and that ±Kf{x) = JlxK{x,y)f{y)dy. When this is so, differentiation under the integral sign allows the computation of (Kf)'. Theorem 2.6.1. Let y -» K{x,y)f(y) € Ll(R) for a < x < b. If a < Xo < b, assume that there is an L'-function Y(y) with K{x0 + h,y)- K{x0,y) \f(y)\ < Y(y), for a.lmost all y e K and aJJ h sufficiently smaJJ. If j^K(x,y) exists for all y &t x = x0, then Kf(x) = J K(x,y)f(y)dy is differentiable at x0 and (Kf)'(xo) = f -^.K{x, y)(x0, y)f(y)dy. Proof. Let a < xo < b, and let(fen)n>i be a sequence with a < xn < b for all n. Set Fn{y) = {M£o±^h^£°^}/(y). if hn - 0 as n - oo, then by hypothesis Fn(y) -» J^K(x,y)f{y). Since and \Fn{y)\ < Y(y), by dominated convergence (Theorem 2.1.38), / d Vl w, w ,. Kf(x0 + fen) - Kf(Xp) VxK(x0, y)f(y)dy = l™ - >. Hence, (Kf)'(x0) exists and is this limit since the sequence (fen)n>i was arbitrary. □ 7. Signed measures and the Radon-Nikodym theorem* Let (fi, 5, fi) be a cr-finite measure space. Exercise 2.4.2, indicates how to define a new measure v by means of a non-negative function X € Ll(Sl,3,n): set u(A) = JA Xdfx d= J 1a{u)X{lj)(Lj. The measure v is non-negative because X is non-negative. However, the set function v(A) = fA Xdfj, is still defined without this requirement of positivity, and it continues to satisfy the condition (M2) of Definition 2.2.1. Exercise 2.7.1. Let X e Ll(£l,$,(i), and define u(A) = JAXd^t for all A € 5. Show that (1) v{A) e R for all A e $ (i.e., v is finite), (2) if A = u£L ji4n, where the An are pairwise disjoint sets in $, then V{A) = H^Li "(-^n) [Hint: make use of dominated convergence (Theorem 2.1.38) and the fact that \X\ e L1 (£1,3».], (3) show that v is the difference of two (non-negative) measures. [Hint: X = X+ - x-.]
7. SIGNED MEASURES AND THE RADON NlKODYM THEOREM 61 The set function thus defined by an L'-function is an example of a signed measure. Definition 2.7.2. A real-valued function v defined on a a-algebra 5 will be called a signed measure if whenever A = \Jn°=1An, with the An pairwfse disjoint sets in 5, then u(A) = 5Z^°=1 v{An). Remarks. (1) If v is a signed measure, then v(0) = Oas0 = 0U0. (2) The word" measure" will be reserved to mean a non-negative measure in the sense of Definition 2.2.1. At times, this usage will be reinforced by also writing "non-negative measure". (3) Usually the term "signed measure" denotes a set function that in addition to taking real values may also take one of the values ±oo. For such signed measures, the analogue of a cr-finite measure is a signed measure (in the above extended sense) with the property that il is a countable union of sets of finite measure. Such a signed measure is called a cr-finite signed measure. (4) Complex-valued measures are also considered, but not in this book: they are cr-additive complex-valued functions; their real and imaginary parts are signed measures in the sense of Definition 2.7.2. The signed measure v determined by an L'-function can be written as the difference of two finite measures v\ and vz as indicated in Exercise 2.7.1 (3): let v\ be the measure determined by X+ and 1/2 be the measure determined by X~. Given any two finite measures v\ and 1/2, it is easy to see that their difference v\ — u2 is a signed measure whose value at A e 5 is vi(A) — V2{A) (and it is a cr-finite signed measure if one of the two measures is cr-finite while the other is finite). It is a natural question to ask whether every signed measure can be written as the difference of two finite measures. This is in fact the case, as will be shown below. As a first step in proving this result, one has the following characterization, for X €. L1, of the set where X > 0, in terms of the measure v defined by A". Proposition 2.7.3. Let X € L1((n,Sr,/x) and A € $. The following properties of A are equivalent, where v is the measure defined by X: (1) i/(A) = v+(A), where i/+(A) = fAX+dv; (2) ^) = /^% (3) v- (A) = 0, where v~ {A) = JA X'dv; (4) if Be A, Be 5, then i/(B) > 0; (5) fi(A\{X+ > 0}) = 0. Proof. (1) implies (2) by definition of v+. Since v{A) = v+{A) - v~(A), it is obvious that (2) implies (3). If B c A, then 0 < v~{B) = fB X'dfj, < fAX~dn. Hence, if v~{A) = 0, it follows that v~{B) = 0 and so v(B) = u+{B) > 0, i.e., (3) implies (4).
62 II. INTEGRATION Assume (4) and let B = A\{X+ > 0}. Then B = U~=1Bn, where Bn = A n {X~ > ±} = An{X < -±}. If fj.{B) > 0, then for some n, fj,(Bn) > 0. Now u(Bn) = JB Xdfi < -±(j,(Bn) < 0, which contradicts (4). Finally, (5) implies (1). Since A\{X+ > 0} = A n {X~ > 0} has fi- measure zero, it follows that JAXdft = fAl{x+>o}Xdfi = fAX+dn = v+(A). D Note that condition (4) in this proposition is formulated in terms of the signed measure v alone. It motivates the following definition. Definition 2.7.4. Let v be a signed measure on a measurabJe space (fi, $)• A set A € $ is said to be ^-positive or positive if v{B) > 0 for aJJ Jf- measurabJe subsets B of A. It is said to be ^-negative or negative if v{B) < 0 for aJJ J-measurabJe subsets B of A. Remark. Obviously, a set is ^-negative if and only it is (—1)^-positive. Using this terminology, Proposition 2.7.3 implies that if dv = Xdfi (i.e., if v{A) = JA Xdfi for all AeJ), then a set is ^-positive if and only if it is a subset of {X > 0} except for a set of fi-measure zero. Given a subset £e5 that is ^-positive, the restriction of v to E is a finite (non-negative) measure as the next exercise shows. Exercise 2.7.5. Let v be a signed measure on a measurable space (0,5) def and Eejbea ^-positive set. Show that the set function t](A) = i/(A(~)E) is a finite measure on (tl,$). This suggests that one way to decompose a signed measure v into the difference of two finite measures is to look for a largest ^-positive set E and see if the difference between v and its restriction to E is a finite measure. The first thing to check is whether in fact, given a signed measure v, there are any ^-positive sets. Let £ e J, and assume that A c E. Then v{E) = v{A) + v(E\A). If the set E is ^-positive, then it is clear that v{E) > v{A). This property characterizes ^-positive sets as stated in the next exercise. Exercise 2.7.6. Show that a set E e 5 is ^-positive if and only if, for any subset A e $ of E, one has v{E) > v{A). It is therefore of interest to investigate the set function V(E), where V(E) = s\ipAcEv(A) for EeJ. It is called the upper variation of v (see Halmos [HI], p. 122 and Wheeden and Zygmund [Wl], p. 164). Exercise 2.7.7. Show that (1) the upper variation of a signed measure v is a measure [Hint: first show it is countably subadditive (Definition 1.4.10).], (2) the upper variation of a (non-negative) measure ft coincides with fx,
7. SIGNED MEASURES AND THE RADON-NIKODYM THEOREM 63 (3) the upper variation of a signed measure equals zero if the signed measure is — fi, with fi a finite (non-negative) measure. In fact, not only is the upper variation of v a measure, it is a finite measure, as the next proposition shows. Proposition 2.7.8. Ifv is a signed measure, then V(£l) < oo. The key to proving this result is the following lemma. Lemma. Let v be a signed measure, and assume that A € 5 has V(A) = +oo. If q > 0, there is a subset B of A with (i) V(B) = +oo and (ii) v{B) > Q. Proof of the lemma. Since V(A) = +oo, there is a subset Bi of A with v(B\) > a. As V is a measure, either (i) V(Bi) = +oo or (ii)V(Bi) < +oo. Assume that the second alternative holds (otherwise B\ is a set with the desired property). Then V(A\B\) = +oo, and hence there is a subset B<i of A\B\ with v(Bi) > a. Again assume that the second alternative holds and continue in this way, always assuming the second alternative holds. Then there is a sequence of pairwise disjoint subsets (Bfc)fc>i of A with f(Sfc) > Q- Since v is finite-valued, this is impossible, as v{\Jk>\&k) = Efc>i^(Bfc) = +0°- It follows, therefore, that there is a subset B of A with the desired property. □ Proof of Proposition 2.7.8. If V(fl) = +oo, then by the lemma there is a subset A\ with (i) V{A\) = +oo and (ii) v{A\) > 1. Using the lemma once again, one gets a subset Ai of A\ with (i) V(A2) = +oo and 1^(^2) > v(Ai)+l. Continuing in this way, one obtains a sequence (Ak)k>i of subsets of E with Ak D Ak+i and v(Ak+i) > v{Ak) + 1 for all k > 1. Let A^ = D~=1i4n, and let A = A^A^ = U^L^kXAk+i. Once again, since u(A) is finite , this leads to a contradiction, as v{Ak) = v(Ak\Ak+i) + v{Ak+\) implies that ^(i4fc\i4fc+i) < -1 and so u(A) = £ "(-AfcX-Afc+i) = -00. □ Proposition 2.7.9. Let v be a signed measure that takes on both negative and positive vaiues. If E\ e 5 and v(E\) > 0, then there is a subset E of Ei with v(E) > 0 that is v-positive. Proof. The idea of the proof is to remove successively from £1 sets of negative measure that are as large as possible. What is left over will then turn out to be the desired ^-positive set. Of course, if by luck every subset of £1 has non-negative v-measure, then one takes E to be E\. The concept of "as large as possible" amounts to the following: choose a measurable subset A\ of E\ (i.e., one belonging to 5), with v{A\) < —■£- where n\ > 1 is as small as possible. If £^-^ is a ^-positive set, it has positive ^-measure as 0 < v{E±) = v{A{) + v{E\\A{), and one sets E = .EiX-Ap If E\\A\ is not a positive set, then again choose a subset A2
64 II. INTEGRATION of negative measure that is less than or equal to - ^- where ri2 > 1 is as small as possible: note that ni > ni. If Ei\(A\ U A2) is not ^-positive, continue, by choosing as large as possible a subset A$ of £1 disjoint from A\ U A2 that has negative ^-measure, and so on. If this procedure stops at the ith. stage, then Ei\{uek=lAk) is a ^-positive set of positive ^-measure. On the other hand, if it never stops, then one has a sequence (;4fc)fc>i of pairwise disjoint subsets Ak of E\ with v{Ak) < —£-, where n* is as small as possible. Let A = U^^fc. Then -00 < v{A) = YlT=i u(Ak) ^ — 5Z^.j ^-. This implies that the series ^1¾^ ^- converges, and hence limit nfc = +00. Now suppose that B C £1 is disjoint from A. If v{B) < 0, then there is an integer k > 1 with v(B) < — j. This means that at every stage of the above procedure, v{Ak) < -¾ (i.e., £- > \). This contradicts the fact that the series YlT=i ^~ converges. Hence, i/(B) > 0, i.e., E = E\\A is a ^-positive set and, since 0 < v{E\) = v(A) + u(E), it follows that v{E) > 0. D Without loss of generality, for a signed measure one can assume v(tl) > 0 (if not, consider —v). The basic idea of the above proof is now used to determine a measurable subset of il that is ^-positive and has a ^-negative complement, a so-called Hahn decomposition of Q relative to v. The argument uses the following observation. Exercise 2.7.10. Let (j4n) be a sequence of pairwise disjoint ^-positive sets. Then A = U^L^,, is also ^-positive. Theorem 2.7.11. (Decompositions associated with a signed measure) Let v be a signed measure on (0,5). (1) There is a set E+ e J that is positive and whose complement def il\E+ = E~ is negative (this is a Hahn decomposition of £1 relative to v). (2) Ifu+ denotes the restriction of v to E+ and v~ denotes the restriction 05(-1)^ to E~, then v± are finite measures and v = v+ — v~ (see the Jordan decomposition ofv below). Proof. Statement (2) is obvious given the existence of the sets E+ and E~: each set A = A D E+ U A n E~; by Exercise 2.7.5, u± are finite measures and v{A) = v{AC\E+)+v{AC\E~) = v+{A) - v~ {A). It suffices to verify (1) when u(il) > 0. To determine the set E+, let A\ be a measurable subset of il that is (i) ^-positive with positive ^-measure and (ii) as large as possible (i.e., v(A\) > £-, where n\ > 1 is as small as possible). If il\A\ is not ^-negative, let A2 be a subset that is (i) ^-positive and (ii) as large as possible. If this procedure stops at the ^th stage set E+ = u[=1;4fc and E~ = fi\E+. By Exercise 2.7.10, these sets have the desired properties.
7. SIGNED MEASURES AND THE RADON NIKODYM THEOREM 65 If the procedure does not stop, then one has, as in the proof of Proposition 2.7.9, a sequence {Ak)k>i of pairwise disjoint ^-positive sets Ak with "(•<4fc) > £- and n* as defined above. Since J^kLi v{A-k) = v{A) < +oo, the series Ylk^i n~ converges: as a result, limit njt = +oo. If B is disjoint from A, then by an argument similar to the one used in the proof of Proposition 2.7.9, one has u(B) < 0. Set E+ = A. By Exercise 2.7.10, it has the desired property. □ The measures v± are given by the upper and lower variations of v, as shown in the next exercise. For this reason, the decomposition of v is its so-called Jordan decomposition v = V — V. Exercise 2.7.12. Show that (1) the measure i/+ equals V. Define the lower variation V of v by setting V(E) = -'mfAcEv(A) = supAcE-i>(A). Show that (2) v~ = V; and hence, conclude that (3) v = V - V (the Jordan decomposition of v). Note that the total variation V of v is defined to be V + V (see [HI], p. 122 and [Wl], p. 164). In case du = Xdfj, with X € L1 (^,5,^), show that (4) dV = X+dn (i.e., V{A) = JA X+dn), (5) dV = X'dfi (i.e., V(A) = fA X'dp), and (6) dV = \X\dp (i.e., V(A) = fA \X\dfi). Remark. These decompositions exist for cr-finite signed measures t/ . One decomposes the whole space fl into a sequence of disjoint measurable sets An on each of which the restriction of t/ is a signed measure vn , i.e., a finite signed measure. On each of these sets, Hahn and Jordan decompositions for un may be defined, and one merely puts them together in the obvious way to get a decomposition for the whole space. For example, if An = A+ U A~ is a Hahn decomposition for i/n, then A+ = U„j4+ and A~ = U„j4~ give a Hahn decomposition of fl for t/. Definition 2.7.13. Let fix and m be two measures defined on a measure space (0,5). They are said to be mutually singular if there is no set A e 5 with both m{A) > 0 and H2{A) > 0. It follows from Theorem 2.7.11 that the two measures u+ = V and v~ = V are mutually singular. This property characterizes the Jordan decomposition of a signed measure v. Exercise 2.7.14. Let v = v\ - v?. = pi - fi2 be a signed measure where 1/1,1^,/ii, and fi2 are finite measures. Then v\ = fi\ and V2 = Pi if v\ and V2, as well as \i\ and \i2- are pairs of mutually singular measures. Hence, not only is every signed measure the difference of two finite measures, but in addition the measures in this decomposition are unique and "live" on disjoint sets if they are mutually singular.
66 II. INTEGRATION Hahn decompositions of the underlying set fi into positive and negative parts are by no means unique, because a set of measure zero need not be empty (see the following exercise). Exercise 2.7.15. Let v denote the signed measure t\ - e_ i on (K, 93(R)). Determine all the Hahn decompositions of R associated with v. The Radon-Nikodym theorem. An essential feature of a signed measure v given by X 6 L1 (tl,$,(i) is that v(A) = 0 if fi(A) = 0. Definition 2.7.16. Let fx be a measure on a measurable space (0,5). A signed measure v on (tl,$) is said to be absolutely continuous with respect to n if n(A) = 0 implies that v{A) = 0. This is denoted by writing v « fx. In other words, the signed measures determined by L'-functions are all absolutely continuous with respect to the basic measure fx. The Radon- Nikodym theorem states that the converse is true. Remark 2.7.17. Note that a signed measure is absolutely continuous with respect to fi if and only if v± are both absolutely continuous with respect to ft. To prove the Radon-Nikodym theorem, it suffices to prove it for a (non- negative) measure. If X € Ll(ft) is non-negative, the measure dv = Xdft clearly has the property that if Fn:k = {^ft < X < ^r}, where k > 0, then ±n{A n F„,0 < v{A n Fn,fc) < *$±n{A n FnJc). Given the connection between a Hahn decomposition for v — £n and the sets {^ < X},{X < ^}, the above suggests a way to approximate the measure v by measures defined by simple functions sn. Let v be any finite measure, and let D denote the set of positive dyadic rationals (i.e., r € D if and only if r = ^ for some k, n € N). Assume that for each a e D it is possible to determine a Hahn decomposition SI = E+ U E~ for 4 = v - a/j, such that if cm < q2, then E+ D E+2 (while it is not totally obvious that this monotone condition can be satisfied, it is not hard to find such a countable family of Hahn decompositions, as shown later in Proposition 2.7.28). Since v is a measure (i.e., it is non-negative), the Hahn decomposition corresponding to 0 can be assumed to be ft = Eq and 0 = Eq . The function sn is constructed from the Hahn decompositions associated with D+, the set of non-negative dyadic rationals of the form ^-, 1 < k < 22n,n> 1. Let ^n,0 = E , , Fn,fc = E+ \E+ =E+ HE', for l<k< 22n,
7. SIGNED MEASURES AND THE RADON-NIKODYM THEOREM 67 These sets are pairwise disjoint, and for any A e $, it follows that, for 0 < it < 22n, ^{A n Fn,fc) < u{A n Fn,fc) < ±p(A n Fn,fc) + ^{A n Fn,fc). Define f 0 if u; e Fn,0, sn(u;) = i 2 2n if we Fn22-.. It follows that, for any A e J, one has £ ifu>eFn,fc, forl<fc<22", (t) / snd/i < v(An£2„)< / snd/i + ^M(tt), where F2„ = (F2„)c = U0<fc<22"^n,fc- Since E * D £ 2>+i D £4+1, it follows that sn < sn+\ for all n > 1. Let X = limn sn. It follows from (f) that / ^nE^^M = "(^ n ^2~") f°r any A e 5. Let F/ denote the union of all the sets E^„, n > 1. This is the set on which X is finite. Lemma 2.7.18. Let Ej denote F+qq = UnE^. Then fi(Ef) = (i(£l), i.e., KEj) = 0. Proof. If B c F^, then j^(B) > 2nn(B). Since j^(B) < +00, this implies that fi(B) = 0 if B c ^££. Hence, f*(£^) = 0. □ This completes most of the proof of the following theorem. Theorem 2.7.19. (Radon-Nikodym theorem: measures) Let fi be a finite measure on a measure space (fi, 5) and v be another finite measure that is absolutely continuous with respect to fi (i.e., v « fi). Then there is a non-negative function X € L1 (£l,$,(i) such that (*) f Xdfx = u(A) for allAe$. Furthermore, X is unique in the sense that ifYefi satisfies (*), then X = K/i- a.e. (i.e., /i({u; | A» = Y{w)}) = 0). Proof. Using the previous notations, v{Ecj) = 0 since v « fi and so, for any set A e 5, one has u(A) = \\m.u(A nF2~ ) = lim / Xd/i = / AVfy*. n n JAnE-„ J A Since v is finite, this also implies that X e L1^). Assume that V satisfies (*), and let A = {Y > X- £}. Then fA Xdp = JA Ydfi > JA Xdfj, - ±n(A). Hence, (i(A) = 0 and so Y > X /i-a.e. By symmetry, it follows that X = Y /i-a.e. □
68 II. INTEGRATION Corollary 2.7.20. (Radon-Nikodym Theorem: signed measures) Let ft be a finite measure on a measure space (fi, 5) and v be a signed measure that is absolutely continuous with respect to ft (i.e., v « ft). Then there is a function X € L1 (tl,$,ft) such that (*) f Xdfj, = v{A) for aJJ A e £. furthermore, X is unique in the sense that if Y e J satisfies (*), then X = Y fi- a.e. (i.e., n({u \ X{lj) = Y(u)}) = 0). Proof. Let v = u+ — v~ denote the Jordan decomposition of v. Then, as pointed out in Remark 2.7.17, both u± are absolutely continuous. Let du+ = X\dfi and dv~ = Acetyl. Then X = Xi - Xi € L1^) and satisfies (*). Uniqueness is proved as in the case of measures. □ Remark. The Radon-Nikodym theorem is usually stated when the basic measure fx is assumed to be cr-finite and v is a signed measure. The extension from the case when fi is finite to the more general cr-finite case may be carried out by cutting fl up into a countable collection of pairwise disjoint sets of finite /i-measure and applying Corollary 2.7.20 to each piece. Exercise 2.7.21. Prove the Radon-Nikodym theorem when fi is cr-finite. The Lebesgue decomposition of a finite signed measure with respect to a finite measure. The basic argument just used to prove the Radon-Nikodym theorem only made use of the hypothesis of absolute continuity to guarantee that v(A) = v(A) D Ej). In fact, if v is any finite measure, one always has (t) v(A r\Ef) = \\mu(A n E^r.) = lim / Xdfi = [ Xdfi n n JAnE-„ J A since ft{Ej) = 0 by Lemma 2.7.18. Let r)(A) = v(A n Ej). Then n is a finite measure that is singular with respect to fi, and so one has v(A) = n{A) + v(A n Ef) = t](A) + / Xdfi for any € £ This proves the following result. Theorem 2.7.22. (Lebesgue decomposition theorem: measures) Let fx be a finite measure on a measure space (fi, 5) and v be another finite measure. Then there is a unique decomposition of v as a sum (*) v = t,+rf of an absolutely continuous measure 4 and a singular measure n.
7. SIGNED MEASURES AND THE RADON NIKODYM THEOREM 69 In terms of the notations used earlier, r)(A) = v{A D Ej) and £(A) = JAXdfi with X anon-negative function e L1 (tl,$,(i), i.e., v(A) = u{A n Ef) + r)(A) = / Xdfj, + r](A) for any A e £. The Ll(fi)-function X is unique modulo a nuJJ function. Proof. If £(A) = u{A D Ej), then by (t) it follows that n and 4 are two mutually singular measures whose sum is v with £ « (i and n singular with respect to fx. Assume that v is the sum n\ + £i of two mutually singular measures with £i << fi and n\ singular with respect to p. Lemma 2.7.23. Let v = n\ + £i with n\ singular with respect to fi. Let E be a set in 5 on which v — ct\i is negative for some 0 < a + oo. Then T]i{B) = OforallBcE. Proof. Since t/i(B) < v(B) < afi(B), the fact that t/i and fi are singular implies that t/i(B) = 0. □ It follows from this lemma that for any B c E^, one has rji(B) = 0. Hence, for all B c Ef, it follows that t/i(B) = 0. As a result, for any A e 5, it follows that v{A D Ej) = t]\(Ar\ Ej) + £i(A n £/) = 4i(A n £/). Since ^ << /i and /i(^/) = 0, it follows that hiA) = ti(A n Ef) + Zi(A n Ef = Zi(A n £/)• Hence, 4i = 4 and so r)i = r). The uniqueness of X follows in the usual way, or by appeal to the Radon-Nikodym theorem (Theorem 2.7.19). □ Exercise 2.7.24. Formulate and prove the Lebesgue decomposition theorem (i) when v is a signed measure with fi finite and (ii) when fj, is er-finite and v is a signed measure. Exercise 2.7.25. Let P be a probability on (R,93(R)) that is neither absolutely continuous nor singular. Show that there is a unique probability P3 that is singular with respect to Lebesgue measure dx, a unique non- negative L'-function / with J f(x)dx = 1, and a unique A € [0,1] such that, for any Borel set A, one has P(A) = \f f(x)dx + (l - A)P.(A). In the terminology of the next section, the probability determined by the L1 -function / is absolutely continuous with respect to Lebesgue measure. In other words, every probability that is neither absolutely continuous nor singular is a unique convex combination of a singular probability and an absolutely continuous one, i.e., if Pac(i4) = JA f(x)dx, then P has a unique representation P = APac + (l-A)P3
70 II. INTEGRATION as a convex combination of an absolutely continuous probability and a singular probability. Countable families of Hahn decompositions. It remains to show that the basic assumption made above can be verified: namely, that for each a € D, the set of dyadic rationals, it is possible to determine a Hahn decomposition fl = E+ U E~ for £ = v - an such that ifQl <a2, then£+ 3¾. The first step is to see how Hahn decompositions of £ = v — a.\i may be chosen to be "compatible" in the above sense for two values of a. Exercise 2.7.26. Let E+ denote the positive subset of a Hahn decomposition for v — a.\i. Show that (1) if Q! < q2, then niE+^E^) = 0 [Hint: if A C E+, then v{A) > an(A), and if A C E~, then v(A) < afi(A)}, (2) if A is (i/ — Qi/x)-negative, it is (i/ — Q2/i)-negative, (3) E+x D E+2 is the positive set for a Hahn decomposition associated with v — Q2/i, and (4) E~x U E~2 is the negative set for a Hahn decomposition associated with v — 012/1. Therefore, the original Hahn decomposition E+2 and E~2 for v - a^fi, where on < Q2, may be replaced by E+y n E+2 and E~x U E~2. The modification amounts to observing that the union of the two (u — 02/1)- negative sets E~2 and £^ is also a {v — a2/i)-negative set. This observation about the union of negative sets may be extended to countable unions, as indicated in the following exercise. Exercise 2.7.27. Let 4 be a signed measure and (Bfc)fc>i be a sequence of ^-negative sets. Show that Ufc>iBfc is a ^-negative set. [Hint: express the union as a countable disjoint union of sets (see Remark 1.2.3).] Proposition 2.7.28. Let D be a countabie subset ofR. For each a e D, there is a Hahn decomposition ofv — ctfx such that if E+ is the positive set of this decomposition, then Qi < Q2 implies E+t D E+2 for 01,02 € D. Proof. For each o e D, let E'a be the positive set of some Hahn decomposition associated with v — ap. Define £+ to be n^6Dt^<Q^ . Then £+ is (u — a/i)-positive, as it is a subset of the {v — a/i)-positive set E'a and its complement (-¾-)° = U0eD,0<a(E'p)c. Since (-¾)° is (v - a/i)-negative by Exercise 2.7.26 (2), if /3 < a, it follows from Exercise 2.7.27 that (E'0)° is (1/ — a/i)-negative. Let E~ equal (E'p) . Then E£ UE~ is a Hahn decomposition of v - ap. The "compatibility" property holds by construction. □
8. SIGNED MEASURES ON R AND FUNCTIONS OF BOUNDED VARIATION 71 8. Signed measures on K and functions of bounded variation* Functions of bounded variation. Let v = v+ — v~ be a signed measure on 93(R), and define ' i/((0,x]) if0<x, T(x) = ^0 if x = 0, _ -K(x,0]) ifx<0. Then, if -oo < a < b < +oo, it follows that T(6) -T(a) = v{{a,b}). Furthermore, while T is right continuous, it is not necessarily non-decreasing, as the signed measure v is not necessarily a measure. On the other hand, the measures v*- define non-decreasing, right continuous functions G± by Theorem 2.2.2: {^((O.x]) if0<x, 0 if x = 0, -^((x.O]) ifx<0. It follows from these definitions that T = G+ - G~. This formula has the folbwing proposition as an immediate consequence since ti-\ < U implies that \T(U) - T(U-i)\ < {G+(U) - G+(U-i)} + {G-(U) - G-(«i_i)} = V+{{ti-l,ti])+V-[[ti-l,ti]). Proposition 2.8.1. Let n : to = a < t\ < t2 < • • • < tn = b be a. finite partition of the bounded intervai [a, 6]. Then J2 \T{U) - r(«i-i)| < {G+(b) - G+(a)} + {G~(b) - G-(a)} t=i i/+((a,6]) + i/-((a,6]). Since the upper bound u+((a,b]) + v~((a,b}), while dependent upon the interval, does not depend upon the partition, the function T is of bounded variation on every finite interval [a, 6]: in point of fact, in this instance, the upper bound v+{{a,b}) + i/~((a,b}) < u+(R) + i^_(K) < oo since it is assumed here unless otherwise stated that a signed measure is the difference of two finite measures. Definition 2.8.2. Let $ be a function defined on [a,b]. It is said to have bounded variation or to be of bounded variation if there is a constant
72 II. INTEGRATION M > 0 such that, for any partition n : to = a < h < t2 < • •• < tn = b of [a,b], one has n J2 Mh) - *(«i-i)| =f V(*,tt) < M. i=l The least upper bound of {V($,7r) | 7r a partition of [a, 6]} is defined to be the total variation of$ on [a, 6] and wiJJ be denoted by V($, [a, 6]). Remark. If 7r' refines 7r (i.e., if it contains all the partition points of n), then V($,tt)< V($,7r'). It will be shown in what follows that every function of bounded variation is the difference of two non-decreasing functions and that in the case of the function T associated with a signed measure, its total variation over a finite interval [a, b] is exactly f+((a, b}) +v~((a, b}) (see Proposition 2.8.11 later). The argument used to establish Proposition 2.8.1 shows that if $ is a function denned on [a, b] and if $ is the difference of two non-decreasing functions (denned on [a, b}), then $ is of bounded variation. The converse is stated as the following result. Proposition 2.8.3. Let 4» be a function of bounded variation on [a, 6], and Jet V(*)(x) = V(*, [a,x]). Then V(*) is a finite non-decreasing function on [a,b] and V($) ± $ is also a non-decreasing function on [a,b\. Hence, $ = v(*) - {V(*) - $} = {V(*) + $} - v(*) is the difference of two non-decreasing functions. This result is an easy consequence of the following lemma. Lemma 2.8.4. Let $ be a function of bounded variation on the finite interval [a, 6]. Ifa<x<x + h<b, then V(*. [a, x + ft]) = V(*. [a, x]) + V(*. [x, x + ft]). Proof. To begin, if n\ and 7¾ are two partitions contained in [a, b] and if the partition n\ ends at the first point of 7¾, then their union n\ U 7¾ is a partition such that V(4»,7r1) + V($,7T2) = V($,7Ti U 7¾). From this, it follows that V($, [a,x + ft]) > V($, [a,x]) + V($, [x,x + ft]). Since one may always add the point x to any partition of [a, x + /i], it follows that V(*. [a, x + h]) < V(*. [a,x]) + V(*. [x, x + ft]). D Proof of Proposition 2.8.3. First observe that, when ft > 0, every partition 7r of [a, x] can be extended to give a partition 7T/, of [a, x + ft] simply by adding the point x + h. Hence, V($,7r) < V($,7r/,), which implies that V($)(x) < V($)(x + ft), i.e., V($) is a non-decreasing function of x. It follows from Lemma 2.8.4 that V($)(x + ft) - V($)(x) = V($, [x,x + ft]) > |$(x + ft) - $(x)| > ±{$(x + ft) - $(x)}, and so V($)(x) ± $(x) are two non-decreasing functions of x on [a, 6], the sign choice being fixed independent of x. □
8. SIGNED MEASURES ON R AND FUNCTIONS OF BOUNDED VARIATION 73 While every function $ of bounded variation is a difference of finite, non-decreasing functions, say Hi and H2, this representation is not unique since not only can a constant be added to a non-decreasing function without changing this property, but also any non-decreasing function may be added. Note that if $ = Hi-H2, then V(*) < Hi + H2, and if Hi = ±{V($) + $} and #2 = 5{V($) — $}■ tnen *ne representation of $ given by $ = H\ - H2 is such that H\ + H2 = V($). The converse is true since, given 7 and 6, if two numbers a and /3 are such that a + /3 = 7 and a - 0 = 6, then q = ^{7 + 6} and /3 = 5(7-6}. This proves the following proposition. Proposition 2.8.5. Let $ be a function of bounded variation on [a, b] and V($)(x) = V($, [a, x]) be its totaJ variation on [a,x]. Then, the representation of $ given by *(x) = ^{V(*)(x) + *(x)} - ^{V(*)(x) - *(x)} is the unique representation of $ as the difference of two non-decreasing functions whose sum is the total variation V($). Functions of bounded variation determine signed measures when they are right continuous. Any non-decreasing function H on [a, b] may always be "regularized" by looking at the corresponding right continuous, non- decreasing function H, where H(x) = inf{H(u) | x < u}. Exercise 2.8.6. Show that H is right continuous on [a, 6]. Consequently, every function $ of bounded variation on [a, b] may also be "regularized" by taking right limits, and so determines a signed measure. _ Given a function / on a closed finite interval [a, 6], let its "natural" extension to all of K (also denoted by /) be denned by setting /(x) = /(a) if x < a and /(x) = /(6) if b < x. With this convention, to each right continuous function $ of bounded variation on a finite interval, there corresponds a right continuous function on R that is of bounded variation on every finite interval. Exercise 2.8.7. Let $ be a real-valued function on R whose restriction to every finite interval is of bounded variation. Show that (1) if *(y) =f -*(-y), then V(*. [a, b}) = V(9, [-b, -a]), (2) there are non-decreasing functions Hi and H2 on R such that 4» - Hi - H2 [Hint: first define these functions for 0 < x < +00, then use (1) and imitate the formula in Theorem 2.2.2.], (3) if $ is right continuous, the functions Hi in (2) can be assumed to be right continuous and "standardized" to have value zero at 0. Further, if the total variation V($, [a, b}) < M for all finite intervals [a, b\ for some constant M > 0, show that (4) $ is a bounded function and Hi and H2 may be taken to be bounded.
74 II. INTEGRATION It follows from this exercise and Theorem 2.2.2, that to each right continuous function $ on R for which the total variation V($, [a, 6]) is uniformly bounded by a constant M > 0 (i.e., independent of the interval), there corresponds a signed measure n on *8(R) such that (i((a, b}) = ¢(6) - $(a) for each finite interval (a, 6]. Remark. If one wants to include er-finite signed measures in the discussion, then at least one of the non-decreasing functions in a representation of $ has to be bounded. When $ is a right continuous function on K of bounded variation on every finite interval, it suffices that there is a constant M > 0 such that for every finite interval [a, b] one has either (i) V($, [a,b]) - {¢(6) - 4>(a)} < M or (ii) V(*. [a,6]) + {¢(6) - ¢(0)} < M. The signed measure fx has a total variation \fi\ = fi+ + fi~. This total variation measure (Exercise 2.7.12) is related to the total variation of the function ¢: in fact, |/i|((a,6]) = V(^ [a,6]) for any finite interval [a,b\. This is not entirely obvious and will require some additional discussion of the variation V(^7r) given by any partition. Lemma 2.8.8. If n : to = a < t\ < <2 < ••• < tn = b is a partition of the finite interval [a, 6], Jet V+{$,n) =f £?=1{*(*i) - ¢(^-1)} V 0 and V-(*,tt) ^ (-l)[E?=i{*(*0 - *(«i-i)} AO] = E?=i[(-l){*(*i) - ¢(¢.-1)}] VO. Then (1) V(*,b-) = V+(*,b-) + V-(*,b-), (2) V+(*,7r)<V(M)((a,6]),and (3) V-(*,tt)< ^)(((0,6]). where V((i)((a,b\) is the value of the upper variation of the signed measure fx for the set (a, 6] and V{fi){{a,b]) is the value of the lower variation (Exercise 2.7.12) of the signed measure fx for the same set. Hence, (4) V{9,[a,b))<V(n)((a,b]), where V(fi)((a,b}) = \fx\((a,b\) is the value of the total variation of the signed measure fx for the set (a, 6]. Proof. Identity (1) is obvious, since for any real number a one has by Exercise 1.1.16 that \a\ = a V 0 - {a A 0} = a V 0 + {-a} V 0. Let A denote the union of the intervals (ti-i,tj] for which $(U) - *(«i_i) > 0. Then n(A) = V+(*,7r) and so (2) V+(*,7r) < V((a,b\) = fi+((a,b}). Similarly, (3) V~{9,ir) < V((a,b]) = fj.-((a,b}), and so by (1) V(*,tt) < V((a,b]) = \n\{{a,b]), i.e., (4) holds. □ An almost immediate consequence of this lemma and Lemma 2.8.4 is the following corollary.
8. SIGNED MEASURES ON R AND FUNCTIONS OF BOUNDED VARIATION 75 Corollary 2.8.9. Let $ be a right continuous function on R that is of bounded variation on each finite interval. Then, for any reai number a, (1) the function V($)(x) = V($, [a,x]) is right continuous for x > a, and (2) the function V($, [x,a]) is right continuous for x < a. Proo/. Let ft > 0. Lemma 2.8.4 implies that |V($)(x) - V($)(x + h)\ = V($, [x,x + ft]) < \fi\((x,x + ft]). To complete the argument for (1), note that \fi\((x,x + ft] tends to zero as ft tends to zero. The argument for (2) is the same since |V($, [x,a]) - V($, [x + ft,a])| = V($, [x,x + ft]) as long as0<ft<a-x. D Exercise 2.8.10. Use Hahn and Jordan decompositions of fi to show that if the signed measure fi = fii — fi2, then for any measurable set A one has fj,\(A) > n+{A) and fi2{A) > n~{A). [Hint: if E+ is the positive set of a Hahn decomposition, show that for any measurable set A, one has ti+(A)<ti1(Ar\E+).} Proposition 2.8.11. Let $ be right continuous on R with uniformly bounded variation on each finite interval [a,b], i.e., there is a constant M > 0 such that V($, [a, 6]) < M for all finite intervals [a, b}. Let n be the corresponding signed measure. Then, for any finite interval [a,b], V($,[a,b)) = \»\((a,b)). Proof. One may assume a > 0: if not, consider the right continuous function $a{x) = $(x + a). Let fi' denote the corresponding signed measure. ThenM'((c-a,d-a]) = M((c,d]), V($a,[c-a,d-a]) = V($, [c,d]), |/x'|((c— a,d-a} = \n\(c,d}), and V($a,[c -a,d-a)) = V($, [c,d]. Since V($)(x) = V([0,x]) is right continuous in x for x > 0, the representation of 4» ( for x > 0) as $ = ^{V($) + $}- ±{V($)-$} determines two measures pi and H2 on R for which /ii((-oo,0]) = /12((-00,0]) = 0. They are the measures such that m{(a, b}) = ^{V{$) + $}(&) - ^{V(<&) + $}(a) if a > 0 and fi\((a,b}) = 0 if b < 0, and m is similarly denned by i{V($) - $}. It follows from Exercise 2.8.10 that |/i|((0,x]) < /n((0,x]) + /i2((0,x]) = V($)(x) = V($, [0,x]). This combined with Lemma 2.8.8 (4) proves that V($, [0,x]) = |/i|((0,x]) for all x > 0. Since by Lemmas 2.8.4 and 2.8.8, V($,[0,6]) = V($,[0,a]) + V($, [a, 6]) < |M|((0,a]) + |M|((a,6]) = |M|((0,6]) = V($,[0,6]), it follows that V($,[a,6]) = |/i|((a,&]) for all 0 < a < b < +00. □ Absolutely continuous functions. Absolutely continuous measures on R. Let $ be a function on the finite interval [a, b}. If it is of bounded variation and right continuous, then, for any e > 0, it follows from the right continuity of V($, [a, x]) and its additivity (Lemma 2.8.4) that there exists a 6 = 6(x) > 0 such that V($, [x,x + ft]) < e if 0 < ft < 6(x).
76 II. INTEGRATION Definition 2.8.12. A function * on [a, b] is said to be absolutely continuous on [a, b] if for any e > 0 there exists a.6 = 6t > 0 such that for any finite collection of points a < a\ < b\ < a? < 62 < • ■ ■ < bn-i < °n < &n < b (equivalently for any finite collection of subintervals [a;, 6j],l < i < n, of [a, b] with non-overlapping interiors), it follows that n n J21*(* - *(<>i)l < * if £(fc - °«) < 6- i=l i=l Remark. The key point in this definition is that the intervals are not required to be contiguous. When 6j_i = a*, 2 < i < n, then 5Z"=11^(¾ - «(a0| = V([ai,fc„],«). The first thing to observe is that if one takes e = 1, then V([x, x+/i], ¢) < 1 if 0 < h < 6\. From the additivity property (Lemma 2.8.4), it follows that * is of finite variation on [a, 6] if * is absolutely continuous on [a, 6]. It also follows from the definition that V([0,x], *V(*)(x), and *(x) are right continuous functions of x (even continuous functions). The significance of the non-contiguity of the intervals is shown by the following result Proposition 2.8.13. Let v denote the measure on 93(R) defined by the "natural" right continuous extension ofty to R: it equaJs ¢(0) for x < a and ¢(6) for b < x. If E C [a, b] is a Borel set of measure zero, then v(E) = \v\{E) = 0. Hence, v and \v\ are absolutely continuous with respect to Lebesgue measure dx. Proof. Since the endpoint a has Lebesgue measure zero, one may assume E C (a, b}. Recall that \E\ = 0 implies that for any 6 > 0 there is a sequence (an,6n])n of intervals that can be assumed to be pairwise disjoint with (i) E c U£°=1(an,6n] and (ii) | U~=1 (an,6n]| = £~=i(&» - an) < 6. Furthermore, one has \v\{E) = 0 if for any e > 0 this sequence can also be chosen so that (iii) |^|(U^=1(an,6n]) = £~=1 |^|((an,6n]) < e. Since E c (a, b], all these intervals can be assumed to be subintervals of (a,b}. If (an,6n] c (a,b], then |*(6n) - *(an)| < V(*,[an,6n]) = |i>|((an,6n]). Assume that 6 = 6t, where 6t implies that n n 2 |«(Oi- ¢(04)^6 if £(¾-*)<««. t=l t=l Clearly, one wants to replace 5Z"=1 ^(M — *(°t)| m this inequality by HiLi y([at>&t]>*) = H"=i M((atA])- I' ^s now shown that this gives a similar inequality that holds for 6 = 6t but with e replaced by 2e. This suffices to prove the proposition. Observe that for each interval (a„, 6n] one may find a partition nn such that V(*,7rn) + ^r > V(*,[an,6nj) = M((an,6n]). Each partition wn
8. SIGNED MEASURES ON R AND FUNCTIONS OF BOUNDED VARIATION 77 subdivides [an,6n] into subintervals with non-overlapping interiors. It follows that 5Z"=1 V(^,7rn) < e since the sum of the lengths of the intervals involved equals 5Z"=1(&t — a,) < 6 = 6t. Hence, 5Z"=1 |^|((ai,6i]) = E?=i ^(*. Kb]) < E?=i V(*,*») + ET=i F < 2e. Since this inequality holds independent of n, it follows that Yl™=i M((anifcn]) < 2e. As a result, M(E) = 0 and so also !/(£) = 0. □ From this, it follows that if $ is an absolutely continuous function on [a, 6], then there is a signed measure v on 93 (R) such that all the mass of \v\ and hence of v is concentrated on [a, b], which is absolutely continuous in the sense of Definition 2.7.5. Hence, by the Radon-Nikodym theorem (Theorem 2.7.20), there is an L'-function ip such that v(A) = fAip(x)dx and M(j4) = JA \i/}(x)\dx. As a result, it follows that one may assume that {x | \il>{x)\ ^ 0} C [a, 6] (since this is true up to a set of measure zero, one may assume it by modifying V if necessary on a set of Lebesgue measure zero); in other words, V € ^([a,b],dx). This essentially proves the first part of the following result. Theorem 2.8.14. A function * on [a,b] is absolutely continuous if and only if there is a function V € Ll([a,b],dx) with *(x) = f* ip(u)du,a < x<b. Proof. If $ is absolutely continuous in the sense of Definition 2.8.12, the resulting signed measure v is determined by the fact that \v\(A) = 0 if A C (—oo,a) U (6,+00) is a Borel subset and <£(x) = u(a, x] = Jx ip(u)du,a < x < b, where the last inequality follows from the Radon-Nikodym theorem. On the other hand, if v is given by a function ip e Ll([a,b],dx) in the sense that v(A) = $Aip(u)du for any A e 93(R), then given e > 0, |v|(A) < e if |A| < 6 = 6(e) as stated in the following exercise. Exercise 2.8.15. (see Lemma 4.5.5) Let ip e L^R), and define \v\(A) = JA \i/}\(u)du. Show that, for any n > 0, one has (1) M(A) < n \A n {x I |V(x)| < n}\ + J{xM{x)>n} 1>{u)du, and (2) /(1111(-(1)^711^(11)^11 —► 0 as n —► 00 [Hint: use dominated convergence (Theorem 2.1.35).], and conclude that (3) if e > 0, then there is a 6 = 6(ijj, e,n) such that |f |(A) < e if |j4| < 6. It follows from this that if *(x) = f* ip(u)du, then * is absolutely continuous in the sense of Definition 2.8.12: observe that if Oj < 6j, then |*(6i) - ¢(04)1 < V(*, [04,¾]) = |i/|((oi,Oi])- Hence, if in Definition 2.8.12 I U?=1 (oi,bi]\ = E?=i(fc* - °0 < «(^,e,n). then £?=! |*(6*) - 9{oi)\ < EIU V(*,(<*iA]) = |i/|(U?=i(ai,M) < ¢- □ In other words, the absolutely continuous functions on [a, 6] are the "indefinite" integrals in the sense of Lebesgue of the L'-functions on [a, b] (relative to Lebesgue measure on [a, b}). From calculus, one is familiar with the fact that for continuous functions V. the derivative of its indefinite
78 II. INTEGRATION integral coincides with tp. This is true a.e. for the indefinite Lebesgue integral, but its proof is more difficult and is postponed until Chapter IV, where it is proved as a consequence of Lebesgue's differentiation theorem, which states that, if f € Ll (R), then a.e. ^ f*+£ }{u)du -► /(x) as h -» 0. 9. Additional exercises* Exercise 2.9.1. Define G(x) = [x], where [x] = n if n < x < n + 1. (1) Show that every subset of K is /i*-measurable, where fx is the measure for which n((a,b]) = G(b) — G(a). (2) What are the /i-measurable functions? (3) Determine £,1 (R,*P(R),^t) (recall from (1) that <P(K) is the <r-field of /i*-measurable sets). Exercise 2.9.2. Show that (1) the union of a countable family of sets of measure zero is also of measure zero (here "measure" refers to any er-finite measure, for example, one constructed from a right continuous, non-decreasing function G : R -> R), (2) E is a Lebesgue measurable set if and only if there is a Borel set B with EAB a set of Lebesgue measure zero where EAB = (E\B)\J (B\E) (this set is called the symmetric difference of E and B. [Hint: see Exercise 2.9.12.] Exercise 2.9.3. Let E C R. Show that (1) the outer Lebesgue measure X*(E) is the infimum of {|0| | E C 0,0 open} [Hint: if E c U^L^an.&n], replace each interval (an,bn] by (an,bn+ ■£).], (2) if E is bounded and Lebesgue measurable, then for any e > 0 there is a bounded open set O D E with |0\E| < |. Assume E C [-N, TV]. Show that (3) if P D Ectl[-N,N] = ECN is open and |P\E^| < f, then |E\C| < \ where C = Pc n [-N, N] C E. The set C is closed and bounded, i.e., it is compact (Exercise 1.5.6). Conclude that (4) if E is a bounded Lebesgue measurable set and e > 0, then there exist a compact set C and an open set O with C C E C O and \0\C\ < e, and (5) if E is a Lebesgue measurable set with \E\ < +oo, then there exist a compact set C and an open set O with C C E C O and |0\C| < e. Finally, show that (6) for any Lebesgue measurable set E, \E\ = sup{|C| | C compact C E) = inf{|0| | O open D E).
9. ADDITIONAL EXERCISES 79 Exercise 2.9.4. (Differentiating the Fourier transform) Let f € L^R). Show that (1) fcos{tx)f(x)dx is denned for all (eR, and (2) /sin(tx)/(x)dx is denned for all t € R. Assume that x —► x/(x) € L^K). Show that (3) ft /cos(tx)/(x)dx = -/xsin(tx)/(x)dx, and (4) £ fsin(tx)f(x)dx = J xcos(tx)f(x)dx. Using complex notation with elxt = cos(xt) + i sin(xt) (see the appendix to Chapter VI), this shows that the derivative of the Fourier transform f(t) =f f e~ixt f{x)dx (see Korner [K3]) is given by (5) ft J eixtf(x)dx =-/ ixeixtf(x)dx. In probability and statistics, when / > 0 and J f(x)dx = 1 (i.e., / is a probability density function), a variant of this form of the Fourier transform is called the characteristic function of the corresponding probability. The value of the characteristic function at t is the value f(—t) at —t of the Fourier transform. Exercise 2.9.5. (Differentiating the Laplace transform) Let / be a Borel measurable function on R with /(x) = 0, x < 0. Assume that for some constants C and q > 0, |/(x)| < CeQX, x > 0. Let £/(s) = J e~axf(x)dx,s > q. Show that (1) Cf is differentiable on (q,+oo), and (2) (Cf)'(8) = -Jxe-«f(x)dx. Exercise 2.9.6. Let n(x) = ni(x) = 775=6-^-^ &nd nt{x) = -):^(-¾) = -75=? e ~("it"). The main purpose of this exercise is to show that if nt(x) = ~j2n{-j;)i then for any t and bounded measurable function /, the function nt * f is smooth — it has derivatives of all orders and hence is a so-called infinitely differentiable function — where nt * /(x) = Jnt{x — y)f{y)dy. This exercise is long and is divided into parts A,B, and C. The main reason for all the difficulty is that at this point convolution (Exercise 3.3.21) has not been denned and the relation (Theorem 4.2.5) between continuous functions and Lebesgue integrable functions has not been made clear (see also Exercise 4.2.10). This exercise follows from Proposition 4.2.29 once it is shown that all the derivatives of nt(x) are integrable (part C (3)]. Part A. Show that for all r > 0 one has (1) e~^ < Co(a)e~r [Hint: complete the square r2 - 2ar.], (2) rm < C{m)e* if m > 1 [Hint: recall the power series for eu.], and (3) rme~T < C(m)e-5 if m > 1.
80 II. INTEGRATION Part B. Let / be a non-negative even function that takes its maximum value at x = 0 ("even" means that /(x) = /(-x) for all x € R). Assume that / e L^K) = L^R.fodx), where J is the <r-algebra of Lebesgue measurable subsets of R. (1) Find a function tpN e L^K) such that, if |6| < N, then <Pn(x) > f(x-b) for all x € R. [Hint: "split" the graph of the function at x = 0 and "insert" a constant value C on [-N, N] so that J ipN(x)dx = 2CN + ff(x)dx.} Assume that (1 + |x|m)/(x) e L^R), where m > 1. (2) Find a function 0N € Ll(R) such that (l+|x-6|m)/(x-&) < <£N(x) for all x e K provided |6| < N. [Hint: first show that |x - b\ < |x - 7V| + 2N and (a + b)m < 2m(am + bm) for a, b > 0 implies that 1 + |x - b\m < C(l + |x - N\m if |6| < N; then make use of the function ipn in (1).] (3) Find a function 7N e L^K) such that (l+\x-b\m)e^x-bl < 7N(x) for all x e K provided |6| < N. [Hint: make use of (2) and part A (3).] Part C. Recall that nt(x) = ~w=te~(~ii~\ Use mathematical induction to show that (1) -gp;Tit{x) = Hn{x,t)nt{x), for each n > 1, where for each fixed t > 0, Hn(x,t) is a polynomial of degree n in x, the so-called Hermite polynomial of degree n when t = 1. Show that (2) nt(x) < C(t)e"|:E| [Hint: see part A.], (3) |tfn(x,t)nt(x)| < C{t,n)e-^ [Hint: see part A]. Conclude that -§^nt € L^K). Find a function 0N,t,m,fc e L^K) such that (4) (1 + rfc|x - b\m)nt{x -b) < 9N:t,m,k{x) for all x e K provided |6| < N. [Hint: make use of (2) and part B (3).] Finally, show that if / is a bounded Borel (or even Lebesgue measurable) function, then (5) i^[Jnt(x-y)f(y)dy] = jHn{x-y,t)nt{x-y)f{y)dy, for each n> 1. Conclude that J nt{x — y)f(y)dy is a C°°-function of x (i.e., it is smooth). Exercise 2.9.7. Find a sequence (</?n)n>i of Riemann integrable functions tpn on [0,1] that are uniformly bounded, converge everywhere to a function <p, and are such that (1) the Riemann integral /0 ipn(x)dx = 1, for all n > 1, and (2) the Riemann integral /0 <p(x)dx does not exist.
9. ADDITIONAL EXERCISES 81 Note that /</5(x)l[0,i](x)dx = limn/</5n(x)l[0,i](x)dx = 1, in view of the theorem of dominated convergence (2.1.38). [Hint: enumerate the ratio- nals.] Exercise 2.9.8. Let X be non-negative. Show that E[X] = / XdP = sup{/sdP | 0 < s < X, s simple}. Remark. The integral J XdP is often denned as sup{/sdP | 0 < s < X, s simple}, see Rudin [R4]. One then has to prove the additivity of the integral, which amounts to verifying the above exercise using the definition of the integral given in Definition 2.1.23. Exercise 2.9.9. Let ea denote the Dirac measure at a or unit point mass at a (also denoted by 6a). If X € i'(ll,i8(R),e0), compute /Xdea. For any measurable space (0,5), define an analogous measure eUo, for any point uo€fl, and compute J Xdewo. Exercise 2.9.10. Let Y be a random variable on (0,5, P), and let a(Y) denote the smallest er-algebra (5 contained in 5 such that Y is (5-measurable (i.e., is a random variable on (fl,(5,P)). Show that A e (5 if and only if there is a Borel set B with A = Y~l(B); equivalently, 1a = 1b ° Y. Exercise 2.9.11. Let G be any non-decreasing, real-valued function on R, and let fx be the corresponding er-finite measure that Caratheodory's procedure (Theorem 1.4.13) produces. It is defined on the er-algebra 5 of /i*-measurable subsets of R. Show that (1) if E e 5, then there are two Borel sets A and B with (i)BcEcA and (ii) n(A\B) = 0 (see Exercise 2.2.7), and (2) J is the smallest er-algebra, containing 93(R) and all the sets that are subsets of a Borel set B with fi{B) = 0; and (3) consider fx restricted to the Borel er-algebra, and show that if one starts the Caratheodory procedure with 93 (R) and fj,, in place of 21 and (i, the resulting measure is the same, i.e., the same <r-algebra 5 results and the measure on it is fx (see Exercise 1.5.4). Exercise 2.9.12. If fx is a er-finite measure on 93(R), let Ot denote the collection of sets that are subsets of a Borel set B with fi(B) = 0. Define 5 to be the set of symmetric differences AAN, where A € 93(R) and N € 9T. Show that (1) 5 is a <7-algebra, and (2) the function v defined by v(AAN) = fi{A) is a er-finite measure on 5- Remark. The er-algebra determined in Exercise 2.9.12 is called the completion of 93(R) with respect to fx and is often denoted by 93(R) . Exercise 2.9.12 says that a <r-finite measure on 93(R) can be extended as a measure to the completion 93 (R) . The intersection over all possible er-finite (even
82 II. INTEGRATION finite) measures fx of the completions 93 (R) is an important er-algebra. It is called the er-algebra of universally measurable sets. Every er-finite measure can be extended as a measure to the er-algebra of universally measurable sets. Exercise 2.9.13. This exercise shows that a non-decreasing function is not too discontinuous. In particular, the set of points at which it is discontinuous has Lebesgue measure zero, since it is a countable set. Let H be a non-decreasing, finite-valued function defined on R. Observe that (1) for any a € R, H(a) is an upper bound of the set {H(x) \ x < a} and is a lower bound of {H(x) \ x > a}. Define H(a-) to be the least upper bound of {H(x) \ x < a} and H(a+) to be the greatest lower bound of {H(x) \ x > a}. Show that (2) limxTa H(x) = H(a-), where A = limxTa H(x) if for any e > 0 there is a 6 > 0 such that \H(x) -A|<eifa-6<x<a, def (3) limxia H(x) = H(a+), where A = limxia H(x) if for any e > 0 there is a 6 > 0 such that \H(x) — A| < e if a < x < a + 6, (4) there are at most a countable number of points a with H(a+) ^ H(a-) (note that the difference H(a+) — H(a~) measures the size of the jump of H at a) [Hint: estimate the number of points a in [-N, N] with the jump of H at a greater than £.], (5) H is continuous at a if and only if H{a—) = H(a+), (6) if Hi and H% are two non-decreasing functions on R that are right continuous, they coincide if Ht(x) = H2(x) for any point x at which both functions are continuous. [Hint: make use of Exercise 1.3.15.] Exercise 2.9.14. Let F be a distribution function that is not continuous. Then there is a non-void, at most countable set of points x at which F is not continuous. Since F is right continuous, Exercise 2.9.13 implies that this happens if and only if F(x) ^ F(x—) Let G(x) = J2a<x{F(a) ~ F(a~)} (see tne following remark). Show that (1) G is non-decreasing, right continuous, with limxj-oo G(x) = 0, (2) F(x) - G(x) is continuous and non-decreasing. Let G(+oo) = £aeiJF(a) - F{a-)} and let Fd(x) =f ^¾. Show that if G ^ F, then (3) there is a continuous distribution function Fc and a number A e (0,1) with F = \Fc + (l — \)Fd, i.e., F is a convex combination of Fc and Fd- Equivalently, every probability that is neither continuous nor discrete is a unique convex combination of a continuous and a discrete probability.
9. ADDITIONAL EXERCISES 83 Show that (4) every continuous probability that is neither absolutely continuous nor singular is a unique convex combination of an absolutely continuous probability and a continuous singular probability. A measure fj, is said to be continuous if fj.{{x}) = 0 for all x e K; singular if there is a Borel set E with \E\ = 0 and (i(Ec) = 0; and discrete if there is a countable set D with fi(Dc) = 0. Finally, show that (5) given any probability P, there are three unique non-negative measures Hac,f*sc,f*d are absolutely continuous, singular and continuous, and discrete, respectively, such that P = fiac + Mac + (id', and (6) if a probability is neither continuous nor discrete, neither absolutely continuous nor singular, then there are three unique positive scalars Ai, A2, A3 with A! + A2 + A3 = 1 and three unique probabilities Pac, Pes and P<j that are absolutely continuous, singular and continuous, and discrete, respectively, for which P = A1Pac + A2P3C + A3Pd. Remark. Since all the terms are non-negative, the sum Y2a<x{F(a) — F(a-)} may be defined as supj5Za<ia6J{F(a)-F(a-)}, where the supre- mum is taken over the finite subsets J of (—oo,x]. Note that these finite sums are always bounded above by F(x). Exercise 2.9.15. Let £ be an irrational number and let G = {m + n£ \ m, n € Z}. Show that (1) G is an additive subgroup of K, i.e., (G is closed under addition and subtraction), (2) if x = m + n£ = m' + n'£, then n = n' and m = m', (3) if n ^ 0, there is a unique integer mn such that 0 < mn + n£ < 1 (in fact, 0 < mn +n£ < 1), (4) if 0 < it, at least one of the it intervals ({, ^),0 < £ < k - 1, contains two distinct points of G (in fact, an infinite number), (5) if 0 < it, there is a point a € G with 0 < q < ^, (6) if -oo < a < b < +oo, show that (a, b) contains a point of G [Hint: consider the subgroup Zq and recall the use in Exercise 2.1.12 of Z^.], and finally, (7) if O is a non-void open subset of K, then O DG ^ 0, i.e., G is dense inK.
84 II. INTEGRATION Exercise 2.9.16. (a singular distribution function on [0,1]) (See Wheeden and Zygmund [Wl], pp. 34-35, where it is referred to as the Cantor-Lebesgue function.) Let F0(x) = x if 0 < x< 1 and Fi(z) \x ifO<x<i, \x-\ if § < x < 1. The graph of the function F\ may be obtained from the graph of Fo by viewing it as a diagonal of the rectangle denned by (0,0) and (1,1); divide the base of this rectangle into three equal pieces by two lines parallel to the y-axis and the vertical sides into two halves by a line parallel to the x-axis. This subdivides the original rectangle into 6 new rectangles. The graph of F\ coincides with the line bisecting the vertical sides between the two lines that divide the base and the top into thirds. Outside these two lines, it is given by two diagonal lines in two of the 6 new rectangles: the line joining (0,0) to {\,\) and the line joining (\,\) to (1,1). The graph of F2 is obtained from the graph of F\ by carrying out this construction for each of the new rectangles in which the graph of F\ is given by a diagonal line. Continuing by induction on n, this defines a sequence of non-decreasing, continuous (even piecewise linear) functions Fn that map [0,1] to [0,1]. Show that (1) for each n > 1, the maximum of |Fn_i(x) - Fn(x)|,0 < x < 1 is Conclude that (2) the sequence of continuous functions converges uniformly on [0,1] to a non-decreasing function F with F(0) = 0 and F(l) = 1. Extend F to all of R by setting F(x) = 0 if x < 0 and F(x) = 1 if X > 0. Then F is a continuous distribution function. If P is the corresponding probability, show that (3) P(C) = 1, where C is the Cantor set (see Example 1.2.5), and (4) P is singular with respect to Lebesgue measure (and any probability that has a probability density function relative to Lebesgue measure). Exercise 2.9.17. (see Titchmarsh [T], pp. 229-230 and pp. 366-367) It follows from Exercise 1.2.8 that the intervals removed at the nth stage of the procedure that produce the Cantor set are of the form (5Z"-T\ §^ + ^r.Er^i1 h + £)• where bi e {°>2}- Let c« = 6'/2-1 < » < " - 1- Show that (1) if Er=7 ^ + £ < * < E?="il h + £, then F(x) = £^1 §> + £ = Fn(x), where F„(x) and F(x) are defined in Exercise 2.9.16.
9. ADDITIONAL EXERCISES 85 Each x e [0,1] has a triadic expansion of the form x = 0.aiO2 ... an ... , where a* e {0,1,2}; in other words, x = J2iL\ y- Show that (2) x € C if and only if x has a triadic expansion in which only 0 or 2 occurs; e.g., 3 = 0.1... = 0.0222..., and so 3 e C, and (3) if x € C, then F(0.ai<i2 ... an ...) = O.&1&2 ...bn... , where aj e {0,2} and 6j = ^Oj (note that the expansion of y = F(x) = O.6162 ... 6n ... is understood to be dyadic; i.e., y = J2iLi ?»• Exercise 2.9.18. (characterization of a bounded Riemann integrate function on [a, 6]) In the proof of Proposition 2.3.4, show that (1) each partition nn can be assumed to contain (for example) all the dyadic rationals on [0,1] of the form ^-,0 < k < 2n, which implies that the "mesh" of nn (which is defined as the maximum distance between adjacent partition points) is < ^-. Let J denote the union of all the points in all the partitions 7rn. Then \J\ = 0. Show that (2) if f(x) = g{x) and x ¢ J (see Proposition 2.3.4 for the definition of / and g), then <p is continuous at x; and conversely, (3) if tp is continuous at x ¢ J, then /(x) = g{x); (4) conclude that <p is Riemann integrable if and only if it is continuous at almost every point of [a, 6] (i.e., ip is a.e. continuous on [a, 6]). Exercise 2.9.19. Let W be a random variable and assume that W = X\ — Y\ = X2 — ¥2, where ^1,^2,^1,^2 aie a^ non-negative random variables and in addition Y\ and Y2 are integrable. Show that (1) E[Xl}-E[Yl} = E[X2\-E[Y2\. Define E[W] to be E[X] - E[Y] if X and Y are non-negative and Y is integrable. Now assume that W = Z + X with Z non-negative and X integrable. Show that (2) E[W) = E[Z) + E[X).
CHAPTER III INDEPENDENCE AND PRODUCT MEASURES 1. Random vectors and Borel sets in Rn Let (fi, 5, P) be a probability space and let X : Q —► R2 be a vector valued function, i.e., the values are vectors in R2. For each w € fi, let X(lj) = (Xi(lj),X2(lj)), where X\(lj) and X2(u) are the components of X(lj) with respect to the canonical basis of R2 consisting of ei = (1,0) and 62 = (0,1). It is a natural question to ask what condition X should satisfy in order to be a measurable vector-valued function, in other words, a random vector. Suppose that X\ and X2 are random variables, i.e., they are measurable when R is equipped with the Borel er-algebra 95(R). What happens if instead of the basis {ei, 62} one uses another basis? The new components remain measurable, as they are linear combinations of X\ and X2. So this is a basis-free condition and is the one that will be imposed: X will be said to be measurable or a random vector if of its components Xx,X2 with respect to the canonical basis are measurable, i.e., if they are random variables. A (real-valued) random variable X has a distribution (on R). Does a vector-valued random variable X (with values in R2, say) have a distribution on R2? To answer this question, it is necessary to introduce a er-algebra on R2 — the Borel <r-algebra 95(R2) of R2. Proposition 3.1.1. Let X : fl —► R2 and Jet 95 be the smallest a-algebra on R2 containing all the sets Ex x E2 where Ex and E2 G 95(R). The following statements are equivalent: (1) BeiB implies X~l(B) G & (2) if X = X\e\ + X2e2, then X\ and X2 are random variables on (0,5, P). Proof. Assume (1). Let Ex G 95(R). Then Xil{Ei)^{u | Xi(u) G #1} = {w I X(lj) G Ex x R}. Since EixRg95, (1) implies Xf'(Ei) G & i.e., Xi is a random variable. Similarly, (1) implies that X2 is a random variable. Assume (2) and let EUE2 € 95(R). Then {w | X{u) G Ex x E2) = {w | Xi(w) G Ex} n {lj I X2(u) eE2}e$ (i.e., X-^Ex x Ei) G J). Let (5 be the collection of subsets E of R2 such that X~1(E) G 5. Then, by Exercise 2.1.18 (with R replaced by R2), 0 is a er-algebra. Hence, since Ex x E2 G (5 for all Ex,E2 e 95(R), the <r-algebra (5 D 95. D 86
1. RANDOM VECTORS AND BOREL SETS IN Rv 87 Definition 3.1.2. The er-algebra of Borel subsets of K2 (respectively Rn), is the smallest a- algebra, containing all the open subsets ( Definition 3.1.3) of R2 (respectively, Rn). It will be denoted by <8(R2) (respectively, 93(Rn)). Definition 3.1.3. A subset O C K2 is said to be an open subset of K2 if a € O implies, that for some r > 0, O contains the open disc about a of radius r (i.e., {x | (xi - a.\)2 + (X2 - 02)2 < r2} C O). Similarly, Ocl" is said to be an open subset of Kn if a € O implies that there is some r > 0 such that O contains the open ball about a of radius r (i.e., B(a;r) = {x | 5Z"=1(xj - Oj)2 < r2}, is a subset of 0). Exercise 3.1.4. On Kn define || x ||2= 5Z"=1 x2 and |x| = maxi<j<n |Xj|. (1) Show that || • || and | • | are norms on Kn (see Definition 2.5.3 for the definition of a norm). (2) Given a norm on Kn, define the open ball about Xo of radius r to be the set of x such that the norm of x — Xo is less than r. Sketch the open ball about 0 € R2 of radius 1 for each of the norms in (1). (3) Show that a set O C Kn is open if and only if for any Xo € O there exists r such that {x | |x —Xo| <r} c O, where |y| is defined above. (4) Show that an open ball is an open set. Remark. This exercise shows that the open sets of Kn do not depend upon which of the two norms is used to define open balls. In fact, any norm on Kn determines the open sets in the same way. The collection of open sets is a topology ( Exercise 1.3.11), and, as in R, the closure of a set A is defined to be the smallest closed set that contains it. Proposition 3.1.5. The a-algebra 93(R2) is the smallest (T-algebra S containing all the sets Etx E2,Eie 2$(R), i = 1,2. To carry out the proof, it will be convenient to use the following lemma. Lemma. Every open set O C R2 is the union of those open balls with rational radii that are centered at points with rational coordinates and that lie inside O. Exercise 3.1.6. Prove this lemma, using either of the two norms given in Exercise 3.1.4 to define the open balls. This lemma relies on the fact that between any two real numbers there lies a rational number (see Exercise 2.1.12). Remark 3.1.7. The lemma says that there is a fixed, countable collection of open balls such that any open set is a union of a certain number of these balls (this implies that K2 satisfies what is known as the second axiom of countability). Proof of Proposition 3.1.5. For convenience, use the norm | • |, since an open ball is a "box". In view of the lemma, *8(R2) is the smallest er-algebra of
88 III. INDEPENDENCE AND PRODUCT MEASURES subsets of K2 that contains every open ball. Now {x \ \x - a\ < r} = (a\ - r,a\ + r) x (02 -r,a2 + r) if a = (01,02), and so <8(R2) c S since S contains all of these "boxes". To show that S c <8(R2), it suffices to show that if Ex and E2 € <8(R), ^xRandMx^e <8(R2), since then ElxE2 = (#i xR)fl(Kx E2) € 93(R2). Let (5 be the collection of subsets E of R such that £ x R 6 B(R2), i.e., 0 = {E c R I prf 1{E) e S(K2)}, where pri(ii,i2) = Xi- Then, by Exercise 2.1.18 (with a minor change), 0 is a er-algebra. Since it contains all open intervals (a, b), it contains *8(R). Therefore, E\ e 93(R) implies ^xMe <8(R2). Similarly, Rx E2e <8(R2) if #2 e <8(R). D Exercise 3.1.8. (1) Show that <8(Rn) is the smallest <r-algebra S containing all the sets Et x E2 x • • • x En, Ei € <8(R). (2) Let X : n -» Kn. Show that if X = (Xu-.-.X,,), then each component Xj, i = 1,2,..., n, is a random variable if and only if for all E € <8(Rn),X_1(E) € 5 (here £1 denotes a probability space (ft.foP)). Finally, one defines a random vector as follows. Dennition 3.1.9. Let X : SI -► Rn where (Sl,$, P) is a probabiJity space. Then X is a random vector if it is measurable with respect to or relative to 93(Rn), i.e., E e <8(Rn) implies X'^E) € £. Proposition 3.1.10. Let (ft, 5, P) be a probabiJity space and X : ft -► Kn be a random vector. Then there is a unique probability Q on 93(Rn) such that E e 93(Rn) impJies Q(E) = P(X~1(£:)). Proof. Use the argument that established Proposition 2.1.19. □ Dennition 3.1.11. The probability Q in Proposition 3.1.10 is caJJed the distribution or law of the random vector X. Remark. A probability Q on Rn is determined by a distribution function F as in the case of R: F(xi,x2,... ,xn) = Q((-oo,xi] x (-00,12] x ••• x (-00,xn]) = P(Xi < Xi,X2 < ¢2,...,Xn < £„)■ The distribution function determines the values of Q on sets of the form (ai, b\] x (02, b2] x • • • x (an, bn], which determines the probability Q on 93(Rn) in view of the next exercise. Exercise 3.1.12. Show that *8(Rn) is the smallest <r-algebra of subsets of Rn that contains any one of the following collections ¢, of sets: (1) ¢1: the finite "boxes" or intervals of the form (a 1,61] x (a2,b2] x •••x (a„,b„]; (2) €2: the finite "boxes" or intervals of the form (ai,&i) x (a2,b2) x •••x (a„,6„); (3) ¢3: the finite "boxes" or intervals of the form [ai.&i) x [02,62) x •••x [a„.bn);
2. INDEPENDENCE 89 (4) <£4: the finite "boxes" or intervals of the form [ai,6i] x [02,62] x •••x [an,bn]; (5) C5: the closed subsets of Rn. If X = (Xi,...,Xn), the "randomness" of a vector-valued function X : il —► Rn is determined by that of its components Xi,. This leads one to ask what can be said about the distribution of X in terms of the distributions of its components? For example, can it be calculated knowing the distribution of each of the Xj's? For example, given a two dimensional random vector X for which the so-called marginal distributions (i.e., the distributions of X\,X2) are known, does this enable one to compute or to say anything about the distribution XI More concretely, suppose that X measures two variables or parameters of an individual, say height and weight. Knowing that height and weight have normal distributions with given means and variances, can one say anything about the joint distribution of XI (More dramatic examples involving, say, smoking and lung cancer, etc., can be imagined.) In general, one cannot determine the distribution of X = (X\,..., Xn) from the distributions of its components Xi unless they are independent. Then, the distribution of X is the product of the distributions of the components, as will be shown in the next section. 2. Independence One of the ways in which probability is distinguished from merely being a branch of analysis is the emphasis given to (and the importance of) the notion of independence. Definition 3.2.1. Let (Q, 5, P) be a probability space and Ci, ¢2 be two subsets of J. They are said to be independent collections of events (relative to P) if At e €UA2 e ¢2 implies P(At n A2) = P(A!)P(A2). Example 3.2.2. Let A, B e £ be such that P(A nfl) = P(A)P(B), in which case A and B are said to be independent events. Set 5i = {¢, A, Ac£l} and fo = &,B,Bc,£l}. Then 5i and $2 are independent <7-algebras. Proposition 3.2.3. Let ffi. 3½ be independent sub <7-aJgebras of 3 and Jet X\ e ffii X2 € $2 (i.e., Xi is an (fi.ffi.P) random variable and X2 is an (n.foiP) random variable). Assume that X\,X2 are either both integrabJe or both non-negative. Then, E[XiX2] = E[Xi]E[X2].
90 III. INDEPENDENCE AND PRODUCT MEASURES In particular, XiX2 is integrable (in both cases) if each Xi is integrable. Proof. Consider first the case where Xi and X2 are both non-negative. Let X2 = 1A„ A2 € fo, and 9l = {Xr > 0,Xx € 3i | E[XilA,] = E[Xi]E[lM) = E[Xl}P(A2)}. Since the er-algebras are independent, 1,1, € ¢1 for all A\ € 3i- Hence, every simple function s = J2n=ian^An,An € 3i, 1 < n < N, is in ty^ The principle of monotone convergence and (RVs) in Proposition 2.1.11 imply that ¢1 contains all the non-negative random variables X\ € 3i- Now fix a non-negative Xi € 3i and look at ¢2 = {X2 € 32,¾ > 0 | E[XiX2] = E[Xi]E[X2]}. Since by what has been shown, ¢2 contains lyi2 for all A2 € 32, it follows (as above) that ¢2 contains all the non- negative X2 € 32- This proves the result for X\ and X2 when they are both non-negative. Now assume X\ = Y\ — Z\, X2 = Y2 - Z2, where Vj and Zi are non-negative, belong to 3t, and both are integrable (i.e., i?[yi],iJ[2?j] < +00). Then Y\Y2, Y1Z2, Z1Y2, and Z1Z2 are all integrable by what has been proved. Also, for example, £[^1-2] = E[Yi]E[y2] < +°o- Hence, X1X2 = {YiY2 + Z1Z2) - (YiZ2 + ZiY2) is integrable and E[XiX2] = E[YxY2 + ZxZ2] - E\YXZ2 + ZXY2\ = E[Xi]E[X2], since for each of the products Y\ir2,Y\Z2,Z\ir2, and Z\Z2, the expectation of the product is the product of the expectations of the factors. □ Remark. In general, the product of two integrable random variables is not integrable. However, if each of their squares is integrable, then the product is integrable (see Proposition 4.1.6). The next result shows that the marginal distributions of X = (X\,X2) determine the law of X if X\ and X2 are independent random variables. Proposition 3.2.4. Let 3i &nd 32 be independent sub er-afgebras of 3, where (fl,3, P) is a probability space. Let X\ € 3i and X2 € 32 have distributions Qi and Q2, respectively. Let X = (Xi,X2). The distribution Q of X is determined by Qi and Q2 and is the unique probability Q on 93 (R2) such that (*) Q(E1 x E2) = Qi(£i)Q2(£2) ifEt e 93(R). Proof. X~l(Ei x E2) = X-\EX) n X^ifr). Let A{ = Xrl(Ei). Then P(A1nA2) = P(Ai)P(A2), i.e., Q(#i x E2) = Qi(^i)Q2(^2). It remains to verify that there is at most one probability on 93(R2) satisfying (*). Let 21 denote the collection of finite unions U"=1£i,i x E2,i, where E\ti and E2,i € 93(R). Then, in view of Exercise 3.2.5, 21 is a Boolean algebra. Any two probabilities on 93(R2) that satisfy (*) agree on 21. Since by Proposition 3.1.5 93(R2) is the smallest <r-algebra containing 21, it follows from the monotone class argument used in the proof of Theorem 1.4.13 that two probabilities on 93(R2) that agree on 21 also agree on 93(R2). □
2. INDEPENDENCE 91 Exercise 3.2.5. This is an algebraic exercise that has to do with the relation between collections € of sets that are closed under finite intersections and the smallest Boolean algebra 21 containing ¢. It has three parts. Part A. Let € be a collection of subsets of fl such that (1) Ci,C2 € € implies C\ n C2 € ¢, and (2) C € € implies Cc is the union of a finite number of sets from ¢. Let 21 = {A | A = U?=1Ci,Ci€£,n> 1}. Show that (3) 21 is a Boolean algebra of subsets of fi. Part B. Let <£ be a collection of subsets of il that is closed under complements, i.e., if E € ¢, then Ec e ¢. Let <£ be the collection of finite intersections of sets from <£: C € € implies C = Ci"=1Ei, Ei € <£. Show that (1) if C € ¢, then Cc is a finite disjoint union of sets from ¢. [Hint: use Remark 1.2.3.] Now assume that <£ has property (1). Show that (2) each A € 21 may be written as a finite disjoint union of sets from €. Part C. Let C be a collection of subsets of il such that (1) Ci, C2 € € implies C\ n C2 € ¢, and let £ D C be a collection of subsets of fl such that (2) !)6£, and (3) if A c B are sets in £, then B\A € £. Let <£ = € U Cc, where B e Cc if and only if Bc € ¢. Show that (4) <H C £, (5) if A, Bi, B2,. ■ ■, Bp e € then A n (npi=1B-) e £. [Hints: AnB{ e £ since it coincides with A\(A n Bi); since A n Bf 6 £ for any A, B! e ¢, then A n Bf n B2 e £, and so (A n Bf)\(A n Bf n B2) = A n Bf n B| € £.] Let Co be the collection of finite intersections of sets from ¢. Then (5) implies that C0 C £. It follows that <£0 satisfies part B (1), and so the collection of finite disjoint unions of sets from <£0 is a Boolean algebra by part B (2). Show that (6) if 21 is the smallest Boolean algebra containing ¢, then 21 D Co, (7) 21 is the collection of finite disjoint unions of sets from <£0, and (8) if Ei, E2, ■ ■ ■, Eq are pairwise disjoint sets from ¢0, then u'=1#i € £. [Hint: if EUE2 € £ and Ex n E2 = 0, then E\ D E2 and so E\ n Ec2 e £.] Conclude that 21 C £. The uniqueness question that was settled before this long exercise can be handled very easily using the next result, which is due to Dynkin [D3]
92 III. INDEPENDENCE AND PRODUCT MEASURES (referred to as Dynkin's w-\ theorem by Billingsley [Bl]). It follows from part C of the previous exercise that it is in effect another version of the monotone class theorem (1.4.16) and is therefore more appropriately referred to nowadays by that name. The advantage of Dynkin's version of this theorem is that its hypotheses are easier to apply than those of the earlier form in Exercise 1.4.16. Proposition 3.2.6. (Dynkin's version of the monotone class theorem). Let € be a class of subsets of fl that is closed under finite intersections. Let £ D <£ be a monotone class (Exercise 1.4.15) of subsets such that (i) (l6£ and (ii) A C B sets € £ implies B\A € £. Then £ contains the smaiJest <7-aJgebra So containing ¢. Proof. Let £o D € be the smallest monotone class satisfying (i) and (ii). Then So 3 £o- Since £o is closed under complementation (in view of (i) and (ii)), it is closed under finite intersections if and only if it is closed under finite unions. When either of these two conditions is verified, £o is a <7-algebra since U^=1An is the monotone union U^^U^^A,,). Hence, £o = So if £o is closed under finite intersections. To show that £o is closed under finite intersections, first note that £i = {E € £0 | E n A € £o for all A e €} is a monotone class containing € that satisfies (i) and (ii). Hence £i = £0 (i.e., E € £o, A € € implies EnAe£o). Now let £2 = {E e £o I E n F e £o for all F € £0}. Then £2 is a monotone class containing € (by the previous observation) and satisfies (i) and (ii). Hence, £2 = £o- d Remarks. (1) One cannot replace condition (ii) by the weaker condition (ii): £ is closed under complements. Consider the class consisting of the unbounded intervals of K and the empty set: it is a monotone class, closed under complements, contains K, and yet does not contain the smallest a- algebra containing all the intervals (-oo,x],x e K, which is 93(R). (2) A monotone class £ satisfying the conditions of Proposition 3.2.6 is called a A-system by Dynkin [D3] if, in addition, it contains A\ U A2 whenever A\, A2 € £ and A\ n A2 = 0. Corollary 3.2.7. Let il be a set and J be a <7-aJgebra of subsets of il. Let Pi and P2 be two probabilities on J. If C c 5 is closed under finite intersections and P\(A) = P2(^4) for all A e €, then Pi and P2 agree on So, the smallest <7-aJgebra containing ¢. Proof. Let £ = {E € 5 | Pi(E) = P2(#)}. Then £ D C is a monotone class satisfying (i) and (ii) of Proposition 3.2.6 and so £ D Jo- d Remark. This corollary gives the uniqueness of the probability Q in Proposition 3.2.4: let t = {Et x E2 \ E, € <8(R)}. Let {Xi)\<i<n be a finite collection of real-valued functions on a set fi. Consider all the <r-algebras (5 on fl for which all the Xi are measurable.
2. INDEPENDENCE 93 Such er-algebras exist, e.g., (5 = ^P(fi), the er-algebra of all subsets of fi. Let <r({Xi | 1 < i < n}) denote the smallest of these er-algebras (the intersection of all the above er-algebras). If, for each i, Ei e 2$(R), then X~\Ei) = {Xi e E^ e a({Xi | 1 < i < n}) and so too is n?=lXrl(Ei). Let € be the collection of all sets of this form (it includes, for example, n^X,"1^), by setting En = R, and each Xrl(Ei) by setting all the other Ej = R). Then € is closed under finite intersections, and a({Xi | 1 < i < n}) is the smallest er-algebra of subsets of il containing ¢. In particular, if (0,5, P) is a probability space and the Xi are random variables, then CcJ and hence <r{{Xi | 1 < i < n}) c J. The result in Proposition 3.2.4 about the distribution of X = (Xi.-fo) when Xi and Xi are independent, — when they are measurable with respect to independent <7-algebras — has a corresponding version for n random variables, which is stated as the next proposition. Notice that here the form of the distribution of X = (X\, X2, ■ ■ ■, Xn) is equivalent to a statement about independence. Proposition 3.2.8. Let {Xi)i<i<„ be a finite collection of real-valued random variables on (fi, 0, P). The following conditions are equivalent: (1) if Q is the distribution of X = (X\,X2,-- ■ ,Xn) and Qj is the distribution of Xi, 1 < i < n, then Q(Ei x E2 x • • • x En) = JJQi(Ei); (2) if Ei € S(R), for 1 < i < n, then n P[nU{XieEi}]=Y[p[xieEi] (where Y\"=l Pi denotes the product of the numbers pi); and (3) foreveryFc {1,2,... ,n} andF' = {1,2,... ,n}\F, the a-algebras 5 = <r{{Xi \ieF}) and 5' = o{{Xi \ i € F'}) are independent. Proof. The equivalence of (1) and (2) is evident. Assume (2). Let C be the collection of sets of the form r\i£p{Xi e Ei} and <£.' be the collection of sets of the form CiieF'{Xi € Ei}. Hypothesis (2) implies P(A n A') = P(i4)P(i4') if A e £ and A' e €' (to see this, note, for example, that (2) implies that P[r\i^F{Xi € Ei}] = Tli^fP[Xi € Ei]: set the remaining Ei = R). Let £ = {E € 5 I for all A' e €',P(E n A') = P{E)P{A')}. Then £ D £ is a monotone class satisfying (i) and (ii) of Proposition 3.2.6 and so £ D a({Xi I i € F}). Similarly, if £1 = {E € 5 | for all A € <r{{Xi \ i € F}),P{AnE) = P(A)P(E)}, then £1 is a monotone class containing €' and hence o-{{X, | i € F'}). This proves (3).
94 III. INDEPENDENCE AND PRODUCT MEASURES Assume (3). If n = 2, then (2) is obvious. Assume that the conclusion holds forn = it. Ifn = k + 1, let F = {1,2,...,it} and F' = {it + 1}. Then P[(n?=1{Xi € Ei})n{Xk+l e Ek+l}} = P[n?=1{Xi e Ei}\P[Xk+i e Ek+i ]. By the inductive assumption, this is J}*+/ PfnJlt^Xi € Ei}}. □ Definition 3.2.9. A finite collection of random variables (Ai)i<i<„ is said to be independent if n for any collection (£i)i<j<„ of Bore] sets Ei. Remark. Proposition 3.2.8 shows that a finite collection of independent random variables is a particular case of a family of independent random variables (see Definition 3.4.3.). Definition 3.2.9 is the standard definition of a finite collection of independent random variables (see Loeve [L2], Chung [C], pp. 49-50, Billingsley [Bl], p. 16). Corollary 3.2.10. Let (Xj)i<j<„ be a finite collection of real-valued random variables on (0,5, P) that is independent, i.e., is such that, if Ei e <8(R), 1 < i < n, then n P[nU{XieEi}]=Y[p[xieEi]. i = l Let tp(xi,... ,xjt) and V(xfc-(-i,... ,xn) be BoreJ functions on Kfc,Kn_fc, i < k < n. Define® = <p{Xi,... ,Xk) and * = %lj(Xk+i,..-,Xn). Then for A, Be »(R), P[$e A,*e B] = P[$e A]P[9 e B]. In other words, X and Y are independent random variables. Proof. Let X = (Xi,...,Xk) and Y = (Xk+l,.. .,Xn). Then 9~l(A) = X~1((p~l(A)). Since <p~l(A) € <8(Rfc) and X is a random vector relative to Jo = v{{X, | 1 < i < it}), it follows that 4>-1(i4) € &)• Similarly, *-1(B) € 5i = <r({Xi | fc + 1 < i <}). Since these two a-algebras are independent, the result follows. □ Exercise 3.2.11. Let C be a class of measurable subsets of a probability space (n,5,P). If <L is independent of itself, show that P(A) = 0 or 1 if A € €. In particular, show that if X c 5 is a er-algebra that is independent of itself, then the probability of any event is either 0 or 1. (The probability space (fl,X, P) is therefore an example of an ergodic probability space.)
3. PRODUCT MEASURES 95 Exercise 3.2.12. Let £1 = {0,1,2,... ,n} and J = 93(ft). Determine all the probabilities on (fi, 5) that take only the values 0 and 1. Exercise 3.2.13. Let € be a class of measurable subsets of a probability space (0,5, P). Show that the class £ of sets A such that P(A n B) = P(A)P(B) for all B € <£ is a A-system (see the remarks following Proposition 3.2.6 for the definition of a A-system). Show, with an example, that £ need not be a <r-algebra (see Feller [Fl], p. 127). 3. Product measures At this point, it is not evident that independent random variables exist, even pairs of independent random variables. Intuitively, in cases like coin tossing, they can be seen to exist, but in these notes no mathematical construct has yet been given that explains their existence. In the case of the real line, it follows from the first two chapters that for any distribution function F\ (equivalently, the corresponding probability Pi) on the real line there is at least one probability space and a random variable on it whose distribution function is the given one: one may take the probability space to be (R,93(R), Pi) and X{x) = x for all x e R. In view of the earlier discussion in §1 concerning random vectors, one could also take the following space (R2,5?,Pi) and X\(x,y) = x for all (x, y) € R2, where Ji consists of the sets AxR, A € 93(R), and P?(AxR) = Pi (A). Suppose that P2 is another probability on the real line. Then it is the distribution of the random variable X2 defined on (R2,^, P°), where X2(x,y) = y for all y € R, P^(R x B) = P2(B) for all B € 93(R), and $% consists of the sets RxBBe 93(R). The two a-algebras 5? and $2 generate the a-algebra 93 (R2) in view of Proposition 3.1.5, i.e., 93(R2) is the smallest er-algebra containing them both. So the question arises: is there a probability Q on 93 (R2) that agrees with P° on each 5^? If so, then given two probabilities on the real line, one would have one probability space and a pair of random variables on it whose distributions are the respective probabilities. Notice that Proposition 3.2.4 shows that there is at most one solution Q to the problem, if in addition one requires X\ and X2 to be independent random variables. To motivate further the construction of the probability Q from Pi and P2, consider the problem of choosing a point (x, y) at random from the unit square [0,1] x [0,1]. From Fig. 3.1, it is intuitively clear that if g < x < 5 and \ < y < |, then this probability should be |(|, ^]| x |(^, |]| = -¾.
96 III. INDEPENDENCE AND PRODUCT MEASURES 75 25 —b~ t • \ ■ ! \ ; ; .123 .5 (1.1) Fig. 3.1 What about the probability that the point lies in a more general set? For example, suppose that x € E\ and y e E2, where E\ and E2 are Borel subsets of [0,1]. The probability that x lies in E\ is \E\\ and that y lies in Ei is |i?2|. This suggests that the probability that the point (x, y) lies in E\ x E2 should be |Ei| x |£^|. What if x < y? Then the point (x,y) lies above the diagonal of the square and the probability should be \, the area of this portion of the square. For a general subset E of the square one expects the probability to be the area of the set E. This begs the question: which subsets of the unit square have an "area"? Quite general "boxes" E\ x Ei with Borel sides E\, E2 should have an area equal to the product of the length of the sides, where length is determined by Lebesgue measure. In general, the "area" of a Borel "box" E\ x E2 is determined by two probabilities or positive measures, one for the first side and the other for the second. The problem is then how to extend this notion of "area" to the Borel subsets of the plane. Rather than proceeding with this specific example, consider the problem of defining the product of two probability spaces (or, more generally, two measure spaces) (^1,^1, Pi) and (^2,^2^2)- The first question to settle is how it is possible to define a "natural" a-algebra 5 on Qi x fl2 that contains all the "boxes" E\ x E2, with Ei € &, 1 = 1,2. The determination of this a-algebra 5 is not hard. Define two projections pr\: Sl\ x £l2 —* Sl\ and pr2: fii x JI2 -• ^2 by pr\{bj\,bJ2) = wi and pr2(u 1,^2) = W2. Let 5 be the smallest a-algebra of subsets of fli x f^2 that contains all the sets pr^l(Ei),Ei € &. Equivalently, let 5 be the smallest a-algebra of subsets of S7i x fl2 that contains all the sets E\ x fl2 and ill x E2. Note that j>r^l{E{) = Ex x £l2 and pr^'(^2) = fli x E2. Fig. 3.2 is a diagram of E\ x fl2-
3. PRODUCT MEASURES 97 E, XQ, Fig. 3.2 The <T-algebra 5 will be denoted by 5i x fo (it is often denoted by 5i <8> fo)- Note that in case fij = R and fc = ®(R), then $x$2 = S(R2) by Proposition 3.1.5. The next step is to make use of the two probabilities (or measures) to define a probability P (or measure) on 5i x 3½ such that (*) p(e1xE2) = Pi(e1)p2(e2) whenever £; 6 J, i = 1,2. The er-algebra Si x $2 is generated by a Boolean algebra 21. To describe it, let € be the collection of sets of the form E\ x E2 C fii x fl2, where £t G ^, i = 1,2. If <£ consists of the sets of the form E\ x fl2 or fl i x £2 with £^i e ^, this collection is closed under complements and € is the collection of finite intersections of sets from ¢. Hence € satisfies the hypotheses of part B of Exercise 3.2.5. As a result, if 21 is the collection of finite unions of sets from ¢, it follows that 21 is a Boolean algebra and every set A e 21 can be written as a finite disjoint union of sets from ¢. Exercise 3.3.1. Show that there is a unique finitely additive probability Pon2lsuchthatP(£:1x£;2) = P1(£;1)P2(£;2) for all sets Ei €$i,i = 1,2. [Hints: First show that it suffices to show that if a "box" E\ x E2 is a finite disjoint union of "boxes" £jx£J,l<!<n, then (**) P(ElxE2) = J2P(E\xE2). i=l The identity (**) may be proved in at least two ways. (1) Use the "sides" of the boxes E\ x E2 to define an equivalence relation on E\ and on £J2: set x ~ y if x € E\ if and only if y € E\. The equivalence relation on E\ splits it into a finite number of measurable sets j4i,.A2,...,j4fc,.. .,A/<, and similarly i?2 is split into a finite number of
98 III. INDEPENDENCE AND PRODUCT MEASURES measurable sets Bi,B2,..., Bt,..., BL. Show that the sets Ak x Bt partition E\ x E2, i.e., they are pairwise disjoint and their union is £1 x E2. Make use of this partition to prove (**). (2) The ideas involved in Lemma 3.3.3 may also be used to prove (**).] If P is er-additive on 21, the Caratheodory procedure applies without change to the corresponding outer measure P*, which one defines by setting P*(£)=finf{£r= iP(^) I E C U?=1An,An e H, for all n > 1}. As a result, to obtain a probability P on 5 satisfying (*), it suffices to show that P is er-additive (see Definition 1.4.2) on 21. Note that since 5 = 5i x fo is the smallest er-algebra containing 21, the outer measure P* is a probability when restricted to J. Exercise 3.3.2. Let P be a finitely additive probability on a Boolean algebra 21. Show that the following conditions are equivalent: (1) P is er-additive on 21; (2) (A„)„>i C!J,A„D An+l, and n~ xAn = A imply P(An) | P(A) if A e 21; (3) (An)„>i ca,A„D An+U and n~ xAn = 0 imply P(An) | 0; (4) if {An)n>! c 21, An D An+1, and for all n > 1, P(An) > 6 for some 6 > 0, then n^=lAn ^ 0; and (5) (>!„)„>! ca,AnC An+U and u~ xAn = A imply P(An) T P(A) when A € 21. The principle of monotone convergence implies that the probability P is er-additive on 21. To see this, the following lemma is useful. Lemma 3.3.3. If A e 21 and u\ e fii, then (1) the function lj2 —* 1/1(^1,^2) is in $2, i.e., it is measurable relative to ^2, (2) the function lj\ —> f 1,4(^1,^2)^2(^2) is in Ji, and (3) P(A) = /[/lyl(wi,W2)ft(dta2)]Pl(dWl). Proo/. Clearly, (1), (2), and (3) hold if A = #i XE2, Ei e fc. It is therefore true for finite disjoint unions of sets of the form E\ x E2, £j e J(, and hence for all sets in 21. □ To prove Exercise 3.3.2 (4), and thus the er-additivity of P on 21, note that \a„ T l/i if An C An+i and Unj4n = A. The principle of monotone convergence and Lemma 3.3.3 (2) imply that Yn(LJl)= J lA„("U"2)P2(duJ2) T Yitj^1 J Ia(lj1,lj2)P2(cLj2) and hence (again by monotone convergence and Lemma 3.3.3) P(An) = yVB(wi)Pi(dwi) T JYfaWdui) = P(A). D This completes most of the proof of the following theorem.
3. PRODUCT MEASURES 99 Theorem 3.3.4. Let (fii.Si.Pi) and (^2,^2^2) be two probability spaces. Then there is a smallest <r-algebra 5 of subsets ofQi x fl2 and a unique probability PonJ such that (1) £1 x £2 6 J if Ei e $i,i = 1,2, and (2) P(E! x E2) = P1(El)P2(E2) ifEi € &,i = 1,2. Proof. It remains to show the uniqueness of P. This follows from Corollary 3.2.7. □ Remarks. (1) The probability P is called the product of Pi and P2 and is denoted by Pi x P2. When Ji x fo is denoted by Ji <8>32, it is usual to denote Pi xP2 by PixP2. The probability space {^ixQ2>5i x3r2,Pi><P2) is called the product of the probability spaces (fii.ffi.Pi) &nd (^2>i?2>P2). (2) If P = Pi x P2, an integral JXdP = /A"d(Pi x P2) will sometimes be denoted by J X(lj)Pi x P2(du}), f X(lji,lj2)Pi x P2(tL;i,tL;2), or J JX(iJi,LJ2)Pi(duJi)P2(duj2)- This last notation is especially compatible with Fubini's theorem (Theorems 3.3.5 and 3.3.6), where the integral is computed as an iterated integral. (3) Proposition 3.2.4 states in effect that the distribution Q of (Xi, X2) is Qi x Q2 if Qj is the distribution of X, and the random variables Xi, X2 are independent. (4) If 0i = {Ei x n2 I #i € Si} and 02 = {^i x E2 | E2 € &}> then 0i and 02 are independent relative to P = Pi x P2. The next theorem has to do with computing an integral on a product of probability spaces as an iterated integral. Theorem 3.3.5. (Fubini's theorem for positive random variables functions) Let (fli x n2, $1 x 3½. Pi x P2) be the product of the probability spaces (ni.ffi.Pi) and (n2,&,P2). Let P = pi * p2- If X is a non-negative random variable on (fli x fi2, ffi x &2>P)> then (1) for aJJ u\ e fii, the function w2 —► X(wi,w2) € fo, and (2) the function u\ —► /X(wi, ^2^2(^2) G 5i , and (3) /[/X(wi,W2)ft(dW2)]Pl(dwi) = /XdP. Pnoo/. First, assume that X is the characteristic function of a set A 6 J. If A € 21, then, by Lemma 3.3.3, X = 1A satisfies (1), (2), and (3). Let OT = {A e 5 I (1), (2), and (3) hold for 1^}. Then OJl is a monotone class: it is closed under increasing unions by the principle of monotone convergence and under decreasing intersections by the theorem of dominated convergence. Since it contains 21, it follows from Exercise 1.4.16 that Conditions (1), (2), and (3) hold for Xi + X2 and XX, A > 0, if they are satisfied by X 1,.^2, and X. Consequently, the result holds for any simple function.
100 III. INDEPENDENCE AND PRODUCT MEASURES If (1), (2), (3) are true for Xn, n > 1, with 0 < Xn < X and lim„ Xn = X, then X satisfies these three conditions. This follows by monotone convergence. □ Corollary 3.3.6. (Fubini's theorem for integrable random variables) Let (il\ x fl2. t?i x t?2> Pi x P2) be the product of the probability spaces (ni.ffi.Pi) and (n2,&,P2). Let P = Pi x P2. If X is an integrable random variable on (fli x fi2, ffi x t?2> P), then (1) for Pi-aJmost allu)\, the function w2 —* X(lj\,lj2) € $2, (2) there is an integrable random variable Y on (fii.ffi.Pi) such that Pi-aJmost everywhere /.^(^1,^2)-^2(^2) = ^(^l), and (3) fY(uJl)Pl(dwl) = fXdP. Proof. X is integrable if and only if X+ and X~ are integrable. Applying Fubini's theorem for positive functions (Theorem 3.3.5) to X+ (respectively, X~), it follows from Exercise 2.1.26 that for Pi-almost all u\ (i.e., the exceptional set has Pi-measure zero), u)2 -* X+(lji, W2) is an integrable random variable on (^2)^2^2) (respectively, W2 —* X~ (^1,^2) is an integrable random variable on (^2.^2^2))- Hence, there is a set N\ e Ji with Pi(Ni) = 0 such that if wi ¢ Nu /1+(^1,1,)2)^2(^2) < +00 and f X~ (^1,^2)-^2(^2) < +00. As a result, JX(lj 1,^2)^2((^2) = JX+(LJi,LJ2)P2(duj2) — f X~ (u;i,LJ2)P2(duJ2) is defined for wi ¢ N\. Defi Y( )=1° lfWl €NU ne (UJl) \ JX[Ljl,Lj2)P2(duj2) ifwi ¢^1- Now, by Exercise 2.1.31, f \ f X+[LJi,LJ2)P2{dui)] Pi[dui) = f X+dP < +00, Jni\Ni U 1 J and X~dP < +00. f \ fx-(LJi,U2)P2(dui)] Pi(cLj2) = f Hence, /XdP = fY{ui)Pi(dui). D Remarks 3.3.7. Condition (2) in Corollary 3.3.6 is a bit clumsy because one does not know that for integrable X, /^(^1,^2)-^2(^2) is defined for all u)\. All that can be said is that it is defined and finite outside a set of Pi-probability zero. The formulation of Corollary 3.3.6 (2) can be tidied up if one uses the phrase "integrable random variable" to mean a function Y defined on a subset il\ of il where Y : Q\ —> K U {—00, +00} and such that (a) there is a finite integrable random variable Z on (0,5, P) (i.e., E[Z+],E[Z-}< +00), with (b) P({Z = Y}) = 1; note that {Z = Y) c fii.
3. PRODUCT MEASURES 101 Then, it follows from Exercise 2.1.31 that E[Z] depends only on Y. One sets E[Y] equal to E[Z\. With this terminological modification, Corollary 3.3.6 (2) and (3) can be replaced by (2') the functional —► /X(wi,^2)^2(^2) is an integrable random variable on (fli,3i,Pi), and (3') /[/X(wi,W2)ft(dW2)]A(dwi) = JXdP. This meaning of the phrase "integrable random variable" is clearly at variance with Convention 2.1.32 which requires an integrable random variable to be finite valued. That convention was introduced to make it easy to see why Ll(il, 3, P) is a real vector space. With this extended notion of integrable random variables, it can be seen that they form a real vector space provided (i) one views null functions as "equal to zero" and (ii) one defines (X + Y)(u) to be X(lj) + Y(lj) whenever both X(lj) and V(w) are finite and zero otherwise. These somewhat clumsy devices can be avoided by passing to the vector space of equivalence classes of integrable functions, where X is equivalent to Y if X equals Y+ a null function. This formalizes the identification of the null functions with zero. In this volume however, this formal step will not be taken. Once the product of two probability spaces has been defined, arbitrary finite products S7i x fl2 x ••• x ^n = (^i x fi2 x ••• x ^n>3i x 3½ x • • • x 3n, Pi x P2 x • • • x Pn) can be defined starting with S7i x fl2, then (fli x n2) x Q3, etc., One notes that there is no "associativity" problem, that is S7i x (¾ x Q3) and (S7i x fl2) x Q3 are the same and so the product fli x fl2 x • • • x iln may be defined via the obvious formula: fli x fl2 x • • • x n„ = (fli xn2 x ••• x nn_i) x nn. The following result makes this statement more explicit. Proposition 3.3.8. Let (fli,3i,Pi), 1 < i < n, be n probability spaces. Let 5i x 3½ x • • • x 5„ denote the smallest er-aJgebra on S7i x fl2 x • • • x fln that contains aJJ the sets of the form E\xE2x- ■ xEn, Ej e 3t, 1 < i < n. Then there is a unique probability P on 3t x J2 x • • • x 3„ such that (*) P(EixE2x-.-x£;n) = Pi(£;i)p2(£;2)...pn(£;n) for aJJ £j e J,. furthermore, if 1 < k < n, 3i x 32 x • • • x $n is the smaJJest <7-aJgebra on fli x fl2 x • • • x iln containing the sets Ax B, where A € 3i x J2 x • • • x 3k and B e $k+i x 3fc+2 x---x J„. Also, P is the unique probability on 3i x 32 x • • • x 3n such that (**) P(A x B) = (Pi x P2 x ■ ■ ■ x Pfc)(A)(Pfc+1 x Pfc+2 x • • • x P„)(B) for aJJ A € 3i x J2 x • • • x 3fc and Be £fc+i x 3fc+2 x • • • x 3„.
102 III. INDEPENDENCE AND PRODUCT MEASURES Proof. Let C denote the collection of sets of the form E\xE2x-xEn, Ei€ 3i> 1 < i < n. It is closed under finite intersections, and so by Corollary 3.2.7 there is at most one probability P on Ji x 32 x • • • x 3n satisfying For a similar reason, there is at most one probability P on Ji x 52 x • • • x 5n satisfying (**) provided Ji x ?2 x • • • x Jn is also the smallest a- algebra (5 containing all the sets A x B, A € 3i x 32 x • • • x 3fc, and B € $k+i x 3fc+2 x • • • x &,• This will now be proved. Since E\ x E2 x • • • x Ek = A € 5i x 3½ x • • • x $k and Efc+i x Ek+2 x • • • x En = B e 3fc+i x 3fc+2 x • • • x 5n when Ej e 3t, 1 < » < n, it follows that 0 3 3i x 32 x • • • x 3„. Fix B = Efc+1 x Ek+2 x • • • x En,Ei e 3i, fc + 1 < i < n. Then {A e 3i x 32 x • • • x $k | A x B e 3i x fo x • • • x $„} is a er-algebra that contains all sets of the form E\ x E2 x • • • x Ek,Ej e 3>, 1 < j < k. It therefore equals 3i x 32 x • • • x 3k, and so for all A € 3i x 3½ x • • • x $k and Ei e Si, k + 1 < i < n, one has A x (Ek+i x Ek+2 x • • • x £„) e 3i x ^2 x • • • x 3v,- Now fix A e Ji x 32 x • • • x 3¾. By an entirely similar argument, A x B € 3i x 32 x • • • x 3„ for all B € 3k+i x 3fc+2 x--xjn. Hence, 3i x 32 x • • • x 3„ D 0. To complete the proof of this proposition, it remains to establish the existence of P satisfying (*). Proceeding by induction, when n = 2 it restates the result proved earlier as Theorem 3.3.4. Assume that the proposition holds for n = TV, i.e., that the product probability Pi x P2 x • • • x Pyy exists on 3i x 32 x • • • x 3W in this case. Now assume n = N + 1. The product probability Pi x P2 x • • • x PN = Q exists on 3i x 32 x • • • x 3W, and there is a unique probability P on (3i x32x-x3^)x3^+i such that P(i4xB) = Q(i4)P^+i(B) whenever A € 3i x 32 x • • • x 3N and B e 3n+i- Taking A = Ei x E2 x • • • x En, Ei e 3t> 1 < i < W, ^ is clear that P satisfies (*). □ Definition 3.3.9. The probabiJity determined in Proposition 3.3.8 will be referred to as the product probability and will be denoted by Pi x P2 x •xPn. The probability space (£l\ x fi2 x ••• x fl„,3i x 32 x • •• x 3n,Pi x P2 x ••• x Pn) wiJJ be caJJed the product of the probability spaces (tU,3i,P&l<i<n. Remark. The second part of the previous proposition states that (fli x n2 x • xnn,3i x 32 x ••• x 3„,Pi x P2 x ... x Pn) = (ni.Si.Pi) x (fi2,32,P2), where J)i = fi,|X fi2 x xfifc,3i = 3i x J2 x x3fc,Pi = Pi x P2 x • • x Pfc, and (l2,$2, and P2 are similarly defined using the spaces (ni,3i,P,),fc + 1 < i < n. The possibility of defining such a product allows one to solve the following problem. Let Qi,Q2,...,Qn be n probabilities on R. Are there a probability space (fl,3,P) and a collection of random variables (X,)i<,<„
3. PRODUCT MEASURES 103 on il such that (1) they are independent and (2) Qi is the distribution of Xi, 1 < i < n? Yes. It suffices to take Q = Rn, Xi(x\,... ,xn) = xt, and P = Qi x Q2 x • • • x Qn (in view of Proposition 3.2.8). These results on the products of probability spaces extend with very little additional work to products of er-finite measure spaces. To begin with, given two er-finite measure spaces (fii5i,/ii) and {£l2,$2,p2), define the set function fx on "boxes" E\ x E2 by setting fx(E\ x E2) = (Xi{Ei)(x2(E2), where this product is taken to be zero if one of the terms fXi(Ei) equals zero even if the other is +00. Note that in general the values of fx lie in [0,+oo]. Then fx extends as a finitely additive function (recall that (+00) + q = q + (+00) = +00 for any real number a) to the smallest Boolean algebra 21 containing the sets £1 x E2 x • • • x En since by Exercise 3.2.5 each A G 21 is a finite disjoint union of "boxes" and Exercise 3.3.1 holds in this context. In addition, Lemma 3.3.3 is still valid and so fx is countably additive on 21. The Caratheodory procedure applies, and since 5 = 5i x $2 is the smallest er-algebra containing 21, it follows that the restriction to 5 of the outer measure fx* is a er-finite measure such that (x{E\ x E2) = fX\(E\)fx2(E2) for any pair of sets Ei G 5i- This property defines a unique measure as the next exercise shown. Exercise 3.3.10. Let (fli,5i,/ii), 1 < i < n, be n er-finite measure spaces. Assume that none of the measures fx{ is identically equal to zero. Let fx and fx' be two er-finite measures defined on 5i x $2 x • • • x 5n = 5 that agree on sets of the form E\ x E2 x • • • x En where E{ G 5i and fXi(Ei) < +00 with common value fliLi Mt(^t)- Show that /1 and \x' agree on 5- [Hints: (i) if E G 5 and fx(E) and tx'{E) are finite and positive for some E G 5, use the basic idea of Exercise 2.2.4 — see the remark that follows the exercise — to show that if A G 5,^4 c E, then (x(A) = n'{A) (it will be slightly simpler to first show that E may be written as E° x E2 x • • • x E® with 0 < /it(£?) < +00,1 < i < n); and (ii) show that ftt x ft2 x • • • x ftn is a countable increasing union of sets E for which both fx(E) and fx'(E) are finite.] One consequence of this is that the same measure results if, instead of defining /x on the Boolean algebra 21 determined by arbitrary "boxes", one first defines it only on the Boolean ring 5H generated by the "boxes" whose sides Ei have finite measure Hi{Ei) and then applies the procedure of Caratheodory. Using the uniqueness result of Exercise 3.3.10, the proof of Proposition 3.3.8 also proves the following result. Proposition 3.3.11. Let (fii,5i,/ii)> 1 < i < n, be a-finite measure spaces. Then there is a unique a-finite measure fx on 5i x $2 x • • • x Jn such that if Et G 5i, 1 < i < n, then (*) n(El xE2x---xEn) = fi1{E1)fi2{E2)...fin{En).
104 III. INDEPENDENCE AND PRODUCT MEASURES This measure/i, denoted by pi x/i2 x- ••x/i,,, fsaiso the unique measure H on 5i x §2 x • • • x 5„ such that (**) n(A x B) = {m x H2 x • • • x Hk){A){(ik+i x /ifc+i x • • • x fin)(B) for all A e 5i x Ji x • • • x 5fc and B e 5fc+i x 3fc+2 x • • • x 5„. These observations applied to Lebesgue measure on 93 (R) determine its n-fold product on 93(R) x 93(R) x • • • x 93(R) = 93(Rn). Definition 3.3.12. The a-finite measure defined by the product of n copies of (R,93(R),dx) is called Lebesgue measure on 93(Rn). Remark. If, in the above definition, the cr-algebra 5 of Lebesgue measurable sets is used, instead of the er-algebra 93(R), the resulting measure extends Lebesgue measure on 93(Rn). However, it is not be the "largest" extension. This extension may be obtained by applying the Caratheodory procedure to the function A defined on the Boolean ring 9¾ of sets that are finite unions of "boxes" of the type (ai,6i] x (02,62] x ••• x (an,6n], where the endpoints of all the intervals are all finite, and for which A((ai,&i] x (02,62] x • • • x (an, bn]) = II?=1(&i - at) — the volume of the "box". Since the restriction of Lebesgue measure on the Borel sets of Rn to 9¾ coincides with A, it follows that A is er-additive on 9¾. Hence, one may then apply the procedure of Caratheodory. The resulting measure, which will also be denoted by A, is defined on what is called the er-algebra of Lebesgue measurable subsets of Rn. Since it agrees with Lebesgue measure on the "boxes" (ai,61] x (02,62] x • • • x (an,6n], it agrees with Lebesgue measure on the Borel sets in view of Exercise 3.1.12. The measure A defined on the g- algebra of Lebesgue measurable subsets of Rn is called Lebesgue measure on Kn. Note that A is not a product measure, because the er-algebra of Lebesgue measurable subsets of Rn is not a product of a- algebras, one for each copy of R (see Exercise 3.6.20). There is another way to obtain Lebesgue measure on Rn from the Lebesgue measure on the Borel sets. It involves what is called completing the measure and is explained in the following exercise. Exercise 3.3.13. Show that (1) a subset N of Rn is Lebesgue measurable and has |7V| = 0 if and only if for any e > 0 there is a sequence (j4n)n>i of Borel sets with ji4n and 5Zn=1 l-^nl < e. A set with this property will be called a set of Lebesgue measure zero or a null set; and (2) a set has Lebesgue measure zero if and only if it is a subset of a Borel set of Lebesgue measure zero. Let 5 be the collection of symmetric differences AAN = (A\N)U(N\A) (see Exercise 2.9.2), where A G 93(Rn) and TV is a set of measure zero.
3. PRODUCT MEASURES 105 Show that (3) if, for all n > 1, An G 2$(Rn) and Nn has Lebesgue measure zero, then \J^=l(An^.Nn) G 5 [Hint: show E G 5 if and only if E = (j4\7Vo) U N\, where No and Ni are sets of Lebesgue measure zero and A G <8(Kn)], (4) 5 is a er-algebra, (5) 5 is the er-algebra of Lebesgue measurable subsets of Kn, and (6) 5 is the smallest er-algebra containing 93(Rn) and all the sets that are subsets of a Borel set of measure zero (see Exercise 2.9.12 ). Definition 3.3.14. Let (fi, 5> fi) be a measure space. A subset of fl is said to have measure zero (relative to fi) if, for any e > 0, there is a sequence {An)n>i C 5 with (i) E c U~=1A„ and (ii ) £~=1Ai(A„) < «• A o-Knite measure space (fi, 5>m) 1S ^d to be a complete measure space if every subset of il of measure zero is in 5- Exercise 3.3.15. Show that the er-finite measure space (Kn, 5, dx) is complete, where 5 is the er-algebra of Lebesgue measurable sets. Show that if X is an 5-measurable function, then there is a Borel function Y with {X ^Y} of Lebesgue measure zero. Exercise 3.3.13 illustrates a procedure that can be applied to any <r-finite measure space to obtain a complete measure space. This complete measure space is called the completion of the original one (see Exercise 2.9.12 and the remark that follows). Having now established the existence of products of measure spaces, the following Fubini theorems may be proved almost exactly as in the case of probability spaces (Theorem 3.3.5 and Corollary 3.3.6). The only slight difference that occurs is in the proof of Theorem 3.3.5. The argument involving the class SOI needs to be handled a little carefully. One way is to proceed as in the proof of Theorem 2.2.2. The whole space il\ x il2 = U^A^An = E? x E2,Hi{E?) < +oo,i= 1,2. Let OTn = {An n E \ E G 5 such that (1), (2), and (3) of Theorem 3.3.16 hold for lAnA„}- Then, for each n, 9Jtn is a monotone class that satisfies the hypotheses of Dynkin's result (Proposition 3.2.6). Since it contains the collection Cn = {An n (E\ x E2) | Ei G 5i,i = 1,2}, which is closed under finite intersections, it follows from Proposition 3.2.6, that it contains 5„ = {An n E | E G 5}. This shows that Fubini's theorem is true for any function X = lA„nE,E G 5. It then follows from monotone convergence that Fubini's theorem holds for X = 1e,E G J. Once this is established, the rest of the proof goes through without change. Theorem 3.3.16. (Fubini's theorem for positive functions) Let (il\ x ^2) 5i x 52> Mi x H2) be the product of the measure spaces (fli,5i,/ii) and (^2)52)^2)- Let fi denote Hi x (i2. Let X be a non-
106 III. INDEPENDENCE AND PRODUCT MEASURES negative, measurable function. Then (1) for aJJ w\ G fii, the function W2 —► -^(1^1,0¾) G §2» (2) the function u}\ —> f X(u}\,0^)/12(^2) G 5i, and (3) /(/^(^1,^2)^2(^2))^1(^1) = /^^(^1 XM2)- This theorem extends to L'-functions exactly as in the case of probability measures. Corollary 3.3.17. (Fubini's theorem for L'-functions) Let (il\ x ^2) 5i x 52,Mi x H2) be the product of the measure spaces (^i)5i>Mi) znd (^2)52)^2)- Let fx denote fi\ x fx2- If X is an integrabJe function on (il\ x fi2» 5i x 52>m)> then (1) for (i\-a.lmost all u}\, the function u^ —* X (1^1,0¾) is an integrabJe function on (n2,52>M2)> (2) there is an integrabJe function Y on (fii,5i>Mi) sucn that m-almost everywhere /X^(0^1,^2)^2(^2) = ^(k>i)> and (3) /y(wi)Mi(dwi) = /XdAi. Remark. Fubini's theorem is also valid for Lebesgue measurable functions on Kn (see Exercise 3.6.5). The statement is complicated by the fact that the a-algebra of Lebesgue measurable subsets of Kn is not a product of a- algebras. An alternate way to construct the product of a finite number of er-finite measures is illustrated in the next exercise for Lebesgue measures. This exercise therefore gives another way to construct the Lebesgue measure in Kn and is the n-dimensional version of Exercise 2.2.4. Exercise 3.3.18. Let Bm be the closed box in Kn all of whose sides equal [—m,m], i.e., Bm = [—Tn,m] x [—m,m] x ••• x [—Tn,m\. Show that (1) there is a unique probability Pm on 93(Rn) such that, for any box B = (ai,6i]x(a2,&2]x"-x(an,6n],Pm(B) = j2my?\BnB">\> where \Bn Bm\ denotes the volume of the box \Bn Bm\, i.e., \Bn Bm\ = nr=i \[-m,m] n (ai,bi]\ = EKU^ A m - fli V (-m)). Define fj,m = (2m)nPm,m > 1. Show that (2) if A G <8(Rn) is a subset of Bm, then fj.m(A) = (im+i{A), (3) there is a unique a-additive measure \i on 93(Rn) such that fi(A) = fim(A) if Ac Bm, and (4) /i((ai,&i]x(a2,&2]x---x(an,&n]) = |(ai,&i]x(a2,&2]x---x(an,&n]| for any finite box (ai,&i] x (02,62] x ••• x (an,bn]. Conclude that /x equals Lebesgue measure A on 93(Rn). Riemann and Lebesgue integration on Kn. Consider a bounded function <p on a finite closed "box" B = [ai,&i] x [02,62] x ••• x [an,6n]. To define the Riemann integral of if on B, one
3. PRODUCT MEASURES 107 subdivides each of the sides of the box B to get subboxes, and forms upper and lower sums. By definition, the function is Riemann integrable over the box B if and only if the supremum of the lower sums equals the infimum of the upper sums. In exactly the same way as in the case of K (see Chapter II §3), one shows that if the Riemann integral of ip exists, then there are two bounded Borel functions / and g with f < <p < g and J f(x)dx = J g(x)dx. This implies as before that (i) ip is Lebesgue measurable — we extend it to have value zero outside the box B —, (ii) its Lebesgue integral exists (i.e., it is in L1), and (iii) the two integrals agree. Example 3.3.19. Let D = {(x,y) | 0 < x,0 < y < x}, and let f{x,y) = 2 e~(~3-). Then /Id is a Borel function, and Fubini's theorem shows that /■+00 r /-+00 -| /-+00 r ,i / / /(*> V)dx dy= \ f(x, y)dy JO \Jy J ^0 L/0 dx = 1, since both integrals coincide with the Lebesgue integral over R2 of Id/ and the second one is computable. To see this, note that F(y) = J °° /(x, y)dx = /1d(x,y)/(x,y)dx (Lebesgue integral) by what was said earlier about improper Riemann integrals. Fubini's theorem states that F is Borel measurable and J F(y)dy is the Lebesgue integral J lofdxdy of / on K2. Since F is non-negative, it follows from Remark 2.3.12 that the improper integral /0°° F(y)dy = f F(y)dy, providing that the Riemann integral /0 F(y)dy exists for any b > 0. For this it suffices to show that F is continuous: if 2 2/i < 2/2^(2/2) — F(lJ\) = /1° e~(~S-)dx, which tends to zero as 7/1 —»2/2 or vice versa. Hence, /0 °°[/ °° /(x, y)dx]dy = J lofdxdy. Note that, although this is not needed for the above, in fact /0 F(y)dy < 00 as 2 2 f °° e~(^~^dx ~ -e~(~f> (see Exercise 3.6.5), as y tends to 00. To see that the other iterated integral equals / lofdxdy, one proceeds in a similar fashion. Example 3.3.20. A random vector X (on (fi,5, .P)) has a (non-singular) multivariate normal distribution with mean 6 and (positive definite) covariance matrix E if the distribution Q of X has a density /(x) with respect to Lebesgue measure dxi,..., dxn on Kn and Q(E) = I f{x)dxxdx2 ...dxn for E G <8(Rn), Je where f(x) = -^(x-0)'£-'(x-0) IK ' (27r|detE|)"/2 Notice that to show that Q is indeed a probability, one must resort ultimately to computing an improper Riemann integral in Kn. First, one
108 III. INDEPENDENCE AND PRODUCT MEASURES observes that E = ODO*, where O is an orthogonal matrix and D is a diagonal matrix with all its entries strictly positive. Assuming that the usual change-of- variable formula is valid, the integral reduces to computing the integral for the case where E = D. In other words, it reduces to showing that (2*| detD\rl> JGM--2Y.Mxl-eif}dxldx2...dxn =1, where the dt are the diagonal entries of D. One shows, as in one dimension, that the integral coincides with the improper Riemann integral /+oo /-+00 /-+00 1 « / ••• / exp{--^2di(xi-6i)2}dxldx2-dxn. ■OO J—CXI J —CXI ^ ^_j By making use of Fubini's theorem, this integral is easily seen to equal (2w\detD\)n'2. Convolution of measures. Exercise 3.3.21. Let P! x P2 be a probability on (R2,93(R2)). The random variable s : R2 —► R defined by s(x, y) = x + y has a distribution Q. This distribution is called the convolution P! * P2 of Pi and P2. If A G 93(R), then P! * P2{A) d= Pi x P2({(x,y) | x + y G A}) = f\A{x + y)(P! x P2)(dx,dy) = f f lA{x + y)Pi(dx)P2(dy) (see remark (2) following Theorem 3.3.4). It follows from Fubini's theorem, (Theorem 3.3.5), that Pi*P2{A) = f lA{x + y)(Pi xP2){dx,dy) = J\JlA(x + y)P2(dy) P1(dx) = Jp2(A-x)P1(dx) = J\JlA(x + y)P1(dx) P2(dy) = Jpl(A-y)P2(dy) = P2*Pl(A), where A — u = {a — u | a G A}. Assume that P\(dx) = f\{x)dx. Show that (1) Pi(j4 — y) = Ja f\(x — y)dx [Hint: Lebesgue measure is invariant under translation (Exercise 2.2.8).], (2) P!*P2(du) =g(u)du, where g(u) = /i*P2(u) =f $ h{u-y)P2{dy).
3. PRODUCT MEASURES 109 If, in addition, P2{dy) = /2(2/)^2/, note that g(u) = /1 * /2(11), where def /i*/2(u) = / f\(u-y)f2(y)dy. -JTt(*-">i)2 anA f„(^\ - 1 „-2^(*-m2)2 (3) if/i(*> = ^~"r(,~m,) and/^) = ^"^ that /»*W = ^5^^^^^^^^^- show Extend the definition of convolution to er-finite measures, and show that (4) (n\ +H2)*v = Mi *v + H2*v for any three er-finite measures /11,/12, and ja Compute (5) ea *£b (see Exercise 1.5.2). Let 0 < p, g < 1, and p + q = 1. Compute (6) (geo+P£i)*(geo +P£i) = (geo+pei)*2 [Hint: use (4).], (7) (geo + P£i)*n, the n-fold convolution of qe0 +p£i with itself. Finally, show that if X\ and X2 are two independent random variables, then (8) the distribution of X\ + X2 is the convolution of the distributions of X\ and X2, and (9) the distribution function Fx+y is the convolution F\ * Py of the distribution function of X and the distribution Py of Y (see (2) for the definition of the convolution of a function and a probability). Remark. While it is usual to denote the distribution function Fx*Py by Fx*Fy, this notation will not be used in these notes. It is an unfortunate notation since it is standard mathematical usage to define the convolution of two functions by the formula /1 * /2(2:) = //i(x — 2/)/2(2/)^2/- The (mathematical) convolution F\ * Fy(x) = J Fx(x — y)FY(y)dy, whereas Fx*Py(x) = /Fx(x-2/)dPy(y) = / Fx(x-y)dFY{y), using the notation dF(y) for dP{y) when F is the distribution function associated with P. For example, Ha * Hb ^ Ha+b> where Ha(x) = H(x — a) is the distribution of the point mass ea. In case Py has a density /y, then in fact Fx *Py(x) = Fx*/y(x). Further, if/x is the density of Px, the distribution of X, then it is true that fx * /y is the density fx+Y of the distribution of X + Y as shown in Exercise 3.3.21 (2). Exercise 3.3.22. Let X\ and X2 be two independent Poisson random variables with means Ai and A2 (a Poisson random variable has a Poisson distribution; see Exercise 2.1.24 (2)). Show that X\ + X2 is a Poisson random variable with mean Ai + A2. Exercise 3.3.23. Compute Ha * Hb for a < b (where Hc(x) = H(x — c) is the distribution of the point mass er), and show why this convolution is not a distribution function.
110 III. INDEPENDENCE AND PRODUCT MEASURES Exercise 3.3.24. Let fj, be any probability on R (i.e., on 93(R)). A bounded Borel function ip is said to vanish at infinity if, for any e > 0, there is an integer N >\ such that |x| > N implies that \ip(x)\ < e. Show that if ip vanishes at infinity, then ip*fi also vanishes at infinity. [Hints: ip* A*(*) = J>(* - y)n{dy) = /{|y|<M} 1>(x ~ w)A*(dw)+/{|»i>m> ^x -vMdvY> choose M with n([—M, M\) > 1 — e, and observe that one can force |x — y\ to be very large for all y G [— M, M] if |x| is very large.] 4. Infinite products Let (fi, 5, P) be a probability space. Definition 3.4.1. A stochastic process on (fl,5, P) is a family (Xt)iei of random variables on (0,5. P)- Heuristically speaking, one may think of I as a set of "times" and XL as the observation (of some phenomenon) at time V. If one takes a finite set F C I, say F = {tt,..., tn} then the n-random variables Xlk give "information" about the states of the phenomenon or process "during F". One may calculate probabilities then for the events E G <^{{Xt \ i G F}). The random vector (Xtl, • • • , XLn) associates with this er-algebra and P a distribution Q on Kn. This is called the finite-dimensional joint distribution of the stochastic process (Xt)t6/, corresponding to the finite set Fcl. Just as for a finite collection of random variables, there is a smallest o- algebra relative to which they are all measurable, so too for any collection (Xt)t6/ of random variables (or stochastic process) there is a smallest o- algebra for which the functions XL are all measurable. It will be denoted by <r({Xt | i G /}). The events that the stochastic process (Xt)t6/ determines belong to 0 = <r({Xt | i G /}), and the probability P on 0 is completely determined by knowing all the finite-dimensional joint distributions of the process. This fact is an immediate consequence of the following proposition. Proposition 3.4.2. Let (Xt)iei be a stochastic process on (0,5>P)- Let 0 denote (r({XL \ i G /}) and 0f = <r{{XL \ i G F}), where Fcl denotes an arbitrary Bnite subset. Then (1) 21 = Uf c/0f is a Boolean algebra, and (2) 0 is the smaJJest <7-aJgebra containing 21. Proof. (1) If Ai,A2 G 21, then there are finite sets F{ c I with Ax G 0f,. Since F = Fi\jF2 is finite, AUA2 G 0f and so At\jA2 G 21 and AiC\A2 G 21. It is clear that A G 21 implies Ac G 21. (2) If J C I, then 0j c 0/ = 0 and so 21 C 0. If fj is a <r-algebra D 21, then each random variable Xt is measurable with respect to fj and so fj D o-({Xt | l G /}) = 0. □
4. INFINITE PRODUCTS 111 Remark. If I0 denotes an arbitrary at most countable subset of I, then <& = U/oC/<9/o- There is a special class of stochastic processes for which the finite- dimensional joint distributions are always finite products of the distributions of the individual random variables. This is the class of processes that are independent. Definition 3.4.3. A family (Xt)iei of random variables on (fl,5,P) (i.e., a stochastic process) is said to be an independent family of random variables if, for any J C I, the <r-aJgebras 5i = <*{{XL I >■ £ J}) and 52 = <f{{X<. I *■ £ I\J}) are independent. A family (Et)iei of events or sets is said to be an independent family of events or sets if the famiJy (Ifijie/ of random variables 1^ is independent (see Exercise 3.6.8). Remark. In view of Proposition 3.2.8, this definition agrees with Definition 3.2.9 in case I is finite. Proposition 3.4.4. (Kolmogorov's 0-1 law) Let (Xn)n>i be a sequence of independent random variables on a probability space (fi, 5, P)- Let X„ = <r({Xi \ i > n}) and X = n^X,,. Then, for A G X, it foJJows that either P(A) = 0 or P(A) = 1. In addition, if Z G X, then Z is a.s. constant. Proof. Let 5n = <r{{Xi | 1 < i < n}) and a = U~=15„. It follows from Definition 3.4.3 that 5n and Xn are independent for all n > 1 and hence X is independent of each 5n> n>l. Hence, P(AnA) = P(A)P(A) if A e 21 and A 6 1. Let 5oo = °{{Xi | i > 1}), and let SOT = {B G 5oo | P(B n A) = P(B)P(A) for all A G X}. Note that 5oo is the smallest er-algebra containing a. Exercise. Show that SOI is a monotone class (Exercise 1.4.15). Now a is a Boolean algebra: since each 5n is a <r-algebra, it contains Q and Ac if A G a; also, if A\, Ai G a, then there is an integer n such that both A\ and Ai belong to 5n; hence, A\ U Ai G a. Since a C 9H, it follows from Exercise 1.4.16 that SOI D 5oo- It follows, for any B G 5oo and A G X, that P(B n A) = P(B)P(A). In particular, if A G X, then P(A) = P(A)2, i.e., P(A)(1 - P(A)) = 0. Hence, either P(A) = 0 or P(A) = 1. If Z G X and A G K, then P[Z < A] equals 0 or 1. If it is always 0, then Z = +00 a.s. If it is always 1, then Z = —oo a.s. If neither is the case, there is a unique A0 G R with P[Z < A0] = 0 and P[Z < \0 + 6] = 1 for any 6 > 0. This implies Z = A0 a.s. as P[A0 < Z < A0 -I- £] = 1 for all n > 1. □
112 III. INDEPENDENCE AND PRODUCT MEASURES Remark. One may also prove this result by making use of Dynkin's version of the monotone class theorem (Proposition 3.2.6) if one prefers to argue from the independence of the er-algebra X and the class C of sets of the form n^Xf1^), E{ G »(R). Exercise 3.4.5. Let (Xt)t6/ be a family of independent variables defined on a probability space (fi, 5, P). For each i, let Vt : K —> K be a Borel function. Let YL = rpt o XL. Show that (Vi)i6/ is again an independent family. In particular, show that if (£t)i6/ is a family of independent sets (i.e., the family of random variables (IeJ^/ is independent), then the family (£f )t6/ of sets Ect is also independent (see Exercise 3.6.7). The question of independence involves only the finite-dimensional joint distributions, as the next result shows. Proposition 3.4.6. A family (Xt)tei of random variables is independent if and only if, for every finite F C I, the family (Xt)t6f is independent. Proof. If (Xt)tei is independent and I' C I, then (Xt)ie/' is independent. Hence, for any finite F C I, the family (Xt)iep is independent. Conversely, if J c I, let £\ be the class of sets of the form C\i^f1X~1(E1), where F\ runs over the collection of finite subsets of J, and the sets Et are Borel subsets of K. Let €2 be the class of sets of the form nt6f2Xt-1(£^t), where F2 runs over the collection of finite subsets of I\J, and the sets EL are as before. Since the union of two finite sets is finite, each of these classes ¢, is closed under finite intersections. Furthermore, if A\ G ¢1 and A2 G ¢2, P(-<4i n A2) = P(j4i)P(j42), since by hypothesis every finite subfamily (Xt)t6f is independent. The er-algebra <r{{Xt \ t G J}) is the smallest <r-algebra containing ¢1. The er-algebra o-({X,, \ 1 G I\J}) is the smallest <r-algebra containing €2- It therefore follows, as in the proof of Proposition 3.2.8, that these two <r-algebras are independent. □ As in the case of finite families of distributions (or probabilities) Qn on K, the problem arises of constructing a sequence of independent random variables {Xn)n>i on a probability space with a prescribed distribution Qn for each Xn- For example, Qn(dx) could be ^{eo(dx) + t\(dx)} for each n, and then one wants to find an independent sequence of random variables Xn with equally likely values of 0 or 1 (intuitively, such a space exists in view of the possibilities for tossing a fair coin). To show that there are probability spaces (fl,5,P) with such sequences of random variables on them, one now proceeds to construct the (countably) infinite product of a sequence of probability spaces. To begin with, if fii, 1¾. • • •, fin> • • • 1S & sequence of sets, what is the set x£°=1fln? If one notices that fti x ... x ftn (as a set) can be viewed as the set of functions lj : {1,... ,n} with w(i) G flt, 1 < i < n, then it is clear how to define the set x^Ljfl,,.
4. INFINITE PRODUCTS 113 Definition 3.4.7. The (infinite) product set x^fl,, is the set of functions lj : N —► U^L^n such that u;(n) G iln for each n G N. On the infinite product x^jfl,, there is a "natural" sequence {Xn)n>\ of functions (the coordinate projections) where Xn(u}) = u}(n) for each n (i.e., Xn(uj) is the evaluation of the function lj in x^Ljfl,, at n). If, for example, iln = K for each n, then x^jfl,, is the set of all real-valued sequences a = (fln)n>i> and Xn(a) = an is the nth term of the sequence a. If one plots the sequences in the plane, one obtains a "picture" of this product space, as shown in Fig.3.3. "i n, Q, "„ Fig. 3.3 In general, one may think schematically as follows: arrange the sets ili, ^2) • • •, ^n> • • • m a sequence, viewing iln as over the integer n; choosing a point in each set iln over n determines a point il G x^Ljfl,,. In Fig. 3.3, two points in xJJL^,, are indicated: the values of one point are indicated by the locations of an "x" and the other by an "o" . Suppose now that, for each n > 1, a probability space (fln,5n>Pn) is given. Form the set x^Ljfl,, = il, and consider the functions Xn : il —► iln given by Xn(u>) = u>(n). On the set iln there is a er-algebra 5n. Definition 3.4.8. The smallest a-algebra $ on il such that each Xn '■ il —* iln,n > 1, is measurable with respect to 5 and 5n will be called the product of the cr-algebras 5n,n > 1. It will be denoted by x^L^n {often denoted by g^Sn). Note that x~=1Jn = a({Xn \ n > 1}). For each n, the finite product (Hi xil2x-- xfln,5i x$2 x- •• x5n>Pi x P2x- • xPn) is defined. The aim is to construct a probability Ponx*=13n such that for each n > 1, Pt x P2 x • • • x Pn is the (finite-dimensional) joint distribution of (Xi,X2,--- ,Xn), in other words, P(Ei x ••• x E„ x n„+ix...) = nr=iPi(^)- Heuristically, the finite product corresponds to forgetting about the remaining factors. Formally, for each n, one may define prn : il —> ill x f^2 x ••• x iln by setting prn(ij) = (u(l),u(2),... ,lj(ti)). Then, if
114 III. INDEPENDENCE AND PRODUCT MEASURES Xk{wi x u;2 x • • • x u;n) = Wfc, 1 < fc < n, it follows that Xk = Xk°prn, 1 < k < n. In other words, Fig. 3.4 is commutative. a — » Q^af—*^ A Fig. 3.4 The function prn, which "forgets" the tail end of a point in fi, allows one to set up a correspondence between the sets in xjJLjJ,,, which are determined by the coordinate functions X\,X2--- ,Xn, and the sets in 5i x 52 x • • • x 5n- More precisely, if <5n = o~({Xi | 1 < i < n}), one has the following result. Proposition 3.4.9. The following statements about A c fi are equivalent: (1) Ae<&„; (2) A = pr"1(A), Ag3i x52x---x5„. furthermore, the map A —» pr~ l(A) maps 5i x 52 x • • • x 5n in a one-to-one way onto <5n. In addition, (3) A = Ax(x£in+1fifc), Ae31x32x--x5„. Proof. To verify that (1) is equivalent to (2), first note that n^X"1^) = pr~l(Ei x E2 x ••• x En) since w G n^X"1^,) if and only if, for all i, 1 < i < ", -^i(w) = u>(i) G £^i- Hence, the proposition holds for A = n?=1Xfl(E<)- Let 5 = {A G 5i x 52 x •• • x 5n I pr-^A) G <&„}• Then 5 is a <r- algebra (see Exercise 2.1.18). Since it contains all the sets E\ x £2 x • • • x •^n, £i G 5i, 1 < i < n, it follows that 3 = 3i x J2 x ••• x 3n> and so A G 5i x 52 x ••• x 5n implies that pr^^A) G <5n. Conversely, let 0n = {A G <5„ | A = pr~l(A) for some A G 5i x J2 x • • • x 5n}. It is also a er-algebra, and as it contains all the sets n"=1Xf 1(Ei), it coincides with <5n. This shows that (1) is equivalent to (2). To see that A and A determine each other, observe that (a) if (xi,x2)... ,xn) = x G x"=1fli, then there is a function w G fl with w(i) = x{, 1 < i < n (i.e., x = prn(u}) for some lj G Q), and (b) ifu; G A G <5n and prn(u;') = prn(ij) (i.e., if u;(i) = u)'{i), 1 < i < n), then u;' G A. The first statement is clear enough: pick any function 0¾ G x ^-^n and define tj(i) = X{, 1 < i < n and w(i) = u}0,i > n. The second one is true because (i) it is true for any set A of the form n"=1X_1(E,); if
4. INFINITE PRODUCTS 115 it is true for Ai and for A0 C Ai it is true for Ai\Ao; and finally, the class of sets in (5 for which it is true is a monotone class (Exercise 1.4.15); hence, by Proposition 3.2.6 it is true for every set in (5 as the class <£ of sets of the form n"-^"'^) is closed under finite intersections. Let pr~l(A) = A. Then A = prn(A): if x G A c xj^fij, then by (a) above, there is a function lj G Q such that x{ = w(i), 1 < i < n; then lj G A and prn{u)) = x; since prn(A) C A in any case, A = prn(A). Hence, A\ = Ai if pr-l(A1)=pr-l(A2). On the other hand, if prn(A1) = prn(A2) = A, then A! = A2: if lj\ G Ai and x = prn(u;i) = prn(u}2) with u;2 G A2, then by (b) above, u}\ G A2 (i.e., Ai C A2); by symmetry, Ai = A2. This shows that A and A determine each other. Formula (3) follows from the above discussion. □ The main goal here is to construct a probability P on x^LjJn- Using Proposition 3.4.9, one determines P on the Boolean algebra 21 = U^li<5n as follows. Corollary 3.4.10. If A G 21 = U~=1<&„, define P(A) = (Pi x P2 x • • • x Pn)(i4), where A = Ax (xgin+1fijt). Then P is a finitely additive probability on 21. Proof. Since A G 21, for some n, A G <5n- By Proposition 3.4.9, A = -A x (x£ln+inn) for a unique A G 3i x J2 x • • • x 5n. Since A G <5n+i> one has A = A x nn+l x (x£ln+2nn). In order to see that the formula for P defines a function on 21, it suffices to show that, for any A G 5i x J2 x • • • x Jn, (Pi x P2 x • • • x Pn)(;4.) = (Pi x P2 x • • • x Pn x Pn+1)(i4 x fln+i). However, by Proposition 3.3.8, this last probability is ((Pi x P2 x • • • x Pn) x Pn+i)(j4 x fln+i) = (Pi x P2 x • • • x Pn)(A)Pn+l(nn+l) = (Pi x P2 x • • • x P„)(A). The fact that P is a finitely additive probability on 21 follows from the fact that P restricted to each <5n is a probability. □ Exercise 3.4.11. Verify this last statement. At this stage, one has (x^fl,,, x£°=i3n) defined and P defined on 21 = U£°=i<9n> which is a Boolean algebra with the property that xjjl i5n — 3 is the smallest a-algebra containing 21. As a result, there is at most one way to define P on 3, and this is possible if and only if P is er-additive on 21. Comment. The reader may avoid the following technicalities and skip to the statement of Theorem 3.4.14. The er-additivity of P on 21 is proved using a technique due to Jessen (see Anderson and Jessen1) and to Kakutani.2 The argument given here iSome limit theorems on integrals in an abstract set. Det Kgl. Danske Videnskabemes Selskab Mat-Fys. Medd. Bind XXII Nr. 14 (1946). 3 29. 2Note on Infinite Product Measure Spaces /, Proc. Imp. Acad. Tokyo, vol XIX (1943), 148 151.
116 III. INDEPENDENCE AND PRODUCT MEASURES can be found in Halmos [HI], pp. 157-158. To see more closely how this technique works, it is worthwhile first to consider two probability spaces (fii,5i,Pi), (^2>52> P2) and Fubini's theorem. Let E G 5i x $2 and, for each u}\ G fii, let E(u)\) be the "slice" or section of E over u>i, i.e., E(u{) = {u)2 I (u>i,u>2) G E) (see Fig. 3.5). ECb,) E V... y \ J Fig. 3.5 Fubini's theorem (Theorem 3.3.5 (1)) states that (i) E(lji) € 3½ for each u>\ € fii, (ii) u)\ —» P2(£(u>i)) is a random variable on (fti.ffnPi), and (iii) (Pi x P2)(E) = J P2(E(u;1))P1((Lj1). Lemma 3.4.12. If (Pi x P2)(E) > 6 > 0, then Pi({u;i | P2(E(u;i)) > f})> I- Proof. Let B = {wi \ P2(E(u}1)) > |} e Si- Then 0<6< fP2{E{u}1))P1{(Lj1) = [ P2(E(u,1))P1(<Lj1)+ [ P2(£(u;i))Pi(<k;i) Jb Jbc <Pi(B) + ^Pi(Bc)<Pi(B) + ^. D To prove the er-additivity of P on A, one uses a variant of this lemma over and over again to construct a function u;0 € n^=1Cn if (Cn)n>i C 21 is decreasing and P(C„) > f> > 0 for all n > 1. It follows from this that P is er-additive on 21 by Exercise 3.3.2 (4): if not, then there is a decreasing sequence (Cn)n>i of sets Cn G 21 and a positive number 6 such that (i) nJ^Cn = 0 and (ii) limn P(C„) = 6 > 0. If, as will now be shown in the context of the infinite product, (ii) implies that (i) is false, then by Exercise 3.3.2 (4), P is <7-additive on 21.
4. INFINITE PRODUCTS 117 The argument goes as follows. After relabeling (if necessary), one may assume that C„ = An x (x£in+1fifc) with An e ft x ft x • • • x ft. Exercise. Explain this statement. [For example, if Ci e <5ni, one may redefine Cn to be fi for 1 < n < n\ and Cni to be C\. Explain how to continue: show that there is a strictly increasing sequence (fcn)n such that C„e0fc„.] Having arranged the labeling of the sets Cn, the problem of showing that nJ^Cn 7^ 0 can be reformulated. It amounts to proving the following lemma. Lemma 3.4.13. Assume that, for each n > 1, one has a set An e ft x ft x • • • x 3n such that, for aJJ n > 1, (1) An x nn+1 D An+1 , and (2) (Pi xP2x...xPn)(A„)>«>0. Then there is a function lj° e xf=ink such that (u}°{l),u}°{2),... ,uj0{ti)) € An for all n>l. To determine a value for u>°(l), let .4,,(^) be the slice of An over w\ in Sli x (fi2 x •• • x fin) when n > 2, (i.e., in Lemma 3.4.12, take fii = fii and Q2 = 1¾ x ■ • ■ x ^n)- Since (P! x P2 x • • • x Pn)(An) > 6 for all n > 2, it follows from Lemma 3.4.12 that Pi(B„) > § for all n > 2, where Bn = {wi | (P2 x ... x P„)(4(Wl)) > §}. Since A\ x fi2 3 A2, if A2(u;i) ^ 0, then wi € Ai : (u;i,u;2) e A2 implies u;i e Ai. Hence, A! D B2- Further, An x fin+1 D An+i implies that An(u;i) x fin+i D An+i(u;i): if (u;2,...,u;n+i) e An+i(u;i), then (u>i,u;2,... ,u>n+i) e An+i and so (u;i,...,u;n) e An since (u;i,u;2,... ,u>n+i) e An x fin+i. This implies that (u;2,...,u;n+i) € An(ui) xnn+1. Hence, (P2 x • • • x Pn+1)(An+1(u;1)) < (P2 x • • • x Pn)(An(u;i)) and so Bn D Bn+l for all n > 2. Since Pi is a probability and A\ D Bn D Bn+i for all n > 2, the fact that Pi(B„) > | for all n > 2 implies that Pi(nn>2Bn) > |. Hence, there is a point u;? e nn>2Bn. Note that u;? e A! and An(u;?) ^ 0 for all n > 2. This establishes the following variant of Lemma 3.4.12. Lemma 3.4.12*. Assume that (P! x P2 x • • • x Pn)(An) > 6 for aJJ n > 2. Then there is a point cu® e A\ such that (P2 x • • • x Pn)(An(u;°)) > 6- for aJJ n > 2.
118 III. INDEPENDENCE AND PRODUCT MEASURES One now "forgets" about fii and replaces the sets An by their slices A„(w?) € fo x fo x • • • x &,. Since (P2 x P3 x • • • x Pn)(An(o>°)) > § for all n > 2, it follows from the above lemma that there is a point o>2 6 i42(w?) such that (P3 x • • • x Pn){An{u}^,u}^)) > Jr for all n > 3, where j4„(^1,^2) = ^,,(0^)(072). Using this lemma, it follows, by induction on m, that for each m > 2, if n > m there are 0^,0^,... ,Wm-i such that (a) (wf,w§,...,<_1)eAm-i,and (b) (Pm x Pm+1 x • • • x Pn)(>ln(o;°,o;°,... .u&.i) > ^. To go from the case m = k to the case m = k + 1, one replaces the sets An(tjJ°,LJ2, ■ • • ,w°-i) ^ t*1^ slices ^11(^11^2,...,^2), where ^(o^o;", ... ,o>°) = i4n(u;i,u;2,... ,0^^)(0^), and then makes use of the lemma. This proves Lemma 3.4.13 and hence the er-additivity of P on 21. This completes the proof of the following result. Theorem 3.4.14. Let ((fi„,3n,Pn))n>i denote a sequence of probabiJity spaces. Then there is a unique probabiJity P on the <7-aJgebra xJJLjSn on x£°=1fi„ such that, for aJJ A € <5„ = <?({Xi | 1 < i < n}), P(A) = (P! x • • • x P„)(i4) if A = A x (x£°=n+1fifc). It will be denoted by x£°=1Pn. Definition 3.4.15. The (infinite) product of the probability spaces ((!li,5n,Pn))n>i is defined to be the probabiJity space constructed by Theorem 3.4.14. It will be denoted by (xj^fifc, xgi^k, x£°=1Pfc). As a result, given a sequence of probabilities (Qn)„>i on K, one can find a probability space (fi, 5, P) and a sequence of random variables {Xn)n>\ such that (1) the sequence is independent and (2) the distribution of each Xn is Qn. Exercise 3.4.16. Let (0,.,3,., P„) = (R,33(R),Q„),n > 1. Show that the infinite product of the spaces with the random variables Xn(u}) = o>(n) gives a probability space with the desired properties. Remarks. (1) The first proof of this theorem is due to Von Neumann.3 (2) A corollary of the theorem is its extension to arbitrary products as stated below. The reason is that if (Xt)l6i is any family of functions, the o- additivity of a set function P on the Boolean algebra 21 = U . <*{{XL I l e F}) may be determined by using only a countable number of the variables (see the remark following Proposition 3.4.2). Corollary 3.4.17. Let (fiti&.PiJ.i € I, be any famiJy of probabiJity spaces. Let fi be the product set xl6/fit (i.e., the set of functions o> : I —» Ul6/fit with lj(l) € fit, for aJJ t € I) and let xl6/3i be the er-aigebra (t({Xl I 1 € I}), where X^lj) = u(i) for all o> e fi, 1 € I. Then there is a 3 Functional operators, Vol. I: Measure and Integration, Annals of Mathematics Studies no.21, Princeton University Press, Princeton, N.J., 1950.
5. SOME REMARKS ON MARKOV CHAINS 119 unique probability P on xl6/3i such that (*) P(nte/{Xt€Et}) = nPi(£0 if EL e So for all i, and EL = fit for all but a finite number of indices. 5. Some remarks on Markov chains* This starred section of the book may be omitted without any loss of continuity in the succeeding chapters. It is even more technical than the proof of er-additivity for the product measure. In fact, it deals with an important generalization of the concept of a product measure, namely a so-called Markov Chain, one simple example of which may be constructed from a series of real-valued i.i.d. random variables (Xn)n>i by considering the partial sums of the random series J2^=o -^n, where Xq is an arbitrary random variable. In Neveu [Nl], the above results on infinite products are obtained as consequences of a more general result due to I. Tulcea (see [Nl], p. 162; also in Doob [Dl], pp. 613-615). This more general result also applies to Markov chains. What follows is a presentation of the result for Markov chains (see Revuz [Rl], Theorem 2.8). Let (Xn)n>o be a real-valued stochastic process defined on a probability space (0,5, P), where the parameter n is to be thought of as the time at which the nth observation of some phenomenon is made. In general, one might suppose that the (n+ l)st observation depends upon all the previous ones (the complete past). Since these observations are known only probabilistically (i.e., via their distributions Pn,n > 0,), this suggests that to compute Pn+i(B) = P[Xn+i e B], one should average over the past using the joint distribution Qn of (Xo, X\,..., Xn). Intuitively, there is a transition that occurs from the observation Xn to the observation Xn+\ that in principle involves Xo,Xi,-.- ,Xn- The process is said to be a Markov chain if in fact this transition depends only upon Xn- Furthermore, if the way the transition occurs does not depend upon the particular time, the chain is said to be homogeneous. In this case, there is a function N(x, B) of position x and Borel set B that describes the transition from being in "state" x at one time to being in B at the next time with probability N(x, B). Using the so-called transition function, one has (for a homogeneous chain) P[Xn+i € B] = E[N(Xn,B)], the average over the possible positions of Xn of the probability of then being in B after time increases by one unit. For this formula to make sense, the transition function N has to satisfy certain conditions, which are spelled out in the following definition. Note that in the above heuristic discussion, N is a transition function from (R,<B(R)) to (R,<B(R)) in the terminology to follow.
120 III. INDEPENDENCE AND PRODUCT MEASURES Definition 3.5.1. Let (fio,So) and (fii,Si) be measurable spaces (i.e., each one consists of a set fi and a er-algebra S e.g., fi = Kn, S = *B(Rn)). A transition function (or Markovian kernel) N from fio to fii is a function N : fi0 x Si -» K+ such that (1) for each ljo € fio, ^ —» N(u;o, A) is a probability measure on Si (a probability because N is to be Marfcouian), (2) for each A € Si, the function ljo —> N(u;o,j4) e So (see Revuz [ill]). Exercise 3.5.2. If X € Si is non-negative, show that NX e So, where NA-(wo) =f jN^cLj^XiiJr). Examples. (1) n4 = R, Si = <B(R), and N(x, A) = j^ fA e'^dy, (2) Qi = Z, ^ =, all subsets of Z, and N(n,A) = ±|;4n{n- l,n + l}|. These two examples of kernels have extensions to higher dimensions. (3) fii = R», ^ = <B(R"), and N(x,A) = j^fa* Sa e-LL^dy, (4) fii = Zd, S» = all subsets of Zd, and N(n, A) = ^\A n {n ± eu 1 < i < d}|, where ei is the canonical basis vector of Rn that has a 1 in the ith position and zero elsewhere, and n = (m, ri2,..., n<i), i.e., 2dN(n, A) is the number of so-called nearest neighbours of n that are in A. (5) fii is a countable set, say N, and Si is the er-algebra of all subsets of fii. Let (?r(i, j))i,j>i be an infinite stochastic matrix (i.e., w(i, j) > 0 for all i, j and 51^11 ""(*> j) = 1 f°r a^ *)■ Define N(i,j4) = 5Zj£v4 ""(*> j)- The first four examples are all convolution kernels, i.e., they have the form N(x,j4) = fj,(A — x), where (i is a probability on the abelian group Rn or Zd. For example, in (1), the probability is the unit normal 2 2 n(x)dx since ^2J)1/2 J l/i_I(u)e_("5")du = j^m J l/i(u+ x)e"(_5")du = (2^175 flA{y)e-i!t^-dy. Knowing N and Xn, or rather its distribution Pn, one may immediately compute Pn+i as Pn+i(B) = E[N(Xn, B)\. The following lemma, applied to tp(x) = N(x, B), implies that (1) Pn+i(B) = Jpn(dx)N(x,B). Lemma 3.5.3. Let X be a. random variable on (fi, S, P) with distribution Q. Let ip : R —» R+ be a non-negative BoreJ function. Then E[<poX\ = f <p(x)Q{dx). Proof. Let ip > 0 be Borel and sn 1 ip be a. sequence of Borel simple functions. Then J ip(x)Q(dx) = limn-.oo J sn(x)Q(dx). Also, s„oX T P°X
5. SOME REMARKS ON MARKOV CHAINS 121 and so E[tpoX] = limn-.oo J(snoX)(uj)P(ij). It therefore suffices to verify the formula for simple functions. Let s = 5Zn=1 anl/i„ be a Borel simple function. Then 5Zn=1 anl{xe/in} = s o X, and since by definition Q(j4„) = P[X € An], E[s o X] = EDLi °"P[* € A„] = / s(x)Q(dx). a Remark. The integral f Pn(dx)N(x, B) equals J <p(x)Pn(dx), where tp(x) = N(x, B). This different order in the expression for the integral is used because in effect two integrations occur in a specified order: first one computes N(x,B) = /fiN(x,dy); then, to compute Pn+i, one integrates with respect to Pn. There is a definite order here, suggested by writing the integral in this new way which looks "backwards". This usage will be common in what follows, (see Proposition 3.5.4). Now suppose that one wants to determine the joint distribution of Xn and Xn+l. To compute P[Xn € A,Xn+i € B] = P[{Xn,Xn+i) € A x B], note that if P[Xn € A] = Pn(--4) ^ 0, then formula (1) using the conditional probability Pn[ • | A] in place of Pn gives P[Xn+l €B\Xn€A}= fpn(dx | A)N(x,B). Since P[Xn € A,Xn+i €B} = P[Xn € A\P[Xn+i € B \ Xn € A], and Jpn{dx\A)^{x) = p-L^Pn(dx)V>(x) it follows that (2) P[Xn e A, Xn+l e B] = J P„(dx)N(x, B). To summarize, given a real-valued homogeneous Markov chain {Xn)n>o on a probability space (fi, 5, P) with transition kernel N, the distributions Pn, are related by the transition kernel: Pn+i(B) = J Pn(dx)N(x, B) and the joint distributions Pn n+i of (Xn, ^n+i) &re given by P„in+i(i4 x B) = fAPn(dx)N(x,B). The goal now is to obtain the formulas for the joint distributions Qn of (Xo, Xi,..., Xn) and to relate them to the distributions of the Xk and N. First, however, it is necessary to show that the right-hand side of formula (2) determines a probability on *B(R2): the reason is that ultimately one wants to show the existence of the Markov chain, i.e., of the basic probability space (fi, 5, P) and of the sequence of random variables Xn with the above properties. Of course, when the Markov chain exists, it is clear that (2) defines a probability, namely the distribution of (Xn,Xn+i). In fact, the right-hand side of (2) always determines a probability as the next result shows.
122 III. INDEPENDENCE AND PRODUCT MEASURES Proposition 3.5.4. Let Po be a probability on (¾)¾) and Jet N be a Marlcovian IcerneJ from fio to fii. Then there is a unique probability P on (fio x fiiitfo x Si) such that P(E0xE1) = J [/"N(wo,dwi)lE>i) Po(dwo) (*) =f / Po(dwo) /"n^c^OIe.K) where/N^o.d^iJlE,(^1)=^(^0,^1),(^65). Proof. Let 21 be the Boolean algebra of finite disjoint unions of sets of the form £b x E\, E^ € fo. The formula (*) determines P on 21. To verify that P is er-additive on 21, one first proves a lemma for Markov kernels that corresponds to Lemma 3.3.3. Then the argument used for the er-additivity in the case of the product of two measures applies. □ Interpretation of Proposition 3.5.4 in the case of Example (4) for d = 2. Consider an initial probability distribution Po on Z2 = fi together with the given transition kernel N. Now N(w, A) can be interpreted as giving the probability for the transition or motion of a random particle starting at u; = (ni, n2) = n to a point in A c Z2 after one jump. On fi x fi = Z2 x Z2 define Xo(woiWi) = u>o and ^1(^01^1) = ^1- One may view Xo as giving the initial position of the random particle and X\ as giving its position after one jump, with the result distributed according to the transition kernel N and the initial distribution Po of position. The probability P, which is the joint distribution of (Xo, Xi), also determines the distribution P0 of Xo since P[X0 e A] = P{A x Z2) = JAN{uo,Z2)P0{duo) = P0(A), and the distribution of Xi since P[^ € A] = P(Z2 x A) = /z2 N(n, A)P0{dri). For example, if P0 = ^eo + ^Ce, + g£e2, where £„ is a unit or point mass at the point n (see Exercise 2.9.9) and A = {0,e1,e2,e1 + 62}, then P[Xi € A] = $N(0,A)+iN(ei,A)+iN(ea,A) = (± + ± + ±)± = £, as each point in A has two nearest neighbours in A. If B = {0^,62,-e! + e2}, then P[Xi € B] = ±N(0, B) + ±N(ei, B) + ^N(e»,B) = H + H + S 5 = ^. Also, if C = {0,-e1,e2,-e1 + e2}, then P(C x B) = ±N(0,B) + iN(e2,B) = y + ii = i Turning to the computation of the joint distributions Qn, observe that Qo = Po and Qi is determined by P0 and N as Qi(EoxEi)= f P0(dx0)N(x0,£i). Notice that Q2(K xE,x E2) = Pi,2(£i x E2) = [ Pi(dn)N(n,^). Jex
5. SOME REMARKS ON MARKOV CHAINS 123 Now Pi(Ei)= |Po(dx0)N(x0,£i)= fPo(dxo) I N(x0,dx i.e., P^dxi) = Po(dxo)N(xo,dxi), and so (3) Q2(R xEtx E2) = fPo(dxo) I N(x0,dx1)N(x1,£;2) Assume that P[Xo € Eq] = Po{Eq) ¥" 0- Then one may replace P0 in (3) by the conditional probability Po[- \E0] and obtain P[Xx eEuX2eE2\X0eE0) = Jp0(dx0\E0)\J N(x0,dxi)N(xi,E2) = ^7^ I po(dx0) [ N(x0,dii)N(xi,£i,) . Po(-c-o) Je0 iJex As a result, Q2(E0 x£lX Ea) = P[X0 € E0, Xi€EuX2€ E2) = I Po(dxo) I Nfxo.dxONfxi.Eb) . JEo UE1 In this formula, the term fE N(xo, dxi)N(xi, E2) in the brackets determines a kernel M2(x0,E) from (R,<B(R)) to (R2,<B(R)), where M2(x0,E) =f / N(x0,dxi) N(Xl,dx2)lE(x1)X2) Exercise 3.5.5. Verify this statement. [Hint: consider the collection of Borel sets E c R2 such that N(xi,dx2)l£(xi,x2) is a Borel function of The kernel M2 will also be denoted by the following formal expression: M2(x0,dx) = M2(x0,dxi,dx2) = N(x0,dxi)N(xi,dx2), where dx stands for dx\dx2 and the integer 2 indicates the dimension of R2. In a similar way, the probability Q2 will also be denoted by the formal expression Q2(dx0,dx1,dx2) = P0(dx0)N(x0,dxi)N(xi,dx2) = P0(dx0)M2(x0,dx).
124 III. INDEPENDENCE AND PRODUCT MEASURES These formal formulas for kernels and probabilities have to be interpreted for integration purposes as requiring integration first on X2, then on Xi, and finally on Xo- Since Q2(dxo,dxi,dx2) = Po(dxo)M2(xo,dx), it follows that Q2 is the probability determined by Proposition 3.5.4 from Po and the kernel M2. It is also determined in the same way by Qt and N since Q2(dxo, dx\, dx2) = Qi(dx0,dxi)N(xi,dx2) as Qi(dxo,dxi) = P0(dx0)N(x0,dxi). The kernel N(xi,dx2) is here viewed as a kernel from (R2,<B(R2)) to (R,93(R)): as a function of (xo,Xi), the probability N(xi,j4) of A € 93(R) is Borel measurable. The statement about how to use formulas like Q2(dx0,dxi,dx2) = Qi(dx0,dxi)N(xi,dx2) for integration follows from the next formal result, which can be viewed as a generalization of Fubini's theorem. Proposition 3.5.6. (An extension of Fubini's theorem). Let Po be a probability on (fi0, So) and N be a Markov kernel from (fi0, So) to (fii, Si)- Let P be the probability on (fio x fii,So x Si) such that P(E0xEl)= [ Po(dwo) [N(uo,dui)lEl{ui) Je0 U =f / P0(dWo) / N^o.dw^lEoxEj^o,^) , where Ei € St, i = 0 or 1. Let X be a. non-negative random variable on (fioxfi^So xft.P). Then (1) for aJJ uj0 € fio, the function u}\ —» X(u;o,u>i) € Si, (2) the function wo —* fN(u}0,dui)X(u}o,Wi) € So, and (3) E[X) =/P0(«iwo)[/N(wo,«iwi)X(wo,u;0]. Proof. The proof of this proposition is essentially the same as the proof of the Fubini theorem (Theorem 3.3.5). One reduces to the case of X = 1e,E € Jo x Si, and then uses the same argument. Note that (2) extends Exercise 3.5.2. □ Remark. If E € So x Si, and E(ljo) is the slice of E over w0 € fio, this result applied to X = \e shows that »(E) = J[N(uo,E(uo))]Po(du0). Then, one has the following result.
5. SOME REMARKS ON MARKOV CHAINS 125 Lemma 3.5.7. IfP(E) > 6, then P0({u>o I N(u;0, E(u>o)) > |}) > f • Proof. Repeat the proof of Lemma 3.4.12. □ It should be clear by now what to expect for the formula for Qn, namely Qn(dx0, dx\,..., dxn) = Q„_i{dx0, dx\,..., dxn_i)N(x„_i, dxn). Since Qn-i(dxo,dxi,..., dxn-\) = Qn-2(dxo, dx\, ■ ■ ■, dxn-2)N(x„_2,dxn-i), it follows that Qn(dx0,dxi,...,dxn) = Qn-2{dx0,dxi,... ,dxn_2)N(xn_2dxn_1)N(xn_i,dxn) = P(dxo)N(dx0,dxi)N(x1,dx2) • ■ ■N(xn-i,dxn). This is formalized as the following result. Proposition 3.5.8. For any n > 1, there is a unique probabiJity Q = Qn on (K x Rn,<B(Rn+1)) such that Q„(dx0,dxi,... ,dx„) = Qn_i(dx0,dxi,... ,dxn_i)N(x„_i,dx„), i.e., such that QntEoxSi x---xEn) -L Eq X"« x En — \ Q„_i(dxo,dx1,...,dxn_1)N(x„_i,£;n) (t) = / P0(dx0) / N(xo,dx!) / N(xi,dx2)--- ./Eo L-'Ei U E2 / N(xn_i,dx„) •• where E^ € 93 (R), 0 < i < n. Furthermore, if X is a non-negative
126 III. INDEPENDENCE AND PRODUCT MEASURES random variable on (Kn+1,03(Kn+1),Q„), then E[X] = / Q„(dx0,dxi,... ,dxn)X(x0,Xi,... ,xn) = Qn-i{dx0,dxu...,dxn-l) / N(xn_1,dxn)X(xo,x1,...,xn) (tt) = / P0(dx0) /N(xo,dxi) / N(x0,dx2) / N(xn_i,dx„)X(xo,xi,...,x„) Proof. There is at most one such probability since *B(Rn+1) is the smallest er-algebra containing all the sets Eq x E\ x • • • x En. For n = 1 this result follows from Propositions 3.5.4 and 3.5.6. Assume the result to be true for n = k. Let Qk = Po be the probability on (KxKfc,03(Kfc+1)) such that (f) holds. DefineN((x0,Xi,... ,xfc),dxfc+1) = N(xfc,dxfc+1). Then N is a Markov kernel from fi0 = (R * R\B(Rfc+1)) to Qi = (R,B(R)). Let Qfc+i be the unique probability on (K x Rk x R,B(Rfc+1) x B(R)) = (R x Rfc+1,B(Rfc+2)) such that for E0 € B(Rfc+1) and Ek+l € B(R), Qk+l(Eo x Ek+\) = I Qk(dx0,dxi,...,dxk) / N((xo,Xi,...,xfc),dxfc+1) = Qk(dx0,dxi,...,dxk) / N(xfc,dxfc+1) J Eo U Ek+i (4) JEo Qk+i{dx0,dxi,... ,dxk)N(xk,Ek+i). By using (ft) for the case n = k and the function X(xo,Xi,... ,xk) = lgJxcX!,... ,xk)N(xk,Ek+i), when E0 = E0 x Et x • • • x Ek, it follows that JEo Qk(dx0,dxi,.. .,dxk)N(xk,Ek+i) = f P0(dxo)\ [ N(x0,dii)f/" N(xlfdi2)-- JEo UE, UE2 ••• / N(xfc,dxfc+1) ••
5. SOME REMARKS ON MARKOV CHAINS 127 which verifies the first part (f) of the result for n = k + 1. To show that the "Fubini" formula (ft) holds for Qt+i, it suffices to apply Proposition 3.5.6 to Qt = Po and N and to use the fact that it holds for Qfc. □ Remarks 3.5.9. (1) If E € <B(Rn+1), Q„(E) is the probability that the trajectory or path (X0(lj),Xi(lj),. .. ,Xn(uj)) = (x0,Xi,... ,x„) of the random particle during the time set {0,1,..., n} lies in E. (2) Formula (4) in the proof of Proposition 3.5.8 implies that if Eq G <8(Rfc+1), then E[lEo(x)^(xk+i)] = / Qfc+i(dxo,dxi,...,dxfc+i)V(xfc+i) JEo = / Qfc(dxo,dxi,...,dxfc)N(xfc,V) JEo = £[l£0(x)N(xfc,V>)] for any non-negative Borel function ip, where N(x,V0 = $N(x,dy)ip{y) and the expectation is taken relative to Qt+i. This observation will be explained in Chapter V in terms of conditional expectations and the Markov property (see Remark 5.2.11 and Exercise 5.2.12). If the Markov kernel N(x,dy) does not depend on x (i.e., N(x,dy) = Q(dy) for all x), the probability Qn = Po x Q x • • • x Q. These probabilities are the finite-dimensional joint distributions of an independent process (Xn)n>o, with P0 the distribution of Xo and the remaining Xn all identically distributed with common distribution Q. This raises the problem as to whether for a general Markovian kernel N there is a stochastic process with the Qn as the finite-dimensional joint distribution corresponding to {0,1,..., n}. Such processes exist and are called Markov chains with transition probability N and initial distribution P0. To simplify the following discussion, one assumes the Markov chain to be real-valued. To construct such a process, one uses the probabilities Qn to construct a probability on the Boolean algebra 21 = U^L0®n of subsets of fi = x2°=0fin,fin = R for all n > 0, where <&n = a({Xi \ 0 < i < n}) and Xt : fi —» R is defined by Xi(uj) = w(i) for all lj € fi. These notations were used before when constructing the infinite product space (xJJ'Ljfi,,, x^LjJ,,, x^jPn). Here fin = R and $n = <B(R) for all n > 0, but the sequence (P„)„>i of probabilities Pn is not given. In the construction of the infinite product space, it was shown in Proposition 3.4.9 that A € <5n if and only if A = A x (Xj£Ln+lQk), where A is uniquely determined by A and vice versa. Define Q on 21 by setting Q(A) = Qn(-<4) (for all n > 0). This defines a function on 21 provided, for all n > 0 and A € *B(Rn+1), Qn(i4) = Qn+i(i4 x R). In view of formula (f) in Proposition 3.5.8, this is clear because /RN(xn \,dxn) = N(xn_i,R) = 1 since N is a Markovian
128 III. INDEPENDENCE AND PRODUCT MEASURES kernel. To extend Q from 21 = U^°=o0„ to x~=0,ff„, $n = <8(R) for all n > 0, one has therefore to prove that Q is er-additive on A. In view of Lemma 3.5.7, the pattern of the argument used for the case of infinite products can be followed to show that if (Cn)n>o C 21, Cn D Cn+\, and Q(C„) > 26 > 0 for all n > 0, then n^°=0C„ ^ 0. As before, this implies that Q is er-additive on 21 (see Exercise 3.3.2 (4)). As before one may assume Cn € <5n for all n > 0, i.e., Cn = A'n x (x£in+1fifc), A'n e <8(Kn+1). This allows one to transpose the problem to the analogue of Lemma 3.4.13. Before doing this, it will be helpful to define several kernels. Given the Markov kernel N, define the Markov kernel Mn from (K, 93(R)) to (Rn,<8(Rn)) by setting M„(x0, dxi,dx2,..., dxn) = N(x0, dxi)N(xi, dx2) ■ ■ ■ N(xn_i, dxn). Note that the subscript n on Mn indicates that the probability corresponding to Xo is defined on the Borel subsets of Rn. In keeping with the convention established earlier, this formula is to be read as follows: if E € <8(Rn), then M„(x0,E) = / N(xo,dx!) /' /' / N(x1;dx2)--- N(xn_i,dx„)lE(xi,X2,...,x„) In particular, if E = E\ x i% x • • • x En, then Mn(x0,^i x E2 x ••• x En) = / N(x0,dxi) / N(xi,dx2) J Ei Uei / ^Xn-^dXn)]- Je„ Hence, the earlier formula for (¾ extends, i.e., (*) Q„(dxo,dxi,...,dx„) =Po(dxo)M„(x0,dxi,...,dxn). For a fixed Xo, the probability N(xo,dxi)Mn_ ^11,(^2,^3,.. • ,dxn) is the probability defined by Proposition 3.5.4 with Po(du>o) = N(x0,du;o) and N(u;0,du;i) = Mn_i(u;o,du;i), i.e., u;0 = Xi and dw\ = (dx2,dx3,... ,dxn). It follows from Proposition 3.5.4 that the value of this probability for E = (E^. x E2) with £iCR and E2 c Rn_1 is JE N(x0,dx1)M„_1(x1,E2). If E2 = E2 x ■■■ x En, then Je, N(xo,dxi)Mn_i(xi,E2 x ••• x En) = / N(x0,dx!) / N(x!,dx2)--- / N(xn_1,dxn)] • ■ Jex Ue2 Ue,,
5. SOME REMARKS ON MARKOV CHAINS 129 In other words, the kernel Mn(x0,dx) = N(x0,dxi)Mn_1(xi,dx), where dx = (dx\,dx2, ■ ■ ■ ,dxn) and di denotes (dx2,dx2,... ,dxn) since two measures that agree on all the Borel sets E\ x E<i x • • • x En are identical. Returning to the proof that Q is er-additive on 21, as a first step one applies Lemma 3.5.7 to the probabilities Qn using the decomposition given by formula (*). If Bn = {x0 | M„(x0, A'n{x0)) > 6, then P0(B„) > 6, for all n > 1. As in the infinite product case, one has Bn D Bn+i as A^xK D A'n+1 for all n > 1 implies A'n(xo) x R D A'n+1(xo) and Mn(x0, j44(x0)) = Mn+i{x0,An{x0) xR)> Mn+1(x0,>i;+1(xo)). Hence P0(n^=1Bn) > 6, and so there is a point x§ € A'0 such that Mn(x{), ^(Xq)) > 6 for all n > 1. To continue the induction and simplify the notation, let An now denote j4^(Xo) for all n > 1. The analogue of Lemma 3.4.13 is the following lemma. Lemma 3.5.10. Assume that, for each n > 1, one has a set An € 93(Rn) such that, for aJJ n > 0, (1) An x Rd An+U and (2) Mn(x0,,4n)>6>0. Then there is a sequence (x°)fc>i of reai numbers such that, for all n > 1, one has (x°, x°,..., x°) e An. Once this lemma is proved, it follows that Q is er-additive on 21 and hence extends uniquely to S = xjjl03n; the point (Xq,x?,... ,x°,...) e n£L0Cn. Since Mn(xo,dx) = N(x0,dxi)Mn_i(xi,dx), it follows from Lemma 3.5.7 that if Bn = {xi \ Mn-i{xuAn{xi)) > §}, then N(x0,B„) > §. Just as before, it follows that j4„(xi) x R d j4„+i(xi) and so Bn D Bn+l since Mn_i{xi,An{xi)) = Mn(x1,i4n(x1)xK) > Mn(x1,i4n+1(x1)). Hence, N(xo,n^1Bn) > §, and so there is a point X® € Ai such that M„_1(x?,>ln(x?)) > § for all n > 2. This proves the following variant of Lemma 3.5.10. Lemma 3.5.10*. Assume that Mn(xo, An) > 6 for all n> 2. Then there is a point x° € A\ such that Mn_i(x°,An(x°)) > 1 for all n> 2. One now "forgets" about the first copy of R and replaces the sets An by their slices An(x?) € 95(Rn_1). Since Mn_1(x?,>ln(x?)) > § for all n > 2, and Mn_i(ii,di2,... ,dxn) = N(xi,dx2)Mn_2(x2,dx3,... ,dxn), it follows from the above lemma that there is a point x° 6 A<i(x°) (i.e., (x^x") € A2), such that Mn_2(x2,i4n(xi,X2)) > 51 for all n > 3, where i4n(xi,x^) = j4„(Xi)(x2). Using this lemma, it follows by induction on m that, for each m > 2, if n > m + 1, there are x°,x°,..., x^,_ 1 such that (a) (3:1,3:2.--.^-1) ei4m_! and (b) Mn_(m_i)(x°m_1,An(x?,x°,...,^m-i)) > 2^r-
130 III. INDEPENDENCE AND PRODUCT MEASURES To go from the case m = k to the case m = k + 1, one replaces the sets i4n(Xi,X2,... ,x°_j) by their slices An{x\,X2,... ,x\), where j4n(Xj,X2, ... ,x°) = j4„(Xi,X2, ..., x°_i)(x£), and then makes use of the lemma. This proves Lemma 3.5.10 and hence the er-additivity of Q on 21, which completes the proof of the following result. Theorem 3.5.11. Let P0 be a probability on 93(R) and AT be a Marlcov kernel from (R,93(R)) to (R,93(R)). Let Q be the set of real-valued functions n on N U {0}, and define Xn : Q -> R by X„(w) = u;(n) for aJJ n > 0. Denote by 'S the countable product of the Borel a-algebras, i.e., 5 = (r({Xk | k > 0}). Then there is a unique probability Q on 5 such that (for all n) the finite-dimensional joint distribution of (Xo,X\,... ,Xn) is the probability Q„ on <8(Rn+1) for which Qn(E0 xExx-'-xEn) = / Qn_i(dx0,dxi,...,dxn_i)AT(xn_i,.En) = / P0(dx0) / N(x0,dxi)--- N(xn-i,dxn). J Eo J E\ J En Remarks, (fi, 5, Q) together with the family (Xn)n>o of random variables Xn is a stochastic process that has been constructed from its finite- dimensional joint distributions. In general, to construct stochastic processes from their finite-dimensional distributions, one is obliged to use a different technique: namely Kolmogorov's extension theorem (Kolmogorov [K2], p. 31, also Billingsley [Bl], p. 510 and Loeve [L2], p. 94 ), which makes use of local compactness and hence requires a topology. The case of a real-valued process is discussed in [K2] and is well worth reading: there is a change in terminology, with "field" indicating a Boolean algebra and "Borel field" indicating a er-field. In Loeve, the theorem is called a consistency theorem. The issue is the following: given an arbitrary product of copies of the real numbers indexed by a set I, and for each finite subset F of I a probability Pp on the er-algebra 3> generated by the coordinate functions Xl,l 6 F, such that these probabilities are "consistent", (i.e., PFl and Pf2 agree with PFinF2 on 3>,nF2), is there a probability on the er-algebra 5 def generated by all the coordinate functions? Clearly, 21 = Ufc/Sf is a Boolean algebra, and the consistency of the probabilities shows that there is a unique finitely additive probability P on 21 that agrees with each Pf on each 3>- The problem is, as always, to show that P is <r-additive on 21. The proof of Kolmogorov uses the compactness of closed and bounded subsets of R and makes use of Cantor's diagonal procedure (see the discussion in the appendix of Chapter VI on Helly's selection principle).
6. ADDITIONAL EXERCISES 131 6. Additional exercises* Definition 3.6.1. Let E be any subset of Rn. A subset U of E is said to be an open subset of E or open relative to E if there is an open set O ofRn with U = OnE, i.e., U is open in the topological subspace E ofW1. Definition 3.6.2. The cr-algebra of Borel subsets of E is defined to be the smallest a-algebra containing all the open subsets of E. It will be denoted by <B{E). Exercise 3.6.3. Show that A is a Borel subset of E if and only if there is a Borel subset F of Kn with A = FnE. Exercise 3.6.4. (Fubini's theorem for Lebesgue integrable functions on Kn) Let X be a Lebesgue measurable function on Kn that is integrable. By Exercise 3.3.15, there is a Borel function Y on Kn (which can be assumed to be finite valued) with A = {X ^ Y} of Lebesgue measure zero on Kn. Recall that by Exercise 2.1.31, Y is integrable. Let n = p + q, and if x e Rn, let x = (xi,x2), with Xi € Rp and x2 e Rq. Show that (1) the slice A(xi) of A over Xi e Kp has Lebesgue measure zero (in Rq) for almost all Xi, where x2 € A(xi) if and only if (xi,x2) e A, (2) if |A(xx)| = 0, then x2 —» X(xi,x2) is Lebesgue measurable on Kp, (3) for almost all Xi with |A(xi)| = 0, x2 —» X(xi,x2) is Lebesgue integrable with /X(xi,x2)dx2 = Jy(xi,x2)dx2 (recall that by Corollary 3.3.17, |{x! | Y(xu •) ¢ L1}! = 0 as Y € L1), and (4) /Xdx = J[/X(xi,x2)dx2]dxi, where the first integral is taken to be zero if Xi is such that |A(xi)| = 0 and x2 —» X(xi,x2) is not integrable. [Hint: make use of Fubini's theorem for Borel function.] Note that the "arbitrary" definition of / X(xi, x2)dx2 = 0 on the bad set in (4) amounts to modifying the original function X to have value zero on a set T of Lebesgue measure zero (which does not affect its integral on Kn): let T = (AT0 U Ni) x Rq, where AT0 = {xi | A(xi) does not have Lebesgue measure zero on Rq} and Ni = {xi | |A(x!)| = 0 and x2 —» X(xi,x2) is not Lebesgue integrable on Rq). Exercise 3.6.5. Show that f °° e~(~J*~)dx ~ Ae-^-^-) as y —» +oo (i.e., the ratio tends to 1 as y —» +oo). [Hint: show that the ratio of the derivatives of the two expressions tends to 1 as y —» +oo.] Exercise 3.6.6. Let {Et)te[ be a family of events EL. Show that the cr- algebra o{\e< I i € /} is the smallest <7-algebra 5 that contains one of the following collections of sets: (1) all the sets Et,i € /; (2) all the sets Ef,i€l; (3) all the sets AL,t € I, where, for each t, AL = EL or Ef; (4) all the sets f)teFEl, where F is any finite subset of /;
132 III. INDEPENDENCE AND PRODUCT MEASURES (5) all the sets C\ieF Ef, where F is any finite subset of /; (6) all the sets C\teF At, where F is any finite subset of I and, for each t, AL = EL or Ef. [Hint note that lE e 5 if and only if E € 5; equivalently, if and only if Ec e $.} Exercise 3.6.7. Let (Ei)i<i<n be a finite collection of sets. Show that it is independent in the sense of Definition 3.4.3 if and only if one of the equivalent conditions is verified: (1) if At = Ei,El or n, then P(n?=1A0 = n?=1P(^); (2) if Ai = Ei or 0, then Pfn^AO = II^P^); (3) for any F C {1,2,..., n}, P{nieFAi) = nieFP(Ai), where A, = Et for all i e F. Remark. Condition (3) is the usual definition of a collection of independent events (see Billingsley [Bl], p. 48). Exercise 3.6.8. Show that for a family (Et)l£[ of events Et, the following are equivalent: (1) the family of random variables (Ie,)^/ 1s independent (Definition 3.4.3); (2) for any finite subset F C I, P(nl£FEt) = U.teFP{Et). Exercise 3.6.9. Let <£*, 1 < i < n, be n classes of events each of which is closed under finite intersections. Show that the following are equivalent: (1) P(n?=1Ai) = n?=1P(Ai) if, for each i, either At e €, or A, = fi; (2) P(nieFA,) = nieFP(Ai),Ai € Ct, for any F C {1,2,... ,n}; (3) for any F C {1,2,..., n} and F' = {1,2,..., n}\F, the <r-algebras <r(UiefCi) and a(Ui&F'€i) are independent, where, for example, a(Ui£F€i) is the smallest <r-algebra that contains all the sets in the classes €i,i € F. Remark. Condition (1) is the definition (see Definition 3.6.10) in Billingsley ([Bl], p. 50) of n independent classes of events (without the requirement that the classes be closed under finite intersections). Note that Billingsley's Theorem 4.2 ([Bl], p. 50) is an immediate consequence of condition (2). Definition 3.6.10. Let (C'a)aga be a family of classes of events <£'v It is said to be an independent family of classes of events ifP(r\\eFAx) = IljgF P(A\),A\ € €'x, for any finite subset F of A. Exercise 3.6.11. Let (€\)\e\ be a family of classes of events C'A. Show that the following conditions are equivalent, where ¢^ is the class of sets that are finite intersections of sets from <£'A: (1) the family (Ca)aga of classes €\ is an independent family of classes; (2) for each T c A, the <r-algebras a(l)\er£\) and cr(UAeA\r£\) are independent.
6. ADDITIONAL EXERCISES 133 Remark. Note that Exercise 3.6.6 implies that (j(\J\er<Z\) = (j(\J\er£\) and <r(UAeA\r^A) = ^(UAeAxr^)- Hence, one might think that condition (2) is equivalent to (1) for the classes <£'v This is not the case, as one may have the original family {€\)\e\ of classes independent and the family (Ca)aga derived from it by taking finite intersections of the sets in each class not independent. A counterexample is easily made: Feller [Fl], p. 127 contains an example of three events A\, ^2,^3 any two of which are independent but for which P(At n A2 D A3) ¢. P{A{)P{A2)P{A3); take €\ = {j4i,j42} and £2 = {^3}; then these two classes are independent, but the classes ¢1 = {A\ n A2, A\, A2) and €2 = €'2 = {A3} are not. Exercise 3.6.12. Let (Et)l€[ be a family of independent events Et. Let I = U\e\J\, where the sets J\ are pairwise disjoint (e.g., if I = N and A = {1,2}, one could have J\ = {1,3,... ,2n + l,... } the set of odd natural numbers and J2 = {2,4,..., 2n,... } the set of even natural numbers). Define £'x to be the family (Et)t€jx of events corresponding to the index t in J\. Let 3a be the er-algebra o~(€'x) = o,(1e1 | i< G J\) ( by Exercise 3.6.6 this is the smallest er-algebra that contains all the sets Et,i e J\). Show that the family (3a)aga is an independent family of collections of sets, i.e., an independent family of <r-algebras. [Hint: verify the independence of the family of classes (Ca)aga, where C\ is derived from ¢^ by taking finite intersections of the sets in C'v] Exercise 3.6.13. Let X,Y\, and Y2 be three independent, real-valued random variables on a probability space (fi, 3, P). Assume that Y\ and Y2 have the same distribution. Show that (1) for all a e R, P[X + Vi < a] = P[X + Y2< a]. Assume that all three random variables are strictly positive and that the distribution of X is the same as that of y~+x- Show that (2) if a > 0, P[X <a} = P[Y3 < a], where Y3 = v * . ■ [Hint: use transformations of x, 7/1, and 7/2.] The next series of exercises is devoted to proving the monotone class theorem for functions. The notations used in the statement of the theorem will be used in these exercises without additional comment. Theorem 3.6.14. (Monotone class theorem for functions). Let H be a vector space of bounded functions on a set il. Assume that the constant function 1 belongs to H and that if (/„) C H is a sequence of functions that converges uniformly to a function /, then f € H, (i.e., H is uniformly closed). In addition, assume that H satisfies the following monotone condition: if (fn) is a sequence of functions in Ti, and M is a constant such that /n < /n+i < M for all n, then limn fn €H.
134 III. INDEPENDENCE AND PRODUCT MEASURES Let C C H be closed under multiplication of functions. Then H contains every bounded function that is measurable relative to the o-field o~(C) = *({/l/€C}). To begin with, the notions of uniform convergence and uniform closure are defined. Definition 3.6.15. A sequence (/n)n>i of reaJ-vaJued functions /n on a set fi converges uniformly to a reai-vaiued function f if, for any e > 0, there is an integer N = N(e) such that |/n(w) - /(w)| < e for all w € fi if n> N. This will be denoted by writing fn —* f ■ A set S of real-valued functions is said to be uniformly closed if f €S whenever there is a sequence of functions in S that converges uniformly to f. The uniform closure of a set S of functions is the intersection of all the uniformly closed sets of functions that contains S. It is the smallest uniformly closed set containing S. Exercise 3.6.16. Let B denote the uniform closure of £, the linear sub- space of the vector space ^(fi) of all bounded functions on fi that is generated by C and the constant function 1 (hence all constant functions): note that ft e £ if and only if h = {\\f\ — A2/2) + n, where fi € C and Xi,fi € R. Show that (1) B is a linear subspace of .Fj^fi) and is closed under multiplication (this amounts to saying that B is an algebra, and since it is also a Banach space, it is a so-called Banach algebra). Make use of the Weierstrass approximation theorem (Theorem 4.3.15) to show that (2) if / e B, then \f\ € B [Hint: show that for any M > 0, the function x —» |x| can be uniformly approximated on [—M,M] by a polynomial.], (3) conclude that if /, g e B, then fhg and /Vg are in B. [Hint: for two real numbers a,b, recall that max{a,b} = aVb= ^{a + b+\a— b\}]. Exercise 3.6.17. A set M of bounded functions will be called a monotone class of functions if it satisfies the following monotone conditions for sequences (/n) of functions /n in M: (1) if there is a constant M such that fn < /n+1 < M for all n, then limn /n e M, and (2) if there is a constant m such that m < fn+\ < fn for all n, then lim„/„ e M. If, in addition, M is uniformly closed, then it will be called a closed monotone class. Show that (3) given any collection of bounded functions, there is a smallest closed monotone class that contains it.
6. ADDITIONAL EXERCISES 135 Let Mo be the smallest closed monotone class that contains B. Show that (4) if f € B and g € Mo, then f + g € Mo [Hint: imitate the first hint for Exercise 1.4.16 by considering {g € Mo \ f + g € .Mo}], (5) Mo is closed under addition (i.e., if gi,g2 € Mo, then gi + g2 e Mo) [Hint: imitate the second hint for Exercise 1.4.16.] and scalar multiplication, (6) if <7i,ff2 G Mo, then <7iV<?2 G Mo [Hint: copy the proof for addition], (7) if gi,g2 € M0, then ffi Ag2 € Mo, (8) if <7i,ff2 € Mo and are non-negative, then g\g2 € Mo, (9) Mo is closed under multiplication and so is a Banach algebra. [Hint: every function in Mo is a difference of non-negative functions in Mo] Exercise 3.6.18. Let ip be a non negative function in Mo- Show that (1) 1a € Mo, where A = {tp > 0} [Hint: the function (nip) Ale Mo; take a limit.], and (2) 5 =f {A | 1A e Mo} is a tr-field. With the aid of these exercises, one may now prove the monotone class theorem for functions. Proof of Theorem 3.6.14- The monotone hypothesis satisfied by H ensures that H D Mo- Furthermore, if / e C and A e R, then {/ < A} = {ip > 0}, where <p= \- f A\€ Mo- It follows from Exercise 3.6.18 that A = {/ < A} e J?. The following exercise shows that J D <r(C). Exercise 3.6.19. The <r-field <r(C) = <r({/ | / € C}) is the smallest <r-field that contains all the sets {/ < A}, / € C, A e R. To conclude the proof, it suffices to observe that if a monotone class of (bounded) functions is also a vector space and contains all the functions lyi, Ae5, then it contains all the bounded ^-measurable functions since it necessarily contains every J-simple function. □ Exercise 3.6.20. Let 5(2) denote the er-algebra of Lebesgue measurable subsets of K2. Show that (1) if 5(2) = 0 x 0, where 0 is a er-algebra of subsets of R, then 0 = 5(1) [Hint: use Fubini's theorem to study the slices of a set in 3], and (2) 5(2) * 5(1) x 5(1). Exercise 3.6.21. (Integration by parts) Let fi and v be two probabilities on *8(R) with distribution functions F and G. Let fj,(dx) and v{dx) also be denoted by dF(x) and dG(x). Let B = (a,b] x (a,b\, and set B+ = {(x, y) € B | x < y}, B~ = {(x, y) e B | x > y}. (1) Use Fubini's theorem to express (/i x i/)(B~) as two distinct integrals.
136 III. INDEPENDENCE AND PRODUCT MEASURES (2) Let F(x-) = limyTlF(y). Show that fi((a,c)) = F(c-) - F(a). (3) Use (1) and (2) to show that (H x u){B) = {F(b) - F(a)}{G(b) - G(a)} = I {F(u-) - F(a)}dG{u) + f {G(u) - G{a)}dF(u). J(a,b] J(a,b] (4) Deduce that {F(b)G(b) - F(a)G(a)} = f F(u-)dG(u) + f G(u)dF(u). J(a,b] J(a,b] These results apply not only to probabilities but also to any two <r-finite measures on R. Now assume that F is a distribution function with F(0—) = 0. Show that (5) W1 - FW)dx = n{l - F(n)} + /[0n) xdF(x); and if / xdF(x) < +oo, then (6) /0+°°{l - F(x)}dx = JxdF(x). [Hint: nP[X > n) < /{x>n} XdP.) Finally, use integration of parts to show that (7) if X € Lp(fi,ff,P), then E[\X\"} = /0+°°px^^UXl > x)dx. Remark. This exercise on integration by parts is based on an article by E. Hewitt.4 For another more direct proof of (6) and (7) see Exercise 4.7.1. American Math. Monthly 67 (1960), 419-422.
CHAPTER IV CONVERGENCE OF RANDOM VARIABLES AND MEASURABLE FUNCTIONS 1. Norms for random variables and measurable functions In what follows, the results will be stated and proved for probability spaces. Their extension to general er-finite measure spaces are to be taken for granted unless commented upon. The probabilist's notation E[X] for the integral f XdP or / Xdp will be used frequently. Let (fi, 5, P) be a probability space and L be the vector space of finite random variables on (fi, J, P)- Depending on the context, L will also denote the set of finite measurable functions on a er-finite measure space (fi, 5, (j,). Proposition 4.1.1. The set Ll of integrabJe random variables is a linear subspace of L. If X € L1, the function X — E[\X\] = || X ||, is a norm on L1, i.e., (1) || XX ||,= |A| || X ||, if A eR, Xe Ll; (2) || X + Y H.^ll X ||, +|| Y ||, ifX,Y eL1; (3) || X ||,= 0 implies X is a null function, i.e., a.s. equai to 0 and hence will be viewed as the same as the zero function. Proof. The first statement repeats Proposition 2.1.33 (1). Statements (1) and (2) follow from the properties of the expectation E[X] and the fact that |AX| = |A| |X| and \X + Y\< \X\ + \Y\. In order that || • ||, be a norm, it must have the property that || X ||,= 0 implies X = 0 (see Definition 2.5.3). Now E[\X\] =0=> P({\X\ > 0}) = 0 (see Exercise 2.1.28 (4)), and so || X ||,= 0 implies that P-almost surely, X = 0. □ Remark 4.1.2. In order to be entirely logical, one should not distinguish between integrable random variables that differ only on a set of measure zero. Then L1 would be taken to be the vector space whose points are equivalence classes of integrable functions, and || • ||, would then be a norm on this vector space. However, it is common mathematical practice to ignore the equivalence classes and to use the actual functions themselves as the elements of L1. Then one considers || • ||, as a norm with the reservation that || X ||,= 0 does not imply X(uj) = 0 for all w but only for P-almost all w. Such a function is a null function. By "abus de langage" || • ||, will be called a norm on L1. From the statistical point of view, the actual random variable X is far less important than its distribution. This raises several questions for ran- 137
138 IV. CONVERGENCE OF RANDOM VARIABLES dom variables: (1) is it possible to determine when X is in L1 in terms of its distribution Q? (2) if so, can || X ||, be computed from Q? (3) given X, Y € L1 with distributions Q and R, is it possible to compute the distribution of X + Y from Q and R? Remark. The concept of "distribution" is not used this way in general measure theory, although it makes sense. It is in effect what is sometimes called the image measure of the underlying measure under the measurable function, i.e., given a measurable function / on a measure space (fi,$,fj,), the function B —» fJ.(f~l(B)) is a Borel measure that may be called the image measure or "push forward" of fi by / (see the remark following Proposition 2.1.19). There is a connection between the probabilist's concept of distribution function and what analysts call the "distribution function" of a non-negative measurable function (see Exercise 4.7.3). When the measure is finite, then, up to a scaling constant, the image measure is the distribution of the measurable function relative to the normalized measure. In Wheeden and Zygmund [Wl], the "distribution function" is defined for finite measure spaces (0,5,/1) as fi({f > A}). In the case of a probability, this function is 1 — F, where F is the distribution function of the random variable /. The first two questions are answered by making use of the following lemma, which extends Lemma 3.5.3 to integrable functions. Lemma 4.1.3. Let X be a random variable on (fi, 5, P) with distribution Q. Let tp : R —» R be either a non-negative Borel function or such that <p o X is integrable. Then E[poX]= [ <p{x)Q(dx). In addition, if ip : R -» R is a Bore] function, then ip o X € L^fi.ff.P) if and only ifip € L'(R,«(R),Q). Rjrther, if (fi, 5, P) is a probability space and X : fi —» Kn is a random vector, then for any Borel function tp on Rn that is either non-negative or such that ip o X € L^n,^, P), it follows that E[tpoX] = I <p(x)Q(dx), where Q is the distribution (Definition 3.1.11) of X. Proof. Lemma 3.5.3 proves the formula for non-negative Borel functions. Since ip = ip+- ip~, ip € L1(K,S(K),Q) = L'(Q) if and only if</?+, ip~ € L1(Q). In this case / </?dQ = f <p+dQ-f <p~dQ. Since (i/joX)* = </?±0X, the lemma follows from Lemma 3.5.3. The extension of the result to random vectors has exactly the same proof as in the random variable case. □
1. NORMS FOR RANDOM VARIABLES AND MEASURABLE FUNCTIONS 139 Remarks. (1) E[(p(XX)] = / ip(Xx)Q(dx): let rp(x) = ip(Xx). (2) The above result is essentially formal in the sense that if X : fi —» E is measurable, with (£, ¢) a measurable space, then it extends immediately to measurable functions ip : E —» R. The distribution Q of X is again defined on <£ by Q(B) = P(X-1(B)), and one has as before that E[p o X) = f <p(x)Q(dx) whenever ip is non-negative or <po X € Ll(Q,$, P). (3) For example, the image (see the above remarks) of Lebesgue measure dx on (0, +oo) under the map x —> x? = y (1 < p < oo) is the measure v{dy) = pyp~1dy (i.e., \{x \ x* € A}\ = fApyp~1dy): this holds for all Borel sets A C (0, +oo) because 0 < a < x* < 6 if and only if ap < y < if and / pyp~ldy = tP — ap. A similar "change-of-variable" result holds whenever y = ip(x) and V is a strictly increasing, continuously differentiable function. Corollary 4.1.4. A random variable X with distribution Q is in L1 if and only if/|x|Q(dx) < +oo, i.e., if and only if \x\ € L1(K,S(K), Q). Also, II X ||,= /|x|Q(di) and£[A"] = /xQ(dx). This corollary answers questions (1) and (2) posed earlier. The answer to the third question is affirmative, provided X and Y are independent. The resulting distribution is the convolution of Q and R (see Exercise 3.3.21). Example 4.1.5. Let X be a random variable whose distribution Q has a density /. If /(x) = l(-oo,-i]u[i,+oo){x)^i^ inen X is not in L1 and hence has no expectation E[X] (see Feller [Fl] for comments on random variables without expectations). If one thinks of a random variable X as a vector with as many coordinates as there are points in fi, then the norm || X ||, is the analogue of the norm |x| = |xi| + |ar2| + |ar3| for the vector x = (xi,X2,X3) e K3. The usual Euclidean norm || x ||= (x^ + x\ + x§)5 by analogy suggests consideration of (/X2(uj)P(dw))$ = E[X2}$. Note that this is finite if and only if /x2Q(dx) < oo. Proposition 4.1.6. Let L? be the set of X € L such that E[X2} < oo. Then L2 is a linear subspace of L, and the function X —»|| X ||2 = f^X2]1/2 is a norm (in the sense discussed in Remark 4.1.2) on L2. For X, Y e L2, XY € L1 and \E[XY}\ < || X ||2|| Y ||2 — the inequality of Cauchy-Schwarz. Proof. Since 2ab < a2 + b2, if X, Y € L2, then XY € L1. Consequently, X,Y € L2 implies that X + Y € L2. Let A € R. Clearly, X € L2 implies AX e L2 and so L2 is a linear subspace of L. Also, if X, Y e L2, the function A -»|| X + XY ||^ = || X \\[ +2XE[XY] + A2 || Y ||* is non-negative
140 IV. CONVERGENCE OF RANDOM VARIABLES and so its discriminant b2 — 4ac = 4E[XV]2 - 4 || X || || Y ||2 is negative, in other words E[XY] < \\ X ||a|| Y \\a. Since \\X + Y ||2< (|| X || + || Y ||)2 by the Cauchy-Schwarz inequality, it follows that X —»|| X ||2 is a norm with the proviso as before that || X ||2= 0 implies X = 0, P-a.s. □ Definition 4.1.7. Let X be a. random variable with distribution Q. Let 1 < p < oo. Iff |x|pQ(dx) < oo it is said to have a pth absolute moment (about zero). When this is the case, /xpQ(dx) exists and is caJJed its pth moment (about zero). The next result is true only for finite measures. Proposition 4.1.8. Let X € L2. Then X € L1. Proof. \X\ = |X|l(|X|<i) + |X|l{|x|>i> < 1 + |A-|21{|X|>1}. Since 1 € L1, it follows that X SL1. □ Important remark. Proposition 4.1.8 is false for measure spaces that are er-finite. Note that 1 € L1(n) if and only if (i is a bounded measure, i.e., /i(fi) < oo. Lebesgue measure on R gives standard counterexamples, as the next exercise shows. Exercise 4.1.9. Let f(x) = ±, 1 < x < +oo,r € K, f(x) = 0 otherwise. Show that / € L1(R) if and only if r > 1. Determine a function / € L2(R), with / ¢. LX{R). In the case of probability spaces, the L2-norm is used to define two well-known statistical quantities. Definition 4.1.10. Let X € L2. The second moment of X - E[X] is caJJed the variance of X and is denoted by o~2{X) = a2. The standard deviation of X is o~(X) = n/ct2 = a. Remark. The variance of a random variable is a measure of the concentration or "spread" of its distribution around the mean. For example, in the case of a normal random variable with distribution nt(x)dx (Exercise 2.9.6), as the variance t = a2 tends to zero, the density function concentrates as a "spike" around zero. The larger the variance, the more spread out is the distribution. In the extreme case when the variance of a random variable is zero, this indicates that it is a.s. equal to its mean value and so there is zero "spread". Exercise 4.1.11. Let X € L2. Show that (1) o2{X) = E[X2} - E[X}2 =|| X \\] - m2, where m = E[X] is the mean or expectation of X, (2) o-2{X - a) = o-2{X) for any a € R, and (3) if X has a binomial distribution b(n,p), E[X] = np and o~2{X) = np(l - p) = npq.
1. NORMS FOR RANDOM VARIABLES AND MEASURABLE FUNCTIONS 141 [Hints: the binomial distribution b(n,p) is the distribution of the number S of "successes" or "heads" obtained when tossing a coin n times with the probability of success equal to p. It follows from Exercise 3.3.21 or directly, since (^) is the number of ways to choose k objects from among n, that P[S = k} = (fc)pfc(l - p)n-fc. The law or distribution of 5 is fc=0 ^ ' Note that all the mass is concentrated on the set {0,1,2,... ,k,... , n}. Consequently, E[f(S)} = / /dQ = £ (£)pfcU - P)n-kf(k), for any Q-integrable function /. Scaling 5 by a positive constant A > 0, the distribution of AS is R=E(fcVfc(1-p)n-fc^- fc=0 ^ ' The mass is now concentrated on the set {0, A, 2A,..., k\,..., n\}, and E[f(\S)} = JfdR = Y, Qpfc(l - P)"~fc/(fcA) = J /(Ax)Q(dx), for any Q-integrable function / (see the remarks following Lemma 4.1.3).] Definition 4.1.12. Let 1 < p < oo. Define Lp to be the set of finite random variables X or measurable functions X for which E[\X\P] < +oo. Remark. In the case of a random variable X, it follows from Lemma 4.1.3 that X 6 V if and only if X possesses a pth absolute moment. Remark. Since (a + b)p < 2p(ap + b") if p > 0, a, b > 0, it is clear that I/ is a linear subspace of L for any p > 0. Exercise 4.1.13. If (x!,x2) e K2, let ||x||p =f (Ix^p + |x2|p)p. This exercise shows the dependence of ||x||p on p. For p = 1,2,3,4, sketch the set of points x in the plane for which (|xi|p + |x2|p)p = 1. What is the limiting position of this set as p —» +oo? If X € Lp, one may define the analogue of the norm || x ||p,x € Kn, by setting || X \\p= (/ \X\pdP)* = (E[\X\P])*. Then it turns out that X -►(£[ |X|p])i is a norm. In order to show that X —»(£^[|X|p)p is a norm, it will be necessary to discuss the analogue for Lp of the Cauchy-Schwarz inequality for L2. This is called Holder's inequality. To begin, if 1 < p < +oo, let q € (1, +oo) be the unique number such that i + i = 1. Note that q = -§y. Then p and q are said to be conjugate indices.
142 IV. CONVERGENCE OF RANDOM VARIABLES Proposition 4.1.14. Let p and q be conjugate indices and a, b > 0. Then G(a,b) = a'/Pfe1/9 < ^a+ ^b = A(a,b). In other words, the geometric mean is Jess than or equal to the arithmetic mean. Proof. The inequality is obvious if either a or b equals 0. Assume a and b are both positive. If t = ^ and A = ^, the inequality is equivalent to At+(1-A) > t\ i.e., c/>(t) = At-tA+ (1-A) > 0. Since, for any A e (0,1), (//(1) = 0, and 4>"{t) > 0 for all t > 0, it follows that 0(1) = 0 is the minimum value of <j>(t) on (0, +oo). □ Remarks. (1) For a proof using a Lagrange multiplier, see Exercise 4.7.4. (2) G(a, b) < A(a, b) is equivalent to a special case of Young's inequality: cd < ^ + ^- if c, d > 0 (see Exercise 4.7.5). Proposition 4.1.15. (Holder's inequality). Let X e LP, Y € L*, where p and q e (1, oo) are conjugate indices. Then XY € L1 and E[\XY\}<\\X\\p\\Y\\q. Proof. Let U = \X\P,V = \Y\q. Then both functions belong to L1. Hence, by Proposition 4.1.14, ' U ' [E[U}\ p ' V ' [E[V}\ {p)E[U]+{q) E[vy Integrate both sides. The right-hand side has integral 1 as 1 + 1 = 1. Therefore, / u [E[U)\ V [E[V)\ (u)P(dw) < 1. Hence, / \X(tj)Y{tj)\P{dw) < || X ||J Y \\q as E[U}p =|| X \\p and E[V}< =||y||,. Also, XYeL1. a Holder's inequality implies the following inequality, due to Minkowski, which in effect shows that X —»|| X \\p is a norm on Lp. Proposition 4.1.16. (Minkowski's inequality) Let X, Y € L?. Then X + Y e V and II X + Y II < II XII. + II Y ||. . Proof. It has been pointed out (following Definition 4.1.12) that V is a linear subspace of L. Now, E[\X + Y\P}= [\X + Y\pdP< f \X\\X + Y\p-ldP+ f \Y\\X + Y\p-ldP
1. NORMS FOR RANDOM VARIABLES AND MEASURABLE FUNCTIONS 143 As Z e V implies Zp_1 e Lq if q and p are conjugate, since || Zp_1 ||,= (E[\Z\P])< =|| Z ||« as p + qr = pq, it follows by Holder's inequality that E[\X + Y\"} < || X ||J \X + Yl"'1 ||, + || y ||J \X + Yl"'1 ||, . Therefore, II x + y h; < (|| x + y n;)[|| x ||p + || y ||p] < +oo, and so nx+nip <nxuP + imP. Since p(l --) = 1, this completes the proof. D Exercise 4.1.17. Let 1 < p, < p, < oo. Show that (1) for a finite measure space (Sl,$,fJ,), LPj C IP* [Hint: make use of the argument used to prove Proposition 4.1.8], (2) this statement is false for (R,93(R),dx). [Hint: see Exercise 4.1.9.] Let (fi,93(R),P) be the probability space corresponding to the uniform distribution on [0,1] (see Example 1.3.9 (1)) . Since the probability has no mass outside [0,1], one might as well take fi to be [0,1]. The resulting measure space corresponds to Lebesgue measure on [0,1], i.e., ([0,l],93([0,l]),dx). Let /(x) = ^,0 < x < 1 — no need to define / at zero (why?). Show that (3) /eL'QOJDifandonly if r < 1, (4) LP. ([0,1]) g L"> ([0,1]) for 1 < Pl < p2. Consider the spaces £p: for 1 < p < oo, a sequence of real numbers a = (o„)„>i is said to be in £p if £~ , \a*Y < +ooand ||o||p =f (££=1 K|p)*: see Exercises 2.5.4 and 2.5.5 for p = 1,2. If p = 00, then £^ is the set of def bounded sequences and [[aW^ = supn |an|. Show that (5) 1 < p, < p2 < 00 implies that £Pi C £Pi. Finally, if 1 < p, < p2 < 00, for Lebesgue measure on R, show that (6) LPi (R) g V* (R) and L^ (R) g L^ (R). Exercise 4.1.18. Use Holder's inequality to show that, on a probability space, (1) if 1 < p < +00 and X e IP, then X € L1 and || X || x < || X \\p [Hint: 1 e L».], (2) if 1 < p! < vi < 00 and X € IP2, then X € IP1 and || X ||Pl < || A" Up,. [Hint: use (1).]
144 IV. CONVERGENCE OF RANDOM VARIABLES Remark. What modification is needed to the above exercise in order that it be valid for a bounded measure? The first result (1) in Exercise 4.1.18 is a special case of a more general inequality known as Jensen's inequality. This inequality involves convex functions. Recall that a function ip : (a,b) —» R is convex if for any a <x < y <b and t € [0,1], tp(tx + (1- t)y) < ttp(x) + (1- t)tp(y). Proposition 4.1.19. (Jensen's inequality) Assume that X is an integrable random variable -valued in (a, b). Ifipisa convex function on (a,b) and <p(X) is integrable, then <p{E[X\) < Efo{X)). Proof. This result is an easy consequence of the fact that any convex function ip is the supremum of a countable number of affine functions (see Exercise 4.7.25). Let Ln(x) = anx + bn for n > 1 and assume that tp(x) = supnLn(x). Since X is integrable, E[Ln(X)] = anE[X] + bn = Ln(E[X]), it follows that E[ Ln(X) ] < E[ip(X)] for all n > 1. The result follows. □ Remarks. (1) The function t —» tp is convex if p > 1 since its second derivative is non-negative. Hence, by Jensen's inequality, (£^[|X|])P < E[|X|P] if p > 1 as long asIeL1. Note that the inequality is trivial if X ¢1/. (2) Jensen's inequality is only valid on probability spaces (see Exercise 4.7.6). (3) The hypothesis that ip(X) is integrable is not necessary once one has sorted out how to define the integral for the sum of a non-negative random variable and an integral one (see Exercise 2.9.19). Exercise 4.1.12 asks for the limiting position as p —» +oo of {x = (xi,x2) | (|X!|P +|x2|p)1/p = 1}. It is {x|max{|x!|,|x2|} = 1}. This set is the unit sphere for the norm on K2 defined by || x ||oo= max{|xi|, |X2|} = largest of the components. By analogy, one makes the following definition. Definition 4.1.20. X € L°° if X € L and there is a number C > 0 such that P({\X\ > C}) = 0. The smallest number with this property is called the essential supremum of X and is denoted by \\ X 11«,. Remark. Note that X e L°° need not imply that X is bounded. However, modulo a null function, X is bounded. If X € L°°, it is also said to be essentially bounded. Exercise 4.1.21. Let (fi, J,/i) be a er-finite measure space and let X, Y be measurable functions. Show that (1) n({\X + Y\ > C + D}) = 0 if n({\X\ > C}) = 0 and n({\Y\ > D}) = 0 [Hint: consider Q\(E U F), where E = {\X\ > C} and F = {\Y\>D}.),
1. NORMS FOR RANDOM VARIABLES AND MEASURABLE FUNCTIONS 145 (2) L°° is a linear subspace of L, (3) X —»|| X ||oo is a norm on L°° with the usual proviso that || X ||oo = 0 implies X = 0,/i-a.e. In the case of a probability space (fi, 5, P), show that (4) L°° C LP if 1 < p < oo and in general L°° ^ n1<p<00LP(n,Sr,P), [Hint: show that X{x) = x is in every LP(R,93(R), ^=e-a*~dx)], (5) || X ||p< || A" Hoo if X€L°°. Remark. In the case of a probability space (0,5, P),sup!<p<00 || X \\p = || X 11^ if X € L°°. This is not true in general for a er-finite measure space. However, if the measure is finite, then || X \\ II X |l. It suffices to observe the effect on the LP-norms of normalizing the total mass (see Exercise 4.7.11). Exercise 4.1.22. Let X be a random variable with distribution Q. Show that (1) X € L°° if and only if Q([-M, M]) = 1 for some M > 0, (2) if X € L°° and M = HX^, then Q({x \ M - e < \x\ < M}) > 0 for any e > 0 that is less than M. Use this property (2) together with Chebychev's inequality (Proposition 4.3.6) and (Exercise 1.1.18) to show that on a probability space, || X ||oo = suPi<p<°° II X lip- Definition 4.1.23. Let 1 < p < +oo. A sequence (Xn)n>i of r.v. in V converges to 0 in LP or in Lp-norm if lim,,-*,*, || Xn \\p= 0. This is Lp often denoted by writing Xn —► 0. Important remark. Assume that the measure space (0,5,^) is finite, i.e., fi(Q) < oo. If pi < P2 and {Xn)n>i is a sequence in LP2, it is also a sequence in LPl. In view of Exercise 4.1.18, if it converges to zero in LP2, it converges to zero in LPl, but not conversely. In particular, if all the random variables are in L2 (i.e., (Xn)n>i C L2), to say that it converges to 0 in L2 is a stronger statement than saying that it converges to 0 in L1. The higher the p, the stronger the condition "converges to 0 in Lp" becomes. Example 4.1.24. Let Q = [0,1),5 = 93([0,1]), and P = dx. In other words, P is Lebesgue measure on [0,1]. Let Xn = %/^l[o,i/n]- Then (A"n)n>i is in L°° and || Xn ||2= 1, for all n. However, || Xn ||„= n1/2"1^ Lp o and so Xn—*0 if 1 < p < 2. Clearly, Xn does not converge in L to zero. Replacing y/n by an, one can get similar examples that fail to converge in IP2, for 1 < p2 < +oo, but converge in LPl, 1 < pi < p2. Important remark. The situation in the above example involving the vector space L2([0,1]) is completely different from the situation on Kn. There, because the dimension is finite, all norms give the same topology.
146 IV. CONVERGENCE OF RANDOM VARIABLES To sum up, for any measure space (fi, 5, fi) or probability space (fi, 5, P), a large family of normed linear spaces is associated: namely, the spaces 1/(0,3^) or L"(n,ff,P),l < p < oo. Definition 4.1.25. A metric space is a set E together with a function d: E x E —> R+ called a metric such that, for aJJ x,y,z e £, (1) d(x,y) =d{y,x), (2) d(x,z) < d(x,y) + d(y, z) (the triangle inequality), and (3) d(x, y) = 0 if and onJy if x = y. A sequence (xn)n>i in E converges to x if for any e > 0 there is an integer n(e) such that d(xn,x) < e when n > n(e) (Definition 1.1.5). A sequence (xn)n>i in E is said to be a Cauchy sequence if for any e > 0, there is an integer N(e) such that d(xm,xn) < e when n,m> N{e). A metric space is said to be a complete metric space if every Cauchy sequence converges. Remarks 4.1.26. (1) A norm || • || on a vector space V (see Definition 2.5.3) determines a metric d by setting d(x,y) = \\ x — y ||. The simplest example is V = R, with the norm being the absolute value and d(a, b) = \a — b\. It turns out (see Exercise 4.7.12) that K is complete under the metric given by the absolute value, as also are all the vector spaces Kn with either of the norms of Exercise 3.1.4 (or in fact any norm). A vector space with a norm is called a normed vector space. A normed vector space that is complete in the associated metric is called a Banach space. (2) A metric has associated with it a topology: a set E in a metric space is said to be open if for any a e E there is an e > 0 such that {x | d(x, a) < e) is a subset of E; the open ball about a of radius e is defined to be {x | d(x, a) < e}; it is an open set in view of the triangle inequality. Hence, a set E is open if and only if for any a e E there is an open ball about a that is contained in E. The next two results imply that the normed spaces Lp(Q,$,(i) and LP(Q,$, P), 1 < p < oo, are all complete and so are Banach spaces. The first result is a general fact from functional analysis. Proposition 4.1.27. Let V be a normed reaJ or complex vector space. The following are equivaient: (1) it is complete (as a metric space); (2) a series J3^_ j xn converges in V provided Yl^L i II xn \\ converges in R (such a series is said to be absolutely summable or normally summable). Proof. Assume (1) and that J^=1 || xn ||< +oo. The sequence (yn)n>i of partial sums yn = $2t=1Xfc is a Cauchy sequence: || yn - yn+m ||< 13fc=n+i II xk II' which tends to zero as n tends to infinity. Hence, the series converges to a vector x, which is denoted by J^n^! xn.
1. NORMS FOR RANDOM VARIABLES AND MEASURABLE FUNCTIONS 147 Assume (2). Let (yn)n>i be a Cauchy sequence. One cannot immediately assume that the yn are the partial sums of an absolutely summable series. However, by using the Cauchy property, one can determine a sequence (njt)fc>i of integers njt > 1 such that, for all k > 1, II l T \\ Vm ~ Vn \\< -^ » n,m>nk. One may also assume that nk > k for all k > 1. This ensures that the y„t are the partial sums of an absolutely summable series: define X\ = ynx,xk+\ = ynk+l ~ ynk, and consider the series Yl'kLi1^ then y„t is its fcth partial sum. Since Yll^Li Wxk 11^ 1> (2) implies that the series YlV=iXk converges to a limit y. Hence, y = limjt y„t. The proof concludes with the observation that if a Cauchy sequence has a convergent subsequence, then it converges: let e > 0 and let £ be such that ^ < f; then || y„t - y ||< § for all k > £; and || y„ - y ||<|| y„ - yn/ || + II IJn, -y \\< £ if n > nt. D To prove that all the Lp-spaces are Banach spaces, it therefore suffices to show that every sequence in IP that is absolutely summable also converges. Proposition 4.1.28. (Riesz-Fischer theorem) A series Yl^Li Xn 'n the normed linear space V, 1 < p < oo, converges if it is absolutely summable. Hence, Lp, 1 < p < oo, is complete. Proof. First, consider the case where 1 < p < oo. Let C = 53n^=i II Xn II • Let ZN = 53n=1 \Xn\. Since || X \\p = \\ \X\ \\p for any measurable function X, Minkowski's inequality (Proposition 4.1.16) implies that || Z^ \\ < I3n=i II -^" Up- Hn^i II Xn \\p= C. Hence, it follows from monotone convergence that if Z = limN ZN, then E[ZP] < Cp < oo. As a result, Z is finite a.e. If E denotes {Z = +00}, the series 5Zn^=i Xn(v) is absolutely convergent on Q\E. y ' \ 0 ifweE. Then, if Yn(lj) = H^=1-Xn(w), |Vn(w)| < Z^(lj) < Z(lj), and so \Y\ < Z € L". Consequently, Y e Lp. Now \Y - YN\ < 2Z e Lp, since \Y - YN\P < 2PZP e L1. Since |V — Vyv|p —♦ 0 a.e., it follows from the theorem of dominated convergence (Theorem 2.1.38) that E[\Y - YN\P] -» 0. In other words, \\Y-YN ||p-» 0 and so the series converges in Lp to Y. The case when p = 00 is simpler. Assume that C = Yl^i II Xn \\„< +00. For each n > 1, let En be the set where \Xn\ >\\ Xn \\„ and set E = \J^=xEn. This is a set of measure zero, and on Q\E the series Yl^L 1 Xn converges uniformly: u£E implies | £~=N+1 ^n(w)| < X^°=n+i II X* IL
148 IV. CONVERGENCE OF RANDOM VARIABLES Let W = ( £"'*»<">• a"*E v ; I 0 itweE. Then, if YN = £?=1 Xn, \\ Y - YN \\x< E~ N+1 II *" IL- This is because E is a set of measure zero and on Q\E the inequality holds pointwise. Consequently, the series converges in L°° to Y. □ In probability, there is a famous lemma, the Borel-Cantelli lemma, the first part of which is related to the proof of the completeness of L1. Recall that || \e 11,= P(E). Given a sequence {En)n>\ of events in 5, let the set {u\En occurs infinitely often} = {w|w belongs to an infinite number of sets En} be denoted by {En i.o.}. Then {£„ i.o.} = n~=1(U~=m£:n), equivalents, l{En ,-.o} = ln~=1(u~=mEn) = limsupn lEn. Proposition 4.1.29. (Borel-Cantelli lemma) Let (£n)n>i be a sequence of events in J. (1) «"E~=i P(^n) < +oo, then P({En i.o.}) = 0. (2) Conversely, if the events En are independent, then P{{En i.o.}) = 0 impiies that E^Li P(^n) < +°°. Proof. (1) P(En) =|| 1e„ IL and so by the proof of Proposition 4.1.28 for P = 1, Y = E~=i 1e„ € L1 if E~ ! P(^n) < +oo. Since Y(lj) < +oo with probability one, {lj\ the sequence (1e„(w))ti has an infinite number of ones}= {uj\En occurs infinitely often} has probability zero. (2) Let A = {En i.o.}. Then, as observed above, 1^ = limsupn 1e„- If P(A) = 0 then, as m -» oo, P(U~=mEn) [ 0 and so P(n£Lm££) | 1. Now the sets (En)n>i are independent since the sets (£n)n>i are independent (see Exercises 3.4.5 and 3.6.7). Hence, N N p(n^mEcn) = n p(js) = n t1 - p(£;«)] N N < Y[ exp{-P(En)} = exp{- J^ P{En)}, as 1 - x < e~x if x > 0. n—m n=m Let N —> +oo. This gives the inequality P(n~m£5)<exP{-f;p(^,)}. n=rn Now P(nn°=mEn) 1 1 as m —» +oo and so there is at least one m with P(n~ mEcn) > \ (say). The fact that i < exp{-£~ m P(^n)} implies that E~m p(£») < +<*>• □
2. CONTINUOUS FUNCTIONS AND IS 149 2. Continuous functions and Lp* This section can be omitted without loss of continuity. While it begins with some extra results about continuous functions that are outlined in exercises, its main purpose is to introduce Fourier series and to discuss the use of convolution in summing them (see Fejer's theorem 4.2.27). In addition, it is shown that convolution and differentiation commute, thereby extending Theorem 2.6.1. It was shown in Proposition 2.1.15 that every continuous function on K is a Borel function, i.e., is measurable with respect to 93(R). There are many Borel functions that are not continuous, for example, the characteristic function \a of any Borel set different from R or the empty set 0. However, in the Lp-spaces, for 1 < p < oo, the continuous functions can be used to approximate in Lp-distance any measurable function in Lp. Since such functions are limits of simple functions (if they are non-negative (Proposition 2.1.11), (ilVs)), it is more or less clear that the first thing to do, if one wants to verify this, is to approximate the characteristic function 1a of a Borel set by a continuous function. Exercise 4.2.2 sets the stage for this. To have an idea of what is going on, imagine approximating the function 1(0,1] by the continuous piecewise linear functions ipn, where (0 **<-£, nx + 1 if-^<x<0, 1 if 0 < ar < 1, -nx + n + 1 if 1 <x < 1+ £, 0 1 <x. n <pn(x) The area of the region between the graphs of l[o,i] and ipn tends to zero as n tends to infinity. First, recall that in Exercise 2.9.3(4), it is shown that if E is a bounded Lebesgue measurable set and e > 0, then there exist a compact set C and a bounded open set O with C C E c O and |0\C| < e. Since, as explained in the next exercise, 1^ is continuous on CU0c, it follows that the measurable function lg is continuous except on a set of Lebesgue measure less than e. This a special case of Lusin's Theorem (see later). Recall that in Exercise 2.3.7, it was shown that a function / : R —» R is Lebesgue measurable if, for any e > 0, it is continuous except on a set of measure less than e. Exercise 4.2.1. (Simple case of Lusin's theorem) Let C, E, and O be as above. Show that (1) the characteristic function \e of E is continuous (on C U Oc) when restricted to the closed set C U Oc (Definition 2.1.13). [Hints: if x € C, some small open interval about x lies inside 0; if x € Oc some small open interval about x lies in Cc.\
150 IV. CONVERGENCE OF RANDOM VARIABLES Note that the Lebesgue measure of R\{C U Oc} = 0\C is less than e. Now let s = Hn=1anl/in De a simple Lebesgue measurable function with |.An| < oo for each n. Make use of Exercise 2.9.3(5) to show that (2) if e > 0 there is a closed subset A of R such that (i) the restriction of s to A is continuous (on A); and (ii) |R\j4| < e. Since every non-negative measurable function can be approximated by simple functions, it is natural to ask whether (2) holds for any Lebesgue measurable function on R. The answer is affirmative as stated in a preliminary version of the following theorem. Lusin's theorem, (see Exercise 4.7.9) A real-valued function on R is Lebesgue measurable if and only if, for any e > 0, there is a closed set A such that (i) the restriction of f to A is continuous on A, and (ii) \R\A\ < e. While Lusin's theorem shows that measurable functions are not far from being continuous functions, there is another more quantitative way in which continuous functions are close to the functions in W. This involves actually approximating the characteristic function of a bounded Lebesgue measurable set by a continuous function defined on all of R. Exercise 4.2.2. Let C, E, and O be as above. Show that (1) there exists a continuous function <p on R, 0 < ip < 1, such that <p(x) = 0 for x ¢ O, <p(x) = 1 for x e C [Hint: consider ip(x) = dist(x,C) where dist(x,C) = inf{|x — y\ \ y e C} (see Exercise 4.7.17); use results of that exercise to show that for some R > 0 one may take tp = ^(V> A R)}\ (2) |1e(z) - ^)1 = 0 for x e C U 0° and < 1 for x € 0\C. In other words, |1E -tp\ < lo\c- Proposition 4.2.3. Let E be a Lebesgue measurable set with \E\ < +oo. For each p, 1 < p < oo, and e > 0, there is a continuous function tp,0 < <p < 1, such that II 1e - <P \\p< t and {x | tp(x) ^ 0} is bounded. Proof. Let En = E n [—n, n]. Then, 1e„ T 1e and so, by dominated convergence, || 1^ - 1e„ ||„—» 0. Choose an n such that || 1^ - \e„ \\p< 5. It follows from Exercise 4.2.2 that there is a continuous function ip with {x I tp(x) 7^ 0} c [—{n+ l),(n+ 1)] and such that \1e„ -<p\ < lo\c, where \O\C\ <(§)". Hence, || Ir.-<HI„< §• ° Definition 4.2.4. A continuous function tp on R is said to have compact support if the closure of {x \ <p(x) ^ 0} is compact. Similarly, if O is an open subset of R, a continuous function ip on O is said to have compact support if the closure of {x \ ip(x) ^ 0} is a compact subset of O. Let CC(R) denote the set of continuous functions on R with compact support. Similarly, if O is an open set, let Cc(0) denote the set of continuous
2. CONTINUOUS FUNCTIONS AND V 151 functions on O with compact support. It is a linear subspace of the space of all real-valued functions on R because {x | ip(x) + ip(x) ^ 0} C {x \ <p(x) ^ 0} U {x | ij)(x) ^ 0} and the union of two compact sets is compact (by part A of Exercise 1.5.6). In addition, for all p, 1 < p < +oo, CC{R) C LP(R). Theorem 4.2.5. If 1 < p < +oo, then CC(R) is a dense subspace of LP(R). Proof. It follows from Proposition 4.2.3 that the closure of CC(R) in LP(R), which is a linear subspace, contains all the simple functions and their differences. Note that X € Lp, if and only if |X| € Lp, which holds if and only if X+ and X~ e LP. As a result, every function X € LP is the limit in LP of a sequence of differences of simple functions provided every positive function X 6 LP is the limit in LP of a sequence of simple functions. Hence, to prove the theorem, it is enough to verify the last statement. Let 0 < X € LP and (sn)n>i be a sequence of simple functions that increases to X. It follows from dominated convergence (Theorem 2.1.38) that E[|A--s„|p]^0, as |A--s„|p< Xp. □ If E C R, let C(E) denote the set of continuous functions / : E —» R (see Definition 2.1.13). Corollary 4.2.6. C([a,b}) is dense in Lp([a,b}), 1 < p < +oo. Proof. Lp([a, b\) is the LP-space associated with Lebesgue measure on [a, b]. Recall that this measure can be viewed as the restriction of Lebesgue measure to the Lebesgue measurable subsets of [a, b] or as the measure obtained from the right continuous function G(x) = a,x < a,G(x) = x,a < x < b, and G(x) = b,b < x. Every function / € LP ([a, b}) can be viewed as a Lebesgue measurable function on R that vanishes outside [a,6]. If ip € CC(R) and || / — ip \\p< e, then J l[a<b](x)\f(x) - v,(x)|pdx < J \f(x) - <p{x)\*dx < ep. Since ipu e C([a, 6]), the result follows. □ Exercise 4.2.7. Make use of the proof of Proposition 4.2.3 to show that the space Cc((a, b)) is dense in Lp([a, b]). In particular, every function in IP can be approximated in IP by continuous functions V on [a,b] for which rp(a) = rp(b). Remark. Note that tp € Cc((a,b)) if and only if it is continuous on [a,b] and vanishes near the endpoints a and b. These density results depend on a relation between the topology of R and the measure dx which is stated in Exercise 2.9.3. A measure fi on *8(R) for which this exercise holds is called a regular Borel measure. All the measures determined by a non-decreasing, right continuous function G as in Chapter II are regular. The concept of regularity makes sense in a
152 IV. CONVERGENCE OF RANDOM VARIABLES fairly general setting; in particular on Kn. It is not hard to verify Exercise 2.9.3(4) for Lebesgue measure on Kn (see Exercise 4.2.9). As a result, the following holds since, in addition, the proof suggested for Exercise 4.2.2 holds in Kn. Theorem 4.2.8- The set Cc(Rn) of continuous functions on Rn with compact support is dense in Lp, 1 < p < oo. Remark. The concept of compact support for a function on Kn is defined as in Definition 4.2.4. It requires that the concept of compactness be defined in Kn (one copies Definition 1.4.9). For this to have any meaning, one needs to know that, as before on R, a set is compact if and only if it is closed and bounded, i.e., lies in some ball centered at the origin (see Marsden [Ml]). Exercise 4.2.9. Let E be a bounded Lebesgue measurable subset of Kn with \E\ < +oo. Let e > 0. Show that (1) there exists a bounded open set O D E with \0\E\ < §. [Hint: show that the outer measure A* that may be used to define Lebesgue measure starting from the Boolean ring of "half-open boxes" (a\, b\] x (02,62] x •• x (an,6n] can be computed by using open boxes; replace intervals of the form (a*, 6j] by open intervals (aj, 6j + e) for small e (depending on the box).] Assume E c [-N, N] x [-AT, AT] x • • • x [-AT, AT]. Show that (2) if P D Ec n [-AT, AT] x [-AT, AT] x • • • x [-AT, AT] d= ECN is open and \P\E%\ < §,then|E\C| < §, where C = Pcn[-N,N] x[-N,N] x ••• x[-N,N). Conclude that (3) if E is a bounded Lebesgue measurable set, then there is a bounded open set O and a compact set C with C C E c O and |0\C| < e (recall that a compact set is one that is closed and bounded). The above density results are false for p = 00. This is a consequence of either of the next two exercises. Exercise 4.2.10. If / is a function on K, let //, denote the translate of / by h, i.e., /h(x) = f(x + h) for all x e R. Show that (1) if <p e CC{R), show that || iph - ip ||p-» 0 as \h\ -» 0 for 1 < p < 00, (2) if / € LP(K) and 1 < p < 00, show that || fh - f ||p-» 0 as \h\ -» 0 [Hint: use the density result (Theorem 4.2.5) and relate || fh— f \\p to || iph — ip \\p if ip approximates /; recall that Lebesgue measure is translation invariant (Exercise 2.2.8).], and (3) if p = +00, it is false that || fh - f ||p-» 0 as \h\ -» 0. [Hint: consider / = l[o,i]] Extend these results to Lp(Kn).
2. CONTINUOUS FUNCTIONS AND Lf 153 Remark. The definition of //, in effect shifts the graph of / to the left if h > 0. To shift to the right, one either uses —h or redefines the translation of / by h by setting //,(x) = f(x — h). The definition given in the exercise is compatible with the definition of the translation of a measure given in Proposition 6.5.4(8). Exercise 4.2.11. Let tp denote a continuous bounded function on R. Its uniform norm || tp ||n is defined to be supIgR |</?(z)|- Show that (1) if ipn is a sequence of bounded continuous functions that converges uniformly (see Definition 3.6.15) to a function <p, then <p is bounded and continuous, i.e., C(,(K), the vector space of continuous bounded functions on K, is uniformly closed (Definition 3.6.15). [Hint: \(f(xo) - <p{x)\ < \<f{xo) - <fn{Xo)\ + \<fn{^o) - lfn(x)\ + \lfn(x) - lf(x)\.] Comment. This property of continuous functions holds in general: given any set E C Kn (say), the vector space Cb{E) of continuous bounded functions on E is uniformly closed: the above hint applies. Continuing the exercise, show that (2) if tp is a bounded continuous function, then || tp ||„>|| if {{„ (its L°°-norm relative to Lebesgue measure), and (3) IML<IML- [Hint: if M =|| if ||„, then {x | ^(x) > M - e} is a non-empty open set]. Conclude that (4) the vector space C(,(K) of essentially bounded, Lebesgue measurable functions on R is not dense in L°°(R), . Remark. Since || ip \\u=\\ <p ]]„ if <p is a bounded continuous function, it is usual to denote the uniform norm by || tp W^. For L2, these density results, combined with the Stone-Weierstrass theorem, prove that / e L2([0,2w}) is zero if / is orthogonal to all the trigonometric polynomials, where the inner product (Definition 2.5.6) is defined by (f,9) = Jo2* f(x)g(x)dx. Theorem 4.2.12. Let f € L2([0,2tt]). Assume that I f(x)cosnxdx = / f(x)s'mnxdx = 0, for all n > 0. Jo Jo Then / = 0, i.e., / = 0 a.e. Proof. A trigonometric polynomial p is by definition a finite linear combination of the functions cos nx, sin mx,n > 0, m > 1 on [0,27r]. It follows from the addition formulas for sines and cosines that the collection V of trigonometric polynomials is an algebra of functions (i.e., it is a linear sub- space closed under (pointwise) multiplication). By the Stone-Weierstrass
154 IV. CONVERGENCE OF RANDOM VARIABLES theorem, (see Marsden [Ml], p. 120, Rudin [R4]), every continuous function tp on [0,2n] for which tp(0) = tp(2n) is a uniform limit (Definition 3.6.15) of trigonometric polynomials. Let e > 0 and / e L2([0,27r]). By Exercise 4.2.7, there is a continuous function tp with y>(0) = <p(2n) such that || / - tp ||2< §. From the above observations, if t/ > 0, there is a trigonometric polynomial p with || ip — P Hoc*- V- This implies that f0* \<p(x) — p(x)|2dx < 2ivn2. Hence, if 5 < ll/-PL<ll/-Ha + IIV-Plla<*- The assumption that / is orthogonal to all trigonometric polynomials p implies that \\f-p § = \\ f \\\ +2</,p> + || p ||; = || / ||] + || p ||22< 62. Therefore, || / ||2 < e, for all e > 0 and so / = 0. □ The vector space L2([0,2n}) is complete in the L2-norm in view of the Riesz-Fischer theorem (Theorem 4.1.28). As a result, it is a so-called Hilbert space: a complete normed vector space V whose norm is defined by an inner product (see Exercise 2.5.8 and Proposition 5.1.8). Definition 4.2.13. Let V be a Hilbert space. An orthogonal system in V is a sequence (ipn)n>i of vectors such that any two distinct vectors are orthogonal. If, in addition, the vectors are all of length one, the system is called an orthonormal system. An orthogonal system is said to be complete if the only vector perpendicular to all the vectors ipn is the zero vector. In the Hilbert space L2([0,27r]), the functions 1, cosnx,sinnx,n > 1 form an orthogonal system. Exercise 4.2-14. Verify this statement. [Hint: make use of the addition formulas for sines and cosines (see the appendix of Chapter VI) to express (for example) sin mx cos nx as a linear combination of sines and cosines.] Also verify that the length in L2([0,27r]) of 1 is \/2n and that for each of the other functions their length is y/n. In other words, Theorem 4.2.12 says that the functions l,cosnx,sinnx, n > 1 form a complete orthogonal system. Whenever this happens in a Hilbert space, one may expand any vector in a unique way in terms of the complete orthogonal system. For the above system in L2([0,2n]), this expansion for a function / is called its Fourier series. This expansion will now be discussed by first explaining how it works in an abstract Hilbert space. Let ((/>n)n>i be an orthonormal system in a Hilbert space V. Then it is necessarily a linearly independent set: if 5I"=1aj0i = 0, then 0 = (H"=iat0t,0k) = Hr=i(a«0«> 4>k) = at if 1 < k < n. Let Lyy be the linear subspace of V with basis 4>\, ¢2,..., 4>n- It is a closed subspace: if £ n=\an(k)(t>n = xjt —» x € V, then (xt, y) —» (x,y) for
2. CONTINUOUS FUNCTIONS AND IS 155 any y e V; taking y = (j)j,l<j<N, shows that each sequence (dj(k))k>i converges to a limit ay, hence, || xk - Y.n=\ an</»n ||2 = En=i{anW - an}2 tends to 0; as the limit is unique, x = 5Zn=1 an(j>n e LN. Lemma 4.2.15. IfuE V, then there exists a unique £ = £^ € L^ such that u — £ is perpendicular to LN. More specifically, N N /jv = $^(ii,0n)0n and \\£N\\2=J2(^^)2 <\\uf . n=l n=l Proof. Assume that it is possible to write u = £ + w with £ = 5Zi=1 arfi in LN and w perpendicular to LN. Since 0 = (u — £,(j>n) for 1 < n < AT, it follows that (u,(/>„) = (£,4>n) = an- This shows that there is at most one £ € Ln with w = u- £ perpendicular to LN. Defined to be £*i(u>0n}0n. Then {u-£,4>n) = 0 for 1 < n < N, i.e., w = u — £ is perpendicular to L^. Since (^, w) = 0, it follows that n II « II2 = II / II2 + II v, ||2 > || / ||2 = 5>,<M2- □ n=l As an immediate corollary, one has the following result. Corollary 4.2.16. Let ((/>n)n>i be an orthonormal system in a Hilbert space V. IfuE V, then oo >J(u, </>n)2 < || u ||2 (Bessel's inequality). n=l Bessell's inequality implies that the series J^nL^u, 4>n)<t>n is absolutely summable and so converges in the Hilbert space V by Proposition 4.1.27. The significance of the completeness of an orthonormal system is that in this case the above series converges to u and so gives an expansion of u in terms of the complete orthonormal system. Proposition 4.2.17. Let (0n)n>i be a complete orthonormal system in a Hilbert space V. IfuE V, then oo u=^(u,0„)0n and n=l oo >J(u, </>n)2 =|| « ||2 (Parseval's equality). n=l
156 IV. CONVERGENCE OF RANDOM VARIABLES Proof. Let £ = YlT=i(u>t")^ and *n = H£=i(«></>n)</>n- Then since £ = limw_oo^w, it follows that N {£,4>k) = lim {£N,4>k) = lim (Y"(u,(/>„)(/>„,4>k) = (u,</>k). N—>oo N—>oo *—' n=l As a result, (u—^, 0¾) = 0 for all k > 1. Completeness of the orthonormal system implies that u = ^. It remains to compute || £ ||. Since ^ —» £, it follows that || £n || —»|| ^ ||. Now || ^N ||2= En=i(u'«^>)2 W Lemma 4-2-!5 and so || ^ ||2= Er=i(«.^»>2- a Corollary 4.2.18. (Parseval's theorem) An orthonormal system is complete if and only if for any u € V one has (*) 5>,<M2=||U n=l Proof It remains to show that the condition (*) implies completeness. The condition itself shows that || u ||2 — || £n ||2 —* 0 for any u € V. Since u = £n + tun with wn perpendicular to Lyy, it follows that || wn ||2 —» 0 and so £n —> u. If u € V and u is orthogonal to each (/>„, then £N = 0 for all AT and so u = 0, i.e., the system is complete. □ In the case of L2([0,2n]), the functions 1,cosnx,sinnx,n > 1 form an orthogonal system. If in a Hilbert space one has an orthogonal system {il>n)n>\, then by normalizing the vectors ipn, one obtains an orthonormal system ((/>n)n>i- Assuming the orthogonal system to be complete, and since (/>„ = ir^TfV'n, the expansion of u can be written as oo oo , . . u = Y,(«.0»)0» = Y, nti2^ and n=l n=l II Vn II (u, Vn) 2 OO Uii2=$>,<m2 =2 n=l fl II *• As a result, for any / e L2([0,27r]), in view of Exercise 4.2.14, one has the following Fourier expansion of /: i _°°_ f(x) — — H— YJ{ancosnx + 6nsinnx}, n=l
2. CONTINUOUS FUNCTIONS AND LP 157 where ,.2ir On = / /(«) Jo /■2 ir bn = I f(t)smntdt, for all n > 1. /o cos ntdt, for all n > 0, and 2*- This expansion is to be understood as in L2. It does not mean that for all x € [0,2n] the series sums to /(x). The expansion could be written more appropriately as 1 °° /(•) = ir- + -5I{anCosn(-) + 6nSinn(-)}, n=l where the equality is read as convergence in L2 of the partial sums and, for example, cos n(x) = cos nx. The study of other types of convergence for this type of series is the subject of classical Fourier analysis (see Korner [K3]). For example, it was an open question for many years as to whether the series converged a.e. to / when / e L2. It was solved in the affirmative in 1966 by Lennart Carleson.5 Convolution and Fourier series. Convolution was defined earlier (Exercise 3.3.21) primarily to discuss the distribution of sums of independent random variables. The definition of convolution for functions in L^K) was implicit in part of that discussion. Definition 4.2.19. If f and g are in L^K), the convolution of/ with g is the function denoted by f * g, where (f*g)(x) = J f(x - y)g(y)dy. Notice that, for a specific x e R, there is no a priori reason why the product /(x — y)g(y) should be integrable. However, Fubini's theorem implies that, for a.e. x, it is in L^K). Proposition 4.2.20. If f and g are in L^K), then /*ff€Ll(R) and ||/*<H|,<||/ Ugh- Proof. One may assume that / and g are finite Borel functions — modifying / and g if necessary by adding null functions. Then $(x, y) = /(x — y)g(y) is a Borel function on K2: the function (x,y) —» x - y is Borel and so 5Acta Math. 116, (1966) pp. 135 157.
158 IV. CONVERGENCE OF RANDOM VARIABLES f(x — y) is a Borel function (Proposition 2.1.17); (x, y) —» g(y) is Borel for the same reason, and the product of measurable functions is measurable (Proposition 2.1.11) as (u, v) —» uv is measurable (even continuous). The function $ e L^K2): using Fubini's theorem (Theorem 3.3.16) one has E\\9\ = J \9(y)\ J |/(i - y)\dx dy =|| / ||, y |5(y)|dy = as the invariance of Lebesgue measure under translation (Exercise 2.2.8) implies that/ |/(x - y)|dx = / |/(x)|dx. Since $ is integrable, by Fubini's theorem (Theorem 3.3.17), //(x — y)g(y)dy is integrable for almost all x. Therefore, the function / * g is defined as a function a.e. (set it equal to zero (say), when the integral is not defined). Furthermore, J\(f*g)(x)\dx < J J \f(x - y)g(y)\dy dx =\\ * ||,=|| / ||,|| g ||, . □ In principle, the convolution product / * g depends upon the order. However, / * g = g * f for any two functions in L1. This is explained in the next exercise, using the fact that for any Lebesgue measurable set A one has |A| = | — A\, where —A = {—x | x e A}: in other words, Lebesgue measure is invariant under the transformation x —» —x. Exercise 4.2.21. Show that (1) if / is a Borel (respectively, Lebesgue measurable) function, the function /(x) = /(— x) has the same property, (2) if / = 1,4, then / = 1_^, where -A = {-x \ x € A}, (3) |A| = | — A\ for any Lebesgue measurable set A [Hint: first verify this for intervals.], (4) if / € Ll(R), then / e L!(R) and / f(x)dx = / /(x)dx [Hint: first verify this for simple functions.], (5) if /, g € L1 (R), then {f * g){x) = J f{y - x)5(y)dy = / /(u)ff(z + u)du [Hint: use translation invariance.], and finally, (6) / f(u)g(x + u)du = / f{u)g(x - u)du = (g * /)(x). [Hint: use (4).] Note that in (5) and (6), one assumes that x is such that y —» /(x - y)g(y) is in L1. Conclude that (7) if f,ge Ll(R), then f*g = g*f. Remark 4.2.22. All of these observations about convolution carry over automatically to functions in L^K"). In classical Fourier analysis, the interval [0,2n] plays a privileged role. The functions / on R that are of interest are all 27r-periodic (i.e., f(x+2n) =
2. CONTINUOUS FUNCTIONS AND IS 159 f(x) for all x). These functions may therefore be identified with functions on [0, 2tt] that have the same value at 0 and 2n. By an "abus de langage", a 27r-periodic function / on R will be said to be a 27r-periodic integrable function or simply to be integrable if its restriction to [0,2n] is integrable: strictly speaking, such a function is locally integrable in the sense of Remark 4.6.8 Note that such functions have associated with them a Fourier series since the integrals that define the coefficients an and bn are defined. Since the interval [0,2n] can be identified with the unit circle by the map t —» ext = cos t + i sin t, the above amounts to considering functions on the circle 51 (also called a one dimensional-torus and denoted by T). The laws of exponents for the complex exponential (equivalently, the addition formulas for sin and cos) imply that 51 is a group, even a compact commutative group: e"e,t = e,(-"+t^ = exte13 (see the appendix to Chapter VI). The uniform distribution on [0,27r] (i.e., fi(dx) = ^l(0t2n){x)dx) has an image m on 51 under the exponential map t —» elt. This function may be viewed as a random vector on the probability space ([0,27r],93([0,27r]),ji). The distribution m of the exponential map is then a probability on the circle: it is the uniform distribution on the circle, i.e. m{a) = ^|ot|, where |q| is arc length for any arc a on the circle. As a result, if / is 27r-periodic on R and 4>{ext) = f(t), for integrable /, it follows that i- J ' f(t)dt = ij; I ' 0(e")A = J <t>dm. In this way L'QO,2n]) can be viewed as Lx{dm). It turns out that because 51 is a group, convolution can be defined for functions on 51 that are in Ll(m) (see Exercise 4.7.26). However, it can be directly defined using 27r-periodic functions on R: the important thing to observe is that if / is 27r-periodic, then by the translation invariance of Lebesgue measure, Ja K f(x)dx = /0 " f(u)du for any a€R. Proposition 4.2.23. Let f and g be two 2n-periodic integrable functions onR. Then / /(z - y)9(y)dy = g(x- u)f(u)du Jo Jo is an integrable 2w-periodic function ofx. Proof. First, observe that as in Proposition 4.2.20 the functions / and g may be assumed to be Borel and that fj [J*l/(x ~y)My)idy dx=In [In |/(x ~ y)Hi9{y)idy = ( [ * \f(x)\dx)( [ * \g(y)\dy) Jo Jo
160 IV. CONVERGENCE OF RANDOM VARIABLES since f0* |/(x - y)\dx = J_ \f(u)\du, which equals Jj,* \f(u)\du by 2n- periodicity. Hence, for almost all x e [0,27r], the functions y —» f{x — y)g(y) and y -» g{x - y)f{y) are in 1^((0,2^]). Let x € [0,2n] be such that y —» /(x - y)g(y) and y —» g(x — y)f(y) are in ^([0,271-]). The integral J*" }{x-y)g{y)dy = J l[o,2*](y)f{x-y)g(y)dy. Following the line of reasoning in Exercise 4.2.21, it follows that J l[o,2,r)(y)/(z - y)g{y)dy = J l[-xM-x){u)f{u)g{x + u)du = J l[x-2,r,x|(u)/(u)s(z - Ujdu = / f{u)g(x - u)du, Jo since /° " h(u)du = /0 " h(x)dx for any a € R as u —» h(u) = f(u)g(x-u) is 27r-periodic. The 27r-periodicity of /0 /(x — y)g(y)dy is obvious. The integrability of x —» J K f(x — y)g(y)dy follows by Fubini's theorem from the initial computation, which showed that $(x,y) = /(x - y)g(y) is in L2([0,2tt] x [0,2tt]). □ This suggests that one could reasonably denote -^ JQ * f(x — y)g(y)dy by (f *2n g)(x) if / and <? are 27r-periodic functions on R. In [K3], Korner denotes this convolution by ^ J"T f(y)g(x — y)dy. In particular, if K is 27r-periodic and non-negative, and ^ /0 " K{x)dx = 1, then l[02w](x)K(x)dx determines a probability /i on the circle S1, and the convolution (/ *2w K) could be denoted by (/>*s> M if 4>{exx) = /(^)- In classical Fourier analysis, there is a very well-known sequence (/in)n>i of probabilities (in of this type which are collectively referred to as the def Fejer kernel. The nth probability /in is given by the function Kn(x) = (n+lXsin * ) (note that it is not obvious from this formula that 2^ f0* Kn(x)dx = 1 — see Lemma 4.2.26). The significance of this kernel is that for any n > 1 and 27r-periodic function /, one has (t) i n / i < \ (/*2irKn)(z) = ——r5I(^ + -y]{afcCosfcx + &fcsinfcx} ), where the ak and 6¾ are the Fourier coefficients of /, and the summation on k is set equal to zero when £ = 0. The expressions s< = — + /{at cos kx + bk sin kx} 2n *-^ fc=i
2. CONTINUOUS FUNCTIONS AND IP 161 are the partial sums st of the Fourier series of / and so (f) states that the average of the first n + 1 partial sums (see Exercise 4.3.14) is given by a convolution kernel. To verify (f), it is useful to relate the Fourier coefficients of / to convolution operators. The Fourier coefficients of / e 1^((0,2^]) can be expressed in terms of the complex exponentials elkx since cosfcx = -—^ and sin kx = -—=? . One has the following lemma. Lemma 4.2.24. If f e 1^((0,2^]) and it > 1, then - f * f(u){e-ikueikx + eikue-ikx}du= -{akcoskx + bksmkx). •^ Jo *" In other words, -{afccosfcz + bfcsinfcr} = (f *2* {e,fc + e~xk})(x) = 2(/*2n cosfc)(x) w where elk(x) = elkx and cos k(x) = cosfcx. Proof. If k > 1, then ur f{u)e~ikudu)eikx = — I / /(u){cosfcu - isinfcu}{cosfcx + ismkx}du I = — I I f(u){coskucoskx + s\nkusmkx}du) 27r \Jo ) H (/ /(u){cosfcusinfcx — sinfcucosfcx}du ) 27r \h ) — {at cos kx + bk sin kx) i < #* + 2n ~[ /(u){cosfcusinfcx - sinfcucosfcx}du J. The result follows since / real implies that ^ ( /q" f{u)elkudu je * the complex conjugate of ^ ( /0 " f{u)e~lkudu J elkx. O
162 IV. CONVERGENCE OF RANDOM VARIABLES Hence, 1 t si = — H— /^{ofc cos kx + bk sin kx} fc=i ' i r2w = E k I"'/(«)eifc<-«>*. kt?« 27r -/o = (/*2* { E e,fc})(X)' and S0 <*n = k=-< 1 n ^Es< <=0 (/^^i{E(Eeifc)})w .+_ 1=0 k=-l = f*2n Kn(x), in view of (5) in the next exercise. This completes the proof of (f). Exercise 4.2.25. Show that (1) 5Zfc=0 zk = l~lz_z for any complex number z ^ 1. Make use of (1) to show that (2) Dt{x) = Y.i=-ieikx = 8iln+i)x ( Dt(x) is called Dirichlet's kernel, (3) E"=oei('+^x = ^"("+l)x+iO-cos(n+l)x); and condude that wrM^tiii-'y. Finally, show that (5) ^tE"=o(EU-^^) = ^iE"=o^(-)= (^f1)2. Every continuous 27r-periodic function is in L'QO,1]) and hence has a Fourier series. However, this is not enough to ensure that the partial sums sn(x) = §£ + 5Zfc=1{ancosfcx + 6nsinfcx} converge to /(x). For classical counterexamples, see Korner [K3] pp. 67-73. While the partial sums of the Fourier series of a continuous function need not converge to /(x), in fact their averages on = ^y 5Z"=0s< do so (i.e., the Cesaro means (see (Exercise 4.3.14), also [T] p. 411) of the Fourier series converge to /(x)). The expression (f) states that the nth Cesaro mean is obtained from / by taking its convolution with the nth Fejer kernel Kn(x). Fejer's theorem (Theorem 4.2.27) states that, if / is continuous, these convolutions converge uniformly to /(x). This theorem is a formal consequence of the following lemma.
2. CONTINUOUS FUNCTIONS AND IS 163 Lemma 4.2.26. The Fejer kernel (Kn)n>i has the following properties: (1) Kn[x) > 0; (2) ± J? Kn(x)dx = 1; and (3) if 6 > 0 and e > 0, then K„(x) < e for aJJ x e [6,27r - 6} if n is sufficiently large; equivalently, (4) if 6 > 0 and e > 0, then Kn(x) < e for all x e [-tt, -6} U [6, n] if n is sufficiently large. Proof. Property (1) is obvious. If x € [6,2n — 6], then sin § > sin | = m > 0, and so in this range Kn(x) < ^y(^), which is less than e for n + 1 > ^. This proves (3) and hence (4). Since Kn(x) = ^ £i=o(£l=-i eifcl) and /02ir eifcldx = 0 for all it £ 0, property (2) follows immediately. □ Theorem 4.2.27. (Fejer) If f is a continuous 2n-periodic function on R, then, for 0 < x < 2tt, (/ *2* Kn)(z) =2^1 f(x~ v)Kn{y)dy -^ /(x) as n -» oo, i.e., the Cesaro means /*2ir ^n converge uniformly to /. More generaJJy, if f is 2n-periodic and integrable, then (/ *2,r Kn)(x0) ~» ^{/(^ + ) + /(^()-)} as Tl -» OO at any point xo where both limits exist. Proof. If / is continuous, it is uniformly continuous on [-n, tt] by Property 2.3.8 (3). Let e > 0 and 6 > 0 be such that |/(x) - f(y)\ < e if |x - y\ < 6,x,y e [—7r, 7r]. Then 1(/ *2* *»)(*) - /(i)| = | ^ /' {/(* -J/) ~ /(*)}*»(»)<*» <^/*!/(*-»)-/(*)!*»(»)<*»• Now ^ /" |/(i - y) - f(x)\Kn(y)dy =±.J |/(i - y) - /(x)|tfn(y)dy + ^/j/(z-y)-/(*)|tfn(y)dy + ^i"|/(X~y)~/(:,:)|Kn(y)dy < ^ll/IL+e + ^|l/IL
164 IV. CONVERGENCE OF RANDOM VARIABLES if n is large enough to imply that Kn(x) < e on [6,2w — 6]. The proof for the more general case follows in essentially the same way once one observes that the Fejer kernel is symmetric, i.e., it is an even function of x. As a result, J"0 Kn(x)dx = J_^ Kn(x)dx = n, and so one may "average" on each side of a point x0 by splitting the integral over [—6,6] into two integrals, one over [—6,0] and the other over [0,6]. If a, b € R, then one has sf" x0~y)Kn(y)dy- -a (1) 11 r = hr / {/(x° ~ y) ~ a}Kn{y)dy \*n Jo l r ^ / I/(3¾ - y) - a\Kn(y)dy 1 fs ^ J 1/(2¾ - y) - a\Kn(y)dy l r + 2wJ \f(xo-y)-a\Kn(y)dy 2?r 2tt /1+/2 and 1^ J_ f(xo-y)Kn(y)dy-±b =\^J_ U(xo ~ y) ~ b}Kn(y)dy ~hj_ \f(xo'y)~b\Kn(y)dy (2) =± J \f(x0-y)-b\Kn(y)dy + hj_ \f(xo-y)-b\Kn(y)dy = Ji + J2- Assume that f(x0-) exists, and let a = f(x0-) in (1). Then, for small 6 > 0, it follows that |/(x0 — y) — f(xo-)\ <eif0<y<6 and so /1 < e for small 6 > 0. Fix one such 6. Then, the second integral /2 in (1) is dominated by e{\a\ + £ JQn |/(u)|du} as long as n > n(6,e). If /(x0+) exists, let 6 = /(x0+) in (2). Then, for the same reasons, J\ + 3i < e{l + |6| + £ /0" \f(u)\du} for a small 6 > 0 and n > n{6,e) sufficiently large. This shows that 2(/*2,r/fn)(xo) —» f(xo+)+f(xo—) as n —> 00, provided both limits exist. □ Convolution and differentiation. Let / e Lp(Rn), 1 < p < 00, and let it € Cc(Rn). Then, for all x € Kn, the function y —> f(y)k(x — y) is in Lp as it is dominated by 11^11^,/(^),
2. CONTINUOUS FUNCTIONS AND IS 165 and by dominated convergence the convolution (/ * k)(x) is a continuous function of i on M". If, in addition, k € C^(Rn), then it will now be shown that f*k is C1 and its partial derivatives gf-(/*fc) are given by convolution with the partial derivatives ■£- k. OXi Let ej be the canonical basis vector of Kn all of whose components are zero, except for a 1 in the ith position. The mean-value theorem implies that fc(* + fte,)-fc(*) = ^.fc(g + gftet)t where 0 < 6 < 1 and depends upon x, h, and i. Since £-k € Cc(Rn), there is a constant M with \-§^k{x)\ < M for all i and x. Hence, if B^ = {x \ \\x\\ < N) contains the support of k, i.e. BN D {x | k{x) ^ 0}, then k(x + he,) — k(x) < M1Bn, for all i, 1 < i < n, and x e Kn. It follows from this, by dominated convergence, as in the proof of Theorem 2.6.1, that (/ * k){x + hei) - (/ • k){x) f k(x + h - yej) - k{x - y) h = ) /(j/) h dV converges to f f(y)-^-k(x — y)dy as h —» 0. This proves the following important result about differentiating under the integral sign. Theorem 4.2.28. Let f € Lp(Kn), 1 < p < oo, and k e C*(Rn). Then f*k€C1{Rn)a.nd-^-{f*k) = f*-^-k, for all i, 1 < i < n. For bounded measurable functions / this result extends to convolution by functions k that are C1 and integrable, but not necessarily of compact support. First, here is the one-dimensional version, which automatically solves Exercise 2.9.6 once one shows that all the derivatives of the Gaussian x2 density n(x) = -i= e~ t~ exist and are integrable. Proposition 4.2.29. Let f € L°°(R) and k € C^R) n L^K). Assume that k' e C(R) n L^K). Then (/**)' = /**'. Proof. Let (j)N € C*(R) with 0 < 0 < 1 be such that $N{x) = 1 if |x| < AT and $n(x) = 0 if |x| > N + 1. One can also assume that the derivative 4>'N is uniformly bounded, independent of N, by a positive constant M.
166 IV. CONVERGENCE OF RANDOM VARIABLES Since / € L°° and k is continuous, it follows from dominated convergence that / * k is continuous: if xn —» x, then f(y)k(xn — y) —» f(y)k(x — y); and f(y)k(x — y) is integrable in y for all x. Also, / * fc(/>^ -^-» f * k (i.e., pointwise) as AT —» oo since J f(x - y)k(y)$N(y)dy converges to J f(x — y)k(y)dy as N —» oo, again by dominated convergence. Furthermore, by Theorem 4.2.28, (t) (/ * MN)' = / * (fc(M' = / * H*n + f* WN, and hence (/*fc<M'-^/*fc', since |(/*fc^)(x)| < M|| / |L /{|y| >n) \k{u)\dy —> 0 as n —> oo and / * fc(/>'N -^-> / * fc'. Note that, for any u, (f) implies (tt) 1(/ * *<M'(")I < (l/l * |fc'|)(«) + 2M || / |L for large N, with (l/l * |fc'|)(u) continuous. Now (f*k)(x)= lim (f*k$iv)(x) N—oo = lim N->oo (/**tov)(0) + [X(f*k$N)'(u)du Jo (/**)(0)+ / (/*fc')(u)Ai, where the last limit holds by dominated convergence in view of (ff) since (l/l * |k'l)(u) is continuous. The continuity of k' ensures that f * k' is continuous and so ^ JQx(f * k')(u)du = (/ * k')(x). D Corollary 4.2.30. Let f € L°°(Rn) and it e C^K") n Ll{Rn). Assume that each partiaJ derivative jj^k belongs to L^K"). Then, for each i, 1 < i < n, Proof. Let function 6n(x) = (/>^(||x||), where (/>yy is the cutoff function used above. The norm of its gradient V6N(x) = ^N(\\x\\)w^rx is then bounded by a positive constant M independent of N. As a result, (f) and (ff) hold with the derivative replaced in turn by each of the n partial derivatives -£-. Let x € Kn and let e* be a basis vector. By integrating along the line segment from x to x + /le*, one has, as before,
3. POINTWISE CONVERGENCE 167 (/* k)(x + hei) = lim (/ * k$N)(x + he^) N—>oo = lim (/*fc(/>N)(z)+ / {V(f*k$N)(x + thei),hei)dt /•1 O = lim (f*k<j>N)(x) + / h ——(f * k$N)(x + thei)dt N—oo J0 OXi /•1 f\ = (f*k){x) + h(f*-^—k)(x + th.ei)dt. Since the partial derivative g|-fc is continuous by hypothesis, it follows that gf-(/**)(*) = (/*^fc)(x). D 3. POINTWISE CONVERGENCE AND CONVERGENCE IN MEASURE OR PROBABILITY Let (Xn)n>i be a sequence of random variables on (0,5, P) or measurable functions on a er-finite measure space (tl,$,n). Then, for each lj e tl,(Xn(u>))n>i is a sequence of real numbers and lj can be viewed as a parameter, a "random" parameter in the case of a probability space. One can then consider the set of parameter values for which the sequence (X„(u>)„>i converges. Definition 4.3.1. A sequence (Xn)n>i of random variables converges P- a.s. to a random variable X (or P-a.s. to X) if P({lj | limn_00Xn(u;) = X(lj)}) = 1. This will be denoted by writing Xn P-^3' X or Xn ^ X if the context determines P. A sequence (^fn)n>i of measurabJe functions converges to a measurabJe function X a.e. (almost everywhere) on a o- Bnite measure space (tl,$,(i) if (i(A) = 0, where A = {lj \ Xn(uj) does not tend to X(lj) as n —» oo}. This will be denoted by writing Xn ^-4 X or Xn -^+ X if fx is defined by the context. On a probability space (and also for any bounded positive measure) the following result holds. Proposition 4.3.2. If Xn °-4 0 then, for any e > 0, P({u; | |A"n(u;)| > e}) —» 0 as n —» oo. The converse is false, as shown by Example 4.3.3. Proof. Let A = {lj \ Xn(w) -» 0 as n -» +00} and let e > 0. If T(e, AT) = {lj I \Xn{w)\ < e for all n > AT}, then T{t,N) c T(e, N + 1) and A c U5£=ir(e,A0, since for each lj € A there is a first time AT = N(lj) after which |Xn(u;)| is never > e. Let 6 > 0. Since P(A) = 1, there is an AT, such that P(r(e, AT)) > 1 - 6 if AT > AT,. Now {lj \ \Xn(Lj)\ > e} C T(e,N)c if n > N. Hence, n > Nt implies that P({u; | \Xn(Lj)\ > e}) < 6. □
168 IV. CONVERGENCE OF RANDOM VARIABLES Remark. The basic idea of the above proof is that P(r(e,AT)) is close to 1 if e is small and N is large enough (in terms of e). By modifying it to let e tend to 0 appropriately, one obtains a set A with P(A) < e such that, on A, the sequence converges uniformly to zero. This result, known as Egororov's theorem, is used to prove Lusin's theorem (Exercise 4.7.9) (see also Exercise 4.2.1). See Exercise 4.7.8 for detailed hints on Egorov's theorem. Example 4.3.3. Let SI = [0, l],ff = S(K), and P{dx) = dx. For each n, divide [0,1] into 2n closed intervals /n(fc) of equal length. Enumerate the random variables l/n(k) = l[k/2n,(k+i)/2n] m a sequence by first listing the intervals /2(/:)1 then the intervals /3(/:), and so on. Let the sequence of random variables that results be denoted by (Xn)n>i. Then, if 0 < e < 1, P[|X„| > e]) < 2½ if n > Y^Zo 2'- Hence, this probability tends to zero as n —» 00. On the other hand, P({x | limn_00 Xn(x) exists}) = 0 as one sees by considering what happens for x irrational. Proposition 4.3.2 motivates the next definition. Definition 4.3.4. A sequence (Xn)n>i of random variables converges in probability to a random variable X if, for any e > 0, P[\Xn-X\>e]^0 asn-»oo. This will be denoted by writing Xn —► X. A sequence (Xn)n>i of measurabJe functions converges in measure to a measurabJe function X if for any e > 0 M({|A"„- X\ >e}) -»0 asn-»oo. This will be denoted by writing Xn —► X. Proposition 4.3.2 states that convergence a.s. to zero implies convergence to zero in probability, and Example 4.3.3 shows that the converse is false. Remark. On a general measure space, the result that corresponds to Proposition 4.3.2 is that if E is a set with finite measure and Xn -1-* 0, then Xn -^+ 0 on E, i.e., (i({w € E | |X„(u;)| > e} -» 0 for all e > 0. It amounts to restricting the measure to E and using the resulting bounded measure where Q = E, 8 = {A € $ \ A c E}, and the measure being fj, restricted to 0. Without the requirement that n{E) < 00, Proposition 4.3.2 is false. Lebesgue measure on the real line gives an example: let Xn = l[n,+oo)- While convergence of (Xn)n>i in measure does not imply convergence a.e., it does imply that a subsequence converges a.e., as stated in the next exercise.
3. POINTWISE CONVERGENCE 169 Exercise 4.3.5. Let (Xn)n>i bd a sequence of random variables for which Xn -^+ 0. Prove that there is a subsequence (-Xnt)fc>i such that Xnk ^+ 0 by showing the following: (1) for all k > 1, there is an integer n^ > nk-i with P[|X„t| > ^ ] < l . (2) if 6 > ^, {suPfc>,|X„J > 6} C U£,{l*»*l > ^}; (3) if 6 > £, P[suPfc>, \Xnk\ ><}< EZt ^ = 2^- Conclude that limfc_00Xnt —» 0 a.s. [Hint: if X„t(u;) -^ 0, for some m > 1, u € {supfc>fco \Xnk\ > ^} for all large k0.} Proposition 4.3.6. (Chebychev's inequality) Let X € IP,\ < p < +oo. Then , II X |f Proof. \\X\{=I\X\rdP>Im^}\X\rdP>€*P[\X\>*\. U Corollary 4.3.7. If (Xn)n>i C Lp, and Xn -^ 0, then Xn-?->0 and, in the case of a measure space, Xn —► 0. Proof. If 1 < p < oo, then for any e > 0, P[|X„| > e] < e_p || Xn \\"p, which tends to zero as n —» oo. If p = +oo, then || Xn |L< e if n > n(e), and so P[|X„| > e] = 0 if n > n(e). D Exercise 4.3.8. Modify Example 4.3.3 to show that convergence in probability does not imply convergence in any Lp, 1 < p < +oo. Use Example 4.3.3 to show that convergence in Lp, 1 < p < oo, does not imply convergence a.s. On the other hand, show that convergence in L°° implies convergence a.s. Remark. In §5 a closer connection between convergence in probability and in V is established using the concept of uniform integrability. Exercise 4.3.9. Use Exercise 4.3.5 to show that if a sequence Xn —► 0 on a probability space (fi, 5, P), then a subsequence converges a.s. Show that Exercise 4.3.5 holds for arbitrary measure spaces and hence that if a sequence converges in Lp, a subsequence converges a.e. In Chapter III independent random variables were defined, and the result on infinite products, Theorem 3.4.14, shows that, given a fixed probability Q on (K, 2$(R)), there are a probability space (0,5, P) and a sequence (Xn)n>i of random variables defined on il such that they all have the common distribution Q and are independent (see Exercise 3.4.16). Such a sequence is called an i.i.d. sequence, i.e., an independent and identically distributed sequence. For such a sequence,
170 IV. CONVERGENCE OF RANDOM VARIABLES E[(Xk — m)(Xe — m)} = 0, k ^ I, where m is their common expectation if they are integrable. For example, if it is conceivable to toss a fair coin an infinite number of times, this "experiment" may be modeled by a sequence of i.i.d. random variables with common distribution Q = ^{e0 + £i}, where 1 corresponds to a head and 0 to a tail. The result of running the "experiment" is an infinite sequence of 0's and l's, i.e., a point in an infinite product space. One expects the average number of l's in the sequence to tend to the common mean m = \ as n —» oo. In practice, this average number is determined by computing the average of n samples of the population consisting of {0,1}, i.e., computing the average number of heads after n tosses of the coin. The following proposition shows that this sample average converges to the population mean m in L2. Proposition 4.3.10. Let (Xn)n>i be a. sequence of i.i.d. random variables. Assume that they have second moments, i.e., (Xn)n>i C L2. Let m be their common expectation. Then the sequence (^f-)n>i converges to m in L2, where Sn = 5Z"=1 Xi. Proof. Let Un = Sn-nm = Y^=i(xk -m) = ££=1 Yk, where Yk = Xk - m. The random variables Yk are in L2, are independent (by Exercise 3.4.5) and have mean zero. Thus, E[YkYe] = 0, it # t Now U2 = ££=1 Y2 + 2 J2k<e YkYi, and so E[U2} = 5Zt=1 E[Y^} = no2, where a2 is the common variance of the variables Xn. Therefore, || C/„ ||2 = y/na and so || £t/„ ||2 = -fa -» 0 as n -» oo. □ Corollary 4.3.11. (A weak law of large numbers) Let (Xn)n>i be a sequence of i.i.d. random variables in L2 (i.e., they have finite second moments). Then, ifm denotes the common mean, Sn n — -^-+ m, where Sn = / Xk, Tl ' * i.e., for aJJ e > 0, Sn m n >e 0 as m —» oo. Proof. It is a consequence of Proposition 4.3.10 and Corollary 4.3.7. □ Remarks. (1) A weak law of large numbers says that under suitable hypotheses on (I„)„>i, ^-^-+^- (2) A strong law of large numbers is a stronger statement. It says that under suitable hypotheses, ^- —> m (this is a stronger statement because of Proposition 4.3.2.) A simple strong law for L2-random variables is given in Proposition 4.4.1. An even simpler one due to Borel, stated in
3. POINTWISE CONVERGENCE 171 Exercise 4.7.19, has to do with normal numbers. Consider an arbitrary number x in [0,1]. It has a decimal expansion x = 0.a\d2 ■ ■ an- ■ ■ . Fix a digit from 0 to 9, say 1, and count the number of occurrences of 1 among the first n digits a* in the decimal expansion. Divide this number by n. One expects that this ratio will tend to ^ as n tends to infinity. The number x is said to be a normal number if this is the case. Borel showed that a.s. every number in [0,1] is normal. It is a consequence of Borel's strong law (Exercise 4.7.19) and the fact that in a measure-theoretic sense, ([0, l],93(R),dx) is essentially isomorphic to an infinite product of finite probability spaces (Exercise 4.7.20). These laws of large numbers correspond to what one thinks of intuitively when "sampling" many times and then taking the average of the sample. One expects that in some sense it should be close to the actual mean of the "population". For example, the students at a certain university have an average height. Select a student at random and then measure that person's height. Repeat this procedure a large number of times, say 100 times. The sample average is the average of the observed heights. One way to say that the observed average is close to the true average is to say that this is so in probability if n is large. Of course, it is desirable to know something about the error that is being made, i.e., how well does ^ approximate the mean m? The usual way in statistics makes use of confidence intervals: given an acceptable level of error a > 0, determine a < b so that with probability 1 — q the random variable -*£- (^- — m) lies in [a, b]. The reason for the scaling by -*£- is that, in general, one does not know the underlying distribution and hence the distribution of ^ — m. As a result, one is forced to rely upon the unit normal distribution to determine a and b (usually with a = —b). The reason for the use of the unit normal is that, by the central limit theorem (Theorem 6.7.4), its distribution is the limit (in the weak sense) of the distribution of *lL(^-m)asn-»oo since the scaling by -*£- ensures that -*£- (^- — m) has mean zero, variance one, and is in fact -4^ times the sum of n i.i.d. random variables (Vfc)fc>i with mean zero and variance one: namely, Yk = ^(Xk — m), where a is the common variance of the original random variables Xk (whose existence is an additional hypothesis). Having determined a and b to correspond to a by using the unit normal, it follows that a < *2- (^a - 77i) < b with an approximate probability of 1 - a. Hence, -^2- < ^ — m < -j£- with approximate probability 1 — q. Another, but cruder, way to obtain estimates of error is by using Chebychev's inequality (Proposition 4.3.6) for the random variable ^- — m, as it gives an upper bound on the probability that the error is at most a. (3) As Chung [C] points out, the hypothesis that (Xn)n>i is i.i.d. can easily be weakened: provided the random variables are all in L2, it suffices that they have mean zero, are orthogonal, and are bounded in L2, i.e., their
172 IV. CONVERGENCE OF RANDOM VARIABLES L2-norms are uniformly bounded. According to Chung, this result is due to Chebychev. In addition with these hypotheses, Rajchman proved that the strong law holds (see Proposition 4.4.1). (4) The natural moment condition to impose on the Xn is that they are all in L1, as then they have means. Khintchine's weak law (see Theorem 4.3.12) is the standard result in this case. While the random variables are required to be identically distributed, for independence one only requires it pairwise, i.e., for any two of the random variables. Theorem 4.3.12. (Khintchine's weak law of large numbers). Let (-Xn)n>i be a sequence of pairwise independent, identically distributed random variables on a probability space (fi, 5, P). If the random variables are integrabJe (i.e., a = J |x|Q(dx) < oo where Q is the common distribution), then £ 5Z"=1 Xi —► m in probability, where m = J xQ(dx) is the common mean. (Note that random variables Xn,n > 1, are said to be pairwise independent if any two are independent.) Proof I. (see Feller [Fl], pp. 246-248) One can assume m = 0 (replace Xn by Xn - m if m ^ 0). Let 6 > 0 be chosen (its value will be fixed later). For each n truncate X\,..., Xn by 6n, i.e., for 1 < k < n, define tfb.»M = { o Xk{u) if |A-fc(u;)| < 6n, if \Xk{u)\ > 6n. Let Vfc,„ = Xk - Uk,n, l<k<n. Exercise 4.3.13. Let Y and Z be two random variables and let X = Y + Z. Show that (1) P[|X| > 2e] < P[|K| > e]+P[|Z| > e]. [Hint: the triangle inequality (Exercise 1.1.16) implies that one of \a\, \b\ > e if \a + b\ > 2e.) If (yn) and (Zn) are two sequences of random variables, show that (2) yn + Zn -^+ 0 in probability if both Yn -^+ 0 and Zn -^+ 0. In view of Exercise 4.3.13, it suffices, given 77 > 0, to show that for large n, (l)P[££Li^,n >*]<>?, and (2) P[£n=i %.»><]<* First consider (1). Let S'n = 5Z]J=1 Uk,n- Then E[(S'n)2) = J2EiUU + 2E m.nUj.n) fc=l i<j n fc=l i<j
3. POINTWISE CONVERGENCE 173 as the C/fc>n, 1 < k < n, are pairwise independent by Exercise 3.4.5 since Uk,n = 4>n°Xk, where x if |x| < 6n, 6n. Now E[Uln] = /{|l|<6n} x2Q(dx) by Lemma 4.1.3, as U2n = <fa o Xk. Hence, ( x if |x| <i I 0 if |x| > i E[U\ ,„] <6n[ J{\x\<6n} |x|Q(dx) < a6n, where a = J |x|Q(dx). Also, E[Ui,n] = [ xQ(dx) = /</.n(x)Q(dx), J{\x\<6n} J and |(/>n(a0| < \x\ = f{x), where / e 1^((¾) and (/>n(x) —► x as n —► oo. Dominated convergence (Theorem 2.1.38) implies that I (j>n(x)Q(dx) —► / xQ(dx) = 0 (by assumption) as n —► oo. Hence, for large n > n(6), |E[t/j>n]| < a6. Consequently, E[(S'n)2} < n2a6 + (n2 - n)a262 < 2n2a6, as one may assume that a6 < 1. Hence, by Chebychev's inequality (Proposition 4.3.6), e2p[i^i>e| < J_ [ (5;)2dP < £[(5p2] < 2a6. . n J n J{\S'n\>nt} n Choose a positive number 77 and set 6 = 6(e,r)) = min{^, ^}. Then, for n > n(6), (t) p[i^l>el=p[i^)l>. 2a6 <-T<ri, as long as the truncation parameter 6 < 6(e, rj). It remains to prove (2). Let S% = J2k=i Vk.n- Now {u, I » >,} C {f ^0} C ULAVk.n *o}, and so p|S!>. n n /" <y;P{Vfc,n^0} = n/ Q(dx) fc=l •'{IxIXn} < i f |x|Q(dx). 0 J{|i|>6n} As n —> 00, this term tends to zero either by dominated convergence or by Exercise 2.4.2, as E —► JE |x|Q(dx) is a measure on B(K) and HntM > 6n} = 0. □
174 IV. CONVERGENCE OF RANDOM VARIABLES W = { Remark. This proof does not show that either ^Hfc=1£4,n ~^~* 0 or £ $3fc=1 Vfc)n -^+ 0 as n —► oo, since the truncation procedure changes with n and so affects these two expressions. Proof II. (see Chung [C], pp. 109-110). The subtlety of proof I lies in the idea of repeated truncation, i.e., for each n, one truncates X\,... ,Xn at height 6n. As in Proof I, one can assume i?[Xn] = 0 for all n. In this proof, each variable is truncated once. Let Xn(u) if \Xn(u)\ < n, 0 otherwise. Step (1). Let An = {Xn ? Yn}. If E*=ip(^n) < +oo, then by the Borel-Cantelli lemma (Proposition 4.1.29 (1)), a.s. Yn(u}) = Xn(u)) for sufficiently large n. Now P(An) = P({\Xn\ > n}) = /{|Z|>n} Q(dar). Consider Qx (counting measure) on K x N. Let A = UjJL^-oo,-n]U[n, +oo)} x {n}. By Fubini's theorem (Theorem 3.3.16), £~=iP(A») = E~=i(/{|x|>») Q(dx)) < +°° since J lAdQ x dn = E^°=i(/{|x|>n} Q^)) = /"l{n<|x|<n+i}Q(<&) < /|x|Q(dx) < +00. Alternatively, one may use Exercise 4.7.1 (6) to show that £~=1 P(An) < oo. Hence, a.s. A £Li{**M ~ Yk(u)} -» 0, since a.s. there are only a finite number of non-zero terms Xk(uj) — Yk(tj), where the number of terms depends on w. Therefore, if £ £k=i Yk -^+ 0, then n Hfc=i -Xfc ^ 0 by Exercise 4.3.13 (2), i.e., the weak law of large numbers holds. Step (2). Here one shows that £ ££=1 Yk -^+ 0. Let mk = E[Yk] = /,.,<fc, zQ(dx). Then, by the theorem of dominated convergence (Theorem 2.1.38), mk —► m = 0 as fc —► oo. Exercise 4.3.14. Let mn -► L as m -► oo. Then mi+m:H-"+mn _► L as n —► oo. In other words, if mn converges to L, then it converges to L in the sense of Cesaro. [Hint: choose N so large that \mn — L\ < e if n > N. Observe that ml+m» + -+mn = rnl+mi+-+m„ + mN + 1 + - + mn , n n n J Exercise 4.3.14 shows that £ 5Zfc=i ^P^fc] ~~* 0- Thus, to show that £ ELi ^-^0, it suffices to show that Tn = I ££=1{*fc - £[Vfc]}-^0 as n —► oo. To do this, one estimates the variance £[T^] of the expression Tn, making use of the independence of the centered random variables Yk — E[Yk). One has E[T2n) Y,E[{Yk - E[Yk}}2} + 2j2E[(Yt - £^)(^ - E[Y3})} fc=i k<j ^X>2^) ^E^2] = ^t[ l2QW n fc=i n fc=i n fc=i"/(W<fc}
3. POINTWISE CONVERGENCE 175 The argument now continues as in Chung [C]. Choose (an)n>i a sequence of integers —► +00 such that ^ —► 0; e.g., an = [log n] + 1, where [x] is the greatest integer < x. Then, n - Qn /• n - V / x2Q{dx) = V / x2Q(dx) + V / x2Q(dx). The second term n . V / x2Q(dx) k=an+iJ{\*\<k) (*) = V L/ \x\Q(dx)+ [ x2Q(dx)}, k=an+lL -MM^M i{an<|i|<fc} J and the first term V / x2Q(dx) < V / an|x|Q(dx). k=1J{\x\<k} fc=l-/{|x|<a„} Combining this with the first expression in (*), it follows that V/ x2Q(dx) < nan J \x\Q(dx) + n2 |x|Q(dx). Hence, 4rf>[^2] < -[ /|x|Q(dx)l + I |x|Q(dx). The first term goes to zero as n —► 00 by the choice of the sequence an. The second goes to zero as D^Lifan < I1!} = 0> and E —► JE |x|Q(dx) is a measure on B(K) by Exercise 2.4.2, or by dominated convergence. □ A very famous application of the weak law of large numbers is S. Bernstein's proof of the Weierstrass approximation theorem on [0,1]. One form of the Weierstrass approximation theorem states that every continuous function on [0,1] can be uniformly approximated by a polynomial. Bernstein's proof gives an explicit sequence of polynomials (constructed from the function /) that converge uniformly (Definition 3.6.15) to /. Theorem 4.3.15. (S. Bernstein) Let f be a continuous function on [0,1]. Then the sequence of polynomials (Pn)n>i converges uniformly to f, where ™-£'(%} x (1 — x)n (the nth Bernstein polynomial).
176 IV. CONVERGENCE OF RANDOM VARIABLES Proof. Let x e [0,1] and let (Xn)n>\ be a sequence of i.i.d. random variables on a probability space (0,5, P) with P[Xn = 1} = x and P[Xn = 0] = 1 — x. The weak law of large numbers in Corollary 4.3.11 applies (as does the far simpler strong law of Borel (Exercise 4.7.19)) and so T? = £ E?=i *i -* x in probability. Since by Exercise 3.3.21, P[5„ = it] = (£)zfc(l - x)n_fc, it follows from the hints in Exercise 4.1.11 that E[f{^)] = Pn(x). Now \Pn(x) - /(x)| < E[ \f(^) - f(x)\ )= f \f(^) - f(x)\dP + fs 1/(^)-/(^)1^- The first integral <2||/||00P[|^--i|>e], which tends to zero as n tends to oo. It does so uniformly in x since by Chebychev's inequality (Proposition 4.3.6) and independence, X n >e <-> 1 n i2 e2n2 ^-f LV ' ne2 v ne2 i=i (see Exercise 4.1.11 for the last equality). Since / is uniformly continuous (see Property 2.3.8 (3)), for any 6 > 0 there exists e > 0 such that |/(x) - f(y)\ < 6 if |x - y\ < e, 0 < x, y < 1. Hence, the second integral is at most 6. O 4. KOLMOGOROV'S INEQUALITY AND THE STRONG LAW OF LARGE NUMBERS If a sequence of random variables converges a.s. to zero (say), then it converges in probability to zero (Proposition 4.3.2). Hence, convergence a.s. is "stronger" than convergence in probability. The following is a simple strong law. It is a modification of an even simpler argument that may be used to prove Borel's strong law for i.i.d. Bernoulli random variables (see Exercise 4.7.19). Proposition 4.4.1. Let (Xn)n>\ be a sequence of random variables in L2(£l,$,P). Assume that they are orthogonal, with mean zero, and are bounded in L2, i.e., E[X - n] = 0 for all n > 1, E[XmXn] = 0 if m ^ n, and for some M > 0 one has E[X2] < M for all n>\. Then ^ ^ 0. Proof. Orthogonality of the sequence implies that the square of the L2- norm of Sn is dominated by Mn as || £fc=i*fc \\2= £fc=i || Xk \\2.
4. KOLMOGOROV'S INEQUALITY 177 Chebychev's inequality implies that if one looks at the sequence {S^)k>i, then for any e > 0, one has ±ns,>^<£l^f<£^<oo. fc=i fc=i fc=i By the Borel-Cantelli lemma (Proposition 4.1.29), -^r —+ 0. Given that this subsequence converges, one needs only show that the maximum, for k2 <n<{k + l)2, of the difference k~2\Sn - 5fcS| ^i 0 as \Sn\ \Sn\ \Sk^\ \Sn-Sk^\ -,.2^^,,,,^2 — <1^<1^+ k2 if* <n<(* + l). def For each k, let Mk = maxfc2<n<(fc+1)2 \Sn — Ski\. Then k2+2k Ml < J2 |5„-5fc2|2 and so n=fc2+l k2+2k \\Mk\\2< J2 H5„-5fc2||2<4fc2M. n=fc2+l One now uses Chebychev's inequality again in the same way to get 00 x II w no 00 . fc=l fc=l K € fc=lK € Hence, by the Borel-Cantelli lemma (Proposition 4.1.29), it follows that ^^0. D Remark. The technique used in this proof of looking at subsequences is used in Exercise 4.7.19 to prove Borel's strong law, which itself is a special case of this result of Rajchman (see also Loeve [L2], p. 19). The argument there is simpler because the random variables are uniformly bounded, which of course implies that they are bounded in I?. The classic strong law, which only requires first absolute moments, is due to Kolmogorov. The independence condition is more restrictive than in Khintchine's weak law as it requires the random variables to be i.i.d. and not merely identically distributed and pairwise independent. Theorem 4.4.2. (Kolmogorov's strong law) Let (Xn)„>i be a. sequence of i.i.d. random variables. Let Q be their common distribution. Assume J |x|Q(dx) < 00, i.e., the Xn are all integrable. Then ^- —» m a.s., where m = J xQ(dx) and Sn = J2k=i -^fc- As in the earlier discussion of the weak law, by centering the random variables, one may assume m = 0. In order to control the behavior of the sequence "^ for all large values of n, as uj varies, one uses the following generalization of Chebychev's inequality.
178 IV. CONVERGENCE OF RANDOM VARIABLES Theorem 4.4.3. (Kolmogorov's inequality) Let (Xk)\<k<n be a finite collection of independent (not necessarily identically distributed) random variables in L2 with E[Xk] = 0 for aJJ n. Then, P[max|5fc|>,]<^ = ^i. Li<fc<n J e2 e2 Proof. Fix u), and let v(lj) denote the first time I that |S<(u>)| > e. This defines a function v of w on A = {w | |Sfc(w)| >e} = {w\u(w)< n}. Further, {u \ u(w) < k) € $k = ^{{X, I 1 < i < k}). Consequently, Afc = {v = k} € $k- As a result, using independence, one has f Sk(Sn - Sk)dP = £[SfclAt(Sn - Sk)} = E[SklAk]E[Sn -Sk} = 0 since 5fclAfc € 5k and Sn-Ske o{{Xj \ k+l < j < n}), with E[Sn-Sk] = 0. NOW A = Ul<fc<n Afc and <72(Sn) = E[S%\ > f S%dP = J2 I [Sk + (Sn - Sk)}2dP = £ /" SldP+ I (Sn - Sk)2dP in view of what has been proved. Hence, <72(Sn)>£/ S2dP>62£P(Afc) = 62P(A). D fc=l,'A'' fc=l Remarks. (1) For random variables in L2, Chebychev's inequality states that p[i«.i>«]<^.^1. As {\Sn\ > e} C { maxi<fc<„ |5fc| > e }, the inequality of Kolmogorov is stronger than that of Chebychev (Proposition 4.3.6). (2) Kolmogorov's inequality is a special case of a more general one for martingales due to Doob (see Theorem 5.4.20) that involves stopping times.
4. KOLMOGOROV'S INEQUALITY 179 Proposition 4.4.4. (Kolmogorov's criterion) (See FeJJer [Fl].) Let (Xn)n>\ be a sequence of independent (but not necessarily identically distributed) random variables with finite second moments. If £~=1^*=1 < oof then 5"~^5"1 — 0 a.s. Proof. To begin with, it can be assumed as usual that £[Xn] = 0 for all n. Let e > 0 and let Av = {lj | there exists n € (2""1,2"] with ^=^ > e}. Assume £^ P(A„) < +oo. Then P(A„ i.o.) = 0 by the Borel-Cantelli lemma (Proposition 4.1.29) (1). Hence, for a given e > 0, a.s. there exists v = v(u) such that n > 2"(w)_1 implies that l^aMi < e. Since e > 0 is arbitrary, this shows that a.s. for each lj e fl the sequence nj^' —► 0: consider a sequence fk = j, say, that tends to zero as k —► oo. Now Kolmogorov's inequality (Theorem 4.4.3) implies that ££^ P(A/) < +oo. To see this, first observe that Theorem 4.4.3 applied to Y\ = S2-i + i, Vfc = X2»-i+k, 2<k <2" - 2"-1, implies that {mas {ma> 2"-' + l< max fc<2' P(A„) = P <P Therefore, since <r2(Sn) = 5Z£=1 a2(Xk), max — | > e fc<2" K >e2"-1| oo . oo Ep(^)<^E^ i/=l i/=l 22" IX(**) Lfc=i <^E-2w fc=i oo Z^ 021/ 2">fc C" ifc2 fc=i 1 1 < oo, as 6 ^ CT2(Xfc) v. uu, < _ if 2"° > ik >2"0_1. D 22" 22"o i_i - 3ik2 2">fc 4 Remark. Kolmogorov's criterion is trivially satisfied if the sequence is bounded in L2. It therefore extends the simpler result (Proposition 4.4.1) of Rajchman when applied to an i.i.d. sequence (Xn)n>i whose common mean is zero. Note that the criterion is a variation of the fact that ££=1 F* < +co.
180 IV. CONVERGENCE OF RANDOM VARIABLES Proof of Theorem 4-4-%- (see Feller [Fl], p. 260) As usual, one can assume the mean m = 0. Let W = { Xn(u) if |X„(w)| < n, Since the (Xn)n>i are identically distributed, H^PU-Xn ^ Yn}) < oo. Hence ^ -► 0 a.s. if ± £"=1 V* -► 0 a.s. (see Step (1) in Proof II of Khintchine's weak law (Theorem 4.3.12)). Assume the sequence (yn)n>i satisfies Kolmogorov's criterion (Proposition 4.4.4). Then, ^$3"=1Vi - n £"=i E[Y*\ ""> ° as- lt follows from Exercise 4.3.14 that £ £"=1 £[^] -► 0 as E[Yn] -> 0 when n->oo. Hence, ± £"=1 V* -» 0 a.s. It therefore remains to verify the hypothesis of Proposition 4.4.4 for the sequence (Vn)n>i, in other words, that 5Z«^=i " na < °°- Now a2(Yn) < E[Y2} = [ x2Q(dx) = V I x2Q(dx) J{\x\<n} fc=i-/{fc-K|x|<fc} <f>/ \x\Q(dx). ^[ J{k-K\x\<k) Let ak = /{fc_1<N<fc} \x\Q(dx). Then, oo 2/v \ oo - n oo Too-" E^E^E*-=E^E^ n=l n=l fc=l fc=l Ln=fc oo .. < 2^afc = 2 / |x|Q(dx) < oo, k=\ J where the last inequality holds since 5Z^Lfc ^ < J^_l jf = ^37 < f if ifc > 1. □ 5. Uniform integrability and truncation* In the proofs of the laws of large numbers, truncation was used to reduce the study of sums of independent random variables to the case of bounded random variables. Truncation cuts the "tail" oft" the random variable, i.e., the "tail" of its distribution. Given a random variable X and a positive constant c, define its truncation Xc at height c by r x{u) if \x(u)\ < c, Xc{u})-\o if ixmi >c.
5. UNIFORM INTEGRABILITY AND TRUNCATION 181 Lemma 4.5.1. Let X e L1 and c > 0. Then (1) \\X-Xcl=J{lxl>c]\X\dP,and (2) /{|X|>c}l^|dP-0asc^oo. Proof. (1) is obvious and (2) follows from the dominated convergence theorem (Theorem 2.1.38) since 1{|X|>C}I-^I —► 0 as c —► oo. □ Remarks. (1) There is a corresponding LP result, for all p, 1 < p < oo. In what follows, things will be stated for L1, and the reader is left to formulate the corresponding LP statement (see Chung [C], p. 97 and Exercises 4.7.14, 4.7.15, and 4.7.16). (2) It is also useful to use a Lipschitz truncation at height c > 0, namely 4>c o X, where {—c if x < —c, x if — c < x < c, c if c < x. (3) The function <f>c is the "natural" extension of the function /(x) = x, —c < x < c in the sense of Proposition 2.8.13: one merely extends with the constant value given at the endpoint. For large c > 0, this Lipschitz truncation differs very little from the previous truncation Xc as the following exercise shows. Exercise 4.5.2. Define {—c if x < —c, 0 if - c < x < c, c if c < x. Show that (1) |(/>c(x) - 4>c{y)\ <\x-y\ for all x, y e R, (2) X € L1 implies 4>c °X € L> [Hint: (/>c(0) = 0 and so |(/>c(x)| < |x|.], (3) Xc + tcoX = 4>coX, and (4) || Xc - 4>c oX ||,< cP[|X| > c] < /{|X|>C} |X|dP. Given a sequence (Xn)n>\ of random variables, truncating all the variables at a height c > 0 gives an approximation in L1 to each term Xn of the sequence. When is this approximation uniform? One may also ask this question for the Lipschitz truncation at height c. The answer to these questions involves the concept of uniform integrability. Definition 4.5.3. A sequence {Xn)n>i of random variables in L1 (fi, 5, P) is said to be uniformly integrable if, for e > 0, there is a constant c = c(e) > 0 such that I \Xn\dP <e for all n > 1. •Ml*n|>c}
182 IV. CONVERGENCE OF RANDOM VARIABLES It follows from Lemma 4.5.1 (1) and Exercise 4.5.2 that uniform in- tegrability and uniform approximation by truncation at a large level are equivalent: Lemma 4.5.4. Let {Xn)n>\ be a sequence in L1. The following are equivalent: (1) for large c, the truncation of the sequence (Xn)n>i at height c gives a bounded sequence that approximates the original sequence uniformly in L1; (2) for large c, the Lipschitz truncation of the sequence (Xn)n>i at height c gives a bounded sequence that approximates the original sequence uniformly in Ll; and (3) the sequence (Xn)n>i is uniformly integrable. Remark. If (Xn)n>i C Lp, the truncations at height c > 0 approximate uniformly in Lp if and only if (|Xn|p)n>i is uniformly integrable. One could refer to this property by saying (Xn)n>i is uniformly integrable in Lp, but this is not standard terminology. The point of introducing uniformly integrable sequences is that it enables one to reduce convergence questions to questions about bounded sequences. In particular, it gives insight (Remark 4.5.10) into exactly which sequences converge in Ll and thereby extends the theorem of dominated convergence (see Example 4.5.11). In order to characterize uniformly integrable sequences, it is important to prove the following continuity lemma. Lemma 4.5.5. Let X € L1 and AeJ. Then |A-|dP-0 asP(A)-0. Proof. If c> 0, I \X\dP= I \X\dP + f \X\dP Ja JAn{\X\<c} JAn{\X\>c) < cP{A) + [ \X\dP. J{\X\>c] The result follows from Lemma 4.5.1. □ This lemma states that the measure v(A) = J \X\dP is continuous with respect to P in the sense that if e > 0 then there is a 6 > 0 such that v(A)<e\fP(A)<6. Hence, given the Radon-Nikodym theorem (Theorem 2.7.19), it follows that if v is absolutely continuous with respect to P (Definition 2.7.16), then v is continuous with respect to P in the above sense. However, this observation does not depend upon the Radon Nikodym theorem, as the following version of Lemma 4.5.5 shows L
5. UNIFORM INTEGRABILITY AND TRUNCATION 183 Lemma 4.5.5*. (see Halmos [Hi], p. 125) Let v be a finite measure on (0,5) that is absolutely continuous with respect to a probability P. If e > 0, then there exists a 6 > 0 such that v{A) < e ifP(A) < 6. Proof. Assume the statement is false. Then there is a positive number e > 0 and a sequence of sets An € 5 such that, for all n > 1, (i) v{An) > e and (ii) P(An) < ±. Let Bm = Un>mj4n. Then, for all m > 1, one has v(Bm) > e and P(#m) < Y.n>mP(An) < $)^- Clearlv> B- => B-+l and SO if B = n^=1Bm, then (i) v{B) > e and (ii) P(S) = 0. This contradicts the hypothesis that v is absolutely continuous with respect to P. □ Combining this lemma with the definition of uniform integrability gives the following characterization of uniform integrability. Proposition 4.5.6. (Xn)n>\ is uniformly integrable if and only if (1) there is a constant M such that E[\Xn\] < M for all n> 1, and (2) JA \Xn\dP -» 0 uniformly in n as P(A) -► 0. Proof. Assume that the sequence is uniformly integrable. Then E[ \Xn\ ] = f{\Xn\<c) l^nl^1* + /{|Xn|>c} l*n|dP < c + 1 if c = c(l) in Definition 4.5.3. This establishes (1). To show (2), observe that [ \Xn\dP= f \Xn\dP+ f \Xn\dP JA JAn{\Xn\<c} JAn{\X„\>c} <cP(A)+ f \Xn\dP. J{\X»\>C} Hence, given e > 0, if P(A) < e/2c and /nXn,>cj |^n|dP < e/2, then one has JA \Xn\dP < t for all n. Conversely, by (1), cP[|X„| > c] < M and so P[|X„| > c] -» 0 uniformly in n as c —► oo. Property (2) implies that (Xn)n>i is uniformly integrable. More explicitly, if c > 0, condition (2) states that there is a 6 > 0 such that, for all n > 1, JA \Xn\ dP < e if P(A) < 6. Since, for all n, P[ |X„| > c] < 6 if c> ™, the result follows. D Proposition 4.5.7. If Xn —► X, then (1) Xn -^ X, and (2) {Xn)n>\ is uniformly integrable. Proof. (1) is a repetition of Corollary 4.3.7. To prove (2), note that | f \Xn\dP - f \X\dP J < [ \Xn- X\dP < || Xn ~ X ||, . J A J A J A
184 IV. CONVERGENCE OF RANDOM VARIABLES Let e > 0, and choose 60 such that JA \X\dP < e/2 if P(A) < 60. Then JA \Xn\dP < e,n > n(e), provided || Xn - X ||,< e/2, for n > n(e) and P(A) < 60. Choose 6 < 60 such that JA \Xn\dP < e, 1 < n < n(e), if P(A) < 6. Then P(A) < 6 implies fA \Xn\dP < e, for all n > 1. Since || Xn II,—»|| X ||,, the result follows from Proposition 4.5.6. □ As stated in Exercise 4.3.8, convergence in probability does not necessarily imply convergence in L1. However, as shown in Theorem 4.5.8, for uniformly bounded random variables, it does. Consequently, for a uniformly bounded sequence (Xn)n>i of random variables Xn, the sequence converges in Ll if and only if it converges in probability. Theorem 4.5.8. Let (Xn)n>i be a sequence of uniformly bounded random variables (i.e., there is a constant M with \Xn\ < M, for aJJ n > 1). Let 1 < p < oo. UXn ZU X, then Xn -^ X. Proof. Since {\X\ > M + l/m) c {\Xn - X\ > 1/m}, convergence in probability implies that P[\X\ > M] = 0. Hence, when integrating, one may assume \X\ < M, and so f \Xn ~ X\pdP = f \Xn- X\PdP + f \Xn- X\pdP J J{|Xn-X|<£} J{\Xn-X\>£} <£p + 2pMpP[|A"„ - X\ > e]. D Remark. It suffices for the above result that Xn € ^°° with || Xn H^ < M for all n. This theorem has a corollary, which, as pointed out in Exercise 4.5.11, is a generalisation of the theorem of dominated convergence (Theorem 2.1.38). Corollary 4.5.9. Assume that X and Xn, n > 1, are in L1. If (1) Xn ZU x, and (2) (-Xn)n>i is uniformly integrable, then Xn-^X. Proof. By Lemma 4.5.4, the Lipschitz truncation at height c given by (j>c ° Xn gives a uniform approximation in L1 to the Xn- Let e > 0. Choose c>0so that || Xn-<l>c°Xn || ,< §,n > 1, and || X -<t>coX ||,< §. Since |(/>c(x) - (j)c(y)\ <\x~ y\, for all x, y € K, it follows that {\Xn - X\ > r)} D {\4>c°Xn- 4>c°X\ > t)}. Therefore, the sequence ((/>c°.Xn)n>i converges in probability to 4>C°X. It follows from Theorem 4.5.8 that (f>c°Xn —> 4>C°X. Consequently, || Xn - X \\, < || Xn - <t>c o Xn ||, + || <j)c o Xn - 0C c X ||, + || 0C ol-l JI, < f + || (/>c ° *„ - 0C o A" ||, < e if n>n(e,c). D
5. UNIFORM INTEGRABILITY AND TRUNCATION 185 Remark 4.5.10. Conditions (1) and (2) in Corollary 4.5.9 imply that X € Ll (see Exercise 4.7.13). Hence, in view of Proposition 4.5.7, a sequence (-Xn)n>i of random variables in Ll converges in L1 if and only if (i) it converges in probability and (ii) it is uniformly integrable (Exercise 4.7.13). Example 4.5.11. Let (Xn)n>i be a sequence of random variables uniformly bounded by Y € Ll, i.e., |X„| < Y for all n > 1. Show that (1) the sequence {Xn)n>i is uniformly integrable, and (2) if, in addition, Xn —► X, then Xn —► X. Conclude that Corollary 4.5.9 is a generalization of the theorem of dominated convergence. In addition, show that (3) the sequence (Xn)n>i is uniformly integrable if there is a constant M < oo and p > 1 with E[|X„|P] < M for all n > 1. [Hint: make use of Holder's inequality when establishing Lemma 4.5.4 (2)]. rp Exercise 4.5.12. Show that if X and Xn, n > 1, are in Lp, then Xn —► X if Xn ■£+ X and (*„)„>i is uniformly integrable in V, i.e., (|Xn|p)n>i is uniformly integrable. This discussion of uniform integrability concludes with a result stating that given convergence in probability, convergence occurs in Ll when the L'-norms converge. Thinking in Euclidean terms, this is obvious for a sequence of vectors in Kn: they converge if and only if their components converge and their lengths converge. In fact the second condition is superfluous. However, in an infinite-dimensional space like L1, where such things are not so evident, it is intuitively appealing that a sequence converges if it converges pointwise and the norms || Xn ||,, which determine the sphere in 1} on which Xn lies, also converge. As pointwise convergence implies convergence in probability, the following result says something stronger: namely, convergence in probability suffices in order that convergence of the L'-norm implies convergence in Ll. Theorem 4.5.13. Assume that X and Xn, n > 1, are in L1 and that The following are equivaJent: (1) || XnlKHI ^ll.; and (2) Xn iu X. Proof. It suffices to show that (1) implies (2) since the converse is clear. If (2) is false, then there is a subsequence with || XUk — X ||, > 6, k > 1, for some 6 > 0. Since every sequence converging in probability contains a subsequence that converges a.s. (Exercise 4.3.5), it follows that if (2) is false, there is a subsequence that converges a.s. for which || Xnk — X !!,> 6, k > 1. This is impossible in view of the following result, SchefFe's lemma.
186 IV. CONVERGENCE OF RANDOM VARIABLES Proposition 4.5.14. (SchefFe's lemma) Let (Xn)n>i be a sequence in L1 that converges a.s. to X € L1. Then Xn -^ X if\\ Xn ||,-»|| X ||, . Proof. First, assume that all the random variables are non-negative. As a result, on {Xn - X < 0}, X - Xn < X. Since Xn - X -^i 0 it follows from dominated convergence (Theorem 2.1.38) that J,x _x<0,(Xn — X)dP —► 0 as n —► oo. The assumption that the random variables are non-negative, the identity f{Xn - X)dP = f {Xn- X)dP + f (Xn- X)dP, J J{X„-X>0} J{Xn-X<0} and the hypothesis that || Xn ||, - || X ||,= J(Xn - X)dP -► 0, imply that J,x _x>o)(-^" _ X)dP —► 0 as n —► oo. Hence, \\Xn~X ||1= f (Xn - X)dP - f (Xn- X)dP -+ 0 J{Xn-X>0} J{X„-X<0} as n —► oo. The general case follows by applying the result for non-negative random variables to the random variables X^,X^ and X+,X~. This presupposes that if || Xn ||,= /|X„|dP -* /|X|dP =|| X \\„ thenfX+dP ^JX+dP and JX~dP — fX~dP (i.e., || X± ||,—1| x± II.)- UsinS Fatou's lemma (Proposition 2.1.25), one sees that, in any case, IX±dP < lirninf f X±dP. Since f{X++X~)dP =\\ Xn 11,-11 X ||,= J(X+ + X-)dP, it follows from the next exercise that J X^dP -► J X±dP. D Exercise 4.5.15. Let (an)n>i and (6n)n>i be two real sequences. Let a < liminf,, an and b < liminfn bn. Assume that an +bn —► a + b. Show that an —» a and 6n —► b. [Hint: make use of Exercise 2.1.10.] 6. Differentiation: the Hardy-Littlewood maximal function* The Hardy-Littlewood maximal function. If / is a continuous, real- valued function defined on K, then it is clear that, for any symmetric interval (x — h,x + h) about x with h > 0, the mean-value ^ /i_/, f{u)du of / over this interval converges to f(x) as h —► 0. Furthermore, the fundamental theorem of calculus states that d fx 1 fx+h 1 fx — / f(u)du = lim- / /(u)du = lim - / f(u)du = f(x). ixja h[ohJx MohJx_h
6. DIFFERENTIATION: THE HARDY-LITTLEWOOD MAXIMAL FUNCTION 187 Since the continuous functions with compact support are dense in 2^(¾) (Theorem 4.2.5), it is natural to ask to what extent these results hold for an arbitrary function f € Ll. Since an L'-function can be modified on a set of measure zero without changing its integral, it is clear that the best one can hope for is a result a.e. Suppose that <pn —* / with the functions </?n € CC(K). Then U- [ f(u)du - f(x) < U- f {f{u)du-tpn{u)}du I -"l Jx-h I lh Jx-h | 1 rx+h — / <pn(u)du-tpn(x) I *n Jx-h + + IVn(l)-/(l)|- Let e > 0 and Ef = {x | limsuphi0 |^ f**h f(u)du - f(x)\ > e). Since the second term goes to zero as <pn is continuous, it is clear that Ee c Etii U£,i2, where Ef>i = lx\ \<pn(x) - f(x)\ > - \ and Et,* = lx | limsupl— / {f{u)du - ipn(u)}du\ > - \. The first set E(,\ has small measure for large n by Chebychev's inequality (Proposition 4.3.6) as |£«,i| < \ \\ <pn — f ||,. To control the measure of Ef, it suffices to show that the measure of £¢,2 can be made small. It is a remarkable fact that the measure of this set can be shown to be small by showing that a much larger set has small measure if || tpn — f II, is small. This larger set is a sort of worst-case scenario: it is {x \ suPh>0 2hJX-h \fiu)du - ipn(u)\du > |}, which certainly contains Ett2 since f**h \f(u)du - ipn(u)\du > | f**h f(u)du - <pn(u)du\. Definition 4.6.1. Let ip € L^K). The Hardy-Littlewood maximal function ip* ofrp is defined by setting 1 fx+h rp*(x) = sup— / \rp(u)\du. h>o In Jx-h If V = / ~ Vn, then {x \ suph>0 ^ £+£ \f(u)du - ipn(u)\du > |} = {x | ip*(x) > I}. As will be shown later, the maximal function is not in L1 and so one cannot use Chebychev's inequality to estimate the measure {x | %l>*(x) > f }• However, rp* belongs to the so-called weak 2^(¾). Definition 4.6.2. A measurable function is said to be in weak L^K) or to be of weak type (1,1) if there is a constant c > 0 such that for aJJ e > 0 one has |{i/(x)|>6}|<^.
188 IV. CONVERGENCE OF RANDOM VARIABLES Remark. The function defined by f(x) = ^, x ^ 0, and /(0) = 0 is of weak type (1,1) but not integrable. Not only does V* belong to weak L^K), but in fact, as proved by Hardy and Littlewood, one has the following inequality. Proposition 4.6.3. Ifrp€ L^R), there is a constant c > 0 independent of ip such that for any e > 0 i{tf»>oi<^imi, • Combining all of this information gives a proof of the following famous result. Theorem 4.6.4. (Lebesgue's differentiation theorem) If/eL^K), then rx+h w*hL-h '("^ ='W ae" Proof. Fix e > 0, and let 6 > 0. If n is sufficiently large, || f - tpn ||,< rmn{e6,^e6}, and so |£«,i| < 6 and \Et:2\ < 6. Hence, \Et\ = 0 for any e > 0. □ Remark. This theorem holds in Kn using the same proof with cubes centered at x replacing symmetric intervals (see Wheeden and Zygmund [Wl], p. 100). It also holds with balls replacing cubes (see Stein and Weiss [SI], p. 60, Theorem 31.2 for a very general theorem of this type). To prove the analogue of the fundamental theorem of calculus, a further refinement of Lebesgue's differentiation theorem is needed. Definition 4.6.5. If f € L^K), a point x is called a Lebesgue point of fif fe^£r|/(u)-/(x)|du=°- The collection of Lebesgue points of f is called the Lebesgue set of f It is clear that at a Lebesgue point lim/,|0 55 JZ+Z f(u)du = f(x). It is not hard to prove using, the differentiation theorem that almost every point is a Lebesgue point of / (see Proposition 4.6.9). Using this fact one can prove the following theorem. Theorem 4.6.6. If f € ^(R) and x is a Lebesgue point of f, then 1 rx+h f(x) = lim r / f(u)du and h[0hjx 1 fx ^(x) = Lim »: / f(u)du.
6. DIFFERENTIATION: THE HARDY-LITTLEWOOD MAXIMAL FUNCTION 189 Hence, for any a e R, J rX — I f(u)du = f(x) a.e. Proof. Since x is a Lebesgue point of f, |i J" f(u)du - f(x) < ± jf \f(u) - f(x)\du l rx+h -hjxh \fM-fW\du = hi\_ l/(u) ~/(x)|du "* ° M h l °- As the proof for the other limit is essentially the same, the result follows from the fact shown in Proposition 4.6.9 that almost every point is a Lebesgue point of /. □ Corollary 4.6.7. Let v be a signed measure on 93 (R) that is absolutely continuous with respect to Lebesgue measure dx. Let G be any function of bounded variation such that i/((a,b]) = G(b) — G(a) if a < b. Then, if f € L1{R) is such th&t f(x)dx = v(dx), it follows that f(x) = G'(x) a.e. In particular, if v is an absolutely continuous probability on 2$(R), then f(x) = F'(x) a.e., where F is the distribution function of v. In other words, its distribution function F is a.e. differentiabJe and the derivative F' is its probability density function. Proof. Up to a constant, (JZf(u)du ifO<x, G(x) = < 0 if x = 0, [ /° f(u)du) if x < 0, where / e L^K) is the Radon-Nikodym derivative of v with respect to dx. It follows from Theorem 4.6.6 that G'(x) exists and equals f(x) a.e. The case of a probability is now immediate. □ Remark 4.6.8. Since these differentiation results make use of the inte- grability of the function only over a closed, bounded interval, it follows that they all carry over to measurable functions that for each a e K are integrable on (a — 6,a + 6) for some 6 = 6(a) > 0. These are the so-called locally integrable functions. The Heine-Borel theorem implies that a function is locally integrable if and only if it is integrable on any compact set (equivalently, any bounded set). Any non-zero constant function is locally integrable although not in L\. If / e L1, then |/ - r\ is locally integrable for any constant r ^ 0 and not integrable as |/| + |/ — r| > \r\.
190 IV. CONVERGENCE OF RANDOM VARIABLES Proposition 4.6.9. IffsL1 (R) or is even JocaJJy fntegrabJe, then almost every point is a Lebesgue point of /. Proof. Let r € Q. Then |/ - r\ is locally integrable, and so by the dif- ferentation theorem (Theorem 4.6.4), 1 fx+h lim — / |/(u) - r\du = \f(x) - r\ a.e. />io 2ft Jx_h Since the rationals are countable, there is a set E with \E\ = 0 such that 1 fx+h lim — / |/(u) - r\du = |/(x) — r\ for all x ¢ E and all r e Q. Mo 2ft yz_h Every point in the complement of E is a Lebesgue point of / since if x ¢ E and r is rational, 1 /■I+h lim sup — / |/(u) - f(x)\du hio *h Jx-h 1 fx+h < limsup — / {|/(u) - r| + \r - f(x)\}du hio I", Jx_h = 2|/(i)-r|. Since for x ¢ £^ the rational r may be arbitrary, it follows that r can be chosen with |/(x) - r\ arbitrarily small. D It remains to do the hard work: establish the properties of maximal functions and the Hardy-Littlewood result (Proposition 4.6.3). Maximal functions. One has the following more or less obvious properties of maximal functions. Exercise 4.6.10. Let f,g,...eLl. Show that (1) /* = i/r, (2) if|/|<|</|, then/*<</*, and (3) ifO</„T/,then/n/*- To show that the maximal function is measurable, first observe that for ft > 0, fixed the mean value ^ f**h f(u)du is a continuous function of x. This follows immediately from dominated convergence: if xn —► x, then l[in-fc,in+fcl(«) -> l[i-h,i+h](u) as long as u ^ ±ft and hence a.e. Exercise 4.6.11. Let /i/, denote the uniform distribution on [—ft,ft], i.e. fih(dx) = ^l[_h)h](x)dx. If / e L1, let /(x) = /(-x). Show that 1 rx+h — I f{u)du = {f*nh){x). The maximal function /* = suph>0(/*/ih) is the supremum of a family of continuous functions. As a result, it is measurable, as the next exercise shows.
6. DIFFERENTIATION: THE HARDY-LITTLEWOOD MAXIMAL FUNCTION 191 Definition 4.6.12. A function i : R -» RU {±00} is defined to be lower semicontinuous at x0 if for any e > 0 there is a 6 > 0 such that £(x) > £(x0) - e when |x - x0| < 6. It is said to be lower semi continuous if it is lower semicontinuous at every point. Exercise 4.6.13. Let £ be a lower semicontinuous function. Show that (1) {x I £(x) > A} is open for any A e R, (2) any function satisfying (1) is lower semicontinuous. Let (ipa)ae[ be a family of continuous functions and let t = supQ ipa. Show that (3) £ satisfies (1) and so is lower semicontinuous, (4) a lower semicontinuous function is Borel measurable. Remark. This suggests that maximal functions can be profitably associated to any family (fit)t>o of probabilities for which <p* fit —► <p pointwise, where ip is any continuous function of compact support, e.g., nt{dx) = nt(x)dx, the Gaussian semigroup. Such a family is called an approximate identity or an approximation to the identity. See Wheeden and Zygmund [Wl] and Stein and Weiss [Si] for further details. For any measurable set E, the mean value 2hJx x+h ^ lE(u)du= —\En[x-h,x + h]\. h lh Exercise 4.6.14. Let E be measurable and a subset of [-N, N]. Show that (1) if |x| >N, then lfc(x) > 5^¾ > If. [Hint: for x > N, consider [-N, 2x + N] the smallest symmetric interval about x containing [-N, N}.} If / e Lr(R) and /(x) = 0 whenever |x| > AT (i.e., the support of / is a subset of [-AT, N}), show that (2) if |z|>tf, then/•(!)> I Jig*-. Conclude that, if / € L^K) is not the zero function, then (3) /* is not integrable. [Hint: use Exercise 4.6.10 (2).] To get an upper estimate of the maximal function of a bounded, measurable set, consider the case of a bounded interval [a,b\: if x > b, one has 1 if h < x — b, I [a, 6] n [x — h, x + h\\ = I b — x + h if x — b < h < x — a, > — a if x — a < h;
192 IV. CONVERGENCE OF RANDOM VARIABLES and if x < a, one has {0 if h < a — x, x + h — a if a — x < h < b — x, b - a if b — x < h. Exercise 4.6.15. Show that (1) Ifa 6)(1) ~ TIT f°r larSe x> i-e> there is a constant c > 0 such that ^ < 1^,6)(^ ifj for large x, (2) if E is bounded and measurable, then VE(x) ~ A for large x, (3) if / has compact support (i.e., for some N > 0 its support is contained in [-AT, AT]), then /*(x) ~ A for large x, and (4) if / has compact support, then |{/* > e}| < +oo for any e > 0. Proof of (4-6-3) (Hardy-Littlewood) that f* is of weak type (1,1)- Assume f € L1 has compact support and that e > 0. If f*(x) > e, there is a closed interval [x - h, x + h] with ^ f*_h \f(u)\du > e. Hence, {/* > e) is contained in the union of such intervals, i.e., it is covered by these intervals. Let ([xi—hi, Xi+hi])i<i<m be a finite disjoint collection of these intervals. If A = U^Jxi - hi, Xi + hi], then /■ m rx,+h, "» \\f\l> / |/(U)|dU=£/ \f(u)\du>Y,t*hi = t\A\- J A i=i J*,-h. i=i The key step is a part of the Vitali covering lemma (see Proposition 4.6.16): it implies that since |{/* > e}| < +00, there is a fixed number /3 > 0, independent of / and e, such that the collection of disjoint intervals [xi — hi,Xi + hi] can be chosen so that |j4| > 0\{f* > e}\. Taking c = /3_1 proves Proposition 4.6.3 in case the support of / is compact. Now for any f € L1 there is a sequence (gn)n>i of non-negative functions with compact support such that gn | |/|. Since g* | /* by Exercise 4.6.10 (3), it follows that {3; > e} c {g^+l > e) and {/* > e} = U£°=1{g; > e}. Since || gn 11,111 / ||,, the result follows as (3~l = c is independent of the gn and |{/* > e}\ = lim \{g'n >e}\<C lim || gn \\= ° \\ f ||, . D n—►oo f n—►oo f It remains to prove the covering lemma. Following Wheeden and Zyg- mund [Wl], the Vitali covering lemma will be proved by fine-tuning the proof of the following simpler covering lemma, which suffices to complete Proposition 4.6.3.
6. DIFFERENTIATION: THE HARDY-LITTLEWOOD MAXIMAL FUNCTION 193 Proposition 4.6.16. Let E be a measurabJe set with \E\ < +oo. Assume that X is a coJJection of closed finite intervals I whose union contains E. Then, ifO < /3 < |, there is a finite disjoint I\,h,- ■ -,Im collection of these intervals such that rn £l*l >P\E\- t=l Proof. Consider the lengths of the intervals in 2. If they are unbounded, there is even one interval I €l with |/| > \E\ > 0\E\. If I is a closed finite interval, 5/ denote the closed interval with the same midpoint but 5 times its length. Then, if J is any closed interval that intersects I and has \J\ < 2|J|, it follows that J c 5/. Assume that the supremum £\ of the lengths of the intervals in 2 is finite. Choose I\ € 2 with |/i| > ^£\. Then the union of all the intervals that intersect I\ is a subset of 5Ji. The remaining intervals are all disjoint from I\. Let (.2 be the supremum of their lengths, and choose I2 €. 1 disjoint from Ii with I/2I > 3^2- Then 5/i U 5/2 contains all the intervals in 2 that intersect I\U h- In this way, for each m > 1, one obtains disjoint intervals /1,/2,- • • > An such that all the intervals in 2 that intersect /1U /2 U • • • U /m are contained in 5/i U 5/2 U • • • U 5/m. Let £m+i be the supremum of the lengths of the intervals in 2 that are disjoint from /1, /2,..., /m. It could happen that at some stage the process terminates. This means that, for some m, every interval intersects /1 U /2 U • • • U /m. As a result E c U^^A, and so |E| < | U|™! 5/^ < 5^^! |A|. which completes the proof in this case. If the process never terminates, then either ^1°^ t |/j| = +00, in which case some partial sum of this series will be larger than \E\ (which gives the result), or Ylili IA| < +°°- If J2ili IAI < +°Oi then E c U^^/j: this is so if no / € 1 is disjoint from all the U\ if / € J is disjoint from all the /j, then £t > |/| > 0 for all i > 1. But this is impossible as £t < 2|/j| and |/j| —► 0 as the series converges. Since E c 1^5/,, it follows that |E| < 5^°^ |Jj|. Hence, if /T1 < 5 is fixed, |£| < 0~l £^j |/j| for some m > 1, which proves the result. □ Remark. It is not necessary that E be measurable in the above covering lemma. It suffices that it be a subset of a measurable set of finite measure, or equivalently, it have finite outer measure A*(£). The statement of the result then involves A*(£) rather than \E\. The Vitali covering lemma: Differentiation of monotone functions. In Corollary 4.6.7, it was shown that, if F is the distribution function of an absolutely continuous probability on K, then it is differentiable a.e.
194 IV. CONVERGENCE OF RANDOM VARIABLES and F(x) = f*^ f(u)du, where f = F'. It turns out that any distribution function is difFerentiable a.e., regardless of whether or not it is absolutely continuous. This is proved by using the Vitali covering lemma, which is a refinement of the covering lemma (Proposition 4.6.16). Given any function F on (say) [a, b] and a point x € (a, 6), the difference quotient F(x+h>-FW dAc DF(x; h) is defined for sufficiently small h. The function is differentiable at x if and only if the following four numbers agree and are finite: DF+(x) = limsup£>F(x;/i); DF+(x) = liminf£>F(x;/i); Mo DF~(x) = limsup.DF(x;/i); and ftTO £>F_(x) =f liminf£>F(x;/i). These are the four so-called Dini derivatives of F at x. Note that DF+(x) > q implies DF(u;h) > q for arbitrarily small positive h; DF+(x) < p implies DF(u; h) < p for arbitrarily small positive h; DF~(x) > q implies DF(u;k) > q for arbitrarily small negative k; and DF-(x) < p implies DF(u;k) < p for arbitrarily small negative k, where, for example, DF(u; h) > q for arbitrarily small positive h means that, for any 6 > 0, there is an h with 0 < h < 6 such that DF(u; h) > q. Since DF+(x) < DF+{x) and DF_(x) < DF_(x), the four Dini derivatives agree if and only if DF+(x) < DF_(x) and DF~{x) < DF+{x). Now assume F is non-decreasing. Then the four Dini derivatives are all non-negative. Assume that DF+(x) > DF_(x). Then there are two rational numbers p < q with DF+(x) > q > p > DF_(x), and the point x belongs to Ep,q -{«€ (a, b) | DF(u; h) > q for arbitrarily small positive h} Ci{u € (a, b) | DF(u; k) < p for arbitrarily small negative k}. Since there are a countable number of sets Ep<q as p < q run over the pairs of positive rational numbers, it follows that DF+(x) < DF_(x) a.e. on (a, b) if |£p,9| = 0 for any pair of positive rationals Ep,q. In view of the following exercise, which essentially reverses the order on K, if DF+(x) < DF-F(x) a.e. on (a,b) for any non-decreasing function F, then it also follows that DF~(x) < DF+(x) a.e. on (a, b) and, hence, that the four Dini derivatives agree a.e. on (a, b).
6. DIFFERENTIATION: THE HARDY-LITTLEWOOD MAXIMAL FUNCTION 195 Exercise 4.6.17. Let F be a non-decreasing, real-valued function on [a, b]. Define G(u) = -F(-u). Show that (1) G is a non-decreasing function on [-6, —a], (2) DG(-x; -ft) = DF(x\ ft) if ft ^ 0, (3) D-G(-x) = DF+(x), (4) D-G{-x) = DF+{x), (5) D+G(-x) = DF-(x), and (6) D+G(-x) = DF_(x). In order to prove that \EPtQ\ = 0, one makes use of the coverings that are given by the intervals [u, u + h], with DF(u; ft) > q. The intervals [u + k, u], with DF(u; k) < p. Let V+ be the collection of closed intervals [u,u + ft] where u e Ep<q and ft > 0 is such that DF(u; ft) > q. Let V_ be the collection of closed intervals [u + k, u], where u € Ep<q and k < 0 is such that DF(u; k) < p. Then both collections of intervals cover EPtq, and any point u of Epq lies in an interval from V± of arbitrarily small length. All of these intervals are non-trivial, i.e., they all contain more than one point, (equivalently, they all have positive length). In other words, the collections V± are Vitali covers of Ep,q. Definition 4.6.18. A collection of (non-trivial) closed intervals is said to be a Vitali cover of a set E if their union contains E and for any x € E there is an interval in V of arbitrarily small length that contains x (the Vitali property). Remark 4.6.19. If V is a Vitali cover of a set E, and if O is an open set, then the Vitali property ensures that the intervals in V contained in O form a Vitali cover of E n O. This allows one to localize in a certain sense. The Vitali covering lemma is the following theorem. Theorem 4.6.20. (Vitali) Let V be a VitaJi cover of a subset E of R. Then there isapairwise disjoint countable family of intervaJs(Ij)i<j<;v, 1 < N < oo, from V such that \E\{U?=1U}\ = 0. Further, if 0 < A*(£) < +oo, then for each e > 0 there is a finite number m > 1 such that the outer measure X'(E\{uT=1U})<e and | u?=l U\ < (1 + e)X'(E). Proof. Let En = E n (n, n + 1), where n € Z. Then E differs from UnE„ by at most a countable set. In view of Remark 4.6.19, the intervals of V that lie in (n,n + 1) constitute a Vitali cover Vn of En. If the theorem
196 IV. CONVERGENCE OF RANDOM VARIABLES holds for each En, then, by using ^nr for each En, the theorem follows for E: one merely puts together the collections of intervals obtained for each neZ. Since Lebesgue measure is translation-invariant, it suffices to prove the theorem when Ed (0,1). Consider the proof of Proposition 4.6.16 for E c (0,1), where by the Vitali property, one may assume that all the intervals in V are subsets of (0,1). To begin with, £\ < 1, so that the selection process needs to be used. Here this process is modified by cutting down the family of intervals used at each stage: after having selected Ii, I2,..., Im, let Am = I\ UI2 U • • • UIm\ the next interval is taken from the Vitali cover of E\Am given by the intervals of V that are subsets of the open set (0, l)\.Am; and £m+i is taken to be the supremum of the lengths of these intervals. Now if the process of selection of the intervals /j terminates at stage to, it means that there are no intervals in V disjoint from the closed set Am. As a result, E c Am since the intervals in V contained in the open set (0, l)\;4.m form a Vitali cover of E\Am by Remark 4.6.19. If the selection process never terminates, then J2ili IAI < 1 < +°° &nd E c 1^5^. Let A = U^Ji. If x € E\A, then x € 5^ infinitely often (i.e., x e U?2m+15/j for each m > 1): since x e E\Am, by the Vitali property, there is an interval I in V containing x that is disjoint from Am; this interval intersects some A*, k > to, as |/| > 0 and ^ —► 0 as i —► oo; hence, x e 57) if j > m is the first time that I n Aj ^ 0. Now | U£m+1 5AI < 5£~m+1 \h\ ^ 0 as to ^ oo. Hence, \E\A\ = 0. This completes the proof of the first statement. Since E\{u^:1/i} C {l)?=m+Ji} U E\A, this also proves that, for any e > 0, there is an integer to with A*(E\{u^:1/j}) < e . Finally, if 0 < A*(E) let 6 = e\*(E) > 0. Since E C (0,1), there is an open set O with E c O C (0,1) and \0\ < X'(E)+6 = (l+e)A*(E). Now replace the original Vitali cover by the Vitali cover of intervals in V that are subsets of O. The sequence (/j)i<j<n can then be taken from this new Vitali cover. Hence, \A\ = £^ |/i| < \0\ < (1 + e)A*(E). The last part of Vitali's theorem follows from what has been established for E c (0,1): it suffices to observe that, in general, A*(£) = ^Zn X*(En) is greater than zero if and only if at least one of the A*(En) > 0. Then one uses what has been proved to find for each n6Z with \*(En) > 0 (i) an open set On with En c On C (n,n + 1) and |0„| < (1 + e)A*(En) and (ii) a finite disjoint family (Ii<i<N) of intervals from the Vitali cover of En by intervals in V that are subsets of On such that A-(En\{utm=1/tn}) < ^ and I uj^ I?\ < (1 + e)X'(En). Then, putting all these disjoint collections together, one obtains the desired result. The fact that A*(E) < +oo implies that one can ignore all
6. DIFFERENTIATION: THE HARDY LITTLEWOOD MAXIMAL FUNCTION 197 but a finite number of the sets En in putting together a finite subcollection (7j)i<i<m such that \*(E\{\jp=lIj}) < e. D Returning to the question of the differentiability of a monotone function F, assume that \*(Ep,q) > 0 for some pair of positive rational numbers p < q. Let e > 0 and let O be an open subset of (a, b) containing EP:q such that \0\ < \*(Epq) + e. The sets in the Vitali cover V_ of EPi9 that are subsets of O are again a Vitali cover of Ep<q. Hence, there is a finite set of disjoint intervals [xj + fcj,Xj] in V_, 1 < i < m, that are subsets of O with X'(EPtq\{ui^Ll[xi + fci,Xj]}) < e. Note that this implies m (1) ^|^|< |0|< A-(£7p,q) + 6. t=i Let U = U^L^Xj + ki,Xi) c O. The intervals in V+ that are subsets of U constitute a Vitali cover of EPtq n U, and so there is a finite set of disjoint intervals [yj,yj + hj] in V+, 1 < j < n, that are subsets of U with A*(EPi, n U\{U?=l[yj,yj + hj}}) < e. This implies that n \*{Ep,qnU) <J2hi+e> i=\ and so (2) A*(EM) < A*(EP,, nU) + \'{Ep,q\U) <J2hi+ 2e' j=i since A*(EP,,\L/) = A*(EP,,\{U™ Jx* + fc^x*]} < e. From this, it follows that n m 5>i = I U?=1 [yj,yj + h3}\ <\U\ = \ U™i [it + ki,Xi]\ = J2 l*il- j=i t=i In addition, because DF(x};h}) > q, DF(yi;ki) < p, and U^ljxi + ki,Xi] c U"=1(yj,7/j + hj), plus the assumption that F is non-decreasing, one has n n j=i ]=i m m (3) <£{*•(».)-f^+ *,)}< p£l*il- t=l t=l
198 IV. CONVERGENCE OF RANDOM VARIABLES Let h = £"=1 h}, \k\ = Z?=i 1**1. and q = A*(EP,,). Then by (1), (2), and (3), one has that \k\ < a + e, 0 < q < h + 2e, and qh < p\k\. Assume that e < |. Then a (a — p) 1 -(a - 2e) < a + e and so ~—^-a < e < -a. PK ' (P+2?) 2 Therefore, the assumption that a = \*{EPtq) > 0 leads to a contradiction, since, in the above argument, e may be arbitrarily small. This completes the proof of most of the next result. Theorem 4.6.21. (Lebesgue) Let F be any real-valued, non-decreasing function on [a,b]. Then F is differentiable a.e. Further, if f = F', then f is Lebesgue measurable and J f(u)du < F(b—) — F(a+). Proof. It remains to prove that a.e. the common value of the four Dini derivatives is finite, and that the resulting function /, which is defined a.e., is Lebesgue measurable with ja f(u)du as specified. Let DnF(x) = j £ n{F(x+ i)- F(x)} if a<x<b-±, if b - i < x < b. Then DnF(x) > 0, and it converges to / a.e. by what has been proved. Fatou's lemma implies that /6 rb f(u)du < liminf / DnF(u)du r fh r+" l = liminf n / F(u)du-n I F(u)du 71 L A-£ J a = F{b-) - F{a+). This implies that the non-negative function / is a.e. finite and so F has a derivative a.e. on (a, b). □ Corollary 4.6.22. Let F be a distribution function on K. The derivative f of F is the density of the absolutely continuous measure in the Lebesgue decomposition of the probability P determined by F. Proof. Let G(x) = F(x) - f*^ f(u)du. Then G is a non-decreasing func- tion: {F(b) - F(a)} - Jbaf(u)du > {F(b) - F(a)} - {F(b-) - F(a+)} =
7. ADDITIONAL EXERCISES 199 F(b) — F(b—) as F is right continuous. Hence, the measure fi given by fi(A) = fA f(u)du is dominated by P, i.e., for all A, (i(A) < P(A). On the other hand, by Theorem 2.7.22, the probability P(dx) = 4>(x)dx + rj(dx) with rj a unique singular finite measure. Let H(x) = 77((-00, x]). Then, F(x) = J-*^ 0(u)du + H(x) and so a.e. f(x) = 0(z) + H'(x). Since H is non-decreasing, it follows that a.e. />(/>• Thus, the difference f — <f> defines a measure v : set u(A) = JA{f{u) - 4>{u)}du. Since P(A) = f 4>{u)du + 77(A) > I }{u)du = [ <j>{u)du + u{A), J A J A J A it follows that v is dominated by 77 and so v is singular. This implies that / = 0 a.e. since v = 0. □ 7. Additional exercises* Exercise 4.7.1. Let X be a random variable defined on a probability space (0,5, P). Show that (1) E~=o("+ 1)P[" < 1*1 < ("+ 1)1 = E~=oP[l*l > "]• [ffi"te: observe that P[0 < |X|] = P[0 < |X| < 1] + P[l < |X|], or use Fubini's theorem on the product of (fi,$,P) and (N U {0},<P(N U {0}),dn), where dn is counting measure.] Use (1) to show that (2) E~=i«P[n < \X\ <(n+ 1)] = £~ 1 P[|X| > n], and (3) E~=i P({ W > "}) < E[\X\] < 1 + E~=i P((l*l > n}). Modify (1) and show that (4) E~=i"P["<l*l<(n+l)] = E~=iP[l*l>n].and (5) EZi P({ W > n}) < E[|X|] < 1 + E~=i P({l*| > n}). Conclude that (6) X € L^n.ff.P) if and only if £^=iP[l*l > n] < 00 and if and °nlyif£"=iP[l*l>rc]<°a Extend (5) and (6) to show that, if m > 1, then (7) ±?l£lP[\X\>£)<E[\X\]<±+X~lP[\X\>±i). Conclude that (8) fP[\X\>x]dx = E[\X\]. Assume that X is non-negative, i.e., F(0-) = 0. Show that if F(0-) = 0 then (9) /{1 - F{x)}dx = JxdF{x) = E[X] (see Exercise 3.7.21 for a proof based on integration by parts), and (10) E[X'] = Jx?dF(x) = /{1 - F(Xp)}dx = /{1 - F{u)}pup~1du. [Hint: see Remark (3) following Lemma 4.1.3 for the last equality.]
200 IV. CONVERGENCE OF RANDOM VARIABLES Exercise 4.7.2. (Alternate methods to do Exercise 4.7.1 (10)). Let Y be a non-negative random variable on (fi, J, P). Show that for any p, 1 < p < oo, /■OO E[YP}= / pAP-'Pfy >A]dA. Do this two ways: (1) first verify it for a simple function s = $2"=ia«l;4i> where a\ < a2 < • • •< an, and then pass to the limit; and (2) use the identity /■V(w)An /-n (l»An)p = / pAP"1dA= / pA*-ll{y((i,)>A}dA Jo Jo and Fubini's theorem (Theorem 3.3.5) to compute E[(Y A n)p] in two ways. Show also that /■OO E[YP] = / pAP-'PfK > A]dA. Jo Remark. P[V > A] = P[V > A] except for at most a countable number of values of A. This is because for a non-decreasing or non-increasing real- valued function 4> defined on an interval [a, 6], the left limit ¢(1-) equals the right limit (/>(<+) at all but a countable number of values of t: consider for how many points t the magnitude of the "jump" (/>(<+) — 4>(t—) > \/n. Exercise 4.7.3. Let / be a non-negative Lebesgue integrable function on K. Then there is a unique non-negative measure r\ on K such that 77(B) = |/_1(B)| for all Borel subsets B of R. [This measure is the analogue of the distribution Q of a random variable X : Q(B) = P(X~1(B))\ see the proof of Proposition 2.1.9.] Show that (1) B —* |/_1(B)| is a non-negative measure 77 on 93(R) that is the image of Lebesgue measure under / (see the remark following Proposition 2.1.19 and Lemma 4.1.3), (2) 77((-00,0)) = 0 and r)((a,b}) <ooif0<a<6< +00, (3) »7({0}) = +00 if |/_1((0,+oo))| < 00 and that no conclusion can be drawn about 77((0}) if |/_'((O.+oo))! = 00. By (2) 77 is a measure on [0,+00). Let v be its restriction to (0,+00), i.e., to the Borel subsets of (0,+00). Then v is a er-finite measure on ((0, +oo),93((0, +00))). The measure v is determined by what analysts call the distribution function of / (see [Wl], p. 77 and [SI], p. 57) which will be referred to here as the analyst's distribution function. This is the function ljj defined by u;/(A) = |{/ > A}|. Notice that the integrability
7. ADDITIONAL EXERCISES 201 of / implies that |{/ > A}| < +00 but that |{/ < A}| may very well be infinite. Show that (4) the analyst's distribution function w/ is decreasing and right continuous, (5) v is the unique er-finite measure fj, on ((0,+00),03((0,+00))) such that fJ.((a,b}) = u;/(a) -u>/(6) = \{a < f < b}\ if 0 < a < b < +00, (6) / <j>(y)v(dy) = f(j>(f(x))dx for all non-negative Borel functions 4>: (0,+oo) -»R, and (7) jf(x)"dx = /0+oopAP-1u;/(A)dA. [Hint: the method of Exercise 4.7.2 (1) can be used.] Remark. Let / be a Lebesgue measurable function on K. The preceding exercise shows that if f € L1 and is non-negative, then the analyst's distribution function ljj determines the measure on (0, +00) that is the image of Lebesgue measure under /. In addition, if E c K has finite Lebesgue measure \E\ and g = /|E, then the image t\e of Lebesgue measure under g is determined by the analyst's distribution function wg of G. In fact, u;9(A) = |{/ > A} n E\ and if M = \E\, then G(A) = M — u>9(A) is non-negative, bounded, and right continuous. The measure t\e is the finite measure determined by G (see Theorem 2.2.2). Exercise 4.7.4. Prove Proposition 4.1.14 by reducing to the case a + b = 1 and then using a Lagrange multiplier to locate the minimum of G(a, b). Exercise 4.7.5. (Young's inequality) Let y — tp(x) be a continuous, strictly increasing function on K+ = [0,00) with tp(0) = 0 and lim^oo tp(x) = +00. Let V(y) = x denote the inverse function (i.e., ip{y) = x if and only if y = <p{x)). Set $(z) = /,,1 ip{u)du and *(y) = /0y V(u)du. Show that, for x > 0, xip{x) = $(z) + *(</?(x)). [Hint: interpret xtp(x) as the area of the rectangle determined by (0,0) and (x,tp(x) .] Compare the area /0 <p(u)du under the curve y = <p(x),0 < x < c, with the area of the rectangle determined by (0,0) and (c,d) when d < <p{c). Conclude that (1) if c,d > 0, then cd < $(c) + *(d), and hence that 2) cd < ^ + ^ if c,d > 0 and ± + ± = 1. \ / — p q ' — p q Exercise 4.7.6. Assume that Jensen's Inequality in Proposition 4.1.19 holds for any convex function <p and X e Ll(il,$,fi), where fx is a positive measure. Show that fx is a probability. [Hint: let ip be an affine function, i.e., <p(x) = ax + b.} Exercise 4.7.7. Show that (1) if/€1,^).5 €Lp(R),l<p<oo, then f*geLp(R).
202 IV. CONVERGENCE OF RANDOM VARIABLES Also show that (2) II f*9 II,,<II / II, II 9 llp- [Hint for 1 < p < oo, use Jensen's inequality (Proposition 4.1.19) to estimate (f\g\dv)p, where v{dy) =|| f\\;l\f(x-y)\dy.} Exercise 4.7.8. (Egorov's theorem) Let (fn)n>i be a sequence of measurable functions /n on a finite measure space (0,5,^). Assume that /n —► 0 fx a.e. Let e > 0. Show that there is a measurable set A such that (1) n(\c) < e, and (2) on A, the sequence (/n)n>i converges to zero uniformly. [Hints: one may assume that fi is a probability; modify the argument of Proposition 4.3.2 to show that for each k there is an integer N(k, e) with M(r(£, N)) > 1 - £ if N > N(e, k)\ set A = nf=ir(^, N(e,k).} Exercise 4.7.9. (Lusin's theorem)(see Exercise 4.2.1)Let EcRbea Lebesgue measurable set, and let f denote a function f : E —* K. Then f is Lebesgue measurable if and only if, for e > 0, there is a closed set A with A C E such that (i) the restriction of/ to A is continuous on A and (ii) \E\A\ < e. The proof of this result is a fairly long exercise and will be done in two parts. Recall that the necessity of the condition is fairly easy to prove and has already been done as Exercise 2.3.7. It remains to prove the hard part, i.e., the sufficiency of the condition. To begin, one reduces the result to the case where \E\ < oo. Part A. Let En = E n {x | n - 1 < |i| < n},n > 1. Then E\{0} = U^L^n- Assume that Lusin's theorem holds for E bounded. Let e > 0, and for each n > 1, let An C En be a closed (and hence compact) set such that (i) the restriction of / to An is continuous and (ii) |En\j4n| < ^-. Show that (1) A = U^fj-A,, is a closed set. [Hint: if a sequence in A converges in K, show that it is bounded and so for some N is contained in the closed set U%=lAn-], (2) \E\A\ < e, and (3) the restriction of / to A is continuous on A. [Hint: use the suggestion for (1).] Conclude that Lusin's theorem holds if it holds for bounded measurable sets. To prove Lusin's theorem when |£| < oo, one uses Egorov's theorem to pass from the simple case discussed in Exercise 4.2.1 to the final result. The idea of the proof is simple enough. First, it is enough to prove it for a non-negative function /. Such a function is a limit of a sequence (sn)n>i of simple functions sn, and Egorov's theorem implies that, except
7. ADDITIONAL EXERCISES 203 on a set of small measure, these functions converge uniformly to /. Lusin's theorem is valid for each simple function sn (Exercise 4.2.1), and since the uniform limit of continuous functions is continuous (see the comment following Exercise 4.2.11), one hopes to be able to control the exceptional sets corresponding to each simple function sn and thereby prove the result. Part B. Assume that \E\ < oo. Let / > 0 and (sn)n>i be a sequence of simple functions that converges to /. Let e > 0. By Egorov's theorem, there is a subset F of E such that (1) |E\F| < e and (2) s„ -^ / on F (see Definition 3.6.15). For each n, by using Lusin's theorem for simple functions (Exercise 4.2.1), choose a closed set An with An c F such that (i) the restriction of sn to An is continuous on An and (ii) |F\j4n| < ^-. Show that (1) A = n^LlAn is a closed set, (2) |F\A| < e; and (3) the restriction of / to A is continuous. Conclude that Lusin's theorem holds if / > 0, and thus deduce its validity for any finite, measurable function /. Exercise 4.7.10. Let \X\ be a random variable. By integrating over {\X\ > e} and {\X\ < e}, determine upper and lower bounds for eM.L] that involve e and P[\X\ > e]. Conclude that for a sequence of random variables (Xn)n>i, Xn-^X if and only if E Show that d(X, Y) = ^[tl.|x-y|] ^s a metric on the vector space L of finite random variables, with the usual proviso about a random variable being zero if it is equal to zero a.s. Note that this shows that on L convergence in probability is given by a metric. Exercise 4.7.11. Let 1 < p, < p, < oo and u € LP2(tl,$,ii). Let A e 5 with n(A) < oo. Assume that u = uIa- Show that (1) [^] ^ 11-1,,^¾] * 11« «„• If / e Ll{n,$,n) n L°°{il,$,fi), show that (2) /6^(0,5,/1) forallpe(0,+oo),and (3) II / ||p-»|| / IL ^ P -* +°°- [Hints: prove this first when fi{R) < oo, and then, by using u(du) = \f(u})\i/(du}), reduce to the case of a finite measure.] Finally, show that (4) the conclusion (3) holds for any function / € ff if II / ll<x>= +oo. \Xn ~ X\ 1 + \Xn ~ X\
204 IV. CONVERGENCE OF RANDOM VARIABLES Remark. This exercise plays a key role in Cramer's theory of large deviations (see Stroock [S2]). Exercise 4.7.12. Let (<!„)„>! be a Cauchy sequence of real numbers relative to the usual distance d(a, b) = \a — b\. Show that (1) limsupn an < +oo [Hint: show that there is an integer N such that an < an + l,n > AT.], (2) liminfnan > -oo, (3) if e > 0, then limsupn an — liminfn an < e. Conclude from Exercise 2.1.10 that (4) (an)n>i converges, i.e., K is complete (relative to the usual distance). If x e Kn, let x(i) denote the i-th coordinate of x. Let (x,,),,^! be a sequence in Kn. Show that (5) it converges (Definition 4.1.25) relative to the metric d(x, y) = || x— y ||, where || x ||2 = 5Z"=1 x(i)2, if and only if, for each i, 1 < i < n, the sequence (xn(i))n>1 converges relative to the usual metric of K, (6) it is Cauchy relative to the Euclidean metric (the metric in (5)) if and only if for each i, 1 < i < n, the sequence (xn(i))n>1 is Cauchy relative to the usual metric of K. Conclude that (7) Kn is complete relative to the Euclidean metric. Exercise 4.7.13. Let (Xn)n>i be a sequence of random variables such that (1) Xn -^+ X, and (2) (X„)n>i is uniformly integrable. Show that X € L1. [Hint: use the proof of Corollary 4.5.9 and get an upper estimate for £^[|XC|] in terms of £^[|Xn|] ]• Conclude that a sequence (-Xn)n>i of random variables in L1 converges in L1 if and only if (i) it converges in probability and (ii) it is uniformly integrable. Exercise 4.7.14. Verify Corollary 4.5.9 for random variables in Lp, i.e., Lp show that Xn —* X if the sequence is uniformly integrable in Lp and converges in probability to X. Exercise 4.7.15. State and prove the Lp-analogue of Proposition 4.5.7. [Hint: make use of Theorem 4.5.8 and a Lipschitz contraction to uniformly approximate in Lv.\ Conclude that a sequence (Xn)n>\ of random variables in Lv converges in Lv if and only if (i) it converges in probability and (ii) it is uniformly integrable in Lp. Exercise 4.7.16. State and prove the analogue of Exercise 4.7.13 for a sequence (Xn)n>i of random variables in V.
7. ADDITIONAL EXERCISES 205 Exercise 4.7.17. Let A be a closed subset of R and define dist(x,.A) = inf{|x - y\ | y e A}. Show that (1) |dist(x!,i4) - dist(x2,.A)| < \xi - x2| [Hint: use the triangle inequality (Exercise 1.1.16).] ( Note that this proves the continuity of x —► dist(x,.A).), (2) x € A if and only if dist(x, A) = 0. Let C be a compact set (i.e., closed and bounded as stated in part A of Exercise 1.5.6) and assume C C O, where O is an open set. Then, for any x e C, there is an rx > 0 such that B(x, rx) = (x — rx, x + rx) C O. Show that (3) there is an R > 0 such that dist(x,C) < R implies x e O. [Hint: use the Heine-Borel theorem (Theorem 1.4.5).] Conclude that (4) if/(x) = min{dist(x, C),R), then / is a continuous function with 0 < /(x) < R such that /(x) = 0 if and only x e C and /(x) = fl if x ^ O. Exercise 4.7.18. Let £ be a bounded, Lebesgue measurable set with 0 < \E\ < oo, and let 0 < a < 1. Use Exercise 4.2.2 to show that (1) there is an open set O D E and q|0.| < \E\. Use the fact that the open set O is a disjoint union of at most countably many open intervals (Exercise 1.3.10 (d)) to show that (2) there is a bounded open interval (a, b) with a(b — a) = a|(a,6)| < |£n(a,6)|. Let En (a,b) d= Ex and ^ =6{b- a),6 > 0. Show that if (-^ < x < Si), then (3) Ei+x={y + x\yeEi}c (a-Sub + Si), (4) x = u- v,u,v € Ei if and only if Ei n (Ei + x) ± 0, (5) Ei n (Ei + x) = 0 implies 1 + 26 > 2a. Conclude that if 3 < a < 1 and 0 < S < i, the open interval (—Si,Si) C Ei. Deduce that for any Lebesgue measurable set E, if \E\ > 0, then D — {yi — j/2 | j/j 6 £} contains a non-void open interval. Exercise 4.7.19. (Borel's strong law of large numbers) Let (Xn)n>i be an i.i.d. sequence of Bernoulli random variables Xn with P[Xn = 1] = p and P[Xn = 0] = q. It follows from Kolmogorov's strong law (Theorem 4.4.2) that ^ ^+ p. This exercise, following Loeve [L2], outlines a simple proof of the result exploiting the fact that the common distribution is Bernoulli, (i.e., Q = <f£o +P£i)- Let Zn = ^-. Show that (l)<72(Zn) = H2, (2) Zk=i°2(Zk>) < +oo, and
206 IV. CONVERGENCE OF RANDOM VARIABLES (3) if it2 < n < (k+1)2, then \Zn-Zki\ < \. [Hint: Observe that Zn = ± J2i=i xi+n H?=fc2+i xi> and estimate, for example, £"=fc2+1 Xi by n-(k2 +1).] Use (2) to show that Zj.2 —^ p. [Hints: note that a sequence zn —► 0 if and only if, for all m > 1, one has |zn| < — for large n. Chebychev's inequality implies that P[|Zn - p\ > ± ] < m2<T2(Zn). (2) implies that Hfcli P[ l^fc2 _ Pi ^ m 1 < °°-l- finally, use (3) to prove that Zn ^+ p as n —► oo. Now assume that the i.i.d. random variables are all bounded with (say) \Xn\< M for all n > 1. Show that (4) <72(Zn) < (M+H)', Where m = £[*„], (5) Er=i^2(^)<+oo, and (6) |Z„-Zfc2|< ^ if it2 < n< (k + 1)2. Conclude, as in the Bernoulli case, that Z n ► 771. Exercise 4.7.20. This exercise explains the relation between normal numbers and Borel's strong law by showing that ([0,1],93([0, l]),dx) is almost isomorphic as a probability space to an infinite product space. This terminology is now explained. Definition 4.7.21. Let (H^SliMi) ^^ (^2,^2,^2) be measure spaces. They are said to be isomorphic if there is a bisection (i.e., a J.J onto map) $ : il\ —» f^2 such that (1) $ and $_1 are both measurabJe (i.e., A2 € $2 if and only if $_1(i42) €ffi), and (2) Aii(*-1(A2)) = n2{A2) for aJJ A2 € fo. Let (tl,$,(i) be a measure space and A be a measurabJe subset with Ai(fl\A) = 0. Set (8 d= {A n A | A e $}, and Jet u(A n A) d= /*(A) for aJJ A e 5- Then (A, (8, v) is a measure space that is said to be aJmost isomorphic to (£l,$,(i). Finally, two measure spaces (Hi, $1,1*1) &nd (^2,^2,^2) are said to be aJmost isomorphic if there are measurabJe subsets Ai C fli and A2 C il2 with /*i(fii\Ai) = 0 and fi2(tl2\A.2) = 0 such that the measure spaces (Ai, (81,1/1) and (A.2,<32,u2) are isomorphic. The exercise has several parts. Part A. For each n > 1, consider the dyadic rational numbers of the form £, 0 < it < 2n. They partition (0,1] into 2n intervals (£,*£!], each of length 2tt. Define the random variable Xn on ((0,1),93((0, l]),dx) by setting
7. ADDITIONAL EXERCISES 207 Show that (1) the random variables Xn,n > 1 are i.i.d. Bernoulli random variables with \{Xn = 0}]| = \{Xn = 1}| = i, (2) Xn{x\) = Xn(x2) for all n > 1 if and only if xi = x2, (3) E~=i*»(*)2-n = x. If an € {0,1} for n > 1, let 0.aia2 • • • a„ • • • =f £~=1 an2"n = a € [0,1]. Then O.a^ • • • an • • • is called a binary expansion of a. Show that (4) the binary expansion of x given by x = O.Xi (1).^2(1) • • • Xn(x) ■ ■ ■ is the unique binary expansion of x that does not terminate in an unbroken string of zeros, and (5) if t = £"=1 Xn(x)2-n, then t = £■ < x < *fcl. Part B. Let Q = (0,5, P) be the countable product of an infinite number of copies of the probability spaces (fio.So.Po), where fio = {0,1} and the probability Po defined on all the subsets of fio gives each point weight ^. Let A c fibe the set of functions w : N —► {0,1} such that uj(n) = 1 infinitely often. Define the random variables Yn : £1 —► {0,1} for n > 1 by setting Yn(u}) = uj(n). Show that (1) n\A is countable and P(A) = 1. Define X : (0,1] -► tl by setting X(x)(n) = Xn(x) for all n > 1. Show that (2) X is a measurable map (or fl-valued random variable). [Hint: verify that the map X into fl is measurable if and only if Yn 0 X is measurable for all n > 1.] The random variable X maps (0,1] to A in view of part A (4). It has an inverse 5 : A -► (0,1] given by S{w) =f £~=1 u>(n)2~n. Show that (3) S(X(x)) = x for all x e (0,1], and (4) X(S(u)) = u for all u € A. Hence, X : (0,1] —► A is a bijection. The goal now is to show that S is measurable. Let (5 denote the o- algebra on A consisting of sets of the form AnA, A 6 J. It is shown in Exercise 5.4.6 that 93((0,1]) is the smallest er-field that contains all the dyadic intervals. It follows from Exercise 2.1.18 that S is measurable if 5_1((2V, 4^r]) is the intersection of A with a set in the product a-algebra J. Show that (5) 5-^(^,^1]) = {uj I w{i) = a,, 1 < i < n} n A, where the coefficients Oj e {0,1} are uniquely determined by the property that k2~n = Er=ia«2_i- iHint: use P8^ A (5)1- This proves that the measurable spaces (A, (5) and ((0,1],95((0,1]) are isomorphic.
208 IV. CONVERGENCE OF RANDOM VARIABLES Part C. It remains to show that the distribution of S is Lebesgue measure on (0,1] and that the distribution of X is the measure P restricted to (5 (the product er-algebra 5 restricted to A). At the end of part B, it was shown that a dyadic interval (^-, ^r] corresponds under X and S to the intersection with A of the so-called cylinder set {w \ w(i) = a*, 1 < i < n}, where the a* are the coefficients of the finite binary expansion of ^-. These sets in (0,1] and fi, respectively, generate the corresponding er-algebras. Show that (1) |(£,^]| = P({u;|u;(i) = ai, !<*<"})• Conclude that (1) the distribution of S is Lebesgue measure, and (2) P is the distribution of X. This shows that (A,(5,P) and ((0,1],95((0, l],dx) are isomorphic measure spaces. Since P(A) = 1, the measure space (A, <5,P) is almost isomorphic to (n,y,P). It follows that ([0,1],95([0, l],dx) and the infinite product space (fi, 5, P) are almost isomorphic measure spaces since ((0,1],95((0, l],dx) is clearly almost isomorphic to ([0,1],95([0, l],dx). Exercise 4.7.22. Let {Xn)n>\ be a sequence of independent random variables in L^tyfoP) with E[Xn] = 0 for all n > 1. Assume that this sequence satisfies the strong law of large numbers, i.e. £ H"=i X% ~~* 0. Show that £~=1 P[ ^ > e] < oo for any e > 0. [Hint: make use of the Borel-Cantelli lemma (Proposition 4.1.29)] Exercise 4.7.23. Let (.X\)i<j<n be a finite set of i.i.d. random variables in L2(il, 5, P). Let a1 denote the common variance. If X = 5Z"=1, show that ^iE[Z^=1(Xi-X)2} = a2. [Comment: this shows that ^ £?=i(*t - X)2 (the sample variance) is a so-called unbiased estimator of a2]. Exercise 4.7.24. Let {Xn)n>\ be a sequence of independent random variables on (n.ff.P). Assume that P[Xn = 1] = P[Xn = -1] = 1(1 - i) and that P[Xn = 2n] = P[Xn = -2n] = ^. (1) Show that £^=i ^¾^ diverges, and that (2) the sequence (Xn)n>i satisfies the strong law of large numbers. [Hint: use a suitable truncation.] Exercise 4.7.25. (Convex functions) A function tp : (a,b) —* K is convex if, for any a < x < y < b and t € [0,1], tp(tx + (1- t)y) < tip(x) + (1- t)ip(y). This means that the line segment joining (x,ip(x)) to (y,</5(y)) lies above the graph of ip (it may coincide in part). Let Ln(x) — anx + 6n for n > 1. Then, if ip(x) = supn Ln(x), it is a convex function on {x I ip(x) < +00}. The aim of this exercise is to prove the converse: namely, if <p : (a,b) —► K is convex, there is a sequence (L„)„>i of affine functions Ln(x) = anx + bn such that ip = sup„ Ln on (a, b). This is a long exercise and is presented in several parts.
7. ADDITIONAL EXERCISES 209 Part A. Assume that ip is a convex function on (a, b). Show that, for any x e (a, 6), (1) the function h —► ^I+ r-v\x> \s a non-decreasing function of h for ft > 0 and for h < 0; (2) »(»+fc)-»(») < „(x+h)-„(x) ifa<x + it<x<x + /l<6. Let ^\x) *? limM0 »<'+*>-»<'> and ^>(*) =f limfcT0 »(«+*>-»<«>. Part B. Show that (1) <p+ {x) and tp_(x) both exist, are finite non-decreasing functions for a < x < 6, and (/5+(1) > V- (1) if ^ € (a, 6), (2) <p is continuous at each x e (a, 6) [Hint: use (1).], (3) if a < x0 < 6, then the graph of ip lies above the graph of each of the two affine functions L+{x) = tp+ (xo)(x — x0) + tp(xo) and L~(x) = ip_(x0)(x — x0) + tp(x0). [Hint: sketch the situation for a typical convex function, e.g., <p(x) = x2.] Part C. Let (rn)n>i be an enumeration (i.e., a counting) of Qn (a, 6), the rationals in (a,b). Define Ln(x) = an(x — rn) + 6n, where an = ip)_ (rn) and bn = <p(rn)- Show that (1) if for some 6 > 0, <p(x0) > Ln(x0) + 6 for any n, then tp[_'(x0) = +oo. [Hint: compute <p[_'(x0) using a sequence of rationals (rnk)k that increases to xo, and use part B (3).] Conclude that (2) <p(x) = supn>! Ln(x) if a < x < b. Exercise 4.7.26. If ¢, rp € L\m), let (0*V)(ff) =' / <!>{gh-l)i>{h)m{dh). Deonte by / and k the 27r-periodic functions on K such that f(t) = 0(g) if g = e" and k(t) = ^{h) if h = eiy. Show that (0 * i/>)(g) = f** f(x - y)k(y)dy.
CHAPTER V CONDITIONAL EXPECTATION AND AN INTRODUCTION TO MARTINGALES 1. Conditional expectation and Hilbert space In this chapter, the conditional expectation operator will be defined and then used in the study of martingales. To begin, one considers the simplest cases of conditional expectation, which are closely related to conditional probability. Then, one proves the Riesz representation theorem for continuous linear functional on Hilbert space as a tool for defining conditional expectation for square integrable random variables. Given this, it is easy to then define the conditional expectation of integrable random variables. Note that this way of explaining conditional expectation can be completely avoided if one wants to use the Radon-Nikodym theorem (Theorem 2.7.19). However, in the author's opinion, the use of Hilbert space ideas to define the conditional expectation operator is simpler and more direct. In addition, it emphasizes the point that conditional expectation is a projection. The basic idea in the Riesz representation theorem is a simple geometric one: given any closed convex set, it contains a unique point closest to the origin. This is intuitively clear in the plane once one sketches some closed convex sets. The proof for the plane carries over without change to any real Hilbert space. Conditional expectation: introduction and definition. Let (ft, J, P) be a probability space and BeJ with 0 < P(B) < 1. Then, for any A € 5, the conditional probability of A given B is defined to be P$bj)• If one writes P{A n B) as £[1^1^], then this formula extends from random variables of the form 1^ to any positive random variable X. What results is the conditional expectation of X given B: namely, E[X\B] = p^ jB XdP. Similarly, one may define the conditional expectation of X given Bc: E[X\BC] = P/Bc\ JBc XdP. In this way, to each non-negative random variable X, one associates two values: i?[X|B] and i?[X|Bc]. With these two values, one may define a random variable Y on the probability space (fi, {(/>, B, Bc, fl}, P) by the formula Y (E[X\B) if.eB, v ' \ E[X\BC] ifu;eBc. Note that if 0 = {(/>, B, Bc,fl}, the random variable Y satisfies the condition (CE) Jc YdP = Jc XdP for all C e<3. 210
1. CONDITIONAL EXPECTATION AND HILBERT SPACE 211 If 0 denotes the er-field {¢, B, Bc, fl}, the random variable Y will be denoted by iJ[X|0]. It will be called a conditional expectation of X given 0. This operation of passing from a non-negative random variable X to its conditional expectation Y given 0 has the following properties, which will be shown to follow from the defining property {CE): (CEX) E[\X\<S>] = \E[X\€>} for any A > 0; (CE2) E[*i|<&] + E[X2\8] = E[XX + X2\<&) if the X{ are non-negative; (CE3) EiX^®} < E[X2\®} P-a.s. if 0 < Xx < X2; (CE4) if X e 0, E[X\<5] = X, and hence E[E[A-|0]|0] = E[X\<3); (CE5) E[X] = E[E[X\6]]. In effect, it is a linear projection of the convex cone 5"*" of non-negative random variables on (0,5, P) onto the convex cone 0+ of non-negative random variables on (fi, 0, P), where property (CE4) guarantees that it is a projection: E[E[X|0]] = E[X|0] for all X € 5+. Furthermore, this projection depends upon the probability P. If B\, B2,..., Bn are n disjoint sets in fl with P(Bi) > 0 for all i and U"=1Bj = il, the smallest a-field 0 that contains them is the collection of finite unions of these sets. If A 6 J, the conditional probabilities P[i4|Bj] = Oj define a random variable Y on (fi, 0, P) whose value on Bi is Oj. Similarly, one may define, for each non-negative random variable X, a random variable Y on (12,0,P) by setting it equal to E[X |Bj] on Bi. If one denotes Y by E[X|0], it is easy to see that this again defines a projection of J"1" onto 0+ that satisfies properties {CE\) to (CE5) since (CE) holds with 0 the smallest er-field containing all the sets Bj. Example 5.1.1. Let (fi.ff.P) = ([0,1],93([0, l]),dx) and, for a fixed n > 1, let B0 = [0, 2^-] and Bk = (£-, *£!] for 1 < it < 2n. Let X{x) = x. Then E[X\Bk) = 2~n-\2k + \). Hence, Y(x) = 2-^,0 < x < £, and Y(x) = 2-n-\2k + 1), £ < x < ^ti. Here the <r-algebra 0 =f $n depends upon n and, as one will see later, each X defines in this way a martingale since EtEtXlffn+iJISn] = E[X\$n}. Replacing the uniform distribution on [0,1] by the uniform distribution 1[0 ij(x)2dx on [0,^] changes the conditional expectation of X. Among other things, it can take any values on the interval (|, 1], as pointed out later. It is not necessary that all the P(Bi) be strictly positive . If, for instance, P(Bi) = 0 then one may define the value of E[X|0] on Bi, to be any value (say zero). The really important and defining property (CE) is unaffected, as integration over B\ has no effect. As a consequence, the conditional expectation is defined up to a null function. This defining property for non-negative, integrable X is equivalent to stating that the conditional expectation Y of X given 0 is a density for the measure on 0 defined by X: namely, C —* jc XdP. In fact, one motivation for studying conditional expectation is it's relation to the Radon Nikodym
212 V. CONDITIONAL EXPECTATION AND MARTINGALES theorem, as mentioned earlier. Given any measure fx on J and a finite subfield 0, then, just as in the case of the measure given by X, there is a random variable Y on 0 for which (i(C) = / YdP for all C e 0 provided that n{C) = 0 if P(C) = 0. The random variable Y is some kind of approximate derivative of fi with respect to P. By taking larger and larger finite subfields, it is shown at the end of this chapter that these approximate derivatives converge to what is called the Radon-Nikodym derivative of fx with respect to P. This sets the stage for seeking to extend the conditional expectation operator when 0 is an arbitrary sub er-field of J, not necessarily determined by a finite number of sets. For example, if (Xi)iei is a stochastic process on (0,5, P), then 0 = <r({Xi | i € I}) is a sub er-field not necessarily determined by a finite number of sets, even if I consists of one point. The aim is to show that for any sub-er-field 0 of J, to each probability P on 5 there corresponds a linear projection of 5"*" onto 0+ that satisfies the defining property (CE) and hence the properties (CEi) to (CE5). Because conditional expectation is to be linear on J+, as is the integral, it has an obvious extension to Ll(£l,$, P) that imitates the extension of the integral from non-negative random variables to integrable ones. Definition 5.1.2. Let X be a. random variable on (fi, 5, P) that is either non-negative or integrable, and Jet 0 C 5 be a sub a-feld. A conditional expectation of X given 0 is a random variable Y on (il, 0, P) such that (CE) I XdP = I YdP for all C € 0. Jc Jc IfY is a random variable satisfying (CE), this will be denoted by writing Y = E[X\®}. A conditional expectation, if it exists, is essentially unique. Furthermore, a conditional expectation of a non-negative or of an integrable random variable is, respectively, non-negative or integrable. These facts are direct consequences of the definition as shown next. Remark. It is implicit in the definition of a conditional expectation Y that fc YdP be defined for all C € 0: in other words, if Y takes both positive and negative values, then either Y+ or Y~ is integrable, as then /c YdP =f /c Y+dP - jc Y~dP is well defined. This point is clarified in Exercise 2.9.19 and in the following lemma. Lemma 5.1.3. Let X be a random variable on (0,5, P) that is either non-negative or integrable. If Y\ and Y^ are two conditional expectations of X given 0, then Yi = Y2 P-a.s. If X is non-negative or integrable, then a conditional expectation of X is, respectively, non-negative or integrabJe.
1. CONDITIONAL EXPECTATION AND HILBERT SPACE 213 Proof. First, consider the case of a non-negative random variable X. Let Y be a conditional expectation of X given 0. If Cn = {Y < —^}, then C„e6 and (--\p{Cn)> f YdP= ( XdP>$. Hence, P(C„) = 0 for all n > 1 and so Y > 0 P-a.s. Also, if X is a non-negative, integrable random variable, then a conditional expectation Y of X given 0 is integrable since J XdP = J YdP. Now assume that Y\ and Y2 are two conditional expectations of a non- negative random variable X. Let Cmk = {m > Y\ > Y2 + £}. Then Cmk G 0, and so fn Y\dP = fr Y2dP. Also, this common value is finite. Furthermore, since by the definition of Cmk, f Y,dP> f Y2dP+(^)p[Cmk), JCmk JCmk \KJ it follows that P(Cmfc) = 0. Hence, Y\ > Y2 P-a.s., which implies, by symmetry, that Y\ = Y2 P-a.s. Now assume that X is integrable and that Y is a conditional expectation of X given 0. Then, if C+ = {Y > 0} and C_ = {Y < 0}, it follows that jY+dP = Jc+ YdP = Jc+XdP and jY'dP = /c YdP = /c_ XdP since C± € 0. From this, it follows that Y+ and Y~ are integrable and so Y is integrable. Finally, if Y\ and Y2 are two conditional expectations of an integrable random variable X, it follows (as in the case of non-negative X) that Y\ = Y2 P-a.s. □ The defining property of a conditional expectation implies that properties (CE\) to {CEs) are automatically satisfied when they make sense. Proposition 5.1.4. (Elementary properties of conditional expectation) Let X, X\, and X2 denote non-negative random variables for which the conditional expectations E[X\<6], E[Xi\<6], and E[X2|0], respectively, are defined. Then {CEX) \E[X\&\ = E[\X\8\ if\ > 0, (CE2) E{Xi\&] + E{X2\<3] = E[Xi + X2\<5], {CE3) E\Xi\e] < E[X2\<6] P-a.s. if Xx < X2, {CE4) ifX e 0, E[X\e] = X, and hence E[E[X\&\\&\ = E[X\&\, (CE5) E[X] = E[E[X\6]]. Proof. All of these properties except (3) are immediate consequences of the definition of a conditional expectation. The third property follows from (2) and the fact (Lemma 5.1.3) that a conditional expectation of a non-negative random variable is itself non-negative. □
214 V. CONDITIONAL EXPECTATION AND MARTINGALES Comment. Identities like (2) have to be read modulo null random variables in the same way that they are for functions in, for example, L1: to say \\X\\l = 0 means that X is a null function — not that it is identically zero. Similarly, for example, (2) says that the sum of conditional expectations of Xi and Xi is a conditional expectation for the sum X\ + X?.. The issue now is to show that a conditional expectation exists for any random variable X that is either non-negative or integrable. The easiest way to do this for non-negative integrable random variables is to use the Radon-Nikodym theorem (Theorem 2.7.19) from Chapter II ( although that theorem must then first be proved). Assume that X is a non- negative integrable random variable. Then i/(C) = Jc XdP, for C 6 0, is a finite measure on 0 by Exercise 2.4.2. Since u(C) = 0 if P(C) = 0, the Radon-Nikodym theorem implies that there is an integrable, non-negative random variable Y on (ft,0, P) such that, for all C € 0, Jc YdP = i/(C) = I XdP. Jc In place of the Radon-Nikodym theorem, one may also use the Riesz representation theorem to define the conditional expectation for random variables in the Hilbert space L2(£l,$, P) and then extend it to all integrable random variables. The advantage of this approach to conditional expectation is that it avoids having to prove the Radon-Nikodym theorem and shows the connection between conditional expectation and projection. The Riesz representation theorem. Let V be a real vector space. A set C c V is said to be convex if for x,y € C and all t € [0, l],tx + (1 - t)y e C. In other words, the line segment joining x to y lies in C. For example, if V = R2, then {(x,y)|x2 + y2 < 1}, {(x,y)\y > x2}, and {(x, y)\x > 0,0 < y < mx,m > 0} are examples of convex sets C as are their translates C + u = {v + u\veC) by any vector u. The next two results should appear obvious in view of the familiar geometry of R2 (draw some diagrams). Proposition 5.1.5. Let CcK2 be a closed convex set. Then there is a unique point cq € C closest to the origin. In other words, there is exactly one point cq € C such that \\ cq\\ < \\ c \\ for all c € C. Proof. Suppose that ci,C2 are two points in C that minimize the distance from 0. The midpoint ^(ci + C2) of the line segment joining ci to c2 is in C. Hence, if 6 = \\ c, \\, 62 < \ \\ Cx + c2 \\2 = \{262 + 2c! ■ c2}, and so 62 < ci • c2 < || ci mi c2 || = 62. Consequently, ci and c2 are collinear, i.e., one is a scalar multiple of the other. Therefore, ci = c2 since ct (say) = Ac2 implies A = ±1: if A = —1, then ct = -c2 and so 0 € C; this implies Ci=c2=0as6 = 0. Consequently, there is at most one point in C that is closest to the origin.
1. CONDITIONAL EXPECTATION AND HILBERT SPACE 215 Now let (cn)n>i C C be such that || c„ || > || c„+i || and limn || cn || = 6 = infc6c || c ||. If this sequence converges to a point Co, then (i) (¾ € C and (ii) || Co || = limn || C ||= 6 as | || Co || - || C || | < || Co - c„ || by the triangle inequality. To prove convergence, it suffices to verify that the Cauchy criterion is verified (see Definition 4.1.25 and Exercise 4.7.12): || c — cm ||< e if n,m>N = N(e) for all e > 0. Consider two vectors u, v and the parallelogram they determine. Then II u ||2 + || v\\2 = ${\\ u + HI2 + II u - v ||2}. Hence, I || C - c^ ||2 = (II Cn ||2 + || €„ ||2) - I || Cn + C^ ||2 < || Cn ||2 + || C^ ||2 - 262 3S \{cn + (^) € C and so || c„ + Cn || > 26. Since || c„ ||2 + || C,, ||2< 262 + e for any e > 0, if n,m > N, it follows that || c„ — c, || < e for sufficiently large n and m. □ Corollary 5.1.6. Let L C K2 be a (necessarily closed) linear subspace. Then for each u e K2 there is a unique vector u\ e L + u of minimum length. This vector u\ is orthogonal to L. farther, if u\ = £ + u, then —£ is the point in L that is closest to u. Proof. C = L + u = {v + u\v€L} isa closed convex subset of K2. Let u\ be the unique vector in C of minimum length. It will be shown to be perpendicular to L. If u\ is not perpendicular to L, then for some a € L, u\ a ^ 0. One may write u! = Xa + w with w a = 0 and A ^ 0 (A = (ui • a)/ \\ a ||2). As u\ = v\ + u, with v\ € L, it follows that w E L + u. Now || ut ||2 = A2 || a ||2 + || w ||2 > || w ||2. This contradicts the defining property of u\. If y € L, then d(u, y) =|| u — y \\. Since u — y e L + u, it follows that d(u>y) > || "i Hi with equality if and only if u - y = u\, i.e. y = —£. D The proofs of these two results make no use of the dimension of K2 and so are valid in a much larger context, i.e., namely for Hilbert spaces (see Definition 5.1.9). A (real) Hilbert space is a real vector space equipped with an inner product (Definition 2.5.6) which satisfies a completeness condition. Example 5.1.7. The following are examples of vector spaces equipped with an inner product (Definition 2.5.6). Recall that ( , ) is also denoted by (, )• (1) V = R», (u, v) = (u, v) = u ■ v = Zti uiVi- (2) V = £2, (u,v) = (u,v) = Yl'iLiUiVi (see Exercise 2.5.7). (3) V = L2(n,3,P), (X,Y) = E[XY\. Remark. In this last example, (X, X) = 0 implies X = 0 P-a.s., and so X is a null function (see Exercise 2.1.28 and Remark 4.1.2). Let V be a real vector space with inner product, and let || u \\ denote y/(u,u), u € V. Then, as shown in the following proposition, u —<■ \\ u \\ is a norm on V (Definition 2.5.3).
216 V. CONDITIONAL EXPECTATION AND MARTINGALES Proposition 5.1.8. Let V be a reaJ vector space equipped with an inner product. The following statements are vaiid. (1) (Parallelogram law) for aJJ u, v € V, || u ||2 + || v \\2= \ {|| u + v ||2 + || u - v ||2}; (2) (Cauchy-Schwarz inequality) for aJJ u,v € V, \{u,v)\ < \\ u \\ \\ v || with equality if and only ifu and v are coJJinear; (3) (The triangle inequality) for &llu,v € V, || u + v || < || u || + || v ||. Hence, || • || is a norm on V and defines a metric d, where d(u, v) =\\ u — v \\ onV. Proof. (1) || u ± v ||2 = || u ||2 ± 2(u,t;) + || v ||2. (2) Let Q{\) = \\ u + \v ||2. Then Q(\) is a quadratic polynomial in A. Since Q(A) > 0 for all A € R, 4{u,v)2 - 4 || u ||2|| v ||2 < 0. Further (see Exercise 2.5.8), this discriminant is zero if and only if Q vanishes at Ao = ]|"',yj . Since Q(Ao) = 0, u = —Xqv (note that v may be assumed ^ 0 here, as the inequality is trivial if u or v = 0). Observe that (3) and (2) are equivalent since || u+v ||2=|| u ||2 +2(u,v)+ \\v\\*. D Definition 5.1.9. A real vector space V equipped with an inner product (i.e., an inner product space) is said to be a (real) Hilbert space if it is complete with respect to the metric given by the norm \\ u \\ = y/(u,u). Remarks 5.1.10. (1) For a complex vector space, for example, C", Definition 2.5.6 needs to be modified: the inner product (u,v) € C; in (3), A e C; and (1) is replaced by (u, v) = (v,u) for all u, v € V. For example, in the case of Cn, one defines (u, v) to be 5Z"=1 u^. (2) The Riesz-Fischer theorem (Proposition 4.1.28) and Example 5.1.7 (3) imply that L2(il, 5, P) is a Hilbert space. So too is any L2(il, 5,//). (3) By using complex measurable functions, one also obtains examples of (complex) Hilbert spaces: Lc(tl,$,/x), the space of complex-valued immeasurable functions / : il —► C with J \f\2dfi < oo: note that if/ = u + iv then / is 5-measurable if and only if u and v are (see Proposition 3.1.1); |/|2 = u2 + v2 implies that / € L^itl, 5,//) if and only if u and v belong to L2(n,Sr,/x); and finally, (/,g) = J fgd/i. Example 5.1.11. Let V = C([0,1]), the vector space of continuous real- valued functions on [0,1]. Define (/, g) to be fQ f(x)g{x)dx. This defines an inner product on V, but V is not a Hilbert space: for example, let {0 for0<x<i-i, nx -(=-1) fori-i<x<i, 1 for i < x < 1.
2. CONDITIONAL EXPECTATION 217 The sequence (/n)n>i is Cauchy but fails to converge to a continuous function. Note that by Theorem 4.2.8, V C L2([0,1]) is a dense subspace of L2([0,1]), which is a Hilbert space by Proposition 4.1.28 (see Remarks 5.1.10). The following two results have been established by the proofs of Proposition 5.1.5 and Corollary 5.1.6. Proposition 5.1.12. Let C be a closed convex set in a Hilbert space V. Then there is a unique point Co 6 C closest to the origin. Corollary 5.1.13. Let L be a closed linear subspace of a Hilbert space V. Then for each vector u € V there is a unique vector u\ e L + u of minimum length. This vector is orthogonal to L. Further, if u\ = £ + u, then —£ is the point in L that is closest to u. Remark. If L = Ln is the closed linear subspace determined by the first TV vectors of an orthonormal system (</>n)^>i in V, the vector u\ equals u — 22n=1(u, 4>n)4>n (see Lemma 4.2.15). Theorem 5.1.14. (Riesz representation theorem) Let V be a Hilbert space and lett: V —► K be a non-triviai, continuous linear functional Then there is a unique vo € V such that £(v) = (v, vo) for all v € V. Proof. Let Ker(^) = {v e V \ £(v) = 0}. This is a closed linear subspace L. Since £ is non-trivial, there is a vector vi with £{vi) = 1. Let E = L+v\, and let u0 be the vector in E of minimum length. Then, since £ = 1 on E, || u0 II ^ 0. By Corollary 5.1.13, (v,uq) = 0 for all v € L. In other words, £(v) = 0 implies {v,u0) = 0. Now let £(v) = a ^ 0. Since £{u0) = 1, £{v — auo) = 0. Hence, (v — auo,u0) = 0, i.e., (v,u0) = q(uo."o) = o || "o ||2- Consequently, if vo = u Spuo. then £{v) = {v,vo) for all v € V. The vector vo is unique because £{v) = (v,v'0) for all v € V implies (v, vo — v'0) = 0 for all v € V; in particular, for v = vo — v'0. Hence, v0 - v'0 = 0. □ 2. Conditional expectation It follows immediately from the Riesz representation theorem that every square integrable random variable has a conditional expectation. Proposition 5.2.1. Let X € L2(il, 5, P) and 0 C J. Then the element Y ofL2(il, (5, P) that represents the continuous linear functionai Z —► E[XZ] on the Hilbert space L2(fl, 0, P) is a conditionaJ expectation Y of X given 8. Proof. Let Z € L2(il,<8,P). The function Z -» E[XZ] is a continuous linear functional on L2(fl,(5,P) since by the Cauchy-Schwarz inequality
218 V. CONDITIONAL EXPECTATION AND MARTINGALES |E[XZ]| < ||X||2||Z||2. Hence, by the Riesz representation theorem (Theorem 5.1.14) there is a unique (up to null function) Y € L2(fl,0,P) such that E[XZ] = E[YZ] for all Z € L2{n,®,P). Since lc € L2{n,®,P) for all C € 0, it follows that E[lcX] = E[lcY] (i.e., /c XdP = /c YdP) for all Ce(S. In other words, Y is a conditional expectation of X given 0. □ Remark 5.2.2. Proposition 5.2.1 is a well-known fact from Hilbert space theory. Given a closed (linear) subspace V of a Hilbert space H, there is a unique linear operator P : H —» V with || P \\ < 1 and P2 = P. In the case of H = L2(Sl,$,P) and V = L2(n,0,P), this operator is the conditional expectation operator £[-|0]. Such an operator is called a projection operator. If P(x) = y, then y is the closest point in V to x (see Corollary 5.1.13). In other words, y solves a least squares problem. This terminology comes from the observation that if H = Kn, then y is the point v e V that minimizes 5Z"=1(ui — £j)2. By Lemma 5.1.3, £[-^10] is non-negative if X e L2(£l,$,P) is non- negative. This allows one to extend E[-1<5] to arbitrary positive random variables and so to integrable random variables. Proposition 5.2.3. If X is a non-negative random variable on (fl.S.P) and 0 is a sub <7-aJgebra Jet Y = limn E{X An\<8]. Then Y is a conditionaJ expectation of X given (5. IfXeLl{Sl,$,P), tbenElX^e] e Ll{Sl,<3,P). Let Y d= E[X+\<5] - E[X~\8\. Then Y € 1^(12,0,P) is a conditionaJ expectation of X given 0. Proof. Assume that X is non-negative. Then, or each n > 1, X A n e L2(n,5,P). Let Yn = E[X An\<8}. Then Vn+1 > Yn P-a.s. Hence, by modifying the variables Yn on sets in 0 of probability zero (which does not change the fact that they are conditional expectations of the X An), one may assume 0 < Yn < Vn+i for all n > 1. Define Y = limn Yn, and let C € 0. Then, f XdP = lim f{XAn)dP= lim [ YndP = [YdP, Jc n^°°Jc n^°°Jc Jc i.e., Y is a conditional expectation of X given 0. If X € L1^,^, P), then the conditional expectations £[.^10] are integrable by Lemma 5.1.3, and so Y = £[^+10] - E[A-|0] € L^U,®^). It is a conditional expectation of X given 0 since J XdP = J X+dP - J X~dP= J YdP for all C e 0. □ J C JC JC Jc
2. CONDITIONAL EXPECTATION 219 Remark. The conditional expectation can also be defined for any random variable that is the sum of a non-negative random variable Z and an integrable one X. Let E[Z\8] = W and E[X\&] = Y. Then for any set C € 0, it follows from Exercise 2.9.19 that JC{Z + X)dP = Jc ZdP + Jc XdP = /c WdP + Jc YdP = JC{W + y)dP. As a result, E[Z + X\6] = E[Z\6] + E[X\<3}. Proposition 5.2.4. (Additional properties of conditional expectation) Assume that each random variable is either non-negative or in Ll. Then the following properties hold: (CE6) if (Xn)n>i is such that Xn < Xn+i and limn Xn = X, then E{X\<3]=\\mnE{Xn\<&}; (CE7) iff) is a sub d-Beld of®, then E[E[X\®]\S)} = E[X\Sj]; (CE&) (Jensen's inequality) Let c : K —» K be convex and let X e Ll. Then E[c o X] is defined and coE[X\6] < E[coX\<5}\ {CE9) ifX € 0, then £[xy|0] = XE\Y\G] (here X,Y e L2 or both are non-negative); {CEio) ifX is independent of & (i.e., (r(X) and 0 are independent a-Belds), then E[X\<5] = E[X]; and (CEn) iff) and (j(X)U® are independent, then E[X\(r(®Uf))] = E[X\&], where <r(<5 U f)) is the smallest a-algebra. containing 0 U f). Proof. (CE6) and {CE7) are left to the reader to verify. To prove (CEg), note that by Exercise 4.7.25, there exists a sequence (Ln)n>i of linear functions Ln(x) = anx + bn with c(x) = supnLn(x),x e R. Now co X > Ln o X, and so c o X = Zn + Ln o X, Zn > 0. It follows from the remark preceding this proposition that E[c o X\&] is defined (or one may impose the extra condition that coX € L1) and that E[c o X\&] = E[Zn] + E[Ln o X\<5] > E[Ln o X\<5] = anE[X\&] + 6n for all n. Hence, E[c°X\®\ >c(E[X\<5}). If A € 0, then {CEQ) holds for X = 1a- Hence, it holds for all simple functions in 0 and consequently for all non-negative X € 0. This proves (CE9). {CEio) follows immediately from Proposition 3.2.3 since fcXdP = P(C)E[X] if C € 0. To prove (CEn), first note that Y = E[X\&] € 0 is independent of f). Thus, by Proposition 3.2.3, for any B € 9) and A € 0, f YdP = P[B] f YdP JAnB J A = P[B] f XdP = f lB{lAX)dP = f XdP, J A J JAnB
220 V. CONDITIONAL EXPECTATION AND MARTINGALES where the last but one equality uses the hypothesis that f) and a(X)U& are independent. Since the class € of sets C € <r(<5 U f)) for which /c YdP = Jc XdP contains C2\Ci whenever it contains each d and C\ C C2, the result follows from Proposition 3.2.6. □ Corollary 5.2.5. Let X e 27((1, S,P), 1 < P < +00. Then E[X\<S>\ e Lp(n,8,P) and || E[X\6] ||p < || X ||„ . Proof. If 1 < p < 00, t —» |t|p is a convex function, and so by Jensen's inequality, (|£[X|0]|)P < £[|A-|P|0]. Hence, \\E[x\e]\\,<(E[\x\>\e])' = \\x\\,. If p = +00 then - || X ||oo < X < \\ X 11«, P-a.e. Hence, by (CE3), - || X Hoo < E[X\6] < || X Hoo as E[l |0] = 1. D Remark. In particular, if X € L2, E[X\&] € L2 and so as remarked before, it is (up to a null function) the random variable Y in L2 for which E[XZ] = E[YZ] for all Z e L2(n,8,P). When the er-algebra 0 is generated by a random variable Y (i.e., 0 = <r(y)), then one often sees expressions of the type E[X|V = y ] in connection with E[X\(t{Y)] =f E[X\Y\. To explain why the expression E[X\Y = y] is meaningful in this context, one first needs to see the relation between Borel functions and random variables that are measurable with respect to o-(Y). Exercise 5.2.6. Let Y denote a random variable on a probability space (0,5, P). Let o-(Y) denote the smallest er-algebra with respect to which Y is measurable. Show that (1) A e a{Y) if and only if A = Y~l{By, equivalently, lA = lB 0 Y for some Borel subset B of K (see Exercise 2.9.10), (2) if Ai,A2 € a{Y) and are disjoint, then Ai = Y'^Bi), i = 1,2, implies that A2 = Y~1(B2\Bi); equivalently, 1^, = 1b2\Bi °Y, (3) if s is <r(y)-simple, then there is a Borel simple function t with s = toY; (4) if U is non-negative and <r(y)-measurable, then there is a non- negative Borel function V with U = ip 0 Y, and finally, (5) if U is finite and <r(y)-measurable, then there is a finite Borel function V with U — ip 0 y. Using the result of this exercise, one defines £^[X|y = y] to be f(y) if E[X|y] = /or,/ e <B(R). This kind of notation is used in Karlin and Taylor [Kl], p. 5, where a formula is given for P[X < x\Y = y] (the conditional distribution function of X given Y = y) when X and Y
2. CONDITIONAL EXPECTATION 221 are random variables for which the distribution of the random vector (X, Y) has a density p(x, y). This formula will now be explained as a special case of the formula for the conditional expectation of g(X) = goX given Y = y, (i.e., for E[g{X)\Y = y]), (see [Kl], p. 7 (1.5)), by using the concept of conditional expectation as introduced earlier (here g is a Borel function for which goX € L1 (^1,^,P)). Note that the density p, which is in principle only Lebesgue measurable on K2, can be taken to be Borel measurable. Proposition 5.2.7. Let X, Y be random variables on a probabfJfty space (n,j,P), and Jet g e <B(R) be such that go X € L^fi^.P). Assume that the joint distribution of (X, Y) has a density p(x, y) (which may be assumed to be a BoreJ function of x and y). Then P2(y) = fp(x,y)dx 1S the density of the (marginal) distribution ofY, and J g(x)p(x,y)dx if V2{y) ± 0, = y\ = {f») E[g o X\Y ' n if My) = o. Proof. If Q{dx,dy) = p(x,y)dxdy is the joint distribution of (X, Y), its second marginal is the distribution of Y, i.e.. P[V < b] = Q[Kx(-oo,6] ] = /-ooLf-^ra P(x>y)dx]dy, which has as density the Borel function P2(y) = Jp(x,y)dx (see Fubini's theorem (Theorem 3.3.5)). If E[g o X\o-(Y)} = U, then for all A = Y~\B) € o-(Y), it follows that f {goX)dP= f UdP, J A J A that is, /(Is o Y)(g o X)dP = f(lB o Y)UdP for all B € <B(R). Let $(x,y) = lB(y)g(x). Then j\lB o Y)(g o X)dP = J $(X, Y)dP = J J lB(y)g(x)p(x, y)dxdy. Also, by Exercise 5.2.6, one has U = (/> o Y, where (j>(y) is a Borel function and /|(/>(y)|Q(dx,dy) < +oo since U is integrable. Hence, J (iB o Y)UdP = J(iB o Y)(<t> o Y)dP = J iB(y)d>(y)P2(y)dy, and so J J lB{y)g(x)p(x,y)dxdy = j lB(y)4>(y)p2(y)dy for all B € <B(R).
222 V. CONDITIONAL EXPECTATION AND MARTINGALES The left-hand side is the integral of $(x, y)p(x, y) with respect to Lebesgue measure on *B(R2). Obviously, one wants to compute it by Fubini's theorem and thus to get /ls(y) J 9{x)p{x,y)dx dy = j lB{y)4>{y)p2{y)dy for all B e B(R). Assuming this is possible (which indeed is the case: see Exercise 5.2.8 (2)) if one sets ,, , def /_£ 9{x)p{x, y)dx /+~ g{x)p(x, y)dx V'(y) = t+ao . = -7-: when ^(y) ± 0, fl0op(x,y)dx My) and equal to zero otherwise, then rp{y)P2{y) = f_™g(x)p(x,y)dx for all y e R and so j is(y)V'(y)P2(y)dy = j iB{y)d>{y)My)dy for all B e B(R). Provided V is a Borel function (see Exercise 5.2.8 (1)), it follows that f (goX)dP= f UdP= f {4>oY)dP JY-^B) JY-l(B) JY-i(B) = f (V o Y)dP for all B e B(R). yy-'(B) This implies that rpoY = E[goX\(r(Y)\, and so E[goX\Y = y] = V(y)- □ When g = 1a, g° X = l{xeA}, which implies that p[X6A|y = y] = J-0° ^ 'y> if My) to and is equal to zero otherwise. In particular, the conditional probability distribution function of X given Y = y is f p(u, i/)du P[X<x\Y = y) = J-°°£{y)y' ifp2(y)^0, and equals zero otherwise. Notice that the normalisation by P2(y) is exactly what is needed to give a distribution function. Finally, when X € Ll, f °° up(u, y)du E[X\Y = y]= J-°° ^y) lf My) * °' and as usual equals zero when this is not the case. To complete this discussion, it remains to settle the technical issues concerning the use of the Fubini theorem and the measurability of ip. They are given as an exercise.
2. CONDITIONAL EXPECTATION 223 Exercise 5.2.8. (Technical details for Proposition 5.2.7) If V is a non-negative measurable function on (0,5), show that (1) Z is also a measurable function on (1),5), where i V(u 1° 1 " if V(w) = 0. [Hint: first show that y is a measurable function if one defines ^ to be +oo.] Using the notations of Proposition 5.2.7, show that (2) \g(x)\p(x,y) € L^K2), i.e., relative to Lebesgue measure on <B(R2) since goX € Ll(Sl,$,P), (3) there is a Borel set T C K with |T| = 0 such that the function 7 defined by setting , , f /g{x)p{x,y)dx if; 7(y) = { . .. r _, 10 if y e r, is a Borel function and / ls(y)7(y)dy = // lB(y)ff(x)p(x, y)dxdy. If one lets this function 7(y) stand for / ff(x)p(x, y)dx, the computation in Proposition 5.2.7 goes through. The function ip(y) is then the product of 7(7/) and a Borel function by (1). This illustrates the technical difficulties of using Fubini's theorem for L'-functions. The formalism is clear, but the fine print needs inspection. The initial examples of E[-1<8] used a procedure for obtaining i?[X|<8] from X that involved integration. In the case of n sets Bi each of positive probability, define N(w,dr/) by N(w, A) = P(A|Bj) if w e B^ Then N is a Markov kernel on (1),5) to (1),0) and E[X\&]{u>) = /N(w,dT/)A"(T/). This last equality holds because if P(<kj\B) denotes the conditional probability P(A|B),forAeSr,whereP(B) > 0, then / X(u)P(du\B) = pfej J XdP. To see this, note that it suffices to consider X = I a, A e F. Then / lA(u,)P(du\B) = P(A |B) = ^Jgp = ptbt / lAdP. It is important (and useful) to know when the conditional expectation operator may be realized in this way by a Markov kernel. In particular, this Markov kernel automatically selects a conditional expectation of a random variable, as indicated in the following definition. Definition 5.2.9. Let (1),5, P) be a probability space, and let 0 C 5. A Markov JcerneJ N from (1),5) to (1),0) is called a regular conditional probability if, for all non-negative X, the random variable Y defined by Y{u)= JN{lj,<Lj')X{lj')
224 V. CONDITIONAL EXPECTATION AND MARTINGALES is a conditional probability E[X\&]. From the previous remarks, it is clear that if 0 is a finite a-field, E[ • |0] is realized by a regular conditional probability. This is because for a finite er-algebra 0, the whole space ft decomposes into a finite number of disjoint sets Bi € 0 each of which has no proper subset in 0 (they are called the atoms of 0). If P(Bj) > 0 for each atom, then a regular conditional probability is defined as above. If some of the P(Bj) = 0, then define N(w,j4) = P(j4) for lj e Bi when P(Bi) = 0 and otherwise as above. A regular conditional probability N is then defined. Another case when a regular conditional probability may be defined is for 0 = <r({y}), where Y is a random variable that assumes at most a countable number of values a*, i > 1. The sets Bi = [Y = a*} are the atoms of 0, and one may define (as before) P(A|B4) \fujeBi,P(Bi)>0, P(A) ifw€Bi,P(Bi) = 0. Remark 5.2.10. While it is not always possible to realize a conditional expectation operator by means of a regular conditional probability, for a large class of measurable spaces this is the case (see [I], p. 13, Theorem 3.1) These are the so-called standard measurable spaces — they are measure isomorphic to one of the following spaces — see Exercise 4.7.21. Ikeda and Watanabe, [I], p. 13 or Neveu [Nl]): (i) ({1,2,....n}, ¢({1,2,...,n})); (ii) (N,«P(N)); (iii) (R,»(R)). If ft is a complete separable metric space (a so-called Polish space), then (ft, 2$(ft)) is a standard measure space ("separable" means that there is a countable subset S such that every non-void open set contains a point of S: e.g., Rn is a separable metric space relative to the Euclidean metric; the set S of points all of whose coordinates is rational has this property). In addition, every measurable subset C of a standard measurable space (£, ¢) is itself a standard measurable space when equipped with the er-algebra of sets C n A, A e <£ (see Parthasarathy [PI], p. 135, Theorem 2.3). Remark. The formula (1.50) in [Kl] for E[g(X)\Y = y] that was established in Proposition 5.2.7 is not given by a regular conditional probability, as it depends upon the density for the pair (X, Y) and so varies with X. Remark 5.2.11. In Proposition 3.5.8, a probability Q is constructed on (R x Rn,<B(Rn+1)) from a Markov kernel N on (R,»(R)) and an initial probability P0 on R. In Remark 3.5.9 (2), the following formula is given, where ip is a non-negative function of Xfc+i, Eq is a Borel subset of R x Rfc, N(w,A) =
2. CONDITIONAL EXPECTATION 225 and Nrp(u) = fN(u,dv)rp(v): EWM'k+i)] = E[lio(x)N^(xk)}. Define the random variables Xk on R x Rn, for 0 < k < n by setting Xk(xo,X\,..., xn) = Xk- The above formula can be restated as follows: f ^(Xk+i)dQ= f (N^(A-fc)dQ if£0e<B(Rfc+1). Jeo Je0 In other words, if 3* = &{Xi | 0 < i < k}, then E[rpoXk+M=~Nrp°Xk since A € fo if and only if A = E0 x Rn-fc with E0 € <B(Rfc+1) (see Proposition 3.4.9). Exercise 5.2.12. (Markov processes) Let (Xn)n>o be a (real-valued) stochastic process on (fi, J, P), and let N be a transition kernel on (R, <B(R)) to (R,!8(R)). Denote by $n the <r-field (r({Xi \ 0 < i < n}). The stochastic process is called a Markov process with transition kernel N if (*) £#°X„+i|S„] = NV°A-n for any bounded Borel function ip. The <7-fields 3n increase with n and can be replaced by any increasing sequence (referred to as a filtration in §4), provided that Xn 6 $n for each n (the process is then said to be adapted (Definition 5.4.1)). Let <5n+i = <r({Xj \ j > n + 1}). This er-field represents the future relative to time n and $n corresponds to the past. For a Markov process, the conditional expectation of any future event given the past depends only upon the present. The exercise explains this fact. First, show that (1) if (Vv)i<<<m is any finite collection of bounded Borel functions, then £[n?li(Vv ° -Xn+<)|3n] = $ o .Xn for some Borel function $ on R. [Hint: use induction on m and the fact that E[U\$n+j] = E[E[U\$n+j+i]\$n+j] to find a formula for ¢.] In particular, this implies that if F is a finite subset of {j > n + 1} and At € <B(R) for each i € F, then E[HieF(lAt o Xi)\$n] = *oX„ for some Borel function ¢. Use this to show that (2) if A e <5n+i, then E[lA|3n] = 9oI„ for some Borel function 6. [Hint: use Proposition 3.2.6.] This shows that for every future event, its conditional expectation given the past is a Borel function of the present. Make use of (2) to show that (3) if U is a non-negative random variable measurable with respect to <5n+i, then £[¢/15,,] is a Borel function of Xn.
226 V. CONDITIONAL EXPECTATION AND MARTINGALES In addition, show the following: (4) if X € 5+ and Y e 0++1, then E[XY\Xn] = E[X\Xn]E[Y\Xn\. [Hint: first condition the product on $n and then use Proposition 5.2.4 (CEj).] Whenever this last identity holds, one says that the past and the future are conditionally independent given the present (compare (4) with the formula in Proposition 3.2.3). A process for which this holds is said to have the Markov property. What this exercise shows is that the presence of the transition kernel implies that the process is Markov. 3. Sufficient statistics* An important statistical concept that involves conditional expectation is the notion of a sufficient statistic. Consider a probability space (0,5, P), and a sub er-field 0 of 5. If X is a non-negative random variable (i.e., X € 5 is non-negative), the probability P determines the conditional expectation i?[X|0] = Y. In general, for a given X, there is no reason for Y not to depend upon the particular probability P. Example 5.3.1. Let Q denote {0,1}, 5 denote the collection of all subsets of n, and 0 = {0,0}. Let X{0) = 0, X(l) = a. Then E[X\Q] = E[X] = aP({l}) if P is a probability on (0,5) and so depends on P. However, when several variables are involved, it is easy to construct examples of families V of probabilities P for which the conditional expectation is independent of P. Example 5.3.2. Let Q = K2, 5 = <B(R2), and V = {P = n x /x | /x a probability on K}, where n{dx) = ~^=e~x ^2dx is the unit normal. Let 8 = {RxB\B e <B(R)} = A"2-1(a3(K)) = tr{X2), where X2(xl,x2) = x2. Then, for any non-negative random variable X, it follows from Fubini's theorem (Theorem 3.3.7) that E[X\8](xi,x2) = JX{xi,x2)n(dxi), which is independent of the particular probability PeP. In the situation of Example 5.3.2, the random variable X2 is called a sufficient statistic. To formulate the definition, it is convenient to introduce the concept of a sufficient er-field. Definition 5.3.3. Let V be & collection of probabilities P on (0,5), and Jet 0 C 5 be a sub a-field. The o-field 0 is said to be sufficient for the collection V if, for aJJ non-neg&tive ^-measurable functions X on n, EP[X\®] = Y is independent of P e V, where Ep[X|0] denotes the conditional expectation of X relative to P.
3. SUFFICIENT STATISTICS 227 IfT : (0,5) —» (£, ¢), where <£ is a <7-aJgebra on E, is a measurabJe function, that is, ifT~l(<£) C 5 and if® = T-1(£) is sufficient for V, then T is said to be a sufficient statistic for V. Intuitively, a er-field (8 is sufficient for a collection V of probabilities P if enough information to distinguish between any two of them is to be found in (8. For example, one has the following result. Exercise 5.3.4. Let (8 be sufficient for a collection V of probabilities on (0,5). Show that if Pi,P2 G V and are distinct (that is, for some A 6 J, Pi(-<4) ^ P2(j4)), then when viewed as probabilities on (12,(8) they are distinct. [Hint: Pi {A) = f Ei[1a\8]cIPi, where E\ indicates conditional expectation relative to Pi.] However, the fact that a sub er-field (8 is capable of distinguishing between probabilities is not enough to guarantee that it is sufficient. This is not so surprising if one recalls that conditional expectation has to do with projection. Exercise 5.3.5. Construct an example on the measurable space (0,5) = ({0,1,2}, P({0,1,2}) to show that a <r-field (8 C 5 is not sufficient for a class V of probabilities on 5 if it merely distinguishes between probabilities in P. According to Zacks ([Z], p. 29), "the main formulation of the concept 'sufficient statistic' is based on the assertion that if the conditional distribution of X, given T, is independent of (i.e., does not depend upon) the actual (non-conditional) distribution of X, then T is sufficiently informative for T> " Example 5.3.6. (see Zacks [Z], p. 30), Let PA be the probability on (Rn,93(Rn)) that is the product of n copies of the Poisson distribution on R with mean A. Then PA is concentrated on the points {k\,k2, ■■■, kn) with hi non-negative integers, and PA({(fci,fc2, •• • ,kn)}) = e~nA£]££,,X" • Let T(xi,x2,... ,xn) = Y^=ixi- Then T is a sufficient statistic for the collection V of probabilities PA, A e R. Toseethis, let Bt = {T = £},£ € NU{0} and let B = Rn\U£L0B<. Then PA(B) = 0 and Px(Bt) = e~nX^-, as T has a Poisson distribution with parameter nX (see Exercise 3.3.22). Hence, as far as PA is concerned, T has only a countable number of values, and so a regular conditional probability TV may be computed. For u€Bit does not matter what probability one associates with u since PA(B) = 0. Define N(lj, A) = £o(-<4) if u 6 B, where e0 is the unit point mass at 0 € Rn.
228 V. CONDITIONAL EXPECTATION AND MARTINGALES For ueB(, define N(u, A) = PX{A n Bt)/Px{Bt) = p4bT) X ^ P\{kuk2,...,kn}) K ' (fcl,fc2 fcn) eAnBe £\ ^-^ e-nAAfcl +fc2+"+fc" (fcl,fc2 fcn) eAnBe = -7 x ^ , ,, , t—,> since y^ fcj = 1 eAnBe This regular conditional probability does not depend upon A, and so EX[X\T-1(B(R))] = JN(w,da)X{a) also does not depend upon A and so T is sufficient. Another example, also taken from Zacks [Z], p. 30, is the following: Example 5.3.7. Let P6 be the probability on (Rn,<B(Rn)) which is the product of n copies of the normal law ne{dx) = -^= e V dx. Then ^^"^ where e = (1,1,...,1). Define X = ~ $3"=1 Xi. Then X is a sufficient statistic for the collection of probabilities P0, 0 e R. To see this, first note that P0 is invariant under rotations about the point 6e. Choose an orthonormal basis fi, f2,..., fn for Kn, where fi = -4^ e and f2,f3,...,fn is an orthonormal basis for the subspace X\+x2-\ hxn = 0. Then, if x = 5Z"=1 y»ft, it follows that P (dx) = -——^ ^dyx-=e ~^i dy1} where y = (y2,• ..,yn) (translate to the point 0e = 6y/nfi and then rotate). In this system of coordinates, the probabilities P0 are products n(dy)ng(dy\) and the random variable X maps y = (yi,•.. ,yn) to y\ since (i) 5Z"_2 yi = 0 and (ii) 5Z"=1 £* = ny^ Consequently, as in Example 5.3.2, X is a sufficient statistic. So too is nX. In Zacks [Z], a statistic T : (0,5) —» (E,<£) is said to be sufficient for a class V of probabilities P if the regular conditional probability Np for which Ep[X\T-1{€)]{ui) = /Np{u,do)X{o) does not depend upon P € V. This implies that T is sufficient in the sense of Definition 5.3.3. Note that this definition assumes a situation where all these regular conditional probabilities exist.
4. MARTINGALES 229 Exercise 5.3.8. Let V be any family of probabilities on (R,<B(R)). Show that the statistic T(x) = x is sufficient. Is it important that the measurable space is (K, »(R))? For further information on this topic, one may consult Zacks [Z] or an article by Halmos and Savage.6 4. Martingales The most important concept that makes use of conditional expectation is that of a martingale. Let (fi, 5, P) be a probability space and ($t)tei be an increasing family of sub <7-fields of 5, where I C [0, +oo) (that is, s < t in I implies that 3s C St C 5). Such a family will be referred to as a filtration. Definition 5.4.1. A process (Xt)tei on (^,5, P) WM be caJJed a martingale (respectiveJy, supermartingale), with respect to the filtration (dt)tei, if (Ml) each Xt € fo (such a process is said to be adapted to the filtration) and is /ntegrabie, (M2) for aJJ s < t in I, E[Xt\3a] = Xs (respectively, < Xa). A process X is called a submartingale if —X is a supermartingaie. Remarks. (1) When I = Nu{0} and (Xn)nei is a martingale, the process may be thought of as the fortune of a gambler in a fair game: the expected value of his fortune does not change with the number of times the game is played. (For additional information on the relations between martingales and gambling, see Billingsley [Bl], p. 407, an entertaining article by Snell7, and also the book of Williams [W2].) (2) If the filtration is trivial with each fo = J, for all t € I, then a process X = (Xt)tei 1S a martingale if and only if it is constant: each Xt = Y, a fixed random variable. Similarly, it is a supermartingale if and only if it is a non-increasing family of random variables, i.e., Xa > Xt if s < t and s,t € I. Example 5.4.2M. Let (Xn)n>\ be a sequence of independent integrable random variables on a probability space (fi, 5, P) with ^[Xn] = 0 for all n > 1. Define ffn = (j{{X, | 1 < i < n}) and 5„ = £?=1 Xi. Then (5„)„>i is a martingale, with respect to the given filtration: £?[Sn+i|3n] = Sn + E[Xn+i\$n\; the hypothesis of independence implies by Proposition 5.2.4 (CEW) that E[Xn+i\3n] = E[Xn+i] = 0 and so E[Sn+l\$n] = 5„ (this martingale appears in the study of the laws of large numbers in Chapter IV). 6 Application of the Radon Nikodym Theorem to the theory of sufficient statistics Ann. Math. Stat. 20 (1972), 225-251. 7Gambling, probability and martingales The Math. Intelligencer 4 (1982), 118-124.
230 V. CONDITIONAL EXPECTATION AND MARTINGALES Exercise 5.4.2S. Let (Xn)n>i be a sequence of independent integrable random variables on a probability space (0,5, P) with £[X„] < 0 for all n > 1. Define £„ = a{{Xi | 1 < i < n}) and 5„ = £?=1 X*. Show that (5n)n>i is a supermartingale with respect to the given filtration. Modify this exercise so that (5n)n>i is a submartingale. Example 5.4.3. Let X e Ll(Sl,$,P), and let ($„)„>! be a filtration. Show that def (1) (Xn)n>i is a martingale if Xn = E[X\$n] for all n > 1 (see Example 5.1.1). Let (<5n)n>i be any sequence of <r-fields (not necessarily a filtration). Define def the sequence of random variables (yn)n>i by setting Yn = E[X\<Sn]. Show that (2) the sequence (yn)n>i is uniformly integrable. [Hints: use the fact that |E[X|<5„]| < E[\X\ |<5„] and Chebychev's inequality to show that P[ \Yn\ > c] is uniformly small for large c; make use of Lemma 4.5.5.] Conclude that (3) the martingale (Xn)n>i in (1) is uniformly integrable. Example 5.4.4. Let {Xn)n>\ be a martingale with respect to (3r,)n>i and assume Xn G L2 for each n > 1. Then (X2)n>\ is a submartingale: by Jensen's inequality, X2 = (E[Xn+i\$n])2 < ^[■X'n+ilffn]- In particular, if Sn is as in Example 5.4.2M with Xn e L2 for all n, then (52)„>i is a submartingale. Example 5.4.5. Let /x be a probability on (fi, 5, P) such that n(A) = 0 if P(A) = 0 (i.e., by Definition 2.7.16, /x is absolutely continuous with respect to P), and assume that 5 is the smallest <r-algebra containing 21 = Un>i5n, where $n C Sn+i is an increasing sequence of finite <r-fields. For each n, the set Wn{bj) = C\{A € 3n|w € A} is in $n, and the sets Wn(u;) are identical or pairwise disjoint as w varies through il (they are the atoms of $n: sets in 5,, which are non-void and are minimal relative to inclusion). Define Xn to be constant on the atoms A of $n with *.(*) = (*$ tf"^andP(^0, [0 if w € A and P(A) = 0. Then /c XndP = /x(C) for any set C € Sn- To see this, observe that it is enough to verify it when C is an atom. The sequence (Xn)n>i is a martingale: if C € $n, then fcXndP = M(C) = /c-^+1^- Th's martingale will be used later in Exercise 5.7.6 to show that there is an Z.1-density / for /x, that is, a function such that /x(C) = /c fdP for any C € 5- In fact, / = limn Xn. This is another way to
4. MARTINGALES 231 prove the Radon-Nikodym theorem (Theorem 2.7.19) when the <r-algebra 5 is generated by a Boolean algebra 21 that is the union of a sequence of finite er-algebras. Note that as n increases, the martingale approximates some kind of "derivative" of the measure /x with respect to P. The hypothesis that /x is absolutely continuous plays no role in the definition of this martingale. For a general measure /x, the martingale converges to the L'-density of the absolutely continuous measure in the Lebesgue decomposition of/x (see Theorem 2.7.22). Remark. As an illustration of the above example, consider Example 5.1.1, where SI = [0,1],J = <B([0,1]), and dP = dx. Let $n be the smallest <r-field containing the dyadic intervals [0, ^r], (^, ^], 1 < fc < 2n - 1: these intervals are in fact the atoms of $n- The following exercise states that 93( [0,1]) is the smallest a-field that contains all the dyadic intervals. Exercise 5.4.6. Let fl = [0,1], and let $n be the finite <r-field whose atoms are the dyadic intervals [0, ^-] and (^, ^fel], 1 < k < 2" - 1. Show that 93([0,1]) is the smallest er-algebra containing 21 = Un^iSvi, or, equivalently, containing all the dyadic intervals. [Hint: if0<a<6<l, show that (a, b) is a countable union of dyadic intervals.] Exercise 5.4.7. Let (Xn)n>o be a Markov chain with transition kernel N (Exercise 5.2.12). Let $n = <r({Xk | 0 < it < n}). Show that (1) if V' is a non-negative Borel function such that Nip <ip{& so-called excessive function), then (ip o Xn)n>o is a supermartingale with respect to the filtration (SvOn^o, (2) if ip is a bounded Borel function such that Nip = ip ( a so-called bounded harmonic function), then (ip o Xn)n>o is a martingale with respect to the same filtration. Remarks. (1) The function ip(n) = ^ is a harmonic function on Z relative to the convolution kernel N given by N(0,dx) = pei(dx) + qe-\(dx), where 0 < p, q < 1, and p + q = 1. Consequently, if (Xn)n>o is a Markov chain with this transition kernel and Xo = n0 for some fixed n0 € Z, then ip(Xn) = K^- is a martingale. Since the function ip is unbounded, this is not an immediate consequence of Exercise 5.4.7 (2) even though N^ = ip. However, since Xn a.s. assumes only a finite number of values for each n, it follows that ip o Xn is integrable and, as a result, is a martingale. (2) This connection between supermartingales and martingales on the one hand and excessive and harmonic functions on the other is at the basis of the close connection between classical potential theory and Brownian motion (see Dynkin and Yushekevic [D4]). A martingale (Xn)n>o with respect to a filtration (3n)n>o can always be viewed as the sequence of partial sums of an infinite series 5Z^_0 ^" wnose terms Dn = Xn - Xn-i, n > 1, D0 = 0 are such that E[Dn\$n-\\ = 0
232 V. CONDITIONAL EXPECTATION AND MARTINGALES for all n > 1. It is of interest to ask how one can "weight" the martingale differences Dn so as to still have a sequence of martingale differences. This will be useful for optional stopping and is a precursor of the kind of sums needed to define a stochastic integral. This transformation of the martingale by weights is called a martingale transform. Proposition 5.4.8. (Martingale transform) Let (Xn)n>o be a martin- gaie with respect to (SvOn^o &nd f = (/n)n>i be a sequence of random variables with /n e 3n-i for all n > 1 (f is usually called a predictable process). Define {Yn)n>0 by the formulas ^n+1 =Yn+ /n+l(-Xn+l — .Xn), n>0, that is, n Yn+l = Xo + 2^, fk+l(Xk+l - Xk)- fc=0 Then Y = (yn)n>o is a martingaJe and wiJJ be denoted by f ■ X. In addition, if X is a supermartingaJe (respectiveJy, a submartingaJe) and f is non-negative, then f ■ X is aJso a supermartingaJe (respectively, a submartingaiej. Proof. Consider first the case of a martingale X. Then f-X is a martingale since E[Yn+l\3n] = Yn + E[fn+l(Xn+l - Xn)\5n] = Yn + fn+l(E[Xn+i\$n] — Xn) = Yn. This also proves that / ■ X is a supermartingale if X is a supermartingale and / is non-negative since then /n+i(^[-^n+i|5n] — -Xn) is non-positive. The submartingale case follows by the same argument with the proviso that fn+i(E[Xn+i\$n] - Xn) is then non-negative. □ Remarks. (1) Note that if Xo = 0, then / —► / X is linear, and replacing Xn by Un = Xn - X0, one has f.X = f.U + X0 since C/fc+i - Uk = Xk+i -Xk, k> 0. (2) In Billingsley [Bl], p. 412, (/n)n>i is called a betting system. Intuitively, the original process represents the gambler's fortune, and the betting system allows scaling of the increment Xn+i - Xn of the gambler's fortune before knowing the outcome of the (n + l)st game. See also Williams [W2], p. 96. (3) As mentioned, the martingale transform is the discrete analogue of the stochastic integral in the sense of Ito, which is defined as a limit of Riemann like sums, each of which is a martingale transform (see Ikeda and Watanabe [I], p. 48, and Protter [P2], p. 44) relative to a sequence of stopping times. The classes of martingales and supermartingales have some obvious elementary properties, which are now stated as two propositions.
4. MARTINGALES 233 Proposition 5.4.9M. Let X = (Xt)tei and Y = {Yt)tei be martingales with respect to the same filtration {$t)tei- Then (1) Z — aX +bY is a martingale, where Zt = aXt +bYt for all real a, b, (2) if c : K —» K is convex and each c o Xt el1 for all t G I, then C — coX is a submartingale, where Ct = co Xt. Proof. (1) is obvious; (2) follows from Jensen's inequality (Proposition 5.2.4 (CES)): if s < t, then coX,=co £[^13,] < E[c o Xt\$a]. □ Proposition 5.4.9S. Let X = (Xt)tei and Y = (Yt)tej be supermartin- gaJes with respect to the same filtration {$t)tei- Then def (1) Z = aX + bY is a supermartingaJe, Zt = aXt + bYt if a,b>0, def (2) Z = X A Y is a supermartingaJe, Zt = Zt A Yt, t G I, and (3) if c : K —» K is concave and increasing and coXt Gl1 for all t & I, then C = co X is a supermartingaJe, where Ct = co Xt- Proof. (1) is clear since a > 0 implies aXa > aE[Xt\$a} if s < t. (2) is obvious once one checks that each Zt is integrable as s < t implies that E[Zt\$s] < E[Xt\$s} < Xa and similarly E[Zt\$s) < Ya. Property (3) corresponds to (2) in Proposition 5.4.9M, where c is replaced by (-c). Jensen's inequality implies that E[co Xt\^s} < co E[Xt\$a} and that, as c is increasing, co E^ISs] < coXs, since E^ISa] < Xa if s < t. □ Consider now {Xn)n>i, a sequence of independent identically distributed (i.i.d.) random variables on (n,;?,P) with P[Xn = +1] = P[Xn = -1] = i. Let X0(u) = 0 for all w G SI. Then 5n = YH=oxi is a martingale with respect to the natural filtration 3n = ^({-^t|0 < i < n}). The random variable represents the position of a particle moving on Z under the influence of the symmetric random walk started from 0 at time 0. As time passes, the value 5n(w) varies over all of Z, and so if one fixes an integer N > 0 there is a first time T = T(w) at which |5n(w)| attains the value N + 1 (that is, at which the particle visits N + 1 or -(AT + 1)). This time T is random (in other words, it depends upon uj). Furthermore, since T(lj) = inf{n | |5n(w)| > AT}, it follows that {T = k} = {lj\ \Si{lj)\ <N, 0<e<k-l, |5fc(w)| = N + 1}. This set is in Jk and so {w | T(lj) < k) G 3* (in other words, this set is determined by the information available up to and including time k). Random times with this property are called stopping times and are very common in martingale theory. A random time of this type was used in the proof of Kolmogorov's inequality (Theorem 4.4.3). Definition 5.4.10. Let ($t)tei be a fiJtration on (£l,$,P). A random variable T : £1 -» /U{+oo} C [0, +oo] such that, for all t G /, {T < t} G fo is caJJed a stopping time for the filtration ($t)tei-
234 V. CONDITIONAL EXPECTATION AND MARTINGALES Remark. The event associated with a stopping time T need not occur. When this happens, it is often convenient to define T to be +00. It is sometimes useful to define 3oo to be the smallest er-algebra containing all the St, t G I, in which case the set I can be viewed as a subset of [0, +00]. Examples 5.4.11. (1) If T{bj) = s G / for all w G fi, then T is a stopping time as {T < t} = 0 or n. (2) Consider a symmetric random walk on Z started at 0. If T is the first time that it leaves [-N, N], then T is a stopping time —see the discussion preceding Definition 5.4.10. Assume that I = [0, +00) and that X = (Xt)t>o is an adapted process such that for almost every w, the function t —» Xt(u) is continuous. Assume that (fi, S, P) is complete and that each St contains all the sets in S of probability zero. Then (3) T(lj) = inf{t I \Xt\ > 1}, with inf 0 = +00, is a stopping time. Assume that t —» Xt{bj) is continuous. Then T(w) < t if and only if maxo<3<( |X3(w)| > 1. Hence, T(w) > t if and only if for some n, maxo<3<( |^3(w)| < 1 - £, equivalently, by continuity, sup0<r<t |A"r(a;)| < l-j,r£Q. Since Q is countable, it follows that {T > t} nfli G St, where S7i = {lj I t —» Xt(u) is continuous}. Since by hypothesis P(fi\fii) = 0, {T > t} G St- Consequently, {T < t} G St- Under the same hypotheses as in (3), one can prove that {T < t) G St, for all* > 0 if T = inf{t | \Xt\ > 1}. This implies that (4) T = inf {t I \Xt\ > 1} is a stopping time with respect to the filtration ($t)t>o, where St = ^«>tSa- Remark. These last two examples indicate something of the technical complications that arise when looking at processes indexed by [0,00) rather than a finite or a countable time set I, the classical example being Brownian motion in Kn. From now on, I will be finite or countable. Exercise 5.4.12. Show that T in Example 5.4.11 (4) is a stopping time. Definition 5.4.13. A set A e S is said to occur prior to a stopping time T if, for aii t G I, An {T < t} G St-' Let St denote the sets in S that occur prior to T. Exercise 5.4.14. Show that St is a <r-field. [Hint: {Ac U (T > t)} n (T < t) = Ac n (T < t).} Exercise 5.4.15. If S < T are two stopping times, show that Ss C St- [Hint: S <T implies that (5 < t) n (T < t) = (T < t).} Example 5.4.16. In Example 5.4.11 (2), the following sets A occur prior toT: (1) A = {lj I |5n(w)| < N for all n} as A n {T = k} = 0 for all it;
4. MARTINGALES 235 (2) Ani = {u | |5n(w)| < N for 0 < n < nx} as Ani n {T = k} = 0 if k < ni and {T = k} if k > ni, and (3) the set A = {uj \ \Sk(uj) \< N +1, 0<k< 2N + 2,S2n+2^) = 0} does not occur prior to T, as A n {T = AT + 1} = {lj \ Xn(uj) = +1, 1 < n < N + 1, Xn(u}) = -1, AT + 2 < n < 2N + 2}, which is not in SW+i- Exercise 5.4.17. Let I C [0, +oo) be finite and T : Q —» I be a stopping time for a filtration C$t)tei- Let (Xt)te/ be a real-valued process that is adapted to the filtration. Define Xt(u) = Xt^(lj). Show that Xt is 3t- measurable. [Hint: find an alternate expression for (Xt < A) n (T = k)]. Theorem 5.4.18. (Doob's optional stopping theorem: finite case) (See [M2], [I].) Let I C [0, +oo) be finite and let S,T be two stopping times for a fiitration (fo)te/ on (tl,$,P). Let (Xt)tei be an (5t)ter supermarting&le (respectively, submaxting&le). If S <T then E[XT\$s] < Xs (respectively, > Xs). Proof I. (See [M2]). One may as well assume that I = {0,1,..., n}. Let A G 5s- Then f XTdP = V f XjdP = V V f XjdP J A j=oJ{T=j}nA j=0 i=0 J{T=j}n{S=i}nA n n . = VV / XjdP as S<T i=o j=i J{T=j}n{s=i}nA n . ~^JA,n{T>,} XTdP, where Ai = {S = i} n A G &> since S is a stopping time. Now f XTdP = f XidP + f XTdP. JA,n{T>i} JA,n{T=i} JA,n{T>i} Note that A, n {T > i} e fc. Assume that T < S + 1. Then, / XTdP= / A-i+1dP< / ^dP JA,n{T>i} JA,n{T>i] JA,n{T>i] (respectively, >) as (Xt)tei ls a supermartingale (respectively, submartin- gale). Therefore, if S <T< S+ 1, I XTdP < V I X,dP = I XsdP ( respectively, >). 'A t~r,JA.n{T>i} J A
236 V. CONDITIONAL EXPECTATION AND MARTINGALES For the general case of S < T, define a sequence of stopping times T(fc), 1 < it < n, as follows: T^ = T/\(S+k). ThenT^ < T<fc+1> < T^ + 1 and Srw C 3^+1)• Hence, for A G $s, and 1 < k < n, JAXT{k)dP < JA XT(k+i)dP (respectively, >). Since T(n) = T, the result follows. □ Proof II. (See [I]). If T is a stopping time for (fo)t€/, / = {0,1, •.. ,n}, and if hk = l{k<T}, then the process h = {hk)i<k<n is predictable as {k < T} = {T < k - 1}C G dk-i- Let fk = l{s<k<T) = l{fc<r} - l{k<s}- Then {fk)i<k<n is also predictable and non-negative. Let (Xt)tei be a supermartingale. First assume Xo = 0. Then (h-X)o = 0 by definition (see Proposition 5.4.8) and for k > 1, (h ■ X)k = XT/\k by induction since {h ■ X)k+i = {h.X)k + hk+i(Xk+i - Xk) = Xr^k + hk+i(Xk+i — Xk) (Xk+1-Xk ifT>fc+l, \ 0 if T < fc + 1 = -^TA(fc+l)- Consequently, by linearity (see Proposition 5.4.8), (/ • X)k = XT„k - Xs„k, 0 < k < n, is a supermartingale. Therefore, 0>E[(f-X)n} = E[XT}-E[Xs}. If X0 7^ 0, let Ut = Xt — X0. Then (Ut)tei is a supermartingale with C/o = 0 and Ut = Xt - Xo, Us = Xs - X0. Hence, Xt - Xs = Ut - Us and so E[XT) < E[XS}. Exercise 5.4.19. For any A G 5s, define f S(lj) iitjeA, Sa{u) = < .. . . \n if lj ¢ A, and M } \n ifw£A. Show that Sa and Ta are stopping times. Hence, by what has been proved, f XTdP + f XndP = E[XTa) < E[XSa) = ! XsdP + [ XndP, JA JA' JA JA* J XTdP < f XsdP. D and so
4. MARTINGALES 237 Theorem 5.4.20. (Doob's maximal inequality: finite case) Let I C [0, +oo) be finite and (Xt)tei be a submartingale with respect to a filtration (3t)tei on (n,ff, P). Let A > 0. Then \P[maxXt > Al < £[l{maxie, x,>a}*t+] < E[X+], where X+ = XT V 0, and r is the largest element of I. Proof. One may as well assume I = {0,1,..., AT}. Let T(w) be the first time in I that Xt{u) > A (if it ever occurs) and N otherwise. Then T is a stopping time and, by Theorem 5.4.19, X^ > E[X£\$N], as (Xf)tei is a supermartingale. Hence, if B = {maxt6/ Xt > A}, E{X+}>E[X+1B}>E{X+1B} N >££[*„+lBn<r=n>] n=0 N >^A?[Bn{T = n}] = AP[maxA"t > A]. D Remark. This inequality is called a "maximal" inequality as it involves def supn \Xn\ = X*, the so-called maximal function of the submartingale. This maximal function is a direct analogue of the Hardy-Littlewood maximal function (Definition 4.6.1). Corollary 5.4.21. (Kolmogorov's inequality). Let (5n)n>i be a martingale with respect to a filtration (3n)n>i on (0,5, P). Assume Sn G L2 for all n. Then, for any A > 0, A2P[ max \Sn\ > A] < E[Sl}. Ll<n<N ' — i — i »' Proof. (S£)n>i is a positive submartingale and { max S? > A2 1 = { max |5„| > a! . D ^l<n<N n J |_l<n<N J Remark. Doob's maximal inequality is an improvement of Chebychev's inequality (Proposition 4.3.6): X2P[\Sn\>X]<E[S2N]. It also generalizes Kolmogorov's inequality for sums of independent random variables (Theorem 4.4.3).
238 V. CONDITIONAL EXPECTATION AND MARTINGALES 5. An introduction to martingale convergence If (Xn)n>i is a sequence of i.i.d. integrable random variables with expectation zero, the process (C/n)n>i, where Un — 5Z£=1 ^-, is a martingale. Kronecker's lemma (Proposition 5.5.5) states that £ $3fc=i -^fc(w) —♦ 0 if Hfc=i fc converges. Consequently, to prove the strong law of large numbers for (Xn)n>i, it would suffice to show that the martingale (C/n)n>i converges a.s. as n —» oo. In case the Xn are in L2, the martingale (t/n)n>i converges as a consequence of the next result since E[U2} = C^Z^=1 pr, where C = E[X2]. This proves the i.i.d. version of Proposition 4.4.1. This result is a special convergence theorem and is by no means the most general result of this type (see Corollary 5.7.3 for a general result). Proposition 5.5.1. (L2-martingale convergence theorem). Let (Xn)n>o be a martingaie, and assume that for some constant M, E[X2} < M (i.e., the martingale is L2-bounded). Then there is a random variabie YeL2 such that Xn ^¾ Y. In addition, Xn -^ Y. Proof, (see Feller [F2], p. 236). Since (X2)n>0 is a submartingale, E[Xl) increases to a limit n < M. Let A > 0. Kolmogorov's inequality implies that (1) \2P[ max \XN+r - XN\ > A] < E[(XN+R - XN)% l<r</J as (Ijf+r)r>! is a martingale with respect to (®r)r>i,®r = $n+t- For any martingale in L2, E[(Xn+r - Xn)2|ffn] = E[X2+r - 2Xn+rXn + X2\dn] = E[Xl+T - X2\dn] as E[Xn+rXn\3n] = X2 by Proposition 5.2.4 (CE9). Hence, (2) E[(XN+R - XN)2} = E[X2N+R - X2N}. Since £[-^] —» fi as n —» oo, (2) implies (i) that the martingale is Cauchy in L2 and (ii), in view of (1), that for any e > 0, A > 0, there exists AT0 = N0{e) such that for all R if N > N0, then (3) \2P\ max \XN+T - XN\ > A] < e. Ll<r<fi i — j — As shown by the following exercise, (3) implies that a.s. every sequence (Xn(uj))n>i is Cauchy. □ def Exercise 5.5.2. Let k > 1 and En = {w | for some r > 1, |Xjv+r(w) — Xn(uj)\ > j:}- Use (3) to show that (1) there is an integer Nk with P(ENk) < ^.
5. AN INTRODUCTION TO MARTINGALE CONVERGENCE 239 Hence, show that (2) a.s. \Xn+j.(lj) — Xn(lj)\ < £ for all r > 1 if N is sufficiently large. Conclude that (3) a.s. every sequence (Xn(ui))n>i is Cauchy. Remarks. (1) As observed at the beginning of this section, the simple strong law for sums of independent L2-random variables (Proposition 4.4.1) follows from this convergence theorem. (2) In Doob [Dl] there is a related result (see [Dl], p. 108, Theorem 2.3). (3) The Lp-analogue for 1 < p < oo of this theorem is also true. It follows from the martingale convergence theorem (5.7.3) that an Lp-bounded martingale converges a.s. However, it does not follow from the proof (as in the case p = 2) that this gives convergence in LP. To get LP-convergence, it suffices to determine a dominating function in IP. There is a natural candidate for this function, the maximal function M* = supn |Mn|. As shown in the next exercise, this function is in IP (see Williams [W2], p. 143, and Ikeda and Watanabe, [I], p. 28). Exercise 5.5.3. (Doob's LMnequality: countable case) see Revuz and Yor [R2], p. 51 and Ikeda and Watanabe [I], p. 28). Let (Xn)o<n<N be a submartingale and let X* = sup|Xn|. Use Doob's inequality (Theorem 5.4.20), Exercise 4.7.2, and Fubini's theorem (Theorem 3.3.5) to show that, for any integer k and 1 < p < +oo, (1) E[(X*Ah)p}< f p\"-2E[\XN\l{x.>x)}d\ Jo = J^E[\XN\(X*Kk)p-1}. p- 1 (2) Apply Holder's inequality (Proposition 4.1.15) to the last term (recall that Z e IP if and only if Zp_1 G Lq) and conclude that e[(x* a ky] <(-B-)(e[ \xN\" ]) i (E[(x* a fc)p])] _p (3) Hence show that II V* II • „ II V II II X \\p < q || XN ||p, where q is the conjugate index to p. Use this to deduce Doob's IP-inequality: if X = {Xn)n>o is a martingale that is bounded in IP, 1 < p < +oo, and X* = supn>0 Xn, then (5) X* elP and sup || Xn \\p < || X* \\p < qr(sup || Xn ||p).
240 V. CONDITIONAL EXPECTATION AND MARTINGALES Remark. If / G LP(R), 1 < p < oo then the Hardy-Littlewood maximal function /* is in IP (see [Wl], p. 155, Theorem 9.16). The proof of this is closely related to the above proof for the analogous result for martingales. For the finite set I = {0,1,..., AT}, it is obvious that sup0<n<N \Xn\ G LP as it is dominated by 5Zn=0l-Xn|; what is not obvious but crucial is that its norm in Lp satisfies (3) and so the maximal function X* = supn>0 l-Xnl G IP with a norm equivalent to the bound in IP of the martingale. By using truncation to get an L2-bounded martingale, Kolmogorov's strong law of large numbers will now be proved for (Xn)n>i in L1 by making use of the martingale convergence theorem for L2-bounded martingales. Theorem 5.5.4. (Kolmogorov's strong law of large numbers) Let (Xn)n>i be a sequence ofi.i.d. integr&ble random variabies. Then ^Sn —» m a.s., where m = / xQ(dx), Q the common distribution, and Sn = Efc=l Xk- Proof. (Compare this martingale argument with the proof of Theorem 4.4.2.) To begin, as usual, one may assume m = 0 by replacing Xn with Xn — 771. Let Yn be the truncation of Xn at height n, (i.e., Yn(w) = Xn(ui) if Xn(u) < n and equals 0 otherwise). Let An = {Yn ^ Xn}. Then, as was shown in Proof II of Theorem 4.3.12, 5Z^=1 P(An) < +oo and so with probability 1, for each w there exists no(w) = no such that Yn(w) = Xn(ij) for n > no- Hence, it suffices to show that £ Y^=i ^fc ~* 0 as- Let mn = E[Yn] = /{H<n} xQ{dx). Define Zn = ±(Yn - m„). Then, if Tn = $3fc=1 Zk, the process (Tn)n>i is a martingale with respect to (SvOn^i, 3n = cr({-X",11 < i < n}) (see Exercise 5.4.2M). Since each Zn is bounded, T„ G L2(n,& P) for all n > 1. In fact, T = {Tn)n>i is an L2- bounded martingale: using independence, one has E[T2] = 5Zt=1 E[Zl\ = Hfc=i 1$' where o\ is the variance of Yy.. In the earlier proof of this strong ^^ 2 law it was shown that 51^^ ^5* < +oo (see the end of the proof of Theorem 4.4.2), which proves the L2-boundedness of T. By Proposition 5.5.1, there is a random variable Z such that Tn = Y0k=\ %k —» Z a.s., that is, Y^k=\ k~kmk' —» Z a.s. Kronecker's lemma (see Proposition 5.5.5) implies that £ Hfc=l(Vk - 771¾) —» 0 a.s. Hence, it suffices to show that £ Ylk=i mk —» 0. This is immediate since mt tends to /xQ(dx) = 0 as |x| G Ll(Q), and so it follows that mk tends to zero in the sense of Cesaro, i.e., - 5Z£=1 mk —* 0 (see Exercise 4.3.14). □ Proposition 5.5.5. (Kronecker's lemma). Let (an)n>i be a. sequence of real numbers, and assume that 52n°=1 in converges- Then, £ 5Z"=1 at —» 0.
6 THE THREE-SERIES THEOREM AND THE DOOB DECOMPOSITION 241 Proof. The first problem is how to relate the two series. Let bn — L_k=x -g- and 60 = 0. Then an = {bn — 6n_i}n for all n > 1, and so the partial sum 5Zk=i o-k = 5Zfc=1{&fc - bk-\}k. Formally, this looks like /" b'(x)k(x)dx, which suggests that one should try to use integration by parts to evaluate it. The following lemma gives the formula for integration by parts of a series. Lemma 5.5.6. Let (Cn)n>i and (Dn)n>i be two sequences of real numbers. Then, for all n > 2, n-l n-1 C„£>„ - CiDl = £{Cfc+i - Ck}Dk+l + J2i°k+i - Dk}Ck. k=\ k=\ Proof of the lemma. Ck+iDk+\ — CkDk {Ck+l-Ck}Dk+l + {Dk+l-Dk}Ck. D Applying Lemma 5.5.6 to integrate £ Y?k=i afc = n 52k=Abk ~ bk-i}k by parts, one obtains n-l O-k k=\ Y,{bk-bk-i}k .fc=i = bn- 61 + (nbn (n-l) bi) (n-l) _ ]_ n n-l fc=l n-l fc=l (6i-M + E^fc+i-6fc}(fc + 1) k=i letting Ck = bk and Dk = k, This last expression tends to zero as 6n converges by assumption and so («n)n>i converges to zero in the sense of Cesaro (Exercise 4.3.14). □ Remark. There is a martingale proof of this theorem, due to Doob, that makes no use of truncation. In place of truncation, it uses the concept of a reversed, or backward, martingale. It differs from a martingale in that the <7-algebras decrease rather than increase. This proof is given later as Theorem 5.7.9 (see [Dl], p. 341). 6. The three-series theorem and the Doob decomposition A) The three-series theorem. Consider a series ^Zn Xn of independent random variables. If the Xn are all integrable with mean zero, then the partial sums of this series form a martingale and the series converges if and only if the martingale converges (which by the martingale convergence theorem (Theorem 5.7.3) is the case if the martingale is bounded in L1). The question of convergence involves an event in the er-field 1^ = nna({Xk \ k > n}), as indicated by the following exercise.
242 V. CONDITIONAL EXPECTATION AND MARTINGALES Exercise 5.6.1. Show that {w | J2n Xn(ui) converges } belongs to Too. Kolmogorov's 0-1 law (Proposition 3.4.4) implies that either the series converges a.s. or the series diverges a.s. Example 5.6.2. The harmonic series ^Zn £ diverges. Let e„,n > 1 be i.i.d. Bernoulli random variables with distribution 5(e_i +£i). It turns out that the random series £n en(^) does converge a.s. Its partial sums 5n = Ek=i £fc(j)>n > 1> define a martingale by Example 5.4.2M, and since <r2(Sn) = Ek=i *? ^ the partial sum of the convergent series £n 4j, this martingale is bounded in L2. As a result, it follows from Proposition 5.5.1 that it converges a.s. There is a well-known general criterion for the convergence of an infinite series of independent random variables that makes use of truncation. It is stated as the next result. Theorem 5.6.3. (The three-series theorem) Let J2n Xn be a series of independent random variables. The following are equivalent (1) En X* converges a.s.; (2) for every truncation height c > 0, the three series (0E„P[*»#rn], («)E„^n],and (m)E„*2(y») all converge; (3) for some truncation height c > 0, the three series all converge, where Yn is the truncation of Xn at height c > 0. Proof. Assume (3). Then, by the Borel-Cantelli lemma (Proposition 4.1.29 (1)), the convergence of (i) implies that a.s. Xn(u;) = Yn(ui) for n > n(w). Therefore, the series J2n Xn converges if J2n *n converges. Since (ii) converges, £n Yn converges if and only if the martingale given by ^2n Yn — E[Yn] converges. By independence, if Sn = Ek=i ^fc _ ^[^fc]> then II Sn ||2 = Ek=i 02{Yk)- Since (iii) converges, this martingale is bounded in L2 and so by Proposition 5.5.1 it converges. Hence (3) implies (1). Now assume (1) and let c > 0 be any truncation height. Since the series En Xn converges, a.s. Xn(ui) = Yn{bj) for n > n(w) since a.s. Xn{bj) —» 0. It follows from independence and the Borel-Cantelli lemma (Proposition 4.1.29 (2)) that the series (i) converges. As a result, the series ^nVn converges a.s. From this, it follows that the second series (ii) converges if and only if the series En*" _ Efln] converges. By Proposition 5.5.1, this series will converge if the third series (iii) converges. To prove (2), it will suffice to prove the following. Lemma. If \Yn\ < c, n > 1 are independent and En^" converges a.s., then En^2^") converges.
6. THE THREE-SERIES THEOREM AND THE DOOB DECOMPOSITION 243 This will be proved as an application of the Doob decomposition (Proposition 5.6.4) of a submartingaie. Since (2) implies (3), this completes the proof. □ B) The Doob decomposition of a submartingaie. One way to produce a submartingaie would be to start with a martingale (Mn)n>i and add an increasing adapted process (;4.n)n>i, where increasing means that An < An+i, for all n > 1. Then E[Mn + ;4.n|3n-i] = Mn_! + E[A„|&,-i] > Mn_! + E[An^ |&,_,] = Mn_! + A,.,. The Doob decomposition shows that every submartingaie has this form. It also states that the increasing process {An)n>i is unique if it is predictable. Proposition 5.6.4. (Doob decomposition) Let X = {Xn)n>o be a submartingaie with respect to a fiitration (3n)n>o on a probability space (0,5, P). Then there is a unique decomposition of X into the sum of an So random variabie Xq, a martingale M with Mo = 0, and a predictabie increasing process A with Aq = 0: n Xn = X0 + Mn + An, An = J2 E[Xk\dk-i] ~ *fc_i. fc=i Proof. Assume that such a decomposition exists. Then E[Xn\3n-l] = Xo + Mn_! + E[An\dn-l] = X0 + Mn_! + An > Xn_! = X0 + Mn_! + An-i. Hence, (*) ElXntfn-J-Xn-^An-An-L This implies A0 = 0 and An = £J=1 E[Xk\dk-i] ~ *k-i- Conversely, define A0 — 0 and An — 5Z£=1 E[Xk\$k-\] - Xk-\. Then A is increasing and predictable, and, as (*) holds, E[Xn - An\dn-i] = E[Xn\dn-i] -An = Xn_l- An_,. Hence, if Mn = Xn — An, then (Mn)n>0 is a martingale. □ Exercise 5.6.5. Suppose that X is the sum of a martingale M' and an increasing adapted process A' (not necessarily predictable). Determine the Doob decomposition of the submartingaie X.
244 V. CONDITIONAL EXPECTATION AND MARTINGALES Exercise 5.6.6. Let (V^)n>i be a sequence of independent random variables Yn in L2 with mean zero. Let Sn = 5Z£=1 Vj. and $n = &({Xk I 1 < k < n}). Show that the increasing process of the Doob decomposition for the submartingale (S2)n>i is the sequence of partial sums of the series £n<72(r„). [Hint: recall that E[{Mn - Mn^)2\dn-i] = E[M2 - M^JSVi-i] if M is an L2-martingale; see the proof of (2) in Proposition 5.5.1.] The result of this exercise is a key to proving the following proposition which is very similar to the lemma that is needed to complete the three- series theorem. Proposition 5.6.7. Let {Xn)n>i be a sequence of independent random variables with mean zero that are uniformiy bounded (i.e., \Xn\ <C,n> I). If 5Zn^n converges a.s., then ^no2{Xn) converges. Proof. If Sn — 5Zfc=1 Xk, then 5 is a martingale and S2 — J2k=i &2{Xk) is a martingale by Exercise 5.6.6. Therefore, for any stopping time T, n (t) £[S2AT - AnAT] = 0, where An = £a2(Xk). fc=i Let c > 0 and let T = inf{n | |5„| > c} with inf 0 - +oo. This is a stopping time by Exercise 5.4.11 (2). The key step is the following lemma. Lemma. For some c> 0, P[T = +oo] = a > 0. Assume the lemma. Since each jump Xn of the martingale S is bounded by C, it follows that S2AT < (C + c)2. Consequently, it follows from (f) that n a£V(¾) = £[AnAT.l{T=+oo}] < E[AnAr] = £[S2AT] < (C + c)2. D fc=i Proof of the lemma. Let S = J2™=i Xn and A^ = {\S\ < Co}. Then for some Co, one has P(AC0) > ^. If w G Aco, then for some N, lj G A^n = nn>N{|5n| < Co + 1}. As a result, there is an N with P{A.C0,n) > \ (say). If c = Co + 1 + NM, and w G ACo,^, then |5n(w)| < c for all n > 1. D While this last result is not enough by itself to prove the missing lemma, by a clever trick, involving independence (see Lamperti [LI], p. 36, and Williams [W2]), one may use Proposition 5.6.7 to prove this lemma. The original sequence of independent bounded random variables (Vn)n>i is defined on a probability space (fi, 5, P). Take a copy of (fi, 5, P), (Yn), denote it by {W, 5', P')> (^), and form the product (Q xfi'.Jx £', P x P') of the probability spaces. Define two processes on il x il': the first one, Y, has yn(w,w') = Vn(w) and the second one, Y', has y^(w,w') = Y^(lj'). These
7. THE MARTINGALE CONVERGENCE THEOREM 245 two processes on the product space are independent, i.e., o{{Yn \ n > 1}) and a({Yn | n > 1}) are independent er-fields and have the same finite- dimensional joint distributions. As a result, the sequence of random variables Zn = Yn — Y„ is independent (see Exercise 5.6.9) and each Zn has mean zero since the distribution of Y„ equals the distribution of Yn. Also, \Zn\ < 2c. Provided the series £n Zn converges a.s., Proposition 5.6.7 shows that 52n<r2(Zn) = 252n<72(Y^,) converges, which proves the lemma that completes the proof of the three -series theorem. This question of convergence amounts to the a.s. convergence of the series ^Zn Y„. The distribution Q of the processes Y and Y' coincides: it is the infinite product x^LjQ,, on the product xJ^R of a countable number of copies of R, where Qn is the distribution of Yn. Now the original series 5ZnV^i on (fi, 5, P) converges a.s. if and only if Q(A) = 1, where A = {(an)n>i,an G R | ^Zn an converges} (which begs a measurability question: see the following exercise). Given this, it follows that the series 52n Yn converges a.s. if and only if the series ^2n Y„ converges a.s. and the proof of the three series theorem is complete since a.s. both series converge if this is true of one of them. Exercise 5.6.8. Show that A G x£°=1«8(R). Exercise 5.6.9. Let (0,5, P) be a probability space, and let ^1,^21^1 and <52 be four a algebras contained in J. Assume that (1) Ji and 3½ are independent, (2) 0i and <52 are independent, and (3) <7(3i Ufo) and a{<&i U<52) are independent. Show that ¢7(3^ U <&\) and a($2 U 62) are independent. [Hint: if A{ G fo and B{ G <&;, then P(Ax nA2nB,n B2) = P(A1)P(A2)P(B1)P(B2), and make use of Proposition 3.2.6 as in the proof of Proposition 3.2.8).] Conclude that if (Vn)n>i is an independent real-valued process Y and (^n)n>i 's an independent real-valued process Y', then the process (Zn)n>i, Zn = Yn - Y„, is an independent real-valued process provided that Y and Y' are independent, i.e., provided the <r-algebras <r({Yn \ n > 1}) and <r({y^ I n > 1}) are independent. 7. The martingale convergence theorem A sequence (an)n>i of real numbers fails to converge if and only if liminfnan < limsupn an (see Exercise 2.1.10). When this happens, there is a pair of rational numbers p < q such that the sequence is infinitely often < p and infinitely often > q. In other words, the sequence crosses the interval [p, q] from p to q an infinite number of times, where exactly one such "upcrossing" takes place during the time interval [k, i] if at < p, at > q and in between times the value never gets to be > q, i.e., £ is the first time n after k that an > q. Consequently, to study the convergence of a
246 V. CONDITIONAL EXPECTATION AND MARTINGALES process, one examines the random variable that describes the upcrossings of an interval [a, b] by the paths of a process. Proposition 5.7.1. (seelkeda and Wa.tana.be, [I], p. 29, Theorem 6.3) Let X = (Xn)o<n<N be a submartingale. If a < b, let U([a,b}) — Ux([a,b}) denote the number of upcrossings of [a,b] by X during {0,1,2,...,N}. Then, E[U([a,b})}<-^- Proof. Let To = 0 and T\ be the first time that Xn < a with default value N if this never occurs. Let T2 be the first time after T\ that Xn > b with TV as default value. By induction, define T2k+\ to be the first time after T2k that Xn < a with default value N. Similarly, let T2(k+i) denote the first time after T2k+i that Xn > b, again with default N. This is an increasing sequence of stopping times (see Exercise 5.7.2) all of which equal the default value of N if 2k > N. Further, U([a,b}) = Y.l=i Ut, where Ak = {w | -Xrik-i (^) < a < b < -Xr2t(w)}, and hence is a random variable. Replace the submartingale X by the process Y, where Yn = (Xn — a)+, which is a submartingale by Proposition 5.4.9S (2) or (3). Note that Ux{[a,b}) = UY([0,b-a}) and that by optional stopping (Theorem 5.4.18), E[YTj+i ~ Ytj] > 0, for all j. Hence, if 2k > N, Ik k E[YN ~ Yo] = E EFrJ+1 ~ YTj) > E E[Yt„ ~ Yt^} k = £[£(^r2, - Yt^ )}>(b- a)E[UY([b - a,0})} t=i since Et=i(Yr2l - Yr^)] >(b- a)UY([b - a,0}). D Exercise 5.7.2. Show by induction that the times Tj above are stopping times. [Hint: if m < N, then (T2k = m) = U«m[(T2fc-i = i) n {Xt+i < b, Xt+2 < b,..., Xm-i < b, Xm > b}] G Sm-] Corollary 5.7.3. (Martingale convergence theorem: countable case) Let {Xn)n>o be a submartingale with supn £[-¾-] < +oo or, equiva- lently, bounded in L1. Then it converges a.s. to a limit in L1. In particular, if (Xn)n>0 is a martingaJe that is bounded in Ll, it converges a.s. to a limit in Ll. Proof. Since \Xn\ = 2X+ - Xn,E[\Xn\] = 2E[X+] - E[Xn] < 2E[X+] - 7?[Xo], the process X is bounded in L1 (i.e., supn E[\Xn\] < +oo), if and only if supn £^[^^"] < oo. If a sequence of random variables Xn is bounded in Ll and converges a.s. to X, then X G L1 since by Fatou's lemma (Proposition 2.1.25), E[\X\] < liminf„£;[|A'n|] < supn E[\Xn\] < oo. E[(XN - a)+) - E[(X0 - a)+] .
7. THE MARTINGALE CONVERGENCE THEOREM 247 In view of the remarks preceding Proposition 5.7.1, the submartingale converges a.s. if the mean or expected number of upcrossings of any interval [a,b] is a.s. finite. Let UN([a,b}) be the number of upcrossings of [a,b] by the submartingale during {0,1,2,... ,7V}. Then, the number U([a,b]) of upcrossings of [a, b] by the submartingale during {0} U N is the limit of the increasing sequence t/yv([a,6]). Since (x - a)+ < x+ + \a\, it follows from Proposition 5.7.1 that E[U([a,b})} < —i-(supE[X+] + \a\) < +oo. □ o — Q, n Corollary 5.7.4. Let 1 < p < oo and let (Xn)n>\ be a martingaie that is bounded in IP, i.e., for some constant M, E[ \X\? } < M. Then there is a random variabie X G Lp such that Xn ^+ X. In addition, Xn —► X. Proof. Since £[-¾-] < || .Xn IL < || Xn ||p < M1^, the martingale is bounded in L1 and hence by Corollary 5.7.3 it converges a.s. Since by Exercise 5.5.3, its maximal function is in IP, it follows from the theorem of dominated convergence (Theorem 2.1.38) that it converges in IP. □ Remarks. (1) The case of p = oo is even simpler, as then the maximal function is bounded. (2) Note that there is nothing in the proof of the martingale convergence theorem that lets one conclude the I^-convergence, except that in the special case of p = 2 a different proof of convergence also gives the L2- convergence (see Proposition 5.5.1). For this reason, it is important to know that the maximal function is in Lp. For p = 1, uniform integrability of the martingale is necessary and sufficient for convergence in Ll (see Proposition 4.5.7, Corollary 4.5.9, and Exercise 4.7.13). When 1 < p < oo, uniform integrability in LP for a martingale (i.e., the sequence of pth powers is uniformly integrable) is equivalent to its Lp-boundedness since the maximal function is in Lp(see Exercise 4.5.11 (1)). Applying this convergence theorem to the martingale X = (Xn)n>o of Example 5.4.5 gives a proof of the Radon-Nikodym theorem for a positive bounded measure /i on say [0, l] (see Doob [Dl], p. 343). Since fi([0,1]) < +oo, the martingale is bounded in Ll. Therefore, it converges a.s. to a limit function X e L1. To prove the Radon-Nikodym Theorem, it is necessary and sufficient to show that this martingale is uniformly integrable (see Exercise 5.7.5). When this is the case, it also converges in L1 by Corollary 4.5.9. The significance of this is that, for all A G 21 = U^L^n, one has, JAXndP -^ JAXdP as \JAXndP - JAXdP\ < J\lA(Xn - X)\dP < || Xn - X ||,. Now if A G 3*, jA XndP = fi{A) for all n > k. Consequently, H(A) = JA XdP, for all A G 21.
248 V. CONDITIONAL EXPECTATION AND MARTINGALES Since 21 generates 2$([0,1]) by Exercise 5.4.6, it follows from a monotone class argument (Exercise 1.4.16) or Proposition 3.2.6, that p(A) = f XdP, for all A G «([0,1]). In other words, the limit of the martingale is the Radon-Nikodym derivative of fi with respect to P. Exercise 5.7.5. (This exercise explains why the martingale in Exercise 5.4.5) is uniformly integrable.) Show that (1) P[Xn > c] < Hfpi. [Hint: /c XndP = fi(C) for all C G &,.] The next step is to recall that the measure y. is "continuous" relative to P by Lemma 4.5.5*: given e > 0, there exists 6 > 0 such that n(A) < e if P(j4) < 6. Conclude from this that (2) f.x >c, XndP is uniformly small if c is sufficiently large, and (3) the martingale is uniformly integrable. This martingale method can be used to prove the Radon-Nikodym theorem on er-fields that are countably generated: the generators can be used to get an increasing sequence of finite er-algebras whose union generates J. Exercise 5.7.6. Extend the results of Example 5.4.5 and Exercise 5.7.5 to prove the following version of the Radon-Nikodym theorem (Theorem 2.7.19). Let (fi, 5) be a measure space, and assume that there is a sequence (•<4n)n>i C 5 such that 5 = a{An | n > 1}) (i.e., the er-algebra 5 is count- ably generated). Let fj, and v be two (non-negative, finite) measures on 5 such that n{E) — 0 implies v{E) = 0 (i.e, v is absolutely continuous with respect to fi). Then there is a (unique) function / G Ll(£l,$,(i) such that v{E) = fE fdfi for all £ej. Backward martingales. Consider a filtration given by $n, where n G -N, i.e, for each n G -N one has 3^-1 C 3>i- The filtration "provides" less and less "information" as times increases. Such nitrations arise naturally from sequences {Xk)k>\ of random variables: set $n = a{{Xk \ —n < k}. This type of filtration occurred in the proof of Kolmogorov's 0-1 Law (Proposition 3.4.4). A martingale relative to this type of filtration is called a backward martingale. Proposition 5.7.7. Let (Xn)n<-i be a backward martingaie with respect to the Gltrattion (SvOne-N- Then {Xn)n<-i is uniformly integrable and converges a.s. and in L1. Proof. Since Xn = E[X-i\$n] for all n < -1, it follows from Exercise 5.4.3 (2) that (Xn)n<-\ is uniformly integrable. Corollary 4.5.9 implies that it converges in L1 if it converges a.s.
7. THE MARTINGALE CONVERGENCE THEOREM 249 The mean upcrossing inequality in Proposition 5.7.1 applies to this martingale on the time set {-N, -N + 1,..., -2, -1} and so E[U([a,b})}<-^r-a(E[(X-i-")+}- Hence, a.s. the number U([a, b}) of upcrossings of [a, b] by the martingale during -N is finite and so it converges a.s. □ Remark 5.7.8. In fact, the argument proves that a backward submartin- gale converges a.s. Kolmogorov's strong law of large numbers (Theorem 4.4.2) follows from the convergence of backward martingales. Here is the restatement of the theorem in this context. Theorem 5.7.9. (Kolmogorov's strong law). Let (Xn)n>i be a sequence ofi.i.d. integrable random variabies with mean zero, and set Sn = ELi **• Let e-n = °({Sn} U {Xk | k > n + 1}). Then (±S„)„>i is a backward martingale relative to the filtration <9m)me-N> and its a.s. limit isO. Proof. Since Xn+i = o({Xk \ k > n+1}) and o-({X\, Sn}) are independent, it follows from Proposition 5.2.4 (CEn) that £[^10-,,] = E[Xi\Sn). If A G o-(Sn), by Exercise 2.9.10 there is a Borel set B C K with I a = 1b ° Sn. Since X\ and Sn are o~({Xj | 1 < j < n})-measurable, and the distribution of (X\, X2, • • •, Xn) is Q x Q x • • • x Q (n copies), where Q is the common distribution of the Xk, it follows from Lemma 4.1.3 that fA XidP = /x1lB(x1 + x2 + • • • + xn)Q(dx!)Q(dx2) • • • Q{dxn). It follows by symmetry that for all j, 1 < j < n, fA XidP = fA XjdP and so fA X^P = i fASndP (i.e., £[^16,,] = ±Sn). This proves that (^5n)n>i is a backward martingale. Proposition 5.7.7 implies that the limit Z of ^5n exists and also that E[Z] = 0 since the convergence is also in L1. The limit function Z G 1 = r£°+1Xn since, for any m > 1, the limit Z = limn £ Y!k=m xk e Tm. Kolmogorov's 0-1 Law (Proposition 3.4.4) implies that Z is constant a.s. and so Z = 0 a.s. □ Remark. This proof is due to Doob and may be found in his book [Dl], pp. 341-342. Many other things having to do with martingales and convergence of sequences and series of random variables are to be found in Doob's book. In addition, it is also worth consulting for the historical notes and references to the literature on these topics.
CHAPTER VI AN INTRODUCTION TO WEAK CONVERGENCE 1. Motivation: empirical distributions Let (Xn)n>i be a sequence of i.i.d. random variables on a probability space (0,5, P). Let F be the distribution function of the common law Q of the random variables. Pick a real number x, and let Yn = l(_ooxj oXn. Then the Yn are i.i.d. random variables (see Exercise 3.4.5). They are Bernoulli with P[Vn = 1] = F(x) and P[Yn = 0] = 1 - F(x). It follows from the strong law of large Numbers, either Kolmogorov's strong law (Theorem 4.4.2) or more simply from Borel's strong law (see Exercise 4.7.19), that £ 5I£=1 Yk —* F(x) as. If the -Xk(w)i<fc<„ are viewed as n independent samples from a population with underlying probability distribution Q, then one defines the empirical distribution function Fn(x,uj) to be the random variable £ $3fc=1 Yk(u), which equals £ times the number of k, 1 < k < n, for which -Xfc(w) < x. For each u 6 fl, this is a distribution function that takes the values 0, ^, ^,..., s^, 1. Hence, Fn(x, w) is a random distribution function. By the strong law, there is a set A G 5 with P(A) = 1 such that for all g£Q one has limn Fn(q, w) = F(q) if w G A. Lemma 6.1.1. Let F and Fn, n > 1, be non-decreasing reaJ-vaiued functions on K. If limn Fn(q) = F(q) for all q G Q, and F is continuous at Xo, then limn Fn(xo) = F(xo). Proof. Let e > 0, and assume 6 > 0 is such that |F(xo) - F(x)| < e if |x-Xo| < 6 (Definition 2.1.13). Since the rationals are dense in K (Exercise 2.1.12), there exist rational numbers 9i, 92 with \q, - xo| < 6 and, say, q\ < xo < 92. Then F(9l) < F(x0) < F(q2) and Fn(9l) < Fn(x0) < F„(ft). By assumption, if n is large enough, one has |Fn(<7i)-F(<7i)| < e and |Fn(q2)- F(qr2)| < e- Hence, Fn(qr1) and Fn(qr2) are both in \F{q{) - e,F(q2) + e] for large n. As a result, Fn(xo) is also in this interval for large n, and so F„(x0) G [F(x0) - 2e, F(x0) + 2e] (i.e., |F„(x0) - F(x0)| < 2e for n large enough). □ Remark. While this proof makes no use of the right continuity of a distribution function, essential use is made of the fact that a distribution function is non-decreasing. 250
2. WEAK CONVERGENCE OF PROBABILITIES 251 Corollary 6.1.2. Let A be the set of probability 1 such that ifq G Q, then Fn(q,u) —» F(g) for all w G A. If F is continuous at Xo, then Fn(xo,w) —» F(x0). This motivates the following definition. Definition 6.1.3. A sequence of distribution functions Fn on R converges weakly to a distribution function F if Fn (x) —» F(x) whenever F is continuous at x. This will be denoted by writing Fn =s- F or Fn -¾ F. In other words, it follows that a.s. the empirical distribution function of a random sample of a population converges weakly to the underlying distribution function of the population. 2. Weak convergence of probabilities: equivalent formulations Definition 6.2.1. A sequence (fin)n>i of probabilities on (R,93(R)) is said to converge weakly to a probability fi if / fdfin —» f fdfi for all bounded continuous functions / on R. This will be denoted by writing w fin => fl Or fln -> fl. Exercise 6.2.2. Let fi and v be two finite measures on (R,*8(R)) (recall they are non-negative). Assume that / <pdfi = f tpdv for all continuous functions on R with compact support (see Definition 4.2.4). Show that (1) fi([a,b}) = v([a,b]) for any closed bounded interval [a,b}. [Hint: approximate 1 [„,(,] by a larger continuous function whose support lies in [a-£,&+£].] Conclude that (2) n(W) = v(R), and (3) fj, = i/. [Hint: normalize to get probabilities and show that they have the same distribution function or use Proposition 3.2.6.] Remark. Definition 6.2.1 makes sense on a metric space. It is in this context that weak convergence is usually discussed (see Parthasarathy [PI] and Billingsley [B1]). It will be shown that this concept of weak convergence for probabilities is the same as the one defined earlier (Definition 6.1.3) using distribution functions. The first step is to show that weak convergence of probabilities implies weak convergence of their distribution functions. Proposition 6.2.3. Let (fin)n>\ converge weaiciy to fi, and let Fn be the distribution function associated with fin- Then Fn ^* F, where F is the distribution function for fi.
252 VI. AN INTRODUCTION TO WEAK CONVERGENCE Proof. Let x be a point of continuity of F. Define |1 ilt<x-{, il>k{t)=l -k{t-x) ifx-£<t<x, [ 0 if t > x, and let ipk(t) = ip(t - £)• Then l(_OOI_^] < Vfc < ^-oo.xj < <fk < 1(-00,1+^)- Since ip\ and ipk are bounded continuous functions, one has Fix - -) < I ijjkdfi = lim I Vfcd/Jn ^ liminf Fn(x), and limsupFn(x) < lim / </?fc<fyin = / v?*^ ^ F( x + - J. The continuity of F at x and these estimates imply that lim infn Fn (x) = limsupn Fn(x). It follows from Exercise 2.1.10 that limn Fn(x) = F(x). D Remark. The function rpk is bounded and uniformly continuous on all of K, i.e., given e > 0, there is a 6 > 0 such that |x - y\ < 6 implies \il>k{x) ~ V'fc(y)! < «• This follows from Property 2.3.8 (3) applied to rpk on the interval [x - £ - 1, x + 1] (say) since outside that interval the problem of finding 6 is trivial. As a result, Fn -^+ F as long as / fdfin —» / fdfi only for all bounded uniformly continuous functions on K. By using integration by parts, it is easy to see that weak convergence of distribution functions implies something very similar to weak convergence of the probabilities. Proposition 6.2.4. Assume that Fn -^+ F and that <p € CC(K), i.e., that ip is a continuous function with compact support (Definition 4.2.4). Then / <pdn„ -» / (fdfi. Proof. First, assume that <p € C^(K), i.e., that ip € CC(K) is continuously difFerentiable. The support of ip lies in some interval (-m, m] for sufficiently large m (i.e., (-m,m\ D {x \ <p(x) ^ 0}). Lemma 6.2.5. Let Q be a probabiJity on 93(R), and Jet F denote its distribution function. Iftp€ C](R), then [ <pdQ =f fip(x)dF(x) = - [\p'(x)F(x)dx. (An outline of the proof of this integration by parts formula is given in Exercise 6.8.1.) Assume Lemma 6.2.5. Now Fn —» F a.e. as F is continuous except for most at a countable number of points (Exercise 2.9.14). Thus, by
2. WEAK CONVERGENCE OF PROBABILITIES 253 dominated convergence (Theorem 2.1.38), it follows that / Fn(x)tp'(x)dx —» f F(x)ip'(x)dx and hence f <pdfxn —» f ipdfi. It is shown in the appendix to this chapter that if ip e CC(R) and e > 0, there exists V € C'(K) with \\tp - xpW^ < e (where \\tp - V>|L = supx6R \(f(x) - rp(x)\; see Exercise 4.2.11). This implies that \ f <pdQ — /V^QI < £ for any probability Q. Hence, if ftjjdfi-e < J ipdfxn < J tpd^+t for large n, it follows that / ipdfi—St < / ipd(in < f ipd(i+3e for large n. □ To prove that fin —► fx, it is not enough to know that f ipd(in —» f (fdt* for arbitrary functions </? € CC(K). These are not generic bounded continuous functions. For example, the constant function 1 is not of this type. The next exercise indicates what can happen. Exercise 6.2.6. Let ea denote the point mass at a (Exercise 1.5.2). Show that (1) if xn —» x, then eXn ^* ex and, on the other hand, (2) there is no probability fx on 93(R) such that en —► fx. Let Fo(x) = ex for x < 0 and Fo(x) = 1 for x > 0, and let Fn denote the distribution function Fn(x) = Fo(x + n). Show that (3) Fn(x) —» 1 for all x as n tends to infinity. Conclude that there is no distribution function F such that Fn -^+ F. Let fxn be the probability defined by Fn. Show that (4) for all ip € CC(K), the sequence f ipd(in converges to zero. This exercise indicates that something needs to be added to the type of convergence of integrals given in Proposition 6.2.4 if one wants to conclude that the distribution functions converge weakly. In these examples, the probabilities involved all end up giving probability zero to any bounded set. In this sense, they "escape" to infinity. The solution is to "trap them" (or at least most of the mass) on a large bounded set. Definition 6.2.7. A sequence (/i„)„>i of probabilities on (R,93(R)) is said to be tight if for any e > 0 there is a compact set K such that fin(R\K) < t for ail n > 1. Remarks 6.2.8. (1) Every compact set K is a subset of some closed and bounded interval, and so the compact set K may as well be taken to be [-M, M] for some M > 0. (2) A sequence of point masses eXn is tight if and only if the sequence (xn)n>\ is bounded. (3) Any finite collection of probabilities is tight since if v is a finite measure, i>(R\[—n,n]) = v(R) — v([—n,n]) —» 0 as n —» oo. As a result, a sequence (/!„)„>i of probabilities is tight if for any e > 0 there is a compact set K such that fin(R\K) < e for n > N(t,K).
254 VI. AN INTRODUCTION TO WEAK CONVERGENCE Proposition 6.2.9. Let (/!„)„> 1 be a tight sequence of probabilities on (R, 93(R)), and assume that for some probability fx one has I <fdfin —»ipdft for allipe CC{R). Then fin -¾ /i. Proof. Let e > 0, and let K = [-M,M] be a compact set such that (i) (in(R\K) < e for all n > 1 and (ii) n((R\K) < e. The argument makes use of a "cut-off" function 6 € CC(R), where {1 if |x| < M, M + l-|x| ifM<|x|<M+l, 0 ifM + l<|x|. Let / be a bounded continuous function on R. Then / fdfin = / f6dfin +/(1- 6)fdfin, and so I y /dMn - y /dpi < \f fod»n - j /ad+y(i - ^i/i^n +y^ - 9)1/1«^ < \J jednn- J jedn + 2e || / || def where || / ||oo= supx6R|/(x)| (see Exercise 4.2.11). Since fO e CC{R), it follows from the hypothesis that | / fd(in - f fdfi\ < 6 + 2611/11,30 if n is large enough and so limn / fd(i„ = / fdfx. □ Proposition 6.2.10. Let Fn -^+ F. If (i„ is the probability determined by Fn, then this sequence of probabilities is tight and fxn —► fi, where fx is the probability determined by F. Proof. The set of continuity points of F is dense (Exercise 1.3.15) as its complement is at most countable (Exercise 2.9.14). Thus, for any n and e > 0 there are points of continuity a and b of F with a < —n and n < b. If /i(R\[-n,n]) < e, then /i(R\(a,6]) = 1 - n{{a,b\) < e. Since nn{{a,b\) = Fn{b) - Fn(a) -» F(b) - F(a) = n{{a,b]), it follows that for n > n(e) one has fin(R\(a,b]) < 2e and so /in(R\[a, 6]) < 2e for n > n(e). Hence, by Remark 6.2.8 (3), the sequence is tight. The fact that this implies fin -^+ fj, then follows from Propositions 6.2.4 and 6.2.9. □ These results may be summarized as the following theorem.
3. WEAK CONVERGENCE OF RANDOM VARIABLES 255 Theorem 6.2.11. Let (/i„)„>i be a sequence of probabilities on (K, 93(R)) with corresponding distribution functions Fn. If (i is a probability on (R,93(R)) with distribution function F, the following are equivaient: (1) nn -»fi, i.e., nn => n; (2) Fn ^ F, i.e., Fn =► F. Remark 6.2.12. Two other ways to show that Fn —► F implies n„ —► n are presented later in this chapter. One method, due to Gnedenko and Kol- mogorov [G] can be found in Exercise 6.8.6. It avoids integration by parts and proves in an elegant manner the equivalence of Definitions 6.1.3 and 6.2.1, and convergence in a metric due to Levy (Definition 6.3.6). The other method, which also makes implicit use of the Levy metric, involves a theorem due to Skorohod (Theorem 6.3.4) which states that weak convergence of distribution functions can be realized as a.s. convergence of a sequence of random variables. 3. Weak convergence of random variables Since statisticians are mainly interested in the distribution of a random variable rather than the random variable itself, it is natural to look at the behaviour of the distributions of a sequence of random variables. Definition 6.3.1. A sequence of random variables (X„)„>i is said to converge weakly or in distribution or law to a random variable X if the distribution of Xn converges weaMy to the distribution of X. This will be denoted by writing Xn -^+ X, Xn —► X, or Xn => X. This notion is related to convergence in probability. First, note that convergence in probability implies tightness of the resulting distributions. Proposition 6.3.2. Let (Xn)n>i be a sequence of finite random variables on a probability space (0,5, P). Assume that Xn -^+ X, and let fin be the distribution of Xn. Then, the sequence {fin)n>i of probabilities fxn is tight. Proof. Let n be the distribution of X, and assume that fj,([-M, M\) > 1- e. Let An = {\Xn\ > M + 1}. Since An = (An n {\Xn - X\ > 1}) U (An n {\Xn - X\ < 1}), it follows that Mn([-M - 1, M + l]c) = P({\Xn\ > M + 1} <P({\Xn-X\>l}) + P({\X\>M}) <P({\Xn-X\>l}) + e. Hence, for sufficiently large n, fin([-M — 1, M + 1]) > 1 — It. This implies that the sequence {fj,„)„>i is tight (see Remark 6.2.8 (3)). □ Not only does convergence in probability of a sequence of random variables ensure that their distributions form a tight sequence, but in fact they converge weakly.
256 VI. AN INTRODUCTION TO WEAK CONVERGENCE Proposition 6.3.3. Let {Xn)n>\ be a sequence of random variables on a probability space (0,5, P). If Xn —► X, then Xn —► X. Proof. Since by Proposition 6.3.2, the sequence of distributions is tight, it follows from Proposition 6.2.9 that it is enough to show that f <pd(in —» f (pdfi for all ip € CC(R). By Lemma 4.1.3, therefore, it suffices to show that for all tp € CC(R), E[((p{Xn)\ -» E[p(X)\. Now for any 6 > 0, one has E[<p(Xn)\ - E[ip(X)] = f {<poXn-<po X}dP J{\Xn-X\>6} + J {<p°Xn -ipoX}dP. J{\X„-X\<6} Let e > 0. By uniform continuity (Property 2.3.8 (3)) there exists a 6 > 0 such that \x — y\ < 6 implies \tp(x) - tp(y)\ < e. Hence, \E[p{Xn)) - E[p{X))\ < 2|MLP[|X„ -X\>6} + e. Consequently, if Xn -^+ X, it follows that \E[<p(Xn)\ - E[<p{X)]\ = | / (pdfin - / (pdfi\ < 2e for sufficiently large n. □ Since the statement Xn —* X gives no information about the behaviour of the random variables themselves, but rather about their distributions, it is remarkable that weak convergence of probabilities may be realized as a.s. convergence of some sequence of random variables. This is stated as the following theorem due to Skorohod. Theorem 6.3.4. Let (fj,„)„>i be a sequence of probabilities on (K, 93(R)) that converges weakly to the probability fi. Then there is a sequence (^n)n>i of random variables Yn on ((0,1), 2$((0,1)), dx) and a further random variabJe Y on the same probability space such that (1) fxn is the distribution ofYn for all n > 1, fx is the distribution of Y, and (2) Yn ^ Y. Proof. The key to the argument is the observation that any distribution function F can be "inverted" to obtain a random variable Y on (0,1) whose distribution function is F, as stated in the following lemma. Lemma 6.3.5. Let F be a distribution function on R. Define Y(t) = inf{x | t < F(x)}. Then Y is a random variabJe on ((0,1),33((0,1)),dx) such that (1) it is right continuous and non-decreasing, and (2) its distribution function is F, i.e., |{t | Y(t) < x}\ = F(x) for all xeR.
3. WEAK CONVERGENCE OF RANDOM VARIABLES 257 Continuation of the proof. Assuming this lemma, let Fn be the distribution function associated with fxn and F be the distribution function of fi. Let Yn and Y be the corresponding random variables on (0,1) given by the lemma. They obviously satisfy condition (1) of the theorem. It remains to verify that they converge a.s. This is done by making use of the Levy distance between two distribution functions. Definition 6.3.6. Let F\ and F2 be two distribution functions on R. The Levy distance d(Fi,Fi) between them is defined to be the inhmum of all positive e such that, for all x e R, one has F\(x — e) — e < F2(x) < Fl(x + e) + e. Note that Ft (x - e) - e < F2(x) < Ft (x + e) + e for all x e R if and only if F2(x - e) - e < F^x) < F2(x + e) + e for all xeR. Exercise 6.3.7. Show that (1) d(F\, F2) < 1 for any two distribution functions Ft and F2, (2) the Levy distance defined above is a metric (Definition 4.1.25) on the set of all distribution functions on R, and (3) if d(Fn, F) -^ 0, then Fn -^ F. Note that the Levy distance also has the property that (4) if Fn -^+ F, then d(Fn, F) -» 0 as n goes to infinity. Property (4) is established in two exercises in §8 (see Proposition 6.8.3). By (4), it follows that if e > 0, then, for large n, one has F(x — e) — e < F„(x) < F(x + e) + e for all x e R. It follows almost immediately from the definition of the inverse of the distribution function given in the lemma that Y(t-e)-t< Yn(t) < Y(t + e) + e for all t € (e, 1 - e). More explicitly: if t < Fn(x) < F(x + e) + e, then t - t < F(x + e) and so Y(t - e) < x + e, which implies that Y(t - e) - e < Yn(t). Similarly, Fn(x) < t implies that F(x - e) - e < t and so Y(t + e) > x - e (i.e., Y(t + e)+e>Yn(t)). The inequality Y{t - e) - e < Yn(t) < Y(t + e) + e for all t € (e, 1 - e) implies that Yn(t) —» Y(t) at all points of continuity of Y in (0,1). This completes the proof since by part (1) of the lemma, the function Y is right continuous, non-decreasing, and hence is continuous except at a countable number of points in (0,1) by Exercise 2.9.14. □ Before giving the proof of Lemma 6.3.5 (see Lemma 6.3.10), here are several applications of Skorohod's theorem. Proposition 6.3.8. Let fxn —* fx, and let h be a Borel function that is (1-sl.s. continuous. Then the distribution Qn of h relative to fxn converges weakly to the distribution Q of h relative to fx (in other words, using a common notation for these distributions, (inoh~l —* fioh~l. )
258 VI. AN INTRODUCTION TO WEAK CONVERGENCE Proof. Let D/, denote the set of discontinuities of h, and let (Vn)n>i be the sequence of random variables given by Skorohod's theorem. Since Yn -—* Y and \{Y e Dh}\ = f*(Dh) = 0, where Y is the inverse of the distribution function of ft, it follows that hoYn ^i hoY: let \x = {t e (0,1) | Yn(t) -» Y(t)} and A2 = {( 6 (0,1) | h is continuous at Y(t)}; then |Ai D A2| = 1, and for t € Ai D A2 one has Yn{t) -» Y{t) and /i(V„(t)) -» h{Y{t)). Proposition 6.3.3 implies that Qn -^+ Q. D Remark. This result can be proved without using Skorohod's theorem, (see Billingsley [B2], p. 343, Theorem 25.7). For weak convergence of random variables, one has the following consequences. Proposition 6.3.9. Let Xn —► X be a weakly convergent sequence of random variables on a probability space (0,5, P), and Jet h be a Bore] function on K. Denote by D/, its set of discontinuities. Then (1) hoXn ~^hoX ifP[XeDh} = 0, (2) if X = a a.s. (i.e., if Xn —► a) and h is continuous at x = a, then hoXn —► h(a), (3) if an —» a and bn —» b, then anXn + bn -^+ aX + b, and (4) E[\X\\<\immfnE[\Xn\}. Proof. (1) and (2) are corollaries of Proposition 6.3.8. Skorohod's theorem and Fatou's lemma applied to the (Yn) prove (4): since \Yn\ -^+ \Y\, by Fatou, /,,1 \Y{t)\dt < liminf,,/,,1 \Yn(t)\dt. Also, by Lemma 4.1.3, /„ \Y(t)\dt = E[\X\] and /„ \Yn(t)\dtl = E[\Xn\] since, for example, the law of Y is the law Q of X and so E[p{X)) = / ^(x)Q(di) = /„ ip(Y(t))dt for any bounded Borel function ip. To verify (3), observe that if hn(x) = anx + b„ and h(x) = ax + b, then Yn ^+ Y implies that hn(Xn) ^+ h(Y). Hence, (3) follows from Proposition 6.3.3. □ Equivalence of (6.1.3) and (6.2.1). (see Billingsley [B2], pp. 344- 345) The most interesting application of Skorohod's theorem, however, is another proof of the equivalence of Definitions 6.1.3 and 6.2.1, which was already alluded to in Remark 6.2.12. While these two ways of describing weak convergence have already been shown to be equivalent, an inspection of the proof of Skorohod's theorem and its consequences Propositions 6.3.8 and 6.3.9 shows that in fact everything can be formulated in terms of the distribution functions and Definition 6.1.3, so that no use is made of Definition 6.2.1. As shown earlier (Proposition 6.3.2), it is straightforward to show that if fxn —* fi, then Fn —► F. The converse is a consequence of Proposition 6.3.8. Let / be a bounded continuous function and let Yn and Y be the random variables in Skorohod's theorem corresponding
3. WEAK CONVERGENCE OF RANDOM VARIABLES 259 to the distribution functions Fn and F. Then by Proposition 6.3.8, / o Yn -^+ foY and these random variables on (0,1) are uniformly bounded by 11/11^. Hence, by dominated convergence (or Theorem 4.5.8 if one prefers) it follows that / fdfj,n = £ f(Yn{t))dt -► J* f(Y(t))dt = / fdn, and so w n Before proving Lemma 6.3.5, it will be restated as follows. Lemma 6.3.10. Let F be a distribution function on R. If 0 < t < 1, Jet Y(t) =f inf{x | t < F(x)}. Then (1) Y(t) = sup{y | F(y) < t}, (2) F(x) < t implies that x < Y{t) and, hence, F(K(t)) > t for all t €(0,1), (3) V is non-decreasing and right continuous, and (4) F{a) < t < F{b) implies a < Y{t) < b and a < Y(t) < b implies F(a) < t < F{b). Hence, as a random variabJe on (0,1), the distribution of Y is F, i.e., \{Y(t)<x}\ = F(x). Proof. First, note that Y(t) has a finite value for each t € (0,1) since a distribution function F has infimum equal to zero and supremum equal to one. For any t € (0,1), the distribution function F may be used to split R into the disjoint union of two intervals, namely I\ = {x \ F(x) < t} and h = {x | t < F(x)}. As a result, (1) follows from Exercise 1.1.19 (2), which implies that Y(t) is a boundary point of both intervals. To prove the first part of (2), observe that by right continuity of F, if F(x) < t there is an x' with x < x' and F(x') < t. Hence, Y(t) >x'>x. Now, as observed above, each t € (0,1) splits R into two disjoint intervals with Y(t) the common boundary point. If x = Y(t) is in the left-hand interval, it follows from what has just been proved that F(Y(t)) = F(x) = t. Hence, for all t, one has F(Y(t)) > t. Clearly, Y is non-decreasing. If tn [ t and t < F(x), then, for large n, tn < F(x) and so Un{x | tn < F(x)} = {x \ t < F(x)}. The right continuity of Y follows from Exercise 1.1.19 (4) with In = {x \ tn < F(x)}. It is clear that (4) amounts to the following four statements: (i) a <Y(t) implies F(a) < t ; (ii) Y(t) < b implies t < F(6); (iii) F{a) < t implies a < Y(t); and (iv) t < F{b) implies Y{t) < b. The first one is obvious; the second follows from (2) as t < F(Y(t)) < F(b); the third one is part of (2); and the last one is also obvious. It follows from (4) that \{a < Y{t) < b}\ = F{b) - F(a) and so the distribution function of Y is F. □ Remark 6.3.11. The operation that constructs the "inverse" Y of a distribution function F is often referred to as the inverse or quantile transformation (see an informative review by Csorgd8 of a monograph on em- Bull, (new series) AMS 17 (1987), 189 200.
260 VI. AN INTRODUCTION TO WEAK CONVERGENCE pirical processes). There is a certain arbitrariness in taking Y to be right continuous and there is a corresponding left continuous version of the quan- tile transformation: one defines Y(t) = inf{x | F(x) > t}, which is the one discussed in the review by Csorgo. In case the distribution function F is continuous, then Y(t) is the last "time" x that F(x) = t, whereas for the left continuous version of the quantile transformation it is the first "time" x that F(x) = t. Note also that the "analyst's distribution function" of an integrable function (4.7.3) has an inverse that is referred to (see [SI], p. 189) as the non-increasing rearrangement or monotone rearrangement of the original measurable function. Remark 6.3.12. Let F be a distribution function on R with associated probability p. If A € <8(R), then by Lemma 4.1.3, / lA(Y(t))dt = fj,{A). Hence, as pointed out by Billingsley ([B2], p. 190), one obtains the measure fi on the Borel subsets of R from the distribution function F, its inverse V, and Lebesgue measure on (0,1) without going through the procedure given in Chapter I. However, this is not an entirely free ride, as in any case one needs to get hold of Lebesgue measure on (0,1), which amounts to proving the key Theorem 1.4.13 for the distribution function of the uniform distribution on (0,1) (see Example 1.3.9 (1)). Exercise 6.3.13. Let X be a random variable on a probability space (fi, 5, P). Let F denote its distribution function. Show that if F is continuous, then FoX = F(X) is uniformly distributed on [0,1]. [Hint: observe that F(X(uj)) < t if and only if X(uj) € {x \ F(x) < t} = (oo,Xo); compute P[F o X < t] in terms of the distribution of X] 4. Empirical distributions again: THE GLIVENKO-CANTELLI THEOREM Exercises 6.8.4 and 6.8.5 show that the Levy distance of Fn from F goes to zero as n —» oo. When the distribution function F is continuous it is a fact, as indicated in the next exercise, that Fn converges to F uniformly in x (i.e., given e > 0, there is an integer n(e) such that |Fn(x) - F(x)| < e for all x e K provided n > n(e)). Exercise 6.4.1. Let Fn —► F, and assume that F is continuous. Let e > 0. Show that (1) there is a finite set of points to < t\ < ■■ ■ < tk+i such that F(to) < e, F(ti+i)-F(ti) < e forO < i < fc and F{tk+l) > 1-e [Hints: first determine to and tfc+i, then use either uniform continuity (Property 2.3.8 (3)) on [t0,tfc+i] or, having chosen t0 < ti < • • • < t,, choose tj+i to be the last time that F(t) < F(ti) + e, which amounts to looking at the quantile transformation or inverse Y of F.\, (2) there is an n(e) such that n > n(e) implies |Fn(tj) - F(tj)| < e for 0<i < k+ 1.
4. EMPIRICAL DISTRIBUTIONS: THE GLIVENKO-CANTELLI THEOREM 261 Make use of the non-decreasing property of F and of the Fn to show that (3) \Fn(x) - F(x)\ < It for all x € R if n > n(e). In the case of empirical distributions, where as. the random distribution function Fn (x, lj) —► F, in fact this can be improved to give uniform convergence as. regardless of whether or not the "population" distribution function F is continuous. This is the Glivenko-Cantelli theorem, which is often referred to as the "fundamental theorem of statistics", as it says that the (random) empirical distribution function, constructed from observations, converges uniformly a.s. to the "population" distribution function F. Recall that the empirical distribution Fn(x, •) is defined to be £ $2£=1 Yk, where the Yk = l(-oo,x]°-Xfc = l{xt<x} are Bernoulli random variables with P[yfc = 1] = F(x) and P[Yk = 0] = 1 - F(x) and, as mentioned earlier, the strong law implies that £ J2k=i ^k = Fn(x, •) -^+ F(x). The random variables Zn = l(_oo,x) ° Xn = l{x„<x} are also Bernoulli random variables with P[Zn = 1] = F(x-), the left limit at x of F, and P[Zn = 0] = 1 - F(x-). In the same way as for the random variables Yn, it follows from the strong law that limn_00 £ J2k=i l{xt<x} = Fn{x~, ) ^+ F(x—) for any fixed value of x. Theorem 6.4.2. (Glivenko-Cantelli) Let (Xn)n>i be a sequence of i.i.d. random variables on a probability space (fi, J,P)- Let Fn(x, w) = £ J21=i l{xt<x} denote the corresponding empiricaJ distribution. Given e > 0, there exists an integer n(e, w) such that a.s. sup|Fn(x.w) - F(x)| < e if n > n(e,w). xeR Proof. Let k > 1 and let Xjtk = Y(^) for 1 < j < k - 1, where Y is the quantile transformation of F (i.e., its inverse). The idea of the proof is to use the a.s. convergence of Fn to F at the values x^k and Xj^— to control the potential discontinuity of F and thus get uniform convergence. To begin with, notice that the change in F(x) between x^fc and xJ+i,fc— is at most j since F(xjtk) = P[X < xjtk] > £ by Lemma 6.3.10 (2). Also, (i) in the proof of Lemma 6.3.10 (4) implies that F(xJ + i,fc-) = P\X < X]+i,k] < ^- In other words, if 1 < j < k - 2, and xJtfc < x < xi+i,*, then (1) F(x>,fc) < F(x) < F(x>+1,fc-) < F(x>,fc) + 1. Now, for each k and j, 1 < j < k - 1, Fn(xj,fc, ) -^ F{Xj,k) and ^n(£>,fc—»•) —~* F(xj,k—)• Denote by A the set where this occurs for 1 < j < Af — 1, and let nk = nk{u) for u e A be such that, when n > nk
262 VI. AN INTRODUCTION TO WEAK CONVERGENCE and 1 < j < k — 2, then (2) \Fn{xjtk-,Lj) - F(xjik-)\ < - and \Fn{xjtk,u) - F{xjyk)\ < A, (3) F(xjtk) -- < Fn(xjik,uj) < Fn(xj+l<k-,uj) < F(xj+ltk-) + -. Assume that xitk < x < £j+i,fc, where 1 < j < k - 2. It follows from (1) and (3) that if n > rifc(w), then (4) F(x) -- < Fn(xjtk,u) < Fn(x,u) < Fn(xj+hk-,Lj) < F(x) + |. Combining (4) with (2), it follows that \Fn(x,w) - F(x)\ < \ if xjtk < x < Xj+i,k for 1 < j < k — 2. It remains to consider what happens on (-00,11^) and (xk-\<k,+oo). On the first interval, F(x) and Fn(x, w) both lie in [0, \) if n > nk(uj) since this is true at the right-hand endpoint: F(x) < F(xiik-) < £ and F„(x,lj) < Fn(xiifc-) < f. Similarly, on (xfc-i.it,+00) it follows that F(x) and Fn(x, w) both lie in (^^,1] if n > nk(u}). This completes the argument. □ 5. THE CHARACTERISTIC FUNCTION The limit theorems that are called laws of large numbers are geared to approximating the mean of a distribution. If a probability Q on R admits a second moment, then random variables X with distribution Q are necessarily in L2. Hence, they have variances. Lemma 6.5.1. (See Proposition 4.3.10.) Let {Xn)n>i be a sequence of i.i.d. random variables, and assume their common distribution admits a second moment. If S = 5Z]J=i Xk, then the variance <r2(5n) of Sn equals no1, where a2 is the common variance of the Xn. It is clear that <r2(Ay) = A2«r2(y). As a result, if one wants to scale Sn so that the variance is constant, one has to divide it by y/n. Dividing it by dy/n normalizes it to have variance equal to 1. The scaling used for the laws of large numbers (dividing by n) is appropriate to leaving the mean constant, but — is the variance of -Sn and it goes to zero as n tends to infinity, which is what one expects from the weak law of large numbers. The random variable ^j^Sn converges weakly (i.e., in distribution) to the unit normal. This fact is called the central limit theorem. The basic tool for obtaining this result is the Fourier transform, or rather the particular form of it used in probability that is called the characteristic function. The key property of this transform is that it converts convolutions
5. THE CHARACTERISTIC FUNCTION 263 into products, which are easier to analyze than convolutions (or sums of independent random variables). The first central limit theorem that was proved was for Bernoulli random variables. It is called the de Moivre-Laplace limit theorem and can be proved directly without using characteristic functions (see Feller [Fl], p. 182). Definition 6.5.2. Let fi be a probability on R. The characteristic function of n is defined to be f(t) = f eitxn(dx), t € R, where eiu = cosu + isinu (see the Appendix). If X is a random variable with distribution fi, the characteristic function of X is defined to be the characteristic function of fx. Remark 6.5.3. The Fourier transform of fi, denoted by fl(t), is defined to be f(—t) in Korner [K3] and f(—2nt) in Stein and Weiss [SI]. In analysis, the minus sign is omnipresent, but the use of the scaling constant is not fixed. In analysis, it is used more for functions in an LP-space than for finite measures, which is where it is used in probability theory. In probability, the characteristic function is not spoken of as a transform. Proposition 6.5.4. (Basic properties of characteristic functions) (1) If f is the characteristic function of fi, then it is continuous and bounded with |/(i)| < 1 = /(0). (2) If f is the characteristic function of fi, then its complex conjugate (see the Appendix) f(t) = f(—t) and f is the characteristic function of p., where fi(A) =f fi(-A) for allAe ®(R). (3) If /i and /2 are the characteristic functions of fi\ and m, then A/i + (1- A)/2 is the characteristic function of A/ii + (1- \)(J.2 if 0< A< 1. (4) If Y^= 1 ^n = 1, 0 < An < 1, and /n is the characteristic function of fin for n > 1, then Yln°=i^nfn is the characteristic function of (5) If f is the characteristic function of fi and X is a random variable with distribution fi, then f(t) = E[eitX]. (6) If /1 and /2 are the characteristic functions of fi\ and 1x2, then /1/2 is the characteristic function of p\ * (jq. (7) If f is the characteristic function of fi, then |/|2 = // is the characteristic function of n*jl. (8) If f is the characteristic function of X, then e,btf(t) is the characteristic function of X + b and of m,, where f ip(x)(j,b{dx) = J V(x + b)fi(dx). (9) If f is the characteristic function of X and a e R, then f(at) is the characteristic function of aX. Proof. (1) \feitxn(dx)\ < / \eitx\[i(dx) = f n(dx) = 1 = /(0). The continuity of / follows by dominated convergence since elt"x —» e,tx if tn —» t
264 VI. AN INTRODUCTION TO WEAK CONVERGENCE and |e'tx| = 1 for all x and t. (2) Clearly, 7(0 = /e"*fi(<kr) = f e-itxfj,{dx) = /(-«). As / ^{x)fi{dx) = J rp(-x)fi(dx), it follows that f(-t) = J eitxp.(dx). (3) Property (3) follows immediately from the definition. (4) If the probability v is given by v(A) = limn_00 5Z£=1 ^kHk(A), then J i\>dv = limn^oo £fc=1 Afc / V^Mfc- Hence, / eixtdu = limj] Afc / etoW(4t) = lim£ A*/*(«). •^ " fc=i ^ " fc=i (5) is a particular case of Lemma 4.1.3: E[p oX] = ftp(x)(j,(dx). (6) follows from Proposition 3.2.3: let Xi and X2 be two independent random variables with distributions (i\ and /12; if / is the characteristic function of m * fj,2, then }{t) = E[eit(-X>+X^\ = E[eitXl\E[eitX*\ = /i(0/2(0- (7) follows immediately from (6) and (2). (8) E[eit(-X+V] = eitbE[eitX] = eitbf(t). Also, if F is the distribution function of X, then F(x — b) is the distribution function of X + b. Hence, the corresponding probability fx\, is related to fx by the formula /ib((c, d]) = F(d-b)- F(c-b) = n((c-b,d-b\). This implies that for any Borel set A one has fib(A) = fi(A — b). In terms of integrals, this says J lA{x)fj,)b{dx) = J \A-b(x)fx{dx) = J lA{x + b)fi(dx). Hence, for any positive Borel function ip, one has J rpdfib = J rp(x + b)fi(dx). (9) E[e*<oX>] = f{at). □ Exercise 6.5.5. Determine the characteristic functions of (1) e«, (2) peo + qeu (3) the binomial distribution b{n,p) (see Example 4.1.11), and (4) the Poisson distribution with mean A. Exercise 6.5.6. Compute the characteristic function of a normal distribution by calculating 2 (1) ^=/e txe ^ 2 )<fx (this is the value at t of the Laplace transform of the unit normal distribution: one completes the square of the exponent), (2) ^L- /cosh(tx)e-(-^>dx (recall that coshu =f e"+2e~"). Recall the power series for eu, and so (3) determine the power series for cosh u. The well-known power series for the cosine is cosu = 5I*^o(—^fiSty!- Show that, for any n > 1, (4) l£fc=o(-l)fc(Sil< coshu.
5. THE CHARACTERISTIC FUNCTION 265 Conclude that (5) -^/coS(te)e-(^)dx = Er=o(-l)fc(ft(7b/;c2fce"(^)d;c)- Use integration by parts to show that (6) /x2"e-(-^>dx=^^/x2"-2e-(-^>dxifn>l. Compute (7) the even moments rn2n = -^ f x2ne~(^~)dx of the unit normal distribution n(x)dx. Show that (8) the characteristic function /(<) = / cos <x n(x)dx of the unit normal n(x)dx is e {~r~> and (9) the characteristic function f(t) of an AT(0, a2) random variable (i.e., of nCT2(x)dx) is e-^2^-). Exercise 6.5.7. This is a method for computing the characteristic function of a normal distribution by using complex variable theory. Let (/>(u>) = e 2 , where w = u + iv. Show that (1) (/) is analytic. By integrating counterclockwise along the contour that is the boundary of the rectangle in K2 = C determined by {R,0),(R,t),(-R,t), and (-R,0), show that By completing the square in the exponent, show that (3) ^-JVtee-<-£><fc = e-<-£>. Use (3) to show that j2 2,2 (4) the characteristic function of nCT2(dx) = J^-e ^2dx is e_(—s-). Remark 6.5.8. If you are not familiar with complex variable theory, you may also proceed as follows. First, compute the Laplace transform £(n)(t) of the unit normal distribution (see Exercise 6.5.6 (1)). Except for the sign of the exponent, it is the same as the characteristic function. One gets the correct formula by substituting it for t. This substitution can be justified by complex variable theory. A word of warning: it is not true that the characteristic function can always be obtained from the Laplace transform in this way; for this to work, one needs the domain of the Laplace transform of the probability to be a neighbourhood of zero.
266 VI. AN INTRODUCTION TO WEAK CONVERGENCE Exercise 6.5.9. Let r be the probability density function where r(x) = < ■ 0 (x + 1) (1-x) . o Show that (1) the characteristic function of f(t\ = 2(1-cos if x < -1, if - 1 < x < 0, if 0 < x < 1, if 1 < x. r(x)dx is t)_/sin±\2 1 \ 2 / Let ra be the probability density ra(x) = 5r(ax)> where a > 0. Show that (2) if the distribution of X has probability density function r, then the distribution function of aX has probability density function ra. As a result, (3) determine the characteristic function of ra{x)dx. For a basic table of characteristic functions see, for example, Feller [F2], p. 416, or Chung [C], p. 147. 6. Uniqueness and inversion of the characteristic function There is a simple relation between two probabilities and their characteristic functions that has far-reaching effects when used in conjunction with the normal or Gaussian densities (as shown by Feller [F2] and Lamperti [LI]) that illustrates some general summability results in harmonic analysis (see Stein and Weiss [SI], pp. 8-12). To emphasize this relation, the probability fj,(dx) corresponding to a distribution function will sometimes be denoted by dF(x). Proposition 6.6.1. (Parseval's relation) Let /i and f2 be the characteristic functions of the respective probabilities m and /¾. Then / f\{x)fx2{dx) = / f2{x)m(dx), J fi(x)dF2(x) = J f2(x)dFl(x), i.e.. where F\ and F2 are the corresponding distribution functions. Proof. This is a simple application of Fubini's theorem (Corollary 3.3.17). One has I }\{x)n2{dx) = I I < -III- ''n\{dy) V2(dx) H2(dx) m(dy) ■I J2{y)Hi{dy). □
6. UNIQUENESS AND INVERSION OF THE CHARACTERISTIC FUNCTION 267 Corollary 6.6.2. JV"*fi(y)dF2(y) = J /2(x - QdF^x). Proof. By Proposition 6.5.4 (8), the function e~ltyf\(y) is the characteristic function of (fi\)-t, the translate of (i\ by —t (i.e., the probability v such that fipdv = Jip(x - t)fi\(dx) = J ip{x — t)dF\{x) for any bounded function 2 Corollary 6.6.3. Let nai{x) = J_e~*2^, and Jet f be the characteristic function of ft. Then ^ J e-^me-^dy = J na*{t - x)n(dx). Proof. By Exercise 6.5.6 (9), the characteristic function of n i {y)dy is ,2 e 571. Hence, by Corollary 6.6.2, if dF2{y) = fi2{dy) = n i {y)dy, one has ■j^Je-Hyme-^dy = j e^^^{dx). Since normal laws are symmetric, the formula follows after multiplying both sides by -^=. □ The expression J nai (t - x)/x(dx) is the density of the probability nai *fx by Exercise 3.3.21 (2), where nav also denotes the normal law with mean zero and variance a1 (i.e., nai(dx) = nai(x)dx). In other words, the density for nai *(iis expressed here in terms of the characteristic function of / as in the Fourier inversion result (Theorem 6.6.10). Proposition 6.6.4. Let fx be any probability on R. Then nai * /x —► fx as a tends to zero, i.e., for any sequence (<rn)n>i of vaJues of a converging to zero, nCT2n • /x —► /x. Proof First, observe that, for any <p € CC(K), <pd(na2*n)= / ip(x + y)na2(x)dxfi(dy) (*) = / / <f(y ~ x)n„*{x)dx n{dy), since Jtj}{x)n„i{x)dx = $ip{x)hai{x)dx = /%j)(x)nai{x)dx by Exercise 4.2.21 (4). Furthermore, \ <p(y- x)na2(x)dx - <p(y)\ < / \<p(y - x) - <p(y)\na2(x)dx \J I J{\x\<6) (t) + \<p{y - x) - <p{y)\na*{x)dx. J{\x\>6)
268 VI. AN INTRODUCTION TO WEAK CONVERGENCE Since <p is a continuous function with compact support, it is uniformly continuous (see Property 2.3.8 (3)). Given e > 0, let 6 > 0 be such that |u — v\ < 6 implies that \tp(u)-tp(v)\ < e. It follows that the first integral in (f) is less than e for any a > 0. The second integral is less than 211(/511^71^2((^1 > 6}). This tends to zero as a tends to zero since (tt) If 2 If 2 nff2({|x| > 6}) = —-= / e""(2^>dx = —= / e'^du. <7V27T J{\x\>6} V27T J{\u\>%} Hence, \\ip * nai — ^^ —» 0 as a —» 0. It follows from (*) that I ipd(nai * fi) —» I ipdfi as a —» 0. To show that nai * fx —* fi, it suffices, in view of Proposition 6.2.9, to show that the set of measures nai * n is tight when, say, a < 1. This fact is stated as the following lemma. Lemma. The collection {na*)a<c of probabilities na* is tight, and hence the collection {nav * n)a<c of probabilities nav */i is tight. Proof. It follows from (ff) that there is a constant 6o such that nCT2({|x| > 6o}) < t for all a < c (i.e., this collection of probabilities is tight). Let e > 0 and choose M > 0 such that fj,([-M, M\) > 1 - e. Hence, {n„i *(i)([-M — 6o, M + 60\) > 1 - 2e for all a < c. This is because, by the triangle inequality, l[_M_6o,M+6o](x + y) > l{\x\<60}(x)l{\y\<M}(y) and so 1[-m-60,m+6o](x' + y)n<r2(dx)fi(dy) > nff2({|x| < 60})fj.([-M,M\) > (1 - e)2 > 1 - 2e. D An immediate consequence of this result is the fact that a probability is determined by its characteristic function. Theorem 6.6.5. (Uniqueness theorem) Let pi and /¾ be two probabilities with the same characteristic function f. Then fi\ = /¾. Proof. It follows from Corollary 6.6.3 that nai * fi\ = nai * /¾. Proposition 6.6.4 implies that fi\ = /¾ since weak limits are unique by Exercise 6.2.2. □ In order to use the transform given by the characteristic function to prove the central limit theorem, it is essential to relate weak convergence to continuity properties of the transform. //
6. UNIQUENESS AND INVERSION OF THE CHARACTERISTIC FUNCTION 269 Theorem 6.6.6. (Continuity theorem) Let (/i„)„>i be a sequence of probabilities on K with corresponding characteristic functions fn. The following conditions are equivalent: (1) there is a probability fj, such that fin -^+ p; and (2) 11111,,-.00 /n exists and the limit function f is continuous at zero. The function f is the characteristic function of p. Proof. Assume that fxn —► fx. Since eltx is a bounded continuous function of x for any t € K, it follows that limn fn = f. Furthermore, / is continuous by Proposition 6.5.4 (1). The hard part of the proof is to show that (2) implies (1). It makes use of a compactness property due to Helly, which is stated as follows. Lemma 6.6.7. (Helly's first selection principle) Let (Gn)n>i be any sequence of non-negative, right continuous, non-decreasing functions on R that are uniformly bounded by 1 (i.e., Gn(x) < 1 for all x e K). Then there is a subsequence (Gnt)fc>i and a non-negative, right continuous, non- decreasing function G such that G„k{x) —» G{x) for each continuity point x ofG. Assume the lemma. Let Fn be the distribution function of /xn. Then there is a subsequence (F„t)fc>i and a non-negative, right continuous, non- decreasing function G such that F„k(x) —» G(x) for each continuity point of G, in particular, F„k —> G. At this point, there is no reason to assume that G is a distribution function. However, in any case there is a unique measure v such that G(x) = i^((-oo, x]), i.e., v{dx) = dG{x) and v{W) < 1. The aim now is to prove that v is a probability. Then G will be its distribution function and Fnic will converge weakly to G. The integration by parts formula Exercise 6.8.2 (2) implies that I n„i{t - x)nnk{dx) = -1 n'ai{t - x)F„k{x)dx and I n„i(t - x)u{dx) = -1 n'ai(t - x)G(x)dx. Now by Corollary 6.6.3 and dominated convergence, Jna,(t - xKJdx) = -|- je-^UMe-^dy -^- [e-itvf(y)e-(*¥)dy as it — oo. 2n J Since Fnk ^> G, it follows from (*) that ^ J'e-ttvf(y)e-{S¥->dy = J'n^t - x)u(dx).
270 VI. AN INTRODUCTION TO WEAK CONVERGENCE Multiply both sides by <t\/2k. Since {a\f2n)nai{x) = e ^2 < 1, one obtains a lower bound for the total mass of v (dependent upon a) as -j= [ e-ityf{y)e-^^dy = a\f^ [ na,(t - x)v(dx) < i/(R). At this point, the continuity of / at zero comes into play. Let a —» oo. Then (t) -^Je-ityf(y)e-r^1)dy = Je-^/(y)n^(dx) - /(0) = 1, and so v is a probability with G as its distribution function. To prove (f), one uses the argument for (f) in Proposition 6.6.4 (note that one cannot use the weak convergence of n i to Co) Let e > 0. Then, I f e-itvf(y)n^ (y)dy - ll < f |e-"»/(i,) - l|n , (y)dy \J ^ I J{\x\<6) ^ + [ \e-ityf(y)-l\njj(y)dy <e + 2n±({\y\>6}), for sufficiently small 6 as by continuity, \e~xtyf(y) — 1| < e if |y| < 6. As before in the proof of Proposition 6.6.4 (see (ft)), one has ni({|y| > 6}) < t for all a > 1 sufficiently large. It has now been proved that (2) implies that there is a subsequence {fj.nk)k>i that converges weakly to a probability v and hence (by the first part of the proof) / = limit fnk is its characteristic function. The uniqueness theorem (Theorem 6.6.5) implies that v is the only probability that is the weak limit of a subsequence of the fin since / is necessarily the characteristic function of any weak limit. This implies that the whole sequence converges weakly to v. the set of probabilities is a metric space with respect to the Levy metric. In a metric space, a sequence (xn)„>i converges to a point xo if every subsequence of that sequence (xn)n>i has a subsequence that converges to xo since, given e > 0, this means it is impossible to find an infinite number of points in the sequence whose distance from Xo is greater than or equal to e. □ Remark 6.6.8. If /xn —► M, the convergence of the characteristic functions /n to / is better than pointwise. In fact, it occurs uniformly on the compact subsets of K (see Exercise 6.8.7). To complete the proof of the continuity theorem, one needs to prove the key lemma, Lemma 6.6.7. Its proof makes use of the diagonal argument used by Cantor to prove that the real numbers are not countable (see Proposition 1.3.14) and is given in the appendix to this chapter.
6. UNIQUENESS AND INVERSION OF THE CHARACTERISTIC FUNCTION 271 Since the characteristic function determines a unique probability, it is natural to attempt to invert the transform, i.e., to get the probability from its characteristic function. There are many ways to do this (see Feller's comments [F2], p. 484). The one presented here is known as Gauss summability in harmonic analysis (see Stein and Weiss [SI], p. 11, Theorem 1.20). In the simple case where the characteristic function is in L^K), it has a Fourier transform. This gives the following inversion formula. Theorem 6.6.9. (Fourier inversion) Let f be the characteristic function of a probability /x, and assume that f € L^K). Then /x has as bounded continuous density the function <p, where f(x) = ^Je-itxf(t)dt. Proof. As remarked earlier, the probability nai * /x has density Va(x) = —= / e_(t^L)/x(dy) 0-y/2w J = Jn<J,(x-yMdy)=±je-ixtf(t)e-^dt, where the last equality holds by Corollary 6.6.3. Since nai */x -^-> /x by Proposition 6.6.4 as a —» 0, it looks hopeful that one can get the density <p by setting a = 0. First of all, since / € L1, it follows from dominated convergence that limCT_0 '■Pa = </?• This implies that <p > 0. The formula for ip implies that \<p(x)\ < ^\\f\\i for all x and similarly, |</?ct(x)I ^ 2^11/Hi f°r a^ x- Hence, fA<p(x)dx = r)(A) defines a <r-finite measure on 93(R). It follows from Exercise 6.2.2 that t/ = /x (i.e., <p is the density of/x), if f (frdr) = f (frdu for all continuous functions <$> with compact support. Since nai * /x —> /x, it follows that / (j>(x)ipa(x)dx = / 4>d(nai */x) —» / c/>d/x as a —» 0. On the other hand, the densities <pa are uniformly bounded, and so by dominated convergence one has / 4>(x)<pa{x)dx —» / 4>{x)<p(x)dx = / (j>dt] as a —» 0. □ In spite of the fact that not every probability has a bounded continuous density, the argument just used shows how to determine an arbitrary probability from its characteristic function /: while / need not be in L^K) and <72!2 hence need not be integrable, it is bounded and so f(t)e s- is in L , its Fourier transform is the density of nai */x, which converges weakly to /x as a —» 0. In this way, one may recuperate /x from its characteristic function.
272 VI. AN INTRODUCTION TO WEAK CONVERGENCE Theorem 6.6.10. (Inversion theorem) Let f be the characteristic function of a probability /x. If a and b are continuity points of /x, then dt. H((a,b}) = WmJ- //(t)e-(^) a—.0 ZTT J Proof. To begin, ■e-ita_e-itb- it /x((a,6]) = lim(nff2*M)((a,6]) = lim [ U- [ e-itvf(t)e^^dt ct—0 a-.ojB [Z7r7 dy in view of Corollary 6.6.3 since (nCT2 */x)(y) = $ na*{y — x)n(dx) is the density of na* */x by Exercise 3.3.21 (2). The result follows provided mi- -ity f(t)e _(»2I2 >«tt dy = ^/[^e-itydy]/(t)e-(^)dt. This follows from Fubini's theorem for Lebesgue measure (Corollary 3.3.17) as |c-**"/(*)c-<*t^->| < e~^\ which is in Ll([a,b] x K). D Finally, to extend this inversion formula to the case when the endpoints are not continuity points of F, it is necessary to determine the limiting behaviour of the distribution function Fai of nai * /x. Proposition 6.6.11. Let Fai denote the distribution function ofnai */x and F be the distribution function of /x. Then, for all x, limFff2(x) = ^{F(x-) + F(x)}. Proof. Since nai */x = /x*nCT2 by Exercise 3.3.21, it follows from Exercise 3.3.21 (9) that F„*(x) = / F(x - y)n„*{y)dy = (F*nff*)(x). The convoluting density nCT2(y) is an even function of y and so, just as in the proof of Fejer's theorem (Theorem 4.2.27), one may average on each side of x. More specifically, if a, 6 e K, then /■+00 1 /-+00 / F{x-y)na2(y)dy- -a \ = \ {F{x-y) - a}n(T2(y)dy Jo * I \Jo r+oo < / |F(x-y)-a|nCT2(y)dy Jo = / \F{x~ y) ~ a\nai{y)dy Jo r+oo + \F(x - y) - a\na2(y)dy (1) = /1+/2
7. THE CENTRAL LIMIT THEOREM 273 and ,.0 (2) f 1/- / F(x-y)na7(y)dy--b\ = \ {F(x- y) - b}na*{y)dy <[ \F{x-y)-b\na*{y)dy = j \F{x-y)-b\na*{y)dy -6 ,-6 + I \F(x-y)-b\n<T2(y)dy J — oo = J\ + J2- Assume that F(x—) exists, and let a = F(x-) in (1). Given e > 0, for small 6 > 0 it follows that \F(x — y) — F(x—)\ < e if 0 < y < 6, and so h < € for small 6 > 0. Fix such a 6 > 0. The second integral /2 in (1) is dominated by {1 + |a|}nCT2({y | y > 6}), which goes to zero as a —» 0 by (ft) in tne proof of Proposition 6.6.4. Let 6 = F(x) in (2). Then, for the same reasons, for a small 6 > 0, J\ + J2 < e + {1 + |fr|}"a2({y I y < — 6})> which is dominated by 2e for sufficiently small a = o~(6, e). This shows that 2{F*n„*){x) -> F(x-) + F{x) as n -> 00. □ Corollary 6.6.12. Let a < 6. Then -itb /*((«. *)) + ^(W) + ^(W) = j™ ± J f{t) e"<' e it dt. Proof Since FctS(6) - Fa,{a) = fc[± f e~ity f^e-^^d^dy, the result follows from Proposition 6.6.11 and the fact that l-[{F{b)+F{b-)}-{F{a) + F{a-)}) = {F(6_) _ F(a)} + l{F(a) _ F(a_)} + 1{F(6) _ F(6_)} = /i((a,6)) + ^i({a})+^({6}). D 7. The central limit theorem The simple i.i.d. case. The simplest case of the central limit theorem is for sums of i.i.d. random variables {Xn)n>\- By centering the random variables (subtracting the mean) and then scaling them by £, one may assume that they have common mean zero and common variance one. The
274 VI. AN INTRODUCTION TO WEAK CONVERGENCE central limit theorem states that, under these conditions, the distribution of -4= 5n converges weakly to the unit normal distribution. Reformulated, without centering and rescaling, this theorem states that —j%{Sn — nm) converges in distribution to the unit normal. To prove this, some basic analytic tools are needed. To begin, it is important to relate the existence of moments to analytic properties of the characteristic function / of the common distribution /x of the Xn- Proposition 6.7.1. Let /x be a probability on 93(R), and assume that it has a second moment (i.e., fx2fi(dx) < oo). If f is the characteristic function of /x, then (1) / has a continuous second derivative, (2) f'{t) = /ixeite/x(dx), and (3) f"(t) = -fx2e«*n(dx). In particular, /'(0) = i / x/x(dx) and /"(0) = —/x2/x(dx). In other words, if X is a random variable with distribution /x, then iE[X] = /'(0) and -E[X2} = /"(0). Proof. Since /x has a second moment, it also has a first absolute moment (i.e., /|x|/x(dx) < oo (see Proposition 4.1.8)). The difference quotient £{/(' + h) ~ /(')} = \ J{e*<t+fc>- - e^)n(dx) = I /e"*{e"" - l}/x(dx). Since |eihl - 1| < \hx\ (see Exercise 6.7.2) it follows that f \{{ei(-t+h)x - e,tx}\ < |x|, which is in L^/x). It follows from Theorem 2.6.1 that one can differentiate under the integral sign and so f'(t) = f ixe,txn(dx). Since x2 e L2(fi), the same argument shows that one can differentiate f ixe'tx/x(dx) in t by differentiating the integrand in t and so f"(t) = -fx2eitx^(dx). D Exercise 6.7.2. Show that (1) \eix - 1|2 = 2(1 - cosx), and (2) 2(1 — cosx) < x2. [Hint: the power series for cosx is an alternating series.] This result, which obviously extends to higher derivatives in the presence of higher moments, shows that one has a Taylor expansion of f(t) of order 2. Exercise 6.7.3. Let g be a continuous real-valued function on K that has a derivative g'(0) at zero. Show that (1) g(x) = g(0) + xg'(0) + o(x), where o(x) is a continuous function such that | ^~ | —» 0 as x —» 0. [Hint: recall the definition of the derivative.] Let / be a real-valued function such that /' e C^K) and has a second derivative at zero. Show that (2) /(x) = /(0) + x/'(0) + ^/"(0) + o(x2), where o(x2) is a continuous
7. THE CENTRAL LIMIT THEOREM 275 function such that \^p-\ —» 0 as x —» 0. [Hint: let g = f in (1) and integrate from 0 to x.] These results also hold for complex-valued functions /(x) of x e R. It suffices to consider their real and imaginary parts u and v, where /(x) = u(x) + iv(x). Let (-Xn)n>i be a sequence of i.i.d. random variables with common distribution /x. Assume that the common mean is 0 and that the second moment of/x equals 1 (i.e., the common variance is 1). Theorem 6.7.4. (Central limit theorem for i.i.d. random variables) Let Sn = Hfc=i-Xfc- If Mn is the distribution of -7%Sn, then 2 /xn —> n, the unit normal distribution n(dx) = -y=e_(~3~)dx. Proof. Let / be the common characteristic function. It follows from Proposition 6.5.4 (6) and (9) that the characteristic function of -j^Sn is ip(t) = [/(vk-)r Since /x has a second moment, the mean is 0 and the variance is 1, it follows from Exercise 6.7.3 (2) and Proposition 6.7.1 that /«) = l-|+o(^, andso /(-£.)_,_!(.£)+„(.£). If one neglects the term o(-£-), then, from calculus, one has ¥>(*) = '(£)]"-Htf)]*-'** as n —» oo. The theorem follows from the continuity theorem (Theorem 6.6.6) and the fact that e-(~5~' is the characteristic function of the unit normal. □ The completion of this proof needs some technical work to justify dropping the error term o(-£-). The natural way to do this is to use the logarithm, although this may be bypassed by using a clever trick (see Billingsley [B2], p. 367, Lemma 1). Exercise 6.7.5. Recall that if x > 0, then logx = /* ^- The geometric series gives a power series expansion for log(l — x) if |x| < 1: (1) T^ = 2r=o«n*M<i- Integrate both sides of (1) from 0 to x, and show that (2) -log(l-x) = E^=oSifN<l (observe that if -1 < x < 1, then 0 < 1 - x < 2). The logarithm in real calculus is the inverse of the exponential function: y = logx if and only if x = ev. While there is no power series that
276 VI. AN INTRODUCTION TO WEAK CONVERGENCE defines logx for all x > 0, there is a well-known power series for ey, namely e" = EfcLo fcp Hence> if V = log(! - x)> then (l-x) = e» = £ {log(l-x)}* (l-x)n = eny fc=0 oo = £ fc=0 it! {nlog(l-x)}* it! and If now z,w E C, then one may define z = ew by the power series ew = YlkLo TT (see l^e appendix to this chapter) and then attempt to invert this function to determine it; as a function of z. It turns out that this is not a simple matter because the complex exponential map is not one-to-one (e.g., e,v = e'(2'r+t') for any real v). However, if \z\ < 1, then the power series for log(l — x) makes sense with x replaced by z and it inverts the exponential function. More explicitly, let (*) Then 00 zk+1 w = -w-^-'Ehi '" \z\ < 1. n=0 (l-*)=e> £'"*"-'»' and fc=0 it! (1 - z)n = enw = V - ^^ ~ ^ fc=0 it! These identities are proved in complex calculus, for example by analytic continuation from the identities in the real case, or more directly by using the principal branch of the complex logarithm. One direct way to do this is to observe that the algebra involved in substituting the series for log(l —x) in the exponential series and computing to get (1 — x) in the real case is exactly the same as that involved when doing the analogous substitution in the complex case (see the Appendix for a very brief introduction to complex numbers). The formula (*) for w = log(l — z) is the technical tool that justifies the statement that lim n—>oo 4(£)+K£>r="-m4i_K£) Let zn = I(-f) - o(-£). Then nlog(l - zn) = -nzn = e"< + >. ^fc=0 fc+l —nzng(zn), where g(z) = ^0^j. Note that g is continuous as it is defined by a convergent power series when \z\ < 1, and g(0) = 1.
7. THE CENTRAL LIMIT THEOREM 277 Now nzn = -5— no( -£-), and since ^^ —♦ 0 as u —» 0, it follows that for t ^ 0 fixed no(-^-) —» 0 as n tends to infinity. Hence, nzn —» -|- and zn —» 0. Since <? is continuous in z, it follows that 3(2:,,) —» g(0) = 1 and so nlog(l — zn) —»—-§-• Therefore, (!-*»)" = -=(t)+Kt) nlog(l-zn) -(4-) = e BV "' -♦ e This completes the proof of the simple case of the central limit theorem. The Lindeberg condition. There has been an enormous amount of effort spent in generalizing and extending the above simple case of the central limit theorem. The main classical result in this work is due to Lindeberg. He solved the problem for a sequence (Xn)n>i of independent random variables Xn all with mean zero and finite variance o\. As before, let 5n = Ylk=iXk- Then, if sn denotes the variance of 5n, it follows from independence that sn = Ylk=i a\^ where o\ is the variance of Xk, k > 1. The analogue of -4^5n in the i.i.d. case is j^Sn- The characteristic function <pn(t) of Sn is the product of the characteristic functions fk(t) of the Xk and so the characteristic function of -^- 5n is "(£)-n*(=> As before, the question is how to show that <p(j~) —» e-(~s~). Using Exercise 6.7.3 (2) to expand each characteristic function /t, one has » ^)-2(5)-(5)-- Clearly, one wants to show that ££_j log/fc(^-) —» — (-^-) as then -(=) -2^(=)-^8^(=))--(-0 In order to be able to apply the power series for log(l — znk), it is necessary to know that, in fact, |znfc| < 1. In the i.i.d. case this was no problem as the variance sn = n in this case and a\ = a2 = 1 for all k. This condition on the znk demands that one have a better control of the term Ofc(fr) in the expansion (f) and that in addition one be able to show that 4->0asn tends to infinity.
278 VI. AN INTRODUCTION TO WEAK CONVERGENCE Proposition 6.7.6. Let Q be any probability on K that has a second moment. Let f(t) be its characteristic function. Then, for any c> 0, f(t)-\l + itf'(0)-±-f"(0))\<\t\3c [x2Q(dx) + t2 f x2Q(dx). Proof. Since by Proposition 6.7.1, *2 /(0- |l + zt/'(0)-^/"(0)}=y[e^-|l+ztx-^} Q{dx), it will suffice to estimate the above integral. The integrand is estimated by using the following lemma. Lemma 6.7.7. The following estimates hold: (1) |e« - 1| < |t|; (2) |e"-{l + tt}| < ^,-and (3) |e«-{l+tt--£}|<]£. Proof. The first one follows from the observation that ext — 1 = i f0 etudu since \i fQ e,udu\ < \t\. The second follows in the same way, using (1) since ext — {1 +it} = i fQ [e,u — l]du. The third makes use, in the same way, of the identity ext — {1 + it —|- } = i /0 [e,u — {1 + iu}] du. To conclude, note that if \h{u)\ < \u\n, then | /„' h{u)du\ < ^^-. □ Continuation of the proof of Proposition 6.7.6. It follows from the lemma that f fc*te _ {l + *te - ^-|-)1 Q(dx)| < f J{\x\<c\ L I Z ) J I J{\x\<. \t3X3\ {\x\<c} Q(dx) < \t\3c /*2Q(i !x) and f \eitx-{l+itx-~)}Q(d: J{\x\>c}l I 2 J J < / f|eite-{l + ite}| + t2X2 '{|x|>c} < f t2x2Q(dx). J{\x\>c} Combining these two estimates gives the result. □ Q(dx)
7. THE CENTRAL LIMIT THEOREM 279 Remark. One cannot immediately use Lemma 6.7.7 (3) in the proof of Proposition 6.7.6 because the probability does not necessarily have a third absolute moment. The estimate is therefore a subtle one, combining third- and second-order terms. Returning to the problem of controlling the error termor, it follows from Proposition 6.7.6 and (f) that for any c > 0, KS)I-K£)-M(S)}I < ^£ /*x2Qfc(dx) + -^-/ x2Qfc(dx), Sn J sn J{\x\>c) where Qt is the distribution of Xk- If one takes c = sne, then one has \sn/ sn sn J {\x\>ant} This implies that {|i|>»ne} The Lindeberg condition states that, for any e > 0, (L) " 1 /" y^ -=- / x2Qk(dx) —» 0 as n —»-oo. fc=l S" ^{|x|>»„£} Hence, if t is fixed, it follows that, given e > 0, there is an integer no = no(e) such that for all n > no one has 52fc=1|ofc(jy)| < 2e|t|3 for all k < n. The Lindeberg condition (L) also implies that -£ is uniformly small for 2 large n (i.e., given e > 0 one has ^ < 2e2 for all k < n as long as n is large). This is because o\ = f x2Qfc(dx) + I x2Qk(dx) J{\x\<S„t} J{\x\>Snt} < sy + f {i*i>»„*} x2Qfc(dx). ■,?,.■> Putting all this together, it follows that |znfc| < ^(-^-) + 10^(-^-)1 is small for all k < n as long as n is large. It follows that
280 VI. AN INTRODUCTION TO WEAK CONVERGENCE as in the simple case of i.i.d. random variables (see Exercise 6.7.8): ff(znk) is uniformly close to 1 for all k < n for large n and, for fixed t, Ek=i znk = -|— 5Zfc=i °k(^r) —* -j- as n —» oo since by (*) and the Lindeberg condition (L) it follows that if n > n(e), then ££=1|ofc(£)| < 2e|t|3. D Exercise 6.7.8. Show that n a2 (1) if \g(znk)-l\ < eforallfc< n when n > n(e), then |£fc=1 Jtg{znk) - 1| < e if n > n(e) [Hint: £j=1 °l = *n-]. (2) |ELi"fc(#)ff(«nfc)| < (1 + 0« if " > n(e) and in=1ofc(-£)| (3) Efc=l Znkg(Znk) -► "f • This completes the proof of the following version of the central limit theorem. Theorem 6.7.9. (Central limit theorem: Lindebergh condition) Let (Xk)k>\ be a sequence of independent random variabJes with mean zero that have second moments or, equivaJentJy, are in L2. Let o\ be the variance ofXk, k>l, and Jet s£ = Ek=i °\ an(^ $» = Et=i -Xfc- ^Mn is the distribution of j-Sn, then /xn converges weaJcJy to n, the unit normal distribution, n(dx) = -4—e~~T dx, provided the Lindebergh condition (L) is satisfied: for any e > 0, "if (L) V -J / x2Qfc(dx) -> 0 as n -» oo, fc=l S" -MM>»»«} where Qt is the distribution of Xk, k > 1. There is another version of this theorem for so-called triangular arrays whose proof is essentially the same as the proof for a sequence once the more or less obvious notational changes have been made. Definition 6.7.10. A triangular array of random variabJes is asequence of finite collections Xn\,Xn2> ■ ■ ■, Xnk„ of independent random variabJes Xni, 1 < i < kn, for n > 1. Any sequence {Xn)n>\ of independent random variables determines a triangular array in a natural way: let kn = n and Xni = Xi for 1 < i < n. Given a triangular array where each random variable has mean zero and finite variance, for each n > 1 let sn = E»=i CTm and &n = Et=i ^m- ^ Qni is the distribution of Xni, the triangular array will be said to satisfy the Lindebergh condition if *n j - (L) y2~2 I x2Qni(dx) —> 0 as n -» oo.
8. ADDITIONAL EXERCISES 281 By copying the proof of Theorem 6.7.9, with the new definitions of sn and Sn, one obtains a proof of the central limit theorem for triangular arrays. Theorem 6.7.11. (Central limit theorem: triangular arrays) Given a triangular array Xn\, Xn2, • • •, -Xnkn, n > 1, of random variables Xni, ifsn = E^i^ni ^ sn = £*=i xm, then the distribution of ^5„ converges weakly to the unit normal distribution n if the following Linde- berg condition (L) is satisfied: k" 1 r (L) V] T / x2Qni(dx) -» 0 as n -» oo, where Qnj is the distribution ofXni, 1 < i < fcn, n > 1. 8. Additional exercises* Exercise 6.8.1. (Integration by parts) Let G be a non-decreasing, right continuous, real-valued function on R with associated er-finite measure /x. Let </5 be a continuously difFerentiable function whose support (Definition 4.2.4) is contained in [a, b] and let a = t0 < <i < <2 < • • • < tk-i = b. Show that (1) E?=iV(*0{G(«i)-G(ti-i)} = -E?=iM«0-v(*i-i)}G(«i-i). [Hint: use integration by parts for series (Lemma 5.5.6).] Let e > 0. Show that (2) if \ip(ti+i) — tp(x)\ < e for U < x < ti+i, then one has \ J<pdfi — YS=MU){G(U)-G(U-ti}\ <e/i((a,6]) < ¢, (3) if G(*i) < G(x) < G{U)+e fort, <x< ti+u then | fp'(x)G{x)dx- E?=iM*i)-v(*i-i)}G(«i-i)|<e/|v/(x)|dx, (4) the points tj can be chosen such that the conditions in (2) and (3) are both satisfied. [Hints: for (2) use uniform continuity and for (3) verify that, for some m, there are m points Sj with so = a, sm = b, and Sj+\ = sup{x | G(x) < G(sj) + e}.} Conclude that (5) f<p(x)it(dx) = -f<p'(x)G(x)dx. Exercise 6.8.2. Let v be a finite Borel measure and G(x) = i/((—oo,x]) be the corresponding distribution function. Let ip be a continuously difFerentiable function. If a < 6, show that (1) /(ai6| p4< = y>(6)G(6) - <f(a)G(a) - /a6 y/(x)G(x)dx. Assume that tp' € L^K) and that </5 vanishes at infinity. Show that (2) ipdn = - / ip'(x)G{x)dx.
282 VI. AN INTRODUCTION TO WEAK CONVERGENCE Remark. These two integration by parts formulas are special cases of well- known results for what are often referred to as Riemann-Stieltjes integrals (see Rudin [R4], p. 123, Theorem 6.30) as any function <p with a continuous derivative is of bounded variation on a bounded interval [a, b] since it satisfies a Lipschitz condition: if a < x < y < b, then \tp(x) — <p(y)\ < M\x — y\ with M = supa<t<l, |</?'(*)|- These exercises can also be proved by extending to signed measures the integration by parts formula given in Exercise 3.6.21. Proposition 6.8.3. If Fn -^+ F, the Levy distance of Fn from F goes to zero as n tends to infinity. The proof consists of two exercises with hints based on an argument in Gnedenko and Kolmogorov ([G], p. 34), which the reader is advised to consult for its clarity and elegance. Exercise 6.8.4. Let e > 0. Show that (1) there are two continuity points a and b of F with F(a) < | and F(b) > 1 — | [Hint: recall that the set of continuity points of F is dense as its complement is countable.], (2) if \Fn{a) - F{a)\ < f, then Fn(x) < Fn{a) < e if x < a, (3) if |Fn(a)-F(a)\ < §,then F„(x-e)-e< F(x) andF(x-e)-e < Fn(x) if x < a. Similarly, show that (4) if \Fn{b)-F{b)\ < f, then 1 > Fn(x) > F(x-e)-e and 1 > F(x) > Fn(x - e) - e if x > b. [Hint: show that F(x) > 1 - e > Fn(x) - e and Fn(x) > 1 - e > F(x) - e if x > b.] Exercise 6.8.5. Let e > 0. Show that (1) there are continuity points U of F with a = t0 < t\ < ■ ■ ■ < tk+\ = b such that ti+\ — U < e for 0 < i < k, and (2) if \Fn{U) - F{U)\ < e for 0 < i < k +1, then U < x < ti+l implies that F{x-t)-t< Fn(U) < Fn(x) < F„(ti+1) < F(x + e) + e for 0 < i < k. The above argument of Gnedenko and Kolmogorov is one part of an elegant theorem that shows the equivalence of the two notions of weak convergence (Definition 6.1.3) and (Definition 6.2.1) as well as the equivalence to convergence in the Levy distance. Their argument that Definition 6.1.3 implies Definition 6.2.1 avoids the use of integration by parts and approximation by smooth functions. It is based on the ideas used above and is outlined in the following exercise. Exercise 6.8.6. Assume that Fn —* F, and let /xn and /x be the corresponding probabilities. Let e > 0 and / be a bounded continuous function. Show that (1) the continuity points tt in Exercise 6.8.5 can be chosen (if necessary,
8. ADDITIONAL EXERCISES 283 closer together) so that, in addition, for two points x,y in any interval [tj,ti+i], one has |/(x) — f(y)\ < e [Hint: make use of uniform continuity on [a, 6].], (2) |/(ai6]/(*)/i(d*)-£U/(^+iM(M,+i])l < * and l/(o.n/(arW(di)-E|!=o/('i+i)Mn((«i,«i+i])|<e. [Hint: consider, for example, /,t t ,/(r)/x(dx).] Let n(e) be such that \Fn(a) - F(a)\ < f, \Fn(U) - F(U)\ < f for 1 < i < k, and \Fn{b) - F{b)\ < f if n > n{e). Show that, for n > n(e), if C = WfW^, one has (3) I £-=0 f(ti+i)tin((tu U+i}) - E"=o f(U+iM(tu «i+i])| < e(k + l)C I /(-oc,a| f{x)Vn(dx) - $^,4 f{x)n{dx)\ < eC, and I /(6,+oc) /(*K(<kr) - /(6i+oo) f(xMdx) < eC. Conclude that (4) j" /d/xn —► / /d/x and hence that /xn —► /x. Comment. This exercise is in essence the Helly-Bray lemma (see Loeve [L2], p. 182). Exercise 6.8.7. Let (/xn)n>i be a tight sequence of probabilities on R. Show that (1) the corresponding sequence (/n)n>i of characteristic functions /n is uniformly equicontinuous, i.e., for each (€R and e > 0, there is a 6 > 0 that depends only on e such that |/n(* + J»)-/n(0l<* if|ft|<«. [Comment: if one 6 works for all /n at one point, the sequence is called equicontinuous, and if it also works at all points t, it is said to be uniformly equicontinuous.] [Hint: verify that \fn{t + h) — fn(t)\<J\eihx-l\Lin(dx).] Now assume that /xn —► /x. Show that (2) /n converges to / uniformly on any finite interval [a, b] (i.e., given e > 0 there exists no = n(e) such that l/n(0 - /(01 < e for a11 < e [°. ft] if n > no)- [Hint: make use of (1), the fact that there are k+ 1 points a = to < ti < • • • < tk = b with \ti — U+\| < 6 for 0 < i < k — 1, and show that one can find n0 with \fn{U) — /(<t)| <eforO<i<fcas long asn> no.]
284 VI. AN INTRODUCTION TO WEAK CONVERGENCE Remark. There is an idea in the argument suggested for the solution of Exercise 6.8.7 (2). The underlying formal result (which has to do with when a pointwise limit of continuous functions is continuous) is called the Ascoli-Arzela theorem (see Royden [R3]). Notice that, in fact, one does not apparently need to use uniform equicontinuity, as equicontinuity and compactness of [a, b] suffice. However, equicontinuity and compactness of [a, b] imply uniform equicontinuity just as continuity and compactness imply uniform continuity. 9. Appendix* Approximation by smooth functions. This section of the Appendix is devoted to a proof of the following result, where C£°(R) denotes the space of infinitely difFerentiable functions with compact support. Theorem 6.9.1. Let <p € CC(R) and e > 0. Then there exists ip e C£°(R) with ||<p-VIL < e- Let C0(R) denote the continuous functions 4>{x) on R that converge to zero as |x| —► oo. This is a Banach space with respect to the norm ||011^, = supx6R |(/>(x)| since it a closed subspace of the Banach space C(,(K) (see Exercise 4.2.11). 2 Proposition 6.9.2. Let nt(x) = -J=e~~*~. If 4> € C0(K), then nt * <f> e C^(R) and \\nt *<j>- (/>|L -»0 as t -» 0. Proof. Exercise 3.3.24 shows that nt*4> € Co{R) for any function <$> € Co(R). The fact that nt •(/> e Cq°(R) was established in Exercise 2.9.6 (and also in an easier way using Proposition 4.2.29 and part C of Exercise 2.9.6). Finally, it follows from the fact that any function <$> € Co(R) is uniformly continuous (it suffices to observe that by Property 2.3.8 (3) it is uniformly continuous on [-N, N] for any N > 1), and the proof of (f) in Proposition 6.6.4 that ||nt*0-0|L -» 0 as t-» 0. □ Given this result, to complete the proof of Theorem 6.9.1, it suffices to know how to "cut off" smooth functions in Co{R) to get smooth functions of compact support. Proposition 6.9.3. Let a < /3 < 7 < 6. Then there is a C°°-function "cut off function" 6 with 0 < 6 < 1 such that ri if/3<x<7, I. 0 ifx < a or x > 6 Proof of Theorem 6.9.1. By Proposition 6.9.2, there is a t > 0 such that \\tp — nt* wW^ < 5. Since the C°°-function nt * <p vanishes at infinity, there is an integer M > 0 such that \nt *<p{x)\ < | if |x| > M. Let 6 be a C°° cut
9. APPENDIX 285 off function corresponding to -/3 = 7 = M and —a = 6 = M + 1. Then \nt*<p(x)—6(nt*(p)(x)\ < \nt*<p(x)\ for all x and so \\nt*(p—0(nt*</?)IL < f since the difference is non-zero only if |x| > M. Hence, ||</5 — ^11,,, < e if r/, = 0{nt*<p). □ The following technical lemma gives the essential ingredient for constructing the cut off function of Proposition 6.9.3. Lemma 6.9.4. The function f defined by JO ifx < 0 I e * ifx > 0 is a C°°-function on R. Let q < /3, and define /i(x) = /(x — a) and /r(x) = /(/3 — x), where / def is the function in Lemma 6.9.4. Let a < /3 and gQ/3 = /j/r. Then ga^ is in C£°(R) with compact support (Definition 4.2.4) contained in [a, /3]. Define ipa<p(x) = cla 0) /-00 ^.0(0)^ where C(Q>/5) is the normalizing constant equal to Ja ga^{u)du. The C°°-function Va,/3 takes values in [0,1] and f 0 ifX<Q, ^(X) = \1 ifx>/3. Let q < /3 < 7 < 6. The function 0 = rpa^il — rpy^s) is C°° with compact support, takes values in [0,1], and fl if/3<x<7, I 0 if x < q or x > /3. □ Helly's selection principle. Consider the set S of sequences a = (an)n>i of real numbers an such that 0 < an < 1 for all n > 1. This is a subset of £&> (Exercise 4.1.17). It can also be viewed as the product of an infinite number of copies of [0,1]. Define a metric on S by setting °° 1 (*) d(a,6) = ^-|0n-6n|. n=l Z Exercise 6.9.5. Show that (1) (*) defines a metric on <S, (2) a sequence (afc)fc>i of points in S converges to a if and only if an = limjt a£ for all n > 1, (3) a subset E of S is open (relative to this metric) if and only if a € E implies that there is an integer TV and a positive number e such that {b e S \ \dj - bj\ < e, 1 < j < N} c E. Recall that a set E in a metric space is open (Remarks 4.1.26) if for any a € E there is an e > 0 such that the open ball about a of radius e (i.e., {x I d(x, a) < e}) is a subset of E.
286 VI. AN INTRODUCTION TO WEAK CONVERGENCE Remarks. Condition (3) shows that the open sets defined by the metric d (i.e., the metric topology), is the so-called product topology (see [R3]): the smallest topology such that the functions a —► a, are continuous for all j > 1. It is well known that the product of compact spaces is compact (see Tychonov's theorem [R3]). The selection principle of Helly in effect gives a proof of this topological fact for countable products of compact metric spaces. The product topology is then defined by a metric as above. For metric spaces, as is well known (see [R3]), compactness is equivalent to the Bolzano— Weierstrass property: every sequence has a convergent subsequence. The fact that the product is countable makes it possible to use Cantor's diagonal procedure (Proposition 1.3.14) to prove this. Proposition 6.9.6. Any sequence of points in S has a convergent subsequence. Proof. The idea is as follows: given a sequence (afc)fc>i of points in S, the set {a* | k > 1} of first coordinates of these points is a subset of [0,1] and so, by the compactness of [0,1], it has a convergent subsequence (Exercise 1.5.6). This subsequence is defined by a strictly increasing function m —> k(l,m) on N (see §1 of Chapter I) with the property that there is a point ai € [0,1] for which ai = limm a/ ' . In effect, one now forgets about the points ak that are not labeled by this function m —» fc(l, m) and looks at the second coordinates of the remaining points and applies the same argument, which leads to the dropping of still more points; one continues by applying the argument to what's left, and so on. One finally applies the diagonal procedure to the countable array of subsequences that arises in this way to get a subsequence that converges to the point a whose coordinates were determined by the choice of subsequence at each stage. More formally, there is a sequence (k(£, -))t>i of strictly increasing functions m —> k(£, m) defined on N with the following properties valid for each e> 1: (1) the sequence (k(£ + l,m))m>i is a subsequence of the previous sequence {k(£,Tn))m>i (i.e., there is a strictly increasing function m —> n(m) with k(£ + l,m) = k{£, n{m)) for m > 1); and (2) there is a point at e [0,1] such that at = limma/ . It follows from property (1) that for any £ > 1 one has (3) a, = limm a^ 'm' for all j, 1 < j < £, as by (1) the sequence (k(£,m))m>\ is a subsequence of each of the sequences (k(j, m))m>i for 1 < j < £. In other words, one obtains the following array of natural numbers, where each row is strictly increasing and every row is a subsequence of the preceding row:
9. APPENDIX 287 fc(l, 1) fc(l,2) ... k{l,m) ... -► oo fc(2,l) k{2,2) ... fc(2,m) ... — oo fc(/,l) k(t,2) ... k[l,m) ... — oo such that for each £ > 1, the first £ coordinates of the original sequence (afc)fc>i of points ak € S all converge along the subsequence defined by the tth row of this array. By the diagonal procedure of Cantor, consider the strictly increasing function m —» k(m,m) = k(m): the sequence (fc(?n))m>i is a subsequence of every one of the sequences {k(£, m))m>i for £ > 1. The diagonal sequence (fc(m))m>i determines a subsequence (afc(m))m>i of the sequence (afc)k>i of points ak € S. Let a be the sequence for which a< = limm a/ m' for each £ > 1. Then the sequence (afc(m))m>i converges to a (i.e., for each coordinate £, one has lim,,,a/ = a<). This is obvious, as the sequence (fc(m))m>i is a subsequence of each of the sequences (k(£, m))m>\. O This compactness result is the key to Helly's selection principle. Proposition 6.9.7. (Helly's first selection principle) Let (Gn)n>i be a sequence of non-decreasing, right continuous, non-nega.tive functions on R that are uniformly bounded by 1 (i.e., 0 < Gn(x) < 1 for all x € R). Then there is a non-decreasing, right continuous, function G on R, with 0 < G(x) < 1 for aJJ x, and a subsequence (Gnt)k>i such that limG„t(x) = G(x) k for all continuity points x ofG- Proof. Enumerate the rationals Q with a 1:1 onto function m —» rm. Each distribution function Gn determines a sequence an in S : a" = Gn(r<). Let a be a point in S that is a limit of a subsequence (ant)fc>i of the sequence (an)n>!. Define G(rt) = at. Then limfc G„t(r) = G(r) for all r € Q, and the function G defined on the rationals is non-decreasing because each Gnk is non-decreasing. Define G(x) = inf{G(r) | x < r € Q}. Then G (1) is non-decreasing, (2) is right continuous, and (3) limjt Gnk (x) = G(x) at each continuity point x of G.
288 VI. AN INTRODUCTION TO WEAK CONVERGENCE The first property is obvious. Let e > 0 and consider G(xo): there is a rational number r > x0 with G(x0) < G(r) < G(x0) + e. If x0 < x < r, then G(x0) < G(x) < G{r) < G(x0) + e. This proves (2). The third property is more delicate and is subtle: it would follow immediately from Lemma 6.1.1 if G{r) = G{r) for all r € Q; this need not be the case; all that is certain is that G(r) < G(r) for all r e Q. Now it is fairly clear that G is continuous at xo (i.e., G is left continuous at xo), if and only if G(xo) = sup{G(r) | r < Xo}: if Xo — 6 < X\ < Xo implies G(xo) - e < G(xi) < G(x0), then Xi < n < x0 implies that G(xo) - e < G(xi) < G(ri) < G(x0). Conversely, if there is a rational number n < x0 with G(xo) — e < G{r\) < G(xo), then G(xo) — e < G(ri) < G(x) < G(x0) if n < x < x0. Let xo be a continuity point of G and let e > 0. Then there are two rational numbers n, T2 with n < Xo < T2 such that G(x0) - e < G(ri) < G(x0) < G{r2) < G(x0) + e. Let fc0 = k(e,n,r2) be such that |Gnk(n) - G(n)| < e and |G„t(r2) - G(r2)| < e if k > fc0- Since Gnk(n) < G„t(x0) < G„t(r2), it follows that |G„t(x0) -G(x0)| < 2e if it > fc0- n Remark 6.9.8. This result shows that the collection of finite measures v on (K, 93(R)) with total mass at most one (the collection of subproba- bilities) is a compact metric space when equipped with the Levy distance since the proof of the proposition applies without any change to the case when the distribution functions are the "distribution functions" of such measures. Complex numbers. To maintain the plan that these notes, be self-contained, it is appropriate to say a word about the field C of complex numbers. Simply stated, C is the usual Euclidean plane K2 equipped with a multiplication that is compatible with the vector space structure of K2. More explicitly, any point (x,y) € R2 can be written as xe! + ye2, where ei and e2 are the canonical basis vectors. To "multiply" xei + ye2 by ae\ + 6e2, it suffices to be able to "multiply" the basis vectors as one expects that {xei + ye2}{aei + 6e2} = xaeiei + x6eie2 + yae2ei + ybe2e2. The convention is made that ei = (1,0) is denoted by 1, i.e., it is to be the multiplicative unit: multiplication by 1 changes nothing. Further, the "multiplication" is to be commutative. With these simplifications, one has {xei + ye2}{aei + 6e2} = xa + (x6 + ya)e2 + y6e2e2.
9. APPENDIX 289 This means that the "multiplication" will be defined by deciding on the value of e2e2- Now looking at things in the plane, since eie2 = le2 = e2, one observes that multiplication by e2 rotates 1 = ei counterclockwise by |. This leads one to define multiplication by e2 as rotation counterclockwise by |: the upshot of this is that e2e2 = (-1,0) = -1 since (-1,0) is obtained from (0,1) by rotation counterclockwise by ^. The basis vector e2 is labeled as the complex number i, and one makes the following definition of multiplication for complex numbers. Definition 6.9.9. Let (x, y) = x + iy and (a, b) = a + ib be two complex numbers. Their product is defined by the formula (x + iy)(a + ib) = (xa - yb) + i(xb + ya). With this definition, it is easy to verify all the usual rules of arithmetic (the axioms of a field) except possibly those related to division. If z = x + iy is not zero, its reciprocal i is defined by making use of the complex conjugate of z. This is by definition the complex number def x — iy, and it is denoted by z (i.e., z = x — iy if z = x + iy). The reason the complex conjugate is so important is that zz = x2 + y2 = ||(x,y)||2. The square root of zz is denoted by \z\ and is called the modulus of z (i.e., \z\ = ||(x,y)|| if z = x + iy). In terms of the modulus and the complex conjugate, one has -j = -^ if z ^ 0. One of the most important functions of a complex variable is the complex exponential e*. If t is a real number one sets ext = cost + isint. It satisfies the law of exponents ei(ti+ti) _ gitigttj precisely because of the trigonometric formulas cos(ti + ¢2) = costi cos<2 — sintisin<2, and sin(ti + ¢2) = sin t\ cos <2 + cos t\ sin <2- One then sets ez = exe,y = ex{cosy + isiny} if z = x + iy, where ex is the real exponential. The law of exponents ez'+z* =eZle*2 is satisfied once again since it holds for real numbers and purely imaginary numbers (those whose real part is zero). Finally, the convergence of sequences and of series is formulated in exactly the same terms as for the real case. It suffices to "copy" the definitions. In particular, the well-known series for ex has an exact counterpart for the complex exponential, namely,
290 VI. AN INTRODUCTION TO WEAK CONVERGENCE n=0"! Further information on functions of a complex variable can be found in any of the standard texts on this subject.
BIBLIOGRAPHY [Bl] Billingsley, P., Convergence of Probability Measures, John Wiley &; Sons, Inc., New York, 1968. [B2] Billingsley, P., Probability and Measure, 2nd edition, John Wiley &; Sons, Inc., New York, 1986. [C] Chung, K. L., A Course in Probability Theory, 2nd edition, Academic Press, New York, 1974. [Dl] Doob, J. L., Stochastic Processes, John Wiley &; Sons, Inc., New York, 1953. [D2] Dudley, R. M., Real Analysis and Probability, Wadsworth &; Brooks/Cole, Pacific Grove, Calif., 1989. [D3] Dynkin, E. B., Theory of Markov processes, Prentice-Hall, Inc., En- glewood Cliffs, N. J., 1961. [D4] Dynkin, E. B. and Yushekevic, A. A., Markov Processes, Plenum Press, New York, 1969. [Fl] Feller, W., An Introduction to Probability and Its Applications, Vol. I, 3rd edition, John Wiley & Sons, Inc., New York, 1968. [F2] Feller, W., An Introduction to Probability and Its Applications, Vol. II, John Wiley & Sons, Inc., New York, 1966. [G] Gnedenko, B. V. and Kolmogorov, A. N., Limit Distributions for Sums of Independent Random Variables, Addison-Wesley, Cambridge, Mass., 1954. [HI] Halmos, P. R., Measure Theory, Van Nostrand, New York, 1950. [H2] Halmos, P. R., Naive Set Theory, Van Nostrand, Princeton, N. J., 1960. [I] Ikeda, N. and Watanabe, S., Stochastic Differential Equations and Diffusion Processes, 2nd edition, North Holland Publ. Co., Amsterdam, 1989. [Kl] Karlin, S. and Taylor, H. M., A First Course in Stochastic Processes, 2nd edition, Academic Press, New York, 1975. [K2] Kolmogorov, A. N., Foundations of Probability Theory, Chelsea Publ. Co., New York, 1950. [K3] Korner, T. W., Fourier Analysis, Cambridge University Press, Cambridge, 1988. [LI] Lamperti, J., Probability, W. A. Benjamin, Inc., New York, 1966. [L2] Loeve, M., Probability Theory I, II, 4th edition, Springer-Verlag, New York, 1978. [Ml] Marsden, J. E., Elementary Classical Analysis, W. H. Freeman and Co., San Francisco, 1974. 291
292 BIBLIOGRAPHY Meyer, P.-A., Probability and Potentials, Blaisdell Publ. Co., Waltham, Mass., 1966. Neveu, J., Mathematical Foundations of the Calculus of Probability, Holden-Day, Inc., San Francisco, 1965. Neveu, J., Discrete-Parameter Martingales, North Holland Publ. Co., Amsterdam, 1975. Parthasarathy, K. R., Probability Measures on Metric Spaces, Academic Press, New York, 1967. Protter, P., Stochastic Integration and Differential Equations, Springer- Verlag, Berlin, 1990. Revuz, D., Markov Chains, North Holland Publ. Co., Amsterdam, 1975. Revuz, D. and Yor, M., Continuous Martingales and Brownian Motion, Springer-Verlag, New York, 1991. Royden, H. L., Real Analysis, 3rd edition, MacMillan, New York, 1988. Rudin, W., Principles of Mathematical Analysis, McGraw-Hill, New York, 1953. Stein, E. M. and Weiss, G., Introduction to Fourier Analysis on Euclidean Space, Princeton University Press, Princeton, N. J., 1971. Stroock, D. W., Probability Theory, an Analytic View, Cambridge University Press, Cambridge, 1993. Titchmarsh, E. C., Theory of Functions, 2nd edition, Oxford University Press, Oxford, 1939. [Wl] Wheeden, R. L. and Zygmund, A., Measure and Integral, Marcel Dekker, New York, 1977. [W2] Williams, D., Probability with Martingales, Cambridge University Press, Cambridge, 1991. [Z] Zacks, S., The Theory of Statistical Inference, John Wiley & Sons, Inc., New York, 1971.
INDEX Absolutely continuous function on [o,6],76 absolutely continuous probability, 57 absolutely continuous with respect to ft, 66 absolutely convergent, 54 absolutely summable, 146 adapted, 229 almost isomorphic measure spaces, 206 analyst's distribution function, 200 approximate identity, 191 approximation to the identity, 191 Archimedean property, 2 arithmetic mean, 142 at most countable, 15 atoms, 224, 230 axiom of the least upper bound, 2 Backward martingale, 248 Banach algebra, 134 Banach space, 146 Bessel's inequality, 155 betting system, 232 Bolzano-Weierstrass property, 28 Boolean algebra, 6 Boolean ring, 45 Borel function, 36 Borel's strong law of large numbers, 205 Borel-Cantelli lemma, 148 bounded interval, 4 bounded variation, 71 Cantor discontinuum, 8 Cantor set, 8 Cantor's diagonal argument, 15 Cantor-Lebesgue function, 84 Cartesian product, 15 Cauchy sequence, 146 Cauchy-Schwarz inequality, 59, 216 central limit theorem: i.i.d. case, 275 central limit theorem: Lindebergh condition, 280 central limit theorem: triangular arrays, 281 Cesaro mean, 162 characteristic function of a set, 30 characteristic function, 79, 263 Chebychev's inequality, 169 closed, 14 closed interval, 4 closed monotone class, 134 closure, 15 compact, 20 compact support, 150 compact support in an open set, 150 complete measure space, 105 complete metric space, 146 complete orthonormal system, see orthonormal system completing the measure, see completion completion, 47, 105 completion of 93(R), 81 complex conjugate of a complex number, 289 conditional distribution function of X given Y = y, 221 conditional expectation of X, 212 conditional probability of E given A, 47 conditionally convergent, 54 293
294 INDEX conditionally independent, 226 conjugate indices, 141 continuity theorem, characteristic function, 269 continuous at Xo 6 E, 36 continuous at Xo € K, 36 continuous measure, 83 continuous on E, 36 continuous on K, 36 converge in distribution, 255 converge in law, 255 converge weakly, 251, 255 convergence in measure, 168 convergence in probability, 168 convergence P-a.e., 167 convergence /i-a.e, 167 converges to +oo, 2 converges to B, 2 converges to x, 146 converges to 0 in Lp, 145 converges to 0 in Lp-norm, 145 converges uniformly, 134 converges weakly, 251 convex, 144 convex set, 214 convolution kernels, 120 convolution of integrable functions, 157 convolution of two probabilities, 108 countable, 15 countable additivity, 9 countably additive, 9 countably generated, 248 countably subadditive, 21 Dedekind cut, 2 dense, 16 Dini derivatives, 194 Dirac measure at a, 26 Dirac measure at the origin, 26 Dirichlet's kernel, 162 discrete distribution function, 57 discrete measure, 83 distribution function, 11 distribution of a random vector X, 88 distribution of X, 37 Doob decomposition, 243 Doob's Lp-inequality: countable case, 239 Doob's maximal inequality: finite case, 237 Egororov's theorem, 168, 202 empirical distribution function, 250 equicontinuous, 283 ergodic probability space, 94 essential supremum, 144 essentially bounded, 144 excessive function, 231 expectation, 30 expectation of X, 38 exponentially distributed (with parameter A), 56 5- measurable, 32 S-simple function, 30 Fatou's lemma, 39 Fejer kernel, 160 Fejer's theorem, 163 filtration, 225, 229 finite expectation, 42 finite measure, 44 finite measure space, 45 finitely additive probability space, 6 Fourier inversion theorem, 271 Fourier series, 154 Fourier transform, 79 Fubini's theorem for integrable random variables, 100 Fubini's theorem for Lebesgue integrable functions on Rn, 131 Fubini's theorem for L'-functions, 106 Fubini's theorem for positive functions, 105
INDEX 295 Fubini's theorem for positive random variables, 99 G.l.b., see greatest lower bound gamma distribution, 56 geometric mean, 142 Glivenko-Cantelli theorem, 261 greatest lower bound, 2 Hahn decomposition, 64 half-open interval, 4 Hardy-Littlewood maximal function, 187 harmonic function, 231 Heaviside function, 26 Heine-Borel theorem, 19 Helly's first selection principle, 269 Hermite polynomial, 80 Hilbert space, 154, 216 Holder's inequality, 142 homogeneous Markov chain, 119 I.i.d. sequence, 170 image measure, 138 image of P, 37 improper Riemann integral of / over [0, +00), 54 increasing process, 243 independent collections of events, 89 independent events, 89 independent family of classes of events, 132 independent family of events or sets, 111 independent family of random variables, 111 independent, finite collection of random variables, 94 indicator function of a set, 30 inequality of Cauchy-Schwarz, 139 inf, see infimum infimum, 20 infinite product of a sequence of probability spaces, 118 infinite product set, 113 infinite series, 3 infinitely difFerentiable function, 79, see also smooth initial distribution, 122 inner product, 59 integers, 1 integrable random variable, 41, see also non-negative integrable random variable integral, 30 integral of X, 38 integration by parts, 135 integration by parts: functions, 281 integration by parts: series, 241 interval, 4 inverse of a distribution function, 259 inversion theorem, characteristic function, 272 irrational, 1 isomorphic measure spaces, 206 Jensen's inequality, 144 Jensen's inequality, conditional expectation, 219 joint distribution, finite- dimensional, 110 Jordan decomposition, 65 Khintchine's weak law of large numbers, 172 Kolmogorov's 0-1 law, 111 Kolmogorov's criterion, 179 Kolmogorov's inequality, 178, 237 Kolmogorov's strong law of large numbers, 177, 240 Kolmogorov's strong law: backward martingale, 249 Kronecker's lemma, 241
296 INDEX L2-bounded, 238 L2-martingale convergence theorem, 238 Laplace transform, 79 law of a random vector X, see distribution of a random vector X law of X, see distribution of X least squares, 218 least upper bound, 1 Lebesgue decomposition theorem for measures, 68 Lebesgue integral, 48 Lebesgue measurable sets, 51 Lebesgue measure, 46 Lebesgue measure on *B(Rn), 104 Lebesgue measure on E, 48 Lebesgue measure on Kn, 104 Lebesgue point, 188 Lebesgue set, 188 Lebesgue's differentiation theorem, 188 left continuous, 11 left limit, 11 liminf, see limit infimum limit infimum, 33 limit supremum, 33 limsup, see limit supremum Lindebergh condition, 280 Lipschitz truncation of a random variable at height c > 0, 181 locally integrable function, 189 lower bound, 1 lower semi-continuous, 191 lower semi-continuous at x0, 191 lower sum, 49 lower variation, 65 l.u.b., see least upper bound Lusin's theorem, 202 Marginal distributions, 89 Markov chain, 119 Markov chain with transition probability N, initial distribution P0, 127 Markov process, 225 Markov process with transition kernel N, 225 Markov property, 127, 226 Markovian kernel, see transition kernel martingale, 229 martingale convergence theorem: countable case, 246 martingale transform, 232 maximal function, see Hardy- Littlewood maximal function maximal function: martingale, 239 maximal function: submartingale, 237 measurable, 86 measurable function, 32 measurable space, 32 measure space, 44 measure zero, 105 metric, 146 metric space, 146 metric space E, 48 Minkowski's inequality, 142 modulus of a complex number, 289 monotone class, 25 monotone class of functions, 134 monotone class theorem, Dynkin's version, 92 monotone class theorem for functions, 133 monotone class theorem for sets, 25 monotone rearrangement, 260 multivariate normal distribution, 107 mutually singular, 65 ^-negative set, 62
INDEX 297 ^-positive set, 62 n-dimensional random vector, 88 natural numbers, 1 nearest neighbours, 120 neighbourhood of Xo, 36 non-increasing rearrangement, 260 non-Lebesgue measurable set, 49 non-negative integrable random variable, 38 non-negative measure, 44 norm, 58 normal number, 171 normally summable, 146 normed vector space, 146 null function, 41 null set, 104, see also null function null variable, 41 Occurrence of a set prior to a stopping time, 234 open, 14 open ball in a metric space about a of radius e, 146 open interval, 4 open relative to E, see open subset of E open set in a metric space, 146 open subset of E, 131 open subset of K2, 87 open subset of Rn, 87 optional stopping theorem: finite case, 235 orthogonal system, 154 orthonormal system, 154 outer measure, 21 27r-periodic integrable function, 158 pth absolute moment, 140 pth moment, 140 pairwise independent random variables, 172 parallelogram law, 216 Parseval's equality, 155 Parseval's relation, 266 Parseval's theorem, 156 partition 7r, 49 path, 127 Poisson distribution, 13 Poisson distribution with mean A > 0, 38 Polish space, 224 predictable process, 232 principle of mathematical induction, 7 principle of monotone convergence, 39 probability, 6, 9 probability density function, 56 probability space, 9 product of a sequence of o- algebras, 113 product of complex numbers, 289 product of n probabilities, 102 product of n probability spaces, 102 product of two probabilities, 99 product of two probability spaces, 99 product topology, 286 projection operator, 218 Quantile transformation, 259 Radon-Nikodym derivative of P with respect to A, 57 Radon-Nikodym theorem: measures, 67 Radon-Nikodym theorem: signed measures, 68 random variable, 32 random vector, 86 rational numbers, 1 real numbers, 1 regular Borel measure, 151 regular conditional probability, 223 Riemann integrable, 50 Riemann integral of ip over [a, b], 50
298 INDEX Riesz-Fischer theorem, 147 Riesz representation theorem, 217 right continuous, 11 right limit, 11 Scheffe's lemma, 186 second axiom of countability, 87 separable, 224 sequence of elements from, 2 set of Lebesgue measure zero, 104 er-additive, 9 er-additive probability on a Boolean algebra, 17 <7-algebra, 9 er-algebra generated by ¢, 14 er-algebra of Borel subsets of E, 131 a- algebra of Borel subsets of K, 14 er-algebra of Borel subsets of K2, 87 er-algebra of Borel subsets of Kn, 87 er-algebra of Lebesgue measurable subsets of Kn, 104 er-algebra of P*-measurable sets, 24 <7-field, 10 er-finite measure, 44 er-finite measure space, 45 er-finite signed measure, 61 signed measure, 61 simple function, 30 singular distribution function, see Cantor-Lebesgue function singular measure, 57 smooth, 79, see also infinitely difFerentiable function square summable sequences, 59 standard deviation, 140 standard measurable spaces, 224 stochastic integral, 232 stochastic process, 110 stopping time, 234 strong law of large numbers, 170 submartingale, 229 subsequence, 2 sufficient er-field, 226 sufficient statistic, 227 sup, see supremum supermartingale, 229 supremum, 20 symmetric difference, 78 Theorem of dominated convergence, 43 theorem of dominated convergence, equivalent form, 44 theorem of dominated convergence, first version, 43 three-series theorem, 242 topology, 14 total variation, 65, 71 trajectory, 127 transition kernel, 120 translation invariance, 49 translation invariant, 49 triangle inequality, 5, 146, 216 triangular array, 280 trigonometric polynomial, 153 truncation of a random variable at height c, 180 Unbounded interval, 4 uniform closure, 134 uniform convergence, 134 uniform distribution on [0,1], 12 uniform norm, 153 uniformly closed, 134 uniformly equicontinuous, 283 uniformly integrable, 181 uniqueness theorem, characteristic function, 268 unit normal distribution, 13 unit point mass at a, 26 unit point mass at the origin, 26 universally measurable sets, 82 upcrossing, 245
INDEX 299 upper bound, 1 upper sum, 49 upper variation of u, 62 Vanishing at infinity, Borel function, 110 variance, 140 Vitali cover, 195 Vitali covering lemma, 195 Weak law of large numbers, 170 weakL^R), 187 weak type, 187 Weierstrass approximation theorem, 175 Young's inequality, 142, 201