Текст
                    Probability and Measure
Third Edition
PATRICK BILLINGSLEY
The University of Chicago
A Wiley-Interscience Publication
JOHN WILEY & SONS
New York · Chichester · Brisbane · Toronto · Singapore


This text is printed on acid-free paper. Copyright © 1995 by John Wiley & Sons, Inc. All rights reserved. Published simultaneously in Canada. Reproduction or translation of any part of this work beyond that permitted by Section 107 or 108 of the 1976 United States Copyright Act without the permission of the copyright owner is unlawful. Requests for penrussioa oxiuxthejL- information should be addressed tMthe Permissrons~b)e#artment, John Wiley & Sons, Inc., 605 ThirfrAvenue, New Yori<|NY 10158-0012. Library of Congress Cataloging in Publication Data: Billingsley, Patrick. Probability and measure / Patrick-Billingsley. —3tfJ ed. p. cm. —(Wiley series in probability and mathematical statistics. Probability and mathematical statistics) "A Wiley-Interscience publication." Includes bibliographical references and index. ISBN 0-471-00710-2 1. Probabilities. 2. Measure theory. I. Title. II. Series. QA273.B575 1995 519.2—dc20 94-28500 Printed in the United States of America 10 9
Preface Edwaid Davenant said he "would have a man knockt in the head that should write anything in Mathematiques that had been written of before." So reports John Aubrey in his Brief Lives. What is new here then? To introduce the idea of measure the book opens with Borel's normal number theorem, proved by calculus alone, and there follow short sections establishing the existence and fundamental properties of probability measures, including Lebesgue measure on the unit interval. For simple random variables—ones with finite range—the expected value is a sum instead of an integral. Measure theory, without integration, therefore suffices for a completely rigorous study of infinite sequences of simple random variables, and this is carried out in the remainder of Chapter 1, which treats laws of large numbers, the optimality of bold play in gambling, Markov chains, large deviations, the law of the iterated logarithm. These developments in their turn motivate the general theory of measure and integration in Chapters 2 and 3. Measure and integral are used together in Chapters 4 and 5 for the study of random sums, the Poisson process, convergence of measures, characteristic functions, central limit theory. Chapter 6 begins with derivatives according to Lebesgue and Radon-Nikodym—a return to measure theory—then applies them to conditional expected values and martingales. Chapter 7 treats such topics in the theory of stochastic processes as Kolmogorov's existence theorem and separability, all illustrated by Brownian motion. What is new, then, is the alternation of probability and measure, probability motivating measure theory and measure theory generating further probability. The book presupposes a knowledge of combinatorial and discrete probability, of rigorous calculus, in particular infinite series, and of elementary set theory. Chapters 1 through 4 are designed to be taken up in sequence. Apart from starred sections and some examples, Chapters 5, 6, and 7 are independent of one another; they can be read in any order. My goal has been to write a book I would myself have liked when I first took up the subject, and the needs of students have been given precedence over the requirements of logical economy. For instance, Kolmogorov's exis- v
vi PREFACE tence theorem appears not in the first chapter but in the last, stochastic processes needed earlier having been constructed by special arguments which, although technically redundant, motivate the general result. And the general result is, in the last chapter, given two proofs at that. It is instructive, I think, to see the show in rehearsal as well as in performance. The Third Edition. The main changes in this edition are two For the theory of Hausdorff measures in Section 19 I have substituted an account of Lp spaces, with applications to statistics. And for the queueing theory in Section 24 I have substituted an introduction to ergodic theory, with applications to continued fractions and Diophantine approximation. These sections now fit better with the rest of the book, and they illustrate again the connections probability theory has with applied mathematics on the one hand and with pure mathematics on the other. For suggestions that have led to improvements in the new edition, I thank Raj Bahadur, Walter Philipp, Michael Wichura, and Wing Wong, as well as the many readers who have sent their comments. Envoy. I said in the preface to the second edition that there would not be a third, and yet here it is. There will not be a fourth. It has been a very- agreeable labor, writing these successive editions of my contribution to the river of mathematics. And although the contribution is small, the river is great: After ages of good service done to those who people its banks, as Joseph Conrad said of the Thames, it spreads out "in the tranquil dignity of a waterway leading to the uttermost ends of the earth." Patrick Billingsley Chicago, Illinois December 1994
Contents CHAPTER 1. PROBABILITY 1 1. Borel's Normal Number Theorem, 1 The Unit Interval—The Weak Law of Large Numbers—The Strong Law of Large Numbers—Strong Law Versus Weak—Length— The Measure Theory of Diophantine Approximation* 2. Probability Measures, 17 Spaces—Assigning Probabilities—Classes of Sets—Probability Measures—Lebesgue Measure on the Unit Interval—Sequence Space*—Constructing σ-Fields* 3. Existence and Extension, 36 Construction of the Extension—Uniqueness and the ττ-λ Theorem —Monotone Classes—Lebesgue Measure on the Unit Interval— Completeness—Nonmeasurable Sets—Two Impossibility Theorems* 4. Denumerable Probabilities, 51 General Formulas—Limit Sets—Independent Events— Subfields—The Borel-Cantelli Lemmas—The Zero-One Law 5. Simple Random Variables, 67 Definition—Convergence of Random Variables—Independence— Existence of Independent Sequences—Expected Value—Inequalities 6. The Law of Large Numbers, 85 The Strong Law—The Weak Law—Bernstein's Theorem— A Refinement of the Second Borel-Cantelli Lemma *Stars indicate topics that may be omitted on a first reading. vii
viii CONTENTS 7. Gambling Systems, 92 Gambler's Ruin—Selection Systems—Gambling Policies— Bold Play*—Timid Play* 8. Markov Chains, 111 Definitions—Higher-Order Transitions—An Existence Theorem— Transience and Persistence—Another Criterion for Persistence— Stationary Distnbutions—Exponential Convergence*—Optimal Stopping * 9. Large Deviations and the Law of the Iterated Logarithm,* 145 Moment Generating Functions—Large Deviations—Chernojfs Theorem—The Law of the Iterated Logarithm CHAPTER 2. MEASURE 158 10. General Measures, 158 Classes of Sets—Conventions Involving °o—Measures— Uniqueness 11. Outer Measure, 165 Outer Measure—Extension—An Approximation Theorem 12. Measures in Euclidean Space, 171 Lebesgue Measure—Regularity—Specifying Measures on the Line—Specifying Measures in Rk—Strange Euclidean Sets* 13. Measurable Functions and Mappings, 182 Measurable Mappings—Mappings into Rk—Limits and Measureability—Transformations of Measures 14. Distribution Functions, 187 Distribution Functions—Exponential Distributions—Weak Convergence—Convergence of Types*—Extremal Distributions* CHAPTER 3. INTEGRATION 199 15. The Integral, 199 Definition —Nonnega tive Functions—Uniqueness
CONTENTS 16. Properties of the Integral, 206 Equalities and Inequalities—Integration to the Limit—Integration over Sets—Densities—Change of Variable—Uniform Integrability —Complex Functions 17. The Integral with Respect to Lebesgue Measure, 221 The Lebesgue Integral on the Line—The Riemann Integral— The Fundamental Theorem of Calculus—Change of Variable— The Lebesgue Integral in Rk—Stieltjes Integrals 18. Product Measure and Fubini's Theorem, 231 Product Spaces—Product Measure—Fubini's Theorem— Integration by Parts—Products of Higher Order 19. The L" Spaces,* 241 Definitions—Completeness and Separability—Conjugate Spaces —Weak Compactness—Some Decision Theory—The Space L2—An Estimation Problem CHAPTER 4. RANDOM VARIABLES AND EXPECTED VALUES 20. Random Variables and Distributions, 254 Random Variables and Vectors—Subfields—Distributions— Multidimensional Distributions—Independence —Sequences of Random Variables—Convolution—Convergence in Probability —The Glivenko-Cantelli Theorem* 21. Expected Values, 273 Expected Value as Integral—Expected Values and Limits— Expected Values and Distributions—Moments—Inequalities— Joint Integrals—Independence and Expected Value—Moment Generating Functions 22. Sums of Independent Random Variables, 282 The Strong Law of Large Numbers—The Weak Law and Moment Generating Functions—Kolmogorov's Zero-One Law—Maximal Inequalities—Convergence of Random Series—Random Taylor Series *
χ 23. The Poisson Process, 297 Characterization of the Exponential Distribution—The Poisson Process—The Poisson Approximation—Other Characterizations of the Poisson Process—Stochastic Processes 24. The Ergodic Theorem,* 310 Measure-Preserving Transformations—Ergodicity—Ergodicity of Rotations—Proof of the Ergodic Theorem—The Continued- Fraction Transformation—Diophantine Approximation CHAPTER 5. CONVERGENCE OF DISTRIBUTIONS 25. Weak Convergence, 327 Definitions—Uniform Distribution Modulo Γ—Convergence in Distribution—Convergence in Probability—Fundamental Theorems—Helly's Theorem—Integration to the Limit 26. Characteristic Functions, 342 Definition—Moments and Derivatives—Independence—Inversion and the Uniqueness Theorem—The Continuity Theorem— Fourier Series* 27. The Central Limit Theorem, 357 Identically Distributed Summands—The Lindeberg and Lyapounov Theorems—Dependent Variables * 28. Infinitely Divisible Distributions,* 371 Vague Convergence—The Possible Limits—Characterizing the Limit 29. Limit Theorems in Rk, 378 The Basic Theorems—Characteristic Functions—Normal Distributions in Rk—The Central Limit Theorem 30. The Method of Moments,* 388 The Moment Problem—Moment Generating Functions—Central Limit Theorem by Moments—Application to Sampling Theory— Application to Number Theory
CONTENTS CHAPTER 6. DERIVATIVES AND CONDITIONAL PROBABILITY 31. Derivatives on the Line,* 400 The Fundamental Theorem of Calculus—Derivatives of Integrals —Singular Functions—Integrals of Derivatives—Functions of Bounded Variation 32. The Radon-Nikodyni Theorem, 419 Additive Set Functions—The Hahn Decomposition—Absolute Continuity and Singularity—The Main Theorem 33. Conditional Probability, 427 The Discrete Case—The General Case—Properties of Conditional Probability—Difficulties and Curiosities—Conditional Probability Distributions 34. Conditional Expectation, 445 Definition—Properties of Conditional Expectation—Conditional Distributions and Expectations—Sufficient Subfields*— Minimum-Variance Estimation * 35. Martingales, 458 Definition-Submartingdales—Gambling—Functions of Martingales—Stopping Times —Inequalities —Convergence Theorems—Applications: Derivatives—Likelihood Ratios— Reversed Martingales—Applications: de Finetti's Theorem— Bayes Estimation—A Central Limit Theorem* CHAPTER 7. STOCHASTIC PROCESSES 36. Kolmogorov's Existence Theorem, 482 Stochastic Processes —Finite-Dimensional Distributions—Product Spaces—Kolmogorov's Existence Theorem—The Inadequacy of &T—A Return to Ergodic Theory*—The Hewitt-Savage Theorem* 37. Brownian Motion, 498 Definition—Continuity of Paths—Measurable Processes— Irregularity of Brownian Motion Paths—The Strong Markov Property—The Reflection Principle—SkorohodEmbedding*— Invariance"
xii CONTENTS 38. Nondenumerable Probabilities,* 526 Introduction —Definitions—Existence Theorems—Consequences of Separability APPENDIX 536 NOTES ON THE PROBLEMS 552 BIBLIOGRAPHY 581 LIST OF SYMBOLS 585 INDEX 587
Probability and Measure
CHAPTER 1 Probability SECTION 1. BOREL'S NORMAL NUMBER THEOREM Although sufficient for the development of many interesting topics in mathematical probability, the theory of discrete probability spaces1 does not go far enough for the rigorous treatment of problems of two kinds: those involving an infinitely repeated operation, as an infinite sequence of tosses of a coin, and those involving an infinitely fine operation, as the random drawing of a point from a segment. A mathematically complete development of probability, based on the theory of measure, puts these two classes of problem on the same footing, and as an introduction to measure-theoretic probability it is the purpose of the present section to show by example why this should be so. The Unit Interval The project is to construct simultaneously a model for the random drawing of a point from a segment and a model for an infinite sequence of tosses of a coin. The notions of independence and expected value, familiar in the discrete theory, will have analogues here, and some of the terminology of the discrete theory will be used in an informal way to motivate the development. The formal mathematics, however, which involves only such notions as the length of an interval and the Riemann integral of a step function, will be entirely rigorous. All the ideas will reappear later in more general form. Let il denote the unit interval (0,1]; to be definite, take intervals open on the left and closed on the right. Let ω denote the generic point of Ω. Denote the length of an interval / = (a, b] by |/|: (1.1) \I\ = \(a,b]\ = b-a. For the discrete theory, presupposed here, see for example the first half of Volume I of Feller. (Names in capital letters refer to the bibliography on p. 581 ) 1
2 PROBABILITY If (ΐ·3) a=\ji,= ύκΛ], 1=1 1=1 where the intervals /, = (a,-, 6,] are disjoint [Α3Ϋ and are contained in Ω, assign to A the probability (ΐ·3) p(a)= £i/,i= £(&,.-«,). ί=1 /=I It is important to understand that in this section P(A) is defined only if A is a finite disjoint union of subintervals of (0,1]—never for sets A of any other kind. If Л and β are two such finite disjoint unions of intervals, and if A and В are disjoint, then A U В is a finite disjoint union of intervals and (1.4) P(AUB)=P(A)+P(B). This relation, which is certainly obvious intuitively, is a consequence of the additivity of the Riemann integral: (1.5) (\f{a>)+g{a>))dcu= Cf(io)du>+ Cg(u>)dio. Jo Jo -Ό If /(ω) is a step function taking value c; in the interval (x _ l5 jc;], where 0=дго< X\ < · · · <X/i = 1, then its integral in the sense of Riemann has the value к (1.6) j !{ω)άω= Σ c,(*,.-*,_,). '° /=1 If f = IA and g = IB are the indicators [A5] of A and B, then (1.4) follows from (1.5) and (1.6), provided A and В are disjoint. This also shows that the definition (1.3) is unambiguous—note that A will have many representations of the form (1.2) because (a, b] U (b, с] = (а, с]. Later these facts will be derived anew from the general theory of Lebesgue integration.* According to the usual models, if a radioactive substance has emitted a single α-particle during a unit interval of time, or if a single telephone call has arrived at an exchange during a unit interval of time, then the instant at which the emission or the arrival occurred is random in the sense that it lies in (1.2) with probability (1.3). Thus (1.3) is the starting place for the fA notation [An] refers to paragraph η of the appendix beginning on p. 536; this is a collection of mathematical definitions and facts required in the text. *Passages in small type concern side issues and technical matters, but their contents are sometimes required later.
SECTION 1. BOREL'S NORMAL NUMBER THEOREM 3 description of a point drawn at random from the unit interval: Ω is regarded as a sample space, and the set (1.2) is identified with the event that the random point lies in it. The definition (1.3) is also the starting point for a mathematical representation of an infinite sequence of tosses of a coin. With each ω associate its nonterminating dyadic expansion (1.7) «= Σ -ψ1 = Α(")<12(ω)..., n = l Δ each dn(w) being 0 or 1 [A31]. Thus (1.8) (<*,(«),<*,(«),...) is the sequence of binary digits in the expansion of ω. For definiteness, a point such as |=.1000... =.0111..., which has two expansions, takes the nonterminating one; 1 takes the expansion .111... . 1 1 0 10 1 Graph of (1,(ω) Graph of d2(u) Imagine now a coin with faces labeled 1 and 0 instead of the usual heads and tails. If ω is drawn at random, then (1.8) behaves as if it resulted from an infinite sequence of tosses of a coin. To see this, consider first the set of ω for which ίί,-(ω) = w, for i = Ι,.,.,η, where u,,...,u„ is a sequence of O's and l's. Such an ω satisfies n и "и- 1 γ4ΐτ<ω< УЦ+ У Л, 1—1 г)1 — 1—1 г\1 1—1 П1 ' 1-1 Δ i= l z i=n+\ Δ where the extreme values of ω correspond to the case ά^ω) = 0 for i > η and the case dt((o) = 1 for i > n. The second case can be achieved, but since the binary expansions represented by the ά^ω) are nonterminating—do not end in O's—the first cannot, and ω must actually exceed Σ"=1μ,/2'. Thus (1.9) [«:^.(«) = Ы<>1 = 1,...,/1] = (е|, Σ^ί + ψ ■
4 PROBABILITY The interval here is open on the left and closed on the right precisely because the expansion (1,7) is the nonterminating one. In the model for coin tossing the set (1.9) represents the event that the first η tosses give the outcomes w,,...,w„ in sequence. By (1.3) and (1.9), (1.10) P[w.di(co) = ui,i=l,...,n] = ±, which is what probabilistic intuition requires. oo . ci , io , и OOP t 001 , 010 , 0Ί [ 100 , 101 , 110 , 111 Decompositions by dyadic intervals The intervals (1.9) are called dyadic intervals, the endpcints being adjacent dyadic rationale k/2r and (k + l)/2" with the same denominator, and η is the rank or order of the interval. For each η the 2" dyadic intervals of rank η decompose or partition the unit interval. In the passage from the partition for η to that for η + 1, each interval (1.9) is split into two parts of equal length, a left half on which ί/π + ,(ω) is 0 and a right half on which ί/„ + ](ω) is 1. For и = 0 and for и = 1, the set [ω: ί/„ + 1(ω) = и] is thus a disjoint union of 2" intervals of length l/2" + 1 and hence has probability \: Ρ[ω: dn(w) = u] = \ for all n. Note that ί/,(ω) is constant over each dyadic interval of rank / and that for η > i each dyadic interval of rank η is entirely contained in a single dyadic interval of rank :'. Therefore, ά((ω) is constant over each dyadic interval of rank η if i < n. The probabilities of various familiar events can be written down immediately. The sum Σ"= λά^ω) is the number of l's among ί/,(ω),..., dn(w), to be thought of as the number of heads in η tosses of a fair coin. The usual binomial formula is (1.11) ω: Ε^,(ω) = к /=1 ΐ)ψ> °*k*»· This follows from the definitions: The set on the left in (1.11) is the union of those intervals (1.9) corresponding to sequences и,,...,«„ containing к l's and n-k O's; each such interval has length 1/2" by (1.10) and there are \"A of them, and so (1.11) follows from (1.3).
SECTION 1. BOREL's NORMAL NUMBER THEOREM 5 The functions άη{ω) can be looked at in two ways. Fixing η and letting ω vary gives a real function dn = dn(·) on the unit interval. Fixing ω and letting η vary gives the sequence (1.8) of O's and l's. The probabilities (1.10) and (1.11) involve only finitely many of the components d-<ia). The interest here, however, will center mainly on properties of the entire sequence (1.8). It will be seen that the mathematical properties of this sequence mirror the properties to be expected of a coin-tossing process that continues forever. As the expansion (1.7) is the nonterminating one, there is the defect that for no ω is (1.8) the sequence (1,0,0,0,...), for example. It seems clear that the chance should be 0 for the coin to turn up heads on the first toss and tails forever after, so that the absence of (1,0,0,0,...)—or of any other single sequence—should not matter. See on this point the additional remarks immediately preceding Theorem 1.2. The Weak Law of Large Numbers In studying the connection with coin tossing it is instructive to begin with a result that can, in fact, be treated within the framework of discrete probability, namely, the weak law of large numbers: Theorem 1.1. For each e? (1.12) lim Ρ ω: 1 " 1 -£>,.(ω)- J >e = 0. Interpreted probabilistically, (1.12) says that if η is large, then there is small probability that the fraction or relative frequency of heads in η tosses will deviate much from \, an idea lying at the base of the frequency conception of probability. As a statement about the structure of the real numbers, (1.12) is also interesting arithmetically. Since άι(ω) is constant over each dyadic interval of rank η if i <n, the sum Σ"=1ίί,(ω) is also constant over each dyadic interval of rank n. The set in (1.12) is therefore the union of certain of the intervals (1.9), and so its probability is well denned by (1.3). With the Riemann integral in the role of expected value, the usual application of Chevyshev's inequality will lead to a proof of (1.12). The argument becomes simpler if the dn(w) are replaced by the Rademacher functions, (1.13) r„(«) = 2rf„(«)-l = + 1 ifrf„(«) = l, -1 ifrf„(«)=0. +The standard e and δ of analysis will always be understood to be positive.
6 PROBABILITY Graph of r,fcjj 1 0 1 1 Graph of r3(tj) Consider the partial sums (1.14) i„(«)= Ση{ω). i=l Since Σ"=,ίί,(ω) = (sn(w) + n)/2, (1.12) with e/2 in place of e is the same thing as (1.15) lim Ρ η ->oo ω: «5«(ω> > e = 0. This is the form in which the theorem will be proved. The Rademacher functions have themselves a direct probabilistic meaning. If a coin is tossed successively, and if a particle starting from the origin performs a random walk on the real line by successively moving one unit in the positive or negative direction according as the coin falls heads or tails, then rf(<a) represents the distance it moves on the ith step and 5„(ω) represents its position after η steps. There is also the gambling interpretation: If a gambler bets one dollar, say, on each toss of the coin, r,(w) represents his gain or loss on the ith play and sn(w) represents his gain or loss in η plays. Each dyadic interval of rank /' - 1 splits into two dyadic intervals of rank /; Γ,(ω) has value - 1 on one of these and value + 1 on the other. Thus r/ω) is — 1 on a set of intervals of total length { and + 1 on a set of total length \. Hence }^η(ω)άω = 0 by (1.6), and (1.16) / 5„(ω) άω = 0 by (1.5). If the integral is viewed as an expected value, then (1.16) says that the mean position after η steps of a random walk is 0. Suppose that i <j. On a dyadic interval of rank /- 1, r,(w) is constant and r (ω) has value -1 on the left half and +1 on the right. The product
SECTION 1. BOREL's NORMAL NUMBER THEOREM 7 Γι(ω)η(ω) therefore integrates to 0 over each of the dyadic intervals of rank j — 1, and so (1.17) ( η(ω)η(ω)άω = 0, i Φ]. This corresponds to the fact that independent random variables are uncorrected. Since rf (ω) = 1, expanding the square of the sum (1.14) shows that (1.18) j 5^(ω) άω = п. This corresponds to the fact that the variances of independent random variables add. Of course (1.16), (1.17), and (1.18) stand on their own, in no way depend on any probabilistic interpretation. Applying Chebyshev's inequality in a formal way to the probability in (1.15) now leads to (1.19) Ρ[ω: |*„(ω)| > ne] < -~-2 f\2n(w) dw = -^. The following lemma justifies the inequality. Let / be a step function as in (1.6): /(w) = c; for ω e(x;._,, хД where 0 =x0 < · · · <xk = 1. Lemma. If fis a nonnegative step function, then [ω: /(ω) > a] is for a > 0 a finite union of intervals and (1.20) P[<o:f(a,)>a]<lff(w)dw. The shaded region has area αΡ[ω: /(ω)» α}. Proof. The set in question is the union of the intervals (χ;_,,χ;] for which Cj > a. If Σ denotes summation over those / satisfying c; > a, then Ρ[ω: f(w)>a] = L'(Xj-Xj_]) by the definition (1.3). On the other hand,
8 PROBABILITY since the c; are all nonnegative by hypothesis, (1.6) gives f /(ω) άω = Σ Cj(Xj ~Xj.,) > Е'сДXj -Xj-i) Hence (1.20). ■ Taking a = n2e2 and /(ω)-ί^(ω) in (1.20) gives (1.19). Clearly, (1.19) implies (1.15), and as already observed, this in turn implies (1.12). The Strong Law of Large Numbers It is possible with a minimum of technical apparatus to prove a stronger result that cannot even be formulated in the discrete theory of probability. Consider the set (1.21) N = 1 " 1 ω: Urn - £<(ω) η ^лш> 2 consisting of those ω for which the asymptotic relative frequency* of 1 in the sequence (1.8) is \. The points in (1.21) are called normal numbers. The idea is to show that a real number ω drawn at random from the unit interval is "practically certain" to be normal, or that there is "practical certainty" that 1 occurs in the sequence (1.8) of tosses with asymptotic relative frequency \. It is impossible at this stage to prove that P(N) = 1, because N is not a finite union of intervals and so has been assigned no probability. But the notion of "practical certainty" can be formalized in the following way. Define a subset Л of Ω to be negligible^ if for each positive e there exists a finite or countable* collection /,, /2,... of intervals (they may overlap) satisfying (1.22) Aa\JIk к and (1.23) ΣΙ/*Ι<*. к A negligible set is one that can be covered by intervals the total sum of whose lengths can be made arbitrarily small. If P(A) is assigned to such an *The frequency of 1 (the number of occurrences of it) among d,(c<j),...,dn{m) is Σ,"=ιίί,(ω), the lelatiue frequency is η-1Σ,"=1£*,(ω), and the asymptotic relative frequency is the limit in (1.21). +The term negligible is introduced for the purposes of this section only. The negligible sets will reappear later as the sets of Lebesgue measure 0. %СоиШаЫу infinite is unambiguous. Countable will mean finite or countably infinite, although it will sometimes for emphasis be expanded as here to finite or countable.
SECTION 1. BOREL'S NORMAL NUMBER THEOREM 9 A in any reasonable way, then for the Ik of (1.22) and (1.23) it ought to be true that P(A) < ZkP(Ik) = Zk\Ik\ < e, and hence P(A) ought to be 0. Even without any assignment of probability at all, the definition of negligibility can serve as it stands as an explication of "practical impossibility" and "practical certainty": Regard it as practically impossible that the random ω will lie in A if A is negligible, and regard it as practically certain that ω will lie in A if its complement Ac [Al] is negligible. Although the fact plays no role in the next proof, for an understanding of negligibility observe first that a finite or countable union of negligible sets is negligible. Indeed, suppose that AX,A2,... are negligible. Given e, for each η choose intervals lnl, In2,... such that An с (J*. Ink and Zk\Ink\ < e/2". All the intervals Ink taken together form a countable collection covering (J„ An, and their lengths add to ZnZk\lnk\ < E„e/2" = e. Therefore, U„ An is negligible. A set consisting of a single point is clearly negligible, and so every countable set is also negligible. The rationale for example form a negligible set. In the coin-tossing model, a single point of the unit interval has the role of a single sequence of 0's and l's, or of a single sequence of heads and tails. It corresponds with intuition that it should be "practically impossible" to toss a coin infinitely often and realize any one particular infinite sequence set down in advance. It is for this reason not a real shortcoming of the model that for no ω is (1.8) the sequence (1,0,0,0,...). In fact, since a countable set is negligible, it is not a shortcoming that (1.8) is never one of the countably many sequences that end in 0's. Theorem 1.2. The set of normal numbers has negligible complement. This is Borel's normal number theorem,1' a special case of the strong law of large numbers. Like Theorem 1.1, it is of arithmetic as well as probabilistic interest. The set Nc is not countable: Consider a point ω for which (ί/,(ω),ά2(ω),...) = (1,1,ы3,1,1,ы6,...)—that is, a point for which d,(w) = 1 unless i is a multiple of 3. Since n~lZ"=]dj(w) > f, such a point cannot be normal. But there are uncountably many such points, one for each infinite sequence (u3,u6,...)oi 0's and l's. Thus one cannot prove Nc negligible by proving it countable, and a deeper argument is required. Proof of Theorem 1.2. Gearly (1.21) and (1.24) N = ω: lim -5„(ω)=0 П tEmiIe Borel: Sur les probabilites denombrables et leurs applications arithmetiques, Circ. Mat d. Palermo, 29 (1909), 247-271. See Dudley for excellent historical notes on analysis and probability.
10 PROBABILITY define the same set (see (1.14)). To prove Nc negligible requires constructing coverings that satisfy (1.22) and (1.23) for A =NC. The construction makes use of the inequality (1.25) P[a>:\s„(a>)\>ne]i^-ifs*n(a>)d<». This follows by the same argument that leads to the inequality in (1.19)—it is only necessary to take /(ω) = s*(a)) and α = nAeA in (1.20). As the integral in (1.25) will be shown to have order n2, the inequality is stronger than (1.19). The integrand on the right in (1.25) is (1.26) 5π4(ω)= Σ''α(ω)Λβ(ω)^(ω)Λδ(ω), where the four indices range independently from 1 to л. Depending on how the indices match up, each term in this sum reduces to one of the following five forms, where in each case the indices are now distinct: r,» = l, г,2(«)г/(«) = 1, (1.27) {г,2(«)гД«К(«) = г;(«)гк(«), г/(«)гД«)=г1.(«)г/(й»), /,(ω)>';(ωΚ(ω)Λ,(ω). If, for example, к exceeds /, j, and /, then the last product in (1.27) integrates to 0 over each dyadic interval of rank к - I, because η(ω)η(ω)η(ω) is constant there, while rk(w) is -1 on the left half and +1 on the right. Adding over the dyadic intervals of rank к — 1 gives f rl-(u»)r/(ftl)rfc(ftl)r,(ftl)rfftl=0. yo This holds whenever the four indices are distinct. From this and (1.17) it follows that the last three forms in (1.27) integrate to 0 over the unit interval; of course, the first two forms integrate to 1. The number of occurrences in the sum (1.26) of the first form in (1.27) is n. The number of occurrences of the second form is Ъп{п — 1), because there are η choices for the α in (1.26), three ways to match it with β, y, or δ, and n-1 choices for the value common to the remaining two indices. A term-by- term integration of (1.26) therefore gives (1.28) [1sl(a>)da> = n + 3n(n-l) <3n2,
SECTION 1. BOREL's NORMAL NUMBER THEOREM 11 and it follows by (1.25) that (1.29) ω: -*„(") > e П2€А Fix a positive sequence {e„} going to 0 slowly enough that the series Е„е~4и-2 converges (take en = n~,/s, for example). If An = [ω: |«_15„(ω)|> e„], then P(A„) < 3e;4«"2 by (1.29), and so ΣπΡ(Αη) < oo. If, for some m, ω lies in Acn for all η greater than or equal to m, then \n~ 'ί„(ω)| < en for n>m, and it follows that ω is normal because en -> 0 (see (1.24)). In other words, for each m, f) ™-mAc„ c^» which is the same thing as /Vе с U°n=m A„. This last relation leads to the required covering: Given e, choose m so that YTn=mP(An) <e. Now Ar is a finite disjoint union \Jk Ink of intervals with Lk\Ink\ = P(An\ and therefore U^=m An is a countable union iX~m U* Ink °^ intervals (not disjoint, but that does not matter) with L~=mLk\Ink\ = L'"n,mP(AJ<e. The intervals Ink (n>m, k>\) provide a covering of /Vе of the kind the definition of negligibility calls for. ■ Strong Law Versus Weak Theorem 1.2 is stronger than Theorem 1.1. A consideration of the forms of the two propositions will show that the strong law goes far beyond the weak law. For each η let /„(ω) be a step function on the unit interval, and consider the relation (1.30) lim P[«: |/„(«)| 2:e]=0 together with the set (1.31) «: lim/„(«)= 0]. n-»°° J If /„(ω) = Η_15„(ω), then (1.30) reduces to the weak lav/ (1.15), and (1.31) coincides with the set (1.24) of normal numbers. According to a general result proved below (Theorem 5.2(ii)), whatever the step functions /„(ω) may be, if the set (1.31) has negligible complement, then (1.30) holds for each positive e. For this reason, a proof of Theorem 1.2 is automatically a proof of Theorem 1.1. The converse, however, fails: There exist step functions /„(ω) that satisfy (1.30) for each positive e but for which (1.31) fails to have negligible complement (Example 5.4). For this reason, a proof of Theorem 1.1 is not automatically a proof of Theorem 1.2; the latter lies deeper and its proof is correspondingly more complex. Length According to Theorem 1.2, the complement /Vе of the set of normal numbers is negligible. What if N itself were negligible? It would then follow that (0,1] = /V U /Vе was negligible as well, which would disqualify negligibility as an explication of "practical impossibility," as a stand-in for "probability zero." The proof below of the "obvious" fact that an interval of positive
12 PROBABILITY length is not negligible (Theorem 1.3(ii)), while simple enough, does involve the most fundamental properties of the real number system. Consider an interval I = (a,b] of length \I\ = b-a; see (1.1). Consider also a finite or infinite sequence of intervals Ik = (ak,bk]. While each of these intervals is bounded, they need not be subintervals of (0,1]. Theorem 1.3. (i) // (J^ Ik c/, and the Ik are disjoint, then Lk\Ik\ < |/|. (ii) /// с \J( Ik {the Ik need not be disjoint), then |/| < Σ J/J. (iii) /// = Ok Ik, and the lk are disjoint, then \l\ = *Lk\Ik\. Proof. Of course (iii) follows from (i) and (ii). Proof of (i): Finite case. Suppose there are η intervals. The result being obvious for η = 1, assume that it holds for η - 1. If an is the largest among au...,an (this is just a matter of notation), then U"Z\(ak,bk]<z (a,an\ so that L"kZ.\(bk ~ak) <an -a by the induction hypothesis, and hence Lk=](bk - ak) < (an - a) + (bn -an)<b-a. Infinite case. If there are infinitely many intervals, each finite subcollection satisfies the hypotheses of (i), and so L"k = i(bk - ak) < b - a by the finite case. But as η is arbitrary, the result follows. Proof of (ii): Finite case. Assume that the result holds for the case of η - 1 intervals and that (a,b]c \Jk = l(ak,bk]. Suppose that an <b <bn (notation again). If an < a, the result is obvious. Otherwise, (а, ап] с UkZ\(ak,bk], so that Z"kZ\(bk-ak)>an-a by the induction hypothesis and hence L"k= ,(b*. - ak) > (an -a) + (bn -an)>b -a. The finite case thus follows by induction. Infinite case. Suppose that (a, b] с \J*k = i(ak,bk]. If 0 <e <b -a, the open intervals (ak, bk + el~k) cover the closed interval [a + e,b], and it follows by the Heine-Borel theorem [A13] that [a + e,b]a \J"k=l(ak,bk +e2~k) for some n. But then (a +e,b]c U"k=l(ak,bk + e2~k], and by the finite case, b -(a+e)<L"k = i(bk + e2~k -ak)<Lk'=l(bk-ak) + e. Since e was arbitrary, the result follows. ■ Theorem 1.3 will be the starting point for the theory of Lebesgue measure as developed in Sections 2 and 3. Taken together, parts (i) and (ii) of the theorem for only finitely many intervals Ik imply (1.4) for disjoint A and B. Like (1.4), they follow immediately from the additivity of the Riemann integral; but the point is to give an independent development of which the Riemann theory will be an eventual by-product. To pass from the finite to the infinite case in part (i) of the theorem is easy. But to pass from the finite to the infinite case in part (ii) involves compactness, a profound idea underlying all of modern analysis. And it is part (ii) that shows that an interval / of positive length is not negligible: |/| is
SECTJON 1. BOREL'S NORMAL NUMBER THEOREM 13 a positive lower bound for the sum of the lengths of the intervals in any covering of /. The Measure Theory of Diophantine Approximation* Diophantine approximation has to do with the approximation of real numbers χ by rational fractions p/q. The measure theory of Diophantine approximation has to do with the degree of approximation that is possible if one disregards negligible sets of real x. For each positive integer q, χ must lie between some pair of successive multiples of 1 /q, so that for some p, \x— p/q\< 1 /q. Since for each q the intervals ^ (f-£■?♦£] decompose the line, the error of approximation can be further reduced to \/2q: For each q there is a p such that \x -p/q\< l/2q. These observations are of course trivial. But for "most" real numbers χ there will be many values of ρ and q for which x lies very near the center of the interval (1.32), so that p/q is a very sharp approximation to x. Theorem 1.4. If χ is irrational, there are infinitely many irreducible fractions ρ / q such that (i.33) μ-J ι This famous theorem of Dirichlet says that for infinitely many ρ and q, χ lies in (p/q ~ 1/q2, P/q + l/<72) and hence is indeed very near the center of (1.32). Proof. For a positive integer Q, decompose [0,1) into the Q subintervals [(/- l)/Q,i/Q), i = l,...,Q. The points (fractional parts) {qx} = qx -[qx\ for q = 0,1,...,Q lie in [0,1), and since there are Q + 1 points1 and only Q subintervals, it follows (Dirichlet's drawer principle) that some subinterval contains more than one point. Suppose that {q'x} and {q"x} lie in the same subinterval and 0 < q' < q" < Q. Take q=q"-q' and ρ = [q"x\ -\q'x\; then \<q<Q and \qx-p\ = \{q"x)-{q'x}\ <\/Q: (1.34) 1 1 < -TT < q qQ J -2 If ρ and q have any common factors, cancel them; this will not change the left side of (1.34), and it will decrease q. For each Q, therefore, there is an irreducible p/q satisfying (1.34).* Suppose there are only finitely many irreducible solutions of (1.33), say P\/ q\,---,pm /qm· Since χ is irrational, the I* — pk /qk\ are all positive, and it is possible to choose Q so that Q~l is smaller than each of them. But then the p/q of (1.34) is a solution of (1.33), and since |* -p/q\ < 1/Q, there is a contradiction. ■ *This topic may be omitted. +Although the fact is not technically necessary to the proof, these points are distinct: (q'x) = (q"x) implies iq" - q')x = [q"x J - [q'x], which in turn implies that χ is rational unless q' = q". *This much of the proof goes through even if χ is rational.
14 PROBABILITY In the measure theory of Diophantine approximation, one looks at the set of real χ having such and such approximation properties and tries to show that this set is negligible or else that its complement is. Since the set of rationals is negligible, Theorem 1.4 implies such a result: Apart from a negligible set of x, (1.33) has infinitely many irreducible solutions. What happens if the inequality (1.33) is tightened? Consider (1.35) 1 < <72<K<7) ' and let A consist of the real χ for which (1.35) has infinitely many irreducible solutions. Under what conditions on φ will Αψ have negligible complement? If cp(q)< 1, then (1.35) is weaker than (1.33): <p(<7)> 1 in the interesting cases. Since χ satisfies (1.35) for infinitely many irreducible p/q if and only if χ - [χ J does, Αφ may as well be redefined as the set of χ in (0,1) (or even as the set of irrational χ in (0, D) for which (1.35) has infinitely many solutions. Theorem 1.5. Suppose that ψ is positive and nondecreasing. If (1.36) Σ—7~τ=^ V , Ml) then Αφ has negligible complement. Theorem 1.4 covers the case cp(q) = 1. Although this is the natural place to state Theorem 1.5 in its general form, the proof, which involves continued fractions and the ergodic theorem, must be postponed; see Section 24, p. 324. The converse, on the other hand, has a very simple proof. Theorem 1.6. Suppose that ψ is positive. If q Ml) then Αφ is negligible. Proof. Given e, choose <70 so that Σ, b(? 1 /q<p(q) < e/A. If χ ^Αφ, then (1.35) holds for some q S: q0, and since 0 < χ < 1, the corresponding ρ lies in the range 0 <p < q. Therefore, ArcU и(|-йгг.#- The right side here is a countable union of intervals covering Αφ, and the sum of their lengths is Σ Σ-*-- Σ ^i±i>-< L^~<e яъяо p-0 «Ml) дъд0<1М<1) чъЧоМ<1) Thus A satisfies the definition ((1.22) and (1.23)) of negligibility. ■
SECTION 1. BOREL'S NORMAL NUMBER THEOREM 15 If <pi(q)=l, then (1.36) holds and hence Αφ has negligible complement (as follows also from Theorem 1.4). If φ2(ΐ) = <7£, however, then (1.37) holds and Αφ itself is negligible Outside the negligible set A^UA , therefore, \x-p/q\< \/q2 has infinitely many irreducible solutions but |* —p/q\ < l/q2+c has only finitely many. Similarly, since Lqi/(q log q) diverges but Lql/(q log1+£<7) converges, outside a negligible set \x-p/q\<l/(q2 \ogq) has infinitely many irreducible solutions but \x —p/q\ < l/(q2 \og1+tq) has only finitely many. Rational approximations to χ obtained by truncating its binary (or decimal) expansion are very inaccurate: see Example 4.17. The sharp rational approximations to χ come from truncation of its continued-fraction expansion: see Section 24. PROBLEMS Some problems involve concepts not required for an understanding of the text, or concepts treated only in later sections; there are no problems whose solutions are used in the text itself. An arrow Τ points back to a problem (the one immediately preceding if no number is given) the solution and terminology of which are assumed. See Notes on the Problems, p. 552. 1.1. (a) Show that a discrete probability space (see Example 2.8 for the formal definition) cannot contain an infinite sequence A1,A2,... of independent events each of probability \. Since An could be identified with heads on the nth toss of a coin, the existence of such a sequence would make this section superfluous. (b) Suppose that 0 <pn < 1, and put a„ = min{pn, 1 -p„]. Show that, if Σ„α„ diverges, then no discrete probability space can contain independent events Al,A2,... such that An has probability pn. 1.2. Show that N and Nc are dense [A15] in (0,1]. 1.3. ΐ Define a set A to be trifling* if for each e there exists a finite sequence of intervals /,. satisfying (1.22) and (1.23). This definition and the definition of negligibility apply as they stand to all sets on the real line, not just to subsets of (0,11 (a) Show that a trifling set is negligible. (b) Show that the closure of a trifling set is also trifling. (c) Find a bounded negligible set that is not trifling. (d) Show that the closure of a negligible set may not be negligible. (e) Show that finite unions of trifling sets are trifling but that this can fail for countable unions. 1.4. Τ For i = 0,..., r - 1, let Ar(i) be the set of numbers in (0,1] whose nonter- minating expansions in the base r do not contain the digit i. (a) Show that Ar(i) is trifling. (b) Find a trifling set A such that every point in the unit interval can be represented in the form χ +y with χ and у in A. tLike negligible, trifling is a nonce word used only here. The trifling sets are exactly the sets of content 0: See Problem 3.15
16 PROBABILITY (с) Let Ar(iu ■■■,ik) consist of the numbers in the unit interval in whose base-r expansions the digits г',,..., ik nowhere appear consecutively in that order. Show that it is trifling. What does this imply about the monkey that types at random? 1.5. Τ The Cantor set С can be defined as the closure of /1,(1) (a) Show that С is uncountable but trifling. (b) From [0,1] remove the open middle third ({, f); from the remainder, a union of two closed intervals, remove the two open middle thirds (£, §) and (|, §). Show that С is what remains when this process is continued ad infinitum. (c) Show that С is perfect [A15] 1.6. Put M(t) = lle's"Mdm, and show by successive differentiations under the integral that (1.38) M<*>(0)- flskn(<o)da> Over each dyadic interval of rank n, sn(a>) has a constant value of the form ± 1 + 1 + ■ ■ ■ + 1, and therefore M(t) = 2~Texp t(± 1 + 1 ± · ■ ± 1), where the sum extends over all 2" л-long sequences of + l's and — l's. Thus (1.39) M(t) = (e'+2e '] = (cosh г)" Use this and (1.38) to give new proofs of (1.16), (1.18), and (1.28). (This, the method of moment generating functions, will be investigated systematically in Section 9.) 1.7. Τ By an argument similar to that leading to (1.39) show that the Rademacher functions satisfy L 1 exp о k=\ ^ ~ 11 2 k=\ Δ η = Π cosak. k=\ Take ak = tl k, and from Σ^=1^(ω)2 * = 2ω — 1 deduce (1.40) «Ji-noos' k=\ ^ by letting η -»oo inside the integral above. Derive Vieta's formula 2. π J2 yjl + ΐϊ д/2 + J2 + fi 1.8. A number ω is normal in the base 2 if and only if for each positive e there exists an n0(e, ω) such that |η-1Σ7=1^,(ω) - -jl<e for all η exceeding n0(e, ω).
SECTION 2. PROBABILITY MEASURES 17 Theorem 1.2 concerns the entire dyadic expansion, whereas Theorem 1.1 concerns only the beginning segment. Point up the difference by showing that for e < j the n0(e, ω) above cannot be the same for all ω in N—in other words, η-1Σ"„ι^,·(ω) converges to \ for all ω in N, but not uniformly. But see Problem 13.9. 1.9. 1.3 Τ (a) Using the finite form of Theorem 1.3(ii), together with Problem 1.3(b), show that a trifling set is nowhere dense [A15]. (b) Put В = \J„(r„ -2~n~2, rn + 2~n~2], where rl,r2,... is an enumeration of the rationals in (0,1]. Show that (0,1] — В is nowhere dense but not trifling or even negligible. (c) Show that a compact negligible set is trifling 1.10. Τ A set of the first category [A15] can be, represented as a countable union of nowhere dense sets; this is a topological notion of smallness, just as negligibility is a metric notion of smallness. Neither condition implies the other: (a) Shov that the nonnegligible set N of normal numbers is of the first category by proving that Am = Γ\"^η[ω- \n~lsn(a>)\ < \] is nowhere dense and Nc (b) According to a famous theorem of Baire, a nonempty interval is not of the first category. Use this fact to prove that the negligible set Nc = (0,1] - N is not of the first category. 1.11. Prove: (a) If χ is rational, (1.33) has only finitely many irreducible solutions (b) Suppose that <p(q)> 1 and (1.35) holds for infinitely many pairs p,q but only for finitely many relatively prime ones. Then χ is rational. (c) If φ goes to infinity too rapidly, then Αφ is negligible (Theorem 1.6). But however rapidly φ goes to infinity, A is nonempty, even uncountable. Hint- Consider χ = ΣΧ=11/2α<Α:) for integral a(k) increasing very rapidly to infinity. SECTION 2. PROBABILITY MEASURES Spaces Let Ω be an arbitrary space or set of points ω. In probability theory Ω consists of all the possible results or outcomes ω of an experiment or observation. For observing the number of heads in η tosses of a coin the space Ω is {0,1,..., n); for describing the complete history of the η tosses Ω is the space of all 2" л-long sequences of H's and T's; for an infinite sequence of tosses Ω can be taken as the unit interval as in the preceding section; for the number of α-particles emitted by a substance during a unit interval of time or for the number of telephone calls arriving at an exchange Ω is {0,1,2,...}; for the position of a particle Ω is three-dimensional Euclidean space; for describing the motion of the particle Ω is an appropriate space of functions; and so on. Most Ω'β to be considered are interesting from the point of view of geometry and analysis as well as that of probability.
18 PROBABILITY Viewed probabilistically, a subset of Ω is an event and an element ω of Ω is a sample point. Assigning Probabilities In setting up a space Ω as a probabilistic model, it is natural to try and assign probabilities to as many events as possible. Consider again the case Ω = (0,1] —the unit interval. It is natural to try and go beyond the definition (1.3) and assign probabilities in a systematic way to sets other than finite unions of intervals. Since the set of nonnormal numbers is negligible, for example, one feels it ought to have probability 0. For another probabilistically interesting set that is not a finite union of intervals, consider oo (2.1) (J [ω: -a <s,(<u),..., 5„_,(ω) <b,sn(a>)= -a], n = \ where a and b are positive integers. This is the event that the gambler's fortune reaches —a before it reaches +b; it represents ruin for a gambler with a dollars playing against an adversary with b dollars, the rule being that they play until one or the other runs out of capital. The union in (2.1) is countable and disjoint, and for each η the set in the union is itself a union of certain of the intervals (1.9). Thus (2.1) is a countably infinite disjoint union of intervals, and it is natural to take as its probability the sum of the lengths of these constituent intervals. Since the set of normal numbers is not a countable disjoint union of intervals, however, this extension of the definition of probability would still not cover all the interesting sets (events) in (0,1]. It is, in fact, not fruitful to try to predict just which sets probabilistic analysis will require and then assign probabilities to them in some ad hoc way. The successful procedure is to develop a general theory that assigns probabilities at once to the sets of a class so extensive that most of its members never actually arise in probability theory. That being so, why not ask for a theory that goes all the way and applies to every set in a space Ω? In the case of the unit interval, should there not exist a well-defined probability that the random point ω lies in A, whatever the set A may be? The answer turns out to be no (see p. 45), and it is necessary to work within subclasses of the class of all subsets of a space Ω. The classes of the appropriate kinds—the fields and σ-fields—are defined and studied in this section. The theory developed here covers the spaces listed above, including the unit interval, and a great variety of others. Classes of Sets It is necessary to single out for special treatment classes of subsets of a space il, and to be useful, such a class must be closed under various of the
SECTION 2. PROBABILITY MEASURES 19 operations of set theory. Once again the unit interval provides an instructive example. Example 2.1* Consider the set N of normal numbers in the form (1.24), where 5„(ω) is the sum of the first η Rademacher functions. Since a point ω lies in N if and only if lim„ «-ι5„(ω) = 0, N can be put in the form (22) N = Π U П [ω:\η-\(ω)\<^\. к=1т=1п=т Indeed, because of the very meaning of union and of intersection, ω lies in the set on the right here if and only if for every к there exists an m such that \ri~xsn(b>)\<k~l holds for all n>m, and this is just the definition of convergence to 0—with the usual e replaced by k~l to avoid the formation of an uncountable intersection. Since s„(w) is constant over each dyadic interval of rank n, the set [ω: «_15η(ω)| </с-1] is a finite disjoint union of intervals. The formula (2.2) shows explicitly how N is constructed in steps from these simpler sets. ■ A systematic treatment of the ideas in Section 1 thus requires a class of sets that contains the intervals and is closed under the formation of countable unions and intersections. Note that a singleton [Al] {x} is a countable intersection f\n(x -n'1, x] of intervals. If a class contains all the singletons and is closed under the formation of arbitrary unions, then of course it contains all the subsets of Ω. As the theory of this section and the next does not apply to such extensive classes of sets, attention must be restricted to countable set-theoretic operations and in some cases even to finite ones. Consider now a completely arbitrary nonempty space Ω. A class OF of subsets of Ω is called a field1 if it contains Ω itself and is closed under the formation of complements and finite unions: (i) fieJ; (ii) /4ef implies Ac <ξ <F; (iii) А,В<еЗг implies ЛиВе^ Since Ω and the empty set 0 are complementary, (i) is the same in the presence of (ii) as the assumption 0 e JF". In fact, (i) simply ensures that ^ is nonempty: If A <= !F, then Ac <ξ S^ by (ii) and Ω = A UAC <ξ S^ by (iii). By DeMorgan's law, Α Π В = (Ac U Bc)c and A U В = (Ac Π BCY. If У is closed under complementation, therefore, it is closed under the formation of finite unions if and only if it is closed under the formation of finite intersec- Many of the examples in the book simply illustrate the concepts at hand, but others contain definitions and facts needed subsequently The term algebra is often used in place of field
2U PROBABILITY tions. Thus (iii) can be replaced by the requirement (iii') Л,Вё^ implies АСЛВ<е&. A class if of subsets of Ω is a σ-field if it is a field and if it is also closed under the formation of countable unions: (iv) AX,A2,. . <= if implies Л, UA2L) ·■■ &if. By the infinite form of DeMorgan's law, assuming (iv) is the same thing as assuming (iv') Л,, Л2,... £ if implies Л, Г\А2П ·· · e if. Note that (iv) implies (iii) because one can take AY=A and An = В for η > 2. A field is sometimes called a finitely additive field to stress that it need not be a σ-rield. A set in a given class if is said to be measurable if or to be an S^set. A field or σ-field of subsets of Ω will sometimes be called a field or σ-field in Ω. Example 2.2. Section 1 began with a consideration of the sets (1.2), the finite disjoint unions of subintervals of Ω = (0,1]. Augmented by the empty set, this class is a field ^0: Suppose that A = (a],a\]u ··· U(am,a'm], where the notation is so chosen that a, < ··· <am. If the (a(,a';] are disjoint, then Ar is (0, a,] L)(a\,a2] U ··· u(a'm_,,am] u(a'm, 1] and so lies in ^0 (some of these intervals may be empty, as d-t and ai + i may coincide). If B = (b1,b\]u ·· u(bn,b'„], the (Ь^Ь-] again disjoint, then AnB = U/li U"=ii(fl,·, fl/]n(fcy, by]}; each intersection here is again an interval or else the empty set, and the union is disjoint, and hence А Г) В is in @0. Thus &ϋ satisfies (i), (ii), and (iii'). Although S8{) is a field, it is not a σ-field: It does not contain the singletons {jc}, even though each is a countable intersection Π„(ϊ — η~\ x] of i$()-sets. And ί$() does not contain the set (2.1), a countable union of intervals that cannot be represented as a finite union of intervals. The set (2.2) of normal numbers is also outside &ϋ. ■ The definitions above involve distinctions perhaps most easily made clear by a pair of artificial examples. Example 2.3. Let ^consist of the finite and the cofinite sets (Л being cofinite if Л' is finite). Then if is a field. If Ω is finite, then ^contains all the subsets of Ω and hence is a σ-field as well. If Ω is infinite, however, then if is not a σ-field. Indeed, choose in Ω a set Л that is countably infinite and has infinite complement. (For example, choose a sequence ωλ,ω2,... of distinct points in Ω and take Л = {ω2,ω4,...}.) Then /iiJ, even though
SECTION 2. PROBABILITY MEASURES 21 A is the union, necessarily countable, of the singletons it contains and each singleton is in ^. This shows that the definition of σ-field is indeed more restrictive than that of field. ■ Example 2.4. Let & consist of the countable and the cocountable sets (A being cocountable if Ac is countable). Then F is a σ-field. If Ω is uncountable, then it contains a set A such that A and Ac are both uncountable.1 Such a set is not in &, which shows that even a σ-field may not contain all the subsets of Ω; furthermore, this set is the union (uncountable) of the singletons it contains and each singleton is in &~, which shows that a σ-field may not be closed under the formation of arbitrary unions. ■ The largest σ-fieid in Ω is the power class 2Ω, consisting of all the subsets of Ω; the smallest σ-field consists only of the empty set and Ω itself. The elementary facts about fields and σ-fields are easy to prove: If ^ is a field, then ^Se-J implies А -В=АпВс<в S? aiid A*B = (A - 5) и (B-A)<e.!F. Further, it follows by induction on η that Αν...,Αη^^ implies AX\J ··· иЛ„Ё ^and Axn · ·· пАп <ξ <F. A field is closed under the finite set-theoretic operations, and a σ-field is closed also under the countable ones. The analysis of a probability problem usually begins with the sets of some rather small class stf, such as the class of subintervals of (0,1]. As in Example 2.1, probabilistically natural constructions involving finite and countable operations can then lead to sets outside the initial class &/. This leads one to consider a class of sets that (i) contains sxf and (ii) is a σ-field; it is natural and convenient, as it turns out, to consider a class that has these two properties and that in addition (iii) is in a certain sense as small as possible. As will be shown, this class is the intersection of all the σ-fields containing &/; it is called the σ-field generated by sf and is denoted by σ(&/). There do exist σ-fields containing &/, the class of all subsets of Ω being one. Moreover, a completely arbitrary intersection of σ-fields (however many of them there may be) is itself a σ-field: Suppose that &~= Πβ -^, where θ ranges over an arbitrary index set and each J^ is a σ-field. Then Пе^ for all 0, so that Oef. And A e <F implies for each θ that Ле^ and hence Ac <ξ S^e, so that Ac <ξ &\ If An e &~ for each n, then A„ e S^ for each η and Θ, so that (J„ An lies in each 5^ and hence in ^. Thus the intersection in the definition of σ(&/) is indeed a σ-field containing «of. It is as small as possible, in the sense that it is contained in every σ-field that contains &/: if &/a^ and ^ is a σ-field, then ^ is one of If Ω is the unit interval, for example, take A = (0,5], say. To show that the general uncountable Ω contains such an A requires the axiom of choice [A8]. As a matter of fact, to prove the existence of the sequence alluded to in Example 2.3 requires a form of the axiom of choice, as does even something so apparently down-to-earth as proving that a countable union of negligible sets is negligible. Most of us use the axiom of choice completely unaware of the fact Even Borel and Lebesgue did; see Wagon, pp. 217 ff.
22 PROBABILITY the σ-fields in the intersection defining σ(.οΟ, so that a(^f)<z^f. Thus σ(&ί) has these three properties: (i) icff(j/); (ii) a(stf) is a σ-field; (iii) if £f<z^f and £ is a σ-field, then a(snf) c/ The importance of σ-fields will gradually become clear. Example 2.5. If fF is a σ-field, then obviously a(S^) = $*~.H ssf consists of the singletons, then a(ssf) is the σ-field in Example 2.4. If ssf is empty or sf= {0} or sf= {SI}, then σ(&/) = {0,Sl}. If sf<isf\ then a(j*)<za(j*'). If ^c/C(t(/), then а(я/) = аШ'). ■ Example 2.6. Let J^ be the class of subintervais of SI = (0,1], and define 6B = a{if). The elements of 9$ are called the Borel sets of the unit interval. The field 9$Q of Example 2.2 satisfies J^c 9$Q с ^, and hence a(9SQ) = 60. Since ί# contains the intervals and is a σ-field, repeated finite and countable set-theoretic operations starting from intervals will never lead outside 98. Thus 98 contains the set (2.2) of normal numbers. It also contains for example the open sets in (0,1]: If G is open and xeG, then there exist rationale ax and bx such that χ ^(ax,bx\<^G. But then G = Ox^G(ax,bx]; since there are only countably many intervals with rational endpoints, G is a countable union of elements of J^ and hence lies in 98. In fact, 9$ contains all the subsets of (0,1] actually encountered in ordinary analysis and probability. It is large enough for all "practical" purposes. It does not contain every subset of the unit interval, however; see the end of Section 3 (p. 45). The class 9$ will play a fundamental role in all that follows. ■ Probability Measures A set function is a real-valued function defined on some class of subsets of SI. A set function Ρ on a field !F is a probability measure if it satisfies these conditions: (i) 0<P(A)< lfor A e^; (ii) P(0) = 0, P(Sl) = 1; (iii) if Ai,A2,... is a disjoint sequence of J^sets and if \J^^iAke &~, then1 (2.3) p[\jAk)= LP(Ak). fAs the left side of (2 3) is invariant under permutations of the An, the same must be true of the right side. But in fact, according to Dirichlet's theorem [A26], a nonnegative series has the same value whatever order the terms are summed in.
SECTION 2. PROBABILITY MEASURES 23 The condition imposed on the set function Ρ by (iii) is called countable additivity. Note that, since & is a field but perhaps not a σ-field, it is necessary in (iii) to assume that U"= ι Ak lies in &. If A,,..., An are disjoint ^sets, then Ok = iAk is also in & and (2.3) with An + l =An,2 = ··· = 0 gives (2-4) p[ U^)= LP(Ak). \k=l I L-_l The condition that (2.4) holds for disjoint ^sets is finite additivity; it is a consequence of countable additivity. It follows by induction on η that Ρ is finitely additive if (2.4) holds for η = 2—if P(A и 5) = P(A) +P(B) for disjoint ^sets A and β. The conditions above are redundant, because (i) can be replaced by P(A) > 0 and (ii) by Ρ(Ω,) = 1. Indeed, the weakened forms (together with (iii)) imply that P(il) = Р(Щ + P(0) + P(0) + · · · , so that P(0) = 0, and 1 = Ρ(Ω) = P(A) + P(AC), so that P(A) < 1. Example 2.7. Consider as in Example 2.2 the field бёй of finite disjoint unions of subintervals of Ω = (0,1]. The definition (1.3) assigns to each ^0-set a number—the sum of the lengths of the constituent intervals—and hence specifies a set function Ρ on ^0. Extended inductively, (1.4) says that Ρ is finitely additive. In Section 1 this property was deduced from the additivity of the Riemann integral (see (1.5)). In Theorem 2.2 below, the finite additivity of Ρ will be proved from first principles, and it will be shown that Ρ is, in fact, countably additive—is a probability measure on the field бёй. The hard part of the argument is in the proof of Theorem 1.3, already done; the rest will be easy. ■ If & is a σ-field in Ω and Ρ is a probability measure on &, the triple (Ω, &, P) is called a probability measure space, or simply a probability space. A support of Ρ is any ^set A for which P(A) = 1. Example 2.8. Let & be the σ-field of all subsets of a countable space Ω, and let ρ(ω) be a nonnegative function on Ω. Suppose that Σω(Ξίιρ(ω) = 1, and define P(A) = Σω^Αρ(ω); since ρ(ω)>_0, the order of summation is irrelevant by Dirichlet's theorem [A26]. Suppose that A = U"=i A{, where the Aj are disjoint, and let ωη, ωί2,... be the points in At. By the theorem on nonnegative double series [A27], P(A) = Σ4ρ(ωα) = Σ,Σ,/Κω,·,) = LjPiAj), and so Ρ is countably additive. This (Ω, &, P) is a discrete probability space. It is the formal basis for discrete probability theory. ■ Example 2.9. Now consider a probability measure Ρ on an arbitrary σ-field & in an arbitrary space Ω; Ρ is a discrete probability measure if there exist finitely or countably many points ωΙ( and masses mk such that P(A) = Σω e Amk for A in &. Here Ρ is discrete, but the space itself may not be. In
24 PROBABILITY terms of indicator functions, the defining condition is P(A) = LkmkIA(wk) for A e &. If the set [ων ω2,...} lies in &, then it is a support of P. If there is just one of these points, say ω0, with mass m0 = 1, then Ρ is a unit mass at ω0. In this case P(A) = ΙΑ(ω0) for A e ^". ■ Suppose that Ρ is a probability measure on a field ^", and that Л,Ве^ and ЛсВ. Since P(A) + P(B -A) = P(B), Ρ is monotone: (2.5) P(A)<P(B) if ЛсВ. It follows further that P(B - A) = P(B) - P(A), and as a special case, (2.6) P(AC) = 1-P(A). Other formulas familiar from the discrete theory are easily proved. For example, (2.7) P(A) +P(B)=P(AuB)+P(Af)B), the common value of the two sides being P(A U Bc) + 2P(A C\B) + P(AC Π В). Subtraction gives (2.8) P(A ufi) =P(A) +P(B)-P(AnB). This is the case η = 2 of the general inclusion-exclusion formula: (2.9) ρίυΛ^Σ^ΛΟ-Σ^^η^) + Σ Р(Л/ПЛ;ПЛ,)+---+(-])" + 1Р(Л,П---пЛ„). i<j<fc To deduce this inductively from (2.8), note that (2.8) gives p\RAk )=p(piAk)+p(A»+i)-p(yyknA»+i))- Applying (2.9) to the first and third terms on the right gives (2.9) with η + 1 in place of n. If Βλ =Αλ and Bk =Ak СлА\ П · · · Г\Ак_и then the Bk are disjoint and \JUxAk=\J"k = xBk, so that P{\JUxAk) = ZUxP{Bk). Since P(Bk) < P{Ak) by monotonicity, this establishes the finite subadditivity of P: (2.10) Ρ\(jAk)< LP(Ak).
SECTION 2. PROBABILITY MEASURES 25 Here, of course, the Ak need not be disjoint. Sometimes (2.10) is called Boole's inequality. In these formulas all the sets are naturally assumed to lie in the field $*~. The derivations above involve only the finite additivity of P. Countable additivity gives further properties: Theorem 2.1. Let Ρ be a probability measure on a field ^. (i) Continuity from below: If An and A lie in & and* An1 A, then P(An)TP(A). (ii) Continuity from above: If An and A lie in & and An\ A, then P(An)iP(A). (iii) Countable subadditivity: IfAx,A2,··· and \Tks,\Ak lie in S*~(theAk need not be disjoint), then (OO \ 00 \jAk\< LP(Ak). k=l I k=l Proof. For (i), put Bx = AX and Bk=Ak-Ak_x. Then the Bk are disjoint, A = \J^^iBk, and An = Wk = \Bk, so that by countable and finite additivity, P(A) = Е"к = 1Р(Вк) = lim„ Znk = xP(Bk) = lim„ P(An). For (ii), observe that An i A implies Acn Τ Ac, so that 1-Р(Ап)П~ Р(А). As for (iii). increase the right side of (2.10) to L°^=iP(Ak) and then apply part (i) to the left side. ■ Example 2.10. In the presence of finite additivity, a special case of (ii) implies countable additivity. // Ρ is a finitely additive probability measure on the field $*~, and if An 10 for sets An in 5? implies P(An)iO, then Ρ is countably additive. Indeed, if В = \Jk Bk for disjoint sets Bk (B and the Bk in F), then Cn=Ok>nBk = B- \Jk<nBk lies in the field $?, and C„|0- The hypothesis, together with finite additivity, gives P(B) - Σ^,^ίβ^.) = P(Cn) ^ 0, and hence P(B) = Σ"„ ,/>(£*)· ■ Lebesgue Measure on the Unit Interval The definition (1.3) specifies a set function on the field 080 of finite disjoint unions of intervals in (0,1]; the problem is to prove Ρ countably additive. It will be convenient to change notation from Ρ to Λ, and to denote by ^ the class of subintervals (a, b] of (0,1]; then A(/) = \I\ = b - a is ordinary length. Regard 0 as an element of J^ of length 0. If A = (J"= ι /,, the /, being fFor the notation, see [A4] and [A10].
26 PROBABILITY disjoint J^sets, the definition (1.3) in the new notation is (2.12) λ(Α)= Σλ(/,.)= ΣΙ/,Ι· ί-1 (=1 As pointed out in Section 1, there is a question of uniqueness here, because A will have other representations as a finite disjoint union U"'= ι Jj of J^-sets. But Jf is closed under the formation of finite intersections, and so ihe finite form of Theorem 1.3(iii) gives η η m m (2.13) El/,l= Σ Σΐ/,η/,ΐ= ΣΙ-',Ι· (Some of the /, n/ may be empty, but the corresponding lengths are then 0.) The definition is indeed consistent. Thus (2.12) defines a set function A on ^(), a set function called Lebesgue measure. Theorem 2.2. Lebesgue measure A is a (countably additive) probability measure on the field &0. Proof. Suppose that A = U£=i Ak, where A and the Ak are ^0-sets and the Ak are disjoint. Then A = U"„i /,· and Ak = Uy"=( Л/ are disjoint unions of j^-sets, and (2.12) and Theorem 1.3(iii) give (2.14) \(А)= ΣΙ/,·Ι= Σ Σ Σΐ/,·η7„.| ί=1 ί-1 fc=l /=1 οο tV ^ со = Σ L\h<\= ΣΑ(Λ). ■ ft-lj-l k=l In Section 3 it is shown how to extend A from 98ϋ to the larger class 98 - σ(&0) of Borel sets in (0,1]. This will complete the construction of A as a probability measure (countably additive, that is) on &8. and the construction is fundamental to all that follows. For example, the set N of normal numbers lies in SB (Example 2.6), and it will turn out that A(/V) = 1, as probabilistic intuition requires. (In Chapter 2, A will be defined for sets outside the unit interval as well.) It is well to pause here and consider just what is involved in the construction of Lebesgue measure on the Borel sets of the unit interval. That length defines a finitely additive set function on the class J^ of intervals in (0,1] is a consequence of Theorem 1.3 for the case of only finitely many intervals and thus involves only the most elementary properties of the real number system. But proving countable additivity on J^ requires the deeper property of
SECTION 2. PROBABILITY MEASURES 27 compactness (the Heine-Borel theorem). Once λ has been proved countably additive on J^, extending it to ^0 by the definition (2.12) presents no real difficulty: the arguments involving (2.13) and (2.14) are easy. Difficulties again arise, however, in the further extension of Λ from бёй to & = σ(^0), and here new ideas are again required. These ideas are the subject of Section 3, where it is shown that any probability measure on any field can be extended to the generated σ-field. Sequence Space* Let 5 be a finite set of points regarded as the possible outcomes of a simple observation or experiment. For tossing a coin, 5 can be {H,T} or {0,1); for rolling a die, 5 = {1,... ,6}; in information theory, 5 plays the role of a finite alphabet. Let Ω = S°° be the space of all infinite sequences (2.15) ω = (ζ1(ω),ζ2(ω),...) of elements of 5: zk((o)^S for all ω eS°° and к > 1. The sequence (2.15) can be viewed as the result of repeating infinitely often the simple experiment represented by S. For 5 = {0,1}, the space 5°° is closely related to the unit interval; compare (1.8) and (2.15). The space 5°° is an infinite-dimensional Cartesian product. Each zk(·) is a mapping of 5°° onto 5; these are the coordinate functions, or the natural projections. Let S" = S x · · · x 5 be the Cartesian product of η copies of 5; it consists of the л-long sequences (u,,...,u„) of elements of 5. For such a sequence, the set (2.16) [ω: (ζ1(ω),...ίζ,Ι(ω)) = («„...,«„)] represents the event that the first η repetitions of the experiment give the outcomes ul,...,u„ in sequence. A cylinder of rank л is a set of the form (2.17) Α = \ω:{ζλ{ω),...,ζη{ω))^Η], where Я с 5". Note that A is nonempty if Η is. If Я is a singleton in 5", (2.17) reduces to (2.16), which can be called a thin cylinder. Let €Q be the class of cylinders of all ranks. Then ifQ is a field: S°° and the empty set have the form (2.17) for Η = S" and for Η = 0. If Η is replaced by 5" -H, then (2.17) goes into its complement, and hence if0 is *The ideas that follow are basic to probability theory and are used further on, in particular in Section 24 and (in more elaborate form) Section 36. On a first reading, however, one might prefer to skip to Section 3 and return to this topic as the need arises.
28 PROBABILITY closed under complementation As for unions, consider (2.17) together with (2.18) Β=[ω:(ζι(ω),...,ζ„(ω))<Ξΐ], a cylinder of rank m. Suppose that η < m (symmetry); if H' consists of the sequences («„...,«„) in 5'" for which the truncated sequence (и„...,и„) lies in Я, then (2.17) has the alternative form (2.19) Α = [ω:(ζι(ω),...,ζ„(ω))<=Η']. Since it is now clear that (2.20) /lUfl=[«:(z.(ft»),...,zm(fti))eH'u/] is also a cylinder, tf0 is closed under the formation of finite unions and hence is indeed a field. Let pu, ueS, be probabilities on 5—nonnegative and summing to 1. Define a set function Ρ on tf0 (it will turn out to be a probability measure) in this way: For a cylinder A given by (2.17), take (2-21) Ρ(Α)=Σρ„ι···Ρ^ η the sum extending over all the sequences (и,,..., un) in H. As a special case, (2.22) />[ω:(ζ,(ω),...,ζ„(ω)) = («,,...,«„)] =pUi · · · р„я. Because of the products on the right in (2.21) and (2.22), Ρ is called product measure; it provides a model for an infinite sequence of independent repetitions of the simple experiment represented by the probabilities pu on 5. In the case where 5 = {0,1} and p0=/7, = 5, it is a model for independent tosses of a fair coin, an alternative to the model used in Section 1. The definition (2.21) presents a consistency problem, since the cylinder A will have other representations. Suppose that A is also given by (2.19). If η —m, then Η and H' must coincide, and there is nothing to prove. Suppose then (symmetry) that η <m. Then H' must consist of those (u,,...,um) in Sm for which (u„...,u„) lies in Η: Η' = HxSm-". But then (2.23) Zpu, ■ ■ ■ Pu,pu^ · ■ ■ pu„ = Lpu, ■ ■ ■ pu„ Σ pU)i+i ■ · ■ pu„ Η' η s"'-" = ΣΡ«, ■■■Pu,,· Η The definition (2.21) is therefore consistent. And finite additivity is now easy: Suppose that A and В are disjoint cylinders given by (2.17) and (2.18).
SECTION 2. PROBABILITY MEASURES 29 Suppose that η <m, and put A in the form (2.19). Since A and В are disjoint, H' and / must be disjoint as well, and by (2.20), (2.24) P(AuB) = Σ Pui---pUm = P(A)+P{B). Η'ΌΙ Taking H = S" in (2.21) shows that Ρ(5°°) = 1. Therefore, (2.21) defines a finitely additive probability measure on the field ifQ. Now, Ρ is countably additive on €Q, but this requires no further argument, because of the following completely general result. Theorem 2.3. Every finitely additive probability measure on the field ifQ of cylinders in Sx is in fact countably additive. The proof depends on this fundamental fact: Lemma. If Anl A, where the An are nonempty cylinders, then A is nonempty. Proof of Theorem 2.3. Assume that the lemma is true, and apply Example 2.10 to the measure Ρ in question: If Anl0 for sets in tf0 (cylinders) but P(An) does not converge to 0, then P(An) > e > 0 for some 6. But then the An are nonempty, which by the lemma makes An 10 impossible. ■ Proof of the Lemma.* Suppose that A, is a cylinder of rank mn say (2.25) At = [a:(z^),...,zm^))^Ht], where H, c5m'. Choose a point ωη in An, which is nonempty by assumption. Write the components of the sequences in a square array: ζ,(ω,) ζ,(ω2) ζ,(ω3) ··■ (2.26) ζ2(ω,) ζ2(ω2) ζ2(ω3) ■■■ The «th column of the array gives the components of ωη. Now argue by a modification of the diagonal method [A14]. Since 5 is finite, some element w, of 5 appears infinitely often in the first row of (2.26): for an increasing sequence {л, k) of integers, ζ,(ωη ) = ы, for all k. By the same reasoning, there exist an increasing subsequence {n2 k) of {л, к) and an The lemma is a special case of Tychonov's theorem: If S is given the discrete topology, the topological product S°° is compact (and the cylinders are closed).
30 PROBABILITY element u2 of 5 such that ζ2(ωη ) = u2 for all k. Continue. If nk = nk k, then zr{u)nk) = ur for k>r, and hence (ζ,(ω^),..., zr(wni)) = (ы,,...,ыг) for k>r. Let ω° be the element of 5°° with components ur: ω° = (ы,,ы2,...) = (ζ,(ω°), ζ2(ω°),...). Let t be arbitrary. If к > t, then (nk is increasing) и*. > t and hence ω„ еЛ„ с Л,. It follows by (2.25) that, for k>t,H, contains the point (ζ,(ω„ ),-.., ζ„#(ω)) of 5m'. But for k>m,, this point is identical with (ζ,(ω0),...,ζΜ(ω0)), which therefore lies in Hr Thus ω° is a point common to all the A,. ■ Let € be the σ-field in 5°° generated by €ϋ. By the general theory of the next section, the probability measure Ρ defined on tf0 by (2.21) extends to if The term product measure, properly speaking, applies to the extended P. Thus (5°°, €, P) is a probability space, one important in ergodic theory (Section 24). Suppose that S = {0,1} and p0 =P] = \. In this case, (5", ■&, P) is closely related to ((0,l],i@, λ), although there are essential differences. The sequence (2.15) can end in O's, but (1.8) cannot. Thin cylinders are like dyadic intervals, but the sets in if0 (the cylinders) correspond to the finite disjoint unions of intervals with dyadic endpoints, a field somewhat smaller than ^0. While nonempty sets in ^c (for example, (5,5 + 2""]) can contract to the empty set, nonempty sets in if0 cannot. The lemma above plays here the role the Heine-Borel theorem plays in the proof of Theorem 1.3. The product probability measure constructed here on if0 (in the case 5 = {0,1b Ρο=ίΊ = I, that is) is analogous to Lebesgue measure on 9§й But a finitely additive probability measure on ^0 can fail to be countably additive/ which cannot happen in if0. Constructing σ-Fields* The σ-field σ($έ) generated by stf was defined from above or from the outside, so to speak, by intersecting all the σ-fields that contain srf (including the σ-field consisting of all the subsets of Ω). Can a(s>f) somehow be constructed from the inside by repeated finite and countable set-theoretic operations starting with sets in stfl For any class JV of sets in Ω let c№* consist of the sets in at?, the complements of sets in c№, and the finite and countable unions of sets in ai°. Given a class .я/, put sxf{) = srf and define si^, srf2,... inductively by (2.27) < =<*-!· That each sxfn is contained in a(sxf) follows by induction. One might hope that &fn -=a{snf) for some n, or at least that U"=0^ = a(sxf). But this process applied to the class of intervals fails to account for all the Borel sets. Let J^q consist of the empty set and the intervals in Ω = (0,1] with rational endpoints, and define ^n =J*-\ for η = 1,2,... . It will be shown that υ"=0^· " strictly smaller than e8 = a(J?0). +See Problem 2.15. *This topic may be omitted.
SECTION 2. PROBABILITY MEASURES 31 If a„ and b„ are rationals decreasing to a and ft, then (a,ft] = Um n„(flm,ftn] = Um(U „(am,bn¥Y e j^. The result would therefore not be changed by including in ,/Ό all Ihe intervals in (0,1]. To prove \J™=Qy„ smaller than S§, first put (2.28) Ψ(Αι,Α2,...)=Α'[υΑ2υΑ3υΑ4Ό ···. Since J^_! contains Ω = (0,1] and the empty set, every element of J? has the form (2.28) for some sequence Al,A2,... of sets in J^_]. Let every positive integer appear exactly once in the square array mn mn ■■· mn m22 ■■■ Inductively define (2.29) Φ0(Αι,Α2, ..)=ΑΧ, Фп{АиА2,.. )=Ψ(Φπ-1(Αηΐιι,Αη,π,...),Φη_ι(Αη,2ί,Αηΐ22,...), ..), « = 1,2,.... It follows by induction that every element of J? has the form Фп(АиА2,...) for some sequence of sets in ^0. Finally, put (2.30) Ф(А1,А2,...)=Ф1(Ати,Ат„,...)иФ2(Ат21,Ат22,...)и ■■·. Then every element of U"=0^ has the form (2.30) for some sequence AUA2,.. of sets in J?0. If AUA2,... are in ^, then (2.28) is in ^; it follows by induction that each Ф„(АиА2,...) is in & and therefore that (2.30) is in (%. With each ω in (0,1] associate the sequence (ω,, ω2,...) of positive integers such that u>]+ ··· +a>k is the position of the fcth 1 in the nonterminating dyadic expansion of ω (the smallest η for which L]=idj(w) = k). Then ω +*(ω1,ω2,...) is a one-to-one correspondence between (0,1] and the set of all sequences of positive integers. Let /,,/2,... be an enumeration of the sets in j^,put φ(ω) = Φ(/ω ,/ш ,...), and define Β = [ω: ω £ <ρ(ω)]. It will be shown that ΰ is a Borel set out is not contained in any of the J?. Since ω lies in В if and only if ω lies outside ψ(ω), Β Φ φ(ω) for every ω. But every element of OZ=o-?„ has the form (2.30) for some sequence in J^J, and hence has the form φ(ω) for some ω. Therefore, В is not a member of и"=0^· It remains to show that Л is a Borel set. Let Dk =[ω: ω e Ιω ]. Since Lk(n) = [ω: ω, + ··· +<ak =η] = [ω: Е"Г,Ц(ш) < fc = Σ;=1^/ω)] is a Borel set, so are [ω: wk = "]= Um = 1i-A_,(m)nLA(m+n) and 0* = [«:«εΛ,4]= UO:t**="]n/n)·
32 PROBABILITY Suppose that it is shown that (2.31) [ω·ωεΦη(/ω(|,/ωΐ2...)]=ΦΛ(0Μ|,0Μ2,...) for every η and every sequence ui,u2,... of positive integers. It will then follow from the definition (2.30) that Β'=[ω:ω<=φ(ω)]= (J [ω: ω e Φ„(/^ ,/ω^,.. )] η = 1 со = иФ«(°|.,я!.0Ив2.-)=Ф(01.°2.···)· But as remarked above, (2.30), is a Borel set if the An are. Therefore, (2.31) will imply that Bc and В are Borel sets. If η =0, (2.31) holds because it reduces by (2.29) to [ω: ω е/ш ] = DU . Suppose that (2.31) holds with η - 1 in place of n. Consider the condition "I I (2.32) ωεΦη_,(/ ,/ \ "raj | By (2.28) and (2.29), a necessary and sufficient condition for ω e Ф„(/ш ,/,...) is that either (2.32) is false for к = 1 or else (2.32) is true for some к exceeding 1. But by the induction hypothesis, (2.32) and its negation can be replaced by ω e Φ„_](Ομ ,D„ ,...) and its negation. Therefore, ω еф;;(/Ш1,/ш ,...) if and only if Thus U „У„ Ф ё&, and there are Borel sets thai cannot be arrived at from the intervals by any finite sequence of set-theoretic operations, each operation being finite or countable. It can even be shown that there are Borel sets that cannot be arrived at by any countable sequence of these operations. On the other hand, every Borel set can be arrived al by a countable ordered set of these operations if it is not required that they be performed in a simple sequence. The proof of this statement—and indeed even a precise explanation of its meaning—depends on the theory of infinite ordinal numbers.+ PROBLEMS 2.1. Define χ V у = тах{дг, у), and for a collection {xj define V„ xa = supa xa; define χ Λ у = minU, у) and Λ „xa = infβ xa. Prove that IA u B = IA V /B, IAnB = /4 /\IB, IA. =1 —IA, and IA&B = \IA - IB\, in the sense that there is equality at each point of Ω. Show that A <zB if and only if IA <IB pointwise. Check the equation χ Λ (у Vz) = (jt л у) V (χ л ζ) and deduce the distributive law 'See Problem 2.22.
SECTION 2. PROBABILITY MEASURES 33 АГ\(ВиС) = (АГ\В)и(АГ\ С). By similar arguments prove that A\J(BnC) = (AuB)n(AuC), AACcz(AAB)U(BAC), (lM„) = Г\А<„, η ' η 2.2. Let A1,...,A„ be arbitrary events, and put Uk = U(^,- η · · · (ΊΑ,) and /λ = П(Л; U · · · U Л; ), where the union and intersection extend over all the fc-tuples satisfying 1 < г, < ··· <ik <n. Show that Uk -I„_k + l. 2.3. (a) Suppose that Ω e & and that /Ι,ίε^ implies Л -В =Л n#f e 5*". Show that & is a field. (b) Suppose that Ω £ У and that У is closed under the formation of complements and finite disjoint unions. Show that S^ need not be a field. 2.4. Let ^,^2'- ■ be classes of sets in a common space Ω. (a) Suppose that 5*~n are fields satisfying 5Jc^ + l. Show that U*_i^ is a field. (b) Suppose that 5^ are σ-fields satisfying Уп с 5^ + 1. Show by example that lC = ii^ need not be a σ-field. 2.5. The field f{snf) generated by a class suf in Ω is defined as the intersection of all fields in Ω containing stf. (a) Show that /(ja/) is indeed a field, that stfczfisnf), and that f(stf) is minimal in the sense that if £ is a field and ja/c^, then /(.я/) с^. (b) Show that for nonempty ,af, /(ja/) is the class of sets of the form U/Ιι Π]LiAjj, where for each ί and ;' either Л,7 e ja/ or Л;; е ja/, and where the m sets η^ύ,/1ι7, 1 <i<m, are disjoint. The sets in f(sxf) can thus be explicitly presented, which is not in general true of the sets in a(s>f). 2.6. Τ (a) Show that if ja/ consists of the singletons, then f(s>f) is the field in Example 2.3. (b) Show that /(ja/) ca(ja/), that f(srf) = a(sif) if suf is finite, and that σ(/(Λ^))=σ(ΛΤ). (c) Show that if ja/ is countable, then f{snf) is countable. (d) Show for fields ^j and J^ that /(^j U У2) consists of the finite disjoint unions of sets Ax Г\А2 with At e &[. Extend. 2.7. 2.5 Τ Let Я be a a set lying outside У, where У is a field [or σ-field]. Show that the field [or σ-field] generated by &\J {Н} consists of sets of the form (2.33) (НПА)и(НспВ), A,B^SF.
34 PROBABILITY 2.8. Suppose for each A in srf that Ac is a countable union of elements of .я/. The class of intervals in (0,1] has this property. Show that a(sxf) coincides with the smallest class over stf that is closed under the formation of countable unions and intersections. 2.9. Show that, if В е<г(,я/), then there exists a countable subclass JnfB of ssf such that flea(j/fi), 2.10. (a) Show that if a(&f) contains every subset of Ω, then for each pair ω and ω' of distinct points in Ω there is in $f an A such that ΙΑ{ω) Φ ίΑ(ω') (b) Show that the reverse implication holds if Ω is countable. (c) Show by example that the reverse implication need not hold for uncountable Ω 2.11. A σ-field is countably generated, or separable, if it is generated by some countable class of sets. (a) Show that the σ-field 98 of Borel sets is countably generated. (b) Show that the σ-field of Example 2.4 is countably generated if and only if Ω is countable. (c) Suppose that ^j and &г are σ-fields, .^ с 5^, and ^ 1S countably generated. Show by example that S^ may not be countably generated. 2.12. Show that a σ-field cannot be countably infinite—its cardinality must be finite or else at least that of the continuum. Show by example that a field can be countably infinite. 2.13. (a) Let У be the field consisting of the finite and the cofinite sets in an infinite Ω, and define Ρ on & by taking P(A) to be 0 or 1 as A is finite or cofinite. (Note that Ρ is not well defined if fl is finite.) Show that Ρ is finitely additive. (b) Show that this Ρ is not countably additive if Ω is countably infinite. (c) Show that this Ρ is countably additive if Ω is uncountable. (d) Now let У be the σ-field consisting of the countable and the cocountable sets in an uncountable Ω, and define Ρ on У by taking P(A) to be 0 or 1 as A is countable or cocountable. (Note that Ρ is not well defined if Ω is countable.) Show that Ρ is countably additive. 2.14. In (0,1] let JF be the class of sets that either (i) are of the first category [A15] or (ii) have complement of the first category. Show that &~ is a σ-field. For A in У, take P(A) to be 0 in case (i) and 1 in case (ii). Show that Ρ is countably additive. 2.15. On the field SSU in (0, lj define P( A) to be 1 or 0 according as there does or does not exist some positive eA (depending on A) such that A contains the interval {{,{ +eA]. Show that Ρ is finitely but not countably additive. No such example is possible for the field ifn in S°° (Theorem 2.3). 2.16. (a) Suppose that Pisa probability measure on a field &~. Suppose that A, e У for г>0, that AscAt for s <t, and that A — \Jt>0A, e <F. Extend Theorem 2.1(i) by showing that P(A,)1 P(A) as t -»<». Show that A necessarily lies in У if it is a σ-field. (b) Extend Theorem 2.1(ii) in the same way.
SECTION 2. PROBABILITY MEASURES 35 2.17. Suppose that Ρ is a probability measure on a field &, that Al,A2,..., and A = (J „Л„ lie in &~, and that the An are nearly disjoint in the sense that Р(АтГ\Ап) = 0 for тФп. Show that />(Л) = Σ„/>(An). 2.18. Stochastic arithmetic. Define a set function Pn on the class of all subsets of П = {1,2,...}Ьу (2.34) Pn{A)=^#[m: \ <m <n,m<EA]; among the first η integers, the proportion that lie in A is just P„(A). Then P„ is a discrete probability measure. The set A has density (2.35) D(A)=\imP„(A), η provided this limit exists. Let 3 be the class of sets having density. (a) Show that D is finitely but not countably additive on 3. (b) Show that 3 contains the empty set and Ω and is closed under the formation of complements, proper differences, and finite disjoint unions, but is not closed under the formation of countable disjoint unions or of finite unions that are nol disjoint. (c) Let Л consist of the periodic sets Ma = [ka: к = 1,2,...]. Observe that (2.36) Pn{Ma)=l-\l\^\=D{Ma). Show that the field /(*#) generated by ^(see Problem 2.5) is contained in 3. Show that D is completely determined on f(~#) by the value it gives for each a to the event that m is divisible by a. (d) Assume that Σρ_1 diverges (sum over all primes; see Problem 5.20(e)) and prove that D, although finitely additive, is not countably additive on the field f(M). (e) Euler's function φ{η) is the number of positive integers less than η and relatively prime to it. Let pl,...,pr be the distinct prime factors of n; from the inclusion-exclusion formula for the events [m: ρ,-lm], (2.36), and the fact that the ρ, divide n, deduce <"" ^-йИ)- (f) Show for 0 < χ < 1 that D(A) =jc for some A. (g) Show that D is translation invariant: If В = [m + l: m e/1], then В has a density if and only if A does, in which case D(A) =D(B). 2.19. A probability measure space (Ω,&~,Ρ) is nonatomic if P(A)>0 implies that there exists а В such that Be A and 0<P(B)<P(A) (A and Β in &, of course). (a) Assuming the existence of Lebesgue measure λ on SS, prove that it is nonatomic. (b) Show in the nonatomic case that P(A) > 0 and e > 0 imply that there exists а В such that В с A and 0 < P(B) < e.
36 PROBABILITY (c) Show in the nonatomic case that 0 <x <P(A) implies that there exists a B such that BcA and P(B):=x. Hint: Inductively define classes J%, numbers A„, and sets Hn by <ЯГй = {0} = {//„}, сГп = [Н: HcA -Uk<nHk, PiUk<nHk) + P(H)^xl hn = sup[P(H)- //ε/Д and Ρ(Ηη)>ηη-η-χ. Consider (J kHk. (d) Show in the nonatomic case that, if pb p2,--- are nonnegative and add to 1, then A can be decomposed into sets B{,B2,... such that P(Bn) =pnP(A). 2.20. Generalize the construction of product measure: For η = 1,2,..., let S„ be a finite space with given probabilities p„tl, и е£„ Let 5, Χ 52 Χ · be the space of sequences (2.15), where now ζΑ(ω) ^Sk. Define Ρ on the class of cylinders, appropriately defined, by using the product pUl ■ ■ pnu on the right in (2.21) Prove Ρ countably additive on if0, and extend Theorem 2.3 and its lemma to this more general setting. Show that the lemma fails if any of the Sa are infinite 2.21. (a) Suppose that J3/={A{,A2,.. } is a countable partition of Ω Show (see (2.27)) that ja/] = &f* = .of* coincides with a(sxf). This is a case where a(snf) can be constructed "from the inside." (b) Show that the set of normal numbers lies in Jb. (c) Show that <^F* = Ж if and only if Jf is a σ-field. Show that J^_, is strictly smaller than j£ for ац п 2.22. Extend (2.27) to infinite ordinals a by defining snfa =(υί<σ^)*- Show that, if Ω is the first uncountable ordinal, then \Ja<n^a =σ(&/\ Show that, if the cardinality of srf does not exceed that of the continuum, then the same is true of a(&f). Thus &§ has the power of the continuum. 2.23. Τ Extend (2.29) to ordinals a < Ω as follows. Replace the right side of (2.28) by υ"=ι(Λ2„-ι u^2„)· Suppose that Φ„ is defined for β < a Let βα(1),βα(2),... be a sequence of ordinals sucn that βα(η) <ot and such that if β <ot, then β =βα(η) for infinitely many even η and for infinitely many odd n; define (2.38) Φ„(Λ1;Λ2,...) ж%.|(/,.„./|. J-W^,"4"»-·)····)· Prove by transfinite induction that (2.38) is in 98 if the An are, that every element of j^ has the form (2.38) for sets An in .X,, and that (2.31) holds with a in place of n. Define φα(ω) = Φ„(/ωι, /ш,,...), and show that Βα = [ω: ω g φα(ω)] lies in 98 - j^ for a < Ω. Show that J^ is strictly smaller than j^ for α < β < Ω. SECTION 3. EXISTENCE AND EXTENSION The main theorem to be proved here may be compactly stated this way: Theorem 3.1. A probability measure on a field has a unique extension to the generated σ-field. In more detail the assertion is this: Suppose that Ρ is a probability measure on a field ^ of subsets of Ω, and put Sr=a(Sr0). Then there
SECTION 3. EXISTENCE AND EXTENSION 37 exists a probability measure Q on & such that Q(A) = P{A) for A e &~0. Further, if Q' is another probability measure on & such that Q'(A) = P(A) for A e= ^o, then Q'(A) = Q(A) for Α<ξ&. Although the measure extended to fF is usually denoted by the same letter as the original measure on &~0, they are really different set functions, since they have different domains of definition. The class S^0 is only assumed finitely additive in the theorem, but the set function Ρ on it must be assumed countably additive (since this of course follows from the conclusion of the theorem). As shown in Theorem 2.2, Λ (initially defined for intervals as length: A(/) = |/|) extends to a probability measure on the field ^0 of finite disjoint unions of subintervais of (0,1]. By Theorem 3.1, λ extends in a unique way from ^0 to ^ = σ(^0), the class of Borel sets in (0,1]. The extended Λ is Lebesgue measure on the unit interval. Theorem 3.1 has many other applications as well. The uniqueness in Theorem 3.1 will be pioved iater; see Theorem 3.3. The first project is to prove that an extension does exist. Construction of the Extension Let Ρ be a probability measure on a field S*~0. The construction following extends Ρ to a class that in general is much larger than σ($*~0) but nonetheless does not in general contain all the subsets of Ω. For each subset A of Ω, define its outer measure by (3.1) Ρ*(Α) = ΜΣΡ(Αη), η where the infimum extends over all finite and infinite sequences Au A2,... of ^"0-sets satisfying A<z\Jn A„. If the A„ form an efficient covering of A, in the sense that they do not overlap one another very much or extend much beyond A, then Σ„Ρ(Αη) should be a good outer approximation to the measure of A if A is indeed to have a measure assigned it at all. Thus (3.1) represents a first attempt to assign a measure to A. Because of the rule P(AC) = 1 - P(A) for complements (see (2.6)), it is natural in approximating A from the inside to approximate the complement Ac from the outside instead and then subtract from 1: (3.2) P^{A) = \-P*{AC). This, the inner measure of A, is a second candidate for the measure of A? A plausible procedure is to assign measure to those A for which (3.1) and (3.2) An idea which seems reasonable at first is to define Pt(A) as the supremum of the sums Σ„Ρ(Αη) for disjoint sequences of ^-sets in A. This will not do. For example, in the case where Ω is the unit interval, Уй is S80 (Example 2.2), and Ρ is λ as defined by (2.12), the set N of normal numbers would have inner measure 0 because it contains no nonempty elements of ^0; in a satisfactory theory, N will have both inner and outer measure 1.
38 PROBABILITY agree, and to take the common value P*(A) = P*{A) as the measure. Since (3.1) and (3.2) agree if and only if (3.3) P*{A)+P*{AC) = \, the procedure would be to consider the class of A satisfying (3.3) and use P*(A) as the measure. It turns out to be simpler to impose on A the more stringent requirement that (3.4) Р*(АГ)Е)+Р*(АСПЕ)=Р*(Е) hold for every set E; (3.3) is the special case Ε = Ω,, because it v/ill turn out that Ρ*(Ω.) - V A set A is called P*-measurable if (3.4) holds for all E; let J( be the class of such sets. What will be shown is that JK contains σ(5^) and that the restriction of P* to σ(5^) is the required extension of P. The set function P* has four properties that will be needed: (i) />*(0) = O; (ii) P* is nonnegative: P*(A) > 0 for every А с Ω; (iii) Ρ* is monotone: A<zB implies P*(A) <P*(B); (iv) P* is countably subadditive: P*(\J„ An) < Σ„Ρ*(Αη). The others being obvious, only (iv) needs proof. For a given e, choose ^o-sets Bnk such that Ana\JkBnk and ZkP(Bnk) <P*(An) + el'", which is possible by the definition (3.1). Now UnAn<z\JnkBnk, so that P*iU„ A„) < ZnkP{Bnk) < LnP*(An) + e, and (iv) follows.* Of course, (iv) implies finite subadditivity. By definition, A lies in the class *d of P*-measurable sets if it splits each Ε in 2n in such a way that P* adds for the pieces—that is, if (3.4) holds. Because of finite subadditivity, this is equivalent to (3.5) Р*(АПЕ)+Р*(АСПЕ)<Р*(Е). Lemma 1. The class JZ is a field. fIt also turns out, after the fact, that (3.3) implies that (3.4) holds for aii Ε anyway, see Problem 3.2. *Сотраге the proof on p. 9 that a countable union of negligible sets is negligible.
SECTION 3. EXISTENCE AND EXTENSION 39 Proof. It is clear that ftei" and that J( is closed under complementation. Suppose that Л,5е/ and £cft. Then P*(E) = Р*(ВпЕ)+Р*(ВсГ\Е) = Р*(АпВпЕ)+Р*(АсПВпЕ) + Р*(АГ)ВСПЕ) + Р*(АСПВСПЕ) >Р*(АГ)ВпЕ) + Р*((АС Π Β Π Ε) U (А П Вс Π Ε) и(АсПВсПЕ)) = Р*(( А ПВ) П Ε) + Р*((А П В)с П Ε), the inequality following by subadditivity. Hence* Α Π β e ^, and ^ is a field. ■ kmma 2. // Л,, Л 2,... is a finite or infinite sequence of disjoint JK-sets, then for each £cft, (3.6) P*[En^\jAk^ = EP*(EnAk). Proof. Consider first the case of finitely many Ak, say η of them. For η = 1, there is nothing to prove. In the case η = 2, if Л, L)A2 = Ω, then (3.6) is just (3.4) with Л, (or A2) in the role of A. If Л, UA2 is smaller than Ω, split £П(Л, иЛ2) by Л, and Л^ (or by A2 and Л2) and use (3.4) and disjointness. Assume (3.6) holds for the case of η - 1 sets. By the case η = 2, together with the induction hypothesis, P*(E n((JJ[_, Лл)) = />*(£ η (U£:} Λ*)) + /3*(ЕПЛ) = Е^ = 1/'*(£ПЛЛ). Thus (3.6) holds in the finite case. For the infinite case use monotonicity: P*(En(\Jk> = lAk))>P4En(\J"k=lAk))=L"k = lP*(EnAk). Let n-*«, and conclude that the left side of (3.6) is greater than or equal to the right. The reverse inequality follows by countable subadditivity. ■ Lemma 3. The class JK is a σ-field, and P* restricted to Jif is countabfy additive. Proof. Suppose that Ax, Л2)... are disjoint ^sets with union A. Since F„= UJ[-,/4* lies in the field jf, P*(E) = P*(EnFn) + P*(Ef)Fnc). To the This proof does not work if (3.4) is weakened to (3.3).
40 PROBABILITY first term on the right apply (3.6), and to the second term apply monotonicity (F„c эАс): P*(E) > Znk = lP*(E f)Ak) + P*(E f)Ac). Let η -+ oo and use (3.6) again: P*(E) > Zmk = lP*(EnAk) + P*(E f)Ac) = P*(E f)A) + P*(E f)Ac). Hence A satisfies (3.5) and so lies in J%, which is therefore closed under the formation of countable disjoint unions. From the fact that J( is a field closed under the formation of countable disjoint unions it follows that J% is a σ-field (for sets Bk in JK, let Л, = Bx and Ak= ВкГ\Вх П · · · П Bk_t; then the Ak are disjoint ^sets and (J*. Bk = \JkAk^J?). The countable additivity of P* on Jf follows from (3.6): take Ε = Ω. ■ Lemmas 1, 2, and 3 use only the properties (i) through (iv) of P* derived above. The next two use the specific assumption that P* is defined via (3.1) from a probability measure Ρ on the field 5^. Lemma 4. If P* is defined by (3.1), then S^0 <^J(. Proof. Suppose that A e S^0. Given Ε and e, choose ^-sets An such that Ecz \JnAn and Σ„Ρ(Α„) <Ρ*(Ε) + e. The sets Bn=A„nA and Cn = An C\AC lie in $*~0 because it is a field. Also, Ε Г) А с \jn Bn and Ε C\AC с U„C„; by the definition of P* and the finite additivity of Ρ, Ρ*(Ε η A) + P*(E ПАС) < ΣηΡ(Βη) + Z„P(C„) = ΣηΡ(Αη) < P*(E) + e. Hence >4e^0 implies (3.5), and so ^, c^. ■ Lemma 5. //f* й defined by (3.1), {Леи (3.7) />*(Л)=/>(Л) forA&y0. Proof. It is obvious from the definition (3.1) that P*(A) < P(A) for A in ^0. If А с U„ /4„, where Л and the An are in J^, then by the countable subadditivity and monotonicity of Ρ on ^"0, P(A) <ΣηΡ(Α nAn) < ΣηΡ(Αη). Hence (3.7). ■ Proof of Extension in Theorem 3.1. Suppose that P* is defined via (3.1) from a (countably additive) probability measure Ρ on the field ^,. Let У= σ{&~0). By Lemmas 3 and 4,* By (3.7), Р*Ш) = Ρ(Ω) = 1. By Lemma 3, P* (which is defined on all of 2Ω) restricted to Λ is therefore a probability measure there. And then P* further restricted to & is clearly a probability measure on that class as well. fIn the case of Lebesgue measure, the relation is ^c^c^ci10'11, and each of the three inclusions is strict; see Example 2.2 and Problems 3.14 and 3.21
SECTION 3. EXISTENCE AND EXTENSION 41 This measure on & is the required extension, because by (3.7) it agrees with Ρ on &0. ■ Uniqueness and the ττ-λ Theorem To prove the extension in Theorem 3.1 is unique requires some auxiliary concepts. A class &> of subsets of Ω is a ir-system if it is closed under the formation of finite intersections: (тг) A,B<e&> implies AnB^g». A class ^f is a λ-system if it contains Ω and is closed under the formation of complements and of finite and countable disjoint unions: (A,) fie/; (A2) A <E^tf implies Ac <E^f\ (λ,) А1,Аг,..., е_г^ and AnC\Am = 0 for m Φ η imply \JnAn<E^f. Because of the disjointness condition in (λ3), the definition of Λ-system is weaker (more inclusive) than that of σ-field. In the presence of (A,) and (A2), which imply 0 е_г^, the countably infinite case of (A3) implies the finite one. In the presence of (A,) and (A3), (A2) is equivalent to the condition that S is closed under the formation of proper differences: (A'2) A, B ^^f and Ad В imply В -А еУ. Suppose, in fact, that ^f satisfies (A2) and (A3). If Л,5еУ and A <zB, then .^ contains B:, the disjoint union A VBC, and its complement (A U Bc)c = B -A. Hence (A'2). On the other hand, if ^f satisfies (A,) and (A'2), then /4e/ implies / = П-Ле/ Hence (A2). Although a σ-field is a Α-system, the reverse is not true (in a four-point space take S to consist of 0, Ω, and the six two-point sets). But the connection is close: Lemma 6. A class that is both a ττ-system and a λ-system is a σ-field. Proof. The class contains Ω by (A,) and is closed under the formation of complements and finite intersections by (A2) and (тг). It is therefore a field. It is a σ-field because if it contains sets An, then it also contains the disjoint sets Bn =АпСлА\С\ ■ ■ · Г\Асп_х and by (A 3) contains ΌηΑη = \JnBn.
42 PROBABILITY Many uniqueness arguments depend on Dynkin's тг-А theorem: Theorem 3.2. // & is a π-system and ^f is a λ-system, then &<z^f implies a{&)<i-if. Proof. Let jf{) be the Α-system generated by &— that is, the intersection of all Λ-systems containing &>. It is a Α-system, it contains &, and it is contained in every A-system that contains & (see the construction of generated σ-fields, p. 21). Thus &<z^?K) <z^f. If it can be shown that _г^() is also a тг-system, then it will follow by Lemma 6 that it is a σ-field. From the minimality of σ(^) it will then follow that a(&>)<Zjf0, so that Λσ(^) с ^f0 <z^f. Therefore, it suffices to show that „/J, is a тг-system. For each Л, let _^, be the class of sets В such, that А ПВ ^jf{). If A is assumed to lie in &, or even if A is merely assumed to lie in jf{), then jfA is a λ-system: Since ΑΓ)Ω=Α ^^?α by the assumption, _^, satisfies (A,). If ZJ,,#2e_^ and Bx<zB2, then the λ-system _^, contains AnBx and Af)B2 and hence contains the proper difference (А Г\В2) - (Α ηβ,) = AC\(B2- B{), so that -έ?Α contains B2-Bx: ^fA satisfies (A'2). If Bn are disjoint _^-sets, then -г^, contains the disjoint sets А Г) Вп and hence contains their union Α Π (U„ Bn): ^fA satisfies (A3). If A<^& and βε^, then {&> is a тг-system) ^ПВе^сУ0, or В e_^. Thus Л e ^ implies ^c_^, and since ^ is a Α-system, minimality gives _^0с_г^. Thus Л е ^ implies _г^0 с_^, or, to put it another way, Ле^ and βεΞ_^0 together imply that B^jfA and hence Л e_^B. (The key to. the proof is that Be^ if and only if A e_^B.) This last implication means that В е./',, implies &<ZjfB. Since .ifB is a Α-system, it follows by minimality once again that Be^J, implies _^() c_^B. Finally, Be^, and Ce^, together imply СеУй, or BnC^^f0. Therefore, _^, is indeed a тг- system. ■ Since a field is certainly a тг-system, the uniqueness asserted in Theorem 3.1 is a consequence of this result: Theorem 3.3. Suppose that f, and P2 are probability measures on σ(&), where & is a v-system. ///>, and P2 agree on 3s, then they agree on σ({?). Proof. Let ^f be the class of sets Л in σ{&) such that />,(Л) = P2(A). Clearly ОеУ. If ^e7, then />,ЫС) = 1 - Ρ,Μ) = 1 - P2(A) = P2(AC), and hence Ac <E^f. If An are disjoint sets in .£, then P1(UnAn) = Σ„Ρι(Α„) = ΣηΡ2(Αη) = P2(\Jn A„), and hence U„ A„ e^. Therefore J1 is a Α-system. Since by hypothesis &><^^f and Φ is a тг-system, the ττ-λ theorem gives σ(^) с_г^, as required. ■
SECTION 3. EXISTENCE AND EXTENSION 43 Note that the π-λ theorem and the concept of Λ-system are exactly what are needed to make this proof work: The essential property of probability measures is countable additivity, and this is a condition on countable disjoint unions, the only kind involved in the requirement (A3) in the definition of Α-system. In this, as in many applications of the π-λ theorem, ~ι?<ζσ{&) and therefore a(&)=jf, even though the relation a(&)<Zjf itself suffices for the conclusion of the theorem. Monotone Classes A class Jf of subsets of Ω is monotone if it is closed under the formation of monotone unions and intersections: (i) Αχ,Α2,... e^and A„"\ A imply Ле/; (ii) Ax, A2, .. s^ and An J, A imply ^e/. Halmos's monotone class theorem is a close relative of the π-λ theorem but will be less frequently used in this book. Theorem 3.4. // Уа is a field and Л is a monotone class, then ^0<ζΛ implies Proof. Let m(^0) be the minimal monotone class over 5^,—the intersection of all monotone classes containing У0. It is enough to prove ^(^ст^); this will follow if /?i(5^,) is shown to be a field, because a monotone field is a σ-field. Consider the class <0= [A: Ac e m(5^0)]. Since m(^0) is monotone, so is <f. Since 5^, is a field, У0с^, and so т(5^0) <zJ?. Hence т(У0) is closed under complementation. Define ^, as the class of A such that Α υβ e т(У0) for all βε^,. Then Jx is a monotone class and ^c^j; from the minimality of m(^()) follows »i(^)c^j. Define J2 as the class of В such that A U В e mKS^) for all A e т(^0). Then J2 is a monotone class. Now from m{^a)<z^l it follows that A e m(^0) and В е ^ together imply that /1 υβεηιί^); in other words, fie^0 implies that β s^2. Thus F0c^2; by minimality, m(5r0)<z^2-> a"d hence A,B em(^Q) implies that Lebesgue Measure on the Unit Interval Consider once again the unit interval (0,1] together with the field ^0 of finite disjoint unions of subintervals (Example 2.2) and the σ-field £8 = аШ0) of Borel sets in (0,1]. According to Theorem 2.2, (2.12) defines a probability measure Л on ^0. By Theorem 3.1, A extends to Ш, the extended A being Lebesgue measure. The probability space ((0,1],^, A) will be the basis for much of the probability theory in the remaining sections of this chapter. A few geometric properties of A will be considered here. Since the intervals in (0,1] form a T7-system generating Ш, λ is the only probability measure on &§ that assigns to each interval its length as its measure.
44 PROBABILITY Some Borel sets are difficult to visualize: Example 3.1. Let {r,, r2,...} be an enumeration of the rationale in (0,1). Suppose that e is small, and choose an open interval /„ = (an, bn) such that r„ e /„ с (0,1) and Л(/„) = b„ - an < e2~n. Put A = L£ =, /„. By subadditivity, 0<\(A)<e. Since A contains all the rationale in (0,1), it is dense there. Thus A is an open, dense set with measure near 0. If / is an open subinterval of (0,1), then / must intersect one of the /tt, and therefore λ(Α Π /) > 0. If В ■= (0,1) -A then 1-е < λ(Β) < 1. The set В contains no interval and is in fact nowhere dense [A 15]. Despite this, В has measure nearly 1. ■ Example 3.2. There is a set defined in probability terms that has geometric propenies similar to those in the preceding example. As in Section 1, let άη{ω) be the nth digit in the dyadic expansion of ω; see (1.7). Let /1„ = [ω e (0,1]: d/ίω) = dn+i(w) = d2n+i{w\ i = 1,..., n], and let A = ΐχ=, Α„. Probabilistically, A corresponds to the event that in an infinite sequence of tosses of a coin, some finite initial segment is immediately duplicated twice over. From Λ(/1„) = 2"·2-1" it follows that 0 < λ(Α) < Σ"_,2"2" = \. Again A is dense in the unit interval; its measure, less than \, could be made less than e by requiring that some initial segment be immediately duplicated к times over with к large. ■ The outer measure (3.1) corresponding to A on ^() is the infimum of the sums Σ„λ(Αη) for which An e £8{) and А с U„ An. Since each A„ is a finite disjoint union of intervals, this outer measure is (3.8) A*(/l) = inf£lU where the infimum extends over coverings of A by intervals /„. The notion of negligibility in Section 1 can therefore be reformulated: A is negligible if and only if λ*(A) = 0. For A in ^, this is the same thing as λ(Α) = 0. This ravers the set /V of normal numbers: Since the complement /Vе is negligible and lies in (8, A(/vrc) = 0. Therefore, the Borel set N itself has probability 1: A(/V)=l. Completeness This is the natural place to consider completeness, although it enters into probability theory in an essential way only in connection with the study of stochastic processes in continuous time; see Sections 37 and 38.
SECTION 3. EXISTENCE AND EXTENSION 45 A probability measure space (Ω, &~, P) is complete if A<zB, В e &, and P(B) = 0 together imply that A e ^"(and hence that P(A) = 0). If (Ω, &~, P) is complete, then the conditions /fe^, ^M'cfie/, and Ρ(β) = 0 together imply that /e^ and P(A') = P(A). Suppose that (Ω, &, P) is an arbitrary probability space. Define P* by (3.1) for &0 = &=σ{5Γ0), and consider the σ-field Jf of P*-measurable sets. The arguments leading to Theorem 3.1 show that P* restricted to J( is a probability measure. If />*(B) = 0 and Ac B, then Р*(А С\Е) + Р*(АС ПЕ) <P*(B) + P*(E) = P*(E) by monotonicity, so that A satisfies (3.5) and hence lies in JK. Thus (Ω,Λ,Ρ*) is a complete probability measure space. In any probability space it is therefore possible to enlarge the σ-field and extend the measure in such a way as to get a complete space. Suppose that ((0,1], 98, λ) is completed in this way. The sets in the completed σ-field Л are called Lebesgue sets, and λ extended to J( is still called Lebesgue measure. Nonmeasurable Sets There exist in (0,1] sets that lie outside £8. For the construction (due to Vitali) it is convenient to use addition modulo 1 in (0,1]. For x, у e (0,1] take χ ®y to be χ +y or χ +y - 1 according as χ -Vy lies in (0,1] or not.f Put A ®x = [a ®x: a eA]. Let S be the class of Borel sets A such that A®x is a Borel set and λ(Α ®χ) = λ(Α). Then ^f is a λ-system containing the intervals, and so ^c_/ by the тг-λ theorem. Thus A&S8 implies that /Iffijce^ and λ(Α ®x) = λ(Α). In this sense, λ is translation-invariant. Define χ and у to be equivalent (x~y) if x®r=y for some rational r in (0,1]. Let Я be a subset of (0,1] consisting of exactly one representative point from each equivalence class; such a set exists under the assumption of the axiom of choice [A8]. Consider now the countably many sets H®r for rational r These sets are disjoint, because no two distinct points of Η are equivalent. (If H®rx and Η ®r2 share the point hi®r1=h2®r2, then h{ ~h2; this is impossible unless h{ =h2, in which case rt =r2.) Each point of (0,1] lies in one of these sets, because Η has a representative from each equivalence class. (If x~heH, then x=h ® reH®r for some rational r.) Thus (0,1]= \Jr(H® /), a countable disjoint union. If Η were in &, it would follow that λ(0,1] = ErA(#©r). This is impossible: If the value common to the X(H®r) is 0, it leads to 1 = 0; if the common value is positive, it leads to a convergent infinite series of identical positive terms (a +a + · · · < oo and a > 0). Thus Η lies outside ^. ■ Two Impossibility Theorems* The argument above, which uses the axiom of choice, in fact proves this: There exists on 2(0·1] no probability measure Ρ such that P(A®x) = P(A) for all A e 2(0·1] and all χ e (0,1]. In particular it is impossible to extend λ to a translation-invariant probability measure on 2(0,1l This amounts to working in the circle group, where the translation у-»лФу becomes a rotation (1 is the identity). The rationals form a subgroup, and the set Η defined below contains one element from each coset. This topic may be omitted. It uses more set theory than is assumed in the rest of the book.
46 PROBABILITY There is a stronger result There exists on 2((ul no probability measure Ρ such that P[x} = 0 for each x. Since А{д:} = 0, this implies that it is impossible to extend λ to 2<°·4 at all.* The proof of this second impossibility theorem requires the well-ordering principle (equivalent to the axiom of choice) and also the continuum hypothesis. Let S be the set of sequences (s(l),s(2),...) of positive integers. Then S has the power of the continuum. (Let the nth partial sum of a sequence in S be the position of the nth 1 in the nonterminating dyadic representation of a point in (0,1]; this gives a one-to-one correspondence.) By the continuum hypothesis, the elements of S can be put in a one-to-one correspondence with the set of ordinals preceding the first uncountable ordinal. Carrying the well ordering of these ordinals- over to S by means of the correspondence gives to 5 a well-ordering relation <w with the property that each element has only countably many predecessors. For s, t in 5 write s < t if s(i) < t(i) for all i > 1. Say that t rejects s if t <w s and s < t, this is a transitive relation. Let Τ be the set of unrejected elements of S. Let Vs be the set of elements that reject s, and assume it is nonempty. If t is the first element (with respect to <w) of Vs, then t e Τ (if t' rejects t, then it also rejects s, and since t <w t, there is a contradiction). Therefore, if s is rejected at all, it is rejected by an element of Γ. Suppose Τ is countable and let tl,t2,... be an enumeration of its elements. If t*(k) = tk(k) + 1, then t* is not rejected by any tk and hence lies in T, which is impossible because it is distinct from each tk. Thus 7' is uncountable and must by the continuum hypothesis have the power of (0,1]. Let χ be a one-to-one map of Τ onto (0,1]; write the image of i as xr Let A\ = [x,: t(i) = k] be the image under χ of the set of г in Γ for which t(.i) = k. Since t(i) must have some value k, (J* = ( A'k = (0,1]. Assume that Ρ is countably additive and choose и in S in such a way that P(\JfL\ A'k)>l - l/2, + 1 for ί > 1. If oo u(i) oo Λ= Π ΙΜ'*= i)[x.-t(i)^u(i)] = [xl:t<u]t i**l k = l ;=i then P(A) > 0. HA is shown to be countable, this will contradict the hypothesis that each singleton has probability 0. Now, there is some г() in Τ such that и < t0 (if и e T, take t0 = u, otherwise, и is rejected by some t0 in T). If t < и for a ι in T, then t < t0 and hence t <w t0 (since otherwise г0 rejects t). This means that [t: t <u] is contained in the countable set [t. t <w г(;], and A is indeed countable. PROBLEMS 3.1. (a) In the proof of Theorem 3.1 the assumed finite additivity of Ρ is used twice and the assumed countable additivity of Ρ is used once. Where? (b) Show by example that a finitely additive probability measure on a field may not be countably subadditive. Show in fact that if a finitely additive probability measure is countably subadditive, then it is necessarily countably additive as well. +This refers to a countably additive extension, of course. If one is content with finite additivity, there is an extension to 2(0 ''; see Problem 3.8.
SECTION 3. EXISTENCE AND EXTENSION 47 (c) Suppose Theorem 2.1 were weakened by strengthening its hypothesis to the assumption that У is a σ-field. Why would this weakened result not suffice for the proof of Theorem 3.1? 3.2. Let Ρ be a probability measure on a field 5^ and for every subset A of Ω define P*(A) by (3.1). Denote also by Ρ the extension (Theorem 3.1) of Ρ to (a) Show that (3.9) P*(A) = M[P(B): AcB, В е ^] and (see (3.2)) (3.10) P*(A) = sup[P(C):CcA,C<E^}, and show that the infimum and supremum are always achieved. (b) Show that A is P*-measurable if and only if />*(/!) = P*(A). (c) The outer and inner measures associated with a probability measure Ρ on a σ-field & are usually defined by (3.9) and (3.10). Show that (3.9) and (3.10) are the same as (3.1) and (3.2) with & in the role of 3^0. 3.3. 2.13 2.15 3.2 Τ For the following examples, describe P* as defined by (3.1) and Λ=Λ(Ρ*) as defined by the requirement (3.4). Sort out the cases in which P* fails to agree with Ρ on ^Q and explain why. (a) Let ^o consist of the sets 0,{1},{2,3}, and Ω= {1,2,3}, and define probability measures P, and P2 on У0 by P,{1}=0 and P2[2,3) = 0. Note that Ж Pt) and ЖР?) differ. (b) Suppose that Ω is countably infinite, let 3*~0 be the field of finite and cofinite sets, and take P(A) to be 0 or 1 as A is finite or cofinite. (c) The same, but suppose that Ω is uncountable. (d) Suppose that Ω is uncountable, let 5^0 consist of the countable and the cocountable sets, and take P(A) to be 0 or 1 as A is countable or accountable. (e) The probability in Problem 2.15. (f) Let P(A) = ΙΑ(ω0) for A e &~0, and assume {ω0} e σ(3*~0). 3.4. Let / be a strictly increasing, strictly concave function on [0,oo) satisfying /(0) = 0. For Ac(0,1], define P*(A)=f(X*(A)l Show that P* is an outer measure in the sense that it satisfies P*(0) = 0 and is nonnegative, monotone, and countably subadditive. Show that A lies in Л (defined by the requirement (3.4)) if and only if λ*(Α) or X*(AC) is 0. Show that P* does not arise from the definition (3.1) for any probability measure Ρ on any field У0. 3.5. Let Ω be the unit square [(x, y): 0 <x, у < 1], let У be the class of sets of the form [(x, y): χ еЛ, 0 <y < 1], where /leJ, and let Ρ have value λ(Α) at :his set. Show that (Ω, &~, P) is a probability measure space. Show for A = [(x, y): 0 <x < 1, у = 4] that PJA) = 0 and P*(A) = 1.
48 PROBABILITY 3.6. Let Ρ be a finitely additive probability measure on a field ^0 For А с Ω, in analogy with (3.1) define (3.11) P°(/l) = infX;P(/ln), where now the infimum extends over all finite sequences of ^,-sets An satisfying А с (J„ A„. (If countable coverings are allowed, everything is different. It can happen that Ρ°(Ω) = 0; see Problem 3.3(e)) Let Л° be the class of sets A such that P°(£) = P°(A П E) + P°(AC Π Ε) for all Ε с Ω. (a) Show that P°(0) = 0 and that P° is nonnegative, monotone, and finitely subadditive Using these four properties of P°, prove: Lemma 1°- Л° is a field. Lemma 2°: If A{, A-,, is a finite sequence of disjoint ^°-sets, then for each £сП, (3.12) ^£п(ил)) = Е^И)- Lemma 3°: P° restricted to the fieid JK° is finitely additive. (b) Show that if P° is defined by (3.11) (finite coverings), then: Lemma 4°- ^"0 c^°. Lemma 5°: P°(A) = P(A) for A e &0. (c) Define P0(A)= \-P°(Ac). Prove that if Ε с A e J2,,, then (3.13) Po(E)=P(A)-P°(A-E). 3.7. 2.7 3.6 Τ Suppose that Η lies outside the field &~0, and let ^ be the field generated by У0и{Я}, so that 5*j consists of the sets (Η Π /1)и(Яг η β) with Α,Β е У0. The problem is to show that a finitely additive probability measure Ρ on ^ has a finitely additive extension to ^. Define β on 5^ by (3.14) Q((Hr\A)\J(HcnB))=P°(HnA)+Po(HcnB) for A, Be У0. (a) Show that the definition is consistent. (b) Shows that Q agrees with Ρ on 5?~0. (c) Show that Q is finitely additive on &x. Show that Q(H) = P°(H). (d) Define Q by interchanging the roles of P° and Po on the right in (3 14). Show that Q' is another finitely additive extension of Ρ to SFV The same is true of any convex combination Q" of Q and Q. Show that Q"(H) can take any value between P0(H) and P°(#). 3.8. Τ Use Zorn's lemma to prove a theorem of Tarski: A finitely additive probability measure on a field has a finitely additive extension to the field of all subsets of the space. 3.9. Τ (a) Let Ρ be a (countably additive) probability measure on a σ-field &~. Suppose that Η€&~, and let ^ = a{3F\j {#}). By adapting the ideas in Problem 3.7, show that Ρ has a countably additive extension from У to 5^.
SECTION 3. EXISTENCE AND EXTENSION 49 (b) It is tempting to go on and use Zorn's lemma to extend Ρ to a completely additive probability measure on the σ-field of all subsets of Ω. Where does the obvious proof break down? 3.10. 2.17 3.2 Τ As shown in the text, a probability measure space (Ω, У, P) has a complete extension—that is, there exists a complete probability measure space (Ω, Уи P{) such that Fc yt and P, agrees with Ρ on У. (a) Suppose that (Ω, ^2, P2) is a second complete extension Show by an example in a space of two points that P, and P2 need not agree on the σ-field ^пУ2. (b) There is, however, a unique minimal complete extension: Let У+ consist of the sets A for which there exist J^sets В and С such that A&BczC and P(C) = 0. Show that ^+ is a σ-field. For such a set A define P+(A) = P(B). Show that the definition is consistent, that P+ is a probability measure on У+, and that (Ω, У+, F+) is complete. Show that, if (Ω, У1гР{) is any complete extension of (Ω, ^ P), then Sr+cz^rl and P, agrees with P+ on SF+\ (Ω, F+, F+) is the completion of (Ω, ^ F). (c) Show that ,4εΓ if and only if P*(A) = P*(A), where P* and F* are defined by (3.9) and (3.10), and that P+(A) = P*(A) = P*(/l) in this case. Thus the complete extension constructed in the text is exactly the completion. 3.11. (a) Show that a Α-system satisfies the conditions (A4) A,Be^f ana ΛηΒ = 0 imply /1иВеУ, (A5) Л,,/12,... еУ and /1„ T /1 imply А еУ, (λ6) AVA2,... e^and Λ„ J, /1 imply /1 ёУ. (b) Show that _/ is a Α-system if and only if it satisfies (A,), (A'2), and (A5). (Sometimes these conditions, with a redundant (A4), are taken as the definition.) 3.12. 2.5 3.11 Τ (a) Show that if &> is a π-system, then the minimal Α-system over &> coincides with σ(&>). (b) Let &> be a π-system and J( a monotone class. Show that &<zJK does not imply a(&)dJf. (c) Deduce the π-Α theorem from the monotone class theorem by showing directly that, if a Α-system ^ contains a π-system &, then ^ also contains the field generated by &>. 3.13. 2.5 Τ (a) Suppose that &~Q is a field and P, and P2 are probability measures on σ(^,). Show by the monotone class theorem that if P, and F2 agree on 5*,,, then they agree on σ(^0). (b) Let ^o be the smallest field over the π-system &. Show by the inclusion- exclusion formula that probability measures agreeing on &> must agree also on ^q. Now deduce Theorem 3.3 from part (a). 3.14. 1.5 2.22T Prove the existence of a Lebesgue set of Lebesgue measure 0 that is not a Borel set. 3.15. 1.3 3.6 3.14Τ The outer content of a set A in (0,1] is c*(A) = infE„|/,,|, where the infimum extends over finite coverings of A by intervals /„. Thus A is
50 PROBABILITY trifling in the sense of Problem 1.3 if and only if c*(A) = 0. Define inner content by cJ).(A)= 1 -c*(Ac). Show that c*(A) = supEJ/J, where the supremum extends over finite disjoint unions of intervals /„ contained in A (of course the analogue for λ* fails) Show that c*(A) <c*(A); if the two are equal, their common value is taken as the content c(A) of A, which is then Jordan measurable Connect all this with Problem 3.6. Show that c*(A) = c*(A~\ where A' is the closure of A (the analogue for λ* fails). A trifling set is Jordan measurable Find (Pioblem 3.14) a Jordan measurable set that is not a Borel set. Show that c*(A) <\*(А) <λ*(Α) <c*(A). What happens in this string of inequalities if A consists of the rationals in (0, \] together with the irrationals in (i.iP 3.16. 1 5 Τ Deduce directly by countable additivity that the Cantor set has Lebesgue measure 0. 3.17. From the fact that λ(χ®Α) = λ(Α), deduce that sums and differences of normal numbers may be nonnormal. 3.18. Let Η be the nonmeasurable set constructed at the end of the section. (a) Show that, if A is a Borel set and A<zH, then λ(Α) = 0—that is, λ *(//) = 0. (b) Show that, if A*(£) > 0, then Ε contains a nonmeasurable subset. 3.19. The aim of this problem is the construction of a Borel set A in (0,1) such that 0 <λ(Α Π G) < A(G) for every nonempty open set G in (0,1). (a) It is shown in Example 3.1 how to construct a Borel set of positive Lebesgue measure that is nowhere dense. Show that every interval contains such a set. (b) Let {/„} be an enumeration of the open intervals in (0,1) with rational endpoints. Construct disjoint, nowhere dense Borel sets Ai,Bl,A2,B2,... of positive Lebesgue measure such that An US,, c7„. (c) Let A = \Jk Ak. A nonempty open G in (0,1) contains some /„. Show that 0 < k{A„) < AM nG)< \(A DG) + \(B„) < A(G). 3.20. Τ There is no Borel set A in (0,1) such that αλ(Ι)<λ(Α Dl)<b\(I) for every open interval / in (0,1), where 0 < a < b < 1. In fact prove: (a) If \(A П I) <b\(I) for all / and if b < 1, then A(/1)=0. Hint. Choose an open G such that А с G c(0,1) and A(G) <b-,A(/l); represent G as a disjoint union of intervals and obtain a contradiction. (b) If αλ(Ι)<λ(Α η I) for all / and if a > 0, then A(/l)= 1. 3.21. Show that not every subset of the unit interval is a Lebesgue set. Hint: Show that A* is translation-invariant on 2((>,1); then use the first impossibility theorem (p. 45). Or use the second impossibility theorem.
SECTION 4. DENUMERABLE PROBABILITIES 51 SECTION 4. DENUMERABLE PROBABILITIES Complex probability ideas can be гщае clear by the systematic use of measure theory, and probabilistic ideas of extramathematical origin, such as independence, can illuminate problems of purely mathematical interest. It is to this reciprocal exchange that measure-theoretic probability owes much of its interest. The results of this section concern infinite sequences of events in a probability space.1 They will be illustrated by examples in the unit interval. By this will always be meant the triple Ш, &~, P) for which Ω is (0,1], & is the σ-field Ш of Borel sets there, and P(A) is for A in &~ the Lebesgue measure λ(Α) of A. This is the space appropriate to the problems of Section 1, which will be pursued further. The definitions and theorems, as opposed to the examples, apply to all probability spaces. The unit interval will appear again and again in this chapter, and it is essential to keep in mind that there are many other important spaces to which the general theory will be applied later. General Formulas The formulas (2.5) through (2.11) will be used repeatedly. The sets involved in such formulas lie in the basic σ-field S*~ by hypothesis. Any probability argument starts from given sets assumed (often tacitly) to lie in tF; further sets constructed in the course of the argument must be shown to lie in J^as well, but it is usually quite clear how to do this. If P(A)>0, the conditional probability of В given A is defined in the usual way as (4.1) Р(ЩЛ)^ЛЛ^1. There are the chain-rule formulas P(AnB) =P(A)P(B\A), (4.2) Р(АпВпС)=Р(А)Р(В\А)Р(С\АГ)В), If у4,,у42,... partition Ω, then (4.3) Ρ(Β) = ΣΡ(Αη)Ρ(Β\Αη). η They come under what Borel in his first paper on the subject (see the footnote on p. 9) called probability denombrables, hence the section heading
52 PROBABILITY Note that for fixed A the function P(B\A) defines a probability measure as В varies over &~. If P(An) = 0, then by subadditive P(\jnAj = 0. If P(A„) = l, then П„ An has complement U„ Ac„ of probability 0. This gives two facts used over and over again: //Л,, A2, - - - are sets of probability 0, so is \Jn A„. IfAx, Аг,... are sets of probability 1, so is C\„A„. Limit Sets For a sequence Al,A2,··- of sets, define a set oo oo (4.4) \imsupA„ = П \jAk η л = 1 k-=n and a set (4.5) liminf^= (J f)Ak. " л=1к=п These setsf are the limits superior and inferior of the sequence {An}. They lie in &~ if all the An do. Now ω lies in (4.4) if and only if for each η there is some к > η for which ω ^Ak; in other words, ω lies in (4.4) if and only if it lies in infinitely many of the A„. In the same way, ω lies in (4.5) if and only if there is some η such that ω e Ak for all /c > n; in other words, ω lies in (4.5) if and only if it lies in all but finitely many of the An. Note that VCk=nAk Tliminf„ An and \J°*k=nAk J,limsupn An. For every m and n, Г\™=тАк с иХ^Лд., because for / > max{m, л}, At contains the first of these sets and is contained in the second. Taking the union over m and the intersection over η shows that (4.5) is a subset of (4.4). But this follows more easily from the observation that if ω lies in all but finitely many of the An then of course it lies in infinitely many of them. Facts about limits inferior and superior can usually be deduced from the logic they involve more easily than by formal set-theoretic manipulations. If (4.4) and (4.5) are equal, write (4.6) lim An = lim inf An = lim supAn. " " η To say that An has limit A, written An-^A, means that the limits inferior and superior do coincide and in fact coincide with A. Since lim inf„ Л„ с limsup„/4„ always holds, to check whether An->A is to check whether limsup„ An <zA climinf„ An. From Л„е J^and An -»Л follows Ле^". See Problems 4 1 and 4 2 for the analogy between set-theoretic and numerical limits superior and inferior.
SECTION 4. DENUMERABLE PROBABILITIES 53 Example 4.1. Consider the functions άη(ω) denned on the unit interval by the dyadic expansion (1.7), and let Ιη(ω) be the length of the run of O's starting at άη(ω): /„(ω) =/c if ά„(ω)= ■■· = ά„+Ιί_1(ω) = 0 and dn+k(w) = 1; here /„(ω) = 0 if άη(ω)= 1. Probabilities can be computed by (1.10). Since [ω: /„(ω) = к] is a union of 2"~l disjoint intervals of length 2~n~k, it lies in & and has probability 2~*-1. Therefore, [ω: /„(ω) > r] = [ω: ά^ω) = 0, л <i < л + r] lies also in & and has probability ЕЛгг2~*_|: (4.7) Ρ[ω;/η(ω-)>Λ]=2- If An is the event in (4.7), then (4.4) is the set of ω such that Ιη(ω) > ι for infinitely many n, or, η being regarded as a time index, such that /„(ω) > r infinitely often. В Because of the theory of Sections 2 and 3, statements like (4.7) are valid in the sense of ordinary mathematics, and using the traditional language of probability—"heads," "runs," and so on—does not change this. When η has the role of time, (4.4) is frequently written (4.8) \imsupAn = [Ani.o.], η where "i.o." stands for "infinitely often." Theorem 4.1. (i) For each sequence [An), (4.9) p(limMAn)<liminfP(An) <lim sup P(An) </41im sup/4 J. η η (ii) IfAn-+A, then P(An)-^P(A). Proof. Clearly (ii) follows from (i). As for (i), if Bn = C\1=nAk and C„= U1=nAk, then B„Tliminf„ An and C„ JJimsup,, An, so that, by parts (i) and (ii) of Theorem 2.1, P(An)> P(B„) -^ P(\imMn A„) and P(An) < P(Cn) ^ P(\im supn An). Ш Example 4.2. Define /„(ω) as in Example 4.1, and let An = [ω: /„(ω) > г] for fixed r. By (4.7) and (4.9), Ρ[ω: l„(a))>r i.o.]>2~r. Much stronger results will be proved later. Ш Independent Events Events A and В are independent if P(A Π Β) = P(A)P(B). (Sometimes an unnecessary mutually is put in front of independent.) For events of positive
54 PROBABILITY probability, this is the same thing as requiring P(B\A) = P(B) or P(A\B) = P(A). More generally, a finite collection Л,,■ ■ ■, Лп of events is independent if (4.10) P(Akin-nAkt)=P(Aki)---P(Akj) for 2 <j < η and 1 < &, < · ■ ■ <kj <n. Reordering the sets clearly has no effect on the condition for independence, and a subcollection of independent events is also independent. An infinite (perhaps uncountable) collection of events is defined to be independent in each of its finite subcollections is. If η = 3, (4.10) imposes for j = 2 the three constraints (4.11) Р(А1ПА2)=Р(А1)Р(А2)> Р(А,пА3) = />( Л,)/>( Л3), Р(А2Г,А3)=Р(А2)Р(А3)> and for j = 3 the single constraint (4.12) P(AlnA2nA3)=P(Al)P(A2)P(A3). Example 4.3. Consider in the unit interval the events BUL =[ω: ίίΜ(ω) = ίί,(ω)]—the wth and uth tosses agree—and let Л, = Bn, A2 = B13, A3 = B23. Then Av A2, A3 are pairwise independent in the sense that (4.11) holds (the two sides of each equation being \). But since Л, пА2 сЛ3, (4.12) does not hold (the left side is \ and the right is |), and the events are not independent. ■ Example 4.4. In the discrete space Ω = {1,..., 6} suppose each point has probability | (a fair die is rolled). If Л, ={1,2,3,4} and A2 =A3 = {4,5,6}, then (4.12) holds but none of the equations in (4.11) do. Again the events are not independent. ■ Independence requires that (4.10) hold for each j = 2,...,n and each choice of /c,,...,/cy, a total of Σ"^!"^2"- ! ~n constraints. This requirement can be stated in a different way: If Bu..., Bn are sets such that for each i = 1,..., и either В, = Л, or β, = Ω, then (4.13) P(Bt ПВ2 П · · · ПВЯ) = P{B,)P{B2) ■ · · Р(ВЯ). The point is that if Β. = Ω, then Bt can be ignored in the intersection on the left and the factor P{Bt) = 1 can be ignored in the product on the right. For example, replacing A2 by Ω reduces (4.12) to the middle equation in (4.11). From the assumed independence of certain sets it is possible to deduce the independence of other sets.
SECTION 4. DENUMERABLE PROBABILITIES 55 Example 4.5. On the unit interval the events H„ = [ω: άη(ω) = 0], η = 1,2,..., are independent, the two sides of (4.10) having in this case value 2~J. It seems intuitively clear that from this should follow the independence, for example, of [ω: d2(w) = 0] = H2 and [ω: ίί,(ω) = 0, ά3(ω) = 1] = Я, Г\Н^, since the two events involve disjoint sets of times. Further, any sets A and В depending, respectively, say, only on even and on odd times (like [ω: ά2η(ω) = 0 i.o.] and [ω: ά2η + ](ω) = 1 i.o.]) ought also to be independent. This raises the general question of what it means for A to depend only on even times. Intuitively, it requires that knowing which ones among H2, H4,... occurred entails knowing whether or not A occurred—that is, it requires that the sets H2,H4,... "determine" A. The set-theoretic form of this requirement is that A is to lie in the cr-field generated by H2,H4,... . From Α^σ(Η2,Η4,...) and 5еа(Я,,Я3, ..) it ought to be possible to deduce the independence of A and В. Ш The next theorem and its corollaries make such deductions possible. Define classes sxfb...,sin in the basic σ-field !F to be independent if for each choice of At from J3^, / = 1,..., n, the events Ab...,An are independent. This is the same as requiring that (4.13) hold whenever for each /', 1 < i < n, either В, е j^ or Β, = Ω. Theorem 4.2. If stfy,...,stfn are independent and each ^ is a π-system, then σ(^ι),..., σ(&ίη) are independent. Proof. Let ^ be the class j^ augmented by Ω (which may be an element of J^ to start with). Then each Ш{ is a T7-system, and by the hypothesis of independence, (4.13) holds if Β,-^Щ, i = l,...,n. For fixed sets B2,...,B„ lying respectively in £82,...,£8п, let Л be the class of J^sets Bx for which (4.13) holds. Then ^ is a λ-system containing the T7-system Шх and hence (Theorem 3.2) containing σ(^,) = a{ssfx). Therefore, (4.13) holds if BvB2,...,Bn lie respectively in σ(^,),Q82,...,£8n, which means that σ(^,),srf2,-..,£f„ are independent. Clearly the argument goes through if 1 is replaced by any of the indices 2,..., n. From the independence of σ($#γ), &f2,...,sxfn now follows that of σ(^), σ(^2), ^3, ...,J^,, and soon. ■ If si= {Ay,..., Ak) is finite, then each A in a(&f) can be expressed by a "formula" such as A =A2nAc5 or A =(A2nA7) u(A3nAc7r\As). If ssi is infinite, the sets in a(&f) may be very complicated; the way to make precise the idea that the elements of stf "determine" A is not to require formulas, but simply to require that A lie in σ(^). Independence for an infinite collection of classes is defined just as in the finite case: \stfe: θ e Θ] is independent if the collection [Αθ: θ e Θ] of sets is independent for each choice of Ae from &fe. This is equivalent to the independence of each finite subcollection sf9,...,sf9 of classes, because of
56 PROBABILITY the way independence for infinite classes of sets is defined in terms of independence for finite classes. Hence Theorem 4.2 has an immediate consequence: Corollary 1. // sxfe, θ e Θ, are independent and each sxfe is a π-system, then σ(&ίθ), йе0, are independent. Corollary 2. Suppose that the array An Ai2 (4.14) A2l A22 ··· of events is independent; here each row is a finite or infinite sequence, and there are finitely or infinitely many rows. If ^ is the σ-field generated by the ith row, then ^χ,^2,... are independent. Proof. If s^i is the class of all finite intersections of elements of the / th row of (4.14), then j^ is a T7-system and aissfj) = «9^. Let / be a finite collection of indices (integers), and for each / in / let /, be a finite collection of indices. Consider for i e / the element C, = П; <= }.AV of £/(. Since every finite subcollection of the array (4.14) is independent (the intersections and products here extend over : e / and j e/;), p{ Π c,)-p( Π Π Au) = Π YIp(au) = ГИп,л„) = Пр(с,)- i It follows that the classes ssfx,ssf2,... are independent, so that Corollary 1 applies. ■ Corollary 2 implies the independence of the events discussed in Example 4.5. The array (4.14) in this case has two rows: H2 H4 H6 Hi H3 H5 Theorem 4.2 also implies, for example, that for independent Al,...,An, (4.15) P(A\n---nAlnAk + ln---nAn) = P(A)-P(Al)P(Ak+l)--P(A„).
SECTION 4. DENUMERABLE PROBABILITIES 57 To prove this, let j^ consist of A; alone; of course, /1[е(г(^). In (4.15) any subcollection of the Л, could be replaced by their complements. Example 4.6. Consider as in Example 4.3 the events BUI that, in a sequence of tosses of a fair coin, the wth and v\h outcomes agree. Let srfx consist of the events B12 and B13, and let sf2 consist of the event B23. Since these three events are pairwise independent, the classes srfx and stf2 are independent. Since B23 = (Βι2*Β]3Υ lies in σ(·0^,), however, σ(^,) and a(&f2) are not independent. The trouble is that s/x is not a тг-system, which shows that this condition in Theorem 4.2 is essential. ■ Example 4.7. If stf= {Av A2,...} is a finite or countable partition of Ω and Ρ(Β\Α,) = p for each A, of positive probability, then P(B) =p and В is independent of a(&f): If Σ' denotes summation over those i for which P(Ai)>0, then Р(В) = ЕР(А,пВ) = ГР(А<)р=р, and so Β is independent of each set in the T7-system stfV) {0}. ■ Subfields Theorem 4.2 involves a number of σ-fields at once, which is characteristic of probability theory; measure theory not directed toward probability usually involves only one all-embracing σ-field &. In proability, σ-fields in ^—that is, sub-a-fields of .9^—play an important role. To understand their function it helps to have an informal, intuitive way of looking at them. A subclass sf of & corresponds heuristically to partial information. Imagine that a point ω is drawn from Ω according to the probabilities given by Ρ: ω lies in A with probability P(A). Imagine also an observer who does not know which ω it is that has been drawn but who does know for each A in sf whether ω еЛ or ω <£A—that is, who does not know ω but does know the value of ΙΑ(ω) for each A in &f. Identifying this partial information with the class stf itself will illuminate the connection between various measure- theoretic concepts and the premathematical ideas lying behind them. The set В is by definition independent of the class &f if P(B\A) = P(B) for all sets A in of for which P(A)>0. Thus if B is independent of suf, then the observer's probability for В is P(B) even after he has received the information in stf; in this case sxf contains no information about B. The point of Theorem 4.2 is that this remains true even if the observer is given the information in a{ssf), provided that sxf is a T7-system. It is to be stressed that here information, like observer and know, is an informal, extramathe- matical term (in particular, it is not information in the technical sense of entropy). The notion of partial information can be looked at in terms of partitions. Say that points ω and ω are ^equivalent if, for every A in sxf, ω and ω' lie
58 PROBABILITY either both in A or both in Ac—that is, if (4.16) ΙΑ{ω)=ΙΑ{ω'), A^s/. This relation partitions Ω into sets of equivalent points; call this the &f- partition. Example 4.8. If ω and ω are a(,af )-equivalent, then certainly they are ^equivalent. For fixed ω and ω', the class of A such that ΙΑ(ω) = ΙΑ(ώ) is a σ-field; if ω and ω are ^equivalent, then this σ-field contains &f and hence a(&f), so that ω and ω' are also a(.aO-equivalent. Thus ^equivalence and a(.aO-equivalence are the same thing, and the ^partition coincides with the σ(&/)-partition. Ш An observer with the information in σ(&/) knows, not the point ω drawn, but only the equivalence class containing it. That is exactly the infoima- tion he has. In Example 4.6, it is as though an observer with the items of information in ΰ/χ were unable to combine them to get information about #23. Example 4.9. If Ηη = [ω: άη(ω) = 0] as in Example 4.5, and if &/= {Hx, #3, H5,...}, then ω and ω satisfy (4.16) if and only if άη(ω) = άη(ω) for all odd n. The information in a(sf) is thus the set of values of άη(ω) for η odd. ■ One who knows that ω lies in a set A has more information about ω the smaller A is. One who knows ΙΑ(ω) for each A in a class s/, however, has more information about ω the larger si is. Furthermore, lo have the information in srfx and the information in $i2 is to have the information in $fx \Jsaf7, not that in $#x Π sf2. The following example points up the informal nature of this interpretation of σ-fields as information. Example 4.10. In the unit interval (Ω, &~, Ρ) let <0 be the σ-field consisting of the countable and the cocountable sets. Since P(G) is 0 or 1 for each G in <0, each set Η in У is independent of <0. But in this case the ^partition consists of the singletons, and so the information in <0 tells the observer exactly which ω in Ω has been drawn, (i) The σ-field <0 contains no information about Η—in the sense that Η and J are independent, (ii) The σ-field ^ contains all the information about Η—in the sense that it tells the observer exactly which ω was drawn. ■ In this example, (i) and (ii) stand in apparent contradiction. But the mathematics is in (i)—Η and <0 are independent—while (ii) only concerns heuristic interpretation. The source of the difficulty or apparent paradox here lies in the unnatural structure of the σ-field <0 rather than in any deficiency in the notion of independence.1 The heuristic equating of σ-fields and information is helpful even though it sometimes fSee Problem 4.10 for a more extreme example
SECTION 4. DENUMERABLE PROBABILITIES 59 breaks down, and of course proofs are indifferent to whatever illusions and vagaries brought them into existence. The Borel-Cantelli Lemmas This is the first Borel-Cantelli lemma: Theorem 4.3. // Σ„Ρ(Α„) converges, then f(limsup„ An) = 0. Proof. From lim sup„ An с \Jt = mAk follows P(l\m sup„ An) < P(U°°k = „,Ak)<Lu°k^mP(Ak), and this sum tends to 0 as m->oo if Σ„Ρ(Αη) converges. ■ By Theorem 4.1, P(An)-^0 implies that P(\imini„ /4„) = 0; in Theorem 4.3 hypothesis and conclusion are both stronger. Example 4.11. Consider the run length /„(ω) of Example 4.1 and a sequence {r„} of positive reals. If the series El/2r" converges, then (4.17) P[(o:ln((o)>r„ i.o.] = 0. To prove this, note that if sn is r„ rounded up to the next integer, then by (4.7), Ρ[ω: /„(ω)> rj = 2^« < 2~Ч Therefore, (4.17) follows by the first Borel-Cantelli lemma. If rn = (1 + e)log2η and e is positive, there is convergence because 2'r"= l/«1+e. Thus (4.18) P[w:l„(w)^(l+e)log2n\.o] =0. The limit superior of the ratio /n(w)/log2" exceeds 1 if and only if ω belongs to the set in (4.18) for some positive rational e. Since the union of this countable class of sets has probability 0, (4.19) Ρ To put it the other way around, /„(ω) ω: limsup , > 1 = 0. (4.20) Ρ /„(ω) ω: limsup , < 1 = 1. Technically, the probability in (4.20) refers to Lebesgue measure. Intuitively, it refers to an infinite sequence of independent tosses of a fair coin. ■ In this example, whether limsup„ /n(w)/log2" < 1 holds or not is a property of ω, and the property in fact holds for ω in an J^set of probability
60 PROBABILITY 1. In such a case the property is said to hold with probability 1, or almost surely. In nonprobabilistic contexts, a property that holds for ω outside a (measurable) set of measure 0 holds almost everywhere, or for almost all ω. Example 4.12. The preceding example has an interesting arithmetic consequence. Truncating the dyadic expansion at η gives the standard (n - l)-place approximation LkZ\dk(a>)2~k to ω; the error is between 0 and 2~" + 1, and the error relative to the maximum is (421) e,(a>)= ;—n = L4.+/-i(»)2 . which lies between 0 and 1. The binary expansion of e„(a>) begins with /,,(ω) O's, and then comes a 1 Hence .0 . 01 <en(w)< .0...0111..., where there are /„(ω) O's in the extieme terms. Therefore, (4·22) £>*TT^e«(«)^^). so that results on run length give information about the eiror of approximation. By the left-hand inequality in (4.22), е„(ш)<х„ (assume that 0 <xn < 1) implies that /„(ω)> -log2.r„-l; since E2~r"<<» implies (4.17), Σ*„<°° implies P[a>: e„(w) <xn i.o.] = 0. (Clearly, [ω: ε„(ω) <χ] is a Borel set) In particular, (4.23) P[a>:en(a>)< I/ni+e i.o.) =0. Technically, this probability refers to Lebesgue measure; intuitively, it refers to a point drawn at random from the unit interval. ■ Example 4.13. The final step in the proof of the normal number theorem (Theorem 1.2) was a disguised application of the first Borel-Cantelli lemma. If Αη = [ω: |«_15„(ω)|>«-1/8], then ΣΡ(Α„)<«>, as follows by (1.29), and so P[An i.o.] = 0. But for ω in the set complementary to [An i.o.], n~lsn(w) -*0. The proof of Theorem 1.6 is also, in effect, an application of the first Borel-Cantelli lemma. ■ This is the second Borel-Cantelli lemma: Theorem 4.4. If{An) is an independent sequence of events and ΣηΡ(Αη) diverges, then f(limsup„ An) = 1. Proof. It is enough to prove that P(\Tn = i П^=„Л^) = 0 and hence enough to prove that />(П£=„ Ack) = 0 for all n. Since 1 - χ < e~x, Ρ \Γ\ΑΙ = П(1-Р(Ак))<схр \k=n / k^" n+j LP(Ak)
SECTION 4. DENUMERABLE PROBABILITIES 61 Since ZkP(Ak) diverges, the last expression tends to 0 as /-»oo, and hence P(Crk=nAck) = UmJP(nnkt/nAl) = 0. Ш By Theorem 4.1, limsup„ P(An) > 0 implies P(\imsupn An) > 0; in Theorem 4.4, the hypothesis ΣηΡ(Αη) = oo is weaker but the conclusion is stronger because of the additional hypothesis of independence. Example 4.14. Since the events [ω: /„(ω) = 0] = [ω: άη(ω) = 1], η = 1,2,..., are independent and have probability \, Ρ[ω: /„(ω) = 0 i.o.] = 1. Since the events An = [ω: /,,(ω) = 1] = [ω: dn(w) = 0, άη + 1(ω) = 1], η = 1,2,..., are not independent, this argument is insufficient to prove that (4.24) Ρ[ω:!η(ω)-1\.ο.] = 1. But the events A2, A4, A6,... are independent (Theorem 4.2) and their probabilities form a divergent series, and so Ρ[ω: /2„(ω) = 1 i.o.] = 1, which implies (4.24). ■ Significant applications of the second Borel-Cantelli lemma usually require, in order to get around problems of dependence, some device of the kind used in the preceding example. Example 4.15. There is a complement to (4.17): If rn is nondecreasing and Y2~r"/rn diverges, then (4.25) Ρ[ω:Ιη(ω)>Γηι.ο.]=1. To prove this, note first that if /■„ is rounded up to the next integer, then Y2~r"/rn still diverges and (4.25) is unchanged. Assume then that rn = r(n) is integral, and define {nk} inductively by и, = 1 and nk + i = nk + rn , к > 1. Let Ak = [ω: /„(ω)> r„ ] = [ω: ίί,·(ω) = 0, nk<i <nk + 1]; since the Ak involve nonoverlapping sequences of time indices, it follows by Corollary 2 to Theorem 4.2 that Л,, A2, ■ ■ ■ are independent. By the second Borel-Cantelli lemma, P[Ak i.o.] = 1 if ZkP(Ak) = ΣΑ.2_,'("<) diverges. But since r„ is nondecreasing, E2-*"i>= Z2-^rl(nk)(nk + l-nk) k>l k>l ±Σ Σ 2-ч„-'-Е2л-'. Аг>1 nt<n<nt + l л>1 Thus the divergence of L„2~r"r~l implies that of Lk2~r("k\ and it follows that, with probability 1, /„ (ω) > rn for infinitely many values of k. But this is stronger than (4.25). The result in Example 4.2 follows if r„ = r, but this is trivial. If r„ = log2 n, then L2~r"/rn = Σΐ/(η log2 n) diverges, and therefore (4.26) Ρ[ω: /„(ω) > log2 η i.o.] = 1.
62 PROBABILITY By (4.26) and (4.20), (4.27) ω: hmsup log, л = 1 = 1. Thus for ω in a set of probability 1, log2 л as a function of л is a kind of "upper envelope" for the function /„(ω). ■ Example 4.16. By the right-hand inequality in (4.22), if /,,(w)>log2n, then е„(ш) < 1 /п. Hence (4.26) gives (4.28) ι ч l · 1. This and (4.23) show that, with probability 1, εη(ω) has 1/nasa "lower envelope." The discrepancy between ω and its (n — l)-place approximation L"Z\dk(<o)2~k will fall infinitely often below l/(n2n_1) but not infinitely often below !/(n1+e2"-'). ■ Example 4.17. Examples 4.12 and 4.16 have to do with the approximation of real numbers by rationals: Diophantine approximation. Change the xn = \/nl+c of (4.23) to l/((n - l)log2)1+i. Then Σχ„ still converges, and hence Ρ[ω: e„(w) < l/(log 2"-')'+i i.o.] = 0. And by (4.28), P[e:e„(»)<l/log2"_l i.o.] = 1. The dyadic rational L"kZ\dk(a>)2~k =p/q has denominator q = 2"~l, and en(<o) = q(<o -p/q). There is therefore probability 1 that, if q is restricted to the powers of 2, then 0 <ω -p/q < 1/0? log 4) holds for infinitely many p/q but 0<a>-p/q< 1 /(q log1+£<j) holds only for finitely many.f But contrast this with Theorems 1.5 and 1.6: The sharp rational approximations to a real number come not from truncating its dyadic (or decimal) expansion, but from truncating its continued-fraction expansion; see Section 24. ■ The Zero-One Law For a sequence Al,A2,·.. of events in a probability space Ш, &, P) consider the σ-fields σ(Αη, Αη +,,...) and their intersection (4.29) n = l +This ignores the possibility of even ρ (reducible p/q); but see Problem 1.11(b). And rounding ω up to (p + \)/q instead of down to p/q changes nothing; see Problem 4.13.
SECTION 4. DENUMERABLE PROBABILITIES 63 This is the tail σ-field associated with the sequence {An}, and its elements are called tail events. The idea is that a tail event is determined solely by the An for arbitrarily large n. Example 4.18. Since limsupm^4m = nk2.„U i>kAt and liminfm^4m = U ^>„П/>^,· are both in cr(An, An +,,...), the limits superior and inferior are tail events for the sequence {Л„}. ■ Example 4.19. Let /„(ω) be the run length, as before, and let Hn = [ω: άη(ω) = 0]. For each nQ, [w:ln(w)>rn\.o.] = f] (J [ω: lk(w) > rk] n>n0 tin = П U^n//t+1n-nHt+rrl. Thus [ω: Ι „(ω) > rn i.o.] is a tail event for the sequence {#„}. ■ The probabilities of tail events are governed by Kolmogorou's zero-one law:1' Theorem 4.5. If Ax, A2,-·· is an independent sequence of events, then for each event A in the tail σ-field (4.29), P(A) is either 0 or 1. Proof. By Corollary 2 to Theorem 4.2, σ(Λ,), .. . , σ(Λ„_,), σ(Αη,Αη + 1,...) are independent. If A e У", then A <Ea(An,An + i,...) and therefore Au.. ,An_l,A are independent. Since independence of a collection of events is defined by independence of each finite subcollection, the sequence A,A1,A2,--- is independent. By a second application of Corollary 2 to Theorem 4.2, σ(Α) and σ(Λ„Λ2,...) are independent. But A e Ус σ(Λ,, Л2,..·); from Α ^σ(Α) and Л εσ(^,, A2,...) it follows that Л is independent of itself: Р(АПА) = P(A)P(A). This is the same as P(A) - (/ЧЛ))2 and can hold only if />Ы) is 0 or 1. ■ Example 4.20. By the zero-one law and Example 4.18, f(limsup„ An) is 0 or 1 if the An are independent. The Borel-Cantelli lemmas in this case go further and give a specific criterion in terms of the convergence or divergence of ΣΡ(Αη). Ш Kolmogorov's result is surprisingly general, and it is in many cases quite easy to use it to show that the probability of some set must have one of the extreme values 0 and 1. It is perhaps curious that it should so often be very difficult to determine which of these extreme values is the right one. For a more general version, see Theorem 22.3
64 PROBABILITY Example 4.21. By Kolmogorov's theorem and Example 4.19, [ω: /„(ω) > r„ i.o.] has probability 0 or 1. Call the sequence {r„} an outer boundary or an inner boundary according as this probability is 0 or 1. In Example 4.11 it was shown that {r„} is an outer boundary if Σ2~Γ" <°°. In Example 4.15 it was shown that {rn} is an inner boundary if rn is nondecreasing and Y2~r"r~l = oo. By these criteria r„ = 01og2« gives an outer boundary if θ > 1 and an inner boundary if θ < 1. What about the sequence rn = log2 η + θ log2 log2 л? Here Σ2-Γ" = El//z(iog2 n)e, and this converges for θ > 1, which gives an outer boundary. Now 2~r"/~1 is of the order l/«(log2 л)1 +θ, and this diverges if θ < 0, which gives an inner boundary (this follows indeed from (4.26)). But this analysis leaves the range 0 < θ < 1 unresolved, although every sequence is either an inner or an outer boundary This question is pursued further in Example 6.5. PROBLEMS 4.1. 2.1 Τ The limits superior and inferior of a numerical sequence {xn} can be denned as the supremum and infimum of the set of limit points—that is, the set of limits of convergent subsequences. This is the same thing as defining CO CO (4.30) lim sup.*„ = Д V ** " n=\ k=n and oo то (4.31) liminfx„= V Л хк- " n=\ k=n Compare these relations with (4.4) and (4.5) and prove that 'limsup„,4„ = ''"Ι Sup/^ , hmM„A„ = lim miIAn- η " Prove that lim„ An exists in the sense of (4.6) if and only if lim„/4 (ω) exists for each ω. 4.2. Τ (a) Prove that (lim sup/l„)nllim supBJ Dlim sup(Anr\Bn), I lim sup/l„jullim supBJ = lim sup(AnuBn), π η η (lim inf/ljnilim ΊηΐΒ„) =lim Μ(ΑηΓ\Βη), (lim inf/l„)u(lim infB„)clim M(An\JBn). Show by example that the two inclusions can be strict.
SECTION 4. DENUMERABLE PROBABILITIES 65 (b) The numerical analogue of the first of the relations in part (a) is lim sup*„j л llim supy„) > lim sup(jcn лу„). η η η Write out and verify the numerical analogues of the others. (c) Show that lim supAcn = (lim inf/ln) , η " lim intAc„ = (lim sup/ln) , lim sup/ln - lim inf/ln = lim sup(/ln Π/1^ + 1) η " η = limsup(/Tnn/l„+1) η (d) Show that An-+A and Bn-+B together imply that AnUBn-*A υ β and ΑηΓ\Βη^>ΑΓ\Β. 4.3. Let An be the square [(x, y): \x\< 1, |y|< 1] rotated through the angle 2-πηθ. Give geometric descriptions of limsup„ An and liminf An in case (a) θ = |; (b) θ is rational; (c) θ is irrational. Hint: The 2-πηθ reduced modulo 2π are dense in [0,2тг] if θ is irrational. (d) When is there convergence is the sense of (4.6)? 4.4. Find a sequence for which all three inequalities in (4.9) are strict. 4.5. (a) Show that lim„ F(lim infk An Π Ack) = 0. Hint: Show that lim sup„ liminfA An Г)Аск is empty. Put A* = lim sup„ An and A * = lim inf„ An. (b) Show that P(An -A*)-* 0 and P(A* -A„)-*0. (c) Show that An->A (in the sense that A=A*=A^) implies Ρ(/1Δ/1η) -» 0. (d) Suppose that An converges to A in the weaker sense that Ρ(ΑΔΑ*) = Р(ААА*) = 0 (which implies that P(A* -Ajf) = 0). Show that Ρ(/1Δ/1η)^0 (which implies that P(An)-+P(A)). 4.6. In a space of six equally likely points (a die is rolled) find three events that are not independent even though each is independent of the intersection of the other two. 4.7. For events Al,...,An, consider the 2" equations Ρ{Βγ(Λ ■■■ Γ\Βη) = Ρ(β,) ··· P(Bn) with В,=А{ or £,·=/!' for each i. Show that /*,,...,/!„ are independent if all these equations hold. 4.8. For each of the following classes stf, describe the ^partition defined by (4.16). (a) The class of finite and cofinite sets. (b) The class of countable and cocountable sets.
66 PROBABILITY (c) A partition (of arbitrary cardinality) of Ω. (d) The level sets of sin χ (Ω = R1). (e) The σ-field in Problem 3.5. 4.9. 2.9 2.10 Τ In connection with Example 4.8 and Problem 2 10, prove these facts: (a) Every set in σ(β/) is a union of ^equivalence classes. (b) If snf=[Ae: 0e0], then the ^equivalence classes have the form Γ\βΒθ, where for each Θ, Be is Ae or Ace. (c) Every finite σ-field is generated by a finite partition of Ω. (d) If 5^ is a field, then each singleton, even each finite set, in σ(^ΰ) is a countable intersection of J^-sets. 4.10. 3.2 Τ There is in the unit interval a set Η that is nonmeasurable in the extreme sense that its inner and outer Lebesgue measures are 0 and 1 (see (3.9) and (3.10)): A*(tf) = 0 and X*{H) = 1. See Problem 12.4 for the construction Let Ω = (0,1], let ^ consist of the Borel sets in Ω, and let Η be the set just described. Show that the class ^"of sets of the form (ЯпС,)и(ЯспС2) for G, and G2 in J is a σ-field and that F[(tfn G,)u(tfc nG2)]= |A(G,) + jA(G2) consistently defines a probability measure on У. Show that P(H) = { and that P(G) = A(G) for Ge/, Show that £ is generated by a countable subclass (see Problem 2.11). Show that ^contains all the singletons and that Η and ^ are independent. The construction proves this: There exist a probability space (ίϊ,^,Ρ), a σ-field ^ in tF, and a set Η in ZF, such that P(H) = \, Η and J are independent, and ^ is generated by a countable subclass and contains all the singletons. Example 4.10 is somewhat similar, but there the σ-fieid £ is not countably generated and each set in it has probability either 0 or 1. In the present example ^ is countably generated and P(G) assumes every value between 0 and 1 as G ranges over <&. Example 4.10 is to some extent unnatural because the ^ there is not countably generated. Tne present example, on the other hand, involves the pathological set H. This example is used in Section 33 in connection with conditional probability; see Problem 33.11. 4.11. (a) If Al,A2,... are independent events, then P(f]™=lAn) = Π^= \Ρ(Α„)and P(\J°°n=iAn)= 1 - Π^=1(1 -P(An)). Prove these facts and from them derive the second Borel-Cantelli lemma by the well-known relation between infinite • series and products. (b) Show that F(limsup„/1„) = 1 if for each к the series En>kP(An\AckC\ • · C\Ac„_l) diverges. From this deduce the second Borel-Cantelli lemma once again. (c) Show by example that F(limsup„ An)= 1 does not follow from the divergence of Σ„Ρ(Αη\Α\ η · · · (~)Acn_i) alone. (d) Show that F(limsup„ An) = 1 if and only if ΣηΡ(Α Γ\Αη) diverges for each A of positive probability. (e) If sets An are independent and P(An) < 1 for all n, then P[An i.o.] = 1 if and only if F(U nAn)= 1. 4.12. (a) Show (see Example 4.21) that log2 η + log2 log2 η + θ log2 log2 log2 η is an outer boundary if θ > 1. Generalize. (b) Show that log2 η + log2 log2 log2 η is an inner boundary.
SECTION 5. SIMPLE RANDOM VARIABLES 67 4.13. Let φ be a positive function of integers, and define Βφ as the set of χ in (0,1) such that \x-p/2'\< \/2'φ{2') holds for infinitely many pairs p,i. Adapting the proof of Theorem 1.6, show directly (without reference to Example 4.12) that Σ,1Λ>(2') < °° implies λ(Βφ) = 0. 4.14. 2.19Τ Suppose that there are in (Ω, &~, P) independent events Al,A2,--- such that, if a„ = mm{P(An), 1 -P(An)), then Σα„ = °°. Show that Ρ is nonatomic. 4.15. 2.18 Τ Let F be the set of square-free integers—those integers not divisible by any perfect square. Let F, be the set of m such that p2\m for no ρ </, and show that D(F,) = npi,(l -p~2\ Show that Pn(F,-F) <Lp>,p~2, and conclude that the square-free integers have density Π„(1 -p~2) = 6/тг2. 4.16. 2.18 Τ Reconsider Problem 2.18(d). If D were countably additive on f(Jf\ it would extend to σ(Λ). Use the second Borel-Cantelli lemma. SECTION 5. SIMPLE RANDOM VARIABLES Definition Let (Ω, !F, P) be an arbitrary probability space, and let X be a real-valued function on Ω; Χ is a simple random variable if it has finite range (assumes only finitely many values) and if (5.1) [ω:Χ(ω)=χ] e^ for each real x. (Of course, [ω: Χ(ω) = χ] = 0 <ξ & for χ outside the range of X.) Whether or not X satisfies this condition depends only on &, not on P, but the point of the definition is to ensure that the probabilities Ρ[ω: Χ(ω) = χ] are defined. Later sections will treat the theory of general random variables, of functions on Ω having arbitrary range; (5.1) will require modification in the general case. The άη(ω) of the preceding section (the digits of the dyadic expansion) are simple random variables on the unit interval: the sets [ω: άη(ω) = 0] and [ω: άη(ω) = 1] are finite unions of subintervals and hence lie in the σ-field &8 of Borel sets in (0,1]. The Rademacher functions are also simple random variables. Although the concept itself is thus not entirely new, to proceed further in probability requires a systematic theory of random variables and their expected values. The run lengths /„(ω) satisfy (5.1) but are not simple random variables, because they have infinite range (they come under the general theory). In a discrete space, ^"consists of all subsets of Ω, so that (5.1) always holds. It is customary in probability theory to omit the argument ω. Thus X stands for a general value ΛΧω) of the function as well as for the function itself, and [X = x] is short for [ω: ΛΧω) = χ]
68 PROBABILITY A finite sum (5.2) X= Σ,χ,Ιλ, i is a random variable if the A, form a finite partition of Ω into 5^sets. Moreover, every simple random variable can be represented in the form (5.2): for the jc, take the range of X, and put Л, = [X = x,]. But X may have other such representations because XjIA can be replaced by Σ^χ,/^.. if the A,j form a finite decomposition of Л,· into ^sets. If ^ is a sub-a-field of !F, a simple random variable X is measurable ^, or measurable with respect to cf, if [Z = jc] e «^ for each jc. A simple random variable is by definition always measurable &. Since [X^H]= \j[X = x], where the union extends over the finitely many χ lying both in Η and in the range of X, [X^H]^cf for every HaR1 if X is a simple random variable variable measurable ^. The σ-field σ(Χ) generated by X is the smallest σ-field with respect to which X is measurable; that is, σ(Χ) is the intersection of all σ-fields with respect to which X is measurable. For a finite or infinite sequence Xv X2, ■ ■. of simple random variables, a(Xu X2,...) is the smallest σ-field with respect to which each X, is measurable. It can be described explicitly in the finite case: Theorem 5.1. Let Xx,..., Xn be simple random variables. (i) The σ-field σ(Χν..., Xn) consists of the sets (5.3) [(*„...,*„) 6Я] = [ω: (Χχ{ω),...,Χη{ω)) е=Я] for Η с R"; H in this representation may be taken finite. (ii) A simple random variable Υ is measurable σ(Χν..., XJ if and only if (5.4) Y = f(Xl,...,Xn) for some f: R"-*R1. Proof. Let J% be the class of sets of the form (5.3). Sets of the form [(*„ ...,Xn) = (xlt..., x„)] = П ,"_,[*, = χ,] must lie in σ(Χν...,Xn); each set (5.3) is a finite union of sets of this form because (A',,...,Xn), as a mapping from Ω to R", has finite range. Thus J?<za(Xx,...,Xn). On the other hand, J( is a σ-held because Ω = [(Xv..., Xj&R"], [и1,...Д„)еЯГ = [а1)...Д„)бЯс], and иД,...Д„)е^] = [(Xv..., Хп)Г\ U jHj]. But each X-, is measurable with respect to J?, because [X,=x] can be put in the form (5.3) by taking Η to consist of those (jc ,,..., jc„) in Λ" for which x,=x. It follows that σ(Χχ,...,Χη) is contained
SECTION 5. SIMPLE RANDOM VARIABLES 69 in J( and therefore equals JK. As intersecting Η with the range (finite) of (Xv...,Xn) in R" does not affect (5.3), Η may be taken finite. This proves (i). Assume that Υ has the form (5.4)—that is, Υ(ω) =/(Ζ,(ω),..., Χη(ω)) for every ω. Since [У = y] can be put in the form (5.3) by taking Η to consist of those x = (xl,...,x„) for which f(x) = y, it follows that Υ is measurable σ{Χχ,- ■ ■, Xn)- Now assume that Υ is measurable a(Xt,..., Xn). Let yv--,yr be the distinct values Υ assumes. By part (i), there exist sets Hx,...,Hr in R" such that [ω:Υ(ω)=νί}=[ω:(Χι(ω),...,Χη(ω))^Ηί}. Take / = Σ;= j >',-/w. Although the //,■ need not be disjoint, if tf,- and Hj share a point of the form (Χχ(ω),..., Χ„(ω)\ then Κ(ω) = >', and Κ(ω) = yy, which is impossible if i Φ-j. Therefore each (Χχ(ω),..., Χη(ω)) lies in exactly one of the H-„ and it follows that /(Χχ(ω),..., Χη(ω)) = Υ(ω). Μ Since (5.4) implies that Υ is measurable σ(Χχ,..., Xn), it follows in particular that functions of simple random variables are again simple random variables. Thus X2, e'x, and so on are simple random variables along with X. Taking / to be Σ"=1·*,, Y\"=ix;, or max,-<„ xt shows that sums, products, and maxima of simple random variables are simple random variables. As explained on p. 57, a sub-a-field corresponds to partial information about ω. From this point of view, σ(Χν...,Χη) corresponds to a knowledge of the values Χ^ω),...,Χη(ω). These values suffice to determine the value Κ(ω) if and only if (5.4) holds. The elements of the σ(Χν..., A'J-partition (see (4.16)) are the sets [Xx =xv...,Xn =xn] for jc, in the range of Xr Example 5.1. For the dyadic digits άη(ω) on the unit interval, d3 is not measurable a(dv d2); indeed, there exist ω' and ω" such that ίί,(ω') = ίί,(ω") and ά2(ω') = ά2(ω") but ά3(ω) Φ ά3(ω"), an impossibility if ά3(ω) = /(ίί,(ω), ά2{ω)) identically in ω. If such an / existed, one could unerringly predict the outcome ά3(ω) of the third toss from the outcomes ίί,(ω) and ά2(ω) of the first two. - ■ Example 5.2. Let 5„(ω) - L"k=irk(a)) be the partial sums of the Rademacher functions—see (1.14). By Theorem 5.1(H) sk is measurable a(r,,...,r„) for k<n, and rk = sk- sk_l is measurable a(s,,...,s„) for к < п. Thus a(r,,..., r„) = σ(ί,,...,ί„). In random-walk terms, the first η positions contain the same information as the first η distances moved. In gambling terms, to know the gambler's first η fortunes (relative to his initial fortune) is the same thing as to know his gains and losses on each of the first η plays. ■
70 PROBABILITY Example 5.3. An indicator lA is measurable £ if and only if A lies in £. And A e oi Av..., A„) if and only if IA = f(IAj,..., IA) for some /: Rn -* Rl. Convergence of Random Variables It is a basic problem, for given random variables X and XVX2,... on a probability space Ш, &, P), to look for the probability of the event that lim„ Χ„(ω) =Χ(ω). The normal number theorem is an example, one where the probability is 1. It is convenient to characterize the complementary event: Χ„(ω) fails to converge to Χ(ω) if and only if there is some e such that for no m does |Χη{ω) - Χ(ω)\ remain below e for all η exceeding m—that is to say, if and only if, for some e, \Χ„(ω) -Χ(ω)\ > e holds for infinitely many values of n. Therefore, (5.5) [lim *„=*]' = \J[\Xn-X\>ei.o.], where the union can be restricted to rational (positive) e because the set in the union increases as e decreases (compare (2.2)). The event [lim„ Xn - X] therefore always lies in the basic σ-field 3*~, and it has probability 1 if and only if (5.6) P[\Xn-X\>ei.o.]=0 for each e (rational or not). The event in (5.6) is the limit superior of the events [|Xn -X\> e], and it follows by Theorem 4.1 that (5.6) implies (5.7) limP[\Xn-X\>e]=0. η This leads to a definition: If (5.7) holds for each positive e, then Xn is said to converge to X in probability, written Xn -*p X. These arguments prove two facts: Theorem 5.2. (i) There is convergence limn Xn = X with probability 1 if and only if (5.6) holds for each e. (ii) Convergence with probability 1 implies convergence in probability. Theorem 1.2, the normal number theorem, has to do with the convergence with probability 1 of л_1Е"=1^,(ы) to \. Theorem 1.1 has to do instead with the convergence in probability of the same sequence. By Theorem 5.2(ii), then, Theorem 1.1 is a consequence of Theorem 1.2 (see (1.30) and (1.31)). The converse is not true, however—convergence in probability does not imply convergence with probability 1:
SECTION 5. SIMPLE RANDOM VARIABLES 71 Example 5.4. Take X = 0 and Xn = lA Then Xn -*P X is equivalent to P(An)-*0, and [lim„ Xn =X]c = [An i.o.]. Any sequence {An} such that P(An)-*0 but P[An i.o.]>0 therefore gives a counterexample to the converse to Theorem 5.2(H). Consider the event An = [ω: /„(ω) > log2 η] in Example 4.15. Here, P(An)< \/n ->0, while P[An i.o.] = 1 by (4.26), and so this is one counterexample. For an example more extreme and more transparent, define events in the unit interval in the following way. Define the first two by ^i=(<U], A2 = (±,l]. Define the next four by ^3=(0'4ji ^4=(?'lJ' J^5=\li'i\> -^6 = (ϊ' 1J - Define the next eight, A7, .., Au, as the dyadic intervals of rank 3. And so on. Certainly, P(A„)-*0, and since each point ω is covered by one set in each successive block of length 2k, the set [An i.o.] is all of (0,1]. ■ Independence A sequence Xl,X2,... (finite or infinite) of simple random variables is by definition independent if the classes σ(Χι),σ(Χ2),... are independent in the sense of the preceding section. By Theorem 5.1(i), a(Xt) consists of the sets [X-^H] for H<zRl. The condition for independence of Xb...,Xn is therefore that (5.8) P[XltEHl,...,XntEHn]=P[X1*EHi]---P[XntEHn] for linear sets #,,..., Hn. The definition (4.10) also requires that (5.8) hold if one or more of the [Xt e Hs] is suppressed; but taking #, to be Rl eliminates it from each side. For an infinite sequence XX,X2,..., (5.8) must hold for each n. A special case of (5.8) is (5.9) Ρ[Χι=χι,...,Χη=χη]=Ρ[Χι=χι]·-·Ρ[Χη=χη]. On the other hand, summing (5.9) over jc, e #„..., xn e Hn gives (5.8). Thus the X; are independent if and only if (5.9) holds for all jc,,..., jc„. Suppose that (5.10) Xn is an independent array of simple random variables. There may be finitely or X, 12 x. 22
72 PROBABILITY infinitely many rows, each row finite or infinite. If snf-t consists of the finite intersections ПДА^-е//.] with Hj(zR\ an application of Theorem 4.2 shows that the σ-fields σ(Χη,Χί2,...\ /'=1,2,... are independent. As a consequence, YX,Y2,... are independent if Vj is measurable σ(Χη, Xi2,...) for each /'. Example 5.5. The dyadic digits άλ(ω), ά2(ω),... on the unit interval are an independent sequence of random variables for which (5.11) P[dn = Q\=P[dn = \] = {. It is because of (5 11) and independence that the dn give a model for tossing a fair coin. The sequence (ίί,(ω), ά2(ω\...) and the point ω determine one another. It can be imagined that ω is determined by the outcomes άη{ω) of a sequence of tosses. It can also be imagined that ω is the result of drawing a point at random from the unit interval, and that ω determines the άη{ω). In the second interpretation the άη(ω) are all determined the instaflt ω is drawn, and so it should further be imagined that they are then revealed to the coin tosser or gambler one by one. For example, σ(άι, d2) corresponds to knowing the outcomes of the first two tosses—to knowing not ω but only ίί,(ω) and ά2(ω)—and this does not help in predicting the value ά3(ω), because a{dx,d2) and a(d3) are independent. See Example 5.1. ■ Example 5.6. Every permutation can be written as a product of cycles. For example, [11 ι tit з7Н15«Н">«>· This permutation sends 1 to 5, 2 to 1, 3 to 7, and so on. The cyclic form on the right shows that 1 goes to 5, which goes to 6, which goes to 2, which goes back to 1; and so on. To standardize this cyclic representation, start the first cycle with 1 and each successive cycle with the smallest integer not yet encountered. Let Ω consist of the n\ permutations of 1,2,..., n, all equally probable; &~ contains all subsets of Ω, and P(A) is the fraction of points in A. Let Xk(a>) be 1 or 0 according as the element in the /cth position in the cyclic representation of the permutation ω completes a cycle or not. Then 5(ω) = Σ£=1Λ^(ω) is the number of cycles in ω. In the example above, η = 7, Xl=X2 = X3 = X5 = О, ХЛ = X6 = X7= 1, and 5 = 3. The following argument shows that Xx,..., Xn are independent and P[Xk = 1] = l/(« - к + 1). This will lead later on to results on P[S^H].
SECTION 5. SIMPLE RANDOM VARIABLES 73 The idea is this: Λ\(ω) = 1 if and only if the random permutation ω sends 1 to itself, the probability of which is \/n. If it happens that Χ[ω) = 1—that ω fixes 1—then the image of 2 is one of 2,..., n, and Χ2(ω) = 1 if and only if this image is in fact 2; the conditional probability of this is l/(n - 1). If Χ[ω) = 0, on the other hand, then ω sends 1 to some / φ 1, so that the image of / is one of 1,...,/ - 1, /' + Ι,.,.,η, and Χ2(ω) = 1 if and only if this image is in fact 1; the conditional probability of this is again l/(n - 1). This argument generalizes. But the details are fussy Let Υ,(ω),..., ϊ„(ω) be the integers in the successive positions in the cyclic representation of ω Fix k, and let At be the set where (Xl,...,Xk_i,Y1,...,Yk) assumes a specific vector of values u = (xl,...,xk_l, yx, ..,yk) The /1, form a partition si of Ω, and if P\Xk = \\A\ = \/(n-k + 1) for each υ, then by Example 4.7, P[Xk = 1] = l/(n - к + 1) and Xk is independent of a(sf) and hence of the smaller σ-field σ(Χχ,..., Xk_x). It will follow by induction that Xx,..., Xn are independent. Let j be the position of the rightmost 1 among xl,...,xk_l (;' = 0 if there are none). Then ω lies in A, if and only if it permutes у,,..., y} among themselves (in a way specified by the values x,,..., д:у_,, л:у=1, у1;..., у,) and sends each of ν,+ ι,..., Ук-\ t0 the у just to its right. Thus Af contains (n - к + 1)! sample points. And Хк(ш)= 1 if and only if ω also sends yk to yj + l. Thus At C\[Xk = 1] contains (n - k)l sample points, and so the conditional probability of Xk = 1 is 1 /{n - к + 1). Existence of Independent Sequences The distribution of a simple random variable X is the probability measure μ defined for all subsets A of the line by (5.12) μ(Α)=Ρ[Χ(ΞΛ]. This does define a probability measure. It is discrete in the sense of Example 2.9: If χ^.-.,χ, are the distinct points of the range of X, then μ has mass Pi = P[X = x;] = μ{χ;} at xh and μ(Α) = Σρ,-, the sum extending over those i for which X; eA. As μ(Α) = 1 if Л is the range of X, not only is μ discrete, it has finite support. Theorem 5.3. Let {μη} be a sequence of probability measures on the class of all subsets of the line, each having finite support. There exists on some probability space (Ω,&~,Ρ) an independent sequence {XJ of simple random variables such that Xn has distribution μη. What matters here is that there are finitely or countably many distributions μη. They need not be indexed by the integers; any countable index set will do.
74 PROBABILITY Proof. The probability space will be the unit interval. To understand the construction, consider first the case in which each μη concentrates its mass on the two points 0 and 1. Put pn = μη{0} and qn = 1 -p„ = μη{1}. Split (0,1] into two intervals /0 and /, of lengths px and qx. Define Χχ(ω) = 0 for ω e /0 and Χχ(ω) = 1 for ω e /,. If Ρ is Lebesgue measure, then clearly P[XX = 0]=p1 and P[XX = 1] = qx, so that Xx has distribution μ.,. ΑΊ=0 ΑΊ = 1 ' « a β ΑΊ =0 ΑΊ=0 A", = l A", = l Χ2 = 0 Χ2=\ Χ2 = 0 Χ2 = 1 . 1 , 1 # Ρΐί>2 Ρί42 Ч\Рг Ч^у Now split /0 into two intervals /00 and /0l of lengths pxp2 and pxq2, and split /, into two intervals /10 and /,, of lengths q\P2 and qxq2- Define Χ2(ω) --= 0 for ω e /00 и /10 and Χ2(ω) = 1 for ω e /01 U /,,. As the diagram makes clear, P[XX = 0, X2 = 0]=pxp2, and similarly for the other three possibilities. It follows that Xx and X2 are independent and X2 has distribution μ2. Now X3 is constructed by splitting each of /00, /0!, /10, /,, in the proportions p3 and q3. ^d so on. If pn = qn = 2 for all n, then the successive decompositions here are the decompositions of (0,1] into dyadic intervals, and Χη(ω) = άη(ω). The argument for the general case is not very different. Let xnX,..., xnl be the distinct points on which μη concentrates its mass, and put pni = μη{χηι) for 1 < i < /„. Decompose1 (0,1] into /, subintervals /,(1),...,//" of respective lengths pxx,...,pxl[. Define Xx by setting Χχ(ω) = χΧί for ω e//1', 1 <i </,. Then (P is Lebesgue measure) Ρ[ω: Χγ(ω) = xxi] = Ρ(//υ) =ρχί, 1 <ί </,.Thus Χχ is a simple random variable with distribution μχ. Next decompose each //" into /2 subintervals Iff\ ■. ■, Iff] of respective lengths PuPn,»-tPuPiif Define Χ2(ω) = χ2] for we \J';LXI^\ l<j<l2. Then Ρ[ω: Χχ(ω) = χχί, Χ2(ω) = x2j] = PUjf) =РиРг;· Adding out /shows that Ρ[ω: Χ2(ω) =x2j] =p2J, as required. Hence P[Xx=xxh X2 = x2J] = P\iPij~P[Xi = хи]Р[Хг =x2j\ and X\ and X2 are independent. The construction proceeds inductively. Suppose that (0,1] has been decomposed into /]···/„ intervals (5.13) //;\, ι </,</„...,! <;„</„, +If ί>-α = δ,+ ··- +δ, and δ,-> 0, then /; = (α + Σί<ίδ/, α + Σ>£,δ;] decomposes (a, b] into subintervals Ix,...,li with lengths of δ, Of course, /,■ is empty if δ,- = 0.
SECTION 5. SIMPLE RANDOM VARIABLES 75 of lengths (5-14) />(///%,)=/>.,,, ••■/ν,ν Decompose /,■"',ir into /n + 1 subintervals //"+/)„..., I}"*]], of respective lengths PUf^'iYpn +1.1, ■ ■ ·, PUi"] ,·„)/>„ + u„+r these are the+intervals of the next decomposition. This construction gives' a sequence of decompositions (5.13) of (0,1] into subintervals; each decomposition satisfies (5.14), and each refines the preceding one. If μ„ is given for 1 < η < Ν, the procedure terminates after N steps; for an infinite sequence it does not terminate at all. For 1 </'</„, put Χη(ω) = χηί if ω«Ξ (J,, ,-п.|//"> ,-„.,,· Since each decomposition (5.13) refines the preceding, Xk(a)) = xkj for ω€Ξ//η)(. . . Therefore, each element of (5.13) is contained in the element with the same label /', ...:'„ in the decomposition Α-, ,·„ = [ω:^,(ω)=Λ1/ι>...,Λ'„(ω)=Λ„|.Β]. 1 <:/,</„..., l<i„</„. The two decompositions thus coincide, and it follows by (5.14) that P[Xl=xUi,...,X„=xniJ=Pi,ir--Pnj„- Adding out the indices /„...,:„_, shows that Xn has distribution μη and hence that Xu...,Xn are independent. But η was arbitrary. ■ In the case where the μ„ are all the same, there is an alternative construction based on probabilities in sequence space. Let S be the support (finite) common to the μ„, and let pu, и e S, be the probabilities common to the μ„. In sequence space 5°°, define product measure Ρ on the class if0 of cylinders by (2.21). By Theorem 2.3, Ρ is countably additive on tf0, and by Theorem 3.1 it extends to if=a(if0). The coordinate functions zk(·) are random variables on the probability space (S°°, €, F); take these as the Xk. Then (2.22) translates into P[Xl=ul,...,X„=un\=pu · · · pu, which is just what Theorem 5.3 requires in this special case. Probability theorems such as those in the next sections concern independent sequences {Xn} with specified distributions or with distributions having specified properties, and because of Theorem 5.3 these theorems are true not merely in the vacuous sense that their hypotheses are never fulfilled. Similar but more complicated existence theorems will come later. For most purposes the probability space on which the Xn are defined is largely irrelevant. Every independent sequence {Xn} satisfying P[X„ = 1] =p and P[X„ = 0] = \-p is a model for Bernoulli trials, for example, and for an event like υ^=1[Σ" = 1Α^ > an], expressed in terms of the Xn alone, the calculation of its probability proceeds in the same way whatever the underlying space (Ω, J2",/') may be.
76 PROBABILITY It is, of course, an advantage that such results apply not just to some canonical sequence {Xn} (such as the one constructed in the proof above) but to every sequence with the appropriate distributions. In some applications of probability within mathematics itself, such as the arithmetic applications of run theory in the preceding section, the underlying Ω does play a role. Expected Value A simpie random variable in the form (5.2) is assigned expected value or mean value (5.15) E[X]=E Σ*Λ = Σ*ΛΑί)- There is the alternative form (5.16) Ε[Χ]=Σ*Ρ[Χ = *], X the sum extending over the range of X; indeed, (5.15) and (5.16) both coincide with Σ,Σ,■. x .=fXjP(At). By (5.16) the definition (5.15) is consistent: different representations (5.2) give the same value to (5.15). From (5.16) it also follows that E[X] depends only on the distribution of X; hence E[X] = E[Y] if P[X=Y]=1. If X is a simple random variable on the unit interval and if the Л, in (5.2) happen to be subintervals, then (5.15) coincides with the Riemann integral as given by (1.6). More general notions of integral and expected value will be studied later. Simple random variables are easy to work with because the theory of their expected values is transparent and free of technical complications. As a special case of (5.15) and (5.16), (5-17) E[IA]=P(A). As another special case, if a constant a is identified with the random variable Χ(ω) = a, then (5.18) E[a]=a. From (5.2) follows f(X) = Z;fix,)IAl, and hence (5.19) E[f(X)] = Zf(*i)P(A,) = Lf(*)P[X = x]> i χ the last sum extending over the range of X. For example, the kth moment E[Xk] of X is defined by E[Xk] = LvyP[Xk = y], where у varies over the
SECTION 5. SIMPLE RANDOM VARIABLES 77 range of Xk, but it is usually simpler to compute it by E[Xk] = LxxkP[X = x], where χ varies over the range of X. If (5.20) X= Σ*,Ια,, Υ= ЕуД- ' j are simple random variables, then αΧ + βΥ= Συ(αχι>+ j8y-)/4.nB. has ex" pected value Σ,/ajc,- + /3y;)/>U,- П Bj) = αΣ,χ,ΡίΛ^ + βΣ^Ρ(Β}). Expected value is therefore linear: (5.21) Ε[αΧ + βΥ] = αΕ[Χ]+βΕ[Υ]. If Χ(ω)<Υ(ω) for all ω, then *[<У; if AjCiBj is nonempty, and hence 1LijxiP{AiC\Bj) < Lijy]-P(Aj C\Bj). Expected value therefore preserves order: (5.22) E[X]<E[Y] ifX<Y. (Ii is enough that X < Υ on a set of probability 1.) Two applications of (5.22) give £[--|*l] <£[*]<£[I* 13, so that by linearity, (5.23) \E[X]\<E[\X\]. And more generally, (5.24) \E[X-Y]\<E[\X-Y\]. The relations (5.17) through (5.24) will be used repeatedly, and so will the following theorem on expected values and limits. If there is a finite К such that \Χη(ω)\<Κ for ail ω and n, the Xn are uniformly bounded. Theorem 5.4. // {Xn} is uniformly bounded, and if X = limn Xn with probability 1, then E[X] = lim„ E[Xn]. Proof. By Theorem 5.2(H), convergence with probability 1 implies convergence in probability: Xn -*P X. And in fact the latter suffices for the present proof. Increase К so that it bounds \X\ (which has finite range) as well as all the \Xn\; then \X-Xn\<2K. If A = [\X-Xn\> e], then \Χ(ω)-Χη(ω)\< 2ΚΙΑ(ω)+€ΐΑ*>) ^ 2ΚΙΑ(ω)+€ for all ω. By (5.17), (5.18), (5.21), and (5.22), E[\X-Xn\] <2KP[\X-Xn\>e] +e.
78 PROBABILITY But since Xn ~*p X, the first term on the right goes to 0, and since e is arbitrary, E[\X- Xn\] -+ 0. Now apply (5.24). ■ Theorems of this kind are of constant use in probability and analysis. For the general version, Lebesgue's dominated convergence theorem, see Section 16. Example 5.7. On the unit interval, take Χ(ω) identically 0, and take A;(w)tobe n2 if 0<ω<η'1 and 0 if η'1 <ω < 1. Then Χη(ω)-*Χ(ω) for every ω, although E[Xn] = η does not converge tc E[X] = 0. Thus theorem 5.4 fails without some hypothesis such as that of uniform boundedness. See also Example 7.7. ■ An extension of (5.21) is an immediate consequence of Theorem 5.4: Corollary. If Χ=ΣηΧη on an tF-set of probability 1, and if the partial sums of ΣηΧη are uniformly bounded, then E[X\ = ΣηΕ[Χη]. Expected values for independent random variables satisfy the familiar product law. For X and Υ as in (5.20), XY = Σ,,-κ,ν^.ηβ. If the jc, are distinct and the y; are distinct, then Ai = [X = xi] and Bj = [Y = yj\·, for independent X and Y, P(AS nB;) = PiAjPiBj) by (5.9), and so E[XY] = ZijXiyjPiAjPiBj) = E[X]E[Y]. If Χ,Υ,Ζ are independent, then XY and Ζ are independent by the argument involving (5.10), so that E[XYZ] = E[XY]E[Z] = E[X]E[Y]E[Z). This obviously extends: (5.25) E[Xl--Xn] = E[Xi]--E[Xn] if Xu.·, Xn are independent. Various concepts fiom discrete probability carry over to simple random variables. If E[X] = m, the variance of X is (5.26) Уш[Х]=Е[{Х-т)2\ =Е[Х2]-т2; the left-hand equality is a definition, the right-hand one a consequence of expanding the square. Since αΧ + β has mean am + β, its variance is E[((aX + β) - (am + β))2] = E[a2(X - m)2]: (5.27) Var[ff* + /3]=<*2Var[Z]. If Xu...,Xn have means m,,...,m„, then 5 = E"„iAr/ has mean m = Σ?../я,-, and E[(S - m)2] = Е[(Ц= ,U,. - m,-))2] = Ц= ,£[U,. - m;)2] + 2Σ]&ί<]^ηΕ[(Χι - m^Xj - ntj)]. If the X; are independent, then so are the X; - m,, and by (5.25) the last sum vanishes. This gives the familiar formula
SECTION 5. SIMPLE RANDOM VARIABLES 79 for the variance of a sum of independent random variables: (5.28) Var /■=1 = Σ Var[*,.]. /'= 1 Suppose that X is nonnegative; order its range: 0 <x, <x2 < · · · <xk. Then = Σχι{Ρ[Χ>χ:\-ηχ^ι+Λ)+4Ρ[Χ^4] к = χιΡ[Χ>χχ] + £(*,-*,_,№>*,]. i'=2 Since /J[Ar>jc] = /5[Ar>jc1]forO<jc<jCi and P[X> x] = P[X>xs] for x,_l <x < jc,, it is possible to write the final sum as the Riemann integral of a step function: (5.29) E[X]= fp[X>x]dx. This holds if X is nonnegative. Since P[X>x] = 0 for χ >xk, the range of integration is really finite. There is for (5.29) a simple geometric argument involving the "area over the curve." If p, =P[X = x;], the area of the shaded region in the figure is the sum pxxx + · · · +pkxk = E[X] of the areas of the horizontal strips; it is also the integral of the height P[X>x] of the region. A 1 ^ 1.» '"' 1 \ ■ 1 I;-. , ' l" Pk 1 ι ^ X-, X X-, l*-l
80 PROBABILITY Inequalities There are for expected values several standard inequalities that will be needed. If X is nonnegative, then for positive a (sum over the range of X) Ε[Χ] = ΣχχΡ[Χ = χ]>Σχ χ>αχΡ[Χ = χ]>αΣχ χ>„/>[X = χ]. Therefore, (5.30) P[X>oc}<^E[X] if X is nonnegative and a positive. A special case of this is (1.20). Applied to IX\\ (5.30) gives Markov's inequality, (5.31) P[\X\>a]<-^E[\X\k], valid for positive a. If к = 2 and m = E[X] is subtracted from X, this becomes the Chebysheu (or Chebyshev-Bienayme) inequality: (5.32) P[\X-m\>a] <^Var[X]. a A function φ on an interval is convex [A32] if <p(px + (1 -p)y) <ρφ(χ) + (1 -ρ)φ(γ) for 0 <p < 1 and χ and у in the interval. A sufficient condition for this is that φ have a nonnegative second derivative. It follows by induction that <ρ(Σί„ ,/?,*,) < Σί·„,/?,·<ρ(χ,·) if the pt are nonnegative and add to 1 and the x{ are in the domain of φ. If X assumes the value x, with probability pt, this becomes Jensen's inequality, (5.33) φ(Ε[Χ])<Ε[φ(Χ)], valid if φ is convex on an interval containing the range of X. Suppose that (5.34) p + q=1' P>1' q>1· Holder's inequality is (5.35) E[\XY\]<El/p[\X\"]-El/q[\Y\q]. If, say, the first factor on the right vanishes, then X=0 with probability 1, hence AY = 0 with probability 1, and hence the left side vanishes also. Assume then that the right side of (5.35) is positive. If a and b are positive, there exist s and t such that a =ep~'s and b = eq~''. Since ex is convex,
SECTION 5. SIMPLE RANDOM VARIABLES 81 ep-'s+4 '· <p-les + q-le',OT L ap b" ab< — + —. Ρ Я This obviously holds for nonnegative as well as for positive a and b. Let и and υ be the two factors on the right in (5.35). For each ω, Χ(ω)Υ(ω) uv < 1 Χ (ω) + 1 Υ(ω) Taking expected values and applying (5.34) leads to (5.35). If ρ = q = 2, Holder's inequality becomes Schwarz's inequality: (5.36) E[\XY\]<El/2[X2]El/2[Y2]. Suppose that 0<α<β. In (5.35) take ρ = β/α, q = β/(β-a), and Υ(ω) = 1, and replace X by \X\". The result is Lyapounou's inequality, (5.37) £1/а[|ЛГ]<£1//3[|Л'|/3], 0<<*</3. PROBLEMS 5.1. (a) Show that X is measurable with respect to the σ-field <& if and only if a(X)c<#. Show that X is measurable σ(Υ) if and only if σ(Χ)<ζσ(Υ). (b) Show that, if ^={0,Ω}, then X is measurable ^ if and only if X is constant. (c) Suppose that P(A) is 0 or 1 for every A in ^. This holds, for example, if ^ is the tail field of an independent sequence (Theorem 4.5), or if £ consists of the countable and cocountable sets on the unit interval with Lebesgue measure. Show that if X is measurable ^, then P[X= c] = 1 for some constant с 5.2. 2.19 Τ Show that the unit interval can be replaced by any nonatomic probability measure space in the proof of Theorem 5.3. 5.3. Show that m =E[X] minimizes E[(X - m)2]. 5.4. Suppose that X assumes the values m—a,m,m+a with probabilities p, 1 — 2p,p, and show that there is equality in (5.32). Thus Chebyshev's inequality cannot be improved without special assumptions on X. 5.5. Suppose that X has mean m and variance σ2. (a) Prove Cantelli's inequality P[X-m>a]< σ2 + α2' a>0.
82 PROBABILITY (b) Show that P[\X~m\>a] <2σ2/(σ2 + a2). When is this better than Chebyshev's inequality? (c) By considering a random variable assuming two values, show that Cantelli's inequality is sharp. 5.6. The polynomial £[(ί|ΛΊ + \Υ\)2] in t has at most one real zero. Deduce Schwarz's inequality once more. 5.7. (a) Write (5.37) in the form Εβ/α[\Χ\α]<Ε[\Χ\α)β/α] and deduce it directly from Jensen's inequality. (b) Prove that E[l/X"]> l/E"[X] for ρ > 0 and X a positive random variable. 5.8. (a) Let / be a convex real function on a convex set С in the plane Suppose that (Χ(ω), Y(co)) e С for all ω and prove a two-dimensional Jensen's inequality: (5.38) f(E[X],E[Y])<E[f(X,Y)]. (b) Show that / is convex if it has continuous second derivatives that satisfy (5-39) /M>0, /22>0, fufnbfh- 5.9. Τ Holder's inequality is equivalent to E[Xl/pYi/q]<El/p[X]Ei/ci[Y] (p~l + q~x = 1), where X and Υ are nonnegative random variables. Derive this from (5.38). 5.10. Τ Minkowski's inequality is (5.40) El/°[\X+Y\p]<Ei/p[\X\p]+El/p[\Y\p], valid for p> 1. It is enough to prove that E[(Xi/p + Υ [/p)p] <(E[/p[X] + Ei/p[Y])p for nonnegative X and У. Use (5.38). 5.11. For events Al,A2,..., not necessarily independent, let Nn = YJjl_liA be the number to occur among the first n. Let (5-41) *n=l- i.P(Ak)> βη--^—^ Σ P(AjnAk). k = l \ ' \<,j<k<n Show that (5.42) E[n~%] =«„, Var[n-4] =^„-«„2+ ^^- Thus Var[n-'Nn]->0 if and only if /3„-«;;->0, which holds if the An are independent and P(An)=p (Bernoulli trials), because then an=p and βη = P2 = <*2n-
SECTION 5. SIMPLE RANDOM VARIABLES 83 5.12. Show that, if A' has nonnegative integers as values, then E[X] = T%=iP[X>n]. 5.13. Let /f = IA, be the indicators of η events having union A. Let Sk = Σ/, ··*/,·, where the summation extends over all fc-tuples satisfying 1 </, < · · · <ik<n. Then sk = E[Sk] are the terms in the inclusion-exclusion formula P{A) = sx - s2+ · ■· ±sn. Deduce the inclusion-exclusion formula from /,,=5,-52 + • · · ± 5„. Prove the latter formula by expanding the product Π"_ι(1 - /,·) ■ ι «-'< 5.14. Let fn(x) be n2x or In -ηλχ or 0 according as 0<*<η~' or χ <2n~x or 2n~l <x<\. This gives-a standard example of a sequence of continuous functions, that converges to 0 but not uniformly. Note that Ι^η(χ)άχ does not converge to 0; relate to Example 5.7. 5.15. By Theorem 5.3, for any prescribed sequence of probabilities pn, there exists (on some space) an independent sequence of events An satisfying P(An)=pn. Show that if pn -»0 but Σρη =■ °°, this gives a counterexample (like Example 5.4) to the converse of Theorem 5.2(ii). 5.16. Τ Suppose that 0 <pn < 1 and put an = min{pn, 1 ~p„). Show that, if Σ«„ converges, then on some discrete probability space there exist independent events An satisfying P(An)=pn. Compare Problem 1.1(b). 5.17. (a) Suppose that X„ ->P X and that / is continuous. Show that f(Xn) ->P f(X). (b) Show that E[\X~Xn\]- false. • 0 implies Xn -+P X. Show that the converse is 5.18. 2.20 Τ The proof given for Theorem 5.3 for the special case where the μη are all the same can be extended to cover the general case: use Problem 2.20. 5.19. 2.18 Τ For integers m and primes p, let ap(m) be the exact power of ρ in the prime factorization of m: m = Прр""(т). Let Sp(m) be 1 or 0 as ρ divides m or not. Under each Pr for distinct primes p{,...,pu, (see (2.34)) the ap and δρ are random variables. Show that (5.43) and (5.44) Similarly, (5.45) Рп[ар/>к„1<и] = ± ρί' •••Pu" p*....p*. ι = ι \ fi Pi I P„[8p-l,izu\-± P\---Pu Pi'"P«' According to (5.44), the ap are for large η approximately independent under P„, and according to (5.45), the same is true of the δρ.
PROBABILITY For a function / of positive integers, let (5-46) £„[/]=£ £/(™) m= 1 be its expected value under the probability measure Pn. Show that (5.47) lPki P-1' this says roughly that (p - 1) ' is the average power of ρ in the factorization of large integers. Τ (a) From Stirling's formula, deduce (5.48) E„[log] = logn + 0(l). From this, the inequality En[ap] <2/p, and the relation log m = Epap(m)\og p, conclude that Σρρ~[ log ρ diverges and that there are infinitely many primes, (b) Let log*m = LpSp(m)\ogp. Show that (5.49) £„[log*] = Σ log ρ = log и + 0(1). (с) Show that [2n/p\ — 2[n/p\ is always nonnegative and equals 1 in the range η <p < In. Deduce £2„[log*] - £„[log*] = 0(1) and conclude that (5.50) Σ \ogP=0(x). Use this to estimate the error removing the integral-part brackets introduces into (5.49), and show that (5.51) Σ p-'logp = log* + 0(l). p<x (d) Restrict the range of summation in (5.51) to θχ <p <x for an appropriate Θ, and conclude that (5.52) Σ log ρ xx, p<,x in the sense that the ratio of the two sides is bounded away from 0 and <». (e) Use (5.52) and truncation arguments to prove for the number π(χ) of primes not exceeding χ that (5.53) π(χ)~ΈΪΪ·
SECTION 6. THE LAW OF LARGE NUMBERS 85 (By the prime number theorem the ratio of the two sides in fact goes to 1.) Conclude that the rth prime pr satisfies pr ^ r log r and that (5-54) Σ^=»- Ρ ρ SECTION 6. THE LAW OF LARGE NUMBERS The Strong Law Let Xu X2,■ · · be a sequence of simple random variables on some probability space (Ω, &~, P). They are identically distributed if their distributions (in the sense of (5.12)) are all the same. Define 5„ =XX + · · · \-Xn. The strong law of large numbers: Theorem 6.1. // the Xn are independent and identically distributed and E[Xn] = m. then (6-1) lim η [S„ =m\ = 1. Proof. The conclusion is that n~lSn-m = n~l'Z"„l(Xj-m)-*Q with probability 1. Replacing Xi by Xt - m shows that there is no loss of generality in assuming that m = 0. The set in question does lie in fF (see (5.5)), and by Theorem 5.2(i), it is enough to show that P[\n ~~ lSn\ > e i.o.] = 0 for each e. Let E[X?\ = σ2 and E[Xf] = ξ4. The proof is like that for Theorem 1.2. First (see (1.26)), E[S*] = ΣΕ[ΧαΧβΧΎΧδ], the four indices ranging independently from 1 to n. Since E[Xt] = 0, it follows by the product rule (5.25) for independent random variables that the summand vanishes if there is one index different from the three others. This leaves terms of the form E[Xf] = ξ\ of which there are n, and terms of the form E\X}Xf\ = E\X}\E\Xf\ = σ4 for / Φ}, of which there are 3n(n - 1). Hence (6.2) £[S„4] =ηξ4 + 3η(η-\)σ4<Κη2, where К does not depend on n. By Markov's inequality (5.31) for к = 4, P[\Sn\ > ne] < Kn~2e~4, and so by the first Borel-Cantelli lemma, P[\n~lSn\ > e i.o.] = 0, as required. ■ Example 6.1. The classical example is the strong law of large numbers for Bernoulli trials. Here P[X„ = 1] =p, P[X„ = 0] = 1 -p, m =p; Sn represents the number of successes in η trials, and n~lSn ~*p with probability 1. The idea of probability as frequency depends on the long-range stability of the success ratio S /n. ■
86 PROBABILITY Example 6.2. Theorem 1.2 is the case of Example 6.1 in which Ш, !F, P) is the unit interval and the Χη(ω) are the digits άη(ω) of the dyadic expansion of ω. Here p=\. The set (1.21) of normal numbers in the unit interval has by (6.1) Lebesgue measure 1; its complement has measure 0 (and so in the terminology of Section 1 is negligible). ■ The Weak Law Since convergence with probability 1 implies convergence in probability (Theorem 5.2(ii)), it follows under the hypotheses of Theorem 6.1 that n~ '5„ -*ρ т. But this is of course an immediate consequence of Chebyshev's inequality (5.32) and the rule (5.28) for adding variances: П € Π € This is the weak law of large numbers. Chebyshev's inequality leads to a weak law in other interesting cases as well: Example 6.3. Let Ωπ consist of the n\ permutations of 1,2,...,n, all equally probable, and let Xnk(w) be 1 or 0 according as the kth element in the cyclic representation of ω e Ω„ completes a cycle or not. This is Example 5.6, although there the dependence on η was suppressed in the notation. The Xnl,...,Xnn are independent, and Sn=Xnl + ··· +Xnn is the number of cycles. The mean mnk of Xnk is the probability that it equals 1, namely (n - к + l)-1, and its variance is σ„\ = mnk(l - mnk). If Ln = Zk = lk~l, then Sn has mean T,nk = lmnk = Ln and variance L"k=slmnk(l - mnk) <Ln. By Chebyshev's inequality, \Sn-Ln [ L» >€ Of the n\ permutations on η letters, a proportion exceeding 1 - e~2L~l thus have their cycle number in the range (l+e)L„. Since Ln = log η + 0(1), most permutations on η letters have about log η cycles. For a refinement, see Example 27.3. Since Ωπ changes with и, it is the nature of the case that there cannot be a strong law corresponding to this result. ■ Bernstein's Theorem Some theorems that can be stated without reference to probability nonetheless have simple probabilistic proofs, as the last example shows. Bernstein's approach to the Weierstrass approximation theorem is another example.
SECTION 6. THE LAW OF LARGE NUMBERS 87 Let / be a function on [0,1]. The Bernstein polynomial of degree η associated with / is (6.3) Bn(x) = Д/(£)(«)х*(1_х)-* Theorem 6.2. If f is continuous, Bn(x) converges to f(x) uniformly on [0,1]· According to the Weierstrass approximation theorem, / can be uniformly approximated by polynomials; Bernstein's result goes further and specifies an approximating sequence. Proof. Let Μ = sup,, |/(x)|, and let 5(e) = sup[|/(x) -f(y)\: \x - y\ <e] be the modulus of continuity of /. It will be shown that Ί Μ (6.4) sup|/(x)-B„(x)|<S(£) + —. χ K€ By the uniform continuity of /, lim£^05(e) = 0, and so this inequality (for e = n~l/3, say) will give the theorem. Fix η > 1 and χ e [0,1] for the moment. Let Xl:..., Xn be independent random variables (on some probability space) such that P[X,= 1] = * and />[*, = 01= 1-х; put S = *,+ ·· +Xn. Since />[S = k] = ("k)xk(\ -x)n~k, the formula (5.19) for calculating expected values of functions of random variables gives E[f(S/n)] = Bn(x). By the law of large numbers, there should be high probability that S/n is near χ and hence (/ being continuous) that f(S/n) is near f(x); E[f(S/n)] should therefore be near f(x). This is the probabilistic idea behind the proof and, indeed, behind the definition (6.3) itself. Bound |/(«"'5)-/(x)|by 5(e) on the set [\n"lS - x\<e] and by 2M on the complementary set, and use (5.22) as in the proof of Theorem 5.4. Since E[S] = nx, Chebyshev's inequality gives \Bn(x)-f(x)\<E[\f(n-lS)-f(x)\] <5(e)P[|«-15-x|<e] + 2M/5[|«-'5-x!>e] <8(e)+2M Var[S]/rtV; since Var[5] = nx(l - x) < n, (6.4) follows. ■ A Refinement of the Second Borel-Cantelli Lemma For a sequence Al:A2,... of events, consider the number Nn = IA + ··· +IA of occurrences among Av...,An. Since [An i.o.] = [ω: supn Νη(ω) = o°], P[An i.o.] can be studied by means of the random variables JV„.
88 PROBABILITY Suppose that the An are independent. Put pk = P(Ak) and mn = pl + ·■■ +pn. From E[IAt]=pk and Var[/J = Pk{\ - pk) <pk follow E[N„] = mn and Vai[iV„] = Σ£ = , Varf/^J <m„. If mn >x, then (6.5) />[/Vn<x]</>[|/Vn-mJ>m„-x] Var[/V„] ^ m„ < (m„-x)2 (mn-x)2 If Σ/?„ = oo, so that m„ -»oo, it follows that lim„ P[Nn <x] = 0 for each x. Since (6.6) sup Nk<x zP[N„<x], ftsup^. Nk <x] = 0 and hence (take the union over χ = 1,2,...) ftsup^ Nk < oo] = 0. Thus P[An i.o.] = /5[sup„ /V, = oo] = 1 if the An are independent and Σρη = oo, which proves the second Borel-Cantelli lemma once again. Independence was used in this argument only to estimate Var[/V„]. Even without independence, E[Nn] = mn and the first two inequalities in (6.5) hold. Theorem 6.3. // ΣΡ(Αη) diverges and Σ Р(А;ПАк) (6.7) liminf^-^ — < 1, " (Σρ(α,)) Kk<n ' then Р[АП i.o.] = 1. As the proof will show, the ratio in (6.7) is at least 1; if (6.7) holds, the inequality must therefore be an equality. Proof. Let θη denote the ratio in (6.7). In the notation above, Var[JVj=£[jV„2]-^= Σ E\lAlA\-m\ J, k<n = Σ Р{А}пАк)-т2„ = (вп-1)т2п J, k<n (and 0„-l>O). Hence (see (6.5)) P[Nn <x] <(0„ - l)m2n/(mn -x)2 for χ <mn. Since m2n/(mn -x)2 ~> 1, (6.7) implies that liminf„ P[Nn <x] = 0. It still follows by (6.6) that P[supk Nk<x] = 0, and the rest of the argument is as before. ■
SECTION 6. THE LAW OF LARGE NUMBERS 89 Example 6.4. If, as in the second Borel-Cantelli lemma, the An are independent (or even if they are merely independent in pairs), the ratio in (6.7) is 1 + Lk <„( pk-p\)/m2n, so that ΣΡ(Αη) = οο implies (6.7). ■ Example 6.5. Return once again to the run lengths /π(ω) of Section 4. It was shown in Example 4.21 that {rn} is an outer boundary (P[ln > rn i.o.] = 0) if Σ2_Γ"<ο°. It was also shown that {rj is an inner boundary (P[ln>rn i.o.] = 1) if rn is nondecreasing and Σ2"Γτ~1 = οο, but Theorem 6.3 can be used to prove this under the sole assumption that Σ2~Γη = oo. As usual, the rn can be taken to be positive integers. Let An = [ln > rn] = [dn = ■ ■ ■ = dn+r _j = 0]. If j + η < к, then Aj and Ak are independent. If j<k<j + rj, then P(Aj\Ak)<P[dj= ·■· =dk_i = 0\Ak] = P[dJ= ■■■ = d'*_, = 0]= 1/2*4 and so Р(А;Г)Лк) <PUk)/2k~j. Therefore, Σ P{Aj^Ak) j, к <п < EP(Ak)+2 Σ P{Aj)P(Ak) + 2 Σ 2-<k-»P(Ak) k<n i<k<n j<k <n j + r^k k<jJrrj * LP(Ak) + (ZP(Ak)) +2ΣΡ(ΑΙζ). k<n k<n ' k<,n If ΣΡ(Α„) = Σ2~Γ- diverges, then (6.7) follows. Thus {rr} is an outer or an inner boundary according as Σ2~Γ" converges or diverges, which completely settles the issue. In particular, rn = log2" + 01og2log2rt gives an outer boundary for 0> 1 and an inner boundary for 0<1. ■ Example 6.6. It is now possible to complete the analysis in Examples 4.12 and 4.16 of the relative error επ(ω) in the approximation of ω by LnkZ\dk(a))2~k. If /„(ω) > - log2 xn (0 <xn < 1), then en(<o) <xn by (4.22). By the preceding example for the case rn = -log2^„, Σ*π = °° implies that Ρ[ω: еп{ш)<хп i.o.]= 1. By this and Example 4.12, [ω: en(a>) <xn i.o.] has Lebesgue measure 0 or 1 according as Σχπ converges or diverges. U PROBLEMS 6.1. Show that Zn-*Z with probability 1 if and only if for every positive e there exists an n such that P[\Zk~Z\<e, n <k <m\> \ —e for all m exceeding n. This describes convergence with probability 1 in "finite" terms. 6.2. Show in Example 6.3 that P[\S„ - L„\ 2: L'/2+£] -* 0. 6.3. As in Examples 5.6 and 6.3, let ω be a random permutation of 1,2,..., n Each k, 1 <k <n, occupies some position in the bottom row of the permutation ω;
90 PROBABILITY let XnkM be the number of smaller elements (between 1 and к - 1) lying to the right of к in the bottom row. The sum S„ = Xni+ ··· +Xnn is the total number of inversions—the number of pairs appearing in the bottom row in reverse order of size. For the permutation in Example 5.6 the values of Χη,...,ΧΊΊ are 0, 0, 0, 2, 4, 2, 4, and 57 = 12. Show that A",,,,.. ,Xnn are independent and P[Xnk = i] = k~' for 0 < i < k. Calculate E[S„] and Var[5„]. Show that S„ is likely to be near n2/A. 6.4. For a function / on [0,1] write ll/ll = supj/(*)|. Show that, if / has a continuous derivative /', then \\f - B„\\<e\\f'\\ + 2\\f\\ /ne2. Conclude that \\f~Bn\\ = 0(n-^\ 6.5. Prove Poisson's theorem: If Al,A2,... are independent events, pn = «"'Σ?.,ΡΚ·), and ,У = Е;'=1/Л|, then n-lN„-pn-*P0. In the following problems S„=Xi + ■ ■ ■ +Xr 6.6. Prove Cantelli's theorem. If A",, A';,... are independent, E[Xn] = 0, and E[X*] is bounded, then n~ lS„ -» 0 with probability 1. The Xn need not be identically distributed 6.7. (a) Let xbx2,... be a sequence of real numbers, and put sn=xl + · ·· +xn. Suppose that n~2sn2 -* 0 and that the xn are bounded, and show that n~ xsn -* 0. (b) Suppose that n~zS„: -* 0 with probability 1 and that the Xn are uniformly bounded (sup,, ω\Χη(ω)\ <<»). Show that n~[Sn-->0 with probability 1 Here the Xn need not be identically distributed or even independent. 6.8. Τ Suppose that A",, A"2>... are independent and uniformly bounded and E[X„] = 0. Using only the preceding result, the first Borel-Cantelli lemma, and Chebyshev's inequality, prove that n~lSn -* 0 with probability 1. 6.9. Τ Use the ideas of Problem 6.8 to give a new proof of Borel's normal number theorem, Theorem 1.2. The point is to return to first principles and use only negligibility and the other ideas of Section 1, not the apparatus of Sections 2 through 6; in particular, P(A) is to be taken as defined only if A is a finite, disjoint union of intervals. 6.10. 5.11 6.7Τ Suppose that (in the notation of (5.41)) β„ ~a2 = 0(1 /ή). Show that η lNn -an->0 with probability 1. What condition on βη - a2 will imply a weak law? Note that independence is not assumed here. 6.11. Suppose that ΑΊ, X2,... are m-dependent in the sense that random variables more than m apart in the sequence are independent. More precisely, let ja^* =a(Xj,...,Xk), and assume that ja^*',...,sxf^1 are independent if fc,_, + m </,· for i = 2,...,/. (Independent random variables are 0-dependent.) Suppose that the Xn have this property and are uniformly bounded and that E[X„] = 0. Show that n"'Sn->0. Hint: Consider the subsequences X/, -^i+m + i» Xi+%m+1), for 1 < ί < m + 1. 6.12. Τ Suppose that the Xn are independent and assume the values xx,..., x, with probabilities p(x,),.. .,ρ(χ,). For ux,...,uk a fc-tuple of the x,\ let
SECTION 6. THE LAW OF LARGE NUMBERS 91 N„(ult...,uk) be the frequency of the fc-tuple in the first n+k- 1 trials, that is, the number of t such that 1 <t <n and X, = ul,..., Xl+k_{ = uk. Show that with probability 1, all asymptotic relative frequencies are what they should be—that is, with probability 1, n~'JV„(h„ ..., uk) -*p(ux) ■ ■ ■ p(uk) for every к and every fc-tuple u{,...,uk. 6.13. Τ A number ω in the unit interval is completely normal if, for every base b and every к and every fc-tuple of base-b digits, the fc-tuple appears in the base-b expansion of ω with asymptotic relative frequency b~k. Show that the set of completely normal numbers has Lebesgue measure 1. 6.14. Shannon's theorem. Suppose that XhX2,... are independent, identically distributed random variables taking on the values l,...,r with positive probabilities ρ „..., pr If pn(ix,...,in) = ph ..p, and ρ„(ω) = ρ„(Χ[(ω),. .,Χ„(ω)), then ρ„(ω) is the probability that a new sequence of η trials would produce the particular sequence Xt(<a),..., Χ„(ω)οίoutcomes that happens actually to have been observed. Show that 1 r - log ρ„(ω) -» A = - £ p,. log p, ; = i with probability 1. In information theory 1, — r are interpreted as the letters of an alphabet, Xl,X2,..- are the successive letters produced by an information source, and h is the entropy of the source. Prove the asymptotic equipartition property: For large η there is probability exceeding 1-е that the probability ρ,,(ω) of the observed и-long sequence, or message, is in the range e~"(h±*\ 6.15. In the terminology of Example 6.5, show that log2 η + log2 log2 η + θ log2 log2 log2 η is an outer or inner boundary as θ > 1 or θ < 1. Generalize. (Compare Problem 4.12.) 6.16. 5.20? Let g(m) = Lp8p(m) be the number of distinct piime divisors of m. For an = En[g] (see (5.46)) show that an -> <». Show that (6.8) еАъ-1- ι \U~l- MLJ- + .L L\ n[p\)\ п[Я \j\~ np nq for ρ Φ q and hence that the variance of g under Pn satisfies (6-9) Var„[g]<3 £ -i p<1 Prove the Hardy-Ramanujan theorem: (6.10) limP m: g("0 - 1 ^€ Since απ ~ loglogn (see Problem 18.17), most integers under η have something like log log η distinct prime divisors. Since log log 107 is a little less than 3, the typical integer under 107 has about three prime factors—remarkably few.
92 PROBABILITY 6.17. Suppose that Xt,X2,... are independent and P[Xn = 0]=p. Let Ln be the length of the run of O's starting at the nth place: Ln = к if Xn = ■ ■ ■ = Xn+k_i = О Φ Хп + к. Show that P[Ln > rn i.o.] is 0 or 1 according as L„pr" converges or diverges. Example 6.5 covers the case ρ = γ. SECTION 7. GAMBLING SYSTEMS Let Xl,X2,--- be an independent sequence of random.variables (on some (Ω, JF", />)) taking on the two values +1 and -1 with probabilities P[X„ = + l]=p and P[X„= -\\ = q = \-p. Thioughout the section, Xn will be viewed as the gambler's gain on the nth of a series of plays at unit stakes. The game is favorable to the gambler if ρ > {-, fair if ρ = {, and unfavorable if ρ < {-. The case ρ < {- will be called the subfair case. After the classical gambler's ruin problem has been solved, it will be shown that every gambling system is in certain respects without effect and that some gambling systems are in other respects optimal. Gambling problems of the sort considered here have inspired many ideas in the mathematical theory of probability, ideas that carry far beyond their origin. Red-and-Ыаск will provide numerical examples. Of the 38 spaces on a roulette wheel, 18 are red, 18 are black, and 2 are green. In betting either on red or on black the chance of winning is Щ. Gambler's Ruin Suppose that the gambler enters the casino with capital a and adopts the strategy of continuing to bet at unit stakes until his fortune increases to с or his funds are exhausted. What is the probability of ruin, the probability that he will lose his capital, a? What is the probability he will achieve his goal, c? Here a and с are integers. Let (7.1) Sn=Xl+---+Xn, 50 = 0. The gambler's fortune after η plays is a +Sn. The event n-l (7.2) Aan = [a + Sn = c]n f][0<a + Sk<c] k=l represents success for the gambler at time n, and n- 1 (7.3) Bafn = [a+Sa = 0]n C\[0<a + Sk<c] k=l
SECTION 7. GAMBLING SYSTEMS 93 represents ruin at time n. If sc(a) denotes the probability of ultimate success, then (7.4) *c(«)-1 ΙΜ..« - ΣΡ(Λα,η) > л = 1 / n = l for 0 < a < с Fix с and let a vary. For η > 1 and 0 <a <c, define Aa n by (7.2), and adopt the conventions Aa 0 = 0 for 0<a<c and Лс0 = П (success is impossible at time 0 if a < с and certain if a = c), as well as A0 n = Ac „ = 0 for η > 1 (play never starts if α is 0 or c). By these conventions, sc(0) = 0 and sc(c)=h Because of independence and the fact that the sequence X2,X3,... is a probabilistic replica of XbX2,..., it seems clear that the chance of success for a gambler with initial fortune a must be the chance of winning the first wager times the chance of success for an initial fortune a + 1, plus the chance of losing the first wager times the chance of success for an initial fortune a - 1. It thus seems intuitively clear that (7.5) sc(a) =psc(a + 1) +qsc(a - 1), 0<a<c. For a rigorous argument, define A'an just as Aa „ but with S'n=X2 + ■■■ +χη + ι in Place of 5„ in (7.2). Now />[*,. = *,·, Ί" < n] =P[Xi+l =*,-, i < n] for each sequence x,,...,x„ of +l's and -l's, and therefore Ρ[(Χι,...,Χη)<ΞΗ] = Ρ[(Χ2,...,Χη + ι)(ΞΗ]ΐοτ Я с Л". Take Η to be the set of χ = (*,,...,*„) in R" satisfying x, = ±1, a +x, + · · · +xn = c, and +xk <c for к < п. It follows then that 0<a +x, + (7.6) P(AaJ=P(A'aJ. Moreover, Aa<n = ([*, = +1]пЛ'й + 1 ,„_,)u([Z, = - 1]пЛ'й_, „.,) for и > 1 and 0<a<c. By independence and (7.6), P(Aan) =pP(Aa + l n_r) + qP(Aa_x „_!); adding over и now gives (7.5). Note that this argument involves the entire infinite sequence Xx, X2,... It remains to solve the difference equation (7.5) with the side conditions 5C(0) = 0, sc(c) = 1. Let ρ = q/p be the odds against the gambler. Then [A19] there exist constants A and В such that, for 0<a < c, sc(a)=A + Bp" if ρ Φ q and sc(a) = A + Ba if ρ = q. The requirements sc(0) = 0 and sc(c) = 1 determine A and B, which gives the solution: The probability that the gambler can before ruin attain his goal of с from an initial capital of a is (7.7) sc(a) ^—[> 0<a<c, ifp=J#l, -, 0<a<c, ifp=J = l.
94 PROBABILITY Example 7.1. The gambler's initial capital is $900 and his goal is $1000. If ρ = \, his chance of success is very good: s1000(900) = .9. At red-and-black, ρ = з| and hence ρ = Щ; in this case his chance of success as computed by (7.7) is only about .00003. ' ■ Example 7.2. It is the gambler's desperate intention to convert his $100 into $20,000. For a game in which ρ = 5 (no casino has one), his chance of success is 100/20,000 = .005; at red-and-black it is minute—about 3 X 10 ~911 In the analysis leading to (7.7), replace (7.2) by (7.3). It follows that (7.7) with ρ and q interchanged (p goes to p_1) and a and с-a interchanged gives the probability rc(a) of ruin for the gambler: rc(a) = (p_(c ~a) - 1)/ (p_c-l) if ρ φ 1 and rc(a) = (c - a)/c if ρ = 1. Hence sc(a) -J- rc(a) = 1 holds in all cases: The probability L· 0 that play continues forever. For positive integers a and b, let На,ъ= U [[Sn = b]n"r\[-a<Sk<b]\ n-1\ fc=l / be the event that 5„ reaches +b before reaching -a. Its probability is simply (7.7) with c = a+b: P(HaJ,) = sa+b(a). Now let Hb= UHa<b= \J[Sn = b] = a=l n=l sup Sn>b be the event that 5„ ever reaches +b. Since Ha h Τ #fe as a -»oo, it follows that /4#6) = !ima 50+ί)(α); this is 1 if ρ = 1 or ρ < 1, and it is \/pb if ρ > 1. Thus (7.8) sup Sn>b 1 iip>q, (p/q)b ifp<q. This is the probability that a gambler with unlimited capital can ultimately gain b units. Example 7.3. The gambler in Example 7.1 has capital 900 and the goal of winning b = 100; in Example 7.2 he has capital 100 and b is 19,900. Suppose, instead, that his capital is infinite. If ρ = \, the chance of achieving his goal increases from .9 to 1 in the first example and from .005 to 1 in the second. At red-and-black, however, the two probabilities .9100 and .919900 remain essentially what they were before (.00003 and 3 X 10~911). ■
SECTION 7. GAMBLING SYSTEMS 95 Selection Systems Players often try to improve their luck by betting only when in the preceding trials the wins and losses form an auspicious pattern. Perhaps the gambler bets on the nth trial only when among Xv..., Χη_λ there are many more + l's than - l's, the idea being to ride winning streaks (he is "in the vein"). Or he may bet only when there are many more -l's than +l's, the idea being it is then surely time a +1 came along (the "maturity of the chances"). There is a mathematical theorem that, translated into gaming language, says all such systems are futile. It might be argued that it is sensible to bet if among Xl,...,X„_l there is an excess of + l's, on the ground that it is evidence of a high value of p. But it is assumed throughout that statistical inference is not at issue: ρ is fixed—at Щ, for example, in the case of red-and-black—and is known to the gambler, or should be. The gambler's strategy is described by random variables B-^ B2,... taking the two values 0 and 1: If Bn = 1, the gambler places a bet on the nth trial; if Bn = 0, he skips that trial. If Bn were (Xn + l)/2, so that Bn = 1 for Xn = + 1 and Bn = 0 for Xn = -1, the gambler would win every time he bet, but of course such a system requires he be prescient—he must know the outcome Xn in advance. For this reason the value of Bn is assumed to depend only on the values of Xl,...,Xn_[: there exists some function bn: Rn~l -* Rl such that (7-9) 5„ = &„(*„...,*„_,). (Here β] is constant.) Thus the mathematics avoids, as it must, the question of whether prescience is actually possible. Define mm /$>*(*„...,*„), n-1,2,..., и'Ш) \^Ο = {0,Ω}. The σ-field 5^_, generated by X1,...,Xn_1 corresponds to a knowledge of the outcomes of the first η - 1 trials. The requirement (7.9) ensures that Bn is measurable 5^_! (Theorem 5.1) and so depends only on the information actually available to the gambler just before the nth trial. For η = 1,2,..., let iV„ be the time at which the gambler places his nth bet. This nth bet is placed at time к or earlier if and only if the number Ef=1B, of bets placed up to and including time к is η or more; in fact, N„ is the smallest к for which Y%=lBt = n. Thus the event [Nn<k] coincides with [Ef„i£;>n]; by (7.9) this latter event lies in σ(β,,..., Bk)(z σ(*„...,Xk_{)= Ук_х. Therefore, (7.11) [Nn = k] = [Nn<k]-[Nn<k-l]tESrk_l.
96 PROBABILITY (Even though [Nn = k] lies in ^k_1 and hence in &~, Nn is, as a function on Ω, generally not a simple random variable, because it has infinite range. This makes no difference, because expected values of the Nn will play no role; (7.11) is the essential property.) To ensure that play continues forever (stopping rules will be considered later) and that the N„ have finite values with probability 1, make the further assumption that (7.12) P[B„ = I i.o.] = l. A sequence {Bn} of random variables assuming the values 0 and 1, having the form (7.9), and satisfying (7.12) is a selection system. Let Yn be the gambler's gain on the nth of the trials at which he does bet: Y„ = XNn. It is only on the set [Bn = 1 i.o ] that all the N„ and hence all the Y„ are well defined. To complete the definiiion, set Yn= -1, say, on [Bn= 1 i.o.]c; since this set has probability 0 by (7.12), it really makes no difference how YH is defined on it. Now Yn is a complicated function on Ω because Υη(ω) = ΧΝ (ω)(ω). Nonetheless, [ω: Υ„(ω) = l] = (J {[ω: Ν„(ω) = к] η [ω: Хк{а>) = l]) lies in JF, and so does its complement [ω: Υη(ω) = - 1]. Hence Y„ is a simple random variable. Example 7.4. An example will fix these ideas. Suppose that the rule is always to bet on the first ti ail, to bet on the second trial if and only if Xx= +1, to bet on the third trial if and only if X{ = X2, and to bet on all subsequent trails. Here Bx = 1, [B2= \] = [Xl = +1], [B3 = 1] = [Х{ =Х2], and B4 = B5= · · · = 1. The table shows the ways the gambling can start out. A dot represents a value undetermined by Xu X2, X3. Ignore the rightmost column for the moment. Χι -1 -1 -1 -1 + 1 + 1 + 1 + 1 *2 -1 -1 + 1 + 1 -1 -1 + 1 + 1 x3 в, в2 -1 + 1 -1 + 1 -1 + 1 ] -1 ] + 1 I 0 I 0 I 0 I 0 I 1 1 1 I 1 B3 N, N2 1 1 1 ] 0 1 0 1 0 1 0 1 1 1 1 1 3 3 4 4 2 2 2 2 ^3 4 4 5 5 4 4 3 3 #4 5 5 6 6 5 5 4 4 Yi -1 -1 -1 -1 + 1 + 1 + 1 + 1 y2 -1 + 1 -1 -1 + 1 + 1 ^3 -1 + 1 τ 1 1 1 1 2 2 3
SECTION 7. GAMBLING SYSTEMS 97 In the evolution represented by the first line of the table, the second bet is placed on the third trial (N2 = 3), which results in a loss because Y2 = XN = X3 = -1. Since X3 = -1, the gambler was "wrong" to bet. But remember that before the third trial he does not know Χ3(ω) (much less ω itself); he knows only Χ^ω) and Χ2(ω). See the discussion in Example 5.5. ■ Selection systems achieve nothing because {Yn} has the same structure as Theorem 7.1. For every selection system, {YJ is independent and P[Y„~ + l]=P,P[Yn=-l] = 1. Proof. Since random variables with indices that are themselves random variables are conceptually confusing at first, the w's here will not be suppressed as they have been in previous proofs. Relabel ρ and q as p( + l) and p(- 1), so that Ρ[ω: Хк(со) = x] = p(x) for χ = ±1. If A e &k_1, then A and [ω: Xk(a>) = x] are independent, and so Ρ(Αί)[ω: Xk(<o) = xl) = P(A)p(x). Therefore, by (7.11), Ρ[ω:Υη(ω)=χ]=Ρ[ω:ΧΝι(ω)(ω)=χ] oo = LP[<o:N^)=k,Xk(w)=x] k = l 00 = P(x)· More generally, for any sequence xv...,xn of + l's, Ρ[ω:Υ,{ω)=χ.„Ί^η] = Ρ[ω: ΧΝ^(ω) = χ;, i <n] Σ Р[о:Щ(о.)=к.„Хк1(а>)=Х;,1<п], *ι< ■··<*„ where the sum extends over и-tuples of positive integers satisfying /c, < · · · < kn. The event [ω: /ν,(ω) = kh i < η] η [ω: Xk((o) = xh i < n] lies in &~k , (note that there is no condition on Xk (ω)), and therefore />[ω:^( ω) =*,,/<«] = Σ /,([a>:tfi(<U)=*|.,i<n] Γ»[ω: ХкЦ(о)=хп i<n])p(x„).
98 PROBABILITY Summing kn over kn_l + l, k„_l + 2,... brings this last sum to Σ Ρ[ω: Ν,(ω) = k„ Χ,ί,ω) = x„ ι < η] ρ(χη) = />[ω: ΧΝι<ω)(ω)=Χί,ΐ<η]ρ(χη) = Ρ[ω:^.(ω)=χ|.,ί<η]ρ(χη). It follows by induction that Ρ[ω: ^(ω) =x,., i <n] = Пр(^) = ГИ*»: Ή") = *,]. and so the У) are independent (see (5.9)). ■ Gambling Policies There are schemes that go beyond selection systems and tell the gambler not only whether to bet but how much. Gamblers frequently contrive or adopt such schemes in the confident expectation that they can, by pure force of arithmetic, counter the most adverse workings of chance. If the wager specified for the nth trial is in the amount Wn and the gambler cannot see into the future, then Wn must depend only on Xl,...,X„_1. Assume therefore that Wn is a nonnegative function of these random variables: there is an /„: R"~l-+Rl such that (7.13) Wn=fn(Xl,...,Xn_l)>0. Apart from nonnegativity there are at the outset no constraints on the /„, although in an actual casino their values must be integral multiples of a basic unit. Such a sequence {Wn} is a betting system. Since Wn = 0 corresponds to a decision not to bet at all, betting systems in effect include selection systems. In the double-or-nothing system, Wn = 2n~' if Xl = · · · = Xn_, = - 1 (W, = 1) and Wn = 0 otherwise. The amount the gambler wins on the nth play is WnXn. If his fortune at time η is Fn, then (7-14) Fn = Fn_x + WnXn. This also holds for η = 1 if F0 is taken as his initial (nonrandom) fortune. It is convenient to let W„ depend on F0 as well as the past history of play and hence to generalize (7.13) to (7.15) K = 8„(F0,Xl,...,X„_l)^0
SECTION 7. GAMBLING SYSTEMS 99 for a function gn: R"->Rl. In expanded notation, \¥η(ω) = g„(F0, Χλ(ω), ... ,Ar„_ ι(ω)). The symbol Wn does not show the dependence on ω or on F0, either. For each fixed initial fortune F0, Wn is a simple random variable; by (7.15) it is measurable &n_v Similarly, F„ is a function of F0 as well as of Χ^ω),..., Χ„(.ωϊ Fn = Fn(F0, ω). If F0 = 0 and g„ = 1, the Fn reduce to the partial sums (7.1). Since ^_, and σ(Χη) are independent, and since Wn is measurable &'n_x (for each fixed F0), Wn and Z„ are independent. Therefore, E[WnXn\ = E[W„]-E\X„l Now E[Xn]=p-q<0 in the subfair case (p<j), with equality in the fair case (p = {). Since E[Wn] > 0, (7.14) implies that E[Fn] < £[F„_,]. Therefore, (7Л6) fo^^il* ■ >E[Fn]> ■■■ in the subfair case, and (7.17) ίΌ = Ε[*Ί]= ··" =^[f„l= ··· in the fair case. (If p<q and f[^Fn>0]>0, there is strict inequality in (7.16).) Thus no betting system can convert a subfair game into a profitable enterprise. Suppose that in addition to a betting system, the gambler adopts some policy for quitting. Perhaps he stops when his fortune reaches a set target, or his funds are exhausted, or the auguries are in some way dissuasive. The decision to stop must depend only on the initial fortune and the history of play up to the present. Let t(F0, ω) be a nonnegative integer for each ω in Ω and each F0 > 0. If τ = η, the gambler plays on the nth trial (betting tyn) and then stops; if τ = 0, he does not begin gambling in the first place. The event [ω: t(F0, ω) = η] represents the decision to stop just after the nth trial, and so, whatever value F0 may have, it must depend only on Λ',,...,Xn. Therefore, assume that (7.18) [W:r(F0,w)=n]e^, n = 0,l,2,... Α τ satisfying this requirement is a stopping time. (In general it has infinite range and hence is not a simple random variable; as expected values of τ play no role here, this does not matter.) It is technically necessary to let t(F0,o)) be undefined or infinite on an ω-set of probability 0. This has no effect on the requirement (7.18), which must hold for each finite n. But it is assumed that τ is finite with probability 1: play is certain to terminate. A betting system together with a stopping time is a gambling policy. Let π denote such a policy. Example 7.5. Suppose that the betting system is given by Wn =β„, with Bn as in Example 7.4. Suppose that the stopping rule is to quit after the first
100 PROBABILITY loss of a wager. Then* [τ = η] = \Jk= ,[i%. = и, У, = · · · = Ук_, = + 1, Yk = -1]. For j <k <n, [Nk = n, Yj = x]= U"m = i[Nk = n, Nj = m, Xm = x\ lies in ^ by (7.11); hence τ is a stopping time. The values of τ are shown in the rightmost column of the table. ■ The sequence of fortunes is governed by (7.14) until play terminates, and then the fortune remains for all future time fixed at Fr (with value FT(F ω)(ω)). Therefore, the gambler's fortune at time η is (7.19) F„* = F„ Ίίτ>η, F. if τ <η. Note that the case τ = η is covered by both clauses here If η - 1 < η < τ, then F„* = F„ = F„_1 + ^„Z„=F„*_1 + H;Z„; if r<n-Kn, then Fn* = FT = F„*_,. Therefore, if W* = I[Tin]Wn, then (7.20) F„* = Fn*-i+I[r>n]KXn = F«*-i + ^Xn. But this is the equation for a new betting system in which the wager placed at time η is W*. If τ>η (play has not already terminated), W* is the old amount IV„; if τ < η (play has terminated), W* is 0. Now by (7.18), [τ > η] = [τ < л]с lies in &n_x. Thus /(r >n] is measurable <^_l5 so that И^* as well as И^ is measurable ^_i, and {W^*} represents a legitimate betting system. Therefore, (7.16) and (7.17) apply to the new system: (7.21) F0 = F*>E[F,*]> ··· >E[Fn*]> ■■■ if ρ < j, and (7.22) F0 = F0*=-F[Fr]= ··· =£[/?]=... The gambler's ultimate fortune is FT. Now lim„ F* = FT with probability 1, since in fact F* = FT for л > т. If (7.23) limF[F„*]=F[Fr], then (7.21) and (7.22), respectively, imply that E[FT]<F0 and £[Fr]=F0. According to Theorem 5.4, (7.23) does hold if the F* are uniformly bounded. Call the policy bounded by Μ (Μ nonrandom) if (7.24) 0<F„*<M, «=0,1,2,.... If F* is not bounded above, the gambler's adversary must have infinite capital. A negative F* represents a debt, and if F* is not bounded below,
SECTION 7. GAMBLING SYSTEMS 101 the gambler must have a patron of infinite wealth and generosity from whom to borrow and so must in effect have infinite capital. In case F* is bounded below, 0 is the convenient lower bound—the gambler is assumed to have in hand all the capital to which he has access. In any real case, (7.24) holds and (7.23) follows. (There is a technical point that arises because the general theory of integration has been postponed: FT must be assumed to have finite range so that it will be a simple random variable and hence have an expected value in the sense of Section 5.f) The argument has led to this result: Theorem 7.2. For every policy, (7.21) holds if ρ < \ and (7.22) holds if ρ = j. If the policy is bounded (and FT has finite range), then E[FT] < F0 for ρ < \ and E[FT] = F0 for ρ = {. Example 7.6. The gambler has initial capital a and plays at unit stakes until his capital increases to с (0 < a < c) or he is ruined. Here F0 = a and Wn = 1, and so Fn = a + Sn. The policy is bounded by c, and FT is с or 0 according as the gambler succeeds or fails. If ρ = \ and if s is the probability of success, then a = F0 = FfF, ] = sc. Thus s = a/c. This gives a new derivation of (7.7) for the case p=\. The argument assumes however that play is certain to terminate. If ρ <\, Theorem 7.2 only gives s <a/c, which is weaker than (7.7). ■ Example 7.7. Suppose as before that F0 = a and Wn = 1, so that F„ = a + 5„, but suppose the stopping rule is to quit as soon as F„ reaches a +b. Here F* is bounded above by a + b but is not bounded below. If ρ = \, the gambler is by (7.8) certain to achieve his goal, so that FT = a + b. In this case F0 = a <a +b = F[Fr]. This illustrates the effect of infinite capital. It also illustrates the need for uniform boundedness in Theorem 5.4 (compare Example 5.7). ■ For some other systems (gamblers call them "martingales"), see the problems. For most such systems there is a large chance of a small gain and a small chance of a large loss. Bold Play* The formula (7.7) gives the chance that a gambler betting unit stakes can increase his fortune from α to с before being ruined. Suppose that a and с happen to be even and that at each trial the wager is two units instead of TSee Problem 7.11. *This topic may be omitted.
102 PROBABILITY one. Since this has the effect of halving a and c, the chance of success is now pa/1- 1 pa-\ pc/2+l q ,c/2_j рс-1ра/2+Г Ρ = ρΦ\. If p> 1 (p< \), the second factor on the right exceeds 1: Doubling the stakes increases the probability of success in the unfavorable case ρ > 1. In the case ρ = 1, the probability remains the same. There is a sense in which large stakes are optimal. It will be convenient to rescale so that the initial fortune satisfies 0 < F0 < 1 and the goal is 1. The policy of bold play is this: At each stage the gambler bets his entire fortune, unless a win would carry him past his goal of 1, in which case he bets just enough that a win would exactly achieve that goal: 1^2' 1 -F„. if ^ <F„_, < 1 2 (It is convenient to allow even irrational fortunes.) As for stopping, the policy is to quit as soon as Fn reaches 0 or 1. Suppose that play has not terminated by time к - 1; under the policy (7.25), if play is not to terminate at time k, then Xk must be +1 or - 1 according as Fk_l < \ or Fk_l > \, and the conditional probability of this is at most m -= max{/?, q). It follows by induction that the probability that bold play continues beyond time η is at most m", and so play is certain to terminate (τ is finite with probability 1). It will be shown that in the subfair case, bold play maximizes the probability of successfully reaching the goal of 1. This is the Dubins-Sauage theorem. It will further be shown that there are other policies that are also optimal in this sense, and this maximum probability will be calculated. Bold play can be substantially better than betting at constant stakes. This contrasts with Theorems 7.1 and 7.2 concerning respects in which gambling systems are worthless. From now on, consider only policies π that are bounded by 1 (see (7.24)). Suppose further that play stops as soon as F„ reaches 0 or 1 and that this is certain eventually to happen. Since FT assumes the values 0 and 1, and since [Fr = x] = O^=0[r:=n]n[Fn =x] for χ = 0 and χ = 1, FT is a simple random variable. Bold play is one such policy -π. The policy π leads to success if FT = 1. Let Q^ix) be the probability of this for an initial fortune F0 =x: (7.26) Q*(x)=P[FT=l] ■ forF0=x.
SECTION 7. GAMBLING SYSTEMS 103 Since F„ is a function tf>„(F0, Χ^ω),..., Χη(ω)) = Ψη(Ρ0,ω), (7.26) in expanded notation is Qu(x) = Ρ[ω: %(κ,ω)(χ,ω)= 1]. As π specifies that play stops at the boundaries 0 and 1, r7?7, QM = o, β„(ΐ) = ι, { ' 0<<2π(χ)<1, 0<χ<1. Let Q be the Qu for bold play. (The notation does not show the dependence of Q and Qn on p, which is fixed.) Theorem 7.3. In the subfair case, QJLx) < Q(x) for all π and all x. Proof. Under the assumption ρ < q, it will be shown later that (7.28) Q(x)>pQ(x + t)+qQ(x-t), 0 <x - t <x <x + t < 1. This can be interpreted as saying that the chance of success under bold play starting at χ is at least as great as the chance of success if the amount t is wagered and bold play then pursued from χ +1 in case of a win and from χ -1 in case of a loss. Under the assumption of (7.28), optimality can be proved as follows. Consider a policy π, and let Fn and F* be the simple random variables defined by (7.14) and (7.19) for this policy. Now Q(x) is a real function, and so Q(F*) is also a simple random variable; it can be interpreted as the conditional chance of success if π is replaced by bold play after time n. By (7.20), F* - χ + tX„ if F„*_, = χ and W* - t. Therefore, x,t where χ and t vary over the (finite) ranges of F„*_, and W*, respectively. For each χ and t, the indicator above is measurable Srn_l and Q(x + tXn) is measurable σ(Χη); since the Xn are independent, (5.25) and (5.17) give (7.29) E[Q(Fn*)\ = Y,P[Fn*-^x,Wn* = t}E{Q{x + tXn)\ x,t By (7.28), E[Q(x + tXn)] < Q(x) if 0 < χ - t < χ < χ + t < 1. As it is assumed of π that F* lies in [0,1] (that is, W* < min{F„*_„ 1 - F*_,}), the probability
104 PROBABILITY in (7.29) is 0 unless χ and t satisfy this constraint. Therefore, e[Q(f;)} < £/>[/?_, =χ, ^„* = ί]β(χ) = Lp[f:-x=aq(x)-e[q(f:_x)\ This is true for each n, and so E[Q(F*)] < E[Q(Fp] = <2(F0). Since Q(F*) = £?(^r) for " > τ, Theorem 5.4 implies that E[Q(FT)] < Q(F0). Since χ = 1 implies that β(χ)= 1, />[Fr = 1] <E[<2(FT)] < <2(F0). Thus G^Fq)^ <2(F0) for the policy π, whatever F0 may be. It remains to analyze Q and prove (7.28). Everything hinges on the functional equation (7.30) 0^'\р + т2х-^ Ux£l. For χ = 0 and χ = 1 this is obvious because <2(0) = 0 and Q(\) = 1. The idea is this: Suppose that the initial fortune is x. If χ < \, the first stake under bold play is x; if the gambler is to succeed in reaching 1, he must win the first trial (probability p) and then from his new fortune x+x = 2x go on to succeed (probability Q(2x)); this makes the first half of (7.30) plausible. If χ > \, the first stake is 1-х; the gambler can succeed either by winning the first trial (probability p) or by losing the first trial (probability q) and then going on from his new fortune χ-(1 -x) = 2x - 1 to succeed (probability Q(2x - 1)); this makes the second half of (7.30) plausible. It is also intuitively clear that Q(x) must be an increasing function of χ (0 < χ < 1): the more money the gambler starts with, the better off he is. Finally, it is intuitively clear that Q(x) ought to be a continuous function of the initial fortune x. A foimal proof of (7.30) can be constructed as for the difference equation (7.5). If β(χ) is χ for χ < \ and 1 -x for χ > \, then under bold play Wn = /3(F„_,). Starting from /o(x) =x, recursively define /„(x;xi,...,x„) =/„_1(χ;χ1,...,χ„_1) +^(/„_ι(χ;χ!,...,χ„_ι))χ„. Then Fn=fn(FQ;Xh...,Xn). Now define g„(x;x,,...,x„)= max fk(x;xx,...,xk). 0<k<n If Fq^x, then Tn(x):=[gn(x;Xl,...,Xn) = 1] is the event that bold play will by time η successfully increase the gambler's fortune to 1. From the recursive definition it
SECTION 7. GAMBLING SYSTEMS 105 follows by induction on η that for n^l, fn(x;xl,...,xn)=fn_i(x + fi(x)xl; x2,...,x„) and hence that g„( л:; *„...,*„)== тах{д:^п_1(д: + /3(д:)д:1;д:2,...,д:„)}. Since д:= 1 implies g„_iU + β(χ)χ1;x2,...yχη)^χ + β(χ)χ1 = 1, Tn(x) = [gn_i(x + β{χ)Χχ\ X2, ■ ■ ■, Xn) - 1]. and since the Xi are independent and identically distributed, P(Tn(x)) = F([Z, = + 1] η Γ„(*)) + F([Z, = -1] η Г„Ы) = pF[gn_1U+/3U);Z2,...,^n) = l]+<7F[^n_1(jc-/3U);^2,...,^n)=pF(rn_1U + /3(jt))) + <7F(r„_ ,(jc - /3(jc))). Letting η -» oo now gives g(jc) = pQ(jc + /3(jc)) +<72(jc - )3(jc)), which reduces to (7.30) because 2(0) = 0 and Q(\) = 1. Suppose that у =f„-i(x',xi, ,xn-i) is nondecreasing in x. If xn= +1, then f„{x\Xi, ■ ,x„)is2y if0<>< 5 and 1 if \<v< 1; if x„ = - 1, then /„(*;jc,,...,jc„) is 0 if 0<у<5 and 2y - 1 if у<><1. In any case, fn(x;*,,...,xj is also nondecreasing in x, and by induction this is true for every n. It follows that the same is true of g„(x;χ^.-.,χ,,), of P(Tn(x)\ and of Q(x). Thus Q(x) is nondecreasing. Since 2(1) = 1, (7.30) implies that βφ = ρβ(1)=ρ, β(£)=ρβφ = ρ2, QO = /?+<72(р =P + P<7 More generally, if p0 =p and /?, =<7, then (7.31) β(£)-Σ /=1 z 0<fc<2", n>l, the sum extending over и-tuples (ы,,...,ып) of 0's and l's satisfying the condition indicated. Indeed, it is easy to see that (7.31) is the same thing as (7.32) Q(.ul...un + 2-"')-Q(.ul. .un)=pn Pu. for each dyadic rational .ы,... un of rank n. If .ы, ...ы„ + 2 " < }, then ut = 0 and by (7.30) the difference in (7.32) is p0[Q(.u2...u„ + 2~n+l) -Q(.u2...un)]. But (7.32) follows inductively from this and a similar relation for the case .ы,... ы„ > |. Therefore Q(k2~n) — Q((k - 1)2"") is bounded by max{p", a"), and so by mono- tonicity 2 is continuous. Since (7.32) is positive, it follows that Q is strictly increasing over [0,1]. Thus Q is continuous and increasing and satisfies (7.30). The inequality (7.28) is still to be proved. It is equivalent to the assertion that 4r,s) = Q(a)-pQ(s)-qQ(r)2:0 if 0<r<5 < 1, where a stands for the average: a = {-(r + s). Since Q is continuous, it suffices to prove the inequality for r and s of the form k/2", and this will be done by induction on n. Checking all cases disposes of и =0. Assume that the inequality holds for a particular n, and that r and s have the form /c/2" + 1. There are four cases to consider. Case 1. s <{. By the first part of (7.30), A(r,s) =/?A(2r,2s). Since 2r and 2s have the form k/2", the induction hypothesis implies that A(2r,2s)>0.
106 PROBABILITY Case 2. \ < r. By the second part of (7.30), Д(г,5)=<7Д(2г-1,25-1)>0. Case 3. r < a < \ < s. By (7.30), Hr,s) =pQ(2a) -p[p + qQ(2s - 1)] -q[pQ(2r)]. From \<s <r + s = 2a <\, follows Q(2a) = p + qQ(4a - 1); and from 0 < 2a - \ < \, follows Q(2a - {) =pQ(4a - 1). Therefore, pQ(2a) =p2 + qQ(2a - j), and it follows that 4r,s) =q[Q(2a - {) -pQ(2s - 1) -PQ(2r)}. Since p<q, the right side does not increase if either of the two /?'s is changed to q. Hence A(r,s)>qmax[A(2r,2s- l),A(2s- l,2r)]. The induction hypothesis applies to 2r < 2s - 1 or to 2s - 1 < 2r, as the case may be, so one of the two A's on the right is nonnegative. Case 4. r < \ < a < s. By (7.30), A(r,s) =pq + qQ(2a - 1) -pqQ(2s - 1) -pqQ(2r). From 0<2a-l = r + s-l<j, follows Q(2a - 1) = pQ(4a - 2); and from \<2a-\-=r + s-\<\, follows β(2α - \) = p + qQ(4a - 2). Therefore, qQ(2a - 1) =pQ(2a - |) -/?2, and it follows that A(r,s) =p[q -p + Q(2a - {) - qQ(2s - \)-qQ(2r)\. If 2s - 1 < 2r, the right side here is Р[(Я-Р)(1 -Q(2r)) + A(2s - l,2r)] > 0. If 2r < 2s - 1, the right side is Р[(Я -P)(l - <2(2s - 1)) + Д(2г,25 - 1)] > 0. This completes the proof of (7.28) and hence of Theorem 7.3. ■
SECTION 7. GAMBLING SYSTEMS 107 The equation (7.31) has an interesting interpretation. Let Z,,Z2,... be independent random variables satisfying P[Ztl = 0] = p0 =/? and P[Z„ = 1] = px=q. From P[Z„=\ i.o.] = 1 and Σ,·>ΠΖ,·2-'< 2~" it follows that />[Σ°1,Ζ,2-' < A:2-"] < PtE^.Z^"' < £2""] < Р\ТГЫ^{1~1 < k2~n}. Since by (7.31) the middle term is Q(k2~n), (7.33) Q(x)=P ΣΖ,2-'<χ holds for dyadic rational χ and hence by continuity holds for all x. In Section 31, Q will reappear as a continuous, strictly increasing function singular in the sense of Lebesgue. On p. 408 is a graph for the case p0 = .25. Note that Q(x) = x in the fair case p= \. In fact, for a bounded policy Theorem 7.2 implies that E[FT] = F0 in the fair case, and if the policy is to stop as soon as the fortune reaches 0 or 1, then the chance of successfully reaching 1 is P[FT =■ 1] = E[FT] = F0. Thus in the fair case with initial fortune x, the chance of success is χ for every policy that stops at the boundaries, and χ is an upper bound even if stopping earlier is allowed. Example 7.8. The gambler of Example 7.1 has capital $900 and goal $1000. For a fair game (p = |) his chance of success is .9 whether he bets unit stakes or adopts bold play. At red-and-black (p =■ Щ), his chance of success with unit stakes is .00003; an approximate calculation based on (7.31) shows that under bold play his chance Q(.9) of success increases to about .88, which compares well with the fair case. ■ Example 7.9. In Example 7.2 the capital is $100 and the goal $20,000. At unit stakes the chance of successes is .005 for p=-\ and 3x 10~911 for ρ = Щ. Another approximate calculation shows that bold play at red-and-black gives the gambler probability about .003 of success, which again compares well with the fair case. This example illustrates the point of Theorem 7.3. The gambler enters the casino knowing that he must by dawn convert his $100 into $20,000 or face certain death at the hands of criminals to whom he owes that amount. Only red-and-black is available to him. The question is not whether to gamble—he must gamble. The question is how to gamble so as to maximize the chance of survival, and bold play is the answer. ■ There are policies other than the bold one that achieve the maximum success probability Q(x). Suppose that as long as the gambler's fortune χ is less than { he bets χ for χ < | and \ -x for £ <x < \. This is, in effect, the
108 PROBABILITY bold-play strategy scaled down to the interval [0, \], and so the chance he ever reaches \ is Q(2x) for an initial fortune of x. Suppose further that if he does reach the goal of \, or if he starts with fortune at least \ in the first place, then he continues, but with ordinary bold play. For an initial fortune χ > τ, the overall chance of success is of course Q(x), and for an initial fortune x< \, it is Q(2x)Q(±) = pQ(2x) = Q(x). The success probability is indeed Q(x) as for bold play, although the policy is different. With this example in mind, one can generate a whole series of distinct optimal policies. Timid Play* The optimality of bold play seems reasonable when one considers the effect of its opposite, timid play. Let the e-timid policy be to bet Wn = min{e, Fn_x, 1 -F„_x} and stop when Fn reaches 0 or 1. Suppose that ρ <q, fix an initial fortune χ = F0 with 0 <x < 1, and consider what happens as e -> 0. By the strong law of large numbers, lim„ n~lSn = E[XX] ^ ρ -q < 0. There is therefore probability 1 that sup^ Sk < o° and lim„ 5„ = -oo. Given η >0, choose e so that P[s\ipk(x + eSk)< 1] > 1 -η. Since P(\J°°n=x[x + eSn < 0]) = 1, with probability at least 1 - η there exists an n such that χ + eSn < 0 and ma\k<n(x + eSk) < 1. But under the e-timid policy the gambler is in this circumstance ruined. If Qe(x) is the probability of success under the e-timid policy, then lime _0 Qe(x) = 0 for 0 < χ < 1. The law of large numbers carries the timid player to his ruin.1 PROBLEMS 7.1. A gambler with initial capital a plays until his fortune increases b units or he is ruined. Suppose that p>\. The chance of success is multiplied by 1 + θ if his initial capital is infinite instead of a. Show that 0 < θ < (pa - l)~l < (a(p - I))"1; relate to Example 7.3. 7.2 As shown on p. 94, there is probability 1 that the gambler either achieves his goal of с or is ruined. For ρ Φ q, deduce this directly from the strong law of large numbers. Deduce it (for all p) via the Borel-Cantelli lemma from the fact that if play never terminates, there can never occur с successive +l's. 7.3. 6 12Τ If V„ is the set of и-long sequences of ±l's, the function b„ in (7.9) maps И„_, into {0,1}. A selection system is a sequence of such maps. Although there are uncountably many selection systems, how many have an effective 'This topic may be omitted fFor each e, however, there exist optimal policies under which the bet never exceeds e; see Dubins & Savage.
SECTION 7. GAMBLING SYSTEMS 109 description in the sense of an algorithm or finite set of instructions by means of which a deputy (perhaps a machine) could operate the system for the gambler? An analysis of the question is a matter for mathematical logic, but one can see that there can be only countably many algorithms or finite sets of rules expressed in finite alphabets. Let Υ}σ),Υ^σ\... be the random variables of Theorem 7.1 for a particular system σ, and let С be the ω-set where every fc-tuple of + l's (k arbitrary) occurs in Υ,<σ)(ω),Υ2<σ)(ω),... with the right asymptotic relative frequency (in the sense of Problem 6.12). Let С be the intersection of Ca over all effective selection systems σ. Show that С lies in У (the σ-field in the probability space (Ω, &~, P) on which the Xn are defined) and that P(C)=1. A sequence (Χ£ω),Χ2{ω),...) for ω in С is called a collective: a subsequence chosen by any of the effective rules σ contains all fc-tuples in the correct proportions 7.4. Let D„ be 1 or 0 according as X2n _, Φ Χ2π or not, and let Mk be the time ot the fcth 1—the smallest η such that E"=1D,- = fc. Let Zk = X2M. In other words, look at successive nonoverlapping pairs (X2n-i> ^гД discard accordant (X2n-i ~^2^ Pai'rs, and keep the second element of discordant (A,2n_, ФХ2п) pairs. Show that this process simulates a fair coin: ZUZ2,... are independent and identically distributed and P[Zk = +1] = P[Zk = -1] = 5, whatever ρ may be. Follow the proof of Theorem 7.1. 7.5. Suppose that a gambler with initial fortune 1 stakes a proportion θ (0 < θ < 1) of his current fortune: F0 = 1 and Wn = 6F„_X. Show that Fn = П£„,(1 + 0Хк) and hence that Og К = 7 £logi±!+lQg(l-fl2) Show that Fn-*0 with probability 1 in the subfair case. 7.6. In "doubling," W{ = 1, Wn - 2W„_l, and the rule is to stop after the first win. For any positive p, play is certain to terminate. Here FT = F0+ 1, but of course infinite capital is required. If F0 = 2k — 1 and Wn cannot exceed Fn_l. the probability of FT = F0 + 1 in the fair case is 1 - 2~k. Prove this via Theorem 7.2 and also directly. 7.7. In "progress and pinch," the wager, initially some integer, is increased by 1 after a loss and decreased by 1 after a win, the stopping rule being to quit if the next bet is 0. Show that play is certain to terminate if and only if ρ > 5. Show that FT = F0 + {-Wf + =ί(τ - 1). Infinite capital is required. 7.8. Here is a common martingale. Just before the nth spin of the wheel, the gambler has before him a pattern xh...,xk of positive numbers (k varies with n). He bets jc, +xk, or jc, in case к = 1. If he loses, at the next stage he uses the pattern xh...,xk,xl +xk (x\,X[ in case к = 1). If he wins, at the next stage he uses the pattern x2,...,xk_{, unless к is 1 or 2, in which case he quits. Show
по PROBABILITY that play is certain to terminate if ρ > у and that the ultimate gain is the sum of the numbers in the initial pattern. Infinite capital is again required. 7.9. Suppose that Wk = \, so that Fk=F0 + Sk. Suppose that p>.q and τ is a stopping time such that 1 <τ <η with probability 1. Show that E[FT] <E[Fn], with equality in case ρ = q. Interpret this result in terms of a stock option that must be exercised by time n, where Fu + Sk represents the price of the stock at time k. 7.10. For a given policy, let A* be the fortune of the gambler's adversary at time n. Consider these conditions on the policy, (i) W* <F*_X; (ii) W* <A*n_x\ (iii) F* +A* is constant. Interpret each condition, and show that together they imply that the policy is bounded in the sense of (7.24). 7.11. Show that FT has infinite range if FQ = 1, W„ = 2~", and τ is the smallest η for which X„= +1. 7.12. Let μ be a real function on [0,1], u(x) representing the utility of the fortune x. Consider policies bounded by 1; see (7.24). Let Q„(F0) = E[u(FT)]; this represents the expected utility under the policy ττ of an initial fortune F0. Suppose of a policy 7г0 that (7.34) и(х)<ОщЦх), 0<л:<1, and that (7.35) Q^x) >pQ^x + t)+ qQ^x-t), 0<x-t<x<x + t<l. Show that Q„(x) < £?„ (x) for all χ and all policies it. Such a тг0 is optimal. Theorem 7.3 is the special case of this result for ρ < \, bold play in the role of 7Гц, and u(x) = 1 or u(x) = 0 according as χ = 1 or χ < 1. The condition (7.34) says that gambling with policy тг0 is at least as good as not gambling at all; (7.35) says that, although the prospects even under 7г0 become on the average less sanguine as time passes, it is better to use тг0 now than to use some other policy for one step and then change to тг0. 7.13. The functional equation (7.30) and the assumption that Q is bounded suffice to determine Q completely. First, Q(0) and Q(\) must be 0 and 1, respectively, and so (7.31) holds. Let T0x = \x and Txx = \x + \\ let f0x=px and /,jc =p + qx. Then Q(TU[ ■■■ Tux)=fu ■ ■ ■ fu Q(x). If the binary expansions of χ and у both begin with the digits м„...",ип, they have the form x = Tu ...Tux' and У = TUx " ' Tuy. If К bounds Q and if m = max{p, q), it follows that \Q(x) ~ Q(y)\<Km". Therefore, Q is continuous and satisfies (7.31) and (7.33).
SECTION 8. MARKOV CHAINS HI SECTION 8. MARKOV CHAINS As Markov chains illustrate in a clear and striking way the connection between probability and measure, their basic properties are developed here in a measure-theoretic setting. Definitions Let 5 be a finite or countable set. Suppose that to each pair / and / in 5 there is assigned a nonnegative number prj and that these numbers satisfy the constraint (8.1) Σα-, = 1. ''eS- Let X0, X1,X2,... be a sequence of random variables whose ranges are contained in 5. The sequence is a Markov chain or Markov process if (8.2) P[Xn+l=j\X0 = i0,...,Xn=in] = Р[Х„ + 1=]\Х„=^]=Р!я} for every η and every sequence /0,...,/'„ in 5 for which P[X0 =i0,...,Xn = in] > 0. The set 5 is the state space or phase space of the process, and the ptj are the transition probabilities. Part of the defining condition (8.2) is that the transition probability (8.3) P[Xn+l=j\Xn=i]=P,; does not vary with n.1 The elements of 5 are thought of as the possible states of a system, Xn representing the state at time n. The sequence or process X0, Xv X2, ■ ■ ■ then represents the history of the system, which evolves in accordance with the probability law (8.2). The conditional distribution of the next state Xn + l given the present state Xn must not further depend on the past X0,..., X„_i. This is what (8.2) requires, and it leads to a copious theory. The initial probabilities are (8.4) a, = P[X0 = i]. The a-, are nonnegative and add to 1, but the definition of Markov chain places no further restrictions on them. Sometimes in the definition of the Markov chain P[Xn+ \ =-j\X„ = '] is allowed to depend on n. A chain satisfying (8.3) is then said to have stationary transition probabilities, a phrase that will be omitted here because (8 3) will always be assumed.
112 PROBABILITY The following examples illustrate some of the possibilities. In each one, the state space 5 and the transition probabilities ptj are described, but the underlying probability space Ш, &, P) and the Xn are left unspecified for now: see Theorem 8.1.* Example 8.1. The Bernoulli-Laplace model of diffusion. Imagine r black balls and r white balls distributed between two boxes, with the constraint that each box contains r balls. The state of the system is specified by the number of white balls in the first box, so that the state space is 5 =■ {0,1,..., r). The transition mechanism is this: at each stage one ball is chosen at random from each box and the two are interchanged. If the present state is /, the chance of a transition to / - 1 is the chance i/r of drawing one of the i white balls from the first box times the chance i/r of drawing one of the / black balls from the second box. Together with similar arguments for the other possibilities, this shows that the transition probabilities are ll\2 (r~i\2 Лг~1) Pu-i =■ [~r J · P«,,-+1 = [ — J > P„ = 2—~2— > the others being 0. This is the probablistic analogue of the model for the flow of two liquids between two containers. ■ The pu form the transition matrix Ρ = [/?,·■] of the process. A stochastic matrix is one whose entries are nonnegative and satisfy (8.1); the transition matrix of course has this property. Example 8.2. Random walk with absorbing barriers. Suppose that 5 = {0,l,...,r}and 1 A я 0 0 0 0 0 0 я 0 0 0 0 Ρ 0 0 0 0 0 0 Ρ 0 0 0 0 0 0 я 0 0 0 0 0 0 я 0 0 0 0 Ρ 0 0 о' 0 0 0 Ρ 1 That is, p!J+, =p and piJ_l = ^ = 1 -p for 0 < i < r and p00 =prr = 1. The chain represents a particle in random walk. The particle moves one unit to the right or left, the respective probabilities being ρ and q, except that each of 0 and r is an absorbing state—once the particle enters, it cannot leave. The state can also be viewed as a gambler's fortune; absorption in 0 fFor an excellent collection of examples from physics and biology, see Feller, Volume 1, Chapter XV.
SECTION 8. MARKOV CHAINS 113 represents ruin for the gambler, absorption in r ruin for his adversary (see Section 7). The gambler's initial fortune is usually regarded as nonrandom, so that (see (8.4)) ai = 1 for some /. ■ Example 8.3. Unrestricted random walk. Let S consist of all the integers i = 0, ± 1, ± 2,...,and take p: +i =p and /?,,_, =q = 1 -p. This chain represents a random walk without barriers, the particle being free to move anywhere on the integer lattice. The walk is symmetric if ρ = q. Ш The state space may, as in the preceding example, be countably infinite. If so, the Markov chain consists of functions Xn on a probability space (Ω, &~, P), but these will have infinite range and hence will not be random variables in the sense of the preceding sections. This will cause no difficulty, however, because expected values of the Xn will not be considered. All that is required is that for each /gS the set [ω: Χη(ω) = ϊ\ lie in 3^ and hence have a probability. Example 8.4. Symmetric random walk in space. Let 5 consist of the integer lattice points in Ar-dimensional Euclidean space Rk; x = (xl,...,xk) lies in 5 if the coordinates are all integers. Now χ has 2k neighbors, points of the form у = (χ,,..., χ, ± 1,..., xk)\ for each such у let pxy = (2k)~l. The chain represents a particle moving randomly in space; for к = 1 it reduces to Example 8.3 with ρ = q = \. The cases к < 2 and к > 3 exhibit an interesting difference. If к < 2, the particle is certain to return to its initial position, but this is not so if к > 3; see Example 8.6. ■ Since the state space in this example is not a subset of the line, the X0,XU... do not assume real values. This is immaterial because expected values of the Xn play no role. All that is necessary is that Xn be a mapping from Ω into 5 (finite or countable) such that [ω: Χη(ω) = i] e S*~ for / e 5. There will be expected values E[f(Xn)] for real functions f on S with finite range, but then f(Xn(oj)) is a simple random variable as defined before. Example 8.5. A selection problem. A princess must chose from among r suitors. She is definite in her preferences and if presented with all r at once could choose her favorite and could even rank the whole group. They are ushered into her presence one by one in random order, however, and she must at each stage either stop and accept the suitor or else reject him and proceed in the hope that a better one will come along. What strategy will maximize her chance of stopping with the best suitor of all? Shorn of some details, the analysis is this. Let 5„S2,.-.,Sr be the suitors in order of presentation; this sequence is a random permutation of the set of suitors. Let Xt—l and let X2, X3,... be the successive positions of suitors who dominate (are preferable to) all their predecessors. Thus X2 = 4 and ^3=6 means that 5, dominates S2 and S3 but 54 dominates S,, S2, 53, and that S4 dominates S5 but 56 dominates 5,,...,55. There can be at most r of these dominant suitors; if there are exactly m, Xm +, = Xm + 2= ··· =r+lby convention.
114 PROBABILITY As the suitors arrive in random order, the chance that 5, ranks highest among 5,,..., S; is (i - l)!/i! = 1/i. The chance that 5^ ranks highest among 5„..., 5- and 5,- ranks next is (;' - 2)!//! = 1//'(;' - 1). This leads to a chain with transition probabilities1 (8.5) P[xn + I=j\xn = i} = —L—, \<i<j<r. If Xn = i, then Xn + l=r + I means that 5, dominates Si+U. , 5r as well as S{,. .,S;, and the conditional probability of this is (8.6) P[Jf„+1=r+l|Jf„ = /] = i, \<i<r As downward transitions are impossible and r+l is absorbing, this specifies a transition matrix for 5 = {1,2,..., r + 1}. It is quite clear that in maximizing her chance of selecting the best suitor of all, the princes>s should reject those who do not dominate their predecessors. Her strategy therefore will be to stop with the suitor in position XT, where τ is a random variable representing her strategy. Since her decision to stop must depend only on the suitors she has seen thus far, the event [τ=η] must lie in σ(Χ,..., Xn) If XT = i, then by (8.6) the conditional probability of success is f(i) = i/r. The probability of success is therefore b[f(XT)], and the problem is to choose the strategy τ so as to maximize it. For the solution, see Example 8 17.* ■ Higher-Order Transitions The properties of the Markov chain are entirely determined by the transition and initial probabilities. The chain rule (4.2) for conditional probabilities gives P[^0 = i0,Jf1=i11^2=i2] = P[Jf0 = i0]P[^1=i1|^0 = i0]P[^2 = i2|Jf0 = i0,^1=i!] = ai0Pi0ilPili1· Similarly, (8.7) P[X=it, 0 <i <m] = a,, p, , ■■■ η , v Lit' J '(I ^'O'l r'm-l'm for any sequence iQ,il,...,im of states. Further, (8.8) P\Xm + t =)tA<t< n\Xs = i„ 0<s<m] =PimhPhh ■ ■ ■ Pin_-ln, +Тпе details can be found in Dynkin & Yushkevich. Chapter III. *With the princess replaced by an executive and the suitors by applicants for an office job, this is known as the secretary problem
SECTION 8. MARKOV CHAINS 115 as follows by expressing the conditional probability as a ratio and applying (8.7) to numerator and denominator. Adding out the intermediate states now gives the formula (8.9) p^ = P[Xm+n-=j\Xm = i] = Σ PikiPkik2'--Pkn_ii (the k[ range over 5) for the nth-order transition probabilities. Notice that p\p is the entry in position (:',;) of P", the nth power of the transition matrix P. If 5 is infinite, Ρ is a matrix with infinitely many rows and columns; as the terms in (8.9) are nonnegative, there are no convergence problems. It is natural to put рда-в..-/1 ifiW' Then P° is the identity /, as it should be. From (8.1) and (8.9) follow (8.10) /7<™+"> = Σρ^ΡΪ"\ Σρ^=1· " j An Existence Theorem Theorem 8.1. Suppose that Ρ = [ρ/;·] is a stochastic matrix and that a; are nonnegative numbers satisfying E/eSar,- = 1. There exists on some (Ω, 5*", P) a Markov chain X0, Xx, X2, ■.. with initial probabilities a-, and transition probabilities pu. Proof. Reconsider the proof of Theorem 5.3. There the space (Ω, &,P) was the unit interval, and the central part of the argument was the construction of the decompositions (5.13). Suppose for the moment that 5 = {1,2,...}. First construct a partition /}0), Z^0',... of (0,1] into countably many1 subinter- vals of lengths (P is again Lebesgue measure) F(//0)) = a;. Next decompose each //0) into subintervals //;> of lengths PU№) = a;pij. Continuing inductively gives a sequence of partitions {//"' ;: i0,...,i„ = 1,2,...} such that each refines the preceding and P(Ifn) ·. ) = a; p, ,· · · · p, ;. r ° Ό ιη Ό ΌΊ ^'л-1'л Put Χη(ω) = i if ω <ξ (J, .,· //"> in_ ,. It follows just as in the proof of Theorem 5.3 that the set [XQ = i0,...,Xn = in] coincides with the interval Я?■ ■:'„■ Thus P[X0 = i0,...,Xn = ij = ainpi{jii · · · /?;„_,;„· From this it follows immediately that (8.4) holds and that the first and third members of U S, + S2+ · =b -a and θ, > 0, then /, ={b - Σ;£,·θ;, b - Σ,<(·θ·], i = 1,2,..., decompose (a, b] into intervals of lengths θ;.
116 PROBABILITY (8.2) are the same. As for the middle member, it is P[Xn=in, X„ + l = j\/P[Xn=in\, the numerator is EarioP/o/| ■'' Pi„.{i„Pij the Sum extending over all i0, ...,/„_,, and the denominator is the same thing without the factor Pj j, which means that the ratio is pt;, as required. That completes the construction for the case 5 = {1,2,...}· For the general countably infinite 5, let g be a one-to-one mapping of {1,2,...} onto 5, and replace the Xn as already constructed by g(X„); the assumption 5 = {1,2,...} was merely a notational convenience. The same argument obviously works if S is finite.1 ■ Although strictly speaking the Markov chain is the sequence X0, X,,..., one often speaks as though the chain were the matrix Ρ together with the initial probabilities ai or even Ρ with some unspecified set of at. Theorem 8.1 justifies this attitude: For given Ρ and a, the corresponding Xn do exist, and the apparatus of probability theory—the Borel-Cantelli lemmas and so on—is available for the study of Ρ and of systems evolving in accordance with the Markov rule. From now on fix a chain X0, Χλ,... satisfying a-t > 0 for all i. Denote by f, probabilities conditional on [X0 = /]: Ρ/LA) = P[ A \X0 = /]. Thus (8.11) P,[X, = i„ 1 < t < n] =PUlP,li2 ■ ■ ■ Pin_iln by (8.8). The interest centers on these conditional probabilities, and the actual initial probabilities a-, are now largely irrelevant. From (8.11) follows (8.12) /f[/L,=/,,..., Xm = im, Xm+i =7ц- · ·, Xm+n = ]n\ -Pi[Xl=il,...,Xm=i„]Pim[Xl=j\,...,Xn=Jn]. Suppose that / is a set (finite or infinite) of m-long sequences of states, / is a set of л-long sequences of states, and every sequence in / ends in /. Adding both sides of (8.12) for (/l5..., im) ranging over / and (;',,..., ;'„) ranging over / gives (8.13) P![(Xl,...,Xm)^I,{Xm,l,...,Xm,n)^J] = P![(Xl,...,Xm)cEl]Pj[(Xl,...,Xn)fEj]. For this to hold it is essential that each sequence in / end in ;'. The formulas (8.12) and (8.13) are of central importance. For a different approach in the finite case, see Problem 8.1.
SECTION 8. MARKOV CHAINS 117 Transience and Persistence Let (8.14) f^ = Pi[xl*J\--,xn-l*i,x„=f] be the probability of a first visit to j at time η for a system that starts in /', and let (8-15) /„=/>,( U [**=/])= tfir be the probability of an eventual visit. A state i is persistent if a system starting at /' is certain sometime to return to /':/,·,· = 1. The state is transient in the opposite case: /„· < 1. Suppose that nl,...,nk are integers satisfying 1<и,< ··· <nk and consider the event that the system visits ; at times nl...nk but not in between; this event is determined by the conditions л, #./,..., Χηι_ιΦ], X„t=j, Χη,+ι ^ ],·■■■> xni-\^h Xni=j> Xnk_l+i*tji---i Xnk-i^ii Xnk=i- Repeated application of (8.13) shows that under />, the probability of this event is ftf°f}f2~n,) ■ · ■ f}"k~"k-l). Add this over the fc-tuples и„..., nk: the /^-probability that Xn=) for at least к different values of η is /;,/;*"'. Letting к -»oo therefore gives (8.16) />[*n=/i.0.] = Recall that i.o. means infinitely often. Taking / = / gives (8.17) />,[*„= ι i.e.] = Thus Pi[X„ = i io.] is either 0 or 1; compare the zero-one law (Theorem 4.5), but note that the events [Xn = i] here are not in general independent.+ 0 if/,.,·<!, 0 if/,<l, 1 if/„=l.· fSee Problem 8.35
118 PROBABILITY Theorem 8.2. (i) Transience of i is equivalent to P,{Xn = i i.o.] = 0 and to Σηρ^ < °°. (ii) Persistence of i is equivalent to P;[X„ = i i.o.] = 1 and to Σηρ^ = °°. Proof. By the first Borel- Cantelli lemma, Σηρ\Ρ < °° implies P,[X„ = i i.o.] = 0, which by (8.17) in turn implies fu < 1. The entire theorem will be proved if it is shown that /„ < 1 implies Σπ/?·,η) < °°. The proof uses a first-passage argument: By (8.13), n-\ = £^[^1^' ^-^a,-ri№=i] 5 = 0 Therefore, ί>ί/>= Σ Σ/Τ^/' l-1 ί=1ί=0 -ςΊ*1 ς rr^ twin- s=0 i=s+\ s=0 Thus (1 -βα)Σ"^ιΡ(ιΡ<^ι, and if /,, < 1, this puts a bound on the partial sums Σ?„ ,/?,<;>. ■ Example 8.6. Polya's theorem. For the symmetric /c-dimensional random walk (Example 8.4), all states are persistent if к = 1 or к = 2, and all states are transient if к > 3. To prove this, note first that the probability pW of return in η steps is the same for all states i\ denote this probability by a*f' to indicate the dependence on the dimension k. Clearly, α^'+ι = 0· Suppose that к = 1. Since return in In steps means η steps east and η steps west, By Stirling's formula, αψη ~ (irn)"i/2. Therefore, Σ„α(ηι) = oo, and all states are persistent by Theorem 8.2.
SECTION 8. MARKOV CHAINS 119 In the plane, a return to the starting point in 2n steps means equal numbers of steps east and west as well as equal numbers north and south: fl(2,= у (2")! 1 2n и~0и!и!(л-и)!(л-и)! 42" = 4^( η J^Q[uj[n-uJ- It can be seen on combinatorial grounds that the last sum is 12н"1, and so a<2n ~ (fl2n)2 ~ (""и)-1. Again, Σηα(2) = oo and every state is persistent. For three dimensions, в(э,_у l^i1 L 2" ^ M'.M!u!u!(n-u-u)!(n-M-u)! 62" ' the sum extending over nonnegative и and υ satisfying и + и < п. This reduces to (8-18) αψη= t (22;)(^)2"~2'(|)VU*£\ as can be checked by substitution. (To see the probabilistic meaning of this formula, condition on there being 2n - 2/ steps parallel to the vertical axis and 2/ steps parallel to the horizontal plane.) It will be shown that a^l = 0(n~3/2\ which will imply that Σηαψ < °°. The terms in (8.18) for / = 0 and l = n are each 0(n~3/2) and hence can be omitted. Now a(u° <Ku~l/2 and a(2) <Ku~\ as already seen, and so the sum in question is at most ^^(^^(^"""(jf^n-20-^(20-·. Since (2n - 2/)-1/2 < 2ni/2(2n - 2/Γ1 < 4ni/2(2n -21+ l)"1 and (2/)"1 < 2(2/ + 1)_1, this is at most a constant times „1/2 ДгеИПГ-^)· Thus Σ„α(„3) < °°, and the states are transient. The same is true for к = 4,5,..., since an inductive extension of the argument shows that α(„Λ) = 0(n ~k/2). ■ It is possible for a system starting in i to reach / (f{j > 0) if and only if p\V > 0 for some n. If this is true for all i and ;', the Markov chain is irreducible.
120 PROBABILITY Theorem 8.3. If the Markov chain is irreducible, then one of the following two alternatives holds. (i) All states are transient, Pj(\Jj[X„ =j i.o.]) = 0 for all i, and Σ„ρ\ρ < °° for all i and j. (ii) All states are persistent, Pi(f\j[X„ =j i.o.]) = 1 for all i, and Σηρ\Ρ = °° for all i and j. The irreducible chain itself can accordingly be called persistent or transient. In the persistent case the system visits every state infinitely often. In the transient case it visits each state only finitely often, hence visits each finite set only finitely often, and so may be said to go to infinity. Proof. For each / and j there exist r and s such thai p\0 > 0 and pf > 0. Now (8.19) ΡΪΓ'^ΖρΡρΡρΪΡ, and from p\pp]p > 0 it follows that Σ„ρ£ι)<°° implies Σ„ p'jp <°°: if one state is transient, they all are. In this case (8.16) gives Pj[X„ =j i.o.] = 0 for all i and j, so that />((J/*„=;' i.o.]) = 0 for all i. Since ΣΤηα1ρ^ = Σ:=ιΣ:=ι^ρ^-ν)-Σ:^^Σΐ=οΡ^<κ^ρ^ii follows that if jis transient, then (Theorem 8.2) Σηρ\Ρ converges for every :'. The other possibility is that all states are persistent. In this case P.[X„ = j i.o.] = 1 by Theorem 8.2, and it follows by (8.13) that /?jf> = />;([*„,=*] П [*„=/i.o.]) ^ Σ Р>[Хт = 1, Xm + l*i,...,X„-l*j, X„=i] n>m = ЦрТУГт)=рТ%· There is an m for which p),"" > 0, and therefore fn = 1. By (8.16), Pi[X„=j i.o.] =/,.. = 1. If Σηρ)Ρ were to converge for some : and /, it would follow by the first Borel-Cantelli lemma that P.[Xn =j i.o.] = 0. ■ Example 8.7. Since Σ}ρ\Ρ = 1, the first alternative in Theorem 8.3 is impossible if 5 is finite: a finite, irreducible Markov chain is persistent. ■ Example 8.8. The chain in Polya's theorem is certainly irreducible. If the dimension is 1 or 2, there is probability 1 that a particle in symmetric random walk visits every state infinitely often. If the dimension is 3 or more, the particle goes to infinity. ■
SECTION 8. MARKOV CHAINS 121 Example 8.9. Consider the unrestricted random walk on the line (Example 8.3). According to the ruin calculation (7.8), /0l = p/q for p<q. Since the chain is irreducible, all states are transient. By symmetry, of course, the chain is also transient if ρ > q, although in this case (7.8) gives /01 = 1. Thus /;· = 1 0 *tj) is possible in the transient case.1 If ρ = q = \, the chain is persistent by Polya's theorem. If η and / - i have the same parity, Pf = η \ η +j - i ~2~ , _1 2"' ™-, \j-i\<n. This is maximal if j = /' or j =:' ± 1, and by Stirling's formula the maximal value is of order n~l/2. Therefore, lim„ pj"} = 0, which always holds in the transient case but is thus possible in the persistent case as well (see Theorem 8.8). ■ Another Criterion for Persistence Let Q = [<7,y] be a matrix with rows and columns indexed by the elements of a finite or countable set U. Suppose it is substochastic in the sense that qu > 0 and Σ;0,7 < 1. Let Q" = [qW] be the nth power, so that (8.20) ίίΤ1)-ΣΧΦ}. €-su. Consider the row sums (8.21) ^"}=Σίί;}· From (8.20) follows (8.22) o-/-+l)=EiW}. Since Q is substochastic σ,·(1) < 1, and hence σ/"+1) = LjLvq^")qvj Zvqj")ay) < σ/η). Therefore, the monotone limits (8.23) σ-limEC But for each j there must be some i Φ] for which /,-· < 1; see Problem 8.7.
122 PROBABILITY exist. By (8.22) and the Weierstrass M-test [A28], σι■ = Σ^,,σ,, Thus the a-t solve the system [xi= Σ QijXj, 'eL/> (8.24) i*u (0<Jt,<l, ie(/. For an arbitrary solution, xt = Σ,-<?,·,.*,■ < Σ,<?,·,· = σ,(|), and χ, < σ/'0 for all i implies χ,<Σ#„-σ}η) = σ}" + 1) by (8.22). Thus x,■. < σ/'!) for all η by induction, and so Xj < σ,. Thus the σ, give the maximal solution to (8.24): Lemma 1. For a substochastic matrix Q the limits (8.23) are the maximal solution of (8.24). Now suppose that U is a subset of the state space 5. The pu for i and j in U give a substochastic matrix β. The row sums (8.21) are σ/η) = Σρ,7 ρ,· ·, • · · pin , where the /,,...,j„ range over i/, and so σ/η) = f,[X, e ί/, ί <и]". Let л -»oo: (8.25) σ, = Ρ,.[Ζ, et/,i= 1,2...], ίεί/. In this case, σ(· is thus the probability that the system remains forever in U, given that it starts at /'. The following theorem is now an immediate consequence of Lemma 1. Theorem 8.4. For U с 5 the probabilities (8.25) are the maximal solution of the system [xi= Σ Ρ/,-*/» ' e U, (8.26) i^u [0<jc,<1, /εΙ/. The constraint jc,· > 0 in (8.26) is in a sense redundant: Since xi = 0 is a solution, the maximal solution is automatically nonnegative (and similarly for (8.24)). And the maximal solution is xt- = 1 if and only if Σ;·ε6,ρ,·;· = 1 for all i in U, which makes probabilistic sense. Example 8.10. For the random walk on the line consider the set U = {0,1,2,...}. The System (8.26) is xi=P*i+i+Qxi-i> '^1. χο=Ρχι- It follows [A19] that xn =A +An if ρ =q and jc„ =A -A(q/p)" + l if p¥=q. The only bounded solution is xn = 0 if <?>/?, and in this case there is
SECTION 8. MARKOV CHAINS 123 probability 0 of staying forever among the nonnegative integers. If q <p, A = 1 gives the maximal solution xn = 1 -(q/p)" + l (and 0 <A < 1 gives exactly the solutions that are not maximal). Compare (7.8) and Example 8.9. ■ Now consider the system (8.26) with U = S - {i0} for an arbitrary single state /0: (·*,= Σ PijX,, i'^'o» (8.27) J+O [o<jc,<1, ΐΦί0- There is always the trivial solution—the one for which xi = 0. Theorem 8.5. An irreducible chain is transient if and only if (8.27) has a nontrivial solution. Proof. The probabilities (8.28) 1 -/,,o = />,.[*„ Φ i0, η > 1], ι Φ i0, are by Theorem 8.4 the maximal solution of (8.27). Therefore (8.27) has a nontrivial solution if and only if /,,· < 1 for some : Φ i0. If the chain is persistent, this is impossible by Theorem 8.3(ii). Suppose the chain is transient. Since //ο,ο = ρ.ο[^=,·0]+ Σ LPlo[Xi=i,X2*io>->Xn-i*io,Xn = io] η = 2 ι*ί„ — ^'o'o L· ΡίυίΙίία> ίφίυ and since /, , < 1, it follows that /,,· < 1 for some /" Φ /η. ■ Since the equations in (8.27) are homogeneous, the issue is whether they have a solution that is nonnegative, nontrivial, and bounded. If they do, 0 < jc, < 1 can be arranged by rescaling.1 fSee Problem 8.9.
124 PROBABILITY Example 8.11. In the simplest of queueing models the state space is {0,1,2,...} and the transition matrix has the form "ίο ίο 0 0 0 ίι ίι ίο 0 0 ί2 h ίι ίο 0 0 0 ί2 ίι ίο 0 0 0 ί2 ίι 0 ··· 0 ··· 0 ··· 0 ·· ί2 ··· If there are i customers in the queue and I > 1, the customer at the head of the queue is served and leaves, and then 0, 1, or 2 new customers arrive (probabilities t0,tl,t2), which leaves a queue of length /'- 1, /', or /'+1. If i = 0, no one is served, and the new customers bring the queue length to 0, 1, or 2. Assume that i0 and t2 are positive, so that the chain is irreducible. For /0 = 0 the system (8.27) is (8.29) Xl — tiXi + t2X2, xk^ {oxk-i+tixk +t2xk + i' k>2. Since t0,tut2 have the form q(l - ί), t, p(l -t) for appropriate p,q,t, the second line of (8.29) has the form xk =pxk + i + qxk_i, k>2. Now the solution [Al9] is A + B(q/p)k =A + B(t0/t2)k if f0 Φ t2 (pφq)аnd A+Bk if i0 = t2 (p = q), and A can be expressed in terms of В because of the first equation in (8.29). The result is β((ίο/ί2)*-ΐ) if Ό*'2. Bk if f0 = f2. There is a nontrivial solution if i0 < t2 but not if t0>t2. If i0 <t2, the chain is thus transient, and the queue size goes to infinity with proability l.lit0^t2, the chain is persistent. For a nonempty queue the expected increase in queue length in one step is t2 - i0, and the queue goes out of control if and only if this is positive. ■ Stationary Distributions Suppose that the chain has initial probabilities π,· satisfying (8.30) ΣπιΡα = π], Уе5. iei
SECTION 8. MARKOV CHAINS 125 It then follows by induction that (8.31) Ет,Х?) = 17-у, /eS, «=0,1,2,.... If π,· is the probability that X0 = i, then the left side of (8.31) is the probability that Xn =_/, and thus (8.30) implies that the probability of [ Xn =j] is the same for all n. A set of probabilities satisfying (8.30) is for this reason called a stationary distribution. The existence of such a distribution implies that the chain is very stable. To discuss this requires the notion of periodicity. The state / has period t if pjp > 0 implies that t divides η and if t is the largest integer with this property. In other words, the period of j is the greatest common divisor of the set of integers (8.32) {n:n>\,p)p>Q\. If the chain is irreducible, then for each pair / and / there exist r and s for which p\p and ρψ are positive, and of course (8.33) PJr^^PiPtfVP- Let i; and i; be the periods of i and /. Taking η = 0 in this inequality shows that t-, divides r + s; and now it follows by the inequality that p)p > 0 implies that i, divides r + s +n and hence divides n. Thus i, divides each integer in the set (8.32), and so i, < f ■. Since i and / can be interchanged in this argument, i and / have the same period. One can thus speak of the period of the chain itself in the irreducible case. The random walk on the line has period 2, for example. If the period is 1, the chain is aperiodic Lemma 2. In an irreducible, aperiodic chain, for each i and j, p\p > 0 for all η exceeding some n0(i, j). Proof. Since p)p+n)>p)p)p)p, if Μ is the set (8.32) then m eM and η e Μ together imply m + η e M. But it is a fact of number theory [A21] that if a set of positive integers is closed under addition and has greatest common divisor 1, then it contains all integers exceeding some и,. Given / and /, choose r so that pjp > 0. If η > n0 = η, + r, then p\f> >pjpp}"~r) > 0. ■ Theorem 8.6. Suppose of an irreducible, aperiodic chain that there exists a stationary distribution—a solution of (8.30) satisfying π, > 0 and Σ,-ττ,·= 1. Then the chain is persistent, (8.34) limpj^ir,· П for all i andj, the π. are all positive, and the stationary distribution is unique.
126 PROBABILITY The main point of the conclusion is that the effect of the initial state wears off. Whatever the actual initial distribution {a-,} of the chain may be, if (8.34) holds, then it follows by the M-test that the probability Σ,-α,-pfj0 of [Ar„=;] converges to iry. Proof. If the chain is transient, then pW -» 0 for all i and / by Theorem 8.3, and it follows by (8.31) and the M-test that тг- is identically 0, which contradicts Σ^ = 1. The existence of a stationary distribution therefore implies that the chain is persistent. Consider now a Markov chain with state space 5x5 and transition probabilities p(ij,kl) =PikPji (it is easy to verify that these form a stochastic matrix). Call this the coupled chain; it describes the joint behavior of a pair of independent systems, each evolving according to the laws of the original Markov chain. By Theorem 8.1 there exists a Markov chain (Xn,Y„), n = 0,1,...,having positive initial probabilities and transition probabilities ^(*„μΛ + ι) = (Μ)|(*„Λ) = (μ)]=ρ(«.*0· For η exceeding some n0 depending on i.;, k, /, the probability p("\ij, kl) <= ΡίΖ)ρ}") is positive by Lemma 2. Therefore, the coupled chain is irreducible. (This proof that the coupled chain is irreducible requires only the assumptions that the original chain is irreducible and aperiodic, a fact needed again in the proof of Theorem 8.7.) It is easy to check that тг(//) = тг,-тг· forms a set of stationary initial probabilities for the coupled chain, which, like the original one, must therefore be persistent. It follows that, for an arbitrary initial state (/',/) for the chain {(X„,Y„)} and an arbitrary t0 in 5, one has />iy[(A'„,Y„) = (/0,/0) i.o.] = 1. If τ is the smallest integer such that XT = YT = /0, then τ is finite with probability 1 under Pu. The idea of the proof is now this: Xn starts in i and Yn starts in /; once Xn = Y„ = :0 occurs, Xn and Yn follow identical probability laws, and hence the initial states i and ;' will lose their influence. By (8.13) applied to the coupled chain, if m < n, then PlMxn,Yn)=(k,l),r = m] = Pu[(X»Yi)*(i0,io),t<'n>(Xn„Ym) = (io,io)] = Ри[т = т]р$-"р%-»\ Adding out / gives Ри[Х„ = к, τ = m] = ΡιΊ[τ = m]pfy~m\ and adding out к gives Ри[Уп = 1, r = m] = Plj[T = m]pj"l~m). Take к = I, equate probabilities, and add over m = l,...,n: Pu[Xn = k,r <n] = ΡιΊ[Υη = k, τ <n].
SECTION 8. MARKOV CHAINS 127 From this follows Ри[Хп = к]<Ри[Хп = к,т<:п]+Ри[т>п] <Ри[У„ = к]+Ри[т>п]. This and the same inequality with X and Υ interchanged give \p\V-P^\ = \Pii[Xn=k\-Pii[Yn = k\\^Pij[r>nl Since τ is finite with probability 1, (8.35) \im\p\p-p)i>\ = Q. η (This proof of (8.35) goes through as long as the coupled chain is irreducible and persistent— no assumptions on the original chain are needed. This fact is used in the proof of the next theorem.) By (8.31),ттк -ρ)? = Σ,ττ,ί/?^ -ρ)Ρ\ and this goes to 0 by the M-test if (8.35) holds. Thus lim„ p^) = irk. As this holds for each stationary distribution, there can be only one of them. It remains to show that the тг· are all strictly positive. Choose r and s so that p\p and pjp are positive. Letting η ~* o° in (8.33) shows that тг,· is positive if тг· is; since some тг· is positive (they add to 1), all the тг,· must be positive. ■ Example 8.12. For the queueing model in Example 8.11 the equations (8.30) are 77 о :=7roio + 77iio» π1 =7Г0{1 +7Г1{1 +7Γ2ί0> ·π2 = π0ί2 + πιί2 + π2ίι +ττ3ί0, TTk =7r*-li2+'nVl +7r/t + li0' к>Ъ. Again write t0,tut2, as q(l -1), t, p(l -f). Since the last equation here is тгк = qirk + l +ртгл_„ the solution is ^ \A +B(p/q)k =A +B(t2/t0)k iff0#f2, ""* \A+Bk ifi0 = f2 for к > 2. If r0 < t2 and Σττ^. converges, then тг^. = О, and hence there is no stationary distribution; but this is not new, because it was shown in Example 8.11 that the chain is transient in this case. If tQ = t2, there is again no
128 PROBABILITY stationary distribution, and this is new because the chain was in Example 8.11 shown to be persistent in this case. If t0> t2, then Lvk converges, provided /4 = 0. Solving for Т70 and π, in the first two equations of the system above gives Τ70 = Bt2 and π, = Bt2(l - ί0)/ί0. From Lkvk = 1 it now follows that Β = (ί0 - t2)/t2, and the irk can be written down explicitly. Since irk = B(t2/t0)k for k>2, there is small chance of a large queue length. ■ If ί0 = ί2 ш this queueing model, the chain is persistent (Example 8.11) but has no stationary distribution (Example 8.12). The next theorem describes the asymptotic behavior of the pjp in this case. Theorem 8.7. If an irreducible, aperiodic chain has no stationary distribution, then (8.36) limpid = 0 for all i and j. If the chain is transient, (8.36) follows from Theorem 8.3. What is interesting here is the persistent case. Proof. By the argument in the proof of Theorem 8.6, the coupled chain is irreducible. If it is transient, then Ln(pjp)2 converges by Theorem 8.2, and the conclusion follows. Suppose, on the other hand, that the coupled chain is (irreducible and) persistent. Then the stopping-time argument leading to (8.35) goes through as before. If the pj-n) do not all go to 0, then there is an increasing sequence {nj of integers along which some pjp is bounded away from 0. By the diagonal method [A14], it is possible by passing to a subsequence of {nj to ensure that each p\"u) converges to a limit, which by (8.35) must be independent of i. Therefore, there is a sequence {nj such that limu p\"u) = tj exists for all i and /, where ί · is nonnegative for all / and positive for some /. If Μ is a finite set of states, then E;(=Mi, = limu E;eMp,("u) < 1, and hence 0< t = Ljtj < 1. Now LkfEMPik")Pkj <PJ"" + l) = ^kPikPknj"\ it is possible to pass to the limit (u -»oo) inside the first sum (if Μ is finite) and inside the second sum (by the M-test), and hence T.k<=Mtkpkj<'Zkpiktj = tj. Therefore, Lktkpkj <tj-, if one of these inequalities were strict, it would follow that Lktk = LjLktkpkj < Zjtj, which is impossible. Therefore Lktkpkj = tj for all j, and the ratios Vj = tj/t give a stationary distribution, contrary to the hypothesis. ■ The limits in (8.34) and (8.36) can be described in terms of mean return times. Let oo (8.37) μ-i-Znf^; n= 1
SECTION 8. MARKOV CHAINS 129 if the series diverges, write μ, = °°. in the persistent case, this sum is to be thought of as the average number of steps to first return to j, given that Lemma 3. Suppose that j is persistent and that lim„ /?)"' = u. Then u> 0 if and only if μι < °°, in which case и = 1 /μ-. Under the convention that 0 =■ l/°o, the case и = 0 and μ, = o° is consistent with the equation и = \/μ}. Proof. For к > 0 let pk = Hn>kfj"); the notation does not show the dependence on /, which is fixed. Consider the double series W+ff+f$>+■■■ + ... The &th row sums to pk (k > 0) and the nth column sums to nfjp (n > 1), and so [A27] the series in (8.37) converges if and only if Zkpk does, in which case 00 (8.38) Hj=Zpk- k=Q Since j is persistent, the /y-probability that the system does not hit / up to time η is the probability that it hits / after time n, and this is p„. Therefore, l-p^ = P,[Xn*j] n-l = Р;[х1Ф]\...,хпфП+ ZPj[xk=J,Xk+i*J,-,Xn*J] n-l =pn+ Σ Pjppn-k, and since p0 = 1, l=PoPJ")+PiPJr1)+-"+P"-i^) + P"i#)- Keep only the first к + 1 terms on the right here, and let η ~* o°; the result is 1 > (p0 + · ■ · +pk)u. Therefore и > 0 implies that Zkpk converges, so that /X; < 00. ^ince in general there is no upper bound to the number of steps to first return, it is not a simple random variable. It does come under the general theory in Chapter 4, and its expected value is indeed μ; (and (8 38) is just (5.29)), but for the present the interpretation of μ, as an average is informal See Problem 23.11
130 PROBABILITY Write xnk = pk p)1 ~k) for 0 < к < η and xnk = 0 for η < к. Then 0 < xnk < pk and lim„ xnk = pku. If μ; < <», then Σ^,ρ* converges and it follows by the M-test that 1 = Σ£=0χ„Λ -» Σ*=0ρ*«. By (8.38), 1 = /x;u, so that u > 0 and и = \/μ}. Μ The law of large numbers bears on the relation и = \/μι in the persistent case. Let Vn be the number of visits to state / up to time n. If the time from one visit to the next is about μ.;, then V„ should be about η/μ}: Vn/n ~ l/μ. . But (if X0 =j) Vn/n has expected value п~1И"к = 1р]р, which goes to и under the hypothesis of Lemma 3 [A30]. Consider an irreducible, aperiodic, persistent chain. There are two possibilities. If there is a stationary distribution, then the limits (8.34) are positive, and the chain is called positive persistent. It then follows by Lemma 3 that μι < oo and πj = Ι/μ for all /. In this case, it is not actually necessary to assume persistence, since this follows from the existence of a stationary distribution. On the other hand, if the chain has no stationary distribution, then the limits (8.36) are all 0, and the chain is called null persistent It then follows by Lemma 3 that μ; = с» for all j. This, taken together with Theorem 8.3, provides a complete classification: Theorem 8.8. For an irreducible, aperiodic chain there are three possibilities: (i) The chain is transient; then for all i and j, lim„ pjp = 0 and in fact (ii) The chain is persistent but there exists no stationary distribution (the null persistent case); then for all i and j, pjp goes to 0 but so slowly that ZnPi") = co, and /*; = °°- (iii) There exist stationary probabilities π- and (hence) the chain is persistent (the positive persistent case); then for all i and j, lim„ р}") = тгу> О and μ; = 1/π,. <». Since the asymptotic properties of the ρψ are distinct in the three cases, these asymptotic properties in fact characterize the three cases. Example 8.13. Suppose that the states are 0,1,2,... and the transition matrix is <7o Qi Qi Po 0 0 0 Pi 0 0 ··· " 0 ·· Рг ■■■ where p-t and qt are positive. The state / represents the length of a success
SECTION 8. MARKOV CHAINS 131 run, the conditional chance of a further success being pL. Clearly the chain is irreducible and aperiodic. A solution of the system (8.27) for testing for transience (with /0 = 0) must have the form xk=xx/py ■■pk_v Hence there is a bounded, nontrivial solution, and the chain is transient, if and only if the limit a of p0 · · ■ pn is positive. But the chance of no return to 0 (for initial state 0) in η steps is clearly p0 ■ ■ ■ pn_,; hence fm = 1 - a, which checks: the chain is persistent if and only if a = 0. Every solution of the steady-state equations (8.30) has the form vk = π0ρ0 ■ ■ ■ pk_x- Hence there is a stationary distribution if and only if Σ/(ρ0 ■ · ■ Pk converges; this is the positive persistent case. The null persistent case is that in which p0 ■ ■ ■ pk~* 0 but Lkp0 ■ · · pk diverges (which happens, for example, if qk = \/k for к > 1). Since the chance of no return to 0 in л steps is p0 ···/?„_,, in the persistent case (8.38) gives μ0 = Σ^=0/?0 · · ■ Pk-\. In the null persistent case this checks with μ0 =«; in the positive persistent case it gives μ0 = Σ*~ο7Γ*/7Γο = l/^o» which again is consistent. ■ Example 8.14. Since Ey/$° = 1, possibilities (i) and (ii) in Theorem 8.8 are impossible in the finite case: A finite, irreducible, aperiodic Markov chain has a stationary distribution. ■ Exponential Convergence* In the finite case, pjp converges to тг,- at an exponential rate: Theorem 8.9. // the state space is finite and the chain is irreducible and aperiodic, then there is a stationary distribution {тг,-}, and \pf-7rj\<Ap", where A > 0 and 0 < ρ < 1. Proof.1 Let m<n) = min, pW and Mjn) = max,- pW. By (8.10), mj" + "=min EAvPiy^min ZPtMP = rn\n\ M}" + »= max ΣΡίνΡΐΤ* max ΣΡιΜ") = Μΐη)' This topic may be omitted. fFor other proofs, see Problems 8.18 and 8 27.
132 PROBABILITY Since obviously m*n) < Mjn), (8.39) Q<m(p<mf< ·■· <MJ2) <MJV) < 1. Suppose temporarily that all the ptj are positive. Let s be the number of states and let δ = min,7 /?,-,·. From Σ;/?,7 > s8 follows 0 < δ < s~'. Fix states и and υ for the moment; let Σ' denote the summation over / in 5 satisfying puj >pL.j and let Σ" denote summation over / satisfying puj <pvj- Then (8-40) E'( Puj-P.j) + Σ"( Puj-Puj) = 1-1=0. Since L'paj + FpttJ > sS. (8-41) L'(Puj-P,j) = 1 - Σ'ρ,ΐ-Σ'ρ^ίΙ-'δ. Apply (8.40) and then (8.41): Puk Pi. к L·\Puj PujjPjk j * L'(pUJ-p,j)Mr + r(puj-P4Xn) = Е'(ри,-^)(мГ-<)) <(l-s8)(Mln)-m(kn)). Since и and υ are arbitrary, Ml"+ 1) - m(k"+l) < (1 -sS)(Mln) - m(kn)). Therefore, M(k") - mkn) < (1 -s8)n. It follows by (8.39) that m<-n) and Mjn> have a common limit Τ7;· and that (8.42) \p\p-irj\^(l-sS)n. Take Л = 1 and ρ = 1 - s8. Passing to the limit in Zip(v")piJ = pl]+1) shows that the ttl are stationary probabilities. (Note that the proof thus far makes almost no use of the preceding theory.) If the Pjj are not all positive, apply Lemma 2: Since there are only finitely many states, there exists an m such that /?^m) > 0 for all /' and /. By the case just treated, Mjmt) -m<.m,) <p'. Take A=p~l and then replace ρ by p1/m. ■
SECTION 8. MARKOV CHAINS 133 Example 8.15. Suppose that Po Ps-l Pi Pi ■ Po ■ Рг ■ ·· Ps-\~ ·· Ps-i ■ Po The rows of Ρ are the cyclic permutations of the first row: pij=pJ-i, j -i reduced modulo s. Since the columns of Ρ add to 1 as well as the rows, the steady-state equations (8.30) have the solution ir; = j_1. If the pt are all positive, the theorem implies that /?}"' converges to Γ1 at an exponential rate. If X0,Y1,Y2,·-- are independent random variables with range {0,1,.. ,5-1}, if each Yn has distribution {p0,---,ps-i}, and if Xn = X0 +F, -r - · · + Yn, where the sum is reduced modulo s, then P[X„ =j] ~* s~l. The Xn describe a random walk on a circle of points, and whatever the initial distribution, the positions become equally likely in the limit. ■ Optimal Stopping* Assume throughout the rest of the section that 5 is finite. Consider a function τ on Ω for which τ(ω) is a nonnegative integer for each ω. Let ^ = σ(Χ0,Χι,...,Χη);τ is a stopping time or a Markov time if (8.43) \ω:τ(ω) = η] e ^ for η = 0,1,... . This is analogous to the condition (7.18) on the gambler's stopping time. It will be necessary to allow τ(ω) to assume the special value oo; but only on a set of probability 0. This has no effect on the requirement (8.43), which concerns finite η only. If / is a real function on the state space, then f(X0), f(X\),■■■ are simple random variables. Imagine an observer who follows the successive states X0, Xb... of the system. He stops at time τ, when the state is XT (or Ζτ(ω)(ω)), and receives an reward or payoff f(XT)- The condition (8.43) prevents prevision on the part of the observer. This is a kind of game, the stopping time is a strategy, and the problem is to find a strategy that maximizes the expected payoff E[f(XT)]. The problem in Example 8.5 had this form; there 5 = {1,2,..., r + 1}, and the payoff function is /(/) = i/r for i<r (set /(r+l) = 0). If P(A) > 0 and Υ = Y,jysIB. is a simple random variable, the B- forming a finite decomposition of Ω into ^sets, the conditional expected value of Υ *This topic may be omitted.
134 PROBABILITY given A is defined by E[Y\A] = Zy!P(Bj\A). j Denote by Ei conditional expected values for the case A = [ X0 = /']: Ei[Y]=E{Y\X0=i] = ZyjPi(B}). J The stopping-time problem is to choose τ so as to maximize simultaneously E;[f(XT)] for all initial states /. If χ lies in the range of /, which is finite, and if τ is everywhere finite, then [ω:β(ΧτΜ(ω)) =χ]= υ™=,0[ω:τ(ω) = n, f(Xn(o))) = x] lies in ^, and so f(XT) is a simple random variable. In order that this always hold, put /(^τ(ω)(ω)) =■ 0, say, if τ(ω) = ο° (which happens only on a set of probability 0). The game with payoff function / has at / the value (8.44) v(i)=supEi[f(XT)}, the supremum extending over all Markov times т. It will turn out that the supremum here is achieved: there always exists an optimal stopping time. It will also turn out that there is an optimal τ that works for all initial states i. The problem is to calculate v(i) and find the best т. If the chain is irreducible, the system must pass through every state, and the best strategy is obviously to wait until the system enters a state for which / is maximal. This describes an optimal τ, and v(i) = max / for all i. For this reason the interesting cases are those in which some states are transient and others are absorbing (/?„· =1). A function φ on 5 is excessive or superharmonic, iff (8.45) φν)*Σρ,Μί)> 'eS. 1 In terms of conditional expectation the requirement is φ(ί)>Ε![φ(Χι)]. Lemma 4. The value function ν is excessive. Proof. Given e, choose for each У in 5 a "good" stopping time r,· satisfying E}[f(XT)]>v(j)-€. By (8.43), [η = n] = [(X0,...,Xn)eI]n] for some set JJn of (n + l)-long sequences of states. Set τ = η + 1 (η > 0) on the set [Xx =j\ Π [(Xt,...,Xn + l)e/;J; that is, take one step and then from the new state X{ add on the "good" stopping time for that state. Then τ is a fCompare the conditions (7.28) and (7.35).
SECTION 8. MARKOV CHAINS 135 stopping time and oo £,[/(*,)]= Σ ЕЕ^,=М^ ^,)e/;„j„+1=i]/(i) л=0 j ft 00 = Σ ΣΣΡ^·[(*ο>·--.*π)ε//π.*„ = *]/(*) л=0 ; ft = Σρ,·Λ·[/(^)]· j Therefore, v(i)>Et[f(XT))>Z;Pij(v(j)-e) = Z;Pijv(j)-e. Since e was arbitrary, υ is excessive. ■ Lemma 5. Suppose that φ is excessive. (i) For all stopping times τ, <ρ(/) > Ε([φ(Χτ)]. (ii) For all pairs of stopping times satisfying σ < τ, Ε;[φ(Χσ)] > Ε;[φ(Χτ)]. Part (i) says that for an excessive payoff function, τ = 0 represents an optimal strategy. Proof. To prove (i), put τΝ = min{r, N). Then τΝ is a stopping time, and (8.46) EI[4>(XJ]= Σ ЕФ = «Д» = *]?(*) n=0 ft ft Since [r>iV] = [t</V]c eFw_|, the final sum here is by (8.13) LLPl[r^N,XN.l=j,XN = k^(k) к J = ΣΣΡ^>Ν,ΧΝ_ι=}]ρ1Ι(φ(Ιί)<ΣΡ^>Ν,ΧΝ_ι=}]φα). к j j Substituting this into (8.46) leads to Ε;[φ(ΧΤΝ)] <Ε;[φ(ΧΤΝ )]. Since τ0 = 0 and Ε;[φ(Χ0)] = φ(Ο, it follows that Ε;[φ(ΧΤΝ)] <φ(ί) for"all N. But for τ(ω) finite, φ(Χτ^ω)(ω)) ~* φ(ΧΗω)(ω)) (there is equality for large N), and so 4<P(XJ] ~* £/[*>UT)] ЬУ Theorem 5.4.
136 PROBABILITY The proof of (ii) is essentially the same. If rN = πιίη{τ, σ + Ν), then τΝ is a stopping time, and Ει[φ{ΧΤΗ)]= Σ Σ LPi[<r = m,T = m + n,Xm+n = k]<p(k) m=0 л = 0 к oo + Σ Е/5,[^ = т,т>т+Лг, Zm+A, = /c]<p(/c). т = 0 fc Since [a = m, τ > m + Ν] =[a = m] - [σ = m, τ <m +N]e SrmJrN_v again E^XTn)\<E;^XTn_)\<E^{XTi)1 Since τ0 = σ, part (ii) follows from part (i) by another passage to the limit. ■ Lemma 6. // an excessive function φ dominates the payoff function f, then it dominates the value function ν as well. By definition, to say that g dominates h is to say that g(i) > MO for all i. Proof. By Lemma 5, <p(0 > Et[(p(XT)] > E:[f(XT)] for all Markov times τ, and so <p(i) > v(i) for all /'. ■ Since τ = 0 is a stopping time, ν dominates /. Lemmas 4 and 6 immediately characterize v: Theorem 8.10. The value function ν is the minimal excessive function dominating f. There remains the problem of constructing the optimal strategy τ. Let Μ be the set of states i for which v(i) =/(/); M, the support set, is nonempty, since it at least contains those : that maximize /. Let A = П"=0[^п ^^1 be the event that the system never enters M. The following argument shows that P;(A) = 0 for each /'. As this is trivial if M = S, assume that Μ Φ S. Choose δ > 0 so that /(/) < ϋ(ί) - δ for i <= S - M. Now Et[f(XT)] = Е"=0Ц Ph = n, X„ = k]f(k); replacing the f(k) by v(k) or v(k) - δ according as к εΜ or k(ES-M gives E:[f(XT)] <; E,[v(XT)] - 8Р,[ХТ eS-M]< ЕДиОД] - δΡ;(Α) < v(i) - SPt(A), the last inequality by Lemmas 4 and 5. Since this holds for every Markov time, taking the supremum over τ gives P;(A) = 0. Whatever the initial state, the system is thus certain to enter the support set M. Let T0(io) = min[n:In(u))eM] be the hitting time for M. Then τ0 is a Markov time, and τ0 = 0 if I0eM. It may be that Χη(ω)£Μ for all n, in which case τ0(ω) = с», but as just shown, the probability of this is 0.
SECTION 8. MARKOV CHAINS 137 Theorem 8.11. The hitting time τ0 is optimal: Е[/(ХТ )] = v(i) for all i. Proof. By the definition of т0, f(XTfj) = υ(ΧΤ()). Put <p(i) = £,[/(*Гц)] = Е([и(Хт )]. The first step is to show that φ is excessive. If τλ = min[«: η ^ 1 Д„ e M], then τ, is a Markov time and oo Ε{ν{Χτ)}= Σ Σ P-[X^M,...,Xn_^M,Xn = k]v(k) л=1keM oo = Σ Σ Σρ^[*ο*Λ/,···,*π-2«Μ.*η-. =*]«(*) η = 1 keM jeS -LpuEjHxJ]. J Since т0 < τ,, £'1[y(Arr())] > E,[y(Arri)] by Lemmas 4 and 5. This shows that φ is excessive. And φ(ϊ) < v(i) by the definition (8.44). If (p(i)>f(i) is proved, it will follow by Theorem 8.10 that φ(ί)>υ(0 and hence that φ(ί) = v(i). Since τ0 = 0 for X0 e M, if ieM then φ(ί) = ^/[Д^о)1 ^/(O. Suppose that <p(0 </(:') for some values of : in S - M, and choose /0 to maximize /(:') - φϋ). Then </>(') = <p(i) +/('O) ""^(''о) dominates / and is excessive, being the sum of a constant and an excessive function. By Theorem 8.10, ψ must dominate υ, so that </»(/0) > v(i0), or /(/„) > υ(υ0). But this implies that i0 e M, a contradiction ■ The optimal strategy need not be unique. If / is constant, for example, all strategies have the same value. Example 8.16. For the symmetric random walk with absorbing barriers at 0 and r (Example 8.2) a function ψ on 5 = {0,1,..., r) is excessive if φ(ί) > -\φ(ί - 1) ■+ \φ(ί + 1) for 1 < i < r - 1. The requirement is that φ give a concave function when extended by linear interpolation from 5 to the entire interval [0, r]. Hence υ thus extended is the minimal concave function dominating /. The figure shows the geometry: the ordinates of the dots are the values of / and the polygonal line describes v. The optimal strategy is to stop at a state for which the dot lies on the polygon.
138 PROBABILITY If f(r) = 1 and /(/) = 0 for i < r, then у is a straight line; v(i) = i/r. The optimal Markov time т0 is the hitting time for Μ = {0, r), and v(i) = Ei[f(XT )] is the probability of absorption in the state r. This gives another solution of the gambler's ruin problem for the symmetric case. ■ Example 8.17. For the selection problem in Example 8.5, the pn are given by (8.5) and (8.6) for \<i<r, while pr+lr+1= 1. The payoff is f(i) = i/r for i <r and f(r f 1) = 0. Thus u(r + 1) = 0, and since υ is excessive, г (8.47) o(i)>g(i)= Σ ТТТТТГКЛ. l~i<r- By Theorem 8.10, υ is the smallest function satisfying (8.47) and u(i)>f(i) = i/r, 1 <i <r. Since (8.47) puts no lower limit on u(r), it follows that u(r)=f(r) = 1, and r lies in the support set M. By minimality, (8.48) i>(i)=max{/(i),*(i)}. l<i<r. If ieM, then f(i) =_u(i) > g(i) > Σ^ί+ιΓι(] - 1)_1ДЛ =/(/)£;_,+ιΟ - l)"1, and hence Σ.Γ=,·+1(/ - 1) ' < 1. On the other hand, if this inequality holds and i+l,...,r all lie in M, then g(0 = Е;„/+1г>*"Чу- 1)"'/(;') =/(0E^/+1(/- l)"1 </(/), so that ίεΜ by (8.48). Therefore, M= {ir, ir + l,.. ,r,r+ 1}, where ir is determined by (8-49) ГЩ + -+7Ч^<]Я + Ь-+7Ч If / < ir, so that ι ί Μ, then v(i)>f(i) and so, by (8.48), It follows by backward induction starting with i = ir — \ that (8-50) lj(f)_Pr_kZi(_lT + ...+_lT) is constant for 1 < i < ir. In the selection problem as originally posed, Xx = \. The optimal strategy is to stop with the first Xn that lies in M. The princess should therefore reject the first ir — 1 suitors and accept the next one who is preferable to all his predecessors (is dominant). The probability of success is pr as given by (8.50). Failure can happen in two ways. Perhaps the first dominant suitor after ir is not the best of all suitors; in this case the princess will be unaware of failure. Perhaps no dominant suitor comes after i ; in this case the princess is obliged to take the last suitor of all and may be well
SECTION Ь. MARKOV CHAINS 139 aware of failure. Recall that the problem was to maximize the chance of getting the best suitor of all rather than, say, the chance of getting a suitor in the top half. If r is large, (8.49) essentially requires that log r - log ir be near 1, so that ir ~ r/e. In this case, pr = l/e. Note that although the system starts in state 1 in the original problem, its ν resolution by means of the preceding theory requires consideration of all possible initial s*ates. ■ This theory carries over in part to the case of infinite 5, although this requires the general theory of expected values, since f{XT) may not be a simple random variable. Theorem 8.10 holds for infinite 5 if the payoff function is nonnegative and the value function is finite.1 But then problems arise: Optimal strategies may not exist, and the probability of hitting the support set Μ may be less than 1. Even if this probability is 1, the strategy of stopping on first entering Μ may be the worst one of all.* PROBLEMS 8.1. Prove Theorem 8.1 for the case of finite S by constructing the appropriate probability measure on sequence space S°°: Replace the summand on the right in (2.21) by au pu u ■ ■ pu _ u , and extend the arguments preceding Theorem 2.3. If Α',,(-) = £„(■), then X{,X7,. . is the appropriate Markov chain (here time is shifted by 1). 8.2. Let У0,У,,... be independent and identically distributed with Р[У„=1]=р, P[Yn = 0] = q = 1 -ρ,ρ Φ q. Put Xn = Yn + Yn+, (mod2). Show that X0, Xv... is not a Markov chain even though Р[А'п+1 =/|А'п_1 =г'] = Р[А'п+1 =/]. Does this last relation hold for all Markov chains? Why? 8.3. Show by example that a function f{Xn\f(Xx),... of a Markov chain need not be a Markov chain. 8.4. Show that and prove that if / is transient, then Σηρ\"} <°° for each г (compare Theorem 8.3(0). If ;' is transient, then The only esseniial change in the argument is that Fatou's lemma (Theorem 16.3) must be used in place of Theorem 5 4 in the proof of Lemma 5. *See Problems 8 36 and 8.37
140 PROBABILITY Specialize to the case i =;: in addition to implying that ί is transient (Theorem 8.20)), a finite value for t^.ipj"'* suffices to determine fu exactly. 8.5. Call {jc,} a subsolution of (8.24) if xi < E;<7,;JC, and 0 <jc,< 1, i e U. Extending Lemma 1, show that a subsolution {jc,} satisfies ·*,<σ,: The solution {σ,} of (8.24) dominates all subsolutions as well as all solutions. Show that if jc, = E;<7,;JC and -1 <jc( < 1, then {|.r,|} is a subsolution of (8.24). 8.6. Show by solving (8.27) that the unrestricted random walk on the line (Example 8.3) is persistent if and only if ρ = \ 8.7. (a) Generalize an argument in the proof of Theorem 8.5 to show that fjk = Pik + ^j*kPafjk- Generalize this further to /W,V> + ■■+№ + Σ ад **.....*„-> **.*„=л/;* (b) Take k=i. Show that /(>>0 if and only if РДЛГ,*г,. .,Χ„.χΦ'ι, Xn =j] > 0 for some n, and conclude that г is transient if and only if fsi < 1 for some / Ψ i such that ftj > 0. (c) Show that an irreducible chain is transient if and only if for each г there is a ; Φι such thai />( < 1. 8.8. Suppose that S = [0,1,2,...}, p00 = 1, and /,0 > 0 for all i. (a) Show that ^.(U"=,[*„ =/ i.o.]) = 0 for all i. (b) Regard the state as the size of a population and interpret the conditions Poo = 1 and fi0 > 0 and the conclusion in part (a). 8.9. 8.5 Τ Show for an irreducible chain that (8.27) has a nontrivial solution if and only if there exists a nontrivial, bounded sequence {jc,·} (not necessarily nonnega- tive) satisfying xi, = Σ,·„, ,op/y·xjt i*i0. (See the remark following the proof of Theorem 8.5.) 8.10. Τ Show that an irreducible chain is transient if and only if (for arbitrary г0) the system y( = E;p,yyy, ί Φ i0 (sum over all j), has a bounded, nonconstant solution {у,, г e S). 8.11. Show that the Ρ,-probabilities of ever leaving U for i e U are the minimal solution of the system. [zi= Σ Pi,zj+ Σ Pir i^U, (8.51) I i^u i*u (о<г,<1, i<=U. The constraint zt < 1 can be dropped: the minimal solution automatically satisfies it, since z, = 1 is a solution. 8.12. Show that sup,--n0(i,/) = oo is possible in Lemma 2.
SECTION 8. MARKOV CHAINS 141 8.13. Suppose that {π,·} solves (8.30), where it is assumed that Σ,·Ιπ,Ι < °°, so that the left side is well defined. Show in the irreducible case that the тг1 are either all positive or all negative or all 0. Stationary probabilities thus exist in the irreducible case if and only if (8.30) has a nontrivial solution {π,} (Σ,-ττ, absolutely convergent). 8.14. Show by example that the coupled chain in the proof of Theorem 8.6 need not be irreducible if the original chain is not aperiodic. 8.15. Suppose that S consists of all the integers and Ρο,-ι = ίΌ,ο=Ρο,+ ι = I. Pk,k-\=4> Pk,k + l=P> Pk,k-\ =P' Pk.k + l =4> Show that the chain is irreducible and aperiodic. For which p's is the chain persistent? For which p's are there stationary probabilities? 8.16. Show that the period of / is the greatest common divisor of the set (8.52) [n:n>l,f^>0]. 8.17. Τ Recurrent events. Let fx, /2,... be nonnegative numbers with /= Σ^„ι/„ < 1. Define м,,м2,... recursively by u{ = /j and (8.53) "»=/i"„-i + ··· +/„-i«i+/„■ (a) Show that /< 1 if and only if L„un < <». (b) Assume that /= 1, set μ = Σ^ι/ι/,,, and assume that (8.54) gcdfn: л > 1,/„ > 0] = 1. Prove the renewal theorem. Under these assumptions, the iimit и = limn un exists, and и > 0 if and only if μ < », in which case и = Ι/μ. Although these definitions and facts are stated in purely analytical terms, they have a probabilistic interpretation: Imagine an event <? that may occur at times 1,2,.... Suppose fn is the probability <f occurs first at time n. Suppose further that at each occurrence of £ the system starts anew, so that /„ is the probability that <f next occurs η steps later. Such an £ is called a recurrent event. If un is the probability that <? occurs at time n, then (8.53) holds. The recurrent event <? is called transient or persistent according as /< 1 or /= 1, it is called aperiodic if (8.54) holds, and if /= 1, μ is interpreted as the mean recurrence time 8.18. (a) Let τ be the smallest integer for which XT = i0. Suppose that the state space is finite and that the pif are all positive. Find a p such that max,(l -p,·,·) <p < 1 and hence РДт>и]^р" for all i. (b) Apply this to the coupled chain in the proof of Theorem 8.6: \pj^ ~Рд"| < ρ". Now give a new proof of Theorem 8.9. *<-l, fc>l.
142 PROBABILITY 8.19. A thinker who owns r umbrellas wanders back and forth between home and office, taking along an umbrella (if there is one at hand) in rain (probability p) but not in shine (probability q). Let the state be the number of umbrellas at hand, irrespective of whether the thinker is at home or at work. Set up the transition matrix and find the stationary probabilities. Find the steady-state probability of his getting wet, and show that five umbrellas will protect him at the 5% level against any climate (any p). 8.20. (a) A transition matrix is doubly stochastic if Σ,-ρ,- = 1 for each ;. For a finite, irreducible, aperiodic chain with doubly stochastic transition matrix, show that the stationary probabilities are all equal. (b) Generalize Example 8.15: Let 5 be a finite group, let p(i) be probabilities, and put Pjj =p(j i~'), where product and inverse refer to the group operation. Show that, if all p(i) are positive, the states are all equally likely in the limit. (c) Let 5 be the symmetric group on 52 elements. What has (b) to say about card shuffling? 8.21. A set С in 5 is closed if L:<EcPij= * f°r 'e^: once the system enters С it cannot leave. Show that a chain is irreducible if and only if S has no proper closed subset. 8.22. Τ Let Τ be the set of transient states and define persistent states ί and / (if there are any) to be equivalent if /)· > 0. Show that this is an equivalence relation on S — Τ and decomposes it into equivalence classes C{,C2,..., so that S = TO C, U С, U · · Show that each Cm is closed and that ftj = 1 for i and j in the same Cm. 8.23. 8.11 8.21 Τ ' Let Γ be the set of transient states and let С be any closed set of persistent states. Show that the ^-probabilities of eventual absorption in С for i £ Τ are the minimal solution of У, = Σ Pijyj + Σ Pij, ' e T, j<ET ye С 0<у,<1, ier. 8.24. Suppose that an irreducible chain has period t> 1. Show that S decomposes into sets S0,...,S,_, such that pt] > 0 only if г'е£„ and j^Sv+1 for some ν {ν + 1 reduced modulo /)- Thus the system passes through the Sv in cyclic succession. 8.25. Τ Suppose that an irreducible chain of period t > 1 has a stationary distribution {ttj}. Show that, if i e S„ and j^Sv+a (v + a reduced modulo t), then lim„p((;'+<" = 7r;, Show that lim„ «-^_,p,if,=ir>// for all i and ;. 8.26. Eigenvalues. Consider an irreducible, aperiodic chain with state space {1, ..,s). Let i-0 = (ir1,...,irJ) be (Example 8.14) the row vector of stationary probabilities, and let c0 be the column vector of l's; then r0 and c0 are left and right eigenvectors of Ρ for the eigenvalue λ = 1. (a) Suppose that r is a left eigenvector for the (possibly complex) eigenvalue λ: rP = kr. Prove: If λ = 1, then r is a scalar multiple of rQ (λ = 1 has geometric (8.55)
SECTION 8. MARKOV CHAINS 143 multiplicity 1). If λ Φ 1, then |λ| < 1 and rc0 = 0 (the 1 X 1 product of 1 X s and s X 1 matrices). (b) Suppose that с is a right eigenvector: Pc = Ac. If λ = 1, then с is a scalar multiple of c0 (again the geometric multiplicity is 1). If λ Φ 1, then again |λ| < 1, and r0c = 0. 8.27. Τ Suppose Ρ is diagonalizable; that is, suppose there is a nonsingular С such that C~1PC = \, where Л is a diagonal matrix. Let λ,,.,.,λ^ be the diagonal elements of Λ, let ch...,cs be the successive columns of C, let R = C~l, and let ri,...,rs be the successive rows of R. (a) Show that c, and r; are right and left eigenvectors for the eigenvalue Aj, i = l,...,s. Show that τ-,ο, = δ(;. Let At = cjri (s Xs). Show that Λ" is a diagonal matrix with diagonal elements \",..., Xns and that P" = CK"R = Lsu = xX"uAu,n>\. (b) Part (a) goes through under the sole assumption that Ρ is a diagonalizable matrix. Now assume also that it is an irreducible, aperiodic stochastic matrix, and arrange the notation so that A, = 1. Show that each row of /4, is the vector (it,, ..., irs) of stationary probabilities. Since (8.56) Г"=А1+ ΣΚΑ„ и=-2 and | A J < 1 for 2 < и < s, this proves exponential convergence once more. (c) Write out (8.56) explicitly for the case 5 = 2. (d) Find an irreducible, aperiodic stochastic matrix that is not diagonalizable. 8.28. Τ (a) Show that the eigenvalue A = 1 has geometric multiplicity 1 if there is only one closed, irreducible set of states; there may be transient states, in which case the chain itself is not irreducible. (b) Show, on the other hand, that if there is more than one closed, irreducible set of states, then A = 1 has geometric multiplicity exceeding 1. (c) Suppose that there is only one dosed, irreducible set of states. Show that the chain has period exceeding 1 if and only if there is an eigenvalue other than 1 on the unit circle. 8.29. Suppose that {Xn) is a Markov chain with state space 5, and put Yn = (Хп, А'я + 1). Let Τ be the set of pairs (/,;') such that ptj > 0 and show that {Yn) is a Markov chain with state space T. Write down the transition probabilities. Show that, if {Xn} is irreducible and aperiodic, so is {Yn}. Show that, if π,· are stationary probabilities for {X„), then π,ρ,; are stationary probabilities for {Yn}. 830. 6.10 8.29 Τ Suppose that the chain is finite, irreducible, and aperiodic and that the initial probabilities are the stationary ones. Fix a state /, let An = [A", = i], and let Nn be the number of passages through ί in the first η steps. Calculate an and βη as defined by (5.41). Show that βη-α2η = 0(1/η), so that n~lNn^> ττ-ι with probability 1. Show for a function / on the state space that я"'Σ* = i/(■£*) -> Σ,ττ,/Ο) with probability 1. Show thai n~1L"k=ig(Xk, Xk + 1) -> LjjTTjPjjgQ, j) for functions g on S X 5.
144 PROBABILITY 8.31.6.14 8.30Τ If Χ0(ω) = ϊ0,...,ΧηΜ = ΐ„ for states i0,...,i„, put ρ„(ω) = ■""ί Pi i " ' Pi . i у so triat ρ„(ω) is the probability of the observation observed. Show that -n-1 logp„(u>) -*h = -Σ,,π,ρ,; logp,; with probability 1 if the chain is finite, irreducible, and aperiodic. Extend to this case the notions of source, entropy, and asymptotic equipartition. 8.32. A sequence [Xn] is a Markov chain of second order if Ρ[ΛΓ„ + , =j|AT0 = /0,...,А'п = гл] = Р[А'п+1=;|ЛГп_1=гп_1,А'л = гп]=р,.1_1,л). Show that nothing really new is involved because the sequence of pairs (Xn,Xn+1) is an ordinary Markov chain (of first order). Compare Problem 8.29. Generalize this idea into chains of order r. 8.33. Consider a chain on S = {0,1, .., r), where 0 and r are absorbing states and piJ+1 =Pj > 0, ρ,·,_ ι = q,:= 1 -Pj > 0 for 0 < i < r. Identify state i with a point ζ ι on the line, where 0 = z0 < · · · < zr and the distance from z, to zj+ j is <7,/p, times that from z,_j to z,. Given a function ψ on 5, consider the associated function φ on [0, zr] defined at the z, by <£(z,) = φ(0 and in between by linear interpolation. Show that φ is excessive if and only if ψ is concave. Show that the probability of absorption in r for initial state ί is tj_l/tr_i, where t,■ = Σ'* = ο9ι ' ' Чк/Р\ "" Ρ к- Deduce (7.7). Show that in the new scale the expected distance moved on each step is 0. 8.34. Suppose that a finite chain is irreducible and aperiodic. Show by Theorem 8.9 that an excessive function must be constant. 8.35. A zero-one law. Let the state space S contain s points, and suppose that e„ = sup;y|pf"' - ir-l —» 0, as holds under the hypotheses of Theorem 8.9. For a<b, let J?% be the σ-field generated by the sets [Xa =ua,..., Xb = ub]. Let ^ - σ(υU.f,b) and &= n:_i^· Show that \P(A ПB) - P(A)P(B)\ < s(en+eb+n) for A^c?o and йе^"1; the eb + n can be suppressed if the initial probabilities are the stationary ones. Show that this holds for A e <&£ and 5еУн„. Show that Се У implies that P(C) is either 0 or 1. 8.36+ Alter the chain in Example 8.13 so that q0 = 1 -p0 = 1 (the other p, and <7, still positive). Let β = limn pl ■ ■ ■ pn and assume that β > 0. Define a payoff function by /(0)=1 and f(i) = 1 -/,„ for />0. If X0,...,Xn are positive, put ση = η; otherwise let ση be the smallest к such that Xk = 0. Show that £,[/(A'<7n)] -> 1 as η -> oo, so that и(г') = 1. Thus the support set is Μ = {0}, and for an initial state i> 0 the probability of ever hitting Μ is fm < 1. For an arbitrary finite stopping time τ, choose η so that Ρ,·[τ <«=σπ]> 0. Then £,[/(A'T)]< 1 -/;+„ 0Ρ,·[τ<η =σ„]< 1. Thus no strategy achieves the value u(i) (except of course for г = 0). 8.37. Τ Let the chain be as in the preceding problem, but assume that /3 = 0, so that //0= 1 for all i. Suppose that λ,,λ2,... exceed 1 and that λ, · ·· λ„->λ < oo; put /(0) = 0 and /(г') = А, ··· A,_,/pi •••p/_1. For an arbitrary (finite) stopping time τ, the event [τ = η] must have the form [(X0,...,X„)eIn] for some set /„ of (n + l)-long sequences of states. Show that for each i there is at fThe final three problems in this section involve expected values for random variables with infinite range.
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 145 most one η > 0 such that (i,i + 1,...,/ + n)e In. If there is no such n, then E]if(XT)] = 0. If there is one, then £,[/(*r)]=^[(*o.· .,*„) = (i,...,i + n)]/(i + n), and hence the only possible values of E[f(XT)] are 0, /(i), p./O'+l) =/(')*,-, р,р, + 1/(г + 2)=/(ОА;А;+1,.... Thus o(i) =/(/)λ/λ, · · · λ,_, for ί > 1; no strategy this value. The support set is Μ = {0}, and the hitting time τ0 for Μ is finite, but ЕЦ(ХТо)] = О. 8.38. 5.12 Τ Consider an irreducible, aperiodic, positive persistent chain. Let τ,- be the smallest η such that X„=j, and let /и,у = Е,[тД Show that there is an r such that ρ = Ρ-[ΧιΦ},...,ΧΓ_ιΦ}, Xr = i] is positive; from f}jn + r) >pf^ and mj,<co> conclude that mit < со and /и,7 = 1^=0^[т,>п]. Starting from pjy'' = ς;_ιΛ?>ρ)Γ,>· show that 1 = 1 m = 0 Use the M-test to show that CO n= 1 If г =;, this gives m- = 1/ttj again; if г Ф], it shows how in principle m,; can be calculated from the transition matrix and the stationary probabilities. SECTION 9. LARGE DEVIATIONS AND THE LAW OF THE ITERATED LOGARITHM* It is interesting in connection with the strong law of large numbers to estimate the rate at which S„/n converges to the mean m. The proof of the strong law used upper bounds for the probabilities P[\Sn — m\ > a] for large a. Accurate upper and lower bounds for these probabilities will lead to the law of the iterated logarithm, a theorem giving very precise rates for Sn/n -»m. The first concern will be to estimate the probability of large deviations from the mean, which will require the method of moment generating functions. The estimates will be applied first to a problem in statistics and then to the law of the iterated logarithm. This section may be omitted \
146 PROBABILITY Moment Generating Functions Let X be a simple random variable asssuming the distinct values χ,,..., xt with respective probabilities ρ,,.,.,ρ,. Its moment generating function is (9.1) Μ(ί)=Ε[β,χ]=ΣΡ,-''χ'· (See (5.19) for expected values of functions of random variables.) This function, denned for all real t, can be regarded as associated with X itself or as associated with its distribution—that is, with the measure on the line having mass p, at x, (see (5.12)). If с = max,-!*,-!, the partial sums of the series c'x ■= Y%=0tkXk/k\ are bounded by e|,|c, and so the corollary to Theorem 5.4 applies: (9.2) M(i)= Σ ^E[Xk]. Thus M(t) has a Taylor expansion, and as follows from the general theory [A29], the coefficient of tk must be М(Ч0)Д! Thus (9.3) E[Xk]=Mw(0). Furthermore, term-by-term differentiation in (9.1) gives M(k\t) = Σ Pix*e,x-=E[Xke'x]; i=l taking t = 0 here gives (9.3) again. Thus the moments of X can be calculated by successive differentiation, whence M{t) gets its name. Note that M(0) = 1. Example 9.1. If X assumes the values 1 and 0 with probabilities ρ and q = \—p, as in Bernoulli trials, its moment generating function is M(t) = pe'+q. The first two moments are M'(0)=p and M"(0) =p, and the variance is ρ —p2 =pq. ■ If Xv...,Xn are independent, then for each t (see the argument following (5.10)), e'x\...,e'x" are also independent. Let Μ and M,,..., M„ be the respective moment generating functions of S=Xl+---+X„ and of J\f,,...,J\f„; of course, e'5 = П/е'Л/. Since by (5.25) expected values multiply for independent random variables, there results the fundamental relation (9.4) M{t)=Ml(t)---M„(t).
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 147 This is an effective way of calculating the moment generating function of the sum 5. The real interest, however, centers on the distribution of 5, and so it is important to know that distributions can in principle be recovered from their moment generating functions. Consider along with (9.1) another finite exponential sum iV(i) = H-q-e'**, and suppose that M{t) = N{t) for all t. If х1ц = max xi and yy = max yjt then M{t) ~pifje'x'f> and N(t) ~ qJoe'yio as t -»oo, and so xt = y,- and p,- = qt. The same argument now applies to ΣίΦίοΡιβ'χ' = Υ,ίΦ- q^^·, and it follows inductively that with appropriate relabeling, x, = y; and p( = ^ for each /. Thus the function (9.1) does uniquely determine the x; and pt. Example 9.2. If A"j,..., J\f„ are independent, each assuming values 1 and 0 v/ith probabilities ρ and g, then S=J\f,+ ··· +Z„ is the number of successes in η Bernoulli trials. By (9.4) and Example 9.1, 5 has the moment generating function £[е<Л=(ре' +q)"= t (?)pV-V*. The right-hand form shows this to be the moment generating function of a distribution with mass ί "k \pkq"~k at the integer k, 0 < к < п. The uniqueness just established therefore yields the standard fact that P[S = k] = [nk]pkq"~k. The cumulant generating function of X (or of its distribution) is (9.5) C(r) = log M(i) = log £[e'*]. (Note that M{t) is strictly positive.) Since C = M'/M and C"-{MM" - (M')2)/M2, and since M(0)= 1, (9.6) ;C(0)=0, C(0)=E[X], C"(0) = Var[*]. Let mk = E[Xk]. The leading term in (9.2) is m0=l, and so a formal expansion of the logarithm in (9.5) gives (9-7) Ф)=Е^ЕМ· υ=1 \fc=l ' / Since M{t) -»1 as t -»0, this expression is valid for ί in some neighborhood of 0. By the theory of series, the powers on the right can be expanded and
148 PROBABILITY terms with a common factor t' collected together. This gives an expansion (9.8) ОД= Σ^«', i'-l valid in some neighborhood of 0. The c( are the cumulants of X. Equating coefficients in the expansions (9.7) and (9.8) leads to ct =m, and c2 = m2-m2, which checks with (9.6). Each c, can be expressed as a polynomial in m1,...,mi and conversely, although the calculations soon become tedious. If E[X] = 0, however, so that m. c, = 0, it is not hard to check that _ τ. im% (9.9) c3=m3, cA = m. Taking logarithms converts the multiplicative relation (9.4) into the additive relation (9.10) C(r)=C1(i) + ---+Cll(i) for the corresponding cumulant generating functions; it is valid in the presence of independence. By this and the definition (9.8), it follows that cumulants add for independent random variables. Clearly, M"(t) = E[X2e'x) > 0. Since (M'(O)2 = E2[Xe,x] < E[e,x] ■ E[X2e,x] = M(t)M"(t) by Schwarz's inequality (5.36), C"(t)>0. Thus the moment generating function and the cumulant generating function are both convex. Large Deviations Let Υ be a simple random variable assuming values y; with probabilities p,. The problem is to estimate P[Y>a] when Fhas mean 0 and a is positive. It is notationally convenient to subtract a away from Υ and instead estimate P[Y> 0] when Υ has negative mean. Assume then that (9.11) E[Y]<0, P[Y>0]>0, the second assumption to avoid trivialities. Let M{t) = ZjPj-e'y' be the moment generating function of Y. Then M'(0) < 0 by the first assumption in >~t
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 149 (9.11), and M{t) -»oo as t -»oo by the second. Since M(t) is convex, it has its minimum ρ at a positive argument τ: (9.12) infM(i) =м(т)=р, 0<р<1, т > 0. Construct (on an entirely irrelevant probability space) an auxiliary random variable Ζ such that (9.13) />[Z=y,]=^/>[r = y,] for each y; in the range of Y. Note that the probabilities on the right do add to 1. The moment generating function of Ζ is (9.14) ψ*]-££?>,-iiiliil. and therefore г τ Μ'(τ) 1 Г ,1 Μ" (τ) (9.15) Ε[Ζ] = —— =0, s2 = £[Z2]= ^->0. For all positive t, P[Y> Q] = P[e'Y> 1] <M{t) by Markov's inequality (5.31), and hence (9.16) P[Y>0]<p. Inequalities in the other direction are harder to obtain. If Σ denotes summation over those indices / for which y;· > 0, then (9.17) P[Y>0]=ZPj = pL'e-TyjP[Z=yj]. Put the final sum here in the form e~e, and let ρ = P[Z > 0]. By (9.16), θ > 0. Since log χ is concave, Jensen's inequality (5.33) gives -Θ = logZ'e~ryiP~lP[Z = yj] +\ogp >L'(-Tyj)p-lP[z = yj}+\ogP = -T^-1E'7^[Z = yy]+logp. By (9.15) and Lyapounov's inequality (5.37),
150 PROBABILITY The last two inequalities give (9.18) 0<fl< ρ[ζ°>0] -logP[Z>0]. This proves the following result. Theorem 9.1. Suppose that Υ satisfies (9.11). Define ρ and τ by (9.12), let Ζ be a random variable with distribution (9.13J, and define s2 by (9.15). Then P[Y>0]=pe~e, where Θ satisfies (9.18). To use (9.18) requires a lower bound for P[Z > 0]. Theorem 9.2. IfE[Z] = 0, E[Z2] = s2, and E[Z4] = £4 > 0, then P[Z > 0] > i4/4fV Proof. Let Z + = Z/[Z>0] and Z~= -ΖΙ,Ζ<0]. Then Z+ and Z~ are nonnegative, Z = Z + -Z~,Z2 = (Z+)2 + (Z~Y, and (9.19) j2 = e[(Z + )2]+e[(Z-)2]. Let ρ = P[Z > 0]. By Schwarz's inequality (5.36), e[(z+)2]=e[/[2,0]z2] <E^2\l2z^'2\Z*\=p^2e· By Holder's inequality (5.35) (for ρ = f and g = 3) £[(Z-)2]=£[(Z-)2/3(Z-)4/3] <£2/3[z-]£'/3[(z-)4] <£2/3[z~]£4/3. Since E[Z] = 0, another application of Holder's inequality (for p = 4 and q=\) gives E[Z-]=E[Z+]=E[ZI[Z^0]] <E./4[Z4]E3/4[/v3o]]=^3/4 Combining these three inequalities with (9.19) gives s2 < ρι/2ξ2 + (^3/4)2/3^4/3 = 2ρΙ/2ξ2 , fFor a related result, see Problem 25.19.
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 151 Chernoffs Theorem1 Theorem 9.3. Let Xl,X2,... be independent, identically distributed simple random variables satisfying E[Xn]<0 and P[Xn > 0]> 0, let M(t) be their common moment generating function, and put ρ = inf, M(t). Then (9.20) lim -log/>[*,+ ··· +Xn>0]=\ogp. Π -^ oo ft Proof. Put Yn =Xt + ■ ■ ■ +Xn. Then E[YJ<0 and P[Yn > 0] > P"[XX > 0] > 0, and so the hypotheses of Theorem 9 1 are satisfied. Define pn and τ„ by inf, Mn(t) = Мя(тя) =ри, where Μ„(ί) is the moment generating function of Yn. Since Mn(t)=M"(t), it follows that pn =p" and τ„ = τ, where Μ(τ)=ρ. Let Z„ be the analogue for Yn of the Ζ described by (9.13). Its moment generating function (see (9.14)) is Μ„(τ + t)/p" = (М(т + t)/p)". This is also the moment generating function of Vx + · · · + Vn for independent random variables Vb...,Vn each having moment generating function M(r + t)/p. Now each Vt has (see (9.15)) mean 0 and some positive variance σ2 and fourth moment ξ4 independent of i. Since Z„ must have the same moments as V{ + ·- · +V„, it has mean 0, variance s2n = na2, and fourth moment ξ* = ηξ4 + Зп(п - 1)σ4 = 0(n2) (see (6.2)). By Theorem 9.2, P[Zn>0]> 54/4ξ4 > a for some positive a independent of n. By Theorem 9.1 then, P[Y„> Q] = p"e~e", where 0 <en<rnsna~l - log α = τα"χσ^η - log α. This gives (9.20), and shows, in fact, that the rate of convergence is 0(n~1/2). ■ This result is important in the theory of statistical hypothesis, testing. An informal treatment of the Bernoulli case will illustrate the connection. Suppose Sn=Xl+ · · · +Xn, where the A', are independent and assume the values 1 and 0 with probabilities ρ and q. Now P[Sn ^na] = P[Lnk=i{Xk -a) ^01, and Chernoffs theorem applies if ρ < a < 1. In this case M(t) = £te'(*1-e)] = e~'"(pe' + <7). Minimizing this shows that the ρ of Chernoffs theorem satisfies -\og ρ = K(a, p) = a\og - +blog-, where b = 1 - a. By (9.20), n"' log P[Sn ^ na] -> -K(a, p); express this as (9.21) P[Sn>na]«e'nK(a-p\ Suppose now that ρ is unknown and that there are two competing hypotheses concerning its value, the hypothesis #, that p=px and the hypothesis H2 that This theorem is not needed for the law of the iterated logarithm, Theorem 9.5.
152 PROBABILITY Ρ = P2> where ρλ <р2. Given the observed results Xlt. .,Xn of η Bernoulli trials, one decides in favor of H2 if Sn > na and in favor of #, if Sn < na, where a is some number satisfying px <a <p2. The problem is to find an advantageous value for the threshold a. By (9.21), (9.22) P[Sn>na\Hl]=e-nK(a-p>\ where the notation indicates that the probability is calculated for ρ =ρ,—that is, under the assumption of #,. By symmetry, (9 23) P[Sn<na\H2]=e-"K^^\ The left sides of (9.22) and (9 2?) are the probabilities of erroneously deciding in favor of #2 when #, is, in fact, true and of erroneously deciding in favor of #, when H2 is, in fact, true—the probabilities describing the level and power of the test. Suppose a is chosen so that K(a,P\)-=K{a,p2), which makes the two error probabilities approximately equal. This constraint gives for a a linear equation with solution (9.24) * = «(P,,P2)=log(p2M) + log(9i/92), where q-, = l -p; The common error probability is approximately β~ηΚ<·α·Ρθ for this value of a, and so the larger K(a,px) is, the easier it is to distinguish statistically between pt and p2. Although K(a(ph p2), p,) is a complicated function, it has a simple approximation for p, near p2. As χ -> 0, log(l +x) =x - \хг + 0(x3). Using this in the definition of К and collecting terms gives (9.25) K(p+x,p) = ^+0(x3), x^O. Fix p, =p, and let p2 —p + t; (9.24) becomes a function ψ(ί) of t, and expanding the logarithms gives (9.26) ιρ(ί)=ρ + \ί + θ(ίζ), г->0, after some reductions. Finally, (9.25) and (9.26) together imply that (9.27) К(ф(0,р) = ^+О(Р), *-0. In distinguishing pt=p from p2=p + t for small t, if α is chosen to equalize the two error probabilities, then their common value is about е~п,г/Ърч. For t fixed, the nearer ρ is to 2, the larger this probability is and the more difficult it is to distinguish ρ from p+t. As an example, compare ρ = Λ with ρ = .5. Now .36n/2/8UX-^ = иг2/8(.5)(.5). With a sample only 36 percent as large, Л can therefore be distinguished from Л+1 with about the same precision as .5 can be distinguished from .5+г.
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 153 The Law of the Iterated Logarithm The analysis of the rate at which S„/n approaches the mean depends on the following variant of the theorem on large deviations. Theorem 9.4. Let Sn = Xl + ■ · · +Xn, where the Xn are independent and identically distributed simple random variables with mean 0 and variance 1. // an are constants satisfying (9.28) e„-»oo, -^-»0, then (9.29) P[Sn>an^\ =e-«S> + W/2 for a sequence ζη going to 0. Proof. Put Yn --= S„ - ajn~ = Znk=l{Xk- aj{n~). Then E[Yn] < 0. Since λ', has mean 0 and variance 1, Р[Х{>0]>0, and it follows by (9.28) that P[XX > an/Jn] > 0 for η sufficiently large, in which case P[Yn > 0] > Ρ"[Α'1-β„/^">01>0. Thus Theorem 9.1 applies to Yn for all large enough n. Let Mn(t),pn,rn, and Z„ be associated with Yn as in the theorem. If m(t) and c(t) are the moment and cumulant generating functions of the Xn, then Mn(t) is the nth power of the moment generating function e~'an/^"m{t) of Xi - an/4n , and so Yn has cumulant generating function (9.30) Cn(t)=-tan^^nc{t). Since τ„ is the unique minimum of Cn(t), and since Cn(t)= -ajn + nc'(t), τ„ is determined by the equation с'(т„) = an/4n . Since Xx has mean 0 and variance 1, it follows by (9.6) that (9.31) c(0) = c'(0) = 0, c"(0) = l. Now c'(t) is nondecreasing because c(t) is convex, and since с'(т„) = an/4n goes to 0, т„ must therefore go to 0 as well and must in fact be 0{an/Jn). By the second-order mean-value theorem for c'(t), an/4n = с'(т„) = тя + 0(т„2), from which follows a„ „ /a2 (9.32) - T» = t+°W
154 PROBABILITY By the third-order mean-value theorem for c(t), logpn = Cn(Tn)= -tnajn~ +nc(r„) Applying (9.32) gives (9.33) log p„=-\al + o(a2„). Now (see (9.14)) Z„ has moment generating function Μη(τ,, + t)/pn and (see (9.30)) cumulant generating function Dn{t) = Сп{тп + t) - log p„ = -(т„ + t)anJn + nc(t + т„) - log p„. The mean of Z„ is £>ή(0) = 0· Its variance s;j is D;'(0); by (9.31), this is (9.34) s2n = nc"{rn) = n(c"(0) + 0(r„)) = n(l + o(l)). The fourth cumulant of Z„ is DJ"(0) = nc""(T„) = O(n). By the formula (9.9) relating moments and cumulants (applicable because E[Zn] = 0), E[Z*] = 3s^ +Д,"'(0). Therefore, E[Z^]/s4n-*3, and it follows by Theorem 9.2 that there exists an a such that />[Z„ > 0] > a > 0 for all sufficiently large n. By Theorem 9.1, P[Yn > 0] = pne~K with 0 < θη < τ„5„α_1 + loga. By (9.28), (9.32), and (9.34), θ„ = 0(an) = o{a2r), and it follows by (9.33) that P[Yn>0] = e-a"ii+o(1))/2. Ш The law of the iterated logarithm is this: Theorem 9.5. Let Sn=Xi+ ··· +X„, where the Xn are independent, identically distributed simple random variables with mean 0 and variance 1. Then (9.35) lim sup . " =■ = l „ V2«loglog« = 1. Equivalent to (9.35) is the assertion that for positive e (9.36) P[Sn>(l +e)V2nloglogni.o.] =0 and (9.37) /5[5„>(l-e)y2nloglogrti.o.] =1. The set in (9.35) is, in fact, the intersection over positive rational e of the sets in (9.37) minus the union over positive rational e of the sets in (9.36).
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 155 The idea of the proof is this. Write (9.38) φ(η) = ]/2n log log ίΓ If A± = [5„ > (1 ± е)ф(п)], then by (9.29), P(A±) is near (log «)"(1 ±6)2. If nk increases exponentially, say nk ~ вк for θ > 1, then P(A *) is of the order k'{i±() . Now Lkk~° ±e) converges if the sign is + and diverges if the sign is -. It will follow by the first Borel-Cantelli lemma that there is probability 0 that A* occurs for infinitely many k. In providing (9.36), an extra argument is required to get around the fact that the A* for пФпк must also be accounted for (this requires choosing θ near 1). If the A~ were independent, it would follow by the second Borel-Cantelli lemma that with probability 1, A~ occurs for infinitely many k, which would in turn imply (9.37). Ал extra argument is required to get around the fact that the An are dependent (this requires choosing θ large). For the proof of (9.36) a preliminary result is needed. Put Mk = max{50,5,,..., Sk), where 50 = 0. Theorem 9.6. // the Xk are independent simple random variables with mean 0 and variance 1, then for a > v2. (9.39) >a <2P Proof. If As = [My_,, < a\fn <Mf], then 4η- >a <P n-l ^<a-yf2 Since 5„ - Sj has variance η -), it follows by independence and Chebyshev's inequality that the probability in the sum is at most PlA.n " ' >y/2 \ [ vn = P{Aj)P \n Since и"Г/Лус[М„>а\/«], *Цл1)Чг**рМ· ^а <P Vn + \p ^a
156 PROBABILITY Proof of (9.36). Given e, choose 0 so that 0 > 1 but 02 < I + e. Let nk = [вк\ and xk = 0(21oglogrt,)1/2. By (9.29) and (9.39), M„ ^'* < 2exp (xk-Jl)\\+tk) where £A -»0. The negative of the exponent is asymptotically 02log/c and hence for large к exceeds 0 log k, so that M„ >Xi < ,0 · Since 0 > 1, it follows by the first Borel-Cantelli lemma that there is probability 0 that (see (9.38)) (9.40) М„к>вф(пк) for infinitely many k. Suppose that nk_l<n <nk and that (9.41) S„>(1+£)(/>(«). Now ф(п) >ф(пк_1)~в~1/2ф(пк); hence, by the choice of Θ, (1 +е)ф(п)> вф(пк) if к is large enough. Thus for sufficiently large k, (9.41) implies (9.40) (if nk_l <n <nk), and there is therefore proability 0 that (9.41) holds for infinitely many n. ■ Proof of (9.37). Given e, choose an integer θ so large that 30"1/2 <e. Take η k = вк. Now nk - η k _, -*«>, and (9.29) applies with n=nk- nk_l and an =xk/ynk ~ nk-\ > where xk = (1 - в~1)ф(пк). It follows that Ψ-.-V,***]-'IV-.-,***]-«* 2 nk-nk_l (1+&) where £ft -» 0. The negative of the exponent is asymptotically (1 - 0_1)log к and so for large к is less than log k, in which case P[Sn - Sn >*J >k~l. The events here being independent, it follows by the second Borel-Cantelli lemma that with probability 1, 5„ - 5 >xk for infinitely many k. On the other hand, by (9.36) applied to (-X$, there is probability 1 that -S < 2ф(пк_1) < 2в~1/2ф(пк) for all but finitely many k. These two inequalities give 5„t >xk - 2в~1/2ф(пк) > (1 - е)ф(пк), the last inequality because of the choice of 0. ■ That completes the proof of Theorem 9.5.
SECTION 9. LARGE DEVIATIONS AND THE ITERATED LOGARITHM 157 PROBLEMS 9.1. Prove (6.2) by using (9.9) and the fact that cumulants add in the presence of independence. 9.2. In the Bernoulli case, (9.21) gives P[Sn>np+xn] = txp -ηκ(ρ+^,ρ)(1+ο(1)) , where ρ < a < 1 and xn = n(a -p). Theorem 9.4 gives P[Sn>np+xn]=e.xp x2 -^-=-(1+0(1)) 2npQ v v 'I where xn=any/npq. Resolve the apparent discrepancy. Use (9.25) to compare the two expressions in case xn/r. is small. See Problem 27.17. 9.3. Relabel the binomial parameter ρ as θ =/(ρ), where / is increasing and continuously differentiable. Show by (9.27) that the distinguishability of θ from 0 + ΔΘ, as measured by K, is (Δθ)2/8ρ(1 -pXf'(p))2 + Ο(Δ0)3. The leading coefficient is independent of θ if f(p) = arcsiny'p . 9.4. From (9.35) and the same result for {—X„}, together with the uniform bounded- ness of the Xn, deduce that with probability 1 the set of limit points of the sequence {5„(2nloglogn)_1/2} is the closed interval from -1 to +1. 9.5. Τ Suppose X„ takes the values +1 with probability \ each, and show that P[Sn = 0 i.o.] = 1. (This gives still another proof of the persistence of symmetric random walk on the line (Example 8.6).) Show more generally that, if the Xn are bounded by M, then P[\S„\ <M i.o.] = 1. 9.6. Weakened versions of (9.36) are quite easy to prove. By a fourth- moment argument (see (6.2)), show that />[5„>n3/4(logn)(1 + °/4i.o.] = 0. Use (9.29) to give a simple proof that P[Sn > On log n)1/2i.o.] = 0. 9.7. Show that (9.35) is true if 5„ is replaced by |5„| or maxkunSk or maxkun\Sk\.
CHAPTER 2 Measure SECTION 10. GENERAL MEASURES Lebesgue measure on the unit interval was central to the ideas in Chapter 1. Lebesgue measure on the entire real line is important in probability as well as in analysis generally, and a uniform treatment of this and other examples requires a notion of measure for which infinite values are possible. The present chapter extends the ideas of Sections 2 and 3 to this more general setting. Classes of Sets The σ-field of Borel sets in (0, l] played an essential role in Chapter 1, and it is necessary to construct the analogous classes for the entire real line and for A;-dimensional Euclidean space. Example 10.1. Let χ = (xl,...,xk) be the generic point of Euclidean /c-space Rk. The bounded rectangles (10.1) [x = (xl,...,xk):ai<xi<bi,i=l,...,k] will play in Rk the role intervals (a, b] played in (0,1]. Let &k be the σ-field generated by these rectangles. This is the analogue of the class & of Borel sets in (0,1]; see Example 2.6. The elements of &k are the k- dimensional Borel sets. For к = 1 they are also called the linear Borel sets. Call the rectangle (10.1) rational if the a, and £>, are all rational. If G is an open set in Rk and yeG, then there is a rational rectangle Ay such that ye/4ycG. But then G= Ό y^GAy, and since there are only countably many rational rectangles, this is a countable union. Thus 3lk contains the open sets. Since a closed set has open complement, &k also contains the closed sets. Just as QB contains all the sets in (0,1] that actually arise in 158
SECTION 10. GENERAL MEASURES 159 ordinary analysis and probability theory, 3$k contains all the sets in Rh that actually arise. The σ-field &k is generated by subclasses other than the class of rectangles. If An is the x-set where a; <x; < b; + n~l, i = 1,..., k, then An is open and (10.1) is Π nAn. Thus 32k is generated by the open sets. Similarly, it is generated by the closed sets. Now an open set is a countable union of rational rectangles. Therefore, the {countable) class of rational rectangles generates &k. ■ The σ-field & on the line Rl is by definition generated by the finite intervals. The σ-field 98 in (0,1] is generated by the subintervals of (0,1]. The question naturally arises whether the elements of 98 are the elements of & that happen to lie inside (0,1], and the answer is yes. If sai is a class of sets in a space Ω and Ω0 is a subset of Ω, let stff\ il0 = [An Ω0·. A e s/\. Theorem 10.1. (i) If S^ is a σ-field in Ω, then &T\ Ω0 is a σ-field in Cl0. (ii) // &f generates the σ-field & in Ω, then ^ΠΩ0 generates the σ-field &Τ\ Ω0 in Ω0: σ(^η Ω0) = σ(^) Π Ω0. Proof. Of course Ω0 = Ω η Ω0 lies in &Ή Ω0. If Β lies in ^n Ω0, so that Β = Α ΠΩ0 for an A <= ^", then Ω0 -В = (Ω - Α)η Ω0 lies in Jnfi0. If Bn lies in &T\ Ω0 for all n, so that Bn=Ann Ω0 for an An e &~, then υ„β„ = (υ„/4„)ηΩ0 liesin^r^0. Hence part (i). Let S^ be the σ-field sz?C\ Ω0 generates in Ω0. Since s/d Ω0 с 3*~Γ\ Ω0 and Fn Ω0 is a σ-field by part (i), &~0 с ^"n Ω0. Now S^n Ω0 с $*~0 will follow if it is shown that Ле^ implies ΑΓ\Ω,0 e ^q, or, to put it another way, if it is shown that & is contained in d?= [А с Ω: Α η Ω0 e ^0]. Since Ле/ implies that ЛпП0 lies in sfc\ Ω0 and hence in &~C\ Ω0, it follows that .ofc ^. It is therefore enough to show that £ is a σ-field in Ω. Since Ω Π Ω0 = Ω0 lies in 3*~0, it follows that Ω<= d? If Α <ξ d?, then (П-Л)пО0 = О0-(Лп Ω0) lies in ^0 and hence Ω-Ле/ If Л„е^ for all n, then (и„Л„)П Ω0 = υ„(Λ„ηΩ0) lies in ^o and hence \JnAn<E^. ■ If Ωο^^, then Fnft0=[^: ЛсП0, Л е ^"]. If Ω = Λ\ Ω0 = (0,1], and &r=£%\ and if J^ is the class of finite intervals on the line, then s/C\ Ω0 is the class of subintervals of (0,1], and 9$ = σ(^Γι Ω0) is given by (10.2) &=[A: Лс(0,1], /lej1]. A subset of (0, l] is thus a Borel set (lies in 9S) if and only if it is a linear Borel set (lies in &l), and the distinction in terminology can be dropped.
160 MEASURE Conventions Involving °° Measures assume values in the set [0,°°] consisting of the ordinary nonnegative reals and the special value <», and some arithmetic conventions are called for. For x, у е [Ο,οο], χ <y means that у = °° or else χ and у are finite (that is, are ordinary real numbers) and χ <y holds in the usual sense. Similarly, χ <y means that у = 0° and χ is finite or else χ and у are both finite and χ <y holds in the usual sense. For a finite or infinite sequence x, xv x2,··· in [Ο,οο], (юз) *=Σ** к means that either (i) χ = oo and xk = oo for some k, or (ii) χ = oo and jcfc < oo for all к and Lkxk is an ordinary divergent infinite series, or (iii) χ <<χ> and xk<°° for all к and (10.3) holds in the usual sense for Lkxk an ordinary finite sum or convergent infinite series. By these conventions and Dirichlet's theorem [A26], the order of summation in (10.3) has no effect on the sum. For an infinite sequence x, *,, x2,... in [Ο,οο], (10.4) xkU means in the first place that xk<xk+l<x and in the second place that either (i) χ < oo and there is convergence in the usual sense, or (ii) xk =oo for some k, or (iii) χ = oo and the xk are finite reals converging to infinity in the usual sense. Measures A set function μ on a field ^ in Ω is a measure if it satisfies these conditions: (i) /х(Л)е [Ο,οο] for /|еУ; (ii) /t(0) = 0; (iii) if Л,, Л2,... is a disjoint sequence of ^sets and if \J1=lAk<E. &, then (see (10.3)) (oo \ oo The measure μ is finite or infinite as μ(Ω)<°° or μ.(Ω) = οο; 4t is a probability measure if μ,(Ω) = 1, as in Chapter 1. If Ω = Л, UЛ2 U ... for some finite or countable sequence of J^sets satisfying μ{ΑΙζ) < oo, then μ. is σ-finite. The significance of this concept will be seen later. A finite measure is by definition σ-finite; a σ-finite measure may be finite or infinite. If si is a subclass of &, then μ is σ-finite on stf if Ω = U ftЛл for some finite or infinite sequence of .onsets satisfying μ(Αι<) < oo. It is not required that these sets Ak be disjoint. Note that if Ω is not a finite or countable union of .onsets, then no measure can be σ-finite on si/. It
SECTION 10. GENERAL MEASURES 161 is important to understand that σ-finiteness is a joint property of the space Ω, the measure μ, and the class stf. If μ is a measure on a σ-field & in Ω, the triple (Ω, $*~,μ) is a measure space. (This term is not used if F is merely a field.) It is an infinite, a σ-finite, a finite, or a probability measure space according as μ has the corresponding property. If μ( Ac) = 0 for an ^set Л, then A is a support of /x, and /x is concentrated on Л. For a finite measure, Л is a support if and only if μ(Α) = μ(ίϊ). The pair (ft, $*~) itself is a measurable space if 5*" is a σ-field in il. To say that μ is a measure on (il, 5*~) indicates clearly both the space and the class of sets involved. As in the case of probability measures, (iii) above is the condition of countable additivity, and it implies finite additivity: If A,,..., An are disjoint J^sets, then m(lM*]= t^(Ak). As in the case of probability measures, if this holds for л = 2, then it extends inductively to all n. Example 10.2. A measure μ on (il, F) is discrete if there are finitely or countably many points ω, in Ω and masses mi in [0,oo] such that μ(Α) = Σω <EAmi f°r Л е ^". It is an infinite, a finite, or a probability measure as L!m! diverges, or converges, or converges to 1; the last case was treated in Example 2.9. If F contains each singleton {ω,}, then μ is σ-finite if and only if mi < oo for all i ■ Example 10.3. Let F be the σ-field of all subsets of an arbitrary Ω, and let μ(Α) be the number of points in Л, where μ(Α) = oo if Л is not finite. This μ is counting measure; it is finite if and only if Ω is finite, and is σ-finite if and only if Ω is countable. Even if &~ does not contain every subset of Ω, counting measure is well defined on &. ■ Example 10.4. Specifying a measure includes specifying its domain. If μ is a measure on a field F and FQ is a field contained in F, then the restriction μ0 of μ to 5^ is also a measure. Although often denoted by the same symbol, μ0 is really a different measure from μ unless $*~0 = F. Its properties may be different: If μ is counting measure on the σ-field F of all subsets of a countably infinite Ω, then μ is σ-finite, but its restriction to the σ-field Fo = {0, Ω} is not σ-finite. ■
162 MEASURE Certain properties of probability measures carry over immediately to the general case. First, μ is monotone: μ(Α) < μ(Β) if Л с β. This is derived, just like its special case (2.5), from μ(Α) + μ(Β -Л) = μ(Β). But it is possible to go on and write μ(Β -Α) = μ(Β) -μ(Α) only if μ(Β)<χ. If μ(Β) = οο and μ( A) < oo, then μ(Β - A) = oo; but for every a e [0, oo] there are cases where μ(Α) = μ{Β) = oo and μ(Β -Α) = α. The inclusion-exclusion formula (2.9) also carries over without change to ^sets of finite measure: (10.5) μ[ (J АЛ = Ε/*(Λ·) - Σμ(Α,ПА,) +■■■ + (-1)" + 1/х(Л1П---ПЛ„). The proof of finite subadditwity also goes through just as before: μ1 \JaA< ΣΜ4); \k=\ I k=\ here the Ak need not have finite measure. Theorem 10.2. Let μ be a measure on a field 5*~. (i) Continuity from below: If An and A lie in & and An\ Л, then* μ(Αη)\μ(Α). (ii) Continuity from above: If An and A lie in & and An I A, and if /х(Л,)<оо, then μ(Αη)Ιμ(Α). (iii) Countable subadditwity: If Av Л2,... and \J1=iAk lie in &~, then (oo \ oo \jAk\< Σμ(Α,). (iv) If μ is σ-finite on &, then & cannot contain an uncountable, disjoint collection of sets of positive μ-measure. Proof. The proofs of (i) and (iii) are exactly as for the corresponding parts of Theorem 2.1. The same is essentially true of (ii): If /х(Л,)<°°, subtraction is possible and Л, - Л„ Τ Αχ- A implies that /х(Л,) -/х(Л„) = μ(Αι-Αη)ϊμ(Αι-Α) = μ(Αι)-μ(Α). There remains (iv). Let [Ββ: ίεθ] be a disjoint collection of J^sets satisfying μ(Ββ)> 0. Consider an ^set Л for which μ(Α) < oo. If вь...,в„ are distinct indices satisfying μ(А Г\Вв)>е>0, then ne < Σ"=]μ(Α ηΒβ.) < μ(Α), and so n < μ(Α)/β. Thus the index set [θ: μ(Α Γ\ΒΘ) > e] is finite, +See (10.4).
SECTION 10. GENERAL MEASURES 163 and hence (take the union over positive rational e) [θ: μ(ΑΓ\Βθ)>0] is countable. Since μ is σ-finite, il=\JkAk for some finite or countable sequence of J^sets Ak satisfying /x(/4fc) < oo. But then %k = [Θ: ^{АкСЛВе) > 0] is countable for each k. Since μ(Ββ) > 0, there is а к for which μ{Αίι Π Βθ) > 0, and so Θ = \Jk®k: Θ is indeed countable. ■ Uniqueness According to Theorem 3.3, probability measures agreeing on a 77-system &> agree on σ(£Ρ). There is an extension to the general case. Theorem 10.3. Suppose that μ1 and μ2 are measures on σ(£Ρ), where & is a π-system, and suppose they are σ-finite on &. If μλ and μ2 agree on &, then they agree on σ(&). Proof. Suppose that fie^ and μ^Β) = μ2(Β) < °°, and let ^fB be the class of sets A in σ(&) for which μλ(Β η Α) = μ2(ΒηΑ). Then ^fB is a Λ-system containing &> and hence (Theorem 3.2) containing σ(&). By σ-finiteness there exist ^sets Bk satisfying Ω = Uk Bk and μϊ(ΒΙζ) = μ2(Β/<.) < oo. By the inclusion-exclusion formula (10.5), μα[ϋ(ΒίΠΑ)\= Σ ИМПЛ)- Σ μα(ΒίηΒ}ηΑ)+··· for a = 1,2 and all п. Since &> is a ir-system containing the Bt, it contains the BjDBj, and so on. For each σί^-βεί A, the terms on the right above are therefore the same for a = 1 as for a = 2. The left side is then the same for a = 1 as for a = 2; letting η -* oo gives μ.,(/4) = μ2( А). Ш Theorem 10.4. Suppose μ, and μ2 are finite measures on σ(£Ρ), where & is a тг-system and Ω is a finite or countable union of sets in &. If μχ and μ2 agree on &, then they agree on σ(£Ρ). Proof. By hypothesis, Ω = UkBk for ^sets Bk, and of course Va(Bk) ^ μα(Ω) < oo, cr=l,2. Thus μ, and μ2 are σ-finite on &>, and Theorem 10.3 applies. ■ Example 10.5. If &> consists of the empty set alone, then it is a 77-system and σ{&>) = {0, Ω}. Any two finite measures agree on &, but of course they need not agree on σ(&). Theorem 10.4 does not apply in this case, because Ω is not a countable union of sets in &>. For the same reason, no measure on σ(&) is σ-finite on @>, and hence Theorem 10.3 does not apply. ■
164 MEASURE Example 10.6. Suppose that (D,,Sr) = (Rl,&1) and &> consists of the half-infinite intervals (-00, χ]. By Theorem 10.4, two finite measures on & that agree on & also agree on ^. The ^sets of finite measure required in the definition of σ-finiteness cannot in this example be made disjoint. ■ Example 10.7. If a measure on (Ω, &) is σ-finite on a subfield 3^ of 5^ then Ω = (Jk Bk for disjoint 5^,-sets β^ of finite measure: if they are not disjoint, replace Bk by Bk Π B{ ■ ■ ■ Π β£_,. ■ The proof of Theorem 10.3 simplifies slightly if Ω = U^ Bk for disjoint ^sets with μ.,(βΑ) = μ.2(β*) <°°, because additivity itself can be used in place of the inclusion-exclusion formula. PROBLEMS 10.1. Show that if conditions (i) and (iii) in the definition of measure hold, and if μ(Α) < oo for some /1 e JF, then condition (ii) holds. 10.2. On the σ-field of all subsets of Ω = {1,2,...} put μ(Α) = T.k^A2"k if A is finite and μ(/0 = °° otherwise. Is μ finitely additive? Countably additive7 10.3. (a) In connection with Theorem 10.2(ii), show that if An J, A and μ(/1Α) < oo for some k, then μ(Αη)1 μ(Α). (b) Find an example in which An J, Α, μ(Απ) = oo, and A = 0. 10.4. The natural generalization of (4.9) is (10.6) μ liminf/ln) < lim Ίηϊμ(Αη) <lim supμ(Αη) <μ\\ίτη supAn\. π π Show that the left-hand inequality always holds. Show that the right-hand inequality holds if μ(υ* гп Ак) < <χ> for some η but can fail otherwise. 10.5. 3.10 A measure space (Ω, JF, μ) is complete if А с В, fie^ and μ(Β)= Ο together imply that A e &— the definition is just as in the probability case. Use the ideas of Problem 3.10 to construct a complete measure space (Ω, 5Γ+, μ+) such that JFc ^ and μ and μ+ agree on У. 10.6. The condition in Theorem 10.2(iv) essentially characterizes σ-finiteness. (a) Suppose that (Ω, JF, μ) has no "infinite atoms," in the sense that for every A in У-, if μ(Α) = oo, then there is in & а В such that В cA and 0 <μ(Β) <οο. Show that if JF does not contain an uncountable, disjoint collection of sets of positive measure, then μ is σ-finite. (Use Zorn's lemma.) (b) Show by example that this is false without the condition that there are no "infinite atoms." 10.7. Example 10.5 shows that Theorem 10.3 fails without the σ-finiteness condition. Construct other examples of this kind.
SECTION 11. OUTER MEASURE 165 SECTION 11. OUTER MEASURE Outer Measure An outer measure is a set function μ* that is denned for all subsets of a space Ω and has these four properties: (i) μ*(Α) e [0, oo] for every AcO,; (ii) μ*(0) = 0; (iii) μ* is monotone: A <zB implies μ*(Α) <μ*(Β); (iv) μ* is countably subadditive: /x*(U„ An) < Σ„μ*(Απ). The set function P* defined by (3.1) is an example, one which generalizes: Example 11.1. Let ρ be a set function on a class srf in Ω. Assume that 0e^ and p(0) = O, and that p(A)<E [0,oo] for Лё/; р and j/ are otherwise arbitrary. Put (11.1) μ*(Α) = ϊα£Σρ(Α„), where the infimum extends over all finite and countable coverings of A by .onsets An. If no such covering exists, take μ*(Α) = oo in accordance with the convention that the infimum over an empty set is oo. That μ* satisfies (i), (ii), and (iii) is clear. If μ*(Αη) = oo for some n, then obviously /x*(U„ An) < Σπμ*(Απ). Otherwise, cover each An by .onsets Bnk satisfying EkP(Bnk) < μ*(Αη) + e/2"; then ^*(U„ An) < Ln,kp(Bnk) < Σημ*(Αη) + €. Thus μ* is an outer measure. ■ Define A to be μ*-measurable if (11.2) μ*(ΑηΕ)+μ*(Α':ηΕ)=μ*(Ε) for every E. This is the general version of the definition (3.4) used in Section 3. By subadditivity it is equivalent to (11.3) μ*(ΑηΕ)+μ*(Α'ΠΕ)<μ*(Ε). Denote by Λ(μ*) the class of /x*-measurable sets. The extension property for probability measures in Theorem 3.1 was proved by a sequence of lemmas the first three of which carry over directly to the case of the general outer measure: If P* is replaced by μ* and JK by *Ж/х*) at each occurrence, the proofs hold word for word, symbol for symbol.
166 MEASURE In particular, an examination of the arguments shows that oo as a possible value for μ* does not require any changes. Lemma 3 in Section 3 becomes this: Theorem 11.1. If μ* is an outer measure, then ^(μ*) is a σ-field, and μ* restricted to «^(μ*) is a measure. This will be used to prove an extension theorem, but it has other applications as well. Extension Theorem 11.2. A measure on a field has an extension to the generated σ-field. If the original measure on the field is σ-finite, then it follows by Theorem 10.3 that the extension is unique. Theorem 11.2 can be deduced from Theorem 11.1 by the arguments used in the proof of Theorem 3.1.f It is unnecessary to retrace the steps, however, because the ideas will appear in stronger form in the proof of the next result, which generalizes Theorem 11.2. Define a class &f of subsets of Ω to be a semiring if (i) 0 e #f; (ii) А, В е of implies AnB<Esf; (iii) if A, B<es/ and A <zB, then there exist disjoint .onsets C,,...,C„ such that B-A={Jnk = lCk. The class of finite intervals in D, = Rl and the class of subintervals of Ω = (0,1] are the simplest examples of semirings. Note that a semiring need not contain Ω. Theorem 11.3. Suppose that μ is a set function on a semiring &f. Suppose that μ has values in [0, oo], that μ(0) = 0, and that μ is finitely additive and countably subadditive. Then μ extends to a measure on a(&f). This contains Theorem 11.2, because the conditions are all satisfied if stf is a field and μ is a measure on it. If Ω = (Jk Ak for a sequence of .onsets satisfying μ(ΑΙζ) < oo, then it follows by Theorem 10.3 that the extension is unique. See also Problem 11.1.
SECTION 11. OUTER MEASURE 167 Proof. If A, B, and the Ck are related as in condition (iii) in the definition of semiring, then by finite additivity μ(Β) = μ(Α) + L"k = ^(Ck) > μ(Α). Thus μ is monotone. Define an outer measure μ* by (11.1) for ρ = μ: (11-4) μ*(Α)=ΐηίΣμ(Αη), η the infimum extending over coverings of A by .onsets. The first step is to show that ,afc^(/x*). Suppose that /jeu If μ*(Ε) = oo, then (11.3) holds trivially. If μ*(Ε) < oo, for given e choose .onsets An such that Ε с (J„ An and Σημ(Αη) <μ*(Ε) + e. Since of is a semiring, Bn=AnAn lies in sf and AcnAn=An-Bn has the form \J™ixC„k for disjoint .onsets Cnk. Note that An= BnU (JknilCnk, where the union is disjoint, and that Α Π Ε с U„ β„ and Лс Π £ с U„ U™=i С„л. By the definition of /x* and the assumed finite additivity of μ, и, μ*(ΑηΕ)+μ*(Α<ΠΕ)<Σμ(Βη)+Σ Σ и(Слк) η π к = \ = Σν(Αη)<μ*(Ε)+€. Π Since € is arbitrary, (11.3) follows. Thus .я^сДд*). The next step is to show that μ* and μ agree on ssf. If А с (J„ /4„ for .onsets Л and Λη, then by the assumed countable subadditivity of μ and the monotonicity established above, μ(Α) < Σ„μ(Α ΠΑΠ) < Σ„μ(Απ). Therefore, Ле# implies that μ(Α) <μ*(Α) and hence, since the reverse inequality is an immediate consequence of (11.4), μ(Α) = μ*(Α). Thus μ* agrees with μ on stf. Since £/<ζΛ(μ*) and Λ(μ*) is a σ-field (Theorem 11.1), j/Cff(j/)c/(M*)c2" Since μ* is countably additive when restricted to Λ(μ*) (Theorem 11.1 again), μ* further restricted to σ{·ΰ/) is an extension of μ on of, as required. Example 11.2. For srf take the semiring of subintervals of Ω = (0,1] (together with the empty set). For μ take length Λ: A(a,b] = b - a. The finite additivity and countable subadditivity of Λ follow by Theorem 1.3.f By Theorem 11.3, Λ extends to a measure on the class a(szf) = (8 of Borel sets in (0,1]. ■ On a field, countable additivity implies countable subadditivity, and λ is in fact countably additive on si/— but st/ is merely a semiring. Hence the separate consideration of additivity and subadditivity; but see Problem 11.2.
168 MEASURE This gives a second construction of Lebesgue measure in the unit interval. In the first construction Λ was extended first from the class of intervals to the field SSQ of finite disjoint unions of intervals (see Theorem 2.2) and then by Theorem 11.2 (in its special form Theorem 3.1) from 6$ϋ to & = σ(&0). Using Theorem 11.3 instead of Theorem 11.2 effects a slight economy, since the extension then goes from &f directly to & without the intermediate stop at ί$0, and the arguments involving (2.13) and (2.14) become unnecessary. Example 11.3. In Theorem 11.3 take for &/ the semiring of finite intervals on the real line Rl, and consider Xl(a,b] = b -a. The arguments for Theorem 1.3 in no way require that the (finite) intervals in question be contained in (0,1], and so A, is finitely additive and countably subadditive on this class &/. Hence A, extends to the σ-field ^" of linear Borel sets, which is by definition generated by si. This defines Lebesgue measure A, over the whole real line. ■ A subset of (0,1] lies in & if and only if it lies in &l (see (10.2)). Now Л,(Л) = Л(Л) for subintervals A of (0.1], and it follows by uniqueness (Theorem 3.3) that k^A) = λ(Α) for all A in S8. Thus there is no inconsistency in dropping A! and using Λ to denote Lebesgue measure on &y as well as on 98. Example 11.4. The class of bounded rectangles in Rk is a semiring, a fact needed in the next section. Suppose that A = [x: jc, e /15 / < k] and В = [*: jc, e/(., i <k] are nonempty rectangles, the /, and /, being finite intervals. If А с В, then /,-c/,., so that /,·-/,· is a disjoint union /,'U/" of intervals (possibly empty). Consider the 3* disjoint rectangles [x: jc, e Ц, i <k], where for each /, Ц is /, or // or /". One of these rectangles is A itself, and В -А is the union of the others. The rectangles thus form a semiring. ■ An Approximation Theorem If &f is a semiring, then by Theorem 10.3 a measure on σ(&/) is determined by its values on &/ if it is σ-finite there. Theorem 11.4 shows more explicitly how the measure of a a(&f)-set can be approximated by the measures of .onsets. Lemma 1. If A,Ax,...,An are sets in a semiring &f, then there are disjoint stf-sets C,,..., Cm such that ΑηΑ\η ■■■ n^^=C,U ··· UCm. Proof. The case η = 1 follows from the definition of semiring applied to А Г\А\=А -(ADAi). If the result holds for n, then A nA\n ·· · nAcn + l = UJLi(Cy η Acn + l); apply the case η = 1 to each set in the union. ■
SECTION 11. OUTER MEASURE 169 Theorem 11.4. Suppose that si is a semiring, μ is a measure on &= σ( si), and μ is σ-finite on si. (i) //fie^ and e > 0, there exists a finite or infinite disjoint sequence Ax, A2,... of si-sets such that В a \Jk Ak and μ(((Jk Ak)- B)<e. (ii) If B<e $r and e > 0, and if μ(Β) < oo, then there exists a finite disjoint sequence A{,...,An of subsets such that /x(Z?a(|J£ = 1 Ak)) <e. Proof. Return to the proof of Theorem 11.3. If μ* is the outer measure defined by (11.4), then ^сЖ/х*) and μ* agrees with μ on si, as was shown. Since μ* restricted to & is a measure, it follows by Theorem 10.3 that μ* agrees with μ on & as well. Suppose now that В lies in & and μ(Β) = μ*{Β) < oo. There exist .onsets Ak such that В с \Jk Ak and μ(Ό/ί Ak) < Σkμ(Ak) <μ(Β) + e; but then μ((Όι<: Ak)-B) <€. To make the sequence {Ak} disjoint, replace Ak by АкГ\А\Г\ ··· Г\Аск_{, by Lemma 1, each of these sets is a finite disjoint union of sets in si. Next suppose that В lies in &~ and μ(Β) = μ*(Β) = oo. By σ-finiteness there exist .assets Cm such that Ω = Um Cm and /x(Cm) < oo. By what has just been shown, there exist .onsets Amk such that В nCma(JkAmk and μ((ΐΛ Amk) - (Β Π CJ)<e/2m. The sets Amk taken all together provide a sequence AX,A2,... of .onsets satisfying B<z (Jk Ak and μ((Ό/ζΑΙζ)-Β) <e. As before, the Ak can be made disjoint. To prove part (ii), consider the Ak of part (i). If В has finite measure, so has A = (Jk Ak, and hence by continuity from above (Theorem 10.2(ii)), μ(Α - (Jk<nAk)<e for some n. But then /x(£a(|J£ = 1 Ak)) < 2e. Ш If, for example, β is a linear Borel set of finite Lebesgue measure, then Α(β δ (U"k = iAk)) < ε for some disjoint collection of finite intervals Αγ,..., Αη. Corollary 1. // μ is a finite measure on a σ-field &generated by afield ^0, then for each &-set A and each positive e there is an S^-set В such that μ(Α δ Β) < е. Proof. This is of course an immediate consequence of part (ii) of the theorem, but there is a simple direct argument. Let ^ be the class of ^sets with the required property. Since AC^BC =A *B, ^ is closed under complementation. If A = (J„ An, where An&^, given e choose n0 so that μ(Α— Unan An)<e, and then choose ^-sets Bn, n<n0, so that μ(Απ*Βπ)<ε/η0. Since i\Jn£nuA„)*(\Jn£„uB„)c ^n<,n№n Δ #Д the ^o"set S=U„S„ Bn satisfies μ(Α δ Β) <2e. Of course &0 с J; since ^ is a σ-field, JFc <£, as required. a Corollary 2. Suppose that si is a semiring, Ω is a countable union of sisets, and μ-ι,μι are measures on S^=a(si). If μ[(Α)<μ2(Α)<<χ> for A e si, then μ^Β)< μ2(Β) for Be Sr.
170 MEASURE Proof. Since μ2 is σ-finite on srf, the theorem applies. If μ2(Β)<°°, choose disjoint .assets Ak such that β с \JkAk and Σ/ιμ2(ΑΙι) <μ2(Β) + e. Then μ{(Β)< Σ^ι(ΑΙι)<Σιιμ2(ΑΙ()<μ2(Β) + ε. Ш A fact used in the next section: Lemma 2. Suppose that μ is a nonnegative and finitely additive set function on a semiring stf, and let A, A{,..., An be sets in S>f. (i) // U"^ι Λ, C/1 and tne Ai are disjoint, then Σ"^ιμ(Αί)<μ{Α). (ii) If А с U"_, A„ then μ(Α)<Σ^ιμ(Αί). Proof. For part (i), use Lemma 1 to choose disjoint .assets CJ such that A - U"=|/l,= U™=|C Since μ is finitely additive and nonnegative, it follows that μ(Α) = Σ^ ιμ(Α,) + t% ,μ(ς·) > E?_ ,μ(Λ,). For(ii), take β, =Α Π Ax and Β, =Α Π Α,η Α\ Π · · П/^_, for i> 1. By Lemma 1, each β, is a finite disjoint union of .assets C,--. Since the β, are disjoint, /1 = UjB,= LLC,, and U;C,;c/l;., it follows by finite additivity and part (i) that μ(Α) = Е,Е;м(С,,) < Σ,μΜ,)· ■ Compare Theorem 1.3. PROBLEMS 11.1. The proof of Theorem 3.1 obviously applies if the probability measure is replaced by a finite measure, since this is only a matter of rescaling. Take as a starting point then the fact that a finite measure on a field extends uniquely to the generated σ-field. By the following steps, prove Theorem 11.2—that is, remove the assumption of finiteness. (a) Let μ be a measure (not necessarily even σ-finite) on a field J^, and let ^=σ(^ο). If A is a nonempty set in J^J, and μ(Α)<<χ>, restrict μ to a finite measure μΑ on the field З^ПА, and extend μΑ to a finite measure μΑ on the σ-field JFh/1 generated in A by У0 Г\А. (b) Suppose that £t^ If there exist disjoint ^-sets An such that Ε с (J„ An and μ(/1π)<<», put μ(Ε)= Ε„μ/(ι(£Γΐ/1π) and prove consistency. Otherwise put /2(E) = oo. (c) Show that β is a measure on JF and agrees with μ on J^. 11.2. Suppose that μ is a nonnegative and finitely additive set function on a semiring $1?. (a) Use Lemmas 1 and 2, without reference to Theorem 11.3, to show that μ is countably subadditive if and only if it is countably additive. (b) Find an example where μ is not countably subadditive. 11.3. Show that Theorem 11.4(ii) can fail if μ(β) = со. 11.4. This and Problems 11.5, 16.12, 17.12, 17.13, and 17.14 lead to proofs of the Daniell-Stone and Riesz representation theorems.
SECTION 12. MEASURES IN EUCLIDEAN SPACE 171 Let Λ be a real linear functional on a vector lattice J? of (finite) real functions on a space Ω. This means that if / and g lie in _/, then so do / ν g and f Ag (with values max{/(ω), g(w)}and min{/(w), g(w)}), as well as af + fig, and A(a/ + /3g) = aA(/) + /3A(g). Assume further of J' that /e_/ implies /Λΐε^ (where 1 denotes the function identically equal to 1). Assume further of Л that it is positive in the sense that /> 0 (pointwise) implies Л(/) > 0 and continuous from above at 0 in the sense that /„ J, 0 (pointwise) implies Л(/„) -» 0. (a) If/<g (/,gey), define in ΩχΛ1 an "interval" (U5) {f,g] = [(a>,t):f{a>)<t<g(a>)]. Show that these sets form a semiring snf0. (b) Define a set function v0 on я/0 by (П.6) "о(/.*]=Л(*-/). Show that vn is finitely additive and countably subadditive on ja/0. 11.5. Τ (a) Assume /ε/ and let /„ = («(/-/л 1)) л 1. Show that /(ω)<1 implies /π(ω) = 0 for all η and /(ω)>1 implies /π(ω)=1 for all sufficiently large n. Conclude that for χ > 0, (11.7) (0,^„]τ[«./(«)>1]χ(0,λ]. (b) Let JF be the smallest σ-field with respect to which every / in ^f is measurable: ^=a[f{H: f<E^f, Ней1]. Let ^ be the class of A in ^ for which A X (0,1] 6E σ(^/0). Show that 3^ is a semiring and that 52"=σ(52ο). (c) Let ν be the extension of v0 (see (11.6)) to a{saf0), and for /1 e J^ define μ0(A) = v(A x(0,1]). Show that μ0 is finitely additive and countably subadditive on the semiring ^0. SECTION 12. MEASURES IN EUCLIDEAN SPACE Lebesgue Measure In Example 11.3 Lebesgue measure Λ was constructed on the class &l of linear Borel sets. By Theorem 10.3, Λ is the only measure on £%x satisfying λ(α, b] = b - a for all intervals. There is in /c-space an analogous k- dimensional Lebesgue measure \k on the class &k of /c-dimensional Borel sets (Example 10.1). It is specified by the requirement that bounded rectangles have measure (12.1) Ak[x:ai<xi<bi,i=l,...,k]=U(bi-ai). 1 = 1 This is ordinary volume—that is, length (k = 1), area (k = 2), volume (k = 3), or hypervolume (k > 4).
172 MEASURE Since an intersection of rectangles is again a rectangle, the uniqueness theorem shows that (12.1) completely determines Xk. That there does exist such a measure on 3lk can be proved in several ways. One is to use the ideas involved in the case к = 1. A second construction is given in Theorem 12.5. A third, independent, construction uses the general theory of product measures; this is carried out in Section 18.f For the moment, assume the existence on &k of a measure Xk satisfying (12.1). Of course, Xk is σ-finite. A basic property of Xk is translation invariance* Theorem 12.1. If A<E&k, then A +x = [a +x: a<EA]<E&k and Xk(A) = Xk(A + x) for all x. Proof. If £ is the class of A such that A +x is in &k for all x, then cf is a σ-field containing the bounded rectangles, and so ^Z)&k. Thus A +x e For fixed χ define a measure μ on 3&k by μ(Α) = Xk(A + x). Then μ and Xk agree on the T7-system of bounded rectangles and so agree for all Borel sets. ■ If Л is a (k - l)-dimensional subspace and χ lies outside A, the hyper- planes A + tx for real t are disjoint, and by Theorem 12.1, all have the same measure. Since only countably many disjoint sets can have positive measure (Theorem 10.2(iv)), the measure common to the A + tx must be 0. Every (k - \)-dimensional hyperplane has k-dimensional Lebesgue measure 0. The Lebesgue measure of a rectangle is its ordinary volume. The following theorem makes it possible to calculate the measures of simple figures. Theorem 12.2. // T: Rk -»Rk is linear and nonsingular, then Ле^4 implies that ТА е &k and (12.2) Xk(TA)=\detT\-Xk(A). Since a parallelepiped is the image of a rectangle under a linear transformation, (12.2) can be used to compute its volume. If Г is a rotation or a reflection—an orthogonal or a unitary transformation—then det7"=+l, and so Xk(TA)=-Xk(A). Hence every rigid transformation or isometry (an orthogonal transformation followed by a translation) preserves Lebesgue measure. An affine transformation has the form Fx = Tx + x0 (the general +See also Problems 17.14 and 20.4 *An analogous fact was used in the construction of a nonmeasurable set on p. 45
SECTION 12. MEASURES IN EUCLIDEAN SPACE 173 linear transformation Τ followed by a translation); it is nonsingular if Τ is. It follows by Theorems 12.1 and 12.2 that Xk(FA) = \detT\-Xk(A) in the nonsingular case. Proof of the Theorem. Since Τ U„ An = U„ TAn and TAC = (TA)C because of the assumed nonsingularity of T, the class £= [A: TA^ &k\ is a σ-field. Since ТА is open for open A, it follows again by the assumed nonsingularity of Τ that £ contains all the open sets and hence (Example 10.1) all the Borel sets. Therefore, A<E&k implies TA<z&k. For A<a&k, set μι(Α) = λΙζ(ΤΑ) and /x2(^) = |det T\- Ak(A). Then μ{ and μ2 are measures, and by Theorem 10.3 they will agree on &k (which is the assertion (12.2)) if they agree on the T7-system consisting of the rectangles [x: a, <*,<&,, /=l,...,/c] for which the a, and the b,- are all rational (Example 10.1). It suffices therefore to prove (12.2) for rectangles with sides of rational length. Since such a rectangle is a finite disjoint union of cubes and Xk is translation-invariant, it is enough to check (12.2) for cubes (12.3) ^ = [jc:0<Jc,<c,i = l,...,A:] that have their lower corner at the origin. Now the general Τ can by elementary row and column operationst be represented as a product of linear transformations of these three special forms: (Γ) T(x1,...,xk):=(xv.i,.,.,xv.k), where π is a permutation of the set {1,2,..., A:}; (2°) T(xi,..., χk) = (axt, χ2,..., χk); (3°) T(x и..., xk) = (xl +x2,x2,...,xk). Because of the rule for multiplying determinants, it suffices to check (12.2) for Τ of these three forms. And, as observed, for each such Τ it suffices to consider cubes (12.3). (Γ): Such а Г is a permutation matrix, and so det T= ± 1. Since (12.3) is invariant under T, (12.2) is in this case obvious. (2°): Here det Τ = a, and TA = [x: jc, еЯ, 0 <x, < c, i = 2,..., k], where Η = (0, ас] if a > 0, Η = {0} if a = 0 (although a cannot in fact be 0 if Γ is nonsingular), and # = [ac,0)if a<0. In each case, λk(TA) = \a\ · ck = \a\ · Ak(A). Birkhoff & Mac Lane, Section 8.9
174 MEASURE (3°): Here detT = 1. Let B = [x: 0<jc,<c, i = 3,...,*], where B=Rk if к < 3, and define β, = [x:0<jc, <x2<c]nB, B2=[jc:0<jc2<jc, <c]nB, B3 = [x: с <x, <c +x2,0<x2<c] nB. Then Λ = β, Uβ2, TA=B2UB3, and β, + (с,О,...,0) = β3· Since Лк(В,) = ΛΛ(β3) by translation invariance, (12.2) follows by additivity. ■ If Τ is singular, then detr = 0 and ТА lies in a (k - l)-dimensional subspace. Since such a subspace has measure 0, (12.2) holds if A and ТА lie in 32k. The surprising thing is that A e ^ need not imply that ТА e ^ if Г is singular. Even for a very simple transformation such as the projection r(jc,,jc2) = (^1,0) in the plane, there exist Borel sets A for which ТА is not a Borel set. Regularity Important among measures on &k are those assigning finite measure to bounded sets. They share with Kk the property of regularity: Theorem 12.3. Suppose that μ is a measure on &k such that μ(Α) <οο if A is bounded. (i) For A&3lk and e > 0. there exist a closed С an open G such that C<^A <zG and μ(ϋ-0<β. (ii) If μ(Α) <co, then μ{Α) = βυρμ(Α'), the supremum extending over the compact subsets К of A. Proof. The second part of the theorem follows form the first: μ(Α) < oo implies that μ(Α~Α0) <e for a bounded subset A0 of A, and it then follows from the first part that μ(Α0-Κ)<β for a closed and hence compact subset К of A0. See Hausdorff, p. 241.
SECTION 12. MEASURES IN EUCLIDEAN SPACE 175 To prove (i) consider first a bounded rectangle A = [x: a, < jc, < b,·, i <k]. The set G„ = [*: α, <xi <b,,+ n~\ i </c]isopen and G„ I A. Since ^(G^is finite by hypothesis, it follows by continuity from above that /x(G„ -A) <e for large n. A bounded rectangle can therefore be approximated from the outside by open sets. The rectangles form a semiring (Example 11.4). For an arbitrary set A in <%k, by Theorem 11.4(i) there exist bounded rectangles Ak such that А с \JkAk and p((\JkAk)-A)<e. Choose open sets Gk such that AkcGk and μ(0Ιζ-ΑΙζ)< e/2k. Then G = \JkGk is open and /x(G -Л)< 2е. Thus the general /c-dimensional Borel set can be approximated from the outside by open sets. To approximate from the inside by closed sets, pass to complements. ■ Specifying Measures on the Line There are on the line many measures other than Л that are important for probability theory. There is a useful way to describe the collection of all measures on 3&x that assign finite measure to each bounded set. If μ is such a measure, define a real function F by ( μ(0,χ] ifjc^O, С") "(Χ)"{-μ(*.0] if^O. It is because μ(Α) < °° for bounded A that F is a finite function. Clearly, F is nondecreasing. Suppose that xn I x. If χ ^ 0, apply part (ii) of Theorem 10.2, and if χ < 0, apply part (i); in either case, F(xn)i F(x) follows. Thus F is continuous from the right. Finally, (12.5) /i(fl,b]=F(b)-F(e) for every bounded interval (a, b]. If μ is Lebesgue measure, then (12.4) gives FU)=jc. The finite intervals form a T7-system generating &l, and therefore by Theorem 10.3 the function F completely determines μ through the relation (12.5):But (12.5) and μ do not determine F: if F{x) satisfies (12.5), then so does F(x) + с On the other hand, for a given μ, (12.5) certainly determines F to within such an additive constant. For finite μ, it is customary to standardize F by defining it not by (12.4) but by (12.6) F(x)=n(-«>,x\; then \imx^_caF(x) = 0 and limx^oc F(x) = μ(Κ1). If μ is a probability measure, F is called a distribution function (the adjective cumulative is sometimes added).
176 MEASURE Measures μ are often specified by means of the function F. The following theorem ensures that to each F there does exist a μ. Theorem 12.4. If F is a nondecreasing, right-continuous real function on the line, there exists on &1 a unique measure μ satisfying (12.5) for all a and b. As noted above, uniqueness is a simple consequence of Theorem 10.3. The proof of existence is almost the same as the construction of Lebesgue measure, the case F(x)=x. This proof is not carried through at this point, because it is contained in a parallel, more general construction for k- dimensional space in the next theorem. For a very simple argument establishing Theorem 12.4, see the second proof of Theorem 14.1. Specifying Measures in Я* The σ-field &k of к -dimensional Borel sets is generated by the class of bounded rectangles (12.7) A = [x: a; <*,<&,·, i = l,...,k] (Example 10.1). If /,· = (a,·, b,·], A has the form of a Cartesian product (12.8) A =/,X ··· XV Consider the sets of the special form (12.9) Sx = [y:yi<xi,i = l,...,k]; Sx consists of the points "southwest" of χ = (χ,,..., xk); in the case к = 1 it is the half-infinite interval (-«>, x]. Now Sx is closed, and (12.7) has the form (12.10) v4 = 5(ft, w-[W.*.>u5<*.«2 Mu'"uS№ «*)]· Therefore, the class of sets (12.9) generates &k. This class is a ^-system. The objective is to find a version of Theorem 12.4 for /c-space. This will in particular give /c-dimensional Lebesgue measure. The first problem is to find the analogue of (12.5). A bounded rectangle (12.7) has 2fc vertices—the points x = (xu...,xk) for which each xt is either at or b;. Let sgn^ x, the signum of the vertex, be + 1 or - 1, according as the number of i (1 < i < k) satisfying jc,- = at is even or odd. For a real function F on Rk, the difference of F around the vertices of Л is &AF= Esgn^ χ ■ F(x), the sum extending over the 2k vertices χ of A. In the case к = 1, A = (a, b] and AAF = F(b) - F(a). In the case к = 2, Ay4F = F(b1,b2)-F(b1,fl2)-F(a1,b2) + F(a1>a2).
SECTION 12. MEASURES IN EUCLIDEAN SPACE 177 Since the fc-dimensional analogue of (12.4) is complicated, suppose at first that μ is a finite measure on 3&k and consider instead the analogue of (12.6), namely (12.11) F(x)=n[y\ y,<x,,i = l,...,/c]. Suppose that Sx is defined by (12.9) and A is a bounded rectangle (12.7). Then (12.12) μ(Α)=ΑΑΡ. To see this, apply to the union on the right in (12.10) the inclusion-exclusion formula (10.5). The к sets in the union give 2k - 1 intersections, and these are the sets Sx for χ ranging over the vertices of A other than (ft,,..., bk). Taking into account the signs in (10.5) leads to (12.12). Suppose x(n) i χ in the sense that дс(л) J, χ,, as η -»<χ> for each i = 1,..., k. Then Sr(„, I Sx, and hence F(xin))-> F(x) by Theorem 10.2(ii). In this sense, F is continuous from above. Theorem 12.5. Suppose that the real function F on Rk is continuous from above and satisfies AAF>0 for bounded rectangles A. Then there exists a unique measure μ on &k satisfying (12.12) for bounded rectangles A. The empty set can be taken as a bounded rectangle (12.7) for which a, = b; for some i, and for such a set A, AAF = 0. Thus (12.12) defines a finite-valued set function μ on the class of bounded rectangles. The point of the theorem is that μ extends uniquely to a measure on &k. The uniqueness is an immediate consequence of Theorem 10.3, since the bounded rectangles form a T7-system generating &k. If F is bounded, then μ will be a finite measure. But the theorem does not require that F be bounded. The most important unbounded F is F(x) =jc, • · · xk. Here AAF = {bl -a,) · · · (bk -ak) for A given by (12.7). This is the ordinary volume of A as specified by (12.1). The corresponding measure extended to &k is fc-dimensional Lebesgue measure as described at the beginning of this section.
178 MEASURE Proof of Theorem 12.5. As already observed, the uniqueness of the extension is easy to prove. To prove its existence it will first be shown that μ as defined by (12.12) is finitely additive on the class of bounded rectangles. Suppose that each side /,=(a;, b,-] of a bounded rectangle (12.7) is partitioned into n, subintervals J(] = (i,,>-i, tu], j = 1, · ·,«,, where a, = </0 < tn < ··· < *,„ = b,. The nxn2 · · nk rectangles (12ЛЗ) Bh t=V" X7*A' l£Ji£nt,.. ,\<jk<nk, then partition A Call such a partition regular. It will first be shown that μ adds for regular partitions: (12-14) μ(Λ) = Σ Л (Я/, J- The right side of (12.14) is ΣΡΣΧ sgnB χ F(x), where the outer sum extends over the rectangles В of the form (12.13) and the inner sum extends over the vertices χ of B. Now (12.15) Σ Σ sgnB x-F(x) -= ΣΗχ) Σ sgnB Jt, Βλ- χ Β where on the right the outer sum extends over each χ that is a vertex of one or more of the B's, and for fixed χ the inner sum extends over the B's of which it is a vertex. Suppose that χ is a vertex of one or more of the B's but is not a vertex of A. Then there must be an i such that x,- is neither a,- nor b,-. There may be several such i, but fix on one of them and suppose for notational convenience that it is i= 1. Then Jt, = tXj with 0 <j < nv The rectangles (12.13) of which χ is a vertex therefore come in pairs B'=BJj2 ; and B" = BJ+, J2 j7 and sgnB. jc= -sgnB»Jt. Thus the inner sum on the right in (12.15) is 0 if χ is not a vertex of A. On the other hand, if χ is a vertex of A as well as of at least one B, then for each i either jc, = a,- - tj0 or jc;. = b, - i/n.. In this case χ is a vertex of only one В of the form (12.13)—the one for which ;',■ = 1 or;', = n,-, according as jc, = a,- or jc,- = b,— and sgnB д: = sgn^ x. Thus the right side of (12.15) reduces to kAF, which proves (12.14). Now suppose that A = \}nu^xAu, where A is the bounded rectangle (12.8), Au = Ли x ''- x4« f°r и = 1,...,«, and the Au are disjoint. For each / (1 <i <k), the intervals /„,...,/,.„ have /, as their union, although they need not be disjoint. But their endpoints split /,. into disjoint subintervals 7;1,...,7,„. such that each Iiu is the union of certain of the Jr The rectangles В of the form (12.13) are a regular ■> I I 1
SECTION 12. MEASURES IN EUCLIDEAN SPACE 179 partition of A, as before; furthermore, the fi's contained in a single Au form a regular partition of Au. Since the Au are disjoint, it follows by (12.14) that Μ>0 = ΣΜ«)=Σ Σμ(β)=ΣΜΛ)· β u=l B<ZA„ u=l Therefore, μ is finitely additive on the class J?k of bounded fc-dimensional rectangles. As shown in Example 114, J? is a semiring, and so Theorem 11.3 applies. If A, A{,..·, An are sets in J^k, then by Lemma 2 of the preceding section, η n (12.16) μ(Α)< Σμ(Α„) if А с (J/1,,. ;; =■ 1 u =■ 1 To apply Theorem 11.3 requires showing that μ is countably subadditive on J^k. Suppose then that А с Ц^=, Au. where A and the Au are in J2^. The problem is ю prove that CO (12.17) μ(Α)< Σμ{Αα). и = 1 Suppose that e > 0. If A is given by (12.7) and В = [*· α; + δ <x, <b:, i <к], then μ(Β) > μ(Α) - e for small enough positive δ, because μ is defined by (12.12) and F is continuous from above. Note that A contains the closure B~=[x: α,. + δ <xt <bt, i < к] of B. Similarly, for each и there is in J?k a set Bu = [x: aju <jc; < bju + Su, i<k] such that μ(ΒΝ) <μ{Αιι) + €/2" and /1„ is in the interior B° =[x: aiu <jc, < biu + S,„i<k] of B„ Since B~<zA с U"=i Au с υζ„, θ°, it follows by the Heine-Borel theorem that BcB'c υ^,β° с \Jnn=!iB„ for some n. Now (12.16) applies, and so μ(Α)-€ < μ(Β)<Σ",^ιμ(Β„) <ZZ^i^(Au) + e. Since e was arbitrary, the proof of (12.17) is complete. Thus μ as defined by (12.12) is finitely additive and countably subadditive on the semiring J^k. By Theorem 11.3, μ extends to a measure on 3lk =a(^k). ■ Strange Euclidean Sets* It is possible to construct in the plane a simple curve—the image of [0,1] under a continuous, one-to-one mapping—having positive area. This is surprising because the curve is simple: if the continuous map is not required to be one-to-one, the curve can even fill a squared Such constructions are counterintuitive, but nothing like one due to Banach and Tarski: Two sets in Euclidean space are congruent if each can be carried onto the other by an isometry, a rigid transformation. Suppose of sets A and В in Rk that A can be decomposed into sets A{,...,An and В can be decomposed into sets Bt,...,B„ in such a way that Ai and Bi art congruent for each i=l,...,n. In this case A and В are said to be congruent by dissection. If all the pieces A, and β, are Borel sets, then of course kk{A) = E"=,,A/t(y4/) = Е|'=,λ*(£,·) = \k(B). But if nonmea- This topic may be omitted. A Peano curve see HAUSDORrr, ρ 231 For the construction of simple curves of positive area, see GriUAUM & Oi MSTri), pp. 135 ff.
180 MEASURE surable sets are allowed in the dissections, then something astonishing happens: If k>3, and if A and В are bounded sets in Rk and have nonempty interiors, then A and В are congruent by dissection. (The result does not hold if к is 1 or 2.) This is the Banach-Tarskiparadox. It is usually illustrated in 3-space this way: It is possible to break a solid ball the size of a pea into finitely many pieces and then put them back together again in such a way as to get a solid ball the size of the sun.f PROBLEMS 12.1. Suppose that μ is a measure on £%x that is finite for bounded sets and is translation-invariant: μ(Α + χ) = u(A). Show that μ(Α) = αλ(Α) for some a > 0. Extend to Rk 12.2. Suppose that A e &1, λ(Α) > 0, and 0 < 0 < 1. Show that there is a bounded open interval / such that k(A Π/)> 0λ(/). Hint: Show that X(A) may be assumed finite, and choose an open G such that AcG and λ(Α)>θλ(0. Now G=U„/„ for disjoint open intervals /π [Α12], and Σ„λ(Λ n/J> 0Σ„λ(/π); use an /„. 12.3. Τ If Л е&1 and \(A)>0, then the origin is interior to the difference set D(A) = [x -y: χ, ye A]. Hint Choose a bounded open interval / as in Problem 12.2 for 0 = f. Suppose that \z\< λ(/)/2; since А П/ and (Α Π/) + ζ are contained in an interval of length less than 3λ(/)/2 and hence cannot be disjoint, zeD(A). 12.4. Τ The following construction leads to a subset Η of the unit interval that is nonrr.easurable in the extreme sense that its inner and outer Lebesgue measures are 0 and 1: λ*(#) = 0 and λ*(#) = 1 (see. (3.9) and (3.10)). Complete the details. The ideas are those in the construction of a nonmeasurable set at the end of Section 3. It will be convenient tc work in G = [0,1); let θ and θ denote addition and subtraction modulo 1 in G, which is a group with identity 0. (a) Fix an irrational θ in G and for η = 0, ± 1, ± 2,... let θη be ηθ reduced modulo 1. Show that 0„ θ 0m = 0n+m, 0„e0m=0„_m, and the 0„ are distinct. Show that [θ2η: η = 0, ± 1,...} and {02π + 1: η = 0, ± 1,...} are dense in G. (b) Take χ and у to be equivalent if xGy lies in {0П: и = 0, + 1,...}, which is a subgroup. Let S contain one representative from each equivalence class (each coset). Show that G= U„(5e0„), where the union is disjoint. Put H= U„(5e02n)and show that G-H = H®6. (c) Suppose that A is a Borel set contained in H. If k(A) > 0, then D(A) contains an interval (0,e); but then some 62k + l lies in (0,e)cD(A)cD(H), and so 02* + ,=/?, -h2 = hi θh2 = isi ® θ2π)θ(s2® в2пг) for some hi,h2 in Η and some sl,s2 in S. Deduce that st=s2 and obtain a contradiction. Conclude that A*(#) = 0. (d) Show that АДЯе 0) = 0 and А*(Я) = 1. See Wagon for an account of these prodigies.
SECTION 12. MEASURES IN EUCLIDEAN SPACE 181 12.5. Τ The construction here gives sets Hn such that #„ ί G and λ*(#„) = 0. If Jn = G- Hn, then Jn 10 and A*(7„) = 1. (a) Let Hn = U£= _„(5 θ вк), so that H„ Τ G. Show that the sets H„ θ 0(2и + 1), are disjoint for different υ. (b) Suppose that A is a Borel set contained in Hn. Show that A and indeed all the Α θ 0(2Π+|), have Lebesgue measure 0. 12.6. Suppose that μ is nonnegative and finitely additive on &k and that μ(/?*) < °°. Suppose further that μ(Α) = $ιιρμ(Κ\ where К ranges over the compact subsets of A. Show that μ is countably additive (Compare Theorem 12.3(ii).) 12.7. Suppose μ is a measure on 32k such that bounded sets have finite measure. Given A, show that there exist an /^-set U (a countable union of closed sets) and a Gg-set V (a countable intersection of open sets) such that U с А с К and μ(Υ-υ) = 0. 12.8. 2.19ΐ Suppose that μ is a nonatomic probability measure on (Rk,&k) and that μ(Α)>0. Show that there is an uncountable compact set K such that KcA and μ(Κ) = 0. 12.9. The minimal closed support of a measure μ on 3ik is a closed set С such that СцсС for closed С if and only if С supports μ. Prove its existence and uniqueness. Characterize the points of CM as those χ such that μ(ί/) > 0 for every neighborhood U of x. If к = 1 and if μ and the function F(x) are related by (12.5), the condition is F(x - e) < F(x + e) for all e; χ is in this case called a poini of increase of f. 12.10. Of minor interest is the fc-dimensional analogue of (12.4). Let /, be (0, t] for t> 0 and (г,0] for t < 0, and let Ax =lx χ · · · χ lx . Let ip(jt) be + 1 or -1 according as the number of i, 1 < i < fc, fOr which xi < 0 is even or odd. Show that, if F(x) = φ(χ)μ(Αx), then (12.12) holds for bounded rectangles /1. Call F degenerate if it is a function of some к — 1 of the coordinates, the requirement in the case к = 1 being that F is constant. Show that kAF = 0 for every bounded rectangle if and only if F is a finite sum of degenerate functions; (12.12) determines F to within addition of a function of this sort. 12.11. Let С be a nondecreasing, right-continuous function on the line, and put F(x, y) = min{G(jc), y]. Show that F satisfies the conditions of Theorem 12.5, that the curve С = [(x,G(x)Y. χ e R{] supports the corresponding measure, and that A2(C) = 0. 12.12. Let F{ and F2 be nondecreasing, right-continuous functions on the line and put /r(jc1,jc2) = /:'i(^i)/:'2(jc2^ Show that F satisfies the conditions of Theorem 12.5. Let μ,μ{,μ2 be the measures corresponding to F, F{,F2, and prove that μ(Αι ΧΑ2) = μ[(Α[)μ2(Α2) for intervals A{ and A2. This μ is the product of μ! and μ2; products are studied in a general setting in Section 18.
182 MEASURE SECTION 13. MEASURABLE FUNCTIONS AND MAPPINGS If a real function AOnil has finite range, it is by the definition in Section 5 a simple random variable if [ω: Χ(ω) = χ] lies in the basic σ-field &~ for each x. The requirement appropriate for the general real function X is stronger; namely, [ω: Χ(ω)^Η] must lie in У for each linear Borel set H. An abstract version of this definition greatly simplifies the theory of such functions. Measurable Mappings Let (Ω,^) and (Ω', У) be two measurable spaces. For a mapping T. Ω -» Ω', consider the inverse images T~lA = [ω e Ω: Τω еЛ'] for А с Ω' (See [A7] for the properties of inverse images.) The mapping Τ is measurable У/У if T~lA e= У for each A e f. For a real function /, the image space Ω' is the line R1, and in this case &[ is always tacitly understood to play the role of У. A real function / on Ω is thus measurable У (or simply measurable, if it is clear from the context what У is involved) if it is measurable SF/3$X—that is, if f~[H = [ω: /(ω) eHje^ for every Η g^1. In probability contexts, a real measurable function is called a random variable. The point of the definition is to ensure that [ω: /(и)ей] has a measure or probability for all sufficiently regular sets Η of real numbers—that is, for all Borel sets H. Example 13.1. A real function / with finite range is measurable if /~'{·*} ^ У for each singleton {x}, but his is too weak a condition to impose on the general /. (It is satisfied if (Ω, У) = (R1, &x) and / is any one-to-one map of the line into itself; but in this case f~lH, even for so simple a set Η as an interval, can for an appropriately chosen / be any uncountable set, say the non-Borel set constructed in Section 3.) On the other hand, for a measurable / with finite range, f~xH e У for every Η cR1; but this is too stiong a condition to impose on the general /. (For (Ω, У) = (R\ &l), even f(x)=x fails to satisfy it.) Notice that nothing is required of fA; it need not lie in ^' for A in &~. ■ If in addition to (Ω, S?), (Ω', У), and the map Τ: Ω -» Ω', there is a third measurable space (Ω", У") and a map Τ': Ω' -» Ω", the composition T'T = T'° Τ is the mapping Ω -» Ω" that carries ω to Τ'(Τ(ω)). Theorem 13.1. (i) // T~ lA' e У for each A e sf and sf generates S?\ then Τ is measurable У/ У. (ii) // Τ is measurable У/У and Γ is measurable У/У then T'T is measurable У/У".
SECTION 13. MEASURABLE FUNCTIONS AND MAPPINGS 183 Proof. Since T~l(fl' - A ') = Ω, - T'^A' and T~l(\Jn A'n) = ΌπΤ~ιΑ'η, and since 5? is a σ-field in Ω, the class [A': T~lA' e <F\ is a σ-field in Ω'. If this σ-field contains of', it must also contain σ(αί'), and (i) follows. As for (ii), it follows by the hypotheses that A" e S^" implies that (TTlA" e &*, which in turn implies that (ГТГ'Л" = [ω: T'Tw<eA"] = [ω: Γω е(Г)-'Л"] = Г"'((Г)"'Л") е ^ ■ By part (i), if / is a real function such that [ω: F(w) <jc] lies in &~ for all a, then / is measurable &~. This condition is usually easy to check. Mappings into Rk For a mapping /: Ω-* Rk carrying il into /c-space, 3&k is always understood to be the σ-field in the image space. In probabilistic contexts, a measurable mapping into Rk is called a random vector. Now / must have the form (13.1) /(«) = (/i(*>),···,/*(«)) for real functions /;·(ω). Since the sets (12.9) (the "southwest regions") generate 3lk, Theorem 13.10) implies that / is measurable &~ if and only if the set к (13.2) [<o:f1(co)<x1,...,fk(co)<xk]= f| [«: /y(«) <*,·] lies in ^ for each (*,,..., xk). This condition holds if each f-s is measurable &~. On the other hand, if Xj = χ is fixed and jc, = ■ ■ · =·*,·_! = Jty+i = · ■ ■ = xk = η goes to oo, the sets (13.2) increase to [ω: //ω) <*]; the condition thus implies that each f] is measurable. Therefore, f is measurable S^ if and only if each component function /. is measurable &~. This provides a practical criterion for mappings into Rk. A mapping /: R'~* Rk is defined to be measurable if it is measurable 3$l /0$k. Such functions are often called Borel functions. To sum up, T: a -* Ω' is measurable ^/^' if 7_U' e= У for all Л' е S^'\ f: fl-+Rk is measurable &~ if it is measurable S^/ffi; and /: Λ' -»Λ* is measurable (a Borel function) if it is measurable 0$l /0$k. If Η lies outside 0$\ then IH (i = k = 1) is not a Borel function. Theorem 13.2. ///: R' -* Rk is continuous, then it is measurable. Proof. As noted above, it suffices to check that each set (13.2) lies in 3$'. But each is closed because of continuity. ■
184 MEASURE Theorem 13.3. // f/. D,~*R[ is measurable S^, j=l,...,k, then g(fi(o)), ■ ■■, fk(o))) is measurable SF if g: Rk -» R[ is measurable—in particular, if it is continuous. Proof. If the /. are measurable, then so is (13.1), so that the result follows by Theorem 13.1(H). ■ Taking g(x,, ..,xk) to be Σ^ιχί, Π,*,,*,·, and max{jt,,...,xk) in turn shows that sums, products, and maxima of measurable functions are measurable. If /(ω) is real and measurable, then so are sin/(ω), e'f(a), and so on, and if /(ω) never vanishes, then 1//(ω) is measurable as well. Limits and Measurability For a real function / it is often convenient to admit the artificial values oo and -oo—to work with the extended real line [-00,00]. Such an / is by definition measurable &~ if [ω: /(ω) <ξ Η] lies in &~ for each Borel set Η of (finite) real numbers and if [ω: /(ω) = oo] and [ω: /(ω) = - oo] both lie in &~. This extension of the notion of measurability is convenient in connection with limits and suprema, which need not be finite. Theorem 13.4. Suppose that fl,f2,..- are real functions measurable &~. (i) The functions supn /„, inf„/n, limsupn /„, and liminfn /„ are measurable &. (ii) // limn fn exists everywhere, then it is measurable &~. (iii) The ω-set where {/„(ω)} converges lies in &~. (iv) If f is measurable JF, then the ω-set where /„(ω) -»/(ω) lies in S^. Proof. Clearly, [sup„ fn<x]= fl„ [/„ <·*] lies in & even for χ = °° and χ = - oo, and so sup,, fn is measurable. The measurability of inf „ /„ follows the same way, and hence limsup„ fn = ΐηίπ8υρΛ>π fk and liminf„/„ = supn inik^„fk are measurable. If limn/„ exists, it coincides with these last two functions and hence is measurable. Finally, the set in (iii) is the set where limsup„ /„(ω) = liminf„ /„(ω), and that in (iv) is the set where this common value is /(ω). ■ Special cases of this theorem have been encountered before—part (iv), for example, in connection with the strong law of large numbers. The last three parts of the theorem obviously carry over to mappings into Rk. A simple real function is one with finite range; it can be put in the form (13.3) /= Σ*Λ,,
SECTION 13. MEASURABLE FUNCTIONS AND MAPPINGS 185 where the Л, form a finite decomposition of Ω. It is measurable !F if each Л, lies in &~. The simple random variables of Section 5 have this form. Many results concerning measurable functions are most easily proved first for simple functions and then, by an appeal to the next theorem and a passage to the limit, for the general measurable function. Theorem 13.5. If f is real and measurable &~, there exists a sequence {/„} of simple functions, each measurable JF, such that (13.4) 0</„(ω)ί/(ω) iff(co)>0 and (13.5) 0>/„(ω)|/(ω) ί//(ω)<0. Proof. Define if -oo </(ω) < -η, if -kl'n </(ω) < -(к- 1)2-", 1<к<п2", if (A: - 1)2-" </(ω) <k2'", 1 <k<n2", if η </(ω) < oo. This sequence has the required properties. ■ Note that (13.6) covers the possibilities /(ω) = oo and /(ω) = -oo. If A e &~, a function / defined only on A is by definition measurable if [ωε/4: /(ω) etf] lies in У for Яе^1 and for Η = Π and tf = {-oo}. Transformations of Measures Let (Ω, &~) and (IT, J^7) be measurable spaces, and suppose that the mapping Τ: Ω -» Ω' is measurable S^/ S^'. Given a measure μ. on 5*", define a set function μΤ'1 on ^ by (13.7) μΊ-1(Α')=μ(Τ-ιΑ'), A' e ^'. That is, /χΓ-1 assigns value /х(Г"14') to the set Л'. If A'^F', then T~XA' Gl &~ because Г is measurable, and hence the set function μΤ~ι is well denned on &*. Since Γ~! U„ Л'„ = U„ Г- lA'n and the Γ~',4'„ are disjoint (13.6) /„(«) = - л -(Л- 1)2-" (it-l)2_n
186 MEASURE sets in Ω if the A'n are disjoint sets in Ω', the countable additivity of μΤ~' follows from that of μ. Thus μΤ~ι is a measure. This way of transferring a measure from Ω to IT will prove useful in a number of ways. If μ is finite, so is μΤ~ι; if μ is a probability measure, so is μΤ'1.1 PROBLEMS 13.1. Functions are often defined in pieces (foi example, let f(x) be χλ or x~[ as χ > 0 or jc < 0), and the following result shows that the function is measurable if the pieces are. Consider measurable spaces (Ω, 5*") and (Ω', У) and a map Γ Ω -* Ω' Let Al,A2,... be a countable covering of Ω by J^sets. Consider the σ-field Srn = [A: AcAn, Ae ^]'m An and the restriction Tn of Τ lo An. Show that Τ is measurable &~/ SF' if and only if Tn is measurable &~η/&' for each n. 13.2. (a) For a map Γ and σ-fields & and ^', define Γ-1^' = 17~'/1': Α'<=&~'] and Г^= [/Г: Г-1/*' е 5*"]· Show that T1^' and 7\^" are σ-fields and that measurability ^/^' is equivalent to T'1^' с &~ and to 5*"' с Т&. (b) For given 5*"', T~l&~\ which is the smallest σ-field for which Τ is measurable ^/^', is by definition the σ-field generated by 7\ For simple random variables describe σ{Χχ,...,Xn) in these terms. (c) Let a'W) be the σ-field in Ω' generated by srf'. Show that σ(Τ~ι£/') = T~l(a'{szf)). Prove Theorem 10.1 by taking Τ to be the identity map from Ω0 to Ω. 13.3. Τ Suppose that /: Ω -*Rl. Show that / is measurable T~l&~' if and only if there exists a map φ: Ω' -»R[ such that ψ is measurable У and /= <pT. Hint: First consider simple functions and then use Theorem 13.5. 13.4. Τ Relate the result in Problem 13.3 to Theorem 5.1(ii). 13.5. Show of real functions / and g that f(a>) + g(a>) <x if and only if there exist rationals r and s such that r + s<x, /(w)<r, and g(a>)<s. Prove directly that f+g is measurable & if / and g are. 13.6. Let У be a σ-field in R}. Show that 31х с У if and only if every continuous function is measurable fF. Thus & is the smallest σ-field with respect to which all the continuous functions are measurable. 13.7. Consider on R[ the smallest class ST (that is, the intersection of all classes) of real functions containing all the continuous functions and closed under point- wise passages to the limit. The elements of ST art called В aire functions. Show that Baire functions and Borel functions on Rl are the same thing. 13.8. A real function / on the line is upper semicontinuous at χ if for each e there is a δ such that \x-y\<8 implies that f(y) <f(x) + e. Show that, if/ is everywhere upper semicontinuous, then it is measurable. fBut see Problem 13.14.
SECTION 14. DISTRIBUTION FUNCTIONS 187 13.9. Suppose that /„ and / are finite-valued, ^measurable functions such that /,,(ω) ->/(ω) for ω еЛ, where μ(Α) < <χ> (μ a measure on 5*"). Prove Egoroff's theorem: For each e there exists a subset В oi A such that μ(Β) <e and /„(ω)->/(ω) uniformly on /1 - β. Яшг; Let β^*> be the set of ω in /1 such that |/(ω) -/,(ω)| > к'' for some ί > п. Show that B(nk) 1/0 as и Т °°, choose nk so that μ(β<*>) <e/2*, and put В=Щш1 B(nk\ 13.10. Τ Show that Egoroffs theorem is false without the hypothesis μ(Α) <°°. 13.11. 2.9? Show that, if / is measurable a{snf), then there exists a countable subclass suff of suf such that / is measurable a{srf\ 13.12. Circular Lebesgue measure Let С be the unit circle in the complex plane, and define T: [0,1) -» С by Γώ = e2l,iw. Let ^ consist of the Borel subsets of [0,1), and let λ be Lebesgue measure on S8. Show that -€=\A\ T'lA e 6§\ consists of the sets in £%2 (identify R2 with the complex plane) that are contained in С Show that ■€ is generated by the arcs of C. Circular Lebesgue measure is defined as μ = λΤ~ι. Show that μ is invariant under rotations: μ[θζ: ζ ε/1] = μ(Α) for Ae if and θ e С. 13.13. Τ Suppose thai the circular Lebesgue measure of A satisfies μ(Α) > I— n'1 and that В contains at most η points. Show that some rotation carries В into Α ΘΒ czA foi some θ in С 13.14. Show by example that μ σ-finite does not imply μΤ~ι σ-finite. 13.15. Consider Lebesgue measure λ restricted to the class 98 of Borel sets in (0,1]. For a fixed permutation n,,n2>··· of the positive integers, if χ has dyadic expansion .xlx2···, take Tx = .xn *„,... . Show that Τ is measurable $/& and that λΤ~ι =λ. 13.16. Let Hk be the union of the intervals (d- l)/2k,i/2k] for i even, 1 <i<2k. Show that if 0</(ω)<1 for all ω and Ak=f'l{Hk\ then /(ω) = Σ*=μΛι (ω)/2*, an infinite linear combination of indicators. 13.17. Let S = {0,1}, and define a map Τ from sequence space 5" to [0,1] by Γω = Σ^=1β/ι(ω)/2*. Define a map i/of [0,1] to 5" by Ux = (di(x),d2(x\...), where the dk(x) are the digits of the nonterminating dyadic expansion of х (and dk(0) = 0). Show that Τ is measurable -€/98 and that U is measurable &/if. Let Ρ be the measure specified by (2.21) for p0=P\ = \. Describe PT'1 and AIT1. SECTION 14. DISTRIBUTION FUNCTIONS Distribution Functions A random variable as defined in Section 13 is a measurable real function X on a probability measure space Ш, У, f). The distribution or /aw of the
188 MEASURE random variable is the probability measure μ on (R\ &l) defined by (14.1) μ(Α) = Ρ[ΧεΑ], Ле^1. As in the case of the simple random variables in Chapter 1, the argument ω is usually omitted: P[X<=A] is short for Ρ[ω:Χ(ω) еЛ]. In the notation (13.7), the distribution is PX~ '. For simple random variables the distribution was defined in Section 5—see (5.12). There μ was defined for every subset of the line, however; from now on μ will be defined only for Borel sets, because unless X is simple, one cannot in general be sure that [.Ye/l] has a probability for A outside ^". The distribution function of X is defined by (14.2) F(x) =μ(-οο,χ] =Ρ[Χ<χ] for real x. By continuity from above (Theorem 10.2(ii)) for μ, F is right- continuous. Since F is nondecreasing, the left-hand limit F(x - ) = limyTj F(y) exists, and by continuity from below (Theorem 10.2(iJ) tor μ, (14.3) F(x-)=p(-oo,x)=P[X<x]. Thus the jump or saltus in F at χ is F(x) -F(jc-)=/x{jc} =/>[* = *]. Therefore (Theorem 10.2(iv)) F can have at most countably many points of discontinuity. Clearly, (14.4) lim F(x) =0, lim F(x) = l. X-* —oo X—>°° A function with these properties must, in fact, be the distribution function of some random variable: Theorem 14.1. If F is a nondecreasing, right-continuous function satisfying (14.4), then there exists on some probability space a random variable X for which F(x) = P[X<x]. First Proof. By Theorem 12.4, if F is nondecreasing and right-continuous, there is on (R1,^1) a measure μ for which μ(α^] = Ρ&) - F(a). But \imx__a,F(x) = 0 implies that μ(-°°,χ] = F(x), and limx^a,F(x) = l implies that μ(Λ') = 1. For the probability space take Ш, S?, Ρ) = (Λ1, @\μ\ and for X take the identity function: Χ(ω) = ω. Then P[X <x\ = μ[ω<ΞRl■. ω <x] = F(x). m
SECTION 14. DISTRIBUTION FUNCTIONS 189 Second Proof. There is a proof that uses only the existence of Lebesgue measure on the unit interval and does not require Theorem 12.4. For the probability space take the open unit interval: Я is (0,1), ^"consists of the Borel subsets of (G, 1), and P(A) is the Lebesgue measure of A. To understand the method, suppose at first that F is continuous and strictly increasing. Then F is a one-to-one mapping of Rl onto (0,1); let φ: (0,1) —»/?' be the inverse mapping. For 0 <ω < 1, let Χ(ω} = φ(ω). Since φ is increasing, certainly X is measurable &~. If 0 < и < 1, then <p(u) <x if and only if u<F(x). Since Ρ is Lebesgue measure, P[X<x] = Ρ[ω e (0,1): φ(ω) <x] - Ρ[ω e(0,1): ω <F(x)] = F(x), as required. F(x) I· »—>■ (u) φ(ν) If F has discontinuities or is not strictly increasing, define1 (14.5) <p(u) =inf[jt.-M <F(x)] for 0 < и < I. Since F is nondecreasing, [x: и < F(x)] is an interval stretching to o°; since F is right-continuous, this interval is closed on the left. For 0 <u < 1, therefore, [x: и <F(x)] = [<p(w),o°), and so <p(u) <x if and only if u<F(x). If Χ(ω) = φ(ω) for 0<ω<1, then by the same reasoning as before, X is a random variable and P[X <x] = F(x). ■ This second argument actually provides a simple proof of Theorem 12.4 for a probability distribution* F: the distribution μ (as defined by (14.1)) of the random variable just constructed satisfies μ(-<χ,χ] = F(x) and hence /i(a,b] = F(b)-F(a). Exponential Distributions There are a number of results which for their interpretation require random variables, independence, and other probabilistic concepts, but which can be discussed technically in terms of distribution functions alone and do not require the apparatus of measure theory. This is called the quantile function For the general case, see Problem 14 2.
190 MEASURE Suppose as an example that F is the distribution function of the waiting time to the occurrence of some event—say the arrival of the next customer at a queue or the next call at a telephone exchange. As the waiting time must be positive, assume that F(0) = 0. Suppose that F(x) < 1 for all x, and furthermore suppose that The right side of this equation is the probability that the waiting time exceeds y; by the definition (4.1) of conditional probability, the left side is the probability thai the waiting time exceeds χ + у given that it exceeds x. Thus (14.6) attributes to the waiting-time mechanism a kind of lack of memory or aftereffect: If after a lapse of χ units of time the event has not yet occurred, the waiting time stili remaining is conditionally distributed just as the entire waiting time from the beginning. For reasons that will emerge later (see Section 23), waiting times often have this property. The condition (14.6) completely determines the form of F. If U(x) = l-F(jt), (14.6) is U(x+y) = U(x)U(y). This is a form of Cauchy's equation [A20], and since U is bounded, U(x):=e~ax for some a. Since lim^. _„£/(*) = 0, a must be positive. Thus (14.6) implies that F has the exponential form (14-7) ^(*)"f? -« 'iX~°n K ' w (1-е °' ifx>0, and conversely. Weak Convergence Random variables Xl,...,Xn are defined to be independent if the events [A", ^Al],...,[Xn^An] are independent for all Borel sets Α^.,.,Α^so that P[Xt e A„ ί = 1,..., η] = Π?_, P[X{ e A,]. To find the distribution function of the maximum Mn = max{ Xb..., Xn), take /4,= ··· = An = (-<», x]. This gives P[Mn <x] = Π"=1Ρ[Χι <x]. If the X{ are independent and have common distribution function G and Mn has distribution function F„, then (14.8) Fn(x) = G»(x). It is possible without any appeal to measure theory to study the real function Fn solely by means of the relation (14.8), which can indeed be taken as defining F„. It is possible in particular to study the asymptotic properties of F„:
SECTION 14. DISTRIBUTION FUNCTIONS 191 Example 14.1. Consider a stream or sequence of events, say arrivals of calls at a telephone exchange. Suppose that the times between successive events, the interarrival times, are independent and that each has the exponential form (14.7) with a common value of a. By (14.8) the maximum Mn among the first η interarrival times has distribution function F„(jc) = (1 - e~aXY, χ > 0. For each x, lim„ Fn(x) = 0, which means that Mn tends to be large for η large. But P[Mn - a~l log η <x] = Fn(x + a"' log n). This is the distribution function of Mn -a"1 log n, and it satisfies (14.9) F„(jt + ff-1logrt)-(l-e-(<"+,ogn))"^<?-e"'" - as η -»o°; the equality here holds if log η > -ax, and so the limit holds for all x. This gives for large η the approximate distribution of the normalized random variable Mn - a~] log п. т If F„ and F are distribution functions, then by definition, F„ converges weakly to F, written F„ =» F, if (14.10) limF„(jc)=F(jc) for each χ at which F is continuous.' To study the approximate distribution of a random variable Yn it is often necessary to study instead the normalized or rescaled random variable (Yn -bn)/an for appropriate constants an and bn. If Yn has distribution function Fn and if an > 0, then P[(Yn ~bn)/an <x\ = P[Yn<anx +bn], and therefore (Yn -bn)/an has distribution function Fn(anx +bn). For this reason weak convergence often appears in the form* (14.11) Fn(anx+bn)~F(x). Ал example of this is (14.9): there an = 1, bn~ a~l log n, and F(x) = e'e "". Example 14.2. Consider again the distribution function (14.8) of the maximum, but suppose that G has the form r( \ /° ifjc<l, G(*) = \l-;c-« if,*!, where a > 0. Here Fn(nl/ax) = (1 - η'ιχ-α)" for χ > n'i/a, and therefore .. r- / \/a \ /0 if JC < 0, limFJ ηl/ax) = { η пУ ' \e'x ifjc > 0. This is an example of (14.11) in which an = nl/a and bn = 0. ■ fFor the role of continuity, see Example 14.4. *To write Fn(anx + bn) => F(x) ignores the distinction between a function and its value at an unspecified value of its argument, but the meaning of course is that Fn(anx + bn) -»F(x) at continuity points χ of F.
192 MEASURE Example 14.3. Consider (14.8) once more, but for [0 ifjc<0, G(x)= I-(I-jc)" if0<jc<l, U if jc > 1, where a > 0. This time Fn(n'l/ax + 1) = (1 - n-1(-*)")" if -n1/a<x<0. Therefore, limFn(n-^ax + l) = {e~(~X)° {icx£°> η nK ' \l if jc > 0, a case of (14.11) in which an = n~l/a and i»„ = 1. ■ Let Δ be the distribution function with a unit jump at the origin: 04-12) A(x)-('J ^<J· If Χ(ω) = 0, then X has distribution function Δ. Example 14.4. Let XlX2,... be independent random variables for which P[Xk = l] = P[Xk = -1] = j, and put 5„ = Xx + · · · +*„. By the weak law of large numbers, (14.13) P[\n-%\>e]^0 for e > 0. Let Fn be the distribution function of n~lSn. If jc > 0, then F„(jt) = l-/,[n_lSil>Al-»l; if jc<0, then F„(jc) ^P[|n_1SJ > I jc Π -» 0. As this accounts for all the continuity points of Δ, Fn => Δ. It is easy to turn the argument around and deduce (14.13) from F„ =» Δ. Thus the weak law of large numbers is equivalent to the assertion that the distribution function of n~lSn converges weakly to Δ. If η is odd, so that Sn = 0 is impossible, then by symmetry the events [S„ < 0] and [Sn > 0] each have probability \ and hence F„(0) = {. Thus F„(0) does not converge to Δ(0) = 1, but because Δ is discontinuous at 0, the definition of weak convergence does not require this. ■ Allowing (14.10) to fail at discontinuity points jc of F thus makes it possible to bring the weak law of large numbers under the theory of weak convergence. But if (14.10) need hold only for certain values of jc, there
SECTION 14. DISTRIBUTION FUNCTIONS 193 arises the question of whether weak limits are unique. Suppose that Fn=>F and F„ =» G. Then F(x) = lim„ Fn(x) = G(x) if F and G are both continuous at x. Since F and G each have only countably many points of discontinuity/ the set of common continuity points is dense, and it follows by right continuity that F and G are identical. A sequence can thus have at most one weak limit. Convergence of distribution functions is studied in detail in Chapter 5. The remainder of this section is devoted to some weak-convergence theorems which are interesting both for themselves and for the reason that they require so little technical machinery. Convergence of Types* Distribution functions F and G are of the same type if there exist constants a and b, a > 0, such that F(ax + b) = G(x) for all x. A distribution function is degenerate if it has the form Δ(* -.r0) (see (14.12.)) for some xQ; otherwise, it is nondegenerate. Theorem 14.2. Suppose that Fn(unx +υη) => F(x) and Fn{anx +6„)=> G(x), where un > 0, an > 0, and F and G are nondegenerate. Then there exist a and b, a > 0, such that an/un -»a, (bn - vn)/un -» b, and F{ax + b) = G{x). Thus there can be only one possible limit type and essentially only one possible sequence of norming constants. The proof of the theorem is for clarity set out in a sequence of lemmas. In all of them, a and the an are assumed to be positive. Lemma 1. If Fn =» F, an~* a, and bn -»b, then Fn(anx + bn) =» F(ax + b). Proof. If χ is a continuity point of F(ax + b) and e > 0, choose continuity points и and υ of F so that u<ax + b<v and F(v)-F(u)<e; this is possible because F has only countably many discontinuities. For large enough n, u<anx + bn<u, \Fn{u) - F{u)\<e, and \Fn(v) - F(v)\ <e; but then F(ax + b)-2e< F(u) -e< F„(u) < Fn{anx + b„) < F„(v) <F(v) + e< F(ax+b) + 2e. ■ Lemma 2. If Fn=* F and an -»°o, then Fn(anx) =» A(x). Proof. Given e, choose a continuity point и of F so large that F(u) > 1 - €. If χ > 0, then for all large enough n, anx > и and \Fn(u) - F(u)\ < e, so fThe proof following (14.3) uses measure theory, but this is not necessary If the saltus σ{χ) = F{x) — F{x -) exceeds e at л:, < · · <x„, then F(x,)- F(x,_ x)> e (take jt0 <*,), and so ne <F(x„)- F{xn)< 1; hence [д:· σ(χ)> e] is finite and [д:: σ(χ)> 0] is countable *This topic may be omitted.
194 MEASURE that F„(a„x) > Fn(u) > F{u) - e > 1 - 2e. Thus lim„ F„(a„x) = 1 for χ > 0; similarly, lim„ Fn(anx) = 0 for jc <0. ■ Lemma 3. If Fn=*F and bn is unbounded, then Fn(x + bn) cannot converge weakly. Proof. Suppose that bn is unbounded and that bn -> oo along some subsequence (the case bn-> -oo is similar). Suppose that Fn(x +bn) => G(x). Given e, choose a continuity point и of F so that F(u)> 1-е. Whatever χ may be, for η far enough out in the subsequence, χ +bn> и and Fn(u) > 1 - 2e, so that F„U + 6„) > 1 - 2e. Thus G(jc) = lim„ Fn(x + b„)=l for all continuity points χ of G, which is impossible. ■ Lemma 4. If F„(x) =» F(jc) α«ίί F„(a„a: + i>„) =» G(*), tv/iere Τ7 α«ίί G are nondegenerate, {Леи (14.14) 0 < inf a„ < sup a„ < oo, sup|6„|<oo. Proof. Suppose that a„ is not bounded above. Arrange by passing to a subsequence that an *-» oo. Then by Lemma 2, (14.15) Fn(a„x)=>b(x). Since (14.16) Fn{a„(x + bn/an))=Fn(anx+bn)=>G(x), it follows by Lemma 3 that bn/an is bounded along this subsequence. By passing to a further subsequence, arrange that b„/an converges to some c. By (14.15) and Lemma 1, Fn(an(x + b„/a„)) => Δ(χ +c) along this subsequence. But (14.16) now implies that G is degenerate, contrary to hypothesis. Thus an is bounded above. If Gn(x) =FJLanx + b„), then Gn(x)=*G(x) and Gn(a~lx - a~lbn) = Fn(x) => F(x). The result just proved shows that a~ ' is bounded. Thus an is bounded away from 0 and oo. If bn is not bounded, neither is bn/an; pass to a subsequence along which bn/an -> +oo and an converges to a positive a. Since, by Lemma 1, Fn(anx)=> F(ax) along the subsequence, (14.16) and bn/an ~* +oo stand in contradiction (Lemma 3 again). Therefore bn is bounded. ■ Lemma 5. // F(x) = F(ax + b) for all χ and F is nondegenerate, then a = 1 and i> = 0. Proof. Since F(x) = F(a"x + (a"'1 + ■ ■ ■ +a + l)b), it follows by Lemma 4 that a" is bounded away from 0 and oo, so that a = 1, and it then follows that nb is bounded, so that 6 = 0. ■
SECTION 14. DISTRIBUTION FUNCTIONS 195 Proof of Theorem 14.2. Suppose first that un = 1 and vn = 0. Then (14.14) holds. Fix any subsequence along which an converges to some positive a and bn converges to some b. By Lemma 1, Fn(anx + bn) => F(ax +b) along this subsequence, and the hypothesis gives F(ax + 6) = G(x). Suppose that along some other sequence, an -»и > 0 and bn -> v. Then F(ux + v) = G(x) and F(ax + b) = G(x) both hold, so that и = a and υ = b by Lemma 5. Every convergent subsequence of {(an,bn)} thus converges to (a, b), and so the entire sequence does. For the general case, let Hn(x) = Fn{unx + u„). Then Hn{x)=> F(x) and Ηη(αηιι~[χ +(bn - un)u~l) => G(^), and so by the case already treated, anu~x converges to some positive a and (bn - vn)u~' to some b, and as before, F(ax + 6) = G(jt). ■ Extremal Distributions* A distribution function F is extremal if it is nondegenerate and if, for some distribution function G and constants an (an > 0) and bn, (14.17) G"(a„x + b„)~F(x). These are the possible limiting distributions of normalized maxima (see (14.8)), and Examples 14.1, 14.2, and 14.3 give three specimens. The following analysis shows that these three examples exhaust the possible types. Assume that F is extremal. From (14.17) follow G"k(anx + bn) => Fk(x) and Gnk(ankx + bnk) => F(x), and so by Theorem 14.2 there exist constants ck and dk such that ck is positive and (14.18) Fk(x)=F(ckx + dk). From F(cjkx + djk) = FJk(x) = F'(ckx + dk) = F(cfckx + dk) + dj) follow (Lemma 5) the relations (14.19) cjk = CjCk, djk=cjdk+dj = ckdj + dk. Of course, Cj = 1 and dx = 0. There are three cases to be considered separately. Case 1. Suppose that ck = 1 for all fc. Then (14.20) Fk(x)=F(x + dk), Fl/k(x) = F(x-dk). This implies that Fi/k(x) =F(x + d — dk). For positive rational r=j/k, put 8r=d: ~dk; (14.19) implies that the definition is consistent, and Fr(x) = Fix + Sr). Since F is nondegenerate, there is an χ such that 0 < F(x) < 1, and it follows by (14.20) that dk is decreasing in k, so that 8r is strictly decreasing in r. This topic may be omitted.
196 MEASURE For positive real t let ^(i) = inf0<r</ Sr (r rational in the infimum). Then <p(t) is decreasing in t, and (14.21) F,(x)=F(x + 9(t)) for all χ and all positive t. Further, (14.19) implies that cpist) = cpis) + cp(t), so that by the theorem on Cauchy's equation [A20] applied to <p(ex\ <p(t)= —β log t, where β > 0 because cp(t) is strictly decreasing. Now (14.21) with t = ex/p gives Fix) = exp{e~x/l3 log /40)}, and so F must be of the same type as (14.22) f,(*)=«-'"'. Example 14.1 shows that this distribution function can arise as a limit of distributions of maxima—that is, Fl is indeed extremal. Case 2. Suppose that ck Φ 1 for some k0, which necessarily exceeds 1. Then there exists an x' such that ckix'"+dk=x'; but (14.18) then gives Fk»(x') = Fix'), so that Fix') is 0 or 1. (In Case 1, F has the type (14.22) and so never assumes the values 0 and 1.) Now suppose further that, in fact, /r(jc'):=0. Let xQ be the supremum of those χ for which F(x) = 0. By passing to a new F of the same type one can arrange that x0 = 0; then Fix) =- 0 for χ < 0 and Fix) > 0 for x> 0. The new F will satisfy (14.18), but with new constants dk. If a (new) dk is distinct from 0, then there is an χ near 0 for which the arguments on the two sides of (14.18) have opposite signs. Therefore, dk = 0 for all k, and (14.23) Fk(x)=F(ckx), Fx/k(*)=F(lr) for all к and x. This implies that Fi/kix) =F(xcj/ck). For positive rational r=j/k, put yr = Cj/ck. The definition is again consistent by (14.19), and Frix) = Fiyrx). Since 0 <F(jc)< 1 for some x, necessarily positive, it follows by (14.23) that ck is decreasing in k, so that y, is strictly decreasing in r. Put φ(ί) = inftl<r<l yr for positive real t. From (14.19) follows i/i(st) = ψ(ε)ψθ), and by the corollary to the theorem on Cauchy's equation [A20] applied to ф(ех\ it follows that ψ(ί) = t~( for some ξ > 0. Since F'(x) = F(<l/(t)x) for all χ and for t positive, Fix) = ехр{д:_ l/( log FiD) for χ > 0. Thus (take α = Ι/ξ) F is of the same type as Example 14.2 shows that this case can arise. Case 3. Suppose as in Case 2 that ck Φ 1 for some k0, so that Fix') is 0 or 1 for some x', but this time suppose that Fix ) = 1. Let xx be the infimum of those χ for which Fix)^= 1. By passing to a new F of the same type, arrange that jc, = 0; then Fix) < 1 for χ < 0 and Я·*) = 1 for jc > 0. If dk Φ 0, then for some χ near 0, one side of (14.18) is 1 and the other is not. Thus dk = 0 for all k, and (14.23) again holds. And
SECTION 14. DISTRIBUTION FUNCTIONS 197 again jj/k^Cj/Ck consistently defines a function satisfying Fr(x) = F(yrx). Since F is nondegenerate, 0 < F(x) < 1 for some x, but this time χ is necessarily negative, so that ck is increasing. The same analysis as before shows that there is a positive ξ such that F'(x) = F(t(x) for all χ and for t positive. Thus F(x) = exp{(-xy/t log F(-l)} for χ < 0, and F is of the type (.4.25) ^W=(f""' ΪΑΪ Example 14 3 shows that this distribution function is indeed extremal. This completely characterizes the class of extremal distributions: Theorem 14.3. The class of extremal distribution functions consists exactly of the distribution functions of the types (14.22), (14.24), and (14.25). It is possible to go on and characterize the domains of attraction. That is. it is possible for each extremal distribution function F to describe the class of G satisfying (14.17) for some constants a„ and bn—the class of G attracted to /*".* PROBLEMS 14.1. The general nondecreasing function F has at most countably many discontinuities. Prove this by considering the open intervals (supf(ii), inff(u)) —each nonempty one contains a rational. 14.2. For distribution functions F, the second proof of Theorem 14.1 shows how to construct a measure μ on (R1,^1) such that μ(α, b] = F(b) - F(a). (a) Extend to the case of bounded F. (b) Extend to the general case. Hint: Let F„(x) be —n or F(x) or η as F(x) < —n or —n < F(x) < η or η < F(x). Construct the corresponding μη and define μ(Α) = lim„μπ(Α). 14.3. (a) Suppose that X has a continuous, strictly increasing distribution function F. Show that the random vaiiable F(X) is uniformly distributed over the unit interval in the sense that P[F(X)<u] = u for 0<ы<1. Passing from X to F(X) is called the probability transformation. (b) Show that the function ^(м) defined by (14.5) satisfies F(<p(u) — )< и < /Ч(р(м)) and that, if F is continuous (but not necessarily strictly increasing), then F(^(m)) = и for 0 < и < 1. (c) Show that P[F(X) < и] = /г(^(и) - ) and hence that the result in part (a) holds as long as F is continuous. fThis theory is associated with the names of Fisher, Frechet, Gnedenko, and Tippet. For further information, see Galambos.
198 MEASURE 14.4. Τ Let С be the set of continuity points of F. (a) Show that for every Borel set A, P[F(X) ^A, A'sC] is at most the Lebesgue measure of A. (b) Show that if F is continuous at each point of F~lA, then P[F(X)eA] is at most the Lebesgue measure of A. 14.5. The Levy distance d(F,G) between two distribution functions is the infimum of those e such that G(x - e) - e <F(x) < G(x + e) + e for all x. Verify that this is a metric on the set of distribution functions. Show that a necessary and sufficient condition for Fn => F is that d(Fnl F) -» 0. 14.6. 12.3 Τ A Borel function satisfying Cauchy's equation [A20] is automatically bounded in some interval and hence satisfies fix) =jc/(1). Hint: Take К large enough that k[x\ χ > s, \f(x)\ < K] > 0. Apply Problem 12.3 and conclude that / is bounded in some interval *o the right of 0 14.7. Τ Consider sets S of reals that are linearly independent over the field of rationals in the sense that nixi + ■ ■ · +nkxk — 0 for distinct points jc,- in 5 and integers «,· (positive or negative) is impossible unless «, = 0. (a) By Zorn's lemma find a maximal such S. Show that it is a Hamel basis. That is, show that each real χ can be written uniquely as χ = ηχχλ + · · · +nkxk for distinct points xt in 5 and integers nt. (b) Define / arbitrarily on 5, and define it elsewhere by f(nlxl + ■ · ■ +nkxk) = n,/(jc,) + ■·· +nkf(xk). Show that / satisfies Cauchy's equation but need not satisfy fix) =jc/(1). (c) By means of Problem 14.6 give a new construction of a nonmeasurable function and a nonmeasurable set. 14.8. 14.5 Τ (a) Show that if a distribution function F is everywhere continuous, then it is uniformly continuous. (b) Let Sf.(e) = sup[F(x) — F(y): |л:-у|<е] be the modulus of continuity of F. Show that d(F, G)<e implies that supx \F(x) - G(x)\ <e+ 8F(e). (c) Show that, if Fn=>F and F is everywhere continuous, then F„(x)-*F(x) uniformly in x. V/hat if F is continuous over a closed interval? 14.9. Show that (14.24) and (14.25) are everywhere infinitely differentiable, although not analytic
CHAPTER 3 Integration SECTION 15. THE INTEGRAL Expected values of simple random variables and Rieinann integrals of continuous functions can be brought together with other related concepts under a general theory of integration, and this theory is the subject of the present chapter. Definition Throughout this section, /, g, and so on will denote real measurable functions, the values ±00 allowed, on a measure space (Ω, ^,μ).* The object is to define and study the definite integral /*№ = /7(«)M«) = (ί(ω)μ(άω). 1 Jn Jn Suppose first that / is nonnegative. For each finite decomposition {Л,·} of Ω into ^sets, consider the sum (15.1) inf /(ω) β(Α,). In computing the products here, the conventions about infinity are 0·οο=οο·0=0, (15.2) χ ■ 00 = oo-jc = 00 if0<jc<oo, 00 · 00 = 00. ^Although the definitions (15.3) and (15.6) apply even if / is not measurable &, the proofs of most theorems about integration do use the assumption of measurability in one way or another. For the role of measurability, and for alternative definitions of the integral, see the problems. 199
200 INTEGRATION The reasons for these conventions will become clear later. Also in foice are the conventions of Section 10 for sums and limits involving infinity; see (10.3) and (10.4). If A; is empty, the infimum in (15.1) is by the standard convention oo; but then μ.(/4(·) = 0, so that by the convention (15.2), this term makes no contribution to the sum (15.1). The integral of / is defined as the supremum of the sums (15.1): (15.3) //<*/* = sup Σ inf /(ω) ω<Ξ/4, H(A,). The supremum here extends over all finite decompositions {Αέ} of Ω into ^sets. For general /, consider its positive part, (15.4) Π«) = and its negative part, (15.5) ГЫН ί/(ω) ΐί0</(ω)<οο, ,0 ίί-°ο<^(ω)<0 -/(ω) if -°ο</^(ω) <0, 0 ΐί0</(ω)<οο. These functions are nonnegative and measurable, and / = /+—/ . The general integral is defined by (15.6) (/α·μ=(/+άμ-(Γάμ, unless //+ άμ = //" άμ =■ oo, in which case / has no integral. If //+ άμ and //_ άμ are both finite, then / is integrable, or integrable μ, or summable, and has (15.6) as its aefinite integral. If //+ άμ = °° and //" άμ < oo, then / is not integrable but in accordance with (15.6) is assigned oo as its definite integral. Similarly, if //+ άμ < °o and //" άμ = oo, then / is not integrable but has definite integral -oo. Note that / can have a definite integral without being integrable; it fails to have a definite integral if and only if its positive and negative parts both have infinite integrals. The really important case of (15.6) is that in which //+ άμ and //" άμ are both finite. Allowing infinite integrals is a convention that simplifies the statements of various theorems, especially theorems involving nonnegative functions. Note that (15.6) is defined unless it involves "oo - oo"; if one term on the right is oo and the other is a finite real x, the difference is defined by the conventions oo - χ = oo and χ - oo = - oo. The extension of the integral from the nonnegative case to the general case is consistent: (15.6) agrees with (15.3) if / is nonnegative, because then Г-0.
SECTION 15. THE INTEGRAL 201 Nonnegative Functions It is convenient first to analyze nonnegative functions. Theorem 15.1. (i) ///= Σ,·*,/^ is a nonnegative simple function, {Л(} being a finite decomposition of Ω into &-sets, then \$άμ = Σίχίμ(Αί). (ii) // 0 </(ω) <g(w) for all ω, then \$άμ < ^άμ. (in) // 0 </„(ω)Τ /(ω) for all ω, then 0 < //„ άμ | \$άμ. (iv) For nonnegative functions f and g and nonnegative constants a and β, |(af + βg)dμ=aJfdμ+β|gdμ. In part (iii) the essential point is that \$άμ = lim„ //„ άμ, and it is important to understand that both sides of this equation may be oo. If fn ■= lA and f = IA, where An | A, the conclusion is that μ is continuous from below (Theorem 10.2(0): Ιϊτηημ(Αη) = μ(Α); this equation often takes the form oo — oo. Proof of (i). Let {β·} be a finite decomposition of ii and let /3, be the infimum of / over Br If Α,Γ\Β;Φ 0, then β} <x,; therefore, Σ]β]μ(Β]) = Σ,^Μ(Λ,ηβρ)<Ε,νχ,·μ(Λ,ι~>β;·)= Σ,-χ,-μί/Ι,·). On the other hand, there is equality here if {Bj) coincides with {А). Ш Proof of (ii). The sums (15.1) obviously do not decrease if / is replaced byg. ■ Proof of (iii). By (ii) the sequence //„ άμ is nondecreasing and bounded above by jfdn. It therefore suffices to show that \$άμ < lim„ //„ άμ, or that m (15.7) Пт//„^>5= Χ>,μ(Λ,.) " i=\ if Ax,...,Am is any decomposition of Ω into ^sets and vt = ϊηίω(Ξ/4./(ω). In order to see the essential idea of the proof, which is quite simple, suppose first that S is finite and all the y,- and /х(Л;) are positive and finite. Fix an € that is positive and less than each f,, and put Ain = [ω еЛ,: /„(ω) > у,· - е). Since /„ Τ f, Ain Τ Л;. Decompose Ω into A,„,..., Amn and the complement of their union, and observe that, since μ is continuous from below, m m (15.8) ί/ηάμ> Σ(νί~€)μ(Αίη)^ Σ(ν,-€)μ(Α1) ί=1 i=l m = 5-бЕд(Л,). \=\ Since the /х(Л,·) are all finite, letting e -> 0 gives (15.7).
202 INTEGRATION Now suppose only that 5 is finite. Each product ι>,·μ.(/4,·) is then finite; suppose it is positive for i<m0 and 0 for i>m0. (Here m0<m; if the product is 0 for all i, then 5 = 0 and (15.7) is trivial.) Now y, and μ.(/4,) are positive and finite for i < m0 (one or the other may be oo for i > m0). Define Ain as before, but only for i <m0. This time decompose Ω into Ain,..., Am n and the complement of their union. Replace m by mQ in (15.8) and complete the proof as before. Finally, suppose that S = <χ. Then υίιμ(Αί ) = oo for some i0, so that vt and μ(Α( ) are both positive and at least one is oo. Suppose 0 <x < y, < oo and 0<y <μ(Αίί)<οο, and put Л(> = [ω еЛ/и: fn(w)>x]. From /„ Τ / follows Ain\At; hence μ(Αίη)>γ for η exceeding some n0. But then (decompose Ω into A{ and its complement) }/„άμ>χμ(Αι n) >xy for η > n0, and therefore lim„ //„ άμ > xy. If v-t = oo, let χ -»oo, and if μ(Α^) = oo, let у -> oo. in either case (15.7) follows: lim„ //„ άμ = oo. ■ Proof of (iv). Suppose at first that /=Σ,χ,·//4 and g = Y,jyjIB. are simple. Then af + βg = Συ(αχι+βγί)1Α.ηΒ., and so {(α/ + β8)άμ=Σ(<ΧΧί + βν])μ(ΑίηΒί) ij = «LxiH(Ai)+fiLyjV(Bj)=<xff<*H+fifg<in- ' i Note that the argument is valid if some of α, β, xh yj are infinite. Apart from this possibility, the ideas are as in the proof of (5.21). For general nonnegative / and g, there exist by Theorem 13.5 simple functions /„ and gr such that 0 </„ Τ / and 0 <gn Τ g. But then 0 <afn + βδη Τ осf + j8g and j(af„ + /3g„) d/x = <*//„ ^Д + Э/§„ ^Д>so that (iv) follows from (iii). ■ By part (i) of Theorem 15.1, the expected values of simple random variables in Chapter 1 are integrals: E[X) = (Χ(ω)Ρ(άω). This also covers the step functions in Section 1 (see (1.6)). The relation between the Riemann integral and the integral as defined here will be studied in Section 17. Example 15.1. Consider the line (Λ',^',Α) with Lebesgue measure. Suppose that - oo < α0 < α, < · · · < am < oo, and let / be the function with nonnegative value x,- on (ai_l, a,·], / = 1,... ,.m, and value 0 on (- oo, a0] and (flm,oo). By part (i) of Theorem 15.1, jfdk = Σ^,χ,ία,- - «,_,) because of the convention 0 · oo = о—see (15.2). If the "area under the curve" to the left of a0 and to the right of am is to be 0, this convention is inevitable. From oo · 0 = 0 it follows that jfdk = 0 if / is oo at a single point (say) and 0 elsewhere.
SECTION 15. THE INTEGRAL 203 If /=/(β,Μ). the area-under-the-curve point of view makes }/άμ = <χ> natural. Hence the second convention in (15.2), which also requires that the integral be infinite if / is oo on a nonempty interval and 0 elsewhere. ■ Recall that almost everywhere means outside a set of measure 0. Theorem 15.2. Suppose that f and g are nonnegative. (i) ///=0 almost everywhere, then {/άμ = 0. (ii) // μ[ω: /(ω) > 0] > 0, then {/άμ > 0. (iii) // {/άμ < oo, then / < oo almost everywhere. (iv) If '/<g almost everywhere, then {/άμ < ^άμ. (ν) ///= g almost everywhere, then {ίάμ = fgdμ. Proof. Suppose that /= 0 almost everywhere. If Л, meets [ω: /(ω) = 0], then the infimum in (15.1) is 0; otherwise, μ(/4,) = 0. Hence each sum (15.1) is 0, and (i) follows. If Ae = [ω: /(ω)> e], then Ae Τ[ω: /(ω) > 0] as e j.0, so that under the hypothesis of (ii) there is a positive e for which μ(Α() > 0. Decomposing Ω into Ae and its complement shows that f/άμ > βμ(Αε) > 0. If μ[/= oo] > 0, decompose Ω into [/= oo] and its complement: f/άμ > oo · μ[/= oo] = oo by the conventions. Hence (iii). To prove (iv), let G = [f<g]. For any finite decomposition {Л,,..., Am} of Ω, inf/ A: ^Σ inf/ μ(ΑίηΟ)<Σ\ inf / A,nG μ(ΑίΠθ) inf g AtnG μ(AlnG)^fgdμ, where the last inequality comes from a consideration of the decomposition A, Π G,..., Am Π G, Gc. This proves (iv), and (v) follows immediately. ■ Suppose that f = g almost everywhere, where / and g need not be nonnegative. If / has a definite integral, then since /+ = g+ and /" = g~~ almost everywhere, it follows by Theorem 15.2(v) that g also has a definite integral and j/άμ = ^άμ. Uniqueness Although there are various ways to frame the definition of the integral, they are all equivalent—they all assign the same value to ί/άμ. This is because the integral is uniquely determined by certain simple properties it is natural to require of it. It is natural to want the integral to have properties (i) and (iii) of Theorem 15.1. But these uniquely determine the integral for nonnegative functions: For / nonnegative, there exist by Theorem 13.5 simple functions fn such that 0</„T/; by (iii), {/άμ must be lim„ //„ άμ, and (i) determines the value of each // άμ.
204 INTEGRATION Property (i) can itself be derived from (iv) (linearity) together with the assumption that \1Αάμ = μ(Α) for indicators lA. /(Σ,·*,/^)^ = Τ.ϊΧ-,\1Α.άμ = Σ,·^,μ(/1,). If (iv) of Theorem 15.1 is to persist when the integral is extended beyond the class of nonnegative functions, )/άμ must be /(/+-/~Ыд = //+ άμ — }f~ άμ, which makes the definition (15.6) inevitable. PROBLEMS These problems outline alternative definitions of the integral and clarify the role measurability plays. Call (15.3) the lower integral, and write it as (15 9) [ /<^=sup£ inf /(ω) μ(Α,) to distinguish it from the upper integral (15.10) /74* = inf Σ sup /(ω) μ(Λ,)· The infimum in (15 10), like the supremum in (15.9), extends over all finite partitions [Aj] of il into insets. 15.1. Suppose that / is measurable and nonnegative. Show that }*/άμ = °χ> if μ[ω: /(ω) > 0] = oo or if μ[ω: /(ω) > a] > 0 for all a There are many functions familiar from calculus that ought to be integrable but are of the types in the preceding problem and hence have infinite upper integral. Examples are x~2I{l ^(x) and ·*~ι/2/(0,ι)(■*)· Therefore, (15.10) is inappropriate as a definition of {/άμ for nonnegative /. The only problem with (15.10), however, is that it treats infinity the wrong way. To see this, and to focus on essentials, assume that μ(Ω) < oo and that / is bounded, although not necessarily nonnegative or measurable &. 15.2. Τ (a) Show that E[ inf /(«)]μ(Λ)^Σ inf /(ω) *W if {Bj} refines {/!,}. Prove a dual relation for the sums in (15.10) and conclude that (15.11) Ι/α·μ<Ι*/άμ. (b) Now assume that / is measurable & and let Μ be a bound for |/|. Consider the partition Λ,. = [ω: ie </(ω) <(ί + l)e], where ι ranges from -N
SECTION 15. THE INTEGRAL 205 to N and N is large enough that Ne > M. Show that sup /(ω) μ(Α,)-Σ\ inf/(«) μ(Λ,)<6μ(Ω). Conclude that (15.12) Ι/άμ=/*/άμ. To define the integral as the common value in (15.12) is the Darboux-Ycung approach. The advantage of (15.3) as a definition is that (in the nonnegative case) it applies at once to unbounded / and infinite μ 15.3. 3.2 15.2T For AcQ,, define μ*{Α) and μ*(Α) by (3.9) and (3.10) with μ in place of P. Show that \*1Αάμ = μ*(Α) and \*1Αάμ =μίι.{Α) for every A. Therefore, (15.12) can fail if / is not measurable &. (Where was measurability used in the proof of (15.12)?) The definitions (15.3) and (15.6) always make formal sense (for finite μ(Ω) and supi/|), but they are reasonable—accord with intuition—only if (15.12) holds. Under what conditions does it hold? 15.4. 10.5 15.3 Τ (a) Suppose of / that there exist an S^-set A and a function g, measurable fF, such that μ(/1) = 0 and [fl=g]^A. This is the same thing as assuming that μ*[fΦg] = 0, or assuming that / is measurable with respect to 5r completed with respect to μ. Show that (15.12) holds. (b) Show that if (15.12) holds, then so does the italicized condition in part (a). Rather than assume that / is measurable JF, one can assume that it satisfies the italicized condition in Problem 15.4(a)—which in case (Ω, &, μ) is complete is the same thing anyway. For the next three problems, assume that μ(Ω) < <» and that / is measurable У and bounded. 15.5. Τ Show that for positive e there exists a finite partition {/!,·} such that, if {Bj} is any finer partition and ω, еВр then //4* "Σ/(«>(*;) <e. 15.6. Τ Show that j/άμ = lim Σ -^ϊϊ-μ \k\<n2" к-1 .. . к The limit on the right here is Lebesgue's definition of the integral.
206 INTEGRATION 15.7. Τ Suppose that the integral is defined for simple nonnegative functions by /(Σ,-jc,-/^ )ί/μ = LjX^iAj). Suppose that /„ and g„ are simple and nondecreas- ing and have a common limit: 0 <fn Τ / and 0 <g„ Τ /· Adapt the arguments used to prove Theorem 15.1(iii) and show that lim„ //„ άμ = lim„ fg„ άμ. Thus, in the nonnegative case, f/άμ can (Theorem 13.5) consistently be defined as lim„ //„ άμ for simple functions for which 0 </„ Τ /· SECTION 16. PROPERTIES OF THE INTEGRAL Equalities and Inequalities By definition, the requirement for integrability of / is that //+ άμ and //" άμ both be finite, which is the same as the requirement that //+ άμ + jf~ άμ < oo and hence is the same as the requirement that /(/+ + /")ti/x < oo (Theorem 15.1(iv)). Since /++/"= i/l, / is integrable if and only if (16.1) /|/|d/*<oo. it follows that if |/|<|g| almost everywhere and g is integrable, then / is integrable as well. If μ,(Ω) < oo, a bounded / is integrable. Theorem 16.1. (i) Monotonicity: If f ana g are integrable ana f <g almost everywhere, then (16.2) //άμ<Ι8άμ. (ii) Linearity: If f ana g are integrable and α, β are finite real numbers, then af + β8 is integrable ana (16.3) ((α/+β8)άμ = αΙ/άμ+βΙ§άμ. Proof of (i). For nonnegative / and g such that f<g almost everywhere, (16.2) follows by Theorem 15.2(iv). And for general integrable / and g, if f <g almost everywhere, then /+<g+ and f~>g~ almost everywhere, and so (16.2) follows by the definition (15.6). ■ Proof of (ii). First, af + fig is integrable because, by Theorem 15.1, /k/ + /3gk/z</(|tf|-|/l + l/3HgH/x = W/l/|rf/* + |j8|/|g|rf/*<oo.
SECTION 16. PROPERTIES OF THE INTEGRAL 207 By Theorem 15.1(iv) and the definition (15.6), }(α/)άμ = α{/άμ—consider separately the cases a > 0 and a < 0. Therefore, it is enough to check (16.3) for the case a = β = I. By definition, (f + g)+-(f + g)~ = f+g =f+-f~ + g+-g~ and therefore (f + g)+ + f~ + g~ = (f + g)~ + f+ + g+. All these functions being nonnegative, /(/ + g)+ άμ + ff~ άμ + fg~ άμ = f(f + g)~ άμ + ff+ άμ + fg+ άμ, which can be rearranged to give f(f + g)+dμ-f(f + g)~ άμ = ff+ άμ - ff άμ + jg+ άμ - fg~ άμ. But this reduces to (16.3). ■ Since -|/| </< l/l, it follows by Theorem 16.1 that (16 4) f/άμ <{\/\άμ for integrable /. Applying this to integrable / and g gives (16.5) ffdμ-fgdμ <.j\f-g\dii. Example 16.1. Suppose that Ci is countable, that & consists of all the subsets of Ω, and that μ is counting measure: each singleton has measure 1. To be definite, take Ω ■-= {1,2,...}. A function is then a sequence xx, x2,·.· · If xnm is xm or 0 as m < η or m> n, the function corresponding to xnl, xn2,··· has integral Yfm = xxm by Theorem 15.1(0 (consider the decomposition {1},...,{«}, [n + I,n +2,...}). It follows by Theorem I5.1(iii) that in the nonnegative case the integral of the function given by [xm] is the sum Lmxm (finite or infinite) of the corresponding infinite series. In the general case the function is integrable if and only if E^, = 1|jcm| is a convergent infinite series, in which case the integral is Σ™ =1x^ ~^m^\x~m· The function xm - (- l)m + 1m_1 is not integrable by this definition and even fails to have a definite integral, since YTm = xxJ'm = YTm = xx^n = °°. This invites comparison with the ordinary theory of infinite series, according to which the alternating harmonic series does converge in the sense that lim^ Σ^=1(- l)m + 1m_1 = log2. But since this says that the sum of the first Μ terms has a limit, it requires that the elements of the space Ω be ordered. If Ω consists not of the positive integers but, say, of the integer lattice points in 3-space, it has no canonical linear ordering. And if Zmxm is to have the same finite value no matter what the order of summation, the series must be absolutely convergent.1 This helps to explain why / is defined to be integrable only if //+ άμ and //" άμ are both finite. ■ Example 16.2. In connection with Example 15.1, consider the function /= 3/(a «,) - 2I(_x<a). There is no natural value for jfaX (it is "<» - oo"), and none is assigned by the definition. ^UDINj, p. 76.
208 INTEGRATION If a function / is bounded on bounded intervals, then each function /„=//(_„„) is integrable with respect to Λ. Since /=lim„/„, the limit of //„ dX, if it exists, is sometimes called the "principal value" of the integral of /. Although it is natural for some purposes to integrate symmetrically about the origin, this is not the right definition of the integral in the context of general measure theory. The functions g„=fl(-„,n + \) for example also converge to /, and jgn άΧ may have some other limit, or none at all; f(x) = x is a case in point. There is no general reason why /„ should take precedence over g„. As in the preceding example, /= E* = t(- l)kk'4(k k + X) has no integral, even though the //„ d\ above converge. ■ Integration to the Limit The first result, the monotone convergence theorem, essentially restates Theorem 15.1(iii). Theorem 16.2. // 0 </„ Τ f almost everywhere, then jfn άμ | {/άμ. Proof. If 0 </„ Τ / on a set A with μ(Α€) = 0, then 0 <fnIA Τ flA holds everywhere, and it follows by Theorem 15.1(iii) and the remark following Theorem 15.2 that //„ άμ = //„ IA άμ \ }fIA άμ = f/άμ. Ш As the functions in Theorem 16.2 are nonnegative almost everywhere, all the integrals exist. The conclusion of the theorem is that lim„ //„ άμ and (/άμ are both infinite or both finite and in the latter case are equal. Example 16.3. Consider the space {1,2,...} together with counting measure, as in Example 16.1. If for each m one has 0 <xnm Τ xm as η -»oo, then lim„ Lmxnm = Emxm, a standard result about infinite series. ■ Example 16.4. If μ is a measure on 5^ and S^ is a σ-field contained in 5^ then the restriction μ0 of μ to ^ is another measure (Example 10.4). If f=IA and A <= &~0, then j/άμ = f/άμο, the common value being μ(Α) = μ0(Α). The same is true by linearity for nonnegative simple functions measurable 3^. It holds by Theorem 16.2 for all nonnegative / that are measurable 5^ because (Theorem 13.5) 0 <fn | / for simple functions fn that are measurable S^. For functions measurable S^, integration with respect to μ is thus the same thing as integration with respect to μ0. ■
SECTION 16. PROPERTIES OF THE INTEGRAL 209 In this example a property was extended by linearity from indicators to nonnegative simple functions and thence to the general nonnegative function by a monotone passage to the limit. This is a technique of very frequent application. Example 16.5. For a finite or infinite sequence of measures μη on &~, μ(Α) = Σημη(Α) defines another measure (countably additive because [A27] sums can be reversed in a nonnegative double series). For indicators /, jfdn=Zffdnn, η and by linearity the same holds for simple /> 0. If 0 < fk t / for simple fk, then by Theorem 16.2 and Example 16.3, //άμ = lim^ jfk άμ = lim^ Lnffk άμη = Ση Уапк ffk άμη = Ση{/άμη. The relation in question thus holds for all nonnegative /. ■ An important consequence of the monotone convergence theorem is Fatou's lemma: Theorem 16.3. For nonnegative fn, (16.6) I liminf/„ άμ < liminf (/„ άμ. Proof. If gn = Ык >nfk, then 0 <g„ Τ g = liminf„ /„, and the preceding two theorems give //„ άμ > fgn άμ -* ^άμ. ■ Example 16.6. On (Λ',^',Λ), the functions /„ = "2/(0,„-') and f=Q satisfy /„(*) -*f(x) for each x, but jfa\ = 0 and //„ άλ = η -»oo. This shows that the inequality in (16.6) can be strict and that it is not always possible to integrate to the limit. This phenomenon has been encountered before; see Examples 5.7 and 7.7. ■ Fatou's lemma leads to Lebesgue's aominatea convergence theorem: Theorem 16.4. // |/„| < g almost everywhere, where g is integrable, ana if fn -»/ almost everywhere, then f ana the fn are integrable ana ffn άμ -> f/άμ. Proof. Assume at the outset, not that the fn converge, but only that they are dominated by an integrable g, which implies that all the fn as well
210 INTEGRATION as /* = limsupn/„ and /* = liminffn are integrable. Since g+/„ and g - f„ are nonnegative, Fatou's lemma gives ^άμ + ff* άμ = fliminf (g +/„) άμ < lim inf f(g+f„) άμ = (gd/x + lim inf ffn άμ, and Jgcf/x - ff*d\i = Jliminf (g -/„) cf/x <liminf J(g-/r)d/x = Jgάμ -hmsup j/,,άμ. " η Therefore (16.7) \\ιχηιηίΙηάμ<\\νΛΜ\ϊηάμ < lim sup Jfn άμ < Aim sup/„ άμ. η η (Compare this with (4.9).) Now use the assumption that f„-*f almost everywhere: / is dominated by g and hence is integrable, and the extreme terms in (16.7) agree with f/άμ. ■ Example 16.6 shows that this theorem can fail if no dominating g exists. Example 16.7. The Weierstrass M-test for series. Consider the space {1,2....} together with counting measure, as in Example 16.1. If \xnm\<Mm and LmMm < oo and if lim„ x„m =xm for each m, then lim„ Zmx„m = Zmxm. mm7 η rim τη * η τη nm τη τη This follows by an application of Theorem 16.4 with the function given by the sequence M,,M2,... in the role of g. This is another standard result on infinite series [A28]. ■ The next result, the boundea convergence theorem, is a special case of Theorem 16.4. It contains Theorem 5.4 as a further special case. Theorem 16.5. // μ,(Ω) < oo and the fn are uniformly bounaea, then fn ->/ almost everywhere implies jfn άμ -» \$άμ. The next two theorems are simply the series versions of the monotone and dominated convergence theorems.
SECTION 16. PROPERTIES OF THE INTEGRAL 211 Theorem 16.6. ///„ > 0, then / Σ„ /„ άμ = Σ„ //„ άμ. The members of this last equation are both equal either to oo or to the same finite, nonnegative real number. Theorem 16.7. // Σ„/„ converges almost everywhere and \L"k = ifk\<g almost everywhere, where g is integrable, then Σ„/„ and the fn are integrable and {Σηί„άμ = Ση}/„άμ. Corollary. // Σ„{\/η\άμ <°°, then Σ„/„ converges absolutely almost everywhere and is integrable, and /Σ„/„ άμ = Σ„//„ άμ. Proof. The function g = Σn\fn\ is integrable by Theorem 16.6 and is finite almost everywhere by Theorem 15.2(iii). Hence Σ„|/„| and Σ„/„ converge almost everywhere, and Theorem 16.7 applies. ■ In place of a sequence {/„} of real measurable functions on (Ω, У, μ), consider a family [/,: t > 0] indexed by a continuous parameter t. Suppose of a measurable / that (16 8) lim/,(«)=/(») on a set A, where (16.9) Aef, μ(Ω-Α) = 0. A technical point arises here, since У need not contain the ω-set where (16.8) holds: Exampie 16.8. Let У consist of the Borel subsets of Ω = [0,1), and let Η be a nonmeasurable s>et—a subset of Ω that does not lie in У (see the end of Section 3). Define /,(ω) = 1 if ω equals the fractional part t — [t\ of t and their common value lies in Hc; define /,(ω) = 0 otherwise. Each /, is measurable У, but if /(ω) = 0, then the ω-set where (16.8) holds is exactly H. Ш Because of such examples, the set A above must be assumed to lie in У. (Because of Theorem 13.4, no such assumption is necessary in the case of sequences.) Suppose that / and the /, are integrable. If /, = //, άμ converges to / = j/άμ as t -»oo, then certainly /, -»/ for each sequence [tn] going to infinity. But the converse holds as well: If /, does not converge to /, then there is a positive e such that I/, -/|>e for a sequence Un) going to infinity. To the question of whether /,_ converges to / the previous theorems apply. Suppose that (16.8) and \f,(u>)\<g((u) both hold for и£/1, where A satisfies (16.9) and g is integrable. By the dominated convergence theorem, / and the /, must then be integrable and /, -»/ for each sequence [tn} going to infinity. It follows that //, άμ -» j/άμ. In this result t could go continuously to 0 or to some other value instead of to infinity.
212 INTEGRATION Theorem 16.8. Suppose that /(ω, t) is a measurable and integrable function of ω for each t in (a,b). Let φ(ί) = //(ω, ί)μ(άω). (i) Suppose that for ω еЛ, where A satisfies (16.9), /(ω, t) is continuous in t at t0; suppose further that Ι/(ω, г)| < g(<u) for ω e A and \t —10\ < δ, where δ is independent of ω and g is integrable. Then φ(ί) is continuous at tQ. (ii) Suppose that for ω еЛ, where A satisfies (16.9), /(ω, t) has in (a, b) a deriuatiue /'(ω, г); suppose further that \f'(u>,t)\<g(cu) for ω еЛ and ί e (a, 6), where g is integrable. Then φ(ί) has deriuatiue jf'ioy, ί)μ(άω) on (a, b). Proof Part (i) is an immediate consequence of the preceding discussion. To prove part (ii), consider a fixed t. If ω e/1, then by the mean-value theorem, f((Q,t+h)-f(eu,t) Jj =J (ω,ί), where s lies between / and t + h. The ratio on the left goesf to /'(ω, t) as h -» 0 and is by hypothesis dominated by the integrable function g(w). Therefore, /J = J 1 μ(άω)-* Ι/(ω,ί)μ(άω). U The condition involving g in part (ii) can be weakened. It suffices to assume that for each t there is an integrable g(o>,t) such that |/'(ω, s)\ <g(cj,t) for ω eA and all s in some neighborhood of t. Integration over Sets The integral of / over a set A in &~ is defined by (16.10) /άμ= ΙΑ/άμ. JA J The definition applies if / is defined only on A in the first place (set /=0 outside Л). Notice that fAfd[i = 0 if μ( A) = 0. All the concepts and theorems above carry over in an obvious way to integrals over A. Theorems 16.6 and 16.7 yield this result: Theorem 16.9. If Αλ, A2,... are disjoint, and if f is either nonnegative or integrable, then \^ηΑηίάμ = Σ„{Α^άμ. +Letting h go to 0 through a sequence shows that each /'(·, t) is measurable 5*"on A; take it to be 0, say, elsewhere.
SECTION 16. PROPERTIES OF THE INTEGRAL 213 The integrals (16.10) usually suffice to determine /: Theorem 16.10. (i) If f and g are nonnegative and {Α/άμ = \Α%άμ for all A in &~, and if μ is σ-finite, then f = g almost everywhere. (ii) If f and g are integrable and \Α$άμ = \^άμ for all A in ^, then f= g almost everywhere. (iii) If f and g are integrable and \Α}άμ = \^άμ for all A in &, where &> is a TT-system generating ^ and Ω is a finite or countable union of &-sets, then f=g almost everywhere. Proof. Suppose that / and g are nonnegative and that {Α/άμ < \Αξάμ for all A in !F. If μ is σ-finite, there are J^sets An such that A„ Τ Ω and /х(Л„)<о°. If B„ = [0<g</, g<n], then the hypothesized inequality applied to AnnBn implies ΙΑηηΒ„ί<ίμ < \ΑηΓλΒβάμ <°ο (finite because ΑηΓ\Βη has finite measure and g is bounded there) and hence fIA nB(/-g)ii/x = 0. But then by Theorem 15.2(H), the integrand is 0 almost everywhere, and so μ(Αη Γ. Βη) = 0. Therefore, μ[0 < g </, g < ooj = 0, so that f <g almost everywhere; (i) follows. The argument for (ii) is simpler: If / and g are integrable and }Afdμ < Ι^άμ for all A in ,9*", then fl[g<f](f - g)dμ = 0 and hence μ[g </] = 0 by Theorem 15.2(H). Part (iii) for nonnegative / and g follows from part (ii) together with Theorem 10.4. For the general case, prove that f+ + g~ = f~ + g+ almost everywhere. ■ Densities Suppose that δ is a nonnegative measurable function and define a measure ν by (Theorem 16.9) (16.11) ν{Α)=\δάμ, A&F; JA δ is not assumed integrable with respect to μ. Many measures arise in this way. Note that μ(Α) = 0 implies that v( A) = 0. Clearly, ν is finite if and only if δ is integrable μ. Another function δ' gives rise to the same ν if δ = δ' almost everywhere. On the other hand, v(A) = jA8'άμ and (16.11) together imply that δ = δ' almost everywhere if μ is σ-finite, as follows from Theorem 16.100). The measure ν defined by (16.11) is said to have density 8 with respect to μ. A density is by definition nonnegative. Formal substitution dv = δάμ gives the formulas (16.12) and (16.13).
214 INTEGRATION Theorem 16.11. If ν has density δ with respect to μ, then (16.12) jfdv = ffSdii holds for nonnegatwe f. Moreover, f (not necessarily nonnegatwe) is integrable with respect to ν if and only if f8 is integrable with respect to μ, in which case (16.12) and (16.13) ίfdv=[fδdμ } A 3 A both hold. For nonnegative f, (16.13) always holds. Here /δ is to be taken as 0 if /= 0 or if δ = 0; this is consistent with the conventions (15.2). Note that ν[δ = 0] = 0. Proof. If f = IA, then jfdv = v(A), so that (16.12) reduces to the definition (16.11). If / is a simple nonnegative function, (16.12) then follows by linearity. If / is nonnegative, then jfndv = \$ηδάμ for the simple functions /„ of Theorem 13.5, and (16.12) follows by a monotone passage to the limit—that is, by Theorem 16.2. Note that both sides of (16.12) may be infinite. Even if / is not nonnegative, (16.12) applies to |/|, whence it follows that / is integrable with respect to ν if and only if /δ is integrable with respect to μ. And if / is integrable, (16.12) follows from differencing the same result for /+ and /". Replacing / by flA leads from (16.12) to (16.13). ■ Example 16.9. If v(A) = μ(Α ПА0), then (16.11) holds with δ = 1Ац, and (16.13) reduces to \Afdv = fAnA fdμ. ■ Theorem 16.11 has two features in common with a number of theorems about integration: (i) The relation in question, (16.12) in this case, in addition to holding for integrable functions, holds for all nonnegative functions—the point being that if one side of the equation is infinite, then so is the other, and if both are finite, then they have the same value. This is useful in checking for integrabil- ity in the first place. (ii) The result is proved first for indicator functions, then for simple functions, then for nonnegative functions, then for integrable functions. In this connection, see Examples 16.4 and 16.5. The next result is Scheffe's theorem.
SECTION 16. PROPERTIES OF THE INTEGRAL 215 Theorem 16.12. Suppose that vn(A) = \Αδηάμ and v(A) = \Αδάμ for densities 8n and 8. If (16.14) vn(f\) = v(i\)«», « = 1,2,..., and if δη~* δ except on a set of μ-measure 0, then (16.15) sup \V(A) -νη{Α)\<\\δ-δη\άμ-+ 0. AeF Jil Proof. The inequality in (16.15) of course follows from (16.5). Let gn = δ - δη. The positive part g* of gn converges to 0 except on a set of μ,-measure 0. Moreover, 0 < g+ < δ and δ is integrable, and so the dominated convergence theorem applies: /g+ άμ -» 0. But jgn άμ = 0 by (16.14), and therefore ί\8η\άμ=ί 8„άμ-ί gn άμ •Ώ J[gn>0] •/fg„<0] = 2/ 8ηάμ=2(8;άμ-*0. Ш J[g„>0] Jn A corollary concerning infinite series follows immediately—take μ as counting measure on Ω = {1,2,...}. Corollary. // Emx„m = E„.xm < oo, the terms being nonnegatwe, and if lim„ xnm =xm for each m, then lim„ Lm\xnm -xm\ = 0. Ifym is bounded, then \imnLmymxnm = Lmymxm. Change of Variable Let (Ω, ,90 and (Ω', ,90 be measurable spaces, and suppose that the mapping Τ: Ω ~* Ω' is measurable F/' F'. For a measure μ on ,9% define a measure μΤ~' on J*"' by (16.16) μΤ-ι(Α')=μ(Τ'ιΑ'), A'eF', as at the end of Section 13. Suppose / is a real function on Ω' that is measurable F\ so that the composition fT is a real function on Ω that is measurable F (Theorem 13.1(ii)). The change-of-variable formulas are (16.17) and (16.18). If A' = W, the second reduces to the first.
216 INTEGRATION Theorem 16.13. If f is nonnegatiue, then (16.17) ( /(Τω)μ(άω) = ( ί{ω')μΤ~\άω'). Jfi ·/Ω' A function f (not necessarily nonnegatiue) is integrable with respect to μΤ~ι if and only iffTis integrable with respect to μ, in which case (16.17) and (16.18) f /(Τω)μ(άω)= ί /(ω·)μΤ-\άω·) hold. For nonnegatiue f, (16.18) always holds. Proof. If f = IA, then fT = IT-iA, and so (16.17) reduces to the definition (16.16). By linearity, (16.17) holds for nonnegative simple functions. If fn are simple functions for which 0</„T/, then 0 <f„T Τ /Τ, and (16.17) follows by the monotone convergence theorem. Ал application of (16.17) to |/| establishes the assertion about integrabil- ity, and for integrable /, (16.17) follows by decomposition into positive and negative parts. Finally, if / is replaced by fIA, (16.17) reduces to (16.18). ■ Example 16.10. Suppose that (Ω', У) = (Л1, «#') and T= φ is an ordinary real function, measurable SF. If f(x) =x, (16.17) becomes (16.19) { φ(ω)μ(άω)= { χμφ-\άχ). If φ = Σ,'-ϊ,·/^. is simple, then ιιφ~ι has mass /х(Л,) at x,·, and each side of (16.19) reduces to Σ,χ,-Μ^;)· ■ Uniform Iniegrability If / is integrable, then l/U[|/|Sct] g°es to 0 almost everywhere as «-><» and is dominated by |/|, and hence (16.20) lim f \/\άμ = 0. α-»<»·'[|/|>α] A sequence {/„} is uniformly integrable if (16.20) holds uniformly in n: (16.21) lim supf |/„|d/i = 0. α->0° η JUfn\Ha]
SECTION 16. PROPERTIES OF THE INTEGRAL 217 If (16.21) holds and μ(Ω) < °°, and if a is large enough that the supremum in (16.21) is less than 1, then (16.22) /|/„|Λμ<αμ(Ω) + 1, and hence the /„ are integrable. On the other hand, (16.21) always holds if the /„ are uniformly bounded, but the /„ need not in that case be integrable if μ(Ω) =°°. For this reason the concept of uniform integrability is interesting only for μ finite If h is the maximum of |/| and \g\, then f \! + Β\άμ<ΐί ϊΐάμ<2\ \/\άμ + 2ί \8\άμ J\f+g\>2a Jh>a J\P>a J\g\>c Therefore, if {/„} and [gn} are uniformly integrable, so is {/„ + g„). Theorem 16.14. Suppose that μ(Ω) < <» and fn -»/ almost everywhere. (i) If the fn are uniformly integrable, then f is integrable and (16.23) //„^μ-//^μ. (ii) If f and the fn are nonnegatwe and integrable, then (16.23) implies that the fn are uniformly integrable. Proof. If the fn are uniformly integrable, it follows by (16.22) and Fatou's lemma that / is integrable. Define (α)=μ ifi/j<«, (a)j> ifi/i<«, Jn \0 if|/„l>a, У \0 ifl/l>a. If μ[|/|=α] = 0, then f{na) ->/<a) almost everywhere, and by the bounded convergence theorem, (16.24) {№>άμ->{/Μάμ. Since (16-25) ί/„άμ-ί^άμ=ί ίηάμ and (16.26) (ϊάμ-(ϊΜάμ=( ίάμ, J J J\f\^a
218 INTEGRATION it follows from (16.24) that limsup //„ άμ - ffdn < Sup f \f„\dp+f \ί\άμ. 1 J л J\f„\-Za J\f\>c And now (16.23) follows from the uniform integrability and the fact that μ[|/| = a] = 0 for all but countably many a. Suppose on the other hand that (16.23) holds, where / and the /„ are nonnegative and integrable. If μ[/ = α] = 0, then (16.24) holds, and from (16.25) and (16.26) follows (16.27) / /„«/μ-»/ ίάμ. Jf„>a Jf>a Since / is integrable, there is, for given e, an a such that the limit in (16 27) is less than e and μ[/=α] = 0. But then the integral on the left is less than e for all η exceeding some n0. Since the /„ are individually integrable, uniform integrability follows (increase a). ■ Corollary. Suppose that μ(Ω)<°°. If f and the /„ are integrable, and if f„~*f almost everywhere, then these conditions are equivalent: (i) /„ are uniformly integrable; (ϋ)/Ι/-/„Ι^μ-0; (iii) /|/„|Λμ-*/|/|Λμ. Proof. If (i) holds, then the differences I/-/J are uniformly integrable, and since they converge to 0 almost everywhere, (ii) follows by the theorem. And (ii) implies (iii) because 11/| -1/„| | < I/-/„I· Finally, it follows from the theorem that (iii) implies (i). ■ Suppose that (16.28) sup/|/„|1+erfM<~ tt for a positive e. If К is the supremum here, then / \ίη\άμ<^( |/„|'+^μ<£, Jl\fn\*a) a Jl\f„)>«) a and so {/„} is uniformly integrable. Complex Functions A complex-valued function on Ω has the form f(u)) = g(<u) + ih(u)\ where g and h are ordinary finite-valued real functions on Ω. It is, by definition, measurable У if g and h are. If g and h are integrable, then / is by definition integrable, and its integral is of course taken as (16.29) /(*+0ΐ)</μ = |ί</μ+ι]"Λ</μ.
SECTION 16. PROPERTIES OF THE INTEGRAL 219 Since max{|g|, \h\) <\f\ <\g\ + \h\, f is integrable if and only if /|/| άμ <<», just as in the real case. The linearity equation (16.3) extends to complex functions and coefficients—the proof requires only that everything be decomposed into real and imaginary parts. Consider the inequality (16.4) for the complex case: (16.30) f/άμ <\\ί\άμ. If f'= g +ih and g and h are simple, the corresponding partitions can be taken to be the same (g = LkxkIAi and h = YLkykIA ), and (16.30) follows by the triangle inequality. For the general integrable /, represent g and h as limits of simple functions dominated by |/|, and pass to the limit. The results on integration to the limit extend as well. Suppose that fk =gk +ihk are complex functions satisfying Lkf\fk\ άμ < °°. Then Lkf\gk\ άμ < oo, and so by the corollary to Theorem 16.7, Lkgk is integrable and integrates to Lkjgk άμ. The same is true cf the imaginary parts, and hence ~Lkfk is integrable and (16 31) /Σ/*4*=Σ//*4*. PROBLEMS 16.1. 13.9 Τ Suppose that μ(Ω)<<» and /„ are uniformly bounded. (a) Assume fn ->/ uniformly and deduce //„ άμ -»j/άμ from (16.5). (b) Use part (a) and Egoroffs theorem to give another proof of Theorem 16.5. 16.2. Prove that if 0 </„ ->/ almost everywhere and j/,,άμ <A<<*>, then / is integrable and j/άμ <A. (This is essentially the same as Fatou's lemma and is sometimes called by that name.) 16.3. Suppose that fn are integrable and sup„ //„ άμ < <χ>. Show that, if /„ Τ /, then / is integrable and //„ άμ -» j/άμ. This is Beppo Levi's theorem. 16.4. (a) Suppose that functions an,bn,fn converge almost everywhere to functions a,b,f, respectively. Suppose that the first two sequences may be integrated to the limit—that is, the functions are all integrable and jan άμ -> {αάμ, jbn άμ -» ^άμ. Suppose, finally, that the first two sequences enclose the third: an </„ <bn almost everywhere. Show that the third may be integrated to the limit. (b) Deduce Lebesgue's dominated convergence theorem from part (a). 16.5. About Theorem 16.8: (a) Part (i) is local: there can be a different set A for each t0. Part (ii) can be recast as a local theorem. Suppose that for ω eA, where A satisfies (16.9),
220 INTEGRATION /(ω, t) has derivative /'(ω, г0) at tQ; suppose further that /(ω,ί0+Α) -/(ω, ί0) (16.32) **ι(ω) for ω еЛ and 0<|Λ|<δ, where δ is independent of ω and gj is integrable. Then φ'(ί0) = ΙΓ(ω,ί0)μ,(άω). The natural way to check (16.32), however, is by the mean-value theorem, and this requires (for ω ε/1) a derivative throughout a neighborhood of t0 (b) If μ is Lebesgue measure on the unit interval CI, (a, b) = (0,1), and /(ω, t) = 1{Q 0(ω), then part (i) applies but part (ii) does not. Why? What about (16.32)? 16.6. Suppose that /(ω, ·) is, for each ω, a function on an open set W in the complex plane and that /(,z) is for ζ in W measurable У and integrable. Suppose that A satisfies (16.9), that /(ω, ·) is analytic in W for ω in A, and that for each z0 in W there is an integrable g(-,z0) such that |/'(ω,ζ)|< g(u>,zQ) for all ω eA and all ζ in some neighborhood of z0. Show that //(ω, ζ)μ(άω) is analytic in W. 16.7. (a) Show that if \f„\ <g and g is integrable, then {/„} is uniformly integrable. Compare the hypotheses of Theorems 16.4 and 16.14 (b) On the unit interval with Lebesgue measure, let /„ =(n/\ogn)I(0 n-i} for η > 3. Show that the /„ are uniformly integrable (and //„ άμ -> 0) although they are not dominated by any integrable g. (c) Show for /„ =nl(0 ,ri0)-nl(n-i 2n->) that the fn can be integrated to the limit (Lebesgue measure) even though the fn are not uniformly integrable. 16.8. Show that if / is integrable, then for each e there is a δ such that μ(Α)<δ implies {Α\/\άμ < e. 16.9. Τ Suppose that μ(Ω) < <». Show that {/„} is uniformly integrable if and only if (i) /|/„Μμ is bounded and (ii) for each e there is a δ such that μ(Α)<8 implies ΙΑ\/η\άμ <e for all n. 16.10. 2.19 16.9 Τ Assume μ(ίΙ)<°°. (a) Show by examples that neither of the conditions (i) and (ii) in the preceding problem implies the other. (b) Show that (ii) implies (i) for all sequences. {/„} if and only if μ is nonatomic. 16.11. Let / be a complex-valued function integrating to re'e, r > 0. From /(|/(ω)|- ε'ιβ/(ω))μ(άμ) = ί\/\άμ-Γ, deduce (16.30). 16.12. 11.5 Τ Consider the vector lattice _/ and the functional Λ of Problems 11.4 and 11.5. Let μ be the extension (Theorem 11.3) to 5r=a(5J0) of the set function μ0 on £Fq. (a) Show by (11.7) that for positive jc,y,,y2 one has "([/> 1]x(0,jc]) = χμ0[/> \]=χμ[!> l]and ν([γχ<[<γ2]χ(0,χ])=χμ[γι <f<y2].
SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 221 (b) Show that if /е_г^, then / is integrable and Λ(/) = //^μ. (Consider first the case />0.) This is the Daniell-Stone representation theorem. SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE The Lebesgue Integral on the Line A real measurable function on the line is Lebesgue integrable if it is integrable with respect to Lebesgue measure Λ, and its Lebesgue integral jfdk is denoted by jf(x)dx, or, in the case of integration over an interval, by faf(x)dx. The theory of the preceding two sections of course applies to the Lebesgue integral. It is instructive to compare it with the Riemann integral. The Riemann Integral A real function / on an interval (a,b] is by definition1 Riemann integrable, with integral r, if this condition holds: For each e there exists a δ with the property that (17.1) '-Σ/(*..)Λ(/,·) <e if {/,} is any finite partition of (a, b] into subintervals satisfying Λ(/;) < δ and if x,e/, for each i. The Riemann integral for step functions was used in Section 1. Suppose that / is Borel measurable, and suppose that / is bounded, so that it is Lebesgue integrable. If / is also Riemann integrable, the r of (17.1) must coincide with the Lebesgue integral J„f(x) dx. To see this, first note that letting xt vary over /, leads from (17.1) to (17.2) r-Esup/(x)-A(/() .re/, <e. Consider the simple function g with value supxe; f(x) on /,. Now f<g, and the sum in (17.2) is the Lebesgue integral of g. By monotonicity of the fFor other definitions, see the first problem at the end of the section and the Note on terminology following it.
222 INTEGRATION Lebesgue integral, j„f(x)dx<j^g(x)dx<r + €. The reverse inequality follows in the same way, and so f^f(x)dx = r. Therefore, the Riemann integral when it exists coincides with the Lebsegue integral. Suppose that / is continuous on [a, b]. By uniform continuity, for each e there exists a δ such that \f(x) -f(y)\ <e/(b -a) if \x-y\<8. If Λφ < δ and χ(· <ξ /(·, then g = Σ,/(χ,)//. satisfies \f - g\ < e/(b - a) and hence \Safdx - jabgdx\ < e. But this is (17.1) with r replaced (as it must be) by the Lebesgue integral S„fdx: A continuous function on a closed interval is Riemann integrable. Example 17.1. If / is the indicator of the set of rationals in (0,1], then the Lebesgue integral /0'(х)Л is 0 because /= 0 almost everywhere. But for an arbitrary partition {/,·} of (0,1] into intervals, Σ,/(χ,)Λ(/,) with jc,e/. is 1 if each x, is taken from the rationals and 0 if each xt is taken from the irrationals. Thus / is not Riemann integrable. ■ Example 17.2. For the / of Example 17.1, there exists a g (namely, g = 0) such that f = g almost everywhere and g is Riemann integrable. To show that the Lebesgue theory is not reducible to the Riemann theory by the casting out of sets of measure 0, it is of interest to produce an / (bounded and Borel measurable) for which no such g exists. In Examples 3.1 and 3.2 there were constructed Borel subsets A of (0,1] such that 0 < Λ( A) < 1 and such that Λ( Α Π /) > 0 for each subinterval / of (0,1]. Take f=IA. Suppose that f = g almost everywhere and that {/,·} is a decomposition of (0,1] into subintervals. Since λ(ΙιΓ\Α D[/=g]) = Л(/,ПЛ) > 0, it follows that g(y,) = /(y;) = 1 for some y,· in /,· C\A, and therefore, (17.3) Е*(У,-)Л(/,) = 1>Л(Л). If g were Riemann integrable, its Riemann integral would coincide with the Lebesgue integrals jgdx = jfdx = Л(/4), in contradiction to (17.3). ■ It is because of their extreme oscillations that the functions in Examples 17.1 and 17.2 fail to be Riemann integrable. (It can be shown that a bounded function on a bounded interval is Riemann integrable if and only if the set of its discontinuities has Lebesgue measure 0.f) This cannot happen in the case of the Lebesgue integral of a measurable function: if / fails to be Lebesgue integrable, it is because its positive part or its negative part is too large, not because one or the other is too irregular. fSee Problem 17.1.
SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 223 Example 17.3. It is an important analytic fact that /,-, α ι· f sin χ , π (17.4) hm / ——dx = T. I—ma Jq X *· The existence of the limit is simple to prove, because Ι"ηπ_1)πχ~ι sin xdx alternates in sign and its absolute value decreases to 0; the value of the limit will be identified in the next section (Example 18.4). On the other hand, x"'sinx is not Lebesgue integrable over (0,oo), because its positive and negative parts integrate to oo. Within the conventions of the Lebesgue theory, (17.4) thus cannot be written fix'1 sin xdx = it/2—although such "improper" integrals appear in calculus texts. It is, of course, just a question of choosing the terminology most convenient for the subject at hand. ■ The function in Example 17.2 is not equal almost everywhere to any Riemann integrable function. Every Lebesgue integrable function can, however, be approximated in a certain sense by Riemann integrable functions of two kinds. Theorem 17.1. Suppose that /|/| dx < oo and e > 0. (i) There is a step function g = Т^=хх^А , with bounded intervals as the A,·, such that j\f- g\dx <e. (ii) There is a continuous integrable h with bounded support such that j\f-h\dx<€. Proof. By the construction (13.6) and the dominated convergence theorem, (i) holds if the Л, are not required to be intervals; moreover, Л(Л,) < oo for each г for which xi,Φ 0. By Theorem 11.4 there exists a finite disjoint union Bj of intervals such that A(/4,A£,) < e//c|x,|. But then Σ,·χ,·/β. satisfies the requirements of (i) with 2e in place of e. To prove (ii) it is only necessary to show that for the g of (i) there is a continuous h such that f\g - h\dx<e. Suppose that Ai = (.ahbi\, let /г,(х) be 1 on (a,·, b;] and 0 outside (a, - δ, bt + δ], and let it increase linearly from 0 to 1 over (α,--δ,α,·] and decrease linearly from 1 to 0 over (6„£>,· + δ]. Since j\IA. — h;\ dx -»0 as δ -» 0, h = Σχ,/г, for sufficiently small δ will satisfy the requirements. ■ The Lebesgue integral is thus determined by its values for continuous functions.1 This provides another way of defining the Lebesgue integral on the line. See Problem 17.13.
224 INTEGRATION The Fundamental Theorem of Calculus Adopt the convention that /,f = - jg if a > β. For positive h, 1 rx+ft 1 /-Χ+Λ, if*+ f(y)dy-f(x) *i£+\f(y)-f(x)\dy <sup[|/(y) -/(x)|: x<y <x + h], and the right side goes to 0 with h if / is continuous at x. The same thing holds for negative h, and therefore f„f(y)dy has derivative f(x): (17.5) ^fj(y)dy=f(x) if / is continuous at x. Suppose that F is a function with continuous derivative F'=f; that is, suppose that F is a primitive of the continuous function /. Then (17.6) fbf(x) dx = fbF'(x)dx = F(b)- F(a), a a as follows from the fact that F(x) - F(a) and ///(уЫу agree at χ = α and by (17.5) have identical derivatives. For continuous /, (17.5) and (17.6) are two ways of stating the fundamental theorem of calculus. To the calculation of Lebesgue integrals the methods of elementary calculus thus apply. As will follow from the general theory of derivatives in Section 31, (17.5) holds outside a set of Lebesgue measure 0 if / is integrable—it need not be continuous. As the following example shows, however, (17.6) can fail for discontinuous /. Example 17.4. Define Fix) =jc2sin x~2 for 0 <x < { and F(x) = 0 for χ <0 and for jc > 1. Now for \<x<l define F(x) in such a way that Γ is continuously differentiable over (0,oo). Then F is everywhere differentiable, but F'(0) = 0 and F'(x) = 2* sin jc-2- 2jc_1 cosjc-2 for 0 <x < \. Thus F' is discontinuous at 0; F' is, in fact, not even integrable over (0,1], which makes (17.6) impossible for a = 0. For a more extreme example, decompose (0,1] into countably many subintervals (an, bn]. Define G(x) = 0 for χ < 0 and χ > 1, and on (a„, bn] define G(x) = F((x - an)/(bn —an)). Then G is everywhere differentiable, but (17.6) is impossible for G if (a, b] contains any of the (an, bn], because G is not integrable over any of them. ■ Change of Variable For (17.7) [a,b]^[u,u]^Rl,
SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUL MEASURE 225 the change-of-variable formula is (17.8) (bf{Tx)T\x)dx= fnf(y)dy. J π •'TV, If Τ exists and is continuous, and if / is continuous, the two integrals are finite because the integrands are bounded, and to prove (17.8) it is enough to let b be a variable and differentiate with respect to it.f With the obvious limiting arguments, this applies to unbounded intervals and to open ones: Example 17.5. Put T(x) = tan χ on {-ττ/Ι,ττ/ΐ). Then 7"(x)=l + T2(x\ and (17.8) for f(y) = (1 + y2)~ ' gives (17.9) / - — 00 J. dy + У2 = π. The Lebesgue Integral in Rk The /c-dimensional Lebesgue integral, the integral in (Rk,&k,\k\ is denoted ff(x) dx, it being understood that χ = (*,,..., xk). In low-dimensional cases it is also denoted fjAf(x\, x2)dxl dx2, and so on. As for the rule for changing variables, suppose that T: U -> Rk, where U is an open set in Rk. The map has the form Tx = (i,(x),...,tk(x)); it is by definition continuously differentiable if the partial derivatives ί,--(χ) = dtjdx. exist and are continuous in U. Let Dx = [ί,7(χ)] be the Jacobian matrix, let J(x) = det Dx be the Jacobian determinant, and let V= TO. Theorem 17.2. Let Τ be a continuously differentiable map of the open set U onto V. Suppose that Τ is one-to-one and that J(x) Φ 0 for all x. If f is nonnegative, then (17.10) ff(Tx)\J(x)\dx=ff(y)dy. By the inverse-function theorem [A35], V is open and the inverse point mapping Γ-1 is continuously differentiable. It is assumed in (17.10) that /: V-* Rl is a Borel function. As usual, for the general /, (17.10) holds with |/| in place of /, and if the two sides are finite, the absolute-value bars can be removed; and of course / can be replaced by flB or fITA. See Problem 17.11 for extensions.
226 INTEGRATION Example 17.6. Suppose that Γ is a nonsingular linear transformation on U = V=Rk. Then Dx is for each χ the matrix of the transformation. If Τ is identified with this matrix, then (17.10) becomes (17.11) \dztT\[f(Tx)dx=[f(y)dy. Ju Jv If f=ITA, this holds because of (12.2), and then it follows in the usual sequence for simple / and for the general nonnegative /: Theorem 17.2 is easy in the linear case. ■ Example 17.7. In R2 take U = [(p, 0): ρ > 0, 0 < θ< 2π] and T(p, θ) = (ρ cos θ, ρ sin θ). The Jacobian is Др,0) = р, and (17.10) gives the formula for integrating in polar coordinates: (17.12) ffp>Q f(pcose,psine)Pdpde = ff f(x,y)dxdy. 0<β<2ττ Here V is R2 with the ray [(χ,Ο): χ > 0] removed; (17.12) obviously holds even though the ray is included on the right. If the constraint on θ is replaced by 0 < θ < Αττ, for example, then (17.12) is false (a factor of 2 is needed on the right). This explains the assumption that Τ is one-to-one. ■ Theorem 17.2 is not the strongest possible; it is only necessary to assume that Τ is one-to-one on the set t/0 = [xe U: Κχ)Φϋ\. This is because, by Sard's theorem/ А*(К-Шо) = 0. Proof of Theorem 17.2. Suppose it is shown that (17.13) [f(Tx)\J(x)\dx>[f(y)dy Ju Jv for nonnegative /. Apply this to Γ"1: V-* U, which [A35] is continuously differen- tiable and has Jacobian J~l(T~1y) at y: jg{T-ly)\r\T-ly)\dy> j g{x)dx for nonnegative g on V. Taking g(x)=f(Tx)\J(x)\ here leads back to (17.13), but with the inequality reversed. Therefore, proving (17.13) will be enough. For f = ITA, (17.13) reduces to (17.14) f\J(x)\dx>\k(TA). fSpivAK, p. 72.
SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 227 Each side of (17.14) is a measure on <&= Uc\3lk. If srf consists of the rectangles A satisfying A~czU, then ja^ is a semiring generating <%, U is a countable union of .assets, and the left side of (17.14) is finite for A in sf (sup^-Щ <°°). It follows by Corollary 2 to Theorem 11.4 that if (17.14) holds for A in ja/, then it holds for A in <%. But then (linearity and monotone convergence) (17.13) will follow. Proof of (17.14) for A in stf. Split the given rectangle A into finitely many subrectangles Qt satisfying (17.15) diamg^S, '8 to be determined. Let xt be some point of β,. Given e, choose δ in the first place so that \J(x) - J(x')\ <e if x,x' eA~ and U-jc'I <δ. Then (17.15) implies (17.16) Σί'(*/)Ιλ*(β,·) ^ / \Ч*)\ dx + e\k(A). Let β/+Ε be a rectangle that is concentric with β, and similar to it and whose edge lengths are those of Q, multiplied by 1 + e. For χ in U consider the affine transformation (17.17) φχζ = Ωχ(ζ-χ)+Τχ, z<ERk; φχζ will [A34] be a good approximation to Tz for ζ near x. Suppose, as will be pioved in a moment, that for each e there is a δ such that, if (17.15) holds, then, for each i, φχ approximates Τ so well on Qt that (17.18) TQi^x.Qf- By Theorem 12.2, which shows in the nonsingular case how an affine transformation changes the Lebesgue measures of sets, λΑ(ψχ.βί+ί) = ΙΛ^)ΙλΑ(β/+£)· If (17.18) holds, then (17.19) \k(TA) = ЕЛ*(Л2,) < Σ^{Φχ&Γ) i i = ΣΐΑ*/)ΐλ*(βη=(ΐ + θ*Σΐ^,)ΐλ*(β,). / i (This, the central step in the proof, shows where the Jacobian in (17.10) comes from.) If for each e there is a δ such that (17.15) implies both (17.16) and (17.19), then (17.14) will follow. Thus everything depends on (17.18), and the remaining problem is to show that for each e there is a δ such that (17.18) holds if (17.15) does. Proof of (17.18). As (x, z) varies over the compact set A~x[z: \z\ = 1], \D~lz\ is continuous, and therefore, for some c, (17.20) \D~1z\<c\z\ for xe A, z <ERk. Since the tjt are uniformly continuous on A ~, 8 can be chosen so that \tjt(z) - iy/(jc)| <e/k2c for all ;,/ if z,x&A and \z—x\<8. But then, by linear approximation [A34: (16)], \Tz-Tx-Dx(z-x)\<ec-1\z-x\<ec-l8. If (17.15) holds and δ<1, then by the definition (17.17), (17.21) \Тг-фх.г\<е/с for ζ eg,.
228 INTEGRATION To prove (17.18), note that ζ e Q. implies \φ-ιΤζ-ζ\ = \φ-ιΤζ-φ-ιφΧιζ\ = \0-\ϋ-φΧιζ)\ <β\Τζ-φΧιΖ\<€, where the first inequality follows by (17.20) and the second by (17.21). Since φχιΤζ is within e of the point ζ of Q„ it lies in g,+ £: φχχΤζ e β,+£, or Γζ e <£,.(2,+£. "Hence (17.18) holds, which completes the proof. ' ■ Stieltjes Integrals Suppose that F is a function on Rk satisfying the hypotheses of Theorem 12.5, so that there exists a measure μ such that μ(Α) = AAF for bounded rectangles A. In integrals with respect to μ, μ(άχ) is often replaced by dF(x): (17.22) jf(x)dF(x) = [/(χ)μ(άχ). J A J A The left side of this equation is the Stieltjes integral of / with respect to F; since it is defined by the right side of the equation, nothing new is involved. Suppose that / is uniformly continuous on a rectangle A, and suppose that A is decomposed into rectangles Am small enough that \f(x) -f(y)\ < €/μ(Α) for x, у еЛт. Then /. f(x)dF(x)-Zf(xm)^f <€ for xm £/lm. In this case the left side of (17.22) can be defined as the limit of these approximating sums without any reference to the general theory of measure, and for historical reasons it is sometimes called the Riemann-Stieltjes integral; (17.22) for the general / is then called the Lebesgue-Stieltjes integral. Since these distinctions are unimportant in the context of general measure theory, ff(x)dF(x) and ffdF are best regarded as merely notational variants for }/(χ)μ(άχ) and {/άμ. PROBLEMS Let / be a bounded function on a bounded interval, say [0,1]. Do not assume that / is a Borel function. Denote by L^f and L*f (L for Lebesgue) the lower and upper integrals as defined by (15.9) and (15.10), where μ is now Lebesgue measure λ on the Borel sets of [0,1]. Denote by R^f and R*f(R for Riemann) the same quantities but with the outer supremum and infimum in (15.9) and (15.10) extending only over finite partitions of [0,1] into subintervals. It is obvious (see (15.11)) that (17.23) Riff<Liff<L*f<R*f.
SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 229 Suppose that / is bounded, and consider these three conditions: (i) There is an r with the property that for each e there is α δ such that (17.1) holds if {/,} partitions [0,1] into subinteruals with λ(/,) < δ and if xt e /,. (ii) К*/=«*/. (iii) If Dj is the set of points of discontinuity off, then \(Df) = 0. The conditions are equivalent. 17.1. Prove: (a) Df is a Borel set. (b) (i) implies (ii). (c) (ii) implies (iii). (d) (iii) implies (i). (e) The r of (i) mast coincide with the Rjf,f=R*f of (ii). A note on terminology. An / on the general (Ω, &,μ) is defined to be integrable not if (15.12) holds, but if (16.1) does. And an / on [0,1] is defined to be integrable with respect to Lebesgue measure not if L^f^Lff holds, but, rather, if (17.24) f\f(x)\dx <co does. The condition L*/ = L*f is not at issue, since for bounded / it always holds if / is a Borel function, and in this book / is always assumed to be a Borel function unless the contrary is explicitly stated. For the Lebesgue integral, the question is whether / is small enough that (17.24) holds, not whether it is sufficiently regular that L*/ = L*f. For the Riemann integral, the terminology is different because Riff<R*f holds for all sorts of important Borel functions, and one way to define Riemann integrability is to require RJff:=R*f. In the context of general integration theory, one occasionally looks at the Riemann integral, but mostly for illustration and comparison. 17.2. 3.15 17.1 Τ (a) Show that an indicator lA for А с [0,1] is Riemann integrable if and only if A is Jordan measurable. (b) Find a Riemann integrable function that is not a Borel function. 17.3. Extend Theorem 17.1 to Rk. 17.4. Show that if / is integrable, then lim f\f(x + t)-f(x)\dx = 0. Use Theorem 17.1. 17.5. Suppose that μ is a finite measure on 92k and A is closed. Show that μ(χ+Α) is upper semicontinuous in χ and hence measurable. 17.6. Suppose that /™|/(jc)| dx < °°. Show that for each e, λ[χ: x>ot, \f(x)\ > e] -»0 as a -»oo. Show by example that fix) need not go to 0 as χ -»oo (even if / is continuous).
230 INTEGRATION 17.7. Let f(x)=x"-l-2x2"-1. Calculate and compare /ο'Σ^,/„(*)<& and T£=.lflf„{x)dx. Relate this to Theorem 16.6 and to the corollary to Theorem 16.7. 17.8. Show that (1+y2)"1 has equal integrals over (-<»,- 1),(- 1,0),(0,1),(1,°°). Conclude from (17.9) that /„'(1 + y2)~l dy = ir/4. Expand the integrand in a geometric series and deduce Leibniz's formula 4 _1 3 T 5 7 + by Theorem 16.7 (note that its corollary does not apply). 17.9. Show that if / is integrable, there exist continuous, integrable functions g„ such thatgn(jc) -*f(x) except on a set of Lebesgue measure 0. (Use Theorem 17.1(H) with e=n~2.) 17.10. 13.9 17.9 Τ Let / be a finite-valued Borel function over [0,1] By the following steps, prove Lusin's theorem: For each e there exists a continuous function g such that λ[χ e (0,1): fix) Φ g(x)] < e. (a) Show that / may be assumed integrable, or even bounded. (b) Let g„ be continuous functions converging to / almost everywhere. Combine EgorofPs theorem and Theorem 12.3 to show that convergence is uniform on a compact set К such that λ((0,1) - К) <е. The limit lim„ g„(x)=/(x) must be continuous when restricted to K. (c) Exhibit (0,1) - К as a disjoint union of open intervals Ik [A12], define g as / on K, and define it by linear interpolation on each Ik. 17.11. Suppose in (17.7) that 7" exists and is continuous and / is a Borel function, and suppose that fa\f(Tx)T'(x)\ dx < <». Show in steps that /т-(0,ь)1/(y)l dy < <χ> and (17.8) holds. Prove this for (a) / continuous, (b) /=/[Si,|, (c) f—IB, (d) / simple, (e) f> 0, (f) / general. 17.12. 16.12 Τ Let ^f consist of the continuous functions on Rl with compact support. Show that _/ is a vector lattice in the sense of Problem 11.4 and has the property that /еУ implies /л1е/ (note that 1 ^J"\ Show that the σ-field ^generated by ^ is &ίχ. Suppose Λ is a positive linear functional on _/; show that Λ has the required continuity property if and only if /„(jc) J. 0 uniformly in χ implies Λ(/„) -» 0. Show under this assumption on Л that there is a measure μ on ^" such that (17.25) A(/) = //dM, /e^. Show that μ is σ-finite and unique. This is a version of the Riesz representation theorem. 17.13. Τ Let Λ(/) be the Riemann integral of /, which does exist for / in ^f. Using the most elementary facts about Riemann integration, show that the μ determined by (17.25) is Lebesgue measure. This gives still another way of constructing Lebesgue measure. 17.14. Τ Extend the ideas in the preceding two problems to Rk.
SECTION 18. PRODUCT MEASURE AND FUBINl's THEOREM 231 SECTION 18. PRODUCT MEASURE AND FUBINI'S THEOREM Let (A\ 3F) and (У, $/) be measurable spaces. For given measures μ and ν on these spaces, the problem is to construct on the Cartesian product XxY a product measure π such that π(Α XB) = μ(Α)ν(Β) for A cJi and В с У. In the case where μ and ν are Lebesgue measure on the line, π will be Lebesgue measure in the plane. The main result is Fubini's theorem, according to which double integrals can be calculated as iterated integrals. Product Spaces It is notationally convenient in this section to change from (Ω, &) to (X, 8£) and (У, 3/). In the product space Α'χ Υ a measurable rectangle is a product Лх В for which ^ef and Be^. The natural class of sets in Χ Χ Υ to consider is the σ-field 3£x 3/ generated by the measurable rectangles. (Of course, 8£x 3/ is not a Cartesian product in the usual sense.) Example 18.1. Suppose that X = У = R1 and 3T= &= &l. Then a measurable rectangle is a Cartesian product A X В in which A and В are linear Borel sets. The term rectangle has up to this point been reserved for Cartesian products of intervals, and so a measurable rectangle is more general. As the measurable rectangles do include the ordinary ones and the latter generate ^2, it follows that &2 <z&1 χ &l. On the other hand, if A is an interval, [B: A xB<e&2] contains Rl (A X Rl = Ό „(Α χ (-«,«])<= &2) and is closed under the formation of proper differences and countable unions; thus it is a σ-field containing the intervals and hence the Borel sets. Therefore, if β is a Borel set, [A: Ax В е^2] contains the intervals and hence, being a σ-field, contains the Borel sets. Thus all the measurable rectangles are in &2, and so &xx3ll = &2 consists exactly of the two- dimensional Borel sets. ■ As this example shows, 3£x З/ is in general much larger than the class of measurable rectangles. Theorem 18.1. (i) If Ε <e S£x &, then for each χ the set [у: (х, у) е Е] lies in "3/ and for each у the set [x: (x, y) e E] lies in 8E. (ii) /// is measurable 8£x З/, then for each fixed χ the function f(x, ·) is measurable 3/, and for each fixed у the function f(-,y) is measurable 8£. The set [y: (x, y) e E] is the section of Ε determined by x, and f(x, ■) is the section of / determined by x. Proof. Fix x, and consider the mapping Tx: У-♦А'х У defined by Txy = (x, y). If E=AxB is a measurable rectangle, T~lE is β or 0
232 INTEGRATION according as A contains χ or not, and in either case Tx XE e $/. By Theorem 13.10), Tx is measurable &/ЗГХ. &. Hence [y: (x,y)e£] = 77'Ее & for Ε e ^x S'. By Theorem 13.1(H), if / is measurable 2Гх &/Я\ then fTx is measurable %//{%х. Hence f(x,-)=fTx(·) is measurable S^. The symmetric statements for fixed у are proved the same way. ■ Product Measure Now suppose that (Χ, 8£, μ) and (У, 5/, y) are measure spaces, and suppose for the moment that μ and ν are finite. By the theorem just proved v[y: (x, y) <= E] is a well-defined function of x. If ^f is the class of Ε in <2"x ^ for which this function is measurable 3£, it is not hard to show that ^f is a λ-system. Since the function is IA(x)v(B) for E=AxB, ^f contains the T7-system consisting of the measurable rectangles. Hence _^ coincides with STY & by the T7-A theorem. It follows without difficulty that (18.1) ir'(£)= [ v[y: (x,y) ^ Ε]μ(άχ), Е&ЗГХ&, Jx is a finite measure on S£x З/, and similarly for (18.2) ir'(E)= [p[x:(x,y)e=E]v(dy), Е^ЗГх&. JY For measurable rectangles, (18.3) ττ'(ΑχΒ)=ττ"(ΑχΒ)=μ(Α)ν(Β). The class of Ε in S£x %/ for which тг'(Е) = тг"(Е) thus contains the measurable rectangles; since this class is a λ-system, it contains 3£x $/. The common value π'(Ε) = π"(Ε) is the product measure sought. To show that (18.1) and (18.2) also agree for σ-finite μ and v, let {A J and {Bn} be decompositions of X and Υ into sets of finite measure, and put μ„(.Α) = μ(Α DAJ and i/„(5) = ι>(Β η 5„). Since KB) = Em^m(B), the integrand in (18.1) is measurable £T in the σ-finite as well as in the finite case; hence π' is a well-defined measure on 8£x $/ and so is it". If ir'mn and тг"тп are (18.1) and (18.2) for μη and ^„,-then by the finite case, already treated (see Example 16.5), тг'(Е) = L„y„n(E) = Σ„„ιτ*„η(.Ε) = π"(Ε). Thus (18.1) and (18.2) coincide in the σ-finite case as well. Moreover, v'(AxB) = Σηιημηι(Α>η(Β) = μ(Α)ν(Β). Theorem 18.2. // (X, ££, μ) and (У, %/, v) are σ-finite measure spaces, 77(E) = тг'(Е) = тг"(Е) defines a σ-finite measure on 3£x ty; it is the only measure such that π(Α Χ Β) = μ(Α) ■ v(B) for measurable rectangles.
SECTION 18. PRODUCT MEASURE AND FUBINl's THEOREM 233 Proof. Only σ-finiteness and uniqueness remain to be proved. The products AmxBn for {Am} and {Bn} as above decompose Λ'χ У into measurable rectangles of finite T7-measure. This proves both σ-finiteness and uniqueness, since the measurable rectangles form a ^-system generating STx ^(Theorem 10.3). ■ The π thus defined is called product measure; it is usually denoted μΧν. Note that the integrands in (18.1) and (18.2) may be infinite for certain χ and y, which is one reason for introducing functions with infinite values. Note also that (18.3) in some cases requires the conventions (15.2). Fubini's Theorem Integrals with respect to π are usually computed via the formulas (18.4) / /(uMi(i,y))=i ff(x,y)v(dy) JXxY and H{dx) (18.5) / /(x,yMrf(x,y)) = / [/(χ,ν)μ(άχ) JXXY v(dy). The left side here is a double integral, and the right sides are iterated integrals. The formulas hold very generally, as the following argument shows. Consider (18.4). The inner integral on the right is (18.6) fYf(x,yHdy). Because of Theorem 18.1(ii), for / measurable 8£x & the integrand here is measurable 'З/; the question is whether the integral exists, whether (18.6) is measurable 8£ as a function of x, and whether it integrates to the left side of (18.4). First consider nonnegative /. If /=/£, everything follows from Theorem 18.2: (18.6) is v[y: (x, у) е £], and (18.4) reduces to π(Ε) = π'(Ε). Because of linearity (Theorem 15.1(iv)), if / is a nonnegative simple function, then (18.6) is a linear combination of functions measurable SC and hence is itself measurable 8£; further application of linearity to the two sides of (18.4) shows that (18.4) again holds. The general nonnegative / is the monotone limit of nonnegative simple functions; applying the monotone convergence theorem to (18.6) and then to each side of (18.4) shows that again / has the properties required. Thus for nonnegative /, (18.6) is a well-defined function of χ (the value °° is not excluded), measurable 3F, whose integral satisfies (18.4). If one side of
234 INTEGRATION (18.4) is infinite, so is the other; if both are finite, they have the same finite value. Now suppose that /, not necessarily nonnegative, is integrable with respect to v. Then the two sides of (18.4) are finite if / is replaced by |/|. Now make the further assumption that (18.7) [\f(x,y)\v(dy)<«> JY for all x. Then (18.8) ff(x,y)V(dy)= ff+(x,y)v(dy)- [r(x,y)v(dy). Jy Jy Jy The functions on the right here are measurable S£ and (since f+,f~< \f\) integrable with respect to μ, and so the same is true of the function on the left. Integrating out the χ and applying (18.4) to /+ and to Γ gives (18.4) for / itself. The set A0 of χ satisfying (18.7) need not coincide with X, but μ(Χ — A0) = 0 if / is integrable with respect to π, because the function in (18.7) integrates to f\f\dv (Theorem 15.2(iii)). Now (18.8) holds on A0, (18.6) is measurable ЙГоп А0, and (18.4) again follows if the inner integral on the right is given some arbitrary constant value on X - A0. The same analysis applies to (18.5): Theorem 18.3. Under the hypotheses of Theorem 18.2, for nonnegative f the functions (18.9) ff(x,y)»(dy),ff(x,yMdx) are measurable 8£ and 'З/, respectively, and (18.4) and (18.5) hold. If f (not necessarily nonnegative) is integrable with respect to π, then the two functions (18.9) are finite and measurable on A0 and on B0, respectively, where μ(Χ - A0) = v(Y-B0) = 0, and again (18.4) and (18.5) hold. It is understood here that the inner integrals on the right in (18.4) and (18.5) are set equal to 0 (say) outside A0 and B0J This is Fubini's theorem; the part concerning nonnegative / is sometimes called Tonelli's theorem. Application of the theorem usually follows a two-step procedure that parallels its proof. First, one of the iterated integrals is computed (or estimated above) with |/| in place of /. If the result is finite, ^ince two functions that are equal almost everywhere have the same integral, the theory of integration could be extended to functions that are only defined almost everywhere; then A0 and fi0 would disappear from Theorem 18.3.
SECTION 18. PRODUCT MEASURE AND FUBlNl'S THEOREM 235 then the double integral (integral with respect to π) of |/| must be finite, so that / is integrable with respect to π; then the value of the double integral of / is found by computing one of the iterated integrals of /. If the iterated integral of I /I is infinite, / is not integrable тт. Example 18.2. Let Dr be the closed disk in the plane with center at the origin and radius r. By (17.12), А2(Я) = ffdxdy = ff ράράθ. 0<p<r 0<0<2i7 The last integral can be evaluated by Fubini's theorem: A2(Dr) = 2π( ρ dp = vr2. ■ •Ό Example 18.3. Let / = jl^e"" dx. By Fubini's theorem applied in the plane and by the polar-coordinate formula, I2= ffe-<x2+y2)dxdy= ff e-e'pdpde. R1 P>0 Ο<0<2Τ7 The double integral on the right can be evaluated as an iterated integral by another application of Fubini's theorem, which leads to the famous formula (18.10) Γβ-χ2άχ = ντ7. As the integrand in this example is nonnegative, the question of integrability does not arise. ■ Example 18.4. It is possible by means of Fubini's theorem to identify the limit in (17.4). First, [ e_"*sin xfitc= A\ -e~"'(usini + cos t)], J0 \+u21 K n as follows by differentiation with respect to t. Since ri г00 ri I I \e~ux sin x\ du dx= I |sin x\ x~ ' dx <t <o°, Λ) l/o J Jo
236 INTEGRATION Fubini's theorem applies to the integration of e ux sin χ over (0, ί) χ (0, oo): dx j dx = / sin χ j e ux du J0 X J0 I/O ~ j I e~"xs'm xdx r°° du r°° e~"' = / τ-/ ? ( μ sin t +cosi) du. h l+u2 J0 1 + ы2 du The next-to-last integral is π/2 (see (17.9)), and a change of variable ut = s shows that the final integral goes to 0 as t -»<». Therefore, (18.11) /■I Sin X , 77 lim | or = у. ί-»οο-0 Integration by Parts Let F and G be two nondecreasing, right-continuous functions on an interval [a,b], and let μ and ν be the corresponding measures: n{x,y] = F{y)-F(x), i>(x,y}---G(y)-G(x), a<x<y<b. In accordance with the convention (17.22) write dF(x) and dG(x) in place of μ(άχ) and y(<±r). Theorem 18.4. // F and G have no common points of discontinuity in (a, b], then (18.12) f G(x)dF(x) J(a,b ] = F(b)G(b) - F(a)G(a) - f F(x)dG(x). J(a,b] In brief: jGdF = AFG - jFdG. This is one version of the partial integration formula. Proof. Note first that replacing F(x) by F(x)-C leaves (18.12) unchanged—it merely adds and subtracts Cv(a, b] on the right. Hence (take С = F(a)) it is no restriction to assume that F(x) = μ(α, χ] and no restriction to assume that G(x) = v(a, x]. If π = μ χ ν is product measure in the plane,
SECTION 18. PRODUCT MEASURE AND FUBlNl'S THEOREM 237 then by Fubini's theorem, (18.13) тг[(х, у): a <y<x<b] = ί ν(α,χ]μ(άχ)= f G(x)dF(x) J(a,b] J(a,b] and (18.14) ir[(x,y):a <x<y<b] = ( μ(α,ν]ν(άν) = [ F(y)dG(y). J(a,b] J(a,b] The two sets on the left have as their union the square 5 = (a,b]x (a,b]. The diagonal of 5 has T7-measure тг[(х,у): a <x=y <b] = [ ν{χ}μ(άχ) = 0 J(.a,b\ because of the assumption that μ and ν share no points of positive measure. Thus the left sides of (18.13) and (18.14) add to ir(S) = μ(α,^\ν(α^] = F(b)G(b). Ш Suppose that ν has a density g with respect to Lebesgue measure and let G(x) = c + f*g(t)dt. Transform the right side of (18.12) by the formula (16.13) for integration with respect to a density; the result is (18.15) [ G(x)dF(x) J(a,b] = F(b)G(b)-F(a)G(a) - fbF(x)g(x) dx. A consideration of positive and negative parts shows that this holds for any g integrable over (a, b]. Suppose further that μ has a density / with respect to Lebesgue measure, and let F(x) = c' + JaxfU)dt. Then (18.15) further reduces to (18.16) fbG(x)f(x) dx = F(b)G(b) - F(a)G(a) - iV(x)g(x) dx. Ja Ja Again, / can be any integrable function. This is the classical formula for integration by parts. Under the appropriate integrability conditions, (a,b] can be replaced by an unbounded interval.
238 INTEGRATION Products of Higher Order Suppose that (Χ,8£,μ), (Υ, 3^,u), and (Ζ,^,η) are three σ-finite measure spaces. In the usual way, identify the products XxYxZ and (XxY)xZ. Let 8£x &X φ be the σ-field in XxYxZ generated by the AxBxC with A,B,C in 8£, ^,^, respectively. For С in Z, let ^c be the class of Ε e QTx <& for which ExCe ЗГх %/χ φ. Then d?c is a σ-field containing the measurable rectangles in XxY, and so ^c = ^x ^· Therefore, (£Tx 30 x ^c ЙГх g^x g. But the reverse relation is obvious, and so (ЙГх %/) x $= STx <3/x $. Define the product μΧ^Χη on STx %/X $ as (μ Χ ν) Χ η. It gives to AxBxC the value (μ Χ ν) (Л χ β) · 77(C) = μ(Α)ι>(Β)η(€), and it is the only measure that does. The formulas (18.4) and (18.5) extend in the obvious way. Products of four or more components can clearly be treated in the same way. This leads in particular to another construction of Lebesgue measure in Rk = Λ1 X · · · X Д1 (see Example 18.1) as the product Λ X · · · Χ λ (к factors) on &k = &x x ■ ■ ■ x&1. Fubini's theorem of course gives a practical way to calculate volumes: Example 18.5. Let Vk be the volume of the sphere of radius 1 in Rk; by Theorem 12.2, a sphere in Rk with radius r has volume rkVk. Let A be the unit sphere in Rk, let В = [(xl,x2): x*+xj<l], and let C(x1,x2) = [(x3,-..,xk): Lk=,3...xf < 1 -x\-x\\. By Fubini's theorem, Vk= fdxl ■dxk= fdxldx2[ dx3-dxk JA JB JC(xl,x2) = \аххахгУк_г{\-х\-х$к-гШ JB -K-2 // (i-P2f~2)/2pdpde 0<θ<2ττ 0</><l If V0 is taken as 1, this holds for к = 2 as well as for к > 3. Since Vl = 2, it follows by induction that V 2(2π)'-' _ (2π)' 1 Χ 3x5 Χ ··· χ(2ι- 1) ' 2! 2Χ4Χ ■■■ Χ(2ι) for / = 1,2,... . Example 18.2 is a special case.
SECTION 18. PRODUCT MEASURE AND FUBlNl'S THEOREM 239 PROBLEMS 18.1. Show by Theorem 18.1 that if A Χ Β is nonempty and lies in Six @/, then /fe^'andfieg'. 18.2. 2.9Τ Suppose that X=Y is uncountable and S£= ^consists of the countable and the cocountable sets. Show that the diagonal Ε = [(χ, у): х = у] does not lie in 9Tx &, even though [y: U,y)£E]e & and [x: (x, y) e E] e Й" for all jc and y. 18.3. 10.5 18.1 Τ Let (Χ,3Τ,μ) = (Υ,3^,ν) be the completion of (R\&\k). Show that UxV, ЙГх ^,/4Ху) is not complete. 18.4. The assumption of o--finiteness in Theorem 18.2 is essential: Let μ be Lebesgue measure on the line, let ν be counting measure on the line, and take Ε = [(jc, y): x = y]. Then (18.1) and (Ί8.2) do not agree. 18.5. Example 18.2 in effect connects π as the area of the unit disk D, with the π of trigonometry. (a) A second way: Calculate A2(D1) directly by Fubini's theoiem: X2(Di) = /1(2(1 -x2)l/2dx. Evaluate the integral by trigonometric substitution. (b) A third way: Inscribe in the unit circle a regular polygon of η sides. Its interior consists of η congruent isosceles triangles with angle 2π/η at the apex; the area is ns\n(n/n)cos(n/n), which goes tc π. 18.6. Suppose that / is nonnegative on a σ-finite measure space (Ω, &, μ). Show that f /άμ = (μΧλ)[(ω,ν)<ΕςΐΧΐΙ[:ϋ<γ</(ω)]. a Prove that the set on the right is measurable. This gives the "area under the curve." Given the existence of μ Χ A on Ωχ R\ one can use the right side of this equation as an alternative definition of the integral. 18.7. Reconsider Problem 12.12. 18.8. Suppose that v[y. (jc, y)e E]= v[y: (x,y)eF] for all x, and show that (μ Χν)(Ε) = (μΧ vXF). This is a general version of Caualieri's principle. 18.9. (a) Suppose that μ is σ-finite, and prove the corollary to Theorem 16.7 by Fubini's theorem in the product of (Π,^,μ) and {1,2,...} with counting measure. (b) Relate the series in Problem 17.7 to Fubini's theorem.
240 INTEGRATION 18.10. (a) Let μ = ν be counting measure on X= Υ = {1,2,...}· If l2-2~x ifjt = y, f(x,y) = j -2 + 2-χ ifjt = y + l, \0 otherwise, then the iterated integrals exist but are unequal. Why does this not contradict Fubini's theorem? (b) Show that xy/(x2 + y2)2 is not integrable over the square [(*, y) \x\,\y\ < 1] even though the iterated integrals exist and are equal 18.11. Exhibit a case in which (18.12) fails because F and G have a common point of discontinuity. 18.12. Prove (18.16) for the case in which all the functions are continuous by differentiating with respect to the upper limit of integration. 18.13. Prove for distribution functions F that fl^Fix + c) - F(x))dx = с 18.14. Prove for continuous distribution functions that }1„F(x)dF(x) = j. 18.15. Suppose that a number /„ is defined for each η > n0 and put F(x) = L„o<„<,/„. Deduce from (18.15) that (18.17) Σ G{n)fn = F(x)G{x)-jXF{t)g{t) dt if G(y)- G(x)= f/g(t)dt, which will hold if G has continuous derivative g. First assume that the /„ are nonnegative. 18.16. Τ Take л0= I, Д = 1, and G(x)= l/x, and derive Σπ<χη~ι = log χ +y + 0{\/x\ where γ = 1 - /J°(f - [t\)t~2 di is Euler's constant. 18.17. 5.20 18.15 Τ Use (18.17) and (5.51) to prove that there exists a constant с such that (18.18) . Σ ^loglog^ + c + θί^Μ. 18.18. Euler's gamma function is defined for positive t by Γ(ί)= \^x'~xe~x dx. (a) Prove that T<k\t)= /J*'-"(log x)ke~xdx. (b) Show by partial integration that Г(г + 1) = tT(t) and hence that T(n + 1) = n\ for integral n. (c) From (18.10) deduce Гф = ^. (d) Show that the unit sphere in Rk has volume (see Example 18.5) (18.19) Vk = it ' Г((*/2) + 1)"
SECTION 19. THE Lp SPACES 241 18.19. By partial integration prove that /"(sin x)/x)2 dx = тг/2 and /™Jl- cos jc)jc~2<ic = тт. 18.20. Suppose that μ is a probability measure on (X, S£) and that, for each x in X, vx is a probability measure on (У, SO. Suppose further that, for each В in S^, ν,(Β) is, as a function of x, measurable 3Γ. Regard the μ(Α) as initial probabilities and the vx(B) as transition probabilities: (a) Show that, if Ε e ЗГх &, then ^[y: (x, у) е £] is measurable #". (b) Show that π(Ε) = lxvx[y: {x, у) е £]μ(ώ:) defines a probability measure on ^"x S^. If ι^ = ι- does not depend on x, this is just (18.1). (c) Show that if / is measurable S'xS' and nonnegative, then fYf(x, y)vx(dy) is measurable SC. Show further that / f(x,yMd(x,y))=f jf{x,y)vx{dy) JXXY μ(άχ), which extends Fubini's theorem (in the probability case). Consider also /'s that may be negative. (d) Let v(B)= [χνχ(Β)μ(άχ). Show that тг{Хх В) = v{B) and (Яу)К*) = /]//(уК(*) μ{άχ). SECTION 19. THE L" SPACES* Definitions Fix a measure space Ш, ^", μ). For 1 </? < oo, let Lp = LP(H. ^", /χ) be the class of (measurable) real functions / for which \f\p is integrable, and for / in Lp, write (19.1) \\ί\ράμ Up There is a special definition for the case ρ = °°: The essential supremum of / is (19.2) , = inf[a: μ[ω: |/(ω)|>α] = О]; take L°° to consist of those / for which this is finite. The spaces Lp have a geometric structure that can usefully guide the intuition. The basic facts are laid out in this section, together with two applications to theoretical statistics. *The results in this section are used nowhere else in the book. The proofs require some elementary facts about metric spaces, vector spaces, and convex sets, and in one place the Radon-Nikodym theorem of Section 32 is used. As a matter of fact, it is possible to jump ahead and read Section 32 at this point, since it makes no use of Chapters 4 and 5
242 INTEGRATION For 1 <p,q <oo, ρ and q are conjugate indices if p~l + q~l = 1, ρ and q are also conjugate if one is 1 and the other is oo (formally, 1-1 + oo-1 = 1). Holder's inequality says that if ρ and q are conjugate and if /e Lp and g <= Lq, then fg is integrable and (19.3) \&diL </|/g|^<||/IUIg| τ This is obvious if ρ = 1 and q = °°. If 1 </?, g < oo and μ. is a probability measure, and if / and g are simple functions, (19.3) is (5.35). But the proof in Section 5 goes over without change to the general case. Minkowski's inequality says that if f,g^Lp (l <p < oo), then f+g^Lp and (19.4) 11/ + *Н„£ 11/11,+ 1Ы1,. This is clear if ρ = 1 or ρ = oo. Suppose that 1 <p < oo. Since \f + g\< 2(| /Ί + IgD1^, /" + § does lie in Lp. Let q be conjugate to p. Since ρ - 1 =ρ/ς Holder's inequality gives ll/ + g||£ = /l/ + *l"d/* <\\f\\P-\\\f+g\p/ql+\\g\\P-\\\f+g\p/ql (11/11, Hlgll,) j\f + g\"dn ll/9 = (ll/ll„ + llgll„)ll/+g| Since ρ - p/<7 = 1, (19.4) follows.1 If a is real and f^Lp, then obviously af^Lp and (19.5) |a/ll„ = kHI/ll„. Define a metric on Lp by rfp(/, g) = ll/~g||p. Minkowski's inequality gives the triangle inequality for dp, and dp is certainly symmetric. Further, dpif,g) = Q if and only if \f-g\p integrates to 0, that is, f=g almost everywhere. To make Lp a metric space, identify functions that are equal almost everywhere. fThe Holder and Minkowski inequalities can also be proved by convexity arguments, see Problems 5.9 and 5.10.
SECTION 19. THE LP SPACES 243 If ll/-/Jp-»0 and ρ < oo, so that j\f-fn\"dp -»0, then /„ is said to converge to / in the mean of order p. If /=/' and g = g' almost everywhere, then f + g=f +g' almost everywhere, and for real a, af = af almost everywhere. In Lp, f and /' become the same function, and similarly for the pairs g and g', f + g and f + g', and af and af. This means that addition and scalar multiplication are unambiguously defined in Lp, which is thus a real vector space. It is a normed vector space in the sense that it is equipped with a norm || · ||p satisfying (19.4) and (19.5). Completeness and Separability A normed vector space is a Banach space if it is complete under the corresponding metric. According to the Riesz-Fischer theorem, this is true of L": Theorem 19.1. The space Lp is complete. Proof. It is enough to show that every fundamental sequence contains a convergent subsequence. Suppose first that ρ < °°. Assume that \\fm -f„\\p -* 0 as m, η -»oo; and choose an increasing sequence {nk} so that \\fm - fnVp < 2-(p+l)k for m,n>nk. Since j\fm-fn\p άμ>αρμ[^η-fn\>a] (this is just a general version of Markov's inequality (5.31)), μ[\/η-fm\>2~k]< 2pk\\fm-fn\\pP<2-k for m,n>nk. Therefore, Ud\f„k+i "/J * 2"*] <°o, and it follows by the first Borel-Cantelli lemma (which works for arbitrary measures) that, outside a set of μ,-measure 0, Lk\fn ,-f„k\ converges. But then fn converges to some / almost everywhere, and by Fatou's lemma, /I/-/„J" άμ < lirninf,- /|/„.-/„J" άμ. <2'k. Therefore, /eL" and II/ - fnk \\p -* 0. as required. If ρ = oo, choose {nk} so that \\fm -/„IL < 2~k for m,n>nk. Since l/„ -/„ |< 2~k almost everywhere,/„ converges to some/, and I/-/ I < 2"* almost everywhere. Again, ΙΙΖ-^.ΙΙρ-^Ο. ■ The next theorem has to do with separability. Theorem 19.2. (i) Let U be the set of simple functions Е"=,а,-/Я. for a-, and μ,(β,) finite. For 1 <p < <», t/ й rfenie /и L^. (ii) If μ is σ-finite and 5? is countably generated, and if ρ < oo, then Lp is separable. Proof. Proof of (i). Suppose first that ρ < <». For f^Lp, choose (Theorem 13.5) simple functions /„ such that /„ -*f and |/J < |/|. Then fn eL", and by the dominated convergence theorem, j\f - fn\p άμ -» 0. Therefore,
244 INTEGRATION \\f~~fn\\p<e f°r some n; but each fn is in U. As for the case ρ = », if n2">||/IL then the /„ defined by (13.6) satisfies ||/-/„|U < 2"" (<e for large «). /Voo/ of («). Suppose that ,9*" is generated by a countable class € and that Ω is covered by a countable class 2) of J^sets of finite measure. Let EX,E2,... be an enumeration of €\J3i; let &>n (n > 1) be the partition consisting of the sets of the form F, η · · · nF„, where f) is £,- or E,c; and let S^ be the field of unions of the sets in &>n Then 5^= U" = 1<^ is a countable field that generates ,9*", and /χ is σ-finite on 3^. Let ^ be the set of simple functions LJL [aiIAj for a, rational, Л, е 5^, and μ(Λ,) < <». Let g = Σ,"1 ,α,/β. be the element of £/ constructed in the proof of part (i). Then ||/- g||p <e, the at are rational by (13.6), and any a,· that vanish can be suppressed. By Theorem 11.4(H), there exist sets A, in ^ such that μ{ΒίΑΑί) <(e/m\a,\)p, provided ρ < oo, and then Λ-Σ^,α,-/^. lies in V and ||/ - Λ|!p < 2e. But V is countable.1 ' ■ Conjugate Spaces A linear functional on Lp is a real-valued function γ such that (19.6) у(«/ + «'/')=«У(Я+«'у(П- The functional is bounded if there is a finite Л/ such that (19.7) Ιγ(/)Ι<Λί·||/ΙΙΡ for all / in Lp. A bounded linear functional is uniformly continuous on Lp because \\f-f'\\p<e/M implies |y(/)- γ(/')| <e (if M>0; and M = 0 implies γ(/) s 0). The norm \\γ\\ of γ is the smallest Μ that works in (19.7): Hyll = sup[|y(/)|/||/||p:/#0]. Suppose ρ and q are conjugate indices and g e Lq. By Holder's inequality, (19.8) ygU)=jfgdn is defined for f^Lp and satisfies (19.7) if M>||g||9; and γ is obviously linear. According to the &'esz representation theorem, this is the most general bounded linear functional in the case ρ <<χ·_ Theorem 19.3. Suppose that μ is σ-finite, that 1 <p < o°, and that q is conjugate to p. Every bounded linear functional on Lp has the form (19.8) for some g e Lq; further, (19.9) UyJ = llglU, and g is unique up to a set of μ-measure 0. fPart (ii) definitely requires ρ < χ; see Problem 19.2.
SECTION 19. THE Lp SPACES 245 The space of bounded linear functionals on Lp is called the dual space, or the conjugate space, and the theorem identifies Lq as the dual of Lp. Note that the theorem does not cover the case ρ = oo.t Proof. Case Ι: μ finite. For A in &~, define φ(Α)= y(IA). The linearity of у implies that φ is finitely additive. For the Μ of (19.7), \φ{Α)\ < Μ · \\ΙΑ\\Ρ = Μ-\μ{Α)\ι/ρ. If A = U„An, where the An are disjoint, then φ{Α) = Σ^ΜΑη)+φωη>ΝΑη), and since |φ(υ „>NAJ <Μμ^ρω я>л,А>-> 0, it follows that ψ is an additive set function in the sense of (32.1). The Jordan decomposition (32.2) represents φ as the difference of two finite measures φ+ and φ~ with disjoint supports A+ and A~. If μ(Α) = 0, then φ + (Α) = φ{Α Γ)Α + )<Μμι/ρ(Α) = 0. Thus φ+ is absolutely continuous with respect to μ. and by the Radon-Nikodym theorem (p. 422) has an integrable density g + : φ+(Α)= fAgJr άμ. Together with the same result for φ~, this shows that there is an integrable g such that y{IA) = φ{Α) = \Αξάμ = \ΙΑ%άμ. Thus γ(/) = ^άμ for simple functions / in Lp. Assume that this g lies in Lq, and define yg by the equation (19.8). Then у and у are bounded linear functionals that agree for simple functions; since the latter are dense (Theorem 19.2(0), it follows by the continuity of у and yg that they agree on all of Lp. It is therefore enough (in the case of finite μ) to prove g eL'. It will also be shown that ||g||9 is at most the Μ of (19.7); since ||g||, does work as a bound in (19.7), (19.9) will follow. If yg{f) = 0, (19.9) will imply that g = 0 almost everywhere, and for the general у it will follow further that two functions g satisfying yg{f) = y{f) must agree almost everywhere. Assume that 1 <p, q < <». Let gn be simple functions such that 0 <gn t \g\q, and take hn ^g^sgng. Then hng = gln/p\g\>gln/pgyq = gn, and since hn is simple, it follows that /g,. άμ < \η^άμ = yg(hn) = y(hr) <M-\\h,\\p = M[fg„dp]x/p. Since l-l/p=l/q, this gives [^,,άμ]1/" <M. Now the monotone convergence theorem gives g^Lp and even ||g||9 <M. Assume that ρ = 1 andq = v>. In this case, l//gifyi| = lyg(/)| = Ιγ(/)Ι<Μ· ||/||i for simple functions / in L'.Take /= sgng ·7[|ί;|>α]. Then a/x[|g| >q] < fI[ш>a■^■\g\dμ = ffgdμ<M■\\f\\^ = Mμ[\g\>a]. If α > M, this inequality gives /4|g|>a] = 0; therefore ||g||o= = llgll, <M and g<ELT = Lq. Case II: μσ-finite. Let Ля be sets such that An1 Ω, and μ.(Λ„)<°°. If μη(Α) = μ(Α^Αη), then lyi/Z^UM-ll/^JI^M-t/l/r^J1/" for/e L"(fIA eL^/^cL" (μ„)). By the finite case, Λ„ supports a g„ in L9 such that y(fIA ) = ///^ g„ άμ for f^Lp, and ||g„||9 < M. Because of uniqueness, gn + 1 can be taken to agree with g„ on An (L/'(/xn + 1)cL/'(/xn)). There is therefore a function g on Ω such that g = gn on Λ„ and Ц/^ g\\q<M. It follows that ||g||9<M and g^Lq. By the dominated convergence theorem and the continuity of y, f e L^ implies //g d/x = limn ffIA g άμ = limn γί//^ ) = γ(/). Uniqueness follows as before. ■ Problem 19 3
246 INTEGRATION Weak Compactness For /eLp. and g eL', where ρ and q are conjugate, write (19.10) (f,8)-ffgdv. For fixed / in Lp, this defines a bounded linear functional on V; for fixed g in Lq, it defines a bounded linear functional on Lp. By Holder's inequality, (19.11) l(/,g)l<ll/IUigll,. Suppose that / and fn are elements of Lp. If (/, g) = lim„(/„ g) tor each g in L9, then /„ converges weakly to /. If ||/-/„ilp ~* 0, then certainly /„ converges weakly to /, although the converse is false.1 The next theorem says in effect that if p>\, then the unit ball Bp = [/ eLp: ||/||p < 1] is compact in the topology of weak convergence. Theorem 19.4. Suppose that μ is σ-finite and &~ is countably generated. If 1 <p < °°, then eveiy sequence in Bp contains a subsequence converging weakly to an element of Bp. Suppose of elements fn, f, and f of Lp that f„ converges weakly both to f and to f. Since, by hypothesis, μ is σ-finite and ρ > 1, Theorem 19.3 applies if the ρ and q there are interchanged. And now, since ('/, g) = (/', g) for all g in Lq, it follows by uniqueness thatf = f. Therefore, weak limits are unique under the present hypothesis. The assumption ρ > 1 is essential} Proof. Let q be conjugate to ρ (1 <q <°°). By Theorem 19.2(ii), Lq contains a countable, dense set G. Add to G all finite, rational linear combinations of its elements; it is still countable. Suppose that {fn} <zBp. By (19.11), K/„,g)l^llgli, ior g^U. Since {(/„,g)} is bounded, it is possible by the diagonal method [A14] to pass to a subsequence of {/„} along which, for each of the countably many g in G, the limit lim„(/„,g) = y(g) exists and |y(g)| < ||g||,. For g, g' e G, |y(g) - y(g')| = lira J(/„, g - g')l < \\g — g'\\q. Therefore, γ is uniformly continuous on G and so has a unique continuous extension to all of Lq. For g, g' e G, y(g+g') = lim„(/„, g +g') - y(g) + y(g'); by continuity, this extends to all of L9. For g^G and α rational, y(ag) = a lim„(/„, g) = ay(g); by continuity, this extends to all real a and all g in L9: γ is a linear functional on Lq. Finally, |y(g)| < \\g\\q extends from G to Lq by continuity, and γ is bounded in the sense of (19.7). +ProbIem 19.4 *Problem 19.5.
SECriON 19. THE Lp SPACES 247 By the Riesz representation theorem (1 <q <oo); there is an / in Lp (the space adjoint to Lq) such that y(g) = (/, g) for all g. Since γ has norm at most 1, (19.9) implies that ||/||p < 1: / lies in Bp. Now (/, g) = lim„(/„, g) for g in G. Suppose that g' is an arbitrary element of Lq, and choose g in G so that \\g' -g\\q <e. Then \(f,g')-(fn,g')\ ^l(/,g,)-(/.?)l + l(/,g)-(/„.?)l + l(/„.g)-(/„.g,)l < ll/Ulg' -gll, +1(/, g) - (/„, g)l + ll/JUIg -g'll, <2e+\{f,g)-(fn,g)\. Since g eG, the last term here goes to 0, and hence lim„(/n. g') = (/, g') for all g' in Lq. Therefore, /„ converges weakly to /. ■ Some Decision Theory The weak compactness of the unit ball in L°° has interesting implications for statistical decision theory. Suppose that μ is σ-finite and & is countably generated. Let /,,...,Д be probability densities with respect to μ—nonnegative and integrating to 1. Imagine that, for some i, ω is drawn from Ω according to the probability measure Pj(A) = jAft άμ. The statistical problem is to decide, on the basis of an observed ω, which /, is the right one. Assume that if the right density is /,, then a statistician choosing / incurs a nonnegative loss L(j\i). A decision rule is a vector function δ(ω) = (δ,(ω),..., δΑ(ω)), where the δ,(ω) are nonnegative and add to 1: the statistician, observing ω, selects /, with probability δ,(ω). If, for each ω, δ,(ω) is 1 for one ί and 0 for the others, δ is a nonrandomized rule; otherwise, it is a randomized rule. Let D be the set of all rules. The problem is to choose, in some more or less rational way that connects up with the losses L(j\i\ a rule δ from D. The risk corresponding to δ and /, is ?;(«) = / Ев;(«)ЦЛ0 /ι(ω)μ(άω), which can be interpreted as the loss a statistician using δ can expect if /; is the right density. The risk point for δ is R(S) = (Rl(S\...,Rk(8)). If Κ,(δ') <Κ;(δ) for all i and Λ,(δ') < Κ,·(δ) for some г—that is, if the point R(S') is "southwest" of R(8)—then of course δ' is taken as being better than δ. There is in general no rule better than all the others. (Different rules can have the same risk point, but they are then indistinguishable as regards the decision problem.) The risk set is the collection 5 of all the risk points; it is a bounded set in the first orthant of Rk. To avoid trivialities, assume that S does not contain the origin (as would happen for example if the L(j\i) were all 0). Suppose that δ and δ' are elements of D, and λ and λ' are nonnegative and add to 1. If δ;'(ω) = λδ,(ω) + λ'δ;(ω), then δ" is in D and Λ(δ") = λ/?(δ) + λ'Λ(δ'). Therefore, 5 is a convex set. Lying much deeper is the fact that S is compact. Given points x(n) in S, choose rules δ"" such that R(S(n)) =x(n). Now «]"»(·) is an element of L°°, in fact of B", and so by Theorem 19.4 there is a subsequence along which, for each ;'= l,...,fc, δ,·"'
248 INTEGRATION converges weakly to a function δ in B™. If μ(Α) < °°, then jSjIA άμ = lim„ /δ)")/^ άμ > 0 and /(1 - LjSj)Ia άμ = lim„ /(1 - Σ,8)η))ίΑ άμ = 0, so that δ, > О and Σ;δ;= ] almost everywhere on A. Since μ is σ-finite, the δ;· can be altered on a set of μ-measure 0 in such a way as to ensure that δ = (δ,,..., 8k) is an element of D. But, along the subsequence, x(n) -> R(8). Therefore: The risk set is compact and convex. The rest is geometry. For χ in Rk, let Qx be the set cf x' such that 0 <x) <xt for all ('. If χ = R(8) and *' = Λ(δ'), then δ' is better than δ if and only if x' e Qx and χ' Φ χ. A rule δ is admissible if there exists no δ' better than δ; it makes no sense to use a rule that is not admissible. Geometrically, admissibility means that, for χ = R(S), S П Q consists of χ alone. A f s ^Ш///Л X' Qx X \ >■ Let χ = R(8) be given, and suppose that δ is not admissible. Since S Π Qx is compact, it contains a point x' nearest the origin (x' unique, since S η Qx is convex as well as compact); let δ' be a conesponding rule: x' = R(S'). Since δ is not admissible, χ' Φχ, and δ' is better than δ. If S Π Qx- contained a point distinct from x', it would be a point of S η Qx nearer the origin than x', which is impossible. This means that Qx. contains no point of S other than x' itself, which means in turn that δ' is admissible. Therefore, if δ is not itself admissible, there is a δ' that is admissible and is bettei than δ. This is expressed by saying that the class of admissible rules is complete. Let ρ = (Pi,..-,pk) be a probability vector, and view pi as an a priori probability that fi is the correct density. A rule δ has Bayes risk R(p,8) = Σ,ρ,/?,(δ) with respect to p. This is a kind of compound risk: /( is correct with probability /?,·, and the statistician chooses jj with probability δ;(ω). A Bayes rule is one that minimizes the Bayes risk for a given p. In this case, take a =/?(/?, δ) and consider the hyperplane (19.12) and the half space Η ζ· ΣΡίζί=α (19.13) Я+ = z'· Σρ,ζ^« Then χ = R(8) lies on H, and S is contained in H+: χ is on the boundary of S, and
SECTION 19. THE Lp SPACES 249 Я is a supporting hyperplane. If /?, > 0 for all /, then Qx meets S only at x, and so δ is admissible. Suppose now that δ is admissible, so that χ = R(8) is the only point in S Π Qx and χ lies on the boundary of S. The problem is to show that δ is a Bayes rule, which means finding a supporting hyperplane (19.12) corresponding to a probability vector p. Let Τ consist of those у for which Qy meets S. Then Τ is convex: given a convex combination y" = λ у + к'у' of points in Τ, choose in S points ζ and z' southwest of у and y', respectively, and note that ζ" = λζ + λ'ζ' lies in S and is southwest of y". Since S meets Qx only in the point x, the same is true of T, so that χ is a boundary point of Τ as well as of S. Let (19.12) (ρ Φ 0) be a supporting hyperplane through χ- χ e Η and Τ с H+. If ρ, < 0, take ζ, = *, +1 and take ζ, = *; for the other i; then ζ lies in Γ but not in H+, a contradiction. (The right-hand figure shows the role of T: the planes H{ and H2 both support S, but only tf2 supports Τ and only tf2 corresponds to a probability vector.) Thus /?, > 0 for all i, and since Σ,ρ, = 1 can be arranged by normalization, δ is indeed a Bayes rule. Therefore The admissible rules are Bayes rules, and they form a complete class. The Space L2 The space L2 is special because ρ = 2 is its own conjugate index. If /, g e L2, the inner product (/, g) = //gd/x is well defined, and by (19.11)—write ||/|l in place of ||/||2—K/, g)\ < 11/11 · llgl!· This is the Schwarz (or Cauchy-Schwarz) inequality. If one of / and g is fixed, (/, g) is a bounded (hence continuous) linear functional in the other. Further, (/, g) = (g,/), the norm is given by II/II2 = (/»/), and L2 is complete under the metric d(f, g) = ||/-g||. A Hilbert space is a vector space on which is defined an inner product having all these properties. The Hilbert space L2 is quite like Euclidean space. If (/, g) = 0, then / and g are orthogonal, and orthogonality is like perpendicularity. If /,,.. .,/„ aie orthogonal (in pairs), then by linearity, (Σ,/^,Σ,/,) = Σ,Σ;(/,,/;) = Zi(fitfi): !ΙΣ,·/;||2 = Σ,ΙΙ/,ΙΙ2. This is a version of the Pythagorean theorem. If / and g are orthogonal, write /±g. For every /, /1 0. Suppose now that μ is σ-finite and ^" is countably generated, so that L? is separable as a metric space. The construction that follows gives a sequence (finite or infinite) φι,ψ2,... that is orthonormal in the sense that ||<p„|| = 1 for all η and (<pm, ψη) = 0 for m Φ η, and is complete in the sense that (/, φη) = 0 for all η implies /= 0—so that the orthonormal system cannot be enlarged. Start with a sequence /,,/2,··· that is dense in L2. Define g,,g2,... inductively: Let g, =/,. Suppose that g,,...,g„ have been defined and are orthogonal. Define g„+1 =/„ + 1 - Σ?_,<*„,£„ where «„,· is (/„ + 1,g,)/l|g,||2 if g,#0 and is arbitrary if g, = 0. Then g„ + 1 is orthogonal to g,,...,g„, and /„ +, is a linear combination of g,,..., g„ +,. This, the Gram-Schmidt method, gives an orthogonal sequence g1,g2,··- with the property that the finite linear combinations of the g„ include all the /„ and are therefore dense in L2. If gn Φ 0, take φη = g„/||gj; if g„ = 0, discard it from the sequence. Then φι,φ2,-.. is orthonormal, and the finite linear combinations of the φη are still dense. It can happen that all but finitely many of the g„ are 0, in which
250 INTEGRATION case there are only finitely many of the φη. In what follows it is assumed that φι,φ2,... is an infinite sequence; the finite case is analogous and somewhat simpler. Suppose that / is orthogonal to all the φη. If ai are arbitrary scalars, then /,αιφί,...,αηφη is an orthogonal set, and by the Pythagorean property, \\f - L?=lai(Pi\\2 = \\f\\2 + Σ?=ια* >\\f\\2. If 11/11 > 0, then / cannot be approximated by finite linear combinations of the φη, a contradiction: φί,φ2,.-- is a complete orthonormal system. Consider now a sequence a,, a2,... of scalars for which Σ°°= ,α2 converges. U sn = Σ"=Ια,<ρ,·, then the Pythagorean theorem gives \\sn - sm\\2 = Em</<„a2. Since the scalar series converges, {sj is fundamental and therefore by Theorem 19.1 converges to some g in L2. Thus g = limn Σ"ι = ιαιψι, which it is natural to express as g = Σ%ίαίφί. The series (that is to say, the sequence of partial sums) converges to g in the mean of order 2 (not almost everywhere). By the following argument, every element of L2 has a unique representation in this form. The Fourier coefficients of / with respect to {φ,) are the inner products «, = (/, φ(). For each n, 0 < ||/ - Σ^,α^ΙΙ2 = ll/ll2 - 2Σ,«;(/, Ψι) + Σ,7α,·α/<ρ,·, <py) = ll/ll2 - Σ"=ια2, and hence, η being arbitrary, Σ?=ια2 < ||/||2. By the argument above, the series Σ™=ιαίφί therefore converges. By linearity, (f-Σ"„^aiφi,φj) = 0 for n>j, and by continuity, (/- Σ°=ιαίψί,φί) = 0. Therefore, /— Е°°=1я,<р, is orthogonal to each <py and by completeness must beO: со (19-14) /= Σ (/,*>,>,- This is the Fourier representation of /. It is unique because if / = Σ^,ι^,-ιρ; is 0 (Σα,2 <o°), then a, = (/, <py) = 0. Because of (19.14), {φη} is also called an orthonormal basis for L2. A subset Μ of L2 is a subspace if it is closed both algebraically (/, /' eM implies af + a'f'^M) and topological^ (fn eM, /„ -»/ implies f^M). If L2 is separable, then so is the subspace M, and the construction above carries over: Μ contains an orthonormal system [φη] that is complete in M, in the sense that / = 0 if (/, φη) = 0 for all η and iff e M. And each f in Μ has the unique Fourier representation (19.14). Even if / does not lie in M, Σ7= ι(/> ψ)2 converges, so that Σ7=ι(/, <ρ,)<ρ,- is well defined. This leads to a powerful idea, that of orthogonal projection onto M. For an orthonormal basis {φ,) of M, define PMf= Σ7=ι(/,<Ρ,-)<Ρ/ for all / in L2 (not just for / in M). Clearly, PMf^M. Further, /- Σ"=1(/, <р,)<р, 1 φ; for η >j by linearity, so that f-PMf±_ φ] by continuity. But if f—PMf is orthogonal to each <p,, then, again by linearity and continuity, it is orthogonal to the general element Σ]=φ}φ; of M. Therefore, FM/eM and f-PMf±M. The map f~*PMf is the orthogonal projection on M.
SECTION 19. THE Lp SPACES 251 The fundamental properties of PM are these: (i) g<EM and f-g±M together imply g = PMf; (ii)/eM implies PMf = f; (Hi) g^M implies ll/-g||^||/-PM/ll; (iv) PM(af + «'/') = aPMf + a'PMf. Property (i) says that PMf is uniquely determined by the two conditions PMf^M and f-PM±M. To prove it, suppose that g,g'<EM, f-g±M, and f-g'lM. Then g -g' eM and g -g' ±M, so that g -g' is orthogonal to itself and hence Hg-g'll =0: g = g'- Thus the mapping PM is independent of the particular basis {<p,}; it is determined by Μ alone. Clearly, (ii) follows from (i); it implies that PM is idempotent in the sense that Puf= PMf. As for (iii), if g lies in M, so does PMf-g, so that, by the Pythagorean relation, \\f-g\\2 = \\f-PMf\\2 + \\PMf-g\\2>\\f-PMf\\2; the inequality is strict if g*PMf- Thus PMf is the unique point of Μ lying nearest to /. Property (iv), linearity, follows from (i). An Estimation Problem First, the technical setting: Let (Ω, У, u) and (Θ,<?,π) be a σ-finite space and a probability space, and assume that У and £ are countably generated. Let /θ(ω) be a nonnegative function on Θ Χ Ω, measurable (fx &~, and assume that /Ω/θ(ω)μ(^ω) = 1 for each 0 e Θ. For some unknown value of 0, ω is drawn from Ω according to the probabilities ΡΘ(Α) = {Α/β(ω)μ(άω), and the statistical problem is to estimate the value of g(0), where g is a real function on Θ. The statistician knows the functions /(-) and g(-), as well as the value of ω; it is the value of 0 that is unknown. For an example, take Ω to be the line, /(ω) a function known to the statistician, and fe(w)=af(aw + β), where θ=(α,β) specifies unknown scale and location parameters; the problem is to estimate g(0)=a, say. Or, more simply, as in the exponential case (14.7), take /θ(ω) = af(aa>\ where θ = g(0) = a. An estimator of g(0) is a function ί(ω). It is unbiased if (19.15) ίί(ω)ίθ(ω)μ(άω)=8(θ) J!l for all θ in Θ (assume the integral exists); this condition means that the estimate is on target in an average sense. A natural loss function is (ί(ω) -g(d))2, and if fe is the correct density, the risk is taken to be /Ω(ί(ω)^(0))2/θ(ω)μ(^0). If the probability measure π is regarded as an a priori distribution for the unknown 0, the Bayes risk of t is (19.16) R(n,t)= f f (f(«) -g(0))2/e(«)M(^)7r(^0); JeJii this integral, assumed finite, can be viewed as a joint integral or as an iterated integral (Fubini's theorem). And now t0 is a Bayes estimator of g with respect to π if it minimizes R(ir, t) over t. This is analogous to the Bayes rules discussed earlier. The
252 INTEGRATION following simple projection argument shows that, except in trivial cases, no Bayes estimator is unbaised f Let Q be the probability measure on (fx & having density /θ(ω) with respect to 7Γ χ μ, and let Lr be the space of square-integrable functions on (Θ Χ Ω, gx iF, Q). Then Q is finite and &x !F is countably generated. Recall that an element of L2 is an equivalence class of functions that are equal almost everywhere with respect to Q. Let G be the class of elements of L2 containing a function of the form g(e,w) = g(w) —functions of 0 alone Then G is a subspace. (That G is algebraically closed is clear; if /„ e G and ll/„-/l|-*0, then—see the proof of Theorem 19.1—some subsequence converges to / outside a set of β-measure 0, ?nd it follows easily that /e G.) Similarly, let Τ be the subspace of functions of ω alone: /(0, ω) = ί(ω). Consider only functions g and their estimators t for which the corresponding g and 1 are in L2. Suppose now that t0 is both an unbiased estimator of gn and a Bayes estimator of g0 with respect to тт. By (19 16) for g0, R(ir, t) = \\1 -g0|l , and since t0 is a Bayes estimator of gQ, it follows that l|f0-goll2 <||?-foil7 for all 1 in T. This means that ?„ is the orthogonal projection of g0 on the subspace Τ and hence that g0 - 10 _L 10. On the other hand, from the assumption that t0 is an unbiased estimator of g0, it follows that, for every g(0, ω) = g(6) in G, ('0 -g0, g) = / {(ΐ0(ω)-80(θ))ί(θ)/β(»)μ(άω)ΐΓ(άθ) ■Ώ-Ό ■'Θ-'Ω = fg(0) f(to(")-go(<>))fe("Md*>) ττ(άθ)=0. This means that ?() — g0±G: g0 is the orthogonal projection of ?0 on the subspace G But g0-/0_L/0 and ~t0-g0jLg0 together imply that t0-g0 is orthogonal to itself: ~t0 = g0. Therefore, t0M = ί0(θ, ω) = §0(θ, ω) = gQ(d) for (θ,ω) outside a set of Q- measure 0. This implies that tQ and g0 are essentially constant. Suppose for simplicity that /θ(ω)>0 for all (θ,ω), so that (Theorem 15.2) (πΧμΜθ,ω): /O(w)^g0(fl)] = 0. By Fubini's theorem, there is a θ such that, if a =g0(e), then μ[ω: ί0(ω)Φα] = 0; and there is an ω such that, if b = t0(<u), then π[θ: gu(e)*b] = 0. It follows that, for (θ,ω) outside a set of (v Хд)-теа5иге О, ί0(α>) and go(0) have the common value a = ft: tt[0: go(0) = α] = 1 and Ρθ[ω: ί0(ω) - α] = 1 for all 0 in Θ. PROBLEMS 19.1. Suppose that μ(ίϊ) < °° and /e L°°. Show that ||/||„ Τ ll/IL 19.2. (a) Show that Lx((0,1],^,λ) is not separable. (b) Show that L "((0,1], 9§, μ) is not separable if μ is counting measure (μ is not σ-finite). (c) Show that Lp(fl, SF, P) is not separable if (Theorem 36.2) there is on the space an independent stochastic process [X,: 0 <t < 1] such that X, takes the values ±1 with probability | each (У is not countably generated). +This is interesting because of the close connection between Bayes rules and admissibility; see Berger, pp. 546 ff.
SECTION 19. THE LP SPACES 253 19.3. Show that Theorem 19.3 fails for ЦЩ, \\@,\\ Hint: Take y(/) to be a Banach limit of п{У"/(х)ах. 19.4. Consider weak convergence in Lp((0,1], ё&, λ). (a) For the case ρ = <», find functions f„ and / such that /n goes weakly to / but \\f-f,,\\P does not go to 0. (b) Do the same for ρ = 2. 19.5. Show that the unit ball in L'((0,1], £8, λ) is not weakly compact. 19.6. Show that a Bayes rule corresponding to p=(pt, -,P^ таУ not be admissible if p, = 0 for some ί But there will be a better Bayes rule that is admissible 19.7. The Neyman-Pearson lemma. Suppose /, and /, are rival densities and L(j\i) is 0 or 1 as j = i or j Φ i, so that Λ,-(ό) is the probability of choosing the opposite density when ft is the right one. Suppose of δ that δ2(ω) = 1 if f2M > ί/ι(ω) and δ2(ω) = 0 if /2(ω) < ί/,(ω), where t > 0 Show that δ is admissible: For any rule δ', /δ'2/, άμ < /δ2/, άμ implies /δ',/2 άμ > /δ,/2 άμ. Hint. /(δ2 - δ2) (/2 — t/^άμ > 0, since the integrand is nonnegative. 19.8. The classical orthonormal basis for L2[0,2π] with Lebesgue measure is the trigonometric system (19.17) (2тг)~\ 7r-,/2sinm:, 7r_1/2cosm:, n=l,2,.. Prove orthonormality. Hint: Express the sines and cosines in terms of e'"x ± e~"'x, multiply out the products, and use the fact that j^e""xdx is Ιττ or 0 as m = 0 or m Φ 0. (For the completeness of the trigonometric system, see Problem 26.26.) 19.9. Drop the assumption that L2 is separable. Order by inclusion the orthonormal systems in L2, and let (Zorn's lemma) Φ = [φγ: γ e Γ] be maximal. (a) Show that I}= [y: (/, φγ) Φ 0] is countable. Hint- Use £?_,(/, <ΡΎ )2 < ll/ll2 and the argument for Theorem 10.2(iv). (b) Let Ρ/=Σγε1·(/,φΎ)φΎ- Show that f-P/ΐΦ and hence (maximality) f=Pf. Thus Φ is an orthonormal basis. (c) Show that Φ is countable if and only if L2 is separable. (d) Now take Φ to be a maximal orthonormal system in a subspace M, and define Puf=Lyerr(f,<p1)<pr Show that PMfeM and /-ΡΜ/ΐΦ, that g = PMg if geM, and that f-PMf±M. This defines the general orthogonal projection.
CHAPTER 4 Random Variables and Expected Values SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS This section and the next cover random variables and the machinery for dealing with them—expected values, distributions, moment generating functions, independence, convolution. Random Variables and Vectors A random variable on a probability space (Ω, &~, Ρ) is a real-valued function X = Χ(ω) measurable &~. Sections 5 through 9 dealt with random variables of a special kind, namely simple random variables, those with finite range. All concepts and facts concerning real measurable functions carry over to random variables; any changes are matters of viewpoint, notation, and terminology only. The positive and negative parts X+ and X~ of X are denned as in (15.4) and (15.5). Theorem 13.5 also applies: Define i(k-l)2-" if(k-l)2-"<x<k2-", (20.1) Ф„(х) = \ 1<к<п2", \n if χ > n. If X is nonnegative and Xn = ψη(Χ), then 0 < Xn Τ Χ. If X is not necessarily nonnegative, define ( ψ(Χ) ifZ>0, <20'2) *■-(-«-*> **-<»■ (This is the same as (13.6).) Then 0<Ζ„(ω)Τ*(ω) if Χ(ω)>0 and 0> 254
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 255 Χη(ω)ΐΧ(ω) if Χ(ω) < 0; and |Α"„(ω)|Τ \Χ(ω)\ for every ω. The random variable Xn is in each case simple. A random vector is a mapping from Ω to Rk that is measurable J^ Any mapping from Ω to /?* must have the form ω -»ΑΧ ω) = (Χ{(ω),..., Xk((o)), where each Χ;(ω) is real; as shown in Section 13 (see (13.2)), X is measurable if and only if each Xt is. Thus a random vector is simply a /c-tuple X= (Xt,..., Xk) of random variables. Subfields If ^ is a σ-field for which c^c ^ a /c-dimensional random vector X is of course measurable £ if [ω: Χ(ω)ЁЯ)е^ for every Η in ^,Ar. The σ-field σ(Χ) generated by X is the smallest σ-field with respect to which it is measurable. The σ-field generated by a collection of random vectors is the smallest σ-field with respect to which each one is measurable. As explained in Sections 4 and 5, a sub^r-field corresponds to partial information about ω. The information contained in σ(Χ) = σ(Χ{,..., Xk) consists of the к numbers Α',(ω),..., Xk(a>).< The following theorem is the analogue of Theorem 5.1, but there are technical complications in its proof. Theorem 20.1. Let X = (X{,...,Xk) be a random vector. (i) The σ-field σ(Χ) = a(Xu ..,Xk) consists exactly of the sets [Л'еЯ] forH^32k. (ii) In order that a random variable Υ be measurable σ(Χ) = a(Xu..., Xk) it is necessary and sufficient that there exist a measurable map f: Rk -»Rl such that Υ(ω)=/(Χι(ω),..., ХкЫУ) for all ω. Proof. The class £ of sets of the form [X^H] for //gJ* is a σ-field. Since X is measurable σ(Λ'), ^Ca(I). Since X is measurable ^f, σ(Χ)α^. Hence part (i). Measurability of / in part (ii) refers of course to measurability &k/&1. The sufficiency is easy, if such an / exists, Theorem 13.1(H) implies that Υ is measurable σ(Λ'). To prove necessity,* suppose at first that Υ is a simple random variable, and let у,,...,ym be its different possible values. Since At = [ω: yr(w) = y,·] lies in σ(X), it must by part (i) have the form [ω: Χ(ω) e #.] for some #,· in &k. Put /= E,y,/H.; certainly / is measurable. Since the At are disjoint, no X(ω) can lie in more than one H, (even though the latter need not be disjoint), and hence /(Χ(ω)) = Υ(ω). +The partition defined by (4.16) consists of the sets [ω. ΧΜ = χ] for χ e Rk. *For a general version of this argument, see Problem 13.3.
256 RANDOM VARIABLES AND EXPECTED VALUES To treat the general case, consider simple random variables Yn such that Υη(ω)-* Υ(ω) for each ω. For each n, there is a measurable function fn: Rk->Rl such that Υη(ω) = /η(Χ(ω)) for all ω. Let Μ be the set of χ in Rk for which {/„(*)} converges; by Theorem 13.4(iii), Μ lies in &k. Let /(*) = limn/„(x) for χ in M, and let /(x) = 0 for χ in Rk-M. Since /= limn /„/M and /я/м is measurable, / is measurable by Theorem 13.4(ii). For each ω, Κ(ω) = limn /η(Χ(ω)); this implies in the first place that Χ(ω) lies in Μ and in the second place that Κω) = lim„ /η(Χ(ω)) = /(Χ(ω)). ■ Distributions The distribution or law of a random variable X was in Section 14 denned as the probability measure on the line given by μ = PX~' (see (13.7)), or (20.3) μ(Α)=Ρ[ΧεΑ], /lej1. The distribution function of X was denned by (20.4) Ρ(χ)=μ(-™,χ]=Ρ[Χ<χ] for real x. The left-hand limit of F satisfies Ρ(χ-)=μ(~°ο,χ)=Ρ[Χ<χ], ' } F(x)-F(x-) = /,{*} =/>[* = *], and F has at most countably many discontinuities. Further, F is nondecreas- ing and right-continuous, and lim^^ F(x) = 0, lim^,,, F(x)= 1. By Theorem 14.1, for each F with these properties there exists on some probability space a random variable having F as its distribution function. A support for μ is a Borel set 5 for which /a(5)= 1. A random variable, its distribution, and its distribution function are discrete if μ has a countable support S = {*,, x2,- .·}· In this case μ is completely determined by the values μ{χι},μ{χ2},... . A familiar discrete distribution is the binomial: (20.6) Ρ[Χ=Γ]=μ{Γ} = ^ρΓ(1-ρ)"'Γ, r-0,l,...,n. There are many random variables, on many spaces, with this distribution: If {Xk} is an independent sequence such that P[Xk= l]=p and P[Xk = 0] = 1 -p (see Theorem 5.3), then X could be Z"=iX,; or Σ^,Λ',·, or the sum of any η of the X,-, Or Ω could be {0,1,---,"} if & consists of all subsets, P{r] = μ{ή, r = 0,!,...,«, and X(r) = r. Or again the space and random variable could be those given by the construction in either of the two proofs of Theorem 14.1. These examples show that, although the distribution of a
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 257 random variable X contains all the information about the probabilistic behavior of X itself, it contains beyond this no further information about the underlying probability space (Ω,^, Ρ) or about the interaction of X with other random variables on the space. Another common discrete distribution is the Poisson distribution with parameter Λ > 0: (20.7) ρ[Χ=Γ)=μ{Γ}=ε-^, - r = 0,l,.... A constant с can be regarded as a discrete random variable with Χ(ω) = с. In this case P[X = c] = μ{ο) = 1. For an artificial discrete example, let {*,, x2,...} be an enumeration of the rationale, and put (20.8) μ{χΓ}=2-'; the point of the example is that the support need not be contained in a lattice. A random variable and its distribution have density f with respect to Lebesgue measure if / is a nonnegative Borel function on Rl and (20.9) Ρ[Χ&Α] = μ(Α)= [ f(x)dx, /lei?1. 3 A In other words, the requirement is that μ have density / with respect to Lebesgue measure Λ in the sense of (16.11). The density is assumed to be with respect to Λ if no other measure is specified. Taking A = Rl in (20.9) shows that / must integrate to 1. Note that / is determined only to within a set of Lebesgue measure 0: if f=g except on a set of Lebesgue measure 0, then g can also serve as a density for X and μ. It follows by Theorem 3.3 that (20.9) holds for every Borel set A if it holds for every interval—that is, if F(b)-F(a) = ff(x)dx a holds for every a and b. Note that F need not differentiate to / everywhere (see (20.13), for example); all that is required is that / integrate properly—that (20.9) hold. On the other hand, if F does differentiate to / and / is continuous, it follows by the fundamental theorem of calculus that / is indeed a density for F.f +The general question of the relation between differentiation and integration is taken up in Section 31
258 RANDOM VARIABLES AND EXPECTED VALUES For the exponential distribution with parameter a > 0, the density is (20.10) f( \ = / ° lf x< RX) \ae-"x ifx> if χ < 0, 0. The corresponding distribution function (20.11) /0 if χ < 0, F(X) \l-e-°< if > 0 was studied in Section 14. For the normal distnbution with parameters m and σ, σ > 0, (20.12) /(*) = /2ττσ exp (χ - m) 2σ2 -oo <χ < oo; a change of variable together with (18.10) shows that / does integrate to 1. For the standard normal distribution, m = 0 and σ = 1. For the uniform distribution over an interval (a, b], (20.13) if a <x < b, otherwise. The distribution function F is useful if it has a simple expression, as in (20.11). It is ordinarily simpler to describe μ by means of a density f(x) or discrete probabilities μ{χΓ). If F comes from a density, it is continuous. In the discrete case, F increases in jumps; the example (20.8), in which the points of discontinuity are dense, shows that it may nonetheless be very irregular. There exist distributions that are not discrete but are not continuous either. An example is μ(Α)= j/i,,(/4) + \μ2(Α) for μ, discrete and μ2 coming from a density; such mixed cases arise, but they are few. Section 31 has examples of a more interesting kind, namely functions F that are continuous but do not come from any density. These are the functions singular in the sense of Lebesgue; the Q(x) describing bold play in gambling (see (7.33)) turns out to be one of them. See Example 31.1. If X has distribution μ and g is a real function of a real variable, (20.14) Ρ[8(Χ)*ΞΑ]=Ρ[ΧςΕ8-*Α]=μ(8-*Α). Thus the distribution of g(X) is /xg ' in the notation (13.7).
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 259 In the case where there is a density, / and F are related by (20.15) F(x) = f f(t)dt. J — CO Hence / at its continuity points must be the derivative of F. As noted above, if F has a continuous derivative, this derivative can serve as the density /. Suppose that / is continuous and g is increasing, and let T = g~l. The distribution function of g(X) is P[g(X) <x] = P[X< T(x)] = F(T(x)). If Τ is differentiable, this differentiates to f(T(x))T'(x), which is therefore the density for g(X). If g is decreasing, on the other hand, then P[g(X) <x] = P[X> T(x)] = 1 - F(T(x)\ and the derivative is equal to -f(T(x))T'(x) = f(T(x))\T'(x)\. In either case, g(X) has density (20.16) Шр[ё(х) <χ\ =f(T(x))\T'(x)\. If X has the normal density (20.12) and a > 0, (20.16) shows that aX + b has the normal density with parameters am Yb and ασ. Finding the density of g(X) from first principles, as in the argument leading to (20.16), often works even if g is many-to-one: Example 20.1. If X has the standard normal distribution, then P[X*<x] = ^=\ΓΧ e-'2/*dt= -jL· ft-*»dt for χ > 0. Hence X2 has density if χ < 0, χ '/V*/2 ifx>0. Multidimensional Distributions For a /c-dimensional random vector A'=(A'1,..., Xk), the distribution μ (a probability measure on &k) and the distribution function F (a real function on Rk) are defined by (20 17) ii{A)=P\{Xx,...,Xk)*A], Ae&, Ρ(χι,...,χΙζ)=Ρ[Χι<χι,...,ΧΙί<χΙζ) = μ(5χ), where 5x = [y: y, <x,, / = l,...,k] consists of the points "southwest" of x.
260 RANDOM VARIABLES AND EXPECTED VALUES Often μ and F are called the joint distribution and joint distribution function of X1,...,Xk. Now F is nondecreasing in each variable, and AAF>0 for bounded rectangles A (see (12.12)). As h decreases to 0, the set sx,h = [y- У,<х,+/г,/= l,...,k] decreases to Sx, and therefore (Theorem 2.1(H)) F is continuous from above in the sense that limAi0 F(x, + h,..., xk + h) = F(x,,..., xk). Further, F(x,,..., xk) -»0 if jc,- -» - oo for some / (the other coordinates held fixed), and F(x,,..., xk) -» 1 if jc,- -»oo for each /. For any F with these properties there is by Theorem 12.5 a unique probability measure μ on &k such that μ(Α) = &AF for bounded rectangles A, and μ(Ξχ) = F(x) for all x. As h decreases to 0, Sx_h increases to the interior S°=[y: y,<xh i = 1,..., k] of Sx, and so (20.18) limF(x, -h,...,xk-h) = /x(S°). h 10 Since F is nondecreasing in each variable, it is continuous at χ if and only if it is continuous from below there in the sense that this last limit coincides with F(x). Thus F is continuous at χ if and only if F(x) = μ(Ξχ) = /x(S°), which holds if and only if the boundary dSx = SX-S°X (the y-set where y,- < jc,- for all / and у,=х, for some i) satisfies μ(35χ) = 0. If к > 1, F can have discontinuity points even if μ has no point masses: if μ corresponds to a uniform distribution of mass over the segment В = [(x,0): 0 <x < 1] in the plane (μ(Α) = λ[χ: 0<χ< 1, (х,0)еЛ1), then F is discontinuous at each point of B. This also shows that F can be discontinuous at uncountably many points. On the other hand, for fixed χ the boundaries dSx h are disjoint for different values of h, and so (Theorem 10.2(iv)) only countably many of them can have positive μ,-measure. Thus χ is the limit of points (x, + h,..., xk + h) at which F is continuous: the continuity points of F are dense. There is always a random vector having a given distribution and distribution function: Take (Ω.,3Γ,Ρ) = (ΚΙζ,&Ιζ,μ) and Χ(ω) = ω. This is the obvious extension of the construction in the first proof of Theorem 14.1. The distribution may as for the line be discrete in the sense of having countable support. It may have density / with respect to /c-dimensional Lebesgue measure: μ(Α) = JAf(x)dx. As in the case к = 1, the distribution μ is more fundamental than the distribution function F, and usually μ is described not by F but by a density or by discrete probabilities. If X is a /c-dimensional random vector and g: Rk -»R' is measurable, then g(X) is an /-dimensional random vector; if the distribution of X is μ, the distribution of g(X) is μg'~', just as in the case к = 1—see (20.14). If g;: Rk ~*Rl is denned by gj(xlt...,xk)= x-, then gj(X) is Xj, and its distribution μj = μgJl is given by μ;(Α) = μ[(χ„ ..., xk): Xj еЛ] = P[Xj <eA] for
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 261 /le^'. The μι are the marginal distributions of μ. If μ has a density / in Rk, then μj has over the line the density (20.19) f;(x) = f kif(xl,...,xJ_l,x,xJ + l,...,xk)dxl ··· dxJ_1dxj + l ··· dxk, since by Fubini's theorem the right side integrated over A comes to /Ж*,,...,**): Xj^A]. Now suppose that g is a one-to-one, continuously differentiable map of V onto U, where £/ and К are open sets in Rk. Let Τ be the inverse, and suppose its Jacobian J(x) never vanishes. If X has a density / supported by V, then for Лс£/, />[g(Z) ёЛ] =/*[Ze ГЛ] = }TAf(y)dy, and by (17.10), this equals /^/(Тх^Дл:)! dx. Therefore, g(X) has density /(7x)|7(x)| forxei/, 0 for χ £ (/. This is the analogue of (20.16). Example 20.2. Suppose that (Л",, A^) has density /(x1,x2) = (277)-1exp[-i(x12 + x22)], and let g be the transformation to polar coordinates. Then U, V, and Τ are as in Example 17.7. If R and Θ are the polar coordinates of (Xu X2\ then (R,®) = g(Xi,X2) has densitv (2тгУ1ре~р2/2 in K. By (20.19), R has density pe-p /2on (0, «О, and Θ is uniformly distributed over (0,2 π). ■ For the normal distribution in Rk, see Section 29. Independence Random variables Xlt...,Xk are denned to be independent if the σ-fields σ(ΛΊ),...,a(Xk) they generate are independent in the sense of Section 4. This concept for simple random variables was studied extensively in Chapter 1; the general case was touched on in Section 14. Since a(Xt) consists of the sets [А',еЯ]йгЯе^'1, Х1,...,Хк are independent if and only if (20.21) P[Xl<EHl,...,XkeHk]=P[XleHl}---P[XkeHk} for all linear Borel sets Hu...,Hk. The definition (4.10) of independence requires that (20.21) hold also if some of the events [A',· e #,·] are suppressed on each side, but this only means taking H,- = Rl. (20.20) d(x) =
262 RANDOM VARIABLES AND EXPECTED VALUES Suppose that (20.22) P[Xl<xi,...,Xk<xk] = P[Xl<xi]-P[Xk<xk] for all real xlt...,xk; it then also holds if some of the events [A^x,] are suppressed on each side (let *,■-»<»). Since the intervals (-00, χ] form a T7-system generating &, the sets [Xt <x] form a ^-system generating a(Xt). Therefore, by Theorem 4.2, (20.22) implies that Xl,...,Xk are independent. If, for example, the Xt are integer-vaiued, it is enough that P\XX = nl,...,Xk = nk]=P[Xl=n1]---P[Xk = nk] for integral nv...,nk (see (5.9)). Let (Xv...,Xk) have distribution μ and distribution function F, and let the Xt have distributions μ., and distribution functions Ft (the marginals). By (20.21), X1,...,Xk are independent if and only if μ is product measure in the sense of Section 18: (20.23) μ = μιΧ ··· ΧμΙζ. By (20.22), X},...,Xk are independent if and only if (20.24) F(xi,...,xk)=Fi(xi)-Fk(xk). Suppose that each μ. has density /,; by Fubini's theorem, /,(y,) · · · fk(yk) integrated over (-00, χ,] χ ··· χ (-oo, xk] is just F,(x,) · · · Fk(xk), so that μ has density (20.25) /(*)=/,(*i) ···/*(**) in the case of independence. If cflt...,cfk are independent «--fields and Xt is measurable ^, / = 1,..., k, then certainly Xx,..., Xk are independent. If X{ is a ίί,-dimensional random vector, : = 1,...,k, then Xx,..., Xk are by definition independent if the σ-fields σ(Χχ),...,a(Xk) are independent. The theory is just as for random variables: Xu...,Xk are independent if and only if (20.21) holds for tf, e &d>, ...,Hk& &dk. Now (*„ ...,Xk) can be regarded as a random vector of dimension d = Ef=1rf,; if μ is its distribution in Rd = Rd[ X · · X Rd" and /x,- is the distribution of Xt in Rd', then, just as before, Xu...,Xk are independent if and only if μ = μ., Χ · · · X ju.^. In none of this need the di components of a single X{ be themselves independent random variables. An infinite collection of random variables or random vectors is by definition independent if each finite subcollection is. The argument following (5.10)
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 263 extends from collections of simple random variables to collections of random vectors: Theorem 20.2. Suppose that (20.26) *2i *22 "· is an independent collection of random vectors. If ^ is the σ-field generated by the ith row, then ^j, &~2,... are independent. Proof. Let s^i consist of the finite intersections of sets of the form [Xjj&H] with Η a Borel set in a space of the appropriate dimension, and apply Theorem 4.2. The σ-fields ^= a(szf), i = l,...,n, are independent for each n, and the result follows. ■ Each row of (20.26) may be finite or infinite, and there may be finitely or infinitely many rows. As a matter of fact, rows may be uncountable and there may be uncountably many of them. Suppose that X and Υ are independent random vectors with distributions μ and ν in R> and Rk. Then (X, Y) has distribution μ Χ ν in R> X Rk = Ri+k. Let χ range over R' and у over Rk. By Fubini's theorem, (20.27) (μχν)(Β)= ( v[y:(x,y)(=BU(dx), BeJi+i. JRl Replace В by (AxRk)nB, where Α<ξ&> and BeJ'+i. Then (20.27) reduces to (20.28) (μ X v)((A xRk) ПВ) = f v\y: (x, y) εί]μ(ώ), Ae&>, ВеЯ'+к. If Bx = [y: U,y)eB] is the x-section of B, so that Bx^&k (Theorem 18.1), then P[(x,Y) (Ξ B] = Ρ[ω: (χ,Υ(ω)) εβ] = Ρ[ω: Υ(ω) е Вх] = ι>(Βχ). Expressing the formulas in terms of the random vectors themselves gives this result: Theorem 20.3. // X and Υ are independent random vectors with distributions μ and ν in R' and Rk, then (20.29) Ρ[(Χ,Υ)(ΞΒ] = f Ρ[(χ,Υ)(ΞΒ]μ(άχ), βε#Λ JRJ
264 RANDOM VARIABLES AND EXPECTED VALUES and (20.30) P[XeA,(X,Y)f=B] = Ρ[(χ,Υ)^Β)μ(άχ), JA A<e&', B<B&i+k. Example 20.3. Suppose that X and Υ are independent exponentially distributed random variables. By (20.29), P[Y/X > ζ] = βΡ[Υ/χ > z]ae~axfitc = jQe-aXZae~axΛ = (1+ζ)_1. Thus Y/X has density (1 + ζΓ2 for z>0. Since P[X>zx, Y/X>z2] = j;P[Y/x>z2\ae'axdx by (20.30), the joint distribution of X and Y/X can be calculated as well. ■ The formulas (20.29) and (20.30) are constantly applied as in this example. There is no virtue in making an issue of each case, however, and the appeal to Theorem 20.3 is usually silent. Example 20.4. Here is a more complicated argument of the same sort. Let Xx,...,Xn be independent random variables, earh uniformly distributed over [0,f]. Let Yk be the fcth smallest among the Xt-, so that 0 < У, < ■ ■ · < Y„ < t. The Xt divide [0,f] into n+ 1 subintervals of lengths Yl,Y2-Yl,...,Yn-Yn_l,t -Y„; let Μ be the maximum of these lengths. Define t]/n(t,a) = P[M < a]. The problem is to show that (20.31) ic^EVi)')";1)!!-^", where x + = (x + \x\)/2 denotes positive part. Separate consideration of the possibilities 0<a<t/2, t/2 <a <t, and t <a disposes of the case η = 1. Suppose it is shown that the probability ψη(ί,α) satisfies the recursion (20.32) ^(Г|в) = и^я_1(г-лг,в)(1=£)Л '^. Now (as follows by an integration together with Pascal's identity for binomial coefficients) the right side of (20.31) satisfies this same recursion, and so it will follow by induction that (20.31) holds for all n. In intuitive form, the argument for (20.32) is this: If [M < a] is to hold, the smallest of the Xt must have some value χ in [0, a]. If XY is the smallest of the X„ then Χ2,···,Χ„ must all lie in [x, f] and divide it into subintervals of length at most a; the probability of this is (1 -χ/ί)"~)ψη_ι(ί -χ,ά), because X2,..., Xn have probability (1 — x/t)"~l of all lying in [x,t], and if they do, they are independent and uniformly distributed there. Now (20.32) results from integrating with respect to the density for Xt and multiplying by η to allow for the fact that any of Xx,...,Xn may be the smallest. To make this argument rigorous, apply (20.30) for j= 1 and к = η - 1. Let A be the interval [0, a], and let В consist of the points (jc„ ..., xn) for which 0 < xt < t, xl is the minimum of д:,,..., хп, and x2,...,xn divide [xx,t] into subintervals of length at most a. Then P[Xx = m\nX„ M< a] = P[XX eA, (*„..., X„) efl]. Take Xx for
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 265 X and (X2,...,X„) for У in (20.30). Since Xx has density \/t, (20.33) F[*1=min*,,M<a] = /<?F[(*,*2,. ,Дл)ей]у. If С is the event that χ < Xt < t for 2 < ί < л, then F(C) = (1 - jt/f)"~ '· A simple calculation shows that P[Xi - χ < s„ 2 < ι < n\C] = П?„2(·*;/^ ~~ *)); in other words, given C, the random variables X2~x,...,Xn-χ are conditionally independent and uniformly distributed over [0,ί -χ]. Now Л'г,..., Х„ are random variables on some probability space (Ω, У, F); replacing F by F(|C) shows that the integrand in (20.33) is the same as that in (20.32). The same argument holds with the index 1 replaced by any к (1 < к < η), which gives (20.32). (The events [Xk = min Xt, Υ <a] are not disjoint, but any two intersect in a set of probability 0.) ■ Sequences of Random Variables Theorem 5.3 extends to genera! distributions μη. Theorem 20.4. // {μη} is a finite or infinite sequence of probability measures on &\ there exists on some probability space (Ω, 3*~, Ρ) an independent sequence {Xn) of random variables such that Xn has distribution μη. Proof. By Theorem 5.3 there exists on some probability space an independent sequence Z„ Z2,... of random variables assuming the values 0 and 1 with probabilities P[Zn =0] = P[Zn = 1] = {. As a matter of fact, Theorem 5.3 is not needed: take the space to be the unit interval and the Ζ„(ω) to be the digits of the dyadic expansion of ω—the functions άπ(ω) of Sections and 1 and 4. Relabel the countably many random variables Z„ so that they form a double array, Zu Z12 ■^21 ^22 All the Znk are independent. Put Un = YTkalZnk2~k The series certainly converges, and Un is a random variable by Theorem 13.4. Further, £/,, U2,... is, by Theorem 20.2, an independent sequence. Now P[Zni = zt, 1 < i < k) = 2~k for each sequence z„..., zk of 0's and 1's; hence the 2k possible values j2'k, 0<j<2k, of Snk = EfDlZ„,2"'' all have probability 2"k. If 0 <x < 1, the number of the j2~k that lie in [0, x] is [2kx\ + 1, and therefore P[Snk<x] - ([2kx\ + \)/2k. Since 5πΛ(ω)Τ ί/„(ω) as A; T00, it follows that [Snk <x]l[Un <x] as M00, and so P[Un<x] = hmk P[Snk<x] = \\mk([2kx\ + \)/2k=x for 0 <x < 1. Thus U„ is uniformly distributed over the unit interval.
266 RANDOM VARIABLES AND EXPECTED VALUES The construction thus far establishes the existence of an independent sequence of random variables Un each uniformly distributed over [0,1]. Let Fn be the distribution function corresponding to μη, and put <p„(w) = inf[x: и <Fn(x)] for 0 <u < 1. This is the inverse used in Section 14—see (14.5). Set <p„(u) = 0, say, for и outside (0,1), and put Χ„(ω) = <ρ„(£/π(ω)). Since φ „(и) <x if and only if и <Fn(x)—see the argument following (14.5)—P[X„ <x] = P[Un<Fn(x)] = Fn(x). Thus Xn has distribution function Fn. And by Theorem 20.2, Xx, X2,... are independent. ■ This theorem of course includes Theorem 5.3 as a special case, and its proof does not depend on the earlier result. Theorem 20.4 is a special case of Kolmogorov's existence theorem in Section 36. Convolution Let X and Υ be independent random variables with distributions μ and v. Apply (20.27) and (20.29) to the planar set B = [(x,y): x + y^H] with (20.34) P[X+Y*=H]= Γ ν{Η-χ)μ{άχ) J — oo = Γ Ρ[Υ(ΞΗ-χ]μ(άχ). "* — OO The convolution of μ and ν is the measure μ * ν defined by (20.35) {μ*ν){Η)= Γ ν{Η-χ)μ{άχ), Ζ/eJ1. "* — 00 If X and Υ are independent and have distributions μ and v, (20.34) shows that X+Y has distribution μ * v. Since addition of random variables is commutative and associative, the same is true of convolution: μ * ν = ν * μ and μ *(ν * η) = (μ * ν)* η. If F and G are the distribution functions corresponding to μ and v, the distribution function corresponding to μ * ν is denoted F *G. Taking Η = (-oo,у] in (20.35) shows that (20.36) (f*G)(y)= Г G(y-x)dF(x). (See (17.22) for the notation dF(x).) If G has density g, then G(y -x) = jrjg(s)ds = fy_ocg(t--x)dt, and so the right side of (20.36) is fiJfZ^U -x)dF(x)]dt by Fubini's theorem. Thus F *G has density F * g, where (20.37) (F*g)(y)= Г g(y-x)dF(x);
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 267 this holds if G has density g. If, in addition, F has density /, (20.37) is denoted f * g and reduces by (16.12) to (20.38) (f*8)(y) = fg(y-x)f(x)dx. This defines convolution for densities, and μ * ν has density f * g if μ and ν have densities / and g. The formula (20.38) can be used for many explicit calculations. Example 20.5. Let Xx,...,Xk be independent random variables, each with the exponential density (20.10). Define gk by ,k-l (20.39) ^(xj-aif^L^e-*, put gk(x) = 0 for χ < 0. Now x>0, fc = l,2,...; U*-i*Si)(30 = f 8k-i(y-x)8i(x)dx> which reduces to gk(y). Thus gk=gk_1*g1, and since g, coincides with (20.10), it follows by induction that the sum A", + · · · +Xk has density gk. The corresponding distribution function is (20.40) Gt(x)-l-e-"*£^= £e — -^, *>0, fM ' 1 = 0 i-ft i! as follows by differentiation. ■ Example 20.6. Suppose that X has the normal density (20.12) with m — 0 and that У has the same density with r in place of σ. If X and У are independent, then X + Υ has density 1 Ίττστ f exp (y-x) __ j^_ 2σ2 2τ2 dx. A change of variable и = χ(σ2 + τ2)ι/2/στ reduces this to , 2 1 1 г ]/2ττ(σ2 + τ2) ^т ·>_ _ 1 exp ^|и-У τ/σ vWt2 / 2(σ2 + τ2) du
268 RANDOM VARIABLES AND EXPECTED VALUES Thus X+Y has the normal density with m = 0 and with σ2 + τ2 in place of σ\ ■ If μ and ν are arbitrary finite measures on the line, their convolution is defined by (20.35) even if they are not probability measures. Convergence in Probability Random variables Xn converge in probability to X, written Xn ~>P X, if (20.41) limP[\Xn-X\>c]=0 η for each positive e.f This is the same as (5.7), and the proof of Theorem 5.2 carries over without change (see also Example 5.4) Theorem 20.5. (i) IfXn-*X with probability I, then Xn ~>p X. (ii) Λ necessary and sufficient condition for Xn ~*P X is that each subsequence {Xn } contain a further subsequence {X„k ) such that Xnt. ~*X vAth probability 1 as i -»°° Proof. Only part (ii) needs proof. If Xn -*p X, then given {nk}, choose a subsequence {nkU)} so that k>k(i) implies that Р[\ХПк~Х\^Г1] <2"'. By the first Borel-Cantelli lemma there is probability 1 that \Xn - X\<i"1 for all but finitely many /'. Therefore, lim, Xn = X with probability 1. If Xn does not converge to X in probability, there is some positive e for which P[\Xn - ΛΊ > e] > e holds along some sequence [nk]. No subsequence of {X„ } can converge to X in probability, and hence none can converge to X with probability 1. ■ It follows from (ii) that if X„ -*p X and Xn ~>P Y, then X=Y with probability 1. It follows further that if / is continuous and Xn ~*p X, then f(Xn) -*P f(X). In nonprobabilistic contexts, convergence in probability becomes convergence in measure: If fn and / are real measurable functions on a measure space Ш, 3*~,μ), and if μ[ω: |/(ω) -/π(ω)|> e] -»0 for each e > 0, then fn converges in measure to /. The Glivenko-Cantelli Theorem* The empirical distribution function for random variables Xv...,Xn is the distribution function F„(x, ω) with a jump of и-1 at each ^(ω): (20.42) Ρη(χ,ω) = ± £/(-.„(**(«))· +This is often expressed plimn Хл = X *This topic may be omitted
SECTION 20. RANDOM VARIABLES AND D1STRJBUTIONS 269 If the Xk have a common unknown distribution function F(x), then Fn(x, ω) is its natural estimate. The estimate has the right limiting behavior, according to the Glivenko-Cantelli theorem: Theorem 20.6. Suppose thatXvX2,■■■ are independent and have a common distribution function F; put £>„(ω) = sup, \Fn(x, ω) - F{x)\. Then Dn~*0 with probability 1. For each x, Fn(x, ω) as a function of ω is a random variable. By right continuity, the supremum above is unchanged if χ is restricted to the rationals, and therefore Dn is a random variable. The summands in (20.42) are independent, identically distributed simple random variables, and so by the strong law of large numbers (Theorem 6.1), for each χ there is a set Ax of probability 0 such that (20.43) \imFn(x,w) = F(x) η for ω <£AX. But Theorem 20.6 says more, namely that (20.43) holds for ω outside some set A of probability 0, where A does not depend on χ—as there are uncountably many of the sets Ax, it is conceivable a priori that their union might necessarily have positive measure. Further, the convergence in (20.43) is uniform in x. Of course, the theorem implies that with probability 1 there is weak convergence Fn(x, ω) => F(x) in the sense of Section 14. Proof of the Theorem. As already observed, the set Ax where (20.43) fails has probability 0. Another application of the strong law of large numbers, with /(_«,,,) in place of /(_«,,,] in (20.42), shows that (see (20.5)) lim„ Fn(x - ,ω) = F(x - ) except on a set Bx of probability 0. Let <p(w) = inf[x: и <F(x)] for 0 <u < 1 (see (14.5)), and put x)n k = (p(k/m), m > 1, 1 < к < т. It is not hard to see that F(tp{u) - ) < и < F(<p(w)); hence F(xm k -)-F(xmk_i)<m-\ F{xmA-)<m-\ and F(xm J > 1 - τη \ Let Dmn{w) be the maximum of the quantities \Fn(xm k,oi) - F{xmk)\ and \Fn(xm,k- >ω)~ ЯхтД- )l for k= l,...,m. If 'x„ik.i <x<xmik, then Fn(x, ω) < Fn(xmk - , ω) < F(xmk - ) + DmJw)<F{x) + m-l+DmJw) and Fn{x,w)> Fn{xmΛ_„ω)> F(xm,M) - Dm „(ω) > F(x) - m ' - Dm „(ω). Together with similar arguments for the cases χ <xm , and χ > xm m, this shows that (20.44) , Dn(w)<DmJw)+m-i. If ω lies outside the union A of all the Ar and Br , then lim„ Dm „(ω) л тк л тк " "Ί" = 0 and hence lim„ £>„(ω) = 0 by (20.44). But A has probability 0. ■
270 RANDOM VARIABLES AND EXPECTED VALUES PROBLEMS 20.1. 2.11 Τ A necessary and sufficient condition for a σ-field ^ to be countably generated is that ^=σ(Χ) for some random variable X. Hint: If £= σ{Αν A2,...), consider X = ITk^if(IAt)/10k, where f(x) is 4 for χ = 0 and 5 for χ Φ 0. 20.2. If Λ" is a positive random variable with density /, then X'1 has density f(l/x)/x2. Prove this by (20.16) and by a direct argument. 20.3. Suppose that a two-dimensional distribution function F has a continuous density/. Show that f(x,y) = d2F(x,y)/dxdy. 20.4. The construction in Theorem 20.4 requires only Lebesgue measure on the unit interval. Use the theorem to prove the existence of Lebesgue measure on Rk. First construct \k restricted to (-η,η] χ · · x(-n,n], and then pass to the limit (n -»oo). The idea is to argue from first principles, and not to use previous constructions, such as those in Theorems 12.5 and 18.2. 20.5. Suppose that A, B, and С are positive, independent random variables with distribution function F. Show that the quadratic Azi + Bz + С has real zeros with probability {^F(x2/4y)dF(x)dF(y). 20.6. Show that Xi,X2,... are independent if σ(Χι,...,Χ„_ι) and σ(Χ„) are independent for each n. 20.7. Let X0, A",,... be a persistent, irreducible Markov chain, and for a fixed state / let Γ],Γ2,... be the times of the successive passages through /. Let Zx = 7\ and Z„ = T„-Tn_l, n>2. Show that Z1,Z2)... are independent and that P[Z„=k]=ffjk) for η > 2. 20.8. Ranks and records. Let X1,X2,... be independent random variables with a common continuous distribution function. Let В be the ω-set where XmM = Χ„(ω) for some pair m, η of distinct integers, and show that P( B) = 0. Remove В from the space Ω on which the X„ are defined. This leaves the joint distributions of the Xn unchanged and makes ties impossible. Let Τ(η\ω)^{Τ\η\ω),...,Τ^ηΚω)) be that permutation (ί,,...,ίη) of (1,..., n) for which Λ^ω) <Χ,±ω)< ■■■ < Χ, (ω). Let Yn be the rank of Xn among Xu...,Xn·. Yn = r if and only if X,-<X„ for exactly r- 1 values of г preceding n. (a) Show that Г(п) is uniformly distributed over the n\ permutations. (b) Show that P[Yn = /·] = \/n, \<r<n. (c) Show that Yk is measurable σ(Γ(π)) for к < п. (d) Show that У,,У2,... are independent. 20.9. Τ Record values. Let An be the event that a record occurs at time n: max^„ Xk<X„. (a) Show that AUA2,... are independent and P(A„)= \/n. (b) Show that no record stands forever. (c) Let Nn be the time of the first record after time n. Show that P[Nn=n + k] = n(n+k-irl(n+k)-1.
SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 271 20.10. Use Fubini's theorem to prove that convolution of finite measures is commutative and associative. 20.11. Suppose that X and Υ are independent and have densities. Use (20.20) to find the joint density for (X+ Y, X) and then use (20.19) to find the density for X+Y. Check with (20.38). 20.12. If F(x - e)<F(x + e) for all positive e, then χ is a point of increase of F (see Problem 12.9). If F(x - )<F(x), then χ is an atom of F. (a) Show that, if χ and у are points of increase of F and G, then χ + y is a point of increase of F *G. (b) Show that, if χ and у are atoms of F and G, then χ +y is an atom of F*G. 20.13. Suppose that μ and ν consist of masses an and /3„ at η, η = 0,1,2,... . Show thai μ * ν consists of a mass of Σ£ = 0α>/3Π_Α at η, η = 0,1,2,... . Show that two Poisson distributions (the parameters may differ) convolve to a Poisson distribution. 20.14. The Cauchy distribution has density (20.45) Св(х) =!_£_, -oo<*<oo, for и > 0. (By (17.9), the density integrates to 1.) (a) Show that cu*ct =cu+l. Hint: Expand the convolution integrand in partial fractions. (b) Show that, if Xu...,Xn are independent and have density c„, then (A", + · · · +Xn)/n has density cu as well. 20.15. Τ (a) Show that, if X and Υ are independent and have the standard normal density, then X/Y has the Cauchy density with и = 1. (b) Show that, if X has the uniform distribution over (-π/2,π/2), then tan X has the Cauchy distribution with и = 1. 20.16. 18.18T Let Xh...,Xn be independent, each having the standard normal distribution Show that has density (20.46) —^ _x<»/2)-ie-*/2 over (0, °°). This is called the chi-squared distribution with η degrees of freedom.
272 RANDOM VARIABLES AND EXPECTED VALUEC e 20.17. Τ The gamma distribution has density (20.47) /(x;a;M)=_i!_x«-.e — over (0, °°) for positive parameters a and u. Check that (20.47) integrates to 1 Show that (20.48) /( ,a,u)*f( ;a,u)=f(-;a,u+u). Note that (20.46) is /(*; {-,n/2), and from (20.48) deduce again that (20.46) is the density of χ*. Note that the exponential density (20 10) is fix; a, 1), and from (20 48) deduce (20 39) cnce again. 20.18. Τ Let N,Xl,X2,... be independent, where P[N = n] = q"~xp, n>\, and each Xk has the exponential density f(x;a, 1). Show that X, + · · +ХЫ has density f(x;ap,l). 20.19. Let A„Je) = [\Zk ~Z\<e, η <k<m]. Show that Zn->Ζ with probability 1 if and only if limn limm P(Anm(e)) = 1 for all positive e, whereas Z„ -»,, Ζ if and only if lim„ P(Ann(e)) = 1 for all positive e. 20.20. (a) Suppose that /: R2 -» Rl is continuous. Show that X„ -*P X and Yn -*P Υ imply f(Xn,Y„)^Pf(X,Y). (b) Show that addition and multiplication preserve convergence in probability. 20.21. Suppose that the sequence {Xn} is fundamental in probability in the sense that fore positive there exists an Λ^ such that P[\Xm-Xn\>e]<e for m,n>Ne. (a) Prove there is a subsequence {X„ } and a random variable X such that nk limA Xn =X with probability 1. Hint: Choose increasing nk such that P[\Xm-X„\ > 2'k] < 2~k for m,n^nk. Analyze P[\X„k„ -X„t\ > 2"*]. (b) Show that Xn-*P X. 20.22. (a) Suppose that Xt<X2< ■■■ and that X„ ->P X. Show that X„-*X with probability 1. (b) Show by example that in an infinite measure space functions can converge almost everywhere without converging in measure. 20.23. If X„->0 with probability 1, then n~XrL'kalXk -> 0 with probability 1 by the standard theorem on Cesaro means [A30]. Show by example that this is not so if convergence with probability 1 is replaced by convergence in probability. 20.24. 2.19 Τ (a) Show that in a discrete probability space convergence in probability is equivalent to convergence with probability 1. (b) Show that discrete spaces are essentially the only ones where this equivalence holds: Suppose that Ρ has a nonatomic part in the sense that there is a set A such that P(A)>0 and P( \A) is nonatomic. Construct random variables X„ such that Xn -+p 0 but Xn does not converge to 0 with probability 1.
SECTION 21. EXPECTED VALUES 273 0.25. 20.21 20.24Τ Let d(X,Y) be the infimum of those positive e for which P[\X-Y\ >e]<e. (a) Show that d(X,Y) = 0 if and only if X=Y with probability 1. Identify random variables that are equal with probability 1, and show that d is a metric on the resulting space. (b) Show that X„ ->P X if and only if d(X„, X) -> 0. (c) Show that the space is complete. (d) Show that in general there is no metric d0 on this space such that Xn->X with probability 1 if and only if d0(Xn, X) -* 0. 20.26. Construct in Rk a random variable X that is uniformly distributed over the surface of the unit sphere in the sense that |A"| = 1 and UX has the same distribution as X for orthogonal transformations U. Hint Let Ζ be uniformly distributed in the unit ball in Rk, define ψ(χ) =χ/\χ\ (ψ(0) = (1,0, ,0), say), and take Χ = φ(ΖΥ 20.27. Τ Let θ and Φ be the longitude and latitude of a random point on the surface of the unit sphere in R3. Show that Θ and Φ are independent, Θ is uniformly distributed over [0,27г), and Φ is distributed over [-it/2, +тг/2] with density γ cos φ SECTION 21. EXPECTED VALUES Expected Value as Integral The expected value of a random variable X on (Ω, &~, P) is the integral of X with respect to the measure P: E[X] = [XdP= f Χ(ω)Ρ(άω). J Jn All the definitions, conventions, and theorems of Chapter 3 apply. For nonnegative X, E[X] is always defined (it may be infinite); for the general X, £[X] is defined, or X has an expected value, if at least one of E[X+] and E[X~\ is finite, in which case E[X] = E[X+] ~ E[X~\, and X is integrable if and only if Ε[|ΑΊ]<<». The integral jAXdP over a set A is defined, as before, as E[IAX]. In the case of simple random variables, the definition reduces to that used in Sections 5 through 9. Expected Values and Limits The theorems on integration to the limit in Section 16 apply. A useful fact: If random variables Xn are dominated by an integrable random variable, or if they are uniformly integrable, then E[Xn] -» E[X] follows if Xn converges to X in probability—convergence with probability 1 is not necessary. This follows easily from Theorem 20.5.
274 RANDOM VARIABLES AND EXPECTED VALUES Expected Values and Distributions Suppose that X has distribution μ. If g is a real function of a real variable, then by the change-of-variable formula (16.17), (21.1) E[g(X)]= Γ 8(χ)μ(άχ). J —oo (In applying (16.17), replace Τ: Ω, -> Ω' by Χ: Ω, -* R\ μ by Ρ, μΤ'1 by μ, and / by g.) This formula holds in the sense explained in Theorem 16.13: It holds in the nonnegative case, so that (212) E[\g(X)\]^r \8(χ)\μ(άχ); J — 00 if one side is infinite, then so is the other. And if the two sides of (2).2) are finite, then (21.1) holds. If μ is disciete and μ{χν x2,...} = 1, then (21.1) becomes (use Theorem 16.9) (21-3) E[g(*)]=E*(xr)/*{xr}. г If X has density /, then (21.1) becomes (use Theorem 16.11) (21.4) E[g(X)]=f g(x)f(x)dx. J — 00 If F is the distribution function of X and μ, (21.1) can be written E[g(X)] = FL„g(x)dF(x) in the notation (17.22). Moments By (21.2), μ and F determine all the absolute moments of X: (21.5) E[\X\k]= Γ\χ\"μ(άχ)= r\x\kdF(x), it = 1,2,.... ^ — 00 ^ — 00 Since j <k implies that |jc|y < 1 + |jc|*, if X has a finite absolute moment of order k, then it has finite absolute moments of orders 1,2,...,k - 1 as well. For each к for which (2.15) is finite, X has fcth moment (21.6) E[xk\ = Γ χ"μ((Ιχ)= Γ xkdF{x). These quantities are also referred to as the moments of μ and of F. They can be computed by (21.3) and (21.4) in the appropriate circumstances.
SECTION 21. EXPECTED VALUES 275 Example 21.1. Consider the normal density (20.12) with m = 0 and σ = 1. For each k, xke~x /2 goes to 0 exponentially as x-> ±o°, and so finite moments of all orders exist. Integration by parts shows that ■^Γι*ί-'!/2ώ~ί"ι*-2Γ'2/2ώ, it = 2,3,.... v2t7 •'-oo V2t7 ■'-co (Apply (18.16) to g(x)=x*-2 and f{x) = xe~"2/2, and let α -» -oo, b-*«>.) Of course, Ε[λΊ = 0 by symmetry and E[X°]= 1. It follows by induction that (21.7) E[X2k] = 1x3x5 X ··· x(2k-l), Λ = 1,2,..., and that the odd moments all vanish. ■ If the first two moments of X are finite and E[X] = m, then just as in Section 5, the variance is (21.8) V<\r[X] = E\(X-m)2] = f (χ-ηι)2μ(άχ) * — ου = E[X2]-m2. From Example 21.1 and a change of variable, it follows that a random variable with the normal density (20.12) has mean m and variance σ2. Consider for nonnegative X the relation (21.9) E[X]= fp[X>t]dt = fp[X>t]dt. Since P[X= t] can be positive for at most countably many values of t, the two integrands differ only on a set of Lebesgue measure 0 and hence the integrals are the same. For X simple and nonnegative, (21.9) was proved in Section 5; see (5.29). For the general nonnegative X, let Xn be simple random variables for which 0 < Xn Τ Χ (see (20.1)). By the monotone convergence theorem EtA'Jt E[X\; moreover, P[Xn> t]\ P[X> t], and therefore j^P[Xn >t]df\ JqP[X> t]dt, again by the monotone convergence theorem. Since (21.9) holds for each Xn, a passage to the limit establishes (21.9) for X itself. Note that both sides of (21.9) may be infinite. If the integral on the right is finite, then X is integrable. Replacing X by XI[X>a] leads from (21.9) to (21.10) f XdP = aP[X>a]+ [°°P[X>t]dt, a>0. J[X>a] Ja As long as a > 0, this holds even if X is not nonnegative.
276 RANDOM VARIABLES AND EXPECTED VALUES Inequalities Since the final term in (21.10) is nonnegative, aP[X> a] < f[X>a]XdP < E[X]. Thus (21.11) P[X>a]< -f XdP<-E[X], a>0, 01 J[X>a] a for nonnegative X. As in Section 5, there follow the inequalities (21.12) P[\X\>a] <\( \X\kdP<^E\\X\k\. It is the inequality between the two extreme terms here that usually goes under the name of Markov; but the left-hand inequality is often useful, too. As a special case there is Chebyshev's inequality, (21.13) P[\X-m\>a] <-^Var[Z] a {m=E[X]l Jensen's inequality (21.14) φ(Ε[Χ])<Ε[φ(Χ)] holds if φ is convex on an interval containing the range of X and if X and φ(Χ) both have expected values. To prove it, let l(x) = ax + b be a supporting line through (E[X],cp(E[X]))—a line lying entirely under the graph of φ [A33]. Then aX{co) + b <φ(Χ(ω)), so that aE[X] + b <E[cp(X)]. But the left side of this inequality is φ(Ε[Χ]). Holder's inequality is (21.15) E[\XY\] ^'/'[lATJE'/^liT], - + - = 1. For discrete random variables, this was proved in Section 5; see (5.35). For the general case, choose simple random variables Xn and Yn satisfying 0<|ZJTI*I and 0<|rjiin see (20.2). Then (5.35) and the monotone convergence theorem give (21.15). Notice that (21.15) implies that if \X\" and \Y\'' are integrable, then so is XY. Schwarz's inequality is the case ρ = q = 2: (21.16) E[\XY\] <Εχ/2\Χ2]Εχ/2\Υ2]. If X and Υ have second moments, then XY must have a first moment.
SECTION 21. EXPECTED VALUES 277 The. same reasoning shows that Lyapounov's inequality (5.37) carries over from the simple to the general case. Joint Integrals The relation (21.1) extends to landom vectors. Suppose that (A',,..., Xk) has distribution μ in /c-space and g: Rk ->/?'. By Theorem 16.13, (21.17) E[g{Xlt...,Xk))=i β(χ)μ(άχ), J Rk with the usual provisos about infinite values. For example, £[A",A"y] = /K*X;X;/x(iic). If E[Xi] = mi, the couariance of A"; and A"; is Cov[4^] =£[(i-m,.)(^-W/)] =/ (χ,.-Μί)(ίΓίη>(ώ). Random variables are uncorrelated if they have covariance 0. Independence and Expected Value Suppose that X and Υ are independent. If they are also simple, then E\XY] = E[X]E[Y], as proved in Section 5—see (5.25). Define Xn by (20.2) and similarly define Yn = ψη(Υ+)- ψη{Υ~). Then Xn and Yn are independent and simple, so that E[\XnYn\\ = E[\Хп\ШЮ1 and 0 < \Xn\1 IX\, 0 < !Г„|Т ΙΠ If X and У are integrable, then E[\XnYn\] = E[\Xn\]E[\Yn\]1 E[\X\]-E[\Y\], and it follows by the monotone convergence theorem that E[\XY\\ < oo; since XnYn ~>XY and \XnYn\<\XY\, it follows further by the dominated convergence theorem that £[AT] = lim„ E[XnYn] = \imn EiXn]E[Yn] = E[X]E[Y]. Therefore, XY is integrable if X and Υ are (which is by no means true for dependent random variables) and E[XY\ = E[X]E[Y]. This argument obviously extends inductively: If X1,...,Xk are independent and integrable, then the product A", · · · Xk is also integrable and (21.18) £[Α-1···Α-,]=£[Α-1]···Ε[Α-,]. Suppose that ^λ and ^2 are independent σ-fields, A lies in cfl,Xl is measurable ^,, and X2 is measurable ^f2- Then lAXx and X2 are independent, so that (21.18) gives (21.19) \ XxX2dP= \ XxdP-E{X2\ }A JA if the random variables are integrable. In particular, (21.20) f X2dP = P(A)E[X2]. 3 A
278 RANDOM VARIABLES AND EXPECTED VALUES From (21.18) it follows just as for simple random variables (see (5.28)) that variances add for sums of independent random variables. It is even enough that the random variables be independent in pairs. Moment Generating Functions The moment generating function is defined as (21.21) M(s) = E\esX\ = Γ ε"μ(άχ) = Г е"dF(x) J — oo ^ —oo for all s for which this is finite (note that the integrand is nonnegative). Section 9 shows in the case of simple random variables the power of moment generating function methods. This function is also called the Laplace transform of μ, especially in nonprobabilistic contexts. Now Ι^'χμ(άχ) is finite for s < 0, and if it is finite for a positive s, then it is finite for all smaller s. Together with the corresponding result for the left half-line, this shows that M(s) is defined on some interval containing 0. If X is nonnegative, this interval contains (-*>,()] and perhaps part of (0, oo); if X is nonpositive, it contains [0,°°) and perhaps part of (-oo,0). It is possible that the interval consists of 0 alone; this happens, for example, if μ is concentrated on the integers and μ{η] = μ{-η) = С/η2 for η = 1,2,... . Suppose that M(s) is defined throughout an interval (-s0,s0), where 50 > 0. Since e1"1 < esx + e~sx and the latter function is integrable μ for |s|<s0, so is Lk'=0\sx\k/k\ = e]sxK By the corollary to Theorem 16.7, μ has finite moments of all orders and 00 к °° к (21.22) M(s)- Σ πΦ"]= Σ πΓ xkit{dx), \s\<s0. Thus M(s) has a Taylor expansion about 0 with positive radius of convergence if it is defined in some (-s(),.s()), s()>0. If M(s) can somehow be calculated and expanded in a series Zkaksk, and if the coefficients ak can be identified, then, since ak must coincide with E[Xk]/k\, the moments of X can be computed: E[Xk] = akk\ It also follows from the theory of Taylor expansions [A29] that akk\ is the kth derivative M(k\s) evaluated at s =0: (21.23) M(k\0) =E[xk] = Γ xkii.(dx). This holds if M(s) exists in some neighborhood of 0. Suppose now that Μ is defined in some neighborhood of s. If ν has density esx/M(s) with respect to μ (see (16,11)), then ν has moment generating function N(u) = M(s + u)/M(s) for и in some neighborhood of 0.
SECTION 21. EXPECTED VALUES 279 But then by (21.23), JV(Ar)(0) = r_„xkv(dx) = /:„χ^"μ(ώ)/Μ(ί), and since /V(Ar)(0) = M(Ar)(s)/M(s), (21.24) Mik)(s) = Γ xkes^(dx). This holds as long as the moment generating function exists in some neighborhood of s. If s = 0, this gives (21.23) again. Taking к = 2 shows that M(s) is convex in its interval of definition. Example 21.2. For the standard normal density, M(s) = -jL· Г e"e-*2/2dx= -^e*2/2 f -ix-s)2/2 dx, '.tt J-oo \Ίττ and a change of variable gives (21.25) M(s)=ei2/2. The moment generating function in this case defined for all s. Since es /2 has the expansion ^/2- f J_f£l]*_ f 1Χ3χ-·-Χ(2/:-1) ~£о*!12' "£„ (2*)! the moments can be read off from (21.22), which proves (21.7) once more. ■ Example 21.3. In the exponential case (20.10), the moment generating function (21.26) M(s) = fesxae-axdx=-^- = Σ ^ •Ό α ό k = oa is defined for s <a. By (21.22) the kth moment is k\a~k. The mean and variance are thus a~l and a~2. ■ Example 21.4. For the Poisson distribution (20.7), (21.27) M(s)= ierse-k^=e^-'\ r = Q Since M'(s) = Xe'M(s) and M"(s) = (A2e2i + ke')M(s), the first two moments are M'(0) = Λ and M"(0) = Λ2 + Λ; the mean and variance are both A.
280 RANDOM VARIABLES AND EXPECTED VALUES Let Xv..., Xk be independent random variables, and suppose that each Xt has a moment generating function M,(s) = E[esXi] in (-s0,s0). For |s|<s0, each expisZ,) is integrable, and, since they are independent, their product exp(sEf=, A",·) is also integrable (see (21.18)). The moment generating function of Xj + · · · +Xk is therefore (2128) M(s)=Ml(s)··-Mk(s) in (-50,50). This relation for simple random variables was essential to the arguments in Section 9. For simple random variables it was shown in Section 9 that the moment generating function determines the distribution. This will later be proved for general random variables; see Theorem 22.2 for the nonnegative case and Section 30 for the general case. PROBLEMS 21.1. Prove differentiate к times with respect to t inside the integral (justify), and derive (21,7) again. 21.2. Show that, if X has the standard normal distribution, then Е[|ЛГ|2л + |] = 2nn\yJ2/^. 21.3. 20.9 Τ Records. Consider the sequence of records in the sense of Problem 20.9. Show that the expected waiting time to the next record is infinite. 21.4. 20.14 Τ Show that the Cauchy distribution has no mean. 21.5. Prove the first Borel-Cantelli lemma by applying Theorem 16.6 to indicator random variables. Why is Theorem 16.6 not enough for the second Borel-Cantelli lemma? 21.6. Prove (21.9) by Fubini's theorem. 21.7. Prove for integrable X that E[X]= Γρ[χ>ί]ώ- f° P[X<t]dt.
SECTION 21. EXPECTED VALUES 281 21.8. (a) Suppose that X and У have first moments, and prove E[Y]-E[X] = f (P[X<t<Y]-P[Y<t<X])dt. J —со (b) Let (А", У] be a nondegenerate random interval Show that its expected length is the integral with respect to t of the probability that it covers t. 21.9. Suppose that X and У are random variables with distribution functions F and G. (a) Show that if F and G have no common jumps, then E[F(Y)] + E[G(X)] = 1. (b) If F is continuous, then E[F(X)] = f. (c) Even if F and G have common jumps, if X and Υ are taken to be independent, then E[F(Y)] + E[G(X)] = 1 -rP[X = Y]. (d) Even if F has jumps, E[F(X)]= { f \JLxP2[X=x] 21.10. (a) Show that uncorrelated variables need not be independent. (b) Show that VarfE^, A",] = Ц,_, Covf*,, Xf] = E?_, Varf*,] + 2E];S,</s„Cov[A';, X]. The cross terms drop out if the A", are uncorrelated, and hence drop out if they are independent. 21.11. Τ Let X, У, and Ζ be independent random variables such that X and У assume the values 0,1,2 with probability \ each and Ζ assumes the values 0 and 1 with probabilities \ and \. Let X'= X and У =X+ Ζ (mod 3). (a) Show that А", У, and A" -f У have the same one-dimensional distributions as X, У, and X+Y, respectively, even though (А",У) and (X,Y) have different distributions. (b) Show that A" and У are dependent but uncorrelated (c) Show that, despite dependence, the moment generating function of A" + У is the product of the moment generating functions of A" and У. 21.12. Suppose that X and У are independent, nonnegative random variables and that Ε[ΑΊ = οο and £[У] = 0. What is the value common to E[XY] and E[X]E[Y]? Use the conventions (15.2) for both the product of the random variables and the product of their expected values. What if £[A"] = oo and 0<Е[У]<оо? 21.13. Suppose that X and У are independent and that f(x,y) is nonnegative. Put g(x) = E[f(x,Y)] and show that E[g(X)] = E[f(X,Y)]. Show more generally that fxeAg(X)dP = fxeAf(X, Y)dP. Extend to / that may be negative. 21.14. Τ The integrability of A" + y does not imply that of X and У separately. Show that it does if X and У are independent. 21.15. 20.25 Τ Write dx{X,Y) = E[|Ar- У 1/(1 +\X- У I)]. Show that this is a metric equivalent to the one in Problem 20.25. 21.16. For the density Cexp(~U|1/2), -°° <x <°°, show that moments of all orders exist but that the moment generating function exists only at 5 = 0.
282 RANDOM VARIABLES AND EXPECTED VALUES 21.17. 16.6 Τ Show that a moment generating function M(s) defined in (— % sn), s0 > 0, can be extended to a function analytic in the strip [z: -sa < Re ζ < s0]. If A/(s) is defined in [0, sQ), s0 > 0, show that it can be extended to a function continuous in [z: 0 < Re ζ <s0] and analytic in [z· 0 < Re ζ < s0]. 21.18. Use (21.28) to find the generating function of (20.39). 21.19. For independent random variables having moment generating functions, show by (21.28) that the variances add. 21.20. 20.17 Τ Show that the gamma density (20.47) has moment generating function (I —s/a)~" for s<a. Show that the fcth moment is u(u + 1) ••(u+k — \)/ak Show that the chi-squared distribution with η degrees of freedom has mean η and variance In. 21.21. Let A",, X2,... be identically distributed random variables with finite second moment. Show that ηΡ[\Χλ\> e\fn] ->0 and n'i/2maxki„\Xk\ -*P 0. SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES Let XX,X2,... be a sequence of independent random variables on some probability space. It is natural to ask whether the infinite series Σ^,Λ^ converges with probability 1, or as in Section 6 whether n~lY."k=lXk converges to some limit with probability 1. It is to questions of this sort that the present section is devoted. Throughout the section, Sn will denote the partial sum L"k=lXk (50 = 0). The Strong Law of Large Numbers The central result is a general version of Theorem 6.1. Theorem 22.1. If X],X2,... are independent and identically distributed and have finite mean, then Sn/n -» E[X{] with probability 1. Formerly this theorem stood at the end of a chain of results. The following argument, due to Etemadi, proceeds from first principles. Proof. If the theorem holds for nonnegative random variables, then n-lS„ = n-lZnk,lX^-n-lZnk.iXk-*^.^i]-E[X;] = ElXl] with probability 1. Assume then that Xk > 0. Consider the truncated random variables Yk =XkI[X sk] and their partial sums 5* = Σ£= {Yk. For a > 1, temporarily fixed, let u„ = [a"\. The first step is to prove s'--e^ >; <.. (22.1) Σρ n=l
SECTION 22, SUMS OF INDEPENDENT RANDOM VARIABLES 283 Since the Xn are independent and identically distributed, Var[S„*]= Σ Var[r,]< £ e[y*] k=\ k~ 1 iE[XtIlXlSk]\znE[XtllXiSa]\. k= 1 It follows by Chebyshev's inequality that the sum in (22.1) is at most n= 1 V"K] 1- —— < — F 2 2 — 2 1 xi L ,. I[x,<;„ n= 1 U \.χι£"η) Let K= 2a/(a - 1), and suppose χ > 0. If N is the smallest η such that u„ >x, then aN>x, and since у < 2[y\ for у > 1, Σ "π_1<2 Σ α-" = *<*""<: ЯГ1. U„>X /TSlN Therefore, Σ£=,Μ~'/[ιν 511 ]</lY["' for Xx > 0, and the sum in (22.1) is at most Ke~2E[Xl]<oo. ' From (22.1) it follows by the first Borel-Cantelli lemma (take a union over positive, rational e) that (5*n - E[S*])/u„ -»0 with probability 1. But by the consistency of Cesaro summation [A30], n~lE[S%] = n~lZ"k=lE[Yk] has the same limit as E[Y„], namely. Ε[Χλ). Therefore S* /u„ ->£[Z,] with probability 1. Since 00 °° 00 another application of the first Borel-Cantelli lemma shows that (5* - Sn)/n -»0 and hence (22.2) ■ВД] with probability 1. If u„<k <u +l, then since A',. > 0,
284 RANDOM VARIABLES AND EXPECTED VALUES But u„ + i/un -»a, and so it follows by (22.2) that ι 5 5 -E[ Xx ] < lim inf -jr < lim sup -£ < ff£[ A", ] with probability 1. This is true for each a > 1. Intersecting the corresponding sets over rational o- exceeding 1 gives lim* Sk/k = E[Xt] with probability 1. Although the hypothesis that the Xn all have the same distribution is used several times in this proof, independence is used only through the equation Var[5*] = Σ'Ι^ι VarfFj, and for this it is enough that the Xn be independent in pairs. The proof given for Theorem 6.1 of course extends beyond the case of simple random variables, but it requires £[ A^4] < oo. Corollary. Suppose that Xl,X2,... are independent and identically distributed and £[Λ7]<οο, £[*,+] = οο (so that £[*,] = oo). Then n''Z"k^Xk -»oo with probability 1. Proof. By the theorem, и^Е^Л^-»E[A7l with probability 1, and so it suffices to prove the corollary for the case A^ =X*> 0. If [X„ ifO<X<u, " \0 ifXn>u, then η" 'Σ£=,Xk > n' lLU tXj·'0 -»E[X\U)\ by the theorem. Let и -»oo. ■ The Weak Law and Moment Generating Functions The weak law of large numbers (Section 6) carries over without change to the case of general random variables with second moments—only Chebyshev's inequality is required. The idea can be used to prove in a very simple way that a distribution concentrated on [0, oo) is uniquely determined by its moment generating function or Laplace transform. For each Л, let YK be a random variable (on some probability space) having the Poisson distribution with parameter Л. Since Υλ has mean and variance Λ (Example 21.4), Chebyshev's inequality gives
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 285 Let GA be the distribution function of FA/A, so that k = 0 The result above can be restated as (22.3) HmGA(0 = (i ί^ί' v ; A-. AV ; \0 ifi<l. In the notation of Section 14, GA(x) => A(x - 1) as A -»oo. Now consider a probability distribution μ concentrated on [0,oo). Let F be the corresponding distribution function. Define (22.4) M(s)= Γβ~"μ(άχ), s>0; Jo here 0 is included in the range of integration. This is the moment generating function (21.21), but the argument has been reflected through the origin. It is a one-sided Laplace transform, defined for all nonnegative s. For positive s, (21.24) gives (22.5) MikXs) = (-l)k Гуке-'Щау). Therefore, for positive χ and s, lsxi (-l)k r°°[sxi (sv)k (22.6) Σ ^-skM«\s) = / Ъе-'уЩ-^) -jTc„(i)M(*). Fix χ > 0. If* 0 <y <x, then Gsy(x/y) -» 1 as 5 -»oo by (22.3); if у >х, the limit is 0. If μ{χ) = 0, the integrand on the right in (22.6) thus converges as s -»oo to /[0л.](у) except on a set of μ,-measure 0. The bounded convergence theorem then gives l«J /_J4* (22.7) lim Σ LTTZ-5A:M(Ar)(5)=/x[0,x]=F(x). fIf у = 0, the integrand in (22.5) is 1 for к = 0 and 0 for λ 2 1, hence for >> = 0, the integrand in the middle term of (22 6) is 1.
286 RANDOM VARIABLES AND EXPECTED VALUES Thus M(s) determines the value of F at χ if χ > 0 and μ{χ) =■ 0, which covers all but countably many values of χ in [0,°°), Since F is right-continuous, F itself and hence μ are determined through (22.7) by M(s). In fact μ is by (22,7) determined by the values of M(s) for s beyond an arbitrary s0: Theorem 22.2. Let μ and ν be probability measures on [0, °o). If Γβ-"μ(άχ)= Γβ-"ι·(άχ), s>s0, where s0 > 0, then μ = v. Corollary. Let fi and f2 be real functions on [0,°o), If f"'e-"fl(x)dx= fe-sxf2{x)dx, s>s0, where s0 > 0, then /, = f2 outside a set of Lebesgue measure 0. The fj need not be nonnegative, and they need not be integrable, but e ""/,·(x) must be integrable over [0,oo) for s > s0. Proof. For the nonnegative case, apply the theorem to the probability densities g,(x) = e~s"xf(x)/m, where m = fZe~s°xfi(x)dx, / = 1,2. For the general case, prove that /Γ+/^ = /^ + /Γ almost everywhere. ■ Example 22.1. If μ] * μ2 = μ3, then the corresponding transforms (22.4) satisfy Λ/,(5)Λ/2(ί) = M3(s) for s > 0. If μ., is the Poisson distribution with mean A,-, then (see (21.27)) M,(s) = exp[A,(e-i - 1)]. It follows by Theorem 22.2 that if two of the μ., are Poisson, so is the third, and Α, Ι- Α2 = A3. ■ Kolmogorov's Zero-One Law Consider the set Α οί ω for which Μ_ιΣ£ = |Λ^(ω) -»0 as η -»οο. For each m, the values of Χ](ω),...,Χηι_](ω) are irrelevant to the question of whether or not ω lies in A, and so A ought to lie in the σ-field σ(Χιη, Xm + l,...). In fact, lim„ n~ 1Σ™Γ()1Λ'Λ(ω) = 0 for fixed m, and hence ω lies in A if and only if lim„ n~lZ"k=s,„Xk(w) = 0. Therefore, (22.8) a= Π U Π e N>mn>N ω: «-' £**(«) k = m <e the first intersection extending over positive rational e. The set on the inside
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 287 lies in a(Xm,Xm + i,,,,\ and hence so does A. Similarly, the ω-set where the series ΣηΧη(ω) converges lies in each a(Xm, Xm + ],,..). The intersection У~= Γ\°°η^]σ(Χη,Χη + ι,...) is the tail σ-field associated with the sequence XVX2,...; its elements are tail events. In the case Xn = lA , this is the σ-field (4.29) studied in Section 4. The following general form of Kolmogorov's zero-one law extends Theorem 4,5. Theorem 22.3. Suppose that {Xn} is independent and that A e У~= ГС-,*(*„,*„ + „...). Then either P(A) = 0 or P(A)=l. Proof. Let S^ = (J ^=la(Xt,,,,, Xk). The first thing to establish is that S^ is a field generating the σ-field a(Xt, X2,...). If В and С lie in S%, then Β εσ(ΛΊ,..., Xj) and С (Ea(Xx,...,Xk) for some ; and k; if m = max{/,/c}, then β and С both lie in σ(Λ'1,..., Л^Д so that BuCe σίΖ,,..., Xm) с 5^. Thus 5^ is closed under the formation of finite unions; since it is similarly closed under complementation, 5^ is a field. For H<e&\ [Ine//)e ^σσ(^), and hence Xn is measurable σ^); thus 5^ generates a(XuX2,...) (which in general is much larger than &~0). Suppose that A lies in &. Then A lies in a(Xk + l, Xk+2,...) for each /с. Therefore, if В e σ(.Υ,,..., Xk), then v4 and β are independent by Theorem 20.2. Therefore, A is independent of 3^ and hence by Theorem 4.2 is also independent of σ(Χν X2,...). But then A is independent of itself: P(A Π A) = P(A)P(A). Therefore, P(A) = P\A\ which implies that P(A) is either 0 or 1. ■ As noted above, the set where ΣηΧη(ω) converges satisfies the hypothesis of Theorem 22.3, and so does the set where n~iY.nk-xXk{.ai) -»0. In many similar cases it is very easy to prove by this theorem that a set at hand must have probability either 0 or 1. But to determine which of 0 and 1 is, in fact, the probability of the set may be extremely difficult. Maximal Inequalities Essential to the study of random series are maximal inequalities—inequalities concerning the maxima of partial sums. The best known is that of Kolmogorov. Theorem 22.4. Suppose that X],...,Xn are independent with mean 0 and finite variances. For a > 0, (22.9) Ρ max \Sk\>a < \ Var[S„]. ]<,k<,n
288 RANDOM VARIABLES AND EXPECTED VALUES Proof. Let Ak be the set where \Sk\>a but |5;| <a for j <k. Since the Ak are disjoint, E\s2n\> Σ/ sup k = 1 Ak = Σ/ [s,2 + 2S,(Sn-Sj+(S„-Sj2]^ *= ι Since Ak and 5^ are measurable a(Xu...,Xk) and Sn-Sk is measurable σ(Λ^ +,,..., Л'Д and since the means are all 0, it follows by (21.19) and independence that jAS,,(Sn- Sk)dP = 0. Therefore, E[S2n\> ΣΙ S2kdP> Z«2P(Ak) k=l Ak · · *=i - ™2 alP max |SJ > a \<k<n By Chebyshev's inequality, P[\Sn\ > a] <a~2 VarfSj. That this can be strengthened to (22.9) is an instance of a general phenomenon: For sums of independent variables, if maxk£n \Sk\ is large, then |5„| is probably large as well. Theorem 9.6 is an instance of this, and so is the following result, due to Etemadi. Theorem 22.5. Suppose that Xt,...,X„ are independent. For a>0, (22.10) max \Sk\> 3a 1 <k <n <3 max /5[|SJ>a']. 1 <k<n Proof. Let Bk be the set where \Sk\ > 3a but |5;| < 3a for / <k. Since the Bk are disjoint, max \Sk\ > 3a . 1 <,k<n <P[\Sn\>a] + "iP(Bkn[\Sn\<a}) <P[\Sn\>a] + "i P{Bkn[\Sn-Sk\>2a]) = P[\Sn\>a] + "i P(Bk)P[\Sn-Sk\>2a] k=\ <P[\Sn\>a] + max P[\S„ - Sk\ >2a] 1 <k <n ^P[\Sn\>a]+ max (P[\Sn\>a]+P[\Sk\>a]) \<k^n <3 max P[\Sk\>a]. 1 <k<n
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 289 If the Xk have mean 0 and Chebyshev's inequality is applied to the right side of (22.10), and if a is replaced by a/3, the result is Kolmogorov's inequality (22.9) with an extra factor of 27 on the right side. For this reason, the two inequalities are equally useful for the applications in this section. Convergence of Random Series For independent Xn, the probability that ΣΧη converges is either 0 or 1. It is natural to try and characterize the two cases in terms of the distributions of the individual Xn. Theorem 22.6. Suppose that {Xn) is an independent sequence and E[Xn] = 0. // EVartZj <oo, then ΣΧη converges with probability 1. Proof. By (22.9), Ρ 1 max \Snl.k-S„\>e <— Σ Var[*„+fc]. . l<k<r J k= 1 Since the sets on the left are nondecreasing in r, letting r -»oo gives sup|S„+k-SJ>6 U>i <— Σ Уят[х„+к]. k= 1 Since Σ Уат[Х„] converges, (22.11) UmP η sup|Sn+A- .*>! -Sn\>e = 0 for each e. Let E(n,e) be the set where sup; k>n \Sj -Sk[> 2e, and put E(e) = Г\„Е(п,е). Then E(n,e)lE(e), and (22.11) implies P(£(e)) = 0. Now U eE(e), where the union extends over positive rational e, contains the set where the sequence {5„} is not fundamental (does not have the Cauchy property), and this set therefore has probability 0. ■ Example 22.2. Let Χη(ω) = rn(w)an, where the r„ are the Rademacher functions on the unit interval—see (1.13). Then Xn has variance a2n, and so Σα2η <οο implies that Er„(w)an converges with probability 1. An interesting special case is an = n~l. If the signs in Σ + η~ι are chosen on the toss of a coin, then the series converges with probability 1. The alternating harmonic series l-2_1+3~'+--is thus typical in this respect. ■ If ΣΧη converges with probability 1, then Sn converges with probability 1 to some finite random variable 5. By Theorem 20.5, this implies that
290 RANDOM VARIABLES AND EXPECTED VALUES Sn ~*p S. The reverse implication of course does not hold in general, but it does if the summands are independent. Theorem 22.7. For an independent sequence {Xn}, the Sn converge with probability 1 if and only if they converge in probability. Proof. It is enough to show that if Sn -*p S, then {5„} is fundamental with probability 1. Since 11W Sn ~*F S implies (22.12) But by (22.10), ~Sn\>€]<P lim supf \S - VI > - + P 15 -5|> max \S„ +j-Sn\>e \<i<k < 3 max Ρ 1 <j < к ls„+,-s„l>T and therefore sup|S„+*-S„|>e til <3sup/> |5n+,-5„|> 3 til L It now follows by (22.12) that (22.11) holds, and the proof is completed as before. ■ The final result in this direction, the three-series theorem, provides necessary and sufficient conditions for the convergence of ΣΑ',, in terms of the individual distributions of the Xn. Let A^c) be X„ truncated at c: X^c) = XnIl\x„\^cy Theorem 22.8. Suppose that {Xn} is independent, and consider the three series (22.13) ZP[\X.\>c], ΣΕ[Χ^], EVar[*r]. In order that ΣΧη converge with probability 1 it is necessary that the three series converge for all positive с and sufficient that they converge for some positive с
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 291 Proof of Sufficiency. Suppose that the series (22.13) converge, and put m(„c) = E[X(nc)]. By Theorem 22.6, L(X(nc) - m(„c)) converges with probability 1, and since E/w(„c) converges, so does ZXj,c). Since Р[ХпфХ11с) i.o.] = 0 by the first Borel-Cantelli lemma, it follows finally that ΣΧη converges with probability 1. ■ Although it is possible to prove necessity in the three-series theorem by the methods of the present section, the simplest and clearest argument uses the central limit theorem as treated in Section 27. This involves no circularity of reasoning, since the three-series theorem is nowhere used in what follows. Proof of Necessity. Suppose that ΣΧη converges with probability 1, and fix с > 0. Since Xn ~* 0 with probability 1, it follows that ΣA^c) converges with probability 1 and, by the second Borel-Cantelli lemma, that LP[\XJ>c]<™. Let M(nc) and s(nc) be the mean and standard deviation of S{nc) = L"k = lX(kc). If s(nc) -»oo, then since the Xj,c) - m(„c) are uniformly bounded, it follows by the central limit theorem (see Example 27.4) that (22.14) UmP X ,(c) - У 1 /2 77 fe-2/4t. And since ΣΧ(η€) converges with probability 1, s(nc) -»oo also implies S(nc)/s(nc) -»0 with probability 1, so that (Theorem 20.5) (22.15) limP[\Sic)/sic)\>e] =0. η But (22.14) and (22.15) stand in contradiction: Since x<sJlz^ll<y X^ c(c) ~y> sic) ,(c) <e is greater than or equal to the probability in (22.14) minus that in (22.15), it is positive for all sufficiently large η (if χ <y). But then x-e< -M^c)/s^<y + €, and this cannot hold simultaneously for, say, (x - e, у +е) = (-1,0) and (x - €, у + e) = (0,1). Thus s(nc) cannot go to °°, and the third series in (22.13) converges. And now it follows by Theorem 22.6 that L(X^C) - m(„c)) converges with probability 1, so that the middle series in (22.13) converges as well. ■ Example 22.3. If Xn = rnan, where r„ are the Rademacher functions, then Σα2π < oo implies that ΣΧη converges with probability 1. If ΣΧη converges, then an is bounded, and for large с the convergence of the third
292 RANDOM VARIABLES AND EXPECTED VALUES series in (22.13) implies Σα\ < oo: If the signs in Σ ± an are chosen on the toss of a coin, then the series converges with probability 1 or 0 according as Σα2η converges or diverges. If Σα\ converges but Σ\α„\ diverges, then Σ ± an is with probability 1 conditionally but not absolutely convergent. ■ Example 22.4. If an J.0 but Σα\ = oo, then Σ + an converges if the signs are strictly alternating, but diverges with probability 1 if they are chosen on the toss of a coin. ■ Theorems 22.6, 22.7, and 22.8 concern conditional convergence, and in the most interesting cases, ΣΧη converges not because the Xn go to 0 at a high rate but because they tend to cancel each other out. In Example 22.4, the terms cancel well enough for convergence if the signs are strictly alternating, but not if they are chosen on the toss of a coin. Random Taylor Series* Consider a power series Σ±ζ", where the signs are chosen on the toss of a coin. The radius of convergence being 1, the series represents an analytic function in the open unit disk D0 = [z: \z\ < 1] in the complex plane. The question arises whether this function can be extended analytically beyond D0. The answer is no: With probability 1 the unit circle is the natural boundary. Theorem 22.9. Let {Xn} be an independent sequence such that (22.16) />[*„ = 1]=/>[X,= -I] = i, „-0,1,.... There is probability 0 that oo (22.17) F(w,z)= ΣΧη(")ζη coincides in D0 with a function analytic in an open set properly containing D0. It will be seen in the course of the proof that the ω-set in question lies in σ(Χ0, Χχ,...) and hence has a probability. It is intuitively clear that if the set is measurable at all, it must depend only on the Xn for large η and hence must have probability either 0 or 1. Proof. Since (22.18) |*„(«)| = 1, η = 0,1,.·· This topic, which requires complex variable theory, may be omitted.
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 293 with probability 1, the series in (22.17) has radius of convergence 1 outside a set of measure 0. Consider an open disk D = [z: \z — ζ\ < r], where ζ <ξ D0 and r > 0. Now (22.17) coincides in D0 with a function analytic in D0UD if and only if its expansion F(*,z)~ Σ ^F<m>(a,,0(z-Om about ζ converges at least for \z - ζ\ < r. Let AD be the set of ω for which this holds. The coefficient π =m \ ' is a complex-valued random variable measurable a(Xm,Xm + l,...). By the root test, ω e AD if and only if Iimsupm \am(w)\l/m < r_1. For each m0, the condition for ω еЛс can thus be expressed in terms of am (ω), am H ,(a>),... alone, and so Л0 е.а(Хт ,Хт +,,...). Thus Л0 has a probability, and in fact P(AD) is 0 or 1 by the zero-one law. Of course, P(AD) = 1 if D cD0. The central step in the proof is to show that P(AD) = 0 if D contains points not in D0. Assume on the contrary that P(AD) = 1 for such a D. Consider that part of the circumference of the unit circle that lies in D, and let к be an integer large enough that this arc has length exceeding 2тт/к. Define Ι Χη(ω) if и £0 (mod*), π(ω) \-Χ„(ω) if nsO(modJt). Let BD be the ω-set where the function oo (22.19) G(w,z)= Σ^„(ω)ζ" coincides in D0 with a function analytic in D0 U D. The sequence {Y0, У,,...} has the same structure as the original sequence: the Yn are independent and assume the values ± 1 with probability \ each. Since BD is defined in terms of the Yn in the same way as AD is defined in terms of the Xn, it is intuitively clear that P(BD) and P(AD) must be the same. Assume for the moment the truth of this statement, which is somewhat more obvious than its proof. If for a particular ω each of (22.17) and (22.19) coincides in D0 with a function analytic in D0 U D, the same must be true of oo (22.20) F(«,z)-G(fti,z)=2 Σ Xmk{<»)zmk. m=0
294 RANDOM VARIABLES AND EXPECTED VALUES Let D, = [ze2rrll/k: z<eD]. Since replacing ζ by ze2nl/k leaves the function (22.20) unchanged, it can be extended analytically to each D0 U D,, / = 1,2,... . Because of the choice of k, it can therefore be extended analytically to [z: |z| < 1 + e] for some positive e; but this is impossible if (22.18) holds, since the radius of convergence must then be 1. Therefore, ADC\BD cannot contain a point ω satisfying (22.18). Since (22.18) holds with probability 1, this rules out the possibility P(AD) = P(BD) = 1 and by the zero-one law leaves only the possibility P(AD) = P(BD) = 0. Let A be the ω-set where (22.17) extends to a function analytic in some open set larger than D0. Then ω еЛ if and only if (22.17) extends to D0U D for some D = [z: \ζ-ζ\<τ] for which D -D0¥= 0, r is rational, and ζ has rational real and imaginary parts; in other words, A is the countable union of AD for such D. Therefore, A lies in сг(Х0,Х{,...) and has probability 0. It remains only to show that P(AD) = P(BD), and this is most easily done by comparing {Xn} and {Yn} with a canonical sequence having the same structure. Put Ζ„(ω) = (Χη(ω)+1)/2> and let Τω be ΥΤη=ύΖη{ω)2'η'χ on the ω-set A* where this sum lies in (0,1]; on Ω,-Α* let Τω be 1, say. Because of (22.16) P(A*)=l. Let У=а(Хй,Хи...) and let Я be the σ-field of Borel subsets of (0,1]; then Τ: Ω -»(0,1] is measurable F/Qe. Let rn(x) be the «th Rademacher function. If M--=[x: r,(x) = u,, / = 1—,n], where w,- = ±1 for each /', then P(T~ lM) = Ρ[ω: Χ^ω) = u,-,: = 0,1,..., η - 1] = 2"", which is the Lebesgue measure λ(Μ) of M. Since these sets form a 77-system generating @, P(T~lM) = λ(Μ) for all Μ in & (Theorem 3.3). Let MD be the set of χ for which E"=0r„ + 1(x)z" extends analytically to D0 UD. Then MD lies in SB, this being a special case of the fact that AD lies in tF. Moreover, if ω e#, then ω ^AD if and only if Τω eMe: Л* C\AD = Α* η Γ-'Λ/д. Since P(A*) = 1, it follows that P(AD) = λ(Μ0). This argument only uses (22.16), and therefore it applies to {Yn} and BD as weil. Therefore, P(BD) = λ(Μ0) = P(AD). ■ PROBLEMS 22.1. Suppose that Xv X2,... is an independent sequence and Υ is measurable σ(Χη,Χη+ι,...) for each n. Show that there exists a constant a such that P[F=a]=l. 22.2. Assume [Xn] independent, and define Xj,c) as in Theorem 22.8. Prove that for Σ\Χη\ to converge with probability 1 it is necessary that EPtlA'Jx:] and ΣΕ[\Xbc)\] converge for all positive с and sufficient that they converge for some positive c. If the three series (22.13) converge but ΣΕ[\X^c)\] = °°, then there is probability 1 that ΣΧη converges conditionally but not absolutely. 22.3. Τ (a) Generalize the Borel-Cantelli lemmas: Suppose Xn are nonnegative If ΣΕ[Χη] < oo, then ΣΧ„ converges with probability 1. If the Xn are indepen dent and uniformly bounded, and if Σ£'[Α'π] = οο) then ΣΧΠ diverges witi probability 1.
SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 295 (b) Construct independent, nonnegative Xn such that ΣΧΠ converges with probability 1 but ΣΕ[Χη] diverges For an extreme example, arrange that P[Xn > 0 i.o.] = 0 but E[X„] = oo. 22.4. Show under the hypothesis of Theorem 22.6 that ΣΧη has finite variance and extend Theorem 22.4 to infinite sequences. 22.5. 20.14 22.1 Τ Suppose that Xt,X2,... are independent, each with theCauchy distribution (20.45) for a common value of u. (a) Show that n~lL"k = iXk does not converge with probability 1. Contrast with Theorem 22.1. (b) Show that F[n"'max^„ Xk <*]->е~"/1ГДГ for χ > 0. Relate to Theorem 14.3. 22.6. If Xit X2,... are independent and identically distributed, and if P[XX > 0] = 1 and Ρ[ΑΊ>0]>0, then Σ„Χ„ = <Χ> with probability 1. Deduce this from Theorem 22.1 and its corollary and also directly: find a positive e such that Xr >c infinitely often with probability 1. 22.7. Suppose that Xl,X2,... are independent and identically distributed and £[!*,!] = oo. Use (21.9) to show that ΣηΡ[\Χη\>αη] = °ο for each a, and conclude that sup„ n~ l\Xn\ = <x> with probability 1. Now show that supnn~:|5J = oo with probability 1. Compare with the corollary to Theorem 22.1. 22.8. Wald's equation. Let X{,X2,... be independent and identically distributed with finite mean, and put S„=Xl + ··· +Xn- Suppose that τ is a stopping time: τ has positive integers as values and [τ = η] e a(Xu..., Xn); see Section 7 for examples. Suppose also that Ε[τ] < oo. (a) Prove that (22.21) E[ST]=E[Xl]E[r]. (b) Suppose that Xn is ± 1 with probabilities ρ and q, ρ Φ q, let τ be the first η for which Sn is — a or b (a and b positive integers), and calculate Ε[τ]. This gives the expected duration of the game in the gambler's ruin problem for unequal ρ and q. 22.9. 20.9 Τ Let Z„ be 1 or 0 according as at time η there is or is not a record in the sense of Problem 20.9. Let Rn = Z, + · · · +Zn be the number of records up to time n. Show that Rn/\ogn ->P 1. 22.10. 22.1 Τ (a) Show that for an independent sequence {Xn} the radius of convergence of the random Taylor series ΣηΧ„ζ" is r with probability 1 for some nonrandom r. (b) Suppose that the Xn have the same distribution and P[A", Φ 0] > 0. Show that r is 1 or 0 according as log^A^I has finite mean or not. 22.11. Suppose that X0, A",,... are independent and each is uniformly distributed over [0,27г]. Show that with probability 1 the series Σηε'χ"ζ" has the unit circle as its natural boundary. 22.12. Prove (what is essentially Kolmogorov's zero-one law) that if A is independent of a 7r-system &> and A e σ(&\ then P(A) is either 0 or 1.
296 RANDOM VARIABLES AND EXPECTED VALUES 22.13. Suppose that srf is a semiring containing Ω. (a) Show that if P(A C\B)< bP(B) for all В e sf, and if b < 1 and A e σ(^), then РЫ) = 0. (b) Show that if P(AnB) <P(A)P(B) for all flej/, and if /ίεσ(^), then P(A) is 0 or 1. (c) Show that if aP(B) < P(A Π Β) for all B&snf, and if α > 0 and A e σ(^), then Р(Я)=1. (d) Show that if P(A)P(B) < P(A Γ)Β) for all fie< and if /Ιεσ(^), then P(A) is 0 or 1. (e) Reconsider Problem 3.20 22.14. 22.12 Τ Burstings theorem. Let / be a Borel function on [0,1] with arbitrarily small periods: For each e there is a ρ such that 0 <p <e and fix) =f(x +p) for 0 <x < 1 -p. Show that such an / is constant almost everywhere: (a) Show that it is enough to prove that P(f~lB) is 0 or 1 for every Borel set B, where Ρ is Lebesgue measure on the unit interval. (b) Show that f~lB is independent of each interval [0, x], and conclude that P(/-'£)is0or 1. (c) Show by example that / need not be constant. 22.15. Assume that Xx,...,Xn are independent and s,t,a are nonnegative Let L(s)= maxP[|SJ>s],/?(s)= maxP[|5„ -Sk\ >s], k<.n k&n M(s) = P\ max|S,| >s], T(s) = P[\S„\ >s]. I k£n 1 (a) Following the first part of the proof of (22.10), show that (22.22) M(s + t)<T(t)+M(s + t)R(s). (b) Take s = 2a and f = a; use (22.22), together with the inequalities T(s) < L(s) and R(2s) <2L(s), to prove Etemadi*s inequality (22.10) in the form (22.23) M(3a)<BE(a)~lA3L(a). (c) Carry the rightmost term in (22.22) to the left side, take s = t=a, and prove Ottaviani's inequality: (22.24) M(2a)<B0(a) = lA ^a) ■ (d) Prove Be(a) < ЪВ0(а/2), В0(а) < 3BE(a/6). This shows that the Etemadi and Ottaviani inequalities are of the same power for most purposes (as, for example, for the proofs of Theorem 22.7 and (37.9)). Etemadi's inequality seems the more natural of the two. Neither inequality can replace (9.39) in the proof of the law of the iterated logarithm.
SECTION 23. THE POISSON PROCESS 297 SECTION 23. THE POISSON PROCESS Characterization of the Exponential Distribution Suppose that X has the exponential distribution with parameter a: (23.1) P[X>x]=e-ax, x>0. The definition (4.1) of conditional probability then gives (23.2) P[X>x+y\X>x]=P[X>y], x,y>0. Image X as the waiting time for the occurrence of some event such as the arrival of the next customer at a queue or telephone call at an exchange. As observed in Section 14 (see (14.6)), (23.2) attributes to the waiting-time mechanism a lack of memory or aftereffect. And as shown in Section 14, the condition (23.2) implies that X has the distribution (23.1) for some positive a. Thus if in the sense of (23.2) there is no aftereffect in the waiting-time mechanism, then the waiting time itself necessarily follows the exponential law. The Poisson Process Consider next a stream or sequence of events, say arrivals of calls at an exchange. Let Xx be the waiting time to the first event, let X2 be the waiting time between the first and second events, and so on. The formal model consists of an infinite sequence Xi,X2,··· of random variables on some probability space, and Sn=Xl+ · · · +Xn represents the time of occurrence of the nth event; it is convenient to write S0 = 0. The stream of events itself remains intuitive and unformalized, and the mathematical definitions and arguments are framed in terms of the Xn. If no two of the events are to occur simultaneously, the Sn must be strictly increasing, and if only finitely many of the events are to occur in each finite interval of time, Sn must go to infinity: (23.3) 0 = 50(ω)<5,(ω)<52(ω) < ···, supS„(co) = °o. This condition is the same thing as (23.4) *,(«)> 0, Χ2(ω)>0,..., Σ*„(ω) = <». η Throughout the section it will be assumed that these conditions hold everywhere—for every ω. If they hold only on a set A of probability 1, and if Χη(ω) is redefined as Χη(ω)= 1, say, for ωί/4, then the conditions hold everywhere and the joint distributions of the X and Sn are unaffected.
298 RANDOM VARIABLES AND EXPECTED VALUES Condition 0°. For each ω, (23.3) and (23.4) hold. The arguments go through under the weaker condition that (23.3) and (23.4) hold with probability 1, but they then involve some fussy and uninteresting details. There are at the outset no further restrictions on the Xt; they are not assumed independent, for example, or identically distributed. The number JV, of events that occur in the time interval [0, i] is the largest integer η such that Sn < t: (23.5) JV, = тах[и: Sn <t]. Note that JV, = 0 if t < 5, = A",; in particular, JV0 = 0. The number of events in (s, t] is the increment JV, - Ns. 3 - ι - So = 0 N, = 0 Nt J - 1 Nt = 2 I 1 s3 = x, +x2 + x3 From (23.5) follows the basic relation connecting the JV, with the Sn: (23.6) [Nt>n] = [Sn<t]. From this follows (23.7) [N, = n] = [Sn*t<S„ + 1]. Each JV, is thus a random variable. The collection [JV,: i^O] is a stochastic process, that is, a collection of random variables indexed by a parameter regarded as time. Condition 0° can be restated in terms of this process: Condition 0°. For each ω, Ν,(ω) is a nonnegative integer for t > 0, JV0(c<j) = 0, and ит,_„ JV,(w) = oo; further, for each ω, Ν,(ω) as a function of t is nondecreasing and right-continuous, and at the points of discontinuity the saltus Ν,(ω) - sup^ <t Ns((o) is exactly 1. It is easy to see that (23.3) and (23.4) and the definition (23.5) give random variables JV, having these properties. On the other hand, if the stochastic
SECTION 23. THE POISSON PROCESS 299 process [JV,: t ^ 0] is given and does have these properties, and if random variables are defined by 5„(ω) = ίηί[ί: Ν,(ω)>η] and Χπ(ω) = Sn(w) - 5„_,(ω), then (23.3) and (23.4) hold, and the definition (23.5) gives back the original JV,. Therefore, anything that can be said about the Xn can be stated in terms of the JV,, and conversely. The points 5,(ω),52(ω),... of (0,o°) are exactly the discontinuities of Ν,(ω) as a function of t; because of the queueing example, it is natural to call them arrival times. The program is to study the joint distributions of the JV, under conditions on the waiting times Xn and vice versa. The most common model specifies the independence of the waiting times and the absence of aftereffect: Condition Γ. The Xtl are independent, and each is exponentially distributed with parameter a. In this case P[Xn > 0] = 1 for each η and n~xSn -»a~l by the strong law of large numbers (Theorem 22.1), and so (23.3) and (23.4) hold with probability 1; to assume they hold everywhere (Condition 0°) is simply a convenient normalization. Under Condition Γ, S„ has the distribution function specified by (20.40), so that P[N, >n] = Z%ne-a'{aty/i\ by (23.6), and , . η (23.8) P[N, = n]=e-a'^-, η =0,1,.-. Thus JV, has the Poisson distribution with mean at. More will be proved in a moment. Condition 2°. (i) For 0 < t, < · · · < tk the increments JV, ,N, - N,,..., Nt - JV, _ are independent. (ii) The individual increments have the Poisson distribution: (23.9) P[N,-Ns = n] = e-«'-s)(a<<tn~S" , « = 0,1,..., 0<s<i. Since JV0 = 0, (23.8) is a special case of (23.9). A collection [JV,: t > 0] of random variables satisfying Condition 2° is called a Poisson process, and a is the rate of the process. As the increments are independent by (i), if r < s < t, then the distributions of Ns - Nr and JV, - Ns must convolve to that of JV, - Nr. But the requirement is consistent with (ii) because Poisson distributions with parameters и and ν convolve to a Poisson distribution with parameter и + v. Theorem 23.1. Conditions Г and 2° are equivalent in the presence of Condition 0°.
300 RANDOM VARIABLES AND EXPECTED VALUES Proof of Γ -» 2°. Fix t, and consider the events that happen after time t. By (23.5), SN <t <SN + l, and the waiting time from t to the first event following t is SN +, - i; the waiting time between the first and second events following t is XN +2; and so on. Thus (23.10) Xl'> = SNi+l-t, X^ = XNi + 2, XP = XNi + 3,... define the waiting times following t. By (23.6), Nl+i - Nt>m, or Nt+S > N, + m, if and only if SN +m < t + s, which is the same thing as X^ + ■ ■ ■ +Х„) < s. Thus (23.11) JV,+i-JV, = max[m: X\l) + ■■■ +X^^s]. Hence [Nl+S - N, = m] = [X\!) + ■■■ +*,<„'> < s < X\» + ■■■ +X«)+ll A comparison of (23.11) and (23.5) shows that for fixed t the random variables Nt+S - N, for s > 0 are defined in terms of the sequence (23.10) in exactly the same way as the Ns are defined in terms of the original sequence of waiting times. The idea now is to show that conditionally on the event [N,=n] the random variables (23.10) are independent and exponentially distributed. Because of the independence of the Xk and the basic property (23.2) of the exponential distribution, this seems intuitively clear. For a proof, apply (20.30). Suppose у > 0; if Gn is the distribution function of Sn, then since Xn +, has the exponential distribution, P[Sn<t<Sn^,Sn+l-t>y] = P[Sn<t,Xn + l>t+y-Sn] = / P[Xn + i>t+y-x]dGn(x) = e-^'\ P[Xn + l>t-x]dGn(x) x<t = e-ayp[Sn<t,Xn+l>t-Sn]. By the assumed independence of the Xn, P[S„ + ,-t>yl,Xn + 2>y2,...,X„+}>yj,Snzt<Sn + l] = P[Sn,i-t>yl,Sn<t<Sn+l]e-yi ■ ■ ■ e-» = P[Sn<t<Sn + l]e-ay* ■■■€-"*. If # = (y,,°°)x ··· x(yy,°°), this is (23.12) P[N, = n,(X\'\...,Xy)tEH}=P[Nt = n]P[(X1,...,Xj)<EH].
SECTION 23. THE POISSON PROCESS 301 By Theorem 10.4, the equation extends from Η of the special form above to all Η in &'. Now the event [Ns. = w„ 1 < i < u] can be put in the form [(X{, ...,Xj)<^ H], where j = mu + \ and Η is the set of χ in R' for which x, + · · ■ +xm. < ί,<χι+ ■·■ +x„l+i, l<i<u. But then [(X\'\..., X}1)) еЯ] is by (23.11) the same as the event [Nt+S. -Nt = mt, 1 < /' < и]. Thus (23.12) gives P[N, = n, Nl+Sj - JV, = mh l<i<u]=P[N,= n]P[Ns. = mh 1 <ι <u]. From this it follows by induction on к that if 0 = i0 < i, < · · ■ <tk, then к (23.13) P[Nli-Nli_i = ni,liiik] = YlP[N,i-,i.l = ni]· i= 1 Thus Condition Г implies (23.13) and, as already seen, (23.8). But from (23.13) and (23.8) follow the two parts of Condition 2°. ■ Proof of 2° -» 1°. If 2° holds, then by (23.6), Р[Х{ >t] = P[N, = 0] = e~°", so that Xx is exponentially distributed. To find the joint distribution of Ζ, and X2, suppose that 0 <s, < i, < s2 < t2 and perform the calculation P[s, <5, <ί,, 52<52<ί2] = F[iVii = 0,iV,|-iVii = l,iVi2-A',| = 0,iV,2-iVii>l] = e~aSi Χ α(ί, - ί1)β_α(''-ί') χ е-а^2-'|) χ (! _ е-«с2-*2)) = α(ί1-51)(^αί2-^α'2)=-ίί a2e-a^dy,dy2. i2<y2</2 Thus for a rectangle Л contained in the open set G = [(у,, у2): О <y, <y2]- P[(S1,S2)tEA} = [a2e-a»dyidy2. JA By inclusion-exclusion, this holds for finite unions of such rectangles and hence, by a passage to the limit, for countable ones. Therefore, it holds for A = G Π G if G' is open. Since the open sets form a T7-system generating the Borel sets, (5,,52) has density а2е~"У2 on G (of course, the density is'O outside G). By a similar argument in Rk (the notation only is more complicated), (5],...,Sk) has density ake~ayk on [y: 0 <y, < ··· <ук]. If a linear transformation g(y) = χ is defined by x, =y,-y,_i, then (Λ",,...; Xk) = g(Slt...,Sk) has by (20.20) the density Ylk=iae'ax· (the Jacobian is identically 1). This proves Condition Г. ■
302 RANDOM VARIABLES AND EXPECTED VALUES The Poisson Approximation Other characterizations of the Poisson process depend on a generalization of the classical Poisson approximation to the binomial distribution. Theorem 23.2. Suppose that for each n, Znl,...,Znr are independent random variables and Znk assumes the values 1 and 0 with probabilities pnk and \-pnk. If (23.14) Σ Pnk -»λ ^ 0, max pnk -» 0, k=l l<Lk<r then (23.15) L znk k=l ->.A' fM ' i! i = 0,l,2,. If λ = 0, the limit in (23.15) is interpreted as 1 for / = 0 and 0 for / > 1. In the case where rn = η and pnk = λ/η, (23.15) is the Poisson approximation to the binomial. Note that if λ > 0, then (23.14) implies rn -»oo. ,1 /o:1"p I hP -И Jn:e-P «/,-c-Pp/l! Proof. The argument depends on a construction like that in the proof of Theorem 20.4. Let t/„t/2,... be independent random variables, each uniformly distributed over [0,1). For each p, 0 <p < 1, split [0,1) into the two intervals I0(p) = [0,1 - p) and /,(p) = \\-p, 1), as well as into the sequence of intervals /f(p) = \£%\*-ρρ'>/j\, Σ)= xe~pp j/j\), / = 0,1 Define Vnk = 1 if Uk&I(pnk) and V„k = 0 if Uk^I0(pnk). Then Vnl,...,V„re are independent, and Vnk assumes the values 1 and 0 with probabilities P[Uk e Ix(pnk)\ = pnk and P[Uk<ElQ(pNk)]= l -Pnk. Since VnX,...,Vnrn have the same joint distribution as Znl,..., Znr, (23.15) will follow if it is shown that Vn = Σ£"= У„к satisfies (23.16) P[Vn = i]^e- λ[ ι! · Now define Wnk = i if Uk^J,.(pnk), i = 0,1 Then P[Wnk = i} = e~Pnkp'nk/rt—Wnk has the Poisson distribution with mean pnk. Since the Wnk
SECTION 23. THE POISSON PROCESS 303 are independent, Wn = Yfkn= ^Vnk has the Poisson distribution with mean A„ = Ek"=, pnk. Since 1 - ρ < e~p, /,(p») с /,(р) (see the diagram). Therefore, Р[Кк*Кк) = Р[К,с = 1*Кк]=Р[икеЧр„1с) -Чр«к)\ = Pnk-e~PnkPnk^P2nk> and Ρ[νηΦ\Υη\< ΣρΙ,^Κ max p„t-» 0 by (23.14). And now (23.16) and (23.15) follow because P[W„ = i] = e~x»)iji\ -» e-AA'/i! В Other Characterizations of the Poisson Process The condition (23.2) is an interesting characterization of the exponential distribution because it is essentially qualitative. There are qualitative characterizations of the Poisson process as well. For each ω, the function Ν,(ω) has a discontinuity at t if and only if 5„(ω) = t for some η > 1; t is a fixed discontinuity if the probability of this is positive. The condition that there be no fixed discontinuities is therefore (23.17) p[sn = t]=0, i>0, «>l; that is, each of 5,,52,... has a continuous distribution function. Of course there is probability 1 (under Condition 0°) that Ν,(ω) has a discontinuity somewhere (and indeed has infinitely many of them). But (23.17) ensures that a t specified in advance has probability 0 of being a discontinuity, or time of an arrival. The Poisson process satisfies this natural condition. Theorem 23.3. // Condition 0° holds and [Nt: t > 0] has independent increments and no fixed discontinuities, then each increment has a Poisson distribution. This is Prekopa's theorem. The conclusion is not that [Nr: t > 0] is a Poisson process, because the mean of JV, - Ns need not be proportional to t - s. If φ is an arbitrary nondecreasing, continuous function on [0, oo) and <p(0) = 0, and if [Nt: t > 0] is a Poisson process, then Νφ(ι) satisfies the conditions of the theorem.1 Proof. The problem is to show for t' <t" that /V,« - N,, has for some λ > 0 a Poisson distribution with mean A, a unit mass at 0 being regarded as a Poisson distribution with mean 0. +This is in fact the general process satisfying them; see Problem 23.8
304 RANDOM VARIABLES AND EXPECTED VALUES The procedure is to construct a sequence of partitions (23.18) ί' = ί„ο<ί„ι< ■■■<'-- ='-" nr„ of [i', t"] with three properties. First, each decomposition refines the preceding one: each tnk is a tn + lJ. Second, (23.19) EP[J4n4-A^_,>l]tA for some finite λ and (23.20) max P\N. - N. >ll-+0. Third, (23.21) p\ max (N. - N. )>2 Once the partitions have been constructed, the rest of the proof is easy: Let Z„k be 1 or 0 according as N. - N. is positive or not. Since [N: - nk 'л,к-I ' ί > 0] has independent increments, the Znk are independent for each n. By Theorem 23.2, therefore, (23.19) and (23.20) imply that Zn = Zrk"=xZnk satisfies P[Zn = i] -» e~kk'/i\ Now Nr - Nt, > Z„, and there is strict inequality if and only if N, t- N, t i > 2 for some k. Thus (23.21) implies P[N,„ - Ν,, Φ Z„] -» 0, and therefore'FtA/;,, - /v> = i] = е_АА'/Л To construct the partitions, consider for each t the distance D, = infm > , |i - Sm\ from t to the nearest arrival time. Since Sm -»oo, the infimum is achieved. Further, D, = 0 if and only if Sm = t foi some m, and since by hypothesis there are no fixed discontinuities, the probability of this is 0: P[D, = 0] = 0. Choose δ, so that 0<δ, <«"' and P[D, < δ,] <η~ι. The intervals (i -o,,t + δ,) for t' < t < t" cover [i', t"]. Choose a finite subcover, and in (23.18) take the tnk for 0 < к < rn to be the endpoints (of intervals in the subcover) that are contained in (i', t"). By the construction, (23.22) max (i„k-i„,k_,)-»0, and the probability that Un k_l,tnk] contains some Sm is less than n~l. This gives a sequence of partitions satisfying (23.20). Inserting more points in a partition cannot increase the maxima in (23.20) and (23.22), and so it can be arranged that each partition refines the preceding one. To prove (23.21) it is enough (Theorem 4.1) to show that the limit superior of the sets involved has probability 0. It is in fact empty: If for infinitely many n, Nl/tt(u)) - Ν,η 4_/ω) ^ 2 holds for some к < r„, then by (23.22), Ν,(ω) as a
SECTION 23. THE POISSON PROCESS 305 function of t has in [i', t"] discontinuity points (arrival times) arbitrarily close together, which requires 5m(w) e [i', t"\ for infinitely many m, in violation of Condition 0°. It remains to prove (23.19). If Znk and Z„ are defined as above and Pnk = P\-Zr,k = U then the sum in (23·19)is ^kPnk = E[Zn]. Since Z„+ x > Z„, Ek pnk is nondecreasing in n. Now P[Nr-N, = Q\=P[Znk = Q,k<rn] = ПО -P„j<exp k=\ Σ рПк k=\ If the left-hand side here is positive, this puts an upper bound on T,kpnk, and (23.19) follows. But suppose P[Nr - JV,, = 0] = 0. If s is the midpoint of t' and t", then since the increments are independent, one of P[ Ns - Nt. = 0] and P[Nt> - Ns = 0J must vanish. It is therefore possible to find a nested sequence of intervals [um,vn] such that vm - um -»0 and the event Am = [N, - Nu > 1] has probability 1. But then P(PlmA)= 1, and if t is the 1 m um mm point common to the [um,vm], there is an arrival at t with probability 1, contrary to the assumption that there are no fixed discontinuities. ■ Theorem 23.3 in some cases makes the Poisson model quite plausible. The increments will be essentially independent if the arrivals to time s cannot seriously deplete the population of potential arrivals, so that Ns has for t > s negligible effect on JV, - Ns. And the condition that there are no fixed discontinuities is entirely natural. These conditions hold for arrivals of calls at a telephone exchange if the rate of calls is small in comparison with the population of subscribers and calls are not placed at fixed, predetermined times. If the arrival rate is essentially constant, this leads to the following condition. Condition 3°. (i) For 0 < i, < ··· <tk the increments Ν,,Ν, -Ν,,..., Λ/, - Λ/, are independent. (ii) The distribution of JV, - Ns depends only on the difference t - s. Theorem 23.4. Conditions Γ, 2°, and 3° are equivalent in the presence of Condition 0°. Proof. Obviously Condition 2° implies 3°. Suppose that Condition 3° holds. If J, is the saltus at t (J, = Λ/, - sups<tNs), then [N, - N,_„-i > Щ [/, > 1], and it follows by (ii) of Condition 3° that P[J, > 1] is the same for all t. But if the value common to P[J, > 1] is positive, then by the independence of the increments and the second Borel-Cantelli lemma there is probability 1 that J, > 1 for infinitely many rational t in (0,1), for example, which contradicts Condition 0°.
306 RANDOM VARIABLES AND EXPECTED VALUES By Theorem 23.3, then, the increments have Poisson distributions. If /(f) is the mean of Nt, then JV, - Ns for s <t must have mean /(f) -f(s) and must by (ii) have mean f(t-s); thus f(t)=f(s)+f(t-s). Therefore, / satisfies Cauchy's functional equation [A20] and, being nondecreasing, must have the form /(f) = at for a > 0. Condition 0° makes a = 0 impossible. ■ One standard way of deriving the Poisson process is by differential equations. Condition 4°. // 0 < f ,<···< f*. and ifnl,...,nk are nonnegative integers, then (23.23) P[N,k,h-N,r\\N,=n1,j<k\=ah+o{h) and (23.24) P[Nlk+h-Nh>2\N,i = nj,j<k]=o(h) as h i0. Moreover, [Nt: f > 0] has no fixed discontinuities. The occurrences of o(h) in (23.23) and (23.24) denote functions, say <£,(/z), and <£2(/z), such that h~ ^t(h) -* 0 as h J.0; the <£, may depend a priori on k,ti,...,tk, and n1,...,nk as well as on h. It is assumed in (23.23) and (23.24) that the conditioning events have positive probability, so that the conditional probabilities are well defined. Theorem 23.5. Conditions Γ through 4° are all equivalent in the presence of Condition 0°. Proof of 2° -» 4°. For a Poisson process with rate a, the left-hand sides of (23.23) and (23.24) are e~ahah and 1 - e~ah - e~ahah, and these are ah + o(h) and o(h), respectively, because e~ah = 1 - ah + o(h). Ana by the argument in the preceding proof, the process has no fixed discontinuities. ■ Proof of 4° -»2°. Fix k, the tp and the n·, denote by A the event [Nt) = nj, j<k]; and for f>0 put pn(t) = P[N,k + t-N,t = n\A). It will be shown that (23.25) Pn(t)=e-a'^-, « = 0,1,....
SECTION 23. THE POISSON PROCESS 307 This will also be proved for the case in which pn(t) = P[N, = n]. Condition 2° will then follow by induction. If ί > 0 and \t -s\<n~l, then |Р[^ = п]-Р[Л^ = п]|<Р[^#^]<Р[^+„--^_„->1]. As η -* oo, the right side here decreases to the probability of a discontinuity at t, which is 0 by hypothesis. Thus P[Nt = n] is continuous at t. The same kind of argument works for conditional probabilities and for ί = 0, and so pn(t) is continuous for t > 0. To simplify the notation, put D, = N,k + I - N,k. If Dl+h = n, then D, = m for some m < n. If t > 0, then by the rules for conditional probabilities, Pn(t + h) =Pn{t)P[D,,h -D, = 0\An[D, = n]] + P«-i(t)P[Dl+h -D, = l\An[D, = n-l]] n-2 + Σ Pm{t)P[D,+h-Dt = n-m\An[Dt = m}}. For η < 1, the final sum is absent, and for η = 0, the middle term is absent as well. This holds in the case pn(t) = P[N, = n] if D, = N, and A = Ω. (If t = 0, some of the conditioning events here are empty; hence the assumption t > 0.) By (23.24), the final sum is o(h) for each fixed n. Applying (23.23) and (23.24) now leads to PnO + h)=P„(t)(l-ah)+pn.l(t)ah+o(h), and letting h 10 gives (23.26) p'n(t)=-apn{t)+apn_i(t). In the case η = 0, take p_ ,(i) to be identically 0. In (23.26), ί > 0 and p'n(t) is a right-hand derivative. But since pn(t) and the right side of the equation are continuous on [0, oo), (23.26) holds also for ί = 0 and p'n(t) can be taken as a two-sided derivative for ί > 0 [A22]. Now (23.26) gives [A23] Pn(t)=e- PnW+af'Pn-iW"* Since pn(0) is 1 or 0 as η = 0 or η > 0, (23.25) follows by induction on n.
308 RANDOM VARIABLES AND EXPECTED VALUES Stochastic Processes The Poisson process [N,: t > 0] is one example of a stochastic process—that is, a collection of random variables (on some probability space (Ω, & P)) indexed by a parameter regarded as representing time. In the Poisson case time is continuous. In some cases the time is discrete: Section 7 concerns the sequence {Fn} of a gambler's fortunes; there η represents time, but time that increases in jumps. Part of the structure of a stochastic process is specified by its finite-dimensional distributions. For any finite sequence f,,..., ffc of time points, the /c-dimensional random vector (N,,..., N,k) has a distribution μ, t over Rk. These measures μ, , are the finite-dimensional distributions of the process. Condition 2° of this section in effect specifies them for the Poisson case: * (ait -t \\η'~">-1 (23.27) f[w!,-.J,/a*]-n.-B--"-'v ν,,-';;",),— if 0 <л, <, ··· <nk and 0<f, < ·· · <tk (take n0 = t0 = 0). The finite-dimensional distributions do not, however, contain all the mathematically interesting information about the process in the case of continuous time. Because of (23.3), (23.4), and the definition (23.5), for each fixed ω, Λ/,(ω) as a function of t has the regularity properties given in the second version of Condition 0°. These properties are used in an essential way in the proofs. Suppose that /(f) is f or 0 according as t is rational or irrational. Let N be defined as before, and let (23.28) Μ,(ω)=Ν,(ω)+/(ί+Χι(ω)). If R is the set of rationale, then Ρ[ω: f(t +Ζ,(ω)) Φ 0] = Ρ[ω: Χ{(ω) <= R - t) = 0 for each t because R-t is countable and X} has a density. Thus P[Mt=Nt] = 1 for each t, and so the stochastic process [M,: t > 0] has the same finite-dimensional distributions as [N,: t > 0]. For ω fixed, however, Μ,(ω) as a function of f is everywhere discontinuous and is neither monotone nor exclusively integer-valued. The functions obtained by fixing ω and letting f vary are called the path functions or sample paths of the process. The example above shows that the finite-dimensional distributions do not suffice to determine the character of the path functions. In specifying a stochastic process as a model for some phenomenon, it is natural to place conditions on the character of the sample paths as well as on the finite-dimensional distributions. Condition 0° was imposed throughout this section to ensure that the sample paths are nonde- creasing, right-continuous, integer-valued step functions, a natural condition if JV, is to represent the number of events in [0, f]. Stochastic processes in continuous time are studied further in Chapter 7.
SECTION 23. THE POISSON PROCESS 309 PROBLEMS Assume the Poisson processes here satisfy Condition 0° as well as Condition Γ. 23.1. Show that the minimum of independent exponential waiting times is again exponential and that the parameters add. 23.2. 20.17T Show that the time Sn of the nth event in a Poisson stream has the gamma density f(x;a,n) as defined by (20.47). This is sometimes called the Erlang density. 23.3. Let A,=t — SN be the time back to the most recent event in the Poisson stream (or to 0), and let B, = SN + j — t be the time forward to the next event. Show that A, and B, are independent, that B, is distributed as Xl (exponentially with parameter a), and that A, is distributed as minfA".,?}: P[A, <t] is 0, 1 -e~ax, or 1 as χ <0, 0 <x <t, or x> t. 23.4. Τ Let L,=A, + Bl = SNi+l-SN be the length of the interarrival interval covering t. (a) Show that L, has density d( ) = ja2xe'ax if 0 <jc < ί, 'W~\a(l+ai)e-"x ifx>r. (b) Show thai E[L,] converges to 2E[XX] as t -»<». This seems paradoxical because L, is one of the Xn. Give an intuitive resolution of the apparent paradox. 23.5. Merging Poisson streams. Define a process {N,) by (23.5) for a sequence {Xn} of random variables satisfying (23.4). Let {X'n} be a second sequence of random variables, on the same probability space, satisfying (23.4), and define {jV/} by Л/,' = тах[л: X\+ ··· +X'„<t\ Define {Д"} by N'; = Nt+N't. Show that, if σ(Χχ,X?,...) and σ(Χ[, X'2,...) are independent and {N,} and {N',} are Poisson processes with respective rates a and β, then {N") is a Poisson process with rate a + β. 23.6. Τ The nth and (n + l)st events in the process {N,} occur at times Sn and S/r + l- (a) Find the distribution of the number N* — jVO of events in the other . . ... . , 3»+i 3" process during this time interval. (b) Generalize to Щ - Ni. 23.7. Suppose that Xl,X2,... are independent and exponentially distributed with parameter a, so that (23.5) defines a Poisson process [N,}. Suppose that Y,,Y2,... are independent and identically distributed and that a(XuX2,...) and σ(Υ,,Υ2,...) are independent. Put Z, = LkiNYk. This is the compound Poisson process. If, for example, the event at time Sn in the original process
310 RANDOM VARIABLES AND EXPECTED VALUES represents an insurance claim, and if Yn represents the amount of the claim, then Z, represents the total claims to time t. (a) If Yk = 1 with probability 1, then {Z,} is an ordinary Poisson process. (b) Show that {Z,} has independent increments and that ZSJr,-Zs has the same distribution as Z,. (c) Show that, if Yk assumes the values 1 and 0 with probabilities ρ and 1 — ρ (0 <p < 1), then {Z,} is a Poisson process with rate pa. 23.8. Suppose a process satisfies Condition 0° and has independent, Poisson-distrib- uted increments and no fixed discontinuities. Show that it has the form {Νφ(ί)}, where {N,} is a standard Poisson process and φ is a nondecreasing, continuous function on [0, oo) with φ(0) = 0. 23.9. If the waiting times Xn are independent and exponentially distributed with parameter a, then Sn/n ->a_1 with probability 1, by the strong law of large numbers. From \\m,^„N, = <» and SN <t <SN +1 deduce that lim, ^,^Ν,/t = a with probability 1. 23.10. Τ (a) Suppose that XVX2,... are positive, and assume directly that Sn/n -»m with probability 1, as happens if the X„ are independent and identically distributed with mean m. Show that lim, N,/t = \/m with probability 1. (b) Suppose now that Sn/n -»°° with probability 1, as happens if the Xn are independent and identically distributed and have infinite mean. Show that lim, N,/t =0 with probability 1. The results in Problem 23.10 are theorems in renewal theory: A component of some mechanism is replaced each time it fails or wears out The Xn are the lifetimes of the successive components, and N, is the number of replacements, or renewals, to time t. 23.11. 20.7 23.10 Τ Consider a persistent, irreducible Markov chain, and for a fixed state j let Nn be the number of passages through j up to time n. Show that Nn/n -»\/m with probability 1, where m = T%=ikfjjk) is the mean return lime (replace \/m by 0 if this mean is infinite). See Lemma 3 in Section 8. 23.12. Suppose that X and У have Poisson distributions with parameters a and β. Show that \P[X= i] - P[Y = i]\ < \a - β\. Hint: Suppose that a < β, and represent У as X+ D, where X and D are independent and have Poisson distributions with parameters a and β - a. 23.13. Τ Use the methods in the proof of Theorem 23.2 to show that the error in (23.15) is bounded uniformly in / by |λ -A„| + AnmaxA pnk. SECTION 24. THE ERGODIC THEOREM* Even though chance necessarily involves the notion of change, the laws governing the change may themselves remain constant as time passes: If time "This section may be omitted. There is more on ergodic theory in Section 36.
SECTION 24. THE ERGODIC THEOREM 311 does not alter the roulette wheel, the gambler's fortunes fluctuate according to constant probability laws. The ergodic theorem is a version of the strong law of large numbers general enough to apply to any system governed by probability laws that are invariant in time. Measure-Preserving Transformations Let (ίϊ, &, Ρ) be a probability space. A mapping Τ: Ω -»Ω is a measure-preserving transformation if it is measurable &/'& and P(T~lA) = P(A) for all A in &, from which it follows that P(T~"A) = P(A) for η > 0. If, further, Τ is a one-to-one mapping onto Ω and the point mapping T~l is measurable &/'& (ТА <= & for Α <ξ &\ then Г is inuertible; in this case Г"1 automatically preserves measure: P(A) = P(J"lTA) = P(TA). The first result is a simple consequence of the π-λ theorem (or Theorems 13.1(0 and 3.3). Lemma 1. If & is a π-systcm generating S^, and if Г'^е ^ and P(J~XA) = P(A) for A e {?, then Τ is a measure-preserving transformation. Example 24.1. The Bernoulli shift. Let 5 be a finite set, and consider the space S°° of sequences (2.15) of elements of 5. Define the shift Τ by (24.1) 7fti = (z2(«),z3(fti),...); the first element of ω = (ζ,(ω), ζ2(ω),...) is lost, and Τ shifts the remaining elements one place to the left: zk(JTo)) = zk + l(u>) for к > 1. If A is a cylinder (2.17), then (24.2) T~lA = [ω: (ζ2(ω),..., ζη + ι(ω)) е=я] = [ft>:(z1(ft,),...,zll+1(ft>))eSxtf] is another cylinder, and since the cylinders generate the basic σ-field if, Τ is measurable ■£/ €. For probabilities pu on 5 (nonnegative and summing to 1), define product measure Ρ on the field ifQ of cylinders by (2.21). Then Ρ is consistently defined and countably additive (Theorem 2.3) and hence extends to a probability measure on € = a(tf0). Since the thin cylinders (2.16) form a T7-system that generates if, Ρ is completely determined by the probabilities it assigns to them: (24.3) />[ω:(ζ1(ω),...,ζπ(ω)) = (Μ1, ...,uj] =p ---p
312 RANDOM VARIABLES AND EXPECTED VALUES If A is the thin cylinder on the left here, then by (24.2), (24.4) P(TlA) = ZPuPUl-PUn=PUl--PUn = P{A), ueS and it follows by the lemma that Τ preserves P. This Τ is the Bernoulli shift. If A = [ω: ζ,(ω) = u](u£: S), then IA{Tk~lw) is 1 or 0 according as zk(w) is и or not. Since by (24.3) the events [ω: гк(а>) = и] are independent, each with probability p„, the random variables IA(Tk~lw) are independent and take the value 1 with probability pu = P(A). By the strong law of large numbers, therefore, (24.5) lim± Σ'ΛΤ*-*ω)=Ρ(Α) " k=\ with probability 1. ■ Example 24.1 gives a model for independent trials, and that Τ preserves Ρ means the probability laws governing the trials are invariant in time. In the present section, it is this invariance of the probability laws that plays the fundamental role; independence is a side issue. The orbit under Τ of the point ω is the sequence (ω, Τω, Τ2ω,...), and (24.5) can be expressed by saying that the orbit enters the set A with asymptotic relative frequency P(A). For Α=[ω: (ζ,(ω), ζ2(ω)) = (uuu2)], the IA(Tk~lw) are not independent, but (24.5) holds anyway. In fact, for the Bernoulli shift, (24.5) holds with probability 1 whatever A may be (/ief). This is one of the consequences of the ergodic theorem (Theorem 24.1). What is more, according to this theorem the limit in (24.5) exists with probability 1 (although it may not be constant in ω) if Τ is an arbitrary measure-preserving transformation on an arbitrary probability space, of which there are many examples. Example 24.2. The Markov shift. Let Ρ = [ρ,·-] be a stochastic matrix with rows and columns indexed by the finite set 5, and let π, be probabilities on S. Replace (2.21) by P(A) = Lhttu pu u · ■ · pu _ u . The argument in Section 2 showing that product measure is consistently defined and finitely additive carries over to this new measure: since the rows of the transition matrix add to 1, "n+l "m and so the argument involving (2.23) goes through. The new measure is again
SECTION 24. THE ERGOD1C THEOREM 313 countably additive on €й (Theorem 2.3) and so extends to €. This probability measure Ρ on € is uniquely determined by the condition (24.6) P[w:(zl(w),...,zn(w)) = (ui,. ..,«„)] = ^,Ρ„,„2 · · · Ρ„π_,„π· Thus the coordinate functions zn() are a Markov chain with transition probabilities p,·· and initial probabilities тг,-. Suppose that the π, (until now unspecified) are stationary: Σ,-ττ,-ρ,-· = тг,·. Then L· 7ruPuuiPuiu2 Pu„_]U„~ πιιιΡιιιυ2 Pun.iU„> uGS and it follows (see (24.4)) that Τ preserves P. Under the measure Ρ specified by (24.6), Τ is the Markov shift. Ш The shift T, qua point transformation on S°°, is the same in Examples 24.1 and 24.2. A measure-preserving transformation, however, is the point transformation together with the σ-field with respect to which it is measurable and the measure it preserves. Example 24.3. Let Ρ be Lebesgue measure λ on the unit interval, and take Τω = 2ω (modi): \ΐω ίΐΟ<ω<{-, \2ω-\ \ί{<ω<\. If ω has nonterminating dyadic expansion ω = .ά^ω)ά2(ω)... , then To> = .ά2(ω)ά3(ω)...: Τ shifts the digits one place to the left—compare (24.1). Since Γ_1(0, jc] = (0, yjc]u(^, j + jjc], it follows by Lemma 1 that T preserves Lebesgue measure. This is the dyadic transformation. ■ Example 24.4. Let Ω be the unit circle in the complex plane, let У be the σ-field generated by the arcs, and let Ρ be normalized circular Lebesgue measure: map [0,1) to the unit circle by ф(х) = е27г1х and define Ρ by P(A) =λ(φ~ιΑ). For a fixed с in Ω, let Τω = co>. Since Τ is effectively the rotation of the circle through the angle argc, Τ preserves P. The rotation turns out to have radically different properties according as с is a root of unity or not. ■ Ergodicity The ^set A is invariant under Τ if T~XA =A\ it is a nontrivial invariant set if 0 <P(A)<1. And Τ is by definition ergodic if there are in JF no
314 RANDOM VARIABLES AND EXPECTED VALUES nontrivial invariant sets. A measurable function / is invariant if /(Τω) =/(ω) for all ω; A is invariant if and only if lA is. The ergodic theorem: Theorem 24.1. Suppose that Τ is a measure-preserving transformation on (Ω, &~, P) and that f is measurable and integrable. Then (24.7) lim^ Σ/(74-'ω)=/(ω) " k = \ with probability 1, where f is invariant and integrable and E[f] = E[f]. If Τ is ergodic, then f = E[f] with probability 1. This will be proved later in the section. In Section 34, / will be identified as a conditional expected value (see Example 34.3). If /= IA, (24.7) becomes (24.8) lim± Σ/ΛΓ*-1") =£(«). and in the ergodic case, (24.9) ΐΑ(ω)=Ρ(Α) with probability 1. If A is invariant, then ΙΑ(ω) is 1 on Л and 0 on Ac, and so the limit can certainly be nonconstant if Τ is not ergodic. Example 24.5. Take Ω = [a, b, c, d, e) and &= 2Ω. If Τ is the cyclic permutation T = (a,b,c,d,e) and Τ preserves P, then Ρ assigns equal probabilities to the five points. Since 0 and Ω are the only invariant sets, Τ is ergodic. It is easy to check (24.8) and (24.9) directly. The transformation Γ= (a, b,c\d, e), a product of two cycles, preserves Ρ if and only if a, b, с have equal probabilities and d, e have equal probabilities. If the probabilities are all positive, then since {a,b, c] is invariant, Τ is not ergodic. If, say, A = {a, d), the limit in (24.8) is j on {a, b, c] and \ on [d, e). This illustrates the essential role of ergodicity. ■ The coordinate functions zn(·) in Example 24.1 are independent, and hence by Kolmogorov's zero-one law every set in the tail field У~= Π„σ(ζπ, z„ + 1,...) has probability 0 or 1. (That the zn take values in the abstract set 5 does not affect the arguments.) If A e σ(ζ,,..., zk), then T~"A εσ(ζ„ +,,..., zn + k)<Za(zn + i, zn+2,...); since this is true for each k, A<e Sfr=a{zl,z2,...) implies (Theorem 13.1(0) Г"Леа(г1 + „гл+2,„.). For A invariant, it follows that A<e T: The Bernoulli shift is ergodic. Thus
SECTION 24. THE ERGODIC THEOREM 315 the ergodic theorem does imply that (24.5) holds with probability 1, whatever A may be. The ergodicity of the Bernoulli shift can be proved in a different way. If A = [(z,,..., z„) = u] and B = [(zi,...,zk) = v] for an л-tuple и and a k- tuple v, and if Ρ is given by (24.3), then P{A Π T~nB) = P(A)P(B) because T~"B = [(z„ +,,..., zn + k) = υ]. Fix л and Л, and use the тг-λ theorem to show that this holds for all В in &. If В is invariant, then P(AnB) = P(A)P(B) holds for the A above, and hence (π-λ again) holds for all A. Taking' В =A shows that P(B) = (P(B))2 for invariant B, and Ρ(β) is 0 or 1. This argument is very close to the proof of the zero-one law, but a modification of it gives a criterion for ergodicity that applies to the Markov shift and other transformations. Lemma 2. Suppose that ^"c &~0 с &~, where S*~0 is afield, every set in &~0 is a finite or countable disjoint union of &-sets, and &~Q generates ^. Suppose further that there exists a positive с with this property: For each A in & there is an integer nA such that (24.10) P(AnT-"*B)>cP(A)P(B) for all В in &>. Then Т~1С = С implies that P{C) is 0 or 1. It is convenient not to require that Τ preserve P; but if it does, then it is an ergodic measure-preserving transformation. In the argument just given, & consists of the thin cylinders, nA=n if A = [(z,,..., zn) = и], с = 1, and $*~0 = €ϋ is the class of cylinders. Proof. Since every ^g-set is a disjoint union of ^-sets, (24.10) holds for Be^0 (and A <= &>). Since for fixed A the class of В satisfying (24.10) is monotone, it contains & (Theorem 3.4). If В is invariant, it follows that Р(АПВ) > cP(A)P(B) for A in &>. But then, by the same argument, the inequality holds for all A in &. Take A = Bc: If В is invariant, then P(BC)P(B) = 0 and hence P(B) is 0 or 1. Ш To treat the Markov shift, take $*~0 to consist of the cylinders and & to consist of the thin ones. If A = [(zl,..., zn) = (и,,...,«„)], nA = η + m - 1, and B = [(zl,...,zk) = (v1,...,vk)], then P(A)P(B) = 7rUipUiU2 · · · PUn_iUjrliPlil2 ■ ■ ■ plt_ilt, (24.11) P(Α Π T~n*B) = iru pu „ · · · pu u p(um? PLL ·■■ pL , ■ The lemma will apply if there exist an integer m and a positive с such that pi™) > ciTj for all i and /. By Theorem 8.9 (or Lemma 2, p. 125), there is in the irreducible, aperiodic case an m such that all p\^ are positive; take с
316 RANDOM VARIABLES AND EXPECTED VALUES less than the minimum. By Lemma 2, the corresponding Markov shift is ergodic. Example 24.6. Maps preserve ergodicity. Suppose that φ: Ω -» Ω is measurable 3^/3^ and commutes with Τ in the sense that ψΤω = Τψω. If Τ preserves F, it also preserves Ρψ~ι: Ρψ-ι(Τ~ιΑ) = Ρ(Τ~1ψ-1Α) = Ρφ~χ(Α). And if T is ergodic under P, it is also ergodic under Ρψ~χ: if A is invariant, so is ψ~ιΑ, and hence Ρψ~ι(Α) is 0 or 1. These simple observations are useful in studying the ergodicity of stochastic processes (Theorem 36.4). ■ Ergodicity of Rotations The dyadic transformation, Example 24.3, is essentially the same as the Bernoulli shift. In any case, it is easy to use the zero-one law or Lemma 2 to show that it is ergodic. From this and the ergodic theorem, the normal number theorem follows once again. Consider the rotations of Example 24.4. If the complex number с defining the rotation (Τω =βω) is — 1, then the set consisting of the first and third quadrants is a nontrivial invariant set, and hence Τ is not ergodic. A similar construction shows that Τ is nonergodic whenever с is a root of unity. In the opposite case, с is ergodic. In the first place, it is an old number-theoretic fact due to Kronecker that if с is not a root of unity then the orbit (ω,εω,ε2ω,...) of every ω is dense. Since the orbits are rotations of one another, it suffices to prove that the orbit (1, c, c2,...) of ! is dense. But if с is not a root of unity, then the elements of this orbit are all distinct and hence by compactness have a limit point ω0. For arbitrary e, there are distinct points c" and cn + k within e/2 of ω0 and hence within e of each other (distance measured along the arc). But then, since the distance from cn+'k to cn + (;+1|t is the same as that from c" to cn+k, it is clear that for some m the points cn,cn + k,...,cn + mk form a chain which extends all the way around the circle and in which the distance from one point to the next is less than e. Thus every point on the circle is within e of some point of the orbit (1, c, c2,...), which is indeed dense. To use this result to prove ergodicity, suppose that A is invariant and P(A) > 0. To show that P(A) must then be 1, observe first that for arbitrary e there is an arc /, of length at most e, satisfying P(A Π /) > (1 - е)Р(Л). Indeed, A can be covered by a sequence /,, /2,... of arcs for which P(A)/(l - e) > E„F(/n); the arcs can be taken disjoint and of length less than e. Since ΣηΡ(Α Γ\Ιη) = Ρ(Α)> (1 -e)E„F(/„), there is an η for which P(A Π /„) > (1 - e)F(/„): take / = /„. Let / have length a; a <e. Since A is invariant and 2" is invertible and preserves F, it follows that P(A Π T"I)> (1 -e)P(T"I). Suppose the arc / runs from a to b. Let n, be arbitrary and, using the fact that {T"a} is dense, choose n2 so that T"4 and T"2I are disjoint and the distance from Γ"<6 to T"*a is less than ea. Then choose пъ so that Г"'/, Т"21, Т"Ч are disjoint and the distance from T"2b to T"^a is less than ea. Continue until T"kb is within a of T"la and a further step is impossible. Since the T"'I are disjoint, ka < 1; and by the construction, the T"'I cover the circle to within a set of measure kea +a, which is at most 2e. And now by disjointness, к к P(A)> Σ Ρ(Α Π Γ"'/) > (1 -e) Σ Ρ(Τ"Ί) > (1 -e)(l-2e). Since e v/as arbitrary, P(A) must be 1: Τ is ergodic if с is not a root of unity.f For a simple Fourier-series proof, see Problem 26.30.
SECTION 24. THE ERGODIC THEOREM 317 Proof of the Ergodic Theorem The argument depends on a preliminary result the statement and proof of which are most clearly expressed in terms of functional operators. For a real function / on Ω, let Uf be the real function with value (£//)(ω) =/(Γω) at ω. If / is integrable, then by change of variable (Theorem 16.13), (24.12) E[Uf] = ff(Tw)P(da>) = f f(co)PT-\dco) = E[f]. And the operator U is nonnegative in the sense that it carries nonnegative functions to nonnegative functions; hence f<g (pointwise) implies Uf < Ug. Make these pointwise definitions: S0f=0, Snf=f+Uf + ■■■ + U"~lf, M„f=max0£k£nSkf, and MJ=sup„3.0S„f=supni0M„f. The maximal ergodic theorem: Theorem 24.2. If f is integrable, then (24.13) f fdP>0. Proof. Since Bn = [Mnf> 0]|[М00/> О], it is enough, by the dominated convergence theorem, to show that {BfdP>0. On Bn, Mnf = max, <£<.„ S^/. Since the operator U is nonnegative, Skf = f+USk_lf< f+ UMJ forl<k<n, and therefore MJ<f+ UMJ on Bn. This and the fact that the function UMnf is nonnegative imply JMJdP= { MJdP<f (f+UMnf)dP Jn jb„ jb„ < [ fdP+ f UMjdP = f fdP+ f MJdP, jb„ Ja jb„ Jn where the last equality follows from (24.12). Hence fB fdP > 0. ■ Replace / by fIA. If A is invariant, then S„(fIA) = (SJ)IA, and M£fIA) = (M0O/)//4, and therefore (24.13) gives (24.14) f fdP>0 HT~lA=A. JAn[Myif>0] Now replace / here by /- Л, Л a constant. Clearly [M„(/- Л) > 0] is the set
318 RANDOM VARIABLES AND EXPECTED VALUES ^A = «: supi Ε/(Γ*-1«)>λ лЫ к = \ where for some η > 1, S„(f- A) > 0. or л 'S„/> A. Let (24.15) it follows by (24.14) that /„ nFa(/-X)dP> 0, or (24.16) \P(AnFx)<[ fdP if T~lA= A. The Л here can have either sign. Proof of Theorem 24.1. To prove that the averages α „(ω) = n~lL'l^if(Tk~lw) converge, consider the set Α·.β = ω: lim ίηίαπ(ω) <α <β < lim supa„(w) for <*</?. Since Aa β=Αα βΓ\Ρβ and /4α ^ is invariant, (24.16) gives βΡ(Αα β) < \A fdP. The same result with -/, -β, -α in place of /, α, β is \A fdP <aP(Aa β). Since α<β, the two inequalities together lead to P(Aa β) = 0. Take the union over rational a and β: The averages α„(ω) converge, that is, (24.17) lime„(fti)=/(«) with probability 1, where / may take the values + °o at certain values of ω. Because of (24.12), E[an]=E[f]; if it is shown that the απ(ω) are uniformly integrable, then it will follow (Theorem 16.14) that / is integrable and £[/] = £[/]· By (24.16), AP(FX) < E[\f\]. Combine this with the same inequality for -/: If GA = [ω: sup>„(«)| > A], then \P(GX) < 2E[\f\] (trivial if Λ < 0). Therefore, for positive a and λ, / K\dP<\t ί \ί{Τ*-ιω)\Ρ{άω) •V,I>a] n k.iJc, *iHf tl \f(Tk-lw)\P(dw)+aP(Gk)\ П к=1У\Г(Тк-1о,)\>а j = / \f(w)\P(dw)+aP(Gx) <( \f(w)\P(dw) + 2jE[\f\]. •Ί/(ω)|>α Λ Take α = Λ1/2; since / is integrable, the final expression here goes to 0 as
SECTION 24. THE ERGODIC THEOREM 319 Λ -»οο. The αη(ω) are therefore uniformly integrable, and E[f] = E[f]. The uniform integrability also implies E[\an -/|] -»0. Set /(ω) = 0 outside the set where the α„(ω) have a finite limit. Then (24.17) still holds with probability 1, and /(Τω) =/(ω). Since [ω: /(ω) <x] is invariant, in the ergodic case its measure is either 0 or 1; if x0 is the infimum of the χ for which it is 1, then /(ω)=χ0 with probability 1, and from x0 = E[f] = E[f] it follows that /(ω) = E[f] with probability 1. ■ The Continued-Fraction Transformation Let Ω consist of the irrationals in the unit interval, and for χ in Ω let Tx = {l/x} and a,(x) = [l/x\ be the fractional and integral parts of 1/x. This defines a mapping (24.18) Γ,= {!} = !-[ 1 , \ of Ω into itself, a mapping associated with the continued-fraction expansion of χ [Α36]. Concentrating on irrational χ avoids some trivial details connected with the rational case, where the expansion is finite; it is an inessential restriction because the interest here centers on results of the almost-every- where kind. For χεΩ and η > 1 let an(x) = α,(Τ"~ιχ) be the nth partial quotient, and define integer-valued functions pn(x) and qn(x) by the recursions (24.19) p_1(.x)=l, p0(x) = 0, P„(x) = an(x)pn_l(x)+pn_2(x), n>\, q_1(x)--=0, q0(x)=l, qn(x) = an(x)qn_i(x) + qn_2(x), n>\. Simple induction arguments show that (24.20) P„'-l(x)qn(x)-pn(x)qn.l(x) = (-l)\ n>0, and [A37: (27)] (24.21) x = JJa,(jc)+ ■·· +l}an_l(x) + l]an(x) + T"x, n>\. It also follows inductively [A36: (26)] that (24.22) ljaji) +··· +l]an_l(x) + \\an(x)+t P»(x) + tPn-i(*) <7„(*)+'<7„-i(*)' η > 1, 0<ί < 1.
320 RANDOM VARIABLES AND EXPECTED VALUES Taking ί = 0 here gives the formula for the nth convergent: (24.23) ΐ№)+···+ΐ№)=^|, η>1, where, as follows from (24.20), pn(x) and qn(x) are relatively prime. By (24.21) and (24.22), p„(x)-t (T"x)pn_,(x) (24.24) х=Щ-А—L„ i ; ν, ">0, which, together with (24.20), implies1 (2425) *-^ = ^ «>0 <?„(*) qn{x)({TnxYXqn{x)+qn^{x)Y Thus the convergents for even η fall to the left of x, and those for odd η fall to the right. And since (24.19) obviously implies that qn(x) goes to infinity with n, the convergents pn(,x)/qn(x) do converge to x: Each irrational χ in (0,1) has the infinite simple continued-fraction representation (24.26) χ = \JaJx) + lje2(x) + · · ■ . The representation is unique [A36: (35)], and Tx = lja2(x) + lja3(x) + ■ ■' ■ : 7" shifts the partial quotients in the same way the dyadic transformation (Example 24.3) shifts the digits of the dyadic expansion. Since the continued- fraction transformation turns out to be ergodic, it can be used to study the continued-fraction algorithm. Suppose now that α,,α2>... are positive integers and define pn and qn by the recursions (24.19) without the argument x. Then (24.20) again holds (without the x), and so pjqn -pn-Jqn-\ = (- 1)"+Ά„_ι4„, "> 1. Since qn increases to infinity, the right side here is the «th term of a convergent alternating series. And since p0/q0 = 0, the nth partial sum is P„/q„, which therefore converges to some limit: Every simple infinite continued fraction converges, and [A36: (36)] the limit is an irrational in (0,1). ^t Δα,. .an be the set of χ in Ω such that ak(x) = ak for 1 < к < η; call it a fundamental set of rank n. These sets are analogous to the dyadic intervals and the thin cylinders. For an explicit description of Δα α —necessary for the proof of ergodicity—consider the function (24.27) ^ι...β„(0=ΐΚ+··-+ΐΚΓί + ΐΚΤί· Theorem 1.4 follows from this.
SECTION 24. THE ERGOD1C THEOREM 321 If χ <ξ Δαι fln, then χ = φαι ..α(.Τ"χ) by (24.21); on the other hand, because of the uniqueness of the partial quotients [A36: (33)], if t is an irrational in the unit interval, then (24.27) lies in Δ0ι α . Thus Δα α is the image under (24.27) of Ω itself. '" ' " Just as (24.22) follows by induction, so does (24.28) φα fl(0=P"tf""'· And ψα α (t) is increasing or decreasing in t according as η is even or odd, as is clear from the form of (24.27) (or differentiate in (24.28) and use (24.20)). It follows that Δ«, Pn Pn+Pn-l η Ω if л is even, (Pn+Pn-\ Pn\r,n if · ^AA ; ,— nil if η is odd. By (24.20), this set has Lebesgue measure (24·29) *(Aa a)= ι 1 7- V ' V °' ""' Qnidn+Яп-д The fundamental sets of rank η form a partition of Ω, and their unions form a field У„; let ^= U"=,^· Then ^ is the field generated by the class 3s of all the fundamental sets, and since each set in ^, is in some ^, each is a finite or countable disjoint union of ^-sets. Since qn>2qn_2 by (24.19), induction gives qn > 2("~1)/2 for η > 0. And now (24.29) implies that &~0 generates the σ-field & of linear Borel sets that are subsets of Ω (use Theorem 10.1(H)). Thus &, $*~0, $*~ are related as in the hypothesis of Lemma 2. Clearly Τ is measurable iF/ iF. Although Τ does not preserve Λ, it does preserve Gauss's measure, defined by (24.30) P(A) = —J АЫ. In fact, since 7-'((0,ί)ηΩ)= U((^,i)nn),
322 RANDOM VARIABLES AND EXPECTED VALUES it is enough to verify dx Д rl/k dx Д rl/k dx f ax _ v1 V' — V il/tk l Jn 1 + χ J /(b ,« 1 + χ J, ,,. . 1 Gauss's measure is useful because it is preserved by Τ and has the same sets of measure 0 as Lebesgue measure does. Proof that Τ is ergodic. Fix ax,...,an, and write φη for φα α and An for Δα 0я. Suppose that η is even, so that <Д„ is increasing, if хеД„, then (since χ = \pn{T"x))s < T"x < t if and only if ψ„(ί) <χ < ψ„(ί); and this last condition implies χ e Δ„. Combined with (24.28) and (24.20), this shows that Λ(Δ„ П [χ: s < Τ" χ < ί]) = ψ(ί) - ώ (j) = -. ^ Γ. If β is an interval with endpoints s and t, then by (24.29), Λ(Δ„ηΓ-"β)=λ(Δ„)Λ(β) (<7„ + s<7n_i)(<?„ + i<?„_i) A similar argument establishes this for η odd. Since the ratio on the right lies between \ and 2, (24.31) 12λ(Αη)λ(Β) <Λ(Δ„ П Г-5) < 2λ(Δ„)λ(5). Therefore, (24.10) holds for 0,^^ as defined above, Α = Δ„, ηΑ = η, с = \, and λ in the role of P. Thus T~lC = С implies that Л(С) is 0 or 1, and since Gauss's measure (24.30) comes from a density, P(C) is 0 or 1 as well. Therefore, 7" is an ergodic measure-preserving transformation on (Ω, 9^,Ρ). It follows by the ergodic theorem that if / is integrable, then (24.32, ϊίΣΛΓ'-χ)-^/'^* holds almost everywhere. Since the density in (24.30) is bounded away from 0 and oo, the "integrable" and "almost everywhere" here can refer to Ρ or to λ indifferently. Taking / to be the indicator of the x-set where al(x) = k shows that the asymptotic relative frequency of к among the partial quotients is almost everywhere equal to _!_/■'/* dx _ 1 (£+l)2 log2JI/a + 1)l+x _ log2logA:(A: + 2)· In particular, the partial quotients are unbounded almost everywhere.
SECTION 24. THE ERGOD1C THEOREM 323 For understanding the accuracy of the continued-fraction algorithm, the magnitude of an(x) is less important than that of q„(x). The key relationship is (24.33) 9π(*)(*π(*)+9« + ι(*)) < Ρ«{χ) Яп(х) < 1 яп(х)яп+1(х)' which follows from (24.25) and (24.19). Suppose it is shown that 1 77 (24.34) Hm_l0g^(x) = __ almost everywhere; (24.33) will then imply that Pn(x) (24.35) lim — log 77 Q„(x) 61og2 The discrepancy between χ and its nth convergent is almost everywhere of the order e""17 /(6l0g2). To prove (24.34), note first that since pj+l(x) = qj(Tx) by (24.19), the product nnk=ip„_k + i(Tk~lx)/qn_k + l(Tk~lx) telescopes to l/qn(x): (24.36) **^=Σ***-'+ΑΤ"1χ) As observed earlier, qn(x) > 2("~l)/2 for η > 1. Therefore, by (24.33), Рп(х)/Я„(х) -1 1 qn + 1(x) ~ 2"/2 1 <ГГп. П>\. Since HoeCl +ί)|<4|ί| if |ί|< 1//2, log χ - log Pn{x) Qn{x) 2n/2 Therefore, by (24.36), 1 Σ logr^'x-log k=\ Чп\л) k=l log Tk lx - log ρ„-*+.σ*-1*) <7n-*+l(^ *)
324 RANDOM VARIABLES AND EXPECTED VALUES By the ergodic theorem, then,1 1 /-ilogx lim — log 1 1 /-ι log* -1 /-ι fi& 1 £ (-1» -77 log2*=0 (A: + l)2 121°^2' Hence (24.34). Diophantine Approximation The fundamental theorem of the measure theory of Diophantine approximation, due to Khinchine, is Theorem 1.5 together with Theorem 1.6. As in Section 1, let <p(q) be a positive function of integers and let A be the set of χ in (0,1) such that (24.37) x-P- < q ч>(ч) has infinitely many irreducible solutions p/q. If Σ1/<7<ρ(<?) converges, then A has Lebesgue measure 0, as was proved in Section 1 (Theorem 1.6). It remains to prove that if φ is nondecreasing and Tl/q<p(q) diveiges, then Αφ has Lebesgue measure 1 (Theorem 1.5). It is enough to consider irrational x. Lemma 3. For positive an, the probability (P or λ) of [x\ an(x) > an i.o.] is 0 or 1 as Σί/α„ converges or diverges. Proof. Let En = [x: an(x) > an]. Since P(£„) = P[x: a{(x) > an] is of the order l/an, the first Borel-Cantelli lemma settles the convergent case (not needed in the proof of Theorem 1.5). By (24.31), λ(Δ„Π£„+1) > ^А(Д„)А[д:: «,(*) >an+l] > ±λ(Δ„) 1 αη+ι + 1 Taking a union over certain of the Δη shows that for m <n, к(ЕстП ■■■ П£,;П£„+1)>А(£^П ··· ПЕС„) 1 2(«„+i + l) By induction on n, " 1 <exp - £ -τ- —p- as in the proof of the second Borel-Cantelli lemma. f Integrate by parts over (a, 1) and then let a 10. For the series, see Problem 26.28. The specific value of the limit in (24.34) is not needed for the application that follows.
SECTION 24. THE ERGOD1C THEOREM 325 Proof of Theorem 1.5. Fix an integer N such that log N exceeds the limit in (24.34). Then, except on a set of measure 0, (24.38) ?„(*)<#" holds for all but finitely many n. Since #\s nondecreasing, ^ 1 N < Nn<,q<Nn + i ^^^J rv I and ΣΙ/φ(Ν") diverges if Ll/q<p(q) does. By the lemma, outside a set of measure 0, a„ + l(x)> φ(Ν") holds for infinitely many n. If this inequality and (24.38) both hold, then by (24.33) and the assumption that φ is nondecreasing, Pn(x) <?„(·*) 1 1 <In{x)<In + \(x) an + l(x)ql(x) 1 1 -<p(N")q2n(x) 9{qn{x))q2„(x)' But p„(x)/qn(x) is irreducible by (24.20). ■ PROBLEMS 24.1. Fix (Ω, &~) and a T measurable 3^/9". The probability measures on (Ω, 3^) preserved by Τ form a convex set C. Show that Τ is ergodic under Ρ if and only if Ρ is an extreme point of С—cannot be represented as a proper convex combination of distinct elements of C. 24.2. Show that Τ is ergodic if and only if n~lZ"KZ\P{A Π T~kB) -» P(A)P(B) for all A and В (or all A and В in a 7r-system generating 5е). 24.3. Τ The transformation Τ is mixing if (24.39) P{ACiT-"B)^P{A)P(B) for all /1 and B. (a) Show that mixing implies ergodicity. (b) Show that Τ is mixing if (24.39) holds for all A and β in a 7r-system generating 5е. (c) Show that the Bernoulli shift is mixing. (d) Show that a cyclic permutation is ergodic but not mixing. (e) Show that if с is not a root of unity, then the rotation (Example 24.4) is ergodic but not mixing.
326 RANDOM VARIABLES AND EXPECTED VALUES 24.4. Τ Write r_"^= [TnA: Α ε У], and call the σ-field ^ = П "„,Γ""^ trivial if every set in it has probability either 0 or 1. (If Γ is invertible, ^ is У and hence is trivial only in uninteresting cases.) (a) Show that if 5£ is trivial, then Τ is ergodic. (A cyclic permutation is ergodic even though ^ is not trivial.) (b) Show that if the hypotheses of Lemma 2 are satisfied, then 5^ is trivial. (c) It can be shown by martingale theory that if 5£ is trivial, then Τ is mixing; see Problem 35.20. Reconsider Problem 24.3(c) 24.5. 8.35 24.4 ΐ (a) Show that the shift corresponding to an irreducible, aperiodic Markov chain is mixing. Do this first by Problem 8 35, then by Problem 24.4(b), (c). (b) Show that if the chain is irreducible but has period greater than 1, then the shift is ergodic but not mixing. (c) Suppose the state space splits into two closed, disjoint, nonempty subsets, and that the initial distribution (stationary) gives positive weight to each. Show that the corresponding shift is not ergodic. 24.6. Show that if Τ is ergodic and if / is nonncgative and Elf] — 00, then η-'Σ"=ι/(Γ*_1ω)^οο with probability 1. 24.7. 24.3 ΐ Suppose that PQ(A) = {A8dP for all A (8>0) and that Τ is mixing with respect to Ρ (T need not preserve F0). Use (21.9) to prove P0(T"A)= ( 8dP->P{A). JT-"A 24.8. 24.6 Τ (a) Show that and 1 " *=i ViK^..^x)-n(1 + iry \(log*)/(log2) almost everywhere, (b) Show that 1 log <?„(·*) Pn(x) Qn(x) 12 log 2' 24.9. 24.4 24.7 Τ (a) Show that the continued-fraction transformation is mixing, (b) Show that \[x:T"x<t] log(l + Q log 2 0 < ί < 1.
CHAPTER 5 Convergence of Distributions SECTION 25. WEAK CONVERGENCE Many of the best-known theorems in probability have to do with the asymptotic behavior of distributions. This chapter covers both general methods for deriving such theorems and specific applications. The present section concerns the general limit theory for distributions on the real line, and the methods of proof use in an essential way the order structure of the line. For the theory in Rk, see Section 29. Definitions Distribution functions Fn were defined in Section 14 to converge weakly to the distribution function F if (25.1) limF„(x)=F(x) η for every continuity point χ of F; this is expressed by writing Fn =» F. Examples 14.1, 14.2, and 14.3 illustrate this concept in connection with the asymptotic distribution of maxima. Example 14.4 shows the point of allowing (25.1) to fail if F is discontinuous at x; see also Example 25.4. Theorem 25.8 and Example 25.9 show why this exemption is essential to a useful theory. If μ„ and μ are the probability measures on (Λ\ «^') corresponding to Fn and F, then Fn=>F if and only if (25.2) 1\πΐμη(Α)=μ(Α) η for every A of the form A = (-<», x] for which μ{χ) = 0—see (20.5). In this case the distributions themselves are said to converge weakly, which is expressed by writing μη =>μ. Thus Fn=>F and μη=>μ are only different expressions of the same fact. From weak convergence it follows that (25.2) holds for many sets A besides half-infinite intervals; see Theorem 25.8. 327
328 CONVERGENCE OF DISTRIBUTIONS Example 25.1. Let Fn be the distribution function corresponding to a unit mass at n: Fn=I[na>). Then lim„ F„(jc) = 0 for every x, so that (25.1) is satisfied if F(x) = 0. But Fn=*F does not hold, because F is not a distribution function. Weak convergence is defined in this section only for functions Fn and F that rise from 0 at -oo to 1 at +°°—that is, it is defined only for probability measures μη and μ..* ■ Example 25.2. Poisson approximation to the binomial. Let μη be the binomia! distribution (20.6) for ρ = Л/л and let μ be the Poisson distribution (20.7). For nonnegative integers k, "Гс-(:)Ш'КГ 1 * "I / · \ A*(l-A/n) Η (1-Α/π)' if η > к. As η -» oo the second factor on the right goes to 1 for fixed k, and so μ„{Α:} -»μ,ί/c}; this is a special case of Theorem 23.2. By the series form of Scheffe's theorem (the corollary to Theorem 16.12), (25.2) holds for every set A of nonnegative integers. Since the nonnegative integers support μ and the μη, (25.2) even holds for every linear Borel set A. Certainly μη converges weakly to μ in this case. ■ Example 25.3. Let μη correspond to a mass of n~l at each point k/n, к = 0,1,..., η - 1; let μ be Lebesgue measure confined to the unit interval. The corresponding distribution functions satisfy Fn(x) = fl/ucj + \)/n -» F(x) for 0 <x < 1, and so Fn => F. In this case (25.1) holds for every x, but (25.2) does not, as in the preceding example, hold for every Borel set A: if A is the set of rationale, then μ „(A) = 1 does not converge to μ(Α) = 0. Despite this, μη does converge weakly to μ. Ш Example 25.4. If μη is a unit mass at xn and μ is a unit mass at x, then μη => μ if and only if xn -»x. If xn > χ for infinitely many n, then (25.1) fails at the discontinuity point of F. Ш Uniform Distribution Modulo 1* For a sequence xvx2,... of real numbers, consider the corresponding sequence of their fractional parts {jcn} =xn - [xn\. For each n, define a probability measure μη by (25.3) μ„(Α) = £#[*: 1 < к <п, {хк) eA]; fThere is (see Section 28) a related notion of vague convergence in which μ may be defective in the sense that μ(Λ') < 1. Weak convergence is in this context sometimes called complete convergence. *This topic, which requires ergodic theory, may be omitted.
SECTION 25. WEAK CONVERGENCE 329 μη has mass n~l at the points {*,},...,{·*„}, and if several of these points coincide, the masses add. The problem is to find the weak limit of {μη} in number-theoretically interesting cases. If the μ„ defined by (25.3) converge weakly to Lebesgue measure restricted to the unit interval, the sequence x{,x2,... is said to be uniformly distributed modulo 1. In this case every subinterval has asymptotically its proportional share of the points {*„}; by Theorem 25.8 below, the same is then true of every subset whose boundary has Lebesgue measure 0. Theorem 25.1. For 0 irrational, 0, 20,30,... is uniformly distributed modulo 1. Proof. Since {«0} = {л{0}}, 0 can be assumed to lie in [0,1]. As in Example 24.4, map [0,1) to the unit circle in the complex plane by ф(х) = е2тг'х. If 0 is irrational, then с = φ(θ) is not a root of unity, and so (p. 000) Τω=εω defines an ergodic transformation with respect to circular Lebesgue measure P. Let j^ be the class of open arcs with endpoints in some fixed countable, dense set. By the ergodic theorem, the orbit {Τ"ω} of almost every ω enters every / in ,/ with asymptotic relative frequency P(I). Fix such an ω. If /] c7c/2, where J is a closed arc and /,, /2 are in J?, then the upper and lower limits of n~xY."kw,lIJ{Tk~ia) are between F(/,) and F(/2), and therefore the limit exists and equals P(J). Since the orbits and the arcs are rotations of one another, every orbit enters every closed arc J with frequency P(J). This is true in particular of the orbit {c") of 1. Now carry all this back to [0,1) by φ"1: For every χ in [0,1), {«0} = ф~1(с") lies in [0, x] with asymptotic relative frequency x. и For a simple proof by Fourier series, see Example 26.3. Convergence in Distribution Let Xn and A' be random variables with respective distribution functions Fn and F. If Fn =» F, then Xn is said to converge in distribution or in law to X, written Xn =*X. This dual use of the double arrow will cause no confusion. Because of the defining conditions (25.1) and (25.2), Xn =» X if and only if (25.4) \imP[Xn<x]=P[X<x] for every χ such that P[X = x] = 0. Example 25.5. Let XX,X2,·· be independent random variables, each with the exponential distribution: P[Xn >x] = e~ax, x>0. Put M„ = max{Xx,..., Xn) and bn = a~l log n. The relation (14.9), established in Example 14.1, can be restated as P[Mn - bn <x]~> e~e "". If X is any random variable with distribution function e~e "', this can be written M„ - bn=>X. ■ One is usually interested in proving weak convergence of the distributions of some given sequence of random variables, such as the M„ - bn in this example, and the result is often most clearly expressed in terms of the random variables themselves rather than in terms of their distributions or
330 CONVERGENCE OF DISTRIBUTIONS distribution functions. Although the M„ - bn here arise naturally from the problem at hand, the random variable X is simply constructed to make it possible to express the asymptotic relation compactly by M„ - bn =» X. Recall that by Theorem 14.1 there does exist a random variable for any prescribed distribution. Example 25.6. For each n, let Ω„ be the space of л-tuples of 0's and l's, let Уп consist of all subsets of Ω„, and let Pn assign probability (А/и)*(1 - Л/л)""* to each ω consisting of к l's and η -к 0's. Let Χη(ω) be the number of l's in ω; then Xn, a random variable on (Ωη, &n, Pn), represents the number of successes in η Bernoulli trials having probability λ/η of success at each. Let I bea random variable, on some (Ω, &, P), having the Poisson distribution with parameter A. According to Example 25.2, Xn => X. ■ As this example shows, the random variables Xn may be defined on entirely different probability spaces. To allow for this possibility, the Ρ on the left in (25.4) really should be written Pn. Suppressing the η causes no confusion if it is understood that Ρ refers to whatever probability space it is that Xn is defined on; the underlying probability space enters into the definition only via the distribution μη it induces on the line. Any instance of Fn =» F or of μη => μ can be rewritten in terms of convergence in distribution: There exist random variables Xn and X (on some probability spaces) with distribution functions Fn and F, and Fn => F and Xn => X express the same fact. Convergence in Probability Suppose that X, Xx, X2,... are random variables all defined on the same probability space Ш, 5% P). If Xn->X with probability 1, then P[\Xn -X\ > € i.o.] = 0 for € > 0, and hence (25.5) limP[\Xn-X\>e]=0 η by Theorem 4.1. Thus there is convergence in probability Xn -*P X; see Theorems 5.2 and 20.5. Suppose that (25.5) holds for each positive e. Now P[X <x - e] - P[\Xn - X\>e]<P[Xn <x] <P[X<x + e] + P[\Xn -x\> e]; letting η tend to » and then letting e tend to 0 shows that P[X <x] < liminf„ P[Xn <x] < limsup„/>[*„<*] </>[*<;<:]. Thus P[X„ <x]-^P[X<x] if P[X = x] = 0, and so Xn => X: Theorem 25.2. Suppose that Xn and X are random variables on the same probability space. If Xn-^> X with probability 1, then Xn-^p X.· If Xn -*P X, thenXn=>X.
SECTION 25. WEAK CONVERGENCE 331 Of the two implications in this theorem, neither converse holds. Because of Example 5.4, convergence in probability does not imply convergence with probability 1. Neither does convergence in distribution imply convergence in probability: if X and Υ are independent and assume the values 0 and 1 with probability \ each, and if Xn = Y, then Xn => X, but Xn -*p X cannot hold because P[\X - Y\] = 1 = \. What is more, (25.5) is impossible if X and the Xn are defined on different probability spaces, as may happen in the case of convergence in distribution. Although (25.5) in general makes no sense unless X and the Xn are defined on the same probability space, suppose that X is replaced by a constant real number a—that is, suppose that Χ(ω) = α. Then (25.5) becomes (25.6) limP[\Xn-a\>e] =0, r. and this condition makes sense even if the space of Xn does vary with n. Now a can be regarded as a random variable (on any probability space at all), and it is easy to show that (25.6) implies that Xn => a: Put e = \x—a\; if x>a, then P[Xn<x]>P[\Xn-a\<e]-> I, and if x<a, then P[Xn<x]< P[\Xn - a\> e]-> 0. If a is regarded as a random variable, its distribution function is 0 for χ < a and 1 for χ > a. Thus (25.6) implies that the distribution function of Xn converges weakly to that of a. Suppose, on the other hand, that Xn =» a. Then P[\Xn - a\ > e] < P[Xn < a-e] + l~P[Xn<a+e]^0,so that (25.6) holds: Theorem 25.3. The condition (25.6) holds for all positive e if and only if Xn => a, that is, if and only if (0 ifx<a, If (25.6) holds for all positive e, Xn may be said to converge to a in probability. As this does not require that the Xn be defined on the same space, it is not really a special case of convergence in probability as defined by (25.5). Convergence in probability in this new sense will be denoted Xn => a, in accordance with the theorem just proved. Example 14.4 restates the weak law of large numbers in terms of this concept. Indeed, if Xx, X2,. ■ ■ are independent, identically distributed random variables with finite mean m, and if 5„ = X{ + · · · +Xn, the weak law of large numbers is the assertion «_15„=>m. Example 6.3 provides another illustration: If 5„ is the number of cycles in a random permutation on η letters, then 5„/log η =» 1.
332 CONVERGENCE OF DISTRIBUTIONS Example 25.7. Suppose that Xn => X and δ„ -»0. Given e and η, choose χ so that Ρ[\Χ\>χ]<η and P[X = ±x] = 0, and then choose n0 so that n>n0 implies that \8„\<e/x and \P[X„ <y] - P[X <y]\ <η for y = +x. Then F[!5„Ar„| > e] < 3η for n > n0. Thus Z^Iam/ δ„ -» 0 imp/y ί/ιαί δ„Α'η =» 0, a restatement of Lemma 2 of Section 14 (p. 193). ■ The asymptotic properties of a random variable should remain unaffected if it is altered by the addition of a random variable that goes to 0 in probability. Let (Xn> Yn) be a two-dimensional random vector. Theorem 25.4. // Xn =» X and Xn-Yn=> 0, then Yn =» X. Proof. Suppose that y'<x<y" and P[X = y'] = P[X = y"\ = 0. If y'< x-€<x<x + e<y", then (25.7) P[Zn <y'] - P[\Xn - Yn\> e] <P[Yn <x] <Р[Хя<У>]+Р[\Хп-¥п\*е]. Since Xn => X, letting n -»c» gives (25.8) /5[A'<y']^liminf/5[r„<x] ^limsupF[r„<x]<F[A'<y"]. Since f[Ar = y] = 0 for all but countably many y, if P[X = x] = 0, then у' and y" can further be chosen so that P[X <y'] and P[X <,y"] are arbitrarily near P[X<x]; hence Ρ[Κ„ <x]-*P[Ar <x]. ■ Theorem 25.4 has a useful extension. Suppose that (Χ^,Υ^ is a two-dimensional random vector. Theorem 25.5. //, for each и, Х(„и)=*Х(и) as n -> oo, i/Ar<") =>Xasu->oo, and if (25.9) limlimsupF[|A'iu)-yJ>e] =0 " n for positive e, ί/ien Yn =>X. Proof. Replace ^„ by X(nu) in (25.7). If P[X = y'] = 0 = ΡΐΑ"'"» =y'] and FtA' = y"] = 0 =P[X(u) = y"l letting и -»°° and then и -»oo gives (25.8) once again. Since P[A' = y] = 0 = P[Ar(u) =y] for all but countably many y, the proof can be completed as before. ■
SECTION 25. WEAK CONVERGENCE 333 Fundamental Theorems Some of the fundamental properties of weak convergence were established in Section 14. It was shown there that a sequence cannot have two distinct weak limits: If Fn=* F and Fn => G, then F = G. The proof is simple: The hypothesis implies that F and G agree at their common points of continuity, hence at all but countably many points, and hence by right continuity at all points. Another simple fact is this: // lim„ Fn(d) = F(d) for d in a set D dense in Rl, then Fn =» F. Indeed, if F is continuous at x, there are in D points d' and d" such that d'<x<d" and F(d") - F(d') <e, and it follows that the limits superior and inferior of Fn(x) are within e of F(x). For any probability measure on (R\&x) there is on some probability space a random variable having that measure as its distribution. Therefore, for probability measures satisfying μη =>μ, there exist landom variables Yn and Υ having these measures as distributions and satisfying Yn =» Y. According to the following theorem, the Yn and Υ can be constructed on the same probability space, and even in such a way that Υη(ω) -» Υ(ω) for every ω—a condition much stronger than Yn ^> Y. This result, Skorohod's theorem, makes possible very simple and transparent proofs of many important facts. Theorem 25.6. Suppose that μη and μ are probability measures on (Rl,&1) and μη=>μ. There exist random variables Yn and Υ on a common probability space Ш, !F, P) such that Yn has distribution μη, Υ has distribution μ, and Υη(ω) -» Υ(ω) for each ω. Proof. For the probability space Ш, &, P), take Ω = (0,1), let & consist of the Borel subsets of (0,1), and for P(A) take the Lebesgue measure of A. The construction is related to that in the proofs of Theorems 14.1 and 20.4. Consider the distribution functions Fn and F corresponding to μη and μ. For 0 < ω < 1, put Υ„(ω) = inf[x: ω < Fn(x)] and Υ(ω) = inf[x: ω < F(x)]. Since ω < Fn(x) if and only if Υη(ω) <χ (see the argument following (14.5)), Ρ[ω: Υη(ω)<χ] = Ρ[ω: ω <Fn(x)] = Fn(x). Thus Yn has distribution function Fn; similarly, Υ has distribution function F. It remains to show that Υη(ω) -» Υ(ω). The idea is that Yn and Υ are essentially inverse functions to Fn and F; if the direct functions converge, so must the inverses. Suppose that 0 <ω < 1. Given e, choose χ so that Υ(ω) - e <x < Υ(ω) and μ(χ} = 0. Then F(x) < ω; Fn(x) -»F(x) now implies that, for η large enough, Fn(x)<w and hence Υ(ω) - e <x < Υη(ω). Thus liminf„ Υη(ω)> Υ(ω). If ω < ω and e is positive, choose а у for which Υ(ω') <y < Υ(ω') + e and д{у} = 0. Now ω <ω <F(Y(a)')) <F(y), and so, for n large enough, (a<Fn(y) and hence Υη(ω) <у < Υ(ω') + e. Thus limsup„r„(w) < Υ(ω') if ω < ω. Therefore, Υη(ω) -» Υ(ω) if Y is continuous at ω.
334 CONVERGENCE OF DISTRIBUTIONS Since Υ is nondecreasing on (0,1), it has at most countably many discontinuities. At discontinuity points ω of Y, redefine Υη(ω) = Υ(ω) = 0. With this change, Υη(ω) -» Κω) for every ω. Since Υ and the Yn have been altered only on a set of Lebesgue measure 0, their distributions are still μη and μ. Ш Note that this proof uses the order structure of the real line in an essential way. The proof of the corresponding result in Rk is more complicated. The following mapping theorem is of very frequent use. Theorem 25.7. Suppose that h: Rl -»Rl is measurable and that the set Dh of its discontinuities is measurable* If μη=> μ and μ(Οίι) = 0, then μ„Α~' =» μ-Α"1. Recall (see (13.7)) that μΑ_1 has value μ(Α-1/ί) at A. Proof. Consider the random variables Yn and Υ of Theorem 25.6. Since У„Ы-» УС*»), if Y(w)£Dh then Α(Κ„(ω))-»Α(Κ(ω)). Since Ρ[ω: Υ(ω) e Dh] = μ(ΩΗ) = 0, it follows that h(Yn(a>)) -»Л(У(«)) with probability 1. Hence h(Y„) =»Л(К)by Theorem 25.2. Since P[h(Y) &A] = P[Yeh-lA] = μ(Α-!/l), Λ(Υ) has distiibution μΑ-1; similarly, A(F„) has distribution μ„Α_1. Thus A(Fn) =» A(F) is the same thing as μ„Α ~' =» μΑ ~'. ■ Because of the definition of convergence in distribution, this result has an equivalent statement in terms of random variables: Corollary 1. IfXn=>XandP[XeDh] = 0, then h(X„) => h(X). Take X = a: Corollary 2. IfX„ =» a and h is continuous at a, then h(Xn) => h(a). Example 25.8. From Xn=*X it follows directly by the theorem that aXn + b =» aX + b. Suppose also that an-^a and bn -»b. Then (a„ - fl)^ => 0 by Example 25.7, and so (a„X„ + b„) - (aXn + b) => 0. And now a„X„ + bn=* aX + b follows by Theorem 25.4: If Xn=*X, an->a, and bn-^b, then anXn + bn => aX + b. This fact was stated and proved differently in Section 14 —see Lemma 1 on p. 193. ■ By definition, μ„ =» μ means that the corresponding distribution functions converge weakly. The following theorem characterizes weak convergence fThat Dh lies in 3ix is generally obvious in applications. In point of fact, it always holds (even if h is not measurable): Let A(e,S) be the set of л: for which there exist у and ζ such that |jt-y|<«, U-z|<«, and \h{y)-h{z)\>t. Then A(e,S) is open and Dh = U£ Г\цА(е, δ), where e and δ range over the positive rationals.
SECTION 25. WEAK CONVERGENCE 335 without reference to distribution functions. The boundary dA of A consists of the points that are limits of sequences in A and are also limits of sequences in Ac; alternatively, dA is the closure of A minus its interior. A set A is a μ-continuity set if it is a Borel set and μ(3Α) = 0. Theorem 25.8. The following three conditions are equivalent. (i) μ„=»μ; (ii) \$άμη -» f/άμ for every bounded, continuous real function f; (iii) μ „(Α) -»μ(Α) for every μ-continuity set A. Proof. Suppose that μη => μ. and consider the random variables Yn and Υ of Theorem 25.6. Suppose that / is a bounded function such that /x(Df) = 0, where Df is the set of points of discontinuity of F. From P[YeDf] = /*(£,) = 0 it follows that f(Yn) ->/(*/) with probability 1, and so by change of variable (see (21.1)) and the bounded convergence theorem, \fdnn = E[f(Yn)} -»E[f(Y)] = f/άμ. Thus μη =» μ and /ДЯ,) = 0 together imply that jfdpn -»//d/x if / is bounded. In particular, (i) implies (ii). Further, if f=IA, then Df = dA, and from μ(3Α) = 0 and μη=*μ follows μη(Α) ■= //ίίμ„ -» //ίίμ = μ(/4). Thus (i) also implies (iii). Since d(-<x,x] = {x}, obviously (iii) implies (i). It therefore remains only to deduce μη =» μ from (ii). Consider the corresponding distribution functions. Suppose that x<y, and let f(t) be 1 for t<x, 0 for t>y, and interpolate linearly on [x,y]: f(t) = (y - t)/(y -x) for x<t<y. Since Fn(x)< ^άμη and (fdμ <F(y\ it follows from (ii) that limsup„ Fn(x) < F(y); letting у J, χ shows that limsup„ F„(x) <F(x). Similarly, F(u)< liminf„F„(x) for и <x and hence F(x - ) < liminf„ F„(x). This implies convergence at continuity points. ■ The function / in this last part of the proof is uniformly continuous. Hence μη =» μ follows if \$άμη -» \$άμ for every bounded and uniformly continuous /. Example 25.9. The distributions in Example 25.3 satisfy μη=*μ, but μ „(A) does not converge to μ(Α) if A is the set of rationale. Hence this A cannot be a μ,-continuity set; in fact, of course, dA = Rl. ■ The concept of weak convergence would be nearly useless if (25.2) were not allowed to fail when μ(3Α) > 0. Since F(x) - F(x - ) = μ[χ) = μ,(<9(-οο, χ]), it is therefore natural in the original definition to allow (25.1) to fail when χ is not a continuity point of F.
336 CONVERGENCE OF DISTRIBUTIONS Helly's Theorem One of the most frequently used results in analysis is the Helly selection theorem: Theorem 25.9. For every sequence {F„} of distribution functions there exists a subsequence {Fn } and a nondecreasing, right-continuous function F such that Hrr^ Fn/(x) = F(x) at continuity points χ of F. Proof. Ал application of the diagonal method [A14] gives a sequence [nk] of integers along which the limit G(r) = l\mk Fnt(r) exists for every rational r. Define F(x) = inf[GO): χ <r]. Clearly F is nondecreasing. To each χ and e there is an r for which x<r and G(r) <F(x) + e. If x<y<r, then F(y) < GO) <F(x) + e. Hence F is continuous from the right. If F is continuous at x, choose у < χ so that F(x) - e < F(y); now choose rational r and s so that у < r <x < s and G(s) < F(x) J,- e. From F(x) - e < G(r) < G(s) <F(x) + e and F„0) < Fn(x) < F„(s) it follows that as к goes to infinity Fn (x) has limits superior and inferior within e of F(x). Ш The F in this theorem necessarily satisfies 0 < F(x) < 1. But F need not be a distribution function: if Fn has a unit jump at n, for example, F(x) = 0 is the only possibility. It is important to have a condition which ensures that for some subsequence the limit F is a distribution function. A sequence of probability measures μη on (Rl,&1) is said to be tight if for each e there exists a finite interval (a, b] such that μη(α, b] > 1 - e for all n. In terms of the corresponding distribution functions Fn, the condition is that for each e there exist χ and у such that F„(x) < e and F„(y) > 1 - e for all n. If μ.„ is a unit mass at η, [μη] is not tight in this sense—the mass of μ„ "escapes to infinity." Tightness is a condition preventing this escape of mass. Theorem 25.10. Tightness is a necessary and sufficient condition that for every subsequence {μη } there exist a further subsequence {μ,4 } and a probability measure μ such that μ„ =>uaj/-» o°. Only the sufficiency of the condition in this theorem is used in what follows. Proof. Sufficiency. Apply Helly's theorem to the subsequence [Fn/} of corresponding distribution functions. There exists a further subsequence {F„ } such that lim;. Fn (x) = F(x) at continuity points of F, where F is nondecreasing and right-continuous. There exists by Theorem 12.4 a measure μ on (Λ1, ^") such that μ(α, b] = F(b) - F(a). Given e, choose a and b so that μη(α, b] > 1 - e for all n, which is possible by tightness. By decreasing a
SECTION 25. WEAK CONVERGENCE 337 and increasing b, one can ensure that they are continuity points of F. But then μ.(α,6]> 1-е. Therefore, μ is a probability measure, and of course Necessity. If [μη] is not tight, there exists a positive e such that for each finite interval (a, b], μη(α, b] < 1 - e for some n. Choose nk so that μη(~k,k]< 1-е. Suppose that some subsequence {μη .} of {μη } were to converge weakly to some probability measure μ. Choose (a,b] so that μ{α) = μ{6) = 0 and μ(α, b] > 1 - e. For large enough /, (a, b] с (-k(j), k(j)], and so 1 - e > /x„t(.(-k(j), k(j)] > μηΐι(.(α, b] ->μ(α, b]. Thus μ(α, b] < 1 - e, a contradiction. ■ Corollary. // {μη} is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure μ, then μη =» μ. Proof. By the theorem, each subsequence {μη } contains a further subsequence {μη } converging weakly (/ -»°o) to some limit, and that limit must by hypothesis be μ. Thus every subsequence {μη } contains a further subsequence {μη } converging weakly to μ. Suppose that μη => μ is false. Then there exists some χ such that μ{χ} = 0 but μ.η(-οο, χ] does not converge to μ.(-ο°, χ]. But then there exists a positive e such that \μη(-°°, χ]- μ( — °°, x]\>e for an infinite sequence {nk} of integers, and no subsequence of {μη } can converge weakly to μ. This contradiction shows that μη =» μ. ■ If μ.„ is a unit mass at x„, then {μη} is tight if and only if {*„} is bounded. The theorem above and its corollary reduce in this case to standard facts about real line; see Example 25.4 and A10: tightness of sequences of probability measures is analogous to boundedness of sequences of real numbers. Example 25.10. Let μη be the normal distribution with mean mn and variance σ„2. If mn and σ„2 are bounded, then the second moment of μη is bounded, and it follows by Markov's inequality (21.12) that [μη] is tight. The conclusion of Theorem 25.10 can also be checked directly: If {nk(j)} is chosen sothatlim,m„ =mandlim,a„2 = σ2, then u„ =» u, where μ is normal J nk(» J nk(j) nkU) with mean m and variance σ2 (a unit mass at m if σ2 = 0). If mn > b, then /x„(fr, «0 > \; if mn<a, then μη( - oo, a] > \. Hence {μη} cannot be tight if mn is unbounded. If mn is bounded, say by K, then μ.„(-οο,α] > ι/(-οο,(α — Κ)σ~ι\, where ν is the standard normal distribution. If a„is unbounded, then i/(-°°,(fl - /Οσ,,-1]-* 5 along some subsequence, and {/x„} cannot be tight. Thus a sequence of normal distributions is tight if and only if the means and variances are bounded. ■
338 CONVERGENCE OF DISTRIBUTIONS Integration to the Limit Theorem 25.11. IfXn =»X, then E[\X\] < liminf„ E[\Xnt Proof. Apply Skorohod's Theorem 25.6 to the distributions of Xn and X: There exist on a common probability space random variables Yn and Υ such that Y= lim„ Yn with probability 1, Yn has the distribution of Xn, and Υ has the distribution of X. By Fatou's lemma, E[|F|]< liminf„ £[Щ]. Since \X\ and \Y\ have the same distribution, they have the same expected value (see (21.6)), and similarly for \Xn\ and \Yn\. ■ The random variables Xn are said to be uniformly integrable if (25.10) lim sup f \Xn\dP = 0; «-*00 η J[\X„\>oc] see (16.21). This implies (see (16.22)) that (25.11) sup£[|Z„|] <oo. η Theorem 25.12. If Xn=>X and the Xn are uniformly integrable, then X is integrable and (25.12) E[Xn}-+E[X]. Proof. Construct random variables Yn and Υ as in the preceding proof. Since Yn -» Υ with probability 1 and the Yn are uniformly integrable in the sense of (16.21), E[Xn] = E[Yn]-+E[Y) = E[X] by Theorem 16.14. ■ If sup„ ZTflA'J +i]<oc for some positive e, then the Xn are uniformly integrable because (2513) bjJW'Mwi Since Xn =*X implies that Xrn=*Xr by Theorem 25.7, there is the following consequence of the theorem. Corollary. Let r be a positive integer. If Xn =*X and sup„ EllA^I1"1"*] < °°5 where e > 0, then E[\X\r] < oo and E[Xrn\ -»E[Xr\. The X„ are also uniformly integrable if there is an integrable random variable Ζ such that P[\X„\ >t]< P[\Z\ > t] for t > 0, because then (21.10)
SECTION 25. WEAK CONVERGENCE 339 ( \X„\dP<( \Z\dP. Jl\X„\>a] Ji\Z\>a] From this the dominated convergence theorem follows again. PROBLEMS 25.1. (a) Show by example that distribution functions having densities can converge weakly even if the densities do not converge: Hint: Consider f„(x) = 1 + cos27rn* on [0,1]. (b) Let /n be 2" times the indicator of the set of χ in the unit interval for which dn+l(x) = · · ■ = d2n(x) = 0, where dk(x) is the fcth dyadic digit. Show that /„(·*) -» 0 except on a set of Lebesgue measure 0; on this exceptional set, redefine f„(x) = 0 for all n, so that f„(x) -»0 everywhere. Show that the distributions corresponding to these densities converge weakly to Lebesgue measure confined to the unit interval. (c) Show that distributions with densities can converge weakly to a limit that has no density (even to a unit mass). (d) Show that discrete distributions can converge weakly to a distribution that has a density. (e) Construct an example, like that of Example 25.3, in which μ „(Α) -»μ(Α) fails but in which all the measures come from continuous densities on [0,1]. 25.2. 14.8 Τ Give a simple proof of the Gilvenko-Cantelli theorem (Theorem 20.6) under the extra hypothesis that F is continuous. 25.3. Initial digits, (a) Show that the first significant digit of a positive number χ is d (in the scale of 10) if and only if {log10 x) lies between log10 d and \ogl0(d + 1), d — 1,..., 9, where the braces denote fractional part. (b) For positive numbers xl,x2,..., let Nr(d) be the number among the first η that have initial digit d. Show that (25.14) lim^N„(i/) = log10(i/+l)-log1oi/, d=l,...,9, if the sequence log10 xn, η = 1,2,..., is uniformly distributed modulo 1. This is true, for example, of xn =#" if log10 τ? is irrational. (c) Let Dn be the first significant digit of a positive random variable Xn. Show that (25.15) \imP[D„=d] = \ogl0(d+l)-\ogl0d, d=l,...,9, η if {log1() Xn) =» U, where U is uniformly distributed over the unit interval. 25.4. Show that for each probability measure μ on the line there exist probability measures μη with finite support such that μη => μ. Show further that μη[χ} can
340 CONVERGENCE OF DISTRIBUTIONS be taken rational and that each point in the support can be taken rational. Thus there exists a countable set of probability measures such that every μ is the weak limit of some sequence from the set. The space of distribution functions is thus separable in the Levy metric (see Problem 14.5). 25.5. Show that (25.5) implies that P([X<x]*[Xn <x]) -> 0 if P[X-x] = 0. 25.6. For arbitrary random variables Xn there exist positive constants an such that 25.7. Generalize Example 25.8 by showing for three-dimensional random vectors (A„,B„,X„) and constants a and b, a > 0, that, if An=>a, Bn=*b, and X„=>X, then A„X„+B„=>aX+b Hint: First show that if Y„=>Y and D„ => 0, then D„Yn =» 0. 25.8. Suppose that Xn => X and that hn and h are Borel functions. Let Ε be the set of χ for which hnxn-*hx fails for some sequence xn-*x. Suppose that £e^' and P[X(=E] = 0. Show that h„Xn=*hX. 25.9. Suppose that the distributions of random variables Xn and X have densities /„ and /. Show that if f„(x) -*f(x) for χ outside a set of Lebesgue measure 0, then Xn =» X. 25.10. Τ Suppose that Xn assumes as values yn + k8n, к = 0, + 1,..., where 8„ > 0. Suppose that δ„ -» 0 and that, if kn is an integer varying with η in such a way that Уп + кп8п^х, then Ρ[Χ„ = ύ„ +Α:ηδη]δ„_1 ->/(*), where / is the density of a random variable X. Shew that Xn =*X. 25.11. Τ Let Sn have the binomial distribution with parameters η and p. Assume as known that (25.16) p[Sn=/cn](«p(l-p))1/2^-i=e-*2/2 V27T if (kn - np)(np(l -р))~1/2->л:. Deduce the DeMoivre-Laplace theorem: (S„ -npXnpil -p))~l/2 =>N, where N has the standard normal distribution. This is a special case of the central limit theorem; see Section 27. 25.12. Prove weak convergence in Example 25.3 by using Theorem 25.8 and the theory of the Riemann integral. 25.13. (a) Show that probability measures satisfy μ„ ==> μ if μ„(«, b] -» μ(α, b] whenever μ{α) = μ{6} = 0. (b) Show that, if //ίίμ„-»\fd\i for all continuous / with bounded support, then μη =>μ.
SECTION 25. WEAK CONVERGENCE 341 25.14. Τ Let μ be Lebesgue measure confined to the unit interval; let μ„ correspond to a mass of ■*„,,·-*„,,■_! at some point in (x„ ,_ j, jc .], where 0 = xn0 <xn\< ·■· <xn„=l. Show by considering the distribution functions that μ„=>μ if тах(<„(л„,-jc„ ,-_,)-» 0. Deduce that a bounded Borel function continuous almost everywhere on the unit interval is Riemann integrable. See Problem 17.1. 25.15. 2.18 5.19 Τ A function / of positive integers has distribution function F if F is the weak limit of the distribution function Pn[m: f(m) <x] of / under the measure having probability \/n at each of I,...,n (see 2.34)). In this case D[m: f(m)<x]=F(x) (see (2.35)) for continuity points χ of F. Show that <p(m)/m (see (2.37)) has a distribution: (a) Show by the mapping theorem that it suffices to prove that f(m) = logdp(m)/m) = EpSp(m)log(l - l/p) has a distribution. (b) Let fu(m) = LpStti:(m)log(l- l/p), and shew by (5.45) that /„ has distribution function Fu(x) = P[Lp^uXp log(l - l/p) <*], where the Xp are independent random variables (one for each prime p) such that P[Xp = 1] = l/p and P[Xp = 0] = l-l/p. (c) Show that ΣρΧρ log(l - l/p) converges with probability 1. Hint: Use Theorem 22.6. (d) Show that lim„_0Osupn E„[\f-fu\] = 0 (see (5.46) for the notation). (e) Conclude by Markov's inequality and Theorem 25.5 that / has the distribution of the sum in (c). 25.16. For /Ιεί?1 and T> 0, put λτ(Α) = λ([-7\Γ] Γ\Α)/2Τ, where λ is Lebesgue measure. The relative measure of A is (25.17) P(A)= lim λτ(Α), 7"-.oo provided that this limit exists. This is a continuous analogue of density (see (2.35)) for sets of integers. A Borel function / has a distribution under λ^; if this converges weakly to F, then (25.18) p[x:f(x)<u]=F(u) for continuity points и of F, and F is called the distribution function of /. Show that all periodic functions have distributions. 25.17. Suppose that sup„//d/in <<» for a nonnegative / such that /(*)-» oo as χ -» ±oo. Show that {μ„} is tight. 25.18. 23.4Τ Show that the random variables A, and L, in Problems 23.3 and 23.4 converge in distribution. Show that the moments converge. 25.19. In the applications of Theorem 9.2, only a weaker result is actually needed: For each К there exists a positive a = a(K) such that if E[X] = 0, E[X2] = 1, and E[X4]<K, then P[X>0]>a. Prove this by using tightness and the corollary to Theorem 25.12. 25.20. Find uniformly integrable random variables Xn for which there is no integrable Ζ satisfying />[|A;|>i]<P[|Z|>i]for t > 0.
342 CONVERGENCE OF DISTRIBUTIONS SECTION 26. CHARACTERISTIC FUNCTIONS Definition The characteristic function of a probability measure μ on the line is defined for real t by φ(ί)=Γβ"'μ(άχ) J — 00 00 00 = / cos ίχμ(άχ) +i / ύηίχμ(άχ); J -oo J — oo see the end of Section 16 fcr integrals of complex-valued functions.1 A random variable X with distribution μ has characteristic function <p(t)*=E[e',x]= Γ β'"μ(άχ). "* — 00 The characteristic function is thus defined as the moment generating function but with the real argument s replaced by it; it has the advantage that it always exists because e"x is bounded. The characteristic function in nonprob- abilistic contexts is called the Fourier transform. The characteristic function has three fundamental properties to be established here: (i) If μ, and μ2 have respective characteristic functions φχ(.ί) and <p2(t), then μ, *μ2 has characteristic function φι(ί)φ2(ί). Although convolution is essential to the study of sums of independent random variables, it is a complicated operation, and it is often simpler to study the products of the corresponding chaiacteristic functions. (ii) The characteristic function uniquely determines the distribution. This shows that in studying the products in (i), no information is lost. (iii) From the pointwise convergence of characteristic functions follows the weak convergence of the corresponding distributions. This makes it possible, for example, to investigate the asymptotic distributions of sums of independent random variables by means of their characteristic functions. Moments and Derivatives It is convenient first to study the relation between a characteristic function and the moments of the distribution it comes from. +Ргот complex variable theory only De Moivre's formula and the simplest properties of the exponential function are needed here.
SECTION 26. CHARACTERISTIC FUNCTIONS 343 Of course, φ(0) = 1, and by (16.30), |<p(i)l < 1 for all t. By Theorem 16.8(0, <p(i) is continuous in t. In fact, |<p(i + h) - φ(ί)\ < j\eihx - 1\μ(άχ), and so it follows by the bounded convergence theorem that cp(t) is uniformly continuous. In the following relations, versions of Taylor's formula with remainder, χ is assumed real. Integration by parts shows that (26.1) ((i-OVi-^^^-fV·*, and it follows by induction that (26.2) for η > 0. Replace η by η - 1 in (26.1), solve for the integral on the right, and substitute this for the integral in (26.2); this gives " (ix) >'" rX (26.3) β"= Σ^ + 7Γ4τ/(χ-ί)"ν-ΐ)ώ. Estimating the integrals in (26.2) and (26.3) (consider separately the cases χ > 0 and χ < 0) now leads to (26.4) fc = 0 ("Г it! < min in+l 2UT (« + !)!' n! for η > 0. The first term on the right gives a sharp estimate for |jc| small, the second a sharp estimate for |jc| large. For η = 0,1,2, the inequality specializes to (26.40) (26.4.) (26.42) |e/j:-l|<min{|x|,2}, \e'x - (1 + ix)\ <min{{x2,2\x\), \eix-(l+ix-{x2)\<min{i\x\\x2}. If X has a moment of order n, it follows that (26.5) 9(t)~ t^~E[Xk] k = 0 <E min |ίΛΊ" + 1 2\tX\" (" + !)!'
344 CONVERGENCE OF DISTRIBUTIONS For any t satisfying t\"E[\X\n] (26.6) lim „, = 0, φ(ί) must therefore have the expansion (26-7) φ(ί)= Σ^Ε[Χ*]; compare (21.22). If fc = 0 ' then (see (16.31)) (26.7) must hold. Thus (26.7) holds if X has a moment generating function over the whole line. Example 26.1. Since E[e|,Jf|] < oo if X has the standard normal distribution, by (26.7) and (21.7) its characteristic function is (26.8) 9(0- έ^ΐΧ3χ·-·Χ(2*-1)- £^(-£)* = e-'V2. This and (21.25) formally coincide if s = it. Ш If the power-series expansion (26.7) holds, the moments of X can be read off from it: (26.9) cp(h\0)=ikE[Xk]. This is the analogue of (21.23). It holds, however, under the weakest possible assumption, namely that EflA^O < oo. Indeed, *(' + *)-*) _E[iXei,x] = E h ,ltXeihx-l-ihX h By (26.4,), the integrand on the right is dominated by 2\X\ and goes to0 with h; hence the expected value goes to 0 by the dominated convergence theorem. Thus φ'(ί) = E[iXe"x]. Repeating this argument inductively gives (26.10) <pw{t) = E[{iX)kei,x\
SECTION 26. CHARACTERISTIC FUNCTIONS 345 if E[|**0 < oo. Hence (26.9) holds if E[\Xk\] < oo. The proof of uniform continuity for <p(i) works for cp(kKt) as well. If E[X2] is finite, then (26.11) φ(ί) = \+itE[X]-\t2E[X2] +o(t2), i-»0. Indeed, by (26.42), the error is at most i2E[min{|r| |ΛΊ3, X2)], and as t -»0 the integrand goes to 0 and is dominated by Z2. Estimates of this kind are essential for proving limit theorems. The more moments μ has, the more derivatives φ has. This is one sense in which lightness of the tails of μ is reflected by smoothness of ψ. There are results which connect the behavior of φ(ί) as |f|-><» with smoothness properties of μ. The Riemann-Lebesgue theorem is the most important of these: Theorem 26.1. // μ has a density, then φ(ί) -» 0 as \t\ -» oo. Proof. The problem is to prove for integrable / that Jf(x)e"xdx -»0 as |i| -»oo. There exists by Theorem 17.1 a step function g = LkackIA/r, a finite linear combination cf indicators of intervals Ak = (ak,bk], for which /!/ — g\dx<€. Now ff(x)e"xdx differs by at most e from fg(x)e"xdx = Lkak(ei,bk - e'""k)/it, and this goes to 0 as |i| -»oo. ■ Independence The multiplicative property (21.28) of moment generating functions extends to characteristic functions. Suppose that Xx and X2 are independent random variables with characteristic functions <p, and φ2. If Y; = cos Xj and Z, = sin tXj, then (Y^Z^ and (Y2, Z2) are independent; by the rules for integrating complex-valued functions, φι(ί)φ2(ί) = (Е[У,] +iE[Zl])(E[Y2] +iE[Z2}) = E[y,]E[y2]-£[Z1]£[Z2] + i(E[Yl]E[Z2]+E[Zi]E[Y2]) = E[r,r2-Z1Z2 + /(riZ2 + Zir2)]=E[e"№+^]. This extends to sums-of three or more: If X1,...,Xn are independent, then (26.12) Ε[β"«-Λ] = f\E[e"Xk].
346 CONVERGENCE OF DISTRIBUTIONS If X has characteristic function <p(i), then aX + b has characteristic function (26.13) E[eh(aX+b)] =e"V(ei)· In particular, -X has characteristic function cp(-t), which is the complex conjugate of φ(ί). Inversion and the Uniqueness Theorem A characteristic function φ uniquely determines the measure μ it comes from. This fundamental fact will be derived by means of an inversion formula through which μ can in principle be recovered from φ. Define S(T)*r^dx, Т.О. In Example 18.4 it is shown that (26.14) limS(T) = y; S(T) is therefore bounded. If sgn# is +1, 0, or -1 as θ is positive, 0, or negative, then (26.15) iT^^-dt=sgneS(T\e\), T>0. Theorem 26.2. // the probability measure μ has characteristic function ψ, and if μ{α) = μ{6} = 0, then (26.16) μ(α^]= \\m^fT^e "" ue "\{t)dt. Distinct measures cannot have the same characteristic function. Note: By (26.4,) the integrand here converges as t -» 0 to b - a, which is to be taken as its value for t = 0. For fixed a and b the integrand is thus continuous in t, and by (26.40) it is bounded. If μ is a unit mass at 0, then cp(t) = 1 and the integral in (26.16) cannot be extended over the whole line. Proof. The inversion formula will imply uniqueness: It will imply that if μ and ν have the same characteristic function, then μ(α, b] = v(a, b] if μ{α) = v{a) = μ{6} = v{b) = 0; but such intervals (a, b] form a 77-system generating 31K
SECTION 26. CHARACTERISTIC FUNCTIONS 347 Denote by IT the quantity inside the limit in (26.16). By Fubini's theorem 1 (26,17) 7^^LL—*—dt It μ(άχ). This interchange is legitimate because the double integral extends over a set of finite product measure and by (26.40) the integrand is bounded by \b - a\. Rewrite the integrand by DeMoivre's formula. Since sin s and cos s are odd and even, respectively, (26.15) gives -f sgn(x -a) π S(T\x-a\) sgn(x-fr) 77 S(T\x-b\) μ(άχ). The integrand here is bounded and converges as Τ -»oo to the function (26.18) ΦαΛ*) Ό ifx<a, \ if x = a, 1 Ha<x<b, \ if χ = b, β ifixx. Thus IT-> (фаЬ άμ, which implies that (26.16) holds if μ{α) = μ{6) = 0. The inversion formula contains further information. Suppose that (26.19) Гш\ dt <oo. In this case the integral in (26.16) can be extended over R1. By (26.40), |β"<6-«>-1| β-ίι6_β-Λβ :ί Ul <|6-el; therefore, μ{α^)<^ - a)j1^{.t)\dt, and there can be no point masses. By (26.16), the corresponding distribution function satisfies F(x+h)-F(x) _ J_ Г e~i,x - e-"ix+h) h ίττ / ith <p(t) dt (whether h is positive or negative). The integrand is by (26.40) dominated by \φ{ί)\ and goes to e"xq>(t) as h -»0. Therefore, F has derivative (26.20) /(*) = а/"в~"М')л.
348 CONVERGENCE OF DISTRIBUTIONS Since / is continuous for the same reason φ is, it integrates to F by the fundamental theorem of the calculus (see (17,6)). Thus (26.19) implies that μ has the continuous density (26,20). Moreover, this is the only continuous density. In this result, as in the Riemann-Lebesgue theorem, conditions on the size of <p(i) for large |i| are connected with smoothness properties of μ. The inversion formula (26.20) has many applications. In the first place, it can be used for a new derivation of (26.14). As pointed out in Example 17.3, the, existence of the limit in (26.14) is easy to prove. Denote this limit temporarily by π0/2—without assuming that π0 = π. Then (26.16) and (26.20) follow as before if π is replaced by π0. Applying the latter to the standard normal density (see (26.8)) gives (26.21) —Le-*2/2= _L Γ e-^e-<2/2dt where the π on the left is that of analysis and geometry—it comes ultimately from the quadrature (18.10). An application of (26.8) with χ and t interchanged reduces the right side of (26.21) to (^2ττ /2v0)e~x2/2, and therefore T70 does equal π. Consider the densities in the table. The characteristic function for the normal distribution has already been calculated. For the uniform distribution over (0,1), the computation is of course straightforward; note that in this case the density cannot be recovered from (26.20), because <p(i) is not integrable; this is reflected in the fact that the density has discontinuities at 0 and 1, Distribution Density Interval 1 _ 2 /7 1. Normal ,— e x ' -оо<д;<оо l/2ir 2. Uniform 1 0 <jc < 1 3. Exponential e~x 0<Jt<oo 4. Double exponential |е_|лг| — oo<jc<<» or Laplace 5. Cauchy r -oo<jc<oo -"■ 1 +jc2 6. Triangular 1 - U| — 1 <jc < 1 1 1 - cos χ 7. ; — oo < jc < oo IT j-2 Characteristic Function e~'2/2 e" - 1 it 1 1-/7 1 1+/2 е-"'· ,, 1 - cos / - /2 (i-M)/(_U)a)
SECTION 26. CHARACTERISTIC FUNCTIONS 349 The characteristic function for the exponential distribution is easily calculated; compare Example 21.3. As for the double exponential or Laplace distribution, e-|JVx integrates over (0,«>) to (1 - /f)-1 and over (-°°,0) to (1 +/f)_1, which gives the result. By (26.20), then, W-oo 1+i2 For χ = 0 this gives the standard integral jx_mdt/{\ +t2) = π; see Example 17.5. Thus the Cauchy density in the table integrates to 1 and has characteristic function e-1'1. This distribution has no first moment, and the characteristic function is not differentiable at the origin. A straightforward integration shows that the triangular density has the characteristic function given in the table, and by (26.20), (1-|xO/(-...)(*)-^/_"/","L=^£a- For χ = 0 this is/IJl - cos t)t"2 dt = π; hence the last line of the table. Each density and characteristic function in the table can be transformed by (26.13), which gives a family of distributions. The Continuity Theorem Because of (26.12), the characteristic function provides a powerful means of studying the distributions of sums of independent random variables. It is often easier to work with products of characteristic functions than with convolutions, and knowing the characteristic function of the sum is by Theorem 26.2 in principle the same thing as knowing the distribution itself. Because of the following continuity theorem, characteristic functions can be used to study limit distributions. Theorem 26.3. Let μ„,μ be probability measures with characteristic functions φη, φ. A necessary and sufficient condition for μ„ =» μ is that <ρ„(ί) -»<ρ(ί) for each t. Proof. Necessity. For each t, eUx has bounded modulus and is continuous in x. The necessity therefore follows by an application of Theorem 25.8 (to the real and imaginary parts of e"x).
350 CONVERGENCE OF DISTRIBUTIONS Sufficiency. By Fubini's theorem, χ> Γ ι /■" 1 1 г" (26.22) _^(ι-φ|ΐ(ί))Λ = ^|-^(ι-β"*)Λ *M„ ι ι 2 χ: χ > — ы (Note that the first integral is real.) Since φ is continuous at the origin and <p(0) = 1, there is for positive e а и for which и"'/"„(! - q>(t))dt <e. Since <p„ converges to φ, the bounded convergence theorem implies that there exists an n0 such that «"'/"„(! -φη(ί))άί < 2e for n>n0. If α = 2/и in (26.22), then μ„[χ: |x|>a]<2e for n>n0. Increasing a if necessary will ensure that this inequality also holds for the finitely many η preceding n0. Therefore, {μ„} is tight. By the corollary to Theorem 25.10, μη =» μ will follow if it is shown that each subsequence {μη/} that converges weakly at all converges weakly to μ. But if μ„4 => ν as к -»oo, then by the necessity half of the theorem, already proved, ν has characteristic function lim^ ψη (ί) = φ(ί). By Theorem 26.2, ν and μ must coincide. ■ Two corollaries, interesting in themselves, will make clearer the structure of the proof of sufficiency given above. In each, let μη be probability measures on the line with characteristic functions φη. Corollary 1. Suppose that limn <p„(i) = g(t) for each t, where the limit function g is continuous at 0. Then there exists α μ such that μη => μ, and μ has characteristic function g. Proof. The point of the corollary is that g is not assumed at the outset to be a characteristic function. But in the argument following (26.22), only <p(0) = 1 and the continuity of φ at 0 were used; hence {μ„} is tight under the present hypothesis. If μ„ => ν as &-»<», then ν must have characteristic function limfc<p„t(i) = g(t). Thus g is, in fact, a characteristic function, and the proof goes through as before. ■
SECTION 26. CHARACTERISTIC FUNCTIONS 351 In this proof the continuity of g was used to establish tightness. Hence if {μ„} is assumed tight in the first place, the hypothesis of continuity can be suppressed: Corollary 2. Suppose that limn <pri(i) = g(f) exists for each t and that [μη] is tight. Then there exists α μ such that μη => μ, and μ has characteristic function g. This second corollary applies, for example, if the μη have a common bounded support. Example 26.2. If μη is the uniform distribution over (-n,n), its characteristic function is (nt)~l sin tn for t Φ 0, and hence it converges to /{0}(i)- In this case {μ„} is not tight, the limit function is not continuous at 0, and μη does not converge weakly. ■ Fourier Series* Let μ be a probability measure on &x that is supported by [0,2·7γ]. Its Fouiier coefficients are defined by (26.23) cm= f2neimx,i(dx), m = 0, + l, + 2,... These coefficients, the values of the characteristic function for integer arguments, suffice to determine μ except for the weights it may put at 0 and 2π. The relation between μ and its Fourier coefficients can be expressed formally by (26.24) H(dx)~^= Σ c,e-"'dx: /=-oo if the μ,(άχ) in (26.23) is replaced by the right side of (26.24), and if the sum over / is interchanged with the integral, the result is a formal identity. To see how to recover μ from its Fourier coefficients, consider the symmetric partial sums sm(t) = (2тг)~1Е?=-тс,е~''' and their Cesaro averages am(t) = "ΐ_1Σ^'ί/(ί). From the trigonometric identity [A24] s'm2\mx (26.25) Σ Σ «№*-^£ / = 0 к- -1 Sln 2х it follows that (26.26) am(t) = ^ / , ,2, v -μ(άχ). *This topic may be omitted
352 CONVERGENCE OF DISTRIBUTIONS If μ is (2π) ' times Lebesgue measure confined to [0,2тг\ then c0 = 1 and cm = 0 for m Φ 0, so that am(t) = sm(t) = (2тг)-1; this gives the identity (26.27) —r vm J r sin2 \ms · 2~i ■π SHTjS <fc = l. Suppose that 0 <a <b <2π, and integrate (26.26) over (a,b). Fubini's theorem (the integrand is nonnegative) and a change of variable lead to (26.28) Ди(0Л_/Ч_1_/ ί-6-д: x sin2 jms sin 2 1. ds μ(άχ). The denominator in (26.27) is bounded away from 0 outside ( — δ, δ), and so as m goes to oo with δ fixed (0 < δ < π\ TmJ8<ls\ 2vm %\x\2jms \<тг sin2 2$ ds-»0, sin2 {ms ί ώ-»1. Therefore, the expression in brackets in (26.28) goes to OifO<i<a or b <x< 2π, and it goes to 1 if a < χ < b; and because of (26.27), it is bounded by 1. It follows by the bounded convergence theorem that (26.29) μ(α,δ] = Μιηί am(t)dt m •'л if μ{α) = μ{6} = 0 and 0 < a < b < 2π. This is the analogue of (26.16). If μ and ν have the same Fourier coefficients, it follows from (26.29) that μ(Α) = ν(Α) for /1с(0,2тг) and hence that μ{0,2ττ} = v[0,2тг). It is clear from periodicity that the coefficients (26.23) are unchanged if μ{0} and μ{2ττ) are altered but μ{0) + μ{2ττ) is held constant. Suppose that μη is supported by [0,2v ] and has coefficients c*£\ and suppose that lim„c^,) = cm for all m. Since {μη} is tight, μη=*μ will hold if μη/ι=>ν (fc-»oo) implies ν = μ. But in this case ν and μ have the same coefficients cm, and hence they are identical except perhaps in the way they split the mass ι/{0,2π} = μ{0,2тг) between the points 0 and 2тг. But this poses no problem if μ{0,27τ} = 0: If lim„ c^} = cm for all m and μ{0} = μ{27τ} = 0, then μη =» μ. Example 26.3. If μ is (2π)-1 times Lebesgue measure confined to the intep/al [0,2π], the condition is that limnc^I) = 0 for m Φ 0. Let xux2,... be a sequence of reals, and let μ„ put mass n_1 at each point 2тг{хк), \<k <n, where {xk} =xk -[xk J denotes fractional part. This is the probability measure (25.3) rescaled to [0,2π]. The sequence x1,x2,.-. is uniformly distributed modulo 1 if and only if 1 " 1 " ± у е2ттЦхк)т _ J. V е2тг!хкт k=\ k = \ for m Φ 0. This is Weyl's criterion.
SECTION 26. CHARACTERISTIC FUNCTIONS 353 If xk = кв, where 0 is irrational, then εχρ(2πίθπι) Φ 1 for m Φ 0 and hence 1 11 _ у „2irik6m _ _р2тг!вт± - '■■' η ι „2τγ ίηθ/η к = \ σ 2 ττΐθηι о. Thus 0,20,30,... is uniformly distributed modulo 1 if 0 is irrational, which gives another proof of Theorem 25.1. ■ PROBLEMS 26.1. A random variable has a lattice distribution if for some a and b, b > 0, the lattice [a+nb: и = 0, ±1,...] supports the distribution of X. Let X have characteristic function φ. (a) Show that a necessary condition for X to have a lattice distribution is that \(p{t)\= 1 for some t Φ$. (b) Show that the condition is sufficient as well. (c) Suppose that \tp(t.)\ = \(p(t')\ = 1 for incommensurable t and t' (t Φ 0, t' Φ 0, t/t' irrational). Show that P[X= c] = 1 for some constant c. 26.2. If μ(-°°, χ] = μ[-χ,οο) for all χ (which implies that μ(Α) = μ(-Α) for all A e £%x\ then μ is symmetric. Show that this holds if and only if the characteristic function is real. 26.3. Consider functions ψ that are real and nonnegative and satisfy cp{ —1) = <p(t) and ψ(0)= 1. (a) Suppose that dx,d2,... are positive and T%=ldk = <», that s, >s2> ··· > 0 and \\mk sk = 0, and that T%=lskdk = 1. Let φ be the convex polygon whose successive sides have slopes — slt — s2,--- and lengths di,d2,-.. when projected on the horizontal axis: ψ has value 1 — L^=1s dj at tk=d] + ■ ■· +dk. If sn = 0, there are in effect only η sides. Let φ0(ί) = (1 — ΙΗ)/(_ι,ΐ)(ί) be the characteristic function in the last line in the table on p. 348, and show that ip(t) is a convex combination of the characteristic functions <pQ(t/tk) and hence is itself a characteristic function. (b) Polya's criterion. Show that φ is a characteristic function if it is even and continuous and, on [0,<»), nonincreasing and convex (φ(0) = 1).
354 CONVERGENCE OF DISTRIBUTIONS 26.4. Τ Let <p\ and φ2 be characteristic functions, and show that the set A =[t: φχ(ΐ) = cp2(t)] is closed, contains 0, and is symmetric about 0. Show that every set with these three properties can be such an A. What does this say about the uniqueness theorem? 26.5. Show by Theorem 26.1 and integration by parts that if μ has a density / with integrable derivative /', then <p(t) = o(t~') as И -»°°. Extend to higher derivatives. 26.6. Show for independent random variables uniformly distributed over (-1, +1) that Xt + ■■■ +Xn has density π" '/^((sin i)/i)"cos txdt for η > 2. 26.7. 21.17 Τ Uniqueness theorem for moment generating functions. Suppose thai F has a moment generating function in (-s0,s0), s0>0. From the fact that /™Me"dF(x) is analytic in the strip —s0 < Re ζ <s0, prove that the moment generating function determines F. Show that it is enough that the moment generating function exist in [0, s0), s0 > 0. 26.8. 21.20 26.7 Τ Show that the gamma density (20.47) has characteristic function ;-.ц.-!)]. where the logarithm is the principal part. Show that j^e**fix; a, u) dx if analytic for Re ζ < a. 26.9. Use characteristic functions for a simple proof that the family of Cauch· distributions defined by (20.45) is closed under convolution; compare th argument in Problem 20.14(a). Do the same for the normal distributic (compare Example 20.6) and for the Poisson and gamma distributions. 26.10. Suppose that F„=* F and that the characteristic functions are dominated by integrable function. Show that F has a density that is the limit of the densit of the Fn. 26.11. Show for all a and b that the right side of (26.16) is μ(α, b) + \μ{α) + \μ 26.12. By the kind of argument leading to (26.16), show that 1 τ (26.30) μ{α)=\\π\-~φ\ e~h\{t)dt. 7"-><» ll J-T 26.13. Τ Let *i>*2>··· be the points of positive μ-measure. By the following prove that (26.31) lim ±fT |^(0|2Λ= Σ(Μ**})2· T~*oo **! J — f , 1 Π -it/ty\ ;r = exp
SECTION 26. CHARACTERISTIC FUNCTIONS 355 Let X and Υ be independent and have characteristic function ψ. (a) Show by (26.30) that the left side of (26.31) is P[X- Y= 0]. (b) Show (Theorem 20.3) that P[X - Υ = 0] = jl^PlX = yfcidy) = Σ*(μ{**))2. 26.14. Τ Show that μ has no point masses if <p2(t) is integrable. 26.15. (a) Show that if {μη} is tight, then the characteristic functions <p„(t) are uniformly equicontinuous (for each e there is a δ such that |χ-ί|<δ implies that \<pnis) - <pn(t)\ < e for all n). (b) Show that μη => μ implies that <pn(t) -»φ(ί) uniformly on bounded sets. (c) Show that the convergence in part (b) need not be uniform over the entire line. 26.16. 14.5 26.15 Τ For distribution functions F and G, define d'(F,G) = sup, \φ(ί) - ψ(ί)\/(1 + И), where φ and ψ are the corresponding characteristic functions. Show that this is a metric and equivalent to the Levy metric. 26.17. 25.16 Τ A real function / has mean value (26.32) M[f(x)]= lim ^ψ [Tf(x)dx, 7"-»°° ll J-T provided that / is integrable over each [ — T,T] and the limit exists. (a) Show that, if / is bounded and e"fix) has a mean value for each t, then / has a distribution in the sense of (25.18). (b) Show that (26.33) "['"]-{;;;;:!!: Of course, fix) =x has no distribution. 2( '8. Suppose that X is irrational with probability 1. Let μη be the distribution of the fractional part {riX}. Use the continuity theorem and Theorem 25.1 to show that «_1E" = iMA converges weakly to the uniform distribution on [0,1]. 26"). 25.13 Τ The uniqueness theorem for characteristic functions can be derived from the Weierstrass approximation theorem. Fill in the details of the following argument. Let μ and ν be probability measures on the line. For continuous / with bounded support choose a so that μ(-α, a) and vi — a, a) are nearly 1 and / vanishes outside i-a, a). Let g be periodic and agree with / in i-a,a), and by the Weierstrass theorem uniformly approximate gix) by a trigonometric sum pix) = !L%=lake"kX. If μ and ν have the same characteristic function, then $άμ ~ \%άμ = \ράμ = fpdv ~ fgdv = ffdv. 26.20. Use the continuity theorem to prove the result in Example 25.2 concerning the convergence of the binomial distribution to the Poisson.
356 CONVERGENCE OF DISTRIBUTIONS 26.21. According to Example 25.8, if Xn => X, an -»a, and bn -»b, then anXn + bn => aJf + b. Prove this by means of characteristic functions. 26.22. 26.1 26.15 Τ According to Theorem 14.2, if *„=> X and anXn + b„=>Y, where an > 0 and the distributions of X and У are nondegenerate, then an -»a > 0, b„ -»b, and aX+b and У have the same distribution. Prove this by characteristic functions. Let ψη,ψ,ψ be the characteristic functions of *„,*,У (a) Show that |^>n(ani)| -»\ψ(ί)\ uniformly on bounded sets and hence that an cannot converge to 0 along a subsequence. (b) Interchange the roles of φ and ф and show that an cannot converge to infinity along a subsequence. (c) Show that an converges to some a > 0. (d) Show that e"b" -»ψ(ί)/φ(αί) in a neighborhood of 0 and hence that /o'e'i6" ds -" Ιο(ψ(ε)/φ(α$)) ds. Conclude that bn converges. 26.23. Prove a continuity theorem for moment generating functions as defined by (22.4) for probability measures on [0,oo). For uniqueness, see Theorem 22.2; the analogue of (26.22) is lf\l-M(s))ds>^u,co). 26.24. 26.4 Τ Show by example that the values <p(m) of the characteristic function at integer arguments may not determine the distribution if it is not supported by [0,277]. 26.25. If / is integrable over [0,2π], define its Fourier coefficients as Q*e,mxf(x)dx. Show that these coefficients uniquely determine / up to sets of measure 0. 26.26. 19.8 26.25 Τ Show that the trigonometric system (19.17) is complete. 26.27. The Fourier-series analogue of the condition (26.19) is Lm\cm\ <°°. Show that it implies μ has density /(х)^(2п)~^тсте~'тх on [0,2ir], where / is continuous and f(0)=f(2ir). This is the analogue of the inversion formula (26.20). 26.28. Τ Show that (it-x) = -~- +4 У. 7-, 0<*<2тг. 3 «-ι m Show that Е^=11/ш2 = 7г2/6 and Г°т=1(- l)m + l/m2 = π2/\2. 26.29. (a) Suppose A" and X" are independent random variables with values in [0,2ir], and let X be X' + X" reduced module 2π. Show that the corresponding Fourier coefficients satisfy cm =c'mc"m. (b) Show that if one or the other of A" and X" is uniformly distributed, so is X.
SECTION 27. THE CENTRAL LIMIT THEOREM 357 26.30. 26.25 Τ The theory of Fourier series can be carried over from [0,2π] to the unit circle in the complex plane with normalized circular Lebesgue measure P. The circular functions eimx become the powers wm, and an integrable / is determined to within sets of measure 0 by its Fourier coefficients cm = Ιςϊω'η/(ω)Ρ(άω). Suppose that A is invariant under the rotation through the angle argc (Example 24.4). Find a relation on the Fourier coefficients of IA, and conclude that the rotation is ergodic if с is not a root of unity. Compare the proof on p. 316. SECTION 27. THE CENTRAL LIMIT THEOREM Identically Distributed Summands The central limit theorem says roughly that the sum of many independent random variables will be approximately normally distributed if each sum- mand has high probability of being small. Theorem 27.1, the Lindeberg-Levy theorem, will give an idea of the techniques and hypotheses needed for the more general results that follow. Throughout, N will denote a random variable with the standard normal distribution: (27.1) Ρ[Ν(ΞΑ] = -β=ί£-χ2'2άχ. V2t7 ja Theorem 27.1. Suppose that [Xn) is an independent sequence of random variables having the same distribution with mean с and finite positive variance σ2. IfS„ = Xi+··· +Xn, then (27.2) ^^jv. ayn By the argument in Example 25.7, (27.2) implies that л_15„=>с. The central limit theorem and the strong law of large numbers thus refine the weak law of large numbers in different directions. Since Theorem 27.1 is a special case of Theorem 27.2, no proof is really necessary. To understand the methods of this section, however, consider the special case in which Xk takes the values +1 with probability 1/2 each. Each Xk then has characteristic function φ(ί) = \e" + \e~a = cosi. By (26.12) and (26.13), Sn/Jn has characteristic function φη(χ/4n\ and so, by the continuity theorem, the problem is to show that cos" t/Jn -* E[ei,N] = e~' /2, or that rtlogcosi/vn^ (well denned for large n) goes to - \t2. But this follows by 1'HopitaPs rule: Let t/yfn =x go continuously to 0.
358 CONVERGENCE OF DISTRIBUTIONS For a proof closer in spirit to those that follow, note that (26.5) for η = 2 gives \φ(ί) - (1 - \t2)\ < |i|3 (\X\ < 1). Therefore, (27.3) 'Ш-(«-£) и3 - «3/2· Rather than take logarithms, use (27.5) below, which gives (n large) (27.4) φ fi 1~Ъг 2 η" Чз < -j= -»0. But of course (1 - t2/2n)" ->e~' /2, which completes the proof for this special case. Logarithms for complex arguments can be avoided by use of the following simple lemma. Lemma 1. Let z,,..., zm and w1,...,wm be complex numbers of modulus at most 1; then (27.5) Zl ■ ■ ■ Zm ~ W\ ■ ■ ■ W, ,1* ΣΙ Zk-Wk\- Proof. This follows by induction from z, (z, - w,Xz2 ■■■ zj + w,(z2 ■■■ zm-w2---wj. Two illustrations of Theorem 27.1: Example 27.1. In the classical De Moivre-Laplace theorem, Xn takes the values 1 and 0 with probabilities ρ and q = 1 - p, so that с =p, and σ2 =pq. Here 5„ is the number of successes in η Bernoulli trials, and (Sn — np)/^npq =>N. Ш Example 27.2. Suppose that one wants to estimate the parameter α of an exponential distribution (20.10) on the basis of an independent sample Χι,..., Xn. As η ~* oo the sample mean Xn = n~lL"k=lXk converges in probability to the mean 1 /a of the distribution, and hence it is natural to use \/X„ to estimate a itself. How good is the estimate? The variance of the exponential distribution being 1/a2 (Example 21.3), ar/n(Xn - 1/a) =>/V by the Lindeberg-Levy theorem. Thus Xn is approximately normally distributed with mean 1/a and standard deviation l/ajn . By Skorohod's Jbeorem 25.6 there exist on a single probability space random variables Y„_and Υ having the respective distributions of Xn and N and satisfying α^/η(Υη(ω) - 1/a) -*_Υ(ω) foreach ω. Now Υη(ω)-> 1/a and a'l)fn~(Yn(w)-1 -a) = aifn~(a-1 - Ϋη(ω))/αΥ„(ω)-* -Υ(ω). Since -Υ has
SECTION 27. THE CENTRAL LIMIT THEOREM 359 the distribution of N and Yn has the distribution of Xn, it follows that a U„ ' thus \/X„ is approximately normally distributed with mean a and standard deviation a/4n. In effect, 1/Xn has been studied through the local linear approximation to the function 1/x. This is called the delta method. Ш The Lindeberg and Lyapounov Theorems Suppose that for each η (27.6) X„i>~;XnrK are independent; the probability space for the sequence may change with n. Such a collection is called a triangular array of random variables. Put Sn =■ Xnl + ■ ■ ■ +Xnr. Theorem 27.1 covers the special case in which rn = η and Xnk =Xk. Example 6.3 on the number of cycles in a random permutation shows that the idea of triangular array is natural and useful. The central limit theorem for triangular arrays will be applied in Example 27.3 to the same array. To establish the asymptotic normality of Sn by means of the ideas in the preceding proof requires expanding the characteristic function of each Xnk to second-order terms and estimating the remainder. Suppose that the means are 0 and the variances are finite; write r„ (27.7) E[Xnk]=0, an2k = E[Xtk], s2n = £ о-Д. The assumption that Xnk has mean 0 entails no loss of generality. Assume si > 0 for large n. A successful remainder estimate is possible under the assumption of the Lindeberg condition: r" 1 (27.8) lim Σ ~[ *Л"^=0 for € > 0. Theorem 27.2. Suppose that for each η the sequence Xni,...,Xnr is independent and satisfies (27.7). // (27.8) holds for all positive e, then Sn/sn=>N.
360 CONVERGENCE OF DISTRIBUTIONS This theorem contains the preceding one: Suppose that Xnk = Xk and rn = n, where the entire sequence [Xk] is independent and the Xk all have the same distribution with mean 0 and variance σ2. Then (27.8) reduces to (27.9) \\xa\i X?dP = 0, which holds because [ΙΛ^Ι > eajn]l 0 as η Τ00· Proof of the Theorem. Replacing Xnk by Xnk/sn shows that there is no loss of generality in assuming rn (27.10) s2n= Ε<τΛ = 1· By (26.42), \e"x - (] + itx - ^t2x2)\ <mm[\tx\2,\tx\3}. Therefore, the characteristic function ipnk of Xnk satisfies (27.11) \<pnk(t) - (1 - ^2a2k)\<E[mm{\tXnk\2,\tXj3}]. Note that the expected value is finite. For positive e the right side of (27.11) is at most f \tXnk\3dP+\ \tXnk\2dP<e\t\3a2k + t2f X2kdP. Since the a2k add to 1 and e is arbitrary, it follows by the Lindeberg condition that (27-12) Ek(0-0-W)h0 k = l for each fixed t. The objective now is to show that (27.13) ГЬ„*(0= Π(1-ϊ«42*)+*(1) k=l k=l rn = rie-'V"'/2 + o(l)=e-'2/2 + o(l).
SECTION 27. THE CENTRAL LIMIT THEOREM 361 For € positive, σ„2,<62 + / XlkdP, and so it follows by the Lindeberg condition (recall that sn is now 1) that (27.14) max σ„\-* 0. \<,к<,гп For large enough n, 1 - jt2a?k are all between 0 and 1, and by (27.5), T\rk"=l<pnk(t) and Π£"=ιΟ - jt2a*k) differ by at most the sum in (27.12). This establishes the first of the asymptotic relations in (27.13). Now (27.5) also implies that Пг'2^-П(1-А24) k^l k=l < Ek'42'/2-i + l'42J k = l For complex z, (27.15) k*-l-z|<|z|2£ 1^-<|2|2еи. k = 2 Using this in the right member of the preceding inequality bounds it by ίν2Σ#_,σ„4*; by (27.14) and (27.10), this sum goes to 0, from which the second equality in (27.13) follows. ■ It is shown in the next section (Example 28.4) that if the independent array {Xnk} satisfies (27.7), and if Sn/sn =» N, then the Lindeberg condition holds, provided maxA ur c2k/s2 -» 0. But this converse fails without the extra condition: Take Xnk = Xk normal with mean 0 and variance a^k = ak, where σ,2 = 1 and σ„2 = rcs2_ j. Example 27.3. Goncharov's theorem. Consider the sum Sn = ΣΊ=ιΧη/ζ in Example 6.3. Here 5„ is the number of cycles in a random permutation on η letters, the Xnk are independent, and Ρ[Χ**-ΐ]= η-\ + 1 =i-P[*nk = o]. The mean mn is Ln = YTk=xk~l, and the variance i2 is Ln + 0(1). Lindeberg's condition for Xnk - (n - к + I)-1 is easily verified because these random variables are bounded by 1. The theorem gives (5„ - Ln)/sn =» N. Now, in fact, Ln = log η + 0(1), and so (see Example 25.8) the sum can be renormalized: (5„ - log n)/ /log η =» N.
362 CONVERGENCE OF DISTRIBUTIONS Suppose that the \Xnk\ + are integrable for some positive δ and that Lyapounov's condition (27.16) lim i-hsEl\Xj2+s]=0 holds. Then Lindeberg's condition follows because the sum in (27.8) is bounded by k = \ Sn J\X„k\*es„ fi» €. k^lSn Hence Theorem 27.2 has this corollary: Theorem 27.3. Suppose that for each η the sequence Xnl,...,Xnr is independent and satisfies (27.7). // (27.16) holds for some positive δ, then Sn/sn=>N. Example 27.4. Suppose that Xi,X2,... are independent and uniformly bounded and have mean 0. If the variance s% of Sn = X{ + ·· · +Xn gees to oo, then S„/s„ =» JV: If К bounds the Xn, then which is Lyapounov's condition for 5 = 1. ■ Example 27.5. Elements are drawn from a population of size n, randomly and with replacement, until the number of distinct elements that have been sampled is rn, where 1 < rn < n. Let Sn be the drawing on which this first happens. A coupon collector requires Sn purchases to fill out a given portion of the complete set. Suppose that rn varies with η in such a way that rn/n -> p, 0 <p < 1. What is the approximate distribution of 5„? Let Yp be the trial on which success first occurs in a Bernoulli sequence with probability ρ for success: P[Yp = k] = qk~lp, where q = \— p. Since the moment generating function is pes/(l-qes), E[Yp]=p~l and Var[Yp] = qp~2. If k-l distinct items have thus far entered the sample, the waiting time until the next distinct one enters is distributed as Yp as ρ = (η - к + \)/n. Therefore, Sn can be represented as Lk"=iXnk for independent summands Xnk distributed as Y(n-k + i)/n- Since rn ~ pn, the mean and variance above give
SECTION 27. THE CENTRAL LIMIT THEOREM 363 and Lyapounov's theorem applies for 8 = 2, and to check (27.16) requires the inequality (27.17) Ε[(Υρ-ρ-*)Α\<Κρ-< for some К independent of p. A calculation with the moment generating function shows that the left side is in fact qp~4(l + lq +q2). It now follows that (*.-:=Ы4к!('-^) Μ1 ,lf η ι ;o (1 и dx -x)4' (27.18) £ £ Since (27.16) follows from this, Theorem 27.3 applies: (5„ — mn)/sn =» N. Dependent Variables* The assumption of independence in the preceding theorems can be relaxed in various ways. Here a central limit theorem will be proved for sequences in which random variables far apart from one another are nearly independent in a sense to be defined. For a sequence Xi,X2,... of random variables, let an be a number such that (27.19) \P(AnB)-P(A)P(B)\<an for Α εσ(ΑΊ,...,Χλ), Β <=a(Xk+n, Xk+n + ls...), and к > 1, η > 1. Suppose that ая -»0, the idea being that Xk and Xk+n are then approximately independent for large n. In this case the sequence {XJ is said to be α-mixing. If the distribution of the random vector (Xn,Xn + l,...,Xn +j) does not depend on n, the sequence is said to be stationary. Example 27.6. Let {YJ be a Markov chain with finite state space and positive transition probabilities pip and suppose that Xn =f(Y„), where / is some real function on the state space. If the initial probabilities /?, are the stationary ones (see Theorem 8.9), then clearly {Xn} is stationary. Moreover, by (8.42), \pf-Pl\<pn, where p<\. By (8.11), Ρ[Υ, =/„...,Yk =ik, Ук+п =Jo,-, Yk+n+i =//! =А,Лу2 · · · Pi^iMPfrr·-^*' which differs 'This topic may be omitted.
364 CONVERGENCE OF DISTRIBUTIONS from F[F, = i„..., Yk = ik]P[Yk+n = jQ,.. ·, Yk+n+i = Λ1 ЬУ at m°st Ρ/,Ρ/,ίϊ · · · Pu-^Pjoi, · · · ph-iir h foIIows ЬУ addition that> if s is the number of states, then for sets of the form A = [(Yl,...,Yk)^H] and B = [(У4+ ,yi+11+,)6W'l (27.19) holds with a„ = sp". These sets (for к and η fixed) form fields generating σ-fields which contain a(Xu...,Xk) and <r(Xk+n,Xk+n + „...), respectively. For fixed Л the set of β satisfying (27.19) is a monotone class, and similarly if A and В are interchanged. It follows by the monotone class theorem (Theorem 3.4) that [XJ is α-mixing with an = sp". ■ The sequence is m-dependent if (Xu...,Xk) and (Xk+n,---,Xk+n+i) are independent whenever n> m. In this case the sequence is α-mixing with an = 0 for η > т. In this terminology an independent sequence is 0-depen- dent. Example 27.7. Let YX,Y2,... be independent and identically distributed, and put X4=fiY„,...,Y„+m) for a real function / on Rm + l. Then {Xn} is stationary and m-dependent. ■ Theorem 27.4. Suppose that Χλ, X2,··· is stationary and α-mixing with αη = 0(η~5) and that £[*„] = 0 and E[Xf]<oo. If Sn=Xl+ · ··+X„, then (27.20) n"1 Var[5„] -+σ2 = E{xf] + 2 Σ E[XtXl+k], where the series converges absolutely. If σ> 0, then Sn/ajn =* N. The conditions an = 0(n~5) and E[X*2] < c» are stronger than necessary; they are imposed to avoid technical complications in the proof. The idea of the proof, which goes back to Markov, is this: Split the sum X{+ · · · +Xn into alternate blocks of length bn (the big blocks) and /„ (the little blocks). Namely, let (27.21) Uni=X(i_l)(bn+ln)+1+ ■■■ +Ха-1хьп+1п)+ьп> \^l^rn> where rn is the largest integer i for which (i - 1)(6Я + /„) + bn<n. Further, let Ki=X(i-lXbn+ln) + bn+l+ '"' +ХцЬп+1п)> 1^'<ГЛ. (27 22^ Then Sn = Lri'LiUni+Z'i"=iVnj, and the technique will be to choose the /„ small enough that Y.tVni is small in comparison with Σ;ί/„; but large enough
SECTION 27. THE CENTRAL LIMIT THEOREM 365 that the Uni are nearly independent, so that Lyapounov's theorem can be adapted to prove Σ,·ί/η/ asymptotically normal. Lemma 2. // Υ is measurable σ(Χχ,..., Xk) and bounded by C, and if Ζ is measurable σ(ΧΙί +n, Xk+n +,,...) and bounded by D, then (27.23) \E[YZ]-E[Y]E[Z]\<4CDan. Proof. It is no restriction to take С = D = 1 and (by the usual approximation method) to take Y= Σ^/,,, and Ζ = Σ;ζ;/β simple (|y,|,|z;| < 1). If du = P(A, η Bj) - P(At)P(Bj), the 'left side of (27.23) is ΙΣ,,,ν,ζ,ίί,,! Take ξ, to be + 1 or - 1 as LjZjd^ is positive or not; new take η; to be + 1 or -1 as Σ,£,ί/,; is positive or not. Then Y,ytzidu < Σ Σ * A = Σ£·ΣζΑ ',; ^ Σ Σ&4υ = Σν;Σ^άί}= Σίί^/ο- '■; Let Л(0) [β(0)] be the union of the Л, [β;] for which ξ,, = +1 fy. = + 1]. and let Л(1) = Ω - Л(0> [В<" = Ω - 5(0)]. Then Σί,-'7/ιν< ΣΗ^Πβ*0)-^00)^0)!^"»· ■ ', 1 Ч-,Ι- Lemma 3. If Υ is measurable σ(Χχ,... > Xk) and E[F4] < С, and if Ζ is measurable a(Xk+n,Xk+n + x,...) and E[Z4] < D, then (27.24) \E[YZ] - E[Y]E[Z]\ < 8(1 + С + D)a1/2. Proof. Let Y0- Yl^f^a], /x — YI^Yl>a], Z0- ZI^z^aj, Zx-ZI^>ay By Lemma 2, \E[Y0Z0] - E[Y0]E[Z0]\ < 4a2an. Further, \E[Y0ZX]-E[Y0]E[ZX]\<E[\Y0-E[Y0]\-\ZX-E[ZX]\] < 2a ■ 2E[\Zl\] < 4βΕ[|Ζ,| · \Zx/a\3] < 4D/a2. Similary, \E[YXZ0] - E[YX]E[Z0]\ < AC /a2. Finally, \E[YXZX] - E[YX]E[ZX]\ ^Var'/^FjVar'/^Z,] <E1/2[Y2]E1/2[Z2} <E"2\Yt/a2\E"2\Z\/a2\<Cx/2D'/2/a2. Adding these inequalities gives 4a2an + 4(C + D)a~2 + C1/2D1/2a~2 as a
366 CONVERGENCE OF DISTRIBUTIONS bound for the left side of (27.24). Take a = a ~1/4 and observe that 4 + 4(C + D) + C'/2£)i/2 < 4 + 4(CI/2 + D1/2)2 < 4 + 8(C + D). Ш Proof of Theorem 27.4. By Lemma 3, \E[XxX1+n]\ < 8(1 + 2E[Xl4])a1/2 = 0(n'5/2), and so the series in (27.20) converges absolutely. If pk = E[XlXl+k], then by stationary £[S2] = np0 + 2L"kZ\(n - k)pk and therefore \σ2 - n-lE[S2]\ < 21Гк=п\Рк\ + 2п-^:11Гк=/\Рк\; hence (27.20). By stationarity again, ^[5B4]^4!/iEH[^,^,+l-^,+i+y^.-i+>+*]|, where the indices in the sum are constrained by /', j,k>0 and i+j + к <n. By Lemma 3 the summand is at most 8(1 +E[xt] +E[X\¥iXUi+iXUl+i+k\)«\/1, which is at most1 8(1 + E[Xt] + E[Xl2])aV2 = Kxa1/2. Similarly, Kxa\/2 is a bound. Hence £[S„4] <4!«2 Σ Кхттп{аУ2,а)/2} I,fc2;0 i + k<n <K2n2 Σ «V2 = Kin2i{k+\W2. 0<i<k 4 = 0 Since ak = 0(k~5), the series here converges, and therefore (27.25) E[St]<Kn2 for some К independent of n. Let bn = [n3/4\ and ln = IV/4J. If rn is the largest integer i such that (i- \Xbn + ln) + bn<n, then (27.26) К~пъ/\ /„~nI/4, г„~л1/4. Consider the random variables (27.21) and (27.22). By (27.25), (27.26), and *E\XYZ\ < Ε'^ΙΛΊ3 · E2/3|KZ|3/2 < Е'/3|л-|3 · E"W · Ε'/3|Ζ|3.
SECTION 27. THE CENTRAL LIMIT THEOREM 367 stationarity, r„-l -γ'ςκ, ** (27.25) and (27.26) also give * Σρ K\* ea ^ ^r.Kll к - eW " " eVV/4 0; σ4η \>e К e*aW2 0. Therefore, Σ\η=ινηι/σ\[η => 0, and by Theorem 25.4 it suffices to prove that EiLlL/,./^=*JV. Let IT,, 1 < i < rn, be independent random variables having the distribution common to the Unj. By Lemma 2 extended inductively the characteristic functions of ΣΊη=ιυηί/σ^η and of Σ-^ ,£/„',/σ\/ίΓ differ by at most1 16/-„az. Since an = 0(n~5), this difference is Oin"1) by (27.26). The characteristic function of Υ.\η=ι\]ηί/σ4η will thus approach e~'1/2 if that of YJ/LiU^/ayfn does. It therefore remains only to show that Zi"-iUni/<rvn=>N. Now £[|L/',.|2] = E[Un\] ~ bna2 by (27.20). Further, E[\UJ4] <Kb2 by (27.25). Lyapounov's condition (27.16) for δ = 2 therefore follows because '■„ψ;:.!4! e[wj\ _ K МкыЧ) r.b2a4 г„ал 0. Example 27.8. Let {YJ be the stationary Markov process of Example 27.6. Let / be a function on the state space, put m = Σ,.ρ,/0'), and define X„=f(Yn)-m. Then {Xn) satisfies the conditions of Theorem 27.4. If fiij = SuPi-piPj + 2piL1 = 1(p<ip-p/), then the σ2 in (27.20) is Σ,7/3,ν(/(0 - mXf(j)-m), and L"k = lf(Yk) is approximately normally distributed with mean nm and standard deviation σ4η . If /(0 = 5:,, then Σ£„ι/(^) is the number of passages through the state i0 in the first η steps of the process. In this case m : D: and σ 2 _ Example 27.9. If the A^ are stationary and w-dependent and have mean 0, Theorem 27.4 applies and σ2 = E[X2] + 2Σ™_,£[*,*,+J. Example 27.7 is a case in point. Taking m = 1 and /(x, y) = χ — у in that example gives an instance where σ2 = 0. ■ + The 4 in (27.23) has become 16 to allow for splitting into real and imaginary parts.
368 CONVERGENCE OF DISTRIBUTIONS PROBLEMS 27.1. Prove Theorem 23.2 by means of characteristic functions. Hint: Use (27.5) to compare the characteristic function of Lkn=iZnk with exp[Lkpnk(e" - 1)]. 27.2. If [Xn] is independent and the Xn all have the same distribution with finite first moment, then n~lSn -*Е[ХХ\ with probability 1 (Theorem 22.1), so that n~xSn =>E[XX]. Prove the latter fact by characteristic functions. Hint: Use (27.5). 27.3. For a Poisson variable Υλ with mean λ, show that (УА - λ)/ \fk =» N as λ -> oo. Show that (22.3) fails for / = 1. 27.4. Suppose that \Xnk\< Mn with probability 1 and Mn/sn -* 0 Verify Lyapounov's condition and then Lindeberg's condition. 27.5. Suppose that the random variables in any single row of the triangular array are identically distributed. To what do Lindeberg's and Lyapounov's conditions reduce? 27.6. Suppose that Zl5 Z2,... are independent and identically distributed with mean 0 and variance 1, and suppose that Xnk=o-„kZk. Write down the Lindeberg condition and show that it holds if maxA s a„k = ο(Σ£"=. i<r„k)· 27.7. Construct an example where Lindeberg's condition holds but Lyapounov's does not. 27.8. 22.9 Τ Prove a central limit theorem for the number Rn of records up to time n. 27.9. 6.3 Τ Let Sn be the number of inversions in a random permutation on η letters. Prove a central limit theorem for Sn. 27.10. The 8-method. Suppose that Theorem 27.1 applies to [X„), so that Vn σ~ι(Χ„ - c)=>N, where Xn = n~xlLnk = \Xk. Use Theorem 25.6 as in Example 27.2 to show thatjjf fix) has a nonzero derivative at c, then /n(f(Xn) - f(c))/a\f'(c)\=>N: Xn is approximately normal with mean с and standard deviation σ/ijn, and f(Xn) is approximately normal with mean /(c) and standard deviation \f'(c)\a/' 4n . Example 27.2 is the case f(x)= l/x. 27.11. Suppose independent Xn have density |jc|-3 outside (-1,+1). Show that (n\ogn)~1/2Sn^N. 27.12. There can be asymptotic normality even if there are no moments at all. Construct a simple example. 27.13. Let dn(<o) be the dyadic digits of a point ω drawn at random from the unit interval. For a fc-tuple (ux,...,uk) of 0's and l's, let Nn(ux,...,uk; ω) be the number of m<n for which (dm(a)),...,d)n+k_l(<o)) = (uu...,uk). Prove a central limit theorem for Nn{ux,...,uk,w). (See Problem 6.12.)
SECTION 27. THE CENTRAL LIMIT THEOREM 369 27.14. The central limit theorem for a random number of summands. Let X{, X2,... be independent, identically distributed random variables with mean 0 and variance σ2, and let Sn =A"1 + · · · +Xn. For each positive t, let v, be a random variable assuming positive integers as values; it need not be independent of the Xn. Suppose that there exist positive constants a, and 0 such that a, as t -> со. Show by the following steps that S S (27.27) —%---=>N, p— =>N. σ-yV, ay θα, (a) Show that it may be assumed that 0 = 1 and the a, are integers. (b) Show that it suffices to prove the second relation in (27.27). (c) Show that it suffices to prove (S„ - S )/ fiT, => 0. (d) Show that P[|S„(-SJ*^]* Ρ[|ν,-β,| ;>£'«,] + P max |5A -Sa\>€-^ \k-a,\<,e\. and conclude from Kolmogorov's inequality that the last probability is at most lea2 27.15. 21.21 23.10 23.14? A central limit theorem in renewal theory. Let Xl,X2,... be independent, identically distributed positive random variables with mean m and variance σ2, and as in Problem 23.10 let N, be the maximum η for which Sn < t. Prove by the following steps that N,-tm~l '/2„,-з/2 at'^m (a) Show by the results in Problems 21.21 and 23.10 that (SNi -1)/ fi => 0. (b) Show that it suffices to prove that N,-SNm-1 _ -(SNi-mN,) at^m-3'2 vtV2m-"2 (c) Show (Problem 23.10) that N,/t=>m ', and apply the theorem in Problem 27.14.
370 CONVERGENCE OF DISTRIBUTIONS 27.16. Show by partial integration that (27.28) 1 1 Птт as χ -> οο. 27.17. Τ Suppose that XUX2,... are independent and identically distributed with mean 0 and variance 1, and suppose that an -> oo. Formally combine the central limit theorem and (27.28) to obtain (27.29) P[Sn>anJn~] ~-^^Le-^/2 = e-aJ(i+i„)/2j where ζη -* 0 if an -> oo. For a case in which this does hold, see Theorem 9.4. 27.18. 21.2 Τ Stirling's formula. Let Sn = A", + · · · +X„, where the Xn are independent and each has the Poisson distribution with parameter 1. Prove successively. (a) Ε *-<Λ vn )K- n-k\nk nn + (l/2)e-n (b) |^=^| -ΛΓ. ^ (с) Е (Sn-n ->E[N~] = (d) n!~y/2^n"+(l/2)e-n. 27.19. Let /η(ω) be the length of the run of O's starting at the nth place in the dyadic expansion of a point ω drawn at random from the unit interval; see Example 4.1. (a) Show that /j,/2,... is an α-mixing sequence, where an = 4/2". (b) Show that Y.nk = llk is approximately normally distributed with mean η and variance 6n. 27.20. Prove under the hypotheses of Theorem 27.4 that Sn/n -> 0 with probability 1. Hint: Use (27.25). 27.21. 26.1 26.29T Let A"„ X2,... be independent and identically distributed, and suppose that the distribution common to the Xn is supported by [0,2π] and is not a lattice distribution. Let Sn=Xi+ ··· +X„, where the sum is reduced modulo 2тг. Show that Sn => U, where U is uniformly distributed over [0,2π].
SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 371 SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS* Suppose that ZA has the Poisson distribution with parameter Λ and that Xnl,...,Xnn are independent and P[Xnk = 1] = λ/η, P[Xnk = 0] = 1 - λ/η. According to Example 25.2, Xni + · ■ ■ +Xnn => ZA. This contrasts with the central limit theorem, in which the limit law is normal. What is the class of all possible limit laws for independent triangular arrays? A suitably restricted form of this question will be answered here. Vague Convergence The theory requires two preliminary facts about convergence of measures Let μη and μ be finite measures on (Rl,&1). If μ„(α, b] ->μ(α.ί>] for every finite interval for which μ{α) = μ{ύ) = 0, then μη converges vaguely to μ, written μη-*ι μ.Π μ„ and μ are probability measures, it is not hard to see that this is equivalent to weak convergence- μη =>μ. On the other hand, if μη is a unit mass at η and μ(/?') = 0, then μη ->, μ, but μη -=> μ makes no sense, because μ is not a probability measure. The first fact needed is this: Suppose that μη ->, μ and (28.1) SUPM„(/?1)<oo; Л then (28.2) ]>μ„ - //^μ for every continuous real f that vanishes at + oo щ the sense that lim^, _(=о/(д:) = 0. Indeed, choose Μ so that μ(/?')<Μ and μπ(/?')<Μ for all n. Given e, choose a and b so that μ{α) = μ'φ) = 0 and |/(*)l < e/M if x£A= (a, b]. Then l/^/d/xJ < e and \{ΑΙ/άμ\<ε. If μ(/1)>0, define ι/(β) = μ(#ηΛ)/μ(/0 and νη(Β) = μη(Βη Α)/μη(Α). It is easy to see that u„ =>v, so that jfdvn -> //di/. But then \\Α$άμη- ΙΑ/άμ\ < e for large n, and hence |//^μ„ - //^μΙ < 3e for large п. If μ(Λ)^= 0, then //4/^μ„ -> 0, and the argument is even simpler. The other fact needed below is this: // (28.1) holds, then there is a subsequence {μ„ } and a finite measure μ such that μη ->t μ as к -> oo. Indeed, let /г„(д:) = μ„(— 0°, дс]. Since the /*"„ are uniformly bounded because of (28.1), the proof of Helly's theorem shows there exists a subsequence [Fn } and a bounded, nondecreasing, right-continuous function F such that \\mk Fn (x) = ^(jc) at continuity points χ of F. If μ is the measure for which μ(α,6] = F(b)-F(a) (Theorem 12.4), then clearly The Possible Limits Let Xnl,...,Xnr, η = 1,2,..., be a triangular array as in the preceding section. The random variables in each row are independent, the means are 0, *This section may be omitted.
372 CONVERGENCE OF DISTRIBUTIONS and the variances are finite: (28.3) E[Xnk] = 0, ση\ = Ε[Χ^\, s2n=fan\. Assume s2 > 0 and put Sn = Xnl + · · · +Xnr. Here it will be assumed that the total variance is bounded: (28.4) sups„2<°o. η In order that the Xnk be small compared with S„, assume that (28.5) limmax^ = 0. η k<rn The arrays in the preceding section were normalized by replacing Xnk by Xnk/Sn- This has the effect of replacing sn by 1, in which case of course (28.4) holds, and (28.5) is the same thing as maxA σ^/s2 -» 0. A distribution function F is infinitely divisible if for each η there is a distribution function Fn such that F is the л-fold convolution F„ * · · · * F„ (n copies) of Fn. The class of possible limit laws will turn out to consist of the infinitely divisible distributions with mean 0 and finite variance.1 It will be possible to exhibit the characteristic functions of these laws in an explicit way. Theorem 28.1. Suppose that (28.6) <p(t) = exp/ (ei,x - 1 - ίίχ)^μ(άχ), where μ is a finite measure. Then φ is the characteristic function of an infinitely divisible distribution with mean 0 and variance /x(i?')· By (26.42), the integrand in (28.6) converges to -t2/2 as χ -» 0; take this as its value at χ = 0. By (26.4,), the integrand is at most t2/2 in modulus and so is integrable. The formula (28.6) is the canonical representation of φ, and μ is the canonical measure. Before proceeding to the proof, consider three examples. Example 28.1. If μ consists of a mass of σ2 at the origin, (28.6) is £-σ ι /2^ ^е characteristic function of a centered normal distribution F. It is certainly infinitely divisible—take Fn normal with variance σ2/η. Ш fThere do exist infinitely divisible distributions without moments (see Problems 28.3 and 28.4), but they do not figure in the theory of this section
SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 373 Example 28.2. Suppose that μ consists of a mass of λχ2 at χ Φ 0. Then (28.6) is exp A(e"x - 1 - /fit); but this is the characteristic function of x(ZA - Λ), where ZA has the Poisson distribution with mean Λ. Thus (28.6) is the characteristic function of a distribution function F, and F is infinitely divisible—take Fn to be the distribution function of χ(Ζλ/η - λ/η). Ш Example 283. If φ jit) is given by (28.6) with /xy for the measure, and if μ = Σ*=1μι, then (28.6) is <p,(?)···<pДО- It follows by the preceding two examples that (28.6) is a characteristic function if μ consists of finitely many point masses. It is easy to check in the preceding two examples that the distribution corresponding to φ(ί) has mean 0 and variance μiRl), and since the means and variances add, the same must be true in the present example. Proof of Theorem 28.1. Let μΙζ have mass д(/2_А,(/ + l)2_Ar] at j2~k for / = 0, ±1,..., ± 22k. Then μκ -», μ. As observed in Example 28.3, if <pk(t) is (28.6) with μίί in place of μ, then <pk is a characteristic function. For each t the integrand in (28.6) vanishes at ±oo; since sup^ μ^-iR1) < oo, ψ^ί) -»φ(ί) follows (see (28.2)). By Corollary 2 to Theorem 26.3, q>it) is itself a characteristic function. Further, the distribution corresponding to q>kit) has second moment /хДЛ1), and since this is bounded, it follows (Theorem 25.11) that the distribution corresponding to <p(r) has a finite second moment. Differentiation (use Theorem 16.8) shows that the mean is <p'(0) = 0 and the variance is -q>"iO) = μϋί1). Thus (28.6) is always the characteristic function of a distribution with mean 0 and variance μiRl). If i/i„(f) is (28.6) with μ/η in place of μ, then <p(r) = ψ^ί), so that the distribution corresponding to (pit) is indeed infinitely divisible. ■ The representation (28.6) shows that the normal and Poisson distributions are special cases in a very large class of infinitely divisible laws. Theorem 28.2. Every infinitely divisible distribution with mean 0 and finite variance is the limit law of Sn for some independent triangular array satisfying (28.3), (28.4), and (28.5). ■ The proof requires this preliminary result: Lemma. // X and Υ are independent and X+Y has a second moment, then X and Υ have second moments as well. Proof. Since X2 + Y2 < iX + Y)2 + 2|AY|, it suffices to prove \XY\ integrable, and by Fubini's theorem applied to the joint distribution of X and Υ it suffices to prove |ΛΊ and \Y\ individually integrable. Since |У|<|х| + |x + F|, £[|yQ = °° would imply E[\x + Yft = a° for each x; by Fubini's
374 CONVERGENCE OF DISTRIBUTIONS theorem again E[\Y\] = o° would therefore imply E[\X + F|] = <», which is impossible. Hence E[\Y\1 < o°, and similarly E[\X\] < o°. ■ Proof of Theorem 28.2. Let F be infinitely divisible with mean 0 and variance σ2. If F is the л-fold convolution of Fn, then by the lemma (extended inductively) F„ has finite mean and variance, and these must be 0 and σ2/η. Take rn = n and take Xnl,..., Xnn independent, each with distribution function Fn. ■ Theorem 28.3. // F is the limit law of Sn for an independent triangular array satisfying (28.3), (28.4), and (28.5), then F has characteristic function of the form (28.6) for some finite measure μ. Proof. The proof will yield information making it possible to identify the limit. Let φη/ζ(ί) be the characteristic function of Xnk. The first step is to prove that (28.7) ГЬп*(')-ехр Σ (*>„*(')-l)-0 for each t. Since \z\ < 1 implies that \ez~l\ = eRez~l < 1, it follows by (27.5) that the difference 8n(t) in (28.7) satisfies |δ„(01 < Σ^,Ι^ίί) - exp(<pnA.(0 - 1)|. Fix t. If <pnk(t) - 1 = впк, then \впк\ < t\2k/2, and it follows by (28.4) and (28.5) that maxj0„J -* 0 and Zk\6nk\ = 0(1). Therefore, for sufficiently large η, |δ„(ί)Ι < EJl + θηΙζ - ев"к\ < e2Zk\enk\2 <e2 тах.к\впк\ · Zk\enk\ by (27.15). Hence (28.7). If Fnk is the distribution function of Xnk, then E(*W(0-1)= Σ/ {eux - I) dFnk{x) fc=l k=l R = Σ f (ei,x-l-itx)dFnk(x). fc-r «' Let μη be the finite measure satisfying (28.8) /*„(-<»,*]= Σ / y2dFnk{y), k=l У^х and put (28.9) φη(ί) = exp/ (β'» - 1 - ία)^μ„(Λ).
SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 375 Then (28.7) can be written (28.10) Π *>„*(')-*>„(')-0. By (28.8), /χ„(Λ') = ί^, and this is bounded by assumption. Thus (28.1) holds, and some subsequence {μη } converges vaguely to a finite measure μ. Since the integrand in (28.9) vanishes at +o°, ψη (t) converges to (28.6). But, of course, lim„ <p„(0 must coincide with the characteristic function of the limit law F, which exists by hypothesis. Thus F must have characteristic function of the form (28.6). ■ Theorems 28.1, 28.2, and 28.3 together show that the possible limit laws are exactly the infinitely divisible distributions with mean 0 and finite variance, and they give explicitly the form the characteristic functions of such laws must have. Characterizing the Limit Theorem 28.4. Suppose that F has characteristic function (28.6) and that an independent triangular array satisfies (28.3), (28.4), and (28.5). Then Sn has limit law F if and only if μη -*c μ, where μη is defined by (28.8). Proof. Since (28.7) holds as before, Sn has limit law F if and only if φη(ί) (defined by (28.9)) converges for each t to φ(ί) (defined by (28.6)). If μη -*υ μ, then cpn(t) ~* φ(ί) follows because the integrand in (28.9) and (28.6) vanishes at ±°° and because (28.1) follows from (28.4). Now suppose that <pn(t) -» φ(ί). Since //.„(Λ1) = si is bounded, each subsequence {μη } contains a further subsequence [μη .} converging vaguely to some v. If it can be shown that ν necessarily coincides with μ, it wil! follow by the usual argument that μη -»t μ. But by the definition (28.9) of <p„(0, it follows that φ(ί) must coincide with ψ(ί) = exp fRi(e"x - 1 - itx)x~2v(dx). Now φ'(ί) = 1>(0/Я'(е'" - 1)χ~ιμ(άχ), and similarly for φ'(ί). Hence <p(r) = (КО implies that fRl(ei,x - 1)χ~ιν(άχ) = }Rl(eiu - 1)χ~ιμ(.άχ). A further differentiation gives Ικ·.ε"χμ(άχ) = JRie"xv(dx). This implies that μ,(Λ') = viR1), and so μ = ν by the uniqueness theorem for characteristic functions. Example 28.4. The normal case. According to the theorem, 5„ =» N if and only if μη converges vaguely to a unit mass at 0. If s% = 1, this holds if and only if Lrk"= ι{Μζεχ2 dFnk(x) -» 0, which is exactly Lindeberg's condition. ■ Example 28.5. The Poisson case. Let Z„,,...,Z„r be an independent triangular array, and suppose Xnk = Znk - mnk satisfies the conditions of the
376 CONVERGENCE OF DISTRIBUTIONS theorem, where mnk = E[Znk]. If ZA has the Poisson distribution with parameter Λ, then Lk Xnk =» ZA - Λ if and only if μη converges vaguely to a mass of Λ at 1 (see Example 28.2). If s%-*\, the requirement is μη[1 -e, 1 +e]-»A, or (28.11) Σί (Znk-mnk)2dP-+0 к \2„к-тпк-\\>€ for positive e. If si and Lkmnk both converge to A, (28.11) is a necessary and sufficient condition for LkZnk =» ZA. The conditions are easily checked under the hypotheses of Theorem 23.2: Znk assumes the values 1 and 0 with probabilities pnk and 1 - pnk, Lkpnk ~* A, and maxk pnk -» 0. ■ PROBLEMS 28.1. Show that μη ->( μ, implies μ(/?') < liminfn/in(/?'). Thus in vague convergence mass can "escape to infinity" but mass cannot "enter from infinity." 28.2. (a) Show that μη ->( μ if and only if (28.2) holds for every continuous / with bounded support. (b) Show that if μη ->( μ but (28.1) does not hold, then there is a continuous / vanishing at ±oo for which (28.2) does not hold. 28.3. 23.7Τ Suppose that N,YUY2,... are independent, the Yn have a common distribution function F, and N has the Poisson distribution with mean a. Then S = Yt+ · · · + YN has the compound Poisson distribution. (a) Show that the distribution of 5 is infinitely divisible. Note that S may not have a mean. (b) The distribution function of S is r;,=0e~aa"F"*{x)/n\, where F"* is the i-fold convolution of F (a unit jump at 0 for η = 0). The characteristic function of S is expa/°Ue"* - l)dF(x). (c) Show that, if F has mean 0 and finite variance, then the canonical measure μ in (28.6) is specified by μ(Α) =ajAx2dF(x). 28.4. (a) Let ν be a finite measure, and define (28.12) <p(f)-exp where the integrand is -t2/2 at the origin. Show that this is the characteristic function of an infinitely divisible distribution. (b) Show that the Cauchy distribution (see the table on p. 348) is the case where γ = 0 and ν has density 77-~'(l +x2)~' with respect to Lebesgue measure. 28.5. Show that the Cauchy, exponential, and gamma (see (20.47)) distributions are infinitely divisible.
SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 377 28.6. Find the canonical representation (28.6) of the exponential distribution with mean 1: (a) The characteristic function is \^e"xe~x dx = (1 - it)"1 = <p(t). (b) Show that (use the principal branch of the logarithm or else operate formally for the moment) d(logip(0)/df = i<p(t) = i\^eUxe~xdx. Integrate with respect to t to obtain (28.13) -4^ = ехр/°°(е<<*-1)£-^ Verify (28 13) after the fact by showing that the ratio of the two sides has derivative 0. (c) Multiply (28.13) by e~" to center the exponential distribution at its mean: The canonical measure μ has density xe~x over (0,°°). 28.7. Τ If X and У are independent and each has the exponential density e~x, then X-Y has the double exponential density \e~w (see the table on p. 348). Show that its characteristic function is —Ц-=ехр Г (e"*-l-itt)-^|jc|e-l"dr. 1+r ■'-«Л ' χ2 28.8. Τ Suppose Xlt X2,··· are independent and each has the double exponential density. Show that Y?n=mXXn/n converges with probability 1. Show that the distribution of the sum is infinitely divisible and that its canonical measure has density |jt|e-|Jr|/(l -<Г|ДГ|) = ТГп=1\х\е^пхК 28.9. 26.8T Show that for the gamma density е~хх"~1/Г(и) the canonical measure has density uxe~x over (0,oo). The remaining problems require the notion of a stable law. A distribution function F is stable if for each η there exist constants an and bn, an > 0, such that, if Xu. .,Xn are independent and have distribution function F, then α~ι(Χγ + · · · +Xn) + bn also has distribution function F. 28.10. Suppose that for all a, a', b, b' there exist a", b" (here a, a', a" are all positive) such that F(ax + b)*F(a'x+b') = F(a"x+b"). Show that F is stable. 28.11. Show that a stable law is infinitely divisible. 28.12. Show that the Poisson law, although infinitely divisible, is not stable. 28.13. Show that the normal and Cauchy laws are stable. 28.14. 28.10 Τ Suppose that F has mean 0 and variance 1 and that the dependence of a",b" on a,a',b,b' is such that teMa-'i* rl + vl Show that F is the standard normal distribution.
378 CONVERGENCE OF DISTRIBUTIONS 28.15. (a) Let Ynk be independent random variables having the Poisson distribution with mean cna/\k\ , where c>0 and 0<a<2. Let Z„ = n~lZf=_nikYnk (omit к = 0 in the sum), and show that if с is properly chosen then the characteristic function of Z„ converges to e-1'1". (b) Show for 0 < a < 2 that e-1'1" is the characteristic function of a symmetric stable distribution; it is called the symmetric stable law of exponent a. The case a = 2 is the normal law, and a = 1 is the Cauchy law. SECTION 29. LIMIT THEOREMS IN Rk If Fn and F are distribution functions on Rk, then Fn converges weakly to F, written F„=>F, if lim„ Fn(x) = F(x) for all continuity points χ of F. The corresponding distributions μη and μ are in this case also said to converge weakly: μη => μ. If Xn and X are /c-dimensional random vectors (possibly on different probability spaces), Xn converges in distribution to X, written Xn => X, if the corresponding distribution functions converge weakly. The definitions are thus exactly as for the line. The Basic Theorems The closure A~ of a set in Rk is the set of limits of sequences in A; the interior is A° = Rk - (Rk -АУ; and the boundary is ЗА =А~- A". A Borel set Л is a μ-continuity set if μ(βΑ) = 0. The first theorem is the /c-dimensional version of Theorem 25.8. Theorem 29.1. For probability measures μη and μ on (Rk,&k), each of the following conditions is equivalent to the weak convergence of μη to μ: (i) lim„ \$άμη = //d/x for bounded continuous f; (ii) limsup„ /x„(C) < μ(0 for closed C; (iii) lim inf„ /x„(G) > /x(G) for open G; (iv) lim„ μη(Α) = μ(Α) for μ-continuity sets A. Proof. It will first be shown that (i) through (iv) are all equivalent, (i) implies (ii): Consider the distance dist(x,C) = inf[|x - y\: у e C] from χ to C. It is continuous in x. Let (1 ifr<0, v.^t) = h-jt if0<i<y_1, (θ ify-'<i.
SECTION 29. LIMIT THEOREMS IN Rk 379 Then fj(x) = ^(distCxjC)) is continuous and bounded by 1, and fj(x)i /с(*) as j T00 because С is closed. If (i) holds, then limsup„ μη(0 < lim„ //y άμη = If; άμ. As j T°°, //; d/x i Pc dV- = β(0- (ii) is equivalent to (iii). Take С = Rk - G. (ii) α«ίί (iii) imply (iv): From (ii) and (iii) follows μ( A°) < lim infμη{ A°) < lim infμη{ A) η η < Hmsup/хДЛ) < limsup/xn(/l_) <μ(Α~). Clearly (iv) follows from this. (iv) implies (i): Suppose that / is continuous and |/(дг)| is bounded by K. Given e, choose reals aQ<ai< ··· <a, so that aQ< -K <K <a,, ar;- a-,·.., <e, and μ[χ: f(x) = a;] = 0. The last condition can be achieved because the sets [x: f(x) = a] are disjoint for different a. Put Л, = [х: a!_l<f(x)<ai]. Since / is continuous, A~<z[x: ai_l <f(x) <at] and A°, э[х: αί_ι <f(x) < α J. Therefore, &4; c[x: /(χ) = «._,] υ [χ: f(x) = α, J, and therefore μ(8Α) = 0. Now \{/άμη - Σ^α^μ,,ίΛ,·)! < e and similarly for /x, and L'i=la^n(A;)-* Σ'^α,-μίΛ,·) because of (iv). Since e was arbitrary, (i) follows. It remains to prove these four conditions equivalent to weak convergence. (iv) implies μη => μ: Consider the corresponding distribution functions. If Sx = [y- У;^хь '-I,-··»£], then F is continuous at χ if and only if μ(95χ) = 0; see the argument following (20.18). Therefore, if F is continuous at x, Fn(x) = μ„(Ξχ) -»/x(S^) = F(x), and F„ =* F. μη => μ implies (iii): Since only countably many parallel hyperplanes can have positive μ,-measure, there is a dense set D of reals such that μ[χ: xt: = d] = 0 for d^D and / = !,...,&. Let .of be the class of rectangles A = [x: at < X; < 6,·. i = 1,..., k] for which the a; and the b; all lie in D. All 2* vertices of such a rectangle are continuity points of F, and so Fn=>F implies (see (12.12)) that μη(Α) = bAF„ ~> bAF = μ(Α). It follows by the inclusion-exclusion formula that μη(Β)~* μ(Β) for finite unions В of elements of &/. Since D is dense on the line, an open set G in Rk is a countable union of sets Am in ssi. But μ(Όm£LMAm) = \imn μη(υ m£.MAm) < lim inf„ μ„(ΰ). Letting Μ -*°o gives (iii). ■ Theorem 29.2. Suppose that h: Rk -» R' is measurable and that the set Dh of its discontinuities is measurable.1 If μη => μ in Rk and /x(DA) = 0, then μ.„Λ_1 =>μ.Λ_1 in R'. tThe argument in the footnote on p. 334 shows that in fact Dh e 9ik always holds.
380 CONVERGENCE OF DISTRIBUTIONS Proof. Let С be a closed set in R'. The closure (h lC) in Rk satisfies (r'O'cD^ur'C If /*„=»/*, then part (ii) of Theorem 29.1 gives lim sup/x„/r'(C) < lim sup/x„((/T'C)~) </x((/r'C)~) <n{Dh)+n(h-lC). Using (ii) again gives μ„Λ ~' => μ,Λ ~' if μ.(£>Α) = 0. ■ Theorem 29.2 is the /c-dimensional version of the mapping theorem—Theorem 25.7. The two proofs just given provide in the case к = 1 a second approach to the theory of Section 25, which there was based on Skorohod's theorem (Theorem 25.6). Skorohod's theorem does extend to Rk, but the proof is harder.1 Theorems 29.1 and 29.2 can of course be stated in terms of random vectors. For example, Xn =*X if and only if P[X<eG] < liminf„ P[Xn e G] for all open sets G. A sequence {//.„} of probability measures on (Rk, &k) is tight if for every e there is a bounded rectangle A such that μη(Α) > 1-е for all n. Theorem 29.3. // {μπ} is a tight sequence of probability measures, there is a subsequence [μπ) and a probability measure μ such that μη. =» μ as i-* °°. Proof. Take Sx = [y: y;<Xj, ]<k] and F„(x) = p„(Sx). The proof of Helly's theorem (Theorem 25.9) carries over: For points χ and у in Rk, interpret x<y as meaning xu<yu, u = l,...,k, and χ <y as meaning хи<Уи-> u = l,...,k. Consider rational points r—points whose coordinates are all rational—and by the diagonal method [A14] choose a sequence {л;} along which lim; Fn(r) = G(r) exists for each such r. As before, define F(x) = inf[G(r): χ < r]. Although F is clearly nondecreasing in each variable, a further argument is required to prove Δ^> 0 (see (12.12)). Given e and a rectangle A =(a1,b1]x ·■■ X(ak,bk], choose a δ such that if ζ = (δ,...,δ), then for each of the 2fc vertices χ of A, x <r <x +z implies \F(x) - G(r)\ <e/2k. Now choose rational points r and s such that a <r<a+z and b <s<b+z. If В = (r,,s,]X ··■ X(rk,sk], then \AAF - ABG\<e. Since ABG^=lim;ABFn >0 and e is arbitrary, it follows that Δ^>0. With the present interpretation of the symbols, the proof of Theorem 25.9 shows that F is continuous from above and lim,- Fn(x) = F(x) for continuity points χ of F. fThe approach of this section carries over to general metric spaces; for this theory and its applications, see BillingsleY] and Billingsley2. Since Skorohod's theorem is no easier in Rk than in the general metric space, it is not treated here.
SECTION 29. LIMIT THEOREMS IN Rk 381 By Theorem 12.5, there is a measure μ on (Rk, &k) such that μ(Α) = A^F for rectangles Л. By tightness, there is for given eat such that /x„[y: -i <yy < ί, / < к] > 1 - e for all л. Suppose that all coordinates of χ exceed t: If r>x, then Fn(r)> 1-е and hence (r rational) G(r)> 1 -e, so that F(x) > 1 - e. Suppose, on the other hand, that some coordinate of χ is less than — t: Choose a rational r such that x<r and some coordinate of r is less than -1; then Fn(r) < e, hence G(r) < e, and so F(x) < e. Therefore, for every € there is a t such that > 1 - e if Xj>t for all j, <e if jc ■ < -1 for some ;'. If Bs = [y: -i<yy<xy, /<A:], then μ(Ξχ) = lim,/χ(β,) = lim, ABF. Of the 2* teims in the sum ABF, all but F(x) go to 0 (s -»<») because of the second part of (29.1). Thus /x(5^) = Fix).1 Because of the other part of (29.1), μ is a probability measure. Therefore, Fn => F and μη =* μ. ■ Obviously Theorem 29.3 implies that tightness is a sufficient condition that each subsequence of {μη} contain a further subsequence converging weakly to some probability measure. (An easy modification of the proof of Theorem 25.10 shows that tightness is necessary for this as well.) And clearly the corollary to Theorem 25.10 now goes through: Corollary. // [μη] is α tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure μ, then μπ =*μ.. Characteristic Functions Consider a random vector X = (Xv..., Xk) and its distribution μ in Rk. Let t -x = Lk = 1tuxu denote inner product. The characteristic function of X and of μ is defined over Rk by (29.2) <p(t) = f e" χμ(άχ)=Ε[ε!ι χ]. jRk To a great extent its properties parallel those of the one-dimensional characteristic function and can be deduced by parallel arguments. +This requires proof because there exist (Problem 12.10) functions F' other than F for which /jl(A) = A^F' holds for all rectangles A. (29.1) F(x)
382 CONVERGENCE OF DISTRIBUTIONS The inversion formula (26.16) takes this form: For a bounded rectangle A = [x: au <xu <bu, u<k] such that μ(3Α) = 0, (29.3) 1 Г-»» (2Т7) JBTu=\ к е-""а Н"|| _ £ "ll'll it,. 4>(t)dt, where BT= [t e Rk: \tu\ <T, и <k] and dt is short for dtx · · ■ rfik. To prove it, replace <p(r) by the middle term in (29.2) and reverse the integrals as in (26.17): The integral in (29.3) is h = (2π)' l Ш k e-""a U"U — β "lAl it,. dt μ(άχ). The inner integral may be evaluated by Fubini's theorem in Rk, which gives k rsgn(xu-au) I. JRku=\ sgn(xu-bu) S(T-\xu-au\) 77 S(T-\xu-bJ) μ(άχ). Since the integrand converges to Пи=1Фа1гЬ(хи) (see (26.18)), (29.3) follows as in the case к = 1. The proof that weak convergence implies (iii) in Theorem 29.1 shows that for probability measures μ and ν on Rk there exists a dense set D of reals such that μ(3Α) = v(dA) = 0 for all rectangles A whose vertices have coordinates in D. If μ(Α) = v(A) for such rectangles, then μ and ν are identical by Theorem 3.3. Thus the characteristic function φ uniquely determines the probability measure μ Further properties of the characteristic function can be derived from the one-dimensional case by means of the following device of Cramer and Wold. For t<ERk: define A,: Rk ~> Rl by h,(x) = fx. For real a, [x: t χ < a] is a half space, and its μ -measure is (29.4) μ[χ:ίχ<α]=μ^\-™,α]. By change of variable, the characteristic function of μ.Α,_1 is (29.5) [ el"nhj\dy) = ( eisU χ)μ(άχ) JRl }Rk = <ρ(ίί1,..·,ίίΑ.), S<ERl. To know the μ,-measure of every half space is (by (29.4)) to know each μ.Α~' and hence is (by (29.5) for s = 1) to know φ(ί) for every t; and to know the
SECTION 29. LIMIT THEOREMS IN Rk 383 characteristic function φ of μ is to know μ. Thus μ is uniquely determined by the values it gives to the half spaces. This result, very simple in its statement, seems to require Fourier methods—no elementary proof is known. If μ„=>μ for probability measures on Rk, then <p„(r) -»(pit) for the corresponding characteristic functions by Theorem 29.1. But suppose that the characteristic functions converge pointwise. It follows by (29.5) that for each t the characteristic function of /х„Л,~' converges pointwise on the line to the characteristic function of μ.Λ~'; by the continuity theorem for characteristic functions on the line then, /х„ЛГ' =*μh~l. Take the wth component of t to be 1 and the others 0; then the /х„Л~' are the marginals for the uth coordinate. Since {μη'η~1} is weakly convergent, there is a bounded interval (au,bj such that μn[x<ERk^. au<xu <bu] = lL„h;\au,bu]> 1 -e/k for all n. But then μη(Α)> 1-е for the bounded rectangle A =[x: au <xu <bu, « = 1,...^]. The sequence {μη} is therefore, tight. If a subsequence [μη) converges weakly to v, then φ „it) converges to the characteristic function of v, which is therefore <p(r). By uniqueness, ν = μ, so that μη, =» μ. By the corollary to Theorem 29.3, μη=*μ. This proves the continuity theorem for A:-dimensional characteristic functions: μη => μ if and only if <p„(f) -»<p(0 for allt. The Cramer-Wold idea leads also to the following result, by means of which certain limit theorems can be reduced in a routine way to the one-dimensional case. Theorem 29.4. For random vectors X„ = iXnl,..., Xnk) and Y = (Y,,..., Yk), a necessary and sufficient condition for Xn=*Y is that Y,ku = ltu Xnu =»££_,гиУи for each itt,...,tk) in Rk. Proof. The necessity follows from a consideration of the continuous mapping h, above—use Theorem 29.2. As for sufficiency, the condition implies by the continuity theorem for one-dimensional characteristic functions that for each it„..., tk) for all real s. Taking s = 1 shows that the characteristic function of Xn converges pointwise to that of Y. ■ Normal Distributions in Rk By Theorem 20.4 there is (on some probability space) a random vector X =iXv...,Xk) with independent components each having the standard normal distribution. Since each Xu has density e~x /2/^2π, Χ has density
384 CONVERGENCE OF DISTRIBUTIONS к Π e""x" u= 1 к Ί2/2 (see (20.25)) (29.6) f(x)= Цт2е"и2/2· where UI2=E*„,x^ denotes Euclidean norm. This distribution plays the role of the standard normal distribution in Rk. Its characteristic function is (29.7) Let A = [aut ] be а к X к matrix, and put Y = AX. where X is viewed as a column vector. Since E[ Xa Χβ] = δαβ, the matrix X = [aUL ] of the covariances of Υ has entries aul = E[YUY, ] = Т,ка = хаиаа1а. Thus X =AA', where the prime denotes transpose. The matrix X is symmetric and nonnegative definite: Zu, aul xuxt = \A'x\2 > 0. View t also as a column vector with transpose t', and note that t ·χ = t'x. The characteristic function of AX is thus (29.8) E[ei,'(AX)] = E[ei(A',1x]=e^A'^/2 = e-''1'/2. Define a centered normal distribution as any probability measure whose characteristic function has this form for some symmetric nonnegative definite Σ. If Σ is symmetric and nonnegative definite, then for an appropriate orthogonal matrix U, ί/'Σί/ = D is a diagonal matrix whose diagonal elements are the eigenvalues of X and hence are nonnegative. If D0 is the diagonal matrix whose elements are the square roots of those of D, and if A = UD0, then X =AA. Thus for every nonnegative definite X there exists a centered normal distribution (namely the distribution of AX) with covari- ance matrix X and characteristic function exp(- \t'Xt). If X is nonsingular, so is the A just constructed. Since X has density (29.6), Υ = AX has, by the Jacobian transformation formula (20.20), density /(/T'jOldet/r'l. From X=AA' follows Idet A~l\ = (detX)"1/2. Moreover, X~l = (A')~iA~l, so that \A~lx\2 = x"l~lx. Thus the normal distribution has density (2T7)Ar/2(detX)"1/2exp(- jx'X"^) if X is nonsingular. If X is singular, the A constructed above must be singular as well, so that AX is confined to some hyperplane of dimension к - 1 and the distribution can have no density. By (29.8) and the uniqueness theorem for characteristic functions in Rk, a centered normal distribution is completely determined by its covariance matrix. Suppose the off-diagonal elements of X are 0, and let A be the diagonal matrix with the σ>/2 along the diagonal. Then X =AA', and if X has the standard normal distribution, the components Xt are independent and hence so are the components ai)/2Xi of AX. Therefore, the components of a
SECTION 29. LIMIT THEOREMS IN Rk 385 normally distributed random vector are independent if and only if they are uncorrelated. If Μ is a / X& matrix and Υ has in Rk the centered normal distribution with covariance matrix Σ, then MY has in R' the characteristic function exp(-\(M'tyi(M't)) = exp(- {t'(MlM')t) U^R'l Hence MY has the centered normal distribution in R' with covariance matrix ΜΈ,Μ'. Thus a linear transformation of a normal distribution is itself normal. These normal distributions are special in that all the first moments vanish. The general normal distribution is a translation of one of these centered distributions. It is completely determined by its means and covariances. The Central Limit Theorem Let Xn - (Xni,..., Xnk) be independent random vectors all having the same distribution. Suppose that E[X^U] < oo; let the vector of means be с = (с,,...,ck), where cu = E[Xnu], and let the covariance matrix be Σ = [auJ, where aU[ = E[{Xnu - cu\Xni - c, )1. Put 5„ = Xx + · · · +Xn. Theorem 29.5. Under these assumptions, the distribution of the random vector (Sn - nc)/ \fn converges weakly to the centered normal distribution with covariance matrix Σ. Proof. Let Y=(Yl,...,Yk) be a normally distributed random vector with 0 means and covariance matrix Σ. For given t = (t,,...,tk), let Z„ = L^ = itu(Xnu - cu) and Ζ = Zku = ltuYu. By Theorem 29.4, it suffices to prove that η~l/2Z"=lZj => Ζ (for arbitrary t). But this is an instant consequence of the Lindeberg-Levy theorem (Theorem 27.1). ■ PROBLEMS 29.1. A real function /on Rk is everywhere upper semicontinuous (see Problem 13.8) if for each χ and e there is a δ such that \x-y\<8 implies that /(y) </(*) -1- e; f is lower semicontinuous if —/ is upper semicontinuous. (a) Use condition (iii) of Theorem 29.1, Fatou's lemma, and (21.9) to show that, if μ„ =» μ and / is bounded and lower semicontinuous, then (29.9) liminf//<//i„>//<//i. (b) Show that, if (29.9) holds for all bounded, lower semicontinuous functions f, then μ„=>μ. (c) Prove the analogous results for upper semicontinuous functions.
386 CONVERGENCE OF DISTRIBUTIONS 29.2. (a) Show for probability measures on the line that μ„ χ νη => μ Χ ν if and only if μ„ =» μ and vn => v. (b) Suppose that Xn and Yn are independent and that X and У are independent. Show that, if Xn =*X and У„ => У, then (Л^.У,,) => (А', У) and hence that Xn + Yn~X+Y. (c) Show that part (b) fails without independence. (d) If F„ =» F and G„ =» G, then F„ * G„ =» F * G. Prove this by part (b) and also by characteristic functions. 29.3. (a) Show that {μ„} is tight if and only if for each e there is a compact set К such that μ„(Κ) > 1 - e for all n. (b) Show that {μη} is tight if and only if each of the к sequences of marginal distributions is tight on the line. 29.4. Assume of (X„, Y„) that X,=>X and Y„ => с Show that (Xa, Yn) => (X, c). This is an example of Problem 29.2(b) where Xn and Yr need not be assumed independent. 29.5. Prove analogues for Rk of the corollaries to Theorem 26.3. 29.6. Suppose that f(X) and g(Y) are uncorrelated for all bounded continuous / and g. Show that X and У are independent. Hint: Use characteristic functions. 29.7. 20.16 Τ Suppose that the random vector X has a centered fc-dimensional normal distribution whose covariance matrix has 1 as an eigenvalue of multiplicity r and 0 as an eigenvalue of multiplicity k — r. Show that \X\ has the chi-squared distribution with r degrees of freedom. 29.8. Τ Multinomial sampling. Let p1,...,pk be positive and add to 1, and let Z,,Z2,... be independent fc-dimensional random vectors such that Z„ has with probability pt a 1 in the tth component and 0's elsewhere. Then /„ = (/„p..-,/„fc)= TTm=xZm is the frequency count for a sample of size η from a multinomial population with cell probabilities p,. Put Xnj = (/,,,· — npj)/ фщ and *„ = (*„„...,*„*). (a) Show that Xn has mean values 0 and covariances σ/;· = (δ^ρ;· — PiPi>/JpiFj- (b) Show that the chi squared statistic Ef„,(/ni -np$l/npi has asymptotically the chi-squared distribution with к — 1 degrees of freedom. 29.9. 20.26Τ A theorem of Poincare. (a) Suppose that Xn = (XJll,...,Xnn) is uniformly distributed over the surface of a sphere of radius {n in R". Fix t, and show that Xnl,...,Xnl are in the limit independent, each with the standard normal distribution. Hint: If the components of У„ = (У„р--.,У„„) are independent, each with the standard normal distribution, then Xn has the same distribution as 4nYn/\Yn\. (b) Suppose that the distribution of X„ = (Xnl,...,Xnn) is spherically symmetric in the sense that Xn/\Xn\ is uniformly distributed over the unit sphere. Assume that \Xn\ /n => 1, and show that Xni,...,Xnt are asymptotically independent and normal.
SECTION 29. LIMIT THEOREMS IN Й* 387 29.10. Let X„ = (Xni,..., Xnk\ η = 1,2,..., be random vectors satisfying the mixing condition (27.19) with an = 0(n~5). Suppose that the sequence is stationary (the distribution of (Xn,..., Xn+j) is the same for all n), that E[Xm] = 0, and that the Xnu are uniformly bounded. Show that if Sn = Xl + · · · +Xn, then S„/ vn has in the limit the centered normal distribution with covariances E[XiM+ LE[XluXl+ll}+ Σ E[Xl+huXu]. Hint: Use the Cramer-Wold device. 29.11. Τ As in Example 27.6, let {Yn) be a Markov chain with finite state space S = {l,...,s}, say. Suppose the transition probabilities pui are all positive and the initial probabilities pu are the stationary ones. Let fnu be the number of i for which 1 <i <n and Yt = u. Show that the normalized frequency count n~1/2(fm ~ «Pi.···> fnk - nPk) has in the limit the centered normal distribution with covariances К -Pupl + Σ (pH)-pup,)+ Σ {p['u-p,Pu)- j=\ ;=i 29.12. Assume that Σ= σιι σΐ2 σ,2 σ22 is positive definite, invert it explicitly, and show that the corresponding two-dimensional normal density is ~Td(°"22*i _2о"12*1*2+о-пд:|) , where D = crua22 — ση. 29.13. Suppose that Ζ has the standard normal distribution in R1. Let μ be the mixture with equal weights of the distributions of (Z, Z) and (Ζ, -Ζ), and let (X, Y) have distribution μ. Prove: (a) Although each of X and У is normal, they are not jointly normal. (b) Although X and У are uncorrected, they are not independent. (29.10) ftxi.Xi)- 1 27tD'/2 exp
388 CONVERGENCE OF DISTRIBUTIONS SECTION 30. THE METHOD OF MOMENTS* The Moment Problem For some distributions the characteristic function is intractable but moments can nonetheless be calculated. In these cases it is sometimes possible to prove weak convergence of the distributions by establishing that the moments converge. This approach requires conditions under which a distribution is uniquely determined by its moments, and this is for the same reason that the continuity theorem for characteristic functions requires for its proof the uniqueness theorem. Theorem 30.1. Let μ be a probability measure on the line having finite moments ak = ρ_α>χΙζμ(άχ) of all orders. If the power series Y.kakrk/k\ has a positive radius of convergence, then μ is the only probability measure with the moments ax,a2,.... Proof. Let βΙζ = ρ_^\χ\Ιζμ(άχ) be the absolute moments. The first step is to show that (30.1) ^T^0' *-*00' for some positive r. By hypothesis there exists an s, 0 < s < 1, such that aksk/kl->0. Choose 0<r<s; then 2kr2k~l <s2k for large к. Since \x\2k-l<\+\x\2k, ft*-,'2*-' , г2*" , β^2' (2*-l)! _ (2k-1)1 (2k)I for large k. Hence (30.1) holds as к goes to infinity through odd values; since βΙζ = ak for к even, (30.1) follows. By (26.4), , / ., " (ihx)k\ and therefore the characteristic function φ of μ satisfies <p(t + h)- t £f(ir)V'>(a) k=o J-°° *This section may be omitted. IM"+1 * (rt + 1)!' ~ (n + 1)! ·
SECTION 30. THE METHOD OF MOMENTS 389 By (26.10), the integral here is <p(k)(t). By (30.1), (30.2) 00 a>{fc)(t) <p(t + h)= Σ ^ЧН-л*. ΙΛΙ =^ ■ it! If ν is another probability measure with moments ak and characteristic function ψ(ί), the same argument gives (30.3) Ψ(' + Λ)=Σ4-1^ \h\<r. Take r = 0; since <pw(0) = ikak = i/rw(0) (see (26'.9)), φ and ψ agree in ( — r, r) and hence have identical derivatives there. Taking t = r~e and t = -r + e in (30.2) and (30.3) shows that φ and ψ also agree in(-2r + e,2r — e) and hence in (- 2r, 2r). But then they must by the same argument agree in (-3r,3r) as well, and so on^ Thus φ and ψ coincide, and by the uniqueness theorem for characteristic functions, so do μ and v. ■ A probability measure satisfying the conclusion of the theorem is said to be determined by its moments. Example 30.1. For the standard normal distribution, \ak\ < k\, and so the theorem implies that it is determined by its moments. ■ But not all measures are determined by their moments: Example 30.2. If N has the standard normal density, then eN has the log-normal density f^L,-e-<los*>2/2 ifx>0, [θ ifx<0. Put g(x) =/(xXl + sin(2T7 log x)). If fxkf{ x) sin(2T7 log x) dx = 0, к = 0,1,2,..., •Ό then g, which is nonnegative, will be a probability density and will have the same moments as /. But a change of variable log χ = s + к reduces the This process is a version of analytic continuation.
390 CONVERGENCE OF DISTRIBUTIONS integral above to _Lek2/2f e-s2/2un2irsds, which vanishes because the integrand is odd. ■ Theorem 30.2. Suppose that the distribution of X is determined by its moments, that the Xn have moments of all orders, and that lim„ E[Xrn] = E[Xr] for r= 1,2,.... Then Xn =>X. Proof. Let μη and μ be the distributions of Xn and X. Since E[X2] converges, it is bounded, say by K. By Markov's inequality, P[\Xn\>x] < K/x2, which implies that the sequence [μJ is tight. Suppose that /x„t ■=> v, and let Υ be a random variable with distribution v. If и is an even integer exceeding r, the convergence and hence boundedness of E\Xun\ implies that E\Xr„k] -+E\Yr], by the corollary to Theorem 25.12. By the hypothesis, then, E[Yr] = E[Xr]—that is, ν and μ have the same moments. Since μ is by hypothesis determined by its moments, ν must be the same as μ, and so /x„t =*/x. The conclusion now follows by the corollary to Theorem 25.10. * ■ Convergence to the log-normal distribution cannot be proved by establishing convergence of moments (take X to have density / and the Xn to have density g in Example 30.2). Because of Example 30.1, however, this approach will work for a normal limit. Moment Generating Functions Suppose that μ has a moment generating function M(s) for s e[— a0,s0], s0> 0. By (21.22), the hypothesis of Theorem 30.1 is satisfied, and so μ is determined by its moments, which are in turn determined by M(s) via (21.23). Thus μ is determined by M(s) if it exists in a neighborhood of 0? The version of this for one-sided transforms was proved in Section 22—see Theorem 22.2. Suppose that μη and μ have moment generating functions in a common interval [-s0,s0], s0 > 0, and suppose that M„(s) -»M(s) in this interval. Since μη[(-α,αΥ) < e~s»a(Mn(-s0) + M„(s0)\ it follows easily that {μη} is tight. Since M(s) determines μ, the usual argument now gives μη =» μ. tFor another proof, see Problem 26.7. The present proof does not require the idea of anaiyticity.
SECTION 30. THE METHOD OF MOMENTS 391 Central Limit Theorem by Moments To understand the application of the method of moments, consider once again a sum 5„ = XnX + ■·· +Xnka, where XnV..., Xnk are independent and (30.4) E[Xnk) = 0, E[xtk\=aZk, s2n= Σ <U к. I Suppose further that for each η there is an M„ such that \Xnk\<Mn, к = 1,..., kn, with probability 1. Finally, suppose that M„ (30.5) ---0. All moments exist, and* (30.6) 5; = Σ Σ' τττ^ ^ Σ" xrni · · · *;r., «-= ι ' " where Σ' extends over the w-tuples (г,,...,ги) of positive integers satisfying r\ + '" +ru = r an^ Σ" extends over the «-tuples (/,,...,/„) of distinct integers in the range 1 < ia < kn. By independence, then, (30.7) where U") J ]:ргх\-:-ги\и\л^"--гЛ' (30.8) An(rl,...,ru)= Σ"^Ε[Χ^]-·- Ε[Χχ], and Σ' and Σ" have the same ranges as before. To prove that (30.7) converges to the rth moment of the standard normal distribution, it suffices to show that (30.9) \imAn(rl,...,ru) = {1 if,bri = ."'=r" = 2'. η \ 0 otherwise Indeed, if r is even, all terms in (30.7) will then go to 0 except the one for which и = r/2 and ra = 2, which will go to r\/(rx\ · · · rjul) =1x3x5 X · · · X (r - 1). And if r is odd, the terms will go to 0 without exception. +To deduce this from the multinomial formula, restrict the inner sum to ы-tuples satisfying 1 S ι', < · · · <iu<,kn and compensate by striking out the \/u\.
392 CONVERGENCE OF DISTRIBUTIONS If ra = 1 for some a, then (30.9) holds because by (30.4) each summand in (30.8) vanishes. Suppose that ra>2 for each a and ra > 2 for some a. Then r>2u, and since \Ε[Χ^]\<Μ^-"2)σ^, it follows that A„(rlt...,ru)< (Mn/snY~2uAn(2,...,2). But this goes to 0 because (30.5) holds and because An{2,.. .,2) is bounded by 1 (it increases to 1 if the sum in (30.8) is enlarged to include all the «-tuples (/,,...,/„)). It remains only to check (30.9) for r, = · · · = ru = 2. As just noted, An(2,... ,2) is at most 1, and it differs from 1 by Ls~2"a2t■, the sum extending over the (/,,...,/„) with at least one repeated index. Since σ„2<Μ„2, the terms for example with /„ = <„_, sum to at most Μ„2ίηΓ2"Σσ„2 ■ ■ · σ„2 _ < MX2. Thus 1 - A„(2,. ·.,2) < u2M2s-2 -* 0. This proves that the moments (30 7) converge to those of the normal distribution and hence that Sn/s„ => N Application to Sampling Theory Suppose that η numbers Xnl' *η2> · ' · ' Xnn> not necessarily distinct, are associated with the elements of a population of size n. Suppose that these numbers are normalized by the requirement η η (30.10) Σχ„., = 0, Σχ*Α = 1, M„=max|x„J. A-l A=l h^n An ordered sample XnV..., Xnk is taken, where the sampling is without replacement. By (30.10), E[Xnk] = 0 and E[X2k]=l/n. Let s2n = kn/n be the fraction of the population sampled. If the Xnk were independent, which they are not, S„ = Xnl -f · · · +Xnk would have variance s2. If kn is small in comparison with n, the effects of dependence should be small. It will be shown that Sn/s„ => N if (30.li) *- = τ-°· 17-*0· k"-"°· Since M2>n~x by (30.10), the second condition here in fact implies the third. The moments again have the form (30.7), but this time E\Xrn\ '"X^J cannot be factored as in (30.8). On the other hand, this expected value is by symmetry the same for each of the (&„)„ = kn(kn - 1) · · · (kn - и + 1) choices of the indices ia in the sum Σ". Thus An(rl,...,ru) = ^E[X^1---X^]. The problem again is to prove (30.9).
SECTION 30. THE METHOD OF MOMENTS 393 The proof goes by induction on u. Now An(r) = knsn rn~x^nh^lxrnh, so that Л„(1) = 0 and Л„(2)= 1. If r > 3, then \xrJ <M^2x2nh, and so |Л„(г)|< (MJsny-2-+0 by (30.11). Next suppose as induction hypothesis that (30.9) holds with и - 1 in place of u. Since the sampling is without replacement, E[Xr„\ ■ · ■ Xfo] = Lxr„'h · · · xr„l /(")„, where the summation extends over the «-tuples (hv...,hu) of distinct integers in the range 1 < ha < n. In this last sum enlarge the range by requiring of (hvh2,...,hu) only that h2,...,hu be distinct, and then compensate by subtracting away the terms where hx = h2, where /z, =h3, and so on. The result is a=2 \П)" This takes the place of the factorization made possible in (30.8) by the assumed independence there. It gives Л„(г,,...,гц) = „_и + 1 ~J1—£-n An(rx)An(r2,...,ru) kn-u + \ » ~ n-U+T L An(r2r---,rl + ra,...,r,l). By the induction hypothesis the last sum is bounded, and the factor in front goes to 0 by (30.11). As for the first term on the right, the factor in front goes to 1. If г, Ф2, then Л„(г,)-*0 and A„(r2,..., ru) is bounded, and so An(rv. ..,ru)-*0. The same holds by symmetry if ra Φ 2 for some a other than 1. If r, = · · · = ru = 2, then Л„(г,) = 1, and An(r2,..., rj -»1 by the induction hypothesis. Thus (30.9) holds in all cases, and Sn/sn =» N follows by the method of moments. Application to Number Theory Let g(m) be the number of distinct prime factors of the integer m; for example #(34x52) = 2. Since there are infinitely many primes, g(m) is unbounded above; for the same reason, it drops back to 1 for infinitely many m (for the primes and their powers). Since g fluctuates in an irregular way, it is natural to inquire into its average behavior. On the space Ω of positive integers, let Pn be the probability measure that places mass \/n at each of 1,2,...,n, so that among the first η positive
394 CONVERGENCE OF DISTRIBUTIONS integers the proportion that are contained in a given set A is just Pn( A). The problem is to study Pn[m: g(m) <x] for large n. If dp(m) is 1 or 0 according as the prime ρ divides m or not, then (30.12) g(m)=ZSp(m). Probability theory can be used to investigate this sum because under Pn the Sp(m) behave somewhat like independent random variables. If pv..., pu are distinct primes, then by the fundamental theorem of arithmetic, 8p(m) = ·· = Sp(m) = 1—that is, each /?, divides m—if and only if the product ρ ι · · ■ pu divides m. The probability under Pn of this is just n~x times the number of m in the range 1 < m < η that are multiples of p, · · · pu, and this number is the integer part of n/pl · · · pu. Thus (30.13) Pn[m:8p(m) = l,i = l,...,u] = ^ ■■■Pu for distinct P;. Now let Xp be independent random variables (on some probability space, one variable for each prime p) satisfying If p1,...,pu are distinct, then (30.14) ΡμΛ.1>/β lf...iM] = _L_. For fixed pv..., pu, (30.13) converges to (30.14) as η -»oo. Thus the behavior of the X can serve as a guide to that of the Sp(m). If m <n, (30.12) is Σρ<,„δρ(ητ), because no prime exceeding m can divide it. The ideaf is to compare this sum with the corresponding sum Ep£LnXp. This will require from number theory the elementary estimate* (30.15) Σ ^ = loglogx + <9(l). The mean and variance of Σρ^ηΧρ are Σριίηρ~ι and Σρ^ηρ~\ΐ -ρ"1); since Σρρ~2 converges, each of these two sums is asymptotically log log n. tCompare Problems 2.18, 5.19, and 6.16. *See, for example, Problem 18.17, or Hardy&Wright, Chapter XXII.
SECTION 30. THE METHOD OF MOMENTS 395 Comparing Lp<nSp(m) with Σρ&ηΧρ then leads one to conjecture the Erdos-Kac central limit theorem for the prime divisor function: Theorem 30.3. For all x, g(m) - loglogn (30.16) m: Vloglogn <x /2 77 ■f e-u*'2du. Proof. The argument uses the method of moments. The first step is to show that (30.16) is unaffected if the range of ρ in (30.12) is further restricted. Let {<*„} be a sequence going to infinity slowly enough that (30.17) but fast enough that (30.18) log л Σ - =o(loglog«)1/2. a„<p£n Because of (30.15), these two requirements are met if, for example, logan (Iog«)/loglog«. Now define Sn(m)= Σ 8p(m). (30.19) For a function / of positive integers, let E„[f] = n~l tf(m) m = l denote its expected value computed with respect to Pn. By (30.13) for и = 1, En\ Σ δJ= Σ Pn[m:8p(m) = l]z Σ \- p>an J an<p<n an<p^n By (30.18) and Markov's inequality, P„[m: \g(m)-gn(m)\>e(\og\ogn)l/2] -»0. Therefore (Theorem 25.4), (30.16) is unaffected if gn(m) is substituted for g(m).
396 CONVERGENCE OF DISTRIBUTIONS Now compare (30.19) with the corresponding sum Sn = Lp&aXp. The mean and variance of S„ are cn= Σ ± *l= Σ ifi-i PS«„ Pi«, and each is loglog η + o(loglog n)i/2 by (30.18). Thus (see Example 25.8), (30.16) with g(m) replaced as above is equivalent to (30.20) m: Sn(m) <x /2π •'-α. It therefore suffices to prove (30.20). Since the X are bounded, the analysis of the moments (30.7) applies here. The only difference is that the summands in 5„ are indexed not by the integers к in the range к <kn but by the primes ρ in the range ρ < an; also, Xp must be replaced by Xp-p~x to center it. Thus the rth moment of (5„ - cn)/sn converges to that of the normal distribution, and so (30.20) and (30.16) will follow by the method of moments if it is shown that as η -»oo, (30.21) *J»t C„ -E. for each /·. Now E[S'„] is the sum (30-22) Σ Σ'-rrf^Tj ϊγγ Σ'Έ{ χχ ■ ■ ■ χ;·;\, и = 1 '' "' ' where the range of Σ' is as in (30.6) and (30.7), and Σ" extends over the «-tuples (pv...,pu) of distinct primes not exceeding an. Since Xp assumes only the values 0 and 1, from the independence of the X and the fact that the p; are distinct, it follows that the summand in (30.22) is (30.23) E\Xn ■■■ Xn] = I Pi PuJ 1 P\'"P« By the definition (30.19), En[gr„] is just (30.22) with the summand replaced by Ε„[δΓ' ■ ■ ■ δΓ"]. Since δ (m) assumes only the values 0 and 1, from (30.13) and the fact that the /?, are distinct, it follows that this summand is (30.24) E-[« P\ ' ' δΡ»\ - η ■ Pu
SECTION 30. THE METHOD OF MOMENTS 397 But (30.23) and (30.24) differ by at most 1/n, and hence E[Sr„) and En[grn] differ by at most the sum (30.22) with the summand replaced by \/n. Therefore, (30.25) \E[S'H\-EH[g'H\\*U Σ lf<$. Now E{{Sn-cn)r]= iQ(rk)E[Skn}(-cny-k, and En[(gn - cnY] has the analogous expansion. Comparing the two expansions term for term and applying (30.25) shows that (30.26) \E[(S„-cS]-E„[(g„-c„)'\\ Since c„ < a„, and since arn/n -» 0 by (30.17), (30.21) follows as required. ■ The method of proof requires passing from (30.12) to (30.19). Without this, the an on the right in (30.26) would instead be n, and it would not follow that the difference on the left goes to 0; hence the truncation (30.19) for an an small enough to satisfy (30.17). On the other hand, an must be large enough to satisfy (30.18), in order that the truncation leave (30.16) unaffected. PROBLEMS 30.1. From the central limit theorem under the assumption (30.5) get the full Lindeberg theorem by a truncation argument. 30.2. For a sample of size kn with replacement from a population of size n, the probability of no duplicates is Flfio'd -j'/n). Under the assumption kn/ 4n -» 0 in addition to (30.10), deduce the asymptotic normality of S„ by a reduction to the independent case. 30.3. By adapting the proof of (21.24), show that the moment generating function of μ in an arbitrary interval determines μ. 30.4. 25.13 30.3 Τ Suppose that the moment generating function of μ„ converges to that of μ in some interval. Show that μη => μ.
398 CONVERGENCE OF DISTRIBUTIONS 30.5. Let μ be a probability measure on Rk for which /Λ*|*,Γμ(ώί)< °° for i = l,...,к and r = 1,2,... . Consider the cross moments 'Λ* for nonnegative integers r,. (a) Suppose for each i that (30.27) Ljrft\*M<b) has a positive radius of convergence as a power series in Θ. Show that μ is determined by its moments in the sense that, if a probability measure ν satisfies a(rlt...,rk) = fx^ · ■ xrkkv(dx) for all ι·,, — rk, then ν coincides with μ. (b) Show that a λ-dimensional normal distribution is determined by its moments. 30.6. Τ Let μ„ and μ be probability measures on Rk. Suppose that for each i, (30.27) has a positive radius of convergence. Suppose that jRk jRk for all nonnegative integers rt,...,rk. Show that μ„ =>μ. 30.7. 30.5T Suppose that X and У are bounded random variables and that Xm and Y" are uncorrelated for m, η = 1,2,... . Show that A' and У are independent. 30.8. 26.17 30.6 Τ (a) In the notation (26.32), show for λ Φ 0 that (30.28) M[(cosA*)r] = (r/2)^ for even r and that the mean is 0 for odd r. It follows by the method of moments that cos Ад; has a distribution in the sense of (25.18), and in fact of course the relative measure is (30.29) p[x: cos A* ^u] = 1 - -arccosu, -1<м<1. (b) Suppose that A„ A2,... are linearly independent over the field of rationale in the sense that, if η,Α, + · · · +nmkm = 0 for integers nv, then n, = · · · = nm = 0. Show that (30.30) Μ к IKcosA,,*/ v=\ к ΠΛ/[(008λ„*)Γ·] x=l for nonnegative integers r,,..., rk.
SECTION 30. THE METHOD OF MOMENTS 399 (c) Let Xl,X2,... be independent and have the distribution function on the right in (30.29). Show that (30.31) (d) Show that (30.32) limp χ: Σ cos kjX <u = P[XX+--+Xk<u]. '■ui<VT Σ cos\jX <u2 = — / e ,2/2du. v K , = , ι/2π Ju, For a signal that is the sum of a large number of pure cosine signals with incommensurable frequencies, (30.32) describes the relative amount of time the signal is between u, and u2. 30.9. 6.16T From (30.16), deduce once more the Hardy-Ramanujan theorem (see (6.10)). 30.10. Τ (a) Prove that (if Pn puts probability \/n at l,...,n) log log m - log log и (30.33) limR, >6 yJ\og log η (b) From (30.16) deduce that (see (2.35) for the notation) = 0. (30.34) D g(m) - loglog/я m: ' = <x yioglog) 30.11. Τ Let G(m) be the number of prime factors in m with multiplicity counted. In the notation of Problem 5.19, G(m) = Lpap(m). (a) Show for к > 1 that Pn[m: ap(m) - 8p(m) >k]< l/pk+l; hence En[ap - Sp] < 2/p2. (b) Show that En[G - g] is bounded. (c) Deduce from (30.16) that G(m) — log log и г: , : <j yloglogn (d) Prove for G the analogue of (30.34). 30.12. Τ Prove the Hardy-Ramanujan theorem in the form 1=(Χ e-"2/2du. D log log m 1 > 6 = 0. Prove this with G in place of g.
CHAPTER 6 Derivatives and Conditional Probability SECTION 31. DERIVATIVES ON THE LINE* This section on Lebesgue's theory of derivatives for real functions of a real variable serves to introduce the general theory of Radon-Nikodym derivatives, which underlies the modern theory of conditional probability. The results here are interesting in themselves and will be referred to later for purposes of illustration and comparison, but they will not be required in subsequent proofs. The Fundamental Theorem of Calculus To what extent are the operations of integration and differentiation inverse to one another? A function F is by definition an indefinite integral of another function / on [a, b] if (31.1) F(x)-F(a)=ff(t)dt Ja for a < χ < b; F is by definition a primitive of / if it has derivative /: (31.2) F'(x)=f(x) for a <x < b. According to the fundamental theorem of calculus (see (17.5)), these concepts coincide in the case of continuous /: Theorem 31.1. Suppose that f is continuous on [a, b]. (i) An indefinite integral of f is a primitive off: if (31.1) holds for all χ in [a,b], then so does (31.2). (ii) A primitive of f is an indefinite integral of f: if (31.2) holds for all χ in [a, b], then so does (31.1). *This section may be omitted. 400
SECTION 31. DERIVATIVES ON THE LINE 401 A basic problem is to investigate the extent to which this theorem holds if / is not assumed continuous. First consider part (i). Suppose / is integrable, so that the right side of (31.1) makes sense. If / is 0 for χ <m and 1 for χ >m (a < m < b), then an F satisfying (31.1) has no derivative at m. It is thus too much to ask that (31.2) hold for all x. On the other hand, according to a famous theorem of Lebesgue, if (31.1) holds for all x, then (31.2) holds almost everywhere—that is, except for χ in a set of Lebesgue measure 0. In this section almost everywhere will refer to Lebesgue measure only. This result, the most one" could hope for, will be proved below (Theorem 31.3). Now consider part (ii) of Theorem 31.1. Suppose that (31.2) holds almost everywhere, as in Lebesgue's theorem, just stated. Does (31.1) follow? The answer is no: If / is identically 0, and if F(x) is 0 for χ < m and 1 for χ > m (a <m <b), then (31.2) holds almost everywhere, but (31.1) fails for χ > т. The question was wrongly posed, and the trouble is not far to seek: If / is integrable and (31.1) holds, then (31.3) F(x + h)-F(x)=(bI(x,x+h)(t)f(t)dt-*0 Ja as h i 0 by the dominated convergence theorem. Together with a similar argument for h |0 this shows that F must be continuous. Hence the question becomes this: If F is continuous and / is integrable, and if (31.2) holds almost everywhere, does (31.1) follow? The answer, strangely enough, is still no: In Example 31.1 there is constructed a continuous, strictly increasing F for which F'(x) = 0 except on a set of Lebesgue measure 0, and (31.1) is of course impossible if / vanishes almost everywhere and F is strictly increasing. This leads to the problem of characterizing those F for which (31.1) does follow if (31.2) holds outside a set of Lebesgue measure 0 and / is integrable. In other words, which functions are the integrals of their (almost everywhere) derivatives? Theorem 31.8 gives the characterization. It is possible to extend part (ii) of Theorem 31.1 in a different direction. Suppose that (31.2) holds for every x, not just almost everywhere. In Example 17.4 there was given a function F, everywhere differentiable, whose derivative / is not integrable, and in this case the right side of (31.1) has no meaning. If, however, (31.2) holds for every x, and if / is integrable, then (31.1) does hold for all x. For most purposes of probability theory, it is natural to impose conditions only almost everywhere, and so this theorem will not be proved here.* The program then is first to show that (31.1) for integrable / implies that (31.2) holds almost everywhere, and second to characterize those F for which the reverse implication is valid. For the most part, / will be nonnegative and F will be nondecreasing. This is the case of greatest interest for probability theory; F can be regarded as a distribution function and / as a density. fFor a proof, see Rudin2, ρ 179
402 DERIVATIVES AND CONDITIONAL PROBABILITY In Chapters 4 and 5 many distribution functions F were either shown to have a density / with respect to Lebesgue measure or were assumed to have one, but such F's were never intrinsically characterized, as they will be in this section. Derivatives of Integrals The first step is to show that a nondecreasing function has a derivative almost everywhere. This requires two preliminary results. Let Λ denote Lebesgue measure. Lemma 1. Let A be a bounded linear Borel set, and let J^ be a collection of open intervals covering A. Then J^ contains a finite, disjoint subcollection I],...,Ik for which Σ?_,Λ(/,) > λ(Α)/6. Proof. By regularity (Theorem 12.3) A contains a compact subset К satisfying λ(Κ)>λ(Α)/2. Choose in J^ a finite subcollection J^0 covering K. Let /, be an interval in J^ of maximal length; discard from J^0 the interval /, and all the others that intersect It. Among the intervals remaining in J^0, let /2 be one of maximal length; discard /2 and all intervals that intersect it. Continue this way until J^0 is exhausted. The /, are disjoint. Let J,- be the interval with the same midpoint as /, and three times the length. If / is an interval in J^0 that is cast out because it meets /,·, then /c/f. Thus each discarded interval is contained in one of the Jh and so the 7, cover K. Hence ΣΛ(//) = ΣΛ(/,)/3 > лШ/3 > λ(Α)/6. Ш If (31.4) Δ: α = a0<al < ■■ ■ <ak = b is a partition of an interval [a, b] and F is a function over [a, b], let (31.5) \\F\\A= E|F(fl()-F(e/_,)|. 1 = 1 Lemma 2. Consider a partition (31.4) and a nonnegative Θ. If (31.6) F(a)<F(b), and if (31.7) Н',)-П*,-1)<_в for a set of intervals [α,,,,,α,·] of total length d, then \\F\\A>\F(b)-F(a)\ + 2ed.
SECTION 31. DERIVATIVES ON THE LINE 403 This also holds if the inequalities in (31.6) and (31.7) are reversed and -Θ is replaced by θ in the latter. Proof. The figure shows the case where к = 2 and the left-hand interval satisfies (31.7). Here F falls at least θά over [a, a + d], rises the same amount over [a + d, u], and then rises F(b) - F(a) over [u, b]. 1 ι I ι ι a a+i и b For the general case, let Σ' denote summation over those / satisfying (31.7) and let Σ" denote summation over the remaining i (1 <i <k). Then IIF||A=E'(F(«/-.)-F(«/))+E"l^(«i)-F(a/_1)| ^Е,(^(в.--|)-^Ю) + |Е*И«*)-^(«.--.))| = L'(F(ai_l)-F(ai))+\(F(b)-F(a)) + Z(F(ai_l)-F(ai))\. As all the differences in this last expression are nonnegative, the absolute- value bars can be suppressed; therefore, l№*F(ft)-F(e) + 2E'(*■(«,_,)-*■( a,)) *F(4>)-F(a) + 2eZ'("i'-«,-,). ■ A function F has at each χ four derivates, the upper and lower right derivatives Dr(x) = hmsup—- τ *-^-, η , ^ Γ · cF{x+h)-F{x) and the upper and lower left derivatives Fn, , ,. Fix)-F(x-h) D(x) = limsup—^—И '-, FD(x) = lim inf v —r^ -.
404 DERIVATIVES AND CONDITIONAL PROBABILITY There is a derivative at χ if and only if these four quantities have a common value. Suppose that F has finite derivative F'(x) at x. If и <x < v, then F(v)-F(u) ν - и F'(x) < υ-χ < ν — и χ - ν - F и и υ-χ F(x)-F(M) -F(x) F(x) Therefore, (31.8) F(v)-F(u) ν - и F'(x) as и Τ χ and υ Ι χ; that is to say, for each e there is a δ such that и <x < ν and 0 < ν - и < δ together imply that the quantities on either side of the arrow differ by less than e. Suppose that F is measurable and that it is continuous except possibly at countably many points. This will be true if F is nondecreasing or is the difference of two nondecreasing functions. Let Μ be a countable, dense set containing all the discontinuity points of F; let rn(x) be the smallest number of the form k/n exceeding x. Then D (x) = Hm sup "-,0° x<y<rn(x) f(y)-F(x). у -χ ' the function inside the limit is measurable because the x-set where it exceeds a is (J [x:x<y<r„(x),F(y)-F(x)>a(y-x)]. уем Thus DF(x) is measurable, as are the other three derivates. This does not exclude infinite values. The set where the four derivates have a common finite value F' is therefore a Borel set. In the following theorem, set F' = 0 (say) outside this set; F' is then a Borel function. Theorem 31.2. Л nondecreasing function F is differentiable almost everywhere, the derivative F' is nonnegative, and (31.9) for all a and b. fbF'(t)dt<F(b)-F(a) •'л This and the following theorems can also be formulated for functions over an interval.
SECTION 31. DERIVATIVES ON THE LINE 405 Proof. If it can be shown that (31.10) DF(x)<FD(x) except on a set of Lebesgue measure 0, then by the same result applied to G(x) = -F(-x) it will follow that FD(x) = DG(-x) <GD(-x) = DF(x) almost everywhere. This will imply that DF(x) <DF(x) < FD(x) < FD(x) < DF(x) almost everywhere, since the first and third of these inequalities are obvious, and so, outside a set of Lebesgue measure 0, F will have a derivative, possibly infinite. Since F is nondecreasing, F' must be nonnega- tive, and once (31.9) is proved, it will follow that F' is finite almost everywhere. If (31.10) is violated for a particular x, then for some pair α, β of rationale satisfying α<β, χ will lie in the set Ααβ = [χ: FD(x) <a <β < DF(x)]. Since there are only countably many of these sets, (31.10) will hold outside a set of Lebesgue measure 0 if λ(Ααβ) = 0 for all a and β. Put G(x) = F(x)- \(,α + β)χ and θ = \(β - a). Since differentiation is linear, Λαβ = Be = [x: GD(x) < -θ<θ< DG(x)]. Since F and G have only countably many discontinuities, it suffices to prove that Л(С9) = 0, where Ce is the set of points in Be that are continuity points of G. Consider an interval (a, b), and suppose for the moment that G(a)<G(b). For each χ in Ce satisfying a <x <b, from GD(x) < -Θ it follows that there exists an open interval (ax, bx) for which χ e (ax, bx) с (a, b) and G(bx)-G(ax) (31.11) h - < -Θ- X X There exists by Lemma 1 a finite, disjoint collection (ax.,bx) of these intervals of total length Z(bx. - ax) > Λ((α, b) Π C9)/6. Let Δ be the partition (31.4) of [a, b] with the points ax and bx_ in the role of the al,...,ak_l. By Lemma 2, (31.12) HGIU >\G(b)-G(a)\ + }0Λ((α, b) n Ce). If instead of G(a) < G(b) the reverse inequality holds, choose ax and bx so that the ratio in (31.11) exceeds Θ, which is possible because DG(x)> θ for x e Ce. Again (31.12) follows. In each interval [a, b] there is thus a partition (31.4) satisfying (31.12). Apply this to each interval [α,_,, α,] in the partition. This gives a partition Δ, that refines Δ, and adding the corresponding inequalities (31.12) leads to 1|С||Д1>||С||д + }0Л((а,Ь)ПС9). Continuing leads to a sequence of successively finer partitions Δη such that (31.13) ||G|U„ >«|a((«,£>) DCe).
406 DERIVATIVES AND CONDITIONAL PROBABILITY Now ||G|U is bounded by \F(b) - F(a)\ + {\a + /3|(6 - a) because F is mono- tonic. Thus (31.13) is impossible unless Λ((α, b) Π Ce) = 0. Since (a, b) can be any interval, Л(С9) = О. This proves (31.10) and establishes the differentiability of F almost everywhere. It remains to prove (31.9). Let (31.14) fM-F(x+-?,-p(x) Now fn is nonnegative, and by what has been shown, /„(*)-» F'(x) except on a set of Lebesgue measure 0. By Fatou's lemma and the fact that F is nondecreasing, /V(x) dx < liin inf fbf„(x) dx = lim inf η I F(x) dx - η ja+" F(x)dx η [ Jb Ja < lim inf [F(b + n'1) - F(a)] =F(b + ) -F(a). η Replacing b by b - e and letting e -»0 gives (31.9). ■ Theorem 31.3. Iffis nonnegative and integrable, and ifF(x) = fx_^f(t)dt, then F'(x) = f(x) except on a set of Lebesgue measure 0. Since / is nonnegative, F is nondecreasing and hence by Theorem 31.2 is differentiable almost everywhere. The problem is to show that the derivative F' coincides with / almost everywhere. Proof for Bounded /. Suppose first that / is bounded by M. Define fn by (31.14). Then fn(x) = nfxx+n~f(t)dt is bounded by Μ and converges almost everywhere to F'(x), so that the bounded convergence theorem gives (bF'(x)dx= lim (bfn(x)dx Ja η ■'a = lim n\ +" F(x) dx - η fa+" F(x)dx Since F is continuous (see (31.3)), this last limit is F(b)~ F(a) = f^f(x)dx. Thus fAF'(x)dx = fAf(x)dx for bounded intervals A = (a, b]. Since these form a T7-system, it follows (Theorem 16.10(iii)) that F'=f almost everywhere. ■
SECTION 31. DERIVATIVES ON THE LINE 407 Proof for Integrable /. Apply the result for bounded functions to / truncated at n: If hnix) is fix) or η as fix) < η or fix) > n, then Hnix) = jx-Jinit)dt differentiates almost everywhere to hnix) by the case already treated. Now Fix) = Hnix) + /I„(/(i) - hnit))dt; the integral here is nonde- creasing because the integrand is nonnegative, and it follows by Theorem 31.2 that it has almost everywhere a nonnegative derivative. Since differentiation is linear, F'ix) > H'nix) = hnix) almost everywhere. As η was arbitrary, F'ix) >fix) almost everywhere, and so fabF'ix)dx> Jabfix)dx = Fib) - Fia). But the reverse inequality is a consequence of (31.9). Therefore", jbiF'ix) - fix))dx = 0, and as before F' =f except on a set of Lebesgue measure 0. ■ Singular Functions If fix) is nonnegative and integrable, differentiating its indefinite integral fL^fiOdt leads back to fix) except perhaps on a set of Lebesgue measure 0. That is the content of Theorem 31.3. The converse question is this: If Fix) is nondecreasing and hence has almost everywhere a derivative F'ix), does integrating F'ix) lead back to Fix)! As stated before, the answer turns out to be no even if Fix) is assumed continuous: Example 31.1. Let Xl,X2,·.. be independent, identically distributed random variables such that P[Xn = 0] = p0 and P[Xn = lj =p, = 1 -p0, and let X= Σ°η = λΧη2-". Let Fix) = P[X<x] be the distribution function of X. For an arbitrary sequence u,,u2,... of 0's and l's, P[X„ = un, η = 1,2,...] = lim„ pu ■ · ■ pu = 0; since χ can have at most two dyadic expansions χ = Lnun2~", P[X = x] = 0. Thus F is everywhere continuous. Of course, F(0) = 0 and F(l) = l. For 0<A:<2", k2~" has the form Σ^,μ,-2-' for some л-tuple (и,,...,w„) of 0's and l's. Since F is continuous, (M.15) ,(*£!)-,(£)-,. <X< 2" 2" = P[Xi = ui,i<n]=pU[ ■■· PUn. This shows that F is strictly increasing over the unit interval. If Po=P\- \, the right side of (31.15) is 2"", and a passage to the limit shows that Fix) = x for 0 <x < 1. Assume, however, that ρ0Φρν It will be shown that F'ix) = 0 except on a set of Lebesgue measure 0 in this case. Obviously the derivative is 0 outside the unit interval, and by Theorem 31.2 it exists almost everywhere inside it. Suppose then that 0 < χ < 1 and that F has a derivative F'ix) at x. It will be shown that F'ix) - 0. For each η choose kn so that χ lies in the interval /„= ikn2~", ikn + 1)2""]; /„ is that dyadic interval of rank η that contains x. By (31.8), Pixel.] F{jkn+l)2-")-Fjkn2-") 2=7; = 2=" b \x>·
408 DERIVATIVES AND CONDITIONAL PROBABILITY Graph of F(.x) for p0 = .25, p\ = .75. Because of the recursion (31.17), the part of the graph over [0,.5] and the part over [.5,1] are identical, apart from changes in scale, with the whole graph Each segment of the curve therefore contains scaled copies of the whole, the extreme irregularity this implies is obscured by the fact that the accuracy is only to within the width of the printed line. , If F'(x) is distinct from 0, the ratio of two successive terms here must go to 1, so that (31.16) If /„ consists of the reals with nonterminating base-2 expansions beginning with the digits «„...,«„, then P[XeIn]=pUi ■ · · plln by (31.15). But Ia + l must for some u„ + , consist of the reals beginning u,,."..,u„,u„ + 1 (un + l is 1 or 0 according as χ lies to the right of the midpoint of /„ or not). Thus P[X <^ In + i]/P[X <^ I„] = Pu n+l is either p0 or/?,, and (31.16) is possible only if p0 =p„ which was excluded by hypothesis. Thus F is continuous and strictly increasing over [0,1], but F'(x) = 0 except on a set of Lebesgue measure 0. For 0 < χ < j independence gives
SECTION 31. DERIVATIVES ON THE LINE 409 F(x) = P[Xl=0, L:=2Xn2-" + ]<2x]=p0F(2x). Similarly, F(x)-p0 = p,F(2x- Dfor \<x<\. Thus (31.17) F(x)-K(2X> !^*· V ^ V ^ \p0+PiF(2x-l) й{<х<\. In Section 7, F(x) (there denoted <2(*)) entered as the probability of success at bold play; see (7.30) and (7.33). ■ A function is singular if it has derivative 0 except on a set of Lebesgue measure 0. Of course, a step function constant over intervals is singular. What is remarkable (indeed, singular) about the function in the preceding example is that it is continuous and strictly increasing but nonetheless has derivative 0 except on a set of Lebesgue measure 0. Note that there is strict inequality in (31.9) for this F. Further properties of nondecreasing functions can be discovered through a study of the measures they generate. Assume from now on that F is nondecreasing, that F is continuous from the right (this is only a normalization), and that 0 = lim, _ _„,, F(x) < lim, _ +00 F(x) = m < oo. Call such an F a distribution function, even though m need not be 1. By Theorem 12.4 there exists a unique measure μ on the Borel sets of the line for which (31.18) μ(β,&] =F(fc)-F(a). Of course, μ,(Λ') = m is finite. The larger F' is, the larger μ is: Theorem 31.4. Suppose that F and μ are related by (31.18) and that F'(x) exists throughout a Borel set A. (i) IfF'(x)<aforx<EA, then μ(Α)<αλ(Α). (ii) IfF'(x)>ot forx<EA, then μ(Α)>αλ(Α). Proof. It is no restriction to assume A bounded. Fix e for the moment. Let Ε be a countable, dense set, and let A„ = Γ\(Α Π /), where the intersection extends over the intervals / = (и, υ] for which u, υ e E, 0 <Λ(/) <n~', and (31.19) μ(Ι)<{α + €)λ{ί). Then An is a Borel set and (see (31.8)) An1 A under the hypothesis of (i). By Theorem 11.4 there exist disjoint intervals Ink (open on the left, closed on
410 DERIVATIVES AND CONDITIONAL PROBABILITY the right) such that An с (J kInk and (31.20) Ea(/„J<A(>4„)+6. It is no restriction to assume that each Ink has endpoints in E, meets An, and satisfies \(I„k) <n~]. Then (31.19) applies to each Ink, and hence /*(Λ)^ΣΜω^(«+ί)Σλ(υ^(«+ο(λ(^)+ί). In the extreme terms here let η -»oo and then e -»0; (i) follows. To prove (ii), let the countable, dense set Ε contain all the discontinuity points of F, and use the same argument with μ(Ι) >{α - e)A(I) in place of (31.19) and ΣΙζμ(ΙηΙζ)<μ(Α„) + € in place of (31.20). Since Ε contains all the discontinuity points of F, it is again no restriction to assume that each Ink has endpoints in E, meets An, and satisfies Α(/„Λ) <η~λ. It follows that /*M„) + *> Ем(и^(«-«)ЕмиМ«-ОМЛ)- к к Again let η -»oo and then e -» 0. ■ and 5A such that The measures μ. and Λ have disjoint supports if there exist Borel sets 5μ = 0, Λ(Λ' -5Α) = 0, (31.21) Theorem 31.5. Suppose that F and μ are related by (31.18). A necessary and sufficient condition for μ and A to have disjoint supports is that F'(x) = 0 except on a set of Lebesgue measure 0. Proof. By Theorem 31.4, μ[χ: |χ|<α, F'(x) <e] < 2ae, and so (let e->0 and then a -»oo) μ[χ: F'(x) = 0] = 0. If F'(x) = 0 outside a set of Lebesgue measure 0, then 5A = [x: F'(x) = 0] and 5μ = R] - 5A satisfy (31.21). Suppose that there exist 5μ and 5A satisfying (31.21). By the other half of Theorem 31.4, e\[x: F'(x) >e] = еЛ[х: ieSA, F'(x)>e] </x(5A) = 0, and so (let € -» 0) F'(x) = 0 except on a set of Lebesgue measure 0. ■ Example 31.2. Suppose that μ is discrete, consisting of a mass mk at each of countably many points xk. Then F(x)= Lmk, the sum extending over the к for which xk <x. Certainly, μ and Λ have disjoint supports, and so F' must vanish except on a set of Lebesgue measure 0. This is directly obvious if the xk have no limit points, but not, for example, if they are dense. ■
SECTION 31. DERIVATIVES ON THE LINE 411 Example 31.3. Consider again the distribution function F in Example 31.1. Here μ(Α) = P[X&A]. Since F is singular, μ and Λ have disjoint supports. This fact has an interesting direct probabilistic proof. For χ in the unit interval, let άλ(χ\ d2(x\... be the digits in its nontermi- nating dyadic expansion, as in Section 1. If (k2~",(k+ 1)2""] is the dyadic interval of rank η consisting of the reals whose expansions begin with the digits u,,...,u„, then, by (31.15), (31.22) μ[ψ, k + 1 2" = μ[χ:ά;(χ) = uhi<n] = p ·· P»„ If the unit interval is regarded as a probability space under the measure μ, then the d^x) become random variables, and (31.22) says that these random variables are independent and identically distributed and μ[χ: d;(x) = 0] = p0, μ[χ: d,(x)= l]=p]. Since these random variables have expected value pv the strong law of large numbers implies that their averages go to px with probability 1: (31.23) μ :(0,1]:НпД Ε4·(*)=Ρι " ί-ι = 1. On the other hand, by the normal number theorem, (31.24) = (0,1]:ИпДЕ4(*) = 1 » П I = 1 Z = 1. (Of course, (31.24) is just (31.23) for the special case p0 = /?, = \; in this case μ and Λ coincide in the unit interval.) If /?, Φ \, the sets in (31.23) and (31.24) are disjoint, so that μ and Λ do have disjoint supports. It was shown in Example 31.1 that if F'(x) exists at all (0 <x < 1), then it is 0. By part (i) of Theorem 31.4 the set where F'(x) fails to exist therefore has μ,-measure I; in particular, this set is uncountable. ■ In the singular case, according to Theorem 31.5, F' vanishes on a support of Λ. It is natural to ask for the size of F' on a support of μ. If В is the x-set where F has a finite derivative, and if (31.21) holds, then by Theorem 31.4, μ[χ<ΕΒ: F'(x) < η] = μ[χ <e Β Γι 5μ: F'(x) < η] < «Λ(5μ) = 0, and hence μ(Β) = 0. The next theorem goes further. Theorem 31.6. Suppose that F and μ are related by (31.18) and that μ and Λ have disjoint supports. Then, except for χ in a set of μ-measure 0,
412 DERIVATIVES AND CONDITIONAL PROBABILITY If μ has finite support, then clearly FD(x) = oo if μ{χ) > 0, while DF(x) = 0 for all x. Since F is continuous from the right, FD and DF play different roles.1 Proof. Let An be the set where FD(x)<n. The problem is to prove that μiAn) = 0J and by (31.21) it is enough to prove that μiAnnS|1) = ϋ. Further, by Theorem 12.3 it is enough to prove that μ(Κ) = 0 if К is a compact subset of An Π 5μ. Fix e. Since XiK) = 0, there is an open G such that K<zG and X(G) < e. If χ e K, then χ ^Λη, and by the definition of FD and the right-continuity of F, there is an open interval Ix for which xe/,cG and μ(Ιχ) <ηλ(Ιχ). By compactness, К has a finite subcover Ix ,..., Ix . If some three of these have a nonempty intersection, one of them must be contained in the union of the ether two. Such superfluous intervals can be removed from the subcover, and it is therefore possible to assume that no point of К lies in more than two of the / . But then μ(Κ)<μί^ύ^Σβ(ΐχ,)^ηΣλ(ΐΧι) ^ i l i i < 2ηλί (J /r.) < 2nX{G) < 2ne. ι Since 6 was arbitrary, Л(/0 = О, as required. ■ Example 31.4. Restrict the F of Examples 31.1 and 31.3 to (0,1), and let g be the inverse. Thus F and g are continuous, strictly increasing mappings of (0,1) onto itself. If A = [x e (0,1): F(;c) = 0], then k(A) = 1, as shown in the examples, while μ(Α) = 0. Let Я be a set in (0,1) that is not a Lebesgue set. Since H-A is contained in a set of Lebesgue measure 0, it is a Lebesgue set; hence HQ = HC\A is not a Lebesgue set, since otherwise H = H0iJ(H-A) would be a Lebesgue set. If B = (0, x], then λg-l(B) = λ(0, Fix)] = Fix) = μ(β), and it follows that λg-'(β) = μ(β) for all Borel sets B. Since g~]H0 is a subset of g~lA and kig~lA) = μiΛ) = 0J g~lH0 is a Lebesgue set. On the other hand, if g~lH0 were a Borel set, H0 = F~\g~]H0) would also be a Borel set. Thus g~lH0 provides an example of a Lebesgue set that is not a Borel set.1 ■ Integrals of Derivatives Return now to the problem of extending part (ii) of Theorem 31.1, to the problem of characterizing those distribution functions F for which F' integrates back to F: (31.25) F(x)= Γ F'(t)dt. ·' — re +See Problem 31.8 For a different argument, see Problem 3.14.
SECTION 31. DERIVATIVES ON THE LINE 413 The first step is easy: If (31.25) holds, then F has the form (31.26) F{x) = f f(t)dt * — 00 for a nonnegative, integrable / (a density), namely / = F'. On the other hand, (31.26) implies by Theorem 31.3 that F'=f outside a set of Lebesgue measure 0, whence (31.25) foliows. Thus (31.25) holds if and only if F has the form (31.26) for some /, and the problem is to characterize functions of this form. The function of Example 31.1 is not among them. As observed earlier (see (31.3)), an F of the form (31.26) with / integrable is continuous. It has a still stronger property: For each e there exists a δ such that (31.27) (f(x)dx<e if\(A)<8. JA Indeed, if An = [x: f(x)>n], then A„10, and since / is integrable, the dominated convergence theorem implies that jA f(x)dx <e/2 for large n. Fix such an η and take δ = e/2n. If А(Л) < δ, then jAf(x)dx< ίΑ-Αηί(χ)<ά + la /(*)dx <ηλ(Α) + 6/2 <e. If "F is given by"(3l.26), then F(b) - F(a) = fabf(x)dx, and (31.27) has this consequence: For every e theie exists a δ such that for each finite collection [a,·, bj], i = 1,..., k, of nonoverlappingt intervals, fc fc (31.28) Σ \ПЬ) -F(«,)| <€ if Σ {bt-at)<8. A function F with this property is said to be absolutely continuous* A function of the form (31.26) (/ integrable) is thus absolutely continuous. A continuous distribution function is uniformly continuous, and so for every e there is a δ such that the implication in (31.28) holds provided that к = 1. The definition of absolute continuity requires this to hold whatever к may be, which puts severe restrictions on F. Absolute continuity of F can be characterized in terms of the measure μ: Theorem 31.7. Suppose that F and μ are related by (31.18). Then F is absolutely continuous in the sense of (31.28) if and only if μ(Α) = 0 for every A for which λ(Α) = 0. f Intervals are nonoverlapping if their interiors are disjoint. In this definition it is immaterial whether the intervals are regarded as closed or open or half-open, since this has no effect on (31.28). The definition applies to all functions, not just to distribution functions If F is a distribution function as in the present discussion, the absolute-value bars in (31.28) are unnecessary
414 DERIVATIVES AND CONDITIONAL PROBABILITY Proof. Suppose that F is absolutely continuous and that Л(Л) = 0. Given e, choose δ so that (31.28) holds. There exists a countable disjoint union В = \JkIk of intervals such that A cB and Λ(β)<δ. By (31.28) it follows that μ((J £„]/*.) <e for each η and hence that μ(Α) <μ(Β) <e. Since e was arbitrary, μ(Α) = 0. If F is not absolutely continuous, then there exists an e such that for every δ some finite disjoint union A of intervals satisfies λ(Α)<δ and μ(Α)>€. Choose An so that λ(Αη) <n~2 and μ(Αη)>β. Then A(limsup„ An) = 0 by the first Borel-Canielli lemma (Theorem 4.3, the proof of which does not require Ρ to be a probability measure or even finite). On the other hand, /x(limsup„ An) > e > 0 by Theorem 4.1 (the proof of which applies because μ is assumed finite). ■ This result leads to a characterization of indefinite integrals. Theorem 31.8. A distribution function F(x) has the form /1„/(ί)ώ for an integrable f if and only if it is absolutely continuous in the sense of (31.28). Proof. That an F of the form (31.26) is absolutely continuous was proved in the argument leading to the definition (31.28). For another proof, apply Theorem 31.7: if F has this form, then λ(Α) = 0 implies that μ(Α) = fAf(t)dt = 0. To go the other way, define for any distribution function F (31.29) Fac(x)=f F'(t)dt and (31.30) Fs(x)=F(x)-Fac(x). Then Fs is right-continuous, and by (31.9) it is both nonnegative and nondecreasing. Since Fae comes form a density, it is absolutely continuous. By Theorem 31.3, Fa'e = F' and hence Fs' = 0 except on a set of Lebesgue measure 0. Thus F has a decomposition (31.31) F(x)=Fac(x)+Fs(x), where F3C has a density and hence is absolutely continuous and Fs is singular. This is called the Lebesgue decomposition. Suppose that F is absolutely continuous. Then Fs of (31.30) must, as the difference of absolutely continuous functions, be absolutely continuous itself. If it can be shown that Fs is identically 0, it will follow that F = Fae has the required form. It thus suffices to show that a distribution function that is both absolutely continuous and singular must vanish.
SECTION 31. DERIVATIVES ON THE LINE 415 If a distribution function F is singular, then by Theorem 31.5 there are disjoint supports 5μ and SA. But if F is also absolutely continuous, then from Λ(5μ) = 0 it follows by Theorem 31.7 that /χ(5μ) = 0. But then /х(Л') = О, and so F(x) = 0. Ш This theorem identifies the distribution functions that are integrals of their derivatives as the absolutely continuous functions. Theorem 31.7, on the other hand, characterizes absolute continuity in a way that extends to spaces Ω without the geometric structure of the line necessary to a treatment involving distribution functions and ordinary derivatives.* The extension is studied in Section 32. Functions of Bounded Variation The remainder of this section briefly sketches the extension of the preceding theory to functions that are not monotone. The results arc for simplicity given only for a finite interval [a,b] and for functions F on [a,b] satisfying F(a) = 0. If F(x) = j*f(t)dt is an indefinite integral, where / is integrable but not necessarily nonnegative, then F(x) = j*f+(t)dt- j*f~(t)dt exhibits F as the difference of two nondecreasing functions. The problem of characterizing indefinite integrals thus leads to the preliminary problem of characterizing functions representable as a difference of nondecreasing functions. Now F is said to be of bounded variation over [a,b] if supA \\F\\& is finite, where ||Л1л is defined by (31.5) and Δ ranges over all partitions (31.4) of [a,b\ Clearly, a difference of nondecreasing functions is of bounded variation. But the converse hold.s as well: For every finite collection Γ of nonoverlapping intervals [*,, y,·] in [a,b], put ΡΓ=Σ(ΡΜ-η*,))+> Nr=L(FM-n*,)V- Now define P(x) = sup Pr, N(x) = sup Nr, г г where the suprsma extend over partitions Г of [a, x]. If F is of bounded variation, then P(x) and Mjc) are finite. For each such Г, Pr = Nr + F(x). This gives the inequalities Pr<N(x)+F(x), P(x)>Nr + F(x), which in turn lead to the inequalities P(x)<N(x)+F(x), P(x)itN(x)+F(x). Thus (31.32) F(x) = P(x)-N(x) gives the required representation: A function is the difference of two nondecreasing functions if and only if it is of bounded variation. tTheorems 31.3 and 31 8 do have geometric analogues in Rk; see Rudin2, Chapter 8.
416 DERIVATIVES AND CONDITIONAL PROBABILITY If Tr = Pr + Nr, then Tr = ΣΙ Я у,-) - F(x;)\. According to the definition (31.28), F is absolutely continuous if for every e there exists a δ such that Tr < e whenever the intervals in the collection Γ have total length less than δ. If F is absolutely continuous, take the δ corresponding to an e of 1 and decompose [a,b] into a finite number, say n, of subintervals [u]_uuj] of lengths less than δ. Any partition Δ of [a, b] can by the insertion of the Uj be split into η sets of intervals each of total length less than 8, and it follows* that ||Л|д<и. Therefore, an absolutely continuous function is necessarily of bounded variation. An absolutely continuous F thus has a representation (31.32). It follows by the definitions that P(y) - P(x) is at most supr Tv, where Г ranges over the partitions of [x,y]. If [х,-,У;] are nonoverlapping intervals, then L(P(y;) -Я*,)) is at most suprrr, where now Г ranges over the collections of intervals that partition each of the [*,·, y,] Therefore, if F is absolutely continuous, there exists for each e a δ such that Е(у,·-*,) <δ implies that L(P(y,) -P(x,)) <e. In other words, Ρ is absolutely continuous. Similarly, N is absolutely continuous. Thus an absolutely continuous F is the difference of two nondecreasing absolutely continuous functions. By Theorem 31.8, each of these is an indefinite integral, which implies that F is an indefinite integral as well: For an F on [a,b] satisfying F(a) = 0, absolute continuity is a necessary and sufficient condition for F to be an indefinite integral—to have the form F(x) = j*f(t)dt for an integrablef. PROBLEMS 31.1. Extend Examples 31.1 and 31.3: Let Ρο>···>Α—ι be nonnegative numbers adding to 1, where r>2; suppose there is no i such that p,= l. Let Χγ, X2,... be independent, identically distributed random variables such that P[Xn = i]=p„ 0</ <r, and put X =L*=lX„r~n. Let F be the distribution function of X. Show that F is continuous. Show that F is strictly increasing over the unit interval if and only if all the p, are strictly positive. Show that F(x)=x for 0<*<1 if p, = r~l and that otherwise F is singular; prove singularity by extending the arguments both of Example 31.1 and of Exampie 31.3. What is the analogue of (31.17)? 31.2. Τ In Problem 31.1 take r = 3 and p0 =p2 = \, pl = 0. The corresponding F is called the Cantor function. The complement in [0,1] of the Cantor set (see Problems 1.5 and 3.16) consists of the middle third (},f), the middle thirds (|, f) and (|, |), and so on. Show that F is \ on the first of these intervals, \ on the second, 4 on the third, and so on. Show by direct argument that F' = 0 except on a set of Lebesgue measure 0. 31.3. A real function / of a real variable is a Lebesgue function if [*: fix) <a] is a Lebesgue set for each a. (a) Show that, if /, is a Borel function and /2 is a Lebesgue function, then the composition /,/2 is a Lebesgue function. (b) Show that there exists a Lebesgue function /, and a Lebesgue (even Borel, even continuous) function /2 such that /,/2 is not a Lebesgue function. Hint: Use Example 31.4. This uses the fact that ||F||i cannot decrease under passage to a finer partition.
SECTION 31. DERIVATIVES ON THE LINE 417 31.4. Τ An arbitrary function / on (0,1] can be represented as a composition of a Lebesgue function /, and a Borel function /2. For χ in (0,1], let dn(x) be the nth digit in its nonterminating dyadic expansion, and define /2(*) = T^=i2dn(x)/3". Show that /2 is increasing and that /2(0,1] is contained in the Cantor set. Take /,(*) to be fif^ix)) if χ e/2(0,1] and 0 if χ e (0,1] -/2(0,1]. Now show that /=/ι/2· 31.5. Let r,,r2)... be an enumeration of the rationale in (0,1) and put F(x) = Σ* r <,x^-~k- Define φ by (14.5) and prove that it is continuous and singular. 31.6. Suppose that μ and F are related by (31.18). If F is not absolutely continuous, then μ(Α) > 0 for some set A of Lebesgue measure 0. It is an interesting fact, however, that almost all translates of A must have μ-measure 0. From Fubini's theorem and the fact that λ is invariant under translation and reflection through 0, show that, if λ(/1) = ϋ and μ is σ-finite, then μ(Α +*) = 0 for χ outside a set of Lebesgue measure 0. 31.7. 17.4 31.6 Τ Show that F is absolutely continuous if and only if for each Borel set Α. μ(Α + χ) is continuous in x. 31.8. Let F*(x) = 11тв_0 mf(F(u)-F(u))/(u- и), where the infimum extends over и and υ such that и <x < υ and υ — и <δ. Define F*(x) as this limit with the infimum replaced by a supremum. Show that in Theorem 31.4, F' can be leplaced by F* in part (i) and by F+ in part (ii). Show that in Theorem 31.6, FD can be replaced by F^ (note that F+(x) < FD(x)). 31.9. Lebesgue's density theorem. A point χ is a density point of a Borel set A if Л((и,υ]Γ\Α)/(υ — и) -»1 as и Τ χ and υ Ι χ. From Theorems 31.2 and 31.4 deduce that almost all points of A are density points. Similarly, А((и,и]п Α)/(υ — и) -» 0 almost everywhere on Ac. 31.10. Let /: [a,b]->Rk be an aic; f(t) = (f1U\...,fk(ty). Show that the arc is rectifiable if and only if each /, is of bounded variation over [a, b]. 31.11. Τ Suppose that F is continuous and nondecreasing and that F(0) = 0, /41) = 1. Then fix) = (x, Fix)) defines an arc /: [0,1] -> R2. It is easy to see by monotonicity that the arc is rectifiable and that, in fact, its length satisfies L(f) < 2. It is also easy, given e, to produce functions F for which L(f) > 2 — e. Show by the arguments in the proof of Theorem 31.4 that L(/)= 2 if F is singular. 31.12. Suppose that the characteristic function of F satisfies limsup,_„I^O)|= 1. Show that F is singular. Compare the lattice case (Problem 26.1). Hint: Use the Lebesgue decomposition and the Rieman η-Lebesgue theorem. 31.13. Suppose that X{,X2,... are independent and assume the values ±1 with probability \ each, and let X= TTn = \Xn/2n. Show that X is uniformly distributed over [-1, +1]. Calculate the characteristic functions of X and Xn and deduce (1.40). Conversely, establish (1.40) by trigonometry and conclude that X is uniformly distributed over [-1, +1].
418 DERIVATIVES AND CONDITIONAL PROBABILITY 31.14. (a) Suppose that Xl,X2,... are independent and assume the values 0 and 1 with probability \ each. Let F and G be the distribution functions of m = i*2„-i/2 and ϊ?π = ιΧ2π/22η. Show that F and G are singular but that F * G is absolutely continuous. (b) Show that the convolution of an absolutely continuous distribution function with an arbitrary distribution function is absolutely continuous. 31.15. 31.2 Τ Show that the Cantor function is the distribution function of Σ?, = ιΧπ/3", where the Xn are independent and assume the values 0 and 2 with probability \ each. Express its characteristic function as an infinite product. 31.16. Show for the F of Example 31.1 that FD(l) = со and Df(0) = 0 if p0< {. From (31 17) deduce that pD{x) = ^ and DF(x) = 0 for all dyadic rationale x. Analyze the case p0 > j and sketch the graph 31.17. 6.14 Τ Let F be as in Example 31.1, and let μ be the corresponding probability measure on the unit interval. Let dn(x) be the nth digit in the nonterminating binary expansion of x, and let sn(x) = Lk1^idk(x). If /„(*) is the dyadic interval of order η containing x, then (31.33) -liogM(/„(i)) = -(l-^)logp0-^logp1. (a) Show that (31.33) converges on a set of μ-measure l to the entropy h = —p0 log p0 -p, log px. From the fact that this entropy is less than log2 if p0 Φ \, deduce in this case that on a set of μ-measure l, F does not have a finite derivative. (b) Show that (31.33) converges to - jlog p() - {log p, on a set of Lebesgue measure 1. If ρ0Φ \ this limit exceeds log2 (arithmetic versus geometric means), and so μ(Ιη(χ))/2~η -> 0 except on a set of Lebesgue measure 0. This does not prove that F(x) exists almost everywhere, but it does show that, except for χ in a set of Lebesgue measure 0. if F'(x) does exist, then it is 0. (c) Show that, if (31.33) converges to /, then (31.34) .im^M: la>1/^' K ' „ {!-") \0 if«<//log2. If (31.34) holds, then (roughly) F satisfies a Lipschitz condition of (exact) order //log 2. Thus F satisfies a Lipschitz condition of order h/log 2 on a set of μ-measure 1 and a Lipschitz condition of order (- \\о%рй - ylogp,)/log2 on a set of Lebesgue measure 1. 31.18. van der Waerden's continuous, nowhere differentiable function is /(*) = E£=0flA(jc), where a0(x) is the distance from χ to the nearest integer and ak(x) = 2~ka0(2kx). Show by the Weierstrass M-test that / is continuous. Use (31.8) and the ideas in Example 31.1 to show that / is nowhere differentiable. +A Lipschitz condition of order a holds at χ if F{x + h) - F{x) = 0{\h\a) as h - 0, for a > 1 this implies F'(jt) = o, and for 0 <a < 1 it is a smoothness condition stronger than continuity and weaker than differentiability.
SECTION 32. THE RADON-N1KODYM THEOREM 419 31.19. Show (see (31.31)) that (apart from addition of constants) a function can have only one representation /*", + F2 with F{ absolutely continuous and F2 singular. 31.20. Show that the Fs in the Lebesgue decomposition can be further split into Fd + Fcs, where Fa is continuous and singular and Fd increases only in jumps in the sense that the corresponding measure is discrete. The complete decomposition is then F = Fac + Fcs + Fd. 31.21. (a) Suppose that *1-<*2< ··· and Ln\F(xn)\ = <». Show that, if F assumes the value 0 in each interval (xn, xn + i), then it is of unbounded variation. (b) Define F over [0,1] by F(0) = 0 and F(x) = xa sin x~l for χ > 0. For which values of a is F of bounded variation? 31.22. 14.4 Τ If / is nonnegative and Lebesgue integrable, then by Theorem 31.3 and (31.8), except for χ in a set of Lebesgue measure 0, (31-35) ^fjO)dt-f(x) if и < χ < υ, и < υ, and и, υ -> χ. There is an analogue in which Lebesgue measure is replaced by a general probability measure μ: If / is nonnegative and integrable with respect to μ, then as h |0, <31·36> u(x I r + hll /('WO "»/(*) μ(Χ -П,χ + П\ J{x-h,x+h] on a set of μ-measure l. Let F be the distribution function corresponding to μ, and put <p(u) = inff*: и < F(x)] for 0 < и < 1 (see (14.5)). Deduce (31.36) from (31.35) by change of variable and Problem 14.4. SECTION 32. THE RADON-NIKODYM THEOREM If / is a nonnegative function on a measure space (Ω, $*~,μ), then i>(A) = {Α/άμ defines another measure on !F. In the terminology of Section 16, ν has density f with respect to μ; see (16.11). For each A in &, μ(Α)=0 implies that v(A) = 0. The purpose of this section is to show conversely that if this last condition holds and и and μ are σ-finite on $*~, then ν has a density with respect to μ. This was proved for the case (Λ',^',Λ) in Theorems 31.7 and 31.8. The theory of the preceding section, although illuminating, is not required here. Additive Set Functions Throughout this section, Ш, $*~) is a measurable space. All sets involved are assumed as usual to lie in !F.
420 DERIVATIVES AND CONDITIONAL PROBABILITY An additive set function is a function φ from & to the reals for which (32-1) <p(\jAn) = L<P(An) η η if Al>A2,--- is a finite or infinite sequence of disjoint sets. A set function differs from a measure in that the values φ(Α) may be negative but must be finite—the special values + °° and — <» are prohibited. It will turn out that the series on the right in (32.1) must in fact converge absolutely, but this need not be assumed. Note that <p(0) = 0. Example 32.1. If μ, and μ2 are finite measures, then φ(Α) = μχ(Α) — μ2(Α) is an additive set function. It will turn out that the general additive set function has this form. A special case of this if φ(Α) = {Α/άμ, where / is integrable (not necessarily nonnegative). ■ The proof of the main theorem of this section (Theorem 32.2) requires certain facts about additive set functions, even though the statement of the theorem involves only measures. Lemma 1. If Eu\ Ε or Eu I E, then cp(Eu) -»φ(Ε). Proof. If Ε„Τ Ε, then <p(£) = <p(£, U U:=1(£u + 1-£„)) = <p(£,) + ΪΖ-ΜΕ„ + ι ~ Ю = Hm Γφ(£,) + К11МЕи + 1 - Еи)] = lim, <p(EL) by (32.1). If Eu i Ε, then Ecu Τ £c, and hence <p(Eu) = φ(Ω) - φ(ΕΖ)->φ(Ω) - <p(Ec) = <p(E). Ш Although this result is essentially the same as the corresponding ones for measures, it does require separate proof. Note that the limits need not be monotone unless φ happens to be a measure. The Hahn Decomposition Theorem 32.1. For any additive set function φ, there exist disjoint sets A + and A ~ such that A + UA~= Ω, cp(E) > 0 for all Ε in A +, and φ(Ε) < 0 for all Ε in A ~. A set A is positive if φ(Ε)>0 for £сЛ and negative if <p(E) <0 for £сЛ. The A+ and A~ in the theorem decompose Ω into a positive and a negative set. This is the Hahn decomposition. If φ(Α) = ΙΑ{άμ (see Example 32.1), the result is easy: take A + = [/> 0] and A~=[f<0].
SECTION 32. THE RADON-N1KODYM THEOREM 421 Proof. Let a = sup[<p(/l): Ле^]. Suppose that there exists a set A + satisfying φ(Α+) = α (which implies that a is finite). Let Α~=Ω~Α+. If А сЛ+ and φ(Α)<0, then φ(Α + -Α)> a, an impossibility; hence A+ is a positive set. If A <zA ~ and φ(Α) > 0, then φ(Α + L)A) > a, an impossibility; hence A ~ is a negative set. It is therefore only necessary to construct a set A + for which φ(Α +) = a. Choose sets An such that φ(Αη)~>α, and let Л= (J nAn. For each « consider the 2" sets β„; (some perhaps empty) that are intersections of the iorm Г\пк = 1А'к, where each A'k is either Ak or else A - Ak. The collection &n = t^m: 1 ^' ^ 2"] of these sets partitions A. Clearly, 9ёп refines &„_{. each Bnj is contained in exactly one of the β„_,;. Let C„ be the union of those Bni in &„ for which <ρ(β„,) > 0. Since A„ is the union of certain of the β,,·, it follows that <рЫ„) < <р(С„). Since the partitions (&и(&г,... are successively finer, m <η implies that (Cm U · · · U C„_, U C„) - (Cm U · · · U C„_i) is the union (perhaps empty) of certain of the sets Bni; the Bni in this union must satisfy <p(Bni) > 0 because they are contained in Cn. Therefore, <p(Cm U · · · U C„_,) < <р(Сш и · · · U C„), so that by induction φ(Α„) < <p(Cm) < «p(Cm U · · · U C„). If Dm - (J ;.иСя1 then by Lemma 1 (take Et = C;n U · · · UCm+,) <p(Am) <«p(DJ. Let Л + = D; = 1Dm (note that Л + --= limsup„C„), so that Dm I A+. By Lemma 1, a = limm <р(Лт) < limm <p(Dm) = <р(Л + ). Thus /ί+ does have maximal <p-value. ■ If φ+(Α) = φ(Α ί)Α + ) and φ~(Α)= -φ(Α пА~), then φ+ and <p" are finite measures. Thus (32.2) φ(Α)=φ + (Α)-φ~(Α) represents the set function φ as the difference of two finite measures having disjoint supports. If Ε(zA, then φ(Ε) <φ + (Ε) <φ+(Α), and there is equality if E=AnA+. Therefore, φ+(Α) = sup£.C/4 φ(Ε). Similarly, φ~(Α) = -ΊηίΕ(ΖΑφ(Ε). The measures φ+ and φ" are called the upper and lower variations of φ, and the measure |<p| with value φ+(Α) + φ~(Α) at A is called the total variation. The representation (32.2) is the Jordan decomposition. Absolute Continuity and Singularity Measures μ and и on Ш, &") are by definition mutually singular if they have disjoint supports—that is, if there exist sets 5μ and S„ such that μ(Ω-5μ)=0, *(Ω-5„)=0, StinS„ = 0. (32.3)
422 DERIVATIVES AND CONDITIONAL PROBABILITY In this case μ is also said to be singular with respect to ν and ν singular with respect to μ. Note that measures are automatically singular if one of them is identically 0. According to Theorem 31.5 a finite measure on R1 with distribution function F is singular with respect to Lebesgue measure in the sense of (32.3) if and only if F'(x) = 0 except on a set of Lebesgue measure 0. In Section 31 the latter condition was taken as the definition of singularity, but of course it is the requirement of disjoint supports that can be generalized from Rl to an arbitrary Ω. The measure ν is absolutely continuous with respect to μ if for each A in &, μ(Α) = 0 implies v(A) = 0. In this case ν is also said to be dominated by μ, and the relation is indicated by ν «: μ. If ν «: μ and μ<~-ν, the measures are equivalent, indicated by ν = μ. A finite measure on the line is by Theorem 31.7 absolutely continuous in this sense with respect to Lebesgue measure if and only if the corresponding distribution function F satisfies the condition (31.28). The latter condition, taken in Section 31 as the definition of absolute continuity, is again not the one that generalizes from Rl to Ω. There is an e-8 idea related to the definition of absolute continuity given above. Suppose that for every e there exists a δ such that (32.4) v{A)<€ ϋμ{Α)<δ. If this condition holds, μ(Α) = 0 implies that v(A)<e for all e, and so y«)ii. Suppose, on the other hand, that this condition fails and that ν is finite. Then for some e there exist sets An such that μ(Αη) <η~2 and v(A„)>€. If Л = Нтьир„Л„, then μ(Α) = 0 by the first Borel-Cantelli lemma (which applies to arbitrary measures), but v(A)>e> 0 by the right- hand inequality in (4.9) (which applies because ν is finite). Hence ν «: μ fails, and so (32.4) follows if ν is finite and ν «с μ. If ν is finite, in order that ν <c μ it is therefore necessary and sufficient that for every e there exist α δ satisfying (32.4). This condition is not suitable as a definition, because it need not follow from ν <c μ if ν is infinite.* The Main Theorem If v(A) = \Α$άμ, then certainly ν <^μ. The Radon-Nikodym theorem goes in the opposite direction: Theorem 32.2. // μ and ν are σ-finite measures such that ν <^μ, then there exists a nonnegative f, a density, such that u(A) = ίΑ}άμ for all A e &~. For two such densities f and g, μ[fΦg] = 0. +See Problem 32.3.
SECTION 32. THE RADON-NIKODYM THEOREM 423 The uniqueness of the density up to sets of μ -measure 0 is settled by Theorem 16.10. It is only the existence that must be proved. The density / is integrable μ if and only if ν is finite. But since / is integrable μ over A if v(A)<°o, and since ν is assumed σ-finite, /<<» except on a set of μ,-measure 0; and / can be taken finite everywhere. By Theorem 16.11, integrals with respect to ν can be calculated by the formula (32.5) \hdv= f h/άμ. JA 3 A The density whose existence is to be proved is called the Radon-Nikodym derivative of ν with respect to μ and is often denoted dv/άμ. The term derivative is appropriate because of Theorems 31.3 and 31.8: For an absolutely continuous distribution function F on the line, the corresponding measure μ has with respect to Lebesgue measure the Radon-Nikodym derivative F'. Note that (32.5) can be written (32.6) \hdv= \Η^άμ. JA JA fl'X Suppose that Theorem 32.2 holds for finite μ and ν (which is in fact enough for the probabilistic applications in the sections that follow). In the σ-finite case there is a countable decomposition of Ω into J^sets An for which μ(Αη) and v(An) are both finite. If (32.7) μ„{Α)=μ{Α ПА„), vn(A) = v{A ПА„), then ν ·« μ implies vn «: μη, and so vn(A) = jAfn άμη for some density /„. Since u„ has density IA with respect to μ (Example 16.9), η η Λ η А = j ΣυΑηάμ· J л Thus LnfnIA is the density sought. It is therefore enough to treat finite μ and v. This requires a preliminary result. Lemma 2. // μ and ν are finite measures and are not mutually singular, then there exists a set A and a positive e such that μ(Α) > 0 and €μ(Ε) < ι>(Ε) for all Ε czA.
424 DERIVATIVES AND CONDITIONAL PROBABILITY Proof. Let Α*ΌΑ~ be a Hahn decomposition for the set function ν - η~ V; put Μ = U nA+, so that Mc = f\ nA~. Since Mc is in the negative set A~ for ν - η~χμ, it follows that v(Mc) <η~ιμ(Μ€); since this holds for all n, v(Mc) = 0. Thus Μ supports v, and from the fact that μ and ν are not mutually singular it follows that Mc cannot support μ—that is, that μ(Μ) must be positive. Therefore, μ(Α*)>0 for some n. Take A =A* and e=n~l. Ш Example 32.2. Suppose that Ш, &~) = (R\&1), μ is Lebesgue measure Λ, and v(a, b] = F(b) - F(a). If ν and Λ do not have disjoint supports, then by Theorem 31.5, Л[лг: F'(x) > 0] > 0 and hence for some e, A = [x: F'(x) > e] satisfies λ(Α) > 0. If Ε = (a, b] is a sufficiently small interval about an χ in A, then ν(Ε)/λ(Ε) = (F(b) - F(a))/(b -a)>e, which is the same thing as e\(E)<v(E). Я Thus Lemma 2 ties in with derivatives and quotients ν(Ε)/μ(Ε) for "small" sets E. Martingale theory links Radon-Nikodym derivatives with such quotients; see Theorem 35.7 and Example 35.10. Proof of Theorem 32.2. Suppose that μ and ν are finite measures satisfying ν ■« μ. Let ^ be the class of nonnegative functions g such that \^άμ < v(E) for all E. If g and g' lie in ^, then max(g, g') also lies in £ because ( mаx(g,g')dμ = f gdμ + f g'άμ JE JEn[g>e·) JEn[g'>g] <v(En[g>g'])+V(En[g'>g])=v(E). Thus ^ is closed under the formation of finite maxima. Suppose that functions gn lie in £ and gn Τ g. Then \^άμ = lim„ fcgn άμ < v(E) by the monotone convergence theorem, so that g lies in £. Thus £ is closed under nondecreasing passages to the limit. Let α = sup jgdμ for g ranging over £ (a <, v(£l)). Choose gn in £ so that fgndμ>α- n~l. If /„ = max(g,,..., gn) and /=lim/„, then / lies in cf and {/άμ = lim„//„ rf/x > limn fgn άμ = a. Thus / is an element of c0 for which J/άμ is maximal. Define vK by ^(E) = {Ε/άμ and *f by vt{E) = v(E) - i>JE). Thus (32.8) v(E) = Vac(E) +»t(E) = f fd» + »t(E). J Ε Since / is in <£, v% as well as vac is a finite measure—that is, nonnegative. Of course, vac is absolutely continuous with respect to μ.
SECTION 31. DERIVATIVES ON THE LINE 425 Suppose that vs fails to be singular with respect to μ. It then follows from Lemma 2 that there are a set A and a positive e such that μ(Α) > 0 and €μ(Ε) < vs(E) for all Ε <zA. Then for every Ε \ (f + eIA) άμ = ί/άμ +€μ(ΕΠΑ) < ί/άμ + vs(Ei)A) JE JE JE = ( /άμ + ν,(ΕΠΑ) + [ /άμ ■Έπα je-a = и(ЕГ)А) + [ /άμ<ν(ΕηΑ)+ν(Ε-Α) JE-A = V{E). In other words, f + eIA lies in cf; since ((/+€ΐΑ)άμ = a + €μ(Α)>a, this contradicts the maximality of /. Therefore, μ and vs are mutually singular, and there exists an 5 such that vt(S) = /x(5c) = 0. But since ν <«c μ, us(Sc) < v(Sc) = 0, and so *,(Ω) = 0. The rightmost term in (32.8) thus drops out. ■ Absolute continuity was not used until the last step of the proof, and what the argument shows is that ν always has a decomposition (32.8) into an absolutely continuous part and a singular part with respect to μ. This is the Lebesgue decomposition, and it generalizes the one in the preceding section (see (31.31)). PROBLEMS 32.1. There are two ways to show that the convergence in (32.1) must be absolute: Use the Jordan decomposition. Use the fact that a series converges absolutely if it has the same sum no matter what order the terms are taken in. 32.2. If A + UA~ is a Hahn decomposition of ψ, there may be other ones /l^U/lf. Construct an example of this. Show that there is uniqueness to the extent that <р(А+ьА?) = <р(А-ьА^) = 0. 32.3. Show that absolute continuity does not imply the e-δ condition (32.4) if ν is infinite. Hint. Let У consist of all subsets of the space of integers, let ν be counting measure, and let μ have mass n'2 at n. Note that μ is finite and ν is σ-finite. 32.4. Show that the Radon-Nikodym theorem fails if μ is not σ-finite, even if ν is finite. Hint: Let У consist of the countable and the cocountable sets in an uncountable Ω, let μ be counting measure, and let v(A) be 0 or 1 as A is countable or cocountable.
426 DERIVATIVES AND CONDITIONAL PROBABILITY 32.5. Let μ be the restriction of planar Lebesgue measure λ2 to the σ-field &~= {AXR1: Ae &l) of vertical strips. Define ν on & by v(A XR1)=\2(A X (0,1)). Show that ν is absolutely continuous with respect to μ but has no density. Why does this not contradict the Radon-Nikodym theorem? 32.6. Let μ, I*, and ρ be σ-finite measures on (Ω, &\ Assume the Radon-Nikodym derivatives here are everywhere nonnegative and finite. (a) Show that ν «: μ and μ <κ ρ imply that ν <κ ρ and dv _ dv άμ dp άμ dp (b) Show that v= μ implies dv _. MM ~' (c) Suppose that μ <s:p and ν «■[>, and let A be the set where dv/dp > 0 = άμ/dp. Show that ν <s: μ if and only if p(A) = 0, in which case dv _ dv/dp Ίμ~ hdv-/dP>o\άμ/dp ■ 32.7. Show that there is a Lebesgue decomposition (32.8) in the σ-finite as well as the finite case. Prove that it is unique. 32.8. The Radon-Nikodym theorem holds if μ is σ-finite, even if ν is not. Assume at first that μ is finite (and ν <κ μ). (a) Let Θ be the class of Onsets) В such that μ(£) = 0 or v(E) = °° for each Ε <ζΒ. Show that 9$ contains a set β0 of maximal μ-measure. (b) Let if be the class of sets in Ω0 = Bq that are countable unions of sets of finite i/-measure. Show that if contains a set C0 of maximal μ-measure. Let D0 = Ω0 — C0. (c) Deduce from the maximality of B0 and Q that μ(Ω0) = v(D0) = 0. (d) Let vQ(A) = v(A Π Ω0). Using the Radon-Nikodym theorem for the pair μ, vQ, prove it for μ, ν. (e) Now show that the theorem holds if μ is merely σ-finite. (f) Show that if the density can be taken everywhere finite, then ν is σ-finite. 32.9. Let μ and ν be finite measures on (Ω, У), and suppose that 5го is a σ-field contained in .9*". Then the restrictions μ° and v° of μ and ν to 3го are measures on (Ω, 5го). Let vac,vs, j/°c, v° be, respectively, the absolutely continuous and singular parts of ν and v° with respect to a and μ". Show that "«(£) ^ "«(£)and "s° (£) ^ "s(£) f°r £ e •^ro· 32.10. Suppose that μ,ν,νη are finite measures on (Ω, ^) and that v(A)= Σπνπ(Α) for all Λ. Let vn(A) = fAfn άμ + v'n(A) and v(A) = {Α/άμ + v'(A) be the decompositions (32.8); here v' and v'n are singular with respect to μ. Show that /= Σ„/„ except on a set of μ-measure О and that ν'(Α) = Σ„ν'„(Α) for all A. Show that ν <s: μ if and only if i/„ ■« μ for all n.
SECTION 33. CONDITIONAL PROBABILITY 427 32.11. 32.2T Absolute continuity of a set function φ with respect to a measure μ is defined just as if φ were itself a measure: μ(Α) = 0 must imply that <p(A) = 0. Show that, if this holds and μ is σ-finite, then φ(Α)= \Α$άμ for some integrable /. Show that Α + =[ω: /(ω)>0] and Α~ = [ω: /(ω) < 0] give a Hahn decomposition for ψ. Show that the three variations satisfy <p+(A) = fAf+ άμ, φ~(Α) = fAf' άμ, and \φ\(Α) = fA\f\ άμ. Hint: To construct /, start with (32.2). 32.12. Τ A signed measure φ is a set function that satisfies (32.1) if AX,A2,... are disjoint and may assume one of the values +°° and — <» but not both. Extend the Hahn and Jordan decompositions to signed measures 32.13. 31.22T Suppose that μ and ν are a probability measure and a σ-finite measure on the line and that ν <κ μ. Show that the Radon-Nikodym derivative / satisfies v(x-h,x + h] lim — ; r4 =f(x) n_c μ(χ-η,jc-f h\ on a set of μ-measure l. 32.14. Find on the unit interval uncountably many probability measures μρ, 0 <p < 1, with supports Sp such that μμ{χ\ = 0 for each χ and ρ and the Sp are disjoint in pairs. 32.15. Let 3^0 be the field consisting of the finite and the cofinite sets in an uncountable Ω. Define φ on 5^ by taking ψ(Α) to be the number of points in A if A is finite, and the negative of the number of points in Ac if A is cofinite. Show that (32.1) holds (this is not true if Ω is countable). Show that there are no negative sets for ψ (except the empty set), that there is no Hahn decomposition, and that ψ does not have bounded range. SECTION 33. CONDITIONAL PROBABILITY The concepts of conditional probability and expected value with respect to a σ-field underlie much of modern probability theory. The difficulty in understanding these ideas has to do not with mathematical detail so much as with probabilistic meaning, and the way to get at this meaning is through calculations and examples, of which there are many in this section and the next. The Discrete Case Consider first the conditional probability of a set A with respect to another set B. It is defined of course by P(A\B) = P(A nB)/P(B), unless P(B) vanishes, in which case it is not defined at all. It is helpful to consider conditional probability in terms of an observer in possession of partial information.1 A probability space (Ω, &, P) describes As always, observer, information, know, and so on are informal, nonmathematical terms; see the related discussion in Section 4 (p. 57).
428 DERIVATIVES AND CONDITIONAL PROBABILITY the working of a mechanism, governed by chance, which produces a result ω distributed according to P; P(A) is for the observer the probability that the point ω produced lies in A. Suppose now that ω lies in В and that the observer learns this fact and no more. From the point of view of the observer, now in possession of this partial information about ω, the probability that ω also lies in A is P(A\B) rather than P(A). This is the idea lying back of the definition. If, on the other hand, ω happens to lie in Bc and the observer learns of this, his probability instead becomes P(A\BC). These two conditional probabilities can be linked together by the simple function (P(A\B) if ω<Ξβ, The observer learns whether ω lies in В or in Bc; his new probability for the event ω^Α is then just /(ω). Although the observer does not in general know the argument ω of /, he can calculate the value /(ω) because he knows which of В and Bc contains ω. (Note conversely that from the value /(ω) it is possible to determine whether ω lies in В от in Bc, unless P(A\B) = P(A\BC)—that is, unless A and В are independent, in which case the conditional probability coincides with the unconditional one anyway.) The sets В and Bc partition Ω, and these ideas carry over to the general partition. Let β,, β2,... be a finite or countable partition of il into J^sets, and let £ consist of all the unions of the Br Then £ is the σ-field generated by the Br For A in &, consider the function with values (33.2) β(ω)=Ρ(Α\Βί)-Ρ(^ί) if ω£β,, / = 1,2,.... If the observer learns which element Bt of the partition it is that contains ω, then his new probability for the event ω еЛ is /(ω). The partition {β,}, or equivalent^ the σ-field ^, can be regarded as an experiment, and to learn which β, it is that contains ω is to learn the outcome of the experiment. For this reason the function or random variable / denned by (33.2) is called the conditional probability of A given £ and is denoted P[A\\cf). This is written Ρ[Α\\^]ω whenever the argument ω needs to be explicitly shown. Thus P[A\\cf] is the function whose value on Bi is the ordinary conditional probability P(A\Bt). This definition needs to be completed, because P(A\B,) is not defined if F(B;) = 0. In this case P[A\№] will be taken to have any constant value on Bt; the value is arbitrary but must be the same over all of the set β,·. If there are nonempty sets B} for which Ρ(β,) = 0, P[A\\^] therefore stands for any one of a family of functions on Ω. A specific such function is for emphasis often called a version of the conditional
SECTION 33. CONDITIONAL PROBABILITY 429 probability. Note that any two versions are equal except on a set of probability 0. Example 33.1. Consider the Poisson process. Suppose that 0 < s < t, and let A=[Ns = 0] and β; = [Λ/, = /], /=0,1,.... Since the increments are independent (Section 23), Ρ(Λ|β,) = P[NS = 0]P[N, -NS = i]/P[N, = /], and since they have Poisson distributions (see (23.9)), a simple calculation reduces this to (33.3) ρ[Ν5 = 0\\^]ω = (\-γ)' ifftiefi,, /==0,1,2,.... Since / = Ν,(ω) on B;, this can be written (33.4) Ρ[Ν5 = 0№]ω = (\-Ί) . Here the experiment or observation corresponding to {β,·} or c0 determines the number of events—telephone calls, say—occurring in the time interval [0, t]. For an observer who knows this number but not the locations of the calls within [0, t], (33.4) gives his probability for the event that none of them occurred before time s. Although this observer does not known ω, he knows Nt(w), which is all he needs to calculate the right side of (33.4). ■ Example 33.2. Suppose that X0, Xx,... is a Markov chain with state space S as in Section 8. The events (33.5) [XQ = iQ,...,Xn = in] form a finite or countable partition of Ω as /0, ...,/„ range over 5. If c0n is the σ-field generated by this partition, then by the defining condition (8.2) for Markov chains, P[Xn + l =]'№„]„ =PinJ holds for ω in (33.5). The sets (33.6) [Xn = i] for / e 5 also partition Ω, and they generate a σ-field ^fn° smaller than £n. Now (8.2) also stipulates Р[Хп+1=]~\\^п°]ш=Рц for ω in (33.6), and the essence of the Markov property is that (33.7) P[Xn + l =/U4] =P[Xn + l =j\\J„°]. ■ The General Case If £ is the σ-field generated by a partition BX,B2,..., then the general element of £ is a disjoint union β,- U β, U · · ·, finite or countable, of certain of the Br To know which set β, it is that contains ω is the same thing
430 DERIVATIVES AND CONDITIONAL PROBABILITY as to know which sets in £ contain ω and which do not. This second way of looking at the matter carries over to the general σ-field £ contained in &. (As always, the probability space is Ш, &, P).) The σ-field ^ will not in general come from a partition as above. One can imagine an observer who knows for each G in £ whether ω e G or ω £ Gc. Thus the σ-field £ can in principle be identified with an experiment or observation. This is the point of view adopted in Section 4; see p. 57. It is natural to try and define conditional probabilities P[A\\cf] with respect to the experiment cf. To do this, fix an A in & and define a finite measure ν on £ by v(G) = P(AnG), Ge/ Then P(G) = 0 implies that v(G) = 0. The Radon-Nikodym theorem can be applied to the measures ν and Ρ on the measurable space (Ω, cf) because the first one is absolutely continuous with respect to the second.1 It follows that there exists a function or random variable /, measurable £ and integrable with respect to P, such that1 P(Ai)G) --= v(G) = fGfdP for all G in J. Denote this function / by P[A\\cf]. It is a random variable with two properties: (i) P[A\\<#] is measurable £ and integrable. (ii) P[A\\<#] satisfies the functional equation (33.8) [ P[A\\J]dP = P(AnG), Ge/ JG There will in general be many such random variables P[A\\cf], but any two of them are equal with probability 1. A specific such random variable is called a version of the conditional probability. If £ is generated by a partition BUB2,... the function / defined by (33.2) is measurable cf because [ω: /(ω) еЯ] is the union of those B, over which the constant value of / lies in H. Any G in £ is a disjoint union G = U kB,k, and P(A nG) = LkP(A\Blk)P(Blk), so that (33.2) satisfies (33.8) as well. Thus the general definition is an extension of the one for the discrete case. Condition (i) in the definition above in effect requires that the values of P[A\\cf] depend only on the sets in cf. An observer who knows the outcome of £ viewed as an experiment knows for each G in £ whether it contains ω or not; for each χ he knows this in particular for the set [ω': Р[А\\с?]ш. = х), +Let P0 be the restriction of Ρ to ^ (Example 10.4), and find on (fl, <?) a density / for ν with respect to P0. Then, for Cei, v{G) = jGfdP0 = jcfdP (Example 16.4). If g is another such density, then P[f*g] = P0lf*g] = 0.
SECTION 33. CONDITIONAL PROBABILITY 431 and hence he knows in principle the functional value Ρ[Α\\^]ω even if he does not know ω itself. In Example 33.1 a knowledge of /ν,(ω) suffices to determine the value of (33.4)—ω itself is not needed. Condition (ii) in the definition has a gambling interpretation. Suppose that the observer, after he has learned the outcome of cf, is offered the opportunity to bet on the event A (unless A lies in £, he does not yet know whether or not it occurred). He is required to pay an entry fee of P[A\\cf] units and will win 1 unit if A occurs and nothing otherwise. If the observer decides to bet and pays his fee, he gains \ -P[A\\cf) if A occurs and -P[A\\cf] otherwise, so that his gain is {\-P[A\\J])IA + {-P[A\\<f])IA<=IA-P[A\\J]. If he declines to bet, his gain is of course 0. Suppose that he adopts the strategy of betting if G occurs but not otherwise, where G is some set in £. He can actually carry out this strategy, since after learning the outcome of the experiment £ he knows whether or not G occurred. His expected gain with this strategy is his gain integrated over G: f(IA--P[A\\J?])dP. But (33.8) is exactly the requirement that this vanish for each G in ^. Condition (ii) requires then that each strategy be fair in the sense that the observer stands neither to win nor to lose on the average. Thus P[A\\cf] is the just entry fee, as intuition requires. Example 33.3. Suppose that ^e^, which will always hold if £ coincides with the whole σ-field !F. Then IA satisfies conditions (i) and (ii), so that P[A\\cf] = lA with probability 1. If A e^, then to know the outcome of £ viewed as an experiment is in particular to know whether or not A has occurred. ■ Example 33.4. If £ is {0, Ω}, the smallest possible σ-field, every function measurable £ must be constant. Therefore, Р[А\\с?]ш =Р(А) for all ω in this case. The observer learns nothing from the experiment £. Ш According to these two examples, P[ Л||{0, Ω}] is identically P(A), whereas IA is a version of P[A\\S^]. For any cf, the function identically equal to P(A) satisfies condition (i) in the definition of conditional probability, whereas lA satisfies condition (ii). Condition (i) becomes more stringent as £ decreases, and condition (ii) becomes more stringent as £ increases. The two conditions work in opposite directions and between them delimit the class of versions of P[A\\^].
432 DERIVATIVES AND CONDITIONAL PROBABILITY Example 33.5. Let Ω be the plane R2 and let & be the class ^2 of planar Borel sets. A point of Ω is a pair (x,y) of reals. Let cf be the σ-field consisting of the vertical strips, the product sets ExR1 = [(jc, y): xe£], where Ε is a linear Borel set. If the observer knows for each strip Ex Rl whether or not it contains (x,y), then, as he knows this for each one-point set E, he knows the value of x. Thus the experiment £ consists in the determination of the first coordinate of the sample point. Suppose now that Ρ is a probability measure on &2 having a density fix, y) with respect to planar Lebesgue measure: P(A) = ffAf(x,y)dxdy. Let Л be a horizontal strip R1 X F = [{x,y): y^F],F being a linear Borel set. The conditional probability P[A\\cf] can be calculated explicitly. Put (33.9) Ч>(х,У) = jf{x,t)dt f f(x,t)dt Set φ(χ, y) = 0, say, at points where the denominator here vanishes; these points form a set of P-measure 0. Since φ(,χ, у) is a function of χ alone, it is measurable £. The general element of £ being Ε XR\ it will follow that φ is a version of P[A\\cf] if it is shown that (33.10) / (p{x,y)dP{x,y)=P(An{ExR1)). JExRl Since A = Rl x F, the right side here is P(E χ F). Since Ρ has density /, Theorem 16.11 and Fubini's theorem reduce the left side to flfRi<P(x'y)f(x> У) dy) dx = /(//(*. t) dt\ dx = {{ f(x,y)dxdy = P(ExF). J JEXF Thus (33.9) does give a version of P[Rl χ F\\J]. Ш The right side of (33.9) is the classical formula for the conditional probability of the event Rl χF (the event that yef) given the event {x} xR1 (given the value of x). Since the event {χ} χ Rl has probability 0, the formula P(A\B) = P(A η Β)/Ρ(Β) does not work here. The whole point of this section is the systematic development of a notion of conditional probability that covers conditioning with respect to events of probability 0. This is accomplished by conditioning with respect to collections of events—that is, with respect to σ-fields £.
SECTION 33. CONDITIONAL PROBABILITY 433 Example 33.6. The set A is by definition independent of the σ-field £ if it is independent of each G in d?: P(AnG) = P(A)P(G). This being the same thing as P(A r\G) = fGP(A)dP, A is independent of c0 if and only if P[ A ||J] = P( A) with probability 1. ■ The σ-field σ(Χ) generated by a random variable X consists of the sets [ω: 1(й)еЯ] for Яе^'; see Theorem 20.1. The conditional probability of A given X is defined as P[A\\<r(X)] and is denoted P[A\\X]. Thus />[Л||ЛГ] = Ρ[Α\\σ(Χ)\ by definition. From the experiment corresponding to the σ-field σ(Χ), one learns which of the sets [ω': Χ(ω') =χ] contains ω and hence learns the value Χ(ω). Example 33.5 is a case of this: take X(x, y) = x for (x, y) in the sample space Ω = R2 there. This definition applies without change to random vector, or, equivalentiy, to a finite set of random variables. It can be adapted to arbitrary sets of random variables as well. For any such set [X,, ! e Γ], the σ-field σ[Χ,, t e T] it generates is the smallest σ-field with respect to which each X, is measurable. It is generated by the collection of sets of the form [ω: Χ,(ω) <ξ Η] for t in 7 and Η in &. The conditional probability P[A\\X„ t e Τ]\ of A with respect to this set of random variables is by definition the conditional probability Ρ[Α\\σ[Χ„ t <= T]] of A with respect to the σ-field σ[Χ„ ie T]. In this notation the property (33.7) of Markov chains becomes (33.11) P[*„ + 1=/||*0 Xn]=P[Xn + l=j\\Xn]. The conditional probability of [ Xn + i =j] is the same for someone who knows the present state Xn as for someone who knows the present state Xn and the past states X0,..., Xn_l as well. Example 33.7. Let X and Υ be random vectors of dimensions / and k, let μ be the distribution of X over R\ and suppose that X and Υ are independent. According to (20.30), P[X(eH,(X,Y)(eJ] = ί Ρ[(χ,Υ)^]μ(άχ) JH for WeJJ and J e &J+k. This is a consequence of Fubini's theorem; it has a conditional-probability interpretation. For each χ in R' put (33.12) f(x)=P[(x,Y) e=/] =Ρ[ω': (χ,Υ(ω')) e=/]. By Theorem 20.1(H), f(X(a))) is measurable σ(Χ), and since μ is the distribution of X, a change of variable gives f f(X(w))P(dw)= [ f(xMdx)=P([(X,Y)eJ]n[XeH]).
434 DERIVATIVES AND CONDITIONAL PROBABILITY Since [1еЯ] is the general element of σ(Χ), this proves that (33.13) /(Χ(ω))=Ρ[(Χ,Υ)^\\Χ]ω with probability 1. ■ The fact just proved can be written P[(.X,Y)eJ\\X]„-P[(X(<o),Y)ej] = Р[б/:(*(б>),У(б/))еУ]. Replacing ω' by ω on the right here causes a notational collision like the one replacing у by χ causes in f^f(x,y)dy. Suppose that X and Υ are independent random variables and that У has distribution function F. For J = [(u, υ): тах{ы, υ) < m], (33.12) is 0 for m < χ and F(m) for m >x; if M = тах^Г}, then (33.13) gives (33.14) P[Mim\\X]u = I[X!Sm{a>)F(m) with probability 1. All equations involving conditional probabilities must be qualified in this way by the phrase with probability 1, because the conditional probability is unique only to within a set of probability 0. The following theorem is useful for checking conditional probabilities. Theorem 33.1. Let & be a ir-system generating the σ-field £, and suppose that Ω, is a finite or countable union of sets in &. An integrable function f is a version of P[A\\^] if it is measurable £ and if (33.15) f fdP = P(Af)G) JG holds for all G in &>. Proof. Apply Theorem 10.4. ■ The condition that Ω is a finite or countable union of ^sets cannot be suppressed; see Example 10.5. Example 33.8. Suppose that X and Υ are independent random variables with a common distribution function F that is positive and continuous. What is the conditional probability of [X<x] given the random variable Μ = maxfA', Y}7 As it should clearly be 1 if Μ <x, suppose that M>x. Since X<x requires Μ =Y, the chance of which is j by symmetry, the conditional probability of [X <x] should by independence be jF(x)/F(m) = jP[X<x\X<m] with the random variable Μ substituted
SECTION 33. CONDITIONAL PROBABILITY 435 for m. Intuition thus gives (33.16) Ρ[*<*ΙΙΜ]ω = /[Λ, ,„(*>) + \\м>*1«>р?щш)у It suffices to check (33.15) for sets G = [M <m\ because these form a π-system generating σ(Μ). The functional equation reduces to (33.17) P[M<mm{x,m)] + \( ^)-dP = P[M<m, X<x]. Δ Jx<M<,m r(M ) Since the other case is easy, suppose that χ <m. Since the distribution of (X,Y) is product measure, it follows by Fubini's theorem and the assumed continuity of F that K<M<mF(M) JJU<1 F(u) x<i <,m + fj<u T^dF(u)dF(o) = 2(F(m)-F(x)), x<u <,m which gives (33.17). ■ Example 33.9. A collection [A',: t > 0] of random variables is a Markov process in continuous time if for к > 1, 0 < ί, < · · · <tk <u, and Η e ^', (33.18) P\XueH\\Xl{,...,Xlt] =P[XueH\\Xlt] holds with probability 1. The analogue for discrete time is (33.11). (The Xn there have countable range as well, and the transition probabilities are constant in time, conditions that are not imposed here.) Suppose that t < u. Looking on the right side of (33.18) as a version of the conditional probability on the left shows that (33.19) ( P[XU eHWX,] dP=P([Xu^H] П G) JG if 0 < i, < · · · <tk = t<u and G e a(X,t,..., X,t). Fix i, u, and H, and let к and tl,...,tk vary. Consider the class &= 1)σ(Χ, ,.,.,Χ, ), the union extending over all к > 1 and all A>tuples satisfying 0 < i, < · · · <tk = t. If A(Ea(Xu,...,Xtt) and В ^a(XSi,...,Xs), then AnB^a(Xri,...,Xrj), where the ra are the sp and the ty merged together. Thus & is a 7r-system. Since &> generates a[Xs: s<t] and P[Xu^H\\Xt] is measurable with respect to this σ-field, it follows by (33.19) and Theorem 33.1 that P[XU <= H\\X,] is a version of P[XU f=H\\Xs, s < t]: (33.20) P[XueEH\\Xs,s<t}=P[XutEH\\X,}, t<u, with probability 1.
436 DERIVATIVES AND CONDITIONAL PROBABILITY This says that for calculating conditional probabilities about the future, the present a(Xt) is equivalent to the present and the entire past a[Xs: s < t]. This follows from the apparently weaker condition (33.18). ■ Example 33.10. The Poisson process [Nt: ί > 0] has independent increments (Section 23). Suppose that 0 < i, < ··· <tk <u. The random vector (N, ,Nt - N,,...,N, - /V, ) is independent of Nu- Nt, and so (Theorem 20.2) (iVf|, N ,..., N,t) is independent of Nu -Nlk. If / is the set of points (x{,...,xk,y)in Rk + l such that xk + y еЯ, where Яе^1, and if ν is the distribution of Nu-N.k, then (33.12) is PfU,,..., xk, Nu -N,t) e/] = P[xk + Nu - N,k e Я] = v(H - xk). Therefore, (33.13) gives P[NU e Я||Л/',|....,Л',4] = 1'(Я-\). This holds also if к = 1, and hence P[NU e Я||Л/...,Л/,4] = Р[Лг„еЯ||Лг^]. The Poisson process thus has the Markov property (33.18); this is a consequence solely of the independence of the increments. The extended Markov property (33.20) follows. ■ Properties of Conditional Probability Theorem 33.2. With probability 1, P[&\^\ = 0, Ρ[Ω\\<0] = 1; and (33.21) 0<P[A\\.^]<1 for each A. IfAu A2,--- is a finite or countable sequence of disjoint sets, then (33.22) with probability 1. IMJ^ Σρ[αΜ] Proof. For each version of the conditional probability, fGP[A\\^]dP = P(A η G) > 0 for each G in cf; since P[A\\c&] is measurable ^f, it must be nonnegative except on a set of P-measure 0. The other inequality in (33.21) is proved the same way. If the An are disjoint and if G lies in £, it follows (Theorem 16.6) that / (LP[An\\^])dP=Ef P[An\\J?]<ir=LP(AnnG) (ил) nG Thus L„P[An\\cf}, which is certainly measurable cf, satisfies the functional equation for P[(J nAn\\^}, and so must coincide with it except perhaps on a set of /"-measure 0. Hence (33.22). ■
SECTION 33. CONDITIONAL PROBABILITY 437 Additional useful facts can be established by similar arguments. If А с В, then (33.23) P[B-A\\S]=P[B\\S]-P[A\\S], P[A\\J>]<P[B\\J>]. The inclusion-exclusion formula (33.24) Ρ U A,\\S holds. If An Τ A, then (33.25) P[An\\^]] Ρ[Λ\\^}, and if An I A, then (33.26) P[AJJ]IP[A\\S]. Further, P(A) = 1 implies that (33.27) P[A\\S] = 1, and P(A) = 0 implies that (33.28) P[A\\J?]=0. Of course (33.23) through (33.28) hold with probability 1 only. Difficulties and Curiosities This section has been devoted almost entirely to examples connecting the abstract definition (33.8) with the probabilistic idea lying back of it. There are pathological examples showing that the interpretation of conditional probability in terms of an observer with partial information breaks down in certain cases. Example 33.11. Let (0-, &, P) be the unit interval Ω with Lebesgue measure Ρ on the σ-field & of Borel subsets of Ω. Take £ to be the σ-field of sets that are either countable or accountable. Then the function identically equal to P(A) is a version of P[A\\cf]: (33.8) holds because P(G) is either 0 or 1 for every G in £. Therefore, (33.29) Ρ[Α\\^]ω = Ρ(Α) with probability 1. But since £ contains all one-point sets, to know which
438 DERIVATIVES AND CONDITIONAL PROBABILITY elements of £ contain ω is to know ω itself. Thus £ viewed as an experiment should be completely informative—the observer given the information in £ should know ω exactly—and so one might expect that (33.30) ρμιι^]ω = | This is Example 4.10 in a new form. 1 if ω e A, 0 if ω ίΛ. The mathematical definition gives (33.29); the heuristic considerations lead to (33.30). Of course, (33.29) is right and (33.30) is wrong. The heuristic view breaks down in certain cases but is nonetheless illuminating and cannot, since it does not intervene in proofs, lead to any difficulties. The point of view in this section has been "global." To each fixed A in $*~ has been attached a function (usually a family of functions) Р[А\\с?]ш defined over all of Ω. What happens if the point of view is reversed—if ω is fixed and A varies over &Ί Will this result in a probability measure on 5?'} Intuition says it should, and if it does, then (33.21) through (33.28) all reduce to standard facts about measures. Suppose that Bv...,Br is a partition of Ω into insets, and let J?= σ(Βι Br). If Ρ(β,) = 0 and Ρ(β,) > 0 for the other i, then one version of P[A\\cf] is Ρ[Α№]ω = Bx, B„ i = 2,...,r. With this choice of version for each Α, Ρ[Α\\^]ω is, as a function of A, a probability measure on & if ω£ΰ2υ ··· U Br, but not if we β,. The "wrong" versions have been chosen. If, for example, iP(A) ifftiefi,, P[A\\J]a={ P(AnBi) P(B,) if ω<Ξ β , i = 2, then Ρ[Α\\^]ω is a probability measure in A for each ω. Clearly, versions such as this one exist if £ is finite. It might be thought that for an arbitrary σ-field ^ in & versions of the various P[A\\^\ can be so chosen that Ρ[Α\\^]ω is for each fixed ω a probability measure as A varies over &. It is possible to construct a
SECTION 33. CONDITIONAL PROBABILITY 439 counterexample showing that this is not so.f The example is possible because the exceptional ω-set of probability 0 where (33.22) fails depends on the sequence A1,A2,...; if there are uncountably many such sequences, it can happen that the union of these exceptional sets has positive probability whatever versions P[A\\cf] are chosen. The existence of such pathological examples turns out not to matter. Example 33.9 illustrates the reason why. From the assumption (33.18) the notably stronger conclusion (33.20) was reached. Since the set [1„еЯ] is fixed throughout the argument, it does not matter that conditional probabilities may not, in fact, be measures. What does matter for the theory is Theorem 33.2 and its extensions. Consider a point ω0 with the property that P(G) > 0 for every G in £ that contains ω0. This will be true if the one-point set {ω0} lies in S*~ and has positive probability. Fix any versions of the P[A\\cf]. For each A the set [ω: Р[А\\с?]ш <0] lies in ^ and has probability 0; it therefore cannot contain ω0. Thus Ρ[Α\υ>]ω >0. Similarly, Р[Щ^]Ш = 1, and, if the An are disjoint, Ρ[Ό ηΑΜΐ[ = ^η-ρ[Α\\^]ωο. Therefore, Ρ[Α\\*]ωο is a probability measure as A ranges over &. Thus conditional probabilities behave like probabilities at points of positive probability. That they may not do so at points of probability 0 causes no problem because individual such points have no effect on the probabilities of sets. Of course, sets of points individually having probability 0 do have an effect, but here the global point of view reenters. Conditional Probability Distributions Let Ibea random variable on (Ω, 5?, P), and let ^ be a σ-field in &. Theorem 33.3. There exists a function μ(Η, ω), defined for Η in &l and ω in Ω, with these two properties: (i) For each ω in Ω, μ(·, ω) is a probability measure on &1. (ii) For each Η in <%\ μ(Η, ■) is a version of P[X^H\\^]. The probability measure μ(·, ω) is a conditional distribution of X given £'. If £= σ(Ζ), it is a conditional distribution of X given Z. Proof. For each rational r, let F(r, ω) be a version of P[X < г\\с?]ш. If r < s, then by (33.23), (33.31) F(r,iu) <^(ί,ω) fThe argument is outlined in Problem 33.11. It depends on the construction of certain nonmeasurable sets.
440 DERIVATIVES AND CONDITIONAL PROBABILITY for ω outside a ^set Ars of probability 0. By (33.26), (33.32) F(r,w)= limF(r + n-\w) for ω outside a d?-set Br of probability 0. Finally, by (33.25) and (33.26), (33.33) . lim F(r,w)=0, limF(r,co) = l outside a c^set С of probability 0. As there are only countably many of these exceptional sets, their union Ε lies in £ and has probability 0. For ωί£ extend Fi-,ω) to all of Rl by setting F(jc,uj) = inffF(r,ω): jc <r]. For ωε£ take F(jc,cu) = F(x), where F is some arbitrary but fixed distribution function. Suppose that ωί£. By (33.31) and (33.32), F(x,cu) agrees with the first definition on the rationals and is nondecreasing; it is right-continuous; and by (33.33) it is a probability distribution function. Therefore, there exists a probability measure μ(·,ω) on (Rl,&l) with distribution function F(-,w). For ие£, let μ(·,ω) be the probability measure corresponding to F(x). Then condition (i) is satisfied. The class of Η for which μ(Η, ■) is measurable £ is a A-system containing the sets H = (-°°,r] for rational r; therefore μ(Η, ■) is measurable 4 for Η in ·#'. By construction, μ((-°ο, r], ω) = P[X < г\\с?]ш with probability 1 for rational r; that is, for Η = (- oo, r) as well as for Η = R\ / μ(//,ω)Ρ(Λω) =/>([*£//] η G) JG for all G in £. Fix G. Each side of this equation is a measure as a function of #, and so the equation must hold for all Η in &l. ■ Example 33.12. Let X and У be random variables whose joint distribution ν in Л2 has density f(x, y) with respect to Lebesgue measure: P[(.Y, F) (ΞΑ] = ι>(Α) = ffAf(x,y)dxdy. Let g(x,y)=f(x,y)/fRif(x,t)dt, and let μ(Η,χ) = fHg(x,y)dy have probability density #(x, ·); if jRif(x,t)dt = Q, let μ(·, jc) be an arbitrary probability measure on the line. Then μ(#, Χ(ω)) will serve as the conditional distribution of Υ given X. Indeed, (33.10) is the same thing as fExRw(F, x)dv(x, y) = v(E X F), and a change of variable gives flxeEp(.F,XM)PUa>) = P[XeE, YeF]. Thus M(F, Ζ(ω)) is a version of P[Y^ F\\Χ]ω. This is a new version of Example 33.5. α
SECTION 33. CONDITIONAL PROBABILITY 441 PROBLEMS 33.1. 20.27 Τ BoreVs paradox. Suppose that a random point on the sphere is specified by longitude Θ and latitude Φ, but restrict Θ by 0 < Θ < π, so that Θ specifies the complete meridian circle (not semicircle) containing the point, and compensate by letting Φ range over (-π,π]. (a) Show that for given Θ the conditional distribution of Φ has density ^|cos</>| over (-π, +тг]. If the point lies on, say, the meridian circle through Greenwich, it is therefore not uniformly distributed over that great circle. (b) Show that for given Φ the conditional distribution of θ is uniform over (0, π). If the point lies on the equator (Φ is 0 or π), it is therefore uniformly distributed over that great circle. Since the point is uniformly distributed over the spherical surface and great circles are indistinguishable, (a) and (b) stand in appaient contradiction. This shows again the inadmissibility of conditioning with respect to an isolated event of probability 0. The relevant σ-field must not be lost sight of. 33.2. 20.16 Τ Let X and У be independent, each having the standard normal distribution, and let (/?,Θ) be the polar coordinates for (X,Y). (a) Shew that X + Υ and А'- У are independent and that R2 = [(X+ У)2 + (X— У)2]/2, and conclude that the conditional distribution of R2 given X—Y is the chi-squared distribution with one degree of freedom translated by (Л--У)2/2. (b) Show that the conditional distribution of R2 given Θ is chi-squared with two degrees of freedom. (c) If X-Y=0, the conditional distribution of R2 is chi-squared with one degree of freedom. If Θ = π/4 or Θ = 5π/4, the conditional distribution of R2 is chi-squared with two degrees of freedom. But the events [Л"- У = 0] and [Θ = π/4] U [Θ = 5π/4] are the same. Resolve the apparent contradiction. 33.3. Τ Paradoxes of a somewhat similar kind arise in very simple cases. (a) Of three prisoners, call them 1, 2, and 3, two have been chosen by lot for execution. Prisoner 3 says to the guard, "Which of 1 and 2 is to be executed? One of them will be, and you give me no information about myself in telling me which it is." The guard finds this reasonable and says, "Prisoner 1 is to be executed." And now 3 reasons, "I know that 1 is to be executed; the other will be either 2 or me, and so my chance of being executed is now only j, instead of the | it was before," Apparently, the guard has given him information. If one looks for a σ-field, it must be the one describing the guard's answer, and it then becomes clear that the sample space is incompletely specified. Suppose that, if 1 and 2 are to be executed, the guard's response is "1" with probability ρ and "2" with probability 1 -p; and, of course, suppose that, if 3 is to be executed, the guard names the other victim. Calculate the conditional probabilities. (b) Assume that among families with two children the four sex distributions are equally likely. You have been introduced to one of the two children in such a family, and he is a boy. What is the conditional probability that the other is a boy as well?
442 DERIVATIVES AND CONDITIONAL PROBABILITY 33.4. (a) Consider probability spaces (Π,^,Ρ) and (Ω', ^\ P')\ suppose that T: Ω->Ω' is measurable &/&1 and P' = PT~l. Let J' be a σ-field in ^"', and take <0 to be the σ-field [Т~1С: G' e J\ For /1' e &', show by (16.18) that P[T-iA'\№]a> = P'[A'\\J!']Ta>mth F-probability 1. (b) Now take (Ω', 5Γ',Ρ') = (Λ2,^2,μ), wheie μ is the distribution of a random vector (X,Y) on (Ω, &~, P). Suppose that (X,Y) has density /, and show by (33.9) that P[YeF\\X]t fj(X(a>),t)dt / f(X(o),t)U with probability 1. 33.5. Τ (a) There is a slightly different approach to conditional probability. Let (Ω, &~, P) be a probability space, (Ω', &') a measurable space, and T: il -»Ω' a mapping measurable &/SF'. Define a measure ν on &' by j/( Л') = Р( Л П Γ-1/4') for A' s У. Prove that there exists a function ρ(Α\ω') on Ω', measurable $*~' andintegrable PT'\ such that //4.ρ(/ΐ|ω')ΡΓ-1(^ω') = Ρ(/1 η Γ_1/ί') for all A' in У. Intuitively, ρ(/1|ω') is the conditional probability that ω eA for someone who knows that Τω = ω'. Let <0= [T~ Ά': A' e У']; show that ^ is a σ-field and that ρ(Α\Τω) is a version of Ρ[Α\\^]ω. (b) Connect this with part (a) of the preceding problem. 33.6. Τ Suppose that Τ = X is a random variable, (Ω',&~') = (ϋ1,&1\ and jc is the general point of R1. In this case p(A\x) is sometimes written P[A \X = x]. What is the problem with this notation? 33.7. For the Poisson process (see Example 33.1) show that for 0 <s < t, ,ι,..„«,-((ϊ)(ί)'ΚΓ- -"■· [θ, k>Nr Thus the conditional distribution (in the sense of Theorem 33.3) of Ns given Nt is binomial with parameters N, and s/t. 33.8. 29.12Τ Suppose that (Xu X2) has the centered normal distribution—has in the plane the distribution with density (29.10). Express the quadratic form in the exponential as
SECTION 33. CONDITIONAL PROBABILITY 443 integrate out the x2 and show that f(Xi,x2) 1 f f{xut)dt J _ CD y/T, exp 2тЬ σΖΧΊ where τ = σ22 ~ог\г<т\\· Describe the conditional distribution of X2 given A",. 33.9. (a) Suppose that μ(Η,ω) has property (i) in Theorem 33.3, and suppose that μ(Η, ·) is a version of P[XeH\\cf] for Я in a тг-system generating J?1. Show that μ(·,ω) is a conditional distribution of X given ^. (b) Use Theorem 12.5 to extend Theorem 33.3 from Rl to Rk. (c) Show that conditional probabilities can be defined as genuine probabilities on spaces of the special form {Ω,σ(Χ{,..., Xk\ P). 33.10. Τ Deduce from (33.16) that the conditional distribution of X given Μ is I, / 1 μ(ΗП (-co, M(a>)]) 2WH](«>)+2 μ(-οο,Μ(ω)]) ' where μ is the distribution corresponding to F (positive and continuous). Hint: First check Я=( — °°, x]. 33.11. 4.10 12.4 Τ The following construction shows that conditional probabilities may not give measures. Complete the details. In Problem 4.10 it is shown that there exist a probability space (Ω, SF, P), a σ-field J in &, and a set Я in & such that P(H)= {-, Я and J are independent, ^ contains all the singletons, and ^ is generated by a countable subclass. The countable subclass generating ^ can be taken to be a 7r-system &= {Bi,B2,...} (pass to the finite intersections of the sets in the original class). Assume that it is possible to choose versions P[A\\^] so that Ρ[Α\\^]ω is for each ω a probability measure as A varies over fF. Let Cn be the ω-set where Ρ[Βη\\^]ω = ΙΒ(ω); show (Example 33.3) that C= f)„Cn has probability 1. Show that we't implies that Ρ[ΰ\\^]ω = /σ(ω) for all G in £ and hence that Ρ[{ω}||^]ω = 1. Now <о<вНГ)С implies that Ρ[Η\\^]ω>Ρ[{ω}\\^]ω= 1 and й><=НсГ)С implies that Р[Я||^]Ш < Ρ[Ω - {ω}||^]ω = 0. Thus ω e С implies that Ρ[Η\\^]ω = ΙΗ(ω). But since Я and J are independent, Р[Я||^]Ш = P(H)= \ with probability 1, a contradiction. This example is related to Example 4.10 but concerns mathematical fact instead of heuristic interpretation. 33.12. Let a and β be σ-finite measures on the line, and let f(x, y) be a probability density with respect to α Χβ. Define (33.34) gx(y) = li^1 f μχ,ΐ)β(ώ)
444 DERIVATIVES AND CONDITIONAL PROBABILITY unless the denominator vanishes, in which case take gx(y) = 0, say. Show that, if (A", Y) has density / with respect to α Χβ, then the conditional distribution of Υ given X has density gx(y) with respect to β. This generalizes Examples 33.5 and 33.12, where a and β are Lebesgue measure. 33.13. 18.20 Τ Suppose that μ and vx (one for each real x) are probability measures on the line, and suppose that vx(B) is a Borel function in χ for each SeJ1. Then (see Problem 18 20) (33.35) r{E)=fvx[y{x,y)eE]p{dx) defines a probability measure on (R2,&2). Suppose that (X,Y) has distribution π, and show that vx is a version of the conditional distribution of Υ given X. 33.14. Τ Let a and β be σ-finite measures on the line. Specialize the setup of Problem 33.13 by supposing that μ has density f(x) with respect to a and ν has density gx(y) with respect to β. Assume that gx(y) is measurable ^ in the pair (x, y), so that vx(B) is automatically measurable in x. Show that (33.35) has density f(x)gx(y) with respect to a Χ β: π(Ε) = }}Ef(x)gx(y)a(dx)fi(dy). Show that (33.34) is consistent with f(x,y) = f(x)gx(y). Put Py(x)=- · f f(s)gs(y)a(ds) ■Ή1 Suppose that (X,Y) has density f(x)gx(y) with respect to α Χβ, and show that Py(x) is a density with respect to a for the conditional distribution of X given Y. In the language of Bayes, f(x) is the prior density of a parameter x, gx(y) is the conditional density of the observation у given the parameter, and py(x) is the posterior density of the parameter given the observation. 33.15. Τ Now suppose that a and β are Lebesgue measure, that f(x) is positive, continuous, and bounded, and that gx(y) = e~{y~x) "/2/^2v/n. Thus the observation is distributed as the average of η independent normal variables with mean χ and variance 1. Show that for fixed χ and y. Thus the posterior density is approximately that of a normal distribution with mean у and variance \/n. 33.16. 32.13T Suppose that X has distribution μ. Now Ρ[/1||ΑΓ]ω=/(ΑΓ(ω)) for some Borel function /. Show that limA_0 P[A\x—h <X<x +h]=f(x) for χ in a set of μ-measure l. Roughly speaking, P[A\x — h <X<x +h]-*P[A\X = x]. Hint: Take v(B) = P(An[X<=Bl> in Problem 32.13.
SECTION 34. CONDITIONAL EXPECTATION 445 SECTION 34. CONDITIONAL EXPECTATION In this section the theory of conditional expectation is developed from first principles. The properties of conditional probabilities will then follow as special cases. The preceding section was long only because of the examples in it; the theory itself is quite compact. Definition Suppose that X is an integrable random variable on Ш, !F, P) and that cf is a σ-field in &. There exists a random variable E[X\\^], called the conditional expected value of X given £, having these two properties: (i) E[X\\^\ is measurable £ and integrable. (ii) E[X\\^\ satisfies the functional equation (34.1) [ E[X\\J]dP= (XdP, Ge/ JG JG To prove the existence of such a random variable, consider first the case of nonnegative X. Define a measure ν on £ by v(G) = fGXdP. This measure is finite because X is integrable, and it is absolutely continuous with respect to P. By the Radon-Nikodym theorem there is a function /, measurable ^f, such that v(G) = \GfdP? This / has properties (i) and (ii). If X is not necessarily nonnegative, E[X+\\^\ — E[X~\\cf] clearly has the required properties. There will in general be many such random variables E[X\\^]; any one of them is called a version of the conditional expected value. Any two versions are equal with probability 1 (by Theorem 16.10 applied to Ρ restricted to ^). Arguments like those in Examples 33.3 and 33.4 show that £[ΛΊ|{0, Ω}] = E[X] and that E[X\\&~]=X with probability 1. As £ increases, condition (i) becomes weaker and condition (ii) becomes stronger. The value Е[Х\\с?]ш at ω is to be interpreted as the expected value of X for someone who knows for each G in £ whether or not it contains the point ω, which itself in general remains unknown. Condition (i) ensures that E[X\\^] can in principle be calculated from this partial information alone. Condition (ii) can be restated as fG(X - E[X\\^])dP = 0; if the observer, in possession of the partial information contained in cf, is offered the opportunity to bet, paying an entry fee of E[X\\c?\ and being returned the amount X, and if he adopts the strategy of betting if G occurs, this equation says that the game is fair. fAs in the case of conditional probabilities, the integral is the same on (Ω, 5*", P) as on (Ω, ^) with Ρ restricted to £ (Example 16.4).
446 DERIVATIVES AND CONDITIONAL PROBABILITY Example 34.1. Suppose that Bu B2,... is a finite or countable partition of Ω generating the σ-field c0. Then E[X\\^] must, since it is measurable £, have some constant value over Bit say or,-. Then (34.1) for G = β, gives aiP(Bi) = (B/XdP. Thus (34.2) Е[Х№]„ = -рщ{вХаР, ,6fi, P(B,)>0. If P{Bt) = 0, the value of E[X\\£\ over β,· is constant but arbitrary. ■ Example 34.2. For an indicator IA the defining properties of E[IA\\^f] and P[A\\<f\ coincide; therefore, E[IA№\ = P[A\\<#\ with probability 1. It is easily checked that, more generally, E[X\\<#] = Σ,α^ΐΑ^] with pioba- bility 1 for a simple function X = Σ,-α,·/^.. ■ In analogy with the case of conditional probability, if [Xn t e T] is a collection of random variables, E[X\\Xt, t e T] is by definition E[X\\^f] with σ[^,, t <ξ Γ] in the role of J?. Example 34.3. Let J^ be the σ-field of sets invariant under a measure- preserving transformation^ on (Ω,,^,Ρ). For / integrable, the limit / in (24.7) is E[f\\^\. Since / is invariant, it is measurable «X If G is invariant, then the averages an in the proof of the ergodic^theorem (p. 318) satisfy E[IGan} = E[IGf]. But since the an converge to / and are uniformly integrable, E[lJ\ = E[Icfl Ш Properties of Conditional Expectation To prove the first result, apply Theorem 16.10(iii) to / and E[X\\cf] on Theorem 34.1. Let & be a v-system generating the σ-field .^, and suppose that Ω is a finite or countable union of sets in d?. An integrable function fis a version of E[X\\^\ if it is measurable £ and if (34.3) ffdP = fxdP G G holds for all G in &. In most applications it is clear that Пе^1. All the equalities and inequalities in the following theorem hold with probability 1.
SECTION 34. CONDITIONAL EXPECTATION 447 Theorem 34.2. Suppose that X, Y, Xn are integrable. (i) IfX = a with probability 1, then E[X\\^] = a. (ii) For constants a and b, E[aX + bY\\J>\ = aE[X\\^\ + bE[Y\\^\. (iii) IfX< Υ with probability 1, then E[X\\cf] < E\Y\\^\. Ы\Е\Х\\Л<Е\\Х\\\Л (ν) // limn Xn = X with probability 1, \Xn\ < Y, and Υ is integrable, then lim„ E[Xn\\cf] = E[X\\cf] with probability 1. Proof. If X = a with probability 1, the function identically equal to a satisfies conditions (i) and (ii) in the definition of E[X\\^\, and so (i) above follows by uniqueness. As for (ii), aE[X\\cf] + bE[Y\\cf) is integrable and measurable £, and [ (aE[X\\c?]+bE[Y\\J?])dP = af E[X\\J?]dP + b( E[Y\\J?]dP JG JG JG = a[ XdP + b[ YdP = f (aX+bY) dP JG JG JG for all G in £, so that this function satisfies the functional equation. If X<Y with probability 1, then fG(E[Y\\d?] - E[X\\J?])dP = fG(Y- X)dP>0 for all G in J. Since E[Y\\^\ - E[X\\d?] is measurable d?, it must be nonnegative with probability 1 (consider the set G where it is negative). This proves (iii), which clearly implies (iv) as well as the fact that E[X\\J\ = E[Y\\^\ \iX=Y with probability 1. To prove (v), consider Z„ = sup^jA^ - X]. Now Z„ J.0 with probability 1, and by (ii), (iii), and (iv), \E[Xn\\^}- E[X\\d?]\<E[Zn\\cf]. It suffices, therefore, to show that £[Ζ„||.^]10 with probability 1. By (iii) the sequence E[Zn\\^\ is nonincreasing and hence has a limit Z; the problem is to prove that Z = 0 with probability 1, or, Ζ being nonnegative, that E[Z] = Q. But 0<Z„<2F, and so (34.1) and the dominated convergence theorem give E[Z] = fE[Z\\cf]dP < jE[Zn\WdP = E[Zn]-+ 0. ■ The properties (33.21) through (33.28) can be derived anew from Theorem 34.2. Part (ii) shows once again that Ε[Σ,·ο',·//4.||^'] = Σ,-α,ΡΜ,-Ц.^] for simple functions. If X is measurable cf, then clearly E[X\\^] = X with probability 1. The following generalization of this is used constantly. For an observer with the information in £, X is effectively a constant if it is measurable £: Theorem 34 J. IfXis measurable ^f, and if Υ and XY are integrable, then (34.4) E[XY\\J?]=XE[Y\\cf] with probability 1.
448 DERIVATIVES AND CONDITIONAL PROBABILITY Proof. It will be shown first that the right side of (34.4) is a version of the left side if X = IG and G0^cf. Since IG E[Y\\^\ is certainly measurable £, it suffices to show that it satisfies the functional equation \GlGE\Y\\J?\dP = fGIGYdP, Gei. But this reduces to jGnG{E[Y\\^]dP = fGnGYdP, which holds by the definition of E[Y\\J\. Thus (34.4) holds if X is the indicator of an element of £. It follows by Theorem 34.2(ii) that (34.4) holds if X is a simple function measurable £. For the general X that is measurable <£, there exist simple functions Xn, measurable £, such that |Xr\ < \X\ and limn Xn = X (Theorem 13.5). Since I^FI^IAYI and \XY\ is integrable, Theorem 34.2(v) implies that lim„ E[XnY\\J?] = E[XY\\J?] with probability 1. But E[X„Y\\J] = XnE[Y\\df] by the case already treated, and of course limn XnE[Y\\cf] = XE[Y\\J?]. (Note that \X„E[Y\\J]\ = \E[X„Y\\J]\ < E[\XnY\ \\J\ < E[\XY\\\d?], so that the limit XE[Y\\^\ is integrable.) Thus (34.4) holds in general. Notice that X has not been assumed integrable. ■ Taking a conditional expected value can be thought of as an averaging or smoothing operation. This leads one to expect that averaging X with respect to £2 and then averaging the result with respect to a coarser (smaller) σ-field ^x should lead to the same result as would averaging with respect to ^"i in the first place: Theorem 34.4. If X is integrable and the σ-fields cfx and £2 satisfy ^,ci2, then (34.5) Е[Е[Х\\^2]\\^] = Е[ХЩ] with probability 1. Proof. The left side of (34.5) is measurable cfv and so to prove that it is a version of E[X\\JX\ it is enough to verify }αΕ[Ε[Χ\\^2]\\^]άΡ = jGXdP for Ge/,. But if Gej?,, then Ge^ and the left side here is (GE[X\\d?2]dP = fGXdP. Ш If cf2 = F, then E[X\\J?2] = X, so that (34.5) is trivial. If J? = {0, Ω} and cf2 = cf, then (34.5) becomes (34.6) E[E[X\\J?]]=E[X], the special case of (34.1) for G = Ω. If £x c-^2, then Е[Х\\^{], being measurable ^,, is also measurable £2, so that taking an expected value with respect to £2 does not alter it: Е[Е[Х\\^]\\^2] = Е[Х\\^1 Therefore, if ^, c^2, taking iterated expected values in either order gives Ε[Χ\\^].
SECTION 34. CONDITIONAL EXPECTATION 449 The remaining result of a general sort needed here is Jensen's inequality for conditional expected values: If φ is a convex function on the line and X and φ(Χ) are both integrable, then (34.7) φ(Ε[Χ№])<Ε[φ(Χ)\\4\ with probability 1. For each x0 take a support line [A33] through (jc0, φ(χ0)): φ(χ0) +A(x0Xx -x0) < φ(χ). The slope A(x0) can be taken as the right-hand derivative of φ, so that it is nondecreasing in jc0. Now <p{E[X\\J>}) + A{E[X\\J]){X- E[X\\J?}) <φ{Χ). Suppose that E[X\\^\ is bounded. Then all three terms here are integrable (if φ is convex on R1, then φ and A are bounded on bounded sets), and taking expected values with respect to cf and using (34.4) on the middle term gives (34.7). To prove (34.7) in general, let G„ = [|£[Z||^]| < n]. Then E[IGX\\cf] = IGnE[X\\cf] is bounded, and so (34.7) holds for ICX: cp(IG E[X\\J?]) < E[cp(IG X)\\c?l Now Ε[φ(Ια X)\\J] = E[IG φ(Χ) + /Сс«р(0)||.Я = IGE[<pCx)\\d?]+IGc<p(Q)-+E[<p(x")\\d?]. Since cp(IGE[X\\d?]) converges to φ(Ε[Χ\\^]) by the continuity of φ, (34.7) follows. If φ(χ) = \x\, (34.7) gives part (iv) of Theorem 34.2 again. Conditional Distributions and Expectations Theorem 34.5. Let μ(·,ω) be a conditional distribution with respect to £ of a random variable X, in the sense of Theorem 33.3. // φ: R1 -+R1 is a Borel function for which φ(Χ) is integrable, then (κιφ(χ)μ(άχ,ω) is a version of Ε[φ(Χ)\\^]ω. Proof. If φ =ΙΗ and Яе^', this is an immediate consequence of the definition of conditional distribution, and by Theorem 34.2(ii) it follows for φ a simple function over R1. For the general nonnegative φ, choose simple φη such that 0 <φη(χ)1 φ(χ) for each χ in R1. By the case already treated, {κιφη(χ)μ(άχ, ω) is a version of Ε[φη(Χ)\\^]ω. The integral converges by the monotone convergence theorem in (β',^',/Λ-,ω)) to (κιφ(χ)μ(άχ,ω) for each ω, the value +00 not excluded, and Ε[φη(Χ)\\^]ω converges to Ε[φ(Χ)\\^]ω with probability 1 by Theorem 34.2(v). Thus the result holds for nonnegative φ, and the general case follows from splitting into positive and negative parts. ■ It is a consequence of the proof above that }κιφ(χ)μ(άχ, ω) is measurable £ and finite with probability 1. If X is itself integrable, it follows by the
450 DERIVATIVES AND CONDITIONAL PROBABILITY theorem for the case φ(χ) =χ that E[X\\J]a= Γ χμ(άχ,ω) ^ — 00 with probability 1. If φ(Χ) is integrable as well, then (34.8) Ε[φ(Χ)№]ω= Γ ψ{χ)μ{άχ,ω) ^ — oo with probability 1. By Jensen's inequality (21.14) for unconditional expected values, the right side of (34.8) is at least φ(ρ_αχμ(άχ,ω)) if φ is convex. This gives another proof of (34.7). Sufficient Subfields* Suppose that for each θ in an index set Θ, P6 is a probability measure on (Ω, У). In statistics the problem is to draw inferences about the unknown parameter θ from an observation ω. Denote by Pe[A\\<f] and Ee[X\\<f] conditional probabilities and expected values calculated with respect to the probability measure Pe on (Ω, У). A σ-field ^ in &~ is sufficient for the family [Ρβ: θ e Θ] if versions P0[A\\^] can be chosen that are independent of θ—that is, if there exists a function ρ(Α,ω) of A e У and ω εΩ such that, for each A e & and θ e Θ, p(A, ■) is a version of Pe[A\\^\ There is no requirement that ρ(·,ω) be a measure for ω fixed. The idea is that although there may be information in & not already contained in ^, this information is irrelevant to the drawing of inferences about 0.+ A sufficient statistic is a random variable or random vector Τ such that σ(Τ) is a sufficient subfield. A family Jf of measures dominates another family JV if, for each A, from μ(Α) = 0 for all μ in Л, it follows that v(A) = 0 for all ν in jV. If each of Л and J/ dominates the other, they are equivalent. For sets consisting of a single measure these are the concepts introduced in Section 32. Theorem 34.6. Suppose thai [Рв: 9еО]и dominated by the σ-finite measure μ. A necessary and sufficient condition for £ to be sufficient is that the density fe of Pe with respect to μ can be put in the form fe = geh for a ge measurable <#. It is assumed throughout that ge and h are nonnegative and of course that h is measurable IF. Theorem 34.6 is called the factorization theorem, the condition being that the density fe splits into a factor depending on ω only through ^ and a factor independent of Θ. Although ge and h are not assumed integrable μ, their product fe, as the density of a finite measure, must be. Before proceeding to the proof, consider an application. Example 34.4. Let (Sl,Sr) = (Rk,^k), and for 6>0 let Pe be the measure having with respect to fc-dimensional Lebesgue measure the density f(x^-f(x хл-{е~к if°£*/*M = l.-.*> Je\x)-Je\x\'--->xk)-\ n t, 0 otherwise. *This topic may be omitted. fSee Problem 34.19.
SECTION 34. CONDITIONAL EXPECTATION 451 If X, is the function on Rk defined by X,(x)=xh then under Pe, Xx,...,Xk are independent random variables, each uniformly distributed over [0,0]. Let T(x) = maxi£k Х{х). If ge(t) is в~к for 0<f <0 and 0 otherwise, and if h(x) is 1 or 0 according as all д:^ are nonnegative or not, then fe(x) = ge(T(x))h(x). The factorization criterion is thus satisfied, and Г is a sufficient statistic. Sufficiency is clear on intuitive grounds as well: 0 is not involved in the conditional distribution of Xt,...,Xk given Τ because, roughly speaking, a random one of them equals Τ and the others are independent and uniform over [0, T]. If this is true, the distribution of Xt given Τ ought to have a mass of k~l at Τ and a uniform distribution of mass 1 — fc"1 over [0, T], so that (34.9) £e№]-£r+^I|-^r. Foi a proof of this fact, needed later, note that by (21 9) (34 10) ( XfdPe = f ΡΘ[Τ<t,X,.>u]du JT<t -Ό J0 θ \0j 20* if 0 < t < Θ. On the other hand, Pe[T <t] = 0/0)*, so that under Pe the distribution of Τ has density ktk'x/0* over [0,0]. Thus n,.n r k+l k + l f, ,uk-1, tk + 1 (34.11) / ~, TdPe = ~. uk—т-аи = —r. K ' h<t 2k 2k Jo вк 2вк Since (34.10) and (34.11) agree, (34.9) follows by Theorem 34.1. ■ The essential ideas in the proof of Theorem 34.6 are most easily understood through a preliminary consideration of special cases. Lemma 1. Suppose that [Ρβ: θ e Θ] is dominated by a probability measure Ρ and that each Pe has with respect to Ρ a density ge that is measurable <#. Then ^ is sufficient, and P[A\\^] is a version of ΡΘ[Α\\^] for each 0 in Θ. Proof. For G in <f, (34.4) gives ( P[A\\S]dP„= f E[lJS]gedP= f E[lAge\\S]dP JG JG JG = flA8edP = f gedP = Pe(AnG). JG JAnG Therefore, P[A\\<f]—the conditional probability calculated with respect to Ρ—does serve as a version of Pe[A\\<f] for each 0 in Θ. Thus ^ is sufficient for the family
452 DERIVATIVES AND CONDITIONAL PROBABILITY [Рв: 0 e Θ]—even for this family augmented by Ρ (which might happen to lie in the family to start with). ■ For the necessity, suppose first that the family is dominated by one of its members. Lemma 2. Suppose that [Ρθ: θ e Θ] is dominated by Pe for some eoe0. If ^ is sufficient, then each Pe has with respect to Pe a density ge that is measurable <?. Proof. Let p(A, ω) be the function in the definition of sufficiency, and take Ρθ[Α\\^]ω=ρ(Α,ω) for all A e F, ω efi, and θ e Θ. Let de be any density of Pe with respect to Pe . By a number of applications of (34.4), JEe\de\\J] dPBo = flAEej[de\\S] dP,o = fEeo{lAEeu[de\\J?]\\J?}dFetr lE4lA\\J)Ee£d№]dPeo = JE„\ EeJ IA\\S)de\\S] dPeu = /EeJ IA\\S)de dP„, = {PeSA\\^]dPe = fPe[A\\S]dPe = ΡΘ(Α), the next-to-last equality by sufficiency (the integrand on either side being ρ (A,-)). Thus ge = Ee[de№\, which is measurable <0, can serve as a density for Pe with respect to Pe . и To complete the proof of Theorem 34.6 requires one more lemma of a technical sort. Lemma 3. // [Pe\ θ εθ]ύ dominated by a σ-finite measure, then it is equivalent to some finite or countably infinite subfamily. In many examples, the Pe are all equivalent to each other, in which case the subfamily can be taken to consist of a single Pe Proof. Since μ is σ-finite, there is a finite or countable partition of Ω into J^sets An such that 0 <μ(Αη) <°°. Choose positive constants an, one for each An, in such a way that Σ„α„ <°°. The finite measure with value Σπαπμ(Α Γ)Αη)/μ(Αη) at A dominates μ. In proving the lemma it is therefore no restriction to assume the family dominated by a finite measure μ. Each Pe is dominated by μ and hep.ce has a density fe with respect to it. Let Se = [ω: /θ(ω) > 0]. Then ΡΘ(Α) = ΡΘ(Α Π Se) for all A, and ΡΘ(Α) = 0 if and only if μ(Α oSe) = 0. In particular, Se supports Ρθ. Call a set В in JF a kernel if BcSe for some Θ, and call a finite or countable union of kernels a chain. Let a be the supremum of μ(0 over chains С Since μ is finite and a finite or countable union of chains is a chain, a is finite and μ((7)= a for some chain C. Suppose that C= U„ #„, wheie each β„ is a kernel, and suppose that Bn с 5θπ. The problem is to show that [Ρθ: θ e Θ] is dominated by [Ρθ: η= 1,2,...] and hence equivalent to it. Suppose that Pe(A) = 0 for all n. Then μ(/1 nSe )=0, as observed above. Since С с U„5e, μ(/1 hC)=0, and it follows that Pe(A"r\C) = 0
SECTION 34. CONDITIONAL EXPECTATION 453 whatever θ may be. But suppose that Pe(A-C)>0. Then Pe((A - C)nSe) = Pe(A - C) is positive, and so (A —C)C\Se is a kernel, disjoint from C, of positive μ-measure; this is impossible because of the maximality of C. Thus ΡΘ(Α - С) is 0 along with Рв(А П С), and so Pe(A) = 0. ■ Suppose that [Pe: θ e Θ] is dominated by a σ-finite μ, as in Theorem 34.6, so that, by Lemma 3, it contains a finite or infinite sequence Ρβ ,ΡΘ ,... equivalent to the entire family. Fix one such sequence, and choose positive constants c„, one for each 0„, in such a way that E„c„ = 1. Now define a probability measure Ρ on 3е by (34.12) P(A)-Lc,MA)- Clearly, Ρ is equivalent to [Ρθ, Ρβ ,.. ] and hence to [Ρβ: θ e Θ], and all three are dominated by μ. (34.13) Ρ^[Ρβ^Ρβ2,...]=[Ρβ:θε%]«μ. Proof of Sufficiency in Theorem 34.6. If each Pe has density geh with respect to μ, then by the construction (34.12), Ρ has density fh with respect to ,u, where /=E„c„^. Put re = ge/f if />0, and re = 0 (say) if /=0. If each ge is measurable ^, the same is true of / and hence of the re. Since P[f^ 0] = 0 and Ρ is equivalent to the entire family, Pe[f= 0]= 0 foi all Θ. Therefore, f r„dP= f Γββάμ= ί Γββάμ=ί ζ^άμ JA J4 JAn[f>0] JAn[f>0] = Pe(An[f>0])=Pe(A). Each Pe thus has with respect to the probability measure Ρ a density measurable ^, and it follows by Lemma 1 thai ^ is sufficient ■ Proof of Necessity in Theorem 34.6. Let p(A, ω) be a function such that, for each A and 0, p(A,·) is a version of Pe[A\\^\ as required by the definition of sufficiency. For Ρ as in (34.12) and G e.#", (34.14) f ρ(Α,ω)Ρ(άω)=Σ(:„ί Ρ(Α,ω)Ρθι(άω) JG „ JG = Zcnf PeXA\\J]dPeii JG η υ = LcnPe(AnG) = P(AnG). Thus p(A, ■) serves as a version of P[A\\^] as well, and ^ is still sufficient if Ρ is added to the family Since Ρ dominates the augmented family, Lemma 2 implies that each Pe has with respect to Ρ a density ge that is measurable <?. But if h is the density of Ρ with respect to μ (see (34.13)), then Pe has density geh with respect to μ. ■
454 DERIVATIVES AND CONDITIONAL PROBABILITY A sub-<r-field ^0 sufficient with respect to [Рв: 0e0] is minimal if, for each sufficient ^, ^0 is essentially contained in ^ in the sense that for each A in ^0 there is а В in ^ such that ΡΘ(Α δ Β) = 0 for all θ in Θ. A sufficient £ represents a compression of the information in fF, and a minimal sufficient ^0 represents the greatest possible compression. Suppose the densities fe of the Pe with respect to μ have the property that /θ(ω) is measurable (fx У, where £ is a σ-field in Θ. Let π be a probability measure on (f, and define Ρ as }βΡβπ(άθ\ in the sense that P(A)= }β}Α^(ω)μ(άω)π(άθ) = {βΡθ(Α)ν(άθ). Obviously, Ρ<ζ[Ρθ: ве Θ]. Assume that (34 15) [Рв:ве9]«:Р = f Ρθ-π(άθ). If π has mass c„ at θη, then Ρ is given by (34.12), and of course, (35.15) holds if (34.13) does. Let re be a density for Pe with respect to P. Theorem 34.7. // (34.15) holds, then ^0 = a[re: 9εθ]ύα minimal sufficient sub-σ-field. Proof. That ^0 is sufficient follows by Theorem 34.6. Suppose that £ is sufficient. It follows by a simple extension of (34.14) that £ is still sufficient if Ρ is added to the family, and then it follows by Lemma 2 that each Pe has with respect to Ρ a density ge that is measurable <#. Since densities are essentially unique, P\.&e = re\= 1- ^l ^ be the class of A in ^0 such that P(A^B) = 0 for some В in <&. Then Ж is a σ-field containing each set of the form A = [reeH] (take β = [ge e #]) and hence containing ^0. Since, by (34.15), Ρ dominates each Pe, ^f0 is essentially contained in ^, in the sense of the definition. ■ Minimum-Variance Estimation* To illustrate sufficiency, let g be a real function on Θ, and consider the problem of estimating g(0). One possibility is that Θ is a subset of the line and g is the identity; another is that Θ is a subset of Rk and g picks out one of the coordinates. (This problem is considered from a slightly different point of view at the end of Section 19.) An estimate of g(0) is a random variable Z, and the estimate is unbiased if Ee[Z]=g(6) for all Θ. One measure of the accuracy of the estimate Ζ is Ee[(Z - g(60)2]. If ^ is sufficient, it follows by linearity (Theorem 34.2(ii)) that Ee[X№] has for simple X a version that is independent of Θ. Since there are simple Xn such that \Xn\ < \X\ and Xn -»ΛΓ, the same is true of any X that is integrable with respect to each Pe (use Theorem 34.2(v)). Suppose that cf is, in fact, sufficient, and denote by E[X\\J?] a version of Ee[X\\^] that is independent of Θ. Theorem 34.8. Suppose that Ee[(Z - g(0))2] < °o for all θ and that ^ is sufficient. Then (34.16) Ee[(E[Z\\J?]-g(6))2] <Εβ[(Ζ-8(θ))2] for all Θ. If Ζ is unbiased, then so is E[Z\\<f]. *This topic may be omitted.
SECTION 34. CONDITIONAL EXPECTATION 455 Proof. By Jensens's inequality (34.7) for φ(χ) = (χ- g(0))2, {E\Z\\J\ - g(0))2 <Ee[(Z - g(e))2\\<f]. Applying Ee to each side gives (34.16). The second statement follows from the fact that Ee[E[Z\\^]] = Ee[Z]. Ш This, the Rao-Blackwell theorem, says that E[Z\\<f] is at least as good an estimate as Ζ if £ is sufficient. Example 34.5. Returning to Example 34.4, note that each A', has mean 0/2 under Ρθ, so that if A"= fc~'E?=i^, is the sample mean, then 2X is an unbiased estimate of Θ. But there is a better one. By (34.9), Ee[2X\\T] = (k+ \)T/k = T, and by the Rao-Blackwell theorem, Τ is an unbiased estimate with variance at most that of IX. In fact, for an arbitrary unbiased estimate Ζ, ΕΘ[(Γ - θ)2] < Εβ[(Ζ - Θ)2]. To prove this, let δ = Γ - E[Z\\T]. By Theorem 20.1(ii), δ =/(Γ) for some Borel function /, and Ee[f(T)] - 0 for all Θ. Taking account of the density for Τ leads to 1${(χ)χ1"1 dx = 0, so thai f(x)xk~l integrates to 0 over all intervals. Therefore, fix) along with f(x)xk'1 vanishes for χ > 0, except on a set of Lebesgue measure 0, and hence />Θ[/(Γ) = 0]= 1 and ΡΘ[Γ =£[Z||T]]= 1 for all Θ. Therefore, Ee[{T - θ)2] = Εθ[(Ε[Ζ\\Τ]-θ)2]<Εβ[(Ζ-θ)2] for Ζ unbiased, and Γ has minimum variance among all unbiased estimates of Θ. ■ PROBLEMS 34.1. Work out for conditional expected values the analogues of Problems 33.4, 33.5, and 33.9. 34.2. In the context of Examples 33.5 and 33.12, show that the conditional expected value of У (if it is integrable) given X is g(X), where f(x,y)ydy f(x,y)dy 34.3. Show that the independence of X and У implies that £[У||ЛГ] = £[У], which in turn implies that E[XY] = E[X]E[Y]. Show by examples in an Ω of three points that the reverse implications are both false. 34.4. (a) Let В be an event with P(B) > 0, and define a probability measure PQ by P0(A) = P(A\B). Show that Р0[А\Ш = Р[АГ)В\\Л>]/Р[В\\^] on a set of ,P0-measure 1. (b) Suppose that JV is generated by a partition β,, Β2,..., and let ^v <%f= σ(^υ с^Г). Show that with probability 1, ί^ί-ς:/.*^·
456 DERIVATIVES AND CONDITIONAL PROBABILITY 34.5. The equation (34.5) was proved by showing that the left side is a version of the right side. Prove it by showing that the right side is a version of the left side. 34.6. Prove for bounded X and У that E[YE[X\\d?]] = E[XE[Y\\<f]]. 34.7. 33.9 Τ Generalize Theorem 34.5 by replacing X with a random vector 34.8. Assume that X is nonnegative but not necessarily integrable. Show that it is still possible to define a nonnegative random variable E[X\\^], measurable £, such that (34.1) holds. Prove versions of the monotone convergence theorem and Fatou's lemma. 34.9. (a) Show for nonnegative X that E[X\\^]= (qP[X> t\\^]dt with probability 1. (b) Generalize Markov's inequality: P[\X\ > a\\J\ <a'kE\\X\k\\J\ with probability 1. (c) Similarly generalize Chevyshev's and Holder's inequalities. 34.10. (a) Show that, if i,c/2 and E[X2]<oo, then E[(X-E[X\\j?2])2] <E[(X - Ε[Χ\\^])2]. The dispersion of X about its conditional mean becomes smaller as the σ-field grows. (b) Define Var[X\\^] = E[(X - E[X\\^])2\\^\ Prove that Var[*] = £[Var[ Jfll^D + УагШЛГМИ 34.11. Let JX,J2,J3 be σ-fields in &~, let ^ be the σ-field generated by ^U J?., and let At be the generic set in £r Consider three conditions: (i) P[A3\\J>l2] = P[A,\\c?2] for allA3. (ii) P[AlnA,\\j?2] = P[Ai]\j?2]P[A,\\<f2]forallAl and A3. (iii) P[A1\\J23] = P[Ai\\J2]forallA1. If <0X, £2, and ^з are interpreted as descriptions of the past, present, and future, respectively, (i) is a general version of the Markov property: the conditional probability of a future event Аъ given the past and present £X2 is the same as the conditional probability given the present ^2 alone. Condition (iii) is the same with time reversed. And (ii) says that past and future events A j and Аъ are conditionally independent given the present ^2. Prove the three conditions equivalent. 34.12. 33.7 34.11 Τ Use Example 33.10 to calculate P[NS = k\\Nu, и >t] (s <t) for the Poisson process. 34.13. Let L2 be the Hubert space of square-integrable random variables on (Ω, JF, P). For ^ a σ-field in JF, let Mj, be the subspace of elements of L2 that are measurable ^. Show that the operator P^ defined for XeL2 by Pj?X = E[X\\<#] is the perpendicular projection on M^. 34.14. Τ Suppose in Problem 34.13 that ^= σ(Ζ) for a random variable Ζ in L2. Let Sz be the one-dimensional subspace spanned by Z. Show that Sz may be much smaller than M.Z), so that Ε[ΛΓ||Ζ] (for XeL2) is by no means the projection of X on Z. Hint: Take Ζ the identity function on the unit interval with Lebesgue measure.
SECTION 34. CONDITIONAL EXPECTATION 457 34.15. Τ Problem 34.13 can be turned around to give an alternative approach to conditional probability and expected value. For a σ-field ^ in JF, let Pj, be the perpendicular projection on the subspace Mj,. Show that P^X has for XeL2 the two properties required of E[X\\<f]. Use this to define E[X\\<f] for A"e L2 and then extend it to all integrable X via approximation by random variables in L2. Now define conditional probability. 34.16. Mixing sequences. A sequence Ai,A2,... of J^sets in a probability space (Ω, JF, P) is mixing with constant a if (34.17) ПтР(АпГ)Е) = аР(Е) π for every Ε in JF. Then a = \imnP(An). (a) Show that [4r) is mixing with constant a if and only if (34.18) limf XdP = a(xdP for each integrable X (measurable JF). (b) Suppose that (34.17) holds for £e^, where &> is a тг-system, Πε^, and An ea(i?) for ail n. Show that {/!„} is mixing. Hint: First check (34.18) for X measurable σ(^) and then use conditional expected values with respect to σ{&>). (c) Show that, if P0 is a probability measure on (Ω, У) and P() <κ Ρ, then mixing is preserved if Ρ is replaced by P{). 34.17. Τ Application of mixing to the central limit theorem. Let X1,X2,... be random variables on (Ω, JF, P), independent and identically distributed with mean 0 and variance σ2, and put Sn=Xx+ ■ ■ ■ +X„. Then Sn/a\fn =»N by the Lindeberg-Levy theorem. Show by the steps below that this still holds if Ρ is replaced by any probability measure P() en (Ω, JF) that Ρ dominates. For example, the central limit theorem applies lo the sums ££„,ΓΛ(ω) of Rademacher functions if ω is chosen according to the uniform density over the unit interval, and this result shows that the same is true if ω is chosen according to an arbitrary density. Let Yv=Sn/a4n and Z„ = (5„ -5(10„π])/σ\/η, and take & to consist of the sets of the form [(AT,,..., Xk) e #], k>\, Η e <%k. Prove successively: (a) P[Yn<x]^P[N<x]. (b) P[\Yn-Zn\>e]^0. (c) P[Zn<x]^P[N<x]. (d) P(E Π [Z„ <jc]) -»P(E)P[N <x] for £εΛ (e) P(E Π[Ζ„ <jc]) -» P(E)P[N<x] for £ε^. (f) P0[Zn<x]^P[N<x]. (g) />o[|Y„-Zn|>e]-0. (h)P0[y„<Jc]^P[yV<Jc]. 34.18. Suppose that ^ is a sufficient subfield for the family of probability measures Ρβ> θ 6E Θ, on (Ω, JF). Suppose that for each θ and A, p(A, ω) is a version of Рв[А\\<#]ш. and suppose further that for each ω, ρ(-,ω) is a probability
458 DERIVATIVES AND CONDITIONAL PROBABILITY measure on JF. Define Qe on JF by Qe{A) = fnp(A, ω)Ρθ(άω), and show that Qe=Pe- The idea is that an observer with the information in £ (but ignorant of ω itself) in principle knows the values ρ(Α,ω) because each p(A, ·) is measurable ^. If he has the appropriate randomization device, he can draw an ω' from Ω according to the probability measure ρ(-,ω), and his ω will have the same distribution Qe = Pe that ω has. Thus, whatever the value of the unknown Θ, the observer can on the basis of the information in £ alone, and without knowing ω itself, construct a probabilistic replica of ω. 34.19. J4.13 Τ In the context of the discussion on p. 252, let JF be the σ-field of sets of the form Θ X A for A e JF. Show that under the probability measure Q, i0 is the conditional expected value of g() given !F. 34.20. (a) In Example 34.4, take π to have density e~e over Θ = (0,°°). Show by Theorem 34.7 that Γ is a minimal sufficient statistic (in the sense that σ(Τ) is minimal). (b) Let Ρθ be the distribution for samples of size η from a normal distribution with parameter θ = (ιη,σ2\ σ2>0, and let π put unit mass at (0,1). Show that the sampie mean and variance form a minimal sufficient statistic. ι SECTION 35. MARTINGALES Definition Let XUX2,... be a sequence of random variables on a probability space (Ω, &~, P\ and let ^v^2,... be a sequence of σ-fields in 5?. The sequence {(X„, ^n): η = 1,2,...} is a martingale if these four conditions hold: (ii) Xn is measurable 5^; (iii)E[|*„D<*>; (iv) with probability 1, (35.1) E[X„ + № ] =Xn. Alternatively, the sequence Xl.X2,... is said to be a martingale relative to the σ-fields $*~v S*~2,... . Condition (i) is expressed by saying the 5^ form a filtration and condition (ii) by saying the X„ are adapted to the filtration. If Xn represents the fortune of a gambler after the nth play and ^ represents his information about the game at that time, (35.1) says that his expected fortune after the next play is the same as his present fortune. Thus a martingale represents a fair game, and sums of independent random variables with mean 0 give one example. As will be seen below, martingales arise in very diverse connections.
SECTION 35. MARTINGALES 459 The sequence Xv X2,... is defined to be a martingale if it is a martingale relative to some sequence ^j,^2,.... In this case, the σ-fields £n = σ(Χχ,...,Χ^) always work: Obviously, ^,c^ + 1 and Xn is measurable ^, and if (35.1) holds, then E[Xn + 1\\c?n} = E[E[Xn + 1\\3*-J\c?n] = Е[Х„\Ш = Xn by (34.5). For these special σ-fields ^, (35.1) reduces to (35.2) E[Xn + i\\Xl,...,Xn]=Xn. Since a(Xu...,Xn)<z Уп if and only if Xn is measurable 5^ for each n, the σ(Χΐ7..., Xn) are the smallest σ-fields with respect to which the X„ are a martingale. The essential condition is embodied in (35.1) and in its specialization (35.2). Condition (iii) is of course needed to ensure that £[Arn + 1||5^] exists. Condition (iv) says that X„ is a version of E[Xn + 1\\3^]; since X„ is measurable 5^, the requirement reduces to (35.3) fXn + ldP=fxndP, A<=&„. JA J A Since the 5^ are nested, A <= ^ implies that \AXndP= jAXn + ]dP = ■■■ = jAXn+kdP. Therefore, Xri, being measurable 5^, is a version of Е[Хп+кШ- (354) E[Xn+k\\^n]=Xn with probability 1 for к > 1. Note that for A = Ω, (35.3) gives (35.5) E[Xl]=E[X2]= ■■·. The defining conditions for a martingale can also be given in terms of the differences (35.6) An=Xn-Xn_x (Δ, =Χι). By linearity, (35.1) is the same thing as (35.7) £[ΔΠ + 1||^]=0. Note that, since Xk = Δ, + ■ ■ ■ +Ak and Ak =Xk- Xk_u the sets Xv ..., Xn and Δ,,..., Δπ generate the same σ-field: (35.8) σ(*1,...,*π)=σ(Δ1,...,Δ„). Example 35.1. Let Δ,,Δ2,... be independent, integrable random variables such that £[ΔΠ] = 0 for η > 2. If ^n is the σ-field (35.8), then by independence £[ΔΠ + 1||5^] = £[ΔΠ + 1] = 0. If Δ is another random variable,
460 DERIVATIVES AND CONDITIONAL PROBABILITY independent of the Δ„, and if &n is replaced by σ(Δ, Δ,,..., Δπ), then the X„ = Δ, + · ■ ■ +ΔΠ are still a martingale relative to the 5^. It is natural and convenient in the theory to allow σ-fields 5^ larger than the minimal ones (35.8). ■ Example 35.2. Let (Ω, &~, Ρ) be a probability space, let ν be a finite measure on &~, and let ^j,^,... be a nondecreasing sequence of σ-fields in &. Suppose that Ρ dominates ν when both are restricted to 5^—that is, suppose that A e &n and Р(Л)= О together imply that г>(Л) = 0. There is then a density or Radon-Nikodym derivative Xn of ν with respect to Ρ when both are restricted to 5^; Λ^ is a function that is measurable Уп and integrable with respect to P, and it satisfies (35.9) XndP = v(A), Ле^. JA If Л е 5^, then /1 e ^ + 1 as well, so that JAX„ + l dP = v(A); this and (35.9) give (35.3). Thus the Xn are a martingale with respect to the 5^. ■ Example 35.3. For a specialization of the preceding example, lei Ρ be Lebesgue measure on the σ-field <F of Borel subsets of Ω = (0,1], and let Уп be the finite σ-field generated by the partition of Ω into dyadic intervals 0t2-", (k + 1)2-"], 0 < к < 2". If A e 5^ and />(Л) = 0, then Л is empty. Hence f dominates every finite measure j/ on 5^. The Radon-Nikodym derivative is (35.10) Χπ(ω) = -^2~Π'^π+1)2~Π] iftuG(yt2-",(yt + l)2-"]. There is no need here to assume that Ρ dominates ν when they are viewed as measures on all of !F. Suppose that ν is the distribution of Σ£=1ΖΑ2~* for independent Zk assuming values 1 and 0 with probabilities ρ and 1 - p. This is the measure in Examples 31.1 and 31.3 (there denoted by μ), and for ρ Φ \, ν is singular with respect to Lebesgue measure P. It is nonetheless absolutely continuous with respect to Ρ when both are restricted to^. ■ Example 35.4. For another specialization of Example 35.2, suppose that ν is a probability measure Q on & and that !Fn = σ(Κ1,..., Υη) for random variables Yx, Y2,... on Ш, ^"). Suppose that under the measure Ρ the distribution of the random vector (У-,,..., Fn) has density pn(y,,..., yn) with respect to л-dimensional Lebesgue measure and that under Q it has density Я„(Уи···■>УД Т° av°id technicalities, assume that pn is everywhere positive.
SECTION 35. MARTINGALES 461 Then the Radon-Nikodym derivative for Q with respect to f on ^ is (3511) Xn Pn{r»-,Yn)· To see this, note that the general element of 3^ is [(F,,..., Yn) еЯ], Η e &n; by the change-of-variable formula, 1, XndP=J -j- rTP„(yi,...,y„)rfy, ■•■dy„ = β[(ίΊ П)ея]. In statistical terms, (35.11) is a likelihood ratio: p„ and qn are rival densities, and the larger Xn is, the more strongly one prefers qn as an explanation of the observation (F,,..., Yn). The analysis is carried out under the assumption that Ρ is the measure actually governing the Yn\ that is, X„ is a martingale under Ρ and not in general under Q. In the most common case the Y„ are independent and identically distributed under both Ρ and Q: p„(y1,...,y„) = p(y1)---p(y„) and Яп(-У\> ■ ■ ■ > Уn> - я(У\)'' ' я(Уп) f°r densities p and q on the line, where ρ is assumed everywhere positive for simplicity. Suppose that the measures corresponding to the densities ρ and q are not identical, so that P[Yn &Η\φ Q[Yn^H] for some Яе^1. If Z„ = IlY eH], then by the strong law of large numbers, n~lL"k = lZk converges to P[Yl еЯ]опа set (in !F) of P-measure 1 and to Q[Yi еЯ]опа (disjoint) set of (2-measure 1. Thus Ρ and Q are mutually singular on J^even though Ρ dominates Q on 5^. ■ Example 35.5. Suppose that Ζ is an integrable random variable on Ш, &, P) and that Уп are nondecreasing σ-fields in &. If (35.12) Xn = E\ZWn\> then the first three conditions in the martingale definition are satisfied, and by (34.5), Ε[Χπ + ι\\^-η]-Ε[Ε[Ζ\\^η+1]\\^-η]^Ε[Ζ\\^-η]^Χη. Thus X„ is a martingale relative to 5^. ■ Example 35.6. Let Nnk, n, к = 1,2,..., be an independent array of identically distributed random variables assuming the values 0,1,2,... . Define Z0, Z,, Z2, . . . inductively by Ζ0(ω) = 1 and Ζ„(ω) = JV„ ,(ω) + · · · +Λ,π,ζ„_|(ω)(ω); Ζ„(ω) = 0 if Ζ„_,(ω) = 0. If N„k is thought of as the number of progeny of an organism, and if Z„ _, represents the size at time η - 1 of a population of these organisms, then Z„ represents the size at time n. If the expected number of progeny is E[Nnk\ = m, then £[ZjZ„_,] = Zn_]/n, so that Xn = Zn/m", η = 0,1,2,..., is a martingale. The sequence Z0, Z,,... is a branching process. Ш
462 DERIVATIVES AND CONDITIONAL PROBABILITY In the preceding definition and examples, η ranges over the positive integers. The definition makes sense if η ranges over 1,2,...,JV; here conditions (ii) and (iii) are required for 1 < η < N and conditions (i) and (iv) only for 1 < η < N. It is, in fact, clear that the definition makes sense if the indices range over an arbitrary ordered set. Although martingale theory with an interval of the line as the index set is of great interest and importance, here the index set will be discrete. Submartingales Random variables Xn are a submartingale relative to σ-fields 5^ if (i), (ii), and (iii) of the definition above hold and if this condition holds in place of (iv). (iv') with probability 1, (35.13) E\Xn + lWn]>Xn· As before, the X„ are a submartingale if they are a submartingale with respect to some sequence &n, and the special sequence 5^ = σ(Χχ,..., X„) works if any does. The requirement (35.13) is the same thing as (35.14) fXn + xdP>fXndP, A^Fn. JA JA This extends inductively (see the argument for (35.4)), and so (35.15) Е[Х„+к\\Г„]>Х„ for к > 1. Taking expected values in (35.15) gives (35.16) E[XX]<E[X2]< ···. Example 35.7. Suppose that the Δπ are independent and integrable, as in Example 35.1, but assume that Ε[ΔΠ] is for η > 2 nonnegative rather than 0. Then the partial sums Δ, + · · · +ΔΠ form a submartingale. ■ Example 35.8. Suppose that the Xn are a martingale relative to the ^. Then \Xn\ is measurable 5^ and integrable, and by Theorem 34.2(iv), E[\Xn + x\Wn\ > \E[Xn+l\\&-„]\ = \X„\. Thus the \Xn\ are a submartingale relative to the 5^. Note that even if Xv..., Xn generate Уп, in general IXx\,..., |Xn\ will generate a σ-field smaller than &n. ■ Reversing the inequality in (35.13) gives the definition of a supermartin- gale. The inequalities in (35.14), (35.15), and (35.16) become reversed as well. The theory for supermartingales is of course symmetric to that of submartingales.
SECTION 35. MARTINGALES 463 Gambling Consider again the gambler whose fortune after the nth play is Xn and whose information about the game at that time is represented by the σ-field &„. If 5^ = σ(ΑΊ,..., Xn), he knows the sequence of his fortunes and nothing else, but 5^ could be larger. The martingale condition (35.1) stipulates that his expected or average fortune after the next play equals his present fortune, and so the martingale is the model for a fair game. Since the condition (35.13) for a submartingale stipulates that he stands to gain (or at least not lose) on the average, a submartingale represents a game favorable to the gambler. Similarly, a supermartingale represents a game unfavorable to the gambler.f Examples of such games were studied in Section 7, and some of the results there have immediate generalizations. Start the martingale at η = 0, X0 representing the gambler's initial fortune. The difference Δ.π= Χπ- Χπ_λ represents the amount the gambler wins on the nth play,* a negative win being of course a loss. Suppose instead that Δπ represents the amount he wins if he puts up unit stakes. If instead of unit stakes he wagers the amount Wn on the rtth play, WnAn represents his gain on that play. Suppose that Wn > 0, and that Wn is measurable 5^_, to exclude prevision: Before the nth play the information available to the gambler is that in 5^_,, and his choice of stake Wn must be based on this alone. For simplicity take Wn bounded. Then №'„Α„ is integrable, and it is measurable 5^ if Δπ is, and if Xn is a martingale, then E[WnAnl^n_,\ = WnE[An\Wn_,\ = 0 by (34.2). Thus (35.17) X0+W^1+---+W^n is a martingale relative to the &~η. The sequence Wi,W2,... represents a betting system, and transforming a fair game by a betting system preserves fairness; that is, transforming Xn into (35.17) preserves the martingale property. The various betting systems discussed in Section 7 give rise to various martingales, and these martingales are not in general sums of independent random variables—are not in general the special martingales of Example 35.1. If W„ assumes only the values 0 and 1, the betting system is a selection system; see Section 7. If the game is unfavorable to the gambler—that is, if Xn is a supermartingale—and if Wn is nonnegative, bounded, and measurable 5^_1; then the same argument shows that (35.17) is again a supermartingale, is again unfavorable. Betting systems are thus of no avail in unfavorable games. The stopping-time arguments of Section 7 also extend. Suppose that {Xn} is a martingale relative to {5^}; it may have come from another martingale There is a reversal of terminology here: a subfair game (Section 7) is against the gambler, while a submartingale favors him. The notation has, of course, changed. The Fn and Xn of Section 7 have become Xn and Δ„.
464 DERIVATIVES AND CONDITIONAL PROBABILITY via transformation by a betting system. Let τ be a random variable taking on nonnegative integers as values, and suppose that (35.18) [т = п]Ы„. If τ is the time the gambler stops, [τ = η] is the event he stops just after the nth play, and (35.18) requires that his decision is to depend only on the information 5^ available to him at that time. His fortune at time η for this stopping rule is (X„ if η < τ, (35.19) Χ* = " .„ v ' " |I, ιί/ι>τ. Here XT (which has value ΧΓ(ω)(ω) at ω) is the gambler's ultimate fortune, and it is his fortune for all times subsequent to т. The problem is to show that Xq, X*,... is a martingale relative to ^o,^,,.... First, n-l Σ Since [τ > η] = Ω, - [τ < n] e ^, E[\Xn*\\ = "if \Xk\dP+( \Xn\dP< ЁВД<' [Xn*^H]= \J[T = k,XktEH]u[T>n,XntEH]tE&-„. fc = 0 Moreover, and (Xn*dP = f XndP+f XTdP JA -Άη[τ>π] ·ΆηΙτ<,η] (Xn*+ldP=f Xn+idP+f XTdP. Because of (35.3), the right sides here coincide if A e 5^; this establishes (35.3) for the sequence X*,X*,·.·, which is thus a martingale. The same kind of argument works for supermartingales. Since X* = XT for η > τ, X* ~* XT. As pointed out in Section 7, it is not always possible to integrate to the limit here. Let Xn =α + Δ, + · · · +Δ„, where the Δπ are independent and assume the values ± 1 with probability \ (X0 = a), and let τ be the smallest η for which Δ, + · · · +Δ„ = 1. Then EiX^^a and XT = a + l. On the other hand, if the X„ are uniformly bounded or uniformly integrable, it is possible to integrate to the limit: E[XT] = E[X0].
SECTION 35. MARTINGALES 465 Functions of Martingales Convex functions of martingales are submartingales: Theorem 35.1. (i) IfXi,X2,... is a martingale relative to ^,3^,..., if φ is convex, and if the <p(X„) are integrable, then φ(Χ{),φ(Χ2),... is a submartingale relative to ^χ,^2. (ii) //Xx, X2,... is a submartingale relative to !FX, &~2,..., if φ is nonde- creasing and convex, and if the φ(Χ„) яге integrable, then φ(Χ1), φ(Χ2), ...is a submartingale relative to ^, S^,... . Proof. In the submartingale case, Xn <E[Xn + l\\&n\, and if φ is non- decreasing, then φ(Χπ)<φ(Ε[Χπ + 1\\^]). In the martingale case, Xn = E[X„ + 1\\&^], and so <p(X„) = φ{Ε[Χη + ι\\3Γ„\). If ψ is convex, then by Jensen's inequality (34.7) for conditional expectations, it follows that φ(Ε[Χπ + 1\\^]) < EW„+l)\\r„l m Example 35.8 is the case of part (i) for φ(χ) = \x\. Stopping Times Let τ be a random variable taking as values positive integers or the special value oo. It is a stopping time with respect to {5^} if [т = к] е S^k for each finite к (see (35.18)), or, equivalently, if [т<к]^ S^k for each finite k. Define (35.20) &τ = \Α<Ξ&\ Α Π [г <к] е Ук, 1 < к < oo]. This is a σ-field, and the definition is unchanged if [т<к] is replaced by [t = k] on the right. Since clearly [r=j] e 3\ for finite j, τ is measurable If τ(ω) < oo for all ω and ^n = σ(Χχ,...,X„), then ΙΑ(ω) = ΙΑ(ω') for all A in ^ if and only if Χ,Χω) = Λ',(ω') for / ^ τ(ω) = τ(ω'): The information in ^ consists of the values τ(ω), Χχ(ω),...,Χτ(ω)(ω). Suppose now that τ, and τ2 are two stopping times and τ, < τ2. If A <= 5^, then Λ η[τ, <A:]e 5^ and hence A n[r2<k]=A п[тх <k]n [T2<kUsrk:5rT^5rTi. Theorem 35.2. If Xx,...,Xn is a submartingale with respect to !FX,..., J^ and τγ,τ2 are stopping times satisfying 1 <τ,<τ2<«, then XT,XT is a submartingale with respect to &~τ 3^2.
466 DERIVATIVES AND CONDITIONAL PROBABILITY This is the optional sampling theorem. The proof will show that XT, XT is a martingale if Xx,..., Xn is. Proof. Since the XT. are dominated by E"k = i\Xk\, they are integrable. It is required to show that E[XT2\\&~T ] > XTl, or (35.21) f^{XT2-XTi)dP>0, AeFTi. But A <= S^t implies that Α Π [τ, < к < т2] = (А Г) [т, < к - 1]) Г\[т2<к- 1]с lies in ^_,. If Ak = Xk-Xk_l, then f(xT2-xT[)dp = f ii[Ti<k,r2Adp A Ak = l Σί AkdP>0 fc = i ■/АП[т[<к<;т2] by the submartingale property. Inequalities There are two inequalities that are fundamental to the theory of martingales. Theorem 35.3. If Xu...,Xn is a submartingale, then for a > 0, (35.22) МшахЛ^а] < -Е[ЦГ„|]. I<.n a This extends Kolmogorov's inequality: If 5,,52,... are partial sums of independent random variables with mean 0, they form a martingale; if the variances are finite, then S,2, S\,... is a submartingale by Theorem 35.1(i), and (35.22) for this submartingale is exactly Kolmogorov's inequality (22.9). Proof. Let τ2 = η; let τ, be the smallest к such that Xk > a, if there is one, and η otherwise. If Mk = max/sk X„ then [Мп>а]п[т1<к] = [Mk > α] <ξ <Fk, and hence [M„ > a] is in ^. By Theorem 35.2, (35.23) αΡ[Λί„>α]< ί XT dP < f XndP J[M„>a) ' •'lM^a] </ Zn+^<£[Z„+]<£[|ZJ]. ■ This can also be proved by imitating the argument for Kolmogorov's inequality in Section 23. For improvements to (35.22), use the other integrals
SECTION 35. MARTINGALES 467 in (35.23). If Xx,..., X„ is a martingale, |Xx\,..., \Xn\ is a submartingale, and so (35.22) gives P[vaax.t&n\Xi\>α] <α~ιΕ[\Χχ The second fundamental inequality requires the notion of an upcrossing. Let [α,β] be an interval (α<β) and let Xx,...,Xn be random variables. Inductively define variables τ,, τ2,...: τ, is the smallest j such that 1 <j <n and X- <a, and is η if there is no such j; тк for even к is the smallest / such that тк_, <j <n and X}> β, and is η if there is no such /; тк for odd к exceeding 1 is the smallest / such that тк_1 <j <n and X- <a, and is η if there is no such j. The number U of upcrossings of [α,β] by Xv...,Xn is the largest / such that Χτ <α <β <XT . In the diagram, η = 20 and there are three upcrossings. T7=T« = Theorem 35.4. For α submartingale Xv..., Xn, the number U of upcrossings of [α, β] satisfies (35.24) 1 J β — a Proof. Let Yk = max{0, Xk - α} and Θ = β - α. By Theorem 35.1(H), F,,..., Yn is a submartingale. The тк are unchanged if in the definitions Xj < a is replaced by Yt, = 0 and A^- Si )3 by ^- Si 0, and so £/ is also the number of upcrossings of [0,0] by Yv...,Yn. If к is even and тл_, is a stopping time, then for j <n, ;-i [r*=/]= U [^-1='^ + ι<β,...,^_1<0,^>0] 1= 1
468 DERIVATIVES AND CONDITIONAL PROBABILITY lies in ^ and [тк = η] = [τ,. < η - l]c lies in 5^, and so тк is also a stopping time. With a similar argument for odd k, this shows that the тк are all stopping times. Since the тк are strictly increasing until they reach η, τπ = n. Therefore, where Σ£ and Σ0 are the sums over the even к and the odd к in the range 2<k <n. By Theorem 35.2, 10 has nonnegative expected value, and therefore, E[Y„] Ξ>£[ 1 Д If Ут,._1 = 0 < θ ζ Υτ2_ (which is the same thing as XT2j_i <α<β <XTJ, then the difference Y. - YT _ appears in the sum 1e and is at least Θ. Since there are U of these differences, le>0U, and therefore E[Yn]>. 6E[U]. In terms of the original variables, this is (β-α)Ε[ϋ]<[ (Xn-a)dP<E[\Xrl\]+\a\. ■ J\X„>cc] In a sense, an upcrossing of [α, β] is easy: since the Xk form a submartin- gale, they tend to increase. But before another upcrossing can occur, the sequence must make its way back down below a, which it resists. Think of the extreme case where the Xk are strictly increasing constants. This is reflected in the proof. Each of Xe and 10 has nonnegative expected value, but for 2e the proof uses the stronger inequality E[le] > Ε[θίΙ]. Convergence Theorems The martingale convergence theorem, due to Doob, has a number of forms. The simplest one is this: Theorem 35.5. Let XVX2,-.. be a submartingale. If К = sup„ £[| ЛГ„|] < oo, then Xn -»X with probability 1, where X is a random variable satisfying E[\X\]<K. Proof. Fix a and β for the moment, and let U„ be the number of upcrossings of [α,β] by Xu...,Xn. By the upcrossing theorem, E[Un]< (Ε[\Χπ\] + \α\)/(β-α)<(Κ + \α\)/(β-α). Since U„ is nondecreasing and E[Un] is bounded, it follows by the monotone convergence theorem that supn£/„ is integrable and hence finite-valued almost everywhere. Let X* and X* be the limits superior and inferior of the sequence X1,X2,...; they may be infinite. If X^ <a <β <Χ*, then U„ must go to infinity. Since U„ is bounded with probability 1, P[X* <a <β <Χ*) = 0.
SECTION 35. MARTINGALES 469 Now (35.25) [*„<**]= U[**«*</3<**], where the union extends over all pairs of rationale a and β. The set on the left therefore has probability 0. Thus X* and X^ are equal with probability 1, and Xn converges to their common value X, which may be +oo. By Fatou's lemma, £[|ΛΊ]< lim inf „ £[1^1] < K. Since it is integrable, X is finite with probability 1. ■ If the Xn form a martingale, then by (35.16) applied to the submartingale IZJJA^I,... the E[\X„\] are nondecreasing, so that K = lim„ E[\X„\]. The hypothesis in the theorem that К be finite is essential: If Xn = Δ, + · · · +ΔΠ, where the Δπ are independent and assume values ±1 with probability \, then X„ does not converge; ENA'J] goes to infinity in this case. If the X„ form a nonnegative martingale, then EtiA'J] = E[Xn] = E[Xl] by (35.5), and К is necessarily finite. Example 35.9. The X„ in Example 35.6 are nonnegative, and so Xn = Zn/m" ~*X, where X is nonnegative and integrable. If m < 1, then, since Z„ is an integer, Z„ = 0 for large n, and the population dies out. In this case, X = 0 with probability 1. Since E[X„] = E[X0] = 1, this shows that E[Xn] -* E[X] may fail in Theorem 35.5. ■ Theorem 35.5 has an important application to the martingale of Example 35.5, and this requires a lemma. Lemma. // Ζ is integrable and !Fn are arbitrary σ-fields, then the random variables E[Z\\&~n] are uniformly integrable. For the definition of uniform integrability, see (16.21). The ^n must, of course, lie in the σ-field 5*", but they need not, for example, be nondecreasing. Proof of the Lemma. Since \E[Z\\^]\ < E[\Z\ ||5^], Ζ may be assumed nonnegative. Let Ααπ = [Ε[Ζ\\^„] > a]. Since Аап е &n [ E[Z\\&-„]dP= [ ZdP. nan si an It is therefore enough to find, for given e, an a such that this last integral is less than e for all n. Now jAZdP is, as a function of A, a finite measure dominated by P; by the e-8 version of absolute continuity (see (32.4)) there is a δ such that Ρ(Α)<δ implies that fAZdP <e. But Ρ[Ε[Ζ\\^„]^α]^ α-ιΕ[Ε[Ζ\\3Γη]] = α-1Ε[Ζ]<δ for large enough α. ■
470 DERIVATIVES AND CONDITIONAL PROBABILITY Suppose that 5^ are σ-fields satisfying ^j с iF2 с · · ·. If the union U„^i 5^ generates the σ-field J£, this is expressed by 5^ t &<*,■ The requirement is not that ^ coincide with the union, but that it be generated by it. Theorem 35.6. // 5^ f 5^ and Ζ й integrable, then (35.26) E[Z||55;]-»E[Z||5L]. ΜίΛ probability 1. Proof. According to Example 35.5, the random variables Xn = E[Z\\^] form a martingale relative to the 5^. By the lemma, the Xn are uniformly integrable. Since ZsflA'J] <£[|Z|], by Theorem 35.5 the X„ converge to an integrable X. The problem is to identify X with E[Z\\3^\. Because of the uniform integrability, it is possible (Theorem 16.14) to integrate to the limit: jAXdP = lim„ fAXn dP. If A <= ^ and η >. к, then fAX„ dP = fAEXZWn\dP = JAZdP. Therefore, jAXdP = }AZdP for all A in the 77-system (J"=i S^k; since X is measurable 5^, it follows by Theorem 34.1 that X is a version of £[Z||^]. ■ Applications: Derivatives Theorem 35.7. Suppose that Ш, ^,P) is a probability space, ν is a finite measure on &~, and 5^|^c ^, Suppose that Ρ dominates ν on each &~n, and let Xn be the corresponding Radon-Nikodym derivatives. Then Xn-*X with probability 1, where X is integrable. (i) // Ρ dominates ν on 5^,, then X is the corresponding Radon-Nikodym derivative. (ii) If Ρ and ν are mutually singular on 5^, then X=Q with probability 1. Proof. The situation is that of Example 35.2. The density Xn is measurable 5^ and satisfies (35.9). Since Xn is nonnegative, E[\Xn\]= E[Xn] = ν(Ω), and it follows by Theorem 35.5 that Xn converges to an integrable X. The limit X is measurable 5^. Suppose that Ρ dominates ν on ^ and let Ζ be the Radon-Nikodym derivative: Ζ is measurable 5^, and fAZdP^=v(A) for /le^. It follows that JAZdP = JAX„dP for A in &„, and so Xn = E\Z\\&„\. Now Theorem 35.6 implies that X„ -* E[Z\\FJ = Z. Suppose, on the other hand, that Ρ and ν are mutually singular on 3^,, so that there exists a set 5 in J£ such that v(S) = 0 and P(S) = 1. By Fatou's lemma jAXdP < \\muAnfAXn dP. If Λ e ^, then fAX„ dP = v(A) ίοτ η > к, and so fAXdP < v(A) for A in the field U"= ι ^k. It follows by the monotone class theorem that this holds for all A in ^. Therefore, jXdP = jsXdP< v(S) = 0, and Ζ vanishes with probability 1. ■
SECTION 35. MARTINGALES 471 Example 35.10. As in Example 35.3, let ν be a finite measure on the unit interval with Lebesgue measure (Ω, &~, P). For ^ the σ-field generated by the dyadic intervals of rank n, (35.10) gives Xn. In this case <ί^ί5ζ = &· For each ω and η choose the dyadic rationale απ(ω) = k2~" and bn(w) = (A: + 1)2~" for which α „(ω) <ω < bn{w). By Theorem 35.7, if F is the distribution function for v, then V 6π(ω)-απ(ω) except on a set of Lebesgue measure 0. According to Theorem 31.2, F has a derivative F' except on a set of Lebesgue measure 0, and since the intervals (α„(ω), 6„(ω)] contract to ω, the difference ratio (35.27) converges almost everywhere to F'(w) (see (31.8)). This identifies X. Since (35.27) involves intervals (α„(ω), 6„(ω)] of a special kind, it does not quite imply Theorem 31.2. By Theorem 35.7, X- F' is the density for ν in the absolutely continuous case, and X = F' = 0 (except on a set of Lebesgue measure 0) in the singular case, facts proved in a different way in Section 31. The singular case gives another example where E[Xn] -»E[X] fails in Theorem 35.5. ■ Likelihood Ratios Return to Example 35.4: v~Q is a probability measure, 5^ = σ(Υν..., Yn) for random variables Yn, and the Radon-Nikodym derivative or likelihood ratio Xn has the form (35.11) for densities pn and qn on R". By Theorem 35.7 the Xn converge to some X which is integrable and measurable 3^ = σ(ΥνΥ2,...\ If the Yn are independent under Ρ and under Q, and if the densities are different, then Ρ and Q are mutually singular on σ(ΥνΥ2,···), as shown in Example 35.4. In this case X = 0 and the likelihood ratio converges to 0 on a set of P-measure 1. The statistical relevance of this is that the smaller Xn is, the more strongly one prefers Ρ over Q as an explanation of the observation {Yx,...,Yn), and Xn goes to 0 with probability 1 if Ρ is in fact the measure governing the Yn. It might be thought that a disingenuous experimenter could bias his results by stopping at an Xn he likes—a large value if his prejudices favor Q, a small value if they favor P. This is not so, as the following analysis shows. For this argument Ρ must dominate Q on each 5^ = a(Yu...,Yn), but the likelihood ratio Xn need not have any special form. Let τ be a positive-integer-valued random variable representing the time the experimenter stops. Assume that τ is finite, and to exclude prevision, assume that it is a stopping time. The σ-field ^ defined by (35.20) represents the information available at time τ, and the problem is to show that Χτ
472 DERIVATIVES AND CONDITIONAL PROBABILITY is the likelihood ratio (Radon-Nikodym derivative) for Q with respect to Ρ on &\. First, X, is clearly measurable 3\. Second, if A e ^, then Αη[τ = /i]e^, and therefore CO 00 [XTdP = Σ f XndP= Σ(2(Λη[τ = η])=βΜ), JA n = iJAnlr = n] „ = l as required. Reversed Martingales A left-infinite sequence..., A"_2, X-\ ls a martingale relative to σ- fields..., ^L2, &_x if conditions (ii) and (iii) in the definition of martingale are satisfied for η < - 1 and conditions (i) and (iv) are satisfied for η < - 1. Such a sequence is a reversed or backward martingale. Theorem 35.8. For a reversed martingale, lim„_0O X_n = X exists and is integrable, and E[X] = E[X_n] for all n. Proof. The proof is almost the same as that for Theorem 35.5. Let X* and X^ be the limits superior and inferior of the sequence X_vX^2->·· ■ ■ Again (35.25) holds. Let Un be the number of upcrossings of [α, β] by Ar_„,...,Ar_1. By the upcrossing theorem, £[{/„]< (E[|AL,|]+ |α|)/(/3-α). Again E[Un] is bounded, and so sup„£/„ is finite with probability 1 and the sets in (35.25) have probability 0. Therefore, lim„_0OX_n —X exists with probability 1. By the property (35.4) for martingales, X_n = ElALjI^J for η = 1,2,... . The lemma above (p. 469) implies that the AL„ are uniformly integrable. Therefore, X is integrable and E[X] is the limit of the E[X_n]\ these all have the same value by (35.5). ■ If 5^ are σ-fields satisfying ^ э &~2 ..., then the intersection fl"= 15^ = ^o is also a σ-field, and the relation is expressed by 5^ i 3%. Theorem 35.9. // Уп \ S^Q and Ζ is integrable, then (35.28) E\ZWn\-*E\Z\\^\ with probability 1. Proof. If X_n = E[Z\\&~n\, then ..., Х_г,Χ_λ is a martingale relative to ..., ^2->^\· By tne preceding theorem, Ε[Ζ\\&~η\ converges as η -»oo to an integrable X and by the lemma, the E[Z\\3^] are uniformly integrable. As the limit of the E[Z\\!Fn\ for η > к, X is measurable ^k; к being arbitrary, X is measurable 3*~Q.
SECTION 35. MARTINGALES 473 By uniform integrability, A e ^ implies that [XdP= l\m [e[Z\\&„ ]dP=lim[E[E[Z\\&r„ ]\MdP JA n Ja n J a ·= Vim ( E[Z\\F0]dP = [ E[Z\\y0]dP. η JA JA Thus X is a version of Ε[Ζ\\&~0]. ■ Theorems 35.6 and 35.9 are parallel. There is an essential difference between Theorems 35.5 and 35.8, however. In the latter, the martingale has a last random variable, namely A'_,, and so it is unnecessary in proving convergence to assume the £[|A'J] bounded. On the other hand, the proof in Theorem 35.8 that X is integrable would not work for a submartingale. Applications: de Finetti's Theorem Let Θ,ΧΧ,Χ2,... be random variables such that 0 < θ < 1 and, conditionally on Θ, the Xn are independent and assume the values 1 and 0 with probabilities θ and 1 - Θ: for и,,..., un a sequence of O's and ]'s, (35.29) Р\Х1=щ,...,Хп = ипЩ = в'{\-в)п~\ where s = w, + · · · +u„. To see that such sequences exist, let θ,Ζλ,Ζ2,··· be independent random variables, where θ has an arbitrarily prescribed distribution supported by [0,1] and the Z„ are uniformly distributed over [0,1]. Put X„ = I,Z sS|. If, for instance, f(x) =x(l —x) = P[Zx<x, Z2>x], then P[XX = 1, ΑΓ2=0!Ιβ]2=/(β(ω)) by (33.13). The obvious extension establishes (35.29). Integrate (35.29): (35.30) Р[Х1=и1,...,Хп = ип]=Е[в\1-в)п-']. Thus {Xn} is a mixture of Bernoulli processes. It is clear from (35.30) that the Xk are exchangeable in the sense that for each η the distribution of (Xlt...,X„) is invariant under permutations. According to the following theorem of de Finetti, every exchangeable sequence is a mixture of Bernoulli sequences. Theorem 35.10. // the random variables Xv X2,... are exchangeable and take values 0 and 1, then there is a random variable θ for which (35.29) and (35.30) hold.
474 DERIVATIVES AND CONDITIONAL PROBABILITY Proof. Let Sm = Xx + ··· +Xm. If t < m, then P[Sm=t]= Σ P[Xl = ul,...,Xm=um], where the sum extends over the sequences for which u, + · · · +um=t. By exchangeability, the terms on the right are all equal, and since there are I m I of them, Suppose that s<n<m and u, + ··· +w„=5<i<m, add out the Mn + i>·· >Mm that sum to t - s: ^.-«,.-.*.-».ι*.-Ί-(,;:;)/(Τ) _ (Q,(m-Q„-, . f (j_\ (m)„ T"^m\m)' where s— 1 / · \ n—s— Ι ι · ^ In — 1 / · > /.,..(*)-д(*-£) П (,-x-s)/ri(i-£). The preceding equations still hold if further constraints 5m + , =tl,...,Sm+j = i,- are joined to Sm = t. Therefore, P[XX = ux,...,X„ = u„\\Sm,..., Sm+j] = fn,s,m(Sm/m). Let .^-tf(Sm,Sm + 1,...)and.y= Г\т^т- Now fix η and u,,...,u„, and suppose that и, + · · · +un = 5. Let / -»00 and apply Theorem 35.6: /'[A', = M,,...,Ar„ = u„||ly^]=/„i m(Sm/m). Let m -» » and apply Theorem 35.9: P[Z1=Ml,...,Z„=MJ|^]w=lim/„ii,m(^^-) holds for ω outside a set of probability 0. Fix such an ω and suppose that {Sm(w)/m} has two distinct limit points. Since the distance from each Sm(w)/m to the next is less than 2/m, it follows that the set of limit points must fill a nondegenerate interval. But limkjcm =x implies lim4/^jm(xm ) =xs(l -x)"~s, and so it follows further that xs(l -x)n~s must be constant over this interval, which is impossible. Therefore, Sm(o))/m must converge to some limit θ(ω). This shows that P[XX = щ,...,Х„ = mJIi/Ί = 0*(1 - θ)""5 with probability 1. Take a conditional expectation with respect to σ(θ), and (35.29) follows. ■
SECTION 35. MARTINGALES 475 Bayes Estimation From the Bayes point of view in statistics, the 0 in (35.29) is a parameter governed by some a priori distribution known to the statistician. For given XU...,X„, the Bayes estimate of 0 is Ε[θ\\Χν...,Χη]. The problem is to show that this estimate is consistent in the sense that (35.31) Е[в\\Х1г...,Х„]-*в with probability 1. By Theorem 35.6, Ε[θ\\Χι,...,Χη}-+Ε[θ\\,'%\, where 3^ = σ(Χι, Χ2,...), and so what must be shown is that £[0||^] = 0 with probability 1. By an elementary argument that parallels the unconditional case, it follows from (35.29) for £„=*,+ ··· +Xn that E[Sn\\e] = ne and E[(Sn - и0)2||0] = «θ(1 -0). Hence E[(n~'Sn - 0)2] < n~\ and by Chebyshev's inequality n~lSn converges in probability to 0. Therefore (Theorem 20.5), Vimkn^lSn = 0 with probability 1 for some subsequence. Thus 0 = 0' with probability 1 for a 0' measurable S^, and Ε[θ\\3^\ = E[ff\\S^\ ■= 0' = 0 with probability 1. A Central Limit Theorem* Suppose Xl,X2,... is a martingale relative to !FV ^,..., and put Yn=Xn -*„_,(?!=*,), so that (35.32) E[U^-,]=0. View F„ as the gambler's gain on the nth trial in a fair game. For example, if Δ,,Δ7,... are independent and have mean 0, 5^ = σ(Δ,,...,Δ„), Wn is measurable ^_„ and Yn = Η^Δ„, then (35.32) holds (see (35.17)). A specialization of this case shows that Xn — Yfk-\^k пее(* not be approximately normally distributed for large n. Example 35.11. Suppose that Δ„ takes the values ± 1 with probability \ each and Wx = 0, and suppose that Wn = \ for η > 2 if Δ, = 1, while Wn = 2 for η > 2 if Δ, = -1. If Sn = Δ2 + · · · + Δ„, then Xn is 5„ or 25„ according as Δ, is +1 or -1. Since Sn/4n has approximately the standard normal distribution, the approximate distribution of Xn/ 4n is a mixture, with equal weights, of the centered normal distributions with standard deviations 1 and 2. ■ To understand this phenomenon, consider conditional variances. Suppose for simplicity that the Yn are bounded, and define (35-33) °t = E[Y?\\*n-i\ *This topic, which requires the limit theory of Chapter 5, may be omitted.
476 DERIVATIVES AND CONDITIONAL PROBABILITY (take ^o = {0, Ω}). Consider the stopping times (35.34) y, = min k=l Under appropriate conditions, Xv /4t will be approximately normally distributed for large t. Consider the preceding example. Roughly: If Δ, = +1, then L"k = 1ak = η - 1, and so v,~t and Xv /4~t ~ S, /i/7; if Δ, = - 1, then Σ£„,σ*2 = 4(n - 1), and so v, = i/4 and X^ /ft » 2St/jfi = S,/4/y774. In either case, ^ /i/7 approximately follows the standard normal law. If the rtth play takes σ„2 units of time, then v, is essentially the number of plays that take place during the first / units of time. This change of the time scale stabilizes the rate at which money changes hands. Theorem 35.11. Suppose the Yn = Xn — Xn _, are uniformly bounded and satisfy (35.32), and assume that Σ„σ„2 = °° with probability 1. Then Xv / -ft =* N. This will be deduced from a more general result, one that contains the Lindeberg theorem. Suppose that, for each n, Xnl, Xn2,... is a martingale with respect to ^ι,·^2>··· · Define Kk=Xnk~Xn,k-i> suppose the Ynk have second moments, and put σ„\ = E[Y*k\\&~„л -',](^0 = (0>Ω))· Tne probability space may vary with л. If the martingale is originally defined only for 1 < к < r„, take Ynk = 0 and Упк = 9~„tn for к > r„. Assume that Σ"=ιΥηΙζ and Σ"„,σ„2Λ converge with probability 1. Theorem 35.12. Suppose that (35.35) Σ<τη\-*Ρ<τ2, k-l ννΛβ/ΐ σ й α positive constant, and thai (36.36) EfiftiwJ-o for each е. ГЛел Σ"=1^ =» σΝ. Proof of Theorem 35.11. The proof will be given for t going to infinity through the integers.1 Let Ynk =I[V ^k^k/fi and !Fnk = &k. From [v„>k] = [Σ;*Γ,ν<η]ε^_2 follow E[Lll^-il = 0and о-Д-£[^11^,*-.] = /κ^,σ>. If A: bounds the \Yk\, then 1 < Σ^_,σ„\ = «-'£*"_,σ* < 1 + /£2/л, so that (35.35) holds for σ = 1. For л large enough that K/{n<€, the sum in (35.36) vanishes. Theorem 35.12 therefore applies, and rk"=lYk/Jn~ = Z:=1Ynk=>N. ш fFor the general case, first check that the proof of Theorem 35.12 goes through without change if η is replaced by a parameter going continuously to infinity.
SECTION 35. MARTINGALES 477 Proof of Theorem 35.12. Assume at first that there is a constant с such that (35.37) which in fact suffices for the application to Theorem 35.11. Write Sk = Z).xYn] (50 = 0), Sm = Z%lYnj, Σ* = Σ,*.,σ„2, (Σ0 = 0), and Σοο = ^%\σηί'ι tne dependence on η is suppressed in the notation. To prove E[e"s*] -*e~ ^2<T\ observe first that |Ε[β'«*·-β-*«ν]| = |£[β,ι5-(1-βϊ|2Σ-β-ϊ'ν)+β^ιν(^|2Σ-βίι5--1)]| <E[\l -^'24r;'V|] +\E[e^W,s·- l]| =A +B. The term A on the right goes to 0 as η -> <», because by (35.35) and (35.37) the integrand is bounded and goes to 0 in probability. The integrand in В is ου £е"5*-'(е"^-е-з'2<^)^'21*, k = l because the mth partial sum here telescopes to e"Sme'2' lm - 1. Since, by (35.37), this partial sum is bounded uniformly in m, and since Sk_l and l,k are measurable ^д_1; it follows (Theorem 16.7) that В f;£[e/,s*-«e5,2£*(e"y-*-e-i,2«r-*)] = 1 k=l k=l To complete the proof (under the temporary assumption (35.37)), it is enough to show that this last sum goes to 0. By (26.42), (35.38) β'ιΥ·* = 1+ίίΥηΙι-ϊ2Υη\ + θ, where (write Ink = 1[\улк\ье\ an(* 'et K, bound t2 and |i|3) |0|<min{|ir„J3, \tYnk\2}iK,(Yn\lnk + €Yn\).
478 DERIVATIVES AND CONDITIONAL PROBABILITY And (38.39) e " *'V"* = 1 - \t 2a2k + θ', where (use (27.15) and increase K,) \&\±{\^к)\^<№пке^<К^к. Because of the condition E[Ynk\\!Fn k _ x\ = 0 and the definition of a2k, the right sides of (35.38) and (35.39), minus θ and ff, respectively, have the same conditional expected value given &„л-у By (35.37), therefore, ft = l *K, Σ {E[Yn2kfnk\ reE[an\] + Ε[ση\]) ^Κ,ί Σ E[Yn2kInk] +€C + cE\Supan2k\). Since ο£ < E[e2 + Y2kInk\\^k^\ <e2 + Σ^,Ε^Λ/ΙΙ^.,-ιΙ it follows by (35.36) that the last expression above is, in the limit, at most K,(ec +ce2). Since e is arbitrary, this completes the proof of the theorem under the assumption (35.37). To remove this assumption, take с > σ2, define Ank = [Σ*=1σ„2<c] and Λ„οο = [Σ7=1σ„2,.<4 and take Znk = YnkIAnk. From Апк^^к_, follow E[Znk\\^k.l} = 0 and T2nk = E\Z2nk\\^k_x\ = lAntfk. Since Σ;„τ„2, is Σ;=1σ„2 on Ank--Ank+i and Σ7=ισ„2 on Ληοο, the Z-array satisfies (35.37). Now P(AnJ-+ 1 by (35.35), and on Лпоо, т2к=а2к for all A, so that the Z-array satisfies (35.35). And it satisfies (35.36) because \Znk\ <\Ynk\. Therefore, by the case already treated, Σ"=ιΖηΙζ=*σΝ. But since Σ%α1ΥηΙζ coincides with this last sum on An„, it, too, is asymptotically normal. ■ PROBLEMS 35.1. Suppose that Δ,,Δ2,... are independent random variables with mean 0. Let Λ^Δ, and Xn+l=Xn + An+lfn(Xb...,Xn\ and suppose that the Xn are integrable. Show that {Xn} is a martingale. The martingales of gambling have this form. 35.2. Let У„ Y2,... be independent random variables with mean 0 and variance σ2. Let X = (Σ"_ιΥ*)2 - ησ2 and show that {Xn} is a martingale.
SECTION 35. MARTINGALES 479 35.3. Suppose that [Yn} is a finite-state Markov chain with transition matrix [pu]. Suppose that LjPijx(j) = λχ(ϊ) for all ί (the д:(г) are the components of a right eigenvector of the transition matrix). Put Xn = λ ~"x(Y„) and show that {A^} is a martingale. 35.4. Suppose that Y1,Y2,... are independent, positive random variables and that E[Y„]=l.Put X„ = Yl ■■■¥„. (a) Show that {Xn} is a martingale and converges with probability 1 to an integrable X. (b) Suppose specifically that Yn assumes the values \ and § with probability \ each. Show that A"=0 with probability 1. This gives an example where Ε[Π"^1Υ„]^Π"^ιΕ[Υ„] for independent, integrable, positive random variables. Show, however, that Е[П"=1У„] < Π"=ι£[Υη] always holds. 35.5. Suppose that Xu X-,,... is a martingale satisfying E[X,] = 0 and E[X*] < oo. Show that Ei(Xn+r-Xn)2]=Lrk = iE[(Xn+k-Xn+k_iY] (the variance of the sum is the sum of the variances). Assume that ΣηΕ[(Χη — Xn_i)2]<<x> and prove that Xn converges with probability 1. Do this first by Theorem 35.5 and then (see Theorem 22.6) by Theorem 35.3. 35.6. Show that a submartingale Xn can be represented as Xn — Yn + Zn, where Yn is a martingale and 0 < Zx < Z2 < · · ·. Hint· Take X0 = 0 and &n=Xn—Xn_u and define Z„ = Ε2_,£[ΔΑ||^_,] (^ = {0,Ω}). 35.7. If XVX2,... is a martingale and bounded either above or below, then sup„£[|*J]<°°· 35.8. Τ Let Xn = Δ, + · · · +Δη, where the Δη are independent and assume the values ± 1 with probability { each. Let τ be the smallest η such that Xn = 1 and define X* by (35.19). Show that the hypotheses of Theorem 35.5 are satisfied by {X*} but that it is impossible to integrate to the limit. Hint: Use (7.8) and Problem 35.7. 35.9. Let Χγ, Χ2,... be a martingale, and assume that |ΛΓ,(ω)| and \Χη(ω) - ΛΤη_,(ω)| are bounded by a constant independent of ω and n. Let τ be a stopping time with finite mean. Show that XT is integrable and that E[XT\ = E[XX]. 35.10. 35.8 35.9 Τ Use the preceding result to show that the τ in Problem 35.8 has infinite mean. Thus the waiting time until a symmetric random walk moves one step up from the starting point has infinite expected value. 35.11. Let Xx, X2,... be a Markov chain with countable state space S and transition probabilities ptJ. A function ψ on S is excessive or superharmonic if <p(i) ^ LjP,j<p(j). Show by martingale theory that <p(X„) converges with probability 1 if ψ is bounded and excessive. Deduce from this that if the chain is irreducible and persistent, then φ must be constant. Compare Problem 8.34. 35.12. Τ A function φ on the integer lattice in Rk is superharmonic if for each lattice point χ, φ(χ)>(2Ιί)~ιΣφ()>), the sum extending over the 2k nearest neighbors y. Show for к = 1 and к = 2 that a bounded superharmonic function is constant. Show for к > 3 that there exist nonconstant bounded harmonic functions.
480 DERIVATIVES AND CONDITIONAL PROBABILITY 35.13. 32.7 32.9 Τ Let (Ω, У, P) be a probability space, let ν be a finite measure on У, and suppose that &~„Ί S^C &~. For η <<», let Xn be the Radon- Nikodym derivative with respect to Ρ of the absolutely continuous part of ν when Ρ and ν are both restricted to &η. The problem is to extend Theorem 35.7 by showing that Xn -* ΛΓ„ with probability 1. (a) For η < α», let р(А)=(х„аР + ая(А), /let '/4 be the decomposition of e into absolutely continuous and singular parts with respect to Ρ on &n. Show that Х1г Х2 is a supermartingale and converges with probability 1. (b) Let σ£Α)=(ζ„άΡ + σ<η(Α), Ae^n, J A be the decomposition of σ„ into absolutely continuous and singular parts with respect to Ρ on &n. Let Yn=E[Xj\&n\ and prove f (Y„ + Z„)dP + a„(A)= f X„dP + a„(A), A^Fn. 1A JA Conclude that Y„ + Z„=X„ with probability 1. Since Yn converges to X„, Z„ converges with probability 1 to some Z. Show that jAZdP <aJ,A) for A e 5£, and conclude that Ζ = 0 with probability 1. 35.14. (a) Show that {X„} is a martingale with respect to {^} if and only if, for all η and all stopping times τ such that τ <η, E[Xn\\^] = XT. (b) Show that, if {Xn) is a martingale and τ is a bounded stopping time, then E[XT] = E[Xl]. 35.15. 31.9 Τ Suppose that &~„ Τ 5£ and /1 e 5£, and prove that Ρ[Α\\3*~„]-*ΙΑ with probability 1. Compare Lebesgue's density theorem. 35.16. Theorems 35.6 and 35.9 have analogues in Hubert space. For η < <χ>, let P„ be the perpendicular projection on a subspace Mn. Then Pnx -~* P„x for all χ if either (a) M, cM2 с · · · and Mm is the closure of D„<aiMn or (b) M] эМ2 э ··· and M00= Пп<0ОМ„. 35.17. Suppose that θ has an arbitrary distribution, and suppose that, conditionally on Θ, the random variables У,,У2>··· are independent and normally distributed with mean θ and variance σ2. Construct such a sequence {θ, Yl,Y2,...}. Prove (35.31). 35.18. It is shown on p. 471 that optional stopping has no effect on likelihood ratios This is not true of tests of significance. Suppose that Xx, X2,··· are independent and identically distributed and assume the values 1 and 0 with probabilities ρ and I-p. Consider the null hypothesis that p= {- and the alternative that ρ > {-. The usual .05-level test of significance is to reject the null
SECTION 35. MARTINGALES 481 hypothesis if (35.40) -i(*,+ ···+*„-5/1) > 1.645. For this test the chance of falsely rejecting the null hypothesis is approximately P[N> 1.645] ~ .05 if η is large and fixed. Suppose that η is not fixed in advance of sampling, and show by the law of the iterated logarithm that, even if ρ is, in fact, 5, there are with probability 1 infinitely many η for which (35.40) holds. 35.19. (a) Suppose that (35.32) and (35.33) hold. Suppose further that, for constants si s;2ZLxat^Pl and J^EJ-.EltfW'J "* °' and show that s^LUiYk => N. Hint: Simplify the proof of Theorem 35.11. (b) The Lindeberg-Leuy theorem for martingales Suppose that ..., Υ-\, Vq,У],... is stationary and ergodic (p. 494) and that E[Yk2]<cc and E[Yk\\Yk_1,Yk_2,...] = 0. Prove that Lk = iYk /{n is asymptotically normal. Hint: Use Theorem 36.4 and the remark following the statement of Lindeberg's Theorem 27.2. 35.20. 24.4 Τ Suppose that the σ-field 9^ in Problem 24.4 is trivial. Deduce from Theorem 35.9 that Ρ[Α\\Τ'η^]^ Ρ[Α\\$ζ>] = Ρ(Α) with probability 1, and conclude that Τ is mixing.
CHAPTER 7 Stochastic Processes SECTION 36. KOLMOGOROVS EXISTENCE THEOREM Stochastic Processes A stochastic process is a collection [X/. ;еГ] of random variables on a probability space (Ω, $*~, P). The sequence of gambler's fortunes in Section 7, the sequences of independent random variables in Section 22, the martingales in Section 35—all these are stochastic processes for which Τ = {1,2,...}. For the Poisson process [N,: t > 0] of Section 23, T = [0, «>). For all these processes the points of Τ are thought of as representing time. In most cases, Τ is the set of integers and time is discrete, or else Τ is an interval of the line and time is continuous. For the general theory of this section, however, Τ can be quite arbitrary. Finite-Dimensional Distributions A process is usually described in terms of distributions it induces in Euclidean spaces. For each /c-tuple (f,,...,fk) of distinct elements of T, the random vector (X, ,...,X, ) has over Rk some distribution μ1{ ,: (36.1) βΐι lk(H)=P[{Xli,...,Xlk)tEH], Яе#. These probability measures μ, , are the finite-dimensional distributions of the stochastic process [X,: t e T]. The system of finite-dimensional distributions does not completely determine the properties of the process. For example, the Poisson process [N,: t > 0] as defined by (23.5) has sample paths (functions /ν,(ω) with ω fixed and t varying) that are step functions. But (23.28) defines a process that has the same finite-dimensional distributions and has sample paths that are not step functions. Nevertheless, the first step in a general theory is to construct processes for given systems of finite- dimensional distributions. 482
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 483 Now (36.1) implies two consistency properties of the system μ, , . Suppose the Η in (36.1) has the form H = Hlx ··· xHk (H^^1), and consider a permutation ν of (1,2,..., к). Since [(X,,..,, X, ) <= (#, χ · · · χ Hk)] and [(Ζ, χ,...,Χ, k)<ξ(Яи1 χ · · · χЯ^)] are the same event, it follows by (36.1) that" (36.2) /z,, .„(Н.Х'-хЩ^,,, ..JH.iX'·' XffTt). For example, if μΜ = cXi'', then necessarily μ., ^ = ι/ Χ v. The second consistency condition is (36.3) μ,, (1.,(Я,Х··· ХЯЛ_,)=Д|1 „^(Я.Х-ХЯ^ХЯ1). This is clear because (A',,..., X, ) lies in Hx x · · · xHk_x if and only if U(i,...,Z/4 _,,*,,) lies in Я,Х ■'■'хЯ^.хЛ'. Measures μ., , coming from a process [Ζ,: ί e Γ] via (36.1) necessarily satisfy (36.2) and (36.3). Kolmogorcv's existence theorem says conversely that if a given system of measures satisfies the two consistency conditions, then there exists a stochastic process having these finite-dimensional distributions. The proof is a construction, one which is more easily understood if (36.2) and (36.3) are combined into a single condition. Define φπ: Rk -> Rk by φπ applies the permutation π to the coordinates (for example, if π sends x3 to first position, then π~ι1 = 3). Since <p~'(#i X · · · xHk) = Hn, X ■ · · X ΗπΙζ, it follows from (36.2) that Mi., л^Л") =/*„..„(") for rectangles H. But then (36.4) **,,.,, = /*,„,. i^-1· Similarly, if <p: Л*-»Л*"1 is the projection φ(χχ,. ..,xk) = (xu..., xk_x), then (36.3) is the same thing as (36·5) Μι,.. '*-, = Μι, . ι4<Ρ ■ ι The conditions (36.4) and (36.5) have a common extension. Suppose that (и,,..., um) is an m-tuple of distinct elements of Г and that each element of (r,,...,ίΛ) is also an element of (и,,...,um). Then (r„...,tk) must be the initial segment of some permutation of (и,,..., um); that is, к < т and there
484 STOCHASTIC PROCESSES is a permutation π of (1,2,..., m) such that {Ulr-il,...,Ulr-im)=(tl,...,tk,tk+x,...,tm), where ik+],...,im are elements of (u,,...,um) that do not appear in (f„.. ·, tk). Define φ: Rm -> Rk by (36.6) ф(х1,...,хт) = (χπ-ιι,...,χπ-4ί); ψ applies ж to the coordinates and then projects onto the first к of them. Since ф(Хи1,...,Хит) = (Х,1,...,Х,к), (36.7) μ,, ΐ4 = μΒι. „„Г1. This contains (36.4) and (36.5) as special cases, but as ψ is a coordinate permutation followed by a sequence of projections of the form (x,,..., x,) -* (*!,...,*,_,), it is also a consequence of these special cases. Product Spares The standard construction of the general process involves product spaces. Let Τ be an arbitrary index set, and let RT be the collection of all real functions on Τ—all maps from Τ into the real line. If Τ = {1,2,..., к), a real function on Τ can be identified with a /c-tuple (х{,...,хк) of real numbers, and so RT can be identified with k-dimensional Euclidean space Rk. If Τ = {1,2,...}, a real function on Γ is a sequence [xltx2,...} of real numbers. If Τ is an interval, RT consists of all real functions, however irregular, on the interval. The theory of RT is an elaboration of the theory of the analogous but simpler space 5°° of Section 2 (p. 27). Whatever the set Τ may be, an element of RT will be denoted x. The value of χ at t will be denoted x(t) or x,, depending on whether χ is viewed as a function of t with domain Γ or as a vector with components indexed by the elements t of T. Just as Rk can be regarded as the Cartesian product of к copies of the real line, RT can be regarded as a product space—a product of copies of the real line, one copy for each t in T. For each t define a mapping Z,: RT~* R1 by (36.8) Z,(x)=x(t)=x,. The Z, are called the coordinate functions or projections. When later on a probability measure has been denned on RT, the Z, will be random variables, the coordinate variables. Frequently, the value Z,(x) is instead denoted Z(t,x). If jc is fixed, Z(,x) is a real function on Τ and is, in fact, nothing other than x()—that is, χ itself. If t is fixed, Z(t, ·) is a real function on RT and is identical with the function Z, denned by (36.8).
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 485 There is a natural generalization to RT of the idea of the σ-field of /c-dimensional Borel sets. Let &T be the σ-field generated by all the coordinate functions Z,, (еГ: &T = σ[Ζ,: t e T]. It is generated by the sets of the form [x(ERT: Z,(i)eH] = [x^RT: х,ея] for t e Τ and Η e ^". If Τ = {1,2,..., A:}, then &T coincides with Як. Consider the class 3$l consisting of the sets of the form (36.9) /l = [xe^:(Z,|(x),...,Z,i(x))eW] -[xtERT:(Xli,...,xlk)tEH], where к is an integer, (f,,...,fft) is a /c-tuple of distinct points of T, and // e &k. Sets of this form, elements of @%, are called finite-dimensional sets, or cylinders. Of course, .Й?^ generates ^r. Now &l is not a σ-field, does not coincide with 3$T (unless Τ is finite), but the following argument shows that it is a field. A <*1 '. h If Τ is an interval, the cylinder [xeRT: αι<χ{ίι)<βι, a2<x{t?) </32] consists of the functions that go through the two gates shown; у lies in the cylinder and ζ does not (they need not be continuous functions, of course) The complement of (36.9) is RT - A = [x <eRt: (xh,.. .,xlt)<=Rk - H], and so &l is closed under complementation. Suppose that A is given by (36.9) and В is given by (36.10) B=[xceRt:(xSi,...,xs.)ceI], where I^&J. Let (u,,...,um) be an m-tuple containing all the ta and all the Sp. Now (f,,...,fA) must be the initial segment of some permutation of (w,,...,wm), and if φ is as in (36.6) and Η' = ψ~ιΗ, then H' <E&m and A is
486 STOCHASTIC PROCESSES given by (36.11) A = [xtERT:(xUi,...,xuJtEH'] as well as by (36.9). Similarly, В can be put in the form (36.12) B = [xeRT:(xUi,...,xUm)eI'\, where /' e <%m. But then (36.13) AuB = [хеЛг:(х||1,...,хИя)еЯ'и/']. Since H' U /' e 3lm, AuB is a cylinder. This proves that 38% is a field such that @T = аШо). The Z, are measurable functions on the measurable space (RT, 3$т). If Ρ is a probability measure on 3$τ, then [Z,: (e Γ] is a stochastic process on (RT,&T,P), the coordinate-variable process. Kolmogorov's Existence Theorem The existence theorem can be stated two ways: Theorem 36.1. If μ, , are a system of distributions satisfying the consistency conditions (36.2) and (36.3), then there is a probability measure Ρ on &T such that the coordinate-variable process [Z,: t e T] on (RT, &$T, P) has the μι , as its finite-dimensional distributions. Theorem 36.2. If μ, , are a system of distributions satisfying the consistency conditions (36.2) and (36.3), then there exists on some probability space (Ω,,^,Ρ) a stochastic process [X,: iei] having the μ, , as its finite- dimensional distributions. For many purposes the underlying probability space is irrelevant, the joint distributions of the variables in the process being all that matters, so that the two theorems are equally useful. As a matter of fact, they are equivalent anyway. Obviously, the first implies the second. To prove the converse, suppose that the process [X,: t^T] on Ш, &, P) has finite-dimensional distributions μ, ,, and define a map ξ: D,~*RT by the requirement (36.14) Z,(f («))=*,(«), fe=7\ For each ω, ξ(ω) is an element of RT, a real function on T, and the
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 487 requirement is that Χ,(ω) be its value at t. Clearly, (36.15) Γ'[^ΛΓ:(Ζ(1(χ) Ζ(ί(ι))ε//| = [<oeCi:(Zli{t(<o)),...,ZltMa>)))eH\ = [^ίΐ:Κ(») W)^l; since the X, are random variables, measurable $*~, this set lies in & if Η e &k. Thus Γ'/ί e J for A ^&Q, and so (Theorem 13.1) ξ is measurable Sf/&T. By (36.15) and the assumption that [X,: t <ξ Γ] has finite- dimensional distributions μ, ,, Ρξ~ι (see (13.7)) satisfies (36.16) Fr1[xe^:(Zii(x),...,Z,4(x))GH] = Ρ[ω^η:(Χ,ϋω),...,Χ1ΐί(ω))^Η}=μ,ι ,/Я). Thus the coordinate-variable process [Z,: (e Г] on (RT, &τ, Ρξ~ι) also has finite-dimensional distributions μ, , . Therefore, to prove either of the two versions of Kolmogorov's existence theorem is to prove the other one as well. Example 36.1. Suppose that Τ is finite, say Τ = {1,2,...,&}. Then (RT,&T) is (Rk,&k), and taking Ρ = μλ2,.. ,k satisfies the requirements of Theorem 36.1. ■ Example 36.2. Suppose that Τ = {1,2,...} and (36-17) μ,,.,.,^μ,, x ··· χμ,,, where μι,μ2,--· are probability distributions on the line. The consistency conditions are easily checked, and the probability measure Ρ guaranteed by Theorem 36.1 is product measure on the product space (RT,&T). But by Theorem 20.4 there exists on some Ш, &, P) an independent sequence XvX2,... of random variables with respective distributions μ,,μ2,...; then (36.17) is the distribution of (X,t,..., X,). For the special case (36.17), Theorem 36.2 (and hence Theorem 36.1) was thus proved in Section 20. The existence of independent sequences with prescribed distributions was the measure-theoretic basis of all the probabilistic developments in Chapters 4, 5, and 6: even dependent processes like the Poisson were constructed from independent sequences. The existence of independent sequences can also be made the basis of a proof of Theorems 36.1 and 36.2 in their full generality; see the second proof below. ■
488 STOCHASTIC PROCESSES Example 36.3. The preceding example has an analogue in the space 5°° of sequences (2.15). Here the finite set 5 plays the role of R\ the z„() are analogues of the Z„(), and the product measure defined by (2.21) is the analogue of the product measure specified by (36.17) with μ; = μ. See also Example 24.2. The theory for 5°° is simple because 5 is finite: see Theorem 2.3 and the lemma it depends on. ■ Example 36.4. If Γ is a subset of the line, it is convenient to use the order structure of the line and take the μ5 Sk to be specified initially only for /c-tuples (s,,..., sk) that are in increasing order: (36.18) s}<s2< ■■■ <sk. It is natural for example to specify the finite-dimensional distributions for the Poisson processes for increasing sequences of time points alone; see (23.27). Assume that the μ5{ Sk for A:-tuples satisfying (36.18) have the consistency property (36.19) μ„ Si^ St(HlX---xHi_lxHt+lX---xHk) = /*,,. ,4(Я,Х ■■xHi_lxRlxHi+lX·· XHk). For given s„..., sk satisfying (36.18), take (X ,..., XSk) to have distribution μ5ι s. If t1,...,tk is a permutation of s,,...,sk, take μ/( ,t to be the distribution of (X, ,...,X, ): (36.20) Mli 1к{НхХ---хНк)=Р[Х:^Н,,1<к\. This unambiguously defines a collection of finite-dimensional distributions. Are they consistent? If ίπ1,...,t^k is a permutation of i,,..., tk, then it is also a permutation of sl,...,sk, and by the definition (36.20), μ, t ,пк is the distribution of (Χ,ν,...,Χ,ν ), which immediately gives (36.2*), the first of the consistency conditions. Because of (36.19), uc c c . is the distribution of (XS],...,Xs._i, Xs.+[,...,XSk), and if tk = s-t, then r„..., tk_, is a permutation of sl,...,si_l,si+l,...,sk, which are in increasing order. By the definition (36.20) applied to tl,...,tk_l it therefore follows that μ,^ .,4_, is the distribution of (X,,...,X, ). But this gives (36.3), the second of the 11 'ft—I consistency conditions. It will therefore follow from the existence theorem that if TczR1 and μ5 Sk is defined for all /c-tuples in increasing order, and if (36.19) holds, then there exists a stochastic process [ X,: геГ] satisfying (36.1) for increasing i„..., tk. Ш
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 489 Two proofs of Kolmogorov's existence theorem will be given. The first is based on the extension theorem of Section 3. First proof of Kolmogorov's theorem. Consider the first formulation, Theorem 36.1. If A is the cylinder (36.9), define (36.21) Ρ(Α)=β„. ,4(Я). This gives rise to the question of consistency because A will have other representations as a cylinder. Suppose, in fact, that A coincides with the cylinder В defined by (36.10). As observed before, if (и,,..., um) contains all the ta and sp, A is also given by (36.11), where Η' = φ~ιΗ and ψ is defined in (36.6). Since the consistency conditions (36.2) and (36.3) imply the more general one (36.7), Ρ(.Α)~μ,ι ι(.Η) = μιΙ{ иШ'). Similarly, (36.10) has the form (36.12), and Ρ(Β) = μ, Xl) = ^u "'(/')· Since the и are dis- tinct, for any real numbers ζ,,..., zm there are points χ of R for which (xu,...,xu ) = (z,,..., zm). From this it follows that if the cylinders (36.11) and (36.121 coincide, then H' = V. Hence A=B implies that P(A) = u„ „ Ш') ·■= u„ „ (/') = P(B), and the definition (36.21) is indeed consis- tent. Now consider disjoint cylinders A and B. As usual, the index sets may be taken identical. Assume then that A is given by (36.11) and В by (36.12), so that (36.13) holds. If Η' Π /' were nonempty, then Α η Β would be nonempty as well. Therefore, Н'пГ = 0, and Ρ(ΑυΒ)=μ„ι ,JH'u/') = μυι ^Η')+μ^^Ι')=Ρ(Α)+Ρ(Β). Therefore, Ρ is finitely additive on &£. Clearly, P(RT) = 1. Suppose that Ρ is shown to be countably additive on 3$%. By Theorem 3.1, Ρ will then extend to a probability measure on &T. By the way Ρ was defined on &l, (36.22) p[xe/?7-:(Zli(x),...,Zfi(x))eW]=/*li ,,t(H), and therefore the coordinate process [Z,: геГ] will have the required finite-dimensional distributions. It suffices, then, to prove Ρ countably additive on &£, and this will follow if АпеЩ and A„ 10 together imply P(An)iO (see Example 2.10). Suppose that Α ι эА2 э · · · and that P(An) > e > 0 f or all n. The problem is to show that Π nAn must be nonempty. Since An e ^, and since the index set involved in the specification of a cylinder can always be permuted and
490 STOCHASTIC PROCESSES expanded, there exists a sequence ί,,ί2,... of points in Τ for which Л„ = [хеЛ7':(д:,1,...,х,г])е//„], wheret Hn <= 31". Of course, Ρ(Αη) = μ, ...tn(Hn). By Theorem 12.3 (regularity), there exists inside Hn a compact set Kn such that μ, ,(Ηη - Kn) <e/2" + 1. If Д„=[хеЕЯг: (xlit...,Xl)eKni then F(/l„ -<) <e/2" + l. Put C„ = Πί-ιθ*. Then С„сВлсЛ„ and P(A„ - C„) <e/2, so that P(C„)>e/2 > 0. Therefore, C„ cC„_, and C„ is nonempty. Choose a point x(n) of RT in C„. If л > к, then х(п) е С„ с Cfc ^Bk and hence (x<n),..., χ'ζ1) <= /CA. Since ^ is bounded, the sequence {x(^,χ'ξ\...} is bounded for each k. By the diagonal method [A14] select an increasing sequence «,, n2,... of integers such that lim, х,(п,) exists for each k. There is in RT some point χ whose r^th coordinate is this limit for each k. But then, for each k, (x.,..., x. ) is the limit as / -»c» of (xin,),..., xin/)) and hence lies * I *k * I к ^^ in /Ck. But that means that χ itself lies in Bk and hence in Ak. Thus *e Π λ = ι^αγ' which completes the proof.* ■ The second proof of Kolmogorov's theorem goes in two stages, first for countable T, then for general T* Second proof for countable T. The result for countable Τ will be proved in its second formulation, Theorem 36.2. It is no restriction to enumerate Τ as {/j, t2,...) and then to identify tn with n; in other words, it is no restriction to assume that Γ ={1,2,...}. Write μ„ in place of μ, 2. ,„· By Theorem 20.4 there exists on a probability space (Ω, У, F) (which can be taken to be the unit interval) an independent sequence J7„ U2, · ■ · of random variables each uniformly distributed over (0,1). Let Fl be the distribution function corresponding to μ γ. If the "inverse" gl of Fl is defined over (0,1) by g,(s) = inffjc: χ </*"](*)], then A", =gj(f/,) has distribution μ, by the usual argument: P[giWi) <x] = P[Ul <F](jc)] = Fl(x). The problem is to construct X2, A"3,... inductively in such a way that (36.23) Xk=hk(Ult...,Uk) for a Borel function hk and (Χ^.,.,Χ,,) has the distribution μη. Assume that Xi,...,Xn_i have been defined (n > 2): they have joint distribution μη^ι and (36.23) holds for к < η - 1. The idea now is to construct an appropriate conditional distribution function Fn(x\xi,...,xn_l); here Ρη(χ\Χ,(ω),...,Χη_χ(ω)) will have the value Ρ[Χη<χ\\Χι,...,Χη_ι]ω would have if Xn were already defined. If gniAx\,-..,x„-\) +In general, An will involve indices tl7..., ta , where al<a2< ·■■. For notational simplicity a„ . is taken as л. As a matter of fact, this can be arranged anyway: Take A'a =An, A'k = [x: (xti,...,xlk)<=Rk] = RT for k<ai, and A'k = [x: (x, ,x,k)eHnxRk'a«]=An for an< k<an + l. Now relabel A'„ as An. *The last part of the argument is, in effect, the proof that a countable product of compact sets is compact. *This second proof, which may be omitted, uses the conditional-probability theory of Section 33.
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 491 is the "inverse" function, then Χη(ω) = gn(Un(w)\Xx(w),...,Xn_x(w)) will by the usual argument have the right conditional distribution given Xx,...,Xn_x, so that (Xx,...,Xn_x,Xn) will have the right distribution over R". To construct the conditional distribution function, apply Theorem 33.3 in (/?",£%",μη) to get a conditional distribution of the last coordinate of (xx,...,xn) given the first η - 1 of them. This will have (Theorem 20.1) the form u(H; xx,..., x„-.\)', it is a probability measure as Η varies over 32х, and L ν(Η;χι *„-ιΗΜ·*ι> ···.·*„) ,λγ„_,)<ξΜ = H„[x e *"■ (*i>·· ·, *„-.) eM, xn e H]. Since the integrand involves only x1,...,xn_1, and since μ,η by consistency projects to μπ_, under the map (*,, ..., x„) -»(*,,.,.,*„_,), a change of variable gives = li„[xeR»:(xu...,xn_l)eM,xneH]. Define /^(лтк,,..., *„_,) = 1/((-о°,л:]; *,,...,*„_,). Then Fn(-\xx,...,xn_x) is a probability distribution function over the line, Fn(x\ ■) is a Borel function over Л"-1, and / Fn{x\xx,...,xn_x) άμη_χ(χχ,...,χη_χ) Μ = μ„[χ<ΕΗη:(χχ,...,χπ__χ)<=Μ,χη<χ]. Put g„(u\xx,..., xn_x)= inf[x: и < Fn(x\xx,..., xn_x)] for 0 < и < 1. Since Fn(x\xx,...,xn_x) is nondecreasing and right-continuous in x, gn(u\xx,...,xn_x)<x if and only if и < Fn(x\xx,..., xn_x). Set ЛГ„ = gn(U„\Xx,..., Xn_x). Since (A",,..., A^,) has distribution μ„_, and by (36.23) is independent of Un, an application of (20.30) gives P[(Xx,...,Xn_x)eM,Xn<x] = P[(Xx,...,Xn_x)eM,Un<Fn(x\Xx,...,Xn_x)] = f P[Un<Fn(x\xx,...,xn_x)]d^n_x{xx,...,xn_x) Μ = f Fn(x\xl,...,xn_x)dvn_l(xl,...,xn-l) JM = Hn[x<ER":(xx,...,xn_x)<=M,xn<x]. Thus (Xx,...,Xn) has distribution μ„. Note that Xn, as a function of XX,...,X„_X and t/n, is a function of Ux,...,Un because (36.23) was assumed to hold for k<n. Hence (36.23) holds for к = η as well. ■
492 STOCHASTIC PROCESSES Second proof for general T. Consider (RT,&T) once again. If S с Т, let ^ = σ[Ζ,: teS]. Then fjC^^7'. Suppose that S is countable. By the case just treated, there exists a process [X,: is5] on some (Ω, У, F)—the space and the process depend on S—such that (X,,..., X, ) has distribution μ, , for every fc-tuple (/,,...,/A) from S. Define a map ξ: Ω -> RT by requiring that г1(«.)>-(*'<-> :<;!*· (о utes. Now (36.15) holds as before if tl,...,tk all lie in S, and so ξ is measurable &/'5^s. Further, (36.16) holds for /,,...,/* in S. Put Ρ5 = Ρξ'1 on S*~s Then Fy is a probability measure on (Λ7", &s), and (36,24) F4^e^:(Z,i(^),,,.,Z,t(^))eW]=M,i ,/tf) if ί/ε^' and f],...,fA all lie in S. (The various spaces (Ω,,^,Ρ) and processes [A",: t e 5] now become irrelevant.) If 5U с5, and if A is a cylinder (36.9) for which the i1,...,tk lie in 50, then Fy(/1) and PS(A) coincide, their common value being μ, , (Η). Since these cylinders generate &s , Ру(Л) = PS(A) for all A in 5^ . If A 'lies*both in &~s and 5^ , then F^/O-Fy uS(A) = Ps(.A). Thus P(A) = PS(A) consistently defines a set function on the class (J s^S' 1^е uni°n extending over the countable subsets S of T. If /ln lies in this union and Ane &~s (Sn countable), then 5 = U nSn is countable and (J „Л„ lies in &s. Thus U j^j is a σ-field and so must coincide with 3lT. Therefore, F is a probability measure on £%T, and by (36.24) the coordinate process has under F the required finite-dimensional distributions. ■ The Inadequacy of ^r Theorem 36.3. Let [X,: t e T] be a family of real functions on Ω. (i) IfA(Ea[X,:t(ET]and(u(EA, and if Χ,(ω) = ΧΧω') for all t e T, then ω' G/4. (ii) If A e σ[ X,: t e T], then A e σ[Χ,: t e 5] /or some countable subset S of T. Proof. Define £: Ω-»ΛΓ by Ζ,(ξ(ω)) = Χ,(ω). Let ^"=σ[Α'/: (еГ]. By (36.15), ξ is measurable !?/{%Τ and hence ^ contains the class [ξ~ΛΜ: Μ e ^,Γ]. Ί he latter class is a σ-field, however, and by (36.15) it contains the sets [ω<ΞίΙ: (Λ',(ω),..., Л^с^еЯ], HeJ*, and hence contains the σ-field ^" they generate. Therefore (36.25) σ[Χ,:ί^Τ] = [ξ-1Μ:Μ^^τ]. This is an infinite-dimensional analogue of Theorem 20.1(i).
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 493 As for (i), the hypotheses imply that ω еЛ = ξ~λΜ and ξ(ω) = ξ(ω), so that ω ^Α certainly follows. For S<zT, let &~s = a[X,: ieS]; (ii) says that &= Э^т coincides with £= U s^S' the union extending over the countable subsets 5 of T. If Av A2,... lie in c0, An lies in &s for some countable 5„, and so U nAn lies in cf because it lies in ^s for 5 = U nSn. Thus £ is a σ-field, and since it contains the sets [X, e //], it contains the σ-field & they generate. (This part of the argument was used in the second proof of the existence theorem.) ■ From this theorem it follows that various important sets lie outside the class &T. Suppose that Г = [0,оо). Of obvious interest is the subset С of RT consisting of the functions continuous over [0, <»). But С is not in &T. For suppose it were. By part (ii) of the theorem (let fl = RT and put [Z,: t e T] in the role of [X,: t e T]), С would lie in σ[Ζ,: t e 5] for some countable 5 с [0, oo). But then by part (i) of the theorem (let Ω = RT and put [Z,: t <ξ 5] in the role of [X,: t e T]\ if x<eC and Z,(x) = Z,(y) for all t e 5, then yeC. From the assumption that С lies in ^r thus follows the existence of a countable set 5 such that, if χ e С and x(t) = y(i) for all t in 5, then yeC. But whatever countable set 5 may be, for every continuous χ there obviously exist functions у that have discontinuities but agree with χ on 5. Therefore, cannot lie in &T. What the argument shows is this: A set A in RT cannot lie in &T unless there exists a countable subset 5 of Γ with the property that, if χ еЛ and x(t)=y(t) for all t in 5, then y^A. Thus A cannot lie in &T if it effectively involves all the points t in the sense that, for each χ in A and each t in T, it is possible to move χ out of A by changing its value at t alone. And С is such a set. For another, consider the set of functions χ over Τ = [0,o°) that are nondecreasing and assume as values x(t) only nonnegative integers: (36.26) [xe#"): x(s) <x(t), χ < t; x(t) e {0,1,...}, t > θ]. This, too, lies outside SI1. In Section 23 the Poisson process was defined as follows: Let Xv Хг,... be independent and identically distributed with the exponential distribution (the probability space Ω on which they are defined may by Theorem 20.4 be taken to be the unit interval with Lebesgue measure). Put 50 = 0 and S„ = Xt+ ··· +Xn. If 5„(ω) < 5„ + ι(ω) for η > 0 and 5„(ω) -* oo, put Mi, ω) = Ν,(ω) = тах[л: 5„(ω) < ί] for t > 0; otherwise, put Mi, ω) = Ν,(ω) = 0 for ί>0. Then the stochastic process [JV,: i>0] has the finite-dimensional distributions described by the equations (23.27). The function Μ,ω) is the path function or sample function1 corresponding to ω, and by the construction every path function lies in the set (36.26). This is a good thing if the +Other terms are realization of the process and trajectory.
494 STOCHASTIC PROCESSES process is to be a model for, say, calls arriving at a telephone exchange: The sample path represents the history of the calls, its value at t being the number of arrivals up to time t, and so it ought to be nondecreasing and integer-valued. According to Theorem 36.1, there exists a measure Ρ on RT for Τ = [0,«0 such that the coordinate process [Z,: t > 0] on (RT,&T,P) has the finite- dimensional distributions of the Poisson process. This time does the path function Ζ(·,χ) lie in the set (36.26) with probability 1? Since Ζ(·,χ) is just χ itself, the question is whether the set (36.26) has P-measure 1. But this set does not lie in &T, and so it has no measure at all. Ал application of Kolmogorov's existence theorem will always yield a stochastic process with prescribed finite-dimensional distributions, but the process may lack certain path-function properties that it is reasonable to require of it as a model for some natural phenomenon. The special construction of Section 23 gets around this difficulty for the Poisson process, and in the next section a special construction will yield a model for Brownian motion with continuous paths. Section 38 treats a general method for producing stochastic processes that have prescribed finite-dimensional distributions and at the same time have path functions with desirable regularity properties. A Return to Ergodic Theory* Write R°°, 39%, S$T for RT, &%, &T in the case where the index set {0, ± 1, ± 2,...} Gonsists of all the integers. Then R°° is analogous to S°° (Sections 2 and 24), except that here the sequences are doubly infinite: x=(...,Z_1(x),Z0(x),Z1(x),...). Let Τ (not an index set) denote the shift: Zk(Tx) = Zk + £x), k = 0,±l,.... This is like the shift in Section 24. Since A e <Щ implies ТА е <Щ, Τ is measurable ЗГ/ЗГ. Clearly, it is invertible. For a stochastic process X =(..., X_v X0,X1,...)on(O,,Sr, P), define ξ: Ω -* R°° by (36.14): ξω=Χ(ω) = (...,Χ_1(ω),Χ0(ω),Χ1(ω),...). The measure Ρξ~ι=ΡΧ'λ on (R°°,&°°) can be viewed as the distribution of X. Suppose that X is stationary in the sense that, for each к > 1 and Иe £&k, P[(X„+1,..., Xn+k)e H]is the same for all η = 0, ± 1,... . Then the shift preserves Ρξ~Λ (use (36.16) and Lemma 1, p. 311). The process X is defined to be ergodic if under Ρξ~* the shift is ergodic in the sense of Section 24. In the ergodic case, it follows by the ergodic theorem that (36.27) \if{Tkx)^j j(x)PC\dx) *This topic, which requires Section 24, may be omitted.
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 495 on a set of Ρξ '-measure 1, provided / is measurable &°° and integrable. Carry (36.27) back to (Ω, &, P) by the inverse set mapping ξ'1. Then (36.28) i Ε /(.··, Xk-i,Xk,Xk + i, ...)-»£[/(···, *-„*„,*„...)] with probability 1: (36.28) holds al ω if and only if (36.27) holds at χ = ζω = Χ(ω). It is understood that on the left in (36.28), Xk is the center coordinate (the Oth coordinate) of the argument of /, and on the right, X0 is the center coordinate: For stationary, ergodic X and integrable f, (36.28) holds with probability 1. If the Xk are independent, then the Zk are independent under Ρξ'χ. In this case, lim„ Ρξ-\ΑηΤ-"Β) = Ρξ-\Α)Ρξ-\Β) for A and В in &%, because for large enough η the cylinders A and T'"B depend on disjoint sets of time indices and hence are independent. But then it follows by approximation (Corollary 1 to Theorem 11.4) that the same limit holds for all A and В in &Г. But for invariant B, this implies Ρξ'\Β€ Π Β) = Ρξ'1(Β':)Ρξ'\Β), so that Ρξ'\Β) is 0 or 1, and the shift is ergodic under Ρξ'1: If X is stationary and independent, then it is ergodic. If / depends on just one coordinate of x, then (36.28) is in the independent case a consequence of the strong law of large numbers, Theorem 22.1. But (36.28) follows by the ergodic theorem even if / involves all the coordinates in some complicated way. Consider now a measurable real function φ on R". Define ψ: Λ°° ->Л°° by ф(х) = (...,ф(Т'1х),ф(х),ф(Тх),...); here ф(х) is the center coordinate: Zk(ip(x)) = ф(Ткх). It is easy to show that φ is measurable &°°/&°° and commutes with the shift in the sense of Example 24.6. Therefore, Τ preserves Ρξ'1φ'1 if it preserves Ρξ'1, and it is ergodic under Ρξ'1ψ'1 if it is ergodic under Ρξ'1. This translates immediately into a result on stochastic processes. Define Y = (..., Υ_ j, Y0, Y],...) in terms of X by (36.29) Υη = φ(...,Χπ_νΧη,Χπ+ι,. .), that is to say, Υ(ω) = ψ(Χ(ω)) = ψξω. Since Ρξ'1 is the distribution of Χ, Ρξ'1^'1 = Ρ(φξ)'J = PY'J is the distribution of Y: Theorem 36.4. If X is stationary and ergodic, in particular if the Xn are independent and identically distributed, then Υ as defined by (36.29) is stationary and ergodic. This theorem fails if Υ is not defined in terms of X in a time-invariant way—if the φ in (36.29) is not the same for all n: If φη(χ) = Z_n(x) and φ is replaced by φπ in (36.29), then Y„ = X0; in this case Υ happens to be stationary, but it is not ergodic if the distribution of A"0 does not concentrate at a single point. Example 36.5. The autoregress'we model. Let φ(χ) = ΣΙ=0βΙίΖ_Ι[(χ) on the set where the series converges, and take φ(χ) = 0 elsewhere. Suppose that \β\ < 1 and that the Xn are independent and identically distributed with finite second moments. Then by Theorem 22.6, Y„ = Hk~ofikXn-k converges with probability 1, and by Theorem 36.4, the process Υ is ergodic. Note that Y„+J = βΥη + Χη+1 and that Xn+1 is independent of У . This is the linear autoregressive model of order 1. ■
496 STOCHASTIC PROCESSES The Hewitt-Savage Theorem* Change notation: Let (K",^) be the product space with {1,2,...} as the index set, the space of one-sided sequences. Let Ρ be a probability measure on 32™. If the coordinate variables Z„ are independent under F, then by Theorem 22.3, P(A) is 0 or 1 for each A in the tail σ-field J. If the Z„ are also identically distributed under F, a stronger result hofds. Let Уп be the class of ^""-sets A that are invariant under permutations of the first η coordinates: if π is a permutation of {l,...,n}, then χ lies in A if and only if (Ζπ1(χ),...,Z„n(x),Zn+1(x),.··) does. Then Sn is a σ-field. Let S= VCn.vSn be the σ-field of U^-sets invariant under all finite permutations of coordinates. Then .У is larger than ST, since, for example, the дг-set where E£ = ]ZA(jt) > cn infinitely often lies in У but not in 3~. The Hewitt-Savage theorem is a zero-one law for У in the independent, identically distributed case. Theorem 36.5. // the Z„ are independent and identically distributed under F, then F( A) is 0 or 1 for each A in У. Proof. By Corollary 1 to Theorem 11.4, there are for giver. A and e an η and a set y = [(Z, Z„)eW] (ffel") such that P(A ± U) < e. Let V = [(Z„ + ],..., Z2„)e Я ]· If the Zk are independent and identically distributed, then P(A δ U) is the same as 4(Z„ + I,...,Z2„,ZI,...,Z„)e//xKn]). But if /1 e </2„, this is in turn the same as P(A * V). Therefore, P(A δ £/) = P(A δ V). But then, P(A*V)<e and P(A*(Un V)) <P(A*U) + P(A δΚ) < 2c. Since t/ and К have the same probability and are independent, it follows that P(A) is within e of F(£/) and hence P2(A) is within 2e of P2W) = P(U)P(V) = P(U Π V), which is in turn within 2e of P(A). Therefore, \P2(A)-P(A)\ < 4e for all e, and so P(A) must be 0 or 1. ■ PROBLEMS 36.1. Τ Suppose that [A',: t e T] is a stochastic process on (Ω, ^ F) and /1 e &. Show that there is a countable subset S of Τ for which FMHA',, f еГ] = Ff/lUA',, f e 5] with probability 1. Replace A by a random variable and prove a similar result. 36.2. Let Τ be arbitrary and let K(s,t) be a real function over ГхГ. Suppose that К is symmetric in the sense that K(s,t) = K(t,s) and nonnegative-definite in the sense that Z'lj=!-lK{ti,tj)xixj> 0 for Ar^ 1, t1,...,tk in Г, and xv...,xk real. Show that there exists a process [A",: /e T] for which (A",,..., X, ) has the centered normal distribution with covariances K(tj,tj), i,j= l,...,к. *This topic may be omitted.
SECTION 36. KOLMOGOROV'S EXISTENCE THEOREM 497 36.3. Let L be a Borel set on the line, let S consist of the Borel subsets of L, and let LT consist of all maps from Τ into L. Define the appropriate notion of cylinder, and let ST be the σ-field generated by the cylinders. State a version of Theorem 36.1 for {lI,~fT). Assume Τ countable, and prove this theorem not by imitating the previous proof but by observing that ll is a subset of RT and lies in &T. 36.4. Suppose that the random variables XvX2,... assume the values 0 and 1 and P[Xn = 1 i.o.] = 1. Let μ be the distribution over (0,1] of Σ°η^Χη /2". Show that on the unit interval with the measure μ, the digits of the nonterminating dyadic expansion form a stochastic process with the same finite-dimensional distributions as Xv X2, 36.5. 36.3 Τ There is an infinite-dimensional version of Fubini's theorem. In the construction in Problem 36.3, let L = I = (0,1), Τ = {1,2,...}, let J? consist of the Borel subsets of /, and suppose that each Аг-dimensional distribution is the Аг-fold product of Lebesgue measure over the unit interval. Then IT is a countable product of copies of (0,1), its elements are sequences χ = (xv x2,···) of points of (0,1), and Kolmogorov's theorem ensures the existence οη(Ιτ,^τ) of a product probability measure v. v[x: *,·<«,·, i<n] = ai ■ ■ · an for 0 <a, < 1. Let /" denote the и-dimensional unit cube. (a) Define φ: Ι" χ/7"-»/7" by ф((х1,...,хп),(у1,у2,...)) = (х1,...,хп,у1,у2,...). Show that φ is measurable J?" X J?T/J?T and φ'1 is measurable J^7Д/"" x/r. Show that ф~\кп Хтг) = π, where λ„ is η-dimensional Lebesgue measure restricted to /". (b) Let / be a function measurable J?T and, for simplicity, bounded. Define /Π(·*Π+ΐ'*η+2>···)= f "· /'/(>Ί.···.>'η.·ϊη + ι.···)ί/>Ί ■■■(1Уп, •Ό -Ό in other words, integrate out the coordinates one by one. Show by Problem 34.18, martingale theory, and the zero-one law that (36.30) fn(xn+1,xn+2,---)^fi7f(y)Tr(dy) except for χ in a set of тг-measure 0. (c) Adopting the point of view of part (a), let gn(xv...,xn) be the result of integrating the variable (yn+i,yn + 2.···) out (w'tri respect to π) from /(*],....,xn,yn + v...). This may suggestively be written as e„(xi,---,xn)= f f •••/(*ι>···>*Π>:νΠ+ι,>'Π+2>···)4'Π+ι^Π+2 ··· ■Ό -Ό Show that g„(jtj,...,xn) ->f(x1,x2,...) except for χ in a set of π-measure 0. 36.6. (a) Let Τ be an interval of the line. Show that 32T fails to contain the sets of: linear functions, polynomials, constants, nondecreasing functions, functions of bounded variation, differentiable functions, analytic functions, functions con-
498 STOCHASTIC PROCESSES tinuous at a fixed f0, Borel measurable functions. Show that it fails to contain the set of functions that: vanish somewhere in T, satisfy x(s) <x(t) for some pair with s <t, have a local maximum anywhere, fail to have a local maximum, (b) Let С be the set of continuous functions опГ= [0,°°). Show that A e £%r and Ac С imply that A = 0. Show, on the other hand, that Ae&T and С о A do not imply that A = RT. 36.7. Not all systems of finite-dimensional distributions can be realized by stochastic processes for which Ω is the unit interval. Show that there is on the unit interval with Lebesgue measure no process [A",: t > 0] for which the X, are independent and assume the values 0 and 1 with probability { each. Compare Problem 1.1. 36.8. Here is an application of the existence theorem in which Τ is not a subset of the line. Let (Ν,Λ,ν) be a measure space, and take Τ to consist of the ^sets of finite ^-measure. The problem is to construct a generalized Poisson process, a stochastic process [XA: A e T] such that (i) XA has the Poiss,on distribution with mean v(A) and (ii) XA,...,XA are independent if Av...,An are disjoint. Hint: To define the finite-dimensional distributions, generalize this construction: For А, В in T, consider independent random variables YVY2, Y$ having Poisson distributions with means v(A Π Bc), u(A Π Β), v(Ac Π Β), take μΑ B to be the distribution of (У, + Y2, Y2 + Y3). SECTION 37. BROWNIAN MOTION Definition A Brownian motion or Wiener process is a stochastic process [Wt: t > 0], on some Ш, &, P), with these three properties: (i) The process starts at 0: (37.1) P[wo = 0] = l. (ii) The increments are independent: If (37.2) 0<i0<ij< ··· <tk, then (37.3) P[w,-w,i^H.ni<k\ = Y\P{w,-w,i^H.\. (iii) For 0 <s <t the increment Wl-Ws is normally distributed with mean 0 and vanance t - s: (37.4) P[W,-WseH]= , l =( e-x2/*'-s)dx. ■\]2n{t-s) jh The existence of such processes will be proved. Imagine suspended in a fluid a particle bombarded by molecules in thermal motion. The particle will perform a seemingly random movement
SECTION 37. BROWN1AN MOTION 499 first described by the nineteenth-century botanist Robert Brown. Consider a single component of this motion—imagine it projected on a vertical axis—and denote by W, the height at time t of the particle above a fixed horizontal plane. Condition (i) is merely a convention: the particle starts at 0. Condition (ii) reflects a kind of lack of memory. The displacements W, -W,,...,W, -W, _ the particle undergoes during the intervals [i0, ti],...,[tk_2, tk_i] in no way influence the displacement W, - W, _ it undergoes during [tk_vtk]. Although the future behavior of the particle depends on its present position, it does not depend on how the particle got there. As for (iii), that W, - Ws has mean 0 reflects the fact that the particle is as likely to go up as to go down—there is no drift. The variance grows as the length of the interval [s,t\, the particle tends to wander away from its position at time s, and having done so suffers no force tending to restore it to that position. To Norbert Wiener are due the mathematical foundations of the theory of this kind of random motion. A Brownian motion path. The increments of the Brownian motion process are stationary in the sense that the distribution of W, - Ws depends only on the difference t-s. Since W0 = 0, the distribution of these increments is described by saying that
500 STOCHASTIC PROCESSES W, is normally distributed with mean 0 and variance t. This implies (37.1). If 0 < s < t, then by the independence of the increments, £[^^,1 = E[(.WS(W, - Ws)] + E[WS2] = E[WS]E[W, - Ws] + E[WS2] = s. This specifies all the means, variances, and covariances: (37.5) E[W,] = 0, E[w,2]=t, £[H^] = min{s,i}. If 0 < f, < ··· <tk, the joint density of (W^W^- Wu,...,W,k- W,k_) is by (20.25) the product of the corresponding normal densities. By the Jacobian formula (20.20), (Wlt,..., W,t) has density (37.6) /,, l4(xI,...,xk) = n-==_=exp i = \ V277(i<_ii-i) (*.--*.--i)2 2(^-^-0 where i0 = x0 = 0. Sometimes W, will be denoted W(t), and its value at ω will be W(t, ω). The nature of the path functions W(·, ω) will be of great importance. The existence of the Brownian motion process follows from Kolmogorov's theorem. For 0 <t1 < · ■ ■ <tk let μ, , be the distribution in Rk with density (37.6). To put it another way, let μ, , be the distribution of (5i,...,5k), where Si = X1+ ··· +Xt and where Xl,...,Xk are independent, normally distributed random variables with mean 0 and variances ti,t2-tu...,tk -tk-v If g(xv...,xk) = (xu...,xl_i,xi+l,...,xk), then g(S1,...,Sk) = (S1,...,Si_1,Si+1,...,Sk) has the distribution prescribed for μ, , , , . This is because X{ + Xj+1 is normally distributed with mean 0 and variance ti+1 - if_,; see Example 20.6. Therefore, (37·7) ^, .',-'i+I. 4 = ^, ,kS~'- The μ, , defined in this way for increasing, positive ti,...,tk thus satisfy the conditions for Kolmogorov's existence theorem as modified in Example 36.4; (37.7) is the same thing as (36.19). Therefore, there does exist a process [W,: ί > 0] corresponding to the μ, , . Taking W, = 0 for ί = 0 shows that there exists on some (Ω,,^,Ρ) a process [W,: t > 0] with the finite-dimensional distributions specified by the conditions (i), (ii), and (iii). Continuity of Paths If the Brownian motion process is to represent the motion of a particle, it is natural to require that the path functions W(-,w) be continuous. But Kolmogorov's theorem does not guarantee continuity. Indeed, for Τ = [0,°°), the space (Ω, &~) in the proof of Kolmogorov's theorem is (RT, &T), and as shown in the last section, the set of continuous functions does not lie in &J.
SECTION 37. BROWN1AN MOTION 501 A special construction gets around this difficulty. The idea is to use for dyadic rational t the random variables Wt as already denned and then to redefine the other Wt in such a way as to ensure continuity. To carry this through requires proving that with probability 1 the sample path is uniformly continuous for dyadic rational arguments in bounded intervals. Fix a space Ш, ,9*", P) and on it a process [Wt: t > 0] having the finite- dimensional distributions prescribed for Brownian motion. Let D be the set of nonnegative dyadic rationals, let Ink = [k2~n,(k + 2)2""], and put (37.8) ΜπΙζ(ω)= sup \lV(i,w)- W(k2-",w)\ r<ElnknD Μη(ω)= max Mnk{o>). 0<k<n2" Suppose it is shown that ΣΡ[Μη > n~l] converges. The first Borel-Canielli lemma will then imply that В = [M„>n~1 i.o.] has probability 0. But suppose ω lies outside B. Then for every t and e there exists an η such that t <n, 2п~л <€, and Μη(ω)<η~Λ. Take δ = 2"". Suppose that r and r' are dyadic rationals in [0,i] and \r-r'\<8. Then r and r' must for some к < til" lie in a common interval Ink (length 2 X 2""), in which case \W(r, ω) - W(r',a))\ < 2M„>) < 2Μ„(ω) < 2л"1 < е. Therefore, ωίΰ implies that W(r, ω) is for every t uniformly continuous as r ranges over the dyadic rationals in [0, t], and hence it will have a continuous extension to [0, oo). To prove ΣΡ[ΜΠ> η~1]<<χ, use Etemadi's maximal inequality (22.10), which applies because of the independence of the increments. This, together with Markov's inequality, gives max|^Ci + 5i'2-m) - W{t)\> a li<2m 3 maxP[| W(t + 5/2~т) -W(t)\> a/3] ■•<2 (a/3) ΓΕ[(^(ί + δ)-^(ί))4]=4·3δ2 = Κδ2 a (see (21.7) for the moments of the normal distribution). The sets on the left here increase with m, and letting m -»oo leads to (37.9) Therefore, sup \W{t + r8)- W{t)\>a 0<r<\ reD Κδ2 г n K(2x2-")2 4Kn5 (η ') and ΣΡ[Μη > η J] does converge.
502 STOCHASTIC PROCESSES Therefore, there exists a measurable set В such that P(B) = 0 and such that for ω outside B, W(r,w) is uniformly continuous as r ranges over the dyadic rationale in any bounded interval. If ω £Β and r decreases to t through dyadic rational values, then W(r, ω) has the Cauchy property and hence converges. Put ( UmW(r,w) if ω£β, \ν;(ω) = W'{t,w) = гц |θ if we β, where r decreases to t through the set D of dyadic rationale. By construction, W(t, ω) is continuous in t for each ω in Ω. If ω £ В, then W(r, ω) = W(r,(u) for dyadic rationale, and W'{-, ω) is the continuous extension to all of [0,oo). The next thing is to show that the W't have the same joint distributions as the Wr It is convenient to prove this by a lemma which will be used again further on. Lemma 1. Let Xn and X be k-dimensional random vectors, and let Fn(x) be the distribution function of Xn. If Xn-> X with probability 1 and Fn(x) -» F(x) for all x, then F(x) is the distribution function of X. Proof.1 Let X have distribution function H. By two applications of Theorem 4.1, if h > 0, then F(xj,..., xk) = HmsupF„(x{,..., xk) < Н(хл,..., xk) η < HminfF„(;C] + h,...,xk + h) η =--F(x, +h,...,xk + h). It follows by continuity from above that F and Η agree. ■ Now, for 0 < i, < · · · <tk, choose dyadic rationals r,(n) decreasing to the t{. Apply Lemma 1 with (W (n),...,Wrki„^ and (W^,..., W,'k) in the roles of Xn and X, and with the distribution function with density (37.6) in the role of F. Since (37.6) is continuous in the ί(·, it follows by Scheie's theorem that Fn(x) -»F(x), and by construction X„-*X with probability 1. By the lemma, (W,',..., W, ) has distribution function F, which of course is also the distribution function of (W^,..., Wlk). Thus \W',: ί > 0] is a stochastic process, on the same probability space as [W,, t > 0], which has the finite-dimensional distributions required for Brown- ian motion and moreover has a continuous sample path W'{-, ω) for every ω. fThe lemma is an obvious consequence of the weak-convergence theory of Section 29; the point of the special argument is to keep the development independent of Chapters 5 and 6.
SECTION 37. BROWNIAN MOTION 503 By enlarging the set В in the definition of W't(ω) to include all the ω for which W(0, ω)Φθ, one can also ensure that ^'(0, ω) = 0. Now discard the original random variables Wt and relabel W, as Wr The new [W,\ t> 0] is a stochastic process satisfying conditions (i), (ii), and (iii) for Brownian motion and this one as well: (iv) For each ω, W(t, ω) is continuous in t and W{Q, ω) = 0. From now on, by a Brownian motion will be meant a process satisfying (iv) as well as (i), (ii), and (iii). What has been proved is this: Theorem 37.1. There exist processes [Wt: t > 0] satisfying conditions (i), (ii), (iii), and (iv)—Brownian motion processes. In the construction above, Wr for dyadic r was used to define W, in general. Foi that reason it suffices to apply Kolmogorov's theorem for a countable index set. By the second proof of that theorem, the space (Ω, JF, P) can be taken as the unit interval with Lebesgue measure. The next section treats a general scheme for dealing with path-function questions by in effect replacing an uncountable time set by a countable one. Measurable Processes Let Γ be a Borel set on the line, let [Xt: t e Γ] be a stochastic process on an (Ω, У, P), and consider the mapping (37.10) (ί,ω)-+Χι(ω)=Χ(ί,ω) carrying ΓχΩ into R1. Let У" be the σ-field of Borel subsets of T. The process is said to be measurable if the mapping (37.10) is measurable Ух &/&. In the presence of measurability, each sample path Λν,ω) is measurable У" by Theorem 18.1. Then, for example, j£(p(X(t,(o))dt makes sense if (a, b) с Τ and ψ is a Borel function, and by Fubini's theorem (\{X{t,-))dt =ibE[9{X,)\dt ■ujbE\W{Xl)\\dt<^. /a J Ja Ja Hence the usefulness of this result: Theorem 37.2. Brownian motion is measurable. Proof. If W(n\t,o)) = W{k2-n,o>) for k2-"<t<(k + l)2-", k = 0,l,2,...,
504 STOCHASTIC PROCESSES then the mapping (t,o>) ~> W(nXt,w) is measurable У~Х fF. But by the continuity of the sample paths, this mapping converges to the mapping (37.10) pointwise (for every (ί, ω)), and so by Theorem 13.4(H) the latter mapping is also measurable У~Х ^/^х. ■ Irregularity of Brownian Motion Paths Starting with a Brownian motion [Wt: t > 0] define (37.11) lv;(<tt) = c-llVci,(a>), where с > 0. Since t -»c2t is an increasing function, it is easy to see that the process [W',: i>0] has independent increments. Moreover, W't-W's = c~\Wc2i - WC2S), and for s <t this is normally distributed with mean 0 and variance c~2(c2t - c2s) = t -s. Since the paths W(-,w) all start from 0 and are continuous, [W,1: t > 0] is another Brownian motion. In (37.11) the time scale is contracted by the factor c2, but the other scale only by the factor c. That the transformation (37.11) preserves the properties of Brownian motion implies that the paths, although continuous, must be highly irregular. It seems intuitively clear that for с large enough the path W(-, ω) must with probability nearly 1 have somewhere in the time interval [0, c] a chord with slope exceeding, say, 1. But then Wi-,ω) has in [0,c_l] a chord with slope exceeding c. Since the Wt are distributed as the W,, this makes it plausible that ΐν(-,ω) must in arbitrarily small intervals [0,δ] have chords with arbitrarily great slopes, which in turn makes it plausible that W^-,ω) cannot be differentiable at 0. More generally, mild irregularities in the path will become ever more extreme under the transformation (37.11) with ever larger values of c. It is shown below that, in fact, the paths are with probability 1 nowhere differentiable. Also interesting in this connection is the transformation (37.12) KW-[m"M in>°' V ' \0 if t = 0. Again it is easily checked that the increments are independent and normally distributed with the means and variances appropriate to Brownian motion. Moreover, the path W"{-, ω) is continuous except possibly at t = 0. But (37.9) holds with W" in place of Ws because it depends only on the finite-dimensional distributions, and by the continuity of W"(-,w) over (0,oo) the supremum is the same if not restricted to dyadic rationals. Therefore, P[supSiLn-W\> п'л]<К/п2, and it follows by the first Borel-Cantelli lemma that W"(-,(o) is continuous also at 0 for ω outside a set Μ of probability 0. For ω e M, redefine W(t, ω) = 0; then [W": ί > 0] is a Brownian motion and (37.12) holds with probability 1.
SECTION 37. BROWN1AN MOTION 505 The behavior of W(-,w) near 0 can be studied through the behavior of Wi-,ω) near oo and vice versa. Since (W" - W£)/t = Wx/„ W"(-, ω) cannot have a derivative at 0 if W(-,w) has no limit at oo. Now, in fact, (37.13) inf Wn= - oo, sup Wn = + oo with probability 1. To prove this, note that Wn= Xx + %k = Wk - Wk_y are independent. Consider +X„, where the sup Wn < oo oo oo υ η и = 1 m=\ max Wt< и i<m this is a tail set and hence by the zero-one law has probability 0 or 1. Now -Χχ, -Χ2,... have the same joint distributions as Xl,X2,..., and so this event has the same probability as oo oo inf Wn > - oo L η = (J Π max(-^)<« If these two sets have probability 1, so has [supjH^I < oo], so that ftsuPnlH^I <x]>0 for some x. But P[\Wn\ <x] = P[\WX\ <x/nl/2] ~> 0. This proves (37.13). Since (37.13) holds with probability 1, W'i-,ω) has with probability 1 upper and lower right derivatives of + oo and - oo at t = 0. The same must be true of every Brownian motion. A similar argument shows that, for each fixed t, ΐν(-,ω) is nondifferentiable at t with probability 1. In fact, 1¥(·,ω) is nowhere differentiable: Theorem 37.3. For ω outside a set of probability Ο, 1¥(·,ω) is nowhere differentiable. Proof. The proof is direct—makes no use of the transformations (37.11) and (37.12). Let (37.14) ^ = max||^(*±l)-^(A)|f|^^l2)_^^l) By independence and the fact that the differences here have the distribution of 2-n/2Wb P[Xnk < e] = P\\WX\ < 2"/2e]; since the standard normal density is bounded by 1, P[Xnk < e] < (2 X 2"/2e)\ If Yn = min^„2,, Xnk, then (37.15) P[Yn<e]<n2n(2x2"/2e) .
506 STOCHASTIC PROCESSES Consider now the upper and lower right-hand derivatives Dw(t,w) = limsup —i τ - -, n ... lV(t + h,a>)-W(t,w) Dw{t,<a) = liminf—* ^ * -. Define Ε (not necessarily in &~) as the set of ω such that Dw(i, ω) and Dw{.t,o)) are both finite for some value of t. Suppose that ω lies in E, and suppose specifically that -K<Dw(t,w) <Dw(t,w) <K. There exists a positive δ (depending on ω, t, and K) such that t < s < t + δ implies \W(s, ω) - WU, ω)\ < K\s - t\. If η exceeds some n0 (depending on δ, К, and ί), then 4χ2""<δ, SK<n, n>t. Given such an n, choose k so that (к - Ш'п < t < к2~". Then \i2~" -t\<8 for i = k,k + l,k + 2,k + 3, and therefore ΧπΙζ(ω) <2Κ(4χ2~π) <η2~η. Since к - 1 < ί2" < η2", Υη(ω) < η2~η. What has been shown is that if ω lies in E, then ω lies in An = [Yn < nl~n\ for all sufficiently large η: Ε с liminf„ An. By (37.15), P( A„) < n2n{2 Χ 2η/1 Χ η2~"γ -* 0. By Theorem 4.1, liminf„ Лл has probability 0, and outside this set W(-,w) is nowhere differentiable—in fact, nowhere does it have finite upper and lower right-hand derivatives. (Similarly, outside a set of probability 0, nowhere does W(-, ω) have finite upper and lower left-hand derivatives.) ■ If A is the set of ω for which W(·, ω) has a derivative somewhere, what has been shown is that Л с β for a measurable В such that P(B) = 0; P(A) = 0 if A is measurable, but this has not been proved. To avoid such problems in the study of continuous-time processes, it is convenient to work in a complete probability space. The space (Ω, 5*", P) is complete (see p. 44) if A <zB, Be^, and P(B) = 0 together imply that A e <F (and then, of course, P(A) = 0). If the space is not already complete, it is possible to enlarge & to a new σ-field and extend Ρ to it in such a way that the new space is complete. The following assumption therefore entails no loss of generality: For the rest of this section the space (Ω, !F, P) on which the Brownian motion is defined is assumed complete. Theorem 37.3 now becomes: ΐν(-,ω) is with probability 1 nowhere differentiable.
SECTION 37. BROWNIAN MOTION 507 A nowhere-differentiable path represents the motion of a particle that at no time has a velocity. Since a function of bounded variation is differentiable almost everywhere (Section 31), ΐν(-,ω) is of unbounded variation with probability 1. Such a path represents the motion of a particle that in its wanderings back and forth travels an infinite distance in finite time. The Brownian motion model thus does not in its fine structure represent physical reality. The irregularity of the Brownian motion paths is of considerable mathematical interest, however. A continuous, nowhere-differentiable function is regarded as pathological, or used to be, but from the Brownian-motion point of view such functions are the rule not the exception.1 The set of zeros of the Brownian motion is also interesting. By property (iv), ί = 0 is a zero of W(-,w) for each ω. Now [W": ί > 0J as denned by (37.12) is another Brownian motion, and so by (37.13) the sequence {W"\ η = 1,2,...} = [nWl/n: η = 1,2,...} has supremum +oo and infimum -oo for ω outside a set of probability 0; for such an ω, W(-, ω) changes sign infinitely often near 0 and hence by continuity has zeros arbitrarily near 0. Let Ζ(ω) denote the set of zeros of Ψ(·,ω). What has just been shown is that 0 e Ζ(ω) for each ω and that 0 is with probability 1 a limit of positive points in Ζ(ω). From (37.13) it also follows that Ζ(ω) is with probability 1 unbounded above. More is true: Theorem 37.4. The set Ζ(ω) is with probability 1 perfect [A15], unbounded, nowhere dense, and of Lebesgue measure 0. Proof. Since Ψ{-,ω) is continuous, Ζ(ω) is closed for every ω. Let λ denote Lebesgue measure. Since Brownian motion is measurable (Theorem 37.2), Fubini's theorem applies: / λ(Ζ(ω))Ρ(άω) = (λχΡ)[(ί,ω): W(t,a>)=0\ Jn = Γρ[ω: W(t,o)) = 0]dt = 0. •'о Thus λ(Ζ(ω)) = 0 with probability 1. If W(-,(o) is nowhere differentiable, it cannot vanish throughout an interval / and hence must by continuity be nonzero throughout some subin- terval of /. By Theorem 37.3, then, Ζ(ω) is with probability 1 nowhere dense. It remains to show that each point of Ζ(ω) is a limit of other points of Ζ(ω). As observed above, this is true of the point 0 of Ζ(ω). For the general point of Ζ(ω), a stopping-time argument is required. Fix r>0 and let For the construction of a specific example, see Problem 31.18.
508 STOCHASTIC PROCESSES τ(ω) = inf[i: t > r, W(t, ω) = 0]; note that this set is nonempty with probability- 1 by (37.13). Thus τ(ω) is the first zero following r. Now [ω:τ(ω)<ί]= ω: inf \W{s,<a)\ = Q , I r<s<t \ and by continuity the infimum here is unchanged if s is restricted to rationals. This shows that τ is a random variable and that [ω. τ(ω) <t] <^a[Wu:u<t]. A nonnegative random variable with this property is a stopping time. To know the value of τ is to know at most the values of Wu for и < т. Since the inciements are independent, it therefore seems intuitively clear that the process (37.16) W?{o>) = WT{a)+l(w) - WT{a)(w) = WHa)+l(w), i > 0, ought itself to be a Brownian motion. This is, in fact, true by the next result, Theorem 37.5. What is proved there is that the finite-dimensional distributions of [W*: t > 0] are the right ones for Brownian motion. The other properties are obvious: IV*(·, ω) is continuous and vanishes at 0 by construction, and the space on which [W*; t > 0] is defined is complete because it is the original space (Ω, ,9*", P), assumed complete. If [W*: t > 0] is indeed a Brownian motion, then, as observed above, for ω outside a set Br of probability 0 there is a positive sequence {i„} such that t„ -» 0 and W*(t„, ω) = 0. But then W(t(<o) + t„, ω) = 0, so that τ(ω), a zero of ΐν(·,ω), is the limit of other larger zeros of Щ-,ω). Now τ(ω) was the first zero following r. (There is a different stopping time τ for each r, but the notation does not show this.) If В is the union of the Br for rational r, the first point of Ζ(ω) following r is a limit of other, larger points of Ζ(ω). Suppose that ωίβ and t e Ζ(ω), where t > 0; it is to be shown that t is a limit of other points of Ζ(ω). If t is the limit of smaller points of Ζ(ω), there is of course nothing to prove. Otherwise, there is a rational r such that r<t and ΐν(·,ω) does not vanish in [r, i); but then, since ω £ Br, t is a limit of larger points s that lie in Ζ(ω). This completes the proof of Theorem 37.4 under the provisional assumption that (37.16) is a Brownian motion. ■ The Strong Markov Property Fix i0 > 0 and put (37.17) w;=ivl<t+,-w,0, i>o. It is easily checked that [IV,': t > 0] has the finite-dimensional distributions
SECTION 37. BROWNIAN MOTION 509 appropriate to Brownian motion. As the other properties are obvious, it is in fact a Brownian motion. Let (37.18) y, = a[Ws:s<.t]. The random variables (37.17) are independent of ^. To see this, suppose that 0 < 5, < · · · < Sj < t0 and 0 < i, < · · · <tk. Put u, = ί0 + ί,. Since the increments are independent, (W!, W,'2 - W/,.. ., W't - W't ) = (Wu - W,n,WU2-WUi,...,WUk-WUk_) is independent of (Щ'V,2 -WSi,.. .,WS - W ). But then {.W;i,W,,2,...,W;k) is independent of (WSi,WSi,...,W). By Theorem 4.2, (W,\,..'., W[k) is independent of ^ц. Thus (37.19) P([K »;;)eH]n/i) = P[(Wl'i,...,Wt'k)tEH]P(A) = P[(Wli,...,Wlk)eH]P(A), /le^, where the second equality follows because (37.17) is a Brownian motion. This holds for all Η in Як. The problem now is to prove all this when i0 is replaced by a stopping time τ—a nonnegative random variable for which (37.20) [ω: τ(ω) <t] ε^, ί > 0. It will be assumed that τ is finite, at least with probability 1. Since [τ = ί] = [τ<ί]- UJt<(- и-1], (37.20) implies that (37.21) [ω:τ(ω)=ί] e ^ ί>0. The conditions (37.20) and (37.21) are analogous to the conditions (7.18) and (35.18), which prevent prevision on the part of the gambler. Now 5*Ί contains the information on the past of the Brownian motion up to time i0, and the analogue for τ is needed. Let ^ consist of all measurable sets Μ for which (37.22) Μη[ω:τ(ω)<ί] ej^ for all t. (See (35.20) for the analogue in discrete time.) Note that 3\ is a σ-field and τ is measurable ^. Since ΜΓ\[τ = t]= Μη[τ = ί] η [τ < ί], (37.23) Μη[ω:τ(ω) = ί] е ^ for Μ in ^. For example, r=inf[i: ^, = 1] is a stopping time and [infsiTWs> - И is in &T.
510 STOCHASTIC PROCESSES Theorem 37.5. Let τ be a stopping time, and put (37.24) IV,*(ω) = WT{a)+l(a>) - Ψτ(ω){ω), t > 0. Then [W*: t>] is a Brownian motion, and it is independent of ^—that is, a[W*: t > 0] is independent of ^: (37.25) p([(W,*,...,W,*)eH]nM) = P[(Wl*,...,Wl*)eH]P(M)=P[(lV,l,...,Wlt)<=H]P(M) for Η in @k and Μ in &;. That the transformation (37.24) preserves Brownian motion is the strong Markov property.* Part of .the conclusion is that the W* are random variables. Proof. Suppose first that τ has countable range V and let i0 be the general point of V. Since [w:W*(w)tEH]= (J [ω:^0+,(αι)-^0(αι)εΗ,τ(αι)=ί0], ,0eK W* is a random variable. Also, P([{W*,...,H>*)tEH\nM) = Е/'([к,..д:)ея]пмп[г=4 ,0eK If Μ e $?, then Μ η [г = ί0] е ^ц by (37.23). Further, if τ = ί0, then И^* coincides with И^' as defined by (37.17). Therefore, (37.19) reduces this last sum to Σ Ρ[(^..,^)εψ(Μη[τ = ι0]) ί0<ΞΚ = />[(^,...,^)еЯ]/>(М). This proves the first and third terms in (37.25) equal; to prove equality with the middle term, simply consider the case Μ = Ω. Since the Brownian motion has independent increments, it is a Markov process (see Examples 33.9 and 33.10); hence the terminology.
SECTION 37. BROWNIAN MOTION 511 Thus the theorem holds if τ has countable range. For the general τ, put k2~a И(к-1)2-"<т<к2-л,к=1,2,... (37.26) r„ = , 1 ; " '0 ifr = 0 If k2~n < t <(k + 1)2"", then [r„ <t] = [r <к2~п] е &кг-л с &t. Thus each тп is a stopping time. Suppose that Με^ and k2~" <t <(k + 1)2"". Then Mn[r„<i] = Mn[T<i2""]E^r„c^. Thus ^c^. Let И^Ы = ^Τπ(ω)+,(ω) - 0ςη(ω)(ω)— that is, let W,ln) be the W,* corresponding to the stopping time т„. If Μ e ^ then Μ <= &~τ and by an application of (37.25) to the discrete case already treated, (37.27) P([(W,\"\...,Wlln))tEH]nM)=P{(Wh,...,Wlt)tEH]P(M). But τ„(ω) -»τ(ω) for each ω, and by continuity of the sample paths, \ν,(η\ω) -» И^/Чю) for each ω. Condition on Μ and apply Lemma 1 with (IV™,..., W,ln)) for *„, (И-;*,..., W*) for X, and the distribution function of (И^,..., W,n)foT F = Fn. Then(37.25) follows from(37.27). ■ The τ in the proof of Theorem 37.4 is a stopping time, and so (37.16) is a Brownian motion, as required in that proof. Further applications will be given below. If У* = a[W*: t > 0], then according to (37.25) (and Theorem 4.2) the σ-fields &\ and ^* are independent: (37.28) P(AnB)=P(A)P(B), Ле^, ВеГ. For fixed t define т„ by (37.26) but with i2"" in place of 2"" at each occurrence. Then [WT <x] η [τ< ί] is the limit superior of the sets [WT <x] Π [τ < t], each of which lies in 5*j. This proves that [WT <x] lies in ·!?' and hence that WT is measurable 5*~τ. Since τ is measurable &~T, (37.29) [(T,WT)etf]e.^ for planar Borel sets #. The Reflection Principle For a stopping time τ, define (if, if t < τ, The sample path for [W": ί > 0] is the same as the sample path for [W,: t > 0] up to τ, and beyond that it is reflected through the point WT. See the figure.
512 STOCHASTIC PROCESSES The process defined by (37.30) is a Brownian motion, and to prove it, one need only check the finite-dimensional distributions: Р[(Щ ,..., Wk) eH] = P[(Wt",..., W"k) eH]. By the argument starting with (37.26), it is enough to consider the case where τ has countable range, and for this it is enough to check the equation when the sets are intersected with [τ = ί0]. Consider for notational simplicity a pair of points: (37.31) p[T = t0,(ws,wl)eH]=p[T = t0,(w;,wr)<=H]. If s < t < tu, this holds because the two events are identical. Suppose next that s < t0 < t. Since [τ = ί0] lies in 5^ > lt follows by the independence of the increments, symmetry, and the definition (37.30) that P[r = t0,(Ws,lVlu)^I,Wi-Wl^j} = P[r = t0,(Ws,Wj^I, -(Wt-WjtEj] = p[r = t0, (w;,w,i) e=/, w;'-w;'^j]. If K = IXJ, this is p[r = tQ,{ws,w,n,wt-wt^K]=p{r = tQ,{w;',w;i,w;'-w;i)^K], and by T7-A it follows for all Κ(Ξ&3. For the appropriate K, this gives (37.31). The remaining case, i0 < s < i, is similar. These ideas can be used to derive in a very simple way the distribution of M, = s\ips£lWs. Suppose that x>0. Let τ = inf[s: Ws>x], define W" by (37.30), and put r" = inf[i: Wsn7>x] and M," = sup,s, Ws". Since r" = r and W" is another Brownian motion, reflection through the point WT = x shows
SECTION 37. BROWNIAN MOTION 513 that P[r<t] P[r<t,Wt<x]+P[r<t,W,>x] Ρ[τ" <t,W," <x]+P[r<t,W,>x] P[r"<t,Wt>x]+P[r<t,Wt>x} P[r<t,Wt>x]+P[r<t,lVl>x] = 2P[Wt>x]. P[Mt>x] = -^f e-"2/2du. This argument, an application of the reflection principle? becomes quite transparent when referred to the diagram. Skorohod Embedding* Suppose that Xv X2,... are independent and identically distributed random variables with mean 0 and variance σ2. A powerful method, due to Skorohod, of studying the partial sums Sn=Xi+ · · ■ +Xn is to construct an increasing sequence τ0 = 0,τ,,τ2,... of stopping times such that W(t„) has the same distribution as 5„. The differences тк-тк_1 will turn out to be independent and identically distributed with mean σ2, so that by the law of large numbers η~ιτη = п~1ТГк=1(тк — тк_1) is likely to be near σ2. But if τ„ is near ησ2, then by the continuity of Brownian motion paths W(t„) will be near W(na2), and so the distribution of Sn/ai/n, which coincides with the distribution of W(rn)/ajn, will be near the distribution of W{na2)/a^fn —that is, will be near the standard normal distribution. The method will thus yield another proof of the central limit theorem, one independent of the characteristic-function arguments of Section 27. But it will also give more. For example, the distribution of max^ £n Sk/ai/n is exactly the distribution of max^^, W(Tk)/aJn , and this in turn is near the distribution of supi£.no.2 W(t)/ajn, which can be written down explicitly because of (37.32). It will thus be possible to derive the limiting distribution of тах^ £„ 5^. The joint behavior of the partial sums is closely related to the behavior of Brownian motion paths. The Skorohod construction involves the class У" of stopping times for which (37.33) E[WT] = 0, (37.34) E[t]=E[W2], P[M,>x] Therefore, (37.32) fSee Problem 37.18 for another application. *The rest of this section, which requires martingale theory, may be omitted.
514 STOCHASTIC PROCESSES and (37.35) £[r2]<4£[^r4]. Lemma 2. All bounded stopping times are members of У. Proof. Define Y9, = exp(6Wt - \d2t) for all 0 and for t > 0. Suppose that s < t and А е ^s. Since Brownian motion has independent increments, (Y9ldP= ( eew<-e2s/2 dP -E[e<*w>-^>-"2<'-*>/2]? }A ' }A and a calculation with moment generating functions (see Example 21.2) shows that (37.36) (YesdP= (YeidP, s<t, /le^. } A JA ' This says that for 0 fixed, [Υθ,: t > 0] is a continuous-time martingale adapted to the σ-fields &~r ^ IS tne moment-generating-function martingale associated with the Brownian motion. Let /(0, i) denote the right side of (37.36). By Theorem 16.8, ^f{6,t) = \\t{Wt-et)dP, -^-2f(<>,t) = f/eMwi-et)2-t]dP, °* f(6,t) = f Ye,,[{Wt - θί)4 -6{W,- etft + 3i2] dP. 4- 3Θ Differentiate the other side of the equation (37.36) the same way and set 0 = 0. The result is f WsdP= f W,dP, s<i, /le Fs, }A }A / (Ws2-s)dP= [ (W,2-t)dP, s<t, Α(Ξ&;, A A f (Ws4-6W2s + 3s2)dP= f (Wt4-6W2t + 3t2)dP, s<t, /4e^, A A This gives three more martingales: If Z, is any of the three random variables (37.37) W„ W2-t, W*-6W2t + 3t2,
SECTION 37. BROWNIAN MOTION 515 then Z0 = 0, Z, is integrable and measurable &„ and (37.38) [ ZsdP= f Z,dP, s<t, A^FS. }A JA In particular, E[Z,] = [Z0] = 0. If τ is a stopping time with finite range {i„...,im} bounded by t, then (37.38) implies that Ε[ΖΤ]=Σ{ ZtidP=Z( Z,dP = E[Z,] = 0. i ■'[t-I,.] ,· Jlr-I,] Suppose that τ is bounded by t but does not necessarily have finite iange. Put Tn = k2-"t if (k-l)2-"t<T^k2-nt, l<k<2", and put T„ = 0 if τ = 0. Then τ„ is a stopping time and E[Zr ] = 0. For each of the three possibilities (37.37) for Z,. supiS.,|Zj is integrable because of (37.32). It therefore follows by the dominated convergence theorem that E[Z ] = lim„ E[ZJ = 0. Thus E[ZT] = 0 for every bounded stopping time т. The three cases (37.37) give E[WT] = E[WT2 - r] = E[WT4 - 6WT2r + 3r2\ = 0. This implies (37.33), (37.34), and Q = E[W?\-6E[Wt2t\ +3E[t2] *E[Wt*]-6EV2[IVt<]E1/2[t2] + 3E[t2]. If С = El/2[WT4] and χ = £1/2[r2], the inequality is 0 > q(x) = 3x2-6Cx + C2. Each zero of q is at most 2C, and q is negative only between these two zeros. Therefore, χ <, 2C, which implies (37.35). ■ Lemma 3. Suppose that τ and τη are stopping times, that each τ„ is a member of У", and that τη~* τ with probability 1. Then τ is a member of 3~ if (i) E[WT4] < E[WT4] < oo for all n, or if (ii) the WT4 are uniformly integrable. Proof. Since Brownian motion paths are continuous, WT -» WT with probability 1. Each of the two hypotheses (i) and (ii) implies that £[И^4] is bounded and hence that Е[т2] is bounded, and it follows (see (16.28)) that the sequences {т„}, [WT}, and {W2} are uniformly integrable. Hence (37.33) and (37.34) for τ follow by Theorem 16.14 from the same relations for the τ„. The first hypothesis implies that liminf„ Е[1У*] <E[WT4], and the second implies that lim„ EiW*] = EiWf]. In either case it follows by Fatou's lemma that E[t2] < lim inf„ e\t2] < 41iminf„ E[WT4] < 4E[WT4]. ■
516 STOCHASTIC PROCESSES Suppose that a, b > 0 and a + b > 0, and let τ(α, b) be the hitting time for the set {-a,b): т(а, b) = inf[i: W, e {-a,b)]. By (37.13), Λα, b) is finite with probability 1, and it is a stopping time because τ(α, b) <t if and only if for every m there is a rational r < t for which Wr is within m~l of -a or of b. From If-Kminf-Ka, b), л})| < max{a, b) it follows by Lemma 3(ii) that т(а, b) is a member of У. Since WT(a b) assumes only the values -a and b, E[WT^a b)] = 0 implies that (37.39) P\WT(aM=-a) = ^, P\Kiab)^b]=aJL·-. This is obvious on grounds of symmetry in the case a = b. Let μ be a probability measure on the line with mean 0. The program is to construct a stopping time τ for which WT has distribution μ. Assume that μ.{0} < 1, since otherwise τ = 0 obviously works. If μ consists of two point masses, they must for some positive a and b be a mass of b/{a + b) at - a and a mass of a/(a + b) at b; in this case т(а b) is by (37.39) the required stopping time. The general case will be treated by adding together stopping times of this sort. Consider a random variable X having distribution μ. (The probability space for X has nothing to do with the space the given Brownian motion is defined on.) The technique will be to represent X as the limit of a martingale Xi,X2,... of a simple form and then to duplicate the martingale by WT, WT ,... for stopping times τ„; the τ„ will have a limit τ such that WT has the same distribution as X. The first step is to construct sets Δ„: α(0")<<)< ··· <a<rnn) and corresponding partitions ί/0" = (-οο,α(«>], &„: 4" = (в(^1,в(*я)],1^Л^„, Let M(H) be the conditional mean: Μ{Η) = ~μΤΗ)ίΗΧμ{άΧ) [ίμ(Η>>>0- Let Aj consist of the single point M(Rl) = E[X] = 0, so that &x consists of /о1 = (-oo,0] and l\ = (0,oo). Suppose that Δ„ and &>n are given. If μ((Ι"Τ) >0, split 4" by adding to Δ„ the point МЩ), which lies in ЩТ; if μ((Ι£Τ) = 0, /" appears again in &>n + l.
SECTION 37. BROWN1AN MOTION 517 Let c0n be the σ-field generated by the sets [Л'е/Д and put Xn = E[X\№n]. Then Xv X2,... is a martingale and Xn = M(I£) on [Ze/;]. The Xn have finite range, and their joint distributions can be written out explicitly. In fact, [Xx = MUl),.. ·, Xn = M(I£)] = [X^ Ilky..., *£=/£], and this set is empty unless l\ э ■ ■ ■ э1£, in which case it is [Xn = M(/£ )] = [ A"e 4"J. Therefore, if kn_\ = j and //"' = /£_, U 4", Ρΐ^ = ^ι^)\\Χι=Μ(ιΙι),...,χ^1=Μ(ι^)] = ή^ and p[xn = M(iD\\xl = M(ilki),...,xn_l = M(i?n-ji)} = provided the conditioning event has positive probability. Thus the martingale {Xn} has the Markov property, and if x = M(I"~l\ u = M(I£_l), and ν = M(I£\ then the conditional distribution of Xn given Xn_x =x is concentrated at the two points и and υ and has mean x. The structure of [Xj is determined by these conditional probabilities together with the distribution Ρ[Χι=Μ(ΐΙ)]=μ(ΐΙ), Ρ[Χι=Μ(ΙΙ)\=μ(ΐΙ). of Xx. If £= σ(υ „^„), then Xn -* E[X\\^] with probability 1 by the martingale theorem (Theorem 35.6). But, in fact, Xn-*X with probability 1, as the following argument shows. Let В be the union of all open sets of μ,-measure 0. Then β is a countable disjoint union of open intervals; enlarge В by adding to it any endpoints of μ,-measure О these intervals may have. Then μ(Β) = 0, and χίΰ implies that μ(χ - e, x] > 0 and μ[χ, χ + e) > 0 for all positive e. Suppose that χ=Χ(ω)<£Β and let χη=Χη(ω). Let /£ be the element of &„ containing x; then xn + l= M(I£) and 4" |/ for some interval /. Suppose that xn+l<x - e for η in an infinite sequence N of integers. Then xn+l is the left endpoint of /£*| for η in N and converges along N to the left endpoint, say a, of /, and (x - e, x]c/. Further, xn + l= M(I£)-* M(I) along N, so that M(I) = a. But this is impossible because μ(χ - e, x] > 0. Therefore, xn >x - e for large n. Similarly, xn <x + € for large n, and so x„ —*x. Thus Χη(ω)~*Χ(ω) if Χ(ω)<£Β, the probability of which is 1. Now XY = Ε[Χ\\^λ] has mean 0, and its distribution consists of point masses at -a = M(/0') and b = M(//). If τ, = τ(α,b) is the hitting time to
518 STOCHASTIC PROCESSES {-a,b}, then (see (37.39)) τ, is a stopping time, a member of У, and WT has the same distribution as Xy Let τ2 be the infimum of those t for which ί > τ, and W, is one of the points M(42), 0 < к < r2 + 1. By (37.13), т2 is finite with probability 1; it is a stopping time, because т2 < t if and only if for every m there are rationale r and s such that r<s + m_1, r<t, s<t, Wr is within m~l of one of the points Μ(Ι;ι), and И^ is within m~l of one of the points M(l£). Since |И/(тт{т2,«})| is at most the maximum of the values |М(/Л2)|, it follows by Lemma 3(ii) that т2 is a member of У. Define И^* by (37.24) with τ, for т. If χ = MUjX then χ is an endpoint common to two adjacent intervals 42_, and l£; put u = M(42_,) and у = M(l£). If И^ =x, then u and и are the only possible values of WT . If τ* is the first time the Brownian motion [W*: t > 0] hits и - χ or у -х, then by (37.39), p\W% = u-x\ = ^—^, P[W% = υ - χ] = —"-. lr Jy-«' lr Jy-u On the set [WT =x], τ2 coincides with τ, + τ*, and it follows by (37.28) that p[WTrx,WT2 = v\ =P[WTi=x,x+WT% = V\ = P[WT=x\P[WT% = v-x] = P[WT\^. This, together with the same computation with и in place of y, shows that for WT = χ the conditional distribution of WT is concentrated at the two points и and υ and has mean x. Thus the conditional distribution of WT given WT coincides with the conditional distribution of X2 given Xx. Since WT and Xx have the same distribution, the random vectors (WT, WT ) and (Xb X2) also have the same distribution. Ал inductive extension of this argument proves the existence of a sequence of stopping times т„ such that τι<τ2£ ■ · ·, each τ„ is a member of У", and for each n, WT,...,WT have the same joint distribution as Xv..., X„. Now suppose that X has finite variance. Since τ„ is a member of У, £[r„] = E[WT2] = E[Xf] = E[E2[X\\^J]<E[X2] by Jensen's inequality (34.7). Thus τ = lim„ τ„ is finite with probability 1. Obviously it is a stopping time, and by path continuity, WT -» WT with probability 1. Since Xn ~* X with probability 1, it is a consequence of the following lemma that WT has the distribution of X. Lemma 4. IfX„ ~* Xand Yn ~* Υ with probability 1, and if Xn and Yn have the same distribution, then so do X and Y.
SECTION 37. BROWN1AN MOTION 519 Proof.1 By two applications of (4.9), P[X <x] < P[X <x + e] <\im iniP[Xn <x + e] η <lim supP[Yn <x + e] <P[Y<x + e]. η Let e -» 0: P[X <x] < P[Y<x]. Now interchange the roles of X and Y. Ш Since X2 <E[X2\\cfn], the Xn are uniformly integrable by the lemma preceding Theorem 35.6. By the monotone convergence theorem and Theorem 16.14, E[r] = limnE[rn] = limnE[W2] = \imnE[X2] = E[X2]^E[W2]. If E[X*]<°o, then ElW?] = E[X*}<E[X*\ = E[W?\ (Jensen's inequality again), and so τ is a member of &. Hence Ε[τ2] < 4E[WT4]. This construction establishes the first of Skorohod's embedding theorems: Theorem 37.6. Suppose that X is a random variable with mean 0 and finite variance. There is a stopping time τ such that WT has the same distribution as X, E[t] = E[X2], and E[t2] < 4E[X4]. Of course, the last inequality is trivial unless E[X4] is finite. The theorem could be stated in terms not of X but of its distribution, the point being that the probability space X is defined on is irrelevant. Skorohod's second embedding theorem is this: Theorem 37.7. Suppose that Xl,X2,... are independent and identically distributed random variables with mean 0 and finite variance, and put Sn = Xl + ■ ■ ■ +Xn- There is a nondecreasing sequence τ„ τ2,... of stopping times such that the WT have the same joint distributions as the Sn and τ„ τ2 - τ1? τ3 - τ2,... are independent and identically distributed random variables satisfying E[rn - τ„_,] = E[X2) and E[(t„ - r„_ ,)2] < 4E[X*]. Proof. The method is to repeat the construction above inductively. For notational clarity write W, = W,w and put 3^0) = a[Ws(1): 0<s<i] and Sr(l) = a[W,a): t > 0]. Let δ, be the stopping time of Theorem 37.6, so that W£l) and Xl have the same distribution. Let ^1} be the class of Μ such that Μ Π [δ, < ί] e ^(1) for all t. Now put W™ = Wtfl, - Wg\ ^2) = a[W}2): 0<s<t], and ^(2) = a[W^2): t > 0]. By another application of Theorem 37.6, construct a stopping time δ2 for the Brownian motion [Wt(2): t > 0] in such a way that Wj;2) has the same distribution as Xx. In fact, use for δ2 the very same martingale construction as for δ„ so that (δ,, We(1)) and (δ2, W£2)) have the same distribution. Since Щ{) and 5r(2) are independent (see (37.28)), it follows (see (37.29)) that (δ„ W™) and (δ2, Wsf) are independent. fThis is obvious from the weak-convergence point of view.
520 STOCHASTIC PROCESSES Let jy2) be the class of Μ such that Μη[δ2<ί]<Ξ ^ for all t. If W,0) = W^\, - Wg> and 5r(3) is the σ-field generated by these random variables, then again 3^2) and J^3* are independent. These two σ-fields are contained in IF^, which is independent of 5^(1). Therefore, the three σ-fields 5g(1), 5^(2), 5r(3) are independent. The procedure therefore extends inductively to give independent, identically distributed random vectors (δ„, W^). If τ„ = δ, + · · · + δ„, then W™ = Wg> +■■■ + Ws[n) has the distribution oi Xx+ ■■■ +Xn. " " ■ Invariance * If E[X2] = σ2, then, since the random variables τ„ -τ„_, of Theorem 37.7 are independent and identically distributed, the strong law of large numbers (Theorem 22.1) applies and hence so does the weak one: (37.40) Р[\п-1тп-а2\>е] -»0. (If E[Xf]<«>, so that the τη-τη_λ have second moments, this follows immediately by Chebyshev's inequality.) Now Sn has the distribution of W(t„\ and τ„ is near ησ2 by (37.40); hence Sn should have nearly the distribution of W(na2), namely the normal distribution with mean 0 and variance ησ2. To prove this, choose an increasing sequence of integers Nk such that P[\n'1rn-a2\>k-1]<k-i for n> Nk, and put €n=k~l for Nk < η < Nk +,. Then e„-*0and Ρ[\η~ιτη- σ2\> €„]<€„. By two applications of (37.32), **(c)=P \Щпа2) - W(rn)\ >e στη <p[\n-'rn-a2\>en)+p <€n + 4P[\W(€nn)\^€CT^], sup \W{t)~W(na2)\>eaTn \t—na2\<,€„n and it follows by Chebyshev's inequality that lim„ Sn(e) = 0. Since 5„ is distributed as W(t„\ σνη -δ„(6)<Ρ <P <x σ4η W(na2) σνη <x + e + δ„(0· *This topic may be omitted.
SECTION 37. BROWNIAN MOTION 521 Here W(na2)/ajn can be replaced by a random variable N with the standard normal distribution, and letting η -* oo and then e -»0 shows that lim/5 σν« <x = P[N<x]. This gives a new proof of the central limit theorem for independent, identically distributed random variables with second moments (the Lindeberg-Levy theorem—Theorem 27.1). Observe that none of the convergence theory of Chapter 5 has been used. This proof of the central limit theorem is an application of the invariance principle: Sn has nearly the distribution of W(na2\ and the distribution of the latter does not depend on (vary with) the distribution common to the Xn. More can be said if the Xn have fourth moments. For each n, define a stochastic process [Yn(t): 0 <i < 1] by Yn(0,ω) = 0 and (37.41) Yn(t,w) = ^rSk(w) if^-<i<£, k=l,...,n. If k/n = t > 0 and η is large, then к is large, too, and Yn(t) = tx/1Sk/a{k is by the central limit theorem approximately normally distributed with mean 0 and variance t. Since the Xn are independent, the increments of (37.41) should be approximately independent, and so the process should behave approximately as a Brownian motion does. Let т„ be the stopping times of Theorem 37.7, and in analogy with (37.41) put Z„(0) = 0 and (37.42) Z„(i)--=^W(r,) \i^±<t<k-, k = l,...,n. By construction, the finite-dimensional distributions of [Yn(t): 0 < t < 1] coincide with those of [Zn(t)\ 0 < t < 1]. It will be shown that the latter process nearly coincides with [W(tna2)/a]fn: 0 < t < 1], which is itself a Brownian motion over the time interval [0,1]—see (37.11). Put Wn(.t) = W(tna2)/<rfn'. Let Βη(δ) be the event that \тк — ka2\ > δησ2 for some k<n. By Kolmogorov's inequality (22.9), Varir 1 4£[Z,41 (37.43) ,(,.(,))* .^L^-^LUo. If (k- On'1 <t <kn~x and η >δ ', then
522 STOCHASTIC PROCESSES on the event (Bn(8))c, and so \Zn(t)-Wn(t)\= Wn[^)-Wn{t) < sup \Wn{s) - Wn{t)\ \s-t\<2S on (Bn(S))c. Since the distribution of this last random variable is unchanged if the WnU) are replaced by W(t\ p\sup\Zn(t)-W„(t)\>e] <Ρ(Βη(δ))+Ρ sup sup \W(s) - W(t)\>e l£l \s-t\&2S Let η -» oo and then δ -» 0; it follows by (37.43) and the continuity of Brownian motion paths that (37.44) limP sup\Zn(t)-Wn(t)\>e L l£l = 0 for positive e. Since the processes (37.41) and (37.42) have the same finite- dimensional distributions, this proves the following general invariance principle or functional central limit theorem. Theorem 37.8. Suppose that XX,X2,·-· are independent, identically distributed random variables with mean 0, variance σ2, and finite fourth moments, and define Y„(t) by (37.41). There exist {on another probability space), for each n, processes [Z„(i): 0 < t < 1] and [Wn(t): 0 < t < 1] such that the first has the same finite-dimensional distributions as [Yn(t): 0 < t < 1], the second is a Brownian motion, and P[sup, <,|Z„(i) - WnU)\ > e] -» 0 for positive e. As an application, consider the maximum Mn = maxk£.n Sk. Now Μη/σ4η = sup, Yn(t) has the same distribution as sup, Z„(i), and it follows by (37.44) that supZn(t)-sup Wn(t) i£l t£l >e But P[suptalWn(0>x] = P[supt£.l W(t)>x] = 2P[N>x] for χ > 0 by (37.32). Therefore, (37.45) σ{η <x 2P[N<x], x>0.
SECTION 37. BROWN1AN MOTION 523 PROBLEMS 37.1. 36.2 Τ Show that K(s, t) = mm{s, t) is nonnegative-definite; use Problem 36.2 to prove the existence of a process with the finite-dimensional distributions prescribed for Brownian motion. 37.2. Let X(t) be independent, standard normal variables, one for each dyadic rational t (Theorem 20.4; the unit interval can be used as the probability space). Let W(0) = 0 and W(n)= Enk = lX(k). Suppose that W(t) is already defined for dyadic rationals of rank n, and put Prove by induction that the W(t) for dyadic t have the finite-dimensional distributions prescribed for Brownian motion. Now construct a Brownian motion with continuous paths by the argument leading to Theorem 37.1. This avoids an appeal to Kolmogorov's existence theorem. 37.3. Τ For each η define new variables Wn(t) by setting Wn(k/2") = W(k/2n) for dyadics of order η and interpolating linearly in between. Set δ„ = sup, s„|W„ + ,(i) - Wn(t)\, and show that δη = max 0<,k<n2n The construction in the preceding problem makes it clear that the difference here is normal with variance 1/2"+ i. Find positive xn such that Σχη and ΣΡ[δπ >x„] both converge, and conclude that outside a set of probability 0, Wn(t, ω) converges uniformly over bounded intervals. Replace W(t,a>) by limn Wn(t, ω). This gives another construction of a Brownian motion with continuous paths. 37.4. 36.6 Τ Let Γ = [0,°°), and let Ρ be a piobability measure on (RT,&T) having the finite-dimensional distributions prescribed for Brownian motion. Let С consist of the continuous elements of RT. (a) Show that /^(0 = 0, or P*(RT-C)=l (see (3.9) and (3.10)). Thus completing (RT,&T,P) will not give С probability 1. (b) Show that P*(C)=l. 37.5. Suppose that [Wt: t>0] is some stochastic process having independent, stationary increments satisfying £[W,] = 0 and E[Pf,2] = i. Show that if the finite-dimensional distributions are preserved by the transformation (37.11), then they must be those of Brownian motion. 37.6. Show that Dt>0a[Ws: s>t] contains only sets of probability 0 and 1. Do the same for П f >о^[Щ'· 0 < ί <e]; give examples of sets in this σ-field. 37.7. Show by a direct argument that W(-,a>) is with probability 1 of unbounded variation on [0,1]: Let Yn = Lfl №02'n) - W((i-1)2-")\. Show that Yn has mean 2"/2E[|W,|] and variance at most Var[|Pf,|]. Conclude that ΣΡ[Υη <и]<». *№)-ш^т
524 STOCHASTIC PROCESSES 37.8. Show that the Poisson process as defined by (23.5) is measurable. 37.9. Show that for Τ = [0,°°) the coordinate-variable process [Ζ,: ίε T]on(RT, &T) is not measurable. 37.10. Extend Theorem 37.4 to the set [t: W(t,<u) = a]. 37.11. Let τα be the first time the Brownian motion hits a > 0: T„=inf[i: W, >a]. Show that the distribution of τα has over (0,oo) the density (37.46) Ae(0= » 1 -«V* Show that E[ra] = oo. Show that τα has the same distribution as a2/N2, where N is a standard normal variable. 37.12. Τ (a) Show by the strong Markov property that ra and τα+β-τα are independent and that the latter has the same distribution as τβ. Conclude that ha * hp :=ha+p. Show that βτα has the same distribution as τα /-. (b) Show that each ha is stable—see Problem 28.10. 37.13. Τ Suppose that X1,X2,... are independent and each has the distribution (37.46). (a) Show that (A^ + ■ · · +Xn)/n2 also has the distribution (37.46). Contrast this with the law of large numbers. (b) Show that P[n~2 maxAin Xk <*]-» cxp(-a^2/vx) for χ > 0. Relate this to Theorem 14.3. 37.14. 37.11 Τ Let p(s, t) be the probability that a Brownian path has at least one zero in (s, t). From (37.46) and the Markov property deduce 2 IT (37.47) p(s,t) = — arccos-J - . Hint: Condition with respect to Ws. 37.15. Τ (a) Show that the probability of no zero in (г, 1) is (2/Tr)arcsiiiA/F and hence that the position of the last zero preceding 1 is distributed over (0,1) with density π_1(ί(1 -t)Yx/2. (b) Similarly calculate the distribution of the position of the first zero following time 1. (c) Calculate the joint distribution of the two zeros in (a) and (b). 37.16. Τ (a) Show by Theorem 37.8 that infiSUS, Yn(u) and infsйu£, Z„(u) both converge in distribution to \nfsSuslW(u) for 0<s<i<l. Prove a similar result for the supremum. (b) Let An(s, t) be the event that Sk, the position at time к in a symmetric random walk, is 0 for at least one к in the range sn<k< tn, and show that P(A„(s, Ο) -»(2/π) arccos yfs/t.
SECTION 37. BROWN1AN MOTION 525 (c) Let Tn be the maximum к such that к < η and Sk = 0. Show that T„/n has asymptotically the distribution with density v~l(t(l -t))~l/2 over (0,1)· As this density is larger at the ends of the interval than in the middle, the last time during a night's play a gambler was even is more likely to be either early or late than to be around midnight. 37.17. Τ Show that p(s,t) = p(t~\s~}) = p(cs,ct). Check this by (37.47) and also by the fact that the transformations (37.11) and (37.12) preserve the properties of Brownian motion. 37.18. Deduce by the reflection principle that (MnW,) has density It on the. set where y>0 and у >х. Now deduce from Theorem 37.8 the corresponding limit theorem for symmetric random walk. 37.19. Show by means of the transformation (37.12) that for positive a and b the probability is 1 that the process is within the boundary -at < W, <bt for all sufficiently large t. Show that a/(a +b) is the probability that it last touches above rather than below. 37.20. The martingale calculation used for (37.39) also works for slanting boundaries. For positive a,b,r, let τ be the smallest t such that either W, = -a +rt or W, = b +rt, and let p(a,b,r) be the piobability that the exit is through the upper barrier—that WT = b + rr. (a) For the martingale Ye, in the proof of Lemma 2, show that E[Ye ,]= 1. Operating formally at first, conclude that (37.48) Е[ев">-в2т/2] = 1. Take 0 = 2r, and note that WT-\<d2r is then 2rb if the exit is above (probability p(a, b, r)) and -Ira if the exit is below (probability 1 —p(a, b, r)). Deduce l-e2ra p(a,b,r)= £2rb_e_2ra. (b) Show that p(a,b,r) -»a/(a + b) as r -» 0, in agreement with (37.39). (c) It remains to justify (37.48) for θ = 2r. From E[Ye,] = 1 deduce (37.49) E[e2'<^-'2">] = 1 for nonrandom σ. By the arguments in the proofs of Lemmas 2 and 3, show that (37.49) holds for simple stopping times σ, for bounded ones, for σ = τ Λ η, for σ = т. 2(2y-jQ ΐ}/2ττί exp
526 STOCHASTIC PROCESSES SECTION 38. NONDENUMERABLE PROBABILITIES* Introduction As observed a number of times above, the finite-dimensional distributions do not suffice to determine the character of the sample paths of a process. To obtain paths with natural regularity properties, the Poisson and Brownian motion processes were constructed by ad hoc methods. It is always possible to ensure that the paths have a certain very general regularity property called separability, and from this property will follow in appropriate circumstances various other desirable regularity properties. Section 4 dealt with "denumerable" probabilities; questions about path functions involve all the time points and hence concern "nondenumerable" probabilities. Example 38.1. For a mathematically simple illustration of the fact that path properties are not entirely determined by the finite-dimensional distributions, consider a probability space (Ω, ,9*", P) on which is defined a positive random variable V with continuous distribution: P[V = x] = 0 for each x. For t > 0, put X(t, ω) = 0 for all ω, and put ,1 iiV(o))=t, Since V has continuous distribution, P[X, = Y,] = 1 for each t, and so [X,: t>0] and [Y,: t > 0] are stochastic processes with identical finite-dimensional distributions; for each t0...,tk, the distribution μ, , common to (Xt,...,X,k) and (Yt,..., Yt) concentrates all its mass at the origin of Rk. But what about the sample paths? Of course, Χ(·,ω) is identically 0, but Υ(·,ω) has a discontinuity—it is 1 at t = ν(ω) and 0 elsewhere. It is because the position of this discontinuity has a continuous distribution that the two processes have the same finite-dimensional distributions. ■ Definitions The idea of separability is to make a countable set of time points serve to determine the properties of the process. In all that follows, the time set Τ will for definiteness be taken as [0,°°). Most of the results hold with an arbitrary subset of the line in the role of T. As in Section 36, let RT be the set of all real functions over T= [0,°°). Let D be a countable, dense subset of T. A function χ—an element of RT—is separable D, or separable with respect to D, if for each t in Τ there exists a *This section may be omitted.
SECTION 38. NONDENUMBERABLE PROBABILITIES 527 sequence i,, t2,... of points such that (38.2) i„GD, f„-»f, x(i„)-»x(0· (Because of the middle condition here, it was redundant to require D dense at the outset.) For t in D, (38.2) imposes no condition on x, since tn may be taken as t. An χ separable with respect to D is determined by its values at the points of D. Note, however, that separability requires that (38.2) hold for every t—an uncountable set of conditions. It is not hard to show that the set of functions separable with respect to D lies outside &T. Example 38.2. If χ is everywhere continuous or right-continuous, then it is separable with respect ίο every countable, dense D. Suppose that x(t) is 0 for t Φ υ and 1 for ί = υ, where υ > 0. Then χ is not separable with respect to D unless υ lies in D. The paths Υ(·,ω) in Example 38.1 are of this form. ■ The condition for separability can be stated another way: χ is separable D if and only if for every t and every open interval / containing t, x(t) lies in the closure of [x(s): s e / nD]. Suppose that χ is separable D and that / is an open interval in T. If € > 0, then x(tQ) + e > sup/e/ x(t) = и for some i0 in /. By separability \x(sQ) - x(tQ)\ < € for some sQ in I C\D. But then x(sQ) + 2e> u, so that (38.3) supx(i) = sup x{t). Similarly, (38.4) inf.r(i)= inf x(t) ,is/ is/no and (38.5) sup |x(0-*('o)l= sup |x(f)-x(f0)|. ι0^'<Ό + δ ι0£ΐ<ι0+δ A stochastic process [X,: i>0] on (Ω,&~,Ρ) is separable D if D is a countable, dense subset of 7"=[0,<») and there is an ^set N such that P(N) = 0 and such that the sample path Χ(·,ω) is separable with respect to D for ω outside N. Finally, the process is separable if it is separable with respect to some D; this D is sometimes called a separant. In these definitions it is assumed for the moment that X(t, ω) is a finite real number for each t and ω. Example 38.3. If the sample path ΑΧ-,ω) is continuous for each ω, then the process is separable with respect to each countable, dense D. This covers Brownian motion as constructed in the preceding section. ■
528 STOCHASTIC PROCESSES Example 38.4. Suppose that [Wt: t> 0] has the finite-dimensional distributions of Brownian motion, but do not assume as in the preceding section that the paths are necessarily continuous. Assume, however, that [Wt: t > 0] is separable with respect to D. Fix i0 and δ. Choose sets Dm = [tml,..., tmm) of D-points such that tQ < tml < ■■■ < tmm < t0 + δ and Dm Τ D η (ί0, ί0 + 5). By the argument leading to (37.9), (38.6) sup \Wt-W \>a Κδ2 For sample points outside the N in the definition of separability, the supremum in (38.6) is unaltered, because of (38.5), if the restriction t e D is dropped. Since P(N) = 0, sup \W,-W,o\>a ioi'S/H + S < Κδ α4 2 Define Mn by (37.8) but with r ranging over all the reals (not just over the dyadic rationale) in [Jt2_",(* + 2)2_n]. Then P[Mn> n~[] <4Kn5/2n follows just as before. But for ω outside В = [Mn> n~l i.o.], W(-,w) is continuous. Since P(B) = 0, ^(-,ω) is continuous for ω outside an ^set of probability 0. If Ш, Sr,P)is complete, then the set of ω for which W{-, ω) is continuous is an ^set of probability 1. Thus paths are continuous with probability 1 for any separable process having the finite-dimensional distributions of Brownian motion—provided that the underlying space is complete, which can of course always be arranged. ■ As it will be shown below that there exists a separable process with any consistently prescribed set of finite-dimensional distributions, Example 38.4 provides another approach to the construction of continuous Brownian motion. The value of the method lies in its generality. It must not, however, be imagined that separability automatically ensures smooth sample paths: Example 38.5. Suppose that the random variables Xt, t > 0, are independent, each having the standard normal distribution. Let D be any countable set dense in Τ = [0, °°). Suppose that / and J are open intervals with rational endpoints. Since the random variables X, with t e D Π / are independent, and since the value common to the P[X,^J] is positive, the second Borel-Cantelli lemma implies that with probability 1, X,^J for some t in D П I. Since there are only countably many pairs / and J with rational endpoints, there is an ^set N such that P(N) = 0 and such that for ω outside N the set [Χ(ί,ω): t^Dnl] is everywhere dense on the line for every open interval / in T. This implies that [Xt: t > 0] is separable with respect to D. But also of course it implies that the paths are highly irregular.
SECTION 38. NONDENUMBERABLE PROBABILITIES 529 This irregularity is not a shortcoming of the concept of separability—it is a necessary consequence of the properties of the finite-dimensional distributions specified in this example. ■ Example 38.6. The process [Yt: t > 0] in Example 38.1 is not separable: The path Υ(·,ω) is not separable D unless D contains the point V(w). The set of ω for which Υ(-,ω) is separable D is thus contained in [ω: V(w) e D], a set of probability 0, since D is countable and V has a continuous distribution. ■ Existence Theorems It will be proved in stages that for every consistent system of finite-dimensional distributions there exists a separable process having those distributions. Define χ to be separable D at the point t if there exist points tn in D such that tn -*t and x(tn) -*x(t). Note that this is no restriction on χ if t lies in D, and note that separability is the same thing as separability at every t. Lemma 1. Let [X/. i>0] be a stochastic process on (Ω,,&~,Ρ). There exists a countable, dense set D in [0, <»), and there exists for each t an t^-set N(t), such that P(N(t)) = 0 andsuch that for ω outside N(t) the path function Χ(·,ω) is separable D at t. Proof. Fix open intervals / and J, and consider the probability P(U)=p(f)[Xs£J]) yse(/ ; for countable subsets U of / η Τ. As U increases, the intersection here decreases and so does p(U). Choose Un so that p(Un) -»inf^ p(U). If U(I,J)= U„t/„, then U(I,J) is a countable subset of ШТ making p(U) minimal: (38.7) p[ Π [*,*']UW П [*,*']) for every countable subset U of / η Τ. If t e / η Τ, then (38.8) p([X,<EJ]n Π [A,£/])=0, because otherwise (38.7) would fail for U = U(I, J) U {i}.
530 STOCHASTIC PROCESSES Let D = U U(I, J), where the union extends over all open intervals / and J with rational endpoints. Then D is a countable, dense subset of T. For each t let (38.9) N(t)= \ji[X,^J)n Π [ЪеП), where the union extends over all open intervals J that have rational end- points and ovei all open intervals / that have rational endpoints and contain t. Then MO is by (38.8) an ^set such that />(MO) = 0. Fix t and ω £ MO. The problem is to show that Χ(·,ω) is separable with respect to D at t. Given n, choose open intervals / and J that have rational endpoints and lengths less than л-' and satisfy t e/ and ΛΧί,ω)^/. Since ω lies outside (38.9), there must be an sn in £/(/,/) such that X(sn, ω)ε/. But then s, eD, |s„ — f| <«-1, and \X(sn,(o) ~X(t,a>)\ < n~l. Thus sn -»ί and AXsn, ω) -»ΑΌ, ω) for a sequence Sj, s2> · · · m D. Ш For any countable D, the set of ω for which Χ(·,ω) is separable with respect to D at ί is oo (38.10) p| (J [a>:\X(t,<o)-X(s,<o)\<n-1]. "=1 |j-i|<n"' ssO This set lies in fF for each t, and the point of the lemma is that it is possible to choose D in such a way that each of these sets has probability 1. Lemma 2. Let [X,: t > 0] be a stochastic process on (Ω, 5*", P). Suppose that for all t and ω (38.11) a<X(t,w)<b. Then there exists on (Ω,&~,Ρ) a process [X't: t > 0] having these three properties: (i) P[X',=X,]=l for each t. (ii) For some countable, dense subset D of [0,«0, Χ'(·,ω) is separable D for every ω in Ω. (iii) For all t and ω, (38.12) a<X'(t,(o)<b. Proof. Choose a countable, dense set D and ^sets MO of probability 0 as in Lemma 1. If t eD or if ωίΜΟ, define Χ'0,ω) = Χ(ί,ω). If iiD,
SECTION 38. NONDENUMBERABLE PROBABILITIES 531 fix some sequence {s*,0} in D for which limn s^ = t, and define Χ'(ί,ω) = limsup„ X(s(n\ ω) for ω e JV(i). To sum up, (38.13) X'(t,a>) = . Χ(ί,ω) if ieDortuiiV(i), limsupAr(i^'),w) if t £D and ω eJV(i). Since /V(i) «ξ ^", Χ; is measurable ^" for.each t. Since P(N(t)) = 0, P[X, = X\\ = \ for each i. Fix t and ω. If ί εβ, then certainly Ζ'ί',ω) is separable D at f, and so assume i£D. If ω<£Ν(ί), then by the construction of N(t), Χ(·,ω) is separable with respect to D at i, so that there exist points sn in D such that sn -*t and Ζ(ίπ,ω) -*X(t, ω). But X(sn,cu) =X'(sn, ω) because sn eE D, and Χ(ί,ω)=Χ'(ί, ω) because ω£/ν(ί)- Hence X'(sn,w) ~* X'(t, ω), and so Χ'(-,ω) is separable with respect to D at f. Finally, suppose that t <£D and ω e JVO). Then X'U, ω) = lim^ Α'ίί^', ω) for some sequence [nk] of integers. As A: ^oo, ,<£-^ and Χ'^,ω) = Χ^,ω) -*Χ'(ί,ω), so that again Χ'(-,ω) is separable with respect to D at t. Clearly, (38.11) implies (38.12). Example 38.7. One must allow for the possibility of equality in (38.12). Suppose that ν(ω)> 0 for all ω and that V has a continuous distribution. Define /(0-fe""' ifi#0' n ' \0 if ί = 0, and put Λ4ί,ω) =/(ί - Κω)). If [A',': t > 0] is any separable process with the same finite-dimensional distributions as [X,: t > 0], then Χ'(-,ω) must with probability 1 assume the value 1 somewhere. In this case (38.11) holds for a < 0 and 6 = 1, and equality in (38.12) cannot be avoided. ■ If (38.14) sup|Z(i,tu)|<oo, l,ω then (38.11) holds for some a and 6. To treat the case in which (38.14) fails, it is necessary to allow for the possibility of infinite values. If x(t) is °° or -oo, replace the third condition in (38.2) by x(tn) -»oo or x(tn) -» -oo. This extends the definition of separability to functions χ that may assume infinite values and to processes [X,: t > 0] for which X(t, ω) = ±<χ> is a possibility. Theorem 38.1. // [X,: t > 0] is a finite-valued process on Ш, &, P), there exists on the same space a separable process [X[: t > 0] such that P[X't = X,] = 1 for each t.
532 STOCHASTIC PROCESSES It is assumed for convenience here that X(t, ω) is finite for all t and ω, although this is not really necessary. But in some cases infinite values for certain X'(t, ω) cannot be avoided—see Example 38.8. Proof. If (38.14) holds, the result is an immediate consequence of Lemma 2. The definition of separability allows an exceptional set N of probability 0; in the construction of Lemma 2 this set is actually empty, but it is clear from the definition this could be arranged anyway. The case in which (38.14) may fail could be treated by tracing through the preceding proofs, making slight changes to allow for infinite values. A simple argument makes this unnecessary. Let g be a continuous, strictly increasing mapping of Rl onto (0,1). Let Y(t, ω) = g(X(t, ω)). Lemma 2 applies to [Yt: i>0]; there exists a separable process [F/: i>0] such that P[Y,'= Y,] ■-= 1. Since 0 < Y(t, ω) < 1, Lemma 2 ensures 0 < Y'(t, ω) ζ 1. Define ■oo if Y'(t,ω) =0, X'{t,o>) = {g-\Y'{t,fJ>)) ίίΟ<Γ(ί,ω)<1, + oo HY'(t,w) = l. Then [X't: t > 0] satisfies the requirements. Note thai P[X't = +oo] = 0 for each t. Ш Example 38.8. Suppose that V(w) > 0 for all ω and V has a continuous distribution. Define h(t) = lur if^o, w \0 ifi = 0, and put X(t,a)) = h(t - ν(ω)). This is analogous to Example 38.7. If [X't: t > 0] is separable and has the finite-dimensional distributions of [X,: t > 0], then X'( ■, ω) must with probability 1 assume the value oo for some t. Ш Combining Theorem 38.1 with Kolmogorov's existence theorem shows that for any consistent system of finite-dimensional distributions μ, , there exists a separable process with the μ, , as finite-dimensional distributions. As shown in Example 38.4, this leads to another construction of Brownian motion with continuous paths. Consequences of Separability The next theorem implies in effect that, if the finite-dimensional distributions of a process are such that it "should" have continuous paths, then it will in fact have continuous paths if it is separable. Example 38.4 illustrates this. The same thing holds for properties other than continuity.
SECTION 38. NONDENUMBERABLE PROBABILITIES 533 Let RT be the set of functions on Τ=[0,<χ>) with values that are ordinary reals or else oo or -oo. Thus RT is an enlargement of the RT of Section 36, an enlargement necessary because separability sometimes forces infinite values. Define the function Z, on RT by Z,(x) =Z(t, x) =x(t). This is just an extension of the coordinate function (36.8). Let &T be the σ-field in RT generated by the Z„ t > 0. _ _ Suppose that A is a subset of RT, not necessarily in &T. For DcF=[0, oo), let AD consist of those elements χ of RT that agree on D with some element у of A: (38.15) AD= (J Π [хе/гг:х(*)=У(0]· Of course, A <^AD. Let 5D denote the set of χ in RT that are separable with respect to D. In the following theorem, [X,: t > 0] and [X't: t > 0] are processes on spaces Ш, &, P) and (Ω', &', P'\ which may be distinct; the path functions are Z(-,oj)and Χ'(-,ω'). Theorem 38.2. Suppose of A that for each countable, dense subset D of Γ = [Ο,οο), the set (38.15) satisfies (38.16) Ad<e&t, ADnSD<zA. If [X,: t > 0] and [X't: t > 0] have the same finite-dimensional distributions, if [ω: Χ(·,ω) ^A] lies in У and has P-measure 1, and if [X't: t > 0] is separable, then [ω: Χ'(-,ω') еЛ] contains an S*~'-set of P'-measure 1. If (ίϊ, ,Φ~', Ρ') is complete, then of course [ω': Χ'(-,ω') еЛ] is itself an ^"'-set of f"-measure 1. Proof. Suppose that [X[: t > 0] is separable with respect to D. The difference [ω': Ζ'(-,ω') eAD] - [ω': Χ'(-,ω') f=A] is by (38.16) a subset of [ω': Χ'(-,ω') ейт- 5D], which is contained in an ^'-set of Л" of P-measure 0. Since the two processes have the same finite-dimensional distributions and hence induce the same distribution on (RT, &T), and since AD lies in Шт, it follows that Ρ'[ω': Χ'(-,ω') (EAD] = Ρ[ω: Χ(·, ω) ^AD]> Ρ[ω: Χ(-,ω)(ΞΑ] = 1. Thus the subset [ω': Χ'(-,ω') ^AD] ~Ν' of [ω': Χ'(-,ω') ^Α] lies in &' and has f'-measure 1. ■ Example 38.9. Consider the set С of finite-valued, continuous functions on T. If χ e 5D and yeC, and if χ and у agree on a dense D, then χ and у agree everywhere: x = y. Therefore, CD Π 5D с С. Further, CD= Π U C\[xeRT:\x{s)\«»Ax{i)\<«>>\x{s)-x{t)\<e\, €,t δ S
534 STOCHASTIC PROCESSES where e and δ range over the positive rationale, t ranges over D, and the inner intersection extends over the s in D satisfying \s —1\<8. Hence CD<^MT. Thus С satisfies the condition (38.16). Theorem 38.2 now implies that if a process has continuous paths with probability 1, then any separable process having the same finite-dimensional distributions has continuous paths outside a set of probability 0. In particular, a Brownian motion with continuous paths was constructed in the preceding section, and so any separable process with the finite-dimensional distributions of Brownia? motion has continuous paths outside a set of probability 0. The argument in Example 38.4 now becomes supererogatory. ■ Example 38.10. There is a somewhat similar argument for the step functions of the Poisson process. Let Z+ be the set of nonnegative integers; let Ε consist of the nonciecreasing functions χ in RT such that x(t)eZ+ for all t and such that for every η e Z+ there exists a nonempty interval / such that x(t) = η for t e /. Then ED= f\[x:x{t)^Z + }n Π [x:x(s)zx(t)] fGD s,t<=.D,s<i CO η П U П [х:хО) = п], η = 0 / iefin/ where / ranges over the open intervals with rational endpoints. Thus Εq g &?. Clearly, £Dn5DcE, and so Theorem 38.2 applies. In Section 23 was constructed a Poisson process with paths in E, and therefore any separable process with the same finite-dimensional distributions will have paths in Ε except for a set of probability 0. ■ Example 38.11. For Ε as in Example 38.10, let E0 consist of the elements of Ε that are right-continuous; a function in Ε need not lie in EQ, although at each t it must be continuous from one side or the other. The Poisson process as defined in Section 23 by N, = max[«: Sn<t] (see (23.5)) has paths in E0. But if Nl = max[n: Sn < t], then [Nf: t ^ 0] is separable and has the same finite-dimensional distributions, but its paths are not in E0. Thus EQ does not satisfy the hypotheses of Theorem 38.2. Separability does not help distinguish between continuity from the right and continuity from the left. ■ Example 38.12. The class of sets A satisfying (38.16) is closed under the formation of countable unions and intersections but is not closed under complementation. Define Xt and Yt as in Example 38.1, and let С be the set of continuous paths. Then [Yt: t > 0] and [X,: t^0] have the same finite-dimensional distributions, and the latter is separable; Υ(·,ω) is in RT— С for each ω, and Χ(·,ω) is in RT-C for no ω. ■ Example 38.13. As a final example, consider the set J of functions with discontinuities of at most the first kind: χ is in J if it is finite-valued, if x(t + ) = limJ,, x(s) exists (finite) for f> 0 and x(t - ) = limJT, s(s) exists (finite) for t> 0, and if *u)lies between x(t +) and x(t — ) for t > 0. Continuous and right-continuous functions are special cases.
SECTION 38. NONDENUMBERABLE PROBABILITIES 535 Let V denote the general system (38.17) V: k;ri,...,rk;sl,...,sk;al,...,ak, where к is an integer, where the /-,·, sh and a(. are rational, and where 0 = /-, <i, <r2 <s2 < · <rk<sk. Define к J(D,V,e)= f) [*:a,<*(0<«,+e, te(ritS;)nD] к П f) [jc:min{a,_,,a,.} <x(t) < max{a,_ ,, α,.} + e, /1 ($,._,, r,) nfl] i = 2 Let %m k & be the class of systems (38.17) that have a fixed value for к and satisfy r,· — Sj_, < δ, ί = 2,..., fc, and sk > m. It will be shown that CO CO (38.18) ]υ- Π Π U Π U HD,V,e), where e and δ range over the positive rationals. From this it will follow that JD e &T. It will also be shown that JD(~)SD<zJ, so that J satisfies the hypothesis of Theorem 38.2. Suppose that ye/. For fixed e, let Η be the set of nonnegative h for which there exist finitely many points г, such that 0 = t0 <r, < · · · <tr = h and \y(t) -y(t')\ <e for t and г' in the same interval (ί(·_,,ί,). If hneH and hn Τ h, then from the existence of y{h —) follows h e #. Hence Я is closed. If /i e H, from the existence of y(h + ) it follov/s that Я contains points to the right of h. Therefore, Η = [0,°°). From this it follows that the right side of (38.18) contains JD. Suppose that χ is a member of the right side of (38.18). It is not hard to deduce that for each t the limits (38.19) lim x(s), lim x(s) slt,s£D s\t,s£D exist and that x(t) lies between them if teD. For t eD take y(t) = x(t), and for iiD take y(t) to be the first limit in (38.19). Then ye7 and hence xeJD. This argument also shows that JD Π SD с J. и
Appendix Gathered here for easy reference are certain definitions and results from set theory and real analysis required in the text. Although there are many newer books, Hausdorff (the early sections) on set theory and Hardy on analysis are still excellent for the general background assumed here. Set Theory AL The empty set is denoted by 0. Sets are variable subsets of some space that is fixed in any one definition, argument, or discussion; this space is denoted either generically by Ω or by some special symbol (such as Rk for Euclidean fc-space). A singleton is a set consisting of just one point or element. That A is a subset of В is expressed by А с В. In accordance with standard usage, А с В does not preclude A = B; A is a proper subset of В if А с В and Α Φ Β. The complement of A is always relative to the overall space Ω; it consists of the points of Ω not contained in A and is denoted by Ac. The difference between A and B, denoted by A - B, is Α Π Bc; here В need not be contained in A, and if it is, then A -B is a proper difference. The symmetric difference A*B = (A Г\Вс)и(АсПВ) consists of the points that lie in one of the sets A and В but not in both. Classes of sets are denoted by script letters. The power set of Ω is the class of all subsets of Ω; it is denoted 2Ω. A2. The set of ω that lie in A and satisfy a given property ρ(ω) is denoted [ω eA: ρ(ω)]; if A = Ω, this is usually shortened to [ω: ρ(ω)]. A3. In this book, to say that a collection [Αβ: θ e Θ] is disjoint always means that it is pairwise disjoint: ΑθΓ\Αθ' = 0 if θ and Θ' are distinct elements of the index set Θ. To say that A meets B, or that В meets A, is to say that they are not disjoint: Α Π Β Φ 0. The collection [Αθ: θ e Θ] covers β if β с (J ΘΑΘ. The collection is a decomposition or partition of В if it is disjoint and В = U βΑβ. A4. By An1 A is meant Al<zA2<z ■■■ and A = \JnAn; by Anl A is meant Л,г>/12г> ··· and A = C\„An. AS. The indicator, or indicator function, of a set A is the function on Ω that assumes the value 1 on A and 0 on Ac; it is denoted lA. The alternative term "characteristic function" is reserved for the Fourier transform (see Section 26). 536
APPENDIX 537 A6. De Morgan's laws are (\J вАв)с = Π eA% and (Г\вАвУ = \JeAce. These and the other facts of basic set theory are assumed known: a countable union of countable sets is countable, and so on. A7. If Τ: Ω -»Ω' is a mapping of Ω into Ω' and A' is a set in Ω', the inverse image of A' is T~lA' = [ω e Ω: Τω еЛ']. It is easily checked that each of these statements is equivalent to the next: иеП-Г^иб Τ' ιΆ, Τω <£Α', Τω е Ω' -Α', ω е Γ" '(Ω' - A'). Therefore, Ω - Τ' ιΑ' = Τ ι(Π' - A'). Simple considerations of this kind show that ΌβΤ-ιΑ'β = Τι(ΌβΑ'β) and П вТ~1Л'в = Г'ЧПеЛ'Д and that A'nB' = 0 implies T~ lA' Π Τ'1 В' = 0 (the reverse implication is false unless ΓΩ = Ω')· If / maps Ω into another space, /(ω) is the value of the function / at an unspecified value of the argument ω. The function / itself (the rule defining the mapping) is sometimes denoted /(·). This is especially convenient for a function /(ω, t) of two arguments: For each fixed t, f{-,t) denotes the function on Ω with value /(ω, г) at ω. A8. The axiom of choice. Suppose that [Αβ: θ e Θ] is a decomposition of Ω into nonempty sets. The axiom of choice says that there exists a set (at least one set) С that contains exactly one point from each Ав: СГ\Ав is a singleton for each θ in Θ. The existence of such sets С is assumed in "everyday" mathematics, and the axiom of choice may even seem to be simply true. A careful treatment of set theory, however, is based on an explicit list of such axioms and a study of the relationships between them; see Halmos 2 or Dudley. A few of the problems require Zorn's lemma, which is equivalent to the axiom of choice; see Dudley or Kaplansky. The Real Line A9. The real line is denoted by Rl; χ ν у = maxU, у} and χ л у = min{*, у}. For real χ, [χ\ is the integer part of x, and sgn χ is +1, 0, or -1 as χ is positive, 0, or negative. It is convenient to be explicit about open, closed, and half-open intervals: (a,b) = [x: a <x <b], [a,b] = [x: a <x <b], (a,b] = [x: a <x <b], [a,b) = [*: a <x <b]. A10. Of course xn ->* means limn xn =x; xn t χ means xl <x2 <■■ ■ and i„->i; xn I x means x^ > x2 > ■ ■' and xn -»*. A sequence {*„} is bounded if and only if every subsequence {*„,) contains a further subsequence {хПк .} that converges to some x: lim; *„ =x. If {*„} is not bounded, then for each к there is an nk for which \xn \> k; no subsequence of {xn } can converge. The implication in the other direction is a simple consequence of the fact that every bounded sequence contains a convergent subsequence. // {*„} is bounded, and if each subsequence that converges at all converges to x, then lim„ xn =x. If xn does not converge to x, then \хПк — x\ > e for some positive e and some increasing sequence {nk} of integers; some subsequence of {xn } converges, but the limit cannot be x. All. A set G is defined as open if for each χ in G there is an open interval / such that χ e / с G. A set F is defined as closed if Fc is open. The interior of A, denoted
538 APPENDIX A°, consists of the χ in A for which there exists an open interval / such that χ s/<zA. The closure of A, denoted A', consists of the χ for which there exists a sequence {*„} in A with xn -»x. The boundary of A is dA =A~-A°. The basic facts of real analysis are assumed known: A is open if and only if A =A°; A is closed if and only if A =A "; A is closed if and only if it contains all limits of sequences in it; χ lies in dA if and only if there is a sequence {*„} in A and a sequence {y„} in Ac such that x„->x and yn -»*; and so on. A12. Every open set G on the line is a countable, disjoint union of open intervals. To see this, define points χ and у of G to be equivalent if χ <y and [i,y]cG or y<* and [j,j]cG. This is an equivalence relation. Each equivalence ciass is an interval, and since G is open, each is in fact an open interval. Thus G is a disjoint union of open (nonempty) intervals, and there can be only countably many of them, since each contains a rational. A13. The simplest form of ihe Heme-Borel theorem says that if [a,b]<z \J^=\(ak,bk), then [a,b]c \j^ = l(ak,bk] for some n. A set A is defined to be compact if each cover of it by open sets has a finite subcover—that is, if [Ge: θ e Θ] covers A and each Ge is open, then some finite subcollection {Ge ,...,Ge} covers A. Equivalent to the Heine-Borel theorem is the assertion that a bounded, closed set is compact. Also equivalent is the assertion that every bounded sequence of real numbers has a convergent subsequence. A14. The diagonal method. From this last fact follows one of the basic principles of analysis. Theorem. Suppose that each row of the array *1,1 *1,2 *1,3 (1) *2.1 *2,2 *2,3 is a bounded sequence of real numbers. Then there exists an increasing sequence nltn2,... of integers such that the limit limA * exists for r= 1,2,... . Proof. From the first row, select a convergent subsequence (2) *1,Π|.|'*1,π, 2' *1.»»| ,>···' here {/г, k} is an increasing sequence of integers and limA xl n exists. Look next at the second row of (1) along the sequence nl l,nl 2,··-'- (3) *2 π,.,»*2,π, 2> *2,Π| ,'··· · As a subsequence of the second row of (1), (3) is bounded. Select from it a convergent subsequence X2,n2 ,' *2,n2 2' Х2.пг ι'' ''' here {«2 *} is an increasing sequence of integers, a subsequence of {nl k}, and lim**2,n'2i exists.
APPENDIX 539 Continue inductively in the same way. This gives an array "l.i "1.2 "i,3 ··· (4) "2,1 "2,2 "2,3 ··· with three properties: (i) Each row of (4) is an increasing sequence of integers, (ii) The rth row is a subsequence of the (r - l)st. (iii) For each r, linn xr „ exists. Thus \ ) Xr,nr |> xr.n, 2' Xr,n, ,' · is a convergent subsequence of the rth row of (1). Put nk = nk k. Since each row of (4) is increasing and is contained in the preceding row, n,, n2, n3, ... is an increasing sequence of integers. Furthermore, nr,nr+l,nr + 2,... is a subsequence of the rth row of (4). Thus *,_„_,* , xr n^ ,... is a subsequence of (5) and is therefore convergent. Thus \\mk xr „ does exist. ' U Since {nk} is the diagonal of the array (4), application of this theorem is called the diagonal method. A15. The set A is by definition dense in the set В if for each χ in В and each open interval J containing x, J meets A. This is the same thing as requiring Β <ζΑ ~. The set Ε is by definition nowhere dense if each open interval / contains some open interval J that does not meet E. This makes sense: It / contains an open interval J that does not meet E, then Ε is not dense in /; the definition requires that Ε be dense in no interval /. A set A is defined to be perfect if it is closed and for each χ in A and positive e, there is а у in A such that 0 <|*-у|<е. An equivalent requirement is that A be closed and for each χ in A there exist a sequence {*„} in A such that xn Φ χ and xn -*x. The Cantor set is uncountable, nowhere dense, and perfect. A set that is nowhere dense is in a sense small. If A is a countable union of sets each of which -is nowhere dense, then A is said to be of the first category. This is a weaker notion of smallness. A set that is not of the first category is said to be of the second category. Euclidean fc-Space A16. Euclidean space of dimension к is denoted Rk. Points (ah...,ak) and (bv...,bk) determine open, closed, and half-open rectangles in Rk: [x: ai<x!<bt,i= l,...,fc], [x: ai<xi<bi,i=l,...,k], [x: a,<xl<bl-,i= l,...,fc]. A rectangle (without a qualifying adjective) is in this book a set of this last form. The Euclidean distance (E,-=i(*,--}v)2)1/2 between x = (xu...,xk) and y = (Уъ---,Ук^ 's denoted by U-yl. A17. All the concepts in All carry over to Rk: simply take the / there to be an open rectangle in Rk. The definition of compact set also carries over word for word, and the Heine-Borel theorem in Rk says that a closed, bounded set is compact.
540 APPENDIX Analysis A18. The standard Landau notation is used. Suppose that {xn) and {yn) are real sequences and yn > 0. Then xn = 0(yn) means xn/yn is bounded; xn = o(yn) means хп/Уп->Ь *п~Уп means Хп/Уп^и хп~Уп means х„/У„ and у„Д„ are both bounded. To write xn = zn + 0(yn), for example, means that xn = z„+un for some (u„) satisfying un = 0(yn)—that is, that j„ -z„ = 0(yn). A19. /1 difference equation. Suppose that a and b are integers and a < b. Suppose that xn is defined for a < η < b and satisfies (6) *„=P*n+i+<7*,,-i fora<n<b, where ρ and q are positive and p+q= 1. The general solution of this difference equation has the form χ =(A+B(q/p)" toa<n<b \fp*q, \A+Bn for a <n <b iip = q. That (7) always solves (6) is easily checked. Suppose the values xn and xn are given, where a < nl <n2<b. If ρ Φ q, the system А+В(д/р)щ=хП1, A+B(q/p)"2=x„2 can always be solved for A and B. If ρ = q, the system A+Bnx=xni, А+Вп2 = хПг can always be solved. Take nl = a and n2 = a + 1; the corresponding A and β satisfy (7) for η = a and for η = a + 1, ar.d it follows by induction that (7) holds for all n. Thus any solution of (6) can indeed be put in the form (7). Furthermore, the equation (6) and any pair of values xn and xn (nx Φ η2) suffice to determine all the xn. If xn is defined for a < η < °° and satisfies (6) for a < η < <», then there are constants A and В such that (7) holds for a < η < °°. A20. Cauchy's equation. Theorem. Let f be a real function on (0, °°), and suppose that f satisfies Cauchy's equation: fix +y)=f(x) +f(y) for x, у > 0. // there is some internal on which f is bounded above, then fix) = xf(l) for x>0. Proof. The problem is to prove that g(x)=f(x) -xfW vanishes identically. Clearly, g(l) = 0, and g satisfies Cauchy's equation and on some interval is bounded above. By induction, g(nx) = ng(x); hence ng(m/n) = g(m) = mg(\) = 0, so that g(r) = 0 for positive rational r. Suppose that g(xo)φ0 for some *0. If g(*0)<0, then g(r0—xc)= — g(xo)>0 for rational г0>л:0. It is thus no restriction to assume that g(x0) > 0. Let / be an open interval in which g is bounded above. Given a number M, choose η so that ng(x0) > M, and then choose a rational r so that nxQ + r lies in /. If r>0, then g(r + nx0) = g(r) + g(nx0) = g(nxQ) = ng(x0). If r<0, then ng(x0)=g(nx0) = g((-r) + (nx0 + r)) = g(-r) + g(nx0 + r)=g(nx0 + r). In either
APPENDIX 541 case, g(nx0) + r) = ng(x0); of course this is trivial if r=0. Since g(nxQ + r) = ng(x0) > Μ and Μ was arbitrary, g is not bounded above in /, a contradiction. ■ Obviously, the same proof works if / is bounded below in some interval. Corollary. Let U be a real function on (0, °°) and suppose that U(x +y) = U(x)U(y) for x, у > 0. Suppose further that there is some interval on which U is bounded above. Then either U(x) = 0 for x > 0, or else there is an A such that U(x) = eAx for x>0. Proof. Since U(x) = U2(x/2\ U is nonnegative. If U(x) = 0, then U(x/2n) = 0 and so U vanishes at points arbitrarily near 0. If U vanishes at a point, it must by the functional equation vanish everywhere to the right of that point. Hence U is identically 0 or else everywhere positive. In the latter case, the theorem applies to f(x) = \ogU(x\ this function being bounded above in some interval, and so fix) = Ax for A = log i/(l). И A21. A number-theoretic fact. Theorem. Suppose that Μ is a set of positive integers closed under addition and that Μ has greatest common divisor 1. Then Μ contains all integers exceeding some n0. Proof. Let Mx consist of all the integers m, —m, and m—m' with m and m' in M. Then Mx is closed under addition and subtraction (it is a subgroup of the group of integers). Let d be the smallest positive element of Mx. If η e Mx, write η =qd + r, where 0 <r <d. Since r = n-qd lies in Mx, r must actually be 0. Thus Mx consists of the multiples of d. Since d divides all the integers in Mx and hence all the integers in M, and since Μ has greatest common divisor 1, d=\. Thus Mx contains all the integers. Write 1 = m — m' with m and m' in Μ (if 1 itself is in M, the proof is easy), and take nQ = (m + m')2. Given η > nQ, write η = q(m + m') + r, where 0 < r < m + m'. From η > n0 > (r + lXm + m') follows q = (n - r)/(m + m') > r. But η = q(m+m') + r(m - m')=(q+r)m+(q-r)m', and since q+r>q-r>0, n lies in M. U A22. One- and two-sided derivatives. Theorem. Suppose that f and g are continuous on [0, <») and g is the right-hand derivative of f on (0,oo); f+(t) =g(t) for t > 0. Then /+(0) = g(0) as well, and g is the two-sided derivative of f on (0, <»). Proof. It suffices to show that F(t) =f(t) -/(0) - /0'g(i)rfi vanishes for t > 0. By assumption, F is continuous on [0,°°) and F+(t)-0 for t > 0. Suppose that F(t0)>F(tx), where 0<го<г,. Then G(t) = F(t)-(t-t0XF(tx)-F(t0))/(tx-t0) is continuous on [0,<»)5 G(t0) = G(tx), and G + (t) > 0 on (0,oo). But then the maximum of G over Uq,^] must occur at some interior point; since G + <0 at a local maximum, this is impossible. Similarly F(t0) <F(tx) is impossible. Thus F is constant over (0,oo) and by continuity is constant over [0, °°). Since F(0) = 0, F vanishes on [0, °o). ■ A23. A differential equation. The equation f'(t)=Af(t) +g(t) (t > 0; g continuous) has the particular solution f0(t) = eA'jOg(s)e~Asds; for an arbitrary solution /, (/(г)-/0(О)е_/" has derivative 0 and hence equals /(0) identically. All solutions thus have the form f(t) = eA'[f(0) + flg(s)e-Asds].
542 APPENDIX A24. A trigonometric identity. If ζ Φ 1 and ζ Φ 0, then ' 2/ J _z2/+i z-/_z/+i Σ zk=z-'Zzk-z-' k=-l k = 0 \-Z 1-2 and hence 1-2 l-2-w+l-2w _ ( l-z~m 1-2' — 2- m-1 / m-1 / /+1 Σ Σ ζ<= Σ^τ^ /=0 k = -I 1=0 7"i/2 _ _-m/2 1-2-1 1-г (l-2)(l-2-1) (2l/2_2-l/2)2 " Take 2 = е,дг. If * is not an integral multiple of 2π, then /=o *=-/ (sin 5*) If χ = 2πη, the left-hand side here is m2, which is the limit of the right-hand side as χ -»2vn. Infinite Series A25. Nonnegative series. Suppose xux2,... are nonnegative. If Ε is a finite set of integers, then £c{l,2,...,n) for some я, so that by nonnegativity Y.k^Exk < E"k=]xk. The set of partial sums Σ£= \xk thus has the same supremum as the larger set of sums Σ* <= Exk (^ finite). Therefore, the nonnegative series T^l-lxk converges if and only if the sums EksExk for finite Ε are bounded, in which case the sum is the supremum: Ц< = 1Хк =SUPE*~k(EExk· A26. Dirichlet's theorem. Since the supremum in A25 is invariant under permutations, so is Σ"=ι*£.' If the xk are nonnegative and yk=Xf(k) for some one-to-one map / of the positive integers onto themselves, then ΣΑ*Α and £kyk diverge or converge together and in the latter case have the same sum. A27. Double series. Suppose that xi}, i,j= 1,2,..., are nonnegative. The ith row gives a series Σ;*,7, and if each of these converges, one can form the series Σ,·Σ;*,7· Let the terms xi; be arranged in some order as a single infinite series Σ,7*,7; by Dirichlet's theorem, the sum is the same whatever order is used. Suppose each Σ;*,7 converges and Σ,·Σ;*,7 converges. If Ε is a finite set of the pairs (i,j), there is an η for which Σ(ί,;)ε£*,7 < £/<;„£JS„*l7 £ LiunL.xu < Σ/Σ;*,;; hence Σ,·;*,7 converges and has sum at most Σ,·Σ;*,7·. On the other hand, if Σ,7*,7 converges, then Σ,·5„Σ;£Πί/7 < Σ,7*,·;; letting η -»°° and then m -»<» shows that each Σ;*;; converges and that Σ,·Σ;*,7 < Σ,·;*,7·. Therefore, in the nonnegative case, Σ,·;*,·, converges if and only if the Σ;*,7 all converge and Σ,·Σ;*,7 converges, in which case 1^цХ^ = 2*,·2*.χ,7. By symmetry, Σ,7*,7 = Σ,Σ,*,7·. Thus the order of summation can be reversed in a nonnegative double series: Σ,·Σ;*,7= Σ;Σ,·*,7. A28. The Weierstrass M-test.
APPENDIX 543 Theorem. Suppose that lim„ xnk = xk for each к and that \xnk\<Mk, where Y.kMk < a». Then Y.kxk and all the t.kxnk converge, and limn Y.kxnk = Y.kxk. Proof. The series of course converge absolutely, since iLkMk<&>. Now fck*nk-'£kxk\<I.klikliUnk-xk\ + 2Zk>kliMk. Given e, choose k0 so that Σ*>* Mk <e/3, and then choose n0 so that n>nQ implies \xnk - xk\ <e/3k0 for к <k0'. Then η >n0 implies \Lkxnk ~Lkxk\<e. Ш A29. Power series. The principal fact needed is this: If fix) = Т^=0акхк converges in the range |*| <r, then it is differentiable there and (8) For a simple proof, choose r0 and rx so that |*| <r0<r{ <r. If \h\ <r0 — \x\, so that \x ± h\ < r0, then the mean-value theorem gives (here 0 < Qh < 1) (^) (* + *)-*' kx k-l = \k(x + ehh) Л- 1 ι k-l kxk-l\<2krS k-\ Since 2кгц~1/гк goes to 0, it is bounded by some M, and if Mk = \ak\-Mrk, then Y.kMk < oo and \ak\ times the left member of (9) is at most Mk for \h\ < r0 -\x\. By the Λί-test [A28] (applied with h -» 0 instead of η -»oo), . Д (x+hf-xk Д . , A^O * = 0 * = 0 Hence (8). Repeated application of (8) gives fU\x)= L*(*"l) ···(*-/+!)«** *-; *=; For χ = 0, this is aj=fuK0)/jl, the formula for the coefficients in a Taylor series. This shows in particular that the values of fix) for |*| <r determine the coefficients ak. A30. Cesaro averages. If xn -*x, then n~lYLnk=\xk -*x. To prove this, let Μ bound \xk\, and given e, choose fc0 so that \x— xk\<e/2 for k>kQ. If n>kQ and и > 4fc0M/e, then *d-l -^Σ**4 Σ2Μ+Ι Σ |< k=l k = l η — 2 A31. Dyadic expansions. Define a mapping Γ of the unit interval Ω = (0,1] into itself by Τω 2ω if 0 <ω < 5, 2ω - 1 if ч <ω < 1.
544 APPENDIX Define a function dl on Ω by /θ ίί0<ω<5, and let d;M = d{(T'~ [ω). Then (1°) L -,· <<■>< L -,· + τ7? i-l 2 i=i 2 for all ω e ω and я > 1. To verify this for η = 1, check the cases ω < 5 and ω > 5 separately. Suppose that (10) holds for a particular η and for all ω. Replace ω by Τω in (10) and use the fact that di(T<u) = di + l(a>); separate consideration of the cases ω < j and ω > 5 now shows that (10) holds with η + 1 in place of η Thus (10) holds for all η and ω, and it follows that ω = Σ7=ι^/ω)/2'. This gives the dyadic representation of ω. If dl(w) = 0 for all i>n, then ω = Σ,"=1ίί,(ω)/2', which contradicts the left-hand inequality in (10). Thus the expansion does not terminate in 0's. Convex Functions A32. A function φ on an open interval / (bounded or unbounded) is convex if (11) <p(tx + (l-t)y)<t<p(x)+(l-t)<p(y) for x,y e I and 0 <t < 1. From this it follows by induction that (12) Ψ Ep,*,U Σρ,<ρ(ϊ,·) if the *, lie in / and the p; are nonnegative and add to 1. If φ has a continuous, nondecreasing derivative φ' on /, then φ is convex. Indeed, if a <b <c, the average of ψ' over (a,b) is at most the average over (b,c): Щ^1 - ά/vw * * "<ч - Л/>) * у(с)-у(Ь) с-Ь The inequality between the extreme terms here reduces to (13) (c-a)<p(b) < (c-b)<p(a) + (b-a)<p(c), which is (11) with χ = а, у = с, t = (с - b)/(c - a). A33. Geometrically, (11) means that the point В in Figure (i) lies on or below the chord AC. But then slope AB < slope AC; algebraically, this is (ip(W - cp(a))/(b - a) < (<p(c) - <p(a))/(c - a), which is the same as (13). As В moves to A from the right,
APPENDIX 545 (О (ii) (ιϋ) (iv) slope AB is thus nonincreasing and hence has a limit. In other words, φ has a right-hand derivative φ+. Figure (ii) shows that slope DE < slope EF < slope FG. Let Ε move to D from the right and let С move to F from the right: The right-hand derivative at D is at most that at F, and φ + is nondecreasing. Since the slope of XY in Figure (iii) is at least as great as the right-hand derivative at X, the curve to the right of X lies on or above the line through X with slope φ + (χϊ· (14) <р(у)><р(*) + (у-л:)<Р + (х), у>х. Figure (iii) also makes it clear that φ is right-continuous. Similarly, φ has a nondecreasing left-hand derivative <p~ and is left-continuous. Since slope AB < slope ВС in Figure (i), φ~(ό) <φ^ (b). Since clearly <p+(b)<°° and -oo<<p-(fo), φ+ and φ~ are finite. Finally, (14) and its right-sided analogue show that the curve lies on or above each line through Ζ in Figure (iv) having slope between φ~(ζ) and ψ + (ζ): (15) <p(x) >ψ(ζ) +m(x-z), φ (ζ) <ΤΠ <φ + (ζ)· This is a support line. Some Multivariable Calculus A34. Suppose that U is an open set in Rk and T: U^>Rk is continuously differen- tiable; let Dx = [tu(x)] and J(x) = det Dx be the Jacobian matrix and determinant, as in Theorem 17.2. Let Q~ be a closed rectangle in U. Theorem. // \tu(x') -i,;(*)l <a forx, x' e Q~ and all i,j, then (16) \Tx'-Tx-Dx(x'-x)\<k2a\x'-x\, x,x'eQ- Before proceeding to the proof, note that, since the t^ are continuous, a can be taken to go to 0 as β contracts to the point x. In other words, (16) implies (17) lim \Tx'-Tx-Dx(x'-x)\ \x'-x\ = 0. This shows that Dx acts as a multivariable derivative. Suppose on the other hand that (17) holds at χ for an initially unspecified matrix Dx. Take x) = x; + n and A = xi f°r / Φ), and let h go to 0. It follows that the entires of Dx must be the partial derivatives tjjix): If (17) holds, then Dx must be the Jacobian matrix.
546 APPENDIX (v) (vi) Proof of (16). For j = 0,l,...,k, let z1 agree with x' in the first ;' places and with χ in the last к -j places. Then z°=x, zk=x', and \zJ -z'~l\ = |(z;-z;_1);| = \x'j — xj\ (Figure (v)). By the mean-value theorem in one dimension, there is a point wu on the segment from z'~l to z' such that tt(z}) -ti(zi~l)=tij(w'iKzi-zi~l)j. Since (D,(z>-z^1))i.= i:i,./(,)(z^-^-1)/ = i,v(x)(z>-z>-1);, / it follows that \Tx'-Tx-Dx(x'-x)\ *L\t,iz>)-t,{z>-*)-{Dx{z>-z>-%\ и = L\^ii)-t!j(x)\-\(^-z^)]\ ϋ < Y,a\x'j-Xj\<k2a\x'-x\. Ш о A35, The multwariable inverse-function theorem. Let *0 be a point of the open set U. Theorem. If Κχ0)Φθ, then there are open sets X and Y, containing xQ and y0 = TxQ, respectively, such that Τ is a one-to-one map from X onto Y= TX; further, T~l: Y-*Xis continuously differentiable, and the Jacobian matrix of T~l at у is D^-\y. This is a local theorem. It is not assumed, as in Theorem 17.2, that Τ is one-to-one on U and J(x) never vanishes; but under those additional conditions, TU is open and the inverse point mapping is continuously differentiable. To understand the role of the condition Кх0)Ф 0, consider the case where к = 1, *0 = 0, and Tx is x2 or *3. Proof. Let β be a rectangle such that *0e6°ci2_c£/ and 7(*)*0 for *e(2~. As (x,u) ranges over the compact set Q~X[u: \u\ = 1], \Dxu\ is bounded below by some positive β: (18) |Dxu|>/3|u| if x<EQ~,u<ERk.
APPENDIX 547 Making Q smaller will ensure that UtJ(x) -ti}{x')\ </3/2fc2 for all χ and x' in Q and all /,;'. Then (16) and (18) give, for x, x' e Q~, \Tx' - Tx\>\Dx(x' - x)\-\Tx' - Tx - Dx(x' - x)\ >\θχ(χ'-χ)\-{β\χ'-χ\>\β\χ'-χΙ Thus (19) \x' -x\<hTx'-Tx\ forx,x'<=Q-. This shows that Τ is one-to-one on Q~. Since x0 does not lie in the compact set dQ, inf^^ll* - Tx0\ = d > 0. Let У be the open ball with center y0 = Tx0 and radius d/2 (Figure (vi)). Fix а у in У. The problem is to show that у = Tx for some χ in Q°, which means finding an χ such that φ(χ) = \γ — Tx\ = E;(y,- tj(x))2 vanishes. By compactness, the minimum of φ on Q~ is achieved there. If χ edg(and у e Y), then 2|y -y0\<d <\Tx -y0| < |7*- yl +ly -Уо1> so lhat Iу - 7*0| <|y- 7дс|. Therefore, <p(*0) <<p(*) for χ edg, and so the minimum occurs in Q° rather than on dQ. At the minimizing point, дср/dxj = -E,2(y; - i,(*))i;/*) = 0, and since Dx is nonsingular, it follows that у = Tx: Each у in У is the image under Τ of some point χ in (2°. By (19), this χ is unique (although it is possible that у =Tz for some ζ outside Q). Let X= Q°Π Г_1У. ТЬгп A' is open and Г is a one-to-one map of X onto Y. Now let T~] denote the inverse point transformation on У. By (19), T~l is continuous. To prove differentiability, consider in У a fixed point у and a variable point y' such that y' -»y and y' =£y. Let jc = Г"'у and x' = T~ly'\ then jc' is a function of y', x' -»*, and χ' =£ jc. Define d by Tx' -Tx = Dx(x' - χ) + υ; then υ is a function of jc' and hence of y', and M/| *'- * I-» 0 by (17). Apply D~l: D;x{Tx' -Tx) = x' -x + D;lu, or Г"У - T-xy=D;\y' -y) -D;1;;. By (18) and (19), \T-ly'-T'ly-Dx-\y'-y)\ _ \D;lu\ \χ'-χ\ \υ\/β I \y'-y\ U'-Jtl Iv'-yl ^ Ιχ'-χΙ β' The right side goes to 0 as y' -»y. By the remark following (17), the components of D~l must be the partial derivatives of the inverse mapping. T'1 has Jacobian matrix Dj-\y at y. The components of an inverse matrix vary continuous with the components of the original matrix (think for example of the inverse as specified by the cofactors), and so T~l is even continuously differentiable on У. ■ Continued Fractions A36. In designing a planetarium, Christian Huygens confronted this problem: Given the ratio χ of the periods of two planets, approximate it by the ratio of the periods of two linked gears. If one gear has ρ teeth and the other has q, then the ratio of their periods is p/q-, so that the problem is to approximate the real number χ by the rational p/q. Of course x, being empirical, is already rational, but the numerator and denominator may be so large that gears with those numbers of teeth are not practical: in the approximation p/q, both ρ and q must be of moderate size.
548 APPENDIX Since the gears play symmetrical roles, there is no more reason to approximate χ by r=p/q than there is to approximate l/x by l/r = q/p. Suppose, to be definite, that χ and r lie to the left of 1. Then (20) !*-/■ 1 X 1 r <f 1 X 1 r xr < - and the inequality is strict unless χ = r: If χ < 1, it is better to approximate l/x and then invert the approximation, since that will control both errors. For a numerical illustration, approximate χ = .127 by rounding it up to r = .13; the calculations (to three places) are (21) r-x= .13-.127 =.003, - - -=7.874-7.692 = .182. The second error is large. So instead, approximate \/x = 7.874 by rounding it down to l/r' = 7.87. Since 1/7.87 = .1271 (to four places), the calculations are (22) - - Д = 7.874 - 7.87 = .004, χ r r'-x = . 1271 -.127 =.0001. This time, both errors are small, and the error .0001 in the new approximation to χ is smaller than the corresponding .003 in (21). It is because χ lies to the left of 1 that inversion improves the accuracy; see (20). If this inversion method decreases the error, why not do another inversion in the middle, in finding a rational approximation to l/x? It makes no sense to invert l/x itself, since it lies to the right of 1 (and inversion merely leads back to χ anyway). But to approximate l/x= 7.874 is to approximate the fractional part .874, and here a second inversion will help, for the same reason the first one does. This suggests Huygens's iterative procedure. In modern notation, the scheme is this. For χ (rational or irrational) in (0,1), let Tx = {l/x) and β^*) = H/jcJ be the fractional and integral parts of l/x; and set TO = 0. This defines a mapping of [0,1) onto itself: (23) Tx- χ ! χ = --ai(x) if 0 <jc < 1, if x = 0. Then (24) fli(jt) + Tx if0<jc<l. What (20) says is that replacing Tx on the right in (24) by a good rational approximation to it gives an even better rational approximation to χ itself.
APPENDIX 549 To carry this further requires a convenient notation. For positive variables z, define the continued fractions 11^=7-' Ц*1 + Цъ = p. z' + Γ '2 + т: and so on. It is "typographically" clear that (25) ΐΓ^+···+ΐΡ^=ι/[2ι + №+'"+ΐΓ^)] and (26) \jrx+ ··· +ir^=lF7+ ··· +ι|ζ„_, + ια„-. For a formal theory, use (25) as a recursive definition and then prove (26) by induction (or vice versa). An infinite continued fraction is defined by provided the limit exists. A continued fraction is simple if the z, are positive integers. If Tn~lx>0, let ап(д;) = а1(Г""1д:); the an(x) are the partial quotients of *. If jc and Tx are both positive, then (24) applies to each of them: χ = 1]β,(*) + 7лг= lja,(jc) + ljfl2(jc) + T2jc. If none of the iterates д:, Tx,...,T"~lx vanishes, then it follows by induction (use (26)) that (27) x = lfi£x)+ ■■■ +l]an_l(x) + ljan(x) + T"x. This is an extension of (24), and the idea following (24) extends as well: a good rational approximation to Tnx in (27) gives a still better rational approximation to x. Even if T"x is approximated very crudely by 0, there results a sharp approximation (28) ""ΐ№)+ ···+!№} to χ itself. The right side here is the «th convergent to x, and it goes very rapidly to x; see Section 24. By the definition (23), χ and Tx are both rational or both irrational. For an irrational x, therefore, T"x remains forever among the irrationals, and (27) holds for all n. If χ is rational, on the other hand, T"x remains forever among the rationals, and in fact, as the following argument shows, Tnx eventually hits 0 and stays there.
550 APPENDIX Suppose that χ is a rational in (0,1): χ = dl/d0, where 0 < d{ < d0. If Tx > 0, then Tx = Ыц/ат) = d2/db where 0 <d2<dl because 0 < Tx < 1. (If dx/d0 is irreducible, so is d2/dl.) If T2x > 0, the argument can be repeated: (29) x = ^, Tx=^, T2x = -£, d0>dl>d2>d3>0. And so on. Since the dn decrease as long as the T"x remain positive, Tnx must vanish for some n, and then Tmx = 0 for m > n. If nx is the smallest integer for which Tnx = 0 (n r > 1 if x> 0), then by (27), (30) х=1ЩТ)+ ■■+l^JF). Thus each positive rational has a representation as a finite simple continued fraction. If 0<x <1 and Tx = 0, then 1>χ=\/αλ{χ), so that α,ί*)^. Applied to Tn'~lx, this shows that the a„(x) in (30) must be at least 2. Section 24 requires*a uniqueness result. Suppose that (31) x = \Ja[+ ··· +\\a„_x + \\an + t, where the a, are positive integers and (32) 0<*<1, 0<ί<1, a„ + t>l. The last condition rules out an = 1 and t = 0 (which in the case η = 1 is also ruled out by χ < 1). It follows from (31) and (32) that (33) al(x)=al,...,an(x)=an, T"x = t. The case η = 1 being easy, suppose the implication holds for η — 1, where η > 2. Since 0<1/(β +ί)<1, the induction hypothesis (use (26)) gives ak(x) = ak for k<n and Tn~[x= l/(an + t). Now apply the case η = 1 to T"~lx. (If an= 1 and i = 0, then aA(jc) = aA for к <n-2, an_l(x) = an_l + 1, and Гп_1д: = 0.) Consider now the infinite case. Assume that (34) x = \Ja[+\Ja'2+ ■■■, converges, where the a„ are positive integers. Then (35) an(x) = an, T"x = lJ^T+ lj^+ · ■ ■, n>\. To prove this, let η -»oo in (25): the continued fraction i=jja^ + jja^+ ··· converges and χ = l/(«i + i). It follows by induction (use (26)) that (36) lja7> JJaT+ ··· +lja^^jja7+jr^, « > 2. Hence 0 <x < 1, and the same must be true of t. Therefore, ax and t are the integer and fractional parts of \/x, which proves (35) for η = 1. Apply the same argument to
APPENDIX 551 Tx, and continue. The χ defined by (34) is irrational: otherwise, Tnx = 0 for some n, which contradicts (35) and (36). Thus the value of an infinite simple continued fraction uniquely determines the partial quotients. The same is almost true of finite simple continued fractions. Since (31) and (32) imply (33), it follows that if χ is given by (30), then any continued fraction of nx terms that represents χ must indeed match (30) term for term. But, for example, 1_[3 + jJ3= 1_[3 + lj4 + lfT. This is always possible: replace an(x) in (30) (where an {x) > 2) by an (*) - 1 + lfT. Apart from this ambiguity, the representation is unique—and the representation (30) that results from repeated application of Τ to a rational χ never ends with a partial quotient of l.f See Rockett & Szusz for more on continued fractions.
Notes on the Problems These notes consist of hints, solutions, and references to the literature. As a rule a solution is complete in proportion to the frequency with which it is needed for the solution of subsequent problems Section 1 1.1. (a) Each point of the discrete space lies in one of the four sets Α[Γ\Α2, А\Г\А2, А{Г)АС2, А\Г\АС2 and hence would have probability at most 2~2; continue. (b) If, for each i, B, is At or А% then B\ П · · · Π Βη has probability at most Π,^α - а,) £ехр[-£?.,«,■] 1.3. (b) Suppose A is trifling and let A ~ be its closure. Given e choose intervals (ak,bk\ k = \,...,n, such that Ac \Jnk = l(ak,bk] and Z"k = lbk-ak)<e/2. If xk = ak -e/2n, then A~c U k = l(xk,bk] and Y."k = x(bk - xk) <e. For the other parts of the problem, consider the set of rationals in (0,1). 1.4. (a) Cover Ar{i) by (r- 1)" intervals of length r~". (c) Go to the base rk. Identify the digits in the base r with the keys of the typewriter. The monkey is certain eventually to reproduce the eleventh edition of the Britannka and even, unhappily, the fifteenth. 1.5. (a) The set A3{\) is itself uncountable, since a point in it is specified by a sequence of 0's and 2's (excluding the countably many that end in 0's). (b) For sequences ul,...,un of 0's, l's, and 2's, let Mu u consist of the points in (0,1] whose nonterminating base-3 expansions start out with those digits. Then A3(l) = (0,1]-L)M„ u , where the union extends over η > 1 and the sequences щ,...,ип containing at least one 1. The set described in part (b) is [0,1] - U M° u , where the union is as before, and this is the closure of A3(l). From this representation of C, it is not hard to deduce that it can be defined as the set of points in [0,1] that can be written in base 3 without any l's if terminating expansions are also allowed. For example, С contains f = .1222 · · · = .2000... because it is possible to avoid 1 in the expansion. (c) Given e and an ω in C, choose ω' in A3(l) within e/2 of ω; now define ω" by changing from 2 to 0 some digit of ω' far enough out that ω" differs from ω by at most e/2. 552
NOTES ON THE PROBLEMS 553 1.7. The interchange of limit and integral is justified because the series Т.кгк{ш)2~ converges uniformly in ω (integration to the limit is studied systematically in Section 16). There is a direct derivation of (1.40): let n-*°° in sin t = 2"sin2""i · Π£=ι cos2"~*i, which follows by induction from the half-angle formula. 1.10. (a) Given m and a subinterval (a, b] of (0,1], choose a dyadic interval / in (a, b], and then choose in / a dyadic interval J of order η > m such that \n~ [sn(w)\ > \ for ω ej. This is possible because to specify J is to specify the first η dyadic digits of the points in J, choose the first digits in such a way that 7c / and take the following ones to be 1, with η so large that η ~ lsn(w) is near 1 for ω e J. (b) A countable union of sets of the first category is also of the first category; (0, l] = NUNC would be of the first category if Nc were. For Baire's theorem, see Royden, ρ 139. 1.11. (a) If x=p0/q0ibp/q, then lfV?-<?oPl J_ (c) The rational YTk = il/2a{k) has denominator 2a(n) and approximates χ to within 2/2α(π + 1>. Section 2 2.3. (b) Let Ω consist of four points, and let У consist of the empty set, Ω itself, and all six of the two-point sets. 2.4. (b) For example, take Ω to consist of the integers, and let 5^ be the σ-field generated by the singletons {k} with к <п. As a matter of fact, any example in which 5^ is a proper subclass of 5^J+, for all η will do, because it can be shown that in this case U n&~„ necessarily fails to be a σ-field; see A. Broughton and B. W. Huff: A comment on unions of sigma-fields, Amer. Math. Monthly, 84 (1977), 553-554. 2.5. (b) The class in question is certainly contained in fisnf) and is easily seen to be closer under the formation of finite intersections. But (Uili Π "'=ιΑηΥ = nr=iU"Li^;, and U"Li^;= U"LiMf.n Г\[\\А-1к] has the required form. 2.8. If <%? is the smallest class over srf closed under the formation of countable unions and intersections, clearly .Же а(з/). То prove the reverse inclusion, first show that the class of A such that Ac е.Ж is closed under the formation of countable unions and intersections and contains srf and hence contains J¥. 2.9. Note that U „B„ea(\J nsnfB). 2.10. (a) Show that the class of A for which ΙΑ(ω) = //ω') is a σ-field. See Example 4.8. 2.11. (b) Suppose that У is the σ-field of the countable and the cocountable sets in Ω. Suppose that У is countably generated and Ω is uncountable. Show that 3е
554 NOTES ON THE PROBLEMS is generated by a countable class of singletons; if Ω0 is the union of these, then У must consist of the sets В and BuO.c0 with йсП0, and these do not include the singletons in flc0, which is uncountable because Ω is. (c) Let 5?2 consist of the Borel sets in Ω = (0,1], and let JF, consist of the countable and the cocountable sets there. 2.12. Suppose that AX,A2,... is an infinite sequence of distinct sets in a σ-field У, and let £ consist of the nonempty sets of the form n"=i#„, where Bn =An or Bn=Acn, η = 1,2,... . Each An is the union of the ^sets it contains, and since the An are distinct, £ must be infinite. But there are uncountably many distinct • countable unions of ^sets, and they all lie in &. 2.18. For this and the subsequent problems on applications of probability theory to arithmetic, the only number theory required is the fundamental theorem of arithmetic and its immediate consequences. The other problems on stochastic arithmetic are 4.15, 4.16, 5.19, 5.20, 6.16, 18.17, 25.15, 30.9, 30.10, 30.11, and 30.12. See also Theorem 30.3. (b) Let A consist of the even integers, let Ck = [m: uk <m < vk+x), and let В consist of the even integers in C, U C3 U · · · together with the odd integers in C2UC4U · · ·; take uk to increase very rapidly with к and consider Α Π Β. (c) If с is the least common multiple of a and b, then Ma (~)Mb = Mc. From Ma e 3 conclude in succession that Ma Π Мь e 3, Ma Π · · · Π Ma Π Μ^ Π · · · ПМ6С( е 31, f(«4') с 3. By the same sequence of steps, show how D on J? determines D on f(~#). (d) If B, = Ma - U p<iMap, then a eB/ and (the inclusion-exclusion formula requires only finite additivity) ο(β,) = 1- Σ —+ Σ v " a *·", ар ^ , ара p<.l P<q<l Choose la so that, if Ca = B,, then D(Ca) <2~"~l. If D were a probability measure on f(~#), D(Q.) < \ would follow. See Problem 4.16 for a different approach. 2.19. (a) Apply the intermediate-value theorem to the function fix) = λ(Α η(0, *]). Note that this even proves part (c) for λ (under the assumption that λ exists). (b) If 0<P(B)<P(A), then either 0 <P(B) <{PiA) or 0<P(B-A) < \P(A). Continue. (c) If P(\JkHk)<x, choose С so that CcA-\JkHk and 0<P(C)<x- P(UkHk). If п~1<Р(С), then PO)k<nHk) + hn<P(Uk<nHk) + P(Hn) + P(C)<P(Uk<nHk) + hn. 2.21. (c) If-^-i were a σ-field, @)с J?n_x would follow. 2.22. Use the fact that, if aba2,... is a sequence of ordinals satisfying απ <Ω, then there exists an ordinal a such that α <Ω and an <a for all n.
NOTES ON THE PROBLEMS 555 2.23. Suppose that Bj e \Ja<aJ^, j= 1,2,... . Choose odd integers n} in such a way that Bj ^J^a(n) and the ns are all distinct; choose J^-sets such that */8.(»J)(C«.Ji'C«.jl"")=B>: for η not of the form n-p choose J^0-sets for which Φβ (п)(Сш ,Cm ,...) is 0 or (0,1] as η is odd or even. Then U°°=1P; = Ф^С^С,,..."). Similarly, Bc = Фв(С„С2,...) for J^-sets C„ ifjB e Up<ct^. The rest of the proof is essentially the same as before. Section 3 3.1. (a) The finite additivity of Ρ is used in the proof that S^c.^i and again (via monotonicity; see (2.5)) in the proof of (3.7). The countable additivity of Ρ is used (via countable subadditivity; see Theorem 2.1) in the proof of (3.7). (b) For a specific example consider U „(2-1 +n~l, 1] in connection with Problem 2.15. But an example is provided by every Ρ that is finitely but not countably additive: If Ρ is finitely additive on a field &~0 and Ar are disjoint Уд-sets whose union A also lies in ^Q, then monotonicity (which requires finite additivity only) gives Zk <nP(Ak) = P(\Jk^nAk) <P(A) and hence EkP(Ak) <P(A). Countable subadditivity will ensure that there is equality here. (c) The proof of (3.7) involves the countable subadditivity of Ρ on S*~0, which is only assumed to be a field (that being the whole point of the theorem). 3.2. (a) Given e, choose yo-sets A„ such that А с U „A„ and ΣΡ(Α„)<Ρ*(Α) + e; if В = U nAn, then А о В, BeF, and P(B) < P*(A) + e; hence the right side of (3.9) is at most P*(A) On the other hand, А с В and В е SF imply P*(A)<P*(B) = P(Bl Hence(3.9). UAcBk, Bk ε У, Р(Вк) <Р*(А)+ к\ and В = П kBk, then AoB, Be^y and Ρ*(A) = P(P). For (ЗЛО), argue by complementation. (b) Suppose that Р*(А) = Р*(А) and chose J^sets Al and A2 in such a way that /1, c/1 <zA2 and ?(/!,) = F(/12). Given £, choose an J^set В in such a way that £ с β and P*(E) = P(P). Then Р*Ы П £) + Р*ЫС П £) < РЫ2 Г\В)+ Р(А\ Π £). Now use (2.7) to bound the last sum by P(B) + P(A2 -Al) = P*(E). 3.3. First note the general fact that P* agrees with Рол F0 if and only if Ρ is countably, additive there, a condition not satisfied in parts (b) and (e). Using Problem 3.2 simplifies the analysis of P* and JKi.P*') in the other parts. Note in parts (b) and (e) that, if P* and P* are defined by (3.1) and (3.2), then, since P*(A) = 0 for all A, (3.4) holds for all A and (3.3) holds for no A. Countable additivity thus plays an essential role in Problem 3.2. 3.6 (c) Split Ec hy A: /»„(£) = 1 -P°(£c) = 1 -P°(A Г)ЕС)-Р°(АС n£c) = 1 - P°(A Π Ec) - P(AC) = Pi A) - РЧА - £). 3.7, (b) Apply (3.13): For A e ^,, (2Ы) = Р°(ЯПЛ) +Ро(ЯсП/1) = Р°(НГ\А) + Ρ(Λ) - Р°Ы - (Яс ПИ)) = Р(Л). (с) If Ax and Λ2 are disjoint J^-sets, then by (3.12), P°(Hn(AlUA2))=P°(Hr\Al)+P°(HnA2).
556 NOTES ON THE PROBLEMS Apply (3.13) to the three terms in this equation, successively using A, VA2, A and A2 for A: P0{H< η (AtUA2)) = P0(HC П/1,) +P0(H< nA2). But for these two equations to hold it is enough that НГ\АХ Г\А2= 0 in the first case and Hc Γ\ΑιΓ\Α2 = 0 in the second (replacing AX by ΑιΓ\Α2 changes nothing). 3.8. By using Banacli limits (Banach, p. 34) one can similarly prove that density D on the class 3> (Problem 2.18) extends to a finitely additive probability on the class of all subsets of Ω = {1,2,...}. 3.14. The argument is based on cardinality. Since the Cantor set С has Lebesgue measure 0, 2C is contained in the class _/ of Lebesgue sets in (0,1]. But С is uncountable: card 9ё=- card(0,1] < card2c < card _/. 3.18. (a) Since the Α θ r are disjoint Borel sets, ΣΓλ(Α θ г) < 1, and so the common value λ(Α) of the λ(Α θ r) must be 0. Similarly, if A is a Borel set contained in some # θ r, then k(A) = 0. (b) If the £п(ЯФг) are all Borel sets, they all have Lebesgue measure 0, and so £ is a Borel set of Lebesgue measure 0. 3.19. (b) Given ЛЬВ1,...,А„_ЬВ„,Ь note that their union Cn is nowhere dense, so that /„ contains an interval Jn disjoint from Cn. Choose in Jn disjoint, nowhere dense sets An and Bn of positive measure. (c) Note that A and Bn are disjoint and that AnUBn<z G. 3.20. (a) If /„ are disjoint open intervals with union G, then b~l\(A) > Σ„λ(/Π)> Lnb-lX(AC)In)>b-lX(A). Section 4 4.1. Let r be the quantity on the right in (4.30), assumed finite. Suppose that χ < r; then χ < V"=nxk for n>\ and hence х<хк for some k>n: x<xn i.o. Suppose that x<xn i.o.; then χ < V"=„*A for η > 1: x<r. It follows that r = sup[*: χ <xn i.o.], which is easily seen to be the supremum of the limit points of the sequence. The argument for (4.31) is similar. '4.10. The class & is the σ-field generated by ^U {#} (Problem 2.7(a)). If (HnGi)U(Hc nG2) = (HnG'l)U(Hc oG'2\ then G^G\czHc and С2дС2сЯ; consistency now follows because АДЯ) = Xji.(Hc)= 0. If A = (Я η G[n)) U (Hc η G2n)) are disjoint, then G\m) η G\n) с Hc and G(2m) η G2"> с Η for тФп, and therefore (see Problem 2.17) P(\JnAn)= \\(\J nG[n)) + 5^(U ,,G2n)) = E„GA(G<n)) + JA(Gi,n))) = ΣηΡ(Αη). The intervals with rational endpoints generate cf. 4.14. Show as in Problem 1.1(b) that the maximum of F(B, Π · · · Γ\Βη), where β, is A, or Acf, goes to 0. Let Ax = [w. ΣηΙΑ(ω)2~" <χ\ show that PiAnAx) is continuous in x, and proceed as in Problem 2.19(a).
NOTES ON THE PROBLEMS 557 4.15. Calculate D(F,) by (2.36) and the inclusion-exclusion formula, and estimate P„(F!-F) by subadditive; now use 0 <P„(F,)-P„(F) = P„(Fl-F). For the calculation of the infinite product, see Hardy & Wright, p. 246. Section 5 5.5. (a) If m = 0, a > 0, and χ > 0, then P[X>a]< P[(X + x)2 ;> (a + x)2] < E[(X + x)2]/(a +x)2 = (σ2 +x2)/(a + x)2; minimize over x. 5.8. <b) It is enough to prove that φίί) =f(t(x', y') + (l -tXx, y)) is convex in t (0 < t < 1) for (x, y) and (*', y') in C. If a = x' - χ and β = у' - у, then (if /u>0) φ"=ίηα2 + 2/12αβ+/22β2 1 7"(/n« +/12/З)2 + W/11/22 -/n)^2 21 0. /11 /l! V Examples like f(x,y) = y2 -2xy show that convexity in each variable separately does not imply convexity. 5.9. Check (5.39) for fix, y) = -хЧ»уЧч. 5.10. Check (5.39) for f(x,y)= -(х^р + у^рУ. 5.19. For (5.43) use (2.36) and the fundamental theorem of arithmetic· since the p,- are distinct, the p;< individually divide m if and only if their product does. For (5.44) use inclusion-exclusion. For (5.47), use (5.29) (see Problem 5.12)). 5.20. (a) By (5.47), £„[«„]< ΣΓ-ιΡ~k <2/p. And, of course, n~l log/z! = E„[log] = ΣρΕη[α p]\og p. (b) Use (5.48j and the fact that En[ap - 8p] < ТГк=2р~к. (c) By (5.49), Σ OgP= Σ у" 2J 11°B Ρ <2n(£2„[log*]-£„[log*]) = 0(n). Deduce (5.50) by splitting the range of summation by successive powers of 2. (d) If К bounds the 0(1) terms in (5.51), then Σ \ο%Ρ^θχ Σ P~l\ogp^ex(\oge~l-2K). p<x θχ<ρ<χ (e) For (5.53) use Σ ρ<χ log p log x <тг(*) <χι'2 + * Σ ρ <χ' 2 log χ 1 + /2 χ Σ log ρ&χ Σ /2 <ρ ^дг Ρ· log ρ log*1/2
558 NOTES ON THE PROBLEMS By (5.53), π(χ)>χι/2 for large x, and hence \ogir(x) χ log χ and 7t(jc)x jc/log 7t(jc). Apply this witfi x=pr and note that тг(рг) = г. Section 6 6.3. Since for given values of Χη1(ω),..-, X„ik_£<·>) there are for Χπ(ω) the к possible values 0,1,. -.,k - 1, the number of values of (Χη1(ω),...,Χηη{ω)) is n\. Therefore, the map ω -»(Χηϊ(ω),..., Χη„(ω)) is one-to-one, and the Xnk(<o) determine ω. It follows that if 0 <jc, <i for l</<fc, then the number of permutations ω satisfying Χπ!(ω) = χ,-, 1 < ί < к, is just (к + lXfc + 2) · · ■ η, so that P[Xni=Xj, I <i <k]= l/k\. It now follows by induction on к that vnl'- ,Xnk are independent and P[Xnk =x] = k l (0 <x <k). Now calculate E[Xnk] E[S„Y k-l 0+1+ ■·· +(и-1) «(«-1) η Τ 02 + l2+ ·· + (*-!) Var[Jf„t]= fc Var[S„]=^L(fc2-l) = fc2-l -m-^ * = i 2я3 + 3я2-5я η 72 3 36· Apply Chebyshev's inequality. 6.7. (a) If k2 < η < (к + l)2, let an = fc2; if Μ bounds the \xn\, then I _ _L я s" a„ s" < 1 η 1 O-n ■nM+—(n-a„)M=2M-—— ->0. 6.16. From (5.53) and (5.54) it follows that an = Σρη Чп/р\ -*°°. The ieft side of (6.8) is IIJL _Ii» I|» <J__(I_I)(I_I)<J_ + J_. η [ pq η [ ρ J η [ q J p<? \p η )\q η ) np nq Section 7 7.3. If one grants that there are only countably many effective rules, the result is an immediate consequence of the mathematics of this and the preceding sections: С is a countable intersection of J^sets of measure 1. The argument proves in particular the nontrivial fact that collectives exist. 7.7. If η <τ, then Wn = Wn__l -X„^i = Wt - S„_u and τ is the smallest η for which 5„_, = Wx. Use (7.8) for the question of whether the game terminates. Now FT = F0+ L (W, -Sk_,)Xk =/=„ + ^5r_, - i(52_, -(τ- 1)). ft-i
NOTES ON THE PROBLEMS 559 7.8. Let xb....Xj be the initial pattern and put X0 = jt,+ ··· +*,. Define X„ = X„_j- WnXn, L0 = k, and ^„ = ^„.,-(3^+ l)/2. Then τ is the smallest « such that Ln < 0, and τ is by the strong law finite with probability 1 if E[3Xn+ l] = 6(p - \)> 0. For η < τ, X„ is the sum of the pattern used to determine Wn+1. Since F„ - Fn_Y = X„ _,- X„, it follows that /Γ„ = /Γ0 + Σ0-Σπ and /v^Fo + Xq. 7.9. Observe that £[/r„-fT] = £[EJ_I^/lT<A]]-EJi.1£[JfjP[T<*]. Section 8 8.8. (b) With probability 1 the population either dies out or goes to infinity. If, for example, pkQ = 1 — Pk,k + i = V·2» 1^еп extinction and explosion each have positive probability. 8.9. To prove that Xj = 0 is the only possibility in the persistent case, use Problem 8.5, or else argue directly: If x, = ΣίΦ!αρί)χ)., i * i0, and К bounds the U,|, then Xj = LpUi...pjn_iJXjii.> where the sum is over /ι,--·»/„-ι distinct from г0, and hence \x,\ < KP,[Xk Φ i0, к <η] -> 0. 8.13. Let Ρ be the set of i for which π,- > 0, let Μ be the set of i for which π, < 0, and suppose that Ρ and N are both nonempty. For i0 e Ρ and j0 e /V choose η so that p!"' > 0. Then o< Σ Σ*,ρΤ = Σ*>- Σ Σ^ρΙΓ = Σ^,ΣρΙ/^ο. Transfer from N to Ρ any г for which π,· = 0 and use a similar argument. 8.16. Denote the sets (8.32) and (8.52) by Ρ and by F, respectively. Since Fc/>, gcd Ρ < gcd /\ The reverse inequality follows from the fact that each integer in Ρ is a sum of integers in F. 8.17. Consider the chain with states 0,1,... and a and transition probabilities Р0У=/у+1 f0r J^°> P0a=}~f> P,,,-l = 1 f°r ΙΞϊ1> 3nd A«»=1 (» 'S ЗП absorbing state). The transition matrix is 7i 1 0 0 /2 0 1 0 h ■ 0 - 0 · о - - W" 0 0 1 Show that /^ =/„ and p$ = un, and apply Theorem 8.1. Then assume /= 1, discard the state a and any states / such that fk = 0 for k>j, and apply Theorem 8.8.
560 NOTES ON THE PROBLEMS In Feller, Volume 1, the renewal theorem is proved by purely analytic means and is then used as the starting point for the theory of Markov chains. Here the procedure is the reverse. 8.19. The transition probabilities are p0r = 1 and p,r_,+ !=p, p,r_,=i, 1<г<г; the stationary probabilities are ul = ■· =ur = q~lun = (r + q)~l. The chance of getting wet is u0p, of which the maximum is 2r+ I - 2y7(r + 1). For r= 5 this is .046, the pessimal value of ρ being .523. Of course, u0p < \/Ar. In more reasonable climates fewer umbrellas suffice: if ρ = .25 and г = Ъ, then u0p = .050; if ρ = .1 and r = 2, then uQp = .031. At the other end of the scale, if ρ = .8 and r = 3, then uQp = .050; and if ρ = .9 and r = 2, then и0р = .043. 8.22. For the last part, consider the chain with state space Cm and transition probabilities ptJ for г',/ eCm (show that they do add to 1). 8.23. Let С = 5-(ГиС), and take U= TUC in (8.51). The probability of absorption in С is the probability of ever entering it, and for initial states г in TUC these probabilities are the minimal solution of У, = Σ P.jVj + Σ PU + Σ P,jV;> '" e ru C'- 0 <y,-< 1, геГиС. Since the states in С (С = 0 is possible) are peisistent and С is closed, it is impossible to move from С to C. Therefore, in the minimal solution of the system above, y, = 0 for i e C. This gives the system (8.55). It also gives, for the minimal solution, E;e7-p,7y; + Lj^cPj^O, г'еС This makes probabilistic sense: for an г in C", not only is it impossible to move to a / in C, it is impossible to move to a / in Τ for which there is positive probability of absoiption in C. 8.24. Fix on a state i, and let S„ consist of those j for which pjp > 0 for some η congruent to ν modulo /. Choose к so that p(k) > 0; if р)р and p)f are positive, then t divides m+k and η + к, so that m and η are congruent modulo t. The Sv are thus well defined. 8.25. Show that Theorem 8,6 applies to the chain with transition probabilities pty. 8.27. (a) From PC = СЛ follows Ρο,^λ,ς·, from RP=AR follows r,P = A,.r„ and from RC = I follows ricJ = SiJ. Clearly Λ" is diagonal and Pn = CAnR. Hence Pip = ZulClu\l8ul Rtj = Σ„λ"„(ν„),·,. = Σ„λ"„(Λ„)„, (b) By Problem 8.26, there are scalars ρ and γ such that rx = pr0 = p(iru,,.,tts) and ci=yc0, where c0 is the column vector of l's. From rxcx = 1 follows ργ = 1, and hence /1, = c,r, = c0r0 has rows (irl,...,irs). Of course, (8.56) gives the exact rate of convergence. It is useful for numerical work; see Cinlar, pp. 364 ff. (c) Suppose all four pu are positive. Then 7Г] =p2i AP21 +Pn^ '""г ^РмАрн + pi2X the second eigenvalue is λ = 1 -р\2~Ргъ ап^ 7Г, 7Г2 7Г, 7Г2 + λ" 7Г2 -7Г2 -7Г, 7Г,
NOTES ON THE PROBLEMS 561 Note that for given η and e, A" > 1 - e is possible, which means that the p}"' are not yet near the it,·. In the case of positive pip Ρ is always diagonalizable by С 1 7Г2 1 -7Г, R -1 (d) For example, take t = \, 0 <e < t, and t t t t t t 1-е t+e t In tliis case, 0 is an eigenvalue with algebraic multiplicity 2 and geometric multiplicity 1. 8.30. Show that a„ = π,- and -ο(4,έ <-*>•)-o(±). where ρ is as in Theorem 8.9. 8.36. The definitions give £,[/(XJ] =Ρ,[ση <n)f(0) + Ρ{ση = n]f(i + n) = 1 - Ρ;[ση = η] +Ρ;[ση = й](1 -//+n,0) = 1 -Ρ,···Ρ,·+π-ι +/Ί···Ρ,+π-ι(Ρ,·+πΡ,+π + ι·· ). and this goes to 1. Since Ρ;[τ <n = σΠ]> Ρ{[τ <n}(~)[Xk> 0, fc>l])->l- /,.„ > 0, there is an η of the kind required in the last part of the problem. And now £,·[/(*r)] zP,[T<n=an]f(i + n) + 1 -Ρ,[τ <η = σ„] = 1-^[τ<π=σ„]//+„ι0. 8.37. If ί>1, ηί<η2, (i,...,i + nl)eln and 0\...,/+л2) e'„2> then ^-[τ = η,, τ = «2]>/';[ΛΓΑ =ί + к, к <п2]> 0, which is impossible. Section 9 9.3. See Bahadur. 9.7. Because of Theorem 9.6 there are for P[Mn >a] bounds of the same order as the ones for P[Sn >ot] used in the proof of (9.36).
562 NOTES ON THE PROBLEMS Section 10 10.7. Let μι be counting measure on the σ-field of all subsets of a countably infinite Ω, let μ2 = 2μ^ and let & consist of the cofinite sets. Granted the existence of Lebesgue measure λ on <ё?', one can construct another example: let μ, =λ and μ2 = 2λ, and let & consist of the half-infinite intervals (-°°, jc]. There are similar examples with a field /5^ in place of 9*. Let Ω consist of the rationals in (0,1], let μ{ be counting measure, let μ2 = 2μ1, and let &~0 consist of finite disjoint unions of "intervals" [r e Ω: a < r < b]. Section 11 11.4. (b) If (/,*] с U *(/*,**], then (/(»),*(»)] с и *(/*(»),/*(»)] for all ω, and Theorem 1.3 gives g(ω)-/(ω) <Σk(gk(ω)-fk(ω)). If Лш =(g-/-Σ* <„ (g*~/*))v0> then Λ„ 4,0 and g-f<Lk<,,(gk-fk) + h„. The positivity and continuity of Л now give v0(f,g]< HkvQ(fk,gk]· A similar, easier argument shows that Lkv0(fk,gk]< v0(f,g] if (/*,gj are disjoint subsets of (/, g]. 11.5. (b) From (11.7) it follows that [/> 1] e У0 for / in _/ Since _/ is linear, [f>x] and [/< -jc] are in 5*~0 for /e_/ and x> 0. Since the sets (*,<») and (-co, -jc) for χ > 0 generate &\ each / in -£ is measurable a(S*~0). Hence .9*"- σ-(^ο). It is easy to show that 3^ is a semiring and is in fact closed under the formation of proper differences. It can happen that Ω£ 3^—for example, in the case where Ω = {1,2} and-/ consists of the / with /(1)= 0. See Jiirgen Kindler: A simple proof of the Daniell-Stone representation theorem. Amer. Math. Monthly, 90 (1983), 396-397.) Section 12 12.4. (a) If θη = θ„, then вп_т = 0 and n=m because θ is irrational. Split G into finitely many intervals of length less than e; one of them must contain points θ2η and в2т with θ2η < e2m. Uk = m-n, then 0 < θ2„ - θ2η = в2т θ θ2π = в2к <е, and the points 62kl for 1 <1<{в2к\ form a chain in which the distance from each to the next is less than e, the first is to the left of e, and the last is to the right of 1 -e. (c) If s1Qs2 = e2k + iQ62n ®θ2η lies in the subgroup, then s, =s2 and 62k_, ] = 12.5. (a) The S ® вт are disjoint, and (2л + ί)υ + к = (2« + 1)ϋ' + к' with |fc|, |fc'| <n is impossible if κ Φ υ'. (b) The /1 θ θ(2η+1)ι are disjoint, contained in G, and have the same Lebesgue measure. 12.6. See Example 2.10 (which applies to any finite measure). 12.8. By Theorem 12.3 and Problem 2.19(b), A contains two disjoint compact sets of arbitrarily small positive measure. Construct inductively compact sets Ku iU (each м;. is 0 or 1) such that 0<μ(^„ ) <3"~" and KU[ 0 and Кщ иj' are disjoint subsets of K„ .. . Take К = f] „ U ,, ,, Ku „ . The Cantor set is , ΜΙ Μ η n M ( M ii " I " и a special case.
NOTES ON THE PROBLEMS 563 Section 13 13.3. If /= Σ,χ,ΙΑ. and ^еГ1^, take A', in &' so that A^T'%, and set φ = ILjXjIj.. For the general / measurable Г'У, there exist simple functions /„, measurable Г_1У, such that /π(ω)-»/(ω) for each ω. Choose ψη, measurable У, so that /„ = φηΤ. Let C" be the set of ω' for which φη(ω') has a finite limit, and define φ(ω') = lim„ φη{ω) for w'eC and ψ(ω')=0 for ω' € С. Theorem 20.1(ii) is a special case. 13.7. The class of Borel functions contains the continuous functions and is closed under pointwise passages to the limit and hence contains 8£. By imitating the proof of the тт-к theorem, show that, if / and g lie in S£, then so do f + g, fg, f-g, /Vg (note that, for example, [g: / + ge 5Г] is closed under passages to the limit). If fn(x) is 1 or 1 - nix - a) or 0 as χ < a or a<x<a+n~x or a +n~x <x, then fn is continuous and f„(x)-> /,_„, a](x). Show that [A: IAe S£\ is a A-system. Conclude that 2£ contains all indicators of Borel sets, all simple Borel functions, all Borel functions. 13.13. Let β = {ί>ι,...,Μ, where fc<«, Et = C-brlA, and E=U?=iE,·. Then E = C— \J?=ib~lA. Since μ is invariant under rotations, μ(Ε,) = 1 - μ(Α) < n~\ and hence μ(Ε) < 1. Therefore C-E= r\f=1bj~lA is nonempty. Use any θ in С - Ε. Section 14 14.3. (b) Since и < F(x) is equivalent to cp(u) <x, it follows that и <Fbp{u)). And since F(x) < и is equivalent to χ < <р(м), it follows further that Fbpiu) - e)<u for positive e. 14.4. (a) If 0<u<u<l, then P[u<F(X)<u, Х<вС] = Р[<р(и) <Χ <φ(υ), A'e C]. If <p(u) e C, this is at most P[ <p{u) <Χ<ψ(υ)] = F(<p(u) - ) - F(f(«) - ) = F(<p(u)- )- Л<р(и)) <u-u; if <р(м)€ С, it is at most P[<p(u) <X<φ(υ)] = F(<piu) - ) -F(<piu)) <u-u. Thus P[F(X)e [u,υ), Χe С] < А[м,и) if 0< м<и < 1. This is true also for M=0(let «10 and note that Р[ЯА') = 0]=0) and foi υ = 1 (let d Τ 1). The finite disjoint unions of intervals [u,u) in [0,1) form a field there, and by addition P[F(X)eA, A'e C]<X(A) for /1 in this field. By the monotone class theorem, the inequality holds for all Borel sets in [0,1). Since P[F(X) =1ДеС] = 0, this holds also for A = {1}. 14.5. The sufficiency is easy. To prove necessity, choose continuity points x,- of F in such a way that x0 <X\ < · · · <xk, F(x0) <e, F(xk)> 1-е, and *,·—■*,■_ι < e. If η exceeds some n0, \F(x;) -F„(x;)\ <e/2 for all i. Suppose that ·*,·_]< x<x;. Then Fll(x)<Fn(xi)<F(x!) + e/2<F(x + e)+e/2. Establish a similar inequality going the other direction, and give special arguments for the cases χ <x0 and х>хк· Section 15 15.1. Suppose there is an ^partition such that Σ,-Isup^ /]μ(/1;·) <<». Then a, = sup^ /<o° for / in the set / of indices for which μ(Αί)>0. If a = max/a,·,'then μ[/> a] = Σ,μ(/1, Π [/> a]) < Σ,μ(/1,·η [/> a,]) = 0. And Α ι Π [/> 0] = 0 for ι outside the set J of indices for which μ{Α-) < °°, so that μ[/> 0] = 1μ(Αί Π [/> 0]) < 1]μ{Α;) < со.
564 NOTES ON THE PROBLEMS 15.4. Let (Ω, 5**, μ + ) be the completion (Problems 3.10 and 10.5) of (Ω, S*~, μ). If g is measurable &,[f*g\^A, Л е &, μ(Α)=0, and He&\ then [/еЯ] = (#п[/£Я])и(Лп[/еЯ]) = (#п[?еЯ])и(/(п[/еЯ]) lies in ^, and hence / is measurable &*. (a) Since / is measurable &~*, it will be enough to prove that for each (finite) ^"-partition {fly} there is an ^partition {/!,} such that Χ,-Ιϊηί^/]μ(/4,-)> Σ;[ϊηίβ./]μ + (Α,), and to prove the dual relation for the upper sums. Choose (Problem 3.2) 5^sets At so that А,сВ, and μ(Λ,) = μ*(Α,) = μ + (Α,·). For the partition consisting of the /1, together with (U,/l,)c, the lower sum is at least Σ,[ίηίβ//]μ(/1,) - Σ,[ϊηίβ./]μ +(fl,). (b) Choose successively finer ^partitions {Am) in such a way that the corresponding upper and lower sums differ by at most l/n3. Let g„ and /„ have values inf^./ and sup^./ on Ani. Use Markov's inequality—since μ(Ω) is finite, it may as well be 1—to show that μ[/„ - g„ > l/n] < l/n2, and then use the first Borel-Cantelli lemma lo show that /„ - g„ -» 0 almost everywhere. Take g - lim„ g„. Section 16 16.3. 0 </„-/,?/-/,· 16.4. (a) By Fatou's lemma, \ί<ίμ ~ \αάμ = j lim(/„ - β„) άμ <lim inf J(/„ -απ)άμ = lim inf (/„άμ - ίαάμ and /Ьйл-//^ = /Нт(Ь„-/„)^ < lim inf /(b„ -/π)άμ= ^άμ- lim sup //„ ^μ. Therefore lim sup //„ ^μ < //^μ < lim inf //„ άμ. 16.6. For ω еЛ and small enough complex h, |/(»,z0 + A)-/(»,z0)| = <|/i|g(u>,z0). 16.8. Use the fact that /J/Ι^μ <αμ(Α) + Ιηη^α]\/\άμ.
NOTES ON THE PROBLEMS 565 16.9. If μ(Α)<δ implies (Α\ί„\άμ <e for all n, and if a"1 supjl/jd/ι. <δ, then μ[\fn\>a]<a~l|\fn\dμ<δ and hence /[|/(|ϊ,α]|/π^μ <e for all n. For the reverse implication adapt the argument in the preceding note. 16.10. (b) Suppose that /„ are nonnegative and satisfy condition (ii) and μ is nonatomic. Choose δ so that μ(Α)<δ implies {Α/ηάμ<1 for all n. If μ[fn = oo] > 0, there is an A such that A<z[fn = oo] and 0 < μ(Α) <δ; but then )Α/ηάμ = oo. Since μ[/Γι = οο] = 0, there is an a such that μ[/η>α]<δ < μ[/„>α]- Choose бс[/л=«] in such a way that A = [fn>a]UB satisfies μ(Α) = δ. Then αδ=αμ(Α)<)Α/ηάμ< 1 and //„ Λμ < 1 +ад(/1с) < 1 + δ'ιμ(Ω,). 16Л2. (b) Suppose that /еУ and />0. If /„ = (1 -«_1)/v0, then /„e/ and /„ Τ /, so that </"„,/] = Λ(/-/„)10. Since «4/,,/] < °°, it follows that ν[(ω,ί): /(ω) = t ] = 0. The disjoint anion β·-μΡ </^ /+1 X (°f]) increases to β, where Bc(0,/] and (0,/]-Κ[(„),(): Λω) = /]. Therefore л2" · A(/) = KO,/]=Hmi;(B„)=liin L 07?Μ :=1 2" <·'- 2" //^μ. Section 17 17.1. (a) Let /1€ be the set of χ such that for every δ there are points у and ζ satisfying |y -jtl <δ, \ζ -χ\<δ, and l/(y) -/(2)|> e. Show that Ae is closed and Df is the union of the Ae. (c) Given e and 77, choose a partition into intervals /, for which the corresponding upper and lower sums differ by at most εη. By considering those Ϊ, whose interiors meet Ae, show that er) >ek(Ae). (d) Ixt Μ bound l/l and, given e, find an open G such that DfoG and A(G) < e/M. Take С = [0,1] - G and show by compactness that there is a δ such that \f(y) -f(x)\ <e if χ (but perhaps not y) lies in С and \y -x\ <δ. If [0,1] is decomposed into intervals /, with λ(/,) < δ, and if x, e /,., let g be the function with value /(·*,·) on /;. Let Σ' denote summation over those i for which /: meets C, and let Σ" denote summation over the other i. Show that (if(x)dx- E/(*,-)A(/,) zf\f(x)-g(x)\dx ■Ό -Ό <E'2eA(/,)+ E"2MA(/,)<4e. 17.10. (c) Do not overlook the possibility that points in (0,1) - К converge to a point in K. 17.11. (b) Apply the bounded convergence theorem to f„(x) = (1 - η dist(*,[i, t]))+. (c) The class of Borel sets В in [u,u] for which f=IB satisfies (17.8) is a λ-system.
566 NOTES ON THE PROBLEMS (e) Choose simple /„ such that 0</„T/ To (17.8) for / = /„, apply the monotone convergence theorem on the right and the dominated convergence theorem on the left. 17.12. If g(x) is the distance from χ to [a,b], then /„ = (1 -ng) V Oj, I[a fc] and fne.^f; since the continuous functions are measurable &x, it follows that У~= &K. If /n(jc)iO for each x, then the compact sets [*: f„(x) > e] decrease to 0 and hence one of them is 0; thus the convergence is uniform. 17.13. The linearity and positivity of Λ are certainly elementary facts, and for the continuity property, note that if 0 <f<t and / vanishes outside [a,b], then elementary considerations show that 0 < Λ(/) < e(b — a). Section 18 18.2. First, ЗГх ST is generated by the sets ot the forms {χ) χ A' and Xx {x). If the diagonal Ε lies in &У. S£, then there must be a countable 5 in A' such that Ε lies in the σ-field iF generated by the sets of these two forms for χ in 5. If 3s consists of Sc and the singletons in S, then &~ is the class of unions of sets in the partition [P, X P2: Pu P2 e 3s]. But Ε e & is impossible. 18.3. Consider А У В, where A consists of a single point and В lies outside the completion of ^' with respect to λ. 18.17. Put f„=p~[ logp, and put /„ = 0 if η is not a prime In the notation of (18.17), Fix) = log χ + φ(χ\ where φ is bounded because of (5.51). If G(x) = -1/log x, then F(x) | fxF(t)dt log·* J2 t\og2t ψ(χ) cx dt r"4>(t)dt _ r«4p(t)dt log χ J2 t log t J2 t iog21 jx t iog21 ■ Section 19 19.3. See Banach, p. 34. 19.4. (a) Take /= 0 and /,,=/(0,,/n)- (b) Take /= 0, and let {/„} be an infinite orthonormal set. Use the fact that En(/„,g)2<llgll2. 19.5. Take fn = nl(0[/n), and suppose that f„k converges weakly to some / in L1. Integrate against the L°°-functions sgn/-/(€ 1} and conclude that /= 0 almost everywhere; now integrate against the function identically 1 and get a contradiction. Section 20 20.4. Suppose Ux,...,Uk are independent and uniformly distributed over the unit interval, put Vj = 2nUj-n, and let μ„ be (2n)k times the distribution of Σ| ρ<χ
NOTES ON THE PROBLEMS 567 (V{,...,Vk). Then μ„ is supported by Qn = (-n,n]x ·-· x(-n,n], and if / = («„b,]X'-'X(fli,i)i]cj2„1 then μ„(/) = Π^,(ί>ί-β1·). Further, if А с Qn с Qm (n < m), then μ „(Α) = μ J. A). Define \k(A) = lim„ μ„(/1 η β„). 20.7. By the argument preceding (8.16), /> [η = Й1,..., г, = пк ] =/#">/#"-'">... /,<·"- ■>. For the general initial distribution, average over i. 20.8. (a) Use the π-λ theorem to show that P[{X^,■■■, Χπ„)e H] is the same for all permutations тт. (b) Use part (a) and the fact that Yn = r if and only if Trin) = n. (c) If k<n, then Yk = r if and only if exactly r-l among the integers 1,.. , к - 1 precede к in the permutation Tin). (d) Observe that Г(п) = (/„.. .,/„) and Y„+i=r if and oniy if Г(п + 1) = ('ι,■--,',_[,и + l,tr,... ,tn), and conclude that σ(Υπ+1) is independent of σ(Γ(π)) and hence of σ(Υ„..., Y„)— see Problem 20.6. 20.12. If X and Υ are independent, then P[\(X +Y) - (jt +y)|<c] >P[\X -x\< &]P[\Y-y\< ±e] and P[X+Y = x+y]>P[X = x]P[Y = y]. 20.14. The partial-fraction expansion gives cu(y-x)c,(x)= ^j(A+B + C + D), where /? = (u2 - v2)2 + 2(u2 + v2)y2 +y4 and y2 + u2-u2 2y(y-x) A=— r, B = u2 + (y -x) u2 + (y-x) C-y2-u2 + u\ D-~2yX- υ +x υ +x After the fact this can of course be checked mechanically. Integrate over [-/,/] and let / -»oo: f'_,Ddx = 0, j'_,Bdx->0, and {ZJ.A +C)dx = (y2 + u2-u2)u~ V + (y2 - v2 + u2)u~ V = и~1и~ 1тг2Яси+1(у). There is a very simple proof by characteristic functions; see Problem 26.9. 20.16. See Example 20.1 for the case η = 1, prove by inductive convolution and a change of variable that the density must have the form Кпх(п/2)~1е~х/2, and then from the fact that the density must integrate to 1 deduce the form of Kn. 20.17. Show by (20.38) and a change of variable that the left side of (20.48) is some constant times the right side; then show that the constant must be 1.
568 NOTES ON THE PROBLEMS 20.20. (a) Given e choose Μ so that P[\X\>M]<e and P[\Y\>M]<e, and then choose δ so that |jc|,|y|<A/, \χ-χ'\<δ, and \y-y'\<8 imply that l/(jt',y')-/(jt,y)l<c Note that P[\f(Xn,Yn) -f(X,Y)\ >e] < 2e + P[\Xn- X\>8] + P[\Yn-Y\>8]. 20.23. Take, for example, independent Xn assuming the values 0 and я with probabilities 1-я-1 and n~l. Estimate the probability that Xk=k for some к in the range я/2 <к <n. 20.24. (b) For each m split A into 2m sets Amk of probability P(A)/2M. Arrange all the Amk in one infinite sequence, and let X„ be the indicator of the nth set in it. 20.27. To get the distribution of Ф, show by integration that for 0 < φ < 2π, the intersection with the unit ball of the (xlt x2, Jt3)-set where 0 <Jt3 < (jt,2-f jc2)1/2 tan φ has volume χττ sin φ. Section 21 21.5. Consider ΣΙΑ ■ A random variable is finite with probability 1 if (but not only if) it is integrable. 21.6. Calculate j^xdF(x) = /jf/*dydF(x) = ffl?dF(x)dy. 21.8. (a) Write E[Y-X] = fx<Yjx<l&Y dtdP-lY<xjY<l <xdtdP. 21.10. (a) The most important dependent uncorrelated random variables are the trigonometric functions—the random variables sinlirnw and coslirnw on the unit interval with Lebesgue measure. See Problem 19.8. 21.13. Use Fubini's theorem; see (20.29) and (20.30). 21.14. Even if X= -Y is not integrable, X+Y=0 is. Since |Y| <UI + U + У|, E[\Y\] = oo implies that E[\x + Y\] = oo for each x; use Problem 21.13. See also the lemma in Section 28. 21.21. Use (21.12). Section 22 22.2. For sufficiency, use Ε[Σ\Χ^\]= ΣΕ[\Χ^\]. ' 22.8. (a) Put U = LkI[kuT]Xt and V= Т<к1[к&г]Хк, so that ST = U-V. Since [τ > fc] = Ω - [τ < fc - 1] lies in σ{Χλ,..., Xk_,), it follows that E[ /[т ^к]Х£] = Е[11т^к]]Е[Х+] = Р[т>к]Е[Х{]. Hence E[U] = Σ^=ιΕ[Χ^]Ρ[τ > к] = Е[Х{]Е[т]. Treat V the same way. (b) To prove Е[т] <oo, show that P[r> (a +b)n] <(1 -pa+b)n. By (7.7), 5T is b with probability (1 — p")/(l - pa+b) and -a with the opposite probability. Since E[Xt]=p -q, c-r ι a a+b 1 ~P" 4 -l ι 1 J q-p q-p ι -po+b' y Ρ
NOTES ON THE PROBLEMS 569 22.11. For each θ, Σπε'χ"(ε'βζ)η has the same probabilistic behavior as the original series, because the Xn + ηθ reduced modulo 2v are independent and uniformly distributed. Therefore, the rotation idea in the proof of Theorem 22.9 carries over. See Kahane for further results. 22.14. (b) Let A=f~lB and suppose ρ is a period of /. Let m=[x/p\ and n=[l/p\. By periodicity, P(AC\[y,y +p]) is the same for all y; therefore, \Р(Ап[0,х])-тР(АГ)[0,р])\<р, \P(A) - nP(A Г)[0, р])\<р, and \P(A Π [0: x]) - P(A)x\ < 2p + \x - m/n\ < Ър. Since ρ can be taken arbitrarily small, P(A Π [0, χ]) = P(A)x. 22.15. (a) By the inequalities Lis) < M(s) and (22.24), BE{ls) = 1 л 3L(2s) < 3M(2s) < 3B0(s). For the other inequality, note first that T(s) is nonincreasing and that R(2s) < 2 Us) and T(s) < Lis) < Mis). If BE(s) < 1/3, then l-fi(6s) - l-2L(3s) _ l-2M(3s) l-2B£(s) <3£?t-(s). On the other hand, if BF(s)> 1/3, then B0(6s) < 1 <3Bt-(s). In either case, B0(6s) < 3BE(s). Section 23 23.3. Note that A, cannot exceed t. If 0 < и <t and υ > 0, then P[A, >u,Bt> u] = РЩ+1-И,_и = 0,] = е-аие-а1. 23.4. (a) Use (20.37) and the distributions of A, and B,. (b) A long interarrival interval has a better chance of covering ' than a short one does. 23.6. The probability that N's^k -ЩИ=] is J η J- -xk-le-axdx = 23.8. Let M, be the given process and put φ(ί) = E[M,]. Since there are no fixed discontinuities, <p(t) is continuous. Let t/f(w) = inf[i: u<cp(t)], and show that Nu = Мф(и) is an ordinary Poisson process and M, = Νφ(ι). 23.9. Let t -» oo in $N, J_ $N,+ 1 N, + 1
570 NOTES ON THE PROBLEMS 23.11. Restrict t in Problem 23.10 to integers. The waiting times are the Z„ of Problem 20.7, and account must be taken of the fact that the distribution of Zj may differ from that of the other Z„. Section 25 25.1. (e) Let G be an open set that contains the rationale and satisfies A(G) < j. For к = 0,1,...,η — I, construct a triangle whose base contains k/n and is contained in G: make these bases so narrow that they do not overlap, and adjust the heights of the triangles so that each has area l/n. For the nth density, piece together these triangular functions, and for the limit density, use the function identically 1 over the unit interval. 25.2. By Problem 14.8 it suffices to prove that F„( , ω) => F with probability 1, and for this it is enough that Fn(x, ω)-» Fix) with probability 1 for each rational x. 25.3. (b) It can be shown, for example, that (25.14) holds for xn=n\. See Persi Diaconis: The distribution of leading digits and uniform distribution modi, Ann. Prob., 5 (1977), 72-81. (c) The first significant digits of numbers drawn at random from empirical compilations such as almanacs and engineering handbooks seem approximately to follow the limiting distribution in (25.15) rather than the uniform distribution over 1,2,..., 9. This is sometimes called Benford's law. One explanation is that the distribution of the observation X and hence of log10 X will be spread over a large interval; if log10 X has a reasonably smooth density, it then seems plausible that {log10 X) should be approximately uniformly distributed. See Feller, Volume 2, p. 62. 25.9. Use Scheffe's theorem. 25.10. Put fn(x) = P[ Xn = yn + kdn]S;' for y„ + kS„ < χ < yn + (k + 1)S„. Construct random variables Yn with densities /„, and first prove Y„=>X. Show that Z„ = yn + [(Y„ - y„)/S„]5„ has the distribution of Xn and that Yn-Zn=> 0. 25.11. For a proof of (25.16) see Feller, Volume 1, Chapter 7. 25.13. (b) Follow the proof of Theorem 25.S, but approximate I(xy^ instead of 25.20. Let X„ assume the values η and 0 with probabilities pn=l/(nlogn) and - 1 -P„- Section 26 26.1. (b) Let μ be the distribution of X. If \φ(0\ = 1 and t Φ 0, then <p(i)=e"a for some a, and 0 = fZJ.1 - eUix"a))ii(dx) = jZJ.1 - cos t(x - α))μ(άχ). Since the integral vanishes, μ must confine its mass to the points where the nonnegative integrand vanishes, namely to the points χ for which t(x — a) = 2π« for юте integer n. (c) The mass of μ concentrates at points of the form a + 2vn/t and also at points of the form a' + 2πη/ί'. If μ is positive at two distinct points, it follows that t/t' is rational.
NOTES ON THE PROBLEMS 571 263. (a) Let f0(x) = v~1x~2(l -cosx) be the density corresponding to φ0(ί). If Pk = (sk _5* + 1)'*. then Σ^-ιΡ*= 1; s'nce Σ"=.ιΡ* *><)('/'*) = *>(') (check the points t = tj), cp(t) is the characteristic function of the continuous density Σϊ-ιΡ*'*/ο('*.*)· (b) If lim, _„„ φ(ί) = 0, approximate ψ by functions of the kind in part (a), pass to the limit, and use the first corollary to the continuity theorem. If φ does not vanish at infinity, mix in a unit mass at 0. 26.12. On the right in (26.30) replace φ(ί) by the integral defining it and apply Fubini's theorem; the integral average comes to , , r sin ТУ*-a) "{a]+Le T(x-a) '»(*>· Now use the bounded convergence theorem. 26.15. (a) Use (26.40) to prove that \<pn{t +h)-(p„(t)\ <2μη(-α. af +a\h\. (b) Use part (a). 26.17. (a) Use the second corollary to the continuity theorem. 26.19. For the Weierstrass approximation theorem, see Rudin,, Theorem 7.32. 26.22. (a) If an goes to 0 along a subsequence, then |ι/Ό)Ι=1; use part (c) of Problem 26.1. (c) Suppose two subsequences of (ял) converge to aQ and a, where 0 <a0 < a; put θ = a0/a and show that \<p(t)\ = |^(fl*/)|. (d) Observe that -ι [e"b"-l] f'e"b' ds 26.25. First do the nonnegative case; then note that if / and g have the same coefficients, so do f+ + g~ and g + + f~ Section 27 27.8. By the same reasoning as in Example 27.3, (Rn - log ή)/ )/\ogn =» N. 27.9. The Lindeberg theorem applies: (S„ - n2/4)/ y/n3/36 =» N. 27.11. Let Y„ be Xn or 0 according as \X} < n1/2 log η or not. Show that X„ = Yn for large n, with probability 1, and that Lyapounov's theorem (δ = 1) applies to the Y„. 27.12. For example, let the distribution of X„ be the mixture, with weights 1-й-2 and n~2, of the standard normal and Cauchy distributions. 27.16. Write fie'»'^du =χ-χβ~χ2'1 - jy^e""2^2du.
572 NOTES ON THE PROBLEMS 27.17. For another approach to large-deviation theory, see Mark Pinsky: An elementary derivation of Khintchine's estimate for large deviations, Proc. Amer. Math. Soc, 22 (1969), 288-290. 27.19. (a) Everything comes from (4.7). If A = [(/,, ...,lk)(=H] and Be Ο^Α+ηΛ+η + Ι.···). then \P(AnB)-P(A)P(B)\ zL\P(.[i«=iu,"zk]nB)-p[iu=iu,uzk]p(B)\, where the sum extends~over the fc-tuples (ij,...,i'A) of nonnegative integers in H. The summand vanishes if и + iu < к + η for и < к, the remaining terms add to at most 2LkualP[l„ >k+n-u]< 4/2" (b) To show that σ2 = 6 (see (27.20)), show that /, has mean 1 and variance 2 and that k-n \p[it=i]i(i-n) \ii>n. Section 28 28.2. (b) Pass to a subsequence along which μη(/?')-»°°, choose en so that it decreases to 0 and en/i„(/?') -»<», and choose xn so that it increases to <» and β„(-χ„,x„)> j^n{Rl)\ consider the / that satisfies f(±x„}^e„ for all η and is defined by linear interpolation in between these points. 28.4. (a) If all functions (28.12) are characteristic functions, they are all certainly infinitely divisible. Since (28.12) is continuous at 0, it need only be exhibited as a limit of characteristic functions. If μ„ has density /(_n n](l +x2) with respect to v, then exp J — ж l ~r X J —oo X is a characteristic function and converges to (28.12). It can also be shown that every infinitely divisible distribution (no moments required) has characteristic function of the form (28.12); see Gnedenko &Kolmogorov, p. 76. (b) Use (see Problem 18.19) - \t\ = w_1/"icos tx - 1)лГ2 dx. 28.14. If X1,X2,... are independent and have distribution function F, then (A-, + • · · + X„)/Jn also has distribution function F. Apply the central limit theorem. 28.15. The characteristic function of Zn is ^^о]^(в',4/""1Имрс^-|^л ; exp . .„ r" 1 -cos^ -c\t\ \ r-—dx
NOTES ON THE PROBLEMS 573 Section 29 29.1. (a) If / is lower semicontinuous, [x: fix) > t] is open. If / is positive, which is no restriction, then {/άμ = /£μ[/> t]dt < /^ Mm inf„ /i„[/> t]dt < liminfn/^n[/> t]dt = 1ίπιίηί„//Λμη. (b) If G is open, then /c is lower semicontinuous. 29.7. Let Σ be the covariance matrix. Let Μ be an orthogonal matrix such that the entries of ΜΣΜ' are 0 except for the first r diagonal entries, which are 1. If Y = MX, then У has covariance matrix ΜΣΜ', and so У=(У1,..-.-Уг,0,...,0), where У,, .,Yr are independent and have the standard normal distribution. But |*12 = Е?„,У,2. 29.8. By Theorem 29.5, Xn has asymptotically the centered normal distribution with covariances σ-. Put χ = (p\/2,...,pk/2) and show that Σχ' = 0, so that 0 is an eigenvalue of Σ. Show that Σ/ = y' if у is perpendicular to x, so that Σ has 1 as an eigenvalue of multiplicity к — I. Use Problem 29.7 together with Theorem 29.2 (h(x) =\x\2). 29.9. (a) Note that л_1Е"=,У,*; => 1 and that (Xnl,...,Xnt) has the same distribution as (^„...^/(η-'Σ^)'/2. Section 30 30.1. Rescale so that s2n= 1, and put Ln(e) = Lkf\Xi \>eX2kdP. Choose increasing nu so that Ln(u~l) <u~3 for n>nu, and put Mrl = u~1 for nu<n <nu + l. Then Mn-0 and- Ln(Mn)<Mn3. Put У„, =^/[|X,ij£MJ. Show that LkE[Ynk] -» 0 and LAE[y„J -» 1, and apply to LkY„k the central limit theorem under (30.5). Show that LkP[Xnk * Y„k] -» 0. 30.4. Suppose that the moment generating function Mn of μ„ converges to the moment generating function Μ of μ in some interval about s. Let vn have density esx/Mn(s) with respect to /tn, and let ν have density e"/M(s) with respect to μ. Then the moment generating function of vn converges to that of ν in some interval about 0, and hence v„ => v. Show that {1χ/(χ)μη(άχ) -» Ι°1αι/(χ)μ(άχ) if / is continuous and has bounded support; see Problem 25.13(b). 30.5. (a) By Holder's inequality ΙΣ^ι^.Γ < fc^'E^ili/Jt/, and so Lrerj \i,jtjXj\^(dx)/r] has positive radius of convergence. Now fj Σ'/*, n(dx)=Ltri,---tita(ri,...,rk), where the summation extends over fc-tuples that add to r. Project μ to the line by the mapping LjtjXj, apply Theorem 30.1, and use the fact that μ is determined by its values on half-spaces. 30.6. Use the Cramer-Wold idea.
574 NOTES ON THE PROBLEMS -M 30.8. Suppose that к = 2 in (30.30). Then Μ [(cos A,Jc)r'(cos A2Jt)r2J (е/А,дг + е-/А,дг \ Г| / е/А2ДГ + е-/А2ДГ\ '2 2 ) ( 2 ) = 2—2Е Ε Г' Г2 Лфхр^А^-г.Э+А^-,-,))*]. By (26.33) and the independence of A, and A2, the last mean here is 1 if 2;', —r{= 2j2 — r2 = 0 and is 0 otherwise. A similar calculation for к = 1 gives (30 28), and a similar calculation for general к gives (30.30). The actual form of the distribution in (30.29) is unimportant. For (30.31) use the multidimensional method of moments (Problem 30.6) and the mapping theorem. For (30.32) use the central limit theorem; by (30.28). Xl has mean 0 and variance ^. 30.10. If nX/2<m<n and the inequality in (30.33) holds, then loglogn1/2< log log и -e(loglogn)1/2, which implies loglogn<e~2 log2 2. For large η the probability in (30.33) is thus at most \/{n. Section 31 31.1. Consider the argument in Example 31.1. Suppose that F has a nonzero derivative at x, and let /n be the set of numbers whose base-r expansions agree in the first η places with that of x. The analogue of (31.16) is P[Ze /„ + ,]/P[^e/„] -> r~\ and the ratio here is one of pQ,...,pr. If pt* r~l for some i, use the second Borel-Cantelli lemma to show that the ratio is p, infinitely often except on a set of Lebesgue measure 0. (This last part of the argument is unnecessary if r= 2.) The argument in Example 31.3 needs no essential change. The analogue of (31.17) is / i + 1 Hx)=Po+ ··" +Pi-\+PiF(rx-i), -<x<—j—, 0<i<r-l. 31.3. (b) Take fi=Ig-'H(l and f2=F; (/ι/2Γ'{1} =Я0 is not a Lebesgue set. 31.9. Suppose that A is bounded, define μ by μ(Β) = λ(Β Γ\Α\ and let F be the corresponding distribution function. It suffices to show that F'(x)= 1 for χ in A, apart from a set of Lebesgue measure 0. Let Ct be the set of χ in A for which F'{x) < 1 - e. From Theorem 31.4(i) deduce that A(C6) = /a(C6) <(1 - e)A(C6) and hence Л(Сб) = 0. Thus F'(x)> 1 -e almost everywhere on A. Obviously, F'(x)< 1. 31.11. Let A be the set of χ in the unit interval for which F'(x) = 0, take a = 0, and define An as in the first part of the proof of Theorem 31.4. Choose η so that Α(Λ„)> 1-е. Split {1,2,..., л} into the set Μ of к for which ((k- l)/n,k/n] meets An and the opposite set N. Prove successively that Lk ^ M[F(k/n)— F((k-l)/n)]<e, LksN[F(k/n)-F((k-l)/n)]>l-e, LksMl/n>\(A„) il-f, ΣΖ_,!/(*/«) -f«k ~ D/»)! * 2 - 2c
NOTES ON THE PROBLEMS 575 3i.i5. п:=1(^ + ;к2"/3"). 31.18. For χ fixed, let un and un be the pair of successive dyadic rationals of order η (υ„ — un = 2~") for which un <x < υη. Show that /Ю-/К) уЧК)-ЫО "y'.T-r,.x where ak is the left-hand derivative. Since ak~ix) = + Lior all χ and k, the difference ratio cannot have a finite limit 31.22. Let A be the .r-set where (31.35) fails if / is replaced by /φ; then A has Lebesgue measure 0. Let G be the union of all open sets of μ-measure 0; represent G as a countable disjoint union of open intervals, and let В be G together with any endpoints of zero μ-measure of these intervals. Let D be the set of discontinuity points of F. If F(x)£A, x&B, and iiD, then F(x- h) <Fix) <Fix + h\ Fix± Λ) -»Fix), and 1 fF(x+h) rx+h)f(9(t))dt^f(9(F(x))). J Ft r _ U \ F(x + h)-F(x-h)JFix„h) Now x-e <<piFix))<x follows from Fix-e) <Fix), and hence <piF{x)) = x. If λ is Lebesgue measure restricted to (0,1), then μ=λφ~\ and (31.36) follows by change of variable. But (36.36) is easy if χ e D, and hence it holds outside В U iDc Π F~ XA). But μiB) = 0 by construction and μiDc nF~1A)=0 by Problem 14.4. Section 32 32.7. Define μη and v„ as in (32.7), and write ν = v^ + 4"\ where v™ is absolutely continuous with respect to μπ and ν*η) is singular with respect to μη. Take pK = Σ„ρ£ and v% = ΣΑ"'- Suppose that vaciE) + vsiE)--= v'aciE) + i>'siE) for ail Ε in &~. Choose an S in &~ that supports v% and v's and satisfies μiS) = 0. Then v.c(E) — vJLE Π Sc) = vjE Π Sc) + vsiE Π Sc) = v'JE η Sc) + v'siE η Sc) = t'jE η Sc) = 1/зС(£). A similar argument shows that us(E)=v's(E). 32.8. (a) Show that & is closed under the formation of countable unions, choose ^-sets Bn such that μ(Β„) -»sup^ μ(β) (<<»), and take B0 = UnBn. (b) The same argument. (c) Suppose μiD0)> 0. The maximality of B0 implies that B0UD0 contains an Ε such that μiE) > 0 and v{E) < oo. Since B0 П Ε cfl0 e 0, μiB0 П E) = 0 iviE) < oo rules out j/(B0 Л Ε) = °°). Therefore, μiDof}E)>0 and ι/(Ζ)0 Π Ε) < oo, which contradicts the maximality of C0. (d) Take the density to be oo on Пс0. 32.9. Define / and us as in (32.8), and let f° and v° be the corresponding function and measure for 3го: v{E) = }Ef° άμ + v°(E) for Ε e 5го, and there is an 3го-set 5° such that ι/°(Ω-5°) = 0 and μ(5°) = 0. If Ε e 5го, it follows that ίΕΓάμ=ΙΕ_^Ι° άμ= jE_^f° άμα =v0iE-S0) = viE-S0)>
576 NOTES ON THE PROBLEMS It is instructive to consider the extreme case 5го = {0, Ω}, in which v° is absolutely continuous with respect to μ° (provided μ(Ω)>0) and hence v° vanishes. Section 33 33.2. (a) To prove independence, check the covariance. Now use Example 33.7. (b) Use the fact that R and Θ are independent (Example 20.2). (c) As the single event [X = Y] = [X- Y= 0]= [Θ = π/4]υ [Θ = 5тг/4] has probability 0, the conditional probabilities have no meaning, and strictly speaking there is nothing to resolve But whether it is natural to regard the degrees of freedom as one or as two depends on whether the 45° line through the origin is regarded as an element of the decomposition of the plane into 45° lines or whether it is regarded as the union of two elements of the decomposition of the plane into rays from the origin. Borel's paradox can be explained the same way: The equator is an element of the decomposition of the sphere into lines of constant latitude; the Greenwich meridian is an element of the decomposition of the sphere into great circles with common poles. The decomposition matters, which is to say the σ-field matters. 33.3. (a) If the guard says, "1 is to be executed," then the conditional probability that 3 is also to be executed is 1/(1 + p). The "paradox" comes from assuming that ρ must be 1, in which case the conditional probability is indeed j. But if ρφ\, then the guard does give prisoner 3 some information. (b) Here "one" and "other" are undefined, and the problem ignores the possibility that you have been introduced to a girl. Let the sample space be οι β γ δ bbo j, bgo j, gbo j, ggo j, ,, 1-α 1-/3 l-> 1-Й bby —4—, bgy —4—, gby —4—. 8ВУ —4— · For example, bgo is the event (probability β/4) that the older child is a boy, the younger is a girl, and the child you have been introduced to is the older; and ggy is the event (probability (1 - δ)/4) that both children are girls and the one you have been introduced to is the younger. Note that the four sex' distributions do have probability j. If the child you have been introduced to is a boy, then the conditional probability that the other child is also a boy is ρ = 1/(2 +β - у). If β = 1 and γ = 0 (the parents present a son if they have one), then p = \. If β = γ (the parents are indifferent), then ρ = {. Any ρ between \ and 1 is possible. This problem shows again that one must keep in mind the entire experiment the sub-<7--field £ represents, not just one of the possible outcomes of the experiment. 33.6. There is no problem, unless the notation gives rise to the illusion that p(A\x) is P(An[X = x])/P[X = x]. 33.15. If N is a standard normal variable, then N i4+x)=7ir-'4+i)/f ny+-fi.
NOTES ON THE PROBLEMS 577 Section 34 34.3. IK*, У Hakes the values (0,0), (1, - 1), and (1,1) with probability j each, then X and У are dependent but £[У||Л"] = E[Y] = 0. If (X,Y) takes the values (-1,1), (0, -2), and (1,1) with probability \ each, then E[Al = E[Y] = E[AY] = Oandso E[XY] = E[X]E[Y\ but E[Y\\X] = У^0=Е[У]. Of course, this is another example of dependent but uncorrected random variables. 34.4. First show that jfdP0 = jBfdP/P(B) and that P[B\\^]> 0 on a set of F0- measure 1. Let G be the general set in <?. (a) Since [ P0[A\\cf]P[B\\cf]dP= [ Pa[A\\cf]lBdP= [lcPa[A\\cf]dP JG JG JB = P(B)[ IcP0\A\\J?]dP0 = P(B)P0(AnG) = ( P[AnB\\J?]dP, JG it follows that P0[A\\S]P[B\\J]=P[AnB\\S] holds on a set of F-measure 1. (b) If Pi(A) = P(A\Bi), then f Ρ{Α№] άΡ = F(Β,.) ί IcPiiAWJ?] dPt = P(B,)/>( AnG) JGnB, Jil = f P[A\\^v^]dP. ■One, Therefore, jcIBP;[A\\^]dP = jcI? P\A\\Jv Jif]dP if C=GC\B„ and of course this holds for C=Gr\Bj if ji=i. But C's of this form constitute a тг-system generating <0V Jf, and hence ΙΒ.Ρ,[Α\\^] = IB,P[A\\^V Jf] on a set of F-measure 1. Now use the result in part (a). 34.9. All such results can be proved by imitating the proofs for the unconditional case or else by using Theorem 34.5 (for part (c), as generalized in Problem 34.7). For part (a), it must be shown that it is possible to take the integral measurable cf. 34.10. (a) If Y = X-E[X\\^l], then X-E[X\\J>2] = Y-E\Y\\J2\ and E[(Y- E[Y\\J?2])2\\J?2] = E[Y2\\J?2]-E2[Y\\J?2]<E[Y2\\J?2]. Take expected values. 34.11. First prove that P[AinA3\\J?2]=E[lAiP[A3\\J?u]\\J?2].
578 NOTES ON THE PROBLEMS From this and (i) deduce (ii). From e[iap[a3\\^2]I^2]=p[a1\\^2]p[a3\\^2], (ii), and the preceding equation deduce [ P[A3\\J?2]dP = f P[A3\\Sl2]dP. JAlnA2 JAlnA2 The sets AlC)A2 form a π-system generating cf12. 34.16. (a) Obviously (34.18) implies (34.17). If (34.17) holds, then clearly (34.18) holds for X simple. For the general X, choose simple Xk such that liin A Xk - X and \Xk\<\X\. Note that f XdP-afXdP j XkdP-a\xkdP + (\+\a\)E[\X-Xk\]; let η -»oo and then let к -»oo. (b) If Ω e &>, then the class of Ε satisfying (34.17) is a λ-system, and so by the ττ-λ theorem and part (a), (34.18) holds if X is measurable σ(&>). Since Αη(Εσ{&>\ it follows that j XdP= j E[X\\a(&>)]dP^afE[X\\a(&>)]dP = afXdP. (c) Replace X by XdP0/dP in (34.18). 34.17. (a) The Lindeberg-Levy theorem. (b) Chebyshev's inequality. (c) Theoiem 25.4. (d) Independence of the Xn. (e) Problem 34.16(b). (f) Problem 34.16(c). (g) Part (b) here and the e-δ definition of absolute continuity, (h) Theorem 25.4 again. Section 35 35.4. (b) Let S„ be the number of к such that l<fc<n and Yk = \. Then Xn = 3s"/2". Take logarithms and use the strong law of large numbers.
NOTES ON THE PROBLEMS 579 35.9. Let К bound I*,! and the \X„ -*„_,!. Bound \X7\ by Κτ. Write jT&kXTdP = Σ^ι/τ„Λ^ = Σ*„ι(/τ&,·Λ',ΛΡ-/τ&Ι.+1Λ',ΛΡ). Transform the last integral by the martingale property and reduce the expression to £[ A", ] - /T > k Xk +, dP. Now <A:(fc + l)F[T>fc]<A:(fc + l)fc-1 f rdP^O. JT>k 35.13. (a) By the result in Problem 32.9, ΛΊ, X2, .. is a supermartingale. Since Ε[\Χη\] = Ε[Χ„]<ν(Ω), Theorem 35.5 applies. (b) If Λ e ^, then lA(Y„ + Z„)dP + σ'η{Α) = jAXmdP + σ<£Α) = v(A) = jAXndP + σπ(Α). Since the Lebesgue decomposition is unique (Problem 32.7), Yn+Zn=Xn with probability 1. Since Xn and Y„ converge, so does Z„. If /Ιε^ and η > к, then jAZ„dP < aJA), and by Fatou's lemma, the limit Ζ satisfies jAZdP<aJiA). This holds for A in UA^ and hence (monotone class theorem) for A in .$£. Choose /1 so that P(A) = 1 and aJ.A) = 0: Ε[Ζ]=//4Ζ^Ρ<σ„Μ) = 0. It can happen that ση(Ω) = 0 and ajil) = ι/(Ω) > 0, in which case ση does not converge to σ^ and the A^ cannot be integrated to the limit. 35.17. For a very general result, see J. L. Doob: Application of the theory of martingales, Le Calcul des Probabilites et ses Applications (Colloques Interna- tionaux du Centre de la Recherche Scientifique, Paris, 1949). Section 36 36.5. (b) Show by part (a) and Problem 34.18 that /n is the conditional expected value of / with respect to the σ-field ^ + ] generated by the coordinates x„ + i,x„ + 2 By Theorem 35.9, (36.30) will follow if each set in П „^ has тг-measure either 0 or 1, and here the zero-one law applies. (c) Show that g„ is the conditional expected value of / with respect to the σ-field generated by the coordinates xh...,xn, and apply Theorem 35.6. 36.7. Let S be the countable set of simple functions Σ,αιΙΑ. for at rational and [Aj] a finite decomposition of the unit interval into subintervals with rational endpoints. Suppose that the X, exist, and choose (Theorem 17.1) Y, in ~tf so that £[|Ar,-y,|]<i From £[1*, -X,\]= ±, conclude that E[\YS -Y,\]> 0 for s Φ t. But there are only countably many of the Yr It does no good to replace Lebesgue measure by some other measure on the unit interval. . ψ Section 37 37.1. If t tk are in increasing order and f0 = 0, then min(/,y) Σ*(',·.'/)*ί*/=Σ*ι«/ Σ (h-h-i) i.i ij /-1 = Σ(<ί-',-ι)(Σ*,)2ϊ:0. f X*+idP JT>k
580 NOTES ON THE PROBLEMS 37.4 (a) Use Problem 36.6(b). (b) Let Щ: t > 0] be a Brownian motion on (Ω, ^ P0\ where W(-, ω) e С for every ω. Define £: Ω-»ΛΓ by Ζ,(£(ω)) = ^,(ω). Show that £ is measurable &/& and Ρ = ?„£-'. If Cc/lei?7', then Ρ(Α) = Ρ0(ξ~ιΑ) = P0(il)=l. 37.5. Consider W{\) = Znkal{W{k/n) - W((k - \)/n)) for notational convenience. Since nf W2[-)dP=( W2(l)dP->0, the Lindeberg theorem applies. 37.14. By symmetry, p(s,t) = 2P Ws>0, inf (Wu-Ws)< -W,\\ Ws and the infimum here are independent because of the Markov property, and so by (20.30) (and symmetry again) P(s,t)=2fp[Txst-s]-7±=e-*2/2*dx Jf\ 1/ / IT С v2tts = 2Γ(,-'-±=-±-€-*1/ι*-Λ=€-**π'ώιάχ. x I c-x2/2u 1 __r2. ^o ^o y/lv u3/2 v^s" Reverse the integral, use j^xe * r/2 dx = 1/r, and put υ = (s/(s + ы))1/2: p(*.o = -/ τζιν**1
Bibliography Halmos and Saks have been the strongest measure-theoretic and Doob and Feller the strongest probabilistic influences on this book, and the spirit of Kac's smali volume has been very important. Aubrey: Brief Lives, John Aubrey; ed., O. L. Dick. Seker and Warburg, London, 1949. Bahadur: Some Limit Theorems in Statistics, R. R. Bahadur. SIAM, Philadelphia, 1971. Banach: Theone des Operations Lineaires, S. Banach. Monografje Matem- atyczne, Warsaw, 1932. Berger: Statistical Decision Theory, 2nd ed., James O. Berger. Springer- Verlag, New York, 1985. Bhattacharya & Waymire: Stochastic Processes with Applications, Rabi N. Bhattacharya and Edward С Waymire. Wiley, New York, 1990. Billingsley,: Convergence of Probability Measures, Patrick Billingsley. t Wiley, New York, 1968. Billingsley2: Weak Convergence of Measures: Applications in Probability, + Patrick Billingsley. SIAM, Philadelphia, 1971. "Birkhoff & Mac Lane: A Survey of Modem Algebra, 4th ed., Garrett BirkhofF and Saunders Mac Lane. Macmillan, New York, 1977. Qnlar: Introduction to Stochastic Processes, Erhan (Jinlar. Prentice-Hall, Englewood Cliffs, New Jersey, 1975. Cramer: Mathematical Methods of Statistics, Harald Cramer. Princeton University Press, Princeton, New Jersey, 1946. " Doob: Stochastic Processes, J. L. Doob. Wiley, New York, 1953. Dubins & Savage: How to Gamble If You Must, Lester E. Dubins and Leonard J. Savage. McGraw-Hill, New York, 1965. 581
582 BIBLIOGRAPHY Dudley: Real Analysis and Probability, Richard M. Dudley. Wadsworth and Brooks, Pacific Grove, California, 1989. Dynkin & Yushkevich: Markov Processes, English ed., Evgenii B. Dynkin and Aleksandr A. Yushkevich. Plenum Press, New York, 1969. Feller: An Introduction to Probability Theory and Its Applications, Vol. I. 3rd ed., Vol. II, 2nd ed., William Feller. Wiley, New York, 1968, 1971. Galambos: The Asymptotic Theory of Extreme Order Statistics, Janos Galambos. Wiley, New York, 1978. Gelbaum & Olmsted: Counterexamples in Analysis, Bernard R. Gelbaum and John M. Olmsted. Holden-Day, San Francisco, 1964. Gnedenko & Kolmogorov: Limit Distributions for Sums of Independent Random Variables, English ed., B. V. Gnedenko and A. N. Kolmogorov. Addison-Wesley, Reading, Massachusetts, 1954. Halmos,: Measure Theory, Paul R. Halmos. Van Nostrand, New York, 1950. Halmos2: Naive Set Theory, Paul R. Halmos. Van Nostrand, Princeton, 1960. Hardy: A Course of Pure Mathematics, 9th ed., G. H. Hardy. Macmillan, New York, 1946. Hardy & Wright: An Introduction to the Theory of Numbers, 4th ed., G. H. Hardy and Ε. Μ. Wright. Clarendon, Oxford, 1959. Hausdorff: Set Theory, 2nd English ed., Felix Hausdorff. Chelsea, New York, 1962. Jech: Set Theory, Thomas Jech. Academic Press, New York, 1978. Kac: Statistical Independence in Probability, Analysis and Number Theory, Carus Math. Monogr. 12, Marc Kac. Wiley, New York, 1959. Kahane: Some Random Series of Functions, Jean-Pierre Kahane. Heath, Lexington, Massachusetts, 1968. Kaplansky: Set Theory and Metric Spaces, Irving Kaplansky. Chelsea, New York, 1972. Karatzes & Shreve: Brownian Motion and Stochastic Calculus, Ioannis Karatzis and Steven E. Shreve. Springer-Verlag, New York, 1988. Karlin & Taylor: A First Course in Stochastic Processes, 2nd ed., A Second Course in Stochastic Processes, Samuel Karlin and Howard M. Taylor. Academic Press, New York, 1975 and 1981. Kolmogorov: Grundbegriffe der Wahrscheinlichkeitsrechnung, Erg. Math., Vol. 2, No. 3, A. N. Kolmogorov. Springer-Verlag, Berlin, 1933. Levy: Theorie de /'Addition des Variables Aleatoires, Paul Levy. Gauthier- Villars, Paris, 1937. Protter: Stochastic Integration and Differential Equations, Philip Protter. Springer-Verlag, 1990.
BIBLIOGRAPHY 583 Riesz & Sz.-Nagy: Functional Analysis, English ed., Frigyes Riesz and Bela Sz.-Nagy. Unger, New York, 1955. Rockett & Szusz: Continued Fractions, Andrew M. Rockett and Peter Sziisz. World Scientific, Singapore, 1992. Royden: Real Analysis, 2nd ed., H. I. Royden. Macmillan, New York, 1968. Rudin,: Principles of Mathematical Analysis, 3rd ed., Walter Rudin. McGraw-Hill, New York, 1976. Rudin2: Real and Complex Analysis, 2nd ed., Walter Rudin. McGraw-Hill New York, 1974. Saks: Theory of the integral, 2nd rev. ed., Stanislaw Saks. Hafner, New York, 1937. Spivak: Calculus on Manifolds, Michael Spivak. W. A. Benjamin, New York, 1965. Wagon: The Banach-Tarsky Paradox, Stan Wagon. Cambridge University Press, 1985.
List of Symbols Ω, 536 2Ω, 536 0,536 (а,Ы 537 lA, 536 Ш. ι Я, 2, 22 d„(o>), 3 r„(«), 5 ί„(ω), 6 Λ/, 8, 357 [jcJ, 537 sgn л:, 537 A - B, 536 Ac, 536 ΑδΒ, 536 /4 с β, 536 5^set, 20 ^0,20 σ(Λθ, 21 s,n SS,21 (Ω, У, Я), 23 ΛηίΛ, 536 у4„4. ^4, 536 λ, 25, 43, 168 Л, 537 ν, 537 /(лО. 33 Я„Ы), 35 D{A\ 35 Я 35 Р*, 37, 47 Я», 37, 47 5", 27, 311 5»,41 ^,41 λ*, 44 Я(ВЫ), 51 'imsupn /1„, 52 liminfn А„, 52 Нт„Л„, 52 i.o., 53 Т, 62, 287 Я1, 537 Я*, 539 [* = *], 67 с(ЛГ), 68, 255 μ, 73, 160, 256 Е[Х], 76, 273 Varl АГ], 78, 275 £Д/], 87 *Дя), 93 τ, 99, 133, 464, 508 Р:р Ш 5, Ш «,, in ir„ 124 МО), 146, 278,285 Як, 158 ^ΓιΩ0, 159 хк\х, 160,537 * = Σ***, 160 μ*, 165 Ж μ*), 165 λ,, 168 А,, 171 F, 175, 177 hAF, 176 Г"'/4', 537 57^', 182 Я„=»Я, 191,327, 378 dfU), 228 XX Υ, 231 Й"Х ^, 231 μ Χυ, 233
586 LIST OF SYMBOLS *,266 X„-*PX, 70,268,330 11/11, 249 II/IU 241 L", 241 μ„ ~ μ, 327, 378 X„ =* X, 329, 378 X„=>a, 331 a A, 538 φ(ί), 342 μ„-»,μ, 371 Fs, 414 FK, 414 ν <κ μ, 422 dv/d/u, 423 ",. 424 "ac> «4 F\A\\<0\ 428, 430 Я[/1||ЛГ„ геГ], 433 E[X\\d?], 445 Rr, 484 i?7, 485 W„ 498
Index Here Ал refers to paragraph η of the Appendix (p. 536): u.v refers to Problem ν in Section u, or else to a note on it (Notes on the Problems, p. 552); the other references are to pages. Greek letters are alphabetized by their Roman equivalents (m for μ, and so on). Names in the bibliography are not indexed separately. Absolute continuity, 413, 422 Absolutely continuous part, 425 Absolute moment, 274 Absoibing state, 112 Adapted σ-fields, 458 Additive set function, 420 Additivity: countable, 23, 161 finite, 23, 161 Admissible, 248, 252 Affine transformation, 172 Algebra, 19 Almost everywhere, 60 Almost surely, 60 α-mixing, 363, 29.10 Aperiodic, 125 Approximation of measure, 168 Area over the curve, 79 Area under the curve, 203 Asymptotic equipartition property, 91, 144 Asymptotic relative frequency, 8 Atom, 271 Autoregression, 495 Axiom of choice, 21, 45 Beppo Levi theorem, 16.3 Bernoulli-Laplace model of diffusion, 112 Bernoulli shift, 311 Bernoulli trials, 75 Bernstein polynomial, 87 Betting system, 98 Binary digit, 3 Binomial distribution, 256 Blackwell-Rao theorem, 455 Bold play, 102 Boole's inequality, 25 Bore!, 9 Borel-Cantelli lemmas, 59, 60 Borel function, 183 Borel normal number theorem, 9 Borel paradox, 441 Borei set, 22, 158 Boundary, All Bounded convergence theorem, 210 Bounded variation, 415 Branching process, 461 Britannica, 552 Brownian motion, 498 Burstin's theorem, 22.14 Baire category, 1 10, A15 Baire function, 13.7 Banach limits, 3.8, 19.3 Banach space, 243 Banach-Tarski paradox, 180 Bayes estimation, 475 Bayes risk, 248, 251 Benford law, 25.3 Canonical measure, 372 Canonical representation, 372 Cantelli inequality, 5.5 Cantelli theorem, 6.6 Cantor function, 31.2, 31.15 Cantor set, 1.5 Cardinality of σ-fieids, 2.12, 2.22 Cartesian product, 231
588 INDEX Category, A15, 1 10 Cauchy distribution, 20 14, 348 Cauchy equation, A20, 14.7 Cavalieri principle, 18.8 Central limit theorem, 291, 357, 385, 391, 34 17,475 Cesaro averages, A30, 20.23 Change of variable, 215, 224, 225, 274 Characteristic function, 342 Chebyshev inequality, 5, 80, 276 Chernoff theorem, 151 Chi-squared distribution, 20.15 Chi-squared statistic, 29.8 Circular Lebesgue measure, 13 12, 313 Class of sets, 18 Closed set, Al 1 Closed set of states, 8 21 Closed support, 12.9 Closure, All Cocountable set, 21 Cofinite set, 20 Collective, 109 Compact, A13 Complement, Al Completely normal number, 6.13 Complete space or measure, 44, 10.5 Completion, 3 10, 10.5 Complex functions, integration of, 218 Compound Poisson distribution, 28.3 Compound Poisson process, 32.7 Concentrated, 161 Conditional distribution, 439, 449 Conditional expected value, 133, 445 Conditional probability, 51, 427, 33.5 Congruent by dissection, 179 Conjugate index, 242 Conjugate space, 244 Consistency conditions for finite-dimensional distributions, 483 Content, 3.15 Continued·fraction transformation, 319, A36 Continuity from above, 25 Continuity of paths, 500 Continuum hypothesis, 46 Conventions involving °°, 160 Convergence in distribution, 329, 378 Convergence in mean, 243 Convergence in measure, 268 Convergence in probability, 70, 268, 330 Convergence with probability 1, 70, 330 Convergence of random series, 289 Convergence of types, 193 Convex functions, A32 Convolution, 266 Coordinate function, 27, 484 Coordinate variable, 484 Countable, 8 Countable additivity, 23, 161 Countable subadditivity, 25, 162 Countably generated σ-field, 2 11 Countably infinite, 8 Counting measure, 161 Coupled chain, 126 Coupon problem, 362 Covariance, 277 Cover, A3 Cramer-Wold theorem, 383 Cylinder, 27, 485 Danieii-Stone theorem, Π 14, 16 12 Darboux-Young definition, 15.2 Δ-distribution, 192 Decision theory, 247 Decomposition, A3 de Finetti theorem, 473 Definite integral, 200 Degenerate distribution function, 193 Delta method, 359 DeMoive-Laplace theorem, 25.11, 358 DeMorgan !aw, A6 Dense, A15 Density of measure, 213, 422 Density point, 31.9 Density of random variable or distribution, 257, 260 Density of set of integers, 2.18 Denumerable probabilities, 51 Dependent random variables, 363 Derivatives of integrals, 402 Diagonal method, 29, A14 Difference equation, A19 Difference set, Al Diophantine approximation, 13, 324 Dirichlet theorem, 13, A26 Discontinuity of the first kind, 534 Discrete measure, 23, 161 Discrete random variable, 256 Discrete space, 1.1, 23, 5.16 Disjoint, A3 Disjoint supports, 410, 421 Distribution: of random variable, 73, 187, 256 of random vector, 259 Distribution function, 175, 188, 256, 259, 409 Dominated convergence theorem, 78, 209 Dominated measure, 422 Double exponential distribution, 348 Double integral, 233
INDEX 589 Double series, A27 Doubly stochastic matrix, 8.20 Dual space, 245 Dubins-Savage theorem, 102 Dyadic expansion, 3, A31 Dyadic interval, 4 Dyadic transformation, 313 Dynkin's ir-A theorem, 42 e-δ definition of absolute continuity, 422 Egorov theorem, 13 9 Eigenvalues, 8.26 Empirical distribution function, 268 Empty set, Al Entropy, 57, 6.14, 8.31, 31.17 Equicontinuous, 355 Equivalence class, 58 Equivalent measures, 422 Erdos-Kac central limit theorem, 395 Ergodic theorem, 314 Erlang density, 23.2 Essential supremum, 241 Estimation, 251, 452 Etemadi, 282, 288, 22.15 Euclidean distance, A16 Euclidean space, ΑΙ, Α16 Euler function, 2.18 Event, 18 Excessive function, 134 Exchangeable, 473 Existence of independent sequences, 73, 265 Existence of Markov chains, 115 Expected value, 76, 273 Exponential convergence, 131, 8.18 Exponential distribution, 189, 258, 297, 348 Extension of measure, 36, 166, 11.1 Extremal distribution, 195 Factorization and sufficiency, 450 Fair game, 92, 463 Fatou lemma, 209 Field, 19, 2.5 Filtration, 458 Finite additivity, 20, 23, 2.15, 3 8, 161 Finite or countable, 8 Finite-dimensional distributions, 308, 482 Finite-dimensional sets, 485 Finitely additive field, 20 Finite subadditivity, 24, 162 First Borel-Cantelli lemma, 59 First category, 1.10, A15 First passage, 118 fixed discontinuity, 303 Fourier representation, 250 Fourier series, 351, 26.30 Fourier transform, 342 Frequency, 8 Fubini theorem, 233 Functional central limit theorem, 522 Fundamental in probability, 20 21 Fundamental set, 320 Fundamental theorem of calculus, 224, 400 Fundamental theorem of Diophantine approximation, 324 Gambling policy, 98 Gamma distribution, 20.17 Gamma function, 18 18 Generated σ· field, 21 Glivenko-Cantelli theorem, 269 Goncharov's theorem, 361 Hahn decomposition, 420 Hamel basis, 14 7 Hardy-Ramanujan theorem, 6.16 Heine-Borel theorem, A13, A17 Hewitt-Savage zero-one law, 496 Hubert space, 249 Hitting time, 136 Holder's inequality, 80, 5 9, 242, 276 Hypothesis testing, 151 Inadequacy of £%T, 492 Inclusion-exclusion formula, 24, 163 Indefinite integral, 400 Identically distributed, 85 Independent classes, 55 Independent events, 53 Independent increments, 299, 498 Independent random variables, 71, 261 Independent random vectors, 263 Indicator, A5 Infinitely divisible distributions, 371 Infinitely often, 53 Infinite series, A25 Information, 57 Initial digit problem, 25.3 Initial probabilities, 111 Inner boundary, 64 Inner measure, 37, 3.2 Integrable, 200, 206 Integral, 199 Integral with respect to Lebesgue measure, 221 Integrals of derivatives, 412 Integration by parts, 236 Integration over sets, 212 Integration with respect to a density, 214
590 INDEX Interior, All Interval, A9 Invariance principle, 520 Invariant set, 313 Inverse image, A7, 182 Inversion formula, 346 Irreducible chain, 119 Irregular paths, 504 Iterated integral, 233 Jacobian, 225, 261, 545 Jensen inequality, 80, 276, 449 Jordan decomposition, 421 Jordan measurable, 3.15 A:-dimensionaI Borel set, 158 A:-dimensionaI Lebesgue measure, 171, 177, 17 14, 20.4 Kolmogorov existence theorem, 483 Kolmogorov zero-one law, 63, 287 Landau notation, A18 Laplace distribution, 348 Laplace transform, 285 Large deviations, 148 Lattice distribution, 26.1 Law of the iterated logarithm, 153 Law of large numbers: strong, 9, 11, 85,282 weak, 5, 11, 86,284 Lebesgue decomposition, 414, 425 Lebesgue density theorem, 31.9, 35.15 Lebesgue function, 31.3 Lebesgue integrable, 221, 225 Lebesgue measure, 25, 43, 167, 171, 177 Lebesgue set, 45 Leibniz formula, 17.8 Levy distance, 14.5, 25.4, 26.16 Likelihood ratio, 461, 471 Limit inferior, 52, 4.1 Limit of sets, 52 Lindeberg condition, 359 Lindeberg-Levy theorem, 357 Linear Borel set, 158 Linear functional, 244 Linearity of expected value, 77 Linearity of the integral, 206 Linearly independent reals, 14.7, 30.8 Lipschitz condition, 418 Log-normal distribution, 388 Lower integral, 204, 228 Lower semicontinuous, 29.1 Lower variation, 421 Z,p-space, 241 A-system, 41 Lusin theorem, 17.10 Lyapounov condition, 362 Lyapounov inequality, 81, 277 Mapping theorem, 344, 380 Marginal distribution, 261 Markov chain, 111, 363, 367, 29.11, 429 Markov inequality, 80, 276 Markov process, 435, 510 Markov shift, 312 Markov time, 133 Martingale, 101, 458, 514 Martingale central limit theorem, 475 Martingale convergence theorem, 468 Maximal ergooic theorem, 317 Maximal inequality, 287 Maximal solution, 122 μ.-continuity set, 335, 378 m-dependent, 6 11, 364 Mean value, 26.17 Measurable mapping, 182 Measurable process, 503 Measurable rectangle, 231 Measurable with respect to a a- field, 68, 225 Measurable set, 20, 38, 165 Measurable space, 161 Measure, 22, 160 Measure-preserving transformation, 311 Measure space, 161 Meets, A3 Method of moments, 388, 30.6 Minimal sufficient field, 454 Minimum-variance estimation, 454 Minkowski inequality, 5.10, 242 Mixing, 24.3, 363, 29.10, 34.16 Mixture, 473 p.*-measurable, 165 Moment, 274 Moment generating function, 1.6, 146, 278, 284, 390 Monotone, 24, 162, 206 Monotone class, 43, 3.12 Monotone class theorem, 43 Monotone convergence theorem, 208 M-test, 210, A28 Multidimensional central limit theorem, 385 Multidimensional characteristic function, 381 Multidimensional distribution, 259 Multidimensional normal distribution, 383 Multinomial sampling, 29.8 Negative part, 200, 254 Negligible set, 8, 1.3, 1.9, 44
INDEX Neyman-Pearson lemma, 19.7 Nonatomic, 2.19 Nondenumerable probabilities, 526 Nonmeasurable set, 45, 12 4 Nonnegative series, A25 Norm, 243 Normal distribution, 258, 383 Normal number, 8, 1 8, 86, 6 13 Normal number theorem, 9, 6.9 Nowhere dense, A15 Nowhere differentiabie, 31.18, 505 Null persistent, 130 Number theory, 393 Open set, All Optional sampling theorem, 466 Optimal stopping, 133 Ordei of dyadic interval, 4 Orthogonal projection, 250 Orthonormai, 249 Ottaviani inequality, 22.15 Outer boundary, 64 Outer measure, 37, 3.2, 165 Pairwise disjoint, A3 Partial-fraction expansion, 20.14 Partial information, 57 Partition, A3 Path function, 308, 493, 500 Payoff function, 133 Perfect set, A15 Peano curve, 179 Period, 125 Permutation, 72, 86, 361 Persistent, 117, 120 Phase space, 111 ΤΓ-Λ theorem, 42 /^-measurable, 38 Poincare theorem, 29.9 Point of increase, 12 9, 20.12 Poisson approximation, 302, 328 Poisson distribution, 257, 299, 375 Poisson process, 297, 436 Poisson theorem, 6.5 Polar coordinates, 226, 261 Polya criterion, 26.3 Polya theorem, 118 Positive part, 200, 254 Positive persistent, 130 Power class, 21, Al Power series, A29 Prekopa theorem, 303 Primitive, 224, 400 Probability measure, 22 Probability measure space, 23 Probability transformation, 14 3 Product measure, 28, 12.12, 233, 487 Product space, 27, 231, 484 Projection, 27, 484 Proper difference, Al Proper subset, Al тг-system, 41 Rademacher functions, 5, 289, 291 Radon-Nykodym derivative, 423, 460, 470 Radon-Nykodym theorem, 422, 32.8 Random Taylor series, 292 Random variable, 67, 182, 254 Random vector, 183, 255 Random walk, 112 Rank, 4, 320 Ranks and records, 20.8 Rate of Poisson process, 299 Rational rectangle, 158 Realization of process, 493 Record values, 20.9, 21.3, 22.9, 27 8 Rectangle, 158, A16 Recurrent event, 8.17 Red-and-black, 92 Reflection principle, 511 Regularity, 174 Relative frequency, 8 Relative measure, 25.16 Renewal theory, 8.17, 310 Reversed martingale, 472 Riemann integral, 2, 12, 221, 228, 25 12 Riemann-Lebesgue lemma, 345 Riesz-Fischer theorem, 243 Riesz representation theorem, 17.12, 244 Right continuity, 175, 256 Rigid transformation, 172 Risk, 247, 251 Saitus, 188 Sample function, 188 Sample path, 188, 308 Sample point, 18 Sampling theory, 392 Scheffe theorem, 215 Schwarz inequality, 81, 5 6, 249, 276 Second Borel-Cantelli lemma, 60, 4.11, 88 Second category, 1.10, A15 Second-order Markov chain, 8.32 Secretary problem, 114 Section, 231 Selection problem, 113, 138 Selection system, 95, 7.3 Semicontinuous, 29 1
592 INDEX Semiring, 166 Separable function, 526 Separable process, 527, 531 Separable σ-field, 2.11 Separant, 527 Sequence space, 27, 311 Set function, 22, 420 tr-fieid, 20 <r-field generated by a class of sets, 21 <r-field generated by a random variable, 68, 255 a- finite on a class, 160 <r-finite measure, 160 Shannon theorem, 6.14, 8 31 Signed measure, 32.12 Simple function, 184 Simple random variable, 67, 185, 254 Singleton, Al Singular function, 407 Singularity, 421 Singular part, 425 Skorohod embedding, 513, 519 Skorohod theorem, 333 Southwest, 176, 247 Space, Al, 17 Souare-free integers, 4.15 Stable law, 377 Standard normal distribution, 258 State space, 111 Stationary distribution, 124 Stationary increments, 499 Stationary probabilities, 124 Stationary process, 363, 494 Stationary sequence of random variables, 363 Stationary transition probabilities, 111 Stieitjes integral, 228 Stirling formula, 27.18 Stochastic arithmetic, 2.18, 4.15, 4.16, 5.19, 6.16, 18.17, 25.15, 393, 30.9, 30.10, 30.11 Stochastic matrix, 112 Stochastic process, 298, 308, 482 Stopping time, 99, 133, 464, 465, 508 Strong law of large numbers, 8, 9, 11, 85, 6.8, 282, 312, 27.20 Strong Markov property, 508 Subadditivity: countable, 25, 162 finite, 24, 162 Subfair game, 92, 102 Submartingale, 462 Subset, Al Subsolution, 8.5 Sufficient subfield, 450 Superharmonic function, 134 Supermartingale, 462 Support line, A33 Support of measure, 23, 161 Symmetric difference, Al Symmetric random walk, 113, 35.10 Symmetric stable law, 378 System, 111 Tail σ-fieid, 63, 287, 496 Tarski theorem, 3.8 Taylor series, A29, 292 Thin cylinders, 27 Three series theorem, 290 Tightness, 336, 380 Timid play, 108 Tonelli theorem, 234 Total variation, 421 Trajectory of process, 493 Transformation of measures, 185, 215 Transient, 117, 120 Transition probabilities, 111 Translation invariance, 45, 172 Triangular array, 359 Triangular distribution, 348 Trifling set, 1.3, 1.9,3.15 Type, 193 Unbiased estimate, 251, 454 Uncorrelated random variables, 7, 277 Uniform distribution, 258, 348 Uniform distribution modulo 1, 328, 352 Uniform integrability, 216, 338 Uniformly equicontinuous, 26.15 Uniqueness of extension, 36, 42, 163 Uniqueness theorem for characteristic functions, 346, 382 Uniqueness theorem for moment generating functions, 147, 284, 26.7, 390 Unit interval, 51 Unit mass, 24 Upcrossing, 467 Upper integral, 204, 228 Upper semicontinuous, 29.1 Upper variation, 421 Utility function, 7 12 Vague convergence, 371 Value of payoff function, 134 van der Waerden function, 31.18
INDEX 593 Variance, 78, 275 Version of conditional expected value, 445 Version of conditional probability, 430 Vieta formula, 1.7 Waid equation, 22.8 Weak convergence, 190, 327, 378 Weak law of large numbers, 5, 11, 86, 284 Weierstrass approximation theorem, 87, 26 Weierstrass M-test, 210, A28 Weiner process, 498 With probability 1, 60 Zeroes of Brownian motion, 507 Zero-one law, 63, 117, 8.35, 286, 22.12, 314, 496, 37.4 Zorn's lemma, A8