Текст
                    Probability,
Random Variables,
and Stochastic
Processes
Third Edition
Athanasios Pap julis

McGraw-Hill Series in Electrical Engineering Consulting Editor Stephen W. Director, Carnegie-Mellon University Circuits and Systems Communications and Signal Processing Control Theory Electronics and Electronic Circuits Power and Energy Electromagnetics Computer Engineering Introductory Radar and Antennas VLSI Previous Consulting Editors Ronald N. Bracewell, Colin Cherry, James F. Gibbons, Willis W. Harman, Hubert Heffner, Edward W. Herold, John G. Linvill, Simon Ramo, Ronald A. Rohrer, Anthony E. Siegman, Charles Susskind, Frederick E. Terman, John G. Truxal, Ernst Weber, and John R. Whinnery
Communications and Signal Processing Consulting Editor Stephen W. Director, Carnegie-Mellon University Antoniou: Digital Filters: Analysis and Design Candy: Signal Processing: The Model-Based Approach Candy: Signal Processing: The Modem Approach Carlson: Communications Systems: An Introduction to Signals and Noise in Electrical Communication Cherin: An Introduction to Optical Fibers Collin: Antennas and Radiowave Propagation Collin: Foundations for Microwave Engineering Cooper and McGillem: Modem Communications and Spread Spectrum Davenport: Probability and Random Processes: An Introduction for Applied Scientists and Engineers Drake: Fundamentals of Applied Probability Theory Huelsman and Allen: Introduction to the Theory and Design of Active Filters Jong: Method of Discrete Signal and System Analysis Keiser: Local Area Networks Keiser: Optical Fiber Communications Kraus: Antennas Kuc: Introduction to Digital Signal Processing Papoulis: Probability, Random Variables, and Stochastic Processes Papoulis: Signal Analysis Papoulis: The Fourier Integral and Its Applications Peebles: Probability, Random Variables, and Random Signal Principles Proakis: Digital Communications Schwartz: Information Transmission, Modulation, and Noise Schwartz and Shaw: Signal Processing Smith: Modem Communication Circuits Taub and Schilling: Principles of Communication Systems
PROBABILITY, RANDOM VARIABLES, AND STOCHASTIC PROCESSES Third Edition Athanasios Papoulis Polytechnic Institute of New York McGraw-Hill, Inc. New York St. Louis San Francisco Auckland Bogota Caracas Hamburg Lisbon London Madrid, Mexico Milan Montreal New Delhi Paris SanJuan Sao Paulo Singapore Sydney Tokyo Toronto
This book was set in Times Roman by Science Typographers, Inc. The editors were Roger L. Howell and John M. Morriss: the production supervisor was Richard A. Ausburn. The cover was designed by Joseph Gillians. Project supervision was done by Science Typographers, Inc. R. R. Donnelley & Sons Company was printer and binder. PROBABILITY, RANDOM VARIABLES, AND STOCHASTIC PROCESSES Copyright © 1991, 1984, 1965 by McGraw-Hill. Inc. All rights reserved. Printed in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher. 1234567890 DOC DOC 90987654321 ISBN 0-07-ачаЧ77-3 Library of Congress Cataloging-in-Publication Data Papoulis, Athanasios, (date). Probability, random variables, and stochastic processes/ Athanasios Papoulis.—3rd ed. p. cm.—(McGraw-Hill series in electrical engineering. Communications and signal processing) Includes bibliographical references and index. ISBN 0-07-048477-5 1. Probabilities. 2. Random variables; 3. Stochastic processes. П. Series. 1991 90-23127
CONTENTS Preface to the Third Edition xi Preface to the Second Edition xiii Preface to the First Edition xv Part I Probability and Random Variables 1 The Meaning of Probability з 1-1 Introduction 3 1-2 The Definitions 5 1-3 Probability and Induction 12 1-4 Causality versus Randomness 13 Concluding Remarks 14 2 The Axioms of Probability 15 2-1 Set Theory 15 2-2 Probability Space 20 2-3 Conditional Probability 27 Problems 36 3 Repeated Trials 38 3-1 Combined Experiments 38 3-2 Bernoulli Trials 43 3-3 Asymptotic Theorems 47 3-4 Poisson Theorem and Random Points 55 Problems 60 4 The Concept of a Random Variable 63 Introduction 63 4-2 Distribution and Density Functions 66 vii
viil CONTENTS Part 4-3 Special Cases 73 4-4 Conditional Distributions and Total Probability 79 Problems 84 5 Functions of One Random Variable 86 5-1 The Random Variable g(x) 86 5-2 The Distribution of g(x) 87 5-3 Mean and Variance 102 5-4 Moments 109 5-5 Characteristic Functions 115 Problems 120 6 Two Random Variables 124 6-1 Bivariate Distributions 124 6-2 One Function of Two Random Variables 135 6-3 Two Functions of Two Random Variables 142 Problems 148 7 Moments and Conditional Statistics 151 7-1 Joint Moments 151 7-2 Joint Characteristic Functions 157 7-3 Conditional Distributions 162 7-4 Conditional Expected Values 169 7-5 Mean Square Estimation 173 Problems 179 8 Sequences of Random Variables 182 8-1 General Concepts 182 8-2 Conditional Penalties, Characteristic Functions, and Normality 192 8-3 Mean Square Estimation 201 8-4 Stochastic Convergence and Limit Theorems 208 8-5 Random Numbers: Meaning and Generation 221 Problems 237 9 Statistics 241 9-1 Introduction 241 9r2 Parameter Estimation 244 9-3 Hypothesis Testing 265 Problems 279 n Stochastic Processes 10 General Concepts 285 285 303 ! 51И' jOefimtions o Systems.withStochastic Inputs
10-3 10-4 11 11-1 11-2 11-3 11-4 11-5 11-6 11-7 12 12-1 12-2 12-3 12-4 13 13-1 13-2 13-3 14 14-1 14-2 14-3 14-4 15 15-1 15-2 15-3 (ОМ The Power Spectrum Digital Processes Appendix 10/\ Continuity, Differentiation, Integration Appendix 10B Shift Operators and Stationary Processes Problems Basic Applications Random Walk, Brownian Motion, and Thermal Noise Poisson Points and Shot Noise Modulation Cyclostationary Processes Bandlimitcd Processes and Sampling Theory Deterministic Signals in Noise Bispcctra and System Identification Appendix 11A The Poisson Sum Formula Appendix 1 IB Schwarz's Inequality Problems Spectral Representation Factorization and Innovations Finite-Order Systems and State Variables Fourier Series and Karhunen-Loeve Expansions Spectral Representation of Random Processes Problems Spectral Estimation Ergod icily Spectral Estimation Extrapolation and System Identification Appendix 13A Minimum-Phase Functions Appendix 13B All-Pass Functions Problems Mean Square Estimation Introduction Prediction Filtering and Prediction Kalman Filters Problems Entropy Introduction Basic Concepts Random Variables and Stochastic Processes The Maximum Entropy Method Coding Channel Capacity Problems ms ix 319 332 336 339 340 345 345 354 362 373 376 384 389 395 395 396 401 401 404 412 416 425 427 427 443 455 474 475 477 480 480 487 508 515 529 533 533 542 558 569 579 591 600
X contents 16 Selected Topics боз 16-1 The Levci-Crossing Problem 603 16-2 Queueing Theory 612 16-3 Shot Noise 629 16-4 Markoff Processes 635 Problems 654 Bibliography 658 Index 661
PREFACE TO THE THIRD EDITION In this edition, about a third of the text is either new or substantially revised. The new topics include the following: A chapter on statistics. With this addition, the first nine chapters of the book could form the basis for a senior-graduate course in probability and statistics. A chapter on spectral estimation. This chapter starts with an expanded treatment of ergodicity and it covers the fundamentals of parametric and nonparametric estimation in the context of system identification. A section on the meaning and generation of random numbers. This material is essential for the understanding of computer simulation of random phenomena and the use of statistics in the solution of deterministic problems (Monte Carlo techniques). Other topics include bispectra, state variables and vector processes, factoriza- tion, and spectral representation. I wrote the first edition of this book long ago. My objective was to develop the subject of probability and stochastic processes as a deductive discipline and to illustrate the theory with basic applications of general interest. I tried to stress clarity and economy, avoiding sophisticated mathematics or, at the other extreme, detailed discussion of practical applications. It appears that this approach met with some success. For over a quarter of a century, the book has been used as a basic text and standard reference not only in this country but throughout the world. I am deeply grateful. McGraw-Hill and I would like to thank the following reviewers for their many helpful comments and suggestions: John Adams, Lehigh University; David Anderson, University of Michigan; V. Krishnan, University of Lowell; Robert J. Mulholland, University of Oklahoma; Stephen Sebo, Ohio State University; ibid Samir S. Soliman, Southern Methodist University. Athanasios Papoulis
PREFACE TO THE SECOND EDITION This is an extensively revised edition reflecting the developments of the last two decades. Several new topics are added, important areas are strengthened, and sections of limited interest are eliminated. Most additions, however, deal with applications; the first ten chapters are essentially unchanged. In the selection of the new material I have attempted to concentrate on subjects that not only are of current interest, but also contribute to a better understanding of the basic properties of stochastic processes. The new material includes the following: Discrete-time processes with applications in system theory Innovations, factorization, spectral representation Queueing theory, level crossings, spectra of FM signals, sampling theory Mean square estimation, orthonormal expansions, Levinson’s algorithm, Wold’s decomposition, Wiener, lattice, and Kalman filters Spectral estimation, windows, extrapolation, Burg’s method, detection of line spectra This book concludes with a self-contained chapter on entropy developed ax- iomatically from first principles. It is presented in the context of earlier chapters, and it includes the method of maximum entropy in parameter estima- tion and elements of coding theory. As in the first edition, I made a special effort to stress the conceptual difference between mental constructs and physical reality. This difference is summarized in the following paragraph, taken from the first edition: Scientific theories deal with concepts, not with reality. All theoretical results are derived from certain axioms by deductive logic. In physical sciences the theories are so formulated as to correspond in some usefid sense to the real world, whatever that may mean. However, this correspondence is approxi- xiii 'o I
Xiv PREFACE IX) THE SECOND EDITION mate, and the physical justification of all theoretical conclusions is based on some form of inductive reasoning. Responding to comments by a number of readers over the years, I would like to emphasize that this passage in no way questions the existence of natural laws (patterns). It is merely a reminder of the fundamental difference between concepts and reality. During the preparation of the manuscript I had the benefit of lengthy discussions with a number of colleagues and friends. I thank in particular Hans Schreiber of Grumman, William Shanahan of Norden Systems, and my col- leagues Frank Cassara and Basil Maglaris for their valuable suggestions. I wish also to express my appreciation to Mrs. Nina Adamo for her expert typing of the manuscript. dthanasios Papoulis
PREFACE TO THE FIRST EDITION Several years ago I reached the conclusion that the theory of probability should no longer be treated as adjunct to statistics or noise or any other terminal topic, but should be included in the basic training of all engineers and physicists as a separate course. I made then a number of observations concerning the teaching of such a course, and it occurs to me that the following excerpts from my early notes might give you some insight into the factors that guided me in the planning of this book: “Most students, brought up with a deterministic outlook of physics, find the subject unreliable, vague, difficult. The difficulties persist because of inade- quate definition of the first principles, resulting in a constant confusion between assumptions and logical conclusions. Conceptual ambiguities can be removed only if the theory is developed axiomatically. They say that this approach would require measure theory, would reduce the subject to a branch of mathematics, would force the student to doubt his intuition leaving him without convincing alternatives, but I don’t think so. I believe that most concepts needed in the applications can be explained with simple mathematics, that probability, like any other theory, should be viewed as a conceptual structure and its conclusions should rely not on intuition but on logic. The various concepts must, of course, be related to the physical world, but such motivating sections should be separated from the deductive part of the theory. Intuition will thus be strength- ened, but not at the expense of logical rigor. “There is an obvious lack of continuity between the elements of probabil- ity as presented in introductory courses, and the sophisticated concepts needed in today’s applications. How can the average student, equipped only with the probability of cards and dice, understand prediction theory or harmonic analy- sis? The applied books give at most a brief discussion of background material; their objective is not the use of the applications to strengthen the student’s understanding of basic concepts, but rather a detailed discussion of special topics. о * XV
XVI PRLfACr ГО THE I IRST EDI DON “Random variables, transformations, expected values, conditional densi- ties, characteristic functions cannot be mastered with mere exposure. '1 hese concepts must be clearly defined and must be developed, one at a time, with sufficient elaboration. Special topics should be used to illustrate the theory, but they must be so presented as to minimize peripheral, descriptive materia! and to concentrate on probabilistic content. Only then the student can learn a variety of applications with economy and perspective.” I realized that to teach a convincing course, a course that is not a mere presentation of results but a connected theory'. I would have to reexamine not only the development of special topics, but also the proofs of many results and the method of introducing the first principles. “The theory must be mathematical (deductive) in form but without the generality or rigor of mathematics. The philosophical meaning of probability most somehow be discussed. This is necessary to remove the mystery associated with probability and to convince the student of the need for an axiomatic approach and a clear distinction between assumptions and logical conclusions. The axiomatic foundation should not be a mere appendix but should be recognized throughout the theory. “Random variables must be defined as functions with domain an abstract set of experimental outcomes and not as points on the real line. Only then infinitely dimensional spaces arc avoided and the extension to stochastic pro- cesses is simplified. “The inadequacy of averages as definitions and the value of an underlying space is most obvious in the treatment of stochastic processes. Time averages must be introduced as stochastic integrals, and their relationship to the statisti- cal parameters of the process must be established only in the form of crgodicity. “The emphasis on second-order moments and spectra, utilizing the stu- dent’s familiarity with systems and transform techniques, is justified by the current needs. “Mean-square estimation (prediction and filtering), a topic of considerable importance, needs a basic reexamination. It is best understood if it is divorced from the details of integral equations or the calculus of variations, and is presented as an application of the orthogonality principle (linear regression), simply explained in terms of random variables. “To preserve conceptual order, one must sacrifice continuity of special topics, introducing them as illustrations of the general theory.” These ideas formed the framework of a course that I taught at the Polytechnic Institute of Brooklyn. Encouraged by the students’ reaction, I decided to make it into a book. I should point out that I did not view my task as an impersonal presentation of a complete theory, but rather as an effort to explain the essence of this theory to a particular group of students. The book is written neither for the handbook-oriented students nor for the sophisticated few who can learn the subject from advanced mathematical texts. It is written lor the majority of engineers and physicists who have sufficient maturity to appreci- ate and follow a logical presentation, but, because of their limited mathematical background, would find a book such as Doob’s too difficult for a beginning text.
rut । м i к» ihi i ihm i.unios xvii Although 1 have included many useful results, some of them new, my hope is that the book will be judged not for completeness but lot organization and clarity. In this context 1 would like to anticipate a criticism and explain my approach. Some readers will find the proofs of many important theorems lacking in rigor. 1 emphasize that it was not out of negligence, but after considerable thought, that I decided to give, in several instances, only plausibil- ity arguments. 1 realize too well that “a proof is a proof or it is not." However, a rigorous proof must be preceded by a clarification of the new idea and by a plausible explanation of its validity. 1 felt that, for the purpose of this book, the emphasis should be placed on explanation, facility, and economy. I hope that this approach will give you not only a working knowledge, but also an incentive for a deeper study of this fascinating subject. Although 1 have tried to develop a personal point of view in practically every topic, I recognize that 1 owe much to other authors. In particular, the books "Stochastic Processes” by J. L. Doob and "Theorie des Functions Aleatoires” by A. Blanc-Lapierre and R. Forter influenced greatly my planning of the chapters on stochastic processes. Finally, it is my pleasant duty to express my sincere gratitude to Miseha Schwartz for his encouragement and valuable comments, to Ray Pickhohz for his many ideas and constructive suggestions, and to all my colleagues and students who guided my efforts and shared by enthusiasm in this challenging project. Athanasios Papoulis
PART I PROBABILITY AND RANDOM VARIABLES
CHAPTER 1 THE MEANING OF PROBABILITY 1-1 INTRODUCTION The theory of probability deals with averages of mass phenomena occurring sequentially or simultaneously: electron emission, telephone calls, radar detec- tion, quality control, system failure, games of chance, statistical mechanics, turbulence, noise, birth and death rates, and queueing theory, among many others. It has been observed that in these and other fields certain averages approach a constant value as the number of observations increases and this value remains the same if the averages are evaluated over any subsequence specified before the experiment is performed. In the coin experiment, for example, the percentage of heads approaches 0.5 or some other constant, and the same average is obtained if we consider every fourth, say, toss (no betting system can beat the roulette). The purpose of the theory is to describe and predict such averages in terms of probabilities of events. The probability of an event л/ is a number P(jaf) assigned to this event. This number could be interpreted as follows: If the experiment is performed n times and the event .й/ occurs n^ times, then, with a high degree of certainty, the relative frequency n.v/n of the occurrence of xsf is close to ): Р(л/)=лл</п (1-1) provided that n is sufficiently large. 3
4 JUL Ml ANIMi Ol I,I«)1IAIIII ITY This interpretation is imprecise: The terms “with a high degree of certainty," “close,*’ and “sufficiently large" have no clear meaning. However, this lack of precision cannot be avoided. If we attempt to define in probabilistic terms the “high degree of certainty" we shall only postpone the inevitable conclusion that probability, like any physical theory, is related to physical phenomena only in inexact terms. Nevertheless, the theory is an exact discipline developed logically from clearly defined axioms, and when it is applied to real problems, it works. OBSERVATION, DEDUCTION, PREDICTION. In the applications of probability to real problems, the following steps must be clearly distinguished: Step 1 (physical) We determine by an inexact process the probabilities P(.o/) of certain events This process could be based on the relationship (1-1) between probability and observation: The probabilistic data equal the observed ratios /1 z/n. It could also be based on “reasoning" making use of certain symmetries: If, out of a total of N outcomes, there are A’.z outcomes favorable to the event .У, then /J(.c/) = N^/N. For example, if a loaded die is rolled 1000 times and five shows 203 limes, then the probability of five equals 0.2. If the die is fair, then, because of its symmetry, the probability of fire equals 1/6. Step 2 (conceptual) We assume that probabilities satisfy certain axioms, and by deductive reasoning we determine from the probabilities P(.>/) of certain events .й/ the probabilities P(.^) of other events For example, in the game with a fair die we deduce that the probability of the event even equals 3/6. Our reasoning is of the following form: If P(l) = ••• = P(6) = I then P(even) = £ Step 3 (physical) We make a physical prediction based on the numbers P(&j) so obtained. This step could rely on (1-1) applied in reverse: If we perform the experiment n times and an event & occurs n;j) times, then n „ - nP(&). If, for example, we roll a fair die 1000 times, our prediction is that even will show about 500 times. We could not emphasize too strongly the need for separating the above three steps in the solution of a problem. We must make a clear distinction between the data that are determined empirically and the results that arc deduced logically. Steps 1 and 3 are based on inductive reasoning. Suppose, for example, that we wish to determine the probability of heads of a given coin. Should we toss the coin 100 or 1000 times? If we toss it 1000 times and the average number of heads equals 0.48 what kind of prediction can we make on the basis of this observation? Can we deduce that at the next 1000 tosses the number of heads will be about 480? Such questions can be answered only inductively. In this book, we consider mainly step 2, that is, from certain probabilities wC derive deductively other probabilities. One might argue that such derivations
1-2 ihi in । iMitoss 5 are mere tautologies because the results are contained in the assumptions. This is true in the same sense that the intricate equations of motion of a satellite are included in Newton’s laws. To conclude, we repeat that the probability Л-?/) of an event .:/ will be interpreted as a number assigned to this event as mass is assigned to a body or resistance to a resistor. In the development of the theory, we will not he concerned about the "physical meaning" of this number. This is what is done in circuit analysis, in electromagnetic theory', in classical mechanics, or in any other scientific discipline. These theories are, of course, of no value to physics unless they help us solve real problems. We must assign specific, if only approximate, resistances to real resistors and probabilities to real events (step 1); we must also give physical meaning to all conclusions that are derived from the theory (step 3). But this link between concepts and observation must be separated from the purely logical structure of each theory' (step 2). As an illustration, we discuss in the next example the interpretation of the meaning of resistance in circuit theory. Example 1-1. A resistor is commonly viewed as a two-terminal device whose voltage is proportional to the current This, however, is only a convenient abstraction. A real resistor is a complex device with distributed inductance and capacitance having no clearly specified terminals. A relationship of the form (1-2) can, therefore, be claimed only within certain errors, in certain frequency ranges, and with a variety of other qualifications. Nevertheless, in the development of circuit theory we ignore all these uncertainties. We assume that the resistance R is a precise number satisfying (1-2) and we develop a theory based on (1-2) and on Kirchhoff's laws. It would not be wise, we all agree, if at each stage of the development of the theory' we were concerned with the true meaning of R. 1-2 THE DEFINITIONS In this section, we discuss various definitions of probability and their roles in our investigation. Axiomatic Definition We shall use the following concepts from set theory (for details see Chap. 2): The certain event is the event that occurs in every trial. The union + & of two events .2/ and & is the event that occurs when .й/ or or both occur. The intersection of the events .£/ and & is the event that occurs when both events л/ and dS occur. The events and & are mutually exclusive if the occurrence of one of them excludes the occurrence of the other.
6 ТМГ MI.ANINfi CM PROHAUII.IIY We shall illustrate with the die experiment: The certain event is the event that occurs whenever any one of the six faces shows. The union of the events even and less than 3 is the event 1 or 2 or 4 or 6 and their intersection is the event 2. The events even and odd are mutually exclusive. The axiomatic approach to probability is based on the following three postulates and on nothing else: The probability P(.V) of an event s>/ is a positive number assigned to this event P(j/) > 0 The probability of the certain event equals 1: P(.Z) = I In the events xt/ and & are mutually exclusive, then />(.:/+ .^) = P(.-/) + P(^) (1-3) (1-4) (1-5) This approach to probability is relatively recent (A. Kolmogoroff,tl933). How- ever, in our view, it is the best way to introduce a probability even in elementary courses. It emphasizes the deductive character of the theory, it avoids concep- tual ambiguities, it provides a solid preparation for sophisticated applications, and it offers at least a beginning for a deeper study of this important subject. The axiomatic development of probability might appear overly mathemati- cal. However, as we hope to show, this is not so. The elements of the theory can be adequately explained with basic calculus. Relative Frequency Definition The relative frequency approach is based on the following definition: The probability P(j^) of an event за/ is the limit P(.;/) = lim — (1-6) и -*oc n where n is the number of occurrences of and n is the number of trials. This definition appears reasonable. Since probabilities are used to de- scribe relative frequencies, it is natural to define them as limits of such frequencies. The problem associated with a priori definitions are eliminated, one might think, and the theory is founded on observation. However, although the relative frequency concept is fundamental in the applications of probability (steps 1 and 3), its use as the basis of a deductive theory (step 2) must be challenged. Indeed, in a physical experiment, the numbers n& and n might be large but they are only finite; their ratio cannot, therefore, be equated, even approximately, to a limit. If (1-6) is used to define tA. Kolmogoroff: Grundbcgrific der Wahrschcinlichkeits Rechnung. Ergeh. Math und ihrer Gntnsg. vol. 2, 1933.
1-2 THE. UEHNII IONS 7 P(jZ), the limit must be accepted as a hypothesis, not as a number that can be determined experimentally. Early in the century, Von Misest used (1-6) as the foundation for a new theory. At that time, the prevailing point of view was still the classical and his work offered a welcome alternative to the a priori concept of probability, challenging its metaphysical implications and demonstrating that it leads to useful conclusions mainly because it makes implicit use of relative frequencies based on our collective experience. The use of (1-6) as the basis for deductive theory has not, however, enjoyed wide acceptance even though (1-6) relates Р(л/) to observed frequencies. It has generally been recognized that the axiomatic approach (Kolmogoroff) is superior. We shall venture a comparison between the two approaches using as illustration the definition of the resistance R of an ideal resistor. We can define R as a limit where e(t) is a voltage source and in(t) are the currents of a sequence of real resistors that tend in some sense to an ideal two-terminal element. This definition might show the relationship between real resistors and ideal elements but the resulting theory is complicated. An axiomatic definition of R based on Kirchhoffs laws is, of course, preferable. Classical Definition For several centuries, the theory of probability was based on the classical definition. This concept is used today to determine probabilistic data and as a working hypothesis. In the following, we explain its significance. According to the classical definition, the probability Р(лУ') of an event .£/ is determined a priori without actual experimentation: It is given by the ratio (1-7) where N is the number of possible outcomes and is the number of outcomes that are favorable to the event .of. In the die experiment, the possible outcomes are six and the outcomes favorable to the event even are three; hence P(even) = 3/6. It is important to note, however, that the significance of the numbers N and is not always clear. We shall demonstrate the underlying ambiguities with the following example. tRichard Von Mises: Probability, Statistics and Truth, English edition, H. Geiringcr. cd„ G. Allen and Unwin Lid., London, 1957.
8 THE MEANING QF I’ROUABlUTY Example 1-2. We roll two dice and wc want to find the probability p that the sum of the numbers that show equals 7. To solve this problem using (1-7). we must determine the numbers N and (a) We could consider as possible outcomes the 11 sums 2,3..... 12. Of these, only one, namely the sum 7, is favorable; hence p = 1/11. This result is of course wrong, (b) We could count as possible outcomes all pairs of numbers not distinguishing between the first and the second die. We have now 21 outcomes of which the pairs (3,4), (5,2), and (6,1) are favorable. In this case, N^= 3 and N = 21; hence p = 3/21. This result is also wrong, (c) We now reason that the above solutions arc wrong because the outcomes in (л) and (6) are not equally likely. To solve the problem “correctly,” we must count all pairs of numbers distinguishing between the first and the second die. The total number of outcomes is now 36 and the favorable outcomes are the six pairs (3,4), (4,3), (5,2), (2,5), (6,1), and (1,6); hence p = 6/36. The above example shows the need for refining definition (1-7). The improved version reads as follows: The probability of an event equals the ratio of its favorable outcomes to the total number of outcomes provided that all outcomes are equally likely. As we shall presently see, this refinement does not eliminate the problems associated with the classical definition. Notes 1. The classical definition was introduced as a consequence of the principle of insufficient reason^: “In the absence of any prior knowledge, we must assume that the events .0/ have equal probabilities.” This conclusion is based on the subjective interpre- tation of probability as a measure of our state of knowledge about the events Indeed, if it were not true that the events have the same probability, then changing their indices we would obtain different probabilities without a change in the state of our knowledge. 2. As we explain in Chap. 15, the principle of insufficient reason is equivalent to the principle of maximum entropy. CRITIQUE. The classical definition can be questioned on several grounds. A. The term equally likely used in the improved version of (1-7) means, actually, equally probable. Thus, in the definition, use is made of the concept to be defined. As we have seen in Example 1-2, this often leads to difficulties in determining N and N^. B. The definition can be applied only to a limited class of problems. In the die experiment, for example, it is applicable only if the six faces have the same probability. If the die is loaded and the probability of four equals 0.2, say, the number 0.2 cannot be derived from (1-7). tH. Bernoulli. Arts Conjectandi, 1713.
1-2 ihi DtiiMiioss 9 C It appears from (1-7) that the classical definition is a consequence of logical imperatives divorced from experience. This, however, is not so. We accept certain alternatives as equally likely because of our collective experience. The probabilities of the outcomes of a fair die equal 1/6 not only because the die is symmetrical but also because it was observed in the long history of rolling dice that the ratio n ^/n in (l-l) is close to 1/6. The next illustration is, perhaps, more convincing: We wish to determine the probability p that a newborn baby is a boy. It is generally assumed that p = 1/2; however, this is not the result of pure reasoning. In the first place, it is only approximately true that p = 1 /2. Furthermore, without access to long records we would not know that the boy-girl alternatives are equally likely regardless of the sex history of the baby’s family, the season or place of its birth, or other conceivable factors. It is only after long accumulation of records that such factors become irrele- vant and the two alternatives are accepted as equally likely. D. If the number of possible outcomes is infinite, then to apply the classical definition we must use length, area, or some other measure of infinity for determining the ratio N.//N in (1-7). We illustrate the resulting difficulties with the following example known as the Bertrand paradox. Example 1-3. Wc are given a circle C of radius r and we wish to determine the probability p that the length I of a “randomly selected” cord AB is greater than the length ry/3 of the inscribed equilateral triangle. Wc shall show that this problem can be given at least three reasonable solutions. I. If the center M of the cord AB lies inside the circle C, of radius r/2 shown in Fig. 1-1Л, then I > г/з. It is reasonable, therefore, to consider as favorable outcomes all points inside the circle Cj and as possible outcomes all points inside the circle C. Using as measure of their numbers the corresponding areas ттг2/4 and тгг2, we conclude that irr2/4 1 p =------— = - тгг 4 FIGURE 1-1
10 ТНЬ MEANING Of PKOUAUll.l I Y II. We now assume that the end Л of the cord AB is fixed. This reduces the number of possibilities but it has no effect on the value of p because the number of favorable locations of В is reduced proportionately. If В is on the 120° arc DBE of Fig. I-lb, then / > rv'3". The favorable outcomes arc now the points on this arc and the total outcomes all points on the circumlcrcnce of the circle C. Using as their measurements the corresponding lengths 2ттг/3 and 2тгг, we obtain Ш. We assume finally that the direction of AB is perpendicular to the line FK of Fig. 1-1 c. As in II this restriction has no effect on the value of p. If the center M of AB is between G and H, then / > г/З. Favorable outcomes arc now the points on GH and possible outcomes all points on FK. Using as their measures the respective lengths r and 2r, we obtain r 1 P ~ Tr ~ 2 We have thus found not one but three different solutions for the same problem! One might remark that these solutions correspond to three different experiments. This is true but not obvious and, in any case, it demonstrates the ambiguities associated with the classical definition, and the need for a clear specification of the outcomes of an experiment and the meaning of the terms “possible” and “favorable.” VALIDITY. We shall now discuss the value of the classical definition in the determination of probabilistic data and as a working hypothesis. A. In many applications, the assumption that there are N equally likely alterna- tives is well established through long experience. Equation (1-7) is then ac- cepted as self-evident. For example, “If a ball is selected at random from a box containing m black and n white balls, the probability that it is white equals n/(m +«),” or, “If a call occurs at random in the time interval (0, T), the probability that it occurs in the interval (/„ t2) equals </2 - Such conclusions are of course, valid and useful; however, their validity rests on the meaning of the word random. The conclusion of the last example that “the unknown probability equals (r2 - t^/T” is not a consequence of the “randomness” of the call. The two statements are merely equivalent and they follow not from a priori reasoning but from past records of telephone calls. B. In a number of applications it is impossible to determine the probabilities of various events by repeating the underlying experiment a sufficient number of times. In such cases, we have no choice but to assume that certain alternatives are equally likely and to determine the desired probabilities from (1-7). This
1-2 IHblMHMIlONS 11 means that we use the classical definition as a working hypothesis. The hypothe- sis is accepted if its observable consequences agree with experience, otherwise it is rejected. We illustrate with an important example from statistical mechanics. Example 1-4. Given n particles and m > n boxes, wc place at random each particle in one of the boxes. Wc wish to find the probability p that in n preselected boxes, one and only one particle will be found. Since wc are interested only in the underlying assumptions, wc shall only state the results (the proof is assigned as Prob. 3-15). We also verify the solution for n = 2 and m = 6. For this special case, the problem can be stated in terms of a pair of dice: The m = 6 faces correspond to the m boxes and the л = 2 dice to the n particles. We assume that the preselected faces (boxes) are 3 and 4. The solution to this problem depends on the choice of possible and favorable outcomes. We shall consider the following three celebrated cases: Maxwell-Boltzmann statistics. If wc accept as outcomes all possible ways of placing n particles in m boxes distinguishing the identity of each particle, then «! P = For n = 2 and m = 6 the above yields p — 2/36. This is the probability for getting 3,4 in the game of two dice. Bose-Einstein statistics. If wc assume that the particles are not distinguishable, that is, if all their permutations count as one, then (m - !)!«! P (n + m - 1) 1 For n = 2 and m = 6 this yields p = 1/21. Indeed, if we do not distinguish between the two dice, then W = 21 and 1 because the outcomes 3,4 and 4,3 are counted as one. Fermi-Dirac statistics. If we do not distinguish between the particles and also wc assume that in each box wc are allowed to place at most one particle, then n!(m - л)! For n = 2 and m = 6 we obtain p = 1/15. This is the probability for 3,4 if we do not distinguish between the dice and also we ignore the outcomes in which the two numbers that show are equal. One might argue, as indeed it was in the early years of statistical mechanics, that only the first of these solutions is logical. The fact is that in the absence of direct or indirect experimental evidence this argument cannot be supported. The three models proposed are actually only hypotheses and the physicist accepts the one whose consequences agree with experience. C. Suppose that we know the probability of an event stf in experiment 1 and the probability P(0) of an event @ in experiment 2. In general, from this
12 TH>. Ml ANIN’G Ol I’ROIIAHII.I I Y information wc cannot determine the probability Pi'-t/tri) that both events and tri will occur. However, if wc know that the two experiments are indepen- dent, then P(V.^) =P(.o/)P(.^) (1-8) In many cases, this independence can be established a priori by reasoning that the outcomes of experiment I have no effect on the outcomes of experiment 2. For example, if in the coin experiment the probability of heads equals 1 /2 and in the die experiment the probability of even equals 1/2, then, wc conclude “logically” that if both experiments arc performed, the probability that wc get heads on the coin and eren on the die equals 1/2 X 1/2. Thus, as in (1-7). we accept the validity of (1-8) as a logical necessity without recourse to (1-1) or to any other direct evidence. D. The classical definition can be used as the basis of a deductive theory if we accept (1-7) as an assumption. In this theory, no other assumptions are used and postulates (1-3) to (1-5) become theorems. Indeed, the first two postulates arc obvious and the third follows from (1-7) because, if the events and .-ri are mutually exclusive, then N-s+j = N-./ + Ny, hence N NN /’(.2/4- .0) = 4- = p( V) + p(.^) N NN As we show in (2-25), however, this is only a very' special case of the axiomatic approach to probability. 1-3 PROBABILITY AND INDUCTION In the applications of the theory of probability we are faced with the following question: Suppose that we know somehow from past observations the probabil- ity P(«o/) of an event in a given experiment. What conclusion can we draw about the occurrence of this event in a single future performance of this experiment? (See also Sec. 9-1.) We shall answer this question in two ways depending on the size of /’(.£/): We shall give one kind of an answer if Р(.с/) is a number distinctly different from 0 or 1, for example 0.6, and a different kind of an answer if P(&/) is close to 0 or 1, for example 0.999. Although the boundary between these two cases is not sharply defined, the corresponding answers are fundamentally different. Case 1 Suppose that P(j/) = 0.6. In this case, the number 0.6 gives us only a ‘‘certain degree of confidence that the event .о/ will occur.” The known probability is thus used as a “measure of our belief’ about the occurrence of .a/ in a single trial. This interpretation of P(.o/) is subjective in the sense that it cannot be verified experimentally. In a single trial, the event .с/ will either occur or will not occur. If it does not, this will not be a reason for questioning the validity of the assumption that P(.c/) = 0.6. Case 2 Suppose, however, that P(.c/) = 0.999. We can now state with practical certainty that at the next trial the event .t/ will occur. This conclusion
1-4 < AUSAl.llh \1 RSI S KXMXJMM v, 13 is objective in the sense that it can be verified experimentally. At the next trial the event must occur. If it does not, we must seriously doubt, if not outright reject, the assumption that P(.c/) = 0.999. The boundary between these two cases, arbitrary though it is (0.9 or 0.99999?), establishes in a sense the line separating “soft” from “hard" scientific conclusions. The theory of probability gives us the analytic tools (step 2) for transforming the “subjective” statements of case 1 to the “objective" statements of case 2. In the following, we explain briefly the underlying reasoning. As we show in Chap. 3. the information that P(.:/) = (1.6 leads to the conclusion that if the experiment is performed 1000 times, then "almost certainly” the number of times the event у/ will occur is between 550 and 650. This is shown by considering the repetition of the original experiment 1000 times as a single outcome of a new experiment. In this experiment the probabil- ity of the event = {the number of times occurs is between 550 and 650} equals 0.999 (sec Prob. 3-6). We must, therefore, conclude that (case 2) the event .3/, will occur with practical certainty. We have thus succeeded, using the theory of probability, to transform the “subjective” conclusion about .г/ based on the given information that /Д.'/) = 0.6, to the “objective" conclusion about .2/, based on the derived conclusion that PCg/,) = 0.999. We should emphasize, however, that both conclusions rely on inductive reasoning. Their difference, although significant, is only quantita- tive. As in case 1, the “objective” conclusion of case 2 is not a certainty but only an inference. This, however, should not surprise us; after all, no prediction about future events based on past experience can be accepted as logical certainty. Our inability to make categorical statements about future events is not limited to probability but applies to all sciences. Consider, for example, the development of classical mechanics. It was observed that bodies fall according to certain patterns, and on this evidence Newton formulated the laws of mechanics and used them to predict future events. His predictions, however, are not logical certainties but only plausible inferences. To “prove” that the future will evolve in the predicted manner we must invoke metaphysical causes. 1-4 CAUSALITY VERSUS RANDOMNESS We conclude with a brief comment on the apparent controversy between causality and randomness. There is no conflict between causality and random- ness or between determinism and probability if we agree, as we must, that scientific theories arc not discoveries of the laws of nature but rather inventions of the human mind. Their consequences are presented in deterministic form if we examine the results of a single trial: they are presented as probabilistic statements if we are interested in averages of many trials. In both cases, all statements are qualified. In the first case, the uncertainties are of the form “with certain errors and in certain ranges of the relevant parameters": in the
14 THE MEANING OF PROBABILITY 0 FIGURE 1-2 second, “with.a high degree of certainty if the number of trials is large enough.” In the next example, we illustrate these two approaches. Example 1-5. A rocket leaves the ground with an initial velocity г forming an angle 6 with the horizontal axis (Fig. 1-2). We shall determine the distance d = OB from the origin to the reentry point B. From Newton’s law it follows that u2 d — — sin 20 (1*9) S The above seems to be an unqualified consequence of a causal law; however, this is not so. The result is approximate and it can be given a probabilistic interpretation. Indeed, (1-9) is not the solution of a real problem but of an idealized model in which we have neglected air friction, air pressure, variation of g, and other uncertainties in the values of v and 0. We must, therefore, accept (1-9) only with qualifications. It holds within an error e provided that the neglected factors are smaller than 8. Suppose now that the reentry area consists of numbered holes and wc want to find the reentry hole. Because of the uncertainties in v and 0, we are in no position to give a deterministic answer to our problem. We can, however, ask a different question: If many rockets, nominally with the same velocity, are launched, what percentage will enter the nth hole? This question no longer has a causal answer; it can only be given a random interpretation. Thus the same physical problem can be subjected cither to a deterministic or to a probabilistic analysis. One might argue that the problem is inherently deterministic because the rocket has a precise velocity even if we do not know it. If we did, we would know exactly the reentry hole. Probabilistic interpretations are, therefore, necessary because of our ignorance. Such arguments can be answered with the statement that the physicists are not concerned with what is true but only with what they can observe. CONCLUDING REMARKS In this book, we present a deductive theory (step 2) based on the axiomatic definition of probability. Occasionally, we use the classical definition but only to determine probabilistic data (step 1), To show the link between theory and applications (step 3), we give also a relative frequency interpretation of the important results. This part of the book, written in small print under the title Frequency interpretation, does not obey the rules of deductive reasoning on which the theory is based.
CHAPTER 2 THE AXIOMS OF PROBABILITY 2-1 SET THEORY A sei is a collection of objects called elements. For example, "car, apple, pencil” is a set whose elements are a car, an apple, and a pencil. The set “heads, tails” has two elements. The set “1, 2, 3, 5” has four elements. A subset дб of a set is another set whose elements are also elements of <C/. All sets under consideration will be subsets of a set which we shall call space. The elements of a set will be identified mostly by the Greek letter g. Thus &-U..................................f„) (2-1) will mean that the set srf consists of the elements fWe shall also identify sets by the properties of their elements. Thus &/— {all positive integers) (2-2) will mean the set whose elements are the numbers 1,2,3...... The notation e У £ £ лэ/ will mean that £ is or is not an element of trf. The empty or null set is by definition the set that contains no elements. This set will be denoted by (0). If a set consists of n elements, then the total number of its subsets equals 2я. IS
16 THI: AXIOMS OF PROVABILITY 7С.л’С.'/ FIGURE 2-2 Note In probability theory, we assign probabilities to the subsets (events) of У and wc define various functions (random variables) whose domain consists of the elements of We must be careful, therefore, to distinguish between the element f and the set (s') consisting of the single element £. Example 2-1. We shall denote by /, the faces of a die. These faces are the elements of the set У'= {/j....../6). In this case, n = 6; hence .У has 2Ъ = 64 subsets: {ЛА), --Л/iAA).... In general, the elements of a set are arbitrary objects. For example, the 64 subsets of the set uZ in the above example can be considered as the elements of another set. In Example 2-2, the elements of are pairs of objects. In Example 2-3, is the set of points in the square of Fig. 2-1. Example 2-2. Suppose that a coin is tossed twice. The resulting outcomes are the four objects hh, ht, th, It forming the set S= {hh, ht, th, tt) where hh is an abbreviation for the element “heads-heads.*’ The set .У’ has 24 = 16 subsets. For example, {heads at the first toss) = {hh,ht} & = {only one head showed) = {ht, th} & = {heads shows at least once) = (hh, ht, гЛ) In the first equality, the sets jZ, Sd, and € are represented by their properties as in (2-2); in the second, in terms of their elements as in (2-1). Example 2-3. In this example, Z* is the set of all points in the square of Fig. 2-1. Ils elements.are all ordered pairs of numbers (x, y) vlhere O^x s T Q^y
2-1 si-.i i hi orv 17 FIGURE 2-3 FIGURE 2-4 The shaded area is a subset .с/ of ./ consisting of all points (.v,y) such that —b £ x — у £ a. The notation •й/ = {-/> s x - у < «} describes V in terms of the properties of x and у as in (2-2). Set Operations In the following, we shall represent a set and its subsets by plane figures as in Fig. 2-2 (Venn diagrams). The notation dd c .£/ or .o/z> & will mean that & is a subset of .с/, that is, that every element of 3d is an element of .о/. Thus, for any {0} c .o/c c ./ Transitivity If if c <fd and .Й? с .о/ then f c .?/ Equality s/= !2 ifff . с/ c @ and c Unions and intersections The sum or union of two sets .о/ and &d is a set whose elements are all elements of л/ or of or of both (Fig. 2-3). This set will be written in the form szf+ 3d or .c/U &d The above operation is commutative and associative: Sd = + л/ (j/+ Sd} + if = s>/+ (^+ if) We note that, if с за/, then <o/'+ Sd = srf. From this it follows that <p/+ srf= stf &/+ (0) = sV s>/= .S' The product or intersection of two sets &/ and Gd is a set consisting of all elements that are common to the set sxf and dd (Fig. 2-3). This set is written in the form ssfdd or srf Г\ Sd The above operation is commutative, associative, and distributive (Fig. 2-4): tf@=Sdstf (s/^)^= + <?) = srf& + +ltT is an abbreviation for if and only if.
18 ТИН AXIOMS OF PROBABILITY FIGURE 2-5 FIGURE 2-6 We note that if then = &/. Hence ,;/.?/= .й/ (0}.g/= {0} ,o/./= Note If two sets .с/ and Z are described by the properties of their elements as in (2-2), then their intersection will be specified by including these properties in braces. For example, if {1,2,3,4,5,6} ,c/ = {even} Z = {less than 5} thenf = {even, less than 5} = {2,4} (2-3) Mutually exclusive sets Two sets and .Z are called mutually exclusive or disjoint if they have no common elements, that is, if = {0} Several sets ,0/,, • • • are called mutually exclusive if = {0} f°r every i and j * i Partitions A partition 31 of a set .Z is a collection of mutually exclusive subsets of У whose union equals ,Z (Fig. 2-5). + ••• 4-^ = .Z = {0} i +j (2-4) All partitions will be denoted by boldface German script (Fraktur) letters. Thus tWe should stress the difference in the meaning of commas in (2-1) and (2-3). In (2-1) the braces include all elements and «I.....U - «.}u ••• is the union of the sets {£))• In (2-3) the braces include the properties of the sets {even) and {less than 5), and {even, less than 5} — {even} n {less than 5) Ц-tho lnteraeuliOn of the sets {even} and {less than 5}.
2-1 si.i ни ohy 19 Complements The complement .о/ of a set is the set consisting of all elements of that are not in .о/ (Fig. 2-6). From the definition it follows that .с/.У= {0} .:7= .У .7 = {0} {0}=./' If 32 c .0/ then 3 d .7; if .УУ then 5/ = .7. De Morgan's law Clearly (sec Fig. 2-7) &/ + Л ^/.7 = .7+ & (2-5) Repeated application of (2-5) leads to the following: If in a set identity wc replace all sets by their complements, all unions by intersections, and all intersections by unions, the identity is preserved. We shall demonstrate the above using as example the identity .c/(.7 + tf) = + .7<f (2-6) From (2-5) it follows that Л/(й$ + if) = .£/+ Й? + tf = Л/+ Similarly, + .^tf = (7<7)(77f) = (.7 + .7) (.7 + 7) and since the two sides of (2-6) are equal, their complements are also equal. Hence .7+ .^7= (.7+ <%)(&+ ?) (2-7) Duality principle As we know, {0} and {0} — Furthermore, if in an identity like (2-7) all overbars are removed, the identity is preserved. This leads to the following version of De Morgan’s law: If in a set identity we replace all unions by intersections, all intersections by unions, and the sets .7 and {0} by the sets {0} and .7, the identity is preserved. Applying the above to the identities + if) = + ja^f ./+ / we obtain the identities 0if = (.<✓ + ^)(.У+ Tf) {0}.c/= {0} 1
20 THL AXIOMS OF I’KOliABII I I Y 2-2 PROBABILITY SPACE In probability theory, the following set terminology is used: The space ./ is called the certain erent, its elements experimental outcomes, and its subsets erents. The empty set {0} is the impossible erent, and the event {£,} consisting of a single element is an elementary erent. All events will be identified by script letters. In the applications of probability theory to physical problems, the identi- fication of experimental outcomes is not always unique. We shall illustrate this ambiguity with the die experiment as might be interpreted by players X. Y. and Z. X says that the outcomes of this experiment are the six faces of the die forming the space ./'= {/j,...,/6}. This space has 26 = 64 subsets and the event {even} consists of the three outcomes fA, and fh. Y wants to bet on even or odd only. He argues, therefore that the experiment has only the two outcomes even and odd forming the space {even, odd}. This space has only 21 = 4 subsets and the event {even} consists of a single outcome. Z bets that one will show and the die will rest on the left side of the table. He maintains, therefore, that the experiment has infinitely many outcomes specified by the coordinates of its center and by the six faces. The event {even} consists not of one or of three outcomes but of infinitely many. In the following, when we talk about an experiment, we shall assume that its outcomes are clearly identified. In the die experiment, for example. will be the set consisting of the six faces fv ..., /fi. In the relative frequency interpretation of various results, we shall use the following terminology. Trial A single performance of an experiment will be called a trial. At each trial we observe a single outcome We say that an event ja/ occurs during this trial if it contains the element The certain event occurs at every trial and the impossible event never occurs. The event .?/+ occurs when я/ or 3d or both occur. The event occurs when both events .с/ and 3$ occur. If the events and 3d are mutually exclusive and .2/ occurs, then 3d does not_ occur. If jaZ c 3d and <fiZ occurs, then 3d occurs. At each trial, cither .?/ or .с/ occurs. If, for example, in the die experiment we observe the outcome /s, then the event {/5}, the event {odd}, and 30 other events occur. The Axioms We assign to each event jbZ a number P(^Z) which we call the probability of the event sxY. This number is so chosen as to satisfy the following three conditions: I (2-8) П = 1 (2-9) Ш if ^={0} then P(jaZ+^) =P(^Z)+P(^) (2-10)
2-2 i-RoKAUii.in м-м.1 21 These conditions are the axioms of the theory of probability. In the development of die theory, all conclusions are based directly or indirectly on the axioms and only on the axioms. The following are simple consequences. Properties. The probability of the impossible event is 0: /’{0} = 0 (2-11) Indeed, .c/{0] = {0} and .cZ-l- {0} = .2/; therefore [see (2-10)] P(.o/) = /J(.c/ + 0) = P(.V) + P{$] For any .?/. P(.V) = I -P(V) < I (2-12) because &/ + &/ = and .2/.;/ = {0}: hence i =P(.Z) =/’(•/+..V) =p(V) +P(.o/) For any .2/ and -Z, /»(.;/+ .Z) = /’(."/) + P(.#) - < P(.r/) + P(.Z) (2-13) To prove the above, we write the events .;/+ .Z and Л as unions of two mutually exclusive events: .c/+ mS = .2/4- .c/.'Z Л = р/Зв + jZZ Therefore [see (2-10)] Z) = P(.s<) + P(.cZ^) P(.^) = P(.oZ^) + /’(.7.Z) Eliminating /’(jZZ), wc obtain (2-13). Finally, if .Z c .2/, then P(jZ) = P(0) + P(jy^) > P(^) (2-14) because л/й? 4- and Z4.2/.Z) = (0). Frequency interpretation The axioms of probability are so chosen that the resulting theory gives a satisfactory representation of the physical world. Probabilities as used in real problems must, therefore, be compatible with the axioms. Using the frequency interpretation P(sS) = — л of probability, we shall show that they do. I. Clearly, P(.c/) £ 0 because 0 and n > 0. II. « 1 because ./’ occurs at every trial; hence n./ = n. III. If then because if .n/+ & occurs then .V or M occurs but not both. Hence 0) = = — + — = P(.₽/) + P(^) n n n
22 IKE AXIOMS OF PROBABILITY • Y.rf+.-Y.rf FIGURE 2-8 Equality of events. Two events .я/ and & are called equal if they consist of the same elements. They are called equal with probability 1 if the set (,c/ + .Z)( лз/Z) = consisting of all outcomes that are in лэ/ or in & but not in .Z.Z (shaded area in Fig. 2-8) has zero probability. From the definition it follows that (see Prob. 2-4) the events лэ/ and $ are equal with probability 1 iff Р(л/) = P(Z) = P(.a/^) (2-15) If Р(лэ/) = P(Z) then we say that лэ/ and дё are equal in probability. In this case, no conclusion can be drawn about the probability of If fact, the events л/ and & might be mutually exclusive. From (2-15) it follows that, if an event .//z equals the impossible event with probability 1 then P(./Jz) = 0. This does not, of course, mean that . i'= {0}. The Class $ of Events Events are subsets of -Z to which we have assigned probabilities. As we shall presently explain, we shall not consider as events all subsets of uZ but only a class of subsets. One reason for this might be the nature of the application. In the die experiment, for example, we might want to bet only on even or odd. In this case, it suffices to consider as events only the four sets {0}, {even}, {odd}, and .Z. The main reason, however, for not including all subsets of in the class § of events is of a mathematical nature: In certain cases involving sets with infinitely many outcomes, it is impossible to assign probabilities to all subsets satisfying all the axioms including the generalized form (2-21) of axiom III. The class $ of events will not be an arbitrary collection of subsets of Z. We shall assume that, if лэ/ and 0% are events, then лэ/+ & and лУ^ are also events. We do so because we will want to know not only the probabilities of various events, but also the probabilities of their unions and intersections. This leads to the concept of a field. FIELDS. A field is a nonempty class of sets such that: If лз/е g then лУе § (2-16) If лз/е ft and g then лз/+e (2-17)
2-2 рконлии । i v si*a< i 23 These two properties give a minimum set of conditions for ft to be a field. All other properties follow: If .?/Gft and & g § then 'УЛ g § (2-18) Indeed, from (2-16) it fofiows that .c/g ft and Я g ft. Applying (2-17) and (2-16) to the sets •£/ and we conclude that .У + g ft .У+ Л = e ft A field contains the certain event and the impossible event: ./Gft {0} g ft (2-19) Indeed, since ft is not empty, it contains at least one clement therefore [see (2-16)] it also contains ,*>/. Hence .с/ + <o/= .y'G ft {0} e ft From the above it follows that all sets that can be written as unions or intersections of finitely many sets in ft are also in ft. This is not, however, necessarily the case for infinitely many sets. Borel fields. Suppose that ..., .Уп.... is an infinite sequence of sets in ft. If the union and intersection of these sets also belongs to ft, then ft is called a Borel field. The class of all subsets of a set .У7 is a Borel field. Suppose that ® is a class of subsets of that is not a field. Attaching to it other subsets of .Z’, all subsets if necessary, we can form a field with S as its subset. It can be shown that there exists a smallest Borel field containing all the elements of Example 2-4. Suppose that У consists of the four elements a, b, c, d and S consists of the sets (a) and {ft}. Attaching to 6 the complements of {a) and {ft) and their unions and intersections, we conclude that the smallest field containing {a} and {ft) consists of the sets {0} {a} {ft} {a, ft} {c,<Z) {b,c,d} (a,c,d} Events. In probability theory, events are certain subsets of forming a Borel field. This permits us to assign probabilities not only to finite unions and intersections of events, but also to their limits. For the determination of probabilities of sets that can be expressed as limits, the following extension-of axiom III is necessary. Repeated application of (2-10) leads to the conclusion that, if the events , sgfn are mutually exclusive, then + X) =P(^I.) + ••• +P(4) (2-20)
24 THU AXIOMS OF PROBABILITY The extension of the above to infinitely many sets does not follow from (2-10). h is an additional condition known as the axiom of infinite additivity : Ilk. If the events J>/2,... are mutually exclusive, then P(^i + + • • • ) = P(&\) + Р(.а^) + • • ' (2-21) We shall assume that all probabilities satisfy axioms 1, II. Ill, and Ila. Axiomatic Definition of an Experiment In the theory of probability, an experiment is specified in terms of the following concepts: 1. The set of all experimental outcomes. 2. The Borel field of all events of 3. The probabilities of these events. The letter ./ will be used to identify not only the certain event, but also the entire experiment. We discuss next the determination of probabilities in experiments with finitely many and infinitely many elements. Countable spaces. If the space cf consists of N outcomes and N is a finite number, then the probabilities of all events can be expressed in terms of the probabilities P(f,} =A of the elementary events {£,}. From the axioms it follows, of course, that the numbers must be nonnegative and their sum must equal 1: Pt>0 Pi + •••+/?„= 1 (2-22) Suppose that л/ is an event consisting of the r elements , In this case, can be written as the union of the elementary events {£* }. Hence [see (2-20)] P(j^) = Pf&J + • • • = pkl + •1 • +pkr (2-23) The above is true even if consists of an infinite but countable number of elements £2,... [see (2-21)]. Classical definition If consists of N outcomes and the probabilities pt of the elementary events are all. equal, then A - (2'24)
2-2 PROB.Mill I I Y SI4<~I. 25 In this case, the probability of an event .0/ consisting of r elements equals r/N. P№ = (2-25) This very special but important case is equivalent to the classical definition (1-7), with one important difference, however: In the classical definition, (2-25) is deduced as a logical necessity; in the axiomatic development of probability. (2-24), on which (2-25) is based, is a mere assumption. Example 2-5. (a) In the coin experiment, the space consists of the outcomes h and t: /= {h,t} and its events arc the four sets {0),{/}, {/«}.•>'. If P{h} = p and P{t} = q, then P + Q = 1. (h) We consider now the experiment of the toss of a coin three times. The possible outcomes of this experiment arc: hhh, hht, hth, hti, thh, tht, tth, Hi We shall assume that all elementary events have the same probability as in (2-24) (fair coin). In this case, the probability of each elementary event equals 1/8. Thus the probability P{hhh} that wc get three heads equals 1/8. The event {heads at the first two tosses) = {hhh, hht} consists of the two outcomes hhh and hhr. hence its probability equals 2/8. The real line. If У7 consists of a noncountable infinity of elements, then its probabilities cannot be determined in terms of the probabilities of the elemen- tary events. This is the case if is the set of points in an л-dimensional space. In fact, most applications can be presented in terms of events in such a space. We shall discuss the determination of probabilities using as illustration the real line. Suppose that is the set of all real numbers. Its subsets can be considered as sets of points on the real line. It can be shown that it is impossible to define probabilities to all subsets of so as to satisfy the axioms. To construct a probability space on the real line, we shall consider as events all intervals xx<,x <хг and their countable unions and intersections. These eVents form a field that can be specified as follows: It is the smallest Borel field that includes all half-lines x < x, where xt is any number. This field contains all open and closed intervals, all points, and, in fact, every set of points on the real line that is of interest in the applications. One might wonder whether % does not include all subsets of ./. Actually, it is possible to show that there exist sets of points on the real line that arc not countable unions and intersections of intervals. Such sets, however, are of no interest in most applications. To complete the specification of it suffices to
26 ТНГ AXIOMS OF PROBABILITY assign probabilities to the events {x < хД. All other probabilities can then be determined from the axioms. Suppose that a(x) is a function such that (Fig. 2-9a) a(x) dx = 1 a(x) > 0 (2-26) We define the probability of the event {x < x,} by the integral P{x < x,} = f ‘ a(x) dx — 00 (2-27) This specifies the probabilities of all events of ..Z. We maintain for example, that the probability of the event {xj < x < x2) consisting of all points in the interval (x,, x2) is given by Zx2 a(x) dx (2-28) Indeed, the events {x < x,} and {x, < x x2} are mutually exclusive and their union equals {x < x2). Hence [see (2-10)] P{X Xj + P{Xj < x < x2) = P{x < x2} and (2-28) follows from (2-27). We note that, if the function a(x) is bounded, then the integral in (2-28) tends to 0 as x । -»x2. This leads to the conclusion that the probability of the event {x2} consisting of the single outcome x2 is 0 for every x2. In this case, the probability of all elementary events of ^Z equals 0, although the probability of their unions equals 1. This is not in conflict with (2-21) because the total number of elements of «Z is not countable. Example 2-6. A radioactive substance is selected at t = 0 and the time i of emission of a particle is observed. This process defines an experiment whose outcomes are all points on the positive t axis. This experiment can be considered as a spccial case of the real line experiment if we assume that <Z is the entire i axis and all events on the negative axis have zero probability.
2-3 CONDITIONAL PHOIIAHIf l IV 27 Suppose then that the function a(z) in (2-26) is given by (Fig. 2-96) a(r)-ce-'U(r) (/(/) = {’ J*}] Inserting into (2-28), we conclude that the probability that a particle will be emitted in the time interval (0, r0) equals f'Je cl dt = I - e * 'u 'o Example 2-7. A telephone call occurs at random in the interval (0. T). This means that the probability that it will occur in the interval 0 < t < tn equals ta/T. Thus the outcomes of this experiment are all points in the interval (0. T) and the probability of the event (the call will occur in the interval (rlf r2)) equals P('i <; t z r2) = This is again a special case of (2-28) with a(r) = 1/T for 0 < t <, T and 0 otherwise (Fig. 2-9c). Probability masses. The probability Р(^У) of an event л/ can be interpreted as the mass of the corresponding figure in its Venn diagram representation. Various identities have similar interpretations. Consider, for example, the identity P(^/+ Й?) = + P(^) — Р(.й/й?). The left side equals the mass of the event &/+ In the sum + P(^), the mass of is counted twice (Fig. 2-3). To equate this sum with Р(л/+ &?), we must, therefore, subtract Р(л/^). 2-3 CONDITIONAL PROBABILITY The conditional probability of an event л/ assuming denoted by Р(лз/|^), is by definition the ratio Р(.йфП « (2-29) where we assume that P(^) is not 0. The following properties follow readily from the definition: If srf then Р(.й/|.^) = 1 (2-30) because феп Similarly, P(^) if then Р(лЛ.^) =s Р(лГ) (2.31)
28 THE AXIOMS Ob PROBABILITY Frequency interpretation Denoting by n^, n ,, and n the number of occurrences of the events л/. and respectively, wc conclude from (I-I) that n v n j- л Pf.ft/) = — P(.^) = — P(V.^) = —- n n n Hence /W) = P(.cZ^) /’(•'O /z> (2-32) 'Ibis result can be phrased as follows: If wc discard all trials in which the event did not occur and we retain only the subsequence of trials in which // occurred, then equals the relative frequency of occurrence n.j#/n # of the event :/ in that subsequence. Fundamental remark. We shall show that, for a specific .z^. the conditional probabilities are indeed probabilities; that is, they .satisfy the axioms. The first axiom is obviously satisfied because Р(.г/.//) > (J and P{.//) > 0: P(M/)>0 (2-33) The second follows from (2-30) because .// c Р(./И) = 1 (2-34) To prove the third, we observe that if the events .о/ and & arc mutually exclusive, then (Fig. 2-10) the events srfJ? and are also mutually exclusive. Hence Р[(л/+ .^).zH + P(.^.//) P(.s/+ .&.//) = ——-—--—- = —------------------------- This yields the third axiom: P(&+ = P(.o/|.^) + Р(Я\Л) (2-35) From the above it follows that ail results involving probabilities holds also for conditional probabilities. The significance of this conclusion will be appreci- ated, later. .</.**= {0} (.</. ад.//) = {0} FIGURE?'10 FIGURE 2-11
2-3 («imhiioxi i>i-Hoii.Miii m 29 Example 2-8. In the fair-die experiment, we shall deletininc the conditional probability of the event {f2} assuming that the event eieu occurred. With V= {/,} .//= {even} = {Л-Л.Д} wc have P(ftO = 1/6 and P(.^) = 3/6. And since ..</.// = (2-29) cields p/ri i 1 ' P{evcn} 3 This equals the relative frequency of the occurrence of the event (two) in the subsequence whose outcomes are even numbers. Example 2-9. Wc denote by r the age of a person when he dies. The probability that i s t„ is given by /’{( < /„} = dt 'll where a(t) is a function determined from mortality records. We shall assume that «(/) = 3 X I0~9r(l00 - t)2 0 < l < lOOyears and 0 otherwise (Fig. 2-11). From (2-28) it follows that the probability that a person will die between the ages of 60 and 70 equals P{60 < t < 70} = P’a(f) (it = 0.154 This equals the number of people who die between the ages of 60 and 70 divided by the total population. With .?/= {60 s t < 70} -^= {r ;> 60} .г/ it follows from (2-29) that the probability that a person will die between the ages of 60 and 70 assuming that he was alive at 60 equals J a(t)dl P{60 < t < 70| г 2 60} = ~----------- = 0.486 ( afj) dt -/6o This equals the number of people who die between the ages 60 and 70 divided by the number of people that arc alive at age 60. Example 2-10. A box contains three white balls w,» iv2, и>л and two red balls r,. r2. Wc remove at random two balls in succession. What is the probability that the first removed ball is white and the second is red? Wc shall give two solutions to this problem. In the first, wc apply (2-25); in the second, we use conditional probabilities.
30 THl. AXIOMS OI PKOHAHII-I n First soltition. The space of our experiment consists of all ordered pairs that we can form with the five balls: H'l,v2 ,Vlrl lvlr2 ••• r2,vl r2,v2 r2wl r2ri The number of such pairs equals 5 X 4 = 20. The event {white first, red second) consists of the six outcomes 1Р,Г, И’,Г, H‘2r, иу, ii ,r. Hence [see (2-25)] its probability equals 6/20. Second solution. Since the box contains three while and two red halls, the probability of the event = {while lirst) equals 3/5. If a while hall is removed, there remain two white and two red balls; hence the conditional probability 7/x) of the event .'#2 — {red second) assuming {white first) equals 2/4. From this and (2-29) it follows that 2 3 6 P( Г,.^2) = P(.^2| 7/',)P( //,) = - X - = - where is the event {white first, red second). Total Probability and Bayes’ Theorem If SI = ..., &/„] is a partition of and & is an arbitrary event (Fig. 2-5), then P(^) =Р(,^|х/,)/’(.й/1) + (2-36) Proof. Clearly, c^( s>/x + • • ♦ + X,) = 4- • • • + .^.0/, But the events and are mutually exclusive because the events .й/ and .й/- are mutually exclusive [see (2-4)]. Hence P(.^) =/,(^.-<) + ••• +Р(.Ш<) and (2-36) follows because [see (2-29)] ) = Р(&\^ ) (2-37) This result is known as the total probability theorem. Since Р(^<й^-) == Р(.й^|^)Р(^) we conclude with (2-37) that ) P(XI«) - )-sfst <2’38) Inserting (2r36) into (2-38), we obtain Bayes' theorem?: ... P(^|.oZ)P(^) ( ' >" p{^wx)p^x} + ••• +p(^i^)p(x,) (2’39 tThe main idea Of this theorem is due to Thomas Bayes (1763). However, its final form (2-39) was given byLaplace several years'later.
2-3 CONDI IIOSAI. РКОНАШ1 ПУ 31 Note The terms a priori and a posteriori arc often used for the probabilities pt c/) and Example 2-11. Wc have four boxes. Box 1 contains 2000 components of which 5 percent are defective. Box 2 contains 500 components of which 40 percent are defective. Boxes 3 and 4 contain 1000 each with 10 percent defective. We select at random one of the boxes and wc remove at random a single component. (a) What is the probability that the selected component is defective? The space of this experiment consists of 4000 good (g) components and 500 defective (d) components arranged as follows: Box I: 1900g,lOOd Box 2: 300g,200d Box 3: 900g, KJOd Box 4: 900g, lOOd Wc denote by the event consisting of all components in the ith box and by £P the event consisting of all defective components. Clearly, P(.^J =P(.>?,) =P(.^?) = P(.^) = i (2-40) because the boxes are selected at random. The probability that a component taken from a specific box is defective equals the ratio of the defective to the total number of components in that box. This means 100 100 iooo=0J that (2-41) 200 ''<^> = да=ол 100 And since the events form a partition of wc conclude from (2-36) that P(</) = 0.05 X | + 0.4 X | + 0.1 x 4 + 0.1 x j = 0.1625 This is the probability that the selected component is defective. (b) We examine the selected component and wc find it defective. On the basis of this evidence, wc want to determine the probability that it came from box 2. We now want the conditional probability P(6d2|5'). Since P(2?) = 0.1625 Р(Я#г) = 0.4 P(&2) = 0.25 (2-38) yields 0.25 P(#2|<?) =0.4 x -^ = 0.615 Thus the a priori probability of selecting box 2 equals 0.25 and the a posteriori probability assuming that the selected component is defective equals 0.615. These probabilities have the following frequency interpretation: If the experiment is performed n times, then box 2 is selected 0.25л times. If wc consider only the n$ experiments in which the removed part is defective, then the number of times the part is taken from box 2 equals 0.615л%. We conclude with a comment on the distinction between assumptions and deductions: Equations (2-40) and (2-41) arc not derived; they are merely reasonable assumptions. Based on these assumptions and on the axioms, wc deduce that Pl£Z) - 0.1625 and P(^2I^) - 0.615.
32 THE AXIOMS OF PROBABILITY Independence Two events .й/ and .<3 are called independent if (2-42) The concept of independence is fundamental. In fact, it is this concept that justifies the mathematical development of probability, not merely as a topic in measure theory, but as a separate discipline. The significance of indepen- dence will be appreciated later in the context of repeated trials. We discuss here only various simple properties. Frequency interpretation Denoting by n -j, n#. and n the number of occurrences of the events and ..с/й? respectively, wc have n-s v n P(^) = — P(&) ~ = ----- n n n If the events and Si arc independent, then _ P( = £1^1 = n-^/n = n " } P(&) n./n Thus, if .a/ and 3} arc independent, then the relative frequency n^/n of the occurrence of sif in the original sequence of n trials equals the relative frequency of the occurrence of in the subsequence in which 3 occurs. We_show next that if the events 32/_and @ are independent, then the events ?/ and & and the events <£/ and35 are also independent. As we know, the events and &/3& are mutually exclusive and P(j7) = 1 - P(&/} From this and (2-42) it follows that P(^) =P(^) -P(^) = [1 -P(j/)]P(.0) =P(V)P(^) This establishes_the independence of and 35. Repeating the argument, we conclude that «о/ and 35 are also independent. In the next two examples, we illustrate the concept of independence. In Example 2-12a, we start with a known experiment and we show that two of its events are independent. In Examples 2-12b and 2-13 we use the concept of independence to complete the specification of each experiment. This idea is developed further in the next chapter. Example 2-12. If we toss a coin twice, we generate the four outcomes ЛЛ, Лг, th, and tt. (a) To construct an experiment with these outcomes, it suffices to assign probabilities to its elementary events. With a and b two positive numbers such that a 4- b « 1, wc assume that P{hh} - a2 P{ht) = P{th} - ab P{a) - b2
2-3 CONIMTION'Al FKOHAHII IIY 33 These probabilities are consistent with the axioms because a2 + ab + ab + b2 — (a + b)2 — 1 In the experiment so constructed, the events = (heads at first toss) = {hh.ht} ~ {heads at second toss) = {hh. th} consist of two elements each, and their probabilities arc [sec (2-23)] P(^t) = P{hh} + P{hi} = a- + ab = a P(J?Z) = P{hh} + P{ih} = a2 + ab = a The intersection of these two events consists of the single outcome {hh}. Hence ^(^1^2) = P{hh} = a2 = P(<*\)P{Jr2) This shows that the events Л*, and are independent. (&) The above experiment can be specified in terms of the probabilities P(a^) = = a of the events and «Л*,, and the information that these events are independent. ___ Indeed, as we have shown, the events and and the events and are also independent. Furthermore, ^|e^2 = {hh} JP\jP2 = {hi} = W ^t'^2 = {"} and PM*) = 1 - P(^\) = 1 - a, P(^2) - 1 - P(^2) = 1 - a. Hence P{hh} = a2 P{hl} - a(l - a) P{ih} = (I - a)a P{tl} = (1 - a)2 Example 2-13. Trains X and Y arrive at a station at random between 8 a.m. and 8.20 a.m. Train X stops for four minutes and train Y stops for five minutes. Assuming that the trains arrive independently of each other, we shall determine various probabilities related to the times x and у of their respective arrivals. To do so, we must first specify the underlying experiment. The outcomes of this experiment are all points (x, y) in the square of Fig. 2-12. The event .й/= {X arrives in the interval (/,,/2)) = {/, <,x <> r2} is a vertical strip as in Fig. 2-12e and its probability equals (r, - 11)/20. This is FIGURE2-L2
34 THE AXIOMS OF PROBABILITY our interpretation of the information that the train arrives at random. Similarly, the event & ** [У arrives in the interval (t3, (4)j = {t3 <, у < /4) is a horizontal strip and its probability equals (r4 - r3)/20. Proceeding similarly, we can determine the probabilities of any horizontal or vertical sets of points. To complete the specification of the experiment, we must determine also the probabilities of their intersections. Interpreting the independence of the arrival times as independence of the events and we obtain (r, - G)(G - P(^) =P(^)P(^) = — £\J Л &X) The event is the rectangle shown in the figure. Since the coordinates of this rectangle are arbitrary, we conclude that the probability of any rectangle equals its area divided by 400. In the plane, all events are unions and intersections of rectangles forming a Borel field. This shows that the probability that the point (x, y) will be in an arbitrary region R of the plane equals the areas of R divided by 400. This completes the specification of the experiment. (a) We shall determine the probability that train X arrives before train У. This is the probability of the event = (x < y) shown in Fig. 2-12b. This event is a triangle with area 200. Hence 200 PW-400 (6) Wc shall determine the probability that the trains meet at the station. For the trains to meet, x must be less than у + 5 and у must be less than x + 4. This is the event & = {-4 <. x - у £ 5} of Fig. 2-12c. As we see from the figure, the region consists of two trapezoids with common based, and its area equals 159.5. Hence 159.5 - «O' (c) Assuming that the trains met, we shall determine the probability that train X arrived before train Y. We wish to find the conditional probability /’(xfl-S’). The event is a trapezoid as shown and its area equals 72. Hence . . , P( ^) 72 “ 159.5 INDEPENDENCE OF THREE EVENTS. The events are called (mut- ually) independent if they are independent in pairs: ) = P(s^ ) i t2-43) P^s^J - (2-44)
2-3 CONDITIONAI I’KOIIAIUI 11V 35 ss.tMTf We should emphasize that three events might be independent in pair but not independent. The next example is an illustration. Example 2-14. Suppose that the events M and it' of Fig. 2-13 have the same probability P(,^) = P(^) = P(^) = 5 and the intersections s-/€. Mt', and also have the same probability P = P(.cZ^) = = P(M6’) = P(.?/.#6’) (a) If p = 1/25, then these events are independent in pairs but they arc not independent because Ф P(.:/)P(^)P(//) (&) If p = 1/25, then = Р(.й/}P{^)P{6’) but the events arc not independent because P(.pZ0) *P(.c/)P(^) From the independence of the events <o/, and if it follows that: 1. Any one of them is independent of the intersection of the other two. Indeed, from (2-43) and (2-44) it follows that P(rf^2tf3) = Р(^)Р(.й/2)Р(.й/3) = P(.^)P(^2 ,c/3) (2-45) Hence the events «й/, and are independent. 2. If we replace one or more of these events with their complements, the resulting events are also independent. Indeed, since 4- лз/(лз/2^3 P(^3) = 1 - P(.ft/r3) we conclude with (2-45) that Р(^г&) = P(^.g/2) “ P(J^X2)P(4/3) =P(.r/,)P(.ft/2)P(^) Hence the events <o/i, &f2, and л/3 are independent because they satisfy (2-44) and, as we have shown earlier in the section, they are also independent in pairs. 3. Any one of them is independent of the union of the other two.
36 ГН г: AXIOMS or PROBABILITY То show that the events .cZ( and .£/, + >Ci6 are independent, it suffices to show that the events .cZ( and .cZ2 + "Z? = ,й/2.с/ч arc independent. 1'his follows from 1 and 2. Generalization. The independence of n events can be defined inductively: Suppose that we have defined independence of к events for every к < n. We then say that the events .cZt,..., are independent if any к < n of them arc independent and P(V, •• •<) = ?(//,) •• /’(X,) (2-46) This completes the definition for any n because we have defined independence for n = 2. PROBLEMS 2-1. Show that (u).cZ + .9? + .:Z + &= cZ; (/>) (xZ + .^X.cZ.^) = .cZ^ + .Z.'Z. 2-2. If .rZ= (2 <. x < 5} and = {3 < x < 6}. find .:•/&}, and (.У + 2-3. Show that if xZ^? = {0}, then P(.oZ) < /4.Z). 2-4. Show that (a) if P(.Z) = P(.#) = P(.cZ^). then P(M + .^.cZ) = 0; (/>) if P(.cZ) = /»(.#) = 1, then P(M) = 1. 2-5. Prove and generalize the following identity ?(.:/+ & + 6’) = P(.:S) + P(&) + P(<?) - - P(.^) - P(.i№) 4- P(.cZ.^Z) 2-6. Show that if ./ consists of a countable number of elements and each subset {/,} is an event, then all subsets of arc events. 2-7. If {1,2,3,4}, find the smallest field that contains the sets {I) and {2.3}. 2-8. If Vc P(jZ) = 1/4, and P(^) = 1/3, find P(^\.^f) and /W). 2-9. Show that P(.3<0| if) = P(.cZ|.#if)P(^lZ) and P(xZ.^rJ’) = P(.Z|.^<f) 2-10. (Chain rule) Show that P(„4 .G/j) «р(Ч1Ч-1.............-z,) ••• P(.cZ2|.tz,).z’(.zI) 2-11. We select at random m objects from a set .У of n objects and we denote by X„ the set of the selected objects. Show that the probability p that a particular clement £0 of ,Z is in equals m/n. Hint:p equals the probability that a randomly selected element of is in X„. 2-12. A call occurs at time t where t is a random point in the interval (0.10). (a) Find P{6 i <. 8). (/>) Find P{6 <; t < 8|f > 5). 2-13; The space •/ is the set of all positive numbers t. Show that if P{t(1 < t < iu + 111 r & t0) - P{t S r,} for every tQ and then P{t £/!}=!- e~ch where c is a constant. 2-14, The events Л/and are mutually exclusive. Can they be independent? >KS. Show that if the events jZh......c/n arc independent and equals .3/ or or Z*, then the events .......are also independent.
i*i*< ни i мч 37 2-16. Show that 2" - (м + I) equations are needed to establish the independence of и events. 2-17. Box 1 contains I white and 999 red balls. Box 2 contains I red and 9‘»9 white balls. Л ball is picked from a randomly selected box. If the ball is red what is the probability that it came from box 1? 2-18. Box 1 contains 1000 bulbs of which 10 percent are defective. Box 2 contains 2000 bulbs of which 5 percent arc defective. Two bulbs are picked from a randomly selected box. (a) Find the probability that both bulbs are defective. (/>) Assuming that hoth are defective, find the probability that they came from box I. 2-19. A train and a bus arrive at the station at random between 9 л.м. and 10 мм. The train stops for 10 minutes and the bus for л minutes. Find t so that the probability that the bus and the train will meet equals 0.5. 2-20. Show that a set ./ with n elements has n(n - I) •••(л - к + 1) л! I • 2 • • • к ~ кЦп - А)! /с-clement subsets. 2-21. We have two coins; the first is fair and the second two-headed. Wc pick one of the coins at random, wc toss it twice and heads shows both times. Find the probability that the coin picked is fair.
CHAPTER 3 REPEATED TRIALS 3-1 COMBINED EXPERIMENTS We are given two experiments: The first experiment is the rolling of a fair die A) WJ 4 The second experiment is the tossing of a fair coin A={hj} P2{M = = 1 Wc perform both experiments and we want to find the probability that we get “two” on the die and “heads” on the coin. If we make the reasonable assumption that the outcomes of the first experiment are independent of the outcomes of the second, we conclude that the unknown probability equals 1/6 X 1/2. The above conclusion is reasonable; however, the notion of independence used in its derivation does not agree with the definition given in (2-42). In that definition, the events л/ and & were subsets of the same space. In order to fit the aibove conclusion into our theory, we must, therefore, construct a space having as subsets the events “two” and “heads.” This is done as follows: The two experiments are viewed as a single experiment whose outcomes are pairs where is one of the six faces of the die and ,s heads or 38
3-1 COMHINLU EXt'LlUMl.NIS 39 tails.f The resulting space consists of the 12 elements fxh.................................. In this space, (two) is not an elementary event but a subset consisting of two elements {two) = {/2/»./2/} Similarly, {heads} is an event with six elements {heads} = {/(Л........ДЛ} To complete the experiment, we must assign probabilities to all subsets of Clearly, the event {two} occurs if the die shows “two" no matter what shows on the coin. Hence /’{two} = P,{/2} = + Similarly, P{ heads) = /<{/;} = { The intersection of the events {two} and {heads) is the elementary event {/2/i}. Assuming that the events {two} and {heads} are independent in the sense of (2-42), we conclude that P{f2h} = 1/6 x 1/2 in agreement with our earlier conclusion. CARTESIAN PRODUCTS. Given two sets and ./2 with elements and £2 respectively, we form all ordered pairs ^^2 where is any element of and g2 is any element of The cartesian product of the sets and .Z2 is a set c/* whose elements are all such pairs. This set is written in the form -У) X Example 3-1. The cartesian product of the seis .У) = (car, apple, bird} .Z2 = {h,t) has six elements v/*! X ./2 = {car-/:, car-z, apple-/:,apple-/.bird-/»,bird-/} Example 3-2. If .Z, = (Л, /}, .Z2 = {Л, /}. Then .У*, X /, = {АЛ, Л/, th, //} In this example, the sets and -X are identical. We note also that the element hi is different from the clement th. tin theearlier discussion, the symbol £, represented a single element of a set In the following, willqlso represent an arbitrary element of a set It will be understood from the context whether f, is one particular element or any element of У-
40 REPEATED TRIALS FIGURE 3-1 If <0/ is a subset of .Z( and is a subset of -У?, then the set -6= a/X consisting of all pairs £x£2 where e «с/ and g2 e is a subset of . Forming similarly the sets Ss/ X .Z2 and .Sx X wc conclude that their intersection is the set jZx .Ух & = (.o/x ~Z2) П (./] x (3-1) Note Suppose that uZ( is the x axis, is the у axis, and .о/ and В arc two intervals: {a j < x < x2} GS = {y, <. у < у,} In this case, &fx 2 is a rectangle, x/x .У^ is a vertical strip, and X $8 is a horizontal strip (Fig. 3-1). We can thus interpret the cartesian product зУх У? of two arbitrary sets as a generalized rectangle. Cartesian product of two experiments. The cartesian product of two experi- ments c/\ and .Z, is a new experiment .Z= .Z, X .Z2 whose events are all cartesian products of the form .Ух & (3-2) where £/ is an event of ^Z( and @ is an event of .Z^, and their unions and intersections. In this experiment, the probabilities of the events л/х .Z2 and X^ are such that Р(лГx Z^) = Px(^) P(^x x = P2(&) (3-3) where Р|(л^) is the probability of the event зз/ in the experiments «Z( and Л(^) is the probability of the event & in the experiments .Z2. The above is motivated by the interpretation of .Z as a combined experiment. Indeed, the event jaf x of the experiment Z” occurs if the event of the experiment <-Z| occurs no matter what the outcome of .Z2 *s- Similarly, the event Zt X of the experiment Z* occurs if the event of the experiment .Z occurs no matter what the outcome of Z^ is. This justifies the two equations in (3-3).
3- 1 COMIMNI О 1 \|>| KIMIS г 41 These equations determine only the probabilities of the events с/х and A\ X A The probabilities of events of the form .?/x ,/? and of (heir unions and intersections cannot in general be expressed in terms of P and /\. To determine them, we need additional information about the experiments and Independent experiments. In many applications, the events с/х and X of the combined experiment are independent for any .г/ and A Since the intersection of these events equals л/х (see (3-1)].. wc conclude from (2-42) and (3-3) that P(.c/x <#) = P(.:/x X .^) = (3-4) This completes the specification of the experiment .X' because all its events arc unions and intersections of events of the form j?/x We note in particular that the elementary event {£(£2} can be written as a cartesian product {^j x {£,} of the elementary events {f(} and {<2} of and .A. Hence - Х(Л>Л(Ы (3-5) Example 3-3. A box Bx contains 10 white and 5 red balls and a box B2 contains 20 white and 20 red balls. A ball is drawn from each box. What is the probability that the ball from Bx will be white and the ball from B2 red? The above operation can be considered as a combined experiment. Experiment A\ ,s *he drawing from Bv and experiment A is the drawing from B2. The space has 15 elements: 10 white and 5 red balls. The event = {all white balls in B,} has 10 elements and its probability equals 10/15. The space has 40 elements: 20 white and 20 red balls. The event .'A = {all red balls in B2) has 20 elements and its probability equals 20/40. The space x •/, has 40 X 15 elements: all possible pairs that can be drawn. Wc want the probability of the event x ,#2 = {white from B( and red from B2} Assuming independence of the two experiments, we conclude from (3-4) that 10 20 P( Г, X г₽2) = P,( = — X — Example 3-4. Consider the coin experiment where the probability of "heads” equals p and the probability of "tails” equals q = 1 — p. If wc toss the coin twice, we obtaiin the space A= A x A A = 'A « {h, /} Titus A consists of the four outcomes ЛЛ, Лг, гЛ, and «. Assuming that the
42 RLI’EAIL-D 1RIA1.S experiments and arc independent, wc obtain Р(ЛЛ) = Px{h}Pz{h} = p2 Similarly, P{/z/} = pq P{th} = qp P{tt} = q2 We shall use the above to find the probability of the event = {heads at the first loss} = {hh.ht} Since consists of the two outcomes hh and hl, (2-23) yields P(^i) = P{hh) + P{ht} = p2 + pq = p Ibis follows also from (3-4) because = {Л} x .У\. Generalization. Given n experiments .У],..., .y^, we define as their cartesian product .У\ X • X (3-6) the experiment whose elements are all ordered n tuplcts where £, is an element of the set .У*. Events in this space are all sets of the form .2/, X • • x &/n where л/} с. and their unions and intersections. If the experiments are independent and is the probability of the event in the experiment .>5, then P(X X О = P,(^,)C,(X.) (3-7) Example 3-5. If we toss the coin of Example 3-4 n times, we obtain the space .У"= .У\ X • • • x -,Уп consisting of the 2” elements • • • £„ where £ = h or i. Clearly, «-'’.«i) -- w,) f'Z? (M> If, in particular p = q = 1/2, then From (3-8) it follows that, if the elementary event {£( • • • („} consists of к heads and n — к tails (in a specific order), then p{<i (3-10) We note that the event = {heads at the first toss} consists of 2я 1 outcomes ft • • * where ft « h and ft = t or h for i > 1. The event t can be written as a cartesian product = {/i} xzx x Hence (see (3-7» P(^) - Wt(-4) • • • P„M) - P
3-2 HI IlNOUI l.l I HIAI.S 43 because /’(•<) = I. We can similarly show that if = {heads at the ith toss} .7, = {tails at the /th toss) then ?( *,)=P P( ) = 4 Dual meaning of repeated trials. In the theory of probability, the notion of repeated trials has two fundamentally different meanings. The lirst is the approximate relationship (l-l) between the probability P(.:/) of an event in an experiment У and the relative frequency of the occurrence of The second is the creation of the experiment ./' x • • • x . / . For example, the repeated tossings of a coin can be given the following two interpretations: First interpretation (physical) Our experiment is the single toss of a fair coin. Its space has two elements and the probability of each elementary event equals 1/2. A trial is the toss of the coin once. If we toss the coin n times and heads shows nh times, then almost certainly nh/n = 1/2 provided that n is sufficiently large. Thus the first interpretation of repeated trials is the above inprecisc statement relating probabilities with observed frequencies. Second interpretation (conceptual) Our experiment is now the toss of the coin n times where n is any number large or small. Its space has 2" elements and the probability of each elementary event equals 1/2". A trial is the toss of the coin n times. All statements concerning the number of heads arc precise and in the form of probabilities. We can. of course, give a relative frequency interpretation to these statements. However, to do so, we must repeat the n tosses of the coin a large number of times. 3-2 BERNOULLI TRIALS It is well known from combinatorial analysis that, if a set has n elements, then the total number of its subsets consisting of к elements each equals (n\ - I) • • (н - Л + I) _ fi'- 1-2 - k kl(n-k)! 1 ' For example, if n = 4 and к = 2, then Indeed, the two-elemerit subsets of the four-element set abed are ab ac ad be bd cd The above result will be used to find the probability that an event .occurs к times in n independent trials of an experiment This problem is essentially
44 RL1*I:ATI:(> TRIALS the same as the problem of obtaining к heads in n tossings of a coin We start therefore, with the coin experiment. Example 3-6. A coin with /•’(/») = p is tossed л times. We maintain that the probability p„(k) that heads shows к times is given by P,,(k) = k q = 1 - p (3-12) Proof. The experiment under consideration is the л-lossing of a coin. A single outcome is a particular sequence of heads and tails. The event {A- heads in any order) consists of all sequences containing к heads and л - к tails. The к heads of each such sequence form a А-element subset of a set of л heads. As we noted, there arc (^ j such subsets. Hence the event {A heads in any order) consists of (^ j elementary events containing к heads and n - к tails in a specific order. Since the probability of each of these elementary events equals pkqn~k, we conclude that P{k heads in any order) = ( д Special Case. If л = 3 and к = 2. then there are three ways of getting two heads, namely, hht, hth, and thh. Hence p3(2) = 3p2q in agreement with (3-12). Success or Failure of an Event ей/ in n Independent Trials We consider now our main problem. We are given an experiment .У and an event with P(x/) = p P(.of) = q p + q = 1 We repeat the experiment n times and the resulting product space we denote by .S". Thus _ ,/x ... x у We shall determine the probability pn(k) that the event .о/ occurs exactly к times. FUNDAMENTAL THEOREM p„(£) = occurs к times in any order) = (3-13) Proof. The event occurs к times in a specific order) is a cartesian product X X @n where к of the events equal зУ and the remaining n - к equal Л/. As we know from (3-7), the probability of this event equals P(^) ••• P(@n) ~pkqn~k
3-2 iii.iisotii 11 iKtAf s 45 FIGURE 3-2 because In other words, P{&/ occurs k times in a specific order) = pk<f к (3-14) The event {.о/ occurs к times in any order) is the union of the j events {.£/ occurs к times in a specific order) and since these events are mutually exclusive, we conclude from (2-20) that p„Uc.) is given by (3-13). In Fig. 3-2, we plot p„(k) for n = 9. The meaning of the dashed curves will be explained later. Example 3-7. A fair die is rolled five times. Wc shall find the probability p5(2) that “six” will show twice. In the single roll of a die, .c/= (six) is an event with probability 1/6. Setting /’(.c/) = ^ P(-7) = £ w-5 Л = 2 in (3-13), wc obtain 5! /l)2/5y’ Example 3-8. A pair of fair dice is rolled four times. We shall find the probability /?д(0) that “seven” will not show at all. The space of the single roll of the two dice consists of the 36 elements f,fr The event .of — (seven) consists of the six elements Therefore Р(л^) - 6/36 and P(.7) - 5/6. With n - 4 and к • 0, (3-13) yields P4(0) = (|)“
46 REPEATED TRIAt-S A- points • • • 1 » • • • • • 1 0--------6 t2 T FIGURE 3-3 Example 3-9. Wc place at random n points in the interval (0, T). What is the probability that к of these points are in the interval (г,./,) (Fig. 3-3)'* This example can be considered as a problem in repeated trials. The experiment is the placing of a single point in the interval (0, 7 ) In this experiment, .й/= {the point is in the interval (г^г,)} is an event with probability 12. ~ 11 -P = -Ly1 In the space .Z”, the event {jZ occurs к times} means that к of the n points are in the interval (/,, t2). Hence [see (3-13)] P{k points are in the interval (/|,/2)} = (3-15) Example 3-10. A system containing n components is put into operation at t — 0. The probability that a particular component will fail in the interval (0, r) equals p — (‘а(т) dr where a(r) 2: 0 [ a(t) dt — 1 (3-16) Jo Jo What is the probability that к of these components will fail prior to time tl This example can also be considered as a problem in repeated trials. Reasoning as above, we conclude that the unknown probability is given by (3-15). Most likely number of successes. We shall now examine the behavior of p„(A) as a function of к for a fixed n. We maintain that as к increases, pn(k) increases reaching a maximum for = + (3-17) where the brackets mean the largest integer that does not exceed (л + Dp. If (л + Dp is an integer, then pn(k) is maximum for two consecutive values of k: к ~ кi = (л + l)p and к = kz - kt - 1 = np - q Proof. We form the ratio Pn(k - 1) = kq Pn(k) (n-k + Dp If this ratio is less than 1, that is, if к < (л + l)p, then pn(k - 1) is less than р„(ЛХ This shows that as к increases, pn(k) increases reaching its maximum for к «• [(л + 1)р]. For к > (n -I- Dp, the above ratio is greater than 1: hence pn(k) decreases.
3-3 ASYMH'OIIV IHfcOHI MS 47 (jUf* (3-18) If kt = (zi -+ l)p is an integer, then P,t(ki ~ 0 = kiQ _ (n 4 l)pg Pn(ki) [n - (zi - l)p + |]p 1 This shows that p„(k) is maximum for к = kl and к = *, - I, Example 3-11. (a) If n = 10 and p = 1/3. then (n + Dp = 11/3; hence к lll/3] = 3. (b) If n = 11 and p = 1/2, then (n + Dp = ft; hence A, = ft. к, = 5. We shall, finally, find the probability P{kt <k sk2] that the number к of occurrences of .?/ is between kt and k,. Clearly, the events № occurs к times}, where к takes all values from kt to k2, are mutually exclusive and their union is the event {к । < к < k2). Hence [sec (3-13)} k. k, P(k,sksk2} = £ p„(k) = £ к‘-kt к-kt Example 3-12. An order of 104 parts is received. The probability that a part is defective equals 0.1. What is the probability that the total number of defective parts docs not exceed 1100? The experiment is the selection of a single part. The probability of the event £•/ = {the part is defective} equals 0.1. We want the probability that in 104 trials, .о/ will occur at most 1100 times. With p = 0.1 n = I04 = 0 )t2=ll00 (3-18) yields not) , 4. P{0 < к <, 1100} = E P? (0.1)A(0.9)ln‘~* (3-19) л-<Л k ' 3-3 ASYMPTOTIC THEOREMS In the preceding section, we showed that if the probability P(j3/) of an event £/ of a certain experiment equals p and the experiment is repeated n times, then the probability that -f/ occurs к times in any order is given by (3-13) and the probability that к is between kx and k2 by (3-18). In this section, we develop simple approximate formulas for evaluating these probabilities. Gaussian functions. In the following and throughout the book we use exten- sively the normal or gaussian function 9(x) - 4=-e’x!/! <3-20) VZ7T
48 REPEATED TRIALS and its integral (see Fig. 3-4 and Table 3-1). G(x) = f 9.(y)dy = -f==r f e~y2/2dy (3-21) J —00 у2тг J “00 As is well known f°e-ax2dx=^ (3-22) From this it follows that 1 ,00 G(co) = ——- f e~x2/2dx = 1 (3-23) v2ir Since g(-x) =g(x), we conclude that G(-x) = 1 -G(x) (3-24) With a change of variablesr(3-21) yields 1 /*"^2 r > / Л*2 \ / -£| —~ \ аг\ / e-v-W/^ dx = G —-------- - G -------- (3-25) av2w JXi \ a } \ a ) for any a and b.
3-3 ASYMPTOTIC ГНГОШ MS 49 TABLE 3-1 1 f* 1 erfx - — I e y‘/2 dy = G(x)---- v2ir Jo 2 x erf x X erf x x erf X x erf x 0.05 0.01994 0.80 0.28814 1.55 0.43943 2.30 0.48928 0.10 0.03983 0.85 0.30234 1.60 0.44520 2.35 0.49061 0.15 0.05962 0.90 0.31594 1.65 0.45053 2.40 0.49180 0.20 0.07926 0.95 0.32894 1.70 0.45543 2.45 0.49286 0.25 0.09871 1.00 0.34134 1.75 0.45994 2.50 0.49379 0.30 0.11791 1.05 0.35314 1.80 0.46407 2.55 0.49461 0.35 0.13683 1.10 0.36433 1.85 0.46784 2.60 0.49534 0.40 0.15542 1.15 0.37493 1.90 0.47128 2.65 0.49597 0.45 0.17364 1.20 0.38493 1.95 0.47441 2.70 0.49653 0.50 0.19146 1.25 0.39435 2.00 0.47726 2.75 0.49702 0.55 0.20884 1.30 0.40320 2.05 0.47982 2.80 0.49744 0.60 0.22575 1.35 0.41149 2.10 0.48214 2.85 0.49781 0.65 0.24215 1.40 0.41924 2.15 0.48422 2.90 0.49813 0.70 0.25804 1.45 0.42647 220 0.48610 2.95 0.49841 0.75 0.27337 1.50 0.43319 2.25 0.48778 3.00 0.49865 For large x, G(x) is given approximately by (see Prob. 3-9) G(x) « 1 - -g(x) (3-26) We note, finally, that G(x) is often expressed in terms of the error function erf x = 1 ,x , , 7-— / e y A dy = G( x) y2ir Jo 1 2 DeMoivre-Laplace Theorem It can be shown that, if npq » 1, then (П\пклп~к-----------------* .r~lk-np)2 /2прц UP* 1/25^ (3-27) for к in yjnpq neighborhood of np. This important approximation, known as the DeMoivre-Laplace theorem, can be stated as an equality in the limit: The ratio of the two sides tends to 1 as л The proof is based on Stirling's formula n! = nne~n y/lirn л -»<» (3-28) The details, however, will be omitted.! tlhe proof can be found in Feller, 1957 (see references al the end of the book).
50 REPEATED TRIALS Thus the evaluation of the probability of к successes in n trials, given exactly by (3-13), is reduced to the evaluation of the normal curve 1----е-1х-Пр}2/2прч (3-29) y/lirnpq for x = k. Example 3-13. A fair coin is tossed 100 times. Find the probability pa that heads will sho.w 500 times and the probability ph that heads will show 510 times. In this example p = q = 0.5 n — 1000 y/npq = 5 v^TcT (a) If к = 500 then к - np = 0 and (3-27) yields 1 1 n = . = —== = 0.0252 J2 irnpq 1075tt (/>) If к = 510 then к — np = 10 and (3-27) yields As the next example indicates, the approximation (3-27) is satisfactory even for moderate values of n. Example 3-14. We shall determine pn(k) for p = 0.5, n = 10, and к = 5. (a) Exactly from (3-13) In\ i i Ю! I ₽,.(*) = (*)'”'' -5!5!?=- °'244 (6) Approximately from (3-27) 1 1 pn(k) = — e-(k-nPr/2пРч = __ = о.252 y/lirnpq v5ir APPROXIMATE EVALUATION OFPUj < к k2}. Using the approximation (3-27), we shall Show that (3-30) Thus, to find the probability that in n trials the number of occurrences of an event «of is between k} and k2, it suffices to evaluate the tabulated normal function G(x). The approximation is satisfactory if npq » 1 and the differences A.| •? rip and. A2 — np are of the order of jnpq. 11
3-3 ASYMI'IOIK IHI OHI MS 51 FIGURE 3-5 Proof. Inserting (3-27) into (3-18). we obtain tc k » E (>V’-A = -^= E k-kJK’ o-v2tf (3-31) The normal curve is nearly constant in any interval of length 1 because <r2 = npq 1 by assumption; hence its area in such an interval equals approxi- mately its ordinate (Fig. 3-5). From this it follows that the right side of (3-31) can be approximated by the integral of the normal curve in the interval (k}. kJ. This yields *2 У g-(k-np)1 /2<т2 ~ k-kt —7==- fkze (r "p)'/2rT' dx (3-32) and (3-30) results [see (3-25)]. Error correction. The sum on the left of (3-31) consists of k2 - kt + 1 terms. The integral in (3-32) is an approximation of the shaded area of Fig. 3-6a, consisting of k2 — k} rectangles. If k2 - k} » 1 the resulting error can be neglected. For moderate values of k2 — ku however, the error is no longer negligible. To reduce it, we replace in (3-30) the limits kt and k2 by k} - 1/2 (я) FIGURE 3-<5
52 RLPliATEP 1RIAI-S and k2 + 1/2 respectively (see Fig. 3-6/>). This yields the improved approxima- tion Example 3-15. A fair coin is tossed 10000 times. What is the probability that the number of heads is between 4900 and 5100? In this problem « = 10000 p = r/ = 0.5 A) = 4900 k2 = 5100 Since (k2 — np)/ -^npq = 100/50 and (Af — np)/ ijnpq = — 100/50, we conclude from (3-30) that the unknown probability equals G(2) - G(-2) = 2G(2) - 1 = 0.9545 Example 3-16. Over a period of 12 hours 180 calls arc made at random. What is the probability that in a four-hour interval the number of calls is between 50 and 70? The above can be considered as a problem in repeated trials with p = 4/12 the probability that a particular call will occur in the four-hour interval. The probability that к calls will occur in this interval equals [sec (3-27)] (180W4 = 1 c_,A..6„r/xo \ к ) \ 3 ) \ 3 ) 4/5тг and the probability that the number of calls is between 50 and 70 equals [see (3-30)] 70 / «ол\ ( 1 Xk f 2 \,K0-A E (T)к к =G(^5) -G(-TZ5) =0.886 *• = 50 ' K И J J Note It seems that we cannot use the approximation (3-30) if k} = 0 because the sum contains values of к that are not in the yfnpq vicinity of np. However, the corresponding terms are small compared to the terms with к near np; hence the errors of their estimates are also small. Since G( -np/y/npq ) = G( - y/np/q ) - 0 for np/q э» 1 we conclude that if not only n 1 but also np » 1, then V (n\ к n-k Акг~ПР\ W L L b v - G .—- (3-34) <t=oVK/ \ V'M J Th the sum (3-19) of Example 3-42, A, - np 10 np = 1000 npq 900 —= - = —
3-3 ASYMI'IlHK Illi OKI MS 53 Using (3-34), wc obtain 1НИ1 i* j \ /Ids E I j (0.1/(09)'"* k = j = 0.99936 Wc note that the sum of the terms of the above sum from 900 to 1100 equals 2G(10/3) - I = 0.99872. The Law of Large Numbers According to the relative frequency interpretation of probability, if an event with ZJ(.ftz) = p occurs к times in n trials, then к — up. In the following, we rephrase this heuristic statement as a limit theorem. We start with the observation that к = np does not mean that к will be close to np. In fact [(see (3-27)] 1 P{k = np} - = -» 0 as n -> (3-35) yjZTrnpq As we show in the next theorem, the approximation к = np means that the ratio k/n is close to p in the sense that, for any e > 0. the probability that lk/n —p| < e tends to 1 as n -» THEOREM. For any r > 0, ( k P{--------Р < E I H 1 as n -* к (3-36) Proof. The inequality \k/n - pl < e means that With к। = n(p - e) and k2 = n( p + e) we have ( к ) P{------p £ e) = P{kt £k < k->] = I « / л, к =A| p q P(k, < к < k2) Inserting into (3-30), we obtain (k, — np\ (k.2 — np\ ( = G ; - G , = 2G - ( fnpq ) ( fnpq ) ( \ n -* oo for any e. Hence ) “ 2g(7J) - 1 as n - 1 But Ey/n/pq -» oc as ( k P{------P £ E l n (3-37) Example 3-17. Suppose that p = q = 0.5 and e = 0.05. In this case п(р — с) =* 0.45/t n(p + r) = 0.55/* ey/n/pq = 0.
54 RhPRATV.O TRIALS In the table below we show the probability 2G(O.lv67) - 1 that к is between 0.4S« and 0.55/t for various values of /». /1 100 400 900 0.1 Jn 1 2 3 2GC0.li/w) - 1 0.682 0.954 0.997 Example 3-18. We now assume that p = 0.6 and we wish to find n such that the probability that к is between 0.59n and 0.61м is at least 0.98. In this case, p = 0.6, q — 0.4, and e = 0.01. Hence P{0.59„ <. к <, 0.6bt) = 2G(0.01/„/0.24 ) - 1 Thus n must be such that 2G(0.01y6»/0.24 ) - 1 2 0.98 From Tabic 3-1 we see that G(x) > 0.99 if x > 2.35. Hence 0.01 /„/0.24 > 2.35 yielding „ > 13 254. GENERALIZATION OF BERNOULLI TRIALS. The experiment of repeated trials can be phrased in the following form: The events = .о/ and .ft/; = .0/ of the space form a partition and their respective probabilities equal p} = p and p2 = 1 — p. In the space the probability of the event occurs fc, = к times and &f2 occurs k2 = n — к times in any order} equals p„(k) as in (3-13). We shall now generalize. Suppose that a = is a partition of consisting of the r events .5^ with =pi Pl + •• • +Pr = 1 We repeat the experiment n times and we denote by рп(кх,..., kr) the probability of the event (л/, occurs kx times,...,&/r occurs kr times in any order} where &i 4- • • • +kr = n We maintain that n! A(*„..., kr) = ———pf. • • pL (3-38) л । • • • * ft Proof. Repeated application of (3-11) leads to the conclusion that the number ©f events of the form {л/, occurs times,..., occurs kr times in a specific order) efluals л! £,!••• Лг!
3-4 POISSON THLORLM AND RANDOM POIN IS 55 Since the trials are independent, the probability of each such event equals Pk' Pr' and (3-38) results. Example 3-19. A fair die is rolled 10 times. We shall determine the probability that /] shows three times, and "even" shows six times. In (his case •o/> = {/i) = {fi. fh) = {fy.fj Clearly. Pi = 6 Pi = o Py = i; A, = 3 A: = 6 ky = I and (3-38) yields Pio(3.6.1) 10! 3'611! / 1 \! \61 \6 J (2/ 3 = 0.002 DeMoivre-Laplace theorem. We can show as in (3-27) that, if k, is in the fn vicinity of np, and n is sufficiently large, then j 1 (Л|-ЛР|)2 (Ar-npr)2l) exp { - - ---------- -t- • • 4------ } n! I 2 npi npr j (3-39) Equation (3-27) is a special case. 3-4 POISSON THEOREM AND RANDOM POINTS We have shown in (3-13) that the probability that an event st/ occurs к times in n trials equals n(n - 1) • • • (n - к + 1) pV (3-40) 1 • 2— к In the following, we obtain an approximate expression for this probability under the assumption that p 1. If n is so large that np = npq » 1, then we can use the DeMoivre-Laplace theorem (3-27). If, however, np is of order of one, (3-27) is no longer valid. In this case, the following approximation can be used: For к of the order of np, "* ...J'”’* kl(n-k)lp4 -e kl uni (3-41)
56 REPEATED TRIALS Indeed, if к is of the order of np, then к n and kp 1. Hence n( n — ])•••(« — к + 1) — n • n ' • n = nk q = 1 — p = e~p qn~k — = ^~np Inserting into (3-40), we obtain (3-41). The above approximation can be stated as a limit theorem (see Feller 1957): POISSON THEOREM. If n —> <» p -* 0 np -* a then n! ak --------Vvipkq"~k 7^* e~a 77 (3-42) k\{n - k)! " kl Example 3-20. A system contains 1000 components. Each component fails independently of the others and the probability of its failure in one month equals 10-3. We shall find the probability that the system will function (i.e., no component will fail) at the end of one month. This can be considered as a problem in repeated trials with p = 10~3, n — 103, and к = 0. Hence [see (3-15)] P{k = 0} - qn = O.9991000 Since np = 1, the approximation (3-41) yields P{k = 0} =е-”р = е~' = 0.368 Applying (3-41) to the sum in (3-18), we obtain the following approxima- tion for the probability that the number к of occurrences of ja/ is between к । and k2: (np)k P{kx < к < k2) - e~np £ —ТГ (3-43) Example 3-21. An order of 3000 parts is received. The probability that a part is defective equals 10”3. We wish to find the probability P{k > 5} that there will be more than five defective parts. Clearly, P{k > 5} = 1 - P{k < 5} With np = 3, (3-43) yields 5 3* P{k 5} = e~3 £ — = 0.916 Л-0 Hence P(k > 5} = 0.084
3-4 ROISSON T1IL-OREM AND RANDOM POINTS 57 Generalization of Poisson theorem. Suppose that ..............., arc the m + I events of a partition with Р{л/} = p.. Reasoning as in (3-42), we can show that if np, -♦ a, for i < m, then л! e (3-44) c ** П1 к < Л’m - Random Poisson Points An important application of Poisson’s theorem is the approximate evaluation of (3-15) as T and n tend to oo. We repeat the problem: We place at random n points in the interval (- T/2, T/2) and wc denote by P{k in ta} the probability that к of these points will lie in an interval (z „ z;) of length t2 - z( = ta. As we have shown in (3-15) P{k in /„) = ]pkq”~k where p = у (3-45) We now assume that n » 1 and ta « T. Applying (3-41), we conclude that r (ni /T)K P{k in zfl) = r ------------- (3-46) for к of the order of nta/T. Suppose, next, that n and T increase indefinitely but the ratio A = n/T remains constant. The result is an infinite set of points covering the entire z axis from -oo to +oo. As we see from (3-46), the probability that к of these points are in an interval of length ta is given by P{k in zj = е~Л<а^ (3-47) POINTS IN NONOVERLAPPING INTERVALS. Returning for a moment to the original interval (-T/2, T/2) containing n points, we consider two nonoverlap- ping subintervals ta and tb (Fig. 3-7). FIGURE 3-7
58 REHEATED FRIALS We wish to determine the probability P{ka in t„,kb in that ka of the n points are in interval ta and kb in the interval tb. We maintain that л! Iiu\ka( tb\k4 ta tb\k' P(ka in ta,kb in/J = (7) (l-7-7j (3*48) where k3 = n — ka — kh. Proof. The above can be considered as a generalized Bernoulli trial. The original experiment is the random selection of a single point in the interval (— T/2, T/2). In this experiment, the events ,2/( = {the point is in rj..?/, = {the point is in /Д and л/3 = {the point is outside the intervals ta and t(l} form a partition and t„ th ta th P(^ = ~ /’(«<.) = PW = 1 - у - у If the experiment is performed n times, then the event {Aa in ta and kb in t6} will equal the event {.q/( occurs kx = ka times, occurs k, = kb times, and occurs k3 = n — kx — k2 times). Hence (3-48) follows from (3-38) with r = 3. We note that the events {ka in ta} and {kb in tj are not independent because the probability (3-48) of their intersection {ka in ta,kb in tb} does not equal P{ka in ta}P{{kb in гД Suppose now that n — = A n —> oc 7* -» 00 T Since nta/T = Aro and ntb/T = Atb, we conclude from (3-48) and Prob. 3-16 that (At )k" (At )kb P(ka in ta, kb in t6) = / e~Af» b (3-49) Ao! kbi From (3-47) and (3-49) it follows that P{A:fl in to, kb in tb) = P{ka in ta}P{kb in tb} (3-50) This shows that the events {ka in ta} and {kb in tb} are independent. We have thus created ah experiment whose outcomes are infinite sets of points on the t axis. These outcomes will be called random Poisson points. The experiment was formed by a limiting process; however, it is completely specified
3-4 POISSON ГНЕОН1 M ANb RANDOM POININ 59 in terms of the following two properties. 1. The probability P{ka in ta} that the number of points in an interval <r(, г») equals ka is given by (3-47). 2. If two intervals (/,, /2) and (/3, r.,) arc nonoverlapping, then the events (ka in (r(,/2B and (kb in (/3,r4)) are independent. The experiment of random Poisson points is fundamental in the theory and the applications of probability. As illustrations wc mention electron emis- sion, telephone calls, cars crossing a bridge, and shot noise, among many others. Example 3*22. Consider two consecutive inlcrvals(rl.t;)and(r2,r3)with respective lengths ta and tb. Clearly, (/ r3) is an interval with length tc = tlt + (b. We denote by ka> kb, and kc = ki: + kh (he number of points in these intervals. We assume that the number of points kc in the interval (r|t r3) is specified. We wish to find the probability that ka of these points are in the interval (rp r,). In other words, wc wish to find the conditional probability P{ka in z„|Af in rj With kh = kc — ka, we observe (hat (k0 in ta,kc in /J = {ka in tatkh in ih] Hence nf/ . ... , «л ta,kh in rj P{k, -j---- From (3-47) and (3-49) it follows that the above fraction equals e-A,4(Ara)Mfli]g^[(A/ft)Mftl] Since tc = ta + th and kc = ka + kht the above yields Jt 1 ( t \ка( 6. \ P{ka inra|kf in Ц = O’51) This result has the following useful interpretation: Suppose that we place at random kc points in the interval (/h t3). As we see from (3-15), the probability that ka of these points are in the interval (t |, t2) equals the right side of (3-51). Density of Poisson points. The experiment of Poisson points is specified in terms of the parameter A. We show next that this parameter can be interpreted as the density of the points. Indeed, if the interval Д/ = t2 - r, is sufficiently small, then А Дге"АЛ' = АД/ From this and (3-47) it follows that P(one point in (/, t + Д/)} = A At (3-52)
60 REI’L;ATi l> I RIAl-S Hence P{onc point in (t, t + Az)] Nonunifonn density Using a nonlinear transformation of the t axis. wc shall define an experiment whose outcomes arc Poisson points specified by a minor modification of property I. Suppose that A(r) is a function such that A(r) > 0 but otherwise arbitrary. We define the experiment of the nonuniform Poisson points as follows: 1. The probability that the number of points in the interval (f1(z,) equals к is given by P{k in (Г।, r2)} = exp - у *A(z) dt f‘2X(t) dt ki (3-54) 2. The same as in the uniform case. The significance of A(/) as density remains the same. Indeed, with t2 - zt = Дг and к = 1, (3-54) yields P{one point in (r, r + Дг)} = A(r) Дг (3-55) as in (3-52). PROBLEMS 3-1. A pair of fair dice is rolled 10 times. Find the probability that •‘seven” will show at least once. Answer: 1 - (5/6)10. 3-2, A coin with p{h} = p — 1 — q is tossed n times. Show that the probability that the number of heads is even equals 0.5[l + (q - p)”]. 3-3. {Hypergeometric series) A shipment contains К good and N - К defective compo- nents. Wc pick at random t\ <, К components and test them. Show that the probability p that к of the tested components are good equals HW(?) 3-4. A fair coin is tossed 900 times. Find the probability that the number of heads is between 420 and 465. Answer: G(2) + G(l) — 1 = 0.819. 3-5. A fair coin is tossed n times. Find n such that the probability that the number of heads is between 0.49н and 0.52/j is at least 0.9. •Answer: G(0.04^T) + G(0.02^T) ^ 1.9; hence n > 4556. 3*6» If = 0.6 and к is the number of successes of «з/ in n trials (a) show that P{550 £. к £ 650) = 0.999, for n = 1000. (b) Find n such that P{0.59w < к < 0.61л) w 0.95.
I'KOIII I MS 61 3-7. A System has 100 components. The probability that a specific component will fail in the interval (fl.ft) equals e '' 1 - e h Find the probability that in the interval (0,774). no more than 10(1 components will fail. 3-8. A coin is tossed an infinite number of times. Show that the heads are observed al the /1 th toss but not earlier equals 3-9. Show that probability that к \y«" (L < 1 - J'(-t) < ^g(.t) A > 0 Hint: Prove the following inequalities and integrate from л to 3-10. Suppose that tn n trials, the probability that an event .:/ occurs at least once equals Pt. Show that, if P(.:/) = p and pn << 1, then Pt - np. 3-11. The probability that a driver will have an accident in 1 month equals 0.02. Find the probability that in 100 months he will have three accidents. Answer: About 4 c ’’/3. 3-12. A fair die is rolled five times. Find the probability that one shows twice, three shows twice, and six shows once. 3-13. Show that (3-27) is a special case of (3-39) obtained with r = 2. k} - к. k2 = n - к. p} =p, p2= I - p. 3-14. Players X and У roll dice alternately starting with X. The player that rolls eleven wins. Show that the probability p that X wins equals 18/35. Outline: Show that /’( V) = P(.?/1-7)P(.7) + Р(.71.^)Р(.^) Set 4/= {.Y wins), — {eleven shows al first try). Note that P(.:/)=p, P(aW) = 1, P(-7) = 2/36. P(.7|M) = 1 - p. 3-15. We place at random n particles in tn > it boxes. Find the probability p that the particles will be found in tt preselected boxes (one in each box). Consider the following cases: («) M-B (Maxwell-Boltzmann)—the particles are distinct: all alternatives arc possible, (b) B-E (Bose-Einstein)—the particles cannot be distin- guished; all alternatives are possible, (c) F-D (Fermi-Dirac)—the particles cannot be distinguished; at most one particle is allowed in a box. Answer: M-B B-E F-D p = n\ nl(m - 1)! (tn + n - 1)! itl(>n — /1)! ш! Outline: (fl) The number N of all alternatives equals nt". The number A.z of favorable alternatives equals the n! permutations of the particles in the preselected boxes. (/>) Place the nt - 1 walls separating the boxes in line ending with the м particles. This corresponds to one alternative where all particles arc in the last box. All other possibilities are obtained by a permutation of the n + nt - 1 objects consisting of the tn - 1 walls and the n particles. All the (m - 1)! permutations of
62 RHFISATED TRIAIJ» the walls and the w! permutations of the particles count as one alternative. Hence N = (.»! + n - !)!/(/» - !)!«’! and //./ - 1. (c) Since the particles are not distin- guishable, N equals the number of ways of selecting n out of m objects: N = j and М/=“ 1. 3-16. Reasoning as in (3-41), show that, if k} + k2 + A, - n Pi + Pi + P$ =1 k{pt 1 k,p2 -t I then Use the above to justify (3-49). 3-17. We place at random 200 points in the interval (0,100). Find the probability that in the interval (0,2) there will be one and only one point (n) exactly and (d) using the Poisson approximation.
CHAPTER 4 THE CONCEPT OF A RANDOM VARIABLE 4-1 INTRODUCTION A random variable (abbreviation: RV) is a number x«) assigned to every outcome < of an experiment. This number could be the gain in a game of chance, the voltage of a random source, the cost of a random component, or any other numerical quantity that is of interest in the performance of the experi- ment. Example 4-1. (я) In the die experiment, we assign to the six outcomes ft the numbers x(/,) = 10/. Thus х(/1) = 10,...,х(Д)=60 (6) In the same experiment, wc assign the number 1 to every even outcome and the number 0 to every odd outcome. Thus Х(Л) = х(/з) = Х(Л) = 0 Х(Л) = x(A) = X(A) » 1 THE MEANING OF A FUNCTION. An RV is a function whose domain is the set of experimental outcomes. To clarify further this important concept, we review briefly the notion of a function. As we know, a function x(r) is a rule of correspondence between values of t and x. The values of the independent variable t form a set on the t axis called the domain of the function and the values of the dependent variable x form a set -У^ on the x axis called the range of the function. The rule of correspondence between t and x could be a curve, a table, or a formula, for example, x(/) « r2. 63
64 THE CONCEPT or Л RANDOM VARIABLE The notation x(r) used to represent a function is ambiguous: It might mean either the particular number x(t) corresponding to a specific /, or the function x(r). namely, the rule of correspondence between any t in .y; and the corresponding x in .У,7. To distinguish between these two interpretations, wc shall denote the latter by x, leaving its dependence on t understood. The definition of a function can be phrased as follows: We are given two sets of numbers .yj and ,УХ. To every t e ,У\ we assign a number x(t) belonging to the set Ух. This leads to the following generalization: We are given two sets of objects and consisting of the elements a and /3 respec- tively. We say that /3 is a function of a if to every element of the set ,Уа we make correspond an element /3 of the set The set is the domain of the function and the set .У^ its range. Suppose, for example, that .y^ is the set of children in a community and the set of their fathers. The pairing of a child with his or her father is a function. We note that to a given a there corresponds a single /3(a). However, more than one element from might be paired with the same /3 (a child has only one father but a father might have more than one child). In Example 4-lb, the domain of the function consists of the six faces of the die. Its range, however, has only two elements, namely, the numbers 0 and 1. The Random Variable We are given an experiment specified by the space <У', the field of subsets of У called events, and the probability assigned to these events. To every outcome f of this experiment, we assign a number x(£)- We have thus created a function x with domain the set У and range a set of numbers. This function is called random-variable if it satisfies certain mild conditions to be soon given. All random variables will be written in boldface letters. The symbol x(f) will indicate the number assigned to the specific outcome £ and the symbol x will indicate the rule of correspondence between any element of ./ and the number assigned to it. Example 4-1 a, x is the table pairing the six faces of the die with the six numbers 10,..., 60. The domain of this function is the set У’= 1/1»••••A) and its range is the set of the above six numbers. The expression x(/2) is the number 20. Events generated by random variables. In the study of RVs, questions of the following form arise; What is the probability that the RV x is less than a given tiumber x, or what is the probability that x is between the numbers x, and x2. If, for example, the RV is the height of a person, we might want the probability that it will not exceed certain bounds. As we know, probabilities are assigned only to events; therefore, in order to answer such questions, we should be able toexpress the various conditions imposed on x as events. We start with the meaning of the notation {x <;x)
4-1 inthoddction 65 This notation represents a subset of consisting of all outcomes £ such that x(£) £ x. We elaborate on its meaning: Suppose that the RV x is specified by a table. At the left column we list all elements £, of and at the right the corresponding values (numbers) x(£() of x. Given an arbitrary number x, wc find all numbers x(£,) that do not exceed x. The corresponding elements t,l on the left column form the set {x < x). Thus {x < x} is not a set of numbers but a set of experimental outcomes. The meaning of {X] < x <, x,} is similar. It represents a subset of consisting of all outcomes £ such that Xj x(£) x2 where X] and x2 are two given numbers. The notation {x =x) is a subset of consisting of all outcomes £ such that x(£) = x. Finally, if Я is a set of numbers on the x axis, then {x eR} represents the subset of consisting of all outcomes £ such that x(£) e R. Example 4-2. We shall illustrate the above with the RV х(/,) = 10< of the die experiment (Fig. 4-1). The set {x 5 35} consists of the elements f, f2, f3 because x(/,) <, 35 only if i = 1, 2, or 3. The set (x 5} is empty because there is no outcome such that x(.ft) < 5. The set (20 x < 35) consists of the elements f2 and f3 because 20 <, x( ft) <, 35 only if i = 2 or 3. The set {x = 40) consists of the element /4 because x(£) = 40 only if i = 4. Finally, (x = 35) is the empty set because there is no experimental outcome such that x(/}) = 35. Note In the applications, we are interested in the probability that an RV x takes values in a certain region R of the x axis. This requires that the set {x 6 Л) be an event. As wc noted in Sec. 2-2, that is not always possible. However, if (x x) is an event for every x and R is a countable union and intersection of intervals, then {x g R] is also an event. In 10 20 30 40 50 60 t x<35 •X--M----M—» _______I 20^x<35 x>50 FIGURE 4-1
66 THk CONCI-I’r ОГ A RANDOM VARIAUI.I: ihc definition of RVs we shall assume, therefore, that the set (x s x) is an event. This mild restriction is mainly of mathematical interest. Wc conclude with a formal definition of an RV. DEFINITION. An RV x is a process of assigning a number x(£) to every outcome £. The resulting function must satisfy the following two conditions but is otherwise arbitrary: I. The set (x x) is an event for every x. II. The probabilities of the events {x = »} and (x = -«} equal 0: P(x = «) = 0 P(x = -*>} =0 The second condition states that, although wc allow x to be +« or — <» for some outcomes, wc demand that these outcomes form a set with zero probability. A complex RV z is a sum Z = X + jy where x and у are real RVs. Unless otherwise stated, it will be assumed that all RVs are real. 4-2 DISTRIBUTION AND DENSITY FUNCTIONS The elements of the set ./ that arc contained in the event (x x) change as the number x takes various values. The probability P{x x} of the event (x £ x} is, therefore, a number that depends on x. This number is denoted by F,(x) and is called the (cumulative) distribution function of the RV x. DEFINITION. The distribution function of the RV x is the function F/x) = P(x <;x) (4-1) defined for every x from -» to ». The distribution functions of the RVs x, y, and z are denoted by F/x), Fy(y\ and Ft(z) respectively. In this notation, the variables x, y, and z can be identified by any letter. Wc could, for example, use the notation Fx(w\ Fy(w), and F.(w) to represent the above functions. Specifically, Fx(w) = P{x <, и>} is the distribution function of the RV x. However, if there is no fear of ambiguity, we shall identify the RVs under consideration by the independent variable in (4-1) omitting the subscripts. Thus the distribution functions of the RVs x, y, and z will be denoted by F(x), F(y), and F(z) respectively. Example 4-3. In the coin-tossing experiment, the probability of heads equals p and the probability of tails equals q. We define the RV x such that х(Л) - 1 x(r)-0
4-2 IMSTKIHUHON AND hl NSII V I HN< I IONS 67 We shall find its distribution function F(x) for every x from — to x. If x 2: 1. then x(/j) = 1 < x and x(/) « 0 < x. Hence (Fig. 4-2) F( x) = P{x < x) = P(h. I} = 1 x > 1 If 0 <, x < 1, then x(h) = 1 > x and x(t) = 0 < x. Hence F(x) = P{x < x} = P{i} = q 0 < x < 1 If x < 0, then x(/i) = 1 > x and x(t) = 0 > x. Hence F(x) = P{x x) = P{0} = 0 x < 0 Example 4-4. In the die experiment of Example 4-2. the RV x is such that xkfi) = 10/. If the die is fair, then the distribution function of x is a staircase function as in Fig, 4-3. We note, in particular, that F(100) = P{x < 100} = /’(./)=! F(35) = P{x 35) = Pff,,/3} = F(30.01) = P(x < 30.01} = РИрЛ.Л} = | F(3O) = P(X^3O}=P(AJ2,A} = | F(29.99) = P{x <; 29.99} = P{/,./2} = | Example 4-5. A telephone call occurs at random in the interval (0.1). In this experiment, the outcomes are time distances t between 0 and 1 and the probability that t is between and /2 is given by P{ti We define the RV x such that x(t) = t 0 & t <> 1 FIGURE 4-3
68 'ГНК CONCl.ri <)1 Л RAMPOM VARIA11I L Thus the variable / has a double meaning: It is the outcome of the experiment and the corresponding value x(/> of the RV x. We shall show that the distribution function F(x) of x is a ramp as in Fig. 4-4. If x > I, then xG) < .v for even' outcome. Hence F(x) = P{x <x} = P{0 < i < 1} = P(.Z) = 1 ,i > 1 If 0 ^x < I, then xG) <x for every i in the interval (0.x). Hence F(x) = P{x <x) = P{0 s t S.i] 0 sx S I If x < 0, then {x < x) is the impossible event because xG) S 0 for every /. Hence F(x) =P{x<.r) =P(0) = 0 x<0 Example 4-6. Suppose .that an RV x .is such that x(f) = a for every £ in .Z. We shall find its distribution function. If x s a, then x(f) « a < x for every Hence F(x) = P(x x} = P(.Z) = 1 x > a If x < a, then {x-< x) is the impossible event because x«) = a. Hence F(x) =P{x <x) =P{0) =0 x<a Thus a constant can be interpreted as an RV with distribution function a delayed step U(t — a) as in Fig. 4-5. Note A complex RV ;z = x + jy has no distribution function because the inequality x +,jy <x +jy has no meaning. The statistical properties of г are specified in terms of the Jornr dislributibn of the RVs x and. у (see Chap. 6). Percentiles, The и percentile of an .RV x is the smallest number such that и = P{x xu} = F( xu) (4-2) Thus. x,( is the inverse of the function и = F(x). Its domain is- the interval 0 £ 1, and its range is the x axis. To find the graph of the function xM, we interchange the axes Of the F(x) curve as in. Fig. 4-6. The Median of x is the smallest number, m such that = 0.5. Thus m is the 0.5 percentile of x.
4-2 DISI RIBll I |(>S ANI) |>1 NSI n I I >M 'I loss 69 Frequency interpretation of Hx) and xu. Wc perform the experiment n times and we observe n values .v,.....x„ of the RV x. We place these numbers on the x axis and wc form a staircase function F„(x) as in Fig 4-6o. The steps arc located at the points x, and their height equals 1 /и. They start at the smallest value xmjn of x„ and Fn(x) = 0 for x < xniln, The function F„(x) so constructed is called the empirical distribution of the RV x. For a specific x, the number of steps of F„(x) equals the number я, of x,.v that are smaller than x; thus F„(x) = nx/n. And since nK/n = /’{x <x) for large n, we conclude that n. F„(x) = ——» P{x <x} = F(x) as n -* » (4-3) n The empirical interpretation of the u percentile x„ is the Quetelet curve defined as follows: We form n line segments of length x, and place them vertically in order of increasing length, distance l/n apart. We then form a staircase function with corners at the endpoints of these segments as in Fig. 4-6£>. The curve so obtained is the empirical interpretation of x„ and it equals the empirical distribution Fn(x) if its axes arc interchanged. Properties of Distribution Functions In the following, the expressions F(x + ) and F(x") will mean the limits F(x+) = limF(x + e) F(x“) = lim F(x - e) 0 < e -> 0 The distribution function has the following properties 1. F(+®) = 1 F(-o°) = 0 Proof. F( + <») = P(x <, = F(.Z) = 1 F(-oo) =P{x= -«} =0
70 Tills CONCEPT OF A RANDOM VARIABLE 2. it is a nondecreasing function of x: if X| <x2 then F(X|) <, F(x2) (4-4) Proof. The event {x < x,} is a subset of the event {x < x2) because, if x(£) < A- for some £, then x(£) < x2. Hence [see (2-14)] P{x < x,} < P{x < x2] and (4-4) results. From (4-4) it follows that F(x) increases from 0 to 1 as x increases from — oo to oo. 3. if F(x()) = 0 then F(x) = 0 for every x < x() (4-5) Proof. It follows from (4-4) because F( —«) —0. The above leads to the following conclusion: Suppose that x(£) > 0 for every £. in this case. HO) = P{x < 0} = 0 because {x < 0} is the impossible event. Hence F(x) = 0 for every x < 0. P{x > x} = 1 - F(x) Proof. The events {x x) and {x > x} are mutually exclusive and {x x} + {x > x} = Z Hence P{x < x) + P{x > x) = P(<Z) = 1 and (4-6) results. 5. The function F(x) is continuous from the right: F(x+) = F(x) (4-6) (4-7) Proof. It suffices to show that P{x < x + e} -» F(x) as e -» 0 because P{x <, x + e} = F(x + e) and F{x + e) -» F(x+) by definition. To prove the above, we must show that the sets {x £ x + e) tend to the set {x < x) as e -» 0 and to use the axiom III о of finite additivity. We omit, however, the details of the proof because we have not introduced limits of sets. P{xi < x < x2] = F(x2) - F(x{) Proof. The events {x^xt} and {xt <x^x2) are mutually exclusive because x(£) cannot be less than x, and between X| and x2. Furthermore, {X X2) = {X <, Xj} + {X| < x x2} Hence P{X < X2} « P{x X,) + P{Xj < x < x2) and (4-8) results.
4-2 DIS1RIBUTION ANI> DI-NSITV H'NCI IONS 71 7. P{x = x) = F( x) - F( a ) (4-9) Proof. Setting x( = x — f and x2 = x in (4-8), we obtain P{x - c < x < x) = F(x) - F(x - e) and with e -» 0, (4-9) results. 8- < x sxj = F(x2) - F(x[) (4-10) Proof. It follows from (4-8) and (4-9) because {АГ, < X < x;) = {xt < X < X2) + {x = A’,} and the last two events are mutually exclusive. Statistics We shall say that the statistics of an RV x are known if we can determine the probability F{x g R] that x is in a set R of the x axis consisting of countable unions or intersections of intervals. From (4-1) and the axioms it follows that the statistics of x are determined in terms of its distribution function. Continuous, discrete, and mixed types. We shall say that an RV x is of continuous type if its distribution function F(x) is continuous. In this case. F(x_) = F(.r); hence /J{x=x} = () (4-11) for every x. We shall say that x is of discrete type if F(x) is a staircase function as in Fig. 4-7. Denoting by xt by discontinuity points of F(x), we have F(xJ - F(x’) = F{x =xj =Pi (4-12) In this case, the statistics of x are determined in terms of x, and p,. If the points x, are equidistant, that is, if xt: — a + bi, then the RV x is of lattice type. We shall say that x is of mixed type if F(x) is discontinuous but not a staircase. Note that if the set has finitely many elements, then any RV defined on is of discrete type. However, an RV x might be of discrete type even if has infinitely many elements. FIGURE 4-7
72 THE CONCEPT Ol- A RANDOM VARIABLE Example 4-7. If .я/ is an arbitrary event of .Z and x y is an RV such that z fl < 6 s/ \0 f e .cZ (4-13) then Xjz is called the zero-one RV associated with the event jZ. Thus {x.{z=l}='aZ {x^=0} = ^ Hence is of discrete type taking only the two values 0 and 1 with P(x^ = l} = /»(.*,) P{Xa<= 0} = 1 - P(.&/) The space however, might have infinitely many elements. The Density Function The derivative of F(x) is called the density function (known also as the frequency function) of the RV x. If the RV x is of discrete type taking the values x, with probabilities pit then f(x) = £p,.3(x - x,) pi = F{x = x,} (4-15) I where 3(x) is the impulse function (Fig. 4-7). The term PjStx — xf) is shown as a vertical arrow at x = x( with length equal to p,-. In Example 4-2, the RV x is of discrete type taking the six values Xj = 10,..., x() = 60 with Pi = 1 /6. Hence f(x) = |[3(x - 10) + 3(x - 20) + • • • + 3(x - 60)] PROPERTIES. From the monotonicity of F(x) it follows that /(x) ;> 0 (4-16) Integrating (4-14) from -« to x and using the fact that F(—°°) = 0, we obtain fUldi (4-17) Since F(a>) = 1, the above yields f f(x)dx = l (4-18) J —oo From (4-17) it follows that F(x2) - F(xi) = ( 7(x) dx (4-19)
4-3 Sl’l:< IAI.C ASIA 73 Hence [see (4-8)] P{a-| < x < x2] = f ’fix) dx (4-20) If the RV x is of continuous type, then the set on the left might be replaced by the set {x, < x <x2). However, if Fix) is discontinuous at x, or x,. then the integration must include the corresponding impulses of fix). With xt =x and x2 ~x + Ax it follows from (4-20) that, if x is of continuous type, then P{x < x < x + Ax) = fi x) Ax (4-21) provided that Ax is sufficiently small. This shows that fix) can be defined directly as a limit Ax-о Ax Note As we can see from (4-21), the probability that x is in a small interval of specified length Ax is proportional to fix) and it is maximum if that interval contains the point xm where fix) is maximum. This point is called the mode or the most likely value of x. An RV is called unimodal if it has a single mode. Frequency interpretation We denote by Дпл the number of trials such that x x(< ) < x + Ax From (1-i) and (4-21) it follows that A/i, /(x)Ax = -^ (4-23) 4-3 SPECIAL CASES In the preceding sections, wc defined RVs starting from known experiments. In this section and throughout the book, we shall often consider RVs having specific distribution or density functions without any reference to a particular probability space. Existence theorem. To do so, we must show that given a function fix) or its integral F(x) = f fi^dt we can construct an experiment and an RV x with distribution Fix) or density fix)..As we know, these functions must have the following properties: The function fix) must be nonnegative and its area must be 1. The function Fix) must be continuous from the right and, as x increases from -» to oo, it must increase monotonically from 0 to 1.
74 THE CONCEPT OE A RANDOM VARIABLE Proof. We consider as our space the set of all real numbers, and as its events all intervals on the real line and their unions and intersections. We define the probability of the event (x < x,} by P{x <xt) = F(xt) (4-24) where Fix) is the given function. This specifies the experiment completely (see Sec. 2-2). The outcomes of our experiment are the real numbers. To define an RV x on this experiment, we must know its value x(x) for every x. We define x such that x(x) =x (4-25) Thus x is the outcome of the experiment and the corresponding value of the RV x (see also Example 4-5). We maintain that the distribution function of x equals the given Fix). Indeed, the event {x<x() consists of all outcomes x such that x(x)<X(. Hence P(x<x,} =P{x<x1) =F(x,) (4-26) and since this is true for every x,, the theorem is proved. In the following, we discuss briefly a number of common densities. Normal. An RV x is called normal or gaussian if its density is the normal curve g(x) [see (3-20)], shifted and scaled 1 / x — i? \ 1 ,, /(*)--«-----------(4-27) a \ a } an]2ir This is a bell-shaped curve, symmetrical about the line x = 17 (Fig. 4-8) and its area equals 1 as it should [see (3-22)]. The corresponding distribution function is given by F(x) = (4-28) \ <r ) where (x) is the tabulated integral of g(x) [see (3-21)]. We shall use the notation FIGURE 4-8
4-3 м-i t iai < ,\st s 75 Uniform 0 x, x; x FIGURE 4-9 to indicate that an RV x is normal as in (4-27). The significance of the constants 77 and <r will be given in Sec. 5-4 (77: mean. <r; standard deviation). Example 4-8. An RV x is N(1000;50). Wc shall find the probability that x is between 900 and 1050. Clearly. P{900 < x < 1050} = F( 1050) - F(900) = G(l) - G( -2) Since G(-x) = l-G(x) (4-29) we conclude from Table 3-1 that P{900 < x < 1050} = G(l) + G(2) - I = 0.819 Uniform. An RV x is called uniform between .v, and x, if its density is constant in the interval (x,. x,) and 0 elsewhere ( 1 f(x) = / x,-x, (4-30) \0 otherwise The corresponding distribution function is a ramp as in Fig. 4-9. Example 4-9. A resistor r is an RV uniform between 900 and 1100 П. We shall find the probability that r is between 950 and 1050 fl. Since /(r) = 1/200 in the interval (900.1100). (4-20) yields 1 rioso /’{950 Sr < 1050} = — / dr = 0.5 200 '950 Binomial. We say that an RV x has a binomial distribution of order n if it takes the values 0,1,...»л with P{x = k} = P + q = 1 (4-31) Thus x is of lattice type and its density is a sum of impulses (Fig. 4-10a) f(x) = E (2)/№Л3(х - к) (4-32) k-o' 1 The corresponding distribution is a staircase function ahd in the interval (0, n) it is given by m tn \ F(x) = E ( l. m <x Cm + 1 (4-33)
76 THL- CONClilT Ot A RANDOM VARIAHl L FIGURE 4-10 We note that, if n is large, then [see (3-34)] /*(л) is close to an .N^np^npq) distribution. In other words. Example 4-10 Bernoulli trials. In (he experiment of the n tosses of a coin, an outcome is a sequence <, • • • of к heads and n - к tails where к = 0 n. Wc define the RV x such that x(<i ’ • • £„) = A- Thus x equals the number of heads. As wc know [sec (3-13)], the probability that x = к equals the right side of (4-31). Hence x has a binomial distribution. Suppose that the coin is fair and i( is tossed n = 100 times. Wc shall find the probability that x is between 40 and 60. In (his case p = q = 0.5 np = 50 \/npq = 5 and (4-34) yields / 60 - 50 \ I 40 - 50 \ P{40 <, x < 60} = G -------- - G ---------- = G(2) - G( - 2) = 0.9545 Poisson. An RV x is Poisson distributed with parameter a if it takes the values 0,1,..,, n ... with a* P{x = k] = e~“ — к = 0, 1,... (4-35) Thus x is of lattice type with density » ak /(x) «e- £ тт5(х-Л) (4-36) Jt-0 The corresponding distribution is a staircase function as in Fig. 4-10b. With pk = P(x = k), it follows from (4-35) that Pti-i _ e~aak~l/(.k - 1)1 k_ pk e~aak/kl a
4-4 ( ONDIIIONAI. I.HSIRIUUIIOSS 77 FIGURE 4-11 If the above ratio is less than 1, that is. if к < «. then pk < pk. This shows that, as к increases. pk increases reaching its maximum for к = [a]. Hence if a < 1. then pk is maximum for к = 0; if a > 1 but it is not an integer, then pk increases as к increases, reaching its maximum for к = [a]; if a is an integer, then pk is maximum for к = a — 1 and к = a. Example 4-11 Poisson points. In (he Poisson points experiment, an outcome < is a set of points t, on the { axis. («) Given a constant t„. we define the RV n such (hat its value n(£) equals the number of points t, in the interval (0,/„). Clearly, n = к means that the number of points in the interval (0, iv) equals k. Hence [see (3-47)] (yt )k P[n = k} = (4-37) Thus (he number of Poisson points in an interval of length i„ is a Poisson distributed RV with parameter a = Xio where A is the density of the points. (6) We denote by tt the first random point co the right of (he fixed point and wc define the RV x as the distance from t„ to tt (Fig. 4-1 la). From the definition it follows that x(<) £ 0 for any £. Hence the distribution function of x is 0 for X < 0. We maintain that for x > 0 it is given by F(x) = 1 - е"Ад Proof. As we know, F(x) equals the probability that x <x where x is a specific number. But x <, x means that there is at least one point between to and to + x. Hence 1 - F(x) equals the probability ptl that there are no points in the interval (ro, to + x). And since the length of this interval equals x, (4-37) yields = I - F(x) The corresponding density f(x) = Xe~x*U(x) is called exponential (Fig. 4-116).
78 ТИБ CONCI-PT OF Л RANDOM VARIABl.U TABLE 4-1
4-4 CONDI IIONAI. UISIRIIIUIIONS 79 Gamma. An RV x has a gamma distribution if cb f(x) = yxb~'e ,lU(x) у = —— (4-38) 1 (b) In the above, b and c are positive numbers and ЦЬ + 1) = ybe~ydy b>—l (4-39) is the gamma function. This function is also called the generalized factorial because Г(Ь + 1) = ЬГ(Ь). If b is an integer, Г(и + 1) = лГ(и) = • • • = н! because Г(1) = 1. Furthermore. г(-) = f y”,/2c_vrfy = 2f e~:'dz = /т The following densities are special cases of (4-38). Erlang. If b = n is an integer, the Erlang density results. With n = 1, we obtain the exponential density shown in Fig. 4-11. Chi-square. For b = n/2 and c = 1/2. (4-38) yields This density is denoted by x2(n} and >s called chi-square with n degrees of freedom. It is used extensively in statistics. In Table 4-1, we show a number of common densities. In the formulas of the various curves, a numerical factor is omitted. The omitted factor is deter- mined from (4-18). 4-4 CONDITIONAL DISTRIBUTIONS We recall that the probability of an event assuming Л is given by P(j^l^) = ’ where P(^) * 0 °(** ) The conditional distribution F(x\ad) of an RV x, assuming is defined as the conditional probability of the event (x x): P,{x < x, F(xU) = P(x <x|^} - p(^) In the above, (x £ x, uH is the intersection of the events {x x) and that is, the event consisting of all outcomes £ such that x(£) s x and f e (4-41)
80 THE CONCEPT OF A RANDOM VARIABLE Thus the definition of is the same as the definition (4-1) of F(x), provided that all probabilities are replaced by conditional probabilities. From this it follows (see Fundamental remark, Sec. 2-3) that F(x\.#) has the same properties as F(x). In particular [see (4-3) and (4-8)] F(«|^) = 1 F(-oo|^)=0 (4-42) Pf*! < X £X2№'} = F(x2[.^) - p(x||^z) = P(.^)- The conditional density is the derivative of F(x\a^): dF(x\^) P{x < x x + Ax|.^) f(x|/) = —Ц------- = lim (4-43) dx дх-о Дх This function is nonnegative and its area equals 1. (4-44) Example 4-12. We shall determine the conditional F(xl^) of the RV x(/J = 10/ of the fair-die experiment (Example 4-4), where Д) 's the event “even.” If x 60, then {x < x) is the certain event and (x < x, = JK. Hence (Fig. 4-12) P(^) P'(xl^) = -J—( = 1 x s 60 If 40 £ x < 60, then {x <, x, Л} = {/2, /,}. Hence P{/2,/4} 2/6 ™-4£ft- 37* 40-<m If 20 x < 40, then (x x, - {f2}- Hence P{/,} 1/6 20^<m If x < 20, then {x x, - (0). Hence F(xH) - 0 x < 20 To find F(x|^), we must, in general, know the underlying experiment. However, if Л is an event that can be expressed in terms of the RV x, then, for the determination of F(xl^), knowledge of F(x) is sufficient. The following two cases are important illustrations.
4-4 CONDITIONAL DISTRIBUTIONS 81 FIGURE 4-13 I. We wish to find the conditional distribution of an RV x assuming that x < a where a is number such that F(a) Ф 0. This is a special case of (4-41) with Thus our problem is to find the function F(x|x <, a) = P{x <x|x < a) = P(x < a) If x a, then {x < x, x <, a} = (x < a}. Hence (Fig. 4-13) P{x < a) F(x|x < a) = —------г = 1 x a P(x < a] If x < a, then (x < x, x <, a) = {x < x}. Hence P{x^x) F(x) F(x x < a) = —-------- = ——- x < a Р{х<д} F(a) Differentiating F(x|x <, a) with respect to x, we obtain the corresponding density: Since F'(x) = /(x), the above yields . Я*) Я*) for x < a (4-45) F(a) J — Ct, and.it is 0 for x > a. П. Suppose now that Л= (b < x <, a}. In this case, (4-41) yields P(x <,x, b < x < a] f(x|i,<XSa)-----------------P(b<x*a~ If x £ a, then {x £ x, b < x £ o} «= {b < x a). Hence F(a) - F(b) f(z|t<xaa)-f(a) jr(^ -1 Xia If b ^x < a, then (x x, b < x £ a} = {6 < x x}. Hence F(x) -F(fc) f(zlt<xSa)-F(a)_^) b*x<a
82 THE CONCEPT OF A RANDOM VARIABLE Finally, if x < b, then (x < x, b < x < a) = (0). Hence F( x\b < x <, a) = 0 x < b The corresponding density is given by /(-v) f(x\b < x a) = ——----------—— for b <;x < a (4-46) F(a) - F(b) and it is 0 otherwise (Fig. 4-14). Example 4-13. Wc shall determine the conditional density /(x| lx - 171 < ко) of an Mij; a) RV. Since P{|x — tj| < Act) = P{r) - ко < к <, q 4- Лег} = G(/c) — G(-£) = 2G(£) - 1 we conclude from (4-46).that /(x| lx -4| s k<r) - 2GW_| for x between 7] - ко- and q + ко and 0 otherwise. This density is called truncated normal. Frequency interpretation In a sequence of n trials, we reject all outcomes f such that x(£) b or x(£) > a. In the subsequence of the remaining trials, F(x|ft < x < «) has the same frequency interpretation as F(x) [see (4-3)]. Total Probability and Bayes’ Theorem We shall now extend the results of Sec. 2-3 to random variables. 1. Setting S8 — {x <, x) in (2-36), we obtain P[x^x) = P{x ^х|л/1}Р(«о/1) 4- ••• +Р{х^х|^}Р(^я) Hence [see (4-41) and (4-44)] F(jt) = F(x|jZi)P(j»')) + • • +F(xlX,)P( X) t4"47) /(*) ’ - +Лх1^)Г(.йб.) t4'48) In the above, the events .,., form a partition of uZ.
4-4 CONIMTIOSAI DISI RIBU I ION'S 83 FIGURE 4-15 Example 4-14. Suppose that the RV x is such that /(x|.^)_is and fixl.#) is N(.ti2-,(t2) as in Fig. 4-15. Clearly, the events and form a partition of Setting л/1 = and .лЛ = in (4-48). wc conclude that _ p (x - n. \ 1 - p / x - n, f(x) = pf(x|.^) + (1 -p)f(xU) = -G----------------+ -------—G( -------- Oj \ O’। ) cr2 \ (T2 where p = 2. From the identity Р(лИ£?) = Р(^|.й/)Р(л/) P(SB) (4-49) [see (2-38)] it follows that Р{х<х|л/} F(x|.o/) /’(•QHxsx) = (4’50) Г|Л S Aj * \ X ) 3. Setting SB = {x, < x x2} in (4-49), we conclude with (4-43) that , % P{x. < x < x2|.oZ] Р{лИх, < X £x2)-------- > Г|Л| ’Ч Л is Л 2/ F(x2|<oH — Р(х.|.й/) = X c Р(.<И (4-51) F(x2) - F(x,) 4. The conditional probability Р(<йИх=х) of the event .о/ assuming x = x cannot be defined as in (2-29) because, in general, P{x = x) = 0. We shall define it.as a limit Р(лИх = x) = lim Р{лНх < x x + Дх) (4-52) Дх-»0 With x( = x, x2 = x + Дх, we conclude from the above and (4-51) that Р{лНх -x) = P(^) (4-53) j\x)
84 ГНЬ CONCEPT OF A RANDOM VARIABLE P (.*/]* = x) Л-ri^) - л О Total probability theorem. As we know [see (4-42)] F(oo|.ft/) = Г f(xW)dx = 1 J — ОС Multiplying (4-53) by /(x) and integrating, we obtain ( P(si/]x = x)f(x) dx = P(&/) (4-54) *' — x This' is the continuous version of the total probability theorem (2-36). Bayes’ theorem. From (4-53) and (4-54) it follows that /’(.c/|x = x)/(x) —------------------------- (4-55) Г P(V|x = x}f\x) dx J — <x This is the continuous version of Bayes’ theorem (2-39). Example 4-15. Suppose that the probability of heads in a coin-tossing experiment is not a number, but an RV p with density /(p) defined in some space yj. The experiment of the toss of a randomly selected coin is a cartesian product .✓/ x In this experiment, the event [head] consists of all pairs of the form £ch where £c *s апУ clement of and h is the element heads of the space c/= {Л, /}. We shall show that P(^) = ('pf(p)dp (4-56) Jo Proof. The conditional probability of assuming p = p is the probability of heads if the coin with p = p is tossed. In other words, P(^|p=p)=p (4-57) Inserting into (4-54), we obtain (4-56) because f(p) = 0 outside the interval (0,1). PROBLEMS 4-1. Suppose that xu is the и percentile of the RV x, that is, F(x„) = u. Show that if /(-x) = /(x), then X|_u = -xu. 4-2. Show that if/(x) is symmetrical about the point x = p and P{r) - a < x < q + a) = 1 - a, then a = 7j - x„/2 = xt _„/2 - p. 4-3. ('tf) Using Table 3-1 and linear interpolation, find the percentile of the N(0,1) RV z for и = 0.9,0.925, 0.95,0.975, and 0.99. (b) The RV x is Ж o'). Express its xM percentiles in terms of zt/. 4-4. The RV is x is Mtj.o-) and P{ri - k<r < x < + kcr} — pk. (a) Find pk for к - 1, 2, and 3. (b) Find к for pk = 0.9, 0.99, and 0.999. (c) If - z„tr < x < t? 4- zua} - y, express zu in terms of y. Find xu for «. = 0.1,0.2,..., 0.9 (a) if x is uniform in the interval (0.1); (b) if /(x) -m?-2'U(x):
I>KOIIU:MS 85 4-6. We measure for resistance R of each resistor in a production line and wc accept only the units the resistance of which is between 96 and 104 ohms. Find the percentage of the accepted units (a) if R is uniform between 95 and 105 ohms: (h) if R is normal with rj = 100 and cr = 2 ohms. 4-7. Show that if the RV x has an Erlang density with n = 2. then F,(.v) = (1 - - cxe~")Uix). 4-8. The RV x is Hi 10; 1). Find /(x|(x - 10)2 < 4). 4-9. Find fix) if Fix} = (1 - e-°x)U(x - c). 4-10. Их is M0,2) find (д) P{1 < x <: 2} and ib) P{ 1 < x < 2|x I}. 4-11. The space ./ consists of all points t, in the interval (0,1) and P{0 < i, s у) = у for every у < 1. The function Gix) is increasing from G(-<*) = 0 to G(<x) = 1; hence it has an inverse G(' n(y) = H(y). The RV x is such that x(r,) = Hit,). Show that Fxix) = G(.r). 4-12. If x is M1000;20) find («) P{x < 1024), ib) P{x < 1024|x > 961), and (c) P{31 < < 32). 4-13. A fair coin is tossed three times and the RV x equals the total number of heads. Find and sketch F,ix) and /4(x). 4-14. A fair coin is tossed 900 times and the RV x equals the total number of heads, (a) Find fxix): I; exactly 2; approximately using (4-34). ib) Find P{435 <x s 460). 4-15. Show that, if a £ x(£) s b for every < e then F(x) =1 for x > b and Fix) = 0 for x < «. 4-16. Show that if x(f) y(£) for every £ <= then Fxiw) > Fviw) for every >v. 4-17. Show that if Pit) — /(r|x > t) is the conditional failure rate of the RV x and Pit) = kt, then fix) is a Rayleigh density (see also Sec. 7-3). 4-18. Show that Pi^f) = Р(.й/|х <r)F(x) + PijV\x > xHl - F(x)]. 4-19. Show that P(V|x <x)F.(x) —w— 4-20. Show that if P(.V|x = x) = P(£?|x = x) for every x x0, then P(.<V|x <, x0) = P(0|x £X(1). Hint: Replace in (4-54) Pi&f) and fix) by P(.a/|x sx(1) and /(x|x <x0). 4-21. The probability of heads of a random coin is an RV p uniform in the interval (0,1). (a) Find P{0.3 sps 0.7). ib) The coin is tossed 10 times and heads shows 6 times. Find the a posteriori probability that p is between 0.3 and 0.7. 4-22. The probability of heads of a random coin is an RV p uniform in the interval (0.4,0.6). ia) Find the probability that at the next tossing of the coin heads will shows, ib) The coin is tossed 100 limes and heads shows 60 times. Find the probability that at the next tossing heads will show.
CHAPTER 5 FUNCTIONS OF ONE RANDOM VARIABLE 5-1 THE RANDOM VARIABLE g(x) Suppose that x is an RV and g(x) is a function of the real variable x. The expression У = g(x) is a new RV defined as follows: For a given x(£) is a number and g[x(£)] is another number specified in terms of x(£) and g(x). This number is the value y(£) = g[x(£)] assigned to the RV y. Thus a function of an RV x is a composite function у = g(x) = g[x(£)] with the domain set of experimental outcomes. The distribution function Fy(y) of the RV so formed is the probability of the event {y <, y) consisting of all outcomes < such that y(f) = g[x«)] <, y. Thus Fy(y) = р(у ^У} = F{g(x) у! <54) For a specific y, the values of x such that g(x) у form a set on the x axis denoted by Ry. Clearly, g[x(f)] gy if x(£) is a number in the set Ry. Hence FrM =P{xe«y) (5-2) 86
5-2 Till DISTRIBUTION OF (la) 87 The above leads to the conclusion that for g(x) to be an RV, the function g(x) must have the following properties: 1. Its domain must include the range of the RV x. 2. It must be a Baire function, that is, for every y, the set Ry such that g(x) у must consist of the union and intersection of a countable number of intervals. Only then {y y) is an event. 3. The events (g(x) = ±<»} must have zero probability. 5-2 THE DISTRIBUTION OF g(x) We shall express the distribution function Fy(y) of the RV у = g(x) in terms of the distribution function Fx(x) of the RV x and the function g(x). For this purpose, we must determine the set Ry of the x axis such that g(x) < y, and the probability that x is in this set. The method will be illustrated with several examples. Unless otherwise stated, it will be assumed that Fx(x) is continuous. 1. We start with the function g(x) in Fig. 5-1. As we see from the figure, g(x) is between a and b for any x. This leads to the conclusion that if у £ b, then g(x) <y for every x, hence P{y < y) = 1; if у < a, then there is no x such that g(x) <, y, hence P{y < y) = 0. Thus F(y)-!1 yzb \o y<a With X! and У1 - g(xx) as shown, we observe that g(x) < y, for x x,. Hence F/yJ = P{x<ix1}=Fx(x1) We finally note that g(x) £ y2 if x <, xi or if x£ £ x x?
88 FUNCTIONS OF ONE RANDOM VARIABLE Hence ^•(ъ) = P{x *2) + P{x!f x x£'} = Fx(x±) 4- Fx(x"') - Fx(x?) because the events (x x£) and (x£ x < x^} are mutually exclusive. Example 5*1 у = ex + b To find Fy(y), we must find the values of x such that ax + b £ y. (a) If a > 0, then ax + b £ у for x < (y — b)/a (Fig. 5-2я). Hence ( У ~ bx f у — b\ Fy(y) =F|x^—— j =/Ц—— j я>0 (b) If я < 0, then ax + b < у for x > (y - b)/a (Fig. 5-26). Hence {y — b\ ( У — b\ x ------} = 1 - Fd----- я < 0 a j \ a ) Example 5*2 у - x2 If у 0, then x2 5 у for - 5 x -/y (Fig. 5-Зд). Hence Fy(y) =P{—/у x vV) =Fx(Vy) “ /у) У>0 FIGURES*}
5-2 THE DISTRIBUTION OF *(»! 89 If у < 0, then there are no values of x such that x2 < y. Hence F>.(y) = P{0} = 0 y<0 Special case If x is uniform in the interval (-1,1), then 1 x F,(x) = 2 + 2 1x1 < 1 (Fig. 5-3fe). Hence Fy(y) - ft for O^y $ 1 and F/y) = J * > * 2. Suppose now that the function g(x) is constant in an interval (x0, x,): g(x) -y, x0 <x ^x, In this case P{y = yj = P{x0 < x <;х,} = Fjxj - Fx(x0) (5-3) Hence Fy(y) is discontinuous at у = yt and its discontinuity equals F/x,) - Fx(x0). Example 5-3. Consider the function (Fig. 5-4) g(x)=0 for -c^xsc and я(х) = {£ + с x < -c In this case, Fy(y) is discontinuous for у = 0 and its discontinuity equals Fx(c) - Fx(—c). Furthermore, If y^O then P{y ^y) = P{x sy + c} = Fx(y + c) If у < 0 then P{y £ y) = P{x sy - c) = Fx(y - c) HGUREM
90 FUNCTIONS OF ONE RANDOM VARIABLE Example 5-4 Limiter. The curve g(x) of Fig. 5-5 is constant for x < -b and x b and in the interval (-b, b) it is a straight line. With у = g(x). it follows that Fy(y) is discontinuous for у — g(.-b) = -b and у = g{b) - b respectively. Furthermore, If у b then If — b £ у < b then If у < -b then g(x)<y for every x; g(x) < у for x < y; g(x) <,y for no x; hence Fv(y) = 1 hence Fy(y) = F^y) hence Fy(y) = 0 3. We assume next that g(x) is a staircase function £(*)=£(*/)= У. •*,-1 < x x, In this case, the RV у = g(x) is of discrete type taking the values y( with Р{У = У/} = < x <x.} = F/x.) - Example 5-5 Hard limiter. If g(x) = l 1 x>0 ' l-l x^O then у takes the values ± 1 with P{y = -1} = P(x <, 0} = Fx(0) P{y = 1} = P{x > 0} = 1 - Fx(0) Hence Fy(y) is a staircase function as in Fig. 5-6. Figures^
5-2 THE niSTKIIIl'llONOFr»») 91 FIGURE 5-7 Example 5-6 Quantization. If g(x) = ns (n - l)s < x £ ns then у takes the values yn = ns with P{y = ns) = P{(n - l)s < x zjj) = F,(ns) - Fx(ns - s) 4. We assume, finally, that the function g(x) is discontinuous at x = x0 and such that g(x)<g(xo) for x<x0 s(x)>g(x£) for x>x0 In this case, if у is between g(xg ) and g(xg), then g(x) < у for x < x0. Hence fyy) = р(х^хо) =Fx(xo) g(x„) <,y ^g(x0+) Example 5-7. Suppose that g(x) = /x + c ' \x - c X 2 0 X < 0 is discontinuous (Fig. 5-7). Thus g(x) is discontinuous for x = 0 with g(0 ) = -c and g(0+) - c. Hence Fy(y) = Fx(0) for |y| <>c. Furthermore, If у 2 c then g(x) <, у for x < у - c; hence If — c £ у c then g(x) &y for x < 0; hence If у £ -c theng(x)^y for x^y + c; hence Fy(y) = Fx(y~c) F,.(y) = Ft(0) Fy(y) = Fx(y + c) Example S-8. The function g(x) in Fig- 5-8 equals 0 in the interval (-c.c) and it is discontinuous for x = ±c with g(c+) = c, g(c~) = 0, g(-c') = -'C, g(-c,‘)”0. Hence F/y) is discontinuous for у = 0 and it is constant for
92 FUNCTIONS OF ONE RANDOM VARIABLE FIGURE 5-8 О < у <• c and -с < у <, 0. Thus If у then £ У for x < у; hence ^.(.v) =F,(y) If 0 £ у < c then g(x) 5 У for x < с; hence F,(y) =F,(c) If -c <,y < 0 then g(x) <У for x < -с; hence РЛу) = FA ~c If у < — c then g(x) < У for x < у; hence FM-FM 5. We now assume that the RV x is of discrete type taking the values xk with probability pk. In this case, the RV у = g(x) is also of discrete type taking the values yk = g(xk). If Ук = Six) for only one x = xk, then ЛУ =У*} = = **} =Pk If, however, yk = g(x) for x = xk and x = xz, then f’fy = Xt) = ^{x = -4-} + /’{x = X/} = pk + Pi Example 5-9 у = x2 GO If x takes the values 1,2,...,6 with probability 1/6, then у takes the values I2,22,,.., 62 with probability 1 /6. (b) If, however, x takes the values —2, — 1,0,1,2,3 with probability 1/6. then у takes the values 0,1,4,9 with probabilities 1/6,2/6,2/6,1/6 respectively. Determination of /У(у) We wish to determine the density of у = g(x) in terms of the density of x. Suppose, first, that the set R of the у axis is not in the range of the function g(x), that is, that g(x).is not a point of R for any x. In this case, the probability that g(x) is in R equals 0. Hence fy(y) = 0 for у e.R. It suffices, therefore, to consider the values of у such that for some x, g(x) = y.
5-2 IIIL DISIRIItl'IIOSOI 93 FIGURE 5-9 FUNDAMENTAL THEOREM. To find fy(y) for a specific y, we solve the equa- tion у = g(x). Denoting its real roots by x„. У =g(*i) = ••• = g(xn) = • (5-4) we shall show that where g'(x) is the derivative of g(x). Proof. To avoid generalities, we assume that the equation у = g(x) has three roots as in Fig. 5-9. As we know fy(y) dy = P{y < у <у + dy} It suffices, therefore, the find the set of values x such that у < g(x) < у + dy and the probability that x is in this set. As wc see from the figure, this set consists of the following three intervals x, < x < X| 4- dxl x, 4- dx2 < x < x2 x3 < x < x3 4- dx2 where dx{ > 0, dx3 > 0 but dx2 < 0. From the above it follows that P{y < У <У + dy) = P{x, < x <x, 4- dxt) 4- P{x2 4- dx2 < x < x2) 4- P{x3 < x < x, 4- dx3) The right side equals the shaded area in Fig. 5-9. Since P{xj < x <x, 4- dr,} = /r(xl) dxi dxt = dy/g'(xi) P{x2 + dx2 < x < x,} = fx(x2) |dx2 dx2 = dy/g'( x2) P{x3 < x < x3 4- dx3} x3) dx3 dx3 = dy/g'( x3)
94 FUNCTIONS OF ONI: RANDOM VARIABLE wc conclude that ft(x.) Ш) fy(y) dy = -^rr^dy 4- dy + J-^~~dy g'(Xi) 1я'(л-2)| g'U.O and (5-5) results. We note, finally, that if g(x) = y, = constant for every x in the interval (xt), x,), then [see (5-3)] F\iy) is discontinuous for у = y(. Hence j\(y) contains an impulse 8(y - у,) of area Fx{xt) - Fx(xi}). Conditional densify’ The conditional density fv(y\./F) of the RV у = g(x) assuming is given by (5-5) if on the right side we replace the terms /v(x,) by /X(xj.^) (see, for example, Prob. 5-17). Illustrations We give next several applications of (5-2) and (5-5). 1. у = ax 4- b g'(x) = a The equation у = ax 4- b has a single solution x = (y — b)/a for every y. Hence Special case If x is uniform in the interval (x।, x2), then у is uniform in the interval (ax, 4- b, ax2 4- b). Example 5-10. Suppose that the voltage v is an RV given by v = z(r 4- r0) where i = 0.01 A and r(, = 1000 П. If the resistance г is an RV uniform between 900 and 1100 П, then v is uniform between 19 and 21 V. 7 1 '( 4 1 2. y=- g'(x) = -^ The equation у = 1/x has a single solution x = 1/y. Hence Special case If x has a Cauchy density with parameter a, ct/ir 1/atr = x2 4- a2 then = y2 4- I/a2 is also a Cauchy density with parameter 1/a.
5-2 THE IMSTRininiON OF »(«» 95 Example 5-11. Suppose that the resistance r is uniform between 900 and 1100 Q as in Fig. 5-10. We shall determine the density of the corresponding conductance g = 1/r Since /r(r) = 1/200 S for r between 900 and 1100 it follows from (5-7) that 1 = 200g2 and 0 elsewhere. 1 1 TlOO <S < 900 3. у = ax2 a>0 g'(x)=2ax If у < 0, then the equation у — ax2 has no real solutions; hence fy(y) = 0. If у > 0, then it has two solutions and (5-5) yields ) +Л(~^) у > 0 (5-8) We note that Fy(y) = 0 for у < 0 and F,(y^ = p{-^ ^’t^'l^}-F^]^F‘[-^} y>0
% FUNCTIONS OF ONE RANDOM VARIAB1X Example 5-12. The voltage across a resistor is an RV e uniform between 5 and 10 V. Wc shall determine the density of the power e2 w = — r = 1000 fl r dissipated in r. Since /,(e) = 1/5 for e between 5 and 10 and 0 elsewhere, we conclude from (5-8) with a = l/r that ПО 1 1 /-W-VV 40 < W < Ю and 0 elsewhere. Special case Suppose that fAx) = -7^e~x2/2 y = x2 v2ir With о = 1, it follows from (5-8) and the evenness of Д(х) that (Fig. 5-11) 1 1 A(y) = -W/i/y) = -^=e"y/2f/(y) уУ у2тгу We have thus shown that if x is an M0,1) RV, the RV у = x2 has a chi-square distribution with one degree of freedom [see (4-40)]. 1 4. У = Vx g'(x) = The equation у = Vx has a single solution x = y2 for у > 0 and no solution for у < 0. Hence Л(У) =2уД(у2)£/(у) (5-9) The chi density Suppose that x has a chi-square density as in (4-40), and у = Jx. In this case, (5-9) yields 2 (s-10) This function is called the chi density with n degrees of freedom. The following cases are of special interest. Maxwell For n — 3, (5-10) yields the Maxwell density 4(y) = y/2/iry2e~yI/2. Rayleigh For л = 2, we obtain the Rayleigh density fy(y) = уе“>Д/2Му)- Si у = x(/(x) g'(x) = U(x) Qearly, fy(.y) “0 arid Fy(y) == 0 for у < 0 (Fig. 5-12). If у > 0, then the
5-2 the distribution oi- 97 equation у = xU(x) has a single solution *! = y. Hence />(у)=А(у) Fy(y)~Fx(y) y>0 Thus Fy(y) is discontinuous at у = 0 with discontinuity FAO") - F’ (0") = Fx(0). Hence A(y) = А(уЖ(у) +Л(0)5(у) 6, у = e* g'(x)=ex If У > 0, then the equation у = ex has the single solution x = In y. Hence /У(у) = pk(Iny) y>0 If у < 0, then fy(y) = 0. Special case If x is M17; a), then f,(y)--------(5-11) <ryv27T This density is called lognormal (see Table 4-1). 7. у = a sin(x + 0) a > 0 If |yI > a, then the equation у = a sin(x + 0) has no solutions; hence fy(y) = 0. If |y| < a, then it has infinitely many solutions (Fig. 5-13a) у xn = arcsin-----в n 1,0,1,... * a Since g'(x„) « acos(x„ 4- 0) = ya2 — y2, (5-5) yields Цу) - 1 Ё Л(х„) |y|<a (5-12) ya2 -y2 rt--» Special case Suppose that x is uniform in the interval (-ir.rr). In this case, the equation у = a sin(x + 0) has exactly two solutions in the interval (-ir.ir) for any 0 (Fig. 5-14). The function Д(х) equals l/2ir for these two values and it equals 0 for any x„ outside the interval (-тг.тт). Retaining the
98 FUNCTIONS OF' ONE RANDOM VARIABLE two nonzero terms in (5-12), we obtain 2 fy(y) =-------, ; lyI < a (5-13) 2тгуа2 - у2 To find F/y), wc observe that у < у if x is either between -тг and x0 or between Xj and ir (Fig. 5-13a). Since the total length of the two intervals equals tt 4- 2X(, 4- 20, we conclude, dividing by 2vr, that 1 1 У Fv(y) = - 4- — arcsin - |y| < a (5-14) 2 7Г a We note that although fy(+a) = oo, the probability that у = ±a is 0. Smooth phase If the density fx(x) of x is sufficiently smooth so that it can be approximated by a constant in any interval of length 2vr (see Fig. 5-136), then it E fAxrl) = f fx(x)dx = 1 л--® because in each interval of length 2ir the above sum has two terms. Inserting into (5-12), we conclude that the density of x is given approximately by (5-13). FIGURE5-I4
5-2 । hl dis । кип 'i ion inкш 99 FIGURE 5-15 TT 0 nil * AW)f Example 5-13. A particle leaves the origin under the influence of the force of gravity and its initial velocity v forms an angle <p with the horizontal axis. The path of the particle reaches the ground at a distance г2 d = — sin 2 <p g from the origin (Fig. 5-15). Assuming that is an RV uniform between 0 and ir/2, wc shall determine: (a) the density of d and (6) the probability that d < J(l. Solution, (a) Clearly, d = asinx a = r2/g where the RV x = 2<p is uniform between 0 and tt. If 0 < d < a, then the equation d = tzsin x has exactly two solutions in the interval (0. тг). Reasoning as in (5-13), we obtain Л(<0 ° / T 0 < </< a тгуа~ - d2 and 0 otherwise. (6) The probability that d < d0 equals the shaded area in Fig. 5-15: 2 d(l F{d £ </0} - FaW = “ arcsin — 7Г и 8. У = tan x The equation у - tan x has infinitely many solutions for any у (Fig. 5-16a) xrt = arctan у n = ..., - 1,0,1,... Since g'(^) = 1/cos2 x = 1 + у2, (5-5) yields Ё Al».) (5-|5) Special case If x is uniform in the interval (—тг/2,тг/2), then the term jyjCj) in (5-15) equals 1/xr and all others are 0 (Fig. 5-166). Hence у has a
100 FUNCTIONS OF ONE RANDOM VARIABLE FIGURE 5*16 Cauchy density 1/77 /,(>) = T77 (5-i6) As we see from the figure, у < у if x is between -tf/2 and x,. Since the length of this interval equals x, + тг/2, we conclude, dividing by tt, that 1 / 77 \ 1 1 -arctan у (5-17) ? 77 \ 2 ) 2 77 Example 5-14. A particle leaves the origin in a free motion as in Fig. 5-17 crossing the vertical line x = d at у = d tan 9 Assuming that the angle <p is uniform in the interval (-0,0), we conclude as in (5-16) that fM <P+r* f0 and 0 otherwise. /<(*>)“ 1/20 / /Т <— y -0 0 0 у (a) FIGURES*!? r |y | < d tan 0 JL 0 d tan 0 у (A)
5-2 uh.pisiKiitt iiosoi ,14 101 ТИЕ INVERSE PROBLEM. In the preceding discussion, wc were given an RV x with known distribution Fx(x) and a function g(x) and we determined the distribution Fy(y) of the RV у = g(x). We consider now the inverse problem: We are given the distribution of x and wc wish to find a function g(x) such that the distribution of the RV у = g(x) equals a specified function F,(y). This topic is developed further in Sec. 8-5. We start with two special cases' From Fx(x) to a uniform distribution. Given an RV x with distribution F,.(x), we wish to find a function g(x) such that the RV u = g(x) is uniformly distributed in the interval (0,1). We maintain that g(x) = F,(x), that is. if u = Fx(x) then F„(u) = it for 0 < it < 1 (5-18) Proof. Suppose that x is an arbitrary number and и = F,(x). From the mono- tonicity of Fx(x) it follows that u < и iff x < x. Hence F„(u) = Ли “} = a-} = /\ (x) = и and (5-18) results. The RV u can be considered as the output of a nonlinear memoryless system (Fig. 5-18) with input x and transfer characteristic Ft(x). Therefore if we use u as the input to another system with transfer characteristic the inverse F<~l\u) of the function и = Ft(x), the resulting output will equal x: If x = F4‘“h(u) then P{x < x) = Fx(x) From uniform to Fy(y). Given an RV u with uniform distribution in the interval (0,1), we wish to find a function g(u) such that the distribution of the RV У — g(u) is a specified function Fy(y). Wc maintain that g(u) is the inverse of the function и = Fy(y): If y = F^"°(u) then P{y £ y} = Fy(y) (5-19) FIGURE 548
102 FUNCTIONS OF ONE RANDOM VARIABLE Proof. The RV u in (5-19) is uniform and the function F,(x) is arbitrary. Replacing Fr(x) by F/y), we obtain (5-19) (see also Fig. 5-18). From F/x) to F/y). We consider, finally, the general case: Given Fr(x) and F/y), find g(x) such that the distribution of у = g(x) equals F/y). To solve this problem, we form the RV u = F/x) as in (5-18) and the RV у » F* IJ(u) as in (5-19). Combining the two, we conclude: If y = FJ‘-,,(Fx(x).) then F{y<y}=Fv(y) (5-20) 5-3 MEAN AND VARIANCE The expected value or mean of an RV x is by definition the integral £{x} - Г Xf(x)dx (5-21) J — 00 This number will also be denoted by i\x or rj. Example 5-15. If x is uniform in the interval (x,, x2), then /(x) = l/(x2 - x() in this interval. Hence We note that, if the vertical line x = a is an axis of symmetry of fix) then E(x) = a; in particular, if /(-x) = /(x), then E{x} = 0. In the above example, /(x) is symmetrical about the line x = (x( + xz)/2. Discrete type For discrete type RVs the integral in (5-21) can be written as a sum. Indeed, suppose that x takes the values x, with probability p,. In this case [see (4-15)] f(x) = E^(x-x,) i Inserting.into (5-21) and using the identity I x8(x — X;) dx = X{ we obtain ЕЫ - Ерл Р,- = P{x - x,} (5-22) i Example 5-W. If x takes the values 1,2,..., 6 with probability 1 /6, then £W -1(1 + 2+ +6) «33
5-3 MfcAN and variance 103 I I I Xi-I *1-1 (b) FIGURE 5-19 Conditional mean The conditional mean of an RV x assuming .// is given by the integral in (5-21) if /(x) is replaced by the conditional density f(x\.//): = f xf(xl.//-)dx (5-23) For discrete type RVs the above yields E{x\.tf} = Ex,-P{x »(5-24) Example 5-17. With {x a «}. it follows from (5-23) that f xf(x)dx £{x|x a) = J x/(x|x > a) dx = —---------- f*f(x)dx Lebesgue integral. The mean of an RV can be interpreted as a Lebesgue integral. This interpretation is important in mathematics but it will not be used in our development. We make, therefore, only a passing reference: We divide the x axis into intervals (xk,xk.^) of length Дх as in Fig. 5-19a. If Дх is small, then the Riemann integral in (5-21) can be approximated by a sum / x/(x)<Zr = £ х*/(хл)Дх (5-25) a---® And since /(хл) Дх = P{xk < x < xk + Дх), we conclude that ЭО E{x} “ К хк?{хк < x < Xk + A*} A- -« In the above, the sets {xk < x < xk 4- Дх) are differential events specified in terms of the RV x, and their union is the space (Fig. 5-19b). Hence, to find £(x}, we multiply the probability of each differential event by the corresponding value of x and sum over all k. The resulting limit as Дх -* 0 is written in the form £{x) = f xdP and is called the Lebesgue integral of x.
104 FUNCTIONS OF ONE RANDOM VARIABLE Frequency interpretation We maintain that the arithmetic average x of the observed values x, of x lends to the integral in (5-21) as л -» ®: x. + • • • + x„ Л---------------------> E{x) (5-26) Proof. We denote by &nk the number of x/s that are between zk and zk Ax = From this it follows that X) + +x„ = Ax And since f(zk)A.x = &nk/n [see (4-23)] we conclude (hat x = - Ax = £zfc/(zj Ax = f xf(x)dx П J-as and (5-26) results. We shall use the above to express the mean of x in terms of its distribu- tion. From the construction of Fig. 5-20a it follows readily that x equals the area under the empirical percentile curve of x. Thus x = (BCD) - (OAB) where (BCD) and (OAB) are the shaded areas above and below the u axis respectively. These areas equal the corresponding areas of Fig. 5-20Z>; hence X = / [1 - F„(-v)] dx - f Fn(x) dx where Fn(x) is the empirical distribution of x. With n -> <» this yields rO E{x) — [ R(x) dx - f F(x) dx R(x) = l-F(x) (5-27) FIGURE 5-20
5-3 MEAN AND VARIANCE 105 FIGURE 5-21 Mean of g(x). Given an RV x and a function g(x), we form the RV у = g(x). As we see from (5-21), the mean of this RV is given by £(y) = fjfyt У) dy (5-28) It appears, therefore, that to determine the mean of y, we must find its density fy(y). This, however, is not necessary. As the next basic theorem shows, E{y} can be expressed directly in terms of the function g(x) and the density /X(x) of x. THEOREM £{£(*)}= / S(x)fx(x)dx (5-29) J — oe Proof. We shall sketch a proof using the curve g(x) of Fig. 5-21. With у = g(x,) = g(x2') — g(x3) as in the figure, we see that fy(y) dy = Д(х,) dx, + fx(x2) dxz+fx(x3) dx3 Multiplying by y, we obtain yfy(y) dy = «(х,)Д(х,) dX' 4- я(х2)Д(х2) dx2 4- g(x3)fx(x3) dx3 Thus to each differential in (5-28) there correspond one or more differen- tials in (5-29). As dy covers the у axis, the corresponding dx’s are nonoverlap- ping and they cover the entire x axis. Hence the integrals in (5-28) and (5.29) are equal. If x is of discrete type as in (5-22), then (5-29) yields £{f(x)} - E^(xf)P{x=xJ (5-30) I Example 5-18. With x0 an arbitrary number and g(x) as in Fig. 5-22, (5-29) yields £{*(x)} - This shows that the distribution function of an RV can be expressed as expected value,.
106 FUNCTIONS OF ONE RANDOM VARIABLE g(*> едх)}=л(х0) Xo X FIGURE 5-22 Example 5-19. In this example, we show that the probability of any event .rf can be expressed as expected value. For this purpose wc form the zero-one RV associated with the event of'. V«) = (e ssf Since this RV takes the values 1 and 0 with respective probabilities PG?/) and M (5-22) yields E{x^) = 1 X + 0 X P(.sV) = P(.c/) Linearity From (5-29) it follows that £{«tgi(x) + ••• + a„g„(x)} =el£{gl(x)} + ••• +anE{gn(x)| (5-31) In particular, E(ax + b} = aE{x} + b Complex RVs If z = x 4- jy is a complex RV, then its expected value is by definition E{z} = E{x) +jE{y} From this and (5-29) it follows that if g(*) =£i(x) +jg2(*) is a complex function of the real RV x then £{«(*)} = dx + jf_J2(x)f(x) dx = jjMf(x) dx (5-32) In other words, (5-29) holds even if g(x) is complex. Variance The variance of an RV x is by definition the integral <r2 = J (x-tj)2f(x)dx (5-33) where ij «= £{x), The positive constant a, denoted also by <rx, is called the standard deviation of x.
5-3 MEAN ANP VAKlANC I 107 From the definition it follows that cr2 is the mean of the RV (x - n)2 Thus 0,2 ~ £{(x — л) } ~ £{x~ — 2x7/ + 17“} = E{x2} — 2t]E{x) + -q2 Hence cr2 = E{x2} - E2{x) (5-34) Example 5-20. If x is uniform in the interval (-c,c). then rj = 0 and Example 5-21. Wc have written the density of a normal RV in the form fix) = —}=e " ’»)i/2,,J trfl-rr where up to now tj and a2 were two arbitrary constants. We show next that и is indeed the mean of x and cr2 its variance. Proof. Clearly, fix) is symmetrical about the line x = ij: hence E{x) = 17. Furthermore, Г e-(x-nf/2<r: dv = ^5? because the area of fix) equals 1. Differentiating with respect to <r. we obtain ,= (x - ti)~ , , . .— J_x a Multiplying both sides by <r2/^r, we conclude that E{x - 7?)2} = a2 and the proof is complete. Discrete type. If the RV x is of discrete type, then o-2 = EpX*/ “ *1)2 Р/ = P{x = -v.) (5-35) i Example 5-22. The RV x takes the values 1 and 0 with probabilities p and q » J — p respectively. In this case E{x) = 1 x p + 0 x <? = p E(x2) = I2 X p + О2 X <7 = p Hence tr2 - E(x2} - E2{x) = p - p2 = pq
108 FUNCTIONS OF ONE RANDOM VARIABLE Example 5-23. A Poisson distributed RV with parameter a takes the values 0. I,... with probabilities ak P{x = k} = e-a — We shall show that its mean and variance equal a: E{x) = a £7{x~) = a2 + a <r2 = a (5-36) Proof. We differentiate twice the Taylor expansion of e": ” a ea = 52 — Z-t Ы1 K- 7 aA1 I Д (P A-0 K- ° A~1 £*(*- I) к « I Hence x ak x ak E{x} = e “ 22 *77 = « Дх2) = e ° E *277 = «2 + « Ы k' Ы k'~ and (5-36) results. Poisson points. As wc have shown in (3-47), the number n of Poisson points in an interval of length t0 is a Poisson distributed RV with parameter a = A/rt. From this it follows that E{n) = Ar0 cr„2 - Az0 (5-37) This shows that the density A of Poisson points equals the expected number of points per unit time. Notes 1. The variance a2 of an RV x is a measure of the concentration of x near its mean r). Its relative frequency interpretation (empirical estimate) is the average of (xf - tj)2: °-2 = -E(*.-v)2 n where xr:are the observed values of x. This average can be used as the estimate of a* only if 7j is known. If it is unknown, we replace it by its estimate x and we change n to л — 1. This yields the estimate <r2 = ^7 £ (x, - x)2 x = £x, known as the sample variance of x [see (8-64)). The reason for changing n to n - I is explained later.
5-4mo.minin 109 2. A simpler measure of the concentration of x near r] is the first absolute central moment M = E{lx - t?1). Its empirical estimate is the average of |x, - 77H M = - У lx, - 171 If 77 is unknown, it is replaced by x. This estimate avoids the computation of squares, 5-4 MOMENTS The following quantities are of interest in the study of RVs: Moments m„ = £{x") = Г x'7(-v) dx (5-38) J - X Central moments цп = £{(x - 77)"} = f (x - ij)"f(x)dx (5-39) Absolute moments £{|хГ) £{|x —77Г} (5-40) Generalized moments £{(x — a)"} £{|x —dl”} (5-41) We note that I k “ 0 ' Hence д.-Е(;к(-чГ‘ <5-42) k-0 ' ' Similarly, m„ = £{[(x - 77) + 17Г} = L (fc)(x ” ’l) *} Hence Л-0 ' In particular, Mo = ,no = 1 wi=17 Mt = ° M2 = cr‘ and Аз = m3 - 3?jm2 + 2173 m3 = + Зт/а2 + n3
Г10 FUNCTIONS OF ONE RANDOM VARIABLE Notes 1. If the function /(x) is interpreted as mass density on the x axis, then £{x) equals its center of gravity, E(x2} equals the moment of inertia with respect to the origin, and a2 equals the central moment of inertia. The standard deviation tr is the radius of gyration. 2. The constants p and <r give only a limited characterization of f(x). Knowledge of other moments provides additional information that can be used, for example, to distinguish between two densities with the same -q and er. In fact, if mn is known for every n, then, under certain conditions, /(x) is determined uniquely [see also (5-69)]. The underlying theory is known in mathematics as the moment problem. 3. The moments of an RV arc not arbitrary numbers but must satisfy various inequalities. For example (sec (5-34)] <r2 = m2 - m2 0 Similarly, since the quadratic E{(x" - a)2} — m2n - 2amn + a2 is nonnegativc for any a, its discriminant cannot be positive. Hence m2n £ m2n Normal random variables. We shall show that if then /(*) = -4=e-’:/2"! <ту2тг EM = /0 \l • 3- • * (n — l)a" n = 2k + 1 n = 2k (5-44) Е(|х|я} = p • 3 • • • (n — l)<r" 2*+72л7 n = 2k (5-45) I 2kk\cr n = 2k 4- 1 The odd moments of x are 0 because /(-x) = /(x). To prove the lower part of (5-44), we differentiate к times the identity This yields x2*e a*2dx - 1 -3-- (2k — 1) /~^F~ 2k V a2* + l and with a - l/2<r2, (5-44) results. Since /(-x) »/(x); we have = 2Гх2*+1/(х)<£г - 2k *^-**/2^ fa f2lT JQ
5-4 MOMENTS III With у = хг/2<г2, the above yields and (5-45) results because the last integral equals AH We note in particular that £{x4) = 3<r4 = 3£2{x2) (5-46) Example 5-24. If x has a Rayleigh density f(x) = a then E{x") = Д Гхп* ’е-'г/2<>2 dx = Г |х|л4 ’e-I/2eJdr a* Jo 2a" J-x From this and (5-45) it follows that ( 1 • 3 • • • nanyJir/2 n = 2k + 1 1 \2kk\a2k n - 2k In particular, E{x} = ayjir/2 ir2 = (2 - ir/2)a2 Example 5-25. If x has a Maxwell density V2 , , /(x) = -^x^'^UCx) then E{x") = —1= Г 4 a3y2ir and (5-45) yields /1 -3---(л + l)a" £{ХЛ) “ I 24!a2A-‘/W n ~2k n~2k- 1 (5-48) In particular, E(x} - 2a^2/^ £{x2} ° 3a2
112 FUNCTIONS OF ONE RANDOM VARIABLE Poisson random variables. The moments of a Poisson distributed RV are functions of the parameter a: 00 ak mn(a) = E[xn} = e~° E *"ту A-0 Kl (5-49) A„(fl) = £{(x - «)"} = e~u E (k - tf) A=0 „ак kl (5-50) We shall show that they satisfy the recursion equations w„+i(a) =tf[m„(a) + m;(a)] (5-51) A„ + i(a) =а[лд„_1(а) +д'„(а)] (5-52) Proof. Differentiating (5-49) with respect to a, we obtain “a* * ak~l 1 m’M = ~e~a E kn— + e~a £ fcn+1 —- = -m„(a) + -m„+1(a) &*=o k- л-о k' a and (5-51) results. Similarly, from (5-50) it follows that “ ak “ . ak M = —e~° E (ft - «Гту “ ne~a E (k ~ ту А=0 A-0 K' “ ak-l + е~° Е (к “ аУк~Г- А-0 к’ Setting к = (к - а) + a in the last sum, we obtain д'„ = -д„ - пцп_1 + (1/аХдьи+1 + ap.n) and (5-52) results. The preceding equations lead to the recursive determination of the moments mn and д„. Starting with the known moments mx = a, = 0, and ft2 = a [see (5-36)], we obtain m2 = a(a + 1) and m3 = a(a2 + a + 2a -I- 1) = a3 + За2 + а д.3 = а(ц'г 4- 2д.|.) = a ESTIMATE OF THE MEAN OF g(x). The mean of the RV у = g(x) is given by £{s(x)} “ / g(x)f(x)dx (5-53) Hence, for its determination, knowledge of /(x) is required. However, if x is concentrated near its mean, then E[g(*)} can be expressed in terms of the moments /дл of x. Suppose, first, that fix) is negligible outside an interval (77 - s, v + s) and in/this interval, g(x) = gCn). In this case, (5-53) yields £(g(x)} « dx = g(v)
5-4 MOMENTS 113 This estimate can be improved if g(.v) is approximated by a polynomial = g(-n) + g'(n)(* - 17) + • ’ • +gi">(-n) ——— n! Inserting into (5-53), we obtain £{g(x)} = g(7?) + %"Ы~ + •• (5-54) n! In particular, if g(x) is approximated by a parabola, then ily = £{g(x)} =^(17) +g"(T])~ (5-55) And if it is approximated by a straight line, then qy = 5(77). This shows that the slope of g(x) has no effect on 77y; however, as we show next, it affects the variance a2 of y. Variance. We maintain that the first-order estimate of a} is given by <5? s lg'(’7)l2o’2 (5-56) Proof. We apply (5-55) to the function g2(x). Since its second derivative equals 2(g')2 + 2gg", we conclude that <5? + 77J = £{g2(x)} = g2 + [(g')2 + gg"]a2 Inserting the approximation (5-55) for r]y into the above and neglecting the a4 term, we obtain (5-56). Example S-26. A voltage E = 120 V is connected across a resistor whose resistance is an RV г uniform between 900 and 1100 O. Using (5-55) and (5-56), we shall estimate the mean and variance of the resulting current E i = — r Clearly, £{r} = 77 = 103, <r2 = 1002/3. With g(r) = E/r, we have g(77) - 0.12 g'G?) - -12 X 10-s g"(7j) = 24 x 10““ Hence £{i) = 0.12 + 0.0004 A <r2 « 48 X 10~6 /I2 Tchebycheff Inequality A measure of the concentration of an RV near its mean 77 is its variance tr2. In fact, as the following theorem shows, the probability that x is outside an arbitrary interval (77 - e, 77 + s) is negligible if the ratio <r/e is sufficiently small. This result, known as the Tchebycheff inequality, is fundamental.
114 FUNCTIONS OF ONE RANDOM VARIABLE THEOREM. For any f > 0, <r2 P{|x-77l ;>£} < p- (5-57) Proof. The proof is based on the fact that P{|x - 7)1 £ f) = f V f(x)dx+f f(x)dx=f f(x)dx -X Jr)+E Indeed <r2=f (x - jj)2f(x) dx 2: f (x - Tf)2f(x) dx > e2f f(x)dx and (5-57) results because the last integral equals P{|x - 771 > g). Notes 1. From (5-57) it follows that, if a = 0, then the probability that x is outside the interval (tj - e,t} + e) equals 0 for any e; hence x = 17 with probability 1. Similarly, if E(x2} = tj2 + <т2 = 0 then 77 = 0 a = 0 hence x = 0 with probability 1. 2. For specific densities, the bound in (5-57) is too high. Suppose, for example, that x is normal. In this case, P{|x - 771 3a) = 2 - 2G(3) = 0.0027. Inequality (5-57), however, yields P{|x — 771 3a) < 1/9. The significance of Tchebycheff’s inequality is the fact that it holds for any /(x) and can, therefore be used even if f(x) is not known. 3. The bound in (5-57) can be reduced if various assumptions are made about /(x) [see Chemoff bound (Prob. 5-30)]. MARKOFF INEQUALITY. If /(x) = 0 for x < 0, then, for any a > 0, P{x t a) s £ (5-58) Proof. E{x) = [ xf(x) dx [ xf(x) dx^atf f(x) dx J0 Ja Ja and (5-58) results because the last integral equals P{x a). COROLLARY. Suppose that x is an arbitrary RV and a and n are two arbitrary numbers. Clearly, the RV |x - а|л takes only positive values. Applying (5-58), with a e”, we conclude that
5-5 CUARACTIRISHC I 1!\Г1 IONS 115 Hence (5-59) ₽n i i Ed*-«И P(lx - a I > e} £ ------- En This result is known as the inequality of Bienayme. Tchebycheff’s inequality' is a special case obtained with a = r) and n = 2. 5-5 CHARACTERISTIC FUNCTIONS The characteristic function of an RV is by definition the integral Ф(о>) = f* f(x)e,o,x dx (5-60) — X This function is maximum at the origin because /(.v) > 0: |Ф(ш)| < Ф(0) = 1 (5-61) If ja> is changed to the resulting integral Q(s) = f(x)e’x dx Ф(;Ъ)=Ф(ы) (5-62) J — OO is the moment {generating) function of x. The function Ф(«) = 1пФ(<о) - Ф(/<а) (5-63) is the second characteristic function of x. Clearly [see (5-32)] Ф(а>) = £{<?"*} &(s) = E{eiX} This leads to the following: If у = ax + b then Фу(ш) = е^шФх(аш) (5-64) because = e/<’"E{e7fl"x) Example 5-27. We shall show that the characteristic function of an Mi?, a) RV x equals Фл(о») = ехр{Л?ы - |<г2ш2} (5-65) Proof. The RV z = (x - i))/a is MO, 1) and its moment function equals With z2 1 , sz я-Т--5(г-^)- + 7
116 FUNCTIONS OF ONE RANDOM VARIABLE wc conclude that Ф,($) = ev/2 [ ,— e <--')• fa - e'' /2 And since x - <rz + i). (5-65) follows from (5-64). Inversion formula As we see from (5-60), Ф(<о) is the Fourier transform of /(x). Hence the properties of characteristic functions arc essentially the same as the properties of Fourier transforms. We note, in particular, that f(x) can be expressed in terms of Ф(ю) /(v) = — ( Ф(ш)е~'ы1 dw (5-66) Moment theorem. Differentiating (5-62) n times, we obtain Ф("'(,у) = E(xV') Hence Ф(">(0) = E{x") = m„ (5-67) Thus the derivatives of Ф(у) at the origin equal the moments of x. This justifies the name “moment function” given to Ф(л). In particular, Ф'(0) = ,ni = V Ф*(0) = rn2 = T)2 + a2 (5-68) Note Expanding ФСг) into a series near the origin and using (5-67), wc obtain “ m *(s) - £ -fs" (5-69) n-0 n' This is valid only if all moments are finite and the series converges absolutely near 5 = 0. Since /(x) can be determined in terms of ФСО, (5-69) shows that, under the stated conditions, the density of an RV is uniquely determined if all its moments arc known. Example 5-28. We shall determine the moment function and the moments of an RV x with gamma distribution: ct>+\ f(x) = yxb~ 'е~схи(х) у = Г(о + I) From (4-39) it follows that Ф($) = у f xb~}e~ic~3ix dx = = --------j- (5-70) Jo (c - s) (c — s) Differentiating with respect to. 5 and setting s = 0, we obtain , 4 b(b + 1) ••• (h + n- 1) Ф<">(0) - ----------L « £(x»)
5-5 CIlARACH HISnC‘ HJ SCHONS 117 With n = I and n = 2, this yields адЛ <•' c‘ c* The exponential density is a special ease obtained with b = 1: /(x) = cec,U(x) ф(5) = _f_ £{x) = - а2=Л c — s c C~ Chi square. Setting b = in/2 and c= 1/2 in (5-701 we obtain the moment function of the chi-square density X2Gn): Ф($) “ —г- —===• E(x) = m ar2 = 2m (5-71) ^(1 - 2s)'" 7 Cumulants. The cumulants A;i of RV x are by definition the derivatives <ГФ(0) = A,. (5-72) of its second moment function OKs). Clearly [see (5-63)] Ф(0) = Au = 0; hence 1 „ 1 Ф(Я) = ЛрУ + -Л2л-2 + • • • + —A„s" + • • • We maintain that A, = tj A, = <r2 (5-73) Proof. Since Ф = еф, we conclude that Ф'= Ф'еф ф' = [ф" + (ф')2]ещ With s = 0, this yields Ф'(0) = Ф'(0) = m1 Ф"(0) = Ф"(0) + [Ф'(0)]2 = тг and (5-73) results. Discrete Type Suppose that x is. a discrete type RV takingthe values x, with probability p;. In this case, (5-60) yields Ф(*>) = (5-74) i Thus ф(<й) is a sum of exponentials. The moment function of x can be defined as tn (5-62). However, if x takes only integer values, then a definition in terms of z transforms is preferable.
118 FUNCTIONS OF ONE RANDOM VARIABLE LATTICE TYPE. If n is a lattice type RV taking integer values, then its moment function is by definition the sum r(z)=£(z")= Ё p„z” (5-75) П" -x Thus T(l/z) is the z transform of the sequence p„ = P{n = n}. With Ф(ш) as in (5-74), the above yields Ф(ш) = Г(е'“) = £ Рпе,пш Л* -X Thus Ф(и) is the discrete Fourier transform (DFT) of p„ and Ф(5) = 1пГ(е5) (5-76) Moment theorem. Differentiating (5-75) к times, we obtain H^z) = E{n(n - 1) • • (n - к + l)zn-*} With z = 1, this yields П*>(1) = E{n(n - 1) • • • (n - к + 1)} (5-77) We note, in particular, that Г(1) = 1 and Г(1) = E{n) Г(1) = £{n2} - E{n} (5-78) Example 5-29. (a) If n takes the values 0 and 1 with P[n = I) = p and P{n = 0} = «у, then P(z) = pz + q Г'(1) = E{n) = p Г"(1) = E{n2} - E{n} = 0 (b) If n has the binomial distribution pn = P{n = «} = ]pnqm~n 0 i n <, tn then m r(z) = L ( и }pnqm~nzn = (pz + qfn л-0 r"(l)=z7jp Г"(1) = tn(m - l)p2 Hence E(n} = mp a2 = mpq Example 5-30. If.n is Poisson distributed with parameter a, an P{n -- и) - e~° —- n-0,1,.., л!
5-5 CHARACTERISTIC I UNCHONS 119 then Г(х) =e'“ £a"i-(5-79) n -11 " In this case [see (5-76)] Ф(5) = a(eJ - 1) ф'(0) = д ф"(0) = о and (5-73) yields E{n) = a, cr~ = a in agreement with (5-36). Determination of the density of g(x). We show next that characteristic functions can be used to determine the density fy(y) of the RV у — g(x) in terms of the density fx(x) of x. From (5-32) it follows that the characteristic function Ф/w) = f e,tayfv(y)dy — X of the RV у = g(x) equals Ф/са) = £{e;“R(x)} = f eMx>fx(x) dx (5-80) J — oo If, therefore, the above integral can be written in the form f eicayh(y) dy —X it will follow that (uniqueness theorem) fy(y) = h(y) This method leads to simple results if the transformation у = g(.x) is one-to-one. Example 5-31. Suppose that x is MO; a) and у = ax2. Inserting into (5-80) and using the evenness of the integrand, we obtain ф (ш) . Г fi“«7(x) dx = -4=- C^r^^dx ' J-a> trylTT J0 As x increases from 0 to <», the transformation у = ax2 is one-to-one. Since dy = 2axdx = ly/aydx the above yields 2 -» ч dy Фу(а>) - -±=/ е^е~^У а>/2тт Jf) 2yay
120 FUNCTIONS OF ONE RANDOM VARIABLE Hence e-y/2aa2 fM = —f—U(y) aylTray in agreement with (5-8). Example 5-32. We assume finally that x is uniform in the interval (-т/2, тг/2) and у = sin x. In this case Ф/w) = Г e'wsin V(x) dx = - П/2 е^ыпл dx J-'» TTJ-ir/2 As x increases from —тг/2 to тг/2, the function у = sin x increases from -1 to 1 and dy = cos xdx = - y2dx Hence 1 и dy Ф/w) = - / е'ыу y irJ-i /ГТ7 This leads to the conclusion that //у) = ~7=T for M < 1 rryl -y and 0 otherwise, in agreement with (5-13). PROBLEMS 5-1. The RV- x is N(5,2) and у = 2x + 4. Find 7jy, try, and fy(y). 5-2. Find Fy(y) and fy(y) if у = -4x + 3 and Д(х) = 2e-2jtMx). 5-3. К the RV x is MO, c) and g(x) is the function in Fig. 5-4, find and sketch the distribution and the density of the RV у = g(x). 5-4. The RV x is uniform in the interval (—2c, 2c). Find and sketch fy(y) and Fy(y) if У = g(x) and g(x) is the function in Fig. 5-3. 5-5. The RV x is MO, b) and g(x) is the function in Fig. 5-5. Find and sketch fy(y) and Fy(y). 5-6. The RV x is uniform in the interval (0,1). Find the density of the RV у = -Inx. 5-7. We place at random 200 points in the interval (0,100). The distance from 0 to the first random point is an RV z. Find Fr(z) (д) exactly and (fi) using the Poisson approximation. 5-8. If у « fix and fx(x) = ce~c*U(x), find //y). 5-9. Express the density fy(y) of the RV у = g(x) in terms of /X(x) if (e) g(x) = |x|; (fi) g(x) » e-xt/(x). 5-10. Find Fv(y) and L(y) if Fx(x) - (1 - e~2x)U(x) and (a) у »(x - 1МЛх - 1>.
PROBLEMS 121 5-11. Show that, if the RV x has a Cauchy density with a = 1 and у = arctan x, then у is uniform in the interval (-тг/2, тг/2). S-12. The RV x is uniform in the interval (-2тг,2тг). Find //y) if (a) у » x\ (b) у = x4, and (с) у = 2sin(3x + 40°). 5-13. The RV x is uniform in the interval (-1,1). Find g(x) such that if у = g(x) then Д(у) = 2e~2W). 5-14. Given that RV x of continuous type, we form the RV у = g(x). (a) Find fy(y) if g(x) = 2Fx(x) + 4. (fr) Find g(x) such that у is uniform in the interval (8,10). 5-15. A fair coin is tossed 10 times and x equals the number of heads, (a) Find F„(x). (b) Find Fy(y) if у = (x - 3)2. 5-16. If t is an RV of continuous type and у = a sin wt, show that ЛО) lyl < a lyl > a 5-17. Show that if у = x2, then /У(у|х £ 0) ~ u(y) fA'/y) 1-FX(O) 2V7 5-18. (o) Show that if у = flx + b, then ay = |e|crx. (fr) Find т)у and try if у = (x - rj,)/o-x. 5-19. Show that if x has a Rayleigh density with parameter a and у = b + ex2, then try - 4с2оЛ 5-20. (a) Show that if m is the median of x, then £{|x - a I) = £{|x - ml) + 2fm(x - a)f(x) dx Ja for any a. (b) Find c such that £{|x - c|} is minimum. 5-21. Show that if the RV x is M17; er), then £{|x|) = cr^/^e ”1,2/2<7 2 + 2i?G^ j - 77 5-22. If x is MO,2) and у e 3x2, find qy, <ry, and /У(у). 5-23. Show that if Я = (jZj, ..., л/п] is a partition of then £(x) - ^{xljarjPX^) + ••• +£{х|лб,)Р(Ч). 5-24, Show that if x 0 and £(x) •= П» then P{x £ . 5-25. Using (5-55), find £{x3) if = 10 and <rx = 2. 5-26. If x is.uniform in the interval (10,12) and у - x3, (a) find //y); (&) find £{y): 1, exactly; 2, using (5-55). 5-27. The RV x is M100, 3). Find approximately the mean of the RV у - 1/x using (5-55). □ ЯЛ
122 FUNCTIONS OF ONI RANDOM VARIABLE 5-28. Wc arc given an even convex function g(x) and an RV x whose density f(x) is symmetrical as in Fig. P5-28 with a single maximum at x = tj. Show that the mean E{g(x - o)} of the RV g(x - fl) is minimum if a = 77. # FIGURE P5-28 x 5-29. Show that if x and у are two RVs with densities /X(x) and fy(y) respectively, then E{log/,(x)} > E{log/,.(x)} 5-30. (Chernqff bound) (fl) Show that for any a > 0 and for any real x. Ф(х) P{eiK > «} < —— where Ф(х) = £(e'K} (i) a Hint: Apply (5-58) to the RV у = e’*. (b) For any A, P{x A) < е-,л,Ф(х) x > 0 P{x £ A) < е”’лФ(л) jt < 0 Hint: Set a = ел/| in (i). 5-31. Show that (a) if /(x) is a Cauchy density, then Ф(<о) = с‘"ш|; (b) if fix) is a Laplace density, then Ф(<о) = a2/(a2 + w2). 5-32. Show that if E{x) = 77, then л-0 5-33. Show that if Фх(ш ।) = 1 for some o>, ¥= 0, then the RV x is of lattice type taking the values xtl = 27rn/<U|. Hint: 0=1- Фх(о>,) = f (1 - е'ш^я(х) dx 5-34. The RV x has zero mean, central moments and cumulants A„. Show that Ля - д3, A4 = fi4 - 3/г2; И У is MO; ay) and ay = at, then E(x4) = Я{у4} +Л4. 5-35. An RV x has a geometric distribution if P{x = £}=/*?* £-0,1,... p + <?=! FihdT(z) and show that « q/p, <rT2 = q/p2-
i>Koiii.t.Ms 123 5*36. An RV x has a Pascal (or negative binomial) distribution if P{x = k) = ()p"( + * " 1 jpV к = о. I,... Find Hz) and show that r/l = nq/p, <r2 = nq/p2. 5-37. The RV x takes the values 0,1.... with P(x = k} = pk. Show that if у = (x - l)U(x - I) then R(z) =p„ + : '[rx(z)-p,<] = *?.,“ 1 + Plt £{y2} = £{x’} - 2r,t + I - p„ 5-38. Show that, if Ф(ш) = then for any a,. У. У Ф(«, - ° i — 1 j “ I Hint: £( Ё u,e'“"* ? * 0 \ । “ i ) 5*39. The RV x is MO;a), (a) Using characteristic functions, show that if g(x) is a function such that g(x)e'-,’/2<r‘ -» 0 as |x| -* ®, then dE(g(x)} \ ld3g(x)\ , ------------- — El---> t> = <r“ (i) du 2 \ dx2 j ' (b) The moments p.n of x are functions of f. Using (i), show that n(n - I) PM =--------------fnP„-2(fl)dP 5*40. Show that, if n is an integer-valued RV with moment function Rz) as in (5-75), then P{n = jfc) = J- Г r(eJa,)e-^d<a 2тг •'—к
CHAPTER 6 TWO RANDOM VARIABLES 6-1 BIVARIATE DISTRIBUTIONS We are given two RVs x and y, defined as in Sec. 4-1, and we wish to determine their joint statistics, that is, the probability that the point (x. y) is in a specified regionf D in the xy plane. The distribution functions Fr(x) and Fv(y) of the given RVs determine their separate (marginal) statistics but not their joint statistics. In particular, the probability of the event {x < x) П {y < y} = {x < x, у < у} cannot be expressed in terms of Fx(x) and Fv(y). In the following, we show that the joint statistics of the RVs x and у are completely determined if the probability of this event is known for every x and y. Joint Distribution and Density The joint (bivariate) distribution Fvy(x, y) or, simply, F(x, y) of two RVs x and у is the probability of the event {x <x,y <y} = {(x,y) e Dj where x and у are two arbitrary real numbers and D{ is the quadrant shown in tThe region D is arbitrary subject only to the mild condition that it can be expressed as a countable union or intersectioii of rectangles; 124
6-1 ll|\ AKIAII DIM Hint HOSS 125 Fig. 6-la: F(x, у) = P{x < x, у < у) (6-1) PROPERTIES. 1. The function Hx, y) is such that F( — oc, y) =0 /'( x. — oc) = () x) = 1 Proof. As we know, P{x = -»} = P{y = -oo) = (). And since (x = -co,y <y) C {x = -oo) {x <x,y = -oc) C {y = -x) the first two equations follow. The last is a consequence of the identities {x < oo, у < x) = .Z P(.Z) = 1 2. The event {x1 < x < x2, у < y) consists of all points (x,y) in the vertical half-strip D2 and the event {x <x, yt < у < y2) consists of all points (x,y) in the horizontal half-strip D, of Fig. 6-16. We maintain that P{x, <x<x,,y^y) = F(x2,y) -F(x,.y) (6-2) P{x <x, уi <y <y2) = F(x,y2) - Р(х,У]) (6-3) Proof. Clearly, {x ^x2,y ^y) = {x <X|,y <y) + {x( < X <x2,y < y) The last two events are mutually exclusive; hence [see (2-10)] P{x x2, У < y} « P{x <. хи у у) + P{xt < x x2, у <, у) arid (6-2) results. The proof of (6-3) is similar. 3-. P{x। < x < x2, y.i < У £ У2) = Р(л'2гУ2) -•-Р(Х2*У\) ^-Hx^yi) (6-4) This is die probability that (x,y) is in the rectangle D4 of Fig. 6-lc.
126 two RANDOM VARIAIIM S Proof. It follows from (6-2) and (6-3) because {.Vj < X < x2, у < y2} = {.V, < X 5 x2, у < у,} + {xI < X < x2. y, < у < y2J and the last two events are mutually exclusive. JOINT DENSITY. The joint density of x and у is by definition the function From this and property 1 it follows that F( x, у) = f f f (a. Д) da d(i (6-6) Joint statistics. Wc shall now show that the probability that the point (x,y) is in a region D of the .vy plane equals the integral of /(x. y) in D. In other words, F{(x,y) e D} = f ff(x.y) dxdy (6-7) where {(x, y) e D] is the event consisting of all outcomes < such that the point W<),y(f)] is in D. Proof. As wc know, the ratio F(x + Д.г,у + Ay) — F(x,y + Ay) - F(x + Ax,y) + F(x.y) Ax Ay tends to <72F(x, y)/dx(>y as Ax -» 0 and Ay -» 0. Hence [sec (6-4) and (6-5)] P{x < x < x + Дх, у < у < у + Ду] — f(x,y) Дх Ду (6-8) We have thus shown that the probability that (x, y) is in a differential rectangle equals fix, y) times the area Ax Ay of the rectangle. This proves (6-7) because the region D can be written as the limit of the union of such rectangles. Marginal statistics. In the study of several RVs, the statistics of each are called marginal. Thus Fx(x) is the marginal distribution and /,(x) the marginal density of x. In the following, we express the marginal statistics of x and у in terms of their joint statistics F(x, y) and fix, y). Wc maintain that F/x) - F(x,«) F,,( y) - F(~, >) (6-9) fM‘[ /(x,y)</y fr(y) - f f(x,y)dx (6-10) Proof. Clearly, {x < = (y < «=) = hence {x sx) = (x <x,y < 00} (y <;y) = (x < 00, у <jy) The .probabilities of the two sides above yield (6-9).
6-1 niVAKlAll DISIHIBUI IONS 127 Differentiating (6-6). we obtain dF(x.y) y —to--f. “p Setting у = x in the first and x = <x because [see (6-9)] in dF(x,y) , ---= J_J( «. У) da (6-11) the second equation, we obtain (6-10) <7F( .t,«) dx dF(<»ty) dy Existence theorem. From properties 1 and 3 it follows that F(-oo, y) = 0 F(x.-«)—0 F(x,<x) = j (6-12) and f(jf2.y2) - F(x,,y2) — F(x2.y,) + F(x,,y,) > 0 (6-13) for every x, < x2 y, < y2. Hence [see (6-6) and (6-8)] f f f(x\y)dxdy = 1 Лх.у)>() (6-14) Conversely, given F(x. y) or /(x. y) as above, we can find two RVs x and y, defined in some space with distribution F(x, y) or density /(x, y). This can be done by extending the existence theorem of Sec. 4-3 to joint statistics. Joint normality. We shall say that the RVs x and у are jointly normal if their joint density is given by r/ л ( 1 Г(х — ту,)3 (x -->?,)(}’- ri2) (y - Th)2]] f(x.y) = A exp - —---------7- -------;----2r-------------------- + ------;--- } ( 2(1 - г*) <T|“ <Г|<г2 rr2 J] (6-15) This function is positive and its integral equals 1 if A =-----------\-------- И < 1 (6-16) 2тг<Г|(г2У1 — r2 Thus f(x, y) is an exponential and its exponent is a negative quadratic because |r| < 1. The function /(x, y) will be denoted by As we shall presently see, 77 j and 172 are l^e expected values of x and y, and or2 and cr2 their variances. The significance of r will be given later (correlation coefficient). We maintain that the marginal densities of x and у are given by t f (y) = ——e-0-^/2^ (6-17)
128 TWO RANDOM VARIAHI.IxS Proof. To prove the above, we must show that if (6-15) is inserted into (6-10), the result is (6-17). The bracket in (6-15) can be written in the form Hence г® (У ~ ^г)2 / , у) dx = A exp-------— — —x 2tr£ ( 1 X — V — 7), x / exp “ -----ГГ------------r----------- '-X | 2(1— rz) cr, <r2 The last integral is a constant В (independent of x and y). Therefore [see (6-10)] Д(у) /2ai- And since //y) is a density, its area must equal 1. This yields AB = \/а2'/2тг and the second equation in (6-17) results. The proof of the first is similar. Notes 1. From (6-17) it follows that if two RVs are jointly normal, they are also marginally normal. However, as the next example shows, the converse is not true. 2. Joint normality can be defined as follows: Two RVs x and у are jointly normal if the sum ax + by is normal for every a and b [see (8-56)]. Example 6-1. We shall construct two RVs x, and y, that are marginally but not jointly normal. We start with two jointly normal RVs x and у with density fix, y) as in (6-15). Adding and subtracting small masses in the region D of Fig. 6-2 consisting of four circles as shown, we obtain a new function /Дх, у) such that /i(x, y) = /(x, y) ± £ in D and f\(x, y) = f(x, y) everywhere else. The function /i(x, y) so formed is a density; hence it defines two new RVs x, and y,. These RVs are not jointly normal because /((x, y) is not of the form (6-15). We maintain, FIGURE 6-2
6-1 1I1VAR1A1E DIM RUH)!IONS 129 however, ihai ihcy arc marginally normal. Indeed, ihc densities of x, and y, arc determined by the masses in the vertical strip x, < x < x, 4 dr and the horizontal strip y( <y <y, + dy. As we sec from the figure, the masses in these strips have not changed. This shows that x( and y( arc normal because x and у arc normal. Discrete type. Suppose that the RVs x and у arc of discrete type taking the values x, and yk with respective probabilities P{x = x,} = p, P{y = yA) = qk (6-18) Their joint statistics are determined in terms of the joint probabilities P{x = x,,y = yA.} = plk (6-19) Clearly, Lp,a-=1 (6-20) i,A because, as i and к take all possible values, the events {x = x,, у = yA.} are mutually exclusive and their union equals the certain event. We maintain that the marginal probabilities Pj and qk can be expressed in terms of the joint probabilities pik: P, = Ей,-* Як = LPik (6-21) к i This is the discrete version of (6-10). Proof. The events {y = yk} form a partition of .У'. Hence as к ranges over all possible values, the events {x = x,, у = yA.} are mutually exclusive and their union equals (x = x,). This yields the first equation in (6-21) [see (2-36)]. The proof of the second is similar. Probability Masses The probability that the point (x,y) is in a region D of the plane can be interpreted as the probability mass in this region. Thus the mass in the entire plane equals 1. The mass in the half-plane x sx to the left of the line Lx of Fig. 6-3 equals Fx(x). The mass in the half-plane у < у below the line Ly equals F/y). The mass in the cross-hatched quadrant (x <x, y<y) equals Fix, y). t FIGURE 6-3 1
130 TWO HANOOM VARIABLES Finally, the mass in the clear quadrant (x > x. у > у) equals P{x >x,y >y} = 1 - Fr(x) - Fy(y) + F(x.y) (^-22) The probability mass in a region D equals the integral [see (6-7)1 (lxc{y 1Г, therefore, fix, y) is a bounded function, it can be be interpreted as surface mass density. Example 6-2. Suppose that f(x,y) = ------(6-23) 2rrrr~ ' Wc shall find the mass tn in the circle x2 + y2 < a2. Inserting (6-23) into (6-7) and using the transformation x = r cos 0 у = r sin 6 we obtain tn = -----7 f f e~r‘ /2" rdrd8 = 1 - (6-24) 2тг(т~ -'о •'-tr POINT MASSES. If the RVs x and у are of discrete type taking the values x, and yk, then the probability masses are 0 everywhere except at the points (x|( yk). We have, thus, only point masses and the mass at each point equals pik [see (6-19)]. The probability p, = P{x = .v,} equals the sum of all masses plk on the line x = x, in agreement with (6-21). If i = 1,..., M and к = 1,..., N. then the number of possible point masses on the plane equals MN. However, as the next example shows, some of these masses might be 0. Example 6-3. (a) In the fair-die experiment, x equals the number of dots shown and у equals twice this number: х(Л)“» У(Л) = 2' /=1......6 In other words, x( = i. yk = 2k and {A » xa n Л 0 i =# к Thus there are masses only on the six points (<*, 2<> and the mass of each point equals 1/6 (Fig. 6-4a). (b) Wc loss the die twice obtaining the 36 outcomes ДД and we define x and у such that x equals the first number that shows and у the second = ‘ у(ЛЛ) = * '•* = >.....6 Thus Xj «i, yA -= к, and pik - 1/36. We have, therefore, 36 point masses (Fig. 6-46) and the mass of each of each point equals 1/36. On the line x = i there are six points with total mass 1/6.
6-1 IIIVAKIAI I DIS I Kill! Hoss 131 FIGURE 6-4 (c) Again the die is tossed twice but now Af,fk) = \i-k\ у(Ш = 1 + Л In this case, x takes the values 0,1.........5 and у the values 2.3...........12. The number of possible points equals 6 X 11 = 66; however, only 21 have positive masses (Fig. 6-4c). Specifically, if x = 0, then у = 2, or 4........ or 12 because if x — 0, then i = к and у = 2i. There are. therefore, six mass points on this line and the mass of each point equals 1/36. If x = 1, then у = 3. or 5..........or 11. There arc, therefore, five mass points on the line x = 1 and the mass of each point equals 2/36. For example, if x = 1 and у = 7. then i - 3. к = 4. or i = 4. к = 3; hence P(x = 1, у = 7) = 2/36. LINE MASSES. The following cases lead to line masses: 1. If x is of discrete type taking the values x, and у is of continuous type, then all probability masses are on the vertical lines x = x, (Fig 6-5«). In particu- lar, the mass between the points y, and y, on the line x =x, equals the probability of the event (x =x„ у, < у £y2) FIGURE6-5
132 IWO KAXIMIM VAKIA1II I S 2. IF у = g(x). then all the masses arc on the curve у = g(x). In this case. /•'(.«. y) can be expressed in terms of /\(л). For example, with x and у as in Fig. 6-56, F(.v, у) equals the masses on the curve у = g(.v) to the left of the point A and between В and C (heavy line). The masses to the left of A equal F\(x}). The masses between В and C equal F,(a\) — /\(л,). Hence F'(-v.y) = /\(A)) и- Ft(.v3) - /\(л,) у - g(x,) = g(x?) = g(.v3) 3. If x = g(z) and у = A(z), then all probability masses are on the curve .V = g(z). у = h(z) specified parametrically. For example, if g(z) = cos z. A(z) = sin z, then the curve is a circle (Fig. 6-5c). In this case, the joint statistics of x and у can be expressed in terms of F.(z). Independence Two RVs x and у are called (statistically) independent if the events {x e /1} and {у e B) are independent [sec (2-40)], that is, if P{x e А, у e B} = P{x e /1}Р{у e B) (6-25) where A and В arc two arbitrary sets on the x and у axes respectively. Applying the above to the events {x < a} and {y < y}, we conclude that, if the RVs x and у are independent, then F(x,y) =Ft(x)F;.(y) (6-26) Hence /(A-.y) -Л(д-)Л(у) (6-27) It can be shown that, if (6-26) or (6-27) is true, then (6-25) is also true; that is, the RVs x and у are independent [see (6-7)]. If the RVs x and у are of discrete type as in (6-19) and independent, then P,k = PiPk (6‘28) This follows if we apply (6-25) to the events (x = ,v,} and {y = yj. Example 6-4 Buffon's needle. A fine needle of length la is dropped at random on a board covered with parallel lines distance lb apart where b > a as in Fig. 6-6o, We shall show that the probability p that the needle intersects one of the lines equals 2a/irb. In terms of RVs the above experiment can be phrased as follows: Wc denote by x the distance from the center of the needle to the nearest line and by в the angle between the needle and the direction perpendicular to the lines. We assume that the RVs x and fl are independent, x is uniform in the interval (0. b). and & is uniform in the interval (0,ir/2). From this it follows that I 2 ~ Vzxsb o<;0£? П ТГ ~ and 0 elsewhere. Hence the probability that the point (x. 0) is in a region D included in the rectangle R of Fig. 6-6/> equals the areas of D times 2/trb.
6-1 ItIVAHIMI nisi HUH III 133 FIGURE 6-6 The needle interseels the lines if .v <acosO. Hence p equals ilie shaded area of Fig. 6-66 times 2/-b: p = P{x < a cos 0} = —- / 'a cos 0 <10 = — ~b -'ll тгЬ The above can be used to determine experimentally the number — using the relative frequency interpretation of p; If the needle is dropped n times and it intersects the lines n, times, then и, 2 a 2 an — - P - ~~r hence г ~ -— Il 77 Ь bnt THEOREM. If the RVs x and у arc independent, then the RVs z = s’(x) w =/i(y) are also independent. Proof. We denote by A. the set of points on the x axis such that g(x) < z and by Bw the set of points on the у axis such that h(y) < tv. Clearly, (z < z} = {xe/lj {w < w) = {у e /?„.) (6-29) Therefore the events (z <, x} and (w < и*} are independent because the events {x 6 A.} and (y 6 BtJ are independent. INDEPENDENT EXPERIMENTS. As in the case of events (Sec. 3-1), the concept of independence is important in the study of RVs defined on product spaces. Suppose that the RV x is defined on a space ./\ consisting of the outcomes and the RV у is defined on a space .Z, consisting of the outcomes In the combined experiment .Zj X .Z, the RVs x and у arc such that х(Ш =*(<>) У( Ш = MJ (^0) In Other words, x depends on the outcomes of .Z, only, and у depends on the outcomes of .Z2 only.
134 TWO RANDOM VARIABLES THEOREM. If the experiments .S\ and .У'2 are independent, then the RVs x and у are independent. Proof. We denote by the set (x x) in .У\ and by X the set (y < y} in In the space x ./2. {x <, a} = x x .z; {y < >-} =x x From the independence of the two experiments, it follows that [see (3-4)] the events .й/ x .У2 and X X are independent. Hence the events (x < .v} and {y <, y) are also independent. Example 6-5. A die with P(f} = p, is tossed twice and the RVs x and у arc such that х(/,Л)=' У(Ш = к Thus x equals the first number that shows and у equals the second: hence the RVs x and у arc independent. This leads to the conclusion that P,k = = '\У = =P,Pk Circular Symmetry We say that the joint density of two RVs x and у is circularly symmetrical if it depends only on the distance from the origin, that is, if Л-r, y) = g(r) r = y]xz + y2 (6-31) THEOREM. If the RVs x and у are circularly symmetrical and independent, then they are normal with zero mean and equal variance. Proof. From (6-31) and (6-27) it follows that g(/r2 + у2) =Л(х)/у(у) (6-32) Since <?g(r) dg{r) dr dr x —— = —-— — and — = - dx ar dx dx r we conclude, differentiating (6-32) with respect to x, that =A'(x)//y) Dividing both sides by xg(f) — xfx(x)fy(.y)t we obtain The right side above is independent of у and the left side is a function of r
6-2 ONI FUNCTION <>l TWO RANUOM VARlAlil.IJi 135 - yCv2 + у 2. This shows that both sides are independent of л and y. Hence 1 Г(г) ----—— = a = constant r g(r) From this if follows that rflng(r) ----—-----= ar g(r) = Ae“' z* and (6-31) yields Л -v. у ) = g (y/x2 + у2) = Лси(' ’ 'J>> - (6-34) Thus the RVs x and у are normal with zero mean and variance a2 - - 1 /a. 6-2 ONE FUNCTION OF TWO RANDOM VARIABLES Given two RVs x and у and a function g(x, y). we form the RV z = g(x.y) We shall express the statistics of z in terms of the function g(x, y) and the joint statistics of x and y. With z a given number, we denote by D: the region of the xy plane such that g(x, y) <, z. This region might not be simply connected (Fig. 6-7). Clearly, {z < z} = {g(x,y) < z) = {(x,y) e D,} Hence [see (6-7)] R(z) =P{z <z) =P{(x,y) eDr) = f ff(x,y)dxdy (6-35) Thus, to determine F,(z). it suffices to find the region D. for every z and to evaluate the above integral. The density of z can be determined similarly. With SD: the region of the xy plane such that z < g(x, y) < z + dz, we have {z < z < z + dz} = {(x,y) c AD.) FIGURE 6-7 FIGURE 6-8
136 TWO RANDOM VARIABLES Hence /,(2) dz = !>{z < z < z + dz} = [f f(x,y)dxdy (6-36) JJSD. Illustrations In the following, we use (6-35) and (6-36) to find the statistics of various functions of x and y. 1. z = x + у The region D. of the xy plane such that x + у < z is the shaded part of Fig. 6-8 to the left of the line x + у = z. Integrating over suitable strips, we obtain F,(z) = f [ yf(x,y)dxdy (6-37) We can find fXz) either by differentiating FXz} or directly from (6-36). The region AD. such that z < x + у < z + dz is a diagonal strip bounded by the lines x + у = z and x + у = z + dz. The coordinates of a point of this region arc z — у, у and the area of a differential equals dydz. Hence f:(z}dz = Г f(z-y,y)dydz (6-38) J —co INDEPENDENCE AND CONVOLUTION. If the RVs x and у are independent, then f(x,y) =A(x)4(y) Inserting into (6-38), we obtain Л(О = - y)f,(y) dy (6-39) The above integral is the convolution of the functions /,(л) and fy(y). We thus reach the following fundamental conclusion; If two RVs are independent, then the density of their sum equals the convolution of their densities. We note that, if Д(х) = 0 for x < 0 and fy(y) = 0 for у < 0, then fXQ) = 0 for z < 0 and fAz) = ( fAz -y)fy(yYdy z>0 (6-40) '0 Example 6-6. It follows from (6-39) that the convolution of two rectangles is a trapezoid. Hence, if the RVs x and у are uniform in the intervals (a, b) and (c, d) respectively, then the density of their sum z - x + у is a trapezoid as in Fig. 6-9<r If, in particular, b - a - d - c, then f.(z) is a triangle as in Fig. 6-9b.
6-2 ONE FUNCTION OF TWO RAN DOM VARI ABLES 137 FIGURE 6-9 Suppose, for example, that resistors r( and r2 arc two independent RVs uniform between 900 and 1100 fl. From the above it follows that, if they are connected in series, the density of the resulting resistor r = r, + r, is a triangle between 1800 and 2200 fl. In particular, the probability that г is between 1900 and 2100 fl equals 0.75. Example 6-7. If the RVs x and у arc independent and fx(x) » ae~axU(x) fy(y) = pe~^U(y) (Fig. 6-10) then for z > 0, {cr/3 Д - a (6-41) a2ze~ax fl = a FIGURE 6-10
138 TWO random variables FIGURE 6-11 2. z = x/y The region D: of the xy plane such that x/y < z is the shaded part of Fig. 6-11. Integrating over suitable strips, we obtain F:(z) = ( (y f(x,y) dxdy + f" [ f(x,y)dxdy (6-42) The region AD. such that z < x/y < z + dz is a triangle sector bounded by the lines x = yz and x = y(z + dz). The coordinates of a point in this region are zy, у and the area of a differential equals |y| dydz. Inserting into (6-36) and canceling dz, we obtain f:(z) = f \y\f(zy,y)dy J — X (6-43) Normal densities. We maintain that, if the RVs x and у are jointly normal with zero mean /(x.y) = ---------7 -exp 2тгагх(ггу I — r2 1 I x2 2(1 - r2) a2 ~ xy - 2r------- + (T|(T2 (6-44) 1 then their ratio z = x/y has a Cauchy density centered at ra-x/(r2. о-|<г2У1 - г2 /тт a£(z - rax/a2)2 + <Tf( 1 - r2) (6-45) Proof. Inserting (6-44) into (6-43) and using the fact that /(—x, — y) = /(x, y), we obtain 2 r“ / У2 2(1 -H) Z“ z — - 2r----------- (Гх O’t^’z and (6-45) results because the above integral equals (1 - r2) divided by the quantity in brackets.
UM I UNCI ION Ol rw<> RANDOM VARIAIII IS |J9 Integrating (6-45) from -Мог, we obtain the corresponding distribution function 1 I 042 - /O’. = ^ + ~arctan—(6-46) о-j 1 - r- Quadrant masses Using (6-46), we shall show that the probability masses nip m2, m3, m4 in the four quadrants of the лу plane are given by la la "’i ="'3 = 7 + 73 = "’.j = 7 - 7— (6-47) 4 2. u 4 Z 7Г where (Fig. 6-12) a = arcsin r = arctan r/? 1 - r~ -тг/2 < a < тг/2 Proof. The second and fourth quadrant is the region of the plane such that x/y < 0. The probability that the point (x,y) is in the region equals, therefore, the probability that the RV z = x/y is negative. Hence 1 1 r m2 + m4 = P{z <, 0} = F.(0) = --arctan-y==- 2 7Г VI - r- and (6-47) results because m2 - m4 + m2 +m3 +»n4- I This useful result could have been obtained by integrating /(.v, y) in each qiiadraht; the above method is, however, simpler. 3 .z = Ух2 + у2 The region £>. is the circle x2 + y2 < z2 and FXz) equals the probability masses in this circle. If f(x, y) = g(r) is circularly symmetrical, then F.(z) = 2irf rg(r) dr z > 0. Aj
140 IU<> ЦДМЮМ VARIAHl.l-S Normal densities. («) If Ax’>’) = (6-48) <4 a <z ' then 1 г-' - ) = ^5 / >'? ' '~tr dr = I - e ‘ z > 0 (6-49) Hence Д(г) = -V^'W) (6-50) Thus, if the RVs x and у are normal, independent with zero mean and equal variance, then the RV z = /x2 + у2 has a Rayleigh density. (Z>) Suppose now that /(-v,y) = —Це-I*' ^-'-1 (6-51) L~<r~ The region AZZ of the plane such that z < yCr2 + y2 < z + dz is a circular ring with inner radius z and thickness dz. With .V z cos 0 у = г sin 0 dxdy = zdzdO it follows that /.(z) dz = [[ f(x, y) dxdv = —Ц JJXD. 2тга~ A> Hence f.(?) = -------------------_£,-(-•*+’T)/2<7-[- e^C^0/,r-de 2тга~ A) This yields /;(z) = ”2;«>(р’)е’('Л+’’г>/2‘гг z > 0 (6'52) where / (,v) = J-[2ee^»do (6-53) 2tt Aj is the modified Bessel function. Example 6-8. Consider the sine wave xcos + ysin u>t « rcosfwf + 0) Since r e |/x: + y2. if follows from the above that, if the RVs x and у arc normal as in (6-48). then the density of r is Rayleigh as in (6-50).
6-2 ОМ. I IISOION O( IWO KAMMJM VAHIAIII l_S 141 FIGURE 6-13 4. z = max(x,y) w = min(x.y) (u) The region D. oi the xy plane such that max(x, y) < z is the set of points such that x < z and у £ z (shaded in Fig. 6-13«). Hence /<(-’)= F,v(z.z) (6-54) If the RVs x and у are independent. then F.(z) =F,(z)F>.(z) /.(z) =/v(2)Fv(z) + A(2)F((z) (6-55) (6) The region DH of the xy plane such that min(.v, y) < »v is the set of points such that x < и' or у < iv (shaded in Fig. 6-136). Hence ЛД»’) = Л(»’) + /\(»’) ~ Ли(и’’и) (6-56) If the RVs x and у arc independent then it is simpler to express the result in terms of the reliability function. /?v(x) = P{x >x} = 1 - Ft(x) (6-57) Defining Rv(y) and Rw(w) similarly, wc conclude from (6-56) that /?„.(и-) = Rt(w)Ry(w) fw(w) =fx(w)Ry(w) +fv(w)Rx(w) (6-58) Discrete type. If the RVs x and у are of discrete type taking the values x, and yk. then the RV z =g(x,y) is also of discrete type taking the values z, = g(x,, yA). The probability that z = zr equals the sum of the point masses on the curve g(x, y) = zr. Example 6-9. A fair die is tossed twice and the RVs x and у are such that *(/,/*) = ' у(ЛЛ) = * The xy plane has 36 equal point masses as in Fig. 6-14. The RV z = x + у lakes the values zr = x, + yk with probabilities p, = m/36 where m is the number of points on the line x + у = z,. As we sec from the figure z, = 2 3 4 5 6 7 8 9 10 11 12 12 3 4 5 6 5 4 3 2 1 Pr “363636 36 363636 3636 36 36 For example, there arc four mass points on the line x + у = 5; hence « 4/36.
142 TWO RANDOM VARIABLES 6-3 TWO FUNCTIONS OF TWO RANDOM VARIABLES Given two RVs x and у and two functions g(x, y) and h(x, y), we form the RVs z = g(x,y) w = /i(x,y) (6-59) We shall express the joint statistics of z and w in terms of the functions g(x, y) and h(x, y) and the joint statistics of x and y. With z and w two given numbers, we denote by D.w the region of the xy plane such that g(x, y) < z and h(x, y) < w. Clearly, {z < z, w £ w) = {(x,y) e Dzw) Hence [see (6-7)] /Цг,и») = P{(x,y) e D.J = ff fxy(x, y) dxdy (6-60) Suppose, for example, that z = yx2 + y2 w = y/x (6-61) In this case, the set D.w such that y/x2 4-y2 < z y/x £ w is the shaded region of Fig. 6-15a, and F.„,(z, w) equals the mass in this region. Example 6-10. If Д/ж, У) = ~lx3+yl,/2tf2 z = 1/x2 + y2 w = y/x then [see (6-49)] the mass in the circle x2 + y2 < z2 equals I - e“* /2,r - Since /Х1.(х, у) has circular symmetry we conclude that for z > 0: 20 . , . w F.H.(z,»v) = r—(l ~ e-*'"7-"') 6 - ~ + arctanw 2тг and XjhXz,».) - 0 for z < 0. This is a product of a function of z times a function
6-3 TWO l USrriONS Ol- IWO RAN|X>M VAR1AHI I s 143 FIGURE 6-15 of w. Hence the RV z and w arc independent with /\.(г) = (1 - e "/2")U{ г) Fw(w) = - + —arctan u In other words, z has a Rayleigh density and w has a Canehy density [see (5-17)] as in Fig. 6-156. Joint Density We shall determine the joint density of the RVs z = g(x,y) w = Л(х,у) in terms of the joint density of x and y. Fundamental theorem. To find we solve the system g(x,y)=z h(x,y)=w (6-62) Denoting by (x„, y„) its real roots g(x„,yn)=z h(x„,y„)=w we maintain that fxv(x„,ytl) (6-63) U(x„y,)l +" • + . U(X„,y„)1 where dz dz dx dx -i dx By dz dw (6-64) J(x,y) = dw dw = dy dy dx dy dz dw is the Jacobian of the transformation (6-62).
144 two random variables FIGURE 6-16 Proof. We denote by AD.K, the region in the .ry plane such that z < g(x, y) < z + dz w < Л(х, у) < w 4- dw This region consists of differential parallelograms, one for each (x„,yn) as in Fig. 6-16. The area of each parallelogram equals dzdw/ |Лх„, y„)l and its mass equals fXy(xnXy„)dzdw/ |J(x„,y„)| Since f2W(z,w)dzdw equals the mass in Д£);и„ we conclude, summing the masses in all parallelograms, that f2W(z,w) dzdw = fxytx^y^dzdw fxy(xn,y„) dzdw Щх„,У„)1 |/(х„У|)1 and (6-63) results. If the system (6-62) has no solutions in some region of the zw plane, then f.w(z,w) = 0 in that region. We shall illustrate the above theorem with two special cases. LINEAR TRANSFORMATION z = ax 4- by w = ex 4- Jy (6-65) If ad + be ¥= 0, then the system ax 4- by = z, ex 4- dy = w has one and only one solution x = Az 4- Bw у = Cz 4- Dw Since J(x, y) = ad - be, (6-63) yields fzw(z’w) = . , 1 , । Ay(4- Bw, Cz 4- Dw) (6-66) lad — oe| Joint normality. From (6-66) it follows that if the RVs x and у are jointly normal and z = ax 4- by w = ex 4- dy then z and w are also jointly normal.
6-3 1WO FUNCTIONS OF FWO RANUOM VARIABI LS 145 Proof. Joint normality means that /,v(x. y) is an exponential whose exponent is a quadratic in x and y. If, in this quadratic, wc replace л by Az + Bw and у by Cz + Dw as in (6-66), then an exponential results whose exponent is a quadratic in z and и». This shows that the RVs z and w are jointly, and therefore also marginally, normal. From the above it follows that, if x and у are jointly normal and z = x + y, then z is normal. We should emphasize, however, that if x and у arc marginally but not jointly normal, then z is not, in general, normal. We give next a counter example. Example 6-11. Wc shall construct two marginally normal RVs x, and y, such that their sum z( = x, + y| is not normal: Wc start with two jointly normal RVs x and у and add and subtract masses on the four circles of Fig. 6-17. The resulting mass distribution specifics the joint density of the RVs x( and y,. As wc have shown in Example 6-1. these RVs are marginally normal. However, their sum z( is not normal. Rotation, A special case of (6-65) is the transformation г = xcos tp + у sin tp w = -xsin tp + у cos tp (6-67) In this case a = d - cos tp, b = — c = sin tp, and ad — be - 1. Hence x = z cos tp — w sin tp у = z sin <p + w cos tp and (6-66) yields f.w( z,w) = fxy( z cos tp - tv sin tp, z sin tp + tv cos tp) (6-68) Thus, if two RVs are rotated by an angle tp, their probability masses are rotated in the opposite direction by the same angle. Circular symmetry If fxy(x, y) is circularly symmetrical as in (6-31), then f (x, у) = fxy( x cos tp - у sin tp. x sin tp + у cos tp) (6-69) because (x cos tp — у sin tp)~ + (x sin tp + у cos ^>)~ = x2 + у2 Hence [see (6-68)1 2>»v) = Л/ w) = g()/z2 + tv2) (6-70) FIGURE 6-17
146 TWO RANDOM VARIABLES Conversely, if the RVs x,y and z,w have the same statistics for every <p, then their joint density is circularly symmetrical. From (6-34) it follows that if x and у arc also independent, then they are normal. POLAR COORDINATES. Consider the RVs r=/x2 + y2 <p = arctany/x (6-71) where we assume that г > 0 and -—<(?<—. With this assumption, the system \/x3 + у2 = r. arctan y/x = u? has a single solution x = r cos <p у — r sin <p for r > 0 Since [see (6-64)] ,, . cos <p -rsin<p ' I J(x, y) = . = - Sin ip r cos r we conclude from (6-63) that fr*(r><p) = rf„.(r cos<p, r sin <p) r>0 (6-72) and 0 for r < 0. Example 6-12. Wc shall show that if xcos wt + ysin wt = rcos(w/ - <p) | < 7Г and the RVs x and у arc V(0, a) and independent, then the RVs г and are independent. <p is uniform in the interval (-тг, тт) and г has a Rayleigh distribution. Proof. Since x = r cos <p, у - r sin <p. and Av(x, v) = — (6-72) yields ~e~ri/2": r>0 M < ”• 2ira~ and 0 otherwise. This is a product of a function of r times a function of <p. Hence the RVs r and <p are independent with aw - цм - ar лтт for r > 0, —it < ^> s it and 0 otherwise. The proportionality factors arc so chosen as to make the area of each term equal to 1. From the above it follows that, if the RVs r and <p are independent, r has a Rayleigh distribution, and tp is uniform in the interval (-tt.tf), then the RVs x = rcosp y«rsin^ are MO, tr) and independent.
6-3 IWO f UNCTIONS Of- two random V.\RIAIU ES 147 Auxiliary variables. The determination of the density of one function z = g(x. y) of two RVs can be determined from (6-63) where w is a conveniently chosen auxiliary variable, for example w = x or w = y. The density of z is then found by integrating the function w) so obtained. Example 6-13. We shall find the density of the RV z = ax + by using as auxiliary variable the funeiion w = y. The system z = ax + by, iv =y has a single solution: x = (z - bw)/a, у = iv. Since -|“o it follows from (6-63) that 1 I z- by \ Л..(г.и.) - —( — .У) Hence (6-73) Example 6-14. With z = xy w = x the system xy = z, x = >v has a single solution: x = iv. у = z/w. In this case, J = —w and (6-63) yields 1—гА>(и?- |»v| Ц w) Hence the density of the RV z = xy is given by (M4) FIGURE 6-18
148 TWO RANDOM VARIAUl.liS Special case. Wc now assume that the RVs x and у arc independent and each is uniform in the interval (0, I). In this case. in the triangle z < w < 1, 0<z<l (shaded in Fig. 6-18) and 0 elsewhere. Inserting into (6-74), we obtain f.(z) = ['-dw = / 7ln z W’ (f)-75) ' J. и- \ 0 elsewhere ' 1 Example 6-15. An RV z has a Student-t distribution tin) with n degrees of freedom if У, = Г[(» + D/2] /(I +Z2/")"77 У' ^"Г("/2)' (6-76) We shall show that if x and у are two independent RVs, x is MU. 1). and у is X2(ri)'- f,(x) - e fy(y) ~ у"/1- '<•“>/-(/( У) then the RV has a r(n) distribution Proof. We introduce the RV w = у and use (6-63) with This yields Integrating with respect to w, we obtain and (6-76) results because dw = Г(а)/Ьи. The constant y, is determined from (4-18). PROBLEMS 6-1. If x anil у arc the zero-one RVs associated with the events .?/ and & respectively, (fl) find the probability masses in the x-y plane and (b) show that the RVs x and у are independent iff the events x/ and id are independent.
PKoint'MS 149 6-2. The RVs x and у are independent and z = x + y. Find f,( y) if /,( x) = ce r'l/( x) f.(z) = c2zc ‘ {/( z) 6-3. The RVs x and у arc independent and у is uniform in the interval ((). I). Show that, if z = x + y. then /.-(-) = F,(z) - F,(z - I) 6-4. (o) The function g(x) is monotone increasing and у = g(x). Show ihai (b) Find Flv(x, y) if g(x) is monotone decreasing. 6-5. Express F.w(r. iv) in terms of fiy(x, y) if z = niax(x,y). w = min(x,y). 6-6. The RVs x and у are M0.2) and independent. Find fAz) and F.(z) if («) z = 2x + 3y, and (Л) z = x/y. 6-7. The RVs x and у are independent with Show that the RV z = xy is ЛЧ0,«). 6-8. The RVs x and у are independent with Rayleigh densities Д(х) = f,(y) = ^e~'2'2li:U(y) a fi~ (a) Show that if z = x/y, then (/?) Using (i). show that for any к > 0. к2 6-9. The RVs x and у are independent with exponential densities /,(x) = ae~”*U(x) fY(y) = pc-^U(y) Find the densities of the following RVs: 1.2x + y 2. x - у 3. 4. max(x,y) 5. min(x.y) 6-10. The RVs x and у are independent and each is uniform in the interval (0.(я). Find the density of the RV z = |x - y|. 641. Show that (n) the convolution of two normal densities is a normal density, and (b) the convolution of two Cauchy densities is a Cauchy density.
ISO ТОО RANDOM VARIABLES 6-12. The RVs x and 6 arc independent and 8 is uniform in the interval (-тг, п-). Show that if z = xcosGw + 0), then 6-13. The RVs x and у are independent, x is A/(0, cr), and у is uniform in the interval (0,rr). Show that if z = x + a cosy, then £•(*) = -----== Ге ~ y’/2*’’ dy тпг&тг -'ll 6-14. The RVs x and у are of discrete type, independent, with P{x = n) = ti„, P{y = n) = b„, и = 0,1,... . Show that, if z = x + y, then ft P{z = ,t}= it-II 6-15. The RV x is of discrete type taking the values x„ with P{x = л,,} = p„ and the RV у is of continuous type and independent of x. Show that if z = x + у and w = xy, then AG) = ЕЛ-(г - x„)p„ fw(w) = £ — К n it । x^ti 1 6-16. The Rvs x and у are normal, independent, with the same variance. Show that, if z = /x2 + y2, then fXz) is given by (6-52) where rj = ^/rj2 + rj~. 6-17. The RVs x, and x2 are jointly normal with zero mean. Show that their density can be written in the form = 2^C*P{~2XC~'X>} C=[^ M.u] where X: [г„гг1 = E(x,xy}, and Д = + Mu- 6-18. Show that if the RVs x and у are normal and independent, then (ri. \ ( flv\ / n, \ I flv\ — +G — -2G — G — / \ °> / I \ / 6-19. The RVs x and у are independent with respective densities ;y2(/n) and д,2(п). Show that if x/т xm/2-2 z ------ then /,(z) = у . ~U(x) 7(1 + This distribution is denoted by F(m, /1) and is called the SnedecorF distribution. It is used in hypothesis testing (see Prob, 9-34).
CHAPTER 7 MOMENTS AND CONDITIONAL STATISTICS 7-1 JOINT MOMENTS Given two RVs x and у and a function g(x, y), wc form the RV z = g(x,y). The expected value of this RV is given by E{z} = Г zf:(z)dz (7-1) J — ОС However, as the next theorem shows, E(z) can be expressed directly in terms of the function g(x, y) and the joint density /(x, y) of x and y. THEOREM £{g(x,y)} = f f g(x,y)f(x,y)dxdy (7-2) Proof. The proof is similar to the proof of (5-29). We denote by Д/Л the region of the xy plane such that z < g(xr y) <z + dz. Thus to each differential in (7-1) there corresponds a region AD. in the xy plane. As dz covers the z axis, the regions AD. are not overlapping and they cover the entire xy plane. Hence the integrals in (7-1) and (7-2) are equal. We note that the expected value of g(x) can be determined either from (7-2) or from (5-29) as a single integral £{g(x)} = f f 8(x)f(x,y)dxdy~f g(x)/t(x) dx J — X 151
152 MOMEN rs AND CONDITIONAL STATISTICS This is consistent with the relationship (6-10) between marginal and joint densities. If the RVs x and у are of discrete type taking the values x, and yk with probability pik as in (6-19), then E{g(^.y)} = 'Es(xi,yk)pik (7-3) i,k Linearity From (7-2) it follows that (n \ n £«а.яДх,у) = £>лЕ{яДх.у)} (7-4) I I I This fundamental result will be used extensively. We note in particular that E{x + y} = £{x) + £{y) (7-5) Thus the expected value of the sum of two RVs equals the sum of their expected values. We should stress, however, that in general E{xy} #= £{x)E{y) Frequency interpretation As in (5-26) c{ л x(O + y(<i) + •••+x(U+y(&) E{x + y) =---------------------------------- n «(<>) + •••+«(£,,), y«,) + ••+?«,) , ---------------------+-------------------= ед + ед However, in general, w , X(f,)y(f,) + ••• +x(f„)y(f„) E{xs}«----------------------------- X<f.) + ••• +x(U y(f,)+ ••• +y(f„) *--------------------x-------------------= £{x}E{y) Covariance. The covariance C or Crv of two RVs x and у is by definition the number C = E((x - 4J(y - л>.)) (7-6) where £{x) = and E{y) = Expanding the product in (7-6) and using (7-4) we obtain C = £{xy} — E{x)E{y) (7-7) Correlation coefficient The correlation coefficient r or rxv of the RVs x and у is by definition the ratio r=— (7-8)
7-1 JOIN! MOMLNtS 153 We maintain that И 1 |C|^o-,<Tv (7.9) Proof. Clearly, f{[a(x - 7jJ + (y - t1>.)]2} = a-a~ + 2aC + a,2 (7-10) The above is a positive quadratic for any a; hence its discriminant is negative. In other words, C2 - ofa2 < 0 (7-H) and (7-9) results. We note that the RVs x, у and x - ?)t, у - 17,. have the same covariance and correlation coefficient. Example 7-1. Wc shall show that the correlation coefficient of two jointly normal RVs is the parameter r in (6-15). It suffices to assume that ??, = tj, = 0 and to show that £{xy} = ra}cr2. Since wc conclude with (6-44) that £{xy) =—-7= f ye">"/2,':'[ -----------===== exp a2flTr >-*. '-»о-|^2тг(1 - r2) (x - ry<7|/o-2)2 2ar(l -r2) dxdy The inner integral is a normal density with mean ryo}/a2 multiplied by x; hence it equals ry<rx/a2. This yields £{xy) = —т== [ y2e~)r/-ai dy = гах(т2 <г2у2тг 7-x Uncorrelatedness Two RVs are called uncorrelated if their covariance is 0. This can be phrased in the following equivalent forms C = 0 r = 0 £{xy) = £{x) £{y) Orthogonality Two RVs are called orthogonal if £{xy} = 0 We shall use the notation x ± у to indicate that the RVs x and у are orthogonal. Note (а) И x and у are uncorrelatcd, then x - тц ± у - ny- (b) If x and у arc uncorrelated and ” Ootfy “О then x 1 y.
154 MOMl NTS AND CONDITIONAL STATISTICS Vector space of random variables. We shall find it convenient to interpret RVs as vectors in an abstract space. In this space, the second moment £{xy) of the RVs x and у is by definition their inner product and E{x2} and E{y2} are the squares of their lengths. The ratio £{xy} \/E{x2)E{y2} is the cosine of their angle. We maintain that E2{xy) < E(x2}E(y2] (7-12) This is the cosine inequality and its proof is similar to the proof of (7-11): The quadratic E{(ax - y)2} = a2E(x2} - 2яЕ{хУ| + E{y2) is positive for every a; hence its discriminant is negative and (7-12) results. If (7-12) is an equality, then the quadratic is 0 for some a = al}; hence у = anx. This agrees with the geometric interpretation of RVs because, if (7-12) is an equality, then the vectors x and у are on the same line. The following illustration is an example of the correspondence between vectors and RVs: Consider two RVs x and у such that E{x2} = E(y2}. Geometri- cally, this means that the vectors x and у have the same length. If, therefore, we construct a parallelogram with sides x and y, it will be a rhombus with diagonals x + у and x - у (Fig. 7-1). These diagonals are perpendicular because £((x + y)(x - y)} = £{x2 - У2} = 0 THEOREM. If two RVs are independent, that is, if Л*»У)=/д.(хЩу) (7-13) then they are uncorrelated. Proof. It suffices to show that E{xy) = E{x)E(y} (7-H) x-yxx+y У FIGURE 7-1
7-1 ioim momi мъ 155 From (7-2) and (7-13) it follows that £{xy} = f ( xyf,( x)/ ( у) dxdy = Г Xj\(x) dx Г yfv( y) dv and (7-14) results. If the RVs x and у are independent, then the RVs g(x) and Л(у) are also independent [see (6-29)]. Hence £{g(x)A(y)} = E(g(x)}E{h(y)} (7-15) This is not, in general, true if x and у are merely uncorrelated. We note, finally, that if two RVs arc uncorrelated they are not necessarily independent. However, for normal RVs uncorrelatcdness is equivalent to inde- pendence. Indeed, if the RVs x and у arc jointly normal and r = 0, then [see (6-15)]/(x, y) =/r(x)/>.(y). Variance of the sum of two RVs If г = x + y, then = 77, + т]у\ hence 07 = E{(z - 77.)2) = e{[(x - 77x) + (У - T7.v)f) From this and (7-10) it follows that of = of + 2roxoy + of (7-16) The above leads to the conclusion that if r « 0 then of = of + of (7-17) Thus, if two RVs are uncoirelated, then the variance of their sum equals the sum of their variances. It follows from (7-14) that this is also true if x and у are independent. Moments The mean mkr = Е{х*у') = ( [ xkyrf(x,y) dxdy (7-18) —00 of the product x*yr is by definition a joint moment of the RVs x and у of order к + r = n. Thus m in = 77x, m01 = 77,. are the first-order moments and = E{x2} = E(xy) тпг = £{y2} are the second-order moments. The joint central moments of x and у are the moments of x ~~ 7}д and У - -Пу’ - £{(x - ЧЛ)‘(У - 4,)'} " fjj* ~ У) W (7-19)
156 MOMBNIS ANU CONDITIONAL STATISTICS Clearly. дИ1 = дП1 = 0 and Д||=С M2II = (Г1 Д|,2 = «Г Absolute and generalized moments arc defined similarly [see (5-40) an<i (5-41)]. For the determination of the joint statistics of x and у knowledge of their joint density is required. However, in many applications, only (he first- and second-order moments are used. These moments are determined in terms of the five parameters Ti П, 'гг If x and у are jointly normal, then [see (6-15)] the above parameters determine uniquely fix, y). Example 7-2. The RVs x and у arc jointly normal with т]л — 10 rjy = 0 = 2 <rv = 1 ru = 0.5 We shall find the joint density of the RVs z = x + у w = x - у Clearly, 17.- = Пл + П> = 10 П„. = П, “ 4» = 10 a? = <rf + a2 + 2rxvirx<rv = 7 a~ = tr~ + — 2r<va1o-| = 3 £{zw) = £{x2 - y2} = (100 + 4) - 1 = 103 £{zw] - E{z}£{w} 3 (r.(rw /7x3 As we know [see (6-66)], the RVs z and w are jointly normal because they are linearly dependent on x and y. Hence their joint density is V( 10,10;/7. A;-/3/7) Estimate of the mean of g(x,y). If the function g(x,y) is sufficiently smooth near the point (i7x,i7y), then the mean and variance a2 of g(x,y) can be estimated in terms of the mean, variance, and covariance of x and y: I л d2g 4'=«+2^ +2^r‘T-°'- + d2g 21 dy2<7> / (7-20) dg 12 2 dy j (7-21) where the function g(x, y) and its derivatives are evaluated at .r - and
7-2 JOIN) (.НАПАСИ KIS DC JUNCItOSS 157 Proof. Wc expand g(.v. y) into a series about the point (tj,. tjJ: #(•*• y) = «(Пс-'Пс) + (-V - rif) Д 4- ( у - 7]4) — f--- (7-22) cly Inserting the above into (7-2), wc obtain the moment expansion of tfgfx.j)) in terms of the derivatives of g(x, y) at (77,.^.) and the joint moments pA, of x and y. Using only the first five terms in (7-22), we obtain (7-20), Equation (7-21) follows if we apply (7-20) to the function [g(.v, у) - nJ2 and neglect moments of order higher than 2. 7-2 JOINT CHARACTERISTIC FUNCTIONS The joint characteristic function of the RVs x and у is by definition the integral Ф(ю,.ш2) = / Г f(x.y)e),“''"",-i} dxdy (7-23) • — 'X. — X. From the above and the two-dimensional intorsion formula for Fourier trans- forms, it follows that /(x,y) = Г Г Ф((оРы,)е (/«>, (7-24) Clearly, Ф(<о,.Ш;) = (7-25) The logarithm Ф( Wi, on) = In Ф(oj,,<*>-,) (2-26) of Ф(ю„ш,) is the joint second characteristic function of x and y. The marginal characteristic functions Ф,(«) = £{e'“’x) Ф,(<а) = E{eJW>} (7-27) of x and у can be expressed in terms of their joint characteristic function ФСшрШт). From (7-25) and (7-27) it follows that Фг(о>) = Ф(ш,0) ФДй») = Ф(0. fo>) (7-28) We note that, if z — ax + by, then ф.(й») - = Ф(аш.Ьш) (7-29) Hence Ф.(1) = Ф(д, b). Cramer-Wold theorem The above shows that if Ф.(й>) is known for every a and b, then Ф(й1|,а>2) is uniquely determined. In other words, if the density of ax + by is known for every a and b, then the joint density f(x, y) of x and у is uniquely determined. Independence and convolution. If the RVs x and у are independent, then [see (745)] = £(e/W|'}£{e/“’2jr}
1S8 MOMLNIS ANO CONPI IIONAI. STATISTICS From this it follows that Ф( ат,, ш2) = ФД ю, )ФД a>2) ( 7-30) Conversely, if (7-30) is true, then the RVs x and у are independent. Indeed, inserting (7-30) into the inversion formula (7-24) and using (5-66), we conclude that fix, y) = /Дх)/Ду). Convolution theorem If the RVs x and у are independent and z = x + y, then Hence Ф. (ы) = Фл (ш) ФД ы) Ф. (ш) = ФД w) + ФД (и) (7-31) As we know [sec (6-39)], the density of z equals the convolution of /,(x) and fv(y). From this and (7-31) it follows that the characteristic function of the convolution of two densities equals the product of their characteristic functions. Example 7-3. Wc shall show that if the RVs x and у are independent arid Poisson distributed with parameters a and b respectively, then their sum z = x + у is also Poisson distributed with parameter a + b. Proof. As we know (sec Example 5-30) ФДш) = a{e,u> - I) ФДш) = b(e,u' - 1) Hence ФДш) = ФДш) + ФДш) = (а + b)(elw - I) It can be shown that the converse is also true: If the RVs x and у arc independent and their sum is Poisson distributed, then x and у arc also Poisson distributed. The proof of this difficult theorem will not be given. Example 7-4. It was shown in Sec. 6-3 that if the RVs x and у arc jointly normal, then the sum ex + by is also normal. In the following we reestablish a special case of the above using (7-30): If x and у arc independent and normal, then their sum z = x + у is also normal. Proof. In this case [sec (5.65)] ФДш) = ~ l°x2<°2 ФДю) = ~ 1ауш' Hence ФДш) = + - 3(07 + сг~)ш: It can be shown that the converse is also true (Cramer theorem): If the RVs x and у are independent and their sum is normal, then they arc also normal. The proof of this difficult theorem will not be given.t tE. Lukacs: Characteristic Functions. Hafner Publishing Co., New York, I960.
7-2 IOINI ГНЛКАГП RISII* UNCTIONS 159 Normal RVs. We shall show that the joint characteristic function of two jointly normal RVs is given by Ф(<0|, Ш2) = ~ (7-32) Proof. This can be derived by inserting /(x,y) into (7-23). The following simpler proof is based on the fact that the RV z = ш,х + w2y is normal and 4z.(o>) = — yrfw2 (7-33) Since 77. = <O|771 4- ш2Т)2 = + 2ra>la>2crl<r2 + a>2(r2 and Ф.(<и) = Ф(й>](о, to2w), (7-32) follows from (7-33) with w = 1. The above proof is based on the fact that the RV z = w,x 4 o>2y is normal for any toj and e>2; this leads to the following conclusion: If it is known that the sum ax 4 by is normal for every a and b, then RVs x and у arc jointly normal. We should stress, however, that this is not true if ax 4 by is normal for only a finite set of values of a and b. A counterexample can be formed by a simple extension of the construction in Fig. 7-2. Example 7-5. We shall construct two RVs x( and x2 with the following properties: xH x,, and X| 4 x2 arc normal but x, and x2 are not jointly normal. Suppose that x and у arc two jointly normal RVs with mass density fix, y). Adding and subtracting small masses in the region D of Fig. 7-2 consisting of eight circles as shown, we obtain a new function fxlx, y) such that f fx. y) = fix, y) ± r. in D and falx, y) = fix, y) everywhere else. The function falx, y) is a density; hence it defines.two new RVs Xj and yt. These RVs arc obviously not jointly normal. However, they are marginally normal because x and у arc marginally normal and the masses in any vertical or horizontal strip have not changed. Furthermore, the RV z, = x, 4 y( is also normal because z = x + у is normal and the masses in any diagonal strip of the form + + have not changed.
160 MOMENTS AND CONDITIONAL STATISTICS Moment theorem. The moment generating function of x and у is given by 0(sps2) = Expanding the exponential and using the linearity of expected values, we obtain the scries ф(^^2) = E -4 E (£te(x¥-*)*frr* „c() W! k-oyK} = 1 + mief! + m(1152 + 5 (w 20 5 2 + 2ж,|5|52 + /n02s?) + • • • (7-34) From this it follows that дкдг -j— Ф(0,0)-т„ (7-35) VJ I C7|J 2 The derivatives of the function Ф(5,,52) = 1пФ(5|,52) are by definition the joint cumulants ЛАг of x and y. It can be shown that II) = ™IO ^01 = ™()l ^20 = M20 ^(12 = Ml>2 ^11 = Mil Hence Ф($1, $2) = 17|51 + rl2S2 + К0"!25? + 2го-|О-2-5|52 + a2sl) + ’ ” Example 7-6a. Using (7-34), wc shall show that if the RVs x and у arc jointly normal with zero mean, then E{x2y2} = E{x2)E(y2} + 2E2{xy) (7-36) Proof. As we see from (7-32) Ф($1, s2) = e~A A = |(<Г|2Я + 2Су|52 + °"’5?) where C = E(xy} = ro-|O-2. To prove (7-36), we shall equate the coefficient of in (7-34) with the corresponding coefficient of the expansion of e~A. In this expansion, the factors s^2 appear only in the terms A~ I 1 •> •» -»\2 T = -j (оТ*Г + 2CS|S2 + <r,‘s2) Z о Hence ZT ( 2 )^хгу2^ = |(2o’'2a2 + 4C2) and (7-36) results.
7-2 JOINT CllARACTLKISIIC FUNCTIONS 161 Prices theorem.! Given two jointly normal RVs x and y, we form the mean / = £{s(x.y)} - Г Г g(x, y)f(x. y) dxdy (1.31a) of some function g(x,y) of (x,y). The above integral is a function /(д) of the covariance д of the RVs x and у and of four parameters specifying the joint density /Cr, y) of x and y. We shall show that if g(x, y)f(x, у) -> 0 as (x, y) -» ®, then дд" * г* 93ng(x, у) 9x" dyn f(x, y) dxdy = E d2”g(x,y) dx"dyn (l-31b) Proof. Inserting (7-24) into (7-37a) and differentiating with respect to д, we obtain <3-/(m) (-1)" , X f f й/|'й/,'Ф(а>|, to2)<? Jlu>'i +u>2>} dwi dw2dxdy From this and the derivative theorem, it follows that ^/(д) ад" /X ^V(x.y) dx" dy" dxdy Integrating by parts and using the condition at a\ we obtain (7-376) (see also Prob. 5-31). Example 7-6b. Using Price’s theorem, wc shall rcdcrivc (7-36). Setting g(x, y) = x2y2 into (7-376), we conclude with n = 1 that dp. \ dxdy ) 2 If д = 0, the RVs x and у arc independent; hence /(0) = E{x2y2} = E(x2}E{y2} and-(7-36) results. tR. Price, “A Useful Theorem for Nonlinear Devices Having Gaussian Inpiiis." IRE. PGfT, Vol. IT-4, 1958. Sec also A. Papoulis, "On an Exiension of Price’s Theorem,” IEEE Transaction! on Information Theory, Vol. ГТ-11, 1965.
162 MOMENTS AND CONDITIONAL STATISTICS 7-3 CONDITIONAL DISTRIBUTIONS As we have noted, conditional distributions can be expressed as conditional probabilities: P{z <z, r(z|.^) = P{z < zU] = — • (7-38) P{z £ z,w < w, F2W(z,w\*#) = F{z < z,w = The corresponding densities are obtained by appropriate differentiations. In this section, we evaluate these functions for various special cases. Example 7-7. Wc shall first determine the conditional distribution F/ylx^x) and density /г(у|х < x). With {x <; x}, (7-38) yields P{x<x,y<;y} F(x,y) Fy(y|x <x)-------17- ~ --------- F{x<x) Ft(x) .. . . 5F(x,y)/fly Example 7-8. We shall next determine the conditional distribution F(x,y|.^) for .^= {x( < x 5 x2). In this case, F(x, у |.^) is given by F(x,ylx, < x ^x,) = P{x £ x, у < y, x( < x < x,} P(x( < x 5 x;) ' F(x2,y) - F(x{,y) Ft{x2)-FAxx) ' F{x,y) - F(xi,y) , Fx(x,)-F,(x,) X > x2 X| <x <x2 and it equals 0 for x < x,. Since f == d'F/dxdy, the above yields f( x, v) /(x, у |x, < X s x,) = X, < X <; X, (7-39) ГА-*’) Гг( xI ) and 0 otherwise. The determination of the conditional density of у assuming x = x is of particular interest. This density cannot be derived directly from (7-38) because, in general, the event {x = x) has zero probability. It can, however, be defined as a limit. Suppose first that
7-3 CONDIIIOSAI I>ISI Klin. HOSS 163 In this case, (7-38) yields /;.(УИ < x sx,) = P{X' < * = *4 •*?♦>’) ~ /(a,, у) /’{.v, < x <x,} /\(х,) - Г4(х,) Differentiating with respect to y. wc obtain f 'f(x.y) dx /v(yl-V| < x ^xs) = —-— ---------------- (7-40) -F,(a,) because [see (6.6)] аГ(А-.у) x To determine Д.(у|х = л). we set x, = л and x, = x + Да in (7-40). This yields fy( Уl-v < x < x + Дх) / 'f(a,y)da /;.(А +Да) - I\(x) f(x, у) Дх /.(v) Дх Hence /У(у|х = x) = Jim^Cylx < x <x + Дх) = If there is no fear of ambiguity, the function /y.(ylx = x) will be written in the form f(y\x). Defining /(x|y) similarly, we obtain /(y|x) = 7fjr /(Ф) = 7Г7Г (MI) If the RVs x and у are independent, then f(x,y) = /(x)/(y) f(y\x)=f(y) f(x\y) = j\x) Notes 1. For a specific x, the function fix, y) is a profile of /(x. у); that is, it equals the intersection of the surface /(x, y) by the plane x = constant. The conditional density /(y|x) is the equation of this curve normalized by the factor 1//(л-) so as to make its area 1. The function /(x|y) has a similar interpretation: It is the normalized equation of the intersection of the surface f(x, y) by the plane у = constant. 2. As we know, the product f(y)dy equals the probability of the event (у < у < у + dy). Extending this to conditional probabilities, wc obtain , z , P{x} <x£x2,y <y £y+ dy} Ц ylx, < x < x2) dy =-------Plx~<*<x V--------- This equals the mass in the rectangle of Fig. 7-3a divided by the mass in the vertical strip x, <xsx2. Similarly, the product f(y\x)dy equals the ratio of the mass in the differential rectangle dxdy of Fig. 7-36 over the mass in the vertical strip (x, x + dx).
164 MOMHNTS AND CONDITIONAI S TATISTICS («) (6) FIGURE 7-3 3. The joint statistics of x and у arc determined in terms of their joint density fix, y). Since /(.v.y) =/(ylx)/(A) wc conclude that they arc also determined in terms of the marginal density f(.v) and the conditional density /(y|.v). Example 7-9. We shall show that, if the RVs x and у are jointly normal with zero mean as in (6-44), then /(yl.r) = -----, -- exp rr2y 2тг( 1 -r) (y - rtrzx/(T})2 2a,2(l -r’) (7-42) Proof, The exponent in (6-44) equals (у — /•<Г2.Г/<Г| )“ 2(7?(I - r2) Division by /(л ) removes the term -.г2/2сг,2 and (7-42) results. The same reasoning leads to the conclusion that if x and у arc jointly normal with £(x) = ту, and E{y} = tj2, then f(y|.v) is given by (7-42) if у and л are replaced by у — т)2 and .v - rj| respectively. In other words, for a given x. f(y\x) is a normal density with mean tj2 + r<r2(x — and variance <t22(1 - r2). Bayes’ theorem and total probability. From (7-41) it follows that Я*Ь>) - /<У) (7-43) This is the density version of (2-38). The denominator /(y) can be expressed in terms of f(y|.v) and /(.v). Since /(>’)“( f(x,y)dx and f(x,y) =/(ylx)/(.v)
7-3 CONPIIIOSAI DISIRIHl'IIOSS 165 we conclude that (total probability) (7-44) Inserting into (7-43), we obtain Bayes’ theorem for densities / f{y\x)f(x)dx (7-45) Note As (7-44) shows, to remove the condition x = x from the conditional density /(y|.r). we multiply by the density /(.v) of x and integrate the product. Discrete type. Suppose that the RVs x and у are of discrete type P{x = x,) = /», р{у = ук}=(1к P{x = x,,y = .vA.} = plk i=\....,M к = I N where [see (6-21)] P< = Xp.k Qk = Ep,* A i From the above and (2-29) it follows that F{y = yA.|x = x,} = R{x = x,.y =yj P{x=x,} P,k P, Markoff matrix We denote by тт,к the above conditional probabilities F(y = УЛ.|Х =X,} = 77,A and by П the M x N matrix whose elements are irtk. Clearly, TT|A = — (7-46) P, Hence >0 - 1 (7-47) к Thus the elements of the matrix П are positive and the sum on each row equals 1, Such a matrix is called Markoff. The conditional probabilities r 1 1 ki P{x = x,|y = yA.} = ~ k' = — 4k are the elements of an WxAf Markoff matrix. If the RVs x and у are independent, then Pik = Pi<lk =(lk 'n’*'= P>
166 MOMENTS ANO CONDITIONAL STATISTIC'S We note that = ^ik~ Чк = Ъ^,кР, (7-48) These equations arc the discrete versions of Eqs. (7-43) and (7-44). System Reliability Wc shall use the term system to identify a physical device used to perform a certain function. The device might be a simple element, a light bulb, for example, or a more complicated structure. We shall call the time interval from the moment the system is put into operation until it fails the time to failure. This interval is, in general, random. It specifies, therefore, an RV x 0. The distribution Fit) = P{x < r) of this RV is the probability that the system fails prior to time t where wc assume that t = 0 is the moment the system is put into operation. The difference R(r) = 1 - F(t) = P(x > r) is the system reliability. It equals the probability that the system functions at time t. The mean time to failure of a system is the mean of x. Since Fix) = 0 for x < 0, we conclude from (5-27) that E{x) = Cxfix) dx = CRit) dt (7-49) The probability that a system functioning at time t fails prior to time x > t equals Differentiating with respect to x, we obtain z-, . x Лх) fix x > t) = ------------------------------г 74 7 1 - F(t) F(x)-F(t) l-F(t) (7-50) (7-51) x > t The product /(x|x > t)dx equals the probability that the system fails in the interval (x, x + dx), assuming that it functions at time t. Example 7-10. If fix) then Fit) = 1 - e~a and (7-51) yields ce~rx fix\x > t) = = fix ~/) This shows that the probability that a system functioning at time t fails in the interval (x, x + dr) depends only on the difference x - t (Fig. 7-4). We show later that this is true only if fix) is an exponential density.
7-3 conditional dis» КИШ HONS 167 1 FIGURE 7-4 Conditional failure rate. The conditional density /(x|x > r) is a function of л and t. Its value at x = t is a function only of t. This function is denoted by j3(t) and is called the conditional failure rate or, the hazard rate of the system. From (7-51) and the definition it follows that /(r) Д(') =/(r|x>/) = -- (7-52) I - F(O The product [3(.()dt is the probability that a system functioning at time t fails in the interval (t,t + dt). In Sec. 8-1 (Example 8-3) we interpret the function ft(t) as the expected failure rate. Example 7-11. (a) If /(x) = then F(t) = I - e cl and ce ~ ' (b) If fix') = c2xe~r\ then F(x) = 1 - схе ,л - е~сл and c2te~l> czt ) с1е~“ + ecl 1 + ct From (7-52) it follows that F'(/) _ £(0 l-F(t) R(t) We shall use this relationship to express the distribution of x in terms of the function /3(r). Integrating from 0 to x and using the fact that In Ж0) = 0, we obtain - [x0(t)dt = ln R(x) Hence R(x) = 1 -F(x) = exp{-dt And since f(x) = F'(x), this yields f(x) = &(x) exp^-jT Э(0 (7-53) (7-54)
168 MOMENTS AND CONDITIONAI. STATISTICS Example 7-12. A system is called memoryless if the probability that it fails in an interval (/, x), assuming that it functions at time t, depends only on the length of this interval. In other words, if the system works a week, a month, or a year after it was put into operation, it is as good as new. This is equivalent to the assumption that /(x|x > t) = f(x - /) as in Fig. 7-4. From this and (7-52) it follows that with x = t: Д(/)=Д/|х>г)=/(/-/)=/(0)=с and (7-54) yields /(x) = ce ~cx. Thus a system is memoryless iff x has an exponential density. Example 7-13. A special form of /?(/) of particular interest in reliability theory is the function Д(/) = ctb~l This is a satisfactory approximation of a variety of failure rates, at least near the origin. The corresponding /(x) is obtained from (7-54): i I f(x) = cxb~{ exp<> (7-55) This function is called the Weibull density. We conclude with the observation that the function /3(r) equals the value of the conditional density /(x|x > /) for x = /; however, 0(r) is not a density because its area is not one. In fact its area is infinite. This follows from (7-53) because R(«t) = 1 - F(a>) = 0. Interconnection of systems. We are given two systems and S2 with times to failure x and у respectively, and we connect them in parallel or in series or in FIGURE 7-5 Series
7-4 CONDITIONAL EXPECTED VALUES 169 standby as in Fig. 7-5, forming a new system S. We shall express the properties of 5 in terms of the joint distribution of the RVs x and y. Parallel. We say that the two systems are connected in parallel if S fails when both systems fail. Denoting by z the time to failure of 5, we conclude that z = t when the larger of the numbers x and у equals t. Hence [see (6-54)] z = max(x,y) Fz(z) = Fry(z, z) If the RVs x and у are independent, F.(z) = F/z)Fv(z). Series. We say that the two systems are connected in series if S fails when at least one of the two systems fails. Denoting by w the time of failure of S, we conclude that w = t when the smaller of the numbers x and у equals t. Hence [see (6-56)] w = min(x,y) Fw(w) = Fx(w) + Fy(w) - Fxy(w,w) If the RVs x and у are independent, = Fx(w)Ry(w) = 0x(t) + £y(t) where j3x(t), pytt), and /3U.G) are the conditional failure rates of systems S,, S2, and S respectively. Standby. We put system 5, into operation, keeping S2 in reserve. When 5, fails, we put S2 into operation. The system S so formed fails when S2 fails. If r, and t2 are the times of operation of 5, and 52, t, + t2 is the time of operation of S. Denoting by s the time to failure of system S, we conclude that s = x 4- у The distribution of s equals the probability that the point (x, y) is in the shaded region of Fig. 7-5. If the RVs x and у are independent, the density of s equals fM- as in (6-40). 7-4 CONDITIONAL EXPECTED VALUES Applying theorem (5-29) to conditional densities, we obtain the conditional mean of g(y): £{^(y)l-^} = Г S(y)f(y^)dy (7-56) * —CD This can be used to define the conditional moments of y.
170 MOMENTS AND CONDITIONAL STATISTICS FIGURE 7-6 Using a limit argument as in (7-41), we can also define the conditional mean E{g(y)|x}. In particular, yylx = E(y|x} = f yf(y\x) dy (7-57) J — QC is the conditional mean of у assuming x = x, and = Е{(У ~ ’b-ix)2!*} = У ~ rly\x)Zf(y\x) dy (7-58) is its conditional variance. For a given x, the integral in (7-57) is the center of gravity of the masses in the vertical strip (x, x + dx). The locus of these points, as x varies from -® to °®, is the function 4>(x) = f yf(y\x)dy (7-59) J — ® known as the regression line (Fig. 7-6). Note If the RVs x and у are functionally related, that is, if у = g(x), then the probability masses on the xy plane are on the line у = g(x) (see Fig. 6-5f>); hence E(y|x} = g(x). Gallon’s law. The term regression has its origin in the following observation attributed to the geneticist Sir Francis Galton (1822-1911): “Population ex- tremes regress toward their mean.” This observation applied to parents and their adult children means that children of tall (or short) parents are on the average shorter (or taller) than their parents. In statistical terms this can be phrased in terms of conditional expected values: Suppose that the RVs x and у model the height of parents and their children respectively. These RVs have the same mean and variance, and they are positively correlated: "Их = Vx = arx = ay = ar r>0 According to Gallon’s law, the conditional mean £{y|x} of the height of children whose parents height is x, is smaller (or larger) than x if x > t? (or x < 77): if x < 77 if x < 17
7-4 CONDITIONAL I ХРЬСГЬР VALUES 171 This shows that the regression line <p(x) is below the line у = x for x > -q and above this line if x < rj as in Fig. 7-7. If the RVs x and у are jointly normal, then [see (7-60) below] the regression line is the straight line <p(x) = rx. For arbitrary RVs, the function <p(x) does not obey Gallon's law. The term regres- sion is used, however, to identify any conditional mean. Example 7-14. If the RVs x and у are normal as in Example 7-9, then the function . x - n. E{y|x} = 7], + rcr,--- (7-60) ’ *1 is a straight line with slope ra2/al passing through the point Since for normal RVs the conditional mean £{yI v] coincides with the maximum of f(y|x), we conclude that the locus of the maxima of all profiles of /(x, y) is the straight line (7-60). From theorem (7-2) it follows that £{#(x,y)Un = f f g(x, y)f(x, у\Л) dxdy (7-61) — OO* —30 This expression can be used to determine £{g(x,y)|x}; however, the conditional density /(x,y|x) consists of line masses on the line x-constant. To avoid dealing with line masses, we shall define £{g(x,y)|x) as a limit: As we have shown in Example 7-8, the conditional density f(x, y|x < x < x + Ax) is 0 outside the strip (x, x + Ax) and in this strip it is given by (7-39) where X| =x and x2 = x + Ax. Il follows, therefore, from (7-61) with .^= {x < x x + Ax] that /•“ ух + Дх . /( «, У ) . B{g(x,j-)|x<xSx + Ax) - j j* As Ax -» 0, the inner integral tends to g(x, y)/(x, y)//(x). Defining E(g(x,y)|x) as the limit of the above, we obtain E(g(x,y)|x] « f g(x,y)f(y|x)<fy (7-62) * — QC
172 MOMI N IS AND CONDI IIONAI. SI AI IS TICS Wc also note that b'{A’(x,y)|.v) = [ g(x.y)f(y\x) dy (7-6З) ' — -x because g(x,y) is a function of the RV y, with x a parameter: hence its conditional expected value is given by (7-56). Thus E{ g (x, У) I x} = £{ g (-v, У) I -v) (7-64) One might be tempted from the above to conclude that (7-64) follows directly from (7-56); however, this is not so. The functions g(x. y) and g(x.y) have the same expected value, assuming x = x, but they arc not equal. The first is a function g(x, y) of the RVs x and y, and for a specific £ it takes the value £[x(£), y(£)]. The second is a function g(x.y) of the real variable л and the RV y, and for a specific £ it takes the value g[x,y(<)] where x is an arbitrary number. Conditional Expected Values as RVs The conditional mean of y, assuming x = x, is a function Дх) = £{y|x) of x given by (7-59). Using this function, we can construct the RV Дх) = E{y|x) as in Sec. 5-1. As wc see from (5-29), the mean of this RV equals Е{Дх)} = f <p(x)f(x) dx = I f(x)f yf(y\x)dydx Since Дх, у) = Дх)Ду|х), the above yields E{E{y|x}} = [ f yf(x, y) dxdy = E(y) (7-65) This basic result can be generalized: The conditional mean E{g(x. y)|x} of g(x, y), assuming x = x, is a function of the real variable x. It defines, therefore, the function E{g(x, y)|x) of the RV x. As we see from (7-2) and (7-61), the mean of E{g(x, y)|x) equals f f(x) f g(x,y)/(y|x) dydx = f f g(x,y)f(xty) dxdy J — 00 J—» 00-'—ac But the last integral equals E{g(x,y)}; hence £{E{g(x,y)|x}} = E(g(x,y)} (7-66) Wc note, finally, that £{ Я|(х)я2(у) И =^{«|(х)^2(у)И =5!(x)E{g2(y)l-v) (767) E(£i(x)S2(y)) =£{£{« 1(х)«з(У) Iх}} = EUi(x)E{£2(y)lx}} Example 7-15. Suppose that the RVs x and у are M0,0:o-|,cr2;r). As wc know E{x2) £(x4} = 3cr4
7-5 MLANSOl'ARI I..SIIMAIION 173 Furthermore. f(y | л) is a normal density with mean and variance rr,vl - r. Hence £{r!jr) = n;l( 4 _ r') Using (7-67), we shall show that = r>rt<r: E{*:*y= E{x:}E[y:] i 2£'{xy) Proof ZT{xy) = £{x£{y|x}} £{x-v-) = £{x-£{y2|x)} = £<x* and the proof is complete [see also (7-36)]. 7-5 MEAN SQUARE ESTIMATION The estimation problem is fundamental in the applications of probability and it will be discussed in detail later (Chap. 14). in this section, we introduce the main ideas using as illustration the estimation of an RV у in terms of another RV x. Throughout this analysis, the optimality criterion will be the minimization of the mean square value (abbreviation: MS) of the estimation error. We start with a brief explanation of the underlying concepts in the context of repeated trials, considering first the problem of estimating the RV у by a constant. Frequency interpretation As we know, the distribution function F(y) of the RV у determines completely its statistics. This does not, of course, mean that if wc know F(y) we can predict the value y(f) of у at some future trial. Suppose, however, that wc wish to estimate the unknown y«) bv some number c. As wc shall presently sec, knowledge of F(y) can guide us in the selection of c. If у is estimated by a constant c. then, at a particular trial, the error yf£) - c results and our problem is to select c so as to minimize this error in some sense. A reasonable criterion for selecting c might be the condition that, in a long series of trials, the error is close to 0: y(£i) -c+ ••• 4-y«„) -c Q n As we see from (5-26), this would lead to the conclusion that c should equal the mean of У (Fig. 7-8«). Another criterion for selecting c might be the minimization of the average of c|. In this case, the optimum c is the median of у (sec page 68).
174 MOMENTS AND C ONDITIONAL STATISTICS FIGURE 7-8 In our analysis, wc consider only MS estimates. This means that c should be such as to minimize the average of |y(£) — c|“. This criterion is in general useful but it is selected mainly because it leads to simple results. As we shall soon sec. the best c is again the mean of y. Suppose now that at each trial we observe the value x(£) of the RV x. On the basis of this observation it might be best to use as the estimate of у not the same number c at each trial, but a number that depends on the observed x(£). In other words, wc might use as the estimate of у a function c(x) of the RV x. The resulting problem is the optimum determination of this function. It might be argued that, if at a certain trial we observe x(f), then we can determine the outcome 4 of this trial, and hence also the corresponding value y(£) of y. This, however, is not so. The same number x«) = x is observed for every 4 in the sei {x = x) (Fig. 7-8Л). И, therefore, this set has many elements and the values of у are different for the various elements of this set, then the observed x(£) does not determine uniquely y(£). However, wc know now that £ is an element of the subset {x = x}. This information reduces the uncertainty about the value of y. In the subset {x =x}, the RV x equals x and the problem of determining c(x) is reduced to the problem of determining the constant c(x). As we noted, if the optimality criterion is the minimization of the MS error, then c(x) must be the average of у in this set. In other words, c(x) must equal the conditional mean of у assuming that x = x. Wc shall illustrate with an example. Suppose that the space .S is the set of all children in a community arid the RV у is the height of each child. A particular outcome ( is a specific child and y(£) is the height of this child. From the preceding discussion it follows that if wc wish to estimate у by a number, this number must equal the mean of y. We now assume that each selected child is weighed. On the basis of this observation, the estimate of the height of the child can be improved. The weight is an RV x; hence the optimum estimate of у is now the conditional mean £{y |x) of у assuming x = x where x is the observed weight. In the context of probability theory, the MS estimation of the RV у by a constant c can be phrased as follows: Find c such that the second moment
7-5 Ml. AS SQl'ARL LM I.MA1 ION 175 (MS error) c = £{(У - CH = / J У - c)2/( y) dy (7-68) of the difference (error) у - c is minimum. Clearly, e depends on c and it is minimum if de .v. de У ~ (ly = 0 that is, if <’ = f УЛ У) dy Thus c = £{y) = f yf(y)dy (7-69) This result is well known from mechanics: The moment of inertia with respect to a point c is minimum if c is the center of gravity of the masses. NONLINEAR MS ESTIMATION. Wc wish to estimate у not by a constant but by a function c(x) of the RV x. Our problem now is to find the function c(x) such that the MS error e = £{[y - c(x)]2} = f f [y - c(x)j2/(x, y) dxdy (7-70) J — —x is minimum. We maintain that c(x) = £{y|x) = / yf(y\x)dy (7-71) Proof. Since f(x, y) = f(y\x)f(x), (7-70) yields e= ( f(x)f [y - c(x)]zf(y\x) dydx J-X. J -X The integrands above are positive. Hence e is minimum if the inner integral is minimum for every л. This integral is of the form (7-68) if c is changed to c(x), and /(y) is changed to /(y|x). Hence it is minimum if c(x) equals the integral in (7-69), provided that /(y) is changed to /(y|x). The result is (7-71). Thus the optimum c(x) is the regression line <p(x) of Fig. 7-6. As we noted in the beginning of the section, if у « g(x), then £{y|x} = g(x); hence c(x) = g(x) and the resulting MS error is 0. This is not surprising because, if x is observed and у = g(x), then у is determined uniquely. If the RVs x and у are independent, then £{y|x} = £{y) = constant. In this case, knowledge of x has no effect on the estimate of y.
176 MOMENTS AND CONDITIONAL STATISTICS Linear MS Estimation The solution of the nonlinear MS estimation problem is based on knowledge of the function <p(x). An easier problem, using only second-order moments, is the linear MS estimation of у in terms of x. The resulting estimate is not as good as the nonlinear estimate; however, it is used in many applications because of the simplicity of the solution. The linear estimation problem is the estimation of the RV у in terms of a linear function Ax + В of x. The problem now is to find the constants A and В so as to minimize the MS error e = E{[y - (Ax + B)]2) (7-72) We maintain that e = etn is minimum if Д11 rorv A = —— = — B = ny - Лт7ж (7-73) Mao ax and em = Мог “ — = <r/( 1 - r2) (7-74) M20 Proof. For a given A, e is the MS error of the estimation of у - /lx by the constant B. Hence e is minimum if В = E{y - Лх} as in (7-69). With В so determined, (7-72) yields <? = £{[(У “ Vy) ~A(x - 77л)]2} = or2 - 2Araxay + A1 a; This is minimum if A = ray/<rx and (7-73) results. Inserting into the above quadratic, we obtain (7-74). Terminology. In the above, the sum Ax + B is the nonhomogeneous linear estimate of у in terms of x. If у is estimated by a straight line ax passing through the origin, the estimate is called homogeneous. The RV x is the data of the estimation, the RV e = у — (Лх + В) is the error of the estimation, and the number e = £{e2} is the MS error. Fundamental note. In general, the nonlinear estimate <p(x) = E[yIx} of у in terms of x is not a straight line and the resulting MS error E{(y - <p(x)]2) is smaller than the MS error e,„ of the linear estimate Ax + B. However, if the RVs x and у are jointly normal, then [see (7-60)] is a straight line as in (7-73). In other words: For normal RVs, nonlinear and linear MS estimates are identical.
7-5 MEAN SQUARE ESI IMA 1 ION 177 The Orthogonality Principle From (7-73) it follows that Я{[у - (Лх + B)]x) = 0 (7-75) This result can be derived directly from (7-72). Indeed, the MS error e is a function of A and В and it is minimum if Яе/ЯА = 0 and Яе/ЯВ = 0. The first equation yields de — = E(2[y - ( Ax + B)](-x)} = 0 (77*1 and (7-75) results. The interchange between expected value and differentiation is equivalent to the interchange of integration and differentiation. Equation (7-75) states that the optimum linear MS estimate Ax + В of у is such that the estimation error у - (Ax + B) is orthogonal to the data x. This is known as the orthogonality principle. It is fundamental in MS estimation and will be used extensively. In the following, we reestablish it for the homogeneous case. HOMOGENEOUS LINEAR MS ESTIMATION. We wish to find a constant a such that, if у is estimated by ax, the resulting MS error e = E[(y - ax)2} (7-76) is minimum. We maintain that a must be such that E{(y - ax)x) = 0 (7-77) Proof. Clearly, e is minimum if e'(a) = 0; this yields (7-77). We shall give a second proof: We assume that a satisfies (7-77) and we shall show that e is minimum. With a an arbitrary constant, E{(y - ax)2} = E{[(y - ax) + (a - a)x]2} = E{(y - ax)2} + (a - a)2E{x2) -I- 2(a - a)E((y - ax)x) In the above, the last term is 0 by assumption and the second term is positive. From this it follows that E((y - ax)2} > E{(y - ax)2} for any a; hence e is minimum. The linear MS estimate of у in terms of x will be denoted by E{y|x). Solving (7-77), we conclude that E(y|x) = ax a = (7-78)
178 MOMENTS ANO CONDITIONAL STATISTICS MS error Since e = E((y - ox)y} - E{(y - ax)ax) = E{y2} - E{(ax)2) - 2oE{(y - ax)x) wc conclude with (7-77) that e = E{(y - ax)y} = E(y2} - E[(«x)2} (7-79) We note finally that (7-77) is consistent with the orthogonality principle: The error у — ax is orthogonal to the data x. Geometric interpretation of the orthogonality principle. In the vector represen- tation of RVs (see Fig. 7-9). the difference у - ax is the vector from the point ax on the x line to the point y, and the length of that vector equals fe. Clearly, this length is minimum if у - ax is perpendicular to x in agreement with (7-77), The right side of (7-79) follows from the Pythagorean theorem and the middle term states that the square of the length of у - ax equals the inner product of у with the error у — ax. Risk and loss functions. We conclude with a brief comment on other optimality criteria limiting the discussion to the estimation of an RV у by a constant c. We select a function L(x) and we choose c so as to minimize the mean R = E(L(y - c)} + j" Цу - c)f(y) dy of the RV L(y - c). The function L(x) is called the loss function and the constant R is called the average risk. The choice of L(x) depends on the applications. If L(x) = x1, then R = E{(y — c)2} is the MS error and as we have shown, it is minimum if c = E{y). If L(x) = |x|, then Л=Е{|у-с|). We maintain that in this case, c equals the median y05 of у (see also Prob. 5-20). Proof. The average risk equals \y - c\f(y) dy = ( (c - y)f(y) dy + [ (y-c)f(y)dy •'-oo J-<» Je Differentiating with respect to c, we obtain dR rC ra> = /_ J(>,) dy (/(y) dy=2F ю -1 Thus R is minimum if F(c) = 1/2, that is, if c = yn5. y—axlx y' y—ax « x FIGURE 7-5
7-5 Mt AS'SQVaKI. I SI IMA I ION 179 Wc note finally that in certain applications, у is estimated by its mode, that is, the value yni3X of у for which /(y) is maximum. This is based on the following: The probability that у is in an interval (c,c + dy) of specified length dy equals P{c < у < c + dy} ~ f(c)dy This is maximum if c = yro.iv PROBLEMS 7-1. The RVs x and у are AK0:rr) and independent. Show that if z = |x - v|. then E{z} = 2ст/ fa. E{z2} = 2a2. 7-2. Show that if x and у arc two independent RVs with /\(.v) = <• 'U(x). fv(y) = e~vU(y), and z = (x - y)t/(x - y). then E{z| = 1/2. 7-3. Show that for any x.y real or complex U) IE{xy}212 < E{|x12}E{Iy|2}: _______ (b) (triangle inequality) yj E{ |x + y|2} < у/E{ |x|2} + у/ E{ |y|2} . 7-4. Show that, if rlv = 1. then у = ях + b. 7-5. Show that, if E{x2} - E{y2} = E{xy). then x = y. 7-6. Show that, if the RV x is of discrete type taking the values хл with P(x = .v„) = p„ and z = g(x,y), then ад = E^(e(x„.y)}p„ A(z) = ЕЛ(-’1а„)р„ /I П 7-7. The RV n is Poisson with parameter A and the RV x is independent of n. Show that, if z = nx and Л(а) = • Л 2. then Ф.(<о) = схр{Ле "|ш| - A) ir(a~ + л* ) 7-8; Show that, if the RVs x and у arc М0,0;а,<т;г), then 1 ( rzx2 1 w £(Л(-У|л)>= / 7-9. Show that if the RVs x,y are M0,0;a-,,a-2; r) then 2 rc p 2a,a3 2ata3 Eflxyll = — I arcsin-----dp +--------=--------(cosл 4- «sin a) 7Г A) &~l&2 where r ~ sin a and С — ra\a2. Hint: Use (7-37) with g(x, у) = |лу I. 7-10. The RVs x and у are uniform in the interval (-1,1) and independent. Find the conditional density /,(№) of the RV г - y/x2 + y2 where .Z= {r <; 1). 7-H. We have a pile of m coins. The probability of heads of the tth coin equals p,. Wc select at random one of the coins, we toss it n times and heads shows к times.
180 7-12. 7-13. 7-14. 7-15. 7-16. 7-17. 7-18. 7-19. 7-20. 7-21. 7-22. 7-23. 7-24. 7-25. 7-26. MOMf.NrS AND CONDITIONAL STATISTICS Show that the probability that wc selected the rth coin equals _______________________________Р.Ч1 -pry~k_______________ /’*(• -P,)" * + ••• + p*(l -P„y-k The RV x has a Student-/ distribution /(»). Show that E{x} - n/tn - 2). Show that if 0r(/) = > i), /3v(r |y > /) and fJJj) = then 1 - F(v) = fl - Fy(.r)]\ Show that, for any x.y, and г > 0. 1 P(lx - у I > f} < — £{ |x - уГ) Show that the RVs x and у arc independent iff for any a and b: E{U(a - x)U(b - y)} = E(U(a - x)}E{U(b - у)) Show that £{ylx 0} = —- f" E{y\x)ft(x)dx Г x ( U ) J — sc Show that, if the RVs x and у arc independent and z = x + y. then fSzl.r) — fy(z -xl The RVs x,y arc М3,4: 1,2;0.5). Find f(y\x) and /(.vly). Show that, for any x and y, the RVs z = F,(x) and w = Fv(y|x) arc independent and each is uniform in the interval (0,1). The RVs x and у arc M0,0; 3,5; 0.8). Find g(x) such that E{[y - g(x)]2} is minimum. In the approximation of у by ^>(x), the “mean cost" E{g[y - <p(x)]} results, where g(.r) is a given function. Show that, if g(x) is an even convex function as in Fig. P5-28, then the “mean cost" is minimum if v’(x) = £{ylx). Show that if <p(x) = E{y|x) is the nonlinear MS estimate of у in terms of x. then £{[y - <p(x)]2} = £{y2} - E{<r(x)} If 7jr ~ Tjy = 0, агл ~ o-y « 4, and у ~ 0.2x (linear MS estimate), find E{(y - y)"). Show that if the constants A, B, and a are such that Effy - (Лх + В)]') and E{[(y - y) - a(x - tjx)]2} are minimum, then a = A. Given ??, = 4, f)y = 0, cr, = 1, cr^ = 2, riy - 0.5, find the parameters A, B, and a that minimize E{(y - (/ix + B)]‘) and E((y - ax)2). The RVs x, у are independent, integer-valued with P{x = k} - pK, P{y = k) = Show that (a) if z = x + y, then (discrete-lime convolution) = n) = £ A- - -® (/>) if the RVs x, у are Poisson distributed with parameters a and b respectively and w « x - y, then ( - a"'kbk [0 m>0 („+А.)Ш «-Ui ,,<o
7-5 MI AN SOI'AIU 1 SI1MA1ION 181 7-27. If у => x ’, find the nonlinear and linear MS estimate of у in terms oi x and the resulting MS errors. 7-28. The RV x has a Rayleigh density [sec (6-50)]. Find its conditional failure rale 7-29. Find the reliability /?(/) of a system if /3(r) = ct/(l + cl). 7-30. The RV x is uniform in the interval (0. T). Find and sketch (i(i). 7-31. Find and sketch R{i) if [ill) = 4U(i) + 2U(i - T). Find the mean time to failure of the system. 7-32. The RVs x and у are jointly normal with the zero mean, and a, = 2. = 4. rlV — 0.5. (a) Find the regression line £{y|.v) = <p(.c). (/>) Show that the RVs x and у - <p(x) arc independent. 7-33. («) Show that £{(y - c)2} = tr~ + (c - rjv): for any c. (/>) Using this, show that £{(y - c)2) is minimum if г = tjv as in (7-69). (c) Reasoning similarly, show that £{{y — g(x)])2} is minimum if g(x) = E{y|x} as in (7-71).
CHAPTER 8 SEQUENCES OF RANDOM VARIABLES 8-1 GENERAL CONCEPTS A random vector is a vector X=[xI,...,x„] (8-1) whose components x, are RVs. The probability that X is in a region D of the л-dimensional space equals the probability masses in D: P(XeD) = f f(X)dX X=[xI,...,xJ (8-2) In the above r( ' dnF{xXi...,xn) f\X) -/(xu...,xw) = dx^ — (8-3) is the joint (or, multivariate') density of the RVs x, and £(-¥) = F(x1,...,xn) = P{x ^xp...,x„ £x„} (8-4) is their joint distribution. If we substitute in F(xb.... x„) certain variables by we obtain the joint distribution of the remaining variables. If we integrate f(xx,. ..,xn) with respect to certain variables, we obtain the joint density of the remaining 182
8-1 C.hNlRAI.C<W<l-nS 183 variables. For example F(X|,x3) » Г(Х|,»,л3,«>) /(-rPx3)=f f /(xp x2, x3. x4) dx2 dx4 (8 Note In the above, we identify various functions in terms of their independent variables. Thus f(Xf, x3) is the joint density of the RVs x( and x3 and it is in general different from the joint density f(x2. x4) of the RVs x2 and x4. Similarly, the density ft(x,) of the RV x, will often be denoted by /(x,). TRANSFORMATIONS. Given к functions »l(X).....gt(X) X=[x,...........,r„] we form the RVs У1 =£1(Х),-.чУл = gkW (8-6) The statistics of these RVs can be determined in terms of the statistics of X as in Sec. 6-3. If к < n, then we could determine first the joint density of the n RVs У|,...,у*,хл+|,...,хя and then use the generalization of (8-5) to elimi- nate the x’s. If к > n, then the RVs y„+1,... ,yk can be expressed in terms of yI,...,y„. In this case, the masses in the к space are singular and can be determined in terms of the joint density of ур...,уя. It suffices, therefore, to assume that к = n. To find the density //y,,..., y„) of the random vector Y = [yt,... ,y„] for a specific set of number y„..., y„, we solve the system Sl(X) = y(,...,«„(*) = y„ (8-7) If this system has no solutions, then fy(yt,.... y„) = 0. If it has a single solution X = [xt,..., x„], then »........ where dgi dSi dx. dx„ (8-9) dgn ds„ dXj dx„ 1s the jacobian of the transformation (8-7). If it has several solutions, then we add the corresponding terms as in (6-63).
184 SEQUENCES OF RANDOM VARIABLES Independence The RVs Xj,. . . , x„ are called (mutually) independent if the events {xj Xj},..., {x„ xn} are independent. From this it follows that F(xl,...,x„) = F(x,) F(x„) Дх„...,х„)-/(x,) /(x„) (8’10) Example 8-1. Given n independent RVs x, with respective densities we form the RVs Ук ~ xi + ‘’ + xк к = \ We shall determine the joint density of yk. The system x, =y,.xl + x, =y2....x, + ••• +x„ = y„ has a unique solution хк~Ук~Ук-1 and its jacobian equals 1. Hence [see (8-8) and (8-10)] /Дуп-’-.Ул) -Л(У|)Л(У2 -yi) ••• fn(yn ~yn-i) (8-11) From (8-10) it follows that any subset of the set x; is a set of independent RVs. Suppose, for example, that /(x,,x2,x3) =/(x!)/(x2)/(x3) Integrating with respect to x3, we obtain f(xl,x2) = /(xt)/(x2). This shows that the RVs %! and x2 are independent. Note, however, that if the RVs x; are independent in pairs, they are not necessarily independent. For example, it is possible that /(X|,x2) =/(x,)/(x2) /(x!,x3) =/(xI)/(x3) /(x2,x3) =/(x2)/(x3) but /*(X|,x2, x3) #= f(xI)/(x2)/,(x3) (see Prob. 8-2). Reasoning as in (6-29), we can show that if the RVs x, are independent, then the RVs У1 = 8i(xjy„ =8„(хл) are also independent. INDEPENDENT EXPERIMENTS AND REPEATED TRIALS. Suppose that 7" = x • • • x is a combined experiment and the RVs x(- depend only on the outcomes £• of ZJ: ••• £ ••• О »*/(&) If the experiments are independent, then the RVs xf are independent (see also (6-30)]. The following special case is of particular interest.
8-1 GENERAL CONCEHS 185 Suppose that x is an RV defined on an experiment and the experiment is performed n times generating the experiment .У"* = ./x • • x .У. In this experiment, we define the RVs x, such that *•' £ ••• О ~x«.) »= (8-12) From this it follows that the distribution of x, equals the distribution Fx(x) of the RV x. Thus, if an experiment is performed n times, the RVs x, defined as in (8-12) are independent and they have the same distribution Fr(x). These RVs are called i.i.d. (independent, identically distributed). Example 8-2 Order statistics. The order statistics of the RVs x, arc n RVs yk defined as follows: For a specific outcome f, the RVs xz take the values x,(f). Ordering these numbers, we obtain the sequence xr,«) •• 5хГд«)5 ••• <Jxr<t(<) and we define the RV yk such that yi(O=xZi(f)< ••• 5У*(ЛЯЧ(П^ ^y„«) =xrJf) (8-13) We note that for a specific i, the values x/f) of x, occupy different locations in the above ordering as £ changes. We maintain that the density fk(y) of the к th statistic yk is given by n! i fM “ (*-l)!(,-t)l '(’’Я1 - fM (8-14) where Fx(x) is the distribution of the i.i.d. RVs x,- and /Х(х) is their density. Proof. As we know A(y) dy - P{y < yk <. у + dy} The event Sd = {у < yk <, у + dy) occurs iff exactly к - 1 of the RVs x, are less than у and the one is in the interval (у, у + dy) (Fig. 8-1). In the original experiment </, the events ~ (x <, y) = {y < x у + dy} = {x > у + dy} form a partition and P(^) = Fx(y) P(^2) - fx(y) dy P(^3) » 1 - Fx(y) In the experiment the event & occurs iff occurs к - 1 times, occurs once, and «я^3 occurs n — к times. With kx = к — I, кг = 1, k3 = n - k, it 4 3t» 4 *-----1----*-----1 ; ----------M- У У* У+dy y„ FIGVRE8-I
186 SEQUENCES OF RANDOM VARIABLES follows from (3-38) that P(S,) - (*-.)!?(« and (8-14) results. Note that /,(>•) = «[I - ^(у)Г’*Л(у) Л(У) = ’iFr'WfM These are the densities of the minimum y( and the maximum yn of the RVs x,. Special Case. If the RVs x, are exponential with parameter a: fx(x) — ae~nlU(x) Ft(.v) = (1 - e~ax)U(x) then that is, their minimum y( is also exponential with parameter na. Example 8-3. A system consists of m components and the time to failure of the /th component is an RV x, with distribution F,(x). Thus 1 — Fz(/) = P{x, >/} is the probability that the ith component is good at time t. We denote by n(t) the number of components that are good at time t. Clearly, n(/) = ni + • •• +n„, where П, = (о x <r = 1 ~ Л(') Hence the mean £{n(r)} = ?j(t) of n(r) is given by rj(/) = !-£,(/)+ ••• +1->£«(/) We shall assume that the RVs x, have the same distribution F(t). In this case, 7}(t) = m[l - F(/)] Failure rate The difference rj(t) — rj(t + dt) is the expected number of failures in the interval (/,/ + dt). The derivative —7]'(t) ~ mf(.t) of -i?(/) is the rate of failure. The ratio B(t) ~ (8*15) l-F(t) is called the relative expected failure rate. As we see from (7-52), the function 0(t) can also be interpreted as the conditional failure rate of each component in the system. Assuming that the system is put into operation at. t — 0. we have n(0) — m; hence ij(0) — £(n(0)} = m. Solving (8-15) for tjC/), we obtain tj(/) -mcxp{-j£iB(r) drj
8-1 GLN1.KALCONCM4S 187 Example 8-4 Measurement errors. Wc measure an object of length 77 with n instruments of varying accuracies. The results of the measurements arc /1 RVs x, =’) + p, £{p,} = 0 £{ v,2} = <r~ where v, are the measurement errors which we assume independent with zero mean. We shall determine the unbiased, minimum variance, linear estimation of 77. This means the following: Wc wish to find n constants a, such that the sum is an RV with mean E{fj} = ajEfxJ + • • + a„E(xJ = 7] and its variance P1 = afo-f + • • • + a2<r/ is minimum. Thus our problem is to minimize the above sum subject to the constraint «1 ++«„= 1 (8-16) To solve this problem, we note that V = a [erf + • • • + ajjor2 - A(ak + • • + «„ - 1) for any A (Lagrange multiplier). Hence V is minimum if 3V , A — = 2a,tr, - A = 0 a, = —-j oa, 2a/ Inserting into (8-16) and solving for A, we obtain - = |/=----------=----------------T 2 1/af + • • + i/ff,,- Hence - = Y|/g|2 + "' + 77 1/tTj2 + ••• +l/tr„2 (8-17) Illustration. The voltage E of a generator is measured three times. We list below the results xt of the measurements, the standard deviations a, of the measurement errors, and the estimate £ of E obtained from (8-17): x,. = 98.6 98.8 98.9 a, = 0.20 0.25 0.28 X|/0.04 4-Хг/0.0625 + x3/0.0784 1/0.04 + 1/0.0625 + 1/0.0784 Group independence. We say that the group Gx of the RVs x„...,x„ is independent of the group Gy of the RVs y,»..., yk if -/(Х|,...,хл)/(у„...,у*) (8-18) By suitable integration as in (8-5) we conclude from (8-18) that any subgroup of Gx is independent of any subgroup of Gy. In particular, the RVs x,- and у; are independent for any i and j. Suppose that is a combined experiment X the RVs x; depend only on the outcomes of and the RVs y, depend only on the outcomes of
188 StOUHNCES OF RANDOM VARlAHI.bS •>2. If the experiments .Zt and ../2 are independent, then the groups G, and Gy are independent. Wc note finally that if the RVs z,„ depend only on the RVs x, of G, and the RVs wr depend only on the RVs y} of Gv. then the groups G. and G„ arc independent. Complex random variables The statistics of the RVs Z! = X! = X„ +jy„ are determined in terms of the joint density f(xt, .v„. yp.... y„) of the 2n RVs x, and yz. We say that the complex RVs z, are independent if ....x„, у= /(*,, yj • /(x„,y„) (8-19) Mean and Covariance Extending (7-2) to n RVs. we conclude that the mean of g(xn... ,x„) equals f • •• f g(xl,...,x„)f(xl,....xn) dxl •• • dx,, (8-20) J — 00 J — X If the RVs z, = x, + jy, are complex, then the mean of g(z1,...,zn) equals / •••/ g(zl,...,2„)/(xI,...,x„,y1,,...y„) dxt dy„ J —00 J — X From the above it follows that (linearity) £{0|g,(X) + ••• +a„g„(X)} = «,E{g,(X)) + ••• +n„E{g„(X)} for any random vector X real or complex. CORRELATION AND COVARIANCE MATRICES. The covariance C,; of two real RVs X/ and x; is defined as in (7-6). For complex RVs CV) = E{(xz - ^-)(x* - 77*)) = Hlx.xf) - E{x,}E{xf} by definition. The variance of x(- is given by <z2 = C„ = £{|x, - ч,|2) = £{lx,l2} - |£{х,) |2 The RVs xz are called (mutually) uncorrelated if CtJ = 0 for every i #= /. In this case, if x = x, + • • • 4- x„ then cr* = a~\ + • • • + <r„2 (8-21) Example 8-5. The RVs _ 1 " Iя , * = - L X, v = ----- £ (x, - x)- Л-1/-! are by definition the sample mean and the sample variance respectively of x,. We shall show that, if the RVs x, are uncorrelated with the same mean £{x,J = n and
8-1 (.i м-нл| i <i\ci its 189 variance <rf2 = a-2, then £{x} = tj <rc2 = ,r2/,t (8-22) and E{v) = a2 (8-23) Proof. The first equation in (8-22) follows from the linearity of expected values and the second from (8-21): £(*} = - E £{x.} = *) = — Ё a,- = — " .-I «“,„i " To prove (8-23), we observe that £{(\ “ T)(x - П)} = ~E{(X, “ »l)[(xi “ t?) + • • +(x„ - tj)]} 1 er2 = -£{(x< - 7?)(x! - T?)} = — because the RVs x, and x; are uncorrelated by assumption. Hence . T Гг, ill <r“ л - 1 £{(*. “ x)“} = £{[(x, - rj) - (x - -q)]*) = <r- +------2— =------<r- ' n и n This yields 1 Д , ,, n n - 1 , £{f) - — £ £{(«. - SH = — — and (8-23) results. Note that if the RVs x, arc i.i.d. with E{|x, - т? I4} = then (see Prob. 8-21) 2 1 ( П ~ 3 Л "b = ~ Д4------------r0- Il \ Il - 1 I If the RVs X|,...,x„ are independent, they are also uncorrelated. This follows as in (7-14) for real RVs. For complex RVs the proof is similar: If the RVs zt = Xj + jy, and z2 = x2 +/У2 are independent, then /(xt, x2. yt, y2) = ftxi, у$(хг, y2). Hence Г • • • Г .г,, x2, уi, y2) dxt dyx dx2 dy2 e Г Г dy' £^в^_в22*^X2, dy’2 This yields » EfzJEU?} therefore, zx and z, are uncorrelaied.
190 SEQUENCES OF RANDOM VARIABLES Note, finally, that if the RVs x, are independent, then E{i't(x।) • • • g„(x„)} = E(gt(xt)} • • • £{g„(x„)} (8-24) Similarly, if the groups xl5... ,x„ and y,,... ,yk are independent, then E{g(xl,...,x„)/t(yl,...,yJ) -£{g(x„...,xJ)E{/i(y„....yA.)} The correlation matrix. We introduce the matrices Л • c„ = £ii • clt; A- •• я J к • C„„] where = £{x,x*} = R* CtJ = RtJ - tj/t,* = C* The first is the correlation matrix of the random vector X = [X|,...,xJ and the second its covariance matrix. Clearly, Rn = E{X'X*} where X' is the transpose of X (column vector). We shall discuss the properties of the matrix R„ and its determinant A„. The properties of Cn are similar because C„ is the correlation matrix of the “centered” RVs x, - n,. THEOREM. The matrix R„ is nonnegative definite. This means that Q = 'LarfR'j = AR„A + > 0 (8-25) ij where A + is the conjugate transpose of the vector A = [a,,..., nJ. Proof. It follows readily from the linearity of expected values E{|fl|X| + ••• + a„x„|2} = £e,-ef£{xfx/) (8-26) i.i If (8-25) is strictly positive, that is, if Q > 0 for any A =# 0, then Rn is called positive definite.t The difference between Q > 0 and Q > 0 is related to the notion of linear dependence. DEFINITION. The RVs x, are called linearly independent if E{|a,x, + ••+a„x„|2) >0 (8-27) for any A 0. In this case (see (8-26)], their correlation matrix R,t is positive definite. tWe shall use the abbreviation p.d. to indicate that Rn satisfies (8-25). The distinction between Q St 0 and Q > 0 will be understood front the context.
8-1 GENERAL CONCEPTS 191 The RVs x, are called linearly dependent if «ixi + •• +«„x„ = 0 (8-28) for some A * 0. In this case, the corresponding Q equals 0 and the matrix Rn is singular [see also (8-29)]. From the definition it follows that, if the RVs x( are linearly independent, then any subset is also linearly independent. The correlation determinant. The determinant A„ is real because Rtj = R*. We shall show that it is also nonnegative A„ > 0 (8-29) with equality iff the RVs x, are linearly dependent. The familiar inequality Д2 = Ли/?22 — /?f2 0 is a special case [see (7-12)]. Suppose, first, that the RVs x, are linearly independent. We maintain that, in this case, the determinant Д„ and all its principal minors are positive Д* > 0 k<n (8-30) Proof. The above is true for n = 1 because Д] = > 0. Since the RVs of any subset of the set {x,} are linearly independent, we can assume that (8-30) is true for к < n — 1 and we shall show that Д,( > 0. For this purpose, we form the system ^llal + ‘ ‘ ‘ + ^\nan = 1 + • • • + R2nan = 0 (8-31) Лл!а1 + ' ’' + Л/Л ~ 0 Solving for «i, we obtain at = Дя_1/Д„ where A„_j is the correlation determi- nant of the RVs x2,...,x„. Thus is a real number. Multiplying the jth equation by af and adding, we obtain Д . <2 = E^.7 = «l = -f:i <8’32) U n In the above, Q > 0 because the RVs x, are linearly independent and the left side of (8-27) equals Q. Furthermore, > 0 by the induction hypothesis; hence Дл > 0. We shall now show that, if the RVs xz are linearly dependent, then Д„ = 0 (8-33) Proof. In this case, there exists a vector A * 0 such that atxt + • • • + anxn = 0. Multiplying by x* and taking expected values, we obtain ‘ ‘ + anRin = 0 This is a homogeneous system satisfied by the nonzero vector Л; hence Д„ = 0.
192 SEQUENCES OF RANDOM VARIABLES Note, finally, that [see (15-161)] Д„ £ 1^22 ' * Я„„ (8-34) with equality iff the RVs x, are (mutually) orthogonal, that is, if the matrix Rn is diagonal. 8-2 CONDITIONAL DENSITIES, CHARACTERISTIC FUNCTIONS, AND NORMALITY Conditional densities can be defined as in Sec. 7-2. We shall discuss various extensions of the equation /(y|x) =/(x, y)/f(x). Reasoning as in (7-41), we conclude that the conditional density of the RVs x„, ....x^, assuming xA,...,Xj is given by , . /(x., -.., X,,,..., x„) f(xn,.... xA+1 |x,..... xj =-----—-------—------- (8-35) J\X], . . . , xk) The corresponding distribution function is obtained by integration: F(xn> • • > xk + 1 • • •» ^1) = f "•••/’ 1+'/(ал,...,аА+1|хА,...,х1)^О!л + 1 da„ (8-36) J — w ОС For example, r( i Лх1’*2.*з) ^(x^.xj) /(х2>хз) Chain rule From (8-35) it follows that /(х15...,хл) ••• Л*г1*|)Л*1) (8"37) Example 8-6. We have shown that [see (5-18)] if x is an RV with distribution F(x), then the RV у = F(x) is uniform in the interval (0,1). The following is a generalization. Given n arbitrary RVs xt, we form the RVs = Уг ” F(x2|x,),...,y„ =F(xJx„_),...,xl) (8-38) We shall show that these RVs are independent and each is uniform in the interval (0,1). Proof. The RVs y( are functions of the RVs x, obtained with the transformation (8-38). For 0 & yf g 1, the system У1 “F(x,) yt “F(x2|xl),...,yn ° F(xrt|x„_1,...,xl)
8-2 CONDITIONAL DENSITIES 193 has a unique solution xIt..., xn and its jacobian equals J = dx, аУ2 d*\ 0 аУ2 dx2 0 0 0 0 аУп аУп dx\ 3x„ The above determinant is triangular; hence it equals the product of its diagonal elements dyk — = /(xJx*_,........x,) Inserting into (8-8) and using (8-37), we obtain f( x ______________A*i.....*„)_______________ ” ’ " /(xt)/(x2|X|) •••/(xfl|xw_,,...,xt) “ in the л-dimensional cube 0 y, < 1, and 0 otherwise. From (8-5) and (8-35) it follows that f(xt\x3)=f /(xl,x2|x3)dr2 < — ОС /(X||x4)=f f /(x1|x2,x3,x4)/(x2,x3|x4)dr2dx3 Generalizing, we obtain the following rule for removing variables on the left or on the right of the conditional line: To remove any number of variables on the left of the conditional line, we integrate with respect to them. To remove any number of variables to the right of the line, we multiply by their conditional density with respect to the remaining variables on the right, and we integrate the product. The following special case is used extensively (Chapman- Kolmogoroff): f(xl\x3)= f /(x1|x2,x3)/(x2|x3)€&2 (8-39) J — co Discrete type The above rule holds also for discrete type RVs provided that all densities are replaced by probabilities and all integrals by sums. We mention as an example the discrete form of (8-39): If the RVs X|,x2’ x3 take the values at,bk,cr respectively, then P(x, = af|x3 = cj - EF(xi = (ч\Ьк,сг}Р{*2 = hjcj (8-40) к
194 SEQUENCES Of- RANDOM VARIABLES CONDITIONAL EXPECTED VALUES. The conditional mean of the RVs £(х|э....x„) assuming is given by the integral in (8-20) provided that the density f(xu...,x„) is replaced by the conditional density /(x,,..x/t\^). Note, in particular, that [see also (7-57)] E{xj|x,,..., x„) «j* xI/(x,|x2,...,xw)rfxl (8-41) The above is a function of x2,..., x„; it defines, therefore, the RV E{Xj|x2,.. .,хл). Multiplying (8-41) by /(x2,...,x„) and integrating, we con- clude that E{E{xl|x2,...,x„}} = E{x,} (8-42) Reasoning similarly, we obtain E{xl|x2,x3) = E{E{x,|x2,x3,x4}) « f £{x||a-2,x3,x4)/(x4|x2,x3) dxj (8-43) This leads to the following generalization: To remove any number of variables on the right of the conditional expected value line, we multiply by their conditional density with respect to the remaining variables on the right and we integrate the product. For example, £{xilx.i} = f £{х11*2.л'з)Лл’21*з)^2 (8-44) / — X and for the discrete case [see (8-40)] £{xiK} = ££{х11Ьа.,сг}Р{х2 = 6Alcr) (8-45) к Example 8-7. Given a discrete type RV n taking the values 1,2,... and a sequence of RVs xk independent of n, we form the sum s = У- xa (8-46) This sum is an RV specified as follows: For a specific £, n(£) is an integer and s(£) equals the sum of the numbers xA(f) for к from 1 to n(£). We maintain that if the RVs xk have the same mean, then E{s) = rjE{n} where E(xfc) — 17 (8-47) Clearly, E{xjn ~ n} — E{xk) because xk is independent of n. Hence (II \ n 22 x4n = n > = £ £{x*J= k-i J fc-1 From this and (7-65) it follows that E{s] =E{E{s|n}} =Е{лп} and (8-47) results..
8-2 CONDI 11(>,S'Al. |)| SSI I U.S 195 , Wc show next that if the RVs x. are uncorrclatcd with the same variance o--, then £(s-} ~ T)?E{n2} + a’£{n) (8-48) Reasoning as above, we have /<{s-|n = «)=££ (8-49) i •• I A- - I where r-i i I <r: + i]2 i = к (9“ i* к The double sum in (8-49) contains n terms with i = к and n2 — n terms with i Ф k; hence it equals (cr2 + t)2)h + 7?2(/r - n) = трг + <r2n This yields (8-48) because E{s2} = E{E{s2|n)} = Е{тгп2 + <r2n) Special Case. The number n of particles emitted from a substance in t seconds is a Poisson RV with parameter Ar. The energy xA. of the A:th particle has a Maxwell distribution with mean 3kT/2 and variance 3k2T2/2 (see Prob. 8-5). The sum s in (8-46) is the total emitted energy in / seconds. As we know E{n) = Ar, E{n2} = A2r2 + Ar [see (5-37)]. Inserting into (8-47) and (8-48), we obtain 3kTM , \5k2T2At ад - j— Characteristic Functions and Normality The characteristic function of a random vector is by definition the function Ф(О) = E(e'nx'} = Е{ел“'х,+ = Ф(Д1) (8-50) where X = П = [wi,..•,*>«] As an application, we shall show that if the RVs x^ are independent with respective densities /;(хД then the density fz(z) of their sum z — Xj + • • • +x„ equals the convolution of their densities ft(z) =Л(г)* ••• */„(z) (8-51) Proqf. Since the RVs x, are independent and depends only on x„ we conclude that from (8-24) that
196 SliOUHNCES Ol- RANDOM VARIAHI.ES Hence Ф;(<у) « E{^w(x'+ " +'"’} = ф|(") ” ‘ ф"(ш) (К-52) where Ф,(ю) is the characteristic function of x,. Applying the convolution theorem for Fourier transforms, we obtain (8-51). Example 8-8. (a) (Bernoulli trials) Using (8-52) wc shall redcrivc the fundamental equation (3-13). Wc define the RVs x, as follows: x, - 1 if heads shows at the ith trial and x, = 0 otherwise. Thus P{x, = 1} = P{h} = p P{k, = 0} = /’{/}= q Ф,(«>) = pe’M + q (8-53) The RV z = X| 4- • •• +x„ takes the values 0,and {z = k} is the event {k heads in n tossings). Furthermore. ФДго) = Де'шг) = £ P{z = k}eiU' (8-54) *-o The RVs x, arc independent because x, depends only on the outcomes of the ith trial and the trials arc independent. Hence [sec (8-52) and (8-53)] Фг(а>) = (pe’,u 4- q)" = £ (j)pW*‘ к ** 0 Comparing with (8-54), we conclude that P{z = k] = P{k heads} = (8-55) (/>) (Poisson theorem) We shall show that if pci, then as in (3-41). In fact, wc shall establish a more general result. Suppose that the RVs xi are independent and each takes the value 1 and 0 with respective probabilities p, and qj = 1 - p,. If pt 1, then 1» = 1 + - 1) = Pieiu> 4- = Ф,.(ш) With z = X] 4- • • • 4- x,„ it follows from (8-52) that Ф.(ш) = ° where a = p, 4- • • • 4-pz|. This leads to the conclusion that [see (5-79)] the RV z is approximately Poisson distributed with parameter a. It can be shown that the result is exact in the limit if Pi 0 and pt 4- • • - +pn a as n -» * NORMAL VECTORS. Joint normality of n RVs x; can be defined as in (6-15): Their joint density is an exponential whose exponent is a negative quadratic. We give next an equivalent definition that expresses the normality of n RVs in terms of the normality of a single RV.
8-2 CONDITIONAL DI.NSIIILS 197 DEFINITION. The RVs x, are jointly normal iff the sum fli*i + • • • + a„x„ = AX' (8-56) is a normal RV for any A. We shall show that this definition leads to the following conclusions: If the RVs x, have zero mean and covariance matrix C. then their joint characteristic function equals Ф(П) = ехр{-4ПСП'} (8-57) Furthermore, their joint density equals /(*) = - 1 _ exp{ - U'C“ ' AT *) (8-58) V(2tt) Д where Д is the determinant of C. Proof. From the definition of joint normality it follows that the RV w =ш1х| + • • + ш„х„ - ПХ* (8-59) is normal. Since Е{хД = 0 by assumption, the above yields [see (8-26)) E{w) = 0 E{w2} = = af i.J Setting 77 « 0 and w - 1 in (5-65), we obtain E{e'”} = exp - у This yields E(eynx') e exp/ - - (8-60) ( “ i.j I as in (8-57). The proof of (8-58) follows from (8-57) and the Fourier inversion theorem. Note, finally, that if the RVs xt are jointly normal and uncorrelated, they are independent. Indeed, in this case, their covariance matrix is diagonal and its diagonal elements equal erf. Hence C-1 is also diagonal with diagonal elements l/<r2. Inserting into (8-58), we obtain 1 1 (x2 x2 H f(xi,..., x_) = ----------. = , expt — -r — + ' • • -I—? I ? a, • • cr„-/(27rjr (<rf cr~ J j Example 8-9. Using characteristic functions, wc shall show that if the RVs x, arc jointly normal with zero mean, and£{x,X;) = Cir then £{*1*2*3*41 = 12^-3-» ^13^*24 "t" (8-61)
198 SLQUfcNC’liS Ol RANDOM VARIABLES Proof. Wc expand the exponentials on the left and right side of (8-6(1) and wc show explicitly only the terms containing the factor 1 r 41 Е(е'(ш‘х‘+ = •• - +— E{(^|X| 4- +w4x4) } + 24 = ... + —-£{xlx2xJx4}wl<u2to3<u4 4! cxp^~ у = + у ^7 +••• 8 Equating coefficients, wc obtain (8-61). Complex normal vectors. A complex normal random vector is a vector Z = X + j Y = [zt,. .,,z„] the components of which are n jointly normal RVs z, = x, + /Ус We shall assume that £{z,} = 0. The statistical properties of the vector Z are specified in terms of the joint density /2(Z) ~f(xx,...,x,t,yx........y„) of the 2n RVs xt- and y;. This function is an exponential as in (8-58) determined in terms of the 2/i by 2n matrix n = ^xx ^XY [CXA- CYY consisting of the 2n~ + n real parameters £{х,хД E(y,,yy}, and E{x,y;}. The corresponding characteristic function Ф2(П) « E{exp(J(uiXj + • • +u„x„ + Г)У1 + • • + vny„))} is an exponential as in (8-60): Ф2(П) = exp{- |Q) G=[t/ И] Cxx CYv ^XY U' C*yy . V‘ where U » [ub..., «„], И - , vj, and Я = U + jV. The covariance matrix of the complex vector Z is an n by n hermitian matrix Czz ~ E(ZfZ*} — Cxx + CYY j(CxY Cyx) with elements £{z,z*}. Thus, Czz is specified in terms of n2 real parameters. From this it follows that, unlike the real case, the density fz(.Z) of Z cannot in general be determined in terms of Czz because fz(.Z) is a normal density consisting of 2n2 4- n parameters; Suppose, for example, that n = 1. In this case, Z = z = x + jy is a scalar and Czz = E{|z|2}. Thus, Czz is specified in terms of the single parameter a.2 = E{x2 + y2}. However, f:(z) = /(x, y) is a bivariate nprmal density consisting of the three parameters orx, cry, and E{xy}. In
8-2 CONDITIONAL Df-NSiril-S 199 the following, we present a special class of normal vectors that are statistically determined in terms of their covariance matrix. This class is important in modulation theory (see Sec. 11-3). Goodman’s Theorem.! If the vectors X and Y are such that <~XX ” ^YY Qvy = ~~Cyx and Z = X + _/Y, then Qz = 2(cv<v-;cvy) fAz) ~ ^\c. I exP{"zQzz4 (8-62fl) ( I I Фг(П) = exp/-~HC2zn+} (8-626) Proof. It suffices to prove (8-62/?); the proof of (8-62) follows from (8-626) and the Fourier inversion formula. Under the stated assumptions, Ml' c.vy ‘-xa'JLk J « UCXYU' + VCXYir - UCXYV’ + vcxxw Furthermore Cxx = Cxx and CXY « ~CXY. This leads to the conclusion that VCXXIT = UCXXV' UCXYU‘ = VCxy^ = 0 Hence iflCzzn* = (U + jV)(Cxx-iCxr)(U' - JV') - e and (8-626) results. Normal quadratic forms. Given n independent MO, 1) RVs zf, we form the sum of their squares X = zf + • • + zf Using characteristic functions, we shall show that the RV x so formed has a chi-square distribution with n degrees of freedom: /X(x) = yxn/2~le~x/2U(x) tN. R. Goodman, “Statistical Analysis Based on Certain Multivariate Complex Distribution," Annals of Math. Statistics, 1963, pp. 152-177,
200 SEQUENCES OF RANDOM VARIABLES Proof. The RVs z2 have a ^2(1) distribution (see page 96); hence their charac- teristic functions are obtained from (5-71) with m - 1. This yields 1 Ф/s) - £{e,5ir) ~ — — 1 /1 - 2s From (8-52) and the independence of the RVs z2, it follows therefore that ФГ(5)=Ф,(.)--.Ф„(5)«7=Ц= V(] - 2s) Hence [see (5-71)] the RV x is x2M- Note that 1 1 1 7a - 2s)"‘ x 7(i - 2s)" 7a - 2s)"^n This leads to the conclusion that if the RVs x and у arc independent, x is x2(m) and у is x2(n\ then the RV z - x + у is xHm + n) (8-63) Conversely, if z is x2(m + л), x anc* У are independent, and x is *2(*n). then у is x2(n\ The following is an important application. Sample variance. Given n i.i.d. Mi?, <r) RVs xf, we form their sample variance s2 -------- 52 (x; ~ x)2 x = _ L x; (8-64) Л"1,-] л,.] as in Example 8-4. We shall show that the RV (n-l)s2 « /x,-x\2 . -----2----~ 12 ------- ,s x (n - 1) (8-65) /=l\ <r / Proof. sum the identity (x, - 17)2 = (x/ - x 4- x - 17)2 = (xt- - x)2 + (x - 17)2 + 2(xt ~ x)(x - 1?) from 1 to n. Since E(x, — x) ~ 0, this yields £pr^=M‘+(w (M6) It can .be shown that the RVs x and s2 are independent (see Prob. 8-17). From this it follows that the two terms on the right of (8-66) are independent. Furthermore, the term
8-3 MEAN SQUARE ESTIMAI ION 201 is x2(l) because the RV x is Mt?, a/yfo). Finally, the term on the left side is X2(n) and the proof is complete. From (8-65) and (5-71) it follows that the mean of the RV (n - l)s2/cr2 equals n - 1 and its variance equals 2(n - 1). This leads to the conclusion that 2 4 E{s2) = (n — 1) —= a2 Vars2 = 2(w - 1)—--------= ———— "(л - I)2 m-1 (8-67) Example 8-10. We shall verify the above for л = 2. In this case, _ xi + x2 , , , , 1 --------- s = (x, - x)- + (x2 - x)- + -(x, - x,)- The RVs X! + x2 and x, — x2 arc independent because they are jointly normal and Efy - x,} = 0, E((x, - x2Xx1 + x2)} = 0. From this it follows that the RVs x and s2 arc independent. But the RV (x( - x2)/o72 = s/<r is MO, 1); hence its square s2/tr2 is 1) in agreement with (8-65). 8-3 MEAN SQUARE ESTIMATION In Sec. 7-5, we considered the problem of estimating an RV s by a linear and a nonlinear function of another RV x. Generalizing, we consider now the problem of estimating s in terms of n RVs xI}...,x„ (data). This topic is developed further in Chap. 14 in the context of infinitely many data and stochastic processes. LINEAR ESTIMATION. The linear MS estimate of s in terms of the RVs x, is the sum s = + •• • + a„x„ (8-68) where are n constants such that the MS value P = E{(s - s)2} = f([s - (fl]X + • • • + я„х„)]2} (8-69) of the estimation error s — s is minimum. Orthogonality principle. P is minimum if the error s - s is orthogonal to the data x,-: E{[s - (^Xj + •••+а„хЛ)]х,} = 0 (8-70) Proof, P is a function of the constants «,• and it is minimum if — =E{-2[s - (ajXj + ••• +a«x„)]xj = 0 and (8-70) results. This important result is known also as the projection theorem.
202 SEQUENCES OF RANDOM VARIABLES Setting i = 1,..., n in (8-70), we obtain the system + 7?21^2 = Л0! E|2al + R22a2 + • " + R„2an = K()2 (8-71) Rln^l R2n^2 Rnn^n R0n where R(J = E{x,x,} and R0/ = E{sxy). To solve this system, we introduce the row vectors X = [x,,...,xj A = [a„...,art] Ko = [K0J...., R(I„] and the data correlation matrix R = E(X'X) where X' is the transpose of X. This yields AR = R0 A = R^R-' (8-72) Inserting the constants a, so determined into (8-69), we obtain the LMS error. The resulting expression can be simplified. Since s - s ± x, for every i, we conclude that s - s 1 s; hence P = E{(s - s)s) = E{s2} - Л/?' (8-73) Note that if the rank of R is m < n, then the data are linearly dependent. In this case, the estimate s can be written as a linear sum involving a subset of m linearly independent components of the data vector X. Geometric interpretation. In the representation of RVs as vectors in an abstract space, the sum s = + • • • + алх„ is a vector in the subspace S„ of the data X; and the error e = s - s is the vector from s to s as in Fig. 8-2л. The projection theorem states that the length of e is minimum if e is orthogonal to x;, that is, if it is perpendicular to the data subspace Sn. The estimate § is thus the “projection” of s on Sn.
8-3 МкЛЧ М)| -Л КI I S 1IMA I II >\ 203 If s is a vector in S,„ then s = s and P = 0. In this case, the n + I RVs s,X|„. ...x„ arc linearly dependent and the determinant Д,,., of their correla- tion matrix is 0. If s is perpendicular to S,„ then s = 0 and P = E{|s|2). This is the case if s is orthogonal to all the data xp that is. if Rlh = (I for j * 0. Nonhomogeneous estimation. The estimate (8-68) can be improved if a constant is added to the sum. The problem now is to determine n + 1 parameters a* such that if s = + OjXj + • • • +a„x,1 (8-74) then the resulting MS error is minimum. This problem can be reduced to the homogeneous case if wc replace the term a0 by the product a(lxl( where xn = 1. Applying (8-70) to the enlarged data set *o,xi.....x„ where ~ 7?' ‘ r ll i = 0 we obtain «о + *7i«i + • • • + =Пл + K|i«i + •• + /?!„«„ = Л(И (8.75) Vnan + + • • • + =/?„,, Note that, if 7], = 17, = 0, then (8-75) reduces to (8-71). This yields a(l = 0 and «Л &n" Nonlinear estimation. The nonlinear MS estimation problem involves the de- termination of a function g(x......,x„)=g(X) of the data x, such as to minimize the MS error P = E{[s-g(X)]2} (8-76) We maintain that P is minimum if g(X) = E{sb¥) = J*sfs(s]X)ds (8-77) The function fs(s\X) is the conditional mean (regression surface) of the RV s assuming X = X. Proof. The proof is based on the identity [see (8-42)] /> = £{[s -«(X)]2) =£{E{[s -«(X)]2|X|) (8-78) Since all quantities are positive, it follows that P is minimum if the conditional MS error £{[s ~«(X)]2|*) - ds (8-79)
204 SliQUI-Nl.'l-S O1 RANDOM VAR1AII1.I:S is minimum. In the above integral, g(X) is constant. Hcncc the integral is minimum if g(X) is given by (8-77) [see also (7-71)]. The general orthogonality principle. From the projection theorem (8-70) it follows that £{[s - s](cjXj + •• + c„x„)} = 0 (8-80) for any This shows that if s is the linear MS estimator of s. the estimation error s - s is orthogonal to any linear function у = qxj + • + c„x of the data x,. We shall now show that if g(X) is the nonlinear MS estimator of s. the estimation error s - g(X) is orthogonal to any function w(X), linear or nonlin- ear, of the data x,: E([s-g(X)]w(X)) -0 (8-81) Proof. We shall use the following generalization of (7-60): £{[s -g(X)]iv(X)} = E{w(X)E{s - g(X) IX}} (8-82) From the linearity of expected values and (8-77) it follows that E{s - g{X)\X} =£ВД -E{g(X)|X| =0 and (8-81) results. Normality. Using the above, we shall show that if the RVs s,х,,...,хя are jointly normal with zero mean, the linear and nonlinear estimators of s are equal: § = «,x, + • • • + at)xn = g(X) = E{s|X} (8-83) Proof. To prove (8-83), it suffices to show that s = E(s|X). The RVs s - s and X; are jointly normal with zero mean and orthogonal; hence they arc indepen- dent. From this it follows that £{s - §|^} = E{s - §} = 0 = E{s|^} - E{s|AZ} and (8-83) results because EfsIA’} = s. Conditional densities of normal RVs. We shall use the preceding result to simplify the determination of conditional densities involving normal RVs. The conditional density ffs\X) of s assuming X is the ratio of two exponentials the exponents of which are quadratics, hence it is normal. To determine it, it suffices, therefore, to find the conditional mean and variance of s. We maintain that £{sI*) = s E{(s - s) V) = £{(s - s)2} = P (8-84)
8-3 Ml-.AN'SQUARI LX11MA11ON 205 The first follows from (8-83). The second follows from the fact that s - s is orthogonal and, therefore, independent of X. We thus conclude that Mxi..........x„} = а“‘"’1‘/2Г (8.85) V Example 8-11. The RVs x, and x2 are jointly normal with zero mean. Wc shall determine their conditional density /(x-lx,). As wc know (sec (7-78)] £{xji|} = ax, a=^ <V;U, = £ = £{(x2 - ax,)x2] = Л-r, - aRt2 Inserting into (8-85), wc obtain /(-'six,) = 2/’ у2тг P Example 8-12. Wc now wish to find the conditional density /(x-Jx,. x,). In this case, £{x,|x,, x2) = «,x, + a,x2 where the constants a, and a2 arc determined from the system /?,,«! + Л|2«2 = R|3 /?I2a, + ^22a2 = ^23 Furthermore [see (8-84) and (8-73)] = P — £33 — (£j3tfi + £23^2) and (8-85) yields /(^1хих2) =7^pe~l"~.............. Example 8-13. In this example, wc shall find the two-dimensional density (x2, x3|x,). This involves the evaluation of five parameters [see (6-15)]: two conditional means, two conditional variances, and the conditional covariance of the RVs x2 and x3 assuming x,. The first four parameters are determined as in Example 8-11: R1? R13 £{x2|x,} = — x, E{xj|x,) = 1 A, 1 «Ц 2 о-2 j? О-Д,|Д1-Л22 R" <Гч1л, R)3 The conditional covariance «^12 1 I 1 I to окЛ is found as follows: Wc know that the errors x2 — f?i2xi/f?n and x3 — Л13х(/Лц
206 SEQUENCES OF RANDOM VARIABLES arc independent of xP Hence the condition x, = x( in (8-86) can be removed. Expanding the product, wc obtain R 13 s' — Ft — сжгж1|ж, /v23 This completes the specification of /(x2,x3|x(). Orthonormal Data Transformation If the data x; are orthogonal, that is, if Ru = 0 for i #= j, then R is a diagonal matrix and (8-71) yields Thus the determination of the projection s of s is simplified if the data x, are expressed in terms of an orthonormal set of vectors. This is done as follows. We wish to find a set {iA} of n orthonormal RVs iA. linearly equivalent to the data set {xA}. By this we mean that each iA is a linear function of the elements of the set {xA} and each xA is a linear function of the elements of the set (iA}. The set {iA,} is not unique. We shall determine it using the Gram-Schmidt method (Fig. 8-2b). In this method, each iA depends only on the first к data xt,.... xA.. Thus »i = Ti*i ‘2 = + T2X2 (8-88) = УГ*1 + У2Х2 + *•• In the notation у*, к is a superscript identifying the A th equation and r is a subscript taking the values 1 to k. The coefficient y} is obtained from the normalization condition E{i?} = W)2».. = 1 To find the coefficients yf and yf, we observe that i, ± x, because i2 1 i( by assumption. From this it follows that E{i2Xj} = 0 = у^н + y2z/?2i The condition £{i|} = 1 yields a second equation. Similarly, since iA 1 ir for r < k, we conclude from (8-88) that iA ± x, if r <k. Multiplying the fcth equation in (8-88) by xr and using the above, we obtain E(iAx,} = 0==y*/?lr + ••• + ykkRkr 1 ^r<,k- 1 (8-89) This is a system of к — 1 equations for the к unknowns y*,..., y*. The condition E{iJ) = 1 yields one more equation. The system (8-88) can be written in a vector form I - ХГ (8-90)
8-3 MEAN SOt.lAKI: LSI IMA HUN 207 where I is a row vector with elements iA. Solving for X, we obtain xi=/i*4 , %=ir-'=IL X, — /fi, + /;i, (8-91) x,, = + A’i, + ••• + /"i„ In the above, the matrix Г and its inverse arc upper triangular Since E{i,iy] = 3[/ — j] by construction, we conclude that E{I'I) = l„ = Е{Г'Х'ХГ) = Г'Е{Х'Х]Г (8-92) where 1 „ is the identity matrix. Hence ГКГ = 1„ R=L'L R{ = ГГ' (8-93) We have thus expressed the matrix R and its inverse R~{ as products of an upper triangular and a lower triangular matrix [see also Cholesky factorization (14-79)]. The orthonormal base (i„) in (8-88) is the finite version of the innovations process i[/i] introduced in Sec. (12-1). The matrices Г and L correspond to the whitening filter and to the innovations filter respectively and the factorization (8-93) corresponds to the spectral factorization (12-3). From the linear equivalence of the sets {ij and (xj, it follows that the estimate (8-68) of the RV s can be expressed in terms of the set (ij: s = Z>,il + • • • + 6Д, = BI' where again the coefficients bk are such that s - s 1 iA \ к <n This yields [see (8-92)] E{(s - BV)!} = 0 - E{sl] - В from which it follows that В = E{sl) = E{sXr] = Л„Г (8-94) Returning to the estimate (8-68) of s, we conclude that s = BI1 = ВГ'Х' = AX' A = BY1 (8-95) This simplifies the determination of the vector A if the matrix Г is known.
208 SKOUENCISS OH HANOOM VARIABLES 8-4 STOCHASTIC CONVERGENCE AND LIMIT THEOREMS A fundamental problem in the theory of probability is the determination of the asymptotic properties of random sequences. In this section, we introduce the subject, concentrating on the clarification of the underlying concepts. Wc start with a simple problem. Suppose that wc wish to measure the length a of an object. Due to measurement inaccuracies, the instrument reading is a sum x = a + v where v is the error term. If there are no systematic errors, then v is an RV with zero mean. In this case, if the standard deviation a of v is small compared to a, then the observed value x(£) of x at a single measurement is a satisfactory estimate of the unknown length a. In the context of probability, this conclusion can be phrased as follows: The mean of the RV x equals a and its variance equals a2. Applying Tchebycheffs inequality, we conclude that a2 P{|x - a| < e) > 1----(8-96) E~ If, therefore, cr « e, then the probability that |x - a| is less than that e is close to 1. From this it follows that “almost certainly” the observed x(£) is between a — e and a + s, or equivalently, that the unknown a is between x(£) - e and x(£) + e. In other words, the reading x(£) of a single measurement is “almost certainly” a satisfactory estimate of the length a as long as a «: a. If cr is not small compared to a, then a single measurement does not provide an adequate estimate of a. To improve the accuracy, we perform the measurement a large number of times and we average the resulting readings. The underlying proba- bilistic model is now a product space .У"' = .^x • • • X formed by repeating n times the experimentof a single measurement. If the measurements are independent, then the ith reading is a sum Xi; — a + V; where the noise components vz are independent RVs with zero mean and variance <r2. This leads to the conclusion that the sample mean of the measurements is an RV with mean a and variance a2/n. If, therefore, n is so large that a-2 c na2, then the value x(£) of the sample mean x in a single performance of the experiment У" (consisting of n independent measure- ments) is a satisfactory estimate of the unknown a. To find a bound of the error in the estimate of a by x, we apply (8-96). To be concrete, we assume that n is so large that cr2/na2 = 10-4, and we ask for the probability that x is between 0.9a and 1.1a. The answer is given by (8-96)
8-4 STOCIIASIIC CONVI RCibNCI AND I.IMI I 1HI.OIUMS 209 with c « 0.1a. P{0.9a < x < l.la) > 1 - _ q 99 ZJ Thus, if the experiment is performed n = 104o-2/«2 times, then “almost cer- tainly” in 99 percent of the cases, the estimate x of a will be between 0.9a and 1.1a. Motivated by the above, we introduce next various convergence modes involving sequences of random variables. DEFINITION. A random sequence or a discrete-time random process is a se- quence of RVs xi.....*.............................. (8-98) For a specific £, x„(£) is a sequence of numbers that might or might not converge. This suggests that the notion of convergence of a random sequence might be given several interpretations: Convergence everywhere (e) As wc recall, a sequence of numbers л„ tends to a limit л if, given e > 0, we can find a number zin such that |x„ - x| < e for every n > nn (8-99) We say that a random sequence x„ converges everywhere if the sequences of numbers x„(£) converges as above for every The limit is a number that depends, in general, on In other words, the limit of the random sequence x„ is an RV x: x„ -» x as n -> =0 Convergence almost everywhere (a.e.) If the set of outcomes 4 such that lim x„(£) = x(£) as n -» » (8-100) exists and its probability equals 1, then we say that the sequence x„ converges almost everywhere (or with probability 1). This is written in the form P{x„ x} = 1 as n -» oo (8-101) In the above, {x„ -» x) is an event consisting of all outcomes f such that x,,U) -> x(£). Convergence in the MS sense (MS) The sequence x„ tends to the RV x in the MS sense if E{|x„ - x|2}-> 0 as zj-> oo (8-102) This is called limit in the mean and it is often written in the form l.i.m.x„ = x zi-»oo Convergence in probability (p) The probability P(|x — x,,| > e) of the event {|x - x„| > e) is a sequence of numbers depending on e. If this sequence tends to 0: P{|x - x„| > e} -> 0 (8-103) n -» 00
210 SEQUENCES OF RANDOM VARIABLES for any £ > 0, then we say that the sequence x„ tends to the RV x in probability (or in measure). This is also called stochastic convergence. Convergence in distribution (d) We denote by Fn(x) and F(x) respec- tively the distribution of the RVs x„ and x. If F„(x)-»F(x) n -> x (8-104) for every point x of continuity of F(x), then we say that the sequence x„ tends to the RV x in distribution. We note that, in this case, the sequence x„(<) need not converge for any Cauchy criterion As wc noted, a deterministic sequence xn converges if it satisfies (8-99). This definition involves the limit x of x,r The following theo- rem, known as the Cauchy criterion, establishes conditions for the convergence of x„ that avoid the use of x: If k,™-*,,l “* 0 (8-105) for any m > 0, then the sequence x„ converges. The above theorem holds also for random sequence. In this case, the limit must be interpreted accordingly. For example, if £{lx„<-w - xJ2} “* 0 as n -> x for every m > 0, then the random sequence x„ converges in the MS sense. Comparison of convergence modes. In Fig. 8-3, we show the relationship between various convergence modes. Each point in the rectangle represents a random sequence. The letter on each curve indicates that all sequences in the interior of the curve converge in the stated mode. The shaded region consists of all sequences that do not converge in any sense. The letter d on the outer curve shows that if a sequence converges at all, then it converges also in distribution. We comment next on the less obvious comparisons: If a sequence converges in the MS sense, then it also converges in probability. Indeed, Tchebycheffs inequality yields If x„ -> x in the MS sense, then for a fixed e > 0 the right side tends to 0; hence the left side also tends to 0 as n -> «> and (8-103) follows. The converse. FIGURE 8-3
8-4 STOCHASTIC CONVBRGCNCIi AND I.IMIГ1 HI OKI.MS 2(1 however, is not necessarily true. If x„ is not bounded, then P(|x„ - x| > e} might tend to 0 but not E(|x„ - xF}. If, however, x„ vanishes outside some interval (-c, c) for every n > w0, then p convergence and MS convergence are equivalent. It is self-evident that a.e. convergence implies p convergence. We shall show by a heuristic argument that the converse is not true. In Fig. 8-4, we plot the difference |x„ - x| as a function of n where, for simplicity, sequences are drawn as curves. Each curve represents, thus, a particular sequence |xzl(£) - x(£)|. Convergence in probability means that for a specific n > «0, only a small percentage of these curves will have ordinates that exceed e (Fig. 8-4a). Il is, of course, possible that not even one of these curves will remain less than e for every n > nn. Convergence a.e., on the other hand, demands that most curves will be below s for every n > nn (Fig. 8-4b). The law of large numbers (Bernoulli). In Sec. 3-3 wc showed that if the probability of an event in a given experiment equals p and the number of successes of s/ in n trials equals k, then P --P I « < ej -> 1 as n -*oo (8-106) We shall reestablish this result as a limit of a sequence of RVs. For this purpose, we introduce the RVs if srf occurs at the ith trial otherwise We shall show that the sample mean Xj + ••• +x„ n of these RVs tends to p in probability as n -> <». Proof, As we know , /и? £{x J = E(x„) = p <r2 = pq <r;n = —
212 SEQUENCES OF RANDOM VARIABLES Furthermore, pq — p(l - p} <, 1/4. Hence [see (5-57)] pq 1 P{ |x„ - pl < e} > 1-------з 1 “ —7 । 1 " K J ПЕ2 4пе~ " x This reestablishes (8-106) because x„«) = k/n if occurs к times. The strong law of large numbers (Borel) It can be shown that x„ tends to p not only in probability, but also with probability 1 (a.e.). This result, due to Borel, is known as the strong law of large numbers. The proof will not be given. We give below only a heuristic explanation of the difference between (8-106) and the strong law of large numbers in terms of relative frequencies. Frequency interpretation Wc wish to estimate p within an error e = 0.1, using as its estimate the sample mean x„. If n 1000, then 1 39 Р(|хя “Pl < 0.1} > 1 - > — Thus, if we repeat the experiment at least 1000 times, thei» in 39 out of 40 such runs, our error |x„ - p| will be less than 0.1. Suppose, now. that we perform the experiment 2000 times and we determine the sample mean xrt not for one n but for every n between 1000 and 2000. The Bernoulli version of the law of large numbers leads to the following conclusion: If our experiment (the toss of the coin 2000 times) is repeated a large number of times, then, for a specific n larger than 1000, the error |x„ - pl will exceed 0.1 only in one run out of 40. In other words, 97.5 percent of the runs will be “good.” We cannot draw the conclusion that in the good runs the error will be less than 0.1 for every n between 1000 and 2000. This conclusion, however, is correct, but it can be deduced only from the strong law of large numbers. Ergodicity. Ergodicity is a topic dealing with the relationship between statistical averages and sample averages. This topic is treated in Sec. 12-1. In the following, we discuss certain results phrased in the form of limits of random sequences. Markoff’s theorem. We are given a sequence x, of RVs and we form their sample mean x, + +x„ X" л Clearly, x„ is an RV whose values x„(£) depend on the experimental outcome We maintain that, if the RVs x(- are such that the mean 77,( of xrt tends to a limit 7] and its variance an tends to 0 as n -> °o; )2} 0 (8'107) then the RV xzl tends to 17 in the MS sense *((*. - ’>)2} 0 (8-108)
8-4 STOCHASTIC CONVI:KG1:N( I AND I.IMI I 1II1OIUMS 213 Proof. The proof is based on the simple inequality |x„ - nh < 2|x„ - 77„|2 + 2|7j„ — 7712 Indeed, taking expected values of both sides, wc obtain £((*„ “ 7?)2) 2E((x„ “ i?»)2) + 2(17,1 - П)2 and (8-108) follows from (8-107). COROLLARY (Tchebycheffs condition). If the RVs x, are uncorrelated and then 1 A *„ “ ,inl _ E е(*Л '< - * n j=i in the MS sense. Proof. It follows from the theorem because, for uncorrclated RVs. the left side of (8-109) equals a*. We note that Tchebycheff’s condition (8-109) is satisfied if or < К < » for every i. This is the case if the RVs x, are i.i.d. with finite variance. Kinchin We mention without proof that if the RVs x, are i.i.d., then their sample mean xw tends to 77 even if nothing is known about their variance. In this case, however, x„ tends to 17 in probability only. The following is an application: Example 8-14. We wish to determine the distribution Fix) of an RV x defined in a certain experiment. For this purpose we repeat the experiment n times and form the RVs x,- as in (8-12). As we know, these RVs are i.i.d. and their common distribution equals Fix). We next form the RVs where x is a fixed number. The RVs y/x) so formed are also i.i.d. and their mean equals £{%(*)} = 1 x P{y, = 1} =P{x, = F(x) Applying Kinchin’s theorem to y,(x), we conclude that y,(x) + ••• +y„(x) -----------------------> F(x) n in probability. Thus, to determine Fix), we repeat the original experiment n limes and count the number of times the RV x is less than x. If this number equals к and n is sufficiently large, then Fix) = k/n. The above is thus a restatement of the relative frequency interpretation (4-3) of Fix) in the form of a limit theorem.
214 SEQUENCES OF RANDOM VARIABLES The Central Limit Theorem Given n independent RVs x„ we form their sum x = x, + • • + x„ This is an RV with mean 77 = + • • +17,, and variance <r2 = 07 + • • +rf. The central limit theorem (CLT) states that under certain general conditions, the distribution F(x) of x approaches a normal distribution with the same mean and variance: 1 x - ri \ F(x)=G^ —— j (8-110) as n increases. Furthermore, if the RVs x( are of continuous type, the density /(x) of x approaches a normal density (Fig. 8-5a): f(x) = ——.e-(x-n)-/2<r- (8_ш) ау2тг This important theorem can be stated as a limit: If z = (x - уУ/а- then for the general and for the continuous case respectively. The proof is outlined later. The CLT can be expressed as a property of convolutions: The convolution of a large number of positive functions is approximately a normal function [see (8-51)]. The nature of the CLT approximation and the required value of n for a specified error bound depend on the form of the densities /j(x). If the RVs x, are i.i.d., the value n = 30 is adequate for most applications. In fact, if the FIGURE 8-5
8-4 Snx-HASTK CONVLRGfcNCI ЛКПИМП IHfcORIMS 215 functions fi(x) are smooth, values of n as low as 5 can be used. The next example is an illustration. Example 8-15. The RVs x, arc i.i.d. and uniformly distributed in the interval (0.1 J. Wc shall compare the density /Да) of their sum x with the normal approximation (8-111) for /i = 2 and n = 3. In this problem. , тг T (T~ = ——- 7) = 71 — ' 12 2 n T: 12 n = 2 /(a-) is a triangle obtained by convolving a pulse with itself (Fig. 8-6) , T2 1 ГТ 7, = r <T~ = e rr'1 О / V 7Г n=3 /(.v) consists of three parabolic pieces obtained by convolving a triangle with a pulse As wc can see from the figure, the approximation error is small even for such small values of n. For a discrete-type RVs F(a') is a staircase function approaching a normal distribution. The probabilities pk however, that x equals specific values xk are, in general, unrelated to the normal density. Lattice-type RVs are an exception: FIGURE 8-6
216 SCQUI-NCliS OH RANDOM VARIABLES If the RVs x, take equidistant values akt. then x takes the values ak and for large л, the discontinuities pk = P{x = ak] of F(.v) at the points xk = ak equal the samples of the normal density (Fig. 8-5b): 1 • , P{x = ak} ~ (8-1P) СГу277 ’ We give next an illustration in the context of Bernoulli trials. The RVs x of Example 8-7 are i.i.d. taking the values 1 and 0 with probabilities p and q respectively; hence their sum x is of lattice type taking the values к = 0.......n. In this case, E{x) = nE{xt} = np a~ = ncr~ = npq Inserting into (8-112), we obtain the approximation P{x = k) = = -г—1 е-(к-„лг/2ПР11 (8-ЦЗ) -\/2iTnpq This shows that the DeMoivre-Laplace theorem (3-27) is a special case of the lattice-type form (8-112) of the central limit theorem. Example 8-16. A fair coin is tossed six times and x, is the zero-one RV associated with the event {heads at the ith toss). The probability of к heads in six losses equals P{x = k} = =Pk x = xj + ---+xfe In the following table we show the above probabilities and the samples of the normal curve N(tj,<t2) (Fig. 8-7) where 71 = np - 3 <r~ = npq = 1.5 к 0 I 2 3 4 5 6 Pk 0.016 0.094 0.234 0.312 0.234 0.094 0.016 Nd), a) 0.016 0.086 0.233 0.326 0.233 0.086 0.016
8-4 STOCHASHC CONVERGENCE AMI) LIMI Г IHEOREMS 217 ERROR^ CORRECTION. In the approximation of /(.r) by the normal curve M77, <>")» error e(a) =/(.v)------}=e~'2/2"' сгу2тг results where we assumed, shifting the origin, that tj = 0. Wc shall express this error in terms of the moments mn = E{x") of x and the Hermite polynomials . , dk dx = xk - (2)** * + 1 ‘ + (8-114) These polynomials form a complete orthogonal set on the real line: Г e~x'/2Hn(x)Hm(x') dx = /«^ n = m J -* (0 n * m Hence e(x) can be written as a series The series starts with к = 3 because the moments of e(x) of order up to 2 are 0. The coefficients Cn can be expressed in terms of the moments m„ of x. Equating moments of order n = 3 and n = 4, we obtain [sec (5-44)] 3!o-3C3 = m3 V.a'C4 = m4 - 3a4 First-order correction. From (8-114) it follows that H3(x) = x3 - 3x H4(x) = x4 - 6x2 + 3 Retaining the first nonzero term of the sum in (8-115), wc obtain m3 I x3 3x \ 6cr3 Iff3 <r I /(x) = -^e-2/2"1 ffVZTF Iffix') is even, then m3 = 0 and (8-115) yields 1 (m4 24 \ cr .4 л 6x2 (8-116) (8-117) Example 8-17, If the RVs xf are i:i.d. with density /,(х) as in Fig. 8-8a, then f(x) consists of three parabolic pieces (see also Example 8-12) and MO, 1/4) is its normal approximation. Since fix) is even and «14 •= 13/80 (see Prob. 8-4), (8-117)
218 SEQUENCES OF RANDOM VARIABLES FIGURE 8-8 yields In Fig. 8-8b, we show the error e(x) of the normal approximation and the first-order correction error f(x) - /(x). ON THE PROOF OF THE CENTRAL LIMIT THEOREM. We shall justify the approximation (8-111) using characteristic functions. We assume for simplicity that 7?; = 0. Denoting by Ф,(а>) and Ф(л>), respectively, the characteristic functions of the RVs x, and x = X| + * • • + x„, we conclude from the indepen- dence of X; that Ф(ш) = Ф|(&>) • • • Ф,|(ю) Near the origin, the functions ФДа») = In Ф;(а>) can be approximated by a parabola: Ф,(й>) — — |oy2cu2 Ф,-(<о) = e_<r<z"*/2 for |w| < e (8-118) If the RVs X,- are of continuous type, then [see (5-61) and Prob. 5-25] Ф,(0) = 1 |Ф,.(4))| < 1 for |о>| #= 0 (8-П9) Equation (8-119) suggests that for small e and large n, the function Ф(л>) is negligible for |w| > e, (Fig. 8-9a). This holds also for the exponential e~v " /_ if ст -> a» as in (8-123). From the above it follows that Ф(й>) — -• g-0»?*»*/2 — e-tr'a>'/2 fOj. aj| ш (8-120) in agreement with (8-111).
8-4 SIOCHAS'UCCONVt RG1-NCL А\|> 1.1МГ1 IHIOKI.MS 219 The exact form of the theorem states that the normalized RV xi + ••• + x„ Z = ------------- <T* = (Tf + • • + <T~ <T tends to an N(0,1) RV as n -> (8-121) V ^7T A general proof of the theorem is given below. In the following, we sketch a proof under the assumption that the RVs x, are i.i.d. In this case Ф|(&)) = • • = Ф„(й)) <т = arfn Hence, ФДео) =фИ-4-) \<Г,УЛ I Expanding the functions = In Ф,(о>) near the origin, we obtain (T2b)2 =------— + <7(w3) Hence, / (!) \ (t)2 / 1 1 Ш2 ф.((о) = лФ, —г = --у + о 7= ,772 - у <8-122) \ ОуУП / z \ v« / z This shows that Фг(ш) -* e"“1/2 as л -* «> and (8-121) results. As we noted", the theorem is not true always. The following is a set of sufficient conditions: (fl) <г.2+ •• +a„2^oo (8-123) (b) There exists a number a > 2 and a finite constant К such that Г x*fi(x) dx<K<«> for all i (8-124)
220 SEQUENCES OF RANDOM VARIANCES These conditions are not the most general. However, they cover a wide range of applications. For example, (8-123) is satisfied if there exists a constant e > () such that a-, > s for all i. Condition (8-124) is satisfied if all densities /Дх) are 0 outside a finite interval (-c, c) no matter how large. Lattice type The preceding reasoning can also be applied to discrete-type RVs. However, in this case the functions ФДш) are periodic (Fig. 8-9h) and their product takes significant values only in a small region near the points ш = 2тгп/а. Using the approximation (8-112) in each of these regions, wc obtain ч > 2 7Г Ф(ш) = /2 = — (8-125) a a As we can see from (11A-1), the inverse of the above yields (8-112). The Berry-Esseen theorem^ This theorem states that if £{x;4} £ ca2 all i (8-126) where c is some constant, then the distribution F(.r) of the normalized sum _ X,+ ••• + x„ x = --------------- erf + • • • + er/ = cr~ a is close to the normal distribution G(x) in the following sense _ 4c |F(x) -G(x)| < — (8-127) <r The central limit theorem is a corollary of (8-127) because (8-127) leads to the conclusion that F(x)->G(.v) as a->oo (8-128) This proof is based on condition (8-126). This condition, however, is not too restrictive. It holds, for example, if the RVs x, are i.i.d. and their third moment is finite. We note, finally, that whereas (8-128) establishes merely the convergence in distribution of x to a normal RV, (8-127) gives also a bound of the deviation of F(jc) from normality. The central limit theorem for products. Given n independent positive RVs x,, we form their product: У = XjX2 • * • X„ X; > 0 •fA. Papoulis: “Narrow-Band Systems and Gaussianity,” IEEE Transactions on Information Theory, January 1972.
8-5 KANBOM NGMMt RS. MEANING AND СИ M HAI1ON 221 THEOREM. For large n, the density of у is approximately lognormal-. exp{“ Ч“г(,п У ~ v)^U(y) (8-129) where '» ZI V = 52 £{ln\} w2 - Var(lnx,) »- i (-1 Proof. The RV z = Iny = Inx, + •• • + !nx„ is the sum of the RVs lnxf. From the CLT it follows, therefore, that for large n, this RV is nearly normal with mean 77 and variance a2. And since у = c\ wc conclude from (5-10) that у has a lognormal density. The theorem holds if the RVs Inx, satisfy the conditions for the validity of the CLT. Example 8-18. Suppose that the RVs xf arc uniform in the interval (0.1). In this case, E{lnx,} = In xdx = - 1 E{(ln x, )2} = ||'(lnx)“ dx = 2 Hence 77 = — n and <r2 = n. Inserting into (8-129). we conclude that the density of the product у = x( ♦ • • x„ equals Л(у) = /=------ exp/- —(In у + n)z\u(y) уу2тг« ( 2/t } 8-5 RANDOM NUMBERS: MEANING AND GENERATION Random numbers (RNs) are used in a variety of applications involving comput- ers and statistics. In this section, we explain the underlying ideas concentrating on the meaning and generation of RNs. We start with a simple illustration of the role of statistics in the numerical solution of deterministic problems. MONTE CARLO INTEGRATION. We wish to evaluate the integral 7» ('g(x)dx (8-130) jd For this purpose, we introduce an RV x with uniform distribution in the interval (0,1) and we form the RV у = g(x). As we know, £(««) - I'sMfM = //O) * («-I’D hence 77. = 7. We have thus expressed the unknown I as the expected value of the Rv\. This result involves only concepts; it docs not yield a numerical
222 SEQUENCES OF RANDOM VARIABLES method for evaluating /. Suppose, however, that the RV x models a physical quantity in a real experiment. We can then estimate I using the idative frequency interpretation of expected values: We repeat the experiment a large number of times and observe the values x, of x; we compute the corresponding values y, = g(x,) of у and form their average as in (5-26). This yields I = £{g(x)} = (8-132) The above suggests the following method for determining I: The data л,-, no matter how they are obtained, are random numbers; that is, they are numbers having certain properties. If, therefore, we can numerically generate such numbers, we have a method for determining /. To carry out this method, we must reexamine the meaning of RNs and develop computer programs for generating them. THE DUAL INTERPRETATION OF RNs. “What are RNs? Can they be generated by a computer? Is it possible to generate truly random number sequences?” Such questions do not have a generally accepted answer. The reason is simple. As in the case of probability (see Chap. 1), the term random numbers has two distinctly different meanings. The first is theoretical: RNs are mental constructs defined in terms of an abstract model. The second is empirical: RNs are sequences of real numbers generated either as physical data obtained from a random experiment or as computer output obtained from a deterministic program. The duality of interpretation of RNs is apparent in the following extensively quoted definitions!: A sequence of numbers is random if it has every property that is shared by all infinite sequences of independent samples of random variables from the uniform distribution (J. M. Franklin) A random sequence is a vague notion embodying the ideas of a sequence in which each term is unpredictable to the uninitiated and whose digits pass a certain number of tests, traditional with statisticians and depending somewhat on the uses to which the sequence is to be put. (D. H. Lehmer) It is obvious that these definitions cannot have the same meaning. Never- theless, both are used to define RN sequences. To avoid this confusing ambigu- ity, we shall give two definitions: one theoretical, the other empirical. For these definitions we shall rely solely on the uses for which RNs are intended: RNs are used to apply statistical techniques to other fields. It is natural, therefore, that they are defined in terms of the corresponding probabilistic concepts and their tD. fi. Knuth: The Art dfComputer Programming, Addison-Wesley, Reading, MA, 1969.
8-5 RANDOM NUMBERS: MEANING AM) GENI RAI ION 22Л properties as physically generated numbers are expressed directly in terms of the properties of real data generated by random experiments. CONCEPTUAL DEFINITION. A sequence of numbers x, is called random if it equals the samples x, = x,«) of a sequence x, of i.i.d. RVs x, defined in the space of repeated trials. It appears that this definition is the same as Franklin’s. There is. however, a subtle but important difference. Franklin says that the sequence .v, has everv property shared by i.i.d. RVs; we say that x, equals the samples of the i.i.d. RVs x,. In this definition, all theoretical properties of RNs are the same as the corresponding properties of RVs. There is, therefore, no need for a new theory. EMPIRICAL DEFINITION. A sequence of numbers x, is called random if its statistical properties are the same as the properties of random data obtained from a random experiment. Not all experimental data lead to conclusions consistent with the theory of probability. For this to be the case, the experiments must be so designed that data obtained by repeated trials satisfy the i.i.d. condition. This condition is accepted only after the data have been subjected to a variety of tests and in any case, it can be claimed only as an approximation. The same applies to com- puter-generated RNs. Such uncertainties, however, cannot be avoided no mat- ter how we define physically generated sequences. The advantage of the above definition is that it shifts the problem of establishing the randomness of a sequence of numbers to an area with which we are already familiar. We can, therefore, draw directly on our experience with random experiments and apply the well-established tests of randomness to computer-generated RNs. Generation of RN Sequences RNs used in Monte Carlo calculations are generated mainly by computer programs; however, they can also be generated as observations of random data obtained from real experiments: The tosses of a fair coin generate a random sequence of Q’s (heads) and l’s (tails); the distance between radioactive emis- sions generates a fandom sequence of exponentially distributed samples. Wc accept number sequences so generated as random because of our long experi- ence with such experiments. RN sequences experimentally generated are not, however, suitable for computer use, for obvious reasons. An efficient source of RNs is a computer program with small memory, involving simple arithmetic operations. We outline next the most commonly used programs. Our objective is to generate RN sequences with arbitrary distributions. In the present state of the art, however, this cannot be done directly. The available algorithms only generate sequences consisting of integers z( uniformly dis- tributed in an interval (0,m). As we show later, the generation of a sequence x, with an arbitrary distribution is obtained indirectly by a variety of methods involving the uniform sequence z(.
224 SEQUENCES OF RANDOM VARIABLES The most general algorithm for generating an RN sequence z, is an equation of the form *„=/(*„-1.......г„_г) mod m (8-133) where f(z„_iy..., z„_r) is a function depending on the r most recent past values of z„. In this notation, zn is the remainder of the division of the number by m. The above is a nonlinear recursion expressing z„ in terms of the constant m, the function /, and the initial conditions z,,..., zr~t. The quality of the generator depends on the form of the function f. It might appear that good RN sequences result if this function is complicated. Experi- ence has shown, however, that this is not the case. Most algorithms in use are linear recursions of order 1. We shall discuss the homogeneous case. LEHMER’S ALGORITHM. The simplest and one of the oldest RN generators is the recursion z„ = mod m z0 = 1 n > 1 (8-134) where m is a large prime number and a is an integer. Solving, we obtain zn = a" mod tn (8-135) The sequence z,t takes values between I and m - 1; hence at least two of its first m values are equal. From this it follows that zn is a periodic sequence for n > m with period mo < m - 1. A periodic sequence is not, of course, random. However, if for the applications for which it is intended the required number of sample does not exceed m(,y periodicity is irrelevant. For most applications, it is enough to choose for m a number of the order of 109 and to search for a constant a such that mo = m - 1. A value for m suggested by Lehmer in 1951 is the prime number 2?I - 1. To complete the specification of (8-134), we must assign a value to the multiplier a. Our first condition is that the period nio of the resulting sequence z(l equal tn0 — 1. DEFINITION. An integer a is called the primitive root of m if the smallest n such that an = 1 mod m is n = m — 1. (8-136) From the definition it follows that the sequence an is periodic with period mo = m — 1 iff a is a primitive root of m. Most primitive roots do not generate good RN sequences. For a final selection, we subject specific choices to a variety of tests based on tests of randomness involving real experiments. Most tests are carried out not on terms of the integers z, but in terms of the properties of the numbers „ _ L (8-137) m These numbers take essentially all values in the interval (0,1) and the purpose
8-5 RANDOM NUMBERS Ml AMNG AND Gl M HaIion 225 of testing is to establish whether they are the values of a sequence u of continuous-type i.i.d. RVs uniformly distributed in the interval (0,1). The i.i.d. condition leads to the following equations: For every it, in the interval (0. I) and for every n. Ли, w,-} = n, (8-l38«) P(ui * "t......u« 2 ««) “ Ли, < nJ • • Ли,, < «„) (8-1386) To establish the validity of these equations, we need an infinite number of tests. In real life, however, we can perform only a finite number of tests. Furthermore, all tests involve approximation based on the empirical interpretation of proba- bility. We cannot, therefore, claim with certainty that a sequence of real numbers is truly random. Wc can claim only that a particular sequence is reasonably random for certain applications or that one sequence is more random than another. In practice, a sequence it,, is accepted as random not only because it passes the standard tests but also because it has been used with satisfactory results in many problems. Over the years, several algorithms have been proposed for generating “good” RN sequences. Not all, however, have withstood the test of time. An example of a sequence z„ that seems to meet most requirements is obtained from (8-134) with a = 27 — 1 and m = 251 - 1: z„ = 16,807z„ _ j mod 2,147,483,647 (8-139) This sequence meets most standard tests of randomness and has been used effectively in a variety of applications.! We conclude with the observation that most tests of randomness are applications, direct or indirect, of well-known tests of various statistical hy- potheses. For example, to establish the validity of (8-138a), we apply the Kolmogoroff-Smimov test, page 272, or the chi-square test, page 273. These tests are used to determine whether given experimental data fit a particular distribution. To establish the validity of (8-1386), we apply the chi-square test, page 274. This test is used to determine the independence of various events. In addition to direct testing, a variety of special methods have been proposed for testing indirectly the validity of both equations in (8-138). These methods are based on well-known properties of RVs and they are designed for particular applications. The generation of random vector sequences is an application requiring special tests. Random vectors. We shall construct a multidimensional sequence of RNs using the following properties of subsequences. Suppose that x is an RV with distribution Fix) and x, is the corresponding RN sequence. It follows from TS. K> Park and K. W. Miller “Random Number Generations: Good Ones Are Hard to Find," Communications of the ACM, vol. 31, no. 10, October 1988.
226 .SEQUENCES ОГ RANDOM VARIABLES (8-138) that every subsequence of x, is an RN sequence with distribution F(x). Furthermore, if two subsequences have no common elements, they are the samples of two independent RVs. From this we conclude that the odd-subscript and even-subscript sequences x" = x,j_ । x- = x2l 1=1,2,... are the samples of two i.i.d. RVs x" and x‘' with distribution Fix). Thus, starting from a scalar RN sequence, x,, we constructed a vector RN sequence (x‘\ xf). Proceeding similarly, we can construct RN sequences of any dimen- sionality. Using superscripts to identify various RVs and their samples, we conclude that the RN sequences *=1.......ffl z=l,2, ... (8-140) arc the samples of m i.i.d. RVs x1,.. . ,x"' with distribution F(x). Note that a sequence of numbers might be sufficiently random for scalar but not for vector applications. If, therefore, an RN sequence x, is to be used for multidimensional applications, it is desirable to subject it to special tests involving its subsequences. RN Sequences with Arbitrary Distributions In the following, the letter u will identify an RV with uniform distribution in the interval (0,1); the corresponding RN sequence will be identified by ut. Using the sequence uh we shall present a variety of methods for generating sequences with arbitrary distributions. In this analysis, we shall make frequent use of the following: If x, are the samples of the RV x, then y, = g(x,) arc the samples of the RV у = g(x). For example, if x, is an RN sequence with distribution FA(x), then yt = a + Z>xt is an RN sequence with distribution Ft[(y - a)/b] if b > 0, and 1 — Fj(y - a)/b] if b < 0. From this it follows, for example, that r, = 1 - ut is an RN sequence uniform in the interval (0.1). PERCENTILE TRANSFORMATION METHOD. Consider an RV x with distribu- tion F/x). We have shown in Sec. 5-2 that the RV u = Fv(x) is uniform in the interval (0,1) no matter what the form of FA.(x) is. Denoting by Fv,-n(x) the inverse of F/x), we conclude that x = F/-l>(u) (see Fig. 8-10). From this it follows that x, = F/~’’(m,) (8-141) is an RN sequence with distribution Fx(x), [see also (5-19)]. Thus, to find an RN sequence x, with distribution a given function Fx(x), it suffices to determine the inverse of F/x) and to compute Fv<-I,(u;). Note that the numbers x# are the u, percentiles of Fr(x).
8-5 RANDOM NUMBERS: MEANINU AND OI.NI.RA IK >N 227 Example 8-19. We wish to generate an RN sequence x, with exponential distribution. In this case, Fx(x) = I - e~1/A x=-Aln(l-u) Since 1 — u is an RV with uniform distribution, wc conclude that the sequence x, = —A In u, (8-142) has an exponential distribution. Example 8-20. We wish to generate an RN sequence x, with Rayleigh distribution. In this case, F,(x) = I - е"л’/2 Fj" "(u) = У— 2 ln( I - u) Replacing 1 — u by u, we conclude that the sequence X; = y-2ln u, (8-143) has a Rayleigh distribution. Suppose now that we wish to generate the samples xt of a discrete-type RV x taking the values ak with probability pk=P{x — ak} k = In this case, Fx(x) is a staircase function (Fig. 8-11) with discontinuities at the points ak, and its inverse is a staircase function with discontinuities at the points «P| + • • • +pk. Applying (8-141), we obtain the following rule for generating the RN sequence *,: Set xf - iff px + • • • «»• < Pi + "' +Pk (8-144)
228 SEQUENCES OF RANDOM VARIABLES FIGURE 8-11 Example 8-21. The sequence 10 if Q<ii'<p I 1 if p < ut < 1 lakes the values 0 and I with probability p and 1 - p respectively. It specifics, therefore, a binary RN sequence. The sequence x, = k iff O.U < и, < 0.1(A- + 1) к = 0,1.....................9 lakes the values 0,1,..., 9 with equal probability. Il specifics, therefore, a decimal RN sequence with uniform distribution. Setting ^ = (k)pkqnk л = 0-1........nt into (8-15), we obtain an RN sequence with binomial distribution. Setting A4 ak = k Pk=e A = 0,1, ... into (8-15) wc obtain an RN sequence with Poisson distribution. Suppose now that we are given not a uniform sequence, but a sequence xt with distribution F/x). We wish to find a sequence y{ with distribution Fv(y). As we know, y, = Fy(-l>(u;) is an RN sequence with distribution Fy(y). Hence (see Fig. 8-10) the composite function W,‘”V,W) (8-145) generates an RN sequence with distribution Fy(y) [see also (5-20)].
8-5 random NUMHkKS MI AN1NG z\ND GENERA I ION 229 Example 8-22. Wc arc given an RN sequence x, > 0 with distribution I\(x) = I e xe and we wish to generate an RN sequence y( > () with distribution F/y) = 1 - In this example F' ”(M) = - ln( 1 - «); hence П( F,(.r)) = - ln[ I - = - ln(c '+.«< *) Inserting into (8-145). wc obtain y, = — ln(e *• + .r,c *') REJECTION METHOD. In the percentile transformation method, we used the inverse of the function Fx(x). However, inverting a function is not a simple task. To overcome this difficulty, we develop next a method that avoids inversion. The problem under consideration is the generation of an RN sequence y, with distribution Fv(y) in terms of the RN sequence .v, as in (8-145). The proposed method is based on the relative frequency interpretation of the conditional density Л(.г|.^) dx = P{x < x < x + dx,.//} (8-146) of an RV x assuming J? (see page 80). In the following method, the event is expressed in terms of the RV x and another RV u, and it is so chosen that the resulting function /д.(х|.^) equals ffy). The sequence is generated by setting y, = x{ if .Z' occurs, rejecting x, otherwise. The problem has a solution only if fy(x) — 0 in every interval in which /t(x) = 0. Wc can assume, there- fore, without essential loss of generality, that the ratio fy(x)/fx(x) is bounded from below by some positive constant «: fx(x) ------> a > 0 for every x fy(*) Rejection theorem. If the RVs x and u are independent and f ( Д') {u < r(x)} where r(x) a * £ I (8-147) Jx\ Л ) then W) = /,.(*) (8-148) Proof. The joint density of the RVs x and u equals /t(x) in the strip 0 < и < I of the xu plane, and 0 elsewhere. The event <// consists of all outcomes such that the point (x,u) is in the shaded area of Fig. 8-12 below the curve и = r(x). Hence P(.^) = f* г(х)Л(х) dx = аdx ~ a The event (x < x £X + dx,^} consists of all outcomes such that the point
230 SEQUENCES OP RANDOM VARIABLES FIGURE 8-12 (x, u) is in the strip x < x <x + dx below the curve и = r(x). The probability masses in this strip equal fx(x)r(x) dx. Hence P{x < x <x + = fr(x)r(x) dx Inserting into (8-146), wc obtain (8-148). From the rejection theorem it follows that the subsequence of x, such that Uj < r(x,) forms a sequence of random numbers that are the samples of an RV у with density = /У(у). This leads to the following rule for generating the sequence yz: Form the two-dimensional RN sequence (x,,n,.). ( V •) Set У1 = Д', if < a ?( -y; reject x, otherwise (8-149) Example 8-23. We arc given an RN sequence x, with exponential disiribution and we wish to construct an RN sequence y, with truncated normal distribution: Л(х) = e~xU(x) fy(y) = у2тг For x > 0, Setting a = у/тг/2е, we obtain the following rule for generating the sequence y;: Set y, =x, if ul < e~ix,~1)2/2; reject x, otherwise MIXING METHOD. We develop next a method generating an RN sequence x, with density /(x) under the following assumptions: The function /(x) can be expressed as the weighted sum of tn densities fk(rn): Л*) = PifM + • *' +Pmf„M Pk > 0 (8-150) Each component fk(x) is the density of a known RN sequence x*. In the mixing method, we generate the sequence xz by a mixing process involving certain subsequences of the tn sequences x* selected according to the following rule: Set x,=Xy* if />| + ••• +p*_, <Pi + •“ +Pk (8-151)
8-5 RANDOM NUMHI-RS MI ANING ANDGI M RAIION 231 Mixing theorem. If the sequences u, and x,'...., x,”‘ are mutually independent, then the density /v(x) of the sequence x,. specified by (8-151) equals № =Pj1(-v) + ••• +р,„/,„(x) (8-152) Proof. The sequence x, is a mixture of ni subsequences. The density of the subsequence of the Hh sequence xf equals Д.(х). This subsequence is also a subsequence of x, conditioned on the event = {Pi + ’ •' + Pk-t < и < pt + • • • + pj Hence its density also equals fx(x\.y/k). This leads to the conclusion that A(-vl--4) =A(-v) From the total probability theorem (4-58), it follows that /,(•') + • +Л(.г|.Ч,)/'(.Ч„) And since Р(.с/Л) = pk. (8-152) results. Comparing with (8-150), we conclude that the density /X(x) generated by (8-152) equals the given function Дх). Example 8-24. The Laplace density 0.5 c 1,1 can be written as a sum f(x) = Q.5e xU(x) 4- 0.5c''(/( - .v) This is a special case of (8-150) with /,(x) = e-'U(x) f2(x) = e'U(-x) P| = p2 = 0.5 A sequence xt with density f(x) can. therefore, be realized in terms of the samples of two RVs x1 and x2 with the above densities. As we have shown in Example 8-19, if the RV v is uniform in the interval (0,1). then the density of the RV x1 = — Inv equals ffx); similarly, the density of the RV x2 = luv equals /2(x). This yields the following rule for generating an RN sequence .v, with Laplace distribution: Form two independent uniform sequences ut and r,: Set x, = - In v, if 0 < ui < 0.5 Set x, = In r, if 0.5 u, < 1 GENERAL TRANSFORMATIONS. We now give various examples for generating an RN sequence wt with specified distribution F„,(w) using the transformation w = g(x’,..., x"') where x* are m RVs with known distributions. To do so, we determine g such that the distribution of w equals Flv(w). The desired sequence is given by w. = g(x?.....x"') Binomial RNs. If x* are m i.Ld. RVs taking the values 0 and 1 with probabili- ties p and q respectively, their sum has a binomial distribution. From this it follows that if x* are tn binary sequences, their sum wt = x,1 + • • • + x •**
232 SEQUENCES OF RANDOM VARIABLES is an RN sequence with binomial distribution. The m sequences x) can be realized as subsequences of a single binary sequence x, as in (8-140). Erlang RNs. The sum w = x' + ••• + x"‘ of m i.i.d. RVs x* with density t>-r‘t/(x) has an Erlang density [see (4-38)]: /„.(w) - w'”~'e~"U(w) (8-153) From this it follows that the sum w, = w- + ••• +w"' of m exponentially distributed RN sequences w* is an RN sequence with Erlang distribution. The sequences x* can be generated in terms of m subsequences of a single sequence м{ (sec Example 8-19): w = --(ln«! + ••• + lni<") (8-154) c Chi-square RNs. We wish to generate an RN sequence w, with density fH,(w) ~ w''/1-'e~w/2U(w) For n = 2m, this is a special case of (8-153) with c = 1/2. Hence wt is given by (8-154). To find wt for n = 2m + 1, we observe that if у is ^2(2/n) and z is MO. 1) and independent of y, the sum w = у + z2 is ^2(2zn + 1) [sec (8-63)]; hence the sequence w, = — 2(ln mJ + ♦ • ♦ +ln «'") + (г,)2 has a ^2(2m + 1) distribution. Student-t RNs. Given two independent RVs x and у with distributions MO, 1) and x2(n) respectively, we form the RV w = x/^/y/n. As we know, w has a t(jf) distribution (see example 6-15). From this it follows that, if x, and y, are samples of x and y, the sequence has a /(n) distribution. Lognormal RNs. If z is MO, 1) and w = ea+hz, then w has a lognormal distribu- tion (see (5-10)]: 1 ( (In w — a)2) , /5— exp---------—2----- bw>j2Tr ( 2bz j Hence, if z, is an MO, 1) sequence, the sequence w, = е"+Лг' has a lognormal distribution.
8-5 RANDOM NUMBERS’. MEANING AND GLNI-.RAT1ON 233 RN sequences with normal distributions. Several methods arc available for generating normal RVs. We give next various illustrations. The percentile transformation method is not used because of the difficulty of inverting the normal distribution. The mixing method is used extensively because the normal density is a smooth curve; it can, therefore, be approximated by a sum as in (8-150). The major components (shaded) of this sum arc rectangles Fig. (8-13) that can be realized by sequences of the form aut + h. The remaining compo- nents (shaded) are more complicated, however; since their areas are small, they need not be realized exactly. Other methods involve known properties of normal RVs. For example, the central limit theorem leads to the following method. Given m independent RVs u*, we form the sum z = u' + • • + u'" If tn is large, the RV z is approximately normal [see (8-111)]. From this it follows that if are m independent RN sequences their sum z, = a- + • • • + h"' is approximately a normal RN sequence. This method is not very efficient. The following three methods are more efficient and are used extensively. Rejection and mixing (G. Marsaglia). In Example 8-23, we used the rejection method to generate an RV sequence y, with a truncated normal density 2 /,(>') = -/==€->-'2U(y) y2ir The normal density can be written as a sum ЛО) = <8-155) v2ir 2 & The density fy(y) is realized by the sequence y, as in Example 8-23 and the density fy( — y) by the sequence -yf. Applying (8-151), we conclude that the
234 SEQUENCES CM; RANDOM VARIABI.ES following rule generates an MO, 1) sequence zt: Scl Zj = y, if 0 < u, <0.5 Sei z, = -y, if 0.5 <, u, < 1 (8-156) Polar coordinates. Wc have shown that, if the RVs r and <p are independent, r has the Rayleigh density /r(r) = re~f2/1 and <p is uniform in the interval (——,77). then (see Example 6-12) lhe RVs z = rcos<p w=rsin<p (8-157) are MO, 1) and independent. Using this, wc shall construct two independent normal RN sequences z, and w, as follows: Clearly, <p = tt(2u - 1); hence <p, = п-(2н, - I). As we know, r = '/2x = J— 2 In v where x is an RV with exponen- tial distribution and v is uniform in the interval (0,1). Denoting by x, and r, the samples of the RVs x and v, wc conclude that r, = ^2x, = У — 2 In r, is an RN sequence with Rayleigh distribution. From this and (8-157) it follows that if u, and v, are two independent RN sequences uniform in lhe interval (0,1), then the sequences z, = у/— 2In p, cos 7t(2h, — 1) iv, = y]— 2In i\ sin тт(2и, - I) (8-158) are MO, 1) and independent. The Box-Muller method. The rejection method was based on the following: If xt is an RN sequence with distribution F(x), its subsequence y, conditioned on an event is an RN sequence with distribution F(x|.^). Using this, we shall generate two independent MO, 1) sequences z, and w, in terms of the samples x,-, yj of two independent RVs x,y uniformly distributed in lhe interval (- 1,1). We shall use for Л the event {q < 1} q = /x2 + y2 The joint density of x and у equals 1/4 in the square |x| < 1, |y| < 1 of Fig. 8-14 and 0 elsewhere. Hence = — F{q < q} = -y for q < 1 But {q < q, лН = (q < q}, for q < 1 because {q < <7} is a subset of Hence = o2 f„(qU) = 2q 0 < q< 1 (8-159) Writing the RVs x and у in polar form: x - qcos<p у = qsin q> tan <p = y/x (8-160)
8-5 RANDOM NUMBERS MEANING AND GENERATION 235 FIGURE 8-14 we conclude as in (8-159) that the joint density of the RVs q and 9 is such that P{q < q < q + dq, ip < < <p + dip} qdqd<p/4 for 0 < q < 1 and |<p| < тг. From this it follows that the RVs q and <p are conditionally independent and fq(ql*#)=2q = 1/2тг Q^q^l -ir<ip<ir THEOREM. If x and у are two independent RVs uniformly distributed in the interval (-1,1) and q = \/x2 + y2, then the RVs z = — y/ — 41nq w = — /— 4 In q (8-161) q q are conditionally MO, 1) and independent: ztt Proof. From (8-160) it follows that z = y/- 4In q cos <p w = 41nq sin <p This system is similar to the system (8-157). To prove the theorem, it suffices, therefore, to show that, the conditional density of the RV r = V~ 41nq assuming equals re~r‘/2. To show this, we apply (5-5). In our case, <?(r) - <r(r) = = 2» z r t q)
236 SEQUENCES OF RANDOM VARIABLES Hcncc r fr{rU) =Л/(^1^)1<7'('-)1 =2e-'2/-'-e-r?/4 = n?-'2/2 This shows that the conditional density of the RV г is Rayleigh as in (8-157). The preceding theorem leads to the following rule for generating the sequences z, and iv,: Form two independent sequences x, = 2г/, - 1, у, = 2г;,. - 1. If <7, = Ул7 + У,2 < К set z, = — у/^41п //, iv, = - у/- 4 In q, Qi Qi Reject (x,, yt) otherwise. COMPUTERS AND STATISTICS. In this section, we analyzed the dual meaning of random numbers and their computer generation. We conclude with a brief outline of the general areas of interaction between computers and statistics: 1. Statistical methods are used to solve numerically a variety of deterministic problems. Examples include the following: evaluation of integrals, solution of differen- tial equations; determination of various mathematical constants. The solu- tions are based on the availability of RN sequences. Such sequences can be obtained from random experiments; in most cases, however, they arc com- puter generated. We shall give a simple illustration of the two approaches in the context of Buffon’s needle. The objective in this problem is the statistical estimation of the number tt. The method proposed in Example 6-4 involves the performance of a physical experiment. We introduce the event .c/= (x <ocos0) where x (distance from the nearest line) and 0 (angle of the needle) are two independent RVs uniform in the intervals (0, a) and (0, ~/2), respectively. This event occurs if the needle intersects one of the lines and its probability equals тгЬ/2а. From this it follows that n2a — тг — —~n(8-162) n no where n is the number of intersections in n trials. The above estimate can be obtained without experimentation. We form two independent RN se- quences x, and 0, with distributions Ft(x) and F6(e), respectively, and wc denote by n^ the number of times x, < a cos 6,. With n ;Z so determined the computer generated estimate of 7r is obtained from (8-162). 2. Computers are used to solve a variety of deterministic problems originating in statistics. Examples include the following: evaluation of the mean, the variance, or other averages used in parameter estimation and hypothesis testing; classifi- cation and storage of experimental data; use of computers as instructional tools. For example, graphical demonstration of the law of large numbers or the central limit theorem. Such applications involve mostly routine computer
I’Roni.i ms 237 programs unrelated to statistics. There is, however, another class of deter- ministic problems the solution of which is based on statistical concepts and RN sequences. A simple illustration follows: We are given m RVs xx„ with known distributions and we wish to estimate the distribution of the RV у = g(x!...x„). This problem can, in principle, be solved analytically; however, its solution is. in general, complex. See, for example, the problem of determining the exact distribution of the RV q used in the chi-square test (9-76). As we explain next, the determina- tion of Fv(y) is simplified if wc use Monte Carlo techniques. Assuming for simplicity that n = I, wc generate an RN sequence x, of length n with distribution the known function /\.(л) and wc form the RN sequence yz = g(x,). To determine F,.(y) for a specific y. wc count the number /ц of samples y, such that y, < y. Inserting into (4-3). wc obtain the estimate П. Fv(y) - — (8-163) n A similar approach can be used to determine the и percentile xM of x or to decide whether x„ is larger or smaller than a given number (see hypothesis testing, Sec. 9-2). 3. Computers are used to simulate random experiments or to verify a scientific theory. This involves the familiar methods of simulating physical systems where now all inputs and responses arc replaced by appropriate RN sequences. PROBLEMS 8-1. Show that if F(x,y,z) is a joint distribution, then for any .v( <x2. yt sy2, 2i z2: F(x2,y2,z2) + FCxj.yj.Zj) + F(X|,y2,zt) + F(x2, yt,zj -F(x,,y2,z2)-F(x2.y„z2)-F(x2,y2,z,)-F(x,.y„z,) £0 8-2. The events x/, £ arc such that P(.o/) = Р(^)=Р(Л = °.5 P(.s>Z0) = P(x/<f ) = P(£^) = P(.fc'W) = 0.25 Show that the zero-one RVs associated with these events arc not independent; they are, however, independent in pairs. 8-3. Show that if the RVs x, y, z arc jointly normal and independent in pairs, they arc independent. The RVs x(- are i.i.d. and uniform in the interval (-05,0.5). Show that E{(X, + X2 + Xj)4) = 5ft
238 SEQUENCES OF RANDOM VARIABLES 8-5. (a) Reasoning as in (6-34), show that if the RVs x,y,z arc independent and lhetr joint density has spherical symmetry: f(x,y,z) = /(/r2 + у2 + z2) then they arc normal with zero mean and equal variance.______ (b) The components vx,vy,v. of the velocity v = ^/v; + v2 + v2 of a particle are independent RVs with zero mean and variance kT/m. Furthermore, their joini density has spherical symmetry. Show that v has a Maxwell density and fW , ЗкТ л 15k2T2 £(v} = 21/----- £{v2} ----------- £{v4} = --------— ' у irm m nr 8-6. Show that if the RVs x,y,z are such that rxy = ry. = 1, then rts = I. 8-7. Show that £{xtx2|x3} = £{£{xIx2|x2,x3) |хл) = £{x2£{xi|x2,x3)|x3) 8-8. Show that £{y|xt} = £{£{y|X|,x2}|x|) where £{y|X|,x2) = a,Xj + a2x2 is the linear MS estimate of у terms of Xj and x2. 8-9. Show that if n X, > 0, 2i{xfJ = M and s = У. x, i- t then £{s2) < ME{n2} 8-10. We denote by xm an RV equal to the number of tosses of a coin until heads shows for the mth time. Show that if P{h) = p, then £{x,„} = m/p. Hint: E{xm - xm_ J = £{X[} = p + 2pq + • • • + npqn~1 + • = 1/p. 8-11. The number of daily accidents is a Poisson RV n with parameter a. The probability that a single accident is fatal equals p. Show that the number m of fatal accidents in one day is a Poisson RV with parameter ap. Hint: £{e>"m|n = n} - У eJU,k(^\pkqn~k = (pe;<“ + q)" fc-0 ' ' 8-12. The RVs x* are independent with densities fk(x) and the RV n is independent of xk with P(n = к) = pk. Show that if s- Ex* then A(s) = E P*[/iCr)*’•• */*(*)] Л-1 Л-1 8-13. The RVs x, are-i.i.d. with moment function Ф/s) = £(e4x'}. The RV n takes the values 0,1,... and its moment function equals r„(z) = £{z"}. Show that if у = E x, then Фу(,) = £{e"} = Г„[ф,(х)] i Hint: £{e,y|n = k} = E{e**>* +x‘>) = Фк(5). Special case: If n is Poisson with parameter a, then - t.ue>.u>-a
prohi i ms 239 8-14. The RVs x, are i.i.d. and uniform in the interval (0.1). Show that if у = maxx , then Fiy) = y" for 0 £ у 1. 8-15. Given an RV x with distribution F,(.r), wc form its order statistics y* as in Example 8-2, and their extremes i=y„ = xm...x w = y1=xni(1) Show that f (z w) = Н» - D/,(-)A(w)[Fa(z) - F,(w)]" : OH- IO -<и- 8-16. Given n independent N(q,, 1) RVs z,. wc form the RV w = zf + • + z2. This RV is called noncentral chi-square with zi degrees of freedom and eccentricity e = • +77^. Show that its moment generating function equals 8-17. Show that if the RVs x, are i.i.d. and normal, then their sample mean x and sample variances s2 are two independent RVs. 8-18. Show that, if a0 + ajXj + a2x; is the nonhomogencous linear MS estimate of s in terms of xt and x2, then E{s - 77,|X| - 77|,x, - 77,} = a,(xt - t]|) + o,(x2 - 77,) 8-19. Shows that E{y|x,J = E{E{y|X|.x,} |x,} 8-20. We place at random n points in the interval (0,1) and wc denote by x and у the distance from the origin to the first and last point respectively. Find Fix'), Fiy), and Fix, y). 8-21. Show that if the RVs x( are i.i.d. with zero mean, variance cr2, and sample variance v (see Example 8-5), then n - 3 £{x?) - —,‘j 8-22. The RVs x( are MO; cr) and independent. Using Prob. 7-1, show that if VtT " , 7Г - 2 э 2 = — E |x21 - x21_|| then E{z}=cr cr?------------------ 8-23. Show that if R is the correlation matrix of the random vector X: [Х|,...,хл] and R~* is its inverse, then E{XR'lX'} = n 8«24. Show that if the RVs X,- are of continuous type and independent, then, for sufficiently large n, the density of sin(X] + • • • + x„) is nearly equal to the density of sin x where x is an RV uniform in the interval (-тг, тг). 8-25, Show that if a„ and E(|x„ - -» 0, then x„ -»a in the MS sense as ft -» ». '8-26. Using the Cauchy criterion, show that a sequence x„ tends to a limit in the MS sense iff the limit of E{x„x,„} as n,rn -» « exists.
240 PROBABILITY AND RANDOM VARIABLES 8-27. An infinite sum is by definition a limit: E x* = I'm У„ У„ = £ x* *-l *-* A-1 Show that if the RVs xk are independent with zero mean and variance cr/, then the sum exists in the MS sense iff A- 1 Hint: ^{(Ул “ У,.)2} = i *£ А-Л4-1 8-28. The RVs x, are i.i.d. with density ce~CIU(x). Show that, if x = X| + • • • +x„, then fSx) is an Erlang density. 8-29. Using the central limit theorem, show that for large л: c -------~Xn~ le'“ = g-(cx—n)2/2n x > 0 (л-1)! У^гл 8-30. The resistors rj,r2, r3,r4 are independent RVs and each is uniform in the interval (450; 550). Using the central limit theorem, find />{19(X) q + r-> + r3 + r4 2100}. 8-31. Show that the central limit theorem does not hold if the RVs x, have a Cauchy density. 8-32. The RVs x and у arc uncorrelated with zero mean and <rx = <ry - tr. Show that if z = x + Jy, then /.(z) =/(x,y) = = ’ е-|л|’/<г/ 2тго-~ ire. f 1 , , , \ ( 1 , Фг(П) = exp< - -(o-2ir + o-*t--) > = exp< - — trfini” > where П = u + jv. This is the scalar form of (8-62).
CHAPTER 9 STATISTICS 9-1 INTRODUCTION Probability is a mathematical discipline developed as an abstract model and its conclusions are deductions based on the axioms. Statistics deals with the applications of the theory to real problems and its conclusions are inferences based on observations. Statistics consists of two parts: analysis and design. Analysis, or mathematical statistics, is part of probability involving mainly repeated trials and events the probability of which is close to 0 or to 1. This leads to inferences that can be accepted as near certainties (see page 12). Design, or applied statistics, deals with data collection and construction of experiments that can be adequately described by probabilistic models. In this chapter, we introduce the basic elements of mathematical statistics. We start with the observation that the connection between probabilistic concepts and reality is based on the approximation relating the probability p = of an event «о/ to the number n^ of successes of л/ in n trials of the underlying physical experiment. We used this empirical formula to give the relative frequency interpretation of all probabilis- tic concepts. For example, we showed that the mean 17 of an RV x can be approximated by the average i Ц = (’-2) of the observed values of x, and its distribution F(x) by the empirical 241
242 statistics Predict x Estimate (i Ы (6) FIGURE 94 distribution (9-3) where n,. is the number of .v,’s that do not exceed x. These relationships are empirical point estimates of the parameters -p and F(x) and a major objective of statistics is to give them an exact interpretation. In a statistical investigation, we deal with two general classes of problems. In the first class, we assume that the probabilistic model is known and we wish to make predictions concerning future observations. For example, wc know the distribution F(x) of an RV x and we wish to predict the average x of its n future samples or we know the probability p of an event and we wish to predict the number n ;Z of successes of ./ in n future trials. In both cases, we proceed from the model to the observations (Fig. 9-1g). In the second class, one or more parameters 6, of the model are unknown and our objective is either to estimate their values (parameter estimation) or to decide whether 9, is a set of known constants 0O) (hypothesis testing). For example, we observe the values x, of an RV x and we wish to estimate its mean 17 or to decide whether to accept the hypothesis that 77 = 5.3. We toss a coin 1000 times and heads shows 465 times. Using this information, we wish to estimate the probability p of heads or to decide whether the coin is fair. In both cases, we proceed from the observations to the model (Fig. 9-lb). In this chapter, we concentrate on parameter estimation and hypothesis testing. As a preparation, we comment briefly on the prediction problem. Prediction. We are given an RV x with known distribution and we wish to predict its value x at a future trial. A point prediction of x is the determination of a constant c chosen so as to minimize in some sense the error x - c. At a specific trial, the RV x can take one of many values. Hence the value that it actually takes cannot be predicted; it can only be estimated. Thus prediction of an RV x is the estimation of its next value x by a constant c. If we use as the criterion for selecting c the minimization of the MS error E{(x - c)2}, then c = E{x}. This problem was considered in Sec. 7-3 and Sec. 8-3. An interval prediction of x is the determination of two constants c, and c2 such that P(C| < x < c2} = у = 1 — 8 (9-4)
9-1 INIRODUC'TION 243 FIGURE 9-2 where у is a given constant called the confidence coefficient. The above equatipn states that if we predict that the value x of x at the next trial will be in the interval (C|,c2), our prediction will be correct in 100y% of the cases. The problem in interval prediction is to find C| and c2 so as to minimize the difference c2 — C| subject to the constraint (9-4). The selection of у is dictated by two conflicting requirements. If у is close to 1, the prediction that x will be in the interval (,cuc2) is reliable but the difference c2 - c, is large; if у is reduced, c2 — C| is reduced but the estimate is less reliable. Typical values of у are 0.9, 0.95, and 0.99. For optimum prediction, we assign a value to у and we determine c( and c2 so as to minimize the difference c2 — c, subject to the constraint (9-4). We can show that (see Prob. 9-6) if the density fix) of x has a single maximum, c2 — Cj is minimum if /(C|) = f(c2). This yields C| and c2 by trial and error. A simpler suboptimal solution is easily found if we determine c, and c2 such that 5 8 p{x < = £ p{x > c2) = - (9-5) This yields C| ~xs/2 and c2 = xx_6/2 where xM is the и percentile of x (Fig. 9-2 a). This solution is optimum if f(x) is symmetrical about its mean 77 because then /(cP = /(c2). If x is also normal, then x„ = 77 + zua where z„ is the standard normal percentile (Fig. 9-2b).
244 stat isncs Example 9-1. The life expectancy of batteries of a certain brand is modeled by a norma) RV with r) = 4 years and a = 6 months. Our car has such a batten, bind the prediction interval of its life expectancy with у = 0.95. In this example. 3 = 0.05, _й/: = zl)47s = 2 = This yields the interval 4 ± 2 x 0.5. Wc can thus expect with confidence coefficient 0.95 that the life expectancy of our battery will be between 3 and 5 years. As a second application, we shall estimate the number n z of successes of an event .с/ in n trials. The point estimate of n is the product np. The interval estimate (к^к2) is determined so as to minimize the difference k2 - к, subject to the constraint P(k{ < n :/< k2] = у We shall assume that n is large and у = 0.997. To find lhe constants k\ and A,, we set к = and e = y/pq/n into (3-37). This yields P\np - 3y/npq < n.z< np + 3yjnpq } = 0.997 (9-6) because 2G(3) - 1 = 0.997. Hence we predict with confidence coefficient 0.997 that will be in the interval np ± 3y/npq. Example 9-2. We toss a fair coin 100 limes and wc wish to estimate the number n & of heads with у = 0.997. In this problem n — 100 and p = 0.5. Hence kt = np — iy/npq =35 k2 = np - Зу/npq = 65 We predict, therefore, with confidence coefficient 0.997 that the number of heads will be between 35 and 65. The above example illustrates the role of statistics in the applications of probability to real problems: The event &/= (heads) is defined in the experi- ment ,Z of the single toss of a coin. The given information that P(x/) = 0.5 cannot be used to make a reliable prediction about the occurrence of '/ at a single performance of ,Z. The event ,^= {35 < n^< 65} is defined in the experiment of repeated trials and its probability equals P(£?) = 0.997. Since Р(й?) - 1 we can claim with near certainty that & will occur at a single performance of the experiment ./J. We have thus changed the “subjective” knowledge about sxf based on the given information that P(ss/} = 0.5 to the “objective” conclusion that will almost certainly occur, based on the derived probability that P(^) ® 1. Note, however, that both conclusions are inductive inferences; the difference between them is only quantitative. 9-2 PARAMETER ESTIMATION Suppose that lhe distribution of an RV x is a function Fix,в) of known form depending on a parameter 0, scalar or vector. We wish to estimate 0. To do so. we repeal the underlying physical experiment n times and wc denote by .v( the
9-2 I’ARAMI 11.К I SI IMA 1 К 14 245 observed values of x. Using these observations, we shall find a point estimate and an interval estimate of в. A point estimate is a function 0 = g(X) of the observation vector A = .......*„]• The corresponding RV 6 = g(X) is the point estimator of 0. Any function of the sample vector X = [x,..x„] is called a statistic Л Thus a point estimator is a statistic. We shall say that 0 is an unbiased estimator of the parameter 0 if E{0} = 0. Otherwise. it is called biased with bias h = £{6) - 0. If the function g(A') is properly selected, the estimation error 0 - 0 decreases as л increases. If it tends to 0 in probability as n -> a>, then 0 is called a consistent estimator. The sample mean x of x is an unbiased estimator of its mean t). Furthermore, its variance сгг/п tends to 0 as n -* x. From this it follows that x tends to rj in the MS sense, therefore, also in probability. In other words, x is a consistent estimator of 77. Consistency is a desirable property; however, it is a theoretical concept. In reality, the number n of trials might be large but it is finite. The objective of estimation is thus the selection of a function g(X) minimizing in some sense the estimation error g(X) - 0. If g(X) is chosen so as to minimize the MS error e = E([g(X) - s]2) = f [g( X} -e\2f(X,e)dX (9-7) then the estimator 0 = g(X) is called the best estimator. The determination of best estimators is not, in general, simple because the integrand in (9-7) depends hot only on the function g(A') but also on the unknown parameter 0. The corresponding prediction problem involves the same integral but it has a simple solution because in this case, 0 is known (see Sec. 8-3). In the following, we shall .select the function g(X) empirically. In this choice we are guided by the following: Suppose that 0 is the mean 0 = of some function q(x) of x. As we have noted, the sample mean e = -l>(x,) (9-8) n of <?(x) is a consistent estimator of 0. If, therefore, we use the sample mean 0 of <?(x) as the point estimator of 0, our estimate will be satisfactory at least for large it. In fact, it turns out that in a number of cases it is the best estimate. INTERVAL ESTIMATES. We measure the length 0 of an object and the results are the samples x, = 0 + V; of the RV x = 0 + v where v is the measurement error. Can we draw with near certainty a conclusion about the true value of 0? Wc cannot do so if we claim that 0 equals its point estimate 0 or any other tThis interpretation of the term statistic applies only for Chap. 9. In all other chapters, statistics means statistical properties.
246 STATISTICS constant. We can, however, conclude with near certainty that в equals в within specified tolerance limits. This leads to the following concept. An interval estimate of a parameter в is an interval (0,, 02), the endpoints of which are functions 0, = g^X) and 02 = g2(X) of the observation vector X. The corresponding random interval (0(,02) *s the interval estimator of в. We shall say that (0,, 02) is a у confidence interval of 0 if P{e, < в < 02) = у (9-9) The constant у is the confidence coefficient of the estimate and the difference 8 - 1 — у is the confidence level. Thus у is a subjective measure of our confidence that the unknown в is in the interval (0|,02). If у is close to 1 we can expect with near certainty that this is true. Our estimate is correct in lOOy percent of the cases. The objective of interval estimation is the determination of the functions g/A') and g2(X) so as to minimize the length 02 - 0| of the interval (0,, 02) subject to the constraint (9-9). If 0 is an unbiased estimator of the mean 77 of x and the density of x is symmetrical about 77, then the optimum interval is of the form 77 + a as in (9-10). In this section, we develop estimates of the commonly used parameters. In the selection of 0 we are guided by (9-8) and in all cases we assume that n is large. This assumption is necessary for good estimates and, as we shall see, it simplifies the analysis. Mean We wish to estimate the mean 77 of an RV x. We use as the point estimate of 77 the value of the sample mean x of x. To find an interval estimate, we must determine the distribution of x. In general, this is a difficult problem involving multiple convolutions. To simplify it we shall assume that x is normal. This is true if x is normal and it is approximately true for any x if n is large (CLT). Known variance. Suppose first that the variance a2 of x is known. The normality assumption leads to the conclusion that the point estimator x of 77 is Mt7,o-/т/n). Denoting by zu the и percentile of the standard normal density, we conclude that ~ zt-s/2 < x < 77 + zt_6/2 —j=- j = G(zt_a/2) - G( — i[-e/2) 8 8 » 1 “ 2 " 2 (9-10)
9-2 PARA.MI 11 R IS НМЛ I JON 247 TABLE 9-1 ti - и 0.90 0.925 0.95 0.975 0.99 0.995 0.999 0.9995 z 1.282 1.440 1.645 1.967 2.326 2.576 3.090 3.29) * I because z„ == -Z|_„ and G(= G(z„) = u. This yields p/x — г।_й/, —7=r < rj < x + Z| _л/2 -у=-\ = 1 — <5 = у (9-11) \ vn vn ) We can thus state with confidence coefficient у that 77 is in the interval x ± 2t_6/2o-/ -/n . The determination of a confidence interval for 77 thus pro- ceeds as follows: Observe the samples x, of x and form their average x. Select a number у = 1 - Й and find the standard normal percentile z„ for и = 1 - Й/2. Form the interval x ± z,,a/ -fii. This also holds for discrete-type RVs provided that n is large [sec (8-110)]. The choice of the confidence coefficient у is dictated by two conflicting requirements: If у is close to 1. the estimate is reliable but the size 2z„a/ \bi of the confidence interval is large; if у is reduced, z„ is reduced but <he estimate is less reliable. The final choice is a compromise based on the applications. In Table 9-1 we list z„ for the commonly used values of u. The listed values are determined from Table 3-1 by interpolation. Tchebycheff inequality. Suppose now that the distribution of x is not known. To find the confidence interval of 77, we shall use (5-57): We replace x by x and cr by ст/ 4n, and we set e = a/n8. This yields p[x------f==r < 77 < x + \ > I - й = у (9-12) I M JnS J The above shows that the exact у confidence interval of 77 is contained in the interval x ± a/ . If, therefore, we claim that 77 is in this interval, the probability that we are correct is larger than y. This result holds regardless of the form of F(x) and, surprisingly, it is not very different from the estimate (9-11). Indeed, suppose that у = 0.95; in this case, 1/ J8 = 4.47. Inserting into (9-12), we obtain the interval x ± 4.47а/ 4n . The corresponding interval (9-11), obtained under the normality assumption, is x ± 2<r/ 4n because z0975 - 2. Unknown variance. If <r is unknown, we cannot use (9-11). To estimate 77, we form the sample variance I2- —!—E(x,-i)2 (9-13) n - 1
248 statistics This is an unbiased estimate of cr2 [see (8-23)] and it tends to cr2 as n -* x. Hence, for large n, we can use the approximation 5 — cr in (9-11). This yields the approximate confidence interval s _ s x — < Л < x + X[-a/2 (9-14) We shall find an exact confidence interval under the assumption that x is normal. In this case [see (8-65)] the ratio has a Student-/ distribution with л - 1 degrees of freedom. Denoting by /„ its и percentiles, we conclude that I x - 77 ] P{ -tu < ------=- < tu = 2u - 1 = у (9-16) [ s/ул J This yields the interval X — f|_fi/2 < V < X + Г] -5/2 (9-17) In Table 9-2 we list r„(/i) for n from 1 to 20. For n > 20, the tin) distribution is nearly normal with zero mean and variance n/in - 2) (see Prob. 7-12). Example 9-3. The voltage И of a voltage source is measured 25 times. The results of the measurement are the samples jc, = И + v, of the RV x = V + v and their average equals x = 112 V. Find the 0.95 confidence interval of V. (a) Suppose that the standard deviation of x due to the error v is cr = 0.4 V. With 8 — 0.05, Table 9-1 yields zl)Q75 = 2. Inserting into (9-11), we obtain the interval x ± 2().975o-/>67 - 112 ± 2 X 0.4/i/25 = 112 ± 0.16 V (b) Suppose now that cr is unknown. To estimate it, we compute the sample variance and we find s2 = 0.36. Inserting into (9-14), wc obtain the approximate estimate x ± 4.4isS/4n = 112 ± 2 X 0.6/i/25 = 112 ± 0.24 V Since /0.975(25) = 2.06, the exact estimate (9-17) yields 112 ± 0.247 V. In the following three estimates the distribution of x is specified in terms of a single parameter. We cannot, therefore, use (9-11) directly because the constants and <r are related. tin most examples of this chapter, we shall not list all experimental data. To avoid lengthly tables, we shall list only the relevant averages.
9-2 PARAMETER ESTIMATION 249 TABLE 9-2 Student-r Percentiles nK .9 .95 .975 .99 .995 1 3.08 631 12.7 31.8 63.7 2 1.89 2.92 430 6.97 9.93 3 1.64 2.35 3.18 4.54 5.84 4 1.53 2.13 2.78 3.75 4.60 5 1.48 2.02 2.57 3.37 4.03 6 1.44 1.94 2.45 3.14 3.71 7 1.42 1.90 2.37 3.00 3.50 8 1.40 1.86 2.31 2.90 3.36 9 1.38 1.83 2.26 2.82 3.25 10 1.37 1.81 2.23 2.76 3.17 11 1.36 1.80 2.20 2.72 3.11 12 1.36 1.78 2.18 2.68 3.06 13 1.35 1.77 2.16 2.65 3.01 14 1.35 1.76 2.15 2.62 2.98 15 1.34 1.75 2.13 2.60 2.95 16 134 1.75 2.12 2.58 2.92 17 133 1.74 2.11 2.57 2.90 18 133 1.73 2.10 2.55 2.88 19 133 1.73 2.09 2.54 2.86 20 1.33 1.73 2.09 2.53 2.85 22 132 1.72 2.07 2.51 2.82 24 132 1.71 2.06 2.49 2.80 26 1.32 1.71 2.06 2.48 2.78 28 131 1.70 2.05 2.47 2.76 30 131 1.70 2.05 2.46 2.75 Exponential distribution. We are given an RV x with density У(х,Л) = — e~x/AU(x) Л and we wish to find the у confidence interval of the parameter A. As we know, tj = A and <r = A; hence, for large n, the sample mean x of x is MA, А/ Jn). Inserting into (9-11), we obtain ( A A ] A - zu-f=- < x < A + zu-y=- > = у = 2u - 1 I Vn Vn I
250 statistics This yields (9-18) and the interval x/(l ± zu/ yfn ) results. Example 9-4. The time to failure of a light bulb is an RV x with exponential distribution. Wc wish to find the 0.95 confidence interval of Л. To do so. we observe the lime to failure of 64 bulbs and wc find that their average x equals 210 hours. Setting 2U/ Jn -2/ /f>4 = 0.25 into (9-18), we obtain the interval 168 < Л < 280 Wc thus expect with confidence coefficient 0.95 that the mean lime to failure E{x) = Л of the bulb is between 168 and 280 hours. Poisson distribution. Suppose that the RV x is Poisson distribution with param- eter Л: A* P(x = к} = e~* — fc = 0,l,... In this case, 17 = A and a2 - A; hence, for large n, the distribution of x is approximately MA, yjk/n ) [see (8-110)]. This yields / ) The points of the xk. plane that satisfy the inequality |.r - A| < zu^k/n are in the interior of the parabola 2^ (A-a)2=—A (9-19) n From this it follows that the у confidence interval of A is the vertical segment (A|,A2) of Fig. 9-3 where A, and A2 are the roots of the quadratic (9-19). FIGURE 9-3
9-2 PAKAMIflER ESflMAIION 251 Example 9-5. The number of particles emitted from a radioactive substance per second is a Poisson RV x with parameter A. We observe the emitted particles x in 64 consecutive seconds and we find that л = 6. Find the 0.95 confidence interval of Л. With z„/n = 0.0625, (9-19) yields the quadratic (A — 6)" = 0.0625Л Solving, wc obtain A( = 5.42, A, = 6.64. Wc can thus claim with confidence coefficient 0.95 that 5.42 < A < 6.64. Probability. We wish to estimate the probability p = P(.oZ) of an event .о/. To do so, we form the zero-one RV x associated with this event. As we know, E{x} = p and 07 = pq. Thus the estimation of p is equivalent to the estimation of the mean of the RV x. We repeat the experiment n times and we denote by к the number of successes of л/. The ratio x = k/n is the point estimate of p. To find its interval estimate, we form the sample mean x of x. For large n, the distribution of x is approximately N(p, y/pq/n ). Hence Ppx - pl < = у = - 1 The points of the xp plane that satisfy the inequality |x - pl < zuy/pq/n are in the interior of the ellipse (p ~x)2 2p(1 ~P) (9-20) к x = — n From this it follows that the у confidence interval of p is the vertical segment (Pi.P2) of Fig. 9-4. The endpoints P| and p2 of this segment are the roots of (9-20). For n > 100 the following approximation can be used: Pi _ /x(l - x) p2 ^x±zu]l - Pi <P <P2 (9-21) This follows from (9-20) if we replace on the right side the unknown p by its point estimate x. FIGURE 9-4
252 statistics Example 9-6. In a preelection poll. 500 persons were questioned and 240 responded Republican. Find the 0.95 confidence interval of the probability p = (Republican}. In this example, zu = 2, n = 500. x = 240/500 = 0.48, and (9-21) yields the interval 0.48 + 0.045. In lhe usual reporting of the results, the following wording is used We estimate that 48 percent of lhe voters arc Republican. The margin of error is ±4.5 percent. This only specifics the point estimate and the confidence interval of the poll. The confidence coefficient (0.95 in this case) is rarely mentioned. Variance We wish to estimate the variance г = a2 of a normal RV x in terms of the n samples xi of x. Known mean. We assume first that the mean -q of x is known and we use as the point estimator of о the average v = - E (x, - t})2 (9-22) n z=i As we know, 2cr4 £•( v} = и a.----------—+ 0 n ' Thus v is a consistent estimator of cr2. We shall find an interval estimate. The RV nv/cr2 has а л,2(п) density (see page 200). This density is not symmetrical; hence the interval estimate of cr2 is not centered at a2. To determine it, we introduce two constants ct and c2 such that (Fig. 9-5«) This yields ci = Xi-6/2^' and the interval nv nv ------FT < a < ------FT Xl-6/2(n) *6/21”) FIGURE 9-5
9-2 РАНАМ I 11 R I SI IMA I К IN 253 TABLE 9-3 Chi-square percentiles Хц(п) \w n\ .005 .01 .025 .05 .1 .9 .95 .975 .99 .995 1 0.00 0.00 0.00 0.00 0.02 2.71 3.84 5.02 6.63 7.SS 2 0.01 0.02 0.05 0.10 0.21 4.61 5.99 7.38 9.21 10.60 3 0.07 0.11 0.22 0.35 0.58 6.25 7.81 9.35 11.34 12.8-1 4 0.21 0.30 0.48 0.71 1.06 7.78 9.49 11.14 13.28 14.86 5 0.41 0.55 0.83 1.15 1.61 9.24 11.07 12.83 15.09 16.75 6 0.68 0.87 1.24 1.64 2.20 10.64 12.59 14.45 16.81 18.55 7 0.99 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48 20.28 8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.00 21.96 9 1.73 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67 23.59 10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21 25.19 II 2.60 3.05 3.82 4.57 558 17.28 19.68 21.92 24.73 26.76 12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 23.34 26.22 28.30 13 357 4.11 5.01 5.89 7.04 19.81 22.36 24.74 27.69 29.82 14 4.07 4.66 5.63 6.57 7.79 21.06 23.68 26.12 29.14 31.32 15 4.60 5.23 6.26 7.26 8.55 22.31 25.00 27.49 30.58 32.80 16 5.14 5.81 6.91 7.96 9.31 23.54 26.30 28.85 32.00 34.27 17 5.70 6.41 7.56 8.67 10.09 24.77 27.59 30.19 33.41 4S 18 6.26 7.01 8.23 9.39 10.86 25.99 28.87 31.53 34.81 37.16 19 6.84 7.63 8.91 10.12 11.65 27.20 30.14 32.85 36.19 38.58 20 7.43 8.26 959 10.85 12.44 28.41 31.41 34.17 37.57 40.00 22 8.6 95 11.0 12.3 14.0 30.8 33.9 36.8 40.3 42.8 24 9.9 10.9 12.4 13.8 15.7 33.2 36.4 39.4 43.0 45.6 26 11.2 12.2 13.8 15.4 17.3 35.6 38.9 41.9 45.6 48.3 28 12.5 13.6 15.3 16.9 18.9 37.9 41.3 44.5 48.3 51.0 30 13.8 15.0 16.8 18.5 20.6 40.3 43.8 47.0 50.9 53.7 40 20.7 22.2 24.4 265 29.1 51.8 55.8 59.3 63.7 66.8 50 28.0 29.7 32.4 34.8 37.7 63.2 67.5 71.4 76.2 79.5 For n & 50: *,/") = -(<„ i/2n - I f results. This interval does not have minimum length. The minimum interval is such that /V(C|) = fx(c2) (Fig. 9-5/?); however, its determination is not simple. In Table 9-3, we list the percentiles the A'2(n) distribution. Unknown mean. If у is unknown, we use as the point estimate of <r2 the sample variance sz [see (9-13)]. The RV (и - Ds2/"’2 has a A,2(/I _ 0 distribu- tion. Hence ~ О < (и - l)s2 ^2 <Х\-я/2(" - Ш = У This yields the interval (n-l)s2 (л-1)л-2 АТ-й/2(" - 1) - l> (9-24)
254 statistics Example 9-7. A voltage source lz is measured six limes. The measurements arc modeled by the RV x = И + v. Wc assume that the error v is .Ши). We wish to find the 0.95 interval estimate of ст2. (о) Suppose first that the source is a known standard with Iz= 110 V. Wc insert the measured values xt = 110 + v, of V into (9-22) and wc find Г = 0.25. From Table 9-3 wc obtain ^.025(6) = 1-24 x2975(6) = 14.45 and (9-23) yields 0.104 < ст2 < 1.2. The corresponding interval for ст is 0.332 < ст < 1.096 V. (b) Suppose now that Iz is unknown. Using the same data, wc compute .r from (9-13) and wc find s2 — 0.30. From Table 9-3 wc obtain A'i7.o25(5) = 0.83 аг2975(5) = 12.83 and (9-24) yields 0.117 < ст2 < 1.8. The corresponding interval for ст is 0.342 < ст < 1.344 V. PERCENTILES. The и percentile of an RV x is by definition a number xu such that F(xtl) = u. Thus x„ is the inverse function F<_,)(tz) of the distribution F(x) of x. Wc shall estimate xa in terms of the samples x, of x. To do so, we write the n observations x; in ascending order and we denote by yk the к th number so obtained. The corresponding RVs yk are the order statistics of x [see (8-13)]. From the definition it follows that yk < xu iff at least к of the samples x, are less than x„; similarly, yk+r > xa iff at least к + r of the samples x, are greater than xir Finally, yk < xu < yk+r iff at least к and at most к + r — 1 of the samples x, are less than xu. This leads to the conclusion that the event (У* <xu < occurs iff the number of successes of the event {x <x„] in n repetitions of the experiment is at least к and at most к + r - 1. And since P{x £x„} - ut it follows from (3-18) with p - и that P{y„ <*,, - E (9-25) Using this basic relationship, we shall find the у confidence interval of xu for a specific u. To do so, we must find an integer к such that the sum in (9-25) equals у for the smallest possible r. This is a complicated task involving trial and error. A simple solution can be obtained if n is large. Using the normal approximation (3-33) with p = nu, we obtain (к + r — .0.5 — пи \ ( к - 0.5 - nu \ —1 ,, , -G i , = У ynw(l - и) J I ynu(n - U) j This follows from (3-33) with p = nu. For a specific y, r is minimum if nu is
9-2 PARAMI. II.Il I-STI.MA IION 255 near the center of the interval (к, к 4- r). This yields к = пи - zx _й/2 yjnu{\ - и) к + г - пи 4- z, _й/2 у/пи( I - и) (9-26) to the nearest integer. Example 9-8. We observe 100 samples of x and wc wish to find the 0.95 confidence interval of the median xn5 of x. With и = 0.5, пи = 50. z(W75 = 2, (9-26) yields к ~ 40, к 4- r = 60. Thus wc can claim with confidence coefficient 0.95 that the median of x is between y4U and DISTRIBUTIONS. We wish to estimate the distribution F(x) of an RV x in terms of the samples x4 of x. For a specific x, F(.r) equals the probability of the event {x x); hence its point estimate is the ratio nx/n where nx is the number of x/s that do not exceed x. Repeating this for every x, we obtain the empirical estimate of the distribution F(x) [see also (4-3)]. This estimate is a staircase function (Fig. 9-6e) with-discontinuities at the points x,. Interval estimates. For a specific x, the interval estimate of F(x) is obtained from (9-20) with p = F(x) and x = F(x). Inserting into (9-21), we obtain the interval F(x) ± -F(x)] We can thus claim with confidence coefficient у = 2u - 1 that the unknown -F(x) is in the above interval. Note that the length;of this interval depends on x.
256 statistics We shall now find an interval estimate Fix) ± c of Fix) where c is a constant. The empirical estimate Fix) depends on~ the samples x, of x. It specifies, therefore, a family of staircase functions F(x), one for each set of samples x,. The constant c is such that P{|F(x) — F(x)|<c) = y (9-27) for every x and the у confidence region of Fix) is the strip Fix) ± c. To find c, we form the maximum w= max|F(x) -F(x)| (9-28) (least upper bound) of the distance between Fix) and Fix). Suppose that iv == w(£) a specific value of w. From (9-28) it follows that iv < c iff Fix) — Fix) < c for every x. Hence у - P{w < c) = Fwic) It suffices, therefore, to find the distribution of w. We shall show first that the function Fwiw) does not depend on Fix). As we know [see (5-18)], the RV у — Fix) is uniform in the interval (0,1) for any Fix). The function у = Fix) transforms the points xt to the points y, = Fix J and the RV w to itself (see Fig. 9-6Z0. This shows that Fwiw) does not depend on the form of Fix). For its determination it suffices, therefore, to assume that x is uniform. However, even with this simplification, it is not simple to find F„(w). We give next an approximate solution due to Kolmogoroff: For large n: Fw(w) = 1 - 2е~2п*г (9-29) From this it follows that у = Fwic) = 1 — e~2nc'. We can thus claim with confidence coefficient у that the unknown Fix) is between the curves Fix) + c and Fix) — c where I i c = V"T-ln^— (9-зо) у 2n 2 This approximation is satisfactory if w > 1/ v^- Bayesian Estimation We return to the problem of estimating the parameter в of a distribution Fix, 9). In dur earlier approach, we viewed в as an unknown constant and the estimate was based solely on the observed values x,- of the RV x. This approach to estimation is called classical. In certain applications, в is not totally unknown. If, for example, в is the probability of six in the die experiment, we expect that its possible values are close to 1/6 because most dice are reasonably fair. In bayesian statistics, the available prior information about в is used in the estimation problem. In this approach, the unknown parameter 0 is viewed as the value of an RV в and the distribution of x is interpreted as the conditional
9-2 PARAMt.TLR I SI IMA I IUN 257 distribution Fx(x|0) of x assuming 0 = 0. The prior information is used to assign somehow a density /„(0) to the RV 0, and the problem is to estimate the value в of 0 in terms of the observed values x, of x and the density of 0. The problem of estimating the unknown parameter 0 is thus changed to the problem of estimating the value в of the RV 0. Thus, in bayesian statistics, estimation is changed to prediction. We shall introduce the method in the context of the following problem. We wish to estimate the inductance в of a coil. We measure 0 n times and the results are the samples x, = в + p, of the RV x = в + v. If we interpret в as an unknown number, we have a classical estimation problem. Suppose, however, that the coil is selected from a production line. In this case, its inductance 0 can be interpreted as the value of an RV 0 modeling the inductances of all coils. This is a problem in bayesian estimation. To solve it, we assume first that no observations are available, that is, that the specific coil has not been measured. The available information is now the prior density /„(0) of 0 which we assume known and our problem is to find a constant в close in some sense to the unknown в, that is, to the true value of the inductance of the particular coil. If we use the LMS criterion for selecting 0, then [see (7-62)] 0 = £{0) = Г ef0(e) de J — x To improve the estimate, we measure the coil n times. The problem now is to estimate 0 in terms of the n samples x( of x. In the general case, this involves the estimation of the value 0 of an RV 0 in terms of the n samples x, of x. Using again the MS criterion, we obtain 0 = £{0|X} = Г 0fe(0\X) de (9-31) J — x [see (8-77)] where X = [x1?..., x„] and A«W - (9-32) In the above, f(X\e) is the conditional density of the n RVs x, assuming 0 = 0. If these RVs are conditionally independent, then r(XI0-/(x,l»)--rW«) (9-33) where /(x|0) is the conditional density of the RV x assuming 0 = 0. These results hold in general. In the measurement problem, /(x|0) = /,.(x — 0). We conclude with the clarification of the meaning of the various densities used in bayesian estimation, and of the underlying model, in the context of the measurement problem. The density /tf(0), called prior (prior to the measure- ments), models the inductances of all coils. The density /fl(0|A'X called posterior (after the measurements), models the inductances of all coils of measured inductance x. The conditional density /д.(х|0) = /„(x — 0) models all measure- ments of a particular coil of true inductance 0. This density, considered as a
258 statistics FIGURE 9-7 function of 0, is called the likelihood function. The unconditional density /Л(л ) models all measurements of all coils. Equation (9-33) is based on the reasonable assumption that the measurements of a given coil are independent. The bayesian model is a product space ,У= Уу x y£ where ./e is the space of the RV 0 and У£ is the space of the RV x. The space Уу is the space of all coils and .y£ is the space of all measurements of a particular coil. Finally, У" is the space of all measurements of all coils. The number в has two meanings: It is the value of the RV 0 in the space Уй; it is also a parameter specifying the density f(x|0) = /„(x - 0) of the RV x in the space У'. Example 9-9. Suppose that x = 0 + v where v is an N(0, <r) RV and 0 is the value of an N(0u,<ro) RV 0 (Fig. 9-7). Find the bayesian estimate 0 of 0. The density /(x|0) of x is N(0,o-). Inserting into (9-32), we conclude that (see Prob. 9-37) the function fe(0\X) is MOj.o-j) where n + o-2/fi 0t = —eu + <r ZltTj2 _ —~x <T‘ From the above it follows that £{0|zY) = 0(; in other words, 0 = 0(. Note that the classical estimate of 0 is the average x of x,. Furthermore, its prior estimate is the constant 0O, Hence 0 is the weighted average of the prior estimate 0() and the classicaj estimate x. Note further that as n tends to », tr( -» 0 and ncr2/a2 -» I; hence 0 tends to x. Thus, as the number of measurements increases, the bayesian estimate 0 approaches the classical estimate x; the effect of the prior becomes negligible. We present next the estimation of the probability p = of an event To be concrete, we assume that зУ is the event “heads” in the coin experiment. The result is based on Bayes’ formula [see (4-67)] = Г P(^\x)f(x)dx J —00 (9-34) In bayesian statistics, p is the value of an RV p with prior density f(p). In the
9-2 <*лхлм| ilk i.siimaiio^ 2S9 (9-35) absence of any observations, the LMS estimate p is given by P = f'pf(p) dp Jo To improve the estimate, we toss the coin at hand n times and we observe that “heads” shows к times. As we know, P{*#\p = p} = pkq"~k Л= {k heads} Inserting into (9-34), we obtain the posterior density f(p\^) = —------------------- f pkq"~kf(p) dp Jo Using this function, we can estimate the probability of heads at the next toss of the coin. Replacing ftp) by ftp\-#) in (9-35), we conclude that the updated estimate p of p is the conditional estimate of p assuming (9-36) ('pf(pU)dp (9-37) Note that for large n, the factor ^?(p) = pktl - p)n~k in (9-36) has a sharp maximum at p = k/n. Therefore, if /(p) is smooth, the product ftp)<p(p) is concentrated near k/n (Fig. 9-8a). However, if /(p) has a sharp peak at p = 0.5 (this is the case for reasonably fair coins), then for moderate values of n, the product /(p)^>(p) has two maxima: one near k/n and the other near 0.5 (Fig. 9-8Z>). As n increases, the sharpness of <ptp) prevails and f(p|^/) is maximum near k/n (Fig. 9-8c). Example 9-10. We toss a coin of unknown quality n times and we observe к heads. Using this information, wc wish to find the bayesian estimate p of the probability p that at the next toss heads will show. In the absence of any prior information, we assume that p is the value of an RV p uniformly distributed in the interval (0,1). Setting ftp) = 1 in (9-36) and
260 STATISTICS using the identity A) A!(/i - it)! (/j + I)! wc obtain (« + 0! , „ a f^-w^.^'-p> n<p<' This function is known as the beta density. The updated estimate p of p is obtained from (9-37): P = (« + !)! A!(n - k)l flpk И(1 -P) Jo dp = к + 1 »j + 2 This result is known as the law of succession. Note Bayesian estimation is a controversial subject. The controversy has its origin on the dual interpretation of the physical meaning of probability. In the first interpretation, the probability of an event .с/ is an 'objective” measure of the relative frequency of the occurrence of .о/ in a large number of trials. In the second interpretation, P( :S) is a “subjective” measure of our state of knowledge concerning the occurrence of in a single trial. This dualism leads to two different interpretations of the meaning of parameter estimation. In the coin experiment, these interpretations take the following form: In the classical (objective) approach, p is an unknown number. To estimate its value, wc toss the coin n times and use as an estimate of p the ratio p = k/n. In the bayesian (subjective) approach, p is also an unknown number, however, wc interpret it as the value of an RV 6, the density of which wc determine using whatever knowledge we might have about the coin. The resulting estimate of p is determined from (9-37). If wc know nothing about p.wcset /(p) = 1 and wc obtain the estimate p = (k + !)/(/» + 2). Conceptually, the two approaches arc different. However, practically, they lead in most estimates of interest to similar results if the size n of the available sample is large. In the coin problem, for example, if n is large, к is also large with high probability; hence (k + l)/(n + 2) = k/n. If n is not large, the results arc different but unreliable for either method. The mathematics of bayesian estimation arc also used in classical estimation problems if в is the value of an RV the density of which can be deter- mined objectively in terms of averages. This is the case in the problem considered in Example 9-9. Method of Maximum Likelihood Up to now, we considered the estimation of particular parameters, and the selection of their estimators was based on the relative frequency interpretation of the mean of some function of x. In the following, we develop a general method of estimation. This method can be used for most applications but it is efficient primarily for large values of n. We introduce the method in the context qf the following problem^
9-2 I’AKAMI lt.lt LSI IMAI IOS 261 (tf) FIGURE 9-9 Wc have an RV x with density f(x, 0) and wc wish to estimate 0 in terms of a single observation of the RV x. To do so, we plot the density f(x,0) as a function of 0, assigning to л the observed value of x, and we determine the value 0 = 0mux of 0 that maximizes fix, 0). Wc shall call the curve /(.v, 0) so plotted the likelihood function of x and the number 0 the maximum likelihood (ML) estimate of 0. This estimate is the value of 0 for which the probability f(x,0)dx that the RV x is in the interval (x. x + dx) is maximum. Example 9-11. If Fig. 9-9 we plot the Erlang density f(x.O) = 02xe "'(/(x) as a function of x and the corresponding likelihood function. The likelihood function is maximum for 0 = 2/x. Thus the ML estimate of 0 in terms of the observed value x of x is 0 = 2/x. The mode xllbIX = 1/0 of the density is the predicted value of x if 0 is known (see page 179). We shall now determine the ML estimate of 0 in terms of n observations xt of x. To do so, we form the joint density /(X,0)=/(xt,0)---/(x„.0) of the n samples x, of x. This density, considered as a function of 0 is called the likelihood function of X. The value 0 of 0 that maximizes f(X\0) is the ML estimate of 0. The logarithm L(X,0) = \nf(X,0) = Eln/(x,,0) (9-38) /»i is the log-likelihood function of X. From the monotonicity of the logarithm, it follows that 0 also maximizes the function L(x, 0). If 0 is in the interior of the domain 0 of 0, then 0 is a root of the equation i V(.r„o) j 3» M
262 statistics Example 9-12. Suppose that fix,в) = Oen'U(x). In this case. f(X,6) = ene~OnS L(X,6) = n In fl - впх Hence Thus the ML estimator of 0 equals 1/x. This estimator is biased because E(I/x) =/i0/(n - 1). The ML method can be used to estimate any parameter. However, for moderate values of n, the estimate is not efficient. The method is used primarily for large values of n. This is based on the following important result. Asymptotic properties. For large n, the distribution of the ML estimator 6 approaches a normal curve with mean в and variance l/nl where I « E 3L(x,0) 30 0L(x,0) dO 2 /(x,0) dx (9-40) Thus л 1 ( nl л fM « /5—f exp/ - — (0 - 0) } (9-41) у2тгл/ I 2 } The number I is called the information about 0 contained in x. Using integra- tion by parts, we can show that (see Prob. 9-24) ____f 02L(x,0) \ \ 002 / We show later [see (9-46)] that the variance of any estimator of 0 cannot be smaller than l/nl. From this it follows that the ML estimator is asymptotically normal, unbiased, with minimum variance. In the next example, we demonstrate the validity of the above theorem. The proof will not be given. Example 9-13. Suppose that the RV x is N(7j,<r) where -37 is a known constant. Wc wish to find the ML estimate 0 of its variance c = cr2. In this problem, /(Xt u) « —«exp/ - £(x, - n)2) (/2irv) I 21 / , n 1 r-. , L(X,u) = - —ln(2-n-p) - — Е(х, - 77Г Inserting into (9-39), we obtain fl£(X,0) я 1 _
9-2 I'AHAMI II R ISIIMAIIOS 263 This yields the estimator -Е(».-ч): As wc know (see (8-67)] it the RV v is nearly normal (CLT) as in (9-41). To н/2'Л This Furthermore, for large n complete the validity of (9-41), it suffices to show that nl = l/<rf follows from the identity ( d2L(x, r)) I = E{------------ = E (x “ V)2 + The Rao-Cramer Bound A basic problem in estimation is the determination of the best estimator 0 of a parameter 0. It is easy to show that if 0 exists, it is unique (see Prob. 9-39). However, in general, the problem of determining the best estimator of 0, or even of showing that such an estimator exists, is not simple. In the following, we determine the greatest lower bound of the variance of most estimators. This result can be used to establish whether a particular estimator is the best or that it is close to the best. We shall assume that the density /(x,0) of x is differentiable with respect to в and that the boundary of the domain of x does riot depend on 0. Differentiating the area condition ff(x, 3) dx = 1 with respect to в, we obtain the identity f V(x-e> , n / ---------dx = 0 J-~ 30 A density satisfying the conditions leading to this identity will be called regular. We show next that (9-42) (ЗЦХ,в) do dL(X,0) de 2i > - nl (9-43) E where L(X, в) = In f(Xt 0) is the log-likelihood of X and nl is the information about 0 contained in X [see also (9-40)]. Proof. From the identity L(x, в) = In f(x, в) and (9-42), it follows that ,« 0L(x,e) df(x,e) =L.-i»- =0 This shows that the mean of the function 5L(x, 0)/d0 is 0; hence its variance equals £{|<?L(x, 0)/d0\2}. Inserting into (9-38), we obtain (9-43) because the RVs In f(xh 0) are independent.
264 STATISTICS We shall use (9-43) to determine the greatest lower bound of the variance of an arbitrary estimator 6 of 0. Suppose first that в = g(X) is an unbiased estimator of 0: E{0} = f g(X)f(x,e)dx = o JR Differentiating with respect to 0, we obtain , af(x,e) , дцх,е) , I = M*) ’ [s(X) aa f(x,e}dx Jr of) Jr dti This yields Multiplying the first equation of (9-43) by 0 and subtracting from (9-44), we obtain i 0Л(Х,0Ъ e «(x)-e—=1 (9-45) \ 00 I We shall use this identity to prove the following important result. THEOREM. The variance £{[g(X) — 0]2) of any unbiased estimator 0 of 0 cannot be smaller than 1/nZ: 4 (9'46) Proof. The proof is based on Schwarz’s inequality E2{zw) E{z2}E{w2} (9-47) Squaring both sides of (9-45) and applying (9-47) to the RVs z = g(X) — 0 and w = 0L(X, 0)/00 we obtain ( <?L(X,0) 2) 1 s E [S(X) - в]2 E \ (9-48) I 0(7 I and (9-46) results. We shall how determine the class of functions for which the estimator 0 is best, that is, that (9-46) is ah equality. As we know, (9-47) is an equality if z = cw. Hence (9-48) is an equality if g(X) — 0 = c0L(X, 0)/00. To find c, we insert into (9-48) and use (9-43). This yields c = 1 /nl hence 0L(X,0) — =n/[g(X)-0] (9-49) ou Thus the estimate 0 = g(X) is best if the log-likelihood function L(X, 0) satisfies (9-49).
9-3 HYI’OIHLSIS IIJ»FISG 265 COROLLARY. If 0 = g(X) is a biased estimator of в with mean h'{0) = then r° - nl (9-50) Proof. The statistic 0 = g(X) is an unbiased estimator of the parameter т = r(0). We can, therefore, apply (9-46) provided that we replace 0 by r(0) and nl by the information about т contained in X. Since 0L[*,0(t)] дЦХ.в) de dr ее d т and 0'(т) = l/r'(0) we obtain J ад[х,0(т)] 2) = ni \ dr / [t’(0)]2 and (9-50) results. Reasoning as in (9-49), we conclude that (2-44) is an equality iff 0L[X,0(r)] nl T ~H---------L~7 „„Tp «(-V)-e(r) (9-51) ar [-r'(e)] Note If /(x, 0) is a density of exponential type, that is, if /(x,0) = A(x)cxp{a(0)^(x) + />(0)} (9-52) then the statistic 0 = (l/n)E<7(x) is the best estimator of the parameter r(0) = —Z>'(0)/a'(0). This follows readily from (9-51). 9-3 HYPOTHESIS TESTING A statistical hypothesis is an assumption about the value of one or more parameters of a statistical model. Hypothesis testing is a process of establishing the validity of a hypothesis. This topic is fundamental in a variety of applica- tions: Is Mendel’s theory of heredity valid? Is the number of particles emitted from a radioactive substance Poisson distributed? Docs the value of a parame- ter in a scientific investigation equal a specific constant? Are two events independent? Does the mean of an RV change if certain factors of the experiment are modified? Does smoking decrease life expectancy? Do voting patterns depend on sex? Do IQ scores depend on parental education? The list is endless. We shall introduce the main concepts of hypothesis testing in the context of the following problem: The distribution of an RV x is a known func- tion F(x,0) depending on a parameter 0. We wish to test the assumption 0 « 0O against the assumption 0 * 0O. The assumption that 0 « 0O is denoted by Ho and is called the null hypothesis. The assumption that 0 * 0O is denoted
266 SI A US’1 ICS by W, and is called the altcmatire hypothesis. The values that 0 might take under the alternative hypothesis form a set 0, in the parameter space. If (-), consists of a single point в = 0P the hypothesis is called simple., otherwise, it is called composite. The null hypothesis is in most cases simple. The purpose of hypothesis testing is to establish whether experimental evidence supports the rejection of the null hypothesis. The decision is based on the location of the observed sample X of x. Suppose that under hypothesis //(t the density f(.X, 0(l) of the sample vector X is negligible in a certain region /2 of the sample space, taking significant values only in the complement Z) of Dc. It is reasonable then to reject Ha if X is in D,. and to aecep£/-/n if X is in Dc. The set Dc is called the critical region of the test and the set Dc is called the region of acceptance of H„. The test is thus specified in terms of the set Z). We should stress that the purpose of hypothesis testing is not to determine whether or If is true. It is to establish whether the evidence supports the rejection of Hl}. The terms “accept” and “reject" must, therefore, be inter- preted accordingly. Suppose, for example, that we wish to establish whether the hypothesis Ha that a coin is fair is true. To do so, we toss the coin 100 times and observe that heads show к times. If к = 15, we reject Ho. that is, we decide on the basis of the evidence that the fair-coin hypothesis should be rejected. If к = 49, we accept Z7(), that is, we decide that the evidence does not support the rejection of the fair-coin hypothesis. The evidence alone, however, does not lead to the conclusion that the coin is fair. We could have as well concluded that p = 0.49. In hypothesis testing two kinds of errors might occur depending on the location of X: 1. Suppose first that Z/(l is true. If X g Dr, we reject Нл even though it is true. We then say that a Type I error is committed. The probability for such an error is denoted by a and is called the significance level of the test. Thus a = P{XeDc\Hn] (9-53) The difference I — a = P{X £ Dc|equals the probability that we accept Ho when true. In this notation, P{ • • • |H(1} is not a conditional probability. The symbol Ha merely indicates that /Z() is true. 2. Suppose next that HQ is false. If X £ Dc, we accept Z7() even though it is false. We then say that a Type П error is committed. The probability for such an error is a function /3(0) of 0 called the operating characteristic (ОС) of the test. Thus /3(0) = P{X £ DJZZ,} (9-54) The difference 1 — /3(0) is the probability that we reject Ha when false. This is denoted by P(0) and is called the power of the test. Thus P(0) - 1 - /3(0) = P{X e (9-55)
9-3 iiypoiiii sis it sum, 267 Fundamental note Hypothesis testing is not a part of statistics. It is part of decision theory based on statistics. Statistical consideration alone cannot lead to a decision. They merely lead to the following probabilistic statements: If H() is true, then P(X e Dr} - a If Hu is false, then P{X € D,.} = 0(0) Guided by these statements, wc "reject" H„ if X g D( and we "accept” if X e I\.. These decisions arc not based on (9-56) alone. They take into consideration other, often subjective, factors, for example, our prior knowledge concerning the truth of H„. or the consequences of a wrong decision. The test of a hypothesis is specified in terms of its critical region. The region Dc is chosen so as to keep the probabilities of both types of errors small. However both probabilities cannot be arbitrarily small because a decrease in a results in an increase in /3. In most applications, it is more important to control a. The selection of the region £>. proceeds thus as follows: Assign a value to the Type I error probability a and search for a region Dc of the sample space so as to minimize the Type II error probability for a specific в. If the resulting /3(0) is too large, increase a to its largest tolerable value; if /3(0) is still too large, increase the number n of samples. A test is called most powerful if /3(0) is minimum. In general, the critical region of a most powerful test depends on 0. If it is the same for every' 0 e O(. the test is uniformly most powerful. Such a test does not always exist. The determination of the critical region of a most powerful test involves a search in the л-dimensional sample space. In the following, we introduce a simpler approach. TEST STATISTIC. Prior to any experimentation, we select a function q = £(X) of the sample vector X. We then find a set Rc of the real line where under hypothesis the density of q is negligible, and we reject H(l if the value q = g(X) of q is in Rc. The set Rc is the critical region of the test; the RV q is the test statistic. In the selection of the function g( X) we are guided by the point estimate of 0. In a hypothesis test based on a test statistic, the two types of errors are expressed in terms of the region Rc of the real line and the density f4(q,0) of the test statistic q: a = P{q e R(|/7„} = f f (q, 0O) clq (9-57) -'я, W) =P{qe/?J/71) = ff^q.&jdq (9-58) 'я, To carry out the test, we determine first the function f4(q, 0). We then assign a value to a and we search for a region Rc minimizing /3(0). The search
268 SIAII.S'IKS FIGURE 9-10 is now limited to the real line. We shall assume that the function /,,(<7.0) has a single maximum. This is the case for most practical tests. Our objective is to test the hypothesis 0 = 0O against each of the hypothe- ses 0 =# 0(1, 0 > 0(), and 0 < 0O. To be concrete, we shall assume that the function fq(q, 0) is concentrated on the right of fq(q, 0(1) for 0 > 0O and on its left for 0 < 0(l as in Fig. 9-10. Ht: в & 0() Under the stated assumptions, the most likely values of q are on the right of fq(q, 0()) if 0 > 0(| and on its left if 0 < 0(). It is. therefore, desirable to reject Ho if q < ct or if q > c2. The resulting critical region consists of the half-lines q < ct and q > c2. For convenience, we shall select the constants ct and c2 such that P{q < cjAZ,]} P{q > c2|/-/0} =| Denoting by qu the и percentile of q under hypothesis Яп, we conclude that ci = tfa/2’ c2 “ tft-n/z- This yields the following test: Accept H(i iff qa/2 <q <q^u/2 (9-59 a) The resulting ОС function equals P(9) = ("'!f„(4,0)dq (9-60») H,: e > 0U Under hypothesis Hlf the most likely values of q are on the right of /,/<?. 0). It is, therefore, desirable to reject HQ if q > c. The resulting critical region is now
9-3 IIYTOTI (t:SIS ILS IING 269 the half-line q > c where c is such that > c|/70) = a c = q and the following test results: Accept //0 iff q < qt _a The resulting ОС function equals Р(в) = f f4(<hf>)dq d — sc (9-59/>) (9-60b) Hx: e <eQ Proceeding similarly, we obtain the critical region q < c where c is such that P{q < с|Я0} = a c = qn This yields the following test: Accept Ho iff q > qa The resulting ОС function equals (9-59c) W) = faq^dq (9-60c) The test of a hypothesis thus involves the following steps: Select a test statistic q = g(X) and determine its density. Observe the sample X and com- pute the function q = g(X). Assign a value to a and determine the critical region Rc. Reject iff q e Rc. In the following, we give several illustrations of hypothesis testing. The results are based on (9-59) and (9-60). In certain cases, the density of q is known for 0 = 0U only. This suffices to determine the critical region. The ОС function /3(0), however, cannot be determined. MEAN. We shall test the hypothesis Hu: q = q0 that the mean q of an RV x equals a given constant qQ. Known variance. We use as the test statistic the RV Under the familiar assumptions; x is N(q,a/ hence q is N(qtl, 1) where Under hypothesis Ho, q is MO, i). Replacing in (9-59) and (9-60) the qu
270 statistics percentile by the standard normal percentile zu, we obtain the following test: Accept Hn iff za/2 < q < zx_a/2 (9-63a) =/’{|ql <г,_в/21^|) - &(zx_a/2 - ??„) -G(z„/2 -77_) (9-64a) Л > Vo Accept Hn iff q < z,_o (9-636) P(v) = P{q <zl_a|//1) = G(zt_a - (9-646) ti < 77O Accept Hu iff q > za (9-63c) P(v) = /’{q > = 1 “ G(Za “ Vq) (9-64<?) Unknown variance. We assume that x is normal and use as the test statistic the RV q = X - 77o S/l/rT (9-65) where s2 is the sample variance of x. Under hypothesis Ho, the RV q has a Student-f distribution with n - 1 degrees of freedom. We can, therefore, use (9-59) where we replace qu by the tabulated tu(n — 1) percentile. To find Д(^), we must find the distribution of q for -q * -q0. Example 9-14. Wc measure the voltage И of a voltage source 25 times and wc find x = 110.12 V (see also Example 9-3). Test the hypothesis V = Vu = ПО V against V ч6 ПО V with a = 0.05. Assume that the measurement error v is M0, a). (a) Suppose that cr = 0.4 V. In this problem, Z|_o/2 = z0.97s = 2: 110.12- 110 q --------7=— = 1.5 0.4/>/25 Since 1.5 is in the interval (—2,2), we accept Hu. (6) Suppose that cr is unknown. From the measurements wc find 5 = 0.6 V. Inserting into (9-65), we obtain 110.12 - 110 q --------f=— = 1 0.6/v^5 Table 9-3 yields /t_a/2(n - 1) =/0975(25) = 2.06 = -/0,025- Since 1 is in the interval (—2.06,2.06), we accept ff0. PROBABILITY. We shall test the hypothesis HQ: p = p0 = 1 — qQ that the probability p = ) of an event л/ equals a given constant p0, using as data the number k of successes of л/ in n trials. The RV к has a binomial distribution and for large n it is N(np, y/npq). We shall assume that n is large. The test will be based on the test statistic к - npQ q — 1------- ynpQqQ (9-66) Under hypothesis Ho, q is M0,1). The test thus proceeds as in (9-63).
9-3 HYPOIHLSIS IUS1INC, 271 To find the ОС function /3(p), we must determine the distribution of q under the alternative hypothesis. Since к is normal, q is also normal with _ ~ npa , _ npq q JnPn<h “ "Pn‘h> This yields the following test: Я^Р^Ро Accept Hn iff z<f/, < q < zt_n/2 (9-67a) ЙР) - Pflql < j - g( 7/; ~ j (9-68«) \ УДО/РЛ / I VPfl/Pnfld / Я1:р>р() Accept Ho iff q < zt_a (9-67Ю 0(p) =P{q <z,_„W,) -G 4==^] (9-686) 1 VW/P(l<7o / H\'- P <Pa Accept HQ iff q > za (9-67c) 0(p)-P(q>;„|W,) - 1-g( Г"-’’1' I (9-68c) 1 ylPP/P^a I Example 9-15. Wc wish to test the hypothesis that a coin is fair against the hypothesis that it is loaded in favor of “heads”: Hf. p = 0.5 against Hf. p > 0.5 Wc toss the coin 100 times and “heads” shows 62 times. Docs the evidence support the rejection of the null hypothesis with significance level a = 0.05? Tn this example, Z|_CT = z09S = 1.645. Since the fair-coin hypothesis is rejected. VARIANCE. The RV x is M17, cr). We wish to test the hypothesis Htt: cr = cr0. Known mean. We use as test statistic the RV /V _ „ \2 q-E— <9’69) i I ^0 / Under hypothesis HQt this RV is x2(n)- We can, therefore, use (9-59) where qa equals the x*(n) percentile. Unknown mean. We use as the test statistic the RV (9-70) i X I
272 statistics Under hypothesis Ha, this RV is ^2(n - 1). We can, therefore, use (9-59) with <?« = X?Sfl ~ D- Example 9-16. Suppose that in Example 9-14, the variance cr2 of the measurement error is unknown. Test the hypothesis Ho: <r = 0.4 against <r > 0.4 with a = 0.05 using 20 measurements л,- = И 4- p(. (a) Assume that V = 110 V. Inserting the measurements .v, into (9-69), we find Since Xi-<Sn) = А'о.ч5<2О) = 31.41 < 36.2, wc reject Ho. (/?) If И is unknown, we use (9-70). This yields Since Xi-a(n - 0 = = 30.14 < 22.5, we accept HQ. DISTRIBUTIONS. In this application, HQ does not involve a parameter; it is the hypothesis that the distribution F(x) of an RV x equals a given function Fn(x). Thus HQ: F(x) = Fn(x) against F(x) Ф F0(x) The Kolmogoroff-Smimov test. We form the random process F(x) as in the estimation problem (see page 256) and use as the test statistic the RV q = max|F(x) - F0(x)| (9-71) X This choice is based on the following observations: For a specific <, the function Ax) is the empirical estimate of F(x) [see (4-3)]; it tends, therefore, to F(x) as n -> 00. From this it follows that E(F(x)J - F(x) F(x) F(x) This shows that for large n, q is close to 0 if HQ is true and it is close to Fix') — F0(x) if Ht is true. It leads, therefore, to the conclusion that we must reject Ho if q is larger than some constant c. This constant is determined in terms of the significance level a = P{q > c|H0} and the distribution of q. Under hypothesis HOt the test statistic q equals the RV w in (9-28). Using the Kolmogoroff approximation (9-29), we obtain a = P{q > с|Я0) = 1 - e~2neI (9-72) The test thus proceeds as follows: Form the empirical estimate Ax) of F(x)
9-3 hvi'oiiii.sis iLsiiMi 273 and determine q from (9-71). /1 a Accept //,. iff q > i/-In — V 2n 2 The resulting Type II error probability is reasonably small only if n is large. (9-73) Chi-Square Tests We are given a partition ?I = ....of the space and we wish to test the hypothesis that the probabilities p, = Р(.^) of the events ?/, equal in given constants pOi: Ho: p, = plu, all i against /7,: p, =# p(b. some i using as data the number of successes kt of each of the events For this purpose, we introduce the sum £ (k.-'iPo.)2 q = E--------------- (9-74) in n trials. (9-75) a binomial i-l "P<u known as Pearson's test statistic. As we know, the RVs k, have distribution with mean npt and variance np^. Hence the ratio k,/zz tends to p, as л ->*. From this it follows that the difference |kt - np(uI is small if p, = piU and it increases as |pf - plb| increases. This justifies the use of the RV q as a test statistic and the set q > c as the critical region of the test. To find c, we must determine the distribution of q. We shall do so under the assumption that n is large. For moderate values of n, we use computer simulation [see (9-85)]. With this assumption, the RVs k( are nearly normal with mean kpf. Under hypothesis HOf the RV q has а jf2(/n - 1) distribution. This follows from the fact that the constants p(u satisfy the constraint Lplh = 1. The proof, however, is rather involved. The above leads to the following test: Observe the numbers kt and compute the sum q in (9-75); find xl-Sm ~ 0 from Table 9-3. Accept Hn iff q < Xi-a(m ~ 0 (9-76) We note that the chi-square test is reduced to the test (9-68) involving the probability p of an event лэ/. In this case, the partition equals [.й/, .ft/] and the statistic q in (9-75) equals (k ~ npQ)z/np0q0 where = р(И, q(t = pni> к = klt and n — к = k2 (see Prob. 9-40). Example 9-17. We roll a die 300 times and wc observe that /( shows k, = 55 43 44 61 40 57 times. Test the hypothesis that the die is fair with a = 0.05. In this problem, pai = 1/6, m = 6, and nplfl = 50. Inserting into (9-75), we obtain E—50-----------M <-1 Since A’nssCS) e 11.07 > 7.6, wc accept the fair-die hypothesis.
274 statistics The chi-square lest is used in goodness-of-fit tests involving the agreement between experimental data and theoretical models. We next give two illustra- tions. TESTS OF INDEPENDENCE. We shall lest the hypothesis that two events and are independent: H„: .#) = P(.c/)/’(.^) against Ht: P(v/C\ * Р(.?/)Р(Л} (9-77) under the assumption that the probabilities b = P(.S)') and c = P(€) of these events are known. To do so, we apply the chi-square test to the partition consisting of the four events .й/, = 3 П if .й/2 = .^ n = 13 n 6' --/4 = > n 6’ Under hypothesis //(l, the components of each of the events are indepen- dent. Hence Poi =bc pt)2 = b(l-c) pai = (\-b)c pM = (1 - Z>)(1 - c) This yields the following test: 4 ( /c — ftp * ) *" Accept HQ iff £ —---------— < *?_„(3) (9-78) k-i nPw In the above, k-, is the number of occurrences of the event .£/; for example, k2 is the number of times .2$ occurs but does not occur. Example 9-18. In a certain university. 60 percent of all first-year students arc male and 75 percent of all entering students graduate. Wc select at random the records of 299 males and 101 females and wc find that 168 males and 68 females graduated. Test the hypothesis that the events й? = (male) and £ = (graduate} arc independent with a = 0.05. In this problem, m = 400, P(3) = 0.6, /’(tf) = 0.75. p0(- = 0.45 0.15 0.3 0.1, ki = 168 68 131 33, and (9-75) yields Since ^0.95(3) = 7.81 > 4.1, we accept the independence hypothesis. TESTS OF DISTRIBUTIONS. We introduced earlier the problem of testing the hypothesis that the distribution F(x) of an RV x equals a given function F0(x). The resulting test is reliable only if the number of available samples x} of x is very large. In the following, we test the hypothesis that F(x) = F0(x) not at every x but only at a set of m — 1 points o, (Fig. 9-11): F(at) = 1 i: £ m. - 1 against F(«() * F0(c(). some i (9-79)
9-3 HYPomi sis ils(ing 275 FIGURE 9-11 We introduce the m events = {a,_ । < x < a,) i = 1,..., m where a0 = — <» and am = oo. These events form a partition of The number kt of successes of .w'' equals the number of samples xf in the interval (o, _,, аД Under hypothesis Ho, “ Л)(а,-1) =P(h Thus, to test the hypothesis (9-79), we form the sum q in (9-75) and apply (9-76). If Ho is rejected, then the hypothesis that F(x) = F0(x) is also rejected. Example 9-19. We have a list of 500 computer-generated decimal numbers and wc wish to test the hypothesis that they arc lhe samples of an RV x uniformly distributed in the interval (0,1). We divide this interval into 10 subintcrvals of length 0.1 and wc count the number k, of samples xt that arc in the ith subinterval. The results are /c, = 43 56 42 38 59 61 41 57 46 57 In this problem, m = 500, p(), = 0.1, and Ю (к,- 50)2 Since *0.95(9) = 16.9 > 13.8 we accept the uniformity hypothesis. Likelihood Ratio Test We conclude with a general method for testing any hypothesis, simple or composite. We ate given an RV x with density /(x, 0), where в is an arbitrary parameter, scalar or vector, and we wish to test the hypothesis Htt: 6 g 0(l
276 statistics against в e 0r The sets 0O and 0| are subsets of the parameter space 0 = enu The density f(X, в), considered as a function of в, is the likelihood function of X. We denote by the value of в for which /(X, 0) is maximum in the space 0. Thus вт is the ML estimate of 0. The value of 0 for which f(X.O) is maximum in the set 0O will be denoted by 0mQ. If Hu is the simple hypothesis 0 = 0O, then 0„,n = 0(). The maximum likelihood (ML) test is a test based on the statistic Note that 0<X< 1 because f(X,0mO) < f(X,0m). We maintain that X is concentrated near 1 if is true. As we know [see (9-41)], the ML estimate 0,„ of 0 tends to its true value 0* as и -»<». Furthermore, under the null hypothesis, 0* is in the set 0O; hence A -» 1 as л -» ». From this it follows that wc must reject HQ if A < c. The constant c is determined in terms of the significance level a of the test. Suppose, first, that HQ is the simple hypothesis 0 = 0O. In this case. « = P{X <; c|H0) = / A(A,0o) dk (9-81) This leads to the following test: Using the samples x, of x, form the likelihood function f(X,0). Find 0,„ and 0„(O and form the ratio A =/(Д', 0„((I)//(X, 0,„): Reject Hn iff Л < Atr (9-82) where ka is the a percentile of the test statistic X under hypothesis Htt. If Ho is a composite hypothesis, c is the smallest constant such that P{X < c] < ka for every в g 0O. Example 9-20. Suppose that fix, 0) ~ 6e~exU(x). Wc shall test the hypothesis Hn: 0 < 6 <, 0O against Я(:0>0Й In this problem. ®0 is the segment 0 < 0 < 0() of the real line and 0 is the half-line 0 > 0. Thus both hypotheses arc composite. The likelihood function /(X,0) = 0"e-"rfl is shown in Fig. 9-12a for .r > 1/0(1 and .r < l/0h. In the hnlf-Iinc 0 > 0 this function is maximum for 0 = l/x. In the interval O<0^0(, it is maximum for fl - l/x if x > l/fl0 and for в = 0O if x < 1/0O. Hence д =. 1 a / l/i for *> 1/6° m x п,0=\в» for x < 1/0O
9-3 in ни hi sis и sim. 277 FIGURE 9-12 The likelihood ratio equals (Fig. 9-12/?) 1 for X > 1/0Ц for x < l/0lt Wc reject if A < c or, equivalently, if лее, where ct equals the a percentile of the RV x. To carry out a likelihood ratio test, we must determine the density of X. This is not always a simple task. The following theorem simplifies the problem for large n. ASYMPTOTIC PROPERTIES. We denote by m and m(, the number of free parameters in 0 and 0() respectively, that is, the number of parameters that take noncountably many values. It can be shown that if m > then the distribution of the RV w = -2 In X approaches a chi-square distribution with m - mn degrees of freedom as n -> oo. The function и» = -2 In A is monotone decreasing; hence A < c iff m > C| = — 2 In c. From this it follows that a = P(X < c) = P(w > C]| where ct = - zn0), and (9-82) yields the following test Reject H{) iff -2 In A > ~ mu) (9-83) We give next an example illustrating the theorem. Example 9-21. Wc arc given an Mtj,1) RV x and wc wish to test the simple hypotheses 77 - tjq against tj * ?j0. In this problem nnii> = 9<i ancl /(^Л) = -7=^cxp{-4 L(.v, - tj)2} V(2tf) This is maximum if the sum [see (8-66)] E(-t< - Ч)2 “ E(*i -л)2 + n(x - г})2
278 statistics is minimum, that is, if tj = x. Hence rjin — x and cxp{-i£(.r, ~ Th))2} cxp(-{E(.r, - *)2} = cxp{- - 7?n)2 From the above it follows that A > c iff |jf - 77,,! < cr This shows that the likelihood ratio test of the mean of a normal RV is equivalent to the test (9-63л). Note that in this problem, m = 1 and тпц = 0. Furthermore, w = -21n X = n(x - 7jn)2 But the right side is an RV with x2(l) distribution. Hence the RV w has a X2(m - m0) distribution not only asymptotically, but for any n. COMPUTER SIMULATION IN HYPOTHESIS TESTING. As we have seen, the test of a hypothesis //0 involves the following steps: We determine the value X of the random vector X = [xi,...,x,n] in terms of the observations of the m RVs хл and compute the corresponding value q = q(X) of the test statistic q = g(X). We accept if q is not in the critical region of the test, for example, if q is in the interval (qa,qb) where qa and qb are appropriately chosen values of the и percentile qtl of q [see (9-59)]. This involves the determination of the distribution F(q) of q and the inverse q„ = **(«) of F(q). The inversion problem can be avoided if we use the following approach. The function F(q) is monotone increasing. Hence, qa<q <qbtff a = F(qa) < F(q) < F(qb) = b This shows that the test qa < q < qb is equivalent to the test Accept HQ iff a < F(q) <b (9-84) involving the determination of the distribution F(q) of q. As we have shown in Sec. 8-3, the function F(q) can be determined by computer simulation [see (8-163)]: To estimate numerically F(q) we construct the RV vector sequence [ % 1,/» • • • » ] i I, . . . , Zl where xkJ are the computer generated samples of the m RVs xk. Using the sequence Xj, we form the RN sequence q, = gCX,) and we count the number nq of <7,’s that are smaller than the computed q. Inserting into (8-163), we obtain the estimate F(q) — nq/n. With F(q) so determined, (9-84) yields the test n. Accept Hn iff a < < b (9-85) In the above, q « g(X) is a number determined in terms of the experi- mental data xk. The sequence qit however, is computer generated.
гконнмч 279 The above approach is used if it is difficult to determine analytically, the function F(q). This is the case in the determination of Pearson’s test statistic (9-75). PROBLEMS 9-1. The diameter of cylindrical rods coming out of a production line is a normal RV x with a = 0.1 mm. Wc measure n = 9 units and find that the average of the measurements is x = 91 mm. (e) Find c such that with a 0.95 confidence coeffi- cient, the mean tj of x is in the interval x ± c. (b) Wc claim that tj is in the interval (90.95.91.05). Find the confidence coefficient of our claim. 9-2. The length of a product is an RV x with a = 1 mm and unknown mean. Wc measure four units and find that x = 203 mm. (a) Assuming that x is a normal RV, find the 0.95 confidence interval of tj. (b) The distribution of x is unknown. Using Tchebycheffs inequality, find c such that with confidence coefficient 0.95, и is in the interval 203 ± c. 9-3. We know from past records that the life length of type A tires is an RV x with cr = 5000 miles. Wc test 64 samples and find that their average life length is x = 25,000 miles. Find the 0.9 confidence interval of the mean of x. 9-4. We wish to determine the length a of an object. We use as an estimate of a the average x of n measurements. The measurement error is approximately normal with zero mean and standard deviation 0.1 mm. Find n such that with 95 percent confidence, x is within ±0.2 mm of a. 9-5. The RV x is uniformly distributed in the interval в — 2 < x < 0 + 2. Wc observe 100 samples x( and find that their average equals x = 30. Find the 0.95 confidence interval of в. 9-6. Consider an RV x with density /(x) - xe~'U(x). Predict with 95 percent confi- dence that the next value of x will be in the interval (a, b). Show that the length b - a of this interval is minimum if a and b arc such that /(«)=/(6) P{a < x < b} = 0.95 Find a and b. 9-7. (Estimation-prediction) The time to failure of electric bulbs of brand A is a normal RV with cr = 10 hours and unknown mean. Wc have used 20 such bulbs and have observed that the average x of their time to failure is 80 hours. Wc buy a new bulb of the same brand and wish to predict with 95 percent confidence that its time to failure will be in the interval 80 ± c. Find c. 9-8. Suppose that the lime between arrivals of patients in a dentist’s office constitutes samples of an RV x with density 0e~n'U(x). The 40th patient arrived 4 hours after the first. Find the 0.95 confidence interval of the mean arrival time 17 = 1/0. 9-9. The number of particles emitted from a radioactive substance in 1 second is a Poisson distributed RV with mean A. Il was observed that in 200 seconds, 2550 particles were emitted. Find the 0.95 confidence interval of A. 9-10. Among 4000 newborns, 2080 arc male. Find the 0.99 confidence interval of the probability p - Pfmale).
280 statistics 9-11. In an exit poll of 900 voters questioned, 360 responded that they favor a particular proposition. On this basis, it was reported that 40 percent of the voters favor the proposition, (a) Find the margin of error if the confidence coefficient of the results is 0.95. (b) Find the confidence coefficient if the margin of error is ±2 percent. 9-12. In a market survey, it was reported that 29 percent of respondents favor product A. The poll was conducted with confidence coefficient 0.95, and the margin of error was ±4 percent. Find the number of respondents. 9-13. Wc plan a poll for the purpose of estimating the probability p of Republicans in a community. Wc wish our estimate to be within ±0.02 of p. How large should our sample be if the confidence coefficient of the estimate is 0.95? 9-14. A coin is tossed once, and heads shows. Assuming that the probability p of heads is the value of an RV p uniformly distributed in the interval (0.4,0.6), find its bayesian estimate. 9-15. The time to failure of a system is an RV x with density fix, в) = 0e~e*U(.x). Wc wish to find the bayesian estimate 0 of 0 in terms of the sample mean x of the л samples x, of x. Wc assume that в is the value of an RV 0 with prior density /й(0) = ce~c0U(Q). Show that . n + 1 1 fl---------- - c + tix n X 9-16. The RV x has a Poisson distribution with mean fl. We wish to find the bayesian estimate 0 of fl under the assumption that 0 is the value of an RV 0 with prior density /„(0) ~ 0be~c°U(0). Show that . rix + b + 1 0--------------- n + c 9-17. Suppose that the IQ scores of children in a certain grade arc the samples of an Mi7,a) RV x. We test 10 children and obtain the following averages: x = 90, s = 5. Find the 0.95 confidence interval of rj and of cr. 9-18. The RVs x,- arc i.i.d. and M0,cr). Wc observe that xf + • • • TXfn = 4. Find the 0.95 confidence interval of a. 9-19. The readings of a voltmeter introduces an error v with mean 0. We wish to -estimate its standard deviation cr. We measure a calibrated source V = 3 V four times and obtain the values 2.90, 3.15, 3.05, and 2.96. Assuming that v is normal, find the 0.95 confidence interval of er. 9-20. The RV x has the Erlang density /(x) ~ c4x*e~cxU(x). Wc observe the samples Xj — 3.1,3.4,3.3. Find the ML estimate c of c. 9-21. The RV x has the truncated exponential density f(x) = ce~clx~x,,)U(x - x„). Find th‘6 ML estimate c of c in terms of the n samples x, of x. 9-22. The time to failure of a bulb is an RV x with density ce~rxU(x). We test 80 bulbs and find that 200 hours later, 62 of them are still good. Find the ML estimate of c. 9-23. The RV x has a Poisson distribution with mean 0. Show that the ML estimate of 0 equals x. 9-24. Show that.if L(x,0) = In /(x, 0) is the likelihood function of an RV x, then
pkoiii ems 281 9-25. Wc are given an RV x with mean 17 and standard deviation cr = 2, and we wish to lest the hypothesis 77 = 8 against 77 = 8.7 with a = 0.01 using as the test statistic the sample mean x of n samples, (a) Find the critical region Rc of the test and the resulting fi if n = 64. (6) Find n and Rc if /3 = 0.05. 9-26. A new car is introduced with the claim that its average mileage in highway driving is at least 28 miles per gallon. Seventeen cars are tested, and the following mileage is obtained: 19 20 24 25 26 26.8 27.2 27.5 28 28.2 28.4 29 30 31 32 33.3 35 Can we conclude with significance level at most 0.05 that the claim is true? 9-27. The weights of cereal boxes arc the values of an RV x with mean 77. Wc measure 64 boxes and find that x = 7.7 oz. and s = 1.5 oz. Test the hypothesis Hu; 77 = 8 oz. against 77 ¥= 8 oz. with a = 0.1 and a = 0.01. 9-28. Brand A batteries cost more than brand В batteries. Their life lengths are two RVs x and y. We test 16 batteries of brand A and 26 batteries of brand В and find these values, in hours: x = 4.6 .s\=l.l у = 4.2 .vy = 0.9 Test the hypothesis ту, = 7)v against 77* > 77,. with a = 0.05. 9-29. A coin is tossed 64 times, and heads shows 22 limes. Test the hypothesis that the coin is fair with significance level 0.05. Wc toss a coin 16 times, and heads shows к times. If к is such that kt < к £ kz, wc accept the hypothesis that the coin is fair with significance level a = 0.05. Find kt and k2 and the resulting Д error. 9-30. In a production process, the number of defective units per hour is a Poisson distributed RV x with parameter A = 5. A new process is introduced, and it is observed that the hourly defectives in a 22-hour period arc x, = 3054264153740832436569 Test the hypothesis A = 5 against A < 5 with a = 0.05. 9-31. A die is tossed 102 times, and the /th face shows k, = 18, 15, 19, 17, 13, and 20 times. Test the hypothesis that the die is fair with a = 0.05 using the chi-square test. 9-32. A computer prints out 1000 numbers consisting of the 10 integers j = 0,1,.... 9. The number of times J appears equals ttj- - 85 110 118 91 78 105 122 94 101 96 Test the hypothesis that the numbers j are uniformly distributed between 0 and 9, with a - 0.05. 9-33. The number x of particles emitted from a radioactive substance in 1 second is a Poisson RV with mean 0. In 50 seconds, 1058 particles arc emitted. Test the hypothesis 0O = 20 against в #= 20 with a = 0.05 using the asymptotic approxima- tion. 9-34. The RVs x and у are crx) and N(rjy, cry) respectively and independent. Test the hypothesis ax = cry against ax & <ry using as the test statistic the ratio (see Prob. 6-19) 1 m / 1 n q - — £ (x, - Пд)7 - £ (у< - n>.) ,n 1-1 / n/-i
282 statistics 9-3S. Show that (he variance of an RV with student-/ distribution /(л) equals ii/(n - 2). 9-36. Find the probability p5 that in a men's tennis tournament the final match will last five games. («) Assume that the probability p that a player wins a set equals 0.5. (h) Usebayesian statistic with uniform prior (see taw of succession). 9-37. Show that in the measurement problem of Example 9-9, the bayesian estimate 0 of lhe parameter 6 equals 2 2 0 = —+ —— .v where a (r _2 Л-2 — X ——— П 4 tr'/n 9-38.. Using the 'ML method, find.the у confidence inlcrval-of the variance r = о-2 of an Mrj.rr) RV with known mean. 9-39- Show that if 0| and 02 are lwo unbiased minimum variance estimators of a parameter 0, then 0] = 02. Hint: Form the RV 0 = (0, 4 0,)/2. Show that о/ = <r2(l 4 r)/2 £ <r2 where a2 is the common variance of 0t and 02 and r is their correlation coefficient. 9-40, The number Of successcs df an event in n trials equals к(. Show that (*i — zijo,)2 (k2~np2)2 _ (k, - np,)2 "Pi "Pi "Pi Pi where k2 = n — k, and P(.a<) = p, = 1 - p2.
PART II STOCHASTIC PROCESSES
CHAPTER 10 GENERAL CONCEPTS 10-1 DEFINITIONS As we recall, an RV x is a rule for assigning to every outcome 4 of an experiment .У a number x(£). Л stochastic process x(z) is a rule for assigning to every £ a function x(t,£). Thus a stochastic process is a family of time functions depending on the parameter < or, equivalently, a function of t and C The domain of C is the set of all experimental outcomes and the domain of t is a set R of real numbers. If R is the real axis, then x(/) is a continuous-time process. If R is the set of integers, then x(r) is a discrete-time process. A discrete-time process is. thus, a sequence of random variables. Such a sequence will be denoted by x„ as in Sec. 8-4, or, to avoid double indices, by x[/i]. We shall say that x(z) is a discrete-state process if its values are countable. Otherwise, it is a continuous-state process. Most results in this investigation will be phrased in terms of continuous- time processes. Topics dealing with discrete-time processes will be introduced either as illustrations of the general theory, or when their discrete-time version is not self-evident. We shall use the notation x(/) to represent a stochastic process omitting, as in the case of random variables, its dependence on Thus x(r) has the following interpretations: I. It. is a family (or an ensemble) of functions x(/,£). In this interpretation, t and £ are variables. 285
286 STOCHASTIC PROPERTIES 2. It is a single time function (or a sample of the given process). In this case, t is a variable and £ is fixed. 3. If t is fixed and f is variable, then x(r) is a random variable equal to the state of the given process at time t. 4. If t and f are fixed, then x(r) is a number. A physical example of a stochastic process is the motion of microscopic particles in collision with the molecules in a fluid (brownian motion). The resulting process x(/) consists of the motions of all particles (ensemble). A single realization x(t, 0 of this process (Fig. 10-la) is the motion of a specific particle (sample). Another example is the voltage x(/) = rcos(wr + <p) of an ac generator with random amplitude г and phase <p. In this case, the process x(/) consists of a family of pure sine waves and a single sample is the function (Fig. 10-lb) x(t,£) = r(f)cos[oK + <p(f)] According to our definition, both examples are stochastic processes. There is, however, a fundamental difference between them. The first example (regular) consists of a family of functions that cannot be described in terms of a finite number of parameters. Furthermore, the future of a sample x(/,£) of x(t) cannot be determined in terms of its past. Finally, under certain conditions, the statistics! of a regular process x(t) can be determined in terms of a single sample (see Sec. 13-1). The second example (predictable) consists of a family of pure sine waves and it is completely specified in terms of the RVs r and <p. Furthermore, if x(r, £) is known for t to, then it is determined for t > to. Finally, a single sample x(t, £) of x(f) does not specify the properties of the tRecall that statistics hereafter wilt mean statistical properties.
IO-1 im । im। ions 287 entire process because it depends only on the particular values r(<) and <p(<) of rand <p. A formal definition of regular and predictable processes is given in Sec. 12-3. Equality. We shall say that two stochastic processes x(r) and y(r) arc equal (everywhere) if their respective samples x(r, <) and y(r, f) arc identical for every Similarly, the equality z(/) = x(r) + y(t) means that z(z.<) = x(r,<) + y(z.<) for every £. Derivatives, integrals, or any other operations involving stochastic processes are defined similarly in terms of the corresponding operations for each sample. As in the case of limits, the above definitions can be relaxed. We give below the meaning of MS equality and in Арр. I0A we define MS derivatives and integrals. Two processes x(/) and y(r) arc equal in the MS sense iff £{lx(') -y(/)l2) = 0 (10-1) for every t. Equality in the MS sense leads to the following conclusions: We denote by the set of outcomes f such that x(r. О = y(r,<) for a specific t, and by the set of outcomes £ such that x(r,£) = y(r,<) for every r. From (10-1) it follows that x(/,£) - y(/,f) = 0 with probability 1; hence P(.n/Z) = P(.Z) = 1. It does not follow, however, that P(.^C) = 1. In fact, since .•;/ is the intersection of all sets &/, as t ranges over the entire axis. Р(л£) might even equal 0. Statistics of Stochastic Processes A stochastic process is a noncountable infinity of random variables, one for each t. For a specific t, x(/) is an RV with distribution F(x,t) =P{x(r) <x} (10-2) This function depends on t, and it equals the probability of the event {x(r) < x) consisting of all outcomes £ such that, at the specific time /, the samples x(f, {) of the given process do not exceed the number x. The function Их, r) will be called the first-order distribution of the process x(r). Its derivative with respect to x: is the first-order density of x(/). Frequency interpretation If the experiment is performed n times, then n functions x(r, f() arc observed, one for each trial (Fig. 10-2). Denoting by м/x) the number of trials such that at time t the ordinates of the observed functions do not exceed x (solid lines), we conclude as in (4-3) that F(x,t)^'^- (10-4) it
288 STOCHASTIC PROPERTIES The second-order distribution of the process x(/) is the joint distribution F(xitx2;tltl2) =P{x(ti) <xx,x(t2) <x2) (10-5) of the RVs xf^) and x(r2). The corresponding density equals <?2F(xt, x2; r।, t2) f(x,, x2; t,, t2) = ---——----------- (10-6) dx,aX2 We note that (consistency conditions) F(x,;r,) = Ffxpoo;/,,/^,) /(x1,f1)=/‘ f(xltx2;tltt2)dx2 J — ОС as in (6-9) and (6-10). The nth-order distribution of x(r) is the joint distribution F(x(,...,x„; f|,.. -, t„) of the RVs x(f|),... ,x(f„). SECOND-ORDER PROPERTIES. For the determination of the statistical proper- ties of a stochastic process, knowledge of the function F(xp x„; ff„) is required for every x,, th and n. However, for many applications, only certain averages are used, in particular, the expected value of x(r) and of x2(r). These quantities can be expressed in terms of the second-order properties of x(r) defined as follows: Mean The mean 17(f) of x(/) is the expected value of the RV xG): i)(t) = E(x(t)) = f xf(x,t)dx (10-7) — ОС Autocorrelation The autocorrelation R(t},t2) of x(r) is the expected value of the product x(rt)x(/2): R(z„l2) =£{x(»,)x(f,)} = Г Г xlx2f(.xl,x2,tl,l2)dx,dx2 (10-8) — OO-' — » The value of R(r„ t2) on the diagonal /l = t2 = t is the average power of x(t): E{x2(/)} = R(M)
10-1 DEHNITIONS 289 The autocovariance C(tx, t2) of x(f) is the covariance of the RVs х(^) and x(r2): C(r„r2) =Я('!.'2) (10-9) and its value C(r, t) on the diagonal r, = t2 = t equals the variance of x(/). Note The following is an explanation of the reason for introducing the function /?(/,, t2) even in problems dealing only with average power: Suppose that x(z) is the input to a linear system and yG) is the resulting output. In Sec. 10-2 wc show that the mean of y(/) can be expressed in terms of the mean of x(r). However, the average power of y(r) cannot be found if only E{x2(j)} is given. For the determination of £{y2(r)}, knowledge of the function /?(/h/2) is required, not just on the diagonal r( = r,, but for every and The following identity is a simple illustration £{[x(z,) + x(z,)]3) = /?(/,, г,) + 2Л(г,,г2) + R(t2,t2) This follows from (10-8) if wc expand the square and use the linearity of expected values. Example 10-1. An extreme example of a stochastic process is a deterministic signal x(/) = /(/). In this case, n(0 =*{/(')) =/(O «(G,G)=^/(G)/(G)} =/(G)/(G) Example 10-2. Suppose that x(/) is a process with r?(/) = 3 Л(г,.г2) = 9 + 4t> Wc shall determine the mean, the variance, and the covariance of the RVs z = x(5) and w = x(8). Clearly, £{z) = i)(5) = 3 and £{w) = tj(8) = 3. Furthermore, £{z2} - 7?(5,5) = 13 £{w2} = R(8.8) = 13 £(zw) = £(5,8) = 9 + 4e-"h = 11.195 Thus z and w have the same variance cr2 = 4 and their covariance equals C(5,8) = 4e-'*h = 2.195. Example 10-3. The integral s = Г\(/) dt of a stochastic process x(r) is an RV s and its value s(£) for a specific outcome I is the area under the curve x(z,f) in the interval (a, b) (see also Арр. 10A). Interpreting the above as a Riemann integral, we conclude from the linearity of expected values that = £{s} = (bE{x(t)} dt = dt (10-10) Similarly, since s2= fb (bx(t,)x(t2)dtidt2
290 STOCHASTIC PROPERTIUS wc conclude, using again the linearity of expected values, that E{s2} = (b(hE{x(tl)x(t2)} di{ dtz = fh fbR(ti,t,)dtidt: (10-11) Example 10-4. We shall determine the autocorrelation Л(Г|,г2) of the process x(r) = rcos(<z>r + <p) where wc assume that the RVs r and <p are independent and <p is uniform in the interval (—тт,tt). Using simple trigonometric identities, we find E(x(/|)x(g)} = ^E{r2}E{cos w(r, - t2) + cos(&)t| + wi, + 2<p)} and since I rTT £{cos(<wrj 4- ш1~, + 2<p)} = -— I cos(w/| + wt-, + 2<p) dip = 0 Zir-'-TT we conclude that Я(Г|,Г2) = ^£{r2)cosш(Г| - t2) (10-12) Example 10-5 Poisson process. In Sec. 3-4 we introduced the concept of Poisson points and we showed that these points are specified by the following properties: P|i The number n(tb t2) of the points t, in an interval (rb t2) of length t = t2 - r, is a Poisson RV with parameter Ar: e~A'(Ar)A P{n(r|tr2) =fc) =------(10-13) P2: If the intervals (rbr2) and (r3, r4) are nonoverlapping, then the RVs n(rbr2) and n(t3, t4) are independent. Using the points l{, we form the stochastic process x(r) = n(0,t) shown in Fig. 10-Зя. This is a discrete-state process consisting of a family of increasing staircase functions with discontinuities at the points tz. For a specific r, x(r) is a Poisson RV with parameter Ar; hence E{x(/)} = 7j(t) = At Wc shall show that its autocorrelation equals ч I At, + A2t,r, r. G Л(г,и,)-{ “ , <1044) (A/i+A-r^, r, £ r, or equivalently that C(t|,r2) = A min(rt, t2) = At|{/(r2 - r,) + Ar,U(t| - r2)
10-1 Dt-HN(1(ONS 291 Proof. The above is true for /, = r, because [see (5-36)] E{x2(r)} = Ar 4-A2r2 (10-15) Since Л(Г|,/2) = Л(/2,/|), it suffices to prove (10-14) for Г| < tz. The RVs x(r,) and x(t2)-x(t|) are independent because lhe intervals (0, r,) and (r,, t2) are nonoverlapping. Furthermore, they are Poisson distributed with parameters АГ| and A(r, - *|) respectively. Hence E{*('i)[*('2) ~ x('i)]} = £{*('i)}^{xCr2) “ *('i)} = “ '») Using the identity x<^i)x(/2) = x('i)[x('i) + x('2> - x('l)] we conclude from the above and (10-15) that = А/, + A2r2 4- Ar,A(t2 - /,) and (10-14) results. Nonuniform case If the points t, have a nonuniform density A(z) as in (3-54), then the preceding results still hold provided that the product A(r, - Г|) is replaced by the integral of A(r) from r( to r2. Thus = f‘x(a)da (10-16) and K('i.'2) = /Й,'А(0Л[1 + 'i ^'2 (10-17) Example 10-6 Telegraph signal. Using the Poisson points t^, we form a process x(r) such that x(r) = 1 if the number of points in the interval (0, t) is even, and x(t) — -1 if this number is odd (Fig. 10-3Z>).
292 STOCHASTIC PROPERTIES Denoting by p(k) the probability that the number of points in the interval (Oj) equals k, we conclude that [see (10-13)] P{x(t) = 1} = p(0) + p(2) + = e A' cosh Л t P{x(t) = -1} = p(l) +p(3) + ••• = e A‘ At + (Ar)3 = e A' sinh kt Hence £{x(z)} - e~ A'(cosh kt - sinh kt) = e 2A' (10-18) To determine R(tx,t2), we note that, if x(r j) = 1. then x(z2) = I if the number of points in the interval (t,, t2) is even. Hence />{x(z2) = 1 |x(f।) = 1} = cA'cosh kt t = |f2 - tj| Multiplying by P{x(t,) = 1), we obtain P{x(t।) = 1. x(g) = 1) = e A'cosh kte A'- cosh kt2 Similarly, /’{х(Г|) = - 1, x(r2) - ~ 1} = e“A'cosh Are~A'-sinh kt2 P{x(Ji) = к x(r2) = — 1} = eA' sinh kte~*': sinh kt2 P{x(f|) = - l,x(r2) = 1} = e"A/ sinh kte A'- cosh kt2 Since the product x(/|)x(r2) equals 1 or - 1, we conclude omitting details that Я(г,Л2)(10-19) The above process is called semirandom telegraph signal because its value x(0) = 1 at t = 0 is not random. To remove this certainty, we form the product y(r) = ax(r) where a is an RV taking the values + 1 and -1 with equal probability and is independent of x(r). The process yO) so formed is called random telegraph signal. Since £(a) = 0 and £{a2} = 1, the mean of yO) equals £{a)£{x(t)) = 0 and its autocorrelation is given by £{y(r i)y(r2)} = £{a2)£{x('i)x(z2)} = e-2A,“"'-’ Wc note that as f -> « the processes x(r) and y(r) have asymptotically equal statistics.
10-1 di । ini । ions 293 General Properties The statistical properties of a real stochastic process x(r) are completely determined! in terms of its nth-order distribution F(xi,...,x„;tl...../„) = Р{х(г,) <x,.......x(/„) <.t„) (10-20) The joint statistics of two real processes x(r) and y(r) are determined in terms of the joint distribution of the RVs x('t)....X( G,).y( f;).... The complex process = x(/) + jy(r) is specified in terms of the joint statistics of the real processes x(t) and y(f). A vector process (n-dimensional process) is a family of n stochastic processes. Correlation and covariance. The autocorrelation of a process x(r). real or complex, is by definition lhe mean of the product xU,)x*(r,). This function, will be denoted by R(t{, /,) or /?//,, t2) or /?лл(г(, t2). Thus = £{x(/,)x*(/;)} (10-21) where the conjugate term is associated with the second variable in RKK(t{,t2). From this it follows that /?(£,,/,) =E{x(r,)x*(r,)} =/?*(/,.r2) (10-22) We note, further, that /?(/./) = £'{|x(/)|2} > 0 (10-23) The last two equations are special cases of the following: The autocorrela- tion /?(грг2) of a stochastic process x(r) is a positive definite (p.d.) function, that is, for any af and af: '£fa(a*R(t„tt) > 0 (10-24) This is a consequence of the identity 0 ZXxU)| ) = E«/«*£(X(OX*('J} v « ' / «.> We show later that the converse is also true: Given a p.d. function R(tf,t2), we can find a process x(/) with autocorrelation R(t,, t2). tThere are processes (nonseparable) for which this is not true. However, such processes are mainly of mathematical interest.
294 STOCHASTIC PROPERTIES Example 10-7. (o) If x(/) = aeio't then R(r,,/2) = E{ae'“'la*e~'"‘’} = E{ |a|2}^"(/'~^’ (b) Suppose that the RVs a, are uncorrelated with zero mean and variance ст/. If x(r) = i then (10-21) yields Л(/„г2) = Z>.2c'41''‘';) i The autocovariance C(tlt t2) of a process x(f) is the covariance of the RVs x(r ।) and x(t2): C(/„/2) =/?(/>.f2) -tlW(h) (10-25) In the above, y(t) = E{x(t)) is the mean of x(r). The ratio C(f,, f2) r(/t,/2) = , r— (10-26) ус(»„»,)с(/2,»г) is the correlation coefficient t of the process x(/). Note The autocovariancc C(z,, r2) of a process x(r) is the autocorrelation of the centered process x(O = x(0 “ л(') Hence it is p.d. The correlation coefficient r(.t{, t2) of x(f) is the autocovariance of the normalized process x(t)/ yC(ttf); hence it is also p.d. Furthermore [see (7-9)] к(г1э/2)| 2S 1 r(M) = i (10-27) Example 10-8. If s = fb*(t) dt then s - r}3 = fh*(t) dt J a J a where x(r) = x(/) — п.ДО- Using (10-11), we conclude from the above note that a/ = E(|s - nJ2) = ffhCx(ty h) dti dtz (Ю-28) Ja Ja The cross-correlation of two processes x(t) and yO) is the function ^y(/„f2) =£{x(f,)y*(f2)} =/?*.(/,,/,) (10-29) tin optics, C(/[,/2) (s called the coherence function and r(/(, r2) is called the complex degree of coherence {see Papoulis, 1968).
10-1 dijimikjss 295 Similarly, Сг.Л'р'2) = ^O(G-U) -vAfi)Vy(ti) (10-30) is their cross-covariance. Two processes xG) and y(r > are called (mutually) orthogonal if = O for every /, and /, (10-31) They are called uncorrelated if CrjXG’G) = 0 forevery tf and t2 (10-32) а-dependent processes In general, the values x(r,) and x(t2) of a stochastic process x(/) are statistically dependent for any r, and t2. However, in most cases this dependence decreases as |r, - r2| -♦ x. This leads to the following concept: A stochastic process x(/) is called а-dependent if all its values x(f) for t < to and for I > to + a are mutually independent. From this it follows that C(r,,/,)=0 for I/, - r,| > a (10-33) A process x(r) is called correlation а-dependent if its autocorrelation satisfies (10-33). Clearly, if x(r) is correlation n-dependent. then any linear combination of its values for t < ta is uncorrelated with any linear combination of its values for + a. White noise We shall say that a process v(/) is white noise if its values »(/,) and are uncorrelated for every ti and = 0 ti^tl As we explain later, the autocovariance of a nontrivial white-noise process must be of the form C(/,,r2) = q(tl)8(ti -t2) q(l) > 0 (10-34) If the RVs v(/,) and v(/;) are not only uncorrelated but also independent, then v(t) will be called strictly white noise. Unless otherwise stated, it will be assumed that the mean of a white-noise process is identically 0. Example 10-9. Suppose that v(/) is white noise and x(/) = f'v(a) da (10-35) Л) Inserting (10-34) into (10-35), wc obtain E{x2(0) = ~ t2)dr[dt2 = ^(fjdt, (10-36) because ^8(11 - t2) dt2 =1 for 0 < /j < t Uncorrelated and independent increments If the increments x(/2) - x(/j) and XG4) — х(Гд) of a process x(/) are uncorrelated (independent) for any
296 STOCHASTIC PROPERTIES f, < t2 < t3 < tA, then we say that x(f) is a process with uncorrelated (indepen- dent) increments. The Poisson process is a process with independent incre- ments. The integral (10-35) of white noise is a process with uncorrelated increments. Independent processes If two processes x(f) and y(f) are such that the RVs x(/j), . ..,x(f„) and y(f [),... ,y(f^) are mutually independent, then these processes are called independent. Normal processes. A process x(r) is called normal, if the RVs x(z,)..x(r„) are jointly normal for any n and tx,...,tn. The statistics of a normal process are completely determined in terms of its mean 17(f) and autocovariance C(f|,f2). Indeed, since £{x(f)} = 17(f) a~(t) = we conclude that the first-order density f(x,t) of x(f) is the normal density Мч(О;/С(770]. Similarly, since the function r(f|,f2) in (10-26) is the correlation coeffi- cient of the RVs x(f 1) and x(f,), the second-order density f(x{, x,; f(, f2) of x(f) is the jointly normal density ^(*2);zi) ;r(tx,f2)] The nth-order characteristic function of the process x(f) is given by [see (8-60)] ехр(/Е1?(0)й>,- “ 7 (10-37) I i 2 i.k ) Its inverse /(X|,...,re;f1,...,t(() is the nth-order density of x(f). Existence theorem. Given an arbitrary function 17(f) and a p.d. function C(t (, /2), we can construct a normal process with mean 17(f) and autocovariance C(f p f2). This follows if we use in (10-37) the given functions 17(f) and C(fj,f2). The inverse of the resulting characteristic function is a density because the function C(f j, t2) is p.d. by assumption. Example 10-10. Suppose that x(f) is a normal process with 17(f) = 3 C(/„/2) («) Find the probability that x(5) < 2. Clearly, x(5) is a normal RV with mean ij(5) = 3 and variance C(5,5) = 4. Hence P{x(5) s 2} = G(-1/2) = 0.309 (W Find the probability that |x(8) - x(5)| 1. The difference s e x(8) — x(5) is a normal RV with mean 17(8) - ij(5) = 0 and variance C(8,8) + C(5,5) - 2C(8,5) - 8(1 - e”0 6) = 3.608
10-1 IMJ INIl IONS 297 Hence P{ lx(8) - x(5)l <; 1) = 2G( 1 /1.9) - 1 = 0.4 Point and renewal processes. A point process is a set of random points t, on the time axis. To every point process we can associate a stochastic process x(r) equal to the number of points t, in the interval (0, /). An example is the Poisson process. To every point process t( we can associate a sequence of RVs z„ such that Z| = t| Z2 = t, — t| • • • zn = t„ — tzl_ I where t, is the first random point to the right of the origin. This sequence is called a renewal process. An example is the life history of light bulbs that are replaced as soon as they fail. In this case, zt is the total time the ith bulb is in operation and t, is the time of its failure. We have thus established a correspondence between the following three concepts (Fig. 10-4): (a) a point process t;, (b) a discrete-state stochastic process xG) increasing in unit steps at the points t,f (c) a renewal process consisting of the RVs z; and such that tw = Zj + • • +z„. This correspondence is developed further in Sec. 16-1. Stationary Processes A stochastic process xG) is called strict-sense stationary (abbreviated SSS) if its statistical properties are invariant to a shift of the origin. This means that the processes xG) and xG + c) have the same statistics for any c. Two processes xG) and yG) are called jointly stationary if the joint statistics of x(t) and y(t) are the same as the joint statistics of xG -+ c) and y(t + c) for any c. A complex process zG) = xG) + jyCf) is stationary if the processes xG) and y(r) are jointly stationary. From the definition it follows that the nth-order density of an SSS process must be such that f{xx,...,xn\tx,...ttn) = /(х1,...,хл;/1 + + c) (10-38) for any c.
2S>8 STOCHASTIC PROPERTIES From the above it follows that /(x; t) = f(x\ t + c) for any c. Hence the first-order density of x(t) is independent of r. /(x;r) = /(x) (10-39) Similarly, f(xlt x2; t] + c, t2 + c) is independent of c for any c. This leads to the conclusion that f{xx,x2,t^t2) = f(xXix2\T) T = tl-t2 (10-40) Thus the joint density of the RVs x(z + r) and x(r) is independent of t and it equals f(xltx2;r). WIDE SENSE. A stochastic process x(/) is called wide-sense stationary (abbrevia- ted WSS) if its mean is constant E{x(t)}=-q (10-41) and its autocorrelation depends only on т = r, - t2: E{x(t + t)x*(z)} = R(r) (10-42) Since т is the distance from t to t + r, the function R(r) can be written in the symmetrical form R(r) = (10-43) Note in particular that E(|x(OI2) = «(0) Thus the average power of a stationary process is independent of t and it equals Ж0). Example 10-11. Suppose that x(/) is a WSS process with autocorrelation R(t) = Ae~a,rl We shall determine the second moment of the RV x(8) - x(5). Clearly, E{[x(8) - x(5)]2} = E{x2(8)J + E{x2(5)} - 2E{x(8)x(5)} “R(0) + R(0) - 2Я(3) = 2A - 2Ae~3a Note As the above example suggests, the autocorrelation of a stationary process x(z) can be defined as average.power. Assuming for simplicity that x(z) is real, we conclude from (10-42) that Е{[х(/ + т) - x(0]2} = 2[Я(0) - R(r)] (10-44) •From (10-42) it follows that the autocovariance of a WSS process depends only on r~ti — t2i (10-45)
10-1 DEFINITIONS 299 and its correlation coefficient [see (10-26)] equals r(r) = C(r)/C(0) (10-46) Thus C(r) is the covariance, and r(r) the correlation coefficient of the RVs x(t + r) and x(z). Two processes x(t) and y(/) are called jointly WSS if each is WSS and their cross-correlation depends only on т = Г] - t2: Лху(т) = E(x(t + r)y*(/)) CXJ.(T) = Rxy(r) - (10-47) If x(z) is WSS white noise, then [see (10-34)] C(t)=^3(t) (10-48) If x(z) is an а-dependent process, then C(r) = 0 for |r| > a. In this case, the constant a is called the correlation time of x(z). This term is also used for arbitrary processes and it is defined as the ratio = 7777 f C(t) dr (10-49) C(0) Jq In genera] C(r) * 0 for every r. However, for most regular processes C(r) —-------> 0 R(r) —-----------> hl2 |t|— Example 10-12. If x(z) is WSS and s = fT x(z) dt -T then [see (10-28)] а,2 = Г Г C(tt - t2) dtx dt2 = [2T (2T - |г|)С(т) dr (10-50) У -n,. уУ — у* * — 27* The last equality follows with r = /] — t2 (see Fig. 10-5); the details, however, are omitted [see also (10-143)]. T T 2T f jc(t/-4) Л, dt2 - /(27Нт|)С(7) dr -r-T HGUREltS
300 STOCHASTIC PROPERTIES Special cases, («) If C(r) = q3(r), then <rs2 = q[2T (2T — |t|)6(t) dr = 2Tq J-2T (b) If the process x(r) is «-dependent and a <k T, then (10-50) yields a,2- (2T (2T- |T|)C(T)rfr = 2Т(° C(t) dr J-2T J~a This shows that, in the evaluation of the variance of s, an а-dependent process with a T can be replaced by white noise as in (10-48) with <7 = / C(t)<Zt J —a If a process is SSS, then it is also WSS. This follows readily from (10-39) and (10-40). The converse, however, is not in general true. As we show next,, normal processes are an important exception. Indeed, suppose that x(r) is a normal WSS process with mean 17 and autocovariance C(r). As we see from (10-37), its nth-order characteristic function equals ( 1 1 ЯР (10-51) I i L i.k ) This function is invariant to a shift of the origin. And since it determines completely the statistics of x(r), we conclude that x(f) is SSS. Example 10-13. We shall establish necessary and sufficient conditions for the stationarity of the process x(r) = acos tot + bsin ад/ (10-52) The mean of this process equals E{x(r)} = E{a}cos ад/ + E{b}sin«>r This, function must be independent of t. Hence the condition E{a) = E{b)=0 (10-53) is necessary for both forms of stationarity. We shall assume that it holds. Wide sense. The process x(/) is WSS iff the RVs a and b are uncorrelated with equal variance: E(ab} = 0 E{a2} = E{b2) = a2 (10-54) If this holds, then Л(т) = a2 cos (от (10-55) Proo/. If x(r) is WSS, then E(x2(0)} = E{x2(7r/2to)} = Я(0)
10-1 1Л-.1 INHIONS 301 But x(0) - a and х(тг/2ш) = b; hence E(a2) = E(b2}. Using the above, we obtain Я{х(/ + т)х(г)} = E{[acosw(/ + r) + bsinw(/ + r)][acosw/ + bsin wt]} = a2 cos wt 4- E{ab)sin w(2r + r) (|0-56) This is independent of t only if E(ab) = 0 and (10-54) results. Conversely, if (10-54) holds, then, as we see from (10-56), the autocorrelation of x(t) equals a2 cos шт; hence x(t) is WSS. Strict sense. The process x(t) is SSS iff the joint density /(a, b) of the RVs a and b has circular symmetry, that is, if /(a,b) =f(>/a2 + b2) (10-57) Proof. If x(r) is SSS. then the RVs x(0) = a x(—/2ш) = b and x(/) = acoswt + bsin wt x(t + тг/2ш) = bcos wt - asin wt have the same joint density for every t. Hence [see (6-70)], /(a, b) must have circular symmetry. We shall now show that, if /(a, b) has circular symmetry, then x(t) is SSS. With т a given number and 3| = acos wr + bsin wr b, = bcos шт - asin шт we form the process X|(f) = at cos wt + bt sin wt = x(t + r) Clearly, the statistics of x(t) and x ,(z) are determined in terms of the joint densities /(a, b) and f(ait bt) of the RVs a, b and a(,b|. But [see (6-67)] the RVs a, b and a^bj have the same joint density. Hence the processes x(t) and x(t + t) have the same statistics for every r. Corollary. If the process x(t) is SSS and the RVs a and b are independent, then they are normal. Proof. It follows from (10-57) and (6-34). Example 10-14. (a) Given an RV w with density /(w) and an RV q> uniform in the interval (—77,77) and independent of w, we form the process x(f) = a cos(wf + <p) (10-58) We shall show that x(t) is WSS with zero mean and autocorrelation a2 a2 R(t) = — E(cos wt} = — Re Фы(т) (10-59) Where Ф^т) - £{е/ыт) - E{cos wt) + jE{sin wt) (10-60) is the characteristic function of w.
302 STOCHASTIC PROPERTIES Proof. Clearly [sec (7-59)] E{cos(u>r + y)} = £{£{cos(<oz + <p) !<*>)} From the independence of <0 and <p, it follows that E{cos(<o/ + <p) = cos ut E(cos <p) — sin cat E{sin <p) Hence E(xG)} = 0 because 1 yW I E{cos <p) = — / cos <p d<p = Q E{sin <p) = — I sin <p d<p = 0 2 77 J—tt 2ТТ J-tt Reasoning similarly, wc obtain E{cos(2<ot + шт + 2<p>} = 0. And since 2cos[io(f + r) + <p]cos(un + ip) = cos tor + cos(2<nf + шт + 2<p) wc conclude that a~ R(t) = a2E{cos[<o(f + t) + <p]cos(wf + <p)} = — E{cos<or} (b) With ы and <p as above, the process z(f) = ae*”1 is WSS with zero mean and autocorrelation E{z(t + r)z*(r)J = a2E{e'“'} = а2Ф„(т) Centering. Given a process x(r) with mean 77(f) and autocovariance (?,(/,»/2), we form difference x(r) = x(r) - Tj(t) (10-61) This difference is called the centered process associated with the process x(f). Note that E{x(r)} =0 Лх.(г„г2) = Cx(r„r2) From this it follows that if the process x(t) is covariance stationary, that is, if Сх(/1эГ2) = C/f, — t2), then its centered process x(r) is WSS. Other forms of stationarity. A process x(r) is asymptotically stationary if the statistics of the RVs x(r, + c), + c) do not depend on c if c is large. More precisely, the function f(xi......Vi +c,...,t„ + c) tends to a limit (that does not depend on c) as c -» co. The semirandom telegraph signal is an example- A process x(/) is Nth-order stationary if (10-38) holds not for every /1, but only for n <, N. A process x(t) is stationary in an interval if (10-38) holds for every /, and 4- c in this interval. We say that x(/) is a process with stationary increments if its increments y(/)ax(f + Л) — x(r) form a stationary process for every h. The Poisson process is ah example.
10-2 SYSTEMS Wl ГН S ГОС1 IASI IC INPI its 303 MEAN SQUARE PERIODICITY. A process x(/) is called MS periodic if E{|x(r + T) - x(/)|2) = 0 (10-62) for every t. From this it follows that, for a specific f, 4- T) = x(/) (10-63) with probability 1. It does not, however, follow that the set of outcomes £ such that х(/ + T, f) = x(/, for all t has probability 1. As we see from (10-63) the mean of an MS periodic process is periodic. We shall examine the properties of R(t}. i2). THEOREM. A process x(/) is MS periodic iff its autocorrelation is doubly periodic, that is, if Я(/, + тТ, t2 + nT) = R(tt.t2) (10-64) for every integer m and n. Proof. As we know [see (7-12)] E2{zw} < E{z2}E{w2} With z = x(/|) and w = x(/2 + T) - x(/2) the above yields E2{x(Zi)[x(r, + T) - x(f2)]} < E{x2(r,)}E[[x(/2 + T) - x(i2)]2) If x(f) is MS periodic, then the last term above is 0. Equating the left side to 0, we obtain R(lt,t2 + T) =0 Repeated application of this yields (10-64). Conversely, if (10-64) is true, then R(t + Tj + T) = R(t + T,t) = R(t,t) Hence E([x(z+ T) - x(Z)]2) =R(t + T,t + T) + R(t,t) - 2R(t + Tj) = 0 therefore x(i) is MS periodic. 10-2 SYSTEMS WITH STOCHASTIC INPUTS Given a stochastic process x(z), we assign according to some rule to each of its samples x(z,<z) a function у(^£,). We have thus created another process y(0 = 7’[x(/)] whose samples are the functions y(z, The process y(/) so formed can be considered as the output of a system (transformation) with input the process x(r). The system is completely specified in terms of the operator T, that is, the nile of correspondence between the samples of the input x(z) and the output yfr)-
3Q4 STOCHASTIC PROPERTIES The system is deterministic if its operates only on the variable t treating f as a parameter. This means that if two samples x(/, {J and x(r, f2) of the input are identical in t, then the corresponding samples y(t, £,) and y(r,<2) of the output are also identical in t. The system is called stochastic if T operates on both variables t and £• This means that there exist two outcomes and <2 such that x(r, £\) = x(r, f2) identically in t but y(/, f j) Ф y(t, f2). These classifications are based on the terminal properties of the system. If the system is specified in terms of physical elements or by an equation, then it is deterministic (stochastic) if the elements or the coefficients of the defining equations are deterministic (stochastic). Throughout this book we shall consider only deterministic systems. In principle, the statistics of the output of a system can be expressed in terms of the statistics of the input. However, in general this is a complicated problem. We consider next two important special cases. Memoryless Systems A system is called memoryless if its output is given by У(*) =£[x(')] where g(x) is a function of x. Thus, at a given time t = tt, the output y(rг) depends only on x(/j) and not on any other past or future values of xG). From the above it follows that the first-order density /У(у; r) of yG) can be expressed in terms of the corresponding density /X(x; t) of xG) as in Sec. 5-2. Furthermore, £(у(')}=/ g(x)fx(x-t)dx J —00 Similarly, since y(f j) = g[xGj)] and y(r2) = #[xG2)l, the second-order den- sity /у(Ур Уг5 G» G) yG) can be determined in terms of the corresponding density Д(Х], x2; r2) of x(r) as in Sec. 6-3. Furthermore, £{y(fi)y(G)} = f / g(xl)g(x2)fx(xl,x2'ttltt2) dx{ dx2 — 00* —00 The nth-order density /У(ур..., y„; tlt..., tn) of y(r) can be determined from the corresponding density of x(r) as in (8-8) where the underlying transfor- mation is the system y(G) e^[x(G)],...,y(t„) =£[*('„)] (10-65) STATIONARTTY. Suppose that the input to a memoryless system is an SSS process xG). We shall show that the resulting output y(f) is also SSS. Proof, To determine the nth-order density of y(t), we solve the system s(*i) вУ1» .-,«(хп) — yn (10-66)
10-2 SYSTEMS WITH STOCHASTIC INPUTS 305 If this system has a unique solution, then [see (8-8)] /у( У1» - У„: 'i ? Xn,t".",tn) (10-67) Ig'Ui) ‘ •• £'(x„)| From the stationarity of x(r) it follows that the numerator in (10-67) is invariant to a shift of the time origin. And since the denominator does not depend on t, we conclude that the left side does not change if r, is replaced by t, + c. Hence y(r) is SSS. We can similarly show that this is true even if (10-66) has more than one solution. Notes 1. If xG) is stationary of order N, then yO) is stationary of order N. 2. If x(/) is stationary in an interval, then yO) is stationary in the same interval. 3. If x(r) is WSS stationary, then yO) might not be stationary in any sense. Square-law detector. A square-law detector is a memoryless system whose output equals y(f) =x2(/) We shall determine its first- and second-order densities. If у > 0, then the system у = x2 has the two solutions ± y/y • Furthermore, y'(x) = ±2y/y; hence At*') = +a(-^;')I If у i > 0 and у 2 > 0, then the system У1=-Ч2 Уг=*1 has the four solutions (± y/y?> ± '[Уг )• Furthermore, its jacobian equals ±4Уу(у2; hence /у(У1» Уг» G» *2) ~ xB/xfiy/y?> iy/y7»Л» *2) where the summation has four terms. Note that, if x(/) is SSS, then /Х(х;г) = /x(x) is independent of t and Zr(Xj, x2; G, f2) = /x(xh x2;r) depends only on т » tx - t2. Hence /У(у) is independent of t and fy(yit y2; r) depends only on т = tj - t2. Example 10-15. Suppose that x(z) is a normal stationary process with zero mean and autocorrelation Лл(т), In this case, /X(x) is normal with variance Rx(0). if yG)- **0) (Fig-10-6), then £{y(r)} = J?x(0) and (see (5-8)]
306 STOCHASTIC PROPERTIES Wc shall show that Я,(т) - Я*(0) + 2К2(т) (10-68) Proof. The RVs x(f + r) and x(f) are jointly normal with zero mean. Hence [sec (7-36)] E{x2(l + r)x2(f)} = E{x2(t + r)}£{x2(r)} + 2E2{x(f + r)x(r)} and (10-68) results. Note in particular that £{У2(')} = Яу(0) = 3R;(0) <r; = 2R;(0) Hard limiter. Consider a memorylcss system with «(*)-{_[ (10-69) (Fig. 10-7). Its output y(r) takes the values ± 1 and P{y(/) = 1} =P{X(/) > 0} = 1 -F/0) P(y(/) = -1} =P{x(t) <0} =Fx(0) Hence £{У(')) = 1 XP{y(t) = 1} - 1 xP{y(r) = -1) = 1 - 2Fx(0) The product y(r 4- r)y(/) equals 1 if x(r + r)x(t) > 0 and its equals - 1 other- wise. Hence ЯДт) = P{x(t + t)x(0 > 0} - P{x(t + t)x(0 < 0) (10-70) 1 — Дк * * 0 x * \ 1 y(0- 0 f| t2 b G 1 FIGURE 10-7
10-2 SYSTEMS WITH SIXX'HAS IKINPUTS 307 Thus, in the probability plane of the RVs x(/ + 7) and x(r), RV(T) equals the masses in the first and third quadrants minus the masses in the second and fourth quadrants. Example 10-16. We shall show that if x(z) is a normal stationary process, then the autocorrelation of the output of a hard limiter equals 2 Rv(r) Rv(t) = - arcsin R (10-71) This result is known as the arcsine law.t PROOF. The RVs x(r + r) and x(z) are jointly normal with zero mean, variance Rx(0), and correlation coefficient /?,(т)//?,(0). Hence [sec (6-47)). P{x(f + r)x(f) >0} = | + - 2 7Г sin a = — , 4 I a /?,») P{x(/+ r)x(t) < 0} = — -— L ТГ Inserting in (10-70), we obtain and (10-71) follows. Example 10-17 Bussgang’s theorem. Using Price’s theorem, wc shall show that if the input to a memoryless system у = g(x) is a zero-mean normal process x(z), the cross-correlation of x(r) with the resulting output y(z) = g[x(/)] is proportional to ^xx^T Rxv(r) = KRxx(r) where К = E{g'[x( t)]} (10-72) Proof. For a specific t, the RVs x = x(f) and z = x(z + t) arc jointly normal with zero mean and covariance д = E{xz) = /?хх(т). With I - E{zg(x)} = E{x(f + r)y(z)} = Лг,(т) it follows from (7-37) that (‘°-7” dp. ( dx dz J If д = 0, the RVs x(z + t) and x(t) arc independent; hence I = 0. Integrating (10-73) with respect to д, wc obtain 1 — Kp and (10-72) results. +J. L Lawson and G. E. Uhlenbeck: Threshold Signals, McGraw-Hill Book Company, New York, 1Й0.
308 STOCHASTIC PROPERTIES Special cases A (a) (Hard limiter) Suppose that g(x) = sgn x as in (10-69). In this case, g'(x) = 2S(x); hence К = E{25(x)} =2 Г З(х)Дх) dx = 2/(0) ** — ж where /(X) = /2irRtt(0) CXP{ ” 2Я„(0) } is the first-order density of x(/). Inserting into (10-72), we obtain / 2 Яху(г) = RM-d TrRtx(Q) = Sgnx(/) (b) (Limiter) Suppose next that y(z) is the output of a limiter S(A-) = C «<') = (n H * (c |x| > c (0 |x| > In this case, К = Г f(x)dx = 2G| C z - | - I (10-74) (10-75) Linear Systems The notation y(f) =L[x(f)] (10-76) will indicate that y(r> is the output of a linear system with input x(f). This means that LfajX^/) 4-a2x2(r)] = a1L[x1(t)] + azMx2(z)] (10-77) for any .a1>a2,X|(0>x2(0. The above is the familiar definition of linearity and it also holds if the coefficients a( and a2 are random variables because, as we have assumed, the system is deterministic, that is, it operates only on the variable t. Note If a system is specified by its internal structure or by a differential equation, then (10-77) holds only if y(r) is the zero-state response. The response due to the initial conditions (zero-input response) will not be considered. A system is called time-inuariant if its response to x(r 4- c) equals y(t + c). We shall assume throughout that all linear systems under consideration arc time-invariant. tH. E, Rowe, “Memoryless Nonlinearities with Gaussian Inputs,” BSTJ, vol. 67, no. 7, September 1982,
10-2 systemsw«।л sioc has'Ik iNi'i'is 309 It is well known that the output of a linear system is a convolution y(f) = = ( x( / - a)h(a) da (10-78) — te where h(i) = L[6(/)] in its impulse response. In the following, most systems will be specified by (10-78). However, we start our investigation using the operational notation (10-76) to stress the fact that various results based on the next theorem also hold for arbitrary linear operators involving one or more variables. The following observations are immediate consequences of the linearity and time invariance of the system. If x(r) is a normal process, then y(/) is also a normal process. This is an extension of the familiar property of linear transformations of normal RVs and can be justified if we approximate the integral in (10-78) by a sum: У(С) = ~ »a)A(«) к If x(/) is SSS, then y(r) is also SSS. Indeed, since y(z + c) = L[x(z + c)] for every c, we conclude that if the processes x(z) and x(z + c) have the same statistical properties, so do the processes y(r) and y(z + c). Wc show later [see (10-133)] that if x(z) is WSS, the processes x(z) and y(z) are jointly WSS. Fundamental theorem. For any linear system £{L[x(f)]) = L[E{x(z)}] (10-79) In other words, the mean 17//) of the output y(z) equals the response of the system to the mean 17/z) of the input (Fig. 10-8a) Ч,(')=Мл.(0] (10-80) The above is a simple extension of the linearity of expected values to arbitrary linear operators. In the context of (10-78) it can be deduced if wc write the integral as a limit of a sum. This yields E(y(r)} = f E{x(t - a)}h(a) da = r}x(t')* h(t) (10-81) (a) (6) FIGURE 10-8
310 STOCHASTIC PROPERTIES Frequency interpretation At the zth trial the input to our system is a function xG,{() yielding as output the function yG»&) = LfodJ, £)]• For large л, , , 4, H'.fi) + +y(',f„) i[x(r.f,)] + ••• +L[x((.J„)] ЭДО) =----------------------------------------------------- From the linearity of the system it follows that the last term above equals Гх(/,^) + ••• + x(/,Q л This agrees with (10-79) because the fraction is nearly equal to E{xG)}. Notes 1. From (10-80) it follows that if x(r) = x(f) - у(О = у(О-ny(O L[x(0] » Ф(')] - Ц^(О] = У(О (10-82) Thus the response of a linear system to the centered input x(r) equals the centered output y(r). 2. Suppose that x(0=/(0 + v(0 £{v(O) = 0 In this case, E(x(/j) = f(t); hence Thus, if x(f) is the sum of a deterministic signal /(/) and a random component v(t), then for the determination of the mean of the output wc can ignore v(i) provided that the system is linear and E{v(f)} = 0. Theorem (10-79) can be used to express the joint moments of any order of the output y(f) of a linear system in terms of the corresponding moments of the input. The following special cases are of fundamental importance in the study of linear systems with stochastic inputs. OUTPUT AUTOCORRELATION. We wish to express the autocorrelation Ryy(tt, t2) of the output y(r) of a linear system in terms of the autocorrelation Лхх(ги t2) of the input x(r). As we shall presently see, it is easier to find first the cross-correlation Rxy(t{, f2) between x(t) and yO). THEOREM («) = L2[«„(G,»2)] (10-83) In the above notation, L2 means that the system operates on the variable treating as a parameter. In the context of (10-78) this means that M'l.'z) - /”_Я„(«|. <2-«)*(«)</» (10-84) <») -4R„(»,.'j)] (10-85)
10-2 SYS П-MS WITH SHH IIASJK INPL'IS 311 In this case, the system operates on = J_ ~ (x,t2)h(a) da (10-86) Proof. Multiplying (10-76) by x(r,) and using (10-77), we obtain x('i)y(') = L,[x(z()x(z)] where Lf means that the system operates on t. Hence [see (10-79)] £{x('i)y(')} = L, [£{x(i\)x(t)}] and (10-83) follows with t = t2. The proof of (10-85) is similar: We multiply (10-76) by y(/2) and use (10-79). This yields f{y(')y(/2)} = L,[£{x(z)y(z,)}] and (10-85) follows with t = tv The preceding theorem is illustrated in Fig. 10-86: If Rlv(f,,/,) is the input to the given system and the system operates on t2, the output equals /?,//„ t2). И /?,.//,, z2) is the input and the system operates on r,, the output equals Ryy(tltt2). Inserting (10-84) into (10-86), wc obtain Яп.('р'2) = Г f This expresses j?v>,(/h z2) directly in terms of Rxl.(t P t2). However, conceptually and operationally, it is preferable to find first /?rv(r,, z2). Example 10-18. A stationary process v(t) with autocorrelation R,„.(t) = q8(r) (white noise) is applied at t = 0 to a linear system with A(z) = e"‘'t/(z) We shall show that the autocorrelation of the resulting output y(t) equals /?v..(z,,z2) = T-(l -e-2"’)^-'-! (10-87) 2c for 0 < z, < t2. Proof. We can use the preceding results if we assume that the input to the system is the process x(z) = v(i)U(i) With this assumption, all correlations arc 0 if z, < 0 or z2 < 0. For z, > 0 and h> o, Rtx(f|.M = <73(z, - z2) Aswcsee from (10-83), Rxy(tit t2) equals the response of the system to qfUtf - z2) considered as a function of t2. Since <5(Z| - z2) = 5(z2 - t\) and L[8(t2 - z,)l =
312 STOCHASTIC PROPERTIES Л(/2 - f|) (time invariance), we conclude that R,y(hJi) =<lh(t2 - /,) = qe~c(,2~',)U(t2 - /,) In Fig. 10-9, we show Ялу(г1,/2) as a function of and t2. Inserting into (10-86), we obtain /? .(fp/,) = q Г'е‘(1,~"~1:>е~‘° da < t-, Jo and (10-87) results. Note that £(/-(/)) = R„«) - £(1 - e-:") “ COROLLARY. The autocovariance Cyy(.tx,t2) of y(t) is the autocorrelation of the process y(t) = yO) - tj/O and, as we see from (10-82), y(r) equals L[x(/)]. Applying (10-84) and (10-86) to the centered processes x(() and yG), we obtain CfyOl’'2) = fl’G) */z(G) (10 88) Cyy(G,/2)=Cxy(rl,/2)*/t(f1) where the convolutions are in I, and t2 respectively. Complex processes The preceding results can be readily extended to complex processes and to systems with complex-valued h(t). Reasoning as in the real case, we obtain M'i’G) = «x,r(G.G)*/i*(G) (Ю-89) Л>(>,(/|, t2) = Rxy(t|, t2) * \) Response to white noise. We shall determine the average intensity E{|yG)|2) of the output of a system driven by white noise. This is a special case of (10-89), however, because of its importance it is stated as a theorem. THEOREM. If the input to a linear system is white noise with autocorrelation G) = G — G) then MIKOI2) =<1(0*|Л(')12= f%(/-a)|/i(a)|2rfa (10-90) J — ae
10-2 SYSTEMS WITH STOCTIAS ГК'INPU IS 313 Proof. From (10-89) it follows that Я,у('р'2) =<7('|Ж'2- '.)*Л*('г) = 4('i)A’(f: - g) Я„.('р'2) = Г«('| -«)Л*['2~ ('. -a)]h(a)da and with r, = t2 = t, results. Special cases (a) If x(t) is stationary white noise, then qlt) = q and (10-90) yields £{у2(О}=<?£ where E = f |A(f)l2r/t is the energy of Mt). (b) If A(/) is of short duration relative to the variations of q(t), then £(У2(')} =<?(') Г \h(a)\2da = Eq(t) (10-91) J — X This relationship justifies the term average intensity' used to describe the function q(t). (c) If = qS(r) and v(t) is applied to the system at t = 0, then q(t) = qU(t) and (10-90) yields £{y2(/)} = qf‘ IA(a)|2rf« — X Example 10-19. The integral у = Pvfa) da Ai can be considered as the output of a linear system with input x(/) = v(t)U(t) and impulse response Mt) = U(l). If, therefore, v(i) is white noise with average intensity q(t), then x(l) is white noise with average intensity q(t)U(t) and (10-90) yields ^{У2(/)) =<7(t)U(f)^(/) = J q(a) da Differentiators. A differentiator is a linear system whose output is the derivative of the input L[x(/)] = x'(r) We can, therefore, use the preceding results to find the mean and the autocor- relation of x-4r). From. (10-80) it follows that (10-92)
314 STOCHASTIC PROPERTIES Similarly [see (10-83)] d/rJr.G) я„.(»,,<2) =l2[«„(',.'2)| = —(10-9.4, because, in this case, L2 means differentiation with respect to t2. Finally, M','2) =Ll[/?^(r1,f2)] = 3{ - (10-94) Combining, wc obtain ^•('и'г)---------; д (10-95) (i 11 u / 2 Stationary processes If x(f) is WSS, then 7jK(r) is constant; hence E{x'(r))=0 (10-96) Furthermore, since /?х//,,/2) = Ялх(т), we conclude with т = - t2 that a«„('i - >2) _ _ MM _ <<2я„(т) dt2 dr dt\dt2 dr2 Hence Rxx,(r) = -Я.Дт) Яг.,,(т) = -R"M (Ю-97) Poisson impulses. If the input x(r) to a differentiator is a Poisson process, the resulting output z(/) is a train of impulses (Fig. 10-10) z(r) = £3(t -t.) (10-98) i We maintain that z(z) is a stationary process with mean 77. = A (10-99) and autocorrelation /?..(т) = A2 + AS(t) (10-100) x(0 (a) FIGURE 10-10
10-2 SYStFMS will) SIOCHASIK IM'IIIS 315 Proof. The first equation follows from (10-91) because = Ar. To prove the second, wc observe that [see (10-14)] ««(G’G) = А2Г|Г, + A min(r,./2) (10-101) And since z(r) = x’(t), (10-93) yields M'i.'j) = -------------— = A2/! + A4/(r, - r,) This function is plotted in Fig. 10-106 where the independent variable is r(. As we see, it is discontinuous for r, = t, and its derivative with respect to r( contains the impulse A6(r, - r2). This yields [see (10-94)] M,.(fltb) ------ = A- + A3(г, - ty) DIFFERENTIAL EQUATIONS. A deterministic differential equation with random excitation is an equation of the form + ••• + aoy(r) = x(r) (10-102) where the coefficients ak are given numbers and the driver x(r) is a stochastic process. We shall consider its solution y(r) under the assumption that the initial conditions are 0. With this assumption, y(r) is unique (zero-state response) and it satisfies the linearity condition (10-77). We can, therefore, interpret y(r) as the output of a linear system specified by (10-102). In general, the determination of the complete statistics of y(r) is compli- cated. In the following, we evaluate only its second-order moments using the preceding results. The above system is an operator L specified as follows: Its output y(r) is a process with zero initial conditions satisfying (10-102). Mean. As we know [see (10-80)] the mean of y(r) is the output of L with input Hence it satisfies the equation + ••• +«<>^(0 = (ю-103) and the initial conditions tj/O) = ••• =^Гп(0) =0 (10-104) This result can be established directly: Clearly, E{y<*’(/)} =77(y*V) (10-105) Taking expected values of both sides of (10-102) and using the above, we obtain (10-103). Equation (10-104) follows from (10-105) because y<Aj(0) = 0 by as- sumption.
3'16 STOCHASTIC PROPERTIES Correlation. To determine ,Rx>.(ti, t2), we use (10-83) = ^-2 [ > (^ I ’ 2) 1 In this case, L2 means that Rxy(t^ satisfies the differential equation + ... +anRf(liJz) = ял,(,,.,,) (i(M()6) with the initial conditions Лл/G.O) “ ••• d""1/?,, ,(r.,0) -----— = о dt?-' Similarly, since [see (10-85)] = ^i[^xy(G’ ^2)] we conclude as above that d"Rvv(t.,t2) ^fn + +«o^,(r„r2) =Rxv('{,'2) Ryy(0, (2) ~ а"-’я„.(0л2) ( 10-107) (10-108) (10-109) The preceding results can be established directly: From (10-102) it follows that *(/|)[яйУ<я)(ь) + +«(>y(f2)J =x(Zi)x(t2) This yields (10-106) because [see (10-119)] Similarly, (10-108) is a consequence of the identity [аяУ0,)(Г|) + ••• +<r(,yUi )]y(/2) =x('!)y('2) because ^{y(‘,(fl)y(/2)}-a%>.(fpf2)/^f Finally, the expected values of x('i)y<A)(0) = 0 y<A,(0)y(z2) = 0 yieldi C10-107) and (10-109). General moments. The moments of any order of the output y(z) of a linear system can be expressed in terms of the corresponding moments of the input x(/). As an illustration, we shall determine the third-order moment = £{У|(ОУз(')Уз(')} of y(f) to terms of the third-order moment Лл.жл.(г t, t2, of x(z). Proceeding as
10-2 >YMI MS Willi SICHIIAMH IM’I IS 317 in (10-83), we obtain £T{x< z, )x( гг)у( z,)) = L,[ £{x(/, )x( ь )X( i,)} ] = J K..,('i-'2-G - у)МуМу (10-IHM) £{x('i)y('з>У<G)} = M £{x('i)x(<2)y(fг)}] = f J/J (10-110Л) J - X £{y('i)y('->)y('J) = M £(x( z,)y( z2)y( z,)} ] = f Rllv(ft - a.t2.t')h(n) da (KI-llOc) • X Note that for the evaluation of /?vl v(Zj. z?. z5) for specific times zPz2.zv the function /?1Л10,, z2, z3) must be known for every z,.z2, zv Vector Processes and Multiterminal Systems We consider now systems with n inputs x,(z) and r outputs y,(z). As a preparation, we introduce the notion of autocorrelation and cross-correlation for vector processes starting with a review of the standard matrix notation. The expression A = [at)] will mean a matrix with elements atl. The notation A' = [«J = [«*] Af = [«;] will mean the transpose, the conjugate, and the conjugate transpose of A. A column vector will be identified by A = [«J. Whether A is a vector or a general matrix will be understood from the context. If A - [«J and В = [bj are two vectors with m elements each, the product AlB = а,Ь, + • • +ciinbtll is a number, and the product AB1 = [cr,bj is an m X m matrix with elements A vector process X(z) = (xr(z)] is a vector, the components of which are stochastic processes. The mean tj(z) = E{X(z)} = [??,(/)] of X(z) is a vector with components 7j,(r) = E(x,(z)}. The autocorrelation }?(/,,/2) or /?rv(Z|,z2) of a vector process X(r) is an m x tn matrix /?(/,, z2) =E(X(zl)Xt(z2)} (10-111) with elements We define similarly the cross-correlation matrix М'р'з) =^{X(zl)Yt(z2)} (10-112) of the vector processes X(OeMO] z=l....,m Y(0 = [*>(')] A rnullitcrminal system with m inputs x,(z) and r outputs y,(z ) is a rule for assigjitag to an ni vector X(z) an r vector Y(z). If the system is linear and
318 STOCHASTIC PROPERTIES lime-invariant, it is specified in terms of its impulse response matrix. This is an г X m matrix Я(') = [M')J '=1.......m J = 1.........r (Ю-114) defined as follows: Its component is the response of the jth output when the fth input equals 8(t) and all other inputs equal 0. From this and the linearity of the system, it follows that the response y,(r) of the jth output to an arbitrary input X(t) = [x,(z)] equals УДО = [ hj}(a)x^t - a) da + • + ( hj,„(a)K„,(t ~ a) da Hence Y(/) = Г H(a)X(t-a)da (10-115) J — X In the above, X(/) and Y(r) are column vectors and H(t) is an г X m matrix. We shall use this relationship to determine the autocorrelation of Y(/). Premultiplying the conjugate transpose of (10-115) by X(r,) and setting t = t2, we obtain X(G)Y+(Z2) = Г X(/,)X+(Z2 - a)H\a) da Hence АгД'г'г) = [ а)Н*(а) da (10-116cr) J — oe Postmultiplying (10-115) by Y+(r2) and setting I = we obtain '2) - Г -a, I2) da (10-1166) J —QO as in (10-89). These results can be used to express the cross-correlation of the outputs of several scalar systems in terms of the cross-correlation of their inputs. The next example is an illustration. Example 10-20. In Fig. 10-11 we show two systems with inputs X|(z),x,(z) and outputs У1(') — a)da y2(O = Г h2(a)x2(l - a) da (10-117) J-x> (e) (6) FIGURE 10-11
10-3 1 HL I’OWI H SI’I-.CIHI IM 319 These signals can be considered as the components of the output vector Y'(/) = [yt0 ),y2(/)] of a 2 x 2 system with input vector X'G) = [x,(r).x,(zj] and impulse response matrix Inserting into (10-116), we obtain ^ii»;(,r fj) ~ f ~ ° ) da (10-118) f’) = / ^i(° )^,,v,(/| -a.lz)da Thus, to find /?дРр|.r2), wc use RKl^t{,t2} as the input to the conjugate /<?(/) of Л2(г), operating on the variable t2. To find Rt(t,(/,. /2). we use /?, *.(/|,as the input to /t|(r) operating on the variable r, (Fig. 10-11). Example 10-21. The derivatives yt{t) = z('"’G)andy,(f) = w'"’(t) of two processes zO) and w(r) can be considered as the responses of two differentiators with inputs x,(f) = z(r) and x2(r) = w(t). Applying (10-118) suitably interpreted, we conclude that , , a"" "K.u.(/../,) E{z‘-">(/1)w<'->(/2)} =---‘ (I0-H9) 10-3 THE POWER SPECTRUM In signal theory, spectra are associated with Fourier transforms. For determinis- tic signals, they are used to represent a function as a superposition of exponen- tials. For random signals, the notion of a spectrum has two interpretations. The first involves transforms of averages; it is thus essentially deterministic. The second leads to the representation of the process under consideration as superposition of exponentials with random coefficients. In this section, we introduce the first interpretation. The second is treated in Sec. 12-4. We shall consider only stationary processes. For nonstationary processes the notion of a spectrum is of limited interest. DEFINITIONS. The power spectrum (or spectral density) of a WSS process x(f). real or complex, is the Fourier transform S(<d) of its autocorrelation R(r) = £(x(f + r)x *(/)): SM = Г R(T)e-)b,rdr (10-120) ' — X Since R(— r) = R*(t) it follows that S(o>) is a real function of ы. From the Fourier inversion formula, it follows that Я(т) = Г SMeia”do> (10-121) 2тг* -x
320 STOCHASTIC PROPERTIES TABLE 10-1 Л(т) = _L Г Sturie^do) «-> S(u) = Г R(r)e~ja>r dr 2'IT * — co * — QB S(t) ♦* 1 elpt « 2тгЗ(ш - 3) z° 4- 1 -♦ 2тгЗ(ш) COS 3т irS(<i> — p) + ir3(<4 + p) a a e-o|r| cos fir -♦ —--------------r + —----------------г a2 + (a> — 3) о2 + (ш + 3 К 2e"nr!cos3r ♦*»/- + e-<«•*«*/»»] 4sin2(&>7/2) Ta2 |<z> | < a |<z> | > a If x(f) is a real process, then /?(т) is real and even; hence S(<o) is also real and even. In this case, /00 00 jR(t)cos шт dr = 2 I /?(t)cos шт dr ° , (10-122) Я(т) = -—/ S(ai)cos шт d<t) = — / 5(ai)cos ыт do) 2тг J-tn 7Г JQ The cross-power spectrum of two processes x(r) and y(/) is the Fourier transform Sxy{o)) of their cross-correlation Яху(т) = E(x(t + т)у*(г)}: M“) = Г dr R„(r) = ~Г s,^a)e>" dv (10-123) "r — co Z7T J — co The function Szy(o)) is, in general, complex even when both processes x(f) and y(t) are real. In all cases, Sxy(a)) = SyUa)) (10-124) because Ях/-т) = £{x(r - т)у*(г)) = Я*х(т). In Table ICF1 we list a number of frequently used autocorrelations and the corresponding spectra. Note that in all cases, S(w) is positive. As we shall soon show, this is true for every spectrum. Example 10-22. A random telegraph signal is a process x(/) taking the values +1 and -1 as in Example 10-6: x(/) - { ^2/ < < ^2f+l l2i-I < 1 <
10-3 THE POWER SPLAT RUM 321 where t, is a sei of Poisson points with average density Л. As wc have shown in (10-19), its autocorrelation equals e"2A,T|. Hence For most processes R(r) -* tj2 where tj = E(x(r)} (see Sec. 12-4). If, therefore, 17 #= 0, then S(w) contains an impulse at ш = 0. To avoid this, it is often convenient to express the spectral properties of x(t) in terms of the Fourier transform Sc(w) of its autocovariance C(r). Since 7?(т) = C(r) + tj2, it follows that S(w) = Sc(w) + 2tti723(6)) (10-125) The function $с(ы) is called the covariance spectrum of x(r). Example 10-23. Wc have shown in (10-100) that the autocorrelation of the Poisson impulses d *(') = - *<) = Z>(t - t,) equals /?.(т) = A2 + A3(r). From this it follows that 5.(ш) = A + 2ttA23(w) Scz(u) = A We shall show that given an arbitrary positive function S(ro), we can find a process x(r) with power spectrum S(ia). (a) Consider the process x(r) = ae'<w'-v) (10-126) where a is a real constant, <o is an RV with density /ш(ю), and <p is an RV independent of <»> and uniform in the interval (0,2тг). As we know, this process is WSS with zero mean and autocorrelation Лх(т) = a2E(e'") = а2 Г * — «о From this and the uniqueness property of Fourier transforms, it follows that (see (10-121)] the power spectrum of x(/) equals $/*>) = 2тга2иШ) (10-127) If, therefore, «$(&)) 1 r00 then Д,Сю) is a density and 5x(a>) = 5(w). To complete the specification of x(r), it suffices to construct an RV ш with density S(<o)/2ira2 and insert it into (10^126).
322 STOCHASTIC PROPERTIES FIGURE 10-12 (b) We show next that if Si-ш) = S(w), we can find a real process with power spectrum 5(o>). To do so, we form the process y(t) ~ a cos(tat + <p) (10-128) In this case (see Example 10-14) a2 a2 Яу(т) = —£{coswt) = — I f(u)cos шт dw Л 2 J—cc From this it follows that if /ш(^) = 5(w)/rra2, then Sv(o>) = S(w). Example 10-24 Doppler effect. A harmonic oscillator located at point P of the x axis (Fig. 10-12) moves in the x direction with velocity v. The emitted signal equals and the signal received by an observer located at point О equals s(t) = where c is the velocity of propagation and r = rQ + vt. We assume that v is an RV with density Clearly, s(r) = co = cu0^l----j <p = ° °- hence the spectrum of the received signal is given by (10-127) S(w) = 2ira2fat(a>) = 2ira2c Г/ a ------A 1- — w0 шо (10-129) Note that if v = 0, then s(f) - ae*W R(r) = a2eib,°T S(u>) = 2тга28{ш - w0) This is the spectrum of the emitted signal. Thus the motion causes broadening of the spectrum. The above holds also if the motion forms an angle with the x axis provided that v is. replaced by its projection v, on OP. The following case is of special interest. Suppose that the emitter is a particle in a gas of temperature T. In this case, the x component of its velocity is a normal RV with zero mean and variance
10-3 THE POWER SPECTRUM 323 kT/m (see Prob. 8-5). Inserting into (10-129), we conclude that 2тгагс f me2 f ы \21 = vQf2irkT/m CXP\ ” ШТ ~ / ( кТш^г21 Я(т) - д2 cxp{-------v \e,a>l,T ( 2nic~ J Line spectra, (a) We have shown in Example 10-7 that the process x(r) = ^с,еУЧ' i is WSS if the RVs c, are uncorrelated with zero mean. From this and Table 10-1 it follows that Я(т) = £<r/e^'T SM = 2ir'Ea,28(a) - to,) (10-130) i where 072 = £{c2}. Thus S(to) consists of lines. In Sec. 14-2 we show that such a process is predictable, that is, its present value is uniquely determined in terms of its past. (b) Similarly, the process У(0 = E(a,- cos a)jt + b, sin (Ojt) i is WSS iff the RVs and b, are uncorrelated with zero mean and E{a2} = E{b2} = 0/. In this case, R(r) = X/°i2 cos Ш1Т = тг£о;2[3(й) — to,) + 3(to + to,)] i « (10-131) linear systems. We shall express the autocorrelation Rw,(r) and power spec- trum 5>у(<о) of the response y(() = f *(t-a)h(a)da (10-132) * —00 of a linear system in terms of the autocorrelation Rxx(t) and power spectrum ^xx(w) of the input x(/). THEOREM Мт) - Мт)*Л*(-т) Луу(т) = Лду(т)*Л(т) (10-133) Sxy - Sxx(to)H*(to) Sy/to) - Sxy(o»)H(to) (10-134) Proof, The two equations in (10-133) are special cases of (10-184) and (10-185). However, because of their importance they will be proved directly. Multiplying
324 STOCHASTIC PROPHRTIKS the conjugate of(10-132) by x(r + т) and taking expected values, wc obtain E(x(/ + т)у*(/)} = f E{x(z + -)x*(l - a)}h*(a) da J - ОС Since E{x(z + r)x*(r - a)} = Ядд.(т + a), this yields R (r) = Г /?1Л(г + а)Л*(а) da = Г ЯД1.(т - 0)h*(-£) d$ * J-X J Proceeding similarly, we obtain E{y(r)y*(/ - r)} = f E{x(l - а)у*(г - т)}/|(а) da = f Rxy(r - a)h(a) da J — ОС Equation (10-134) follows from (10-133) and the convolution theorem. COROLLARY. Combining the two equations in (10-133) and (10-134). we obtain Лу>(г) = Ялд.(т)*Л(т)*Л*(-т) =/?гг(7) *р(т) (10-135) M") =M")«(")^*(") =5гд(й>)|/У(ш)|2 (10-136) where р(т) = Л(т)*Л*(-т) = Г h(t + 7)h*(t)dt |W(m)|2 (10-137) J — X Note, in particular, that if x(r) is white noise with average power q, then Едд(т) = qS(r) Sxx(u)=q , (10-138) Syy(a>) - q\H(a>)\‘ Ryy(r)=qp(r) From (10-136) and the inversion formula (10-121), it follows that £{MOI2} -Я„(0) = -- Г SIXM\HM\2do, > 0 (10-139) This equation describes the filtering properties of a system when the input is a random process. It shows, for example, that if Н(ш) = 0 for |&>I > and SXJf(ai) = 0 for |ш| < <o(), then E(y2(r)} = 0. Note The preceding results hold if all correlations are replaced by the corresponding covariances and all spectra by the corresponding covariance spectra. This follows from the fact that the response to x(r) -у* equals y(r) —17... For example, (10-136) and (10-142) yield M") = M‘‘»)IW(«)I2 (io-14°) Vary(/) f S^(w)|H(w)r-dw (Ю-141) Z1T -'-x
ltl-3 uh 1’owi.h si*i<him 325 Example 10-25. («) (Moring aieragc) The integral y(Z) = —J x(a)da is the average of the process x(r) in the interval (t - T, i + T). Clearly. y(r) is the output of a system with input x(z) and impulse response a rectangular pulse as in Fig. 10-13. The corresponding p(r) is a triangle. In this case, 1 sin 7'ш sin2 7’w 11 J -T 1(4 1 ~ш~ Thus Н(ш) takes significant values only in an interval of the order of 1/7’ centered at the origin. Hence the moving average suppresses the high-frequency components of the input. It the thus a simple low-pass filter. Since p(r) is a triangle, it follows from (10-135) that Rvv(r) = ^f21 (1 - - a) da (10-142) 2Г •'-27Д ~1 / We shall use this result to determine the variance of the integral = — f x(') dt 1 / J - т Clearly» T|z - y<0); hcncc Var qz = C,„(0) = 57 da (10-143) (Z>) (High-pass filter) The process z(r) = x(r) - y(z) is the output of a system with input x(z) and system function sin Тш Н(ш) = 1 - —— Tw This function is nearly 0 in an interval of the order of 1/7" centered at the origin, and it approaches 1 for large <u. It acts, therefore, as a high-pass filter suppressing the low frequencies-of the input. Example 10-26 Derivatives. The derivative x'(r) of a process x(z) can be Considered as lhe output of a linear system with input x(/) and system function ja>.
326 STOCHASTIC PROPERTIES From this and (10-134), it follows that Sxx,(o») = — jcuSxx(tu) Sx,x,(w) = arSxx(w) Hence R.Ar) = ^xx(T) dr The nth derivative y(r) = x(n’(/) of x(t) is the output of a system with input x(r) and system function (/co)". Hence Syy(*>) = l/cul2" Ryy(r) = (- 1)лЛ(2'”(т) (10-144) Example 10-27. (a) The differential equation y’(t) + cy(z) = x(f) all / specifics a linear system with input x(r), output yG), and system function l/(/to + c). We assume that x(r) is white noise with Лхх(т) = qS(r). Applying (10-136). wc obtain SyAm) = ~5-----7 = ~7— n w2 + c2 ы2+ Note that E{y2(r)} = Луу(0) = q/2c. (ft) Similarly, if q «>/’•) 27е"1'1 У(') + fty'(') + сИ') = «(') then 1 q H(w) ------j—T--------- -------------------- -a> +]Ьш+с (C _ ^,2)2 + Ь2Ш2 To find 7?y,.(r), we shall consider three cases: b2 < 4c Ryy(T) = 4T7e'n|rl(cos0T + = a2+f: £.uC \ p ) £ b2 « 4c о b /г"(т)“2б7е'""'(1+“|т|) a~2 b2 > 4c ^y(T) = 4^(ar + y)^‘“"r,|Tl ~ (« ~ y)e_(“+*)|Tl] Ь 2 a=> - a1 - y~ ~>c In all,cases, £(y2(/)} - q/2bc.
10-3 ini row t к sh ci ri m 327 S.-Mf Example 10-28 Hilbert transforms. A system with system function (Fig. 10-14) H(w) = -;sgn<u = / . } > n (10-145) (У a) < 0 is called a quadrature filter. The corresponding impulse response equals \/~t (Papoulis, 1977). Thus H(u>) is all-pass with -90° phase shift; hence its response to cos wt equals cos(wt — 90°) = sin mt and its response to sin mt equals sin(wt - 90°)= - cos mt. The response of a quadrature filter to a real process x(r) is denoted by x(r) and it is called the Hilbert transform of x(t). Thus 1 1 x(a) x(r) = x(/)*— = — / ---------da (10-146) ttI 7Г J-xt - a From (10-134) and (10-124) it follows that (Fig. 10-14) Sxi(w) = j'Stl(w)sgnw----Sfr(w) (10-147) = S„(W) The complex process z(r) = x(t) + jx(t) is called the analytic signal associated with x(/). Clearly, z(/) is the response of the system 1 + j( —j sgn m) = 2U(m) with input x(/). Hence [sec (10-136)] Stz(m) = 45л./й|)и(а>) = 2SXJt(w) + 2jSix(m) (10-148) Я„(т) = 2Ялх(т) + 2;7?/х(т) (10-149) THE WIENER-KHINCHIN THEOREM. From (10-121) it follows that £{x1 2(/)} = Л(0) = 2^/" S(u>) dm s 0 (KM50)
= = ( 328 STOCHASTIC PROPERTIES This shows that the area of the power spectrum of any process is positive. Wc shall show that $(«>);> 0 (10-151) for every ш. Proof. We form an ideal bandpass system with system function 1 o)( < a) < a)2 0 otherwise and apply x(r) to its input. From (10-139) it follows that the power spectrum Syy((o) of the resulting output y(r) equals 5(a)) o>! < a) < a)2 0 otherwise Hence о <;£{y2(0) = Л Г W dta = ^TFSM du> (l0"152> ZTr-»-® 2тг-'Ш| Thus the area of 5(a)) in any interval is positive. This is possible only if S(o)) 0 everywhere. We have shown on page 321 that if S{u>) is a positive function, then we can find a process x(/) such that Sxx(w) = S(o>). From this it follows that a function is a power spectrum iff it is positive. In fact, we can find an exponential with random frequency as in (10-127) with power spectrum an arbitrary positive function S(w). We shall use (10-152) to express the power spectrum 5(a>) of a process x(r) as the average power of another process y(r) obtained by filtering x(r). Setting ш । =ш,| + (5 and w2 - «о - 3, we conclude that if 8 is sufficiently small, 3 £{№(<)) = -S(o>o) (10-153) 7Г This shows the localization of the average power of x(r) on the frequency axis. Integrated spectrum. In mathematics, the spectral properties of a process x(r) dre expressed in terms of the integrated spectrum F(e>) defined as the integral of 5(a>): FM = Г S(a) da (10-154) ** — oo Krom the positivity of S(w), it follows that F(w) is a nondecreasing function w. Integrating the inversion formula (10-121) by parts, we can express the autocor-
10-3 I HL POWLR SPECTRUM 329 relation /?(т) of x(f) as a Riemann-Stieltjes integral: 1 oc K(r) = — f e^dFM) (10-155) •—IT J — x This approach avoids the use of singularity functions in the spectral representa- tion of Mr) even when 5(<u) contains impulses. If $(ш) contains the terms - <u,-), then is discontinuous at ы, and the discontinuity jump equals /3,. The integrated covariance spectrum Fr(w) is the integral of the covari- ance spectrum. From (10-125) it follows that F(w) = Fc(u} + 2vq2U(a)). Vector spectra. The vector process X(r) = [x,(/)] is WSS if its components x,(r) are jointly WSS. In this case, its autocorrelation matrix depends only on т = - t2. From this it follows that [see (10-116)] My(T) = / Rxx(t + a)H\a) da Ryy(r) = f H(a)Rxv(r - a) da (10-156) The power spectrum of a WSS vector process X(r) is a square matrix Sxx(to) = (5jy(w)], the elements of which are the Fourier transforms S0(w) of the elements R^t) of its autocorrelation matrix Rx ,(r). Defining similarly the matrices Sxy((o) and Syy(a>), we conclude from (10-156) that Sxy(W) = Sxx(a>)H'M Syy(a>) = 77(w)Sxv(a>) (10-157) where H(m) = (/f|7(o»)] is an m X r matrix with elements the Fourier trans- forms Я//(й>) of the elements of the impulse response matrix H(t). Thus Syy(a>) = (10-158) This is the extension of (10-136) to a multiterminal system. Example 10-29. The derivatives y,(/) = zCm,(r) y2(z) = w<n,(0 of two WSS processes z(r) and w(z) can be considered as the responses of two differentiators with inputs z(z) and w(r) and system functions H/w) = and H2(w) •= (ja>)n. Proceeding as in (10-119), wc conclude that the cross-power Spectrum of z<m,G) and w("’(z) equals (,/й>)'”(—jw)"SfH.(w). Hence £{z<"”(; + r)z‘">(0} = (-0 dT^ - (10-,59> PROPERTIES OF CORRELATIONS. If a function RM is the autocorrelation of a WSS .process x(z), then (sec (10-151)] its Fourier transform S(w) is positive. Furthermore, if RM is a function with positive Fourier transform, we can find a process x(z) as in (10-126) with autocorrelation Л(т). Thus a necessary and sufficient condition for a function RM IO be an autocorrelation is the positivity of Its Fourier transform. The
330 STOCHASTIC PROPERTIES conditions for a function R(r) to be an autocorrelation can be expressed directly in terms of Жт). We have shown in (10-84) that the autocorrelation Я(т) of a process x(/) is p.d., that is, £a,a*R(7<--7|) a 0 (10-160) for every o/t г,, and rr It can be shown that the converse is also truet: If R(r) is a p.d. function, then its Fourier transform is positive. Thus a function Жт) has a positive Fourier transform iff it is p.d. A sufficient condition. To establish whether Я(т) is p.d., we must show cither that it satisfies (10-160) or that its transform is positive. This is not, in general, a simple task. The following is a simple sufficient condition. Polya’s criterion. It can be shown that a function R(r) is p.d. if it is concave for г > 0 and it tends to a finite limit as - Consider, for example, the function w(r) — If 0 < с < I, then w(t) -» 0 as r —♦ oo and w"(r) > 0 for т > 0; hence w(r) is p.d. because it satisfies Polya’s criterion. Note, however, that it is p.d. also for 1 < c < 2 even though it does not satisfy this criterion. Necessary conditions. The autocorrelation Я(т) of any process x(r) is maximum at the origin because [see (10-121)] 1 » |Я(т)|< — [ S(o)) da) = R(Q) (10-161) 27Г — ОС We show next that if Я(т) is not periodic, it reaches its maximum only at the origin. THEOREM. If J?(7,) = Ж0) for some r, 0, then Я(т) is periodic with period Tt: ^(т + т,) = R(t) for all t (10-162) Proof. From Schwarz’s inequality E2{zw) < E{z2]E(w2) (10-163) it follows that E2([x(t -1-7 4-7,)- x(f 4- 7)]x(t)} E([x(r 4-7 4-7,)- x(f 4- 7)]2)E{x2(t)} Hence [Я(т 4- T|) - Я(т)]2 <i 2[Я(0) - Е(т,)] E(0) (10-164) If E(r,) « R(0}, then the right side is 0; hence the left side is also 0 for every r. This yields (10-162). tS Bocher: 4ATur« on Fduricr Integrals. Princeton Univ. Press, Princeton. NJ, 1959.
10-3 THL P(1WI.H SI'l сп«>м 331 COROLLARY. If Жт,) = R(r,) = 7?(0) and the numbers T| and -2 are noncom- mensurate, that is, their ratio is irrational, then Я(т) is constant. Proof. From the theorem it follows that Жт) is periodic with periods 7( and r,. This is possible only if Жт) is constant. Continuity. If Жт) is continuous at the origin, it is continuous for every -. Proof. From the continuity of Жт) at r = 0 it follows that Жт,) -> Ж0); hence the left side of (10-164) also tends to 0 for every т as t, -» 0. Example 10-30. Using the theorem, wc shall show that the truncated parabola is not an autocorrelation. If w(r) is the autocorrelation of some process x(r). then [see (10-144)] the function is the autocorrelation of x'(f). This is impossible because -iv"(r) is continuous for r = 0 but not for r = a. MS continuity and periodicity. We shall say that the process x(t) is MS continuous if E{[x(t + t?) - x(/)]2)-* 0 as e-*0 (10-165) Since £{[x(r + e) - x(t)]2} = 2[Ж0) - Же)], we conclude that if x(/) is MS continuous, Ж0) - Же) -» 0 as e -» 0. Thus a WSS process x(r) is MS continu- ous iff its autocorrelation Жт) is continuous for all r. We shall say that the process x(t) is MS periodic with period t, if K{[x(/4- 71) - x(f)]2} = 0 (10-166) Since the left side equals 2[Ж0)-Жт,)], we conclude that Жт,) = Ж0); hence (see (10-162)) Жт) is periodic. This leads to the conclusion that a WSS process x(r) is MS periodic iff its autocorrelation is periodic. Cross-correlation. Using (10-163), we shall show that the cross-correlation Яху(т) of two WSS processes x(/) and y(f) satisfies the inequality R2,.(r) S«„(0)«„.(0) (10-167) Proof. From <10-163) it follows that £2(x(I + r)y»(z)) £ E{|x(r + r)|2)£{ly(t)I2) - Я„(0)Я„(0) and (10-167) results.
332 STOCASTIC PROPERTIES COROLLARY. For any a and b, < ( Sxx(w)dw( Sys((jj) dw (10-168) J a I Siy(w) dw a Proof, Suppose that xG) and y(r) are the inputs to the ideal filters = H2(w} = [\ a<w<b 14 ’ 2 (0 otherwise Denoting by zG) and wG) respectively the resulting outputs, we conclude that 1 fh 1 ,-b «.-.•(0) = T-l dw Я„„.(0) = — J 5,,(Ш) dw АЛТ Ja & " я.„(0) = dw and (10-168) follows because Л2и.(0) < Ягг(0)Я№„.(0). 10-4 DIGITAL PROCESSES A digital (or discrete-time) process is a sequence x„ or RVs. To avoid double subscripts, we shall use also the notation xbi] where the brackets will indicate that n is an integer. Most results involving analog (or continuous-time) pro- cesses can be readily extended to digital processes. We outline the main concepts. The autocorrelation and autocovariance of x[«] are given by л2] = £{х[л1]х*[«2]} ФьЯг] = Я["1’«2] ~ ’if'2!J7?*['22] (10-169) respectively where тДл] = £{x[n]} is the mean of х[л]. A process x[m] is SSS if its statistical properties are invariant to a shift of the origin. It is WSS if = constant and Л[л + = E{x[zt + m]x*[n]J = R[zn] (10-170) A process x[/i] is strictly white noise if the RVs x[nj are independent. It is white noise if the RVs x[nj are uncorrelated. The autocorrelation of a white- noise process with zero mean is thus given by Л[и1}л2] - n2J where 5[n] = (J (10-171) and — Е{х2[л]}. If x[/z] is also stationary, then Thus a WSS White noise is a sequence of i.i.d. RVs with variance q. The delta response /i[n] of a linear system is its response to the delta sequence 3[n]. Its system function is the z transform of Л[л): H(z)- f h[n]z~n (10-172) л -
10-4 DIGIIAI PHCHLSil.s 333 If х[л] is the input to a digital system, the resulting output is the digital Convolution of х[л] with /<(«]: У[«] = £ x[zz — А ]Л[A. ] = x[/j ] ./г[,2 ] (10-173) Л “ - x From this it follows that 7}Дл] = 17,[?г]* Л[л]. Furthermore, £ R„[nt,n2-k]li*[k] (10-174) к - -x ЯУ)["1’Л2]= E ЛдДл, - г,л2]/г[г] (10-175) f =» — ОС If х[л] is white noise with average intensity as in (10-171), then, [see (10-90)], £{у2[л]} = <?[>г]*|/»[л]|2 (10-176) If х[л] is WSS, then у[л] is also WSS with 77v = т?л = H( 1). Furthermore, ЯхДги] = Я„[т]* й*[-ю] ЯуД/л] = ЯгДт]* h[m] х (10-177) = /?xx[m]*p[m] рМ = Е ч-А:]Л*[Лс] 1 к — х asin (10-133) and (10-135). THE POWER SPECTRUM. Given a WSS process х[л], we form the z transform S(z) of its autocorrelation Я[т]: S(z) - £ (10-178) m — —oo The power spectrum of х[л] is the function S(<u) = S(i?J“) - £ (10-179) m “ — » Thus S(e/<u) is the DFT of Я[т]. The function S(e'“) is periodic with period 2-тг and Fourier series coefficients Я[т]. Hence Л[т] = — Г S(e;")e""“ da> (10-180) 2tf J-тг It suffices, therefore, to specify S(eiai) for |w| < тг only (see Fig. 10-15). If x[n] isa real process, then 7?[—/л] = Я[т] and (10-179) yields S(e/<tf) = 7?[0] + 2 £ 7?[wjcos mu (10-181) m—0 This shows that the power spectrum of a real process is a function of cos ш because cos mu is a function of cos &>.
334 STOCHASTIC PROPERTIES SaM fS(w) FIGURE 10-15 Example 10-31. If Я[/п] = a,m,t then “1 “ az z S(z) = E a~mz~m + ^amz~m = --------------+------ №--<» m-0 \ az z a a-1 — a (a-1 + a) — (z-1 + z) Hence , a-1 — a S(e'“) = —-------------- a + a — 2cos w Example 10-32. Proceeding as in the analog case, we can show that the process x[n] = ^cieia,,n i is WSS iff the coefficients c/ are uncorrelated with zero mean. In this case, Я[т] - LayVAH $(ш) = 27rE<r,25(w - Д) Ы < tt (10-182) « i where <r,a - Etc?), e>, - 2rk, + ft, and Iftl < v. From (10477) and the convolution theorem, it follows that if yfnl is the output of,a linear, system with input 4n], then S„(e'“) = S„(e>-)H*(et”) S„(e'“) = S„(e>")H(e'“) (10483) S„(e*) = S„(e'")|H(e'“)|2
10-4 DIGITAL I'HOlTSSt-S 335 If Л[л] is real, H*(e/w) = Н(е"'“). In this case S>y(z) = S,x( z)H( z)H( 1/z) (10-184) Example 10-33. The first difference y[n] = х[л] - х[л - 1] of a process x[n] can be considered as lhe output of a linear system with input x|/i| and system function H(z) = 1 - z_|. Applying (10-184), we obtain Syj.(z) = S„(z)(l -z-')(l -z) = S1A(z)(2-z - г1) Лу>.[/п] = -/?лд[/и + 1] + 2Ядд[/л] - RK Д/л — 1] If x[«] is white noise with S„(z) = q, then Syy(e^) = <?(2 - e>“ - e~>“) = 2g(l - cos w) Example 10-34. The recursion equation y[n] - ay[n - 1] = x[n] specifies a linear system with input x[n] and system function H(z) = 1/(1 - az~ *). If S,r(z) = q, then (sec Example 10-31) From (10-183) it follows that £{ly[«]l2) -Я„[0] - -J-f S„(e'“)|H(e'“)l2<to (10-185) Z7T j —tf Using this identity, we shall show that the power spectrum of a process x[n] real or complex is a positive function: Sxx(e;<u) > 0 (10-186) Proof. We form an ideal bandpass filter with center frequency <a0 and band- width 2Д and apply (10-185). For small Д, E(ly[«]l2) = = ;S„(e'“") 2-ТГ^-Д IF mid ,(10-186) results because Е{у2[л]) > 0 and a>Q is arbitrary. SAMPLING. In many applications, the digital processes under consideration are obtained by sampling various analog processes. We relate next the correspond- ing correlations and spectra. Given an analog process x(/), we form the digital process х[л] == х(лТ) J
33.6 STOCHASTIC PROPERTIES where T is a given constant. From this it follows that *?["] =-Па(пТ) R[ni,n2]=Ra(niT^2T) (10-187) where ??e(z) is the mean and Ra(tit t2) the autocorrelation of x(r). If x(r) is a stationary process, then x[n] is also stationary with mean 17 = rja and autocorre- lation fl[m] = Ra(mT) From this it follows that the power spectrum of x[/i] equals (Fig. 10-15) * 1 Л / <0 + 2irn \ S(e'“)- £ = - £ S. --------------- (10-188) n>- -® J n- 1 1 ' where Se(<a) is the power spectrum of x(/). The above is a consequence of Poisson’s sum formula [see (11A-1)]. Example 10-35. Suppose that x(r) is a WSS process consisting of M exponentials as in (10-130): м м x(r) = Ecif'“'' so(w) = 2?T £<rr:S(w - to,) i-i i-i where cr,2 = E(c2). Wc shall determine the power spectrum S(i’/“) of the process x[n) “ x(nT). From (10-188) it follows that м S(eJ,u) = E E <r,2(to ~ to, + 2irn) П - - ® I - 1 In the interval (-эт.тт), this consists of M lines: м S(e'") = E<r,23(to - J3,) |шI < tt 10,| < it i-t where are such that ш, = 2этп, + 0,. APPENDIX 10A CONTINUITY, DIFFERENTIATION, INTEGRATION In the earlier discussion, we routinely used various limiting operations involving stochastic processes, with the tacit assumption that these operations hold for every sample involved. This assumption is, in many cases, unnecessarily restric- tive. To give some idea of the notion of limits in a more general case, we discuss next conditions for the existence of MS limits and we show that these conditions can be phrased in . terms of second-order moments (see also Sec. 8-4). STOCHASTIC CONTINUITY. A process x(/) is called MS continuous if (10A-1)
APPENDIX 10A CONTINUITY. DIFFERENTIATION. 1NIFGR AVION 337 THEOREM. We maintain that x(/) is MS continuous if its autocorrelation is continuous. Proof. Clearly, E{[x(t + s) - x(t)]2) = R(t + e, t + e) - 2R(t + e, /) + R(t, t) If, therefore, R(t,, t2) is continuous, then the right side tends to 0 as e -» 0 and (10A-1) results. Note Suppose that (10A-1) holds for every t in an interval /. From this it follows that [see (10-1)] almost all samples of x(/) will be continuous at a particular point of I. It does not follow, however, that these samples will be continuous for every point in /. Wc mention as illustrations the Poisson process and the Wiener process. As wc see from (10-14) and (11-5), both processes arc MS continuous. However, the samples of the Poisson process are discontinuous at the points tt, whereas almost all samples of the Wiener process are continuous. COROLLARY. If x(/) is MS continuous, then its mean is continuous ?)(/ + e) -> 7}(t) £ -> 0 Proof. As we know E{[x(f + e) - x(/)]2} > E2{[x(/ + e) - x(t)]} Hence (10A-2) follows that (10A-1). The above shows that lim E{x(t + s)} = El lim x(r + e (10A-2) (IOA-3) STOCHASTIC DIFFERENTIATION. A process x(r) is MS differentiable if in the MS sense, that is, if THEOREM. The process M is MS differentiable if d2R(t{, dt2 exists, ftw/. It suffices to show that (Cauchy criterion) x(t + £j) - x(t) x(t + £2) - x(t) I2’ ->0 (10A-6) We use this criterion because, unlike (10A-5), it does not involve the unknown
338 STOCHASTIC PROPERTIES x'(t). Clearly, E{[x(f + e,) - x(t)][x(r + e2) - x(/)]} = R(t + eitt + e2) ~ R(t + Et,t) - R(t, t + e2) + R(t, t) The right side divided by E|e2 tends to d2R(t, t)/dt dt which, by assumption, exists. Expanding the square in (10A-6), we conclude that its left side tends to <72Я(/,/) d2R(t,t) d2R(t,t) dt dt dt dt dt dt COROLLARY. The above yields / и rk x('+£)“x(')\ .. _/ *(f + e) - x(t) t E{x'(/)} = El hm------------------> = lim El----------------> о e ) r-0 I e j Note The autocorrelation of a Poisson process x(f) is discontinuous al the points t,; hence x'(t) docs not exist at these points. However, as in lhe case of deterministic signals, it is convenient to introduce random impulses and to interpret x'(t) as in (10-98). STOCHASTIC INTEGRALS. A process x(f) is MS integrable if the limit [bx(t)dt= lim £x(t,) At, (10A-7) 'a i exists in the MS sense. THEOREM. The process x(r) is MS integrable if a (10A-8) 0 Proof. Using again the Cauchy criterion, we must show that 0.2\ £х(г,)Д/,- ------* i к U Д«4.Д«*-0 This follows if we expand the square and use the identity я{Ех(г,)Д'/Ех('*)Д'Л = А/,, ДгА. ' i к ' i,k because the right side tends to the integral of R(tlt /,) as Д/, and &tk tend to 0. COROLLARY. From the above it follows that I * n fhx(t) dt a E (10A-9) as in (10-11).
APPENDIX IOR SHIH OPERATORS AND SI AlIONARY PRIX I SSE.S 339 APPENDIX JOB SHIFT OPERATORS AND STATIONARY PROCESSES An SSS process can be generated by a succession of shifts Tx of a single RV x where Г is a one-to-one measure preserving transformation (mapping) of the probability space Z into itself. This difficult topic is of fundamental importance in mathematics. In the following, we give a brief explanation of the underlying concept, limiting the discussion to the discrete-time case. A transformation T of ./ into itself is a rule for assigning to each clement of Z another element of .Z: = (iob-]) called the image of £. The images £ of all elements of a subset .с/ of .Z form another subset Z= T.Z of Z called the image of Z. We shall assume that the transformation T has the following properties. P,: It is one-to-one. This means that if then P2: It is measure preserving. This means that if .of is an event, then its image «я? is also an event and P(Z)=P(Z) (10B-2) Suppose that x is an RV and that T is a transformation as above. The expression Tx will mean another RV y=Tx such that y(£) = x(£) (10B-3) where is the unique inverse of fj.This specifies у for every element of Z because (see Px) the set of elements £ equals Z. The expression z = T~lx will mean that x = 7'z. Thus z=T"'x iff z(O=x(£) We can define similarly T2x - T(Tx) = Ту and Tnx = T(T"~'x) = T-I(T" + Ix) for any n positive or negative. From (10B-3) it follows that if, for some x(£) < w, then y(£) = x(£) < w. Hence the event {y <, w) is the image of the event {x w}. This yields [see (10B-2)] P(x w} = P{y <> w) у = Tx (I OB-4) for any w. We thus conclude that the RVs x and Tx have the same distribution №
340 STOCHASTIC PROPERTIES Given an RV x and a transformation T as above, we form the random process Xy = x x„ = T"x n = ~x................x ( ЮВ-5) It follows from (10B-4) that the random variables x„ so formed have the some distribution. Wc can similarly show that their joint distributions of any order arc invariant to a shift of the origin. Hence the process x„ so formed is SSS. It can be shown that the converse is also true: Given an SSS process x„. we can find an RV x and a one-to-one measuring preserving transformation of the space .У into itself such that for all essential purposes, x„ = 7 ' x. The proof of this difficult result will not be given. PROBLEMS 10-1. In the fair-coin experiment, wc define the process x(z) as follows: x(r) = sin тгг if heads shows, x(t) = 2t if tails shows, (a) Find £(x(r)}. (b) Find F(x, r) for t = 0.25, f = 0.5, and t = 1. 10-2. The process x(f) = e"' is a family of exponentials depending on the RV a. Express the mean т?(г), the autocorrelation /?(Zj. r2). and the first-order density fix, t) of x(z) in terms of the density f,(,a) of a. 10-3. Suppose that x(r) is a Poisson process as in Fig. 10-3 such that E(x(9)J = 6. (a) Find the mean and the variance of x(8). (b) Find P(x(2) < 3). (c) Find P(x(4) S 5|x(2) <. 3}. 10-4. The RV c is uniform in the interval (0, T). Find /?r(z)t t-,) if («) x(t) = L'(.t - c), (b) x(z) = 8(t - c). 10-5. The RVs a and b arc independent M0; a) and p is the probability that the process x(z) — a — bz crosses the t axis in the interval (0, T). Show that — p = arctan T. Hint: p = P{0 < a/b 5 T}. 10-6. Show that if = <?('iW'i - t2) w"(z).= v(z)l/(z) and w(0) = w'(0) = 0, then £{w2(O} = f'(t ~ t)z/(t) (It A) 10-7. The process x(z) is real with autocorrelation Л(т). (д) Show that •P{|X(Z + t) - x(z)| г fl} <; 2[7?(O) - R(r)]/a2 (b) Express P(|x(z + r) - x(z)| a) in terms of the second-order density f(xitx2',r) of x(z). 10-8. The process x(z) is WSS and normal with £{x(r)} = 0 and Л(т) = 4c“:'r|. (o) Find P{x(z) £ 3}. (b) Find £([x(z + 1) - x(z - I)]2}. W-9. Show that the process x(z) = cw(z) is WSS iff £(c) - 0 and w(z) = 4 10-10, The process x(z) is WSS and £{x(z)} = 0. Show- that if z(z) •= x2(z), then C..(r)
1’11411! I Ms 341 10-11. Find E{y(/)}. E{yHt)}, and Rvv(r) if y"(r) + 4y(r) + 13y(f) = 26 + v(r) = 103(7) Find P{y(r) £ 3} if v(t) is normal. 10-12. Show that: If x(f) is a process with zero mean and autocorrelation l(t ,Mt t - t2). then the process y(G = x(r)//(r) is WSS with autocorrelation и(т). if x(r) is white noise with autocorrelation <у(Г|).Й(г( then the process z(t> - x(f)/ Jq(t) is WSS white noise with autocorrelation <S(r). 10-13. Show that |/?,v(r)| < |[/?,,(0) ч Rvv(())]. 10-14. Show that if the processes x(r).y(f) are WSS and E(!x(0) - v(.())l*} = (). then Я,,(т) = Яд>.(т) - Я,ч,(т). Hint: Set z = x(r + т). w = x*(r» - y*(f) in (10-163). 10-15. Show that if x(t) is a complex WSS process, then E{|x(f + -) - x( t )|2} = 2 Re[/?((!) - R( r)] 10-16. Show that if <p is an RV with Ф(А) = E{rM*} and <l>( 1) = Ф(2) = 0. then the process x(f) = cosfrot + <p) is WSS. Find E(x(t)} and /?Дт) if <p is uniform in the interval (-тг, it). 10-17. Given a process x(f) with orthogonal increments and such that x(0) = 0. show that (a) R(t|,t:) = /?(г,. г,) for tt <, t2. and (b) if £{[x(f|) - x(r2)]’} = zyl/j - r2| then the process yit) = (x(f + r) - x(f)]/e is WSS and its autocorrelation is a triangle with area q and base 2e. 10-18. Show that if Rsx(f|. f,) = r/frpSfr, - g) and yU) = x(f )* Mr) then £{x(')y(0} -Л(0Мг) 10-19. The process x(f) is normal with tjx = 0 and Rt(r) = 4e"3,T|. Find a memoryless system g(x) such that the first-order density of the resulting output y(f) = g[x(f)] is uniform in the interval (6.9). /Insiver; g(x) = 3G(a/2) + 6. 10-20. Show that if x(f) is an SSS process and e is an RV independent of x(f), then the process y(f) = x(t - e) is SSS. 10-21. Show that if x(t) is a stationary process with derivative x'(f), then for a given t the RVs x(r) and x'(r) are orthogonal and uncorrclated. 10-22. Given a normal process x(r) with = 0 and Rx(r) = 4e~2,T|, we form the RVs z = x(l 4- I), w = х(/ - 1), (n) find £{zw} and E{(i + w)2}, (b) find Л(2) P{Z<1) /U-3V) 10-23. Show that if x(f) is normal with autocorrelation R(r), then P{x'(f) <;0} =G a /- R"(0) 10-24. Show that if xG) is a normal process with zero mean and y(/) = sgnx(f). then 2 x 1 Ry(r) - - £ -[J0(htt) - (-l)"]sin 7Гп-|Л й1Ч(0) where J0(a’) is the Bessel function. Hint: Expand the arcsine in (10-71) into a Fourier series.
342 STOCHASTIC PROPERTIES 10-25. Show that if x(t) is a normal process with zero mean and y(t) = then Vy = =/cxp^y/?,((!)J Яу.(т) = /2ехр{а2[Я,(0) + Яж(т)]} 10-26. Show that (a) if y(f)=ax(ct) then Ry(r) = a2Rx(cr) (ft) if Яд(т) -» 0 as т -> oo and z(/) = lim ^x(eZ) then Rz(t) = q8(r) q = J Rx(T)dr 10-27. Show that if x(f) is white noise, h(t) = 0 outside the interval (0, T), and y(r) = х(/)*й(/) then Ryy(tl,t2) = 0 for |f, - f2| > T. 10-28. Show that if -t2) £{y2(O} = Ф) and (а) y(t) = ffA(t,a)x(a) da then l(t) = ( h2(t ,a)q{a) da Ju Jn (b) y'(t) + c(t)y(t) = x(t) then I'(t) + 2c(t)I(t) = q(t) 10-29. Find E{y2(r)} (a) if Rxx(t) = 58(r) and y'(t) + 2y(f) = x(t) all t (i) (ft) if (i) holds for t > 0 only and y(t) = 0 for t < 0. Hint: Use (10-90). 10-30. The input to a linear system with ft(t) = Ae~a,U(t) is a process x(t) with Я/т) = NSM applied at t = 0 and disconnected at t = T. Find and sketch E(y2(/)}. Hint: Use (10-90) with q(t) = N for 0 < t < T and 0 otherwise. 10-31. Show that if s= [,Ox(t)dt then E{s2} = [,a (10 - |т|)Яд(т)^т A) J -in Find the mean and variance of s if E(x(r)} = 8, Ял(т) = 64 + 10e-2,rl. 10-32. The process x(/) is WSS with Rxx(t) = 58(т) and y'(t) + 2y(r) = x(t) (i) Find E{y2(t)}, Ял>,(/|,72), ЯууОи f2) (a) if (i) holds for all t, (ft) if y(0) = 0 and (i) holds for t 0. 10-33. Find S(w) if (а) Я(т) = e~aT', (ft) Я(т) = e“er* cos aipT. 10-34. Show that the power spectrum of an SSS process x(t) equals 5(«) “ f f xlx2G(xltx2',<o)dXi dx, ОС*'—Ж where G(X|, x2; ш) is the Fourier transform in the variable r of the second-order density ДХ|, x2; t) of x(r).
puohi । ms 343 10-35. Show that ifyG) = x(t + a) - x(t - a). then ^у(т) = 2Л,(т) — /?,(7 + la) — Rx(r — la) .S’((w) = 4.S’>(<li)sin* aa> 10-36. Using (10-122), show that K(0) - R(~) > -2_[Z?(O) - /?(2"т)] Hint: ,0 ,0 1 1 — cos0 = 2 sin* - > 2sin* — cos* - = -(1 - cos20) 2 224' ’ 10-37. The process x(t) is normal with zero mean and /?4(т) = le "!Ti cos /3т. Show that if yG) = x2(f). then Cv(t) = Z3<--2"'T|(I + cos2/3t). Find .S\((1>). 10-38. Show that if R(r) is the inverse Fourier transform of a function 5(o>) and 5(w) > 0, then, for any a,, "La^Rir, - -J > 0 i.k Hint: /_* 5(w)|Lo1f/u,T-| dtaZO 10-39. Find R(t) if (а) 5(ш) = 1/(1 + w1), (b) SM = 1/(4 + w2)2. 10-40. Show that, for complex systems. (10-136) and (10-181) yield Sn.(s) = Si4(jr)H(s)H*(-x*) Svl.(z) = S4r(z)H(z)H*(l/z*) 10-41. The process x(f) is normal with zero mean. Show that if y(t) = x2(t), then S/ш) = 2тт/?*(0)3(ш) + 2S/w)« 5Д(ш) Plot Sy(w) if 5r(w) is (a) ideal LP, (b) ideal BP. 10-42. The process x(f) is WSS with E{x(t)) = 5 and /?лг(т) = 25 + ~2|T|. If y(f) = 2x(f) + 3x'(f), find 7)v, Ryy(r), and 5у>.(ш). 10-43. The process x(f) is WSS and 7?1Л(т) = 55(т). (a) Find E{y2(t)} and Syv(w) if y'(f) + 3y(r) = x(f). (b) Find £{y2(/)} and Rxy(th z2) if у'(f) + 3y(f) = x(t)U(t). Sketch the functions Rxy(2, tt) and Rxy(tt,3). 10-44. Given a complex process x(f) with autocorrelation R(r\ show that if |/?(t,)| = 1, then R(t) = e'“Tw(r) x(f) = ey“"y(f) where w(r) is a periodic function with period and y(r) is an MS periodic process with the same period. 10-45. Show that (a) £{x(t)x(f )} = 0, (b) i(f) = -x(f). 10-46. (Stochastic resonance) The input to the system h(i) ,>+L+s
344 S’lnCHASIlC PROPERTIES is a WSS process x(z) with £{x2(z)} = 10. Find Sx(u) such that the average power E(y2(z)} of the resulting output y(z) is maximum. Hint: |Н(/ш)| is maximum for ш = 7з. 10-47. Show that if Rx(r) - Aeju>°Tt then Rxy(r) = Beju,>r for any y(z). Mini: Use (10-167). 10-48. Given a system /7(ш) with input x(z) and output y(z), show that (a) if x(z) is WSS and Rxx(r) = eiaT, then ^x(r) - Ryy(r) = e""|H(«)|2 (6) if /?„(Z„Z2) = then 10-49. Show that if Sxx(<i>)Syy(.(u) a 0, then S(>.(w) « 0. 10-50. Show that if х[л] is WSS and ЯД1] = Я Д0], then ЯДт] = Я ДО] for every m. 10-51. Show that if Я[гл] = Е{х[л + m]x[zi]), then Я[0]Я[2] > 2Я2(1] -Я2[0] 10-52. Given an RV ш with density /(w) such that /(w) = 0 for Iwl > it, wc form the process x[n] = Aejnuir. Show that Sx(w) = 2tt/12/(w) for |ш| < тт. 10-53. (a) Find £{y2(z)) if y(0) = y'(0) = 0 and У"(') + 7y'(z) + 10y(z) = x(z) Rx(r) = 58(t) (h) Find Ely 2[л]} if y[- 1] == y[-2] = 0 and 8у[л] - 6у[л - 1] + у[л - 2] = х[л] Ях[т] = 5фл] 10-54. The process х[л] is WSS with ЯхДл»] = 56[л|] and у[л] - 0.5у[л - 1] = х[л] (i) Find Е(у21л]}, ЯжД/м„т2], Я^Д/л,. гл2] (о) if (i) holds for all л, (h)ify[-l] = 0 and (i) holds for n 0. 10-55. Show that (a) if Ax[m|,m2] = 9lW|]8[»n1 — m2] and N N 5 = E then E{s2} = £ а,М«] л-0 «-0 (b) If Я^Дг,, z2) = ^(z1)6(z1 - z2) and s= f a(t)x(r)di then £{s2} = f a2(t)<i(i)dt Jo Jo
CHAPTER 11 BASIC APPLICATIONS 11-1 RANDOM WALK, BROWNIAN MOTION, AND THERMAL NOISE We toss a fair coin every T seconds and after each toss we take instantly a step of length 5, to the right if heads shows, to the left if tails shows. The process starts at t = 0 and our location at time t is a staircase function with discontinu- ities at the points t = nT (Fig. 11-la). We have thus created a discrete-state stochastic process xG) whose samples xG, f) depend on the particular sequence of heads and tails. This process is called the random walk. Suppose that at the first n tosses we observe к heads and n - к tails. In this case, our walk consists of к steps to the right and n - к steps to the left. Hence our position at time t = nT is x(nT) = ks — (n - k)s = ms m = 2k - n Thus xCnT) is an RV taking the values ms, where m equals n, or n - 2,..., or —n. Furthermore, t \ 1 m + n P{^nT) = ms} = [nk\- к = UM) This is the probability of к heads in n tosses. We note that x(nT) can be written as a sum x(nT) + • • • +x„ where x, equals the size of the /th step. Thus the RVs xt are independent 345
346 STOCHASTIC PROCESSES taking the values ±j and E{x,} = 0, E{x2} = s2. From this it follows that Е{х(лТ))=0 Е{х2(лТ)} = ns2 (11-2) Large t. As we know, if n is large and к is in the т/npq vicinity of np, then [see (3-27)] (П\пклп~к — 1 „-(k-npr /Inpq к)рч “ From this and (11-1) it follows with p = q = 0.5 and nt = 2k — n that Р(х(ЯТ) = -m) = уЛ7Г/2 for m of the order of -/n. Hence P{x(r) £ ms} = G(m/yfn) nT - T < t £ nT (11-3) where G(x) is the N(0,1) distribution [see (3-34)]. Note that if л, < n2 n3 < n4 then the increments х(л4 T) - х(л3 7) and х(л2 T) - х(л, T) of x(/) are independent. The Wiener process. We shall now examine the limiting form of the random walk as n -» « or, equivalently, as T -> 0. As we have shown ts2 £{x2(/)} =ns2 = — t = nT Hence, to obtain meaningful results, we shall assume that s tends to 0 as УТ: s2 = aT The limit of x(r) as T -» 0 is then a continuous-state process (Fig. 11-lb) w(t) = limx(r) T 0 known as the Wiener process.
1 1-1 RANDOM WALK, BROWNIAN MOTION, AND I IIURMAI. NOISI: 347 We shall show that the first-order density f(w,t) of wit) is normal with zero mean and variance at: /(»v,r) = -=LC-“!A' (И-4) v27T«f ' ’ Proof. If w = ms and t = nT, then m w/s w 4n 4at Inserting into (11-3), we conclude that . f w \ P(w(r) < И'} = G -= I \ vat / and (11-4) results. We show next that the autocorrelation of wit) equals /?(/,,/,)= omin(/1,r;) (H-5) Indeed, if r, < l2, then the difference w(r2) - wG]) is independent of wir,). Hence £{[w('2) - w(r,)]w(r,)} = £{[w(t2) - w(/,)]}E{w(t1)} = 0 This yields E{w(r,)w(r2)} = E{w2(/,)} = = at, as in (11-5). The proof is similar if r, > t2. Note finally that if z, < t2 < t2 < t4 then the increments wit4) - wG3) and w(r2) - w(t}) of wG) are independent. Generalized random walk. The random walk can be written as a sum х(г)=£сД»-*П (n-1)T<t£nT (11-6) k-l where cA is a sequence of i.i.d. RVs taking the values s and -s with equal probability. In the generalized random walk, the RVs ck take the values s and ~s with probability p and 4 respectively. In this case, E{c4 = ( P - q)* £{<%} = s* = W2 From this it follows that E{x(/)} = n(p - q)s Varx(/) = 4npqs2 (H-7) For large n, the process x(z) is nearly normal with l 4t - ^(.Р ~ Varx(z) =(П-8)
348 STOCHASTIC PROCESSES Brownian Motion The term brownian motion is used to describe the movement of a particle in a liquid, subjected to collisions and other forces. Macroscopically, the position x(t) of the particle can be modeled as a stochastic process satisfying a second- order differential equation: mx"(t) +fx'(t) + cx(t) = F(t) c>0 (П-9) where F(r) is the collision force, m is the mass of the particle, / is the coefficient of friction, and cxG) is an external force which wc assume propor- tional to xG). On a macroscopic scale, the process FG) can be viewed as normal white noise with zero mean and power spectrum Sr(c»)=2kTf (11-10) where T is the absolute temperature of the medium and к = 1,37 x 10”23 Joule-degrees is the Boltzmann constant. We shall determine the statistical properties of x(f) for various cases. Bound motion. We assume first that the restoring force cxG) is different from 0. For sufficiently large t, the position x(r) of the particle approaches a stationary state with zero mean and power spectrum 2kTf s.M--------------2, -2 , (и-ID (c — mar) + f‘ca- To determine the statistical properties of xG), it suffices to find its autocorrela- tion. We shall do so under the assumption that the roots of the equation ms2 + fs + c = 0 are complex f > c Si,2=~a±jp a =—— or + p2------------------ 2m m Replacing b, ct and q in Example 10-26b by f/m, c/m, and 2kTf/m2 respectively, wc obtain kT t a \ Лж(т) = —e~“,r| cos fh- + — sin /3|т| (11-12) c \ Э / Thus, for a specific t, x(t) is a normal RV with mean 0 and variance R/0) = kT/с. Hence its density equals fM = е~"г/2"Т (1113) The conditional density of xG) assuming x(f0) = x0 is a normal curve with mean ax0, and variance P where (see Example 8-11) /?х(т) fl = W P = Rx(0)(l-n2J 7-U-/0
11—1 RANDOM WAI К BROWNIAN MOTION. AND I HF RM Al. NOISI 349 FREE MOTION. We say that a particle is in free motion if the restoring force is 0. In this case. (11-9) yields = F(z) (11-14) The solution of this equation is not a stationary process. We shall express its properties in terms of the properties of the velocity x( t) of the panicle. With vO) = x'(z), it follows from (11-17) that mv’(t) +fv(f) = F(r) (11-15) The steady state solution of this equation is a stationary process with t ч 2АТ/ kT S,M = --77 /<(r) - —(11-16) * “ m normal process with zero mean. From the preceding, it follows that Hz) is a variance kT/т, and density / Tn (H-17) The conditional density of Иг) assuming v(0) = r(l is normal with mean au0 and variance P (see Example 8-11) where ЯД0 . kT 7 kT a=z-r77^=e~/,''n P= — (’ - a ' = — 0 ~e ' «Д0) rn m In physics, (11-15) is called the Langevin equation, its solution the Omstein-Uhlenbeck process, and its spectrum lorenzian. The position x(f) of the particle is the integral of its velocity; x(z) = ( v(a) da (11-18) A) From this and (10-11) it follows that E{x2(/)} = [' f'R.ta - fi)dadfi = — [' da d(5 Jo Jo >n Jo Jo Hence E(x2(')) = ^r-('-y + 7f U‘-19) Thus; the position of a particle in free motion is a nonstationary normal process with zero mean and variance the right side of (11-19). For t » m/f, (11-19) yields 2kT kT E{x2.(Z)} = — t = 2D2t D2-~f <lb20) The parameter D is the diffusion constant. This result will be presently rederived:
350 STOCHASTIC PROCESSES THE WIENER PROCESS. We now assume that the acceleration term zhx"(/) of a particle in free motion is small compared to the friction term fx'(t), this is the case if f » m/t. Neglecting the term in (11-14), we conclude that /x'(r) = F(/) x(/) = - J F(a) rfa J Because F(t) is white noise with spectrum 2kTf, it follows from (10-36) with v(t) = FG)// and q(t) = 2kT/f that , , , 2kT 2kT E{x2(/)} = —— t = at a = -y- - 2D2 Thus, x(r) is a nonstationary normal process with density 1 /,fn(x) ® / ~e x 7201 Jx(‘* ' /277^7 We maintain that it is also a process with independent increments. Because it is normal, it suffices to show that it is a process with orthogonal increments, that is E{[x(t2) - x(r,)][x(r4) - x(/3)]} = 0 (И-21) for /j < t2 < /3 < t4. This follows from the fact that x(/,) - x(ry) depends only on the values of F(r) in the interval (/,, ty) and F(/) is white noise. Using this, we shall show that Ях(Г|,/2) = amin(/„/2) (11-22) To do so, we observe from (11-21) that if /, < f2, then £{x(tj)x(r2)} = Е{х(Г))[х(Г2) - xC/j) + x(r,)]} = E{x2(t,)} = ati and (11-22) results. Thus the position of a particle in free motion with negligible acceleration has the following properties: It is normal with zero mean, variance at and autocorrelation min(/f, r2). It is a process with independent increments. A process with these properties is called the Wetter process. As we have seen, it is the limiting form of the position of a particle in free motion as t -» «; it is also the limiting form of the random walk process as n -» «. We note finally that the conditional density of x(r) assuming x(r0) = x0 is normal with mean ox0 and variance P where (see Example 8-11) A(Mo) a = 1 р = -aR(t,t0) = at - at0 Hence - *o) = -й==---------(11-23) J2tra{t -10)
1I-! RANDOM WALK. IIROWNIaN MOIION AND ГК1КМЛ1 NOISI 351 S„» = 2*77? («>) = 2kTG FIGURE 11-2 Difftision equations. The right side of (11-23) is a function depending on the four parameters x,xu,t, and ta. Denoting this function by tt(.v, aI(; t,,) wc conclude by repeated differentiation that where D2 — a/2. These equations arc called diffusion equations. They are reestablished in Sec. 16-4 in the context of Markoff processes. Thermal noise Thermal noise is the distribution of voltages and currents in a network due to the thermal electron agitation. In lhe following, wc discuss the statistical properties of thermal noise ignoring the underlying physics. The analysis is based on a model consisting of noiseless reactive elements and noisy resistors. A noisy resistor is modeled by a noiseless resistor R in series with a voltage source nc(/) or in parallel with a current source n,(/) = nr(f)//? as in Fig. 11-2. It is assumed that n/t) is a normal process with zero mean and flat spectrum = 2kTR 5, (") = = 2kTG IX (11-25) where к is the Boltzmann constant, T is the absolute temperature of the resistor, and G = 1/R is its conductance. Furthermore, the noise sources of the various network resistors are mutually independent processes. Note lhe similar- ity between the spectrum (11-25) of thermal noise and the spectrum (11-10) of the collision forces in brownian motion. Using the above and the properties of linear systems, we shall derive the spectral properties of general network responses starting with an example. Example 11-1, The circuit of Fig. 11-3 consists of a resistor R and a capacitor C. We shall determine the spectrum of the voltage v(i) across the capacitor due to thermal noise. The voltage v(/) can be considered as the output of a system with input the noise voltage nr(f) and system function
352 STOCHASTIC PROCESSES FIGURE 11-3 b Applying (10-136), we obtain , 2kTR - - 2r2C2 kr (11’26) Я.(г) - — The following consequences are illustrations of Nyquist’s theorem to be discussed presently: We denote by Z(s) the impedance across the terminals a, b and by z(t) its inverse transform = z<') - 7с"/ЯСи(<) 1 + KCS c The function z(t) is the voltage across C due to an impulse current 8(t) (Fig. 11-3). Comparing with (11-26), we obtain Я = 2£T ReZ(y<u) ReZ( jw) = Я„(т) = kTz(r) т > 0 Яг(0) = Ш(0+) E{v2(/)} = Я„(0) - = lim >Z(/W) G С cu—>ос Given a passive, reciprocal network, we denote by v(r) the voltage across two arbitrary terminals a, b and by Z(s) the impedance from a to b (Fig. 11-4). NYQUIST THEOREM. The power spectrum of v(l) equals 5„(<u) = 2fcTReZ(jco) (11-27) Proof. We shall assume that there is only one resistor in the network. The general case can be established similarly if we use the independence of the noise sources. The resistor is represented by a noiseless resistor in parallel with ZO) .$„(«) = 2/tTRe Z(M FIGURE IM
11-1 RANDOM WALK. BROWNIAN MOTION. AND THERMAL NO1SL 353 figure ll-s a current source n/z) and the remaining network contains only reactive ele- ments (Fig. ll-5a). Thus v(r) is the output of a system with input n,(/) and system function W(w). From the reciprocity theorem it follows that « •И(о>)//(а>) where 1(ш) is the amplitude of a sine wave from a to b (Fig. 11-56) and Й(<ч) is the amplitude of the voltage across R. The input power equals |/(oj)|2 Re |Z(joj) and the power delivered to the resistance equals |И(<о)1“/Я. Since the connecting network is lossless by assumption, we conclude that , |T(«)|2 |/(w)|2 ReZ(Jw)----------— ix Hence , 1И(ш)|2 И"» = 62 =-RReZ(>) and (11-27) results because 2kT S,M = Sjto) |H(a>) I2 = — COROLLARY 1. The autocorrelation of v(z) equals Ru(t) = кТг(т) т>0 (11-28) where z(z) is the inverse transform of Z(s). Proof, Since Z(—jto) • Z♦(jo>), it follows from (11-27) that SVM = kT[Z(ja>) +Z(-M] and (11-28) results because the inverse of Z(—jo>) equals z(—t) and z(—t) = 0 for t > 0. COROLLARY 2. The average power of v(z) equals kT 1 Efv2(r)j =-— where — e Inh j<oZ(ju) (11-29) 1 C C «-»« where C is the input capacity.
354 STOCHASTIC PROCESSES Proof. As we know (initial value theorem) z(0+) = lim sZ(s) 5 -* * and (11-29) follows from (11-28) because E(v2(r)} = R,(0) = kTz(0<) Currents. From Thevenin’s theorem it follows that, terminally, a noisy network is equivalent to a noiseless network with impedance Z(s) in series with a voltage source v(t). The power spectrum St,(w) of v(r) is the right side of (11-27). This leads to the following version of Nyquist’s theorem: The power spectrum of the short-circuit current i(t) from a to b due to thermal noise equals 1 = 2&TReY(ja>) Y(s) = (11-30) Proof. From Thdvenin’s theorem it follows that , 2ATReZ(j<a) SM = StM |Y(jo>) I2 = . IZ( ju) 1 and (11-30) results. The current version of the corollaries is left as an exercise. 11-2 POISSON POINTS AND SHOT NOISE Given a set of Poisson points t, and a fixed point we form the RV z = t, - r() where t, is the first random point to the right of r(1 (Fig. 11-6). We shall show that z has an exponential distribution: /2(z)=Ae"Ai F,(z) = 1 - e-A; z>0 (11-31) Proof. For a given z > 0, the function F2(z) equals the probability of the event {z z). This event occurs if Ц < tn + z, that is, if there is at least one random point in the interval (t0, tQ + z). Hence Fz(z) = P{z<z) =Р{п(/0,г0+z) > 0} = 1 -P{n(/0,r0 + z) =0} and (11-31) results because the probability that there are no points in the interval + z) equals е-Аг. FIGURE 11-6
11-2 I'Oissom’oinisлм>mioi soisi-s 355 We can show similarly that if w = /„ - t_, is the distance from t(l to the first point t_, to the left of zu then fw(w) = Xe~*w F(v(iV) = I - e'A" и-> 0 (11-32) We shall now show that lhe distance x„ = t„ - from /„ the nth random point t„ to the right of f0 (Fig. 11-6) has an Erlang distribution: Cl-33) Proof. The event {x„ < x} occurs if there are at least n points in the interval (f0, t0 4- x). Hence = -Plx,, <x) = 1 - P{n(rf)ufl + x) < и) = I - £ (-Л ) " k I к - () K‘ Differentiating, we obtain (11-33). Distance between random points. We show next that the distance * ~ ~ Xn_ । t„ — t„_ । between two consecutive points and t„ has an exponential distribution: A(x)=Ar?-Al (11-34) Proof. From (11-33) and (5-70) it follows that the moment function of x„ equals = >—Л (11-35) (A - x) Furthermore, the RVs x and x,,., are independent and x„ = x + x„_,. Hence, if is the moment function of x, then 4($) = ФД5)фп-|(5) Comparing with (11-35), we obtain Фх(х) = A/(A - 5) and (11-34) results. An apparent paradox. We should stress that the notion of the “distance x between two consecutive points of a point process” is ambiguous. In Fig. 11-6, we interpreted x as the distance between t„_, and t„ where t„ was lhe nth random point to the right of some fixed point r(1. This interpretation led to the conclusion that the density of x is an exponential as in (11-34). The same density is obtained if we interpret x as the distance between consecutive points to the left of z0. Suppose, however, that x is interpreted as follows: Given a fixed point /e, we denote by t; and tr the random points nearest to ia on its left and right respectively (Fig. ll-7a). We maintain that the density of the distance x я tr - between these two points equals f(x) = A2xe-Ax (11-36)
356 STOCHASTIC PROCESSES c> C.V X---- (/•) FIGURE 11-7 Indeed the RVs *i = ta ~ t/ and xr = tr - ta are independent with exponential density as in (11-31); furthermore, x = xr + x;. This yields (11-36) because the convolution of two exponentials is the density in (11-36). Thus, although x is again the “distance between two consecutive points," its density is not an exponential. This apparent paradox is a consequence of the ambiguity in the specification of the identity of random points. Suppose, for example, that we identify the points t, by their order /, where the count starts from some fixed point t0, and we observe that in one particular realization of the point process, the point t„ defined as above, equals t„. In other realizations of the process, the RVs tz might equal some other point in this identification (Fig. 11-76). The same argument shows that the point tr does not coincide with the ordered point t,j+l for all realizations. Hence we should not expect that the RV x = tr — tf has the same density as the RV t,l +, - t„. CONSTRUCTIVE DEFINITION. Given a sequence w„ of positive i.i.d. (indepen- dent, identically distributed) RVs with density /(w) = Ae-A*’ (11-37) we form a set of points t„ as in Fig. ll-8a where r = 0 is an arbitrary origin and th = Wj + w2 + • • • +w„ (11-38) We maintain that the points so formed are Poisson distributed with param- eter A.
и 1 1-2 I'OISSON I’OININ AM>S||I)I NOISES 357 FIGURE 11-8 Proof. Prom the independence of the RVs w„, it follows that the RVs t„ and w„+t are independent, the density fn(t) of t„ is given by (11-33) (11-39) and the joint density of t„ and w„ t, equals the product /„(!)/(»’). If t„ < r and t„+1 = + w/i+i > T ^en there arc exactly n points in the interval ((),-). As we see from Fig. 11-8Z?, the probability of this event equals (n - 1)! t" le Al dwelt Thus the points tM so constructed have property Pt. We can show similarly that they have also property P2. POISSON POINTS REDEFINED. Poisson points are realistic models for a large class of point processes: photon count, electron emission, telephone calls, data communication, visits to a doctor, arrivals at a park. The reason is that in these and other applications, the properties of the underlying points can be derived from certain general conditions that lead to Poisson distributions. As we show next, these conditions can be stated in a variety of forms that are equivalent to the two conditions used in Sec. 3-4 to specify random Poisson points (see page 59). I. If we place at random W points in an interval of length T where N » 1, then the resulting point process is nearly Poisson with parameter N/T. This is exact in the limit as N and T tend to «(see (3-47)]. HJf the distances w„ between two consecutive points t„_, and t„ of a point process are independent and exponentially distributed, as in (11-37), then this process is Poisson. The above can be phrased in an equivalent form: If the distance w from an arbitrary point r0 to the next point of a point process is an RV
358 STOCHASTIC* PROCESSES whose density does not depend on the choice of t(1, then the process is Poisson. The reason for this equivalence is that this assumption leads to the conclusion that /(w|w f„) =/(w - r0) (11-40) and the only function satisfying (11-40) is an exponential (sec Example 7-10). In queueing theory, the above is called the Markoff or memoryless property. III. If the number of points n(/, t + dt) in an interval (r, t + dt) is such that: (a) P{n(t,t + dt) = 1} is of the order of dt; (b) P{n(z,r + dt) > 1} is of order higher than dt', (c) the above probabilities do not depend on the state of the point process outside the interval (r, t + dt)-, then the process is Poisson (see Sec. 16-4). IV. Suppose, finally, that: (a) P{n(a, b) = k} depends only on к and on the length of the interval («, b); (h) if the intervals (a„ bt) are nonoverlapping, then the RVs n(a,, £>,) are independent; (c) P{n(a, b) = <»} = 0. These conditions lead again to the conclusion that the probability рк(т) of having к points in any interval of length т equals P*(O =<A’(Ar)7t! (11-41) The proof is omitted. Linear interpolation. The process x(t) = t-t„ t„<t<tn + l (11-42) of Fig. 11-9 consists of straight line segments of slope 1 between two consecutive random points t„ and t„+I. For a specific t, x(z) equals the distance w = t - t„ from t to the nearest point tn to the left of t\ hence the first-order distribution of x(r) is exponential as in (11-32). From this it follows that 1 2 E{x(t)J = - £{x2(/)) = (11-43) Л Л FIGURE 11-9
11-2 POISSON I'OIM'S AND SHC11 NOIM.S 359 THEOREM. The autocovariance of x(f) equals C(t) = -j(l + Al"|)e-A|T| (11-44) Л Proof. We denote by t,„ and t„ the random points to the left of the points t + r and t respectively. Suppose first, that t,„ = t„: in this case x(r + 7) = 1 + - - t and x(r) = t - t„. Hence [see (11-42)] C(r) = E{(t + r - t„)(t - t„)} = E[(f - t„);) + ~E{i - t„) = 4 + - A” A Suppose, next, that t,„ =# t„; in this case C(r) = E{(j + 7 - t„,)(t - t„)} = E{t + 7 - t„,]E{r - t„] = 1 Clearly, tm = t„ if there are no random points in the interval (t + 7, t)\ hence P{tw = t„) = e-AT. Similarly, t,„ =/= t„ if there is at least one random point in the interval (f + r, r); hence P(t„ #= t,„) = 1 - e"Ar. And since [see (4-48)] Я(т) = E{x(t + r)x( t) |t,„ = tJPft,,, = tj + E{x(t + T)x(f)|t„ * tJP(tz, * t,„] we conclude that I 2 7\ 1 E(t) = -2 + T e Ar + 77(1 “ e Ar) \ Л Л } Л for т > 0. Subtracting 1/A2, we obtain (11-44). Shot Noise Given a set of Poisson points t( with average density A and a real function h(t), we form the sum s(r) = (H-45) i This sum is ah SSS process known as shot noise. In the following, we discuss its second order properties. The general statistics are developed in Sec. 16-3. From the definition it follows that s(f) can be represented as the output of a linear system (Fig. 11-10) with impulse response h(t) and input the Poisson impulses z(<)- (il-46) / This representation agrees with the generation of shot noise in physical prob- lems: The process s(f) is the output of a dynamic system activated by a sequence of impulses (particle emissions, for example) occurring at the random times t,.
360 STOCHASTIC PROCESSES x(r) s0) FIGURE 11-10 As we know, 17, = Л; hence E{s(/)} =Л/“Л(Г)Л = ЛЯ(О) (11-47) Furthermore, since (see Example 10-22) SI2(m) = 2ttA25(w) + A (11-48) it follows from (10-136) that S„(w) = 2тгА2Я2(0)5(ш) + Л|Я(ш)|2 (11-49) because |Я(п))|2S(w) = //2(0)S(w). The inverse of the above yields Л„(т) =А2Я2(0) +Ар(т) С„(т) = Ар(т) (11-50) Campbell’s theorem. The mean i]s and variance <r2 of the shot-noise process s(t) equal Ъ = лГ h(t)dt <rs2 = Xp(0) = X Г h2(t) dt (11-51) * — BQ J — oo Proof. It follows from (11-50) because <r2 = Css(ff). Example 11-2. If then A 2 A ~ (rf = — a 2a 2irA2 A A S1M - — W + ^5 C„(r) = -e-H Example 11-3 Electron transit Suppose that Л(г) is a triangle as in Fig. 11-1U. Since
11-2 POISSON POINTS AND SI IO I NOIShS 361 it follows from (11-51) that XkT2 , Xk2T* In this case HM - Che-'-d, - 2k™“T/2 _ A) jar jo) Inserting into (11-49), we obtain (Fig. 11-116). , A*2 5„(<o) = 2irTj,3(<o) 4--r-(2 - 2cos wT + ш~Т2 - 2a>T sin шТ) ш Generalized Poisson process and shot noise. Given a set of Poisson points t, with average density Л, we form the process n(i) X(<)T) - Lc, (11-52) i i = I where c; is a sequence of i.i.d. RVs independent of the points t, with mean j]c and variance rrc2. Thus x(t) is a staircase function as in Fig. 10-3 with jumps at the points t; equal to cf . The process n(z) is the number of Poisson points in the interval (0, z); hence £{n(z)} = Xt and £{n2(z)} = A2 + Az. For a specific z, x(z) is a sum as in (8-46). From this it follows that £{*(0} = ^£{n(z)} = T?CAZ £{x2(z)} = i72£{n2(z).} + <Tc2£{n(/)) = -h2(Az + A2Z2) + ac2Az Proceeding as in Example 10-5, we obtain Cxjr(zt,Z2) = + o?2)A min(z,,Z2) (11-54) We next form the impulse train z(Z) — x'(t) = £c,.8(z “ *z) (11-55)
362 STOCHASTIC PKOCESSES From (11-54) it follows as in (10-100) that d £{z(/)} = —£T(x(r)} =TjfA (11-56) C„(Z„<2) - } <Т(д"'2) = (n? + 0-c2)A5(t) (11-57) where r = G - ft. Convolving z(f) with a function h(t), we obtain the general- ized shot noise s(r) = 'Ecjhit - tj = z(f) * h(t) (11-58) I This yields E(s(t)} = £{z(f))*A(r) - ЯЛ Г h(t)dt (11-59) J — оь C„(t) = Сгг(т)*Л(т)*Л(-т) = (ле + ог2)Ар(т) (11-60) Vars(f) = C„(0) = (172 + <rf2)A f h2(t) dt (11-61) The above is the extension of Campbell’s theorem to a shot-noise process with random coefficients. 11-3 MODULATION! Given two real jointly WSS processes a(f) and b(r) with zero mean and a constant <o0, we form the process x(f) = a(/)cos <o0f - b(/)sin wof = r(r)cos[<unt + <p(r)] (11-62) where r(t) = Va2G) + b2(f) tan <p(f) = a(') This process is called modulated with amplitude modulation rO) and phase modulation <p(t). We shall show that x(/) is WSS iff the processes a(t) and b(/) are such that Яв«(т) =«w(r) Rab(r) = —Rba(r) (11-63) Proof, Clearly, £{x(/)} = £{a(/))cos (o0t — £{b(r)}sin &>ot = 0 fA. ‘Papoulis: “Random Modulation: A Review," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-31, 1983.
11-3 MODUIAIIOS 363 Furthermore, x(t + r)x(r) = [a(r + r)cos<d0(/ + 7) - b(r + t)cos<u(,(/ + 7)] x[a(/)cos<u0z - b(/)sin to0/] Midtiplying, taking expected values, and using appropriate trigonometric identi- ties, we obtain 2E{x(/+ r)x(f)) = (Яао(г) +/?W)(t)]cos wn7 + [/?u/,(t) 7)]sin <u07 + [K™(T) ” K/,/>(r)]cos ю(1(2г + r) -[«„/,( r) + /?,,„( 7)]sin <u()(2r + 7) (11-64) If (11-63) is true, then the above yields RxA?) = Kaa(r)cosw(17 + /?„>,( 7 )sinw07 (11-65) Conversely, if x(/) is WSS, then the second and third lines in (11-64) must be independent of t. This is possible only if (11-63) is true. We introduce the “dual” process y(/) = Ь(г)со8й)(|Г + a(t)sinav (11-66) This process is also WSS and Ryy(7) = Rx,(r) /?х>.(7) = -RyM (11-67) Rxy(7) = Rah( 7)cos <un7 - Raa( 7)sin to07 (11-68) The above follows from (11-64) if we change one or both factors of the product x(t + r)x(f) with y(r + 7) or y(z). Complex representation. We introduce the processes w(r) = a(r) + /b(r) = r(t)eMn z(/) = x(t) +jy(l) =w(t)elu"' (11-69) Thus x(t) = Rez(t) - Re[w(f)e/“"'j (11-70) and a(r) +jb(/) =w(/) = This yields a(z) = x(/)cos o)0t + y(f)sin b(f) = y(r)cos<o0r - x(f)sin&>nr (11-71) Correlations and spectra. The autocorrelation of the complex process w(t) equals Лш|И(т) = E([a(f + r) +/b(r + 7)] [a(r) -fb(/)]}
364 STOCHASTIC PROCESSES Expanding and using (11-63), we obtain = 2Лил(т) - 2jRab(r) (11-72) Similarly. RJ/r) = 2Rxx(t) - 2jRxy(r) (11-73) We note, further, that Щг) = e,w"rRwn.(~) (11-74) /S>«(w) (<) F1GUREU-12
11-3 мши i лщ)>. 365 From the above it follows that *0 = 25,„(«>) - 2jSllh((,>) S:z((o) = 25t,(w) - 2j5lv(w) (H'75) 5,,( w) = - co,,) ( И-70) The functions 5,,(co) and 5-.(to) arc real and positive. Furthermore [see (11-67)] RKV( --) = -Я,,(-г) = This leads to the conclusion that the function -j$,v(co) = В (w) is real and (Fig. li-12o) |Btl.(w)| < S,,(co) Bx>.(-co) = -S<y(w) (11-77) And since 5,л( -a>) = 5ж1(ш). we conclude from the second equation in (11-75) that 45,., (co) = 5\,(<и) + 5'.-( —«а) 4;5,.,.(co) = 5..(-co) -5..(co) (lb78) Single sideband If b(r) = a(/) is the Hilbert transform of a(t). then [sec (10-147)] the constraint (11-63) is satisfied and the first equation in (11-75) yields = 45„„(со)(/(со) (Fig. 11-12Z>) because $««(<*>) = /5tfrt(<o)sgnco The resulting spectra arc shown in Fig. 11-126. We note, in particular, that 5TX(w) = 0 for |«)| < ft)l(. RICE’S REPRESENTATION. In (11-62) we assumed that the carrier frequency w() and the processes a(f) and bG) were given. We now consider the converse problem: Given a WSS process x(t) with zero mean, find a constant w(l and two processes aG) and b(z) such that xG) can be written in the form (11-62). To do so, it suffices to find the constant co() and the dual process y(t) [see (11-71)]. This shows that the representation of xG) in the form (11-62) is not unique because, not only con is arbitrary, but also the process y(/) can be chosen arbitrarily subject only to the constraint (11-67). The question then arises whether, among all possible representations of x(t), there is one that is optimum. The answer depends, of course, on the optimality criterion. As wc shall presently explain, if equals the Hilbert transform i(t) of xG), then (11-62) is optimum in the sense of: minimizing the average rate of variation of the envelope of x(t). Hilbert transforms. As we know [see (10-147)] W=«„(’) R„(r) - -R„(t) (11-79)
366 STOCHASTIC PROCESSES We can, therefore, use x(0 to form the processes z(f) = x(t) +jx(f) = w(f)e'w°' (Ц-80) w(r) = i(f) + jq(f) = z(t)e-/“0' as in (11-69) where now (Fig. ll-12c) y(r)=x(r) a(r)=i(/) b(f)-q(O Inserting into (11-62), we obtain x(t) = i(r)cos a)Qt - q(t)sin <o0f (11-81) This is known as Rice's representation. The process i(z) is called the inphase component and the process q(r) the quadrature component of x(f). Their realization is shown in Fig. 11-13 [see (11-71)]. These processes depend, not only on x(/), but also on the choice of the carrier frequency o>0. From (10-136) and (11-75) it follows that S22(o>) =4Sxx(a>)UW (11-82) Bandpass processes. A process x(r) is called bandpass (Fig. ll-12c) if its spectrum Sxx(a>) is 0 outside an interval (o^, <u2). It is called narrowband or quasimonochromaiic if its bandwidth o>2 - o>, is small compared with the center frequency. It is called monochromatic if Sxjr(o>) is an impulse function. The process a cos a>Qt + b sin o>0f is monochromatic. The representations (11-62) or (11-81) hold for an arbitrary xG). However, they are useful mainly if x(z) is bandpass. In this case, the complex envelope wG) and the processes Kt) and q(r) are low-pass because $«...(®) +ш0) We shall show that if the process x(r) is bandpass and <o, + шс < 2o>0, then the inphase component i(/) and the quadrature component q(t) can be obtained as responses of the system of Fig. ll-14a where the LP filters are ideal with cutoff
11-3 modhiaiiun 367 FIGURE 11-14 frequency tac such that W2 — O>0 < O)c (t)1 — O)0 > — шс (11-84) Proof. It suffices to show that (linearity) the response of the system of Fig. 11-14Z) equals w(r). Clearly, 2x(t) = z(/) + z*(f) w*(f) = z*(f)ey'“u' Hence 2x(t)e~,Uot - w(z) + w*(f)e~'2""' The spectra of the processes w(f) and w*G)e~/2"0' equal Sww(a>) and — 2a>0) respectively. Under the stated assumptions, the first is in the band of the LP filter H(<o) and the second outside the band. Therefore, the response of the filter equals w(t). We note, finally, that if <u0 to,, then St,w(fo) = 0 for ш < 0. In this case, q(U is the Hilbert transform of i(r). Since ш2 - s 2<u0, this is possible only if <у2 Зсор In Fig. ll-14c, we show the corresponding spectra for ю0 = Optimum envelope. We are given an arbitrary process x(f) and we wish to determine a constant coo and a process y(r) so that, in the resulting representa- tion (11-62), the complex envelope w(/) of x(f) is smooth in the sense of mininuzing E{|w'(r)|2}. As we know, the power spectrum of w'(f) equals to2SWM,(to) = to2S„(<u + <o0)
368 STOCHASTIC PROCESSES Our problem, therefore, is to minimize the integral! M = 27rE{|w'(f)l2} = / — wo)2^rz(w) (11-85) subject to the constraint that Sxx(a)) is specified. THEOREM. Rice’s representation (11-81) is optimum and the optimum carrier frequency <oo is the center of gravity io0 of 5хх(ш)1/(ш). Proof. Suppose, first, that Sxx(a)) is specified. In this case, M depends only on <o0. Differentiating the right side of (11-85) with respect to <u0, we conclude that M is minimum if <u0 equals the center of gravity /00 o)Sz.(a)) da) J а)Вху(ы) dw w0 = ~--------------4-------------- (11 -86) / Si:Mdw f Sxx(w)da) of Szz(a)). The second equality above follows from (11-75) and (11-77). Inserting (11-86) into (11-85), we obtain M = f (a)2 — aj2))Szz(a)) da) = 2 f (a)2 — aj2)Sxx(a)) da) (11-87) '—00 ’'—00 We wish now to choose Szz(a)) so as to minimize M. Since Sxx(a)) is given, M is minimum if w0 is maximum. As we see from (11-86), this is the case if |Bxy(w)| = Sxx(w) because |BXJ,(«j)| Sxx(.o)). We thus conclude that —jSxy(a)) = Sxx(<o)sgn a) and (11-75) yields Ssz(a)) = 4Sxx(a))U(a)) Instantaneous frequency. With ф(г) as in (11-62), the process <o,(t) = <o0 + <p'(r) (11-88) is called the instantaneous frequency of x(/). Since Z = геЛ"<^+«Р) = x + jy we have z'z* = rr' + fr2<ol- = (x' + yy')(x - jy) (11-89) tL Mandel: “Complex Representation of Optical Fields in Coherence Theory." Journal of die Optical Society of America, vol. 57, 1967. See also N. M. Blachman: Noise and Its Effect on Communication, Krieger Publishing Company. Malabar. FL, 1982.
11-3 MODULATION 369 This yields E{rr'} = 0 and £{r2w,} = у- f toS..(o>) da> (11-90) •i'TT J _ X because the cross-power spectrum of z' and z equals The instantaneous frequency of a process x(r) is not a uniquely defined process because the dual process y(r) is not unique. In Rice’s representation у — x, hence XX* “‘ xzx w, = ----p------ r“ = x2 + x2 (11-91) In this case [see (11-82) and (11-86)] the optimum carrier frequency <о0 equals the weighted average of w,: _ _ £{г2“.) "° £{r2) Frequency Modulation The process x(t) = cos[o)0r + A<p(/) + <p0] <p(t) = f c(a) da (11-92) J(i is FM with instantaneous frequency + AcG) and modulation index A. The corresponding complex processes equal w(/)=e>A*<'> z(r) =w(/)e'<-,+*"> (11-93) We shall study their spectral properties. THEOREM. If the process c(z) is SSS and the RV <p0 is independent of c(f) and such that £{em) = E{e>2*'»} = 0 (11-94) then the process x(/) is WSS with zero mean. Furthermore, Rxx(r)= jReR2X(r) (H95) Rww(r) - £{w(r)) Proof. From (11-94) it follows that £{x(r)} = 0 because E{z(/)} = =0 Furthermore, £{z(t 4- r)z(/).} = £{еЛ“-х2<+’-»+А*<г+т)+А*(,’1}£{е^} = 0 £{z(/ + r)z*(r)) = e/w»TE/exp jA f^c(a) da 1} = «'“•••'£{*( t)}
370 STOCHASTIC PROCESSES The last equality is a consequence of the stationarity of the process c(f). Since 2x(f) = z(/) + z*(/), we conclude from the above that 4£{x(/ + r)x(f)} = /?-г(т) + and (11-95) results because = Л?.(т). Definitions A process x(f) is phase modulated if the statistics of «p( t) are known. In this case, its autocorrelation can simply be found because £{w(/)} = £{e/A*<f)) = ФДЛ,/) (11-96) where Ф^(Л,/) is the characteristic function of <p(z). A process x(f) is frequency modulated if the statistics of c(z) are known. To determine Фр(А, f), we must now find the statistics of the integral of c(r). However, in general this is not simple. The normal case is an exception because then Фр(Л,г) can be expressed in terms of the mean and variance of <p(/) and, as we know [see (10-143)] £{<₽(')} = /f^(c(a)} da = i)ct ° (11-97) E{<P2(O} = -«) da Jo For the determination of the power spectrum 5и.и.((о) of xG), we must find the function Фф(А,/) and its Fourier transform. In general, this is difficult. However, as the next theorem shows, if Л is large, then Sxx(a>) can be expressed directly in terms of the density fc(c) of c(r). WOODWARD’S THEOREM.! If the process c(r) is continuous and its density /e(c) is bounded, then for large A: „ ТГ Г f Ы — Ыо\ f —a) — (On 'll = TT л —H +Л—J—° f11’98’ ZA L \ л / \ A /J Proof. If r0 is sufficiently small, then cG) - c(0), and <p(0= f'c(a) da = c(0)t |/|<r0 (11-99) Jo Inserting into (11-96), we obtain £{w(r)} «£{eArc(0)j = фдАт) |T| < Tq (h-100) fP. M. Woodward; "’The Spectrum of Random Frequency Modulation," Telecommunications Research, Great Malvern, Worcs., England, Memo 666,1952.
11-3 MOUUIAIION 371 where ФСЫ = is the characteristic function of c(r). From this and (11-95) it follows that K.-.-(T) = ФДАт)е'"‘>' |т| <t„ (11-101) If A is sufficiently large, then Фг(Ат) = 0 for lr| > r0 because ФДМ) - 0 as /I-* oo. Hence (11-101) is a satisfactory approximation for every т in the region where Ф/Ar) takes significant values. Transforming both sides of (11-101) and using the inversion formula Л(с) = —/ Фс(д)е-^</д -7Г - co we obtain 2tt i a) — o)n\ Фс(Лт)е'“°те“'"г</т = —/J-----------) and (11-98) follows from (11-78). NORMAL PROCESSES. Suppose now that c(t) is normal with zero mean. In this case <p(t) is also normal with zero mean. Hence [see (11-97)] Фф(А,т) = схр(-|Л2<гф2(т)} , r (11-102) о£(т) = 2f Rc(a)(T ~ “) da 'o In general, the Fourier transform of Ф/А,т) is found only numerically. However, as we show next, explicit formulas can be obtained if A is large or small. We introduce the “correlation time” tc of c(t): rc = - a) da p = Rc(<>) (11-103) 0 'o and. we select two constants t0 and rl such that Inserting into (11-102), we obtain (Fig. 11-15) <Ъ2(т) (л_2 07 2077,. |71 < 70 7 > 71 е-рАгг2/21 12 НЛ^(т) (H-104) It is known from the asymptotic properties of Fourier transforms that the behavior of Rw„(t) for small (large) r determines the behaviors of Sww(<o) for
372 STOCHASTIC PROCESSBS FIGURE 11-15 large (small) ш. Since 2 . ' 2\4 (0 + p тсл (11-105) we conclude that Sww(a>) is lorenzian near the origin and it is asymptotically normal as ы -> co. As we show next, these limiting cases give an adequate description of for large or small A. Wideband FM. If A is such that рА2т(2 » 1 then Rww(r) » о for |rI > T(). This shows that we can use the upper approxima- tion in (11-104) for every significant value of t. The resulting spectrum equals 1 Г2тг , , A V P (11-106) in agreement with Woodward’s theorem. The last equality in (11-106) follows because c(/) is normal with variance E{c2(f)} = p. Narrowband FM. If A is Such that рА2т1т<. 1 then /?„,й,(т) » 1 for |t| < Tj. This shows that we can use the lower approxima- tion in*(il“104) for every significant value of r. Hence = 2pTrA2 ta2 + р2т2А4 (11-107)
11-4 CYCI <>\l A I1OSAK4 l‘R<>< I y»l S 373 И-4 CYCLOSTATIONARY PROCESSES! A process x(t) is called strict-sense cyclostationary (SSCS) with period T if its statistical properties are invariant to a shift of the origin by integral multiples of T, or, equivalently, if F(a'i....x„;lx + mT......i„ rmT) =F(xl.........v„;/t....z„) (11-108) for every integer m. A process x(z) is called wide-sense cyclostationary (WSCS) if 7j(f + wT) = 7j(/) + mT, tz + mT) = R{ if, t2) (11-109) for every integer m. It follows from the definition that if x(t) is SSCS, it is also WSCS. The following theorems show the close connection between stationary and cyclosta- tionary processes. THEOREM 1. If x(f) is an SSCS process and 0 is an RV uniform in the interval (0, T) and independent of x(t), then the process x(r)=x(r-0) (11-110) obtained by a random shift of the origin is SSS and its mh-order distribution equals 1 r>' F(xb...fx„;ti......r„) = - - ar....Z„ - a) da I Ai (11-111) Proof. To prove the theorem, it suffices to show that the probability of the event </ = {x(r, + с) ........x(G, + c) <x„) is independent of c and it equals the right side of (11-111). As we know [see 4-62)] P(^) = (11-112) TJo Furthermore, P(V[0 = 0) =P{x(f( + c-6) <;xl,...,x(f„ + c-6) Arid since 0 is independent of x(/), we conclude that P{^H0 = 0} = +c - 0,...jn + c - 0) Inserting.into (11-112) and using (11-108), we obtain (11-111). tN. A. Gardner and L. E. Franks: Characteristics of Cyclostationary Random Signal Processes. IEEE Transactions in Information Theory, vol. IT-21, 1975.
374 STOCHASTIC PROCESSES theorem 2. If x(t) is a WSCS process, then the shifted process x(r) is WSS with mean 1 r V = (11-113) I Jo and autocorrelation — It R(t) = -f R(j +T,t)dt (11-114) J Jo Proof From (7-59) and the independence of 0 from x(r), it follows that £{x(t - 0)} = E{-r](t - 0)} = — fTT](t - 0) dO L J0 and (11-113) results because 17(f) is periodic. Similarly, £{x(f + т - 0)x(t - 0)} = E{R(t + т - 0, t - 0)} 1 fT = — / R(t + т - 0, f - 6) dO TJo This yields (11-114) because R(t + t, l) is a periodic function of t. Pulse-Amplitude Modulation (PAM) An important example of a cyclostationary process is the random signal x(f) = £ cnh(t-nT) (11-115) П = —co where h(f) is a given function with Fourier transform H(<o) and c„ is a stationary sequence of RVs with autocorrelation Rc[m] = £{c„+/nc,t) and power spectrum Sc(e'“) = £ Rc[m]e-^“ (11-116) THEOREM. The power spectrum 5x(o>) of the shifted process x(f) equals = ~Sc(e^\HM\2 (11-117) Proof We form the impulse train «(/) = £ cn8(t-nT) (11-118) П " “CO Clearly, z(0 Is the derivative of the process w(r) of Fig. 11-16: w(r) - £ c„U(t - nT) z(t) = w'(0 (11-119) —о»
11-4 CYCI.OSIAI|()NAR\ PR(X l ЧЧ1Л 375 Pulse-amplitude modulation FIGURE 11-16 The process w(r)is cyclostationary with autocorrelation Ян.('|,'2) = E - r)U(tl - nT)U(tz - rT) ft r From (10-94) it follows that •Яг0|»*г) = —= ~ rls('i “ nT)8(i, - rT) This yields Л.(г + т,/)= £ Яс[т] £ 8[t + т - (m + r)T)8(t - rT] (11-120) ffl я — 00 r = — X We shall find first the autocorrelation R.(t) and the power spectrum 5r(oj) of the shifted process z(f) = z(f - 0). Inserting (11-120) into (11-114) and using the identity f^8[t 4- 7 - (m + r)T]8(t - rT) dt = 8(r - mT) we obtain Я;(т)=у £ Rt[m]S(r-mT) (11-121) * Ш *= - I From this it follows that * n « “« The process x(t) is the output of a linear system with input z(/). Thus x(r) = z(/)*A(t) x(f) = z(/)* h(t) Hence {see (11-122) and (10-136)] the power spectrum of the shifted PAM process x(f) is given by (11-117). COROLLARY. If the process c„ is white noise with Sc(u) = q, then Q Q «.(") = Я,(г) = 7>'(<)‘Л(-') (11-123)
376 STOCHASTIC PROCESSES FIGURE H-17 1 Я(г) -гот; Example 11-4. Suppose that /1G) is a pulse and c„ is a white-noise process taking the values ± 1 with equal probability: 1 0 t < T c„ = x(nT') Лг[/п] = фп] 0 otherwise А(0 = { The resulting process xG) is called binary transmission. It is SSCS taking the values ± 1 in every interval {nT - T, nT), the shifted process xG) = xG - 0) is stationary. From (11-117) it follows that $,.(«) = 4 sin2(wT/2) TV because Sc(z) = 1. Hence Ях(т) is a triangle as in Fig. 11-17. 11-5 BANDLIMITED PROCESSES AND SAMPLING THEORY A process x(r) is called bandlimited (abbreviated BL) if it has finite power and its spectrum vanishes for |o> | > ar: R(0)<a> S(<o) = 0 |<o| > a (11-124) In this section we establish various identities involving linear functionals of BL processes. To do so, we express the two sides of each identity as responses of linear systems. The underlying reasoning is based on the following: THEOREM. Suppose that w((/) and w2(z) are the responses of the systems T^w) and T2(w) to a BL process xG) (Fig. 11-18). We shall show that if 7,(0») = T2(o>) for |ш| ar (11-125) then Wi(') =w2(0 (11-126) Proofi The difference Wj(r) - w2(f) is the response of the system T/w) — T2(«u) to the input x(/). Since 5(w) = 0 for |w| > <r, we conclude from (10-139) and
11-5 BANDLIMITED PROCESSUS AND SAMPLING'I HEORY 377 FIGURE 11-18 (11-125) that E{|w,(i) - w,(r)|2) = 2- Г- TM\2da> = 0 i-Tt J —a Hencet Wj(r) = w2(f). Taylor series. If x(f) is BL, then [see (10-121)] 1 r(T Е(т)=—I S(a))eJU>rdu> (11-127) 2тг J —(r In the above, the limits of integration are finite and the area 2тг/?(0) of S(a>) is also finite. We can therefore differentiate under the integral sign Л(я’(т) = 2- Г (j<u)"S(<u)e'«rda> (11-128) 2ir J-a This shows that the autocorrelation of a BL process is an entire function; that is, it has derivatives of any order for every t. From this it allows that x(/,>(0 exists for any n (see Арр. 10A). We maintain that x(r + r) = Ex(n)(0^r (11-129) л-O n‘ Proof, We shall prove (11-129) using (11-126). As we know e'“T = all ca (11-130) л-0 n‘ The processes x(/ + r) and x<nKf) are the responses of the systems ejuiT and respectively to the input x(r). If, therefore, we use as systems 7\(ы) and T2(w) in (11-125) the two sides of (11-130), the resulting responses will equal the two sides of (11-129). And.since (11-130) is true for all a>, (11-129) follows from (11-126). t Ail Identities in this section are interpreted in the MS sense.
37# STOCHASTIC PROCESS liS Bounds. Bandlimitedness is often associated with slow variation. The following is an analytical formulation of this association. If x(t) is BL, then E{[x(t + t) - x(I)]2} < сг2т2/?(0) (11-131) or, equivalently, 2[R(0) -Я(т)] <<т2т2Я(0) (11-132) Proof. The familiar inequality |sin <p| < l<pl yields toT (O2T2 I — cos tor = 2 sin2— < —— 2 2 Since $(<u) > 0, it follows from the above and (10-122) that 1 fir R(Q) — R(r) = -—J S(<y)(l — cos шт) do) ZlT J i 2 2 2 2 7 1 r<r (а т err r<T a’? £ z— f S(o>)—— do) < ——I S(to) do) = —— /?(0) 2 .тг J2 4tt J-a 2 as in (11-132). Sampling Expansions The sampling theorem for deterministic signals states that if /(/) «-> F(to) and F(to) = 0 for I co | > cr, then the function /(/) can be expressed in terms of its samples f(nT) where T = тг/сг is the Nyquist interval. The resulting expansion applied to the autocorrelation R(r) of a BL process x(/) takes the following form: ® sin сг(т - nT) К(т)= £ R(nT) —p------------—- (11-133) Л--00 or(T-nT) We shall establish a similar expansion for the process x(r). THEOREM. If x(/) is a BL process, then . , sin<r(r — nT) it x(l + r)- E х(г + лТ)--------~T=— (11-134) „--00 а(т-пТ) O- for every t and t. This is a slight extension of (11-133). This extension will permit us to base the proof of the theorem on (11-126). Proof We consider the exponential eiwT as a function of to, viewing т as a parameter, and we expand it into a Fourier series in the interval (-«• ^ w a). 1
11*5 BANDLIMI П'-D PROCESSES AND SAMPLING Till ORY 379 The coefficients of this expansion equal 1 r,T . - sin <г(т - nT) a„ = — е,шге-’"Тш du = ------------------------- 2a '-а (т(т — hT) Hence e /шт _ * _ sin<r(r - nT) E ----------------— <r(r — nT) |o)| < (Г (11-135) We denote by T^u) and T,(w) the left and right side respectively of (11-135). Clearly, ГХ<у) is a delay line and its response w,(r) to x(i) equals x(/ + r). Similarly, the response w,(z) of T2{u) to x(r) equals the right side of (11-135). Since Tt(<u) = T2(<o) for Ы < a, (11-134) follows from (I I-I26). Past samples. A deterministic BL signal is determined only if all its samples, past and future, are known. This is not necessary for random signals. We show next that a BL process x(z) can be approximated arbitrarily closely by a sum involving only its past sample x(nT0) provided that Tl} < T. We illustrate first with an example.t Example 11-5. Consider the process x(0 = лх(г - To) - (")x(z - 2T«) + • • - (-1 )rtx(r - nT") (11-136) The difference y(/) = x(z) - x(z) = E (-1)*(")x(/ - kT„) is the response of the system H(u) = E (~1)к(Пк}еЧкГ,,ш = (1 - fc-0 ' ' with input x(r). Since |H(w)| = |2sin(<oT0/2)l", wc conclude from (10-36) that £{у2(')} -ЛГ5<">(2«п^г) d<a (11'137) If Го < тт/За, then 2sin(wTn/2) < 2sin(ir/6) = 1 for kl <cr. From this it follows that the integrand in (11-137) tends to 0 as л -» ®. Therefore, E{y 2(z)) -» 0 and x(/) -» x(/) as л -* oo Note that this holds only if T„ < T/3; furthermore, the coefficients of x(z) tend to oo as л -»oo. tL A. Wainstein and V, Zubakov: Extraction of Signals in Noise, Prentice-Hall, Englewood Cliffs, NJ,1962.
380 STOCHASTIC PROCESSES FIGURE 11-19 We show next that x(z) can be approximated arbitrarily closely by a sum involving only its past samples x(f - kT()~) where Tlt is a number smaller than T but otherwise arbitrary. THEOREM. Given a number To < T and a constant e > 0, wc can find a set of coefficients ak such that £{|x(/) — x(f)|2} < e *(O = (11-138) k- i where n is a sufficiently large constant. Proof. The process x(f) is the response of the system P(w) = £ аке~’г"ш (11-139) k~ i with input x(z). Hence £{|x(z) - x(r) |2} = f S(<o)11 - P(<o)\-da> J —or it suffices, therefore, to find a sum of exponentials with positive exponents only, approximating 1 arbitrarily closely. This cannot be done for every Ы < o',, = тг/Тй because P(<o) is periodic with period 2<r0. We can show, however, that if or0 > cr, we can find P(<o) such that the differences 11 - Hwfl can be made arbitrarily small for |w| < cr as to Fig. 11-19. The proof follows from the Weierstrass approximation theorem and the Fejer-Riesz factorization theorem; the details, however, are not simple.! Note that, as in Example 11-5, the coefficients ak tend to <» as e -» 0. This is based ph the fact that we cannot find a sum P(co) of exponentials as in PapbU!is: '‘A Note, on "the'Predictability of Band-Limited "Processes,” Proceedings of the IEEE. Wl. 13, h6. 8. 1985.
11-5 l»AM)| IMIH f) PR<K I SSI-S AShSAMH IS». Hll OIO 381 FIGURE 11-20 (11-139) such that 11 — Р(ш)\ = 0 for ever}' ш in an interval. This would violate the Paley-Wiener condition (12-9). THE PAPOULIS SAMPLING EXPANSION.t The sampling expansion holds only if T < ir/<r. The following theorem states that if we have access to the samples of the outputs y/f),... ,Уд,(1) of N linear systems //,(«),..., driven by x(f) (Fig. 11-20), then we can increase the sampling interval from тг/сг to Nir/v. We introduce the constants c = —= ^- T = NT (11-140) N T and the N functions Pt(a>,l).....P^tn.i) defined as the solutions of the system Н1(.ш)Р}(ы,т) + +/'/у(ш)/\(ш, т) = 1 + c)P|(<o, r) + • • +HN(u) + c)PN(to, r) = e1'T /f](<u + Nc — с)Р^ш, т) + • • + Me — c)PN(a). т) = еяЛ 1,1 ’ (11-141) In the above, ш-takes all values in the interval (— ir, - a + c) and r is arbitrary'. We next form the N functions = - /‘"’Г+\(й>,т)^“т«/ш I < к < M (11-142) C J-rr tA, Papoulis: "New Results in Sampling Theory." Hawaii Intern Conf. System Sciences. January (See also Papoulis, 1968, pp. 132-137).
382 STOCHASTIC PROCESSES THEOREM Xp + r)» L [у1(/+«Пр1(т-«П+ +yN(z+ лТ)рл.(т-л7;)] (11-143) /I — — « Proof. The process у ft + nT) is the response of the system Н,(<о)е1,,Гш to the input xG). Therefore, if we use as systems T^to) and T2(to) in Fig. 11-18 the two sides of the identity e^-Hfto) £ pfr- nT)e"’“T + + Hw(<o) £ рл.(т - пТ)е}"шТ fl « — 00 11 (11-144) the resulting responses will equal the two sides of (11-143). To prove (11-143), it suffices, therefore, to show that (11-144) is true for every |ш| <т. The coefficients Hk(to + kc) of the system (11-141) are independent of r and the righ_t side consists of periodic functions of r with period T = 2тг/с because eikcT = 1. Hence the solutions Pk(to, r) are periodic Рл(<о,т - nT) = Pk(co,r) From the above and (11-142) it follows that pk(r — nT) — — f Рк(ш,т)е}ш('~"^> da> c J-a This shows that if we expand the function Рк(.<о,т)е'шт into a Fourier series in the interval (—a, — cr + c), the coefficient of the expansion will equal pk(r - nT). Hence Рк{(о,т)е,шг = Pk(T ~ nT)eJnwT — cr < to < — cr + c (11-145) fl = —OO Multiplying each of the equations in (11-141) by e'“T and using (11-145) and the identity е/Щы+кс)Т — ein^f we conclude that (11-144) is true for every to in the interval (—tr,cr). Random Sampling We wish to estimate the Fourier transform F(to) of a deterministic signal fit) in terms of a sum involving the samples of /(r). If we approximate the integral of/(/)(?_-/cu' by its Riemann sum, we obtain the estimate F(w)=F*(w)s £ Tf(nT)e~I,,uT (11-146) fl — - oo From the Poisson sum formula (11A-1), it follows that F*(w) equals the sum of
I 1-5 BAND1.1MI I UD PROCTSSl .SAND!SAMPLING 11ILORY 383 Fito) and its displacements F*(a>) = 22 F((i) + 2/иг) <t = — » •= - X T Hence F#(<u) can be used as the estimate of Fico) in the interval (-<r. a) only if Fico) is negligible outside this interval. The difference Hw) - F.U) is called aliasing error. In the following, we replace in (11-146) the equidistant samples f(nT) of fit) by its samples /(t() at a random set of points t, and we examine the nature of the resulting error.! We maintain that if t, is a Poisson point process with average density Л, then the sum 1 _ P(^) = - Entje--""1' (11-147) A i is an unbiased estimate of F(w). Furthermore, if the energy E- f of fit) is finite, then P(<u) -» Fico) as A -» ». To prove the above, it suffices to show that E{P(<o)} = F(u>) ^<m=T (11-148) Л Proof. Clearly, Г/(/)е-^ЕЗ(/ - t,)dt = ЕЛ‘/)^‘Уш‘' (11-149) -« j j Comparing with (11-147), we obtain Р(ю) = — [ y(/)z(f)e“'"'dt where z(f) = E5(z _ Q (11-150) A t — a> l is as Poisson impulse train as in (10-98) with E{z(t)} = A C2(q,t2) = АЗ(Г, - G) (11-151) IE. Masry: “Poisson Sampling and Spectral Estimation of Continuous-Time Processes. IEEE Transactions on Information Theory, vol. IT-24. 1978. See also F. J. Beutler: “Alias Free Randomly Timed Sampling of Stochastic Processes.” IEEE Transactions on Information Theory, vol. IT-16. 1970.
384 STOCHASTIC PROCESSES Hence £(P(<o)} = | Г f(t)E[z(t)}e-^dt = F(m) o-Д., = VГ Г - '2) <Л> *2 = тГ Г(<г) л2 and (11-148) results. From (11-148) it follows that, for a satisfactory estimate of F(<u), A must be such that (11-152) Example 11-6. Suppose that /(/) is a sum of sine waves in the interval (—a. «): /(/) = Y,cketu,k‘ W < a k and it equals 0 for И > a. In this case, „ sin a( to — <0.) _ , £(*>) = £2сл-----i------- E = 2aL|c,|- (11-153) к к where we neglected cross-products in the evaluation of E. If a is sufficiently large, then F(wJ - 2ack This shows that if L|c1|2<K2flAlcJ2 then P(wJ = F(wA.) i Thus with random sampling we can detect line spectra of any frequency even if the average rate Л is small, provided that the observation interval 2a is large. 11-6 DETERMINISTIC SIGNALS IN NOISE A central problem in the applications of stochastic processes is the estimation of a signal in the presence of noise. This problem has many aspects (see Chap. 14). In the following, we discuss two cases that lead to simple solutions. In both cases the signal is a deterministic function /(r) and the noise is a random process v(t) with zero mean. The Matched Filter Principle The following problem is typical in radar: A signal of known form is reflected from a distant target. The received signal is a sum «(0-/(0+»(0 E{v(r)}~0 where /(r) is a shifted and scaled version of the transmitted signal and v(r) is a
11-6 t)kn-KMINIS'l К SIGNALS IN N(>|SI. 385 WSS process with known power spectrum S(a). We assume that f(t) is known and we wish to establish its presence and location. To do so. we apply the process x(f) to a linear filter with impulse response Mr) and system function H(w). The resulting output y(t) = x(r)* h(t) is a sum У(0 = [_ *({ ~ a)h(a)d(a) = у-Дг) + y,.(r) (11-154) where У/(0 = f - ot)h{a) da = f F(a)H(a)e}u>l da (11-155) J -® 277 J-з> is the response due to the signal f(t\ and yp(r) is a random component with average power £{уЛО) = £\$(<o)|H(<u)|2<Z<u (11-156) Since y„(f) is due to v(t) and E{v(r)) = 0. wc conclude that E{yt,(r)} = 0 and E{y/f)} = y/f). Our objective is to find H(a) so as to maximize the signal-to- noise ratio i/£{y.’('<>)} (11-157) at a specific time rn. White noise. Suppose, first, that S(o>) = 50. Applying Schwarz’s inequality (11B-1) to the second integral in (11-155), we conclude that f |/=’(w)e'"''1|2 da f | H(o>) |2 da E гг < ----------------------------------= / (11-158) 27t5iJ|H(<o)|24/<o where Ef = (l/2ir)f\F(a)\2 da is the energy of f(t). The above is an equality if [see (11В-2)] H(a) = kF*(a)e~iulu h(t) = - t) (11-159) This determines the optimum H(a) within a constant factor k. The system so obtained is called the matched filter. The resulting signal-to-noise ratio is maximum and it equals ^E^/Sn. 'Colored noise. The solution is not so simple if S(a) is not a constant. In this ease* we use a trick. We first multiply and divide the integrand of (11-155) by
386 STOCHASTIC PROCESSUS У5(<о) and then apply Schwarz’s inequality. This yields _ r F(<o) ------ J Дш) . |F(oj)|2 , < f —15—— da) fs(a))|H(w)|~d<o J 5(<o) J Inserting into (11-157), we obtain , |F(a>)|2 г _2 - f—tofsMiHMI-d, _ . l£Wd(d 2vfsMtHM\2da 2lr 5(<u) Г- £ EQuality holds if . V5(<o) Thus the signal-to-noise ratio is maximum if Н(ы) = к F*(to) S(<o) (11-160) Tapped delay line. The matched filter is in general noncausal and difficult to realize. A suboptimal but simpler solution results if Н(ш) is a tapped delay line: H((o) = a0 + л.е"^7, + • • • +ате~/тыГ (11-161) In this case, tn m ?/('„)= Е«,Л<о-'П у,,(<)-E«,»(/-it) (1M62) i-O 1-0 and our problem is to find the m constants af so as to maximize the resulting signal-tp-noise ratio. It can be shown that (see Prob. 11-28) the unknown constants are the solutions of the system Za,R(nT-iT) = kf(tn - nT) n = 0,...,m (11-163) i-O where Жт) is the autocorrelation of v(r) and к is an arbitrary constant. Smoothing We.wish to estimate an unknown signal f(t) in terms of the observed value of the sum k(/) =/(O + v(t). Wc assume that the noise v(f) is white with known autocorrelation Ж^»^) = ^{ti)8(tl — t2). Our estimator is again the response
11-6 DETI KMINISIK SIGNM s 14 M>M 387 y(r‘) of the filter Л(/): У(') = f ~ r)h(r) dr (11-164) The estimator is biased with bias b =yz(/) -/(/) = / f(t - t)/j(t) dr -f(t) (11-165) and variance [see (10-90)] cr2 = £{y,?(0) = f q(t - r)/r(r) dr (11-166) Our objective is to find hit) so as to minimize the MS error e = £([y(O -Г(<)]2| -b'- + a- We shall assume that hit) is an even positive function of unit area and finite duration: ft(-r)=/i(f) f\h(t)dt=l h(t) > 0 (11-167) where T is a constant to be determined. If T is small, yfit) -fit). hence the bias is small; however, the variance is large. As T increases, the variance decreases but the bias increases. The determination of the optimum shape and duration of hit) is in general complicated. We shall develop a simple solution under the assumption that the functions /(r) and qit) are smooth in the sense that /(/) can be approximated by a parabola and qit) by a constant in any interval of length 2T. From this assumption it follows that (Taylor expansion) t2 /(f-T)==/(f)-r/'(') + y/"(') q(t-r)^q(t) (11-168) for ]t| < T. And since the interval of integration in (11-165) and (11-166) is (—T, T), we conclude that b ~ £111 fT т2Л(т) dT = [T h2(r) (It (11-169) 2 J-г because the function hit) is even and its area equals 1. The resulting MS error equals е = |М2[/''(О]2+ &?(') (H-170) where M « fLTl2hit)dt and E = /1гЛ^(/)е/т. To separate the effects of the shape and the size of hit) on the MS error, we introduce the normalized filter w(r) = TA(77) (11-171)
388 STOC'HASIIf VHOCI SSKS FIGURE 11-21 The function м4г) is of unit area and w(f) = 0 for III > I. With M„=f t2w(t)dt = —^ Ew= w2(t)df = TE J-i l" 1 it follows from (1.1-167) and (11-170) that o-2=y^(<) (11-172) с = |t2m,+ -v«(') (i‘-ra> 4 i Thus e depends on the shape of w(i) and on the constant T. The two-to-one rule.t We assume first that w(t) is specified. In Fig. 11-21 wc plot the bias b. the variance a2, and the MS error e as functions of T. As T increases, b increases, and cr2 decreases. Their sum e is minimum for Inserting into (11-172), we conclude, omitting the simple algebra, that a = 2b (11-175) Thus if M'(/) is of specified shape and T is chosen so as to minimize the MS error e, then the standard deviation of the estimation error equals twice its bias. Moving average. A simple estimator of /(f) is the moving average y(O = f+/x(-) dr 2/ Jt-T tA. Papoulis, Two-lo-Onc Rule in Data Smoothing, IEEE Trans. Inf. Theory. September, l*>77.
11-7 insi'i < ||<л ли s.4 S I t М !l >| Ml) |( VIOS 389 of x(/). This is a special case of (11-164) a pulse of width 2. In this case Inserting into (11-174). we obtain 7 m Um]2 where the normalized filter »(/) equals 1 fl 1 F- = -J ;h - 2 ... ></(') e = >b~ = ~ST~ (11-176) The parabolic window. Wc wish now to determine the shape of w(/) so as to minimize the sum in (11-173). Since A(r) needs to be determined within a scale factor, it suffices to assume that Eh has a constant value. Thus our problem is to find a positive even function w(r) vanishing for |/| > 1 and such that its second moment MK is minimum. It can be shown that (sec page 388//) w(z) (0 75('-r) им 177) (0 Id > I Thus the optimum w(f) is a truncated parabola. With u(r) so determined, the optimum filter is 1 i { \ h(t) - —и- — r It tn \ tn J where T„, is the constant in (11-174). This filter is, of course, time varying because the scaling factor Tm depends on t. 11-7 BISPECTRA AND SYSTEM IDENTIFICATION! Correlations and spectra are the most extensively used concepts in the applica- tions of stochastic processes. These concepts involve only second-order mo- ments, In certain applications, moments of higher order are also used. In the following, we introduce the transform of the third-order moment Л.х.г(Г|Л2.Г3) = A'[x(r1)x(r2)x(t3)} (11-178) of a process x(/) and wc apply it to the phase problem in system identification. We assume thatx(/) is a real SSS process with zero mean. From the stationarity of x(7) it follows that the function ,, t2, t3) depends only on the differ- ences /] - f3 = Ц. f,~ fy= v tt>. R. Brillingcr "An Introduction to Polyspeclrn.” storwh of Math Statistics, vol. 36. Also С. I. Niklas and M. R. Raghuveer (1987): “Bispectrum Rstimalion; Digital Processing Framework," IEEE Proceedings, vol. 75. 1965. iFV
390 STOCHASTIC PROCESSES Setting t3 = t in (11-178) and omitting subscripts, we obtain Я(/,Л2-Гз) = Л(м. »*) = £(x(' + A*)x(' + *')*(')} (11-179) DEFINITION. The bispectrum S(u, v) of the process x(t) is the two-dimensional Fourier transform of its third-order moment /?(д, p): S(u,u) = dfjbdv (11-180) The function Л(д, p) is real; hence S(—u,-u) = S*(u,v) (11-181) If x(f) is white noise then =03(д)5(р) S(u,v) = Q (11-182) Notes I. The third-order moment of a normal process with zero mean is identically zero. This is a consequence of the fact that the joint density of three jointly normal RVs with zero mean is symmetrical with respect to the origin. 2. The autocorrelation of a white noise process with third-order moment as in (11-182) is an impulse дЗ(т); in general, however, q Ф Q. For example if xO) is normal white noise, then Q = 0 but q Ф 0. Furthermore, whereas q > 0 for all nontrivial processes, Q might be negative. Symmetries. The function R(tt, i2, t3) is invariant to the six permutations of the numbers t2, and t3. For stationary processes, = М f2~h = v G “ t2 = iJ.-v 1 G.G. h u,u 4 G«G*G - Д — g + V - и - V, 1‘ 2 <j. G> G V, и 5 G«G»G -д + V,— g f, - и — V 3 G» G« G — v, g — V —и — и, и 6 G-G.G д - P, “ V и, - и - I’ FIGURE 1122
11-7 BlSI’liC'l HA AND SYS'I I M 11)1 N ( П К ЛI ION 391 This yields the identities К(д.") = /?(р, д) = Я( -J/, д - Р) + Л(-д, - м + ,,) =/?(-д + р, - д) =/?(д - I/, - р) (11-183) Непсе if we know the function /?(д,с) in any one of the six regions of Fig. 11-22, we can determine it everywhere. From (11-180) and (11-183) it follows that S(m, r) = S( г, h) = S( — u — r.u) = S( -и - г. с) = S(i\ - и - i) = S(u, - и - r) (11-184) Combining with (11-181), we conclude that if wc know S(u, r) in any one of the 12 regions of Fig. 11-22, we can determine it everywhere. Linear Systems We have shown in (10-110) that if x(r) is the input to a linear system, the third-order moment of the resulting output y(r) equals z2, G) = f - a,t2 - /3,G - y)h(a)h{P)h(y) dadftdy (11-185) For stationary processes, /?X4r(r, - a, t: - ft h - у) = /?1Д,(д + у - a,v + у -.ft)', hence Ят.(д,*') = //f_ Rxxx(v + у ~ a,i> + у - /3)Л( a)/i(/3)/z( у) dadpdy (11-186) Using this relationship, we shall express the bispectrum Syyifu,i?) of y(/) in terms of the bispectrum Sxxx(u,u) of x(f). THEOREM Syyy(u,o) = Sxxx(u,u)H(u)H(r)H*(u + i.') (11-187) Proof. Taking transformations of both sides of (11-185) and using the identity // Rjcxxil1 + У - v + у “ ^)е-''("д + ",Мд dv = Sxxx(u,u)eJl,,(Y‘f,’*,(Y“M we obtain Syyy(u,u) = Sxxx(u,v) fjf* e^^+,^-fiiih(a)h(p)h(y)dadpdy Expressing the above integral as a product of three one-dimensional integrals, we obtain (11-187). Example 11-7. Using (11-185), we shall determine the bispectrum of the shot noise s(r) = ЕЛ(/ - t4) = x(z) *Л(г) z(r) - 2>(r - <.) i ‘ where t, is a Poisson point process with average density Л.
392 STOCHASTIC PROCESSES To do so, wc form the centered impulse train z(z) = z(z) - л and the centered shot noise s(z) = z(z)» h(t). As wc know (see Prob. 11-28) = A3(m)3(p) hence S.-.-=(zz, z) = A From this it follows that $ш(м) = + c) and since S-(w) = А|/7(ш)|2, we conclude from Prob. 11-27 with c = E{s(i)} = A/7(0) that $„,(«,«;) = AH(«)H(i')H*(« + v) + 2irA2H(0)[lH(«)|I5(t) + \H(t )|23(«) + |И(м)|26(и c)| + 4тг2А4Я3(0)5(«)3(г) System Identification A linear system is specified terminally in terms of its system function H(a)) = А(ш)е">1ш) System identification is the problem of determining H(w). This problem is central in system theory and it has been investigated extensively. In the following, we apply the notion of spectra and polyspectra in the determination of /4(a>) and Spectra. Suppose that the input to the system Н(ш) is a WSS process x(r) with power spectrum Sxx(a>). As we know, SxyM =Sxx(<o)H*(<o) (11-188) This relationship expresses Н(ы) in terms of the spectra Sxx(zo) and 5Xi,(zu) or, equivalently, in terms of the second-order moments Лхх(т) and /?х,.(т). The problem of estimating these functions is considered in Chap. 13. In a number of applications, we cannot estimate Яху(т) either because we do not have access to the input x(/) of the system or because we cannot form the product x(f + rfyit) in real time. In such cases, an alternative method is used based on the assumption that x(/) is white noise. With this assumption (10-136) yields = S„( ш) |Я( w) I2 = ф42(ш) (11-189) This relationship determines the amplitude /1(<о) of in terms of Syiiu) within a constant factor. It involves, however, only the estimation of the power spectrum Syy(a>) of the output of the system. If the system is minimum phase (see page 40), then ff(zu) is completely determined from (11-189) because, then, y(<u) can be expressed in terms of A((u). In general, however, this is not the case. The phase of an arbitrary system cannot be determined in terms of second-order moment of its input. It can, however, be determined if the third-order moment of y(/) is known. Phase determination. We assume that x(f) is an SSS white-noise process with “ 0- Inserting into (11-187), we obtain Syyy(utv) - + г) (11-190)
11-7 BISPUCTRA AND SYSTEM IDLNTIHCATIQN 393 The function Syyy(u, v) is, in general, complex: 5у>,,.(и,г) = B(u,u)ew,,,> (11-191) Inserting (11-191) into (11-190) and equating amplitudes and phases, we obtain B(u,u) = QA(u)A(u)A(u + r) (11-192) в(и,и) = tp(u) + <p(t-) - <p(u + r) (11-193) We shall use these equations to express Л(ш) in terms of B(u,v) and tp(a>) in terms of 0(u, v). Setting г = 0 in (11-192), we obtain , <2 A = Л(о7В(ш’О) ^3(°) =<2J3(O,O) (11-194) Since Q is in general unknown, Л(<о) can be determined only within a constant factor. The phase ^>(<u) can be determined only within a linear term because if it satisfies (11-193), so does the sum y>(<o) + c<o for any c. We can assume therefore that <p'(0) = 0. To find <p(<o), we differentiate (11-193) with respect to v and we set v = 0. This yields 6,.(u,0) = -<р'(м) ^(<o) = - ( 0,.(u,O) du (11-195) 'o where 0r(u, v) = 50(u, v)/du. The above is the solution of (11-193). In a numerical evaluation of <p(<o), we proceed as follows: Clearly, 0(u, 0) = p(u) + <p(0) - <p(u) = <p(0) = 0 for every u. From this it follows that 1 0,.(и,О) = lim—0(u, A) as Д -* 0 Hence 0r(u,O) = 0(u, A)/A for sufficiently small A. Inserting into (11-195), we obtain the approximations ~ -1 fU0(u,b)du <р(пЩ = - Е0(А:Д,Д) (11-196) д7о a-i This is the solution of the digital version 0(fcA,rA) = ^>(&A) + <p(rA) - <p(fcA + rA) (11-197) of (11-193) where (&A,rA) are points in the sector I of Fig. 11-22. As we see from (11-196) ф(лА) is determined in terms of the values of 0(AA, A) of 0(m, A) on the horizontal line v = A. Hence the system (11-197) is overdetermined. This is used to improve the estimate of if 0(«, v) is not known exactly but it is estimated in terms of a single sample of y(/).t The corresponding problem of spectral estimation is considered in Chap. 13. tT. Matsuoka and T. J. Ulrych: “Phase Estimation Using the Bispectrum," IEEE Proceedings, vol. 72,1984.
394 STOCHASTIC PROCESSES Note If the bispectrum S(u,u) of a process x(r) equals the right side of (11-190) and Н(ы) = 0 for |w| > <t, then S(«,y) = 0 for|n|><r or If | > o’ or |ы + i’| > tr Thus, S(u, i>) = 0 outside the hexagon of Fig. ll-23a. From this and the symmetries of Fig. 11-22 it follows that S(u,v) is uniquely determined in terms of its values in the triangle OAB of Fig. U-23a. Digital processes. The preceding concepts can be readily extended to digital processes. We cite only the definition of bispectra. Given an SSS digital process x[zz], we form its third-order moment = Е{х[л + fc]x[n + r]x[n]} (11-198) The bispectrum of x(n] is the two-dimensional DFT of R[k, r): S(u,u) = £ £ R[k,r]e-*“k*u''> (11-199) & a — OO /> w —>00 This function is doubly periodic with period 2-тг: S(u + 2irm,v + 2тгп) = S(u,v) (11-200) It is therefore determined in terms of its values in the square |u| < тг, |t»| 'tr of Fig. l l-23b. Furthermore, it has the 12 symmetries of Fig. 11-22. Suppose finally that the process х(л] equals the samples х(лТ) of an analog process x(f) with bispectriim Sa(u, t>). If Sa(u, v) equals the right side of (11-190), then, in the square of Fig. 11-236, S(u,u) '—0 for |u| > тг or |u| > тг or s|u + l-| 7Г
APPENDIX 10B THE SCHWARZ INF.QI.IA! Ш 395 From this it follows that in this case, Siu,v) is uniquely determined in terms of its values in the triangle OAB of Fig. 11-236. APPENDIX 10A THE POISSON SUM FORMULA If F(u) = ( f(x)e~i,,x dx J — 00 is the Fourier transform of fix) then for any c " 1 “ 27Г £ f(x + nc) = - £ F(nu0)e'nUQl uQ--------------------(11A-1) «•’—00 C «•—00 £ Proof. Clearly £ 6(x + nc) = i £ eJn,,«x (11A-2) л — —ее n — — oo because the left side is periodic and its Fourier series coefficients equal 1 rc/1 1 — I 8(x)e Jnu°x dx = — C J-c/2 c Furthermore, 8(x + nc)* fix) — fix + nc) and e>«ox*f(x) = Г e/',Ho(x““’/(a) da = e)nu"xFinu0) — 00 Convolving both sides of (11А-2) with fix) and using the above, we obtain (11A-1). APPENDIX 10B THE SCHWARZ INEQUALITY We shall show that 2 [bf(x)gix)dx <, fb\fix)\Z dx[b\gix)\* 2 dx (11B-1) ’a Ja Ja with equality iff /(x) = kg*ix) (11В-2)
396 STOCHASTIC FROCL-SSliS Proof. Clearly fbf(x)g(x) (lx J II < ff'\f(x)\\g(x)\dx Equality holds only if the product f(x)g(x) is real. This is the case if the angles of fix) and g(x) are opposite as in (11B-2). It suffices, therefore, to assume that the functions f(x) and g(x) arc real. The quadratic /(*) = fb[f(x) - zg(x)]2 dx = z2 ff'g2(x) dx - 2z fbf(A-)g (x) dx + [bf2( A-) dx Ja Ja J<i is nonnegativc for every real z. Hence, its discriminant cannot be positive. This yields (1 IB-1). If the discriminant of /(z) is zero, then Z(z) has a real (double) root z = k. This shows that l(k) = 0 and (1 IB-2) follows. PROBLEMS 11-1. Find the first-order characteristic function («) of a Poisson process, and (b) of a Wiener process. Answer: (a) * °; e /2. 11-2. (Two-dimensional random walk). The coordinates x(t) and y(t) of a moving object are two independent random-walk processes with the same .v and T as in Fig. 11-lfl. Show that if z(r> = \Zx2(Z) + y2(z) is the distance of the object from the origin and t » T, then for z of lhe order of fat: a = s at T 11-3. In the circuit of Fig. Pl 1-3, n,.(z) is the voltage due to thermal noise. Show that 2kTR 2kTR S ,,((!>) = --------;---------- S (<o) = —i--------XTj (1 - a>zLCy + a>2R2C2 R- + and verify Nyquist’s theorems (11-27) and (11-30). FIGURE Pl 1-3 (K A particle in free motion satisfies the equation mx"(0 + /x'(/) = F(r) Sr(w) “ 2kTf
HioitiiMt J97 Show that if x(0) = x'(0) = 0, then 2T{x-(r)} = 2D2(z - -1 + \ 4 a Cr 4a J where D2 = kT/f and a = f /2т. Hint: Use (10-90) with 1 /»(/) = j(l - 4{t) = 2kTfU(t) 11-5. The position of a particle in underdamped harmonic motion is a normal process with autocorrelation as in (11-12). Show that its conditional density assuming x(0) = xn and x'(0) = v(0) = r(l equals = ............... /2—Р Find the constants a. b. and P. 11-6. Given a Wiener process w(z) with parameter «. wc form the processes x(/) = w(z2) y(z) = w2(z) z(z)=|w(z)| Show that x(z) is normal with zero mean. Furthermore, if z, < z;, then ^,(/|,G) efltZ? = O’2/)(2/| + Z2) 2a ----- /7? t?) = — yt ft2 (cos в + 0 sin 0) sin 0 - у — 11-7. The process s(z) is shot noise with Л = 3 as in (11-45) where hit) = 2 for 0 S t < 10 and hit) = 0 otherwise. Find E(s(z)}, E{s2(/)), and P(s(7) = ()}. 11-8. The input to a real system Hit») is a WSS process x(z) and the output equals y(r). Show that if Kx.(r) =/?,.,.(?) M-T)= v(r) as in (11-67), then Hit») = jB(w) where Bit») is a function taking only the values + 1 and - 1. Special case: If yfz) = xfz), then В(ш) = -sgn w. 11-9. Show that if x(z) is the Hilbert transform of x(z) and i(r) - х(О«»юо/ + XG)sinwftz q(z) = x(z)cos<o(1z -x(z)sin<o0z then (Fig. Pl 1-9) where 5,/w) - 45t(w + n>ti)U(v + wn).
398 STOCHASTIC PROCESSES 11-10. Show that if w(r) and wT(r) are the complex envelopes of the processes x(r) and x(z - t) respectively, then wT(r) = w(z - r)e“/“"T. 11-11. Show that if w(t) is the optimum complex envelope of x(t) {see (11-85)], then E{|w'(OI2} = -2[/?:(0) + «^(0)] 11-12. Show that if the process x(/)cos<oz 4- y(/)sin<or is normal and WSS, then its statistical properties are determined in terms of the variance of the process z(f) = x(r) +/y(r). 11-13. Show that if 6 is an RV uniform in the interval (0. T) and /(») is a periodic function with period T, then the process x(r) = f(t — 6) is stationary and / |-'o 11-14. Show that if sin tr(t-nT) tt ®/v(O - x(') - E a=T n--N ff(t-llT) T then 1 z00 E(e«W} - у-/ Я“) «*" ~ 7 — 00 N sin <r(r - nT) n~N a(t - nT) 2 e'"“r du and if S(a>) = 0 for |ш| > <т, then E{ejj(t)} -* 0 as N -* ». 11-15. Show that if x(r) is BL as in (11-124), thent for |t| < ir/tr: 2r2 t2 ^5-|Я"(0)| SR(O) - R(r) £ -^-|R"(0)| £{[x(r + t) - x(/)]2} 2: ^£{[х'(г)]2} 7Г Hint: If 0 < <p <tt/2 then 2<p/ir < sin <p < H-I64 A WSS process x(f) is BL as in (11-124) and its samples х(лэг/<т) are uncorre- lated. Find Sx(.u) if £(x(/)} — tj and E(x4t)) « I. 11-17. Find the power spectrum S(w) of a process x(r) if 5(w) «• 0 for |ш| > тт and £{х(л + m)x(n)} = V5[m] fA. Papoulis: “An Estimation of the Variation of a Bandlimited process,” IEEE, PGtT, 1984.
PROBLEMS 399 11-18. Show that if S(a>) = 0 for |w| > o-, then 7?(t) a 7?((l)coso-r for |r| < r/2cr 11-19. Show that if x(r) is BL as in (11-124) and Д = 2-a/a, then X x(r) = 4 Sin-— E х(лД) x’(nA) -------------— -f- —------------- (al - 2n~}~ <r(at - 2/itt) Hint: Use (11-143) with /V = 2. Н^ш) = I. H2(w) = juj. 11-20. Find the mean and the variance of P(w„) it t, is a Poisson point process and 11-21. Given a WSS process x(z) and a set of Poisson points t, with average density A. wc form the sum X.(w)= £ x(t,)e I«,l <<• Show that if £{x(O) = 0 and R,(r) -> 0 as |—I -* x, then for large c. E{Xr(w)} = 2c5.(<o) + ^R,(0) Л 11-22. We arc given the data x(r) =»/(/) + n(t) where /?„(т) = HS(r) and E(n(r) = ()}. We wish to estimate the integral g(f) = knowing that g(T) = 0. Show that if we use as the estimate of g(/) the process w(r) = z(t) - dT)t/T where z(/) = [‘x(a).da then E{w(t))-g(t) о’,; = М(1-—I 11-23. (Cauchy inequality) Show that |Е«д| G) with equality iff a, = kb*. 11-24. The input to a system H(z) is the sum х(л] = f[n] + v[n] where f(n] is a known sequence with z transform F(z). Wc wish to find H(zJ such that the ratio у/[0]/Е{уДл]} of the output у[л] « уДл] + у, [n] is maximum. Show that (a) if v(n] is white noise, then H(z) = AF(z~ and (/>) if H(z) is an FIR filter that is, if H(z) = a0 + a, z“ * + • • -t-fl^z-^ then its weights am are the solutions of the system N E R, [л - - V[-«J « = °..........n m-0
400 STOCHASTIC PKOCESSES 11-25. If R„(t) = N8M and 1 x(r) = A cos inQt + n(r) Н(ш) = —— a + y(r) = Bcos(w(l + t + tp) + y„(r) where y„(Z) is the component of the output y(t) due to n(r), find the value of a that maximizes the signal-to-noise ratio IBI2 ФЛ0) Answer: a = a>(). 11-26. In the detection problem of page 386, we apply the process x(r) = /(/) + v(z) to the tapped delay line (11-161). Show that: (a) The S/N ratio r is maximum if the coefficients at satisfy (11-162); (Z>) the maximum r equals ^/>y(r0). 11-27. Given an SSS process x(r) with zero mean, power spectrum Siu), and bispcctrum S(u, u), we form the process yG) = x(r) + c. Show that Syyy(u,u) = S(u,v) 4- 2irc[S(»)6(o) + S(u)6(u) + S(u)8(u + i)] + 4тг2с36(и)5(г>) 11-28. Given a Poisson process xG), we form its centered process x(r) = x(/) - At and the centered Poisson impulses dx(t) Show that E{x(t|)x(r2)x(fj) = Л min(rp t2, t3) £{z(/,)z(r2)i(b) - A8(tt - r2)5('i “ b) Hint Use (10-94) and the identity min(/t,r2,/3) = r,t/(r2 - r()U(r3 - M + r2t/(r( - t2)U(t3 - t3) ~ ~ g)
CHAPTER 12 SPECTRAL REPRESENTATION 12-1 FACTORIZATION AND INNOVATIONS In this section, we consider the problem of representing a real WSS process x(f) as the response of a minimum-phase system Us) with input a white-noise process i(z). The term minimum-phase has the following meaning: The system Us) is causal and its impulse response /(/) has finite energy; the system T(s) = 1/L(s) is causal and its impulse response y(f) has finite energy. Thus a system Us) is minimum-phase if the functions L(s) and 1/L(s) are analytic in the right-hand plane Re s > 0. A process x(t) that can be so represented will be called regular. From the definition it follows that x(r) is a regular process if it is linearly equivalent with a white-noise process i(r) in the sense that (see Fig. 12-1) i(/) = f ”У(«)Х(Г “ a) da Ra(T) = 5(T) (12-1) Jo x(t) = fz(a)i(/ - a) da £{x2(0} » f*2(0 dt < « (12-2) Jo Jo The last equality follows from (10-91). The above shows that the power spec- trum S(s) of a regular process can be written as a product S(s) = L(s)L(-s) S(a>) » |L(^)|2 (12-3) where Us) is a minimum-phase function uniquely determined in terms of S((o). The function Us) will be called the innovations filter of x(t) and its inverse PCs) 401
402 STOCHASTIC PROCESSES FIGURE 12-1 the whitening filter of x(/). The process i(f) will be called the innovations of x(r). It is the output of the filter L(s) with input x(/). The problem of determining the function L(.y) can be phrased as follows: Given a positive even function S(w) of finite area, find a minimum-phase function L(s) such that |L(J<u)|2 = S(<u). It can be shown that this problem has a solution if 5(ю) satisfies the Paley-Wiener conditions r<*> |ln S(w)| / --------da) < J-x 1 + a)2 This condition is not satisfied if S(o>) consists of lines, or, more generally, if it is bandlimited. As we show later, processes with such spectra are predictable. In general, the problem of factoring 5(<u) as in (12-3) is not simple. In the following, we discuss an important special case. (12-4) Rational spectra. A rational spectrum is the ratio of two polynomials of a)2 because S(~a>) = S(<u): /4(w2) A( -s2) This shows that if s, is a root (zero or pole) of S(s), *s also a root. Furthermore, all roots are either real or complex conjugate. From this it follows that the roots of S(s) are symmetrical with respect to the Ja) axis (Fig. 12-2a). Hence they can be separated into two groups: The “left” group consists of all roots s,- with Rested, and the “right” group consists of all roots with Res, > 0. The minimum-phase factor L(s) of S(s) is a ratio of two polynomials formed with the left roots of S(s): N(s)N(-s) N(s) = Example 12-1. If S(o) — N/(a2 + a>2) then /У N JN a2—s2 (a + s)(a - s) a + s fN, Wiener* R. E. A. Q Paley: Fourier Transforms in the Complex Domain, American Mathemati- cal Society College, 1934 (see also Papoulis, 1962).
12-1 FACTORIZATION AND INNOVATIONS 403 S(s) X X ---------x----------------x—► j, о X X (fl) FIGURE 12-2 Example 12-2, If S(w) - (49 + 25ш2)/(ш4 + 10co2 + 9) then _ 49 ~ 2552 _ 7 4- 5.5 (l-№)(9-№) (I + s)(3 + 5) Example 12-3. If 5(<o) = 25/(«/ + 1) then 25 _ 25 ( ) + 1 (№ + 1/2S + l)(f2 — 1/2S + 1) L(5) - Digital Processes A digital system is minimum-phase if its system function L(z) and its inverse Rz) = 1/L(z) are analytic in the exterior |z| > 1 of the unit circle. A real WSS digital process х[л] is regular if its spectrum S(z) can be written as a product S(z) = L(z)L(l/z) S(e>) - |L(e>)I2 (12-6) Denoting by Z[«] and у[л) respectively the delta responses of Uz) and Rz), we conclude that a regular process х[л] is linearly equivalent with a white-noise process К л] (see Fig. 12-3): 1[л] ~ E = 3[m] (12-7) k-0 х[л] = E l[*]*[n ~ Л] £{x2[«]} = E < 00 (12-8) *-0 *-0 The process 1[л] is the innovations of x£n] and the function L(z) its innovations filter. The whitening filter of x[n] is the function Hz) = l/Uz). It can be shown that the power spectrum S(e^“) of a process х(л] can be factored as in (12-6) if it satisfies the Paley-Wiener condition [ I In S(m) dw| < 00 (12-9) * —
404 STOCHASTIC PROCESSES FIGURE 12-3 Rational spectra. The power spectrum S(e;") of a real process is a function of cos (и = (ei<o + e~JU>)/2 [see (10-180)]. From this it follows that S(z) is a function of z + l/z. If therefore, 2, is a root of 8(2), 1/z, is also a root. We thus conclude that the roots of 8(2) are symmetrical with respect to the unit circle (Fig. 12-3); hence they can be separated into two groups: The “inside” group consists of all roots z; such that I2J < 1 and the “outside” group consists of all roots such that |zj > 1. The minimum-phase factor L(z) of 8(2) is a ratio of two polynomials consisting of the inside roots of 8(2): W(z)W(l/z) W(z) (Z) D(z)Z>(l/z) (z) D(z) L'dl’Sd) Example 12-4. If S(w) = (5 - 4cosw)/(10 - 6cos<u) then S-Zfz + z-1) 2(z- l/2)(z-2) 2z - 1 10 —3(z+z~’) 3(z - l/3)(z - 3) 3z - 1 12-2 FINITE-ORDER SYSTEMS AND STATE VARIABLES In this section, we consider systems specified in terms of differential equations or recursion equations. As a preparation, we review briefly the meaning of finite-order systems and state variables starting with the analog case. The systems under consideration are multiterminal with m inputs x/r) and r outputs y/'z) forming the column vectors X(z) = [x/z)] and Y(z) = [y/r)] as in (10-113). At a particular time t = the output Y(z) of a system is in general specified only if the input X(z) is known for every t. Thus, to determine Y(z) for
12-2 HNITE ORDLR SYSTEMS ANDSIAT1 VARI MILLS 405 FIGURE 12-4 t > tg, we must know X(r) for / > /(l and for / < t„. For a certain class of systems, this is not necessary. The values of Y(r) for t > r(l are completely specified if we know X(t) for t > tQ and. in addition, the values of a finite number of parameters. These parameters specify the “state” of the system at time t = in the sense that their values determine the effect of the past t < ttl of X(t) on the future t > r0 of Y(t). The values of these parameters depend on t0; they are, therefore, functions z,(r) of t. These functions are called state variables. The number n of state variables is called the order of the system. The vector Z( t) = [zf( t)] i = 1....н is called the state vector; this vector is not unique. We shall say that the system is in zero state at t = r0 if Z(r0) = 0. We shall consider here only linear, time-invariant, real, causal systems. Such systems arc specified in terms of the following equations: dZ(r) —=/lZ(f) +BX(t) (12-10a) Y(t) = CZ(f)+DX(r) (12-10b) In the above, А, В, C, and D are matrices with real constant elements, of order п X n, n X m, г X n, and г X m respectively. In Fig. 12-4 we show a block diagram of the system S specified terminally in terms of these equations. It consists of a dynamic system S, with input U = BX(t) and output Z(O, and of three memoryless systems (multipliers). If the input XG) of the system S is specified for every t, or, if X(t) — 0 for t < 0 and the system is in zero state at r= 0, then the response Y(t) of 5 for t > 0 equals y(r) = [ fi(a)X(t — a) da (12-11) Jo where Mr) is the impulse response matrix of S. This follows from (10-78) and the fact that Mr) «= 0 for t < 0 (causality assumption).
406 STOCHASTIC paocesses . We shall determine the matrix Ж0 starting with the system As we see from (12-10л), the output Z(t) of this system satisfies the equation dZ(t) ^Z(t) -U(r) (12-12) The.impulse response of the system 5г is an n x n matrix Ф(г) = [<₽;i(r)] called the transition matrix of S. The function </>;,(0 equals the value of the /th state variable z/tO.when the ith element u/f) of the input U(r) of 5, equals 6(0 and all other elements are 0. From this it follows that [see (10-115)] Z(/) = ( <&(a)V(f - a) da = [ Ф(а)ВХ(г - a) da (12-13) J0 Inserting into (12-106), we obtain Y(r) = ГсФ(а)ВХ(г - a) da + DX(t) Jo = Г[СФ(а)ВХ(г - a) + 6(a)£>X(t - a)] da (12-14) JQ where 8(t) is the (scalar) impulse function. Comparing with (12-11), we con- clude that the impulse response matrix of the system 5 equals H(i) = СФ(г)В + 8(t)b (12-15) From the definition of Ф(/) it follows that d<b(t) —12-ЛФ(г)=6(г)1я (12-16) dt. where 1„ is the identity matrix of order л. The Laplace transform <Ks) of Ф(О is the system function of the system Taking transforms of both sides of (1246), we obtain УФ($.) -ЛФО) = 1„. Ф($) = (sin (12-17) Hence ф(Г)=ея/ t>Q (12-18) This is a direct generalization of the scalar case; however, the determination of the elements x?/,-(t) of <X>(r) is not trivial. Each element is a sum of exponentials of the form Ы0 - ЕрдЛО*’*' <>'0 к where are the eigenvalues of the matrix A and p# k(t) are polynomials in t of degree equal to the multiplicity of sk. There are several methods for determining these polynomials. For small л, it is simplest to replace (12-16) by n systems of: n scalar equations»
12’2 l-INI Ib ORDER SYSTI.MS AN|> STATI VARlAHI.JS 407 Inserting Ф(/) into (12-15). we obtain H(t) = CeA,B + d'(r)D H(s) = C(.vl„-A)~'B + D (12-19) Suppose now that the input to the system S is a WSS process X(r). We shall comment briefly on the spectral properties of the resulting output, limiting the discussion to the state vector Z(r). The system S, is a special case of S obtained with В = C = 1„ and D - 0. In this case, Z(r) = Y(r) and rfY(r) -Л¥(г) =X(O H(s) = (.vl„-/I)"' (12-20) Inserting into (10-157), we conclude that S„(s)-S„(s)(-sl„-/l)-' S,.x(s)-(*1„-/I')''s„,(s) (12-21) S„.(5) - (»1, -X')’'S„U)(-*I. -Л)- Differential equations. The equation У("ЧО + aiy("~n(t) + • • +a„y(t) = x(r) (12-22) specifies a system S with input x(r) and output y(/). This system is of finite order because yO) is determined for t > 0 in terms of the values of x(/) for t S. 0 and the initial conditions у(0),У(0)....y(n-,)(0) It is, in fact, a special case of the system of Fig. 12-4 if we set m = r = 1: *i(0 = y(t) z2(O = y’(0 •••«„( 0 = y<"“O(r) 0 1 0 0 0 A = 0 0 1 ••• 0 В = c = 0 . ~an "Vi “an-2 “fll. . 1 . and D = 0. Inserting the above into (12-19), we conclude after some effort that 1 sn + a^”"1 + • • • +a„ This result can be derived simply from (12-22). Multiplying both sides of (12-22) by x(f - r) and y(/ + r), we obtain «*л>(т) + а.КЙ-'Чг) + + «.»,.(г)-Л„(т) (12-23) ЯМ(т) +о,ЯЙ-|’(г) + +о„Я„(г) -/?„(’’) (12-24) for all т. This is a special case of (10-133).
408 STOCHASTIC PROCESSES Finite-order processes. We shall say that a process x(t) is of finite order if its innovations filter Us) is a rational fiiriction of s: bosn + bisn~l + • • + b N(s) ад-uw-») L(s) 5. + as„-r;-,. (12-25) where Ms) and P(s) are two Hurwitz polynomials. The process xG) is the response of the filter L(s) with input the white-noise process i(r): x<">(0 +olx<"-1)(t) + ••• + «„x(') =b„i<m4t) + • • +M(') (12-26) The past x(r - r) of x(/) depends only on the past of i(r); hence it is orthogonal to the right side of (12-26) -for every r > 0. From this it follows as in (12-24) that + а^а~1\т) + • • • + а„Я(т) =0 т > 0 (12-27) Assuming that the roots s, of D(s) are simple, we conclude from (12-27) that Я(т) = т > 0 The coefficients «, can be determined from the initial value theorem. Alterna- tively, to find /?(т), we expand S(s) into.partial firactions: S(s) = E77- + E = s*(s) + S-(s) (12-28) 7«| 5 si 1-1 5 si The first siim is the transform of the causal part jR+(t) e R(t)U(t) of 7?(r) and the second of its anticausal part /?“(т) =/?(r)f/(-r). Since R{-r) = R(t), this yields Я(т) =jR+(|t|) = (12-29) i-1 Example 12-5. If L(s) — l/(s + a), then er л- 1 _ l/2a . l/2a (s+a)( — ji + a) 5 + a —s + a •Hence 7?(r) -(l/2a)«-a4 Example 12-6; Thediffercntialequation X'(O + 3x’(0 + 2x(r)•- l(t) MO -5(r) specifiesa process x(r)with autocorrelation RM. Fro.m (12-27) it follows that Я"(О =+ SR’tr) + 2K(r) = 0 hence R(r) + c,<2t for t > 0. To find the constants C[ and.c2, we. shall determine Ж0) and 7?’(0). Clearly, _ 1 s/12 + 1/4 —j/12 4- i/4 •(f2 + 3s + 2)(s2 - 3s + 2) в s2 + 3s + 2 + s2 - 3s + 2
12-2 FINITE ORDER SYSTEMS AND STATU VARIAHIJA 409 The first fraction on the right is the transform of Я + (т); hence R* (0*) = lim .sS+( O = п = C) + c, = R(0) J —• 05 * Similarly, Я’(0*) = lim s(sS +(^) - n) = 0 «= -cl - 2c, This yields Л(т) - |e-|r| - £е~2|т|. Note finally that R(r) can be expressed in terms of the impulse response Kt) of the innovations filter Us): Я(т)-/(т)«/(-т) = Г/(|т| + a)l(a) da Jo (12-30) Digital Systems The digital version of the system of Fig. 11-4 is a finite-order system S specified by the equations: Z\k + 1] = AZ[k] + BX[k] (12-3U) Y[Xc] = CZ[k] + DX[k] (12-312?) where к is the discrete time, X[к] the input vector, Y[A:] the output vector, and Z[k] the state vector. The system is stable if the eigenvalues z, of the n x n matrix A are such that |zj < 1. The preceding results can be readily extended to digital systems. Note, in particular, that the system function of S is the z transform H(z) = C(zl„ — A)~XB + D (12-32) of the delta response matrix /7 [A] = СФ[А]В + 5[k]D kzO (12-33) We shall discuss in some detail scalar systems driven by white noise. This material is used in Sec. 13-3. Finite-order processes. Consider a real digital process x[n] with innovations filter L(z) and power spectrum S(z): S(z) = L(z)L(l/z) L(z) - £ /Hz"" (12-34) n-0 where n is now the discrete time. If we know L(z), we can find the autocorrela- tion of х(л] either from the inversion formula (10-179) or from the convolution theorem E[m] =/[«]♦/[-m] - £ i[|ml +k]/[k] (12-35) Л-0 We. shall discuss the properties of Я[ш] for the class of finite-order processes.
410 STOCHASTIC PROCESSES The power spectrum 5(<u) of a finite-order process х[л] is a rational function of cos <o; hence its innovations filter is a rational function of z; ,, x /V(z) bo + biz~* + •“ +bMz'" D(z) = 1 +a,z-' + ••• +aNz~N (l2‘36^ To find its autocorrelation, we determine /[л] arid insert the result into (12-35). Assuming that the roots z, of ZXz) are simple and M < N, we obtain L(z) =* E , -Г = E%-M«] i J ^iZ i Alternatively, we expand S(z): S(t)-E , + EtA- «[">] = Е»,гГ' (12-37) t J i 1 ZiZ i Note that at} = y/Ld/z,). The process x[л] satisfies the recursion equation х[л] + aixfri -!] + ••• +a,vx[n - TV] = £>oi[n] +. • • • +b1Ui[n - m] (12-38) where i[n] is its innovations. We shall use this equation to relate the coefficients of L(z) to the sequence T?(m] starting with two special cases. Autoregressive processes. The process x[n].is called autoregressive (AR) if 60 U,.+0 (12-39) 1 * u IZ r ' “W* In this case, (12-38) yields х[л] 4- tf|x[/i - 1] 4- • • • 4-дл,х[п - TV] = hoi[n] (12-40) Theipast х[л — /и] of x[n] depends only on the past of i[n]; furthermore, £(i"[n]) — 1. From this it follows that £(xin]i[n]} = b0 and £(х[л — = 0; Multiplying (12-40) by х[л - m] and setting rn = 0,1,..., we obtain the equations 7?[0] +•«,/?[!] + • • • +aAZ£[N] = b0 Я[1] +a,T?[0] 4-. + aNR[N - 1] =0 (12-41д) T?[N] Ч- o, J?[TV - 1] + • • • + л„Я[0] = 0 and Л[./п] -tha।T?[rn — 1] + • 4-a^R[m — TV] = 0 (12-416) for m > TV. The first :N 4- 1 of these are called the Yule-Walker equations. They .ire used in Sebt 13-3 .to express the TV 4- i parameters ak arid bti in terms, of lhe first ;TV 4* I values of Conversely, if L(z) is luiown, we find T?[/n]
12-2 UNITt ORt)l!llS¥STIMSANI»SJATI VARIAIHJ.4 411 for Inti &N solving the system (12-41a) and we determine /?(m] recursively from (12-41 b) for m > N. Example 1-27. Suppose that xp» 1 - flx[" - 1] - v[n] ni) = Лй(m] This is a special case of (12-40) with D(z) = I - az 1 and Z| - a. Hence K[0] - a/?[ I} “ b Л[т] « «a1"11 a =• ------5 1 - a2 Line spectra. Suppose that x[m] satisfies the homogeneous equation x[wj + в|Х[я - 1] + • • • + awx[n - N] = 0 (12-42) This is a special case of (12-40) if we set = 0. Solving for x[h], wc obtain x[n] « CjZf + • • +cNz£ D(zJ = 0 (12-43) If x[n] is a stationary process, only the terms with z, » е1ш> can appear. Furthermore, their coefficients cA must be uncorrclated with zero mean. From this it follows that if x[n] is a WSS process satisfying (12-42), its autocorrelation must be a sum of exponentials as in Example 10-31: /?[ni] » = 2tf|ш| <ir (12-44) where at e £{c?) and )34 = <u, - 2ттк( as in (10-182). Moving average processes. A process x[/i] is a moving average (MA) if х[л] = Z>„i[zi] + ••• +£>wi[?i -M] (12-45) In this case, L(z) is a polynomial and its inverse /(«] has a finite length (FIR filter): L(z) •»/>(> + b,z_* + • • • +bMz~M /[л] = Ь05[л] + • * • + bM8[n - M] (12-46) Since I(n] - 0 for n > mt (12-35) yields Я[т] - + *]/[*] - Ё"»*--»* (12-47) к-0 к-0 for 0 s т £ М and 0 for т > М. Explicitly, Я[0] Л[1] " bobi + bybz + • • • +bM-.ibM rW-Mw..............................
412 STOCHASTIC F.IIOCHSSES Example 12-8. Supposethat x[n] is the arithmetic average of the M values rd i[n); x[/i] » —(![/»] + •«[« - 1] + • • +i[n - Л/ + 1]) M In this case. 1 I - z M , л/(-' 2 _ - -л/ w sin' —— S(z) = L(2)L(l/2) = f2 - T_,\ S(e-) ------- M(2-Z -2) Af:sin*y Autoregressive moving average. We shall say that x[n] is an ARMA process if it satisfies the equation х[л] + Я|Х[л -!] + ••• +«.vx[zz - AT] = boi[n] + • • • + hA/i[« - M] (12-48) Its innovations filter L(z) is the fraction in (12-36). Again, i[/t] is white noise: hence £{x[n -m]i[n - r]} = 0 for m < r Multiplying (12-48) by x[/i — m) and using the above, we conclude that R[m] + — 1] + • • • +ux,/?[w — Л7] = 0 m > M (12-49) Note that; unlike the AR case, this is true only for m > M. 12-3 FOURIERSERIES AND KARHUNEN-LOEVE EXPANSIONS A process x(r) is.MS periodic with period T if E(|x(7 + T) - x(/)|2} = 0 for all t. A WSS process is MS periodic if its autocorrelation Z?(r) is periodic with period! T = 2it/^>0 [see (10-165)]. Expanding R(r) into Fourier, series, we obtain = (12-50) n--« TJ» Given a WSS periodic process x(r.) with period T. we form the sum « i1 _ Я(4= E c„= - f dt (12-51) ,„._w TJ0
12-3 НИ RII f< М ни s ANL| КАИШМ I G1 V| > XHASSION'. 413 THEOREM. The above sum equals x(r) in the MS sense: Z:'[|x( <) - x(t)|2) = 0 Furthermore, the RVs c„ are uncorrelatcd with zero mean lor п л 0, and their variance equals yH: E[c J = (V' " U) n = 0 n * 0 Etc c*} --= I7" J - 1 и m I Il •- 111 11 * III ( 12-53) Proof. We form the products 1 d c„x*(a) = - / x(r)x’(nf)e ''........ dt I Jn c„c;‘> = 7 ['c„x't t dt I Л> and we take expected values, This yields E{c„x‘(«)} = 7 ('lit' ~ a)e""“"' dt = yne I Ai I л? ( v n = in £{<„'*} = rf У^ ""^............ l Jn It’ it * nt and (12-53) results. To prove (12-52), we observe, using the above, that A{|x( r) I2} = L/i{|c„|2) = = /<(()) = E{|x(/)|?) E{x(r)x’(/)} = £ EfcXf= Еу„ = E{x*(/)x(t)} and (12-51) follows readily. Suppose now that the WSS process x(t) is not periodic. Selecting an arbitrary constant T, we form again the sum x(t) as in (12-51). It can be shown that (see Prob. 12-12) x(t) equals x(r) not for all t. but only in the interval (0, T): E{|x(r) - x(r)|2} = 0 0<t<T (12-54) Unlike the periodic case, however, the coefficients c,t of this expansion are not orthogonal (they are nearly orthogonal for large n). In the following, we show that an arbitrary process xG), stationary or not, can be expanded into a series with orthogonal coefficients. The Karhunen-Loeve Expansion The Fourier series is a special case of the expansion of a process x(f) into a series of the form x(/) = £ €„?„(') Л “• I (12-55) 0 < t < T
414 STOCHASTIC PROCESSES where ^>„(f) is a set of orthonormal functions in the interval (0, T); dt = Spi - m] (12-56) Jo and the coefficients c„ are RVs given by cn= fT*(t)rf(t) dt (12-57) 'o In the following, we consider the problem of determining a set of orthonormal functions <p„(t) such that: (a) the sum in (12-55) equals x(f); (6) the coefficients cn are orthogonal. To solve this problem, we form the integral equation [T^(tlJ2)<f>(t2)dt2 = A<f>(tl) 0 <t, < T (12-58) Jo where /?(/,, f2) is the autocorrelation of the process x(f). It is well known from the theory of integral equations that the eigenfunctions <pn(r) of (12-58) are orthonormal as in (12-56) and they satisfy the identity «('.') = Ea.WOI2 (12-59) Л“ I where An are the corresponding eigenvalues. This is a consequence of the p.d. character of R(r,,r2). Using the above, we shall show that if <p„(t) are the eigenfunctions of (12-58) then E{|x(f) - x(f)|2} = 0 0 < t < T (12-60) and £{сясХ} = Ml" - >”] (12-61) Proof. From (12-57) and (12-58) it follows that £{cnx*(«)} = t)<p*(t) dt = An^*(a) Jo E{cnc*) = Am/’r<p*(r)<pm(z) dt = Ап3[л - m] (12-62) 'o Hence йкЛ‘(0) = Ё ,>•(<) = лл*«) m«l £(S(»)x‘«)} - E = ««<) n-I - £(«>(r)x(r)) - £{|x(r)|2) - £(l«(r)I2) and (12-60) results.
12-3 mUHtI R St-Ktl S ANU KAHHUNLN Ll>I.VI I XfANStONS 415 It is of interest to note that the converse of the above is also true: if </>„(/) is an orthonormal set of functions and x(') = i <=„<?„(') £{c„c*} =/fr" "=,n n -1 \ (I n * m then the functions <pn(r) must satisfy (12-58) with A = a~. Proof. From the assumptions it follows that c„ is given by (12-57). Furthermore, £{x(r)c*} = £ E{c„c*)^m(t) =<r^„,(/) fi • i £{x(r)c*} = flE{x(t)x*(a)}<pm(a) da = (ГЩ(,a)<f>„,(a) da Jo Jo This completes the proof. The sum in (12-55) is called the Karhunen-Loeve (K-L) expansion of the process xG). In this expansion, xG) need not be stationary. If it is stationary, then the origin can be chosen arbitrarily. We shall illustrate with two examples. Example 12-9. Suppose that the process xG) is ideal low-pass with autocorrelation sin лт *(r) =-------- ITT We shall find its K-L expansion. Shifting the origin appropriately, wc conclude from (12-58) that the functions <pfl(t) must satisfy the integral equation ,T/2 sin a(t - r) / —7~-------:—'₽„(’) dr = A„<₽„( t) (12-63) '-7/2 TTyt - T) The solutions of Ihis equation are known as prolate-spheroidal functions.! Example 12-10. We shall determine the K-L expansion (12-55) of the Wiener process wG) introduced in Sec. 11-1. In this case [sec (11-5)] f2 < fj KGpfj)-aminGf,^) = ( f >(^ Inserting into (12-58), wc obtain o[,'t2fp(t2)dt-> +atl[T<p(t2)di2^ Ayftj) (12-64) Jn ~ To solve the above integral equation, we evaluate the appropriate endpoint tD. Slepian, H. J. Landau, and H. O. Pollack: “Prolate Spheroidal Wave Functions." Bell System Technkal.Jolirnal, 'K>\. 40,1961.
416 STOCHASTIC PROCESSES conditions and differentiate twice. This yields (/>(0) = 0 of^V(G) dt2 *¥>'(<>) ф'(Л “ 0 **'(') + “vO) = о Solving the last equation, we obtain z v /Т . [o' (2n + l)ir ».(«) - у j s>» *>»> у t; —27^“ Thus, in the interval (0, T), the Wiener process can be written as a sum of sine waves w(0 ” /у E c« sin cn = f^iysin o>ntdt where the coefficients c„ are uncorrelated with variance £{c^} = A„. 12-4 SPECTRAL REPRESENTATION OF RANDOM PROCESSES The Fourier transform of a stochastic process x(f) is a stochastic process X(w) given by X( <o ). = Г x(t)e~/ш> dt (12-65) J—it, The integral is interpreted as an MS limit. Reasoning as in (12-52), we can show that (inversion formula) 1 /** x(t) = V" / X(w)e/eJ,rfw (12-66) 2-1Г — co in the MS sense. The properties of Fourier transforms also hold for random signals; For example, if y(t)is the output of a linear system with input x(/) and system function Ш), then Y(w) = ,Х(й))Я(й>). The mean^of X(e>) equals the Fourier transform of the mean of x(r). We shall express, the autocorrelation of X(w) in terms of the two-dimensional Fourier transform: r(u, p) = Г Г Л( г, ,г2).е“л",'+‘4tt dt2 (12-67) -OO of the autocorrelation tz) of x(/).. Multiplying (12-65) by its conjugate and talcing expected values, we obtain E{X(u)X*(u)} = f /“' £{x(fl)x*(l2)Je-^.-‘»i’dtldr2 — oor
12-4 SPECTRAL REPRESENTATION OP RANDOM PROCESSES 417 Hence Е(Х(м)Х*(у)) = Г(и, -y) (12-68) Using (12-68), we shall show that, if xG) is nonstationary white noise with average power q(t\ then X(rj) is a stationary process and its autocorrelation equals the Fourier transform Q(a>) of q(t): THEOREM. If /?(/lsf2) = <?(/,)3(ft - t2\ then E{X(<t) + a)X*(ot)} = 2(w) = j q(t)e~,a>‘ dt (12-69) * — 00 Proof. From the identity Г Г dt2 = Г q{t2)e~il"+vy'2 dt2 — авт — oo — x it follows that Г(и,у) = Q(u + u\ Hence [see (12-68)] Ё(Х(а> + a)X*(a)} - Г(ы 4- a, -a) = Q(w) Note that if the process x(r) is real, then E(X(u)X(y)} = Г(и,у) (12-70) Furthermore, X(-<o) =X*(<o) Г(-и,-и) = f(u,y) (12-71) Covariance of energy spectrum. To find the autocovariance of |Х(ш)|2, we must know the fourth-order moments of X(ru). However, if the process x(r) is normal, the results can be expressed in terms of the function Г(и, у). We shall assume that the process x(f) is real with X(a>) = A(<y) + JB(w) Г(м,у) = Гг(и,у) +Л\(и,и) (12-72) From (12-68) and (12-70) it follows that 2£{A(u)A(y)} = Гг(м,у) + Гг(и, -y) 2£{A(y)B(u)} - ГХи,у) + ГДи, -у) (12-73) 2£{B(u)B(y)} = Гг(и,у) - rr(u,-u) 2£(A(u)B(y)} - Ц(и,у) - Г,(м, - у) THEOREM, If x(Г) is a real normal process with zero mean, then Cov{IX(u)l2, |X(u)I2) - r2(u,-«) + Г2(и,и) (12-74)
418 STOCHASTIC PROCESSES Proof From the normality of’x(/). it follows that the processes A(w) and В(ш) ate jointly normal with zero mean; Hence (see (7-36)] E{|X(u)|a|X(y)|a) - E(|X(.«)|2)E{IX(и)I2} = E{[A2(h) + B2( и).|[А2( v) + B2( i’)]} - E{A2(u) + B2(u))£{A2(f) + B2(u)} = 2E2{A(u)A(t’)} +2E2{B(u)B(r)} + 2E2{ A(«)B( r)} + 2E2(A(l-)B(u)} Inserting (12-73) into the above, we obtain (12-74). STATIONARY PROCESSES. Suppose that x(/) is a stationary process with auto- correlation /?(t|, /*,) = /?(r, - t?) and power spectrum 5(o>). We shall show that Г(и,г) = 2irS(it)8(u + i>) (12-75) Proof. With f, = t2 + r, it follows from (12-67) that the two-dimensional trans- form of Ж/i,- t2) equals Г Г R('r, - t2)e-il“‘^,^dt4dt2 = Г Г R(r)e~iUT dr dt2 —OCT--00 •' — DC — DC Hence F(u,u) = S(u) Г dt2 J — ОС This yields (12-74) because je~iui dt = 2ir3(<o). From. (12-74) and (12-68) it follows that E{X(m)X*(u)} = 2ifS(u)8(u - o) (12-76) This shows that the Fourier transform of a stationary process is nonstationary white noise with average power 2tt5(u). It can be shown that the converse is alsc> true (see Prob. 12-12): The process x(r) in (12-66) is WSS iff Е(Х(ш)} = 0 for Ш 0, and £{X(n)X*(y)} =:Q(«)6(« - u) (.12-77) Beat processes, If x(t) is real, then A(— ш) - A(<o), B(-<u) = В(ю), and 1 r® 1 y«>. jt(t) - — I A( w) cos (o.tdai - — I B(^)sin totdto (12-78) 'ir -/o it 'o therefore, to specify A(<o) and B(a>) for ш 0 only. From (12-68) and (12-70) it follows that £([A(U) •hJB(lO][AW “ 0 >• * ±o
12-4 SPIiCrRAl.RCPRISkNTA'IIOSOl КЛЧГЮМ PIUX t.sslA 419 Equating real and imaginary parts, wc obtain £(A(«)A(p)) = E{A(u)B(i>)} = £{B(«)B(r)) =0 for и * г (12-7%) With и = w and и = -ш, (12-9) yields £(X( ш)Х( ы)} = 0 for <u * o; hence £{A2(*0) = £{B2(co)) £{A( W)B( ш)) = 0 (12-7%) It can be shown that the converse is also true (see Prob. 12-13). Thus a real process x(/) is WSS if the coefficients A(<o) and B(<o) of its expansion (12-78) satisfy (12-79) and £(A(<u)) = £{B(w)} = 0 for ш * 0. Windows. Given a WSS process x(r) and a function iv(r) with Fourier transform IF(<o), we form the process y(r) = h'G)xO). This process is nonstationary with autocorrelation Kyy(G»G) = H'(G)w‘(t2)/?(f1 - r2) The Fourier transform of Ey>.(f,, t2) equals Гуу(«^)=/ f w(tl)w(f2)/?(t, - t2)e-,<",^"-',df,d/2 J — X^ — X Proceeding as in the proof of (12-75), we obtain ry„(«,w) = Г И/(М - 0)1Г*(-и - p)S(p) d/3 (12-80) From (12-68) and the above it follows that the autocorrelation of the Fourier transform Y(m) = Г dt (12-81) — X of yO) equals E(Y(u)Y*(u)} = Гу„(и, -p) = ЗГ Г *(“ " - IW) W Hence £{|У(^)|2} = -!-Г \W(a>-p)\2S(p)dp (12-82) 2ir J—x Example 12-11. The integral XT(w) = [Tx(l)e-^dt J-T is the transform of the segment x(f)pr(r) of the process x(r). This is a special case of (12-81) with »*</)-Pr(O and WM - 2sin7w/*>. If. therefore, x(r) is a
420 SIOCHAS1IC PttOCESSliS stationary process, then [sec (12-82)] , 2 sin2 7 w E{|Xr(w)|-} = -S(<n)’-------— (12-83) Fourier-Stieltjes Representation of WSS Processes! We shall express the spectral representation of a WSS process x( t) in terms of the integral Z(w) = Гх(а) da (12-84) 'o We have shown that the Fourier transform X(o>) of x(O is nonstationary white noise with average power 2—5(o>). From (12-76) it follows that. Z(w) is a process with orthogonal increments: For any w, < w, <Wi < <o4: E([Z(w2) - ZfwJJtZ^wJ - Z*(w3)]} = 0 (12-85a) E{|Z(u>,) -Z(to,)|2j = 27гГ25(«) r/w (12-85b) /W| Clearly, dZ(a>) = X(w) dio (12-86) hence the inversion formula (12-66) can be written as a Fourier-Stieltjes integral: 1 x(t) = — f e^'dZ(^) (12-87) 2тг J-=c With й>| = w, ш2 = и + du and = r, a>4 = r + dr, (12-85) yields E(rfZ(u)«/Z*(r)} =0 u*v , (12-88) £{|rfZ(u)|-} = 2irS(u) du The last equation can be used to define the spectrum S(o>) of WSS process x(r) in terms of the process Z(w). WOLD’S DECOMPOSITION. Using (12-85), we shall show that an arbitrary WSS process x(r) can be written as a sum: x(t) e x,(r) 4-хДг) (12-89) where x/f) is a regular process and xp(O is a predictable process consisting of tH. Cramer: Mathematical Methods of Statistics. Princeton Univ. Press. Princeton, NJ.. 1*M6
12-4 SPICIKAI HI i'Rf-Ч1-МЛПОМ» HAMlOM >’l«ll I SSI S 421 exponentials: *P(f) = c0+ Ес/........ A(cJ = 0 (12-91)) I Furthermore, the two processes arc orthogonal: + r)x;(r)} = 0 (12-91) This expansion is called Wold's decomposition. In See. 14-2, wc determine the processes x,(f) and xp(t) as the responses of two linear systems with input x( r). We also show that xp(t) is predictable in the sense that it is determined in terms of its past; the process x/r) is not predictable. We shall prove (12-89) using the properties of the integrated transform Z(<u) of x(/). The process Z(<u) is a family of functions. In general, these functions are discontinuous at a set of points w, for almost every outcome. We expand Z(<o) as a sum (Fig. 12-5) Z(w) = Z,(<o) + Zp(w) (12-92) where Zr(&>) is a continuous process for ш * 0 and Zp(<a) is a staircase function with discontinuities at ш,. We denote by 2 ire, the discontinuity jumps at w, * U. These jumps equal the jumps of Zz,(<o). We write the jump at ш = 0 as a sum 2ir(?j + c0) where rj = E(x(t)}, and wc associate lhe term 2tri] with Zr(w). Thus at co = 0 the process Zr(co) is discontinuous with jump equal to 2тгг). The jump of Zp(o>) at co = 0 equals 27rcu. Inserting (12-92) into (12-87), wc obtain the decomposition (12-89) of xU) where xr(t) and xp(t) arc the components due to Zr(<u) and Zp(a>) respectively. From (12-85) it follows that Zr(co) and Zz,(w) are two processes with orthogonal increments and such that £(z,(u)z;(i)j - 0 £(c,c;)=|‘' (12-W) The first equation shows that the processes x,(/) and xp(r) are orthogonal as in (12-89); the second shows that the coefficients c, of xp(r) are orthogonal. This also follows from the stationarity of x„(/).
422 STOCHASTIC PROCESSES We denote by Sr(w) and SpM the spectra and by Fr(a>) and F^w) the integrated spectra of xr(t) and xp(.t) respectively. From (12-89) and (12-91) it follows that S(m) = Sr(w) +$/*>) FM = FrM + Fp(w) (12-94) The term Fr(ai) is continuous for ш Ф 0; for ю « 0 it is discontinuous with a jump equal, to iirtf. The term Fp(6>)fs a staircase function, discontinuous at the points <t>j with jumps equal to 2тгк{. Hence SpM = 2тгк0ИМ + 2irEM(ai - u() (12-95) i Theimpulse at the origin of 5(ш) equals 2тг(&0 + tj2)S(6j). Example 12-12. Consider the process y(r) = ax(r) E{a) = 0 .where x(7) is a regular process independent of a. We shall determine its Wold decomposition. From the assumptions it follows that E{y(r)} --0 R„(r) = Е{а2х(/ + т)х(г)} = <г2Яух.(т) The spectrum of x(t) equals + 2тгг)*8(ы). Hence s.yyM =<$«(«) + 2тгстя217;й(ш) From the regularity of x(r )it follows that its covariance spectrum S'x(<u) has no impulses. Since -* 0, we. conclude from (12-95) that Sp(w) = 2тгкй6(й)) where Ao = tr2ti2. Ibis,yields Ур(О = 'П,а y,(/) =a[x(t)-i7x] DISCRETE-TIME PROCESSES. Given a discrete-time process 4«1 we form its discrete Fourier transform (DFT) X(w) = £ х[к]е~'лш (12-96) Л — -oo This yields x[h] = -- f do) (12-97) 2тг-/-тг From, the definition it follows that the process X(o>) is periodic with period 2tt. It sbfficesi therefore, to study its properties for |w| < it only. The preceding results properly modified also hold for discrete-time processes. We shall discuss only the digital' version-of.(12-76): If 4h] is a WSS process with power spectrum S(ci>), then its DFT X(m) is nonstationary white noise with autocovariance £lX(«)X*(,u)) = 2tfS(m)5(« - и) -тг. < u, v < тг .(12-98)
12-4 Sri.CTKAt RLVRbSLNTATION 1И RANDOM PROCESSES 423 Proof, The proof is based on the identity £ e~jnu' = 2Tr3(w) |w| < tt tt * -« Clearly, £{X(n)X*(f)} = £ £ £{x[n + "d*‘[«’])exp{ -j[{ m + л)м - /»,]) tt •• — 00 fn — ~ X = £ £ £[/n]e~""'' tt — — ОС ffi = - X and (12-98) results. BISPECTRA AND THIRD ORDER MOMENTS. Consider a real SSS process x(t) with Fourier transform X(<o) and third-order moment Л(/х,р) [see (11-179)]. Generalizing (12-76), we shall express the third-order moment of X(w) in terms of the bispectrum S(u, v) of x(z). THEOREM. £{X(m)X(i.)X*(’v)} = 2irS(u,i')8(u + t - w) (12-99) Proof. From (12-65) it follows that the left side of (12-99) equals Г f Г£{х(г|)х(г2)х(/3))е->(1''' + ^-^'г/г|</г2^3 J — CC*' — CD*' — With f] = + д and t2 = t3 + v, the above yields J"* J* /?(д>р)е“л"д+, ,’>г/д</р/” and (12-99) results because the last integral equals 2ir5(u + i> - tv). We have thus shown that the third-order moment of Х(га) is 0 everywhere in the uvw space except on the plane w = и 4- v where it equals a surface singularity with density 2тг5(м,г). Using this result, we shall determine the third-order moment of the increments Z(w,) - Z(o)J = f‘X(w) da) (12-100) of the integrated transforms Z(w) of xG). THEOREM. B{[-Z(^) - Z(Wl)]{Z(w4) -Z(w3)][Z‘(O - Z*(<V5)]} = 2ttJ fs(u,u) dudt> (12-101)
424 stochastic processes (a) FIGURE 12-6 where R is the set of points common to the three regions O>| < и < <w2 &>3 < и < &>4 O)5 < »v < &>6 (shaded in Fig. 12-6a) of the uu plane. Proof. From (12-99) and (12-100) it follows that the left side of (12-101) equals J J J 2nS(u,v) dudvdw = 2irj f S(u,u,)duduf 8(u + u-w)dw Ja>3 Jioi '<«1 Jwi The last integral', equals one for a>5 < и. + и < ыь and 0 otherwise. Hence the right side equals the integral of 2irS(u,u) in the set R as in (12-101). COROLLARY. Consider the differentials dZ(u0) =» X(u0).du dZ(t>0), = Х( у0) du dZ( w0) = X(w0) dw We maintain that £{dZ(u0) dZ(^) <fZ*(w0)}. = 2ttS(u0, h0) dudu (12-102) if wa = uQ + u0 and dw ^du + du\ it is zero if w0 =£ u0 + u0. Proof, Setting: <ot = W0 й>з = Р0 tos = Wo = «0 + «0 d>i «о + du — yQ + du (u6 £ w0 + du + du into (12-lOi), we obtain (12402) because the set R is the shaded rectangle of Figi 12-6h. We conclude with the observation that equation (12-102) can be used to define the bispectrum of a.SSS process x(t) in terms of Z(d>).
problems 425 PROBLEMS 12-1. Find ЯД/n] and the whitening filter of x[/i] if „ v cos 2<o + 1 St(w) •= —--------------------- 12cos2fc> - 70cos co + 62 12-2. Find the innovations filter of the process x(t) if w4 + 64 = —j------:---- co4 + 10a>- + 9 12-3. Show that if A,[nJ is the delta response of the innovations filter of s[n], then K,[0] - Ё /?[»] n — 0 12-4. The process x(r) is WSS and Л0 + 3/(t) + 2y(r) =x(z) Show that (a) Л;ж(т) + 3/?;г(т) + 2Яух(т)=Яжд(т) л;у(т) + зя;.у(т) + 2Яуу(т) = яху(т), а11 (b) If Я„(т) = q8(r), then Яух(т) - 0 for т < 0 and for т > 0: Я"х(т) + ЗЯ;х(т) + 2Яу1(т) = 0 Яух(0) = 0 Я;х(0+) = q я;у(т) + зя;у(т) + 2Яуу(т) = о яуу(0) = ~ R'„(Q) = о 12-5. Show that if s[n] is AR and v[n] is white noise orthogonal to s[n], then the process х[л] = s[n] + v[zi] is ARMA. Find Sx(z) if ЯД/n] = 2"1'"1 and S,,(z) = 5. 12-6. Show that if x(r) is a WSS process and 1 « „ 1 sin2nwT/2 s = - £ х(ЛТ) then E{s2} = —j f Sx(a>) .2 T~/2~ <*<» 12-7. Show that if Я/т) » e~c,T|, then the Karhunen-Loeve expansion of x(r) in the interval (-a, a) is the sum *0) = £ (РлЬ„ cos <o„i + Д'Ь; sin <o‘nt) Л-1 where tan a«„ — — cot аш’„ = —- fi„ = (a + cA.„) 1/2 Д' = (a - cA'„) 72 2c 2c ^{b2} = A„ = c2 + шг £{ЬА2} - А'л = c2 + ш,2 12-& Show that if x(r) is WSS and X (w) « [T/i x^e'^dt then e(^= \XT(a>) I2) - Г * J-T/2 i J~T
426 STOCHASTIC PROCESSES 12-9. Find lhe mean and the variance of the integral X(co) = f [5cos3r + v(r)]c_J,“'A if E{v(r)) = 0 and Я„(т) = 26(т). 12-10. Show that if E{x„xJ - tr*8[n - 4] X(w) = £ -00 and E{xJ = 0, then E{X(a>)} = 0 and E{X(n)X*(w)} = £ an2c-'n<u',>T n — -0° 12-11. Given a nonperiodic WSS process x(r), we form the sum xU) = Ес,е/Я"»* as in (12-51). Show that (a) f{|x(r) - x(r)|2} = 0 for 0 < t < T. (b) E{c„c,*} = (1/T)[g@n(a')eina’<1<‘ da where pnM >* (1/Т)/огЯ(т — da are the co- efficients of the Fourier expansion of Я(т - a) in the interval (0, T). (c) For large T, E{c„c*} = S(na„)8(n - m). 12-12. Show that, if the process X(w) is white noise with zero mean and autocovariance Q(u)b(u - o), then its inverse Fourier transform x(r) is WSS with power spec- trum Q(a>)/2ir, 12-13. Given a real process x(r) with Fourier transform X(co) = A(co) + ;B(w), show that if lhe processes A(co) and B(a>) satisfy (12-79) and E{A(o>)) = E{B(w)} = 0, then x(l) is WSS. 12-14. We use as an estimate of the Fourier transform F(co) of a signal /(r) the integral Xr(a>)= fT [f(i)+V(t)]e-^dt J-T where v(t) is the measurement noise. Show that if Sv„(a>) = q, then , . rT sin Т(ш — у) Е{Хг(а))} = / F(y) —-------------^-dy VarXr(co) = 2qT J — T ~~ у)
CHAPTER 13 SPECTRAL ESTIMATION 13-1 ERGODICHY A central problem in the applications of stochastic processes is the estimation of various statistical parameters in terms of real data. Most parameters can be expressed as expected values of some functional of a process x(/). The problem of estimating the mean of a given process x(r) is, therefore, central in this investigation. We start with this problem. For a specific t, x(0 is an RV; its mean 17(f) = £{x(t)} can, therefore, be estimated as in Sec. 9-2: We observe n samples xU,£) of xG) and use as lhe point estimate of E{xG)} the average 4<J) e ~Ex(',O n i As we know, ?}(/) is a consistent estimate of 17(f); however, it can be used only if a large number of realizations x(f,£) of x(f) arc available. In many applications, we know only a single sample of x(f). Can we then estimate 17(f) in terms of the time average of the given sample? This is not possible if £{x(/)} depends on t. However, if x(r) is a regular stationary process, its time average tends to £{x(f)} as the length of the available sample tends to ». Ergodicity is a topic dealing with the underlying theory. 427 c
428 SPECTRAL ESTIMATION Mean-Ergodic Processes We are given a real stationary process x(r) and we wish to estimate its mean -q = E{x(r)). For this purpose, we form the lime average > rr Пт=^[ *(')df (13-1) 2/ J- 7’ Clearly, т)г is an RV with mean £{Лг} = £{*(')} dt = 77 2/ J - т Thus iir is an unbiased estimator of 77. If its variance cr/ -> 0 as T -* x. then qr -> 17 in the MS sense. In this case, the time average t|r(f) computed from a single realization of x(t) is close to 77 with probability close to 1. If this is true, we shall say that the process x(f) is mean-ergodic. Thus a process x(/) is mean-ergodic if its time average tends to the ensemble average 77 as T -> x. To establish the ergodicity of a process, it suffices to find ar and to examine the conditions under which cr7- -* 0 as Г -» ». As the following examples show, not all processes are mean-ergodic. Example 13-1. Suppose that c is an RV with mean and x(t) = c 77 = E{x(f)} = E{c) = 77c. Tn this case, x(r) is a family of straight lines and -qr = c. For a specific sample, TlyCf) = c«) is a constant different from 77 if ctf) * 77, Hence x(f) is not mean-ergodic. Example 13-2. Given two mean-ergodic processes xjr) and x,(r) with means and 77,, we form the sum x(r) = x,(r) + cx2(r) where c is an RV independent of x2(r) taking the values 0 and 1 with probability 0.5. Clearly, E(x(r)} = E(x,( t)} + E{c)E{x2(/)} = 77, + O.577, If df) = 0 for a particular then x(r) = x,(r) and 17,- -> 77, as Г -> x. if c(f) = 1 for another then x(r) = xt(r) + x2(r) and Пт -* 77, + 77, as T -» ». Hence x(r) is not mean-ergodic. VARIANCE. To determine the variance cr/ of the time average -Пу of x(z), we start, with the observation that •nr=w(0) where w(r) = Г+ Гх(а) da (13-2) 2TJt~T is the moving .average of x(r). As we know, w(f) is the output of a linear system With input x(/) and with impulse response a pulse centered at t = 0. Hence wG >
13’1 I Ш.ОО11 in 429 Is stationary and its autocovariancc equals 1 f-24 / |(jf| \ Q..(^) = 2f j C<T ~ tr> 1 ~ Tfl da (13-3) where C(r) is the autocovariancc of xG) [sec (111-142)1. Since trj = Varw(0) - CWH.(0) and C( -a) = C(a), this yields , 1 f2T d 1 /-2Г I <» \ ,rf"2r/-2r l“,| 2?) rX ‘'"’I' 2/T'" This fundamental result leads to the following conclusion: Л pioccss x(r) with autocovariancc C(r) is mean-ergodic iff 1 f2T I « \ 7A, C<a,(' 2f)d“T^ ° ('3-5> The determination of the variance of xG) is useful not only in establishing the crgodicity of xG) but also in determining a confidence interval for the estimate 7]t of tj. indeed, from Tchebycheffs inequality it follows that the probability that the unknown rj is in the interval t|-z ± IDtr, is larger than 0.99 [see (5-57)]. Hence -Tjr is a satisfactory estimate of -q if T is such that a, <k q. Example 13-3. Suppose that C(t) = qe 'M as in (11-15). In this case. -LTT-) I -'ll к 2/ f cl \ 2cl ) Clearly, tr/ -» 0 as T -» «; hence xG) is mean-ergodic. If T I /с. then » <//cT. Example 13-4. Suppose that xG) = n + pG) where v(r) is while noise with Kp1,(t) =• q8(r). In ihis case. С(т) *= Я„,,(т) and (13-4) yields Hence xG) is mean-ergodic. It is clear from (13-5) that the crgodicity of a process depends on the behavior of C(r) for large r. If C(r) = 0 for r > a, that is, if xG) is n-depen- dent and T => a, then 1 ,a / r \ I .a U *hf/ С(ТЦ1 " Tr) dT = rf C(T) dT < TC(0) ° .1 JQ \ 2,1 ) 1 JQ 1 because |C(r)| < C(0); hence x(/) is mean-ergodic. In many applications, the RVs xG + r) and xG) are nearly uncorrelated :for large t, that is, C(r) -» 0 as т -> «. The above suggests that if this is the й«, then xG) is mean-ergodic and for large T the variance of т)г can be
430 Spectral estimation approximated by <4 = ± [2TC(r) <Zr = ifc(r) dr - ^C(O) (13-6) / •'Q 1 JO J where tc is the correlation time of x(r) defined in (10-49). This result will be justified presently. SLUTSKY’S THEOREM. A process x(r) is mean-ergodic iff (13-7) Proof, (a) We show first that if <rT 0 as T -»cc, then (13-7) is true. The covariance of the RVs i)r and x(0) equals €ov[i)r,x(b).] WO -''ilW0) - ч] C(t)dt \27 J-r } 2.1 J-r But [See (7-9)1 Cov2[ijr,’x(0)] < Var T|7-Varx(0) = <r^C(0) Hence (13-7.) holds if <rT -> 0. (&) We show next that if (13-7). is true; then <rr -* 0 as T -» From (13-7) it follows that given c > 0, we can find a constant c0 such that •1 rr - J C(r) dr < e for every c > c0 (13-8) t Jc The variance of -nr equals [see (13-4)] cr2 = у27“+ - J27C(t)(1 - — 1 dr Jd T \ 2T } The integral from 0 to 2T0 is less than 2T0C(Q)/T because |C(t)1 £ C(0). Hence i 2T0 1 r2T ( r \ ^<^CW + -f2TCW[l-^]dT But:(see fig. 13-1). f1TC(f){2T-rydr^ (2TC(r) f2Tdtdr = f2T f‘ C(r)drdt 'ЭТи J2T0 J7 J2f0J2T^ From (13-8)’it follows that the inner integral on the right is less than et; hence 27'л. . E >2T °r < + ^2 / t dt 2e т T* J2TV T^°° and since ₽ is arbitrary, we conclude that oT -> 0 as T -* «>,
13'1 I R<,(ir>|( fn 411 Example 13*5. Consider the process x( t) = a cos <Di + b sin on + c where a and b are two uncorrclated RVs with zero mean and equal variance. As we know [see (10-55)], the process x(f) is WSS with mean c and autocovariance cr2 cos o)T. We shall show that it is mean-ergodic. This follows from (13-7) and the fact that — [ГС(т) dr = — [Tcos o)t dr = — sin шТ ——► 0 T Jq T ->п ыТ г—& Sufficient conditions, (a) If Гс(т)</т<оо (13-9) •'0 then (13-7) holds; hence the process x(f) is mean-ergodic. (6) If 7?(r) -» 7/2 or, equivalently, if C(t) -» 0 as r-»w (13-10) then x(t) is mean-ergodic. Proof, If (13-10) is true, then given e > 0, we can find a constant To such that |C(t)| < e for t > TQ; hence 4 fTC(r) dr = — /Г°С(т) dr + - fTC(r) dr TJ.o Г ‘ Jru Tn T-TQ < t£C(0) + £------- £ у v ' ’T T-*’» and .since e is arbitrary, we conclude that (13-7) is true.
432 SPECTRAL ESTIMATION Condition (13-10) is satisfied if the RVs x(t + r) and x(r) arc uncorrelated for targe t. Note The time average is an unbiased estimator of rj; however, it is not best. An estimator with smaller variance results if we use the weighted average = Г w(t)x(t)dt J-r and select the function w(t) appropriately (see also Example 8-4). DISCRETE-TIME PROCESSES. Wc outline next, without elaboration, the dis- crete-time version of the preceding results. We are given a real stationary process x[n] with autocovariance C[w] and we form the time average 1 M Пл<=Т7 £ x[/t] W = 2Л/+ 1 (13-11) ™ n~ -M This is an unbiased estimator of the mean of x[?z] and its variance equals 1 2M ( |m|\ E Cfm] 1 - — (13-12) ZV m - - 2M к zv / The process x[n] is mean-ergodic if the right side of (13-12) tends to 0 as M ->«. SLUTSKY’S THEOREM. The process x[n] is mean-ergodic iff 1 w ттЕсЫ^о (13-13) m-0 We can show as in (13-10) that if C[m] -> 0 as m -> oo, then x[?i] is mean- ergodic. For large M, 1 м °‘m = T7£C["’] (13-14) m eU Example 13-6. (a) Suppose that the centered process x[n] = x[n] - n is while noise with autocovariancc P3[m], In this case, Iм P C[m] = P$[m] ofi ~ — = — Thusx(n] is mean-ergodic and the variance of equals P/N. This agrees with (8-22): TheRVsxfn] are iJ.d. with variance C(0] = P, and the time average -nw is their sample mean.
13-1 ERGOOIcrnr 433 (6) Suppose now that C[/»] = Pa'"'1 as in Example 10-31. In this ease (13-14) yields Note that if wc replace x[n] by white noise as in («) with the same P and use as estimate of tj the time average of N, terms, the variance P/Nt of the resulting estimator will equal <r/f if Sampling. In a numerical estimate of the mean of a continuous-time process x(t), the time-average is replaced by the average 1 V- of the N samples x(/rt) of x(f). This is an unbiased estimate of t] and its variance equals = ~j^2 52 — tk) n к where C(t) is the autocovariance of x(z). If the samples are equidistant, then the RVs x(r„) — x(nT0) form a discrete-time process with autovariance C(mT0). In this case, the variance afi of Пл, is given by (13-12) if we replace C[/n] by C(mT0). SPECTRAL INTERPRETATION OF ERGODICITY. We shall express the ergodicity conditions in terms of the properties of the covariance spectrum Sf(m) = 5(<u) - 2ir7]28(a>) of the process x(t). The variance cr? of nr equals the variance of the moving average w(/) of x(r) [see (13-2)]. As we know, sin2 Ta , . hence 1 sin2 Ты <Гт = -z-j Se(w) da> (13-16) The fraction in (13-16) takes significant values only in an interval of the order of 1/T centered at the origin. The ergodicity conditions of x(t) depend, therefore, only on the behavior of Se.(a>) hear the origin. Suppose first that the process x(/) is regular. In this case, Sf(a>) does not have an impulse at ш - 0. If, therefore, T is sufficiently large, we can use the
434 SPECTRAL ESTIMATION approximation 5f(<w) ~ 5c(0) in (13-16). This yields 5f(0) ,» sin2T<u Sc(0) Hence x(f) is mean-ergodic. Suppose now that Sc(w) = Sf(«i) + 2тг/са8(ш) Sf(O) < oo Inserting into (13-16), wc conclude as in (13-17) that = -U,(0) + kn > k0 2/ (13-17) (13-18) Hence x(z) is not mean-ergodic. This case arises if in Wold’s decomposition (12-89) the constant term cft is different from 0, or, equivalently, if the Fourier transform X(&>) of x(f) contains the impulse 2ire0<5(a>). Example 13-7. Consider the process y(r) = ax(f) £{a) = 0 where x(t) is a mean-ergodic process independent of the RV a. Clearly, E{y(f)} = 0 and = <rfl23\\(w) + 2тга~т];8(ш) as-in Example 12-12. This shows that the process y(/) is not mean-ergodic. The preceding discussion leads to the following equivalent conditions for mean ergodicity: 1. <rT must tend, to 0 as T -> <». 2. In Wold's decomposition (12-89) the constant random term c0 must be 0. 3. The integrated power spectrum Fr(w) must be continuous at the origin. 4. The integrated Fourier transform Z(a») must be continuous at the origin. Analog estimators. The mean 77 of a. process x(r) can be estimated by the response of a physical.system with input x(/). A simple example is a normalized integrator of finite integration time. This is a linear device with, impulse response the rectangular pulse p(t) of Fig. 13-2. For ( > To the output of the integrator equals У(О=4-Г H{.a)da 4Ji-r0 If To is. large compared to the correlation time tc of x(f), then the variance of >0) equals 2тсС(О)/Го. This follows, from (13-6) with To =» 2T.
13-1 hRGODK II ¥ 435 Suppose now that x(f) is the input to a system with impulse response h(t) of unit area and energy E: w(t) = ( x(a)h(t - a) da E = [ h2(t) dt Jo 4) We assume that С(т) = 0 for т > Tt and Л(г) = 0 for t > Tn > Tt as in Fig. 13-2. From these assumptions it follows that EMO) = r] and <ти2 = ЕС(0)тг for t > TQ. If, therefore, EC(0)tc 772 then w(r) = 17 for t > Tn. The above conditions are satisfied if the system if low-pass, that is, if H(w) = 0 for Id < шс and шс 772/C(O)tc. Covariance-Ergodic Processes We shall now determine the conditions that an SSS process x(r) must satisfy such that its autocovariance C(A) can be estimated as a time average. The results are essentially the same for the estimates of the autocorrelation ЖА) of x(r). VARIANCE. We start with the estimate of the variance |/=C(0) =E{|x(/) -77I2} =£{x2(')} - V (13-19) of x(z). Known mean. Suppose, first, that 77 is known. We can then assume, replacing the process x(r) by its centered process x(/) - 77, that E(x(/)J =0 И=Е{х2(г)} Our problem is thus to estimate the mean V of the process x2(O. Proceeding as in (13-1), we use as the estimate of V the time average Vr--!-/’x2(<)A (13-20) Z/ J-T This estimate is unbiased and its variance is given by (13-4) where we replace
436 SPECTRAL ESTIMATION the function C(r) by the autocovariance C,2i.-(t) = E{x2(t + r)x2(r)} - E2{x2(z)} (13-21) of the process x2(f). Applying (13-7) to this process, we conclude that x(.') is variance-ergodic iff | [TE(x2(< + r)x2(t)) dl C2(0) (13-22) I J<l 1 To test the validity of (13-22), we need the fourth-order moments of x<t). If, however, x(/) is a normal process, then [see (10-68)] Сл2,.’(т) = 2С2(т) (13-23) From this and (13-22) it follows that a normal process is variance-ergodic ill’ ^/^(т)^—> 0 (13-24) TA> Using the simple inequality (see Prob. 13-10) 1 2 । - (ГС(т) dr <. - dr 1 Jo I •'o we conclude with (13-7) and (13-24) that if a normal process is variance-ergodic, it is also mean-ergodic. The converse, however, is not true. This theorem has the following spectral interpretation: The process x(z) is mean-ergodic iff Sr(w) has no impulses at the origin; it is variance-ergodic iff Sr(a>) has no impulses anywhere. Example 13-8. Suppose that the process x(r) = a cos u)t + bsin a>t + tj is normal and stationary. Clearly, x(z) is mean-ergodic because it docs not contain a random constant. However, it is not variance-ergodic because the square |x(r) — tj|2 = 4(a2 + b2) + 4(a2 cos2o>r - b2 cos2wz) + absin 2wt oi x(r) — 7j contains the random constant (a2 + b2)/2. Unknown mean. If 77 is unknown, we evaluate its estimator from (13-1) and fonn the average Vr = [x(') “ Л7]2^ = ^pfT x2(') dt - rtf The determination of the statistical properties of Vr is difficult. The following observations, however, simplify the problem. In general, V7- is a biased estimator ofithe variance И of x(f). However, if T is large, the bias can be neglected in the determination of the estimation error; furthermore, the variance of V7 can be approximated by the variance of the known-mean estimator Vr. In many cases.
13-1 । KbODK nv 437 the MS error £{(Vr - И)2) is smaller than E((V7- - Г )2) for moderate values ot T. It might thus be preferable to use V, as the estimator of V even when r? is known. AUTOCOVARIANCE. We shall establish the ergodicity conditions for the auto- covariance C(A) of the process x(/) under the assumption that £'(x(r)) = 0 Wc can do so, replacing x(/) by x(/) - tj if 77 is known. If it is unknown, we replace x(/) by x(f) - In this case, the results arc approximately correct if T is large. For a specific A, the product xG + A)x(f) is an SSS process with mean C(A). We can, therefore, use as the estimate of C(A) the time average 1 rT dt z(r) = x(r + A)x(r) (13-25) This is an unbiased estimator of C(A) and its variance is given by (13-4) if we replace the autocovariancc of x(/) by the autocovariance С„(т) = £{х(/ + A + т)х(г + t)x(/ + A)x(t)) - C2(A) of the process z(f). Applying Slutsky’s theorem, wc conclude that the process x(f) is covariance-ergodic iff 1'3-26) If x(r) is a normal process, Cjr) = C(A + t)C(A - r) + C2(r) (13-27) In this case, (13-6) yields VarCT(A) = f2T[C(X + r)C(A - 7) + C~(7)] dr (13-28) т A) From (13-27) it follows that if C(r) -» 0, then C.Xr) -» 0 as 7 -♦ «>; hence x(/) is covariance-ergodic. Cross-covariance. We comment briefly on the estimate of the eross-covariance Сжу(т) of two zero-mean processes x(f) and y(f). As in (13-25), the time average Cxy( 7) = £^x(/ + 7)y( 1) dt (13-29) is an unbiased estimate of Cxy(r) and its variance is given by (13-4) if we replace C(r) by Cxy(r). We note, finally, that if both processes are variance-ergodic, they are also cross-covariance-ergodic (see Prob. 13-9). NONLINEAR ESTIMATORS. The numerical evaluation of the estimate CT(A) of СЦ) involves the evaluation of the integral of the product x(f + A)x(f) for
438 SPIiCTRAl. KSTIMAllON various values of Л. We show next that the computations can in certain cases be simplified if we replace one or both factors of this product by some function! of x(r). We shall assume that the process x(r) is normal with zero mean. The arcsine law. We have shown in (10-71) that if y(r) is the output of a hard limiter with input x(r): r S / ' / 1 *(') >° x(()<0 then 2 СЛ((т) Cyy( t ) = - arc sin — - - (13-30) 77 Clv(0) The estimate of Cyy(r) is given by 1 r Суу(т) = —[ sgnx(t + r)sgnx(r) dt (13-31) li J-T This integral is simple to determine because the integrand equals ± 1. Thus -y-i) where TT+ is the total time that x(r + r)x(r) > 0. This yields the estimate C„(r) = C„(0)sin[yC„.(r)] of Сжж(т) within a factor. Bussgang’s theorem. We have shown in (10-72) that the cross-covariance of the processes x(f) and yG) = sgnx(r) is proportional to Сгж(т): / 2 C„(r) = KC,,(r) К - J (13-32) To estimate Сжж(т), it suffices, therefore, to estimate Сжу(т). Using (13-29), we obtain Агж(т) = ^Cxy(r) = Г *(t + r)sgnx(t) dt (13-33) К 2K1J-r CORRELOMETERS AND SPECTROMETERS. A correlometer is a physical device measuring the autocorrelation K(A) of a process x(/). In Fig. 13-3 we show two correlometers. The first consists of a delay element, a multiplier, and a low-pass fS. Cambanis and E. Masry: “On the Reconstruction of the Covariance of Stationary Gaussian Processes Through Zero-Memory Nonlinearities," IEEE Transactions on Informalion Theory, Vol. IT-24.1978.
13-1 ikGODKin 439 (/>) FIGURE 13*3 (LP) filter. The input to the LP filter is the process x(r - A)x(r); the output уДг) is the estimate of the mean Ж A) of the input. The second consists of a delay element, an adder, a square-law detector, and an LP filter. The input to the LP filter is the process [x(t - A) + x(r)]2; the output y;(r) is the estimate or the mean 2[Ж0) + Л(Л)] of the input. A spectrometer is a physical device measuring the Fourier transform S(<u) of Ж A). This device consists of a bandpass filter B(<o) with input x(r) and output y(t), in series with a square-law detector and an LP filter (Fig. 13-4). The input to the LP filter is the process y2(/); its output z(r) is the estimate of the mean E{y2(t)} of the input. Suppose that В(ш) is a narrow-band filter of unit energy with center frequency o)(l and bandwidth 2c. If the function 5(w) is continuous at o)0 and c is sufficiently small, then S( w) = S(o)0) for |o) - wj < c; hence (see (10-139)] 1 г00 S(Wii) _ E{y2(t)} = V [ S(a))B2(a)) dco « —---------- f B"(a)) da) = S(a>„) 2тГ * —so 277 'ыц-С as in (10-153). This yields z(i) = E{y2(0} = S(o)n) We give next the optical realization of the correlometer of Fig. 13-36 and the spectrometer of Fig. 13-4. FIGURE 134
440 SPECTRAL ESTIMATION The Michelson interferometer. The device of Fig. 13-5 is an optical correlome- ter. It consists of a light source 5, a beam-splitting surface B, and two mirrors. Mirror Mx is in a fixed position and mirror M2 is movable. The light from the source S is a random signal x(r) traveling with velocity c and it reaches a square-law detector D along paths 1 and 2 as shown. The lengths of these paths equal / and I + 2d respectively, where d is the displacement of mirror M2 from its equilibrium position. The signal reaching the detector is thus the sum Лх(г - t0) + Ax(t - tQ - Л) where A is the attenuation in each path, r0 = I/с is the delay along path 1, and Л = 2d/c is the additional delay due to the displacement of mirror M2. The detector output is the signal z(r) = A2[x(t - t0 - Л) + x(r - r0)]2 Clearly, E(z(0) = 2Л2[Я(0) + Л(Л)] If, therefore, we use z(f) as the input to a low-pass filter, its output y(f) will be proportional to Я(0) + B(A) provided that the process x(r) is correlation-ergodic and the band of the filter is sufficiently narrow. The Fabry-Perot interferometer. The device of Fig. 13-6 is an optical spectrom- eter. The bandpass filter consists of two highly reflective plates Px and P2 distance d apart and the input is a light beam x(f) with power spectrum 5(«).
13-1 ergoimcitv 441 Fubry-Pcrot interferometer (b) FIGURE 13-6 The frequency response of the filter is proportional to 1 = -------5--n Г — I ' 1 _ ri-Q-12ad/c where r is the reflection coefficient of each plate and c is the velocity of light in the medium M between the plates. The function В(ш) is shown in Fig. lO-IOb. It consists of a sequence of bands centered at Trnd whose bandwidth tends to 0 as r -» 1. If only the znth band of В(ш) overlaps with S(a>) and r = 1, then the output z(f) of the LP filter is proportional to 5(<am). To vary CD„,, we can either vary the distance d between the plates or the dielectric constant of the medium M. Distribution-Ergodic Processes Any parameter of a probabilistic model that can be expressed as the mean of some function of an SSS process x(r) can be estimated by a time average. For a specific jc, the distribution of x(t) is the mean of the process y(t) = Z/[x - x(/)):
442 SPECfRAL ESTIMATION FIGURE 13-7 Hence Fix) can be estimated by the time average of y(t). Inserting into (13-1), we obtain the estimator 'T+T" (13-34) 2/J-T where rf are the lengths of the time intervals during which x(r) is less than x (Fig. 13-7й). To find the variance of Fr(x), we must first find the autocovariance of y(/). The product yit + r)y(r) equals 1 if x(r + r) <x and x(t) <x\ otherwise, it equals 0. Hence Ry(r) = P{x{t + r) <x, x(t) <x) = F(x,x;r) where F(x, x; r) is the second-order distribution of x(r). The variance of Fr(x) is obtained from (13-4) if we replace C(r) by the autocovariance Fix, x;t) - F2(x) of yit). From (13-7) it follows that a process x(r) is distribution-ergodic iff 1 т -( F(x,x;r) (It ——» F2(x) (13-35) I Jq ' “ A sufficient condition is obtained from (13-10): A process x(r) is distribution- ergodic if Fix, x; r) -» F2ix) as т -»<». This is the case if the RVs x(r) and x(t + t) are independent for large r. Density. To estimate the density of x(r), we form the time intervals Дт, during which x(r) is between x and x + Дх (Fig. 13-7Z0. From (13-34) it follows that 1 „ f(x) Дх « F(x + Дх) - F(x) = — Y kr. Thus fix) Ax equals the percentage of time that a single sample of x(/) is between x and x + Дх. This can be used to design an analog estimator of fix).
13-2 SPEC TICAI. ES I IMA HON 443 13-2 SPECTRAL ESTIMATION We wish to estimate the power spectrum S(co) of a real process x(r) in terms of a single realization of a finite segment xr(t) =x(r)pr(r) pT(t) = (13-36) of The spectrum S(co) is not the mean of some function of xG). It cannot, therefore, be estimated directly as a time average. It is, however, the Fourier transform of the autocorrelation It will be determined in terms of the estimate of Жт). This estimate cannot he computed from (13-25) because the product x(t + r/2)x(t - r/2) is available only for i in the interval (-7 + |т|/2, T — |t|/2) (Fig. 13-8). Changing 27 to 27 - |r|, we obtain the estimate Rr(r)=«(<+ i)A (13’37) 27 - |t[•/-т+|т|/2 \ 2) \ 2} This integral specifies Rr(r) for |r| < 27; for |т| > 2T we set Rr(r) = 0. The above estimate is unbiased; however, its variance increases as |t| increases because the length 27- |t| of the integration interval decreases. Instead of RT(r), we shall use the product RT(r) = (1 - ^)кГ(г) (13-38) This estimator is biased; however, its variance is smaller than the variance of FIGURE 13-8
444 SPECTRAL ESTIMATION Rr(r). The main reason we use it is that its transform is proportional to the energy spectrum of the segment xT(t) of x(r) [see (13-39)]. The periodogram The periodogram of a process x(f) is by definition the process STM = ^~ f xa)'-*1 dt 21 -T (13-39) The above integral is the Fourier transform of the known segment xr(r) of x( t): Sr(w) = |Хг(ш)|2 XT(<o) = (T x( t)e~Ja>l dt 2T We shall express Sr(<u) in terms of the estimator Rr(r) of Жт). THEOREM Sr(w) = [2T RT(T)e~iu>r dr (13-40) •J — 2T Proof. The integral in (13-37) is the convolution of xr(f) with xf(-r) because xr(f) = 0 for |f | > T. Hence 1 Кг(т) = ^yxr(T)*xr(—r) (13-41) Since xr(f) is real, the transform of xT(-f) equals Х£(ш). This shows that (convolution theorem) the transform of Rr(r) equals the right side of (13-39). In the early years of signal analysis, the spectral properties of random processes were expressed in terms of their periodogram. This approach yielded reliable results so long as the integrations were based on analog techniques of limited accuracy. With the introduction of digital processing, the accuracy was improved and, paradoxically, the computed spectra exhibited noisy behavior. This apparent paradox can be readily explained in terms of the properties of the periodogram: The integral in (13-40) depends on all values of Rr(r) for т large and small. The variance of Rr(r) is small for small т only, and it increases as r 2T. As a result, Sr(w) approaches a white-noise process with mean 5(w) as t increases [see (13-57)]. To overcome this behavior of Sr(to), we can do one of two things: (1) We replace in (13-40) the term Rr(r) by the product iv(t)R7(t) where и’(-г) is a function (window) close to 1 near the origin, approaching 0 as т -* 2T. This deemphasizes the unreliable parts of RT(r), thus reducing the variance of its transform; (2) We convolve Sr(a>) with a suitable window as in (11-164). We continue with the determination of the bias and the variance of Sr(w).
13-2 МЧС I RAI IsllMAtlO'. 445 Bias. From (13-38) and (13-40) it follows that E{Sr(tt)J = f (1 — —— -)e,_'“'7 dr J-2T\ -I J Since ( |r| ) 2sin2 Тш 1 - — рг(т) «-* , \ Zr ) I m~ we conclude that [see also (12-83)] , , sin- T( ш - v) £{Sr(w))=f —----------------'—S(y)dy (13-42) J ~Г[ш — y) The above shows that the mean of the periodogram is a smoothed version of S(o)); however, the smoothing kernel sin2 T(<d - у)/тгТ(<а - у)2 takes signifi- cant values only in an interval of the order of \/T centered at у = <a. If, therefore, T is sufficiently large, we can set 5(y) = SGo) in (13-42) for every point of continuity of S(a>). Hence for large T, sin2 T(<a - v) E{Sf(W)} ——-----------rdy = S(u)) (13-43) тгТ(ш — у) From this it follows that Sz(cu) is asymptotically an unbiased estimator of 5(w). Data window. If S(cu) is not nearly constant in an interval of the order of l/T. the periodogram is a biased estimate of 5(w). To reduce the bias, we replace in (13-39) the process x(f) by the product c(z)x(r). This yields the modified periodogram sc(") = [Г c(t)x(t)e 21 J -T (13-44) The factor c(/) is called the data window. Denoting by C(w) its Fourier transform, we conclude that [see (12-82)] £-{Sc(^)} = -4^Ы*С2(<о) (13-45) 4тгГ VARIANCE. For the determination of the variance of Sr(cu), knowledge of the fourth-order moments of x(t) is required. For normal processes, all moments can be expressed in terms of /?(т). Furthermore, as T -»<», the fourth-order moments of most processes approach the corresponding moments of a normal process with the same autocorrelation (see Papoulis 1.977). We can assume, therefore, without essential loss of generality, that x(r) is normal with zero mean,
446 SPECTRAL ESTIMATION THEOREM. For large T: Var Sr(w) = f 25z(0) U2(") (о — 0 Ы » i/r (13-46) at every point of continuity of 5(cu). Proof. The Fourier transform of the autocorrelation R(r, - r2)pr(f|)pz(f2) of the process xr(t) equals 2sin Ta sin T(u + и - a) T(w,y) = / -—------г-- Jтга(и + - a) S(u — a) da (13-47) This follows from (12-80) with W4fc>) = 2 sin Тш/ш. The fraction in (13-47) takes significant values only if the terms aT and (u + v - a)T are of the order of 1; hence, the entire fraction is negligible if |u + l*| » \/T. Setting и = r = <o, we conclude that Г(ш, <a) — 0 and 2 sin2 Ta Г(й>, — a>) = / ------5—S(<o — a) da J-a. тга 2sin2 Ta - S(w)I --------=—da = 2TS(<o) J-в тга (13-48) for Ы » 1/T and since [see (12-74)] VarSr(") = [r2(w> -o>) + Г2(си, w)] and Г(0,0) = 5(0), (13-46) follows. Note For a specific r, no matter how large, the estimate Rz(t) -♦ Л(т) as T -♦ <». Its transform Sr(<u), however, does not tend to S(tu) as T -» ». The reason is that the convergence of Rr(r) to Я(т) is not uniform in r, that is, given e > 0, we cannot find a constant TQ independent of r such that |Rz(t)- 7?(t)| < e for every r, and every T>Tn. Proceeding similarly, we can show that the variance of the spectrum БДси) obtained with the data window c(r) is essentially equal to the variance of Sr(cu). This shows that use of data windows does not reduce the variance of the estimate. To improve the estimation, we must replace in (13-40) the sample autocorrelation Rr(r) by the product w(t)Rz(t), or, equivalently, we must smooth the periodogram ST(e)). Note Data windows might be useful if we smooth Sr(a>) by an ensemble average: Suppose that we have access to N independent samples x(r,£) of x(t), or, we divide a single long sample into N essentially independent pieces, each of duration 2T. We form
13-2 SPfXIRAl. bSIIMAT ION 447 the periodograms Sr(w, of each sample and their average I Sr(w) = — ESr(w,<,) (13-49) As we know, sin* oj7 _ i E(Sr(<o)} = S(w)«——— VarSr(w) = —52(w) (13-50) ttz co N ' [f is large, the variance of Sr(w) is small. However, its bias might be significant. Use of data windows is in this case desirable. Smoothed Spectrum We shall assume as before that T is large and x(z) is normal. To improve the estimate, we form the smoothed spectrum S„.(w) = 7-/ Sr(" ~ У)И'(у) dy = [2T H’(r)Rr(7)e-J“Tdr (13-51) Z7T J — x J -IT where 1 rx »v(t) = —/ d<o 2тг J-« The function w(t) is called the lag window and its transform WXw) the spectral window. We shall assume that W(-w) = W(<o) and 1 w(0) =1 = — / lV((o)d<a l¥(a>)>Q (13-52) 2тг J-x Bias. From (13-42) it follows that 1 1 sin2 Тш £S„(<») — $(«).——•и'М 2 тг 2 тг тг Г<0 Assuming that HTw) is nearly constant in any interval of length l/T, we obtain the large T approximation 1 £{S„.(<i>)} = —5(w)*W<(<u) (13-53) 2тг Variance. We shall determine the variance of Sw(w) using the identity [see (12-74)] C0v[Sr(u),Sr(l>)] = dp[Г2(и, -о) + Г2(к,п)1 (13-54) This problem is in general complicated. We shall outline an approximate solution based on the following assumptions: The constant T is large in the sense -that the functions 5(ш) and HTzo) are nearly constant in any interval of length l/T.The width of that is, the constant cr such that ИЧео) = 0 for
448 SPECTRAL ESTIMATION |ш| > O', is small in the sense that S(a>) is nearly constant in any interval of length 2<r. Reasoning as in the proof of (13-48), we conclude from (13-47) that Г(и,у) — 0 for и + v » l/T and 2sin T(u — v — a)sin Ta 2sinT(u — u) Г(и, - и) = S(u) f -------------------г-----da = S(u)-------------— v J-x тг(и - v - a)a и - и This is the generalization of (13-48). Inserting into (13-54), we obtain sin2T(n-r) Cov Sr(«). Sr( i’)] = ~r,------(13-55) T-(u - l) Equation (13-46) is a special case obtained with и = v = <o. THEOREM. For |w| » l/T VarS.W = (13-56) where Elv=—/ Wz(a>)da> Z7T — oo Proof. The smoothed spectrum SH,(ca) equals the convolution of Sr(w) with the spectral window From this and (10-87) it follows mutatis mutandis that the variance of Slv(w) is a double convolution involving the covariance of Sr(ca) and the window W(a>). The fraction in (13-55) is negligible for l« - у| » l/T. In any interval of length l/T, the function И'(са) is nearly constant by assumption. This leads to the conclusion that in the evaluation of the variance of SH,(td), the covariance of Sr(ca) can be approximated by an impulse of area equal to the area ,, ,« sin2 T(u - i>) тг , S2(u) f —-------------5- do = — S2(u) Tz(u - v)2 T v of the right side of (13-55). This yields Cov[Sr(u),Sr(t’)] - - u) q(u) = —S2(u) (13-57) From the above and (10-91) it follows that . тг r® _ И/2(у) VarS^w) = — Г S2(w ~y)——dy = T 4tt~ S2M r W2(y) 2T f-x 2тг У and (13-56) results. WINDOW SELECTION. The selection of the window pair iv(/) <-> НХю) depends on two conflicting requirements: For the variance of Sw(w) to be small, the energy Ew of the lag window w(t) must be small compared to T. From this it
13-2 spi.ciRAt iisTiMAitoN 449 follows that и'(г) must approach 0 as t -> 2T. Wc can assume, therefore, without essential loss of generality that »v(/) = 0 for |/| > M where M is a fraction of 2T. Thus S„.(w) = fM dt M < IT J-M The mean of Su.(<z>) is a smoothed version of S(to). To reduce the effect of the resulting bias, we must use a spectral window W(w) of short duration. This is in conflict with the requirement that M be small (uncertainty principle). The final choice of M is a compromise between bias and variance. The quality of the estimate depends on M and on the shape of w(z). To separate the shape factor from the size factor, we express w(r) as a scaled version of a normalized window w0(z) of size 2: I 1 ) w(t) = >voI W'(o)) = (13-58) where w0(t) = 0 for |t| > 1 The critical parameter in the selection of a window is the scaling factor M. In the absence of any prior information, we have no way of determining the optimum size of M. The following considerations, however, are useful: A reasonable measure of the reliability of the estimation is the ratio For most windows in use, Ew is between 0.5Af and 0.8Af (see Table 13-1). If we set a = 0.2 as the largest acceptable a, we must set M < T/2. If nothing is known about S(<o), we estimate it several times using windows of decreasing size. We start with M = T/2 and observe the form of the resulting estimate Sw(w). This estimate might not be very reliable; however, it gives us some idea of the form of 5(cu). If we see that the estimate is nearly constant in any interval of the order of 1/M, we conclude that the initial choice M = T/2 is too large. A reduction of M will not appreciably affect the bias but it will yield a smaller variance. We repeat this process until we obtain a balance between bias and variance. As we show later, for optimum balance, the standard deviation of the estimate must equal twice its bias. The quality of the estimate depends, of course, on the size of the available sample. If, for the given T, the resulting Sw(«) is not smooth for M — T/2, we conclude that T is not large enough for a satisfactory estimate. To complete the specification of the window, we must select the form of w0(t). In this selection, we are guided by the following considerations: 1. The window ИЧю) must be positive and its area must equal 2тг as in (13-52). This ensures the positivity and consistency of the estimation.
450 SPECTRAL ESTIMATION TABLE 13-1 wit) FF(w) 1. Bartleit 1 - Id тг — oo E„ = j n = 2 2. Tukey -}(1 + COS1T/) ^2 ^2 ~ ”2" = ? /1 = 3 3. Parzen [3(1 - 2|/|)p,(/)]• [3(1 - 2|/Dp,(»)] m2 =12 Ew = 0.539 n = 4 4. Papoulist —tsinird + (1 - I/Dcostt/ TT m2 = it2 Ew = 0.587 /1 = 4 4 sin2 w/2 ы~ it2 sin to <w(ir2 - to1) sin w/4 eu/4 cosz(<u/2) (it2 — ш2)2 tA. Papoulis: "Minimum Bias Windows for High Resolu- tion Spectral Estimates," IEEE Transactions on Informa- tion Theory, vol. IT-19, 1973. 2. For small bias, the “duration” of W(a>) must be small. A measure of duration is the second moment 1л00- m2 = —- / ш2И/(ш) dco (13-60) Z7T * — 30 3. The function WCoj) must go to 0 rapidly as ш increases (small sidelobes). This reduces the effect of distant peaks in the estimate of S(<d). As we know, the asymptotic properties of ^(co) depend on the continuity properties of its inverse w(t). Since w(t) = 0 for |/| > M, the condition that ИТо/) -> 0 as A/<Dn as n -> 00 leads to the requirement that the derivatives of w(/) of order up to n — 1 be zero at the end-points ±M of the lag window w(/): w(±M) = = ••• =w("-l) 2 3(±Af) = 0 (13-61) 4* The energy Ew of w(/) must be small. This reduces the variance of the estimate. Over.the years, a variety of windows have been proposed. They meet more or less th£ stated requirements but most of them are selected empirically. Optimality criteria leading to windows that do not depend on the form of the
13-2 M’LCTRAI LStfMAIION 451 FIGURE 13-9 unknown S(ca) are difficult to generate. However, as we show next, for high-res- olution estimates (large T) the last example of Table 13-1 minimizes the bias. In this table and in Fig. 13-9, we list lhe most common window pairs w(t) «-* IT(ш). We also show the values of the second moment тг, the energy E„„ and the exponent n of the asymptotic attenuation А/ш" of WXw). In all cases, и-(г) = 0 for |t| > 1. OPTIMUM WINDOWS. We introduce next three classes of windows. In all cases, we assume that the data size T and the scaling factor M are large (high-resolu- tion estimates) in the sense that we can use the parabolic approximation of $(<»> - a) in the evaluation of the bias. This yields [see (11-168)] — f 5(o> -а)И^(а) da = 5(w) + —f—/ a3IT(a) da (13-62) 2тг ® 4tt » Note that since > 0, the above is an equality if we replace lhe term S"(to) by Sn(w + 8) where 8 is a constant in the region where И'(си) takes significant values. Minimum bias data window. The modified periodogram S/w) obtained with the data window c(t) is a biased estimator of S(w). Inserting (13-62) into (13-45), we conclude that the bias equals Be(a>) f S(a> - a)C2(a) da - S(<d) ~ 1 f a2C2(a) da (13^63) 4ir J-a>
452 SPECTRA!. ESTIMATION FIGURE 13-10 We have thus expressed the bias as a product where the first factor depends only on 5(co) and the second depends only on C(co). This separation permits us to find CM so as to minimize To do so, it suffices to minimize the second moment M2 =-----[ ы2С2(ы) da> = ( |c'(t)|2r// (13-64) 2ttj~<x. j-t of C2(<d) subject to the constraints 1 /•“ _ -— / C“(co) d<o = 1 C( —ca) = C(cu) 2тг It can be shown that! the optimum data window is a truncated cosine (Fig. 13-10): ( 1 7Г . . I ,____cos — t I /1 < T ,— cos •ш Ф) -{ /Г 2T 11 »CM-4tn/f 2 2 (13-65) | тг - ш 10 И > Т The resulting second moment M2 equals 1. Note that if no data window is used, then c(r) — 1 and M2 = 2. Thus the optimum data window yields a 50 percent reduction of the bias. Minimum bias spectral window. From (13-18) and (13-28) it follows that the bias of SH.(w) equals B(cu) = S(co - a)W(a) da - 5(cu) ~ (13-66) 2tTj — ta 2 where m2 is the second moment of И,(й))/2тг. To minimize it suffices, tA. Papoulis: “Apodizalion for Optimum Imaging of Smooth Objects”, J. Opt. Soc. Am.. Vol. b2. December, 1972.
11 13-2 SNCIRAl ESliMATU.N 453 I FIGURE 13-11 therefore, to minimize m2 subject to the constraints I yX H'(w) > 0 Hz(-w) = Hz(w) — / (13-67) 2 77 — x This is the same as the problem just considered if we replace IT by M and we set И'(й)) = С2(ш) This yields the pair (Fig. 13-11) { 1 I тг I / |/h 77 J ~~ s*n 77 M + 1 ~ 77 cos 1И Л/ (iirox и'(О = \-тг| м 1 \ M J M (13-68) lO |f| > M cos2(Afcu/2) >K(<o) = 3Mir2—-----------~2 (13-69) (тг2 — М2ш2) Thus the last window in Table 13-1 minimizes the bias in high-resolution spectral estimates. LMS spectral window. We shall finally select the spectral window WXw) so as to minimize the MS estimation error e = B2(<a) 4- VarS„.(w) (13-70) We have shown, that for sufficiently large values of T, the periodogram ST(w)
454 spectra!, estimation can be written as a sumS(cii) + v(co) where v(co) is a nonstationary white noise process with autocorrelation 7tS2(m)5(m - v)/T as in (13-57). Thus our prob- lem is the estimation of a deterministic function S(a>) in the presence of additive noise v(co). This problem was considered in Sec. 11-6. Wc shall reestablish the results in the context of spectral estimation. We start with a rectangular window of size 2Д and area 1. The resulting estimate of S(<o) is the moving average 8д(о») = ТГ Л ST(w - a) da (13-71) J — д of Sr(o)). The rectangular window was used first by Daniellt in the early years of spectral estimation. It is a special case of the spectral window №(ш)/2-тт. Note that the corresponding lag window sin Дг/2тгДг is not time-limited. With the familiar large-T assumption, the periodogram Sr(<u) is an unbiased estimator of S(<o). Hence the bias of SA(w) equals 1 м , A2 — f S(a> - y) dy - S(a>) = f y2dy=S”(a))—~ 2. A 6 and variance 7rS2(cu) ,Д ttS2{(o) ----:---/ dtt) —---------- 4Д2Т 7_д 2ДТ This follows from (11-172) or directly from (13-46) where we replace the window И/(ш)/2тт by a rectangular window with energy тг/Д. This yields , ч irS2(to) Д4 - tt52(cu) е-36[5"(ш)] (13’72) Proceeding as in (11-176), we conclude that e is minimum if _ |9ir\°-2[ S(a>) 104 Л~(2Т) [s"(o>)) The resulting bias equals twice the standard deviation of SJw) (see two-to-one rule). Suppose finally that the spectral window is a function of unknown form. We wish to determine is shape so as to minimize the MS error e. Proceeding as in (11-177), we сал show that e is minimum if the window is a truncated IP; J. Daniell: Discussion on “Symposium on Autocorrelation in Time Series,” J. Roy. Statist. Soc. Slippl., 8, 1946,
13-3 EXTRAPOLATION AND SYS 11 M IDI.NIII-K Al ION 455 parabola: 3 fд / y2 \ S»(<o) = 4д/_Л(“"’’)Г “ S* (13-73) This window was first suggested by Priestley.t Note that unlike the earlier windows, it is frequency dependent and its size is a function of the unknown spectrum 5(<o) and its second derivative. To determine S„(w + 5) we must therefore estimate first not only S(a>) but also ,S’"(w). Using these estimates we determine Д for the next step. 13-3 EXTRAPOLATION AND SYSTEM IDENTIFICATION In the preceding discussion, we computed the estimate Rr(r) of R(t) for |т| < M and used as the estimate of 5(a>) the Fourier transform Slv(.w) of the product w(t)Rr(r). The portion of Rz(r) for |r| > M was not used. In this section, we shall assume that S(o>) belongs to a class of functions that can be specified in terms of certain parameters, and we shall use the estimated part of R(t) to determine these parameters. In our development, we shall not consider the variance problem. We shall assume that lhe portion of Л(т) for |r| < M is known exactly. This is a realistic assumption if T » M because RT(r) -» R{~) for |т| < M as T -* co. A physical problem leading to the assumption that /?(т) is known exactly but only for |т| < M is the Michelson interferometer. In this example, the time of observation is arbitrarily large; however, /?(т) can be determined only for |т| < M where M is a constant proportional to the maximum displacement of the moving mirror (Fig. 13-5). Our problem can thus be phrased as follows: We are given a finite segment ЯЛ/(т) = K(r) 0 |t| <m |t| > M of the autocorrelation Я(т) of a process x(f) and we wish to estimate its power spectrum S(o>). This is essentially a deterministic problem: We wish to find the Fourier transform S(w) of a function /?(т) knowing only the segment Raz(t) of R(r) and the fact that S(w) 0. This problem does not have a unique solution. Our task then is to find a particular S(cu) that is close in some sense to the unknown spectrum. In the early years of spectral estimation, the function S(a>) tM. B. Priestley: "Basic Considerations in the Estimation of Power Spectra," Technometrics, 4. 1962.
456 SPECTRAL ESTIMATION was estimated with the method of windows (Blackman and Tukeyf). In this method, the unknown /?(т) is replaced by 0 and the known or estimated part is tapered by a suitable factor w(r). In recent years, a different approach has been used: It is assumed that S(a>) can be specified in terms of a finite number of parameters (parametric extrapolation) and the problem is reduced to the estimation of these parameters. In this section we concentrate on the extrapola- tion method starting with brief coverage of the method of windows. Method of windows. The continuous-time version of this method is treated in the last section in the context of the bias reduction problem: We use as the estimate of S(cu) the integral Slv(a>) = w(T)R(T)e~jurdr = Г S(cu - а)1У(а) da (13-74) J-M 2irJ-x and we select w(t) so as to minimize in some sense the estimation error S„.(w) — S(<d). If M is large in the sense that S{u> -a) - S(<o) for |аI < l/M, we can use the approximation [see (13-62)] Sw(w) — S(<o) —-------f а2И/'(а) da 4-n- J-a, This is minimum if 1 IT / Id \ 7Г w(t) = — sin —T + 1 - — cos —T |т| < M тг M ( M J M The discrete-time version of this method is similar: We are given a finite segment /? Ш lw| L in w лДт] = (13-75) (0 |m| > L of the autocorrelation A[zn] = ЕЫп + z?? ]x[zi ]} of a process x[n] and we wish to estimate its power spectrum S(o>) = £ tf[zz!]e-;"‘" Щ — —00 We tise as the estimate of S(o>) the DFT = E ]e~'ww = — [ S(o> - а)РГ(а) da (13-76) m--L 2irJ-v Of the product w[m]/?[zzi] where w[zn] <-* is a DFT pair. The criteria for selecting w[m] are the same as in the continuous-time case. In fact, if M is fR.B. Blackman and J.W. Tukey: The Measurement of Power Spectra, Dover, New York, 1959.
13-3 *iXTHAPO|_ATlf>N AM> SYhfl M H>| N I'll KAI ION 457 large, we can choose for w[m] the samples и?["’] = H'(Mm/L) m =0,..., L (13-77) of an analog window w(t) where M is lhe size of »v(r). In a real problem, the data /?,[/»] are not known exactly. They are estimated in terms of the J samples of x[n]: RJW] = 7 E«[w + zm]x[«] (13-78) J n The mean and variance of Rjm) can be determined as in the analog case. The details, however, will not be given. In the following, we assume that Rt [w] is known exactly. This assumption is satisfactory if J » L. Extrapolation Method The spectral estimation problem is essentially numerical. This involves digital data even if the given process is analog. We shall, therefore, carry out the analysis in digital form. In the extrapolation method we assume that S(z) is of known form. We shall assume that it is rational b(l + b,z~1 + • • • +b..z~‘y N(z) S(z) = L(z)L(l/z) L(z) = - -- = -Ц 1 + C|Z 1 + • • +aNz D(z) (13-79) We select the rational model for lhe following reasons: The numerical evalua- tion of its unknown parameters is relatively simple. An arbitrary spectrum can be closely approximated by a rational model of sufficiently large order. Spectra involving responses of dynamic systems are often rational. System identification. The rational model leads directly to the solution of the identification problem (see also Sec. 11-7): We wish to determine the system function H(z) of a system driven by white noise in terms of the measurements of its output х[л]. As we know, the power spectrum of the output is proportional to H(z)H(l/z). If, therefore, the system is of finite order and minimum phase, then H(z) is proportional to L(z). To determine H(z), it suffices, therefore, to determine the M 4- N 4- 1 parameters of L(z). We shall do so under the assumption that RL[m] is known exactly for |m| <. M + N + I. Wc should stress that the proposed model is only an abstraction. In a real problem, /?[m] is not known exactly. Furthermore, S(z) might not be rational; even if it is, the constants M and W might not be known. However, the method leads to reasonable approximations if Rjmj is replaced by its time-average estimate Rjm] and L is large. Autoregressive process. Our objective is to determine the M 4- N + 1 coeffi- cients bt and ak specifying the spectrum S(z) in terms of the first M 4- N + I
458 SPECTRAL ESTIMATION values ЛД/и] of /?[m]. We start with the assumption that __________ 1 4- fljZ-1 + * • • +aNz~N D(z) (13-80) This is a special case of (12-36) with M = 0 and bQ = y/P^. As we know, the process х[л] satisfies the equation х[л] + a(xl« — 1] + ’' +а,ух[л - Af ] s e[«] (13-81) where е[л] is white noise with average power PN. Our problem is to find the W + 1 coefficients ak and PN. To do so, we multiply (13-81) by x[n - лл] and take expected values. With m = 0,..., N, this yields the Yule-Walker equations /?[0] + n,K[l] + ••• +aN/?[yV] =PN Л[1] +fll/?[0] + • + 1] = 0 (13-82) /?[?/] + a}R[N —!] + ••• + <7„/?[0] = 0 This is a system of N + 1 equations involving the W + 1 unknowns aK and PN, and it has a unique solution if the determinant Дд, of the correlation matrix DN of х[л] is strictly positive. We note, in particular, that P» = > 0 (13-83) If An + ) =0, then PN = 0 and еДлл] = 0. In this case, the unknown S(<o) consists of lines [see (12-44)]. To find L(z), it suffices, therefore, to solve the system (12-82). This involves the inversion of the matrix DN. The problem of inversion can be simplified because the matrix DN is Toeplitz; that is, it is symmetrical with respect to its diagonal. We give later a simple method for determining ak and PN based on this property (Levinson’s algorithm). Moving average processes. If х[л] is an MA process, then S(z) = L(z)L(l/z) L(z) = b0 + 6,z-‘ + • +bMz~M (13-84) In this case, S(z) can be expressed directly in terms of the first M 4- 1 values of Я[т]: м S(z) = £ /?[w]2“w m- ~M S(e'“) = M E btne-^ m»0 (13-85) In the identification problem, our objective is to find not the function S(z), but the M + 1 coefficients bm of L(z). One method for doing so is the factorization S(z) » L(z)L(l/z) of S(z) as in Sec. 12-1. This method involves the determina- tion of the roots of S(z). We discuss later a method that avoids factorization (see page 470).
13-3 EXTRAPOLATION ANO SYSTEM IDLNTIIICA1 ION 459 FIGURE 13-12 (13-86) ARMA processes.! We assume now that x['i] is an ARMA process: L(z) = /J° + b|Z~' * +bM2~‘" _ 1 + a,?"1 + •• +ovz“N D(z) In this case, x[/i] satisfies the equation х[л] + а,х[л -!] + ••• +aNx[n - /V] = bni[w] + • • + />A,i[n - M] (13-87) where i[n] is its innovations. Multiplying both sides of (13-87) by x[n - m) and taking expected values, we conclude as in (12-49) that /?[m] + - 1] + • • +ал//?[т - /V ] = 0 m>M (13-88) Setting m = N + 1,N + 2,. ..,2W into (13-88), we obtain a system of W equations. The solution of this system yields the N unknowns a,,..., aN. To complete the specification of L(z), it suffices to find the M + 1 constants bQ,To do so, we form a filter with input x[w), and system function (Fig. 13-12) D(z) = 1 4- + ••• +aNz~N The resulting output y[n] is called the residual sequence. Inserting into (10-183), we obtain Syy(z) = S(z)D(z)D(l/z) = N(z)N(l/z) From this it follows that у(л] is an MA process, and its whitening filter equals Ly(z) = N(z) = £>0 + btz~’4- ••• +bMz~M (13-89) tM- tCaveh: “.High Resolution Spectral Estimation for Noisy Signals,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27. See also J. A. Cadzow (1982): "Spectral Estimation: An Overdetermined Rational Model Equation Approach," IEEE Proceedings, vol. 70. 1979.
460 SrECTftAL CSTIMA1 ION To determine the constants 6„ it suffices, therefore, to find lhe autocorrelation ЯуДш] for |mI < M. Since y[zi] is the output of the filter D(z) with input xjzi], it follows from (12-47) with a0 = 1 that м /? [m] = = E - *] This yields N 'V ЯуДт] = E Я['я-']р['] ?['”]= E «А-гл«л =p[-w] (13-90) i зя — д/ A — /и for 0 < m < M and 0 for m > M. With so determined, wc proceed as in the MA case. The determination of the ARMA model involves thus the following steps: Find the constants ak from (13-88); this yields D(z). Find Яуу[ш] from (13-90). Find the roots of the polynomial At S,,(z)- £ «„[m]z"“-W(z)W(l/z) wi — — M Form the Hurwitz factor N(z) of Svy(z). Lattice filters and Levinson’s algorithm. An MA filter is a polynomial in z"1. Such a filter is usually realized by a ladder structure as in Fig. 13-14a. A lattice filter is an alternate realization of an MA filter in the form of Fig. 13-146. In the context of spectral estimation, lattice filters are used to simplify the solution of the Yule-Walker equations and the factorization of polynomials. Furthermore, as we show later, they are also used to give a convenient description of the properties of extrapolating spectra. Related applications are developed in the next chapter in the solution of the prediction problem. The polynomial D(z) = 1 - afz-1 — • • - a^z“N = 1 - E akz~k к = 1 specifies an MA filter with H(z) = £>(z). The superscript in ak identifies the order of the filter. If the input to this filter is an AR process x[m] with L(z) as in (13-80) and ak = —ak, then the resulting output e[h] = x[n] — afSc[n - 1] — • • • — a#x[zi - TV] (13-91) is white noise as in. (13-81). The filter D(z) is usually realized by the ladder structure of Fig. 13-14». We shall show that the lattice filter of Fig. 13-146 is an equivalent realization. We start with W = 1.
13-3 I XTRAPOi AVION ANHSVSILM IDEM II ICA I ION 461 In Fig. 13-13a we show the ladder realization of an MA filter of order 1 and its mirror image. The input to both systems is the process x(n]; lhe outputs equal у [/г ] = х[л] + fljxfn - 1] z[n] = —a[x[n] + х[л - 1] The corresponding system functions equal 1 — a\z~1 —aj + z~1 In Fig. 13-136 we show a lattice filter of order 1. It has a single input х[л] and two outputs ё|[п] = x[n] - A?|x[n - 1] e,[л] = —/С,х[л] + x[n - 1] The corresponding system functions are t,(z) = 1 -JfjZ-' ё,(2) = -Kx + z"‘ =z~*E|(l/z) If = a} then the lattice filter of Fig. 13-136 is equivalent to the two MA filters Of Fig. 13-13a. In Fig. 13-146 we show a lattice filter of order N formed by cascading N first-order filters. The input to this filter is the process х[л]. The resulting
462 SPECTRAL ESTIMATION outputs are denoted by en[л] and еДл] and are called forward and backward respectively. As we see from the diagram these signals satisfy the equations = «л/-1(«] ~ - Л (13-92a) Ц«1 = ел/-|[л “ Л “ (13-92/») Denoting by £jy(z) and ^N(z) the system functions from the input A to the upper output В and lower output C respectively, we conclude that 6«(z) = (13-93«) e„(z) = z-'6„_,(z) -K„E„.,(z) (13-936) where E^/z) and Ezw_|(z) are the forward and backward system functions of the lattice of the first — 1 sections. From (13-93) it follows by a simple induction that 6„(z) =z-NfeN(l/z) (13-94) The lattice filter is thus specified in terms of the N constants Kk. These constants are called reflection coefficients.
13-3 l£XIKAPOLA-HONANI>bVSri-M IPI-.N III К AI IOS 463 Since £|(z) = 1 — K^z and £t(z) = ~Kt + zl, we conclude from (13-93) that the functions E((z) and E,v(z) are polynomials in z"1 of the form E„(z) = 1 - - a$z-* (13-95) Ед,(г) = z~N - afz-"*1 - - a* (13-96) where ak are N constants that are specified in terms of the reflection coeffi- cients Kk. LEVINSON’S ALGORITHM.! We denote by ak~1 the coefficients of the lattice filter of the first W - 1 sections: E,v-!(*) = 1 -<'*"-------------- From (13-94) it follows that z-'Ev-^z) =z-^'Ev_,(l/z) Inserting into (13-93a) and equating coefficients of equal powers of z. we obtain ak = ~ Ki^N-k к = 1..........M ~ 1 N k (13-97) Z,*V - aN ~ K-h We have thus expressed the coefficients ak of a lattice of order N in terms of the coefficients a^_| and the last reflection coefficient KN. Starting with =K|, we can express recursively the N parameters ak in terms of the W reflection coefficients Kk. Conversely, if we know ak, we find Kk using inverse recursion: The coefficient KN equals aft. To find KN_t, it suffices to find the polynomial E/y-iCz). Multiplying (13-936) by KN and subtracting from (13-93a), we obtain (1 -K^E^z) = E„(z) + KNz~NEN(l/z) (13-98) This expresses EiV_1(z) in terms of EN(z) because KN — aft- With EjV_](z) so determined, we set Continuing this process, we find EN_k(z) and Kw-k f°r every к < N. Minimum-phase properties. We shall relate the location of the roots z/v of the polynomial EN(z) to the magnitude of the reflection coefficients Kk. THEOREM. If < 1 foralH^/V then Iz^l < 1 forallf^W (13-99) tN. Levinson: “The Wiener RMS Error Criterion in Filter Design and prediction." Journal of Mathematics and Physics, vdl. 25, 1947. See also J. Durbin: “The Fitting of Time Series Models," ReuueL’lnstitut Internationale de Statisque, vol. 28, 1960.
464 SPECTRAL ESTIMATION Proof. By induction. The theorem is true for = 1 because E,(z) = 1 - К,z ‘ hence |z}I = |XJ < 1. Suppose that |z/“11 < 1 for al! j < N - 1 where z;v'1 are the roots of E/y-^z). From this it follows that the function z"'vEZy_1( 1/z) ^_,(z) » —---------—----- (13-100) 11 z) is all-pass. Since EN(z^) = 0 by assumption, we conclude from (13-93a) and (13-94) that e«u) = = о Hence 1^)1= This shows that |z/4 <1 [see (13B-2)]. CONVERSE THEOREM. If |z/4 < 1 for all i <> N then |XJ < 1 for all к <, N (13-101) Proof. The product of the roots of the polynomial EN(z) equals the last coefficient a^, Hence KN ~ aN ~ z\ ’" ‘ zn l^/yl < 1 Thus (13-100) is true for к = N. To show that it is true for к — N — 1, it suffices to show that 1 < 1 for j N — I. To do so, we form the all-pass function z-A,L(l/z) An{z) = л 7 Z (13-102) Since = 0 it follows from (13-98) that ’)l = i^J > 1 Hence |z/“‘| < 1 and = la^ZfI = |ztN-1 ••• z^Z/l < 1. Proceeding similarly, we conclude that |KA.| <1 for all к N. COROLLARY. If < 1 for к <, N - 1 and = 1, then |zf|»l for all i^N (13-103) Proof From the theorem it follows that \z^~ * | < I because |KJ < 1 for all -fc £ Af — 1. Hence the function AN_x(z) in (13-100) is all-pass and HA?_1(z//)| “ 1/ |&w| e 1. This leads to the conclusion that lz/Ч = 1 [see (13B-2)].
13-3 hXIKAJ’OIAIION ANOSVSII M IDLNHI-K aiion 465 We have thus established the equivalence between a polynomial Ev(z) and a set of N constants Kk. We have shown further that the polynomial is strictly Hurwitz, iff |Я\. | < 1 for all k. Inverse lattice realization of AR systems. An inverse lattice is a modification of a lattice as in Fig. 13-15. In this modification, the input is at point В and the outputs are at points A and C. Furthermore, the multipliers from the lower to the Upper line are changed from ~Kk to Kk. Denoting by ел,[«] the input at point В and by ea,_1[zi] the resulting output at C, we observe from the figure that eyv-i[«] = ^[«] + ~ Л (13-104л) =£n-i[« ~ И ~ A’NeN_l[n] (13-104/») These equations are identical with the two equations in (13-92). From this it follows that the system function from В to A equals 1 1 tN(z) 1 - afc"1-------------- a$z~N We have thus shown that an AR system can be realized by an inverse lattice. The coefficients ak and Kk satisfy Levinson’s algorithm (13-97). Iterative solution of the Yule-Walker equations. Consider an AR process with innovations filter L(z)« /D(z) as in (13-80). We form the lattice equivalent of the MA system (Xz) with ak = -ak, and use х[л] as its input. As
466 SPECTRAL ESTIMATION we know [see (13-95)] the forward and backward responses are given by eN[«J = х[л] - <x[n - 1]--------- ~ J05) ёх[л] = *['« - AT] - - N + 1] - • • • - а£х[л] Denoting by Sw(z) and S,v(z) the spectra of ёл[л] and ёл,[л] respectively, we conclude from (13-105) that SA.(z) - S(z)e„(z)E„(l/z) = />„ 6„(z) = S(z)En(z)Ea,(1/z)-Cv From this it follows that ev[»J and ё„[и] are two white-noise processes and E{eJ,(n]| = E(e?v[n ]) = PN (13-106») E{x[n - m]е„[и]} = 1Pn 10 m - 0 1 5 m < N (13-106b) £{x[zi - т]еД«]} = /0 \PN Q < rn < N - 1 m = N (13-106c) These equations also hold for all filters of lower order. We shall use them to express recursively the parameters af, KN, and PN in terms of the Л/ + 1 constants Я[0],..., R[N]. For ЛГ = 1 (13-82) yields /?[0] - n‘/?[l] = Л Л[1] - n*/?[0] = 0 Setting PQ = Л[0], we obtain = ^5} p'= (1 _ K^p° Suppose now that we know the N + 1 parameters а%~\ KN_{, and PN. From Levinson’s algorithm (13-97) it follows that we can determine if KN is known. To complete the iteration, it suffices, therefore, to find KN and PN. We maintain that N-l PN-}Kn = R[N] - E af-'Kftf - k] (13-107) Л-1 (13-108) The first equation yields KN in terms of the known parameters a^~', R[m], and With KN so determined, PN is determined from the second equa- tions. Proof. Multiplying (13-92e) by x[n — W] and using the identities -w]} - Е*аГад л-i ^№[n]} = PN — l]x[«]} = Pfj-i
13-3 kXTKAI’OLATION AND SYSTEM IDES ГК КЛI ION 467 we obtain (13-107). From (13-92«) and the identities £{en[m]x[«]} = PN - 1]х[л]} = P1V_, £{ё/у-1[л - !]*[«]} =/?[N] - £ 4V 'R[N~ fc] = PN.tKN k~\ it follows similarly that PN = PN_, - К^РЫ_ , and (13-108) results. Since Pk 0 for every k, it follows from (13-108) that |KA| < 1 and P0>P} > •• >PN>0 (13-109) If |K„| = 1 but |KJ < 1 for all k < N, then Pa> Pi > • • > PN = Q (13-110) As we show next this is the case if 5(w) consists of lines. Line spectra and hidden periodicities. If PN = 0, then e(V[rz] = 0; hence the process 4n] satisfies the homogeneous recursion equation x[/i] = e^x[n -!] + ••• +ajjx[n - N ] (13-111) This shows that х[л] is a predictable process, that is, it can be expressed in terms of its N past values. Furthermore, 7?[m] -nf7?[m - 1] - • •• -a*R[m — yV ] = 0 (13-112) As we know [see (13-103)] the roots z* of the characteristic polynomial ЕдДи) of this equation are on the unit circle: z* = e/a>i. From this it follows that N N 7?[m] ~ ^2aieJa>'"’ $(ш) = 2тгЕаг5(ю - to,) (13-113) i i=i And since S(to) 2: 0, we conclude that a, 0. Solving (13-111), we obtain № f n i = x[n] = Ec^01'” £{c,} = 0 £{cicJ = (n' , (13-114) /-1 [0 CARATHEODORY’S THEOREM. We show next that if 7?[m] is a p.d. sequence and its correlation matrix is of rank N, that is, if Д„>0 Д„+1 = 0 (13-115) then 7?[w]is a sum of exponentials with positive coefficients: N 7?[m] = Ea/e;"'w ai > ° (13-116) i-i Proof. Since 7?[m] is a p.d. sequence, we can construct a process x[n] with autocorrelation Applying Levinson’s algorithm, we obtain a sequence of constant Kk and Pk. The iteration stops at the Mh step because PN = e 0* This shows that the process x[n] satisfies the recursion equation (13-Ш).
468 SPECTRAL ESTIMATION Detection of hidden periodicities.! We shall use the above to solve the following problem: We wish to determine the frequencies <о, of a process x[z/] consisting of at most N exponentials as in (13-114). The available information is the sum у[м] = x[«] + v[n] £{у2[м]}=1? where v[zt] is white noise independent of x[zt}. Using J samples of y[zz], we estimate its autocorrelation as in (13-78). The correlation matrix DN+X of x[zt] is thus given by МП M'V] _ /?J1] yo]-<? ••• М^П yw] М*-П ••• M°]-« (13-117) (13-118) (13-119) In this expression, Ryy[m] is known but q is unknown. We know, however, that Ддг+i = 0 because x[n] consists of N lines. Hence q is an eigenvalue of Dr<.,. It is, in fact, the smallest eigenvalue qn because DN + X > 0 for q < qa. With /?xx[zn] so determined, we proceed as before: Using Levinson's algorithm, we find the coefficients and the roots е1ш' of the resulting polynomial E(V(z). If qQ is a simple eigenvalue, then all roots are distinct and x[n] is a sum of N exponentials. If, however, qn is a multiple root with multiplicity Nn then x[zz] consists of N — No + 1 exponentials. This analysis leads to the following extension of Caratheodory’s theorem: The TV + 1 values R[0],..., R[/V] of a strictly p.d. sequence /?(zzr] can be expressed in the form N K[zn] = <7O3[zzi] + £ (13-120) /= i where qQ and a( are positive constants and to, are real frequencies. Bmg?s iteration, ф Levinson’s algorithm is used to determine recursively the coefficients a* of the innovations filter L(z) of an AR process x[zi] in terms of Ж/nl. In a real problem the data /?[zzi] are not known exactly. They are estimated from the J samples of x[zi] and these estimates are inserted into (13-107) and (13-108) yielding the estimates of KN and PN. The results are then used to estimate a* from (13-97). A more direct approach, suggested by Burg, avoids the estimation of R[zn]. It is based on the observation that Levinson’s tV, F. Pisarenko: “The Retrieval of Harmonics," Geophysical Journal of the Royal Astronomical «SotfriVi 1973. tU Р» Burg: Maximum entropy spectral analysis, presented at the International Meeting of the Society for the Exploration of Geophysics, Orlando, FL, 1967.
13-3 1 MRM’OI МИЛ AMiSYMt M ||>| Mlf i, л ! |< >\ 469 algorithm expresses recursively the coefficients < in terms of ani| f»v. The estimates of these coefficients can, therefore, be obtained directly in terms of the estimates of K,v and Pv. These estimates arc based on the following identities [see (13-106)]: Лу- - l['t]c.4 _|[/t - I]) i\ = o;(«-;.[«] rj,.]) Replacing expected values by time averages, wc obtain the following iteration: Start with po = J E x?["] M"] = ё()[п] = x[/i] J n - I Find Рл |. a* ev |[/i], Set _ Y.J„ v. ,cN. ,[h]fv_,[h - I] v + H 1]) (,3',22) P,v= (I - K“.)/’v , (13-123) a* = ai“* “ Kvav-1 A=I,...,N-I (13-124) «Я = x -1 e,v[n]=x[n]- Ea£x[n-A] к - 1 д, (13-125) Ц'Ы = x[n ~ w] - E - N + A] к -1 This completes the Nth iteration step. Note that IK/J <; 1 PtViO This follows readily if we apply Cauchy’s inequality (sec Prob. 11-23) to the numerator of (13-122). Levinson’s algorithm yields the correct spectrum S(z) only if x[/t] is an AR process. If it is not, the result is only an approximation. If R[w] is known exactly, the approximation improves as N increases. However, if Rpn] is estimated as above, the error might increase because the number of terms in (13-49) equals J — N — 1 and it decreases as N increases. The determination of an optimum N is in general difficult. FEJER-RIESZ THEOREM AND LEVINSON’S ALGORITHM. Given a positive trigonometric polynomial N JF(e>w) » £ wne~inu> &0 (13-126) л- -w
470 SPECTRAL ESTIMATION we can find a Hurwitz polynomial rW"EV (13-127) n >=U such that W(eJtu) = |Y(ey")|2. This theorem has extensive applications. Wc used it in Sec. 12-1 (spectral factorization) and in the estimation of the spectrum of an MA and an ARMA process. The construction of the polynomial Y(z) involves the determination of the roots of IV(z). This is not a simple problem particularly if ИТе7") is known only as a function of ш. We discuss next a method for determining Y(z) involving Levinson’s algorithm and Fourier series. We compute, first, the Fourier series coefficients «["] = 0 £ " £ N <13-128) of the function S(eyw) = The numbers R[zn] so obtained are the values of a p.d. sequence because > 0. Applying Levinson’s algorithm to the numbers /?[m] so computed, we obtain W + 1 constants and /\. This yields Hence 1 ( * Г(г)--== 1 - V \ n—0 as in (13-127). This method thus avoids the factorization problem. The General Class of Extrapolating Spectral We consider now the following problem: We are given the N + 1 values (data) Я[0],...,Л[/7] of the autocorrelation of a process x[n] and we wish to find all its p.d. extrapolations, that is, we wish to find the family CN of spectra S(e/tu) > 0 such that the first N + 1 coefficients of their Fourier series expansion equal the given data. The sequences /?[m] of the class CN and their spectra will be called admissible. A member of the class CN is the AR spectrum S(z) - L(z)L(l/z) Цг) - Ел(г)/у^ tA. Papoulis: “Levinson’s Algorithm, Wold’s Decomposition, and Spectral Estimation,” SIAM Review, vol 27, 1985.
13"3 1-.XTKAPOLA1 ION AND SYS'll'M IIJINI11-11 ATION 471 Data Extrapolating FIGURE 13-16 where E^z) = EA,(z) is the forward filter of order N obtained from an /V-step Levinson algorithm. The continuation of the corresponding /фп] is obtained from (12-41 b): N — & ] m > N к — 1 To find all members of the class wc continue lhe algorithm assigning arbitrary values |KJ <; 1 к = N+ l,N + 2,... to the reflection coefficients. The resulting values of R[ni] are determined recursively [see (13-107)] tn- I fi[m]= (13-129) k*= I This shows that the admissible values of /?[m] at the mth iteration are in an interval of length 2Pm_}: E -*] - C.-iSW * - Л] + Pln-{ (13-130) (t-1 л-t because |KW| s 1. At the endpoints of the interval, |K„,I = 1; in this case, Pm « 0 and Aw+] = 0. As we have shown, the corresponding spectrum S(w) consists of lines. If |KWJ < 1 and Km = 0 for m > m0, then S(z) is an AR spectrum of order m0. In Fig. 13-16, we show the iteration lattice. The first /V sections are uniquely determined in terms of the data. The remaining sections form a four-terminal lattice specified in terms of the arbitrarily chosen reflection Coefficients • Admissible spectra. The DFTs of the sequences generated by the preceding iteration form the class CN of admissible spectra. We give next a simple characterization of this class starting with regular spectra. Such spectra are the transforms of all. admissible sequences obtained with |K„J < 1 for all m.
472 SPECTRAL ESTIMATION We shall show that all regular spectra can be expressed in terms of the forward and backward functions EN(z) and z^E^l/z) of the first N sections, the constant PN, and a reflection coefficient p(z) defined as follows: A function p(z) is called a reflection coefficient if p(z) = ^4 lp(z)| < 1 for |z| £ 1 (13-131) a(z) where g(z) and b(z) are two power series in z-* analytic for |z| > 1 and such that a(<x>) = 1, Z>(°°) = 0. It can be shown that the functions 1 - |p(e'~)|2 S( е'ш ) = A,-------:-----------:-----------------j (13-132) generate the class of all regular spectra of the class CN where p(z) is an arbitrary reflection coefficient. The proof of this is based on the properties of four-terminal lattices. The details, however, are involved (see page 470л). We shall determine the innovations filter L(z) of a process with the above spectrum. To do so, we must factor S(z). The denominator of S(z) is factored readily. To factor the numerator, we observe that 1 -p(2)p(l/z) g(z)g(l/z) - b(z)b(\/z) g(z)«(l/z) It suffices, therefore, to find the numerator. To do so, we determine a Hurwitz polynomial y(z) such that y(z)y(l/z) = a(z)g(l/z) - b(z)b(l/z) With y(z) so determined, (13-132) yields L(2) En(z)o(z)-z-"E„(1/z)Z>(z) ( 3) If p(z) is a rational function, S(z) is an ARMA spectrum. If p(z) = 0, it is an AR spectrum of order N. If y[z] = constant, S(z) is an AR spectrum of order higher than N. Line spectra. If in the preceding algorithm | Ktn | = 1 for m = m0 > N, then the iteration terminates and the resulting spectrum consists of m0 lines. We can, therefore, fit the N -I- 1 values R[w] of a p.d. sequence with the sum of m0 exponentials where w0 is any number larger than N. We can do so with N lines only if AN+1 = ,0. If Aw+1 > 0, we obtain a spectrum consisting of N lines and a constant: S(<o) = q0 + 2тг 52 а,5(й> — ч) «-I (13-134)
13-3 EXTRAPOLATION AND SYSTEM JDEN ПИСА Г ION 473 where q0 is the smallest eigenvalue of the matrix This is a consequence of the modified form (13-120) of Caratheodory’s theorem. Maximum Entropy and Smoothness Conditions In the preceding discussion we used as the parametric form of S(z) a rational function of z. In the following we determine the parametric form of S(z) in terms of certain smoothness conditions leading to the maximization of the integral of some function of S(<d). The method of maximum entropy is a special case. We repeat lhe problem: We are given the first N + 1 values of lhe p.d. sequence Л[т] and we wish to determine its spectrum S(w) = £ tn ** - 30 To solve this problem, we introduce a nonlinear function G(5(w)) of S(to) and we determine the unknown values of /?[m] so as to maximize the integral H = f G(S(w))dw (13-135) subject to the constraints K[m] = T- (13-136) where /?[m] are the given data. The integral H depends on the unknown values of Л(т]. It is, therefore, maximum if dH d dS(v) , , —= f = 0 |m|>N J-ndS 4 With F(SU)) = G'(S(<o)) this yields v f F(S(<0))e-jmudu) = 0 \m\ > N because щ—j- = e jmta From this it follows that the Fourier series coefficient of the function F(5(w)) must be 0 for |m| > M. In other words F(S(<u)) = £ cke~jko> (13437) Jt- -N The constants ck can, in principle, be determined in terms of the data /?[/«]. Indeed, from (13-136) and (13-137) it follows that ВДж1ГН £ cke~Jka dw lm|s* (13-138)
474 SPECTRAL ESTIMATION where F(_,) is the inverse of F(s). This is a nonlinear system of 2/V + 1 equations involving the 2N + 1 unknowns ck. Ils solution is in general difficult. The selection of the function G(S) depends on the applications. It might be selected, for example, to emphasize the high or low values of S(to). The following special case is of particular interest. It leads to a system that can be simply solved and the result maximizes the uncertainty about the unknown spectrum. The method of maximum entropy, t We now assume that G(S(<u)) = In 5(<u) In this case, H = InS(w) da) (13-139) If 5(ca) is the power spectrum of a process x[m], then H is the entropy rate of х[л] [see (15-130)]. From (13-135) it follows that G(S(ca)) = In S(cu); hence 1 N ?($(*>)) = = £ cke~iko> > Q (13-140) 5(o>) This shows that the spectrum S(oj) is ARMA. It can, therefore, be written in the form Hence its coefficients ak arid PN can be determined recursively from Levinson’s algorithm. We have thus shown that the estimation of 5(o>) based on the principle of maximum entropy rate is equivalent to the assumption that the unknown S(<d) is AR. APPENDIX 13A MINIMUM-PHASE FUNCTIONS A function H(z) = £ hnz~n n»0 is called minimtim-phase, if it is analytic and its inverse 1/H(z) is also analytic |A. Papoulis: “Maximum Entropy and Spectral Estimation: A Review,” IEEE Transactions on Acoustics. Speech, and Signal Processing, vol. ASSP-29. 1981.
APPENDIX IJB ALL-PASS HINCT1ONS 475 for |z| £ 1. We shall show that if H(z) is minimum-phase, then > 1 r* >пЛо=^-/ ln|H(e'*)| dtp &TT J -7Г (13A-1) Proof. Using the identity |H(e><p)2 = wc conclude with e'* = z, that £jn |H(e*)|2^ = ^-Hn[H(z)H(z-*)] dz where the path of integration is the unit circle. We note further, changing z to 1/z, that r 1 ,1 (J>-lnH(z)</z =ф- lnH(z-')</z To prove (13A-1), it suffices, therefore, to show that 1 f 1 In |Л«»1 = InH(z) dz 2ttj j z This follows readily because H(z) tends to Ло as z -> « and the function InH(z) is analytic for |z| 1 by assumption. APPENDIX 13B ALL-PASS FUNCTIONS The unit circle is the locus of points N such that (see Fig. 13-17») (ЛИ) = |eyip - l/zfl e J_ (NB) H’-zJ IzJ 2J From this it follows that, if then = 1. Furthermore, |F(z)| >1 for |z| < 1 and |F(z)| < 1 for |z| > 1 because F(z) is continuous and |F(0)I = kt >1 IF(»)I = |z/*l < 1 K/l Multiplying N bilinear fractions of the above form, we conclude that, if N 77 * — 1 H(2) = n-4rr W<! (J3B4)
476 spectral estimation All-pass filter H(z) H(z-') (b) FIGURE 13-17 A - II Ф1=Xyi,,+Aih i*i then I > 1 |H(z)|< = 1 I < 1 Iz| < 1 |z| = 1 1*1 > 1 (13B-2) A system with system function H(z) as in (13B-1) is called all-pass. Thus an all-pass system is stable, causal, and |Н(е'“,г) | = 1 Furthermore, 1 W Д. z - zi и 1 - z./z {1 \ ГТ -__1 _ ГТ __>' _ UII Mzf-l/z \zj (13B-3) because if zs is a pole of H(z), then z?*is also a pole. From the above it follows that if й[л] is the delta response of an all-pass system, then the delta response of its inverse is Ж?) 3 (13B-4) л-0 гЧг/ л-0 ^heiciiioth series converge in a ring containing the unit circle.
prc hili ms 477 PROBLEMS 13-1. Find the mean and variance of the RV 1 ^x(i) dt where x(t) = 10 p(r) for T = 5 and for T = 100. Assume that E{v(t)} = (), Rt(r) = 2й(т). 13-2. Show that if a process is normal and distribution-ergodic as in (13-35). then it is also mean-ergodic. 13-3. Show that if x(f) is normal with тц = 0 and /?х(т) = 0 lor |r| > a. then it is correlation-ergodic. 13-4. Show that the process ac'1"'**" is not correlaiion-crgodic. 13-5. Show that 1 // = lim—/ x(f + A)y(f) dt iff 1 r2T I kl \ , I'm — I 1 - —-E{x(f + A + r)y(f + т)х(г + A)y( t)} dr = /?;ДА) T—x 2/ •'-27Д -•< ] 13-6. The process x(r) is cyclostationary with period T. mean rj(t\ and correlation jROpG)- Show that if /?(г + t. () -* ij2(t) as |r| -> x. then lim T- f x<^df = тl'^^dl C — x £C J - C ' Ml Hint: The process x(r) = x(r - 6) is mean-ergodic. 13-7. Show that if C(/ + 7>z)^ о uniformly in t; then x(r) is mean-ergodic. 13-8. The process x(d is normal with 0 mean and WSS. (n) Show that (Fig. P13-8«) 7?(A) E(x(r + A)|x(r) = л) = ~r(Q}X FIGUREPU-8
478 SPECTRAL ESTIMATION (6) Show that if D is an arbitrary set of real numbers x, and x = E(x(r)|x(r) e D}. then (Fig. Pl3-86) E{x(/ + A)|x(/) e D} *(A). Л(0)А (c) Using the above, design an analog corrclonictcr for normal processes 13-9. The processes x(r) and y(r) are jointly normal with zero mean. Show that: (o) If wU) = x(t + A)y(r), then Ur) = Cxy(A + т)Сху(А - r) + Схл(г)С„.(т) (6) If the process x(r) and y(r) arc variance ergodic, they are also cross-variance ergodic. 13-10. Using Schwarz’s inequality (11B-1), show that (bf(x)dx Ja < (b - a) fh\f(x)\2 dx Ja 13-11. We wish to estimate the mean q of a process x(t) = tj + v(r) where /?„и(т) = 58(т). (a) Using (5-57), find the 0.95 confidence interval of 77. (6) Improve the estimate if v(f) is a normal process. 13-12. (a) Show that if we use as estimate of the power spectrum S(<u) of a discrete-time process x[n] the function 111- — then 1 {O " $»(<•>) = — ( S(y)W(w - y) dy H'(cu) = E )пш _N (6) Find ITfw) if W = 10 and wn = 1 - |w|/ll. 13-13. Show that if x(r) is zero-mean normal process with sample spectrum 1 fT 2 sr(") = ^ / x(/)e“/o"df Z1 J — t and S(<u) is sufficiently smooth, then E2{Sr(*>)) £ VarSr(<u) £ 2E2{Sr(w)) The right side is an equality if a> = 0. The left side is an approximate equality if l/o>. Hint: Use (12-74). 13-14. Show that the weighted sample spectrum - — [T c(t)x(t)e~y“" Л £/ J — T 2
PROniLMS 479 of a process x(r) is the Fourier transform of the function r'(t) - s=CXr('+iM' iH' * ?)*(' • i)л 13-15. Given a normal process xG) with zero mean and power spectrum 5(w). wc form its sample autocorrelation Rz(r) as in (13-38). Show that for large T, VarRr(A) = -L Г (1 4-e,2Au,)S2(“) t/a) 13-16. Show that if RT(r) = 77 [r x(z + rW' “ Л 74 7 2TJ-T4r\/2 I 2) \ 1) is the estimate of the autocorrelation R(r) of a zero-mean normal process, then , 1 гэт-1-i r , , I |r| + |al \ — [/?*•(«) + R(a + r)R(a - т)] 1------</a 21 '-2Т + Ы \ 21 J 13-17. Show that in Levinson’s algorithm, 4^* If ak'1 = Ok . _ vi * £{ew['’Rv-|[n - П) = 0 1 ля 13-18. Show that if Я[0] = 8 and Л[1] = 4, then the MEM estimate of S(a>) equals 6 Smem(<u) = Il - 0.5e-'"|2 13-19. Find the maximum entropy estimate 5МЕМ(ш) and the line-spectral estimate (13-111) of a process x[«] if Л[0] = 13 /?[11 = 5 R[2] = 2
CHAPTER 14 MEAN SQUARE ESTIMATION 14-1 INTRODUCTIONt In this chapter, we consider the problem of estimating the value of a stochastic process s(t) at a specific time in terms of the values (data) of another process x(£) specified for every £ in an interval a <, g < b of finite or infinite length. In the digital case, the solution of this problem is a direct application of the orthogonality principle (see Sec. 8-4). In the analog case, the linear estimator s(t) of s(t) is not a sum. It is an integral s(r) = E{s(r)|x(£), a < g < b) = fbh(a)x(a) da (14-1) J a and our objective is to find h(a) so as to minimize the MS error P = E([s(r) - §(r)]2} - e[ s(f) - fbh(a)x(a) da 1 Ja (14-2) The function h(a) involves a noncountable number of unknowns, namely, its fN. Wiener: Extrapolation, Interpolation, and Smoothing of Stationary Time Series, MIT Press. 1950. J. Makhoul: “Linear Prediction: A Tutorial Review," Proceedings of the IEEE, vol. 63, 1975. T. Kailath: “A View of Three Decades of Linear Filtering Theory," IEEE Transactions Information Theory, vol. JT-20, ,1974. 480
14-1 IN TRODl'CI ION 481 values for every a in the interval (a,b). To determine Ma), wc shall use the following extension of the orthogonality principle: THEOREM. The MS error P of the estimation of a process s(r) by the integral in (14-1) is minimum if the data x(£) are orthogonal to the error s(t) - s(r): or, equivalently, if Ma) is the solution of the integral equation = f h(a)Rxx(a,t;) da a < £ < b (14-4) Ja Proof. We shall give a formal proof based on the approximation of the integral in (14-1) by its Riemann sum. Dividing the interval (a, b) into tn segments (ak, ak + Да), we obtain b — a s(f) — д} Л(а^)х(аА.) Да Да = ------------ Applying (8-70) with ak = Мал) Да, we conclude that the resulting MS error P is minimum if s(r) 1 < j < tn where is a point in the interval (а;, a, + Да). This yields the system m *„('.(>)= E*( «*)*„(«».(,) Д« M.......«< (14-5) Л = 1 The integral equation (14-4) is the limit of (14-5) as Да -» 0. From (8-73) it follows that the LMS error of the estimation of s(t) by the integral in (14-1) equals = K»(0) - fh(a)Rsx(t,a) da ^a (14-6) In general, the integral equation (14-4) can only be solved numerically. In fact, if we assign to the variable g the values and we approximate the integral by a sum, we obtain the system (14-5). In this chapter, we consider various special cases that lead to explicit solutions. Unless stated otherwise, it will be assumed that all processes are WSS and real. We shall use the following terminology: И the time t in (14-1)is in the interior of the data interval (e, b), then the estimate of s(f) will be called smoothing.
482 MEAN SQUARE ESTIMATION If t is outside this interval and x(r) = s(r) (no noise), then s(r) is a predictor of s(t). If t > b, then §(r) is a “forward predictor”; if t < a, it is a “backward predictor.” If t is outside the data interval and x(r) =# s(r), then the estimate is called filtering and prediction. Simple Illustrations In this section, we present a number of simple estimation problems involving a finite number of data and we conclude with the smoothing problem when the data x(£) are available from (—<*,«>). In this case, the solution of the integral equation (14-4) is readily obtained in terms of Fourier transforms. Prediction. We wish to estimate the future value sG + A) of a stationary process s(r) in terms of its present value s(t + A) = E{s(t 4- A) |s( t)} = as(r) From (7-71) and (7-72) it follows with n = 1 that , . Л(А) E([s(r + A) - as(t)]s(t)} = 0 a = —— K(U) P = E{[s(t + A) — as(z)]s(r + A)} = Л(0) - a/?(A) Special case If Я(т) =Ле"“1т| then e=e-oA In this case, the difference s(t 4- A) — «s(/) is orthogonal to s(r — £) for every E([s(r + A) - as(r)]s(r - f)} - K(A + f) - ) = Ae~a^*v - Ae~a*e~a* = 0 This shows that as(t) is the estimate of s(t 4- A) in terms of its entire past. Such a process is called wide-sense Markoff of order 1. We shall now find the estimate of s(t + A) in terms of s(r) and s'(r): s(t + A) = fl)S(r) 4- a2s'(r) The orthogonality condition (8-70) yields s(r + A) - s(t 4- A) J. s(t),s'(O Using the identities «(0) - 0 R„.(r) = -R'(r) = -я"(г) we obtain a, « Я(А)/Я(0) a2 - /?'( A)//?"(0) P - + A) - ats(t) - a2s’(t)]s(t 4- A)} = Я(0) - ^(A) 4- a2rt’(A)
14-1 INlRODUrriON 483 FIGURE 14-1 If A is small, then Я(Л) = Л(0) /?'(A) « R'(0) + /?"(0)A « R''(0)A «) = 1 a2~ A s(t + A) = s(/) + As’(f) Filtering We shall estimate the present value of a process sG) in terms of the present value of another process xG): s(r) = E{s(r)bc(f)) = ax(r) From (7-71) and (7-72) it follows that £{[s(t) - tfx(r)]x(t)} =0 a = KJX(0)/Ktjr(0) P = E{[s(r) - «x(r)]s(0} = RSfW ~ aRfX(0) Interpolation We wish to estimate the value sG + A) of a process sG) at a point t + A in the interval (/, / + T), in terms of its 2Л/ + 1 samples s(r + kT) that are nearest to t (Fig. 14-1) N s(t+A) = £ aks(t + kT) 0<A<T (14-7) k- —N The orthogonality principle now yields 1/v 1 \ s(/4-A)— E e*s(t + fcT) s(f + nT)| = 0 k--N J / |n| <:W from which it follows that N E akR(kT — nT) = R(k - nT) -Nin<N (14-8) k--N This is a system of 2W 4-1 equations and its solution yields the 2W 4- 1 unknowns ak. The MS value P of the estimation error N В/ЛО e Ф + A) - L a^t + kT) (14-9) k--N
484 MEAN SQUARE ESTIMATION equals N /> = £-(e„(r)s(z+ Л)) =Л(0) - £ akR(K-kT) (14-10) к = — <V Interpolation as deterministic approximation The error e,v(/) can be con- sidered as the output of the system w EnM - e'“' - E V'"'" k=-N (error filter) with input s(/). Denoting by 5(w) the power spectrum of s(r), we conclude from (10-139) that 1 f-£{4(»)1 N E akeikTl k=* -N da) (14-11) This shows that the minimization of P is equivalent to the deterministic problem of minimizing the weighted mean square error of the approximation of the exponential e;wA by a trigonometric polynomial (truncated Fourier series). Quadrature We shall estimate the integral z = (%(/) dt of a process s(t) in terms of its /V + 1 samples s(nT): z = ao's(O) + als(T) + • • • + aNs(NT) b T = — N Applying (8-70), we obtain z s(ZrT)^ = 0 0 < к < N Hence fbR(t - kT) dt = aQR(kT) + • • • +a„R(kT - NT) 0 £k ^N This is a system of N + 1 equations and its solution yields the coefficients ak. Smoothing We wish to estimate the present value of a process s(t) in terms of the values x(£) of the: sum x(r) -s(r) +v(<) available for every £ from —co to w. The desirable estimate §(t) = E{s(/)|x(£), -» < £ < co)
14-1 iNiHoiMicnoN 485 will be written in the form s(r) = Г h(a)x( t - a) da (14-12) In this notation, A(a) is independent of t and s(r) can be considered as the output of a linear time-invariant noncausal system with input x(/) and impulse response A(r). Our problem is to find //(/). Clearly, s(t) - §(/) ± x(£) all $ Setting £ = f - t, we obtain El s(r) — f h(a)x(f — a) da x(/ I — ac 7)} = 0 all - This yields Ллд(т) — f h(a)RXK(~ - a) da all - (14-13) — X Thus, to determine A(z), wc must solve the above integral equation. This equation can be solved easily because it holds for all - and the integral is a convolution of A(r) with /?l4.(r). Taking transforms of both sides, wc obtain S„(a>) = Hence = (14-14) The resulting system is called the noncausal Wiener filter. The MS estimation error P equals = /?„(0) - f h(a)RiX(a) da = — [ [S„(w) - o>)] du J —oc Z7T — x (14-15) If the signal s(z) and the noise v(t) are orthogonal, then Hence (Fig. 14-2) Sw(d>) $«(*>) + V») p = — /• da (14.16) 2-JT If the Spectra Sw(<y) and £„„(«) do not overlap, then H(u) = I in the band of the signal and Жш)•-= 0 in the band of the noise. In this case, P = 0.
486 MEAN SQUARE ESTIMATION FIGURE 14-2 Example 14>1. If s,M - No a2 + a»2 S,№ = N $,„(«) = 0 then (14-16) yields No 7V0 + N(a2 + w2) No A<'> - We -Pin 1 f" N° p = /32 = a2 + N DISCRETE-TIME PROCESSES. The noncausal estimate s[n] of a discrete-time process in terms of the data х[л] = s[n] + v[«] is the output §[л] = E Л[£]х[л-Л] Л — — oo of a linear time-invariant noncausal system with input х[и] and delta response й[л]. The orthogonality principle yields ^{|8[п]“ E &[£]x[?i “&] х[и - /и] > = 0 all m \\ *=-00 J j Hence E й[Л]Яжх[т - Л] all m (14-17) *- -00 Talking transforms of both sides, we obtain H(2) “ (14'18) ахдА2/
г 14-2 FRhoiciioN 487 The resulting MS error equals P = e/s[zi]- £ Л[Л]х[я - Л] s[nj) \ k~ - ® J л«(°) E - ~f [S„(w) - H(e ^r)S,t(w)] dui к — —оь ~ - a > Example 14-2. Suppose that s[/,J is a first-order AR process and v[n] is white noise orthogonal to s[n]: N “ (1 — az-l)(l -az) S-<2> = N s...(~) = 0 In this case. U*)BS„(z)+N- aN(l - bz~ »)(1 — bz) b(l - az’’)(l -az) where b + b~l = a + a 1 + — Hence bNa H( z) = 1/0-,.- ». к s M«] = aN(l - bz *)( 1 - bz) bN{} aN(l - b2) 0 < b < a < 1 дг » 1 “ k-~<*> bNtt a(\-b') 14-2 PREDICTION Prediction is the estimation of the future s(t + A) of a process s(f) in terms of its past s(r — т), г > 0. This problem has three parts: The past (data) is known in the interval (—<», /); it is known in the interval (/ - T, t) of finite length T; it is known in the interval (0, t) of variable length t. We shall develop all three parts for digital processes only. The discussion of analog predictors will be limited to the first part. In the digital case, we find it more convenient to predict the present s[n] of the given process in terms of its past s[n - Л], к к г. Infinite Past We start with the estimation of a process s[n] is terms of its entire past s[n - fc], к 1: 3[л] - jg{s[«]ls[n - £], к st 1} =• £ A(*]s[« - *] (14-19) , л-i This estimator will be called the one-step predictor of s[n]. Thus §[«] is the
488 MEAN SQUARE ESTIMATION response of the predictor filter H(z) =/:[l]z-' + •• +A[*]z-‘ + (14-20) to the input s[?j] and our objective is to find the constants h[k] so as to minimize the MS estimation error. From the orthogonality principle it follows that the error e[zi] = sin] - s[n] must be orthogonal to the data s[zi - zn]: e((s[zz] - L Л[Л]в[п - /c]js[zi - zn]} = 0 m>l (14-21) Ц A-i / ' This yields 7?[m] - 52 Л[Л]Я[т — к] = 0 m > 1 (14-22) k= I We have thus obtained a system of infinitely many equations expressing the unknowns in terms of the autocorrelation /?[zn] of s[zi]. These equations are called Wiener-Hopf (digital form). The Wiener-Hopf equations cannot be solved directly with z transforms even though the right side equals the convolution of A[/n] with Я[т]. The reason is that, unlike (14-17), the two sides of (14-22) are not equal for every m. A solution based on the analytic properties of the z transforms of causal and anticausal sequences can be found (see Prob. 14-12); however, the underlying theory is not simple. We shall give presently a very simple solution based on the concept of innovations. We comment first on a basic property of the estimation error e[n] and of the error filter E(z) = 1 - H(z) = 1 - £ A[n]z-* (14-23) Л-1 The error e[n] is orthogonal to the data s[n - zn] for every m 1; furthermore, e[n - m\ is a linear function of s[n - zn] and its past because e[n] is the response of the causal system E(z) to the input s[n]. From this it follows that e[n] is orthogonal to e[/i - zn] for every m 1 and every n. Hence e[zi] is white noise: /?ee[n] = jE{e[n]e[/i - zn]} = P5[m] (14-24) where P = E{e2[n]} = E{(s[n] - s[n])s[zz]} = Я[0] - £ й[&]Я[к] л-i is the LMS error; This error can be expressed in terms of the power spectrum $(й>) of s[zi]; as we see from (10-139), P = Г IE(e'")|25(w) du (14-25) Z,TT J-ir Using the above, we shall show that the function E(z) has no 0’s outside the unit circle.
14-2 pRFuu-HON 489 THEOREM. If £(2») = 0 then |zj < 1 (14-26) Proof. We form the function 1 - z~ l/z* E0(z) = E(z) —------- I - z,z This function is an error filter because it is causal and E(1(x) == E(<») = 1. Furthermore, if |z,| > 1, then [see (13B-2)] |E„(e'")| = |j|IE(^“)| < |E(tu“)| Inserting into (14-25), we conclude that if we use as the estimator filter the function 1-Eo(z), the resulting MS error will be smaller than P. This, however, is impossible because P is minimum; hence |z(| s 1. Regular Processes We shall solve the Wiener-Hopf equations (14-22) under the assumption that the process s[n] is regular. As we have shown in Sec. 12-1, such a process is linearly equivalent to a white-noise process i[zi] in the sense that s[n] = E/[/t]i[n-fc] (14-27) k =0 i[n] = E - £] (14-28) ьо From this it follows that the predictor s[n] of s[n] can be written as a linear sum involving the past of i[n]: s[n]= £ h,[fc]i[n - к] (14-29) ы To find s[n], it suffices, therefore, to find the constants ЛДЛ] and to express i[n] in terms of s(n] using (14-28). To do so, we shall determine first the cross-corre- lation of s[n] and Цл]. We maintain that = M (14'30> Proof. We multiply (14-27) by i[n - m] and take expected values. This yields £{»[и]1Гп - ">]) - f '[*]£{![" - *JI[" - = E A-о *-° because Rw[m] e 5[m], and (14-30) results.
490 MEAN SQUARE ESTIMATION To find ЛД&], we apply the orthogonality principle: E< |s[n] — XL |i[n — /л] > = 0 m < 1 This yields /?J>] - E ЛДЛ]Л<7[т - = /?„[/л] - E A,-[*]S[m - fc] = 0 A=I I and since the last sum equals hД/л], we conclude that ЛД/л] = /?ЛД/л]. From this and (14-30) it follows that the predictor §[л], expressed in terms of its innovations, equals §[л] = £Z[fc]i[n-fc] (14-31) A-l We shall rederive this important result using (14-27). To do so, it suffices to show that the difference s[n] - s[n] is orthogonal to i[/i - /л] for every m 1. This is indeed the case because е[и] = E /[/c]i[« - *] = /[0]i[«] (14-32) к“О к“1 and |’[л] is white noise. The sum in (14-31) is the response of the filter E/[*]z-* = L(z) -/[0] A = 1 to the input Цл]. To complete the specification of §[л], we must express i[n] in terms of s[n]. Since Цл] is the response of the filter l/L(z) to the input s[w], we conclude, cascading as in Fig. 14-3, that the predictor filter of s[n] is the product 1 /[0] H(z) = 7777(1.(7) - /[0]) = 1-777- (14-33) L(z) L(z) shown in Fig. 14-4. Thus, to obtain H(z), it suffices to factor S(z) as in (12-6). The constant /[0] is determined from the initial value theorem: /[0] = lira L(z) 2-»oe FIGURE 14-3
14-2 PRLUirriON 491 sfnj ♦ h,w-i- « Z[OJ I L(z) 1[л| И One-step predictor FIGURE 14-4 Example 14-3. Suppose that v 5 - 4 cos ы S(w) = 77.—--------------------- 10 - 6 cos ш as in Example 14-4. In this case. (14-33) yields 3 2z - 1 Note that s[m] can be determined recursively: «-(-’) = /[0] 6(1 - г-’/2) 2 3 The Kolmogoroff-Szego MS error formula! As we have seen from (14-32), the MS estimation error equals Р = £{ег[«1)=/г[0] Furthermore [see (13A-1)] In/2[0] = —— ( ШЩе"0)|2do> 2тг-'-7г Since S(to) = |L(e;w)|2, this yields the identity P = (14-34) Expressing P directly in terms of 8(ш). Autoregressive processes. If s[n] is an AR process as in (12-39), then /[0] = bn End H(z) = -e.z-'- • -aNz~N (1W5) s[rt] — —0,s{rt — j] — • • • - - W] P = Theabove shows that the predictor s[n] of s[n] in terms of its entire past is the same as the predictor in terms of the W most recent past values. This result can be established directly: From (12-39) and (14-35) it follows that s[n] - §[n] - HI. Grenander and G.Szcgo: Toeplitz Forms and Their Applications, Berkley University Press. 1958.
492 MEAN SQUARE ESTIMATION Z)0i[«]. This is orthogonal to the past of s[n]; hence E{s[n]|s[n - Ar], 1 < к < /V} = E(s[W]|s[« - к], к > 1} A process with this property is called wide-sense Markoff of order N. THE r-STEP PREDICTOR. We shall determine the predictor §Дм] = E{s[n]|s[zr - Л], к > r} of s[?i] in terms of s[n - r) and its past using innovations. We maintain that sr[«] = E/[A']i[n - Л] (14-36) Proof. It suffices to show that the difference Er[zi] = S[«] ~ S,[h] = Y, /[*]»['» “ *] A = 0 is orthogonal to the data s[n - Ar], к > r. This is a consequence of the fact that s[n - £] is linearly equivalent to i[/t — £] and its past for > r; hence it is orthogonal to i[n - m ] for 1 < m £ r - 1. The prediction error ёДл] is the response of the MA filter /[0] + /[l)z_| + ••• +/[r — l]z"'+ ’ of Fig. 14-5 to the input i[«]. Cascading this filter with l/Uz) as in Fig. 14-5, we conclude that the process sr[«] = s[n] - ёДм] is the response of the system H,(z) = 1 - гДт LWz-* (14-37) L(*) to the input s[zi]. This is the r-step predictor filter of s[n]- The resulting MS error FIGURE 14-5
14-2 i'kt рн । кi\ дуд equals C-E(eJ[n]J = E>[t] A - I) (14-38) Example 14-4. Wc are given a process s[n] with autocorrelation /<[//»] = u'"" and we wish to determine its r-step predictor. In this case (sec Example 1(1-30) a 1 - a />-’ (« 1 + a) - ( z 1 -( г) (1 - az 1)( I - nz) * (> b L(z) = ----------— 4"] = /м'Ъ’[л] I az Hence I - az 1,1 - 1-----------— £&,*_• ' ” к = 0 r I s,[m] = «'s(h - r] Pr = b: У a'k = 1 - tr' к n ANALOG PROCESSES. Wc consider now the problem of predicting the future value s(r + A) of a process sit) in terms of its entire past s(/ - т), т £ 0. In this problem, our estimator is an integral: §(/ + A) = £{s(/ + A)|s(r - t), t > 0} = ( h(a)s(t - a) da (14-39) •'ll and the problem is to find the function hia). From the analog form (14-4) of the orthogonality principle, it follows that a) sit — t)> = 0 This yields the Wiener-Hopf integral equation R(r 4- A) = f h(a)R(T — a)da r>0 (14-40) Jo The solution of this equation is the impulse response of the causal Wiener filter H(s) = f*hit)e~s'dt The corresponding MS error equals P -Ef[s(z + Л) - s(l + A)]s(z + Л)) -E(0) -£"л(«)Я(Л + a)da (14-41)
494 MEAN SQUARE ESTIMATION Equation (14-40) cannot be solved directly with transforms because the two sides are equal for т 0 only. A solution based on the analytic properties of Laplace transforms is outlined in Prob. 14-11. We give next a solution using innovations. As we have shown in (12-8), the process s(t) is the response of its innovations filter L(j) to the white-noise process i(t). From this it follows that s(t + A) = [ I(a)i(t + Л - a) da (14-42) 'о We maintain that s(t + A) is the part of the above integral involving only the past of i(t): s(f + A) = f°7(a)i(t + A - a) da = Го + A)i(/ - 0) dp (14-43) Л 'о Proof. The difference s(/ + A) — s(t + A) = f*/(a)i(t + A — a) da (14-44) Jo depends only on the values of i(/) in the interval (t, t + A); hence it is orthogonal to the past of i(r) and, therefore, it is also orthogonal to the past of s(/). The predictor 3(f + A) of s(t) is the response of the system H/s) = (\(t)e-stdt hft) = l(t + A)Z7( r) (14-45) 'о (Fig. 14-6) to the input i(r). Cascading with 1/L(s), we conclude that s(t + A) is /(Of ilGURB144
14-2 vRiDicnos 495 the response of the system H,(s) LU) (14-46) to the input s(t). Thus, to determine the predictor filter H,(.v) of s(r). proceed as follows: Factor the spectrum of s(r) as in (12-3): SU) = !_(.> )L(-.v). Find the inverse transform /(/) of LU) and form the function h(t) = l(t + AW Find the transform H,U) of Л,(/) and determine H(.v) from (14-46). The MS estimation error is determined from (14-44): P = E ( l(a)i(i + A - a) da о (14-47) Example 14-5. We arc given a process str) with autocorrelation R(r) = 2a<’ " r| and we wish to determine its predictor. In this problem. 8(5) = -^^ L(s) = —/(r) =f-"'(/(r) a~ - s* a + s hAt) = e"“Ae-n,i;(r) H.(s) = ---- a + s H(s) = e-nA s(r + Л) = t'_rtAs(r) This shows that the predictor of s(r + A) in terms of its entire past is the same as the predictor in terms of its present s(r). In other words, if s(r) is specified, the past has no effect on the linear prediction of the future. The determination of HU) is simple if s(r) has a rational spectrum. Assuming that the poles of HU) are simple, we obtain LO) = -^4 = E — = D(s) iS-s, hi(t) = £c,e’V*'l/(O = "OUT (14-48) and (14-46) yields HU) = N{s)/N(.s). If NU) = 1, then HU) is a polynomial: H(s) - N{(s) = b0 + b,s + • • * +bns" and 8(t 4- A) is a linear sum of s(r) and its first n derivatives: 8(t + л) - MO + b^(i) + ••• +Мя>(0
496 MEAN SQUARE ESTIMATION Example 14-6. We are given a process s(z) with 49 - 25s2 t z ч 7 + 5s ~ (1 - s2)(9 — s2) L(5) " (1 +s)(3 +s) and we wish to estimate its future s(i + Л) for Л = In 2. In this problem, ел = 2: 1 4 e-A 4t,3A ,s + 2 L(j) = Г+Т + ГкЗ H,(S) = 7+7 + s + 3 ° (s + l)(s + 3) s + 2 1 3 , л H(.r) = F7T4 = T5('> + 5s + 7 5 25 Hence E{s(t + A)|s(z - t), t > 0} = 0.2s(/) + E{s(r + A)|s(r - t), t > 0} Notes 1. The integral y(r) = f h(a)R(T - a) da J() in (14-40) is the response of the Wiener filter H(s) to the input Я(т). From (14-40) and (14-41) it follows that у(т) = Л(т + А) for т > 0 and y(—A) = Л(0) - P 2. In all MS estimation problems, only second-order moments are used. If, therefore, two processes have the same autocorrelation, then their predictors are identical. This suggests the following derivation of the Wiener-Hopf equation: Suppose that to is an RV with density /(to) and z(r) = eJw'. Clearly, Лгг(т) = Е{еМ,+г>е~'м} = Г /(to)e'“Trfto — 90 From this it follows that the power spectrum of z(r) equals 2irf(u) [see also (10-127)]. If, therefore, s(r) is a process with power spectrum S(to) = 2ir/(to), then its predictor Л(/) will equal the predictor of z(r): z(t 4- A) => a 0} = A(a)e/w(,-o) da Jo = eftu Гh(a)e~iua da = е'ш,Н(и) JQ And since z(t 4- A) - i(t + A) 1 z(t - r), for r > 0, we conclude from the above that £{ [e/w('+A) - е'"'Я(<а)]<?-м'"т>} =0 r £ 0 Hence Г /(«) [eMr+A) - e'wrH(w)] du = 0 r>0 * — eo This yields (1440) because the inverse transform of f(u)eJtu{T+A} equals Л(т 4- A) and the inverse transform of equals the integral in (1440).
14-2 PREDICTION 497 Predictable processes. We shall say that a process s[n] is predictable if it equals its predictor: 1 slw] = 22 A[/c]s[w - At] (14-49) In this case [see (14-25)] 1 P ~ ciaj = 0 (14-50) Since S(&>) > 0, the above integral is 0 if S(a>) & 0 only in a region R of the ш axis where E(ey") = 0. It can be shown that this region consists of a countable number of points w,—the proof is based on the Paley-Wiener condition (12-9). From this it follows that S(o>) = 2тг £а,5(ш - w,) E(e/M') = 0 (14-51) i « I Thus a process s[zr] is predictable if it is a sum of exponentials as in (12-9): s[n] = £ c,eito'n E{c;} = a, (14-52) i-1 We maintain that the converse is also true: If s[n] is a sum of m exponentials as in (14-52), then it is predictable and its predictor filler equals 1 — D(z) where D(z) = (1 -e;“'z"*) • •• (I — e)lUmz~l) (14-53) Proof. In this case, E(z) = D(z) and E(e'"') = 0; hence E(e-"")5(w) = 0 be- cause E(^")3(w - w,) — E(e;w<)3(to - co,) = 0. From this it follows that P = 0. Note The preceding result seems to be in conflict with the sampling expansion (11-138) of a BL process s(/): This expansion shows that s(t) is predictable in the sense that it can be approximated within an arbitrary error e by a linear sum involving only its past samples s(nT0). From this it follows that the digital process s[«) = s(.nTn) is predictable in the same sense. Such an expansion, however, docs not violate (14-50). It is only an approximation and its coefficients tend to » as e -* 0. GENERAL PROCESSES AND WOLD’S DECOMPOSITIONf We show finally that an arbitrary process s[л) can be written as a sum s[n] = S|[«] + (14-54) of a regular .process sДл] arid a predictable process s,[n], that these processes arc tA. Papoulis: Predictable Processes and Wold's Decomposition: A Review. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 22, 1985.
498 MEAN SQUARE ESTIMATION FIGURE 14-7 orthogonal, and that they have the same predictor filter. We thus reestablish construc- tively Wold’s decomposition (12-89) in the context of MS estimation. As we know [sec (14-24)], the error e[n] of the one-step estimate of s[n] is a white-noise process. We form the estimator st[n] of s[n] in terms of e[h] and its past: sjn] = E(s[n]|E[n - Ar], <>!}= 52 _ Ar] (14-55) fc = (1 Thus sj/i] is the response of the system (Fig. 14-7) W(z)= £ wkz'k Jt — 0 to the input e[n], The difference s2[n] = s[/i] - sjn] is the estimation error (Fig. 14-7). Clearly (orthogonality principle) s2[n] ± e[/i - Л] k > 0 (14-56) Note that if s[n] is a regular process, then [see (14-32)] e[/i] = /[O]i[n]; in this case, sjn] = s[n]. THEOREM, (a) The processes s,[n] and s2[n] are orthogonal: s,[n] JL s2[n - £] all к (14*57) (b) sjn] is a regular process. (c) s2[n] is a predictable process and its predictor filter is lhe sum in (14-19): 00 s2[w] = E fc[*]s2[/i - Ac] (14-58) A-l Proof, (a) The process e[n] is orthogonal to s[n — Ac] for every к > 0. Further- more, s2[n — Ac] is a linear functionof s[n — Ac] and its past; hence s2[« — Ac ] _L e[m] for Ac > 0. Combining with (14-56), we conclude that s2[n - Ac] J. e[n] ail к (14-59) And sinceit ,[n] depend? linearly on е[л] and its past; (14-57) follows.
14-2 prediction 499 (Ь) The process s,[«] is the response of the system W(z) to the white noise «[«I- To prove that it is regular, it suffices to show that ОС £‘v*<oc (14-60) *=0 From (14-54) and (14-55) it follows that E{s2[»]} = E{s?[«]J + E(s2h]) > E{s2[n]} = £ w2 fc-0 This yields (14-60) because E{s2[n]) = E(0) < <». (c) To prove (14-58), it suffices to show that the difference z[h] = s2[n] - £ h[k ]s2[/j -&] I equals 0. From (14-59) it follows that z[n] j. e[/i - k] for all k. But z[n) is the response of the system 1 - H(z) = E(z) to the input s,[zi] = s[«] - sjn]; hence (see Fig. 14-8) z[n] = e[«] - sjn] + £h[*]si[n-A] (14-61) к « I This shows that z[n] is a linear function of е[п] and its past. And since it is also Orthogonal to e[zi], we conclude that z[zi] = 0. Note finally that [see (14-61)] sjn] - E A[*]Sil« - Zc] = e[«] ± sjn - m] m > 1 л— I Hence the above sum is the predictor of st[n]. We thus conclude that the sum H(z)ln (14-20) is the predictor filter of the processes s[n], sjn], and s2[n].
500 MEAN SQUARE ESTIMATION FIR PREDICTORS. We shall find the estimate of a process s(n] in terms of its N most recent past values: :V S„[«]-J?(s[n]|s[n -*],1 <k SN) - £«?>[»-*] (14-62) к I This estimate will be called the forward predictor of order N. The superscript in a% identifies the order. The process s^/i] is the response of the forward predictor filter N А«(г)- (14-63) Ы to the input s[n]. Our object is to determine the constants so as to minimize the MS value PN = E{e^.[n]) = E{(s[n] - sv[n])s[n]} (14-64) of the forward prediction error e,v[?z] = s[n] - s^n] The Yule-Walker equations. From the orthogonality principle it follows that (f n \ \ s[n] - —/с] Is[n —/n] > = 0 1 < m < /V \ bl I J This yields the system N Z?[m] - 22 - A] = 0 1 < m < N (14-65) A’= I Solving, we obtain the coefficients a% of the predictor filter Нд,(г). The resulting MS error equals [see (13-83)] N Альм ?« = Ф1- = (14-ВД Л-1 a/V In Fig. 14-8 we show the ladder realization of Hw(z) and the forward error filter ew(z)=i-A„(z). As we have shown in Sec. 13-3, the error filter can be realized by the lattice structure of Fig. 14r9. In that figure, the input is s[/i] and the upper в
14-2 I'HI ПН IHJN 501 output еД/V]. The lower output eJh] is the backward prediction error delined as follows: The processes s[n] and s( - „] have the same autocorrelation: hence their predictor filters are identical. From this it follows that the backward predictor 5n[zj], that is, the predictor of s{n] in terms of its /V most recent future values, equals M'd = £{s[n]|s[n + А]. 1 < к < /V) = £ aAvs[" k] k- i The backward error eN[ n) = s[/i - /V] - 5iV[h - Л'] is lhe response of the filter 6N(z) =2-v(l -------a*.-*) =z-vEv(!/z) with input s[n]. From this and (13-94) it follows that the lower output of the lattice of Fig. 14-8 is e.J/i). In Sec. 13-3, we used the ladder-lattice equivalence to simplify the solution of the Yule-Walker equations. Wc summarize next the main results in the context of the prediction problem. We note that the lattice realization also has the following advantage. Suppose that we have a predictor of order N and we wish to find the predictor of order N + 1. In the ladder realization, we must find a new set of W + 1 coefficients ak’ In the lattice realization, we need only the new reflection coefficient Кл..(; the first N reflection coefficients Kk do not change. Levinson’s algorithm. We shall determine the constants a'k, KN, and PN recursively. This involves the following steps: Start with «|=К| = Я[1]/Я[0] P, = (l-Kf)/?[0] Assume that the N + 1 constants ak *, and PN_t are known. Find KN and PN from (13-107) and (13-108): PN.tKK - K(W] - - 4] Р» = (1 - <14-67) к I Find a* from (13-97) = (14-68) In Levinson’s algorithm, the order N of the iteration is fimtebut it can continue indefinitely. We shall examine the properties of the predictor and oi the MS error PN as H -> «>. It is obvious that PN is a nonincreasing sequence of positive numbers; hence it tends to a positive limit: (14-69)
502 MEAN SQUARE ESTIMATION As we have shown in Sec. 12-3, the zeros z, of the error filter E„(z)-1- E#’* A= 1 are either all inside the unit circle or they are all on the unit circle: If PN > 0, then | AT J < 1 for every к < N and |zf| < 1 for every i [see (13-99)]. If PN_i > 0 and PN = 0, then |KJ < 1 for every к < N - 1, |/C^I = 1. and |zz| = 1 for every i [see (13-101)]. In this case, the process s[n] is pre- dictable and its spectrum consists of lines. If P > 0, then |z,| < 1 for every i [see (14-26)]. In this case, the predictor sN[n] of s[n] tends to the Wiener predictor s[n] as in (14-19). From this and (14-34) it follows that {1 fir \ &N+1 — / In S(tu) da> = /[0] = lim (14-70) This shows the connection between the LMS error P of the prediction of s[n] in terms of its entire past, the power spectrum S(a>) of s[n], the initial value /[0] of the delta response /[n] of its innovations filter, and the correlation determinant Suppose, finally, that PM-l > PM and PM = PM-i = =P (14-71) In this case, Kk = 0 for |A: I > M\ hence the algorithm terminates at the Afth step. From this it follows that the Afth order predictor sA/[n] of s[n] equals its Wiener predictor: Mt « \ Swt”] = E £<s[n]|s[n - Л], 1 к < M] = У s[n]|s[n - Л], к > 1| л-i I *=i J In other words, the process s[n] is wide-sense Markoff of order M. This leads to the conclusion that the prediction error ew[n] = s[n] — sA/[n] is white noise with average power P [see (14-24)]: м «[«] - E <ф[л - fc] = ew[n] Е(ё^[л])=Р A-l and it shows that s[n] is an AR process. Conversely, if s[n] is AR, then it is also wide-sense Markoff. Autoregressive processes and maximum entropy. Suppose that s[n] is an AR process of order M with autocorrelation Л[/л] and s[n] is a general process with autocorrelation f?[m] such that Я[т]=/?[/и] for |m| £ M
14-2 i’hlijk iios 503 The predictors of these processes of order M are identical because they depend on the values of R[/n] for |m| <, M only. From this it follows that the corresponding prediction errors PM and arc equal. As wc have noted Рм “ P for the AR Process s[/i] and £ P for the general process s[n). Consider now the class Cty of processes with identical autocorrelations (data) for |m| M. Each /?[>»] is a p.d. extrapolation of the given data. We have shown in Sec. 13-3 that the extrapolating sequence obtained with the maximum entropy (ME) method is the autocorrelation of an AR process [see (13-141)]. This leads to the following relationship between MS estimation and maximum entropy: The ME extrapolation is the autocorrelation of a process s[/t] in the class Сл/, the predictor of which maximizes the minimum MS error P. In this sense, the ME method maximizes our uncertainty about the values of R[m] for |m| > M. Causal Data We wish to estimate the present value of a regular process s[n] in terms of its finite past, starting from some origin. The data are now available from 0 to n — 1 and the desired estimate is given by s„[n] = £{s[/i]|s[/i - Л], 1 A: £ n) = £ atfspi - /с] (14-72) к- 1 Unlike the fixed length N of the FIR predictor sN[n] considered in (14-62), the length n of this estimate is not constant. Furthermore, the values aj of the coefficients of the filter specified by (14-72) depend on n. Thus the estimator of the process s[n] in terms of its causal past is a linear time-varying filter. If it is realized by a tapped-delay line as in Fig. 14-8, the number of the taps increases and the values of the weights change as n increases. The coefficients a% of s„[n] can be determined recursively from Levinson’s algorithm where now N = n. Introducing the backward estimate s[»i] of s[zi] in terms of its n most recent future values, we conclude from (13-92) that s„[»] = s,+ K„(s[0] ~ ViM) (14-73) s„[0] = s„-t[O] + K„(s[/i] - s„_,[«]) In Fig. 14-10, we show the normalized lattice realization of the error filter E^z) where we use as upper output the process i[„] = 4-a.tn] £{i2H) = i <l4-74> V * n The filter is formed by switching “on” successively a new lattice section starting from the left. This filter is again time-varying; however, unlike the tapped-delay line realization, the elements of each section remain unchanged as n increases. We should point out that whereas ejnl is the value of the upper response of
504 MEAN SQUARE ESTIMATION the к th section at time n, the process i[w] does not appear at a fixed position. It is the output of the last section that is switched “on” and as n increases, the point where i[n] is observed changes. We conclude with the observation that if the process s[n] is AR of order M [see (13-81)], then the lattice stops increasing for n > M, realizing, thus, the time invariant system EM(z)/ y[P^. The corresponding inverse lattice (see Fig. 13-15) realizes the all-pole system EM(z) We shall now show that the output i[n] of the normalized lattice is white noise R„[m] =<5[m] (14-75) Indeed, as we know, ел[л] ± s[n — k] for 1 к < n. Furthermore, е„_Дл - r] depends, linearly only on s[n - r] and its past values. Hence «лк] -1- «л-ik - 1] (14-76) This yields (14-75) because Pn «= £{е^[п]}. Note In a lattice of fixed length, the output ew[/i] is not white noise and it is not orthogonal to However for a specific n, the random variables ёл,(л] and - 1] are orthogonal.
14-2 PREDICTION 505 KALMAN INNOVATIONS'}-. The output i[nj of the time-varying lattice of Fig 14-10 is an orthonormal process that depends linearly on s[n - k]. Denoting by Ук the response of the lattice at time n to the input s[n] = - *], we obtain i[0] = Tos[°] *[1] = YosfO] 4- yjsfl] (14-77) Ы = y«s[0] + •• +n"s[A-] + ••• +?nns[/i] or in vector form + i ®n ♦- I Ц| + I Уо Уо у{ о Уо У" у" /л where S„ + I and I„ + I are row vectors with components s[0],...,s[n] and i[0],..., i[n] respectively. From the above it follows that if s[n] = 3[/t — A:] then i[n] = n > к This shows that to determine the delta response of the lattice of Fig. 14-10, we use as input the delta sequence 5[n — &] and we observe the moving output i[ n ] for n k. The elements of the triangular matrix Гп+1 can be expressed in terms of the weights ank of the causal predictor §„[«]. Since E„[n] = s[n] - s„[n] = n] it follows from (14-72) that yn . = in-k -1 7=°* к S 1 The inverse of the lattice of Fig. 14-10 is obtained by reversing the flow direction of the upper line and the sign of the upward weights -Kn as in Fig. 13-15. The turn-on switches close again in succession starting from the left, and the input i[w] is applied at the terminal of the section that is connected last. The fT. Kailath, A. Vieira, and M. Morf: “Inverses of Tocplitz Operators, Innovations, and Orthogonal Polynomials," SIAM Review, vol. 20, no. 1,1978.
506 MEAN SQUARE ESTIMATION output at A is thus given by s[0] = ф[0] s(l] = /М0] + (14-78) s[n]-/„"i[oj + ••+/:)[«] д„-г„-' From this it follows that if i[n] = 5[n - A’] then s[/i] =/" n > к Thus, to determine the delta response I" of the inverse lattice, we use as moving input the delta sequence 5[n - A] and we observe the left output s[n] for n 2 k. From the preceding discussion it follows that the random vector S„ is linearly equivalent to the orthonormal vector I„. Thus Eqs. (14-77) and (14-78) correspond to the Gram-Schmidt orthonormalization equations (8-88) and (8-91) of Sec. 8-3. Applying the terminology of Sec. 12-1 to causal signals, we shall call the process the Kalman innovations of s[n] and the lattice filter and its inverse Kalman whitening and Kalman innovations filters respectively. These filters are time-varying and their transition matrices equal Гл and Ln respectively. Their elements can be expressed in terms of the parameters Kn and Pn of Levinson’s algorithm because these parameters specify completely the filters. Cholesky factorization We maintain that the correlation matrix Rn and its inverse can be written as products Л,;' = Г„Г,: (14-79) where Гл and Ln are the triangular matrices introduced earlier. Indeed, from the orthonormality of I„ and the definition of Rn, it follows that - i„ £(s;s„} - r„ where 1„ is the identity matrix. Since I„ = S„r„ and S„ = I„L„, the above yields Г„'ЯЛГЛ = 1„ L’nlnLn = Rn and (14-79) results. Autocorrelation as lattice response. We shall determine the autocorrelation 2?[m] of the process s[n] in terms of the Levinson parameters KN and Ры. For this purpose, we form a lattice of order No and we denote by and respectively its upper and lower responses (14-1 la) to the input Я[/и]. As we see from the figure + - 1] (14-80a) — 1] — |[ w] (14-806) U'«l = <7oM = (14-80c)
14-2 pRfcDic-tioN 507 («) FIGURE 14-11 Using the above, we shall show that /?[m] can be determined as the response of the inverse lattice! of Fig. 14-1 lb provided that the following boundary and initial conditions are satisfied: The input to the system (point B) is identically 0: = 0 all m (14-81) The initial conditions of all delay elements except the first are 0: qN[0]=0 N>0 (14-82) The first delay element is connected to the system at m = 0 and its initial condition equals Я[0]: 4о[О]=Я[О] (14-83) From the above and (14-81) it follows that ^[1] =0 N > 1 We maintain that under the stated conditions, the left output of the inverse lattice (point Л) equals /?[m] and the right output of the mth section equals the MS error Pm: <7оМ-лМ (14-84) tE. A. .Robinson and S. Treitel: “Maximum Entropy and the Relationship of the Partial Autocorre- lation to the Reflection Coefficients of a Layered System," IEEE Transactions on Acoustics, Speech, ttnd Signal Process, vol. ASSP-28, no. 2,1^80.
508 MEAN SQUARE ESTIMATION Proof. The proof is based on the fact that the responses of the lattice of Fig. 14-10д satisfy the equations (see Prob. 14-24) ] = 0 1 <, rn < N - 1 (14-85) (14-86) From (14-80) it follows that, if we know and - 1], then we can find and gN[m]. By a simple induction, this leads to the conclusion that if 4n0[w] is specified for every rn (boundary conditions) and <7Л[1] is specified for every N (initial conditions), then all responses of the lattice are determined uniquely. The two systems of Fig. 14-11 satisfy the same equations (14-80) and, as we noted, they have identical initial and boundary conditions. Hence all their responses are identical. This yields (14-84). 14-3 FILTERING AND PREDICTION In this section, we consider the problem of estimating the future value s(t + A) of a stochastic process s(t) (signal) in terms of the present and past values of a regular process x(r) (signal plus noise) s(t + Л) = E{s(f + A)|x(f — т), т > 0} = f hx(a)x(t - a) da (14-87) Thus s(t + A) is the output of a linear time-invariant causal system Hr(s) with input x(r). To determine HA.(s), we use the orthogonality principle E s(t + A) — j hx(a)x(t — a) da JQ x(r - -)j = 0 т > 0 This yields the Wiener-Hopf equation + A) = ( hx(a)Rxx(r — a) da t>0 Jo (14-88) The solution hx(t) of (14-88) is the impulse response of the prediction and filtering system known as the Wiener filter. If x(r) = s(r), then Ax(t) is a pure predictor as in (14-39). If A = 0, then hx(t) is a pure filter. To solve (14-88), we express x(t) in terms of its innovations L(r) (Fig. 14-12) *(') = j <r(a)U' " «) da Rn(r) = 8(r) (14-89) where lx(d) is the impulse response of the innovations filter L/s) obtained by factoring the spectrum of x(r) as in (12-3): ®jrx(,s) = l-x(s)Lx( —s) (14-90) As we know, the processes ix(r) and x(t) are linearly equivalent; hence the estimate s(/ + A) can be expressed as the output of a causal filter H.-(s) with *4
14-3 III.II RING ANDI-KI DKIK». 3O9 S,((s)=S„tv)n<-A) Л,(т)=К„(т + А)С(т) FIGURE 14-12 input ix(/): s(r + Л) = I A, ( a)i,( / - «) da A) ' To determine we use the orthogonality principle E- s(r + A) - [ h, (ft)it(r 'и -«) da M' r)J - 0 (14-91) r > 0 Since ix(/) is white noise, the above yields Я„(т + А) = Гл/(а)5(г-а)</в = /|Дт) r>0 (14-92) * •'o * This determines A. (7) for all 7 because h, (7) = 0 for 7 < 0: \(т) =Лл/1(7 + А)Щ7) (14-93) In the above, Ял/(7) is the cross-correlation between the signal s(r) and the process ir(/). The function RSI (7) can be expressed in terms of the cross-corre- lation Rsx(t) between s(/) and x(/). Indeed, since it(/) is the output of the whitening filter Г/s) with input x(t), we can show as in (10-118) and (10-157) that S„/s) = S,v(x)rr(-5) (14-94) Thus, since SJX(s) is assumed known, (14-94) yields Rsi(T). Shifting to the left and truncating as in (14-93), we obtain A, (7). To complete the specification of Hx(s), we multiply the transform H, (5) of the function Az (1) so obtained with Г/s) (see Fig. 14-12) НД5) =HJx)C(5) (14-95) The function HZ|(s) can be determined directly from (14-94): As we know (shifting theorem) the transform of Rsi(t + A) equals SA(s) = = Sxv(5)G( -s)cAs (14-96) To find HJ.s), it suffices to write SA(s) as a sum (14-97) where SA (s) is analytic in the right-hand 5 plane and S^(jt) is analytic in the left-hand $ plane. Since the inverse transforms of the function SA(s) and SA (5)
510 MEAN SQUARE ESTIMATION equal RJXj(t + A)£/(r) and Rsi(r + A)U(-r) respectively, we conclude from (14-93) that (see also Note on next page) H„(5) = SA+(s) (14-98) To determine the system function Hx(s) of the Wiener filter, proceed, thus, as follows: Factor Sxx(s) as in (14-90) and set Г/л) = 1/Lx(s). Evaluate S51(s) from (14-94) and form the function SA(s) using (14-96). Decompose SA(s) as in (14-97) and form the function Hx(s) using (4-98). Determine H/s) from (14-95). If the function SA(s) is rational, then the decomposition (14-97) can be accomplished by expanding SJ( (s) into partial fractions. Assuming that S5x (s) is a proper fraction with simple poles, we obtain s„,W-E- i S + E----- к s~zk Re s, < 0 Re zk > 0 (14-99) The inverse of the second sum is 0 for т > 0. If, therefore, it is shifted to the left, it will remain 0 for т > 0. This shows that only the first sum will contribute to the term Rst^ + A)[7(r). In other words, Лл-ж(т + Л)Г/(т) = [a1e*'(r+A) + ••• +яХ"(т+А)][7(т) The transform of the above yields SA+(J) = u1eI|A S — Sj aneSnK (14-100) Example 14-7. Suppose that x(/) = s(/) + v(t) and 51l(") = ^W S-M =N as in Example 14-1. In this case, S„(s) = S„(s) and Sxx(s) = —N° ,+N = N& a — s' Ш = 0 (14-101) N 2 Ct — S Hence s + ft 1 a — s G(-*) (14-102) Inserting into (14-9.4) and expanding into partial fractions, we obtain A_____A a2 - s2 ~ s + a s - p (a + p)^N
14-3 Ml П-RINGANDPREDirilON 511 and with = -cr, (14-100) yields 5 + a Hence в — (Y нх(*) = s;(s)r,(s) = -----f-«‘ (14-103) л- + {3 ' ' Note In the decomposition (14-97) of S/л), the functions SA4(s) and SA (s) are unique within an additive constant. This causes an ambiguity in the determination of /», (r). The ambiguity is removed if we impose the condition that SA- (oo) = () In the pure filtering case (Л = 0); the resulting Лл(г) might contain impulses al the origin. This is acceptable because, by assumption (he estimate s(r) of s(i) is a functional of the past and the present value of the data x(t). Filtering white noise. In the pure filtering problem, the determination of the estimator Hx(s) can be simplified if /?„(0) < oo and v(t) is white noise orthogo- nal to the signal as in (14-101). We maintain, in fact, that in this case H/5) = 1 - T/Vr/s) (14-104) where Г/s) is the whitening filter of xG). Proof. From the above assumptions it follows that SJ5(<») = 0; hence S,t(s) = S„(s) = Sxx(s) - N = L,(s)L,( -s) - N S„(«) = 0 S„(~) = N L,( ± ~) - /W Inserting into (14-94), we obtain M5) = M*) - -л) = Lx(s) + К - ATJ -5) - К From the preceding note it follows that the constant К must be such that the noncausal component of S1( (5) satisfies the infinity condition -ЛЛГж(-«>) - К = 0. And since Гх(-а>) = l/Lx(-o°) = 1/ v9v, (14-104) follows from (14-95). Example 14-8. We shall determine the pure filter of the process in Example 14-7. From (14-102) and (14-104) it follows that ИЛ'»"1-H? "777 in agreement with (14-103). Note that the resulting MS error equals /f P - EUs(0 - Jq hja)x(t - a)da]s(ri)| - ^7^
512 MEAN SQUARE ESTIMATION Digital Processes We shall state briefly the discrete-time version of the preceding results. Our problem now is the determination of the future value s[n + r] of a stochastic process in terms of the present and past values of another process x[n): §г[л + r] = E /1'[&]x[n - A:] (14-105) k~o In this case, s[n + r] — sr[n + r] ± x[n — m] m 0 hence + r] = У Лх[А:]Яхх[/тг - к] m > 0 (14-106) This is the discrete version of the Wiener-Hopf equation (14-88). To determine Л£[п], we proceed as in the analog case: We express sr[n + r] in terms of the innovations Цл] of x[n] (Fig. 14-13) sr[« + r] = £ “ £] (14-107) A-0 From this and (8-70) it follows that + r] = У Л'ДЛ]5[т - A] = Л£[т] m 0 k-0 because = 3[m]. Hence all m (14-108) The function can be expressed in terms of as in (14-94) MZ)=S«(Z№') (14-109) Thus the transform of Rsi[m + r] equals Sr(z) =z^(z) =z^„(z)rx(z-1) (14-110) FIGURE 14-13
14-3 HLlEIUNGANDPKI-.l>ICilt>N 513 The function Sr(z) is then written as a sum SrU) = s;(z) + S;(z) (14-111) where Sr+(z) is analytic for |z| > 1 and Sr~(z) is analytic for |z| < 1. Further- more, the inverse of Sr (z) at the origin is 0. Thus S;(z) is the transform of lhe causal function Rs,t[m + And since ix[/i] is the response of the whitening filter Г/z) with input x[n], we conclude from (13-108) that Н;(г) = H;(z)F,(z) = S;(z)rr(2) (14-112) Example 14-9. We shall determine the one-step predictor sjn + 1] of the process s[ n ] where s"(z)=(—ог-.)(1-аг) M-’)-* S...(z)-0 In this case (see Example 14-2) From (14-110) it follows with r = 1 that zNfi'Jb/Na Aaz Az/b I N ' (1 - ez~*)(l - bz) z — a z - l/b V ab Since 0 < fl < I and l/b > 1. we conclude from the above that Sf(z) = Aaz/ (z - a) and (14-112) yields H’(z) = (fl - b)-^— ЛЦл] = (a - b)bnU[n] z — b We discuss presently a more direct method for determining Hx(z) (sec (14-118) below]. White noise. We shall examine the nature of the predictor H'(z) of s[n + r] under the assumption that the noise is white and orthogonal to the signal = A3[m] = 0 (14-113) Pure filter Suppose first that r = 0. In this case, H”(z) is a pure filter and §0[n] is the estimate of the signal s[zi] in terms of x(zz] and its past. We maintain that (Fig. 14-14) нм='-ш (14414) Proof From (14-113) it follows that Slx(z) - S„(z) - S„(z) - N - L,(z)Lx(z-') - N Inserting into (14-109), we obtain S„.(z) -L,(z)-AT,(z-’) (Н-П5)
514 MEAN SQUARE ESTIMATION FIGURE 14-14 We wish to find the causal part of the above, including the value of its inverse at n = 0. Since the inverse z transform of Г/1/z) is 0 for n > 0 and for n = 0 it equals Гг(»), we conclude that H°(z) = L,(z) -ЛТ» (14-116) Multiplying by Гл(г), we obtain (14-114) because Гх(ос) = l/Zx[Oj. Filtering and prediction We shall now show that the estimate sr[n + r] of s[n + r] equals the pure predictor s0[n + r] of the estimate s0[n] of sfn] (Fig. 14-14) sr[n + r] = s0[« + r] = £{s0[« + r]|s0[w - Л], k > 0} (14-117) Proof. From (14-110) and (14-115) it follows that Sr(z) =z'[L/z) -Wr,(z-')] But the inverse of zTx(l/z) is 0 for n > 0. Hence S*(z) is the causal part of z^L/z). Inserting into (14-112), we obtain H;(z) -z'k(z) - \ к — 0 / Z I k(z) j (14-118) As we see from Fig. 14-14, the innovations filter of s0[n] equals Lv(z)Hx(z). To determine the pure predictor Hr(z) of s0[n + r], it suffices, therefore, to multiply (14-37) by zr (we are predicting now the future) and to replace the function L(z) by L/z)H°(z). This yields Ar(z) =z'(l - \ k(z)-D j because the inverse of Lx(z) - D equals /х[л] - D5[n]. Comparing with (14- 118), we conclude that н';(2) = н;(г)й,(г)
14-4 K/\i мам и h.hs 515 («) FIGURE 14-15 The preceding discussion leads to the following important consequences of the white-noise assumption (14-113): 1. The innovations iA.[zi] of x[zz] arc proportional to the difference x[nj - sn[,i]: , , л r ч г > N Ф»] - s0[h] = Di,[«] D = y-j-j (14-119) Indeed, x[/t] - s0[zr] is the output of the filter Lx(z) - [Lt(z) -D]=D with input zz] (Fig. 14-15д). Thus the process ix[zj] can be realized simply by a feedback system (Fig. 14-156) involving merely the filter Hj4z). 2. The r-step filtering and prediction estimate s0[zz + r] can be obtained by cascading the pure filter H°(z) of s[n] with the pure predictor Hf(z) of s0[n + r]. 3. If the signal s[zi] is an ARMA process, then its estimate sn(zr] is also an ARMA process. Indeed, if Lx(z) = A(z)/B(z) is rational, then [see (14-114)], the filter H°(z) is also rational. Furthermore, the denominator B(z) of Lx(z) is the same as the denominator of the forward component Lx(z)-D of the feedback realization of H®(z) shown in Fig. 14-156. As we shall presently see, these results are central in the development of Kalman filters. 14-4 KALMAN FILTERS? In this section we extend the preceding results to nonstationary processes with causal data and we show that the results can be simplified if the noise is white and the signal is an ARMA process. The estimate s,[n + г] of s[n + r] in terms tR. E. Kalman: “A New Approach to. Linear Filtering and Prediction Problems,” ASME Transac- tions, Vol. 82D, 1960.
516 MEAN SQUARE ESTIMATION of the data x[«] = s[n] + v[m] lakes now the form Sr[« + r]=£{s[« + r]|IW,OSii»)= (14-120) Л - U Thus s,[zi + r] is the output of a causal, time-varying system with input х[л]£7[л], and our problem is to find its delta response /j,[zl£]. As we know, s[n 4- r] - sr[zi + r] ± x[zn] 0 < m < n This yields /?jjr[zi 4-r, zn] = 22 Ax[n Д )Лхл[/с, w] 0 < m < n (14-121) A-0 Thus й'[пД] must be such that its response to /?xx[zz. m] (the time variable is n) equals Rsx[n + r, z?i] for every 0 < zn < zi. For a specific zi, this yields n 4- 1 equations for the n 4- 1 unknowns hrx[n,k]. To simplify the determination of hrx[n, Л], we shall express the desired estimates sjzi + r] in terms of the Kalman innovations [see (14-77)] U«] = E 7л1»,*]х[Л] (14-122) ьо of the process x[zi][7[zi] where yx[zi,&] is the Kalman whitening filter. The process ьДп] is orthonormal and, if the data are linearly independent, then the processes x[zi] and iA[zi] are linearly equivalent. This leads to the conclusion that §f[zi 4- r] can be expressed in terms of i jn] and its past (Fig. 14-16) sr[« + r] = £ hrix[n, A:]ix[Ar] (14-123) к — 0 FIGURE 14-16
14-4 kai mas i n 11 hs 517 To determine AJJn, A], we apply the orthogonality principle. Since Л, [/л, л] = - л] this yields /?мДл + r. w] = £ h\ [л. k]8[k - ,л] с -0 Hence гл] = Rti [л 4- г, /л] I) < m < n (14-124) This function can be expressed in terms of the cross-correlation Kjm. л]. Multiplying (14-122) by s[/w], we obtain Я.ч.["».«]= E rd"-£]/?„['". A] (14-125) fc-ll Thus, for a specific /л, /?„Д/л. л] is the response of the Kalman whitening filter of х[л] to the function Я,Д/л.л) where л is the variable. To complete the specification of §,[л + r], we cascade the filter /ijn.rn] with the whitening filter yJ«J] as in Fig. 14-16. ARMA Signals in White Noise In the numerical implementation of the above, wc are faced with two problems: (1) the realization of the Kalman innovations process ц[л]: (2) the determina- tion of the sum in (14-123). In general, these problems are complex, involving storage capacity and number of computations proportional to л. However, as wc show next, under certain realistic assumptions the problem can be simplified drastically. ASSUMPTION I. The noise is white and orthogonal to the signal: Я,,,,[т,л] = N„8[m -«] Ял„[/и,л] = 0 (14-126) This leads to the following conclusions. Property 1 If s0[n] is the estimate of s[ л J in terms of х[л] and its past and D2 is the MS estimation error, then the difference х[л] - §0(л] is proportional to the Kalman innovations ix[n] of the data х[л]: х[л] - §<>[«] = ДЛ1Л] (И-127л) D2 = Е{|х[л] — 50[л]|2} (14-1276) Proof, The difference х[л] — s0[n] depends linearly on х[л] and its past. Fur- thermore, the processes v[«] and s[«] - s0[n] are orthogonal to the past of Дл], Hence х[л] - s0[n] = s[n] - s0[n] + v[n] X x[к ] к < n
518 MEAN SQUARE ESTIMATION FIGURE 1447 From this it follows that the process x[zz] - s()[zi] is white noise and x[zj] — s0[zz] ± ix[ к] 0 < к < n — 1 because the processes x[fc] and iv[k] are linearly equivalent. And since x[zz] - s0[zi] depends linearly on ix[/c] for 0 <, к < n, (14-127«) results. Equation (14-127Z0 is a consequence of the requirement that £?{i^[zz]} = 1. Property 1 shows that the process i j«] can be realized simply by the feedback system of Fig. 14-17. This eliminates the need for designing the whitening filter yx[n, /с]. Property 2 The estimate sr[zi + r] of s[n + r] equals the pure predictor i0[« + r] of the estimate s0[zz] of s[zi] (Fig. 14-17) sr[zi + r] = s0[n + r] « £ /if[zi,k]s0[k] (14-128) provided that, for every n > 0, £{s0[zi]ix[n]} = £{x[n]ix[«]} - Dn * 0 (14-129) Proof. The process s0[ zi] is linearly dependent on x[zi] and its past. Condition (14-129) means that the component of §0[zi] in the ix[zi] direction is not 0. Hence the processes s0[n] and x[n] are linearly equivalent. And since s0[zz + r] - s0[zi + r] ± s0[k] 0 < к £ n we conclude that s0[ zi + r] - s0[n + г] ± x[к] 0 < к £ n Furthermore, s[n + r] - g0[n + r] j. x[к] 0 к n + r because §0[ n + r] is the estimate of s[zz + r] in terms of x[ Xc ] for 0 < к < n + r. Finally, s[n + r] - s0[n + r] = (s[n + r] - s0[zj 4- r]) + (s0[zi + r] - s0[Zl + r]) Hence s[n + r] - §0[zt + r] JL x[к] 0 < к < n and (14-128) results. This prpperty shows that filtering and prediction can be reduced to a cascade of a pure filter and a pure predictor.
14-4 KALMAN FlLllIRS 519 (b) FIGURE 14-18 ASSUMPTION 2. The signal s[n] is a time-varying ARMA process (Fig. 14-18a) M-I s[«] ~«"s[/i - 1]--------- «",s[n - M] = Eb^[n-/c] (14-130) Rf{[m,n] = - «] Property 3 The estimate s0[n] is also an ARMA process M-i 8<>[»]-efS,[«-l]----------<ЫАп-М}- Е4Ф-Ч (14-131) Л-0 where the coefficients ank are the same as in (14-130) and the coefficients c* are M constants to be determined. We assume that the above is true for all past estimates s„[« - k] and wc shall prove that if s0[ rt] is given by (14-131), then it is the estimate of s[«]. It suffices to show that if the constants ck are suitably chosen, then the resulting error, satisfies the orthogonality principle e[n] •= s[n] - s0[n] Xx[r] O^r^n (14-132)
520 MEAN SQUARE ESTIMATION Subtracting (14-131) from (14-130), we obtain м M-1 e[«]= E^e[«-*]+ E "*] - -*]) (14-133) Jt-l k=0 But for r<n, and e[« - k] ± x[r] for r <, n - к (induction hypothesis). Hence (14-132) is true for r £ n - M. It suffices, therefore, to select the M constants such that Е{е[п]х[г]} = 0 n—M+l<r<n (14-134) We have thus expressed s0[n] in terms of ix[?i). To complete the specifica- tion of the filter, we use (14-127д). This yields the feedback system of Fig. 14-18Z? involving M + 1 unknown parameters: the constant Dn and the M coefficients c£. These parameters can be determined from (14-127&) and the M equations in (14-134). The recursion equation (14-131) can be written as a system of M first-order equations (state equations) or, equivalently, as a first-order vector equation (see Sec. 12-2). The unknowns are then the scalar D„ and the coefficients cj!. To simplify the analysis, we shall carry out the determination of the unknown parameters for the first-order scalar case only. The results hold also for the vector case mutatis mutandis. FIRST-ORDER. If >Ы-Л,|[в-1]-|;Н £(«2W) = K (14-135) then (14-131) yields S0[/i] -ЛиЦл - 1] = K„(x[n] - §0[л]) (14-136) where Kn = cg/Dn. This is a first-order system as in Fig. 14-19a. To complete FIGURE 14-19
14-4 KALMAN HL1LKS 521 its specification, we must find the constant Kn. We maintain that K" ~ Nn-p = £te2l"]) (14-137) In the above, NH is the average intensity of v[n], which wC assume known. The MS error Pn can be determined recursively I’.. + ----- (14-138) Proof, Multiplying the data x[n] = s[n] + v[zr] by the error е[л] = s[zi] - s„[n] = x[n] - s0(,i] - v[n] and using the orthogonality condition (14-132), we obtain Е{е[л]х[л]} = 0 = + £(£[n]v[n]} From (14-135) and (14-136) it follows that e[n] =Лле[н - 1] + 5[w] - K„(e[/i] + v[n]) (1 + K„)e[«) - Л„ф, - 1] + {[nJ - ('4‘139) Hence (1 + X„)E(e[rt]v[n]} = -K„£{v2[n]} and (14-137) results. To prove (14-138), we multiply each side of (14-139) by each side of the identity s[n] = /1л8[п - 1] + £[л]. This yields (1+ВД=^Рл., + Ия Since 1 4- KN = Nn/(.N„ - Pn), the above yields (14-138). Note Using (14-135), we can readily show that “ И = К„(х[л] -/l„sn[/i - I]) where - p- A"p-'+ K" ” 4 ^., + r„ + 4 The corresponding system is shown in Fig. 14-19b. In the same diagram, we also show the realization of the one-step predictor s,[n + 1] = S0[n + 1] -4ДМ of s[n + I], This follows readily from (14-128) because the process s0[n] is AR; hence its pure predictor equals /4„s0[n]. The iteration The estimate s0[n] of s[n] is determined recursively: If Kn_! and S0[w - 1] are known, then Kn is determined from (14-141) and S0[n] from (14-140). To start the iteration, we must specify the initial conditions of (14-140) (14-141)
522 MEAN SQUARE ESTIMATION (14-135). We shall assume that s[0] = C[0] This leads to the initial estimate so[O] = Xox[O] from which it follows that £{s[0]x[0]) E{?2[0]} ° E{x2[0]} £{£2[0]) + £{v2[0]J Hence P Wo K(i Ko + /Vo 0 Ko + No (14-142) Linearization Equation (14-138) and its equivalent (14-141) are nonlin- ear. However, each can be replaced by two linear equations. Indeed, if Fn and Gn are two sequences such that Pn =A~nFn- \ + Fq = П П П — I П fl I U U U / t л « a \ (14-143) NnGn = + (K„ + 4)^^ Go = Ko + Nq then Example 14*10. We shall determine the noncausal, the causal, and the Kalman estimate of a process s(n) in terms of the data xf/c] = s[£] 4- v[A], and the corresponding MS error P. We assume that the process s[/i] satisfies the equation s[n] - 0.8s[n - 1] = £[л] and that Fff[*n] > 0.368[m] Я{1,[m] = 0 = 3[zi - zn] This is a special case of the process considered in Example 14-2 with a = 0.8 /V = 1 No = 0.36 b = 0.5 Hence 0.36 <ш-)(1-о.8г) °-8'"" S«(z) - Ljr(z)Lx(z-') L,(z) = /Гб -— (л) Smoothing: x[A] is available for all k. In this case the solution is obtained fowl Example 14-2 with b•- 0.5 and c « 0.375: Л[л] = 0.3 X 0Л'Л| P = 0.3
14-4 kalman hi.tfks 523 . ’’*! m?”®^'°' ‘ S n Thc M*=r is determined from (14-114) where now /ДО] = /1.6; uor \ i *-0.8 0.375z x(X) = 1 “ L6(z-0.5) = Л1"1 = 0375 X ° This shows that the estimate s(n] of s[h] satisfies the recursion equation s[n] — 0.5s[h - I] = 0.375х[л] л 0 The resulting MS error equals f = «»[0]- E Ц*1Ф] = 0.375 A-О (c) Kalman filter: x[Ar] is available for 0 < к £ n. Our process is a special case of (14-135) with An = 0.8 К = £{C;[«]} = 0.36 4-E(v2hl) = l Inserting into (14-143), we obtain Fn = 0.64/;, _ । 4- 0.36G„ . । F(l = 0.36 Gn = 0.64F„ _, + 1.36G„ _, Go = 1.36 This is a system of linear recursion equations and can be readily solved with z transforms. Since -PF £7 _ n n " » Gn and N = 1, the solution yields _ 0.48z" - 0.12z? z, = 1.6 Kn = Pn = 1.28^f + 0.084' z2 = 04 In particular, n = 0 1 2 3 4 Pn = 0.3 0.357 0.371 0.374 0.375 Thus, although the number of the available data increases as n increases, the MS error Pt also increases. The reason is that s[n] is a nonstationary process with initial second moment Ио = 0.36 because s[0] = 5(0], and, as n increases, E{s2[n]} approaches the value 1. We note, Anally, that 0.48 K„ = P„-------* = 0.375 " » я-** 1.28 and (14-140) yields 30[л ] - 0.880[л - 1] = 0.375х[л] - 0.3So(n - 1] The aboveshowsthat, if. the process s(n]'is WSS, then its Kalman filter approaches the causal1 Wiener filter as л -• », This is the case for any Po because the limit of FnG„ as n 00 equals 0.375 regardless of the initial conditions.
S24 MEAN SQUARE ESTIMATION Example 14-11. Wc wish to estimate the RV s in terms of the sum x[n] = s + v[«] where £{sv[''l) = 0 Я,„.[т, л] = N6[m - л] The estimate §и[л] in terms of the data х[л] can be obtained as the output of a Kalman filter if we consider the RV s as a stochastic process satisfying trivially (14-135) s[n] = s[n- !]+£[«] s(-l]=0 rf„1 = /s " = ° y=lEls2}=M n = 0 J к О л > 0 " \ о n > 0 In this case, Atl = I. = V, and (14-143) yields Fn = Fn_t VG„ = F„_t +NG„_I F0-MN G„ = M + N Solving, we obtain Fn = MN G„ = M + N + Mn Hence ;V 4- Mn M §цГл1 =-------------s,J« — 11 4- ------------х[л| "L J M + N + Mn ,,L J M + N + Mn 1 J Continuous-Time Processes We wish, finally, to determine the estimate s0(/) = E{s(/)|x(t),0 < - </} (14-144) of a continuous-time process s(t) in terms of the data x(f) = s(r) + v(t) (14-145) The solution of this problem parallels the discrete-time solution is recursion equations are replaced by differential equations and sums by integrals. It might be instructive, however, to rederive the principal results using a different approach. To avoid repetition, we start directly with the white-noise assumption Ян„(/,т) =W(r)3(t - t) N(t) > 0 = 0 (14-146) and we show that the process w(f) = x(r) - s0(/) is white noise with autocorrelation U'j) eJV(r)S(t —т) (14-147) Proof. As we know «(') - s(t) - s0(z) ± x(t) v(r) ± x(r) for < t Furthermore, w(/r) depends, linearly on x(t) and its past. Hence ^v(f) = e(/) + v(r) 1 w(r) r<t (14-148)
I 14-4 KAiMAMhiiHs 525 To complete lhe proof of 114-147», we shall assume lhal s„(r) isarminuous from the left У'-) = s0(') This is not true at the origin if s(0) * 0. However, for sufficiently large t the effect of the initial condition can be neglected. From the above it follows that P(i) = E{tz(i)] < x and since e(/) ± v(r) for т > t, wc conclude that Oj) = tfllv(/,-) = Я,.,.(г,т) = yV(-)S(r - T) Using a limit argument, we can show that, as in the discrete-time case, the normalized process w(/)/ ^N(t) is the Kalman innovations of x(r). The details, however, will be omitted. This leads to the conclusion that s„(r) can be expressed in terms of w(r) [see also (14-123)] MO = f'hjj’ «)'*(») da (14-149) Since s(r) - s0(r) J_ w(r) for т < /, we conclude from the above and (14-147) that = fAB.(/,a)N(a)5(T - a) da = hw(t,т) H(t) and (14-149) yields M') “ Г-П7—T^JH,(r,a)w(a)da (14-150) Ai N\a) We note that [sec (14-148) and (14-146)] +"(')]} =/>(') WTOE-SENSE MARKOFF PROCESSES. Using the above, we shall show that, if the signal s(r) is WS Markoff, that is, if it satisfies a differential equation driven by white noise, then its estimate s0(t) satisfies a similar equation. For simplicity, we consider the first-order case s’(f) + =£(r) Ru(t,r) = V(r)8(t — t) (14-151) The Kalman-Bucy equations!. We maintain that 86(f) + H(l)80(t) - MOM') -M')l t14'152) □ tk-B. Kalman and R. C. Bucy. “New Results in Linear Filtering and Prediction Theory,” ASME тЛз Trofisaciians, vtfl. 83D, 1961.
526 MEAN SQUARE ESTIMATION where (14-153) Furthermore, the MS error P(z) satisfies the Riccati equation P'(') + 2A(t)P(r) = I'M - (14-154) Proof. Multiplying the differential equation in (14-151) by w(r), we obtain ^«1„('.т)+Л(<)Я,„(г,г)=0 r<r (14-155) Ct We next equate the derivatives of both sides of (14-150) 1 1 c = + X N(a) TtRtw^' da Finally, we multiply (14-150) by A(t) and add with the above. This yields (15-152) because, as we see from (14-155), the sum of the two integrals is 0. To prove (14-154), we use the following version of (10-90): If z(f) is a process with E{z2(t)} = Kt) and such that *'(') + B(t)z(t) = g(t) Rff(t,r) = Q(r)8(t - t) (14-156) then (see Prob. 10-28Z0 I'(t) +2B(t)I(t)=Q(t) (14-157) Returning to (14-152), we observe, subtracting from (14-151), that the estimation error e(t) satisfies the equation e'(0 + [/!(/) + ff(«)]e(r) = ?(r) - K(r)»(r) In the above, the right side ij(r) = £(r) - K(t)v(t) is white noise as in (14-156) with G(r) = У(т) +№(t)N(t) Hence the function P(t) = £{e2(t)} satisfies(14-157) where B(t) = Л(/) + K(t\ This yields P'(0 + 2[Л(г) + 2ф)]Р(г) = И(г) + K2(t)N(t) and (14-154) resultSi Linearization We shall now show that the nonlinear equation (14-154) is equivalent to two linear equations. For this purpose, we introduce the functions ЯО and G(t) such that F(i) F(0 - (14-158)
14-4 KALMAN Hl.TLRS 527 Clearly, ПО = P'(t)G(t) + P(t)G'(t) and (14-154) yields ПО + A(t)F(t) - = P(z) G'(O F(t) A(t)Glt) - -~L 4 Л’(/) This is satisfied if ПО = ~A(t)F(t) + Z(t)Gu) F(t) (14-159) To solve the above system, we must specify F(0) and G(0). Setting arbitrarily G(0) = 1, we obtain F(0) = P(0) where P(0) =E{s2(0)} is the initial value of the MS error P(r). The determination of the Kalman filter thus depends on the second moment of s(0). Example 14-12. We shall determine the noncausal, the causal, and the Kalman estimate of a process s(t) in terms of the data x(t) = s(t) + v(r), and the corresponding MS error P. Wc assume that s(z) satisfies the equation s’(r) + 2s(t) = £(t) and that = 12й(т) ЯДт) = 0 Ur) = 8(т) This is a special case of the process considered in Example 14-7 with a = 2 N~ I No = 12 Д = 4 Hence S„(w) = 4 +(i)2 П(т) = 3e 11 16 + ш2 z 5 + 4 Sx^ = 4 + ш2 Lx(5) “ 5 + 2 (a) Smoothing: x(f) is available for all £. In this case, (14-16) yields The MS error is obtained from (14-15) 9 pe3__y e-<’|T|e-2|r|4r= 1.5 (/>) Causal filter: x(£) is available for < S t. The unknown filter is specified in Example 14-8 with a » 2 /3 = 4 N “ 12
528 MEAN SQUARE ESTIMATION Thus 2 Нл(*) = —7 Лж(О“2е-4'С/(/) P=2 $4-4 This shows that the estimate s(/) of s(t) satisfies the differential equation s'(') + 4s(t) = 2x(t) (c) Kalman filter: x(£) is available for 0 < £ < f. Our problem is a special case of (14-151) with /1(f) = 2 И(Г) = 12 N(l)=l Hence [see (14-159)] F'(t) = — 2F(t) 4- 12G(t) G‘(t) = F(r) 4- 2G(t) To solve this system, we must know P(0). Case / If s(0) = 0, then P(0) = 0. In this case, F(0) = 0, G(0) = 1. Inserting the solution of the above system into (14-153), we obtain K(O=P(t) = 6e4' - 6e"4' 3e4' 4- e~4' 2 Case 2 Vie now assume that s(f) is the stationary solution of the differential equation specifying s(t). In this case, E{s2(0)} = 3; hence P(0) = F(0) = 3 and K(t) = P(/) = 18e4' 4- 6e~4' 9e4' - e~4' 2 Thus, in both cases, the solution s0(t) of the Kalman-Bucy equation (14-152) tends to the solution of the causal Wiener filter s{j(t) 4- 2s(>(t) = 2x(t) - 2s()(t) as t ->». Example 14*13. We wish to estimate the RV s in terms of the sum x(/) = s4-v(t) E{sv(/)}=0 Rvv(r) = N8(r) This is a special case of (14-151) if A(t) = 0 s(/) = s f(t) = O /V(/)=/V In this case, И(/) = 0, P(0) = E{s2} s Mt and (14-159) yields F(t) F’G) = 0 G'(t) « F(0) = M G(0) = 1 Hence Mt F(t) = M G(/) = l + j Inserting into (14452), we obtain k , v M M ’i(0 +
problems 529 PROBLEMS 14-1. If R,M = /e“|T,/r and £{s(r - T/2)|s(/),s(r - T)} = as(r) + bs(i - T) find the constants a, b and the MS error. 14-2. Show that if г = as(0) + bs(T) is the MS estimate of 14-3. Show that if x(r) = s(r) + v(r), Ял„(-) = 0 and £{s'(Olx(r),x(r - t)} = flx(r) + bx(r - t) then for small t, a = -b - /?" (0)/тЯ"г(0), 14-4. Show that, if Sx(w) = 0 for |ш| > cr = ~/T, then the linear MS estimate of x(r) in terms of its samples х(пГ) equals Е{х(/)|х(лТ), n-------= £ sinGrf— „»-» <rt - птт and the MS error equals 0. 14-5. Show that if E{s(r + A)|s(r),s(r - т)} = E{s(t + A)|s(t)} then R,(t) = A?"O|T|. 14-6. A random sequence x„ is called a martingale if E{x„ = 0} and E{x„|x„_1,....X|} = xnl Show that if the RVs yn are independent, then their sum x„ = yt + • • • +y„ is a martingale. 14-7. A random sequence xn is called wide-sense martingale if E{X„|X„_......... Xj) = xn _! (fl) Show that a sequence x„ is WS martingale if it can be written as a sum x„ = у। +. • • • +У„ where the RVs y„ are orthogonal. (h) Show that if the sequence xn is WS martingale, then E{x*} й:Е{х^_,} S ••• aE{x?} Hint: x„ - x„ - x„_, + x„_, and x„ - x„_, 1 хл_,. 14-8. Find the noncausal estimators H^ai) and Я,(ы) respectively of a process s(r) and its derivative s'(f) in terms of the data x(r) = s(z) + v(r) where •R,(r) - Л Sl” г°Т RM - Я3(т) R,M = 0 14-9. We denote by Я,(а>) and Яу(й>) respectively the noncausal estimators of the input #(r) and the output y(/)of the system T(a) in terms of the data x(r) (Fig. Pi4-9). ShcW/ that Я/е>) = H/w)T(w).
S30 MEAN SQUARE ESTIMATION FIGURE P14-9 14-10. Show that if S(a>) = 1/(1 + a>4), then the predictor of s(r) in terms of its entire past equals s(r + A) - btls(r) + bts'(t) where bu = e_a/^ (cos-7=^ + sin~7=-1 b, = 1 >/1 >/1 J Д 14-11. (a) Find a function Л(г) satisfying the integral equation (Wiener-Hopf) R(t — a) da = R(r + In 2) t^O R(t) = je"7 + (b) The function H(s) is rational with poles in the left-hand plane. The function Y(s) is analytic in the left-hand plane. Find H(.v) and Y(x) if r , 49 - 25s2 -V(s> (c) Discuss the relationship between (a) and (b). 14-12. (a) Find a sequence h„ satisfying the system “ 1 1 XI ^k^in-k ~ ^m + 1 /И 2: 0 R„, — t + * = 0 L J (b) The function H(z) is rational with poles in the unit circle. The function Y(z) is rational with poles outside the unit circle. Find H(z) and Y(z) if r , 70 - 25(z + z-‘) [H(z) - z]------------j--------------------= Y(z) 6(z + z ') — 35(z + z ’) + 50 (c) Discuss the relationship between (a) and (b). 14-13. Show that if H(z) is a predictor of a process s[?i] and Hn(z) is an all-pass function such that |Ha(e'")| = 1, then the function 1 - (1 — H(z))H0(z) is also a predic- tor with the same MS error P, 14-14, We have shown that the one-step predictor st[n] of an AR process of order m in terms of its entire past equals [see (14-35)] - k],k z. 1} - - £ a*s[n - Л] A“1 Show that its two-step predictor s2[n] is given by л m £{s[n]|s[n - Л], к ;> 2} - - 1] - 22 ~ M k-1
В / Г PROBLEMS 531 J 14*15. Using (14-70) show that .. „ In A,v 1 N-* N 2тт1_„ЬЛш)аы Hint: 1 A„tl 1 1 д N , П-Д = 1,1 1 “ vIn “* lim In—*4 * 2 л-1 ил 'v N N~<r. 14-16. Find the predictor sN[n] = E{s[n]|s[« - ft], 1 5 к < Л']) of a process s[n] and realize the error filter E„(z) as an FIR filter (Fig. 14-8) and as a lattice filter (Fig. 13-15) for N = 1,2, and 3 if r f,„l = /5<3 ~ |wD H<3 ’ \0 |/n| 3 14-17. The lattice filter of a process s[//| is shown in Fig. P14-17 for N = 3. Find the corresponding FIR filter for N = 1, 2, and 3 and the values of R[m] for |?n| < 3 if R(0] = 5. FIGURE P14-17 14-18. We wish to find the estimate s(r) of the random telegraph signal s(r) in terms of the sum x(t) = s(/) + v(t) and its past, where R/т) = e-2^' ЛДт) = N6(r) R3,(t) - 0 Show that s(t) = (c - 2A) jf x(r - a)e~ea da c = 24^1 + -jjj 14-19. Show that if eN[«] is the forward prediction error and с^[л] is the backward prediction error of a process s[/t], then (o) ew[/i] ± e(V+,„[/i + ml, (b) enW 1 E/v+zJh “ «J, <c) ЕдДл] -L - N -m]. 14-20. Ex(/) = s(O + i*(r): Д;(т) = ЛДт) = 56(т) Л5М(т) - 0 Find the following MS estimates and the corresponding MS errors: (e) the noncausal filter of sft); (b) the causal filter of s(f); (c) the estimate of s(r + 2) in terms cif $(O and its past; (d) the estimate of s(x + 2) in terms of x(r) and its past.
532 MEAN SQUARE ESTIMATION 14*21. If х[л] = s[n] + м[л]: Лд[т] = 5 X 0.8|m| /?„[/«] = 53[лэт] Я,Дт] = О Find the following MS estimates and the corresponding MS errors: (a) the noncausal filter of s[n]; (b) the causal filter of sl«]; (c) the estimate of s[n + I] in terms of s[n] and its past; (d) the estimate of s[« + 1] in terms of х[л] and its past. 1422. Find the Kalman estimate §«["] = £{s['«]Is[A ] + v[Ar], 0 < к < л} of s[n] and the MS error Pn = £{(s[/i] - s0[л])2} if /?л[/л] = 5 X 0.8|m| /?„[m] = 55[oi] /?ли[/л] = 0 14-23. Find the Kalman estimate §o(O = E{s(Ols(r) + v(t), Os t</) of s(f) and the MS error P(t) = E([s(f) - s0(/)]2} if Я,(т)-5С-«’1 Я.(т) = ^(т) R„(r) = 0 14-24. Show that the sequences qN[m] and qN[m] of the inverse lattice of Fig. 14-1 Id satisfy (14-85) and (14-86) (sec Note 1 page 496).
CHAPTER 15 ENTROPY 15-1 INTRODUCTION As we have noted in Chap. 1, the probability /3(.йО of an event .V can be interpreted as a measure of our uncertainty about the occurrence or nonoccur- rence of «о/ in a single performance of the underlying experiment If Р(«й^) = 0.999, then we are almost certain that .a/ will occur; if P(.c/) = 0.1, then we are reasonably certain that .0/ will not occur; our uncertainty is maximum if = 0.5. In this chapter, we consider the problem of assigning a measure of uncertainty to the occurrence or nonoccurrence not of a single event of but of any event of a partition SI of ь/ where, as we recall, a partition is a collection of mutually exclusive events whose union equals S (Fig. 15-1). The measure of uncertainty about Я will be denoted by Ж) and will be called the entropy of the partitioning 51. Historically, the functional Н(Ю was derived from a number of postulates based on our heuristic understanding of uncertainty. The following is a typical set of such postulates!: 1. ЖЯ) is a continuous function of pt = 2. If = • • • =pN = i/N, then Н(Ю is an increasing function of M 3. If a new partition SB is formed by subdividing one of the sets of $1, then Я(ЯЗ)^Я(Ю. tc. E, Shannon and W. Weaver: The Mathematical Theory of Communication, University of Illinois Press, 1949. 533
534 tiNTROPY FIGURE 15-1 It can be shown that the sum Н(Ю = ~P\log/?, - • * ~pN log pN (15 — l)t satisfies these postulates and it is unique within a constant factor. The proof of this assertion is not difficult but we choose not to reproduce it. We propose, instead, to introduce (15-1) as the definition of entropy and to develop axiomati- cally all its properties within the framework of probability. It is true that the introduction of entropy in terms of postulates establishes a link between the sum in (15-1) and our heuristic understanding of uncertainty. However, for our purposes, this is only incidental. In the last analysis, the justification of the concept must ultimately rely on the usefulness of the resulting theory. The applications of entropy can be divided into two categories. The first deals with problems involving the determination of unknown distributions (Sec. 15-4). The available information is in the form of known expected values or other statistical functionals, and the solution is based on the principle of maximum entropy: We determine the unknown distributions so as to maximize the entropy ЖЯ) of some partition 21 subject to the given constraints (statisti- cal mechanics). In the second category (coding theory), we are given МЮ (source entropy) and we wish to construct various random variables (code lengths) so as to minimize their expected values (Sec. 15-5). The solution involves the construction of optimum mappings (codes) of the random variables under consideration, into the given probability space. Uncertainty and information In the heuristic interpretation of entropy, the number Я(Я) is a measure of our uncertainty about the events of the partition SI prior to the performance of the underlying experiment. If the experiment is performed and the results concerning become known, the uncertainty is removed. We can thus say that the experiment provides informa- tion about the events equal to the entropy of their partition. Thus uncer- tainty equals information and both are measured by the sum in (15-1). tWe shall use as logarithmic base either the number 2 or the number e. In the first case, the unit of entropy is the bit.
15-1 INIKOIH'CIION 535 Example 15-1. («) We shall determine the entropy of the partition Я = [even oddl tn the fair-dic experiment. Clearly, /’{even) = P{odd) = j/j. цС1НХ. /7(Й) = --llogl - |)og| = |og2 (/>) In the same experiment. 6 is the partition consisting of the elementarv events (4}. In this case. P{ft} = 1/6; hence H(S) = - i log i - - 1 log ± = logo If the die is rolled and we are told which face showed, then we gain information about the partition © equal to its entropy log 6. If we are told merely that “even” or “odd" showed, then we gain information about the partition Я equal to its entropy log 2. In this case, the information gained about the partition @ equals again log2. As we shall see, the difference log6 - log2 = Iog3 is the uncertainty about © assuming $1 (conditional entropy). Example 15-2. We consider now the coin experiment where P{h} = p. In this case, the entropy of © equals //(©) = -plogp - (I - p)log(-p) =r(p) (15-2) The function r(p) is shown in Fig. 15-2 for 0 < p <, 1. This function is symmetrical, convex, even about the point p = 0.5, and it reaches its maximum at that point. Furthermore, r(0) = r( 1) = 0. Historical note The term entropy as a scientific concept was first used in thermodynam- ics (Clausius 1850). Its probabilistic interpretation in the context of statistical mechanics is attributed to Boltzmann (1877). However, the explicit relationship between entropy and probability was recorded several years later (Planck, 1906). Shannon, in his cele- brated paper (1948), used the concept to give an economical description of the properties of long sequences of symbols, and applied the results to a number of basic problems in coding theory and data transmission. His remarkable contributions form the basis of modern information theory. Jaynest (1957) reexamined the method of maximum entropy and applied it to a variety of problems involving the determination of unknown parame- ters from incomplete data. Maximum entropy and classical definition. An important application of en- tropy is the determination of the probabilities p, of the events of a partition И, subject to various constraints, with the method of maximum entropy (MEM). The method states that the unknown p/s must be so chosen as to maximize the entropy of subject to the given constraints. This topic is considered in Sec. 15-4, In the following we introduce the main idea and we show the equivalence between the MEM and the classical definition of probability (principle of insufficient reason), using as illustration the die experiment. tE. T. Jaynes: Physical Review, vqls. 106-1Й7, 1957.
536 ENTROPY Example 15-3. (e) We wish to determine the probabilities p, of the six faces of a die, having access to no prior information. The MEM states that the p/s must be such as to maximize the sum H(©) = -pj log pj - • • • - pe log p6 Since pj + • • • +p6 = 1, this yields Pl ~ Pb ~ fi in agreement with the classical definition. (b) Suppose now that we are given the following information: A player places a bet of one dollar on “odd” and he wins, on the average, 20 cents per game. We wish again to determine the p/s using the MEM; however, now we must satisfy the constraints Pi + P3 + Ps = 0.6 Р2+Р4+Рб = 0-4 This is a consequence of the available information because an average gain of 20 cents means that P{odd} - /’{even} = 0.2. Maximizing Ж©) subject to the above constraints, we obtain P1=PJ=PS^O.2 p2 =p4 =p6 = 0.133... This agrees again with the classical definition if we apply the principle of insufficient reason to the outcomes of the events {odd} and {even} separately. Although conceptually the ME principle is equivalent to the principle of insufficient reason, operationally the MEM simplifies the analysis drastically When, as is the case in most applications, the constraints are phrased in terms of probabilities in the space У" of repeated trials. In such cases the equivalence still holds, although it is less obvious, but the reasoning is involved and rather forced if we derive the unknown probabilities starting from the classical defini- tion. The MEM is thus a valuable tool in the solution of applied problems. It is used, in fact, even in deterministic problems involving the estimation of un- known parameters from insufficient data. The ME principle is then accepted as
15-1 INIItODlK.-IION 537 a smoothness criterion. We should emphasize, however, that as in (he case of the classical definition, the conclusions drawn from the MB principle must be accepted with skepticism particularly when they involve elaborate constraints This is evident even in the interpretation of the results in Example 15-3; In the absence of prior constraints, we conclude that all p(’s must be equal. This conclusion wc accept readily because it is not in conflict with our experience concerning dice. The second conclusion, however, that p^ = p4 = Ph = 0.133 and P] = p3 = p5 = 0.2 is not as convincing, we would think, even though we have no basis for any other conclusion. In our experience, no crooked dice exhibit such symmetries. One might argue that this apparent conflict between the MEM and our experience is due to the fact that we did not make total use of our prior knowledge. Had we included among the constraints everything we know about dice, there would be no conflict. This might be true; however, it is not always clear how such constraints can be phrased analytically and, even if they can, how complex the required computations might be. Topical Sequences and Relative Frequency Suppose that ..., is an ^element partition of an experiment cf. In the space of repeated trials, the elements of 21 form Nn sequences of the form occurs n, times in a specific order) (15-3) and the probability of each sequence equals p"> ... pf« ... p"" (15-4) where p, = Ptja/p. The numbers nt are arbitrary subject only to the constraint Л] + • • • + nN = n. However, according to the relative frequency interpretation of probability, if n is “sufficiently large,” then “almost certainly” n, = np, (15-5) This is, of course, only a heuristic statement; hence the resulting conse- quences must be interpreted accordingly. However, as we know, the approxima- tion (15-5) can be given a precise interpretation in the form of the law of large numbers. Following a similar approach, we prove at the end of the section the main consequence [Eq. (15-10)] of (15-5) in the context of entropy. Guided by (15-5), we shall separate lhe TV" sequences of the form (15-3) into tvyo groups; (a) typical and (Z>) rare. We shall say that a sequence is typical, if n{ — npj. All other sequences will be called rare. A typical sequence will be identified with the letter t: t и occurs n, — npj times in a specific order) (15-6) From the definition it follows that to each set of numbers ni,...^nN “close” to the numbers лр npjv there corresponds one typical sequence. The union of all typical sequences will be denoted by T. Thus T is the totality of all sequences of the form (15-3) where л, = np. As we noted, it is almost certain
538 ENTROPY that for large n, each observed sequence is typical. This leads to the conclusion that P(T) = 1 (15-7) The complement T of IT is the union of all rare sequences and its probability is negligible for large л: P(T)=0 (15-8) Since = np( for all typical sequences, (15-4) yields _ рП) . . . — е»1Р|1ПР|+ +л/>Л'1прЛ Hence the probability of each typical sequence equals P(t) = (15-9) where Н(Ю is the entropy of the partition Я. Denoting by n j the number of typical sequences, we conclude from (15-7) and the above that ”’ = Ж°е"/л” (15’10) We have thus expressed the number of typical sequences in terms of the entropy of И. If all the events of SI are equally likely, then Ж91) = In N and лт = /Vя. In all other cases, Н(Ю < In N [see (15-38)]. Hence «/Vn for (15-11) This leads to the important conclusion that, if n is sufficiently large, then most sequences are rare even though “almost certainly” none will occur. Notes 1. We should point out that each typical sequence is not more likely than each rare sequence. In fact, the sequence with the largest probability is the rare sequence {л/т occurs n times], where is the event with the largest probability. As we presently show, the distinction between typical and rare sequences is best expressed in terms of the events {s/ occurs n, times in any order] As we know [see (3-38)], the probability of these events equals and for large n, it takes significant values only in a small vicinity of the point (Л, « nxpx,..,,kN = nNpN\ This follows by repeating the argument leading to (3-17) or, from the DeMoivre-Laplace approximation (3-39). 2. On page 1 of Chap. 1 we noted that the theory of probability applied to averages of mass phenomena leads to useful results only if the ratio k/n approaches a constant as n increases and this constant is the same for any subsequence. This apparently mild requirement results in severe restrictions on the properties of the resulting sequences. It leads to the conclusion that of all possible Nn sequences formed with the N elements of a partition И, only the typical sequences are likely to occur; all other sequences arc nearly impossible.
15-1 INTRODUCTION 539 Typical Sequences and the Law of Large Numbers We show next that the preceding results can be reestablished rigorously as consequences of the law of large numbers. For simplicity, we consider only two-element partitions and, to be concrete, we assume that Ja/ and are the events “heads” and “tails” respectively in the coin experiment. In the space the probability of the elementary event (4jtJ = heads in a specific order) equals and the probability of the eventt snfk = {k heads in any order) equals In Fig. 15-3 we plot the probability P(s/k), the geometric progression and the binomial coefficients as functions of k. tThc event tfk is not, of course, an element of the partition Я ” 1-т^>
540 ENTROPY «-TYPICAL SEQUENCES. Given a number a between 0 and 1, we form the number e such that a = 2G>^£y/n/pq ) — 1 (15-14) where G(x) is the normal distribution. We shall say that the sequence *s а-typical if к is such that A'i<A'<£2 where k}=n(p — e) /c2 = zj(p + e) (15-15) The union of all «-typical sequences is a set T consisting of elements and its probability equals a [see (3-37)] P(T) = E (?)p‘«'’-‘ = 2gL/7L ] - I =a (15-17) Л-А, \ 11 PQ J FUNDAMENTAL THEOREM. For any a < 1, the number n, of «-typical se- quences tends to елН(Я) in the following sense In It, —(15-18) Proof. If p = q. = 0.5, then the DeMoivre-Laplace approximation yields / ^irn /2 2(k — л/2)г/п for к in the fn vicinity of n/2. This approximation cannot be used to evaluate the sum in (15-16) for p * 0.5 because then the center np of the interval (ku kf) is not n/2. We shall bound ит using (15-13) and (15-16). Clearly, k2 n, = Ё p-V-V(^) (15-19) k-k} where we assume that p < q. As к increases, the term p~kqk~n increases monotonically. Hence /.« ^"2 In \ k2 ^2 ; E <»> <<r" - E ) (15-20) \pi t-t, \pi *-*, And since [see (15-17)] *2 ЕЛ^)=Р(Т)=а k~ki
15-1 1м к<н >((i i< 541 (15-20) yields г ("i- “ /"Г’ «'Ip) “5-21> Setting ki — np — he and k-> = np 4- пе in the above and using the identity p~npq~'“l = c ln !>*<! I» 41 - we conclude from (15-21) that ае',//,я>(-] <л, I P / I P I Hence nH(%) + In a - пе log - < In n. < л//(Я) 4- In a + he In - P P Dividing by n, we obtain (15-18) because a is constant and, as we see from(15-14), c —* 0 as л —»sc. Important conclusion Theorem (15-18) holds for any a < 1; it will be assumed, however, that a = 1 and the corresponding sequences will be called typical. With this assumption P(T) = a = 1 p(T) = 1-л^0 (15-22) The probability of an arbitrary event .# equals, therefore, its conditional probability P(.Z) = P(.Z|T)/J(T) + Р(./НТ)Ш) = P(.Z|") (15-23) In other words, in any conclusions concerning probabilities in the space it suffices to consider the subspace of ,./'n consisting of typical sequences only. This is, of course, only approximately true for finite n. It is, however, exact in the limit as n -» «. CONCLUDING REMARKS. In Chap. 1, we presented the following interpreta- tions of the probability P(.?/) of an event л/. Axiomatic. Р(л/) is a number assigned to the event -V. This number satisfies three axioms but is otherwise arbitrary. Empirical. For large nt к P(tf) = - where к is the number of times ла/ occurs in n repetitions of the underlying experiment Subjective. P(js/) is a measure of our uncertainty about the occurrence of ini single performance of .Z1
542 ENTROPY Principle of insufficient reason. If &/t are N events of a partition 91 of and nothing is known about their probabilities, then /’(•й/J) = 1/N. We give next four related interpretations of the entropy Н(Ю of 91. Axiomatic. Н(Ю is a number of assigned to each partition of This number equals the sum —Ер,- In p, where p, = f>(.Q^). Empirical. This interpretation involves the repeated performance not of the experiment but of the experiment of repeated trials. In this experi- ment, a specific typical sequence is an event with probability e-nH^\ Applying the relative frequency interpretation of probability to this event, we conclude that if the experiment is repeated m times and the event t; occurs rrtj times, then for sufficiently large m, m, 1 m P(f.) =e-""(») = -> hence 7/(91) = In — J m n nt This relates the theoretical quantity /7(91) to the experimental numbers m, and m. Subjective. 77(91) is a measure of our uncertainty about the occurrence of the events of the partition 91 in a single performance of Principle of maximum entropy. The probabilities p, = Pt&ty must be such as to maximize 77(91) subject to the given constraints. Since nt = елН(Я), the ME principle is equivalent to the principle of maximizing the number of typical sequences. If there are no constraints, that is, if nothing is known about the probabilities pt, then the ME principle leads to the estimates p, = 1 /N, 77(91) = In N, and nt = N". 15-2 BASIC CONCEPTS In this section, we develop deductively the properties of entropy starting with various notations and set operations. At the end of the section, we reexamine the results in terms of the heuristic notion of entropy as a measure of uncertainty, and we conclude with a typical sequence interpretation of the main theorems. DEFINITIONS. The notation 91 — [л/p.. or simply 91 = [.oz'] will mean that 91 is a partition consisting of the events л/. These events will be called1 elementst of 91. tit will be clear from lhe context whether the word element means an event of a partition Я or an clement of the space
15-2 HASH CONCIJ-TS 543 FIGLIRE 15-1 I. A partition with only two elements will be called binary. Thus 91 = is a binary partition consisting of the event </' and its complement .V. II. A partition whose elements are the elementary events {<,) of the space ./ will be denoted by <£> and will be called the element partition. III. A refinement of a partition is a partition 2? such that each element it? of 93 is a subset of some element of Я (Fig. 15-4). We shall use die notation 23 < 91 to indicate that 23 is a refinement of 91 and we shall say that Я is largerf than 23. Thus 23 <91 iff 0.c.^ (15-24) A common refinement of two partitions is a refinement of both. The partition © in Fig. 15-5 is a common refinement of the partitions 91 and 23. IV. The product^, of two partitions 91 = [.^] and 23 = [.у?,] is a partition whose elements are all intersections of the elements of 91 and 23. This partition will be denoted by 91 • 23 Clearly, 91 • 23 is the largest common refinement of 91 and 23. FIGURE 15-S fThe symbol -c is not an ordering of two arbitrary partitions. It has a meaning only if ® is a refinement of И. should emphasize that partition .product is not -a set operation.
544 ENTROPY Properties From the definition it follows that (& -< 91 for any 91 91 93 = 93-91 91 •(© • ®) = (91 • • 6 If 91, < 9l2 < 9l3 then 91, < 9I3 If 23 -< 91 then 91 ♦ 93 = S3 ENTROPY. The entropy of a partition 91 is by definition the sum N' H(4L) = ~(pt logp, + +p;Vlogp/Y) = £>(pj (15-25) i— 1 where pt = and <p(p) = -p log p. Since <p(p) > 0 for 0 < p < 1, it follows from (15-25) that 77(91)^0 (15-26) where Ж91) = 0 iff one of the p/s equals 1; all others are then equal to 0. Binary partitions If 91 = [.с/, зз/'] and Р(^) = p, then (Fig. 15-2) /7(91)----plogp - (1 -p)log(l -p) = r(p) (15-27) Equally likely events If Pi = Pi = ” = P/v then HW = - ~ log V?-------------^log^ = log N (15-28) NN NN If, in particular, N = 2m, then /7(91) = m. INEQUALITIES; The function tp(p) = -p log p is convex. Therefore (see Fig. 15-6 and Prob. 15-2) ф(Р1 + P2) < tP(Pi) + <p(p2) < <p(Pi + «) + <₽(P2 “ e) (15-29) Where Pl <Pi + E <p2 - e <p2 (15-30) This leads to the following properties of entropy: 1. Given a partition 91 = [ja/,, £f2,s^N], we form the partition S3 = obtained by splitting into the elements &a and &b as. in Fig. 15-7. We maintain that H(9l)</7(93) (15-31) Proof. Clearly, Я(91) - <p(pa +pb) = H(93) - fp(pa) ~ fp(ph) because each side equals the contribution to /7(91) and /7(93) respectively due
15-2 BASIC CONCEPTS 545 FIGURE 15-6 to the common elements of Я and 93. Hence (15-31) follows from the first inequality in (15-29). Example 15-4. In the next table we list the probabilities of the events of a partition Я and of its refinement 23 obtained as above. Я p = 0.4 035 0.25 pa « 0.22 ph - 0.18 035 0.25 In this case, Н(Я) = — (0.41og0.4 + 0.35 log 0.35 + 0.25 log 0.25) = 1.559 H(93) = —(0.22 log 0.22 + 0.18 log 0.18 +035 log 0.35 + 0.25log0.25) = 1.956 Я p=p.+pd g j -^4 « F1GURE15-7
546 ENTROPY FIGURE 15-8 Thus H(%) = 1.559 < 1.956 = Я(®) in agreement with (15-31). 2. If then HW>HW (15-32) Proof. Repeating the construction of Fig. 15-7, we form a chain of refinements « = ••• ••• •<»„ = » where 8Im is obtained by splitting one of the elements of as in Fig. 15-8. From this and (15-31) it follows that Я(Я) =7/(8!,) < ••• <sH(8l„) =H(®) and (15-32) results. 3. For any 81: Н(81)<Я(©) (15-33) where <5 is the element partition. Proof. It follows from (15-31) because S is a refinement of 81. 4, For any 81 arid ®: Я(81-®) £H(8l) Я(Я-®)^Н(®) (15-34) Proof. It follows from (15-31) because 81 • ® is a refinement of 81 and of ®. Example 15-5. In the die experiment, the probabilities of the six events l/J equal 0.1 0.1 0.15 0.2 0.2 0.25 respectively. The probabilities of the events of the partitions Я — [even, odd] ® « [•/ < 3, i > 3] are given by P{even) - 0.55 P{odd) ° 0.45 P{i s: 3} = 0.35 P{i > 3) - 0.65
15-2 hash < oM i.i’is 547 FIGURE 15-9 The product Я • 5J is a partition consisting of the four elements (Л) (ЛЛ) {Ш (Д) with respective probabilities 0.1 0.25 0.45 0.2 From the above it follows that Я(Я) = 0.993 H(93) = 0.934 //(Я • 93) = 1.815 in agreement with (15-34). 5. Suppose that Й and 93 are two partitions that have the same elements except the first two (Fig. 15-9) «1- [лГрлГ2,^,...,л^] ® = [^1,.#2,.c/v...,V/v] We maintain that if P(^2)=P2 = Pt + e < p2 - f = as in (15-30), then Н(Я) <//(«) (15-35) Proof. Clearly, Н(Я) - <p(p{) - <p(p2) = H(93) - <p(pi + s) - <p(p2 + £) because each side equals the contribution to Н(й) and /7(93) respectively due to the common elements of Й and 93. Hence (15-35) follows from the second inequality in (15-29). Example 15-6. In the next table we list the probabilities of the events of the partitions Я and 93. Й 0.1 I 03 035 0.25 px = 0.1 ---:--------‘----—------------------------ r = 0.08 » 0.18J 0.22 035 0.25 p2 = 03 In this case, Я(й) = 1.883 /7(93) - 1.956 in agreement with (15-35). 6, If we equalize the entropies of twb elements of a partition, leaving all others unchanged, its entropy increases.
548 ENTROPY FIGURE 15-10 Proof. It follows from the above with e = (p2 - pt)/2. 7. The entropy of a partition is maximum if all its elements are equally likely as in (15-28). Proof. Suppose that VI is a partition such that /7(21) = Hin is maximum and two of its elements have unequal probabilities. If they are made equal, then (property 6) Ж 21) increases. But this is impossible because Hm is maximum by assumption. A useful inequality. If a, and bt are 7/ positive numbers such that a। -+ • • • + a= 1 by + • • • + bN £ 1 (15-36) then - log ai _ log bi (15-37) । i with equality iff a(- = bt. Proof. From the inequality ey’ 1 + у it follows that In x < x - 1 (Fig. 15-10). With x =.b{/alt this yields 6,- bt lnb. - In a.: = In— <--------1 ai ai Multiplying by dj and adding, we obtain E*/.(ln'b,- - In a,) £aj — - 11 = £(6f - «,) 0 i i \ai I i and (15-37) results.
15-2 hash'< <>•.< i п\ 549 Maximum entropy. Using (15-37). wc shall rederive property 7. It suitices to show that ~ Ep, log p, < log N (15-38) Proof. The numbers a, = p, and b, = \/N satisfy (15-36). Inserting into (15-37), we conclude that - Ep< log P, < - Ep, log— - log M'Ep, = log .V » .A Conditional Entropy and Mutual Information The conditional entropy of a partition 21 assuming // is by definition the sum /7(2l|.^) = - £P(.ft<W)log (15-39) /- i where P(^) ¥= 0, is the number of elements .a/ of 21. and P(.!/./f ) As we explain later, H(2I|.^) is the uncertainty about 21 in the subsequence of trials in which Л occurs. Suppose now that ® is a partition consisting of the elements Clearly, Н(Я1 = ~ E /’(«<l^)log P(^|.^) (15-40) i= 1 is the conditional entropy of 21 assuming defined as in (15-39). The conditional entropy of 21 assuming © is the weighted average of H(2t|^,): Я(21|®) = EP(^)H(2I|.^) (15-41) This equals the uncertainty about 2! if at each trial we know which of the events S&j of 23 has occurred. Example 15-7. Wc shall determine the conditional entropy H(@|®) of the clement partition S in the fair-die experiment where 23 = [even, odd). Clearly, P{//|evcn} = f if i is even and Pf/Jcvcn) e 0 if i is odd. Similarly, H/(lodd) = | if i is odd and P[/;|odd) - 0 if i is even. Hcncc //(Sloven) = -(| log | + | log| + | log|) = l.og3 = //(Si odd) And since P(even) mP(odd) = 0.5, we conclude from (15-41) that : //(Si®) - 03 Iog3 + 03log3 • log3
550 UNTROPY Thus, in the absence of any information, our uncertainty about ® equals //(©) = log 6. If wc know, however, whether at each trial “even” or “odd” showed, then our uncertainty is reduced to //(®l®) = log3. THEOREM 1. If 29 <21 then //(Sll23)=O (15-42) Proof. Since 23 is a refinement of 21, each element of 23 is a subset of some element of Й and, therefore, it is disjoint with all other elements of 21. Hence if i = к and = 0 otherwise. This leads to the conclu- sion that (1 i = /c P(^) 10 i*k And since p log p = 0 for p = 0 and p = 1, we conclude that all terms in (15-40) equal 0; hence Я(»|Й?.) = 0 for every j. From this and (15-41) it follows that //(2I|23) = 0. Independent partitions Two partitions 21 = [j^.] and 23 « [Д] are called independent if the events and are independent for every i and j: = P(tft )P(0y) (15-43) THEOREM 2. If the partitions SI and 23 are independent, then Я(21|23) = H( 23ISI) = //(23) (15-44) Proof. Clearly, = Р(лф; hence [see (15-40)] W(Sl|1%) = - )logP(j< ) = Я(Я) i Inserting into (15-41), we obtain H(U|23) ==Z/(Sl)£P(^.) = H(Sl) j and (15-43) results. We can show similarly that //(23|2I) = //(23). THEOREM 3. For any SI and 23: //(21-23) <H(SI) + //(23) (15-45) Рпю/. As we Imow [see (2-36)] P(X ) = £P(^) i
15-2 BASIC CONCEPTS 551 Hence Н(И) - )logP(.a< ) - -EP(.a'^)logr(J< ) ‘ iJ Writing a similar equation for Ж93) and adding, we obtain Н(Я) + Н(9Э) = - )P(#,)] (15-46) •J Clearly, Н(й • 93) is a partition with elements .й/^. Hence Я(И • 93) = - £P(^.^)log (15-47) To prove (15-45), we shall apply (15-37) identifying the numbers a, and with the numbers Р(д^^-) and PWjPtty respectively. We can do so because Ep(^) = i Ep(X )p(-^) = i i,j i,J From (15-37) it follows that the sum in (15-47) cannot exceed the sum in (15-46); hence (15-45) must be true. COROLLARY. =Н(Я) +HW (15-48) iff the partitions 91 and 93 are independent. Proof. This follows from (15-45) because (15-37) is an equality iff a, = b, for every i. Hence (15-45) is an equality iff P(j^) =P(.£< )P(^y) for every i and j. THEOREM 4. For any Й and 93: Я(« • 93) = H(93) + H(9l|93) = Н(Ю + Я(93|Я) (15-49) hoof. Since Р(л<^) - Р(^)Р(л<1^) we conclude from (15-40) that Р(^7)Я(Я|^) - -EP(^)P(^|^)logP(j<l^) « - EP(^y)[log P(^) - log P(^)] i = - EP(^)logP.(^/) + P(^j)logP(^)
552 ENTROPY Summing over all j, we obtain £р(^)н(и|«,) = - £p(^)iog р(л<а>) + EP(0,)iog p(«,) j i,j J and the first equation in (15-49) follows because the above three sums equal 77(2I|®), 77(21 • ®), and -Ж®) respectively. The second equation follows because 31 - ® = ® • 21. COROLLARIES. The following relationships follow readily from the last two theorems: For any 21 and ®: 77(®)sH(21 • ®) <77(21) +//(©) (15-50) Я(21|«) <;Я(21) (15-51) Я(Я) - 77(2l|®) = H(®) - H(®|21) (15-52) Mutual information. The function /(21,®) = 77(21) + H(®) - 77(21-®) (15-53) is called the mutual information of the partitions 21 and ®. From (15-49) it follows that 7(2(,®) = 77(21) - 77(2l|®) = 77(®) - 7/(®|2l) (15-54) Clearly [see (15-51)] 7(21,®) £0 (15-55) As we shall presently see, 7(21, ®) can be interpreted as the “information about 21 contained in ®” and it equals the “information about ® contained in 21.” Example 15-8. In the fair-die experiment of Example 15-7, 77(<5) = log 6 H(S|®) = log3 77(®) = log2 7/(®|S) = 0 Hence 7(<5,®) = log2 Thus the information about the element partition © resulting from the observation of the even-odd partition ® equals log 2. Generalizations. The preceding results can be readily generalized to an arbi- trary number of partitions. We list below several special cases leaving the simple proofs as problems: (e) If ®<® then 77(2I|®) <; Z7(2l|®) (15-56) (b) If the partitions 21, ® and ® are independent, then Я(И •»•<£)- 77(21) + H(® - G) - н(21) + H(®) + 77(G) (15-57)
15-2 basic conctpfs 553 (c) Chain rule For any 21, ®, and (5: Ж» • ®l«) =н(®1») +//(®|я.«) (15.58) H{% •©•€)= /7(21) + //(» - £|2I) = Я(Я) + Н(<В|<Д) + H(d|?l • <B) (15-59) Repeated trials* In the space >/ ' of repeated trials all outcomes are sequences of the form (15-60) where each is an element of ./. Consider a partition Я of ./ consisting of N events. At the к th trial, one and only one of these events will occur, namely the event that contains the element £,д. The cartesian product ••• - x.Z (15-61) is an event in with probability /’И.л) =/’(-4j (15-62) because it occurs iff the event occurs at the it th trial. For specific k, the events k form an N element partition of the space This partition will be denoted by 21 k. From (15-62) it follows readily that H(4Lk) = /7(21) (15-63) We can define similarly the partition ®A of formed with the elements of another partition 23 of . Reasoning as in (15-63), we conclude that Н(«л) = 77(93) and H(21J® k) = /7(211») 7(21*, »J =7(21,») (15-64) We next form the product of the n partitions 2lA.-. 21" = 21, • 2l2 ••• 2l„ (15-65) The elements of this partition are cartesian products of the form .nZ x • • • x .oZ x • • x jjZ (15-66) If 21 is the element partition of .jZ, then 21" is the element partition of cZ". In general, however, the elements of 21" are events consisting of a large number of sequences of the form (15-60). If we picture these sequences as wires, then the elements (15-66) of the partition 21" can be viewed as cables and their union as a collection of such cables (Fig. 15-11). From the independence of the trials, it follows that the n partitions 2tp..., Ил of are independent. Hence [see (15-57) and (15-63)1 77(21") =/7(21,) + ••• +H(2ln) =n/7(2l) (15-67) Defining similarly the partition 23", we conclude as in (15-64) that /7(21"!®") =n/7(»|2l) /(«",»") -n/(2l,») (15-68)
554 ENTROPY j к n .f/( X ••• (•>..........*>............*> FIGURE 15-11 Example 15-9. In the coin experiment, the entropy of the clement partition equals W(@) - -p log - q log q In the space ./2, the element partition consists of four events with P{hh} = p2 P{ht} = P{th} = pq P[tt} = q2 Hcnce H(S2) = —p2 log p2 - 2pq log pq - q2 log q2 = -2p log p - 2q log q Thus Я(@2) =2H(@) in agreement with (15-67). CONDITIONAL ENTROPY AND UNCERTAINTY. As we have noted, the entropy H($£) of a partition Я = gives us a measure of uncertainty about the occurrence of the events at a given trial. Once the trial is performed and the events are observed, the uncertainty is removed. We give next a similar interpretation to the conditional entropy H(9l|^) of 51 assuming that the event has been observed, and of the conditional entropy Ж5Ц®) of 91 assuming that the partitioning © has been observed.! If in the definition (15-25) of entropy we replace the probabilities Р(л<) by the conditional probabilities P(j^l^), we obtain the conditional entropy H(9I|^) of Я assuming Л [see (15-39)]. The relative frequency interpretation of jP(ja^JI^) is the same as that of Р(л^) if we consider not the entire sequence of n trials but only the subsequence of trials in which the eventoccurs. From this it follows that M?l|^) is the uncertainty about 51 per trial in that subsequence. In other words, if at a given trial we know that occurs, then our uncertainty about В equals H(?l|.^); if we know that Л occurs, then our fThe expression a partition Й is observed will mean that .we know which of the events of ® has occurred.
15-2 HASH COS('| IMS 555 uncertainty equals 77(Я|.^). The weighted sum is the uncertainty about 21 assuming that the binary partition is observed. Suppose now that al each trial wc observe the partition 23 = [.л7]. Wc maintain that, under this assumption, the uncertainty per trial about 21 equals 77(Я|®). Indeed, in a sequence of n trials, the number of times the event d occurs equals ' nf - nP(^l) In this subsequence, the uncertainty about 21 equals /7(Я|.^) per trial. Hence the total uncertainty about 21 equals £/1Т7(Я|.0) = £л/,(.^)7/(?1|.^) = /1//(Я|») j j and the uncertainty per trial equals 7/(Я|93). Thus the observation of ® reduces the uncertainty about 21 from /7(21) to Н(Я|®). The difference 7(Я,») =77(21) - /7(211») is the reduction of the uncertainty about 23 resulting from the observation of ®. This justifies the statement that the mutual information /(21, ®) equals the information about Я contained in ®. We show next the consistency between the properties of entropy devel- oped earlier and the subjective notion of uncertainty. 1. If SB is a refinement of Я and ® is observed, then we know which of the events of Я occurred. Hence 77(Я|») = 0 in agreement with (15-42). 2. If the partitions Я and 23 are independent and ® is observed, no informa- tion about Я is gained. Hence 77(Я1®) = /7(Я) in agreement with (15-44). 3. If we observe ®, our uncertainty about Я can only decrease. Hence 7/СЯ15В) МЯ) in agreement with (15-51). 4. To observe Я • ®, we must observe Я and ®. If only 23 is observed, the information gained equals 77(83). The uncertainty about Я assuming ffl equals, therefore, the remaining uncertainty /7(Я|ЯЗ) about ®. Hence ЖЯ • ®) - НМ = Я(Я|®) in agreement with (15-49). 5. Combining 3 and 4, we conclude that /7(Я • ®) — 77(83) < 77(Я) in agree- ment with (15-45). 6. If 23 is observed, then the information that is gained about Я equals 7(Я, 83). If S -< <5 and S3 is observed, then E is known. But knowledge of (£ yields information about Я equal'to 7(Я, (£). Hence, if ® ч (5, then /(Я, ffl) £ 7(Я, ®) or, equivalently, 77(Я|Ё) < 77(Я|(£) in agreement with (15-56).
SS6 ENTROPY .rf. 16, x x.ri,. №>)>))>)>))>)»>))))>)))№ n FIGURE ’5-12 CONDITIONAL ENTROPY AND TYPICAL SEQUENCES. We give next a typical sequence interpretation of lhe properties of conditional entropy limiting the discussion to (15-45) and (15-49). The underlying reasoning is used in the proof of the channel capacity theorem (Sec. 15-6). We denote by ta, t®, and t91'® the typical sequences of the partition and SI • ® respectively, and by T'1,I®, and T**'® their unions (Fig. I5-I2tr). As we know [see (15-7)] p(j«) = P(T®) = P(T9,‘®) = 1 Furthermore, the number of typical sequences in each of the above three sets equals [see (15-10)] nT« = (15-69) I. We maintain that //(?(• ®) < H(?I) +//(®) (15-70) Proof. Each Iя ® sequence specifies a pair (t91, t®). The total number of such pairs formed with all the elements of Ta and T® equals я pi • However, not alli such pairs generate t91® sequences because, if the partitions 91 and ® aire not independent, then not all pairs can occur. For example, if for some i and j and' suf, occurs at the к th trial, then must also occur at this trial. From the above it follows that Wj-fl Ф < Hjfl • tljv and (15-70) follpws from (15-69);
15-2 bask сом I.pis 557 II. We shall show, finally, that И(Я • 93) = H(®) + H(m) (15-71) Proof There are лт« sequences in the set I4’ and n^ * sequences in the set Iя The ratio П r fl _____ _ ег»|Н(Я U4-W-»)] П|« equals, therefore, the number of 8 я ® sequences contained in a single t® sequence on the average. To prove (15-71), we must prove, therefore, that this number equals gnW(«i®)> \уе shall prove a stronger statement: The number of I91'® sequences contained in a single t* sequence (Fig. 15-13) equals ел//(Я|'Э). As we know [see (1-1)], the number of times the event occurs in a t® sequence “almost certainly” equals n^nP(^) (15-72) We denote by Iй*' a subsequence (Fig. 15-12b) of t® in which the number of occurrences of satisfies (15-72). In this subsequence, the relative frequency of the occurrence of an event equals P(j^|^y) [see (2-32)]. We shall use (15-10) to show that the number of typical sequences formed with the elements of Й that are included in a sequence equals s рщи(15-73) indeed, this follows from (15-10) if we introduce the following changes: We replace. by P(ja?J|^.), the length n of the original sequences with the length n, = nP(^.), and the entropy Н(Ю of 21 with the conditional entropy ЖЭДД). Returning to the original I® sequence; we note that it is formed by combining the sequences Lhat are included in t®. This shows that the total liutnbef of Iя sequences that are included in I® equals the product j-j = ел7/(«|») (15-74) J
558 ENTROPY But each Ist sequence that is included in t® is a t*’® sequence. Hence the number of P® sequences that is included in t® equals 15-3 RANDOM VARIABLES AND STOCHASTIC PROCESSES Entropy is a number assigned to a partition. To define the entropy of an RV we must, therefore, form a suitable partition. This is simple if the RV is of discrete type. However, for continuous-type RVs we can do so only indirectly. Discrete type. Suppose that x is an RV taking the values xi with P{x = x() = p, The events {x = x() are mutually exclusive and their union is the certain event; hence they form a partition. This partition will be denoted by 21 x and will be called the partition of x. Definition The entropy H(x) of a discrete-type RV x is the entropy Я(21х) of its partition 2lx: Я(х) = f/(2lx) = - £> In Pi (15-75) Continuous type. The entropy of a continuous-type RV cannot be so defined because the events {x = x() do not form a partition (they are not countable). To define Mx), we form, first, the discrete-type RV x5 obtained by rounding off x as in Fig; 15-14: xs = n8 if n8 - 8 < x < n8 (15-76) Clearly, P{x5 = = P{n8 — 8 < x < 8} = [nS f(x) dx = 8f(n8) JnS-S FIGURE 15-14
15-3 RANDOM VARIABLES AND SI OCHAS11C PRCK l-SSI-S 559 where finS) is a number between lhe maximum and the minimum of the interval inS - 8, n6). Applying (15-75) to the RV xA, we obtain fix) in = " E ЗДп5)1п[з/(п<$)| ft - - x and since ж E 8f(nS) = f f(x)clx= 1 n => - Ж 7-тс we conclude that ^(xa)-------In 8- $2 8f(/j8)ln f(n8) (15-77) № — x As 8 -> 0, the RV x6 tends to x; however, its entropy HixA) tends to « because -In 8 -» oc. For this reason, we define the entropy Hix) of x not as the limit of H(xs) but as the limit of the sum //(xs) + In 8 as 8 -> 0. This yields H(x&) + In 8 -J f(x)lnf(x) dx (15-78) Definition The entropy of a continuous-type RV x is by definition the integral W(x)= - Г f(x)lnf(x)dx (15-79) •J — ao The integration extends only over lhe region where fix) * 0 because /(x)ln fix) = 0 if fix) = 0. Example 15-10. If x is uniform in the interval (0, a), then H(x) = - - fin - dx = In n (15-80) a Jq a Notes 1. The entropy H(x6) of хл is a measure of our uncertainty about the RV x rounded-off to the nearest n8. If 8 is small, the resulting uncertainty is large and it tends to oo as 8 -» 0. This conclusion is based on the assumption that x can be observed perfectly', that is, its various values can be recognized as distinct no matter how close they are. In a physical experiment, however, this assumption is not realistic. Values of x that differ slightly cannot always be treated as distinct (noise considerations or round-off errors, for example). The presence of the term In S in (15-78) is, in a sense, a recognition of this ambiguity. 2. As in the case of arbitrary partitions, the entropy of a discrete-type RV x is positive and it . is used as a measure of uncertainty about x. This is not so, however, for continuous-type RVs. Their entropy can take any value from -<® to <» apd it is used to measure only changes in uncertainly. The various properties of partitions also apply to continuous-type RVs if, as is generally the case, they involve only differences .of entropies.
560 ENTROPY Entropy as expected value. The integral in (15-79) is the expected value of the RVу = -In/(x) obtained through the transformation g(x) = -In fix): H(x) = £{ —ln/(x)} = - Г f(x)\nf(x)dx (15-81) Similarly, the sum in (15-75) can be written as the expected value of the RV - In p(x): //(x) =£{-Inp(x)} = - (15-82) i where now p(x) is a function defined only for x = x, and such that p(xt) = pr Example 5-11. If /(x) = ce~cxU(x) then E{-ln/(x)} = E{cx - Inc} Since E{cx}_ = 1, this yields e Я(х) = 1 - Inc = In- (15-83) c Example 15-12. If /(x) = <ту2тг then — f (x -1])2} — a2 E( — In /(x)} = In oVZir + El —2 > = In CT'/l'IT 4- —2 Hence the entropy of a normal RV equals H(x) = In tr/bre (15-84) Joint entropy. Suppose that x and у are two discrete-type RVs taking the values X( and y, respectively with P(x = x,-,y = уД =Pij Their joint entropy, denoted by Я(х,у), is by definition the entropy of the product of their respective partitions. Clearly, the elements of • $Ly are the events {x = X;, у = yj. Hence HCx,y) = /7(ЯХ • Яу) = - In pu i.j The above can be written as an expected value H(x,y) = £{ - In p(x,y)} where Хх» У) is a function defined only for x = x, and у = yj and it is such that p&ti yf) - pl}.
15-3 RANDOM VARIABLES AND SKX'llASTlC PK<X't-.SM:S 561 The joint entropy Я(х,у) of two continuous-type RVs x and у is defined as the limit of the sum #(хй,уд) +21n3 where x4 and y6 are their staircase approximation. Reasoning as in (15-78), we obtain Я(х,у) = - f ( /(.г,у)1п/(л-, y) dxdy = £{-ln/(x,y)) (15-85) J — —00 Example 15-13. If the RVs x and у arc jointly normal as in (6-15), then ln/(x, y) _ ‘ _ 2, 2(l~r“)[ <rp ^cr, a,* — ln2ir<7|o-2\/l - r2 In this case, -/(x-’h)2 _ (* - Л,)(У - пг) (У --7г)2 \ , E{------5------2r--------------------1- -------> - 1 - 2r 4- I 1 ст, <riflr2 ai I Hence E(-ln /(x,y)} = 1 + In 2iral(r2Vl - r2 From the above and (15-85) it follows that the joint entropy of two jointly normal RVs equals H(x,y) = 1п2тг<?7д (15-86) where Д = ДпМгз ~ Au An=cr2 Мзг = °2 Au " rai<r2 Conditional entropy. Consider two discrete-type RVs x and у taking the values x( and уf with P{X = Х;|У = У/) = TTjj - Pji/Pj The conditional entropy H(x|y,) of x assuming у = У/ is by definition the conditional entropy of the partition of x assuming (у = Ур- From the above and (15-39) it follows that H(x|y,) = - In тгл (15-87) / The conditional entropy Я(х|у) of x assuming у is the conditional entropy of 8, assuming $IX, Thus [see (15-41)1 Я(х|у) - - Е/^Я(х1У/) “ “ to (15^88) I <>1
562 ENTROPY /(x,y) For continuous-type RVs the corresponding concepts are defined similarly H(x\y) = - ( /(x|y)lnf(x|y) dx (15-89) * — 00 Я(х|у) = - Г f(y)H(x\y)dy = [ ( f(x, y)\n f(x\y) dxdy (15-90) j — X J — M"' — 00 The above integrals can be written as expected values [sec also (7-66)] H{x\y) = E{ —In /(x|y)|y = y) (15-91) Я(х|у) = E(-ln/(x|y)) = £{E(-ln/(x|y)|y}} (15-92) The discrete case leads to similar expressions. Mutual information. Guided by (15-53), we shall call the function /(x, у) = Я(х) + Я(у) - Я(х, у) (15-93) the mutual information of the RVs x and y. From (15-81) and (15-85) it follows that /(x,y) can be written as an expected value (15-94) Since /(x, y) = /(x|y)f(y) it follows from the above and (15-92) that /(x,y) = Я(х) -Я(х|у) = Я(у) - /7(у|х) (15-95) Example 15-14. If two RVs x and у are jointly normal with zero mean, then [sec (7-42)] the conditional density /(x|y) is normal with mean and variance <r/(l — r2). From this and (15-84) it follows that Я(х|у) = E( - In /(x|y)} = In <гх|/2^(1 -r1} (15-96) Since this is independent of y, it follows from (15-92) that Я(х|у) = Я (x|y) (15-97) This yields [see (15-95)] Z(x,y) = H(x) - Я(х|у)-----0.5 ln( 1 - r2) (15-98) We note finally that [see (15-86)] Я(х|у) 4- Я(у) = 1п2чге\/Д = Я(х,у) Special Case. Suppose that у = x 4- n where n is independent of x and E{n2} = N. In this case, £{xy) = <rx2 E{y2) =ах24-Я r2=^— a~ 4- N Inserting into (15-98), we obtain / Or2 \ /(x,y) - 0.5Ш 1 4- -jy (15-99)
15-3 RANDOM VARIABLES AM) SHX’UASIK. PR(X KSSl-_S 563 PROPERTIES. The properties of entropy, developed in Sec. 15-2 for arbitrary partitions, are obviously true for the entropy of discrete-type RVs and can be simply established as appropriate limits for continuous-type RVs. It might be of interest however, to prove directly theorems (15-45) and (15-49) using the representation of entropy as expected value. The proofs arc based on the following version of inequality (15-38): If x and у arc two RVs with respective densities a(x) and b(y\ then £{ln a(x)} > £{lnb(x)) (15-100) Equality holds iff a(x) = b(x). Proof. Applying the inequality In z < z - I to the function z = b(x)/a(x), we obtain b(x) b(x) lnb(x) - ln<j(x) = In-J—- < —- - 1 a(x) o(x) Multiplying by o(x) and integrating, we obtain ( o(x)[ln b(x) — In o(x)] dx < ( (b(x) — o(x)] dr = 0 •' — » — X and (15-100) results. The right side is 0 because the functions a(x) and b(x) are densities by assumption. Inequality (15-100) can be readily extended to ?i-dimensional densities. For example, if «(x, y) and b(z, w) are the joint densities of the RVs x, у and z,w respectively, then £{ln a(x,y)} > E{ln 6(x,y)} (15-101) THEOREM 1. Я(х,у) ^Я(х) + Я(у) (15-102) Proof Suppose that /X).(x, y) is the joint density of the RVs x and у and Д(х) and fy(y) their marginal densities. Clearly, the product /X(x)/y(w) is the joint density of two independent RVs z and w. Applying (15-101), we conclude that £{ln Ду(х,у)) 2:£{1п[/х(х)/(у)1) = £{1пД(х)) + £{ln/y(y)} and. (15-102) resu Its. THEOREM 2. Я.(х,у) = Я(х|у) + Я(у) = W(y|x) + Я(х) (15-103) Proof Inserting the identity /(x, y) =/(x|y)/(y) into (15-85), we obtain <(x>y) - JS{-ln/(x,y)) - E{ — In/(x|y)} + £{ ~In/(y)) and the fimt equality in (15403) results. The second follows because //(x.y) -
564 ENTROPY COROLLARY. Comparing (15-102) with (15-103), we conclude that Я(х|у) <. Я(х) (15-104) Note If the RV у is of discrete type, then Жу|х) 0 and (15-103) yields H(x) < H(x, y). This is not, however, true in general if у is of continuous type. Generalizations. The preceding results can be readily generalized to an arbi- trary number of RVs: Suppose that x,,...,x„ are n RVs with joint density /(X|,...,x„). Extending (15-85), we define their joint entropy as an expected value //(X|,...,xn) = E{-In Дх,,..., x„)} (15-105) If the RVs x, arc independent, then /(X|,...,xrt) = /(x,) ••• f(xn) and (15-105) yields f/(x„...,xj = Я(х,) + + H(xn) (15-106) Conditional entropies are defined similarly. For example [see (15-92)] ^(xJx„-i, •• • ,X|) = £{-ln ....x,)} (15-107) Chain ride From the identity [see (8-37)] = /(хй|хя_,......x,) ••• /(x2|x1)/(x1) and (15-107) it follows that Я(Х|,...,хл) = H(xJx„_|,...,xI) + ••• +A/(x2|X|) + H(x}) (15-108) The following relationships are simple extensions of (15-102) and (15-103): H(x,y|z) <, f/(x|z) + /f(y|z) H(x,y|z) = H(x|z) 4- f/(y|x,z) (15-109) H(x,,...,x;i) </f(x,) + ••• +H(x,t) Example 15-15. If the RVs xi are jointly normal with covariance matrix C as in (8-58), then E{-ln/(xJ,...,x„)} = 1п^(27г)"Д + 4E{XC-'X') (15-110) This yields (see Prob. 8-23) Я(х.....x„) = 1п^/(2тге)”Д (15-111) Transformations of RVs Wc shaUcomparethe entropy of the RVs x and у = g(x).
15-3 UANIJOM \ MUAIU IS лм,М(Н I1(4S,1( VH(J( |лм 4 565 Discrete type. If the RV x is of discrete type, then //(y) < //(x) 15.112) with equality iff lhe transformation у - g(x) has a unique inverse. Proof. Suppose that x takes the values v, with probability pt and «to has a unique inverse. In this case, Р{У-У,} = Их =л-,} = p, V^gfA.) hence H(y) = W(x). If the transformation is not one-to-one, then у - у, for more than one value of x. This results in a reduction of //(x) [see (15-31)).' Continuous type. If the RV x is of continuous type, then H(y) < H(x) 4 E(ln|g'(x)I) (15-113) with equality iff the transformation у = g(.v) has a unique inverse. Proof As wc know [see (5-5)] if у = g(x) has a unique inverse x ~ g( l,( y), then Hence H(y) = - - Г Л(х)1пД^А = ~f Л(*),п fAx) d* + f /ж(-*)1п lg'(*)l dx -CO ~r. and (15-113) results. Several RVs. Reasoning as in (15-113), wc can similarly show that if У, =&(*!....x„) are n functions of the RVs xit then Н(У|,...,у„) £ H(X|,...,x„) +E(ln|J(x1.....x„)|) (15-114) Where /(Х|,.,..,лп) is the jacobian of the above transformation [sec (8-9)]. Equality holds iff the transformation has a unique inverse. Linear transformations Suppose that У/ - flnxt + +ain*>> Denoting by A the determinant of the coefficients, we conclude from (15-114) that if A 0 then ЖУ|..-.,Уя) “ W(x1M.t,xJ + InlAl (15-115) because the transformation has a unique inverse and A does not depend on xc
566 ENTROPY Stochastic Processes and Entropy Rate As we know, the statistics of most stochastic processes arc determined in terms of the joint density /(xh..., x,„) of the RVs х(/Д..., x(t,„). The joint entropy W(x„...,xM)-E{-ln/(xI,...,xw)} (15-116) of these RVs is the mth-order entropy of the process x(/). This function equals the uncertainty about the above RVs and it equals the information gained when they are observed. In general, the uncertainty about the values of x(/) on the entire t axis or even on a finite interval, no matter how small, is infinite. However, if x(r) can be expressed in terms of its values on a countable set of points, as in the case for bandlimited processes, then a rate of uncertainty can be introduced. It suffices, therefore, to consider only discrete-time processes. The mth-order entropy of a discrete-time process x„ is the joint entropy H(xh...,xOT) of the m RVs x« ’ ।»• • •, x„ +1 (15-117) defined as in (15-116). We shall assume throughout that the process x„ is SSS. In this case, Жх| • • • xOT) is the uncertainty about any m consecutive values of the process хл. The first-order entropy will be denoted by Жх). Thus Жх) equals the uncertainty about хл for a specific n. Clearly [see (15-109)] H(x„...,xOT) <H(x,) + ••• +H(x,„) (15-118) Special cases (a) If the process x„ is strictly white, that is, if the RVs x„,x„_|,... are independent, then [see (15-106)] H(x„...,xwl) (15-119) (Z>) If the process хл is Markoff, then [see (16-99)] Л*.....=/(xOT|xOT_1) ••• /(xJxJAx,) (15-120) This yields Я(Х|,..;,хот) =H(xOT|xOT_l) + -t-tf^lx,) + H(xx) (15-121) From (15-103) and the stationarity of x„ it follows, therefore, that Я(х„...,хот) = (m - l)H(x„x2) - (m - 2)H(x) (15-122) We have thus expressed the mth-order entropy of a Markoff process in terms of its first- and second-order entropies. CONDITIONAL ENTROPY; The conditional entropy of order m: Of a process x„ is the uncertainty about its present under the assumption that its m most recent values have been observed. Extending (15-104), we can readily
15-3 RANDOM VAR!AUl.l_4 AND SUX'I IAS l К'|’HO< I SSI S 567 show that H(xjx„_|........хя_,„) < Я(х„|х,x„_„,_ J (15-123) Thus the above conditional entropy is a decreasing function of tn. If, therefore, it is bounded from below, it tends to a limit. This is certainly the case if the RVs x„ are of discrete type because then all entropies are positive. The limit will be denoted by Hc(x) and will be called the conditional entropy of the process xn: = lim .....x„..,„) (15-124) The function A/,.(x) is a measure of our uncertainty about the present of x„ under the assumption that its entire past is observed. Special cases (a) If x„ is strictly white, then Hf(x) = H(x) (b) If x„ is a Markoff process, then ^(x„lx„_i......x„_w) = Н(хп\х„_}) Since x„ is a stationary process, the above equals Н(х2\х^. Hence /7,.(х) = Л/(х2|х,) = Я(х„х2) - A/(x) (15-125) This shows that if х„_, is observed, then the past has no effect on the uncertainty of the present. ENTROPY RATE. The ratio MX) • • • x„,)/m is the average uncertainty per sample in a block, of m consecutive samples. The limit of this average as tn -* » will be denoted by H(x) and will be called the entropy rate of the process x„: 77(x) = lim —H(X|.........x,„) (15-126) tn If xn is strictly white, then H(x) = W(x) = ЯДх) If x„ is Markoff, then [see (15-122)] 77(x) = Я(х„х2) - H(x) « А/Дх) (15-127) Thus, in both cases, the limit in (15-126) exists and it equals Hc(x). We show next that this is true in general. THEOREM. The entropy rate of a process x„ equals its conditional entropy /7(х)=А/Дх) (15-128) Proof, This is a consequence of the following simple property of convergent sequences: If ak -> a then — 53 ak a (15-129) m t.|
568 ENTROPY Since x„ is stationary we conclude, as in (15-108), that Я(х|,...,хт) = H(x) + £ Н(хя|хя_1,...,х„_А.) A-i Dividing by m and using (15-129), we obtain (15-128) because H(x„|x„_ ],•••♦ k) tends to Hc(x) as к -»<». Note If x„ equals the samples x(/tT) of x(r), then the entropy rate is measured in bits per T seconds. If we wish to measure it in bits per second, we must divide by T. Normal processes. We shall show that if x„ is a normal process with power spectrum S(w), then H(x) = \гм/2тге + -— f In S(a>) d<o 4тг —тг (15-130) Proof. As we know, the function /(хот + 1|хш,..., jq) is a one-dimensional normal density with variance Д,„+1/Д,„ [see (8-85) and (14-66)]. Hence Я(х„|хя_|,...,х„_„,) - InJ------—— 1 as in (15-84). This leads to the conclusion that ____ 1 Д/я+1 Я_(х) = lnV2ire H— lim In —— Д,я (15-131) (15-132) and (15-130) follows from (14-70) and Prob. 14-15. ENTROPY RATE OF SYSTEM RESPONSE. We shall show that the entropy rate H(y) of the output ул of a linear system L(z) is given by — _ 1 7T Я(у) = Я(х) + —f ln|L(e^r)|rfw (15-133) where Жх) is the entropy rate of the input x„ (Fig. 15-15). Xn L(2) Уп H(y)’ H(y) = Я(х) + -L/’ ln|L(e'“r)|A0 EIGURE15-15
15-4 THE maximum IMh<ipy Ml шор 569 Suppose, first, that x„ is a normal process. In this case у and its entropy rate is given by (15-130) where is also normal S(w) - S,(to) - S,(w)|L(t,;iu/ )|~ (15-134) This yields Я(у) = 1п/2тге + — f [in 5ж(о») + ln|L(i’J<u7)|2] rfw (15-135) and (15-133) follows. The proof for arbitrary processes is involved. We shall sketch a justifica- tion based on (15-115): If the RVs yj..........ym depend linearly on the RVs then /7(У|.....У«) =/7(x,,....x„1) + Ko (15-136) where Ko ~ ln(A( is a constant that depends only on the coefficients of the transformation. The process y„ depends linearly on x„: X| = Y,lk*n-k n= ,,x (15-137) к - 0 where now the transformation matrix is of infinity order. Extending (15-136) to infinitely many variables, we conclude with (15-126) that W(y) = H(x)+tf (15-138) where again К is a constant that depends only on the parameters of the system L(z). As we have seen, if x„ is normal, then К equals the integral in (15-133). And since К is independent of x„, it must equal that integral for any x„. 15-4 THE MAXIMUM ENTROPY METHOD The MEM is used to determine various parameters of a probability space subject to given constraints. The resulting problem can be solved, in general, only numerically and it involves the evaluation of the maximum of a function of several variables. In a number of important cases, however, the solution can be found analytically or it can be reduced to a system of algebraic equations. In this section, we consider certain special cases, concentrating on constraints in the form of expected values. The results can be obtained with the familiar varia- tional techniques involving Lagrange multipliers or Euler’s equations. For most problems under consideration, however, it suffices to use the following form of (15-100). If fix') and (f>(x) are two arbitrary densities, then -j* <p(.v)ln <p(x) dr s - J <p(x)ln/(x) dx (15-139) Example 15-16. In the coin experiment, the probability of heads is often viewed as an RV p (see bayesian estimation, Sec. 9-2). Wc shall show that if no prior
'570 ENTROPY information about p is available, then, according to the ME principle, its density f(p) is uniform in the interval (0,1). In this problem wc must maximize W(p) subject to the constraint (dictated by the meaning of p) that ftp) = 0 ouiside the interval (0,1). The corresponding entropy is, therefore, given by H(p) = - f(p)dp Jo and our problem is to find ftp) such as to maximize the above integral. We maintain that H(p) is maximum if /(p) = l H(p)=0 Indeed, if <p(p) is any other density such that <p(p) = 0 outside the interval (0,1), then [see (15-139)] ~ fl4>(p)ln<p(p) <, - f'<p(p)ln f(p)dp = 0 = H(p) Jo Ja Example 15-17. Suppose that x is an RV vanishing outside the interval (-тг,тг). Using the MEM, wc shall determine the density /(x) of x under the assumption that the coefficients c„ of its Fourier series expansion /(x) = Г c„ejnx -tt £ x < тг are known for In I N. Our problem now is to maximize the integral Я(х) = -£у(х)1п/(х)Л subject to the constraints 1 fir c„ = — J f(x)e J"x dx |n | <N (15-140) Ztt J-it Clearly, Я(х) depends on the unknown coefficients cn and it is maximum iff дН ЭН df -— = —-—=-/ [In f (x) + l]e/”JC dx = 0 |n| >N df dc„ J-S This shows that the coefficients y„ of the Fourier series expansion of the function In /(x) + 1 in the interval (—тг,тг) аге 0 for |л| > N. Hence /v in/(x) + i= £ Уке‘кх к-—N From the above it follows that !n 1 -1 + 5L — тг^х^тг (15-141) k--N ) We have thus shown that the unknown function is given by an exponential involving the parameters yk. These parameters can be determined from (15-140). Tht-tesulting system is nonlinear and can only be solved numerically.
15-4 THE MAXIMUM ENTROPY METHOD 571 Constraints as Expected Values We shall consider now a class of problems involving constraints in the form of expected values. Such problems are common in statistical mechanics. Wc start with the one-dimension case. We wish to determine the density /(л ) of an RV x subject to the condition that the expected values 77,- of n known functions g,(x) of x are given E(8i(x)} = f Si(x)f(x) dx = f), i=l..........n (15-142) — X Using (15-139), we shall show that the MEM leads to the conclusion that /(x) must be an exponential /(x) = Лехр{-Л|$,(х) - -A„grt(x)) (15-143) where Az are n constants determined from (15-142) and A is such as to satisfy the density condition Af exp{-A,g,(x) - - A„g„(x)} dx = 1 (15-144) J — X Proof, Suppose that fix) is given by (15-143). In this case, ( /(x)ln/(x) dx = { /(x)[ln A - A(g,(x) - ••• - A„g„(x)) dx Hence H(x) — A1tj! 4- • • • 4-A„t}m - In A (15-145) To prove (15-143), it suffices, therefore, to show that, if <pix) is any other density satisfying the constraints (15-142), then its entropy cannot exceed the right side of (15-145). This follows readily from (15-139): /X -X ф(х)1п <p(x) dx - I <p(x)ln f(x) dx — 00 x - f <P(*)[A|SiU) + +A„g„(x) - In Л] dx J — X = AjTjg + ••• +Ani?„ - In A We note that, if fix) - 0 outside a certain set R, then fix) is again given by (15-143) for every x in R and the region of integration in (15-144) is the set R. Example 15-18. Wc shall determine fix) assuming that x is a positive RV with known mean 17. With gix) =x, it follows from (15-143) that ft _ / Ае~Лл x > 0 Л ' \0 x<0 We have thus shown that if an RV is positive with specified mean, then its density, obtained with the MEM, is an exponential.
$12 ENTROPY THE PARTITION FUNCTION. In certain problems, it is more convenient to express the given constraints in terms of the partition function (Zustandsummc) 1 r00 Z(A],...,An) = — = / exp{-Л।g,(x) Л,,£„(.?)} dr (15-146) A J — 00 Indeed, differentiating with respect to A,, we obtain -—=( g,(-v)exp{ - E Akgk(x) dr = Zl g,(x)/(x)dr This yields The above is a system of n equations equivalent to (15-142) and can be used to determine the n parameters A(. Example 15-19. In the coin experiment of Example 15-16, we assume that p is an RV with known mean tj. Since f(p) = 0 outside the interval (0,1). (15-143) yields f(p) »/Ae~*P Z=f,e~Apdp----------— (0 otherwise 'o A The constant A is determined from (15-147): 1 HZ 1 -e"A — Ае~л ~~z'ak A(1 - e"A) = V In Fig. 15-16, we plot A and ftp) for various values of ??. Note that if 77 = 0.5, then A = 0 and f(p) = 1. Example 15-20. A collection of particles moves in a conservative field whose potential equals И(х). For a specific t, the x component of the position of a particle is an RV x with density /(x) independent of i (stationary state). Thus the FIGIIREI5-I6
15-4 1HL MAXIMUM l.MHOPY Ml IH(*I> 573 probability that the particle is between x and x + dx equals /(x)dv and the total energy per unit mass of the ensemble equals /= Г И(х)/(х)г/х = Е{И(х)} J —a We shall find /(x) under the assumption that the function g(x) = l'(x) and the mean 1 of И(х) arc given. Inserting into (15-143), we obtain ^ V) = 7e A,(" (15-148) where Z = Г e-A,‘“dv Г Г(х)е л,,лМх = / Special Case. In a gravitational field, the potential fz(x) = Mgx is proportional to the distance x from the ground. Since fix) = 0 for x < 0. it follows from (15-148) that Me /(.r)= — e The resulting atmospheric pressure is proportional to I - Fix). Example 15-21. We shall find fix) such that E(x2} = m2. With g|(x)=x2, (15-143) yields fix)=Ae-Ax' (15-149) Thus, if the second moment m2 of an RV x is specified, then x is Ni0,mz). We can show similarly that if the variance a2 of x is specified, then x is Ni-q.a2) where is an arbitrary constant. Special Case. We consider again a collection of particles in stationary motion and we denote by vx the x component of their velocity. Wc shall determine the density fiux) of vx under the constraint that the corresponding average kinetic energy Kjc = E{Mvx/2} is specified. This is a special case of (15-149) with m2 = 2КЯ/М. Hence I M , , f(pr) = 1/------e/4Л. M *} У 4тгКд Discrete type RVs. Suppose that an RV x takes the values xk with probability P*. We shall use the MEM to determine pk under the assumption that the expected'values £(&(*)} = Ъё{(хк)рк = (15-150) 1 к Of the n known functions g/x) are given.
S74 ENTROPY Using (15-37), we can show as in (15-143) that the unknown probabilities equal pk = Acxp{-X,gi(xk) - ••• - Artgrt(xA.)} (15-151) where -j = Z = £ exp{ -[A,g,(xfc) + • • +A„g„(xA)]} (15-152) Л к The n constants A( are determined either from (15-150) or from the equivalent system 1 dZ ' = 1....n (15-153) (//л: Example 15*22. A die is rolled a large number of times and the average number of dots up equals ?j. Assuming that 17 is known, we shall determine the probabilities pk of the six faces fk using the MEM. For this purpose, we form an RV x such that х(Д) = к. Clearly, E{x} = p, + 2p2 + • • + 6p6 = 7) With g(x) = x, it follows from (15-151) that 1 pk = — e *A Z = w + iv2 + • • • +>v6 where w = e~x. Hence wk w + 2w2 + • • • +6iv6 &k w -I- w2 + • • • +w6 w + И'2 + • • • +h'6 71 as in Fig. 15-17. We note that if 77 = 3.5, then pk = Joint density. The MEM can be used to determine the density f(X) of the random vector X: [x(,... ,xM] subject to the n constraints = Vi i (15-154)
15-4 пи maximum i ||«1ПМ1|Ц()|> 575 Reasoning as in the scalar ease, wc conclude that /(Л') = л exp{- ••• - A„gri(A')} (15-155) Second-Order Moments and Normality We are given the correlation matrix * = b-{X'X) (l5-!56) of the random vector X and we wish to find its density using the MEM. We maintain that /(A') is normal with zero mean as in (8-58) = 7(-?-)л/д exp{~ >л"} (15-157) Proof. The elements Rfk = Е{х,хк} of R are the expected values of the M2 RVs gjk(X) = XjXk. Changing the subscript i in (15-154) to double subscript, wc conclude from (15-155) that /(X) = A exp/ - £ Ал. ,vrv*) (15-158) k 7.A- > This shows that f(X) is normal. The M2 coefficients Alk can be determined from the M2 constraints in (15-156). As we know [see (8-58)], these coefficients equal the elements of the matrix R~ l/2 as in (15-157). The preceding results are acceptable only if the matrix R is p.d. Other- wise, the function /(A') in (15-157) is not a density. The p.d. condition is, of course, satisfied if the given R is a true correlation matrix. However, even then (15-157) might not be acceptable if only a subset of the elements of R is specified. In such cases, it might be necessary, as we shall presently see, to introduce the unspecified elements of R as auxiliary constraints. Suppose, first, that we are given only lhe diagonal elements of R: £{x(2}=R„ /=1,...,/W (15-159) Inserting the functions g;/(x) = x2 into (15-155), we obtain f(X) ехр{-Анл7 - - Wil (15-160) This shows that the RVs x, are normal, independent, with zero mean and variance Rn = 1/2A17. The above solution is acceptable because Rf, > 0. If, however, we are given N < M2 arbitrary joint moments, then the corresponding quadratic in (15-158) will contain only the terms XjXk corresponding to the given moments. The resulting f(X) might not then be a density. To find the ME solution for this case, we proceed as follows: We introduce as constraints the M2 joint moments R/7f where now only N of these moments are given and the other M2 - N moments arc unknown parameters. Applying the MEM, we obtain
576 ENTROPY (15-157). The corresponding entropy equals [see (15-111)]. W(Xi,...,xM) = In-/(lire) Л/Д Д = |R| (15-161) This entropy depends on the unspecified parameters of R and it is maximum if its determinant Д is maximum. Thus the RVs x( are again normal with density as in (15-157) where the unspecified parameters of R arc such as to maxi- mize Д. Note From the above it follows that the determinant Д of a correlation matrix R is such that Д 5 /?ii • RMM with equality iff R is diagonal. Indeed, (15-159) is a restricted moment set; hence the ME solution (15-160) maximizes Д. Stochastic processes. The MEM can be used to determine the statistics of a stochastic process subject to given constraints. We shall discuss the following case. Suppose that xzl is a WSS process with autocorrelation /?[wi] = Е{хя+Жхл) We wish to find its various densities assuming that R[m] is specified either for some or for all values of m. As we know [see (15-158)] the MEM leads to the conclusion that, in both cases, x„ must be a normal process with zero mean. This completes the statistical description of x„ if R[m] is known for all tn. If, however, we know R[m] only partially, then we must find its unspecified values. For finite-order densities, this involves the maximization of the corresponding entropy with respect to the unknown values of /?[m] and it is equivalent to the maximization of the correlation determinant Д [see (15-161)]. An important special case is the MEM solution to the extrapolation problem considered in Sec. 13-3. We shall reexamine this problem in the context of the entropy rate. We start with the simplest case: Given the average power Е{хя] = R[0] of x„, we wish to find its power spectrum. In this case, the entropy of the RVs ,..., X„ is maximum if these RVs are normal and independent for any M [see (15-160)], that is, if the process x„ is normal white noise with R[/n] = R[0]6[m]. Suppose how that we are given the N + 1 values (data) R[0],...,R[N] of Я[/и] and we wish to find the density f(X) of the M + 1 RVs x„,.... x„+A/. If M £ N, then the correlation matrix of X is specified in terms of the data and f(X) is given by (15-157). This is not the case, however, if M > N because then only the center diagonal and the N upper and lower diagonals of the correla-
15-4 THE MAXIMUM UN 1 ROPY METHOD 577 tion matrix are known. To complete lhe specification of /?Af4,, we maximize the determi- nant Ддг+1 with respect to the unknown values of Я[т]. Example 15-23. Given 7?[0) and Я[1]. we shall find Я[2] using the maximum determinant method. In this case. Я[0] Я[2] A = Я[1] Я[2] Я[0] Я[1] Я[1] Я[0] Непсе ад я2ГП ад--2«[0№]+2йЧ1]=0 THE MEM IN SPECTRAL ESTIMATION. We are given again for |m| £ N. The power spectrum S(w) = Я[0] + 2 K[m]cosтшТ m=l of x„ involves the values of K[m] for every m. To find its unspecified values, we maximize the correlation determinant Дл/ and examine the form of the result- ing Я[т1 as M -> «. This is equivalent to the maximization of the entropy rate H(x) of the process x„. Using this equivalence, we shall develop a more direct method for determining S(m). As we know, the MEM leads to the conclusion that under the given constraints (second-order moments), the process x„ must be normal with zero mean. From this and (15-130) it follows that _ _____________ 1 H(x) = In V^lire + — I In 5(<u) da> The entropy rate H(x) depends on the unspecified values of /?[m] and it is maximum if = _L f 2-e-»^, = о Im | > N (15-162) ЭЯ[т] 2тг'-тг5(а>) This shows that the coefficients of the Fourier series expansion of 1/S(<o) are 0
578 ENTROPY for |m | > M Hence 1 SM N E cke'^T k~-к Factoring the resulting S(z) as in (12-6). we obtain 1 ~ |b0 + ЬхеЧыГ + • • • +bNe~iN”T\2 (15-1563) This is the spectrum obtained in Sec. 13-3 [see (13-141)] and it shows that the MEM leads to an AR model. The coefficients bk can be obtained either from the Yule-Walker equations or from Levinson’s algorithm. Note The MEM also has applications in nonprobabilistic problems involving the deter- mination of unknown parameters from insufficient data. In such cases, probabilistic models are created where the unknown parameters take the form of statistical variables that are determined with lhe MEM. We should point out, however, that the results obtained are not unique because more than one model can be used in the same problem. In the following, we illustrate this approach using as an example the one-dimensional form of an important problem in crystallography. A deterministic application of the MEM. We wish to find a nonnegativc periodic function /(x) with period 2-tt: 0 </(*)- £ cne>'“ Л — —« having access only to partial information about its Fourier series coefficients С ~ Г 'nc The truncation problem We assume that cn is known only for In I < N. Solution 1. We create the following probabilistic model: In the interval (-тг.тг), the unknown function /(x) is the density of an RV x taking values between —тг and тг. We determine /(x) so a to maximize the entropy /= -Г /(X)ln/(x)dr J —w* of x. This yields [see (15-141)] (N \ -1 + E ine>nx J n-—N J The constants ya are determined in terms of the known values of c„. Solution 2. We assume that fix) is the power spectrum of a stochastic process x„ and wcdetermine /(x) so as to maximize the entropy rale (we omit incidental constants) f ln/(x)*£r * — fF
15-5 «я>|м, 579 of х„. In this case, f(x) is given by [see (15-163)1 The constants d„ are again determined in terms of the known values of <•„. (Levinson’s algorithm). The phase problem We assume that we know onlv the amplitudes r of c for To solve the problem, wc form again the integral I. either as the entropy or as the entropy rate, and wc maximize it with respect to the unknown parameters which arc now the coefficients c„ (amplitudes and phases) for In I > A'. and the phase <p„ for \n i < ,V. An equivalent approach involves the determination of /(.v) as in the truncation problem, treating the phases as parameters, and the maximization of the resulting / with respect to these parameters. In either case, the required computations are not simple, 15-5 CODING Coding belongs to a class of problems involving the efficient search and identification of an object £ from a set of Л' objects. This topic is extensive and it has many applications. We shall present here merely certain aspects related to entropy and probability, limiting the discussion to binary instantly decodable codes. The underlying ideas can be readily generalized. Binary coding can be also described in terms of the familiar game of 20 questions: A person selects an object from a set .X Another person wants to identify the object by asking “yes" or “no" questions. The purpose of the game is to find using the smallest possible number of questions. The various search techniques can be described in three equivalent forms: (a) as chains of dichotomies of the set (b) in the form of a binary tree: (c) as binary codes (Fig. 15-18). We start with an explanation of these approaches, ignoring for the moment optimality considerations. The criteria for selecting the “best” search method will be developed later. Set dichotomies. We subdivide the set ./ into two nonempty sets and .о/, (first-generation sets). We subdivide each of the sets .V(l and :/, into two nonempty sets and (second-generation sets). We continue with such dichotomies until the final sets consist of a single element each. The indices of the sets of each generation are binary numbers formed by attaching 0 or 1 to the indices of the preceding generation sets. In Fig. 15-18, we illustrate the above with a set consisting of nine elements. We shall use the chain of sets so formed to identify the element £7 by a sequence of appropriate questions (set dichotomies): Is it in No. Is it in •£/I0? No. Is it in ^uo? Yes. Is it in ^uno? Yes. Hence the unknown element is because л^н00 = {£,}. Binary trees. A tree is a simply connected graph consisting of line segments called branches. In a binary tree, each branch splits into two other branches or
580 ENTROPY FIGURE 15-18 it terminates. The points of termination arc the endpoints of the tree and the starting point R is its root (Fig. 15-18). A path is a part of the tree from R to an endpoint. The two branches closest to the root are the first-generation branches. They split into two branches each, forming the second generation. Since each branch splits into two or it terminates, the number of branches in each generation is always even. The length of a path is the total number of its branches. There is one-to-one correspondence between set dichotomies and trees. The £ th-generation sets correspond to the к th-generation branches and each set dichotomy to the splitting of the corresponding branch. The terminal sets {£,} correspond to the terminal branches and the elements to the endpoints of the tree. The indices of the sets are also used to identify the corresponding branches where we use the following convention: When a branch splits, 0 is assigned to the left new branch and I to the right. The index of a terminal branch is also used to identify the corresponding endpoint Thus each element £ of is identified by a binary number x, (Fig. 15-18). The number of digits of Xj equals the length of the path ending at £(. This number also equals the number of questions (dichotomies) required to identify Binary codes. A binary code is a one-to-one correspondence between the elements £, of a set and the elements x, of a set X = (x,, x2,...} of binary numbers. Encoding is the process of constructing such a correspondence. The .set will .be called the source and its elements the source words. The corresponding binary numbers x, will be called the code words. The binary digits 0 and 1 form the code alphabet. The length I, of a code word x, is the total number of its binary digits.
15-5 581 A message is a sequence of source words e •z (15-164) The sequence of the corresponding code words -4, v(< ••• A-,. (15-165) is a coded message. The indices of lhe terminal elements of a tree. or. equivalently, of a chain of set dichotomies, specify a code. Codes can, of course, be formed in other ways; however, other codes will not be considered here The term code will mean a binary code specified by a tree as above. In Fig. 15-18, we show the code words .v, of a source ./ consisting of /V = 9 elements, and the corresponding word lengths THEOREM. If a source .У’ has N words and the lengths of the corresponding code words equal then (15-166) Proof. The last-generation branches of the tree are terminal and they form pairs. The two branches of one such pair arc the ends of two paths of length I, (Fig. 15-19). If they are removed, the tree contracts into a tree with N - 1 endpoints. In this operation, the two paths are replaced with one path of length 'lf— 1 and the two terms 2~lr in (15-166) are replaced with the term Tree contraction ElGURE ,154$
582 hntropv FIGURE 15-20 Since 2_/' + 2~1’ = (15-167) the sum does not change. Thus the binary length sum in (15-166) is invariant to a contraction. Repeating the process until we are left with only two first-gener- ation branches, we obtain (15-166) because 2_| + 2_| = 1. CONVERSE THEOREM. Given W integers /, satisfying (15-166), we can con- struct a code with lengths /f. Proof. It suffices to construct a binary tree with path lengths From (15-166) it follows that if lr is the largest of the integers then the number n of lengths that equal lr is even. Using n = 2m segments, we form the rth (last) generation branches of our tree. If each of the m pairs of integers lr is replaced by a single integer lr — 1 and all others are not changed, the resulting set of numbers will satisfy (15-166) [see (15-167)]. We can, therefore, continue this process until we are left with only two terms. These terms yield the two first-generation branches. The above is illustrated in Fig. 15-20 for W = 8. Decoding. In the earlier discussion, we presented a method for encoding the words of a source Encoding of an entire message of the form (15-164) can be obtained by encoding each word successively. The result is a coded message as in (15-165). Decoding is the reverse process: Given a coded message, find the corresponding source message. Since word coding is a one-to-one correspondence between £ and .r(, the decoding of each word of a message is unique. However, an entire message cannot always be so decoded because there is no space separating the code
15-5 CODING 583 words (this would require an additional letter in the code alphabet). The problem of separation does not exist for codes constructed through dichotomies (they are, we repeat, the only codes considered here) because such codes have the following property: No code word is the beginning of another code word. This property is a consequence of the fact that in any tree, each path terminates at an endpoint; therefore, it cannot be part of another path. Codes with this property are called “instantaneous" because they are instantly decodable; that is, if we start from the beginning of a message, we can identify in real lime the end of each word without any reference to the future. Example 15-24. Wc wish to decode the message 10110100001010001011111 (100000010 formed with the code shown in Fig. 15-18. Starting from the beginning, we identify the code words by underlying them with the help of the table of Fig. 15-18; 10 1101 000 010 10 0010 111 IltXI 000 0010 The corresponding source message is the sequence Note We have identified each source word with a single symbol £(. It is possible, however, that Si might be a grouping of other symbols. For example, the source ./ might consist of: All the letters of the English alphabet, certain frequently used words (for instance, the word the) and even a number of common phrases like happy birthday. Such sources are equivalent to single-symbol sources if each word is viewed as a single element. Optimum Codes In the absence of prior information, the two subsets of each set dichotomy are so chosen as to have nearly equal elements. The resulting code lengths are then nearly equal to log N. If, however, prior information is available, then more efficient codes can be constructed. The information is usually given in terms of relative frequencies and it is used to form codes with minimum average length. Since relative frequencies are best described in terms of probabilities, we shall assume from now on that the source У is a probability space. DEFINITIONS. A random code is a process of assigning to every source word £ a binary number x(. Since Si is an element of the probability space a random code defines an RV x such that x(£) =*, The length of a random code is an RV L such that (15-168) where lt is the length of the code word xt assigned to the element £.
S84 ENTROPY The expected value of L is denoted by L and it is called the average length of the random code x. Thus L=£{L} = Ep,/( (15-169) i where pt = P{x = x() = P{£(}. Optimum code. An optimum code is a code whose average length does not exceed the average length of any other code. A basic objective of coding theory is the determination of such a code. Optimum codes have the following properties: 1. Suppose that ga and are two elements of such that Ph = PUb} We maintain that if the code is optimum and pa>pb then la<lb (15-170) Proof, Suppose that la > lb. Interchanging the codes assigned to the elements and gb, we obtain a new code with average length L, = L - (pala +pblb) + (palb +Pbla) = L-(pa -pb)(la - lb) And since (pa ~Pb^a — lb) > 0, we conclude that L} < L. This, however, is impossible because L is the optimum code length; hence la <, lb. Repeated application of (15-170) leads to the conclusion that if p}>p2> ••• ^PN then f, < l2 <, ••• <, lN (15-171) 2. The elements (source words) with the two smallest probabilities pN_, and pN are in the last generation of the tree; that is, their code lengths are /л,_1 and lN. Proof. This is a consequence of (15-171) and the fact that the number of branches in each generation is even. The following basic theorem shows the relationship between the entropy N. H( S) = - £ p( log Pi i-l of the source word partition @ and the average length L of an arbitrary random code x. THEOREM. Я(©)^£ (15-172) Proof. As we have seen from (15-166), if lt are the lengths of the code words of x and qt • 1/21*., then the sum of the g/s equals 1. With a( = pt and bt - qt it
15-5 CODING 585 follows, therefore, from (15-37) that ~ Ep, l°8 P, “ Ep, log<7, = ЕрЛ = *• (15-173) i i i and (15-172) results. In general, H(@) < L. We maintain, however, that A7(<5) = L iff the probabilities p( are binary decimals, that is, iff p, = 1/2"'. Proof. If /7(0) = L, then (15-173) is an equality; hence pt = q, = 1 /1l< [see (15-37)] and our assertion is true because the lengths /, arc integers. Conversely, if p, = 1/2"- and n, are integers, then we can construct a code with lengths /, = n, because the sum of the p,'s equals 1. The length L of this code equals /7(0). In other words, if all p(*s are binary decimals, then the code with lengths /, = it, is optimum. Shannon, Fano, and Huffman Codes The preceding theorem gives us a low bound for the average code length L but it does not say how close wc can come to this bound. At the end of the section we show that, if we encode not each word but an entire message, then we can construct codes with average length per word less than + s for any e > 0. In the following, we present three well-known codes including the opti- mum code (Huffman). The description of these codes is clarified in Example 15-25. The Shannon code. As we noted, if all probabilities p, are binary decimals, then the code with lengths /( = - log pf is optimum. Guided by this, we shall construct a code for all other cases. Each Pi specifies an integer such that 1 1 — <p,<----------r (15-174) 2"< ' 2"'”* where pf > 1/2"* for at least one p(- (assumption). With n,„ the largest of the integers nh it follows from the above that Л 1 1 (15-175) because the left side is a binary integer smaller than 1. If, therefore, nm is changed to nm — 1, the resulting value of the sum in (15-175) will not exceed 1. We continue the process of reducing the largest integer by 1 until we reach a set of integers /; such that Л 1 Ezr = l l^ni (15-176) /-1 2
586 t-NTROPY With this set of integers we construct a code and we denote by La its average length. Thus N N is = ЕрЛ^ Ep."( i = 1 f » I We maintain that Я(@) <La <Я(<5) + 1 (15-177) Proof. From (15-174) it follows that < -logp, + 1. Multiplying by p, and adding, we obtain iV N У. p,nt < E - l°s р< + 0 ш # (®) + 1 i — I i •= I and (15-177) results [see (15-172)]. The Fano code. We shall describe this code in terms of set dichotomies based on the following rule of subdivision. We number the probabilities p, in descend- ing order Pi>p2> •• >PN (15-178) and we select the sets and of the first generation so as to have equal or nearly equal probabilities. To do so, we determine к such that Pi + • • • +pk < 0.5 < Pk+i + • • • +pN and we set equal to {<,, ..., or to , £*+,}. The same rule is used in all subsequent subdivisions. As we see in Example 15-25, the length Lb of the resulting code is close to the Shannon code length La. We note that, since there is an ambiguity in the choice of the subsets in each dichotomy, the Fano code is not unique. The Huffman code. We denote by the optimum Я-element code and by L?N its average length. We shall determine Хд, using the following operation: We arrange the probabilities p4- of the elements of in descending order as in (15-178) and we number the corresponding elements accordingly. We then replace the last two elements t,N_\ and CN a new element and we assign to this element the probability pN_i + pN. A new source results with Я - 1 elements. This operation will be called Huffman contraction. In the table of Example 15-25, the new element is identified by a box in which the replaced elements are shown. Rearranging the probabilities of the new source in descending order, we repeat the-above operation until we reach a set with only two elements. To each element of the source .S', we shall assign a code word x, starting from the last digit: We assign the numbers 0 and 1 respectively to the last digits of the code words of the elements and gN. At each subsequent contraction, we assign the numbers 0 and 1 to the left of the partially completed code words of all elements that are included in the last two boxes.
15-5 coding 587 The code so formed (Huffman) will be denoted by x‘v and its average length by LeN. We shall show that this code is optimal. Proof. The proof of the optimality is based on lhe following observation. We can readily see that the last two code words x,v_ ] and лл. have the same length lr. In Example 15-25, N = 9 ж,, = 00000 xq = 00001 I, = 5 If we replace these two words with a single word consisting of their common part, we obtain the Huffman code x‘v_ t for the set of N - 1 elements and the code length of the new element equals /г_г This leads to the conclusion that ~ ( Pn -1 + Pn Vr = - i ~ ( Pn - i + Pn ) (h ~ I) Hence + pN_ । + pN (15-179) In the example 7 7 L% = E Pilj + 5pn + 5p4 L* = E P,l, + 4( pK + pj i — 1 i -1 Induction The Huffman code is optimum for N = 2 because there is only one code with two words. We assume that it is optimum for every' source with к /V — 1 elements and we shall show that it is optimum for к - N. Suppose that there is an A/-elemcnt source for which this is not true, that is, suppose that L°N<L‘„ (15-180) As we know, the two elements and %N with the smallest probabilities are in the last-generation branches of the optimum code tree. If they are removed, the contracted tree specifies a new code with length LN_V Reasoning as in (15-179), we conclude with (15-180) that ^n-\ + Pn-i + Pn ~ L°n < — L'n- i + Pn-i + Pn hence < Lcn-\- But this is impossible because the Huffman code of order N — 1 is optimum by assumption. Example 15*25. We shall describe the above codes using as source a set ./ with nine elements. Their probabilities arc shown in the tabic below: i I 2 3 4 5 6 7 « 9 Pi 0.22 0.19 0.15 0.12 0.08 0.07 0.07 0.06 0.04 The resulting cnlropy equals 9 tf(@) - - EpJoba =• 2.703
588 kNTRtJPY Arbitrary code We form a code using a chain of dichotomies chosen arbitrarily as in Fig. 15-19. In the table below wc show the code words and their lengths. i 1 n 3 4 5 6 7 8 9 X, 000 0010 0011 010 Oil 10 1100 1101 9 111 /. = £ P(/( = 3.40 ,, । h 3 4 4 3 3 2 4 4 3 Shannon code In the table below we show the integers nt determined from (15-174) and the required reductions until the final lengths /, are reached. The corresponding code tree is shown in Fig. 15-20. Pl 0.22 0.19 0.15 0.12 0.08 0.07 0.07 0.06 0.04 1 F SP,< 1 1 1 Pl < 2 1 1 L — l-l * Я. 3 3 3 4 4 4 4 5 5 12/16 3 3 3 3 3 4 4 4 4 14/16 h 3 3 3 3 3 3 3 4 4 1 000 001 010 011 100 101 110 1110 1111 L“ = 3.1 Fano code In the table below we show the subsets obtained with the Fano dichotomies, and their probabilities. The last-generation sets are the elements £ of their probabilities are shown on the first row of the table. The dichotomies start with = 0.22 + 0.19 + 0.15 = 0.56 P< 0.22 0.19 0.15 0.12 0.08 0.07 0.07 0.06 0.04 •Уо 0.56 V, 0.44 •^no iQ4ii 0.34 4>Но 0.20 .с/ц 0.24 fl 4*^010 <2 0*011 <3 Л/|00 <4 •^101 fs •*nh 0.14 J/ш 0.10 •®^uuo <6 ^uoi <7 •^1110 ,аЛш 00 010 Oil 100 101 1100 1101 1110 1111 2 3 3 3 3 4 4 4 4 Lb = 3.02 Optimum code In the table below we show the original set consisting of nine elements and the sets obtained with each Huffman contraction. The elements are identified by their indices and the combined elements by boxes. Each box contains all elements £ of lhe original source involved in each contraction, and the evolution of their code words xt starting with the last digit. The rows below cach vZ^ line show the probabilities of the various elements of ./?. For example, the number 0.10 in the line below .У*7 is the probability of the box (element of that contains the elements and
15-5 coding 589 The column at the extreme right shows the sum of the two smallest probabilities of the elements in This number is used to form the row +1 by reordering the elements of .>$ Evolution of Huffman code 1 2 3 4 5 6 7 8 9 A. 9 0.22 0.19 0.15 0.12 0.08 0.07 0.07 0.06 0.04 0.10 ‘^8 I 2 3 4 8 9 0 1 5 6 7 Pi.* 0.22 0.19 0.15 0.12 0.10 0.08 0.07 0.07 0.14 1 1 2 3 6 7 0 1 4 8 9 0 1 5 Pi.l 0.22 0.19 0.15 0.14 0.12 0.10 0.08 0.18 •^6 1 2 8 9 5 00 01 1 3 6 7 0 1 4 Pi.<> 0.22 0.19 0.18 0.15 0.14 0.12 0.26 ^5 6 7 4 00 01 I 1 2 8 9 5 00 01 1 3 Pi.5 0.26 0.22 0.19 0.18 0.15 0.33 8 9 5 3 000 001 01 I 6 7 4 100 01 1 1 2 Pl. 4 0.33 0.26 0.22 0.19 0.41 1 2 0 1 8 9 5 3 000 001 01 1 6 7 4 00 01 1 Pl.3 0.41 0.33 0.26 0.59 8 9 5 3 6 7 4 0000 001 001 01 100 101 11 1 2 0 1 Pl.2 0.59 0.41 1 S1 89536741 2 00000 00001 0001 001 0100 0101 OH 10 11 The completed code words x, taken from the last line of the table and their code lengths /, are listed below. I 2 3 4 5 6 7 8 9 10 11 001 Oil 0001 0100 0101 00000 00001 L° = 3.01 k 2 2 3 3 4 4 4 5 5 The Shannon Coding Theorem In the earlier discussion, we considered only codes of the elements Ci of a set and weshowed that the optimum code is between Я(<5) and #(©) 4- 1: H(S) + 1 (15481)
590 ENTROPY This follows from (15-172) and (15-177). We show next that if we encode not merely single words but entire messages, then the code length per word can be reduced to less than //(©) + e for any e > 0. A messages of length n is any element of the product space Z’". The number of such messages is N" and a code of the space ^/>n is a correspondent between its elements and a set of Af" binary numbers. This correspondence defines the RV xrt (random code) on the space <Zrt and the lengths of the code words form another RV L„ (random code length). The expected value Ln of L„ is the average code length. From the definition it follows that Ln is the average number of digits required to encode the elements of -У"'. The ratio L = — (15-182) n is the average code length per word. The term word means, of course, an element of <Z. We shall assume that «Z" is the space of n independent trials. THEOREM. We can construct a code of the space Z" such that 1 H(@) <L <H(@) + - n (15-183) Proof. We shall give two proofs. The first is a direct consequence of (15-181). The second is based on the concept of typical sequences. 1. Applying the earlier results to the source У'1, we construct a code Ln such that Я(<5") < Ln < H(@«) + 1 (15-184) This yields (15-183) because Ln = nL and H(@") = n!7(®) [see (15-67)]. 2. As we know the space «Z" can be divided into two sets: the set T of all typical sequences and the set T of all rare sequences. To prove (15-183), we construct a code tree consisting of 2"W(S) — 1 short paths of length lt = nH(&) and 2г paths of length lt +1. The short paths are used as the code words of the typical sequences and the long paths for the long sequences (Fig. 15-21). Since P(T) = 1 and P(T) = 0, we conclude that the average length of the resulting code equals = AP(T) +(/ + /,) P(T) = /, - nH( ®) Thus L — Я(®) arid (15-183) results. We note that (15-184) holds even if the trials are not independent. In this case, the theorem is true if Zf(<5) is replaced by Ж®л)/н.
15-6 CHANNkl CAPACITY 591 FIGURE 15-21 15-6 CHANNEL CAPACITY We wish to transmit a message from point A to point В by means of a communications channel (a telephone cable, for example). The message to be transmitted is a stationary process x„ generating al the receiving end another process yn. The output yzl depends not only on the input x„ but also on the nature of the channel. Our objective is to determine the maximum rate of information that can be transmitted through the channel. To simplify the discussion, we make the following assumptions: 1. The channel is binary; that is, the input x„ and the output y„ take only the values 0 and 1. 2. The channel is memoryless; that is, the present value of y„ depends only on the present value of x„. 3. The input xn is strictly white noise. From assumptions 2 and 3 it follows that y„ is also white noise. 4. The messages are transmitted at the rate of one word per second. This is a mere normalization stating that the duration T of each transmit- ted state equals one second. Example 15-26. In Fig, 15-22 we show a simple realization of a channel as a system with input x„ and output y„. The input to the physical channel is a time signal x(i) taking the values E and — E (binary transmission). These values correspond to the two states 1 and 0 of x„. The received signal y(r) is a distorted version of x(r) contaminated possibly by noise. The system output y„ is obtained by some decision rule (detector) translating the time signal y(t) into a discrete-lime signal consisting of O’s and J’s,
592 ENTROPY Channel FIGURE 15-22 Noiseless Channel We shall say that a channel is noiseless! if there is a one-to-one correspondence between the input x„ and the output y„. For a binary channel this means that if хи = 0, then y„ = 0; if x„ = 1, then y„ = 1. In a given channel, the uncertainty per transmitted word equals the entropy rate H(x) = /?(x) of the input x„. If the channel is noiseless, then the observed output y„ determines x„ uniquely; hence it removes this uncertainty. Thus the rate of transmitted information equals H(x). Definition of channel capacity. The maximum value of H(x), as x ranges over all possible inputs, is denoted by C and is called the channel capacity C = maxH(x) (15-185) *л It appears that C does not depend on the channel but that is not so because the channel determines the number of the input states. If it is binary, then x„ has two possible states with probabilities p and q = 1 - p, respectively; hence H(x) = -p log p - (1 - p)log(l -p) =r(p) (15-186) where r(p) is the function of Fig. 15-2. Since r(p) is maximum for p = 0.5 and r(0.5) = 1, we conclude that the capacity of a binary noiseless channel equals 1 bit/s. Similarly, if the channel accepts N input states, then its capacity equals log N bit/s. Rate of information transmission. We repeat: The channel transmits messages at the rate of 1 word/s. It transmits information at the rate H(x) bits/s. This rate depends on the source and it is maximum if the two states of the source are equally likely. tThis definition docs not lead to any conclusion about the actual presence of noise in the channel.
15-6 C HANNLL CAPACITY 593 FIGURE 15-23 THEOREM. The maximum rate of 1 bit/s can be reached even if the input x„ is arbitrary, provided that it is properly encoded prior to transmission. Proof. 1. An m-word message is a binary number with m digits. There arc 2'" such messages forming the space У/" and every realization of lhe input x„ is a sequence of such messages. We encode optimally the space .У/" into a set of binary numbers x„ using the techniques of the last section (Fig. 15-23). The number of digits (code length) of each x„ is an RV L„, with mean L„, = E(L,„). As we know, mH(x) <; L,n < mH(x) + 1 (15-187) Hence Lm — H(x) for large m. A code word x„ requires L,„ seconds to be transmitted because it consists of Lm binary digits. Hence the average time required to transmit the ли-word messages of x„ in code form equals Lm = mH(x) seconds. And since the information contained in each message equals mH(x) bits, we conclude that the average rate of information transmission equals mH(x)/mH(x) = 1 bit/s. Proof. 2. We have 2m messages of length m. In a direct transmission (not encoded), each message requires the same transmission time: m seconds. However, of all these messages, only 2z”w(x) are likely to occur (typical se- quences). To reduce the time of transmission, we encode all typical sequences into words of length lt as in Fig. 15-21. The rare sequences require longer codes; however, the probability of their occurrence is negligible. Hence the average time of transmission of each message is reduced from m seconds to mH(x) seconds. Noisy Channel Due to a variety of factors, a physical channel establishes not a functional but a statistical relationship between the input xn and the output y„. For a binary
SJM entropy channel, this relationship is completely specified in terms of the probabilities P{x„ = 0} =p P{x„ = 1} = q of the two states of the input, and the conditional probabilities P{yn = = /} = TT,j /,/ = 0,1 (15-188) The probabilities of the output states are given by Лу„ = °) = 77ooP + 77ю<7 ЛУ„ = О ='nW + 7ru4 (15-189) DEFINITION. A noisy channel is a random system establishing a statistical relationship between the input x„ and the output y„. For a memoryless channel, this relationship is completely specified in terms of the channel matrix П whose elements тг1; are the conditional probabil- ities between the input states and the output states. For a binary channel 77oo ^oi 77 io 77n where 77 00 + ^Ol — 1 77 io + 77ll = 1 (15-190) The channel is called symmetrical channel, = 7гн = 1 — /3 and if 7Г|0 = 7г0| = /3. In a symmetrical 1 ~P P P l-P (15-191) Example 15-27. To give some idea of the nature of the channel matrix, wc show in Fig. 15-24 a simple version of a symmetrical channel. The input x(r) is a time signal as in Example 15-26, and lhe resulting output yG) is the sum y(f) = x(r) + nT < t < nT + T (15-192) where vn is a sequence of independent RVs with density the even function /G')- The output states are determined as follows: (1 У" “IO if y(r) 2: 0 if y(/) < 0 x„:0100010 FIGURE15-24,
15-6 CHANNbl ( Al’At in 595 From this wc conclude that the channel is symmetrical and /3 = P{y„ = l|x„ = 0} = Г/(р + E) dv = P{V > E) Channel capacity. Prior to transmission, the uncertainty about the input x„ equals H(x) per word. In a noiseless channel, the observed output y(I reduces the uncertainty to 0. This is not so, however, Гог a noisy channel because y„ docs not determine x„ uniquely. Knowledge of y„ reduces the uncertainty about x„ from H(x) to H(x|y) and the difference /(x,y) = /7(x) - f/(x|y) (15-193) is the rate of information transmission A If the channel is noiseless, then /7(x|y) = 0; hence Лх,у) = /Дх). If the output y„ is independent of the input, then //(x|y) = /7(x); hence /(x,y) = 0. In other words, such a channel is useless (it does not transmit any information). DEFINITION. The function /(x,y) depends on the matrix 11 and on the input x„. The capacity C of a noisy channel is the maximum value of Z(x,y) as x„ ranges over ail possible inputs C = max/(x,y) (15-194) This is consistent with (15-185) because, for noiseless channels. /(x,y) = /7(x). Example 15-28. We shall show that the capacity of a binary symmetrical channel with channel matrix as in (15-191) (Fig. 15-25) equals C=l— r(/3) where r( p) = — p log p — q log q (15-195) Proof. The entropy of a two-state partition equals r(p) where p is the probability of one of the states. Thus the entropy /Дх) of the input to the channel equals r(p) and the entropy of the output equals H(y) = (у) у = (1 - 2fl)p + /3 (15-196) because [sec (15-189)] p{y„ = 0} = (1 - /3)р + /3(1 -p) = у The above holds also for conditional entropies. Thus, since Р(У„ - 0l*„ “ 0} = P(y„ = 1 |x„ = 1} = 1 - /3 we conclude that H(y|x„ = 0) ° Я(у|х„ = 1) =r(l - fl) tThe conditional'entropy Жх1у) is Shannon’s equivocation. Г
596 ENTROPY Binary symmetrical channel FIGURE 15-25 Inserting into (15-41) and using the fact that r(0) =r(l - 0). we obiain /7(x|y) = H(y|x) = pr(0) + qr(p) =r(/3) From the above it follows that /(x,y) =r(y)-г(Д). This yields (15-195) because г(Д) does not depend on p and r(y) is maximum if у = 0.5. Redundant and random codes Consider a set ла/ (source) with N ele- ments arid a set 68 (code) with M elements where N < M. A redundant code is a one-to-one correspondence between the elements of ла/ and the elements of a subset of 68. The subset 68 { consists of N elements that can be selected in many ways. If the elements of are chosen at random from the M elements of the resulting code is called random $ From the definition it follows that the probability that a specific element of 68 is in the randomly selected set 68 x equals N/M. In the next example we show that redundant encoding can be used to reduce the probability of error in transmission. Example 15-29. In a symmetrical channel, the probability of error equals ft. To reduce this error, we encode the input set л/= (0,1} into the subset = (000, lllj of the set 6$ of all three-digit binary numbers. In the earlier notation. W = 2 and 8. The input x„ is thus encoded into a signal x„ consisting of triplets of 0’s and Г$ yielding as output a signal y„ (Fig. 15-26). The decoding scheme is the majority rule; If the received triplet consists of at least two 0’s, then y„ = 0, otherwise fThisdcfinition of. a random code is not the definition given on page 583.
15-6 C IIANNLL CAPAC IH 597 FIGURE 15-26 It can be readily seen that (Prob. 15-23) the probability that a transmitted word will be detected incorrectly equals /32(3 - 2/3). This is less than /3 if /3 < 0.5. However, the rate of transmission is also reduced from 1 word per second to I word per three seconds. It appears from the above that reduction of the probability of error by redundant encoding must result in transmission rates that tend to 0 as the error tends to 0. This, however, is not so. As the following remarkable theorem shows, it is possible to achieve arbitrarily small error probabilities while maintaining the rate of information transmission close to the channel capacity. The Channel Capacity Theorem Information can be transmitted through a noisy channel at a rate nearly equal to the channel capacity C with negligible probability of error. Proof. Preliminary remarks From the definition of channel capacity, it follows that the maximum of H(x) is at least equal to C because H(x) = /(x,y) + H(x|y) £ /(x,y) (15-197) This shows that we can find a source with entropy rate as close to C as we want. We shall show that if x„ is a source with entropy rate H(x)<C (15-198) then it can be transmitted at the rate of 1 word per second with probability of error less than a for any a > 0. This will prove the theorem because the information per word equals Mx). As in the noiseless case, the proof is based on proper encoding of the space consisting of all possible segments of xn of length m. However, as the following remarks show, the objectives are different.
598 ENTROPY Noiseless channel Noisy channel FIGURE 15-27 .....*(У.) .....*(2.) Noiseless channel The code set consists of two groups of binary numbers (Fig. 15-27a). The first group has 2й11 elements of length = mH(x) and it is used to encode the 2m' typical sequences of the input space The second group is used to encode the rare sequences of Since the set of all rare sequences has negligible probability, the average length of the code equals Thus, in the noiseless case, the purpose of coding is reduction of the time of transmission of m-word messages from m seconds to m, seconds. This results in an increase of the rate of information transmission from mH(x) bits per m seconds to znH(x) bits per m I = mH(x) seconds. Noisy channel Reasoning as in (15-197), we conclude that, given e > 0, we can find a process z„ such that H(z) - H(z|y) > C - e (15-199) Choosing ;e < C — №), we obtain H(z) > H(x) + H(z|y) > Я(х) (15-200) because H(z|y) > 0. All sequences of z„ of length m form a space consisting of 2"' elements. We can, therefore, encode the input set into the set The resulting code is one-to-one (Fig. 15-27/?). The code can, however, be viewed as redundant if we consider only the mapping of the subset T(x„) of all typical
15-6 < 11 ANM.I. C APA< in 599 sequences of -/’S' into the subset TT(z,„) of all typical sequences of ,Z"'. Indeed, T(x„) has N = 2"'//<x) elements and T(z„) has Л/ = 2"'"(I> elements where (15-201) because H(x) < H(z) and m » 1. We denote by z„ the code word of a typical хл message and by T(z„) the set of all such code words. Clearly. T(z„) is a subset of the set T(z„) consisting of N « M elements. The purpose of the coding is to select the set T(z„) such that its elements are at a “large distance” from each other in the following sense: Since the channel is noisy, the output due to a specific element z„ is not unique. We denote by iMz„) the set of all output sequences due to this element, and we attempt to design the code such that the probability of the intersection of the output sets (Mz„) as z„ ranges over every element of the set T(z„) is negligible. This will ensure the unique determination of z„ in terms of the observed out- put У„. Random code To complete the proof, we shall show that among all jV-element subsets of the set T(zZ() there exists at least one that meets our requirements. In fact, we shall prove a stronger statement: If we select at random N elements z„ from the M elements of T(z„) and use the resulting set T(z„) to encode the set T(x„) then, almost certainly, the probability of error in transmission will be negligible. We note that, once the code set T(z„) has been selected, lhe probability that an element of T(z„) is in T(z„) equals N/M. From this it follows that, if У' is a randomly selected subset of T(z„) consisting of Nw elements, then the probability Pw that it will intersect the set T(z„) equals 4 Pw = 1 NNW M (15-202) because Suppose that we transmit the selected m-word message z„ through the channel and we observe at the output the m-word message y„. Since the channel is noisy, the same y„ might result from many other input messages. We denote by 2^(y„) the set consisting of all elements of T(z„) that will produce the same output y„, excluding the actually transmitted message z„ (Fig. 15-27Z>). If the set ЯР(уп) does not intersect the code set T(z„), there is no error because the observed signal y„ determines uniquely the transmitted signal z„. The error probability equals, therefore, the probability Pw that the sets 2^z(y„) and T(zn) intersect. As we know [see (15-74)] the number Nw of typical elements in equals 2'nW(I,y). Neglecting all others, we conclude from (15-202) that NN p ~ w = 2",//^H')2"’lW(x,_//(>!)1 ** M
600 ENTROPY FIGURE P15-3 This shows that Pw->0 as m -* a» because /7(z|y) + Я(х) — H(z) < 0, and the proof is complete. We note, finally, that the maximum rate of information transmission cannot exceed C bits per second. Indeed, to achieve a rate higher than C, we would need to transmit a signal z„ such that Mz) — Я(г|у) > C. This, however, is impossible [see (15-194)]. PROBLEMS 15-1. Show that Я(И • 8|8) - ЖЯ1®). 15-2. Show that if <p(p) — —p log p and pt < px + e < p2 — e < ^cn 4>(Pi + />2) < 4>(Pi) + ф(Р2) < 4>(P\ + e) + 4>(P2 ~ e) 15-3. In Fig. P15-3a, we give a schematic representation of the identities Я(И • 8) = Я(И) + Я(8|Я) = Я(И) + Я(8) - /(Я, 8) where each quantity equals the area of the corresponding region. Extending formally this representation to three partitions (Fig. P15-3b), we obtain the identities Я(« • 8 • S) « Я(й) + Я(8 • Е1И) = Я(Я • 8) + Я(Е1Я - 8) Я(Я •»•€)= Я(И) + Я(81») + Я(Е1Я • 8) Я(8 • «|Я) = Я(8|Я) + Я(®1Я • 8) Show that these identities are correct. 15-4. Show that /(« • 8, g) + /(«,») • /(« • g, + /(И, £) and identify each quantity in the representation of Fig. P15-36.
I'ROUI.I.MS 601 15-5. The conditional mutual information of two partitions Й and ® assuming @ is by definition /(Я, ®|E) = Я(Я1®) + H(®|£) - Н(й ®l®) (a) Show that /(Я. »I6) =/(»,©• 6)-/(И.®) (i) and identify each quantity in the representation of Fig. P15-3&. (6) From (i) it follows that /(И, S3 • (E) > /(91, @). Interpret this inequality in terms of the subjective notion of mutual information. 15-6. In an experiment .У, the entropy of the binary partition й = [.<У, equals r(p) where p = Р(л/). Show that in the experiment ./’3 =./x ./x .у\ the entropy of the eight-element partition Я3 = Я • 91 • Й equals 3r(p) as in (15-67). 15-7. Show that Я(х + а)=Н(х) H(x + y|x) =H(y|x) In the above, H(x + a) is the entropy of the RV s + a and /7(x + y|x) is the conditional entropy of the Rv x + y. 15-8. The RVs x,y are of discrete type and independent. Show that if z = x + у and the line x + у = z, contains no more than one mass point, then H(z|x)=H(y) <H(z) Hint: Show that 91. = • 9ly. 15-9. The RV x is uniform in the interval (0, a) and the RV у equals the value of x rounded off to the nearest multiple of 8. Show that Kx,y) = In a/8. 15-10. Show that, if the transformation у = g(x) is one-to-one and x is of discrete type, then tf(x,y) = H(x) Hint: рц » P{x = - Д 15-11. Show that for discrete-type RVs H(x,x) = H(x) H(x|x) = 0 H(y|x) = H(y,x|x) H(y|x1,...,x„) =Я|у, £ a4xjx1,...,x„| \ fc-i / For continuous-type RVs, the relevant densities are singular. The above holds, however, if we set H(x,x) = H(x) and use theorem (15-103) and its extensions to several variables to define recursively all conditional entropies. 15-12. The process x„ is normal white noise with E{x„) = 5, and = £2-‘x„_a k-0 (a) Find the mutual information of RVs x„ and y„. (b) Find lhe entropy rate of the process yM.
602 ENTROPY 15-13, The RVs x„ arc independent and each is uniform in the interval (4,6). Find the entropy rate of the process 00 y„ = 5 E 2-лх„_а. A-0 15-14. Find the ME density of an RV x if /(л) = 0 for |.rI > 1 and E{x} = 0.31. 15-15. It is observed that the duration of the telephone calls is a number x between I and 5 minutes and its mean is 3 min 37 sec. Find its ME density. 15-16. Wc are given a die with P(even) = 0.5 and are told that the mean of the number x of faces up equals 4.44. Find the ME values of p, = P{x = i}. 15-17. Suppose that x is an RV with entropy ff(x) and у = 3x. Express the entropy H(y) of у in terms of H(x). (a) if x is of discrete type, (b) if x is of continuous type. 15-18. In the experiment of two fair dice. Я is a partition consisting of the events л/j = {seven}, s/2 = {eleven), and .o/3 = .o/J U (я) Find its entropy, (b) The dice were rolled 100 times. Find the number of typical and atypical sequences formed with the events ^/|, .зЛ, and л/3. 15-19. The process x[n] is SSS with entropy rate H(x). Show that, if - E Л-0 then 1 lim —-tf(w0,...,w„) = H(x) + 1пЛ0 л-.» Л + 1 15-20. In the coin experiment, the probability of “heads” is an RV p with E{p} = 0.6. Using the MEM, find its density ftp). 15-21. (The Brandeis. dice problemt) In a die experiment, the average number of dots up equals 4.5. Using the MEM, find p/ = 15-22. Using the MEM, find the joint density /(X|, x2, x3) of the RVs хьх2,х3 if E{xf) = E{x|) = E{xl) = 4 ^{х^п} = E{x,x3) = 1 15-23. A source has seven elements with probabilities 0.3 0.2 0.15 0.15 0.1 0.06 0.04 respectively. Construct a Shannon, a Fano, and a Huffman code and find their average code lengths. 15-24. Show that in the redundant coding of Example 15-29, the probability of error equals Д2(3 - 2/3). Hinn P{y„ = 1 |x„ = 0} = fi3 + 3/32(l - /3). 15-25. Find the channel capacity of a symmetrical binary channel if the received information is always wrong. IE. T. Jaynes: Brandeis lectures, 1962.
CHAPTER 16 SELECTED TOPICS 16-1 THE LEVEL-CROSSING PROBLEM Given a random process x(z) and a constant a, we denote by t, the lime instances when x(t) crosses the line La shown in Fig. 16-1. This line is parallel to the time axis and x(T/) = a The level-crossing problem is the determination of the statistical properties of the point process tz so formed. A special case is the zero-crossing problem (a = 0) when La coincides with the t axis. The level-crossing problem is, in general, complicated. We discuss next only certain aspects that lead to simple results. EXPECTED NUMBER OF CROSSINGS!. We assume that the process x(t) is stationary and we denote by ne(T) the number of points in an interval of length T. The following basic theorem expresses the mean of n0(T) in terms of the first-order density /X(x) of x(O and the conditional mean of its derivative. THEOREM. If x'O) exists, then £{ne(T)} - TA(n)£{|x'(f)llx(O = (16-1) tA. Bhnc-Lapierre: Modules Statistiques pour I'etude de phenomena de Fluctuations, Masson ct cie, Pari*,1963.
604 SELECTED TOPICS Proof. We shall prove the theorem using the following property of the impulse function (see Papoulis, 1968): If r, are all the real 0’s of a function <p(r), then (Fig. 16-2) r , „ „ SM,)1 = ? (16’2) The 0’s of the function = x(/) - a are the La crossings of x(t). Thus <p(T/) = x(t() - a = 0 <?'(')= x'O) Inserting into (16-2), we obtain Е«(»-т,) = |x*(t)|3[x(r) — a] (16-3) i The sum «/) = E«(t - T.) i is a stationary process consisting of a sequence of impulses at the points t,. The area of each impulse equals 1 and the number of impulses in the interval (t,r + T) equals ne(T). Hence «»(T) = jf'+TJ(o) da E(n0(T)) = 7E({(t)J
16-1 пп i.i vi i crossing riu ни.i m 605 To prove (16-1), it suffices, therefore, to find the mean of £0). As we see from (16-3), the RV £(r) is a function of the RVs x(r) and x’G). Denoting by /(л. д') the joint density of x(r) and x'(r), we conclude from (16-3) and (7-2) that £{£(')} = [ f |л'|<5(л- — a)f(x, a') dxdx' = [ |.v'|/( <j, д’) dx' This yields (16-1) because /(«, x') = /,U)/(a'|«). Note that the conditional mean £{|x'(r)| |x(r) = «) is the average of the slopes |x'(r)| of all processes that cross the line La at time i. >JRVRI ^CROSSING DENSITY. We denote by ри(т) the probability that in an interval /T of length r there is one and only one crossing. If r -♦ 0 then pe(r) -* 0. Furthermore, with the exception of unusual cases, the probability that there is more than one crossing in a small interval т is small compared to pfl(r). Assuming that this is the case for the process x(t). wc introduce the limit I Ля = lim -ро(т) г —0 T If this limit exists, it is the density of the L„ crossings. Thus ри(Дт) = Au Дт for small Дт. We maintain thatf Л„-|е{п„(Г)} (IM) Proof. If Дт is small, then Р{пв(Дт) = 1} = ра(Дт) Р{пв(Дт) = 0} = 1 - Ри(Дт) Hence Е{п„(Дт)} =1 хР{пя(Дт) = 1} =р„(Дт) = Аи Дт From.this it follows that (16-4) is true for T small. And since п0(7’1 + Т2)=пл(Т1)+пй(Т2) we conclude that it is true for any T. We have thus shown that if x(r) is differentiable, then the level-crossing density Afl exists and it equals A,-A(«)E(|x-(r)||x(r)-a| (16-5) Density of maxima. The maxima and minima (extrema) of a process x(r) are the zero crossings of the process y(r) = x'(/). From this it follows that the density of the extrema of x(r) is obtained from (16-5) if we set a = 0 and replace )SJO. Ricc: “Mathcmatical Analysis of Random Noise,” in Selected Papers on Noise and Stochastic Апсеи». Dover, N.Y., 1954.
606 SELECTED TOPICS x(/) by x'(r). This yields A.,,, =/x.(0)£{|x"(r)||x’(O = 0) The density of maxima equals A,„/2. (16-6) Normal Processes We shall apply the preceding results to normal processes under the assumption that their mean -qx is 0. This assumption is not essential because the a-level crossings of the process x(t) are identical to the (a - rjJ-level crossings of the centered process x(r) — -qx. THEOREM. If x(r) is a differentiable process with autocorrelation R(t), then 1 / — R"(0) A. =А(я)Е{|х'(')|) - (16‘7) ТГ у Proof. As we know Я,х(т) = -Л'(т) Rx,Ar) = -Л"(т) From the existence of x'(r) it follows that R"{r) exists. Therefore, /?'(r) also exists and Я'(0) = 0 because R(t) is even. Hence E{x(r)x'(r)} = -Я'(0) =0 This shows that the RVs x(r) and x'G) are orthogonal. And since they are normal with zero mean, they are independent. The first equality in (16-7) follows, therefore, from (16-5). To prove the second equality, we observe that the variance of x(t) equals Я(0) and the variance of x’(r) equals -Я"(0). This yields [see (5-45)] 1 , / — 2/?"(0) fM = £(6'(<) 1} = 1/ —— yjLirR{\j) I 7Г and.(16-7) results. Zero-crossing density. Denoting by Ao the density of the 0’s of x(t), we conclude from (16-7) with a = 0 that -л-(0) _ rSsWd* 0 тг2«(0) Ла ( Example 16-1. If S(o>) = 0 for |o»| > <r, then (16-8) yields Ao 5 or/ir because f a>2S(w) dw £ a2 [ S(a>)da>
16-1 THE Lt- VEt -C ROSSISG PROBLEM 607 If also $(<u) = 50 for |w| < cr (ideal low-pass), then Example 16-2. («) If a and b arc two independent RVs with variance a2 and x(r) = acoswGr + bsina>(/ as in Example 10-13, then Ях(т) = cr2 cos шот ЯД0) = a2 /?;'(()) = ~w2tr2 and (16-8) yields A(| — 7Г (6) If the RVs a and b are independent of the process v(/) and x(z) = a cos w(tt + bsin wot + v(r) then Rx(t) = a-2 cos w()t + Rt,(r) and (16-8) yields A =1 /^2~ Д"(0) ° тг у a2 + Л,,(0) Nondifferentiable processes. We have shown that if x*(f) exists, then the proba- bility p0(r) that there is a zero crossing in a small interval r equals Aor. If x'(f) does not exist, then p0(r) is no longer proportional to r. In the following, we examine the asymptotic form of p0(r) as т -> 0 under the assumption that /?(r) has a comer at the origin as in the Ornstein-Uhlenbeck process [see (11-15)]. Suppose that R'(r) is discontinuous at т = 0 but Л’(0+) exists. In this case Я(т) =Я(0)+Я'(0+)т + О(т2) т>0 (16-9) We shall show that p0(r) is proportional to 4т: 1 / 2K'(0+)r Proof. If г is small, then we can neglect more than one zero crossing in the interval ((, t + r). With this assumption, we have one crossing iff x(r + r)x(/) < 0 (Fig. 16-3). Hence Po(t) = P{x(t + r)x(r) < 0} The RVs x(f + r) and x(r) are jointly normal with correlation coefficient f (г) ® Я(т)/Л(0). Applying (6-46), we conclude that j3 Po(r) =» — cosj3 = r(r) 77
608 SELECT ED TOPICS X(/J FIGURE 16-3 This yields 1Г2Ро(т) r(r) = COS /3 = COS 7гро(т) = 1-------------- for р0(т) c 1. Thus, for small t, ,_________________________________________- R(t) тгр0(т) = /2[1 - г(т)] г(т) = (16-11) Inserting (16-9) into the above, we obtain (16-10). Note Using (16-11), we can reestablish (16-10). Indeed, in this case, R'(0) - 0 and /?"(0) exists. Hence Я(т) = Л(0) 4- |/?"(0)т2 + <9(т3) Inserting into (16-11), wc obtain A>W = -i/-57ifT (16-12) 7Г у л(0) This yields (16-8) because р0(т) = Aor for small r. Example 16-3. If a(t) and b(r) are two normal independent processes and x(t) = z(z)cosw0r - b(r)sin a>ot as in (11-62), then x(/) is normal, stationary, with autocorrelation [see (11-65)] Ях(т) = Ra(r)cos шат (в) V/e shall show that, if x(t) differentiable, then its zero-crossing density equals C wT _ 1 / —Я"(0) Ao = V Ao + — where Ao = — J — (16-13) У ir тг у “(0) is. the zero-crossing density df a(r). Indeed, in this case, Я'(0) = 0; hence ЛХ(О) =• лв(0) /?'(о) = о я;(0) = я;(0) - *>§ял(0) and (16-13) follows from (16-8). (6) In the nondifferentiable case, we have M°)e л-(°) л^(о+) = *i(0+) * о This shows that the zerOrcrbssing probabilities Po(t) and р0(т) of the processes x(t) and S(t) are equal [see (16-10)].
16-1 ТНГ; LEVEL-CROSSING PRGULEM 609 Example 16-4. In the preceding discussion we assumed that all processes arc stationary. The results can be readily extended to nonstationary processes. Wc illustrate the nondifferentiablc case using as an example the Wiener process w<r). We shall show that for small 7, the probability pa(t; t + 7) that wO) crosses the l axis in the interval (t, I + r) equals 1 FT Po<J>t + t) = -J - (16-14) ТГ У 1 Proof. Reasoning as in the stationary case, we conclude that pn(r,/ + r) is given by (16-11) where r(r) is the correlation coefficient of the RVs x(r + r) and x(r). Thus [see (11-5)] r2/T) = R2(<t + r'1} = a2t~ = ‘ R(t + t, t + r)R(t, t) a(t + r)at t + r Inserting into (16-11), we obtain (16-14) because \/r/(r + r) = 1 - r/2t for small t. Density of maxima. The extrema of x(f) are the 0’s of x'G). Hence their density is given by (16-8) if we replace the autocorrelation R(t) of x(r) by the autocorrelation of x'(/). From this it follows that 2 _ -K<4>(0) " “ 1Г2Я”(О) [ <o4S((o) dio J — oo it2 j (o2S(oj) da> J — oo (16-15) provided that x’(O exists. Example 16-5. (o) If S(w) = 0 for |w| > cr, then the above yields Xm <, cr/tr because j </$(*>) dto <> a2] a)2S{(o) dio ~(J J-(r If also 5(ш) = So for |ш| cr, then (Z>) If S(<o) = So for < |w| < <u2 and 0 otherwise (ideal bandpass), then 1 13(й»2 — m ir у 5(л>1 - ю?) HRST-PASSAGE TIME. We denote by r, the first я-level crossing to the right of the origin (Fig. 16-4a). The first-passage problem is the determination of the distribution function FT(r, a) of the RV We shall solve the problem under the assumption that x(f) is the Wiener process (11-20). We should note, however* that in the solution we make use only of the fact that the increment
610 SELECTED TOPICS FIGURE 16-4 x(/2) - x(/|) is independent of xOg) and its density is even: P{x(t2) - x(11) < и>) = P(x(r2) - x(f,) > -w) (16-16) The reflection principle. We shall show that the samples x(f,<) of x(z) that cross the line La continue on symmetrical paths. In other words, if a sample x(/,^|) crosses the line La at t = tz, then there exists another sample x(/,£2) that coincides with xG, fj) for t < t, and for t > t, is the reflection of x(z, £,) on the line La (Fig. 16-46) х(/,^) = x(/,^2) г<т( a ~ x(r,^,) = x(l,£2) - a t > r, This result, known as the reflection principle, can be stated as follows: THEOREM. For any x (less than or greater than a) P{x(l) <x|T| r) = P{x(/) > 2a — x|tj < r) (16-17) Proof. It suffices to show that (see Prob. 4-13) P{x(f) x|T| = t) = P{x(r) > 2a — x|T| = т) (16-18) for every т <,t. From (16-16) it follows that jP{x(t) - x(t) <.x - a) = P(x(t) - x(t) a - x} (16-19) Since x(t) - x(r) is independent of х(т), we can write (16-19) in the form P(x(t) - x(t) x - a|x(r) = a) = P{x(.t) - x(t) £ a - x|x(t) = a) This is true for any t and r; hence it is true for т = In this case (x(t) = a) — {ij = t) and (16-18) results because x(t|) = a. COROLLARY. If x <; a, then P{x( t) <; x, т। r) = 1 - Fx(2a - x, t) (16-20) F(*(r) x> -г, > /} = Fx(x,t) + Fx(2a - x, t) - 1 (16-21)
16- 1 THE I.liVEl.'CROSSING PROBI.I.M 611 where Fr(x,r) = r 1 f2^t I is the first-order distribution of the Wiener process x(/). Proof. Multiplying both sides of (16-17) by Р(т| < f), we obtain jP{x(t) x, T| < t) = P{x(t) > 2a - x, T| < /} (16-22) for every x and t. If x a, then x(r) > 2a - x iff T| < r, hence the right side of (16-22) equals P{x(f) > 2a — x) and (16-20) results. The second equa- tion follows because the sum of the left sides of (16-20) and (16-21) equals P{x(/) £ x}. First-passage distribution. The distribution FT(i, a) of the RV r, equals P{t, <, = P{t, < /,х(/) < a) + Р{т, < Г, x(f) > a) (16-23) Clearly, if x(t) > a, then there must be a crossing prior to time r; hence t, < t. From this it follows that P{t| < t, x(/) > a) = P(x(r) > a} Setting x = a in (16-20), we obtain P{t| <t,x(t) <a) = 1 - Fx(att) = P{x(t) > a} Thus the two terms on the right of (16-23) equal P(x(r) > a). Therefore, P{t, <; r} = F.(i,a) = 2P{x(z) > a} = 2 - 2Fx{aj) (16-24) Absorbing wall We replace the line La by an absorbing barrier. This means that the resulting process y(r) equals a for every t > t, (Fig. 16-5). We
612 Sl-l-l-CTEO TOPIC'S shall show that the distribution function of y(/) equals Fy(y.t) = Ffly,t) + Fx(2a - y.t) - \ У<а (16-25) and, of course, Fv(y, t) = 1 for у > a. Proof. If у < fl, then y(r, £) < у for some outcome £ iff x(r, £,) < у and \(t. £,) does not reach the line Lu prior to time i. Hence {y(Q ^y) = {x(r) <y,T| > t) and (16-25) follows from (16-21). Reflecting wall Wc replace now the line La by a reflecting barrier. This means that the resulting process z(f) equals x(f) if x(f) < a and it equals 2a - x(/) if x(t) > о (Fig. 16-5). We shall show that in this case Fflz.t) = Fx(z,t) + 1 - Ft(2fl - z,t) z<a (16-26) and F:(z, t) = 1 for z > a. Proof. If z <a, then z(f,^) < z iff either x(/.£) < a or x(r. <,) > 2a - z. Hence {z(r) < z) = {x(t) < z) + {x(t) > 2a - z) and (16-26) results. 16-2 QUEUEING THEORY Queueing theory deals with point processes (arrivals and departures) and random intervals (waiting and servicing). This involves the statistical properties of the number of random points in intervals of random length. As a preparation, we introduce the underlying concepts in the context of Poisson points and renewals. Poisson points. The notation n(t,, t2) will mean the number of points of a point process in the interval (tj,t2)- As we have shown in Example 4-11, if t- is a set of Poisson points, with average density A, then the number of points nr = n(t, t 4- T) in an interval of length T is a Poisson RV with parameter AT. The corresponding moment-generating function equals [see (5-79)] r„r(z) = £{z”t) = (16.27) Ergodicity As we know, E{nr) = AT a„2;. = AT (16-28) This shows that A equals the ensemble average of the number of points in a unit interval. We maintain that A can also be interpreted as time average. For this purpose, live form the time average qr = nz/T. As we see from (16-28), »)•/• is
16-2 01 Ч 1.4 INC. Illi CHI'* 613 an RV such that A Е{Т]7)=Л Hence <r? -» 0 as Г -> oo. From this it follows that *JT n7 ^=77^A (16-29) and hT — AT for sufficiently large T. Poisson Points in Random Intervals Suppose now that c is a positive RV and n, = n(r,t + c) is the number of points in an interval of length c. We shall determine its statistics. If с = c is a constant, then nt is a Poisson RV with parameter Ac. Hence £{ njc = c) = Ac From this and (7-66) it follows that £{nt) = £{£{ n. |c}} = £{Ac) = At?, (16-3(1) Denoting by pc the average number of points in the random interval c, we conclude that pc = kric. Reasoning similarly, we can find all moments of n(.. In the following, wc determine directly the moment function rjz) = £{zn-} of the discrete-type RV nf in terms of the moment function Ф/s) = £{eJC} (16-31) of the continuous-type RV c. THEOREM. Г„(2) = Фс(Л2-А) (16-32) Proof. If.c = c, then nc is a Poisson RV with parameter Ac. Hence [see (16-27)] E(z4c = c) =e(2-,,Ar This yields E{zn<} =£{£{2n4c}} = £(е(г~l)Ac) and (16-32) follows from (16-31) with $ = (z - 1)A.
614 SELECTED TOPICS Using (16-32) and the moment theorems (5-67) and (5-77), we can express the moments of nc in terms of the moments of c. We note, in particular, that £(nj = C,(i) = аф„-(о) - л£{с) in agreement with (16-30). Similarly, £{nc(nc - О) " Г"(1) = Л2ФД0) = A2£(c2) (16-33) Hence E{n2) = A2E(c2} + pc сгя2 = A2o-c2 + pc In the above, pc = A?7C is the average number of points in the random interval c, and <rc2 is the variance of c. Example 16-6. I ft /C(c) = ре~цс then E{c) = l/д, E{c2} = 2/д2, and Hence E{nc) = 7 = Pc and From this it follows that Г„,(г) = = *} = P p + A — Az P p + A n = 0,1,... We have thus shown that the number nc of Poisson points in an exponentially distributed random interval c has a geometric distribution with ratio A/(p + A). Example 16-7. If с = c is a constant, then Фг(5)=еа r„r(z)-.eAc<,-|> Thus nc is a Poisson RV with parameter Ac, as it should. 'RENEWAL PROCESSES. Consider a stationary point process ty such that the RVs C/ = t/-| (16-34) It tSincevye deal only with positive RVs, w&shall assume tacitly that ail densities are 0 on the negative
16-2 QUEUEING THEORY 615 FIGURE 16-6 are i.i.d. with distribution Fr(c). With ta a fixed point, we form the RV w = t, - /0 where t| is the first random point to the right of r„ (Fig. 16-6). We shall express the density fw(w) of this RV in terms of Fr(c). THEOREM. We maintain that /» = “[! (16-35) Vc where T7r = E{c,} = f[l-Fr(c)] de (16-36) Jo is the mean of c, [see (5-27)]. Proof. Given a number w, we define the Rv w, as the distance from the point tw = tQ 4- w to the next random point t, to its right (Fig. 16-6). From the stationarity of the points t/( it follows that the RVs w and w, have the same density fw(w). Suppose that the first random point t, to the right of l0 is also to the right of tw. In this case, there is no random point between and tlv and w = t| - > W W, = t| - tw c, > w, 4- w From the above it follows that the events {w > w) and {C| > W] 4- w} are equal provided that in the second set we consider only the outcomes such that C) > w,. Hence P{w > w) = P{cj > Wj 4- w|cj > wj Reasoning .similarly, we conclude that Р{и> < w < w 4- dw} = P{0 < w, < dw, с, > w] Hie above yields dw = /H.(0) dw[l - Fc(w)]
616 SELECTED TOPICS because the RVs w, and c, are independent and P{Cl>w)«I-Fc(w) This completes the proof because the area of equals one; it also shows that - 1/т7с. The theorem is also true if w is the distance from t0 to the nearest random point on its left. COROLLARY. The moment function ФДя) of w equals ФД^) = - 1] (16-37) Proof. Since Fc'(w) = fc{w) and Fc(w) - 0 for w £ 0, we conclude, integrating by parts, that for Re 5 < 0: - CFc(w)etwdw = ^ФДл) - [“e^dw = - A) s Jo s and (16-37) follows from (16-35). Differentiating (16-37), we obtain because ФД0) = 1. Hence [see (5-67)] E{c2} (16-38) Example 16-8; If /С(с) = Ле-Ас, then Ff(c) - 1 - e~Ar nc = у Л and (16-35) yields fw(w) — Ae~Kw = /ДиО. This is the case iff t, is a set of Poisson points. Note With c, as in (16-34), we form the point process Tj = C] + c2 + ' " +Cj starting from /« 0. In general, this process does not have the same statistics as the original process ty (it is not even stationary). It is, however, asymptotically equivalent to ty. The process t; is given by ty = W + C2 4- • • ’ 4c,
16-2 QUEUEING THEORY 617 It can, therefore, be constructed in terms of the sequence c, and the RV w specified by (16-35). Arrivals and Departures The term queueing is used to describe a large class of phenomena involving arrivals, waiting, servicing, and departures. In Fig. 16-7 we show a typical model (queueing system) whose inputs are certain objects (units) identified in terms of their arrival times tr Each unit stays in the system a, seconds (total system time) and it departs at the departure time t,: Ъ “ + ar The number of units in the system (state of the system) at time t will be denoted by N(t). Thus N(/) is a discrete-state process increasing by 1 at t, and decreasing by 1 at t,. Our objective in this section is not to develop this involved topic in detail. We plan merely to introduce the main ideas in the context of earlier concepts. We start with a general theorem that does not rely on any special conditions about the interarrival times, the nature of the system, or the properties of a„. It assumes merely that all processes are SSS with finite second moments. LTITLE’S THEOREM. Suppose that the processes t, and a, are mean-ergodic: (16-39) In the above, rir is the number of points t, in the interval (0,7) and Л = E{n.T}/T is the mean density of these points. Arrivals S Departures FIGURE^?
618 SELECTED TOPICS We maintain that! £{N(/)} = AE{a,) (16-40) In fact, we shall establish the stronger statement that N(r) is also mean-ergodic: lim fTN(t)dt = AE(aJ = £{N(r)} (16-41) т-® T Jo Equation (16-40) seems reasonable: The mean £{N(r)} of the number of units in the system equals the mean number A of arrivals per second multiplied by the mean time £{а,) that each unit remains in the system. It is not, however, always true, although it holds under general conditions. Proof. We start with the observation that N(T) nr N(O) - L ar < f N(f) dt - £ a„ < £ a, r— I n = I l“l (16-42) In the above, the terms a„ of the second sum are due to the nr units that arrived in the interval (0, T); the terms a8ii of the last sum are due to the N(0) units that are in the system at t = 0; the terms af of the first sum are due to the N(T) units that are still in the system at t = T. The details of the reasoning that establishes (16-42) are omitted. As we know (see Prob. 8-9) /NO) V l*-> ) E{N2(r))E{a4) < a> (16-43) Dividing (16-42) by T, we conclude that, if T is sufficiently large, then It [ nr - [ N(f) dt = - E Tn~l (16-44) because the left and right sides of (16-42) tend to 0 after the division by T [see (16-43)1. Furthermore, assumption (16-39) yields nT — XT and 1 nr a ”r E ~~ E ~ 1 л-I nTn-L Inserting into (16-44), we obtain the first equality in (16-41). The second follows because the mean of the left side equals £{N(t)}. fF. J. Beullen “Mean Sojourn Times...,” IEEE Transactions Information Theory, March. 1983.
16-2 OUEUE1NG FHEOKY 619 Immediate Service (M|G|«j) In general, the system time a„ is a sum a« = b„ + c„ where b„ is the waiting time (or queueing time) and c„ is the service lime of the nth unit. In many applications. b„ = 0 and a„ = c„. This is the case if the number of servers is infinite or when no servers are involved (visits to a park, for example). We consider this problem next. We shall assume that the arrival times t„ are Poisson points with mean density Л and the service times c„ are i.i.d. with distribution an arbitrary function F/c). In queueing theory this is written in the following form (Kendall) M|G|« The first position in this notation refers to arrivals, the second to service times, and the third to the number of servers in the system. The letter M (for Markoff от memoryless) in the first position means Poisson arrivals; in the second position it means exponentially distributed service times. The letter G (general) indicates that the arrivals or the service times are arbitrary. The letter D (deterministic) indicates that they are constant. THEOREM. The state NG) of an Af|G|oc system is Poisson distributed Pk P{N(z) = A) = — (16-45) with parameter p = ЛЕ{с„) = Лт7г (16-46) This parameter is called traffic intensity or offered load. Proof. Using the point t as the origin (Fig. 16-8), we divide the time axis into consecutive intervals (aoa,.+ l) of length Aar = aI+J - a,. Denoting by An, the number of arrivals in the interval /, and by ANG, a{) the contribution to the state NG) of the system at time t due to these arrivals, we have N(r) = EAN(t,a<) (16-47) i If Aa is small, then, within probabilities of order Да, the RV An, takes the FIGURE 16-8
620 Stl-ЕСГЕО TOPICS values 0 or I and P{An,= 1} =ЛАа (16-48) If An, = 0, then AN(/,a,) = 0; if An, = 1. then fl if c, > ft, AN(t.«;) = n .f ' v 17 0 if c, < a, where c, is the service time of the single unit that enters the system in (he interval Hence P{AN(Ga,) = 11 An, = 1} = P[c, > «,} = 1 - F(.(ft,) (16-49) Multiplying (16-48) and (16-49), we obtain P{AN(r,a,) = 1) = [1 - /•;(ft,))A Aft The RV AN(r, a,) takes the values 0 or 1 and they arc independent because the RVs An, are independent (Poisson points in nonoverlapping intervals.) Hence (see Example 8-86) the sum in (16-47) tends to a Poisson RV with parameter xf [1 — Fc(a)]<7a = lim А Да[1 — F,(ft,)] 'o and (16-45) results because the sum in (16-47) equals N(r) and the above integral equals E{c„) [see (16-36)]. COROLLARY. The traffic intensity p equals the average number of units in the system E{N(/)} =p = Xijc (16-50) Proof. It follows from (16-45) and (5-36). Example 16-9. (я) (Л1|Л/|оо) if c is an exponential with parameter д. then = 1/g; hence p = Л/д. (6) (Af|D|<») If с = c is a constant, then i), = c; hence p = Ac. Single-Server Queue (M| G11) We conclude the section with a detailed treatment of the Af|G| 1 system. In this system, the arrival times t, are again Poisson distributed and the service times c, are i.i.d. However, there is only one server in the system. The model is described in Fig. 16-9: Suppose that a unit enters the system at t = t,. Just prior to its arrival, the system contains N(t~) units; one of these units is being served and the others are waiting in line. The unit entering at t, occupies the last position in the queue (shaded area). Its position in line remains unchanged as other units arrive; the unit advances when a service is completed. It reaches the server (position 1 in line when service starts) at t = t,_, and it leaves the system
1 6-2 OUliUfclNG 1 HI QRY 621 H-------------a,-------------H FIGURE 16-9 at time t = t,: t, = T,-, 4- c, where c; is the service time. Denoting by b, the waiting time and by a, the system time, we conclude that b, = t,_ , - t, a, = t, - t, = b, + c, This completes the description of the Af|G| I model. Our objective is to express the statistics of the various parameters of the system in terms of the distribution Fc(c) of the service times c, and the average density Л of the arrival times t, under the assumption that N(z) is stationary. As we shall see, the system reaches stationarity only if the mean tjc of c( (mean service time) is less than the average interarrival time 1/Л. Note If the state N(t) of a system is a stationary Af|G|l.process, then N(-t) is a G|Л/| 1 process obtained from NG) by interchanging arrivals and departures. This shows that the properties of a G|A/|1 process follow from the properties of the corresponding M\G11 process. THE IMBEDDED MARKOFF CHAIN. The t h unit arrives at / = t, and departs at t — Tf remaining in the system a, = t( - t,- seconds. During the interval (to тД nOj new units arrive and all other units depart. Hence n0| equals the state Nftf) of the system at t = тД This number is denoted by qf and is called the Markoff chain imbedded in the process N(r ) (see Sec. 16-4). Thus q, = пй( = /V( т* ) (16-51) We Wish to relate q, to q,_ t. Suppose, first, that q, _| 1 (Fig. 16-10л). In this case, at least one unit is waiting at t = ту_hence cf = т, - t,_ t is the ith service interval. During this internal, nCj units arrive and none depart. And
622 SELECTED TOPICS FIGURE 16-10 since at t = rt the ith unit leaves, we conclude that Qi = Qi-i +%- 1 q,-i>l ИЧ/-1 = 0, then the ith service starts not at t = t,-_, (the system is then empty) but at the time of the arrival of the next unit (Fig. 16-106). Hence Q/ = nc, Qi-i = 0 where nQ equals the arrivals in the interior of the service interval. From the above it follows that q(- satisfies the recursion equation Qz = 4i-i+nCt (16-52) where >(•"' Using (16-52), we shall determine the state probabilities pk = P{q, = k] at t* . Zero state We maintain that Po = 1 - p P = ki]c (16-54) Proof. Clearly» nQ is the number of Poisson points in the random interval c,; hence [see (16-30)] £{nc/) = Atjc = p With QO 00 = EkPk= £Ф* ft-0 ft = l the mean of it follows from (16-53) that £(Q/I“ £ (*• “ 1)Л = ~ -Po) £-1 ft-i
16-2 QUEUEING THEORY 623 And since (stationarity assumption) £{q,) = (16-52) yields Vq = -Hq - (1 -Po) + At?c and (16-54) results. Proceeding similarly, we can find the moments of q, (see Prob. 16-11). It is simpler, however, to determine directly the moment function Г/z) = £{z4'] = p0 + £ p,zk (16-55) jt-1 of the sequence q(. Using (16-52), we shall show that f^z) can be expressed in terms of the moment function C(z) = Ф/Az - Л) In the above, nQ is the number of arrivals in the random interval c(, r„(z) is the moment function of nc, and Фс(г) is the moment function of c, [see (16-32)]. THEOREM. 1 -z/ФДАг —A) (16-56) Proof. The moment function of q, equals гИг)=₽о+ E Pkzk~' = Ро + 2"‘[г</(2) “Po] (16-57) Л-1 From the independence of q( and c, it follows that q, and n, are also independent. Hence [see (16-52)] Г/z) = Г5(г)Г„(г) (16-58) Inserting (16-57) into (16-58). we obtain гГ/z) = [Г/z) +pQz -р0]Г„((г) (16-59) and (16-56) results. Equation (16-59) can be used to obtain all moments of qf. We shall use it to determine As a preparation, however, we reestablish (16-54): Since Г/1) = 1, we conclude differentiating (16-59) that i + r;(i) = r;(i)+p0 + r;r(i) This yields (16-54) because Гл (1) = £{nc) = Aqe [see (16-30)]. Pollaczek-Khinchin formula We maintain that the mean of q; equals A2£{c?) П’"д+2(1-р) (16-60)
«24 SELECTED TOPICS Proof. Differentiating (16-59) twice and setting z = 1, we obtain 2Г;(1)-2[г;(1) +₽„]r;(i) + r»(i) But [see (16-33)]Г"(1) = A2E{c2}; hence 2^ = 2(17^ + pn)A-nf + A2E{c;} and (16-60) results because p = 1 - p„ = Mean system time and waiting time The process q, equals the number of arrivals пв( during the system time a,. Hence [see (16-30)] E{q,} = E{ne} = AE{a,} and (16-60) yields AE{c2} (16-61) This is the mean system time. Since b, - arf - cjt the mean waiting time equals AE{c2} £{6,} = —^ (16-62) Other moments can be found using moment functions. Denoting by Фв($) the moment function of ai( we conclude from (16-32) that Г (z) = Ф (Az - A). Hence «W-iJi + t) (16-63) where rq(z) is given by (16-56). Note As we know (Little’s theorem) E{N(f)) = АЕ{аД. Comparing with (16-61) and (16-60), we obtain A2(n2 + <r2) E(N(Z)} = E{N«)} = A4f + - (16-64) ^(1 A7jrJ Thus the mean1 of NG/) equals the mean of N(t) for any t. At the end of this section we shall prove that the processes N(/) and NG/) have not only equal means but also identical distributions. Equation (16-64) shows that, if the mean service time i]c is specified, then the mean £{N(t)} of the number of units in the system is minimum if ac - 0. This is so iff the service time q is constant. Example W40. (Л/|Л/| 1) If F?(c) = 1 - then u 1 2 я ° я я
16-2 QUI-UI-.ING THEORY 625 Thus p = Л/д, Po = 1 - Л/д, and (16-56) yields This shows that q, has a geometric distribution with ratio the traffic intensity p: i P * P{$i = k}= poPk E(q,} = -------------------------- I — p p — A We note, finally, that [see (16-63)] ф„(5) =----------- Fe(fl) = 1 - PPo “ * Thus the system time a, is exponential with parameter др1( and the mean system time equals p(2 ~ p) 2(1 -P) Example 16-11. (M|D| 1) If с = c is a constant then <bt.(s) = ef* E{c} = c £{c2} = c2 p — kc, pn = 1 - p, and (16-60) yields E{4.) - Finally [see (16-56) and (16-63)] 1 BUSY PERIOD. The state of a queuing system is characterized by a succession of idle periods, when N(/) = 0, and busy periods, when there is at least one unit in the system. We denote by x the length of an idle period and by у the length of a busy period. Clearly, x is exponentially distributed as in (11-31) because there are rib arrivals during an idle period. Our objective is to determine the properties of y. The busy period starts with the arrival of a unit at t = t0 into an empty system. The unit is served instantly and it departs at t = т0. The difference с = т0 - t0 is its service time. Denoting by nc the number of arrivals in the first service interval (t0, t0), we conclude that N(tJ)»l N(^)=nc The busy period у equals, thus, the interval of time from I = t0 when N(/g) « 1 until, the moment t = t0 + у when N(r) = N(t(t) - 1 = 0 for the first time. But the variations of N(r) during that interval do not depend on its initial value N(/(j). Hence, a busy period У/ can be characterized statistically as follows:
626 SELECTED TOPICS FIGURE 16-11 It is an RV equal to the time interval from t = when a service period begins, to t = t, 4- y, when N(r) = N(t/) - 1 for the first time. From the above it follows that the RVs y|,y2,y3 in Fig. 16-11 are independent and each is a busy period. Furthermore, the total number of these RVs equals nc because, at the start of yh the state of the system equals N(tq ) = nc and at the end of the total busy period y, N(f) equals 0. Hence у = c + y, + • • • 4-y„f (16-65) Mean busy period. Using (16-65), we shall show that 7jy = — p0 = 1 - A7jc (16-66) Po Proof. Since = A7)c, it follows from (8-47) that я/ E 4 = £{“ J £{y.) = U-i / Inserting into (16-65), we obtain i]y = tjc + Anci7y and (16-66) results. Mean number of units served. The number of units served in a busy period equals the number of arrivals ny in the random interval y. From (16-30) and (16-66) it follows that the mean of this number equals E(ny} = A4y - (16-67) Moment fiinction. We shall show that the moment function 4>y(s.) of the busy period'у satisfies the functional equation Ч(*) e ф< [ * + Афу( s) - A] (16-68)
16-2 OULUt-ING THI-.OHY 627 Proof. If c « c, then nf is a Poisson RV with parameter Ac. And since the RVs у and y, have the same statistics, we conclude that (see Prob. 8-13) £{ехр[$(у! + • • • +y„r)]) = еАгФ’о>-Аг From this and (16-65) it follows that E{eS3f} = £{£{e,y|c}} = £{^+Ae‘/J)-A»c} and (16-68) results. Note that (16-68) does not give Фу($) explicitly. It can, however, be used to determine the moments of y. For example, its derivative at 5 = 0 yields e;(o) = [i + лф;(о)]ф;(о) in agreement with (16-66). Ergodicity. The process N(t) and the sequences N(t,) and N(t,) are ergodic. This means that statistical averages can be expressed as time averages. The proof is a consequence of the fact that the state of the system is a succession of idle and busy periods and its behavior in each period does not depend on what happens elsewhere in time. The details, however, are omitted. We note the following consequences: 1. The state of the system N(/,“) just before a unit arrives is an ergodic sequence of RVs; hence the probability that N(t ~) = к equals the number of times this occurs in a large interval T divided by the number of points t, in this interval. This number also equals the number of limes N(t/ ) equals к because every time N(/) crosses the line N = к increasing, it crosses it also decreasing. Hence P{N(tr) =A) =P{N(<) = A) In other words, the processes N(tf) and N(t,+) have the same statistics. 2. The expected values rjx = £{xz), 7}y = £{y,) of the idle and busy periods, and their total number n in the interval T = Tx + Ty are such that Tx T'y n n where Tx and Ty are the total times the system is idle or busy. Furthermore, /*{N(r) = 0} = Tx/T. This leads to the conclusion that P{N(z) = 0}--------—-------------—--------= ——— + Vy 1/A + 7jc/p0 pn + k^c because 7]x — 1/A and 17 y = i]c/p0 [see (6-66)]. And since pQ = I — A?yc e flMr*) = 0), we conclude that P{N(r) = 0} - P{N«) = 0) = pn P{N(t) > 0} = p
'628 SELECTED TOPICS State statistics. We shall, finally, show that the process N(r) and the sequence q, = N(t,+) have the same distribution P(N(r) =k} =Р{<ь = к] =pk k>0 (16-69) Proof. If we eliminate all idle periods and connect all busy periods together, contracting the t axis, we obtain a renewal process specified in terms of the service times с, = т, - t,_,. With t an arbitrary constant, we denote by w the distance from t to the nearest point т,_, on its left and by nH, the number of arrivals in the interval /). From (16-37) and (16-32) it follows that ФД-s)------[ФД*) - 1] г , . , (16-70) - 1 rjz)=<VAz-A) = -^—-- p(z - 1) This holds for every t in a busy period, that is, when N = N(f) * 0. Returning to the original time (Fig. 16-12), we observe that, if N =# 0, then N = q,_, + n„, . Jq, q,>o Where "'" I q, - 0 Since nw is independent of q,, the above yields E{znIN*0} =r.(z)r„Jz) where 00 ГЛг) =Poz + Pkzk =Poz + r<Xz) ~Po (16-71) Ы As we know, P{N = 0} = p0, P{N > 0} = p. Hence T„(z) = £{zN) = E{zn|N = 0}po + £{zn|N > 0}p From this and (16-59) it follows, after some manipulations, that l\(z) = Г^(г) and (16-69) results. nf=3 FIGURE 16*12
16-3 shot noisu 629 FIGURE 16-13 16-3 SHOT NOISE Shot noise is the process s(f) = £й(1 - t,) (16-72) where hit) is a given function and t, is a set of Poisson points. The process s(r) is the output of a system with impulse response hit) and input a Poisson impulse train z(/) (Fig. 16-13). If hit) = (Jit), then sit) is a Poisson process (Fig. 10-3a). If h(t) is a pulse of width c, then sit) is the queueing process ЛГ|£>|<» of Example 16-96. In Sec. 11-2, we determined the second-order properties of the shot noise. In the following, we evaluate its general statistics. DENSITY FUNCTION. We start with the determination of the density fsix) of sit): fsix) dx = P{ x < sit) <x + dx} under the assumption that hit) is of finite duration hit) =0 for t < 0 and t > T (16-73) Denoting by nr the number of Poisson points in the interval it — T, t), we have [seis (4-48)] A(*)= E /J(x|nr = /c)P{nr = A} k-Q (16-74) In the above P{nr = k} = е-АГ(ЛТ)* To find У,(х), it suffices, therefore, to find the conditional density Д(х|пг = к) assuming that there are к points in the interval it — T, t). The evaluation of this density is based on the following property of Poisson points [see (3-51)]: Й it is known that there are exactly к points in an interval it t, r2), then these points have the same statistics as к arbitrary points placed at random in this interval; In other words, the к points can be assumed to be к independent RVs uniform in the interval (iIr r2).
630 SELECTED TOPICS From the above and (16-73) it follows that the conditional density //xlny. — 1) equals the density g^x) of the process Xj(/) = h(t - t,) = й(т]) Tl=t-t) where tj is uniform in the interval (/ - T, t) or, equivalently, t, is uniform in the interval (0, T). The function g}(x) is independent of t and can be found with the techniques of Sec. 5-2. Similarly, /Х(х|пг = 2) equals the density of the process x2(f) = h(t - tj + Л(г - t2) = Л(т,) + Л(т2) where т, and т2 are two independent RVs uniform in the interval (0, T). Hence the RVs Л(Т]) and Л(т2) have the same density g^x) and they are independent because they are functions of the independent RVs T| and t2. From this it follows that [see (6-39)] the density g2(x) of x2(z) is the convolution £z(*) =£i(*)*gi(*) Reasoning similarly, we conclude that /X(x|nr = k) =gk(x) =gj(x)* ••• *gj(x) (16-75) We note, finally, that if nr - 0, then s(z) = 0; therefore, Д(х|пг= 0) =g0(x) = fi(x) Inserting into (16-74), we obtain fs(x)^e лг £ ---------------- (16-76) A-0 K‘ This formula is useful mainly for “low density" shot noise, that is, when AT is of the order of 1. Example 16-12. Suppose that Л(г) is a trapezoid as in Fig. 16-14. In this case P{xj(r)^x] =P{ts> (1.5 -x)T) =i-0.5
16-3 SHOT NOISE 631 for 0.05 1.5 and 0 otherwise. Hence g^x) is uniform in the interval (0.5,1.5), gztx) is a triangle in the interval (1,3), and g3(x) consists of three parabola pieces in the interval (1.5,4.5). Assuming that A = l/T, we have P{nr=0)-i />{nr=l).l Inserting into (16-76) and neglecting higher-order terms, we obtain the density ffx) shown in Fig. 16-14. Moment function. The moment function of s(z) equals Ф($) = f eiXft(x) dx = ехр(л ГГ[ехЛ(а) - 1] da\ (16-77) J-oo 1 'o I Proof. This is a special case of (16-80) but can be established directly: As we have Seen, s(f) can be written as a sum s(t) = £a(T/) (16-78) i = 0 where nr is a Poisson RV with parameter AT and t, are independent RVs uniform in the interval (0, T). Reasoning as in Prob. 8-13, we obtain (16-77) because 1 т Е{ехЛ(т')) = - [ eshMda (16-79) TJo General Properties Suppose now that the points tz are nonuniform with mean density A(t) and that the function h(t) is arbitrary. We shall show that the second moment function of the resulting nonhomogeneous shot-noise process s(t) equals V(s) = 1пФ($) - Г A(a)[eA('-a>I- l]Ja (16-80) Proof. We divide the time axis into consecutive intervals (ait a/+I) of length Да = af+J — a, as in Fig. 16-15 and we denote by An, the number of points t, X 1 «I —* Да «— FIGURE16-13
632 SELECTED TOPICS in I,. If Да is sufficiently small, then the contribution As/f) to s(r) due to these points is given by [see (16-72)] As,(r) = А(/- а,) Дп, (16-81) As we know, Дп, is a Poisson Rv with parameter Р+ДвЛ(а) da = Л(а,) Да Jat Hence the moment function of As/t) equals ДФ,($) = = ехр{Л(а,) Дв[е*<'-^ - 1]} (16-82) Furthermore, s(') = Ед5,(/) = £Л(/ - а,) Дп, i f And since the RVs Дп, are independent, we conclude that V(s) = 22 In ДФ,-($) = Ел(а,) Да[е',('-Л-М - 1] (16-83) / i As Да -> 0, the last sum tends to the integral in (16-80). Generalization of Campbells’ theorem. The mean t]s and the variance <r/ of the nonhomogeneous shot-noise process s(/) are given by tjs = f X(a)h(t — a) da of = f A(a)/r(f — a) da (16-84) '—co ® Proof. Expanding the exponential in (16-80) into a series and integrating termwise, we obtain (16-84) because [see (5-73)] Ъ = ^(0) а/ = Ф"(0) JOINT CHARACTERISTIC FUNCTIONS. The joint second moment function Ф of the n RVs s(tj),...,s(t„) equals Ф(^„...,5Л) = Гл(а)(ер- 1) da (16-85) Where 0 = S|A(t| - a) + • • • +s„h(t„ - a) Proof. Vfe assume for simplicity that n — 2. Proceeding as in (16-81), we denote by An, the number of impulses in the interval If (a,, ai+,) of Fig. 16-15, and by AS/Oi) ~ Л(/| - а/) Дп, As,(f2) « h(t2 - <х() Дп, the response of the system at t = tr and t =t2 respectively due to these
16-3 SHOT NOISE 633 impulses. The joint moment function Ф of the RVs As,(/,) and As,(/2) equals This is of the same form as the second term in (16-82) if we replace the exponent sh(t — a,) by the sum $|Л(Г] — at) + s2h(j-> - at). Hence, as in (16-83), Ф($1,$2) « - 1] Да / and with Да -♦ 0, (16-85) results. Covariance. We shall use (16-85) to determine the autocovariance C(/h/2) of s(t). As we know [see (page 160)], C(r„ t2) is the coefficient of in the series expansion of about the origin. Expanding the term e? in (16-85), wc conclude that C(t],r2) = f A(a)/i(t] — a)h(f2 — a) da (16-86) — ao If A(f) = A = constant, then, with r, — t2 = r, the above yields C(r) = h(r + a)h(a) da (16-87) in agreement with (11-50). High density and normality. We shall show that if the density A of stationary shot noise is large compared with the time constants of h(t), then s(t) is approximately normal. For this purpose, we introduce the normalizations Ao = -r ha(t) = Jkh(t) pQ = Jkp and examine the form of , s2) as к -♦ Clearly, Л(.е» - 1) = + 4 + + ' •) I * • О«V К j Neglecting negative powers of к, we conclude from (16-85) that -00 f __ Bn \ .....s2)=aJ + (16-88) This shows that Vis a quadratic function of s{ because is linear in st. Hence $!') is nearly normal. This result is exact in the limit as к -> <®if the linear term is Omitted (centering). JtNTENSITY OF SHOT NOISE. The square T 1(0 e s2(O
«34 SELECTED TOPICS rjf = £{s2(/)} = A^ h2(t) dt 4- A2 of the shot noise sit) is a stationary process with mean » I2 / h(t) dt (16-89) У — 00 We shall determine its autocorrelation Я,(т) under the assumption that Я(0) = Г h(t) dt = 0 (16-90) — co With this assumption, the mean and autocorrelation of s(r) are given by 17, = 0 Rx(t)—X.[ h(r 4- a)h(a) da (16-91) J — co We maintain that Я,(т) = А2#2 4- 2Я2(т) 4- Л f h2(r 4- a)h2(a) da (16-92) J — » where E = j” h2(t)dt is the energy of hit). Proof. The second moment function s2) of the RVs s(z t) and s(r2) equals the integral in (16-85) where /3 = sxh(tx - a) 4- s2h(t2 ~ a) A(r) = A As we know [see (7-34)] the coefficient of s2s2 in the expansion of the moment function Ф($о s2) equals £{s2(r,)s2(r2)} 4 We introduce the functions А л» = &nda til J-a> Since B2 efl _ j . £ + + ... 2 and 7t « 0 [see (16-90)], we conclude that *i) - У г + Уз • + Уч + ’ ’ • «Hit»- 1 4- y2 + y3 4- y4 4- • • • 4- * 73 + 74 *-----— 4- • • •
16-4 MARKOFF I’ROCFSSI.S 635 In this expansion, only the sum y4 + y2/2 will have terms in sf.vf. Furthermore, .1 2 y4 - • - • 4-y— f (2 p2('1 - <*)h2(t2 - a) da + • • №r2 a, y| « • •• +A2-t“^ ( Л2( r, - a) da f h2(t2 - a) da L * — rjj + A2a,2s2 / h(ti - a)h(t2 - a) da L”' — OB and with f| = f2 4- t, (16-92) results. Power spectrum. The power spectrum 5,(<w) of s(r) equals [see (16-91)] S,(w) = А|//(ш)|" (16-93) Furthermore, A2£2 ♦-> 2irA2E25(w) 2itR2(t) «-* Ss(<o) * St(ш) The integral in (16-92) equals Л2(т)* h2(-t). And since 2тгй2(г) <-* //(«)* Н(ш) we conclude from (16-92) that the power spectrum of the intensity !(/) of s(t) equals A2 A 2тгЛ2Е26(й)) + — |//(л>)|2*|Я(ю)|2+ —з|//(^)* H(^)|2 (16-94) ir 4-тг High density If A is sufficiently large, then the third term above can be neglected. In this case, s(/) is nearly normal and the power spectrum of s2(/) equals the sum of the first two terms in (16-94) [see also (10-68)]. Low density If A is small, then lhe first two terms in (16-94) can be neglected. In this case, K0“S2(/XA2G-O (16-95) i because the probability that the terms h(t - tz) and h(t - tk) have a significant overlap is negligible. This means that the square of s(/) is approximately a shot-noise process.generated by Л2(г). 16-4 MARKOFF PROCESSES A Markoff process is a stochastic process whose past has no influence on the future if its present is specified. This means the following: If tli^l < t„, then Л*(fn) SxMht S tn-i) - p{*('n) (16-96)
636 SELECTED TOPICS From this it follows that if G <'2 < <tn then P{x(t„) <. x„ |x( tn _ I),..., x( t,)} =P{x(z„) (16-97) The above definition holds also for discrete-time processes if x(z„) is replaced by x„. In this section, we develop various properties of Markoff processes, concentrating on three classes: (д) discrete-time, discrete-state; (b) continuous- time, discrete-state; (c) continuous-time, continuous-state. We start with a brief discussion of certain general properties phrasing the results in terms of discrete-time, continuous-state processes. 1. From (16-97) it follows that 1. •••>*!) (16-98) Applying the chain rule (8-37) to the above, we obtain /(xI(...,x„) = /(xJx„_I)/(x„_I|x„_2) • •• /(x2|xI)/(xI) (16-99) Conversely, if (16-99) is true for all zi, then the process x„ is Markoff because, in this case, f(x X X 'I ZUK-i.•••.*!) = J’ ” n~" )" =Лх>и-|) (i6-ioo) 2. From (16-98) it follows that £Г{х„|х„_1,...,х1} = E{x„|x„_I} (16-101) 3. A Markoff process is also Markoff if time is reversed: >*>>+*) =/W*w + i) (16-102) Proof. The left side of (16-102) equals f(Xn> *^л+ I» • • • » Xn+k) _ Z(^n+|l^n f(Xn + [, • • • > Xn+k) /(x„+I) And since f(xn+ i\xn)f(xn) = /(хл,хй+1) =/(x„|x„+1)/(x,1+1) (16-102) results. 4. If the present is specified, then the past is independent of the future in the following sense: If к < m < zr, then /(*»» “f(xwlxm)/(**lxm) (16-103) -/(*„)
16-4 MARKOI h PR(K I SSLS 637 Proof. From (16-99) it follows that and (16-103) results. The above relationship can be used to express conditional densities involving the past and the future, in terms of the conditional (transition') densities f(xk\xk + l). For example, if к < m < n, then /(x,„|x„, xk) = x„,|x„) (16-1 (14) J ' A * l-Vz» f Example 16-13. If x,( satisfies the recursion equation x„, i - "(x„,«) = *>„ (16-105) and v,; is strictly white noise, then x„ is Markoff. Proof. The process x„+) is determined in terms of x„ and v„: hence it is independent of xk for к < n, assuming x„ = x„. Special case (generalized random walk). If x(l = 0 and a(x„, n) — x„, then X„ = x„-l + Vn-I = v( + v2 + ••• + v„_. Thus the sum of independent RVs is a Markoff sequence. Homogeneous processes. From the chain rule (16-99) it follows that the statis- tics of any order of a Markoff process can be determined in terms of the conditional densities /(x„|x„_j) and the first-order density f(x„). If the process xrt is stationary, then the functions f(xn) and /(x„|x„_,) are invariant to a shift of the origin. In this case, the statistics of xn are completely determined in terms pf the second-order density f(xlt x2) = /(x2|x,)/(x,) A Markoff process x„ is called homogeneous if the conditional density /(xjx,,^) is invariant to a shift of the origin but the first-order density f(xn) might depend on n. In general, a homogeneous process is not stationary. However, in many cases, it tends to a stationary process as n <». The Chapman-Kolmogoroff equation The conditional density /(x„|xfc) can be expressed in terms of f(x„\xm) and f(xm\xk) for any n > m > k: fMxk)=( f(xn\x„,)f(xm\xk) dxm (16-106) J — X This follows from (8-39) because /(х„|х,п, хл) = /(x„|xm).
638 SELECTED TOPICS Discrete-Time Markoff Chains A discrete-time Markoff chain is a Markoff process x„ having a countable number of states at. A Markoff chain is specified in terms of its state probabili- ties (16-107) and the transition probabilities TTij[ni, л2] = р{*П2 = = fli) (16-108) As we know [see (7-48)] Eirjn.,^] = 1 £p,[^W*.n]=₽j[«] (16-109) j • Furthermore, if zz t < n2 < n3, then = E'n’frhn ЛзКД'Ь, лз] (16-110) Г This is the discrete form of the Chapman-Kolmogoroff equation and it follows readily from (8-40). HOMOGENEOUS CHAINS. If the process x„ is homogeneous, then the transition probabilities in (16-108) depend only on the difference m = n2 - nt. Thus тг/7[т] = P{x„+m = e7|x„ = a,} (16-111) Setting n2 — И| = к, n3 — n2 = n in (16-110), we obtain тг/7[л + Л] = Х2‘п-,>[Л]тгп[я] (16-112) Г For a finite-state Markoff chain, the above can be written in vector form: II[w +£] = П[л]П[£] (16-113) where П[и] is a Markoff matrix with elements тг|7[л]. This yields П[н] = П" where П = П[1] (16-114) is the one-step transition matrix with elements тг,7 = irJl]. The above is the Solution of the first-order recursion equation [see (16-113)] П[л + 1] = П[л]П (16-115) The matrix П is shown schematically in Fig. 16-16. The circles in the diagram represent the states a{ of the process and the number on each segment, from a, to a;, the transition probabilities -rr,7. The number on the loop from a, to a{ equals тги. This loop (dashed line) can be omitted because the sum of the row elements of П (segments leaving a state, including the loop) equals 1 [sec (16409)].
16-4 makkoi ь ph(x t-ssi s 639 FIGURE 16-16 Writing (16-109) in vector form and using (16-114), we conclude that р[л] = ... =р[„_£]Ц*= ... = p[o]n- (16-116) where P[n] is a vector whose elements are the stale probabilities рДл]. In. general, P[n] depends on n. However, if the initial state vector P[l] =P = [P1,.(16-117) is such that P[2] = P, then Р[л] = P for all n. In this case, the homogeneous process x(l is also stationary and its state vector P is the solution of the system PI1 = P E₽i=l (16-118) i The state probability vector P of a stationary Markoff chain is thus an eigenvec- tor of its transition matrix П and the corresponding eigenvalue equals 1 [see (16-109)]. If the initial state P[l] of a homogeneous chain x„ does not equal P, then хл is not stationary. In this case, certain of its states might never be reached. The details of the underlying theory will not be discussed. We note only that, if П" tends to a limit as л-»», then xrt is asymptotically stationary. This is the case if all the elements тг,7 of П are strictly positive (Prob. 16-19). Example 16-14. In the random walk experiment (Sec. 10*1), we place two reflecting walls at x = 2s and x = — 2s as in Fig. 16-17. The resulting motion x(/) between the two walls generates a homogeneous Markoff chain x„ = х(лТ) taking the FIGURE 1647
640 SELECTED TOPICS values -2л -5 0 5 25 The resulting transition matrix and its diagram are shown in the figure. In this case, the system (16-118) yields Pi Рз Pi+ Pa Pi = ~ P2 = Pi + — Рз = —— Рз Pa = Pi + — Pi + Pi + Рз + Pa + Ps = 1 Solving, we obtain Pi ~ Ps ~ к Pi = Рз = Pa = л These are the initial state probabilities that generate a stationary process. Continuous-Time Markoff Chains A continuous-time Markoff chain is a Markoff process x(/) consisting of a family of staircase functions (discrete states) with discontinuities at the random points (Fig. 16-18a). The values 4„ = x(C) (16119) of x(t) at these points (Fig. 16-186) form a discrete-state Markoff sequence called the Markoff chain imbedded in the process x(f). A discrete-state stochastic process is called semi-Markoff if it is not Markoff but the imbedded sequence q„ is a Markoff chain. An example is the queueing process N(r) of Sec. 16-2. A Markoff chain x(t) is specified in terms of the underlying point process t„ and the imbedded Markoff chain q„. We denote by p,(<)-P(x(<) =a() (16-120) the state probabilities of x(r) and by = ajx( 11) = nJ (16-121) (*») (A) FIGURE16-18
16-4 .MARKOFF PKCX'L-SSLS 641 its transition probabilities. These functions are such that ХХ/Оп'г) = 1 =Р/'2) (16-122) i • and they satisfy the Chapman-Kolmogoroff equation H/X'pG) = <l2 <h (16-123) r In specific problems the functions тг^г,, t2) are not given directly. As we shall presently see, however, they can be determined in terms of the transition probability rates to be presently defined. For simplicity, we shall consider only homogeneous processes. A Markoff process x(r) is homogeneous if its transition probabilities depend on the difference т = t2 - t(: ’’.•Дт) =+ т) = ау|х(Г) = a,) r>0 (16-124) From the above and (16-123) it follows with a = t3 - t2 that 7Го(т + «) = Етг,г('г)7го(«) (16-125) r This is the Chapman-Kolmogoroff equation for continuous-time Markoff chains and it can be written in vector form: П(т + a) = П(т)П(а) r,a>0 (16-126) Where П(т) is a matrix with elements тг17(т). Probability rates. In the discrete-time case, we showed that the matrix П[п] satisfies the recursion equation (16-115) and can be determined in terms of the one-step transition matrix П. We show next that the transition matrix ГЦт) of a continuous-time chain x(t) satisfies a differential equation and can be deter- mined in terms of the matrix (16-127) whose elements A(-; = ir-,(0+) are the derivatives from the right of the elements of П(т). These derivatives will be called the transition probability rates of x(r). Clearly, J — 0 because 127rz/(T) = 1 and since (16-128)
642 SELECTED TOPICS we conclude with /x, = -A„ that = E'a0 > 0 AM ;> 0 i + j (16-129) i The prime indicates summation for every j Ф i. In the above, we have assumed that тг,/т) is differentiable at r = П \ This is so only if the probability that there is one discontinuity point in the interval (/, t + ДГ) is of the order of Дг: P{x(r -I- Д0 = ajx(z) = aj = p ~Pi (16-130) The Koimogoroff equations. Differentiating (16-126) with respect to a and setting a = 0, we obtain П'(т) = П(т)Л П(0) = 1 (16-131) This is a system of linear differential equations with constant coefficients and its initial condition П(0) is the identity matrix [see (16-128)]. Solving, we obtain П(т)=еЛт (16-132) We have thus expressed П(т) in terms of the transition rate matrix Л. The state probabilities p,(z) satisfies a similar system: Denoting by P(t) a vector with elements p^t), we conclude from (16-122) that P(t + т) =Р(г)П(т) (16-133) Differentiating with respect to т and setting т = 0, we obtain P'(') = />(z)A (16-134) This is a system of N equations of the form Pl(t) = -PiPi(j) + E\fP/(0 (16-135) j Its formal solution is a vector exponential P(z) = Р(0)ел' (16-136) We have thus expressed P(r) in terms of Л and the initial state probabilities p/0). If x(/) is stationary, then p,(z) = p, = constant. Hence [see (16-135)] L PvP; = Ep(=l (16-137) J i This is a system expressing the state probabilities of a stationary process in terms of the transition rates Az/.
16-4 MARKOFF PROCESSFS 643 Example 16*15 Generalized telegraph signal. Suppose that x(r) takes two values (Fig. 16-19) a, = A a2 = — A and P{x(r + Д/) =>l|x(t) = A) = 1 - м, Дг = 77ц(Дг) (16-138) Р{х(/ + ДО = -Л|х(/) = —А} = 1 - д, Д/ = тг22(Дг) In this case, Л)2 = Др А2) = Мг* Inserting into (16-135). we obtain Pi(O + AiPi(O = МгРзСО And since p2(t) = 1 - px{t), we conclude that Pi(O = ———[I “ е"<Д|+м-к] + ^(О)^"'^’' Д1 + Д2 Note that p.(0-------» --------=P| p->(/)-----> --------=p2 ' Д! + p2 ' Г— + p2 The transition probabilities are determined from (16-131) ’Гп(т) + Д Fl l(r) = Д2’г12(т) 11(°) = 1 ’’ЪС’’) + Дг’ГаСт) = AF21(t) тг22(0) = 1 where 7Г|2(т) = 1 - ^ll(') 1Г2|(т) = 1 - U22(t) The above yields ^xM=px+p2e-^^ (16-139) ’r2i(’’) =P2 +pIe"<M'+MpT Aji =P2 A|jeAi WORE №19
644 SELECTED TOPICS Mean and autocorrelation The process xG) is asymptotically stationary with P{x(r) = a,} =p, ax = A a2 = -A and P{x(r + т) = apx(O = в,) =P,7r,/(r) i,j = 1.2 From this it follows that E{x(t)} = rj ~pxA - p2A /?(t) = r)2 + 4A2P|P2e~(,1'4',1-),r' If pt = p2~ then "П = 0 a°d Я(т) = e“2A|ri as in (10-19). In this case only, the discontinuity points t, of xG) are Poisson. SPECTRA OF STOCHASTIC FM SIGNALS!. We shall determine the power spec- trum of the FM signal w(/) = <p(r) = Гх(а) da (16-140) (see also Sec. 11-3) where we assume that the instantaneous frequency xG) is a stationary Markoff chain as in (16-120). Clearly, = E{w(r)} We introduce the conditional correlations Я7*(т) =£{w(t)|x(0) = д(, x(t) = ak}irik(r) (16-141) where эт(Л(т) are the transition probabilities defined in (16-124). Clearly, P{x(0) = ajt х(т) = дл.) = p,it^(t) hence Л(т) = ЕрЛ(т) (16-142) i.k To determine R(r) it suffices, therefore, to find Rik(r). THEOREM. For any т > 0 and v > 0: ««(г + -0 - (16-143) m tR. Kubo: *’A Stochastic Theory of Line-Shape and Relaxation," Scottish Universities Summer School, D. ter Haa'r, ed., Plenum Press, New York, 1961. See also A. Papoulis: "Spectra of Stochastic FM Signals" in Proceedings of Transactions of 9th Prague Conference on Information 7W.1982,
16-4 MARKOFI PR<X4 SSI-5 645 Proof. In the following, the conditions x(0) = a, x(r) - a„, x(r + p) = ak will be abbreviated as д„ a,n. and ak respectively. Reasoning as in (16-104), we obtain , । i Ъ'Л 1 ) , . ... P{x( T) = a,H | д,-, ak} = ---- (16-144) ~ik\r + v) Furthermore [see (8-43)] E{w(t + р)|л,,д*} = £E(w(r + р)|др д„,,дл.}Р{х(т) =Д,„1д,,я*} 9П (16-145) If х(т) = ат is specified, then the integrals are conditionally independent. Hence = Etexp j f x(a) da I yo E a к . From the stationarity of xG) it follows that the last term equals E(exp j [ x(a) da I L Inserting into (16-145) and using (16-144), we obtain (16-143). The Kolmogoroff equations. Differentiating (16-143) with respect to v and setting v = 0, we obtain я;*(г) = Ея,„,(г)к^(о+) т>о (16-146) m Initial conditions To determine Л,л(т) it suffices, therefore, to find the Values of Rik(r) and its derivative at т = 0+. We maintain that адоп-{1 ru(o-) = ‘~kk (16-147) \V I T5 A. I I f Proof. For small r. i = к itk , x / 1 - M/T s л t
646 SELECTED TOPICS Neglecting terms of the order of t2, we conclude that ® Е(ехр[/х(0)т|О|]}тги(т) = e;fl,T( 1 - nj) - 1 + jo(T - д,т Furthermore, for i ¥= k: Ri/M = ^^ikr == A1Jtr and (16-147) follows. Combining (16-147) and (16-146), we obtain Я.ч(~) = Uak “ М*)Я,*(т) + ЕЛш^/?,-ш(т) (16-148) tn This yields Rik(r). The autocorrelation R(r) of w(t) is determined from (16-142). Since the coefficients of (16-148) are constant, we conclude that the power spectrum of an FM signal, whose instantaneous frequency is a finite-state Markoff chain, is rational. Example 16-16. Suppose that x(f) is a symmetrical telegraph signal (Fig. 16-20a). In this case, (21 = = А = ” A and (16-148) yields Л'п(т) = (jA - А)Яи(т) + A7?i2(t) 7?„(0) = 1 Я',2(т) = AR„(t) - (jA + А)Я,,(т) Я.,(0) = О Denoting by S^(s) the Laplace transform of ЯД(т), we conclude from the above that sS^(s) — 1 = (JA — A )S]+](s) + AS^(s) sSгг(,г) = ^(j) ~ A)S^(s) Hence s + jA + A A s"M = -d«- 8^)-ад RGURE16-M
16-4 MARKOFF PROCESSUS 647 where D(s) — s~ + 2As + A~ Reasoning similarly, wc find And since p, =pz = 0.5. (16-142) yields + 2A S+(s) = “n7T = 2RcS + O) In Fig. 16-206, we plot S(w) for A = A and A = 0.1 A. Note that the discontinuity points (zero crossings) of x(/) arc Poisson distributed and their average density equals A. BIRTH PROCESSES. A birth process is Markoff chain x(z) consisting of a family of increasing staircase functions (Fig. 16-21). The process x(/) (population size) takes the values 1,2,3,... and increases by 1 at the discontinuity points t, (birth times). From the definition it follows that the transition rates Ao are different from 0 only if i = j or i = j — 1. Thus -Aj7 = д, A,(, + l) = р,. Ao = 0 otherwise The above shows that a birth process is specified in terms of the parameter д.,. Clearly [see (16-130)] Р{х(/ + Д/) = n|x(f) = л] = 1 - At n > 1 (16-149) P{x(r + Д/) = n|x(f) = n - 1} = pn-\ Lt я > 1 Hence pt(7 + Lt) »/>,(<)(! -p,Lt) Pn(t + &t) = p„(t)(l ~ pnLt) + pn-i(t)iirt.iLt n > 1
648 SELECTED TOPICS This yields Pi(O “ -A1P1O) p«(O = “Млрл(') + мл-|Рл-|(0 n > 1 in agreement with (16-135). (16-150) Note The difference x(t2) - x(r ]) equals the number of discontinuity points tt in the interval (th/2). This shows that a birth process is completely specified in terms of the point process t,-. Example 16-17. If the birthrate is proportional to the population size n: Мл = nc (16-151) then x(t) is called the simple birth process (the constant c is the birthrate per person). We shall determine p„(t) under the (unrealistic!) assumption that (16-151) holds for every n I and that x(0) = 1. In this case, Pj(0) = 1 p„(0) = 0 n > 1 (16-152) Setting p.n = nc in (16-150), we obtain Pi(O + <Pi(0 “ 0 (16-153) Рл(0 + ncp„(t) = (и - l)cp„_!(f) n > 1 The above yields Pi(O =Pi(0)e~r' = e~ct and with a simple recursion, P«(0 = e-c'(l-<?~c')"_1 (16-154) This function is called the Yule-Furry density. Thus, for a specific /, the RV x(r) - 1 has a geometric distribution with ratio (1 — e~cl). Hence (sec Prob. 5-35) E{x(r)} = ecl E{x2(/)} = 2e2cl - ect Example 16-18. We now assume that the rate of increase of x(/) is independent of its present state p.n = A = constant As we shall see, the resulting x(t) is a Poisson process. We assume again that x(0) * 1. Setting p,„ = A in (16-150), wc obtain Pi(O + Ap1(O = 0 Pi(0) = l P«(0 + AP«(O = bPn-W P«(0) " 0 n > 1 This yields е~л'(А/)п n! The above is the probability that x(t) ’ — n + 1 and it equals the probability that the number of points x(l) — x(0) in the interval (0,/) equals n.
16-4 MARKOM PROCESSES 649 BIRTH-DEATH PROCESSES. Suppose now that a Markoff chain takes the values 0,1,2,... and its discontinuities equal +1 or - 1. (Fig. 16-22). We then say that x(/) is a birth-death process. In this case, Ai; is different from 0 only if i = j or j - 1 or j + 1. Hence x(r) is specified in terms of the two parameters Thus —A;/ = д, = a, + and P{x(r + At) = n\x(t) = n - 1} = Af P{x(/ + Ar) = л|х(/) = л) = [1 - (a„ + pn) Ar] P{x(r + Ar) = rt|x(r) = n + 1} = pn+l At From (16-135) or, directly from the above, it follows that Po(0 + ^оРо(') =&vPx<X) PM + («„ +/3„)P„(r) = a„_Ip„_I(r) + 0„+Ip„ + l(r) n > (16-155) Example 16*19 M\M11 queue. A queueing process N(t) is not, in general, Markoff. The M|M|1 case, however, is an exception. In this case, the arrival times 1, are Poisson with average density A and the service time c, is an RV with density де_д₽ where p > A. We maintain that the resulting N(r) is a birth-death process. The probability that a unit will arrive in the interval (/, t + Д/) equals А Д/ (property P| of Poisson points). We shall show that the probability that a unit will depart in the interval (r, i + Д/) equals pAt. Indeed, denoting by t( the first departure point to the right of t and by c( the corresponding service time, we conclude that (see Example 7-10) P{t < c, £ t + Дг} =/c(0) Дг = ft At no matter when this service started. We have thus shown that P(N(r + Дг) =n|N(/) = л - 1) ~ АДг P{N(r + Дг) « n — 1 |N(r) e л) = д Дг Thus N(r) is a birth-death .process with a„ - A and « д. We shall determine its state probabilities pn for the stationary case. (16-156)
650 SELECTED TOPICS In this case, p^(t) = 0 and (16-155) yields Лр0 = д/?| (Л + м)Р„ = Ap„_| + Aipn + I From this it follows readily that (A Iя A Pn ~ Pu I Po ~ 1 \P J P- as in Example 16-10. Continuous-State Processes A continuous-state Markoff process x(t) is specified in terms of its first-order density ap(x,r) p(x,f) =—z--------- P(x,l) = P(x(r) <A'} ox and the conditional (transition) density 7r(x,x0;/J0) =/x(/)(x|x(/0) = x0) t > ia These functions are such that, if t0 < t < t,, then f p(x,f) dx = 1 (16-157) /00 P(xQ,tn)Tr(x,xQ-,t,t0) dx0 — 00 and r00 J ir(x,x0;t,t0) dx = 1 (16-158) ir(x,x0;t,t0) = f 'n-(x,X|;t,f1.)ir(xl,x0;r1,t0) dx, J — ® as in (16422) and (16-123). Furthermore, tt(x, x0; t, t0) > 6(x-x0) (16-159) We shall show that the function ir(x, x0; t, t0) can be determined in terms of the slopes ri and cr2 of the conditional mean a(x0; /, t0) and the conditional variance fe(x„; t, t0) of x(0 assuming x(f0) = x0, defined as follows: a(x0;t,t0)=f хтг(х,х0;г,/о) dr ““ (16-160) b(xp; t, r0) = f (x - a)2ir(x, x0; t, r0) dr —00
16-4 MARKOH PROCI SSI Л 651 We assume that these functions are differentiable from the right and we denote by i7(x0, tQ) and cr2(x0, tQ) respectively their slopes at f = r() (Fig. 16-23) d GI /.J d “ (16-161) ^(xoUo) =— b(x0;t,tQ) dt Clearly [see (16-159) and (16-160)] а(х(ь *о) = л’о b(xQ‘t tu, r()) = 0 Hence, for At > 0, a(x0',t0 + At,t0) ~x0 + 77(x{),t()) At , (16-162) 6(x0; t0 + At, t0) = <r2(x(), t0) At If the process x(t) is homogeneous, then the function тг(х, x(); t, t()) depends on т = 1 - t0. In this case, the slopes i?(x0) and <r2(x0) of a and b are independent of t0. From (16-162) and the definition of the functions a and b it follows that E{</x(r)|x(t) = x] =7](x,t)dt , , , (16-163) £{[ Jx(t) — 17 (x, t) <Zt]2|x) = <r2(x,t) dt As we show in the next example, these equations can often be used to determine тДх, t) and cr2(x, t) directly in terms of the specifications of the process x(r)< Example 16-20. Consider the nonlinear stochastic differential equation dt 4 dw(x, t) + Д(х.О-------%— where w(x, () is a process with independent increments and such that £{dw(x,t)} - 0 £{[dw(x,/)]2} - y(X,r) dt
652 SELECTED TOPICS The solution of the above equation is a Markoff process (sec also (16-105)1. Clearly, E{dx( i) |x) = -p(.x,i)di E([dx(/) + j3(x,/) </r]2|x} = E{[dw(x,z)]‘l*} = v(x,t)dt Hence i)(xt t) = -fiix, t), <r2(x, i) = y(x, i). THE DIFFUSION EQUATIONS. We shall show that the conditional density tt = тг(х, x0; t, f0) satisfies the diffusion equations dir d r , 1 32 r , 4 , - + -h(x,z)H --^[tr (x.zhj-o dir dir 1 32ir TT + v(xn,tQ)-~ + -o- (x0,t0)—у = 0 3tQ dxa 2 dx () (16-164) The first is called forward (or Fokker-Planck} and the second backward. Proof. If fix) is a density with mean ij and “small” variance, then [see (5-55)] f git)fit) - giv) + ^-g"(v) (16-165) From (16-158) it follows with x, = £ and f, = /0 + e that тт(л',х();г,/о) = [ ir(x,£',t,tn + Е)тг(^,л|)и0 + E,ttt) d( In the above, iri£, x0; tQ + e, tn) is a density in the variable £ and its mean and variance equal [see (16-160) and (16-162)] ^(xQ;tQ + e,t0) = x0 + E7?(.vn, f0) b(x',tQ + E,ta) = eo-2(a(), t0) Therefore, with g(€) = ir(x,£;M0 +e) fit) = тг(£,х0;/0 + E,f(1) (16Л65) yields (Fig. 16-24) . . sa2 32 тг(х,х0; t, tQ) - тг(х,х0 +Eir,t,t0 + e) + —— —2-тг(л-, x0 + E;z,/n 4-e) 2 dx0 within O(g2); Expanding the right side into a power series in e and retaining only linear terms, we obtain the second equation in (16-164). The proof of the first is similar. COROLLARY. The first-order density p = p(x, t) of the process xit) satisfies the Fokker-Planck equation dp 3 r 1 d2 r It + ” 2 ax2^2^*'^] “ ° (l6-166>
16-4 МАККОИ CROC I SSI S 653 Proof. It follows if we express the function p(x. r) in terms of the integral in (16-157) and use (16-164). Example 16-21. The velocity v(r) of a particle in brownian motion satisfies Langevin’s equation v'G) + /3v(r) = w’(/) where w(r) is a process with orthogonal increments and such that E{r/w(O}=0 £{[Jw(f)]2) = у dt Hence (see Example 16-20) v(r) is a Markoff process with 7}(.v. t) = -flr, tr:(x, t) = y, and (16-166) yields др ЦСР) У д'Р -- = P +---7 dt dl'--------------2 dc“ (16-167) where p = p(r, r) is the density of vG). SOLUTION OF THE FOKKER-PLANCK EQUATION. We shall solve the forward equation in (16-164) under the assumption that the conditional density тг(х, xn; /, z(1) does not depend explicitly on the state x(/n) = x0 of xG) but only on the increment и ~ x — x(). In this case [see (16-160)] «(*0;Mo) = / (u 4- х(,)тг(и;/,/„) du = «,((, tn) + x„ — 30 Inserting into the second integral in (16-160), we conclude that the conditional variance b = b(t, tu) does not depend on x(1. From the above it follows that the slopes т?(г0) and tr2G(l) of a and b are independent of xft. This simplifies the form of the forward equation dir дтг 1 д2тг 77 + ”(')^-Г2(')^ = 0 (,6’16S) ll The solution ir(u; t, t0) of (16-168) is a density in the variable u. In fact, it is the density of the increment xG) - xG0) of xG) under the assumption that
654 SELECTED TOPICS x(r()) - x0. We shall show that this density is normal with Mean: /\(т) dr Variance: Г<т2(т) dr (16-169) Jtu Proof. With <Ms, f) the bilateral Laplace transform in the variable и of the function ir(u; t, tQ), it follows from (16-168) that ЗФ cr2(t)s2 — = -T](t)sQ + -------------Ф (16-170) dt 2 The function Ф<5, f), evaluated at t = r0, ,s the transform of the function 7r(u;/0,/0) = 8(u) [see (16-159)]. Hence Ф($, f0) = I and (16-170) yields 1пФ($, r) = — 5У t?(t) dr + — J а2(т) dr (16-171) This shows that Ф(.у, /) is the moment function of a normal density with mean and variance as in (16-169). PROBLEMS 16-1. Show that the probability that lhe Wiener process w(z) does not cross lhe line La in the interval (0, t) equals 2G(a/ fat) - 1. 16-2. We denote by P0(t) the conditional probability that the number of 0’s of a normal process x(f) in the interval (/,/ + t), assuming x(i) = 0, is odd. Show that if EM/)} = 0, then Я'(0)Я2(т)Г1/2 cos,/>„(r) - -Я(т) -Я"(0)Я(0) + -------------- /ЦЦ) 16-3. Show that if fw(w,t) is the first-order density of the Wiener process w(t) and /Т(т, a) is the density of the first passage time tj (Fig. 16-4a), then! a A(T^) = -Л(а.т) т 16-4. Passengers arrive at a terminal boarding the next bus. The times of their arrival are Poisson with density Л = 1 per minute. The times of departure of each bus are Poisson with density fi = 2 per hour, (a) Find the mean number of passengers in each bus. (6) Find the mean number of passengers in the first bus that leaves after 9 a.m. Answer: (o) 30; (6) 60. 16-5. Passengers arrive at a terminal after 9 a.m. The times of their arrival are Poisson with mean density A = 1 per minute. The time interval from 9 a.m. to the tA. A. Borovkov: “On the First Passage Time...Theory of Probability and Its Applications, vol. X, rtp. 2, 1965.
FROIH.LMS 65S departure of the next bus is an RV c. Find the mean number of passengers in this bus (fl) if c has an exponential density with mean 17, = 30 min. (/1) if c is uniform between 0 and 60 min. Answer: (a) 30; (b) 30. 16-6, The point process t, is stationary and the RVs c, = t, - t, । arc uniform in the interval (0, a). Show that if tj is the first point to the right of a fixed point rlt. then £{t] - tn) = fl/3. 16-7. (a) The RVs c, are i.i.d. and Е{е>й,с'} = Ф,.(ш). The process n(r) is Poisson with parameter А/ and independent of c,. Show that, if (Fig. P16-7«) n(r) x(f) = L ci then E{e;“x(')) = ехр{Аг[Ф(.(<о) - 1]) i — 1 Special case: If the RV c, takes the values 1 and 0 (Fig. Р16-7Ю and PlCj = 1} = p, then x(t) is a Poisson process with parameter Apr (sec also Prob. 8-11). Hint: E{ey“x(,»|n(r) = л) = Ф/Ы. (/>) Using the above, show that, if t, is a Poisson point process with mean density A and tj is a process obtained by eliminating al random a subset of t,, then t, is a Poisson point process with mean density kp where p is the probability that a point of t, is not eliminated. FIGURE P16-7 16*8. (fl) Show that if t„ is a Poisson point process, then the process t2n consisting of every other point of t„ is not Poisson. (6) Show that if an and arc two independent Poisson point processes with densities Aft and kfi respectively, then the process t„ consisting of all the points of ot„ and is Poisson with density A„ + kfi. 16-9. Visitors enter a park at Poisson times with mean density A = 2 per minute. Each visitor stays in the park c minutes where c is an RV uniform between 30 and 90 min. Find the mean and the variance of the number N(/) of visitors in the park. 16*10. In the M\M|1 queue (Example 16-10), у is the busy period and Фу(г) is its moment function. Show that АФ*(s) - (Л + p - -ОФ/*) + p - 0 Ф/s) 0
656 selected topics 16-11. (a) With <j, as in (16-53), show that £(ч;} = £{q?) - 2n„ +p (i) (b) Prove the Pollaczek-Khinchin formula (16-60), using (i) and the identity E{q?) - Efqf) + £(<) + 2E(4jE(n,r) 16-12. In a single-server queueing system, the arrival times t, are Poisson with mean density A - 9 per hour. Find the mean of the following: the service time c, the waiting time b, the system time a, the idle period x, the busy period y, and the number ny of units served during a busy period. Consider two cases: the density of the service time c is (a) uniform between 4 and 8 min; (b) it equals д:се~цг where д = 1/3. 16-13. In an M\M11 queue, the service time density equals де~*“'. Find the density of the distance from a fixed point tQ to the next departure point. Hint: The probability that lQ is a point of a busy period equals А/д. 16-14. Find the probability P(s(t) s 2} of the shot-noise process s(/) = - /,) Л(г) = 4t/(r) - 3U(t - 1) - U(j - 2) i where t, are Poisson points with A = 2. 16-15. The shot-noise process s<r) is a train of triangles s(O = EAG-t/) л(г)-(5(2 И) 1,1 < 2 v 7 , 17 v 7 \o Id > 2 and the points t, are Poisson with A = 0.01. (fl) Find its power spectrum 5,(w). (b) Find its first-order density. (Note that 2A 1.) 16-16. The points t, are Poisson with density A and s(<)= £*(<-«,) Л(<)-е"'У(0 i (a) Find the mean, the variance, and the power spectrum of s(r). (b) Find the power spectrum of the process y(f) = s2(/) for A » a and for A «a. 16-17. The RVs x„ are i.i.d. taking the values +1 and -1 with P(xn = 1} = 0.6 and P(x„ = -1} = 0.4. Show that the process y„ = x„ + x„ _ ] + • • • +X] is a Markoff chain and find its state probabilities рДл] and transition probabilities 7г,Дт]. 16-18. Show that if x(0) = 0 and x(r) is a process with independent increments, then it is Markoff. 16-19. Given a two-state Markoff chain x„ taking the values 1 and 0 with state probability vector P[n] and transition matrix П. Show that, if Find P[2] and P[3] if Xj = 0. 16-20, Show that, if x(t) is a discrete-state Markoff process taking the values a, and P{x(0 « aj -^(O P{x(t2) = fl/|x(r,) = flj -
i’Uoiih.ms 657 (hen its autocorrelation equals AxGi.'z) = EW/Oi-'jM'i) ij 16-21. Show that if тг0(Г|, t2) are the transition probabilities of a Markoff chain x(t) and P{x(r + Ar) = a,|x(r) = a,) = 1 - д(г) Ar P{x(t + Ar) = a}:|x(l) = a,} = A,; Ar then дттц(1, —----------= -д, (г)тг,(г,г()) + EM'4*(Mi) dt к dlT.-Xt, tn) "T. = t E'77\i(f’ ri»)^/Jt(^o) к 16-22. The telegraph signal x(r) of Example 16-15 is stationary with p,2 = 3/i| = 6 and A = 100. (a) Find its mean tj* and autocorrelation Яд(т). (b) Find the power spectrum Sw(a>) of the FM signal w(r) = e/‘₽tf) <p(r) = Гх(а) da 'o (c) Show that w(r) satisfies the time-varying stochastic differential equation w*(r) + jx(r)w(r) = 0 w(0) = 1 Find E{w(r)} and Rw(.l},t2). 16-23. Show that the distribution function F(x,x(1;r,r0) = P{x(r) <x|x(r0) = x(J = / тг(£, x(); r, r0) of a Markoff process satisfies the backward diffusion equation dF dF 1 d2F T- + ^(xd, („)— + -a2(x„, r0)—T = 0 v/у 2
BIBLIOGRAPHY Abramson, N. M. (1963): Information Theory and Coding, McGraw-Hill, New York. Antoniou, A. (1979): Digital Filters: Analysis and Design, McGraw-Hill. New York. Ash, R. (1965): Information Theory, Interscience, New York. Bharucha-Reid, A. T. (1960): Elements of the Theory of Markov Processes and Their Applications, McGraw-Hill, New York. Blackman, R. B., and J. W. Tukey (1959): The Measurement of Power Spectra, Dover, New York. Blanc-Lapierre, A. and R. Fortet (1953): Theorie des Fonetions Aleatoires, Masson et Cie, Paris. Childers, D. G., ed. (1978): Modem Spectrum Analysis, Wiley, New York. Cooper, R. B. (1981): Introduction to Queuing Theory, North-Holland, New York. Cramer, H. (1946): Mathematical Methods of Statistics, Princeton University Press, Princeton, NJ. Davenport, W. B., Jr. and W. L. Root (1958): An Introduction to the Theory of Random Signals and Noise, McGraw-Hill, New York. Doob, J. L. (1953): Stochastic Processes, Wiley, New York. Feinstein, A (1958): Foundations of Information Theory, McGraw-Hill. New York. Feller, W. (1957 and 1967): An Introduction to Probability Theory and Its Applications, Vols. I and 11, Wiley, New York. Franks, L E. (1979): Signal Theory, Prentice-Hall, Englewood Cliffs, NJ. Gardner, W. A. (1987): Statistical Spectral Analysis: A Non-Probabilistic Theory', Prentice-Hall, Englewood Cliffs, NJ. Helstrom, C. W. (1968): Statistical Theory of Signal Detection, 2d ed., Pergamon Press, New York. Jenkins, G. M. and D. G. Watts (1968): Spectral Analysis and Its Applications, Holden-Day, San Francisco, CA. Kleinrock, L. (1975-1976): Queuing Systems, 2 vols., Wiley, New York. Laning, J. H. and R. H. Battin (1956): Random Processes in Automatic Control, McGraw-Hill, New York. Lebedev, V. L.: “Random Processes in Electric and Mechanical Systems.” NSF and NASA Technical Translations, Washington, DC. Marple, S. L. (1987): Digital Spectral Analysis, Prentice-Hall, Englewood Cliffs, NJ. Nahi, N. E. (1969): Estimation Theory and Applications, Wiley, New York. Oppenheim, A. V. and R. W. Schafer (1975): Digital Signal Processing, Prentice-Hall, Englewood Cliffe/NJ. Papoulis, A. (1962): The Fourier Integral and Its Applications, McGraw-Hill. New York. 658
HIBI KMtKAI'IIY 659 Papoulis, A. (1968): Systems and Transforms with Applications in Optus, McGraw-Hill. New York. Reprinted (1981) by Krieger Publishing Company, Melbourne. FL. Papoulis, A. (1977): Signal Analysis, McGraw-Hill; New York. papoulis. A. (1980): Circuits and Systems: A Modem Approach, Holl. Rincharl and Winston. New York. Papoulis, A. (1990): Probability and Statistics, Prentice-Hall, Englewood Cliffs. NJ. Parzen, E. (1960): Modem Probability Theory and Its Applications. Wiley. New York Priestley, M. (1981): Spectral Analysis and Time Series, 2 vols., Academic, London. Proakis, J. (1983); Introduction to Digital Communications, McGraw-Hill. New York. Schwartz, M. (1977): Computer-Communication Network Design and Analysis, Prentice-Hall. Engle- wood Cliffs, NJ. -Schwartz, M. and L. Shaw (1975): Signal Processing, McGraw-Hill, New York. Wainstein, L. A. and V. D. Zubakov (1962); Extraction of Signals from Noise (translated from Russian), Prentice-Hall, Englewood Cliffs, NJ. Wiener, N. (1949): Extrapolation, Interpolation, and Smoothing of Stationary Time Series, MIT Press, Cambridge, MA. Woodward, P. (1953): Probability and Information Theory with Applications to Radar, Pergamon, New York. Yaglom, A. M. (1962); Stationary Random Functions (translated from Russian). Prentice-Hall, Englewood Cliffs, NJ. Yaglom, A. M. (1987): Correlation Theory of Stationary and Related Random Functions, 2 vols.. Springer, New York.
INDEX e-dependent processes, 295 All-pass filters, 475 Alternative hypothesis, 266 Analog estimators. 434, 438-440 correlometers, 438 Michelson interferometer, 440 spectrometers, 439 Fabry-Perot interferometer. 440 Analytic signal, 327 Arcsine law, 307, 341, 438 Autocorrelation, 288, 293, 329 Autocovariance, 289, 294 Autoregressive (AR), 410, 457 moving average (ARMA), 412, 459 Axioms of probability, 20, 24 Bandlimited processes, 376-384 bounds, 378, 398 Taylor series, 377 Bayes’ theorem, 84,164 Bernoulli trials, 43-47, 196 Berry-Ess6n theorem, 220 Bertrand paradox, 9 Best estimators, 245 Rao-Cram6r bound, 264 Bienayme inequality, 115 Binary transmission, 376 Birth process, 647 Bispectra, 389-395 in spectral representation, 423 symmetries, 390 the phase problem, 393 Bit, 534 н Boltzmann constant, 348. 351 Bose-Einstein statistic. 11 Brownian motion. 348-351 Langevin equation. 349 Buffon’s needle, 132. 236 Burg's iteration, 468 Bussgang's theorem, 307, 438 Campbell's theorem, 360. 632 Caratheodory's theorem, 467 Cauchy inequality, 399 Centered process. 302 Central limit theorem (CLT), 214-221 error correction, 217 products, 220 Chain rule, 36. 192 entropy, 564 Markoff processes. 636 Channel, 591-600 matrix, 594 Channel capacity, 592, 595 theorem, 597-600 Chapman-Kolmogoroff equations. 193, 637 Characteristic functions. 115, 157, 195 binomial, 118, 196 chi-square, 117, 200 convolution theorem, 158. 195 gamma. 116 moment theorem, 116, 160 normal, 196 complex, 198 661
662 I ND [LX Characteristic functions (Co/iL) Poisson, 118 second, 115, 157 ChemofT bound, 122 Chi-square (^2) density, 79 percentiles (table), 253 Chi-square tests, 273 distributions, 273 independent events, 274 Cholesky factorization, 207, 506 Circular symmetry, 134 Code length, 583, 590 Codes: binary, 580 Fano. 586 Huffman, 586 instantaneous, 583 optimum, 584 random, 596 redundant, 596 Shannon, 585 Coding theorems, 584, 589 Complex RVs, 66, 188 Computers and statistics, 236 Conditional: distributions, 79-84, 162-166 normal RVs, 164, 204 entropy, 549-558, 561 expected values, 169-173, 194 as RVs, 172 probability, 27, 83 Confidence, coefficient, 243, 246 interval, 246 level, 240 Convergence concepts, 208-214 CLT, 214-221 law of large numbers, 53, 211 Convolution, 136 theorem, 158,195 Correlation coefficient, 152 Covariance, 152 matrix, 190 Cramdr theorem, 158 Cramdr-Wold theorem, 157 Cross-correlation, 294 Cumulants, 117 Cyclostatibnwy processes, 373-376 ballon’s law, 170 Decoding, 582 DeMoivre-Laplace theorem, 49,55,216 Density, 72,126,182 circular symmetry, 134 spherical symmetry, 238 Differential equations, 315, 407 Differentiators, 313, 325, 329 Diffusion, constant, 349 equations, 351, 652 Digital processes, 332-336 power spectrum, 333 Distribution, 66, 124, 182 computer generated, 237 marginal, 126, 183 Distributions: beta, 260 binomial, 75 negative. 123 Cauchy, 94 chi, 96 chi-square (*2), 79 Erlang, 79 exponential, 77 gamma, 79 geometric, 122, 614 Laplace, 78 lognormal, 97, 221 Maxwell, 96 normal, 74 table, 48 truncated, 82 Pascal, 123 Poisson, 76 Rayleigh, 96 Snedecor-F. 150 Student-t, 148 uniform, 75 Weibull, 168 Yule-Furry, 648 zero-one, 72 Doppler effect, 322 Electron transit, 360 Ensemble, 285 Entropy, 534, 542 conditional, 549-558 historical note, 535 inequalities, 544-549 of RVs, 559 as expected value, 560 stochastic processes, 566-569 Entropy rate, 567 system response, 568 Equivocation, 595л Ergodicity, 427-442 autocovariance, 435-441 variance, 435 distribution, 441 mean» 428-433
index 663 Slutsky's theorem, 430 spectral Interpretation, 433 Estimation, 244-265 bayesian. 256-260 of distributions, 355 maximum likelihood, 260-263 mean, 246 percentiles, 254 probabilities, 251 variance, 252 Expected value, 102 conditional, 169 as RV, 172 approximate evaluation, 112, 156 estimation of, 246 Exponential density, 77 Fabry-Perot interferometer, 440 Factorization, 401, 403 Failure rate, 186 conditional, 167 Fermi-Dirac statistics, 11 Filtering and prediction, 508-515 digital, 512-515 Finite impulse response (FIR); filter, 411 predictor, 500 First-paSsage time, 609-612 Fokker-Planck equations, 652 Fourier, integral, 416 series, 412 Fourier-Stieltjes representation, 420 Frequency interpretation, 14 Frequency modulation, 369, 644 instantaneous frequency, 368 Woodward’s theorem, 370 Functions of RVs, 86, 135, 142, 183 Gallon's law, 70 Gamma density, 79 characteristic function, 116 moments, 116 Gaussian curve, 48 (See also Normal) Goodman’s theorem, 199 Goodness of fit test, 274 Gram-Schmidt method, 206 innovations, 401, 403 Hidden periodicities, 468 Hilbert transforms. 327 Rice's representation. 365 Hypergeometric series. 60 Hypothesis, alternative. 266 composite. 266 null. 265 simple. 266 Hypothesis testing. 265-278 chi-square lest, 273-275 computer simulation. 278 equality of variances. 281 of distributions. 272 Kolmogoroff-Smirnov test, 272 of mean, 269 of probability, 270 of variance, 271 Identification of systems, 392, 457 the phase problem, 393 i.i.d., 185 Imbedded Markoff chain, 621, 640 Independence, linear, 190 statistical, 184 Independent: events, 32, 36 experiments, 41, 133, 184 RVs, 132, 184 stochastic processes, 296 Infinite additivity, 24 Innovations, 402, 403 filter, 401, 403, 506 Kalman, 505, 516 Inphase component, 366, 397 Instantaneous frequency, 368 Insufficient reason, 8 Interferometer: Fabry-Perot, 440 Michelson, 440 Kalman filters, 515-528 ARMA, 517-524 first order, 520, 525 linearization, 522, 526 Kalman-Bucy equations, 524 Riccati equations, 526 Kalman innovations, 505, 516 Karhunen-Loive expansion, 413-416, 425 Kolmogorolf-Szego error formula, 491 Hard limiter, 307,341,438 Hazard rate, 167 Hermite polynomials, 217 Langevin equation, 349 Lattice filters, 460
664 INDEX Lattice filters (Cont.) extrapolating spectra, 470 inverse, 465 Levinson’s algorithm, 460 Law of large numbers, 53, 211 Law of succession, 260 Level-crossing problem, 603-612 first passage time, 609 zero crossings, 606-609 Levinson’s algorithm, 460 Burg’s iteration, 468 Fejer-Riesz theorem, 469 in prediction, 501 Likelihood function, 261 Likelihood ratio test, 275-278 asymptotic form, 277 Line spectra, 323, 422 Wold’s decomposition, 420, 497 Linear systems, 308-319, 332 differential equations, 315, 407 differentiators, 313, 325, 329 finite order, 409-412 state variables, 405, 409 output spectra, 323, 334 Little’s theorem, 617 Localization of power, 328 Lognormal density, 78, 97 in CLT, 221 Lorentzian spectrum, 349 Loss function, 97 Marginal distribution, 126, 183 Markoff chains, 638 continuous-time, 640 Markoff processes, 635-654 birth processes, 647 continuous-time, 640-654 probability rates, 641 diffusion equations, 652 FM spectra, 644-647 Fokker-Planck equations, 652 Markoff’s inequality, 114 matrix, 165 Martingale, 529 Maximum entropy, 535, 549, 569-579 correlation constraints»'575 deterministic applications, 378 mean constraints, 571-575 partition function, 572 in spectral estimation, 577 MaxWclI-Boltzmann statistics, .11, 61 Maxwell density, 78. 96, 195,238 Mean (see Expected value) Mean square, continuity, 336 differentiation, 336 equality, 287 integration, 338 periodicity, 303, 412 Mean square estimation. 173-178 geometric interpretation, 178, 202 projection theorem, 178, 202 linear, 176, 201 projection theorem, 178 nonlinear, 175, 203 Dalton's law, 170 orthogonality principle, 177, 201, 204 regression line, 170 normal RVs, 164, 204 orthonormalization. 206 Measurement errors, 187 Median, 68, 178 Memoryless, property, 168 systems, 304-308 Message, 581 Michelson interferometer, 440 Minimum phase filter, 401, 403, 474 Mode, 73, 179 Modulation, 362-372 optimum envelope, 367 Rice’s representation, 365 (See also Frequency modulation) Moment generating functions, 115, 160, 195 (See also Characteristic functions) Moment theorem, 116, 160 Moving average, 325 Mutual information, 552, 562 Moments, 109, 155 chi-square, 117 normal, 110 Maxwell, 111 Poisson, 112 Rayleigh, 111 third order. 316 bispectra, 389 Monte Carlo method, 221 Buffon's needle, 236 Moving average, 325, 411, 458 process (MA), 411, 458 Mutual information, 552, 562 Nonlinear systems, 304-308 Normal densities, 74, 138, 197 characteristic functions, 115, 159, 197 Price’s theorem, 123, 161 complex, 198 Goodman’s theorems, 199 conditional, 164, 204
1NDI.X 665 entropy, 560. 561 percentiles (table). 247 Null hypothesis, 265 Nyquist interval, 378 theorem, 352 Operating characteristic, 266 Order statistics, 185 Omstein-Uhlenbeck process, 349 Orthogonal RVs, 153 Orthogonality principle, 177, 201, 204 linear, 176, 201 nonlinear, 175, 203 Paley-Wiener condition, 402, 403 Parameter estimation {see estimation) Parametric extrapolation, 457-474 Partition, 18 function, 572 Percentiles, 68 estimate of, 254 tables: chi-square, 253 normal, 247 Student-/, 249 Periodic processes, 303, 412 Periodogram, 444 Point processes, 297 Poisson, 57, 612, 354-358 in random intervals, 613 renewal processes, 297, 614 Poisson impulses, 314 spectrum of, 321 Poisson RVs, 76 moment function, 118 moments, 112 Poisson process, 290, 648 Poisson sum formula, 395 Pollaczek-Khinchin formula, 623 Polya’s criterion, 330 Power localization, 328 of a test, 266 Power spectrum, 319-329 digital-processes, 332 Predictable processes, 420, 497 Prediction, 487-508 causal data, 503-508 Kalman innovations, 505 FIR predictors, 500 Levinson’s algorithm, 501 infinite past, 487-499 analog processes, 493 r-stop predictor, 492 Price’s theorem, 123. 161 Probability. 6-12 masses. 27. 130. 139 Product space. 39 Projection theorem. 178. 201 (See also orthogonality principle) Pulse amplitude modulation (PAM), 374 Quadrature component. 366. 397 Queueing theory, 612-628 arrivals and departures, 617 Little's theorem, 617 immediate service. 619 single server, 620 busy period, 625 Pollaczek-Khinchin formula, 623 Random numbers (RNs). 222 computer generated. 223-236 percentile-transformation method, 226 rejection method. 229 mixing method. 230 general transformations, 231 Box-Muller method. 234 Random points, 57 Random variables (RVs), 64 functions of, 86, 135, 142, 183 Random walk, 345 generalized, 347, 637 Wiener process, 346 Rao-Cramer bound, 263-265 Rare sequences, 537 Rate of information transmission, 595 Reflection coefficient, 462 Reflection principle, 610 Regression line, 170 surface. 203 Dalton’s law, 170 Regular, density, 263 processes, 420-497 Reliability, 166-169 Renewal processes, 297, 614 Rice’s representation, 365 Sample, mean, 188 variance, 188, 200 Sampling expansions, 378 Papoulis, 381 past samples, 379 random sampling. 382 Schwarz* inequality, 154,394
666 INOEX Semi-Markoff process, 640 imbedded chain, 640 Shannon,code, 585 theorem, 589 Shift operators, 339 Shot noise, 359-362, 629-635 Campbell's lheorem, 360, 632 intensity of, 633 moment function, 631 power spectrum, 360, 635 Signal to noise ratio, 385 Slutsky’s theorem, 430, 432 Smoothing, 484-487 Spectral estimation, 443-474 Burg's iteration, 468 extrapolation method, 455-474 AR processes, 457 ARMA processes, 459 MA processes, 459 hidden periodicities, 468 Levinson’s algorithm, 463 maximum entropy method (MEM), 474 periodogram, 444 smoothed spectrum, 447 windows, method of, 456 Spectral representation, 416-424 Spherical symmetry, 238 Standard deviation, 106 State variables, 405, 409 Stationaiy processes, 297-303 strict sense (SSS), 297 wide sense (WSS), 298 Statistics, 71, 245, 245л Stochastic convergence, 208-213 System identification, 392, 457 Systems, 303-39 multiterminal, 317 state variables, 405,409 Tayjor series, 377 Tchebycheff inequality, 113 Telegraph signal, 291, 643 Test, most powerful, 266 power of, 266 statistic, 267 Thermal noise, 351-354 Nyquist theorem» 352 Third order moments. 316 bispecira, 389 Time average, 428 {See also ergodicity) Time-io-failure, 166 Total probability, 84, 164 Traffic intensity, 619 Tree, binary, 579 Transformations of RVs, 86, 135, 142, 183 measure preserving, 339 Typical sequences, 537 in coding, 590 Uncertainty, 534 Uncorrelated RVs, 153 Variance, 106 approximate evaluation, 113 conditional, 170 Vector, processes, 317 spectra, 329 Venn diagrams, 17 White noise. 295 Whitening filter, 402, 403 Kalman, 506 Wiener filter, causal, 493 noncausal, 493 prediction and filtering, 508 Wiener-Hopf equation, 488, 493, 508 Windows, 419, 445, 451 method of, 456 Wold’s decomposition, 420, 499 Woodward’s theorem, 322 Yule-Walker equations, 410, 458, 500 iterative solution, 465 Zero-crossing density, 605 nondifferentiable processes, 607 Zero-one RVs, 72