21TheScienceofInformationdifficult to devise a measure which can quantify the amount of informationcontained inapieceof music.In 1948,Bell Telephone Laboratories scientist ClaudeE.Shannon (1916-2001)published apaper entitled "The Mathematical Theory of Communi-cation"[322] which laid the foundation of an important field now known asinformation theory.In his paper,the model ofa point-to-point communicationsystem depicted in Figure 1.1 is considered. In this model, a message is generated by the information source.The message is converted by the transmitterinto a signal which is suitablefor transmission. In the course of transmission,the signal may be contaminated by a noise source,so that the received signalmay be different from the transmitted signal. Based on the received signal,the receiver then makes an estimate on the message and delivers it to thedestination.In this abstract model of a point-to-point communication system, one isonly concerned about whether the message generated by the source can bedelivered correctly to the receiver without worrying about how the messageis actually used bythereceiver.Ina way.Shannon'smodel does not cover allpossibleaspects of a communication system.However,in ordertodevelop aprecise and useful theory of information, the scope of the theory has to berestricted.In[322],Shannon introduced two fundamental concepts about“informa-tion" from the communication point ofview.First, information is uncertainty.More specifically,if a piece of information we are interested in is deterministic,then it has no value at all because it is already known with no uncertainty.From this point of view,for example,the continuous transmission of a stillpicture on a television broadcast channel is superfluous.Consequently.arinformation source is naturally modeled as a random variable or arandomprocess, and probability is employed to develop the theory of information.Second, information to be transmitted is digital. This means that the infor-mation source should first be converted into a stream of Os and ls called bits,and the remaining task is to deliver these bits to the receiver correctly with noreference to their actual meaning.This is the foundation of all modern digitalINFORMATIONTRANSMITTERDESTINATIONSOURCERECEIVERRECEIVEDSIGNALSIGNALMESSAGEMESSAGENOISESOURCEFig. 1.1. Schematic diagram for a general point-to-point communication system
2 1 The Science of Information difficult to devise a measure which can quantify the amount of information contained in a piece of music. In 1948, Bell Telephone Laboratories scientist Claude E. Shannon (1916– 2001) published a paper entitled “The Mathematical Theory of Communication” [322] which laid the foundation of an important field now known as information theory. In his paper, the model of a point-to-point communication system depicted in Figure 1.1 is considered. In this model, a message is generated by the information source. The message is converted by the transmitter into a signal which is suitable for transmission. In the course of transmission, the signal may be contaminated by a noise source, so that the received signal may be different from the transmitted signal. Based on the received signal, the receiver then makes an estimate on the message and delivers it to the destination. In this abstract model of a point-to-point communication system, one is only concerned about whether the message generated by the source can be delivered correctly to the receiver without worrying about how the message is actually used by the receiver. In a way, Shannon’s model does not cover all possible aspects of a communication system. However, in order to develop a precise and useful theory of information, the scope of the theory has to be restricted. In [322], Shannon introduced two fundamental concepts about “information” from the communication point of view. First, information is uncertainty. More specifically, if a piece of information we are interested in is deterministic, then it has no value at all because it is already known with no uncertainty. From this point of view, for example, the continuous transmission of a still picture on a television broadcast channel is superfluous. Consequently, an information source is naturally modeled as a random variable or a random process, and probability is employed to develop the theory of information. Second, information to be transmitted is digital. This means that the information source should first be converted into a stream of 0s and 1s called bits, and the remaining task is to deliver these bits to the receiver correctly with no reference to their actual meaning. This is the foundation of all modern digital TRANSMITTER SIGNAL RECEIVED SIGNAL MESSAGE MESSAGE NOISE SOURCE INFORMATION SOURCE RECEIVER DESTINATION Fig. 1.1. Schematic diagram for a general point-to-point communication system
31TheScienceof Informationcommunication systems.In fact, this work of Shannon appears to contain thefirst published use of the term "bit," which stands for binary digit.In the same work, Shannon also proved two important theorems.The firsttheorem, called the source coding theorem, introduces entropy as thefunda-mental measure of information which characterizes the minimum rate of asource code representing an information source essentially free of error.Thesource coding theorem is the theoretical basis for lossless data compression.1The second theorem, called the channel coding theorem, concerns communica-tion through a noisy channel. It was shown that associated with every noisychannel is a parameter, called the capacity, which is strictly positive exceptfor very special channels, such that information can be communicated reliablythrough the channel as long as the information rate is less than the capacity.Thesetwotheorems,whichgivefundamental limits inpoint-to-point commu-nication, are the twomost important results in information theory.In science, we study the laws of Nature which must be obeyed by anyphysical system. These laws are used by engineers to design systems to achievespecific goals.Therefore,science isthe foundation of engineering.Withoutscience, engineering can only be done by trial and error.In information theory,we study the fundamental limits in communica-tion regardless of the technologies involved in the actual implementation ofthe communication systems. These fundamental limits not only are used asguidelines by communication engineers, but also give insights into what op-timal coding schemes are like. Information theory is therefore the science ofinformation.Since Shannon published his original paper in 1948, information theoryhas been developed into amajorresearch field in both communication theoryandappliedprobability.For a non-technical introduction to information theory, we refer the readerto Encyclopedia Britannica [103].In fact, we strongly recommend the readerto first read this excellent introduction before starting this book. For biogra-phies of Claude Shannon, a legend of the twentieth century who had madefundamental contribution to the Information Age, we refer the readers to [55]and [340].Thelatter is also a complete collection of Shannon's papers.Unlikemost branchesofapplied mathematics in whichphysical systemsarestudied,abstract systems of communication arestudied in information theory.In reading this book, it is not unusual for a beginner to be able to understandall the steps in a proof but has no idea what the proof is leading to.The bestway to learn information theory is to study the materials first and come backat alater time.Many results in information theory are rather subtle, to theextentthatan expertinthesubjectmayfromtimetotimerealizethathis/herunderstanding of certain basic results has been inadequate or even incorrect.While a novice should expect to raise his/her level of understanding of theAdatacompression schemeislosslessifthedatacanberecovered withanarbi-trarily small probability of error
1 The Science of Information 3 communication systems. In fact, this work of Shannon appears to contain the first published use of the term “bit,” which stands for binary digit. In the same work, Shannon also proved two important theorems. The first theorem, called the source coding theorem, introduces entropy as the fundamental measure of information which characterizes the minimum rate of a source code representing an information source essentially free of error. The source coding theorem is the theoretical basis for lossless data compression.1 The second theorem, called the channel coding theorem, concerns communication through a noisy channel. It was shown that associated with every noisy channel is a parameter, called the capacity, which is strictly positive except for very special channels, such that information can be communicated reliably through the channel as long as the information rate is less than the capacity. These two theorems, which give fundamental limits in point-to-point communication, are the two most important results in information theory. In science, we study the laws of Nature which must be obeyed by any physical system. These laws are used by engineers to design systems to achieve specific goals. Therefore, science is the foundation of engineering. Without science, engineering can only be done by trial and error. In information theory, we study the fundamental limits in communication regardless of the technologies involved in the actual implementation of the communication systems. These fundamental limits not only are used as guidelines by communication engineers, but also give insights into what optimal coding schemes are like. Information theory is therefore the science of information. Since Shannon published his original paper in 1948, information theory has been developed into a major research field in both communication theory and applied probability. For a non-technical introduction to information theory, we refer the reader to Encyclopedia Britannica [103]. In fact, we strongly recommend the reader to first read this excellent introduction before starting this book. For biographies of Claude Shannon, a legend of the twentieth century who had made fundamental contribution to the Information Age, we refer the readers to [55] and [340]. The latter is also a complete collection of Shannon’s papers. Unlike most branches of applied mathematics in which physical systems are studied, abstract systems of communication are studied in information theory. In reading this book, it is not unusual for a beginner to be able to understand all the steps in a proof but has no idea what the proof is leading to. The best way to learn information theory is to study the materials first and come back at a later time. Many results in information theory are rather subtle, to the extent that an expert in the subject may from time to time realize that his/her understanding of certain basic results has been inadequate or even incorrect. While a novice should expect to raise his/her level of understanding of the 1 A data compression scheme is lossless if the data can be recovered with an arbitrarily small probability of error
41TheScienceofInformationsubject by reading this book, he/she should not be discouraged to find afterfinishing the book that there are actually more things yet to be understood.In fact, this is exactly the challenge and the beauty of information theory
4 1 The Science of Information subject by reading this book, he/she should not be discouraged to find after finishing the book that there are actually more things yet to be understood. In fact, this is exactly the challenge and the beauty of information theory
Part IComponents of Information Theory
Part I Components of Information Theory
2Information MeasuresShannon's information measures refer to entropy, conditional entropy,mutualinformation, and conditional mutual information.They are the most impor-tant measures of information in information theory. In this chapter, we in-troduce these measures and establish some basic properties they possess.Thephysical meanings of these measures will be discussed in depth in subsequentchapters. We then introduce the informational divergence which measuresthe“distance" between two probability distributions and prove some usefulinequalities in information theory. The chapter ends with a section on theentropy rate of a stationaryinformation source.2.1IndependenceandMarkovChainsWe begin our discussion in this chapter by reviewing two basic concepts inprobability: independence of random variables and Markov chain. All the ran-dom variables in this book except for Chapters 10 and 11 are assumed to bediscrete unless otherwise specified.Let X be a random variable taking values in an alphabet X. The probabil-ity distribution for X is denoted as (px(r),r e X), with px(r) = Pr[X = ).When there is no ambiguity, Px() will be abbreviated as p(r), and (p(r))will be abbreviated as p(r). The support of X, denoted by Sx,is the set ofall EX such that p(r)> 0. If Sx =X, we say that p is strictly positive.Otherwise, we saythatp is not strictly positive, orp contains zeroprobabilitymasses. All the above notations naturally extend to two or more random vari-ables.As we will see,probability distributions with zero probability massesare very delicate, and they need to be handled with great care.Definition 2.1.Two random variables X and Y are independent, denoted byXIY,if(2.1)p(r,y) =p(r)p(y)for all r andy (i.e.,for all (r,y)eXx)
2 Information Measures Shannon’s information measures refer to entropy, conditional entropy, mutual information, and conditional mutual information. They are the most important measures of information in information theory. In this chapter, we introduce these measures and establish some basic properties they possess. The physical meanings of these measures will be discussed in depth in subsequent chapters. We then introduce the informational divergence which measures the “distance” between two probability distributions and prove some useful inequalities in information theory. The chapter ends with a section on the entropy rate of a stationary information source. 2.1 Independence and Markov Chains We begin our discussion in this chapter by reviewing two basic concepts in probability: independence of random variables and Markov chain. All the random variables in this book except for Chapters 10 and 11 are assumed to be discrete unless otherwise specified. Let X be a random variable taking values in an alphabet X . The probability distribution for X is denoted as {pX(x),x ∈ X}, with pX(x) = Pr{X = x}. When there is no ambiguity, pX(x) will be abbreviated as p(x), and {p(x)} will be abbreviated as p(x). The support of X, denoted by SX, is the set of all x ∈ X such that p(x) > 0. If SX = X , we say that p is strictly positive. Otherwise, we say that p is not strictly positive, or p contains zero probability masses. All the above notations naturally extend to two or more random variables. As we will see, probability distributions with zero probability masses are very delicate, and they need to be handled with great care. Definition 2.1. Two random variables X and Y are independent, denoted by X ⊥ Y , if p(x,y) = p(x)p(y) (2.1) for all x and y (i.e., for all (x,y) ∈X ×Y)