A Tutorial on Principal Component Analysis Jonathon Shlens* Center for Neural Science,New York University New York City.NY 10003-6603 and Systems Neurobiology Laboratory,Salk Insitute for Biological Studies La Jolla,CA 92037 (Dated:April 22,2009;Version 3.01) Principal component analysis (PCA)is a mainstay of modern data analysis-a black box that is widely used but (sometimes)poorly understood.The goal of this paper is to dispel the magic behind this black box.This manuscript focuses on building a solid intuition for how and why principal component analysis works.This manuscript crystallizes this knowledge by deriving from simple intuitions,the mathematics behind PCA.This tutorial does not shy away from explaining the ideas informally,nor does it shy away from the mathematics.The hope is that by addressing both aspects,readers of all levels will be able to gain a better understanding of PCA as well as the when,the how and the why of applying this technique I.INTRODUCTION Il.MOTIVATION:A TOY EXAMPLE Principal component analysis(PCA)is a standard tool in mod- Here is the perspective:we are an experimenter.We are trying ern data analysis-in diverse fields from neuroscience to com- to understand some phenomenon by measuring various quan- puter graphics-because it is a simple,non-parametric method tities (e.g.spectra,voltages,velocities,etc.)in our system. for extracting relevant information from confusing data sets. Unfortunately,we can not figure out what is happening be- With minimal effort PCA provides a roadmap for how to re- cause the data appears clouded,unclear and even redundant duce a complex data set to a lower dimension to reveal the This is not a trivial problem,but rather a fundamental obstacle sometimes hidden,simplified structures that often underlie it. in empirical science.Examples abound from complex sys- The goal of this tutorial is to provide both an intuitive feel for tems such as neuroscience,web indexing,meteorology and oceanography-the number of variables to measure can be PCA,and a thorough discussion of this topic.We will begin with a simple example and provide an intuitive explanation unwieldy and at times even deceptive,because the underlying of the goal of PCA.We will continue by adding mathemati- relationships can often be quite simple cal rigor to place it within the framework of linear algebra to Take for example a simple toy problem from physics dia- provide an explicit solution.We will see how and why PCA grammed in Figure 1.Pretend we are studying the motion is intimately related to the mathematical technique of singular of the physicist's ideal spring.This system consists of a ball value decomposition(SVD).This understanding will lead us of mass m attached to a massless,frictionless spring.The ball to a prescription for how to apply PCA in the real world and an is released a small distance away from equilibrium (i.e.the appreciation for the underlying assumptions.My hope is that spring is stretched).Because the spring is ideal,it oscillates a thorough understanding of PCA provides a foundation for indefinitely along the x-axis about its equilibrium at a set fre- approaching the fields of machine learning and dimensional quency. reduction. This is a standard problem in physics in which the motion The discussion and explanations in this paper are informal in along the x direction is solved by an explicit function of time. the spirit of a tutorial.The goal of this paper is to educate. In other words,the underlying dynamics can be expressed as Occasionally,rigorous mathematical proofs are necessary al- a function of a single variable x. though relegated to the Appendix.Although not as vital to the tutorial,the proofs are presented for the adventurous reader However,being ignorant experimenters we do not know any who desires a more complete understanding of the math.My of this.We do not know which,let alone how many,axes only assumption is that the reader has a working knowledge and dimensions are important to measure.Thus,we decide to of linear algebra.My goal is to provide a thorough discussion measure the ball's position in a three-dimensional space (since by largely building on ideas from linear algebra and avoiding we live in a three dimensional world).Specifically,we place challenging topics in statistics and optimization theory (but three movie cameras around our system of interest.At 120 Hz see Discussion).Please feel free to contact me with any sug- each movie camera records an image indicating a two dimen- gestions,corrections or comments. sional position of the ball(a projection).Unfortunately,be- cause of our ignorance,we do not even know what are the real x,y and z axes,so we choose three camera positions a.b and c at some arbitrary angles with respect to the system.The angles "Electronic address:shlensesalk.edu between our measurements might not even be 90!Now,we
A Tutorial on Principal Component Analysis Jonathon Shlens∗ Center for Neural Science, New York University New York City, NY 10003-6603 and Systems Neurobiology Laboratory, Salk Insitute for Biological Studies La Jolla, CA 92037 (Dated: April 22, 2009; Version 3.01) Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. This manuscript focuses on building a solid intuition for how and why principal component analysis works. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA. This tutorial does not shy away from explaining the ideas informally, nor does it shy away from the mathematics. The hope is that by addressing both aspects, readers of all levels will be able to gain a better understanding of PCA as well as the when, the how and the why of applying this technique. I. INTRODUCTION Principal component analysis (PCA) is a standard tool in modern data analysis - in diverse fields from neuroscience to computer graphics - because it is a simple, non-parametric method for extracting relevant information from confusing data sets. With minimal effort PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the sometimes hidden, simplified structures that often underlie it. The goal of this tutorial is to provide both an intuitive feel for PCA, and a thorough discussion of this topic. We will begin with a simple example and provide an intuitive explanation of the goal of PCA. We will continue by adding mathematical rigor to place it within the framework of linear algebra to provide an explicit solution. We will see how and why PCA is intimately related to the mathematical technique of singular value decomposition (SVD). This understanding will lead us to a prescription for how to apply PCA in the real world and an appreciation for the underlying assumptions. My hope is that a thorough understanding of PCA provides a foundation for approaching the fields of machine learning and dimensional reduction. The discussion and explanations in this paper are informal in the spirit of a tutorial. The goal of this paper is to educate. Occasionally, rigorous mathematical proofs are necessary although relegated to the Appendix. Although not as vital to the tutorial, the proofs are presented for the adventurous reader who desires a more complete understanding of the math. My only assumption is that the reader has a working knowledge of linear algebra. My goal is to provide a thorough discussion by largely building on ideas from linear algebra and avoiding challenging topics in statistics and optimization theory (but see Discussion). Please feel free to contact me with any suggestions, corrections or comments. ∗Electronic address: shlens@salk.edu II. MOTIVATION: A TOY EXAMPLE Here is the perspective: we are an experimenter. We are trying to understand some phenomenon by measuring various quantities (e.g. spectra, voltages, velocities, etc.) in our system. Unfortunately, we can not figure out what is happening because the data appears clouded, unclear and even redundant. This is not a trivial problem, but rather a fundamental obstacle in empirical science. Examples abound from complex systems such as neuroscience, web indexing, meteorology and oceanography - the number of variables to measure can be unwieldy and at times even deceptive, because the underlying relationships can often be quite simple. Take for example a simple toy problem from physics diagrammed in Figure 1. Pretend we are studying the motion of the physicist’s ideal spring. This system consists of a ball of mass m attached to a massless, frictionless spring. The ball is released a small distance away from equilibrium (i.e. the spring is stretched). Because the spring is ideal, it oscillates indefinitely along the x-axis about its equilibrium at a set frequency. This is a standard problem in physics in which the motion along the x direction is solved by an explicit function of time. In other words, the underlying dynamics can be expressed as a function of a single variable x. However, being ignorant experimenters we do not know any of this. We do not know which, let alone how many, axes and dimensions are important to measure. Thus, we decide to measure the ball’s position in a three-dimensional space (since we live in a three dimensional world). Specifically, we place three movie cameras around our system of interest. At 120 Hz each movie camera records an image indicating a two dimensional position of the ball (a projection). Unfortunately, because of our ignorance, we do not even know what are the real x, y and z axes, so we choose three camera positions~a,~b and~c at some arbitrary angles with respect to the system. The angles between our measurements might not even be 90o ! Now, we
Determining this fact allows an experimenter to discern which dynamics are important,redundant or noise. amera C A.A Naive Basis With a more precise definition of our goal,we need a more precise definition of our data as well.We treat every time sample(or experimental trial)as an individual sample in our data set.At each time sample we record a set of data consist- ing of multiple measurements (e.g.voltage,position,etc.).In our data set,at one point in time,camera A records a corre- sponding ball position (xA,yA).One sample or trial can then camera A camera B camera C be expressed as a 6 dimensional column vector XA yA XB YB XC FIG.1 A toy example.The position of a ball attached to an oscillat- where each camera contributes a 2-dimensional projection of ing spring is recorded using three cameras A,B and C.The position the ball's position to the entire vector X.If we record the ball's of the ball tracked by each camera is depicted in each panel below. position for 10 minutes at 120 Hz,then we have recorded 10x 60x 120=72000 of these vectors. record with the cameras for several minutes.The big question With this concrete example,let us recast this problem in ab- remains:how do we get from this data set to a simple equation stract terms.Each sample X is an m-dimensional vector, ofx? where m is the number of measurement types.Equivalently, every sample is a vector that lies in an m-dimensional vec- We know a-priori that if we were smart experimenters,we tor space spanned by some orthonormal basis.From linear would have just measured the position along the x-axis with algebra we know that all measurement vectors form a linear one camera.But this is not what happens in the real world. combination of this set of unit length basis vectors.What is We often do not know which measurements best reflect the this orthonormal basis? dynamics of our system in question.Furthermore,we some- times record more dimensions than we actually need. This question is usually a tacit assumption often overlooked. Pretend we gathered our toy example data above,but only Also,we have to deal with that pesky,real-world problem of looked at cameraA.What is an orthonormal basis for(xA,yA)? noise.In the toy example this means that we need to deal A naive choice would be {(1,0),(0,1)},but why select this with air,imperfect cameras or even friction in a less-than-ideal basis over(),(,=2)}or any other arbitrary rota- spring.Noise contaminates our data set only serving to obfus- tion?The reason is that the naive basis refects the method we cate the dynamics further.This toy example is the challenge gathered the data.Pretend we record the position (2,2).We experimenters face everyday.Keep this example in mind as we delve further into abstract concepts.Hopefully,by the end did not record2in the (direction and 0in the per- of this paper we will have a good understanding of how to pendicular direction.Rather,we recorded the position(2,2) systematically extract x using principal component analysis. on our camera meaning 2 units up and 2 units to the left in our camera window.Thus our original basis reflects the method we measured our data. Ill.FRAMEWORK:CHANGE OF BASIS How do we express this naive basis in linear algebra?In the two dimensional case,(1,0),(0,1)can be recast as individ- ual row vectors.A matrix constructed out of these row vectors The goal of principal component analysis is to identify the is the 2 x 2 identity matrix I.We can generalize this to the m- most meaningful basis to re-express a data set.The hope is dimensional case by constructing an m x m identity matrix that this new basis will filter out the noise and reveal hidden f10..01 structure.In the example of the spring,the explicit goal of b2 01.0 PCA is to determine:"the dynamics are along the x-axis."In other words,the goal of PCA is to determine that &i.e.the unit basis vector along the x-axis,is the important dimension. bm 200
2 camera A camera B camera C FIG. 1 A toy example. The position of a ball attached to an oscillating spring is recorded using three cameras A, B and C. The position of the ball tracked by each camera is depicted in each panel below. record with the cameras for several minutes. The big question remains: how do we get from this data set to a simple equation of x? We know a-priori that if we were smart experimenters, we would have just measured the position along the x-axis with one camera. But this is not what happens in the real world. We often do not know which measurements best reflect the dynamics of our system in question. Furthermore, we sometimes record more dimensions than we actually need. Also, we have to deal with that pesky, real-world problem of noise. In the toy example this means that we need to deal with air, imperfect cameras or even friction in a less-than-ideal spring. Noise contaminates our data set only serving to obfuscate the dynamics further. This toy example is the challenge experimenters face everyday. Keep this example in mind as we delve further into abstract concepts. Hopefully, by the end of this paper we will have a good understanding of how to systematically extract x using principal component analysis. III. FRAMEWORK: CHANGE OF BASIS The goal of principal component analysis is to identify the most meaningful basis to re-express a data set. The hope is that this new basis will filter out the noise and reveal hidden structure. In the example of the spring, the explicit goal of PCA is to determine: “the dynamics are along the x-axis.” In other words, the goal of PCA is to determine that xˆ, i.e. the unit basis vector along the x-axis, is the important dimension. Determining this fact allows an experimenter to discern which dynamics are important, redundant or noise. A. A Naive Basis With a more precise definition of our goal, we need a more precise definition of our data as well. We treat every time sample (or experimental trial) as an individual sample in our data set. At each time sample we record a set of data consisting of multiple measurements (e.g. voltage, position, etc.). In our data set, at one point in time, camera A records a corresponding ball position (xA,yA). One sample or trial can then be expressed as a 6 dimensional column vector ~X = xA yA xB yB xC yC where each camera contributes a 2-dimensional projection of the ball’s position to the entire vector ~X. If we record the ball’s position for 10 minutes at 120 Hz, then we have recorded 10× 60×120 = 72000 of these vectors. With this concrete example, let us recast this problem in abstract terms. Each sample ~X is an m-dimensional vector, where m is the number of measurement types. Equivalently, every sample is a vector that lies in an m-dimensional vector space spanned by some orthonormal basis. From linear algebra we know that all measurement vectors form a linear combination of this set of unit length basis vectors. What is this orthonormal basis? This question is usually a tacit assumption often overlooked. Pretend we gathered our toy example data above, but only looked at camera A. What is an orthonormal basis for(xA,yA)? A naive choice would be {(1,0),(0,1)}, but why select this basis over {( √ 2 2 , √ 2 2 ),( − √ 2 2 , − √ 2 2 )} or any other arbitrary rotation? The reason is that the naive basis reflects the method we gathered the data. Pretend we record the position (2,2). We did not record 2√ 2 in the ( √ 2 2 , √ 2 2 ) direction and 0 in the perpendicular direction. Rather, we recorded the position (2,2) on our camera meaning 2 units up and 2 units to the left in our camera window. Thus our original basis reflects the method we measured our data. How do we express this naive basis in linear algebra? In the two dimensional case, {(1,0),(0,1)} can be recast as individual row vectors. A matrix constructed out of these row vectors is the 2×2 identity matrix I. We can generalize this to the mdimensional case by constructing an m×m identity matrix B = b1 b2 . . . bm = 1 0 ··· 0 0 1 ··· 0 . . . . . . . . . . . . 0 0 ··· 1 = I
where each row is an orthornormal basis vector bi with m ing out the explicit dot products of PX. components.We can consider our naive basis as the effective starting point.All of our data has been recorded in this basis and thus it can be trivially expressed as a linear combination PX X1 Xn of (bi). Pm p1x1…p1X B.Change of Basis pmx1pm'xn With this rigor we may now state more precisely what PCA We can note the form of each column of Y. asks:Is there another basis,which is a linear combination of the original basis,that best re-expresses our data set? P1·X A close reader might have noticed the conspicuous addition of yi the word linear.Indeed,PCA makes one stringent but power- Pm'Xi ful assumption:linearity.Linearity vastly simplifies the prob- lem by restricting the set of potential bases.With this assump- We recognize that each coefficient of yi is a dot-product of tion PCA is now limited to re-expressing the data as a linear xi with the corresponding row in P.In other words,theh combination of its basis vectors. coefficient of yi is a projection on to the ith row of P.This is in fact the very form of an equation where yi is a projection Let X be the original data set,where each column is a single on to the basis of (p1,....Pm.Therefore,the rows of P are a sample (or moment in time)of our data set (i.e.X).In the toy new set of basis vectors for representing of columns of X. example X is an m x n matrix where m =6 and n=72000. Let Y be another m x n matrix related by a linear transfor- mation P.X is the original recorded data set and Y is a new representation of that data set. C.Questions Remaining PX=Y (1) By assuming linearity the problem reduces to finding the ap- propriate change of basis.The row vectors [p1,...,Pm}in Also let us define the following quantities. this transformation will become the principal components of X.Several questions now arise. 。pi are the rows of P What is the best way to re-express X? .xi are the columns of X (or individual X). What is a good choice of basis P? yi are the columns of Y. These questions must be answered by next asking ourselves what features we would like Y to exhibit.Evidently,addi- Equation 1 represents a change of basis and thus can have tional assumptions beyond linearity are required to arrive at a reasonable result.The selection of these assumptions is the many interpretations. subject of the next section 1.P is a matrix that transforms X into Y. IV.VARIANCE AND THE GOAL 2.Geometrically,P is a rotation and a stretch which again transforms X into Y. Now comes the most important question:what does best ex- press the data mean?This section will build up an intuitive 3.The rows of P,{p1,...,Pm},are a set of new basis vec- tors for expressing the columns of X. answer to this question and along the way tack on additional assumptions. The latter interpretation is not obvious but can be seen by writ- A.Noise and Rotation IIn this sectionand y arec vectors,but be forewamed.In all other Measurement noise in any data set must be low or else,no sections xi and yi are row vectors. matter the analysis technique,no information about a signal
3 where each row is an orthornormal basis vector bi with m components. We can consider our naive basis as the effective starting point. All of our data has been recorded in this basis and thus it can be trivially expressed as a linear combination of {bi}. B. Change of Basis With this rigor we may now state more precisely what PCA asks: Is there another basis, which is a linear combination of the original basis, that best re-expresses our data set? A close reader might have noticed the conspicuous addition of the word linear. Indeed, PCA makes one stringent but powerful assumption: linearity. Linearity vastly simplifies the problem by restricting the set of potential bases. With this assumption PCA is now limited to re-expressing the data as a linear combination of its basis vectors. Let X be the original data set, where each column is a single sample (or moment in time) of our data set (i.e. ~X). In the toy example X is an m × n matrix where m = 6 and n = 72000. Let Y be another m × n matrix related by a linear transformation P. X is the original recorded data set and Y is a new representation of that data set. PX = Y (1) Also let us define the following quantities.1 • pi are the rows of P • xi are the columns of X (or individual ~X). • yi are the columns of Y. Equation 1 represents a change of basis and thus can have many interpretations. 1. P is a matrix that transforms X into Y. 2. Geometrically, P is a rotation and a stretch which again transforms X into Y. 3. The rows of P, {p1,...,pm}, are a set of new basis vectors for expressing the columns of X. The latter interpretation is not obvious but can be seen by writ- 1 In this section xi and yi are column vectors, but be forewarned. In all other sections xi and yi are row vectors. ing out the explicit dot products of PX. PX = p1 . . . pm x1 ··· xn Y = p1 · x1 ··· p1 · xn . . . . . . . . . pm · x1 ··· pm · xn We can note the form of each column of Y. yi = p1 · xi . . . pm · xi We recognize that each coefficient of yi is a dot-product of xi with the corresponding row in P. In other words, the j th coefficient of yi is a projection on to the j th row of P. This is in fact the very form of an equation where yi is a projection on to the basis of {p1,...,pm}. Therefore, the rows of P are a new set of basis vectors for representing of columns of X. C. Questions Remaining By assuming linearity the problem reduces to finding the appropriate change of basis. The row vectors {p1,...,pm} in this transformation will become the principal components of X. Several questions now arise. • What is the best way to re-express X? • What is a good choice of basis P? These questions must be answered by next asking ourselves what features we would like Y to exhibit. Evidently, additional assumptions beyond linearity are required to arrive at a reasonable result. The selection of these assumptions is the subject of the next section. IV. VARIANCE AND THE GOAL Now comes the most important question: what does best express the data mean? This section will build up an intuitive answer to this question and along the way tack on additional assumptions. A. Noise and Rotation Measurement noise in any data set must be low or else, no matter the analysis technique, no information about a signal
4 2 low redundancy high redundancy FIG.2 Simulated data of (x,y)for camera A.The signal and noise variances o landare graphically represented by the two FIG.3 A spectrum of possible redundancies in data from the two lines subtending the cloud of data.Note that the largest direction separate measurements rI and r2.The two measurements on the of variance does not lie along the basis of the recording (xA,yA)but left are uncorrelated because one can not predict one from the other rather along the best-fit line. Conversely,the two measurements on the right are highly correlated indicating highly redundant measurements can be extracted.There exists no absolute scale for noise but B.Redundancy rather all noise is quantified relative to the signal strength.A common measure is the signal-to-noise ratio (SNR),or a ratio of variances o-, Figure 2 hints at an additional confounding factor in our data -redundancy.This issue is particularly evident in the example of the spring.In this case multiple sensors record the same dynamic information.Reexamine Figure 2 and ask whether SNR= 6ig吧 it was really necessary to record 2 variables.Figure 3 might reflect a range of possibile plots between two arbitrary mea- surement types ri and r2.The left-hand panel depicts two A high SNR 1)indicates a high precision measurement, recordings with no apparent relationship.Because one can not while a low SNR indicates very noisy data. predict ri from r2,one says that ri and r2 are uncorrelated. Let's take a closer examination of the data from camera On the other extreme,the right-hand panel of Figure 3 de- A in Figure 2.Remembering that the spring travels in a picts highly correlated recordings.This extremity might be straight line,every individual camera should record motion in achieved by several means: a straight line as well.Therefore,any spread deviating from straight-line motion is noise.The variance due to the signal and noise are indicated by each line in the diagram.The ratio .A plot of (xA.xB)if cameras A and B are very nearby of the two lengths measures how skinny the cloud is:possibil- ities include a thin line (SNR>1),a circle (SNR =1)or even .A plot of (xA,)where x is in meters and fa is in worse.By positing reasonably good measurements,quantita- inches. tively we assume that directions with largest variances in our measurement space contain the dynamics of interest.In Fig- ure 2 the direction with the largest variance is notfa=(1,0) Clearly in the right panel of Figure 3 it would be more mean- nor y =(0,1),but the direction along the long axis of the ingful to just have recorded a single variable,not both.Why? cloud.Thus,by assumption the dynamics of interest exist Because one can calculate ri from r2 (or vice versa)using the along directions with largest variance and presumably high- best-fit line.Recording solely one response would express the est SNR data more concisely and reduce the number of sensor record- ings(2-1 variables).Indeed,this is the central idea behind Our assumption suggests that the basis for which we are dimensional reduction. searching is not the naive basis because these directions (i.e. (xA,yA))do not correspond to the directions of largest vari- ance.Maximizing the variance (and by assumption the SNR) corresponds to finding the appropriate rotation of the naive C.Covariance Matrix basis.This intuition corresponds to finding the direction indi- cated by the line in Figure 2.In the 2-dimensional case of Figure 2 the direction of largest variance corresponds to the In a 2 variable case it is simple to identify redundant cases by best-fit line for the data cloud.Thus,rotating the naive basis finding the slope of the best-fit line and judging the quality of to lie parallel to the best-fit line would reveal the direction of the fit.How do we quantify and generalize these notions to motion of the spring for the 2-D case.How do we generalize arbitrarily higher dimensions?Consider two sets of measure- this notion to an arbitrary number of dimensions?Before we ments with zero means approach this question we need to examine this issue from a second perspective. A={a1,a2,,an},B={b1,b2,,bn}
4 σ 2 signal σ 2 noise x y FIG. 2 Simulated data of (x,y) for camera A. The signal and noise variances σ 2 signal and σ 2 noise are graphically represented by the two lines subtending the cloud of data. Note that the largest direction of variance does not lie along the basis of the recording (xA,yA) but rather along the best-fit line. can be extracted. There exists no absolute scale for noise but rather all noise is quantified relative to the signal strength. A common measure is the signal-to-noise ratio (SNR), or a ratio of variances σ 2 , SNR = σ 2 signal σ 2 noise . A high SNR ( 1) indicates a high precision measurement, while a low SNR indicates very noisy data. Let’s take a closer examination of the data from camera A in Figure 2. Remembering that the spring travels in a straight line, every individual camera should record motion in a straight line as well. Therefore, any spread deviating from straight-line motion is noise. The variance due to the signal and noise are indicated by each line in the diagram. The ratio of the two lengths measures how skinny the cloud is: possibilities include a thin line (SNR 1), a circle (SNR = 1) or even worse. By positing reasonably good measurements, quantitatively we assume that directions with largest variances in our measurement space contain the dynamics of interest. In Figure 2 the direction with the largest variance is not ˆxA = (1,0) nor ˆyA = (0,1), but the direction along the long axis of the cloud. Thus, by assumption the dynamics of interest exist along directions with largest variance and presumably highest SNR. Our assumption suggests that the basis for which we are searching is not the naive basis because these directions (i.e. (xA,yA)) do not correspond to the directions of largest variance. Maximizing the variance (and by assumption the SNR) corresponds to finding the appropriate rotation of the naive basis. This intuition corresponds to finding the direction indicated by the line σ 2 signal in Figure 2. In the 2-dimensional case of Figure 2 the direction of largest variance corresponds to the best-fit line for the data cloud. Thus, rotating the naive basis to lie parallel to the best-fit line would reveal the direction of motion of the spring for the 2-D case. How do we generalize this notion to an arbitrary number of dimensions? Before we approach this question we need to examine this issue from a second perspective. low redundancy high redundancy r 1 r 2 r 1 r 2 r 1 r 2 FIG. 3 A spectrum of possible redundancies in data from the two separate measurements r1 and r2. The two measurements on the left are uncorrelated because one can not predict one from the other. Conversely, the two measurements on the right are highly correlated indicating highly redundant measurements. B. Redundancy Figure 2 hints at an additional confounding factor in our data - redundancy. This issue is particularly evident in the example of the spring. In this case multiple sensors record the same dynamic information. Reexamine Figure 2 and ask whether it was really necessary to record 2 variables. Figure 3 might reflect a range of possibile plots between two arbitrary measurement types r1 and r2. The left-hand panel depicts two recordings with no apparent relationship. Because one can not predict r1 from r2, one says that r1 and r2 are uncorrelated. On the other extreme, the right-hand panel of Figure 3 depicts highly correlated recordings. This extremity might be achieved by several means: • A plot of (xA,xB) if cameras A and B are very nearby. • A plot of (xA,x˜A) where xA is in meters and ˜xA is in inches. Clearly in the right panel of Figure 3 it would be more meaningful to just have recorded a single variable, not both. Why? Because one can calculate r1 from r2 (or vice versa) using the best-fit line. Recording solely one response would express the data more concisely and reduce the number of sensor recordings (2 → 1 variables). Indeed, this is the central idea behind dimensional reduction. C. Covariance Matrix In a 2 variable case it is simple to identify redundant cases by finding the slope of the best-fit line and judging the quality of the fit. How do we quantify and generalize these notions to arbitrarily higher dimensions? Consider two sets of measurements with zero means A = {a1,a2,...,an} , B = {b1,b2,...,bn}
where the subscript denotes the sample number.The variance Consider the matrix Cx=IXXT.The ijth element of Cx of A and B are individually defined as. is the dot product between the vector of theth measurement type with the vector of the jh measurement type.We can =1∑a,哈=∑ summarize several properties of Cx: 11 The covariance between A and B is a straight-forward gener- .Cx is a square symmetric m x m matrix (Theorem 2 of alization. Appendix A) The diagonal terms of Cx are the variance of particular covariance of A and B=AB=-aibi measurement types. The off-diagonal terms of Cx are the covariance be- The covariance measures the degree of the linear relationship tween measurement types. between two variables.A large positive value indicates pos- itively correlated data.Likewise,a large negative value de- notes negatively correlated data.The absolute magnitude of Cx captures the covariance between all possible pairs of mea- surements.The covariance values reflect the noise and redun- the covariance measures the degree of redundancy.Some ad- ditional facts about the covariance. dancy in our measurements. .OAB is zero if and only if A and B are uncorrelated (e.g. In the diagonal terms,by assumption,large values cor- respond to interesting structure. Figure 2,left panel). ·OB=oifA=B. In the off-diagonal terms large magnitudes correspond to high redundancy. We can equivalently convert A and B into corresponding row Pretend we have the option of manipulating Cx.We will sug- vectors. gestively define our manipulated covariance matrix Cy.What a [al a2...an] features do we want to optimize in Cy? b =[b1 b2 ..bn] so that we may express the covariance as a dot product matrix D.Diagonalize the Covariance Matrix computation. We can summarize the last two sections by stating that our oab≡三-ab (2) goals are(1)to minimize redundancy,measured by the mag- nitude of the covariance,and (2)maximize the signal,mea- Finally,we can generalize from two vectors to an arbitrary sured by the variance.What would the optimized covariance matrix Cy look like? number.Rename the row vectors a and b as x and x2,respec- tively,and consider additional indexed row vectors X3,...,Xm Define a new m x n matrix X .All off-diagonal terms in Cy should be zero.Thus,Cy must be a diagonal matrix.Or,said another way,Y is decorrelated. Each successive dimension in Y should be rank-ordered according to variance. One interpretation of X is the following.Each row of X corre- There are many methods for diagonalizing Cy.It is curious to sponds to all measurements of a particular type.Each column note that PCA arguably selects the easiest method:PCA as- of X corresponds to a set of measurements from one particular trial (this is X from section 3.1).We now arrive at a definition sumes that all basis vectors [p,...,Pm}are orthonormal,i.e. P is an orthonormal matrix.Why is this assumption easiest? for the covariance matrix Cx. Envision how PCA works.In our simple example in Figure 2, Cx≡-XXT P acts as a generalized rotation to align a basis with the axis n of maximal variance.In multiple dimensions this could be performed by a simple algorithm: 2 Note that in practice,the covariance is calculated asThe 1.Select a normalized direction in m-dimensional space slight change in normalization constant arises from estimation theory.but along which the variance in X is maximized.Save this that is beyond the scope of this tutorial. vector as p1
5 where the subscript denotes the sample number. The variance of A and B are individually defined as, σ 2 A = 1 n ∑ i a 2 i , σ 2 B = 1 n ∑ i b 2 i The covariance between A and B is a straight-forward generalization. covariance o f A and B ≡ σ 2 AB = 1 n ∑ i aibi The covariance measures the degree of the linear relationship between two variables. A large positive value indicates positively correlated data. Likewise, a large negative value denotes negatively correlated data. The absolute magnitude of the covariance measures the degree of redundancy. Some additional facts about the covariance. • σAB is zero if and only if A and B are uncorrelated (e.g. Figure 2, left panel). • σ 2 AB = σ 2 A if A = B. We can equivalently convert A and B into corresponding row vectors. a = [a1 a2 ... an] b = [b1 b2 ... bn] so that we may express the covariance as a dot product matrix computation.2 σ 2 ab ≡ 1 n abT (2) Finally, we can generalize from two vectors to an arbitrary number. Rename the row vectors a and b as x1 and x2, respectively, and consider additional indexed row vectors x3,...,xm. Define a new m×n matrix X. X = x1 . . . xm One interpretation of X is the following. Each row of X corresponds to all measurements of a particular type. Each column of X corresponds to a set of measurements from one particular trial (this is ~X from section 3.1). We now arrive at a definition for the covariance matrix CX. CX ≡ 1 n XXT . 2 Note that in practice, the covariance σ 2 AB is calculated as 1 n−1 ∑i aibi . The slight change in normalization constant arises from estimation theory, but that is beyond the scope of this tutorial. Consider the matrix CX = 1 nXXT . The i jth element of CX is the dot product between the vector of the i th measurement type with the vector of the j th measurement type. We can summarize several properties of CX: • CX is a square symmetric m×m matrix (Theorem 2 of Appendix A) • The diagonal terms of CX are the variance of particular measurement types. • The off-diagonal terms of CX are the covariance between measurement types. CX captures the covariance between all possible pairs of measurements. The covariance values reflect the noise and redundancy in our measurements. • In the diagonal terms, by assumption, large values correspond to interesting structure. • In the off-diagonal terms large magnitudes correspond to high redundancy. Pretend we have the option of manipulating CX. We will suggestively define our manipulated covariance matrix CY. What features do we want to optimize in CY? D. Diagonalize the Covariance Matrix We can summarize the last two sections by stating that our goals are (1) to minimize redundancy, measured by the magnitude of the covariance, and (2) maximize the signal, measured by the variance. What would the optimized covariance matrix CY look like? • All off-diagonal terms in CY should be zero. Thus, CY must be a diagonal matrix. Or, said another way, Y is decorrelated. • Each successive dimension in Y should be rank-ordered according to variance. There are many methods for diagonalizing CY. It is curious to note that PCA arguably selects the easiest method: PCA assumes that all basis vectors {p1,...,pm} are orthonormal, i.e. P is an orthonormal matrix. Why is this assumption easiest? Envision how PCA works. In our simple example in Figure 2, P acts as a generalized rotation to align a basis with the axis of maximal variance. In multiple dimensions this could be performed by a simple algorithm: 1. Select a normalized direction in m-dimensional space along which the variance in X is maximized. Save this vector as p1