You will notice that this is simply the standard deviation squared,in both the symbol (s2)and the formula(there is no square root in the formula for variance).s2 is the usual symbol for variance of a sample.Both these measurements are measures of the spread of the data.Standard deviation is the most common measure,but variance is also used.The reason why I have introduced variance in addition to standard deviation is to provide a solid platform from which the next section,covariance,can launch from. Exercises Find the mean,standard deviation,and variance for each of these data sets. ·[12233444597098] ·[12152527328899] ·[15357882909597] 2.1.3 Covariance The last two measures we have looked at are purely 1-dimensional.Data sets like this could be:heights of all the people in the room,marks for the last COMP101 exam etc. However many data sets have more than one dimension,and the aim of the statistical analysis of these data sets is usually to see if there is any relationship between the dimensions.For example,we might have as our data set both the height of all the students in a class,and the mark they received for that paper.We could then perform statistical analysis to see if the height of a student has any effect on their mark. Standard deviation and variance only operate on I dimension,so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions.However,it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is such a measure.Covariance is always measured between 2 dimen- sions.If you calculate the covariance between one dimension and itself,you get the variance.So,if you had a 3-dimensional data set(,y,z),then you could measure the covariance between the and y dimensions,the x and z dimensions,and the y and z dimensions.Measuring the covariance between z and z,or y and y,or z and z would give you the variance of the z,y and z dimensions respectively. The formula for covariance is very similar to the formula for variance.The formula for variance could also be written like this: var(X)= ∑(X:-x)(X:-) (n-1) where I have simply expanded the square term to show both parts.So given that knowl- edge,here is the formula for covariance: X,)=∑是X-出- (n-1) 5
You will notice that this is simply the standard deviation squared, in both the symbol (✴ ✹ ) and the formula (there is no square root in the formula for variance). ✴ ✹ is the usual symbol for variance of a sample. Both these measurements are measures of the spread of the data. Standard deviation is the most common measure, but variance is also used. The reason why I have introduced variance in addition to standard deviation is to provide a solid platform from which the next section, covariance, can launch from. Exercises Find the mean, standard deviation, and variance for each of these data sets. ❀ [12 23 34 44 59 70 98] ❀ [12 15 25 27 32 88 99] ❀ [15 35 78 82 90 95 97] 2.1.3 Covariance The last two measures we have looked at are purely 1-dimensional. Data sets like this could be: heights of all the people in the room, marks for the last COMP101 exam etc. However many data sets have more than one dimension, and the aim of the statistical analysis of these data sets is usually to see if there is any relationship between the dimensions. For example, we might have as our data set both the height of all the students in a class, and the mark they received for that paper. We could then perform statistical analysis to see if the height of a student has any effect on their mark. Standard deviation and variance only operate on 1 dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is such a measure. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. So, if you had a 3-dimensional data set (❁, ❂, ❃), then you could measure the covariance between the ❁ and ❂ dimensions, the ❁ and ❃ dimensions, and the ❂ and ❃ dimensions. Measuring the covariance between ❁ and ❁, or ❂ and ❂, or ❃ and ❃ would give you the variance of the ❁, ❂ and ❃ dimensions respectively. The formula for covariance is very similar to the formula for variance. The formula for variance could also be written like this: ❄ ✭✰❅ ✵❆✷✟✁ ✦✧✾★ ✛ ✵ ✧ ✲ ✸✷ ✤ ✵ ✧ ✲ ❆✷ ✤ ✵✣✿✲ ✆✒✷ where I have simply expanded the square term to show both parts. So given that knowledge, here is the formula for covariance: ❇✫❈❉❄ ✵✽❊✰❋✍✷●✁ ✦✧✾★ ✛ ✵ ✧ ✲ ✸✷ ✤ ✵❋ ✧ ✲ ❋✍✷ ✤ ✵✣✳✲ ✆✏✷ 5
includegraphicscovPlot.ps Figure 2.1:A plot of the covariance data showing positive relationship between the number of hours studied against the mark received It is exactly the same except that in the second set of brackets,the X's are replaced by Y's.This says,in English,"For each data item,multiply the difference between the z value and the mean of by the the difference between the y value and the mean of y. Add all these up,and divide by (n-1)". How does this work?Lets use some example data.Imagine we have gone into the world and collected some 2-dimensional data,say,we have asked a bunch of students how many hours in total that they spent studying COSC241,and the mark that they received.So we have two dimensions,the first is the H dimension,the hours studied, and the second is the M dimension,the mark received.Figure 2.2 holds my imaginary data,and the calculation of cov(H,M),the covariance between the Hours of study done and the Mark received. So what does it tell us?The exact value is not as important as it's sign(ie.positive or negative).If the value is positive,as it is here,then that indicates that both di- mensions increase together,meaning that,in general,as the number of hours of study increased,so did the final mark. If the value is negative,then as one dimension increases,the other decreases.If we had ended up with a negative covariance here,then that would have said the opposite, that as the number of hours of study increased the the final mark decreased. In the last case.if the covariance is zero.it indicates that the two dimensions are independent of each other. The result that mark given increases as the number of hours studied increases can be easily seen by drawing a graph of the data,as in Figure 2.1.3.However,the luxury of being able to visualize data is only available at 2 and 3 dimensions.Since the co- variance value can be calculated between any 2 dimensions in a data set,this technique is often used to find relationships between dimensions in high-dimensional data sets where visualisation is difficult. You might ask "is cov(X,Y)equal to cov(Y,X)"?Well,a quick look at the for- mula for covariance tells us that yes,they are exactly the same since the only dif- ference between cov(X,Y)and cov(Y,X)is that (XiX)(Yi-Y)is replaced by (Y-(X;X).And since multiplication is commutative,which means that it doesn't matter which way around I multiply two numbers,I always get the same num- ber,these two equations give the same answer. 2.1.4 The covariance Matrix Recall that covariance is always measured between 2 dimensions.If we have a data set with more than 2 dimensions,there is more than one covariance measurement that can be calculated.For example,from a 3 dimensional data set(dimensions y,z)you could calculate cov(,y),(cov(,z),and cov(y,2).In fact,for an n-dimensional data set,you can calculate different covariance values. 刀 6
includegraphicscovPlot.ps Figure 2.1: A plot of the covariance data showing positive relationship between the number of hours studied against the mark received It is exactly the same except that in the second set of brackets, the ’s are replaced by ❋ ’s. This says, in English, “For each data item, multiply the difference between the ❁ value and the mean of ❁, by the the difference between the ❂ value and the mean of ❂. Add all these up, and divide by ✵✣✳✲ ✆✏✷ ”. How does this work? Lets use some example data. Imagine we have gone into the world and collected some 2-dimensional data, say, we have asked a bunch of students how many hours in total that they spent studying COSC241, and the mark that they received. So we have two dimensions, the first is the ❍ dimension, the hours studied, and the second is the ■ dimension, the mark received. Figure 2.2 holds my imaginary data, and the calculation of ❇✫❈❉❄ ✵❍ ❊ ■✷ , the covariance between the Hours of study done and the Mark received. So what does it tell us? The exact value is not as important as it’s sign (ie. positive or negative). If the value is positive, as it is here, then that indicates that both dimensions increase together, meaning that, in general, as the number of hours of study increased, so did the final mark. If the value is negative, then as one dimension increases, the other decreases. If we had ended up with a negative covariance here, then that would have said the opposite, that as the number of hours of study increased the the final mark decreased. In the last case, if the covariance is zero, it indicates that the two dimensions are independent of each other. The result that mark given increases as the number of hours studied increases can be easily seen by drawing a graph of the data, as in Figure 2.1.3. However, the luxury of being able to visualize data is only available at 2 and 3 dimensions. Since the covariance value can be calculated between any 2 dimensions in a data set, this technique is often used to find relationships between dimensions in high-dimensional data sets where visualisation is difficult. You might ask “is ❇✶❈❉❄ ✵✽❊✶❋✱✷ equal to ❇✫❈❉❄ ✵❋❏❊❑❆✷ ”? Well, a quick look at the formula for covariance tells us that yes, they are exactly the same since the only difference between ❇✶❈❉❄ ✵✽❊✶❋✱✷ and ❇✫❈❉❄ ✵❋▲❊▼◆✷ is that ✵ ✧ ✲ ❆✷ ✤ ✵❋ ✧ ✲ ❋✼✷ ✤ is replaced by ✵❋ ✧ ✲ ❋✍✷ ✤ ✵ ✧ ✲ ❆✷ ✤ . And since multiplication is commutative, which means that it doesn’t matter which way around I multiply two numbers, I always get the same number, these two equations give the same answer. 2.1.4 The covariance Matrix Recall that covariance is always measured between 2 dimensions. If we have a data set with more than 2 dimensions, there is more than one covariance measurement that can be calculated. For example, from a 3 dimensional data set (dimensions ❁, ❂, ❃) you could calculate ❇✫❈❉❄ ✵❁ ❊ ❂ ✷ , ✵ ❇✫❈❉❄ ✵❁ ❊ ❃ ✷ , and ❇✫❈❉❄ ✵❂ ❊ ❃ ✷ . In fact, for an ✣-dimensional data set, you can calculate ✦P❖ ◗✦❙❘ ✹❉❚ ❖ ❯✹ different covariance values. 6
Hours(H) Mark(M) Data 9 15 25 69 14 6 0 5 18 75 0 3 16 85 5 42 19 0 16 66 20 80 Totals 167 749 Averages 13.92 62.42 Covariance: H M (H:- ) |(M:-M)(H:-i)M:-) 39 -4.92 -23.42 115.23 15 1.08 -6.42 -6.93 25 93 11.08 30.58 338.83 14 61 0.08 -1.42 -0.11 10 50 -3.92 -12.42 48.69 18 75 4.08 12.58 51.33 0 32 -13.92 -30.42 423.45 16 85 2.08 22.58 46.97 5 42 -8.92 -20.42 182.15 19 70 5.08 7.58 38.51 16 66 2.08 3.58 7.45 20 80 6.08 17.58 106.89 Total 1149.89 Average 104.54 Table 2.2:2-dimensional data set and covariance calculation 7
Hours(H) Mark(M) Data 9 39 15 56 25 93 14 61 10 50 18 75 0 32 16 85 5 42 19 70 16 66 20 80 Totals 167 749 Averages 13.92 62.42 Covariance: ❍ ■ ✵❍✧ ✲ ✤❍ ✷ ✵■✧ ✲ ✤■✷ ✵❍✧ ✲ ✤❍ ✷ ✵■✧ ✲ ✤■✷ 9 39 -4.92 -23.42 115.23 15 56 1.08 -6.42 -6.93 25 93 11.08 30.58 338.83 14 61 0.08 -1.42 -0.11 10 50 -3.92 -12.42 48.69 18 75 4.08 12.58 51.33 0 32 -13.92 -30.42 423.45 16 85 2.08 22.58 46.97 5 42 -8.92 -20.42 182.15 19 70 5.08 7.58 38.51 16 66 2.08 3.58 7.45 20 80 6.08 17.58 106.89 Total 1149.89 Average 104.54 Table 2.2: 2-dimensional data set and covariance calculation 7
A useful way to get all the possible covariance values between all the different dimensions is to calculate them all and put them in a matrix.I assume in this tutorial that you are familiar with matrices,and how they can be defined.So,the definition for the covariance matrix for a set of data with n dimensions is: Cmxn=(ci.j,ci,j=cou(Dimi,Dimj)), where Cnxn is a matrix with n rows and n columns,and Dim is the zth dimension. All that this ugly looking formula says is that if you have an n-dimensional data set, then the matrix has n rows and columns(so is square)and each entry in the matrix is the result of calculating the covariance between two separate dimensions.Eg.the entry on row 2,column 3,is the covariance value calculated between the 2nd dimension and the 3rd dimension. An example.We'll make up the covariance matrix for an imaginary 3 dimensional data set,using the usual dimensions z,y and z.Then,the covariance matrix has 3 rows and 3 columns,and the values are this: cov(T.T) cov(z,y) coU(x,z】 C= cov(y,x)cov(y,y) CGw(,2) cov(2,) C0w(2,y) C0w2,2) Some points to note:Down the main diagonal,you see that the covariance value is between one of the dimensions and itself.These are the variances for that dimension. The other point is that since cov(a,b)=cou(b,a),the matrix is symmetrical about the main diagonal. Exercises Work out the covariance between the z and y dimensions in the following 2 dimen- sional data set.and describe what the result indicates about the data. Item Number:1 2 3 45 1039192328 4313322120 Calculate the covariance matrix for this 3 dimensional set of data. Item Number: 1123 4 3 2.2 Matrix Algebra This section serves to provide a background for the matrix algebra required in PCA. Specifically I will be looking at eigenvectors and eigenvalues of a given matrix.Again, I assume a basic knowledge of matrices. 8
A useful way to get all the possible covariance values between all the different dimensions is to calculate them all and put them in a matrix. I assume in this tutorial that you are familiar with matrices, and how they can be defined. So, the definition for the covariance matrix for a set of data with ✣ dimensions is: ❱ ✦P❲❙✦ ✁ ✵ ❇ ✧▼❳ ❨ ❊ ❇ ✧❩❳ ❨ ✁ ❇✫❈❉❄ ✵▼❬❪❭❴❫✧ ❊ ❬❵❭❩❫❨ ✷❛✷▼❊ where ❱ ✦P❲❙✦ is a matrix with ✣ rows and ✣ columns, and ❬❵❭❩❫✜❜ is the ❁th dimension. All that this ugly looking formula says is that if you have an ✣-dimensional data set, then the matrix has ✣ rows and columns (so is square) and each entry in the matrix is the result of calculating the covariance between two separate dimensions. Eg. the entry on row 2, column 3, is the covariance value calculated between the 2nd dimension and the 3rd dimension. An example. We’ll make up the covariance matrix for an imaginary 3 dimensional data set, using the usual dimensions ❁, ❂ and ❃. Then, the covariance matrix has 3 rows and 3 columns, and the values are this: ❱ ✁ ❇✫❈❉❄ ✵❁ ❊ ❁ ✷ ❇✫❈❉❄ ✵❁ ❊ ❂ ✷ ❇✫❈❉❄ ✵❁ ❊ ❃ ✷ ❇✫❈❉❄ ✵❂ ❊ ❁ ✷ ❇✫❈❉❄ ✵❂ ❊ ❂ ✷ ❇✫❈❉❄ ✵❂ ❊ ❃ ✷ ❇✫❈❉❄ ✵ ❃ ❊ ❁ ✷ ❇✫❈❉❄ ✵ ❃ ❊ ❂ ✷ ❇✫❈❉❄ ✵ ❃ ❊ ❃ ✷ Some points to note: Down the main diagonal, you see that the covariance value is between one of the dimensions and itself. These are the variances for that dimension. The other point is that since ❇✫❈❉❄ ✵ ✭✝❊✰❝✰✷✟✁ ❇✶❈❉❄ ✵ ❝✫❊✶✭P✷ , the matrix is symmetrical about the main diagonal. Exercises Work out the covariance between the ❁ and ❂ dimensions in the following 2 dimensional data set, and describe what the result indicates about the data. Item Number: 1 2 3 4 5 ❁ 10 39 19 23 28 ❂ 43 13 32 21 20 Calculate the covariance matrix for this 3 dimensional set of data. Item Number: 1 2 3 ❁ 1 -1 4 ❂ 2 1 3 ❃ 1 3 -1 2.2 Matrix Algebra This section serves to provide a background for the matrix algebra required in PCA. Specifically I will be looking at eigenvectors and eigenvalues of a given matrix. Again, I assume a basic knowledge of matrices. 8