11PrincipalComponentAnalysisPampryl fr.2(%06i)2aTropnafFruvitafrPamprylamb.Jokeramb.Tropicanaamb-22-404Dim 1 (67.77%)FIGURE 1.7Orange juice data: plane representation of the scatterplot of individuals.between theorange juices,separates thetwo orange juices Tropicana fr.andPampryl amb.According to data Table 1.2, we can see that these orangejuices arethemost extremein termsof thedescriptors odour typicalityandbitterness:Tropicana fr.is the most typical and the least bitter while Pamprylamb.is theleasttypical and themost bitter.Thesecond component,thatis,the property that separates the orange juices most significantly oncethemain principal component ofvariability has been removed,identifies Tropicanaamb.,which is the least intense in terms of odour, and Pampryl fr., which isamongthemostintense (seeTable1.2)Reading this data is tedious when there are a high number of individualsand variables. For practical purposes, we will facilitate the characterisationof the principal components by using the variables more directly1.3.3Representation of the Variablesas an Aid forInterpretingtheCloud ofIndividualsLet F,denote the coordinate of the I individuals on component s and F.(c)its value for individual i..Vector Fis also called the principal component ofrank s. F. is of dimension I and thus can be considered as a variable. Tointerprettherelativepositions of theindividualsonthecomponent of ranks.itmaybeinterestingtocalculatethecorrelation coefficientbetweenvectorFand the initial variables. Thus, when the correlation coefficient between Fand a variable k is positive (or indeed negative),an individual with a positivecoordinateon component Fwillgenerallyhaveahigh(orlow,respectively)value (relative to the average)forvariablek.In the example, Fi is strongly positively correlated with the variablesodour typicality and sweetness and strongly negatively correlated with thevariables bitter and acidic (see Table 1.4).Thus Tropicana fr., which has the
Principal Component Analysis 11 -4 -2 0 2 4 -2 -1 0 1 2 Dim 1 (67.77%) Dim 2 (19.05%) Pampryl amb. Tropicana amb. Fruvita fr. Joker amb. Tropicana fr. Pampryl fr. FIGURE 1.7 Orange juice data: plane representation of the scatterplot of individuals. between the orange juices, separates the two orange juices Tropicana fr. and Pampryl amb. According to data Table 1.2, we can see that these orange juices are the most extreme in terms of the descriptors odour typicality and bitterness: Tropicana fr. is the most typical and the least bitter while Pampryl amb. is the least typical and the most bitter. The second component, that is, the property that separates the orange juices most significantly once the main principal component of variability has been removed, identifies Tropicana amb., which is the least intense in terms of odour, and Pampryl fr., which is among the most intense (see Table 1.2). Reading this data is tedious when there are a high number of individuals and variables. For practical purposes, we will facilitate the characterisation of the principal components by using the variables more directly. 1.3.3 Representation of the Variables as an Aid for Interpreting the Cloud of Individuals Let Fs denote the coordinate of the I individuals on component s and Fs(i) its value for individual i. Vector Fs is also called the principal component of rank s. Fs is of dimension I and thus can be considered as a variable. To interpret the relative positions of the individuals on the component of rank s, it may be interesting to calculate the correlation coefficient between vector Fs and the initial variables. Thus, when the correlation coefficient between Fs and a variable k is positive (or indeed negative), an individual with a positive coordinate on component Fs will generally have a high (or low, respectively) value (relative to the average) for variable k. In the example, F1 is strongly positively correlated with the variables odour typicality and sweetness and strongly negatively correlated with the variables bitter and acidic (see Table 1.4). Thus Tropicana fr., which has the
12ErploratoryMultivariateAnalysisby Erample UsingRhighest coordinate on component 1, has high values for odour typicality andsweetness and lowvaluesforthevariables acidic and bitter.Similarly,wecan examine the correlations between F2 and the variables. It may be notedthat the correlations are generally lower (in absolutevalue)than those withthe first principal component. We will see that this is directly linked to thepercentage of inertia associated with F2 which is, by construction, lower thanthat associated with Fi.The second component can be characterised by thevariables odour intensity and pulp content (see Table 1.4).TABLE 1.4OrangeJuiceData:Correlation betweenVariables and FirstTwo ComponentsFiF20.75Odour intensity0.460.99Odourtypicality0.130.720.62Pulp content0.43Intensityof taste-0.65Acidity0.910.350.19Bitterness0.930.950.16SweetnessTo make these results easier to interpret, particularly in cases with a highnumber of variables, it is possible to represent each variable on a graph, usingitscorrelation coefficients with F and F2 ascoordinates (seeFigurel.8)aOdourinsityPulpiness0.62eo ouaoIntensity gf tasteAcidityBitternesOdourtypicality000.72veetness33-0.50.51.0-1.5-1.00.01.5Dimension 1 (67.77%)FIGURE1.8Orange juice data: visualisation of the correlation coefficients between variables and theprincipal components Fiand F2We can now interpret the joint representation of the cloud of individualswith this representation of the variables
12 Exploratory Multivariate Analysis by Example Using R highest coordinate on component 1, has high values for odour typicality and sweetness and low values for the variables acidic and bitter. Similarly, we can examine the correlations between F2 and the variables. It may be noted that the correlations are generally lower (in absolute value) than those with the first principal component. We will see that this is directly linked to the percentage of inertia associated with F2 which is, by construction, lower than that associated with F1. The second component can be characterised by the variables odour intensity and pulp content (see Table 1.4). TABLE 1.4 Orange Juice Data: Correlation between Variables and First Two Components F1 F2 Odour intensity 0.46 0.75 Odour typicality 0.99 0.13 Pulp content 0.72 0.62 Intensity of taste −0.65 0.43 Acidity −0.91 0.35 Bitterness −0.93 0.19 Sweetness 0.95 −0.16 To make these results easier to interpret, particularly in cases with a high number of variables, it is possible to represent each variable on a graph, using its correlation coefficients with F1 and F2 as coordinates (see Figure 1.8). -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1. 0 -0. 5 0. 0 0. 5 1. 0 Dimension 1 (67.77%) Dimension 2 (19.05%) Odour intensity Odour typicality Pulpiness Intensity of taste Acidity Bitterness Sweetness 0.72 0.62 FIGURE 1.8 Orange juice data: visualisation of the correlation coefficients between variables and the principal components F1 and F2. We can now interpret the joint representation of the cloud of individuals with this representation of the variables
13PrincipalComponentAnalysisRemarkAvariable isalways represented withina circle of radius1 (circlerepresentedin Figure1.8):indeed, itmust be noted thatF and F2 are orthogonal (inthe sense that their correlation coefficient is equal to O) and that a variablecannot be strongly related to two orthogonal components simultaneously. Inthefollowingsectionweshallexaminewhythevariablewill alwaysbefoundwithinthecircleof radius1.1.4StudyingVariables1.4.1TheCloudofVariablesLet us now consider the data table as a set of columns. A variable is one of thecolumns in the table, that is, a set of I numerical values, which is representedby a point of the vector space with I dimensions, denoted R' (and known asthe"variables'space").The vector connects the origin of R' to the point.Allof thesevectorsconstitutethecloudof variablesandthis ensembleisdenotedNk (see Figure 1.9)RI.NK1FIGURE 1.9The scatterplot of the variables Nk in R'.In the case of a standardised PCA.the variables k are located within a hypersphere of radius 1.The scalar product between two variables k and I is expressed as1Erk × it = I/l × l × cos(ok)i=1
Principal Component Analysis 13 Remark A variable is always represented within a circle of radius 1 (circle represented in Figure 1.8): indeed, it must be noted that F1 and F2 are orthogonal (in the sense that their correlation coefficient is equal to 0) and that a variable cannot be strongly related to two orthogonal components simultaneously. In the following section we shall examine why the variable will always be found within the circle of radius 1. 1.4 Studying Variables 1.4.1 The Cloud of Variables Let us now consider the data table as a set of columns. A variable is one of the columns in the table, that is, a set of I numerical values, which is represented by a point of the vector space with I dimensions, denoted R I (and known as the “variables’ space”). The vector connects the origin of R I to the point. All of these vectors constitute the cloud of variables and this ensemble is denoted NK (see Figure 1.9). O 1 FIGURE 1.9 The scatterplot of the variables NK in R I . In the case of a standardised PCA, the variables k are located within a hypersphere of radius 1. The scalar product between two variables k and l is expressed as X I i=1 xik × xil = kkk × klk × cos(θkl)
14ErploratoryMultivariateAnalysis byErample UsingRwith /l and /l the norm for variable k and I, and k the angle produced bythe vectors representing variables k and l. Since the variables used here arecentred, the norm for one variable is equal to its standard deviation multipliedby the square root of I, and the scalar product is expressed as follows:1(-) × (il -) = I × Sk× sI × cos().i=1On the right-hand side of the equation, we can identify covariance betweenvariables k and l.Similarly, by dividing each term in the equation by the standard deviationssk and si of variables k and l, we obtain the following relationship:r(k, l) = cos(α).This property is essential in PCA as it provides a geometric interpretation ofthe correlation. Therefore, in the same way as the representation of cloud Nican be used to visualisethe variability between individuals, a representation ofthe cloud Nk can be used to visualise all of the correlations (through the anglesbetween variables)orin other words,the correlationmatrix.Tofacilitatevisualisation of the angles between variables, the variables are represented byvectors rather than points.Generally speaking, the variables being centred and reduced (scaled tounit variance)have a length with a value of 1 (hence the term“standardisedvariable").The vector extremities are therefore on the sphere of radius 1 (alsocalled "hypersphere"to highlight the fact that, in general, I > 3), as shownin Figure 1.9Commentaboutthe CentringIn RK, when the variables are centred, the origin of the axes is translatedonto the mean point. This property is not true for Nk.1.4.2Fitting the Cloud of VariablesAs is the case for the individuals, the cloud of variables Nk is situated in aspace Rwith a great number of dimensions and it is impossible to visualisethecloud in the overall space. The cloud of variables must therefore be adjustedusing the same strategy as for the cloud of individuals.We maximise anequivalentcriterionk-,(OHk)withHk,theprojectionofvariablekonthe subspace with reduced dimensions.Here too,the subspaces are nestedand wecan identifyaseries of orthogonal axes S which definethesubspacesfor dimensions s =1 to S.Vector ustherefore belongs to a given subspaceand is orthogonal to thevectors vtwhichmakeup the smaller subspaces. Itcan therefore be shown that the vector s maximises k=i(OH)? where His the projection of variablek on s
14 Exploratory Multivariate Analysis by Example Using R with kkk and klk the norm for variable k and l, and θkl the angle produced by the vectors representing variables k and l. Since the variables used here are centred, the norm for one variable is equal to its standard deviation multiplied by the square root of I, and the scalar product is expressed as follows: X I i=1 (xik − x¯k) × (xil − x¯l) = I × sk × sl × cos(θkl). On the right-hand side of the equation, we can identify covariance between variables k and l. Similarly, by dividing each term in the equation by the standard deviations sk and sl of variables k and l, we obtain the following relationship: r(k, l) = cos(θkl). This property is essential in PCA as it provides a geometric interpretation of the correlation. Therefore, in the same way as the representation of cloud NI can be used to visualise the variability between individuals, a representation of the cloud NK can be used to visualise all of the correlations (through the angles between variables) or, in other words, the correlation matrix. To facilitate visualisation of the angles between variables, the variables are represented by vectors rather than points. Generally speaking, the variables being centred and reduced (scaled to unit variance) have a length with a value of 1 (hence the term “standardised variable”). The vector extremities are therefore on the sphere of radius 1 (also called “hypersphere” to highlight the fact that, in general, I > 3), as shown in Figure 1.9. Comment about the Centring In R K, when the variables are centred, the origin of the axes is translated onto the mean point. This property is not true for NK. 1.4.2 Fitting the Cloud of Variables As is the case for the individuals, the cloud of variables NK is situated in a space R I with a great number of dimensions and it is impossible to visualise the cloud in the overall space. The cloud of variables must therefore be adjusted using the same strategy as for the cloud of individuals. We maximise an equivalent criterion PK k=1 (OHk) 2 with Hk, the projection of variable k on the subspace with reduced dimensions. Here too, the subspaces are nested and we can identify a series of orthogonal axes S which define the subspaces for dimensions s = 1 to S. Vector vs therefore belongs to a given subspace and is orthogonal to the vectors vt which make up the smaller subspaces. It can therefore be shown that the vector vs maximises PK k=1(OHs k ) 2 where Hs k is the projection of variable k on vs
15Principal Component AnalysisRemarkIn the individual space RK, centring the variables causes the origin of theaxes to shift to amean point:themaximised criterion is therefore interpretedas a variance; the projected points must be as dispersed as possible. In R'centring has a different effect, as the origin is not the same as the mean point.Theprojected points should be as far as possiblefrom the origin (although notnecessarilydispersed),evenifthatmeansbeinggroupedtogetherormergedThis indicates that the position of the cloud Nk with respect to the origin isimportant.Vectors us (s = l,., S) belong to the space R' and consequently can beconsiderednewvariables.Thecorrelationcoefficientr(k.u)betweenvariablesk and u,is equal tothe cosine of theangle sbetween Ok and swhen variablek is centred and scaled, and thus standardised. The plane representationconstructed by (u1, u2) is therefore pleasing as the coordinates of a variable kcorrespond to the cosine of the angle O and that of angle2,and thus thecorrelation coefficientsbetweenvariableskandvi,andbetweenvariableskand u2.In a planerepresentation such as this,wecan therefore immediatelyvisualise whether or not a variable k is related to a dimension of variability(seeFigure 1.10)By their very construction, variables us maximise criterion k= (OH)?.Since the projection of a variable k is equal to the cosine of angle o, thecriterion maximisesKKEcos? 0% =-Er'(k,us).k=1k=1The above expression illustrates that ,is the new variable which is the moststrongly correlated with all of the initial variables K (with the orthogonalityconstraint of utalreadyfound).As aresult,ucan be said to bea syntheticvariable. Here, we are experiencing the second aspect of the study of variables(see Section 1.2.2).RemarkWhen a variable is not standardised, its length is equal to its standard deviation.In an unstandardised PCA, the criterion can be expressed as follows:KKE(OHi)?-str? (k, v).k=1k=1This highlights the fact that. in the case of an unstandardised PCA. eachvariable k is assigned a weight equal to its variance s2.It can be shown that theaxes of representation Nk are in fact eigenvec-tors of the scalar products matrix between individuals.This property is, inpractice, onlyused when the number of variables exceeds the number of in-dividuals. We will see in the following that these eigenvectors are deductedfrom thoseofthe correlationmatrix
Principal Component Analysis 15 Remark In the individual space R K, centring the variables causes the origin of the axes to shift to a mean point: the maximised criterion is therefore interpreted as a variance; the projected points must be as dispersed as possible. In R I , centring has a different effect, as the origin is not the same as the mean point. The projected points should be as far as possible from the origin (although not necessarily dispersed), even if that means being grouped together or merged. This indicates that the position of the cloud NK with respect to the origin is important. Vectors vs (s = 1, ., S) belong to the space R I and consequently can be considered new variables. The correlation coefficient r(k, vs) between variables k and vs is equal to the cosine of the angle θ s k between Ok and vs when variable k is centred and scaled, and thus standardised. The plane representation constructed by (v1, v2) is therefore pleasing as the coordinates of a variable k correspond to the cosine of the angle θ 1 k and that of angle θ 2 k , and thus the correlation coefficients between variables k and v1, and between variables k and v2. In a plane representation such as this, we can therefore immediately visualise whether or not a variable k is related to a dimension of variability (see Figure 1.10). By their very construction, variables vs maximise criterion PK k=1 (OHs k ) 2 . Since the projection of a variable k is equal to the cosine of angle θ s k , the criterion maximises X K k=1 cos2 θ s k = X K k=1 r 2 (k, vs). The above expression illustrates that vs is the new variable which is the most strongly correlated with all of the initial variables K (with the orthogonality constraint of vt already found). As a result, vs can be said to be a synthetic variable. Here, we are experiencing the second aspect of the study of variables (see Section 1.2.2). Remark When a variable is not standardised, its length is equal to its standard deviation. In an unstandardised PCA, the criterion can be expressed as follows: X K k=1 (OHs k ) 2 = X K k=1 s 2 k r 2 (k, vs). This highlights the fact that, in the case of an unstandardised PCA, each variable k is assigned a weight equal to its variance s 2 k . It can be shown that the axes of representation NK are in fact eigenvectors of the scalar products matrix between individuals. This property is, in practice, only used when the number of variables exceeds the number of individuals. We will see in the following that these eigenvectors are deducted from those of the correlation matrix