16Erploratory MultivariateAnalysis byErample UsingRHAHGH.HoaHgHoHoBFIGURE1.10Projection of the scatterplot of the variables on the main plane of variabil-ity. On the left: visualisation in space Rl; on the right: visualisation of theprojections in the principal plane.The best plane representation of the cloud of variables corresponds exactlyto the graph representing the variables obtained as an aid to interpreting therepresentation of individuals (see Figure 1.8). This remarkable property is notspecific to the example but applies when carrying out any standardised PCA.This point will be developed further in the following section.1.5RelationshipsbetweentheTwoRepresentationsNiand NkSo far we have looked for representations of Nrand Nk according to the sameprinciple and from one single data table. It therefore seems natural for thesetwo analyses (N, in RK and Nk in R') to be related.The relationships between the two clouds Nr and Nk are brought to-gether under the general term of "relations of duality."This term refers tothe dual approach of one single data table, by considering either the lines orthe columns. This approach is also defined by "transition relations" (calculating the coordinates in one space from those in the other).Where F.(i) isthe coordinate of individual i and G.(k) the coordinate of variable k of thecomponent of rank s, we obtain the following equations:K1F.(i) :Tik Gs(k),VA.k=1
16 Exploratory Multivariate Analysis by Example Using R A B C D HA HB HC HD HA HB HC HD FIGURE 1.10 Projection of the scatterplot of the variables on the main plane of variability. On the left: visualisation in space R I ; on the right: visualisation of the projections in the principal plane. The best plane representation of the cloud of variables corresponds exactly to the graph representing the variables obtained as an aid to interpreting the representation of individuals (see Figure 1.8). This remarkable property is not specific to the example but applies when carrying out any standardised PCA. This point will be developed further in the following section. 1.5 Relationships between the Two Representations NI and NK So far we have looked for representations of NI and NK according to the same principle and from one single data table. It therefore seems natural for these two analyses (NI in R K and NK in R I ) to be related. The relationships between the two clouds NI and NK are brought together under the general term of “relations of duality.” This term refers to the dual approach of one single data table, by considering either the lines or the columns. This approach is also defined by “transition relations” (calculating the coordinates in one space from those in the other). Where Fs(i) is the coordinate of individual i and Gs(k) the coordinate of variable k of the component of rank s, we obtain the following equations: Fs(i) = 1 √ λs X K k=1 xik Gs(k)
17PrincipalComponentAnalysis1G(k) =E(1/1) rik Fs(i).This result is essential for interpreting the data, and makes PCA a richand reliable experimental tool. This may be expressed as follows: individualsareon the same side as their corresponding variables withhighvalues,andopposite their corresponding variables with low values. It must be noted thatFik are centred and carry both positive and negative values.This is one of thereasons why individuals can be so far from the variable for which they carrylow values.F,is referred to as the principal component of rank s; Ais thevariance of Fand its square root thelength of F,inR';usisknown as thestandardised principal component.The total inertias of both clouds are equal (and equal to K for standardisedPCA)and furthermore,when decomposed component by component,theyare identical.This property is remarkable:if S dimensions are enough toperfectlyrepresent Nr,thesameistruefor Nk.In this case.two dimensionsare sufficient. If not, why generate a third synthetic variable which would notdifferentiate the individualsat all?1.6Interpreting the Data1.6.1Numerical Indicators1.6.1.1Percentage of Inertia Associated with a ComponentThe first indicators that we shall examine are the ratios between the projectedinertias and the total inertia.For component s:E=1 (OH:)2ZK, (OH)?入sTKEi=11 (Oi)?Ek,Ok2入sIn the most usual case, when the PCA is standardised, K, As = KWhen multiplied by 100, this indicator gives the percentage of inertia (of Niin RK or of Nk in R')expressedby the component of rank s.This can beinterpreted intwo ways:1. As a measure of the quality of data representation; in the example,we say that the first component expresses 67.77% of data variability(see Table 1.5). In a standardised PCA (where I >K),we oftencompare Awith 1, thevalue below which the component of ranks, representing less data than a stand-alone variable, is not worthyof interest.2.As a measure of the relative importance of the components; in the
Principal Component Analysis 17 Gs(k) = 1 √ λs X I i=1 (1/I) xik Fs(i). This result is essential for interpreting the data, and makes PCA a rich and reliable experimental tool. This may be expressed as follows: individuals are on the same side as their corresponding variables with high values, and opposite their corresponding variables with low values. It must be noted that xik are centred and carry both positive and negative values. This is one of the reasons why individuals can be so far from the variable for which they carry low values. Fs is referred to as the principal component of rank s; λs is the variance of Fs and its square root the length of Fs in R I ; vs is known as the standardised principal component. The total inertias of both clouds are equal (and equal to K for standardised PCA) and furthermore, when decomposed component by component, they are identical. This property is remarkable: if S dimensions are enough to perfectly represent NI , the same is true for NK. In this case, two dimensions are sufficient. If not, why generate a third synthetic variable which would not differentiate the individuals at all? 1.6 Interpreting the Data 1.6.1 Numerical Indicators 1.6.1.1 Percentage of Inertia Associated with a Component The first indicators that we shall examine are the ratios between the projected inertias and the total inertia. For component s: PI i=1 1 I (OHs i ) 2 PI i=1 1 I (Oi) 2 = PK k=1 (OHs k ) 2 PK k=1 Ok2 = λs PK s=1 λs . In the most usual case, when the PCA is standardised, PK s=1 λs = K. When multiplied by 100, this indicator gives the percentage of inertia (of NI in R K or of NK in R I ) expressed by the component of rank s. This can be interpreted in two ways: 1. As a measure of the quality of data representation; in the example, we say that the first component expresses 67.77% of data variability (see Table 1.5). In a standardised PCA (where I > K), we often compare λs with 1, the value below which the component of rank s, representing less data than a stand-alone variable, is not worthy of interest. 2. As a measure of the relative importance of the components; in the
18ErploratoryMultivariateAnalysisbyErample UsingRexample,wesaythatthefirstcomponentexpressesthreetimesmorevariabilitythan the second:itaffects threetimesmore variables butthis formulation is truly precise only when each variable is perfectlycorrelated with a component.Because of the orthogonality of the axes (both in RK and in R'),these iner-tia percentages can be added together for several components; in the example,86.82% of the dataare represented bythe first two components (67.77%+19.05%=86.82%).TABLE 1.5OrangeJuiceData:Decomposition of VariabilityperComponentEigenvaluePercentage ofCumulativevarianceof variancepercentageComp 14.7467.7767.771.3319.0586.81Comp20.8211.7198.53Comp 30.081.2099.73Comp 40.020.27100.00Comp 5Let us return to Figure 1.5: the pictures of the fruits on the first line cor-respond approximately toa projection of thefruits on theplane constructedby components 2 and 3 of PCA, whereas the images on the second line cor-respond to a projection on plane 1-2.This is why the fruits are easier torecognise on the second line:the more variability (i.e., the more information)collected on plane 1-2 when compared to plane 2-3, the easier it is to graspthe overall shape of the cloud.Furthermore, the banana is easier to recognisein plane 1-2 (the second line), as it retrieves greater inertia on plane 1-2. Inconcrete terms, as the banana is a longer fruit than a melon, this leads tomore marked differences in inertia between the components. As the melon isalmost spherical,thepercentages of inertia associated with each of thethreecomponents are around 33% and therefore the inertia retrieved by plane 1-2is nearly 66% (as is that retrieved by plane 2-3).1.6.1.2Qualityof Representation of anIndividualorVariableThequality of representation of anindividual ionthe component s can bemeasured by the distance between the point within the space and the projec-tion on the component. In reality,it is preferable to calculate the percentageof inertia of the individual iprojected on the component s.Therefore, whensis the angle between Oi and us, we obtainProjected inertia of ion uscosqlts(i)=Total inertiaofiUsing Pythagoras'theorem, this indicator is combined for multiple components and is mostoften calculated for aplane
18 Exploratory Multivariate Analysis by Example Using R example, we say that the first component expresses three times more variability than the second; it affects three times more variables but this formulation is truly precise only when each variable is perfectly correlated with a component. Because of the orthogonality of the axes (both in R K and in R I ), these inertia percentages can be added together for several components; in the example, 86.82% of the data are represented by the first two components (67.77% + 19.05% = 86.82%). TABLE 1.5 Orange Juice Data: Decomposition of Variability per Component Eigenvalue Percentage of Cumulative variance of variance percentage Comp 1 4.74 67.77 67.77 Comp 2 1.33 19.05 86.81 Comp 3 0.82 11.71 98.53 Comp 4 0.08 1.20 99.73 Comp 5 0.02 0.27 100.00 Let us return to Figure 1.5: the pictures of the fruits on the first line correspond approximately to a projection of the fruits on the plane constructed by components 2 and 3 of PCA, whereas the images on the second line correspond to a projection on plane 1-2. This is why the fruits are easier to recognise on the second line: the more variability (i.e., the more information) collected on plane 1-2 when compared to plane 2-3, the easier it is to grasp the overall shape of the cloud. Furthermore, the banana is easier to recognise in plane 1-2 (the second line), as it retrieves greater inertia on plane 1-2. In concrete terms, as the banana is a longer fruit than a melon, this leads to more marked differences in inertia between the components. As the melon is almost spherical, the percentages of inertia associated with each of the three components are around 33% and therefore the inertia retrieved by plane 1-2 is nearly 66% (as is that retrieved by plane 2-3). 1.6.1.2 Quality of Representation of an Individual or Variable The quality of representation of an individual i on the component s can be measured by the distance between the point within the space and the projection on the component. In reality, it is preferable to calculate the percentage of inertia of the individual i projected on the component s. Therefore, when θ s i is the angle between Oi and us, we obtain qlts (i) = Projected inertia of i on us Total inertia of i = cos2 θ s i . Using Pythagoras’ theorem, this indicator is combined for multiple components and is most often calculated for a plane
19PrincipalComponentAnalysisThe quality of representation of a variable k on the component of rank sis expressed asProjected inertia of k on ug = cos? or.qlts (k) =Total inertia of kThis last quantity is equal to r?(k, .), which is why the quality of represen-tation of a variable is only very rarely provided by software. The representa-tional quality of a variable in a given plane is obtained directly on the graphbyvisuallyevaluating itsdistancefromthecircleof radius1.1.6.1.3DetectingOutliersAnalysing the shape ofthe cloud Nralsomeansdetecting unusual or remark-able individuals.An individual is considered remarkable if it has extremevalues for multiple variables.In the cloud Nr,an individual such as this isfarfrom the cloud's centre of gravity,and its remarkable nature can be evaluatedfrom its distance from the centre of the cloud in the overall space RKIn theexample, none of the orange juices are particularly extreme (see Ta-ble 1.6).The two most extreme individuals are Tropicana fresh and Pamprylambient.TABLE 1.6Orange Juice Data: Distances from the Individuals to the Centre of theCloudPampryl amb.Tropicana amb.Fruvita fr.Joker amb.Tropicana fr.Pampryl fr.3.031.982.592.093.512.341.6.1.4Contribution of an Individual or Variable to theConstructionofaComponentOutliers have an influence on analysis, and it is interesting to know to whatextent their influence affects the construction of the components.Further-more,some individuals can influence the construction of certain componentswithout being remarkable themselves. Detecting those individuals that con-tribute to the construction of a principal component helps to evaluate thecomponent's stability.It is also interesting to evaluate the contribution ofvariables in constructing a component (especially in nonstandardised PCA).To do so, we decompose the inertia of a component individual by individual(or variable by variable).The inertia explained by the individual i on thecomponent s is(1/1) (OH:)2× 100.AsDistances intervene in the components by their squares, which augments theroles of those individuals at a greater distance from the origin.Outlying
Principal Component Analysis 19 The quality of representation of a variable k on the component of rank s is expressed as qlts (k) = Projected inertia of k on vs Total inertia of k = cos2 θ s k . This last quantity is equal to r 2 (k, vs), which is why the quality of representation of a variable is only very rarely provided by software. The representational quality of a variable in a given plane is obtained directly on the graph by visually evaluating its distance from the circle of radius 1. 1.6.1.3 Detecting Outliers Analysing the shape of the cloud NI also means detecting unusual or remarkable individuals. An individual is considered remarkable if it has extreme values for multiple variables. In the cloud NI , an individual such as this is far from the cloud’s centre of gravity, and its remarkable nature can be evaluated from its distance from the centre of the cloud in the overall space R K. In the example, none of the orange juices are particularly extreme (see Table 1.6). The two most extreme individuals are Tropicana fresh and Pampryl ambient. TABLE 1.6 Orange Juice Data: Distances from the Individuals to the Centre of the Cloud Pampryl amb. Tropicana amb. Fruvita fr. Joker amb. Tropicana fr. Pampryl fr. 3.03 1.98 2.59 2.09 3.51 2.34 1.6.1.4 Contribution of an Individual or Variable to the Construction of a Component Outliers have an influence on analysis, and it is interesting to know to what extent their influence affects the construction of the components. Furthermore, some individuals can influence the construction of certain components without being remarkable themselves. Detecting those individuals that contribute to the construction of a principal component helps to evaluate the component’s stability. It is also interesting to evaluate the contribution of variables in constructing a component (especially in nonstandardised PCA). To do so, we decompose the inertia of a component individual by individual (or variable by variable). The inertia explained by the individual i on the component s is (1/I) (OHs i ) 2 λs × 100. Distances intervene in the components by their squares, which augments the roles of those individuals at a greater distance from the origin. Outlying
20ErploratoryMultivariateAnalysisbyErampleUsingRindividuals are the most extreme on the component, and their contributionsare especially useful when the individuals'weights are different.RemarkThese contributions are combined for multiple individualsWhen an individual contributes significantly (ie., much more than theothers)totheconstructionof aprincipal component(for example,Tropicanaambient and Pampryl fresh; for the second component, see Table 1.7), it isnot uncommon for the results of a new PCA constructed without this indi-vidual to change substantially: the principal components can change and newoppositions between individuals may appear.TABLE 1.7Orange JuiceData:Contribution ofIndividuals to the Construction of theComponentsDim.1Dim.20.08Pampryl amb.31.292.7636.77Tropicana amb.13.180.02Fruvita fr.8.69Joker amb.12.6335.664.33Tropicana fr.4.4850.10Pampryl fr.Similarly,the contribution of variablek to the construction of components is calculated. An example of this is presented in Table 1.8.TABLE 1.8OrangeJuiceData:Contribution of VariablestotheConstruction ofthe ComponentsDim.1Dim.2Odour intensity4.4542.6920.471.35Odour typicality10.9828.52Pulp content8.9013.80Taste intensity17.569.10Acidity18.422.65Bitterness1.8919.22Sweetness1.6.2SupplementaryElementsWe here define the concept of active and supplementary (or illustrative)el-ements. By definition, active elements contribute to the construction of theprincipal components, contrary to supplementary elements.Thus, the inertiaof the cloud of individuals is calculated on the basis of active individuals, andsimilarly, the inertia of the cloud of variables is calculated on the basis ofactive variables. The supplementary elements make it possible to illustrate
20 Exploratory Multivariate Analysis by Example Using R individuals are the most extreme on the component, and their contributions are especially useful when the individuals’ weights are different. Remark These contributions are combined for multiple individuals. When an individual contributes significantly (i.e., much more than the others) to the construction of a principal component (for example, Tropicana ambient and Pampryl fresh; for the second component, see Table 1.7), it is not uncommon for the results of a new PCA constructed without this individual to change substantially: the principal components can change and new oppositions between individuals may appear. TABLE 1.7 Orange Juice Data: Contribution of Individuals to the Construction of the Components Dim.1 Dim.2 Pampryl amb. 31.29 0.08 Tropicana amb. 2.76 36.77 Fruvita fr. 13.18 0.02 Joker amb. 12.63 8.69 Tropicana fr. 35.66 4.33 Pampryl fr. 4.48 50.10 Similarly, the contribution of variable k to the construction of component s is calculated. An example of this is presented in Table 1.8. TABLE 1.8 Orange Juice Data: Contribution of Variables to the Construction of the Components Dim.1 Dim.2 Odour intensity 4.45 42.69 Odour typicality 20.47 1.35 Pulp content 10.98 28.52 Taste intensity 8.90 13.80 Acidity 17.56 9.10 Bitterness 18.42 2.65 Sweetness 19.22 1.89 1.6.2 Supplementary Elements We here define the concept of active and supplementary (or illustrative) elements. By definition, active elements contribute to the construction of the principal components, contrary to supplementary elements. Thus, the inertia of the cloud of individuals is calculated on the basis of active individuals, and similarly, the inertia of the cloud of variables is calculated on the basis of active variables. The supplementary elements make it possible to illustrate