21Principal ComponentAnalysisthe principal components, which is why they are referred to as "illustrativeelements."Contrary to the active elements, which must be homogeneous, wecan make use of as many illustrative elements as possible.1.6.2.1Representing SupplementaryQuantitativeVariablesBy definition, a supplementary quantitative variable plays no role in calcu-lating the distances between individuals.They are represented in the sameway as activevariables: to assist in interpreting the cloud of individuals (Section1.3.3).Thecoordinateof thesupplementaryvariablekonthecomponents corresponds to thecorrelation coefficient between k'and the principal component s (i.e., the variable whose values are the coordinates of the individualsprojected on the component of rank s).k’can therefore be represented onthe same graph as the active variables.More formally,the transition formulae can be used to calculate the coordinateofthesupplementaryvariable kon thecomponent of ranks:EG(K) =TikF(i)=r(k,F),VA.ie(active)where factive] refers to the set of active individuals. This coordinate is cal-culated from theactive individuals aloneIntheexample,inadditiontothesensorydescriptors.therearealsophysic-ochemical variables at our disposal (see Table 1.9).However,our stanceremains unchanged.namely.to describethe orangejuices based on their sensoryprofiles.This problem can be enriched using the supplementary variables sincewe can now link sensory dimensions to the physicochemical variables.TABLE 1.9OrangeJuiceData:SupplementaryVariablespHCitricVitamin CGlucoseFructose SaccharoseSweeteningacidpowerPampryl amb25.3227.3636.4589.953.590.8443.4482.5517.3320.0044.153.890.6732.70Tropicana amb.Fruvita fr.23.6525.6552.12102.223.850.6937.0032.4234.5422.923.600.9536.60Joker amb.90.7122.7025.3245.8094.873.820.7139.50Tropicana fr.27.1629.4838.9496.513.6827.00Pampryl fr.0.74The correlations circle (Figure 1.11) represents both the active and supplementary variables.Themain component of variability opposes the orangejuices perceived as acidic/bitter, slightly sweet and somewhat typical with theorange juices perceived as sweet, typical, not very acidic and slightly bitterThe analysis of this sensory perception is reinforced by the variables pH andsaccharose. Indeed, these two variables are positively correlated with thefirstcomponent and lie on the side of the orange juices perceived as sweet and
Principal Component Analysis 21 the principal components, which is why they are referred to as “illustrative elements.” Contrary to the active elements, which must be homogeneous, we can make use of as many illustrative elements as possible. 1.6.2.1 Representing Supplementary Quantitative Variables By definition, a supplementary quantitative variable plays no role in calculating the distances between individuals. They are represented in the same way as active variables: to assist in interpreting the cloud of individuals (Section 1.3.3). The coordinate of the supplementary variable k 0 on the component s corresponds to the correlation coefficient between k 0 and the principal component s (i.e., the variable whose values are the coordinates of the individuals projected on the component of rank s). k 0 can therefore be represented on the same graph as the active variables. More formally, the transition formulae can be used to calculate the coordinate of the supplementary variable k 0 on the component of rank s: Gs(k 0 ) = 1 √ λs X i∈{active} xik0Fs(i) = r(k, Fs), where {active} refers to the set of active individuals. This coordinate is calculated from the active individuals alone. In the example, in addition to the sensory descriptors, there are also physicochemical variables at our disposal (see Table 1.9). However, our stance remains unchanged, namely, to describe the orange juices based on their sensory profiles. This problem can be enriched using the supplementary variables since we can now link sensory dimensions to the physicochemical variables. TABLE 1.9 Orange Juice Data: Supplementary Variables Glucose Fructose Saccharose Sweetening pH Citric Vitamin C power acid Pampryl amb. 25.32 27.36 36.45 89.95 3.59 0.84 43.44 Tropicana amb. 17.33 20.00 44.15 82.55 3.89 0.67 32.70 Fruvita fr. 23.65 25.65 52.12 102.22 3.85 0.69 37.00 Joker amb. 32.42 34.54 22.92 90.71 3.60 0.95 36.60 Tropicana fr. 22.70 25.32 45.80 94.87 3.82 0.71 39.50 Pampryl fr. 27.16 29.48 38.94 96.51 3.68 0.74 27.00 The correlations circle (Figure 1.11) represents both the active and supplementary variables. The main component of variability opposes the orange juices perceived as acidic/bitter, slightly sweet and somewhat typical with the orange juices perceived as sweet, typical, not very acidic and slightly bitter. The analysis of this sensory perception is reinforced by the variables pH and saccharose. Indeed, these two variables are positively correlated with the first component and lie on the side of the orange juices perceived as sweet and
22ErploratoryMultivariateAnalysisbyErample UsingRslightly acidic (a high pH index indicates low acidity). One also finds the re-action known as"saccharose inversion"(or hydrolysis):the saccharosebreaksdown into glucose and fructose in an acidic environment (the acidic orangejuices thus contain more fructose and glucose, and less saccharose than theaverage).9-Odne3Aci(%061) utte0022300-1.5-1.0-0.50.00.51.01.5Dim 1 (67.77%)FIGURE 1.11Orange juice data:representation of the active and supplementary variables.RemarkWhen using PCA to explore data prior to a multiple regression, it is advisableto choose the explanatory variables for the regression model as active variablesfor PCA, and toproject thevariable to be explained (the dependent variable)as a supplementary variable.This gives some idea of the relationships betweenexplanatory variables and thus of the need to select explanatory variablesThis also gives us an idea of the quality of the regression: if the dependentvariable is appropriately projected, it will be a well-fitted model.1.6.2.2Representing SupplementaryCategorical VariablesIn PCA, the active variables are quantitative by nature but it is possible touse information resulting from categorical variables on a purely illustrativebasis (= supplementary), that is, not used to calculate the distances betweenindividuals.The categorical variables cannot be represented in the same way as thesupplementary quantitative variables since it is impossible to calculate thecorrelation between a categorical variable and Fs.Information about categorical variables lies within their categories. It is quite natural to representa categorical variable at the barycentre of all the individuals possessing thatvariable.Thus, following projection on the plane defined by the principal
22 Exploratory Multivariate Analysis by Example Using R slightly acidic (a high pH index indicates low acidity). One also finds the reaction known as “saccharose inversion” (or hydrolysis): the saccharose breaks down into glucose and fructose in an acidic environment (the acidic orange juices thus contain more fructose and glucose, and less saccharose than the average). -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (67.77%) Dim 2 (19.05%) Odour.intensity Odour.typicality Pulpiness Intensity.of.taste Acidity Bitterness Sweetness Glucose Fructose Saccharose Sweetening.power pH Citric.acid Vitamin.C FIGURE 1.11 Orange juice data: representation of the active and supplementary variables. Remark When using PCA to explore data prior to a multiple regression, it is advisable to choose the explanatory variables for the regression model as active variables for PCA, and to project the variable to be explained (the dependent variable) as a supplementary variable. This gives some idea of the relationships between explanatory variables and thus of the need to select explanatory variables. This also gives us an idea of the quality of the regression: if the dependent variable is appropriately projected, it will be a well-fitted model. 1.6.2.2 Representing Supplementary Categorical Variables In PCA, the active variables are quantitative by nature but it is possible to use information resulting from categorical variables on a purely illustrative basis (= supplementary), that is, not used to calculate the distances between individuals. The categorical variables cannot be represented in the same way as the supplementary quantitative variables since it is impossible to calculate the correlation between a categorical variable and Fs. Information about categorical variables lies within their categories. It is quite natural to represent a categorical variable at the barycentre of all the individuals possessing that variable. Thus, following projection on the plane defined by the principal
23Principal Component Analysiscomponents, these categories remain at the barycentre of the individuals intheir plane representation. A categorical variable can thus be regarded as themean individual obtained from the set of individuals who have it.This istherefore the way in which it will be represented on the graph of individualsThe information resulting from a supplementary categorical variable canalso be represented using a colour code: all of the individuals with the samecategorical variable are coloured in the same way.This facilitates visualisationof dispersion around the barycentres associated with specific categories.In the example, we can introducethe variable way of preserving which hastwo categories ambientand freshas well asthe variable origin of the fruit juicewhich has also two categories Florida and Other (see Table 1.10). It seemsthat sensory perception of the products differs according to their packaging(despite the fact that they were all tasted at the same temperature). Thesecond bisectrix separates the products purchased in the chilled section of thesupermarketfromthe others (seeFigure1.12).TABLE 1.10Orange Juice Data:SupplementaryCategoricalVariablesWay ofOriginpreservingPampryl amb.AmbientOtherFloridaTropicana amb.AmbientFreshFloridaFruvita fr.Joker amb.AmbientOtherFloridaTropicana fr.FreshFreshOtherPampryl fr.Pampryl fr.2(%0)Fresh-0canafrIronOther0Fruvitafr.0Pampryl amb.FloridaAmbient.LJokeramb.Tropicanaamb.2-2-4024Dim 1 (67.77%)FIGURE1.12Orange juice data:plane representation of the scatterplot of individuals witha supplementary categorical variable
Principal Component Analysis 23 components, these categories remain at the barycentre of the individuals in their plane representation. A categorical variable can thus be regarded as the mean individual obtained from the set of individuals who have it. This is therefore the way in which it will be represented on the graph of individuals. The information resulting from a supplementary categorical variable can also be represented using a colour code: all of the individuals with the same categorical variable are coloured in the same way. This facilitates visualisation of dispersion around the barycentres associated with specific categories. In the example, we can introduce the variable way of preserving which has two categories ambient and fresh as well as the variable origin of the fruit juice which has also two categories Florida and Other (see Table 1.10). It seems that sensory perception of the products differs according to their packaging (despite the fact that they were all tasted at the same temperature). The second bisectrix separates the products purchased in the chilled section of the supermarket from the others (see Figure 1.12). TABLE 1.10 Orange Juice Data: Supplementary Categorical Variables Way of Origin preserving Pampryl amb. Ambient Other Tropicana amb. Ambient Florida Fruvita fr. Fresh Florida Joker amb. Ambient Other Tropicana fr. Fresh Florida Pampryl fr. Fresh Other -4 -2 0 2 4 -2 -1 0 1 2 Dim 1 (67.77%) Dim 2 (19.05%) Pampryl amb. Tropicana amb. Fruvita fr. Joker amb. Tropicana fr. Pampryl fr. Ambient Fresh Florida Other FIGURE 1.12 Orange juice data: plane representation of the scatterplot of individuals with a supplementary categorical variable
24ErploratoryMultivariateAnalysisbyErample UsingR1.6.2.3RepresentingSupplementaryIndividualsJust as for the variables, we can use thetransition formula to calculate thecoordinate of a supplementary individual i on the component of rank s:K-F.()=aukGs(k).Note that centring and standardising (if any)are conducted with respect totheaverages andthestandard deviationscalculatedfrom theactiveindividualsonly. Moreover, the coordinate of i' is calculated from the active variablesalone.Therefore, it is not necessary to have the values of the supplementaryindividuals for the supplementary variables.CommentA supplementary categorical variable can be regarded as a supplementaryindividual which, for each active variable, would take the average calculatedfrom all of the individuals with this categorical variable.1.6.3AutomaticDescriptionoftheComponentsThe components provided by the principal component method can be de-scribed automatically by all of the variables, whether quantitative or categor-ical, supplementary oractive.For a quantitative variable, the principle is the same whether the variableis active or supplementary.First, the correlation coefficient between the coor-dinates of the individuals on the component s and each variable is calculated.We then sort the variables in descending order from the highest coefficient tothe weakest and retain the variables with the highest correlation coefficients(absolutevalues).CommentLet us recall that principal components are linear combinations of the activevariables, as are synthetic variables. Testing the significance of the correla-tion coefficient between a component and a variable is thus a biased procedureby its very construction. However, it is useful to sort and select the activevariables in this manner to describe the components. On the other hand, forthe supplementary variables, the test described corresponds to that tradition-allyused to test thesignificance of thecorrelation coefficient between twovariables.For a categorical variable,we conduct a one-way analysis of variance wherewe seek to explain the coordinates of the individuals (on the component ofrank s) by the categorical variable; we use the sum to zero contrasts E, a; = 0.Then,for each categorical variable, a Student t-test is conducted to comparethe average of the individuals who possess that category with the generalaverage (we test αi = 0 considering that the variances of the coordinates are
24 Exploratory Multivariate Analysis by Example Using R 1.6.2.3 Representing Supplementary Individuals Just as for the variables, we can use the transition formula to calculate the coordinate of a supplementary individual i 0 on the component of rank s: Fs(i 0 ) = 1 √ λs X K k=1 xi 0kGs(k). Note that centring and standardising (if any) are conducted with respect to the averages and the standard deviations calculated from the active individuals only. Moreover, the coordinate of i 0 is calculated from the active variables alone. Therefore, it is not necessary to have the values of the supplementary individuals for the supplementary variables. Comment A supplementary categorical variable can be regarded as a supplementary individual which, for each active variable, would take the average calculated from all of the individuals with this categorical variable. 1.6.3 Automatic Description of the Components The components provided by the principal component method can be described automatically by all of the variables, whether quantitative or categorical, supplementary or active. For a quantitative variable, the principle is the same whether the variable is active or supplementary. First, the correlation coefficient between the coordinates of the individuals on the component s and each variable is calculated. We then sort the variables in descending order from the highest coefficient to the weakest and retain the variables with the highest correlation coefficients (absolute values). Comment Let us recall that principal components are linear combinations of the active variables, as are synthetic variables. Testing the significance of the correlation coefficient between a component and a variable is thus a biased procedure by its very construction. However, it is useful to sort and select the active variables in this manner to describe the components. On the other hand, for the supplementary variables, the test described corresponds to that traditionally used to test the significance of the correlation coefficient between two variables. For a categorical variable, we conduct a one-way analysis of variance where we seek to explain the coordinates of the individuals (on the component of rank s) by the categorical variable; we use the sum to zero contrasts P i αi = 0. Then, for each categorical variable, a Student t-test is conducted to compare the average of the individuals who possess that category with the general average (we test αi = 0 considering that the variances of the coordinates are
25PrincipalComponentAnalysisequal for each category). The correlation coefficients are sorted according tothe p-values in descending order for the positive coefficients and in ascendingorderforthenegative coefficients.These tips for interpreting such data are particularly useful for understand-ing those dimensions with a high number of variables.The data used is made up of few variables. We shall nonetheless give theoutputs of theautomaticdescription procedureforthefirstcomponent asan example. The variables which best characterise component 1 are odourtypicality, sweetness, bitterness, and acidity (see Table 1.11).TABLE 1.11Orange Juice Data:Description of theFirst Dimensionby the Quantitative VariablesCorrelationp-value0.0003Odourtypicality0.98540.95490.0030SweetnesspH0.87970.0208Aciditiy0.01110.9127Bitterness0.93480.0062Thefirst component is also characterised by the categorical variable Originas the correlation is significantly different from0(p-value=0.0094l; see theresult in the object quali in Table 1.12); the coordinates of the orange juicesfrom Florida are significantly higher than average on the first component,whereas the coordinates of the other orange juices are lower than average (seethe results in the object category in Table 1.12)TABLE 1.12OrangeJuiceData:Description of theFirstDimension by the Categorical Variables andtheCategoriesofTheseCategoricalVariablesSDim.1SqualiR2p-valueOrigin0.84580.0094SDim.1ScategoryEstimatep-valueFlorida2.00310.0094Other2.00310.00941.7ImplementationwithFactoMineRInthissection,wewill explainhowto carryoutaPCAwithFactoMineRand how to find the results obtained with the orange juice data. First, load
Principal Component Analysis 25 equal for each category). The correlation coefficients are sorted according to the p-values in descending order for the positive coefficients and in ascending order for the negative coefficients. These tips for interpreting such data are particularly useful for understanding those dimensions with a high number of variables. The data used is made up of few variables. We shall nonetheless give the outputs of the automatic description procedure for the first component as an example. The variables which best characterise component 1 are odour typicality, sweetness, bitterness, and acidity (see Table 1.11). TABLE 1.11 Orange Juice Data: Description of the First Dimension by the Quantitative Variables Correlation p-value Odour typicality 0.9854 0.0003 Sweetness 0.9549 0.0030 pH 0.8797 0.0208 Aciditiy −0.9127 0.0111 Bitterness −0.9348 0.0062 The first component is also characterised by the categorical variable Origin as the correlation is significantly different from 0 (p-value = 0.00941; see the result in the object quali in Table 1.12); the coordinates of the orange juices from Florida are significantly higher than average on the first component, whereas the coordinates of the other orange juices are lower than average (see the results in the object category in Table 1.12). TABLE 1.12 Orange Juice Data: Description of the First Dimension by the Categorical Variables and the Categories of These Categorical Variables $Dim.1$quali R2 p-value Origin 0.8458 0.0094 $Dim.1$category Estimate p-value Florida 2.0031 0.0094 Other −2.0031 0.0094 1.7 Implementation with FactoMineR In this section, we will explain how to carry out a PCA with FactoMineR and how to find the results obtained with the orange juice data. First, load