1Principal Component Analysis (PCA)1.1DataNotationExamplesPrincipal component analysis (PCA) applies to data tables where rows areconsidered as individuals and columns as quantitative variables.Let Zik bethevaluetaken byindividual ifor variablek,whereivariesfrom to I andkfrom1toK.Let ,denote the mean of variable k calculated over all individual instancesof I:元Tikand sk the standard deviation of the sample of variable k (uncorrected):买—)2S1Ti认-Data subjected to a PCA can bevery diverse in nature; some examplesarelisted in Table1.1This first chapter will be illustrated using the“orange juice" dataset chosenfor its simplicity since it comprises only six statistical individuals or observa-tions.The six orange juices were evaluated by a panel of experts accordingto seven sensory variables (odour intensity,odourtypicality,pulp content, in-tensity of taste, acidity, bitterness, sweetness).The panel's evaluations aresummarised in Table 1.2.1.2ObjectivesThe data table can be considered eitheras a set of rows (individuals)or as aset of columns (variables),thus raising a number ofquestionsrelating tothesedifferent types of objects
1 Principal Component Analysis (PCA) 1.1 Data — Notation — Examples Principal component analysis (PCA) applies to data tables where rows are considered as individuals and columns as quantitative variables. Let xik be the value taken by individual i for variable k, where i varies from 1 to I and k from 1 to K. Let ¯xk denote the mean of variable k calculated over all individual instances of I: x¯k = 1 I X I i=1 xik , and sk the standard deviation of the sample of variable k (uncorrected): sk = vuut 1 I X I i=1 (xik − x¯k) 2 . Data subjected to a PCA can be very diverse in nature; some examples are listed in Table 1.1. This first chapter will be illustrated using the “orange juice” dataset chosen for its simplicity since it comprises only six statistical individuals or observations. The six orange juices were evaluated by a panel of experts according to seven sensory variables (odour intensity, odour typicality, pulp content, intensity of taste, acidity, bitterness, sweetness). The panel’s evaluations are summarised in Table 1.2. 1.2 Objectives The data table can be considered either as a set of rows (individuals) or as a set of columns (variables), thus raising a number of questions relating to these different types of objects
2ErploratoryMultivariateAnalysisbyErample UsingRTABLE 1.1SomeExamplesofDatasetsFieldIndividualsVariablesLikEcologyRiversConcentrationofpollutants Concentrationof pollu-tant k in river iEconomicsYearsIndicator valuek for yearEconomic indicatorsPatientsGeneticsGenesExpressionof genek forpatient iMarketingBrandsMeasures of satisfactionValue of measure k forbrand iPedologySoilsGranulometric compositionContent of componentkin soil BiologyAnimalsMeasurementsMeasurekfor animal iSociologyTimeby activityTime spent on activitykSocial classesbyindividualsfrom so-cial class iTABLE 1.2The Orange Juice DataOdourOdourPulpIntensityBitter-Sweet-AcidityoftasteintensitytypicalitynessnessPampryl amb.2.822.531.663.463.152.972.602.762.821.913.232.552.083.32Tropicana amb.2.832.884.003.452.421.76Fruvita fr.3.382.762.591.663.373.052.562.80Jokeramb3.203.023.693.122.331.973.34Tropicana fr.3.543.312.633.072.733.342.90Pampryl fr.1.2.1Studying IndividualsFigure 1.1 illustrates the types of questions posed during the study of individ-uals. This diagram represents three different situations where 40 individualsare described in terms of two variables: j and k. In graph A, we can clearlyidentify two distinct classes of individuals. GraphB illustrates a dimension ofvariability which opposes extreme individuals, much likegraph A, but whichalso contains less extreme individuals.The cloud of individuals is thereforelong and thin.Graph C depicts a more uniform cloud (i.e., with no specificstructure).Interpreting the data depicted in these examples is relatively straightfor-ward as they are two dimensional.However, when individuals are describedby a large number of variables, we require a tool to explore the space in whichthese individuals evolve. Studying individuals means identifying the similari-tiesbetweenindividualsfromthepointof viewofallthevariables.In otherwords, to provide a typology of the individuals:which are the most similarindividuals (and the most dissimilar)? Are there groups of individuals which
2 Exploratory Multivariate Analysis by Example Using R TABLE 1.1 Some Examples of Datasets Field Individuals Variables xik Ecology Rivers Concentration of pollutants Concentration of pollutant k in river i Economics Years Economic indicators Indicator value k for year i Genetics Patients Genes Expression of gene k for patient i Marketing Brands Measures of satisfaction Value of measure k for brand i Pedology Soils Granulometric composition Content of component k in soil i Biology Animals Measurements Measure k for animal i Sociology Social classes Time by activity Time spent on activity k by individuals from social class i TABLE 1.2 The Orange Juice Data Odour Odour Pulp Intensity Acidity Bitter- Sweetintensity typicality of taste ness ness Pampryl amb. 2.82 2.53 1.66 3.46 3.15 2.97 2.60 Tropicana amb. 2.76 2.82 1.91 3.23 2.55 2.08 3.32 Fruvita fr. 2.83 2.88 4.00 3.45 2.42 1.76 3.38 Joker amb. 2.76 2.59 1.66 3.37 3.05 2.56 2.80 Tropicana fr. 3.20 3.02 3.69 3.12 2.33 1.97 3.34 Pampryl fr. 3.07 2.73 3.34 3.54 3.31 2.63 2.90 1.2.1 Studying Individuals Figure 1.1 illustrates the types of questions posed during the study of individuals. This diagram represents three different situations where 40 individuals are described in terms of two variables: j and k. In graph A, we can clearly identify two distinct classes of individuals. Graph B illustrates a dimension of variability which opposes extreme individuals, much like graph A, but which also contains less extreme individuals. The cloud of individuals is therefore long and thin. Graph C depicts a more uniform cloud (i.e., with no specific structure). Interpreting the data depicted in these examples is relatively straightforward as they are two dimensional. However, when individuals are described by a large number of variables, we require a tool to explore the space in which these individuals evolve. Studying individuals means identifying the similarities between individuals from the point of view of all the variables. In other words, to provide a typology of the individuals: which are the most similar individuals (and the most dissimilar)? Are there groups of individuals which
3Principal Component Analysis:93,OmAS0-L-0nFIGURE1.1Representation of 40 individuals described by two variables:j and k.are homogeneous in terms of their similarities? In addition, we should lookfor common dimensions of variability which oppose extreme and intermediateindividuals.In the example, two orange juices are considered similar if they were eval-uated in the same way according to all the sensory descriptors. In such cases.the two orange juices have the same main dimensions of variability and arethus said to have the same sensory“profile."More generally, we want to knowwhether or not there are groups of orange juices with similar profiles, that is,sensory dimensions which might oppose extreme juices with more intermediatejuices.1.2.2StudyingVariablesFollowing the approach taken to study the individuals, might it also be possi-ble to interpret the data from the variables? PCA focuses on the linear rela-tionships between variables. More complex links also exist, such as quadraticrelationships, logarithmics, exponential functions, and so forth, but they arenot studied in PCA.This may seem restrictive, but in practice many relationships can be considered linear, at least for an initial approximationLet us consider the example of the four variables (j,k, l, and m) in Figure 1.2.The clouds of points constructed by working from pairs of variablesshowthatvariablesjandk (graphA)aswellasvariablesIand m (graphF)are strongly correlated (positively for i and k and negatively for l and m)However, the other graphs do not show any signs of relationships betweenvariables. The study of these variables also suggests that the four variablesare split intotwogroups oftwovariables,(j,k)and (l,m),and that,withinone group, the variables are strongly correlated, whereas between groups, thevariables are uncorrelated.In exactly the same way asfor constructing groupsof individuals, creating groups of variables may be useful with a view to syn-thesis. As for the individuals, we identify a continuum with groups of both
Principal Component Analysis 3 ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 A Variable j Variable k l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 B Variable j Variable k l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −2 −1 0 1 2 3 −2 −1 0 1 2 C Variable j Variable k FIGURE 1.1 Representation of 40 individuals described by two variables: j and k. are homogeneous in terms of their similarities? In addition, we should look for common dimensions of variability which oppose extreme and intermediate individuals. In the example, two orange juices are considered similar if they were evaluated in the same way according to all the sensory descriptors. In such cases, the two orange juices have the same main dimensions of variability and are thus said to have the same sensory “profile.” More generally, we want to know whether or not there are groups of orange juices with similar profiles, that is, sensory dimensions which might oppose extreme juices with more intermediate juices. 1.2.2 Studying Variables Following the approach taken to study the individuals, might it also be possible to interpret the data from the variables? PCA focuses on the linear relationships between variables. More complex links also exist, such as quadratic relationships, logarithmics, exponential functions, and so forth, but they are not studied in PCA. This may seem restrictive, but in practice many relationships can be considered linear, at least for an initial approximation. Let us consider the example of the four variables (j, k, l, and m) in Figure 1.2. The clouds of points constructed by working from pairs of variables show that variables j and k (graph A) as well as variables l and m (graph F) are strongly correlated (positively for j and k and negatively for l and m). However, the other graphs do not show any signs of relationships between variables. The study of these variables also suggests that the four variables are split into two groups of two variables, (j, k) and (l, m), and that, within one group, the variables are strongly correlated, whereas between groups, the variables are uncorrelated. In exactly the same way as for constructing groups of individuals, creating groups of variables may be useful with a view to synthesis. As for the individuals, we identify a continuum with groups of both
4ErploratoryMultivariateAnalysisbyErample UsingRvery unusual variablesand intermediatevariables,which areto some extentlinked to both groups.In the example, each group can be represented by onesingle variable as the variables within each group are very strongly correlated.Werefer to these variables as synthetic variables.CAB3..:83.209200ru8o:E28S0.80-BroS0'-01-orL.-.-1.01.0-1.01.01.090.00.80.5.0..1a0.0VariablekDEF"o1 co aco.coo'0:..000.co.co80.:....acoc"........::O20%1.0-1.01.01.00.57.00.80.60.40.5verable, 0.50.00.20.00.5VariablekVariable1FIGURE 1.2Representation of the relationships between four variables:j,k, l, and m.taken two-by-two.When confronted with a very small number of variables,it ispossibletodraw conclusions from the clouds of points, or from the correlation matrixwhich groups together all of the linear correlation coefficients r(j,k) betweenthe pairs of variables. However, when working with a great number of vari-ables, the correlation matrix groups together a large quantity of correlationcoefficients (190 coefficients for K =20 variables).It is therefore essential tohave a tool capable of summarising the main relationships between the vari-ables in a visual manner. The aim of PCA is to draw conclusions from thelinearrelationshipsbetweenvariablesby detecting theprincipaldimensionsof variability.Asvouwill see,these conclusions will besupplemented bythedefinition of the synthetic variables offered by PCA.It will therefore be eas-ier to describe the data using a few synthetic variables rather than all of theoriginal variables.In the example of the orange juice data,the correlation matrix (see Ta-ble1.3)brings together the21correlation coefficients.Itis possibletogroup
4 Exploratory Multivariate Analysis by Example Using R very unusual variables and intermediate variables, which are to some extent linked to both groups. In the example, each group can be represented by one single variable as the variables within each group are very strongly correlated. We refer to these variables as synthetic variables. l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l ll l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 A Variable j Variable k l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 B Variable j Variable l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 C Variable k Variable l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −0.2 0.2 0.4 0.6 0.8 1.0 D Variable j Variable m l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l ll l l l l l l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 −0.2 0.2 0.4 0.6 0.8 1.0 E Variable k Variable m l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l −1.0 −0.8 −0.6 −0.4 −0.2 0.0 −0.2 0.2 0.4 0.6 0.8 1.0 F Variable l Variable m FIGURE 1.2 Representation of the relationships between four variables: j, k, l, and m, taken two-by-two. When confronted with a very small number of variables, it is possible to draw conclusions from the clouds of points, or from the correlation matrix which groups together all of the linear correlation coefficients r(j, k) between the pairs of variables. However, when working with a great number of variables, the correlation matrix groups together a large quantity of correlation coefficients (190 coefficients for K = 20 variables). It is therefore essential to have a tool capable of summarising the main relationships between the variables in a visual manner. The aim of PCA is to draw conclusions from the linear relationships between variables by detecting the principal dimensions of variability. As you will see, these conclusions will be supplemented by the definition of the synthetic variables offered by PCA. It will therefore be easier to describe the data using a few synthetic variables rather than all of the original variables. In the example of the orange juice data, the correlation matrix (see Table 1.3) brings together the 21 correlation coefficients. It is possible to group
5PrincipalComponentAnalusisthe strongly correlated variables into sets, but even for this reduced numberof variables, grouping them thisway is tedious.TABLE1.3Orange Juice Data: Correlation MatrixOdourOdourPulpIntensityAcidityBitter-Sweet-intensitytypicalityoftastenessnessOdour intensity0.660.231.000.580.270.150.151.000.770.580.620.840.880.92Odourtypicality0.020.660.771.000.470.640.63Pulp content0.620.021.000.730.51Intensityoftaste-0.27-0.570.840.73Acidity0.150.471.000.910.90Bitterness0.150.880.640.510.911.000.980.230.920.630.570.900.981.00Sweetness1.2.3Relationshipsbetweenthe Two StudiesThe study of individuals and the study of variables are interdependent asthey are carried out on the same data table: studying them jointly can onlyreinforcetheirrespectiveinterpretationsIf the study of individuals led to a distinction between groups of individ-uals, it is then possible to list the individuals belonging to only one group.However,for high numbers of individuals,it seems more pertinent to characterise them directly by the variables at hand:for example, by specifyingthat some orange juices are acidic and bitter whereas others have a high pulpcontent.Similarly, when there are groups of variables, it may not be easy to inter-pret the relationships between many variables and we can make use of specificindividuals, that is, individuals who are extreme from the point of view oftheserelationships.Inthis case.it must be possibleto identifytheindividuals. For example, the link between acidity-bitterness can be illustrated by theopposition between two extreme orange juices:Fresh Pampryl (orange juicefrom Spain)versus Fresh Tropicana(orange juice fromFlorida)1.3StudyingIndividuals1.3.1TheCloud ofIndividualsAn individual is a row of the data table, that is, a set of K numerical values.The individuals thus evolve within a space RK called "the individual's space."If we endow this space with the usual Euclidean distance, the distance between
Principal Component Analysis 5 the strongly correlated variables into sets, but even for this reduced number of variables, grouping them this way is tedious. TABLE 1.3 Orange Juice Data: Correlation Matrix Odour Odour Pulp Intensity Acidity Bitter- Sweetintensity typicality of taste ness ness Odour intensity 1.00 0.58 0.66 −0.27 −0.15 −0.15 0.23 Odour typicality 0.58 1.00 0.77 −0.62 −0.84 −0.88 0.92 Pulp content 0.66 0.77 1.00 −0.02 −0.47 −0.64 0.63 Intensity of taste −0.27 −0.62 −0.02 1.00 0.73 0.51 −0.57 Acidity −0.15 −0.84 −0.47 0.73 1.00 0.91 −0.90 Bitterness −0.15 −0.88 −0.64 0.51 0.91 1.00 −0.98 Sweetness 0.23 0.92 0.63 −0.57 −0.90 −0.98 1.00 1.2.3 Relationships between the Two Studies The study of individuals and the study of variables are interdependent as they are carried out on the same data table: studying them jointly can only reinforce their respective interpretations. If the study of individuals led to a distinction between groups of individuals, it is then possible to list the individuals belonging to only one group. However, for high numbers of individuals, it seems more pertinent to characterise them directly by the variables at hand: for example, by specifying that some orange juices are acidic and bitter whereas others have a high pulp content. Similarly, when there are groups of variables, it may not be easy to interpret the relationships between many variables and we can make use of specific individuals, that is, individuals who are extreme from the point of view of these relationships. In this case, it must be possible to identify the individuals. For example, the link between acidity-bitterness can be illustrated by the opposition between two extreme orange juices: Fresh Pampryl (orange juice from Spain) versus Fresh Tropicana (orange juice from Florida). 1.3 Studying Individuals 1.3.1 The Cloud of Individuals An individual is a row of the data table, that is, a set of K numerical values. The individuals thus evolve within a space R K called “the individual’s space.” If we endow this space with the usual Euclidean distance, the distance between