ContentsxiPreface1Principal Component Analysis (PCA)111.1 Data-Notation--Examples11.2Objectives21.2.1Studying Individuals31.2.2Studying Variables51.2.3Relationships between the Two Studies51.3Studying Individuals51.3.1The Cloud of Individuals71.3.2Fitting the Cloud of Individuals71.3.2.1BestPlaneRepresentationofNi101.3.2.2Sequence of Axes forRepresenting Ni101.3.2.3How AretheComponents Obtained?101.3.2.4Example1.3.3Representation of the Variables as an Aid for11Interpreting the Cloud of Individuals.131.4StudyingVariables131:4.1TheCloud of Variables141.4.2FittingtheCloudof Variables161.5Relationships between theTwo Representations NandNk171.6Interpreting the Data171.6.1NumericalIndicators1.6.1.1Percentage of Inertia Associated with a17Component1.6.1.2Quality of Representation of an Individual or18Variable.191.6.1.3Detecting Outliers1.6.1.4Contribution of an Individual or Variable to19the Construction of a Component201.6.2Supplementary Elements.1.6.2.1RepresentingSupplementaryQuantitative21Variables.1.6.2.2Representing Supplementary Categorical22Variables241.6.2.3Representing Supplementary Individualsv
Contents Preface xi 1 Principal Component Analysis (PCA) 1 1.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Studying Individuals . . . . . . . . . . . . . . . . . . . 2 1.2.2 Studying Variables . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Relationships between the Two Studies . . . . . . . . 5 1.3 Studying Individuals . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 The Cloud of Individuals . . . . . . . . . . . . . . . . 5 1.3.2 Fitting the Cloud of Individuals . . . . . . . . . . . . 7 1.3.2.1 Best Plane Representation of NI . . . . . . . 7 1.3.2.2 Sequence of Axes for Representing NI . . . . 10 1.3.2.3 How Are the Components Obtained? . . . . 10 1.3.2.4 Example . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Representation of the Variables as an Aid for Interpreting the Cloud of Individuals . . . . . . . . . . 11 1.4 Studying Variables . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1 The Cloud of Variables . . . . . . . . . . . . . . . . . 13 1.4.2 Fitting the Cloud of Variables . . . . . . . . . . . . . . 14 1.5 Relationships between the Two Representations NI and NK 16 1.6 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 17 1.6.1 Numerical Indicators . . . . . . . . . . . . . . . . . . . 17 1.6.1.1 Percentage of Inertia Associated with a Component . . . . . . . . . . . . . . . . . . . 17 1.6.1.2 Quality of Representation of an Individual or Variable . . . . . . . . . . . . . . . . . . . . . 18 1.6.1.3 Detecting Outliers . . . . . . . . . . . . . . . 19 1.6.1.4 Contribution of an Individual or Variable to the Construction of a Component . . . . . . 19 1.6.2 Supplementary Elements . . . . . . . . . . . . . . . . . 20 1.6.2.1 Representing Supplementary Quantitative Variables . . . . . . . . . . . . . . . . . . . . 21 1.6.2.2 Representing Supplementary Categorical Variables . . . . . . . . . . . . . . . . . . . . 22 1.6.2.3 Representing Supplementary Individuals . . 24 v
viContents241.6.3AutomaticDescription oftheComponents251.7ImplementationwithFactoMineR261.8AdditionalResults261.8.1TestingtheSignificanceoftheComponents271.8.2Variables: Loadings versus Correlations271.8.3Simultaneous Representation: Biplots281.8.4Missing Values291.8.5Large Datasets291.8.6Varimax Rotation301.9Example:TheDecathlon Dataset301.9.1Data DescriptionIssues301.9.2Analysis Parameters301.9.2.1Choiceof ActiveElements321.9.2.2Should the Variables Be Standardised?321.9.3Implementation of theAnalysis1.9.3.1Choosingthe Number of Dimensions to34Examine351.9.3.2Studying the Cloud of Individuals381.9.3.3Studying the Cloud of Variables1.9.3.4Joint Analysis of the Cloud of Individuals and40the Cloud of Variables431.9.3.5Comments on the Data451.10 Example:TheTemperatureDataset451.10.1 Data Description Issues451.10.2 Analysis Parameters451.10.2.1 Choice of Active Elements461.10.2.2 Should theVariables Be Standardised?471.10.3ImplementationoftheAnalvsis531.11 Example of Genomic Data: The Chicken Dataset531.11.1 DataDescription—Issues541.11.2 Analysis Parameters541.11.3 Implementation of the Analysis612Correspondence Analysis (CA)612.1Data-ExamplesNotation-632.2Objectives and the Independence Model632.2.1Objectives.642.2.2Independence Model and x2 Test662.2.3The Independence Model and CA672.3Fitting the Clouds672.3.1Clouds of Row Profiles2.3.268Clouds of Column Profiles2.3.370Fitting Clouds N and Nj2.3.4Example: Women's Attitudes to Women's Work in France71in1970
vi Contents 1.6.3 Automatic Description of the Components . . . . . . . 24 1.7 Implementation with FactoMineR . . . . . . . . . . . . . . . 25 1.8 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . 26 1.8.1 Testing the Significance of the Components . . . . . . 26 1.8.2 Variables: Loadings versus Correlations . . . . . . . . 27 1.8.3 Simultaneous Representation: Biplots . . . . . . . . . 27 1.8.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . 28 1.8.5 Large Datasets . . . . . . . . . . . . . . . . . . . . . . 29 1.8.6 Varimax Rotation . . . . . . . . . . . . . . . . . . . . 29 1.9 Example: The Decathlon Dataset . . . . . . . . . . . . . . . 30 1.9.1 Data Description — Issues . . . . . . . . . . . . . . . 30 1.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 30 1.9.2.1 Choice of Active Elements . . . . . . . . . . 30 1.9.2.2 Should the Variables Be Standardised? . . . 32 1.9.3 Implementation of the Analysis . . . . . . . . . . . . . 32 1.9.3.1 Choosing the Number of Dimensions to Examine . . . . . . . . . . . . . . . . . . . . 34 1.9.3.2 Studying the Cloud of Individuals . . . . . . 35 1.9.3.3 Studying the Cloud of Variables . . . . . . . 38 1.9.3.4 Joint Analysis of the Cloud of Individuals and the Cloud of Variables . . . . . . . . . . . . . 40 1.9.3.5 Comments on the Data . . . . . . . . . . . . 43 1.10 Example: The Temperature Dataset . . . . . . . . . . . . . . 45 1.10.1 Data Description — Issues . . . . . . . . . . . . . . . 45 1.10.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 45 1.10.2.1 Choice of Active Elements . . . . . . . . . . 45 1.10.2.2 Should the Variables Be Standardised? . . . 46 1.10.3 Implementation of the Analysis . . . . . . . . . . . . . 47 1.11 Example of Genomic Data: The Chicken Dataset . . . . . . 53 1.11.1 Data Description — Issues . . . . . . . . . . . . . . . 53 1.11.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 54 1.11.3 Implementation of the Analysis . . . . . . . . . . . . . 54 2 Correspondence Analysis (CA) 61 2.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 61 2.2 Objectives and the Independence Model . . . . . . . . . . . . 63 2.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.2.2 Independence Model and χ 2 Test . . . . . . . . . . . . 64 2.2.3 The Independence Model and CA . . . . . . . . . . . 66 2.3 Fitting the Clouds . . . . . . . . . . . . . . . . . . . . . . . . 67 2.3.1 Clouds of Row Profiles . . . . . . . . . . . . . . . . . . 67 2.3.2 Clouds of Column Profiles . . . . . . . . . . . . . . . . 68 2.3.3 Fitting Clouds NI and NJ . . . . . . . . . . . . . . . . 70 2.3.4 Example: Women’s Attitudes to Women’s Work in France in 1970 . . . . . . . . . . . . . . . . . . . . . . . . . . 71
viiContents722.3.4.1Column Representation (Mother's Activity).742.3.4.2RowRepresentation (Partner's Work)2.3.5Superimposed Representation of BothRows and74Columns.792.4InterpretingtheData792.4.1Inertias Associated with the Dimensions (Eigenvalues)822.4.2Contribution of Points toa Dimension's Inertia2.4.3Representation Quality of Points on a Dimension or83Plane842.4.4Distance and Inertia in the Initial Space852.5Supplementary Elements (- Illustrative)882.6ImplementationwithFactoMineR902.7CA and Textual Data Processing:942.8Example: The Olympic Games Dataset942.8.1DataDescription—Issues962.8.2Implementation of theAnalysis2.8.2.1ChoosingtheNumberofDimensionsto98Examine982.8.2.2Studying the Superimposed Representation1012.8.2.3Interpreting the Results1022.8.2.4Comments on the Data1042.9Example: The White Wines Dataset2.9.1104Data Description-Issues2.9.2106Margins2.9.3107Inertia2.9.4109RepresentationontheFirstPlane1122.10 Example:The Causes of Mortality Dataset1122.10.1 Data DescriptionIssues1142.10.2 Margins1162.10.3 Inertia1182.10.4 FirstDimension1202.10.5 Plane 2-31242.10.6 Projecting the SupplementaryElements1272.10.7Conclusion1313 Multiple Correspondence Analysis (MCA)1313.1 DataNotation-Examples1323.2Objectives1323.2.1StudyingIndividuals1333.2.2Studying the Variables and Categories3.3Defining Distances between Individuals and Distances between134Categories1343.3.1Distances between the Individuals1343.3.2DistancesbetweentheCategories1363.4CAontheIndicatorMatrix
Contents vii 2.3.4.1 Column Representation (Mother’s Activity) . 72 2.3.4.2 Row Representation (Partner’s Work) . . . . 74 2.3.5 Superimposed Representation of Both Rows and Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.4 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 79 2.4.1 Inertias Associated with the Dimensions (Eigenvalues) 79 2.4.2 Contribution of Points to a Dimension’s Inertia . . . . 82 2.4.3 Representation Quality of Points on a Dimension or Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.4.4 Distance and Inertia in the Initial Space . . . . . . . . 84 2.5 Supplementary Elements (= Illustrative) . . . . . . . . . . . 85 2.6 Implementation with FactoMineR . . . . . . . . . . . . . . . 88 2.7 CA and Textual Data Processing . . . . . . . . . . . . . . . . 90 2.8 Example: The Olympic Games Dataset . . . . . . . . . . . . 94 2.8.1 Data Description — Issues . . . . . . . . . . . . . . . 94 2.8.2 Implementation of the Analysis . . . . . . . . . . . . . 96 2.8.2.1 Choosing the Number of Dimensions to Examine . . . . . . . . . . . . . . . . . . . . 98 2.8.2.2 Studying the Superimposed Representation . 98 2.8.2.3 Interpreting the Results . . . . . . . . . . . . 101 2.8.2.4 Comments on the Data . . . . . . . . . . . . 102 2.9 Example: The White Wines Dataset . . . . . . . . . . . . . . 104 2.9.1 Data Description — Issues . . . . . . . . . . . . . . . 104 2.9.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.9.3 Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 2.9.4 Representation on the First Plane . . . . . . . . . . . 109 2.10 Example: The Causes of Mortality Dataset . . . . . . . . . . 112 2.10.1 Data Description — Issues . . . . . . . . . . . . . . . 112 2.10.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 114 2.10.3 Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 2.10.4 First Dimension . . . . . . . . . . . . . . . . . . . . . 118 2.10.5 Plane 2-3 . . . . . . . . . . . . . . . . . . . . . . . . . 120 2.10.6 Projecting the Supplementary Elements . . . . . . . . 124 2.10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 127 3 Multiple Correspondence Analysis (MCA) 131 3.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 131 3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.2.1 Studying Individuals . . . . . . . . . . . . . . . . . . . 132 3.2.2 Studying the Variables and Categories . . . . . . . . . 133 3.3 Defining Distances between Individuals and Distances between Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.3.1 Distances between the Individuals . . . . . . . . . . . 134 3.3.2 Distances between the Categories . . . . . . . . . . . . 134 3.4 CA on the Indicator Matrix . . . . . . . . . . . . . . . . . . 136
viliContents1363.4.1Relationship between MCA and CA1373.4.2The Cloud of Individuals1383.4.3The Cloud of Variables1393.4.4The Cloud of Categories3.4.5142Transition Relations1443.5Interpreting the Data1443.5.1Numerical Indicators3.5.1.1Percentage of Inertia Associated with a144Component..3.5.1.2Contribution and RepresentationQuality of145an Individual or Category3.5.2146Supplementary Elements:1473.5.3Automatic Description of the Components1493.6Implementation withFactoMineR1523.7Addendum1523.7.1Analysing a Survey1523.7.1.1Designing a Questionnaire:Choice of Format1533.7.1.2Accounting forRare Categories.3.7.2Description of a Categorical Variable or a154Subpopulation3.7.2.1Description of a Categorical Variable by a154Categorical Variable3.7.2.2Description ofa Subpopulation (ora155Category)by a QuantitativeVariable3.7.2.3Description ofa Subpopulation (oraCategory)by the Categories of a Categorical156Variable.1573.7.3The Burt Table.1583.7.4Missing ValuesExample: The Survey on the Perception of Genetically3.8160ModifiedOrganisms1603.8.1DataDescriptionIssues3.8.2Analysis Parameters and Implementation with163FactoMineR1643.8.3AnalysingtheFirstPlane1653.8.4Projection of SupplementaryVariables3.8.5167Conclusion1673.9Example:The Sorting Task Dataset3.9.1167DataDescription-Issues3.9.2169Analysis Parameters3.9.3169Representation of Individuals on the First Plane1703.9.4Representation of Categories1713.9.5Representation of the Variables
viii Contents 3.4.1 Relationship between MCA and CA . . . . . . . . . . 136 3.4.2 The Cloud of Individuals . . . . . . . . . . . . . . . . 137 3.4.3 The Cloud of Variables . . . . . . . . . . . . . . . . . 138 3.4.4 The Cloud of Categories . . . . . . . . . . . . . . . . . 139 3.4.5 Transition Relations . . . . . . . . . . . . . . . . . . . 142 3.5 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 144 3.5.1 Numerical Indicators . . . . . . . . . . . . . . . . . . . 144 3.5.1.1 Percentage of Inertia Associated with a Component . . . . . . . . . . . . . . . . . . . 144 3.5.1.2 Contribution and Representation Quality of an Individual or Category . . . . . . . . . . . 145 3.5.2 Supplementary Elements . . . . . . . . . . . . . . . . . 146 3.5.3 Automatic Description of the Components . . . . . . . 147 3.6 Implementation with FactoMineR . . . . . . . . . . . . . . . 149 3.7 Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 3.7.1 Analysing a Survey . . . . . . . . . . . . . . . . . . . . 152 3.7.1.1 Designing a Questionnaire: Choice of Format 152 3.7.1.2 Accounting for Rare Categories . . . . . . . . 153 3.7.2 Description of a Categorical Variable or a Subpopulation . . . . . . . . . . . . . . . . . . . . . . 154 3.7.2.1 Description of a Categorical Variable by a Categorical Variable . . . . . . . . . . . . . . 154 3.7.2.2 Description of a Subpopulation (or a Category) by a Quantitative Variable . . . . 155 3.7.2.3 Description of a Subpopulation (or a Category) by the Categories of a Categorical Variable . . . . . . . . . . . . . . . . . . . . . 156 3.7.3 The Burt Table . . . . . . . . . . . . . . . . . . . . . . 157 3.7.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . 158 3.8 Example: The Survey on the Perception of Genetically Modified Organisms . . . . . . . . . . . . . . . . . . . . . . . 160 3.8.1 Data Description — Issues . . . . . . . . . . . . . . . 160 3.8.2 Analysis Parameters and Implementation with FactoMineR . . . . . . . . . . . . . . . . . . . . . . . . 163 3.8.3 Analysing the First Plane . . . . . . . . . . . . . . . . 164 3.8.4 Projection of Supplementary Variables . . . . . . . . . 165 3.8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 167 3.9 Example: The Sorting Task Dataset . . . . . . . . . . . . . . 167 3.9.1 Data Description — Issues . . . . . . . . . . . . . . . 167 3.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 169 3.9.3 Representation of Individuals on the First Plane . . . 169 3.9.4 Representation of Categories . . . . . . . . . . . . . . 170 3.9.5 Representation of the Variables . . . . . . . . . . . . . 171
ixContents173Clustering41734.1DataIssues1774.2Formalising the Notion of Similarity4.2.1177Similarity between Individuals1774.2.1.1DistancesandEuclideanDistances4.2.1.2178ExampleofNon-EuclideanDistance1794.2.1.3OtherEuclideanDistances4.2.1.4179Similarities and Dissimilarities1804.2.2Similarity between Groups of Individuals1814.3Constructing an Indexed Hierarchy4.3.1181Classic Agglomerative Algorithm4.3.2Hierarchy and Partitions .1834.4183Ward's Method1844.4.1PartitionQuality4.4.2185Agglomeration According to Inertia4.4.3187Two Properties of the Agglomeration Criterion1884.4.4Analysing Hierarchies, Choosing PartitionsDirect1894.5SearchforPartitions:K-Means Algorithm4.5.1189Data--Issues4.5.2190Principle1914.5.3Methodology1914.6Partitioning and Hierarchical Clustering1924.6.1Consolidating Partitions1924.6.2Mixed Algorithm1924.7Clustering and Principal Component Methods1934.7.1Principal Component MethodsPriorto AHC4.7.2Simultaneous Analysis of a Principal ComponentMap193and Hierarchy.1944.8Clustering and Missing Data1944.9Example:TheTemperature Dataset1944.9.1Data Description -Issues4.9.2195AnalysisParameters4.9.3195Implementation of the Analysis1994.10 Example:TheTeaDataset1994.10.1 Data Description—Issues2014.10.2ConstructingtheAHC2024.10.3 Defining the Clusters2044.11 Dividing QuantitativeVariables into Classes2095Visualisation2095.1.Data-Issues2095.2ViewingPCAData2105.2.1Selecting a Subset of ObjectsCloud of Individuals2115.2.2Selecting a Subset of ObjectsCloud of Variables .2125.2.3Adding SupplementaryInformation
Contents ix 4 Clustering 173 4.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 173 4.2 Formalising the Notion of Similarity . . . . . . . . . . . . . . 177 4.2.1 Similarity between Individuals . . . . . . . . . . . . . 177 4.2.1.1 Distances and Euclidean Distances . . . . . . 177 4.2.1.2 Example of Non-Euclidean Distance . . . . . 178 4.2.1.3 Other Euclidean Distances . . . . . . . . . . 179 4.2.1.4 Similarities and Dissimilarities . . . . . . . . 179 4.2.2 Similarity between Groups of Individuals . . . . . . . 180 4.3 Constructing an Indexed Hierarchy . . . . . . . . . . . . . . 181 4.3.1 Classic Agglomerative Algorithm . . . . . . . . . . . . 181 4.3.2 Hierarchy and Partitions . . . . . . . . . . . . . . . . . 183 4.4 Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 183 4.4.1 Partition Quality . . . . . . . . . . . . . . . . . . . . . 184 4.4.2 Agglomeration According to Inertia . . . . . . . . . . 185 4.4.3 Two Properties of the Agglomeration Criterion . . . . 187 4.4.4 Analysing Hierarchies, Choosing Partitions . . . . . . 188 4.5 Direct Search for Partitions: K-Means Algorithm . . . . . . 189 4.5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . 189 4.5.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . 190 4.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . 191 4.6 Partitioning and Hierarchical Clustering . . . . . . . . . . . . 191 4.6.1 Consolidating Partitions . . . . . . . . . . . . . . . . . 192 4.6.2 Mixed Algorithm . . . . . . . . . . . . . . . . . . . . . 192 4.7 Clustering and Principal Component Methods . . . . . . . . 192 4.7.1 Principal Component Methods Prior to AHC . . . . . 193 4.7.2 Simultaneous Analysis of a Principal Component Map and Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 193 4.8 Clustering and Missing Data . . . . . . . . . . . . . . . . . . 194 4.9 Example: The Temperature Dataset . . . . . . . . . . . . . . 194 4.9.1 Data Description — Issues . . . . . . . . . . . . . . . 194 4.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 195 4.9.3 Implementation of the Analysis . . . . . . . . . . . . . 195 4.10 Example: The Tea Dataset . . . . . . . . . . . . . . . . . . . 199 4.10.1 Data Description — Issues . . . . . . . . . . . . . . . 199 4.10.2 Constructing the AHC . . . . . . . . . . . . . . . . . . 201 4.10.3 Defining the Clusters . . . . . . . . . . . . . . . . . . . 202 4.11 Dividing Quantitative Variables into Classes . . . . . . . . . 204 5 Visualisation 209 5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.2 Viewing PCA Data . . . . . . . . . . . . . . . . . . . . . . . 209 5.2.1 Selecting a Subset of Objects — Cloud of Individuals 210 5.2.2 Selecting a Subset of Objects — Cloud of Variables . . 211 5.2.3 Adding Supplementary Information . . . . . . . . . . 212