Contentsx2135.3ViewingDatafrom a CASelectinga Subset of ObjectsCloud of Rowsor Col-5.3.1213umns5.3.2216AddingSupplementaryInformation2165.4ViewingMCAData5.4.1217Selecting a Subset of ObjectsCloud of Individuals2175.4.2Selecting a Subset of ObjectsCloud of Categories5.4.3218Selecting a Subset of ObjectsClouds of Variables2185.4.4AddingSupplementaryInformation5.5Alternatives to the Graphics Function in theFactoMineRPack-219age2195.5.1The Factoshiny Package5.5.2221ThefactoextraPackage5.6Improving Graphs Using ArgumentsManyFac-Commonto221toMineR GraphicalFunctions225AppendixA.1 Percentage of Inertia Explained by theFirst Component or by225theFirstPlane230A.2RSoftware230A.2.1Introduction234A.2.2TheRcmdrPackage236A.2.3TheFactoMineRPackage241Bibliography of SoftwarePackagesBibliography243Index245
x Contents 5.3 Viewing Data from a CA . . . . . . . . . . . . . . . . . . . . 213 5.3.1 Selecting a Subset of Objects — Cloud of Rows or Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 5.3.2 Adding Supplementary Information . . . . . . . . . . 216 5.4 Viewing MCA Data . . . . . . . . . . . . . . . . . . . . . . . 216 5.4.1 Selecting a Subset of Objects — Cloud of Individuals 217 5.4.2 Selecting a Subset of Objects — Cloud of Categories . 217 5.4.3 Selecting a Subset of Objects — Clouds of Variables . 218 5.4.4 Adding Supplementary Information . . . . . . . . . . 218 5.5 Alternatives to the Graphics Function in the FactoMineR Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 5.5.1 The Factoshiny Package . . . . . . . . . . . . . . . . . 219 5.5.2 The factoextra Package . . . . . . . . . . . . . . . . . . 221 5.6 Improving Graphs Using Arguments Common to Many FactoMineR Graphical Functions . . . . . . . . . . . . . . . . . . 221 Appendix 225 A.1 Percentage of Inertia Explained by the First Component or by the First Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 225 A.2 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 230 A.2.2 The Rcmdr Package . . . . . . . . . . . . . . . . . . . 234 A.2.3 The FactoMineR Package . . . . . . . . . . . . . . . . 236 Bibliography of Software Packages 241 Bibliography 243 Index 245
PrefaceQu'est-ce que l'analyse des donnees ? (English: What is data analysis?)As it is usually understood in France, and within the context of this book.the expression analyse des donnees reflects a set of statistical methods whosemain features are to be multidimensional and descriptive.The term multidimensional itself covers two aspects. First, it impliesthat observations (or, in other words, individuals) are described by severalvariables.In this introduction we restrict ourselves to the most commondata, those in which a group of individuals is described by one set of variables.But, beyond the fact that we have many values from many variables for eachobservation, it is the desire to study them simultaneously that is characteristicof a multidimensional approach. Thus, we will use those methods each timethe notion of profile is relevant when considering an individual, for example,theresponse profile of consumers,the biometric profile of plants, the financialprofileofbusinesses,and soforth.From another point of view, the interest of considering values of indi-viduals for a set of variables in a global manner lies in the fact that thesevariables are linked. Let us note that studying links between all the vari-ables taken two-by-two does not constitute a multidimensional approach inthe strict sense. This approach involves the simultaneous consideration of allthe links between variables taken two-by-two. That is what is done, for exam-ple, when highlighting a synthetic variable:such a variable represents severalothers, which implies that it is linked to each of them, which is only possibleif they arethemselveslinked two-by-two.The concept of syntheticvariableisintrinsically multidimensional and is a powerful tool for the description of anindividuals x variables table. In both respects, it is a key concept within thecontext ofthis book.One last comment about the term analyse des donnees since it can have atleast two meanings-the one defined previouslyand anotherbroader one thatcould betranslated as“statistical investigation."This second meaning is froma user's standpoint; it is defined by an objective (to analyse data) and saysnothing about the statistical methods to be used. This is what the Englishterm data analysis covers. The term data analysis, in the sense of a set ofdescriptive multidimensional methods, is more of a French statistical point ofview.It was introduced in France in the 1960s by Jean-Paul Benzecri and theadoption of this term is probably related to the fact that these multivariatemethods are at the heart of many "data analyses."xi
Preface Qu’est-ce que l’analyse des donn´ees ? (English: What is data analysis?) As it is usually understood in France, and within the context of this book, the expression analyse des donn´ees reflects a set of statistical methods whose main features are to be multidimensional and descriptive. The term multidimensional itself covers two aspects. First, it implies that observations (or, in other words, individuals) are described by several variables. In this introduction we restrict ourselves to the most common data, those in which a group of individuals is described by one set of variables. But, beyond the fact that we have many values from many variables for each observation, it is the desire to study them simultaneously that is characteristic of a multidimensional approach. Thus, we will use those methods each time the notion of profile is relevant when considering an individual, for example, the response profile of consumers, the biometric profile of plants, the financial profile of businesses, and so forth. From another point of view, the interest of considering values of individuals for a set of variables in a global manner lies in the fact that these variables are linked. Let us note that studying links between all the variables taken two-by-two does not constitute a multidimensional approach in the strict sense. This approach involves the simultaneous consideration of all the links between variables taken two-by-two. That is what is done, for example, when highlighting a synthetic variable: such a variable represents several others, which implies that it is linked to each of them, which is only possible if they are themselves linked two-by-two. The concept of synthetic variable is intrinsically multidimensional and is a powerful tool for the description of an individuals × variables table. In both respects, it is a key concept within the context of this book. One last comment about the term analyse des donn´ees since it can have at least two meanings — the one defined previously and another broader one that could be translated as “statistical investigation.” This second meaning is from a user’s standpoint; it is defined by an objective (to analyse data) and says nothing about the statistical methods to be used. This is what the English term data analysis covers. The term data analysis, in the sense of a set of descriptive multidimensional methods, is more of a French statistical point of view. It was introduced in France in the 1960s by Jean-Paul Benz´ecri and the adoption of this term is probably related to the fact that these multivariate methods are at the heart of many “data analyses.” xi
xiiPrefaceTo Whom Is This Book Addressed?This book has been designed for scientists whose aim is not to become statis-ticians but who feel the need to analyse data themselves. It is thereforeaddressed to practitioners who are confronted with the analysis of data.Fromthis perspective it is application oriented; formalism and mathematics writinghave been reduced as much as possible while examples and intuition have beenemphasised.Specifically,an undergraduate level is quite sufficient to captureall the concepts introduced.On the software side, an introduction to the R language is sufficient, atleast at first.This software is free and available on the Internet at the followingaddress:http://www.r-project.org/.Content and Spirit of the BookThis book focuses on four essential and basic methods of multivariate ex-ploratory data analysis, those with the largest potential in terms of applica-tions: principal component analysis (PCA) when variables are quantitative,correspondence analysis (CA)and multiple correspondence analysis (MCA)when variables are categorical and hierarchical cluster analysis.The geo-metric point of view used to present all these methods constitutes a uniqueframework inthesense that it provides a unified vision when exploring mul-tivariate data tables. Within this framework, we will present the principles,the indicators,and the ways of representing and visualising objects (rows andcolumns of a data table) that are common to all those exploratory methods.From this standpoint, adding supplementary information by simply projectingvectors is commonplace.Thus, we will show how it is possible to use categor-ical variables within aPCA context where variables that are to be analysedare quantitative, to handle more than two categorical variables within a CAcontext where originally there are two variables, and to add quantitative vari-ables within an MCA context where variables are categorical. More thanthe theoretical aspects and the specific indicators induced by our geometricalviewpoint, we will illustrate the methods and the way they can be exploitedusing examples from various fields, hence the name of the book.Throughout the text, each result correlates with its R command. All thesecommands are accessible from FactoMineR, an R package developed by theauthors. The reader will be able to conduct all the analyses of the book asall the datasets (as well as all the lines of code)are available at the followingwebsite address:http://factominer.free.fr/bookv2.We hope that withthis book, the reader will be fully equipped (theory, examples, software)toconfrontmultivariatereal-lifedata.Note ontheSecond EditionThere were two main reasons behind the second edition of this work. Thefirst was that we wanted to add a chapter on viewing and improving the graphsproduced by the FactoMineR software. The second was to add a section to
xii Preface To Whom Is This Book Addressed? This book has been designed for scientists whose aim is not to become statisticians but who feel the need to analyse data themselves. It is therefore addressed to practitioners who are confronted with the analysis of data. From this perspective it is application oriented; formalism and mathematics writing have been reduced as much as possible while examples and intuition have been emphasised. Specifically, an undergraduate level is quite sufficient to capture all the concepts introduced. On the software side, an introduction to the R language is sufficient, at least at first. This software is free and available on the Internet at the following address: http://www.r-project.org/. Content and Spirit of the Book This book focuses on four essential and basic methods of multivariate exploratory data analysis, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical and hierarchical cluster analysis. The geometric point of view used to present all these methods constitutes a unique framework in the sense that it provides a unified vision when exploring multivariate data tables. Within this framework, we will present the principles, the indicators, and the ways of representing and visualising objects (rows and columns of a data table) that are common to all those exploratory methods. From this standpoint, adding supplementary information by simply projecting vectors is commonplace. Thus, we will show how it is possible to use categorical variables within a PCA context where variables that are to be analysed are quantitative, to handle more than two categorical variables within a CA context where originally there are two variables, and to add quantitative variables within an MCA context where variables are categorical. More than the theoretical aspects and the specific indicators induced by our geometrical viewpoint, we will illustrate the methods and the way they can be exploited using examples from various fields, hence the name of the book. Throughout the text, each result correlates with its R command. All these commands are accessible from FactoMineR, an R package developed by the authors. The reader will be able to conduct all the analyses of the book as all the datasets (as well as all the lines of code) are available at the following website address: http://factominer.free.fr/bookV2. We hope that with this book, the reader will be fully equipped (theory, examples, software) to confront multivariate real-life data. Note on the Second Edition There were two main reasons behind the second edition of this work. The first was that we wanted to add a chapter on viewing and improving the graphs produced by the FactoMineR software. The second was to add a section to
xiiPrefaceeach chapter on managing missing data, which will enable users to conductanalysesfrom incompletetablesmoreeasily.The authors would like to thank Rebecca Clayton for her help in the transla.tion
Preface xiii each chapter on managing missing data, which will enable users to conduct analyses from incomplete tables more easily. The authors would like to thank Rebecca Clayton for her help in the translation
Taylor&FrancisTaylor&FrancisGrouphttp://taylorandfrancis.com