xiiContentsCANONICAL CORRELATIONANALYSIS10539539Introduction10.1Canonical Variates andCanonical Correlations 53910.210.3Interpreting thePopulation CanonicalVariables545IdentifyingtheCanonicalVariables,545Canonical Correlations as GeneralizationsofOtherCorrelationCoefficients,547TheFirstr Canonical Variables as a Summaryof Variability548AGeometrical Interpretation ofthe Population CartonicalCorrelationAnalysis549The Sample Canonical Variates and Sample10.4CanonicalCorrelations55010.5Additional SampleDescriptiveMeasures558Marrices ofErrorsofApproximations,558Proportions ofExplained SampleVariance,561LargeSampleInferences56310.6Exercises567References574DISCRIMINATIONANDCLASSIFICATION11575575Introduction11.1Separation and Classification forTwoPopulations 57611.2Classification with Two Multivariate Normal Populations11.3584Classificationof Normal PopulationsWhenZ=Z2=Z,584Scaling,589Fisher'sApproachtoClassificationwithTwoPopulations,590IsClassificationaGood Idea?,592Classificationof NormalPopulationsWhenZ¥E2,593Evaluating ClassificationFunctions59611.4ClassificationwithSeveral Populations60611.5TheMinimumExpected Cosrof Misclassification Method,600Classification with Normal Populations, 609Fisher's Method for Discriminating11.6among Several Populations 621UsingFisher'sDiscriminants toClassifyObjects,628Logistic Regression and Classification63411.7Introduction,634The Logit Model, 634Logistic RegressionAnalysis,636Classification,638Logistic Regression withBinomial Responses,64064411.8Final CommentsIncludingQualitativeVariables644ClassificationTrees,644Neural Networks 647Selectionof Variables,648
xii Contents 10 CANONICAL CORRELATION ANALYSIS 10.1 Introduction 539 10.2 Canonical Variates and Canonical Correlations 539 10.3 Interpreting the Population Canonical Variables 545 Identifying the {:anonical Variables, 545 Canonical Correlations as Generalizations of Other Correlation Coefficients, 547 The First r Canonical Variables as a Summary of Variability, 548 A Geometrical Interpretation of the Population Canonical Correlation Analysis 549 10.4 The Sample Canonical Variates and Sample Canonical Correlations 550 10.5 Additional Sample Descriptive Measures 558 Matrices of Errors of Approximations, 558 Proportions of Explained Sample Variance, 561 10.6 Large Sample Inferences 563 Exercises 567 References 574 11 DISCRIMINAnON AND CLASSIFICATION 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 Introduction 575 Separation and Classification for 1\vo Populations 576 Classification with 1\vo Multivariate Normal Populations Classification of Normal Populations When It = I 2 = I, 584 Scaling, 589 Fisher's Approach to Classification with 1Wo Populations, 590 Is Classification a Good Idea?, 592 Classification of Normal Populations When It #' I 2 , 593 Evaluating Classification Functions 596 Classification with Several Populations 606 The Minimum Expected Cost of Misclassl:fication Method, 606 Qassification with Normal Populations, 609 Fisher's Method for Discriminating among Several Populations 621 Using Fisher's Discriminants to Classify Objects, 628 Logistic Regression and Classification 634 Introduction, 634 The Logit Model, 634 Logistic Regression Analysis, 636 Classiftcation, 638 Logistic Regression With Binomial Responses, 640 Final Comments 644 Including Qualitative Variables, 644 Classification ]}ees, 644 Neural Networks, 647 Selection of Variables, 648 539 575 584
xiliContentsTestingforGroupDifferences648Graphics,649PracticalConsiderations RegardingMultivariateNormaliry649Exercises650References66967112CLUSTERING,DISTANCEMETHODS,ANDORDINATION12.1Introduction67112.2SimilarityMeasures673DistancesandSimilariryCoefficientsforPairs of Items,673Similarities and Association Measuresfor Pairs of Variables, 677Concluding Comments on Similarity,67812.3HierarchicalClusteringMethods680Single Linkage, 682Complete Linkage, 685Average Linkage, 690Ward'sHierarchical Clustering Method, 692Final Comments--HierarchicalProcedures,69512.4NonhierarchicalClusteringMethods696K-means Method,696Final Comments-NonhierarchicalProcedures,70112.5ClusteringBasedon Statistical Models70312.6Multidimensional Scaling706The Basic Algorithm, 70812.7CorrespondenceAnalysis716AlgebraicDevelopmentofCorrespondenceAnalysis,718Inertia,725Interpretation in TwoDimensions 726FinalComments,72612.8Biplots forViewingSamplingUnits andVariables726Constructing Biplots,72712.9Procrustes Analysis: A Method732for Comparing ConfigurationsConstructing the Procrustes MeasureofAgreement,733Supplement12A:Data Mining740Introduction,740The Data Mining Process, 741Model Assessment, 742Exercises747References 755757APPENDIX764DATAINDEX767SUBJECTINDEX
Contents xiii Testing for Group Differences, 648 Graphics, 649 Practical Considerations Regarding Multivariate Normality, 649 Exercises 650 References 669 12 CLUSTERING, DISTANCE METHODS, AND ORDINATION 12.1 Introduction 671 12.2 Similarity Measures 673 Distances and Similarity Coefficients for Pairs of Items, 673 Similarities and Association Measures for Pairs of Variables, 677 Concluding Comments on Similarity, 678 12.3 Hierarchical Clustering Methods 680 Single Linkage, 682 Complete Linkage, 685 Average Linkage, 690 Wards Hierarchical Clustering Method, 692 Final Comments-Hierarchical Procedures, 695 12.4 Nonhierarchical Clustering Methods 696 K-means Method, 696 Final Comments-Nonhierarchlcal Procedures, 701 12.5 Clustering Based on Statistical Models 703 12.6 Multidimensional Scaling 706 The Basic Algorithm, 708 . 12.7 Correspondence Analysis 716 Algebraic Development of Correspondence Analysis, 718 Inertia, 725 Interpretation in Two Dimensions, 726 Final Comments, 726 12.8 Biplots for Viewing Sampling Units and Variables 726 Constructing Biplots, 727 12.9 Procrustes Analysis: A Method for Comparing Configurations 732 Constructing the Procrustes Measure of Agreement, 7 33 Supplement 12A: Data Mining 740 Introduction, 740 The Data Mining Process, 741 Model Assessment, 742 Exercises 747 References 755 APPENDIX DATA INDEX SUBJECT INDEX 671 757 764 767
PrefaceINTENDEDAUDIENCEThis book originally grew out ofour lecture notes for an"Applied MultivariateAnalysis"course offered jointly by the Statistics Department and the School ofBusiness at the University of Wisconsin-Madison.Applied Multivariate Statisti-calAnalysis,Sixth Edition,is concerned with statistical methodsfordescribing andanalyzing multivariate data, Data analysis, while interesting with one variable,becomes trulyfascinating and challenging when several variables are involved.Researchers in the biological,physical,and social sciences frequently collect mea-surements on several variables. Modern computer packages readily provide thenumerical resultstorathercomplexstatistical analyses.Wehavetriedtoprovidereaders with the supporting knowledge necessary for making proper interpreta-tions, selecting appropriate techniques, and understanding their strengths andweaknesses. We hope our discussions will meet the needs of experimental scien-tists, in a wide variety of subject matter areas, as a readable introduction to thestatistical analysisofmultivariateobservations.LEVELOuraim is topresenttheconceptsand methods of multivariate analysis ata levelthat is readily understandable by readers who have taken two or more statisticscourses. We emphasize the applications of multivariate methods and, conse-quently,haveattemptedtomakethemathematics aspalatableas possible.Weavoid the use ofcalculus.On theother hand,the concepts ofa matrixand of ma-trixmanipulationsare important.Wedonotassumethereaderisfamiliarwithmatrix algebra.Rather, we intreduce matrices as they appear naturally in ourdiscussions, and wethen showhowthey simplifythepresentation ofmultivari-ate models and techniques.The introductory account of matrix algebra, in Chapter 2, highlights themore important matrix algebra results as they apply to multivariate analysis.TheChapter 2 supplement provides a summary of matrix algebra results for thosewith little or no previous exposure to the subject.This supplementary materialhelps make the book self-contained and is used to complete proofs The proofsmay be ignored on the first reading.In this way we hope to make the book ac-cessible to a wide audience.In our attempt to make the study of multivariate analysis appealing to alarge audience of bothpractitioners and theoreticians,we have had to sacrificexv
Preface INTENDED AUDIENCE LEVEL This book originally grew out of our lecture notes for an "Applied Multivariate Analysis" course offered jointly by the Statistics Department and the School of Business at the University of Wisconsin-Madison. Applied Multivariate Statistica/Analysis, Sixth Edition, is concerned with statistical methods for describing and analyzing multivariate data. Data analysis, while interesting with one variable, becomes truly fascinating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect measurements on several variables. Modern computer packages readily provide the· numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpretations, selecting appropriate techniques, and understanding their strengths and weaknesses. We hope our discussions will meet the needs of experimental scientists, in a wide variety of subject matter areas, as a readable introduction to the statistical analysis of multivariate observations. Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics courses. We emphasize the applications of multivariate methods and, consequently, have attempted to make the mathematics as palatable as possible. We avoid the use of calculus. On the other hand, the concepts of a matrix and of matrix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of multivariate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chapter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. This supplementary material helps make the book self-contained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book accessible to a wide audience. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice XV
Prefacexvia consistency of level. Some sections are harder than others In particular, wehave summarized a voluminous amount of material on regression in Chapter7.Theresulting presentation israthersuccinct anddifficultthefirst time through.Wehope instructors willbe abletocompensatefor the unevenness in level byju-diciouslychoosingthosesections,and subsectionsappropriatefortheir studentsand by toning themdown if necessary.ORGANIZATIONANDAPPROACHThe methodological"tools"of multivariate analysis are contained in Chapters5through 12.These chaptersrepresent the heart of the book,but they cannot beassimilated without much of the material in the introductory Chapters1 through4.Even those readers with a goodknowledgeof matrix algebra or those willingtoacceptthemathematicalresults onfaithshould,attheveryleastperuseChapter3"Sample Geometry"and Chapter4"MultivariateNormalDistribution.Our approach in themethodological chapters is tokeepthe discussion di-rect and unclutered.Typically,westart withaformulation of the populationmodelsdelineatethecorresponding sampleresultsandliberally llustrateeverything with examplesThe examples are oftwotypes: those that are simple andwhose calculations can be easily done by hand,and those that rely on real-worlddata and computersoftware.These will provide anopportunity to (1)duplicateouranalyses, (2)carryouttheanalysesdictatedbyexercises,or (3)analyzethedata using methods other than the ones we have used or suggested.The division of the methodological chapters (5 through12)into three unitsallows instructors some flexibility in tailoring a course to their needs. Possiblesequences fora one-semester (twoquarter)course are indicated schematically.Each instructor will undoubtedly omitcertain sections from some chaptersto coverabroader collection oftopics than is indicated by these two choices.Getting StartedChapters 1-4InferenceAboutMeansClassification andGroupingChapters57Chapters 11 and 12Analysis of CovarianceAnalysis of CovarianceStructureStructureChapters 8--10Chapters8-10For most students, we would suggest a quick pass through the firstfourchapters (concentrating primarilyon the material in Chapter1;Sections 2.1,2.2,2.3,2.5,2.6,and 3.6;and the"assessing normality"material in Chapter 4)fol-lowed byaselectionofmethodologicaltopics.Forexample,onemightdiscussthecomparisonofmeanvectorsprincipalcomponents,factoranalysis,discrimi-nant analysis and clustering.Thediscussions couldfeature the many"workedout"examples included in these sections ofthetext.Instructors may rely on di-
xvi Preface a consistency of level. Some sections are harder than others. In particular, we have summarized a voluminous amount of material on regression in Chapter 7. The resulting presentation is rather succinct and difficult the first time through. we hope instructors will be able to compensate for the unevenness in level by judiciously choosing those sections, and subsections, appropriate for their students and by toning them tlown if necessary. ORGANIZATION AND APPROACH The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book, but they cannot be assimilated without much of the material in the introductory Chapters 1 through 4. Even those readers with a good knowledge of matrix algebra or those willing to accept the mathematical results on faith should, at the very least, peruse Chapter 3, "Sample Geometry," and Chapter 4,"Multivariate Normal Distribution." Our approach in the methodological chapters is to keep the discussion direct and uncluttered. Typically, we start with a formulation of the population models, delineate the corresponding sample results, and liberally illustrate everything with examples. The examples are of two types: those that are simple and whose calculations can be easily done by hand, and those that rely on real-world data and computer software. These will provide an opportunity to (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using methods other than the ones we have used or suggested . . The division of the methodological chapters (5 through 12) into three units allows instructors some flexibility in tailoring a course to their needs. Possible sequences for a one-semester (two quarter) course are indicated schematically. Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices. Getting Started Chapters 1-4 For most students, we would suggest a quick pass through the first four chapters (concentrating primarily on the material in Chapter 1; Sections 2.1, 2.2, 2.3, 2.5, 2.6, and 3.6; and the "assessing normality" material in Chapter 4) followed by a selection of methodological topics. For example, one might discuss the comparison of mean vectors, principal components, factor analysis, discriminant analysis and clustering. The discussions could feature the many "worked out" examples included in these sections of the text. Instructors may rely on di-
xviPrefaceagrams and verbal descriptions to teach the corresponding theoretical developments.If the students have uniformly strong mathematical backgrounds,much ofthe book can successfully be covered in one term.We have found individual data-analysis projects useful for integrating ma-terial from several of the methods chapters.Here, our rather complete treatmentsof multivariate analysis of variance (MANOVA),regression analysis,factor analy-sis, canonical correlation, discriminant analysis, and so forth are helpful, eventhough they may not be specifically covered in lectures.CHANGESTOTHESIXTHEDITIONNew material.Users of the previous editions will notice severalmajor changesin the sixth edition.Twelve new data sets including national track recordsfor men and women,psychological profile scores, car body assembly measurements, cell phonetower breakdowns,pulp and paper properties measurements, Mali familyfarm data,stockpriceratesofreturn,and Conchowatersnakedata.Thirty seven newexercises and twenty revised exercises with many oftheseexercisesbasedonthenewdatasets..Four newdatabased examples and fifteen revisedexamplesSixneworexpanded sections:1.Section 6.6Testing for Equalityof Covariance Matrices2. Section 11.7 Logistic Regression and Classification3.Section12.5ClusteringBased on Statistical Models4.Expanded Section 6.3 to include"An Approximation to the Distrib-ution of T for Normal PopulationsWhen Sample Sizesare not Large"5.Expanded Sections7.6and 7.7to includeAkaike's Information Cri-terion6.Consolidated previous Sections 11.3and 11.5on twogroupdiscrimi-nantanalysis intosingle Section11.3Web Site.Tomake themethods ofmultivariateanalysismore prominentin the text, we have removed the long proofs of Results 7.2,7.4,7.10 and 10.1and placed them on a web siteaccessible through www.prenhall.com/statistics.Click on"Multivariate Statistics"and then click on our book. In addition, allfull data sets saved as ASCII files that are used in the book are available onthe web site.Instructors'Solutions Manual.An Instructors Solutions Manual is availableontheauthor'swebsiteaccessiblethroughwwwprenhall.com/statistics.Forinfor-mation on additional for-sale supplements that may be used with the book oradditional titles of interest,please visit the Prentice Hall web site at www.pren-hall.com
Preface xvii agrams and verbal descriptions to teach the corresponding theoretical developments. If the students have uniformly strong mathematical backgrounds, much of the book can successfully be covered in one term. We have found individual data-analysis projects useful for integrating material from several of the methods chapters. Here, our rather complete treatments of multivariate analysis of variance (MANOVA), regression analysis, factor analysis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures. CHANGES TO THE SIXTH EDITION New material. Users of the previous editions will notice several major changes in the sixth edition. • Twelve new data sets including national track records for men and women, psychological profile scores, car body assembly measurements, cell phone tower breakdowns, pulp and paper properties measurements, Mali family farm data, stock price rates of return, and Concho water snake data. • Thirty seven new exercises and twenty revised exercises with many of these exercises based on the new data sets. • Four new data based examples and fifteen revised examples. • Six new or expanded sections: 1. Section 6.6 Testing for Equality of Covariance Matrices 2. Section 11.7 Logistic Regression and Classification 3. Section 12.5 Clustering Based on Statistical Models 4. Expanded Section 6.3 to include "An Approximation to th~ Distribution of T2 for Normal Populations When Sample Sizes are not Large" 5. Expanded Sections 7.6 and 7.7 to include Akaike's Information Criterion 6. Consolidated previous Sections 11.3 and 11.5 on two group discriminant analysis into single Section 11.3 Web Site. To make the methods of multivariate analysis more prominent in the text, we have removed the long proofs of Results 7.2, 7.4, 7.10 and 10.1 and placed them on a web site accessible through www.prenhall.com/statistics. Click on "Multivariate Statistics" and then click on our book. In addition, all full data sets saved as ASCII files that are used in the book are available on the web site. Instructors' Solutions Manual. An Instructors Solutions Manual is available on the author's website accessible through www.prenhall.com/statistics. For information on additional for-sale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall web site at www.prenhall.com