PrefacexviliACKNOWLEDGMENTSWe thank many of our colleagues who helped improve the applied aspect of thebook by contributing their own data sets for examples and exercises Anumberof individuals helped guide various revisions of this book, and we are gratefulfor their suggestions: Christopher Bingham, University of Minnesota; Steve Coad,University of Michigan; Richard Kiltie,University ofFlorida; Sam Kotz,GeorgeMason University; Him Koul, Michigan State University; Bruce McCullough,Drexel University; Shyamal Peddada, University of Virginia;K.Sivakumar Uni-versity of Ilinois at Chicago; Eric Smith, Virginia Tech;and Stanley Wasserman,University of Illinois at Urbana-Champaign.We also acknowledge thefeedbackof the students we have taught these past 35 years in our applied multivariateanalysis courses.Their comments and suggestions arelargelyresponsibleforthepresent iteration of this work.We would also like to give special thanks to WaiKwong Cheang,Shanhong Guan,Jialiang Li and Zhiguo Xiao for their help withthecalculationsformanyoftheexamples.We must thank Dianne Hallfor her valuable help with the Solutions Man-ual, SteveVerrill for computing assistance throughout,and AlisonPollackforimplementing a Chernoff faces program.We are indebted to Cliff Gilman for hisassistance with the multidimensional scaling examples discussed in Chapter12.Jacquelyn Forer did most of the typing of the original draft manuscript, and weappreciateherexpertise and willingness to endurecajoling ofauthors faced withpublicaton deadlines.Finally,wewould liketo thankPetra Recter,Dbbie RyanMichael Bell, Linda Behrens, Joanne Wendelken and the rest of the Prentice Hallstaff for their help with this project.R.A.Johnsonrich@stat.wisc.eduD.W.Wicherndwichern@tamu.edu
""iii Preface ,ACKNOWLEDGMENTS We thank many of our colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. A number of individuals helped guide various revisions of this book, and we are grateful for their suggestions: Christopher Bingham, University of Minnesota; Steve Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George Mason University; Him Koul, Michigan State University; Bruce McCullough, Drexel University; Shyamal Peddada, University of Virginia; K. Sivakumar University of Illinois at Chicago; Eric ~mith, Virginia Tech; and Stanley Wasserman, University of Illinois at Urbana-Champaign. We also acknowledge the feedback of the students we have taught these past 35 years in our applied multivariate analysis courses. Their comments and suggestions are largely responsible for the present iteration of this work. We would also like to give special thanks to Wai Kwong Cheang, Shanhong Guan, Jialiang Li and Zhiguo Xiao for their help with the calculations for many of the examples. We must thank Dianne Hall for her valuable help with the Solutions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his assistance with the multidimensional scaling examples discussed in Chapter 12. Jacquelyn Forer did most of the typing of the original draft manuscript, and we appreciate her expertise and willingness to endure cajoling of authors faced with publication deadlines. Finally, we would like to thank Petra Recter, Debbie Ryan, Michael Bell, Linda Behrens, Joanne Wendelken and the rest of the Prentice Hall staff for their help with this project. R. A. Johnson rich@stat. wisc.edu D. W. Wichern dwichem@tamu.edu
Applied MultivariateStatisticalAnalysis
Applied Multivariate Statistical Analysis
ChapterASPECTS OF MULTIVARIATEANALYSISI.IntroductionScientific inquiry is an iterative learning process. Objectives pertaining to the expla-nation of a social or physical phenomenon must be specified and then tested bygathering and analyzing data. In turn, an analysis of the data gathered by experi-mentation or observation will usually suggest a modified explanation of the phe-nomenon.Throughout this iterative learning process, variables are often added ordeleted from the study.Thus, the complexities of mostphenomena require an inves-tigatorto collectobservationsonmanydifferentvariables.Thisbookisconcernedwith statistical methods designed to elicit information from these kinds of data sets.Because the data include simultaneous measurements on many variables, this bodyof methodology is called multivariate analysisThe need to understand therelationships between many variables makes multi-variate analysis an inherently difficult subject.Often,the human mind is over-whelmed bythe sheer bulk of the data.Additionally,more mathematics is requiredto derivemultivariate statistical techniguesformaking inferencesthan in aunivari-ate setting.Wehave chosen to provide explanations based upon algebraic conceptsand to avoid thederivations of statistical results that require the calculus of manyvariablesOurobjective istointroduce several usefulmultivariate techniques inaclearmanner,makingheavyuseofillustrativeexamplesandaminimumofmathematics.Nonetheless, some mathematical sophistication and a desire to think quanti-tativelywill be required.Most of ouremphasiswill be ontheanalysis of measurements obtained with-out actively controlling or manipulating any of the variables on which the mea-surements are made.Only in Chapters 6 and 7 shall we treat a few experimentalplans (designs) forgenerating data that prescribe the active manipulation of im-portant variables.Although the experimental design is ordinarily the most impor-tant part of a scientific investigation, it is frequently impossible to control the
Chapter ASPECTS OF MULTIVARIATE ANALYSIS 1.1 Introduction Scientific inquiry is an iterative learning process. Objectives pertaining to the explanation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In tum, an analysis of the data gathered by experimentation or observation will usually suggest a modified explanation of the phenomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an investigator to collect observations on many different variables. This book is concerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body of methodology is called multivariate analysis. The need to understand the relationships between many variables makes multivariate analysis an inherently difficult subject. Often, the human mind is overwhelmed by the sheer bulk of the data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a univariate setting. We have chosen to provide explanations based upon algebraic concepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathematics. Nonetheless, some mathematical sophistication and a desire to think quantitatively will be required. Most of our emphasis will be on the analysis of measurements obtained without actively controlling or manipulating any of the variables on which the measurements are made. Only in Chapters 6 and 7 shall we treat a few experimental plans (designs) for generating data that prescribe the active manipulation of important variables. Although the experimental design is ordinarily the most important part of a scientific investigation, it is frequently impossible to control the
2Chapter1Aspects of MultivariateAnalysisgeneration ofappropriate data in certain disciplines (This is true,for example,inbusiness,economics,ecology,geology,and sociology.)You should consult [6] and[7] fordetailed accounts ofdesign principles that,fortunately,also apply tomulti-variate situations.It will become increasingly clear that many multivariate methods are basedupon an underlying probability modelknown as the multivariate normal distribution.Other methods are ad hoc in nature and are justified by logical or commonsensearguments Regardless of their origin, multivariate techniques must, invariably.be implemented on a computer.Recent advances in computer technology havebeen accompanied by the development of rather sophisticated statistical sofwarepackages, making the implementation step easier.Multivariate analysis is a“mixed bag."It is difficult to establish a classificationscheme for multivariate techniques that is both widely accepted and indicates theappropriateness of the techniques One classification distinguishes techniques de-signed to study interdependent relationships from those designed to study depen-dent relationships Another classifies techniques according to the number ofpopulations and the number of sets of variables being studied. Chapters in this textare divided into sections according to inference about treatment means, inferenceabout covariance structure,and techniques for sorting or grouping. This should not,however, be considered an attempt to place each method into a slot. Rather, thechoice of methods and the types of analyses employed are largely determined bythe objectives of the investigation. In Section i.2, we list a smaller number ofpractical problems designed to illustrate the connection between the choice of a sta-tistical method and the objectives of the study.These problems plus the examplesinthe text,should provide you withanappreciation oftheapplicability ofmultivariatetechniques across differentfieldsThe objectives of scientific investigations to which multivariate methods mostnaturally lend themselves include the following:1 Data reduction or structural simplification.The phenomenon being studied isrepresented as simply as possible without sacrificing valuable information.It ishoped that this willmake interpretation easier.2. Sorting and grouping. Groups of "similar" objects or variables are created,based upon measured characteristics Alternatively,rulesfor classifying objectsinto well-defined groups maybe required.3. Investigation of the dependence among variables The nature of the relation-ships among variables is of interest.Are all the variables mutually independentor are one or more variables dependent on the others? If so, how?4. Prediction. Relationships between variables must be determined for the pur-pose of predicting the values of one or more variables on the basis of observa-tionsontheothervariables5.Hypothesis corstruction and testing.Specific statistical hypotheses, formulatedin terms of the parameters of multivariate populations, are tested.This may bedone to validate assumptions or to reinforce prior convictions.We conclude this brief overview of multivariate analysis with a quotation fromF. H. C. Marriott [19], page 89. The statement was made in a discussion of clusteranalysis, but we feel it is appropriate for a broader range of methods. You shouldkeep it inmind whenever you attempt or read about a data analysis.It allows one to
2 Chapter 1 Aspects of Multivariate Analysis generation of appropriate data in certain disciplines. (This is true, for example, in business, economics, ecology, geology, and sociology.) You should consult [6] and [7] for detailed accounts of design principles that, fortunately, also apply to multivariate situations. It will become increasingly clear that many multivariate methods are based upon an underlying pro9ability model known as the multivariate normal distribution. Other methods are ad hoc in nature and are justified by logical or commonsense arguments. Regardless of their origin, multivariate techniques must, invariably, be implemented on a computer. Recent advances in computer technology have been accompanied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classification scheme for multivariate techniques that is both widely accepted and indicates the appropriateness of the techniques. One classification distinguishes techniques designed to study interdependent relationships from those designed to study dependent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. In Section 1.2, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the objectives of the study. These problems, plus the examples in the text, should provide you with an appreciation of the applicability of multivariate techniques across different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following: L Data reduction or structural simplification. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier. 2. Sorting and grouping. Groups of "similar" objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required. 3. Investigation of the dependence among variables. The nature of the relationships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent on the others? If so, how? 4. Prediction. Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables. s. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions. We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should keep it in mind whenever you attempt or read about a data analysis. It allows one to
3Applications of Multivariate Techniquesmaintain a proper perspective and not be overwhelmed by the elegance of some ofthe theory:If the results disagree with informed opinion,do not admit a simple logical interpreta-tion, and do not show up clearly in a graphical presentation,they are probably wrongThere is no magic about numerical methods, and many ways in which they can breakdown. They are a valuable aid to the interpretation of data, not sausage machinesautomatically transforming bodies of numbers into packets of scientific fact.I.2 Applications of Multivariate TechniquesThe published applications of multivariate methods have increased tremendously inrecent years It is nowdifficult to cover the variety of real-world applications ofthesemethods withbrief discussions,as we did in earlier editions of thisbook.How-ever, in order to give some indication of the usefulness of multivariate techniques,we offer the following short descriptions.of the results of studies from several disci-plines These descriptions are organized according to the categories of objectivesgiven in the previous section.Of course,many of our examples are multifaceted andcould be placed in more than one category.Datareductionorsimplification. Using data on several variables related to cancer patient responses to radio-therapy,a simple measure ofpatientresponse to radiotherapy was constructed.(See Exercise 1.15.). Track records from many nations were used to develop an index of perfor-mance for bothmaleand femaleathletes. (See [8] and [22])Multispectral image data collected bya high-altitude scanner were reduced to aform that could be viewed as images (pictures)of a shoreline in two dimensions.(See [23].).Data on several variables relating to yield and protein content were used to cre-ate an index to select parents of subsequent generations of improved beanplants. (See [13].)Amatrix oftactic similarities was developed from aggregate data derived fromprofessional mediators From this matrix the number of dimensions by whichprofessional mediators judge the tactics they use in resolving disputes wasdetermined.(See[21].)Sorting and grouping.Data on several variables related to computer use were employed to createclusters of categories of computer jobs that allow a better determination ofexisting (or planned) computer utilization.(See [2]):Measurements of several physiological variables were used todevelop a screen.ing procedure that discriminates alcoholics from nonalcoholics. (See [26].).Data related to responses to visual stimuli were used to develop a rule for sepa-rating people suffering from a multiple-sclerosis-caused visual pathology fromthose not suffering from the disease.(See Exercise1.14.)
Applications of Multivariate Techniques 3 maintain a proper perspective and not be overwhelmed by the elegance of some of the theory: If the results disagree with informed opinion, do not admit a simple logical interpretation, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines automatically transforming bodies of numbers into packets of scientific fact. 1.2 Applications of Multivariate Techniques The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of real-world applications of these methods with brief discussions, as we did in earlier editions of this book:. However, in order to give some indication of the usefulness of multivariate techniques, we offer the following short descriptions. of the results of studies from several disciplines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are multifaceted and could be placed in more than one category. Data reduction or simplification • Using data on several variables related to cancer patient responses to radiotherapy, a simple measure of patient response to radiotherapy was constructed. (See Exercise 1.15.) • nack records from many nations were used to develop an index of performance for both male and female athletes. (See [8] and [22].) • Multispectral image data collected by a high-altitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimensions. (See [23].) • Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. (See [13].) • A matrix of tactic similarities was developed from aggregate data derived from professional mediators. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was determined. (See [21].) Sorting and grouping • Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of existing (or planned) computer utilization. (See [2].) • Measurements of several physiological variables were used to develop a screening procedure that discriminates alcoholics from nonalcoholics. (See [26].) • Data related to responses to visual stimuli were used to develop a rule for separating people suffering from a multiple-sclerosis-caused visual pathology from those not suffering from the disease. (See Exercise 1.14.)