4Chapter 1 Aspects of Multivariate Analysis.The U.S. Internal Revenue Service uses data collected from tax returns to sorttaxpayers into two groups: those that will be audited and those that will not.(See [31])InvestigationofthedependenceamongvariablesData on several variables were usedto identifyfactors that wereresponsible forclient success in hiring external consultants (See [12].).Measurements of variables related to innovation, on the one hand, and vari-ables related to the business environment and business organization, on theotherhand,wereused to discoverwhy somefirms are product innovators andsome firms are not. (See [3].)Measurements of pulp fiber characteristics and subsequent measurements ofcharacteristics of the papermade from them are used to examine the relationsbetweenpulpfiberproperties and theresultingpaperpropertiesThegoal is todetermine those fibers that lead to higher quality paper.(See [17]).The associations between measures of risk-taking propensity and measures ofsocioeconomic characteristics for top-level business executives were used toassess the relation between risk-taking behavior and performance.(See [18].)Prediction.The associations between test scores, and several high school performance vari-ables,and several college performance variables wereused to develop predic-tors of success in college. (See [10].)·Data on several variables related to the size distribution of sediments were used todevelop rules for predicting different depositional environments. (See [7] and [20].).Measurements on several accounting and financial variables were used to de-velop a method for identifying potentially insolvent property-liability insurers(See [28].)cDNA microarray experiments (gene expression data)are increasingly used to.study themolecular variations among cancertumors Areliable classification oftumors is essential for successful diagnosis and treatment of cancer.(See [9].)HypothesestestingSeveral pollution-related variableswere measured to determinewhether levelsfor a large metropolitan area were roughly constant throughout the week,orwhether there was a noticeable difference between weekdays and weekends.(See Exercise 1.6.).Experimental data on several variables were used to see whether the nature ofthe instructions makes any difference in perceived risks, as quantified by testscores. (See [27].).Data on many variables were used to investigate the differences in structure ofAmerican occupations todeterminethesupportforone of two competing soci-ological theories. (See [16] and [25].]Data on several variables were used to determine whether different types offirms in newly industrialized countries exhibited different patterns of innova-tion. (See [15].)
4 Chapter 1 Aspects of Multivariate Analysis • The U.S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [31].) Investigation of the dependence among variables • Data on several vru-iables were used to identify factors that were responsible for client success in hiring external consultants. (See [12].) • Measurements of variables related to innovation, on the one hand, and variables related to the business environment and business organization, on the other hand, were used to discove~ why some firms are product innovators and some firms are not. (See [3].) • Measurements of pulp fiber characteristics and subsequent measurements of characteristics of the paper made from them are used to examine the relations between pulp fiber properties and the resulting paper properties. The goal is to determine those fibers that lead to higher quality paper. (See [17].) • The associations between measures of risk-taking propensity and measures of socioeconomic characteristics for top-level business executives were used to assess the relation between risk-taking behavior and performance. (See [18].) Prediction • The associations between test scores, and several high school performance variables, and several college performance variables were used to develop predictors of success in college. (See [10].) • Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [7] and [20].) • Measurements on several accounting and fmancial variables were used to develop a method for identifying potentially insolvent property-liability insurers. (See [28].) • eDNA microarray experiments (gene expression data) are increasingly used to study the molecular variations among cancer tumors. A reliable classification of tumo~s is essential for successful diagnosis and treatment of cancer. (See [9].) Hypotheses testing • Several pollution-related variables were measured to determine whether levels for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and weekends. (See Exercise 1.6.) • Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks, as quantified by test scores. (See [27].) • Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing sociological theories. (See [16] and [25].) • Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innovation. (See [15].)
TheOrganizationofData5Thepreceding descriptions offerglimpses into theuseof multivariate methodsinwidelydiversefields.I.3TheOrganizationof DataThroughout this text, we are going to be concerned with analyzing measurementsmade on several variables or characteristics.These measurements (commonly calleddata)must frequently be arranged and displayed in various ways.For example,graphs and tabular arrangements are important aids in data analysis.Summary num-bers, which quantitatively portray certain features of the data,are also necessary toany description.We now introduce the preliminary concepts underlying these first steps of dataorganization.ArraysMultivariate data arise whenever an investigator, seeking to understand a social orphysicalphenomenon,selectsanumberp≥1ofvariablesorcharacterstorecordThe values of these variables are all recorded for each distinct item, individual, orexperimental unit.We will use the notation xjk to indicate the particular value of the kth variablethat is observed onthe jth item, or trial.That is,Xjk=measurementofthekthvariableonthejthitemConsequently,n measurements onpvariables canbedisplayed as follows:Variable 1Variable2VariablekVariablepItem 1:x11X12X1kX1pItem2:X21X22X2kX2p::目Itemj:xj1Xj2XikXip:丰1:1Item n:Xn1Xn2XnkXapOr we can display these data as a rectangular array, called X, of n rows and pcolumns:X11X12XikXIPX21X22X2kX2P::::xXj1Xj2XjkXip.-.:::Xn1Xn2XnkXap-The array X, then, contains the data consisting of all of the observations on all ofthe variables
The Organization of Data 5 The preceding descriptions offer glimpses into the use of multivariate methods in widely diverse fields. 1.3 The Organization of Data Throughout this text, we are going to be concerned with analyzing measurements made on several variables or characteristics. These measurements (commonly called data) must frequently be arranged and displayed in various ways. For example, graphs and tabular arrangements are important aids in data analysis. Summary numbers, which quantitatively portray certain features of the data, are also necessary to any description. We now introduce the preliminary concepts underlying these first steps of data organization. Arrays Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number p 2:: 1 of variables or characters to record. The values of these variables are all recorded for each distinct item, individual, or experimental unit. We will use the notation xjk to indicate the particular value of the kth variable that is observed on the jth item, or trial. That is, x1k = measurement of the kth variable on the jth item Consequently, n measurements on p variables can be displayed as follows: Variable 1 Variable 2 Variable k Variable p Item 1: xu xi2 xlk Xip Item2: x21 Xzz Xzk Xzp Itemj: Xji xjz Xjk Xjp Itemn: Xni x,z x,k Xnp Or we can display these data as a rectangular array, called X, of n rows and p columns: xu xi2 xlk Xip Xzi Xzz Xzk Xzp X xi! xiz Xjk Xjp x,l x,z x,k x,P The array X, then, contains the data consisting of all of the observations on all of the variables
6Chapter1Aspects of MultivariateAnalysisExampleI.I(Adataarray)Aselectionoffourreceiptsfromauniversitybookstorewasobtained inordertoinvestigatethenatureof booksales.Eachreceiptprovided,amongotherthings,thenumberofbookssoldandthetotalamountofeachsale.Letthefirstvariablebetotal dollarsalesand thesecondvariablebenumberofbookssold.Then we can regard the corresponding numbers on the receipts as four mea-surements on two variables Suppose thedata,in tabular form,areVariable1(dollarsales):42524858453Variable2(numberof books):4Usingthe notation just introduced,we haveX31 = 48X41 = 58X11=42X21=52X42=35X32=4X12=4X22=and thedata arrayX is[424525X:484583withfour rows and two columns.Considering data in the form of arays facilitates the exposition of the subjectmatterandallowsnumericalcalculationstobeperformed in anorderlyandefficientmanner.The efficiency is twofold, as gains are attained in both (1) describing nu-merical calculations as operations on arrays and (2)the implementation of the cal-culations on computers, which now use many languages and statistical packages toperform array operations We consider the manipulation of arrays of numbers inChapter 2.At this point, we are concerned only with their value as devices for dis-playing data.DescriptiveStatisticsA largedata set is bulkyand its verymassposes a serious obstacleto anyattempttovisually extract pertinent information.Much of the information contained in thedata can be assessed by calculating certain summary numbers,known as descriptivestatistics. For example, the arithmetic average, or sample mean, is a descriptive sta-tisticthatprovidesameasureof location-thatis,a"central value"fora setof num-bers And the average of the squares of the distances of all of thenumbers from themean provides ameasure of the spread,or variation,in the numbers.We shall relymost heavily on descriptive statistics that measure location,varia-tion,and linearassociation.The formal definitions of these quantities follow.Let X11, X21...., Xa, be n measurements on the first variable. Then the arith-meticaverageofthesemeasurements is1X1nj-i
6 Chapter 1 Aspects of Multivariate Analysis Example 1.1 {A data array) A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can reg_ard the corresponding numbers on the receipts as four measurements on two variables. Suppose the data, in tabular form, are Variable 1 (dollar sales): 42 52 48 58 Variable2(numberofbooks): 4 5 4 3 Using the notation just introduced, we have Xu = 42 Xz! = 52 x31 = 48 x41 = 58 x 12 = 4 x22 = 5 x32 = 4 x42 = 3 and the data array X is l 42 4] X= 52 5 48 4 58 3 with four rows and two columns. • Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and efficient manner. The efficiency is twofold, as gains are attained in both (1) describing numerical calculations as operations on arrays and (2) the implementation of the calculations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for displaying data. Descriptive Statistics A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics. For example, the arithmetic average, or sample mean, is a descriptive statistic that provides a measure of location-that is, a "central value" for a set of numbers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, variation, and linear association. The formal definitions of these quantities follow. Let xu, x21 , . , xn 1 ben measurements on the first variable. Then the arithmetic average of these measurements is
The Organizationof DataIf the n measurements representa subset of the full set ofmeasurements thatmighthavebeen observed,thenX,isalsocalled thesamplemeanforthefirstvariable We adopt this terminologybecause the bulk of thisbookis devoted to proce-dures designed to analyze samples of measurements from largercollections.Thesample mean can be computed from then measurements on each of thep variables, so that,in general,there will be p sample means:1S(1-1)k=1,2,...,pXk=XjkneAmeasure of spread is provided by thesample variance,definedfor n measure-mentsonthefirstvariableas=1( -)2nwhereX, is the sample mean of the xi's.In general,for p variables, we have星之(-x)2k=1,2.....P.(1-2)n台Two comments are in order.First,many authors definethe samplevariancewith adivisor of n --1rather than n.Later we shall see that there are theoretical reasonsfor doing this,and it is particularly appropriate if the number ofmeasurements,n,issmall.The two versions of the sample variance will always be differentiated by dis-playing theappropriateexpression.Second, although the snotation is traditionally used to indicate the samplevariance,we shall eventually consideran array of quantities in which the sample vari-ances lie along the main diagonal. In this situation,it is convenient to use doublesubscripts on the variances in order to indicate their positions in the array. There-fore,we introduce the notation Skk to denote the same variance computed frommeasurements on the kth variable,and we have the notational identities12(一元)s=Skkk=1,2,...p(1-3)neThe square root of the sample variance, Vskk, is known as the sample standarddeviation.ThismeasureofvariationusesthesameunitsastheobservationsConsidernpairs of measurementson eachofvariables1and2:[.Xn121That is, Xj1 and xiz are observed on the jth experimental item (j =1,2,..., n).Ameasure of linearassociation between themeasurementsofvariables 1and2is pro-vided by the sample covariance(X1 )(x/2 2)S12n合
The Organization of Data 7 ' If the n measurements represent a subset of the full set of measurements that might have been observed, then x1 is also called the sample mean for the first variable. We adopt this terminology because the bulk of this book is devoted to procedures designed to analyze samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means: 1 n xk =- 2: xik n i=l k = 1,2, . ,p (1-1) A measure of spread is provided by the sample variance, defined for n measurements on the first variable as 2 1 ~ - 2 St = - "-' (xi 1 - xt) n j=l where x1 is the sample mean of the xi1 's. In general, for p variables, we have 2 1 ~ ( - )2 sk = - "-' xik - xk n i=l . k = 1, 2, . ,p (1-2) Tho comments are in order. First, many authors define the sample variance with a divisor of n - 1 rather than n. Later we shall see that there are theoretical reasons for doing this, and it is particularly appropriate if the number of measurements, n, is small. The two versions of the sample variance will always be differentiated by displaying the appropriate expression. Second, although the s 2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample variances lie along the main diagonal. In this situation, it is convenient to use double subscripts on the variances in order to indicate their positions in the array. Therefore, we introduce the notation skk to denote the same variance computed from measurements on the kth variable, and we have the notational identities 2 1 ~ - )2 sk = skk = - "-' (xik - xk n i=I k = 1,2, . ,p (1-3) The square root of the sample variance, ~, is known as the sample standard deviation. This measure of variation uses the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2: [xu], [x21], . , [Xnt] X12 X22 Xn2 That is, xil and xi 2 are observed on the jth experimental item (j = 1, 2, . , n ). A measure of linear association between the measurements of variables 1 and 2 is provided by the sample covariance 1 n St2 =-2: (xjl - xt) (xj2 - x2) n i=I
8Chapter1Aspects of MultivariateAnalysisor the average product of the deviations from their respective means. If large values forone variable are observed in conjunction with large values for the other variable,andthe small values also occur together,Si2 will be positive If large values from one vari-able occur with small values for the other variable, Sr2will be negative.If there is noparticular association between the values for the two variables, S12 will be approxi-matelyzero.Thesamplecovariance含(-)(x一2)i=1,2....P,km1,2,...,p(1-4)Siknmeasures the association between the'ith and kth variables We note that the covari-ance reduces to the sample variance when i -k.Moreover,Sik=Skifor all i and k.The final descriptive statistic considered here is the sample correlation coeffi-cient (or Pearson's product-moment correlation coefficient, see [14]).This measureof the linear association between two variables does not depend on the units ofmeasurement. The sample correlation coefficient for the ith and kth variables isdefinedas(xj1)(xj)WEsik(1-5)rikVsiiVskkE(x-)一xj=1Afor i = 1,2,..., p and k = 1, 2,..., p. Note rik = rki foralliand k.The sample correlation coefficient is a standardized version of the sample co-variance, where the product of the square roots of the sample variances provides thestandardization. Notice that rik has the same value whether n or n -1 is chosen asthecommondivisorforsit,Skk,andsik.Thesamplecorrelation coefficient rikcan also be viewedasa samplecovariance.Suppose the original values xj and Xjk are replaced by standardized values(xjt-x,)/Vsiiand(xjk-X)/skk.Thestandardizedvaluesarecommensurablebe-cause both sets are centered at zero and expressed in standard deviation units The sam-ple correlation coefficient is justthe sample covariance of the standardized observationsAlthough the signs of the sample correlation and the sample covariance are thesame, the correlation is ordinarily easier to interpret because its magnitude isbounded.To summarize, the sample correlationrhas thefollowingproperties:1. The value of rmust be between -1 and +1 inclusive.2. Here r measures the strength of the linear association. If r = O, this implies alack of linear association between the components.Otherwise, the sign ofrindi-cates the direction of the association:r < O implies a tendency for one value inthe pair to be larger than its average when the other is smaller than its average;and r > O implies a tendency for one value of the pair to be large when theother value is large and also for both values to be small together.3. The value of rik remains unchanged if the measurements of the ith variableare changed to yji = axji + b, j 1, 2,...,n, and the values of the kth vari-able are changed to yjk cxjk + d,j = 1,2,.,n, provided that the con-stantsaandchavethesame sign
8 Chapter 1 Aspects of Multivariate Analysis or the average product of the deviations from their respective means. If large values for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, s12 will be positive. U large values from one variable occur with small values for the other variable, s12 will be negative. If there is no particular association between the values for the two variables, s12 will be approximately zero. The sample covariance 1 n • S;k = -;; L (xji - X;) (xjk - xk) j=l i=1,2, . ,p, k=1,2, . ,p (1-4} measures the association between the "ith and kth variables. We note that the covariance reduces to the sample variance when i = k. Moreover, s;k = ski for all i and k. The final descriptive statistic considered here is the sample correlation coefficient (or Pearson's product-moment correlation coefficient, see [14]}. This measure of the linear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as n L (xji - X;) (xjk - xk} j=l fori= 1,2, . ,pandk = 1,2, . ,p.Noterik = rkiforalliandk. (1-5} The sample correlation coefficient is a standardized version of the sample covariance, where the product of the square roots of the sample variances provides the standardization. Notiee that r;k has the same value whether nor n - 1 is chosen as the common divisor for s;;, skk, and s;k· The sample correlation coefficient r;k can also be viewed as a sample covariance. Suppose the original values ·xj; and xjk are replaced by standardized values (xj 1 - }/~and(xjk- :ik}/~.Thestandardizedvaluesarecornmensurablebecause both sets are centered at zero and expressed in standard deviation units. The sample correlation coefficient is just the sample covariance of the standardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bounded. To summarize, the sample correlation r has the following properties: 1. The value of r must be between -1 and + 1 inclusive. 2. Here r measures the strength of the linear association. If r = 0, this implies a lack of linear association between the components. Otherwise, the sign of r indicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its average; and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together. 3. The value of r;k remains unchanged if the measurements of the ith variable are changed to Yji = axj; + b, j = 1, 2, . , n, and the values of the kth variable are changed to Yjk = cxjk + d, j = 1, 2, . , n, provided that the constants a and c have the same sign