TheOrganizationof Data,9The quantities Sikand ri do not,in general, convey all there is to know abouttheassociationbetweentwovariablesNonlinearassociationscanexistthatarenotrevealed by these descriptive statistics.Covarianceand correlation providemea-sures of linear association,or association alonga line.Their values are less informa-tive for other kinds of association, On the other hand,these quantities can be verysensitive to“"wild"observations ("outliers")and may indicate association when,infact, little exists. In spite of these shortcomings, covariance and correlation coeffi-cients are routinely calculated and analyzed.They provide cogent numerical sum-maries of association when thedata do notexhibit obvious nonlinear patterns ofassociation and when wild observations are not present.Suspect observations must be accounted forby correcting obvious recordingmistakes and by taking actions consistent with the identified causes.The values ofSik and rik shouid be quoted both with and without these observations.The sum of squares of the deviations from the mean and the sum of cross-product deviations are often of interest themselves.These quantities areH(1-6)它(一x)2k=1,2.....pWkk=i=andi = 1,2,..,p, k = 1,2,... p(1-7)(x-x)(xk-)WikfThe descriptive statistics computed from n measurements on p variables canalsobeorganized intoarrays.ArraysofBasicDescriptiveStatistics4X2SamplemeansXsXpS11SIPS12Sample variancesS21$22S2p(1-8)S, =:::andcovariancesLsp1Sp2Spp1r12Tip1*+.21R=Sample correlations.主1Lrp1..rp2
The Organization of Data, 9 The quantities sik and r;k do not, in general, convey all there is to know about the association between two variables. Nonlinear associations can exist that are not revealed by these descriptive statistics. Covariance and correlation provide measures of linear association, or association along a line. Their values are less informative for other kinds of association. On the other hand, these quantities can be very sensitive to "wild" observations ("outliers") and may indicate association when, in fact, little exists. In spite of these shortcomings, covariance and correlation coefficients are routinely calculated and analyzed. They provide cogent numerical summaries of association when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes. The values of s;k and r;k should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of crossproduct deviations are often of interest themselves. These quantities are and n n wkk = 2: (xjk - xk)z j=l W;k = 2: (xi; - x;)(xjk - xk) j=l k = 1, 2, . ,p (1-6) i = 1,2, . ,p, k = 1,2, . ,p (1-7) The descriptive statistics computed from n measurements on p variables can also be organized into arrays. Arrays of Basic Descriptive Statistics Sample means .,m l'" sl2 Sample variances , l sn = s~l Szz szp (1-8) and covariances Spl spz sPP R "l~' r12 , l Sample correlations 1 rzp rpl rpz 1
IOChapter1 AspectsofMultivariateAnalysisThe sample mean array is denoted by x, the sample variance and covariancearray by the capital letter Sa,and the sample correlation array by R.The subscript nonthearrayS.isamnemonicdeviceusedtoremindyouthatnisemployedasadi.visorforthe elements sik.The size of all of thearraysis determinedbythe numberof variables, p.The arrays S,and R consist of p rows and p columns.The arrayx is a singlecolumn with p rows. The first subscript on an entry in arrays S, and R indicatesthe row; the second subscript indicates the column. Since Sik ski and rik Tkifor all i and k,theeatries in symmetric positions about themain northwest-southeast diagonals in arrays S, and R are the same, and the arrays are said to besymmetric.ExampleI.2 (The arraysx,S,,and R forbivariate data)Consider the data intro-duced in Example 1.1.Each receipt yields a pair of measurements, total dollarsales, and number of books sold. Find the arrays x,Sh, and R.Since there are four receipts, we have a total of four measurements (observations)oneach variable.Thesamplemeansare =!≥ xj1 = (42 + 52 + 48 + 58) = 5022 =1 2 ×/2 =(4 + 5 + 4 + 3) = 4*-[] [9]The samplevariances and covariancesareS11 =1 2 (x)1 - x,)2== ((42 50)2 + (52 50)2 + (48 50)2 + (58 50)) = 34522 =12 (xp2 - 2)2.=1= ((4 - 4)2 + (5 - 4)2 + (4 - 4)2 + (3 - 4)) = 5S12 =1 2 (x)1 ~ )(xi2 - 2)j=1((42 - 50) (4 - 4) + (52 - 50)(5 - 4)+ (48 ~ 50) (4 ~ 4) + (58 - 50) (3 - 4)) = 1.5S21=S12and1.53451.5
10 Chapter 1 Aspects of Multivariate Analysis The sample mean array is denoted by i, the sample variance and covariance array by the capital letter Sn, and the sample correlation array by R. The subscript n on the array Sn is a mnemonic device used to remind you that n is employed as a divisor for the elements s;k· The size of all of the arrays is determined by the number of variables, p. The arrays Sn and R consist of p rows and p columns. The array i is a single column with p rows. The first subscript on an entry in arrays Sn and R indicates the row; the second subscript indicates the column. Since s;k = ski and ra = rk; for all i and k, the entries in symmetric positions about the main northwestsoutheast diagonals in arrays Sn and R are the same, and the arrays are said to be symmetric. Example 1.2 (The arrays x, Sn• and R for bivariate data) Consider the data introduced in Example 1.1. Each receipt yields a pair of measurements, total dollar sales, and number of books sold. Find the arrays i, Sn, and R. Since there are four receipts, we have a total of four measurements (observations) on each variable. and The-sample means are 4 X1 = ~ L Xjt = h42 + 52+ 48 +58) =50 j=! 4 x2 = ~ L: x12 = ~(4 + 5 + 4 + 3) = 4 j=l The sample variances and covariances are 4 Stt = ~ L (xj! - x1) 2 j=l = ~((42- sw + (52- so) 2 + (48- so?+ (58- 50) 2 ) = 34 4 s22 = ~ L (xj2 - i2) 2 j=l = ~((4- 4) 2 + (5- 4) 2 + (4- 4)2 + (3- 4) 2) = .5 4 St2 = ~ L (xj! - xt)(xj2- i2) j=l = hC42- so)(4- 4) +(52- so)(s- 4) + (48- 50)(4- 4) +(58- 50)(3- 4)) = -1.5 Sn = [ 34 -1.5] -1.5 .5
TheOrganizationof DataThesamplecorrelationis~1.5512-.36712-V34V.5VsuiVs22r21=712so.361.361Graphical TechniquesPlots are important, but frequently neglected, aids in data analysis. Although it is im-possible to simultaneouslyplot all the measurements made on several variables andstudy the configurations,plots of individual variables and plots of pairs of variablescan still be very informative. Sophisticated computer programs and display equip-ment allow one the luxury of visualiy examining data in one, two, or three dimen-sions with relative ease. On the other hand,many valuable insights can be obtainedfrom the data by constructing plots withpaper and pencil. Simple,yet elegant andeffective,methods for displaying data are available in [29].It isgood statistical prac-tice to plot pairs of variables and visually inspect the pattern ofassociation.Consid-er,then,thefollowing seven pairsofmeasurements on two variables:3426825Variable1(x,):75455.5107.5Variable2(x2):These data are plotted as seven points in two dimensions (each axis represent-ing a variable) in Figure 1.1.The coordinates of the points are determined by thepaired measurements:(3,5),(4,5.5),...,(5,7.5).The resulting two-dimensionalplotisknownasascatterdiagramorscatierplot.108Q.weepO6:10311246010°....E246810Figure I.1 A scatter plotDot diagramandmarginaldotdiagrams
The Organization of Data 1 1 The sample correlation is so R = [-.3~ - .3~ J • Graphical Techniques Plots are important, but frequently neglected, aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations, plots of individual variables and plots of pairs of variables can still be very informative. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one, two, or three dimensions with relative ease. On the other hand, many valuable insights can be obtained from the data by constructing plots with paper and pencil. Simple, yet elegant and effective, methods for displaying data are available in (29]. It is good statistical practice to plot pairs of variables and visually inspect the pattern of association. Consider, then, the following seven pairs of measurements on two variables: Variable 1 (x1): Variable 2 ( x2): 3 5 4 5.5 2 4 6 7 8 10 2 5 5 7.5 These data are plotted as seven points in two dimensions (each axis representing a variable) in Figure 1.1. The coordinates of the points are determined by the paired measurements: (3, 5), ( 4, 5.5), . , (5, 7.5). The resulting two-dimensional plot is known as a scatter diagram or scatter plot. xz Xz • JO • • 8 8 e • " • ~ • " • 6 6 '6 • 8 •• • • • 4 4 • 2 2 0 4 6 8 • ! • t • ! ! I • -"J 2 4 6 8 10 Figure 1.1 A scatter plot Dot diagram and marginal dot diagrams
12 Chapter 1 Aspects of Multivariate AnalysisAlso shown in Figure 1.1 are separate plots of the observed values of variable1and the observed values ofvariable 2,respectively.These plots are called (marginal)dotdiagrams.They can be obtained from the original observations or by projectingthepointsinthescatterdiagramontoeachcoordinateaxis.The information contained in the single-variable dot diagrams can be used tocalculate the sample means Xf and X2 and the sample variances s11and s22.(See Ex-ercise 1.1.)The scatter diagram indicates the orientation of the points,and their co-ordinates can be used to calculate the sample covariance siz.In the scatter diagramofFigure 1.1,large values of xy occur with large values of x2 and small values of xiwith small values of x2.Hence,S12willbepositive.Dot diagrams and scatter plots contain different kinds of information.The in-formation in the marginal dot diagrams is not sufficient for constructing the scatterplot.As an illustration, suppose the data preceding Figure 1.1 had been paired dif-ferently, so that the measurements on the variables x and x2 were as follows:5462283Variable1(x):7555.54107.5Variable2(x2):(We have simply rearranged the values of variable 1.) The scatter and dot diagramsfor the"new"data are shown in Figure 1.2.Comparing Figures 1.1 and 1.2,we findthat the marginal dot diagrams are the same, but that the scatter diagrams are decid-edly different.In Figurei.2,large values of x,arepaired with small values of x2 andsmall values ofx, with large values of x2.Consequently,the descriptive statistics fortheindividual variablesX,,X2,Su1,and22remain unchanged,butthesample covari-ance Siz,which measures the association between pairs of variables, will now benegative.The different orientations of the data in Figures 1.1 and 1.2 are not discerniblefrom the marginal dot diagrams alone.At the same time, the fact that the marginaldot diagrams are the same in the two cases is not immediately apparent from thescatter plots. The two types of graphical procedures complement one another; theyare notcompetitorsThe next two examples further illustrate the information that can be conveyedby a graphic display.2X21410101A62810Figure 1.2 Scatter plot1.-and dot diagrams for61046rearrangeddata
12 Chapter 1 Aspects of Multivariate Analysis Also shown in Figure 1.1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (marginal) dot diagrams. They can be obtained from the original observations or by projecting the points in the scatter diagram onto each coordinate axis. The information contained in the single-variable dot diagrams can be used to calculate the sample means xi and x2 and the sample variances si I and s22 . (See Exercise 1.1.) The scatter diagram indicates the orientation of the points, and their coordinates can be used to calculate the sample covariance Siz· In the scatter diagram of Figure 1.1, large values of xi occur with large values of x2 and small value.s of xi with small values of x 2 • Hence, s12 will be positive. Dot diagrams and scatter plots contain different kinds of information. The information in the marginal dot diagrams is not sufficient for constructing the scatter plot. As an illustration, suppose the data preceding Figure 1.1 had been paired differently, so that the measurements on the variables xi and x2 were as follows: Variable 1 (xi): Variable 2 (xz): 5 5 4 5.5 6 4 2 7 2 10 8 5 3 7.5 (We have simply rearranged the values of variable 1.) The scatter and dot diagrams for the "new" data are shown in Figure 1.2. Comparing Figures 1.1 and 1.2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decidedly different. In Figure 1.2, large values of xi are paired with small values of x2 and small values of xi with large values of x2 . Consequently, the descriptive statistics for the individual variables xi, x2 , sii, and s22 remain unchanged, but the sample covariance si 2 , which measures the association between pairs of variables, will now be negative. The different orientations of the data in Figures 1.1 and 1.2 are not discernible from the marginal dot diagrams alone. At the same time, the fact that the marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are not competitors. The next two examples further illustrate the information that can be conveyed by a graphic display. Xz Xz • 10 • • 8 • • • • 6 • • • • • • 4 • 2 0 2 4 6 8 10 XI • Figure 1.2 Scatter plot t • ! • ! ! I and dot diagrams for 2 4 6 8 10 . x, rearranged data
TheOrganizationofData13Example I.3(The effect of unusual observations on sample correlations) Some fi-nancial data representing jobs and productivity for the 16 largest publishing firmsappeared in an article in Forbes magazine onApril 30,1990.The data for the pairofvariablesxXi=employees(jobs)andx2=profitsperemployee(productivity)aregraphed in Figure 1.3.We have labeled two"unusual"observations.Dun & Brad-street isthe largestfirm in terms of number ofemployees, but is"typical"in terms ofprofits per employee.Time Warnerhas a“typicai"number of employees,but com-parativelysmall (negative)profitsperemployee240. 3020Dun&Bradstreet10.Time WarnerFigure 1.3 Profits per employee10103020405060andnumberofemployeesfor1670800Employees (thousands)publishing firnus.The sample correlation coefficient computed from the values of x andx2 is-.39forall16firms-.56forallfirmsbutDun&Bradstreetr12.39forallfirmsbutTimeWarner-.50 forallfirmsbutDun&BradstreetandTimeWarnerIt is clear that atypical observations can have a considerable effect on the samplecorrelationcoefficient.Example I.4(Ascatter plot forbaseball data)In a July17,1978,article on money insports, Sports Ilustrated magazineprovided data on x=playerpayrollforNation-al League East baseball teamsWe have added data on x2 = won-lost percentage for 1977. The results aregiven in Table 1.1.The scatter plot in Figure 1.4 supports the claim that a championship team canbe bought. Of course,this cause-effect relationship cannot be substantiated, be-cause the experiment did not include a random assignment of payrolls Thus, statis-tics cannot answer the question: Could the Mets have won with $4million to spendon player salaries?
The Organization of Data 13 Example 1.3 {The effect of unusual observations on sample correlations) Some financial data representing jobs and productivity for the 16 largest publishing firms appeared in an article in Forbes magazine on April30, 1990. The data for the pair of variables x1 = employees (jobs) and x2 = profits per employee (productivity) are graphed in Figure 1.3. We have labeled two "unusual" observations. Dun & Bradstreet is the largest firm in terms of number of employees, but is "typical" in terms of profits per employee. Time Warner has a "typical" number of employees, but comparatively small (negative) profits per employee . • • • • , • • • • • • • • Dun & Bradstreet • Time Warner Employees (thousands) Figure 1.3 Profits per employee and number of employees for 16 publishing firms. The sample correlation coefficient computed from the values of x1 and x2 is { - .39 for all16 firms -.56 for all firms but Dun & Bradstreet r 12 = -.39 for all firms but Time Warner -.50 for all firms but Dun & Bradstreet and Time Warner It is clear that atypical observations can have a considerable effect on the sample correlation coefficient. • Example 1.4 {A scatter plot for baseball data) In a July 17, 1978, article on money in sports, Sports Illustrated magazine provided data on x1 = player payroll for National League East baseball teams. We have added data on x2 = won-lost percentage for 1977. The results are given in Thble 1.1. The scatter plot in Figure 1.4 supports the claim that a championship team can be bought. Of course, this cause-effect relationship cannot be substantiated, because the experiment did not include a random assignment of payrolls. Thus, statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries?