14 Chapter 1 Aspects of Multivariate AnalysisTable1.11977SalaryandFinal Record for theNational LeagueEastX2won-lostTeamxy =player payrollpercentage.623PhiladelphiaPhillies3,497,900.593Pittsburgh Pirates2,485,475512St.Louis Cardinals1,782,875.500Chicago Cubs1,725,450.4631,645,575MontrealExpos.395NewYorkMets1,469,800eurioom.80040Figure 1.4 Salaries1andwon-lost01.02.03.04.0percentagefrom Player payroll in millions of dollarsTable1.1.To construct the scatter plot in Figure 1.4, we have regarded the six paired observations in Table 1.l as the coordinates of six points in two-dimensional space.Thefigure allows us to examine visually the grouping of teams with respectto thevari-ables total payroll and won-lost percentage.Example I.5 (Multiple scatterplots forpaperstrength measurements)Paper is man-ufacturedin continuous sheets several feet wide.Because of the orientation offiberswithin the paper,it has a different strength when measured in the direction produced by the machine than when measured across, or at right angles to, the machinedirection.Table1.2showsthemeasuredvaluesofX=density(grams/cubiccentimeter)X2=strength (pounds)in the machine directionXg=strength(pounds)inthecrossdirectionAnovelgraphicpresentationofthesedataappearsinFigure1.5,page16.Thescatter plots are arranged as the off-diagonal elements of a covariance array andbox plots as the diagonal elements.The latter are on a different scale with this
14 Chapter 1 Aspects of Multivariate Analysis Table 1.1 1977 Salary and Final Record for the National League East Team Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets 0 • •• • Xt = playerpayroll 3,497,900 2,485,475 1,782,875 1,725,450 1,645,575 1,469,800 • • Player payroU in millions of dollars x2 = won-lost percentage .623 I .593 .512 .500 .463 .395 Figure 1.4 Salaries and won-lost percentage from Table 1.1. To construct the scatter plot in Figure 1.4, we have regarded the six paired observations in Thble 1.1 as the coordinates of six points in two-dimensional space. The figure allows us to examine visually the grouping of teams with respect to the variables total payroll and won-lost percentage. • Example I.S (Multiple scatter plots for paper strength measurements) Paper is manufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when measured in the direction produced by the machine than when measured across, or at right angles to, the machine direction. Table 1.2 shows the measured values of x1 = density(gramsjcubiccentinleter) xz = strength (pounds) in the machine direction x3 "' strength (pounds) in the cross direction A novel graphic presentation of these data appears in Figure 1.5, page"16. The scatter plots are arranged as the off-diagonal elements of a covariance array and box plots as the diagonal elements. The latter are on a different scale with this
15TheOrganizationofDataTable1.2Paper-QualityMeasurementsStrengthSpecimenDensityCrossdirectionMachinedirection1.801121.4170.422.82472.47127.703.84178.20129.20474.89.816131.805.840135.1071.216.84278.39131.507.82069.02126.708.80273.10115.109.82879.28130.8010.81976.48124.601170.25.826118.311272.88.802114.201368.23.810120.3014.802115.7068.1215.83271.62117.5116.796109.8153.1017.75950.85109.1051.6818.770115.1019.75950.60118.312053.51.772112.6056.5321.806116.20.80370.70.22118.0074.3523.845131.0024.82268.29125.7025.97172.10126.1026.81670.64125.8027.83676.33125.5028.81576.75127.802980.33,822130.503075.68.822127.9031.84378.54123.9032.82471.91124.1033.78868.22120.8034.78254.42107.4035.795120.7070.4136.805121.9173.6837.83674.93122.3138.78853.52110.603948.93.772103.5140.776110.7153.6741.75852.42113.80Source:Data courtesy of SONOCO Products Company
The Organization of Data 15 Table 1.2 Paper-Quality Measurements Strength Specimen Density Machine direction Cross direction 1 .801 121.41 70.42 2 .824 127.70 72.47 3 .841 129.20 78.20 4 .816 131.80 74.89 5 .840 135.10 71.21 6 .842 131.50 78.39 7 .820 126.70 69.02 8 .802 115.10 73.10 9 .828 130.80 79.28 10 .819 124.60 76.48 11 .826 118.31 70.25 12 .802 114.20 72.88 13 .810 120.30 68.23 14 .802 115.70 68.12 15 .832 117.51 71.62 16 .796 109.81 53.10 17 .759 109.10 50.85 18 .770 115.10 51.68 19 .759 118.31 50.60 20 .772 112.60 53.51 21 .806 116.20 56.53 22 .803 118.00 70.70. 23 .845 131.00 74.35 24 .822 125.70 68.29 25 .971 126.10 72.10 26 .816 125.80 70.64 27 .836 125.50 76.33 28 .815 127.80 76.75 29 .822 130.50 80.33 30 .822 127.90 75.68 31 .843 123.90 78.54 32 .824 124.10 71.91 33 .788 120.80 68.22 34 .782 107.40 54.42 35 .795 120.70 70.41 36 .805 121.91 73.68 37 .836 122.31 74.93 38 .788 110.60 53.52 39 .772 103.51 48.93 40 .776 110.71 53.67 41 .758 113.80 52.42 Source: Data courtesy of SONOCO Products Company
16Chapter1AspectsofMultivariateAnalysisStrength (MD)DensityStrength (CD)0.97Max0.81Med0.76MinMax135.1(Cw)guansMed121.4Min103.5Max80.33(ao)uansMed70.70Min48.93Figure1.5Scatter plots and boxplots ofpaper-qualitydatafrom Table1.2.software,so we use only the overall shape to provide information on symmetryand possible outliers for each individual characteristic.The scatterplots can be in-spected for patterns and unusual observations. In Figure 1.5, there is one unusualobservation:the density of specimen 25.Someof the scatter plots havepatternssuggesting that thereare two separateclumps of observations.These scatter plotarrays are further pursued in our discussion of new softwaregraphics in thenextsection.In the general multiresponse situation,p variables are simultaneouslyrecordedon n items Scatterplots should bemade for pairsof important variables and,if thetask is not too greatto warrant theeffort,forall pairsLimited as we are to a three-dimensional world, we cannot always picture anentire set of data.However,two further geometric representations of the data pro-vide an important conceptual framework for viewing multivariable statistical meth-ods In cases where it is possible to capture the essence of the data in threedimensions, these representations can actuallybegraphed
f 6 Chapter 1 Aspects of Multivariate Analysis c ·a Q " 8 6 5 0<1 c ~ "' Density Max Med Min ~ . · . . . . . . . . . r . . . . 4-· . . ·~;,:· . :···· 0.97 0.81 . . 0.76 Max Med Min .· . Strength (MD) . . . ··:.·: . . , . . ·. T I I I I -'- ~ . . : . . . =· . • • i" 135.1 121.4 . . 103.5 Max Med Min Strength (CD) . . , . :·.:·:.··· . . . . . . . . . :: . '· T _l_ 80.33 70.70 48.93 figure I.S Scatter plots and boxplots of paper-quality data from Thble 1.2. software, so we use only the overall shape to provide information on symmetry and possible outliers for each individual characteristic. The scatter plots can be inspected for patterns and unusual observations. In Figure 1.5, there is one unusual observation: the density of specimen 25. Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. These scatter plot arrays are further pursued in our discussion of new software graphics in the next section. • In the general multiresponse situation, p variables are simultaneously recorded on n items. Scatter plots should be made for pairs of important variables and, if the task is not too great to warrant the effort, for all pairs. Limited as we are to a three~dimensional world, we cannot always picture an entire set of data. However, two further geometric representations of the data provide an important conceptual framework for viewing multi variable statistical methods. In cases where it is possible to capture the essence of the data in three dimensions, these representations can actually be graphed
TheOrganizationofData17n Pointsin pDimensions (p-Dimensional ScatterPlot).Consider the naturalexten-sionofthescatterplottopdimensions,wherethepmeasurements(Xj1,Xj2*+Xjp)on the jth item represent the coordinates of a point in p-dimensional space.The co-ordinate axes are taken to correspond to the variables, so that the jth point is Xjlunits along the first axis, Xjz units along the second,..., Xjp units along the pth axis.The resultingplot with n pointsnot onlywill exhibit the overall pattern of variabili-ty,but also will show similarities (and differences)among the n items.Groupings ofitems will manifest themselves in this representationThe next example illustrates a three-dimensional scatter plot.Example I.6 (Lookingforlower-dimensional structure)Azoologist obtained mea-surements on n =25 lizards known scientifically as Cophosaurus texanus.Theweight,or mass, is given ingrams while the snout-vent length (SVL)and hind limbspan (HLS) are given in millimetersThe data are displayed in Table1.3.Although there are three sizemeasurements, we can ask whether or not most ofthe variation is primarily restrictedto two dimensions or even to onedimension.To help answer questions regarding reduced dimensionality,we construct thethree-dimensional scatter plot in Figure 1.6.Clearly most of the variation is scatterabout a one-dimensional straight line. Knowing the position on a line along themajor axes of the cloud of points would be almost as good as knowing the threemeasurementsMass,SVL,andHLS.However, this kind of analysis can be misleading if one variable has a muchlarger variance than the others Consequently,we first calculate the standardizedvalues, zjk (xjk-Xk)/Vskk,so the variables contribute equally to the variationTableI.3LizardSizeDataLizardSVLHLSMassSVLHLSLizardMass12345678995.52659.01473.0136.5113.510.0671510.40175.0142.073.0135.510.09116139.09.21369.0124.077.010.8888.95317118.067.5125.07.61061.5187.06362.0129.57.73366.5133.51962.0150.06.610123.012.01579.5121234574.0140.074.0137.011.27310.0492.44747.097.059.5116.05.14968.015.49386.5162.09.158123.01069.09.004126.512.13275.0141.0118.19970.5136.066.5117.06.978126.60164.5116.063.0117.06.890137.62267.5135.0Source: Data courtesy of Kevin E. Bonine
The Organization of Data I 7 n Points in p Dimensions (p-Dimensional Scatter Plot). Consider the natural extension of the scatter plot top dimensions, where the p measurements on the jth item represent the coordinates of a point in p-dimensional space. The coordinate axes are taken to correspond to the variables, so that the jth point is xi! units along the first axis, xi 2 units along the second, . , xiP units along the pth axis. The resulting plot with n points not only will exhibit the overall pattern of variability, but also will show similarities (and differences) among then items. Groupings of items will manifest themselves in this representation. The next example illustrates a three-dimensional scatter plot. Example 1.6 {Looking for lower-dimensional structure) A zoologist obtained measurements on n = 25 lizards known scientifically as Cophosaurus texanus. The weight, or mass, is given in grams while the snout-vent length (SVL) and hind limb span (HLS) are given in millimeters. The data are displayed in Table 1.3. Although there are three size measurements, we can ask whether or not most of the variation is primarily restricted to two dimensions or even to one dimension. To help answer questions regarding reduced dimensionality, we construct the three-dimensional scatter plot in Figure 1.6. Clearly most of the variation is scatter about a one-dimensional straight line. Knowing the position on a line along the major axes of the cloud of points would be almost as good as knowing the three measurements Mass, SVL, and HLS. However, this kind of analysis can be misleading if one variable has a much larger variance than the others. Consequently, we first calculate the standardized values, Zjk = (xjk- xk)/~, so the variables contribute equally to the variation Table 1.3 Lizard Size Data Lizard Mass SVL HLS Lizard Mass SVL HLS 1 5.526 59.0 113.5 14 10.067 73.0 136.5 2 10.401 75.0 142.0 15 10.091 73.0 135.5 3 9.213 69.0 124.0 16 10.888 77.0 139.0 4 8.953 67.5 125.0 17 7.610 61.5 118.0 5 7.063 62.0 129.5 18 7.733 66.5 133.5 6 6.610 62.0 123.0 19 12.015 79.5 150.0 7 11.273 74.0 140.0 20 10.049 74.0 137.0 8 2.447 47.0 97.0 21 5.149 59.5 116.0 9 15.493 86.5 162.0 22 9.158 68.0 123.0 10 9.004 69.0 126.5 23 12.132 75.0 141.0 11 8.199 70.5 136.0 24 6.978 66.5 117.0 12 6.601 64.5 116.0 25 6.890 63.0 117.0 13 7.622 67.5 135.0 Source: Data courtesy of Kevin E. Bonine
I8 Chapter1Aspectsof Mutivariate Analysis15105155135Figure1.63Dscatter1155060HLS7095plotof lizard datafrom8090SVLTable1.3.nhaeurivethedmaattsabeMoriaonbxlainbtermined byalinethroughthecloud of points32113011.55s0sFigure1.73D scatter2-1.50ZHLS1plot of standardized2ZSVLlizarddataAtheedimensional scatter plot canoftenreveal groupstructureme(ong frgostcturenheedimeno) Rering oExmndata in Table 1.3arefmffmfmfmfmfmmmmfmmmffmff
Chapter 1 Aspe 18 15 5 cts of Multivariate Analysis Figure 1.6 3D scatter plot of lizard data from Table 1.3. . the scatter plot. Figure 1.7 gives _th~ three-dirnensio_nal scatter plot for ~he stanto rd. ed variables. Most of the vanatwn can be explamed by a smgle vanable deda ~zned by a line through the cloud of points. tefl]ll 3 2 : 1 ~ ~ 0 -1 -2 Zsv~ Figure I.T 3D scatter plot of standardized lizard data. • A three-dimensional scatter plot can often reveal group structure. - pie 1.7 (Looking for group structure in three dimensions) Referring to Exam· E~a~ 6 it is interesting to see if male and female lizards occupy different parts of the fh~e~-dimensional space containing the size data. The gender, by row, for the lizard data in Table 1.3 are fmffmfmfmfmfm mmmfmmmffmff