28 CHAPTER 3.DATA DESCRIPTION Example 3.9.U.S.State Facts and Features.The U.S.Department of Commerce of the U.S.Census Bureau releases all sorts of information in the Statistical Abstract of the United States,and the state.region data lists each of the 50 states and the region to which it belongs, be it Northeast,South,North Central,or West.See ?state.region. str(state.region) Factor w/4 levels "Northeast","South",..:2 44 2 44 1 222 .. state.region[1:5] [1]South WestWest South West Levels:Northeast South North Central West The str output shows that state.region is already stored internally as a factor and it lists a couple of the factor levels.To see all of the levels we printed the first five entries of the vector in the second line.need to print a piece of the from Displaying Qualitative Data Tables One of the best ways to summarize qualitative data is with a table of the data values. We may count frequencies with the table function or list proportions with the prop.table function(whose input is a frequency table).In the R Commander you can do it with Statistics> Frequency Distribution....Alternatively,to look at tables for all factors in the Active data set you can do Statistics>Summaries>Active Dataset. Tbl <-table(state.division) Tbl frequencies state.division New England Middle Atlantic South Atlantic 6 3 8 East South Central West South Central East North Central 4 5 West North Central Mountain Pacific > e 5 Tbl/sum(Tbl) relative frequencies state.division New England Middle Atlantic South Atlantic 0.12 0.06 0.16 East South Central West South Central East North Central 0.08 0.08 0.10 West North Central Mountain Pacific 0.14 0.16 0.10 prop.table(Tbl) same thing state.division New England Middle Atlantic South Atlantic 0.12 0.06 0.16 East South Central West South Central East North Central 0.08 0.08 0.10 West North Central Mountain Pacific 0.14 0.16 0.10
28 CHAPTER 3. DATA DESCRIPTION Example 3.9. U.S. State Facts and Features. The U.S. Department of Commerce of the U.S. Census Bureau releases all sorts of information in the Statistical Abstract of the United States, and the state.region data lists each of the 50 states and the region to which it belongs, be it Northeast, South, North Central, or West. See ?state.region. > str(state.region) Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ... > state.region[1:5] [1] South West West South West Levels: Northeast South North Central West The str output shows that state.region is already stored internally as a factor and it lists a couple of the factor levels. To see all of the levels we printed the first five entries of the vector in the second line.need to print a piece of the from Displaying Qualitative Data Tables One of the best ways to summarize qualitative data is with a table of the data values. We may count frequencies with the table function or list proportions with the prop.table function (whose input is a frequency table). In the R Commander you can do it with Statistics ⊲ Frequency Distribution. . . . Alternatively, to look at tables for all factors in the Active data set you can do Statistics ⊲ Summaries ⊲ Active Dataset. > Tbl <- table(state.division) > Tbl # frequencies state.division New England Middle Atlantic South Atlantic 6 3 8 East South Central West South Central East North Central 4 4 5 West North Central Mountain Pacific 7 8 5 > Tbl/sum(Tbl) # relative frequencies state.division New England Middle Atlantic South Atlantic 0.12 0.06 0.16 East South Central West South Central East North Central 0.08 0.08 0.10 West North Central Mountain Pacific 0.14 0.16 0.10 > prop.table(Tbl) # same thing state.division New England Middle Atlantic South Atlantic 0.12 0.06 0.16 East South Central West South Central East North Central 0.08 0.08 0.10 West North Central Mountain Pacific 0.14 0.16 0.10
3.1.TYPES OF DATA 29 8 Northeast South West Northeast South West Figure 3.1.5:Bar graphs of the state.region data The left graph is a frequency barplot made with table and the right is a relative frequency barplot made with prop.table. Bar Graphs A bar graph is the analogue of a histogram for categorical data.A bar is dis- played for each level of a factor,with the heights of the bars proportional to the frequencies of observations falling in the respective categories.A disadvantage of bar graphs is that the levels are ordered alphabetically (by default),which may sometimes obscure patterns in the display. Example 3.10.U.S.State Facts and Features.The state.region data lists each of the 50 states and the region to which it belongs,be it Northeast,South,North Central,or West.See ?state.region.It is already stored internally as a factor.We make a bar graph with the barplot function: barplot(table(state.region),cex.names 0.5) barplot(prop.table(table(state.region)),cex.names =0.5) See Figure 3.1.5.The display on the left is a frequency bar graph because the y axis shows counts,while the display on the left is a relative frequency bar graph.The only difference between the two is the scale.Looking at the graph we see that the majority of the fifty states are in the South,followed by West,North Central,and finally Northeast.Over 30%of the states are in the South. Notice the cex.names argument that we used,above.It shrinks the names on the x axis by 50%which makes them easier to read.See ?par for a detailed list of additional plot parameters
3.1. TYPES OF DATA 29 Northeast South West 0 5 10 15 Northeast South West 0.00 0.10 0.20 0.30 Figure 3.1.5: Bar graphs of the state.region data The left graph is a frequency barplot made with table and the right is a relative frequency barplot made with prop.table. Bar Graphs A bar graph is the analogue of a histogram for categorical data. A bar is displayed for each level of a factor, with the heights of the bars proportional to the frequencies of observations falling in the respective categories. A disadvantage of bar graphs is that the levels are ordered alphabetically (by default), which may sometimes obscure patterns in the display. Example 3.10. U.S. State Facts and Features. The state.region data lists each of the 50 states and the region to which it belongs, be it Northeast, South, North Central, or West. See ?state.region. It is already stored internally as a factor. We make a bar graph with the barplot function: > barplot(table(state.region), cex.names = 0.5) > barplot(prop.table(table(state.region)), cex.names = 0.5) See Figure 3.1.5. The display on the left is a frequency bar graph because the y axis shows counts, while the display on the left is a relative frequency bar graph. The only difference between the two is the scale. Looking at the graph we see that the majority of the fifty states are in the South, followed by West, North Central, and finally Northeast. Over 30% of the states are in the South. Notice the cex.names argument that we used, above. It shrinks the names on the x axis by 50% which makes them easier to read. See ?par for a detailed list of additional plot parameters
30 CHAPTER 3.DATA DESCRIPTION Pareto Diagrams A pareto diagram is a lot like a bar graph except the bars are rearranged such that they decrease in height going from left to right.The rearrangement is handy because it can visually reveal structure (if any)in how fast the bars decrease-this is much more difficult when the bars are jumbled. Example 3.11.U.S.State Facts and Features.The state.division data record the division (New England,Middle Atlantic,South Atlantic,East South Central,West South Central,East North Central,West North Central,Mountain,and Pacific)of the fifty states.We can make a pareto diagram with either the RcmdrPlugin.IPSUR package or with the pareto.chart function from the qcc package [77].See Figure 3.1.6.The code follows. library(qcc) pareto.chart(table(state.division),ylab "Frequency") Dot Charts These are a lot like a bar graph that has been turned on its side with the bars replaced by dots on horizontal lines.They do not convey any more (or less)information than the associated bar graph,but the strength lies in the economy of the display.Dot charts are so compact that it is easy to graph very complicated multi-variable interactions together in one graph.See Section 3.6.We will give an example here using the same data as above for comparison.The graph was produced by the following code. >x <-table(state.region) dotchart(as.vector(x),labels names(x)) See Figure 3.1.7.Compare it to Figure 3.1.5. Pie Graphs These can be done with R and the R Commander,but they fallen out of favor in recent years because researchers have determined that while the human eye is good at judging linear measures,it is notoriously bad at judging relative areas(such as those displayed by a pie graph).Pie charts are consequently a very bad way of displaying information.A bar chart or dot chart is a preferable way of displaying qualitative data.See ?pie for more information. We are not going to do any examples of a pie graph and discourage their use elsewhere. 3.1.3 Logical Data There is another type of information recognized by R which does not fall into the above cat- egories.The value is either TRUE or FALSE(note that equivalently you can use 1 TRUE, 0=FALSE).Here is an example of a logical vector: >x<-5:9 >y<-(x<7.3) >y [1]TRUETRUE TRUE FALSE FALSE Many functions in R have options that the user may or may not want to activate in the function call.For example,the stem.leaf function has the depths argument which is TRUE by default.We saw in Section 3.1.1 how to turn the option off,simply enter stem.leaf(x, depths FALSE)and they will not be shown on the display. We can swap TRUE with FALSE with the exclamation point !
30 CHAPTER 3. DATA DESCRIPTION Pareto Diagrams A pareto diagram is a lot like a bar graph except the bars are rearranged such that they decrease in height going from left to right. The rearrangement is handy because it can visually reveal structure (if any) in how fast the bars decrease – this is much more difficult when the bars are jumbled. Example 3.11. U.S. State Facts and Features. The state.division data record the division (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific) of the fifty states. We can make a pareto diagram with either the RcmdrPlugin.IPSUR package or with the pareto.chart function from the qcc package [77]. See Figure 3.1.6. The code follows. > library(qcc) > pareto.chart(table(state.division), ylab = "Frequency") Dot Charts These are a lot like a bar graph that has been turned on its side with the bars replaced by dots on horizontal lines. They do not convey any more (or less) information than the associated bar graph, but the strength lies in the economy of the display. Dot charts are so compact that it is easy to graph very complicated multi-variable interactions together in one graph. See Section 3.6. We will give an example here using the same data as above for comparison. The graph was produced by the following code. > x <- table(state.region) > dotchart(as.vector(x), labels = names(x)) See Figure 3.1.7. Compare it to Figure 3.1.5. Pie Graphs These can be done with R and the R Commander, but they fallen out of favor in recent years because researchers have determined that while the human eye is good at judging linear measures, it is notoriously bad at judging relative areas (such as those displayed by a pie graph). Pie charts are consequently a very bad way of displaying information. A bar chart or dot chart is a preferable way of displaying qualitative data. See ?pie for more information. We are not going to do any examples of a pie graph and discourage their use elsewhere. 3.1.3 Logical Data There is another type of information recognized by R which does not fall into the above categories. The value is either TRUE or FALSE (note that equivalently you can use 1 = TRUE, 0 = FALSE). Here is an example of a logical vector: > x <- 5:9 > y <- (x < 7.3) > y [1] TRUE TRUE TRUE FALSE FALSE Many functions in R have options that the user may or may not want to activate in the function call. For example, the stem.leaf function has the depths argument which is TRUE by default. We saw in Section 3.1.1 how to turn the option off, simply enter stem.leaf(x, depths = FALSE) and they will not be shown on the display. We can swap TRUE with FALSE with the exclamation point !
3.1.TYPES OF DATA 31 Package 'qcc',version 2.0.1 Type 'citation("qcc")'for citing this R package in publications. Pareto chart analysis for table(state.division) Frequency Cum.Freq.Percentage Cum.Percent. Mountain 8 8 16 16 South Atlantic e 16 16 3 West North Central 7 3 14 6 New England 6 2 12 58 Pacific 10 68 East North Central 5 g 10 West South Central 6 East South Central 88 Middle Atlantic 3 50 6 Pareto Chart for table(state.division) 8 0 %g anleinwno MaN Figure 3.1.6:Pareto chart of the state.division data
3.1. TYPES OF DATA 31 Package 'qcc', version 2.0.1 Type 'citation("qcc")' for citing this R package in publications. Pareto chart analysis for table(state.division) Frequency Cum.Freq. Percentage Cum.Percent. Mountain 8 8 16 16 South Atlantic 8 16 16 32 West North Central 7 23 14 46 New England 6 29 12 58 Pacific 5 34 10 68 East North Central 5 39 10 78 West South Central 4 43 8 86 East South Central 4 47 8 94 Middle Atlantic 3 50 6 100 Mountain South Atlantic West North Central New England Pacific East North Central West South Central East South Central Middle Atlantic Pareto Chart for table(state.division) Frequency 0 10 20 30 40 50 0% 25% 75% Cumulative Percentage Figure 3.1.6: Pareto chart of the state.division data
32 CHAPTER 3.DATA DESCRIPTION West North Central 0 South 0 Northeast 0 9 101112131415 16 Figure 3.1.7:Dot chart of the state.region data >!y [1]FALSEFALSEFALSE TRUETRUE 3.1.4 Missing Data Missing data are a persistent and prevalent problem in many statistical analyses,especially those associated with the social sciences.R reserves the special symbol NA to representing missing data. Ordinary arithmetic with NA values give NA's(addition,subtraction,etc.)and applying a function to a vector that has an NA in it will usually give an NA. >X<-c(3,7,NA,4,7) >y<-c(5,NA,1,2,2) x y [1]8NANA69 Some functions have a na.rm argument which when TRUE will ignore missing data as if it were not there (such as mean,var,sd,IQR,mad,...). sum(x) [1]NA
32 CHAPTER 3. DATA DESCRIPTION Northeast South North Central West 9 10 11 12 13 14 15 16 Figure 3.1.7: Dot chart of the state.region data > !y [1] FALSE FALSE FALSE TRUE TRUE 3.1.4 Missing Data Missing data are a persistent and prevalent problem in many statistical analyses, especially those associated with the social sciences. R reserves the special symbol NA to representing missing data. Ordinary arithmetic with NA values give NA’s (addition, subtraction, etc.) and applying a function to a vector that has an NA in it will usually give an NA. > x <- c(3, 7, NA, 4, 7) > y <- c(5, NA, 1, 2, 2) > x + y [1] 8 NA NA 6 9 Some functions have a na.rm argument which when TRUE will ignore missing data as if it were not there (such as mean, var, sd, IQR, mad, . . . ). > sum(x) [1] NA