3.1.TYPES OF DATA 23 8 8 200.0 80.0 2100 .00.0 0 20 40 60 0 20 40 60 precip precip Figure 3.1.2:(Relative)frequency histograms of the precip data Histogram These are typically used for continuous data.A histogram is constructed by first deciding on a set of classes,or bins,which partition the real line into a set of boxes into which the data values fall.Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin. These are one of the most common summary displays,and they are often misidentified as "Bar Graphs"(see below.)The scale on the y axis can be frequency,percentage,or density (relative frequency).The term histogram was coined by Karl Pearson in 1891,see [66]. Example 3.4.Annual Precipitation in US Cities.We are going to take another look at the precip data that we investigated earlier.The strip chart in Figure 3.1.1 suggested a loosely balanced distribution;let us now look to see what a histogram says. There are many ways to plot histograms in R,and one of the easiest is with the hist function.The following code produces the plots in Figure 3.1.2. hist(precip,main ="" hist(precip,freq FALSE,main "" Notice the argument main =""which suppresses the main title from being displayed -it would have said"Histogram of precip"otherwise.The plot on the left is a frequency histogram(the default),and the plot on the right is a relative frequency histogram(freq FALSE). Please be careful regarding the biggest weakness of histograms:the graph obtained strongly depends on the bins chosen.Choose another set of bins,and you will get a different histogram
3.1. TYPES OF DATA 23 precip Frequency 0 20 40 60 0 5 10 15 20 25 precip Density 0 20 40 60 0.000 0.010 0.020 0.030 Figure 3.1.2: (Relative) frequency histograms of the precip data Histogram These are typically used for continuous data. A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin. These are one of the most common summary displays, and they are often misidentified as “Bar Graphs” (see below.) The scale on the y axis can be frequency, percentage, or density (relative frequency). The term histogram was coined by Karl Pearson in 1891, see [66]. Example 3.4. Annual Precipitation in US Cities. We are going to take another look at the precip data that we investigated earlier. The strip chart in Figure 3.1.1 suggested a loosely balanced distribution; let us now look to see what a histogram says. There are many ways to plot histograms in R, and one of the easiest is with the hist function. The following code produces the plots in Figure 3.1.2. > hist(precip, main = "") > hist(precip, freq = FALSE, main = "") Notice the argument main = "", which suppresses the main title from being displayed – it would have said “Histogram of precip” otherwise. The plot on the left is a frequency histogram (the default), and the plot on the right is a relative frequency histogram (freq = FALSE). Please be careful regarding the biggest weakness of histograms: the graph obtained strongly depends on the bins chosen. Choose another set of bins, and you will get a different histogram
24 CHAPTER 3.DATA DESCRIPTION 10 30 50 70 10 30 50 precip precip Figure 3.1.3:More histograms of the precip data Moreover,there are not any definitive criteria by which bins should be defined;the best choice for a given data set is the one which illuminates the data set's underlying structure (if any). Luckily for us there are algorithms to automatically choose bins that are likely to display well, and more often than not the default bins do a good job.This is not always the case,however,and a responsible statistician will investigate many bin choices to test the stability of the display. Example 3.5.Recall that the strip chart in Figure 3.1.1 suggested a relatively balanced shape to the precip data distribution.Watch what happens when we change the bins slightly(with the breaks argument to hist).See Figure 3.1.3 which was produced by the following code. hist(precip,breaks =10,main ="" hist(precip,breaks 200,main ="" The leftmost graph(with breaks 10)shows that the distribution is not balanced at all. There are two humps:a big one in the middle and a smaller one to the left.Graphs like this often indicate some underlying group structure to the data;we could now investigate whether the cities for which rainfall was measured were similar in some way,with respect to geographic region,for example. The rightmost graph in Figure 3.1.3 shows what happens when the number of bins is too large:the histogram is too grainy and hides the rounded appearance of the earlier histograms. If we were to continue increasing the number of bins we would eventually get all observed bins to have exactly one element,which is nothing more than a glorified strip chart
24 CHAPTER 3. DATA DESCRIPTION precip Frequency 10 30 50 70 0 2 4 6 8 10 12 14 precip Frequency 10 30 50 0 1 2 3 4 Figure 3.1.3: More histograms of the precip data Moreover, there are not any definitive criteria by which bins should be defined; the best choice for a given data set is the one which illuminates the data set’s underlying structure (if any). Luckily for us there are algorithms to automatically choose bins that are likely to display well, and more often than not the default bins do a good job. This is not always the case, however, and a responsible statistician will investigate many bin choices to test the stability of the display. Example 3.5. Recall that the strip chart in Figure 3.1.1 suggested a relatively balanced shape to the precip data distribution. Watch what happens when we change the bins slightly (with the breaks argument to hist). See Figure 3.1.3 which was produced by the following code. > hist(precip, breaks = 10, main = "") > hist(precip, breaks = 200, main = "") The leftmost graph (with breaks = 10) shows that the distribution is not balanced at all. There are two humps: a big one in the middle and a smaller one to the left. Graphs like this often indicate some underlying group structure to the data; we could now investigate whether the cities for which rainfall was measured were similar in some way, with respect to geographic region, for example. The rightmost graph in Figure 3.1.3 shows what happens when the number of bins is too large: the histogram is too grainy and hides the rounded appearance of the earlier histograms. If we were to continue increasing the number of bins we would eventually get all observed bins to have exactly one element, which is nothing more than a glorified strip chart
3.1.TYPES OF DATA 25 Stemplots(more to be said in Section 3.4)Stemplots have two basic parts:stems and leaves. The final digit of the data values is taken to be a leaf,and the leading digit(s)is (are)taken to be stems.We draw a vertical line,and to the left of the line we list the stems.To the right of the line,we list the leaves beside their corresponding stem.There will typically be several leaves for each stem,in which case the leaves accumulate to the right.It is sometimes necessary to round the data values,especially for larger data sets. Example 3.6.UKDriverDeaths is a time series that contains the total car drivers killed or seriously injured in Great Britain monthly from Jan 1969 to Dec 1984.See ?UKDriverDeaths. Compulsory seat belt use was introduced on January 31,1983.We construct a stem and leaf diagram in R with the stem.leaf function from the aplpack package [92]. library(aplpack) stem.leaf(UKDriverDeaths,depth FALSE) 1 |2:represents 120 leaf unit:10 n:192 10157 111136678 121123889 13|0255666888899 1400001222344444555556667788889 15|0000111112222223444455555566677779 16|01222333444445555555678888889 17|11233344566667799 18|00011235568 19|01234455667799 20|0000113557788899 21|145599 22|013467 2319 2417 HI:2654 The display shows a more or less balanced mound-shaped distribution,with one or maybe two humps,a big one and a smaller one just to its right.Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line. Notice that the depths have been suppressed.To learn more about this option and many others,see Section 3.4.Unlike a histogram,the original data values may be recovered from the stemplot display-modulo the rounding-that is,starting from the top and working down we can read off the data values 1050,1070,1110,1130,etc. Index plot Done with the plot function.These are good for plotting data which are ordered, for example,when the data are measured over time.That is,the first observation was measured at time 1,the second at time 2,etc.It is a two dimensional plot,in which the index (or time)is the x variable and the measured value is the y variable.There are several plotting methods for index plots,and we discuss two of them:
3.1. TYPES OF DATA 25 Stemplots (more to be said in Section 3.4) Stemplots have two basic parts: stems and leaves. The final digit of the data values is taken to be a leaf, and the leading digit(s) is (are) taken to be stems. We draw a vertical line, and to the left of the line we list the stems. To the right of the line, we list the leaves beside their corresponding stem. There will typically be several leaves for each stem, in which case the leaves accumulate to the right. It is sometimes necessary to round the data values, especially for larger data sets. Example 3.6. UKDriverDeaths is a time series that contains the total car drivers killed or seriously injured in Great Britain monthly from Jan 1969 to Dec 1984. See ?UKDriverDeaths. Compulsory seat belt use was introduced on January 31, 1983. We construct a stem and leaf diagram in R with the stem.leaf function from the aplpack package [92]. > library(aplpack) > stem.leaf(UKDriverDeaths, depth = FALSE) 1 | 2: represents 120 leaf unit: 10 n: 192 10 | 57 11 | 136678 12 | 123889 13 | 0255666888899 14 | 00001222344444555556667788889 15 | 0000111112222223444455555566677779 16 | 01222333444445555555678888889 17 | 11233344566667799 18 | 00011235568 19 | 01234455667799 20 | 0000113557788899 21 | 145599 22 | 013467 23 | 9 24 | 7 HI: 2654 The display shows a more or less balanced mound-shaped distribution, with one or maybe two humps, a big one and a smaller one just to its right. Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line. Notice that the depths have been suppressed. To learn more about this option and many others, see Section 3.4. Unlike a histogram, the original data values may be recovered from the stemplot display – modulo the rounding – that is, starting from the top and working down we can read off the data values 1050, 1070, 1110, 1130, etc. Index plot Done with the plot function. These are good for plotting data which are ordered, for example, when the data are measured over time. That is, the first observation was measured at time 1, the second at time 2, etc. It is a two dimensional plot, in which the index (or time) is the x variable and the measured value is the y variable. There are several plotting methods for index plots, and we discuss two of them:
26 CHAPTER 3.DATA DESCRIPTION spikes:draws a vertical line from the x-axis to the observation height (type ="h"). points:plots a simple point at the observation height(type ="p"). Example 3.7.Level of Lake Huron 1875-1972.Brockwell and Davis [11]give the annual measurements of the level (in feet)of Lake Huron from 1875-1972.The data are stored in the time series LakeHuron.See ?LakeHuron.Figure 3.1.4 was produced with the following code: plot(LakeHuron,type ="h") plot(LakeHuron,type "p") The plots show an overall decreasing trend to the observations,and there appears to be some seasonal variation that increases over time. 3.1.2 Qualitative Data,Categorical Data,and Factors Qualitative data are simply any type of data that are not numerical,or do not represent numerical quantities.Examples of qualitative variables include a subject's name,gender,race/ethnicity, political party,socioeconomic status,class rank,driver's license number,and social security number(SSN). Please bear in mind that some data look to be quantitative but are not,because they do not represent numerical quantities and do not obey mathematical rules.For example,a person's shoe size is typically written with numbers:8,or 9,or 12,or 12.Shoe size is not quantitative, however,because if we take a size 8 and combine with a size 9 we do not get a size 17. Some qualitative data serve merely to identify the observation (such a subject's name, driver's license number,or SSN).This type of data does not usually play much of a role in statistics.But other qualitative variables serve to subdivide the data set into categories;we call these factors.In the above examples,gender,race,political party,and socioeconomic status would be considered factors(shoe size would be another one).The possible values of a factor are called its levels.For instance,the factor gender would have two levels,namely,male and female.Socioeconomic status typically has three levels:high,middle,and low. Factors may be of two types:nominal and ordinal.Nominal factors have levels that cor- respond to names of the categories,with no implied ordering.Examples of nominal factors would be hair color,gender,race,or political party.There is no natural ordering to"Democrat" and"Republican";the categories are just names associated with different groups of people. In contrast,ordinal factors have some sort of ordered structure to the underlying factor levels.For instance,socioeconomic status would be an ordinal categorical variable because the levels correspond to ranks associated with income,education,and occupation.Another example of ordinal categorical data would be class rank. Factors have special status in R.They are represented internally by numbers,but even when they are written numerically their values do not convey any numeric meaning or obey any mathematical rules (that is,Stage III cancer is not Stage I cancer Stage II cancer). Example 3.8.The state.abb vector gives the two letter postal abbreviations for all 50 states. str(state.abb) chr [1:50]"AL""AK""AZ""AR""CA""CO""CT""DE"... These would be ID data.The state.name vector lists all of the complete names and those data would also be ID
26 CHAPTER 3. DATA DESCRIPTION spikes: draws a vertical line from the x-axis to the observation height (type = "h"). points: plots a simple point at the observation height (type = "p"). Example 3.7. Level of Lake Huron 1875-1972. Brockwell and Davis [11] give the annual measurements of the level (in feet) of Lake Huron from 1875–1972. The data are stored in the time series LakeHuron. See ?LakeHuron. Figure 3.1.4 was produced with the following code: > plot(LakeHuron, type = "h") > plot(LakeHuron, type = "p") The plots show an overall decreasing trend to the observations, and there appears to be some seasonal variation that increases over time. 3.1.2 Qualitative Data, Categorical Data, and Factors Qualitative data are simply any type of data that are not numerical, or do not represent numerical quantities. Examples of qualitative variables include a subject’s name, gender, race/ethnicity, political party, socioeconomic status, class rank, driver’s license number, and social security number (SSN). Please bear in mind that some data look to be quantitative but are not, because they do not represent numerical quantities and do not obey mathematical rules. For example, a person’s shoe size is typically written with numbers: 8, or 9, or 12, or 12 1 2 . Shoe size is not quantitative, however, because if we take a size 8 and combine with a size 9 we do not get a size 17. Some qualitative data serve merely to identify the observation (such a subject’s name, driver’s license number, or SSN). This type of data does not usually play much of a role in statistics. But other qualitative variables serve to subdivide the data set into categories; we call these factors. In the above examples, gender, race, political party, and socioeconomic status would be considered factors (shoe size would be another one). The possible values of a factor are called its levels. For instance, the factor gender would have two levels, namely, male and female. Socioeconomic status typically has three levels: high, middle, and low. Factors may be of two types: nominal and ordinal. Nominal factors have levels that correspond to names of the categories, with no implied ordering. Examples of nominal factors would be hair color, gender, race, or political party. There is no natural ordering to “Democrat” and “Republican”; the categories are just names associated with different groups of people. In contrast, ordinal factors have some sort of ordered structure to the underlying factor levels. For instance, socioeconomic status would be an ordinal categorical variable because the levels correspond to ranks associated with income, education, and occupation. Another example of ordinal categorical data would be class rank. Factors have special status in R. They are represented internally by numbers, but even when they are written numerically their values do not convey any numeric meaning or obey any mathematical rules (that is, Stage III cancer is not Stage I cancer + Stage II cancer). Example 3.8. The state.abb vector gives the two letter postal abbreviations for all 50 states. > str(state.abb) chr [1:50] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" ... These would be ID data. The state.name vector lists all of the complete names and those data would also be ID
3.1.TYPES OF DATA 27 品 器 铝 1880 1900 1920 1940 1960 Time 0 。 60 0 ① 品 0 o eP ① 00% 8 O① @6 o 00 d% 品 % 0 0 O 00 & 0 ① ①① 00 器 T 1880 1900 1920 1940 1960 Time Figure 3.1.4:Index plots of the LakeHuron data
3.1. TYPES OF DATA 27 Time LakeHuron 1880 1900 1920 1940 1960 576 578 580 582 Time LakeHuron 1880 1900 1920 1940 1960 576 578 580 582 Figure 3.1.4: Index plots of the LakeHuron data