当前位置：和泉文库 > 统计 > 《概率论与数理统计》课程教学资源（电子书）Introduction to Probability and Statistics with R（G. Jay Kerns，First Edition）

《概率论与数理统计》课程教学资源（电子书）Introduction to Probability and Statistics with R（G. Jay Kerns，First Edition）

文件格式：PDF，文件大小：2.31MB，售价：53.4元

文档详细内容（约365页）

3.1.TYPES OF DATA 25 Stemplots(more to be said in Section 3.4)Stemplots have two basic parts:stems and leaves. The final digit of the data values is taken to be a leaf,and the leading digit(s)is (are)taken to be stems.We draw a vertical line,and to the left of the line we list the stems.To the right of the line,we list the leaves beside their corresponding stem.There will typically be several leaves for each stem,in which case the leaves accumulate to the right.It is sometimes necessary to round the data values,especially for larger data sets. Example 3.6.UKDriverDeaths is a time series that contains the total car drivers killed or seriously injured in Great Britain monthly from Jan 1969 to Dec 1984.See ?UKDriverDeaths. Compulsory seat belt use was introduced on January 31,1983.We construct a stem and leaf diagram in R with the stem.leaf function from the aplpack package [92]. library(aplpack) stem.leaf(UKDriverDeaths,depth FALSE) 1 |2:represents 120 leaf unit:10 n:192 10157 111136678 121123889 13|0255666888899 1400001222344444555556667788889 15|0000111112222223444455555566677779 16|01222333444445555555678888889 17|11233344566667799 18|00011235568 19|01234455667799 20|0000113557788899 21|145599 22|013467 2319 2417 HI:2654 The display shows a more or less balanced mound-shaped distribution,with one or maybe two humps,a big one and a smaller one just to its right.Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line. Notice that the depths have been suppressed.To learn more about this option and many others,see Section 3.4.Unlike a histogram,the original data values may be recovered from the stemplot display-modulo the rounding-that is,starting from the top and working down we can read off the data values 1050,1070,1110,1130,etc. Index plot Done with the plot function.These are good for plotting data which are ordered, for example,when the data are measured over time.That is,the first observation was measured at time 1,the second at time 2,etc.It is a two dimensional plot,in which the index (or time)is the x variable and the measured value is the y variable.There are several plotting methods for index plots,and we discuss two of them:

3.1. TYPES OF DATA 25 Stemplots (more to be said in Section 3.4) Stemplots have two basic parts: stems and leaves. The final digit of the data values is taken to be a leaf, and the leading digit(s) is (are) taken to be stems. We draw a vertical line, and to the left of the line we list the stems. To the right of the line, we list the leaves beside their corresponding stem. There will typically be several leaves for each stem, in which case the leaves accumulate to the right. It is sometimes necessary to round the data values, especially for larger data sets. Example 3.6. UKDriverDeaths is a time series that contains the total car drivers killed or seriously injured in Great Britain monthly from Jan 1969 to Dec 1984. See ?UKDriverDeaths. Compulsory seat belt use was introduced on January 31, 1983. We construct a stem and leaf diagram in R with the stem.leaf function from the aplpack package [92]. > library(aplpack) > stem.leaf(UKDriverDeaths, depth = FALSE) 1 | 2: represents 120 leaf unit: 10 n: 192 10 | 57 11 | 136678 12 | 123889 13 | 0255666888899 14 | 00001222344444555556667788889 15 | 0000111112222223444455555566677779 16 | 01222333444445555555678888889 17 | 11233344566667799 18 | 00011235568 19 | 01234455667799 20 | 0000113557788899 21 | 145599 22 | 013467 23 | 9 24 | 7 HI: 2654 The display shows a more or less balanced mound-shaped distribution, with one or maybe two humps, a big one and a smaller one just to its right. Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line. Notice that the depths have been suppressed. To learn more about this option and many others, see Section 3.4. Unlike a histogram, the original data values may be recovered from the stemplot display – modulo the rounding – that is, starting from the top and working down we can read off the data values 1050, 1070, 1110, 1130, etc. Index plot Done with the plot function. These are good for plotting data which are ordered, for example, when the data are measured over time. That is, the first observation was measured at time 1, the second at time 2, etc. It is a two dimensional plot, in which the index (or time) is the x variable and the measured value is the y variable. There are several plotting methods for index plots, and we discuss two of them:

26 CHAPTER 3.DATA DESCRIPTION spikes:draws a vertical line from the x-axis to the observation height (type ="h"). points:plots a simple point at the observation height(type ="p"). Example 3.7.Level of Lake Huron 1875-1972.Brockwell and Davis [11]give the annual measurements of the level (in feet)of Lake Huron from 1875-1972.The data are stored in the time series LakeHuron.See ?LakeHuron.Figure 3.1.4 was produced with the following code: plot(LakeHuron,type ="h") plot(LakeHuron,type "p") The plots show an overall decreasing trend to the observations,and there appears to be some seasonal variation that increases over time. 3.1.2 Qualitative Data,Categorical Data,and Factors Qualitative data are simply any type of data that are not numerical,or do not represent numerical quantities.Examples of qualitative variables include a subject's name,gender,race/ethnicity, political party,socioeconomic status,class rank,driver's license number,and social security number(SSN). Please bear in mind that some data look to be quantitative but are not,because they do not represent numerical quantities and do not obey mathematical rules.For example,a person's shoe size is typically written with numbers:8,or 9,or 12,or 12.Shoe size is not quantitative, however,because if we take a size 8 and combine with a size 9 we do not get a size 17. Some qualitative data serve merely to identify the observation (such a subject's name, driver's license number,or SSN).This type of data does not usually play much of a role in statistics.But other qualitative variables serve to subdivide the data set into categories;we call these factors.In the above examples,gender,race,political party,and socioeconomic status would be considered factors(shoe size would be another one).The possible values of a factor are called its levels.For instance,the factor gender would have two levels,namely,male and female.Socioeconomic status typically has three levels:high,middle,and low. Factors may be of two types:nominal and ordinal.Nominal factors have levels that cor- respond to names of the categories,with no implied ordering.Examples of nominal factors would be hair color,gender,race,or political party.There is no natural ordering to"Democrat" and"Republican";the categories are just names associated with different groups of people. In contrast,ordinal factors have some sort of ordered structure to the underlying factor levels.For instance,socioeconomic status would be an ordinal categorical variable because the levels correspond to ranks associated with income,education,and occupation.Another example of ordinal categorical data would be class rank. Factors have special status in R.They are represented internally by numbers,but even when they are written numerically their values do not convey any numeric meaning or obey any mathematical rules (that is,Stage III cancer is not Stage I cancer Stage II cancer). Example 3.8.The state.abb vector gives the two letter postal abbreviations for all 50 states. str(state.abb) chr [1:50]"AL""AK""AZ""AR""CA""CO""CT""DE"... These would be ID data.The state.name vector lists all of the complete names and those data would also be ID

26 CHAPTER 3. DATA DESCRIPTION spikes: draws a vertical line from the x-axis to the observation height (type = "h"). points: plots a simple point at the observation height (type = "p"). Example 3.7. Level of Lake Huron 1875-1972. Brockwell and Davis [11] give the annual measurements of the level (in feet) of Lake Huron from 1875–1972. The data are stored in the time series LakeHuron. See ?LakeHuron. Figure 3.1.4 was produced with the following code: > plot(LakeHuron, type = "h") > plot(LakeHuron, type = "p") The plots show an overall decreasing trend to the observations, and there appears to be some seasonal variation that increases over time. 3.1.2 Qualitative Data, Categorical Data, and Factors Qualitative data are simply any type of data that are not numerical, or do not represent numerical quantities. Examples of qualitative variables include a subject’s name, gender, race/ethnicity, political party, socioeconomic status, class rank, driver’s license number, and social security number (SSN). Please bear in mind that some data look to be quantitative but are not, because they do not represent numerical quantities and do not obey mathematical rules. For example, a person’s shoe size is typically written with numbers: 8, or 9, or 12, or 12 1 2 . Shoe size is not quantitative, however, because if we take a size 8 and combine with a size 9 we do not get a size 17. Some qualitative data serve merely to identify the observation (such a subject’s name, driver’s license number, or SSN). This type of data does not usually play much of a role in statistics. But other qualitative variables serve to subdivide the data set into categories; we call these factors. In the above examples, gender, race, political party, and socioeconomic status would be considered factors (shoe size would be another one). The possible values of a factor are called its levels. For instance, the factor gender would have two levels, namely, male and female. Socioeconomic status typically has three levels: high, middle, and low. Factors may be of two types: nominal and ordinal. Nominal factors have levels that correspond to names of the categories, with no implied ordering. Examples of nominal factors would be hair color, gender, race, or political party. There is no natural ordering to “Democrat” and “Republican”; the categories are just names associated with different groups of people. In contrast, ordinal factors have some sort of ordered structure to the underlying factor levels. For instance, socioeconomic status would be an ordinal categorical variable because the levels correspond to ranks associated with income, education, and occupation. Another example of ordinal categorical data would be class rank. Factors have special status in R. They are represented internally by numbers, but even when they are written numerically their values do not convey any numeric meaning or obey any mathematical rules (that is, Stage III cancer is not Stage I cancer + Stage II cancer). Example 3.8. The state.abb vector gives the two letter postal abbreviations for all 50 states. > str(state.abb) chr [1:50] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" ... These would be ID data. The state.name vector lists all of the complete names and those data would also be ID

点击进入文档下载页（PDF格式）

共365页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录