16 CHAPTER 2.AN INTRODUCTION TO R R-Forge:(http://r-forge.r-project.org/)This is another location where R packages are stored.Here you can find development code which has not yet been released to CRAN. R Wiki:(http://wiki.r-project.org/rwiki/doku.php)There are many tips and tricks listed here.If you find a trick of your own,login and share it with the world. Other:the R Graph Gallery(http://addictedtor.free.fr/graphiques/)and R Graph- ical Manual (http://bm2.genes.nig.ac.jp/RGM2/index.php)have literally thou- sands of graphs to peruse.RSeek (http://www.rseek.org)is a search engine based on Google specifically tailored for R queries. 2.6 Other Tips It is unnecessary to retype commands repeatedly,since R remembers what you have recently entered on the command line.On the MicrosoftR Windows RGui,to cycle through the previous commands just push the t(up arrow)key.On Emacs/ESS the command is M-p(which means hold down the Alt button and press"p").More generally,the command history()will show a whole list of recently entered commands. To find out what all variables are in the current work environment,use the commands objects()or 1s().These list all available objects in the workspace.If you wish to remove one or more variables,use remove(var1,var2,var3),or more simply use rm(var1,var2,var3),and to remove all objects use rm(list 1s()) Another use of scan is when you have a long list of numbers(separated by spaces or on different lines)already typed somewhere else,say in a text file.To enter all the data in one fell swoop,first highlight and copy the list of numbers to the Clipboard with Edit Copy (or by right-clicking and selecting Copy).Next type the x <-scan()command in the R console,and paste the numbers at the 1:prompt with Edit>Paste.All of the numbers will automatically be entered into the vector x. The command Ctrl+l clears the screen in the MicrosoftR Windows RGui.The compa- rable command for Emacs/ESS is Once you use R for awhile there may be some commands that you wish to run automati- cally whenever R starts.These commands may be saved in a file called Rprofile.site which is usually in the etc folder,which lives in the R home directory (which on MicrosoftR Windows usually is C:\Program Files\R).Alternatively,you can make a file.Rprofile to be stored in the user's home directory,or anywhere R is invoked.This allows for multiple configurations for different projects or users.See "Customizing the Environment"of An Introduction to R for more details. When exiting R the user is given the option to"save the workspace".I recommend that beginners DO NOT save the workspace when quitting.If Yes is selected,then all of the objects and data currently in R's memory is saved in a file located in the working directory called RData.This file is then automatically loaded the next time R starts (in which case R will say [previously saved workspace restored]).This is a valuable feature for experienced users of R,but I find that it causes more trouble than it saves with beginners
16 CHAPTER 2. AN INTRODUCTION TO R R-Forge: (http://r-forge.r-project.org/) This is another location where R packages are stored. Here you can find development code which has not yet been released to CRAN. R Wiki: (http://wiki.r-project.org/rwiki/doku.php) There are many tips and tricks listed here. If you find a trick of your own, login and share it with the world. Other: the R Graph Gallery (http://addictedtor.free.fr/graphiques/) and R Graphical Manual (http://bm2.genes.nig.ac.jp/RGM2/index.php) have literally thousands of graphs to peruse. RSeek (http://www.rseek.org) is a search engine based on Google specifically tailored for R queries. 2.6 Other Tips It is unnecessary to retype commands repeatedly, since R remembers what you have recently entered on the command line. On the Microsoftr Windows RGui, to cycle through the previous commands just push the ↑ (up arrow) key. On Emacs/ESS the command is M-p (which means hold down the Alt button and press “p”). More generally, the command history() will show a whole list of recently entered commands. • To find out what all variables are in the current work environment, use the commands objects() or ls(). These list all available objects in the workspace. If you wish to remove one or more variables, use remove(var1, var2, var3), or more simply use rm(var1, var2, var3), and to remove all objects use rm(list = ls()). • Another use of scan is when you have a long list of numbers (separated by spaces or on different lines) already typed somewhere else, say in a text file. To enter all the data in one fell swoop, first highlight and copy the list of numbers to the Clipboard with Edit ⊲ Copy (or by right-clicking and selecting Copy). Next type the x <- scan() command in the R console, and paste the numbers at the 1: prompt with Edit ⊲ Paste. All of the numbers will automatically be entered into the vector x. • The command Ctrl+l clears the screen in the Microsoftr Windows RGui. The comparable command for Emacs/ESS is • Once you use R for awhile there may be some commands that you wish to run automatically whenever R starts. These commands may be saved in a file called Rprofile.site which is usually in the etc folder, which lives in the R home directory (which on Microsoftr Windows usually is C:\Program Files\R). Alternatively, you can make a file .Rprofile to be stored in the user’s home directory, or anywhere R is invoked. This allows for multiple configurations for different projects or users. See “Customizing the Environment” of An Introduction to R for more details. • When exiting R the user is given the option to “save the workspace”. I recommend that beginners DO NOT save the workspace when quitting. If Yes is selected, then all of the objects and data currently in R’s memory is saved in a file located in the working directory called .RData. This file is then automatically loaded the next time R starts (in which case R will say [previously saved workspace restored]). This is a valuable feature for experienced users of R, but I find that it causes more trouble than it saves with beginners
Chapter 3 Data Description In this chapter we introduce the different types of data that a statistician is likely to encounter, and in each subsection we give some examples of how to display the data of that particular type. Once we see how to display data distributions,we next introduce the basic properties of data distributions.We qualitatively explore several data sets.Once that we have intuitive properties of data sets,we next discuss how we may numerically measure and describe those properties with descriptive statistics. What do I want them to know? different data types,such as quantitative versus qualitative,nominal versus ordinal,and discrete versus continuous basic graphical displays for assorted data types,and some of their(dis)advantages fundamental properties of data distributions,including center,spread,shape,and crazy observations methods to describe data(visually/numerically)with respect to the properties,and how the methods differ depending on the data type all of the above in the context of grouped data,and in particular,the concept of a factor 3.1 Types of Data Loosely speaking,a datum is any piece of collected information,and a data set is a collection of data related to each other in some way.We will categorize data into five types and describe each in turn: Quantitative data associated with a measurement of some quantity on an observational unit, Qualitative data associated with some quality or property of the observational unit, Logical data to represent true or false and which play an important role later, Missing data that should be there but are not,and Other types everything else under the sun. In each subsection we look at some examples of the type in question and introduce methods to display them. 19
Chapter 3 Data Description In this chapter we introduce the different types of data that a statistician is likely to encounter, and in each subsection we give some examples of how to display the data of that particular type. Once we see how to display data distributions, we next introduce the basic properties of data distributions. We qualitatively explore several data sets. Once that we have intuitive properties of data sets, we next discuss how we may numerically measure and describe those properties with descriptive statistics. What do I want them to know? • different data types, such as quantitative versus qualitative, nominal versus ordinal, and discrete versus continuous • basic graphical displays for assorted data types, and some of their (dis)advantages • fundamental properties of data distributions, including center, spread, shape, and crazy observations • methods to describe data (visually/numerically) with respect to the properties, and how the methods differ depending on the data type • all of the above in the context of grouped data, and in particular, the concept of a factor 3.1 Types of Data Loosely speaking, a datum is any piece of collected information, and a data set is a collection of data related to each other in some way. We will categorize data into five types and describe each in turn: Quantitative data associated with a measurement of some quantity on an observational unit, Qualitative data associated with some quality or property of the observational unit, Logical data to represent true or false and which play an important role later, Missing data that should be there but are not, and Other types everything else under the sun. In each subsection we look at some examples of the type in question and introduce methods to display them. 19
20 CHAPTER 3.DATA DESCRIPTION 3.1.1 Quantitative data Quantitative data are any data that measure or are associated with a measurement of the quantity of something.They invariably assume numerical values.Quantitative data can be further subdivided into two categories. Discrete data take values in a finite or countably infinite set of numbers,that is,all possible values could (at least in principle)be written down in an ordered list.Examples include:counts,number of arrivals,or number of successes.They are often represented by integers,say,0,1,2,etc.. Continuous data take values in an interval of numbers.These are also known as scale data,interval data,or measurement data.Examples include:height,weight,length,time, erc.Continuous data are often characterized by fractions or decimals:3.82,7.0001,4 etc.. Note that the distinction between discrete and continuous data is not always clear-cut.Some- times it is convenient to treat data as if they were continuous,even though strictly speaking they are not continuous.See the examples. Example 3.1.Annual Precipitation in US Cities.The vector precip contains average amount of rainfall (in inches)for each of 70 cities in the United States and Puerto Rico.Let us take a look at the data: str(precip) Named num[1:70]6754.7748.51417.220.71343.440.2.. attr(*,"names")=chr [1:70]"Mobile""Juneau""Phoenix""Little Rock"... precip[1:4] Mobile Juneau Phoenix Little Rock 67.0 54.7 7.0 48.5 The output shows that precip is a numeric vector which has been named,that is,each value has a name associated with it(which can be set with the names function).These are quantitative continuous data. Example 3.2.Lengths of Major North American Rivers.The U.S.Geological Survey recorded the lengths (in miles)of several rivers in North America.They are stored in the vector rivers in the datasets package(which ships with base R).See ?rivers.Let us take a look at the data with the str function. str(rivers) num[1:141]735320325392524.. The output says that rivers is a numeric vector of length 141,and the first few values are 735,320,325,etc.These data are definitely quantitative and it appears that the measurements have been rounded to the nearest mile.Thus,strictly speaking,these are discrete data.But we will find it convenient later to take data like these to be continuous for some of our statistical procedures
20 CHAPTER 3. DATA DESCRIPTION 3.1.1 Quantitative data Quantitative data are any data that measure or are associated with a measurement of the quantity of something. They invariably assume numerical values. Quantitative data can be further subdivided into two categories. • Discrete data take values in a finite or countably infinite set of numbers, that is, all possible values could (at least in principle) be written down in an ordered list. Examples include: counts, number of arrivals, or number of successes. They are often represented by integers, say, 0, 1, 2, etc.. • Continuous data take values in an interval of numbers. These are also known as scale data, interval data, or measurement data. Examples include: height, weight, length, time, etc. Continuous data are often characterized by fractions or decimals: 3.82, 7.0001, 4 5 8 , etc.. Note that the distinction between discrete and continuous data is not always clear-cut. Sometimes it is convenient to treat data as if they were continuous, even though strictly speaking they are not continuous. See the examples. Example 3.1. Annual Precipitation in US Cities. The vector precip contains average amount of rainfall (in inches) for each of 70 cities in the United States and Puerto Rico. Let us take a look at the data: > str(precip) Named num [1:70] 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ... - attr(*, "names")= chr [1:70] "Mobile" "Juneau" "Phoenix" "Little Rock" ... > precip[1:4] Mobile Juneau Phoenix Little Rock 67.0 54.7 7.0 48.5 The output shows that precip is a numeric vector which has been named, that is, each value has a name associated with it (which can be set with the names function). These are quantitative continuous data. Example 3.2. Lengths of Major North American Rivers. The U.S. Geological Survey recorded the lengths (in miles) of several rivers in North America. They are stored in the vector rivers in the datasets package (which ships with base R). See ?rivers. Let us take a look at the data with the str function. > str(rivers) num [1:141] 735 320 325 392 524 ... The output says that rivers is a numeric vector of length 141, and the first few values are 735, 320, 325, etc. These data are definitely quantitative and it appears that the measurements have been rounded to the nearest mile. Thus, strictly speaking, these are discrete data. But we will find it convenient later to take data like these to be continuous for some of our statistical procedures
3.1.TYPES OF DATA 21 Example 3.3.Yearly Numbers of Important Discoveries.The vector discoveries contains numbers of"great"inventions/discoveries in each year from 1860 to 1959,as reported by the 1975 World Almanac.Let us take a look at the data: str(discoveries) Time-Series[1:100]from1860to1959:5302032361.. discoveries[1:4] [1]5302 The output is telling us that discoveries is a time series (see Section 3.1.5 for more)of length 100.The entries are integers,and since they represent counts this is a good example of discrete quantitative data.We will take a closer look in the following sections. Displaying Quantitative Data One of the first things to do when confronted by quantitative data (or any data,for that matter) is to make some sort of visual display to gain some insight into the data's structure.There are almost as many display types from which to choose as there are data sets to plot.We describe some of the more popular alternatives. Strip charts (also known as Dot plots)These can be used for discrete or continuous data, and usually look best when the data set is not too large.Along the horizontal axis is a numerical scale above which the data values are plotted.We can do it in R with a call to the stripchart function.There are three available methods. overplot plots ties covering each other.This method is good to display only the distinct values assumed by the data set. jitter adds some noise to the data in the y direction in which case the data values are not covered up by ties. stack plots repeated values stacked on top of one another.This method is best used for discrete data with a lot of ties;if there are no repeats then this method is identical to overplot. See Figure 3.1.1,which is produced by the following code. stripchart(precip,xlab "rainfall") stripchart(rivers,method "jitter",xlab "length") stripchart(discoveries,method "stack",xlab "number") The leftmost graph is a strip chart of the precip data.The graph shows tightly clustered values in the middle with some others falling balanced on either side,with perhaps slightly more falling to the left.Later we will call this a symmetric distribution,see Section 3.2.3.The middle graph is of the rivers data,a vector of length 141.There are several repeated values in the rivers data,and if we were to use the overplot method we would lose some of them in the display.This plot shows a what we will later call a right-skewed shape with perhaps some extreme values on the far right of the display.The third graph strip charts discoveries data which are literally a textbook example of a right skewed distribution. The DOTplot function in the UsingR package [86]is another alternative
3.1. TYPES OF DATA 21 Example 3.3. Yearly Numbers of Important Discoveries. The vector discoveries contains numbers of “great” inventions/discoveries in each year from 1860 to 1959, as reported by the 1975 World Almanac. Let us take a look at the data: > str(discoveries) Time-Series [1:100] from 1860 to 1959: 5 3 0 2 0 3 2 3 6 1 ... > discoveries[1:4] [1] 5 3 0 2 The output is telling us that discoveries is a time series (see Section 3.1.5 for more) of length 100. The entries are integers, and since they represent counts this is a good example of discrete quantitative data. We will take a closer look in the following sections. Displaying Quantitative Data One of the first things to do when confronted by quantitative data (or any data, for that matter) is to make some sort of visual display to gain some insight into the data’s structure. There are almost as many display types from which to choose as there are data sets to plot. We describe some of the more popular alternatives. Strip charts (also known as Dot plots) These can be used for discrete or continuous data, and usually look best when the data set is not too large. Along the horizontal axis is a numerical scale above which the data values are plotted. We can do it in R with a call to the stripchart function. There are three available methods. overplot plots ties covering each other. This method is good to display only the distinct values assumed by the data set. jitter adds some noise to the data in the y direction in which case the data values are not covered up by ties. stack plots repeated values stacked on top of one another. This method is best used for discrete data with a lot of ties; if there are no repeats then this method is identical to overplot. See Figure 3.1.1, which is produced by the following code. > stripchart(precip, xlab = "rainfall") > stripchart(rivers, method = "jitter", xlab = "length") > stripchart(discoveries, method = "stack", xlab = "number") The leftmost graph is a strip chart of the precip data. The graph shows tightly clustered values in the middle with some others falling balanced on either side, with perhaps slightly more falling to the left. Later we will call this a symmetric distribution, see Section 3.2.3. The middle graph is of the rivers data, a vector of length 141. There are several repeated values in the rivers data, and if we were to use the overplot method we would lose some of them in the display. This plot shows a what we will later call a right-skewed shape with perhaps some extreme values on the far right of the display. The third graph strip charts discoveries data which are literally a textbook example of a right skewed distribution. The DOTplot function in the UsingR package [86] is another alternative
22 CHAPTER 3.DATA DESCRIPTION IIII ITTTT TTTT 103050 010002500 0246812 rainfall length number Figure 3.1.1:Strip charts of the precip,rivers,and discoveries data The first graph uses the overplot method,the second the jitter method,and the third the stack method
22 CHAPTER 3. DATA DESCRIPTION 10 30 50 rainfall 0 1000 2500 length 0 2 4 6 8 12 number Figure 3.1.1: Strip charts of the precip, rivers, and discoveries data The first graph uses the overplot method, the second the jitter method, and the third the stack method