2 CHAPTER 1.INTRODUCTION How much data are we talking about?A few examples:Google grew from pro- cessing 100 TB of data a day in 2004 [31]to processing 20 PB a day in 2008 [32].In April 2009,a blog post'was written about eBay's two enormous data warehouses:one with 2 petabytes of user data,and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day.Shortly there- after,Facebook revealed2 similarly impressive numbers,boasting of 2.5 petabytes of user data,growing at about 15 terabytes per day.Petabyte datasets are rapidly becom- ing the norm,and the trends are clear:our ability to store data is fast overwhelming our ability to process what we store.More distressing,increases in capacity are out- pacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [63].Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude);on the other hand,latency and bandwidth have improved relatively little in comparison (in the case of latency,perhaps 2x improvement during the last quarter century,and in the case of bandwidth,perhaps 50x).Given the tendency for individuals and organizations to continuously fill up whatever capacity is available,large-data problems are growing increasingly severe. Moving beyond the commercial sphere,many have recognized the importance of data management in many scientific disciplines,where petabyte-scale datasets are also becoming increasingly common [13.For example: The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [12].Today,the Large Hadron Collider (LHC)near Geneva is the world's largest particle accelerator,designed to probe the mysteries of the universe,including the fundamental nature of matter,by recreating conditions shortly following the Big Bang.When it becomes fully op- erational,the LHC will produce roughly 15 petabytes of data a year.3 Astronomers have long recognized the importance of a "digital observatory"that would support the data needs of researchers across the globe-the Sloan Digital Sky Survey [102]is perhaps the most well known of these projects.Looking into the future,the Large Synoptic Survey Telescope(LSST)is a wide-field instrument that is capable of observing the entire sky every few days.When the telescope comes online around 2015 in Chile,its 3.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [11]. The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored,organized,and delivered to scientists for 1http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ 3http://public.web.cern.ch/public/en/LHC/Computing-en.html
2 CHAPTER 1. INTRODUCTION How much data are we talking about? A few examples: Google grew from processing 100 TB of data a day in 2004 [31] to processing 20 PB a day in 2008 [32]. In April 2009, a blog post1 was written about eBay’s two enormous data warehouses: one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Petabyte datasets are rapidly becoming the norm, and the trends are clear: our ability to store data is fast overwhelming our ability to process what we store. More distressing, increases in capacity are outpacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [63]. Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude); on the other hand, latency and bandwidth have improved relatively little in comparison (in the case of latency, perhaps 2× improvement during the last quarter century, and in the case of bandwidth, perhaps 50×). Given the tendency for individuals and organizations to continuously fill up whatever capacity is available, large-data problems are growing increasingly severe. Moving beyond the commercial sphere, many have recognized the importance of data management in many scientific disciplines, where petabyte-scale datasets are also becoming increasingly common [13]. For example: • The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [12]. Today, the Large Hadron Collider (LHC) near Geneva is the world’s largest particle accelerator, designed to probe the mysteries of the universe, including the fundamental nature of matter, by recreating conditions shortly following the Big Bang. When it becomes fully operational, the LHC will produce roughly 15 petabytes of data a year.3 • Astronomers have long recognized the importance of a “digital observatory” that would support the data needs of researchers across the globe—the Sloan Digital Sky Survey [102] is perhaps the most well known of these projects. Looking into the future, the Large Synoptic Survey Telescope (LSST) is a wide-field instrument that is capable of observing the entire sky every few days. When the telescope comes online around 2015 in Chile, its 3.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [11]. • The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored, organized, and delivered to scientists for 1http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ 3http://public.web.cern.ch/public/en/LHC/Computing-en.html
3 further study.Given the fundamental tenant in modern genetics that genotypes explain phenotypes,the impact of this technology is nothing less than transfor- mative [74].The European Bioinformatics Institute (EBI),which hosts a central repository of sequence data called EMBL-bank,has increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [99].Scientists are predicting that,in the not-so-distant future,sequencing an individual's genome will be no more complex than getting a blood test today-ushering a new era of personalized medicine,where interventions can be specifically targeted for an individual. Increasingly,scientific breakthroughs will be powered by advanced computing capabil- ities that help researchers manipulate,explore,and mine massive datasets [50]-this has been hailed as the emerging "fourth paradigm"of science [51](complementing the- ory,experiments,and simulations).In other areas of academia,particularly computer science,systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as "toy systems"with limited utility.Large data is a fact of today's world and data-intensive processing is fast becoming a necessity,not merely a luxury or curiosity. Although large data comes in a variety of forms,this book is primarily concerned with processing large amounts of text,but touches on other types of data as well (e.g., relational and graph data).The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing(NLP)and information retrieval (IR).Recent work in these fields is dominated by a data-driven,empirical approach, typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application.There are three components to this ap- proach:data,representations of the data,and some method for capturing regularities in the data.The first is called corpora(singular,corpus)by NLP researchers and col- lections by those from the IR community.Aspects of the representations of the data are called features,which may be "superficial"and easy to extract,such as the words and sequences of words themselves,or "deep"and more difficult to extract,such as the grammatical relationship between words.Finally,algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application One common application,classification,is to sort text into categories.Examples in- clude:Is this email spam or not spam?Is this word a part of an address?The first task is easy to understand,while the second task is an instance of what NLP researchers call named-entity detection,which is useful for local search and pinpointing locations on maps.Another common application is to rank texts according to some criteria-search is a good example,which involves ranking documents by relevance to the user's query. Another example is to automatically situate texts along a scale of "happiness",a task known as sentiment analysis,which has been applied to everything from understanding political discourse in the blogosphere to predicting the movement of stock prices
3 further study. Given the fundamental tenant in modern genetics that genotypes explain phenotypes, the impact of this technology is nothing less than transformative [74]. The European Bioinformatics Institute (EBI), which hosts a central repository of sequence data called EMBL-bank, has increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [99]. Scientists are predicting that, in the not-so-distant future, sequencing an individual’s genome will be no more complex than getting a blood test today—ushering a new era of personalized medicine, where interventions can be specifically targeted for an individual. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate, explore, and mine massive datasets [50]—this has been hailed as the emerging “fourth paradigm” of science [51] (complementing theory, experiments, and simulations). In other areas of academia, particularly computer science, systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as “toy systems” with limited utility. Large data is a fact of today’s world and data-intensive processing is fast becoming a necessity, not merely a luxury or curiosity. Although large data comes in a variety of forms, this book is primarily concerned with processing large amounts of text, but touches on other types of data as well (e.g., relational and graph data). The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing (NLP) and information retrieval (IR). Recent work in these fields is dominated by a data-driven, empirical approach, typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application. There are three components to this approach: data, representations of the data, and some method for capturing regularities in the data. The first is called corpora (singular, corpus) by NLP researchers and collections by those from the IR community. Aspects of the representations of the data are called features, which may be “superficial” and easy to extract, such as the words and sequences of words themselves, or “deep” and more difficult to extract, such as the grammatical relationship between words. Finally, algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application. One common application, classification, is to sort text into categories. Examples include: Is this email spam or not spam? Is this word a part of an address? The first task is easy to understand, while the second task is an instance of what NLP researchers call named-entity detection, which is useful for local search and pinpointing locations on maps. Another common application is to rank texts according to some criteria—search is a good example, which involves ranking documents by relevance to the user’s query. Another example is to automatically situate texts along a scale of “happiness”, a task known as sentiment analysis, which has been applied to everything from understanding political discourse in the blogosphere to predicting the movement of stock prices
4 CHAPTER 1.INTRODUCTION There is a growing body of evidence,at least in text processing,that of the three components discussed above (data,features,algorithms),data probably matters the most.Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data.But why can't we have our cake and eat it too?Why not both sophisticated models and deep features applied to lots of data?Because inference over sophisticated models and extraction of deep features are often computationally intensive,they often don't scale. Consider a simple task such as determining the correct usage of easily confusable words such as“than”and“then”in English.One can view this as a supervised machine learning problem:we can train a classifier to disambiguate between the options,and then apply the classifier to new instances of the problem(say,as part of a grammar checker) Training data is fairly easy to come by-we can just gather a large corpus of texts and assume that writers make correct choices(the training data may be noisy,since people make mistakes,but no matter).In 2001,Banko and Brill [8]published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy,using this task as the specific example.They explored several classification algorithms(the exact ones aren't important),and not surprisingly, found that more data led to better accuracy.Across many different algorithms,the increase in accuracy was approximately linear in the log of the size of the training data.Furthermore,with increasing amounts of training data,the accuracy of different algorithms converged,such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale.This led to a somewhat controversial conclusion (at least at the time):machine learning algorithms really don't matter,all that matters is the amount of data you have.This led to an even more controversial conclusion,delivered somewhat tongue-in-cheek:we should just give up working on algorithms and simply spend our time gathering data. As another example,consider the problem of answering short,fact-based questions such as "Who shot Abraham Lincoln?"Instead of returning a list of documents that the user would then have to sort through,a question answering (QA)system would directly return the answer:John Wilkes Booth.This problem gained interest in the late 1990s,when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Around 2001,researchers discovered a far simpler approach to answering such questions based on pattern matching [18,37,64].Suppose you wanted the answer to the above question.As it turns out,you can simply search for the phrase "shot Abraham Lincoln" on the web and look for what appears to its left.Or better yet,look through multiple instances of this phrase and tally up the words that appear to the left.This simple approach works surprisingly well,and has become known as the redundancy-based approach to question answering.It capitalizes on the insight that in a very large text
4 CHAPTER 1. INTRODUCTION There is a growing body of evidence, at least in text processing, that of the three components discussed above (data, features, algorithms), data probably matters the most. Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data. But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive, they often don’t scale. Consider a simple task such as determining the correct usage of easily confusable words such as “than” and “then” in English. One can view this as a supervised machine learning problem: we can train a classifier to disambiguate between the options, and then apply the classifier to new instances of the problem (say, as part of a grammar checker). Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that writers make correct choices (the training data may be noisy, since people make mistakes, but no matter). In 2001, Banko and Brill [8] published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy, using this task as the specific example. They explored several classification algorithms (the exact ones aren’t important), and not surprisingly, found that more data led to better accuracy. Across many different algorithms, the increase in accuracy was approximately linear in the log of the size of the training data. Furthermore, with increasing amounts of training data, the accuracy of different algorithms converged, such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter, all that matters is the amount of data you have. This led to an even more controversial conclusion, delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data. As another example, consider the problem of answering short, fact-based questions such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the user would then have to sort through, a question answering (QA) system would directly return the answer: John Wilkes Booth. This problem gained interest in the late 1990s, when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Around 2001, researchers discovered a far simpler approach to answering such questions based on pattern matching [18, 37, 64]. Suppose you wanted the answer to the above question. As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the web and look for what appears to its left. Or better yet, look through multiple instances of this phrase and tally up the words that appear to the left. This simple approach works surprisingly well, and has become known as the redundancy-based approach to question answering. It capitalizes on the insight that in a very large text
6 collection (i.e.,the web),answers to commonly-asked questions will be stated in obvious ways,such that pattern-matching techniques suffice to extract answers accurately. Yet another example concerns smoothing in web-scale language models [16].A language model is a probability distribution that characterizes the likelihood of ob- serving a particular sequence of words,estimated from a large corpus of texts.They are useful in a variety of applications,such as speech recognition (to determine what the speaker is more likely to have said)and machine translation (to determine which of possible translations is the most fluent,as we will discuss in Chapter 6.4).Since there are infinitely many possible strings,and probabilities must be assigned to all of them,language modeling is a more challenging task than simply keeping track of which strings were seen how many times:some number of likely strings will never have been seen at all,even with lots and lots of training data!Most modern language models make the Markov assumption:in a n-gram language model,the conditional probability of a word is given by the n-1 previous words.Thus,by the chain rule,the probability of a sequence of words can be decomposed into the product of n-gram probabilities.Nev- ertheless,an enormous number of parameters must still be estimated from a training corpus:potentially Vn parameters,where V is the number of words in the vocabulary. Even if we treat every word on the web as the training corpus to estimate the n-gram probabilities from,most n-grams-in any language,even English-will never have been seen.To cope with this sparseness,researchers have developed a number of smooth- ing techniques [24,73],which all share the basic idea of moving probability mass from observed to unseen events in a principled manner.Smoothing approaches vary in ef- fectiveness,both in terms of intrinsic and application-specific metrics.In 2007,Brants et al.[16]described language models trained on up to two trillion words.4 Their ex- periments compared a state-of-the-art approach known as Kneser-Ney smoothing [24] with another technique the authors affectionately referred to as "stupid backoff".5 Not surprisingly,stupid backoff didn't work as well as Kneser-Ney smoothing on smaller corpora.However,it was simpler and could be trained on more data,which ultimately yielded better language models.That is,a simpler technique on more data beat a more sophisticated technique on less data. Recently,three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [45].6 Why is this so?It boils down to the fact that language in the wild,just like human behavior in general,is messy.Unlike,say,the interaction of subatomic particles,human use of language is not constrained by succinct,universal "laws of grammar".There are of course rules 4As an side,it is interesting to observe the evolving definition of large over the years.Banko and Brill's paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation,and dealt with a corpus containing a billion words. 5As in,so stupid it couldn't possibly work. 6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [108]
5 collection (i.e., the web), answers to commonly-asked questions will be stated in obvious ways, such that pattern-matching techniques suffice to extract answers accurately. Yet another example concerns smoothing in web-scale language models [16]. A language model is a probability distribution that characterizes the likelihood of observing a particular sequence of words, estimated from a large corpus of texts. They are useful in a variety of applications, such as speech recognition (to determine what the speaker is more likely to have said) and machine translation (to determine which of possible translations is the most fluent, as we will discuss in Chapter 6.4). Since there are infinitely many possible strings, and probabilities must be assigned to all of them, language modeling is a more challenging task than simply keeping track of which strings were seen how many times: some number of likely strings will never have been seen at all, even with lots and lots of training data! Most modern language models make the Markov assumption: in a n-gram language model, the conditional probability of a word is given by the n − 1 previous words. Thus, by the chain rule, the probability of a sequence of words can be decomposed into the product of n-gram probabilities. Nevertheless, an enormous number of parameters must still be estimated from a training corpus: potentially V n parameters, where V is the number of words in the vocabulary. Even if we treat every word on the web as the training corpus to estimate the n-gram probabilities from, most n-grams—in any language, even English—will never have been seen. To cope with this sparseness, researchers have developed a number of smoothing techniques [24, 73], which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. Smoothing approaches vary in effectiveness, both in terms of intrinsic and application-specific metrics. In 2007, Brants et al. [16] described language models trained on up to two trillion words.4 Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing [24] with another technique the authors affectionately referred to as “stupid backoff”.5 Not surprisingly, stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller corpora. However, it was simpler and could be trained on more data, which ultimately yielded better language models. That is, a simpler technique on more data beat a more sophisticated technique on less data. Recently, three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [45].6 Why is this so? It boils down to the fact that language in the wild, just like human behavior in general, is messy. Unlike, say, the interaction of subatomic particles, human use of language is not constrained by succinct, universal “laws of grammar”. There are of course rules 4As an side, it is interesting to observe the evolving definition of large over the years. Banko and Brill’s paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with a corpus containing a billion words. 5As in, so stupid it couldn’t possibly work. 6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [108]
6 CHAPTER 1.INTRODUCTION that govern the formation of words and sentences-for example,that verbs appear before objects in English,and that subjects and verbs must agree in number-but real-world language is affected by a multitude of other factors as well:people invent new words and phrases all the time,authors occasionally make mistakes,groups of individuals write within a shared context,etc.The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about fictional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing.7 The world,he would say,is the best description of itself.In the same way,the more observations we gather about language use,the more accurate a description we have about language itself.This,in turn,translates into more effective algorithms and systems. So,in summary,why large data?In some ways,the first answer is similar to the reason people climb mountains:because they're there.But the second answer is even more compelling.Data is the rising tide that lifts all boats-more data leads to better algorithms and systems for solving real-world problems.Now that we've addressed the why,let's tackle the how.Let's start with the obvious observation:data-intensive pro- cessing is beyond the capabilities of any individual machine and requires clusters-which means that large-data problems are fundamentally about organizing computations on dozens,hundreds,or even thousands of machines.This is exactly what MapReduce does,and the rest of this book is about the how. 1.1 COMPUTING IN THE CLOUDS For better or for worse,MapReduce cannot be untangled from the broader discourse on cloud computing.True,there is substantial promise in this new paradigm of computing, but unwarranted hype by the media and popular sources threatens its credibility in the long run.In some ways,cloud computing is simply brilliant marketing.Before clouds, there were grids,s and before grids,there were vector supercomputers,each having claimed to be the best thing since sliced bread. So what exactly is cloud computing?This is one of those questions where ten experts will give eleven different answers;in fact,countless papers have been written simply to attempt to define the term (e.g.,[4,21,105],just to name a few examples). 7 On Eractitude in Science [14. sWhat is the difference between cloud computing and grid computing?Although both tackle the fundamental problem of how best to bring computational resources to bear on large and difficult problems,they start with different assumptions.Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations.As a result,grid computing tends to deal with tasks that are coarser-grained, and must deal with the practicalities of a federated environment,e.g.,verifying credentials across multiple administrative domains.Grid computing has adopted a middleware-based approach for tackling many of these issues
6 CHAPTER 1. INTRODUCTION that govern the formation of words and sentences—for example, that verbs appear before objects in English, and that subjects and verbs must agree in number—but real-world language is affected by a multitude of other factors as well: people invent new words and phrases all the time, authors occasionally make mistakes, groups of individuals write within a shared context, etc. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about fictional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing.7 The world, he would say, is the best description of itself. In the same way, the more observations we gather about language use, the more accurate a description we have about language itself. This, in turn, translates into more effective algorithms and systems. So, in summary, why large data? In some ways, the first answer is similar to the reason people climb mountains: because they’re there. But the second answer is even more compelling. Data is the rising tide that lifts all boats—more data leads to better algorithms and systems for solving real-world problems. Now that we’ve addressed the why, let’s tackle the how. Let’s start with the obvious observation: data-intensive processing is beyond the capabilities of any individual machine and requires clusters—which means that large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of machines. This is exactly what MapReduce does, and the rest of this book is about the how. 1.1 COMPUTING IN THE CLOUDS For better or for worse, MapReduce cannot be untangled from the broader discourse on cloud computing. True, there is substantial promise in this new paradigm of computing, but unwarranted hype by the media and popular sources threatens its credibility in the long run. In some ways, cloud computing is simply brilliant marketing. Before clouds, there were grids,8 and before grids, there were vector supercomputers, each having claimed to be the best thing since sliced bread. So what exactly is cloud computing? This is one of those questions where ten experts will give eleven different answers; in fact, countless papers have been written simply to attempt to define the term (e.g., [4, 21, 105], just to name a few examples). 7On Exactitude in Science [14]. 8What is the difference between cloud computing and grid computing? Although both tackle the fundamental problem of how best to bring computational resources to bear on large and difficult problems, they start with different assumptions. Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations. As a result, grid computing tends to deal with tasks that are coarser-grained, and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these issues