CHAPTER Big data Concepts and Tools Learning Objectives for Chapter 7 Learn what Big Data is and how it is changing the world of analytics Understand the motivation for and business drivers of Big Data analytics Become familiar with the wide range of enabling technologies for big data Learn about Hadoop, MapReduce, and nosQl as they relate to big data analytics Compare and contrast the complementary uses of data warehousing and Big Data Become familiar with the vendors of Big Data tools and services Understand the need for and appreciate the capabilities of stream analytics Learn about the applications of stream analytics CHAPTER OVERVIEW Big Data, which means many things to many people, is not a new technological fad. It is a business priority that has the potential to profoundly change the competitive landscape in todays globally integrated economy. In add ition to provid ing innovative solutions to enduring business challenges, Big Data and analytics instigate new ways to transform processes, organizations, entire industries, and even society all together. Yet extensive med ia coverage makes it hard to distinguish hype from reality. This chapter aims to provide a comprehensive coverage of Big Data, its enabling technologies, and related analytics concepts to help understand the capabilities and limitations of this emerging technology. The chapter starts with a definition and related concepts of Big Data followed by the technical details of the enabling technologies including Hadoop Copyright C2018 Pearson Education, Inc
1 Copyright © 2018Pearson Education, Inc. Big Data Concepts and Tools Learning Objectives for Chapter 7 ▪ Learn what Big Data is and how it is changing the world of analytics ▪ Understand the motivation for and business drivers of Big Data analytics ▪ Become familiar with the wide range of enabling technologies for Big Data analytics ▪ Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data analytics ▪ Compare and contrast the complementary uses of data warehousing and Big Data ▪ Become familiar with the vendors of Big Data tools and services ▪ Understand the need for and appreciate the capabilities of stream analytics ▪ Learn about the applications of stream analytics CHAPTER OVERVIEW Big Data, which means many things to many people, is not a new technological fad. It is a business priority that has the potential to profoundly change the competitive landscape in today’s globally integrated economy. In addition to providing innovative solutions to enduring business challenges, Big Data and analytics instigate new ways to transform processes, organizations, entire industries, and even society all together. Yet extensive media coverage makes it hard to distinguish hype from reality. This chapter aims to provide a comprehensive coverage of Big Data, its enabling technologies, and related analytics concepts to help understand the capabilities and limitations of this emerging technology. The chapter starts with a definition and related concepts of Big Data followed by the technical details of the enabling technologies including Hadoop, CHAPTER 7
MapReduce, and NoSQL. After describing data scientist as a new fashionable organizational role/job, we provide a comparative analysis between data warehousing and Big Data analytics. The last part of the chapter is ded icated to stream analytics, which is one of the most promising value propositions of Big Data analytics CHAPTER OUTLINE 7. 1 Opening Vignette: Analyzing Customer Churn in a Telecom Company Usin Big data methods 7.2 Definition of Big Data 7. 3 Fundamentals of Big data analytics 7.4 Big data Technologies 7.5 Big Data and Data Warehousing 7. 6 Big data Vendors and Platforms 7.7 Big Data and Stream Analytics 7. 8 Applications of Stream Analytics ANSWERS TO END OF SECTION REVIEW QUESTIONS Section 7. I Review Questions 1. What problem did customer service cancellation pose to ATs business survival? The company identified that it was losing an alarming number of customers and that many of these customer losses happened as a result of customer service interactions. If the company continued to lose customers at this rate, it would no longer be economically viable Identify and explain the technical hurdles presented by the nature and characteristics ofat's data The company needed to analyze data from a variety of sources, as well as data formats. Data was stored in text as well as aud io. the data needed to be combined into a single location and format before analysis could occur What is sessionizing? Why was it necessary for At to sessionize its data? Copyright C2018 Pearson Education, Inc
2 Copyright © 2018Pearson Education, Inc. MapReduce, and NoSQL. After describing data scientist as a new fashionable organizational role/job, we provide a comparative analysis between data warehousing and Big Data analytics. The last part of the chapter is dedicated to stream analytics, which is one of the most promising value propositions of Big Data analytics. CHAPTER OUTLINE 7.1 Opening Vignette: Analyzing Customer Churn in a Telecom Company Using Big Data Methods 7.2 Definition of Big Data 7.3 Fundamentals of Big Data Analytics 7.4 Big Data Technologies 7.5 Big Data and Data Warehousing 7.6 Big Data Vendors and Platforms 7.7 Big Data and Stream Analytics 7.8 Applications of Stream Analytics ANSWERS TO END OF SECTION REVIEW QUESTIONS Section 7.1 Review Questions 1. What problem did customer service cancellation pose to AT’s business survival? The company identified that it was losing an alarming number of customers and that many of these customer losses happened as a result of customer service interactions. If the company continued to lose customers at this rate, it would no longer be economically viable. 2. Identify and explain the technical hurdles presented by the nature and characteristics of AT’s data. The company needed to analyze data from a variety of sources, as well as data formats. Data was stored in text as well as audio. The data needed to be combined into a single location and format before analysis could occur. 3. What is sessionizing? Why was it necessary for AT to sessionize its data?
While not addressed d irectly in this case, sessionizing is aggregating customer interactions concerning a single issue across multiple different contact methods that are being addressed, and also provides information and context about those In this case, sessionizing is important because it reflects the true number of issu events which will need to be analyzed Research other stud ies where customer churn models have been employed what types of variables were used in those stud ies? How is this vignette different? Student insights will vary based on the research completed Besides Teradata Aster, identify other popular Big Data analytics platforms that could handle the analysis described in the preceding case Student insights will vary based on the research completed Section 7.2 Review Questions 1. Why is Big Data important? What has changed to put it in the center of the analytics world? As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. The exponential growth, availability, and use of information, both structured and unstructured, brings Big Data to the center of the analytics world. Pushing the boundaries of data analytics uncovers new insights and opportunities for the use 1g How do you define Big Data? Why is it difficult to define? Big Data means different things to people with different backgrounds and interests, which is one reason it is hard to define. Trad itionally, the term" Big Data" has been used to describe the massive volumes of data analyzed by huge organizations such as Google or research science projects at NASA. Big Data includes both structured and unstructured data, and it comes from everywhere data sources include Web logs, RFID, GPS systems, sensor networks, social networks. Internet-based text documents. internet search indexes detailed call records, to name just a few. Big data is not just about volume, but also variety, velocity, veracity, and value proposition 3. Out of the Vs that are used to define Big Data, in your opinion, which one is the most important? Why? Although all of the Vs are important characteristics, value proposition is probably the most important for decision makers'"big" data in that it contains(or has a greater potential to contain)more patterns and interesting anomalies than"small Copyright C2018 Pearson Education, Inc
3 Copyright © 2018Pearson Education, Inc. While not addressed directly in this case, sessionizing is aggregating customer interactions concerning a single issue across multiple different contact methods. In this case, sessionizing is important because it reflects the true number of issues that are being addressed, and also provides information and context about those events which will need to be analyzed. 4. Research other studies where customer churn models have been employed. What types of variables were used in those studies? How is this vignette different? Student insights will vary based on the research completed. 5. Besides Teradata Aster, identify other popular Big Data analytics platforms that could handle the analysis described in the preceding case. Student insights will vary based on the research completed. Section 7.2 Review Questions 1. Why is Big Data important? What has changed to put it in the center of the analytics world? As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. The exponential growth, availability, and use of information, both structured and unstructured, brings Big Data to the center of the analytics world. Pushing the boundaries of data analytics uncovers new insights and opportunities for the use of Big Data. 2. How do you define Big Data? Why is it difficult to define? Big Data means different things to people with different backgrounds and interests, which is one reason it is hard to define. Traditionally, the term “Big Data” has been used to describe the massive volumes of data analyzed by huge organizations such as Google or research science projects at NASA. Big Data includes both structured and unstructured data, and it comes from everywhere: data sources include Web logs, RFID, GPS systems, sensor networks, social networks, Internet-based text documents, Internet search indexes, detailed call records, to name just a few. Big data is not just about volume, but also variety, velocity, veracity, and value proposition. 3. Out of the Vs that are used to define Big Data, in your opinion, which one is the most important? Why? Although all of the Vs are important characteristics, value proposition is probably the most important for decision makers’ “big” data in that it contains (or has a greater potential to contain) more patterns and interesting anomalies than “small
data. Thus, by analyzing large and feature rich data, organizations can gain greater business value that they may not have otherwise. While users can detect the patterns in small data sets using simple statistical and machine-learning methods or ad hoc query and reporting tools, Big Data means"big"analytics. Big analytics means greater insight and better decisions, something that every organization needs nowadays. Different students may have different answers. 4. What do you think the future of Big Data will be like? Will it lose its popularity to something else? If so. what will it be? Big Data could evolve at a rapid pace. The buzzword"Big Data"might change to something else, but the trend toward increased computing capabilities, analytics methodologies, and data management of high volume heterogeneous information will continue. Different students may have different answers. Section 7.3 Review Questions What is Big Data analytics? How does it differ from regular analytics? Big Data analytics is analytics applied to Big Data architectures. This is a new paradigm; in order to keep up with the computational needs of Big Data,a number of new and innovative analytics computational techniques and platforms have been developed. These techniques are collectively called high-performance computing, and include in-memory analytics, in-database analytics, grid computing, and appliances. They differ from regular analytics which tend to focus on relational database technologies What are the critical success factors for Big Data analytics? Critical factors include a clear business need, strong and committed sponsorship alignment between the business and IT strategies, a fact-based decision culture, a strong data infrastructure, the right analytics tools, and personnel with ad vanced analytic skills 3. What are the big challenges that one should be mind ful of when considering implementation of Big Data analytics? Trad itional ways of capturing, storing, and analyzing data are not sufficient for Big Data. Major challenges are the vast amount of data volume the need for data integration to combine data of different structures in a cost-effective manner. the leed to process data quickly, data governance issues, skill availability, and solution costs What are the common business problems addressed by Big Data analytics? Here is a list of problems that can be addressed using Big Data analytics Copyright C2018 Pearson Education, Inc
4 Copyright © 2018Pearson Education, Inc. data. Thus, by analyzing large and feature rich data, organizations can gain greater business value that they may not have otherwise. While users can detect the patterns in small data sets using simple statistical and machine-learning methods or ad hoc query and reporting tools, Big Data means “big” analytics. Big analytics means greater insight and better decisions, something that every organization needs nowadays. (Different students may have different answers.) 4. What do you think the future of Big Data will be like? Will it lose its popularity to something else? If so, what will it be? Big Data could evolve at a rapid pace. The buzzword “Big Data” might change to something else, but the trend toward increased computing capabilities, analytics methodologies, and data management of high volume heterogeneous information will continue. (Different students may have different answers.) Section 7.3 Review Questions 1. What is Big Data analytics? How does it differ from regular analytics? Big Data analytics is analytics applied to Big Data architectures. This is a new paradigm; in order to keep up with the computational needs of Big Data, a number of new and innovative analytics computational techniques and platforms have been developed. These techniques are collectively called high-performance computing, and include in-memory analytics, in-database analytics, grid computing, and appliances. They differ from regular analytics which tend to focus on relational database technologies. 2. What are the critical success factors for Big Data analytics? Critical factors include a clear business need, strong and committed sponsorship, alignment between the business and IT strategies, a fact-based decision culture, a strong data infrastructure, the right analytics tools, and personnel with advanced analytic skills. 3. What are the big challenges that one should be mindful of when considering implementation of Big Data analytics? Traditional ways of capturing, storing, and analyzing data are not sufficient for Big Data. Major challenges are the vast amount of data volume, the need for data integration to combine data of different structures in a cost-effective manner, the need to process data quickly, data governance issues, skill availability, and solution costs. 4. What are the common business problems addressed by Big Data analytics? Here is a list of problems that can be addressed using Big Data analytics:
Process efficiency and cost reduction Revenue maximization, cross-selling, and up-selling Enhanced customer experience Churn identification, customer recruIting Improved customer service Identify ing new products and market opportunities Risk management Regulatory compliance Enhanced security capabilities Section 7.4 Review Questions 1. What are the common characteristics of emerging Big Data technologies? They take advantage of commod ity hardware to enable scale-out, parallel processing techniques, employ nonrelational data storage capabilities in order to process unstructured and semistructured data; and apply advanced analytics and data visualization technology to Big Data to convey insights to end users What is MapReduce? What does it do? How does it do it? MapReduce is a programming model that allows the processing of large-scale data analysis problems to be distributed and parallelized. The Map Reduce technique, popularized by Google, distributes the processing of very large multi structured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. The map function in Map Reduce breaks a problem into sub-problems, which can each be processed by single nodes in parallel. The reduce function merges(sorts, organizes, aggregates)the results from each of these nodes into the final result What is Hadoop? How does it work? Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed unstructured data. It is designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel typically commodity machines connected via the Internet. It utilizes th MapReduce framework to implement d istributed parallelism. The file organization is implemented in the Hadoop Distributed File System(hdfS), which is adept at storing large volumes of unstructured and semistructured data This is an alternative to the trad itional tables/rows/columns structure of a relational database. Data is replicated across multiple nodes, allowing for fault tolerance in the system Copyright C2018 Pearson Education, Inc
5 Copyright © 2018Pearson Education, Inc. • Process efficiency and cost reduction • Brand management • Revenue maximization, cross-selling, and up-selling • Enhanced customer experience • Churn identification, customer recruiting • Improved customer service • Identifying new products and market opportunities • Risk management • Regulatory compliance • Enhanced security capabilities Section 7.4 Review Questions 1. What are the common characteristics of emerging Big Data technologies? They take advantage of commodity hardware to enable scale-out, parallel processing techniques; employ nonrelational data storage capabilities in order to process unstructured and semistructured data; and apply advanced analytics and data visualization technology to Big Data to convey insights to end users. 2. What is MapReduce? What does it do? How does it do it? MapReduce is a programming model that allows the processing of large-scale data analysis problems to be distributed and parallelized. The MapReduce technique, popularized by Google, distributes the processing of very large multistructured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. The map function in MapReduce breaks a problem into sub-problems, which can each be processed by single nodes in parallel. The reduce function merges (sorts, organizes, aggregates) the results from each of these nodes into the final result. 3. What is Hadoop? How does it work? Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. It is designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel, typically commodity machines connected via the Internet. It utilizes the MapReduce framework to implement distributed parallelism. The file organization is implemented in the Hadoop Distributed File System (HDFS), which is adept at storing large volumes of unstructured and semistructured data. This is an alternative to the traditional tables/rows/columns structure of a relational database. Data is replicated across multiple nodes, allowing for fault tolerance in the system