4. What are the main Hadoop components? What functions do they perform? Major components of Hadoop are the HDFS, a Job Tracker operating on the master node, Name Nodes, Secondary Nodes, and Slave Nodes. The HDFS is the default storage layer in any given Hadoop cluster. A Name Node is a node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail. Secondary nodes are backup name nodes. The Job Tracker is the node of a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data. Slave nodes store data and take direction to process it from the Job tracker Querying for data in the distributed system is accomplished via MapReduce. The client query is handled in a Map job, which is submitted to the Job Tracker. The Job tracker refers to the name node to determine which data it needs to access to complete the job and where in the cluster that data is located, then submits the query to the relevant nodes which operate in parallel. A Name Node acts as facilitator, communicating back to the client information such as which nodes ar available. where in the cluster certain data resides and which nodes have failed When each node completes its task, it stores its result. The client submits a Reduce job to the Job tracker, which then collects and aggregates the results from each of the nodes What is NosQL? How does it fit into the Big Data analytics picture? NoSQL, also known as"Not Only SQL, is a new style of database for processing large volumes of multi-structured data. Whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NosQL databases are mostly aimed at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications. NosQL databases trade ACID (atomicity, consistency, isolation, durability)compliance for performance and scalability Section 7.5 Review Questions witnessing the end of the data warehousing era? Why or why nof? e we 1. What are the challenges facing data warehousing and Big Data? Al What has changed the landscape in recent years is the variety and complexity of data, which made data warehouses incapable of keeping up. It is not the volume of the structured data but the variety and the velocity that forced the world ofIT to devel aradigm, which we now call"Big Data. But this does not mean the end of data warehousing. Data warehousing and RDBMS still bring many strengths that make them relevant for BI and that Big data techniques do not currently provide 6 Copyright C2018 Pearson Education, Inc
6 Copyright © 2018Pearson Education, Inc. 4. What are the main Hadoop components? What functions do they perform? Major components of Hadoop are the HDFS, a Job Tracker operating on the master node, Name Nodes, Secondary Nodes, and Slave Nodes. The HDFS is the default storage layer in any given Hadoop cluster. A Name Node is a node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail. Secondary nodes are backup name nodes. The Job Tracker is the node of a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data. Slave nodes store data and take direction to process it from the Job Tracker. Querying for data in the distributed system is accomplished via MapReduce. The client query is handled in a Map job, which is submitted to the Job Tracker. The Job Tracker refers to the Name Node to determine which data it needs to access to complete the job and where in the cluster that data is located, then submits the query to the relevant nodes which operate in parallel. A Name Node acts as facilitator, communicating back to the client information such as which nodes are available, where in the cluster certain data resides, and which nodes have failed. When each node completes its task, it stores its result. The client submits a Reduce job to the Job Tracker, which then collects and aggregates the results from each of the nodes. 5. What is NoSQL? How does it fit into the Big Data analytics picture? NoSQL, also known as “Not Only SQL,” is a new style of database for processing large volumes of multi-structured data. Whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are mostly aimed at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications. NoSQL databases trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Section 7.5 Review Questions 1. What are the challenges facing data warehousing and Big Data? Are we witnessing the end of the data warehousing era? Why or why not? What has changed the landscape in recent years is the variety and complexity of data, which made data warehouses incapable of keeping up. It is not the volume of the structured data but the variety and the velocity that forced the world of IT to develop a new paradigm, which we now call “Big Data.” But this does not mean the end of data warehousing. Data warehousing and RDBMS still bring many strengths that make them relevant for BI and that Big Data techniques do not currently provide
What are the use cases for Big Data and Hadoop? In terms of its use cases, Hadoop is differentiated two ways: first, as the repository and refinery of raw data, and second, as an active archive of historical data Hadoop, with their distributed file system and flexibility of data formats (allowing both structured and unstructured data), is ad vantageous when workin with information commonly found on the web, includ ing social med ultimedia, and text. Also, because it can handle such huge volumes of data(and because storage costs are minimized due to the d istributed nature of the file system, historical(archive) data can be managed easily with this approach What are the use cases for data warehousing and RDBMS? Three main use cases for data warehousing are performance, integration, and the availability of a wide variety of BI tools. The relational data warehouse approach is quite mature, and database vend ors are constantly ad d ing new index types, partitioning, statistics, and optimizer features. This enables complex queries to be done quickly, a must for any BI application. Data warehousing, and the etL process, provide a robust mechanism for collecting, cleaning, and integrating data. And, it is increasingly easy for end users to create reports, graphs, and visualizations of the data 4. In what scenarios can hadoop and rdbms coexist? There are several possible scenarios under which using a combination of Hadoop and relational DBMS-based data warehousing technologies makes sense. For example, you can use Hadoop for storing and archiving multi-structured data, with a connector to a relational DBMS that extracts required data from Hadoop for analysis by the relational DBMS. Hadoop can also be used to filter and transform multi-structural data for transporting to a data warehouse, and can also be used to analyze multi-structural data for publishing into the data warehouse environment. Combining SQL and MapReduce query functions enables data scientists to analyze both structured and unstructured data. Also, front end quer tools are available for both platforms Section 7.6 Review Questions 1. What is special about the Big Data vendor landscape? Who are the big players? The Big Data vendor landscape is developing very rapidly. It is in a special period of evolution where entrepreneurial startup firms bring innovative solutions to the marketplace. Cloudera is a market leader in the Hadoop space. MapR and Hortonworks are two other Hadoop startups. Data Stax is an example of a NoSQL vendor Informatica, Pervasive Software, Syncsort, and MicroStrategy are also players. Most of the growth in the industry is with Hadoop and NoSQL distributors and analytics providers. There is still very little in terms of Big Data Copyright C2018 Pearson Education, Inc
7 Copyright © 2018Pearson Education, Inc. 2. What are the use cases for Big Data and Hadoop? In terms of its use cases, Hadoop is differentiated two ways: first, as the repository and refinery of raw data, and second, as an active archive of historical data. Hadoop, with their distributed file system and flexibility of data formats (allowing both structured and unstructured data), is advantageous when working with information commonly found on the Web, including social media, multimedia, and text. Also, because it can handle such huge volumes of data (and because storage costs are minimized due to the distributed nature of the file system), historical (archive) data can be managed easily with this approach. 3. What are the use cases for data warehousing and RDBMS? Three main use cases for data warehousing are performance, integration, and the availability of a wide variety of BI tools. The relational data warehouse approach is quite mature, and database vendors are constantly adding new index types, partitioning, statistics, and optimizer features. This enables complex queries to be done quickly, a must for any BI application. Data warehousing, and the ETL process, provide a robust mechanism for collecting, cleaning, and integrating data. And, it is increasingly easy for end users to create reports, graphs, and visualizations of the data. 4. In what scenarios can Hadoop and RDBMS coexist? There are several possible scenarios under which using a combination of Hadoop and relational DBMS-based data warehousing technologies makes sense. For example, you can use Hadoop for storing and archiving multi-structured data, with a connector to a relational DBMS that extracts required data from Hadoop for analysis by the relational DBMS. Hadoop can also be used to filter and transform multi-structural data for transporting to a data warehouse, and can also be used to analyze multi-structural data for publishing into the data warehouse environment. Combining SQL and MapReduce query functions enables data scientists to analyze both structured and unstructured data. Also, front end query tools are available for both platforms. Section 7.6 Review Questions 1. What is special about the Big Data vendor landscape? Who are the big players? The Big Data vendor landscape is developing very rapidly. It is in a special period of evolution where entrepreneurial startup firms bring innovative solutions to the marketplace. Cloudera is a market leader in the Hadoop space. MapR and Hortonworks are two other Hadoop startups. DataStax is an example of a NoSQL vendor. Informatica, Pervasive Software, Syncsort, and MicroStrategy are also players. Most of the growth in the industry is with Hadoop and NoSQL distributors and analytics providers. There is still very little in terms of Big Data