Lecture 7 Hadoop/Spark
Lecture 7 Hadoop/Spark
What is Hadoop Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets>Terabytes or petabytes of data Large clusters>hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model,any data will fit 3
What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets Æ Terabytes or petabytes of data • Large clusters Æ hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit 3
Design Principles of Hadoop Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high-end expensive machines 1
Design Principles of Hadoop • Need to process big data • Need to parallelize computation across thousands of nodes • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem • This is in contrast to Parallel DBs • Small number of high-end expensive machines 4
Divide and Conquer "Work" Partition W1 W2 W3 ”h0tkr worker workeE "Result Combine
Divide and Conquer “worker ” “worker ” “worker ” Partition Combine
It's a bit more complex... Fundamental issues scheduling,data distribution,synchronization, Different programming models inter-process communication,robustness,fault Message Passing Shared Memory tolerance,... P P2 P3 P4 Ps P:P2 P3 P4 Ps Architectural issues Flynn's taxonomy (SIMD,MIMD,etc.), network typology,bisection bandwidth UMA vs.NUMA,cache coherence Different programming constructs mutexes,conditional variables,barriers,... masters/slaves,producers/consumers,work queues,... Common problems livelock,deadlock,data starvation,priority inversion... dining philosophers,sleeping barbers,cigarette smokers,... The reality:programmer shoulders the burden of managing concurrency
It’s a bit more complex… Message Passing P1 P2 P3 P4 P5 Shared Memory P1 P2 P3 P4 P5 Memory Different programming models Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Fundamental issues scheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … Architectural issues Flynn’s taxonomy (SIMD, MIMD, etc.), network typology, bisection bandwidth UMA vs. NUMA, cache coherence The reality: programmer shoulders the burden of managing concurrency…