G)Word Count Count map key=url, val=contents For each word w in contents, emit(w Ilustrated reduce key=word, values=unig_counts Sum all“1” s in values list Emit result“(word,sum)” see bob throw see bob bob see spot run rUn run see 2 see spot spot throw 1 throw 1 11
Word Count 11 Count, Illustrated map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see bob throw see spot run see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1
G)Reverse Web-Link ·Map For each Urc linking to target, Output target, source> pairs Reduce Concatenate list of all source URLs -Outputs <target, list(source)>pairs
Reverse Web-Link 12 • Map – For each URL linking to target, … – Output <target, source> pairs • Reduce – Concatenate list of all source URLs – Outputs: <target, list (source)> pairs
G) Model is Widely Used 1000 800 600 400 200 Mar May Jul Sep Nov Jan Mar May Jul Sep 2003 2004 Exam ple uses: distributed grep distributed sort web link-graph reversal term-vector /host web access log stats inverted index construction statistical machine document clustering machine learning tr anslation
Model is Widely Used 13 Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ...
@Implementation Typical cluster 100s/1000s of 2-CPU X86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IdE disks GF S: distributed file system manages data (SOSP03) Job scheduling system: jobs made up of tasks scheduler assigns tasks to machines Implementation is a C++ library linked into user programs
Implementation 14 Typical cluster: • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs
How is this distributed? Partition input key/value pairs into chunks, run mapo tasks in parallel After all mapos are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reducel in parallel If mapo or reduce fails, reexecute
Execution 15 • How is this distributed? ➢ Partition input key/value pairs into chunks, run map() tasks in parallel ➢ After all map()s are complete, consolidate all emitted values for each unique emitted key ➢ Now partition space of output map keys, and run reduce() in parallel • If map() or reduce() fails, reexecute!