What will we learn? We willl learn to solve real-world problems Recommender systems ◆ Market Basket Analys ◆ Spam detection e Duplicate document detection We will learn various“ tools”: Optimization( stochastic gradient descent)y Dynamic programming(frequy temsets Hashing(LSH, Bloom filters From dr jure leskovec's slides 2021/1/30 同济大学软件学院
2021/1/30 14 What will we learn? ◼ We will learn to solve real-world problems: ◆ Recommender systems ◆ Market Basket Analysis ◆ Spam detection ◆ Duplicate document detection ◼ We will learn various “tools”: ◆ Optimization (stochastic gradient descent) ◆ Dynamic programming (frequent itemsets) ◆ Hashing (LSH, Bloom filters) *From Dr Jure Leskovec’s slides
The course landscape Recommen Duplicate Association der systems document Rule Apps detection Decision Perceptron, Trees KNN ML ale Matlab Hadoop PageRank Filtering sensitive Sim Rank data hashing streams Apache Spark Network Clustering Web Analys Data verti Dimensional Spam Queries on Detection reduction streams High dim. data Graph data Infinite data 2021/1/30 同济大学软件学院
2021/1/30 15 The course landscape Data High dim. data Graph data Infinite data ML alg. Apps Matlab + Hadoop + Apache Spark
Platforms for Big Data Mining Parallel dbMs technologies Proposed in the late eighties e Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS Engines o intended as data warehousing solutions for very large enterprises ■ Hadoop ■ Spark ◆ UC Berkeley 2021/1/30 同济大学软件学院 18
2021/1/30 18 Platforms for Big Data Mining ◼ Parallel DBMS technologies ◆ Proposed in the late eighties ◆ Matured over the last two decades ◆ Multi-billion dollar industry: Proprietary DBMS Engines ◆ intended as Data Warehousing solutions for very large enterprises ◼ Hadoop ◼ Spark ◆ UC Berkeley
Parallel DBMs(PDBMs) technologies Popularly used for more than two decades o Research Projects: Gamma, Grace Commercial: Multi-billion dollar industry but access to only a privileged few a Relational Data model ◆ Indexing e Familiar SQL interface o Advanced query optimization e Well understood and studied ◆ Very reliable 2021/1/30 同济大学软件学院
2021/1/30 19 Parallel DBMS (PDBMS) technologies ◼ Popularly used for more than two decades ◆ Research Projects: Gamma, Grace, … ◆ Commercial: Multi-billion dollar industry but access to only a privileged few ◼ Relational Data Model ◆ Indexing ◆ Familiar SQL interface ◆ Advanced query optimization ◆ Well understood and studied ◆ Very reliable!
MapReduce Overview Data-parallel programming model An associated parallel and distributed implementation for commodity clusters a Pioneered by Google Processes 20 PB of data per day(circa 2008) Popularized by open-source Hadoop project o Used by Yahoo!, Facebook, Amazon, and the list is growing 2021/1/30 同济大学软件学院
2021/1/30 20 MapReduce ◼ Overview: ◆ Data-parallel programming model ◆ An associated parallel and distributed implementation for commodity clusters ◼ Pioneered by Google ◆ Processes 20 PB of data per day (circa 2008) ◆ Popularized by open-source Hadoop project ◆ Used by Yahoo!, Facebook, Amazon, and the list is growing …