Big Data Analysis and Mining Weixiong rao饶卫雄 Tongji University同济大学软件学院 2015Fl wxrao@tongji.edu.cn Some of the slides are from dr Jure Leskovec's and prof. Zachary g. lves 2021/1/30 同济大学软件学院
2021/1/30 1 Big Data Analysis and Mining Weixiong Rao 饶卫雄 Tongji University 同济大学软件学院 2015 Fall wxrao@tongji.edu.cn *Some of the slides are from Dr Jure Leskovec’s and Prof. Zachary G. Ives
Traditional DAM Oracle DB IBM DW product on Operational very powerful servers ETL SAP ERP ERP Exraction Transformation Loading Salesforce CRM Raw data ■■■口■■■ Olap Analysis Reporting Data Warehouse Flat files from Flat Data Mining Legancy System Files (C)2008 datawarehouse 4u. info DAM tools 2021/1/30 同济大学软件学院
2021/1/30 5 Traditional DAM Oracle DB SAP ERP Salesforce CRM Flat Files from Legancy System IBM DW product on very powerful servers DAM tools
Big data a Typical large enterprise .5,000-50,000 servers, Terabytes of data, millions of Txn per day In contrast, many Internet companies o Millions of servers, petabytes of data Google o Lots and lots of Web pages a Billions of Google queries per day ◆ Facebook: d abillion facebook users n Billion+ Facebook pages Twitter a hundreds of million twitter accounts n Hundreds of million Tweets per day 2021/1/30 同济大学软件学院 6
2021/1/30 6 Big Data ◼ Typical large enterprise: ◆ 5,000-50,000 servers, Terabytes of data, millions of Txn per day. ◼ In contrast, many Internet companies ◆ Millions of servers, petabytes of data ◆ Google: Lots and lots of Web pages Billions of Google queries per day ◆ Facebook: A billion Facebook users Billion+ Facebook pages ◆ Twitter: Hundreds of million Twitter accounts Hundreds of million Tweets per day
Nowsdays DAM solutions a Google, Facebook, LinkedIn, eBay, Amazon didnot use the traditional data warehouse products for dAM a Why? CAP theorem Different assumptions lead to different solutions a What? ◆ Massive parallism a Hadoop Map Reduce paradigm rhade a UC Berkeley shark/spark Soar k Lightning-fast cluster comput 2021/1/30 同济大学软件学院
2021/1/30 7 Nowsdays DAM solutions ◼ Google, Facebook, LinkedIn, eBay, Amazon... didnot use the traditional data warehouse products for DAM. ◼ Why? CAP theorem ◆ Different assumptions lead to different solutions ◼ What? ◆ Massive parallism Hadoop MapReduce paradigm UC Berkeley shark/spark
What's DAM? Analysis of data is a process of inspecting cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making a Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes 2021/1/30 同济大学软件学院
2021/1/30 8 What’ s DAM? ◼ Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. ◼ Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes