当前位置：和泉文库 > 计算机 > 浏览文档

同济大学：《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源（PPT课件讲稿）Platforms for Big Data Mining（主讲：饶卫雄）

Parallel DBMS technologies Proposed in the late eighties Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises Hadoop Spark UC Berkeley

文件格式：PPT，文件大小：4.41MB，售价：21.73元

共105页，可试读30页，点击往前阅读 ↑↑

文档详细内容（约105页）

Distributed File system Chunk servers o file is split into contiguous chunks Typically each chunk is 16-64MB ◆ Each chunk replicated( usually2Xor3× o Try to keep replicas in different racks ■ Master node a.k.a. name node in hadoop hdfs e Stores metadata about where files are stored ◆ Might be replicated a Client library for file access e talks to master to find chunk servers o Connects directly to chunk servers to access data 2021/1/30 同济大学软件学院

2021/1/30 31 Distributed File System ◼ Chunk servers ◆ File is split into contiguous chunks ◆ Typically each chunk is 16-64MB ◆ Each chunk replicated (usually 2x or 3x) ◆ Try to keep replicas in different racks ◼ Master node ◆ a.k.a. Name Node in Hadoop’s HDFS ◆ Stores metadata about where files are stored ◆ Might be replicated ◼ Client library for file access ◆ Talks to master to find chunk servers ◆ Connects directly to chunk servers to access data

Distributed File system a Reliable distributed file system ◆ Data kept in“ chunks” spread across machines Each chunk replicated on different machines a Seamless recovery from disk or machine failure Co lC 0 0 hunk server 1 Chunk server 2 Chunk server 3 Chunk server n Bring computation directly to the data Chunk servers also serve as compute servers 2021/1/30 同济大学软件学院

2021/1/30 32 Distributed File System ◼ Reliable distributed file system ◆ Data kept in “chunks” spread across machines ◆ Each chunk replicated on different machines  Seamless recovery from disk or machine failure

Basic idea a Issue: Copying data over a network takes time a dea: Bring computation to data Store files multiple times for reliability a MapReduce addresses these problems Storage Infrastructure- File system a Google: GFS 口 Hadoop:HDFS NEXT ◆ Programming model 口 MapReduce 2021/1/30 同济大学软件学院

2021/1/30 33 Basic Idea ◼ Issue: Copying data over a network takes time ◼ Idea: ◆ Bring computation to data ◆ Store files multiple times for reliability ◼ MapReduce addresses these problems ◆ Storage Infrastructure – File system  Google: GFS.  Hadoop: HDFS ◆ Programming model  MapReduce NEXT

What is HDFS (Hadoop Distributed Eile System)? a HdFS is a distributed file system Makes some unique tradeoffs that are good for MapReduce What hdfs does well Very large read-only or append-only files (individual files may contain Gigabytes/Terabytes of data) Sequential access patterns a What hdfs does not do well o Storing lots of small files ◆Low- agency access ◆ Multiple writers o Writing to arbitrary offsets in the file 2021/1/30 同济大学软件学院

2021/1/30 34 What is HDFS (Hadoop Distributed File System)? ◼ HDFS is a distributed file system ◆ Makes some unique tradeoffs that are good for MapReduce ◼ What HDFS does well: ◆ Very large read-only or append-only files (individual files may contain Gigabytes/Terabytes of data) ◆ Sequential access patterns ◼ What HDFS does not do well: ◆ Storing lots of small files ◆ Low-latency access ◆ Multiple writers ◆ Writing to arbitrary offsets in the file 34 University of Pennsylvania

HDFS versus NFS Network File System( NFs Hadoop distributed File system(HDFS) Single machine makes part of its Single virtual file system file system available to other spread over many machines machines Optimized for sequential Sequential or random access read and local accesses PRO: Simplicity, generality, PRO: High throughput, high transparency capacity CON: Storage capacity and CON: Specialized for throughput limited by single particular types of server applications 2021/1/30 同济大学软件学院

2021/1/30 35 HDFS versus NFS ◼ Single machine makes part of its file system available to other machines ◼ Sequential or random access ◼ PRO: Simplicity, generality, transparency ◼ CON: Storage capacity and throughput limited by single server ◼ Single virtual file system spread over many machines ◼ Optimized for sequential read and local accesses ◼ PRO: High throughput, high capacity ◼ CON: Specialized for particular types of applications Network File System (NFS) Hadoop Distributed File System (HDFS)

点击进入文档下载页（PPT格式）

共105页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

《数据结构》课程教学资源（PPT课件讲稿）第八章图
《单片机应用技术》课程PPT教学课件（C语言版）第3章 MCS-51指令系统及汇编程序设计
《编译原理与技术》课程教学资源（PPT课件讲稿）代码优化
Progress of Concurrent Objects with Partial Methods
《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源（PPT讲稿）Lecture 12 Language Models
四川大学：《操作系统 Operating System》课程教学资源（PPT课件讲稿）Chapter 6 Concurrency - Deadlock（死锁）and Starvation（饥饿）
《操作系统》课程教学资源（PPT课件讲稿）实时调度 Real-Time Scheduling
白城师范学院：《数据库系统概论 An Introduction to Database System》课程教学资源（PPT课件讲稿）第二章关系数据库（2.1-2.3）
《计算机算法设计与分析》课程教学资源（PPT课件）第8章回溯法
清华大学出版社：《计算机应用基础实例教程》课程教学资源（PPT课件讲稿，第二版，共七章，主编：吴霞，制作：李晓新）
中国科学技术大学：《计算机体系结构》课程教学资源（PPT课件讲稿）绪论、第1章量化设计与分析基础（主讲：周学海）
北京大学：烟花算法的变异算子（PPT讲稿）Mutation Operators of Fireworks Algorithm
《计算机网络》课程教学资源（PPT讲稿）网络安全（访问控制、加密、防火墙）
水平集方法与图像分割 Level set method and image segmentation
北京师范大学：《计算机文化基础》课程教学资源（PPT课件讲稿）08 网页制作基础知识（赵国庆）
《C语言程序设计》课程教学资源（PPT讲稿）第1章程序设计和C语言
《计算机组装与维护》课程教学资源（PPT课件讲稿）第十一章计算机数据恢复技术
贵州大学：计算机应用基础（PPT课件讲稿）计算机基础知识
《计算导论与程序设计》课程教学资源（PPT课件讲稿）Chap 5 函数
《计算机网络 Computer Networking》课程教学资源（PPT课件讲稿）Chapter 08 Network Security
《计算机网络与通信》课程教学资源（PPT课件）Chapter 8 传输层
《数据结构与算法分析》课程教学资源（PPT讲稿）Lists, Stacks and Queues
沈阳理工大学：《Visual Basic 6.0程序设计》课程教学资源（PPT课件讲稿）第三章 VB基本语言
南京大学：《计算机网络 Computer Networks》课程教学资源（PPT课件讲稿）简介、第一章引论（谭晓阳）

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录