当前位置：和泉文库 > 计算机 > 浏览文档

上海交通大学：Mining Massive Datasets（PPT讲稿）

文件格式：PPT，文件大小：954.5KB，售价：17.94元

文档详细内容（约66页）

Classification Naive Bayes Naive bayes: Classifying positions<all word positions in current document which contain tokens found in Vocabulary Return CNB, Where BNB=argmax P()P( Ic) ∈C I∈ positions

Classification 16 Naive Bayes: Classifying ▪ positions  all word positions in current document which contain tokens found in Vocabulary ▪ Return cNB, where    = i positions j i j c C NB c argmax P(c ) P(x | c ) j Naïve Bayes

Classification Naive Bayes Naive bayes: Time complexit For document classification Training Time: O(D Lve +Cv) where Lave is the average length of a document in D Assumes all counts are pre-computed in o(D Lave time during one pass through all of the data Generally just O(D Lave)since usually CllV< D Lave Test Time: O(ICI Lt) where Lt is the average length of a test document Very efficient overall linearly proportional to the time needed to ust read in all the data 17

Classification 17 Naive Bayes: Time Complexity For document classification: ▪ Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D. ▪ Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data. ▪ Generally just O(|D|Lave) since usually |C||V| < |D|Lave ▪ Test Time: O(|C| Lt ) where Lt is the average length of a test document. ▪ Very efficient overall, linearly proportional to the time needed to just read in all the data. Naïve Bayes

Classification Naive Bayes Underflow Prevention: using logs Multiplying lots of probabilities, which are between0 and 1 by definition, can result in floating-point underflow Since log xy =log(x)+ logl), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities Class with highest final un-normalized log probability score is still the most probable CNB argmax [log P(c, ) ∑logP(x1(c ∈C ∈ positions Note that model is now just max of sum of weights

Classification 18 Underflow Prevention: using logs ▪ Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. ▪ Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. ▪ Class with highest final un-normalized log probability score is still the most probable. ▪ Note that model is now just max of sum of weights…  cNB = argmax cj C [log P(c j ) + log P(xi | c j ) ipositions  ] Naïve Bayes

Classification Naive Bayes Naive Bayes classifier CNB -argmax [log P(c, ) log P(x;ci) C;∈C l∈ positions Simple interpretation Each conditional parameter log Plx, ci) is a weight that indicates how good an indicator x is for The prior log plc is a weight that indicates the relative frequency of c the sum is then a measure of how much evidence there is for the document being in the class We select the class with the most evidence for it

Classification 19 Naive Bayes Classifier ▪ Simple interpretation: Each conditional parameter log P(xi|cj ) is a weight that indicates how good an indicator xi is for cj . ▪ The prior log P(cj ) is a weight that indicates the relative frequency of cj . ▪ The sum is then a measure of how much evidence there is for the document being in the class. ▪ We select the class with the most evidence for it 19  cNB = argmax cj C [log P(c j ) + log P(xi | c j ) ipositions  ] Naïve Bayes

Classification Classification methods Perceptrons Naive bayes kNN Support vector machine(svm

Classification 20 Classification Methods ▪ Perceptrons ▪ Naïve Bayes ▪ kNN ▪ Support vector machine (SVM)

点击进入文档下载页（PPT格式）

共66页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

电子工业出版社：《计算机网络》课程教学资源（第五版，PPT课件讲稿）第一章概述（谢希仁）
北京航空航天大学：《数据挖掘——概念和技术（Data Mining - Concepts and Techniques）》课程教学资源（PPT课件讲稿）Chapter 03 Data Preprocessing
《数字图象处理》课程教学资源（PPT课件讲稿）第七章邻域运算
上海交通大学：《编译器构造》课程教学资源（PPT讲稿，马融）Compiler
《软件工程 Software Engineering》教学资源：课程教学大纲
沈阳理工大学：《单片机C语言应用程序设计》课程PPT教学课件（单片机C语言编程）04 C51编程设计（廉哲）
中国科学技术大学：《信号与图像处理基础 Signal and Image Processing》课程教学资源（PPT课件讲稿）傅里叶分析与卷积 Fourier Analysis and Convolution
北京科技大学：物联网知识体系和学科建设（PPT讲稿，王志良）
香港理工大学：Discovering Classification Rules
《软件质量与测试》课程教学资源（PPT大纲课件，目录版）
安徽理工大学：《汇编语言》课程教学资源（PPT课件讲稿）第七章高级汇编语言技术（主讲：李敬兆）
《Vb程序设计教程》课程教学资源（PPT课件讲稿）第三章 VB语言基础
东南大学：《数据结构》课程教学资源（PPT课件讲稿）动态规划
《数据结构》课程教学资源：课程教学资源（PPT课件讲稿）第九章查找表
南京大学：《面向对象技术 OOT》课程教学资源（PPT课件讲稿）抽象数据类型 Abstract Data Types
中国科学技术大学：《并行计算 Parallel Computing》课程教学资源（PPT课件讲稿）并行编译简介
《单片机原理及应用》课程教学资源（PPT课件讲稿）第6章 AT89S52单片机的串行口
上海交通大学：《程序设计》课程教学资源（PPT课件讲稿）第4章循环控制
上海交通大学：《通信网络》课程PPT教学课件（Communication Networks）Introduction（主讲：叶通）
北京师范大学：《多媒体技术基础》课程教学资源（PPT课件讲稿）第二章数字图像（曾兰芳）
利用EXCEL进行数据分析与图表处理（PPT讲稿）
上海交通大学：《程序设计》课程教学资源（PPT课件讲稿）第9章模块化开发
《计算科学基础研究》课程教学资源（PPT课件讲稿）类的定义
南京大学：《编译原理》课程教学资源（PPT课件讲稿）第九章机器无关的优化（赵建华）

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录