当前位置：和泉文库 > 计算机 > 浏览文档

电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 2 BasicConcepts（Foundations of Data Mining）

文件格式：PDF，文件大小：4.03MB，售价：22.91元

文档详细内容（约116页）

2.Naive Bayes Given training data X,posteriori probability of a hypothesis H, P(H|X),follows the Bayes theorem P(HIX)=PXH)P(H) PX) Predicts X belongs to C>iff the probability P(C2|X)is the highest among all the P(CkX)for all the k classes ■ Practical difficulty:require initial knowledge of many probabilities,significant computational cost

2. Naïve Bayes Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem  Predicts X belongs to C2 iff the probability P(C2 |X) is the highest among all the P(Ck |X) for all the k classes  Practical difficulty: require initial knowledge of many probabilities, significant computational cost ( ) ( | ) ( ) ( | ) X X X P P H P H P H 

Class Conditional independent P(XIC)=ΠP(xk|C) k=1 n P(CX)= RTR ).9 k=1 P(X P(X argmaxP(CilX)=P(C)II P(XilCi) k=1

   n k xk Ci C P i P X 1 ( | ) ( | ) ( ) 1 ( ) ( | ) ( ) ( | ) ( ) ( | ) P X n k Ci i P X i P C P X i P C i P X C X i P C     Class Conditional independent    n k Ci i P X i X P C i P C i 1 argmax ( | ) ( ) ( | )

Case Study Spam Email Problem:classifying documents by their content: whether a document is a spam email or a non-spam email? Namely,what is the probability that a given document D belongs to a given class C?"In other words,what isPr(CD)? For spam email investigation,by Bayes'theorem,we have Pr(SID)=Pr(S)Pr(DIS) Pr(D) Pr(SD)=Pr(S)Pr(DI) Pr(D) where S means class of spam email and Se is the class of normal email

Case Study Spam Email Problem: classifying documents by their content: whether a document is a spam email or a non-spam email? For spam email investigation, by Bayes' theorem, we have Namely, what is the probability that a given document D belongs to a given class C?“ In other words, what is ? Pr( ) Pr( )Pr( | ) Pr( | ) Pr( ) Pr( )Pr( | ) Pr( | ) D S D S S D D S D S S D C C C   Pr(C | D) where S means class of spam email and Sc is the class of normal email

The problem is transferred to determine which posterior probability is much higher? Pr(S )Pr(DSi) argmax Pr(SD)= Pr(D) Since Pr(D)is a constant and is not relevant to S,,the equation can be further written as: argmaxPr(S,)Pr(D S,) The most common format Given a document D,we can then use this formulas to determine whether it is a spam email or not

The problem is transferred to determine which posterior probability is much higher? Pr( ) Pr( )Pr( | ) argmaxPr( | ) D S D S S D j j j j  Since is a constant and is not relevant to Sj , the equation can be further written as: Pr(D) argmaxPr( )Pr( | ) j j j S D S Given a document D, we can then use this formulas to determine whether it is a spam email or not. The most common format

To compute the posterior probability,we must first compute the prior probability Pr(S,)and the conditional probability Pr(DS,) Suppose we have already known the class information(spam or non-spam)of some emails (which are called as "training data"). The Pr(S,)can be easily obtained to compute based on the training data. Pr(S)= spam #total Pr(Se)=1-#spam #total For Pr(DS),it can be computed as follows.As each document can be modelled as sets of words,the probability that a given document occurs in a document from class S,can be written as

To compute the posterior probability, we must first compute the prior probability and the conditional probability Pr( ) Sj Pr( | ) D Sj Suppose we have already known the class information (spam or non-spam) of some emails (which are called as “training data”). The can be easily obtained to compute based on the training data. Pr( ) Sj For , it can be computed as follows. As each document can be modelled as sets of words, the probability that a given document occurs in a document from class Sj can be written as Pr( | ) D Sj total spam S total spam S C # # Pr( ) # # Pr( )  1

点击进入文档下载页（PDF格式）

共116页，可试读30页，点击继续阅读 ↓↓

您可能感兴趣的文档

电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 1 Intro（主讲：邵俊明）
计算机科学与技术（PPT讲稿）Unlock with Your Heart - Heartbeat-based Authentication on Commercial Mobile Phones
计算机科学与技术（参考文献）VECTOR - Velocity Based Temperature-field Monitoring with Distributed Acoustic Devices
计算机科学与技术（参考文献）VSkin - Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals
计算机科学与技术（参考文献）RespTracker - Multi-user Room-scale Respiration Tracking with Commercial Acoustic Devices
计算机科学与技术（参考文献）Dynamic Speed Warping - Similarity-Based One-shot Learning for Device-free Gesture Signals
计算机科学与技术（参考文献）SpiderMon - Towards Using Cell Towers as Illuminating Sources for Keystroke Monitoring
计算机科学与技术（参考文献）Unlock with Your Heart：Heartbeat-based Authentication on Commercial Mobile Phones
计算机科学与技术（参考文献）QGesture - Quantifying Gesture Distance and Direction with WiFi Signals
计算机科学与技术（PPT讲稿）QGesture - Quantifying Gesture Distance and Direction with WiFi Signals
计算机科学与技术（参考文献）Gait Recognition Using WiFi Signals
计算机科学与技术（参考文献）Gait Recognition Using WiFi Signals
电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 3 Hashing
电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 4 Sampling for Big Data
电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 5 Data Stream Mining
电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 6 Graph Mining
电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 7 Hadoop-Spark
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Introduction（冯钢）
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Unit 1 Overview - A big Picture on Traffic Control and QoS in IP networks
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Unit 2 Call-level Models and Admission Control
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Unit 3 Traffic Policing and Shaping
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Unit 4 TCP Traffic Control
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Unit 5 Buffer Management
电子科技大学：《先进计算机网络技术》课程教学资源（课件讲稿）Unit 6 Packet Scheduling

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录