当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：文本挖掘技术（PPT讲稿）文本分类 Text Categorization

◼ Text Categorization ◼ Problem definition ◼ Build a Classifier ◼ Naïve Bayes Classifier ◼ K-Nearest Neighbor Classifier ◼ Evaluation

文件格式：PPT，文件大小：3.19MB，售价：20.05元

共75页，可试读20页，点击往前阅读 ↑↑

文档详细内容（约75页）

Similarity Metrics Nearest neighbor method depends on a similarity (or distance) metric Simplest for continuous m-dimensional instance space is EuClidean distance Simplest for mdimensional binary instance space is Hamming distance(number of feature values that differ) For text, cosine similarity of tf idf weighted vectors is typically most effective

Similarity Metrics ◼ Nearest neighbor method depends on a similarity (or distance) metric. ◼ Simplest for continuous m-dimensional instance space is Euclidean distance. ◼ Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ). ◼ For text, cosine similarity of tf.idf weighted vectors is typically most effective

Illustration of 3 Nearest Neighbor for Text Vector Space

Nearest Neighbor with Inverted Index Naively finding nearest neighbors requires a linear search through D documents in collection But determining k nearest neighbors is the same as determining the top-k best retrievals using the test document as a query to a database of training documents Use standard vector space inverted index methods to find the k nearest neighbors Testing Time: O(B VD Where b is the average number of training documents in which a test-document word appears Typically B<< Dl

Nearest Neighbor with Inverted Index ◼ Naively finding nearest neighbors requires a linear search through |D| documents in collection ◼ But determining k nearest neighbors is the same as determining the top-k best retrievals using the test document as a query to a database of training documents. ◼ Use standard vector space inverted index methods to find the k nearest neighbors. ◼ Testing Time: O(B|Vt |) where B is the average number of training documents in which a test-document word appears. ◼ Typically B << |D|

kNN Discussion No training necessary No feature selection necessary Scales well with large number of classes Don't need to train n classifiers for n classes Classes can influence each other Small changes to one class can have ripple effect Scores can be hard to convert to probabilities

kNN: Discussion ◼ No training necessary ◼ No feature selection necessary ◼ Scales well with large number of classes ◼ Don’t need to train n classifiers for n classes ◼ Classes can influence each other ◼ Small changes to one class can have ripple effect ◼ Scores can be hard to convert to probabilities

Nc&IS Naive Bayes

Naïve Bayes

点击进入文档下载页（PPT格式）

共75页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

同济大学：《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源（PPT课件讲稿）K-means & EM
中国医科大学计算机中心：《虚拟现实与增强现实技术概论》课程教学资源（PPT课件讲稿）第3章虚拟现实系统的输出设备
香港中文大学：XML for Interoperable Digital Video Library
上海交通大学：《计算机图形学 Computer Graphics》课程教学资源（PPT讲稿）CHAPTER 4 THE VISUALIZATION PIPELINE
《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源（PPT讲稿）Lecture 09 Evaluation
长春工业大学：《网页设计与制作》课程教学资源（PPT课件）第5章 Div+CSS布局技术
合肥工业大学：《计算机网络技术》课程教学资源（PPT课件讲稿）第4章交换网的运行
山东大学软件学院：非线性规划（PPT讲稿）一维搜索方法
《并发控制技术》课程教学资源（PPT课件讲稿）第7章事务管理 transaction management
北京师范大学现代远程教育：《计算机应用基础》课程教学资源（PPT课件讲稿）第1章计算机常识（主讲：马秀麟）
南京大学：《面向对象技术 OOT》课程教学资源（PPT课件讲稿）面向对象的分析与设计简介 OOA & OOD：An introduction
中国科学技术大学：《计算机体系结构》课程教学资源（PPT课件讲稿）向量体系结构
《网页设计与制作》课程教学资源（PPT课件讲稿）第一章 HTML基础
清华大学：《计算机导论》课程电子教案（PPT教学课件）第1章计算机发展简史
《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源（PPT讲稿）Lecture 06 Index Compression
嵌入式交叉开发环境的建立（PPT实验讲稿）
西安交通大学：《微型计算机接口技术》课程教学资源（PPT课件讲稿）第五章输入/输出控制接口
《TCP/IP协议及其应用》课程教学资源（PPT课件讲稿）第3章 IP寻址与地址解析
中国医科大学：《计算机网络实用教程》课程教学资源（PPT讲稿）高速局域网技术、交换式局域网技术、虚拟局域网技术、主要的城域网技术
《大学计算机基础》课程教学资源：作业习题
《计算机网络》课程教学资源（PPT课件讲稿）第一章计算机网络概述
山西国际商务职业学院：《数据库应用程序设计》课程教学资源（PPT课件）第三章数据与数据运算
《C语言程序设计》课程电子教案（PPT课件讲稿）Chapter 02 用C语言编写程序
《数字图像处理》课程教学资源（PPT课件讲稿）第5章图像复原

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录