学校代码:10246 学号:033053558 g人孚 硕士学位论文 基于RSS的企业Web搜索引擎研究与设计 系:软件学院 院专姓 业:软件工程 名:何俊 指导教师:戴伟辉 完成日期:2005年10月10日
学校代码: 10246 学 号: 033053558 硕 士 学 位 论 文 基于 RSS 的企业 Web 搜索引擎研究与设计 院 系: 软件学院 专 业: 软件工程 姓 名: 何 俊 指 导 教 师: 戴伟辉 完 成 日 期: 2005 年 10 月 10 日
基于RSS的企业Web搜索引擎研究与设计 目录 录 第一章绪论 1.1选题的意义· 1.2国内外搜索引擎技术发展现状··· 2.1国外技术发展现状 1.2.2国内技术发展现状 1.2.3企业搜索引擎的研究现状··· 1.3搜索引擎设计模型·····················4 1.4本文的研究内容······· 1.4.1问题的提出·············· 1.4.2研究的思路 1.4.3论文的组织·· 第二章RRS技术· 2.1RSS的概念· 2.2RSS版本和技术标准研究··················7 2.2.1RSS版本区别· 2.2.2RSS1.0与RSS0.9x/2.0的比较研究 8 2.2.3技术标准之争· 2.3RSS元素定义及用法····················10 2. 4 Rss feed 2.5RSS技术的国内外发展现状·。··············12 2.6本模型采用RSS的技术原因···············14 第三章数据自动采集设计· 15 3.1 RSS Feed的自动搜集····················15 3.1.1 RSS Feed的自动发现算法············15 3.1.2基于元搜索的 RSS Feed自动搜集········ 3.1.3 RSS Feed的特征分析··········· 3.1.4 RSS Feed自动搜集设计············ 3.2网页纯文本的提取· ·18 3.2.1提取文本·。·。·····。···。··· 3.2.2特殊字符的转换············。··· 3.3RSS信息的解析···· 20 第四章中文分词与索引设计············· 4.1中文自动分词技术·· 22
基于 RSS 的企业 Web 搜索引擎研究与设计 目 录 目 录 第一章 绪 论 ··························· 1 1.1 选题的意义························· 1 1.2 国内外搜索引擎技术发展现状················· 1 1.2.1 国外技术发展现状 ··················· 2 1.2.2 国内技术发展现状 ··················· 2 1.2.3 企业搜索引擎的研究现状················ 2 1.3 搜索引擎设计模型 ····················· 4 1.4 本文的研究内容······················· 4 1.4.1 问题的提出······················ 4 1.4.2 研究的思路······················ 5 1.4.3 论文的组织······················ 6 第二章 RRS 技术·························· 7 2.1 RSS 的概念························· 7 2.2 RSS 版本和技术标准研究 ·················· 7 2.2.1 RSS 版本区别····················· 7 2.2.2 RSS 1.0 与 RSS 0.9x/2.0 的比较研究 ·········· 8 2.2.3 技术标准之争····················· 9 2.3 RSS 元素定义及用法 ····················10 2.4 RSS feed ·························11 2.5 RSS 技术的国内外发展现状··················12 2.6 本模型采用 RSS 的技术原因 ·················14 第三章 数据自动采集设计······················15 3.1 RSS Feed 的自动搜集 ····················15 3.1.1 RSS Feed 的自动发现算法 ···············15 3.1.2 基于元搜索的 RSS Feed 自动搜集 ············16 3.1.3 RSS Feed 的特征分析 ·················16 3.1.4 RSS Feed 自动搜集设计 ················17 3.2 网页纯文本的提取 ·····················18 3.2.1 提取文本·······················18 3.2.2 特殊字符的转换 ····················19 3.3 RSS 信息的解析·······················20 第四章 中文分词与索引设计·····················22 4.1 中文自动分词技术 ·····················22
基于RSS的企业Web搜索引擎研究与设计 目录 4.1.1汉语分词技术研究·· 4.1.2汉语自动分词的难点· ·23 4.1.3自动分词设计·。·············。····24 4.2索引技术························27 4.2.1全文检索技术· 42.2索引项的选择··············· ·28 4.2.3索引的组织结构····················28 4.2.4索引的设计和实现。。··· 第五章数据检索设计························31 5.1检索模型的比较研究 5.1.1布尔逻辑模型·····················31 5.1.2模糊逻辑模型·· 5.1.3向量空间模型·。············· ·32 5.1.4概率检索模型······ 5.2提高检索效率和质量的相关技术。········· 5.2.1相关度排序技术············· 5.2.2用户接口技术········· 5.3数据检索的实现·。················ 333 第六章基于RSS的企业Web搜索引擎的实现·············38 6.1基于RSS的企业Web搜索引擎整体工作流程· 38 6.2系统模块划分······················39 6.3主要数据结构 6.4系统测试·。··· 444 第七章结论与展望· 7.1结论····························44 7.2展望·········。···· 参考文献 谢
基于 RSS 的企业 Web 搜索引擎研究与设计 目 录 4.1.1 汉语分词技术研究···················22 4.1.2 汉语自动分词的难点··················23 4.1.3 自动分词设计 ·····················24 4.2 索引技术··························27 4.2.1 全文检索技术 ·····················28 4.2.2 索引项的选择 ·····················28 4.2.3 索引的组织结构 ····················28 4.2.4 索引的设计和实现···················29 第五章 数据检索设计························31 5.1 检索模型的比较研究 ····················31 5.1.1 布尔逻辑模型·····················31 5.1.2 模糊逻辑模型·····················32 5.1.3 向量空间模型·····················32 5.1.4 概率检索模型·····················33 5.2 提高检索效率和质量的相关技术················34 5.2.1 相关度排序技术 ····················34 5.2.2 用户接口技术 ·····················35 5.3 数据检索的实现·······················36 第六章 基于 RSS 的企业 Web 搜索引擎的实现··············38 6.1 基于 RSS 的企业 Web 搜索引擎整体工作流程···········38 6.2 系统模块划分························39 6.3 主要数据结构························42 6.4 系统测试··························43 第七章 结论与展望·························44 7.1 结论····························44 7.2 展望····························44 参考文献 ·····························46 致谢 ·······························47
基于RSS的企业Web搜索引擎研究与设计 中文摘要 随着互联网上内容的极大丰富,信息本身的价值正在被创造海量信息的网络 本身所消减,搜索引擎成为了越来越多的企业从海量信息中获取情报和知识必不 可少的工具。搜索引擎技术如何转向企业应用亦成为了这一领域研究的热点。 本文分析了企业web搜索引擎应用现有的公共搜索引擎技术时所遭遇到的 问题,从企业搜索服务的实际需求入手,以提高搜索引擎信息采集实效性,降低 部署运行成本为目标,采用基于push模式的RSS的技术进行信息采集,以解决 传统的基于pull模式搜索引擎信息刷新周期长,部署和运行成本高的问题,并 提出了一种基于元搜索的 RSS Feed自动发现技术 本文分析了RSS的技术规范和技术特性,以搜索引擎的各组成部分为主线, 依次对中文分词、数据索引和数据检索部分Web搜索引擎关键技术的工作原理 工作流程、设计方法等进行了详细的探讨,同时结合企业搜索引擎的特点对部分 技术进行重点研究和改进。在此基础上,论文从总体上对基于RSS的企业Web 搜索引擎的设计、实现方法进行了论述,重点涉及模块设计、整体流程和主要数 据结构。 关键词:搜索引擎,RSS,中文分词,信息检索 第1页
基于 RSS 的企业 Web 搜索引擎研究与设计 摘 要 第 1页 中 文 摘 要 随着互联网上内容的极大丰富,信息本身的价值正在被创造海量信息的网络 本身所消减,搜索引擎成为了越来越多的企业从海量信息中获取情报和知识必不 可少的工具。搜索引擎技术如何转向企业应用亦成为了这一领域研究的热点。 本文分析了企业 Web 搜索引擎应用现有的公共搜索引擎技术时所遭遇到的 问题,从企业搜索服务的实际需求入手,以提高搜索引擎信息采集实效性,降低 部署运行成本为目标,采用基于 push 模式的 RSS 的技术进行信息采集,以解决 传统的基于 pull 模式搜索引擎信息刷新周期长,部署和运行成本高的问题,并 提出了一种基于元搜索的 RSS Feed 自动发现技术。 本文分析了 RSS 的技术规范和技术特性,以搜索引擎的各组成部分为主线, 依次对中文分词、数据索引和数据检索部分 Web 搜索引擎关键技术的工作原理、 工作流程、设计方法等进行了详细的探讨,同时结合企业搜索引擎的特点对部分 技术进行重点研究和改进。在此基础上,论文从总体上对基于 RSS 的企业 Web 搜索引擎的设计、实现方法进行了论述,重点涉及模块设计、整体流程和主要数 据结构。 关键词:搜索引擎,RSS,中文分词,信息检索
基于RSS的企业Web搜索引擎研究与设计 ABSTRACT ABSTRACT Along with abundance of the Internet contents, the tremendous amount of information released by Internet conversely reduces the values of nformation itself. Consequently, the search engine has become an rtant tool f nd ises to obta ntelligence and ledge from the sea of information. The search engine techniques how to apply in enterprises also become the hot point of this research realm This paper analyzes the problem employed by the enterprise Web search engine applying the existing public search engine techniques, starts with the actual need of enterprise search service, aimed to improve the effectiveness of search engine nformation collection as well as t reduce the cost of deployment and running. Adopting rss technology based on push mode to solve these problems of typical search engines based on pull mode such as longer refresh period, and higher cost for deployment and running, and presents a new RSS Feed auto-discovery technique based on meta-search This paper analyzed the technical specifications and features of RSS. Furthermore, subsequently studied in detail on the key techniques of Web search engine including the Chinese words segmentation, data indexing and information retrieval, mainly considered at their working mechanisms workflow and realization method, at the meantime, combining the characteristics of the enterprise search engine, researched and advanced partial of techniques During the above mentioned studies, addressed the design and realization methods of the enterprise Web search engine based on RSS heavily concerned on module design, workflow and main data structure. K eywords Search Engine, RSs, Chinese Words Segmentation, Information Retrieval 第2页
基于 RSS 的企业 Web 搜索引擎研究与设计 ABSTRACT 第 2页 ABSTRACT Along with abundance of the Internet contents, the tremendous amount of information released by Internet conversely reduces the values of information itself. Consequently, the search engine has become an extremely important tool for more and more enterprises to obtain intelligence and knowledge from the sea of information. The search engine techniques how to apply in enterprises also become the hot point of this research realm. This paper analyzes the problem employed by the enterprise Web search engine applying the existing public search engine techniques, starts with the actual need of enterprise search service, aimed to improve the effectiveness of search engine's information collection as well as to reduce the cost of deployment and running. Adopting RSS technology based on push mode to solve these problems of typical search engines based on pull mode such as longer refresh period, and higher cost for deployment and running, and presents a new RSS Feed auto-discovery technique based on meta-search. This paper analyzed the technical specifications and features of RSS. Furthermore, subsequently studied in detail on the key techniques of Web search engine including the Chinese words segmentation, data indexing, and information retrieval, mainly considered at their working mechanisms, workflow and realization method, at the meantime, combining the characteristics of the enterprise search engine, researched and advanced partial of techniques. During the above mentioned studies, addressed the design and realization methods of the enterprise Web search engine based on RSS, heavily concerned on module design、workflow and main data structure. Keywords Search Engine, RSS, Chinese Words Segmentation, Information Retrieval