当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：《模式识别》课程教学资源（参考资料）Algorithms for Clustering Data

文件格式：PDF，文件大小：38.71MB，售价：30.45元

文档详细内容（约326页）

Chap.1 Introduction 女 devoted to theoretical and practical issues in exploratory data analysis itself.We have tried to provide an understandable but complete exposition of cluster analysis that uses only enough mathematical detail to make the material precise.The space limitation makes it difficult for us to cover every aspect of cluster analysis in great detail.We emphasize informal algorithms for clustering methods,and analysis of results.However,a reader may have difficulty in implementing the algorithms from the description given in Chapter 3.Most of the well-known clustering algo- rithms have been implemented and are available as part of clustering and statistical software packages (see Section 3.4).We see cluster analysis as a tool to be used, not as a theory to be developed. In Chapter 2 we present our idea of data with emphasis on ways of viewing data,such as projections based on eigenvectors and multidimensional scaling. Four ways in which data are analyzed that are related to cluster analysis are reviewed so as to clarify the role of cluster analysis.Clustering methods and algorithms themselves are described in Chapter 3.The primary division among clustering methods is between hierarchical and partitional methods.Both approaches are carefully developed and several examples are provided.The availability of clustering software,methodology by which cluster analysis can be applied,and comparative studies of various clustering techniques are also summarized in Chapter 3.A comparative analysis of clustering methods is useful since empirical evidence seems to be the only practical guide to the selection of clustering methods. The crucial step in applications of cluster analysis is the interpretation of the results.In Chapter 4 we present a comprehensive summary of procedures for quantitatively verifying the results of cluster analysis.Monte Carlo techniques along with the method of bootstrapping are also introduced in Chapter 4,because they are useful for estimating the distributions of various cluster statistics.Applica- tions of clustering to an engineering domain (image processing and computer vision)are discussed in Chapter 5.The book also contains eight appendices to review briefly related topics of pattern recognition,commonly used Gaussian and hypergeometric distributions,linear algebra,scatter matrices,factor analysis,multi- variate analysis of variance,and graph theory.An algorithm to generate clustered data is also given in one of the appendices.We hasten to add that these appendices contain only elementary material provided for the convenience of the reader.The reader should consult standard textbooks for detailed coverage of these topics. This is not the first book on cluster analysis.Anderberg (1973)has written the most comprehensive book for those who want to use cluster analysis.We refer frequently to Anderberg's excellent exposition.Everitt (1974)explains cluster analysis in a very readable way but contains fewer details than we feel are necessary. Tryon and Bailey (1970)wrote one of the first books on cluster analysis,but it is restricted to a single approach.Jardine and Sibson(1971)concentrate on mathe- matical foundations.Other early books are those of Duran and Odell (1974)and Clifford and Stephenson (1975).Sneath and Sokal (1973)include an excellent chapter on hierarchical clustering.Hartigan(1975)provides a number of interesting projects,and Lorr(1983)presents cluster analysis especially for social scientists

Data Representation Chap.2 2.1 DATA TYPES AND DATA SCALES Clustering algorithms group objects,or data items,based on indices of proximity between pairs of objects.The objects themselves have been called individuals, cases,subjects,and OTUs (operational taxonomic units)in various applications. This book uses pattern recognition terminology(Appendix A).A set of objects comprises the raw data for a cluster analysis and can be described by two standard formats:a pattern matrix and a proximity matrix. 2.1.1 Pattern Matrix If each object in a set of n objects is represented by a set of d measurements (or attributes or scores),each object is represented by a pattern,or d-place vector. The set itself is viewed as a n x d pattern matrix.Each row of this matrix defines a pattern and each column denotes a feature,or measurement.For example, when clustering time functions such as biological signals or radar echoes,a feature could be a sample value taken at a particular time;the average value of the signal could also be a feature.The set of feature values for a signal is a pattern.We require that the same features be measured for all patterns.If patients in a hospital are to be clustered,each row in the pattern matrix would represent one individual. The features,or columns in the pattern matrix,could represent responses to questions on an admission form or the results of diagnostic tests.The same questions must be asked of every patient and the same diagnostic tests must be performed on all patients in a particular experiment.Categorical,or extrinsic,information,such as age,sex,religion,or hair color,is normally used to interpret the results of a cluster analysis but is not part of the pattern matrix. The d features are usually pictured as a set of orthogonal axes.The n patterns are then points embedded in a d-dimensional space called a pattern space.We use the word "pattern''in the technical sense as a point in a pattern space,not to describe the topological arrangement of objects.A cluster can be visualized as a collection of patterns which are close to one another or which satisfy some spatial relationships.The task of a clustering algorithm is to identify such natural groupings in spaces of many dimensions.Although visual perception is limited to three dimensions,one must be careful not to think automatically of clustering problems as two-or three-dimensional.The real benefit of cluster analysis is to organize multidimensional data where visual perception fails. Example 2.1 This example shows the pattern matrix representation of a data set,called the 80X data, that will be used to demonstrate several projection and clustering methods.This data set was derived from the Munson handprinted FORTRAN character set,which has been used extensively in pattern recognition studies and consists of handwritten characters from several authors,each of whom wrote three alphabets of 46 characters.The handwritten characters

点击进入文档下载页（PDF格式）

共326页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录