当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：《模式识别》课程教学资源（参考资料）Algorithms for Clustering Data

文件格式：PDF，文件大小：38.71MB，售价：30.45元

文档详细内容（约326页）

16 Data Representation Chap.2 The squared Mahalanobis distance has also been used as a distance measure in cluster analysis (Everitt,1974).The expression for the squared Mahalanobis distance between patterns x;and x is di,)=(x-x)T9-'(x-x) where the matrix S is the pooled sample covariance matrix,defined in Appendix D.The Mahalanobis distance incorporates the correlation between features and standardizes each feature to zero mean and unit variance.If is the identity matrix,the squared Mahalanobis distance is the same as the squared Euclidean distance. The sample correlation coefficient defined below is an index of similarity for continuous,ratio data that can be used with patterns but is more frequently used to measure the degree of linear dependency between two features. (1/m)∑ (xii-mi)(xir -m,) d0i,r）= i=I where m;and sare the sample mean and sample variance,respectively,for feature jand are defined in Section 2.3.The absolute value is required because a negative and a positive correlation that differ in sign but not in absolute value have the same significance when measuring similarity.If d(j,r)=0,then features j and r are linearly independent.One of the features is usually discarded if d(j,r)is close to 1.When data are on an ordinal scale,measures of rank correlation(Conover, 1971;Anderberg,1973;Goodman and Kruskal,1954)can be applied. 2.2.2 Nominal Types If continuous,ratio-scaled data are considered to be the "strongest''type of data,then binary,nominal-scaled data are the "weakest'type.Many actual measurements,especially data collected from human subjects,are binary and nomi- nal.Matching coefficients are proximity indices for such data.For convenience, all feature values are taken to be either 0 or 1.These symbols should be assigned consistently;if“I'means“large'for the first feature and“O'means“small,'" 1''must also denote "'large''for all other features measuring size.Proximity indices between the ith and kth patterns are derived from the following contingency table.For example,a is the number of features that are I for both patterns,and ao is the number of features that are 1 for pattern x;and zero for pattern xy.The four entries sum to d,the number of features. Xk 1 0 a11 a10 dol a00

Sec.2.2 Proximity Indices 17 Several measures of proximity can be defined from the four numbers aoo. do,a1o,a in the contingency table for two binary vectors.Anderberg(1973) reviews most of them and puts them into context.Gower (1971)discusses the properties of general coefficients based on weighted combinations of these four numbers and shows the conditions under which proximity matrices formed from them are positive-definite matrices.Gower's index can also be used with a mixture of binary,qualitative,and quantitative features.Measures of proximity for discrete data have been proposed by Hall(1967),who described a heterogeneity function, and Bartels et al.(1970),who introducted the Calhoun distance as the percentages of patterns "between''two given patterns.Many other proximity measures have been defined for particular problems.Hubalek (1982)summarizes and evaluates proximity measures for binary vectors. Two common matching coefficients between x;and xy are defined below: 1.Simple matching coefficient d(i,k)= aoo +ai =doo+a1 ao0+a11+ao1+a10 d 2.Jaccard coefficient d(i,k)= a11 411 a1+ao aio d-doo The simple matching coefficient weights matches of 0's the same as matches of I's,whereas the Jaccard coefficient ignores matches of 0's.The value I means "presence of effect"'in some applications,so 1-1 matches are much more important than 0-0 matches.One example is that of questionnaire data.These two matching coefficients take different values for the same data and their meanings and interpreta- tions are not obvious.Accepted practice in the area of application seems to be the best guide to a choice of proximity index. Example 2.3 Suppose that two individuals are given psychological tests consisting of lists of 20 questions to which "yes''(1)and"'no''(0)responses are required.Assuming that the questions are phrased so that "yes''and "'no"have consistent interpretations,meaningful matching coefficients can be computed from the two patterns. Feature Number 12345 10 15 20 Pattern I (x) 01100100100111001010 Pattern 2(x2) 01110000111111011010

点击进入文档下载页（PDF格式）

共326页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录