当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：《模式识别》课程教学资源（参考资料）Algorithms for Clustering Data

文件格式：PDF，文件大小：38.71MB，售价：30.45元

文档详细内容（约326页）

20 Data Representation Chap.2 Then the distance between x;and xe is written as d(i,k)= 之9 where do is the number of features missing in x;or xy or both.Note that if there are no missing values,then d(i,k)defined above is the squared Euclidean distance. 4.Let d,denote the average distance between all pairs of patterns along the jth feature defined as follows: d=n n-1)2 where n is the number of patterns.Now define the distance between two patterns along the ith feature as 4= d, ifxy or xy is missing otherwise Finally,the distance between patterns x,and xe is written as di,=∑明 Based on experimental results,Dixon (1979)recommends method 3 as the best overall method. 2.2.4 Probabilistic Indices Goodall(1966)proposed an index of similarity that has a uniform distribution when the data are "random.''The idea of using a probability scale to assess the significance of a proximity measure appears in Hamdan and Tsokos(1971),who define an information measure for a contingency table,and Brockett et al.(1981), who used the asymptotic distribution of an information-theoretic measure on ques- tionnaire data.Li (1984)provided the most recent example of this type of measure. Before explaining the proximity measure,we reexamine the simple matching and Jaccard coefficients in light of their distributions under "random''data. Matching coefficients measure the degree of similarity between objects.We know that their value is between 0 and I but do not know how large a value is required before two objects can be called "'close.''We now examine baseline distributions for the simple matching coefficient and the Jaccard coefficient.A baseline distribution describes a state of"randomness,''or the absence of structure, for gauging the magnitude of a matching coefficient.Baseline distributions are used extensively in Chapter 4.Two vectors will be called "'close''if a similarity as large as the one observed is unlikely under a baseline distribution. The simple matching coefficient between two d-position binary vectors a and b can be expressed as

Sec.2.3 Normalization 23 are not as important as 1-1 matches.Consider the population o2 of all d!pairs of vectors that can be obtained by permuting the entries of one of the vectors. Not all pairs of vectors are distinct.Probability function Po2 assigns each pair of vectors probability mass 1/d!,thus establishing a new baseline distribution. Let Au be the number of 1-1 matches in a randomly selected pair of vectors from population o2.Let N be the number of I's in a and let N,be the number of I's in b.All pairs of vectors in o2 have N and Np I's.For example,there are six I's in [10111011].The probability that A=k can be obtained from the hypergeometric distribution(Appendix B)under Po2.In the notation of Appendix B,we have a population of size d with N defectives and we take a sample of size N.Of course,the roles of N and N,can be reversed.The probability of exactly k matches between pairs of I's is P02(A11=k) =H(k,Na,Np,d) This probability expression requires that max{0,Na+Wb-d}≤k≤min (Na,Nb} The S-statistic defined below is essentially the inverse of the hypergeometric cumulative density function.Such statistics have been used elsewhere(Kempthorne, 1952).The additive factor ensures that S has a (continuous)uniform distribution over the unit interval since U is a continuous uniform random variable over the unit interval.If t is the number of 1-1 matches observed between d-vectors a and b,the S-measure of proximity is S(a,b)=>H(d,Na,No.k)+H(d,Na,No,t)U k< Since the distribution of S is uniform under Po2,the value of S is implicitly meaningful.For example,the probability that S is z or more is I -z for z between 0 and 1,as shown in Figure 2.4.This proximity has been used in the analysis of questionnaire data (Li and Dubes,1984)and in a template-matching problem (Li and Dubes,1985).The additive factor does not contribute much to the value of S except when d is small. 2.3 NORMALIZATION Suppose that the raw data consist of an n x d pattern matrix in which all features are continuous and on a ratio scale.Raw data,or the actual measurements,are seldom used just as they are recorded unless a probabilistic model for pattern generation is available.Some normalization is usually employed based on the requirements of the analysis.Preparing the data for a cluster analysis requires

24 Data Representation Chap.2 some sort of normalization that takes into account the measure of proximity.For example,Euclidean distance is a popular and familiar index of dissimilarity,but it implicitly assigns more weighting to features with large ranges than to those with small ranges.Scaling one feature in miles and a second feature in inches makes the second feature numerically overpower the first.We present a normaliza- tion scheme that remedies some of these problems. As explained earlier in this section,the basic unit of data is called a pattern, denoted by a d-vector,whose components are scalars called features.The ith pattern is denoted by the (column)vector x in this section and the jth feature value for the ith pattern is denoted byxThe asterisk denotes"raw'or unnormalized data.If n is the number of patterns in the analysis,the pattern matrix is the n x d matrix 4: xixi2… s4”=[xix2 … X21X22 22 Each row of is a pattern.Each point in the pattern space is a potential pattern.We treat the case when n>d,so the patterns are visualized as a number of points scattered around the pattern space. The jth feature average,mj,and jth feature variance,,are defined as the sample mean and the sample variance for the ith feature. =(m i=1 号=(m)Σ（好-m2 i=1 The simplest type of normalization subtracts the feature means: =有-m吲 (2.1) This normalization makes feature values invariant to rigid displacements of the coordinates.The second type of normalization translates and scales the axes so that all the features have zero mean and unit variance: =垃二四 (2.2) Removing the asterisk indicates that the pattern has been normalized,but the type of normalization must be clear from the context.Other types of normalization include scaling by the range (Carmichael et al.,1968)and a heterogeneity measure (Hall,1969).Lumelsky (1982)incorporates the normalization into the clustering procedure.Normalization or scaling is not always desirable.For example,if the spread among the patterns is due to the presence of clusters,the normalization in Eq.(2.2)can change the interpoint distances and can alter the separation between natural clusters as demonstrated in Figure 2.5

点击进入文档下载页（PDF格式）

共326页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录