当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：《模式识别》课程教学资源（参考资料）Algorithms for Clustering Data

文件格式：PDF，文件大小：38.71MB，售价：30.45元

文档详细内容（约326页）

Sec.2.1 Data Types and Data Scales 11 2.1.2 Proximity Matrix Clustering methods require that an index of proximity,or alikeness,or affinity. or association be established between pairs of patterns.This index can be computed from a pattern matrix,as discussed in Section 2.2,or can be formed from raw data.The data in some psychometric applications are collected as proximities For example,several individuals can be asked to rank their preference for brands of soap and the proximity between two brands can be computed by averaging over individuals.An individual can also be asked to provide proximities directly by judging similarity between brands on a scale from I to 10.A proximiry matrix [d(i,accumulates the pairwise indices of proximity in a matrix in which each row and column represents a pattern.We ignore the diagonal entries of a proximity matrix since all patterns are assumed to have the same degree of proximity with themselves.We also assume that all proximity matrices are symmetric,so all pairs of objects have the same proximity index,independent of the order in which they are written.Hubert(1973)and Gower(1977)consider nonsymmetric proximity matrices. A proximity index is either a similariry or a dissimilariry.The more the ith and jth objects resemble one another,the larger a similarity index and the smaller a dissimilarity index.For example,Euclidean distance between two patterns in a pattern space is a dissimilarity index,whereas the correlation coefficient is a similar- ity index.Several proximity indices are described in Section 2.2.Note that a pattern matrix can easily be converted to a proximity matrix with proximity indices, but projection algorithms(Sections 2.4 and 2.5)or multidimensional scaling tech- niques(Section 2.7)are needed to convert a proximity matrix into a pattern matrix. Example 2.2 We present an example of a proximity matrix that was used by Levine (1977a)to study the perceived similarity of numerical digits by subjects.The subjects were eight graduate students who observed a single numerical digit (0-9)as a 7 x 9 dot matrix character for TABLE 2.2 Confusion Matrix for Stimulus-Response Combination Response 0 2 3 4 5 6 8 9 0 45 10 68 9 39 42 27 59 32 16 269 26 9 40 10 6 5 330 13 10 9 56 3 40 232 11 33 17 3 3 6 Stimulus 7 19 13 290 7 18 6 14 12 280 20 3 10 6 6 213 4 46 152 21 5 9 270 2 3 71 10 46 37 11 120 30 9 2 24 6 56 18 34 196

12 Data Representation Chap.2 variable time on a CRT display system.A noise field was immediately displayed on the CRT so that the digit was not clearly visible.The subjects had to respond what digit was present in the noisy image.Each student looked at only 50 stimuli,so Table 2.2 shows the aggregate response of all eight students.The table shows the confusion for each possible stimulus-response combination.The entry in the second row of the matrix in Table 2.2 indicates that of the 400 stimuli presented for digit 1,269 correct responses were made by the subjects.Levine (1977a)defined the frequency of confusion between stimuli to be the measure of similarity.Thus digit pair 9 and 3 are considered more similar than digit pair 9 and 1.Notice that this similarity matrix is nonsymmetric.Multidimensional scaling and hierarchical clustering algorithms were applied to this matrix by Levine to study the evidence of hierarchical structure in the organization of visual stimuli. 2.1.3 Data Types and Scales Now that the two primary formats for representing data-the pattern matrix and the proximity matrix-have been established,we turn to the characteristics of the data themselves.Anderberg (1973)outlines a categorization of data types and data scales appropriate for cluster analysis that is summarized below.Recogniz- ing the type and scale of data will help in selecting a clustering algorithm. Data type refers to the degree of quantization in the data.A single feature can be typed as binary,discrete,or continuous.Binary features have exactly two values and occur,for example,in "yes-no''responses on a questionnaire.A discrete feature has a finite,usually small,number of possible values.For example, samples of a speech signal can be quantized to 16,or 24,levels,so a feature representing the sample can be coded into 4 bits.All measurements and all numbers stored in computers have a finite number of significant digits,so,strictly speaking, all features are discrete.However,it is often convenient to think of a feature value as a point on the real line that can take on any real value in a fixed range of values.Such a feature is called continuous. Proximity indices can also be binary,discrete,or continuous.For example, suppose that a set of objects is partitioned into mutually exclusive,all-inclusive subsets.One binary index of similarity assigns zero to a pair of objects that fall in different subsets and one to a pair in the same subset.A rank order proximity index is an integer from I to n(n-1)/2,where n is the number of objects.The integers represent the relative order of the proximities.Such an index is discrete. The Euclidean distance proximity index,defined for patterns in a pattern space, is typed continuous. The second trait of a feature and of a proximity index is the data scale, which indicates the relative significance of numbers.Data scales can be dichotomized into qualitative (nominal and ordinal)scales and quantitative (interval and ratio) scales.A nominal scale is not really a scale at all because numbers are simply used as names.For example,a (yes,no)response could be coded as (0,1)or (1,0)or (50,100);the numbers themselves are meaningless in any quantitative sense.The other qualitative scale,and the weakest numerical scale,is the ordinal scale;the numbers have meaning only in relation to one another.For example

14 Data Representation Chap.2 Data type and scale are not always of one's choosing.Recognizing type and scale is important in both forming proximity indices and interpreting the results of a cluster analysis.For example,one should realize that human subjects are good at generating binary,qualitative data but that instruments are required to produce continuous,quantitative data.A human subject required to generate discrete, interval data will be under greater stress than one asked to provide binary,ordinal data,so the reliability of data can depend on type and scale.Anderberg(1973) explains conversions from one scale to another.Clustering methods (Chapter 3) use quantitative indices of proximity to assign a cluster label,or name,to each object,so a nominal scale can be generated from a quantitative scale.Multidimen- sional scaling (Section 2.7)changes ordinal scales into ratio scales.The various formats,types,and scales for data are summarized in Figure 2.2. 2.2 PROXIMITY INDICES This section explains some of the more common proximity indices.Anderberg (1973)provides a thorough review of measures of association and their interrelation- ships.A proximity index between the ith and kth patterns is denoted d(i,k)and must satisfy the following three properties: 1.(a)For a dissimilarity:d(i,i)=0,all i (b)For a similarity:d(i,i)max d(i,k),all i 2.d(i,k)=d(k,i),all (i,k) 3.di,k)≥0，all(i,k) Ratio and nominal proximity indices are discussed in separate sections. 2.2.1 Ratio Types A proximity index can be determined in several ways.Suppose that we begin with a pattern matrix [xl,where x is the jth feature for the ith pattern. All features are continuous and measured on a ratio scale.The most common proximity index for such patterns is the Minkowski metric,which measures dissimi- larity.The ith pattern,which is the ith row of the pattern matrix,is denoted by the column vector x;. x=(c1x2,·xd,i=1,2,···,n Here d is the number of features,n the number of patterns,and T denotes vector transpose.The Minkowski metric is defined by d(i,k where r≥I All Minkowski metrics satisfy the additional metric properties stated below. Property 5 is called the triangle inegualiry

点击进入文档下载页（PDF格式）

共326页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录