Ground Truth 整盛烟 666666665-- Khmer Cultural Center
Ground Truth Khmer Cultural Center
Data Explosion The digital universe was ~281 exabytes (281 billion gigabytes)in 2007;it would grow 10 times by 2011 Images and video,captured by over one billion devices (camera phones),are the major source To archive and effectively use this data,we need tools for data categorization http://eon.businesswire.com/releases/information/digital/prweb509640.htm http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf 尚
Data Explosion • Th di it l i 281 b t The digital universe was ~281 exabytes (281 billion gigabytes) in 2007; it would grow 10 times by 2011 • Images and video, captured by over one billion d i ( h ) th j devices (camera phones), are the major source • To archive and effectively use this data, we need tools for data categorization http://eon.businesswire.com/releases/information/digital/prweb509640.htm http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
Data Clustering Grouping of objects into meaningful categories Classification vs.clustering Unsupervised learning,exploratory data analysis, grouping,clumping,taxonomy,typology,Q-analysis Given a representation of n objects,find K clusters based on a measure of similarity Partitional vs.hierarchical A.K.Jain and R.C.Dubes.Algorithms for Clustering Data,Prentice Hall,1988.(available for download at:http://dataclustering.cse.msu.edu/)
Data Clustering • Grouping of objects into meaningful categories • Classification vs. clustering • Unsupervised learning, exploratory data analysis, grouping clumping taxonomy typology Q grouping, clumping, taxonomy, typology, Q-analysis analysis • Given a representation of n objects, find K clusters based on a measure of based on a measure of similarity similarity • Partitional vs. hierarchical A. K. Jain and R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, 1988. (available for download at: http g) ://dataclustering.cse.msu.edu/)
Why Clustering? Natural classification:degree of similarity among forms (phylogenetic relationship or taxonomy) Data exploration:discover underlying structure, generate hypotheses,detect anomalies Compression:method for organizing data Applications:any scientific field that collects data! Astronomy,biology,marketing,engineering,..... Google Scholar:~1500 clustering papers in 2007 alone!
Why Clustering? • Natural classification: degree of similarity among forms (phylogenetic relationship or taxonomy) • Data exploration: discover underlying structure, generate hypotheses, detect anomalies • Compression: method for organizing data • Applications: any scientific field that collects data! Astronomy, biology, marketing, engineering,….. Google Scholar: ~1500 clustering papers in 2007 alone!
Historical Developments Cluster analysis first appeared in the title of a 1954 article analyzing anthropological data (STOR) Hierarchical Clustering:Sneath (1957),Sorensen (1957) K-Means:independently discovered Steinhaus1(1956),Lloyd2 (1957),Cox3(1957),Bal∥&Hal(1967),MacQueen5(1967) Mixture models (Wolfe,1970) Graph-theoretic methods (Zahn,1971) .K Nearest neighbors (Jarvis Patrick,1973) Fuzzy clustering (Bezdek,1973) Self Organizing Map(Kohonen,1982) Vector Quantization (Gersho and Gray,1992) 1Acad.Polon.Sci.,2Bell Tel.Report,3JASA,4Behavioral Sci.,5Berkeley Symp.Math Stat Prob. 合口
Historical Developments • Cluster analysis first appeared in the title of a 1954 article analyzing anthropological data (JSTOR) • Hierarchical Clustering: Sneath (1957) Sorensen (1957) Sneath (1957) , Sorensen (1957) • K-Means: independently discovered Steinhaus 1 (1956), Lloyd2 (1957), Cox3 (1957), Ball & Hall4 (1967), MacQueen 5 (1967) • Mixture models (Wolfe, 1970 ) • Graph-theoretic methods (Zahn, 1971) • K Nearest neighbors (Jarvis & Patrick, 1973) • Fuzzy clustering (Bezdek, 1973) • Self Organizing Map (Kohonen, 1982) • Vector Quantization (Gersho and Gray, 1992) 1Acad. Polon. Sci., 2Bell Tel. Report, 3JASA, 4Behavioral Sci., 5Berkeley Symp. Math Stat & Prob