Big Data Analysis and Mining Lecture 2: Getting to Know your Data Weixiong Rao饶卫雄 Tongji University同济大学软件学院 2015 Fall wxrao@tongji.edu.cn 2021/2/9 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
2021/2/9 1 Big Data Analysis and Mining Lecture 2: Getting to Know Your Data Weixiong Rao 饶卫雄 Tongji University 同济大学软件学院 2015 Fall wxrao@tongji.edu.cn
Getting to Know your Data Data Objects and Attribute Types (1 a Basic Statistical Descriptions of data Data visualization Measuring Data Similarity and Dissimilarity a Summary 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
2 Getting to Know Your Data ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Data Visualization ◼ Measuring Data Similarity and Dissimilarity ◼ Summary
Types of Data Sets ■ Record ◆ Relational records Data matrix, e.g., numerical matrix, crosstabs Document data: text documents term frequency vector Document 1 ◆ Transaction data Graph and network Document 2 0 ◆ World wide Web 003 Document 3 00 2 Social or information networks Molecular structures Ordered TD tems Video data: sequence of images Bread. Coke. Milk Temporal data: time-series Beer. bread Sequential Data transaction sequences Genetic Beer, Coke, Diaper, Milk I Spatial, image and multimedia Beer, Bread, Diaper, Milk Spatial data: maps Coke, Diaper, Milk Image data Video data 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 3
3 Types of Data Sets ◼ Record ◆ Relational records ◆ Data matrix, e.g., numerical matrix, crosstabs ◆ Document data: text documents: term - frequency vector ◆ Transaction data ◼ Graph and network ◆ World Wide Web ◆ Social or information networks ◆ Molecular Structures ◼ Ordered ◆ Video data: sequence of images ◆ Temporal data: time-series ◆ Sequential Data: transaction sequences ◆ Genetic sequence data ◼ Spatial, image and multimedia: ◆ Spatial data: maps ◆ Image data: ◆ Video data: D o c u m e n t 1 season timeout lost wi n game score ball pla y coach team D o c u m e n t 2 D o c u m e n t 3 3 0 5 0 2 6 0 2 0 2 00 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Important Characteristics of Structured Data a Dimensionality Curse of dimensionality a Sparsity Only presence counts ■ Resolution Patterns depend on the scale ■ Distribution Centrality and dispersion 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
4 Important Characteristics of Structured Data ◼ Dimensionality ◆ Curse of dimensionality ◼ Sparsity ◆ Only presence counts ◼ Resolution ◆ Patterns depend on the scale ◼ Distribution ◆ Centrality and dispersion
Data Obiects a Data sets are made up of data objects a data object represents an entity ■上 xamples e sales database: customers store items sales medical database: patients, treatments university database: students, professors, courses Also called samples, examples, instances, data points, objects, tuples a Data objects are described by attributes a Database rows -> data objects columns ->attributes 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
5 Data Objects ◼ Data sets are made up of data objects. ◼ A data object represents an entity. ◼ Examples: ◆ sales database: customers, store items, sales ◆ medical database: patients, treatments ◆ university database: students, professors, courses ◼ Also called samples , examples, instances, data points, objects, tuples. ◼ Data objects are described by attributes. ◼ Database rows -> data objects; columns ->attributes