Knowledge Discovery(KDD)Process This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant data Data Warehouse /sElection Data cleaning Data Integration Databases 19
19 Knowledge Discovery (KDD) Process ◼ This is a view from typical database systems and data warehousing communities ◼ Data mining plays an essential role in the knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
The KDD process Problem fomulation Data collection subset data: sampling might hurt if highly skewed data feature selection: principal component analysis, heuristic search Pre-processing cleaning name/address cleaning, different meanings(annual, yearly), duplicate removal, supplying missing values Transformation map complex objects e. g. time series data to features e. g frequency Choosing mining task and mining method Result evaluation and visualization Knowledge discovery is an iterative process
The KDD process ◼ Problem fomulation ◼ Data collection ◼ subset data: sampling might hurt if highly skewed data ◼ feature selection: principal component analysis, heuristic search ◼ Pre-processing: cleaning ◼ name/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values ◼ Transformation: ◼ map complex objects e.g. time series data to features e.g. frequency ◼ Choosing mining task and mining method: ◼ Result evaluation and Visualization: Knowledge discovery is an iterative process
Relationship with other fields a Overlaps with machine learning, statistics artificial intelligence, databases visualization but more stress on scalability of number of features and instances stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning automation for handling large heterogeneous data
Relationship with other fields ◼ Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on ◼ scalability of number of features and instances ◼ stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. ◼ automation for handling large, heterogeneous data
Some basic operations Predictive Regression Classification Collaborative Filtering Descriptive Clustering similarity matching Association rules and variants Deviation detection
Some basic operations ◼ Predictive: ◼ Regression ◼ Classification ◼ Collaborative Filtering ◼ Descriptive: ◼ Clustering / similarity matching ◼ Association rules and variants ◼ Deviation detection
Example: A Web mining Framework Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining a Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
23 Example: A Web Mining Framework ◼ Web mining usually involves ◼ Data cleaning ◼ Data integration from multiple sources ◼ Warehousing the data ◼ Data cube construction ◼ Data selection for data mining ◼ Data mining ◼ Presentation of the mining results ◼ Patterns and knowledge to be used or stored into knowledge-base