The Art and Science of Data Preprocessing (1 of 2) The real-world data is dirty, misaligned, overly complex, and inaccurate Not ready for analytics Readying the data for analytics is needed Data preprocessing Data consolidation Data cleaning Data transformation Data reduction Art -it develops and improves with experience Pearson Copyright C 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved The Art and Science of Data Preprocessing (1 of 2) • The real-world data is dirty, misaligned, overly complex, and inaccurate – Not ready for analytics! • Readying the data for analytics is needed – Data preprocessing ▪ Data consolidation ▪ Data cleaning ▪ Data transformation ▪ Data reduction • Art – it develops and improves with experience
The Art and Science of Data Preprocessing (2 of 2) Data reduction 1. Variables Raw Data Sources Dimensional reduction v Integrate data Variable selection Data Cleaning 2 Cases/samples v Eliminate duplicates Sampling Data Transformation Balancing /stratification dava v Create attributes Well-Formed Pearson Copyright C 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved The Art and Science of Data Preprocessing (2 of 2) • Data reduction 1. Variables – Dimensional reduction – Variable selection 2. Cases/samples – Sampling – Balancing / stratification
Data Preprocessing Tasks and Methods (1 of 3) Table 2.1 A Summary of Data Preprocessing Tasks and Potential Methods Main Task Subtasks Popular Methods Data Access and collect the data SQL queries, software agents, Web services consolidation Select and filter the data Domain expertise, SQL queries, statistical tests Integrate and unify the data SQLqueries, domain expertise, ontology-driven data mapping Data Handle missing values in Fill in missing values(imputations )with most cleaning the data appropriate values(mean, median, min/max, mode etc. ); recode the missing values with a constant such as "ML remove the record of the missing value do Identify and reduce noise in Identify the outliers in data with simple statistical the data techniques(such as averages and standard deviations) or with cluster analysis; once identified, either remove the outliers or smooth them by using binning regression, or simple averages Pearson Copyright C 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved Data Preprocessing Tasks and Methods (1 of 3) Table 2.1 A Summary of Data Preprocessing Tasks and Potential Methods Main Task Subtasks Popular Methods Data consolidation Access and collect the data Select and filter the data Integrate and unify the data SQL queries, software agents, Web services. Domain expertise, SQL queries, statistical tests. SQL queries, domain expertise, ontology-driven data mapping. Data cleaning Handle missing values in the data Fill in missing values (imputations) with most appropriate values (mean, median, min/max, mode, etc.); recode the missing values with a constant such as “ML”; remove the record of the missing value; do nothing. Data cleaning Identify and reduce noise in the data Identify the outliers in data with simple statistical techniques (such as averages and standard deviations) or with cluster analysis; once identified, either remove the outliers or smooth them by using binning, regression, or simple averages
Data Preprocessing Tasks and Methods (2 of 3) Main Task Subtasks Popular Methods Find and Identify the erroneous values in data(other than eliminate outliers), such as odd values, inconsistent class erroneous data labels odd distributions: once identified use domain expertise to correct the values or remove the records holding the erroneous values Data Normalize the Reduce the range of values in each numerically transformation data valued variable to a standard range(e.g,0 to 1 or-1 to +1)by using a variety of normalization or scaling techniques Discretize or If needed convert the numeric variables into aggregate the discrete representations using range-or data frequency-based binning techniques; for categorical variables, reduce the number of values by applying proper concept hierarchies Pearson Copyright C 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved Data Preprocessing Tasks and Methods (2 of 3) Main Task Subtasks Popular Methods Data cleaning Find and eliminate erroneous data Identify the erroneous values in data (other than outliers), such as odd values, inconsistent class labels, odd distributions; once identified, use domain expertise to correct the values or remove the records holding the erroneous values. Data transformation Normalize the data Reduce the range of values in each numerically valued variable to a standard range (e.g., 0 to 1 or -1 to +1) by using a variety of normalization or scaling techniques. Data transformation Discretize or aggregate the data If needed, convert the numeric variables into discrete representations using range-or frequency-based binning techniques; for categorical variables, reduce the number of values by applying proper concept hierarchies
Data Preprocessing Tasks and Methods (3 of 3) Main task Subtasks Popular Methods Construct new Derive new and more informative variables from the attributes existing ones using a wide range of mathematical functions (as simple as addition and multiplication or as complex as a hybrid combination of log transformations Data reduction Reduce number Principal component analysis, independent of attributes component analysis, chi-square testing, correlation analysis, and decision tree induction Reduce number Random sampling, stratified sampling, expert- of records knowledge-driven purposeful sampling Balance skewed Oversample the less represented or undersample data the more represented classes Pearson Copyright C 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved Data Preprocessing Tasks and Methods (3 of 3) Main Task Subtasks Popular Methods Data transformation Construct new attributes Derive new and more informative variables from the existing ones using a wide range of mathematical functions (as simple as addition and multiplication or as complex as a hybrid combination of log transformations). Data reduction Reduce number of attributes Principal component analysis, independent component analysis, chi-square testing, correlation analysis, and decision tree induction. Data reduction Reduce number of records Random sampling, stratified sampling, expertknowledge-driven purposeful sampling. Data reduction Balance skewed data Oversample the less represented or undersample the more represented classes