Data Mining and Model choice in Supervised Learning Gilbert Saporta Chaire de statistique appliquee CEDRIC, CNAM 292 rue Saint Martin 6003 paris gilbert saporta@cnam. fr http://cedric.cnam.fr/usaporta
Data Mining and Model Choice in Supervised Learning Gilbert Saporta Chaire de Statistique Appliquée & CEDRIC, CNAM, 292 rue Saint Martin, F-75003 Paris gilbert.saporta@cnam.fr http://cedric.cnam.fr/~saporta
Outline 1. What is data mining 2. Association rule discovery 3. Statistical models 4. Predictive modelling 5. a scoring case study 6. Discussion Beijing, 2008 2
Beijing, 2008 2 Outline 1. What is data mining? 2. Association rule discovery 3. Statistical models 4. Predictive modelling 5. A scoring case study 6. Discussion
1. What is data mining Data mining is a new field at the frontiers of statistics and information technologies(database management, artificial intelligence, machine learning etc which aims at discovering structures and patterns in large data sets Beijing, 2008 3
Beijing, 2008 3 1. What is data mining? ◼ Data mining is a new field at the frontiers of statistics and information technologies (database management, artificial intelligence, machine learning, etc.) which aims at discovering structures and patterns in large data sets
1.1 Definitions U M Fayyad, G Piatetski-Shapiro :Data Mining is the nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns in data D.J. Hand shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets Beijing, 2008
Beijing, 2008 4 1.1 Definitions: ◼ U.M.Fayyad, G.Piatetski-Shapiro : “ Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data ” ◼ D.J.Hand : “ I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets
The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools Data Mining is concerned with data which were collected for another purpose: it is a secondary analysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases(Kardaun, T Alanko, 1998) Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs(Hand, 2000) Beijing, 2008 5
Beijing, 2008 5 ◼ The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. ◼ Data Mining is concerned with data which were collected for another purpose: it is a secondary analysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases (Kardaun, T.Alanko,1998) . ◼ Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand, 2000)