PSA PEUGEOT CITROEN DATA FEATURE ● Data size: More than 80 000 vehicles(=transactions))4 months of manufacturing More than 3000 attributes(items) Sparse data Countof vehicles 972712% 810610% 64858% 48636% 32424% 1621 28 ONSERVAIOIRI Count& percent of the 100 more frequent attributes 3rd IASC world conference on Computational Statistics Data Analysis, Limassol, Cyprus, 28-31 October, 2005 CEDRIC
3rd IASC world conference on Computational Statistics & Data Analysis, Limassol, Cyprus, 28-31 October, 2005 DATA FEATURE ⚫ Data size : ⚫ More than 80 000 vehicles (≈transactions) ➔ 4 months of manufacturing ⚫ More than 3000 attributes (≈items) 1621 2 % 3242 4 % 4863 6 % 6485 8 % 8106 10 % 9727 12 % Count of vehicles Count & percent of the 100 more frequent attributes ⚫ Sparse data :
PSA PEUGEOT CITROEN DATA FEATURE Count of co-occurrences per vehicle 100% 80% 60% 640% 20% 0% 5 10 ONSERVAIOIRI Count of attributes owned by vehicle 3rd IASC world conference on Computational Statistics Data Analysis, Limassol, Cyprus, 28-31 October, 2005 CEDRIC
3rd IASC world conference on Computational Statistics & Data Analysis, Limassol, Cyprus, 28-31 October, 2005 DATA FEATURE ⚫ Count of co-occurrences per vehicle : Count of attributes owned by vehicle Vehicle Percent
PSA PEUGEOT CITROEN OUPUT ASSOCIATION RULES Minimum support (minimum count of Minimum vehicles that support the rule confidence Count of rules Maximum size of rules 500 50% 16 400 50% 29 300 50% 194 3356 250 50% 1299 200 50% 102981 10 100 50% 1623555 Aims Reduce count of rules Reduce size of rules A first reduction is obtained by manual grouping: ONSERVAIOIRI Minimun Minimum Count of rules Maximum size of rules support confidence 100 50% 600636 3rd IASC world conference on Computational Statistics Data Analysis, Limassol, Cyprus, 28-31 October, 2005 CEDRIC
3rd IASC world conference on Computational Statistics & Data Analysis, Limassol, Cyprus, 28-31 October, 2005 OUPUT : ASSOCIATION RULES Minimum support (minimum count of vehicles that support the rule) Minimum confidence Count of rules Maximum size of rules 500 50 % 16 3 Minimum support (minimum count of vehicles that support the rule) Minimum confidence Count of rules Maximum size of rules 500 50 % 16 3 400 50 % 29 3 Minimum support (minimum count of vehicles that support the rule) Minimum confidence Count of rules Maximum size of rules 500 50 % 16 3 400 50 % 29 3 300 50 % 194 5 250 50 % 1299 6 200 50 % 102 981 10 100 50 % 1 623 555 13 ⚫ Aims : ⚫ Reduce count of rules ⚫ Reduce size of rules Minimum support Minimum confidence Count of rules Maximum size of rules 100 50 % 600636 12 ⚫ A first reduction is obtained by manual grouping :
PSA PEUGEOT CITROEN COMBINING CLUSTER ANALYSIS AND ASSOCIATION RULES 10-clusters partition with hierarchical clustering and Russel rao coefficient Cluster Number of variables in Number of rules found in Maximum size of the cluster the cluster rules 2 12 481170 12 2 23456789 5 24 117 55 22 10 33 22 04444424 2928 61 ONSERVAIOIRI Cluster 2 is aty pical and produces many complex rules 3rd IASC world conference on Computational Statistics Data Analysis, Limassol, Cyprus, 28-31 October, 2005 CEDRIC
3rd IASC world conference on Computational Statistics & Data Analysis, Limassol, Cyprus, 28-31 October, 2005 COMBINING CLUSTER ANALYSIS AND ASSOCIATION RULES ⚫ 10-clusters partition with hierarchical clustering and Russel Rao coefficient Cluster Number of variables in the cluster Number of rules found in the cluster Maximum size of rules 1 2 0 0 2 12 481170 12 3 2 0 0 4 5 24 4 5 117 55 4 6 4 22 4 7 10 33 4 8 5 22 4 9 16 1 2 10 2928 61 4 ⚫ Cluster 2 is atypical and produces many complex rules
PSA PEUGEOT CITROEN Mining association rules inside each cluster except atypical cluster: Countof rules Maximumsize of rules/ Reduction of the count of rules Without clustering 600636 12 Ward- Russel Rao 218 More than 99% The number of rules to analyse has significantly decreased The output rules are more simple to analyse Clustering has detected an aty pical cluster of attributes to treat separately ONSERVAIOIRI 3rd IASC world conference on Computational Statistics Data Analysis, Limassol, Cyprus, 28-31 October, 2005 CEDRIC
3rd IASC world conference on Computational Statistics & Data Analysis, Limassol, Cyprus, 28-31 October, 2005 Count of rules Maximum size of rules Reduction of the count of rules Without clustering 600636 12 . Ward - Russel & Rao 218 4 More than 99% ⚫ Mining association rules inside each cluster except atypical cluster : ⚫ The number of rules to analyse has significantly decreased ⚫ The output rules are more simple to analyse ⚫ Clustering has detected an atypical cluster of attributes to treat separately