Chapter 9. Outlier Analysis Outlier and outlier Analysis Outlier Detection Methods Statistical Approaches Proximity-Base Approaches Clustering-Base Approaches Classification Approaches Summary
1 Chapter 9. Outlier Analysis ◼ Outlier and Outlier Analysis ◼ Outlier Detection Methods ◼ Statistical Approaches ◼ Proximity-Base Approaches ◼ Clustering-Base Approaches ◼ Classification Approaches ◼ Summary
What Is Outlier Discovery? What are outliers? The set of objects are considerably dissimilar from the remainder of the data EXample: Sports: Michael Jordon, Wayne Gretzky, Problem: Define and find outliers in large data sets Applications Credit card fraud detection Telecom fraud detection Customer segmentation ■ Medical analysis network intrusion detection fault detection
2 What Is Outlier Discovery? ◼ What are outliers? ◼ The set of objects are considerably dissimilar from the remainder of the data ◼ Example: Sports: Michael Jordon, Wayne Gretzky, ... ◼ Problem: Define and find outliers in large data sets ◼ Applications: ◼ Credit card fraud detection ◼ Telecom fraud detection ◼ Customer segmentation ◼ Medical analysis ◼ network intrusion detection ◼ fault detection
What Are outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism EX: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky,… Outliers are different from the noise data Noise is random error or variance in a measured variable Noise should be removed before outlier detection Outliers are interesting: It violates the mechanism that generates the normal data R Outlier detection Vs novelty detection: early stage outlier; but later merged into the model
3 What Are Outliers? ◼ Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism ◼ Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ... ◼ Outliers are different from the noise data ◼ Noise is random error or variance in a measured variable ◼ Noise should be removed before outlier detection ◼ Outliers are interesting: It violates the mechanism that generates the normal data ◼ Outlier detection vs. novelty detection: early stage , outlier; but later merged into the model
Anomaly Detection Challenges a How many outliers are there in the data? Method is unsupervised Validation can be quite challenging just like for clustering) Finding needle in a haystack Working assumption a There are considerably more normal observations than abnormal observations (outliers/anomalies )in the data
4 Anomaly Detection ◼ Challenges ◼ How many outliers are there in the data? ◼ Method is unsupervised ◼ Validation can be quite challenging (just like for clustering) ◼ Finding needle in a haystack ◼ Working assumption: ◼ There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
Anomaly Detection Schemes General steps Build a profile of the"normal behavior Profile can be patterns or summary statistics for overall population Use the normal profile to detect anomalies Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes Graphical Statistical-based Distance-based Model-based 5
5 Anomaly Detection Schemes ◼ General Steps ◼ Build a profile of the “normal” behavior ◼ Profile can be patterns or summary statistics for overall population ◼ Use the “normal” profile to detect anomalies ◼ Anomalies are observations whose characteristics differ significantly from the normal profile ◼ Types of anomaly detection schemes ◼ Graphical & Statistical-based ◼ Distance-based ◼ Model-based