Outlier Detection I: Semi-Supervised Methods Situation: In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both Semi-supervised outlier detection: Regarded as applications of semi- supervised learning If some labeled normal objects are available Use the labeled examples and the proximate unlabeled objects to train a model for normal objects Those not fitting the model of normal objects are detected as outliers If only some labeled outliers are available, a small number of labeled outliers many not cover the possible outliers well To improve the quality of outlier detection, one can get help from models for normal objects learned from unsupervised methods 11
Outlier Detection III: Semi-Supervised Methods ◼ Situation: In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both ◼ Semi-supervised outlier detection: Regarded as applications of semisupervised learning ◼ If some labeled normal objects are available ◼ Use the labeled examples and the proximate unlabeled objects to train a model for normal objects ◼ Those not fitting the model of normal objects are detected as outliers ◼ If only some labeled outliers are available, a small number of labeled outliers many not cover the possible outliers well ◼ To improve the quality of outlier detection, one can get help from models for normal objects learned from unsupervised methods 11
Outlier Detection (O): Statistical Methods Statistical methods(also known as model-based methods)assume that the normal data follow some statistical model (a stochastic model) The data not following the model are outliers EXample(right figure): First use Gaussian distribution R to model the normal data For each object y in region R, estimate gp(y),the ●● probability of y fits the Gaussian distribution If gp(y)is very low, y is unlikely generated by the Gaussian model thus an outlier Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data There are rich alternatives to use various statistical models E.g., parametric Vs non-parametric 12
Outlier Detection (1): Statistical Methods ◼ Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model) ◼ The data not following the model are outliers. 12 ◼ Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data ◼ There are rich alternatives to use various statistical models ◼ E.g., parametric vs. non-parametric ◼ Example (right figure): First use Gaussian distribution to model the normal data ◼ For each object y in region R, estimate gD(y), the probability of y fits the Gaussian distribution ◼ If gD(y) is very low, y is unlikely generated by the Gaussian model, thus an outlier
Outlier Detection (2): Proximity-Based Methods An object is an outlier if the nearest neighbors of the object are far away, i. e, the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set EXample(right figure): Model the proximity of an R object using its 3 nearest neighbors Objects in region R are substantially different from other objects in the data set ●● Thus the objects in r are outliers The effectiveness of proximity-based methods highly relies on the proximity measure In some applications, proximity or distance measures cannot be obtained easily Often have a difficulty in finding a group of outliers which stay close to each other Two major types of proximity-based outlier detection Distance-based vS density-based 13
Outlier Detection (2): Proximity-Based Methods ◼ An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set 13 ◼ The effectiveness of proximity-based methods highly relies on the proximity measure. ◼ In some applications, proximity or distance measures cannot be obtained easily. ◼ Often have a difficulty in finding a group of outliers which stay close to each other ◼ Two major types of proximity-based outlier detection ◼ Distance-based vs. density-based ◼ Example (right figure): Model the proximity of an object using its 3 nearest neighbors ◼ Objects in region R are substantially different from other objects in the data set. ◼ Thus the objects in R are outliers
Outlier Detection 3): Clustering-Based Methods Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters R Example(right figure) two clusters All points not in r form a large cluster The two points inR form a tiny cluster, thus are outliers Since there are many clustering methods, there are many clustering-based outlier detection methods as well Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets 14
Outlier Detection (3): Clustering-Based Methods ◼ Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters 14 ◼ Since there are many clustering methods, there are many clustering-based outlier detection methods as well ◼ Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets ◼ Example (right figure): two clusters ◼ All points not in R form a large cluster ◼ The two points in R form a tiny cluster, thus are outliers
Statistical Approaches Statistical approaches assume that the objects in a data set are generated by a stochastic process(a generative model) Idea: learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers Methods are divided into two categories: parametric vs non-parametric Parametric method Assumes that the normal data is generated by a parametric distribution with parameter e The probability density function of the parametric distribution f(x, 8) gives the probability that object x is generated by the distribution The smaller this value, the more likely x is an outlier ■Non- parametric method Not assume an a- priori statistical model and determine the mode from the input data Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance EXamples: histogram and kernel density estimation 15
Statistical Approaches ◼ Statistical approaches assume that the objects in a data set are generated by a stochastic process (a generative model) ◼ Idea: learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers ◼ Methods are divided into two categories: parametric vs. non-parametric ◼ Parametric method ◼ Assumes that the normal data is generated by a parametric distribution with parameter θ ◼ The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution ◼ The smaller this value, the more likely x is an outlier ◼ Non-parametric method ◼ Not assume an a-priori statistical model and determine the model from the input data ◼ Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance ◼ Examples: histogram and kernel density estimation 15