Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance AHMED ELGAMMAL,RAMANI DURAISWAMI,MEMBER,IEEE,DAVID HARWOOD,AND LARRY S.DAVIS,FELLOW,IEEE Invited Paper Automatic understanding of events happening at a site is the 1.INTRODUCTION ultimate goal for many visual surveillance systems.Higher level understanding of events requires that certain lower level computer In automated surveillance systems,cameras and other sen- vision tasks be performed.These may include detection of unusual sors are typically used to monitor activities at a site with the motion,tracking targets.labeling body parts,and understanding goal of automatically understanding events happening at the the interactions between people.To achieve many of these tasks, site.Automatic event understanding would enable function- it is necessary to build representations of the appearance of alities such as detection of suspicious activities and site se- objects in the scene.This paper focuses on two issues related to curity.Current systems archive huge volumes of video for this problem.First,we construct a statistical representation of the scene background that supports sensitive detection of moving eventual off-line human inspection.The automatic detection objects in the scene,but is robust to clutter arising out of natural of events in videos would facilitate efficient archiving and scene variations.Second,we build statistical representations of automatic annotation.It could be used to direct the attention the foreground regions (moving objects)that support their tracking of human operators to potential problems.The automatic de- and support occlusion reasoning.The probability density functions tection of events would also dramatically reduce the band- (pdfs)associated with the background and foreground are likely width required for video transmission and storage as only in- to vary from image to image and will not in general have a known parametric form.We accordingly utilize general nonparametric teresting pieces would need to be transmitted or stored. kernel density estimation techniques for building these statistical Higher level understanding of events requires certain representations of the background and the foreground.These lower level computer vision tasks to be performed such techniques estimate the pdf directly from the data without any as detection of unusual motion,tracking targets,labeling assumptions about the underlying distributions.Example results body parts,and understanding the interactions between from applications are presented. people.For many of these tasks,it is necessary to build Keywords-Background subtraction,color modeling.kernel representations of the appearance of objects in the scene.For density estimation,occlusion modeling,tracking.visual surveil- example,the detection of unusual motions can be achieved lance. by building a representation of the scene background and comparing new frames with this representation.This process is called background subtraction.Building representations for foreground objects (targets)is essential for tracking them and maintaining their identities.This paper focuses Manuscript received May 31,2001;revised February 15,2002.This work was supported in part by the ARDA Video Analysis and Content on two issues:how to construct a statistical representation Exploitation project under Contract MDA 90 400C2110 and in part by of the scene background that supports sensitive detection Philips Research. of moving objects in the scene and how to build statistical A.Elgammal is with the Computer Vision Laboratory,University of Maryland Institute for Advanced Computer Studies,Department of representations of the foreground (moving objects)that Computer Science,University of Maryland,College Park,MD 20742 USA support their tracking. (e-mail:elgammal@cs.umd.edu). One useful tool for building such representations is sta- R.Duraiswami,D.Harwood,and L.S.Davis are with the Computer tistical modeling,where a process is modeled as a random Vision Laboratory,University of Maryland Institute for Advanced Computer Studies,University of Maryland,College Park,MD 20742 USA (e-mail: variable in a feature space with an associated probability den- ramani@umiacs.umd.edu;harwood@umiacs.umd.edu;Isd@cs.umd.edu). sity function(pdf).The density function could be represented Publisher Item Identifier 10.1109/JPROC.2002.801448. parametrically using a specified statistical distribution,that 0018-9219/02s17.00⊙2002IEEE PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002 1151
Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance AHMED ELGAMMAL, RAMANI DURAISWAMI, MEMBER, IEEE, DAVID HARWOOD, AND LARRY S. DAVIS, FELLOW, IEEE Invited Paper Automatic understanding of events happening at a site is the ultimate goal for many visual surveillance systems. Higher level understanding of events requires that certain lower level computer vision tasks be performed. These may include detection of unusual motion, tracking targets, labeling body parts, and understanding the interactions between people. To achieve many of these tasks, it is necessary to build representations of the appearance of objects in the scene. This paper focuses on two issues related to this problem. First, we construct a statistical representation of the scene background that supports sensitive detection of moving objects in the scene, but is robust to clutter arising out of natural scene variations. Second, we build statistical representations of the foreground regions (moving objects) that support their tracking and support occlusion reasoning. The probability density functions (pdfs) associated with the background and foreground are likely to vary from image to image and will not in general have a known parametric form. We accordingly utilize general nonparametric kernel density estimation techniques for building these statistical representations of the background and the foreground. These techniques estimate the pdf directly from the data without any assumptions about the underlying distributions. Example results from applications are presented. Keywords—Background subtraction, color modeling, kernel density estimation, occlusion modeling, tracking, visual surveillance. Manuscript received May 31, 2001; revised February 15, 2002. This work was supported in part by the ARDA Video Analysis and Content Exploitation project under Contract MDA 90 400C2110 and in part by Philips Research. A. Elgammal is with the Computer Vision Laboratory, University of Maryland Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD 20742 USA (e-mail: elgammal@cs.umd.edu). R. Duraiswami, D. Harwood, and L. S. Davis are with the Computer Vision Laboratory, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 USA (e-mail: ramani@umiacs.umd.edu; harwood@umiacs.umd.edu; lsd@cs.umd.edu). Publisher Item Identifier 10.1109/JPROC.2002.801448. I. INTRODUCTION In automated surveillance systems, cameras and other sensors are typically used to monitor activities at a site with the goal of automatically understanding events happening at the site. Automatic event understanding would enable functionalities such as detection of suspicious activities and site security. Current systems archive huge volumes of video for eventual off-line human inspection. The automatic detection of events in videos would facilitate efficient archiving and automatic annotation. It could be used to direct the attention of human operators to potential problems. The automatic detection of events would also dramatically reduce the bandwidth required for video transmission and storage as only interesting pieces would need to be transmitted or stored. Higher level understanding of events requires certain lower level computer vision tasks to be performed such as detection of unusual motion, tracking targets, labeling body parts, and understanding the interactions between people. For many of these tasks, it is necessary to build representations of the appearance of objects in the scene. For example, the detection of unusual motions can be achieved by building a representation of the scene background and comparing new frames with this representation. This process is called background subtraction. Building representations for foreground objects (targets) is essential for tracking them and maintaining their identities. This paper focuses on two issues: how to construct a statistical representation of the scene background that supports sensitive detection of moving objects in the scene and how to build statistical representations of the foreground (moving objects) that support their tracking. One useful tool for building such representations is statistical modeling, where a process is modeled as a random variable in a feature space with an associated probability density function (pdf). The density function could be represented parametrically using a specified statistical distribution, that 0018-9219/02$17.00 © 2002 IEEE PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002 1151
is assumed to approximate the actual distribution,with the where the same kernel function is used in each dimension associated parameters estimated from training data.Alterna- with a suitable bandwidth oj for each dimension.We can tively,nonparametric approaches could be used.These esti- avoid having to store the complete data set by weighting the mate the density function directly from the data without any samples as assumptions about the underlying distribution.This avoids having to choose a model and estimating its distribution pa- (x)=〉aK(c-x) rameters. =1 A particular nonparametric technique that estimates the where the ai's are weighting coefficients that sum up to one. underlying density,avoids having to store the complete data, A variety ofkernel functions with different properties have and is quite general is the kernel density estimation tech- been used in the literature.Typically the Gaussian kernel is nique.In this technique,the underlying pdf is estimated as used for its continuity,differentiability,and locality proper- f(a)=>aiK(z-xi) (1) ties.Note that choosing the Gaussian as a kernel function is different from fitting the distribution to a Gaussian model (normal distribution).Here,the Gaussian is only used as a where K is a"kernel function"(typically a Gaussian)cen- function to weight the data points.Unlike parametric fitting tered at the data points in feature space,i,i=1...n,and ofa mixture ofGaussians,kernel density estimation is a more ai are weighting coefficients(typically uniform weights are general approach that does not assume any specific shape for used,i.e.,=1/n).Kernel density estimators asymptoti- the density function.A good discussion of kernel estimation cally converge to any density function [1],[2].This property techniques can be found in [1].The major drawback of using makes these techniques quite general and applicable to many the nonparametric kernel density estimator is its computa- vision problems where the underlying density is not known. tional cost.This becomes less of a problem as the available In this paper,kernel density estimation techniques are computational power increases and as efficient computational utilized for building representations for both the background methods have become available recently [3],[4]. and the foreground.We present an adaptive background modeling and background subtraction technique that is able III.MODELING THE BACKGROUND to detect moving targets in challenging outdoor environ- A.Background Subtraction:A Review ments with moving trees and changing illumination.We also present a technique for modeling foreground regions and 1)The Concept:In video surveillance systems,sta- show how it can be used for segmenting major body parts of tionary cameras are typically used to monitor activities at a person and for segmenting groups of people outdoor or indoor sites.Since the cameras are stationary,the detection of moving objects can be achieved by comparing II.KERNEL DENSITY ESTIMATION TECHNIQUES each new frame with a representation of the scene back- ground.This process is called background subtraction and Given a sample S=i=1...N from a distribution with the scene representation is called the background model. density function p(r),an estimate ()of the density at Typically,background subtraction forms the first stage a can be calculated using in an automated visual surveillance system.Results from background subtraction are used for further processing,such Ko(x-Ti) (2 as tracking targets and understanding events. A central issue in building a representation for the scene background is what features to use for this representation where Ko is a kernel function(sometimes called a"window" or,in other words,what to model in the background.In function)with a bandwidth (scale)o such that Ko(t)= the literature,a variety of features have been used for (1/)K(t/o).The kernel function K should satisfy K(t)> background modeling,including pixel-based features(pixel 0 and K(t)dt =1.We can think of (2)as estimating intensity,edges,disparity)and region-based features (e.g., the pdf by averaging the effect of a set of kernel functions block correlation).The choice of the features affects how centered at each data point.Alternatively,since the kernel the background model tolerates changes in the scene and the function is symmetric,we can also regard this computation granularity of the detected foreground objects. as averaging the effect of a kernel function centered at the In any indoor or outdoor scene,there are changes that estimation point and evaluated at each data point.Kernel occur over time and may be classified as changes to the scene density estimators asymptotically converge to any density background.It is important that the background model toler- function with sufficient samples [1],[2].This property makes ates these kind of changes,either by being invariant to them the technique quite general for estimating the density of or by adapting to them.These changes can be local,affecting any distribution.In fact,all other nonparametric density only part of the background,or global,affecting the entire estimation methods,e.g.,histograms,can be shown to be background.The study of these changes is essential to un- asymptotically kernel methods [1]. derstand the motivations behind different background sub- For higher dimensions,products of one-dimensional (1-D) traction techniques.We classify these changes according to kernels [1]can be used as their source. Illumination changes: gradual change in illumination,as might occur in out- door scenes due to the change in the location of the sun: 1152 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
is assumed to approximate the actual distribution, with the associated parameters estimated from training data. Alternatively, nonparametric approaches could be used. These estimate the density function directly from the data without any assumptions about the underlying distribution. This avoids having to choose a model and estimating its distribution parameters. A particular nonparametric technique that estimates the underlying density, avoids having to store the complete data, and is quite general is the kernel density estimation technique. In this technique, the underlying pdf is estimated as (1) where is a “kernel function” (typically a Gaussian) centered at the data points in feature space, , and are weighting coefficients (typically uniform weights are used, i.e., ). Kernel density estimators asymptotically converge to any density function [1], [2]. This property makes these techniques quite general and applicable to many vision problems where the underlying density is not known. In this paper, kernel density estimation techniques are utilized for building representations for both the background and the foreground. We present an adaptive background modeling and background subtraction technique that is able to detect moving targets in challenging outdoor environments with moving trees and changing illumination. We also present a technique for modeling foreground regions and show how it can be used for segmenting major body parts of a person and for segmenting groups of people. II. KERNEL DENSITY ESTIMATION TECHNIQUES Given a sample from a distribution with density function , an estimate of the density at can be calculated using (2) where is a kernel function (sometimes called a “window” function) with a bandwidth (scale) such that . The kernel function should satisfy and . We can think of (2) as estimating the pdf by averaging the effect of a set of kernel functions centered at each data point. Alternatively, since the kernel function is symmetric, we can also regard this computation as averaging the effect of a kernel function centered at the estimation point and evaluated at each data point. Kernel density estimators asymptotically converge to any density function with sufficient samples [1], [2]. This property makes the technique quite general for estimating the density of any distribution. In fact, all other nonparametric density estimation methods, e.g., histograms, can be shown to be asymptotically kernel methods [1]. For higher dimensions, products of one-dimensional (1-D) kernels [1] can be used as (3) where the same kernel function is used in each dimension with a suitable bandwidth for each dimension. We can avoid having to store the complete data set by weighting the samples as where the ’s are weighting coefficients that sum up to one. A variety of kernel functions with different properties have been used in the literature. Typically the Gaussian kernel is used for its continuity, differentiability, and locality properties. Note that choosing the Gaussian as a kernel function is different from fitting the distribution to a Gaussian model (normal distribution). Here, the Gaussian is only used as a function to weight the data points. Unlike parametric fitting of a mixture of Gaussians, kernel density estimation is a more general approach that does not assume any specific shape for the density function. A good discussion of kernel estimation techniques can be found in [1]. The major drawback of using the nonparametric kernel density estimator is its computational cost. This becomes less of a problem as the available computational power increases and as efficient computational methods have become available recently [3], [4]. III. MODELING THE BACKGROUND A. Background Subtraction: A Review 1) The Concept: In video surveillance systems, stationary cameras are typically used to monitor activities at outdoor or indoor sites. Since the cameras are stationary, the detection of moving objects can be achieved by comparing each new frame with a representation of the scene background. This process is called background subtraction and the scene representation is called the background model. Typically, background subtraction forms the first stage in an automated visual surveillance system. Results from background subtraction are used for further processing, such as tracking targets and understanding events. A central issue in building a representation for the scene background is what features to use for this representation or, in other words, what to model in the background. In the literature, a variety of features have been used for background modeling, including pixel-based features (pixel intensity, edges, disparity) and region-based features (e.g., block correlation). The choice of the features affects how the background model tolerates changes in the scene and the granularity of the detected foreground objects. In any indoor or outdoor scene, there are changes that occur over time and may be classified as changes to the scene background. It is important that the background model tolerates these kind of changes, either by being invariant to them or by adapting to them. These changes can be local, affecting only part of the background, or global, affecting the entire background. The study of these changes is essential to understand the motivations behind different background subtraction techniques. We classify these changes according to their source. Illumination changes: • gradual change in illumination, as might occur in outdoor scenes due to the change in the location of the sun; 1152 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
sudden change in illumination as might occur in an in- distributions corresponding to road,shadow,and vehicle dis- door environment by switching the lights on or off,or tribution.Adaptation of the Gaussian mixture models can be in an outdoor environment by a change between cloudy achieved using an incremental version of the EM algorithm. and sunny conditions; In [12],linear prediction using the Wiener filter is used to shadows cast on the background by objects in the back- predict pixel intensity given a recent history of values.The ground itself(e.g,buildings and trees)or by moving prediction coefficients are recomputed each frame from the foreground objects. sample covariance to achieve adaptivity.Linear prediction using the Kalman filter was also used in [6]-[8] Motion changes: All of the previously mentioned models are based on sta- tistical modeling of pixel intensity with the ability to adapt image changes due to small camera displacements the model.While pixel intensity is not invariant to illumi- (these are common in outdoor situations due to wind nation changes,model adaptation makes it possible for such load or other sources of motion which causes global techniques to adapt to gradual changes in illumination.On motion in the images); the other hand,a sudden change in illumination presents a motion in parts of the background,for example,tree challenge to such models. branches moving with the wind or rippling water. Another approach to model a wide range of variations in the pixel intensity is to represent these variations as dis- Changes introduced to the background:These include any crete states corresponding to modes of the environment,e.g, change in the geometry or the appearance of the background lights on/off or cloudy/sunny skies.Hidden Markov models of the scene introduced by targets.Such changes typically (HMMs)have been used for this purpose in [13]and [14]. occur when something relatively permanent is introduced In [13],a three-state HMM has been used to model the in- into the scene background (for example,if somebody moves tensity of a pixel for a traffic-monitoring application where (introduces)something from (to)the background,or if a car the three states correspond to the background,shadow,and is parked in the scene or moves out of the scene,or ifa person foreground.The use of HMMs imposes a temporal continuity stays stationary in the scene for an extended period). constraint on the pixel intensity,i.e.,if the pixel is detected as 2)Practice:Many researchers have proposed methods to a part of the foreground,then it is expected to remain part of address some of the issues regarding the background mod- the foreground for a period of time before switching back to eling,and we provide a brief review of the relevant work here. be part of the background.In [14],the topology of the HMM Pixel intensity is the most commonly used feature in back- representing global image intensity is learned while learning ground modeling.If we monitor the intensity value of a pixel the background.At each global intensity state,the pixel in- over time in a completely static scene,then the pixel in- tensity is modeled using a single Gaussian.It was shown that tensity can be reasonably modeled with a Gaussian distri- the model is able to learn simple scenarios like switching the bution N(u,o2),given that the image noise over time can lights on and off. be modeled by a zero mean Gaussian distribution N(0,o2). Alternatively,edge features have also been used to model This Gaussian distribution model for the intensity value of a the background.The use of edge features to model the back- pixel is the underlying model for many background subtrac- ground is motivated by the desire to have a representation tion techniques.For example,one of the simplest background of the scene background that is invariant to illumination subtraction techniques is to calculate an average image of changes.In [15],foreground edges are detected by com- the scene,subtract each new frame from this image,and paring the edges in each new frame with an edge map of the threshold the result.This basic Gaussian model can adapt to background which is called the background"primal sketch." slow changes in the scene(for example,gradual illumination The major drawback of using edge features to model the changes)by recursively updating the model using a simple background is that it would only be possible to detect edges adaptive filter.This basic adaptive model is used in [5]:also. of foreground objects instead of the dense connected regions Kalman filtering for adaptation is used in [6]-[8]. that result from pixel-intensity-based approaches.A fusion Typically,in outdoor environments with moving trees and of intensity and edge information was used in [16] bushes,the scene background is not completely static.For Block-based approaches have been also used for modeling example,one pixel can be the image of the sky in one frame, the background.Block matching has been extensively used a tree leaf in another frame.a tree branch in a third frame for change detection between consecutive frames.In [17. and some mixture subsequently.In each situation,the pixel each image block is fit to a second-order bivariate polynomial will have a different intensity (color),so a single Gaussian and the remaining variations are assumed to be noise.A sta- assumption for the pdf of the pixel intensity will not hold tistical likelihood test is then used to detect blocks with sig- Instead,a generalization based on a mixture of Gaussians nificant change.In [18,each block was represented with its has been used in [9]-[11]to model such variations.In [9] median template over the background learning period and its and [10],the pixel intensity was modeled by a mixture of K block standard deviation.Subsequently,at each new frame, Gaussian distributions (K is a small number from 3 to 5) each block is correlated with its corresponding template,and The mixture is weighted by the frequency with which each blocks with too much deviation relative to the measured stan- of the Gaussians explains the background.In [11],a mixture dard deviation are considered to be foreground.The major of three Gaussian distributions was used to model the pixel drawback with block-based approaches is that the detection value for traffic surveillance applications.The pixel inten- unit is a whole image block and therefore they are only suit- sity was modeled as a weighted mixture of three Gaussian able for coarse detection. ELGAMMAL et al:MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1153
• sudden change in illumination as might occur in an indoor environment by switching the lights on or off, or in an outdoor environment by a change between cloudy and sunny conditions; • shadows cast on the background by objects in the background itself (e.g., buildings and trees) or by moving foreground objects. Motion changes: • image changes due to small camera displacements (these are common in outdoor situations due to wind load or other sources of motion which causes global motion in the images); • motion in parts of the background, for example, tree branches moving with the wind or rippling water. Changes introduced to the background: These include any change in the geometry or the appearance of the background of the scene introduced by targets. Such changes typically occur when something relatively permanent is introduced into the scene background (for example, if somebody moves (introduces) something from (to) the background, or if a car is parked in the scene or moves out of the scene, or if a person stays stationary in the scene for an extended period). 2) Practice: Many researchers have proposed methods to address some of the issues regarding the background modeling, and we provide a brief review of the relevant work here. Pixel intensity is the most commonly used feature in background modeling. If we monitor the intensity value of a pixel over time in a completely static scene, then the pixel intensity can be reasonably modeled with a Gaussian distribution , given that the image noise over time can be modeled by a zero mean Gaussian distribution . This Gaussian distribution model for the intensity value of a pixel is the underlying model for many background subtraction techniques. For example, one of the simplest background subtraction techniques is to calculate an average image of the scene, subtract each new frame from this image, and threshold the result. This basic Gaussian model can adapt to slow changes in the scene (for example, gradual illumination changes) by recursively updating the model using a simple adaptive filter. This basic adaptive model is used in [5]; also, Kalman filtering for adaptation is used in [6]–[8]. Typically, in outdoor environments with moving trees and bushes, the scene background is not completely static. For example, one pixel can be the image of the sky in one frame, a tree leaf in another frame, a tree branch in a third frame, and some mixture subsequently. In each situation, the pixel will have a different intensity (color), so a single Gaussian assumption for the pdf of the pixel intensity will not hold. Instead, a generalization based on a mixture of Gaussians has been used in [9]–[11] to model such variations. In [9] and [10], the pixel intensity was modeled by a mixture of Gaussian distributions ( is a small number from 3 to 5). The mixture is weighted by the frequency with which each of the Gaussians explains the background. In [11], a mixture of three Gaussian distributions was used to model the pixel value for traffic surveillance applications. The pixel intensity was modeled as a weighted mixture of three Gaussian distributions corresponding to road, shadow, and vehicle distribution. Adaptation of the Gaussian mixture models can be achieved using an incremental version of the EM algorithm. In [12], linear prediction using the Wiener filter is used to predict pixel intensity given a recent history of values. The prediction coefficients are recomputed each frame from the sample covariance to achieve adaptivity. Linear prediction using the Kalman filter was also used in [6]–[8]. All of the previously mentioned models are based on statistical modeling of pixel intensity with the ability to adapt the model. While pixel intensity is not invariant to illumination changes, model adaptation makes it possible for such techniques to adapt to gradual changes in illumination. On the other hand, a sudden change in illumination presents a challenge to such models. Another approach to model a wide range of variations in the pixel intensity is to represent these variations as discrete states corresponding to modes of the environment, e.g., lights on/off or cloudy/sunny skies. Hidden Markov models (HMMs) have been used for this purpose in [13] and [14]. In [13], a three-state HMM has been used to model the intensity of a pixel for a traffic-monitoring application where the three states correspond to the background, shadow, and foreground. The use of HMMs imposes a temporal continuity constraint on the pixel intensity, i.e., if the pixel is detected as a part of the foreground, then it is expected to remain part of the foreground for a period of time before switching back to be part of the background. In [14], the topology of the HMM representing global image intensity is learned while learning the background. At each global intensity state, the pixel intensity is modeled using a single Gaussian. It was shown that the model is able to learn simple scenarios like switching the lights on and off. Alternatively, edge features have also been used to model the background. The use of edge features to model the background is motivated by the desire to have a representation of the scene background that is invariant to illumination changes. In [15], foreground edges are detected by comparing the edges in each new frame with an edge map of the background which is called the background “primal sketch.” The major drawback of using edge features to model the background is that it would only be possible to detect edges of foreground objects instead of the dense connected regions that result from pixel-intensity-based approaches. A fusion of intensity and edge information was used in [16]. Block-based approaches have been also used for modeling the background. Block matching has been extensively used for change detection between consecutive frames. In [17], each image block is fit to a second-order bivariate polynomial and the remaining variations are assumed to be noise. A statistical likelihood test is then used to detect blocks with significant change. In [18], each block was represented with its median template over the background learning period and its block standard deviation. Subsequently, at each new frame, each block is correlated with its corresponding template, and blocks with too much deviation relative to the measured standard deviation are considered to be foreground. The major drawback with block-based approaches is that the detection unit is a whole image block and therefore they are only suitable for coarse detection. ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1153
In order to monitor wide areas with sufficient resolution. cameras with zoom lenses are often mounted on pan-tilt plat- forms.This enables high-resolution imagery to be obtained from any arbitrary viewing angle from the location where the camera is mounted.The use of background subtraction in such situations requires a representation of the scene background for any arbitrary pan-tilt-zoom combination, which is an extension to the original background subtraction concept with a stationary camera.In [19],image mosaicing techniques are used to build panoramic representations of the scene background.Alternatively,in [20],a represen- tation of the scene background as a finite set of images (a on a virtual polyhedron is used to construct images of the scene background at any arbitrary pan-tilt-zoom setting. Both techniques assume that the camera rotation is around its optical axis and so that there is no significant motion parallax. B.Nonparametric Background Modeling In this section,we describe a background model and a background subtraction process that we have developed, based on nonparametric kernel density estimation.The model uses pixel intensity (color)as the basic feature for modeling the background.The model keeps a sample of intensity values for each pixel in the image and uses this (b) sample to estimate the density function of the pixel intensity distribution.Therefore,the model is able to estimate the Fig,1.Background Subtraction.(a)Original image.(b)Estimated probability of any newly observed intensity value.The probability image. model can handle situations where the background of the scene is cluttered and not completely static but contains Using this probability estimate.the pixel is considered to small motions that are due to moving tree branches and be a foreground pixel if Pr(t)<th,where the threshold bushes.The model is updated continuously and therefore th is a global threshold over all the images that can be ad- adapts to changes in the scene background. justed to achieve a desired percentage of false positives.Prac- 1)Background Subtraction:Let 1,x2,...,IN be a tically,the probability estimation in(6)can be calculated in a sample of intensity values for a pixel.Given this sample, very fast way using precalculated lookup tables for the kernel we can obtain an estimate of the pixel intensity pdf at function values given the intensity value difference (t-i) any intensity value using kernel density estimation.Given and the kernel function bandwidth.Moreover,a partial eval- the observed intensity t at time t,we can estimate the uation of the sum in(6)is usually sufficient to surpass the probability of this observation as threshold at most image pixels,since most of the image is typically from the background.This allows us to construct a Pr(xt)= very fast implementation Ko(xt-xi) (4) Since kernel density estimation is a general approach,the =1 estimate of(4)can converge to any pixel intensity density where Ko is a kernel function with bandwidth o.This esti- function.Here,the estimate is based on the most recent N mate can be generalized to use color features by using kernel samples used in the computation.Therefore,adaptation of products as the model can be achieved simply by adding new samples and ignoring older samples [21].Fig.1(b)shows the estimated Pr(xt (Uti -zis) background probability where brighter pixels represent lower 5) background probability pixels. One major issue that needs to be addressed when using where tt is a d-dimensional color feature and Ko is a kernel kernel density estimation technique is the choice of suitable function with bandwidtho;in the jth color space dimension. kernel bandwidth(scale).Theoretically,as the number of If we choose our kernel function K to be Gaussian.then the samples reaches infinity,the choice of the bandwidth is density can be estimated as insignificant and the estimate will approach the actual density.Practically,since only a finite number of samples are used and the computation must be performed in real Pr(xt)= (6 time,the choice of suitable bandwidth is essential.Too =1=11 2m0 small a bandwidth will lead to a ragged density estimate, 1154 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
In order to monitor wide areas with sufficient resolution, cameras with zoom lenses are often mounted on pan-tilt platforms. This enables high-resolution imagery to be obtained from any arbitrary viewing angle from the location where the camera is mounted. The use of background subtraction in such situations requires a representation of the scene background for any arbitrary pan-tilt-zoom combination, which is an extension to the original background subtraction concept with a stationary camera. In [19], image mosaicing techniques are used to build panoramic representations of the scene background. Alternatively, in [20], a representation of the scene background as a finite set of images on a virtual polyhedron is used to construct images of the scene background at any arbitrary pan-tilt-zoom setting. Both techniques assume that the camera rotation is around its optical axis and so that there is no significant motion parallax. B. Nonparametric Background Modeling In this section, we describe a background model and a background subtraction process that we have developed, based on nonparametric kernel density estimation. The model uses pixel intensity (color) as the basic feature for modeling the background. The model keeps a sample of intensity values for each pixel in the image and uses this sample to estimate the density function of the pixel intensity distribution. Therefore, the model is able to estimate the probability of any newly observed intensity value. The model can handle situations where the background of the scene is cluttered and not completely static but contains small motions that are due to moving tree branches and bushes. The model is updated continuously and therefore adapts to changes in the scene background. 1) Background Subtraction: Let be a sample of intensity values for a pixel. Given this sample, we can obtain an estimate of the pixel intensity pdf at any intensity value using kernel density estimation. Given the observed intensity at time , we can estimate the probability of this observation as (4) where is a kernel function with bandwidth . This estimate can be generalized to use color features by using kernel products as (5) where is a -dimensional color feature and is a kernel function with bandwidth in the th color space dimension. If we choose our kernel function to be Gaussian, then the density can be estimated as (6) Fig. 1. Background Subtraction. (a) Original image. (b) Estimated probability image. Using this probability estimate, the pixel is considered to be a foreground pixel if , where the threshold is a global threshold over all the images that can be adjusted to achieve a desired percentage of false positives. Practically, the probability estimation in (6) can be calculated in a very fast way using precalculated lookup tables for the kernel function values given the intensity value difference and the kernel function bandwidth. Moreover, a partial evaluation of the sum in (6) is usually sufficient to surpass the threshold at most image pixels, since most of the image is typically from the background. This allows us to construct a very fast implementation. Since kernel density estimation is a general approach, the estimate of (4) can converge to any pixel intensity density function. Here, the estimate is based on the most recent samples used in the computation. Therefore, adaptation of the model can be achieved simply by adding new samples and ignoring older samples [21]. Fig. 1(b) shows the estimated background probability where brighter pixels represent lower background probability pixels. One major issue that needs to be addressed when using kernel density estimation technique is the choice of suitable kernel bandwidth (scale). Theoretically, as the number of samples reaches infinity, the choice of the bandwidth is insignificant and the estimate will approach the actual density. Practically, since only a finite number of samples are used and the computation must be performed in real time, the choice of suitable bandwidth is essential. Too small a bandwidth will lead to a ragged density estimate, 1154 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
while too wide a bandwidth will lead to an over-smoothed for that pixel,then it will be detected as a foreground object. density estimate [2].Since the expected variations in pixel However,this object will have a high probability of being intensity over time are different from one location to another a part of the background distribution corresponding to its in the image,a different kernel bandwidth is used for each original pixel.Assuming that only a small displacement can pixel.Also,a different kernel bandwidth is used for each occur between consecutive frames,we decide if a detected color channel. pixel is caused by a background object that has moved by To estimate the kernel bandwidth o?for the ith color considering the background distributions of a small neigh- channel for a given pixel,we compute the median absolute borhood of the detection location. deviation over the sample for consecutive intensity values Let t be the observed value of a pixel x detected as a of the pixel.That is,the median m of for each foreground pixel at time t.We define the pixel displacement consecutive pair (i,i+1)in the sample is calculated inde- probability P(t)to be the maximum probability that the pendently for each color channel.The motivation behind the observed value,t,belongs to the background distribution of use of median of absolute deviation is that pixel intensities some point in the neighborhood A ofx over time are expected to have jumps because different objects (e.g.,sky,branch,leaf,and mixtures when an edge P(ct)=max Pr(tlBu) passes through the pixel)are projected onto the same pixel at yEV(T) different times.Since we are measuring deviations between two consecutive intensity values,the pair (i,i+1)usually where By is the background sample for pixel y,and the prob- comes from the same local-in-time distribution,and only ability estimation Pr(xt B)is calculated using the kernel function estimation as in (6).By thresholding Pv for de- a few pairs are expected to come from cross distributions tected pixels,we can eliminate many false detections due (intensity jumps).The median is a robust estimate and should not be affected by few jumps. to small motions in the background scene.To avoid losing true detections that might accidentally be similar to the back- If we assume that this local-in-time distribution is Gaussian N(2),then the distribution for the deviation ground of some nearby pixel (e.g.,camouflaged targets),a constraint is added that the whole detected foreground ob- (i-i+1)is also Gaussian N(0,202).Since this distri- ject must have moved from a nearby location,and not only bution is symmetric.the median of the absolute deviations m is equivalent to the quarter percentile of the deviation some of its pixels.The component displacement probability Pe is defined to be the probability that a detected connected distribution.That is. componentC has been displaced from a nearby location.This Pr(N(0,2o2)>m)=0.25 probability is estimated by and therefore the standard deviation of the first distribution Pc= P rEc can be estimated as m For a connected component corresponding to a real target, 0二 0.68V51 the probability that this component has displaced from the background will be very small.So,a detected pixel x will be Since the deviations are integer gray scale (color)values, considered to be a part of the background only if(P()> linear interpolation is used to obtain more accurate median th)A(Pc(x)>th2). values. Fig.2 illustrates the effect of the second stage of detec- 2)Probabilistic Suppression of False Detection:In out- tion.The result after the first stage is shown in Fig.2(b). door environments with fluctuating backgrounds.there are In this example,the background has not been updated for two sources of false detections.First,there are false detec- several seconds,and the camera has been slightly displaced tions due to random noise which are expected to be homo- during this time interval,so we see many false detections geneous over the entire image.Second,there are false detec- along high-contrast edges.Fig.2(c)shows the result after tions due to small movements in the scene background that suppressing the detected pixels with high displacement prob- are not represented by the background model.This can occur ability.Most false detections due to displacement were elim- locally,for example,if a tree branch moves further than it inated,and only random noise that is uncorrelated with the did during model generation.This can also occur globally in scene remains as false detections.However,some true de- the image as a result of small camera displacements caused tected pixels were also lost.The final result of the second by wind load,which is common in outdoor surveillance and stage of the detection is shown in Fig.2(d).where the com- causes many false detections.These kinds of false detections ponent displacement probability constraint was added.Fig. are usually spatially clustered in the image,and they are not 3(b)shows results for a case where as a result of the wind load easy to eliminate using morphological techniques or noise the camera is shaking slightly,resulting in a lot of clustered filtering because these operations might also affect detection false detections,especially on the edges.After probabilistic of small and/or occluded targets. suppression of false detection [Fig.3(c)],most of these clus- If a part of the background(a tree branch,for example) tered false detection are suppressed,while the small target on moves to occupy a new pixel,but it was not part of the model the left side of the image remains. ELGAMMAL et al:MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1155
while too wide a bandwidth will lead to an over-smoothed density estimate [2]. Since the expected variations in pixel intensity over time are different from one location to another in the image, a different kernel bandwidth is used for each pixel. Also, a different kernel bandwidth is used for each color channel. To estimate the kernel bandwidth for the th color channel for a given pixel, we compute the median absolute deviation over the sample for consecutive intensity values of the pixel. That is, the median of for each consecutive pair in the sample is calculated independently for each color channel. The motivation behind the use of median of absolute deviation is that pixel intensities over time are expected to have jumps because different objects (e.g., sky, branch, leaf, and mixtures when an edge passes through the pixel) are projected onto the same pixel at different times. Since we are measuring deviations between two consecutive intensity values, the pair usually comes from the same local-in-time distribution, and only a few pairs are expected to come from cross distributions (intensity jumps). The median is a robust estimate and should not be affected by few jumps. If we assume that this local-in-time distribution is Gaussian , then the distribution for the deviation is also Gaussian . Since this distribution is symmetric, the median of the absolute deviations is equivalent to the quarter percentile of the deviation distribution. That is, and therefore the standard deviation of the first distribution can be estimated as Since the deviations are integer gray scale (color) values, linear interpolation is used to obtain more accurate median values. 2) Probabilistic Suppression of False Detection: In outdoor environments with fluctuating backgrounds, there are two sources of false detections. First, there are false detections due to random noise which are expected to be homogeneous over the entire image. Second, there are false detections due to small movements in the scene background that are not represented by the background model. This can occur locally, for example, if a tree branch moves further than it did during model generation. This can also occur globally in the image as a result of small camera displacements caused by wind load, which is common in outdoor surveillance and causes many false detections. These kinds of false detections are usually spatially clustered in the image, and they are not easy to eliminate using morphological techniques or noise filtering because these operations might also affect detection of small and/or occluded targets. If a part of the background (a tree branch, for example) moves to occupy a new pixel, but it was not part of the model for that pixel, then it will be detected as a foreground object. However, this object will have a high probability of being a part of the background distribution corresponding to its original pixel. Assuming that only a small displacement can occur between consecutive frames, we decide if a detected pixel is caused by a background object that has moved by considering the background distributions of a small neighborhood of the detection location. Let be the observed value of a pixel detected as a foreground pixel at time . We define the pixel displacement probability to be the maximum probability that the observed value, , belongs to the background distribution of some point in the neighborhood of where is the background sample for pixel , and the probability estimation is calculated using the kernel function estimation as in (6). By thresholding for detected pixels, we can eliminate many false detections due to small motions in the background scene. To avoid losing true detections that might accidentally be similar to the background of some nearby pixel (e.g., camouflaged targets), a constraint is added that the whole detected foreground object must have moved from a nearby location, and not only some of its pixels. The component displacement probability is defined to be the probability that a detected connected component has been displaced from a nearby location. This probability is estimated by For a connected component corresponding to a real target, the probability that this component has displaced from the background will be very small. So, a detected pixel will be considered to be a part of the background only if . Fig. 2 illustrates the effect of the second stage of detection. The result after the first stage is shown in Fig. 2(b). In this example, the background has not been updated for several seconds, and the camera has been slightly displaced during this time interval, so we see many false detections along high-contrast edges. Fig. 2(c) shows the result after suppressing the detected pixels with high displacement probability. Most false detections due to displacement were eliminated, and only random noise that is uncorrelated with the scene remains as false detections. However, some true detected pixels were also lost. The final result of the second stage of the detection is shown in Fig. 2(d), where the component displacement probability constraint was added. Fig. 3(b) shows results for a case where as a result of the wind load the camera is shaking slightly, resulting in a lot of clustered false detections, especially on the edges. After probabilistic suppression of false detection [Fig. 3(c)], most of these clustered false detection are suppressed, while the small target on the left side of the image remains. ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1155