A D. Sappa et al. [26 ]: aerial surveillance [35][34]: video segmentation [1; vehicles and driver assi tance [15], 24]: just to mention a few. As mentioned above, the underlying strategy in the solutions proposed in the literature essentially relies on the compensation of the camera motion. The difference between them lie on the sensor (i.e,monocu- lar/stereoscopic)or on the use of prior-knowledge of the scene together with visual cues. For instance, [26] uses a stereo system and predicts the depth image for the current time by using ego-motion information and the depth image obtained at the previous time. Then, moving objects are easily detected by comparing the predicted depth image with the one obtained at the current time. The prior-knowledge of the cene is also used in 35] and [34]. In these cases the authors assume that the scene is far from the camera(monocular) and the depth variation of the objects of interest small compared to the distance(e. g, airborne image sequences). In this context camera motion can be approximately compensated by a 2D parametric transforma tion(a 3x3 homography ) Hence, motion compensation is achieved by warping a equence of frames to a reference frame, where moving objects are easily detected by image subtraction like in the stationary camera cases A more general approach has been proposed in [1] for segmenting videos cap- tured with a freely moving camera, which is based on recording complex back ground and large moving non-rigid foreground objects. The authors propose a region-based motion compensation. It estimates the motion of the camera by find- ing the correspondence of a set of salient regions obtained by segmenting successive frames. In the vehicle on-board vision systems and driver assistance fields, the com- pensation of camera motion has also attracted researchers' attention in recent years For instance, in[15] the authors present a simple but effective approach based on the use of GPs information to roughly align frames from video sequences. A lo- cal appearance comparison between the aligned frames is used to detect objects. In the driver assistance context, but by using an onboard stereo rig, [24] introduce a 3D data registration based approach to compensate camera motion from two con- secutive frames. In that work. consecutive stereo frames are aligned into the same oordinate system; then moving objects are obtained from a 3D frame subtraction similar to[26]. The current chapter proposes an extension of [24), by detecting mis registration regions according to an adaptive threshold from the depth information The remainder of this chapter is organized as follows. Section 2 introduces related work in the 3D data registration problem. Then, Section 3 presents the proposed ap- proach for moving object detection. It consists of three stages: i) 2D feature point detection and tracking; ii) robust 3D data registration; and iii) moving object de- tection through consecutive stereo frame subtraction. Experimental results in real nvironments are presented in Section 4. Finally, conclusions and future works are given in Section 5 2 Related work A large number of approaches have been proposed in the computer vision commu- nity for 3D Point registration during the last two decades(e.g, 3,[4],[22]). 3D
26 A.D. Sappa et al. [26]; aerial surveillance [35] [34]; video segmentation [1]; vehicles and driver assistance [15], [24]; just to mention a few. As mentioned above, the underlying strategy in the solutions proposed in the literature essentially relies on the compensation of the camera motion. The difference between them lie on the sensor (i.e., monocular/stereoscopic) or on the use of prior-knowledge of the scene together with visual cues. For instance, [26] uses a stereo system and predicts the depth image for the current time by using ego-motion information and the depth image obtained at the previous time. Then, moving objects are easily detected by comparing the predicted depth image with the one obtained at the current time. The prior-knowledge of the scene is also used in [35] and [34]. In these cases the authors assume that the scene is far from the camera (monocular) and the depth variation of the objects of interest is small compared to the distance (e.g., airborne image sequences). In this context camera motion can be approximately compensated by a 2D parametric transformation (a 3x3 homography). Hence, motion compensation is achieved by warping a sequence of frames to a reference frame, where moving objects are easily detected by image subtraction like in the stationary camera cases. A more general approach has been proposed in [1] for segmenting videos captured with a freely moving camera, which is based on recording complex background and large moving non-rigid foreground objects. The authors propose a region-based motion compensation. It estimates the motion of the camera by finding the correspondence of a set of salient regions obtained by segmenting successive frames. In the vehicle on-board vision systems and driver assistance fields, the compensation of camera motion has also attracted researchers’ attention in recent years. For instance, in [15] the authors present a simple but effective approach based on the use of GPS information to roughly align frames from video sequences. A local appearance comparison between the aligned frames is used to detect objects. In the driver assistance context, but by using an onboard stereo rig, [24] introduce a 3D data registration based approach to compensate camera motion from two consecutive frames. In that work, consecutive stereo frames are aligned into the same coordinate system; then moving objects are obtained from a 3D frame subtraction, similar to [26]. The current chapter proposes an extension of [24], by detecting misregistration regions according to an adaptive threshold from the depth information. The remainder of this chapter is organized as follows. Section 2 introduces related work in the 3D data registration problem. Then, Section 3 presents the proposed approach for moving object detection. It consists of three stages: i) 2D feature point detection and tracking; ii) robust 3D data registration; and iii) moving object detection through consecutive stereo frame subtraction. Experimental results in real environments are presented in Section 4. Finally, conclusions and future works are given in Section 5. 2 Related Work A large number of approaches have been proposed in the computer vision community for 3D Point registration during the last two decades (e.g., [3], [4], [22]). 3D
Moving Object Detection from Mobile Platforms Using Stereo Data Registration data point registration aims at finding the best transformation that places both the given data set and corresponding model set into the same reference system. The different approaches proposed in the literature can be broadly classified into two categories, depending on whether an initial information is required (fine registra ion)or not(coarse registration): a comprehensive survey of registration methods can be found in [23]. The approach followed in the current work for moving object detection lies within the fine rigid registration category Typically, the fine registration process consists in iterating the following two stages. Firstly, the correspondence between every point from the current data set and the model set shall be found. These correspondences are used to define the residual of the registration. Secondly, the best set of parameters that minimizes the accumulated residual shall be computed. These two stages are iteratively applied until convergence is reached. The Iterative Closest Point (ICP)originally intro- duced by [3] and [4-is one of the most widely used registration techniques using this two-stage scheme. Since then, several variations and improvements have been proposed in order to increase the efficiency and robustness(e. g,[25],[8], 5D) In order to avoid the point-wise nature of ICP, which makes the problem discrete and non-smooth, different techniques have been proposed: i)probabilistic represen tations are used to describe both data and model set(e.g. [31], [13D; ii)in [8]the point-wise problem is avoided by using a distance field of the model set; iii) an im- plicit polynomial (Ip)is used in [36] to fit the distance field, which later defines a gradient field leading the data points towards that model set; iv)implicit polynom- als have been also used in[28] to represent both the data set and model set. In this case, an accurate pose estimation is computed based on the information from the mial coefficient Probabilistic-based approaches avoid the point-wise correspondence problem by representing each set by a mixture of Gaussians(e. g, [13]. [6]); hence, registration becomes a problem of aligning two mixtures In [13] a closed-form expression for the Ly distance between two Gaussian mixtures is proposed. Instead of Gaussian mixture models, [31] proposes an approach based on multivariate t-distributions, which is robust to large number of missing values. Both approaches, as all mixture models, are highly dependent on the number of mixtures used for modelling the sets This problem is generally solved by assuming a user defined number of mixtures or as many as the number of points. The former one needs the points to be clustered, while the latter one results in a very expensive optimization problem that cannot handle large data sets or could get trapped in local minimum when complex sets are considered he non-differentiable nature of ICP is overcome by using a derivable distance transform--Chamfer distancein [8. A non-linear minimization (Levenberg Marquardt algorithm) of the error function, based on that distance transform, is used for finding the optimal registration parameters. The main disadvantage of [8] is the precision dependency on the grid resolution, where the Chamfer distance trans- form and discrete derivatives are evaluated. Hence, this technique cannot be directly applied when the point set is sparse or unorganized
Moving Object Detection from Mobile Platforms Using Stereo Data Registration 27 data point registration aims at finding the best transformation that places both the given data set and corresponding model set into the same reference system. The different approaches proposed in the literature can be broadly classified into two categories, depending on whether an initial information is required (fine registration) or not (coarse registration); a comprehensive survey of registration methods can be found in [23]. The approach followed in the current work for moving object detection lies within the fine rigid registration category. Typically, the fine registration process consists in iterating the following two stages. Firstly, the correspondence between every point from the current data set and the model set shall be found. These correspondences are used to define the residual of the registration. Secondly, the best set of parameters that minimizes the accumulated residual shall be computed. These two stages are iteratively applied until convergence is reached. The Iterative Closest Point (ICP)—originally introduced by [3] and [4]—is one of the most widely used registration techniques using this two-stage scheme. Since then, several variations and improvements have been proposed in order to increase the efficiency and robustness (e.g., [25], [8], [5]). In order to avoid the point-wise nature of ICP, which makes the problem discrete and non-smooth, different techniques have been proposed: i) probabilistic representations are used to describe both data and model set (e.g. [31], [13]); ii) in [8] the point-wise problem is avoided by using a distance field of the model set; iii) an implicit polynomial (IP) is used in [36] to fit the distance field, which later defines a gradient field leading the data points towards that model set; iv) implicit polynomials have been also used in [28] to represent both the data set and model set. In this case, an accurate pose estimation is computed based on the information from the polynomial coefficients. Probabilistic-based approaches avoid the point-wise correspondence problem by representing each set by a mixture of Gaussians (e.g., [13], [6]); hence, registration becomes a problem of aligning two mixtures. In [13] a closed-form expression for the L2 distance between two Gaussian mixtures is proposed. Instead of Gaussian mixture models, [31] proposes an approach based on multivariate t-distributions, which is robust to large number of missing values. Both approaches, as all mixture models, are highly dependent on the number of mixtures used for modelling the sets. This problem is generally solved by assuming a user defined number of mixtures or as many as the number of points. The former one needs the points to be clustered, while the latter one results in a very expensive optimization problem that cannot handle large data sets or could get trapped in local minimum when complex sets are considered. The non-differentiable nature of ICP is overcome by using a derivable distance transform—Chamfer distance—in [8]. A non-linear minimization (Levenberg - Marquardt algorithm) of the error function, based on that distance transform, is used for finding the optimal registration parameters. The main disadvantage of [8] is the precision dependency on the grid resolution, where the Chamfer distance transform and discrete derivatives are evaluated. Hence, this technique cannot be directly applied when the point set is sparse or unorganized
A D. Sappa et al. On the contrary to the previous approaches, [36] proposes a fast registration method based on solving an energy minimization problem derived from an implici polynomial fitted to the given model set[37]. This IP is used to define a gradient flow that drives the data set to the model set without using point-wise correspondences The energy functional is minimized by means of a heuristic two step process. Firstly, every point in the given data set moves freely along the gradient vectors defined by the IP. Secondly, the outcome of the first step is used to define a single transfor- mation that represents this movement in a rigid way. These two steps are repeated alternately until convergence is reached. The weak point of this approach is the first step of the minimization that lets the points move independently in the proposed gra- dient flow. Furthermore, the proposed gradient flow is not smooth, specially close to the boundari Most of the algorithms presented above have been originally proposed for reg istering overlapped sets of points corresponding to the 3D surface of a single rigid object. Extensions to a more general framework, where the 3D surfaces to be reg- istered correspond to different views of a given scene, have been presented in the robotic field (e.g, [30, 18). Actually, in all these extensions, the registration is used for the simultaneous localization and mapping(SLAM)of the mobile platform(i.e the robot). Although some approaches differentiate static and dynamic parts of the environment before registration(e. g,[30], [33]), most of them assume that the en vironment is static, containing only rigid, non-moving objects. Therefore, if moving bjects are present in the scene, the least squares formulation of the problem will provide a rigid transformation biased by the motions in the scene. Independently to the kind of scenario to be tackled(partial view of a single object or whole scene), 3D registration algorithms are computationally expensive, which prevents their use in real time applications. In the current work a robust strategy that reduces the CPU time by focusing only on feature points is proposed. It is intended to be used in ADAS(Advanced Driver Assistance Systems)applications, in which an on-board camera explores the current scene in real time. Usually, an exhaustive win- dow scanning approach is adopted to extract regions of interests(ROls), needed in pedestrian or vehicle detection systems. The concept of consecutive frame registra- tion for moving object detection has been explored in [11], in which an active frame subtraction for pedestrian detection from images of moving cameras is proposed. In that work, consecutive frames were not registered by a vision based approach but by estimating the relative camera motion using vehicle speed and a gyrosensor. A similar solution has been proposed in [15], but by using GPS information 3 Proposed Approach The proposed approach combines 2D detection of key points with 3D registration The first stage consists in extracting a set of 2D feature points at a given frame and track it through the next frame; 3D coordinates corresponding to each of these 2D feature points are later on used during the registration process, where the rigid displacement(six degrees of freedom)that maps the 3D scene associated with frame
28 A.D. Sappa et al. On the contrary to the previous approaches, [36] proposes a fast registration method based on solving an energy minimization problem derived from an implicit polynomial fitted to the given model set [37]. This IP is used to define a gradient flow that drives the data set to the model set without using point-wise correspondences. The energy functional is minimized by means of a heuristic two step process. Firstly, every point in the given data set moves freely along the gradient vectors defined by the IP. Secondly, the outcome of the first step is used to define a single transformation that represents this movement in a rigid way. These two steps are repeated alternately until convergence is reached. The weak point of this approach is the first step of the minimization that lets the points move independently in the proposed gradient flow. Furthermore, the proposed gradient flow is not smooth, specially close to the boundaries. Most of the algorithms presented above have been originally proposed for registering overlapped sets of points corresponding to the 3D surface of a single rigid object. Extensions to a more general framework, where the 3D surfaces to be registered correspond to different views of a given scene, have been presented in the robotic field (e.g., [30, 18]). Actually, in all these extensions, the registration is used for the simultaneous localization and mapping (SLAM) of the mobile platform (i.e., the robot). Although some approaches differentiate static and dynamic parts of the environment before registration (e.g., [30], [33]), most of them assume that the environment is static, containing only rigid, non-moving objects. Therefore, if moving objects are present in the scene, the least squares formulation of the problem will provide a rigid transformation biased by the motions in the scene. Independently to the kind of scenario to be tackled (partial view of a single object or whole scene), 3D registration algorithms are computationally expensive, which prevents their use in real time applications. In the current work a robust strategy that reduces the CPU time by focusing only on feature points is proposed. It is intended to be used in ADAS (Advanced Driver Assistance Systems) applications, in which an on-board camera explores the current scene in real time. Usually, an exhaustive window scanning approach is adopted to extract regions of interests (ROIs), needed in pedestrian or vehicle detection systems. The concept of consecutive frame registration for moving object detection has been explored in [11], in which an active frame subtraction for pedestrian detection from images of moving cameras is proposed. In that work, consecutive frames were not registered by a vision based approach but by estimating the relative camera motion using vehicle speed and a gyrosensor. A similar solution has been proposed in [15], but by using GPS information. 3 Proposed Approach The proposed approach combines 2D detection of key points with 3D registration. The first stage consists in extracting a set of 2D feature points at a given frame and track it through the next frame; 3D coordinates corresponding to each of these 2D feature points are later on used during the registration process, where the rigid displacement (six degrees of freedom) that maps the 3D scene associated with frame
Moving Object Detection from Mobile Platforms Using Stereo Data Registration (n)into the 3D scene associated with frame(n+1)is computed(see Figure 1). This rigid transform represents the 3D motion of the camera between frame (n)and frame (n+ 1). Finally, moving objects are detected by computing the difference between the 3D coordinates of points represented in the same coordinate system. Before oing into details in the stages of the proposed approach a brief description of the used stereo vision system is given. 3.1 System Setup A commercial stereo vision system(Bumblebee from Point Grey)is used to acquire the 3D information of the scene in front of the host vehicle. It consists of two Sony ICX084 Bayer pattern CCDs with 6mm focal length lenses. Bumblebee is a pre calibrated system that does not require in-field calibration. The baseline of the stereo head is 12cm and it is connected to the computer by an IEEE-1394 interface. Right nd left color images(Bayer pattern) were captured at a resolution of 640x480 pixels. After capturing each right-left pair of images, a dense cloud of 3D data points P is computed by using a 3D reconstruction software at each frame n. The right intensity image I is used during the feature point detection and tracking stage 3.2 Feature Detection and Tracking As previously mentioned, the proposed approach is intended to be used on on-board vision systems for driver assistance applications. Hence, due to real time constraint it is clear that the whole cloud of points cannot be used to find the rigid transfor- nation that maps two consecutive frames to the same reference system. In order to tackle this problem, an efficient approach that relies only on the use of a reduced of points from the given image I is proposed. Feature points, fiu, cr, far away from the camera position(Pixy- )>8)are discarded in order to increase registration accuracy(8=15 m in the current implementation) The proposed approach does not depend on the technique used for detecting fea- ture points: actually, two different approaches have been tested: one based on the Harris corner points [10] and another on SIFT features [16]. In the first case, once feature points have been selected a tracking window Wr of (9 x9)pixels is set. Fea- re points are tracked by minimizing the sum of squared differences between two consecutive frames by using an iterative approach [17]. In the second case SIFT features[16] are detected in the extreme of difference of Gaussians in a scale-space epresentation and described as histograms of gradient orientations. In this case, fol- lowing [16], a function based on the corresponding histograms distance is used to match the features in consecutive frames(the public implementation of SIFT in [29] has been used) www.ptgrey.com 2 Stereo head data uncertainty grows quadratically with depth [191
Moving Object Detection from Mobile Platforms Using Stereo Data Registration 29 (n) into the 3D scene associated with frame (n+1) is computed (see Figure 1). This rigid transform represents the 3D motion of the camera between frame (n) and frame (n + 1). Finally, moving objects are detected by computing the difference between the 3D coordinates of points represented in the same coordinate system. Before going into details in the stages of the proposed approach a brief description of the used stereo vision system is given. 3.1 System Setup A commercial stereo vision system (Bumblebee from Point Grey1) is used to acquire the 3D information of the scene in front of the host vehicle. It consists of two Sony ICX084 Bayer pattern CCDs with 6mm focal length lenses. Bumblebee is a precalibrated system that does not require in-field calibration. The baseline of the stereo head is 12cm and it is connected to the computer by an IEEE-1394 interface. Right and left color images (Bayer pattern) were captured at a resolution of 640×480 pixels. After capturing each right-left pair of images, a dense cloud of 3D data points Pn is computed by using a 3D reconstruction software at each frame n. The right intensity image In is used during the feature point detection and tracking stage. 3.2 Feature Detection and Tracking As previously mentioned, the proposed approach is intended to be used on on-board vision systems for driver assistance applications. Hence, due to real time constraint, it is clear that the whole cloud of points cannot be used to find the rigid transformation that maps two consecutive frames to the same reference system. In order to tackle this problem, an efficient approach that relies only on the use of a reduced set of points from the given image In is proposed. Feature points, f n i(u,v) ⊂ In, far away from the camera position (Pn i(x,y,z) > δ) are discarded in order to increase registration accuracy2 (δ = 15 m in the current implementation). The proposed approach does not depend on the technique used for detecting feature points; actually, two different approaches have been tested: one based on the Harris corner points [10] and another on SIFT features [16]. In the first case, once feature points have been selected a tracking window WT of (9×9) pixels is set. Feature points are tracked by minimizing the sum of squared differences between two consecutive frames by using an iterative approach [17]. In the second case SIFT features [16] are detected in the extreme of difference of Gaussians in a scale-space representation and described as histograms of gradient orientations. In this case, following [16], a function based on the corresponding histograms distance is used to match the features in consecutive frames (the public implementation of SIFT in [29] has been used). 1 www.ptgrey.com 2 Stereo head data uncertainty grows quadratically with depth [19]
A D. Sappa et al. Frame(n) Frame(n+1) Frame(n) Frame(n+1) Fig. 1 Feature points detected and tracked through consecutive frames: (top)using Harris corner detector; (bottom) using SIFT detector and descriptor [R t 命*P Fig. 2 Illustration of feature points represented in the 3D space, together with three couples of points used for computing the 3D rigid displacement: R t-RANSAC-like technique
30 A.D. Sappa et al. 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 50 100 150 200 250 300 350 400 450 Frame (n) Frame (n+1) Frame (n) Frame (n+1) Fig. 1 Feature points detected and tracked through consecutive frames: (top) using Harris corner detector; (bottom) using SIFT detector and descriptor Y Z X Y Z X P1(x,y,z) n P2(x,y,z) n P3(x,y,z) n P3(x,y,z) n+1 P2(x,y,z) n+1 P1(x,y,z) n+1 [R | t] Fig. 2 Illustration of feature points represented in the 3D space, together with three couples of points used for computing the 3D rigid displacement: [R|t]—RANSAC-like technique