Computer Vision and Image Understanding 118(2014)128-139 Contents lists available at ScienceDirect Computer Vision and Image Understanding ELSEVIER journal homepage:www.elsevier.com/locate/cviu Object tracking using learned feature manifolds* CrossMark Yanwen Guo*,Ye Chen3,Feng Tang,Ang Li,Weitao Luo,Mingming Liu National Key Lab for Novel Software Technology,Nanjing University.Nanjing 210023.PR China P Hewlett-Packard Laboratories,Palo Alto,CA 94304.USA University of Maryland,College Park,MD 20740.USA ARTICLE INFO ABSTRACT Article history: Local feature based object tracking approaches have been promising in solving the tracking problems Received 10 August 2012 such as occlusions and illumination variations.However,existing approaches typically model feature Accepted 30 September 2013 variations using prototypes,and this discrete representation cannot capture the gradual changing prop- Available online 17 October 2013 erty of local appearance.In this paper,we propose to model each local feature as a feature manifold to characterize the smooth changing behavior of the feature descriptor.The manifold is constructed from Keywords: a series of transformed images simulating possible variations of the feature being tracked.We propose Feature manifold to build a collection of linear subspaces which approximate the original manifold as a low dimensional SIFT Tracking representation.This representation is used for object tracking.Object location is located by a feature- to-manifold matching process.Our tracking method can update the manifold status,add new feature manifolds and remove expiring ones adaptively according to object appearance.We show both qualita- tively and quantitatively this representation significantly improves the tracking performance under occlusions and appearance variations using standard tracking dataset. 2013 Elsevier Inc.All rights reserved. 1.Introduction Object dynamics model how the object appearance evolves over time to be able to handle appearance variations.The two problems Object tracking is a central problem in computer vision with are usually coupled together:the object representation should be many applications,such as activity analysis,automated surveil- designed to be easily updated to model appearance variations, lance,traffic monitoring,and human-computer interaction.It is while the object dynamics should be able to take advantage of essentially the problem of finding the most likely estimate of the the characteristics of object representation for model update. object state given a sequence of observations.Object tracking is Traditional methods for representing the object,such as global challenging because of: histogram based approach in meanshift tracking [1]and PCA sub- space based approach in EigenTracking[2],are global approaches Complex object appearance.The object may have complicated which describe the object to be tracked as a whole.Such methods appearance which is hard to model.Furthermore,it may work well in many practical applications,but have several intrinsic undergo significant changes due to the pose and scale variations limitations.First,it is usually very difficult for a global representa- as well as non-rigid object motions. tion to capture local details and as a result unable to model com- Occlusions.The object may be occluded by the background or plex appearances.Second,global representations are not robust other moving objects,making it difficult to be localized. to partial occlusion.Once the object is occluded,the whole feature Complex object motion.This is caused by either the moving pat- vector of object representation is affected.Third,global representa- tern of the object or by camera motion accompanied by object tions are hard to update. motion. Recently,local representations have opened a promising direc- tion to solve these problems by representing an object as a set of There are two key components in an object tracking algorithm: local parts or sparse local features.Part-based trackers generally object representation and dynamics.Object representation tries to use sets of connected or global visual properties incorporated local model the object as accurately as possible so that the tracking parts or components [3-6.The parts used for object representa- algorithm can accurately describe the complex object appearance. tion are updated during tracking by removing old parts that exhibit signs of drifting and adding new ones for easy accommodation of This paper has been recommended for acceptance by Eklundh. appearance changes.Feature-based trackers often represent the Corresponding author. target by a set of sparse local features such as SIFT [7]and affine E-mail address:ywguo@nju.edu.cn (Y.Guo). invariant point detectors [8]which are often invariant to changes 1077-3142/$-see front matter2013 Elsevier Inc.All rights reserved. http://dx.doi.org/10.1016/j.cviu.2013.09.007
Object tracking using learned feature manifolds q Yanwen Guo a,⇑ , Ye Chen a , Feng Tang b , Ang Li c , Weitao Luo a , Mingming Liu a aNational Key Lab for Novel Software Technology, Nanjing University, Nanjing 210023, PR China b Hewlett-Packard Laboratories, Palo Alto, CA 94304, USA cUniversity of Maryland, College Park, MD 20740, USA article info Article history: Received 10 August 2012 Accepted 30 September 2013 Available online 17 October 2013 Keywords: Feature manifold SIFT Tracking abstract Local feature based object tracking approaches have been promising in solving the tracking problems such as occlusions and illumination variations. However, existing approaches typically model feature variations using prototypes, and this discrete representation cannot capture the gradual changing property of local appearance. In this paper, we propose to model each local feature as a feature manifold to characterize the smooth changing behavior of the feature descriptor. The manifold is constructed from a series of transformed images simulating possible variations of the feature being tracked. We propose to build a collection of linear subspaces which approximate the original manifold as a low dimensional representation. This representation is used for object tracking. Object location is located by a featureto-manifold matching process. Our tracking method can update the manifold status, add new feature manifolds and remove expiring ones adaptively according to object appearance. We show both qualitatively and quantitatively this representation significantly improves the tracking performance under occlusions and appearance variations using standard tracking dataset. 2013 Elsevier Inc. All rights reserved. 1. Introduction Object tracking is a central problem in computer vision with many applications, such as activity analysis, automated surveillance, traffic monitoring, and human-computer interaction. It is essentially the problem of finding the most likely estimate of the object state given a sequence of observations. Object tracking is challenging because of: Complex object appearance. The object may have complicated appearance which is hard to model. Furthermore, it may undergo significant changes due to the pose and scale variations as well as non-rigid object motions. Occlusions. The object may be occluded by the background or other moving objects, making it difficult to be localized. Complex object motion. This is caused by either the moving pattern of the object or by camera motion accompanied by object motion. There are two key components in an object tracking algorithm: object representation and dynamics. Object representation tries to model the object as accurately as possible so that the tracking algorithm can accurately describe the complex object appearance. Object dynamics model how the object appearance evolves over time to be able to handle appearance variations. The two problems are usually coupled together: the object representation should be designed to be easily updated to model appearance variations, while the object dynamics should be able to take advantage of the characteristics of object representation for model update. Traditional methods for representing the object, such as global histogram based approach in meanshift tracking [1] and PCA subspace based approach in EigenTracking [2], are global approaches which describe the object to be tracked as a whole. Such methods work well in many practical applications, but have several intrinsic limitations. First, it is usually very difficult for a global representation to capture local details and as a result unable to model complex appearances. Second, global representations are not robust to partial occlusion. Once the object is occluded, the whole feature vector of object representation is affected. Third, global representations are hard to update. Recently, local representations have opened a promising direction to solve these problems by representing an object as a set of local parts or sparse local features. Part-based trackers generally use sets of connected or global visual properties incorporated local parts or components [3–6]. The parts used for object representation are updated during tracking by removing old parts that exhibit signs of drifting and adding new ones for easy accommodation of appearance changes. Feature-based trackers often represent the target by a set of sparse local features such as SIFT [7] and affine invariant point detectors [8] which are often invariant to changes 1077-3142/$ - see front matter 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cviu.2013.09.007 q This paper has been recommended for acceptance by J.-O. Eklundh. ⇑ Corresponding author. E-mail address: ywguo@nju.edu.cn (Y. Guo). Computer Vision and Image Understanding 118 (2014) 128–139 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu
Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 129 in rotation,scale,illumination and viewpoint.These approaches viewpoint changes and occlusions,our dynamic model is designed first localize the features at a sparse set of distinctive image points to be able to add new feature manifolds and remove expiring ones by feature detectors.Then the feature vectors,usually named as adaptively and dynamically.To the best of our knowledge,this is descriptors,are computed based on the local image statistics cen- the first paper that applies manifold learning to local features for tered at these locations.Two major advantages of sparse local fea- object tracking. tures are the invariance to image changes and robustness to The rest of this paper is organized as follows.Section 2 de- occlusions.Existing local feature based approaches typically model scribes the related work on object tracking.We present our feature how the local features vary using prototypes.However,this dis- manifold model in Section 3.Section 4 shows our main tracking crete representation cannot capture the gradual changing property paradigm.Experiments and analysis are given in Section 5,and of local appearance. the whole paper is concluded finally. In this paper,we propose a local feature based manifold repre- sentation for object tracking.The object is represented by a set of sparse local feature manifolds.Each local feature manifold is com- 2.Related work puted from a series of SIFT feature descriptors[7]that correspond to different appearances of the same object feature under simu- Object tracking using local features has been explored by previ- lated variations of practical situations.To build it,we first detect ous researchers.In [9].Shi and Tomasi proposed a method to select a set of interest points on the object by the state-of-the-art feature corner-based features that are most reliable for tracking.Collins detectors.For each feature point,we transform the image regions et al.developed an algorithm for unsupervized learning of object surrounding it for simulating real object changes.A feature mani- models as constellations of features,and proposed a discriminative fold is thus obtained by exploring the ensemble of descriptors ex- feature tracker [10].A simultaneous modeling and tracking meth- tracted on the transformed image regions.Such a manifold is an od is proposed in [11]to learn the object model during tracking. informative yet robust representation in that it captures the local The object features are selected manually and tracked individually. appearance variations of a part of the object over time,making The posterior distribution of appearance and shape is built up the local representation more robust against object changes.The incrementally using an exemplar-based approach.In [12],the ob- local feature variation is complicated and nonlinear in practice as ject is represented as view-dependent object appearance models an example illustrated in Fig.1 which shows a feature on a walking corresponding to different viewing angles.This collection of ac- man.As can be observed,the feature appearance changes dramat- quired models is indexed with respect to the view sphere.These ically during the move.As a result,the feature manifold is a highly models are matched to each frame to estimate object motion.In nonlinear appearance manifold.For computational efficiency,we [13].the authors proposed a"feature harvesting"approach that apply incremental principal component analysis to it and yield a has a training phase to learn the object geometry and appearance collection of linear subspace approximation. using a randomized tree classifier.The online tracking then be- To model geometric relations among local features,the feature comes a detection problem using the learned classifier.Liu et al. manifolds are organized as a feature manifold graph which is used proposed to jointly track different types of features by representing to represent the target object to be tracked.Each local feature man- the objects of interest with the hybrid templates [14].A generative ifold describes object appearance details and relationships among model is developed to learn the template and to estimate object them encode object structure.Such geometric relationships are location and scale. elastic and have the flexibility to handle objects with coherent mo- It is noted that the state-of-the-art local features such as SIFT tion and a certain amount of variations caused by viewpoint 7]and SURF [15]have been used for object tracking recently.In changes and articulated motions.An advantage of the feature man- [16].an attributed relational feature graph which represents the ifold graph is that locally the manifold graph reinforces the power object using SIFT features with geometric relations is proposed of feature description and characterizes variations of object for object tracking.Zhou et al.presented a SIFT based mean shift appearance by learning a series of descriptors,while globally it en- tracking algorithm [17.The similarity between two neighboring codes object structure with the geometric relations among those frames is measured in terms of color and SIFT correspondence by manifolds.Such characteristics make it suitable for many vision using an expectation-maximization algorithm.In [18].He et al. tasks. proposed to represent the object by a set of SURF features of inter- We apply the feature manifold graph to object tracking as an est.Object motion is estimated in terms of maximum likelihood application.With the feature manifold graph representation,the feature motion observations.In [19].Sun and Liu proposed an ob- target object is tracked based on graph-based feature-to-manifold ject tracking method which is based on the combination of local tracking.During tracking,features are extracted in a candidate SIFT description and global PCA representation.The method is con- region of the current frame and then matched with the manifold. structed in the framework of particle filter.In fact,the changing of Object position is located by integrating all matching in the feature appearance is smooth and highly nonlinear in nature, manifold graph.Since features may appear and disappear due to which is hard to be modeled using discrete prototypes.In this Fig.1.Appearance variations of a feature during tracking.Feature patches of different frames are shown on the left and a feature manifold is visualized on the right
in rotation, scale, illumination and viewpoint. These approaches first localize the features at a sparse set of distinctive image points by feature detectors. Then the feature vectors, usually named as descriptors, are computed based on the local image statistics centered at these locations. Two major advantages of sparse local features are the invariance to image changes and robustness to occlusions. Existing local feature based approaches typically model how the local features vary using prototypes. However, this discrete representation cannot capture the gradual changing property of local appearance. In this paper, we propose a local feature based manifold representation for object tracking. The object is represented by a set of sparse local feature manifolds. Each local feature manifold is computed from a series of SIFT feature descriptors [7] that correspond to different appearances of the same object feature under simulated variations of practical situations. To build it, we first detect a set of interest points on the object by the state-of-the-art feature detectors. For each feature point, we transform the image regions surrounding it for simulating real object changes. A feature manifold is thus obtained by exploring the ensemble of descriptors extracted on the transformed image regions. Such a manifold is an informative yet robust representation in that it captures the local appearance variations of a part of the object over time, making the local representation more robust against object changes. The local feature variation is complicated and nonlinear in practice as an example illustrated in Fig. 1 which shows a feature on a walking man. As can be observed, the feature appearance changes dramatically during the move. As a result, the feature manifold is a highly nonlinear appearance manifold. For computational efficiency, we apply incremental principal component analysis to it and yield a collection of linear subspace approximation. To model geometric relations among local features, the feature manifolds are organized as a feature manifold graph which is used to represent the target object to be tracked. Each local feature manifold describes object appearance details and relationships among them encode object structure. Such geometric relationships are elastic and have the flexibility to handle objects with coherent motion and a certain amount of variations caused by viewpoint changes and articulated motions. An advantage of the feature manifold graph is that locally the manifold graph reinforces the power of feature description and characterizes variations of object appearance by learning a series of descriptors, while globally it encodes object structure with the geometric relations among those manifolds. Such characteristics make it suitable for many vision tasks. We apply the feature manifold graph to object tracking as an application. With the feature manifold graph representation, the target object is tracked based on graph-based feature-to-manifold tracking. During tracking, features are extracted in a candidate region of the current frame and then matched with the manifold. Object position is located by integrating all matching in the manifold graph. Since features may appear and disappear due to viewpoint changes and occlusions, our dynamic model is designed to be able to add new feature manifolds and remove expiring ones adaptively and dynamically. To the best of our knowledge, this is the first paper that applies manifold learning to local features for object tracking. The rest of this paper is organized as follows. Section 2 describes the related work on object tracking. We present our feature manifold model in Section 3. Section 4 shows our main tracking paradigm. Experiments and analysis are given in Section 5, and the whole paper is concluded finally. 2. Related work Object tracking using local features has been explored by previous researchers. In [9], Shi and Tomasi proposed a method to select corner-based features that are most reliable for tracking. Collins et al. developed an algorithm for unsupervized learning of object models as constellations of features, and proposed a discriminative feature tracker [10]. A simultaneous modeling and tracking method is proposed in [11] to learn the object model during tracking. The object features are selected manually and tracked individually. The posterior distribution of appearance and shape is built up incrementally using an exemplar-based approach. In [12], the object is represented as view-dependent object appearance models corresponding to different viewing angles. This collection of acquired models is indexed with respect to the view sphere. These models are matched to each frame to estimate object motion. In [13], the authors proposed a ‘‘feature harvesting’’ approach that has a training phase to learn the object geometry and appearance using a randomized tree classifier. The online tracking then becomes a detection problem using the learned classifier. Liu et al. proposed to jointly track different types of features by representing the objects of interest with the hybrid templates [14]. A generative model is developed to learn the template and to estimate object location and scale. It is noted that the state-of-the-art local features such as SIFT [7] and SURF [15] have been used for object tracking recently. In [16], an attributed relational feature graph which represents the object using SIFT features with geometric relations is proposed for object tracking. Zhou et al. presented a SIFT based mean shift tracking algorithm [17]. The similarity between two neighboring frames is measured in terms of color and SIFT correspondence by using an expectation–maximization algorithm. In [18], He et al. proposed to represent the object by a set of SURF features of interest. Object motion is estimated in terms of maximum likelihood feature motion observations. In [19], Sun and Liu proposed an object tracking method which is based on the combination of local SIFT description and global PCA representation. The method is constructed in the framework of particle filter. In fact, the changing of feature appearance is smooth and highly nonlinear in nature, which is hard to be modeled using discrete prototypes. In this Fig. 1. Appearance variations of a feature during tracking. Feature patches of different frames are shown on the left and a feature manifold is visualized on the right. Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139 129
130 Y.Guo et aL/Computer Vision and Image Understanding 118(2014)128-139 paper,we propose to model the feature appearance variations as a To simulate possible viewpoint situations,transformations are feature manifold approximated by several linear subspaces.This applied to the original image.Each transformation involves three significantly enhances the distinctiveness of object representation. atomic transformations:scaling,rotation and shearing.The mix- Avidan's ensemble tracker [20]trains an ensemble of weak clas- ture of these transformations can accurately simulate all possible sifiers and combines them into a strong one using AdaBoost to dis- viewpoint changes.To produce a simulated image /'a combination tinguish between the object and background by pixel labeling.To of these three transformations is applied to the original image lo. adapt to object appearance changes and maintain temporal coher- ence,a new weak classifier is trained per frame and the strong clas- I=Tsh*Tr*Tsc*l0 (1) sifier is updated dynamically.Appearance modeling is important where Tsh,Tr,and Tse mean shearing,rotation,and scaling transfor- for object tracking.In21,the covariance matrices of image fea- mations separately. tures in five modes are used to represent object appearance.Visual We express them using the homogeneous matrices.Specifically, tracking is led by Bayesian inference using the learned Log-Euclid- /a00 ean Riemannian subspace.In [22].Hu et al.presented an incremen- we represent Tsc as0 a 0 and set three levels for a to (0.5.1. tal learning algorithm which represents object appearance with 0 0 low dimensional tensor subspace.During tracking.the learning -sin0 0 algorithm is used to capture appearance of the dynamic object. 2).Tr is expressed as sin0 0 and we rotate the im- In [23,Ross et al.presented a tracking method that incrementally 0 0 learns a low-dimensional subspace representation,efficiently age around its center for a lap and set the interval to 30.We de- adapting online to changes in appearance of the target.A sampling 1 b 0 algorithm with likelihood estimate is used to estimate object loca- note Tsh by d 1 0 and set three levels {-1,0,1)for both b tions during tracking.Originated from [24,251.TLD [26]is a real- 00 time algorithm for tracking of unknown objects in video streams. and d to execute the shearing transformation. It simultaneously tracks the object,learns its appearance and de- In implementation,we use the OpcnCV function cvWarpPerspec- tects it whenever it appears in the video.In [27].a solution for par- tive()to generate the transformed image.This function applies a ticle filtering on general graphs is developed.As the applications,a perspective transformation to an image,and transforms the source high-order Markov chains based object tracking method and a image using the specified matrix multi-object graphical interaction models based distributed multi- It is noted that the interest object is assumed to be nearly planar ple object tracking method are developed. or far away from the camera such that the above combined trans- Our work is also inspired by the face tracking methods using formations can be applied to it for simulating the possible varia- probabilistic appearance manifolds [28].The task of tracking and tions of practical situations.We do not intend to handle large recognition is formulated as a maximum posteriori estimation out-of-plane rotations which may result in self-occlusions of the problem under the manifold representation which models face object.Our experiments show that such scheme works well on a appearance.Learning image manifolds from collections of local broad range of video sequences. features has been addressed in [29].A feature embedding repre- Without loss of generality,we assume that the image is placed sentation that preserves local appearance similarity and spatial with its center at the world origin.The above combined transfor- structure of the features is learned by solving an eigenvalue prob- mations are applied to every point in the image.This yields a series lem.A difference between the above representation and our fea- of transformed images.We then first detect the feature points in ture manifold is that they generally represent the image with a the original image and use corresponding transformation to find manifold.However,we use the manifold to characterize variations their corresponding locations in each transformed image.Specifi- of each individual feature,which is more flexible and able to de- cally,we use the SIFT detector,Harris-Affine detector,as well as scribe object details. Hessian-Affine detector to detect feature points,and describe the local appearance of each point using SIFT descriptor.For each point detected by SIFT.Harris-Affine,or Hessian-Affine detectors,the scale used in computing SIFT descriptor is determined by the cor- 3.The feature manifold model responding detector.For each SIFT feature descriptor fi in the origi- nal image,its corresponding features are different versions of the In traditional local feature based representations,each local fea- original one under viewpoint variations.The manifold M is ture is represented as a single feature vector describing local learned from M(fi.fh.f2,...},in which fn.fi2....are simulated fea- appearance.During tracking.it is matched to features extracted ture descriptors of fi under different viewpoints in the new frame.The assumption is that although object appear- In general,the appearance manifold of a local feature is highly ance may change due to viewpoint and scale changes,the invari- nonlinear.Many samples are needed to learn an accurate represen- ance property of local feature descriptors is able to accommodate tation.Although learning such a manifold globally would be an appearance variations.However,most existing local feature ideal solution,this is mostly infeasible in practice especially for descriptors are only partially invariant to these transformations. tracking which is hard to capture enough training samples.We For example,SIFT is only invariant to scale changes,brightness assume that local linearity property holds everywhere on a global changes,and to some extent robust to viewpoint changes.This is nonlinear manifold.Therefore,a local feature manifold can be illustrated in the results in the performance evaluation of local fea- approximated by a collection of linear subspaces 32.Note that ture descriptors [30.As can be observed,the matching perfor- the similar idea of modeling image variations due to changing illu- mance of SIFT degrades significantly with more significant minations by low-dimensional linear subspaces has been exploited viewpoint changes.To solve this problem,we propose to synthet- by [33].There have been many methods for manifold learning such ically generate additional training samples simulating possible as LLE [32]and ISOMAP [34],but most original algorithms are variations of the features under viewpoint and scale changes.This unsuitable for tracking which requires incremental learning for enriched dataset is used to learn the initial feature manifold.It updating.Although the incremental versions of such algorithms should be noted that the idea of simulating images under different have been developed in the literature [35,36],we choose to types of transformations has been used in[31]to learn a patch approximate the nonlinear feature manifold with several PCA sub- classifier. spaces due to computational efficiency
paper, we propose to model the feature appearance variations as a feature manifold approximated by several linear subspaces. This significantly enhances the distinctiveness of object representation. Avidan’s ensemble tracker [20] trains an ensemble of weak classifiers and combines them into a strong one using AdaBoost to distinguish between the object and background by pixel labeling. To adapt to object appearance changes and maintain temporal coherence, a new weak classifier is trained per frame and the strong classifier is updated dynamically. Appearance modeling is important for object tracking. In [21], the covariance matrices of image features in five modes are used to represent object appearance. Visual tracking is led by Bayesian inference using the learned Log-Euclidean Riemannian subspace. In [22], Hu et al. presented an incremental learning algorithm which represents object appearance with low dimensional tensor subspace. During tracking, the learning algorithm is used to capture appearance of the dynamic object. In [23], Ross et al. presented a tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in appearance of the target. A sampling algorithm with likelihood estimate is used to estimate object locations during tracking. Originated from [24,25], TLD [26] is a realtime algorithm for tracking of unknown objects in video streams. It simultaneously tracks the object, learns its appearance and detects it whenever it appears in the video. In [27], a solution for particle filtering on general graphs is developed. As the applications, a high-order Markov chains based object tracking method and a multi-object graphical interaction models based distributed multiple object tracking method are developed. Our work is also inspired by the face tracking methods using probabilistic appearance manifolds [28]. The task of tracking and recognition is formulated as a maximum posteriori estimation problem under the manifold representation which models face appearance. Learning image manifolds from collections of local features has been addressed in [29]. A feature embedding representation that preserves local appearance similarity and spatial structure of the features is learned by solving an eigenvalue problem. A difference between the above representation and our feature manifold is that they generally represent the image with a manifold. However, we use the manifold to characterize variations of each individual feature, which is more flexible and able to describe object details. 3. The feature manifold model In traditional local feature based representations, each local feature is represented as a single feature vector describing local appearance. During tracking, it is matched to features extracted in the new frame. The assumption is that although object appearance may change due to viewpoint and scale changes, the invariance property of local feature descriptors is able to accommodate appearance variations. However, most existing local feature descriptors are only partially invariant to these transformations. For example, SIFT is only invariant to scale changes, brightness changes, and to some extent robust to viewpoint changes. This is illustrated in the results in the performance evaluation of local feature descriptors [30]. As can be observed, the matching performance of SIFT degrades significantly with more significant viewpoint changes. To solve this problem, we propose to synthetically generate additional training samples simulating possible variations of the features under viewpoint and scale changes. This enriched dataset is used to learn the initial feature manifold. It should be noted that the idea of simulating images under different types of transformations has been used in [31] to learn a patch classifier. To simulate possible viewpoint situations, transformations are applied to the original image. Each transformation involves three atomic transformations: scaling, rotation and shearing. The mixture of these transformations can accurately simulate all possible viewpoint changes. To produce a simulated image I 0 , a combination of these three transformations is applied to the original image I0, I 0 ¼ Tsh Tr Tsc I0; ð1Þ where Tsh, Tr, and Tsc mean shearing, rotation, and scaling transformations separately. We express them using the homogeneous matrices. Specifically, we represent Tsc as a 0 0 0 a 0 001 0 @ 1 A, and set three levels for a to {0.5, 1, 2}. Tr is expressed as cos h sin h 0 sin h cos h 0 00 1 0 @ 1 A, and we rotate the image around its center for a lap and set the interval to 30. We denote Tsh by 1 b 0 d 1 0 001 0 @ 1 A and set three levels {1, 0, 1} for both b and d to execute the shearing transformation. In implementation, we use the OpcnCV function cvWarpPerspective() to generate the transformed image. This function applies a perspective transformation to an image, and transforms the source image using the specified matrix. It is noted that the interest object is assumed to be nearly planar or far away from the camera such that the above combined transformations can be applied to it for simulating the possible variations of practical situations. We do not intend to handle large out-of-plane rotations which may result in self-occlusions of the object. Our experiments show that such scheme works well on a broad range of video sequences. Without loss of generality, we assume that the image is placed with its center at the world origin. The above combined transformations are applied to every point in the image. This yields a series of transformed images. We then first detect the feature points in the original image and use corresponding transformation to find their corresponding locations in each transformed image. Specifi- cally, we use the SIFT detector, Harris-Affine detector, as well as Hessian-Affine detector to detect feature points, and describe the local appearance of each point using SIFT descriptor. For each point detected by SIFT, Harris-Affine, or Hessian-Affine detectors, the scale used in computing SIFT descriptor is determined by the corresponding detector. For each SIFT feature descriptor fi in the original image, its corresponding features are different versions of the original one under viewpoint variations. The manifold Mi is learned from Mffi; fi1; fi2; ...g, in which fi1,fi2,... are simulated feature descriptors of fi under different viewpoints. In general, the appearance manifold of a local feature is highly nonlinear. Many samples are needed to learn an accurate representation. Although learning such a manifold globally would be an ideal solution, this is mostly infeasible in practice especially for tracking which is hard to capture enough training samples. We assume that local linearity property holds everywhere on a global nonlinear manifold. Therefore, a local feature manifold can be approximated by a collection of linear subspaces [32]. Note that the similar idea of modeling image variations due to changing illuminations by low-dimensional linear subspaces has been exploited by [33]. There have been many methods for manifold learning such as LLE [32] and ISOMAP [34], but most original algorithms are unsuitable for tracking which requires incremental learning for updating. Although the incremental versions of such algorithms have been developed in the literature [35,36], we choose to approximate the nonlinear feature manifold with several PCA subspaces due to computational efficiency. 130 Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139
Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 131 Given a set of feature descriptors obtained through the above with a set of linear PCA subspaces,we calculate the distance be- simulation process,we first use K-means clustering to cluster tween the feature descriptor and each subspace.The minimal dis- them into K clusters.For each cluster we fit a PCA subspace tance is taken as d(Mi.f). for dimension reduction.We denote the subspaces as M= to approximate the feature manifold.Here is d(Mif)=min d(cf ) (6) the jth linear subspace in feature manifold Mj. To compute d(c.f.we denote the eigenvectors of a PCA sub- 4.Object tracking space cl as U,and the means as u.Then we project f onto c. Our tracking module includes two stages.The first stage is proif =U (fi). (7) training,in which we generate the initial feature manifold set using features extracted from the first frame of a video.The other The reconstruction of the feature f on the subspace is therefore, is tracking which locates the object in each new frame and updates the object feature model dynamically. =Uu.proif (8) In the training stage,SIFT descriptors are extracted for each de- tected feature point.We note that for some objects,SIFT detector The distance d(c,f)between the feature and the subspace is fi- nally formulated as, cannot find enough features for tracking.so we employ several complementary feature detectors including Harris-Affine detector and Hessian-Affine detector.The set of features is represented as d(c) (9) F=(f.2,....f).Then the approach described in the previous section is used to generate samples for feature manifold learning. We use affine transforms to simulate those transformations.With with K,the feature dimension.K,is set to 128 for SIFT these transformed images,we extracted features on the corre- Based on the feature-manifold matching discussed above,we sponding positions for every original feature fi in F.After all the can further enhance the matching performance by leveraging the transformed images are processed,we obtain the feature manifolds geometric structure of local feature manifolds.The idea is that for every feature.We further approximate each manifold M with a matching of individual features can be made more reliable by con- set of PCA linear subspaces (cc....clm)by using the afore- sidering the neighborhood features.We organize the feature man- mentioned approach. ifolds as an attributed manifold graph similar to [16.Feature In the tracking stage,we maintain two sets.One is the feature matching becomes a graph matching problem which can be solved manifold set Ms generated in the training stage and updated using relaxation label.Thereinto,the Hungarian algorithm is used according to the current object state.The other is a set that contains to enforce one-to-one matching. the features denoted by F={cf,cf2.....cf which are the candi- After feature-manifold matching,we run manifold update to dates for creating new manifolds.When processing a new frame, incorporate the new data into the model.Manifold update includes we do not extract the SIFT features directly on the original frame. two parts:one is manifold self-update which handles feature Instead,we choose a region centered at the predicted location of appearance variations,the other is adding newly appeared manifolds the object with the size twice the object size at the previous frame. and deleting expiring ones which handles pose changes and occlusions. The extracted features in this enlarged region form a feature set =(f.f....,f).For every feature manifold Mi and cf in CF. we try to find the matched feature in F'in a search window which Manifold self-update.After obtaining the matched feature for M in the current frame,we calculate the error between this is center at the feature location of Mi or ofi separately.By using this window,we can filter out unrelated features in F.We believe that matched feature and its reconstructed representation using the location of M and cf will not change dramatically between two the original eigenvectors of the matched subspace of Mi.If continuous frames.So with this region we can save much more the error exceeds a threshold,an incremental PCA(IPCA)proce- computation.The matching probability for f given Mi is, dure is used to update the subspace.There have been many IPCA algorithms,for example the algorithm introduced by Skoc- p()o exp d(Mi.) aj and Leonardis [37]which requires the computation of covari- (2) ance matrix,and the covariance-free IPCA(CCIPCA)proposed by Weng et al.[38].We implemented both algorithms,and the where I denotes the feature index in F extracted in the current results show that IPCA is more accurate,and CCIPCA requires frame.For a given feature manifold,we thus have, less computation.As the precision of CCIPCA is just slightly lower than IPCA,we utilize CCIPCA for tracking.For more details I'arg maxp(fMi). (3) about CCIPCA,please refer to [37]. Adding newly appeared manifolds and deleting expiring ones.In a The feature indexed by I is the matched feature of My.Simi- time window such as 5 frames,if the frequency of a manifold larly,the matching judgment for cfi is, My that can obtain matched features is lower than a threshold, it will be deleted from MS.If the frequency of a candidate cf p(filcfi)exp (cf, (4) that can obtain matched features is higher than a threshold. we will generate its manifold representation,add it into Ms and delete this of from CF meanwhile. and I=arg maxp(flc). 4.1.Object localization (5) Once the features have been matched to all feature manifolds We calculate the L2 distance between cfi's descriptor vector and representing the object,the correspondences are used to estimate f"s descriptor as the value of d(cfi.f).Computation of d(Mi,f)is object motion between two successive frames.We compute a however difficult as M is a manifold.Since M is approximated homography based on the correspondences between the features
Given a set of feature descriptors obtained through the above simulation process, we first use K-means clustering to cluster them into K clusters. For each cluster we fit a PCA subspace for dimension reduction. We denote the subspaces as Mi ¼ fCi1 ; Ci2 ; ... ; Cimg to approximate the feature manifold. Here Cij is the jth linear subspace in feature manifold Mi. 4. Object tracking Our tracking module includes two stages. The first stage is training, in which we generate the initial feature manifold set using features extracted from the first frame of a video. The other is tracking which locates the object in each new frame and updates the object feature model dynamically. In the training stage, SIFT descriptors are extracted for each detected feature point. We note that for some objects, SIFT detector cannot find enough features for tracking, so we employ several complementary feature detectors including Harris-Affine detector and Hessian-Affine detector. The set of features is represented as F¼ff1; f2; ... ; f ng. Then the approach described in the previous section is used to generate samples for feature manifold learning. We use affine transforms to simulate those transformations. With these transformed images, we extracted features on the corresponding positions for every original feature fi in F. After all the transformed images are processed, we obtain the feature manifolds for every feature. We further approximate each manifold Mi with a set of PCA linear subspaces fCi1 ; Ci2 ; ... ; Cimg by using the aforementioned approach. In the tracking stage, we maintain two sets. One is the feature manifold set MS generated in the training stage and updated according to the current object state. The other is a set that contains the features denoted by CF ¼ fcf1; cf2; ... ; cfkg which are the candidates for creating new manifolds. When processing a new frame, we do not extract the SIFT features directly on the original frame. Instead, we choose a region centered at the predicted location of the object with the size twice the object size at the previous frame. The extracted features in this enlarged region form a feature set F0 ¼ f0 1; f0 2; ... ; f0 L . For every feature manifold Mi and cfi in CF, we try to find the matched feature in F0 in a search window which is center at the feature location of Mi or cfi separately. By using this window, we can filter out unrelated features in F0 . We believe that the location of Mi and cfi will not change dramatically between two continuous frames. So with this region we can save much more computation. The matching probability for f0 l given Mi is, pðf0 l jMiÞ / exp 1 r2 d2 ðMi; f0 l Þ ; ð2Þ where l denotes the feature index in F0 extracted in the current frame. For a given feature manifold, we thus have, l ¼ arg maxl pðf0 l jMiÞ: ð3Þ The feature indexed by l ⁄ is the matched feature of Mi. Similarly, the matching judgment for cfi is, pðf0 l jcfiÞ / exp 1 r2 d2 ðcfi; f0 l Þ ; ð4Þ and l ¼ arg maxl pðf0 l jcfiÞ: ð5Þ We calculate the L2 distance between cfi’s descriptor vector and f0 l ’s descriptor as the value of dðcfi; f0 l Þ. Computation of dðMi; f0 l Þ is however difficult as Mi is a manifold. Since Mi is approximated with a set of linear PCA subspaces, we calculate the distance between the feature descriptor and each subspace. The minimal distance is taken as dðMi; f0 l Þ, dðMi; f0 l Þ ¼ min j dðCij ; f0 l Þ: ð6Þ To compute d Cij ; f0 l , we denote the eigenvectors of a PCA subspace Cij as Uij, and the means as lij. Then we project f0 l onto Cij , projf0 l ¼ UijT ðfl lijÞ: ð7Þ The reconstruction of the feature f0 l on the subspace is therefore, f0 l ¼ Uij projf0 l þ lij: ð8Þ The distance dðCij ; f0 l Þ between the feature and the subspace is fi- nally formulated as, dðCij ; f0 l Þ ¼ sqrt Xkf k¼1 ðf0 lk f0 lkÞ !; ð9Þ with Kf the feature dimension. Kf is set to 128 for SIFT. Based on the feature-manifold matching discussed above, we can further enhance the matching performance by leveraging the geometric structure of local feature manifolds. The idea is that matching of individual features can be made more reliable by considering the neighborhood features. We organize the feature manifolds as an attributed manifold graph similar to [16]. Feature matching becomes a graph matching problem which can be solved using relaxation label. Thereinto, the Hungarian algorithm is used to enforce one-to-one matching. After feature-manifold matching, we run manifold update to incorporate the new data into the model. Manifold update includes two parts: one is manifold self-update which handles feature appearance variations, the other is adding newly appeared manifolds and deleting expiring ones which handles pose changes and occlusions. Manifold self-update. After obtaining the matched feature for Mi in the current frame, we calculate the error between this matched feature and its reconstructed representation using the original eigenvectors of the matched subspace of Mi. If the error exceeds a threshold, an incremental PCA (IPCA) procedure is used to update the subspace. There have been many IPCA algorithms, for example the algorithm introduced by Skocaj and Leonardis [37] which requires the computation of covariance matrix, and the covariance-free IPCA (CCIPCA) proposed by Weng et al. [38]. We implemented both algorithms, and the results show that IPCA is more accurate, and CCIPCA requires less computation. As the precision of CCIPCA is just slightly lower than IPCA, we utilize CCIPCA for tracking. For more details about CCIPCA, please refer to [37]. Adding newly appeared manifolds and deleting expiring ones. In a time window such as 5 frames, if the frequency of a manifold Mi that can obtain matched features is lower than a threshold, it will be deleted from MS. If the frequency of a candidate cfi that can obtain matched features is higher than a threshold, we will generate its manifold representation, add it into MS and delete this cfi from CF meanwhile. 4.1. Object localization Once the features have been matched to all feature manifolds representing the object, the correspondences are used to estimate object motion between two successive frames. We compute a homography based on the correspondences between the features Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139 131
132 Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 30 feature to feature 20 102030405060708090 100 Precision (% Fig.2.Two images from Graf and comparison result.The first image is the reference,and the second one is query.The red curve in the right figure represents the feature-to- manifold matching and the blue one denotes feature-to-feature matching.The decimals on curves are the distance ratio thresholds.(For interpretation of the references to colour in this figure legend,the reader is referred to the web version of this article.) in the current frame and the manifolds on the object in the previ- out the transformations to simulate possible viewpoints of practi- ous frame.RANSAC is used to obtain the homography matrix.We cal situations.The number of transformation matrices is 324.Then then transform the box indicating the object in the previous frame we choose an image from other five images as the query.The ori- using this matrix and thus get a parallelogram.The bounding box ginal and query images are shown in Fig.2.The query one has 2806 of the parallelogram is taken as the box indicating the target object features.We use the ground truth homography to evaluate our in the current frame. matching performance.The comparison result is shown in Fig.2 The above tracking process is summarized below. right.From the precision and recall curves,it is obvious that the performance of our feature-to-manifold matching is better than Algorithm 1.Feature manifold tracking traditional feature-to-feature matching Input:The video sequence and an initial object box. 5.2.Tracking results Output:Tracked object positions. Training:For the first frame. 5.2.1.Parameter setting 1 Extract the local features of the initial object to form a Several parameters are used in our tracker.For each feature feature set manifold in the current frame,we use a fixed window around its F=(f2.....fnh. position in the next frame as the search window for matching.In 2 Transform the 1st frame or more precisely the object the experiments,we set this window to 30 x 30 for most videos. region,and generate a feature manifold M;for each feature The size of search window for a sequence named Pets09_2 is f 60 x 60 for accommodating fast object movement.The distance 3 Represent Mi as the PCA linear subspaces threshold for determining whether or not a feature is matched (...cim) with a manifold is 0.8 for admission of more feature-to-manifold 4 Build the feature manifold graph. correspondences.In a time window in the following frames,if Tracking:For each of the successive frames. the number of matched features for a manifold in the current 1 Extract local features=ff....,f}in a candidate frame is less than a threshold t,this manifold will be removed For a newly detected feature in the current frame,if the number region centered at the object. 2 Perform feature-to-manifold matching.Find the matched of matched features in the time window is great than or equal to feature f for each manifold Mi. t,we will generate its manifold.We set the time window to 5 frames,and set t to 4 in our implementation.We would like to 3 Estimate object motion and transform the object box. emphasize that all the parameters,except for the window size 4 Update manifold subspaces using IPCA,add new set to 60x 60 in the Pets09_2 sequence,were kept constant for manifolds and remove expiring ones. our experiments. 5.2.2.Qualitative results 5.Experiments and discussion We apply our tracker to a set of very diverse and challenging video sequences that contain background clutters,camera 5.1.Matching performance motions,rapid changes of object appearance,illumination changes, in-plane/out-of-plane pose changes,scale changes of the targets We first compare our feature-to-manifold matching with fea- and so on.Some of these videos are widely employed by previous ture-to-feature matching.A standard dataset from the VGG datasets tracking approaches for evaluating the performance of their track- is used for comparison. ers.The resolution and number of frames of each video are listed in VGG datasets have six groups of images.Every group includes six Table 1. images for simulating variations under different imaging condi- In the tracking results,the tracked objects are marked as red tions,including viewpoint changes,scale changes,illumination boxes.All the results are shown in our project webpage.We visu- variations,and so on.For viewpoint changes,the ground truth ally inspect the tracking results,and classify them as "keep track"or homography is provided.We choose this group as the dataset for "lose track"as shown in the last column of Table 1.In the following. comparison.We take the Graf dataset as an example.In this data- set,we select the reference image as our original image,and carry http://cs.nju.edu.cn/ywguo/tracking2013/result.html
in the current frame and the manifolds on the object in the previous frame. RANSAC is used to obtain the homography matrix. We then transform the box indicating the object in the previous frame using this matrix and thus get a parallelogram. The bounding box of the parallelogram is taken as the box indicating the target object in the current frame. The above tracking process is summarized below. Algorithm 1. Feature manifold tracking Input: The video sequence and an initial object box. Output: Tracked object positions. Training: For the first frame. 1 Extract the local features of the initial object to form a feature set F¼ff1; f2; ... ; f ng. 2 Transform the 1st frame or more precisely the object region, and generate a feature manifold Mi for each feature fi. 3 Represent Mi as the PCA linear subspaces fCi1; Ci2; ... ; Cimg. 4 Build the feature manifold graph. Tracking: For each of the successive frames. 1 Extract local features F0 ¼ ff0 1; f0 2; ... ; f0 Lg in a candidate region centered at the object. 2 Perform feature-to-manifold matching. Find the matched feature f 0 l for each manifold Mi. 3 Estimate object motion and transform the object box. 4 Update manifold subspaces using IPCA, add new manifolds and remove expiring ones. 5. Experiments and discussion 5.1. Matching performance We first compare our feature-to-manifold matching with feature-to-feature matching. A standard dataset from the VGG datasets is used for comparison. VGG datasets have six groups of images. Every group includes six images for simulating variations under different imaging conditions, including viewpoint changes, scale changes, illumination variations, and so on. For viewpoint changes, the ground truth homography is provided. We choose this group as the dataset for comparison. We take the Graf dataset as an example. In this dataset, we select the reference image as our original image, and carry out the transformations to simulate possible viewpoints of practical situations. The number of transformation matrices is 324. Then we choose an image from other five images as the query. The original and query images are shown in Fig. 2. The query one has 2806 features. We use the ground truth homography to evaluate our matching performance. The comparison result is shown in Fig. 2 right. From the precision and recall curves, it is obvious that the performance of our feature-to-manifold matching is better than traditional feature-to-feature matching. 5.2. Tracking results 5.2.1. Parameter setting Several parameters are used in our tracker. For each feature manifold in the current frame, we use a fixed window around its position in the next frame as the search window for matching. In the experiments, we set this window to 30 30 for most videos. The size of search window for a sequence named Pets09_2 is 60 60 for accommodating fast object movement. The distance threshold for determining whether or not a feature is matched with a manifold is 0.8 for admission of more feature-to-manifold correspondences. In a time window in the following frames, if the number of matched features for a manifold in the current frame is less than a threshold s, this manifold will be removed. For a newly detected feature in the current frame, if the number of matched features in the time window is great than or equal to s, we will generate its manifold. We set the time window to 5 frames, and set s to 4 in our implementation. We would like to emphasize that all the parameters, except for the window size set to 60 60 in the Pets09_2 sequence, were kept constant for our experiments. 5.2.2. Qualitative results We apply our tracker to a set of very diverse and challenging video sequences that contain background clutters, camera motions, rapid changes of object appearance, illumination changes, in-plane/out-of-plane pose changes, scale changes of the targets and so on. Some of these videos are widely employed by previous tracking approaches for evaluating the performance of their trackers. The resolution and number of frames of each video are listed in Table 1. In the tracking results, the tracked objects are marked as red boxes. All the results are shown in our project webpage.1 We visually inspect the tracking results, and classify them as ‘‘keep track’’ or ‘‘lose track’’ as shown in the last column of Table 1. In the following, 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.45 0.60.55 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Precision (%) Recall (%) feature to manifold feature to feature Fig. 2. Two images from Graf and comparison result. The first image is the reference, and the second one is query. The red curve in the right figure represents the feature-tomanifold matching and the blue one denotes feature-to-feature matching. The decimals on curves are the distance ratio thresholds. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 1 http://cs.nju.edu.cn/ywguo/tracking2013/result.html. 132 Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139