This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING.2019 Video Stabilization for Camera Shoot in Mobile Devices via Inertial-Visual State Tracking Fei Han,Student Member,IEEE,Lei Xie,Member,IEEE,Yafeng Yin,Member,IEEE, Hao Zhang,Student Member,IEEE,Guihai Chen,Member,IEEE,and Sanglu Lu,Member,IEEE Abstract-Due to the sudden movement during the camera shoot,the videos retrieved from the hand-held mobile devices often suffer from undesired frame jitters,leading to the loss of video quality.In this paper,we present a video stabilization solution in mobile devices via inertial-visual state tracking.Specifically,during the video shoot,we use the gyroscope to estimate the rotation of camera,and use the structure-from-motion among the image frames to estimate the translation of camera.We build a camera projection model by considering the rotation and translation of the camera,and the camera motion model to depict the relationship between the inertial-visual state and the camera's 3D motion.By fusing the inertial measurement(IMU)-based method and the computer vision (CV)-based method,our solution is robust to the fast movement and violent jitters,moreover,it greatly reduces the computation overhead in video stabilization.In comparison to the IMU-based solution,our solution can estimate the translation in a more accurate manner,since we use the feature point pairs in adjacent image frames,rather than the error-prone accelerometers,to estimate the translation.In comparison to the CV-based solution,our solution can estimate the translation with less number of feature point pairs since the number of undetermined degrees of freedom in the 3D motion directly reduces from 6 to 3.We implemented a prototype system on smart glasses and smart phones,and evaluated the performance under real scenarios,i.e.,the human subjects used mobile devices to shoot videos while they were walking,climbing or riding.The experiment results show that our solution achieves 32% better performance than the state-of-art solutions in regard to video stabilization.Moreover,the average processing time latency is 32.6ms,which is lower than the conventional inter-frame time interval,i.e.,33ms,and thus meets the real-time requirement for online processing Index Terms-Video Stabilization,Mobile Device,3D Motion Sensing,Inertial-Visual State Tracking 1 INTRODUCTION UE to the proliferation of mobile devices,nowadays niques [3],[4].The inertial measurement-based approaches more and more people tend to use their mobile devices mainly use the built-in inertial measurement unit (IMU) to take videos.Such devices can be smart phones and smart to continuously track the 3D motion of the mobile device. glasses.However,due to the sudden movement from the However,they mainly focus on the rotation while ignoring users during the camera shoot,the videos retrieved from the translation of the camera.The reason is two folds:First such mobile devices often suffer from undesired frame the gyroscope in the IMU is usually able to accurately jitters.This usually leads to the loss of video quality.track the rotation,whereas the accelerometer in the IMU Therefore,a number of video stabilization techniques are usually fails to accurately track the translation due to the proposed to remove the undesired jitters and obtain stable large cumulative tracking errors.The computer vision(CV)- videos [1],[2],[3],[4],[5],[61,[7].Recently,by leveraging the based approaches mainly use the structure-from-motion embedded sensors,new opportunities have been raised to [11]among the image frames to estimate both the rota- perform video stabilization in the mobile devices.For the tion and translation of the camera.Although they achieve mobile devices,conventional video stabilization schemes enough accuracy for the camera motion estimation,they involves estimating the motion of the camera,smoothing require plenty of feature point pairs and long feature point the camera's motion to remove the undesired jitters,and tracks.The requirement of massive feature points for mo- warping the frames to stabilize the videos.Among these tion estimation increases the computational overhead in the procedures,it is especially important to accurately estimate resource-constrained mobile devices.This makes the real- the camera's motion during the camera shoot,since it is a time processing impractical in the mobile devices.Hence,to key precondition for the following jitters removal and frame achieve a tradeoff between performance and computation warping. overhead,only rotation estimation is considered for the Conventionally,the motion estimation of the camera in state-of-the-art solutions.Second,according to our empirical 3D space is either based on the inertial measurement-based studies,when the target is at a distance greater than 100cm, techniques [8],[9],[10]or the computer vision-based tech-the rotation usually brings greater pixel jitters than the translation,hence,most previous work consider the rotation Fei Han,Lei Xie,Yafeng Yin,Hao Zhang,Guihai Chen and Sanglu has a greater impact on performance than the translation. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing UIniversity,China However,when the target is within a close range,e.g., E-mail: feihan@smail.nju.edu.cn. lxie,yafeng @nju.edu.cn, at the distance less than 100cm,the translation usually H.Zhang@smail.nju.edu.cn,{gchen,sangluj@nju.edu.cn. brings greater pixel jitters than the rotation,thus the trans- Lei Xie is the corresponding author. lation tracking is also very essential for real applications of camera shooting.Therefore,to efficiently perform video 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 1 Video Stabilization for Camera Shoot in Mobile Devices via Inertial-Visual State Tracking Fei Han, Student Member, IEEE, Lei Xie, Member, IEEE, Yafeng Yin, Member, IEEE, Hao Zhang, Student Member, IEEE, Guihai Chen, Member, IEEE, and Sanglu Lu, Member, IEEE Abstract—Due to the sudden movement during the camera shoot, the videos retrieved from the hand-held mobile devices often suffer from undesired frame jitters, leading to the loss of video quality. In this paper, we present a video stabilization solution in mobile devices via inertial-visual state tracking. Specifically, during the video shoot, we use the gyroscope to estimate the rotation of camera, and use the structure-from-motion among the image frames to estimate the translation of camera. We build a camera projection model by considering the rotation and translation of the camera, and the camera motion model to depict the relationship between the inertial-visual state and the camera’s 3D motion. By fusing the inertial measurement (IMU)-based method and the computer vision (CV)-based method, our solution is robust to the fast movement and violent jitters, moreover, it greatly reduces the computation overhead in video stabilization. In comparison to the IMU-based solution, our solution can estimate the translation in a more accurate manner, since we use the feature point pairs in adjacent image frames, rather than the error-prone accelerometers, to estimate the translation. In comparison to the CV-based solution, our solution can estimate the translation with less number of feature point pairs, since the number of undetermined degrees of freedom in the 3D motion directly reduces from 6 to 3. We implemented a prototype system on smart glasses and smart phones, and evaluated the performance under real scenarios, i.e., the human subjects used mobile devices to shoot videos while they were walking, climbing or riding. The experiment results show that our solution achieves 32% better performance than the state-of-art solutions in regard to video stabilization. Moreover, the average processing time latency is 32.6ms, which is lower than the conventional inter-frame time interval, i.e., 33ms, and thus meets the real-time requirement for online processing. Index Terms—Video Stabilization, Mobile Device, 3D Motion Sensing, Inertial-Visual State Tracking ✦ 1 INTRODUCTION D UE to the proliferation of mobile devices, nowadays more and more people tend to use their mobile devices to take videos. Such devices can be smart phones and smart glasses. However, due to the sudden movement from the users during the camera shoot, the videos retrieved from such mobile devices often suffer from undesired frame jitters. This usually leads to the loss of video quality. Therefore, a number of video stabilization techniques are proposed to remove the undesired jitters and obtain stable videos [1], [2], [3], [4], [5], [6], [7]. Recently, by leveraging the embedded sensors, new opportunities have been raised to perform video stabilization in the mobile devices. For the mobile devices, conventional video stabilization schemes involves estimating the motion of the camera, smoothing the camera’s motion to remove the undesired jitters, and warping the frames to stabilize the videos. Among these procedures, it is especially important to accurately estimate the camera’s motion during the camera shoot, since it is a key precondition for the following jitters removal and frame warping. Conventionally, the motion estimation of the camera in 3D space is either based on the inertial measurement-based techniques [8], [9], [10] or the computer vision-based tech- • Fei Han, Lei Xie, Yafeng Yin, Hao Zhang, Guihai Chen and Sanglu Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, China. E-mail: feihan@smail.nju.edu.cn, {lxie,yafeng}@nju.edu.cn, H.Zhang@smail.nju.edu.cn, {gchen,sanglu}@nju.edu.cn. • Lei Xie is the corresponding author. niques [3], [4]. The inertial measurement-based approaches mainly use the built-in inertial measurement unit (IMU) to continuously track the 3D motion of the mobile device. However, they mainly focus on the rotation while ignoring the translation of the camera. The reason is two folds: First, the gyroscope in the IMU is usually able to accurately track the rotation, whereas the accelerometer in the IMU usually fails to accurately track the translation due to the large cumulative tracking errors. The computer vision (CV)- based approaches mainly use the structure-from-motion [11] among the image frames to estimate both the rotation and translation of the camera. Although they achieve enough accuracy for the camera motion estimation, they require plenty of feature point pairs and long feature point tracks. The requirement of massive feature points for motion estimation increases the computational overhead in the resource-constrained mobile devices. This makes the realtime processing impractical in the mobile devices. Hence, to achieve a tradeoff between performance and computation overhead, only rotation estimation is considered for the state-of-the-art solutions. Second, according to our empirical studies, when the target is at a distance greater than 100cm, the rotation usually brings greater pixel jitters than the translation, hence, most previous work consider the rotation has a greater impact on performance than the translation. However, when the target is within a close range, e.g., at the distance less than 100cm, the translation usually brings greater pixel jitters than the rotation, thus the translation tracking is also very essential for real applications of camera shooting. Therefore, to efficiently perform video Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 nal frames the 3 independent Euler angles separately.In this way,we are able to effectively smooth the rotation while maintaining the consistency among multiple parameters.Secondly,we build a camera projection model by considering the rotation and translation of the camera.Then,by substituting the esti- mated rotation into the camera projection model,we directly estimate the translation according to the matched feature point pairs in adjacent image frames.For the situation of Fig.1.Video Stabilization in Mobile Devices.Videos captured with mo fast movement and violent jitters,it is usually difficult to bile devices often suffer from undesired frame jitters due to the sudden find enough feature point pairs between adjacent image movement from the users.We first estimate the original camera path (red)via inertial-visual state tracking,then smooth the original camera frames to estimate the camera's 3D motion.In comparison to path to obtain the smoothed camera path(blue),and finally obtain the the traditional CV-based approaches,our solution requires stabilized frames by warping the original frames. less number of feature point pairs,as we directly reduce the number of undetermined degrees of freedom in the 3D stabilization in mobile devices,it is essential to fuse the CV- motion from 6 to 3.The second challenge is to sufficiently reduce based and IMU-based approaches to accurately estimate the the computation overhead of video stabilization,so as to make the camera's 3D motion,including the rotation and franslation. real-time processing practical in the resource-constrained mobile In this paper,we propose a video stabilization scheme devices.For traditional CV-based approaches,they usually for camera shoot in mobile devices,based on the visual and require at least 5~8 pairs of feature points to estimate the inertial state tracking.Our approach is able to accurately rotation and translation.They involve 6 degrees of freedom, estimate the camera's 3D motion by sufficiently fusing both thus they usually incur large computation overhead,failing the CV-based and IMU-based methods.Specifically,during to perform the video stabilization in a real-time manner. the process of video shoot,we use the gyroscope to es- To address this challenge,our solution reduces the com- timate the rotation of camera,and use the structure-from- putation overhead by directly reducing the undetermined motion among the image frames to estimate the translation degrees of freedom from 6 to 3.Specifically,we use the of the camera.Different from the pure CV-based approaches, inertial measurements to estimate the rotation.Our solution which estimate the rotation and translation simultaneously only requires at least 3 pairs of feature points to estimate the according to the camera projection model,our solution first translation,which reduces over 50%of the burden in the estimates the rotation based on the gyroscope measurement, CV-based processing.This makes the real-time processing and plugs the estimated rofation into the camera projection possible in the mobile devices. model,then we estimate the franslation according to the We make three key contributions in this paper.1)We camera projection model.In comparison to the CV-based investigate video stabilization for camera shoot in mobile solution,our solution can estimate the franslation in a more devices.By fusing the IMU-based method and the CV-based accurate manner with less number of feature point pairs, method,our solution is robust to the fast movement and vi- since the number of undetermined degrees of freedom in olent jitters,and greatly reduces the computation overhead the 3D motion directly reduces from 6 to 3.After that,we in video stabilization.2)We conduct empirical studies to further smooth the camera's motion to remove the unde- investigate the impact of movement jitters,and the measure- sired jitters during the moving process.As shown in Fig.1, ment errors in IMU-based approaches.We build a camera according to the mapping relationship between the original projection model by considering the rotation and translation moving path and the smoothed moving path,we warp of the camera.We further build the camera motion model to each pixel from the original frame into a corresponding depict the relationship between the inertial-visual state and pixel in the stabilized frame.In this way,the stabilized the camera's 3D motion.3)We implemented a prototype video appears to have been captured along the smoothed system on smart glasses and smart phones,and evaluated moving path of the camera.In the context of recent visual- the performance under real scenarios,i.e.,the human sub- inertial based video stabilization methods [12],[13],our jects used mobile devices to shoot videos while they were solution is able to estimate the translation and rotation in a walking,climbing or riding.The experiment results show more accurate manner,and meets the real time requirement that our solution achieves 32%better performance than the for online processing,by directly reducing the number of state-of-art solutions in regard to video stabilization.More- undetermined degrees of freedom from 6 to 3 for CV-based over,the average processing time latency is 32.6ms,which processing. is lower than the conventional inter-frame time interval,i.e., There are two key challenges to address in this paper. 33ms,and thus meets the real-time requirement for online The first challenge is to accurately estimate and effectively smooth processing the camera's 3D motion in the situation of fast movement and violent jitters,due to the sudden movement during the video shoot. 2 RELATED WORK To address this challenge,firstly,we use the gyroscope to CV-based Solution:Traditional CV-based solutions for perform the rotation estimation to figure out a 3 x 3 rotation video stabilization can be roughly divided into 2D stabiliza- matrix,since it can accurately estimate the rotation even if tion and 3D stabilization.2D video stabilization solutions the fast movement and violent jitters occur.Then,to smooth use a series of 2D transformations between adjacent frames the rotation,instead of smoothing the 9 dependent parame- to represent the camera motion,and smooth these transfor- ters separately,we further transform the 3x3 rotation matrix mations to stabilize the video [1],[2],[14].However,these into the 1 x3 Euler angles,and apply the low pass filter over methods cannot figure out the camera's 3D motion,thus 136-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 2 Google Glass Smartphone Original frames Stabilized frames Fig. 1. Video Stabilization in Mobile Devices. Videos captured with mobile devices often suffer from undesired frame jitters due to the sudden movement from the users. We first estimate the original camera path (red) via inertial-visual state tracking, then smooth the original camera path to obtain the smoothed camera path (blue), and finally obtain the stabilized frames by warping the original frames. stabilization in mobile devices, it is essential to fuse the CVbased and IMU-based approaches to accurately estimate the camera’s 3D motion, including the rotation and translation. In this paper, we propose a video stabilization scheme for camera shoot in mobile devices, based on the visual and inertial state tracking. Our approach is able to accurately estimate the camera’s 3D motion by sufficiently fusing both the CV-based and IMU-based methods. Specifically, during the process of video shoot, we use the gyroscope to estimate the rotation of camera, and use the structure-frommotion among the image frames to estimate the translation of the camera. Different from the pure CV-based approaches, which estimate the rotation and translation simultaneously according to the camera projection model, our solution first estimates the rotation based on the gyroscope measurement, and plugs the estimated rotation into the camera projection model, then we estimate the translation according to the camera projection model. In comparison to the CV-based solution, our solution can estimate the translation in a more accurate manner with less number of feature point pairs, since the number of undetermined degrees of freedom in the 3D motion directly reduces from 6 to 3. After that, we further smooth the camera’s motion to remove the undesired jitters during the moving process. As shown in Fig.1, according to the mapping relationship between the original moving path and the smoothed moving path, we warp each pixel from the original frame into a corresponding pixel in the stabilized frame. In this way, the stabilized video appears to have been captured along the smoothed moving path of the camera. In the context of recent visualinertial based video stabilization methods [12], [13], our solution is able to estimate the translation and rotation in a more accurate manner, and meets the real time requirement for online processing, by directly reducing the number of undetermined degrees of freedom from 6 to 3 for CV-based processing. There are two key challenges to address in this paper. The first challenge is to accurately estimate and effectively smooth the camera’s 3D motion in the situation of fast movement and violent jitters, due to the sudden movement during the video shoot. To address this challenge, firstly, we use the gyroscope to perform the rotation estimation to figure out a 3×3 rotation matrix, since it can accurately estimate the rotation even if the fast movement and violent jitters occur. Then, to smooth the rotation, instead of smoothing the 9 dependent parameters separately, we further transform the 3×3 rotation matrix into the 1×3 Euler angles, and apply the low pass filter over the 3 independent Euler angles separately. In this way, we are able to effectively smooth the rotation while maintaining the consistency among multiple parameters. Secondly, we build a camera projection model by considering the rotation and translation of the camera. Then, by substituting the estimated rotation into the camera projection model, we directly estimate the translation according to the matched feature point pairs in adjacent image frames. For the situation of fast movement and violent jitters, it is usually difficult to find enough feature point pairs between adjacent image frames to estimate the camera’s 3D motion. In comparison to the traditional CV-based approaches, our solution requires less number of feature point pairs, as we directly reduce the number of undetermined degrees of freedom in the 3D motion from 6 to 3. The second challenge is to sufficiently reduce the computation overhead of video stabilization, so as to make the real-time processing practical in the resource-constrained mobile devices. For traditional CV-based approaches, they usually require at least 5∼8 pairs of feature points to estimate the rotation and translation. They involve 6 degrees of freedom, thus they usually incur large computation overhead, failing to perform the video stabilization in a real-time manner. To address this challenge, our solution reduces the computation overhead by directly reducing the undetermined degrees of freedom from 6 to 3. Specifically, we use the inertial measurements to estimate the rotation. Our solution only requires at least 3 pairs of feature points to estimate the translation, which reduces over 50% of the burden in the CV-based processing. This makes the real-time processing possible in the mobile devices. We make three key contributions in this paper. 1) We investigate video stabilization for camera shoot in mobile devices. By fusing the IMU-based method and the CV-based method, our solution is robust to the fast movement and violent jitters, and greatly reduces the computation overhead in video stabilization. 2) We conduct empirical studies to investigate the impact of movement jitters, and the measurement errors in IMU-based approaches. We build a camera projection model by considering the rotation and translation of the camera. We further build the camera motion model to depict the relationship between the inertial-visual state and the camera’s 3D motion. 3) We implemented a prototype system on smart glasses and smart phones, and evaluated the performance under real scenarios, i.e., the human subjects used mobile devices to shoot videos while they were walking, climbing or riding. The experiment results show that our solution achieves 32% better performance than the state-of-art solutions in regard to video stabilization. Moreover, the average processing time latency is 32.6ms, which is lower than the conventional inter-frame time interval, i.e., 33ms, and thus meets the real-time requirement for online processing. 2 RELATED WORK CV-based Solution: Traditional CV-based solutions for video stabilization can be roughly divided into 2D stabilization and 3D stabilization. 2D video stabilization solutions use a series of 2D transformations between adjacent frames to represent the camera motion, and smooth these transformations to stabilize the video [1], [2], [14]. However, these methods cannot figure out the camera’s 3D motion, thus Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING.2019 3 they usually fail to compute the changes of projection for its image projection pixel P'=[u,v]T in the 2D image plane the target scene when there exist significant depth changes. can be represented as: Recent 3D video stabilization solutions [3],[4],[15]all seek to stabilize the videos based on the 3D camera motion model.They use the structure-from-motion among the im- KP (1) age frames to estimate the 3D camera motion,thus they can deal with parallax distortions caused by depth variations. where K is the camera intrinsic matrix [111,which contains Hence,they are usually more effective and robust in video the camera's intrinsic parametersBand f.Here, stabilization,at the cost of large computation overhead. is the pixel coordinate of the principal point C Therefore,they are usually performed in an offline manner in the image plane.f is the camera focal length,which is for video stabilization.Moreover,when the camera moves represented in physical measurements,i.e.,meters,and is fast or experiences violent jitters,they may not find enough equal to the distance from the camera center O.to the image amount of feature points to estimate the motion. plane,i.e.,OeC.Considering that the projected points in IMU-based Solution:For mobile devices,since the built- the image plane are described in pixels,while 3D points in in gyroscopes and accelerometers can be directly used to es- the camera coordinate system are represented in physical timate the camera's motion,the IMU-based solutions [8],[9] measurements,i.e.,meters,we introduce the parameters a [16],[17]are proposed for video stabilization recently.For and B to correlate the same points in different coordinate video stabilization,Karpenko et al.calculate the camera's systems using different units.Thus the parameters a and rotation by integrating the gyroscope readings directly [81,B are the number of pixels per meter(i.e.,unit distance whereas Hanning et al.take into account the noise of the gy- in physical measurements)along zi-axis and yi-axis,as roscope readings,they estimate the camera's rotation with shown in Fig.2.Note that a and B may be different because an extended Kalman filter to fuse the readings of gyroscope the aspect ratio of the unit pixel is not guaranteed to be and accelerometer [9].These IMU-based solutions are much one.We can obtain these camera's intrinsic parameters in faster than the CV-based solutions,but they only consider advance from prior calibration [21].Then,the coordinate of the rotation in modeling the camera motion without the the projection P'in the 2D image plane,i.e.,[u,T,can be translation,since the gyroscope can accurately track the computed according to Eq.(1) rotation,whereas the accelerometer usually fail to accurately track the translation due to large cumulative tracking errors. Hybrid Solution:Recent work seek to fuse the inertial and visual-based methods to track the camera's motion [18],[19],[20].Yang et al.fuse the visual and inertial mea- surements to track the camera state for augmented reality [19].In video stabilization,Jia et al.propose an EKF-based method to estimate the 3D camera rotation by using both the video and inertial measurements [201.Still,they only use the pure rotation to depict the camera motion and ignore the camera's translation.In this paper,we investigate video Fig.2.Pinhole camera model stabilization in mobile devices,by accurately estimating and smoothing the camera's 3D motion,i.e.,camera rotation A EMPIRICAL STUDY and translation.By fusing the IMU-based method and the During the camera shoot,it is known that the camera is CV-based method,our solution is robust to the fast move- usually experiencing back-and-forth movement jitters of ment and violent jitters,moreover,it greatly reduces the fairly high frequency,fast speed and small rotations and computation overhead in video stabilization.In the context translations.In this section,we perform empirical studies of recent visual-inertial based video stabilization methods on the real-world testbed in regard to the movement jitters [12],[13,our solution is able to estimate the translation and and measurement errors,so as to investigate the following rotation in a more accurate manner,and meets the real time issues:1)In what level do the movement jitters in the requirement for online processing,by directly reducing the 3D space affect the pixel jitters in the image plane of the number of undetermined degrees of freedom from 6 to 3 for camera?2)What are the average measurement errors in CV-based processing. measuring the rotation and translation of the camera with the inertial sensors? Without loss of generality,we use the smart phone 3 PRELIMINARY Lenovo PHAB2 Pro as the testing platform.This platform has a 16-megapixel camera,we use it to capture the 1080p To illustrate the principle of camera shoot in the mobile videos at 30 frames per second.Moreover,this platform has devices,we can use the pinhole camera model [11]to depict the an inertial measurement unit(BOSCH BMI160)consisting camera projection.As illustrated in Fig.2,for an arbitrary of a 3-axis accelerometer and a 3-axis gyroscope,we use point P from the specified object in the scene,a ray from this them to capture the linear acceleration and the angular rate 3D point P to the camera optical center Oc intersects the of the body frame at a frequency of 200Hz,respectively.To image plane at a point P.Then,the relationship between capture the ground-truth of the 3D motion for the mobile the point P=[X,Y,Z]T in the 3D camera coordinate and device,including the rotation and translation,we use the OptiTrack system [22]to collect the experiment data 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 3 they usually fail to compute the changes of projection for the target scene when there exist significant depth changes. Recent 3D video stabilization solutions [3], [4], [15] all seek to stabilize the videos based on the 3D camera motion model. They use the structure-from-motion among the image frames to estimate the 3D camera motion, thus they can deal with parallax distortions caused by depth variations. Hence, they are usually more effective and robust in video stabilization, at the cost of large computation overhead. Therefore, they are usually performed in an offline manner for video stabilization. Moreover, when the camera moves fast or experiences violent jitters, they may not find enough amount of feature points to estimate the motion. IMU-based Solution: For mobile devices, since the builtin gyroscopes and accelerometers can be directly used to estimate the camera’s motion, the IMU-based solutions [8], [9], [16], [17] are proposed for video stabilization recently. For video stabilization, Karpenko et al. calculate the camera’s rotation by integrating the gyroscope readings directly [8], whereas Hanning et al. take into account the noise of the gyroscope readings, they estimate the camera’s rotation with an extended Kalman filter to fuse the readings of gyroscope and accelerometer [9]. These IMU-based solutions are much faster than the CV-based solutions, but they only consider the rotation in modeling the camera motion without the translation, since the gyroscope can accurately track the rotation, whereas the accelerometer usually fail to accurately track the translation due to large cumulative tracking errors. Hybrid Solution: Recent work seek to fuse the inertial and visual-based methods to track the camera’s motion [18], [19], [20]. Yang et al. fuse the visual and inertial measurements to track the camera state for augmented reality [19]. In video stabilization, Jia et al. propose an EKF-based method to estimate the 3D camera rotation by using both the video and inertial measurements [20]. Still, they only use the pure rotation to depict the camera motion and ignore the camera’s translation. In this paper, we investigate video stabilization in mobile devices, by accurately estimating and smoothing the camera’s 3D motion, i.e., camera rotation and translation. By fusing the IMU-based method and the CV-based method, our solution is robust to the fast movement and violent jitters, moreover, it greatly reduces the computation overhead in video stabilization. In the context of recent visual-inertial based video stabilization methods [12], [13], our solution is able to estimate the translation and rotation in a more accurate manner, and meets the real time requirement for online processing, by directly reducing the number of undetermined degrees of freedom from 6 to 3 for CV-based processing. 3 PRELIMINARY To illustrate the principle of camera shoot in the mobile devices, we can use the pinhole camera model [11] to depict the camera projection. As illustrated in Fig. 2, for an arbitrary point P from the specified object in the scene, a ray from this 3D point P to the camera optical center Oc intersects the image plane at a point P 0 . Then, the relationship between the point P = [X, Y, Z] T in the 3D camera coordinate and its image projection pixel P 0 = [u, v] T in the 2D image plane can be represented as: Z u v 1 = αf 0 cx 0 βf cy 0 0 1 X Y Z = KP, (1) where K is the camera intrinsic matrix [11], which contains the camera’s intrinsic parameters [cx, cy] T , α, β and f. Here, [cx, cy] T is the pixel coordinate of the principal point C in the image plane. f is the camera focal length, which is represented in physical measurements, i.e., meters, and is equal to the distance from the camera center Oc to the image plane, i.e., OcC. Considering that the projected points in the image plane are described in pixels, while 3D points in the camera coordinate system are represented in physical measurements, i.e., meters, we introduce the parameters α and β to correlate the same points in different coordinate systems using different units. Thus the parameters α and β are the number of pixels per meter(i.e., unit distance in physical measurements) along xi-axis and yi-axis, as shown in Fig.2. Note that α and β may be different because the aspect ratio of the unit pixel is not guaranteed to be one. We can obtain these camera’s intrinsic parameters in advance from prior calibration [21]. Then, the coordinate of the projection P 0 in the 2D image plane, i.e., [u, v] T , can be computed according to Eq.(1). Image plane optical axis P P ′ C Oi Oc camera coordinate X Y Z z x y f xi yi v cy cx u Fig. 2. Pinhole camera model. 4 EMPIRICAL STUDY During the camera shoot, it is known that the camera is usually experiencing back-and-forth movement jitters of fairly high frequency, fast speed and small rotations and translations. In this section, we perform empirical studies on the real-world testbed in regard to the movement jitters and measurement errors, so as to investigate the following issues: 1) In what level do the movement jitters in the 3D space affect the pixel jitters in the image plane of the camera? 2) What are the average measurement errors in measuring the rotation and translation of the camera with the inertial sensors? Without loss of generality, we use the smart phone Lenovo PHAB2 Pro as the testing platform. This platform has a 16-megapixel camera, we use it to capture the 1080p videos at 30 frames per second. Moreover, this platform has an inertial measurement unit (BOSCH BMI160) consisting of a 3-axis accelerometer and a 3-axis gyroscope, we use them to capture the linear acceleration and the angular rate of the body frame at a frequency of 200Hz, respectively. To capture the ground-truth of the 3D motion for the mobile device, including the rotation and translation, we use the OptiTrack system [22] to collect the experiment data. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 十4 100150200250 150 n 45 90 180 10-15 25-30 40-45 target point cm) target p Angle of rotation(deg) Dista of translation(cm) (a)The jitter of pixels due to(b)The jitter of pixels due to (c)The rotation measurement error (d)The translation measurement rotation-based jitter,60 =100 translation-based jitter,ot 5cm from gyroscope error from accelerometer Fig.3.The experiment results of the empirical study. 4.1 Observations from 130 pixels to 43 pixels Observation 1.When the camera is subject to the same rotation- Observation 3.When the mobile device is rotating,the gyro- scope is able to accurately measure the rotation in low,medium based jitters,the stationary target points with closer distance and high speed mode.To evaluate the average measurement to the image plane are suffering from stronger pixel jitters in the image plane.To evaluate how the rotation-based jitters errors in measuring the rotation with the gyroscope,without affect the pixel jitters in the image plane,we deployed the loss of generality,we rotated the mobile device around the stationary target points in a line parallel to the optical axis z-axis of the local coordinate system with the angle of 45 of camera and with different distances to the image plane. 90°andl80°,respectively.Besides,for each rotation angle,, Without loss of generality,we performed the rotation-based we evaluate the measurement errors with the low speed (10°/s),medium speed(40°/s)and high speed(100°/s) jitters around the y-axis of the camera coordinate system, mode,respectively.Specifically,the measurement errors are which leads to the coordinate change of the projection in x-axis.The maximum rotation angle 60 is set to 10 degrees calculated by comparing the gyroscope measurement with by default.Then,we measured the pixel jitter for the target the ground truth.According to the experiment results in points with different depths,i.e.,coordinate difference in Fig.3(c),we found that,as the rotation angle increases from pixels between the projections before and after the rotation- 45°to180°,the measurement error is slightly increasing, which is always less than 2 in all cases. based jitter.As shown in Fig.3(a),we use the pinhole camera model to predict the pixel jitter of an object at a given Observation 4.When the mobile device is moving back and forth,the accelerometer usually fails to accurately measure the distance,and plot it as the curve shown in green color,then we plot the corresponding experiment results for pixel jitter translation in low,medium and high speed mode.To evaluate of an object at a given distance.The comparison between the average measurement errors in measuring the trans- the theoretical results and the experiment results shows that lation with the accelerometer,without loss of generality, the observations from the experiments are consistent with we move the mobile device back and forth in the range the theoretical hypothesis from the pinhole camera model. of [-5cm,+5cm]along the z-axis of the local coordinate According to the experiment results,we found that as the system,by varying the overall distance from 10~15cm to depth of the target point increases from 10cm to 50cm,the 40~45cm,respectively.Besides,for each moving distance, pixel jitter decreases rapidly from 314 pixels to 235 pixels. we evaluate the measurement errors with the low speed Then,as the depth further increases from 50cm to 150cm (3cm/s),medium speed(30cm/s)and high speed(100cm/s) the pixel jitter decreases very slowly from 235 pixels to 230 mode,respectively.Specifically,the measurement errors are pixels. calculated by comparing the accelerometer measurement with the ground truth.As shown in Fig.3(d),we found Observation 2.When the camera is subiect to the same translation-based jitters,the stationary target points with closer that,for all three speed modes,as the moving distance increases from 10~15cm to 40~45cm,the corresponding distance to the image plane are suffering from stronger pixel jitters in the image plane.To evaluate how the translation- measurement errors are linearly increasing.Nevertheless, the measurement errors of all speed modes with all moving based jitters affect the pixel jitters in the image plane of the camera,we deployed the target points in the optical axis of distances are all greater than 10cm.Since the actual trans- the camera and with different distances to the image plane. lation ranges in [-5cm,+5cm],and the maximum moving Without loss of generality,we performed the translation- distance is less than 45cm,thus the average measurement error (whether displacement error or distance error)greater based jitters along the x-axis of the camera coordinate system,the maximum displacement 6t is set to 5cm by than 10cm is not acceptable at all. default.Then,we also measured the pixel jitter for the target 4.2 Summary points with different depths.As shown in Fig.3(b),we use the pinhole camera model to predict the pixel jitter of an Both the rotation-based jitters and the translation-based object at a given distance,and plot it as the curve shown jitters cause non-negligible pixel jitters in the image plane in green color,then we plot the corresponding experiment during video shoot.With the inertial measurement units, results for pixel jitter of an object at a given distance.We usually the rofation can be accurately measured by the found that as the depth of the target point increases from gyroscope,whereas the translation fails to be accurately 10cm to 50cm,the pixel jitter decreases rapidly from 650 measured by the accelerometer.Therefore,it is essential pixels to 130 pixels.Then,as the depth further increases to estimate the translation in an accurate and lightweight from 50cm to 150cm,the pixel jitter decreases very slowly manner,such that the video stabilization can be effectively performed. 36-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 4 0 50 100 150 200 250 300 Distance of target points (cm) 220 240 260 280 300 320 Pixel jitters (pixel) (a) The jitter of pixels due to rotation-based jitter, δθ = 10◦ 0 50 100 150 200 250 300 Distance of target points (cm) 0 200 400 600 800 Pixel jitters (pixel) (b) The jitter of pixels due to translation-based jitter, δt = 5cm 45 90 180 Angle of rotation(deg) 0 1 2 3 Measurement error(deg) Slow speed Medium speed Fast speed (c) The rotation measurement error from gyroscope 10-15 25-30 40-45 Distance of translation(cm) 0 10 20 30 40 50 Translation error(cm) Slow speed Medium speed Fast speed (d) The translation measurement error from accelerometer Fig. 3. The experiment results of the empirical study. 4.1 Observations Observation 1. When the camera is subject to the same rotationbased jitters, the stationary target points with closer distance to the image plane are suffering from stronger pixel jitters in the image plane. To evaluate how the rotation-based jitters affect the pixel jitters in the image plane, we deployed the stationary target points in a line parallel to the optical axis of camera and with different distances to the image plane. Without loss of generality, we performed the rotation-based jitters around the y-axis of the camera coordinate system, which leads to the coordinate change of the projection in x-axis. The maximum rotation angle δθ is set to 10 degrees by default. Then, we measured the pixel jitter for the target points with different depths, i.e., coordinate difference in pixels between the projections before and after the rotationbased jitter. As shown in Fig.3(a), we use the pinhole camera model to predict the pixel jitter of an object at a given distance, and plot it as the curve shown in green color, then we plot the corresponding experiment results for pixel jitter of an object at a given distance. The comparison between the theoretical results and the experiment results shows that the observations from the experiments are consistent with the theoretical hypothesis from the pinhole camera model. According to the experiment results, we found that as the depth of the target point increases from 10cm to 50cm, the pixel jitter decreases rapidly from 314 pixels to 235 pixels. Then, as the depth further increases from 50cm to 150cm, the pixel jitter decreases very slowly from 235 pixels to 230 pixels. Observation 2. When the camera is subject to the same translation-based jitters, the stationary target points with closer distance to the image plane are suffering from stronger pixel jitters in the image plane. To evaluate how the translationbased jitters affect the pixel jitters in the image plane of the camera, we deployed the target points in the optical axis of the camera and with different distances to the image plane. Without loss of generality, we performed the translationbased jitters along the x-axis of the camera coordinate system, the maximum displacement δt is set to 5cm by default. Then, we also measured the pixel jitter for the target points with different depths. As shown in Fig.3(b), we use the pinhole camera model to predict the pixel jitter of an object at a given distance, and plot it as the curve shown in green color, then we plot the corresponding experiment results for pixel jitter of an object at a given distance. We found that as the depth of the target point increases from 10cm to 50cm, the pixel jitter decreases rapidly from 650 pixels to 130 pixels. Then, as the depth further increases from 50cm to 150cm, the pixel jitter decreases very slowly from 130 pixels to 43 pixels. Observation 3. When the mobile device is rotating, the gyroscope is able to accurately measure the rotation in low, medium and high speed mode. To evaluate the average measurement errors in measuring the rotation with the gyroscope, without loss of generality, we rotated the mobile device around the z-axis of the local coordinate system with the angle of 45◦ , 90◦ and 180◦ , respectively. Besides, for each rotation angle, we evaluate the measurement errors with the low speed (10◦/s), medium speed (40◦/s) and high speed (100◦/s) mode, respectively. Specifically, the measurement errors are calculated by comparing the gyroscope measurement with the ground truth. According to the experiment results in Fig.3(c), we found that, as the rotation angle increases from 45◦ to 180◦ , the measurement error is slightly increasing, which is always less than 2◦ in all cases. Observation 4. When the mobile device is moving back and forth, the accelerometer usually fails to accurately measure the translation in low, medium and high speed mode. To evaluate the average measurement errors in measuring the translation with the accelerometer, without loss of generality, we move the mobile device back and forth in the range of [-5cm, +5cm] along the z-axis of the local coordinate system, by varying the overall distance from 10∼15cm to 40∼45cm, respectively. Besides, for each moving distance, we evaluate the measurement errors with the low speed (3cm/s), medium speed (30cm/s) and high speed (100cm/s) mode, respectively. Specifically, the measurement errors are calculated by comparing the accelerometer measurement with the ground truth. As shown in Fig.3(d), we found that, for all three speed modes, as the moving distance increases from 10∼15cm to 40∼45cm, the corresponding measurement errors are linearly increasing. Nevertheless, the measurement errors of all speed modes with all moving distances are all greater than 10cm. Since the actual translation ranges in [-5cm, +5cm], and the maximum moving distance is less than 45cm, thus the average measurement error (whether displacement error or distance error) greater than 10cm is not acceptable at all. 4.2 Summary Both the rotation-based jitters and the translation-based jitters cause non-negligible pixel jitters in the image plane during video shoot. With the inertial measurement units, usually the rotation can be accurately measured by the gyroscope, whereas the translation fails to be accurately measured by the accelerometer. Therefore, it is essential to estimate the translation in an accurate and lightweight manner, such that the video stabilization can be effectively performed. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 PROBLEM FORMULATION AND MODELING camera is dynamically moving in the 3D space,the camera 5.1 Problem Formulation coordinate system as well as the image plane is also contin- uously moving,which involves rotation and translation.In According to the observations in the empirical study,in order to achieve video stabilization,we need to accurately this way,even if the point P keeps still in the 3D space,the corresponding projection P'is dynamically changing in the track the rotation and franslation during the video shot, so as to effectively remove the jitters from the rotation 2D plane,thus further leading to video shaking in the image plane. and translation.Meanwhile,we need to perform the rota- As any 3D motion can be decomposed into the combi- tion/translation estimation in a lightweight manner,so as to make the computation overhead suitable for real-time processing. nation of rotation and translation,we can use the rotation Therefore,based on the above understanding,it is essential matrix Rto.t and a vector Tio.t to represent the rotation and to statistically minimize the expectation of both rotation translation of the camera coordinate system,respectively, estimation error and translation estimation error during the from the time to to the time t.Then,for a target point Pi in the camera coordinate system,if its coordinate at time to is process of video shot.Meanwhile,we need to effectively limit the expected computation overhead within a certain denoted as Pi.to,then,after the rotation and translation of threshold,say T.Specifically,let the rotation estimation error the camera coordinate system,its coordinate Pi.t at time t can be computed by and translation estimation error be or and ot,respectively, and let the computation overhead for rotation estimation Pi.t=Rto.tPi,to Tto.t. (3) and translation estimation be cr and ct,respectively.We use the function exp()to denote the expectation.Then,the Therefore,according to Eq.(1),for the point Pi.t at time t, objective of our solution is to the corresponding projection in the image plane,i.e.,Pi.t= u,can be computed by min exp()+exp(6:), (2) Zi.t[ui.t,vi.t,1]=KPi.t=K(Rto.Pi.to +Tto.t),(4) subject to: exp(cr)+exp(ce)≤T. where Zi.t is the coordinate of Pi.t in the z-axis of the camera coordinate at time t,K is the camera intrinsic matrix. To achieve the above objective,we first analyze the pros and cons for the IMU-based and CV-based approaches,as 5.3 Camera Motion Model shown in Table 1.To track the translation,considering that 5.3.1 Coordinate Transformation only the CV-based approach is able to track the translation with high accuracy,we thus use the CV-based approach to As the mobile devices are usually equipped with Inertial Measurement Units(IMU),thus the motion of the camera estimate the translation.Moreover,to track the rotation,on can be measured by IMU,in the local coordinate system of one hand,both the IMU-based and CV-based approaches the body frame,as shown in Fig.4.As aforementioned in are able to track the rotation with high accuracy,on the other Section 5.2,the camera projection is measured in the camera hand,the compute complexity of the CV-based approach is coordinate system,once we figure out the camera's motion relatively high,especially when the 6 degrees of freedom from the inertial measurements in the local coordinate system, (DoF)are undetermined.Hence,we use the IMU-based it is essential to transform the camera's motion into the approach to estimate the rotation,due to its low compute camera coordinate system. complexity.In this way,the compute overhead of the CV- based approach is greatly reduced,since the undetermined Local coordinate DoF for CV-based processing is greatly reduced from 6 to 3. system Rotation Translation Compute Tracking Tracking Complexity M Camera coordinate IMU-based High Accuracy Low Accuracy Low system (3 DoF) (3 DoF) CV-based High Accuracy High Accuracy High (3 DoF) (3 DoF) TABLE 1 Pros and cons of IMU and CV-based approaches for video stabilization. Fig.4.The local coordinate system and the camera coordinate system Therefore,after formulating the video stabilization prob- of the rear camera. lem in an expectation-minimization framework,we can For the embedded camera of the mobile device,we take decompose and solve this complex optimization problem the mostly used rear camera as an example.As shown in by breaking it down into two subproblems,i.e.,using the Fig.4,we show the camera coordinate system and the local IMU-based approach to estimate the rotation and using the coordinate system,respectively.According to the relationship CV-based approach to estimate the translation. between camera coordinate system and the local coordinate sys- 0 -1 0 fem,we can use a 3x3 rotation matrix M= -10 5.2 Camera Projection Model 00 According to the pinhole camera model,for any arbitrary to denote the coordinate transformation between the two 3D point P from the stationary object in the scene,the coordinate systems.For any other camera,we can also use corresponding 2D projection P'in the image plane always a similar rotation matrix M'to denote the corresponding keeps unchanged.However,when the body frame of the coordinate transformation. 1s36-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 5 5 PROBLEM FORMULATION AND MODELING 5.1 Problem Formulation According to the observations in the empirical study, in order to achieve video stabilization, we need to accurately track the rotation and translation during the video shot, so as to effectively remove the jitters from the rotation and translation. Meanwhile, we need to perform the rotation/translation estimation in a lightweight manner, so as to make the computation overhead suitable for real-time processing. Therefore, based on the above understanding, it is essential to statistically minimize the expectation of both rotation estimation error and translation estimation error during the process of video shot. Meanwhile, we need to effectively limit the expected computation overhead within a certain threshold, say τ . Specifically, let the rotation estimation error and translation estimation error be δr and δt, respectively, and let the computation overhead for rotation estimation and translation estimation be cr and ct, respectively. We use the function exp() to denote the expectation. Then, the objective of our solution is to min exp(δr) + exp(δt), (2) subject to: exp(cr) + exp(ct) ≤ τ. To achieve the above objective, we first analyze the pros and cons for the IMU-based and CV-based approaches, as shown in Table 1. To track the translation, considering that only the CV-based approach is able to track the translation with high accuracy, we thus use the CV-based approach to estimate the translation. Moreover, to track the rotation, on one hand, both the IMU-based and CV-based approaches are able to track the rotation with high accuracy, on the other hand, the compute complexity of the CV-based approach is relatively high, especially when the 6 degrees of freedom (DoF) are undetermined. Hence, we use the IMU-based approach to estimate the rotation, due to its low compute complexity. In this way, the compute overhead of the CVbased approach is greatly reduced, since the undetermined DoF for CV-based processing is greatly reduced from 6 to 3. Rotation Translation Compute Tracking Tracking Complexity IMU-based High Accuracy Low Accuracy Low (3 DoF) (3 DoF) CV-based High Accuracy High Accuracy High (3 DoF) (3 DoF) TABLE 1 Pros and cons of IMU and CV-based approaches for video stabilization. Therefore, after formulating the video stabilization problem in an expectation-minimization framework, we can decompose and solve this complex optimization problem by breaking it down into two subproblems, i.e., using the IMU-based approach to estimate the rotation and using the CV-based approach to estimate the translation. 5.2 Camera Projection Model According to the pinhole camera model, for any arbitrary 3D point P from the stationary object in the scene, the corresponding 2D projection P 0 in the image plane always keeps unchanged. However, when the body frame of the camera is dynamically moving in the 3D space, the camera coordinate system as well as the image plane is also continuously moving, which involves rotation and translation. In this way, even if the point P keeps still in the 3D space, the corresponding projection P 0 is dynamically changing in the 2D plane, thus further leading to video shaking in the image plane. As any 3D motion can be decomposed into the combination of rotation and translation, we can use the rotation matrix Rt0,t and a vector Tt0,t to represent the rotation and translation of the camera coordinate system, respectively, from the time t0 to the time t. Then, for a target point Pi in the camera coordinate system, if its coordinate at time t0 is denoted as Pi,t0 , then, after the rotation and translation of the camera coordinate system, its coordinate Pi,t at time t can be computed by Pi,t = Rt0,tPi,t0 + Tt0,t. (3) Therefore, according to Eq. (1), for the point Pi,t at time t, the corresponding projection in the image plane, i.e., P 0 i,t = [ui,t, vi,t] T , can be computed by Zi,t · [ui,t, vi,t, 1]T = KPi,t = K(Rt0,tPi,t0 + Tt0,t), (4) where Zi,t is the coordinate of Pi,t in the z−axis of the camera coordinate at time t, K is the camera intrinsic matrix. 5.3 Camera Motion Model 5.3.1 Coordinate Transformation As the mobile devices are usually equipped with Inertial Measurement Units (IMU), thus the motion of the camera can be measured by IMU, in the local coordinate system of the body frame, as shown in Fig. 4. As aforementioned in Section 5.2, the camera projection is measured in the camera coordinate system, once we figure out the camera’s motion from the inertial measurements in the local coordinate system, it is essential to transform the camera’s motion into the camera coordinate system. Camera coordinate system Local coordinate system P P ′ M x y z OL OC x y z Fig. 4. The local coordinate system and the camera coordinate system of the rear camera. For the embedded camera of the mobile device, we take the mostly used rear camera as an example. As shown in Fig. 4, we show the camera coordinate system and the local coordinate system, respectively. According to the relationship between camera coordinate system and the local coordinate system, we can use a 3×3 rotation matrix M = 0 −1 0 −1 0 0 0 0 −1 to denote the coordinate transformation between the two coordinate systems. For any other camera, we can also use a similar rotation matrix M0 to denote the corresponding coordinate transformation. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply