This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING.2019 5.3.2 Rotation Estimation P According to Eg.(4),it is essential to accurately estimate the rotation matrix Rto.t and the translation vector Tto.t, P such that the projection of Pi at time t in the image plane, i.e.,Pi.t =[ui.t,vi.t,can be figured out.To estimate the camera's rotation Rto.t from the time to to time t,we first ● use the gyroscope to measure the angular speed in each axis of the local coordinate system.Then,according to the small angle approximation [23],we can compute the rotation Fig.5.Epipolar geometry. matrix At.t+6t relating the local coordinate at time t to the one at time t+ot.After obtaining At.t+t,we can further point Pt and the camera optical center Oto the target 3D update the rotation matrix R'to.t for the local coordinate point Pi must locate in the ray OtoPtSimilarly,the target system as follows: point Pi should also locate in the ray Ot P Thus Pi is R'to.t+8t=At.t+5tR'to.t. (5) the intersection point of OtoPi.to and O Pi.tIn computer vision,this is referred to the epipolar constraint [11].Then, Hence,considering the coordinate transformation between we can use the fundamental matrix Fto.t [11]to describe the local coordinate system and the camera coordinate sys- the epipolar constraint,i.e., tem,we further compute the camera's rotation in camera PiFo Pto=0. (7 coordinate system as follows: Here,the fundamental matrix Fto.t can be generated by the Rto.t MR'to.M-1 (6) relative rotation Rto.t and translation Tto.t from the image Ito to image It,i.e., 5.3.3 Translation Estimation Fto=K-T[Tto.RtoK-1, (8) Considering the nonnegligible error accumulation of using where K is the camera intrinsic matrix,[Tto.tlx is a 3 x 3 linear acceleration to calculate the translation,we introduce matrix.Specifically,let To=[TTT,then the computer vision(CV)-based method,which utilizes the 0 T ,ti feature point pairs to estimate the motion between two Tio.ti [Tto,t]×= 0 -1tot Therefore,by sub- frames.However,different from traditional CV-based meth- to,t1 0 ods which calculate both rotation and translation in each Tiot stituting Eq.(8)to Eq.(7),we have axis,i.e.,6 degrees of freedom (DOF),we have calculated rotation from gyroscope and only need to calculate the (PiK-T)[Tx(RK-Pit)=0. (9) unknown translation,i.e.,3 DOFs.Therefore,we reduce the DOFs in the 3d motion from 6 to 3.Specifically,we first Here,the rotation matrix Rto.t can be estimated from the detect the feature point pairs to estimate the 3D motion of gyroscope measurements.Then,the only unknown factor camera.After that,we subtract the rotation measured by in Eq.(9)is [Tto.t]x,which has three unknown parameters IMU from the estimated 3D motion to obtain the translation. TTTTherefore,as long as we can obtain more However,due to the continuous change of camera coor- than three pairs of matching feature points,we can solve dinate system and the non-unified unit for the estimated Tto.]x based on the Least Square Error (LSE)method. translation,we introduce the initialization to define the However,there could be multiple solutions for [Tto.x, unified translation unit and represent the fixed 3D points as we can multiply a nonzero coefficient on both sides of in the unified unit to estimate the following translation in a Eq.(9),whose right side is 0.For convenience,we figure unified unit. out one of the solutions for TTand T in camera Feature Point Extraction.According to each image frame coordinate system,based on Eq.(9).It is noteworthy that the of the video,we first utilize the FAST(Features from Accel- calculated translation Tto.t from Eq.(9)is represented in a relative manner,instead of in a absolute unit.Therefore, erated Segment Test)keypoint detector to detect the feature point,and then calculate the binary BRIEF(Binary Robust there is a scale factor a between the calculated translation Independent Elementary Features)descriptor [24]of the and the actual translation Tin the absolute unit,as feature point.Both the feature point and the descriptor form shown in Eq.(10),where TT and T mean the an ORB feature [25].Specifically,for the feature point Pito actual translation along x-axis,y-axis,z-axis. in the image frame Ito and the feature point Pt in image T0t=T1a,T%=T%1·a,T01=T0a.(0) frame It,we use Di and Di to represent their descriptors, respectively.Then,the similarity between Pand P However,it is actually difficult to transform the calculated can be measured by the hamming distance [26]between their translation Tto.t in the absolute unit,since the values of o, descriptors Di and Di.Given Pito we choose the feature TTand T are all unknown.To tackle the above issue,we define point with the nearest hamming distance in the image It to be the matching feature point for Pi.to and they form a Tto|=V(T%t)2+(T%ta)2+(T%6,)2 (11) feature point pair.The coordinate difference between the feature point pair can be used to estimate the camera's as the translation unit.In the following frames,when we motion. calculate a new translation,we represent it in the above Initialization for Translation Unit.As shown in Fig.5, translation unit.Consequently,we can represent the calcu- for each feature point pair(Pi.to Pit),given the projection lated translation over all frames with a unified unit. 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 6 5.3.2 Rotation Estimation According to Eq. (4), it is essential to accurately estimate the rotation matrix Rt0,t and the translation vector Tt0,t, such that the projection of Pi at time t in the image plane, i.e., P 0 i,t = [ui,t, vi,t] T , can be figured out. To estimate the camera’s rotation Rt0,t from the time t0 to time t, we first use the gyroscope to measure the angular speed in each axis of the local coordinate system. Then, according to the small angle approximation [23], we can compute the rotation matrix At,t+δt relating the local coordinate at time t to the one at time t + δt. After obtaining At,t+δt, we can further update the rotation matrix R’t0,t for the local coordinate system as follows: R’t0,t+δt = At,t+δtR’t0,t. (5) Hence, considering the coordinate transformation between the local coordinate system and the camera coordinate system, we further compute the camera’s rotation in camera coordinate system as follows: Rt0,t = MR’t0,tM−1 . (6) 5.3.3 Translation Estimation Considering the nonnegligible error accumulation of using linear acceleration to calculate the translation, we introduce the computer vision (CV)-based method, which utilizes the feature point pairs to estimate the motion between two frames. However, different from traditional CV-based methods which calculate both rotation and translation in each axis, i.e., 6 degrees of freedom (DOF), we have calculated rotation from gyroscope and only need to calculate the unknown translation, i.e., 3 DOFs. Therefore, we reduce the DOFs in the 3d motion from 6 to 3. Specifically, we first detect the feature point pairs to estimate the 3D motion of camera. After that, we subtract the rotation measured by IMU from the estimated 3D motion to obtain the translation. However, due to the continuous change of camera coordinate system and the non-unified unit for the estimated translation, we introduce the initialization to define the unified translation unit and represent the fixed 3D points in the unified unit to estimate the following translation in a unified unit. Feature Point Extraction. According to each image frame of the video, we first utilize the FAST (Features from Accelerated Segment Test) keypoint detector to detect the feature point, and then calculate the binary BRIEF (Binary Robust Independent Elementary Features) descriptor [24] of the feature point. Both the feature point and the descriptor form an ORB feature [25]. Specifically, for the feature point P 0 i,t0 in the image frame It0 and the feature point P 0 j,t1 in image frame It1 , we use Di and Dj to represent their descriptors, respectively. Then, the similarity between P 0 i,t0 and P 0 j,t1 can be measured by the hamming distance [26] between their descriptors Di and Dj . Given P 0 i,t0 , we choose the feature point with the nearest hamming distance in the image It1 to be the matching feature point for P 0 i,t0 , and they form a feature point pair. The coordinate difference between the feature point pair can be used to estimate the camera’s motion. Initialization for Translation Unit. As shown in Fig. 5, for each feature point pair (P 0 i,t0 , P 0 i,t1 ), given the projection Ot0 Ot R 1 t0,t1 , Tt0,t1 It0 It1 P ′ i,t0 P ′ i,t1 O ′ t0 Pi1 Pi2 Pi3 Pi O ′ t1 Fig. 5. Epipolar geometry. point P 0 i,t0 and the camera optical center Ot0 , the target 3D point Pi must locate in the ray Ot0 P 0 i,t0 . Similarly, the target point Pi should also locate in the ray Ot1 P 0 i,t1 . Thus Pi is the intersection point of Ot0 P 0 i,t0 and Ot1 P 0 i,t1 . In computer vision, this is referred to the epipolar constraint [11]. Then, we can use the fundamental matrix Ft0,t1 [11] to describe the epipolar constraint, i.e., P 0 i,t1 T Ft0,t1 P 0 i,t0 = 0. (7) Here, the fundamental matrix Ft0,t1 can be generated by the relative rotation Rt0,t1 and translation Tt0,t1 from the image It0 to image It1 , i.e., Ft0,t1 = K −T [Tt0,t1 ]×Rt0,t1K −1 , (8) where K is the camera intrinsic matrix, [Tt0,t1 ]× is a 3 × 3 matrix. Specifically, let Tt0,t1 = [T x t0,t1 , Ty t0,t1 , Tz t0,t1 ] T , then [Tt0,t1 ]× = 0 −T z t0,t1 T y t0,t1 T z t0,t1 0 −T x t0,t1 −T y t0,t1 T x t0,t1 0 . Therefore, by substituting Eq. (8) to Eq. (7), we have (P 0 i,t1 T K −T )[Tt0,t1 ]×(Rt0,t1K −1P 0 i,t0 ) = 0. (9) Here, the rotation matrix Rt0,t1 can be estimated from the gyroscope measurements. Then, the only unknown factor in Eq. (9) is [Tt0,t1 ]×, which has three unknown parameters T x t0,t1 , T y t0,t1 , T z t0,t1 . Therefore, as long as we can obtain more than three pairs of matching feature points, we can solve [Tt0,t1 ]× based on the Least Square Error (LSE) method. However, there could be multiple solutions for [Tt0,t1 ]×, as we can multiply a nonzero coefficient on both sides of Eq. (9), whose right side is 0. For convenience, we figure out one of the solutions for T x t0,t1 , T y t0,t1 , and T z t0,t1 in camera coordinate system, based on Eq.(9). It is noteworthy that the calculated translation Tt0,t1 from Eq. (9) is represented in a relative manner, instead of in a absolute unit. Therefore, there is a scale factor α between the calculated translation and the actual translation T ∗ t0,t1 in the absolute unit, as shown in Eq. (10), where T x∗ t0,t1 , T y∗ t0,t1 and T z∗ t0,t1 mean the actual translation along x-axis, y-axis, z-axis. T x t0,t1 = T x∗ t0,t1 · α, Ty t0,t1 = T y∗ t0,t1 · α, Tz t0,t1 = T z∗ t0,t1 · α. (10) However, it is actually difficult to transform the calculated translation Tt0,t1 in the absolute unit, since the values of α, T x∗ t0,t1 , T y∗ t0,t1 , and T z∗ t0,t1 are all unknown. To tackle the above issue, we define |Tt0,t1 | = q (T x t0,t1 ) 2 + (T y t0,t1 ) 2 + (T z t0,t1 ) 2 (11) as the translation unit. In the following frames, when we calculate a new translation, we represent it in the above translation unit. Consequently, we can represent the calculated translation over all frames with a unified unit. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 Camera Calibration Compute Coordinates of Fixed 3D Points in Unified Unit.According to Eq.(4),representing the translation Tto.t in the unified unit also means representing the coordinate of 3D point Pit in the unified unit.Since the 3D points are stationary,we can use the coordinate of 3D point Pi.to at time to to represent the coordinate of P:at any time.In regard to the coordinate of any fixed 3D point at time to, according to Eq.(4),we can use the rotation Rto.t and the translation Tto.t from to to ti to calculate it in the Preprocess unified unit,since Tto.t is a unified unit.Specifically,for an Video Stabilization arbitrary target point Pi,suppose the 3D coordinates of P:in Fig.6.System Framework the camera coordinate system are Pi.to=[Xi.to,Yi.to,Zi.to] and Pi.t=[Xi.,Yi.t,Zit,respectively,at the time to 6 SYSTEM DESIGN and t1.Then,the corresponding 2D projections in the image 6.1 System Overview plane are Pand Prespectively.Hence,based on the camera projection model in Eq.(4),we have The system architecture is shown in Fig.6.We take as input frames from original video and sensor readings from motion Zi.to Pi.to =KPi.toZi.toK-1Pi.to=Pi.to sensor.We first perform Preprocessing to estimate the 3D Zi.t Pi.t =KPi,t Zi.t K-Pit Pi.t (12) rotation of the camera based on the solution aforemen- tioned in Section 5.3.2,and extract features for video frames. After we calculate the rotation Rto.t and translation The estimated rotation and the video frames with feature Tto.t for the camera coordinate system between the two points will be served for two tasks,Camera Calibration and time points to and t1,we have Pi.t=Rto.t Pi.to+Tto.ti Video Stabilization.The Camera Calibration performs feature based on Eq.(3).Thus according to Eq.(12),we further have tracking between consecutive video frames to obtain feature point pairs,and then uses feature point pairs to calculate Pi.t=Zi,to Rto.tK Pi.to Tto.ti (13) camera intrinsic parameters.The Video Stabilization per- forms video stabilization in three major steps.First,the If we let Xi.to =K-IPito and Xit=K-Pi.t then, 3D translation of the camera is estimated based on the according to Eq.(12)and Eq.(13),we have solution aforementioned in Section 5.3.3.Second,the 3D motion of the camera is sufficiently smoothed to remove Pi.t1 Zi.t Xi,t1 Zi.to Rto.ti Xi,to +Tto.t1 (14) the undesired jitters,thus a smoothed moving path of the camera is generated.Finally,given a smoothed moving path Thus,to compute the coordinate of Pi.t,we only need to of the camera in the 3D space,the stabilized video is created solve Zito or Zi.t.By multiplying both sides of Eq.(14) by the frame warping,i.e.,warping each pixel in the original with the vector Xi,t,,we can eliminate the unknown param- frame to the corresponding stabilized frame,according to eter Zi.t and then calculate the unknown parameter Zi.to the mapping relationship between the original moving path Specifically,the left side of Eq.(14),i.e.,Zi.t (Xi.t x Xi.t) and the smoothed moving path.After that,each frame of the should be equal to 0,since the cross product of any vector stabilized video appears to be captured along the smoothed itself should be equal to 0,then the right side is moving path. Zi,to(Rto,t1Xi,to)×Xi,t1+(Tto,t1)×Xi,t1=0. (15) 6.2 Camera Calibration According to Eq.(15),we are able to solve Zi.to.Then,based According to the pinhole camera model aforementioned in on Eq.(14),we can further calculate Pi.t.Similarly,we can Section 3,in order to depict the camera projection,we need also calculate Pi.to as well as the 3D coordinates of other to know the camera's intrinsic parameters,i.e.,[c cu,a,B target points. and f.Here,[cr,culT is the pixel coordinate of the principal Translation Estimation in Unified Unit.According to point in the image plane.Without loss of generality,if we the projection model in Eq.(4),at any time t during the set the image size to (w,h)in pixels,then [c,cul is ideally camera shoot,we can depict the relationship between the equal to However,due to the sensor manufacturing 3D point and its corresponding projection in the image errors,the principal point,which is the intersection point of plane.Here,K is a known parameter,as aforementioned,the the optical axis and the image plane,will be slightly offest rotation Rto.t can be calculated with the gyroscope-based from the center of the image,ie.,[c,cu]T will not be equal method,and the 3D coordinates of Pito can be calculated to and needs us to estimate.f is the camera focal with the CV-based method.Thus,the only unknown pa- length,which is represented in physical measurements,i.e., rameters are To[TTTand Zi..To solve the meters.a and B are the number of pixels per unit distance above four parameters,we need at least two pairs of feature in physical measurements (i.e.meter)along the ri-axis and points to set up four equations.We can use the Least Square yi-axis of the image plane,and they are used to correlate the Error (LSE)method to solve the overdetermined equation image plane using pixels and the camera coordinate system system.After that,we are able to depict the translation with using meters.If given an arbitrary camera,we may not have the unified translation unit.Specifically,let u Tto.t,then access to these parameters.However,we can access to the we can denote T=Yz·u,T%t=g·u,T%t=2·u.n images the camera takes.Thus we can find a way to deduce this way,we can estimate the translation of the camera with these parameters from images,which is referred as camera the unified unit. 1536-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 7 Compute Coordinates of Fixed 3D Points in Unified Unit. According to Eq. (4), representing the translation Tt0,t in the unified unit also means representing the coordinate of 3D point Pi,t in the unified unit. Since the 3D points are stationary, we can use the coordinate of 3D point Pi,t0 at time t0 to represent the coordinate of Pi at any time. In regard to the coordinate of any fixed 3D point at time t0, according to Eq. (4), we can use the rotation Rt0,t1 and the translation Tt0,t1 from t0 to t1 to calculate it in the unified unit, since Tt0,t1 is a unified unit. Specifically, for an arbitrary target point Pi , suppose the 3D coordinates of Pi in the camera coordinate system are Pi,t0 = [Xi,t0 , Yi,t0 , Zi,t0 ] and Pi,t1 = [Xi,t1 , Yi,t1 , Zi,t1 ], respectively, at the time t0 and t1. Then, the corresponding 2D projections in the image plane are P 0 i,t0 and P 0 i,t1 , respectively. Hence, based on the camera projection model in Eq.(4), we have Zi,t0 P 0 i,t0 = KPi,t0 ⇒ Zi,t0K −1P 0 i,t0 = Pi,t0 , Zi,t1 P 0 i,t1 = KPi,t1 ⇒ Zi,t1K −1P 0 i,t1 = Pi,t1 . (12) After we calculate the rotation Rt0,t1 and translation Tt0,t1 for the camera coordinate system between the two time points t0 and t1, we have Pi,t1 = Rt0,t1 Pi,t0 + Tt0,t1 , based on Eq. (3). Thus according to Eq.(12), we further have Pi,t1 = Zi,t0Rt0,t1K −1P 0 i,t0 + Tt0,t1 . (13) If we let Xi,t0 = K −1P 0 i,t0 and Xi,t1 = K −1P 0 i,t1 , then, according to Eq.(12) and Eq.(13), we have Pi,t1 = Zi,t1 Xi,t1 = Zi,t0Rt0,t1 Xi,t0 + Tt0,t1 . (14) Thus, to compute the coordinate of Pi,t1 , we only need to solve Zi,t0 or Zi,t1 . By multiplying both sides of Eq.(14) with the vector Xi,t1 , we can eliminate the unknown parameter Zi,t1 and then calculate the unknown parameter Zi,t0 . Specifically, the left side of Eq.(14), i.e., Zi,t1 (Xi,t1 × Xi,t1 ) should be equal to 0, since the cross product of any vector itself should be equal to 0, then the right side is Zi,t0 (Rt0,t1 Xi,t0 ) × Xi,t1 + (Tt0,t1 ) × Xi,t1 = 0. (15) According to Eq. (15), we are able to solve Zi,t0 . Then, based on Eq. (14), we can further calculate Pi,t1 . Similarly, we can also calculate Pi,t0 as well as the 3D coordinates of other target points. Translation Estimation in Unified Unit. According to the projection model in Eq. (4), at any time t during the camera shoot, we can depict the relationship between the 3D point and its corresponding projection in the image plane. Here, K is a known parameter, as aforementioned, the rotation Rt0,t can be calculated with the gyroscope-based method, and the 3D coordinates of Pi,t0 can be calculated with the CV-based method. Thus, the only unknown parameters are Tt0,t = [T x t0,t, Ty t0,t, Tz t0,t] and Zi,t. To solve the above four parameters, we need at least two pairs of feature points to set up four equations. We can use the Least Square Error (LSE) method to solve the overdetermined equation system. After that, we are able to depict the translation with the unified translation unit. Specifically, let u = |Tt0,t1 |, then we can denote T x t0,t = γx · u, Ty t0,t = γy · u, Tz t0,t = γz · u. In this way, we can estimate the translation of the camera with the unified unit. Original Video Motion Sensor Rotation Estimation Feature Extraction Feature Tracking Translation Estimation Motion Smoothing Frame Warping Camera Intrinsic Parameters Stabilized Video Camera Calibration Camera Calibration Video Stabilization Preprocessing Fig. 6. System Framework 6 SYSTEM DESIGN 6.1 System Overview The system architecture is shown in Fig.6. We take as input frames from original video and sensor readings from motion sensor. We first perform Preprocessing to estimate the 3D rotation of the camera based on the solution aforementioned in Section 5.3.2, and extract features for video frames. The estimated rotation and the video frames with feature points will be served for two tasks, Camera Calibration and Video Stabilization. The Camera Calibration performs feature tracking between consecutive video frames to obtain feature point pairs, and then uses feature point pairs to calculate camera intrinsic parameters. The Video Stabilization performs video stabilization in three major steps. First, the 3D translation of the camera is estimated based on the solution aforementioned in Section 5.3.3. Second, the 3D motion of the camera is sufficiently smoothed to remove the undesired jitters, thus a smoothed moving path of the camera is generated. Finally, given a smoothed moving path of the camera in the 3D space, the stabilized video is created by the frame warping, i.e., warping each pixel in the original frame to the corresponding stabilized frame, according to the mapping relationship between the original moving path and the smoothed moving path. After that, each frame of the stabilized video appears to be captured along the smoothed moving path. 6.2 Camera Calibration According to the pinhole camera model aforementioned in Section 3, in order to depict the camera projection, we need to know the camera’s intrinsic parameters, i.e., [cx, cy] T , α, β and f. Here, [cx, cy] T is the pixel coordinate of the principal point in the image plane. Without loss of generality, if we set the image size to (w, h) in pixels, then [cx, cy] T is ideally equal to [ w 2 , h 2 ] T . However, due to the sensor manufacturing errors, the principal point, which is the intersection point of the optical axis and the image plane, will be slightly offest from the center of the image, i.e., [cx, cy] T will not be equal to [ w 2 , h 2 ] T and needs us to estimate. f is the camera focal length, which is represented in physical measurements, i.e., meters. α and β are the number of pixels per unit distance in physical measurements (i.e. meter) along the xi-axis and yi-axis of the image plane, and they are used to correlate the image plane using pixels and the camera coordinate system using meters. If given an arbitrary camera, we may not have access to these parameters. However, we can access to the images the camera takes. Thus we can find a way to deduce these parameters from images, which is referred as camera Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply