《分布式计算实验室》课程教学资源（阅读文献）IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 1 DynaKey：Dynamic Keystroke Tracking using a Head-Mounted Camera Device.pdf

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL. VOL.XX.NO.XX.XX 2021 Dynakey:Dynamic Keystroke Tracking using a Head-Mounted Camera Device Hao Zhang,Student Member IEEE,Yafeng Yin,Member IEEE,Lei Xie,Member IEEE, Tao Gu,Senior Member,IEEE,Minghui You,and Sanglu Lu,Member,IEEE Abstract-Mobile and wearable devices have become more and Please TYPE: more popular.However,the tiny touch screen leads to inefficient gyroscope this is a dynakey interaction with these devices,especially for text input.In this paper,we propose Dynakey,which allows people to type on a virtual keyboard printed on a piece of paper or drawn on a desk,for inputting text into a head-mounted camera device (e.g,smart glasses).By using the built-in camera and gyroscope, we capture image frames during typing and detect possible head movements,then track keys,detect fingertips and locate keystrokes.To track the changes of keys'coordinates in images Fig.1.Typing on a virtual keyboard in dynamic scenarios.Camera move- caused by natural head (i.e.,camera)movements,we introduce ments change the keys'coordinates in image frames and lead to the mismatch perspective transformation to transform keys'coordinates among between fingertip and key. different frames.To detect and locate keystrokes,we utilize the variation of fingertip's coordinates across multiple frames to Based on the observation that each finger's typing move- detect possible keystrokes for localization.To reduce the time ment is associated with a unique keystroke,recognizing finger cost,we combine gyroscope and camera to adaptively track the keys,and introduce a series of optimizations such as keypoint de- movements has been proposed as a novel text input method, tection,frame skipping,multi-thread processing,etc.Finally,we which is achieved by additional wearable sensors (e.g.,finger- implement DynaKey on Android powered devices.The extensive mounted sensors [3]-[7])and incurs an additional cost.Con- experimental results show that our system can efficiently track sidering the users'habits in typing on a common QWERTY and locate the keystrokes in real time.Specifically,the average keyboard layout,a projection keyboard [24],[27 generated tracking deviation of the keyboard layout is less than 3 pixels by casting the standard keyboard layout onto a surface via and the intersection over union (IoU)of a key in two consecutive images is above 93%.The average keystroke localization accuracy a projector has been proposed,which is used to recognize reaches 95.5%. keystrokes based on light reflection and depends on the Index Terms-Dynamic keystroke tracking,Camera,Inertial dedicated equipment for projection.Recently.with the advance sensor,Head-mounted device of contactless sensing,recognition of keystrokes can be done via WiFi signals or acoustic signals.For example,WiFi CSI I.INTRODUCTION signals have been explored in [8],[19]to capture keystrokes' Recent years have witnessed an ever-growing popularity typing patterns,the built-in microphone of a smartphone has of mobile and wearable devices such as smartphones,smart been used in [16],[20]to infer keystrokes on a solid surface watches and smart glasses.These devices usually impose a However,contactless sensing is usually vulnerable to environ- small form factor design so that they can be carried by users mental noises,hence limiting its performance in real-world everywhere conveniently.The portable design brings much applications.Therefore,the camera-based approaches [26]. mobility to these devices,but on the other hand it creates [29],[30]have also been proposed to recognize keystrokes many challenges for human-computer interaction,especially on a predefined keyboard layout using image processing. for text input.Some of these devices adopt an on-screen However,existing camera-based text input methods assume virtual keyboard [1],[2]for text input,but others may require a fixed camera and the coordinates of a keyboard layout keep intelligent methods due to the tiny screen or even no screen. unchanged in the fixed camera view.In reality,the camera of a head-mounted device can hardly keep still.Existing methods Manuscript received XXXX.2021.This work is supported by National Key R&D Program of China under Grant No.2018AAA0102302,National may not work in such dynamic moving scenes where the Natural Science Foundation of China under Grant Nos.61802169,61872174, camera will suffer from unavoidable movements.Specifically. 61832008,61906085;JiangSu Natural Science Foundation under Grant No. as shown in Fig.1,head movements cause camera jitters which BK20180325.This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.(Corresponding lead to changes of keyboard's coordinate in image frames,and author:Yafeng Yin.) eventually cause the mismatch between fingertip and key.The Hao Zhang.Yafeng Yin,Lei Xie.Minghui You,and Sanglu Lu are with limitation of existing camera-based text input methods strongly the State Key Laboratory of Novel Software Technology,Nanjing University. Nanjing 210023,China (e-mail:yafeng@nju.edu.cn). motivate the work in this paper. Tao Gu is with the Department of Computing at Macquarie University, In this paper,we propose a novel scheme named DynaKey Sydney.Australia. using camera and gyroscope for text input on a virtual key- Copyright (c)20xx IEEE.Personal use of this material is permitted. However,permission to use this material for any other purposes must be board in dynamic moving scenes.DynaKey does not impose a obtained from the IEEE by sending a request to pubs-permissions@ieee.org. fixed camera,hence it works in more realistic scenarios.Fig

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 1 DynaKey: Dynamic Keystroke Tracking using a Head-Mounted Camera Device Hao Zhang, Student Member, IEEE, Yafeng Yin, Member, IEEE, Lei Xie, Member, IEEE, Tao Gu, Senior Member, IEEE, Minghui You, and Sanglu Lu, Member, IEEE Abstract—Mobile and wearable devices have become more and more popular. However, the tiny touch screen leads to inefficient interaction with these devices, especially for text input. In this paper, we propose DynaKey, which allows people to type on a virtual keyboard printed on a piece of paper or drawn on a desk, for inputting text into a head-mounted camera device (e.g., smart glasses). By using the built-in camera and gyroscope, we capture image frames during typing and detect possible head movements, then track keys, detect fingertips and locate keystrokes. To track the changes of keys’ coordinates in images caused by natural head (i.e., camera) movements, we introduce perspective transformation to transform keys’ coordinates among different frames. To detect and locate keystrokes, we utilize the variation of fingertip’s coordinates across multiple frames to detect possible keystrokes for localization. To reduce the time cost, we combine gyroscope and camera to adaptively track the keys, and introduce a series of optimizations such as keypoint detection, frame skipping, multi-thread processing, etc. Finally, we implement DynaKey on Android powered devices. The extensive experimental results show that our system can efficiently track and locate the keystrokes in real time. Specifically, the average tracking deviation of the keyboard layout is less than 3 pixels and the intersection over union (IoU) of a key in two consecutive images is above 93%. The average keystroke localization accuracy reaches 95.5%. Index Terms—Dynamic keystroke tracking, Camera, Inertial sensor, Head-mounted device I. INTRODUCTION Recent years have witnessed an ever-growing popularity of mobile and wearable devices such as smartphones, smart watches and smart glasses. These devices usually impose a small form factor design so that they can be carried by users everywhere conveniently. The portable design brings much mobility to these devices, but on the other hand it creates many challenges for human-computer interaction, especially for text input. Some of these devices adopt an on-screen virtual keyboard [1], [2] for text input, but others may require intelligent methods due to the tiny screen or even no screen. Manuscript received XX XX, 2021. This work is supported by National Key R&D Program of China under Grant No. 2018AAA0102302, National Natural Science Foundation of China under Grant Nos. 61802169, 61872174, 61832008, 61906085; JiangSu Natural Science Foundation under Grant No. BK20180325. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization. (Corresponding author: Yafeng Yin.) Hao Zhang, Yafeng Yin, Lei Xie, Minghui You, and Sanglu Lu are with the State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210023, China (e-mail: yafeng@nju.edu.cn). Tao Gu is with the Department of Computing at Macquarie University, Sydney, Australia. Copyright (c) 20xx IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. Please TYPE: this is a dynakey._ 1 2 3 4 5 6 7 8 9 0 Q W E camera gyroscope Fig. 1. Typing on a virtual keyboard in dynamic scenarios. Camera movements change the keys’ coordinates in image frames and lead to the mismatch between fingertip and key. Based on the observation that each finger’s typing movement is associated with a unique keystroke, recognizing finger movements has been proposed as a novel text input method, which is achieved by additional wearable sensors (e.g., fingermounted sensors [3]-[7]) and incurs an additional cost. Considering the users’ habits in typing on a common QWERTY keyboard layout, a projection keyboard [24], [27] generated by casting the standard keyboard layout onto a surface via a projector has been proposed, which is used to recognize keystrokes based on light reflection and depends on the dedicated equipment for projection. Recently, with the advance of contactless sensing, recognition of keystrokes can be done via WiFi signals or acoustic signals. For example, WiFi CSI signals have been explored in [8], [19] to capture keystrokes’ typing patterns, the built-in microphone of a smartphone has been used in [16], [20] to infer keystrokes on a solid surface. However, contactless sensing is usually vulnerable to environmental noises, hence limiting its performance in real-world applications. Therefore, the camera-based approaches [26], [29], [30] have also been proposed to recognize keystrokes on a predefined keyboard layout using image processing. However, existing camera-based text input methods assume a fixed camera and the coordinates of a keyboard layout keep unchanged in the fixed camera view. In reality, the camera of a head-mounted device can hardly keep still. Existing methods may not work in such dynamic moving scenes where the camera will suffer from unavoidable movements. Specifically, as shown in Fig. 1, head movements cause camera jitters which lead to changes of keyboard’s coordinate in image frames, and eventually cause the mismatch between fingertip and key. The limitation of existing camera-based text input methods strongly motivate the work in this paper. In this paper, we propose a novel scheme named DynaKey using camera and gyroscope for text input on a virtual keyboard in dynamic moving scenes. DynaKey does not impose a fixed camera, hence it works in more realistic scenarios. Fig. Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 2 1 illustrates a typical scenario where a user wears a head- ing module to ensure DynaKey work dynamically in real time. mounted camera device (e.g.,smart glasses),while a standard In summary,we make three main contributions in this paper. keyboard layout can be printed on a piece of paper or drawn on 1)To the best of our knowledge,this paper appears the first a desk surface.DynaKey combines the embedded camera and work focusing on efficient text input using the built-in camera gyroscope to track finger movements and recognize keystrokes of a head-mounted device (e.g.,smart glasses)in dynamic in real time.Specifically,while the user types on a virtual moving scenes.To adapt to the dynamic camera views,we keyboard,DynaKey utilizes camera to capture image frames propose a Perspective Transformation-based technique to track continuously,then detects fingertips and locates keystrokes the changes of keyboard's coordinate.Besides,without the using image processing techniques.During the tying process, depth information of fingertips in a single camera view,we when the head movement is detected by gyroscope,DynaKey utilize the variation of fingertip's coordinate across multiple needs to track the changes of keyboard coordinate caused by frames for keystroke detection.2)To ensure the real-time camera movements.This keyboard tracking is crucial due to response,DynaKey proposes a gyroscope-based lightweight natural head movements in real application scenarios design to adaptively detect the camera movement and remove The design of DynaKey creates three key challenges that unnecessary image processing for keyboard tracking.Besides, we aim to address in this paper we introduce a series of optimizations such as keypoint The first challenge is how to track changes of keyboard's selection,frame skipping and multi-thread processing for coordinate accurately so that Dynakey is able to adapt to image processing.3)We implement DynaKey on off-the-shelf dynamic moving scenes.In reality,the camera moves naturally Android devices,and conduct comprehensive experiments to along with the head.Such movements will cause dynamic evaluate the performance of DynaKey.Results show that the changes of the camera coordinate system.The different camera average tracking deviation of keyboard layout is less than 3 views and unavoidable image distortion eventually result in pixels and the intersection over union (IoU)[25]of a key changes of keyboard coordinate in image frames.An intuitive in two consecutive images is above 93%.The accuracy of solution is to re-extract keyboard layout from each image,but keystroke localization reaches 95.5%on average.The time it is costly.In addition,we may not be able to obtain keyboard response is 63 ms and such latency is below human response layout from each image properly due to unavoidable occlusion time [23]. by hands.Our intuitive idea asks a fundamental question- II.RELATED WORK can we build a fixed coordinate system no matter how the keyboard coordinate changes?In DynaKey,we propose a Virtual keyboards have been used as an alternative of Perspective Transformation-based technique that converts any on-screen keyboards [1],[2]to support text input for mo- previous coordinate to the current coordinate system.To obtain bile or wearable devices with small or no screen.These appropriate feature point pairs for facilitating transformation, virtual keyboards can be mainly classified into five cate- we propose a keypoint selection method to dynamically select gories,i.e.,wearable sensor-based,projection-based,WiFi- appropriate cross point pairs from the keyboard layout,while based,acoustic-based,and camera-based keyboards. tolerating the occlusion of keyboard. Wearable sensor-based keyboards:Wearable sensors have The second challenge is how to detect and locate keystrokes been used to capture the movements of fingers for text input. efficiently and accurately from a single camera view.This iKey [4]utilizes a wrist-worn piezoelectric ceramic sensor is a non-trivial task due to the lack of depth information of to recognize keystrokes on the back of hand.DigiTouch [5] fingertips from single camera view.In the setting of a head- introduces a glove-based input device which enables thumb-to- mounted camera and a keyboard located in the front of and finger touch interaction by sensing touch position and pressure. below the camera,the camera view from top and behind can MagBoard [3]leverages the triaxial magnetometer embedded hardly get the perpendicular distance between the fingertip and in mobile phones to locate a magnet on a printed keyboard. the keyboard plane,i.e.,it is difficult to determine whether a FingerSound 6 utilizes a thumb-mounted ring which consists finger is typing and which finger is typing.To address this of a microphone and a gyroscope,to recognize unistroke challenge,we utilize the variation of a fingertip's coordinate thumb gestures for text input.These approaches introduce across multiple frames to detect a keystroke,i.e..whether a additional hardwares to capture typing behaviors. finger is typing.In addition to the fingertip movement,we Projection-based keyboards:Projection keyboards [24]. further match a key's coordinate with the fingertip's coordinate [27]have been proposed for mobile devices,by adopting a to locate which finger is typing. conventional QWERTY keyboard layout.They usually require The third challenge is how to trade off between dy- a light projector to cast a keyboard layout onto a flat surface, namic tracking of keyboard and tracking cost for resource- and then recognize keystrokes based on light reflection.This constrained devices.If the camera does not move or has approach requires dedicated equipment.Microsoft Hololens negligible movements,tracking keyboard's coordinate is [13]provides a projection keyboard in front of a user using unnecessary.To achieve the best trade-off for resource-a pair of mixed-reality smart glasses.During text input,the constrained head-mounted devices,we introduce a gyroscope- user needs to move her/his head to pick a key and then make based lightweight method to detect non-negligible camera a specific 'tap'gesture to select the character.This tedious movements,including short-time sharp movement and long- process may slow down text input and affect user experience. time accumulated micro movement.Only the detected non- WiFi-based keyboards:By utilizing the unique pattern of negligible camera movements will trigger the keyboard track- channel state information(CSI)in time series,WiFinger [19]

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 2 1 illustrates a typical scenario where a user wears a headmounted camera device (e.g., smart glasses), while a standard keyboard layout can be printed on a piece of paper or drawn on a desk surface. DynaKey combines the embedded camera and gyroscope to track finger movements and recognize keystrokes in real time. Specifically, while the user types on a virtual keyboard, DynaKey utilizes camera to capture image frames continuously, then detects fingertips and locates keystrokes using image processing techniques. During the tying process, when the head movement is detected by gyroscope, DynaKey needs to track the changes of keyboard coordinate caused by camera movements. This keyboard tracking is crucial due to natural head movements in real application scenarios. The design of DynaKey creates three key challenges that we aim to address in this paper. The first challenge is how to track changes of keyboard’s coordinate accurately so that DynaKey is able to adapt to dynamic moving scenes. In reality, the camera moves naturally along with the head. Such movements will cause dynamic changes of the camera coordinate system. The different camera views and unavoidable image distortion eventually result in changes of keyboard coordinate in image frames. An intuitive solution is to re-extract keyboard layout from each image, but it is costly. In addition, we may not be able to obtain keyboard layout from each image properly due to unavoidable occlusion by hands. Our intuitive idea asks a fundamental question– can we build a fixed coordinate system no matter how the keyboard coordinate changes? In DynaKey, we propose a Perspective Transformation-based technique that converts any previous coordinate to the current coordinate system. To obtain appropriate feature point pairs for facilitating transformation, we propose a keypoint selection method to dynamically select appropriate cross point pairs from the keyboard layout, while tolerating the occlusion of keyboard. The second challenge is how to detect and locate keystrokes efficiently and accurately from a single camera view. This is a non-trivial task due to the lack of depth information of fingertips from single camera view. In the setting of a headmounted camera and a keyboard located in the front of and below the camera, the camera view from top and behind can hardly get the perpendicular distance between the fingertip and the keyboard plane, i.e., it is difficult to determine whether a finger is typing and which finger is typing. To address this challenge, we utilize the variation of a fingertip’s coordinate across multiple frames to detect a keystroke, i.e., whether a finger is typing. In addition to the fingertip movement, we further match a key’s coordinate with the fingertip’s coordinate to locate which finger is typing. The third challenge is how to trade off between dynamic tracking of keyboard and tracking cost for resourceconstrained devices. If the camera does not move or has negligible movements, tracking keyboard’s coordinate is unnecessary. To achieve the best trade-off for resourceconstrained head-mounted devices, we introduce a gyroscopebased lightweight method to detect non-negligible camera movements, including short-time sharp movement and longtime accumulated micro movement. Only the detected nonnegligible camera movements will trigger the keyboard tracking module to ensure DynaKey work dynamically in real time. In summary, we make three main contributions in this paper. 1) To the best of our knowledge, this paper appears the first work focusing on efficient text input using the built-in camera of a head-mounted device (e.g., smart glasses) in dynamic moving scenes. To adapt to the dynamic camera views, we propose a Perspective Transformation-based technique to track the changes of keyboard’s coordinate. Besides, without the depth information of fingertips in a single camera view, we utilize the variation of fingertip’s coordinate across multiple frames for keystroke detection. 2) To ensure the real-time response, DynaKey proposes a gyroscope-based lightweight design to adaptively detect the camera movement and remove unnecessary image processing for keyboard tracking. Besides, we introduce a series of optimizations such as keypoint selection, frame skipping and multi-thread processing for image processing. 3) We implement DynaKey on off-the-shelf Android devices, and conduct comprehensive experiments to evaluate the performance of DynaKey. Results show that the average tracking deviation of keyboard layout is less than 3 pixels and the intersection over union (IoU) [25] of a key in two consecutive images is above 93%. The accuracy of keystroke localization reaches 95.5% on average. The time response is 63 ms and such latency is below human response time [23]. II. RELATED WORK Virtual keyboards have been used as an alternative of on-screen keyboards [1], [2] to support text input for mobile or wearable devices with small or no screen. These virtual keyboards can be mainly classified into five categories, i.e., wearable sensor-based, projection-based, WiFibased, acoustic-based, and camera-based keyboards. Wearable sensor-based keyboards: Wearable sensors have been used to capture the movements of fingers for text input. iKey [4] utilizes a wrist-worn piezoelectric ceramic sensor to recognize keystrokes on the back of hand. DigiTouch [5] introduces a glove-based input device which enables thumb-to- finger touch interaction by sensing touch position and pressure. MagBoard [3] leverages the triaxial magnetometer embedded in mobile phones to locate a magnet on a printed keyboard. FingerSound [6] utilizes a thumb-mounted ring which consists of a microphone and a gyroscope, to recognize unistroke thumb gestures for text input. These approaches introduce additional hardwares to capture typing behaviors. Projection-based keyboards: Projection keyboards [24], [27] have been proposed for mobile devices, by adopting a conventional QWERTY keyboard layout. They usually require a light projector to cast a keyboard layout onto a flat surface, and then recognize keystrokes based on light reflection. This approach requires dedicated equipment. Microsoft Hololens [13] provides a projection keyboard in front of a user using a pair of mixed-reality smart glasses. During text input, the user needs to move her/his head to pick a key and then make a specific ‘tap’ gesture to select the character. This tedious process may slow down text input and affect user experience. WiFi-based keyboards: By utilizing the unique pattern of channel state information (CSI) in time series, WiFinger [19] Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 original frame current frame paper keyboard two image plane keyboard in real world (a)Experimental setup. (b)Unconscious head movements can lead to the large coordinate deviations of keys. 0 detec0n e cost -mean time cost 03 --"duration for a frame 2 0) g° 0.2 90.15 0 0 30 10 The number of detection Timestamp(min) (c)Extracting all keys from each image leads to (d)Head movements occur occasionally and last (e)A frame from camera view can hardly detect unacceptable time cost for real-time systems. for several frames instead of all frames the depth information of fingertips. Fig.2.Observations about coordinate changes of keys,captured frames for keystrokes,and time cost in image processing. is designed to recognize a set of finger-grained gestures to almost comparable to that of single-hand text input on tablet input text for off-the-shelf WiFi devices.Similarly,when a computers.K.Sun et al.[26]propose a depth aware tapping user types on a keyboard,WiKey [8]recognizes the typed scheme for VR/AR devices by combining a microphone with keys based on how the CSI value changes at the WiFi signal a COTS mono-camera.It enables tracking of user's fingers receiver.However,the WiFi-based approach can be easily based on ultrasound and image frames.Yin et al.[29]leverage affected by environments,such as changes of transceiver's a built-in camera in mobile device to recognize keystrokes orientation or location,and unexpected human motions in sur- by comparing the fingertip's location with a key's location in rounding areas.They are often expected to work in controlled image frames.However,these methods assume that the text environments,rather than real-world scenarios. input space has a fixed location in the camera view,i.e.,the Acoustic-based keyboards:By utilizing the built-in mi- coordinates of the keyboard or keys keep unchanged. crophones of mobile and wearable devices,acoustic-based Our work is motivated by the recent advance of camera- keyboards have been recently proposed.UbiTap [16]presents based text input methods.We move an important step towards an input method by turning the solid surface into a touch dynamic scenarios where the camera moves naturally with input space,based on the sound collected by the microphones. user's head.In our work,the keyboard coordinate in the To infer the keystroke's position,it requires three phones to camera's view changes dynamically,creating more challenges estimate the arrival time of acoustic signals.KeyListener [20] in achieving high accuracy in keystroke localization and low infers keystrokes on the QWERTY keyboard of touch screen latency for resource limited head-mounted devices. by leveraging the microphones of a smartphone,while it is designed for indirect eavesdropping attacks,the accuracy of III.OBSERVATIONS keystroke inference is usually not sufficient for text input. We first conduct our preliminary experiments to study UbiK [28]leverages the microphone of a mobile device to how the changes of keyboard coordinate affect key tracking locate the keystrokes,while it requires the user to click a and keystroke localization in a dynamic scenario.In our key with the fingertip and nail margin,which may be not experiments,we use a Samsung Galaxy S9 smartphone as typical.Some Auto Speech Recognition (ASR)tools [31]are a head-mounted camera device,as shown in Fig.2(a).We also designed for text input by decoding the speaker's voice, use a A4-sized paper keyboard with the Microsoft Hololens but they can be vulnerable to environmental sounds and not [13]keyboard layout and keep its location unchanged.Unless suitable to work in public space needing to keep quiet. otherwise specified,the frame rate of camera is set to 30fps. Camera-based keyboards:By using a built-in camera, The sampling rate of gyroscope is set to 200Hz. TiPoint [18]detects keystrokes for interactions with smart Observation 1.Unconscious head movements can lead to glasses,it requires a finger to move and click on the mini- large coordinate deviations of the keyboard.As shown in trackball to input a character.However,its input speed and user Fig.2(a),the head-mounted camera moves along with the experience need further improvement for real applications. head.The head movements will lead to the dynamic changes Chang et al.[11]design a text input system for HMDs of camera view.When the location of keyboard keeps un- by cutting a keyboard into two parts.Its performance is changed,the camera view changes will lead to the changes

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 3 camera Y-axis X-axis Z-axis paper keyboard gyroscope (a) Experimental setup. (Δdx, Δdy) two image plane keyboard in real world Δθ x x′ y O O′ K K1 K2 (b) The unconscious head movements can lead to the large coordinate deviations of keys. y′ 1 2 3 4 5 6 7 8 9 0 Q W E R T Y U I O P A S D F G H J K L Z X C V B N M , . current frame original frame (b) Unconscious head movements can lead to the large coordinate deviations of keys. 0 20 40 60 80 100 The number of detection 30 40 50 60 70 80 Time cost (ms) detection time cost mean time cost duration for a frame (c) Extracting all keys from each image leads to unacceptable time cost for real-time systems. current frame original frame 0 5000 10000 15000 Timestamp (ms) 0 0.05 0.1 0.15 0.2 0.25 0.3 Angular velocity (rad/s) (min) 1 2 3 (d) Head movements occur occasionally and last for several frames instead of all frames. (e) A frame from camera view can hardly detect the depth information of fingertips. Fig. 2. Observations about coordinate changes of keys, captured frames for keystrokes, and time cost in image processing. is designed to recognize a set of finger-grained gestures to input text for off-the-shelf WiFi devices. Similarly, when a user types on a keyboard, WiKey [8] recognizes the typed keys based on how the CSI value changes at the WiFi signal receiver. However, the WiFi-based approach can be easily affected by environments, such as changes of transceiver’s orientation or location, and unexpected human motions in surrounding areas. They are often expected to work in controlled environments, rather than real-world scenarios. Acoustic-based keyboards: By utilizing the built-in microphones of mobile and wearable devices, acoustic-based keyboards have been recently proposed. UbiTap [16] presents an input method by turning the solid surface into a touch input space, based on the sound collected by the microphones. To infer the keystroke’s position, it requires three phones to estimate the arrival time of acoustic signals. KeyListener [20] infers keystrokes on the QWERTY keyboard of touch screen by leveraging the microphones of a smartphone, while it is designed for indirect eavesdropping attacks, the accuracy of keystroke inference is usually not sufficient for text input. UbiK [28] leverages the microphone of a mobile device to locate the keystrokes, while it requires the user to click a key with the fingertip and nail margin, which may be not typical. Some Auto Speech Recognition (ASR) tools [31] are also designed for text input by decoding the speaker’s voice, but they can be vulnerable to environmental sounds and not suitable to work in public space needing to keep quiet. Camera-based keyboards: By using a built-in camera, TiPoint [18] detects keystrokes for interactions with smart glasses, it requires a finger to move and click on the minitrackball to input a character. However, its input speed and user experience need further improvement for real applications. Chang et al. [11] design a text input system for HMDs by cutting a keyboard into two parts. Its performance is almost comparable to that of single-hand text input on tablet computers. K. Sun et al. [26] propose a depth aware tapping scheme for VR/AR devices by combining a microphone with a COTS mono-camera. It enables tracking of user’s fingers based on ultrasound and image frames. Yin et al. [29] leverage a built-in camera in mobile device to recognize keystrokes by comparing the fingertip’s location with a key’s location in image frames. However, these methods assume that the text input space has a fixed location in the camera view, i.e., the coordinates of the keyboard or keys keep unchanged. Our work is motivated by the recent advance of camerabased text input methods. We move an important step towards dynamic scenarios where the camera moves naturally with user’s head. In our work, the keyboard coordinate in the camera’s view changes dynamically, creating more challenges in achieving high accuracy in keystroke localization and low latency for resource limited head-mounted devices. III. OBSERVATIONS We first conduct our preliminary experiments to study how the changes of keyboard coordinate affect key tracking and keystroke localization in a dynamic scenario. In our experiments, we use a Samsung Galaxy S9 smartphone as a head-mounted camera device, as shown in Fig. 2(a). We use a A4-sized paper keyboard with the Microsoft Hololens [13] keyboard layout and keep its location unchanged. Unless otherwise specified, the frame rate of camera is set to 30fps. The sampling rate of gyroscope is set to 200Hz. Observation 1. Unconscious head movements can lead to large coordinate deviations of the keyboard. As shown in Fig. 2(a), the head-mounted camera moves along with the head. The head movements will lead to the dynamic changes of camera view. When the location of keyboard keeps unchanged, the camera view changes will lead to the changes Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 of keyboard coordinate in the image frames.As shown in distance between the fingertip and the keyboard plane.As Fig.2(b),the paper keyboard is represented as K,and the shown in Fig.2(e),all fingers hover above the keyboard.Some captured keyboard from the camera view is K1.We take fingertips appear above the keys from camera view,hence it is the case of rotating around the y-axis (marked in Fig2(a)) easy to recognize non-keystrokes as keystrokes by mistake.To as an example of head (i.e.,camera)movements.When the address the confusion,we may dynamically track the moving camera slightly rotates A=5 around y-axis anticlockwise, patterns of a fingertip and detect a keystroke from several the image frame changes from x-y plane to z'-y'plane,and frames instead of a single frame.We also need an efficient way the captured keyboard in the image frame changes to K2. to distinguish the fingertip pressing a key from other fingertips. Correspondingly,the location offset of the keyboard achieves (Ad,Ad)=(78,27)pixels,which can lead to the mismatch IV.SYSTEM DESIGN between coordinates and keys.As shown in the right part of We now present the design of DynaKey,which provides Fig.2(b),due to the camera movement,the captured keyboard a text-input scheme for a head-mounted camera device in in the current image is shown in blue,while that in the original dynamic scenes,as shown in Fig.1.DynaKey works in image is shown in black.In the current frame,i.e.,blue realistic scenarios where a user types on a virtual keyboard keyboard,the user types letter 'y'.When using the coordinates with natural head movement.The keyboard layout can be of keys in the original frame,it may mismatch letter 'h'with printed on a piece of paper or drawn on a desk surface.Unless the keystroke. otherwise specified,we use an Android smartphone as the Observation 2.Extracting all keys from each image suffers head-mounted camera device,where the embedded camera is from unavoidable occlusion of hands and has an unacceptable used to capture user's typing behaviors,then track and locate cost of processing.To track the coordinate changes of keys, keystrokes.The embedded gyroscope is used to detect head an intuitive solution is to extract keys from each image movements.In regard to the keyboard layout,it is printed on frame.However,considering the hand occlusion which is a piece of paper,as shown in Fig.2(a). unavoidable,as shown in Fig.2(b),it is difficult to extract each key from the image frame accurately.Besides,considering the A.System Overview limited resources of a head-mounted device and the real-time requirement of text input,the processing cost of extracting Fig.3 shows the framework of DynaKey.The inputs are keys from each image frame is expensive.Specifically,we image frames captured by camera and the angular velocity use At to represent the processing cost of key extraction collected by gyroscope,while the output is the character of the from an image frame,i.e.,processing an input image and pressed key.Initially,the user keeps the head unchanged and extracting all keys from the image.In Fig.2(c).we show moves the hand out of the camera view for about 3 seconds, the cost of key extraction in 100 different frames.The result while using Key Tracking to detect the keyboard and extract shows that the processing cost At ranges from 40ms to 60ms, each key from the initial image frame.When the screen shows while the average cost is 49ms,which is larger than the inter- "Please TYPE",the user begins typing.During typing process, frame duration (i.e..33ms).Therefore,extracting all keys from we use Key Tracking to select keypoints of images to trans- each image frame to track the coordinates of keys may be form the coordinates of keys among different frames.At the unacceptable for real applications.More time-efficient key same time,we use Adaptive Tracking to analyze the angular tracking methods are expected. velocity of gyroscope to detect head (i.e.,camera)movements, Observation 3.Head movements occur occasionally and and then determine whether to update the coordinates of keys last for several frames instead of all frames.According to or not.In addition,DynaKey uses Fingertip Detection to seg- Observation 2,extracting all keys in each image can hardly ment the hand region from the frame and detect the fingertips. work.In fact,we find that performing key extraction in After that,we use Keystroke Detection and Localization to each frame is unnecessary.Although the user's head moves detect the keystroke occurred and locate the keystroke.To during typing,the ratio of head movement duration to the ensure Dynakey work in real time,we adopt three threads whole typing duration is small.Fig.2(d)shows that the to implement the image capturing,image processing(i.e.,key head movements cause the peaks in gyroscope data during tracking,fingertip detection,keystroke detection and localiza- a typing process (i.e.,3 minutes),the total duration of the tion),and adaptive tracking in parallel. three head movements is less than 1 minute.It implies that during the typing process,the coordinates of keys in the B.Key Tracking image frames keep unchanged for more than 67%of the time. Before typing,we first need to extract keys from the image. Consequently,we only need to re-extract the coordinates of With possible head movements.i.e..camera view changes.we keys when detecting head movements,rather than performing then need to track the coordinates of keys in the following key extraction in each frame. frames,as mentioned in Observation 1 of Section III.Key Observation 4.A frame from camera view is insufficient to tracking in DynaKey consists of key extraction and coordinate detect the depth information of fingertips.To decide whether a transformation,as described below. keystroke is occurring or not,it is critical to determine whether 1)Key Extraction:We adopt a common QWERTY key- the fingertip is pressing on a key.However,different from the board layout,which is printed in black and white on a piece of front camera view,the camera view from top and behind can paper,as shown in Fig.5(a).Given an input image in Fig.5(a). hardly detect the depth of an object,i.e.,the perpendicular we use Canny edge detection algorithm [10],[29]to obtain

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 4 of keyboard coordinate in the image frames. As shown in Fig. 2(b), the paper keyboard is represented as K, and the captured keyboard from the camera view is K1. We take the case of rotating around the y-axis (marked in Fig2(a)) as an example of head (i.e., camera) movements. When the camera slightly rotates ∆θ = 5◦ around y-axis anticlockwise, the image frame changes from x-y plane to x 0 -y 0 plane, and the captured keyboard in the image frame changes to K2. Correspondingly, the location offset of the keyboard achieves (∆dx, ∆dy) = (78, 27) pixels, which can lead to the mismatch between coordinates and keys. As shown in the right part of Fig. 2(b), due to the camera movement, the captured keyboard in the current image is shown in blue, while that in the original image is shown in black. In the current frame, i.e., blue keyboard, the user types letter ‘y’. When using the coordinates of keys in the original frame, it may mismatch letter ‘h’ with the keystroke. Observation 2. Extracting all keys from each image suffers from unavoidable occlusion of hands and has an unacceptable cost of processing. To track the coordinate changes of keys, an intuitive solution is to extract keys from each image frame. However, considering the hand occlusion which is unavoidable, as shown in Fig. 2(b), it is difficult to extract each key from the image frame accurately. Besides, considering the limited resources of a head-mounted device and the real-time requirement of text input, the processing cost of extracting keys from each image frame is expensive. Specifically, we use ∆t to represent the processing cost of key extraction from an image frame, i.e., processing an input image and extracting all keys from the image. In Fig. 2(c), we show the cost of key extraction in 100 different frames. The result shows that the processing cost ∆t ranges from 40ms to 60ms, while the average cost is 49ms, which is larger than the interframe duration (i.e., 33ms). Therefore, extracting all keys from each image frame to track the coordinates of keys may be unacceptable for real applications. More time-efficient key tracking methods are expected. Observation 3. Head movements occur occasionally and last for several frames instead of all frames. According to Observation 2, extracting all keys in each image can hardly work. In fact, we find that performing key extraction in each frame is unnecessary. Although the user’s head moves during typing, the ratio of head movement duration to the whole typing duration is small. Fig. 2(d) shows that the head movements cause the peaks in gyroscope data during a typing process (i.e., 3 minutes), the total duration of the three head movements is less than 1 minute. It implies that during the typing process, the coordinates of keys in the image frames keep unchanged for more than 67% of the time. Consequently, we only need to re-extract the coordinates of keys when detecting head movements, rather than performing key extraction in each frame. Observation 4. A frame from camera view is insufficient to detect the depth information of fingertips. To decide whether a keystroke is occurring or not, it is critical to determine whether the fingertip is pressing on a key. However, different from the front camera view, the camera view from top and behind can hardly detect the depth of an object, i.e., the perpendicular distance between the fingertip and the keyboard plane. As shown in Fig. 2(e), all fingers hover above the keyboard. Some fingertips appear above the keys from camera view, hence it is easy to recognize non-keystrokes as keystrokes by mistake. To address the confusion, we may dynamically track the moving patterns of a fingertip and detect a keystroke from several frames instead of a single frame. We also need an efficient way to distinguish the fingertip pressing a key from other fingertips. IV. SYSTEM DESIGN We now present the design of DynaKey, which provides a text-input scheme for a head-mounted camera device in dynamic scenes, as shown in Fig. 1. DynaKey works in realistic scenarios where a user types on a virtual keyboard with natural head movement. The keyboard layout can be printed on a piece of paper or drawn on a desk surface. Unless otherwise specified, we use an Android smartphone as the head-mounted camera device, where the embedded camera is used to capture user’s typing behaviors, then track and locate keystrokes. The embedded gyroscope is used to detect head movements. In regard to the keyboard layout, it is printed on a piece of paper, as shown in Fig. 2(a). A. System Overview Fig. 3 shows the framework of DynaKey. The inputs are image frames captured by camera and the angular velocity collected by gyroscope, while the output is the character of the pressed key. Initially, the user keeps the head unchanged and moves the hand out of the camera view for about 3 seconds, while using Key Tracking to detect the keyboard and extract each key from the initial image frame. When the screen shows “Please TYPE”, the user begins typing. During typing process, we use Key Tracking to select keypoints of images to transform the coordinates of keys among different frames. At the same time, we use Adaptive Tracking to analyze the angular velocity of gyroscope to detect head (i.e., camera) movements, and then determine whether to update the coordinates of keys or not. In addition, DynaKey uses Fingertip Detection to segment the hand region from the frame and detect the fingertips. After that, we use Keystroke Detection and Localization to detect the keystroke occurred and locate the keystroke. To ensure DynaKey work in real time, we adopt three threads to implement the image capturing, image processing (i.e., key tracking, fingertip detection, keystroke detection and localization), and adaptive tracking in parallel. B. Key Tracking Before typing, we first need to extract keys from the image. With possible head movements, i.e., camera view changes, we then need to track the coordinates of keys in the following frames, as mentioned in Observation 1 of Section III. Key tracking in DynaKey consists of key extraction and coordinate transformation, as described below. 1) Key Extraction: We adopt a common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 5(a). Given an input image in Fig. 5(a), we use Canny edge detection algorithm [10], [29] to obtain Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 Fingertip Detection Keystroke Detection Fingertip and Localization discovery Moving or pressing Key Tracking Kevs Perspective extraction j th Adaptive Tracking th enor dat R+T Fig.3.Architecture of DynaKey. Fig.4.Principle of perspective transformation all edges,and then find all possible contours from detected space,as described in Eq.(1).We then introduce a division edges,as shown in Fig.5(b)and Fig.5(c)respectively.The operation to obtain its corresponding projection point(U,V) largest contour(i.e.,the green contour shown in Fig.5(c))with in the kth frame,as described in Eq.(2). four corners corresponds to the keyboard,where the corners U C0o CoL Co2 are detected based on the angles formed by the consecutive C10C11 C12 (1) contour segments,as the red points shown in Fig.5(d).When w C20 C21 C22 1 the keyboard location is fixed,i.e.,four corner points are Co0:X:+Co1·Y+Co2 fixed,as shown in Fig.5(e),we can detect the keys from U:=W: C20·Xi+C21·Y+C22 the keyboard.Specifically,with small contours (i.e..the red = C10·X:+C11·Y:+C12 (2) contours shown in Fig.5(c))located in the keyboard.we utilize C20·Xi+C21·Y:+C22 the area of a key to eliminate pitfall contours and then extract Here,the projection points of the keyboard or keys in the each key from the keyboard,as shown in Fig.5(f).Finally,we map the extracted keys with characters based on the relative previous frame can be obtained through key extraction,as mentioned in Section IV-B1.Thus the main challenge lies locations among keys,i.e.,the known keyboard layout in the calculation of transformation matrix C.which will be 2)Coordinate Transformation:Due to head movements,it described below. is essential to track the coordinates of keys among different Keypoint Selection:In the transformation matrix C,C22 is frames.Besides,the camera view changes also bring in a scale factor and usually set to C22 =1,thus we only need the distortion of keyboard in images,as the two captured to calculate the other eight variables,which can be solved by quadrilaterals PoPiP3 P2 and QoQ1Q3Q2 shown in Fig.4. selecting four non-linear feature point pairs (e.g.,P(Xi,Yi) To tolerate the camera movement and image distortion,we and Q:(U V)(i [0,3])shown in Fig.4).The specific propose a Perspective Transformation-based method to track formula for calculating C with four feature point pairs is the coordinates of keys. shown in Eq.(3). Perspective Transformation:As shown in Fig.4,for Xo Y61 0 0 0 -Xo*U -Y%*U a fixed point Gi in the physical space,when we obtain Y 1 0 0 0 -X1*U -YU its projection point (Xi,Yi)in the jth frame,perspective X2 1 0 0 0 -X2*U码 -*U transformation [21]can use a transformation matrix C 1 0 0 -X3*U -Y3+U Yo =2 0 00 -Xo+V -Yo+Vo C11 (3) (Coo,Co1,Co2;C10,C11,C12;C20,C21,C22)to calculate its 0 0 0 -X1*V -y*好 projection (U,V)in the kth frame.Therefore,when the 0 00X2 -X2* -Y* paper keyboard is fixed,we can use the known keyboard/key 0 00X3 Ya -X3* -Y* C21 locations in the previous frames to infer the keyboard/key To get the feature point pairs,FLANN based matcher [22] locations in the following frames,without keyboard detection was often adopted,which finds an approximate(may be not and key extraction.Specifically,with the known projection point (Xi,Yi)in the jth frame,we first use C to calculate the 3D coordinate (Ui,Vi,Wi)related to (Xi,Yi)in the physical 甲 pth frame gth frame (a)Keypoint selection by FLANN based matcher (a)An input frame (b)Edge detection result (c)All detected contours 100 60 26 用用用墨 FLANN Our method Different keypoint selection methods (d)Comer point detecton (f)Key extraction result (b)The time cost of two keypoint selection methods Fig.5.Process of extracting keys. Fig.6.Feature points selection and time cost of FLANN based matcher

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 5 Fingertip Detection Image frames Key Tracking Perspective transformation Keypoint detection Keys extraction Fingertip discovery Hand region segmentation Adaptive Tracking Sharp increase analysis Rotation angle analysis Keystroke Detection and Localization Moving or pressing Coordinate variation of fingertips Match between keys and keystroke Camera Gyroscope 0 500 1000 1500 2000 2500 3000 3500 4000 ms -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 rad/s X Y Z Sensor data Fig. 3. Architecture of DynaKey. Gi (Xi , Yi ) (U′ i , V′ i ) j th k th R + T K1 K2 101∘ 77∘ 78∘ 104∘ 84∘ 97∘ 100∘ 79∘ R + T K1 K2 116∘ 101∘ 64∘ 116∘ 91∘ 63∘ 80∘ 89∘ P0 P1 P2 P3 Q3 Q2 Q1 Q0 Fig. 4. Principle of perspective transformation. all edges, and then find all possible contours from detected edges, as shown in Fig. 5(b) and Fig. 5(c) respectively. The largest contour (i.e., the green contour shown in Fig. 5(c)) with four corners corresponds to the keyboard, where the corners are detected based on the angles formed by the consecutive contour segments, as the red points shown in Fig. 5(d). When the keyboard location is fixed, i.e., four corner points are fixed, as shown in Fig. 5(e), we can detect the keys from the keyboard. Specifically, with small contours (i.e., the red contours shown in Fig. 5(c)) located in the keyboard, we utilize the area of a key to eliminate pitfall contours and then extract each key from the keyboard, as shown in Fig. 5(f). Finally, we map the extracted keys with characters based on the relative locations among keys, i.e., the known keyboard layout. 2) Coordinate Transformation: Due to head movements, it is essential to track the coordinates of keys among different frames. Besides, the camera view changes also bring in the distortion of keyboard in images, as the two captured quadrilaterals P0P1P3P2 and Q0Q1Q3Q2 shown in Fig. 4. To tolerate the camera movement and image distortion, we propose a Perspective Transformation-based method to track the coordinates of keys. Perspective Transformation: As shown in Fig. 4, for a fixed point Gi in the physical space, when we obtain its projection point (Xi , Yi) in the jth frame, perspective transformation [21] can use a transformation matrix C = (C00, C01, C02; C10, C11, C12; C20, C21, C22) to calculate its projection (U 0 i , V 0 i ) in the kth frame. Therefore, when the paper keyboard is fixed, we can use the known keyboard/key locations in the previous frames to infer the keyboard/key locations in the following frames, without keyboard detection and key extraction. Specifically, with the known projection point (Xi , Yi) in the jth frame, we first use C to calculate the 3D coordinate (Ui , Vi , Wi) related to (Xi , Yi) in the physical (a) An input frame (b) Edge detection result (c) All detected contours (d) Corner point detection (e) Keyboard with corner points (f) Key extraction result 0 100 200 300 400 500 600 700 800 900 1000 Point Sequence 80 90 100 110 120 130 140 150 160 170 180 Angle(°) 0 200 400 600 800 1000 80 100 120 140 160 180 Angle ( ) ∘ Point Sequence Fig. 5. Process of extracting keys. space, as described in Eq. (1). We then introduce a division operation to obtain its corresponding projection point (U 0 i , V 0 i ) in the kth frame, as described in Eq. (2).   Ui Vi Wi   =   C00 C01 C02 C10 C11 C12 C20 C21 C22   ·   Xi Yi 1   (1) U 0 i = Ui Wi = C00 · Xi + C01 · Yi + C02 C20 · Xi + C21 · Yi + C22 V 0 i = Vi Wi = C10 · Xi + C11 · Yi + C12 C20 · Xi + C21 · Yi + C22 (2) Here, the projection points of the keyboard or keys in the previous frame can be obtained through key extraction, as mentioned in Section IV-B1. Thus the main challenge lies in the calculation of transformation matrix C, which will be described below. Keypoint Selection: In the transformation matrix C, C22 is a scale factor and usually set to C22 = 1, thus we only need to calculate the other eight variables, which can be solved by selecting four non-linear feature point pairs (e.g., Pi(Xi , Yi) and Qi(U 0 i , V 0 i )(i ∈ [0, 3]) shown in Fig. 4). The specific formula for calculating C with four feature point pairs is shown in Eq. (3).             X0 Y0 1 0 0 0 −X0 ∗ U 0 0 −Y0 ∗ U 0 0 X1 Y1 1 0 0 0 −X1 ∗ U 0 1 −Y1 ∗ U 0 1 X2 Y2 1 0 0 0 −X2 ∗ U 0 2 −Y2 ∗ U 0 2 X3 Y3 1 0 0 0 −X3 ∗ U 0 3 −Y3 ∗ U 0 3 0 0 0 X0 Y0 1 −X0 ∗ V 0 0 −Y0 ∗ V 0 0 0 0 0 X1 Y1 1 −X1 ∗ V 0 1 −Y1 ∗ V 0 1 0 0 0 X2 Y2 1 −X2 ∗ V 0 2 −Y2 ∗ V 0 2 0 0 0 X3 Y3 1 −X3 ∗ V 0 3 −Y3 ∗ V 0 3             ·             C00 C01 C02 C10 C11 C12 C20 C21             = C22 ·             U 0 0 U 0 1 U 0 2 U 0 3 V 0 0 V 0 1 V 0 2 V 0 3             (3) To get the feature point pairs, FLANN based matcher [22] was often adopted, which finds an approximate (may be not (a) Keypoint selection by FLANN based matcher (b) The time cost of two keypoint selection methods Fig. 6. Feature points selection and time cost of FLANN based matcher. Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply