IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 CamK:Camera-based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin,Member,IEEE,Qun Li,Fellow,IEEE,Lei Xie,Member,IEEE,Shanhe Yi, Ed Novak,and Sanglu Lu,Member,IEEE Abstract-Because of the smaller size of mobile devices,text entry with on-screen keyboards becomes inefficient.Therefore,we present CamK,a camera-based text-entry method,which can use a panel(e.g.,a piece of paper)with a keyboard layout to input text into small devices.With the built-in camera of the mobile device,CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior,i.e..extract the keys,track the user's fingertips,detect and locate keystrokes. To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection,CamK introduces the initial training and online calibration.To reduce the time latency,CamK optimizes computation-intensive modules by changing image sizes,focusing on target areas,introducing multiple threads,removing the operations of writing or reading images.Finally,we implement CamK on mobile devices running Android.Our experimental results show that CamK can achieve above 95%accuracy in keystroke localization, with only a 4.8%false positive rate.When compared with on-screen keyboards,CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input.In addition,we introduce word prediction to further improve the input speed for regular text by 13.4% Index Terms-Mobile text-entry,camera,keystroke detection and localization,small mobile devices. 1 INTRODUCTION TN recent years,we have witnessed a rapid development era Lof electronic devices and mobile technology.Mobile de- vices(e.g.,smartphones,Apple Watch)have become smaller and smaller,in order to be carried everywhere easily,while avoiding carrying bulky laptops all the time.However,the small size of the mobile device brings many new challenges, · a typical example is inputting text into the small mobile device without a physical keyboard. In order to get rid of the constraint of bulky physical keyboards,many virtual keyboards have been proposed, Fig.1.A typical use case of CamK. e.g.,wearable keyboards,on-screen keyboards,projection keyboards,etc.However,wearable keyboards introduce ad- To provide a PC-like text-entry experience,we propose a camera-based keyboard CamK,a more natural and intuitive ditional equipments like rings [1]and gloves [2].On-screen text-entry method.As shown in Fig.1,CamK works with keyboards [31,[4]usually take up a large area on the screen the front-facing camera of the mobile device and a paper and only support single finger for text entry.Typing with a small screen becomes inefficient.Projection keyboards [5],[6] keyboard.CamK takes pictures as the user types on the often need a visible light projector or lasers to display the paper keyboard,and uses image processing techniques to detect and locate keystrokes.Then,CamK outputs the cor- keyboard.To remove the additional hardwares,audio signal [7]and camera based virtual keyboards [8],[9]are proposed. responding character of the pressed key.CamK can be used However,UbiK [7]requires the user to click keys with in a wide variety of scenarios,e.g.,the office,coffee shops, outdoors,etc.However,to make CamK work well,we need their fingertips and nails,while the existing camera based to solve the following key technical challenges. keyboards either slow the typing speed [8],or should be used in controlled environments [9].The existing schemes (1)Location deviation:On a paper keyboard,the inter- are difficult to provide a similar user experience to using key distance is only about two centimeters [7].With image physical keyboards. processing techniques,there may exist a position deviation between the real fingertip and the detected fingertip.This deviation may lead to localization errors of keystrokes.To Y.Yin,L.Xie and S.Lu are with the State Key Laboratory for Novel Softare Technology,Nanjing University,Nanjing 210023,China. address this challenge,CamK introduces the initial training E-mail:fyafeng,Ixie,sangluy@nju.edu.cn to get the optimal parameters for image processing.Then, O.Li and S.Yi are with the Department of Computer Science,College of CamK uses an extended region to represent the detected William and Mary,Williamsburg,Virginia 23187. fingertip,to tolerate the position deviation.Besides,CamK E-mail:{liqun,syi@cs.wm.edu E.Novak is with Computer Science Department,Franklin and Marshall utilizes the features of a keystroke (e.g.,the fingertip is College,Lancaster,PA 17604.E-mail:enovak@fandm.edu. located in the key for a certain duration,the pressed key Lei Xie is the corresponding author. is partially obstructed by the fingertip,etc.)to verify the Manuscript received 0.0000;revised 0.0000 validity of a keystroke
IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 1 CamK: Camera-based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin, Member, IEEE, Qun Li, Fellow, IEEE, Lei Xie, Member, IEEE, Shanhe Yi, Ed Novak, and Sanglu Lu, Member, IEEE Abstract—Because of the smaller size of mobile devices, text entry with on-screen keyboards becomes inefficient. Therefore, we present CamK, a camera-based text-entry method, which can use a panel (e.g., a piece of paper) with a keyboard layout to input text into small devices. With the built-in camera of the mobile device, CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior, i.e., extract the keys, track the user’s fingertips, detect and locate keystrokes. To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection, CamK introduces the initial training and online calibration. To reduce the time latency, CamK optimizes computation-intensive modules by changing image sizes, focusing on target areas, introducing multiple threads, removing the operations of writing or reading images. Finally, we implement CamK on mobile devices running Android. Our experimental results show that CamK can achieve above 95% accuracy in keystroke localization, with only a 4.8% false positive rate. When compared with on-screen keyboards, CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input. In addition, we introduce word prediction to further improve the input speed for regular text by 13.4%. Index Terms—Mobile text-entry, camera, keystroke detection and localization, small mobile devices. ✦ 1 INTRODUCTION I N recent years, we have witnessed a rapid development of electronic devices and mobile technology. Mobile devices (e.g., smartphones, Apple Watch) have become smaller and smaller, in order to be carried everywhere easily, while avoiding carrying bulky laptops all the time. However, the small size of the mobile device brings many new challenges, a typical example is inputting text into the small mobile device without a physical keyboard. In order to get rid of the constraint of bulky physical keyboards, many virtual keyboards have been proposed, e.g., wearable keyboards, on-screen keyboards, projection keyboards, etc. However, wearable keyboards introduce additional equipments like rings [1] and gloves [2]. On-screen keyboards [3], [4] usually take up a large area on the screen and only support single finger for text entry. Typing with a small screen becomes inefficient. Projection keyboards [5], [6] often need a visible light projector or lasers to display the keyboard. To remove the additional hardwares, audio signal [7] and camera based virtual keyboards [8], [9] are proposed. However, UbiK [7] requires the user to click keys with their fingertips and nails, while the existing camera based keyboards either slow the typing speed [8], or should be used in controlled environments [9]. The existing schemes are difficult to provide a similar user experience to using physical keyboards. • Y. Yin, L. Xie and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. E-mail: {yafeng, lxie, sanglu}@nju.edu.cn • Q. Li and S. Yi are with the Department of Computer Science, College of William and Mary, Williamsburg, Virginia 23187. E-mail: {liqun, syi}@cs.wm.edu. • E. Novak is with Computer Science Department, Franklin and Marshall College, Lancaster, PA 17604. E-mail: enovak@fandm.edu. • Lei Xie is the corresponding author. Manuscript received 0 . 0000; revised 0 . 0000. l Camera α Fig. 1. A typical use case of CamK. To provide a PC-like text-entry experience, we propose a camera-based keyboard CamK, a more natural and intuitive text-entry method. As shown in Fig. 1, CamK works with the front-facing camera of the mobile device and a paper keyboard. CamK takes pictures as the user types on the paper keyboard, and uses image processing techniques to detect and locate keystrokes. Then, CamK outputs the corresponding character of the pressed key. CamK can be used in a wide variety of scenarios, e.g., the office, coffee shops, outdoors, etc. However, to make CamK work well, we need to solve the following key technical challenges. (1) Location deviation: On a paper keyboard, the interkey distance is only about two centimeters [7]. With image processing techniques, there may exist a position deviation between the real fingertip and the detected fingertip. This deviation may lead to localization errors of keystrokes. To address this challenge, CamK introduces the initial training to get the optimal parameters for image processing. Then, CamK uses an extended region to represent the detected fingertip, to tolerate the position deviation. Besides, CamK utilizes the features of a keystroke (e.g., the fingertip is located in the key for a certain duration, the pressed key is partially obstructed by the fingertip, etc.) to verify the validity of a keystroke
2 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 (2)False positives:A false positive occurs when a non- ZoomBoard [4]adaptively change the size of keys.Context- keystroke(i.e.,a period in which no fingertip is pressing Type [16]leverages hand postures to improve mobile touch any key)is recognized as a keystroke.Without the assistance screen text entry.Kwon et al.[17]introduce the regional of other resources like audio signals,CamK should detect error correction method to reduce the number of necessary keystrokes only with images.To address this challenge, touches.ShapeWriter [18]recognizes a word based on the CamK combines keystroke detection with keystroke local- trace over successive letters in the word.Sandwich key- ization.For a potential keystroke,if there is no valid key board [19]affords ten-finger touch typing by utilizing a pressed by the fingertip,CamK will remove the keystroke touch sensor on the back side of a device.Usually,on-screen and recognize it as a non-keystroke.Additionally,CamK keyboards occupy the screen area and support only one introduces online calibration,i.e.,using the movement fea- finger for typing.Besides,it often needs to switch between tures of the fingertip after a keystroke,to further decrease different screens to type letters,digits,punctuations,etc. the false positive rate. Projection keyboards:Projection keyboards usually (3)Processing latency:To serve as a text-entry method, need a visible light projector or lasers to cast a keyboard, when the user presses a key on the paper keyboard,CamK and then utilize image processing methods [5]or infrared should output the character of the key without any no- light [6]to detect the typing events.Hu et al.use a pico- ticeable latency.However,due to the limited computing projector to project the keyboard on the table,and then resources of small mobile devices,the heavy computation detect the touch interaction by the distortion of the keyboard overhead of image processing will lead to a large latency.To projection [20].Roeber et al.utilize a pattern projector to address this challenge,CamK optimizes the computation- display the keyboard layout on the flat surface,and then intensive modules by adaptively changing image sizes, detect the keyboard events based on the intersection of focusing on the target area in the large-size image,adopt- fingers and infrared light [21].The projection keyboard often ing multiple threads and removing the operations of writ- requires the extra equipments,e.g.,a visible light projector, ing/reading images. infrared light modules,etc.The extra equipments increase We make the following contributions in this paper (a the cost and introduce the inconvenience of text entry preliminary version of this work appeared in [10]). Camera based keyboards:Camera based virtual key- We design a practical framework for CamK,which boards use the captured image [22]or video [23]to infer the operates using a smart mobile device camera and a keystroke.Gesture keyboard [22]gets the input by recogniz- portable paper keyboard.Based on image processing, ing the finger's gesture.It works without a keyboard layout, thus the user needs to remember the mapping between CamK can detect and locate the keystroke with high the keys and the finger's gestures.Visual Panel [8]works accuracy and low false positive rate. with a printed keyboard on a piece of paper.It requires the We realize real time text-entry for small mobile devices with limited resources,by optimizing the user to use only one finger and wait for one second before each keystroke.Malik et al.present the Visual Touchpad computation-intensive modules.Additionally,we in- [24]to track the 3D positions of the fingertips based on troduce word prediction to further improve the input speed and reduce the error rate. two downward-pointing cameras and a stereo.Adajania et We implement CamK on smartphones running An- al.[9]detect the keystroke based on shadow analysis with a standard web camera.Hagara et al.estimate the finger droid.We first evaluate each module in CamK.Then, positions and detect clicking events based on edge detec- we conduct extensive experiments to test the perfor- tion,fingertip localization,etc [25.In regard to the iPhone mance of CamK.After that,we compare CamK with app paper keyboard [261,which only allows the user to use other methods in input speed and error rate. one finger to input letters.The above research work usually 2 RELATED WORK focuses on detecting and tracking the fingertips,instead of locating the fingertip in a key's area of the keyboard,which Considering the small sizes of mobile devices,a lot of virtual is researched in our paper. keyboards are proposed for text entry,e.g.,wearable key- In addition to the above text-entry solutions,MacKenzie boards,on-screen keyboards,projection keyboards,camera et al.[27]describe the text entry for mobile computing. based keyboards,etc. Zhang et al.[28]propose Okuli to locate user's finger based Wearable keyboards:Wearable keyboards sense and on visible light communication modules,LED,and light recognize the typing behavior based on the sensors built into sensors.Wang et al.[7]propose UbiK to locate the keystroke rings [1][11],gloves [12],and so on.TypingRing [13]utilizes based on audio signals.The existing work usually needs the embedded sensors of the ring to input text.Finger-Joint extra equipments,or only allows one finger to type,or keypad [14]works with a glove equipped with the pressure needs to change the user's typing behavior,while difficult sensors.The Senseboard [2]consists of two rubber pads and to provide a PC-like text-entry experience.In this paper,we senses the movements in the palm to get keystrokes.Funk et propose a text-entry method based on the built-in camera al.[15]utilize a touch sensitive wristband to enter text based of the mobile device and a paper keyboard,to provide a on the location of the touch.These wearable keyboards often similar user experience to using physical keyboards. need the user to wear devices around the hands or fingers, 3 FEASIBILITY STUDY AND OVERVIEW OF CAMK thus leading to the decrease of user experience. On-screen keyboards:On-screen keyboards allow the In order to show the feasibility of locating keystrokes based user to enter characters on a touch screen.Considering the on image processing techniques,we first show the observa- limited area of the keyboard on the screen,BigKey [3]and tions of a keystroke from the camera's view.After that,we will describe the system overview of CamK
2 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 (2) False positives: A false positive occurs when a nonkeystroke (i.e., a period in which no fingertip is pressing any key) is recognized as a keystroke. Without the assistance of other resources like audio signals, CamK should detect keystrokes only with images. To address this challenge, CamK combines keystroke detection with keystroke localization. For a potential keystroke, if there is no valid key pressed by the fingertip, CamK will remove the keystroke and recognize it as a non-keystroke. Additionally, CamK introduces online calibration, i.e., using the movement features of the fingertip after a keystroke, to further decrease the false positive rate. (3) Processing latency: To serve as a text-entry method, when the user presses a key on the paper keyboard, CamK should output the character of the key without any noticeable latency. However, due to the limited computing resources of small mobile devices, the heavy computation overhead of image processing will lead to a large latency. To address this challenge, CamK optimizes the computationintensive modules by adaptively changing image sizes, focusing on the target area in the large-size image, adopting multiple threads and removing the operations of writing/reading images. We make the following contributions in this paper (a preliminary version of this work appeared in [10]). • We design a practical framework for CamK, which operates using a smart mobile device camera and a portable paper keyboard. Based on image processing, CamK can detect and locate the keystroke with high accuracy and low false positive rate. • We realize real time text-entry for small mobile devices with limited resources, by optimizing the computation-intensive modules. Additionally, we introduce word prediction to further improve the input speed and reduce the error rate. • We implement CamK on smartphones running Android. We first evaluate each module in CamK. Then, we conduct extensive experiments to test the performance of CamK. After that, we compare CamK with other methods in input speed and error rate. 2 RELATED WORK Considering the small sizes of mobile devices, a lot of virtual keyboards are proposed for text entry, e.g., wearable keyboards, on-screen keyboards, projection keyboards, camera based keyboards, etc. Wearable keyboards: Wearable keyboards sense and recognize the typing behavior based on the sensors built into rings [1] [11], gloves [12], and so on. TypingRing [13] utilizes the embedded sensors of the ring to input text. Finger-Joint keypad [14] works with a glove equipped with the pressure sensors. The Senseboard [2] consists of two rubber pads and senses the movements in the palm to get keystrokes. Funk et al. [15] utilize a touch sensitive wristband to enter text based on the location of the touch. These wearable keyboards often need the user to wear devices around the hands or fingers, thus leading to the decrease of user experience. On-screen keyboards: On-screen keyboards allow the user to enter characters on a touch screen. Considering the limited area of the keyboard on the screen, BigKey [3] and ZoomBoard [4] adaptively change the size of keys. ContextType [16] leverages hand postures to improve mobile touch screen text entry. Kwon et al. [17] introduce the regional error correction method to reduce the number of necessary touches. ShapeWriter [18] recognizes a word based on the trace over successive letters in the word. Sandwich keyboard [19] affords ten-finger touch typing by utilizing a touch sensor on the back side of a device. Usually, on-screen keyboards occupy the screen area and support only one finger for typing. Besides, it often needs to switch between different screens to type letters, digits, punctuations, etc. Projection keyboards: Projection keyboards usually need a visible light projector or lasers to cast a keyboard, and then utilize image processing methods [5] or infrared light [6] to detect the typing events. Hu et al. use a picoprojector to project the keyboard on the table, and then detect the touch interaction by the distortion of the keyboard projection [20]. Roeber et al. utilize a pattern projector to display the keyboard layout on the flat surface, and then detect the keyboard events based on the intersection of fingers and infrared light [21]. The projection keyboard often requires the extra equipments, e.g., a visible light projector, infrared light modules, etc. The extra equipments increase the cost and introduce the inconvenience of text entry. Camera based keyboards: Camera based virtual keyboards use the captured image [22] or video [23] to infer the keystroke. Gesture keyboard [22] gets the input by recognizing the finger’s gesture. It works without a keyboard layout, thus the user needs to remember the mapping between the keys and the finger’s gestures. Visual Panel [8] works with a printed keyboard on a piece of paper. It requires the user to use only one finger and wait for one second before each keystroke. Malik et al. present the Visual Touchpad [24] to track the 3D positions of the fingertips based on two downward-pointing cameras and a stereo. Adajania et al. [9] detect the keystroke based on shadow analysis with a standard web camera. Hagara et al. estimate the finger positions and detect clicking events based on edge detection, fingertip localization, etc [25]. In regard to the iPhone app paper keyboard [26], which only allows the user to use one finger to input letters. The above research work usually focuses on detecting and tracking the fingertips, instead of locating the fingertip in a key’s area of the keyboard, which is researched in our paper. In addition to the above text-entry solutions, MacKenzie et al. [27] describe the text entry for mobile computing. Zhang et al. [28] propose Okuli to locate user’s finger based on visible light communication modules, LED, and light sensors. Wang et al. [7] propose UbiK to locate the keystroke based on audio signals. The existing work usually needs extra equipments, or only allows one finger to type, or needs to change the user’s typing behavior, while difficult to provide a PC-like text-entry experience. In this paper, we propose a text-entry method based on the built-in camera of the mobile device and a paper keyboard, to provide a similar user experience to using physical keyboards. 3 FEASIBILITY STUDY AND OVERVIEW OF CAMK In order to show the feasibility of locating keystrokes based on image processing techniques, we first show the observations of a keystroke from the camera’s view. After that, we will describe the system overview of CamK
YIN et al:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES ager numbe (a)Frame 1 (b)Frame 2 (c)Frame 3 (d)Frame 4 (e)Frame 5 Fig.2.Frames during two consecutive keystrokes 3.1 Observations of A Keystroke we set I 13.5cm,a =90,to make the letter keys large In Fig.2,we show the frames/images captured by the enough in the camera's view.In fact,there is no strict re- camera during two consecutive keystrokes.The origin of quirements of the above parameters'value,especially when axes is located in the top left corner of the image,as shown the position of the camera varies in different devices.In Fig. in Fig.2(a).The hand located in the left area of the image 1,when we fix the A4 sized paper keyboard,l can range is called left hand,while the other is called the right hand, in [13.5cm,18.0cm],while a can range in [78.8,90.0].In as shown in Fig.2(b).From left to right,the fingers are CamK,even if some part of the keyboard is out of the called finger i in sequence,iE [1,10],as shown in Fig.2(c). camera's view,CamK still works. The fingertip pressing the key is called StrokeTip,while that The architecture of CamK is shown in Fig.3.The input pressed key is called StrokeKey,as shown in Fig.2(d). is the image taken by the camera and the output is the When the user presses a key,i.e.,a keystroke occurs,the character of the pressed key.Before a user begins typing, StrokeTip and StrokeKey often have the following features, CamK uses Key Extraction to detect the keyboard and extract which can be used to track,detect and locate the keystroke. each key from the image.When the user types,CamK uses (1)Coordinate position:The StrokeTip usually has the Fingertip Detection to extract the user's hands and detect largest vertical coordinate among the fingers on the same their fingertips.Based on the movements of fingertips, hand,because the user tends to stretch out one finger when CamK uses Keystroke Detection and Localization to detect a typing a key.An example is finger 9 in Fig.2(a).While possible keystroke and locate the keystroke.Finally,CamK considering the particularity of thumbs,this feature may uses Text Output to output the character of the pressed key. not be suitable for thumbs.Therefore,we separately detect Key Extraction the StrokeTip in thumbs and other fingertips. Text (2)Moving state:The StrokeTip stays on the StrokeKey for output a certain duration in a typing operation,as finger 2 shown Key aren,Key locati回 in Fig.2(c)-Fig.2(d).If the positions of the fingertip keep Keystroke Detection and Localizat unchanged,a keystroke may happen. Candidate fingertip selection (3)Correlated location:The StrokeTip is located in the Fra e设 Frame i- Largest vertical coordinate StrokeKey,in order to press that key,such as finger 9 shown Keeping unchanged in Fig.2(a)and finger 2 shown in Fig.2(d). ingertip Detection Locating in the pressed key (4)Obstructed view:The StrokeTip obstructs the StrokeKey ressed ke land segmentatio from the view of the camera,as shown in Fig.2(d).The ratio of the visually obstructed area to the whole area of the key Largest relative distanc 0n can be used to verify whether the key is really pressed. gertip discover y No frg (5)Relative distance:The StrokeTip usually achieves the largest vertical distance between the fingertip and remaining Fig.3.Architecture of CamK. fingertips of the same hand.This is because the user usually stretches out the finger to press a key.Thus the feature can SYSTEM DESIGN be used to infer which hand generates the keystroke.In 4 Fig.2(a),the vertical distance dr between the StrokeTip (i.e., According to Fig.3,CamK consists of four components: Finger 9)and remaining fingertips in right hand is larger key extraction,fingertip detection,keystroke detection and than that (di)in left hand.Thus we choose finger 9 as the localization,and text output.Obviously,text output is easy StrokeTip from two hands,instead of finger 2. to be implemented.Therefore,we mainly describe the first three components. 3.2 System Overview As shown in Fig.1,CamK works with a mobile device and 4.1 Key Extraction a paper keyboard.The device uses the front-facing camera to capture the typing process,while the paper keyboard is Without loss of generality,CamK adopts the common QW- placed on a flat surface and located in the camera's view. ERTY keyboard layout,which is printed in black and white We take Fig.1 as an example to describe the deployment.In on a piece of paper,as shown in Fig.1.In order to eliminate Fig.1,the mobile device is a Samsung N9109W smartphone, the effects of background,we first detect the boundary of the keyboard.Then,we extract each key from the keyboard. while l means the distance between the device and the printed keyboard,o means the angle between the plane Therefore,key extraction contains three parts:keyboard of the device's screen and that of the keyboard.In Fig.1, detection,key segmentation,and mapping the characters to the keys,as shown in Fig.3
YIN et al.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 3 O (0, 0) x y dl dr (a) Frame 1 Left hand Right hand (b) Frame 2 Finger number 1 2 3 4 5 6 7 8 9 10 (c) Frame 3 StrokeKey StrokeTip (d) Frame 4 (e) Frame 5 Fig. 2. Frames during two consecutive keystrokes 3.1 Observations of A Keystroke In Fig. 2, we show the frames/images captured by the camera during two consecutive keystrokes. The origin of axes is located in the top left corner of the image, as shown in Fig. 2(a). The hand located in the left area of the image is called left hand, while the other is called the right hand, as shown in Fig. 2(b). From left to right, the fingers are called finger i in sequence, i ∈ [1, 10], as shown in Fig. 2(c). The fingertip pressing the key is called StrokeTip, while that pressed key is called StrokeKey, as shown in Fig. 2(d). When the user presses a key, i.e., a keystroke occurs, the StrokeTip and StrokeKey often have the following features, which can be used to track, detect and locate the keystroke. (1) Coordinate position: The StrokeTip usually has the largest vertical coordinate among the fingers on the same hand, because the user tends to stretch out one finger when typing a key. An example is finger 9 in Fig. 2(a). While considering the particularity of thumbs, this feature may not be suitable for thumbs. Therefore, we separately detect the StrokeTip in thumbs and other fingertips. (2) Moving state: The StrokeTip stays on the StrokeKey for a certain duration in a typing operation, as finger 2 shown in Fig. 2(c) - Fig. 2(d). If the positions of the fingertip keep unchanged, a keystroke may happen. (3) Correlated location: The StrokeTip is located in the StrokeKey, in order to press that key, such as finger 9 shown in Fig. 2(a) and finger 2 shown in Fig. 2(d). (4) Obstructed view: The StrokeTip obstructs the StrokeKey from the view of the camera, as shown in Fig. 2(d). The ratio of the visually obstructed area to the whole area of the key can be used to verify whether the key is really pressed. (5) Relative distance: The StrokeTip usually achieves the largest vertical distance between the fingertip and remaining fingertips of the same hand. This is because the user usually stretches out the finger to press a key. Thus the feature can be used to infer which hand generates the keystroke. In Fig. 2(a), the vertical distance dr between the StrokeTip (i.e., Finger 9) and remaining fingertips in right hand is larger than that (dl) in left hand. Thus we choose finger 9 as the StrokeTip from two hands, instead of finger 2. 3.2 System Overview As shown in Fig. 1, CamK works with a mobile device and a paper keyboard. The device uses the front-facing camera to capture the typing process, while the paper keyboard is placed on a flat surface and located in the camera’s view. We take Fig. 1 as an example to describe the deployment. In Fig. 1, the mobile device is a Samsung N9109W smartphone, while l means the distance between the device and the printed keyboard, α means the angle between the plane of the device’s screen and that of the keyboard. In Fig. 1, we set l = 13.5cm, α = 90◦ , to make the letter keys large enough in the camera’s view. In fact, there is no strict requirements of the above parameters’ value, especially when the position of the camera varies in different devices. In Fig. 1, when we fix the A4 sized paper keyboard, l can range in [13.5cm, 18.0cm], while α can range in [78.8 ◦ , 90.0 ◦ ]. In CamK, even if some part of the keyboard is out of the camera’s view, CamK still works. The architecture of CamK is shown in Fig. 3. The input is the image taken by the camera and the output is the character of the pressed key. Before a user begins typing, CamK uses Key Extraction to detect the keyboard and extract each key from the image. When the user types, CamK uses Fingertip Detection to extract the user’s hands and detect their fingertips. Based on the movements of fingertips, CamK uses Keystroke Detection and Localization to detect a possible keystroke and locate the keystroke. Finally, CamK uses Text Output to output the character of the pressed key. Key Extraction Keyboard detection Key segmentation Character mapping Location range of keys Key area, Key location Fingertip Detection Hand segmentation Fingertip discovery Frame i Keystroke Detection and Localization Candidate fingertip selection Key area, Key location Largest vertical coordinate Covering the pressed key Largest relative distance Two fingertips No fingertips Keystroke Nonkeystroke Only one fingertip No fingertips Only one fingertip Text output Fingertips’ locations Keystroke location Locating in the pressed key Keeping unchanged Frame j Frame i-2 Frame i-1 Fig. 3. Architecture of CamK. 4 SYSTEM DESIGN According to Fig. 3, CamK consists of four components: key extraction, fingertip detection, keystroke detection and localization, and text output. Obviously, text output is easy to be implemented. Therefore, we mainly describe the first three components. 4.1 Key Extraction Without loss of generality, CamK adopts the common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 1. In order to eliminate the effects of background, we first detect the boundary of the keyboard. Then, we extract each key from the keyboard. Therefore, key extraction contains three parts: keyboard detection, key segmentation, and mapping the characters to the keys, as shown in Fig. 3.
IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 44y (a)An input image (b)Edge detection (c)Edge optimization (d)Keyboard range (e)Keyboard boundary (④Key segmentation Fig.4.Keyboard detection and key extraction 4.1.1 Keyboard detection the space key)as multiple regular keys (e.g.,A-Z,0-9).For We use the Canny edge detection algorithm [29]to obtain example,the space key is treated as five regular keys.In this the edges of the keyboard.Fig.4(b)shows the edge de- way,we will change N to Navg.Then,we can estimate the tection result of Fig.4(a).However,the interference edges average area of a regular key as S/Navg.In addition to size (e.g.,the paper's edge/longest edge in Fig.4(b))should difference between keys,the camera's view can also affect be removed.Based on Fig.4(b),the edges of the keyboard the area of a key in the image.Therefore,we introduce o,oh should be close to the edges of keys.We use this feature to describe the range of a valid area Sk of a key as SkE[o to remove pitfall edges,the result is shown in Fig.4(c). We set a-0.15,n-5 in Camk,based Additionally,we adopt the dilation operation [30]to join on extensive experiments.The key segmentation result of the dispersed edge points which are close to each other, Fig.4(e)is shown in Fig.4(f).Then,we use the location of to get better edges/boundaries of the keyboard.After that, the space key (biggest key)to locate other keys,based on we use the Hough transform [8]to detect the lines in the relative locations between keys. Fig.4(c).Then,we use the uppermost line and the bot- 4.2 Fingertip Detection tom line to describe the position range of the keyboard, After extracting the keys,we need to track the fingertips as shown in Fig.4(d).Similarly,we can use the Hough transform [8]to detect the left/right edge of the keyboard. to detect and locate the keystrokes.To achieve this goal,we If there are no suitable edges detected by the Hough trans- should first detect the fingertip with hand segmentation and form,it is usually because the keyboard is not perfect- fingertip discovery,as shown below. ly located in the camera's view.In this case,we simply 4.2.1 Hand segmentation use the left/right boundary of the image to represent the Skin segmentation [30]is often used for hand segmentation. left/right edge of the keyboard.As shown in Fig.4(e), In the YCrCb color space,a pixel (Y,Cr,Cb)is determined we extend the four edges (lines)to get four intersection- to be a skin pixel,if it satisfies Cr E [133,173]and Cb E s Bi(1,y1),B2(x2,y2),B3(x3,y3),B4(x4,y4),which are [77,127].However,the threshold values of Cr and Cb can used to describe the boundary of the keyboard. be affected by the surroundings such as lighting conditions It is difficult to choose suitable threshold values for Cr and 4.1.2 Key segmentation Cb.Therefore,we combine Otsu's method [31]and the red Considering the short interference edges generated by the channel in YCrCb color space for skin segmentation. edge detection algorithm,it is difficult to accurately segment In the YCrCb color space,the red channel Cr is es- each key from the keyboard with detected edges.Conse- sential to human skin color.Therefore,with a captured quently,we utilize the color difference between the white image,we use the grayscale image that is split based on keys and the black background and the area of a key for key the Cr channel as an input for Otsu's method [31].Otsu's segmentation,to reduce the effect of pitfall areas. method can automatically perform clustering-based image Firstly,we introduce color segmentation to distinguish thresholding,i.e.,calculate the optimal threshold to separate the white keys and black background.Considering the the foreground and background.The hand segmentation convenience of image processing,we represent the color in result of Fig.5(a)is shown in Fig.5(b),where the white YCrCb space.In YCrCb space,the color coordinate (Y,Cr, regions represent the hand regions with high value in Cr Cb)of a white pixel is(255,128,128),while that of a black channel,while the black regions represent the background. pixel is(0,128,128).Thus,we only compute the difference However,around the hands,there exist some interference in the y value between the pixels to distinguish the white regions,which may change the contours of fingers,resulting keys from the black background.If a pixel is located in the in detecting wrong fingertips.Thus,CamK introduces the keyboard,,while satisfying255-ey≤Y≤255,the pixel following erosion and dilation operations [32].We first use belongs to a key.The offsets eyE Nof y is mainly caused by the erosion operation to isolate the hands from keys and light conditions.ey can be estimated in the initial training separate each finger.Then,we use the dilation operation to (see section 5.1).The initial/default value of y is 50. smooth the edge of the fingers.Fig.5(c)shows the optimized When we obtain the white pixels,we need to get the result of hand segmentation.After that,we select the top contours of keys and separate the keys from one another. two segmented areas as hand regions,i.e.,left hand and To avoid pitfall areas such as small white areas which do right hand,to further reduce the effect of inference regions, not belong to any key,we introduce the area of a key.Based such as the red areas in Fig.5(c). on Fig.4(e),we first use B1,B2,B3,Ba to calculate the area 4.2.2 Fingertip discovery Sp of the keyboard as S=(|B1B2 x BiBl+B3B4 x After we extract the fingers,we will detect the fingertips. B3 B2).Then,we calculate the area of each key.We use N to We can differentiate between the thumbs (i.e.,finger 5-6 in represent the number of keys in the keyboard.Considering Fig.2(c))and non-thumbs(i.e.,finger 1-4,7-10 in Fig the size difference between keys,we treat larger keys(e.g., 2(c))in shape and typing movement,as shown in Fig.6
4 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 (a) An input image (b) Edge detection (c) Edge optimization (d) Keyboard range B1 (x1 , y1 ) B4 (x4 , y4 ) B2 (x2 , y2 ) B3 (x3 , y3 ) (e) Keyboard boundary (f) Key segmentation Fig. 4. Keyboard detection and key extraction 4.1.1 Keyboard detection We use the Canny edge detection algorithm [29] to obtain the edges of the keyboard. Fig. 4(b) shows the edge detection result of Fig. 4(a). However, the interference edges (e.g., the paper’s edge/longest edge in Fig. 4(b)) should be removed. Based on Fig. 4(b), the edges of the keyboard should be close to the edges of keys. We use this feature to remove pitfall edges, the result is shown in Fig. 4(c). Additionally, we adopt the dilation operation [30] to join the dispersed edge points which are close to each other, to get better edges/boundaries of the keyboard. After that, we use the Hough transform [8] to detect the lines in Fig. 4(c). Then, we use the uppermost line and the bottom line to describe the position range of the keyboard, as shown in Fig. 4(d). Similarly, we can use the Hough transform [8] to detect the left/right edge of the keyboard. If there are no suitable edges detected by the Hough transform, it is usually because the keyboard is not perfectly located in the camera’s view. In this case, we simply use the left/right boundary of the image to represent the left/right edge of the keyboard. As shown in Fig. 4(e), we extend the four edges (lines) to get four intersections B1(x1, y1), B2(x2, y2), B3(x3, y3), B4(x4, y4), which are used to describe the boundary of the keyboard. 4.1.2 Key segmentation Considering the short interference edges generated by the edge detection algorithm, it is difficult to accurately segment each key from the keyboard with detected edges. Consequently, we utilize the color difference between the white keys and the black background and the area of a key for key segmentation, to reduce the effect of pitfall areas. Firstly, we introduce color segmentation to distinguish the white keys and black background. Considering the convenience of image processing, we represent the color in YCrCb space. In YCrCb space, the color coordinate (Y, Cr, Cb) of a white pixel is (255, 128, 128), while that of a black pixel is (0, 128, 128). Thus, we only compute the difference in the Y value between the pixels to distinguish the white keys from the black background. If a pixel is located in the keyboard, while satisfying 255 − εy ≤ Y ≤ 255, the pixel belongs to a key. The offsets εy ∈ N of Y is mainly caused by light conditions. εy can be estimated in the initial training (see section 5.1). The initial/default value of εy is 50. When we obtain the white pixels, we need to get the contours of keys and separate the keys from one another. To avoid pitfall areas such as small white areas which do not belong to any key, we introduce the area of a key. Based on Fig. 4(e), we first use B1, B2, B3, B4 to calculate the area Sb of the keyboard as Sb = 1 2 · (| −−−→ B1B2 × −−−→ B1B4| + | −−−→ B3B4 × −−−→ B3B2|). Then, we calculate the area of each key. We use N to represent the number of keys in the keyboard. Considering the size difference between keys, we treat larger keys (e.g., the space key) as multiple regular keys (e.g., A-Z, 0-9). For example, the space key is treated as five regular keys. In this way, we will change N to Navg. Then, we can estimate the average area of a regular key as Sb/Navg. In addition to size difference between keys, the camera’s view can also affect the area of a key in the image. Therefore, we introduce αl , αh to describe the range of a valid area Sk of a key as Sk ∈ [αl · Sb Navg , αh · Sb Navg ]. We set αl = 0.15, αh = 5 in CamK, based on extensive experiments. The key segmentation result of Fig. 4(e) is shown in Fig. 4(f). Then, we use the location of the space key (biggest key) to locate other keys, based on the relative locations between keys. 4.2 Fingertip Detection After extracting the keys, we need to track the fingertips to detect and locate the keystrokes. To achieve this goal, we should first detect the fingertip with hand segmentation and fingertip discovery, as shown below. 4.2.1 Hand segmentation Skin segmentation [30] is often used for hand segmentation. In the YCrCb color space, a pixel (Y, Cr, Cb) is determined to be a skin pixel, if it satisfies Cr ∈ [133, 173] and Cb ∈ [77, 127]. However, the threshold values of Cr and Cb can be affected by the surroundings such as lighting conditions. It is difficult to choose suitable threshold values for Cr and Cb. Therefore, we combine Otsu’s method [31] and the red channel in YCrCb color space for skin segmentation. In the YCrCb color space, the red channel Cr is essential to human skin color. Therefore, with a captured image, we use the grayscale image that is split based on the Cr channel as an input for Otsu’s method [31]. Otsu’s method can automatically perform clustering-based image thresholding, i.e., calculate the optimal threshold to separate the foreground and background. The hand segmentation result of Fig. 5(a) is shown in Fig. 5(b), where the white regions represent the hand regions with high value in Cr channel, while the black regions represent the background. However, around the hands, there exist some interference regions, which may change the contours of fingers, resulting in detecting wrong fingertips. Thus, CamK introduces the following erosion and dilation operations [32]. We first use the erosion operation to isolate the hands from keys and separate each finger. Then, we use the dilation operation to smooth the edge of the fingers. Fig. 5(c) shows the optimized result of hand segmentation. After that, we select the top two segmented areas as hand regions, i.e., left hand and right hand, to further reduce the effect of inference regions, such as the red areas in Fig. 5(c). 4.2.2 Fingertip discovery After we extract the fingers, we will detect the fingertips. We can differentiate between the thumbs (i.e., finger 5-6 in Fig. 2(c)) and non-thumbs (i.e., finger 1 − 4, 7 − 10 in Fig. 2(c)) in shape and typing movement, as shown in Fig. 6
YIN et al:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES (a)An input image (b)Hand segmentation (c)Optimization (d)Fingers'contour (e)Fingertip discovery (f)Fingertips Fig.5.Fingertip detection In a non-thumb,the fingertip is usually a convex vertex, thumb in the right most area of left hand or left most area of as shown in Fig.6(a).For a point Pi(zi,i)located in right hand according to 0i and ii,+The detected the contour of a hand,by tracing the contour,we can fingertips of Fig.5(a)are marked in Fig.5(f). select the point Pg(zi-,-)before Pi and the point P()after Pi.Here,i,q E N.We calculate the 4.3 Keystroke Detection and Localization angle 0;between the two vectors PP-,PP+,according After detecting the fingertip,we will track the fingertip to detect a possible keystroke and locate it for text entry.The to Eq.(1).In order to simplify the calculation for 0i,we map keystroke is usually correlated with one or two fingertips, 6A,in the range9,∈[0°,180].Ifa:∈[,fnl,a<0h,we call therefore we first select the candidate fingertip having a high Pi a candidate vertex.Considering the relative locations of the points,Pi should also satisfy yi>yi-and yi>y+ probability of pressing a key,instead of detecting all finger- tips,to reduce the computation overhead.Then,we track Otherwise,Pi will not be a candidate vertex.If there are multiple candidate vertexes,such as P in Fig.6(a),we the candidate fingertip to detect the possible keystroke.Finally, we correlate the candidate fingertip with the pressed key to will choose the vertex having the largest vertical coordinate, locate the keystroke. because it has higher probability of being a fingertip,as Pi shown in Fig.6(a).Here,the largest vertical coordinate 4.3.1 Candidate fingertip selection in each hand means the local maximum in a finger's contour,such as the CamK allows the user to use all of their fingers for text-entry, red circle shown in Fig.5(e).The range of a finger's contour thus the keystroke may come from the left or right hand can be limited by Eq.(1),i.e.,the angle feature of a finger. Based on the observations(see section 3.1),the fingertip(i.e., Based on extensive experiments,we set 0,=60°,fh=l50°, StrokeTip)pressing the key usually has the largest vertical q=20 in this paper. coordinate in that hand,such as finger 9 shown in Fig.2(a) Therefore,we first select the candidate fingertip with the 0i=arccos PE-g·pP+9 (1) PP-PP largest vertical coordinate in each hand.We respectively use C and Cr to represent the points located in the contour of left hand and right hand.For a point P(,y)E CL,if 00,0) 00,0) P satisfies y≥(P(xj,)∈Ci,j≠),then乃will be selected as the candidate fingertip in the left hand.Similarly, P(x we can get the candidate fingertip P(r,yr)in the right P+g(4g4g】 hand.In this step,we only need to get P and P,instead of P-(x- (x,y) detecting all fingertips. (a)Fingertips (non-thumbs) (b)A thumb 4.3.2 Keystroke detection based on fingertip tracking Fig.6.Features of a fingertip As described in the observations,when the user presses a In a thumb,the "fingertip"also means a convex vertex key,the fingertip will stay at that key for a certain duration. of the finger.Thus we still use Eq.(1)to represent the shape Therefore,we can use the location variation of the candidate of the fingertip in a thumb.However,the position of the fingertip to detect a possible keystroke.In Frame i,we convex vertex can be different from that of a non-thumb. use P(L,y)and Pr (ryr)to represent the candidate As shown in Fig.6(b),the relative positions of Pi-q Pi, fingertips in the left hand and right hand,respectively.If the P+are different from that in Fig.6(a).In Fig.6(b),we candidate fingertips in frame i-1,i]satisfy Eg.(2)in left show the thumb of the left hand.Obviously,P-Pi,P+ hand or Eq.(3)in right hand,the corresponding fingertip do not satisfy yi yi-a and yi >yi+Therefore,we use will be treated as static,i.e.,a keystroke probably happens. (i-i-).(ii+)>0 to describe the relative locations Based on extensive experiments,we set Ar=5 empirically. of Pi-a,Pi,Pi+g in thumbs.Then,we choose the vertex with largest vertical coordinate in a finger's contour as the V(:--)2+(:-h-)2≤△r (2) fingertip,as mentioned in the last paragraph. In fingertip detection,we only need to detect the points V(xr-x4-2+(gr--1)2≤△r (3) located on the bottom edge(from the left most point to the right most point)of the hand,such as the blue contour 4.3.3 Keystroke localization by correlating the fingertip with of right hand in Fig.5(d).The shape feature 0;and the the pressed key positions in vertical coordinates yi along the bottom edge After detecting a possible keystroke,we correlate the candi- are shown Fig.5(e).If we can detect five fingertips in a date fingertip and the pressed key to locate the keystroke, hand with i and y,we assume that we have based on the observations of Section 3.1.In regard to also found the thumb.At this time,the thumb presses a key the candidate fingertips,we treat the thumb as a spe- like a non-thumb.Otherwise,we detect the fingertip of the cial case,and also select it as a candidate fingertip at
YIN et al.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 5 (a) An input image (b) Hand segmentation (c) Optimization (d) Fingers’contour 0 100 200 300 400 500 0 20 40 60 80 100 120 140 160 180 200 Point sequence Angle (q) 0 100 200 300 400 500 0 40 80 120 160 200 240 280 320 360 400 Vertical coordinate Angle Vertical coordinate (e) Fingertip discovery (f) Fingertips Fig. 5. Fingertip detection In a non-thumb, the fingertip is usually a convex vertex, as shown in Fig. 6(a). For a point Pi(xi , yi) located in the contour of a hand, by tracing the contour, we can select the point Pi−q(xi−q, yi−q) before Pi and the point Pi+q(xi+q, yi+q) after Pi . Here, i, q ∈ N. We calculate the angle θi between the two vectors −−−−→ PiPi−q, −−−−→ PiPi+q, according to Eq. (1). In order to simplify the calculation for θi , we map θi in the range θi ∈ [0◦ , 180◦ ]. If θi ∈ [θl , θh], θl < θh, we call Pi a candidate vertex. Considering the relative locations of the points, Pi should also satisfy yi > yi−q and yi > yi+q. Otherwise, Pi will not be a candidate vertex. If there are multiple candidate vertexes, such as P ′ i in Fig. 6(a), we will choose the vertex having the largest vertical coordinate, because it has higher probability of being a fingertip, as Pi shown in Fig. 6(a). Here, the largest vertical coordinate means the local maximum in a finger’s contour, such as the red circle shown in Fig. 5(e). The range of a finger’s contour can be limited by Eq. (1), i.e., the angle feature of a finger. Based on extensive experiments, we set θl = 60◦ , θh = 150◦ , q = 20 in this paper. θi = arccos −−−−→ PiPi−q · −−−−→ PiPi+q | −−−−→ PiPi−q| · |−−−−→ PiPi+q| (1) qi ( , ) P x y i i i ( , ) P x y i q i q i q + + + ( , ) P x y i q i q i q - - - ' Pi O (0,0) x y (a) Fingertips (non-thumbs) qi ( , ) P x y i i i ( , ) P x y i q i q i q + + + ( , ) P x y i q i q i q - - - ' Pi O (0,0) x y (b) A thumb Fig. 6. Features of a fingertip In a thumb, the “fingertip” also means a convex vertex of the finger. Thus we still use Eq. (1) to represent the shape of the fingertip in a thumb. However, the position of the convex vertex can be different from that of a non-thumb. As shown in Fig. 6(b), the relative positions of Pi−q, Pi , Pi+q are different from that in Fig. 6(a). In Fig. 6(b), we show the thumb of the left hand. Obviously, Pi−q, Pi , Pi+q do not satisfy yi > yi−q and yi > yi+q. Therefore, we use (xi −xi−q)·(xi −xi+q) > 0 to describe the relative locations of Pi−q, Pi , Pi+q in thumbs. Then, we choose the vertex with largest vertical coordinate in a finger’s contour as the fingertip, as mentioned in the last paragraph. In fingertip detection, we only need to detect the points located on the bottom edge (from the left most point to the right most point) of the hand, such as the blue contour of right hand in Fig. 5(d). The shape feature θi and the positions in vertical coordinates yi along the bottom edge are shown Fig. 5(e). If we can detect five fingertips in a hand with θi and yi−q, yi , yi+q, we assume that we have also found the thumb. At this time, the thumb presses a key like a non-thumb. Otherwise, we detect the fingertip of the thumb in the right most area of left hand or left most area of right hand according to θi and xi−q, xi , xi+q. The detected fingertips of Fig. 5(a) are marked in Fig. 5(f). 4.3 Keystroke Detection and Localization After detecting the fingertip, we will track the fingertip to detect a possible keystroke and locate it for text entry. The keystroke is usually correlated with one or two fingertips, therefore we first select the candidate fingertip having a high probability of pressing a key, instead of detecting all fingertips, to reduce the computation overhead. Then, we track the candidate fingertip to detect the possible keystroke. Finally, we correlate the candidate fingertip with the pressed key to locate the keystroke. 4.3.1 Candidate fingertip selection in each hand CamK allows the user to use all of their fingers for text-entry, thus the keystroke may come from the left or right hand. Based on the observations (see section 3.1), the fingertip (i.e., StrokeTip) pressing the key usually has the largest vertical coordinate in that hand, such as finger 9 shown in Fig. 2(a). Therefore, we first select the candidate fingertip with the largest vertical coordinate in each hand. We respectively use Cl and Cr to represent the points located in the contour of left hand and right hand. For a point Pl(xl , yl) ∈ Cl , if Pl satisfies yl ≥ yj (∀Pj (xj , yj ) ∈ Cl , j ̸= l), then Pl will be selected as the candidate fingertip in the left hand. Similarly, we can get the candidate fingertip Pr(xr, yr) in the right hand. In this step, we only need to get Pl and Pr, instead of detecting all fingertips. 4.3.2 Keystroke detection based on fingertip tracking As described in the observations, when the user presses a key, the fingertip will stay at that key for a certain duration. Therefore, we can use the location variation of the candidate fingertip to detect a possible keystroke. In Frame i, we use Pli (xli , yli ) and Pri (xri , yri ) to represent the candidate fingertips in the left hand and right hand, respectively. If the candidate fingertips in frame [i − 1, i] satisfy Eq. (2) in left hand or Eq. (3) in right hand, the corresponding fingertip will be treated as static, i.e., a keystroke probably happens. Based on extensive experiments, we set ∆r = 5 empirically. √ (xli − xli−1 ) 2 + (yli − yli−1 ) 2 ≤ ∆r, (2) √ (xri − xri−1 ) 2 + (yri − yri−1 ) 2 ≤ ∆r. (3) 4.3.3 Keystroke localization by correlating the fingertip with the pressed key After detecting a possible keystroke, we correlate the candidate fingertip and the pressed key to locate the keystroke, based on the observations of Section 3.1. In regard to the candidate fingertips, we treat the thumb as a special case, and also select it as a candidate fingertip at