2236 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.17,NO.10.OCTOBER 2018 CamK:Camera-Based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin,Member,IEEE,Qun Li,Fellow,IEEE,Lei Xie,Member,IEEE, Shanhe Yi,Ed Novak,and Sanglu Lu,Member,IEEE Abstract-Because of the smaller size of mobile devices,text entry with on-screen keyboards becomes inefficient.Therefore,we present CamK,a camera-based text-entry method,which can use a panel(e.g.,a piece of paper)with a keyboard layout to input text into small devices.With the built-in camera of the mobile device,CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior,i.e.,extract the keys,track the user's fingertips,detect,and locate keystrokes To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection,CamK introduces the initial training and online calibration.To reduce the time latency,CamK optimizes computation-intensive modules by changing image sizes,focusing on target areas,introducing multiple threads,removing the operations of writing or reading images.Finally,we implement CamK on mobile devices running Android.Our experimental results show that CamK can achieve above 95 percent accuracy in keystroke localization,with only a 4.8 percent false positive rate.When compared with on-screen keyboards,CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input.In addition,we introduce word prediction to further improve the input speed for regular text by 13.4 percent. Index Terms-Mobile text-entry,camera,keystroke detection and localization,small mobile devices 1 INTRODUCTION TN recent years,we have witnessed a rapid development with their fingertips and nails,while the existing camera Lof electronic devices and mobile technology.Mobile devi- based keyboards either slow the typing speed [81,or should ces (e.g.,smartphones,Apple Watch)have become smaller be used in controlled environments [9].The existing schemes and smaller,in order to be carried everywhere easily,while are difficult to provide a similar user experience to using avoiding carrying bulky laptops all the time.However,the physical keyboards small size of the mobile device brings many new challenges, To provide a PC-like text-entry experience,we propose a a typical example is inputting text into the small mobile camera-based keyboard CamK,a more natural and intuitive device without a physical keyboard. text-entry method.As shown in Fig.1,CamK works with In order to get rid of the constraint of bulky physical key- the front-facing camera of the mobile device and a paper boards,many virtual keyboards have been proposed,e.g., keyboard.CamK takes pictures as the user types on the wearable keyboards,on-screen keyboards,projection key-paper keyboard,and uses image processing techniques to boards,etc.However,wearable keyboards introduce addi-detect and locate keystrokes.Then,CamK outputs the corre- tional equipments like rings [1]and gloves [2].On-screen sponding character of the pressed key.CamK can be used in keyboards [3],[4]usually take up a large area on the screen a wide variety of scenarios,e.g.,the office,coffee shops,out- and only support single finger for text entry.Typing with a doors,etc.However,to make CamK work well,we need to small screen becomes inefficient.Projection keyboards 5],[6] solve the following key technical challenges. often need a visible light projector or lasers to display the (1)Location Deviation:On a paper keyboard,the inter-key keyboard.To remove the additional hardwares,audio sig- distance is only about two centimeters [7].With image nal [7]and camera based virtual keyboards [81,[9]are pro- processing techniques,there may exist a position deviation posed.However,UbiK [7]requires the user to click keys between the real fingertip and the detected fingertip.This deviation may lead to localization errors of keystrokes.To address this challenge,CamK introduces the initial train- .Y.Yin,L.Xie,and S.Lu are with the State Key Laboratory for Novel Softiare Technology,Nanjing UIniversity,Nanjing 210023,China. ing to get the optimal parameters for image processing. E-mail:(yafeng,Ixie,sanglul@nju.edu.cn. Then,CamK uses an extended region to represent the Q.Li and S.Yi are with the Department of Computer Science,College of detected fingertip,to tolerate the position deviation. William and Mary,Williamsburg,VA23187.E-mail:(liqun,syij@cs.wm.edu. E.Nooak is with the Computer Science Department,Franklin and Mar- Besides,CamK utilizes the features of a keystroke(e.g.,the shall College,Lancaster,PA 17604.E-mail:enovak@fandm.edu. fingertip is located in the key for a certain duration,the Manuscript received 3 Feb.2017;revised 24 Dec.2017;accepted 15 Jan.2018. pressed key is partially obstructed by the fingertip,etc.)to Date of publication 25 Jan.2018;date of current version 29 Aug.2018. verify the validity of a keystroke. (Corresponding author:Lei Xie.) (2)False Positives:A false positive occurs when a non- For information on obtaining reprints of this article,please send e-mail to: keystroke(i.e.,a period in which no fingertip is pressing any reprints@ieee.org,and reference the Digital Object Identifier below. Digital Object Identifier no.10.1109/TMC.2018.2798635 key)is recognized as a keystroke.Without the assistance of
CamK: Camera-Based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin , Member, IEEE, Qun Li, Fellow, IEEE, Lei Xie , Member, IEEE, Shanhe Yi , Ed Novak , and Sanglu Lu, Member, IEEE Abstract—Because of the smaller size of mobile devices, text entry with on-screen keyboards becomes inefficient. Therefore, we present CamK, a camera-based text-entry method, which can use a panel (e.g., a piece of paper) with a keyboard layout to input text into small devices. With the built-in camera of the mobile device, CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior, i.e., extract the keys, track the user’s fingertips, detect, and locate keystrokes. To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection, CamK introduces the initial training and online calibration. To reduce the time latency, CamK optimizes computation-intensive modules by changing image sizes, focusing on target areas, introducing multiple threads, removing the operations of writing or reading images. Finally, we implement CamK on mobile devices running Android. Our experimental results show that CamK can achieve above 95 percent accuracy in keystroke localization, with only a 4.8 percent false positive rate. When compared with on-screen keyboards, CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input. In addition, we introduce word prediction to further improve the input speed for regular text by 13.4 percent. Index Terms—Mobile text-entry, camera, keystroke detection and localization, small mobile devices Ç 1 INTRODUCTION I N recent years, we have witnessed a rapid development of electronic devices and mobile technology. Mobile devices (e.g., smartphones, Apple Watch) have become smaller and smaller, in order to be carried everywhere easily, while avoiding carrying bulky laptops all the time. However, the small size of the mobile device brings many new challenges, a typical example is inputting text into the small mobile device without a physical keyboard. In order to get rid of the constraint of bulky physical keyboards, many virtual keyboards have been proposed, e.g., wearable keyboards, on-screen keyboards, projection keyboards, etc. However, wearable keyboards introduce additional equipments like rings [1] and gloves [2]. On-screen keyboards [3], [4] usually take up a large area on the screen and only support single finger for text entry. Typing with a small screen becomes inefficient. Projection keyboards [5], [6] often need a visible light projector or lasers to display the keyboard. To remove the additional hardwares, audio signal [7] and camera based virtual keyboards [8], [9] are proposed. However, UbiK [7] requires the user to click keys with their fingertips and nails, while the existing camera based keyboards either slow the typing speed [8], or should be used in controlled environments [9]. The existing schemes are difficult to provide a similar user experience to using physical keyboards. To provide a PC-like text-entry experience, we propose a camera-based keyboard CamK, a more natural and intuitive text-entry method. As shown in Fig. 1, CamK works with the front-facing camera of the mobile device and a paper keyboard. CamK takes pictures as the user types on the paper keyboard, and uses image processing techniques to detect and locate keystrokes. Then, CamK outputs the corresponding character of the pressed key. CamK can be used in a wide variety of scenarios, e.g., the office, coffee shops, outdoors, etc. However, to make CamK work well, we need to solve the following key technical challenges. (1) Location Deviation: On a paper keyboard, the inter-key distance is only about two centimeters [7]. With image processing techniques, there may exist a position deviation between the real fingertip and the detected fingertip. This deviation may lead to localization errors of keystrokes. To address this challenge, CamK introduces the initial training to get the optimal parameters for image processing. Then, CamK uses an extended region to represent the detected fingertip, to tolerate the position deviation. Besides, CamK utilizes the features of a keystroke (e.g., the fingertip is located in the key for a certain duration, the pressed key is partially obstructed by the fingertip, etc.) to verify the validity of a keystroke. (2) False Positives: A false positive occurs when a nonkeystroke (i.e., a period in which no fingertip is pressing any key) is recognized as a keystroke. Without the assistance of Y. Yin, L. Xie, and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. E-mail: {yafeng, lxie, sanglu}@nju.edu.cn. Q. Li and S. Yi are with the Department of Computer Science, College of William and Mary, Williamsburg, VA 23187. E-mail: {liqun, syi}@cs.wm.edu. E. Novak is with the Computer Science Department, Franklin and Marshall College, Lancaster, PA 17604. E-mail: enovak@fandm.edu. Manuscript received 3 Feb. 2017; revised 24 Dec. 2017; accepted 15 Jan. 2018. Date of publication 25 Jan. 2018; date of current version 29 Aug. 2018. (Corresponding author: Lei Xie.) For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TMC.2018.2798635 2236 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018 1536-1233 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2237 Camera based on the location of the touch.These wearable keyboards often need the user to wear devices around the hands or fin- gers,thus leading to the decrease of user experience. On-Screen Keyboards.On-screen keyboards allow the user to enter characters on a touch screen.Considering the lim- ited area of the keyboard on the screen,BigKey [3]and ZoomBoard [4]adaptively change the size of keys.Context- Type [16]leverages hand postures to improve mobile touch Fig.1.A typical use case of CamK screen text entry.Kwon et al.[17]introduce the regional error correction method to reduce the number of necessary other resources like audio signals,CamK should detect touches.ShapeWriter [18]recognizes a word based on the keystrokes only with images.To address this challenge, trace over successive letters in the word.Sandwich key- CamK combines keystroke detection with keystroke locali- board [19]affords ten-finger touch typing by utilizing a zation.For a potential keystroke,if there is no valid key touch sensor on the back side of a device.Usually,on-screen pressed by the fingertip,CamK will remove the keystroke keyboards occupy the screen area and support only one fin- and recognize it as a non-keystroke.Additionally,CamK ger for typing.Besides,it often needs to switch between dif- introduces online calibration,i.e.,using the movement fea- ferent screens to type letters,digits,punctuations,etc. tures of the fingertip after a keystroke,to further decrease Projection Keyboards.Projection keyboards usually need a the false positive rate. visible light projector or lasers to cast a keyboard,and then (3)Processing Latency:To serve as a text-entry method, utilize image processing methods [5]or infrared light [6]to when the user presses a key on the paper keyboard,CamK detect the typing events.Hu et al.use a pico-projector to should output the character of the key without any noticeable project the keyboard on the table,and then detect the touch latency.However,due to the limited computing resources of interaction by the distortion of the keyboard projection [201. small mobile devices,the heavy computation overhead of Roeber et al.utilize a pattern projector to display the key- image processing will lead to a large latency.To address this board layout on the flat surface,and then detect the key- challenge,CamK optimizes the computation-intensive mod- board events based on the intersection of fingers and ules by adaptively changing image sizes,focusing on the tar- infrared light [21].The projection keyboard often requires get area in the large-size image,adopting multiple threads the extra equipments,e.g.,a visible light projector,infrared and removing the operations of writing/reading images. light modules,etc.The extra equipments increase the cost We make the following contributions in this paper(a pre- and introduce the inconvenience of text entry. liminary version of this work appeared in [101) Camera Based Keyboards.Camera based virtual keyboards use the captured image [22]or video [23]to infer the key- We design a practical framework for CamK,which stroke.Gesture keyboard [22]gets the input by recognizing operates using a smart mobile device camera and a the finger's gesture.It works without a keyboard layout,thus portable paper keyboard.Based on image process- the user needs to remember the mapping between the keys ing,CamK can detect and locate the keystroke with and the finger's gestures.Visual Panel [8]works with a high accuracy and low false positive rate. printed keyboard on a piece of paper.It requires the user to We realize real time text-entry for small mobile use only one finger and wait for one second before each key- devices with limited resources,by optimizing the stroke.Malik et al.present the Visual Touchpad [24]to track computation-intensive modules.Additionally,we the 3D positions of the fingertips based on two downward- introduce word prediction to further improve the pointing cameras and a stereo.Adajania et al.[9]detect the input speed and reduce the error rate. keystroke based on shadow analysis with a standard web We implement CamK on smartphones running camera.Hagara et al.estimate the finger positions and detect Android.We first evaluate each module in CamK. clicking events based on edge detection,fingertip localization Then,we conduct extensive experiments to test the performance of CamK.After that,we compare CamK etc [251.In regard to the iPhone app paper keyboard [261, which only allows the user to use one finger to input letters. with other methods in input speed and error rate. The above research work usually focuses on detecting and 2 RELATED WORK tracking the fingertips,instead of locating the fingertip in a key's area of the keyboard,which is researched in our paper. Considering the small sizes of mobile devices,a lot of virtual In addition to the above text-entry solutions,MacKenzie keyboards are proposed for text entry,e.g.,wearable key- et al.[27]describe the text entry for mobile computing. boards,on-screen keyboards,projection keyboards,camera Zhang et al.[28]propose Okuli to locate user's finger based based keyboards,etc. on visible light communication modules,LED,and light Wearable Keyboards.Wearable keyboards sense and recog- sensors.Wang et al.[7]propose UbiK to locate the keystroke nize the typing behavior based on the sensors built into rings based on audio signals.The existing work usually needs [1],[11],gloves [121,and so on.TypingRing [13]utilizes the extra equipments,or only allows one finger to type,or embedded sensors of the ring to input text.Finger-Joint key- needs to change the user's typing behavior,while difficult pad [14]works with a glove equipped with the pressure sen- to provide a PC-like text-entry experience.In this paper,we sors.The Senseboard [2]consists of two rubber pads and propose a text-entry method based on the built-in camera of senses the movements in the palm to get keystrokes.Funk the mobile device and a paper keyboard,to provide a simi- et al.[15]utilize a touch sensitive wristband to enter text lar user experience to using physical keyboards
other resources like audio signals, CamK should detect keystrokes only with images. To address this challenge, CamK combines keystroke detection with keystroke localization. For a potential keystroke, if there is no valid key pressed by the fingertip, CamK will remove the keystroke and recognize it as a non-keystroke. Additionally, CamK introduces online calibration, i.e., using the movement features of the fingertip after a keystroke, to further decrease the false positive rate. (3) Processing Latency: To serve as a text-entry method, when the user presses a key on the paper keyboard, CamK should output the character of the key without any noticeable latency. However, due to the limited computing resources of small mobile devices, the heavy computation overhead of image processing will lead to a large latency. To address this challenge, CamK optimizes the computation-intensive modules by adaptively changing image sizes, focusing on the target area in the large-size image, adopting multiple threads and removing the operations of writing/reading images. We make the following contributions in this paper (a preliminary version of this work appeared in [10]). We design a practical framework for CamK, which operates using a smart mobile device camera and a portable paper keyboard. Based on image processing, CamK can detect and locate the keystroke with high accuracy and low false positive rate. We realize real time text-entry for small mobile devices with limited resources, by optimizing the computation-intensive modules. Additionally, we introduce word prediction to further improve the input speed and reduce the error rate. We implement CamK on smartphones running Android. We first evaluate each module in CamK. Then, we conduct extensive experiments to test the performance of CamK. After that, we compare CamK with other methods in input speed and error rate. 2 RELATED WORK Considering the small sizes of mobile devices, a lot of virtual keyboards are proposed for text entry, e.g., wearable keyboards, on-screen keyboards, projection keyboards, camera based keyboards, etc. Wearable Keyboards. Wearable keyboards sense and recognize the typing behavior based on the sensors built into rings [1], [11], gloves [12], and so on. TypingRing [13] utilizes the embedded sensors of the ring to input text. Finger-Joint keypad [14] works with a glove equipped with the pressure sensors. The Senseboard [2] consists of two rubber pads and senses the movements in the palm to get keystrokes. Funk et al. [15] utilize a touch sensitive wristband to enter text based on the location of the touch. These wearable keyboards often need the user to wear devices around the hands or fingers, thus leading to the decrease of user experience. On-Screen Keyboards. On-screen keyboards allow the user to enter characters on a touch screen. Considering the limited area of the keyboard on the screen, BigKey [3] and ZoomBoard [4] adaptively change the size of keys. ContextType [16] leverages hand postures to improve mobile touch screen text entry. Kwon et al. [17] introduce the regional error correction method to reduce the number of necessary touches. ShapeWriter [18] recognizes a word based on the trace over successive letters in the word. Sandwich keyboard [19] affords ten-finger touch typing by utilizing a touch sensor on the back side of a device. Usually, on-screen keyboards occupy the screen area and support only one finger for typing. Besides, it often needs to switch between different screens to type letters, digits, punctuations, etc. Projection Keyboards. Projection keyboards usually need a visible light projector or lasers to cast a keyboard, and then utilize image processing methods [5] or infrared light [6] to detect the typing events. Hu et al. use a pico-projector to project the keyboard on the table, and then detect the touch interaction by the distortion of the keyboard projection [20]. Roeber et al. utilize a pattern projector to display the keyboard layout on the flat surface, and then detect the keyboard events based on the intersection of fingers and infrared light [21]. The projection keyboard often requires the extra equipments, e.g., a visible light projector, infrared light modules, etc. The extra equipments increase the cost and introduce the inconvenience of text entry. Camera Based Keyboards. Camera based virtual keyboards use the captured image [22] or video [23] to infer the keystroke. Gesture keyboard [22] gets the input by recognizing the finger’s gesture. It works without a keyboard layout, thus the user needs to remember the mapping between the keys and the finger’s gestures. Visual Panel [8] works with a printed keyboard on a piece of paper. It requires the user to use only one finger and wait for one second before each keystroke. Malik et al. present the Visual Touchpad [24] to track the 3D positions of the fingertips based on two downwardpointing cameras and a stereo. Adajania et al. [9] detect the keystroke based on shadow analysis with a standard web camera. Hagara et al. estimate the finger positions and detect clicking events based on edge detection, fingertip localization, etc [25]. In regard to the iPhone app paper keyboard [26], which only allows the user to use one finger to input letters. The above research work usually focuses on detecting and tracking the fingertips, instead of locating the fingertip in a key’s area of the keyboard, which is researched in our paper. In addition to the above text-entry solutions, MacKenzie et al. [27] describe the text entry for mobile computing. Zhang et al. [28] propose Okuli to locate user’s finger based on visible light communication modules, LED, and light sensors. Wang et al. [7] propose UbiK to locate the keystroke based on audio signals. The existing work usually needs extra equipments, or only allows one finger to type, or needs to change the user’s typing behavior, while difficult to provide a PC-like text-entry experience. In this paper, we propose a text-entry method based on the built-in camera of the mobile device and a paper keyboard, to provide a similar user experience to using physical keyboards. Fig. 1. A typical use case of CamK. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2237
2238 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.17,NO.10.OCTOBER 2018 Finger number (a)Frame 1 (b)Frame 2 (c)Frame 3 (d)Frame 4 (e)Frame 5 Fig.2.Frames during two consecutive keystrokes. 3 FEASIBILITY STUDY AND OVERVIEW OF CAMK 3.2 System Overview In order to show the feasibility of locating keystrokes based As shown in Fig.1,CamK works with a mobile device and a on image processing techniques,we first show the observa- paper keyboard.The device uses the front-facing camera to tions of a keystroke from the camera's view.After that,we capture the typing process,while the paper keyboard is will describe the system overview of CamK. placed on a flat surface and located in the camera's view.We take Fig.1 as an example to describe the deployment.In 3.1 Observations of a Keystroke Fig.1,the mobile device is a Samsung N9109W smartphone, In Fig.2,we show the frames/images captured by the while l means the distance between the device and the printed camera during two consecutive keystrokes.The origin of keyboard,a means the angle between the plane of the device's axes is located in the top left corner of the image,as screen and that of the keyboard.In Fig.1,we set l=13.5 cm, shown in Fig.2a.The hand located in the left area of the a=90,to make the letter keys large enough in the camera's image is called left hand,while the other is called the view.In fact,there is no strict requirements of the above right hand,as shown in Fig.2b.From left to right,the fin- parameters'value,especially when the position of the camera gers are called finger i in sequence,i[1,10],as shown varies in different devices.In Fig.1,when we fix the A4 sized in Fig.2c.The fingertip pressing the key is called Stroke- paper keyboard,I can range in [13.5 cm,18.0 cm],while a can Tip,while that pressed key is called StrokeKey,as shown range in [78.8,90.0].In CamK,even if some part of the key- in Fig.2d. board is out of the camera's view,CamK still works. When the user presses a key,i.e.,a keystroke occurs,the The architecture of CamK is shown in Fig.3.The input is StrokeTip and StrokeKey often have the following features, the image taken by the camera and the output is the charac- which can be used to track,detect and locate the keystroke. ter of the pressed key.Before a user begins typing,CamK uses Key Extraction to detect the keyboard and extract each (1) Coordinate position:The StrokeTip usually has the larg- key from the image.When the user types,CamK uses Fin- est vertical coordinate among the fingers on the same gertip Detection to extract the user's hands and detect their hand,because the user tends to stretch out one finger fingertips.Based on the movements of fingertips,CamK when typing a key.An example is finger 9 in Fig.2a. uses Keystroke Detection and Localization to detect a possible While considering the particularity of thumbs,this keystroke and locate the keystroke.Finally,CamK uses Text feature may not be suitable for thumbs.Therefore Output to output the character of the pressed key. we separately detect the StrokeTip in thumbs and other fingertips. (2) 4 SYSTEM DESIGN Moving state:The StrokeTip stays on the StrokeKey for a certain duration in a typing operation,as finger 2 According to Fig.3,CamK consists of four components:key shown in Figs.2c and 2d.If the positions of the fin- extraction,fingertip detection,keystroke detection and gertip keep unchanged,a keystroke may happen. localization,and text output.Obviously,text output is easy (3) Correlated location:The StrokeTip is located in the Stro- to be implemented.Therefore,we mainly describe the first keKey,in order to press that key,such as finger 9 three components. shown in Fig.2a and finger 2 shown in Fig.2d. (4) Obstructed view:The StrokeTip obstructs the StrokeKey Key Extraction from the view of the camera,as shown in Fig.2d Text Keyhoard Location Key character range of keys egmentationKey location The ratio of the visually obstructed area to the whole 自appg output area of the key can be used to verify whether the key IKey area.Key location Keystroke is really pressed. Keystroke Detection and Localizatipn (5) Relative distance:The StrokeTip usually achieves the Candidate fingertip selection largest vertical distance between the fingertip and remain- frame 2 Frame 1 Largest vertical coordinate ing fingertips of the same hand.This is because the user Keeping unchanged usually stretches out the finger to press a key.Thus ngertip Detection Locating in the pressed key the feature can be used to infer which hand gener- Covering the pressed key Only oe ates the keystroke.In Fig.2a,the vertical distance d, No fineertips between the StrokeTip (i.e.,Finger 9)and remaining lative distance fingertips in right hand is larger than that(di)in left hand.Thus we choose finger 9 as the StrokeTip from two hands,instead of finger 2. Fig.3.Architecture of CamK
3 FEASIBILITY STUDY AND OVERVIEW OF CAMK In order to show the feasibility of locating keystrokes based on image processing techniques, we first show the observations of a keystroke from the camera’s view. After that, we will describe the system overview of CamK. 3.1 Observations of a Keystroke In Fig. 2, we show the frames/images captured by the camera during two consecutive keystrokes. The origin of axes is located in the top left corner of the image, as shown in Fig. 2a. The hand located in the left area of the image is called left hand, while the other is called the right hand, as shown in Fig. 2b. From left to right, the fingers are called finger i in sequence, i 2 ½1; 10, as shown in Fig. 2c. The fingertip pressing the key is called StrokeTip, while that pressed key is called StrokeKey, as shown in Fig. 2d. When the user presses a key, i.e., a keystroke occurs, the StrokeTip and StrokeKey often have the following features, which can be used to track, detect and locate the keystroke. (1) Coordinate position: The StrokeTip usually has the largest vertical coordinate among the fingers on the same hand, because the user tends to stretch out one finger when typing a key. An example is finger 9 in Fig. 2a. While considering the particularity of thumbs, this feature may not be suitable for thumbs. Therefore, we separately detect the StrokeTip in thumbs and other fingertips. (2) Moving state: The StrokeTip stays on the StrokeKey for a certain duration in a typing operation, as finger 2 shown in Figs. 2c and 2d. If the positions of the fingertip keep unchanged, a keystroke may happen. (3) Correlated location: The StrokeTip is located in the StrokeKey, in order to press that key, such as finger 9 shown in Fig. 2a and finger 2 shown in Fig. 2d. (4) Obstructed view: The StrokeTip obstructs the StrokeKey from the view of the camera, as shown in Fig. 2d. The ratio of the visually obstructed area to the whole area of the key can be used to verify whether the key is really pressed. (5) Relative distance: The StrokeTip usually achieves the largest vertical distance between the fingertip and remaining fingertips of the same hand. This is because the user usually stretches out the finger to press a key. Thus the feature can be used to infer which hand generates the keystroke. In Fig. 2a, the vertical distance dr between the StrokeTip (i.e., Finger 9) and remaining fingertips in right hand is larger than that (dl) in left hand. Thus we choose finger 9 as the StrokeTip from two hands, instead of finger 2. 3.2 System Overview As shown in Fig. 1, CamK works with a mobile device and a paper keyboard. The device uses the front-facing camera to capture the typing process, while the paper keyboard is placed on a flat surface and located in the camera’s view. We take Fig. 1 as an example to describe the deployment. In Fig. 1, the mobile device is a Samsung N9109W smartphone, while l means the distance between the device and the printed keyboard, a means the angle between the plane of the device’s screen and that of the keyboard. In Fig. 1, we set l ¼ 13:5 cm, a ¼ 90, to make the letter keys large enough in the camera’s view. In fact, there is no strict requirements of the above parameters’ value, especially when the position of the camera varies in different devices. In Fig. 1, when we fix the A4 sized paper keyboard, l can range in ½13:5 cm; 18:0 cm, while a can range in ½78:8; 90:0. In CamK, even if some part of the keyboard is out of the camera’s view, CamK still works. The architecture of CamK is shown in Fig. 3. The input is the image taken by the camera and the output is the character of the pressed key. Before a user begins typing, CamK uses Key Extraction to detect the keyboard and extract each key from the image. When the user types, CamK uses Fingertip Detection to extract the user’s hands and detect their fingertips. Based on the movements of fingertips, CamK uses Keystroke Detection and Localization to detect a possible keystroke and locate the keystroke. Finally, CamK uses Text Output to output the character of the pressed key. 4 SYSTEM DESIGN According to Fig. 3, CamK consists of four components: key extraction, fingertip detection, keystroke detection and localization, and text output. Obviously, text output is easy to be implemented. Therefore, we mainly describe the first three components. Fig. 2. Frames during two consecutive keystrokes. Fig. 3. Architecture of CamK. 2238 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2239 5】 与XJ (a)An input image (b)Edge detection (c)Edge optimization (d)Keyboard range (e)Keyboard boundary (f)Key segmentation Fig.4.Keyboard detection and key extraction. 4.1 Key Extraction conditions.Ey can be estimated in the initial training (see Without loss of generality,CamK adopts the common Section 5.1).The initial/default value of ey is 50. QWERTY keyboard layout,which is printed in black and When we obtain the white pixels,we need to get the con- white on a piece of paper,as shown in Fig.1.In order to tours of keys and separate the keys from one another.To eliminate the effects of background,we first detect the avoid pitfall areas such as small white areas which do not boundary of the keyboard.Then,we extract each key from belong to any key,we introduce the area of a key.Based on the keyboard.Therefore,key extraction contains three parts: Fig.4e,we first use B,B2,B3,Ba to calculate the area S of keyboard detection,key segmentation,and mapping the the keyboard as s%=·(IBB×BBl+BgB×BaB). characters to the keys,as shown in Fig.3. Then,we calculate the area of each key.We use N to repre- sent the number of keys in the keyboard.Considering the 4.1.1 Keyboard Detection size difference between keys,we treat larger keys (e.g.,the We use the Canny edge detection algorithm [29]to obtain space key)as multiple regular keys (e.g.,A-Z,0-9).For the edges of the keyboard.Fig.4b shows the edge detection example,the space key is treated as five regular keys.In this result of Fig.4a.However,the interference edges (e.g.,the way,we will change N to Nae.Then,we can estimate the paper's edge/longest edge in Fig.4b)should be removed. average area of a regular key as S/Natg.In addition to size Based on Fig.4b,the edges of the keyboard should be close difference between keys,the camera's view can also affect to the edges of keys.We use this feature to remove pitfall the area of a key in the image.Therefore,we introduce a, edges,the result is shown in Fig.4c.Additionally,we adopt an to describe the range of a valid area S of a key as the dilation operation [30]to join the dispersed edge points Ss∈lm,h·.We set a=0.15,ah=5 in Camk, which are close to each other,to get better edges/bound- based on extensive experiments.The key segmentation aries of the keyboard.After that,we use the Hough trans- result of Fig.4e is shown in Fig.4f.Then,we use the location form [8]to detect the lines in Fig.4c.Then,we use the of the space key(biggest key)to locate other keys,based on uppermost line and the bottom line to describe the position the relative locations between keys. range of the keyboard,as shown in Fig.4d.Similarly,we can use the Hough transform [8]to detect the left/right 4.2 Fingertip Detection edge of the keyboard.If there are no suitable edges detected After extracting the keys,we need to track the fingertips to by the Hough transform,it is usually because the keyboard detect and locate the keystrokes.To achieve this goal,we is not perfectly located in the camera's view.In this case,we should first detect the fingertip with hand segmentation simply use the left/right boundary of the image to represent and fingertip discovery,as shown below. the left/right edge of the keyboard.As shown in Fig.4e, we extend the four edges (lines)to get four intersections 4.2.1 Hand Segmentation B1(x1,y),B2(r2,y),B3(r3,y3),Ba(x4,y),which are used to describe the boundary of the keyboard. Skin segmentation [30]is often used for hand segmentation. In the YCrCb color space,a pixel (Y,Cr,Cb)is determined to be a skin pixel,if it satisfies Cr∈[133,l73]and Cb∈ 4.1.2 Key Segmentation 77.127.However,the threshold values of Cr and Co can be Considering the short interference edges generated by the affected by the surroundings such as lighting conditions.It edge detection algorithm,it is difficult to accurately seg- is difficult to choose suitable threshold values for Cr and ment each key from the keyboard with detected edges.Con- Co.Therefore,we combine Otsu's method [31]and the red sequently,we utilize the color difference between the white channel in YCrCb color space for skin segmentation. keys and the black background and the area of a key for key In the YCrCb color space,the red channel Cr is essential segmentation,to reduce the effect of pitfall areas. to human skin color.Therefore,with a captured image,we First,we introduce color segmentation to distinguish the use the grayscale image that is split based on the Cr channel white keys and black background.Considering the conve- as an input for Otsu's method [31].Otsu's method can auto- nience of image processing,we represent the color in YCrCb matically perform clustering-based image thresholding,i.e., space.In YCrCb space,the color coordinate (Y,Cr,Cb)of a calculate the optimal threshold to separate the foreground white pixel is(255,128,128),while that of a black pixel is(0,and background.The hand segmentation result of Fig.5a is 128,128).Thus,we only compute the difference in the Y shown in Fig.5b,where the white regions represent the value between the pixels to distinguish the white keys from hand regions with high value in C,channel,while the black the black background.If a pixel is located in the keyboard,regions represent the background.However,around the while satisfying 255-E<Y <255,the pixel belongs to a hands,there exist some interference regions,which may key.The offsets EyN of Y is mainly caused by light change the contours of fingers,resulting in detecting wrong
4.1 Key Extraction Without loss of generality, CamK adopts the common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 1. In order to eliminate the effects of background, we first detect the boundary of the keyboard. Then, we extract each key from the keyboard. Therefore, key extraction contains three parts: keyboard detection, key segmentation, and mapping the characters to the keys, as shown in Fig. 3. 4.1.1 Keyboard Detection We use the Canny edge detection algorithm [29] to obtain the edges of the keyboard. Fig. 4b shows the edge detection result of Fig. 4a. However, the interference edges (e.g., the paper’s edge/longest edge in Fig. 4b) should be removed. Based on Fig. 4b, the edges of the keyboard should be close to the edges of keys. We use this feature to remove pitfall edges, the result is shown in Fig. 4c. Additionally, we adopt the dilation operation [30] to join the dispersed edge points which are close to each other, to get better edges/boundaries of the keyboard. After that, we use the Hough transform [8] to detect the lines in Fig. 4c. Then, we use the uppermost line and the bottom line to describe the position range of the keyboard, as shown in Fig. 4d. Similarly, we can use the Hough transform [8] to detect the left/right edge of the keyboard. If there are no suitable edges detected by the Hough transform, it is usually because the keyboard is not perfectly located in the camera’s view. In this case, we simply use the left/right boundary of the image to represent the left/right edge of the keyboard. As shown in Fig. 4e, we extend the four edges (lines) to get four intersections B1ðx1; y1Þ; B2ðx2; y2Þ; B3ðx3; y3Þ; B4ðx4; y4Þ, which are used to describe the boundary of the keyboard. 4.1.2 Key Segmentation Considering the short interference edges generated by the edge detection algorithm, it is difficult to accurately segment each key from the keyboard with detected edges. Consequently, we utilize the color difference between the white keys and the black background and the area of a key for key segmentation, to reduce the effect of pitfall areas. First, we introduce color segmentation to distinguish the white keys and black background. Considering the convenience of image processing, we represent the color in YCrCb space. In YCrCb space, the color coordinate (Y, Cr, Cb) of a white pixel is (255, 128, 128), while that of a black pixel is (0, 128, 128). Thus, we only compute the difference in the Y value between the pixels to distinguish the white keys from the black background. If a pixel is located in the keyboard, while satisfying 255 "y Y 255, the pixel belongs to a key. The offsets "y 2 N of Y is mainly caused by light conditions. "y can be estimated in the initial training (see Section 5.1). The initial/default value of "y is 50. When we obtain the white pixels, we need to get the contours of keys and separate the keys from one another. To avoid pitfall areas such as small white areas which do not belong to any key, we introduce the area of a key. Based on Fig. 4e, we first use B1; B2; B3; B4 to calculate the area Sb of the keyboard as Sb ¼ 1 2 ðjB1B2 ! B1B4 !jþjB3B4 ! B3B2 !jÞ. Then, we calculate the area of each key. We use N to represent the number of keys in the keyboard. Considering the size difference between keys, we treat larger keys (e.g., the space key) as multiple regular keys (e.g., A-Z, 0-9). For example, the space key is treated as five regular keys. In this way, we will change N to Navg. Then, we can estimate the average area of a regular key as Sb=Navg. In addition to size difference between keys, the camera’s view can also affect the area of a key in the image. Therefore, we introduce al, ah to describe the range of a valid area Sk of a key as Sk 2 ½al Sb Navg ; ah Sb Navg. We set al ¼ 0:15, ah ¼ 5 in CamK, based on extensive experiments. The key segmentation result of Fig. 4e is shown in Fig. 4f. Then, we use the location of the space key (biggest key) to locate other keys, based on the relative locations between keys. 4.2 Fingertip Detection After extracting the keys, we need to track the fingertips to detect and locate the keystrokes. To achieve this goal, we should first detect the fingertip with hand segmentation and fingertip discovery, as shown below. 4.2.1 Hand Segmentation Skin segmentation [30] is often used for hand segmentation. In the YCrCb color space, a pixel (Y, Cr, Cb) is determined to be a skin pixel, if it satisfies Cr 2 ½133; 173 and Cb 2 ½77; 127. However, the threshold values of Cr and Cb can be affected by the surroundings such as lighting conditions. It is difficult to choose suitable threshold values for Cr and Cb. Therefore, we combine Otsu’s method [31] and the red channel in YCrCb color space for skin segmentation. In the YCrCb color space, the red channel Cr is essential to human skin color. Therefore, with a captured image, we use the grayscale image that is split based on the Cr channel as an input for Otsu’s method [31]. Otsu’s method can automatically perform clustering-based image thresholding, i.e., calculate the optimal threshold to separate the foreground and background. The hand segmentation result of Fig. 5a is shown in Fig. 5b, where the white regions represent the hand regions with high value in Cr channel, while the black regions represent the background. However, around the hands, there exist some interference regions, which may change the contours of fingers, resulting in detecting wrong Fig. 4. Keyboard detection and key extraction. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2239
2240 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL 17,NO.10.OCTOBER 2018 (a)An input image (b)Hand segmentation (c)Optimization (d)Fingers'contour (e)Fingertip discovery (f)Fingertips Fig.5.Fingertip detection. fingertips.Thus,CamK introduces the following erosion and shown in Fig.6b,the relative positions of Pi Pi,Pitg are dilation operations [321.We first use the erosion operation to different from that in Fig.6a.In Fig.6b,we show the isolate the hands from keys and separate each finger.Then, thumb of the left hand.Obviously,P Pi,Pt do not we use the dilation operation to smooth the edge of the fin- satisfy yi >y-and yi >yi+g Therefore,we use gers.Fig.5c shows the optimized result of hand segmenta- (i-i).(i-i+)>0 to describe the relative locations tion.After that,we select the top two segmented areas as of P-a,Pi,P+in thumbs.Then,we choose the vertex with hand regions,i.e.,left hand and right hand,to further reduce largest vertical coordinate in a finger's contour as the finger- the effect of inference regions,such as the red areas in Fig.5c. tip,as mentioned in the last paragraph. In fingertip detection,we only need to detect the points 4.2.2 Fingertip Discovery located on the bottom edge (from the left most point to the After we extract the fingers,we will detect the fingertips. right most point)of the hand,such as the blue contour of We can differentiate between the thumbs(i.e.,finger 5-6 in right hand in Fig.5d.The shape feature 0;and the positions Fig.2c)and non-thumbs (i.e.,finger 1-4,7-10 in Fig.2c) in vertical coordinates yi along the bottom edge are shown in shape and typing movement,as shown in Fig.6. Fig.5e.If we can detect five fingertips in a hand with 0;and In a non-thumb,the fingertip is usually a convex vertex, yi-yi,yi+g,we assume that we have also found the thumb. as shown in Fig.6a.For a point P(ri,yi)located in the con- At this time,the thumb presses a key like a non-thumb.Oth- tour of a hand,by tracing the contour,we can select the erwise,we detect the fingertip of the thumb in the right point P(ri--)before Pi and the point P+(i++ most area of left hand or left most area of right hand accord- after P:.Here,i,qN.We calculate the angle;between the ing to 0i and ii,i+The detected fingertips of Fig.5a two vectors PPPP+,according to Eq.(1).In order to are marked in Fig.5f. simplify the calculation for 0,we map 0;in the range a∈o°,l8o].Ifa∈[a,oanl,a<ah,we call Pi a candidate 4.3 Keystroke Detection and Localization vertex.Considering the relative locations of the points,P After detecting the fingertip,we will track the fingertip to should also satisfy yi>y and yi>yi+Otherwise,Pi detect a possible keystroke and locate it for text entry.The will not be a candidate vertex.If there are multiple candi- keystroke is usually correlated with one or two fingertips date vertexes,such as P in Fig.6a,we will choose the vertex therefore we first select the candidate fingertip having a high having the largest vertical coordinate,because it has higher probability of pressing a key,instead of detecting all finger- probability of being a fingertip,as p shown in Fig.6a.Here, tips,to reduce the computation overhead.Then,we track the largest vertical coordinate means the local maximum in the candidate fingertip to detect the possible keystroke.Finally, a finger's contour,such as the red circle shown in Fig.5e. we correlate the candidate fingertip with the pressed key to The range of a finger's contour can be limited by Eq.(1),i.e., locate the keystroke. the angle feature of a finger.Based on extensive experi- ments,we set 0=60°,0h=150°,q=20 in this paper 4.3.1 Candidate Fingertip Selection in Each Hand CamK allows the user to use all of their fingers for text- PP-g·PB+9 entry,thus the keystroke may come from the left or right 0;=arccos (1) hand.Based on the observations(see Section 3.1),the finger- PP-PPital tip (i.e.,StrokeTip)pressing the key usually has the largest vertical coordinate in that hand,such as finger 9 shown in In a thumb,the "fingertip"also means a convex vertex of Fig.2a.Therefore,we first select the candidate fingertip the finger.Thus we still use Eq.(1)to represent the shape of with the largest vertical coordinate in each hand.We the fingertip in a thumb.However,the position of the con- respectively use C and C,to represent the points located vex vertex can be different from that of a non-thumb.As in the contour of left hand and right hand.For a point P(am,)∈C,if P satisfies≥5(P(x,)∈C,j≠), 00.0) 0(0,0) then P will be selected as the candidate fingertip in the left hand.Similarly,we can get the candidate fingertip P(r,yr) B4a(tgyg) in the right hand.In this step,we only need to get P and P, 0 instead of detecting all fingertips. P(.) Pg(x-g-g) P(x,y) 4.3.2 Keystroke Detection Based on Fingertip Tracking (a)Fingertips(non-thumbs) (b)A thumb As described in the observations,when the user presses a Fig.6.Features of a fingertip. key,the fingertip will stay at that key for a certain duration
fingertips. Thus, CamK introduces the following erosion and dilation operations [32]. We first use the erosion operation to isolate the hands from keys and separate each finger. Then, we use the dilation operation to smooth the edge of the fingers. Fig. 5c shows the optimized result of hand segmentation. After that, we select the top two segmented areas as hand regions, i.e., left hand and right hand, to further reduce the effect of inference regions, such as the red areas in Fig. 5c. 4.2.2 Fingertip Discovery After we extract the fingers, we will detect the fingertips. We can differentiate between the thumbs (i.e., finger 5-6 in Fig. 2c) and non-thumbs (i.e., finger 1 4, 7 10 in Fig. 2c) in shape and typing movement, as shown in Fig. 6. In a non-thumb, the fingertip is usually a convex vertex, as shown in Fig. 6a. For a point Piðxi; yiÞ located in the contour of a hand, by tracing the contour, we can select the point Piqðxiq; yiqÞ before Pi and the point Piþqðxiþq; yiþqÞ after Pi. Here, i; q 2 N. We calculate the angle ui between the two vectors PiPiq !, PiPiþq !, according to Eq. (1). In order to simplify the calculation for ui, we map ui in the range ui 2 ½0; 180. If ui 2 ½ul; uh; ul < uh, we call Pi a candidate vertex. Considering the relative locations of the points, Pi should also satisfy yi > yiq and yi > yiþq. Otherwise, Pi will not be a candidate vertex. If there are multiple candidate vertexes, such as P0 i in Fig. 6a, we will choose the vertex having the largest vertical coordinate, because it has higher probability of being a fingertip, as Pi shown in Fig. 6a. Here, the largest vertical coordinate means the local maximum in a finger’s contour, such as the red circle shown in Fig. 5e. The range of a finger’s contour can be limited by Eq. (1), i.e., the angle feature of a finger. Based on extensive experiments, we set ul ¼ 60, uh ¼ 150, q ¼ 20 in this paper ui ¼ arccos PiPiq ! PiPiþq ! jPiPiq !jjPiPiþq !j : (1) In a thumb, the “fingertip” also means a convex vertex of the finger. Thus we still use Eq. (1) to represent the shape of the fingertip in a thumb. However, the position of the convex vertex can be different from that of a non-thumb. As shown in Fig. 6b, the relative positions of Piq, Pi, Piþq are different from that in Fig. 6a. In Fig. 6b, we show the thumb of the left hand. Obviously, Piq, Pi, Piþq do not satisfy yi > yiq and yi > yiþq. Therefore, we use ðxi xiqÞðxi xiþqÞ > 0 to describe the relative locations of Piq, Pi, Piþq in thumbs. Then, we choose the vertex with largest vertical coordinate in a finger’s contour as the fingertip, as mentioned in the last paragraph. In fingertip detection, we only need to detect the points located on the bottom edge (from the left most point to the right most point) of the hand, such as the blue contour of right hand in Fig. 5d. The shape feature ui and the positions in vertical coordinates yi along the bottom edge are shown Fig. 5e. If we can detect five fingertips in a hand with ui and yiq, yi, yiþq, we assume that we have also found the thumb. At this time, the thumb presses a key like a non-thumb. Otherwise, we detect the fingertip of the thumb in the right most area of left hand or left most area of right hand according to ui and xiq, xi, xiþq. The detected fingertips of Fig. 5a are marked in Fig. 5f. 4.3 Keystroke Detection and Localization After detecting the fingertip, we will track the fingertip to detect a possible keystroke and locate it for text entry. The keystroke is usually correlated with one or two fingertips, therefore we first select the candidate fingertip having a high probability of pressing a key, instead of detecting all fingertips, to reduce the computation overhead. Then, we track the candidate fingertip to detect the possible keystroke. Finally, we correlate the candidate fingertip with the pressed key to locate the keystroke. 4.3.1 Candidate Fingertip Selection in Each Hand CamK allows the user to use all of their fingers for textentry, thus the keystroke may come from the left or right hand. Based on the observations (see Section 3.1), the fingertip (i.e., StrokeTip) pressing the key usually has the largest vertical coordinate in that hand, such as finger 9 shown in Fig. 2a. Therefore, we first select the candidate fingertip with the largest vertical coordinate in each hand. We respectively use Cl and Cr to represent the points located in the contour of left hand and right hand. For a point Plðxl; ylÞ 2 Cl, if Pl satisfies yl yjð8Pjðxj; yjÞ 2 Cl; j 6¼ lÞ, then Pl will be selected as the candidate fingertip in the left hand. Similarly, we can get the candidate fingertip Prðxr; yrÞ in the right hand. In this step, we only need to get Pl and Pr, instead of detecting all fingertips. 4.3.2 Keystroke Detection Based on Fingertip Tracking As described in the observations, when the user presses a key, the fingertip will stay at that key for a certain duration. Fig. 5. Fingertip detection. Fig. 6. Features of a fingertip. 2240 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018