Depth Aware Finger Tapping on Virtual Displays Ke Sun,Wei Wang',Alex X.Liu,Haipeng Dai State Key Laboratory for Novel Software Technology,Nanjing University,China Dept.of Computer Science and Engineering,Michigan State University,U.S.A. kesun@smail.nju.edu.cn,ww@nju.edu.cn,alexliu@cse.msu.edu,haipengdai@nju.edu.cn ABSTRACT For AR/VR systems,tapping-in-the-air is a user-friendly solution for interactions.Most prior in-air tapping schemes use customized depth-cameras and therefore have the limitations of low accuracy and high latency.In this paper,we propose a fine-grained depth- aware tapping scheme that can provide high accuracy tapping detection.Our basic idea is to use light-weight ultrasound based sensing,along with one COTS mono-camera,to enable 3D tracking (a)Virtual keypad (b)Cardboard VR setup of user's fingers.The mono-camera is used to track user's fingers in the 2D space and ultrasound based sensing is used to get the Figure 1:Tapping in the air on virtual displays depth information of user's fingers in the 3D space.Using speakers and microphones that already exist on most AR/VR devices,we emit ultrasound,which is inaudible to humans,and capture the 1 INTRODUCTION signal reflected by the finger with the microphone.From the phase In this paper,we consider to measure the movement depth of changes of the ultrasound signal,we accurately measure small in-air tapping gestures on virtual displays.Tapping,which means finger movements in the depth direction.With fast and light-weight selecting an object or confirming,is a basic Human Computer ultrasound signal processing algorithms,our scheme can accurately Interaction(HCI)mechanism for computing devices.Traditional track finger movements and measure the bending angle of the finger tapping-based interaction schemes require physical devices such between two video frames.In our experiments on eight users,our as keyboards,joysticks,mouses,and touch screens.These physical scheme achieves a98.4%finger tapping detection accuracy with FPR devices are inconvenient for users to interact on virtual displays of 1.6%and FNR of 1.4%,and a detection latency of 17.69ms,which because users need to hand hold them during the interaction with is 57.7ms less than video-only schemes.The power consumption the AR/VR system,which limits the freedom of user hands in inter- overhead of our scheme is 48.4%more than video-only schemes. acting with other virtual objects on the display.For AR/VR systems, tapping-in-the-air is a user-friendly solution for interactions.In CCS CONCEPTS such schemes,users can input text,open apps,select and size items, and drag and drop holograms on virtual displays,as shown in Fig- ·Human-centered computing一→Interface design proto- ure 1.Tapping-in-the-air mechanisms enrich user experience in typing;Gestural input; AR/VR as user hands are free to interact with other real and virtual objects.Furthermore,fine-grained bending angle measurements of KEYWORDS in-air tapping gestures provide different levels of feedbacks,which Depth aware,Finger tapping,Ultrasound,Computer Vision compensates for the lack of haptic feedback. Most prior in-air tapping based schemes on virtual displays use ACM Reference Format: customized depth-cameras and therefore have the limitations of low Ke Sunt,Wei Wang',Alex X.Liu,Haipeng Dai".2018.Depth Aware accuracy and high latency.First,most depth-cameras provide depth Finger Tapping on Virtual Displays.In Proceedings of MobiSys'18.ACM. measurement with a centimeter level accuracy [17,41],which is New York,NY,USA.13 pages.https://doi.org/10.1145/3210240.3210315 inadequate for tapping-in-the-air because tapping gesture often involves small finger movements in the depth direction depending on the finger length and the bending angle of fingers [12].That explains why they often require users to perform finger movements Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed of several inches,such as touching the index finger with the thumb for profit or commercial advantage and that copies bear this notice and the full citation to perform a click [22],which leads to much lower tapping speed and low key localization accuracy.Second,the latency of camera must be honored.Abstracting with credit is permitted.To copy otherwise,or republish. to post on servers or to redistribute to lists,requires prior specific permission and/or a based gesture schemes is limited by their frame rate and their high fee.Request permissions from permissions@acmorg. computational requirements.Due to the lack of haptic feedback. MobiSys'18,June 10-15,2018,Munich,Germany interactions with virtual objects are different from interactions with e2018 Association for Computing Machinery ACM1SBN978-1-4503-5720-3.s15.00 physical keypads,and they solely rely on visual feedback [24].Vi- https:/doi.org/10.1145/3210240.3210315 sual feedback with a latency of more than 100ms is noticeable to
Depth Aware Finger Tapping on Virtual Displays Ke Sun† , Wei Wang† , Alex X. Liu†‡, Haipeng Dai† †State Key Laboratory for Novel Software Technology, Nanjing University, China ‡Dept. of Computer Science and Engineering, Michigan State University, U.S.A. kesun@smail.nju.edu.cn,ww@nju.edu.cn,alexliu@cse.msu.edu,haipengdai@nju.edu.cn ABSTRACT For AR/VR systems, tapping-in-the-air is a user-friendly solution for interactions. Most prior in-air tapping schemes use customized depth-cameras and therefore have the limitations of low accuracy and high latency. In this paper, we propose a fine-grained depthaware tapping scheme that can provide high accuracy tapping detection. Our basic idea is to use light-weight ultrasound based sensing, along with one COTS mono-camera, to enable 3D tracking of user’s fingers. The mono-camera is used to track user’s fingers in the 2D space and ultrasound based sensing is used to get the depth information of user’s fingers in the 3D space. Using speakers and microphones that already exist on most AR/VR devices, we emit ultrasound, which is inaudible to humans, and capture the signal reflected by the finger with the microphone. From the phase changes of the ultrasound signal, we accurately measure small finger movements in the depth direction. With fast and light-weight ultrasound signal processing algorithms, our scheme can accurately track finger movements and measure the bending angle of the finger between two video frames. In our experiments on eight users, our scheme achieves a 98.4% finger tapping detection accuracy with FPR of 1.6% and FNR of 1.4%, and a detection latency of 17.69ms, which is 57.7ms less than video-only schemes. The power consumption overhead of our scheme is 48.4% more than video-only schemes. CCS CONCEPTS • Human-centered computing → Interface design prototyping; Gestural input; KEYWORDS Depth aware, Finger tapping, Ultrasound, Computer Vision ACM Reference Format: Ke Sun† , Wei Wang† , Alex X. Liu†‡, Haipeng Dai† . 2018. Depth Aware Finger Tapping on Virtual Displays. In Proceedings of MobiSys’18. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3210240.3210315 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MobiSys’18, June 10–15, 2018, Munich, Germany © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5720-3. . . $15.00 https://doi.org/10.1145/3210240.3210315 (a) Virtual keypad (b) Cardboard VR setup Figure 1: Tapping in the air on virtual displays 1 INTRODUCTION In this paper, we consider to measure the movement depth of in-air tapping gestures on virtual displays. Tapping, which means selecting an object or confirming, is a basic Human Computer Interaction (HCI) mechanism for computing devices. Traditional tapping-based interaction schemes require physical devices such as keyboards, joysticks, mouses, and touch screens. These physical devices are inconvenient for users to interact on virtual displays because users need to hand hold them during the interaction with the AR/VR system, which limits the freedom of user hands in interacting with other virtual objects on the display. For AR/VR systems, tapping-in-the-air is a user-friendly solution for interactions. In such schemes, users can input text, open apps, select and size items, and drag and drop holograms on virtual displays, as shown in Figure 1. Tapping-in-the-air mechanisms enrich user experience in AR/VR as user hands are free to interact with other real and virtual objects. Furthermore, fine-grained bending angle measurements of in-air tapping gestures provide different levels of feedbacks, which compensates for the lack of haptic feedback. Most prior in-air tapping based schemes on virtual displays use customized depth-cameras and therefore have the limitations of low accuracy and high latency. First, most depth-cameras provide depth measurement with a centimeter level accuracy [17, 41], which is inadequate for tapping-in-the-air because tapping gesture often involves small finger movements in the depth direction depending on the finger length and the bending angle of fingers [12]. That explains why they often require users to perform finger movements of several inches, such as touching the index finger with the thumb, to perform a click [22], which leads to much lower tapping speed and low key localization accuracy. Second, the latency of camera based gesture schemes is limited by their frame rate and their high computational requirements. Due to the lack of haptic feedback, interactions with virtual objects are different from interactions with physical keypads, and they solely rely on visual feedback [24]. Visual feedback with a latency of more than 100ms is noticeable to
MobiSys'18.June 10-15,2018,Munich,Germany Ke Sun et al. For the ultrasound-based approach,the detection accuracy is lim- ited by the interference of finger movements because it is difficult to tell whether the ultrasound phase change is caused by finger tapping or lateral finger movements.To address this challenge,we combine the ultrasound and the camera data to achieve higher tap- ping detection accuracy.We first detect the finger movements using ultrasound signal.We then look back at the results of previously captured video frames to determine which finger is moving and the Figure 2:Comparison between video and audio streams movement direction of the given finger.Our joint finger tapping detection algorithm improves the detection accuracy for gentle users and degrades user experience [23];however,it is challeng. finger tappings from 58.2%(camera-only)to 97.6%. ing to provide visual feedback within 100ms,because vision-based The second challenge is to achieve low-latency finger tapping schemes require a series of high latency operations,such as captur- detection.In our experiments,the average duration of finger tap- ing the video signal,recognizing gestures using computer vision ping gestures is 354ms,where the tapping down(from an initial algorithms,and rendering the virtual object on the display.While movement to "touching"the virtual key)lasts 152ms and the tap- high-end cameras on smartphones can now provide high speed ping up(moving back from the virtual key to the normal position) video capture at more than 120 fps,the high computational costs lasts 202ms.Therefore,a 30-fps camera only captures less than 4 still limit the processing to a low frame rate in realtime,e.g 15 frames for the tapping down gesture in the worst case.However, fps [43].This explains why commercial AR systems such as Leap the feedback should be provided to the user as soon as the finger Motion [25]rely on the computational power of a desktop and "touches"the virtual key;otherwise,the user tends to move for cannot be easily implemented on low-end mobile devices an extra distance on each tapping.which slows down the tapping In this paper,we propose a fine-grained depth-aware tapping process and worsen user experience.To provide fast feedback,a scheme for AR/VR systems that allows users to tap in-the-air,as system should detect finger movements during the tapping down shown in Figure 1.Our basic idea is to use light-weight ultrasound stage.Accurately recognizing such detailed movements in just four based sensing,along with one Commercial Off-The-Shelf(COTS) video frames is challenging,while waiting for more video frames mono-camera,to enable 3D tracking of users'fingers.To track fin- leads to higher feedback latency.To address this challenge,we use gers in the 2D space,the mono-camera is enough for us to achieve the ultrasound to capture the detailed movement information as that with light-weight computer vision algorithms.To capture the shown in Figure 2.We design a state machine to capture the differ- depth information in the 3D space,the mono-camera is no longer ent movement states of user's fingers.As soon as the state machine sufficient.Prior vision-based schemes require extra cameras and enters the "tapping state",we analyze both the ultrasound signal complex computer vision algorithms to obtain the depth infor- and the captured video frames to provide a robust and prompt de- mation [17,41].In this paper,we propose to use light-weight ul- cision on the tapping event.Thus,our scheme can feedback at the trasound based sensing to get the depth information.Using the precise timing of"touching",rather than waiting for more frames speakers and microphones that already exist on most AR/VR de- to see that the finger starts moving back. vices,we emit inaudible sound wave from the speaker and capture The third challenge is to achieve affordable hardware and com the signal reflected by the finger with the microphone.We first putational cost on mobile devices.Traditional depth-camera based use ultrasound information to detect that there exists a finger that approaches need dual-camera or extra time-of-flight depth sensors performs the tapping down motion,and then use the vision in- [2,10].Furthermore,the computer vision algorithm for 3D fingertip formation to distinguish which finger performs the tapping down localization incurs high computational costs.It is challenging to motion.By measuring the phase changes of the ultrasound signals, achieve 30 fps 3D finger localization,especially on mobile devices we accurately measure fine-grained finger movements in the depth such as the Head-Mounted Display (HMD)or mobile phones.To direction and estimate the bending angles of finger tappings.With address this challenge,we use speakers/microphones as the depth fast and light-weight ultrasound signal processing algorithms,we sensor and combine it with the 2D position information obtained can track finger movements within the gap between two video from ordinary mono-camera with light-weight computer vision frames.Therefore,both detecting finger tapping motion and updat- algorithms.Thus,3D finger location can be measured using existing ing the virtual objects on virtual display can be achieved within sensors on mobile devices with affordable computational costs. one-video frame latency.This fast feedback is crucial for tapping- We implemented and evaluated our scheme using commercial in-the-air as the system can immediately highlight the object that smartphones without any hardware modification.Compared to the is being pressed on user display right after detecting a user tapping video-only scheme,our scheme improves the detection accuracy for motion. gentle finger tappings from 58.2%to 97.6%and reduces the detection There are three challenges to implement a fine-grained depth- latency by 57.7ms.Our scheme achieves 98.4%detection accuracy aware tapping scheme.The first challenge is to achieve high recog- with FPR of 1.6%and FNR of 1.4%.Furthermore,the fine-grained nition accuracy and fine-grained depth measurements for finger tap- bending angle measurements provided by our scheme enables new pings.Using either the video or the ultrasound alone is not enough dimensions for 3D interaction as shown by our case study.However, to achieve the desired detection accuracy.For the camera-based compared to a video-only solution,our system incurs a significant approach,the detection accuracy is limited by the low frame-rate power consumption overhead of 48.4%on a Samsung Galaxy S5. where the tapping gesture is only captured in a few video frames
MobiSys’18, June 10–15, 2018, Munich, Germany Ke Sun et al. Time (millisecond) 0 50 100 150 200 250 300 350 400 450 500 I/Q (normalized) -300 -200 -100 0 100 200 300 400 I Q Figure 2: Comparison between video and audio streams users and degrades user experience [23]; however, it is challenging to provide visual feedback within 100ms, because vision-based schemes require a series of high latency operations, such as capturing the video signal, recognizing gestures using computer vision algorithms, and rendering the virtual object on the display. While high-end cameras on smartphones can now provide high speed video capture at more than 120 fps, the high computational costs still limit the processing to a low frame rate in realtime, e.g., 15 fps [43]. This explains why commercial AR systems such as Leap Motion [25] rely on the computational power of a desktop and cannot be easily implemented on low-end mobile devices. In this paper, we propose a fine-grained depth-aware tapping scheme for AR/VR systems that allows users to tap in-the-air, as shown in Figure 1. Our basic idea is to use light-weight ultrasound based sensing, along with one Commercial Off-The-Shelf (COTS) mono-camera, to enable 3D tracking of users’ fingers. To track fingers in the 2D space, the mono-camera is enough for us to achieve that with light-weight computer vision algorithms. To capture the depth information in the 3D space, the mono-camera is no longer sufficient. Prior vision-based schemes require extra cameras and complex computer vision algorithms to obtain the depth information [17, 41]. In this paper, we propose to use light-weight ultrasound based sensing to get the depth information. Using the speakers and microphones that already exist on most AR/VR devices, we emit inaudible sound wave from the speaker and capture the signal reflected by the finger with the microphone. We first use ultrasound information to detect that there exists a finger that performs the tapping down motion, and then use the vision information to distinguish which finger performs the tapping down motion. By measuring the phase changes of the ultrasound signals, we accurately measure fine-grained finger movements in the depth direction and estimate the bending angles of finger tappings. With fast and light-weight ultrasound signal processing algorithms, we can track finger movements within the gap between two video frames. Therefore, both detecting finger tapping motion and updating the virtual objects on virtual display can be achieved within one-video frame latency. This fast feedback is crucial for tappingin-the-air as the system can immediately highlight the object that is being pressed on user display right after detecting a user tapping motion. There are three challenges to implement a fine-grained depthaware tapping scheme. The first challenge is to achieve high recognition accuracy and fine-grained depth measurements for finger tappings. Using either the video or the ultrasound alone is not enough to achieve the desired detection accuracy. For the camera-based approach, the detection accuracy is limited by the low frame-rate where the tapping gesture is only captured in a few video frames. For the ultrasound-based approach, the detection accuracy is limited by the interference of finger movements because it is difficult to tell whether the ultrasound phase change is caused by finger tapping or lateral finger movements. To address this challenge, we combine the ultrasound and the camera data to achieve higher tapping detection accuracy. We first detect the finger movements using ultrasound signal. We then look back at the results of previously captured video frames to determine which finger is moving and the movement direction of the given finger. Our joint finger tapping detection algorithm improves the detection accuracy for gentle finger tappings from 58.2% (camera-only) to 97.6%. The second challenge is to achieve low-latency finger tapping detection. In our experiments, the average duration of finger tapping gestures is 354ms, where the tapping down (from an initial movement to “touching” the virtual key) lasts 152ms and the tapping up (moving back from the virtual key to the normal position) lasts 202ms. Therefore, a 30-fps camera only captures less than 4 frames for the tapping down gesture in the worst case. However, the feedback should be provided to the user as soon as the finger “touches” the virtual key; otherwise, the user tends to move for an extra distance on each tapping, which slows down the tapping process and worsen user experience. To provide fast feedback, a system should detect finger movements during the tapping down stage. Accurately recognizing such detailed movements in just four video frames is challenging, while waiting for more video frames leads to higher feedback latency. To address this challenge, we use the ultrasound to capture the detailed movement information as shown in Figure 2. We design a state machine to capture the different movement states of user’s fingers. As soon as the state machine enters the “tapping state”, we analyze both the ultrasound signal and the captured video frames to provide a robust and prompt decision on the tapping event. Thus, our scheme can feedback at the precise timing of “touching”, rather than waiting for more frames to see that the finger starts moving back. The third challenge is to achieve affordable hardware and computational cost on mobile devices. Traditional depth-camera based approaches need dual-camera or extra time-of-flight depth sensors [2, 10]. Furthermore, the computer vision algorithm for 3D fingertip localization incurs high computational costs. It is challenging to achieve 30 fps 3D finger localization, especially on mobile devices such as the Head-Mounted Display (HMD) or mobile phones. To address this challenge, we use speakers/microphones as the depth sensor and combine it with the 2D position information obtained from ordinary mono-camera with light-weight computer vision algorithms. Thus, 3D finger location can be measured using existing sensors on mobile devices with affordable computational costs. We implemented and evaluated our scheme using commercial smartphones without any hardware modification. Compared to the video-only scheme, our scheme improves the detection accuracy for gentle finger tappings from 58.2% to 97.6% and reduces the detection latency by 57.7ms. Our scheme achieves 98.4% detection accuracy with FPR of 1.6% and FNR of 1.4%. Furthermore, the fine-grained bending angle measurements provided by our scheme enables new dimensions for 3D interaction as shown by our case study. However, compared to a video-only solution, our system incurs a significant power consumption overhead of 48.4% on a Samsung Galaxy S5
Depth Aware Finger Tapping on Virtual Displays MobiSys'18,June 10-15,2018,Munich,Germany System Sensing methods Sensors Range Depth arruracy Iteraction Kinect v1[21,34] Light Coding IR projector&IR camera 0.8~4m about 4cm Human pose Kinect v2[21,34] Time of Flight IR projector&IR camera 054.57m about 1cm Human pose Leap Motion[25,41] Binocular camera IR cameras&IR LEDs 2.560cm about 0.7mm Hand track and gesture Hololens[22] Time of Flight IR projector&IR camera 10~60cm about 1cm Hand gesture and gaze RealSensel15 Light Coding IR projector&IR camera 20~120cm about 1cm Hand track and gesture Air+Touch[8] Infrared image IR projector&IR camera 5≈20cm about 1em Single finger gesture Our scheme Phase change Microphone&mono-camera 5~60cm 4.32mm Hand track and gesture Table 1:Existing interface schemes for augmented reality systems 2 RELATED WORK so that they cannot be easily ported to mobile devices.RF based Related work can be categorized into four classes:AR/VR gesture systems use the radio waves reflected by hands to recognize prede- recognition,in-air tapping-based interaction on virtual displays, fined gestures [1,7,13,16,20].However,they cannot provide high tapping based interaction for mobile devices,and device-free ges- accuracy tracking capability,which is crucial for in-air tappings. ture recognition and tracking. In comparison,our scheme provides fine-grained localization for AR/VR Gesture Recognition:Most existing AR/VR devices use fingertips and can measure the bending angle of the moving finger. IR projectors/IR cameras to capture the depth information for ges- Sound-based systems,such as LLAP [38]and Strata [44],use phase ture recognition based on structured light [2]or time of flight [10] changes to track hands and achieve cm-level accuracy for 1D and as shown in Table 1.Structured light has been widely used for 2D tracking.respectively.FingerIO [27]proposes an OFDM based 3D scene reconstruction [2].Its accuracy depends on the width of hand tracking system and achieves a hand location accuracy of the stripes used and their optical quality.A time-of-flight camera 8mm and allows 2D drawing in the air using COTS mobile devices. (ToF camera)[10]is a range imaging camera system that resolves However,both schemes treat the hand as a single object and only distance based on the time-of-flight measurements of a light signal provide tracking in the 2D space.The key advantage of our scheme between the camera and the subject for each point of the image is on achieving fine-grained multi-finger tracking in the 3D space However,neither of them focuses on moving object detection and as we fuse information from both ultrasound and vision. they often incur high computational cost.There are other interac- 3 SYSTEM OVERVIEW tion schemes,including gaze-based interactions [33],voice-based Our system is a tapping-in-the-air scheme on virtual displays.It interactions [4,46],and brain-computer interfaces [32].However, uses a mono-camera,a speaker,and two microphones to sense the tapping on virtual buttons is one of the most natural ways for users in-air tapping.The camera captures the video of users'fingers at a to input text on AR/VR devices. speed of 30 fps,without the depth information.The speaker emits In-air Tapping-based Interaction on Virtual Displays:Exist- human inaudible ultrasound at a frequency in the range of 18 ing interaction schemes for VR/AR environments are usually based 22kHz.The microphones capture ultrasound signals reflected by on in-air tapping [14,15,21,22,25,42].Due to the high compu- users'fingers to detect finger movements.The system architecture tational cost and low frame rate,commercial schemes are incon- consists of four components as shown in Figure 3. venient for users [15,21,22,25].Higuchi et al.used 120 fps video Fingertip Localization (Section 4):Our system uses a light- cameras to capture the gesture and enable a multi-finger aR typing weight fingertip localization algorithm in video processing.We first interface [14].However,due to the high computational cost,the use skin color to separate the hand from the background and detects video frames are processed on a PC instead of the mobile device. the contour of the hand,which is a commonly used technique for Comparing with such systems,our scheme uses a light-weight ap- hand recognition [30].Then,we use a light-weight algorithm to proach that achieves high tapping speed and low latency on widely locate all the fingertips captured in the video frame. available mobile devices Ultrasound Signal Phase Extraction(Section 5):First,we down Tapping Based Interaction for Mobile Devices:Recently,vari- convert the ultrasound signal.Second,we extract the phase of the ous novel tapping based approaches for mobile devices have been reflected ultrasound signal.The ultrasound phase change corre- proposed,such as camera-based schemes [26,43],acoustic signals sponds to the movement distance of fingers in the depth direction. based schemes [18,37],and Wi-Fi based schemes [3,6].These ap- proaches focus on exploring alternatives for tapping on the physical Tapping Detection and Tapping Depth Measurement(Sec- materials in the 2D space [3,6,18,26,37,43].In comparison,our tion 6):We use a finite state machine based algorithm to detect approach is an in-air tapping scheme addressing the 3D space lo- the start of the finger tapping action using the ultrasound phase calization problem,which is more challenging and provides more information.Once the finger tapping action is detected,we trace flexibility for AR/VR. back the last few video frames to confirm the tapping motion.To measure the strength of tapping,we combine the depth acquired Device-free Gesture Recognition and Tracking:Device-free from the ultrasound phase change with the depth acquired from gesture recognition is widely used for human-computer interaction the video frames to get the bending angle of the finger. which mainly includes vision-based [8,21,22,25,35,45],RF-based [1,11,16,20,31,36,39,40]and sound-based[7,13,27,38,44.Vi- Keystroke Localization(Section 7):When the user tries to press sion based systems have been widely used in AR/VR systems that a key,both the finger that presses the key and the neighboring have enough computational resources [8,21,22,25,35].However, fingers will move at the same time.Therefore,we combine the they incur high computational cost and have limited frame rates tapping depth measurement with the videos to determine the finger that has the largest bending angle to recognize the pressed key
Depth Aware Finger Tapping on Virtual Displays MobiSys’18, June 10–15, 2018, Munich, Germany System Sensing methods Sensors Range Depth arruracy Iteraction Kinect v1[21, 34] Light Coding IR projector&IR camera 0.8 ∼ 4m about 4cm Human pose Kinect v2[21, 34] Time of Flight IR projector&IR camera 0.5 ∼ 4.5m about 1cm Human pose Leap Motion[25, 41] Binocular camera IR cameras&IR LEDs 2.5 ∼ 60cm about 0.7mm Hand track and gesture Hololens[22] Time of Flight IR projector&IR camera 10 ∼ 60cm about 1cm Hand gesture and gaze RealSense[15] Light Coding IR projector&IR camera 20 ∼ 120cm about 1cm Hand track and gesture Air+Touch[8] Infrared image IR projector&IR camera 5 ∼ 20cm about 1cm Single finger gesture Our scheme Phase change Microphone&mono-camera 5 ∼ 60cm 4.32mm Hand track and gesture Table 1: Existing interface schemes for augmented reality systems 2 RELATED WORK Related work can be categorized into four classes: AR/VR gesture recognition, in-air tapping-based interaction on virtual displays, tapping based interaction for mobile devices, and device-free gesture recognition and tracking. AR/VR Gesture Recognition: Most existing AR/VR devices use IR projectors/IR cameras to capture the depth information for gesture recognition based on structured light [2] or time of flight [10], as shown in Table 1. Structured light has been widely used for 3D scene reconstruction [2]. Its accuracy depends on the width of the stripes used and their optical quality. A time-of-flight camera (ToF camera) [10] is a range imaging camera system that resolves distance based on the time-of-flight measurements of a light signal between the camera and the subject for each point of the image. However, neither of them focuses on moving object detection and they often incur high computational cost. There are other interaction schemes, including gaze-based interactions [33], voice-based interactions [4, 46], and brain-computer interfaces [32]. However, tapping on virtual buttons is one of the most natural ways for users to input text on AR/VR devices. In-air Tapping-based Interaction on Virtual Displays: Existing interaction schemes for VR/AR environments are usually based on in-air tapping [14, 15, 21, 22, 25, 42]. Due to the high computational cost and low frame rate, commercial schemes are inconvenient for users [15, 21, 22, 25]. Higuchi et al. used 120 fps video cameras to capture the gesture and enable a multi-finger AR typing interface [14]. However, due to the high computational cost, the video frames are processed on a PC instead of the mobile device. Comparing with such systems, our scheme uses a light-weight approach that achieves high tapping speed and low latency on widely available mobile devices. Tapping Based Interaction for Mobile Devices: Recently, various novel tapping based approaches for mobile devices have been proposed, such as camera-based schemes [26, 43], acoustic signals based schemes [18, 37], and Wi-Fi based schemes [3, 6]. These approaches focus on exploring alternatives for tapping on the physical materials in the 2D space [3, 6, 18, 26, 37, 43]. In comparison, our approach is an in-air tapping scheme addressing the 3D space localization problem, which is more challenging and provides more flexibility for AR/VR. Device-free Gesture Recognition and Tracking: Device-free gesture recognition is widely used for human-computer interaction, which mainly includes vision-based [8, 21, 22, 25, 35, 45], RF-based [1, 11, 16, 20, 31, 36, 39, 40] and sound-based [7, 13, 27, 38, 44]. Vision based systems have been widely used in AR/VR systems that have enough computational resources [8, 21, 22, 25, 35]. However, they incur high computational cost and have limited frame rates so that they cannot be easily ported to mobile devices. RF based systems use the radio waves reflected by hands to recognize predefined gestures [1, 7, 13, 16, 20]. However, they cannot provide high accuracy tracking capability, which is crucial for in-air tappings. In comparison, our scheme provides fine-grained localization for fingertips and can measure the bending angle of the moving finger. Sound-based systems, such as LLAP [38] and Strata [44] , use phase changes to track hands and achieve cm-level accuracy for 1D and 2D tracking, respectively. FingerIO [27] proposes an OFDM based hand tracking system and achieves a hand location accuracy of 8mm and allows 2D drawing in the air using COTS mobile devices. However, both schemes treat the hand as a single object and only provide tracking in the 2D space. The key advantage of our scheme is on achieving fine-grained multi-finger tracking in the 3D space as we fuse information from both ultrasound and vision. 3 SYSTEM OVERVIEW Our system is a tapping-in-the-air scheme on virtual displays. It uses a mono-camera, a speaker, and two microphones to sense the in-air tapping. The camera captures the video of users’ fingers at a speed of 30 fps, without the depth information. The speaker emits human inaudible ultrasound at a frequency in the range of 18 ∼ 22 kHz. The microphones capture ultrasound signals reflected by users’ fingers to detect finger movements. The system architecture consists of four components as shown in Figure 3. Fingertip Localization (Section 4): Our system uses a lightweight fingertip localization algorithm in video processing. We first use skin color to separate the hand from the background and detects the contour of the hand, which is a commonly used technique for hand recognition [30]. Then, we use a light-weight algorithm to locate all the fingertips captured in the video frame. Ultrasound Signal Phase Extraction (Section 5): First, we down convert the ultrasound signal. Second, we extract the phase of the reflected ultrasound signal. The ultrasound phase change corresponds to the movement distance of fingers in the depth direction. Tapping Detection and Tapping Depth Measurement (Section 6): We use a finite state machine based algorithm to detect the start of the finger tapping action using the ultrasound phase information. Once the finger tapping action is detected, we trace back the last few video frames to confirm the tapping motion. To measure the strength of tapping, we combine the depth acquired from the ultrasound phase change with the depth acquired from the video frames to get the bending angle of the finger. Keystroke Localization (Section 7): When the user tries to press a key, both the finger that presses the key and the neighboring fingers will move at the same time. Therefore, we combine the tapping depth measurement with the videos to determine the finger that has the largest bending angle to recognize the pressed key
MobiSys'18,June 10-15,2018,Munich,Germany Ke Sun et al. Fingertip Localization Tapping Detection and Tapping Depth Measurement Keystroke Localization Hand gertips Confirm the tapping action detection detection sed on finite state machi ping Depth Keystroke localization Ultrasound Signal Phase Extraction based on the depth measurement Sound signal Sound signal Detect the start of h3s色ch3n起 ound signal icropho down comversion the finger tapping action measurement mit 18-22kHa .W sound signa Figure 3:System architecture 4 FINGERTIPS LOCALIZATION speed so that we can use the centroid of the hand to track the move- In this section,we present fingertip localization,the first step of ment under 30 fps frame rate.After determining that one of the video processing.We use light-weight computer vision algorithm large contours in the view is the hand,we retrieve the point that has to locate the fingertips in the horizontal 2D space of the camera. the maximum distance value from the Distance Transform [5]of the segmentation image to find the centroid of the palm,as shown in Figure 4(c).We trace the centroid of the hand rather the entire 4.1 Adaptive Skin Segmentation contour.This significantly simplifies the tracing scheme because Given a video frame,skin segmentation categorizes each pixel the centroid normally remains within the hand contour captured to be either a skin-color pixel or a non-skin-color pixel.Traditional in the last frame due to that the hand movement distance should skin segmentation methods are based on the YUV or the YCrCb be smaller than the palm size between two consecutive frames. color space.However,surrounding lighting conditions have impacts of the thresholds for Cr and Cb.We use an adaptive color-based 4.3 Fingertip Detection skin segmentation approach to improve the robustness of the skin We then detect the fingertips using the hand contour when segmentation scheme.Our scheme is based on the Otsu's method the user makes a tapping gesture.Our model is robust to detect for pixel clustering [29].In the YCrCb color space,we first isolate fingertips'location with different numbers of fingers.As shown in the red channel Cr,which is vital to human skin color detection. Figure 5,we present the most complex situation of a tapping gesture The Otsu's method calculates the optimal threshold to separate with five fingertips.Traditional fingertip detection algorithms have the skin from the background,using the grayscale image in the Cr high computational cost,as they detect fingertips by finding the channel.However,the computational cost of Otsu's method is high convex vertex of the contour.Consider the case where the points and it costs 25ms for a 352 x 288 video frame when implemented on the contour are represented by Pi with coordinates of(xi,yi). on our smartphone platform.To reduce the computational cost,we The curvature at a given point Pi can be calculated as: use Otsu's method to get the threshold only on a small number of frames,e.g.,when the background changes.For the other frames, 0;=arccos- PiPi-qPiPitg (1) we use the color histogram of the hand region learned from the WIP:Pi-gPiPi+gll previous frame instead of the Otsu's method.Note that although our color-based skin segmentation method can work under different where Pi-and Pitq are thethpoint before/after point Pi on the lighting conditions,it is still sensitive to the background color. contour,PiPi-g and PiPi+g are the vectors from Pi to Pi-g and When the background color is close to the skin color,our method Pit,respectively.The limitation of this approach is that we have may not be able to segment the hand successfully to go through all possible points on the hand contour.Scanning through all points on the contour takes 42ms on smartphones on average in our implementation.Thus,it is not capable to achieve 4.2 Hand Detection 30 fps rate. We perform hand detection using the skin segmentation results, To reduce the computational cost for fingertip detection,we first as shown in Figure 4(b).We first reduce the noise in the skin seg- mentation results using the erode and dilate methods.After that, compress the contour into segments and then use a heuristic scheme we use a simplified hand detection scheme to find hand contour. to detect fingertips.Our approach is based on the observations that while tapping,people usually put their hand in front of the Our simplified detection scheme is based on the following ob- servations.First,in the AR scenario,we can predict the size of the camera with the fingers above the palm as shown in Figure 5.This gesture can serve as an initial gesture to reduce the effort of locating hand in the camera view.As the camera is normally mounted on the fingertips.Under this gesture,we can segment the contour by the head,the distance between the hand and the camera is smaller finding the extreme points on the Y axis as shown in Figure 5.The than the length of the arm.Once the full hand is in the view,the size four maximum points,R2,R4.Rs and R correspond to the roots of of the hand contour should at least be larger than a given threshold. fingers.Using this segmentation method,we just need to consider Such threshold can be calculated through the statistics of human arm length [12]and the area of palm.Therefore,we only need to these extreme points while ignoring the contour points in between to reduce the computational costs. perform hand contour detection when there are skin areas larger than the given threshold.Second,the hand movement has a limited Although the extreme-points-based scheme is efficient,it might lead to errors as the hand contour could be noisy.We use the
MobiSys’18, June 10–15, 2018, Munich, Germany Ke Sun et al. Microphone Speaker Emit 18~22kHz CW sound signal Receive sound signal Sound signal down conversion Sound signal phase change measurement Detect the start of the finger tapping action Ultrasound Signal Phase Extraction Camera Receive frame Hand detection Fingertips detection Fingertip Localization Tapping Detection and Tapping Depth Measurement Confirm the tapping action based on finite state machine Tapping Depth Measurement Keystroke Localization Keystroke localization based on the depth measurement Figure 3: System architecture 4 FINGERTIPS LOCALIZATION In this section, we present fingertip localization, the first step of video processing. We use light-weight computer vision algorithm to locate the fingertips in the horizontal 2D space of the camera. 4.1 Adaptive Skin Segmentation Given a video frame, skin segmentation categorizes each pixel to be either a skin-color pixel or a non-skin-color pixel. Traditional skin segmentation methods are based on the YUV or the YCrCb color space. However, surrounding lighting conditions have impacts of the thresholds for Cr and Cb. We use an adaptive color-based skin segmentation approach to improve the robustness of the skin segmentation scheme. Our scheme is based on the Otsu’s method for pixel clustering [29]. In the YCrCb color space, we first isolate the red channel Cr, which is vital to human skin color detection. The Otsu’s method calculates the optimal threshold to separate the skin from the background, using the grayscale image in the Cr channel. However, the computational cost of Otsu’s method is high and it costs 25ms for a 352 × 288 video frame when implemented on our smartphone platform. To reduce the computational cost, we use Otsu’s method to get the threshold only on a small number of frames, e.g., when the background changes. For the other frames, we use the color histogram of the hand region learned from the previous frame instead of the Otsu’s method. Note that although our color-based skin segmentation method can work under different lighting conditions, it is still sensitive to the background color. When the background color is close to the skin color, our method may not be able to segment the hand successfully. 4.2 Hand Detection We perform hand detection using the skin segmentation results, as shown in Figure 4(b). We first reduce the noise in the skin segmentation results using the erode and dilate methods. After that, we use a simplified hand detection scheme to find hand contour. Our simplified detection scheme is based on the following observations. First, in the AR scenario, we can predict the size of the hand in the camera view. As the camera is normally mounted on the head, the distance between the hand and the camera is smaller than the length of the arm. Once the full hand is in the view, the size of the hand contour should at least be larger than a given threshold. Such threshold can be calculated through the statistics of human arm length [12] and the area of palm. Therefore, we only need to perform hand contour detection when there are skin areas larger than the given threshold. Second, the hand movement has a limited speed so that we can use the centroid of the hand to track the movement under 30 fps frame rate. After determining that one of the large contours in the view is the hand, we retrieve the point that has the maximum distance value from the Distance Transform [5] of the segmentation image to find the centroid of the palm, as shown in Figure 4(c). We trace the centroid of the hand rather the entire contour. This significantly simplifies the tracing scheme because the centroid normally remains within the hand contour captured in the last frame due to that the hand movement distance should be smaller than the palm size between two consecutive frames. 4.3 Fingertip Detection We then detect the fingertips using the hand contour when the user makes a tapping gesture. Our model is robust to detect fingertips’ location with different numbers of fingers. As shown in Figure 5, we present the most complex situation of a tapping gesture with five fingertips. Traditional fingertip detection algorithms have high computational cost, as they detect fingertips by finding the convex vertex of the contour. Consider the case where the points on the contour are represented by Pi with coordinates of (xi ,yi ). The curvature at a given point Pi can be calculated as: θi = arccos −−−−−→ PiPi−q −−−−−→ PiPi+q ∥ −−−−−→ PiPi−q ∥ ∥−−−−−→ PiPi+q ∥ (1) where Pi−q and Pi+q are the q th point before/after point Pi on the contour, −−−−−→ PiPi−q and −−−−−→ PiPi+q are the vectors from Pi to Pi−q and Pi+q, respectively. The limitation of this approach is that we have to go through all possible points on the hand contour. Scanning through all points on the contour takes 42ms on smartphones on average in our implementation. Thus, it is not capable to achieve 30 fps rate. To reduce the computational cost for fingertip detection, we first compress the contour into segments and then use a heuristic scheme to detect fingertips. Our approach is based on the observations that while tapping, people usually put their hand in front of the camera with the fingers above the palm as shown in Figure 5. This gesture can serve as an initial gesture to reduce the effort of locating the fingertips. Under this gesture, we can segment the contour by finding the extreme points on the Y axis as shown in Figure 5. The four maximum points, R2, R4, R5 and R6 correspond to the roots of fingers. Using this segmentation method, we just need to consider these extreme points while ignoring the contour points in between to reduce the computational costs. Although the extreme-points-based scheme is efficient, it might lead to errors as the hand contour could be noisy. We use the
Depth Aware Finger Tapping on Virtual Displays MobiSys'18,June 10-15,2018,Munich,Germany 00, .v (a)Input frame (b)Binary Figure 5:Hand geometric model the fingertip locations.Note that our finger detection algorithm focuses on the case for tapping.It might not be able to detect all fingers when the fingers are blocked by other parts of the hand. 5 DEPTH MEASUREMENT (c)Hand contour distance transform (d)Fingertips image We use the phase of ultrasound reflected by the fingers to mea- image sure finger movements.This phase-based depth measurement has Figure 4:Adaptive fingertip 2D localization several key advantages.First,ultrasound based movement detec- tion has low latencies.It can provide instantaneous decision of the geometric features of the hand and the fingers to remove these finger movement between two video frames.Second,ultrasound noisy points on the hand contour.First,the fingertips should be based movement detection gives accurate depth information,which above the palm,shown as the black circle in Figure 4(d).Suppose helps us to detect finger tappings with a short movement distance. that C(x"y")is the centroid of the palm calculated by the Distance Existing ultrasound phase measurement algorithms,such as Transform Image. LLAP [38]and Strata [44].cannot be directly applied to our system. We check that all the fingertips points Fi,with coordinates of This is because they treat the hand as a single object,whereas we (xi,yi),should satisfy: detect finger movements.The ultrasound signal changes caused 班<y”-r,ie{1,2,3,4,5. (2) by hand movements are much larger than that caused by the fin- Second,the length of the fingers,including the thumb,is three ger movements and the multipath interference in finger move- times than its'width [48].We can calculate the width of fingers by: ments is much more significant than hand movements.As illus- trated in Figure 6,the user first pushes the whole hand towards the speaker/microphones and then taps the index finger.The magnitude IR:R2ll. ifie(1) (3) of signal change caused by hand movement is 10 times larger than RiRi+2ll.if i E12.3.4.51. that of tapping a single finger.Furthermore,we can see clear regular The lengths of the fingers are phase changes when moving the hand in Figure 6.However,for the finger tapping.the phase change is irregular and there are large RFRR-R1RRFRR direct-current(DC)trends during the finger movements caused R1R22 by multipath interference.This makes the depth measurement for ifie(1) finger tapping challenging. R1E+R+1R+2-R1R1+2R11EiR1+1R+2 To rule out the interference of multipath and measure the fin- R+1R1+2 ger tapping depth under large DC trends,we use a heuristic algo- ifi∈{2,3,4,5. rithm called Peak and Valley Estimation(PVE).The key difference between PVE and the existing LEVD algorithm [38]is that PVE (4) We check that all the detected fingertips should satisfy: specifically focuses on tapping detection and avoids the error-prone step of static vector estimation in LEVD.As shown in Figure 6,it is threshold,Yi E(1,2,3,4,51 (5) difficult to estimate the static vector for finger tapping because the phase change of finger tapping is not obvious and it is easy to be In our implementation,we set the threshold to 2.5.The maxi- influenced by multipath interference.To handle this problem,we mum points in the contour that can satisfy both Eq.(2)and Eq.(5) rely on the peak and valley of the signal to get the movement dis- correspond to the fingertips. tance.Each time the phase changes by 2n,there will be two peaks As the tapping gesture like Figure 5 recur frequently during and two valleys in the received signal.We can measure the phase tapping,we calibrate our fingertips'number and location when we changes of /2 by counting the peaks and valleys.For example. detect such gestures with different number of fingers.In the case when the phase changes from 0 to /2,we will find that the signal that two fingers are close to each other or there is a bending finger change from the I component peak to the O component peak in we use the coordinates of fingertips on the x axis to interpolate time domain
Depth Aware Finger Tapping on Virtual Displays MobiSys’18, June 10–15, 2018, Munich, Germany (a) Input frame (b) Binary image (c) Hand contour distance transform image (d) Fingertips image Figure 4: Adaptive fingertip 2D localization geometric features of the hand and the fingers to remove these noisy points on the hand contour. First, the fingertips should be above the palm, shown as the black circle in Figure 4(d). Suppose thatC(x ′′ ,y ′′) is the centroid of the palm calculated by the Distance Transform Image. We check that all the fingertips points Fi , with coordinates of (xi ,yi ), should satisfy: yi < y ′′ − r, ∀i ∈ {1, 2, 3, 4, 5}. (2) Second, the length of the fingers, including the thumb, is three times than its’ width [48]. We can calculate the width of fingers by: wi = ∥ −−−→R1R2 ∥, if i ∈ {1} ∥ −−−−−−−→ Ri+1Ri+2 ∥, if i ∈ {2, 3, 4, 5} . (3) The lengths of the fingers are li = −−−→R1F1 ∥ −−−−→ R1R2 ∥ 2− −−−−→ R1R2 −−−→R1F1 −−−−→ R1R2 ∥ −−−−→ R1R2 ∥ 2 , if i ∈ {1} −−−−−−−→ Ri+1Fi+2 ∥ −−−−−−−−→ Ri+1Ri+2 ∥ 2− −−−−−−−−→ Ri+1Ri+2 −−−−−−−→ Ri+1Fi+1 −−−−−−−−→ Ri+1Ri+2 ∥ −−−−−−−−→ Ri+1Ri+2 ∥ 2 , if i ∈ {2, 3, 4, 5}. (4) We check that all the detected fingertips should satisfy: li wi > threshold, ∀i ∈ {1, 2, 3, 4, 5}. (5) In our implementation, we set the threshold to 2.5. The maximum points in the contour that can satisfy both Eq. (2) and Eq. (5) correspond to the fingertips. As the tapping gesture like Figure 5 recur frequently during tapping, we calibrate our fingertips’ number and location when we detect such gestures with different number of fingers. In the case that two fingers are close to each other or there is a bending finger, we use the coordinates of fingertips on the x axis to interpolate !(#$$ ,&$$ ) ()(#) $ ,&) $ ) (*(#* $ ,&* $ ) (+(#+ $ ,&+ $ ) (,(#, $ ,&, $ ) (-(#- $ ,&- $ ) (.(#. $ ,&. $ ) (/(#/ $ ,&/ $ ) 0)(#),&)) 0*(#*,&*) 0+(#+,&+) 0,(#,,&,) 0-(#-,&-) 1 2) 3) 2* 3* 2+ 3+ 2- 3- 2, # & 4(0,0) 3, Figure 5: Hand geometric model the fingertip locations. Note that our finger detection algorithm focuses on the case for tapping. It might not be able to detect all fingers when the fingers are blocked by other parts of the hand. 5 DEPTH MEASUREMENT We use the phase of ultrasound reflected by the fingers to measure finger movements. This phase-based depth measurement has several key advantages. First, ultrasound based movement detection has low latencies. It can provide instantaneous decision of the finger movement between two video frames. Second, ultrasound based movement detection gives accurate depth information, which helps us to detect finger tappings with a short movement distance. Existing ultrasound phase measurement algorithms, such as LLAP [38] and Strata [44], cannot be directly applied to our system. This is because they treat the hand as a single object, whereas we detect finger movements. The ultrasound signal changes caused by hand movements are much larger than that caused by the finger movements and the multipath interference in finger movements is much more significant than hand movements. As illustrated in Figure 6, the user first pushes the whole hand towards the speaker/microphones and then taps the index finger. The magnitude of signal change caused by hand movement is 10 times larger than that of tapping a single finger. Furthermore, we can see clear regular phase changes when moving the hand in Figure 6. However, for the finger tapping, the phase change is irregular and there are large direct-current (DC) trends during the finger movements caused by multipath interference. This makes the depth measurement for finger tapping challenging. To rule out the interference of multipath and measure the finger tapping depth under large DC trends, we use a heuristic algorithm called Peak and Valley Estimation (PVE). The key difference between PVE and the existing LEVD algorithm [38] is that PVE specifically focuses on tapping detection and avoids the error-prone step of static vector estimation in LEVD. As shown in Figure 6, it is difficult to estimate the static vector for finger tapping because the phase change of finger tapping is not obvious and it is easy to be influenced by multipath interference. To handle this problem, we rely on the peak and valley of the signal to get the movement distance. Each time the phase changes by 2π, there will be two peaks and two valleys in the received signal. We can measure the phase changes of π/2 by counting the peaks and valleys. For example, when the phase changes from 0 to π/2, we will find that the signal change from the I component peak to the Q component peak in time domain