VSkin:Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals Ke Sun Ting Zhao State Key Laboratory for Novel Software Technology State Key Laboratory for Novel Software Technology Nanjing University,China,kesun@smail.nju.edu.cn Nanjing University,zhaoting@smail.nju.edu.cn Wei Wang Lei Xie State Key Laboratory for Novel Software Technology State Key Laboratory for Novel Software Technology Nanjing University,China,ww@nju.edu.cn Nanjing University,China,lxie@nju.edu.cn ABSTRACT Enabling touch gesture sensing on all surfaces of the mo- bile device,not limited to the touchscreen area,leads to new user interaction experiences.In this paper,we propose VSkin,a system that supports fine-grained gesture-sensing on the back of mobile devices based on acoustic signals. VSkin utilizes both the structure-borne sounds,i.e.,sounds propagating through the structure of the device,and the air-borne sounds,i.e.,sounds propagating through the air, (a)Back-Swiping (b)Back-Tapping (c)Back-Scrolling to sense finger tapping and movements.By measuring both Figure 1:Back-of-Device interactions the amplitude and the phase of each path of sound signals, VSkin detects tapping events with an accuracy of 99.65%and captures finger movements with an accuracy of 3.59 mm. 1 INTRODUCTION Touch gesture is one of the most important ways for users CCS CONCEPTS to interact with mobile devices.With the wide-deployment of touchscreens,a set of user-friendly touch gestures,such Human-centered computing-Interface design as swiping,tapping,and scrolling,have become the de facto prototyping;Gestural input; standard user interface for mobile devices.However,due to KEYWORDS the high-cost of the touchscreen hardware,gesture-sensing is usually limited to the front surface of the device.Further- Touch gestures;Ultrasound more,touchscreens combine the function of gesture-sensing ACM Reference Format: with the function of displaying.This leads to the occlusion Ke Sun,Ting Zhao,Wei Wang,and Lei Xie.2018.VSkin:Sens- problem [30],i.e.,user fingers often block the content dis- ing Touch Gestures on Surfaces of Mobile Devices Using Acoustic played on the screen during the interaction process Signals.In MobiCom'18:24th Annual International Conference on Enabling gesture-sensing on all surfaces of the mobile de- Mobile Computing and Networking,October 29-November 2,2018, vice,not limited to the touchscreen area,leads to new user New Delhi,India.ACM,New York,NY,USA,15 pages.https://doi. interaction experiences.First,new touch gestures solve the rg/10.1145/3241539.3241568 occlusion problem of the touchscreen.For example,Back-of- Device(BoD)gestures use tapping or swiping on the back of Permission to make digital or hard copies of all or part of this work for a smartphone as a supplementary input interface [22,35].As personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear shown in Figure 1,the screen is no longer blocked when the this notice and the full citation on the first page.Copyrights for components back-scrolling gesture is used for scrolling the content.BoD of this work owned by others than ACM must be honored.Abstracting with gestures also enrich the user experience of mobile games by credit is permitted.To copy otherwise,or republish,to post on servers or to allowing players to use the back surface as a touchpad.Sec- redistribute to lists,requires prior specific permission and/or a fee.Request ond,defining new touch gestures on different surfaces helps permissions from permissions@acm.org. the system better understand user intentions.On traditional MobiCom'18,October 29-November 2,2018,New Delhi,India e2018 Association for Computing Machinery. touchscreens,touching a webpage on the screen could mean ACM ISBN978-1-4503-5903-0/18/10..$15.00 that the user wishes to click a hyperlink or the user just https:/doi.org/10.1145/3241539.3241568 wants to scroll down the page.Existing touchscreen schemes
VSkin: Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals Ke Sun State Key Laboratory for Novel Software Technology Nanjing University, China, kesun@smail.nju.edu.cn Ting Zhao State Key Laboratory for Novel Software Technology Nanjing University, zhaoting@smail.nju.edu.cn Wei Wang State Key Laboratory for Novel Software Technology Nanjing University, China, ww@nju.edu.cn Lei Xie State Key Laboratory for Novel Software Technology Nanjing University, China, lxie@nju.edu.cn ABSTRACT Enabling touch gesture sensing on all surfaces of the mobile device, not limited to the touchscreen area, leads to new user interaction experiences. In this paper, we propose VSkin, a system that supports fine-grained gesture-sensing on the back of mobile devices based on acoustic signals. VSkin utilizes both the structure-borne sounds, i.e., sounds propagating through the structure of the device, and the air-borne sounds, i.e., sounds propagating through the air, to sense finger tapping and movements. By measuring both the amplitude and the phase of each path of sound signals, VSkin detects tapping events with an accuracy of 99.65% and captures finger movements with an accuracy of 3.59 mm. CCS CONCEPTS • Human-centered computing → Interface design prototyping; Gestural input; KEYWORDS Touch gestures; Ultrasound ACM Reference Format: Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. VSkin: Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals. In MobiCom ’18: 24th Annual International Conference on Mobile Computing and Networking, October 29–November 2, 2018, New Delhi, India. ACM, New York, NY, USA, 15 pages. https://doi. org/10.1145/3241539.3241568 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MobiCom’18, October 29–November 2, 2018, New Delhi, India © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5903-0/18/10. . . $15.00 https://doi.org/10.1145/3241539.3241568 (a) Back-Swiping (b) Back-Tapping (c) Back-Scrolling Figure 1: Back-of-Device interactions 1 INTRODUCTION Touch gesture is one of the most important ways for users to interact with mobile devices. With the wide-deployment of touchscreens, a set of user-friendly touch gestures, such as swiping, tapping, and scrolling, have become the de facto standard user interface for mobile devices. However, due to the high-cost of the touchscreen hardware, gesture-sensing is usually limited to the front surface of the device. Furthermore, touchscreens combine the function of gesture-sensing with the function of displaying. This leads to the occlusion problem [30], i.e., user fingers often block the content displayed on the screen during the interaction process. Enabling gesture-sensing on all surfaces of the mobile device, not limited to the touchscreen area, leads to new user interaction experiences. First, new touch gestures solve the occlusion problem of the touchscreen. For example, Back-ofDevice (BoD) gestures use tapping or swiping on the back of a smartphone as a supplementary input interface [22, 35]. As shown in Figure 1, the screen is no longer blocked when the back-scrolling gesture is used for scrolling the content. BoD gestures also enrich the user experience of mobile games by allowing players to use the back surface as a touchpad. Second, defining new touch gestures on different surfaces helps the system better understand user intentions. On traditional touchscreens, touching a webpage on the screen could mean that the user wishes to click a hyperlink or the user just wants to scroll down the page. Existing touchscreen schemes
often confuse these two intentions,due to the overloaded ac- a short time interval of 0.13~0.34 ms,which is just 6~16 sam tions on gestures that are similar to each other.With the new ple points at a sampling rate of 48 kHz.With the limited types of touch gestures performed on different surfaces of inaudible sound bandwidth(around 6 kHz)available on com- the device.these actions can be assigned to distinct gestures. mercial mobile devices,it is challenging to separate these e.g,selecting an item should be performed on the screen paths.Moreover,to achieve accurate movement measure- while scrolling or switching should be performed on the back ment and location independent touch detection,we need or the side of the device.Third,touch sensing on the side to measure both the phase and the magnitude of each path. of the phone enables virtual side-buttons that could replace To address this challenge,we design a system that uses the physical buttons and improve the waterproof performance of Zadoff-Chu(ZC)sequence to measure different sound paths. the device.Compared to in-air gestures that also enrich the With the near-optimal auto-correlation function of the ZC gesture semantics,touch gestures have a better user experi- sequence,which has a peak width of 6 samples,we can sepa- ence,due to their accurate touch detection(for confirmation) rate the structure-borne and the air-borne signals when the connected to the useful haptic feedbacks. distance between the speaker and microphone is just 12 cm. Fine-grained gesture movement distance/speed measure- Furthermore,we develop a new algorithm that measures ments are vital for enabling touch gestures that users are the phase of each sound path at a rate of 3,000 samples per already familiar with,including scrolling and swiping.How- second.Compared to traditional impulsive signal systems ever,existing accelerometer or structural vibration based that measure sound paths in a frame by frame manner(with touch sensing schemes only recognize coarse-grained ac- frame rate <170 Hz [14,34),the higher sampling rate helps tivities,such as the tapping events [5,35].Extra informa- VSkin capture fast swiping and tapping events. tion on the tapping position or the tapping force levels usu- We implement VSkin on commercial smartphones as real- ally requires intensive training and calibration processes time Android applications.Experimental results show that [12,13,25]or additional hardware,such as a mirror on the VSkin achieves a touch detection accuracy of 99.65%and an back of the smartphone [31]. accuracy of 3.59 mm for finger movement distances.Our user In this paper,we propose VSkin,a system that supports study shows that VSkin only slightly increases the movement fine-grained gesture-sensing on the surfaces of mobile de- time used for interaction tasks,e.g.,scrolling and swiping, vices based on acoustic signals.Similar to a layer of skin by 34%and 10%when compared to touchscreens. on the surfaces of the mobile device,VSkin can sense both We made the following contributions in this work: the finger tapping and finger movement distance/direction We introduce a new approach for touch-sensing on mo- on the surface of the device.Without modifying the hard- bile devices by separating the structure-borne and the air- ware,VSkin utilizes the built-in speakers and microphones borne sound signals. to send and receive sound signals for touch-sensing.More We design an algorithm that performs the phase and specifically,VSkin captures both the structure-borne sounds. magnitude measurement of multiple sound paths at a high ie.,sounds propagating through the structure of the device, sampling rate of 3 kHz. and the air-borne sounds,ie,sounds propagating through We implement our system on the Android platform and the air.As touching the surface can significantly change the perform real-world user studies to verify our design. structural vibration pattern of the device,the characteristics of structure-borne sounds are reliable features for touch de- 2 RELATED WORK tection,i.e.,whether the finger contacts the surface or not [12,13,25].While it is difficult to use the structure-borne We categorize researches related to VSkin into three classes:Back-of-Device interactions,tapping and force sens- sounds to sense finger movements,air-borne sounds can measure the movement with mm-level accuracy [14,28,34] ing,and sound-based gesture sensing. Therefore,by analyzing both the structure-borne and the Back-of-Device Interactions:Back-of-Device interac- air-borne sounds,it is possible to reliably recognize a rich set tion is a popular way to extend the user interface of mobile of touch gestures as if there is another touchscreen on the devices [5,11,31,32,35.Gestures performed on the back back of the phone.Moreover,VSkin does not require inten- of the device can be detected by the built-in camera [31,32 sive training,as it uses the physical properties of the sound or sensors [5,35]on the mobile device.LensGesture [32] propagation to detect touch and measure finger movements. uses the rear camera to detect finger movements that are The key challenge faced by VSkin is to measure both the performed just above the camera.Back-Mirror [31]uses an structure-borne and the air-borne signals with high fidelity additional mirror attached to the rear camera to capture BoD while the hand is very close to the mobile device.Given the gestures in a larger region.However,due to the limited view- small form factor of mobile devices,sounds traveling through ing angle of cameras,these approaches either have limited different mediums and paths arrive at the microphone within sensing area or need extra hardware for extending sensing
often confuse these two intentions, due to the overloaded actions on gestures that are similar to each other. With the new types of touch gestures performed on different surfaces of the device, these actions can be assigned to distinct gestures, e.g., selecting an item should be performed on the screen while scrolling or switching should be performed on the back or the side of the device. Third, touch sensing on the side of the phone enables virtual side-buttons that could replace physical buttons and improve the waterproof performance of the device. Compared to in-air gestures that also enrich the gesture semantics, touch gestures have a better user experience, due to their accurate touch detection (for confirmation) connected to the useful haptic feedbacks. Fine-grained gesture movement distance/speed measurements are vital for enabling touch gestures that users are already familiar with, including scrolling and swiping. However, existing accelerometer or structural vibration based touch sensing schemes only recognize coarse-grained activities, such as the tapping events [5, 35]. Extra information on the tapping position or the tapping force levels usually requires intensive training and calibration processes [12, 13, 25] or additional hardware, such as a mirror on the back of the smartphone [31]. In this paper, we propose VSkin, a system that supports fine-grained gesture-sensing on the surfaces of mobile devices based on acoustic signals. Similar to a layer of skin on the surfaces of the mobile device, VSkin can sense both the finger tapping and finger movement distance/direction on the surface of the device. Without modifying the hardware, VSkin utilizes the built-in speakers and microphones to send and receive sound signals for touch-sensing. More specifically, VSkin captures both the structure-borne sounds, i.e., sounds propagating through the structure of the device, and the air-borne sounds, i.e., sounds propagating through the air. As touching the surface can significantly change the structural vibration pattern of the device, the characteristics of structure-borne sounds are reliable features for touch detection, i.e., whether the finger contacts the surface or not [12, 13, 25]. While it is difficult to use the structure-borne sounds to sense finger movements, air-borne sounds can measure the movement with mm-level accuracy [14, 28, 34]. Therefore, by analyzing both the structure-borne and the air-borne sounds, it is possible to reliably recognize a rich set of touch gestures as if there is another touchscreen on the back of the phone. Moreover, VSkin does not require intensive training, as it uses the physical properties of the sound propagation to detect touch and measure finger movements. The key challenge faced by VSkin is to measure both the structure-borne and the air-borne signals with high fidelity while the hand is very close to the mobile device. Given the small form factor of mobile devices, sounds traveling through different mediums and paths arrive at the microphone within a short time interval of 0.13∼0.34ms, which is just 6∼16 sample points at a sampling rate of 48 kHz. With the limited inaudible sound bandwidth (around 6 kHz) available on commercial mobile devices, it is challenging to separate these paths. Moreover, to achieve accurate movement measurement and location independent touch detection, we need to measure both the phase and the magnitude of each path. To address this challenge, we design a system that uses the Zadoff-Chu (ZC) sequence to measure different sound paths. With the near-optimal auto-correlation function of the ZC sequence, which has a peak width of 6 samples, we can separate the structure-borne and the air-borne signals when the distance between the speaker and microphone is just 12 cm. Furthermore, we develop a new algorithm that measures the phase of each sound path at a rate of 3,000 samples per second. Compared to traditional impulsive signal systems that measure sound paths in a frame by frame manner (with frame rate <170 Hz [14, 34]), the higher sampling rate helps VSkin capture fast swiping and tapping events. We implement VSkin on commercial smartphones as realtime Android applications. Experimental results show that VSkin achieves a touch detection accuracy of 99.65% and an accuracy of 3.59mm for finger movement distances. Our user study shows that VSkin only slightly increases the movement time used for interaction tasks, e.g., scrolling and swiping, by 34% and 10% when compared to touchscreens. We made the following contributions in this work: • We introduce a new approach for touch-sensing on mobile devices by separating the structure-borne and the airborne sound signals. • We design an algorithm that performs the phase and magnitude measurement of multiple sound paths at a high sampling rate of 3 kHz. • We implement our system on the Android platform and perform real-world user studies to verify our design. 2 RELATED WORK We categorize researches related to VSkin into three classes: Back-of-Device interactions, tapping and force sensing, and sound-based gesture sensing. Back-of-Device Interactions: Back-of-Device interaction is a popular way to extend the user interface of mobile devices [5, 11, 31, 32, 35]. Gestures performed on the back of the device can be detected by the built-in camera [31, 32] or sensors [5, 35] on the mobile device. LensGesture [32] uses the rear camera to detect finger movements that are performed just above the camera. Back-Mirror [31] uses an additional mirror attached to the rear camera to capture BoD gestures in a larger region. However, due to the limited viewing angle of cameras, these approaches either have limited sensing area or need extra hardware for extending sensing
range.BackTap [35]and BTap[5]use built-in sensors,such Top Mic (Mic 2) as the accelerometer,to sense coarse-grained gestures.How- Rack Surtace ever,sensor readings only provide limited information about f the the gesture,and they cannot quantify the movement speed and distance.Furthermore,accelerometers are sensitive to vibrations caused by hand movements while the user is hold- ing the device.Compared to camera-based and sensor-based schemes,VSkin incurs no additional hardware costs and can Struchure path- perform fine-grained gesture measurements. LOS ar path - Tapping and Force Sensing:Tapping and force applied to the surface can be sensed by different types of sensors Bottom Mic (Mic 1) [4,7,9,10,12,13,15,19,25].TapSense [7]leverages the Figure 2:Sound propagation paths on a smartphone tapping sound to recognize whether the user touches the screen with a fingertip or a fist.Force Tap [9]measures the air-borne sound signals to sense gestures performed on the tapping force using the built-in accelerometer.VibWrite [13] surface of the mobile devices,which are very close (e.g.,less and VibSense [12]use the vibration signal instead of the than 12 cm)to both the speakers and the microphones.As the sound signal to sense the tapping position so that the inter- sound reflections at a short distance are often submerged by ference in air-borne propagation can be avoided.However, the Line-of-Sight(LOS)signals,sensing gestures with SNR they require pre-trained vibration profiles for tapping local- 2 dB at 5 cm is considerably harder than sensing in-air ization.ForcePhone [25]uses linear chirp sounds to sense gestures with SNR 12 dB at 30 cm. force and touch based on changes in the magnitude of the structure-borne signal.However,fine-grained phase infor- 3 SYSTEM OVERVIEW mation cannot be measured through chirps and chirps only VSkin uses both the structure-borne and the air-borne capture the magnitude of the structure-borne signal at a low sound signals to capture gestures performed on the surface of sampling rate.In comparison,our system measures both the the mobile device.We transmit and record inaudible sounds phase and the magnitude of multiple sound paths with a using the built-in speakers and microphones on commodity high sampling rate of 3 kHz so that we can perform robust mobile devices.As an example illustrated in Figure 2,sound tap sensing without intensive training. signals transmitted by the rear speaker travel through multi- Sound-based Gesture Sensing:Several sound-based ple paths on the back of the phone to the top and bottom mi- gesture recognition systems have been proposed to recog- crophones.On both microphones,the structure-borne sound nize in-air gestures [1,3,6,16,17,21,23,33,37].Soundwave that travels through the body structure of the smartphone [6],Multiwave [17],and AudioGest [21]use Doppler effect arrives first.This is because sound wave propagates much to recognize predefined gestures.However,Doppler effect faster in the solid(>2.000m/s)than in the air(around 343m/s) only gives coarse-grained movement speeds.Thus,these [24].There might be multiple copies of air-borne sounds ar- schemes only recognize a small set of gestures that have riving within a short interval following the structure-borne distinctive speed characters.Recently,three state-of-the-art sound.The air-borne sounds include the LOS sound and the schemes (i.e,FingerIO [14],LLAP [28],and Strata [34])use reflection sounds of surrounding objects,e.g.,the finger or ultrasound to track fine-grained finger gestures.FingerIO the table.All these sound signals are mixed at the recording [14]transmits OFDM modulated sound frames and locates microphones. the moving finger based on the change of the echo profiles of VSkin performs gesture-sensing based on the mixture of two consecutive frames.LLAP [28]uses Continuous Wave sound signals recorded by the microphones.The design of (CW)signal to track the moving target based on the phase VSkin consists of the following four components: information,which is susceptible to the dynamic multipath Transmission signal design:We choose to use the caused by other moving objects.Strata [34]combines the Zadoff-Chu(ZC)sequence modulated by a sinusoid carrier as frame-based approach and the phase-based approach.Using our transmitted sound signal.This transmission signal design the 26-bit GSM training sequence that has nice autocorrela- meets three key design goals.First,the auto-correlation of tion properties,Strata can track phase changes at different ZC sequence has a narrow peak width of 6 samples so that we time delays so that objects that are more than 8.5 cm apart can separate sound paths arrive with a small time-difference can be resolved.However,these schemes mainly focus on by locating the peaks corresponding to their different delays tracking in-air gestures that are performed at more than see Figure 3.Second,we use interpolation schemes to reduce 20 cm away from the mobile device [14,23,28,34].In com- the bandwidth of the ZC sequence to less than 6 kHz so that parison,our system uses both the structure-borne and the it can be fit into the narrow inaudible range of 17~23 kHz
range. BackTap [35] and βTap[5] use built-in sensors, such as the accelerometer, to sense coarse-grained gestures. However, sensor readings only provide limited information about the gesture, and they cannot quantify the movement speed and distance. Furthermore, accelerometers are sensitive to vibrations caused by hand movements while the user is holding the device. Compared to camera-based and sensor-based schemes, VSkin incurs no additional hardware costs and can perform fine-grained gesture measurements. Tapping and Force Sensing: Tapping and force applied to the surface can be sensed by different types of sensors [4, 7, 9, 10, 12, 13, 15, 19, 25]. TapSense [7] leverages the tapping sound to recognize whether the user touches the screen with a fingertip or a fist. ForceTap [9] measures the tapping force using the built-in accelerometer. VibWrite [13] and VibSense [12] use the vibration signal instead of the sound signal to sense the tapping position so that the interference in air-borne propagation can be avoided. However, they require pre-trained vibration profiles for tapping localization. ForcePhone [25] uses linear chirp sounds to sense force and touch based on changes in the magnitude of the structure-borne signal. However, fine-grained phase information cannot be measured through chirps and chirps only capture the magnitude of the structure-borne signal at a low sampling rate. In comparison, our system measures both the phase and the magnitude of multiple sound paths with a high sampling rate of 3 kHz so that we can perform robust tap sensing without intensive training. Sound-based Gesture Sensing: Several sound-based gesture recognition systems have been proposed to recognize in-air gestures [1, 3, 6, 16, 17, 21, 23, 33, 37]. Soundwave [6], Multiwave [17], and AudioGest [21] use Doppler effect to recognize predefined gestures. However, Doppler effect only gives coarse-grained movement speeds. Thus, these schemes only recognize a small set of gestures that have distinctive speed characters. Recently, three state-of-the-art schemes (i.e., FingerIO [14], LLAP [28], and Strata [34]) use ultrasound to track fine-grained finger gestures. FingerIO [14] transmits OFDM modulated sound frames and locates the moving finger based on the change of the echo profiles of two consecutive frames. LLAP [28] uses Continuous Wave (CW) signal to track the moving target based on the phase information, which is susceptible to the dynamic multipath caused by other moving objects. Strata [34] combines the frame-based approach and the phase-based approach. Using the 26-bit GSM training sequence that has nice autocorrelation properties, Strata can track phase changes at different time delays so that objects that are more than 8.5 cm apart can be resolved. However, these schemes mainly focus on tracking in-air gestures that are performed at more than 20 cm away from the mobile device [14, 23, 28, 34]. In comparison, our system uses both the structure-borne and the Top Mic (Mic 2) Bottom Mic (Mic 1) Path 2 Path 1 Path 3 Path 4 Path 6 Path 5 Rear Speaker Structure path LOS air path Reflection air path Back Surface of the Phone Figure 2: Sound propagation paths on a smartphone air-borne sound signals to sense gestures performed on the surface of the mobile devices, which are very close (e.g., less than 12 cm) to both the speakers and the microphones. As the sound reflections at a short distance are often submerged by the Line-of-Sight (LOS) signals, sensing gestures with SNR ≈ 2 dB at 5 cm is considerably harder than sensing in-air gestures with SNR ≈ 12 dB at 30 cm. 3 SYSTEM OVERVIEW VSkin uses both the structure-borne and the air-borne sound signals to capture gestures performed on the surface of the mobile device. We transmit and record inaudible sounds using the built-in speakers and microphones on commodity mobile devices. As an example illustrated in Figure 2, sound signals transmitted by the rear speaker travel through multiple paths on the back of the phone to the top and bottom microphones. On both microphones, the structure-borne sound that travels through the body structure of the smartphone arrives first. This is because sound wave propagates much faster in the solid (>2,000m/s) than in the air (around 343m/s) [24]. There might be multiple copies of air-borne sounds arriving within a short interval following the structure-borne sound. The air-borne sounds include the LOS sound and the reflection sounds of surrounding objects, e.g., the finger or the table. All these sound signals are mixed at the recording microphones. VSkin performs gesture-sensing based on the mixture of sound signals recorded by the microphones. The design of VSkin consists of the following four components: Transmission signal design: We choose to use the Zadoff-Chu (ZC) sequence modulated by a sinusoid carrier as our transmitted sound signal. This transmission signal design meets three key design goals. First, the auto-correlation of ZC sequence has a narrow peak width of 6 samples so that we can separate sound paths arrive with a small time-difference by locating the peaks corresponding to their different delays, see Figure 3. Second, we use interpolation schemes to reduce the bandwidth of the ZC sequence to less than 6 kHz so that it can be fit into the narrow inaudible range of 17 ∼ 23 kHz
roah3 Note that finger movement measurement and touch measurement can use signal captured by the top micro- &←Pahs phone,the bottom microphone,or both.How these mea- 257 513 769 surements are used in specific gestures,such as scrolling Samples and swiping,depends on both the type of the gestures (a)Bottom microphone (Mic 1) and the placement of microphones of the given device,see Section 6.5. -Path 4 3131.17105 4 TRANSMISSION SIGNAL DESIGN -Path 6 4.1 Baseband Sequence Selection Sound signals propagating through the structure path,the 512 76 Samples LOS path,and the reflection path arrive within a very small (b)Top microphone(Mic 2) time interval of less than 0.34ms,due to the small size of a Figure 3:IR estimation of dual microphones smartphone (20cm).One way to separate these paths is to transmit short impulses of sounds so that the reflected provided by commodity speakers and microphones.Third, impulses do not overlap with each other.However,impulses we choose to modulate the ZC sequence so that we can ex- with short time durations have very low energy so that the tract the phase information,which cannot be measured by received signals,especially those reflected by the finger,are traditional chirp-like sequences such as FMCW sequences. too weak to be reliably measured. Sound path separation and measurement:To sepa- In VSkin,we choose to transmit a periodical high-energy rate different sound paths at the receiving end,we first use signal and rely on the auto-correlation properties of the cross-correlation to estimate the Impulse Response (IR)of the signal to separate the sound paths.A continuous period- mixed sound.Second,we locate the candidate sound paths ical signal has higher energy than impulses so that the using the amplitude of the IR estimation.Third,we identify weak reflections can be reliably measured.The cyclic auto- the structure-borne path,the LOS path,and the reflection correlation function of the signal s[n]is defined as R(r)= path by aligning candidate paths on different microphones 为∑N!s[ns*[n-r)modN],where N is the length of based on the known microphone positions.Finally,we use the signal,r is the delay,and s'[n]is the conjugation of the an efficient algorithm to calculate the phase and amplitude signal.The cyclic auto-correlation function is maximized of each sound path at a high sampling rate of 48 kHz. around r =0 and we define the peak at r =0 as the main Finger movement measurement:The finger move- lobe of the auto-correlation function,see Figure 5(b).When ment measurement is based on the phase of the air-borne the cyclic auto-correlation function has a single narrow peak, path reflected by the finger.To detect the weak reflections of i.e.,R(r)0 for r +0,we can separate multiple copies of the finger,we first calculate the differential IR estimations s[n]arrived at different arrival delay r by performing cross- so that changes caused by finger movements are amplified. correlation of the mixed signal with the cyclically shifted Second,we use an adaptive algorithm to determine the de- s[n].For the cross-correlation results as shown in Figure 3, lay of the reflection path so that the phase and amplitude each delayed copy of s[n]in the mixed signal leads to a peak can be measured with high SNR.Third,we use an Extend at its corresponding delay value of r. Kalman Filter to further amplify the sound signal based on The transmitted sound signal needs to satisfy the following the finger movement model.Finally,the finger movement extra requirements to ensure both the resolution and signal- distance is calculated by measuring the phase change of the to-noise ratio of the path estimation: corresponding reflection path. Narrow autocorrelation main lobe width:The Touch measurement:We use the structure-borne path width of the main lobe is the number of points on each to detect touch events,since the structure-borne path is side of the lobe where the power has fallen to half(-3 dB) mainly determined by whether the user's finger is pressing of its maximum value.A narrow main lobe leads to better on the surface or not.To detect touch events,we first cal- time resolution in sound propagation paths. culate the differential IR estimations of the structure-borne Low baseband crest factor:Baseband crest factor is path.We then use a threshold-based scheme to detect the the ratio of peak values to the effective value of the baseband touch and release events.To locate the touch position,we signal.A signal with a low crest factor has higher energy found that the delay of the changes in structure-borne sound than a high crest factor signal with the same peak power [2] is closely related to the distance from the touch position to Therefore,it produces cross-correlation results with higher the speaker.Using this observation,we classify the touch signal-to-noise ratio while the peak power is still below the event into three different regions with an accuracy of 87.8%. audible power threshold
1 257 513 769 Samples 0 1 2 3 Absolute value 106 (301, 2.25 106) Path 1 and Path 3 Path 5 (a) Bottom microphone (Mic 1) 0 256 512 768 1024 Samples 0 5 10 15 Absolute value 104 Path 2 (301, 1.19 105) Path 4 (313, 1.17 105) Path 6 (b) Top microphone (Mic 2) Figure 3: IR estimation of dual microphones provided by commodity speakers and microphones. Third, we choose to modulate the ZC sequence so that we can extract the phase information, which cannot be measured by traditional chirp-like sequences such as FMCW sequences. Sound path separation and measurement: To separate different sound paths at the receiving end, we first use cross-correlation to estimate the Impulse Response (IR) of the mixed sound. Second, we locate the candidate sound paths using the amplitude of the IR estimation. Third, we identify the structure-borne path, the LOS path, and the reflection path by aligning candidate paths on different microphones based on the known microphone positions. Finally, we use an efficient algorithm to calculate the phase and amplitude of each sound path at a high sampling rate of 48 kHz. Finger movement measurement: The finger movement measurement is based on the phase of the air-borne path reflected by the finger. To detect the weak reflections of the finger, we first calculate the differential IR estimations so that changes caused by finger movements are amplified. Second, we use an adaptive algorithm to determine the delay of the reflection path so that the phase and amplitude can be measured with high SNR. Third, we use an Extend Kalman Filter to further amplify the sound signal based on the finger movement model. Finally, the finger movement distance is calculated by measuring the phase change of the corresponding reflection path. Touch measurement: We use the structure-borne path to detect touch events, since the structure-borne path is mainly determined by whether the user’s finger is pressing on the surface or not. To detect touch events, we first calculate the differential IR estimations of the structure-borne path. We then use a threshold-based scheme to detect the touch and release events. To locate the touch position, we found that the delay of the changes in structure-borne sound is closely related to the distance from the touch position to the speaker. Using this observation, we classify the touch event into three different regions with an accuracy of 87.8%. Note that finger movement measurement and touch measurement can use signal captured by the top microphone, the bottom microphone, or both. How these measurements are used in specific gestures, such as scrolling and swiping, depends on both the type of the gestures and the placement of microphones of the given device, see Section 6.5. 4 TRANSMISSION SIGNAL DESIGN 4.1 Baseband Sequence Selection Sound signals propagating through the structure path, the LOS path, and the reflection path arrive within a very small time interval of less than 0.34ms, due to the small size of a smartphone (< 20cm). One way to separate these paths is to transmit short impulses of sounds so that the reflected impulses do not overlap with each other. However, impulses with short time durations have very low energy so that the received signals, especially those reflected by the finger, are too weak to be reliably measured. In VSkin, we choose to transmit a periodical high-energy signal and rely on the auto-correlation properties of the signal to separate the sound paths. A continuous periodical signal has higher energy than impulses so that the weak reflections can be reliably measured. The cyclic autocorrelation function of the signal s[n] is defined as R(τ ) = 1 N PN n=1 s[n]s ∗ [(n − τ ) mod N], where N is the length of the signal, τ is the delay, and s ∗ [n] is the conjugation of the signal. The cyclic auto-correlation function is maximized around τ = 0 and we define the peak at τ = 0 as the main lobe of the auto-correlation function, see Figure 5(b). When the cyclic auto-correlation function has a single narrow peak, i.e., R(τ ) ≈ 0 for τ , 0, we can separate multiple copies of s[n] arrived at different arrival delay τ by performing crosscorrelation of the mixed signal with the cyclically shifted s[n]. For the cross-correlation results as shown in Figure 3, each delayed copy of s[n] in the mixed signal leads to a peak at its corresponding delay value of τ . The transmitted sound signal needs to satisfy the following extra requirements to ensure both the resolution and signalto-noise ratio of the path estimation: • Narrow autocorrelation main lobe width: The width of the main lobe is the number of points on each side of the lobe where the power has fallen to half (−3 dB) of its maximum value. A narrow main lobe leads to better time resolution in sound propagation paths. • Low baseband crest factor: Baseband crest factor is the ratio of peak values to the effective value of the baseband signal. A signal with a low crest factor has higher energy than a high crest factor signal with the same peak power [2]. Therefore, it produces cross-correlation results with higher signal-to-noise ratio while the peak power is still below the audible power threshold
Interpolation Auto-correlation Baseband crest Auto-correlation Auto-correlation Method main lobe width factor gain side lobe level Time domain 14 samples 8.10dB 11.80dB -4.64dB GSM(26 bits) Frequency domain 8 samples 6.17dB 11.43dB -3.60dB Time domain 10.50dB 11.81dB -9.57dB Barker(13 bits) 16 samples Frequency domain 8 samples 5.12dB 13.46dB -6.50dB Time domain 16 samples 5.04dB 12.04dB -11.63dB M-sequence(127 bits) Frequency domain 8 samples 6.68dB 13.90dB -6.58dB Time domain 16 samples 3.85dB 12.14dB -12.45dB ZC(127 bits) Frequency domain 6 samples 2.56dB 13.93dB -6.82dB Table 1:Performance of different types of sequences cos2nfet High auto-correlation gain:The auto-correlation gain is the peak power of the main lobe divided by the FFT Up- sample average power of the auto-correlation function.A higher auto-correlation gain leads to a higher signal-to-noise ratio in the correlation result.Usually,a longer code sequence has -sin 2nfet a higher auto-correlation gain. Figure 4:Sound signal modulation structure Low auto-correlation side lobe level:Side lobes are sharp transitions between "0"and"1"in M-sequence make the small peaks(local maxima)other than the main lobe in the interpolated version worse than chirp-like polyphase the auto-correlation function.A large side lobe level will sequences [2].In general,frequency domain interpolation cause interference in the impulse response estimation. is better than the time domain interpolation,due to their We compare the performance of the transmission sig- narrow main lobe width.While the side lobe level of fre- nals with different code sequence designs and interpolation quency domain interpolation is higher than the time domain methods.For code sequence design,we compare commonly interpolation,the side lobe level of-6.82 dB provided by the used pseudo-noise (PN)sequences (i.e,GSM training se- ZC sequence gives enough attenuation on side lobes for our quence,Barker sequence,and M-sequence)with a chirp-like system. polyphase sequence(ZC sequence [18])in Table 1.Note that Based on above considerations,we choose to use the fre- the longest Barker sequence and GSM training sequence quency domain interpolated ZC sequence as our transmitted are 13 bits and 26 bits,respectively.For M-sequence and ZC signal.The root ZC sequence parametrized by u is given by: sequence,we use a sequence length of 127 bits. We interpolate the raw code sequences before transmit- ZC[n川=ej“g2 (1) ting them.The purpose of the interpolation is to reduce the bandwidth of the code sequence so that it can be fit into a where 0 s n Nzc,q is a constant integer,and Nzc is the length of sequence.The parameter u is an integer with narrow transmission band that is inaudible to humans.There 0<u Nzc and gcd(Nzc,u)=1.The ZC sequence has are two methods to interpolate the sequence,the time do- several nice properties [18]that are useful for sound signal main method and the frequency domain method.For the modulation.For example,the ZC sequences have constant time domain method [34],we first upsample the sequences magnitudes.Therefore,the power of the transmitted sound by repeating each sample by k times(usually k=6~8)and is constant so that we can measure its phase at high sam- then use a low-pass filter to ensure that the signal occupies pling rates as shown in later sections.Note that compared the desired bandwidth.For the frequency domain method, to the single frequency scheme [28],the disadvantages of we first perform Fast Fourier Transform(FFT)of the raw modulated signals including using ZC sequence are that they sequence,perform zero padding in the frequency domain to have to occupy the larger bandwidth and therefore require increase the length of the signal,and then use Inverse Fast stable frequency response for the microphone. Fourier Transform(IFFT)to convert the signal back into the time domain.For both methods,we reduce the bandwidth 4.2 Modulation and Demodulation of all sequences to 6 kHz with a sampling rate of 48 kHz so We use a two-step modulation scheme to convert the raw that the modulated signal can be fit into the 17~23 kHz ZC sequence into an inaudible sound signal,as illustrated inaudible range supported by commercial devices. in Figure 4.The first step is to use the frequency domain The performance of different sound signals is summarized interpolation to reduce the bandwidth of the sequence.We in Table 1.The ZC sequence has the best baseband crest factor first perform Nzc-points FFT on the raw complex valued and auto-correlation gain.Although the raw M-sequence has ZC sequence,where Nzc is the length of the sequence.We the ideal auto-correlation performance and crest factor,the then zero-pad the FFT result into Nc=Nzcfs/B points by
Interpolation Method Auto-correlation main lobe width Baseband crest factor Auto-correlation gain Auto-correlation side lobe level GSM (26 bits) Time domain 14 samples 8.10 dB 11.80 dB -4.64 dB Frequency domain 8 samples 6.17 dB 11.43 dB -3.60 dB Barker (13 bits) Time domain 16 samples 10.50 dB 11.81 dB -9.57 dB Frequency domain 8 samples 5.12 dB 13.46 dB -6.50 dB M-sequence (127 bits) Time domain 16 samples 5.04 dB 12.04 dB -11.63 dB Frequency domain 8 samples 6.68 dB 13.90 dB -6.58 dB ZC (127 bits) Time domain 16 samples 3.85 dB 12.14 dB -12.45 dB Frequency domain 6 samples 2.56 dB 13.93 dB -6.82 dB Table 1: Performance of different types of sequences • High auto-correlation gain: The auto-correlation gain is the peak power of the main lobe divided by the average power of the auto-correlation function. A higher auto-correlation gain leads to a higher signal-to-noise ratio in the correlation result. Usually, a longer code sequence has a higher auto-correlation gain. • Low auto-correlation side lobe level: Side lobes are the small peaks (local maxima) other than the main lobe in the auto-correlation function. A large side lobe level will cause interference in the impulse response estimation. We compare the performance of the transmission signals with different code sequence designs and interpolation methods. For code sequence design, we compare commonly used pseudo-noise (PN) sequences (i.e., GSM training sequence, Barker sequence, and M-sequence) with a chirp-like polyphase sequence (ZC sequence [18]) in Table 1. Note that the longest Barker sequence and GSM training sequence are 13 bits and 26 bits, respectively. For M-sequence and ZC sequence, we use a sequence length of 127 bits. We interpolate the raw code sequences before transmitting them. The purpose of the interpolation is to reduce the bandwidth of the code sequence so that it can be fit into a narrow transmission band that is inaudible to humans. There are two methods to interpolate the sequence, the time domain method and the frequency domain method. For the time domain method [34], we first upsample the sequences by repeating each sample by k times (usually k = 6 ∼ 8) and then use a low-pass filter to ensure that the signal occupies the desired bandwidth. For the frequency domain method, we first perform Fast Fourier Transform (FFT) of the raw sequence, perform zero padding in the frequency domain to increase the length of the signal, and then use Inverse Fast Fourier Transform (IFFT) to convert the signal back into the time domain. For both methods, we reduce the bandwidth of all sequences to 6 kHz with a sampling rate of 48 kHz so that the modulated signal can be fit into the 17 ∼ 23 kHz inaudible range supported by commercial devices. The performance of different sound signals is summarized in Table 1. The ZC sequence has the best baseband crest factor and auto-correlation gain. Although the raw M-sequence has the ideal auto-correlation performance and crest factor, the ZC IFFT Upsample I Q FFT Figure 4: Sound signal modulation structure sharp transitions between “0” and “1” in M-sequence make the interpolated version worse than chirp-like polyphase sequences [2]. In general, frequency domain interpolation is better than the time domain interpolation, due to their narrow main lobe width. While the side lobe level of frequency domain interpolation is higher than the time domain interpolation, the side lobe level of −6.82 dB provided by the ZC sequence gives enough attenuation on side lobes for our system. Based on above considerations, we choose to use the frequency domain interpolated ZC sequence as our transmitted signal. The root ZC sequence parametrized by u is given by: ZC[n] = e −j πun(n+1+2q) NZC , (1) where 0 ⩽ n < NZC, q is a constant integer, and NZC is the length of sequence. The parameter u is an integer with 0 < u < NZC and дcd(NZC,u) = 1. The ZC sequence has several nice properties [18] that are useful for sound signal modulation. For example, the ZC sequences have constant magnitudes. Therefore, the power of the transmitted sound is constant so that we can measure its phase at high sampling rates as shown in later sections. Note that compared to the single frequency scheme [28], the disadvantages of modulated signals including using ZC sequence are that they have to occupy the larger bandwidth and therefore require stable frequency response for the microphone. 4.2 Modulation and Demodulation We use a two-step modulation scheme to convert the raw ZC sequence into an inaudible sound signal, as illustrated in Figure 4. The first step is to use the frequency domain interpolation to reduce the bandwidth of the sequence. We first perform NZC-points FFT on the raw complex valued ZC sequence, where NZC is the length of the sequence. We then zero-pad the FFT result into N ′ ZC = NZC fs /B points by