Device-Free Gesture Tracking Using Acoustic Signals Wei Wang Alex X.Liut+ Ke Sunt tState Key Laboratory for Novel Software Technology,Nanjing University,China +Dept.of Computer Science and Engineering,Michigan State University,USA ww@nju.edu.cn,alexliu@cse.msu.edu,samsonsunke@gmail.com ABSTRACT Device-free gesture tracking means that user hands/fingers are not Device-free gesture tracking is an enabling HCI mechanism for attached with any device.Imagine that if a smart watch has the small wearable devices because fingers are too big to control the device-free gesture tracking capability,then the user can adjust time GUI elements on such small screens,and it is also an import- in a touch-less manner as shown in Figure 1,where the clock hand ant HCI mechanism for medium-to-large size mobile devices be- follows the movement of the finger.Device-free gesture tracking cause it allows users to provide input without blocking screen view. is an enabling HCI mechanism for small wearable devices (such In this paper,we propose LLAP.a device-free gesture tracking as smart watches)because fingers are too big to control the GUI scheme that can be deployed on existing mobile devices as soft- elements on such small screens.In contrast,device-free gesture ware,without any hardware modification.We use speakers and tracking allows users to provide input by performing gestures near microphones that already exist on most mobile devices to perform a device rather than on a device.Device-free gesture tracking is device-free tracking of a hand/finger.The key idea is to use acoustic also an important HCI mechanism for medium-to-large size mobile phase to get fine-grained movement direction and movement dis- devices (such as smartphones and tablets)complementing touch tance measurements.LLAP first extracts the sound signal reflected screens because it allows users to provide inputs without block- by the moving hand/finger after removing the background sound ing screen view,which gives user better visual experience.Fur- signals that are relatively consistent over time.LLAP then meas- thermore,device-free gesture tracking can work in scenarios where ures the phase changes of the sound signals caused by hand/finger touch screens cannot,e.g.,when users wear gloves or when the movements and then converts the phase changes into the distance device is in the pocket of the movement.We implemented and evaluated LLAP using commercial-off-the-shelf mobile phones.For 1-D hand movement and 2-D drawing in the air,LLAP has a tracking accuracy of 3.5 mm and 4.6 mm,respectively.Using gesture traces tracked by LLAP,we can recognize the characters and short words drawn in the air with an accuracy of 92.3%and 91.2%,respectively. CCS Concepts ●Human-centered computing→Gestural input; Keywords Figure 1:Device-free gesture tracking Gesture Tracking;Ultrasound;Device-free Practical device-free gesture tracking systems need to satisfy three requirements.First,such systems need to have high accur- 1.INTRODUCTION acy so that they can capture delicate movements of a hand/finger. Due to the small operational space around the mobile device,e.g.. 1.1 Motivation within tens of centimeters (cm)to the device,we need millimeter Gestures are natural and user-friendly Human Computer Interac- (mm)level tracking accuracy to fully exploit the control capability tion(HCI)mechanisms for users to control their devices.Gesture of human hands.Second,such systems need to have low latency tracking allows devices to get fine-grained user input by quantit- (i.e.,respond quickly),within tens of milliseconds,to hand/finger atively measuring the movement of their hands/fingers in the air. movement without user feeling lagging responsiveness.Third,they need to have low computational cost so that they can be implemen- ted on resource constrained mobile devices. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that c 1.2 Limitations of Prior Art tion on the first page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re- Most existing device-free gesture tracking solutions use cus- publish,to post on servers or to redistribute to lists,requires prior specific permission tomized hardware [1-4].Based on the fact that wireless signal and/or a fee.Request permissions from permissions@acm.org. changes as a hand/finger moves,Google made a customized chip MobiCom'16,October 03-07.2016.New York City.NY.USA in their Soli system that uses 60 GHz wireless signal with mm- ©2016ACM.ISBN978-1-4503-4226-1/1610..$15.00 level wavelength to track small movement of a hand/finger [1]. D0 http:/dx.doi.org/10.1145/2973750.2973764 and Teng et al.made customized directional 60 GHz transceivers
Device-Free Gesture Tracking Using Acoustic Signals Wei Wang† Alex X. Liu†‡ Ke Sun† †State Key Laboratory for Novel Software Technology, Nanjing University, China ‡Dept. of Computer Science and Engineering, Michigan State University, USA ww@nju.edu.cn, alexliu@cse.msu.edu, samsonsunke@gmail.com ABSTRACT Device-free gesture tracking is an enabling HCI mechanism for small wearable devices because fingers are too big to control the GUI elements on such small screens, and it is also an important HCI mechanism for medium-to-large size mobile devices because it allows users to provide input without blocking screen view. In this paper, we propose LLAP, a device-free gesture tracking scheme that can be deployed on existing mobile devices as software, without any hardware modification. We use speakers and microphones that already exist on most mobile devices to perform device-free tracking of a hand/finger. The key idea is to use acoustic phase to get fine-grained movement direction and movement distance measurements. LLAP first extracts the sound signal reflected by the moving hand/finger after removing the background sound signals that are relatively consistent over time. LLAP then measures the phase changes of the sound signals caused by hand/finger movements and then converts the phase changes into the distance of the movement. We implemented and evaluated LLAP using commercial-off-the-shelf mobile phones. For 1-D hand movement and 2-D drawing in the air, LLAP has a tracking accuracy of 3.5 mm and 4.6 mm, respectively. Using gesture traces tracked by LLAP, we can recognize the characters and short words drawn in the air with an accuracy of 92.3% and 91.2%, respectively. CCS Concepts •Human-centered computing → Gestural input; Keywords Gesture Tracking; Ultrasound; Device-free 1. INTRODUCTION 1.1 Motivation Gestures are natural and user-friendly Human Computer Interaction (HCI) mechanisms for users to control their devices. Gesture tracking allows devices to get fine-grained user input by quantitatively measuring the movement of their hands/fingers in the air. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MobiCom’16, October 03-07, 2016, New York City, NY, USA c 2016 ACM. ISBN 978-1-4503-4226-1/16/10. . . $15.00 DOI: http://dx.doi.org/10.1145/2973750.2973764 Device-free gesture tracking means that user hands/fingers are not attached with any device. Imagine that if a smart watch has the device-free gesture tracking capability, then the user can adjust time in a touch-less manner as shown in Figure 1, where the clock hand follows the movement of the finger. Device-free gesture tracking is an enabling HCI mechanism for small wearable devices (such as smart watches) because fingers are too big to control the GUI elements on such small screens. In contrast, device-free gesture tracking allows users to provide input by performing gestures near a device rather than on a device. Device-free gesture tracking is also an important HCI mechanism for medium-to-large size mobile devices (such as smartphones and tablets) complementing touch screens because it allows users to provide inputs without blocking screen view, which gives user better visual experience. Furthermore, device-free gesture tracking can work in scenarios where touch screens cannot, e.g., when users wear gloves or when the device is in the pocket. Figure 1: Device-free gesture tracking Practical device-free gesture tracking systems need to satisfy three requirements. First, such systems need to have high accuracy so that they can capture delicate movements of a hand/finger. Due to the small operational space around the mobile device, e.g., within tens of centimeters (cm) to the device, we need millimeter (mm) level tracking accuracy to fully exploit the control capability of human hands. Second, such systems need to have low latency (i.e., respond quickly), within tens of milliseconds, to hand/finger movement without user feeling lagging responsiveness. Third, they need to have low computational cost so that they can be implemented on resource constrained mobile devices. 1.2 Limitations of Prior Art Most existing device-free gesture tracking solutions use customized hardware [1–4]. Based on the fact that wireless signal changes as a hand/finger moves, Google made a customized chip in their Soli system that uses 60 GHz wireless signal with mmlevel wavelength to track small movement of a hand/finger [1], and Teng et al. made customized directional 60 GHz transceivers
in their mTrack system to track the movement of a pen or a fin shift,our approach has three advantages:(1)tracking capability,(2) ger using steerable directional beams [2].Based on the fact that low latency,and (3)ability to track slow or small movements of a light reflection strength changes as a hand/finger moves,Zhang et hand/finger.We have lower latency than Doppler shift based sys- al.made customized LED/light sensors in their Okuli system to tems because Doppler shift requires Fast Fourier Transform(FFT), use visible light to track hand/finger movement [3].Based on vis- which needs to accumulate at least 2048 samples(translated to 42.7 ion processing algorithms,Leap Motion made customized infrared ms)to process,whereas we only need to accumulate 16 samples cameras to track hand/finger movements [4].Recently,Nandak- (translated to 0.3 ms).In other words,Doppler shift based systems umar et al.explored the feasibility of using commercial mobile only respond to hand/finger movement every 42.7 ms whereas our devices to track fingers/hands within a short distance.They pro- LLAP system can respond to hand/finger movement every 0.3 ms. posed fingerlO,which uses OFDM modulated sound to locate the Note that in practice,we may need to accumulate more samples fingers with accuracy of 8 mm [5]. due to the hardware limitations of mobile devices,such as 512 samples (translated to 10.7 ms)on smartphones.We can deal with 1.3 Proposed Approach slow hand/finger movement because LLAP can precisely measure In this paper,we propose a device-free gesture tracking scheme, the accumulated slow phase changes over time.We can deal with called Low-Latency Acoustic Phase (LLAP),that can be deployed small hand/finger movement because LLAP can precisely measure on existing mobile devices as a software (such as an APP)without small phase changes that is less than a full phase cycle.In contrast, any hardware modification.We use speakers and microphones Doppler-based approaches cannot detect slow or small movements that already exist on most mobile devices to perform device-free due to their limited frequency resolution,as we show in Section 3 tracking of a hand/finger.Commercial-Off-The-Shelf (COTS)mo- The second challenge is to achieve two dimensional gesture bile devices can emit and record sound waves with frequency tracking.Although LLAP can precisely measure the relative move- higher than 17 kHz,which are inaudible to most people [6].The ment distance of a hand,it cannot directly measure the absolute dis- wavelength of sound waves in this frequency range is less than 2 tance between the hand and the speaker/microphones,and therefore cm.Therefore,a small movement of a few millimeters will sig- it is hard to determine the initial hand location that is essential for nificantly change the phase of the received sound wave.Our key two dimensional tracking.To address this challenge,we use mul- idea is to use the acoustic phase to get fine-grained movement dir- tiple Continuous Waves (CW)with linearly spaced frequencies to ection and movement distance measurements.LLAP first extracts measure the path length.We observe that sound waves with dif- the sound signal reflected by the moving hand/finger after remov- ferent frequencies have different wavelengths,which leads to dif- ing the background sound signals that are relatively consistent over ferent phase shifts even if they travel through the same path.To time.Second,LLAP measures the phase changes of the sound sig- determine the path length of the reflected sound wave,we first isol- nals caused by hand/finger movements and then converts the phase ate the phase changes caused by hand/finger movement and then changes into the distance of the movement.LLAP achieves a track- apply Inverse Discrete Fourier Transform(IDFT)on the phases of ing accuracy of 3.5 mm and a latency of 15 ms on COTS mobile different sound frequencies to get the TOA of the path.By identi- phones with limited computing power.For mobile devices with two fying the TOA that has the strongest energy in the IDFT result,we or more microphones,LLAP is capable of 2-D gesture tracking that can determine the path length for the sound reflected by the mov- allows users to draw in the air with their hands/fingers. ing hand/finger.Thus,our approach can serve as a coarse-grained initial position estimation.Combining the fine-grained relative dis 1.4 Technical Challenges and Solutions tance measurement and the coarse-grained initial position estima- The first challenge is to achieve mm-level accuracy for the meas- tion,we can achieve a relatively accurate 2-D hand/finger tracking urement of hand/finger movement distance.Existing sound based 1.5 Summary of Experimental Results ranging systems either use the Time-Of-Arrival/Time-Difference- We implemented and evaluated LLAP using commercial mobile Of-Arrival(TOA/TDOA)measurements [7,8]or the Doppler shift phones without any hardware modification.Under normal indoor measurements [9.10].Traditional TOA/TDOA based systems re- noise level,for 1-D hand movement and 2-D drawing in the air, quire the device to emit bursty sound signals,such as pulses or LLAP has a tracking accuracy of 3.5 mm and 4.57 mm,respect- chirps,which are often audible to humans as these signals change ively.Under loud indoor noise level such as playing music,for 1-D abruptly [7,8].Furthermore,their distance measurement accuracy hand movement and 2-D drawing in the air,LLAP has a tracking is often in the scale of cm,except for the recent OFDM phase based accuracy of 5.81 mm and 4.89 mm,respectively.Experimental res- approach [5].Doppler shift based device-free systems do not have ults also show that LLAP can detect small hand/finger movements tracking capability and can only recognize predefined gestures be- For example,for a small single-finger movement of 5 mm,LLAP cause Doppler shift can only provide the coarse-grained measure- has a detection accuracy of 94%within a distance of 30 cm.Using ment of the speed or direction of hand/finger movements due to the gesture traces tracked by LLAP,we can recognize the characters limited frequency measurement precision [9,11,12].In contrast,to achieve mm-level hand/finger tracking accuracy,we leverage the and short words drawn in the air with an accuracy of 92.3%and 91.2%.respectively. fact that the sound reflected by a human hand is coherent to the sound emitted by the mobile device.Two signals are coherent if they have a constant phase difference and the same frequency.This 2. RELATED WORK coherency allows us to use a coherent detector to convert the re- Sound Based Localization and Tracking:TOA and TDOA ceived sound signal into a complex-valued baseband signal.Our ranging systems using sound waves has a good ranging accuracy approach is to first measure the phase change of the reflected sig- of a few centimeters because of the slower propagation speed com- nal,rather than using the noise-prone integration of the Doppler pared to radio waves [7.8,14-16].However,such systems often shift as AAMouse [13]did,and then convert the phase change to either require specially designed ultrasound transceivers [14]or the movement distance of a hand/finger.Compared with traditional emit audible probing sounds,such as short bursty sound pulses or TOA/TDOA,our approach has two advantages:(1)human inaudib- chirps [7,8,15].Furthermore,most existing sound based tracking lity,and(2)mm-level tracking accuracy.Compared with Doppler systems are not device-free as they can only track a device that
in their mTrack system to track the movement of a pen or a finger using steerable directional beams [2]. Based on the fact that light reflection strength changes as a hand/finger moves, Zhang et al. made customized LED/light sensors in their Okuli system to use visible light to track hand/finger movement [3]. Based on vision processing algorithms, Leap Motion made customized infrared cameras to track hand/finger movements [4]. Recently, Nandakumar et al. explored the feasibility of using commercial mobile devices to track fingers/hands within a short distance. They proposed fingerIO, which uses OFDM modulated sound to locate the fingers with accuracy of 8 mm [5]. 1.3 Proposed Approach In this paper, we propose a device-free gesture tracking scheme, called Low-Latency Acoustic Phase (LLAP), that can be deployed on existing mobile devices as a software (such as an APP) without any hardware modification. We use speakers and microphones that already exist on most mobile devices to perform device-free tracking of a hand/finger. Commercial-Off-The-Shelf (COTS) mobile devices can emit and record sound waves with frequency higher than 17 kHz, which are inaudible to most people [6]. The wavelength of sound waves in this frequency range is less than 2 cm. Therefore, a small movement of a few millimeters will significantly change the phase of the received sound wave. Our key idea is to use the acoustic phase to get fine-grained movement direction and movement distance measurements. LLAP first extracts the sound signal reflected by the moving hand/finger after removing the background sound signals that are relatively consistent over time. Second, LLAP measures the phase changes of the sound signals caused by hand/finger movements and then converts the phase changes into the distance of the movement. LLAP achieves a tracking accuracy of 3.5 mm and a latency of 15 ms on COTS mobile phones with limited computing power. For mobile devices with two or more microphones, LLAP is capable of 2-D gesture tracking that allows users to draw in the air with their hands/fingers. 1.4 Technical Challenges and Solutions The first challenge is to achieve mm-level accuracy for the measurement of hand/finger movement distance. Existing sound based ranging systems either use the Time-Of-Arrival/Time-DifferenceOf-Arrival (TOA/TDOA) measurements [7, 8] or the Doppler shift measurements [9, 10]. Traditional TOA/TDOA based systems require the device to emit bursty sound signals, such as pulses or chirps, which are often audible to humans as these signals change abruptly [7, 8]. Furthermore, their distance measurement accuracy is often in the scale of cm, except for the recent OFDM phase based approach [5]. Doppler shift based device-free systems do not have tracking capability and can only recognize predefined gestures because Doppler shift can only provide the coarse-grained measurement of the speed or direction of hand/finger movements due to the limited frequency measurement precision [9, 11, 12]. In contrast, to achieve mm-level hand/finger tracking accuracy, we leverage the fact that the sound reflected by a human hand is coherent to the sound emitted by the mobile device. Two signals are coherent if they have a constant phase difference and the same frequency. This coherency allows us to use a coherent detector to convert the received sound signal into a complex-valued baseband signal. Our approach is to first measure the phase change of the reflected signal, rather than using the noise-prone integration of the Doppler shift as AAMouse [13] did, and then convert the phase change to the movement distance of a hand/finger. Compared with traditional TOA/TDOA, our approach has two advantages: (1) human inaudibility, and (2) mm-level tracking accuracy. Compared with Doppler shift, our approach has three advantages: (1) tracking capability, (2) low latency, and (3) ability to track slow or small movements of a hand/finger. We have lower latency than Doppler shift based systems because Doppler shift requires Fast Fourier Transform (FFT), which needs to accumulate at least 2048 samples (translated to 42.7 ms) to process, whereas we only need to accumulate 16 samples (translated to 0.3 ms). In other words, Doppler shift based systems only respond to hand/finger movement every 42.7 ms whereas our LLAP system can respond to hand/finger movement every 0.3 ms. Note that in practice, we may need to accumulate more samples due to the hardware limitations of mobile devices, such as 512 samples (translated to 10.7 ms) on smartphones. We can deal with slow hand/finger movement because LLAP can precisely measure the accumulated slow phase changes over time. We can deal with small hand/finger movement because LLAP can precisely measure small phase changes that is less than a full phase cycle. In contrast, Doppler-based approaches cannot detect slow or small movements due to their limited frequency resolution, as we show in Section 3. The second challenge is to achieve two dimensional gesture tracking. Although LLAP can precisely measure the relative movement distance of a hand, it cannot directly measure the absolute distance between the hand and the speaker/microphones, and therefore it is hard to determine the initial hand location that is essential for two dimensional tracking. To address this challenge, we use multiple Continuous Waves (CW) with linearly spaced frequencies to measure the path length. We observe that sound waves with different frequencies have different wavelengths, which leads to different phase shifts even if they travel through the same path. To determine the path length of the reflected sound wave, we first isolate the phase changes caused by hand/finger movement and then apply Inverse Discrete Fourier Transform (IDFT) on the phases of different sound frequencies to get the TOA of the path. By identifying the TOA that has the strongest energy in the IDFT result, we can determine the path length for the sound reflected by the moving hand/finger. Thus, our approach can serve as a coarse-grained initial position estimation. Combining the fine-grained relative distance measurement and the coarse-grained initial position estimation, we can achieve a relatively accurate 2-D hand/finger tracking. 1.5 Summary of Experimental Results We implemented and evaluated LLAP using commercial mobile phones without any hardware modification. Under normal indoor noise level, for 1-D hand movement and 2-D drawing in the air, LLAP has a tracking accuracy of 3.5 mm and 4.57 mm, respectively. Under loud indoor noise level such as playing music, for 1-D hand movement and 2-D drawing in the air, LLAP has a tracking accuracy of 5.81 mm and 4.89 mm, respectively. Experimental results also show that LLAP can detect small hand/finger movements. For example, for a small single-finger movement of 5 mm, LLAP has a detection accuracy of 94% within a distance of 30 cm. Using gesture traces tracked by LLAP, we can recognize the characters and short words drawn in the air with an accuracy of 92.3% and 91.2%, respectively. 2. RELATED WORK Sound Based Localization and Tracking: TOA and TDOA ranging systems using sound waves has a good ranging accuracy of a few centimeters because of the slower propagation speed compared to radio waves [7, 8, 14–16]. However, such systems often either require specially designed ultrasound transceivers [14] or emit audible probing sounds, such as short bursty sound pulses or chirps [7, 8, 15]. Furthermore, most existing sound based tracking systems are not device-free as they can only track a device that
transmits or receives sound signals [7.8.10.13-15.17].For ex- 3.MEASURE 1-D RELATIVE DISTANCE ample,AAMouse measures the Doppler shifts of the sound waves In this section,we present our approach to measuring the one transmitted by a smart phone to track the phone itself with an accur- dimensional relative movement distance of a hand/finger,which acy of 1.4 cm [13].In comparison,our approach is device-free as consists of three steps.First,we use a coherent detector to down we use the sound signals reflected by a hand/finger.The problems comert the received sound signal into a complex-valued baseband that we face are more challenging because the signal reflected by signal.Second,we measure the path length change based on the the object has much weaker energy compared to the signal travelled phase changes of the baseband signal.Third,we combine the phase through the Line-Of-Sight (LOS)path changes at different frequencies to mitigate the multipath effect Sound Based Device-Free Gesture Recognition:Most sound Before we introduce these three steps,we analyze the limitations of based device-free gesture recognition systems use the Doppler ef- the Doppler shift based approach,which is used by most existing fect of the sound reflected by hands [9.11.12].Such systems do sound-based gesture recognition systems [8,9,11-13]and present not have tracking capability and can only recognize predefined ges- the advantages of our phase based approach over the Doppler shift tures because Doppler shift can only provide the coarse-grained based approach measurement of the speed or direction of hand/finger movements due to the limited frequency measurement precision [9,11,12].An- 3.1 Limitations of Doppler Shift Based Dis- other system,ApenaApp,uses chirp signals to detect the changes tance Measurement in reflected sound that is caused by human breaths [18].ApenaApp As a moving object changes the frequency of the sound waves applies FFT over the sound signals of a long duration to achieve reflected by it,by measuring the frequency changes in the re- better distance resolution at the cost of reducing the time resolu- ceived sound signal,which is called Doppler shift,we can calcu- tion.Thus,ApenaApp's approach can only be used for long term late the movement speed of the object.The traditional Doppler shift monitoring for periodical movements(such as human breaths)that measurement approach,which uses Short-Time Fourier Transform have frequency lower than 1 Hz.There are keystroke recognition (STFT)to get the Doppler shift,is not suitable for device-free ges systems that use the sound emitted by gestures,such as typing on a ture recognition due to its low resolution and highly noisy results. keyboard or tapping on a table,to recognize keystrokes [19-21]or First,the resolution of STFT is limited by the fundamental con- handwriting [22].Compared with such systems,we use inaudible. straints of time-frequency analysis [36].The STFT approach first rather than audible,sound reflected by hands/fingers. divides the received sound data into data segments,where each In recent pioneer work parallel with us,Nandakumar et al.pro- segment has equal number(say 2.048)of signal samples,and then posed an OFDM based finger tracking system,called fingerlO [5]. FingerIO achieves a finger location accuracy of 8 mm and also performs Fast Fourier Transform(FFT)on each segment to get the spectrum of the given data segment.With a small segment size,the allows 2-D drawing in the air using COTS mobile devices.The frequency resolution is very low.For example,when the segment key difference between LLAP and fingerIO is that LLAP uses CW size is 2,048 samples and the sampling rate is 48 kHz,the frequency signals rather than OFDM pulses.The phase measured by CW resolution of STFT is 23.4 Hz.This corresponds to a movement signals is less noisy due to the narrower bandwidth compared to OFDM pulses.This allows LLAP to achieve better tracking ac- speed of 0.2 meters per second(m/s)when the sound wave has a frequency of 20 kHz.In other words,the hand must move at a speed curacy.Furthermore,the complex valued baseband signal extracted of at least 20 cm per second to be detectable by the STFT approach. by LLAP can potentially give more information about hand/finger Note that improving the frequency resolution is always at the cost movements than the TOA measurements from fingerIO.However, of reducing the time resolution [36].For example,if we use a larger the CW signal approach used by LLAP is more susceptible to the segment size with 48,000 samples to get the frequency resolution of interference of background movements than the OFDM approach. 1 Hz,this will inevitably reduce the time resolution of STFT to one RF Based Gesture Recognition:Radio Frequency (RF)signals, second as it takes one second to collect 48,000 samples when the such as Wi-Fi signals,reflected by human bodies can be used for sampling rate is 48 kHz.Distance measuring schemes with such a human gesture and activity recognition [23-28].However,as the low time resolution are unacceptable for interactive inputs because propagation speed of light is almost one million times faster than they can only measure the moving distances of a hand/finger at a the speed of sound,it is very difficult to achieve fine-grained dis- one-second time interval.Note that the resolution for STFT cannot tance measurements through RF signals.Therefore,existing Wi- be improved by padding short data segments with zeros and per Fi signal based gesture recognition systems cannot perform fine- form FFT with a larger size,as done in [13],because zero padding grained quantification of gesture movement.Instead,they recog- is equivalent to convolution with a sinc function in the frequency nize predefined gestures,such as punch,push,or sweep [27,29.30] domain.Figure 2 shows the STFT result for a hand that first moves When using narrow band RF signals lower than 5 GHz,the state toward and then moves away from the microphone,where each of-the-art tracking systems have a measurement accuracy of sev- sample segment contains 2,048 samples and is padded with zeros eral cm [31,32].To the best of our knowledge,the only RF based to perform FFT with size of 48,000.Although the frequency resol- gesture recognition systems that achieve mm-level tracking accur- ution seems to be improved to 1 Hz when we perform FFT with a acy are mTrack [2]and Soli [1],which uses 60 GHz RF signals larger size,the high energy band in the frequency domain(red part The key advantage of our system over mTrack and Soli is that we in the spectrogram)still spans about 80 Hz range,instead of being use speakers and microphones that already exist on most mobile around I Hz.Most of the small frequency variations are buried in devices to perform device-free tracking of a hand/finger. this wide band and we can only roughly recognize a positive fre- Vision Based Gesture Recognition:Vision based gesture re- quency shift from 4 to 5.2 seconds and a negative frequency shift cognition systems use cameras or light sensors to capture fine- from 6 to 7.5 seconds. grained gesture movements [3,4,33-35].For example,Okuli Second,Doppler shift measurements are subject to high noises achieves a localization accuracy of 7 mm using LED and light as shown in Figure 2.In device-based tracking systems,such as sensors [3].However,such systems have a limited viewing angle AAMouse [13],where the sound source or sound receiver is mov- and are susceptible to lighting condition changes [3].In contrast, ing,it is possible to use the frequency that has the maximal energy LLAP can operate while the device is within the pocket. to determine the Doppler shift.In device-free tracking systems
transmits or receives sound signals [7, 8, 10, 13–15, 17]. For example, AAMouse measures the Doppler shifts of the sound waves transmitted by a smart phone to track the phone itself with an accuracy of 1.4 cm [13]. In comparison, our approach is device-free as we use the sound signals reflected by a hand/finger. The problems that we face are more challenging because the signal reflected by the object has much weaker energy compared to the signal travelled through the Line-Of-Sight (LOS) path. Sound Based Device-Free Gesture Recognition: Most sound based device-free gesture recognition systems use the Doppler effect of the sound reflected by hands [9, 11, 12]. Such systems do not have tracking capability and can only recognize predefined gestures because Doppler shift can only provide the coarse-grained measurement of the speed or direction of hand/finger movements due to the limited frequency measurement precision [9,11,12]. Another system, ApenaApp, uses chirp signals to detect the changes in reflected sound that is caused by human breaths [18]. ApenaApp applies FFT over the sound signals of a long duration to achieve better distance resolution at the cost of reducing the time resolution. Thus, ApenaApp’s approach can only be used for long term monitoring for periodical movements (such as human breaths) that have frequency lower than 1 Hz. There are keystroke recognition systems that use the sound emitted by gestures, such as typing on a keyboard or tapping on a table, to recognize keystrokes [19–21] or handwriting [22]. Compared with such systems, we use inaudible, rather than audible, sound reflected by hands/fingers. In recent pioneer work parallel with us, Nandakumar et al. proposed an OFDM based finger tracking system, called fingerIO [5]. FingerIO achieves a finger location accuracy of 8 mm and also allows 2-D drawing in the air using COTS mobile devices. The key difference between LLAP and fingerIO is that LLAP uses CW signals rather than OFDM pulses. The phase measured by CW signals is less noisy due to the narrower bandwidth compared to OFDM pulses. This allows LLAP to achieve better tracking accuracy. Furthermore, the complex valued baseband signal extracted by LLAP can potentially give more information about hand/finger movements than the TOA measurements from fingerIO. However, the CW signal approach used by LLAP is more susceptible to the interference of background movements than the OFDM approach. RF Based Gesture Recognition: Radio Frequency (RF) signals, such as Wi-Fi signals, reflected by human bodies can be used for human gesture and activity recognition [23–28]. However, as the propagation speed of light is almost one million times faster than the speed of sound, it is very difficult to achieve fine-grained distance measurements through RF signals. Therefore, existing WiFi signal based gesture recognition systems cannot perform finegrained quantification of gesture movement. Instead, they recognize predefined gestures, such as punch, push, or sweep [27,29,30]. When using narrow band RF signals lower than 5 GHz, the stateof-the-art tracking systems have a measurement accuracy of several cm [31, 32]. To the best of our knowledge, the only RF based gesture recognition systems that achieve mm-level tracking accuracy are mTrack [2] and Soli [1], which uses 60 GHz RF signals. The key advantage of our system over mTrack and Soli is that we use speakers and microphones that already exist on most mobile devices to perform device-free tracking of a hand/finger. Vision Based Gesture Recognition: Vision based gesture recognition systems use cameras or light sensors to capture finegrained gesture movements [3, 4, 33–35]. For example, Okuli achieves a localization accuracy of 7 mm using LED and light sensors [3]. However, such systems have a limited viewing angle and are susceptible to lighting condition changes [3]. In contrast, LLAP can operate while the device is within the pocket. 3. MEASURE 1-D RELATIVE DISTANCE In this section, we present our approach to measuring the onedimensional relative movement distance of a hand/finger, which consists of three steps. First, we use a coherent detector to down convert the received sound signal into a complex-valued baseband signal. Second, we measure the path length change based on the phase changes of the baseband signal. Third, we combine the phase changes at different frequencies to mitigate the multipath effect. Before we introduce these three steps, we analyze the limitations of the Doppler shift based approach, which is used by most existing sound-based gesture recognition systems [8, 9, 11–13] and present the advantages of our phase based approach over the Doppler shift based approach. 3.1 Limitations of Doppler Shift Based Distance Measurement As a moving object changes the frequency of the sound waves reflected by it, by measuring the frequency changes in the received sound signal, which is called Doppler shift, we can calculate the movement speed of the object. The traditional Doppler shift measurement approach, which uses Short-Time Fourier Transform (STFT) to get the Doppler shift, is not suitable for device-free gesture recognition due to its low resolution and highly noisy results. First, the resolution of STFT is limited by the fundamental constraints of time-frequency analysis [36]. The STFT approach first divides the received sound data into data segments, where each segment has equal number (say 2,048) of signal samples, and then performs Fast Fourier Transform (FFT) on each segment to get the spectrum of the given data segment. With a small segment size, the frequency resolution is very low. For example, when the segment size is 2,048 samples and the sampling rate is 48 kHz, the frequency resolution of STFT is 23.4 Hz. This corresponds to a movement speed of 0.2 meters per second (m/s) when the sound wave has a frequency of 20 kHz. In other words, the hand must move at a speed of at least 20 cm per second to be detectable by the STFT approach. Note that improving the frequency resolution is always at the cost of reducing the time resolution [36]. For example, if we use a larger segment size with 48,000 samples to get the frequency resolution of 1 Hz, this will inevitably reduce the time resolution of STFT to one second as it takes one second to collect 48,000 samples when the sampling rate is 48 kHz. Distance measuring schemes with such a low time resolution are unacceptable for interactive inputs because they can only measure the moving distances of a hand/finger at a one-second time interval. Note that the resolution for STFT cannot be improved by padding short data segments with zeros and perform FFT with a larger size, as done in [13], because zero padding is equivalent to convolution with a sinc function in the frequency domain. Figure 2 shows the STFT result for a hand that first moves toward and then moves away from the microphone, where each sample segment contains 2,048 samples and is padded with zeros to perform FFT with size of 48,000. Although the frequency resolution seems to be improved to 1 Hz when we perform FFT with a larger size, the high energy band in the frequency domain (red part in the spectrogram) still spans about 80 Hz range, instead of being around 1 Hz. Most of the small frequency variations are buried in this wide band and we can only roughly recognize a positive frequency shift from 4 to 5.2 seconds and a negative frequency shift from 6 to 7.5 seconds. Second, Doppler shift measurements are subject to high noises as shown in Figure 2. In device-based tracking systems, such as AAMouse [13], where the sound source or sound receiver is moving, it is possible to use the frequency that has the maximal energy to determine the Doppler shift. In device-free tracking systems
1.81X10 g 6.5 79 Time(seconds) 170 (a)I/Q waveforms 5.5 6 6.5 7.5 Time(seconds) 100 Figure 2:Doppler shift of hand movements however,the frequency with the highest energy,which is plotted Static as the white line around 18 kHz in Figure 2,does not closely fol- vector low the hand movement because the sound waves reflected by the -100 moving hand are mixed with the sound waves traveling through the 'Dynamic Line-Of-Sight (LOS)path as well as those reflected by static ob- vector jects.Furthermore,there are impluses in the Doppler shift measure- -200 Starting Ending (4.04s) (4.64s) ments due to frequency selective fading caused by the hand move- ment,i.e.,the sound waves traveling from different paths may get cancelled with each other on the target frequency when the hand is 30000 -100 100 200 at certain positions. I(normalized) (b)Complex I/Q traces 3.2 Phase Based Distance Measurement Figure 3:Baseband signal of sound waves Because of the above limitations of Doppler shift based distance measurement,we propose a phase based distance measurement ap- suming that the speed of sound is c=343 m/s,the wavelength of proach for sound signals.As Doppler shift in the reflected signal sound signals with frequency f 18 kHz is 1.9 cm.We observe is caused by the increase/decrease in the phase of the signal when that the complex signal moves by about 4.25 circles,which cor- the hand moves close/moves away,the idea is to treat the reflected responds to an 8.5 increase in phase values in Figure 3(b).Thus signal as a phase modulated signal whose phase changes with the the path length changes by 1.9 x 4.25 =8.08 cm during the 0.6 movement distance.Except for fingerIO that uses OFDM phase [5]. second shown in Figure 3(b).This is equivalent to hand movement no prior work has used phase changes of sound signals to measure distance of 4.04 cm considering the two-way path length change movement distance,although the phase of RF baseband signal has Furthermore.we can determine whether the hand is moving toward been used for measuring the movement distance of objects [2,23] or moving away from the microphone by the sign of the phase Compared to the Doppler shift,the phase change of the baseband changes.Note that it is important to use both the I and Q com- signal can be easily measured in the time domain.Figure 3 shows ponents because the movement direction information is lost when the In-phase (I)and the Quadrature (Q)components of the base- we only use a single component or the magnitude [23]. band signal obtained from the same sound record that produces the This phase based distance measurement approach has three ad- spectrogram in Figure 2.From Figure 3(a).we observe that the I/O vantages over the Doppler shift based approach.First,the accuracy waveforms remain static when the hand is not moving and vary like is much higher because by directly measuring the phase changes sinusoids when the hand moves.Combining the in-phase (as the we eliminate the noise-prone steps of first measuring the Dop- real part)and quadrature (as the imaginary part)components into pler shift and then integrating the Doppler shift to get the dis a complex signal,we can clearly observe patterns caused by hand tance changes.Second,the latency is much lower because the phase movement.Figure 3(b)shows how the complex signal changes dur- measurement can be conducted on a short data segment with only ing a short time period from 4.04 to 4.64 seconds while the hand hundreds of samples.Third,the speed resolution is much higher moves towards the microphone.We observe that the traces of the because the phase measurement can track small phase changes complex signal are close to circles on the complex plane. and slow phase shifts.For example,phase based measurement can In essence,the complex signal is a combination of two vectors easily achieve 2.4 mm distance resolution,which corresponds to in the complex plane:we call a static vector and a dynamic vector. a phase change of /4 when the wavelength is 1.9 cm.Further- The static vector corresponds to the sound wave traveling through more,the information is much richer because phase measurements the LOS path or reflected by static objects,such as walls and tables. provide more information than what we get from STFT.For ex- This vector remains quasi-static during this short time period.The ample,the phase difference at different frequencies can be used for dynamic vector corresponds to the reflection caused by the moving localizing the hand as discussed in Section 4 hand.When the hand moves towards the microphone,we observe an increase in the phase of the dynamic vector,which is caused 3.3 LLAP Overview by the decrease in length of the reflected path.As the phase of We now give an overview of LLAP when operating on a single the signal increases by 2 when the path length decreases by one sound frequency.Without loss of generality,we assume that the wavelength of the sound wave,we can calculate the distance that sampling frequency of the device is 48 kHz.We have tested our the hand moves via the phase change of the dynamic vector.As- implementation under other sampling frequencies,e.g.,44.1 kHz
Figure 2: Doppler shift of hand movements however, the frequency with the highest energy, which is plotted as the white line around 18 kHz in Figure 2, does not closely follow the hand movement because the sound waves reflected by the moving hand are mixed with the sound waves traveling through the Line-Of-Sight (LOS) path as well as those reflected by static objects. Furthermore, there are impluses in the Doppler shift measurements due to frequency selective fading caused by the hand movement, i.e., the sound waves traveling from different paths may get cancelled with each other on the target frequency when the hand is at certain positions. 3.2 Phase Based Distance Measurement Because of the above limitations of Doppler shift based distance measurement, we propose a phase based distance measurement approach for sound signals. As Doppler shift in the reflected signal is caused by the increase/decrease in the phase of the signal when the hand moves close/moves away, the idea is to treat the reflected signal as a phase modulated signal whose phase changes with the movement distance. Except for fingerIO that uses OFDM phase [5], no prior work has used phase changes of sound signals to measure movement distance, although the phase of RF baseband signal has been used for measuring the movement distance of objects [2, 23]. Compared to the Doppler shift, the phase change of the baseband signal can be easily measured in the time domain. Figure 3 shows the In-phase (I) and the Quadrature (Q) components of the baseband signal obtained from the same sound record that produces the spectrogram in Figure 2. From Figure 3(a), we observe that the I/Q waveforms remain static when the hand is not moving and vary like sinusoids when the hand moves. Combining the in-phase (as the real part) and quadrature (as the imaginary part) components into a complex signal, we can clearly observe patterns caused by hand movement. Figure 3(b) shows how the complex signal changes during a short time period from 4.04 to 4.64 seconds while the hand moves towards the microphone. We observe that the traces of the complex signal are close to circles on the complex plane. In essence, the complex signal is a combination of two vectors in the complex plane: we call a static vector and a dynamic vector. The static vector corresponds to the sound wave traveling through the LOS path or reflected by static objects, such as walls and tables. This vector remains quasi-static during this short time period. The dynamic vector corresponds to the reflection caused by the moving hand. When the hand moves towards the microphone, we observe an increase in the phase of the dynamic vector, which is caused by the decrease in length of the reflected path. As the phase of the signal increases by 2π when the path length decreases by one wavelength of the sound wave, we can calculate the distance that the hand moves via the phase change of the dynamic vector. As- 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 −300 −200 −100 0 100 200 300 Time (seconds) I/Q (normalized) I Q (a) I/Q waveforms −200 −100 0 100 200 −300 −200 −100 0 100 I (normalized) Q (normalized) Starting (4.04s) Static vector Ending (4.64s) Dynamic vector (b) Complex I/Q traces Figure 3: Baseband signal of sound waves suming that the speed of sound is c = 343 m/s, the wavelength of sound signals with frequency f = 18 kHz is 1.9 cm. We observe that the complex signal moves by about 4.25 circles, which corresponds to an 8.5π increase in phase values in Figure 3(b). Thus, the path length changes by 1.9 × 4.25 = 8.08 cm during the 0.6 second shown in Figure 3(b). This is equivalent to hand movement distance of 4.04 cm considering the two-way path length change. Furthermore, we can determine whether the hand is moving toward or moving away from the microphone by the sign of the phase changes. Note that it is important to use both the I and Q components because the movement direction information is lost when we only use a single component or the magnitude [23]. This phase based distance measurement approach has three advantages over the Doppler shift based approach. First, the accuracy is much higher because by directly measuring the phase changes, we eliminate the noise-prone steps of first measuring the Doppler shift and then integrating the Doppler shift to get the distance changes. Second, the latency is much lower because the phase measurement can be conducted on a short data segment with only hundreds of samples. Third, the speed resolution is much higher because the phase measurement can track small phase changes and slow phase shifts. For example, phase based measurement can easily achieve 2.4 mm distance resolution, which corresponds to a phase change of π/4 when the wavelength is 1.9 cm. Furthermore, the information is much richer because phase measurements provide more information than what we get from STFT. For example, the phase difference at different frequencies can be used for localizing the hand as discussed in Section 4. 3.3 LLAP Overview We now give an overview of LLAP when operating on a single sound frequency. Without loss of generality, we assume that the sampling frequency of the device is 48 kHz. We have tested our implementation under other sampling frequencies, e.g., 44.1 kHz
and obtained similar results as in 48 kHz.LLAP uses Continuous represented as Rp(t)=2A cos(2rft-2nfdp(t)/c-0p).where Wave(CW)signal of Acos 2ft,where A is the amplitude and f 2Ap is the amplitude of the received signal,the term 2fdp(t)/c is the frequency of the sound,which is in the range of 1723 kHz. comes from the phase lag caused by the propagation delay of CW sound signals in this range can be generated by many COTS Tp =dp(t)/c and c is the speed of sound.There is also an ini- devices without introducing audible noises [6]. tial phase p.which is caused by the hardware delay and phase We use the microphones on the same device to record the sound inversion due to reflection.Based on the system structure shown in wave using the same sampling rate of 48 kHz.As the received Figure 4,when we multiply this received signal with cos(2ft) sound waves are transmitted by the same device,there is no Carrier we have: Frequency Offset(CFO)between the sender and receiver.There- fore.we can use the traditional coherent detector structure as shown 2 A cos(2rft-2πfdp(t)/c-0p)×cos(2πft) in Figure 4 to down convert the received sound signal to a base- band signal [37].The received signal is first split into two identical Ap(cos(-2xfdp(t)/c-0p)+cos(4nft-2mfdp(t)/c-0p)). copies and multiplied with the transmitted signal cos 2ft and its Note that the second term has a high frequency of 2f phase shifted version-sin 2mft.We then use a Cascaded Integ- and will be removed by the low-pass CIC filter.There- rator Comb(CIC)filter to remove high frequency components and fore,we have the I-component of the baseband as Ip(t)= decimate the signal to get the corresponding In-phase and Quadrat- Ap cos(-2rfdp(t)/c-0p).Similarly,we get the Q-component as ure signals. Qp(t)=Ap sin(-2fdp(t)/c-0p).Combining these two com- ponents as real and imaginary part of a complex signal,we have the Acos2nft CIC complex baseband as follows,wherej=-1: ↑cos2mft Bp(t)=Ate-i(2x/dp(t)/c+0p) (1) Q CIC Note that the phase for path p is p(t)=-(2xfdp(t)/c+p). ↑-sin2mft which changes by 2 when dp(t)changes by the amount of sound wavelength入=c/f Figure 4:System structure 3.5 Phase Based Path Length Measurement 3.4 Sound Signal Down Conversion As the received signal is a combination of the signals traveling through many paths,we need to first extract the baseband signal Our CIC filter is a three section filter with the decimate ratio of component that corresponds to the one reflected by the moving 16 and differential delay of 17.Figure 5 shows the frequency re- hand so that we can infer the movement distance from the phase sponse of the CIC filter.We select the parameters so that the first change of that component,as we will show next.Thus.we need and second zeros of the filter appear at 175 Hz and 350 Hz.The to decompose the baseband signal into the static and dynamic vec- pass-band of the CIC filter is0100 Hz,which corresponds to the tor.Recall that the static vector comes from sound waves travel- movements with a speed lower than 0.95 m/s when the wavelength ing through the LOS path or the static surrounding objects,which is 1.9 cm.The second zero of the filter appears at 350 Hz so that could be much stronger compared to the sound waves reflected the signals at (f+350)Hz will be attenuated by more than 120 by hand.In practice,this static vector may also vary slowly with dB.Thus,to minimize the interferences from adjacent frequencies. the movement of the hand.Such changes in the static vector are we use a frequency interval of 350 Hz when the speaker transmits caused by the blocking of other objects by the moving hand or multiple frequencies simultaneously.To achieve better computa- slow movements of the arm.It is therefore challenging to sep- tional efficiency,we do not use a frequency compensate FIR filter arate the slowly changing static vector from the dynamic vector after the CIC. caused by a slow hand movement.Existing work in 60 GHz tech- nology uses two methods.Dual-Differential Background Removal (DDBR)and Phase Counting and Reconstruction(PCR).to remove -与0 the static vector [2].However,the DDBR algorithm is susceptible to noises and cannot reliably detect slow movements,while PCR 100 has long latency and requires strong periodicity in the baseband signal.Thus,both of these algorithms are not suitable for our pur- .150 pose. 0正 0 0.8 We use a heuristic algorithm called Local Extreme Value De- Frequency(kHz) tection (LEVD)to estimate the static vector.This algorithm op Figure 5:Frequency response of CIC filter erates on the I/Q component separately to estimate the real and imaginary parts of the static vector.The basic idea of LEVD is in- CIC filter incurs low computational overhead as they involve spired by the well-known Empirical Mode Decomposition (EMD) only additions and subtractions.Therefore,we only need two multi- algorithm [38].We first find alternate local maximum and min- plications per sample point for the down conversion,i.e..multiply- imum points that are different more than an empirical threshold ing the cos 2mft and -sin 2ft with each received sample.For Thr,which is set as three times of the standard deviation of the 48 kHz sampling rate,this only involves 96,000 multiplications per baseband signal in a static environment.These large variations in second and can be easily carried out by mobile devices.After the the waveform indicate the movements of surrounding objects.We down conversion,the sampling rate is decreased to 3 kHz to make then use the average of two nearby local maxima and minima as the subsequent signal processing more efficient. estimated value of the static vector.Since the dynamic vector has a To understand the digital down conversion process,we consider trace similar to circles,the average of two extremes would be close the sound signal that travels through a path p with time-varying to the center.Figure 6 shows the LEVD result for a short piece of path length of dp(t).This received sound signal from path p can be waveform in Figure 3(a).LEVD pseudocode is in Algorithm 1
and obtained similar results as in 48 kHz. LLAP uses Continuous Wave (CW) signal of A cos 2πf t, where A is the amplitude and f is the frequency of the sound, which is in the range of 17 ∼ 23 kHz. CW sound signals in this range can be generated by many COTS devices without introducing audible noises [6]. We use the microphones on the same device to record the sound wave using the same sampling rate of 48 kHz. As the received sound waves are transmitted by the same device, there is no Carrier Frequency Offset (CFO) between the sender and receiver. Therefore, we can use the traditional coherent detector structure as shown in Figure 4 to down convert the received sound signal to a baseband signal [37]. The received signal is first split into two identical copies and multiplied with the transmitted signal cos 2πf t and its phase shifted version − sin 2πf t. We then use a Cascaded Integrator Comb (CIC) filter to remove high frequency components and decimate the signal to get the corresponding In-phase and Quadrature signals. cos 2πft —sin 2πft CIC CIC I Q Acos2πft Figure 4: System structure 3.4 Sound Signal Down Conversion Our CIC filter is a three section filter with the decimate ratio of 16 and differential delay of 17. Figure 5 shows the frequency response of the CIC filter. We select the parameters so that the first and second zeros of the filter appear at 175 Hz and 350 Hz. The pass-band of the CIC filter is 0 ∼ 100 Hz, which corresponds to the movements with a speed lower than 0.95 m/s when the wavelength is 1.9 cm. The second zero of the filter appears at 350 Hz so that the signals at (f ± 350) Hz will be attenuated by more than 120 dB. Thus, to minimize the interferences from adjacent frequencies, we use a frequency interval of 350 Hz when the speaker transmits multiple frequencies simultaneously. To achieve better computational efficiency, we do not use a frequency compensate FIR filter after the CIC. 0 0.2 0.4 0.6 0.8 1 −150 −100 −50 0 Frequency (kHz) Magnitude (dB) Figure 5: Frequency response of CIC filter CIC filter incurs low computational overhead as they involve only additions and subtractions. Therefore, we only need two multiplications per sample point for the down conversion, i.e., multiplying the cos 2πf t and − sin 2πf t with each received sample. For 48 kHz sampling rate, this only involves 96,000 multiplications per second and can be easily carried out by mobile devices. After the down conversion, the sampling rate is decreased to 3 kHz to make subsequent signal processing more efficient. To understand the digital down conversion process, we consider the sound signal that travels through a path p with time-varying path length of dp(t). This received sound signal from path p can be represented as Rp(t) = 2A 0 p cos(2πf t−2πf dp(t)/c−θp), where 2A 0 p is the amplitude of the received signal, the term 2πf dp(t)/c comes from the phase lag caused by the propagation delay of τp = dp(t)/c and c is the speed of sound. There is also an initial phase θp, which is caused by the hardware delay and phase inversion due to reflection. Based on the system structure shown in Figure 4, when we multiply this received signal with cos(2πf t), we have: 2A 0 p cos(2πf t − 2πf dp(t)/c − θp) × cos(2πf t) = A 0 p cos(−2πf dp(t)/c − θp) + cos(4πf t − 2πf dp(t)/c − θp) . Note that the second term has a high frequency of 2f and will be removed by the low-pass CIC filter. Therefore, we have the I-component of the baseband as Ip(t) = A 0 p cos(−2πf dp(t)/c−θp). Similarly, we get the Q-component as Qp(t) = A 0 p sin(−2πf dp(t)/c − θp). Combining these two components as real and imaginary part of a complex signal, we have the complex baseband as follows, where j 2 = −1: Bp(t) = A 0 pe −j(2πfdp(t)/c+θp) . (1) Note that the phase for path p is φp(t) = −(2πf dp(t)/c + θp), which changes by 2π when dp(t) changes by the amount of sound wavelength λ = c/f. 3.5 Phase Based Path Length Measurement As the received signal is a combination of the signals traveling through many paths, we need to first extract the baseband signal component that corresponds to the one reflected by the moving hand so that we can infer the movement distance from the phase change of that component, as we will show next. Thus, we need to decompose the baseband signal into the static and dynamic vector. Recall that the static vector comes from sound waves traveling through the LOS path or the static surrounding objects, which could be much stronger compared to the sound waves reflected by hand. In practice, this static vector may also vary slowly with the movement of the hand. Such changes in the static vector are caused by the blocking of other objects by the moving hand or slow movements of the arm. It is therefore challenging to separate the slowly changing static vector from the dynamic vector caused by a slow hand movement. Existing work in 60 GHz technology uses two methods, Dual-Differential Background Removal (DDBR) and Phase Counting and Reconstruction (PCR), to remove the static vector [2]. However, the DDBR algorithm is susceptible to noises and cannot reliably detect slow movements, while PCR has long latency and requires strong periodicity in the baseband signal. Thus, both of these algorithms are not suitable for our purpose. We use a heuristic algorithm called Local Extreme Value Detection (LEVD) to estimate the static vector. This algorithm operates on the I/Q component separately to estimate the real and imaginary parts of the static vector. The basic idea of LEVD is inspired by the well-known Empirical Mode Decomposition (EMD) algorithm [38]. We first find alternate local maximum and minimum points that are different more than an empirical threshold T hr, which is set as three times of the standard deviation of the baseband signal in a static environment. These large variations in the waveform indicate the movements of surrounding objects. We then use the average of two nearby local maxima and minima as the estimated value of the static vector. Since the dynamic vector has a trace similar to circles, the average of two extremes would be close to the center. Figure 6 shows the LEVD result for a short piece of waveform in Figure 3(a). LEVD pseudocode is in Algorithm 1