Dynamic Speed Warping:Similarity-Based One-shot Learning for Device-free Gesture Signals Xun Wang,Ke Sun,Ting Zhao,Wei Wang,and Qing Gu State Key Laboratory for Novel Software Technology,Nanjing University [xunwang,kesun,tingzhao)@smail.nju.edu.cn,kesun@eng.ucsd.edu,[ww.guq}@nju.edu.cn Abstract-In this paper,we propose a Dynamic Speed Warping that learns information about object categories from one,or (DSW)algorithm to enable one-shot learning for device-free only a few,training samples.Thus,it reduces both the data gesture signals performed by different users.The design of DSW collection and training cost.The design of the DSW algorithm is based on the observation that the gesture type is determined by the trajectory of hand components rather than the movement is based on the critical observation that the gesture type is speed.By dynamically scaling the speed distribution and tracking determined by the trajectory of hand components,e.g.,fingers the movement distance along the trajectory,DSW can effectively and the palm,rather than the movement speed.We show that match gesture signals from different domains that have a ten-fold the similarity in trajectory leads to the similarity in the total difference in speeds.Our experimental results show that DSW movement distances and the scaled speed distributions.The can achieve a recognition accuracy of 97%for gestures performed by unknown users,while only use one training sample of each total movement distances are similar because considering the gesture type from four training users. specific trajectory for a gesture,e.g.,click,the starting and the ending postures of the hand remain the same,no matter I.INTRODUCTION how fast the user performs the gesture.The scaled speed Device-free gesture recognition systems use the Radio Fre- distributions are similar because when the user changes the quency (RF)[1]-[9]or sound signals [10]-[15]to detect movement speed,the speeds of different parts of the hand, and recognize human movements.By analyzing the signal such as the fingers and the palm,changes proportionally reflection of the hand,device-free sensing allows users to Therefore,the speed distribution of different components of a interact with their devices freely without wearing any sensor. fast gesture movement can be matched to the distribution of a Such natural and unconstrained interaction paradigm would slow gesture movement of the same type,when we scale down become a vital component for the next generation Human- speeds of all components by the same factor.Based on this Computer Interaction (HCD)solutions. observation,we design a dynamical programming algorithm, One of the key challenges for device-free sensing is to which is inspired by Dynamic Time Warping(DTW)[20],to robustly recognize gesture signals for different users and in calculate the similarity of gesture signals in terms of the total different environments.Traditional machine learning methods movement distance and the scaled speed distribution. use large datasets and intensive training process to extract domain-independent features from gesture signals.For ex- The DSW similarity measure leads to new ways to explore ample,one can collect gesture samples in different domains the gesture recognition problem.First,the robust gesture and use Generative Adversarial Networks (GANs)to reduce matching algorithm can be combined with kNN to serve as the impact of domain-specific features [16],[17].However, a similarity-based one-shot learning scheme that only requires due to the insufficient understanding of the machine-generated a small number of training samples.As the DSW algorithm models,the performance of these domain-independent models can adapt to different gesture speeds,it dramatically reduces under a new environment cannot be guaranteed.Fine-tuning the data collection/labeling cost and can incrementally tune the model in a new domain may require a large number of the system without retraining.Second,the DSW similarity new samples to be collected and labeled by the end-user in measure can serve as the basis for unsupervised or semi- the new environment.Even if virtual samples can be produced supervised learning systems.The DSW algorithm can auto- via geometric models using a small number of gestures in matically derive the type of gestures of unlabeled samples by the target domain [18],the retraining process still incurs clustering them using the speed-independent measure. formidable costs for mobile systems. We perform extensive evaluations of DSW using ultrasound- In this paper,rather than extracting domain-independent fea- based gesture signals.Our experimental results show that ture,we use Dynamic Speed Warping(DSW)to derive a sim- DSW can achieve a recognition accuracy of 97%for gestures ilarity measure between device-free gesture signals.As users performed by unknown users,while using only one training may perform the gesture with different speeds and the Doppler sample of each gesture type from four training users.DSW shift largely depends on the environment [19],speed variations also outperforms the DTW algorithm in all three external lead to severe robustness issues in the widely-used speed-based indicators for clustering performance.Therefore,DSW sim- gesture features [1]-[3].By removing the speed variation,the ilarity can serve as a powerful tool for both supervised and DSW similarity enables domain-independent one-shot learning unsupervised learning tasks
Dynamic Speed Warping: Similarity-Based One-shot Learning for Device-free Gesture Signals Xun Wang, Ke Sun, Ting Zhao, Wei Wang, and Qing Gu State Key Laboratory for Novel Software Technology, Nanjing University {xunwang,kesun,tingzhao}@smail.nju.edu.cn, kesun@eng.ucsd.edu, {ww,guq}@nju.edu.cn Abstract—In this paper, we propose a Dynamic Speed Warping (DSW) algorithm to enable one-shot learning for device-free gesture signals performed by different users. The design of DSW is based on the observation that the gesture type is determined by the trajectory of hand components rather than the movement speed. By dynamically scaling the speed distribution and tracking the movement distance along the trajectory, DSW can effectively match gesture signals from different domains that have a ten-fold difference in speeds. Our experimental results show that DSW can achieve a recognition accuracy of 97% for gestures performed by unknown users, while only use one training sample of each gesture type from four training users. I. INTRODUCTION Device-free gesture recognition systems use the Radio Frequency (RF) [1]–[9] or sound signals [10]–[15] to detect and recognize human movements. By analyzing the signal reflection of the hand, device-free sensing allows users to interact with their devices freely without wearing any sensor. Such natural and unconstrained interaction paradigm would become a vital component for the next generation HumanComputer Interaction (HCI) solutions. One of the key challenges for device-free sensing is to robustly recognize gesture signals for different users and in different environments. Traditional machine learning methods use large datasets and intensive training process to extract domain-independent features from gesture signals. For example, one can collect gesture samples in different domains and use Generative Adversarial Networks (GANs) to reduce the impact of domain-specific features [16], [17]. However, due to the insufficient understanding of the machine-generated models, the performance of these domain-independent models under a new environment cannot be guaranteed. Fine-tuning the model in a new domain may require a large number of new samples to be collected and labeled by the end-user in the new environment. Even if virtual samples can be produced via geometric models using a small number of gestures in the target domain [18], the retraining process still incurs formidable costs for mobile systems. In this paper, rather than extracting domain-independent feature, we use Dynamic Speed Warping (DSW) to derive a similarity measure between device-free gesture signals. As users may perform the gesture with different speeds and the Doppler shift largely depends on the environment [19], speed variations lead to severe robustness issues in the widely-used speed-based gesture features [1]–[3]. By removing the speed variation, the DSW similarity enables domain-independent one-shot learning that learns information about object categories from one, or only a few, training samples. Thus, it reduces both the data collection and training cost. The design of the DSW algorithm is based on the critical observation that the gesture type is determined by the trajectory of hand components, e.g., fingers and the palm, rather than the movement speed. We show that the similarity in trajectory leads to the similarity in the total movement distances and the scaled speed distributions. The total movement distances are similar because considering the specific trajectory for a gesture, e.g., click, the starting and the ending postures of the hand remain the same, no matter how fast the user performs the gesture. The scaled speed distributions are similar because when the user changes the movement speed, the speeds of different parts of the hand, such as the fingers and the palm, changes proportionally. Therefore, the speed distribution of different components of a fast gesture movement can be matched to the distribution of a slow gesture movement of the same type, when we scale down speeds of all components by the same factor. Based on this observation, we design a dynamical programming algorithm, which is inspired by Dynamic Time Warping (DTW) [20], to calculate the similarity of gesture signals in terms of the total movement distance and the scaled speed distribution. The DSW similarity measure leads to new ways to explore the gesture recognition problem. First, the robust gesture matching algorithm can be combined with kNN to serve as a similarity-based one-shot learning scheme that only requires a small number of training samples. As the DSW algorithm can adapt to different gesture speeds, it dramatically reduces the data collection/labeling cost and can incrementally tune the system without retraining. Second, the DSW similarity measure can serve as the basis for unsupervised or semisupervised learning systems. The DSW algorithm can automatically derive the type of gestures of unlabeled samples by clustering them using the speed-independent measure. We perform extensive evaluations of DSW using ultrasoundbased gesture signals. Our experimental results show that DSW can achieve a recognition accuracy of 97% for gestures performed by unknown users, while using only one training sample of each gesture type from four training users. DSW also outperforms the DTW algorithm in all three external indicators for clustering performance. Therefore, DSW similarity can serve as a powerful tool for both supervised and unsupervised learning tasks
The main contributions of our work are as follows: III.DEVICE-FREE GESTURE MATCHING We propose a new similarity measure that can adapt to the In this section,we first summarize the state-of-the-art ges- speed variations in gesture signals of different domains. ture matching methods and their limitations.We then describe We formally prove the properties of the speed adaptive signal our insight on the characteristics of device-free gesture signals. matching scheme and show that the result of DSW is a valid Finally,we demonstrate the benefits of using such speed- similarity measure. adaptive characteristics for gesture similarity calculation. Using real-world ultrasound gesture signals,we show that A.Gesture Matching Methods the DSW algorithm can serve as a solution for both one-shot Device-free gesture recognition systems collect radio/sound learning in supervised gesture recognition and unsupervised signals reflected by the hand to perform gesture recognition. gesture clustering tasks. We call these radio/sound signals gesture signals.The most- widely used gesture signals are complex-valued baseband signals that have Doppler frequencies corresponding to the II.RELATED WORKS hand movement speeds [1].[3],[25].For instance,Figure 1(a) Existing works that are closely related to our approach can and l(e)show the ultrasonic baseband signals of two samples be categorized into three areas:domain-independent feature of the writing "W"gesture,where the user writes the letter "W"in-the-air by moving hand back-and-forth twice.The I- extraction,cross-domain adaptation,and DTW-based schemes. component and the Q-component are the real and imaginary Domain-independent feature extraction.Early device-free parts of the gesture signal.Meanwhile,Figure 1(b)and 1(f) gesture recognition systems use statistical values (mean,vari- show the corresponding spectrogram calculated through Short- ance,etc.)of the signals [21]-[23]or Doppler speeds [24],[25] time Fourier transform (STFT).We can observe that the as the gesture features.However,it's well known that these gesture signal has a negative Doppler frequency when the hand features are dependent on the user,the location of devices. is moving away and has a positive frequency when the hand is and multi-path conditions introduced by the environment. moving back.Thus,the gesture signals show specific patterns There are two major approaches to extract domain-independent that can be matched with gesture movements and actions features for device-free gesture signals.The first approach is to At early stages of device-free gesture recognition,research- use an adversarial network as domain discriminator to help the ers collect statistical parameters from gesture signals as fea- feature-extracting network in generating domain-independent tures,such as mean,variance,and standard deviation.Such features [16.However,the training process requires huge statistical features cannot adapt to speed variations and often datasets from multiple domains,which leads to high data lead to models that are not robust enough for real-world collection costs.The second approach is to use geometric applications.Some other unsuccessful efforts indicate that models to recombine signals measured through multiple links linear transformation is inherently insufficient for handling the into a domain-independent body-coordinate velocity profile complicated nonlinearity fluctuation in patterns [25]. [19].However,this domain-adaption method uses multiple The DTW algorithm is a pattern matching algorithm with devices and assumes that accurate user locations are known. a nonlinear time-normalization effect.While directly applying Cross-domain adaptation.Instead of using domain- DTW on the gesture waveforms have been applied in device- independent features,we can also transfer a domain-specific free keystroke matching,it only works when gestures start gesture recognition model into the target domain.One ap- from a fixed point and is not robust enough for daily gesture proach is to use transfer learning schemes to retrain the model recognition tasks [32].As an example,consider the gesture using a small number of samples in the target domain [26]- samples in Figure 1(a)and 1(e),which have durations of [28].Another way is to use neural networks or geometric 1.5 and 3 seconds,respectively.Besides the differences in models to transfer the samples in the source domain to the frequencies caused by different movement speeds,the two target domain in order to boost the number of training samples waveforms also have different initial phases and different low- in the target domain [18],[29].Compared to these approaches frequency components.The initial phase and low-frequency that need samples in the target domain for bootstrapping,the components of the gesture signal depend on the starting DSW scheme can evaluate the similarity of gestures from position and small body movements [3],which are noise for unknown target domains gesture recognition.However,these noisy factors dominate the Dynamic Time Warping schemes.The DTW algorithm similarity calculated by DTW so that DTW could not find a is originally designed for matching speech signals that have suitable matching for these two samples in the time domain. different lengths in time [20].As human activities also have Fortunately,STFT and other time-frequency analysis allow variable durations,DTW has been adopted for various types of us to focus on the frequency domain without being distracted activity recognition systems [30],[31].In device-free gesture by phases and low-frequency components.The spectrograms recognition,DTW has been applied for matching either the raw in Figure 1(b)and Figure 1(f)clearly show how the dis- gesture signals [32].[33]or the extracted features [21].[34].tribution of Doppler frequency changes over time,which However,these DTW applications only consider the scaling is directly connected to changes in the movement speed. in time rather than the scaling of speed distribution and the However,matching spectrograms is challenging since gesture consistency of movement distance. speed changes introduce both frequency shifts and gesture
The main contributions of our work are as follows: • We propose a new similarity measure that can adapt to the speed variations in gesture signals of different domains. • We formally prove the properties of the speed adaptive signal matching scheme and show that the result of DSW is a valid similarity measure. • Using real-world ultrasound gesture signals, we show that the DSW algorithm can serve as a solution for both one-shot learning in supervised gesture recognition and unsupervised gesture clustering tasks. II. RELATED WORKS Existing works that are closely related to our approach can be categorized into three areas: domain-independent feature extraction, cross-domain adaptation, and DTW-based schemes. Domain-independent feature extraction. Early device-free gesture recognition systems use statistical values (mean, variance, etc.) of the signals [21]–[23] or Doppler speeds [24], [25] as the gesture features. However, it’s well known that these features are dependent on the user, the location of devices, and multi-path conditions introduced by the environment. There are two major approaches to extract domain-independent features for device-free gesture signals. The first approach is to use an adversarial network as domain discriminator to help the feature-extracting network in generating domain-independent features [16]. However, the training process requires huge datasets from multiple domains, which leads to high data collection costs. The second approach is to use geometric models to recombine signals measured through multiple links into a domain-independent body-coordinate velocity profile [19]. However, this domain-adaption method uses multiple devices and assumes that accurate user locations are known. Cross-domain adaptation. Instead of using domainindependent features, we can also transfer a domain-specific gesture recognition model into the target domain. One approach is to use transfer learning schemes to retrain the model using a small number of samples in the target domain [26]– [28]. Another way is to use neural networks or geometric models to transfer the samples in the source domain to the target domain in order to boost the number of training samples in the target domain [18], [29]. Compared to these approaches that need samples in the target domain for bootstrapping, the DSW scheme can evaluate the similarity of gestures from unknown target domains. Dynamic Time Warping schemes. The DTW algorithm is originally designed for matching speech signals that have different lengths in time [20]. As human activities also have variable durations, DTW has been adopted for various types of activity recognition systems [30], [31]. In device-free gesture recognition, DTW has been applied for matching either the raw gesture signals [32], [33] or the extracted features [21], [34]. However, these DTW applications only consider the scaling in time rather than the scaling of speed distribution and the consistency of movement distance. III. DEVICE-FREE GESTURE MATCHING In this section, we first summarize the state-of-the-art gesture matching methods and their limitations. We then describe our insight on the characteristics of device-free gesture signals. Finally, we demonstrate the benefits of using such speedadaptive characteristics for gesture similarity calculation. A. Gesture Matching Methods Device-free gesture recognition systems collect radio/sound signals reflected by the hand to perform gesture recognition. We call these radio/sound signals gesture signals. The mostwidely used gesture signals are complex-valued baseband signals that have Doppler frequencies corresponding to the hand movement speeds [1], [3], [25]. For instance, Figure 1(a) and 1(e) show the ultrasonic baseband signals of two samples of the writing “W” gesture, where the user writes the letter “W” in-the-air by moving hand back-and-forth twice. The Icomponent and the Q-component are the real and imaginary parts of the gesture signal. Meanwhile, Figure 1(b) and 1(f) show the corresponding spectrogram calculated through Shorttime Fourier transform (STFT). We can observe that the gesture signal has a negative Doppler frequency when the hand is moving away and has a positive frequency when the hand is moving back. Thus, the gesture signals show specific patterns that can be matched with gesture movements and actions. At early stages of device-free gesture recognition, researchers collect statistical parameters from gesture signals as features, such as mean, variance, and standard deviation. Such statistical features cannot adapt to speed variations and often lead to models that are not robust enough for real-world applications. Some other unsuccessful efforts indicate that linear transformation is inherently insufficient for handling the complicated nonlinearity fluctuation in patterns [25]. The DTW algorithm is a pattern matching algorithm with a nonlinear time-normalization effect. While directly applying DTW on the gesture waveforms have been applied in devicefree keystroke matching, it only works when gestures start from a fixed point and is not robust enough for daily gesture recognition tasks [32]. As an example, consider the gesture samples in Figure 1(a) and 1(e), which have durations of 1.5 and 3 seconds, respectively. Besides the differences in frequencies caused by different movement speeds, the two waveforms also have different initial phases and different lowfrequency components. The initial phase and low-frequency components of the gesture signal depend on the starting position and small body movements [3], which are noise for gesture recognition. However, these noisy factors dominate the similarity calculated by DTW so that DTW could not find a suitable matching for these two samples in the time domain. Fortunately, STFT and other time-frequency analysis allow us to focus on the frequency domain without being distracted by phases and low-frequency components. The spectrograms in Figure 1(b) and Figure 1(f) clearly show how the distribution of Doppler frequency changes over time, which is directly connected to changes in the movement speed. However, matching spectrograms is challenging since gesture speed changes introduce both frequency shifts and gesture
400 200 Time (s) Time(s) Index after Scaling (a)Sample A (writing W.1.5 seconds). (b)STFT result for sample A. (c)DTW over STFT for sample A. (d)DSW over STFT for sample A. 50 60 2 20 40 0 10 Time (s) Time(e倒 Index after Scaling Index after Scaling (e)Sample B(writing W,3 seconds). (f)STFT result for sample B. (g)DTW over STFT for sample B.(h)DSW over STFT for sample B. Figure 1.Two samples of writing W and corresponding matching results with different methods. duration changes.For example,the slower gesture sample B total movement distance along the trajectory.In this way,we in Figure 1(f)has a smaller frequency variation that lasts for a can stretch and match the movement stages of gestures with long time in the spectrogram.It is challenging to find the right different speeds,as shown in Figure 1(d)and Figure 1(h). scaling factor in both time and frequency domain to match the Definitions: spectrogram of sample A with sample B.Figure 1(c)and 1(g) Lets(t)= >pepsp(t)be the baseband gesture signal, show the stretched results when we directly apply DTW on where P is the set of signal propagation paths and sp(t)is the STFT spectrogram.We observe that the DTW algorithm the complex-valued signal along path p.We define the function does not scale and match the right stages for sample A and B. D(f,t),which is the square-root of the power spectral density Furthermore,the speed distributions in each stage of the two of the gesture signal at time t: results are still quite different. et+△t Neural networks,such as CNNs,can be used in classifying D(f,)= s(T)e-jwr (1) spectrograms with different scaling factors when trained by a large number of samples [7],[35].However,the CNN does not where At is the length for a time frame.We define the consider underlying physical models of gesture spectrogram so trajectory that the hand moves through within At as a micro- that the training samples should exhaustively cover all speed unit.Consider two gesture signal samples sA(t)and sB(t)of and duration combinations of the same gesture.This incurs the same gesture type.We are looking for a mapping function formidable costs in the data collection and training process. 6:R×Rt→R×R that maps the time and frequency of the In summary,we need to find a new nonlinear pattern match- gesture samples,so that map(DA(f,t),6)/DB(f,t)=a(t). ing algorithm that can compare gesture signals with different where a(t)is a constant factor that only depends on time. durations without the labelling information.Additionally,the Assumptions: algorithm needs to accommodate different speed distributions We use the generalized coordinates r to denote the location while keeping track of the movement distance. of different parts of the hand,e.g.,r is the coordinates of each finger and the palm concatenated into a single vector.Thus, B.Gesture speed adaptation different parts of the hand may go along different trajectories Before we formally define and prove the properties for in the same gesture.In this section,we first consider the ideal gesture speed adaptation,we first use an example to show the case where the trajectories of the same gesture are exactly intuition of our matching algorithm design.Our key insight is same so that the trace of each part is fixed.This assumption that the type of the gesture is determined by the movement is relaxed in Section IV. trajectory of the hand rather than the movement speed.Given We assume that At is small so that the hand moves for a a certain movement stage on the trajectory,the user may move short distance during one micro-unit.We further assumes that the hand at different speeds,but the different parts of the the signal amplitude and the movement speed is not changed hand speed up/slow down proportionally.For example,the for one micro-unit.In practice,we set At to 40 milliseconds so thumb and the index fingers move at different speeds towards that the hand only moves for less than 4 cm in one micro-unit. each other in the click gesture.However,when the user clicks So,this assumption is reasonable for real-world applications. slowly,both fingers slow down by the same factor.Therefore, All of the gesture signals are treated as continuous signals we can scale the speed distribution of a slow gesture to match during the following problem description and proofs. with a fast gesture at the same stage.Furthermore,the stages Speed Adaptation Properties: of the gesture movement are determined by the position of hand on the trajectory.Therefore,we can track the movement Property 1.For two signal samples sA(t)and sB(t)of stages of gestures with different speeds by cumulating the the same gesture type,consider the single micro-unit from
01234 Time (s) 200 400 600 800 1000 I\Q (normalized) I Q (a) Sample A (writing W, 1.5 seconds). Energy 1234 Time (s) 50 25 0 -25 -50 Frequency Shift (Hz) 0 1 2 3 4 5 6 104 (b) STFT result for sample A. Energy (normalized) 20 40 60 80 Index after Scaling 50 25 0 -25 50 Frequency Shift (Hz) 0.1 0.2 0.3 0.4 (c) DTW over STFT for sample A. Energy (normalized) 5 10 15 20 25 30 Index after Scaling 50 25 0 -25 -50 Frequency Shift (Hz) 0.1 0.2 0.3 0.4 (d) DSW over STFT for sample A. 01234 Time (s) 200 400 600 800 1000 I\Q (normalized) I Q (e) Sample B (writing W, 3 seconds). Energy 1234 Time (s) 50 25 0 -25 -50 Frequency Shift (Hz) 0 5 10 15 104 (f) STFT result for sample B. Energy (normalized) 20 40 60 80 Index after Scaling 50 25 0 -25 50 Frequency Shift (Hz) 0 0.1 0.2 0.3 0.4 (g) DTW over STFT for sample B. Energy (normalized) 5 10 15 20 25 Index after Scaling 50 25 0 -25 -50 Frequency Shift (Hz) 0 0.1 0.2 0.3 0.4 (h) DSW over STFT for sample B. Figure 1. Two samples of writing W and corresponding matching results with different methods. duration changes. For example, the slower gesture sample B in Figure 1(f) has a smaller frequency variation that lasts for a long time in the spectrogram. It is challenging to find the right scaling factor in both time and frequency domain to match the spectrogram of sample A with sample B. Figure 1(c) and 1(g) show the stretched results when we directly apply DTW on the STFT spectrogram. We observe that the DTW algorithm does not scale and match the right stages for sample A and B. Furthermore, the speed distributions in each stage of the two results are still quite different. Neural networks, such as CNNs, can be used in classifying spectrograms with different scaling factors when trained by a large number of samples [7], [35]. However, the CNN does not consider underlying physical models of gesture spectrogram so that the training samples should exhaustively cover all speed and duration combinations of the same gesture. This incurs formidable costs in the data collection and training process. In summary, we need to find a new nonlinear pattern matching algorithm that can compare gesture signals with different durations without the labelling information. Additionally, the algorithm needs to accommodate different speed distributions while keeping track of the movement distance. B. Gesture speed adaptation Before we formally define and prove the properties for gesture speed adaptation, we first use an example to show the intuition of our matching algorithm design. Our key insight is that the type of the gesture is determined by the movement trajectory of the hand rather than the movement speed. Given a certain movement stage on the trajectory, the user may move the hand at different speeds, but the different parts of the hand speed up/slow down proportionally. For example, the thumb and the index fingers move at different speeds towards each other in the click gesture. However, when the user clicks slowly, both fingers slow down by the same factor. Therefore, we can scale the speed distribution of a slow gesture to match with a fast gesture at the same stage. Furthermore, the stages of the gesture movement are determined by the position of hand on the trajectory. Therefore, we can track the movement stages of gestures with different speeds by cumulating the total movement distance along the trajectory. In this way, we can stretch and match the movement stages of gestures with different speeds, as shown in Figure 1(d) and Figure 1(h). Definitions: Let s(t) = P p2P sp(t) be the baseband gesture signal, where P is the set of signal propagation paths and sp(t) is the complex-valued signal along path p. We define the function D(f, t), which is the square-root of the power spectral density of the gesture signal at time t: D(f, t) = Z t+t t s(⌧ )e j!⌧ d⌧ , (1) where t is the length for a time frame. We define the trajectory that the hand moves through within t as a microunit. Consider two gesture signal samples sA(t) and sB(t) of the same gesture type. We are looking for a mapping function : R⇥R+ 0 ! R⇥R+ 0 that maps the time and frequency of the gesture samples, so that map (DA(f, t), ) /DB(f, t) = ↵(t), where ↵(t) is a constant factor that only depends on time. Assumptions: • We use the generalized coordinates r to denote the location of different parts of the hand, e.g., r is the coordinates of each finger and the palm concatenated into a single vector. Thus, different parts of the hand may go along different trajectories in the same gesture. In this section, we first consider the ideal case where the trajectories of the same gesture are exactly same so that the trace of each part is fixed. This assumption is relaxed in Section IV. • We assume that t is small so that the hand moves for a short distance during one micro-unit. We further assumes that the signal amplitude and the movement speed is not changed for one micro-unit. In practice, we set t to 40 milliseconds so that the hand only moves for less than 4 cm in one micro-unit. So, this assumption is reasonable for real-world applications. All of the gesture signals are treated as continuous signals during the following problem description and proofs. Speed Adaptation Properties: Property 1. For two signal samples sA(t) and sB(t) of the same gesture type, consider the single micro-unit from
location rs to rt,where the movement period for sample A is map(DA(f,tA),6)=DA(f/a(tA),tA),where a(t)is the [tA:tA+TA]and for sample B is [tB,tB+TBl.Without loss scaling factor function.Then the mapping function satisfies of generality,we assume TA TB. the following relationship: Then,the two spectrograms DA(f,t)and DB(f,t)satisfy N,f-1t≤tA3a=-二 the following relationship: map(DA(f,t),5)=DA(f/a(t),t)=a(t)DB(f,t),(6) DA(f,tA)=aDB(af,tB), (2) where o(1)is called scaling ratio of the current where t is in (ti,ti. speed distribution. Proof.Consider the ith micro-unit ui of the trajectory.Ac- cording to property 1,there exists oo =(tAi-tAi-1)/(tBi- Proof.We first consider a single part h of the hand,e.g.,the tBi-1)and we can scale the distribution of sample A at oo to index finger,that is moved by a small distance of dh from get the same distribution as sample B.So, location rs to rt.By our assumption that the movement speed DA(f/ao,t)=aoDB(f,t) (7) is constant for the short time duration,we have the movement speed of h,vh.A dh/TA,in sample A and vh,B=dh/TB Note that every micro-unit is small enough that the distribution in sample B.Therefore,we have Uh.A =Un.B/a,where a= in frequency domain is unchanged during this interval.From TA/TB. the above,we can get the complete expression of the scaling In the gesture signal,suppose there is a path p corresponding ratio,which is to the reflection from part h of the hand.Using the signal models in [3],we have: a(t)=a0= tAi-tA-1,tE(tAi-1t,iEN.(8) tBi-tBi-1 Sp(t)=Ape-j(wunt/c+0p) (3) ▣ where w is the carrier frequency of the passband signal.As Property 1 and 2 show that we can achieve speed alignment sample A and sample B start from the same location,we can between different samples by dynamically scaling the frequen- use the same Ap and p in Eq.(3). cies while ensuring that the aligned segments represent the Now consider the Fourier transform of gesture signal in same trajectory.Note that we can find the exact match between sample A.Sp.A(f,t)=F[sp.A(t)}: frequency distributions when the hand precisely follows the Sp.A(f,tA)=FApe-i(wh.A()/e0) trajectory r(t).In reality,the hand may deviate from the trajectory r(t)so that our goal is to minimize the difference FApe-joh((-i)/a)/e) between the matched frequency distributions. aSp.B(af,tB), (4) IV.DYNAMIC SPEED WARPING In this section,we design an optimization algorithm for where t1 E [tA,tA+TA]and t2 E [tB,tB+TB].In the last measuring similarity of gesture signals based on speed adapt- step,we use the time scaling property of the Fourier transform, ation properties derived in Section III.Our optimization prob- F()(f),where the capitalized function X(f) lem considers small variations in gesture trajectories so that is the Fourier transforms of the non-capitalized function x(t).our objective is to minimize the difference between gesture As Eq.(4)holds for all different parts of the hand,we spectrograms instead of finding the exact mapping functions can use the linearity of continuous-time Fourier transform as in Section III.We then present a dynamic programming Ffax(t)+by(t)}ax(f,t)+bY(f,t).to get: algorithm,which is similar to the DTW algorithm.to find the optimal solution that satisfies both the speed and the cumulative movement distance constraints.Finally,we show Da(f.ta) 下{图-图 the micro-benchmark of this similarity measure on gesture signals and discuss the impact of parameters for the algorithm. DB(af,tB) (5) A.Problem Formulation ◇ In reality,the gesture spectrogram is discretized in both the time and the frequency domain.Given two spectrograms Property 2.Consider that two gesture signal samples A X[n],n 1...N and Y[m];m 1...M,where[n]is the and B of the same gesture type.We can divide the entire spectral of time-frame n.Consider a discrete warping function trajectory into micro-units u1,u2,....Assume that the time F[k=c().k 1...K,where c(k)=(i(k),j()),such that sample A passes through these micro-units in turn is that maps the spectral of X[i(k)]onto that of Y[j(k)]at the tA1,tA2,...while the time for sample B is tB1,tB2,.... kth warping.Here,i()and j(k)represents the frame index There exists a mapping function:R×Rd→R×Rd that of X and Y,respectively.The warping function should be: IIf there are more than one path corresponds to a single hand component. 2Note that (t)can be greater than 1 here,which means that in the current we can treat each path as a separate component with a different path speed micro-unit,the sample B is mapped to the sample A at the scaling ratio of and the above result still holds. 1/a(t)
location rs to rt, where the movement period for sample A is [tA, tA + TA] and for sample B is [tB, tB + TB]. Without loss of generality, we assume TA TB. Then, the two spectrograms DA(f, t) and DB(f, t) satisfy the following relationship: DA(f, tA) = ↵DB(↵f, tB), (2) where ↵ (↵ = TA TB 1) is called scaling ratio of the current speed distribution. Proof. We first consider a single part h of the hand, e.g., the index finger, that is moved by a small distance of dh from location rs to rt. By our assumption that the movement speed is constant for the short time duration, we have the movement speed of h, vh,A = dh/TA, in sample A and vh,B = dh/TB in sample B. Therefore, we have vh,A = vh,B/↵, where ↵ = TA/TB. In the gesture signal, suppose there is a path p corresponding to the reflection from part h of the hand1. Using the signal models in [3], we have: sp(t) = Apej(!vht/c+✓p) , (3) where ! is the carrier frequency of the passband signal. As sample A and sample B start from the same location, we can use the same Ap and ✓p in Eq. (3). Now consider the Fourier transform of gesture signal in sample A, Sp,A(f, t) = F{sp,A(t)}: Sp,A(f, tA) = F n Apej(!vh,A(t1tA)/c+✓p) o = F n Apej(!vh,B((t2tB)/↵)/c+✓p) o = ↵Sp,B(↵f, tB), (4) where t1 2 [tA, tA + TA] and t2 2 [tB, tB + TB]. In the last step, we use the time scaling property of the Fourier transform, F{x(kt)} = 1 |k|X(f /k), where the capitalized function X(f) is the Fourier transforms of the non-capitalized function x(t). As Eq. (4) holds for all different parts of the hand, we can use the linearity of continuous-time Fourier transform F{ax(t) + by(t)} = aX(f, t) + bY (f, t), to get: DA(f, tA) = F (X p2P sp,A(tA) ) = X p2P Sp,A(f, tA) = X p2P ↵Sp,B(↵f, tB) = ↵DB(↵f, tB) (5) Property 2. Consider that two gesture signal samples A and B of the same gesture type. We can divide the entire trajectory into micro-units u1, u2, ··· . Assume that the time that sample A passes through these micro-units in turn is tA1, tA2, ··· while the time for sample B is tB1, tB2, ··· . There exists a mapping function : R ⇥ R+ 0 ! R ⇥ R+ 0 that 1If there are more than one path corresponds to a single hand component, we can treat each path as a separate component with a different path speed and the above result still holds. map(DA(f, tA), ) = DA(f /↵(tA), tA), where ↵(t) is the scaling factor function. Then the mapping function satisfies the following relationship: 8i 2 N⇤, if tAi1 t tAi, 9 ↵(t) = tAitAi1 tBitBi1 , map(DA(f, t), ) = DA(f /↵(t), t) = ↵(t)DB(f, t0 ), (6) where t 0 is in (tBi1, tBi] 2. Proof. Consider the ith micro-unit ui of the trajectory. According to property 1, there exists ↵0 = (tAi tAi1)/(tBi tBi1) and we can scale the distribution of sample A at ↵0 to get the same distribution as sample B. So, DA(f /↵0, t) = ↵0DB(f, t0 ) (7) Note that every micro-unit is small enough that the distribution in frequency domain is unchanged during this interval. From the above, we can get the complete expression of the scaling ratio, which is ↵(t) = ↵0 = tAi tAi1 tBi tBi1 , t 2 (tAi1 , tAi] , i 2 N⇤ . (8) Property 1 and 2 show that we can achieve speed alignment between different samples by dynamically scaling the frequencies while ensuring that the aligned segments represent the same trajectory. Note that we can find the exact match between frequency distributions when the hand precisely follows the trajectory r(t). In reality, the hand may deviate from the trajectory r(t) so that our goal is to minimize the difference between the matched frequency distributions. IV. DYNAMIC SPEED WARPING In this section, we design an optimization algorithm for measuring similarity of gesture signals based on speed adaptation properties derived in Section III. Our optimization problem considers small variations in gesture trajectories so that our objective is to minimize the difference between gesture spectrograms instead of finding the exact mapping functions as in Section III. We then present a dynamic programming algorithm, which is similar to the DTW algorithm, to find the optimal solution that satisfies both the speed and the cumulative movement distance constraints. Finally, we show the micro-benchmark of this similarity measure on gesture signals and discuss the impact of parameters for the algorithm. A. Problem Formulation In reality, the gesture spectrogram is discretized in both the time and the frequency domain. Given two spectrograms X[n], n = 1 ...N and Y [m], m = 1 ...M, where X[n] is the spectral of time-frame n. Consider a discrete warping function F[k] = c(k),k = 1 ...K, where c(k)=(i(k), j(k)), such that maps the spectral of X[i(k)] onto that of Y [j(k)] at the kth warping. Here, i(k) and j(k) represents the frame index of X and Y , respectively. The warping function should be: 2Note that ↵(t) can be greater than 1 here, which means that in the current micro-unit, the sample B is mapped to the sample A at the scaling ratio of 1/↵(t)
monotonic,i.e.,i(k-1)<i(k),j(k-1)<j(k),continuous, Algorithm 1:Basic Dynamic Speed Warping i.e.,i()-i(k-1)≤1,j(k)-j(k-1)≤1,and meeting Input:Two spectrograms Xnl,n 1,...,N and the boundary conditions,i.e.,i(1)=1,j(1)=1,i(K)= Y[m,m=1,·,M,scaling ratio list ou,v=l,·,V N,j(K)=M as in traditional DTW [20]. 0 utput::min(dDsw[N,M,u,4l,v=1,…,V,μ=0,1. /dpswn,m,v,u:4d array recording distance We define the scaling operation a(k)based on Property 1 between two spectrograms at each middle state. to perform speed adaption.For example,performing an a(k)- //Isn,m,v,ul:4d array recording differences of times scaling on X[i(k)]means linearly stretching its spectral duration for scaled X and Y from initial to current state. in the frequency domain by 1/a(k)times while shortening /Initialization. * duration of this frame to a(k)in the time domain.In the 1 for n 1,...,N,m =1,...,M.v =1,...,V do following discussion,we always have a(k)<1 and only scale 2 dpsw[n,m,u,l←o 3 dDsw0,0,u,4←0 one of the X or Y.We define the indicator function I(k)to 4 lsn,m,,4←0 represent the object of the scaling operation: 5 end 6 for n 1,...,N,m 1,...,M.v =1,...,V do 0 Scale x[i(k)], /scale Xn */ I()= (9) 7 1 Scale Y[i(k)] if=0 then //91,92:candidate index sets /d1,d2:distance of alternative paths. Then,we define the distance between two spectral vectors s dist =1-cos(Xalu]In],Y[m]) X[i(k)】andY[i(ke】after scaling as: /case 1:(n-1,m)(n,m) 9 9←{(n-1,m,u,0)|llsn-1,m,u,0l≤Q: d(I,c,a,k)=I(k){1-cos(x[i()],Ya(k() d←min({dpsw(n-1,m,u,0]I(n-1,m,u,0)∈91} /+case2:(n-1,m-1)→(n,m) ★ +(1-I(k)){1-cos(Xa(k)[i()],Y[j()] (10) 1 92←{(n-1,m-1,v,4)1lln-1,m-1,v,μ]l≤Q: where cos(X,Y)=(X.Y)/(represents the 12 d2←min({dpsw[n-1,m-1,u,μ]|(m-1,m- 1,0,4)∈92}): cosine distance between vector X and Y.The objective of dpsw [n,m,v,]+min(d1 dist,d2 +2dist) DSW is to find the optimal F.a and I that minimize the 14 Update ls according to equation 16. average distance after the warping operation: 15 end /Scale Y[ml */ else DSW(X,Y)=min ∑KdL,ca,k)u(因 17 dpsw[n,m,v,1]can be calculated in a similar way as ∑太1w(k (*) doswn,m,v,0. 18 end The weight w(k)is a normalizer for different time durations, 19 end 20 for v =1,...,V and u=0,1 do where we use the symmetric form w(k)=(i()-i(k-1))+ dpsw [N,M,v,udpsw [N,M,v,u]/(N+M) (j(k)-j(k-1))so that w(k)=N+M. 22 end Our optimization needs to satisfy the timing constraint so 23 return min(dpsw [N,M,v,u).v=1;...,V,=0,1. that gesture stages can be aligned.Since the scaling operation not only changes frequency distributions but also changes the time of frames,we define the frame time of X[i(k)]and Yk]]after the kth warping: HereT()represents sum of the scaled duration from X[i(1)]to X[i(k)]and Q is the threshold of duration (1-I()·a()I()=0,w()=2 differences that decides suitable candidate warping functions. 1 T)= I(k)=1,w()=2 (1-I()·a(k)I(k)=0,w()=1 (11) B.Basic Dynamic Speed Warping 0 I()=1,w(k)=1 In this section we first present a dynamical programming When we choose to scale Xi(k)]by a(k),whatever the c(k- algorithm to solve the above optimization,which serves as a 1)is,the time duration of X[i(k)]is scaled to a(k).However, basic version of our final solution.We use a four-dimensional if we don't scale X[i()],there are two cases.If X [i(k)]has array dpsw to maintain the smallest dissimilarity between the been matched by Y[j(k-1)],which is equivalent to i(k)= matching part of X and Y at each intermediate state.In the i(k-1).then we set Ti(k)=0.Otherwise,we set Ti(k)=1. array dpswn,m,v,u,the first and the second index,n and The above piecewise function of Ti(k)can be rewritten as: m,are the current time-frame index for X and Y.The third index v indicates the scaling factor for the matching and the T=I(k)·(w(E)-1)+(1-I()·a(E) (12) fourth index u=0,1 is the indication function for whether By symmetry,we have Ti(k)as shown in Eq.(13). scaling X or Y.Thus,dpsw[n,m,v,0]is the shortest distance between the matched part of two spectrograms when T)=(1-I(k)·(w()-1)+I()·a() (13) Xn is scaled by av]times and then matched with Y[m]. Therefore,the timing constraint can be expressed as: We use a searchable scaling ratio array av],v =1,...,V for convenience.We also maintain another four-dimensional array T-sQ. ls,which tracks the difference of duration for the matched part **) of X and Y to meet the constraint in Eq.(**)
monotonic, i.e., i(k 1) i(k), j(k 1) j(k), continuous, i.e., i(k) i(k 1) 1, j(k) j(k 1) 1, and meeting the boundary conditions, i.e., i(1) = 1, j(1) = 1, i(K) = N, j(K) = M as in traditional DTW [20]. We define the scaling operation ↵(k) based on Property 1 to perform speed adaption. For example, performing an ↵(k)- times scaling on X[i(k)] means linearly stretching its spectral in the frequency domain by 1/↵(k) times while shortening duration of this frame to ↵(k) in the time domain. In the following discussion, we always have ↵(k) < 1 and only scale one of the X or Y . We define the indicator function I(k) to represent the object of the scaling operation: I(k) = ( 0 Scale X[i(k)], 1 Scale Y [j(k)]. (9) Then, we define the distance between two spectral vectors X[i(k)] and Y [j(k)] after scaling as: d(I, c, ↵, k) = I(k) 1 cos ⌦ X[i(k)],Y↵(k)[j(k)]↵ +(1 I(k)) 1 cos ⌦ X↵(k)[i(k)],Y [j(k)]↵ (10) where cos hX,Y i = (X · Y )/ (kXk kY k) represents the cosine distance between vector X and Y . The objective of DSW is to find the optimal F, ↵ and I that minimize the average distance after the warping operation: DSW(X,Y ) = min F,↵,I "PK k=1 d(I, c, ↵, k)w(k) PK k=1 w(k) # . (⇤) The weight w(k) is a normalizer for different time durations, where we use the symmetric form w(k)=(i(k) i(k 1))+ (j(k) j(k 1)) so that PK k=1 w(k) = N + M. Our optimization needs to satisfy the timing constraint so that gesture stages can be aligned. Since the scaling operation not only changes frequency distributions but also changes the time of frames, we define the frame time of X[i(k)] and Y [j[k]] after the kth warping: Ti(k) = 8 >>< >>: (1 I(k)) · ↵(k) I(k)=0, w(k)=2 1 I(k)=1, w(k)=2 (1 I(k)) · ↵(k) I(k)=0, w(k)=1 0 I(k)=1, w(k)=1 (11) When we choose to scale X[i(k)] by ↵(k), whatever the c(k 1) is, the time duration of X[i(k)] is scaled to ↵(k). However, if we don’t scale X[i(k)], there are two cases. If X[i(k)] has been matched by Y [j(k 1)], which is equivalent to i(k) = i(k 1), then we set Ti(k) = 0. Otherwise, we set Ti(k) = 1. The above piecewise function of Ti(k) can be rewritten as: Ti(k) = I(k) · (w(k) 1) + (1 I(k)) · ↵(k) (12) By symmetry, we have Tj(k) as shown in Eq. (13). Tj(k) = (1 I(k)) · (w(k) 1) + I(k) · ↵(k) (13) Therefore, the timing constraint can be expressed as: XK k=1 Ti(k) XK k=1 Tj(k) Q, (⇤⇤) Algorithm 1: Basic Dynamic Speed Warping Input: Two spectrograms X[n], n = 1, ··· , N and Y [m], m = 1, ··· , M, scaling ratio list ↵[v], v = 1, ··· , V . Output: min(dDSW [N,M, v, µ]), v = 1, ··· ,V,µ = 0, 1. // dDSW [n, m, v, µ]: 4d array recording distance between two spectrograms at each middle state. // ls[n, m, v, µ]: 4d array recording differences of duration for scaled X and Y from initial to current state. /* Initialization. */ 1 for n = 1,...,N, m = 1,...,M, v = 1,...,V do 2 dDSW [n, m, v, µ] 1 3 dDSW [0, 0, v, µ] 0 4 ls[n, m, v, µ] 0 5 end 6 for n = 1,...,N, m = 1,...,M, v = 1,...,V do /* Scale X[n] */ 7 if µ = 0 then // q1, q2: candidate index sets // d1, d2: distance of alternative paths. 8 dist = 1 coshX↵[v][n], Y [m] i /* Case 1: (n 1, m) ! (n, m) */ 9 q1 {(n 1, m, v, 0) | |ls[n 1, m, v, 0]| Q}; 10 d1 min({dDSW [n1, m, v, 0] | (n1, m, v, 0) 2 q1}); /* Case 2: (n 1, m 1) ! (n, m) */ 11 q2 {(n1, m1, v, µ0 ) | |ls[n1, m1, v, µ0 ]| Q}; 12 d2 min({dDSW [n 1, m 1, v, µ0 ] | (n 1, m 1, v, µ0 ) 2 q2}); 13 dDSW [n, m, v, 0] min(d1 + dist, d2 + 2 dist) 14 Update ls according to equation 16. 15 end /* Scale Y [m] */ 16 else 17 dDSW [n, m, v, 1] can be calculated in a similar way as dDSW [n, m, v, 0]. 18 end 19 end 20 for v = 1,...,V and µ = 0, 1 do 21 dDSW [N,M, v, µ] dDSW [N,M, v, µ]/(N + M) 22 end 23 return min(dDSW [N,M, v, µ]), v = 1, ··· ,V, µ = 0, 1. Here PK k=1 Ti(k) represents sum of the scaled duration from X[i(1)] to X[i(k)] and Q is the threshold of duration differences that decides suitable candidate warping functions. B. Basic Dynamic Speed Warping In this section we first present a dynamical programming algorithm to solve the above optimization, which serves as a basic version of our final solution. We use a four-dimensional array dDSW to maintain the smallest dissimilarity between the matching part of X and Y at each intermediate state. In the array dDSW [n, m, v, µ], the first and the second index, n and m, are the current time-frame index for X and Y . The third index v indicates the scaling factor for the matching and the fourth index µ = 0 , 1 is the indication function for whether scaling X or Y . Thus, dDSW [n, m, v, 0] is the shortest distance between the matched part of two spectrograms when X[n] is scaled by ↵[v] times and then matched with Y [m]. We use a searchable scaling ratio array ↵[v], v = 1, ··· , V for convenience. We also maintain another four-dimensional array ls, which tracks the difference of duration for the matched part of X and Y to meet the constraint in Eq. (**)