Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China Skeleton-Aware Neural Sign Language Translation Shiwei Gan,Yafeng Yin*,Zhiwei Jiang,Lei Xie,Sanglu Lu State Key Laboratory for Novel Software Technology,Nanjing University,China gsw@smail.nju.edu.cn.yafeng@nju.edu.cn.jzw@nju.edu.cn,lxie@nju.edu.cn,sanglu@nju.edu.cn ABSTRACT As an essential communication way for deaf-mutes,sign languages are expressed by human actions.To distinguish human actions for sign language understanding.the skeleton which contains position information of human pose can provide an important cue,since different actions usually correspond to different poses/skeletons. However,skeleton has not been fully studied for Sign Language (a)Sign:doing (b)Sign:liver Translation (SLT),especially for end-to-end SLT.Therefore,in this paper,we propose a novel end-to-end Skeleton-Aware neural Net- Figure 1:In sign languages,the same hand gesture in differ- work(SANet)for video-based SLT.Specifically,to achieve end-to- ent positions can have different meanings.The blue points end SLT,we design a self-contained branch for skeleton extraction. and lines represent the skeleton. To efficiently guide the feature extraction from video with skele- tons,we concatenate the skeleton channel and RGB channels of 1 INTRODUCTION each frame for feature extraction.To distinguish the importance of clips,we construct a skeleton-based Graph Convolutional Network Sign language has been widely adopted as a communication way (GCN)for feature scaling.ie..giving importance weight for each for deaf-mutes.To build the bridge between deaf-mutes and hear- clip.The scaled features of each clip are then sent to a decoder ing people,the research work on sign language understanding module to generate spoken language.In our SANet,a joint training emerged and the existing work can mainly be categorized as Sign strategy is designed to optimize skeleton extraction and sign lan- Language Recognition(SLR)[21,29,45]and Sign Language Transla- guage translation jointly.Experimental results on two large scale tion(SLT)[5,15,25].Earlier,the sign language-related work usually SLT datasets demonstrate the effectiveness of our approach,which focused on SLR,which aims at recognizing isolated sign as word outperforms the state-of-the-art methods.Our code is available at or expression [12,24,37],or recognizing continuous signs as cor- https://github.com/SignLanguageCode/SANet. responding word sequence [4,10,13,21].However,the SLR work neglected the difference between sign language and spoken lan- CCS CONCEPTS guage on grammatical rules,i.e.,the recognized word sequence may be not grammatically correct,thus hindering the understanding Computing methodologies-Activity recognition and un- of sign language.Recently,due to the advancement of annotated derstanding. dataset and deep learning technology.SLT has attracted people's attention.SLT is a more challenging task and its objective is to trans- KEYWORDS late sign language to spoken language,while requiring that the Sign Language Translation;Skeleton;Neural Network translation results conform to the grammatical rules and linguistic characteristics of the target spoken language. ACM Reference Format: In regard to SLT,the prior work tended to decompose SLT into Shiwei Gan,Yafeng Yin,Zhiwei Jiang,Lei Xie,Sanglu Lu.2021.Skeleton- two stages,i.e.,recognizing continuous signs as word sequence Aware Neural Sign Language Translation.In Proceedings of the 29th ACM and then utilizing language models to construct sentences with International Conference on Multimedia(MM'21).October 20-24,2021.Virtual the words [3,9].However,the two-stage methods usually required Event,China.ACM,New York,NY,USA,9 pages.https://doi.org/10.1145/ gloss!annotation,which was a labor-intensive task and needed 3474085.3475577 specialists.Recently,due to the development of deep learning tech- nology,Camgoz et al.approached SLT as a neural machine trans- Yafeng Yin is the corresponding author. lation task [5],and introduced the encoder-decoder network and attention mechanism for end-to-end SLT for the first time.After that,Camgoz et al.introduced transformer networks for end-to- Permission to make digital or hard copies of all or part of this work for personal or end SLT from videos [7].When considering the modality difference classroom use is granted without fee p ovided that copies are not made or distributed between video and language in SLT,feature representation was for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than ACM adopted,i.e.,the video is represented as features which are later must be honored.Abstracting with credit is permitted.To copy otherwise,or republish translated to language.However,in the existing neural-based meth- to post on servers or to redistribute to lists,requires prior specific permission and/or a ods,the feature representation of video was mainly consisted of fee.Request permissions from permissions@acm.org. MM'21,October 20-24,2021,Virtual Event,China full-frame [5,15,22]or local-area [6]features,while the skeleton 2021 Association for Computing Machinery. information which reflects the important spatial structure of human ACM1SBN978-1-4503-8651-7/21/10..$15.00 https:/doi.org/10.1145/3474085.3475577 Here,'gloss'means a gesture with its closest meaning in natural languages [101. 4353
Skeleton-Aware Neural Sign Language Translation Shiwei Gan, Yafeng Yin∗ , Zhiwei Jiang, Lei Xie, Sanglu Lu State Key Laboratory for Novel Software Technology, Nanjing University, China gsw@smail.nju.edu.cn,yafeng@nju.edu.cn,jzw@nju.edu.cn,lxie@nju.edu.cn,sanglu@nju.edu.cn ABSTRACT As an essential communication way for deaf-mutes, sign languages are expressed by human actions. To distinguish human actions for sign language understanding, the skeleton which contains position information of human pose can provide an important cue, since different actions usually correspond to different poses/skeletons. However, skeleton has not been fully studied for Sign Language Translation (SLT), especially for end-to-end SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural Network (SANet) for video-based SLT. Specifically, to achieve end-toend SLT, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from video with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips, we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. The scaled features of each clip are then sent to a decoder module to generate spoken language. In our SANet, a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness of our approach, which outperforms the state-of-the-art methods. Our code is available at https://github.com/SignLanguageCode/SANet. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding. KEYWORDS Sign Language Translation; Skeleton; Neural Network ACM Reference Format: Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Lei Xie, Sanglu Lu. 2021. SkeletonAware Neural Sign Language Translation. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 3474085.3475577 ∗Yafeng Yin is the corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’21, October 20–24, 2021, Virtual Event, China © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 https://doi.org/10.1145/3474085.3475577 (a) Sign: doing (b) Sign: liver Figure 1: In sign languages, the same hand gesture in different positions can have different meanings. The blue points and lines represent the skeleton. 1 INTRODUCTION Sign language has been widely adopted as a communication way for deaf-mutes. To build the bridge between deaf-mutes and hearing people, the research work on sign language understanding emerged and the existing work can mainly be categorized as Sign Language Recognition (SLR) [21, 29, 45] and Sign Language Translation (SLT) [5, 15, 25]. Earlier, the sign language-related work usually focused on SLR, which aims at recognizing isolated sign as word or expression [12, 24, 37], or recognizing continuous signs as corresponding word sequence [4, 10, 13, 21]. However, the SLR work neglected the difference between sign language and spoken language on grammatical rules, i.e., the recognized word sequence may be not grammatically correct, thus hindering the understanding of sign language. Recently, due to the advancement of annotated dataset and deep learning technology, SLT has attracted people’s attention. SLT is a more challenging task and its objective is to translate sign language to spoken language, while requiring that the translation results conform to the grammatical rules and linguistic characteristics of the target spoken language. In regard to SLT, the prior work tended to decompose SLT into two stages, i.e., recognizing continuous signs as word sequence, and then utilizing language models to construct sentences with the words [3, 9]. However, the two-stage methods usually required gloss1 annotation, which was a labor-intensive task and needed specialists. Recently, due to the development of deep learning technology, Camgoz et al. approached SLT as a neural machine translation task [5], and introduced the encoder-decoder network and attention mechanism for end-to-end SLT for the first time. After that, Camgoz et al. introduced transformer networks for end-toend SLT from videos [7]. When considering the modality difference between video and language in SLT, feature representation was adopted, i.e., the video is represented as features which are later translated to language. However, in the existing neural-based methods, the feature representation of video was mainly consisted of full-frame [5, 15, 22] or local-area [6] features, while the skeleton information which reflects the important spatial structure of human 1Here, ‘gloss’ means a gesture with its closest meaning in natural languages [10]. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4353
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China pose in sign languages has not been fully studied.In fact,the skele- 2 RELATED WORK ton can be used to distinguish signs with different human poses The existing research work on sign languages can be mainly cate- (i.e.,different relative positions of hands,arms,etc),especially for gorized into SLR and SLT,where SLR can be further classified into the signs which use the same hand gesture in different positions to isolated SLR and continuous SLR.In this section,we review the represent different meanings,as shown in Figure 1.Therefore,it is related work on isolated SLR,continuous SLR,and SLT. meaningful to introduce skeleton information into SLT. Isolated Sign Language Recognition (ISLR):The isolated To utilize skeletons for SLT,there emerged some work recently SLR aims at recognizing one sign as word or expression [2,12,24] [6,14],which advanced the research of skeleton assisted SLT.How- which is similar to gesture recognition [27,42]and action recog- ever,the existing research often had the following problems.First, nition [32,40].The early methods tended to select features from obtaining the skeletons often required an external device [14]or videos manually,and introduced Hidden Markov Model(HMM) extra offline preprocessing [6],which hindered the end-to-end SLT [12,16]to analyze the gesture sequence of a sign(i.e.,human ac- from videos.Second,the videos and skeletons were used as two tion)for recognition.However,the manually-selected features may independent data sources for feature extraction,i.e.,not fused at limit the recognition performance.Therefore,in recent years,the the initial stage of feature extraction,thus the video-based feature deep learning-based approaches were introduced for isolated SLR. extraction may be not efficiently guided/enhanced with the skele- The approaches utilized neural networks to automatically extract ton information.Third,each clip(i.e.a short segmented video)was features from videos [17]),Kinect's sensor data [36],or moving tra- usually treated equally,while neglecting the different importance jectories of skeleton joints [24]for isolated SLR,and often achieved of meaningful (e.g.sign-related)clips and unmeaninngful (e.g.end a better performance. state)clips.Among the problems,the third one exists not only in Continuous Sign Language Recognition(CSLR):Continu- skeleton-assisted SLT,but also in much SLT work ous SLR aims at recognizing a sequence of signs to the correspond- To address the above three problems,we propose a Skeleton- ing word sequence [21,28,45],thus continuous SLR is more chal- Aware neural Network(SANet).Firstly,to achieve end-to-end SLT, lenging than isolated SLR.To realize CSLR,the traditional methods SANet designs a self-contained branch for skeleton extraction.Sec- like DTW-HMM [44]and CNN-HMM [21]introduced temporal ondly,to guide the video-based feature extraction with skeletons. segmentation and Hidden Markov Model(HMM)to transform con- SANet concatenates the skeleton channel and RGB channels for tinuous SLR to isolated SLR.Considering the possible errors and each frame,thus the features extracted from images/videos will be annotation burden in temporal segmentation,the recent deep learn- affected by skeleton.Thirdly,to distinguish the importance of clips. ing based methods [18]applied sequence to sequence learning for SANet constructs a skeleton-based Graph Convolutional Network continuous SLR.They learned the correspondence between two (GCN)for feature scaling,ie giving importance weight for each sequences from weakly annotated data in an end-to-end manner clip.Specifically,SANet consists of four components,ie.,FrmSke, However,many approaches tended to adopt Connectionist Tem- ClipRep,ClipScl,LangGen.At first,FrmSke is used to extract skele- poral Classification(CTC)loss [13,20,43,45]which requires that ton from each frame and frame-level features for a clip by convo- source and target sequences have the same order.In fact,the sign lutions and deconvolutions.Then,ClipRep is used to enhance clip sequence in sign language and the word sequence in spoken lan- representation by adding skeleton channel.After that,ClipScl is guage can be different [5].thus the approaches for continuous SLR used to scale the clip representation by a skeleton-based Graph are not suitable for SLT. Convolutional Network(GCN).Finally,with the scaled features of Sign Language Translation(SLT):SLT aims to translate sign clips,LangGen is used to generate spoken language with sequence languages into spoken languages.Traditional methods [3,9]usually to sequence learning.In addition,we design a joint optimization decomposed SLT into two stages,i.e.,continuous SLR and text-to- strategy for model training and achieve end-to-end SLT. text translation.The two-stage methods had both gloss annotations We make the following contributions in this paper. and sentence annotations,thus can be optimized in two stages for a better performance [5].However,annotating glosses requires spe- We propose a Skeleton-Aware neural Network(SANet)for cialists and is a labor-intensive task.Recently,due to the advance- end-to-end SLT,where a self-contained branch is designed ment of public datasets in sentence-level annotations [5,11,18 for skeleton extraction and a joint training strategy is de- and deep learning technology,there emerged a few end-to-end signed to optimize skeleton extraction and sign language SLT approaches.Camgoz et al [5]introduced the encoder-decoder translation jointly. framework to realize end-to-end SLT.Guo et al.[14,15]proposed We concatenate the extracted skeleton channel and RGB the hierarchical-LSTM model for end-to-end SLT.Camgoz et al.uti- channels in source data level,thus can highlight human lized the transformer networks [38]to jointly solve SLR and SLT[7] pose-related features and enhance the clip representation. Li et al.developed a temporal semantic pyramid encoder and a We construct skeleton-based graphs and use graph convolu- transformer decoder for SLT [22].These neural-based approaches tional network to scale the clip representation,ie,weighting often adopted encoder and decoder for SLT. the importance of each clip,thus can highlight meaningful To represent the sign languages,the existing neural-based SLT clips while weakening unmeaningful clips. methods mainly focused on extracting full-frame [5,15,22]or local- We conduct extensive experiments on two large-scale public area features [6,46]from the video.There was only a little work SLT datasets.The experimental results demonstrate that our paying attention on skeleton information(i.e.,human pose)for SANet outperforms the state-of-the-art methods. SLT.Specifically,HRF [14]collected skeletons with a depth camera, 4354
pose in sign languages has not been fully studied. In fact, the skeleton can be used to distinguish signs with different human poses (i.e., different relative positions of hands, arms, etc), especially for the signs which use the same hand gesture in different positions to represent different meanings, as shown in Figure 1. Therefore, it is meaningful to introduce skeleton information into SLT. To utilize skeletons for SLT, there emerged some work recently [6, 14], which advanced the research of skeleton assisted SLT. However, the existing research often had the following problems. First, obtaining the skeletons often required an external device [14] or extra offline preprocessing [6], which hindered the end-to-end SLT from videos. Second, the videos and skeletons were used as two independent data sources for feature extraction, i.e., not fused at the initial stage of feature extraction, thus the video-based feature extraction may be not efficiently guided/enhanced with the skeleton information. Third, each clip (i.e., a short segmented video) was usually treated equally, while neglecting the different importance of meaningful (e.g, sign-related) clips and unmeaninngful (e.g, end state) clips. Among the problems, the third one exists not only in skeleton-assisted SLT, but also in much SLT work. To address the above three problems, we propose a SkeletonAware neural Network (SANet). Firstly, to achieve end-to-end SLT, SANet designs a self-contained branch for skeleton extraction. Secondly, to guide the video-based feature extraction with skeletons, SANet concatenates the skeleton channel and RGB channels for each frame, thus the features extracted from images/videos will be affected by skeleton. Thirdly, to distinguish the importance of clips, SANet constructs a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. Specifically, SANet consists of four components, i.e., FrmSke, ClipRep, ClipScl, LangGen. At first, FrmSke is used to extract skeleton from each frame and frame-level features for a clip by convolutions and deconvolutions. Then, ClipRep is used to enhance clip representation by adding skeleton channel. After that, ClipScl is used to scale the clip representation by a skeleton-based Graph Convolutional Network (GCN). Finally, with the scaled features of clips, LangGen is used to generate spoken language with sequence to sequence learning. In addition, we design a joint optimization strategy for model training and achieve end-to-end SLT. We make the following contributions in this paper. • We propose a Skeleton-Aware neural Network (SANet) for end-to-end SLT, where a self-contained branch is designed for skeleton extraction and a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. • We concatenate the extracted skeleton channel and RGB channels in source data level, thus can highlight human pose-related features and enhance the clip representation. • We construct skeleton-based graphs and use graph convolutional network to scale the clip representation, i.e., weighting the importance of each clip, thus can highlight meaningful clips while weakening unmeaningful clips. • We conduct extensive experiments on two large-scale public SLT datasets. The experimental results demonstrate that our SANet outperforms the state-of-the-art methods. 2 RELATED WORK The existing research work on sign languages can be mainly categorized into SLR and SLT, where SLR can be further classified into isolated SLR and continuous SLR. In this section, we review the related work on isolated SLR, continuous SLR, and SLT. Isolated Sign Language Recognition (ISLR): The isolated SLR aims at recognizing one sign as word or expression [2, 12, 24] which is similar to gesture recognition [27, 42] and action recognition [32, 40]. The early methods tended to select features from videos manually, and introduced Hidden Markov Model (HMM) [12, 16] to analyze the gesture sequence of a sign (i.e., human action) for recognition. However, the manually-selected features may limit the recognition performance. Therefore, in recent years, the deep learning-based approaches were introduced for isolated SLR. The approaches utilized neural networks to automatically extract features from videos [17]), Kinect’s sensor data [36], or moving trajectories of skeleton joints [24] for isolated SLR, and often achieved a better performance. Continuous Sign Language Recognition (CSLR): Continuous SLR aims at recognizing a sequence of signs to the corresponding word sequence [21, 28, 45], thus continuous SLR is more challenging than isolated SLR. To realize CSLR, the traditional methods like DTW-HMM [44] and CNN-HMM [21] introduced temporal segmentation and Hidden Markov Model (HMM) to transform continuous SLR to isolated SLR. Considering the possible errors and annotation burden in temporal segmentation, the recent deep learning based methods [18] applied sequence to sequence learning for continuous SLR. They learned the correspondence between two sequences from weakly annotated data in an end-to-end manner. However, many approaches tended to adopt Connectionist Temporal Classification (CTC) loss [13, 20, 43, 45] which requires that source and target sequences have the same order. In fact, the sign sequence in sign language and the word sequence in spoken language can be different [5], thus the approaches for continuous SLR are not suitable for SLT. Sign Language Translation (SLT): SLT aims to translate sign languages into spoken languages. Traditional methods [3, 9] usually decomposed SLT into two stages, i.e., continuous SLR and text-totext translation. The two-stage methods had both gloss annotations and sentence annotations, thus can be optimized in two stages for a better performance [5]. However, annotating glosses requires specialists and is a labor-intensive task. Recently, due to the advancement of public datasets in sentence-level annotations [5, 11, 18] and deep learning technology, there emerged a few end-to-end SLT approaches. Camgoz et al. [5] introduced the encoder-decoder framework to realize end-to-end SLT. Guo et al. [14, 15] proposed the hierarchical-LSTM model for end-to-end SLT. Camgoz et al. utilized the transformer networks [38] to jointly solve SLR and SLT[7]. Li et al. developed a temporal semantic pyramid encoder and a transformer decoder for SLT [22]. These neural-based approaches often adopted encoder and decoder for SLT. To represent the sign languages, the existing neural-based SLT methods mainly focused on extracting full-frame [5, 15, 22] or localarea features [6, 46] from the video. There was only a little work paying attention on skeleton information (i.e., human pose) for SLT. Specifically, HRF [14] collected skeletons with a depth camera, Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4354
Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China LangGen Inference FrmSke BiLSTM wife tentio cher ide LSTM ⊙Concatenation ClipRep MemoryCel Frame-level features 0 Clip-level features ClipSel ○Scale factor on graph 目Scaled eatre Figure 2:The SANet consists of FrmSke,ClipRep,ClipScl and LangGen,which are used for extracting skeletons and frame-level features,enhancing clip representation,scaling features and generating sentences. while Camgoz et al.[6]extracted skeleton from video with exiting tool in an offline stage.Then,they parallelly input RGB videos with skeletons to neural network for feature extraction.The external device or offline preprocessing of skeleton extraction hindered the end-to-end SLT from videos.Besides,the existing approaches tended to fuse videos and skeletons in extracted features and give frame-loeri features the same importance of each clip,which may limit the performance of feature representation.Differently,we extract skeleton with a self-contained branch to achieve end-to-end SLT.Besides,we ① fuse videos and skeletons in source data by concatenating skeleton channel and RGB channels,and construct a skeleton-based GCN to Figure 3:FrmSke extracts skeleton from each image and weight the importance of each clip. frame-level features of each clip. 3 PROPOSED APPROACH of VGG model [33]as the backbone network for FrmSke.In the In SLT,when given a sign language video X=(fi.f2....,fu)with compressed VGG model,the number of channels in convolutional uframes,our objective is to learn the conditional probability p(YX) layers is reduced to one fourth of the original one,to reduce the of generating a spoken language sentence Y =(wi,w2,...,wo) memory requirement and make the model work on our platform. with v words.The sentence with highest probability p(YX)is Skeleton extraction:As shown in Figrue 3,to extract the skele- chosen as the translated spoken language. ton map from a frame,two parallel deconvolutional networks are To realize end-to-end SLT,we propose a Skeleton-Aware neural used to upsample high-to-low resolution representations [34]after Network(SANet),which consists of FrmSke,ClipRep,ClipScl and the Conv3 and Conv4 layers.Specifically,Deconv1 layer adopts LangGen.As shown in Figure 2,a sign language video is segmented one 3x3 deconvolution with the stride 2 for 2x upsampling,while into consecutive equal-length clips with 50%overlap.For a clip.we Deconv2 layer adopts two consecutive 3x3 deconvolutions with first use FrmSke to extract skeleton from each frame and frame-level the stride 2 for 4x upsampling.Then,the pointwise convolutional clip features.Then,we concatenate the skeleton channel and RGB layer and element-sum operation are added after deconvolutional channels of each frame in the clip.and adopt ClipRep to extract clip- layers to generateK heatmaps,where each heatmap Mk[1.K] level features.The frame-level features and clip-level features are contains one keypoint(with the highest heatvalue)of the skeleton added to get the fused features of a clip.Meanwhile,we utilize the After that,we generate the skeleton(ie.,a 2D matrix)MS by adding skeletons in a clip to construct a spatial-temporal graph and adopt the corresponding elements in K heatmaps.Here,K is set to 14 and ClipScl to calculate the scale factor,which will be multiplied with means the number of keypoints from nose,neck,both eyes,both the fused features to get the scaled feature vector of a clip.Finally. ears,both shoulders,both elbows,both wrists and both hips. the scaled features of all clips are sent to LangGen for generating Frame-level clip representation:As shown in Figure 3,the the spoken language. convolutions Convl to Conv5 are first used to extract feature maps from each frame of a clip.Then,the feature maps are concatenated, 3.1 Frame-Level Skeleton Extraction flatten and sent to a fully connected layer to get a feature vector A frame can capture the specific gesture in sign language,thus Fm with Nm =4096 elements of the clip. containing the spatial structure of human pose and detailed infor- mation in face,hands,fingers,etc.Therefore,we split the clip into 3.2 Channel Extended Clip Representation frames,and propose FrmSke module to extract skeleton and frame- A video clip with several consecutive frames can capture the short level features.As shown in Figure 3,we select a compressed variant action(i.e.,continuous/dynamic gestures)during sign languages. 4355
Video RGB channels 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 … … ClipRep ClipScl Clip i … Clip 1 … Clip m 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 𝑾𝑾 × 𝑯𝑯 × 𝟏𝟏 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟒𝟒 i1 i2 ic Skeleton graph Scaling Clip i … My wife is a teacher Inference C C Concatenation Frame-level features Clip-level features Scale factor MC MemoryCell BiLSTM MC 3× LSTM Attention LangGen Skeleton channel Fused features FrmSke Scaled features RGBS channels 𝒊𝒊𝒄𝒄 Frames i1 i2 ic Adding Figure 2: The SANet consists of FrmSke, ClipRep, ClipScl and LangGen, which are used for extracting skeletons and frame-level features, enhancing clip representation, scaling features and generating sentences. while Camgoz et al. [6] extracted skeleton from video with exiting tool in an offline stage. Then, they parallelly input RGB videos with skeletons to neural network for feature extraction. The external device or offline preprocessing of skeleton extraction hindered the end-to-end SLT from videos. Besides, the existing approaches tended to fuse videos and skeletons in extracted features and give the same importance of each clip, which may limit the performance of feature representation. Differently, we extract skeleton with a self-contained branch to achieve end-to-end SLT. Besides, we fuse videos and skeletons in source data by concatenating skeleton channel and RGB channels, and construct a skeleton-based GCN to weight the importance of each clip. 3 PROPOSED APPROACH In SLT, when given a sign language video X = (f1, f2, . . . , fu ) with u frames, our objective is to learn the conditional probability p(Y |X) of generating a spoken language sentence Y = (w1,w2, . . . ,wv ) with v words. The sentence with highest probability p(Y |X) is chosen as the translated spoken language. To realize end-to-end SLT, we propose a Skeleton-Aware neural Network (SANet), which consists of FrmSke, ClipRep, ClipScl and LangGen. As shown in Figure 2, a sign language video is segmented into consecutive equal-length clips with 50% overlap. For a clip, we first use FrmSke to extract skeleton from each frame and frame-level clip features. Then, we concatenate the skeleton channel and RGB channels of each frame in the clip, and adopt ClipRep to extract cliplevel features. The frame-level features and clip-level features are added to get the fused features of a clip. Meanwhile, we utilize the skeletons in a clip to construct a spatial-temporal graph and adopt ClipScl to calculate the scale factor, which will be multiplied with the fused features to get the scaled feature vector of a clip. Finally, the scaled features of all clips are sent to LangGen for generating the spoken language. 3.1 Frame-Level Skeleton Extraction A frame can capture the specific gesture in sign language, thus containing the spatial structure of human pose and detailed information in face, hands, fingers, etc. Therefore, we split the clip into frames, and propose FrmSke module to extract skeleton and framelevel features. As shown in Figure 3, we select a compressed variant 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 Frame i1 Frame i2 Frame ic … F … 𝑾𝑾 × 𝑯𝑯 × 𝟏𝟏 i1 i2 ic Convolutional layer Deconvolutional layer Fully connected layer Pointwise convolutional layer Deconv1 Deconv2 Conv1 Conv2 Conv3 Conv4 Conv5 P P F P Frame-level features Skeletons Figure 3: FrmSke extracts skeleton from each image and frame-level features of each clip. of VGG model [33] as the backbone network for FrmSke. In the compressed VGG model, the number of channels in convolutional layers is reduced to one fourth of the original one, to reduce the memory requirement and make the model work on our platform. Skeleton extraction: As shown in Figrue 3, to extract the skeleton map from a frame, two parallel deconvolutional networks are used to upsample high-to-low resolution representations [34] after the Conv3 and Conv4 layers. Specifically, Deconv1 layer adopts one 3×3 deconvolution with the stride 2 for 2× upsampling, while Deconv2 layer adopts two consecutive 3×3 deconvolutions with the stride 2 for 4× upsampling. Then, the pointwise convolutional layer and element-sum operation are added after deconvolutional layers to generate K heatmaps, where each heatmap MH k , k ∈ [1,K] contains one keypoint (with the highest heatvalue) of the skeleton. After that, we generate the skeleton (i.e., a 2D matrix) MS by adding the corresponding elements in K heatmaps. Here, K is set to 14 and means the number of keypoints from nose, neck, both eyes, both ears, both shoulders, both elbows, both wrists and both hips. Frame-level clip representation: As shown in Figure 3, the convolutions Conv1 to Conv5 are first used to extract feature maps from each frame of a clip. Then, the feature maps are concatenated, flatten and sent to a fully connected layer to get a feature vector Fm with Nm = 4096 elements of the clip. 3.2 Channel Extended Clip Representation A video clip with several consecutive frames can capture the short action (i.e., continuous/dynamic gestures) during sign languages. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4355
Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China AAverage pooling (a)Intermediate feature maps 'NOT'using skeleton ⑥Fally conected layer Clip-level features Figure 4:ClipRep extracts clip-level features,where each frame of the clip is extended to four channels by concate- nating skeleton channel and RGB channels. (b)Intermediate feature maps using skeleton Figure 5:Intermediate feature maps after the first P3D block Therefore,we propose ClipRep module to track the dynamic changes without or with using skeleton channel.In each case,we of human pose and extract the clip representation.As shown in show 4 examples selected from 16 frames in a clip. Figure 4,we first extend the channels of each frame by concate- nating the extracted skeleton channel and original RGB channels, and then adopt Pseudo 3D Residual Networks(P3D)[30]to extract clip-level features. ST-GCN Unit Channel extension with skeleton:We use the skeleton map (i.e.,a 2D matrix)as the fourth channel,and concatenate it with original RGB channels of each frame,to get the RGBS frame with M)Max pooling four channels,as shown in Figure 4.After that,the clip with RGBS frames will be used for clip-level feature extraction. Enhanced clip representation:Based on the RGBS frame se- quence of a clip,we introduce the P3D block [30]to extract the features of the clip,where the P3D is first adopted for SLT.For the P3D block,it is consisted of one(2D)spatial filer(1x 3 x 3),one (1D)temporal filter(3x1×1),and two pointwise filters(1×1×1) Combining the filters in different ways can get different modules (i.e.,P3D-A,P3D-B,P3D-C)for P3D.In ClipRep(shown in Figure 4), after 3D-convolution and P3D blocks,the residual units,average Figure 6:ClipScl constructs skeleton-based graph and uses pooling and fully connected layer will be used to get the feature GCN to calculate the scale factor,which is used for weight- vector Fc with Ne 4096 elements for the clip. ing the importance of each clip. To verify whether the added skeleton channel can enhance fea- ture representation,we visualize the intermediate feature map(i.e.. after the first P3D block)in ClipRep without or with using skeleton edge set E.Suppose the keypoints of the ith skeleton in a clip are channel in Figure 5(a)and Figure 5(b),respectively.The areas with =(wi,i2,,vik,i∈[1,cl.Here,vi,j∈[l,K]means the brighter colors in Figure 5(b)indicate that the added skeleton chan- jth keypoint/node in the ith skeleton,while K =14 means the nel can highlight the features related to sign language,e.g.,gesture number of keypoints in a skeleton.Then we can get the node set changes and human pose,thus enhancing clip representation. V=(vi,ie[1,cl.je[1,K]).In regard to the edge set,it includes the intra-skeleton edge set Ea =vipvil(p.q)ES)where S means 3.3 Skeleton-Aware Clip Scaling the set of naturally connected body joints in a skeleton and the In a short-time clip,the human action can correspond to a mean- inter-skeleton edge set Ee =(vipvjp li.j [1.c].li-jl =1)(i.e.. ingful sign,a less important transition action,an unmeaningful end edge between the corresponding nodes of two adjacent skeletons), state,etc.Thus the importance of each clip for SLT can be different. as shown in Figure 6.For each node in the constructed skeleton- To track the human action in a clip and weight the importance based graph,its coordinate vector (x,y)in the frame is used as its of each clip,we propose ClipScl module,which first constructs a initial feature vector skeleton-based graph and applies a Graph Convolutional Network With the skeleton-based graph,we then adopt Graph Convolu- (GCN)[41]to generate a scale factor,and then scales the feature tional Network(GCN)to calculate the scale factor(ie.,importance vector of each clip with the scale factor. weight)of a clip.Specifically,we design ClipScl,which consists of Skeleton-based GCN:To track the dynamic changes of hu- 7 layers of spatial-temporal graph convolution(ST-GCN)units [41]. man action in a short clip,we construct the skeleton-based graph, while decreasing the channel number of ST-GCN by a factor of which can describe the moving trajectories of keypoints in the 0.25 to reduce the memory requirement of the model,as shown in skeleton [31,41].Specifically,for a clip with c frames,we first con- Figure 6.Then,we use the max pooling and a fully connected layer struct a skeleton-based graph G=(V,E)with the node set V and to get the feature vector,which will be passed to a sigmoid function 4356
Clip i 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 … 𝑾𝑾 × 𝑯𝑯 × 𝟏𝟏 i1 i2 ic C 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟒𝟒 Conv 3D P3D P3D-A 𝟒𝟒 × Residual Unit 𝟑𝟑 × A F Frame i Frame k ik RGB RGBS Fully connected layer Average pooling Clip-level features Skeletons P3D-A P3D-B P3D-C F A Figure 4: ClipRep extracts clip-level features, where each frame of the clip is extended to four channels by concatenating skeleton channel and RGB channels. Therefore, we propose ClipRep module to track the dynamic changes of human pose and extract the clip representation. As shown in Figure 4, we first extend the channels of each frame by concatenating the extracted skeleton channel and original RGB channels, and then adopt Pseudo 3D Residual Networks (P3D) [30] to extract clip-level features. Channel extension with skeleton: We use the skeleton map (i.e., a 2D matrix) as the fourth channel, and concatenate it with original RGB channels of each frame, to get the RGBS frame with four channels, as shown in Figure 4. After that, the clip with RGBS frames will be used for clip-level feature extraction. Enhanced clip representation: Based on the RGBS frame sequence of a clip, we introduce the P3D block [30] to extract the features of the clip, where the P3D is first adopted for SLT. For the P3D block, it is consisted of one (2D) spatial filer (1× 3 × 3), one (1D) temporal filter (3× 1× 1), and two pointwise filters (1× 1 × 1). Combining the filters in different ways can get different modules (i.e., P3D-A, P3D-B, P3D-C) for P3D. In ClipRep (shown in Figure 4), after 3D-convolution and P3D blocks, the residual units, average pooling and fully connected layer will be used to get the feature vector Fc with Nc = 4096 elements for the clip. To verify whether the added skeleton channel can enhance feature representation, we visualize the intermediate feature map (i.e., after the first P3D block) in ClipRep without or with using skeleton channel in Figure 5(a) and Figure 5(b), respectively. The areas with brighter colors in Figure 5(b) indicate that the added skeleton channel can highlight the features related to sign language, e.g., gesture changes and human pose, thus enhancing clip representation. 3.3 Skeleton-Aware Clip Scaling In a short-time clip, the human action can correspond to a meaningful sign, a less important transition action, an unmeaningful end state, etc. Thus the importance of each clip for SLT can be different. To track the human action in a clip and weight the importance of each clip, we propose ClipScl module, which first constructs a skeleton-based graph and applies a Graph Convolutional Network (GCN) [41] to generate a scale factor, and then scales the feature vector of each clip with the scale factor. Skeleton-based GCN: To track the dynamic changes of human action in a short clip, we construct the skeleton-based graph, which can describe the moving trajectories of keypoints in the skeleton [31, 41]. Specifically, for a clip with c frames, we first construct a skeleton-based graph G = (V, E) with the node set V and (a) Intermediate feature maps ‘NOT’ using skeleton (b) Intermediate feature maps using skeleton Figure 5: Intermediate feature maps after the first P3D block without or with using skeleton channel. In each case, we show 4 examples selected from 16 frames in a clip. ST-GCN Unit M F S 𝟕𝟕 × Conv2d GraphConv Conv2d Fully connected layer Max pooling F M S Sigmoid Scale factor Skeleton graph Figure 6: ClipScl constructs skeleton-based graph and uses GCN to calculate the scale factor, which is used for weighting the importance of each clip. edge set E. Suppose the keypoints of the ith skeleton in a clip are Vi = (υi1 ,υi2 , . . . ,υiK ),i ∈ [1,c]. Here, υij , j ∈ [1,K] means the jth keypoint/node in the ith skeleton, while K = 14 means the number of keypoints in a skeleton. Then we can get the node set V = {υij ,i ∈ [1,c], j ∈ [1,K]}. In regard to the edge set, it includes the intra-skeleton edge set Ea = {υip υiq |(p, q) ∈ S} where S means the set of naturally connected body joints in a skeleton and the inter-skeleton edge set Ee = {υip υjp |i, j ∈ [1,c], |i − j| = 1} (i.e., edge between the corresponding nodes of two adjacent skeletons), as shown in Figure 6. For each node in the constructed skeletonbased graph, its coordinate vector (x,y) in the frame is used as its initial feature vector υ f . With the skeleton-based graph,we then adopt Graph Convolutional Network (GCN) to calculate the scale factor (i.e., importance weight) of a clip. Specifically, we design ClipScl, which consists of 7 layers of spatial-temporal graph convolution (ST-GCN) units [41], while decreasing the channel number of ST-GCN by a factor of 0.25 to reduce the memory requirement of the model, as shown in Figure 6. Then, we use the max pooling and a fully connected layer to get the feature vector, which will be passed to a sigmoid function Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4356
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China 21 Clip sequence Figure 7:Visualization of the scale factor for each clip.Meaningful clips with signs have larger scale factors,while unmean- ingful clips have lower scale factors. to get the scale factor sf,i.e.,a value belonging to [0,1]. At the tth time step,the decoder predicts the word t.The decoder stops decoding until the occurrence of the symbol "[EOS]". sf Sigmoid(ST-GCN+(Vf,E)) (1) Here,Vf is feature vector set of node set V,ST-GCN+(-)denotes 3.5 Joint Loss Optimization the combination of 7 layers ST-GCN units,max pooling and a fully To optimize skeleton extraction and sign language translation jointly, connected layer,Sigmoid()denotes the sigmoid function. we design a joint loss L.which consists of skeleton extraction loss Fused feature scaling:For each clip,we get the frame-level Lske and SLT loss Lslt(y.). feature vector Fm,clip-level feature vector Fe and the scale factor The skeleton extraction loss Lke is calculated as follows,where sf.First,we fuse Fm and Fe with element addition to get the MHER3,MG ER3 denote the predicted heatmap and the ground- fused feature vector.Then,we scale the fused feature vector with truth heatmap.respectively.Here,K.h,w are the number of key- multiplication operation to get the scaled feature vector Ff,as points,the height and width of a heatmap.For each ground-truth shown below. heatmap,it contains a heating area,which is generated by applying Ff=(Em⊕Fc)⑧sf (2) a 2D Gaussian function with 1-pixel standard deviation [34]on the In Figure 7,we show the calculated scale factor sf for each clip. keypoint estimated by OpenPose [8]. where clips with meaningful signs(i.e.,clips in the green rectangle) have larger sf.It means that the designed ClipScl module can effi- K h w 1 ciently track the dynamic changes of human pose with skeletons Lske (MH (4) k i j and distinguish the importance of different clips,i.e.,ClipScl can highlight meaningful clips while weakening unmeaningful clips. The SLT loss Lslt(y,is the cross entropy loss function,is the predicted word sequence and y is the ground-truth word sequence 3.4 Spoken Language Generation (ie.,labels).The calculation of Lslt(y,)is shown below,where After getting the scaled feature vector of each clip,we propose T means the max number of time steps in decoder and V means the number of words in vocabulary.yt.i is an indicator,when the LangGen,which adopts the encoder-decoder framework [35]and attention mechanism [1]to generate the spoken language,as shown ground-truth word at the tth time step is the jth word in vocabulary. in Figure 2. yt.j=1.Otherwise,yt.j=0.pt.j means the probability that the predicted word it at the tth time step is the jth word in vocabulary. BiLSTM Encoder with MemoryCell:We use three-layered BiLSTMs and propose a novel MemoryCell to connect adjacent BiL- STM layers for encoding.Specifically,given a sequence of scaled Lsi(y.) yt.jlog(pt.i) (5) clips'feature vectors zn,we first get the hidden states H= (...)after the Ith BiLSTM layer.Then,we design Memo- Based on Lske and Lslt(y.),we can calculate the joint lossL ryCell to change the dimensions of hidden states and provide the as follows,where a is a hyper-parameter and used to balance the appropriate input for the following layer,as shown below. ratio of Lske and Lslt.We set a to 1 at the beginning of training ht1=tanh(w·hh+b) (3) and change it to 0.5 in the middle of training. where W and b are weight and bias of the fully connected layer,h L=aLske Lslt(y.) (6) is the final output hidden state of the Ith layer,is input as the 4 EXPERIMENT initial hidden state of the (I+1)th layer. LSTM Decoder:We use one LSTM layer as the decoder to de- 4.1 Datasets code the word step by step.Specifically,the decoder utilizes LSTM There are two public SLT datasets that are often used,one is the cells,a fully-connected layer and a softmax layer to output the CSL dataset [18]which contains 25K labeled videos with 100 chi- prediction probability pt.j,ie..the probability that the predicted nese sentences filmed by 50 signers,and the other one is a German word yt at the tth time step is the ith word in vocabulary.At the be- sign language dataset:the RWTH-PHOENIX-Weather 2014T [5] ginning of decoding.to is initialized with the start symbol"[SOS]". which contains 8257 weather forecast samples from 9 signers.The 4357
… … Clip sequence Initial state Transition clip Meaningful clips with signs Transition clip End state Padded clips Figure 7: Visualization of the scale factor for each clip. Meaningful clips with signs have larger scale factors, while unmeaningful clips have lower scale factors. to get the scale factor sf , i.e., a value belonging to [0, 1]. sf = Siдmoid(ST − GCN+ (V f , E)) (1) Here, V f is feature vector set of node set V , ST −GCN+(·) denotes the combination of 7 layers ST-GCN units, max pooling and a fully connected layer, Siдmoid(·) denotes the sigmoid function. Fused feature scaling: For each clip, we get the frame-level feature vector Fm, clip-level feature vector Fc and the scale factor sf . First, we fuse Fm and Fc with element addition ⊕ to get the fused feature vector. Then, we scale the fused feature vector with multiplication operation ⊗ to get the scaled feature vector Ff , as shown below. Ff = (Fm ⊕ Fc ) ⊗ sf (2) In Figure 7, we show the calculated scale factor sf for each clip, where clips with meaningful signs (i.e., clips in the green rectangle) have larger sf . It means that the designed ClipScl module can efficiently track the dynamic changes of human pose with skeletons and distinguish the importance of different clips, i.e., ClipScl can highlight meaningful clips while weakening unmeaningful clips. 3.4 Spoken Language Generation After getting the scaled feature vector of each clip, we propose LangGen, which adopts the encoder-decoder framework [35] and attention mechanism [1] to generate the spoken language, as shown in Figure 2. BiLSTM Encoder with MemoryCell: We use three-layered BiLSTMs and propose a novel MemoryCell to connect adjacent BiLSTM layers for encoding. Specifically, given a sequence of scaled clips’ feature vectors z1:n, we first get the hidden states H l = (h l 1 , h l 2 , . . . , h l n ) after the lth BiLSTM layer. Then, we design MemoryCell to change the dimensions of hidden states and provide the appropriate input for the following layer, as shown below. h l+1 0 = tanh(W · h l n + b) (3) where W and b are weight and bias of the fully connected layer, h l n is the final output hidden state of the lth layer, h l+1 0 is input as the initial hidden state of the (l + 1)th layer. LSTM Decoder: We use one LSTM layer as the decoder to decode the word step by step. Specifically, the decoder utilizes LSTM cells, a fully-connected layer and a softmax layer to output the prediction probability pt,j , i.e., the probability that the predicted word yˆt at the tth time step is the jth word in vocabulary. At the beginning of decoding, yˆ0 is initialized with the start symbol “[SOS]”. At the tth time step, the decoder predicts the word yˆt . The decoder stops decoding until the occurrence of the symbol “[EOS]”. 3.5 Joint Loss Optimization To optimize skeleton extraction and sign language translation jointly, we design a joint loss L, which consists of skeleton extraction loss Lske and SLT loss Lsl t (y,yˆ). The skeleton extraction loss Lske is calculated as follows, where MH ∈ R 3 , MG ∈ R 3 denote the predicted heatmap and the groundtruth heatmap, respectively. Here, K, h, w are the number of keypoints, the height and width of a heatmap. For each ground-truth heatmap, it contains a heating area, which is generated by applying a 2D Gaussian function with 1-pixel standard deviation [34] on the keypoint estimated by OpenPose [8]. Lske = 1 K Õ K k Õ h i Õw j (M H k,i,j − M G k,i,j ) 2 (4) The SLT loss Lsl t (y,yˆ) is the cross entropy loss function, yˆ is the predicted word sequence and y is the ground-truth word sequence (i.e., labels). The calculation of Lsl t (y,yˆ) is shown below, where T means the max number of time steps in decoder and V means the number of words in vocabulary. yt,j is an indicator, when the ground-truth word at the tth time step is the jth word in vocabulary, yt,j = 1. Otherwise, yt,j = 0. pt,j means the probability that the predicted word yˆt at the tth time step is the jth word in vocabulary. Lsl t (y,yˆ) = − Õ T t=1 Õ V j yt,j loд(pt,j) (5) Based on Lske and Lsl t (y,yˆ), we can calculate the joint loss L as follows, where α is a hyper-parameter and used to balance the ratio of Lske and Lsl t . We set α to 1 at the beginning of training and change it to 0.5 in the middle of training. L = αLske + Lsl t (y,yˆ) (6) 4 EXPERIMENT 4.1 Datasets There are two public SLT datasets that are often used, one is the CSL dataset [18] which contains 25K labeled videos with 100 chinese sentences filmed by 50 signers, and the other one is a German sign language dataset: the RWTH-PHOENIX-Weather 2014T [5] which contains 8257 weather forecast samples from 9 signers. The Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4357