Hashing based Answer Selection Dong Xu and Wu-Jun Li* National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,China dc.swindegmail.com,liwujunenju.edu.cn Abstract This phenomenon limits the performance of answer selec- tion models.Deep neural networks (DNN)based models, Answer selection is an important subtask of question answer- ing (QA),in which deep models usually achieve better per- also simply called deep models,can partly tackle this prob- formance than non-deep models.Most deep models adopt lem by using pre-trained word embeddings.Word embed- question-answer interaction mechanisms,such as attention, dings pre-trained on language corpus contain some common to get vector representations for answers.When these inter- knowledge and linguistic phenomena,which are helpful for action based deep models are deployed for online prediction, selecting answers.Deep models have achieved promising the representations of all answers need to be recalculated for performance for answer selection in recent years (Tan et each question.This procedure is time-consuming for deep al.2016b;Santos et al.2016;Tay,Tuan,and Hui 2018a; models with complex encoders like BERT which usually have Tran and Niederee 2018:Deng et al.2018). better accuracy than simple encoders.One possible solution is to store the matrix representation (encoder output)of each Most deep models for answer selection are constructed answer in memory to avoid recalculation.But this will bring with similar frameworks which contain an encoding large memory cost.In this paper,we propose a novel method, layer (also called encoder)and a composition layer (also called hashing based answer selection (HAS),to tackle this called composition module).Traditional models usually problem.HAS adopts a hashing strategy to learn a binary ma- adopt convolutional neural networks (CNN)(Feng et al. trix representation for each answer,which can dramatically 2015)or recurrent neural networks(RNN)(Tan et al.2016b: reduce the memory cost for storing the matrix representa- Tran and Niederee 2018)as encoders.Recently,complex tions of answers.Hence,HAS can adopt complex encoders like BERT in the model,but the online prediction of HAS pre-trained models such as BERT(Devlin et al.2018)and is still fast with a low memory cost.Experimental results on GPT-2(Radford et al.2019).are proposed for NLP tasks. three popular answer selection datasets show that HAS can BERT and GPT-2 adopt Transformer (Vaswani et al.2017) outperform existing models to achieve state-of-the-art perfor- as the key building block,which discards CNN and RNN en- mance. tirely.BERT and GPT-2 are typically pre-trained on a large- scale language corpus,which can encode abundant common knowledge into model parameters.This common knowledge Introduction is helpful when BERT or GPT-2 is fine-tuned on other tasks. Question answering (QA)is an important but challenging The output of the encoder for each sentence of either task in natural language processing (NLP)area.Answer question or answer is usually represented as a matrix and selection (answer ranking),which aims to select the cor- responding answer from a pool of candidate answers for each column or row of the matrix corresponds to a vec- tor representation for a word in the sentence.Composition a given question,is one of the key components in many modules are used to generate vector representations for sen- kinds of QA applications.For example,in community-based tences from the corresponding matrices.Composition mod- question answering (CQA)tasks,all answers need to be ules mainly include pooling and question-answer interaction ranked according to the quality.In frequently asked ques- mechanisms.Question-answer interaction mechanisms in- tions (FAQ)tasks,the most related answers need to be re- clude attention (Tan et al.2016b),attentive pooling(Santos turned back for answering the users'questions. One main challenge of answer selection is that both ques- et al.2016),multihop-attention (Tran and Niederee 2018) and so on.In general,question-answer interaction mecha- tions and answers are not long enough in most cases.As a nisms have better performance than pooling.However,in- result,questions and answers usually lack background infor- mation and knowledge about the context(Deng et al.2018). teraction mechanisms bring a problem that the vector repre- sentations of an answer are different with respect to different "Wu-Jun Li is the corresponding author. questions.When deep models with interaction mechanisms Copyright c)2020,Association for the Advancement of Artificial are deployed for online prediction,the representations of all Intelligence (www.aaai.org).All rights reserved. answers need to be recalculated for each question.This pro-
Hashing based Answer Selection Dong Xu and Wu-Jun Li∗ National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, China dc.swind@gmail.com, liwujun@nju.edu.cn Abstract Answer selection is an important subtask of question answering (QA), in which deep models usually achieve better performance than non-deep models. Most deep models adopt question-answer interaction mechanisms, such as attention, to get vector representations for answers. When these interaction based deep models are deployed for online prediction, the representations of all answers need to be recalculated for each question. This procedure is time-consuming for deep models with complex encoders like BERT which usually have better accuracy than simple encoders. One possible solution is to store the matrix representation (encoder output) of each answer in memory to avoid recalculation. But this will bring large memory cost. In this paper, we propose a novel method, called hashing based answer selection (HAS), to tackle this problem. HAS adopts a hashing strategy to learn a binary matrix representation for each answer, which can dramatically reduce the memory cost for storing the matrix representations of answers. Hence, HAS can adopt complex encoders like BERT in the model, but the online prediction of HAS is still fast with a low memory cost. Experimental results on three popular answer selection datasets show that HAS can outperform existing models to achieve state-of-the-art performance. Introduction Question answering (QA) is an important but challenging task in natural language processing (NLP) area. Answer selection (answer ranking), which aims to select the corresponding answer from a pool of candidate answers for a given question, is one of the key components in many kinds of QA applications. For example, in community-based question answering (CQA) tasks, all answers need to be ranked according to the quality. In frequently asked questions (FAQ) tasks, the most related answers need to be returned back for answering the users’ questions. One main challenge of answer selection is that both questions and answers are not long enough in most cases. As a result, questions and answers usually lack background information and knowledge about the context (Deng et al. 2018). ∗Wu-Jun Li is the corresponding author. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. This phenomenon limits the performance of answer selection models. Deep neural networks (DNN) based models, also simply called deep models, can partly tackle this problem by using pre-trained word embeddings. Word embeddings pre-trained on language corpus contain some common knowledge and linguistic phenomena, which are helpful for selecting answers. Deep models have achieved promising performance for answer selection in recent years (Tan et al. 2016b; Santos et al. 2016; Tay, Tuan, and Hui 2018a; Tran and Niederee 2018; Deng et al. 2018). ´ Most deep models for answer selection are constructed with similar frameworks which contain an encoding layer (also called encoder) and a composition layer (also called composition module). Traditional models usually adopt convolutional neural networks (CNN) (Feng et al. 2015) or recurrent neural networks (RNN) (Tan et al. 2016b; Tran and Niederee 2018) as encoders. Recently, complex ´ pre-trained models such as BERT (Devlin et al. 2018) and GPT-2 (Radford et al. 2019), are proposed for NLP tasks. BERT and GPT-2 adopt Transformer (Vaswani et al. 2017) as the key building block, which discards CNN and RNN entirely. BERT and GPT-2 are typically pre-trained on a largescale language corpus, which can encode abundant common knowledge into model parameters. This common knowledge is helpful when BERT or GPT-2 is fine-tuned on other tasks. The output of the encoder for each sentence of either question or answer is usually represented as a matrix and each column or row of the matrix corresponds to a vector representation for a word in the sentence. Composition modules are used to generate vector representations for sentences from the corresponding matrices. Composition modules mainly include pooling and question-answer interaction mechanisms. Question-answer interaction mechanisms include attention (Tan et al. 2016b), attentive pooling (Santos et al. 2016), multihop-attention (Tran and Niederee 2018) ´ and so on. In general, question-answer interaction mechanisms have better performance than pooling. However, interaction mechanisms bring a problem that the vector representations of an answer are different with respect to different questions. When deep models with interaction mechanisms are deployed for online prediction, the representations of all answers need to be recalculated for each question. This pro-
cedure is time-consuming for deep models with complex en BERT and Transfer Learning To tackle the problem of coders like BERT which usually have better accuracy than insufficient background information and knowledge in an- simple encoders.One possible solution is to store the matrix swer selection.some methods introduce extra knowledge representation(with float or double values)of each answer from other data.(Deng et al.2018;Min,Seo,and Ha- in memory to avoid recalculation.But this will bring large jishirzi 2017:Wiese.Weissenborn,and Neves 2017)em- memory cost. ploy supervised transfer learning frameworks to pre-train In this paper,we propose a novel method,called hashing a model from a source dataset.There are also some un- based answer selection (HAS).to tackle this problem.The supervised transfer learning techniques (Yu et al.2018; main contributions of HAS are briefly outlined as follows: Chung,Lee,and Glass 2018).BERT (Devlin et al.2018) HAS adopts a hashing strategy to learn a binary matrix is a recently proposed model for language understanding. representation for each answer,which can dramatically By training on a large language corpus,abundant common knowledge and linguistic phenomena can be encoded into reduce the memory cost for storing the matrix represen- tations of answers.To the best of our knowledge,this is the parameters.As a result,BERT can be transferred to a wide range of NLP tasks and has shown promising results. the first time to use hashing for memory reduction in an- swer selection. Hashing Hashing (Li,Wang,and Kang 2016)tries to learn binary codes for data representations.Based on the By storing the (binary)matrix representations of an- binary code,hashing can be used to speedup retrieval and swers in the memory,HAS can avoid recalculation for reduce memory cost.In this paper,we take hashing to answer representations during online prediction.Subse- reduce memory cost,by learning binary matrix represen- quently,HAS can adopt complex encoders like BERT in tations for answers.There have already appeared many the model,but the online prediction of HAS is still fast hashing techniques for learning binary representation (Li. with a low memory cost. Wang,and Kang 2016;Cao et al.2017;Hubara et al.2016; Experimental results on three popular answer selection Fan et al.2019).To the best of our knowledge,there have datasets show that HAS can outperform existing models not existed works to use hashing for memory reduction in answer selection. to achieve state-of-the-art performance. Related Work Hashing based Answer Selection Answer Selection Most early models for answer selection In this section,we present the details of hashing based answer selection (HAS),which can be used to solve the are shallow (non-deep)models,which usually use bag-of- words (BOW)(Yih et al.2013),term frequency (Robert- problem faced by existing deep models with question- answer interaction mechanisms son et al.1994),manually designed rules (Tellez-Valero et al.2011),syntactic trees (Wang and Manning 2010; The framework of most existing deep models is shown in Figure 1(a).Compared with this framework.HAS has Cui et al.2005)as features.Different upper structures are de- an additional hashing layer,which is shown in Figure 1(b) signed for modeling the similarity of questions and answers based on these features.The main drawback of shallow mod- More specifically,HAS consists of an embedding layer,an encoding layer,a hashing layer,a composition layer and a els are the lacking of semantic information by using only similarity layer.With different choices of encoders (encod- surface features.Deep models can capture more semantic in- formation by distributed representations,which lead to bet- ing layer)and composition modules(composition layer)in ter results than shallow models.Early deep models use pool- HAS,several different models can be constructed.Hence, HAS provides a flexible framework for modeling. ing (Feng et al.2015)as the composition module to get vec- tor representations for sentences from the encoder outputs which are represented as matrices.Pooling cannot model the Embedding Layer and Encoding Layer interaction between questions and answers,which has been HAS is designed for modeling the similarity of question- outperformed by new composition modules with question- answer pairs.Hence,the inputs to HAS are two sequences answer interaction mechanisms.Attention(Bahdanau,Cho, of words,corresponding to the question text and answer and Bengio 2015)can generate a better representation of an- text respectively.Firstly,these sequences of words are rep- swers (Tan et al.2016b)than pooling,by introducing the in- resented by word embeddings through a word embedding formation flow between questions and answers into models. layer.Suppose the dimension of word embedding is E,and (Santos et al.2016)proposes attentive pooling for bidirec- the sequence length is L.The embeddings of question q tional attention.(Tran and Niederee 2018)proposes a strat- and answer a are represented by matrices ExL and egy of multihop attention which captures the complex rela- ARExL respectively.We use the same sequence length tions between question-answer pairs.(Wan et al.2016)fo- L for simplicity.Then,these two embedding matrices Q cuses on the word by word similarity between questions and and A are fed into an encoding layer to get the contextual answers.(Wang,Liu,and Zhao 2016)and (Chen et al.2018) word representations.Different choices of embedding layers propose inner attention which introduces the representation and encoders can be adopted in HAS.Here,we directly use of question to the answer encoder through gates.(Tay,Tuan, the embedding layer and encoding layer in BERT to utilize and Hui 2018a)designs a cross temporal recurrent cell to the common knowledge and linguistic phenomena encoded model the interaction between questions and answers. in BERT.Hence,the formulation of encoding layer is as fol-
cedure is time-consuming for deep models with complex encoders like BERT which usually have better accuracy than simple encoders. One possible solution is to store the matrix representation (with float or double values) of each answer in memory to avoid recalculation. But this will bring large memory cost. In this paper, we propose a novel method, called hashing based answer selection (HAS), to tackle this problem. The main contributions of HAS are briefly outlined as follows: • HAS adopts a hashing strategy to learn a binary matrix representation for each answer, which can dramatically reduce the memory cost for storing the matrix representations of answers. To the best of our knowledge, this is the first time to use hashing for memory reduction in answer selection. • By storing the (binary) matrix representations of answers in the memory, HAS can avoid recalculation for answer representations during online prediction. Subsequently, HAS can adopt complex encoders like BERT in the model, but the online prediction of HAS is still fast with a low memory cost. • Experimental results on three popular answer selection datasets show that HAS can outperform existing models to achieve state-of-the-art performance. Related Work Answer Selection Most early models for answer selection are shallow (non-deep) models, which usually use bag-ofwords (BOW) (Yih et al. 2013), term frequency (Robertson et al. 1994), manually designed rules (Tellez-Valero ´ et al. 2011), syntactic trees (Wang and Manning 2010; Cui et al. 2005) as features. Different upper structures are designed for modeling the similarity of questions and answers based on these features. The main drawback of shallow models are the lacking of semantic information by using only surface features. Deep models can capture more semantic information by distributed representations, which lead to better results than shallow models. Early deep models use pooling (Feng et al. 2015) as the composition module to get vector representations for sentences from the encoder outputs which are represented as matrices. Pooling cannot model the interaction between questions and answers, which has been outperformed by new composition modules with questionanswer interaction mechanisms. Attention (Bahdanau, Cho, and Bengio 2015) can generate a better representation of answers (Tan et al. 2016b) than pooling, by introducing the information flow between questions and answers into models. (Santos et al. 2016) proposes attentive pooling for bidirectional attention. (Tran and Niederee 2018) proposes a strat- ´ egy of multihop attention which captures the complex relations between question-answer pairs. (Wan et al. 2016) focuses on the word by word similarity between questions and answers. (Wang, Liu, and Zhao 2016) and (Chen et al. 2018) propose inner attention which introduces the representation of question to the answer encoder through gates. (Tay, Tuan, and Hui 2018a) designs a cross temporal recurrent cell to model the interaction between questions and answers. BERT and Transfer Learning To tackle the problem of insufficient background information and knowledge in answer selection, some methods introduce extra knowledge from other data. (Deng et al. 2018; Min, Seo, and Hajishirzi 2017; Wiese, Weissenborn, and Neves 2017) employ supervised transfer learning frameworks to pre-train a model from a source dataset. There are also some unsupervised transfer learning techniques (Yu et al. 2018; Chung, Lee, and Glass 2018). BERT (Devlin et al. 2018) is a recently proposed model for language understanding. By training on a large language corpus, abundant common knowledge and linguistic phenomena can be encoded into the parameters. As a result, BERT can be transferred to a wide range of NLP tasks and has shown promising results. Hashing Hashing (Li, Wang, and Kang 2016) tries to learn binary codes for data representations. Based on the binary code, hashing can be used to speedup retrieval and reduce memory cost. In this paper, we take hashing to reduce memory cost, by learning binary matrix representations for answers. There have already appeared many hashing techniques for learning binary representation (Li, Wang, and Kang 2016; Cao et al. 2017; Hubara et al. 2016; Fan et al. 2019). To the best of our knowledge, there have not existed works to use hashing for memory reduction in answer selection. Hashing based Answer Selection In this section, we present the details of hashing based answer selection (HAS), which can be used to solve the problem faced by existing deep models with questionanswer interaction mechanisms. The framework of most existing deep models is shown in Figure 1(a). Compared with this framework, HAS has an additional hashing layer, which is shown in Figure 1(b). More specifically, HAS consists of an embedding layer, an encoding layer, a hashing layer, a composition layer and a similarity layer. With different choices of encoders (encoding layer) and composition modules (composition layer) in HAS, several different models can be constructed. Hence, HAS provides a flexible framework for modeling. Embedding Layer and Encoding Layer HAS is designed for modeling the similarity of questionanswer pairs. Hence, the inputs to HAS are two sequences of words, corresponding to the question text and answer text respectively. Firstly, these sequences of words are represented by word embeddings through a word embedding layer. Suppose the dimension of word embedding is E, and the sequence length is L. The embeddings of question q and answer a are represented by matrices Qq ∈ R E×L and Aa ∈ R E×L respectively. We use the same sequence length L for simplicity. Then, these two embedding matrices Qq and Aa are fed into an encoding layer to get the contextual word representations. Different choices of embedding layers and encoders can be adopted in HAS. Here, we directly use the embedding layer and encoding layer in BERT to utilize the common knowledge and linguistic phenomena encoded in BERT. Hence, the formulation of encoding layer is as fol-
Similarity Layer Similarity Layer Composition Layer Composition Layer Composition Layer Composition Layer Hashing Layer Encoding Layer Encoding Layer Encoding Layer Encoding Layer Embedding Laye Embedding Laye Embedding Laye Embedding Layer Question Text Answer Text Question Text Answer Text (a) (6) Figure 1:(a)Framework of traditional deep models for answer selection;(b)Framework of HAS. lows: layer as that in (Li,Wang,and Kang 2016): U=BERT(Q), T(a)=‖Ba-Ball (2) Va BERT(Aa), where BaL is the binary matrix representation for answer a,is the Frobenius norm of a matrix.Here,Ba where U VaERDxLare the contextual semantic features is also a parameter to learn in HAS model. of words extracted by BERT for question g and answer a When the learned model is deployed for online predic- respectively,and D is the output dimension of BERT. tion,the learned binary matrices for answers will be stored in memory to avoid recalculation.With binary representation, Hashing Layer each element in the matrices only costs one bit of memory. The outputs of the encoding layer for question g and answer Hence,the memory cost can be dramatically reduced. a are U and Va,which are two real-valued(float or double) matrices.When deep models with question-answer interac- Composition Layer tion mechanisms store the output of encoding layer (Va) The outputs of encoding layer and hashing layer are matri- in memory to avoid recalculation,they will meet the high ces of size D x L.Composition layers are used to compose memory cost problem.For example,if we take float values these matrix representations into vectors.Pooling,atten- for Va,the memory cost for only one answer is over 600 KB tion (Tan et al.2016b),attentive pooling (Santos et al.2016) when L 200 and D =768.Here,D =768 is the output and other interaction mechanisms (Tran and Niederee 2018: dimension of BERT.If the number of answers in candidate Wan et al.2016)can be adopted in HAS.Interaction based set is large,excessive memory cost will lead to impractica- modules usually have better performance than pooling based bility,especially for mobile or embedded devices. modules which have no question-answer interaction.Here, In this paper,we adopt hashing to reduce memory cost we take attention as an example to illustrate the advantage by learning binary matrix representations for answers.More of HAS.More specifically,we adopt pooling for compos- specifically,we take the sign function y sgn(r)to bina- ing matrix representations of questions into question vec- rize the output of the encoding layer.But the gradient of the tors,and adopt attention for composing matrix representa- sign function is zero for all nonzero inputs,which leads to a tions of answers into answer vectors.The formulation of the problem that the gradients cannot back-propagate correctly. composition layer is as follows: y=tanh(r)is a commonly used approximate function for ug mar-pooling(Ug), y=sgn(z),which can make the training process end-to- end with back-propagation(BP).Here,we use a more flex- ible variant y tanh(Bx)with a hyper-parameter B>1. la)=attention(Ba,ug)=∑ai·ba, The derivative of y tanh(Bx)is i=1 dy ix exp(mT.tanh(W·ba+w2·ug), Ox =(1-y2). whereare the composed vectors of questions By using this function,the formulation of hashing layer and answers respectively,is the i-th word representation is as follows: in Ba=is the attention weight for the i-th Ba tanh(BVa), (1) word which is calculated by a softmax function,W1,W2 E RMxD,m E RM are attention parameters with M being where BaRDxL is the output of hashing layer. the hidden size of attention. To make sure that the elements in Ba can concentrate to The above formulation is for training.During test proce- binary values B={+1),we add an extra constraint for this dure,we just need to replace Ba by Ba
Embedding Layer Encoding Layer Composition Layer Similarity Layer Question Text Answer Text Embedding Layer Encoding Layer Composition Layer Embedding Layer Encoding Layer Composition Layer Similarity Layer Question Text Answer Text Embedding Layer Hashing Layer Encoding Layer Composition Layer (a) (b) Figure 1: (a) Framework of traditional deep models for answer selection; (b) Framework of HAS. lows: Uq = BERT(Qq), Va = BERT(Aa), where Uq,Va ∈ R D×L are the contextual semantic features of words extracted by BERT for question q and answer a respectively, and D is the output dimension of BERT. Hashing Layer The outputs of the encoding layer for question q and answer a are Uq and Va, which are two real-valued (float or double) matrices. When deep models with question-answer interaction mechanisms store the output of encoding layer (Va) in memory to avoid recalculation, they will meet the high memory cost problem. For example, if we take float values for Va, the memory cost for only one answer is over 600 KB when L = 200 and D = 768. Here, D = 768 is the output dimension of BERT. If the number of answers in candidate set is large, excessive memory cost will lead to impracticability, especially for mobile or embedded devices. In this paper, we adopt hashing to reduce memory cost by learning binary matrix representations for answers. More specifically, we take the sign function y = sgn(x) to binarize the output of the encoding layer. But the gradient of the sign function is zero for all nonzero inputs, which leads to a problem that the gradients cannot back-propagate correctly. y = tanh(x) is a commonly used approximate function for y = sgn(x), which can make the training process end-toend with back-propagation (BP). Here, we use a more flexible variant y = tanh(βx) with a hyper-parameter β ≥ 1. The derivative of y = tanh(βx) is ∂y ∂x = β(1 − y 2 ). By using this function, the formulation of hashing layer is as follows: Ba = tanh(βVa), (1) where Ba ∈ R D×L is the output of hashing layer. To make sure that the elements in Ba can concentrate to binary values B = {±1}, we add an extra constraint for this layer as that in (Li, Wang, and Kang 2016): J c (a) = ||Ba − Ba||2 F , (2) where Ba ∈ B D×L is the binary matrix representation for answer a, ||·||F is the Frobenius norm of a matrix. Here, Ba is also a parameter to learn in HAS model. When the learned model is deployed for online prediction, the learned binary matrices for answers will be stored in memory to avoid recalculation. With binary representation, each element in the matrices only costs one bit of memory. Hence, the memory cost can be dramatically reduced. Composition Layer The outputs of encoding layer and hashing layer are matrices of size D × L. Composition layers are used to compose these matrix representations into vectors. Pooling, attention (Tan et al. 2016b), attentive pooling (Santos et al. 2016) and other interaction mechanisms (Tran and Niederee 2018; ´ Wan et al. 2016) can be adopted in HAS. Interaction based modules usually have better performance than pooling based modules which have no question-answer interaction. Here, we take attention as an example to illustrate the advantage of HAS. More specifically, we adopt pooling for composing matrix representations of questions into question vectors, and adopt attention for composing matrix representations of answers into answer vectors. The formulation of the composition layer is as follows: uq = max pooling(Uq), v (q) a = attention(Ba,uq) = X L i=1 αi · b (a) i , αi ∝ exp(m> · tanh(W1 · b (a) i + W2 · uq)), where uq, v (q) a ∈ R D are the composed vectors of questions and answers respectively, b (a) i is the i-th word representation in Ba = [b (a) 1 , ..., b (a) L ], αi is the attention weight for the i-th word which is calculated by a softmax function, W1,W2 ∈ RM×D, m ∈ RM are attention parameters with M being the hidden size of attention. The above formulation is for training. During test procedure, we just need to replace Ba by Ba
Similarity Layer and Loss Function The similarity layer measures the similarity between Table 1:Statistics of the datasets.“#questions'”and“C.A.” denote the number of questions and candidate answers re- question-answer pairs based on their vector representations spectively. and Here,we choose cosine function as the simi- insuranceQA yahooQA wikiQA larity function,which is usually adopted in answer selection #questions (Train) 12887 50112 873 tasks: #questions(Dev) 1000 6289 126 s(q,a)=cos(ug,va), #questions (Test1) 1800 6283 243 #questions (Test2) 1800 where s(g,a)ER is the similarity between question q and #C.A.per question 500 9 answer a. Based on the similarity between questions and answers, we can define the loss function.The most commonly used loss function for ranking is the triplet-based hinge loss(Tan yahooQA I is a large CQA corpus collected from Ya- et al.2016b;Tran and Niederee 2018).To combine the hinge hoo!Answers.We adopt the dataset splits as those in (Tay et al.2017;Tay,Tuan,and Hui 2018a;Deng et al.2018)for loss and the binary constraint in hashing together,we can get fair comparison.Questions and answers are filtered by their the following optimization problem: length,and only sentences with length among the range of - ∑ [Tm(g,p,n)+6J(p)+6T(n)] 5-50 are preserved.The number of candidate answers for each question is five,in which only one answer is positive. (q,P,n) The other four negative answers are sampled from the top [max(0,0.1-s(q,p)+s(q,n)+ 1000 hits using Lucene search for each question.As in ex- (9,P,n isting works (Tay et al.2017;Tay,Tuan,and Hui 2018a; 6.1Bp-Bpll+6.Bn -Bnl], Deng et al.2018),P@1 and Mean Reciprocal Rank(MRR) are adopted as evaluation metrics. where Jm(q,p,n)=max(0,0.1-s(g,p)+s(g:n))is the wikiQA (Yang,Yih,and Meek 2015)is a benchmark for hinge loss for a triplet(g,p,n)from the training set,p is a open-domain answer selection.The questions of wikiOA are positive answer corresponding to g.n is a randomly selected factual questions which are collected from Bing search logs negative answer,is the coefficient of the binary constraint Each question is linked to a Wikipedia page,and the sen- Te(p)and Je(n)for the positive answer p and the nega- tences in the summary section are collected as the candidate tive answer n respectively.B.denotes a set of binary matrix answers.The size of candidate answer set for each question representations for all answers.donates the parameters in is different and there may be more than one positive answer HAS except B.. to some questions.We filter out the questions which have no These two sets of parameters and B,can be optimized positive answers as previous works (Yang,Yih,and Meek alternately (Li,Wang,and Kang 2016).More specifically, 2015;Deng et al.2018;Wang,Liu,and Zhao 2016).Mean Ba E B.corresponding to answer a can be optimized as Average Precision (MAP)and MRR are adopted as evalua- follows when is fixed: tion metrics as in existing works. Ba=sgn(Ba). Hyperparameters and Baselines And 0 can be updated by utilizing back propagation(BP) We use base BERT as the encoder in our experiments.Large when B,is fixed. BERT may have better performance,but the encoding layer is not the focus of this paper.More specifically,the embed- Experiment ding size E and output dimension D of BERT are 768.The Datasets probability of dropout is 0.1.Weight decay coefficient is 0.01.Batch size is 64 for yahooQA,and 32 for insuranceQA We evaluate HAS on three popular answer selection and wikiOA.The attention hidden size M for insuranceOA datasets.The statistics about the datasets are presented in is 768.M is 128 for yahooQA and wikiQA.Learning rate is Table 1. 5e-6 for all models.The numbers of training epoches are 60 insuranceQA (Feng et al.2015)is a FAQ dataset from for insuranceQA,18 for wikiQA and 9 for yahooQA.More insurance domain.We use the first version of this dataset, epoches cannot bring apparent performance gain on the val- which has been widely used in existing works (Tan et idation set.We evaluate all models on the validation set af- al.2016b:Wang,Liu,and Zhao 2016:Tan et al.2016a: ter each epoch and choose the parameters which achieve the Deng et al.2018;Tran and Niederee 2018).This dataset has best results on the validation set for final test.All reported already been partitioned into four subsets:Train,Dev,Test1 results are the average of five runs. and Test2.The total size of candidate answers is 24981.To There are also two other important parameters,B in reduce the complexity,the dataset has provided a candidate tanh(Bx)and the coefficient o of the binary constraint. set of 500 answers for each question,including positive and B is tuned among {1,2,5,10,20},and 6 is tuned among negative answers.There is more than one positive answer {0,1e-7,1e-6,1e-5,1e-4. to some questions.As in existing works(Feng et al.2015; Tran and Niederee 2018;Deng et al.2018),we adopt Preci- https://webscope.sandbox.yahoo.com/catalog.php?datatype= sion@1 (P@1)as the evaluation metric. l&guccounter=1
Similarity Layer and Loss Function The similarity layer measures the similarity between question-answer pairs based on their vector representations uq and v (q) a . Here, we choose cosine function as the similarity function, which is usually adopted in answer selection tasks: s(q, a) = cos(uq, v (q) a ), where s(q, a) ∈ R is the similarity between question q and answer a. Based on the similarity between questions and answers, we can define the loss function. The most commonly used loss function for ranking is the triplet-based hinge loss (Tan et al. 2016b; Tran and Niederee 2018). To combine the hinge ´ loss and the binary constraint in hashing together, we can get the following optimization problem: min θ,B? J = X (q,p,n) [J m(q, p, n) + δ · J c (p) + δ · J c (n)] = X (q,p,n) [max(0, 0.1 − s(q, p) + s(q, n))+ δ · ||Bp − Bp||2 F + δ · ||Bn − Bn||2 F ], where J m(q, p, n) = max(0, 0.1 − s(q, p) + s(q, n)) is the hinge loss for a triplet (q, p, n) from the training set, p is a positive answer corresponding to q, n is a randomly selected negative answer, δ is the coefficient of the binary constraint J c (p) and J c (n) for the positive answer p and the negative answer n respectively. B? denotes a set of binary matrix representations for all answers. θ donates the parameters in HAS except B?. These two sets of parameters θ and B? can be optimized alternately (Li, Wang, and Kang 2016). More specifically, Ba ∈ B? corresponding to answer a can be optimized as follows when θ is fixed: Ba = sgn(Ba). And θ can be updated by utilizing back propagation (BP) when B? is fixed. Experiment Datasets We evaluate HAS on three popular answer selection datasets. The statistics about the datasets are presented in Table 1. insuranceQA (Feng et al. 2015) is a FAQ dataset from insurance domain. We use the first version of this dataset, which has been widely used in existing works (Tan et al. 2016b; Wang, Liu, and Zhao 2016; Tan et al. 2016a; Deng et al. 2018; Tran and Niederee 2018). This dataset has ´ already been partitioned into four subsets: Train, Dev, Test1 and Test2. The total size of candidate answers is 24981. To reduce the complexity, the dataset has provided a candidate set of 500 answers for each question, including positive and negative answers. There is more than one positive answer to some questions. As in existing works (Feng et al. 2015; Tran and Niederee 2018; Deng et al. 2018), we adopt Preci- ´ sion@1 (P@1) as the evaluation metric. Table 1: Statistics of the datasets. “#questions” and “#C.A.” denote the number of questions and candidate answers respectively. insuranceQA yahooQA wikiQA #questions (Train) 12887 50112 873 #questions (Dev) 1000 6289 126 #questions (Test1) 1800 6283 243 #questions (Test2) 1800 — — #C.A. per question 500 5 9 yahooQA 1 is a large CQA corpus collected from Yahoo! Answers. We adopt the dataset splits as those in (Tay et al. 2017; Tay, Tuan, and Hui 2018a; Deng et al. 2018) for fair comparison. Questions and answers are filtered by their length, and only sentences with length among the range of 5 - 50 are preserved. The number of candidate answers for each question is five, in which only one answer is positive. The other four negative answers are sampled from the top 1000 hits using Lucene search for each question. As in existing works (Tay et al. 2017; Tay, Tuan, and Hui 2018a; Deng et al. 2018), P@1 and Mean Reciprocal Rank (MRR) are adopted as evaluation metrics. wikiQA (Yang, Yih, and Meek 2015) is a benchmark for open-domain answer selection. The questions of wikiQA are factual questions which are collected from Bing search logs. Each question is linked to a Wikipedia page, and the sentences in the summary section are collected as the candidate answers. The size of candidate answer set for each question is different and there may be more than one positive answer to some questions. We filter out the questions which have no positive answers as previous works (Yang, Yih, and Meek 2015; Deng et al. 2018; Wang, Liu, and Zhao 2016). Mean Average Precision (MAP) and MRR are adopted as evaluation metrics as in existing works. Hyperparameters and Baselines We use base BERT as the encoder in our experiments. Large BERT may have better performance, but the encoding layer is not the focus of this paper. More specifically, the embedding size E and output dimension D of BERT are 768. The probability of dropout is 0.1. Weight decay coefficient is 0.01. Batch size is 64 for yahooQA, and 32 for insuranceQA and wikiQA. The attention hidden size M for insuranceQA is 768. M is 128 for yahooQA and wikiQA. Learning rate is 5e −6 for all models. The numbers of training epoches are 60 for insuranceQA, 18 for wikiQA and 9 for yahooQA. More epoches cannot bring apparent performance gain on the validation set. We evaluate all models on the validation set after each epoch and choose the parameters which achieve the best results on the validation set for final test. All reported results are the average of five runs. There are also two other important parameters, β in tanh(βx) and the coefficient δ of the binary constraint. β is tuned among {1, 2, 5,10, 20}, and δ is tuned among {0, 1e −7 , 1e −6 , 1e −5 , 1e −4}. 1 https://webscope.sandbox.yahoo.com/catalog.php?datatype= l&guccounter=1
Table 2:Results on insuranceOA.The results of models Table 3:Results on yahooQA.The results of models marked marked with are reported from (Tran and Niederee 2018) with are reported from (Tay,Tuan,and Hui 2018a).Other Other results marked with o are reported from their origi- results marked with o are reported from their original paper. nal paper.P@1 is adopted as evaluation metric by following P@l and MRR are adopted as evaluation metrics by follow- previous works.'our impl.'denotes our implementation. ing previous works. Model P@1 (Test1)P@1(Test2) Model P@1 MRR CNN* 62.80 59.20 Random Guess 20.0045.86 CNN with GESD* 65.30 61.00 NTN-LSTM 54.50 73.10 OA-LSTM (our impl.) 66.08 62.63 HD-LSTM* 55.70 73.50 AP-LSTM★ 69.00 64.80 AP-CNN* 56.00 72.60 IARNN-GATE 70.10 62.80 AP-BiLSTM* 56.80 73.10 Multihop-Sequential-LSTM* 70.50 66.90 CTRN* 60.10 75.50 AP-CNN 69.80 66.30 HyperQA 68.30 80.10 AP-BiLSTM 71.70 66.40 KAN (Tgt-Only)o 67.20 80.30 MULT◆ 75.20 73.40 KAN 74.40 84.00 KAN(Tgt-Only)o 71.50 68.80 HAS 73.89 82.10 KAN 75.20 72.50 HAS 76.38 73.71 Table 4:Results on wikiOA.The results marked with o are reported from their original paper.MAP and MRR are The state-of-the-art baselines on three datasets are dif- adopted as evaluation metrics by following previous works. ferent.Hence,we adopt different baselines for compar- Model MAP MRR ison on different datasets according to previous works. AP-CNN 68.86 69.57 AP-BiLSTM 67.05 68.42 Baselines using single model without extra knowledge in- RNN-POA◇ 72.12 73.12 clude:CNN,CNN with GESD (Feng et al.2015),QA- Multihop-Sequential-LSTM 72.20 73.80 LSTM (Tan et al.2016b).AP-LSTM (Tran and Niederee IARNN-GATE 72.58 73.94 2018),Multihop-Sequential-LSTM (Tran and Niederee CA-RNN 73.58 74.50 2018),IARNN-GATE (Wang,Liu,and Zhao 2016),NTN- MULT◇ 74.33 75.45 LSTM,HD-LSTM (Tay et al.2017),HyperQA (Tay, MV-FNN 74.62 75.76 Tuan.and Hui 2018b).AP-CNN(Santos et al.2016).AP- SUMBASE,PTK◇ 75.59 77.00 BiLSTM(Santos et al.2016),CTRN (Tay,Tuan,and Hui LRXNET◇ 76.57 75.10 2018a),CA-RNN (Chen et al.2018),RNN-POA (Chen et HAS 81.0182.22 al.2017),MULT (Wang and Jiang 2017),MV-FNN (Sha et al.2018).Single models with external knowledge in- clude:KAN (Deng et al.2018).Ensemble models include: 2018),which utilizes external knowledge,is the state-of-the- LRXNET (Narayan et al.2018),SUMBASE.PTK (Ty- art model on this dataset.HAS outperforms all baselines moshenko and Moschitti 2018). except KAN.The performance gain of KAN mainly owes Because HAS adopts BERT as encoder,we also construct to the external knowledge,by pre-training on a source QA two BERT-based baselines for comparison.BERT-pooling is dataset SQuAD-T.Please note that HAS does not adopt ex- a model in which both questions and answers are composed ternal QA dataset for pre-training.HAS can outperform the into vectors by pooling.BERT-attention is a model which target-only version of KAN,denoted as KAN (Tgt-Only). adopts attention as the composition module.Both BERT- which is only trained on yahooQA without SQuAD-T.Once pooling and BERT-attention use BERT as the encoder,and again,the result on yahooQA verifies the effectiveness of hashing is not adopted in them. HAS. Experimental Results Results on wikiOA Table 4 shows the results on wik- Results on insuranceQA We compare HAS with base- iQA dataset.SUMBASE.PTK (Tymoshenko and Moschitti lines on insuranceQA dataset.The results are shown in Ta- 2018)and LRXNET (Narayan et al.2018)are two ensemble ble 2.MULT (Wang and Jiang 2017)and KAN (Deng et models which represent the state-of-the-art results on this al.2018)are two strong baselines which represent the state- dataset.HAS outperforms all the baselines again,which fur- of-the-art results on this dataset.Here,KAN adopts exter- ther proves the effectiveness of our HAS nal knowledge for performance improvement.KAN (Tgt- Only)denotes the KAN variant without external knowledge We can find that HAS outperforms all the baselines,which Comparison with BERT-based Models We compare proves the effectiveness of HAS. HAS with BERT-pooling and BERT-attention on three datasets.As shown in Table 5,BERT-attention and HAS outperform BERT-pooling on all three datasets,which veri- Results on yahooQa We also evaluate HAS and baselines fies that question-answer interaction mechanisms have bet- on yahooQA.Table 3 shows the results.KAN (Deng et al. ter performance than pooling.Furthermore,we can find that
Table 2: Results on insuranceQA. The results of models marked with ? are reported from (Tran and Niederee 2018). ´ Other results marked with are reported from their original paper. P@1 is adopted as evaluation metric by following previous works. ‘our impl.’ denotes our implementation. Model P@1 (Test1) P@1 (Test2) CNN ? 62.80 59.20 CNN with GESD ? 65.30 61.00 QA-LSTM (our impl.) 66.08 62.63 AP-LSTM ? 69.00 64.80 IARNN-GATE ? 70.10 62.80 Multihop-Sequential-LSTM ? 70.50 66.90 AP-CNN 69.80 66.30 AP-BiLSTM 71.70 66.40 MULT 75.20 73.40 KAN (Tgt-Only) 71.50 68.80 KAN 75.20 72.50 HAS 76.38 73.71 The state-of-the-art baselines on three datasets are different. Hence, we adopt different baselines for comparison on different datasets according to previous works. Baselines using single model without extra knowledge include: CNN, CNN with GESD (Feng et al. 2015), QALSTM (Tan et al. 2016b), AP-LSTM (Tran and Niederee´ 2018), Multihop-Sequential-LSTM (Tran and Niederee´ 2018), IARNN-GATE (Wang, Liu, and Zhao 2016), NTNLSTM, HD-LSTM (Tay et al. 2017), HyperQA (Tay, Tuan, and Hui 2018b), AP-CNN (Santos et al. 2016), APBiLSTM (Santos et al. 2016), CTRN (Tay, Tuan, and Hui 2018a), CA-RNN (Chen et al. 2018), RNN-POA (Chen et al. 2017), MULT (Wang and Jiang 2017), MV-FNN (Sha et al. 2018). Single models with external knowledge include: KAN (Deng et al. 2018). Ensemble models include: LRXNET (Narayan et al. 2018), SUMBASE,P TK (Tymoshenko and Moschitti 2018). Because HAS adopts BERT as encoder, we also construct two BERT-based baselines for comparison. BERT-pooling is a model in which both questions and answers are composed into vectors by pooling. BERT-attention is a model which adopts attention as the composition module. Both BERTpooling and BERT-attention use BERT as the encoder, and hashing is not adopted in them. Experimental Results Results on insuranceQA We compare HAS with baselines on insuranceQA dataset. The results are shown in Table 2. MULT (Wang and Jiang 2017) and KAN (Deng et al. 2018) are two strong baselines which represent the stateof-the-art results on this dataset. Here, KAN adopts external knowledge for performance improvement. KAN (TgtOnly) denotes the KAN variant without external knowledge. We can find that HAS outperforms all the baselines, which proves the effectiveness of HAS. Results on yahooQA We also evaluate HAS and baselines on yahooQA. Table 3 shows the results. KAN (Deng et al. Table 3: Results on yahooQA. The results of models marked with ? are reported from (Tay, Tuan, and Hui 2018a). Other results marked with are reported from their original paper. P@1 and MRR are adopted as evaluation metrics by following previous works. Model P@1 MRR Random Guess 20.00 45.86 NTN-LSTM ? 54.50 73.10 HD-LSTM ? 55.70 73.50 AP-CNN ? 56.00 72.60 AP-BiLSTM ? 56.80 73.10 CTRN ? 60.10 75.50 HyperQA 68.30 80.10 KAN (Tgt-Only) 67.20 80.30 KAN 74.40 84.00 HAS 73.89 82.10 Table 4: Results on wikiQA. The results marked with are reported from their original paper. MAP and MRR are adopted as evaluation metrics by following previous works. Model MAP MRR AP-CNN 68.86 69.57 AP-BiLSTM 67.05 68.42 RNN-POA 72.12 73.12 Multihop-Sequential-LSTM 72.20 73.80 IARNN-GATE 72.58 73.94 CA-RNN 73.58 74.50 MULT 74.33 75.45 MV-FNN 74.62 75.76 SUMBASE,P TK 75.59 77.00 LRXNET 76.57 75.10 HAS 81.01 82.22 2018), which utilizes external knowledge, is the state-of-theart model on this dataset. HAS outperforms all baselines except KAN. The performance gain of KAN mainly owes to the external knowledge, by pre-training on a source QA dataset SQuAD-T. Please note that HAS does not adopt external QA dataset for pre-training. HAS can outperform the target-only version of KAN, denoted as KAN (Tgt-Only), which is only trained on yahooQA without SQuAD-T. Once again, the result on yahooQA verifies the effectiveness of HAS. Results on wikiQA Table 4 shows the results on wikiQA dataset. SUMBASE,P TK (Tymoshenko and Moschitti 2018) and LRXNET (Narayan et al. 2018) are two ensemble models which represent the state-of-the-art results on this dataset. HAS outperforms all the baselines again, which further proves the effectiveness of our HAS. Comparison with BERT-based Models We compare HAS with BERT-pooling and BERT-attention on three datasets. As shown in Table 5, BERT-attention and HAS outperform BERT-pooling on all three datasets, which veri- fies that question-answer interaction mechanisms have better performance than pooling. Furthermore, we can find that