Deep Cross-Modal Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,P.R.China jiangqyalamda.nju.edu.cn,liwujun@nju.edu.cn Abstract text information like tags for the images in Flickr and many other social websites.This kind of data is always called Due to its low storage cost and fast query speed,cross- multi-modal data.With the rapid growth of multi-modal modal hashing(CMH)has been widely used for similarity data in real applications,especially multimedia application- search in multimedia retrieval applications.However most s,multi-modal hashing (MMH)has recently been widely existing CMH methods are based on hand-crafted features used for ANN search (retrieval)on multi-modal datasets. which might not be optimally compatible with the hash-code Existing MMH methods can be divided into two main learning procedure.As a result,existing CMH methods categories:mutli-source hashing (MSH)[30,36,32,14] with hand-crafted features may not achieve satisfactory and cross-modal hashing (CMH)[18,35.7.22.31.The performance.In this paper,we propose a novel CMH goal of MSH is to learn hash codes by utilizing all the in- method,called deep cross-modal hashing (DCMH),by formation from multiple modalities.Hence,MSH requires integrating feature learning and hash-code learning into that all the modalities should be observed for all data points the same framework.DCMH is an end-to-end learning including query points and those in database.In practice, framework with deep neural networks,one for each modal- the application of MSH is limited because in many cases it ity,to perform feature learning from scratch.Experiments is difficult to acquire all modalities of all data points.On on three real datasets with image-text modalities show the contrary,the application scenarios of CMH are more that DCMH can outperform other baselines to achieve flexible than those of MSH.In CMH.the modality of a the state-of-the-art performance in cross-modal retrieval query point is different from the modality of the points in applications. database.Furthermore,typically the query point has only one modality and the points in the database can have one or more modalities.For example,we can use text queries to 1.Introduction retrieve images in the database,and we can also use image Approximate nearest neighbor(ANN)search [1]plays a queries to retrieve texts in the database.Due to its wide fundamental role in machine learning and related applica- application,CMH has gained more attention than MSH. tions like information retrieval.Due to its low storage cost Many CMH methods have recently been proposed. and fast retrieval speed,hashing has recently attracted much Existing representative methods include cross attention from the ANN research community [17,34,9,15, modality similarity sensitive hashing (CMSSH)[2], 26,10,29,21,28,4].The goal of hashing is to map the cross view hashing (CVH)[18], multi-modal data points from the original space into a Hamming space latent binary embedding (MLBE)[39].co- of binary codes where the similarity in the original space regularized hashing (CRH)[38].semantic correlation is preserved in the Hamming space.By using binary hash maximization (SCM)[35].collective matrix factorization codes to represent the original data,the storage cost can hashing (CMFH)[7],semantic topic multi-modal be dramatically reduced.Furthermore,we can achieve a hashing (STMH)[33]and semantics preserving constant or sub-linear time complexity for search by using hashing (SePH)[22].Almost all these existing CMH hash codes to construct an index [15].Hence,hashing has methods are based on hand-crafted features. One become more and more popular for ANN search in large- shortcoming of these hand-crafted feature based methods scale datasets. is that the feature extraction procedure is independent of In many applications,the data can have multi-modalities. the hash-code learning procedure,which means that the For example,besides the image content,there also exists hand-crafted features might not be optimally compatible
Deep Cross-Modal Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, P. R. China jiangqy@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Due to its low storage cost and fast query speed, crossmodal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, most existing CMH methods are based on hand-crafted features which might not be optimally compatible with the hash-code learning procedure. As a result, existing CMH methods with hand-crafted features may not achieve satisfactory performance. In this paper, we propose a novel CMH method, called deep cross-modal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. Experiments on three real datasets with image-text modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross-modal retrieval applications. 1. Introduction Approximate nearest neighbor (ANN) search [1] plays a fundamental role in machine learning and related applications like information retrieval. Due to its low storage cost and fast retrieval speed, hashing has recently attracted much attention from the ANN research community [17, 34, 9, 15, 26, 10, 29, 21, 28, 4]. The goal of hashing is to map the data points from the original space into a Hamming space of binary codes where the similarity in the original space is preserved in the Hamming space. By using binary hash codes to represent the original data, the storage cost can be dramatically reduced. Furthermore, we can achieve a constant or sub-linear time complexity for search by using hash codes to construct an index [15]. Hence, hashing has become more and more popular for ANN search in largescale datasets. In many applications, the data can have multi-modalities. For example, besides the image content, there also exists text information like tags for the images in Flickr and many other social websites. This kind of data is always called multi-modal data. With the rapid growth of multi-modal data in real applications, especially multimedia applications, multi-modal hashing (MMH) has recently been widely used for ANN search (retrieval) on multi-modal datasets. Existing MMH methods can be divided into two main categories: mutli-source hashing (MSH) [30, 36, 32, 14] and cross-modal hashing (CMH) [18, 35, 7, 22, 3]. The goal of MSH is to learn hash codes by utilizing all the information from multiple modalities. Hence, MSH requires that all the modalities should be observed for all data points including query points and those in database. In practice, the application of MSH is limited because in many cases it is difficult to acquire all modalities of all data points. On the contrary, the application scenarios of CMH are more flexible than those of MSH. In CMH, the modality of a query point is different from the modality of the points in database. Furthermore, typically the query point has only one modality and the points in the database can have one or more modalities. For example, we can use text queries to retrieve images in the database, and we can also use image queries to retrieve texts in the database. Due to its wide application, CMH has gained more attention than MSH. Many CMH methods have recently been proposed. Existing representative methods include cross modality similarity sensitive hashing (CMSSH) [2], cross view hashing (CVH) [18], multi-modal latent binary embedding (MLBE) [39], coregularized hashing (CRH) [38], semantic correlation maximization (SCM) [35], collective matrix factorization hashing (CMFH) [7], semantic topic multi-modal hashing (STMH) [33] and semantics preserving hashing (SePH) [22]. Almost all these existing CMH methods are based on hand-crafted features. One shortcoming of these hand-crafted feature based methods is that the feature extraction procedure is independent of the hash-code learning procedure, which means that the hand-crafted features might not be optimally compatible
with the hash-code learning procedure.Hence,these column of W is denoted as Wij.The ith row of W is existing CMH methods with hand-crafted features may not denoted as Wi.,and the jth column of W is denoted as achieve satisfactory performance in real applications. W材 WT is the transpose of W.We use 1 to denote Recently,deep learning with neural networks [19,16] a vector with all elements being 1.tr()and‖·lF has been widely used to perform feature learning from denote the trace of a matrix and the Frobenius norm of scratch with promising performance.There also exist a matrix,respectively.sign(.)is an element-wise sign some methods which adopt deep learning for uni-modal function defined as follows: hashing [37,23,20,40.24].These methods show that end-to-end deep learning architecture is more compatible x≥0, for hashing learning.For the CMH setting,there also sign(x x<0. appears one method,called deep visual-semantic hash- ing (DVSH)[3],with deep neural networks for feature learning'.However,DVSH can only be used for a special 2.2.Cross-Modal Hashing CMH case where one of the modalities have to be temporal dynamics. Although the method proposed in this paper can be easily In this paper,we propose a novel CMH method,called adapted to cases with more than two modalities,we only deep cross-modal hashing(DCMH),for cross-modal re- focus on the case with two modalities here. trieval applications.The main contributions of DCMH are Assume that we have n training entities (data points), outlined as follows: each of which has two modalities of features.Without loss of generality,we use image-text datasets for illustration in .DCMH is an end-to-end learning framework with deep this paper,which means that each training point has both neural networks,one for each modality,to perform text modality and image modality.We use X=x feature learning from scratch. to denote the image modality,where xi can be the hand- crafted features or the raw pixels of image i.Moreover, The hash-code learning problem is essentially a dis- we use Y=[yi]1 to denote the text modality,where crete learning problem,which is difficult to learn. yi is typically the tag information related to image i.In Hence.most existing CMH methods solve this prob- addition,we are also given a cross-modal similarity matrix lem by relaxing the original discrete learning problem S.Sij=1 if image xi and text yj are similar,and Sij =0 into a continuous learning problem.This relaxation otherwise.Here,the similarity is typically defined by some procedure may deteriorate the accuracy of the learned semantic information such as class labels.For example,we hash codes [25].Unlike these relaxation-based meth- can say that image xi and text y;are similar if they share ods,DCMH directly learns the discrete hash codes the same class label.Otherwise,image xi and text yj are without relaxation. dissimilar if they are from different classes. Experiments on real datasets with image-text modali- Given the above training information X,Y and S,the ties show that DCMH can outperform other baselines goal of cross-modal hashing is to learn two hash functions to achieve the state-of-the-art performance in cross- for the two modalities:h()(x){-1,+1e for the modal retrieval applications. image modality and h()(y)E{-1,+c for the text modality,where c is the length of binary code.These two The rest of this paper is organized as follows.Section 2 hash functions should preserve the cross-modal similarity introduces the problem definition of this paper.We present in S.More specifically,if Sj=1.the Hamming distance our DCMH method in Section 3,including the model between the binary codes b)=h()(x)and bv)= formulation and learning algorithm.Experiments are shown h()(y;)should be small.Otherwise,if Si=0,the in Section 4.At last,we conclude our work in Section 5. corresponding Hamming distance should be large. Here,we assume that both modalities of features for each 2.Problem Definition point in the training set are observed although our method 2.1.Notation can also be easily adapted to other settings where some training points have only one modality of features being Boldface lowercase letters like w are used to denote observed.Please note that we only make this assumption vectors.Boldface uppercase letters like W are used to for training points.After we have trained the model,we can denote matrices,and the element in the ith row and jth use the learned model to generate hash codes for query and IThe first version of our DCMH method has been submitted to database points of either one modality or two modalities. arXiv [13]before this CVPR submission,which actually appeared earlier which exactly matches the setting of cross-modal retrieval than DVSH in public literature. applications
with the hash-code learning procedure. Hence, these existing CMH methods with hand-crafted features may not achieve satisfactory performance in real applications. Recently, deep learning with neural networks [19, 16] has been widely used to perform feature learning from scratch with promising performance. There also exist some methods which adopt deep learning for uni-modal hashing [37, 23, 20, 40, 24]. These methods show that end-to-end deep learning architecture is more compatible for hashing learning. For the CMH setting, there also appears one method, called deep visual-semantic hashing (DVSH) [3], with deep neural networks for feature learning1. However, DVSH can only be used for a special CMH case where one of the modalities have to be temporal dynamics. In this paper, we propose a novel CMH method, called deep cross-modal hashing (DCMH), for cross-modal retrieval applications. The main contributions of DCMH are outlined as follows: ∙ DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. ∙ The hash-code learning problem is essentially a discrete learning problem, which is difficult to learn. Hence, most existing CMH methods solve this problem by relaxing the original discrete learning problem into a continuous learning problem. This relaxation procedure may deteriorate the accuracy of the learned hash codes [25]. Unlike these relaxation-based methods, DCMH directly learns the discrete hash codes without relaxation. ∙ Experiments on real datasets with image-text modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in crossmodal retrieval applications. The rest of this paper is organized as follows. Section 2 introduces the problem definition of this paper. We present our DCMH method in Section 3, including the model formulation and learning algorithm. Experiments are shown in Section 4. At last, we conclude our work in Section 5. 2. Problem Definition 2.1. Notation Boldface lowercase letters like w are used to denote vectors. Boldface uppercase letters like W are used to denote matrices, and the element in the 𝑖th row and 𝑗th 1The first version of our DCMH method has been submitted to arXiv [13] before this CVPR submission, which actually appeared earlier than DVSH in public literature. column of W is denoted as 𝑊𝑖𝑗 . The 𝑖th row of W is denoted as W𝑖∗, and the 𝑗th column of W is denoted as W∗𝑗 . W𝑇 is the transpose of W. We use 1 to denote a vector with all elements being 1. tr(⋅) and ∥⋅∥𝐹 denote the trace of a matrix and the Frobenius norm of a matrix, respectively. sign(⋅) is an element-wise sign function defined as follows: sign(𝑥) = { 1 𝑥 ≥ 0, −1 𝑥 < 0. 2.2. Cross-Modal Hashing Although the method proposed in this paper can be easily adapted to cases with more than two modalities, we only focus on the case with two modalities here. Assume that we have 𝑛 training entities (data points), each of which has two modalities of features. Without loss of generality, we use image-text datasets for illustration in this paper, which means that each training point has both text modality and image modality. We use X = {x𝑖}𝑛 𝑖=1 to denote the image modality, where x𝑖 can be the handcrafted features or the raw pixels of image 𝑖. Moreover, we use Y = {y𝑖}𝑛 𝑖=1 to denote the text modality, where y𝑖 is typically the tag information related to image 𝑖. In addition, we are also given a cross-modal similarity matrix S. 𝑆𝑖𝑗 = 1 if image x𝑖 and text y𝑗 are similar, and 𝑆𝑖𝑗 = 0 otherwise. Here, the similarity is typically defined by some semantic information such as class labels. For example, we can say that image x𝑖 and text y𝑗 are similar if they share the same class label. Otherwise, image x𝑖 and text y𝑗 are dissimilar if they are from different classes. Given the above training information X, Y and S, the goal of cross-modal hashing is to learn two hash functions for the two modalities: ℎ(𝑥) (x) ∈ {−1, +1}𝑐 for the image modality and ℎ(𝑦) (y) ∈ {−1, +1}𝑐 for the text modality, where 𝑐 is the length of binary code. These two hash functions should preserve the cross-modal similarity in S. More specifically, if 𝑆𝑖𝑗 = 1, the Hamming distance between the binary codes b(𝑥) 𝑖 = ℎ(𝑥) (x𝑖) and b(𝑦) 𝑗 = ℎ(𝑦) (y𝑗 ) should be small. Otherwise, if 𝑆𝑖𝑗 = 0, the corresponding Hamming distance should be large. Here, we assume that both modalities of features for each point in the training set are observed although our method can also be easily adapted to other settings where some training points have only one modality of features being observed. Please note that we only make this assumption for training points. After we have trained the model, we can use the learned model to generate hash codes for query and database points of either one modality or two modalities, which exactly matches the setting of cross-modal retrieval applications
Loss Function 1-11-1 10.... 01 -1-11-1 01……… 00 Binary Codej S -1-11-1 00...10 1-1-1-1 0.…….01 Bag of Words Figure 1.The end-to-end deep learning framework of our DCMH model. Table 1.Configuration of the CNN for image modality. layer is described by several aspects: Layer Configuration conv1f.64×11×11;st.4×4,pad0,LRN,×2pool ●“f.num×size×size”denotes the number of conv2 f.265×5×5;st.1×1,pad2,LRN,×2pool convolution filters and their receptive field size. conv3 f.265×3×3:st.1×1,pad1 conv4 f.265×3×3:st1×1,pad1 ●“st"denotes the convolution stride. conv5 f.265×3×3:st.1×1,pad1,×2p0ol ●“pad”denotes the number of pixels to add to each size full6 4096 of the input. full7 4096 full8 Hash code length c ."LRN"denotes whether Local Response Normaliza- tion (LRN)[16]is applied or not. 3.Deep Cross-Modal Hashing ●“pool”'denotes the down-sampling factor. In this section,we present the details about our deep The number in the fully connected layers.such as CMH(DCMH)method,including model formulation and "4096",denotes the number of nodes in that layer.It is learning algorithm. also the dimensionality of the output at that layer. 3.1.Model All the first seven layers use the Rectified Linear Unit(Re- The whole DCMH model is shown in Figure 1,which is LU)[16]as activation function.For the eighth layer,we choose identity function as the activation function. an end-to-end learning framework by seamlessly integrating To perform feature learning from text,we first represent two parts:the feature learning part and the hash-code each text y;as a vector with bag-of-words (BOW)repre- learning part.During learning,each part can give feedback sentation.And then the bag-of-words vectors are used as to the other part. the input to a deep neural network with two fully-connected layers,denoted as"fulll-full2".The detailed configuration 3.1.1 Feature Learning Part of the deep neural network for text is shown in Table 2, where the configuration shows the number of nodes in each The feature learning part contains two deep neural network- layer.The activation function for the first layer is ReLU, s,one for image modality and the other for text modality. and that for the second layer is the identity function. The deep neural network for image modality is a con- volutional neural network (CNN)adapted from [5].There Table 2.Configuration of the deep neural network for text are eight layers in this CNN model.The first seven layers modality. are the same as those in CNN-F of [5].The eighth layer Layer Configuration is a fully-connected layer with the output being the learned fulll 8192 image features. full2 Hash code length c Table 1 shows the detailed configuration of the CN- N for image modality.More specifically,eight layers Please note that the main goal of this paper is to show are divided into five convolutional layers and three fully- that it is possible to design an end-to-end learning frame- connected layers,which are denoted as"convI-conv5"and work for cross-modal hashing by using deep neural net- "full6-full8"in Table 1,respectively.Each convolutional works for feature learning from scratch.But how to design
Binary Code Loss Function S Convolutions Pooling Convolutions Fully Connected Quantization TEXT A black dog and a white dog with brown spots are staring at each other in the street …… Fully Connected basketball cat dog spots ball street tree white zoo … 0 0 1 1 0 1 0 1 0 ... Bag of Words 噯 噯 . Figure 1. The end-to-end deep learning framework of our DCMH model. Table 1. Configuration of the CNN for image modality. Layer Configuration conv1 f. 64 × 11 × 11; st. 4 × 4, pad 0, LRN,×2 pool conv2 f. 265 × 5 × 5; st. 1 × 1, pad 2, LRN,×2 pool conv3 f. 265 × 3 × 3; st. 1 × 1, pad 1 conv4 f. 265 × 3 × 3; st. 1 × 1, pad 1 conv5 f. 265 × 3 × 3; st. 1 × 1, pad 1,×2 pool full6 4096 full7 4096 full8 Hash code length 𝑐 3. Deep Cross-Modal Hashing In this section, we present the details about our deep CMH (DCMH) method, including model formulation and learning algorithm. 3.1. Model The whole DCMH model is shown in Figure 1, which is an end-to-end learning framework by seamlessly integrating two parts: the feature learning part and the hash-code learning part. During learning, each part can give feedback to the other part. 3.1.1 Feature Learning Part The feature learning part contains two deep neural networks, one for image modality and the other for text modality. The deep neural network for image modality is a convolutional neural network (CNN) adapted from [5]. There are eight layers in this CNN model. The first seven layers are the same as those in CNN-F of [5]. The eighth layer is a fully-connected layer with the output being the learned image features. Table 1 shows the detailed configuration of the CNN for image modality. More specifically, eight layers are divided into five convolutional layers and three fullyconnected layers, which are denoted as “conv1 - conv5” and “full6 - full8” in Table 1, respectively. Each convolutional layer is described by several aspects: ∙ “f. 𝑛𝑢𝑚 × 𝑠𝑖𝑧𝑒 × 𝑠𝑖𝑧𝑒” denotes the number of convolution filters and their receptive field size. ∙ “st” denotes the convolution stride. ∙ “pad” denotes the number of pixels to add to each size of the input. ∙ “LRN” denotes whether Local Response Normalization (LRN) [16] is applied or not. ∙ “pool” denotes the down-sampling factor. ∙ The number in the fully connected layers, such as “4096”, denotes the number of nodes in that layer. It is also the dimensionality of the output at that layer. All the first seven layers use the Rectified Linear Unit (ReLU) [16] as activation function. For the eighth layer, we choose identity function as the activation function. To perform feature learning from text, we first represent each text y𝑗 as a vector with bag-of-words (BOW) representation. And then the bag-of-words vectors are used as the input to a deep neural network with two fully-connected layers, denoted as “full1 - full2”. The detailed configuration of the deep neural network for text is shown in Table 2, where the configuration shows the number of nodes in each layer. The activation function for the first layer is ReLU, and that for the second layer is the identity function. Table 2. Configuration of the deep neural network for text modality. Layer Configuration full1 8192 full2 Hash code length 𝑐 Please note that the main goal of this paper is to show that it is possible to design an end-to-end learning framework for cross-modal hashing by using deep neural networks for feature learning from scratch. But how to design
different neural networks is not the focus of this paper. The third term n(F1+G1)in (1)is used to Other deep neural networks might also be used to perform make each bit of the hash code be balanced on all the feature learning for our DCMH model,which will be leaved training points.More specifically,the number of +1 and for future study. that of-1 for each bit on all the training points should be almost the same.This constraint can be used to maximize 3.1.2 Hash-Code Learning Part the information provided by each bit. In our experiment,we find that better performance can Let f(xi;0)Re denote the learned image feature for be achieved if the binary codes from the two modalities are point i,which corresponds to the output of the CNN for set to be the same for the same training points.Hence,we image modality.Furthermore,let g(y;:0)E Re denote set B()=B()=B.Then,the problem in (1)can be the learned text feature for point j,which corresponds to the transformed to the following formulation: output of the deep neural network for text modality.Here, 0 is the network parameter of the CNN for image modality. emin了=-∑(S6,-log(1+e8)》 B.0:0u and 0 is the network parameter of the deep neural network i,j=1 for text modality. +y(B-F+B-G胫) The objective function of DCMH is defined as follows: +(F1层+川G12) (2) min B(-).B(w).0z.0v 了=-∑(S,6j-log(1+e9) s.t.B∈{-1,+1}exn ij=1 This is the final objective function of our DCMH for +y(B)-房+B-G2) learning. +(F12+1G1) (1) From (2),we can find that the parameters of the deep s.t. B)e{-1,+1ex" neural networks(and )and the binary hash code(B) are learned from the same objective function.That is to Be{-1,+1xn, say,DCMH integrates both feature learning and hash-code learning into the same deep learning framework where F∈Rexn with F.=f(xi;0z,G∈Rexn with Please note that we only make B()=B()for the train- Gj=g(yj:0u),=FTG.j,B)is the binary hash ing points.After we have learned the problem in (2),we code for imageBis the binary hash code for texty still need to generate different binary codes bh(x) y and n are hyper-parameters. and bh(y)for the two different modalities of the The first term-=(-log(1+))in (1) same point i if point i is a query point or a point from the is the negative log likelihood of the cross-modal similarities database rather than a training point.This will be further with the likelihood function defined as follows: illustrated in Section 3.3. o(Θ) S)=1 3.2.Learning p(SlFi,G)三 1-o()S=0 We adopt an alternating learning strategy to learn z, 0y and B.Each time we learn one parameter with the where6=FTG.j ando(e)=i+e-o可· 1 other parameters fixed.The whole alternating learning It is easy to find that minimizing this negative log likeli- algorithm for DCMH is briefly outlined in Algorithm 1,and hood,which is equivalent to maximizing the likelihood,can the detailed derivation will be introduced in the following make the similarity (inner product)between Fi and G.j content of this subsection. be large when Sj=1 and be small when Si;=0.Hence, optimizing the first term in(1)can preserve the cross-modal similarity in S with the image feature representation F and 3.2.1 Learn 0z,with 0y and B Fixed text feature representation G. When 0y and B are fixed,we learn the CNN parameter By optimizing the second term(B()-FB()- of the image modality by using a back-propagation (BP) G)in (1),we can get B()=sign(F)and B()= algorithm.As most existing deep learning methods [16,we sign(G).Hence,we can consider F and G to be the con- utilize stochastic gradient descent (SGD)to learn 0 with tinuous surrogate of B()and B(),respectively.Because the BP algorithm.More specifically,in each iteration we F and G can preserve the cross-modal similarity in S,the sample a mini-batch of points from the training set and then binary hash codes B(=)and B()can also be expected carry out our learning algorithm based on the sampled data. to preserve the cross-modal similarity in S.which exactly In particular,for each sampled point x;,we first compute matches the goal of cross-modal hashing. the following gradient:
different neural networks is not the focus of this paper. Other deep neural networks might also be used to perform feature learning for our DCMH model, which will be leaved for future study. 3.1.2 Hash-Code Learning Part Let 𝑓(x𝑖; 𝜃𝑥) ∈ ℝ𝑐 denote the learned image feature for point 𝑖, which corresponds to the output of the CNN for image modality. Furthermore, let 𝑔(y𝑗 ; 𝜃𝑦) ∈ ℝ𝑐 denote the learned text feature for point 𝑗, which corresponds to the output of the deep neural network for text modality. Here, 𝜃𝑥 is the network parameter of the CNN for image modality, and 𝜃𝑦 is the network parameter of the deep neural network for text modality. The objective function of DCMH is defined as follows: min B(𝑥),B(𝑦),𝜃𝑥,𝜃𝑦 𝒥 = − ∑𝑛 𝑖,𝑗=1 (𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) + 𝛾(∥B(𝑥) − F∥2 𝐹 + ∥B(𝑦) − G∥2 𝐹 ) + 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) (1) 𝑠.𝑡. B(𝑥) ∈ {−1, +1}𝑐×𝑛, B(𝑦) ∈ {−1, +1}𝑐×𝑛, where F ∈ ℝ𝑐×𝑛 with F∗𝑖 = 𝑓(x𝑖; 𝜃𝑥), G ∈ ℝ𝑐×𝑛 with G∗𝑗 = 𝑔(y𝑗 ; 𝜃𝑦), Θ𝑖𝑗 = 1 2F𝑇 ∗𝑖G∗𝑗 , B(𝑥) ∗𝑖 is the binary hash code for image x𝑖, B(𝑦) ∗𝑗 is the binary hash code for text y𝑗 , 𝛾 and 𝜂 are hyper-parameters. The first term −∑𝑛 𝑖,𝑗=1(𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) in (1) is the negative log likelihood of the cross-modal similarities with the likelihood function defined as follows: 𝑝(𝑆𝑖𝑗 ∣F∗𝑖, G∗𝑗 ) = { 𝜎(Θ𝑖𝑗 ) 𝑆𝑖𝑗 = 1 1 − 𝜎(Θ𝑖𝑗 ) 𝑆𝑖𝑗 = 0 where Θ𝑖𝑗 = 1 2F𝑇 ∗𝑖G∗𝑗 and 𝜎(Θ𝑖𝑗 ) = 1 1+𝑒−Θ𝑖𝑗 . It is easy to find that minimizing this negative log likelihood, which is equivalent to maximizing the likelihood, can make the similarity (inner product) between F∗𝑖 and G∗𝑗 be large when 𝑆𝑖𝑗 = 1 and be small when 𝑆𝑖𝑗 = 0. Hence, optimizing the first term in (1) can preserve the cross-modal similarity in S with the image feature representation F and text feature representation G. By optimizing the second term 𝛾(∥B(𝑥)−F∥2 𝐹 +∥B(𝑦)− G∥2 𝐹 ) in (1), we can get B(𝑥) = sign(F) and B(𝑦) = sign(G). Hence, we can consider F and G to be the continuous surrogate of B(𝑥) and B(𝑦) , respectively. Because F and G can preserve the cross-modal similarity in S, the binary hash codes B(𝑥) and B(𝑦) can also be expected to preserve the cross-modal similarity in S, which exactly matches the goal of cross-modal hashing. The third term 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) in (1) is used to make each bit of the hash code be balanced on all the training points. More specifically, the number of +1 and that of −1 for each bit on all the training points should be almost the same. This constraint can be used to maximize the information provided by each bit. In our experiment, we find that better performance can be achieved if the binary codes from the two modalities are set to be the same for the same training points. Hence, we set B(𝑥) = B(𝑦) = B. Then, the problem in (1) can be transformed to the following formulation: min B,𝜃𝑥,𝜃𝑦 𝒥 = − ∑𝑛 𝑖,𝑗=1 (𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) + 𝛾(∥B − F∥2 𝐹 + ∥B − G∥2 𝐹 ) + 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) (2) 𝑠.𝑡. B ∈ {−1, +1}𝑐×𝑛. This is the final objective function of our DCMH for learning. From (2), we can find that the parameters of the deep neural networks (𝜃𝑥 and 𝜃𝑦) and the binary hash code (B) are learned from the same objective function. That is to say, DCMH integrates both feature learning and hash-code learning into the same deep learning framework. Please note that we only make B(𝑥) = B(𝑦) for the training points. After we have learned the problem in (2), we still need to generate different binary codes b(𝑥) 𝑖 = ℎ(𝑥) (x𝑖) and b(𝑦) 𝑖 = ℎ(𝑦) (y𝑖) for the two different modalities of the same point 𝑖 if point 𝑖 is a query point or a point from the database rather than a training point. This will be further illustrated in Section 3.3. 3.2. Learning We adopt an alternating learning strategy to learn 𝜃𝑥, 𝜃𝑦 and B. Each time we learn one parameter with the other parameters fixed. The whole alternating learning algorithm for DCMH is briefly outlined in Algorithm 1, and the detailed derivation will be introduced in the following content of this subsection. 3.2.1 Learn 𝜃𝑥, with 𝜃𝑦 and B Fixed When 𝜃𝑦 and B are fixed, we learn the CNN parameter 𝜃𝑥 of the image modality by using a back-propagation (BP) algorithm. As most existing deep learning methods [16], we utilize stochastic gradient descent (SGD) to learn 𝜃𝑥 with the BP algorithm. More specifically, in each iteration we sample a mini-batch of points from the training set and then carry out our learning algorithm based on the sampled data. In particular, for each sampled point x𝑖, we first compute the following gradient:
=2∑a()G-SG) Algorithm 1 The learning algorithm for DCMH. Input:Image set X,text set Y.and cross-modal similarity 1 matrix S. +2y(Fi-B.i)+2nF1. (3) Output:Parameters 0 and 0 of the deep neural networks,and binary code matrix B. Then we can compute withby using the chain Initialization rule,based on which BP can be used to update the parameter Initialize neural network parameters and,mini-batch size 0x N Ny =128,and iteration number tr [n/N:1,ty [n/Nul. 3.2.2 Learn 0,with 0r and B Fixed repeat for iter=1,2,·,txdo When 0.and B are fixed.we also learn the neural network Randomly sample N points from X to construct a mini- parameter 0 of the text modality by using SGD with a BP batch. algorithm.More specifically,for each sampled point yi,we For each sampled point x;in the mini-batch,calculate first compute the following gradient: F.=f(x:;0)by forward propagation. Calculate the derivative according to (3). S=a(Θ)E-SF0 Update the parameter by using back propagation. 0G.j end for =1 for iter=1,2,·,tydo +2y(Gj-B)+2mG1 (4) Randomly sample Ny points from Y to construct a mini- batch. Then we can compute with by using the chain For each sampled point y;in the mini-batch,calculate rule,based on which BP can be used to update the parameter G.i=g(y;;0)by forward propagation. Calculate the derivative according to(4). Update the parameter 0,by using back propagation. 3.2.3 Learn B,with and 0 Fixed end for Learn B according to (5). When 0r and 0 are fixed,the problem in (2)can be until a fixed number of iterations reformulated as follows: mxtr(BT(h(F+G)》=tr(BV)=∑B, 4.Experiment i.j s.t.B∈{-1,+1exm, We carry out experiments on image-text datasets to veri- fy the effectiveness of DCMH.DCMH is implemented with where V=y(F+G). the open source deep learning toolbox MatConvNet [31]on It is easy to find that the binary code Bi;should keep the a NVIDIA K80 GPU server. same sign as Vij.Therefore,we have: 4.1.Datasets B sign(V)=sign((F+G)). (5) Three datasets,MIRFLICKR-25K [12],IAPR TC-12 [8] 3.3.Out-of-Sample Extension and NUS-WIDE [6],are used for evaluation. The original MIRFLICKR-25K dataset [12]consists of For any point which is not in the training set,we can 25,000 images collected from Flickr website.Each image obtain its hash code as long as one of its modalities (image is associated with several textual tags.Hence,each point or text)is observed.In particular,given the image modality is a image-text pair.We select those points which have at xg of point q,we can adopt forward propagation to generate least 20 textual tags for our experiment.The text for each the hash code as follows: point is represented as a 1386-dimensional bag-of-words vector.For the hand-crafted feature based method,each b()=h()(xa)=sign(f(xa;0)). image is represented by a 512-dimensional GIST feature Similarly,if point q only has the text modality ya we vector.Furthermore,each point is manually annotated with can also generate the hash code b)as follows: one of the 24 unique labels. The IAPR TC-12 dataset [8]consists of 20,000 image- b(u)=h()(ya)=sign(g(yq:0u)). text pairs which are annotated using 255 labels.We use the entire dataset for our experiment.The text for each point Hence,our DCMH model can be used for cross-modal is represented as a 2912-dimensional bag-of-words vector. search where the query points have one modality and the For the hand-crafted feature based method,each image is points in database have the other modality. represented by a 512-dimensional GIST feature vector
∂𝒥 ∂F∗𝑖 =1 2 ∑𝑛 𝑗=1 (𝜎(Θ𝑖𝑗 )G∗𝑗 − 𝑆𝑖𝑗G∗𝑗 ) + 2𝛾(F∗𝑖 − B∗𝑖)+2𝜂F1. (3) Then we can compute ∂𝒥 ∂𝜃𝑥 with ∂𝒥 ∂F∗𝑖 by using the chain rule, based on which BP can be used to update the parameter 𝜃𝑥. 3.2.2 Learn 𝜃𝑦, with 𝜃𝑥 and B Fixed When 𝜃𝑥 and B are fixed, we also learn the neural network parameter 𝜃𝑦 of the text modality by using SGD with a BP algorithm. More specifically, for each sampled point y𝑗 , we first compute the following gradient: ∂𝒥 ∂G∗𝑗 =1 2 ∑𝑛 𝑖=1 (𝜎(Θ𝑖𝑗 )F∗𝑖 − 𝑆𝑖𝑗F∗𝑖) + 2𝛾(G∗𝑗 − B∗𝑗 )+2𝜂G1. (4) Then we can compute ∂𝒥 ∂𝜃𝑦 with ∂𝒥 ∂G∗𝑗 by using the chain rule, based on which BP can be used to update the parameter 𝜃𝑦. 3.2.3 Learn B, with 𝜃𝑥 and 𝜃𝑦 Fixed When 𝜃𝑥 and 𝜃𝑦 are fixed, the problem in (2) can be reformulated as follows: max B tr(B𝑇 (𝛾(F + G))) = tr(B𝑇 V) = ∑ 𝑖,𝑗 𝐵𝑖𝑗𝑉𝑖𝑗 𝑠.𝑡. B ∈ {−1, +1}𝑐×𝑛, where V = 𝛾(F + G). It is easy to find that the binary code 𝐵𝑖𝑗 should keep the same sign as 𝑉𝑖𝑗 . Therefore, we have: B = sign(V) = sign(𝛾(F + G)). (5) 3.3. Out-of-Sample Extension For any point which is not in the training set, we can obtain its hash code as long as one of its modalities (image or text) is observed. In particular, given the image modality x𝑞 of point 𝑞, we can adopt forward propagation to generate the hash code as follows: b(𝑥) 𝑞 = ℎ(𝑥) (x𝑞) = sign(𝑓(x𝑞; 𝜃𝑥)). Similarly, if point 𝑞 only has the text modality y𝑞, we can also generate the hash code b(𝑦) 𝑞 as follows: b(𝑦) 𝑞 = ℎ(𝑦) (y𝑞) = sign(𝑔(y𝑞; 𝜃𝑦)). Hence, our DCMH model can be used for cross-modal search where the query points have one modality and the points in database have the other modality. Algorithm 1 The learning algorithm for DCMH. Input: Image set X, text set Y, and cross-modal similarity matrix S. Output: Parameters 𝜃𝑥 and 𝜃𝑦 of the deep neural networks, and binary code matrix B. Initialization Initialize neural network parameters 𝜃𝑥 and 𝜃𝑦, mini-batch size 𝑁𝑥 = 𝑁𝑦 = 128, and iteration number 𝑡𝑥 = ⌈𝑛/𝑁𝑥⌉, 𝑡𝑦 = ⌈𝑛/𝑁𝑦⌉. repeat for 𝑖𝑡𝑒𝑟 = 1, 2, ⋅⋅⋅ , 𝑡𝑥 do Randomly sample 𝑁𝑥 points from X to construct a minibatch. For each sampled point x𝑖 in the mini-batch, calculate F∗𝑖 = 𝑓(x𝑖; 𝜃𝑥) by forward propagation. Calculate the derivative according to (3). Update the parameter 𝜃𝑥 by using back propagation. end for for 𝑖𝑡𝑒𝑟 = 1, 2, ⋅⋅⋅ , 𝑡𝑦 do Randomly sample 𝑁𝑦 points from Y to construct a minibatch. For each sampled point y𝑗 in the mini-batch, calculate G∗𝑗 = 𝑔(y𝑗 ; 𝜃𝑦) by forward propagation. Calculate the derivative according to (4). Update the parameter 𝜃𝑦 by using back propagation. end for Learn B according to (5). until a fixed number of iterations 4. Experiment We carry out experiments on image-text datasets to verify the effectiveness of DCMH. DCMH is implemented with the open source deep learning toolbox MatConvNet [31] on a NVIDIA K80 GPU server. 4.1. Datasets Three datasets, MIRFLICKR-25K [12], IAPR TC-12 [8] and NUS-WIDE [6], are used for evaluation. The original MIRFLICKR-25K dataset [12] consists of 25,000 images collected from Flickr website. Each image is associated with several textual tags. Hence, each point is a image-text pair. We select those points which have at least 20 textual tags for our experiment. The text for each point is represented as a 1386-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector. Furthermore, each point is manually annotated with one of the 24 unique labels. The IAPR TC-12 dataset [8] consists of 20,000 imagetext pairs which are annotated using 255 labels. We use the entire dataset for our experiment. The text for each point is represented as a 2912-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector