5996 IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.27.NO.12,DECEMBER 2018 Deep Discrete Supervised Hashing Qing-Yuan Jiang,Xue Cui,and Wu-Jun Li,Member;IEEE Abstract-Hashing has been widely used for large-scale search Over the last decades,hashing has become an active due to its low storage cost and fast query speed.By using super- sub-area of ANN search [5],[10].[11].The goal of hashing vised information,supervised hashing can significantly outper- is to map the data points to binary codes with hash functions form unsupervised hashing.Recently,discrete supervised hashing and feature learning based deep hashing are two representative which tries to preserve the similarity in the original space progresses in supervised hashing.On one hand,hashing is essen- of the data points.With the binary hash code representation, tially a discrete optimization problem.Hence,utilizing supervised the storage cost for the data points can be dramatically information to directly guide discrete (binary)coding procedure reduced.Furthermore,hashing based ANN search is able to can avoid sub-optimal solution and improve the accuracy.On the achieve a constant or sub-linear time complexity [12].Hence. other hand,feature learning based deep hashing,which integrates deep feature learning and hash-code learning into an end-to-end hashing has become a promising choice for efficient ANN architecture,can enhance the feedback between feature learning search in large-scale datasets because of its low storage cost and hash-code learning.The key in discrete supervised hashing and fast query speed [2]-[4].[6].[12]-[18]. is to adopt supervised information to directly guide the discrete Existing hashing methods can be divided into two main cat- coding procedure in hashing.The key in deep hashing is to adopt egories:data-independent methods and data-dependent meth- the supervised information to directly guide the deep feature learning procedure.However,most deep supervised hashing ods.Data-independent hashing methods usually adopt random methods cannot use the supervised information to directly guide projections as hash functions to map the data points from both discrete(binary)coding procedure and deep feature learning the original space into a Hamming space of binary codes. procedure in the same framework.In this paper,we propose That is to say,these methods do not use any training data a novel deep hashing method,called deep discrete supervised to learn hash functions and binary codes.Representative hashing (DDSH).DDSH is the first deep hashing method which can utilize pairwise supervised information to directly guide both data-independent hashing methods include locality-sensitive discrete coding procedure and deep feature learning procedure hashing (LSH)[2].[19],kernelized locality-sensitive hash- and thus enhance the feedback between these two important ing (KLSH)[14].Typically,data-independent hashing methods procedures.Experiments on four real datasets show that DDSH need long binary code to achieve satisfactory retrieval per- can outperform other state-of-the-art baselines,including both formance.Data-dependent hashing methods,which are also discrete hashing and deep hashing baselines,for image retrieval. called learning to hash methods,try to learn the hash functions Index Terms-Image retrieval,deep learning,deep supervised from data.Recent works [12],[16],[20]-[25]have shown hashing. that data-dependent methods can achieve comparable or even I.INTRODUCTION better accuracy with shorter binary hash codes,compared with UE to the explosive increasing of data in real applica- data-independent methods.Therefore,data-dependent methods tions,nearest neighbor (NN)[1]search plays a funda- have received more and more attention. mental role in many areas including image retrieval,computer Existing data-dependent hashing methods can be further vision,machine learning and data mining.In many real divided into unsupervised hashing methods and supervised applications,there is no need to return the exact nearest neigh- hashing methods,based on whether supervised information bors for every given query and approximate nearest neigh- is used for learning or not.Unsupervised hashing methods bor(ANN)is enough to achieve satisfactory search(retrieval) aim to preserve the metric (Euclidean neighbor)structure performance.Hence,ANN search has attracted much attention among the training data.Representative unsupervised hashing in recent years [2]-9]. methods include spectral hashing(SH)[26],iterative quanti- zation (ITQ)[20],isotropic hashing (IsoHash)[10],spherical Manuscript received July 31,2017:revised April 5.2018 and May 30. hashing (SPH)[16],inductive manifold hashing (IMH)[211. 2018;accepted July 17.2018.Date of publication August 10,2018;date of current version September 4,2018.This work was supported in part by the anchor graph hashing (AGH)[27],discrete graph hash- NSFC under Grant 61472182,in part by the Key Research and Development ing (DGH)[28],latent semantic minimal hashing (LSMH)[29] Program of Jiangsu under Grant BE2016121,and in part by a fund from Tencent.The associate editor coordinating the review of this manuscript and and global hashing system (GHS)[30].Due to the seman- approving it for publication was Prof.Gang Hua.(Corresponding author: tic gap [31],unsupervised hashing methods usually can not Wu-Jun Li.) achieve satisfactory retrieval performance in real applications. The authors are with the National Key Laboratory for Novel Software Technology.Collaborative Innovation Center of Novel Software Technology Unlike unsupervised hashing methods,supervised hashing and Industrialization,Department of Computer Science and Technology,Nan- methods aim to embed the data points from the original space jing University,Nanjing 210023,China (e-mail:jiangqy@lamda.nju.edu.cn: into the Hamming space by utilizing supervised information cuix @lamda.nju.edu.cn:liwujun@nju.edu.cn). to facilitate hash function learning or hash-code learning. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Representative supervised hashing methods include seman- Digital Object Identifier 10.1109/TIP.2018.2864894 tic hashing [32],self-taught hashing (STH)[3].supervised 1057-71492018 IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
5996 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018 Deep Discrete Supervised Hashing Qing-Yuan Jiang, Xue Cui, and Wu-Jun Li , Member, IEEE Abstract— Hashing has been widely used for large-scale search due to its low storage cost and fast query speed. By using supervised information, supervised hashing can significantly outperform unsupervised hashing. Recently, discrete supervised hashing and feature learning based deep hashing are two representative progresses in supervised hashing. On one hand, hashing is essentially a discrete optimization problem. Hence, utilizing supervised information to directly guide discrete (binary) coding procedure can avoid sub-optimal solution and improve the accuracy. On the other hand, feature learning based deep hashing, which integrates deep feature learning and hash-code learning into an end-to-end architecture, can enhance the feedback between feature learning and hash-code learning. The key in discrete supervised hashing is to adopt supervised information to directly guide the discrete coding procedure in hashing. The key in deep hashing is to adopt the supervised information to directly guide the deep feature learning procedure. However, most deep supervised hashing methods cannot use the supervised information to directly guide both discrete (binary) coding procedure and deep feature learning procedure in the same framework. In this paper, we propose a novel deep hashing method, called deep discrete supervised hashing (DDSH). DDSH is the first deep hashing method which can utilize pairwise supervised information to directly guide both discrete coding procedure and deep feature learning procedure and thus enhance the feedback between these two important procedures. Experiments on four real datasets show that DDSH can outperform other state-of-the-art baselines, including both discrete hashing and deep hashing baselines, for image retrieval. Index Terms— Image retrieval, deep learning, deep supervised hashing. I. INTRODUCTION DUE to the explosive increasing of data in real applications, nearest neighbor (NN) [1] search plays a fundamental role in many areas including image retrieval, computer vision, machine learning and data mining. In many real applications, there is no need to return the exact nearest neighbors for every given query and approximate nearest neighbor (ANN) is enough to achieve satisfactory search (retrieval) performance. Hence, ANN search has attracted much attention in recent years [2]–[9]. Manuscript received July 31, 2017; revised April 5, 2018 and May 30, 2018; accepted July 17, 2018. Date of publication August 10, 2018; date of current version September 4, 2018. This work was supported in part by the NSFC under Grant 61472182, in part by the Key Research and Development Program of Jiangsu under Grant BE2016121, and in part by a fund from Tencent. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gang Hua. (Corresponding author: Wu-Jun Li.) The authors are with the National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China (e-mail: jiangqy@lamda.nju.edu.cn; cuix@lamda.nju.edu.cn; liwujun@nju.edu.cn). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2018.2864894 Over the last decades, hashing has become an active sub-area of ANN search [5], [10], [11]. The goal of hashing is to map the data points to binary codes with hash functions which tries to preserve the similarity in the original space of the data points. With the binary hash code representation, the storage cost for the data points can be dramatically reduced. Furthermore, hashing based ANN search is able to achieve a constant or sub-linear time complexity [12]. Hence, hashing has become a promising choice for efficient ANN search in large-scale datasets because of its low storage cost and fast query speed [2]–[4], [6], [12]–[18]. Existing hashing methods can be divided into two main categories: data-independent methods and data-dependent methods. Data-independent hashing methods usually adopt random projections as hash functions to map the data points from the original space into a Hamming space of binary codes. That is to say, these methods do not use any training data to learn hash functions and binary codes. Representative data-independent hashing methods include locality-sensitive hashing (LSH) [2], [19], kernelized locality-sensitive hashing (KLSH) [14]. Typically, data-independent hashing methods need long binary code to achieve satisfactory retrieval performance. Data-dependent hashing methods, which are also called learning to hash methods, try to learn the hash functions from data. Recent works [12], [16], [20]–[25] have shown that data-dependent methods can achieve comparable or even better accuracy with shorter binary hash codes, compared with data-independent methods. Therefore, data-dependent methods have received more and more attention. Existing data-dependent hashing methods can be further divided into unsupervised hashing methods and supervised hashing methods, based on whether supervised information is used for learning or not. Unsupervised hashing methods aim to preserve the metric (Euclidean neighbor) structure among the training data. Representative unsupervised hashing methods include spectral hashing (SH) [26], iterative quantization (ITQ) [20], isotropic hashing (IsoHash) [10], spherical hashing (SPH) [16], inductive manifold hashing (IMH) [21], anchor graph hashing (AGH) [27], discrete graph hashing (DGH) [28], latent semantic minimal hashing (LSMH) [29] and global hashing system (GHS) [30]. Due to the semantic gap [31], unsupervised hashing methods usually can not achieve satisfactory retrieval performance in real applications. Unlike unsupervised hashing methods, supervised hashing methods aim to embed the data points from the original space into the Hamming space by utilizing supervised information to facilitate hash function learning or hash-code learning. Representative supervised hashing methods include semantic hashing [32], self-taught hashing (STH) [3], supervised 1057-7149 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
JIANG et al.:DEEP DISCRETE SUPERVISED HASHING 5997 hashing with kernels (KSH)[12].latent factor hash- TABLE I ing (LFH)[33],fast supervised hashing (FastH)[23],super- NOTATION vised discrete hashing (SDH)[25],column sampling based Notation discrete supervised hashing (COSDISH)[34]and nonlinear Meaning B boldface uppercase.matrix discrete hashing(NDH)[17].By using supervised information b boldface lowercase.vector for learning,supervised hashing can significantly outperform Bii the ()th element in matrix B unsupervised hashing in real applications such as image B7 transpose of matrix B b2 Euclidean norm of vector b retrieval.Detailed surveys on learning to hash can be found 2 capital Greek letter,set of indices in[7]-9]. Hadamard product (i.e.,element-wise product) Hashing is essentially a discrete optimization problem. b2 element-wise square of vector,i.e.,b2=b.b Because it is difficult to directly solve the discrete optimization problem,early hashing methods [15],[20],[22]adopt relax- ation strategies to solve a proximate continuous problem which hashing is to adopt the supervised information to directly guide might result in a sub-optimal solution.Specifically.relaxation the deep feature learning procedure. based hashing methods utilize supervised information to guide Both discrete supervised hashing and feature learning based continuous hash code learning at the first stage.Then they deep hashing have achieved promising performance in many convert continuous hash code into binary code by using round- real applications.However,most deep supervised hashing ing technology at the second stage.Recently,several discrete methods cannot use the supervised information to directly hashing methods,e.g.,DGH [28].FastH [23].SDH [25]and guide both discrete(binary)coding procedure and deep feature COSDISH [34],which try to directly learn the discrete binary learning procedure in the same framework.In this paper, hash codes,have been proposed with promising performance. we propose a novel deep hashing method,called deep discrete In particular,several discrete supervised hashing methods, supervised hashing (DDSH).The main contributions of DDSH including FastH [23].SDH [25].and COSDISH [34],have are outlined as follows: shown better performance than traditional relaxation-based DDSH is the first deep hashing method which can utilize continuous hashing methods.The key in discrete supervised pairwise supervised information to directly guide both hashing is to adopt supervised information to directly guide discrete coding procedure and deep feature learning pro- the discrete coding procedure,i.e.,the discrete binary code cedure. learning procedure. By integrating the discrete coding procedure and deep Most existing supervised hashing methods,including those feature learning procedure into the same end-to-end introduced above and some deep hashing methods [171 framework.these two important procedures can give [32],[35],are based on hand-crafted features.One major feedback to each other to make the hash codes and feature shortcoming for these methods is that they cannot perform representation more compatible. feature learning.That is,these hand-crafted features might Experiments on four real datasets show that our proposed not be optimally compatible with the hash code learn- DDSH can outperform other state-of-the-art baselines, ing procedure.Hence,besides the progress about discrete including both discrete hashing and deep hashing base- hashing,there has appeared another progress in supervised lines. hashing,which is called feature learning based deep hash- The rest of this paper is organized as follows.In Section II, ing [11].[36]-[43].Representative feature learning based we briefly introduce the notations and problem definition in deep hashing methods includes convolutional neural net- this paper.Then we describe DDSH in Section III.We discuss work hashing (CNNH)[36],network in network hash- the difference between DDSH and existing deep hashing ing (NINH)[37],deep semantic ranking hashing(DSRH)[38], methods with pairwise labels in Section IV.In Section V, deep similarity comparison hashing (DSCH)[11],deep we evaluate DDSH on four datasets by carrying out the Ham- pairwise-supervised hashing(DPSH)[42].deep hashing net- ming ranking task and hash lookup task.Finally,we conclude work (DHN)[41],deep supervised hashing (DSH)[40], the paper in Section VI. deep quantization network (DQN)[41]and deep supervised discrete hashing (DSDH)[44].Deep hashing adopts deep II.NOTATION AND PROBLEM DEFINITION learning [45].[46]for supervised hashing.In particular,most deep hashing methods adopt deep feature learning to learn a A.Notation feature representation for hashing.Many deep hashing meth- Some representative notations we use in this paper are out- ods integrate deep feature representation learning and hashing lined in Table I.More specifically,we use boldface uppercase code learning into an end-to-end architecture.Under this archi- letters like B to denote matrices.We use boldface lowercase tecture,feature learning procedure and hash-code learning pro- letters like b to denote vectors.The (i,j)th element in matrix cedure can give feedback to each other in learning procedure. B is denoted as Bij.B is the transpose of B and lbll2 is the Many works [40],[42]have shown that using the supervised Euclidean norm of vector b.We use the capital Greek letter information to directly guide the deep feature learning proce- like to denote the set of indices.We use the symbol.to dure can achieve better performance than other strategies [36] denote the Hadamard product(i.e.,element-wise product).The which do not use supervised information to directly guide square of a vector(or a matrix)like b2 indicates element-wise the deep feature learning procedure.Hence,the key in deep square,i.e,b2=b●b
JIANG et al.: DEEP DISCRETE SUPERVISED HASHING 5997 hashing with kernels (KSH) [12], latent factor hashing (LFH) [33], fast supervised hashing (FastH) [23], supervised discrete hashing (SDH) [25], column sampling based discrete supervised hashing (COSDISH) [34] and nonlinear discrete hashing (NDH) [17]. By using supervised information for learning, supervised hashing can significantly outperform unsupervised hashing in real applications such as image retrieval. Detailed surveys on learning to hash can be found in [7]–[9]. Hashing is essentially a discrete optimization problem. Because it is difficult to directly solve the discrete optimization problem, early hashing methods [15], [20], [22] adopt relaxation strategies to solve a proximate continuous problem which might result in a sub-optimal solution. Specifically, relaxation based hashing methods utilize supervised information to guide continuous hash code learning at the first stage. Then they convert continuous hash code into binary code by using rounding technology at the second stage. Recently, several discrete hashing methods, e.g., DGH [28], FastH [23], SDH [25] and COSDISH [34], which try to directly learn the discrete binary hash codes, have been proposed with promising performance. In particular, several discrete supervised hashing methods, including FastH [23], SDH [25], and COSDISH [34], have shown better performance than traditional relaxation-based continuous hashing methods. The key in discrete supervised hashing is to adopt supervised information to directly guide the discrete coding procedure, i.e., the discrete binary code learning procedure. Most existing supervised hashing methods, including those introduced above and some deep hashing methods [17], [32], [35], are based on hand-crafted features. One major shortcoming for these methods is that they cannot perform feature learning. That is, these hand-crafted features might not be optimally compatible with the hash code learning procedure. Hence, besides the progress about discrete hashing, there has appeared another progress in supervised hashing, which is called feature learning based deep hashing [11], [36]–[43]. Representative feature learning based deep hashing methods includes convolutional neural network hashing (CNNH) [36], network in network hashing (NINH) [37], deep semantic ranking hashing (DSRH) [38], deep similarity comparison hashing (DSCH) [11], deep pairwise-supervised hashing (DPSH) [42], deep hashing network (DHN) [41], deep supervised hashing (DSH) [40], deep quantization network (DQN) [41] and deep supervised discrete hashing (DSDH) [44]. Deep hashing adopts deep learning [45], [46] for supervised hashing. In particular, most deep hashing methods adopt deep feature learning to learn a feature representation for hashing. Many deep hashing methods integrate deep feature representation learning and hashing code learning into an end-to-end architecture. Under this architecture, feature learning procedure and hash-code learning procedure can give feedback to each other in learning procedure. Many works [40], [42] have shown that using the supervised information to directly guide the deep feature learning procedure can achieve better performance than other strategies [36] which do not use supervised information to directly guide the deep feature learning procedure. Hence, the key in deep TABLE I NOTATION hashing is to adopt the supervised information to directly guide the deep feature learning procedure. Both discrete supervised hashing and feature learning based deep hashing have achieved promising performance in many real applications. However, most deep supervised hashing methods cannot use the supervised information to directly guide both discrete (binary) coding procedure and deep feature learning procedure in the same framework. In this paper, we propose a novel deep hashing method, called deep discrete supervised hashing (DDSH). The main contributions of DDSH are outlined as follows: • DDSH is the first deep hashing method which can utilize pairwise supervised information to directly guide both discrete coding procedure and deep feature learning procedure. • By integrating the discrete coding procedure and deep feature learning procedure into the same end-to-end framework, these two important procedures can give feedback to each other to make the hash codes and feature representation more compatible. • Experiments on four real datasets show that our proposed DDSH can outperform other state-of-the-art baselines, including both discrete hashing and deep hashing baselines. The rest of this paper is organized as follows. In Section II, we briefly introduce the notations and problem definition in this paper. Then we describe DDSH in Section III. We discuss the difference between DDSH and existing deep hashing methods with pairwise labels in Section IV. In Section V, we evaluate DDSH on four datasets by carrying out the Hamming ranking task and hash lookup task. Finally, we conclude the paper in Section VI. II. NOTATION AND PROBLEM DEFINITION A. Notation Some representative notations we use in this paper are outlined in Table I. More specifically, we use boldface uppercase letters like B to denote matrices. We use boldface lowercase letters like b to denote vectors. The (i, j)th element in matrix B is denoted as Bij . BT is the transpose of B and b2 is the Euclidean norm of vector b. We use the capital Greek letter like to denote the set of indices. We use the symbol • to denote the Hadamard product (i.e., element-wise product). The square of a vector (or a matrix) like b2 indicates element-wise square, i.e., b2 = b • b.
5998 IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.27.NO.12,DECEMBER 2018 Feature Learning Part Loss Part ! Pariwise Loss 心么 B Optimal Binary Code Learning Fig.1. The model architecture of DDSH.DDSH is an end-to-end deep learning framework which consists of two main components:loss part and feature leaming part.The loss part contains the discrete coding procedure (to learn the binary codes B),and the feature leaming part contains the deep feature learning procedure (to lear the F(x:for x indexed by I).During each iteration,we adopt an alternating strategy to learn binary codes and feature representation alteratively,both of which are directly guided by supervised information.Hence,DDSH can enhance the feedback between the discrete coding procedure and the deep feature learning procedure. B.Problem Definition There have appeared various loss functions in supervised Although supervised information can also be triplet hashing.For example,KSH [12]uses L2 function,which is labels [11],[37]-[39]or pointwise labels [25].in this paper we defined as follows: only focus on the setting with pairwise labels [36].[40]-[43] which is a popular setting in supervised hashing.The tech- L(bi,bj:Sij)=(Si b)2 nique in this paper can also be adapted to settings with triplet labels,which will be pursued in our future work. where b"is the mth element in vector bi.Please note that our We use X =(xi)to denote a set of training points DDSH is able to adopt many different kinds of loss functions. In supervised hashing with pairwise labels,the supervised information S =(-1,1]xn between data points is also In this paper,we only use L2 loss function as an example,and leave other loss functions for further study in future work. available for training procedure,where Sij is defined as follows: xi and xj are similar. III.DEEP DISCRETE SUPERVISED HASHING -1,otherwise. In this section,we describe the details of DDSH,including Supervised hashing aims at learning a hash function to map the model architecture and learning algorithm. the data points from the original space into the binary code space (or called Hamming space),with the semantic (super- A.Model Architecture vised)similarity in S preserved in the binary code space DDSH is an end-to-end deep hashing method which is able We define the hash function as:h(x)(-1,+1),Vx EX, to simultaneously perform feature learning and hash code where c is the binary code length.The Hamming distance learning in the same framework.The model architecture of between binary codes bi=h(xi)and bj=h(xj)is defined DDSH is shown in Figure 1,which contains two important as follows: 1 components:loss part and feature learning part.The loss part distH (bi,bj)=(c-b bj) contains the discrete coding procedure which aims to learn optimal binary code to preserve semantic pairwise similarity. To preserve the similarity between data points,the Hamming The feature learning part contains the deep feature learning distance between the binary codes bi=h(xi)and bj =h(xj) procedure which tries to learn a compatible deep neural net- should be relatively small if the data points xi and xj are work to extract deep representation from scratch.For DDSH, similar,i.e.,Sij=1.On the contrary,the Hamming distance discrete coding procedure and deep feature learning are inte- between the binary codes bi=h(xi)and bj=h(xj)should grated into an end-to-end framework.More importantly,both be relatively large if the data points xi and xi are dissimilar, procedures are directly guided by supervised information. i.e.,Sij =-1.In other words,the goal of supervised hashing 1)Loss Part:Inspired by COSDISH [34],we use is to solve the following problem: column-sampling method to split the whole training set into min C(h)=∑Lh(x),h(x片S) two parts.More specifically,we randomly sample a subset ofΦ={l,2,,n}and generate I=Φ\2(Tl>l2. i.j=l Then we split the whole training set X into two subsets X =∑Lb,b;S), and Xr,where X and Xr denote the training data points (1) indexed by and I respectively. i,j=1 Similarly,we sample |Q columns of S with the correspond- where L()is a loss function. ing sampled columns indexed by Then,we approximate the
5998 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018 Fig. 1. The model architecture of DDSH. DDSH is an end-to-end deep learning framework which consists of two main components: loss part and feature learning part. The loss part contains the discrete coding procedure (to learn the binary codes B), and the feature learning part contains the deep feature learning procedure (to learn the F(x; ) for x indexed by ). During each iteration, we adopt an alternating strategy to learn binary codes and feature representation alternatively, both of which are directly guided by supervised information. Hence, DDSH can enhance the feedback between the discrete coding procedure and the deep feature learning procedure. B. Problem Definition Although supervised information can also be triplet labels [11], [37]–[39] or pointwise labels [25], in this paper we only focus on the setting with pairwise labels [36], [40]–[43] which is a popular setting in supervised hashing. The technique in this paper can also be adapted to settings with triplet labels, which will be pursued in our future work. We use X = {xi}n i=1 to denote a set of training points. In supervised hashing with pairwise labels, the supervised information S = {−1, 1}n×n between data points is also available for training procedure, where Sij is defined as follows: Sij = 1, xi and x j are similar. −1, otherwise. Supervised hashing aims at learning a hash function to map the data points from the original space into the binary code space (or called Hamming space), with the semantic (supervised) similarity in S preserved in the binary code space. We define the hash function as: h(x) ∈ {−1, +1} c , ∀x ∈ X, where c is the binary code length. The Hamming distance between binary codes bi = h(xi) and bj = h(x j) is defined as follows: distH (bi, bj) = 1 2 (c − bT i bj). To preserve the similarity between data points, the Hamming distance between the binary codes bi = h(xi) and bj = h(x j) should be relatively small if the data points xi and x j are similar, i.e., Sij = 1. On the contrary, the Hamming distance between the binary codes bi = h(xi) and bj = h(x j) should be relatively large if the data points xi and x j are dissimilar, i.e., Sij = −1. In other words, the goal of supervised hashing is to solve the following problem: min h L(h) = n i,j=1 L(h(xi), h(x j); Sij) = n i,j=1 L(bi, bj; Sij), (1) where L(·) is a loss function. There have appeared various loss functions in supervised hashing. For example, KSH [12] uses L2 function, which is defined as follows: L(bi, bj; Sij) = Sij − 1 c c m=1 bm i bm j 2 , where bm i is the mth element in vector bi . Please note that our DDSH is able to adopt many different kinds of loss functions. In this paper, we only use L2 loss function as an example, and leave other loss functions for further study in future work. III. DEEP DISCRETE SUPERVISED HASHING In this section, we describe the details of DDSH, including the model architecture and learning algorithm. A. Model Architecture DDSH is an end-to-end deep hashing method which is able to simultaneously perform feature learning and hash code learning in the same framework. The model architecture of DDSH is shown in Figure 1, which contains two important components: loss part and feature learning part. The loss part contains the discrete coding procedure which aims to learn optimal binary code to preserve semantic pairwise similarity. The feature learning part contains the deep feature learning procedure which tries to learn a compatible deep neural network to extract deep representation from scratch. For DDSH, discrete coding procedure and deep feature learning are integrated into an end-to-end framework. More importantly, both procedures are directly guided by supervised information. 1) Loss Part: Inspired by COSDISH [34], we use column-sampling method to split the whole training set into two parts. More specifically, we randomly sample a subset of = {1, 2,..., n} and generate = \ (||||). Then we split the whole training set X into two subsets X and X, where X and X denote the training data points indexed by and respectively. Similarly, we sample || columns of S with the corresponding sampled columns indexed by . Then, we approximate the
JIANG et al.:DEEP DISCRETE SUPERVISED HASHING 5999 original problem in (1)by only using the sampled columns of TABLE II S: CONFIGURATION OF THE CONVOLUTIONAL LAYERS IN DDSH Configuration min C(h)= ∑∑Lhx,h(xs) Layer filter size stride pad LRN pool ien j=l 64×11×11 4×4 0 yes 2×2 conv2 256×5×5 yes 2×2 L(h(xi),h(xj);Sij) conv3 256×3×3 1×1 no XiEXOXjEXT conv4 256×3×3 1×1 no conv5 256×3×3 1×1 0 2×2 + ∑Lhs,hss 2 X,x∈X0 TABLE III CONFIGURATION OF THE FULLY-CONNECTED LAYERS IN DDSH Then we introduce auxiliary variables to solve problem(2) More specifically,we utilize auxiliary variables B=(bili E Layer Configuration with bie(-1,+1e to replace part of the binary codes full6 4096 generated by the hash function,i.e.,h(X).Here,h(X)= full7 4096 full8 (h(xi)x;X).Then we rewrite the problem(2)as follows: hash code length c min C(h,B9=∑∑Lb,hs片S) h BO 2)Feature Learning Part:The binary codes of X are iEO xiEX generated by the hash function h(),which is defined based on +∑Lb,bj:S) the output of the deep feature learning part.More specifically, i.jen we define our hash function as:h(x)=sign(F(x;))where s.t.b:e{-1,+1},i∈2 (3) sign()is the element-wise sign function.F(x:denotes the output of the feature learning part and denotes all The problem in (3)is the final loss function (objective)to parameters of the deep neural network. learn by DDSH.We can find that the whole training set is We adopt a convolutional neural network(CNN)from [47]. divided into two subsets X and X.The binary codes of i.e.,CNN-F,as our deep feature learning part.We replace X i.e..BR,are directly learned from the objective function the last layer of CNN-F as one fully-connected layer to in(3).but the binary codes of xr are generated by the hash project the output of the second last layer to Re space. function h().h()is defined based on the output of the deep More specifically,the feature learning part contains 5 con- feature learning part,which will be introduced in the following volutional layers ("conv1-conv5)and 3 fully-connected lay- subsection. ers ("full6-full8").The detailed configuration of the 5 con- The learning of B contains the discrete coding procedure. volutional layers is shown in Table II.In Table II,"filter which is directly guided by the supervised information.The size"denotes the number of convolutional filters and their learning of h()contains the deep feature learning procedure, receptive field size."stride"specifies the convolutional stride. which is also directly guided by the supervised informa- "pad"indicates the number of pixels to add to each size tion.Hence,our DDSH can utilize supervised information to of the input."LRN"denotes whether Local Response Nor- directly guide both discrete coding procedure and deep feature malization (LRN)[45]is applied or not."pool"denotes learning procedure in the same end-to-end deep framework. the down-sampling factor.The detailed configuration of the This is different from existing deep hashing methods which 3 fully-connected layers is shown in Table III,where the either use relaxation strategy without discrete coding or do not "configuration"shows the number of nodes in each layer. use the supervised information to directly guide the discrete We adopt the Rectified Linear Unit (ReLU)[45]as activa- coding procedure. tion function for all the first seven layers.For the last layer, Please note that "directly guided"in this paper means that we utilize identity function as the activation function. the supervised information is directly included in the corre- sponding terms in the loss function.For example,the super- vised information Sij is directly included in all terms about the B.Learning discrete codes B in(3),which means that the discrete coding After randomly sampling at each iteration,we utilize an procedure is directly guided by the supervised information. alternating learning strategy to solve problem(3). Furthermore,the supervised information Sij is also directly More specifically,each time we learn one of the variables included in the term about the deep feature learning function B and h(F(x;)with the other fixed.When h(F(x;))is h(xj)in(3),which means that the deep feature learning pro- fixed,we directly learn the discrete hash code B over binary cedure is also directly guided by the supervised information. variables.When B is fixed,we update the parameter of To the best of our knowledge,DDSH is the first deep hashing the deep neural network. method which can utilize pairwise supervised information to 1)Learn B With h(F(x:)Fixed:When h(F(x:)is directly guide both discrete coding procedure and deep feature fixed,it's easy to transform problem(3)into a binary quadratic learning procedure,and thus enhance the feedback between programming (BQP)problem as that in TSH [22].Each time these two important procedures. we optimize one bit for all points.Then,the optimization of
JIANG et al.: DEEP DISCRETE SUPERVISED HASHING 5999 original problem in (1) by only using the sampled columns of S: min h L(h) = i∈ n j=1 L(h(xi), h(x j); Sij) = xi∈X x j∈X L(h(xi), h(x j); Sij) + xi,x j∈X L(h(xi), h(x j); Sij). (2) Then we introduce auxiliary variables to solve problem (2). More specifically, we utilize auxiliary variables B = {bi|i ∈ } with bi ∈ {−1, +1}c to replace part of the binary codes generated by the hash function, i.e., h(X). Here, h(X) = {h(xi)|xi ∈ X}. Then we rewrite the problem (2) as follows: min h,B L(h,B) = i∈ x j∈X L(bi, h(x j); Sij) + i,j∈ L(bi, bj; Sij) s.t. bi ∈ {−1, +1} c , ∀i ∈ (3) The problem in (3) is the final loss function (objective) to learn by DDSH. We can find that the whole training set is divided into two subsets X and X. The binary codes of X, i.e., B, are directly learned from the objective function in (3), but the binary codes of X are generated by the hash function h(·). h(·) is defined based on the output of the deep feature learning part, which will be introduced in the following subsection. The learning of B contains the discrete coding procedure, which is directly guided by the supervised information. The learning of h(·) contains the deep feature learning procedure, which is also directly guided by the supervised information. Hence, our DDSH can utilize supervised information to directly guide both discrete coding procedure and deep feature learning procedure in the same end-to-end deep framework. This is different from existing deep hashing methods which either use relaxation strategy without discrete coding or do not use the supervised information to directly guide the discrete coding procedure. Please note that “directly guided” in this paper means that the supervised information is directly included in the corresponding terms in the loss function. For example, the supervised information Sij is directly included in all terms about the discrete codes B in (3), which means that the discrete coding procedure is directly guided by the supervised information. Furthermore, the supervised information Sij is also directly included in the term about the deep feature learning function h(x j) in (3), which means that the deep feature learning procedure is also directly guided by the supervised information. To the best of our knowledge, DDSH is the first deep hashing method which can utilize pairwise supervised information to directly guide both discrete coding procedure and deep feature learning procedure, and thus enhance the feedback between these two important procedures. TABLE II CONFIGURATION OF THE CONVOLUTIONAL LAYERS IN DDSH TABLE III CONFIGURATION OF THE FULLY-CONNECTED LAYERS IN DDSH 2) Feature Learning Part: The binary codes of X are generated by the hash function h(·), which is defined based on the output of the deep feature learning part. More specifically, we define our hash function as: h(x) = sign(F(x; )), where sign(·) is the element-wise sign function. F(x; ) denotes the output of the feature learning part and denotes all parameters of the deep neural network. We adopt a convolutional neural network (CNN) from [47], i.e., CNN-F, as our deep feature learning part. We replace the last layer of CNN-F as one fully-connected layer to project the output of the second last layer to Rc space. More specifically, the feature learning part contains 5 convolutional layers (“conv1-conv5”) and 3 fully-connected layers (“full6-full8”). The detailed configuration of the 5 convolutional layers is shown in Table II. In Table II, “filter size” denotes the number of convolutional filters and their receptive field size. “stride” specifies the convolutional stride. “pad” indicates the number of pixels to add to each size of the input. “LRN” denotes whether Local Response Normalization (LRN) [45] is applied or not. “pool” denotes the down-sampling factor. The detailed configuration of the 3 fully-connected layers is shown in Table III, where the “configuration” shows the number of nodes in each layer. We adopt the Rectified Linear Unit (ReLU) [45] as activation function for all the first seven layers. For the last layer, we utilize identity function as the activation function. B. Learning After randomly sampling at each iteration, we utilize an alternating learning strategy to solve problem (3). More specifically, each time we learn one of the variables B and h(F(x; )) with the other fixed. When h(F(x; )) is fixed, we directly learn the discrete hash code B over binary variables. When B is fixed, we update the parameter of the deep neural network. 1) Learn B With h(F(x; )) Fixed: When h(F(x; )) is fixed, it’s easy to transform problem (3) into a binary quadratic programming (BQP) problem as that in TSH [22]. Each time we optimize one bit for all points. Then, the optimization of
6000 IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL.27.NO.12,DECEMBER 2018 the kth bit of all points in B is given by: Algorithm 1 The Learning Algorithm for DDSH min [b ]TQ b+[b p Input: b Training set X; s.t.b∈{-1,+1}I (4) Code length c; Supervised information S. where b denotes the kth column of B,and Output: k- -29-∑ Parameter of the deep neural network. Initialization m=1 Initialize neural network parameter mini-batch size M Q=0 and iteration number Tout,Tin Initialize B=bii=1,2,...,n} p晴=-2∑晴es-∑, for iter =1,2,...,Tout do l=1 Randomly sample and set r= Here,bm denotes the mth bit of bi and p denotes the ith Split training set X into X and xr. element of p. Split B into B and Br. Following COSDISH,we can easily solve problem (4) for epoch =1,2,....Tin do by transforming the BQP problem into an equally clustering for k =1,2,...,c do problem [48]. Construct the BOP problem for the kth bit using (4). 2)Learn h(F(x:))With B Fixed:Because the derivative Construct the clustering problem to solve the BQP of the hash function h(x)=sign(F(x:))is 0 everywhere problem for the kth bit. end for except at 0,we cannot use back-propagation(BP)methods to for t 1,2,...,T/M do update the neural network parameters.So we relax sign()as h(x)=tanh(F(x;)inspired by Song et al.[24].Then we Randomly sample M data points from XI to con- struct a mini-batch. optimize the following problem: Calculate h(xj)for each data point x;in the mini- minc)=∑∑Ld,AsS) batch by forward propagation. i∈2 XIEXI Calculate the gradient according to(7). s.t.h(xj)=tanh(F(xj;) (5) Update the parameter by using back propagation. Update b;=sign(h(x))for each data point xj in To learn the CNN parameter we utilize a the mini-batch. back-propagation algorithm.That is,each time we sample a end for mini-batch of data points,and then use BP algorithm based end for on the sampled data. end for We define the output of CNN as zj=F(xj:and aj= tanh(zj).Then we can compute the gradient of aj and zj as follows: ac =>0Lb,:S) More specifically,given any point xX,we use the dai following formula to predict its binary code: ien dai =∑2abi-sb, (6) ba=h(x)=sign(F(x;)) ieΩ where is the deep neural network parameter learned by and DDSH model. ac ac ●(1-a) IV.COMPARISON TO RELATED WORK =∑2(ab-S)b,·1-a》 (7) Existing feature learning based deep hashing methods with iEQ pairwise labels either use relaxation strategy without discrete Then,wecan use chain rule to computebased on coding or do not use the supervised information to directly and号 guide the discrete coding procedure.For example,CNNH [36] We summarize the whole learning algorithm for DDSH in is a two-step method which adopts relaxation strategy to learn Algorithm 1. continuous code in the first stage and performs feature learning in the second stage.The feature learning procedure in CNNH is not directly guided by supervised information.NINH [37], C.Out-of-Sample Extension for Unseen Data Points DHN [41]and DSH [40]adopt relaxation strategy to learn After training our DDSH model,we can adopt the learned continuous code.DPSH [42]and DQN [41]can learn binary deep hashing framework to predict the binary code for any code in the training procedure.However,DSPH and DQN do unseen data point during training. not utilize the supervised information to directly guide the
6000 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018 the kth bit of all points in B is given by: min bk [bk ] T Qkbk + [bk ] T pk s.t. bk ∈ {−1, +1} || (4) where bk denotes the kth column of B, and Qk ij i=j = −2(cS ij − k−1 m=1 bm i bm j ) Qk ii = 0 pk i = −2 || l=1 B lk (cS li − k−1 m=1 B lmbm i ). Here, bm i denotes the mth bit of bi and pk i denotes the ith element of pk . Following COSDISH, we can easily solve problem (4) by transforming the BQP problem into an equally clustering problem [48]. 2) Learn h(F(x; )) With B Fixed: Because the derivative of the hash function h(x) = sign(F(x; )) is 0 everywhere except at 0, we cannot use back-propagation (BP) methods to update the neural network parameters. So we relax sign(·) as h(x) = tanh(F(x; )) inspired by Song et al. [24]. Then we optimize the following problem: min h L(h) = i∈ x j∈X L(bi, h(x j); Sij) s.t. h(x j) = tanh(F(x j; )) (5) To learn the CNN parameter , we utilize a back-propagation algorithm. That is, each time we sample a mini-batch of data points, and then use BP algorithm based on the sampled data. We define the output of CNN as z j = F(x j; ) and a j = tanh(z j). Then we can compute the gradient of a j and z j as follows: ∂L ∂a j = i∈ ∂ L(bi, a j; Sij) ∂a j = i∈ 2(aT j bi − Sij)bi (6) and ∂L ∂z j = ∂L ∂a j • (1 − a2 j) = i∈ 2(aT j bi − Sij)bi • (1 − a2 j) (7) Then, we can use chain rule to compute ∂L ∂ based on ∂L ∂a j and ∂L ∂z j . We summarize the whole learning algorithm for DDSH in Algorithm 1. C. Out-of-Sample Extension for Unseen Data Points After training our DDSH model, we can adopt the learned deep hashing framework to predict the binary code for any unseen data point during training. Algorithm 1 The Learning Algorithm for DDSH More specifically, given any point xq ∈/ X, we use the following formula to predict its binary code: bq = h(xq ) = sign(F(xq ; )), where is the deep neural network parameter learned by DDSH model. IV. COMPARISON TO RELATED WORK Existing feature learning based deep hashing methods with pairwise labels either use relaxation strategy without discrete coding or do not use the supervised information to directly guide the discrete coding procedure. For example, CNNH [36] is a two-step method which adopts relaxation strategy to learn continuous code in the first stage and performs feature learning in the second stage. The feature learning procedure in CNNH is not directly guided by supervised information. NINH [37], DHN [41] and DSH [40] adopt relaxation strategy to learn continuous code. DPSH [42] and DQN [41] can learn binary code in the training procedure. However, DSPH and DQN do not utilize the supervised information to directly guide the