IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 1 Discrete Latent Factor Mmodel for Cross-Modal Hashing Qing-Yuan Jiang,Wu-Jun Li,Member;IEEE Abstract-Due to its storage and retrieval efficiency, representation,the storage cost can be dramatically reduced, cross-modal hashing (CMH)has been widely used for and furthermore we can achieve constant or sub-linear search cross-modal similarity search in many multimedia applications. According to the training strategy,existing CMH methods can be speed which is much faster than the search speed in the mainly divided into two categories:relaxation-based continuous original space 12],13],24],25]. methods and discrete methods.In general,the training of Early hashing methods are mainly proposed for uni-modal relaxation-based continuous methods is faster than discrete meth- data to perform uni-modal similarity search.In recent years, ods,but the accuracy of relaxation-based continuous methods is with the explosive growing of multimedia data in real ap- not satisfactory.On the contrary,the accuracy of discrete meth- plications,multi-modal similarity search has attracted a lot ods is typically better than relaxation-based continuous methods, but the training of discrete methods is very time-consuming.In of attention.For example,given a text query,a multi-modal this paper,we propose a novel CMH method,called discrete similarity search system can return the nearest images or latent factor model based cross-modal hashing (DLFH),for cross videos in the database.To achieve an efficient performance for modal similarity search.DLFH is a discrete method which can large-scale problems,multi-modal hashing (MMH)has been directly learn the binary hash codes for CMH.At the same time,the training of DLFH is efficient.Experiments show that proposed for multi-modal search [26]-[28]. DLFH can achieve significantly better accuracy than existing Existing MMH methods can be divided into two major methods,and the training time of DLFH is comparable to that categories:multi-source hashing (MSH)[15],[2931]and of relaxation-based continuous methods which are much faster cross-modal hashing(CMH)[26].[27].[32]-[34].MSH meth- than existing discrete methods. ods aim to learn binary hash codes by utilizing information Index Terms-Approximate nearest neighbor,cross-modal re- from multiple modalities for each point.In other words,all trieval,hashing,multimedia. these multiple modalities should be observed for all data points including the query points and those in database under MSH settings.Because it's usually difficult to observe all the I.INTRODUCTION modalities in many real applications,the application scenarios EAREST neighbor (NN)search plays a fundamental role for MSH methods are limited.Unlike MSH methods,CMH in many areas including machine learning,information methods usually require only one modality for a query point retrieval,computer vision and so on.In many real applications, to perform search in a database with other modalities.The there is no need to return exact nearest neighbors for every application scenarios for CMH are more flexible than those given query and approximate nearest neighbor (ANN)is for MSH.For example,CMH can perform text-to-image or enough to achieve satisfactory performance [1]-[3].Because image-to-text retrieval tasks in real applications.Hence,CMH ANN search might be much faster than exact NN search,ANN has gained more attention than MSH [26],[27]. search has become an active research topic with a wide range A lot of CMH methods have been proposed in recent of applications especially for large-scale problems [1]-[3]. years.Based on whether supervised information is used Among existing ANN search methods,hashing methods or not during training procedure,existing CMH methods have attracted much more attention due to their storage and can be further divided into two categories:unsupervised retrieval efficiency in real applications [4]-[23].The goal of CMH and supervised CMH.Unsupervised CMH methods hashing is to embed data points from the original space into directly explore data features without supervised informa- a Hamming space where the similarity is preserved.More tion to learn binary codes (or hash functions).Represen- specifically,in the Hamming space,each data point will tative unsupervised CMH methods include canonical cor- be represented as a binary code.Based on the binary code relation analysis iterative quantization (CCA-ITQ)[6],col- lective matrix factorization hashing (CMFH)[27],alternat- Manuscript received July 20,2018;revised December 30,2018;accepted ing co-quantization (ACQ)[35]and unsupervised generative January 22,2019.This work was supported in part by the NSFC-NRF Joint Research Project under Grant 61861146001.and in part by the NSFC adversarial cross-modal hashing (UGACH)[36].Supervised under Grant 61472182.The associate editor coordinating the review of this CMH tries to learn the hash function by utilizing supervised manuscript and approving it for publication was Prof.Xiaochun Cao.(Cor- responding author:Wu-Jun Li) information.As supervised CMH methods can incorporate All authors are with the National Key Laboratory for Novel Software semantic labels to mitigate the semantic gap,supervised CMH Technology,Collaborative Innovation Center of Novel Software Technology methods can achieve better accuracy than unsupervised CMH and Industrialization,Department of Computer Science and Technology,Nan- methods.Representative supervised CMH methods include jing University,Nanjing 210023,China (E-mail:jiangqy@lamda.nju.edu.cn: iwujun@nju.edu.cn). multi-modal latent binary embedding(MLBE)[37],semantic correlation maximization (SCM)[26],semantics preserving
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 1 Discrete Latent Factor Model for Cross-Modal Hashing Qing-Yuan Jiang, Wu-Jun Li, Member, IEEE Abstract—Due to its storage and retrieval efficiency, cross-modal hashing (CMH) has been widely used for cross-modal similarity search in many multimedia applications. According to the training strategy, existing CMH methods can be mainly divided into two categories: relaxation-based continuous methods and discrete methods. In general, the training of relaxation-based continuous methods is faster than discrete methods, but the accuracy of relaxation-based continuous methods is not satisfactory. On the contrary, the accuracy of discrete methods is typically better than relaxation-based continuous methods, but the training of discrete methods is very time-consuming. In this paper, we propose a novel CMH method, called discrete latent factor model based cross-modal hashing (DLFH), for cross modal similarity search. DLFH is a discrete method which can directly learn the binary hash codes for CMH. At the same time, the training of DLFH is efficient. Experiments show that DLFH can achieve significantly better accuracy than existing methods, and the training time of DLFH is comparable to that of relaxation-based continuous methods which are much faster than existing discrete methods. Index Terms—Approximate nearest neighbor, cross-modal retrieval, hashing, multimedia. I. INTRODUCTION N EAREST neighbor (NN) search plays a fundamental role in many areas including machine learning, information retrieval, computer vision and so on. In many real applications, there is no need to return exact nearest neighbors for every given query and approximate nearest neighbor (ANN) is enough to achieve satisfactory performance [1]–[3]. Because ANN search might be much faster than exact NN search, ANN search has become an active research topic with a wide range of applications especially for large-scale problems [1]–[3]. Among existing ANN search methods, hashing methods have attracted much more attention due to their storage and retrieval efficiency in real applications [4]–[23]. The goal of hashing is to embed data points from the original space into a Hamming space where the similarity is preserved. More specifically, in the Hamming space, each data point will be represented as a binary code. Based on the binary code Manuscript received July 20, 2018; revised December 30, 2018; accepted January 22, 2019. This work was supported in part by the NSFC-NRF Joint Research Project under Grant 61861146001, and in part by the NSFC under Grant 61472182. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xiaochun Cao. (Corresponding author: Wu-Jun Li) All authors are with the National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China (E-mail: jiangqy@lamda.nju.edu.cn; liwujun@nju.edu.cn). representation, the storage cost can be dramatically reduced, and furthermore we can achieve constant or sub-linear search speed which is much faster than the search speed in the original space [12], [13], [24], [25]. Early hashing methods are mainly proposed for uni-modal data to perform uni-modal similarity search. In recent years, with the explosive growing of multimedia data in real applications, multi-modal similarity search has attracted a lot of attention. For example, given a text query, a multi-modal similarity search system can return the nearest images or videos in the database. To achieve an efficient performance for large-scale problems, multi-modal hashing (MMH) has been proposed for multi-modal search [26]–[28]. Existing MMH methods can be divided into two major categories: multi-source hashing (MSH) [15], [29]–[31] and cross-modal hashing (CMH) [26], [27], [32]–[34]. MSH methods aim to learn binary hash codes by utilizing information from multiple modalities for each point. In other words, all these multiple modalities should be observed for all data points including the query points and those in database under MSH settings. Because it’s usually difficult to observe all the modalities in many real applications, the application scenarios for MSH methods are limited. Unlike MSH methods, CMH methods usually require only one modality for a query point to perform search in a database with other modalities. The application scenarios for CMH are more flexible than those for MSH. For example, CMH can perform text-to-image or image-to-text retrieval tasks in real applications. Hence, CMH has gained more attention than MSH [26], [27]. A lot of CMH methods have been proposed in recent years. Based on whether supervised information is used or not during training procedure, existing CMH methods can be further divided into two categories: unsupervised CMH and supervised CMH. Unsupervised CMH methods directly explore data features without supervised information to learn binary codes (or hash functions). Representative unsupervised CMH methods include canonical correlation analysis iterative quantization (CCA-ITQ) [6], collective matrix factorization hashing (CMFH) [27], alternating co-quantization (ACQ) [35] and unsupervised generative adversarial cross-modal hashing (UGACH) [36]. Supervised CMH tries to learn the hash function by utilizing supervised information. As supervised CMH methods can incorporate semantic labels to mitigate the semantic gap, supervised CMH methods can achieve better accuracy than unsupervised CMH methods. Representative supervised CMH methods include multi-modal latent binary embedding (MLBE) [37], semantic correlation maximization (SCM) [26], semantics preserving
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 2 hashing (SePH)[38],supervised matrix factorization hash- A.Continuous Cross-Modal Hashing ing (SMFH)[39],binary set hashing (BSH)[40].deep cross-modal hashing (DCMH)[34],cross-modal Hamming Continuous CMH methods usually adopt relaxation strategy hashing (CMHH)[41].and generalized semantic preserving for learning.More specifically,these methods adopt relaxation hashing(GSPH)[42]. strategy to learn continuous representation at the first stage, and then utilize some rounding techniques to generate discrete According to the training strategy,existing CMH methods binary codes at the second stage.Representative continuous can be divided into two categories:relaxation-based contin- CMH methods include CMFH [27],BSE [40].SCM [26]. uous methods and discrete methods.Hashing is essentially a SMFH [39]and GSPH [421.CMFH is an unsupervised CMH discrete learning problem.To avoid the difficulty caused by discrete learning,relaxation-based continuous methods try to method which adopts collective matrix factorization to learn cross-view hash functions.BSE,SCM,SMFH and GSPH solve a relaxed continuous problem with some relaxation strat- egy.Representative continuous methods include CMFH [27], are supervised CMH methods.BSE tries to preserve the cross view hashing (CVH)[43].SCM [26].SMFH [39]and inter-modal and intra-modal similarity by learning two pro- jections and taking the geometric structures of each modality GSPH [421.Discrete methods try to directly solve the discrete problem without continuous relaxation.Representative dis- into account.SCM learns two hash functions by integrating crete methods include CCA-ITO [61.ACQ [351.MLBE [37]. semantic labels into the learning procedure.To perform effi- predictable dual-view hashing (PDH)[44]and SePH [38]. cient training and fully utilize supervised information,SCM In general,the training of relaxation-based continuous meth- proposes an approximation method to avoid explicit computa- tion of the pairwise similarity matrix.SMFH is the supervised ods is faster than discrete methods,but the accuracy of version of CMFH which integrates supervised information into relaxation-based continuous methods is not satisfactory.On the contrary,the accuracy of discrete methods is typically better learning procedure to further improve retrieval performance. GSPH tries to design a generalized hashing framework to than relaxation-based continuous methods,but the training of handle a wide range of scenarios discrete methods is time-consuming. In this paper,we propose a novel CMH method, One shortcoming of the continuous methods is that the re- laxation procedure might result in a sub-optimal solution [45] called discrete latent factor model based cross-modal hashing (DLFH),for cross modal similarity search.The con- tributions of DLFH are outlined as follows: B.Discrete Cross-Modal Hashing DLFH is a supervised CMH method,and in DLFH a novel discrete latent factor model is proposed to model Discrete CMH methods try to directly learn binary codes the supervised information. without discarding discrete constraints.Representative discrete DLFH is a discrete method which can directly learn the CMH methods include CCA-ITQ [46],ACQ [35],MLBE [37], binary hash codes without continuous relaxation. PDH [44].SePH [38]and DCMH [34].CCA-ITO.ACO and .A novel discrete learning algorithm is proposed for PDH are unsupervised CMH methods.CCA-ITQ and ACQ DLFH,which can be proved to be convergent.Further- utilize the dimension reduction technologies,e.g.,canoni- more,the implementation of DLFH is simple. cal correlation analysis (CCA)and neighborhood preserving The training (learning)of DLFH is still efficient although embedding (NPE)[47],to embed multimodal data into one DLFH is a discrete method. common subspace.Then,they minimize the quantization error Experiments on real datasets show that DLFH can achieve to learn binary codes.PDH adopts max-margin theory to significantly better accuracy than existing methods,in- learn binary codes.By using an iterative optimization method cluding both relaxation-based continuous methods and based on block coordinate descent.PDH tries to maintain existing discrete methods.Experimental results also show the predictability of the binary codes.MLBE,SePH and that the training speed of DLFH is comparable to that of DCMH are supervised CMH methods.MLBE leverages a relaxation-based continuous methods,and is much faster probabilistic latent factor model to learn binary codes which are devised as binary latent factors and designs an alternating than that of existing discrete methods learning algorithm to learn binary latent factors (i.e.,binary The rest of this paper is organized as follows.In Section II, codes).Although MLBE is a supervised CMH method,the we briefly review the related works.In Section III.we de- complexity of MLBE is extremely high.Hence,the training scribe the notations and problem definition.In Section IV,we of MLBE is extremely time-consuming and it cannot scale present our DLFH in detail,including model formulation and to large-scale datasets.SePH aims to transform semantic learning algorithm.In Section V,we carry out experiments for affinities information between training data into a probability evaluation on three widely used datasets.At last,we conclude distribution by minimizing the Kullback-Leibler divergence. the paper in Section VI. Furthermore,SePH utilizes a kernel logistic regression method as an out-of-the-sample strategy to learn binary code for un- II.RELATED WORK seen data.SePH relaxes binary code as real-value continuous codes and imposes a quantization error term to learn binary In this section,we briefly review the related works of codes.In the meantime,the time complexity of SePH is also cross-modal hashing.including continuous cross-modal hash-high.DCMH is a deep learning based cross-modal hashing ing and discrete cross-modal hashing. method which integrates feature learning and binary code
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 2 hashing (SePH) [38], supervised matrix factorization hashing (SMFH) [39], binary set hashing (BSH) [40], deep cross-modal hashing (DCMH) [34], cross-modal Hamming hashing (CMHH) [41], and generalized semantic preserving hashing (GSPH) [42] . According to the training strategy, existing CMH methods can be divided into two categories: relaxation-based continuous methods and discrete methods. Hashing is essentially a discrete learning problem. To avoid the difficulty caused by discrete learning, relaxation-based continuous methods try to solve a relaxed continuous problem with some relaxation strategy. Representative continuous methods include CMFH [27], cross view hashing (CVH) [43], SCM [26], SMFH [39] and GSPH [42]. Discrete methods try to directly solve the discrete problem without continuous relaxation. Representative discrete methods include CCA-ITQ [6], ACQ [35], MLBE [37], predictable dual-view hashing (PDH) [44] and SePH [38]. In general, the training of relaxation-based continuous methods is faster than discrete methods, but the accuracy of relaxation-based continuous methods is not satisfactory. On the contrary, the accuracy of discrete methods is typically better than relaxation-based continuous methods, but the training of discrete methods is time-consuming. In this paper, we propose a novel CMH method, called discrete latent factor model based cross-modal hashing (DLFH), for cross modal similarity search. The contributions of DLFH are outlined as follows: • DLFH is a supervised CMH method, and in DLFH a novel discrete latent factor model is proposed to model the supervised information. • DLFH is a discrete method which can directly learn the binary hash codes without continuous relaxation. • A novel discrete learning algorithm is proposed for DLFH, which can be proved to be convergent. Furthermore, the implementation of DLFH is simple. • The training (learning) of DLFH is still efficient although DLFH is a discrete method. • Experiments on real datasets show that DLFH can achieve significantly better accuracy than existing methods, including both relaxation-based continuous methods and existing discrete methods. Experimental results also show that the training speed of DLFH is comparable to that of relaxation-based continuous methods, and is much faster than that of existing discrete methods. The rest of this paper is organized as follows. In Section II, we briefly review the related works. In Section III, we describe the notations and problem definition. In Section IV, we present our DLFH in detail, including model formulation and learning algorithm. In Section V, we carry out experiments for evaluation on three widely used datasets. At last, we conclude the paper in Section VI. II. RELATED WORK In this section, we briefly review the related works of cross-modal hashing, including continuous cross-modal hashing and discrete cross-modal hashing. A. Continuous Cross-Modal Hashing Continuous CMH methods usually adopt relaxation strategy for learning. More specifically, these methods adopt relaxation strategy to learn continuous representation at the first stage, and then utilize some rounding techniques to generate discrete binary codes at the second stage. Representative continuous CMH methods include CMFH [27], BSE [40], SCM [26], SMFH [39] and GSPH [42]. CMFH is an unsupervised CMH method which adopts collective matrix factorization to learn cross-view hash functions. BSE, SCM, SMFH and GSPH are supervised CMH methods. BSE tries to preserve the inter-modal and intra-modal similarity by learning two projections and taking the geometric structures of each modality into account. SCM learns two hash functions by integrating semantic labels into the learning procedure. To perform effi- cient training and fully utilize supervised information, SCM proposes an approximation method to avoid explicit computation of the pairwise similarity matrix. SMFH is the supervised version of CMFH which integrates supervised information into learning procedure to further improve retrieval performance. GSPH tries to design a generalized hashing framework to handle a wide range of scenarios. One shortcoming of the continuous methods is that the relaxation procedure might result in a sub-optimal solution [45]. B. Discrete Cross-Modal Hashing Discrete CMH methods try to directly learn binary codes without discarding discrete constraints. Representative discrete CMH methods include CCA-ITQ [46], ACQ [35], MLBE [37], PDH [44], SePH [38] and DCMH [34]. CCA-ITQ, ACQ and PDH are unsupervised CMH methods. CCA-ITQ and ACQ utilize the dimension reduction technologies, e.g., canonical correlation analysis (CCA) and neighborhood preserving embedding (NPE) [47], to embed multimodal data into one common subspace. Then, they minimize the quantization error to learn binary codes. PDH adopts max-margin theory to learn binary codes. By using an iterative optimization method based on block coordinate descent, PDH tries to maintain the predictability of the binary codes. MLBE, SePH and DCMH are supervised CMH methods. MLBE leverages a probabilistic latent factor model to learn binary codes which are devised as binary latent factors and designs an alternating learning algorithm to learn binary latent factors (i.e., binary codes). Although MLBE is a supervised CMH method, the complexity of MLBE is extremely high. Hence, the training of MLBE is extremely time-consuming and it cannot scale to large-scale datasets. SePH aims to transform semantic affinities information between training data into a probability distribution by minimizing the Kullback-Leibler divergence. Furthermore, SePH utilizes a kernel logistic regression method as an out-of-the-sample strategy to learn binary code for unseen data. SePH relaxes binary code as real-value continuous codes and imposes a quantization error term to learn binary codes. In the meantime, the time complexity of SePH is also high. DCMH is a deep learning based cross-modal hashing method which integrates feature learning and binary code
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 learning simultaneously with a deep neural networks.The time and Vis respectively denote the binary hash codes of two complexity of DCMH is also high. modalities for point i and c is the length of binary code. Typically,discrete methods can outperform continuous The goal of supervised CMH is to learn the binary codes methods in terms of retrieval accuracy.However,high com- U and V,which try to preserve the similarity information plexity makes the training of discrete supervised CMH time- in S.In other words,if Sij=1,the Hamming distance consuming.To make the training practicable,discrete super- between Ui and Vi+should be as small as possible and vice vised CMH methods usually sample a subset from database versa.Furthermore.we also need to learn two hash functions during training procedure.Hence,existing discrete supervised h(x)E{-1,+1]e and hu(ya)E{-1,+1]e respectively CMH methods can't fully utilize supervised information and for modality x and modality y,which can compute binary will deteriorate the accuracy.In summary,to fully utilize hash codes for any new query point (xa or ya)unseen in the the supervised information,we need to design a scalable training set. discrete CMH method to further improve retrieval accuracy and scalability for large-scale datasets. IV.DISCRETE LATENT FACTOR MODEL BASED CROSS-MODAL HASHING III.NOTATIONS AND PROBLEM DEFINITION In this section.we introduce the details of DLFH,including We briefly introduce the notations and problem definition model formulation and learning algorithm. in this section. A.Model Formulation A.Notations Given a binary code pair [Ui.,Vi,we define j as: We use bold uppercase letters like U and bold lowercase V.where c is the code length which is pre- letters like u to denote matrices and vectors,respectively.The specified,and A>0 is a hyper-parameter denoting a scale element at the position (i,j)of matrix U is denoted as Uj. factor for tuning. The ith row of matrix U is denoted as Ui,and the jth column By using a logistic function,we define Aij as:Aij= of matrix U is denoted as U..l.lg and UT denote the Based on Aij,we define the likelihood Frobenius norm of a matrix and the transpose of matrix U of the cross-modal similarity S as: respectively.sign()is an element-wise sign function. p(SIU,V)= p(SijlU,V) .=1 B.Problem Definition Without loss of generality,we assume there exist only where p(SijlU,V)is defined as follows: two modalities in the data although our DLFH can be easily if Sij=1, adapted to more modalities.We use X=12...n p(SijlU,V) Rnxd and Y =[y1,y2,...,yn]Te Rnxdy to denote the 1-Aij, otherwise feature vectors of the two modalities(modality x and modality Then the log-likelihood of U and V can be derived as y),where d and du respectively denote the dimensionality of follows: the feature spaces for two modalities and n is the number of training data.In particular,x;and yi denote the feature vectors L=logp(SIU,V)= ∑[S,6-log1+e8 of the two modalities for training point i,respectively.Without ,j=1 loss of generality,the data are assumed to be zero-centered which means∑1x=0and∑21yi=0.Here,weas- The model of DLFH tries to maximize the log-likelihood of sume that both modalities are observed for all training points. U and V.That is,DLFH tries to solve the following problem: However,we do not require that both modalities are observed L=logp(SIU,V) (1) for query (test)points.Hence,the setting is cross-modal. Actually,DLFH can be easily adapted to cases where some training points are with missing modalities,which will be left ∑[S,0-log(1+e9】 ij=1 for future study.In this paper,we focus on supervised CMH which has shown better accuracy that unsupervised CMH [16], s.tU,V∈{-l,+1mxc, [17].[26],[38].[44].In supervised CMH,besides the feature where U and V are constrained to be in a binary space,i.e., vectors X and Y,we are also given a cross-modal supervised {-1,+1)nxc,because they are binary hash codes for learning. similarity matrix S{0,1]nx".If Sj=1,it means that We can find that maximizing the objective function in(1) point xi and point yi are similar.Otherwise xi and yi are exactly matches the goal of hashing.More specifically,the dissimilar.Here,we assume all elements of S are observed. learned binary hash codes try to preserve the similarity infor- But our DLFH can also be adapted for cases with missing mation in S. elements in S.Sij can be manually labeled by users,or Please note that in (1),DLFH adopts the maximum like- constructed from the labels of point i and point j. lihood loss function.Although there exist some CMH meth- We use U,V E{-1,+1]nxe to respectively denote the ods [34],[41]which also use maximum likelihood loss func- binary codes for modality x and modality y,where Ui.tion,none of them can utilize discrete latent factor model to
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 3 learning simultaneously with a deep neural networks. The time complexity of DCMH is also high. Typically, discrete methods can outperform continuous methods in terms of retrieval accuracy. However, high complexity makes the training of discrete supervised CMH timeconsuming. To make the training practicable, discrete supervised CMH methods usually sample a subset from database during training procedure. Hence, existing discrete supervised CMH methods can’t fully utilize supervised information and will deteriorate the accuracy. In summary, to fully utilize the supervised information, we need to design a scalable discrete CMH method to further improve retrieval accuracy and scalability for large-scale datasets. III. NOTATIONS AND PROBLEM DEFINITION We briefly introduce the notations and problem definition in this section. A. Notations We use bold uppercase letters like U and bold lowercase letters like u to denote matrices and vectors, respectively. The element at the position (i, j) of matrix U is denoted as Uij . The ith row of matrix U is denoted as Ui∗, and the jth column of matrix U is denoted as U∗j . k · kF and UT denote the Frobenius norm of a matrix and the transpose of matrix U respectively. sign(·) is an element-wise sign function. B. Problem Definition Without loss of generality, we assume there exist only two modalities in the data although our DLFH can be easily adapted to more modalities. We use X = [x1, x2, . . . , xn] T ∈ R n×dx and Y = [y1, y2, . . . , yn] T ∈ R n×dy to denote the feature vectors of the two modalities (modality x and modality y), where dx and dy respectively denote the dimensionality of the feature spaces for two modalities and n is the number of training data. In particular, xi and yi denote the feature vectors of the two modalities for training point i, respectively. Without loss of generality, the data are assumed to be zero-centered which means Pn i=1 xi = 0 and Pn i=1 yi = 0. Here, we assume that both modalities are observed for all training points. However, we do not require that both modalities are observed for query (test) points. Hence, the setting is cross-modal. Actually, DLFH can be easily adapted to cases where some training points are with missing modalities, which will be left for future study. In this paper, we focus on supervised CMH which has shown better accuracy that unsupervised CMH [16], [17], [26], [38], [44]. In supervised CMH, besides the feature vectors X and Y, we are also given a cross-modal supervised similarity matrix S ∈ {0, 1} n×n. If Sij = 1, it means that point xi and point yj are similar. Otherwise xi and yj are dissimilar. Here, we assume all elements of S are observed. But our DLFH can also be adapted for cases with missing elements in S. Sij can be manually labeled by users, or constructed from the labels of point i and point j. We use U, V ∈ {−1, +1} n×c to respectively denote the binary codes for modality x and modality y, where Ui∗ and Vi∗ respectively denote the binary hash codes of two modalities for point i and c is the length of binary code. The goal of supervised CMH is to learn the binary codes U and V, which try to preserve the similarity information in S. In other words, if Sij = 1, the Hamming distance between Ui∗ and Vj∗ should be as small as possible and vice versa. Furthermore, we also need to learn two hash functions hx(xq) ∈ {−1, +1} c and hy(yq) ∈ {−1, +1} c respectively for modality x and modality y, which can compute binary hash codes for any new query point (xq or yq) unseen in the training set. IV. DISCRETE LATENT FACTOR MODEL BASED CROSS-MODAL HASHING In this section, we introduce the details of DLFH, including model formulation and learning algorithm. A. Model Formulation Given a binary code pair {Ui∗, Vj∗}, we define Θij as: Θij = λ c Ui∗VT j∗ , where c is the code length which is prespecified, and λ > 0 is a hyper-parameter denoting a scale factor for tuning. By using a logistic function, we define Aij as: Aij = σ(Θij ) = 1 1+e −Θij . Based on Aij , we define the likelihood of the cross-modal similarity S as: p(S|U, V) = Yn i,j=1 p(Sij |U, V), where p(Sij |U, V) is defined as follows: p(Sij |U, V) = ( Aij , if Sij = 1, 1 − Aij , otherwise. Then the log-likelihood of U and V can be derived as follows: L = log p(S|U, V) = Xn i,j=1 SijΘij − log(1 + e Θij ) The model of DLFH tries to maximize the log-likelihood of U and V. That is, DLFH tries to solve the following problem: max U,V L = log p(S|U, V) (1) = Xn i,j=1 SijΘij − log(1 + e Θij ) s.t. U, V ∈ {−1, +1} n×c , where U and V are constrained to be in a binary space, i.e., {−1, +1} n×c , because they are binary hash codes for learning. We can find that maximizing the objective function in (1) exactly matches the goal of hashing. More specifically, the learned binary hash codes try to preserve the similarity information in S. Please note that in (1), DLFH adopts the maximum likelihood loss function. Although there exist some CMH methods [34], [41] which also use maximum likelihood loss function, none of them can utilize discrete latent factor model to
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 model the supervised information.On the contrary,DLFH I where I is an identity matrix.We define I(U)as can directly utilize discrete latent factor model to model follows: the supervised information and learn the binary hash codes without relaxation U)=Ute)+Ut-U(④P,包 +U.-U.k()]TH[U.&-U.()J. B.Learning Algorithm Problem (1)is a discrete (binary)learning problem,which -U00-u01-uer0 is difficult to solve.One possible solution is to relax the +LU)+(H(U.HU. discrete problem to a continuous problem by discarding the 2 2 discrete (binary)constraints.Similar relaxation strategies have been adopted by many existing relaxation-based continuous -.-. methods like GSPH [42],CMFH [27]and SMFH [39].How- ever,this relaxation may cause the solution to be sub-optimal. +L(Uk)+U.PHUk因_X2n3 2 8c2 Hence,the search accuracy might be unsatisfactory. In this paper,we propose a novel method to directly Ug国-Hvo+on (2) learn the binary codes without continuous relaxation.The two parameters U and V are learned in an alternating way. where denotes the coant term(U,k(】-g+ Specifically,we design an iterative learning algorithm,and in ()HU(t)-[U.((t)which is indepen- each iteration we learn one parameter with the other parameter dent of the variable U.k. fixed. Theorem 1.L(U.k)is a lower bound of L(U.k). 1)Learning U with V Fixed:We try to learn U with V fixed.Even if V is fixed,it is still difficult to opti- To prove Theorem 1,we first introduce the following Lemma. mize (learn)the whole U in one time.For example,if we simply flip the signs of each element to learn U,the total Lemma 1.For a concave function f(x)with bounded cur- time complexity will be O(2mxe)which is very high.Here, vature,if the Hesian satisfiesD for some we adopt a column-wise learning strategy to optimize one matrix D 0,we have: column (corresponds to one bit for all data points)of U each time with other columns fixed.The time complexity to directly learn one column is O(2m)which is still high.Here,we adopt fy)≥fx)+y-xrf+ x +(y-x)"D(y-x). a surrogate strategy [48]to learn each column,which results Here,D<0 denotes that D is a negative definite matrix, in a lower time complexity of O(n2).More specifically,to and BC indicates that B-C is a positive definite matrix. optimize the kth column U.k,we construct a lower bound of The proof for Lemma 1 can be found in [48],[49]. L(U.)and then optimize the lower bound,which can get a Based on Lemma 1,we can prove Theorem 1. closed form solution and make learning procedure simple and efficient.Moreover,the lower-bound based learning strategy Proof of Theorem 1.It's easy to see that the L(U.k)is a con- can guarantee the solution to converge.The following content cave function with bounded curvature.Furthermore,because will introduce the details about the derivation of column-wise 0<Aij 1,we have 0<Aj(1-Aij)<.Then we can learning strategy. 2L get:auouH. The gradient and Hessian of the objective function L with According to Lemma 1,we can get:L(Uk)>L(U.k). respect to U.k can be computed as follows This means L(U.k)is a lower bound of L(U.k). aL (S.j-A.j)Vjk, Then,we learn the column U.k by maximizing the lower U*k bound L(U+k): j=1 82L 2 aU.kOUIk =diag(a1,a2,an), 歌id=ug[恶0-Huao例+om s.L.U*k∈{-1,+1}n (3) where A [Aiglj=1.ai =>=1Ag(1 -Aig).and diag(a1,a2,..,an)denotes a diagonal matrix with the ith Vl,Uik is defined over {-1,+1.Hence,to maximize diagonal element being ai. L(U.k),we only need to take Uuk 1 whenever the 1-th Let Uk(t)denote the value of Uk at the t-th iteration, element of [()HU.(t)]is greater than 0.and (t)denote the gradient with respect to U.(t),andH= Oi=-1 otherwise.That is to say,the optimal solution for Problem (3)is:U-signHU().And we use this solution to get U.(+1): Please note that the objective function L is defined on the whole real space although the variables U and V are constrained to be discrete.Hence,we can still compute the gradient and Hessian for any discrete points U and V. Ukt+)=g国-HU (4)
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 4 model the supervised information. On the contrary, DLFH can directly utilize discrete latent factor model to model the supervised information and learn the binary hash codes without relaxation. B. Learning Algorithm Problem (1) is a discrete (binary) learning problem, which is difficult to solve. One possible solution is to relax the discrete problem to a continuous problem by discarding the discrete (binary) constraints. Similar relaxation strategies have been adopted by many existing relaxation-based continuous methods like GSPH [42], CMFH [27] and SMFH [39]. However, this relaxation may cause the solution to be sub-optimal. Hence, the search accuracy might be unsatisfactory. In this paper, we propose a novel method to directly learn the binary codes without continuous relaxation. The two parameters U and V are learned in an alternating way. Specifically, we design an iterative learning algorithm, and in each iteration we learn one parameter with the other parameter fixed. 1) Learning U with V Fixed: We try to learn U with V fixed. Even if V is fixed, it is still difficult to optimize (learn) the whole U in one time. For example, if we simply flip the signs of each element to learn U, the total time complexity will be O(2n×c ) which is very high. Here, we adopt a column-wise learning strategy to optimize one column (corresponds to one bit for all data points) of U each time with other columns fixed. The time complexity to directly learn one column is O(2n) which is still high. Here, we adopt a surrogate strategy [48] to learn each column, which results in a lower time complexity of O(n 2 ). More specifically, to optimize the kth column U∗k, we construct a lower bound of L(U∗k) and then optimize the lower bound, which can get a closed form solution and make learning procedure simple and efficient. Moreover, the lower-bound based learning strategy can guarantee the solution to converge. The following content will introduce the details about the derivation of column-wise learning strategy. The gradient and Hessian of the objective function L with respect to U∗k can be computed as follows 1 : ∂L ∂U∗k = λ c Xn j=1 (S∗j − A∗j )Vjk, ∂ 2L ∂U∗k∂UT ∗k = − λ 2 c 2 diag(a1, a2, · · · , an), where A = [Aij ] n i,j=1, ai = Pn j=1 Aij (1 − Aij ), and diag(a1, a2, · · · , an) denotes a diagonal matrix with the ith diagonal element being ai . Let U∗k(t) denote the value of U∗k at the t-th iteration, ∂L ∂U∗k (t) denote the gradient with respect to U∗k(t), and H = 1Please note that the objective function L is defined on the whole real space although the variables U and V are constrained to be discrete. Hence, we can still compute the gradient and Hessian for any discrete points U and V. − nλ2 4c 2 I where I is an identity matrix. We define Le(U∗k) as follows: Le(U∗k) =L(U∗k(t)) + [U∗k − U∗k(t)]T ∂L ∂U∗k (t) + 1 2 [U∗k − U∗k(t)]T H[U∗k − U∗k(t)], =UT ∗k [ ∂L ∂U∗k (t) − HU∗k(t)] − [U∗k(t)]T ∂L ∂U∗k (t) + L(U∗k(t)) + [U∗k(t)]T HU∗k(t) 2 + UT ∗kHU∗k 2 =UT ∗k [ ∂L ∂U∗k (t) − HU∗k(t)] − [U∗k(t)]T ∂L ∂U∗k (t) + L(U∗k(t)) + [U∗k(t)]T HU∗k(t) 2 − λ 2n 3 8c 2 =UT ∗k h ∂L ∂U∗k (t) − HU∗k(t) i + const (2) where const denotes the constant term L(U∗k(t)) − λ 2n 3 8c 2 + 1 2 [U∗k(t)]T HU∗k(t) − [U∗k(t)]T ∂L ∂U∗k (t) which is independent of the variable U∗k. Theorem 1. Le(U∗k) is a lower bound of L(U∗k). To prove Theorem 1, we first introduce the following Lemma. Lemma 1. For a concave function f(x) with bounded curvature, if the Hessian ∂ 2 f(x) ∂x∂xT satisfies ∂ 2 f(x) ∂x∂xT D for some matrix D ≺ 0, we have: f(y) ≥ f(x) + (y − x) T ∂f(x) ∂x + 1 2 (y − x) T D(y − x). Here, D ≺ 0 denotes that D is a negative definite matrix, and B C indicates that B − C is a positive definite matrix. The proof for Lemma 1 can be found in [48], [49]. Based on Lemma 1, we can prove Theorem 1. Proof of Theorem 1. It’s easy to see that the L(U∗k) is a concave function with bounded curvature. Furthermore, because 0 < Aij < 1, we have 0 < Aij (1 − Aij ) < 1 4 . Then we can get: ∂ 2L ∂U∗k∂UT ∗k H. According to Lemma 1, we can get: L(U∗k) ≥ Le(U∗k). This means Le(U∗k) is a lower bound of L(U∗k). Then, we learn the column U∗k by maximizing the lower bound Le(U∗k): max U∗k Le(U∗k) =UT ∗k h ∂L ∂U∗k (t) − HU∗k(t) i + const, s.t. U∗k ∈ {−1, +1} n . (3) ∀l, Ulk is defined over {−1, +1}. Hence, to maximize Le(U∗k), we only need to take Ulk = 1 whenever the l-th element of ∂L ∂U∗k (t) − HU∗k(t) is greater than 0, and Ulk = −1 otherwise. That is to say, the optimal solution for Problem (3) is: U∗k = sign[ ∂L ∂U∗k (t)−HU∗k(t)]. And we use this solution to get U∗k(t + 1): U∗k(t + 1) = sign[ ∂L ∂U∗k (t) − HU∗k(t)]. (4)
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 Algorithm 1 Learning algorithm for DLFH L(U,V)is non-concave2 in both U and V,the leared Input:S{1,0)nxn:supervised similarity matrix, solution will converge to a local optimum. ▣ c:code length. Output:U and V:binary codes for two modalities. 3)Stochastic Learning Strategy:We can find that the 1:Procedure:initialize U and V. computational cost for learning U.k and V.k is O(n2).DLFH 2:fort=1→Tdo will become intractable when the size of training set is large. fork=1→cdo Here,we design a stochastic learning strategy to avoid high computational cost. 名 Update U.according to (4). It is easy to see that the high computational cost mainly 5 end for comes from the gradient computation for both U.k and V.k 6: fork=1→cdo In our stochastic learning strategy,we randomly sample m 公 Update Vk according to (5). columns (rows)of s to compute(during ch 8: end for iteration.Then,we get the following formulas for updating 9:end for Uk and Vk: U.k(t+1)=sign 2)Learning V with U Fixed:When U is fixed,we adopt (8) a similar strategy as that for U to learn V.Specifically,we can get the following closed form solution to update V.: 0a=1 (9) V.k(t +1)=sign[ V.()-HV.(). aL (5) where j1 and i are the m sampled column and whee,=合∑(Sg-A)U and H=-L row indices,respectively. To get the stochastic learning algorithm for DLFH,we only The learning algorithm for DLFH is summarized in Algo- rithm 1,which can be easily implemented. need to substitute (4)and(5)in Algorithm 1 by (8)and (9), respectively.Then the computational cost will decrease from Please note that LFH [50]has adopted latent factor model O(n2)to O(nm),where m<n. for hashing.However,DLFH is different from LFH and is novel due to the following reasons.Firstly,DLFH is for C.Out-of-Sample Extension cross-modal supervised hashing but LFH is for uni-modal su- pervised hashing.Secondly,DLFH is a discrete method which For any unseen query points xX or yaY,we need to directly learns discrete (binary)codes without relaxation,but perform out-of-sample extension to generate the binary codes. LFH is a relaxation-based continuous method which cannot Many existing hashing methods learn classifiers to predict directly learn the discrete codes.Last but not the least,LFH binary codes for out-of-sample extension.These classifiers learns all bits of a point each time,but DLFH learns one bit include SVM [24],boosted decision tree [8],kernel logistic of all points each time which can lead to highly uncorrelated regression [38],and so on.In this paper,we adopt two hash codes.This will be empirically verified in Section V-H. classifiers for out-of-sample extension,one being linear and the other being non-linear.We use the modality x as an Theorem 2.The learning algorithm for DLFH which is shown example to demonstrate these two strategies. in Algorithm I is convergent. For linear classifier,we minimize the following square loss: Proof of Theorem 2.According to Theorem 1,we have: Lsquare (W())=IU-xW(l+W( L(Uk)≥Z(Uk) (6 where Y is the hyper-parameter for regularization term.We can get:W(F)=(XTX+I)-1XTU. Furthermore,since we use the optimal solution of Prob- Then we can get the hash function for out-of-sample exten- lem (3)to get U.k(t+1).we have: sion:hz(xa)=sign([W()]Txa). For non-linear classifier,we adopt kernel logistic regres- L(U*k(t+1)≥L(U*k(t) (7)sion (KLR)as that in SePH [381.Specifically,we learn c We can also find that L(U.(t))=L(U.(t)).Then,we classifiers,one classifier for a bit,to predict binary codes for xg.For the kth bit,we minimize the following KLR loss: have: L(U*k(t+1)≥L(U*(t+1)≥L(U*k(t)=L(Uk(t) LKLR(M)=>log(1+e-()M?)+M i=1 Hence,during the learning procedure,we can always guar- where n is the hyper-parameter for regularization term and antee that L(U.(t+1))>L(U.(t)).Similarly,we can also (x;)is a RBF kernel feature representation for x;.Then we guarantee that L(V(+1))>L(Vk(t)).This means that can get the hash function for out-of-sample extension: we can always guarantee that the objective function will not decrease during the whole learning procedure in Algorithm 1. ha(xa)=sign([M()T(xi)), Together with the fact that L(U,V)is upper-bounded by Please note that L()is concave in U.k or Vk,but non-concave in both 0,we can guarantee that Algorithm I will converge.Because U and V
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 5 Algorithm 1 Learning algorithm for DLFH Input: S ∈ {1, 0} n×n: supervised similarity matrix, c: code length. Output: U and V: binary codes for two modalities. 1: Procedure: initialize U and V. 2: for t = 1 → T do 3: for k = 1 → c do 4: Update U∗k according to (4). 5: end for 6: for k = 1 → c do 7: Update V∗k according to (5). 8: end for 9: end for 2) Learning V with U Fixed: When U is fixed, we adopt a similar strategy as that for U to learn V. Specifically, we can get the following closed form solution to update V∗k: V∗k(t + 1) = sign[ ∂L ∂V∗k (t) − HV∗k(t)], (5) where ∂L ∂V∗k = λ c Pn i=1(S T i∗ − AT i∗ )Uik and H = − nλ2 4c 2 I. The learning algorithm for DLFH is summarized in Algorithm 1, which can be easily implemented. Please note that LFH [50] has adopted latent factor model for hashing. However, DLFH is different from LFH and is novel due to the following reasons. Firstly, DLFH is for cross-modal supervised hashing but LFH is for uni-modal supervised hashing. Secondly, DLFH is a discrete method which directly learns discrete (binary) codes without relaxation, but LFH is a relaxation-based continuous method which cannot directly learn the discrete codes. Last but not the least, LFH learns all bits of a point each time, but DLFH learns one bit of all points each time which can lead to highly uncorrelated hash codes. This will be empirically verified in Section V-H. Theorem 2. The learning algorithm for DLFH which is shown in Algorithm 1 is convergent. Proof of Theorem 2. According to Theorem 1, we have: L(U∗k) ≥ Le(U∗k). (6) Furthermore, since we use the optimal solution of Problem (3) to get U∗k(t + 1), we have: Le(U∗k(t + 1)) ≥ Le(U∗k(t)). (7) We can also find that Le(U∗k(t)) = L(U∗k(t)). Then, we have: L(U∗k(t + 1)) ≥ Le(U∗k(t + 1)) ≥ Le(U∗k(t)) = L(U∗k(t)). Hence, during the learning procedure, we can always guarantee that L(U∗k(t+1)) ≥ L(U∗k(t)). Similarly, we can also guarantee that L(V∗k(t + 1)) ≥ L(V∗k(t)). This means that we can always guarantee that the objective function will not decrease during the whole learning procedure in Algorithm 1. Together with the fact that L(U, V) is upper-bounded by 0, we can guarantee that Algorithm 1 will converge. Because L(U, V) is non-concave2 in both U and V, the learned solution will converge to a local optimum. 3) Stochastic Learning Strategy: We can find that the computational cost for learning U∗k and V∗k is O(n 2 ). DLFH will become intractable when the size of training set is large. Here, we design a stochastic learning strategy to avoid high computational cost. It is easy to see that the high computational cost mainly comes from the gradient computation for both U∗k and V∗k. In our stochastic learning strategy, we randomly sample m columns (rows) of S to compute ∂L ∂U∗k ( ∂L ∂V∗k ) during each iteration. Then, we get the following formulas for updating U∗k and V∗k: U∗k(t + 1) = signλ c Xm q=1 (S∗jq − A∗jq )Vjqk(t) + mλ2 4c 2 U∗k(t) , (8) V∗k(t + 1) = signλ c Xm q=1 (S T iq∗ − A T iq∗)Uiqk(t) + mλ2 4c 2 V∗k(t) , (9) where {jq} m q=1 and {iq} m q=1 are the m sampled column and row indices, respectively. To get the stochastic learning algorithm for DLFH, we only need to substitute (4) and (5) in Algorithm 1 by (8) and (9), respectively. Then the computational cost will decrease from O(n 2 ) to O(nm), where m n. C. Out-of-Sample Extension For any unseen query points xq ∈/ X or yq ∈/ Y, we need to perform out-of-sample extension to generate the binary codes. Many existing hashing methods learn classifiers to predict binary codes for out-of-sample extension. These classifiers include SVM [24], boosted decision tree [8], kernel logistic regression [38], and so on. In this paper, we adopt two classifiers for out-of-sample extension, one being linear and the other being non-linear. We use the modality x as an example to demonstrate these two strategies. For linear classifier, we minimize the following square loss: Lsquare(W(x) ) = kU − XW(x) k 2 F + γxkW(x) k 2 F , where γx is the hyper-parameter for regularization term. We can get: W(x) = (XT X + γxI) −1XT U. Then we can get the hash function for out-of-sample extension: hx(xq) = sign([W(x) ] T xq). For non-linear classifier, we adopt kernel logistic regression (KLR) as that in SePH [38]. Specifically, we learn c classifiers, one classifier for a bit, to predict binary codes for xq. For the kth bit, we minimize the following KLR loss: LKLR(M(x) ∗k ) = Xn i=1 log(1 + e −Uikφ(xi) TM(x) ∗k ) + ηxkM(x) ∗k k 2 F , where ηx is the hyper-parameter for regularization term and φ(xi) is a RBF kernel feature representation for xi . Then we can get the hash function for out-of-sample extension: hx(xq) = sign([M(x) ] T φ(xi)), 2Please note that L(·) is concave in U∗k or V∗k, but non-concave in both U and V.