Column Sampling based Discrete Supervised Hashing Wang-Cheng Kang,Wu-Jun Li and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210023 Department of Computer Science and Technology,Nanjing University,China kwc.oliver@gmail.com,liwujun,zhouzhenju.edu.cn Abstract methods.Representative data-independent methods include locality-sensitive hashing(LSH)(Andoni and Indyk 2006) By leveraging semantic (label)information,supervised hashing has demonstrated better accuracy than unsuper- and its variants.The data-dependent hashing can be further vised hashing in many real applications.Because the divided into unsupervised hashing and supervised hashing hashing-code learning problem is essentially a discrete methods (Liu et al.2012:Zhang et al.2014).Unsupervised optimization problem which is hard to solve,most ex- hashing methods,such as iterative quantization(ITQ)(Gong isting supervised hashing methods try to solve a relaxed and Lazebnik 2011),isotropic hashing (IsoHash)(Kong and continuous optimization problem by dropping the dis- Li 2012),discrete graph hashing(DGH)(Liu et al.2014)and crete constraints.However,these methods typically suf- scalable graph hashing(SGH)(Jiang and Li 2015),only use fer from poor performance due to the errors caused by the feature information of the data points for learning with- the relaxation.Some other methods try to directly solve out using any semantic (label)information.On the contrary. the discrete optimization problem.However.they are supervised hashing methods try to leverage semantic (label) typically time-consuming and unscalable.In this paper, we propose a novel method,called column sampling information for hashing function learning.Representative based discrete supervised hashing(COSDISH),to di- supervised hashing methods include sequential projection rectly learn the discrete hashing code from semantic learning for hashing (SPLH)(Wang,Kumar,and Chang information.COSDISH is an iterative method,in each 2010b),minimal loss hashing (MLH)(Norouzi and Fleet iteration of which several columns are sampled from the 2011),supervised hashing with kernels (KSH)(Liu et al. semantic similarity matrix and then the hashing code 2012),two-step hashing (TSH)(Lin et al.2013a),latent is decomposed into two parts which can be alternately factor hashing (LFH)(Zhang et al.2014),FastH (Lin et al. optimized in a discrete way.Theoretical analysis shows 2014),graph cuts coding(GCC)(Ge,He,and Sun 2014)and that the learning (optimization)algorithm of COSDISH supervised discrete hashing (SDH)(Shen et al.2015). has a constant-approximation bound in each step of the Supervised hashing has attracted more and more attention alternating optimization procedure.Empirical results on datasets with semantic labels illustrate that COSDISH in recent years because it has demonstrated better accu- can outperform the state-of-the-art methods in real racy than unsupervised hashing in many real applications. applications like image retrieval. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve,most existing supervised hashing methods,such as KSH (Liu et Introduction al.2012),try to solve a relaxed continuous optimization Although different kinds of methods have been proposed problem by dropping the discrete constraints.However, for approximate nearest neighbor (ANN)search (Indyk these methods typically suffer from poor performance due to and Motwani 1998),hashing has become one of the most the errors caused by the relaxation,which has been verified popular candidates for ANN search because it can achieve by the experiments in (Lin et al.2014;Shen et al.2015). better performance than other methods in real application- Some other methods,such as FastH (Lin et al.2014),try to s(Weiss,Torralba,and Fergus 2008;Kulis and Grauman directly solve the discrete optimization problem.However, 2009;Zhang et al.2010;Zhang,Wang,and Si 2011;Zhang they are typically time-consuming and unscalable.Hence, et al.2012;Strecha et al.2012;Zhen and Yeung 2012; they have to sample only a small subset of the entire dataset Rastegari et al.2013;Lin et al.2013b;Xu et al.2013; for training even if a large-scale training set is given,which Jin et al.2013;Zhu et al.2013;Wang,Zhang,and Si 2013; cannot achieve satisfactory performance in real applications Zhou,Ding,and Guo 2014;Yu et al.2014). In this paper,we propose a novel method,called column There have appeared two main categories of hashing sampling based discrete supervised hashing (COSDISH), methods (Kong and Li 2012;Liu et al.2012;Zhang et to directly learn the discrete hashing code from semantic al.2014):data-independent methods and data-dependent information.The main contributions of COSDISH are listed as follows: Copyright C)2016,Association for the Advancement of Artificial Intelligence (www.aaai.org).All rights reserved. COSDISH is iterative,and in each iteration column sam-
Column Sampling based Discrete Supervised Hashing Wang-Cheng Kang, Wu-Jun Li and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023 Department of Computer Science and Technology, Nanjing University, China kwc.oliver@gmail.com, {liwujun,zhouzh}@nju.edu.cn Abstract By leveraging semantic (label) information, supervised hashing has demonstrated better accuracy than unsupervised hashing in many real applications. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve, most existing supervised hashing methods try to solve a relaxed continuous optimization problem by dropping the discrete constraints. However, these methods typically suffer from poor performance due to the errors caused by the relaxation. Some other methods try to directly solve the discrete optimization problem. However, they are typically time-consuming and unscalable. In this paper, we propose a novel method, called column sampling based discrete supervised hashing (COSDISH), to directly learn the discrete hashing code from semantic information. COSDISH is an iterative method, in each iteration of which several columns are sampled from the semantic similarity matrix and then the hashing code is decomposed into two parts which can be alternately optimized in a discrete way. Theoretical analysis shows that the learning (optimization) algorithm of COSDISH has a constant-approximation bound in each step of the alternating optimization procedure. Empirical results on datasets with semantic labels illustrate that COSDISH can outperform the state-of-the-art methods in real applications like image retrieval. Introduction Although different kinds of methods have been proposed for approximate nearest neighbor (ANN) search (Indyk and Motwani 1998), hashing has become one of the most popular candidates for ANN search because it can achieve better performance than other methods in real applications (Weiss, Torralba, and Fergus 2008; Kulis and Grauman 2009; Zhang et al. 2010; Zhang, Wang, and Si 2011; Zhang et al. 2012; Strecha et al. 2012; Zhen and Yeung 2012; Rastegari et al. 2013; Lin et al. 2013b; Xu et al. 2013; Jin et al. 2013; Zhu et al. 2013; Wang, Zhang, and Si 2013; Zhou, Ding, and Guo 2014; Yu et al. 2014). There have appeared two main categories of hashing methods (Kong and Li 2012; Liu et al. 2012; Zhang et al. 2014): data-independent methods and data-dependent Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. methods. Representative data-independent methods include locality-sensitive hashing (LSH) (Andoni and Indyk 2006) and its variants. The data-dependent hashing can be further divided into unsupervised hashing and supervised hashing methods (Liu et al. 2012; Zhang et al. 2014). Unsupervised hashing methods, such as iterative quantization (ITQ) (Gong and Lazebnik 2011), isotropic hashing (IsoHash) (Kong and Li 2012), discrete graph hashing (DGH) (Liu et al. 2014) and scalable graph hashing (SGH) (Jiang and Li 2015), only use the feature information of the data points for learning without using any semantic (label) information. On the contrary, supervised hashing methods try to leverage semantic (label) information for hashing function learning. Representative supervised hashing methods include sequential projection learning for hashing (SPLH) (Wang, Kumar, and Chang 2010b), minimal loss hashing (MLH) (Norouzi and Fleet 2011), supervised hashing with kernels (KSH) (Liu et al. 2012), two-step hashing (TSH) (Lin et al. 2013a), latent factor hashing (LFH) (Zhang et al. 2014), FastH (Lin et al. 2014), graph cuts coding (GCC) (Ge, He, and Sun 2014) and supervised discrete hashing (SDH) (Shen et al. 2015). Supervised hashing has attracted more and more attention in recent years because it has demonstrated better accuracy than unsupervised hashing in many real applications. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve, most existing supervised hashing methods, such as KSH (Liu et al. 2012), try to solve a relaxed continuous optimization problem by dropping the discrete constraints. However, these methods typically suffer from poor performance due to the errors caused by the relaxation, which has been verified by the experiments in (Lin et al. 2014; Shen et al. 2015). Some other methods, such as FastH (Lin et al. 2014), try to directly solve the discrete optimization problem. However, they are typically time-consuming and unscalable. Hence, they have to sample only a small subset of the entire dataset for training even if a large-scale training set is given, which cannot achieve satisfactory performance in real applications. In this paper, we propose a novel method, called column sampling based discrete supervised hashing (COSDISH), to directly learn the discrete hashing code from semantic information. The main contributions of COSDISH are listed as follows: • COSDISH is iterative, and in each iteration column sam-
pling (Zhang et al.2014)is adopted to sample several by different optimization problems,the commonly used one columns from the semantic similarity matrix.Different is as follows: from traditional sampling methods which try to sample only a small subset of the entire dataset for training,our min llgS BBT (1) Be{-1,1}n×g column sampling method can exploit all the available data points for training. where s-BBI=∑∑=1(qS-B,Bg2 Based on the sampled columns,the hashing-code learning The main idea of (1)is to adopt the inner product, which reflects the opposite of the Hamming distance,of procedure can be decomposed into two parts which can two binary codes to approximate the similarity label with be alternately optimized in a discrete way.The discrete the square loss.This model has been widely used in many optimization strategy can avoid the errors caused by supervised hashing methods (Liu et al.2012;Lin et al. relaxation in traditional continuous optimization methods 2013a;Zhang and Li 2014;Lin et al.2014;Xia et al.2014; Theoretical analysis shows that the learning (optimiza- Leng et al.2014).LFH (Zhang et al.2014)also uses the tion)algorithm has a constant-approximation bound in inner product to approximate the similarity label,but it uses each step of the alternating optimization procedure. the logistic loss rather than the square loss in (1). Problem(1)is a discrete optimization problem which is Empirical results on datasets with semantic labels illus- trate that COSDISH can outperform the state-of-the-art hard to solve.Most existing methods optimize it by dropping the discrete constraint (Liu et al.2012;Lin et al.2013a). methods in real applications,such as image retrieval. To the best of our knowledge,only one method,called FastH(Lin et al.2014),has been proposed to directly solve Notation and Problem Definition the discrete optimization problem in(1).Due to the difficulty Notation of discrete optimization,FastH adopts a bit-wise learning strategy which uses a Block Graph-Cut method to get the Boldface lowercase letters like a denote vectors.and the ith local optima and learn one bit at a time.The experiments element of a is denoted as a;.Boldface uppercase letters of FastH show that FastH can achieve better accuracy than like A denote matrices.I denotes the identity matrix.Ai other supervised methods with continuous relaxation. and A.denote the ith row and the jth column of A, It is easy to see that both time complexity and storage respectively.Aii denotes the element at the ith row and complexity are O(n2)if all the supervised information in S jth column in A.A-1 denotes the inverse of A,and AT is used for training.Hence,all the existing methods,includ- denotes the transpose of A.denotes the cardinality of a ing relaxation-based continuous optimization methods and set,i.e.,the number of elements in the set.lg denotes the discrete optimization methods,sample only a small subset Frobenius norm of a vector or matrix,and denotes the with m(m n)points for training where m is typically LI norm of a vector or matrix.sgn()is the element-wise several thousand even if we are given a large-scale training sign function which returns 1 if the element is a positive set.Because some training points are discarded,all these number and returns-1 otherwise. existing methods cannot achieve satisfactory accuracy. Therefore,to get satisfactory accuracy,we need to solve Problem Definition the problem in (1)from two aspects.On one hand,we Suppose we have n points {xiE R)1 where xi is the need to adopt proper sampling strategy to effectively exploit of thepn all the n available points for training.On the other hand, we need to design strategies for discrete optimization.This where.Besides the feature vectors,the training motivates the work in this paper. set of supervised hashing also contains a semantic similarity matrix S E{-1,0,1)nxn,where Sij 1 means that COSDISH point i and point j are semantically similar,Sii=-1 This section presents the details of our proposed method means that point i and point j are semantically dissimilar, called COSDISH.More specifically,we try to solve the two and S=0 means that whether point i and point j are aspects stated above,sampling and discrete optimization,for semantically similar or not is unknown.Here,the semantic the supervised hashing problem. information typically refers to semantic labels provided with manual effort.In this paper,we assume that S is fully Column Sampling observed without missing entries,i.e.,S {-1,1]nxn As stated above,both time complexity and storage com- This assumption is reasonable because in many cases we plexity are O(n2)if all the supervised information in S is can always get the semantic label information between two used for training.Hence,we have to perform sampling for points.Furthermore,our model in this paper can be easily training,which is actually adopted by almost all existing adapted to the cases with missing entries. methods,such as KSH,TSH,FastH and LFH.However. The goal of supervised hashing is to learn a binary all existing methods except LFH try to sample only a small code matrix B E {-1,1)nx4,where Bi.denotes the subset with m(m<n)points for training and discard the g-bit code for training point i.Furthermore,the learned rest training points,which leads to unsatisfactory accuracy. binary codes should preserve the semantic similarity in S. The special case is LFH,which proposes to sample sev- Although the goal of supervised hashing can be formulated eral columns from S in each iteration and several iterations
pling (Zhang et al. 2014) is adopted to sample several columns from the semantic similarity matrix. Different from traditional sampling methods which try to sample only a small subset of the entire dataset for training, our column sampling method can exploit all the available data points for training. • Based on the sampled columns, the hashing-code learning procedure can be decomposed into two parts which can be alternately optimized in a discrete way. The discrete optimization strategy can avoid the errors caused by relaxation in traditional continuous optimization methods. • Theoretical analysis shows that the learning (optimization) algorithm has a constant-approximation bound in each step of the alternating optimization procedure. • Empirical results on datasets with semantic labels illustrate that COSDISH can outperform the state-of-the-art methods in real applications, such as image retrieval. Notation and Problem Definition Notation Boldface lowercase letters like a denote vectors, and the ith element of a is denoted as ai . Boldface uppercase letters like A denote matrices. I denotes the identity matrix. Ai∗ and A∗j denote the ith row and the jth column of A, respectively. Aij denotes the element at the ith row and jth column in A. A−1 denotes the inverse of A, and AT denotes the transpose of A. | · | denotes the cardinality of a set, i.e., the number of elements in the set. k · kF denotes the Frobenius norm of a vector or matrix, and k · k1 denotes the L1 norm of a vector or matrix. sgn(·) is the element-wise sign function which returns 1 if the element is a positive number and returns -1 otherwise. Problem Definition Suppose we have n points {xi ∈ R d} n i=1 where xi is the feature vector of point i. We can denote the feature vectors of the n points in a compact matrix form X ∈ R n×d , where XT i∗ = xi . Besides the feature vectors, the training set of supervised hashing also contains a semantic similarity matrix S ∈ {−1, 0, 1} n×n, where Sij = 1 means that point i and point j are semantically similar, Sij = −1 means that point i and point j are semantically dissimilar, and Sij = 0 means that whether point i and point j are semantically similar or not is unknown. Here, the semantic information typically refers to semantic labels provided with manual effort. In this paper, we assume that S is fully observed without missing entries, i.e., S ∈ {−1, 1} n×n. This assumption is reasonable because in many cases we can always get the semantic label information between two points. Furthermore, our model in this paper can be easily adapted to the cases with missing entries. The goal of supervised hashing is to learn a binary code matrix B ∈ {−1, 1} n×q , where Bi∗ denotes the q-bit code for training point i. Furthermore, the learned binary codes should preserve the semantic similarity in S. Although the goal of supervised hashing can be formulated by different optimization problems, the commonly used one is as follows: min B∈{−1,1}n×q kqS − BBT k 2 F , (1) where kqS − BBT k 2 F = Pn i=1 Pn j=1(qSij − Bi∗BT j∗ ) 2 . The main idea of (1) is to adopt the inner product, which reflects the opposite of the Hamming distance, of two binary codes to approximate the similarity label with the square loss. This model has been widely used in many supervised hashing methods (Liu et al. 2012; Lin et al. 2013a; Zhang and Li 2014; Lin et al. 2014; Xia et al. 2014; Leng et al. 2014). LFH (Zhang et al. 2014) also uses the inner product to approximate the similarity label, but it uses the logistic loss rather than the square loss in (1). Problem (1) is a discrete optimization problem which is hard to solve. Most existing methods optimize it by dropping the discrete constraint (Liu et al. 2012; Lin et al. 2013a). To the best of our knowledge, only one method, called FastH (Lin et al. 2014), has been proposed to directly solve the discrete optimization problem in (1). Due to the difficulty of discrete optimization, FastH adopts a bit-wise learning strategy which uses a Block Graph-Cut method to get the local optima and learn one bit at a time. The experiments of FastH show that FastH can achieve better accuracy than other supervised methods with continuous relaxation. It is easy to see that both time complexity and storage complexity are O(n 2 ) if all the supervised information in S is used for training. Hence, all the existing methods, including relaxation-based continuous optimization methods and discrete optimization methods, sample only a small subset with m (m < n) points for training where m is typically several thousand even if we are given a large-scale training set. Because some training points are discarded, all these existing methods cannot achieve satisfactory accuracy. Therefore, to get satisfactory accuracy, we need to solve the problem in (1) from two aspects. On one hand, we need to adopt proper sampling strategy to effectively exploit all the n available points for training. On the other hand, we need to design strategies for discrete optimization. This motivates the work in this paper. COSDISH This section presents the details of our proposed method called COSDISH. More specifically, we try to solve the two aspects stated above, sampling and discrete optimization, for the supervised hashing problem. Column Sampling As stated above, both time complexity and storage complexity are O(n 2 ) if all the supervised information in S is used for training. Hence, we have to perform sampling for training, which is actually adopted by almost all existing methods, such as KSH, TSH, FastH and LFH. However, all existing methods except LFH try to sample only a small subset with m (m < n) points for training and discard the rest training points, which leads to unsatisfactory accuracy. The special case is LFH, which proposes to sample several columns from S in each iteration and several iterations
are performed for training.In this paper,we adopt the same The proof of Theorem 1 can be found in the supplemen- column sampling method as LFH (Zhang et al.2014)for our tary material, COSDISH.Unlike LFH which adopts continuous relaxation That is to say,if we use the solution of fi().i.e..Br= for learning,in COSDISH we propose a novel discrete Fi=sgn(STB),as the solution of f2(),the solution optimization (learning)method based on column sampling. is a 2q-approximation solution.Because g is usually small, More specifically,in each iteration we randomly sample we can say that it is a constant-approximation solution, a subset n of N={1,2,...,n}and then choose the which means that we can get an error bound for the original semantic similarity between all n points and those points problem. indexed by 1.That is to say,we sample columns of S with the column numbers being indexed by We use Because the elements of STB can be zero,we set B)= S-1,1)x to denote the sampled sub-matrix of e_sgn(STB,B-1))in practice: similarity. b1>0 We can find that there exist two different kinds of points b1=0 in each iteration,one being those indexed by n and the e-sgn(b1,62) -1 b1<0 other being those indexed by I =N-1.We use S {-1,1x to denote a sub-matrix formed by the rows of where t is the iteration number,and e_sgn(,)is applied in S indexed by2.sr∈{-1,1}rlx,B∈{-l,1al×g an element-wise manner. and Br e{-1,1rlx are defined in a similar way. Update B with Bl Fixed When Br is fixed,the sub- According to the problem in (1),the problem associated problem of B is given by: with the sampled columns in each iteration can be reformu- min llgsT-BT BO]T+S-B2 B2]T lated as follows: B2e{-1,1InI×g llas"-B+-B"[BT 2) Inspired by TSH (Lin et al.2013a),we can transform the above problem to q binary quadratic programming (BQP) Alternating Optimization problems.The optimization of the kth bit of B is given by: We propose an alternating optimization strategy,which eQb+b的7p (4) contains several iterations with each iteration divided into two alternating steps,to solve the problem in (2).More where b denotes the kth column of B,and specifically,we update B with B fixed,and then update B with Br fixed.This two-step alternating optimization procedure will be repeated for several times. ∑6,Q=0 Update Br with BR Fixed When BS is fixed,the objec- 0=-24g- tive function of Br is given by: 左一1 f(B)=IgsT-BPB]哈 =-2Bik(g配:- l=1 m=1 Br∈{-1,1}1rI×9 Note that the formulation in(4)is not the same as that in Minimizing f2()can be viewed as a discrete least square TSH due to the additional linear term.More details about problem.By relaxing the B into continuous values,it is the above derivation can be found in the supplementary easy to get the optimal continuous solution.However,after material. we quantize the continuous solution into discrete solution, Then,we need to turn the problem in (4)into a standard the error bound caused by quantization cannot be guar- BOP form in which the domain of binary variable is {0,1} anteed.Hence,even if the continuous solution is optimal and there are no linear terms. for the relaxed continuous problem,the discrete solution First,we transform the domain of bk from {-1,11 to might be far away from the optimal solution of the original problem. 10,1)11.Let B=(b+1),we have: Here,we design a method to solve it in a discrete way with [bTQ()b+bTp() a constant-approximation bound.Let's consider the follow- ing problem fi()which changes the loss from Frobenius 12412 2 norm in f2()to Li norm: =4∑∑Q+2∑限-∑(Q+Q以》 m=1l=1 fi(B)=llasr-Bl[BTI. (3) const, Br∈{-1,1IrI×9 where const is a constant. It's easy to find that when Br sgn(STB),f() Hence,problem(4)can be reformulated as follows: reaches its minimum.Furthermore,we have the following theorem. min b Tbp() (5) Theorem 1.Suppose that fi(Fi)and f2(F2)reach their BE(0.1)I1 minimum at the points Fi and F3,respectively.We have IThe supplementary material can be downloaded from http: f2(F1)≤2gf2(F2) //cs.nju.edu.cn/1wj/paper/COSDISH_sup.pdf
are performed for training. In this paper, we adopt the same column sampling method as LFH (Zhang et al. 2014) for our COSDISH. Unlike LFH which adopts continuous relaxation for learning, in COSDISH we propose a novel discrete optimization (learning) method based on column sampling. More specifically, in each iteration we randomly sample a subset Ω of N = {1, 2, . . . , n} and then choose the semantic similarity between all n points and those points indexed by Ω. That is to say, we sample |Ω| columns of S with the column numbers being indexed by Ω. We use Se ∈ {−1, 1} n×|Ω| to denote the sampled sub-matrix of similarity. We can find that there exist two different kinds of points in each iteration, one being those indexed by Ω and the other being those indexed by Γ = N − Ω. We use SeΩ ∈ {−1, 1} |Ω|×|Ω| to denote a sub-matrix formed by the rows of Se indexed by Ω. SeΓ ∈ {−1, 1} |Γ|×|Ω| , BΩ ∈ {−1, 1} |Ω|×q and BΓ ∈ {−1, 1} |Γ|×q are defined in a similar way. According to the problem in (1), the problem associated with the sampled columns in each iteration can be reformulated as follows: min BΩ,BΓ kqSeΓ − B Γ [B Ω] T k 2 F + kqSeΩ − B Ω[B Ω] T k 2 F . (2) Alternating Optimization We propose an alternating optimization strategy, which contains several iterations with each iteration divided into two alternating steps, to solve the problem in (2). More specifically, we update BΓ with BΩ fixed, and then update BΩ with BΓ fixed. This two-step alternating optimization procedure will be repeated for several times. Update BΓ with BΩ Fixed When BΩ is fixed, the objective function of BΓ is given by: f2(B Γ ) BΓ∈{−1,1}|Γ|×q = kqSeΓ − B Γ [B Ω] T k 2 F . Minimizing f2(·) can be viewed as a discrete least square problem. By relaxing the BΓ into continuous values, it is easy to get the optimal continuous solution. However, after we quantize the continuous solution into discrete solution, the error bound caused by quantization cannot be guaranteed. Hence, even if the continuous solution is optimal for the relaxed continuous problem, the discrete solution might be far away from the optimal solution of the original problem. Here, we design a method to solve it in a discrete way with a constant-approximation bound. Let’s consider the following problem f1(·) which changes the loss from Frobenius norm in f2(·) to L1 norm: f1(B Γ ) BΓ∈{−1,1}|Γ|×q = kqSeΓ − B Γ [B Ω] T k1. (3) It’s easy to find that when BΓ = sgn(SeΓBΩ), f1(·) reaches its minimum. Furthermore, we have the following theorem. Theorem 1. Suppose that f1(F ∗ 1 ) and f2(F ∗ 2 ) reach their minimum at the points F ∗ 1 and F ∗ 2 , respectively. We have f2(F ∗ 1 ) ≤ 2qf2(F ∗ 2 ). The proof of Theorem 1 can be found in the supplementary material1 . That is to say, if we use the solution of f1(·), i.e., BΓ = F ∗ 1 = sgn(SeΓBΩ), as the solution of f2(·), the solution is a 2q-approximation solution. Because q is usually small, we can say that it is a constant-approximation solution, which means that we can get an error bound for the original problem. Because the elements of SeΓBΩ can be zero, we set BΓ (t) = e sgn(SeΓBΩ, BΓ (t−1)) in practice: e sgn(b1, b2) = ( 1 b1 > 0 b2 b1 = 0 −1 b1 < 0 where t is the iteration number, and e sgn(·, ·) is applied in an element-wise manner. Update BΩ with BΓ Fixed When BΓ is fixed, the subproblem of BΩ is given by: min BΩ∈{−1,1}|Ω|×q kqSeΓ−B Γ [B Ω] T k 2 F +kqSeΩ−B Ω[B Ω] T k 2 F . Inspired by TSH (Lin et al. 2013a), we can transform the above problem to q binary quadratic programming (BQP) problems. The optimization of the kth bit of BΩ is given by: min bk∈{−1,1}|Ω| [b k ] T Q(k)b k + [b k ] T p (k) (4) where b k denotes the kth column of BΩ, and Q (k) i,j i6=j = − 2(qSeΩ i,j − kX−1 m=1 b m i b m j ), Q(k) i,i = 0, p (k) i = − 2 X |Γ| l=1 B Γ l,k(qSeΓ l,i − kX−1 m=1 B Γ l,mB Ω i,m). Note that the formulation in (4) is not the same as that in TSH due to the additional linear term. More details about the above derivation can be found in the supplementary material. Then, we need to turn the problem in (4) into a standard BQP form in which the domain of binary variable is {0, 1} and there are no linear terms. First, we transform the domain of b k from {−1, 1} |Ω| to {0, 1} |Ω| . Let b k = 1 2 (b k + 1), we have: [b k ] T Q (k)b k + [b k ] T p (k) =4 X |Ω| m=1 X |Ω| l=1 b k mb k l Q (k) m,l + 2 X |Ω| m=1 b k m(p (k) m − X |Ω| l=1 (Q (k) m,l + Q (k) l,m)) + const, where const is a constant. Hence, problem (4) can be reformulated as follows: min b k ∈{0,1}|Ω| [b k ] T Q (k) b k + [b k ] T p (k) , (5) 1The supplementary material can be downloaded from http: //cs.nju.edu.cn/lwj/paper/COSDISH_sup.pdf
where可=4Q,)-2n-∑把l(Q+Q]: The Whole Algorithm The whole learning(optimization) More details about the above derivation can be found in the algorithm is summarized in Algorithm 1. supplementary material. Furthermore,BOP can be turned into an equivalent form Algorithm 1 Discrete optimization in COSDISH without linear terms(Yang 2013).That is to say,the problem in(5)can be rewritten as follows: Input:SE{-1,1)nxn,q.Tsto.Taut. Initialize binary code B by randomization min [61TO)6 for iter=l→Tsto do (6) Sample columns of S to get S. s.t. 6e{0,1}+1,a+1=1 Let be the set of sampled column indices and I =A-, where with N={1,2,...n} Split S into S?and Sh 6= Q(k) Split B into B and B) 0 fort=1→Taut do fork=1→gdo A method proposed by (Yang 2013)can solve the problem in(6)with an additional constraint=[( Construct problem (4)from B and the 1)/27.We also add this constraint to our problem to get a first k-1 columns of B). Construct problem(5)from problem (4) balanced result for each bit which has been widely used in Construct problem(7)from problems (5)and (6). hashing (Liu et al.2014).So we reformulate problem (6) Construct problem(8)from problem(7)by performing with the constraint as follows: Cholesky decomposition. min [61TO6 Using the 2-approximation algorithm (Yang 2013)to solve problem (8)and acquire the kth column of B s.t. b∈{0,1M,的u=1 end for M (7) Bro=e-sgn(SB),B(t-1)). 丁被=H end for Recover B by combining B and B i=1 end for where M=+1,and H=[M/2]. 0 utput:B∈{-1,1}nxg As in (Yang 2013),we transform the problem in(7)to an equivalent clustering problem:given a dataset ={ui E M)we want to find a subset u of size H that the sum in general,.10≤Tsto≤20,3≤Talt≤10and|2≥q of square of the distances within the subset is minimized is enough to get satisfactory performance.Unless otherwise It can be formulated as: stated,we set Tsto 10,Talt 3 and g in our 1v2 experiments.Furthermore,in our experiments we find that min our algorithm is not sensitive to the initialization.Hence,we u∈l4" vEW (8) adopt random initialization in this paper. s.t.U'cu,'=H,uv eu' Remark 1.Unlike other hashing methods such as Then,we have the following theorems. TSH(Lin et al.2013a)which try to solve the whole problem Theorem 2.Let us use a matrix U of size Mx M to denote as q BOP problems,our alternating optimization strategy the datesetu with U=ui,and if UTU=XI-Q(k)>0. decomposes the whole problem into two sub-problems. then (7)and (8)are equivalent. Typically,.i.e.the number of variables in B is far less than that in B.Furthermore,the cost to get a Proof.We can use similar method in (Yang 2013)to prove solution for problem(3)is much lower than that to get a ▣ solution for BOP.Hence,the key idea of our alternating According to (Yang 2013),we can always find such U optimization strategy is to adopt a faster solution for the and A to satisfy AI-Q(k)0 in Theorem 2.In practice, larger sub-problem,which makes our strategy much faster we can take a sufficiently large number as A,and perform than TSH.Moreover,the faster solution of our strategy Cholesky decomposition on AI-Q(k)to get U. can also guarantee accuracy,which will be verified in the experiments. Theorem 3.Assuming u!is the global solution of (8) TSH adopts LBFGSB to solve the BOP problem in a and f(u)=∑weu llu-a∑vevl2 is the objective continuous-relaxation way.We found that if LBFGSB is used function of(8),there exists an algorithm which can find a to solve the BOP problem in our method (i.e.,problem (6)). solution W where the accuracy of our method will be dramatically deteriorat- f(41)≤2f(4) ed.The reason is that the the solution of LBFGSB will hurt That is to say,there exists a 2-approximation algorithm the quality of the solution of Br. for (8). In addition,the graph-cut method used in FastH cannot be adapted to solve our problem (6),because problem (6) Proof.Please refer to (Yang 2013)for the proof and algo- doesn't satisfy the sub-modular property required by the rithm. graph-cut method
where Q (k) = 4Q(k) , p (k) i = 2[p (k) i − P|Ω| l=1(Q (k) i,l +Q (k) l,i )]. More details about the above derivation can be found in the supplementary material. Furthermore, BQP can be turned into an equivalent form without linear terms (Yang 2013). That is to say, the problem in (5) can be rewritten as follows: min [bek ] T Qe (k)bek s.t. bek ∈ {0, 1} |Ω|+1 , eb k |Ω|+1 = 1 (6) where bek = b k 1 , Qe (k) = Q (k) 1 2 p (k) 1 2 [p (k) ] T 0 ! . A method proposed by (Yang 2013) can solve the problem in (6) with an additional constraint P|Ω|+1 i=1 eb k i = d(|Ω| + 1)/2e. We also add this constraint to our problem to get a balanced result for each bit which has been widely used in hashing (Liu et al. 2014). So we reformulate problem (6) with the constraint as follows: min [bek ] T Qe (k)bek s.t. bek ∈ {0, 1} M, eb k M = 1 X M i=1 eb k i = H (7) where M = |Ω| + 1, and H = dM/2e. As in (Yang 2013), we transform the problem in (7) to an equivalent clustering problem: given a dataset U = {ui ∈ RM}M i=1, we want to find a subset U 0 of size H that the sum of square of the distances within the subset U 0 is minimized. It can be formulated as: min X u∈U0 ku − 1 H X v∈U0 vk 2 s.t. U 0 ⊆ U, |U0 | = H, uM ∈ U0 (8) Then, we have the following theorems. Theorem 2. Let us use a matrix U of size M ×M to denote the dateset U with U∗i = ui , and if UT U = λI−Qe (k) 0, then (7) and (8) are equivalent. Proof. We can use similar method in (Yang 2013) to prove it. According to (Yang 2013), we can always find such U and λ to satisfy λI − Qe (k) 0 in Theorem 2. In practice, we can take a sufficiently large number as λ, and perform Cholesky decomposition on λI − Qe (k) to get U. Theorem 3. Assuming U 0 ∗ is the global solution of (8) and f(U 0 ) = P u∈U0 ku − 1 H P v∈U0 vk 2 is the objective function of (8), there exists an algorithm which can find a solution U 0 1 where f(U 0 1 ) ≤ 2f(U 0 ∗ ). That is to say, there exists a 2-approximation algorithm for (8). Proof. Please refer to (Yang 2013) for the proof and algorithm. The Whole Algorithm The whole learning (optimization) algorithm is summarized in Algorithm 1. Algorithm 1 Discrete optimization in COSDISH Input: S ∈ {−1, 1} n×n , q, Tsto, Talt, |Ω| Initialize binary code B by randomization for iter = 1 → Tsto do Sample |Ω| columns of S to get Se. Let Ω be the set of sampled column indices and Γ = N − Ω, with N = {1, 2, . . . n}. Split Se into SeΩ and SeΓ . Split B into B Ω (0) and B Γ (0) . for t = 1 → Talt do for k = 1 → q do Construct problem (4) from B Γ (t−1) , SeΩ , SeΓ and the first k − 1 columns of B Ω (t) . Construct problem (5) from problem (4). Construct problem (7) from problems (5) and (6). Construct problem (8) from problem (7) by performing Cholesky decomposition. Using the 2-approximation algorithm (Yang 2013) to solve problem (8) and acquire the kth column of B Ω (t) . end for B Γ (t) = e sgn(SeΓB Ω (t) , B Γ (t−1)). end for Recover B by combining B Ω (t) and B Γ (t) . end for Output: B ∈ {−1, 1} n×q In general, 10 ≤ Tsto ≤ 20 , 3 ≤ Talt ≤ 10 and |Ω| ≥ q is enough to get satisfactory performance. Unless otherwise stated, we set Tsto = 10, Talt = 3 and |Ω| = q in our experiments. Furthermore, in our experiments we find that our algorithm is not sensitive to the initialization. Hence, we adopt random initialization in this paper. Remark 1. Unlike other hashing methods such as TSH (Lin et al. 2013a) which try to solve the whole problem as q BQP problems, our alternating optimization strategy decomposes the whole problem into two sub-problems. Typically, |Ω| |Γ|, i.e., the number of variables in BΩ is far less than that in BΓ. Furthermore, the cost to get a solution for problem (3) is much lower than that to get a solution for BQP. Hence, the key idea of our alternating optimization strategy is to adopt a faster solution for the larger sub-problem, which makes our strategy much faster than TSH. Moreover, the faster solution of our strategy can also guarantee accuracy, which will be verified in the experiments. TSH adopts LBFGSB to solve the BQP problem in a continuous-relaxation way. We found that if LBFGSB is used to solve the BQP problem in our method (i.e., problem (6)), the accuracy of our method will be dramatically deteriorated. The reason is that the the solution of LBFGSB will hurt the quality of the solution of BΓ. In addition, the graph-cut method used in FastH cannot be adapted to solve our problem (6), because problem (6) doesn’t satisfy the sub-modular property required by the graph-cut method
Soft Constraints g is very small,e.g.,less than 64.Hence,our method is As mentioned in (Leng et al.2014),when we can only scalable. get a subset of the semantic information,pushing two dissimilar points to have maximum Hamming distance may Experiment lead to over-fitting and unexpected result.Moreover,the We use real datasets to evaluate the effectiveness of our number of dissimilar labels is typically far more than that method.All the experiments are conducted on a workstation of similar labels.Hence,we can also view it as a class- with 6 Intel Xeon CPU cores and 48GB RAM. imbalance problem between positive and negative labels Inspired by (Leng et al.2014),we change the element-1 Dataset in our similarity matrix S to a real value 0<B<1.More Two image datasets with semantic labels are used to specifically,we take B=the number of 1 in s the number of-1 ins empirically. evaluate our method and the other baselines.They are Please note that although soft constraints can further im- CIFAR-10 (Krizhevsky 2009)and NUS-WIDE (Chua prove performance,the superior performance of our method et al.2009).Both of them have been widely mainly comes from the learning procedure rather than the used for hashing evaluation (Lin et al.2013a; soft constraints.Empirical verification about this can be Zhang et al.2014).Each instance in CIFAR-10 has a found in the supplementary material. single label,and each instance in NUS-WIDE might have multi-labels. Out-of-Sample Extension CIFAR-10 contains 60,000 images.Each image is repre- sented by a 512-dimension GIST feature vector extracted Many supervised hashing methods can be viewed as two- from the original color image of size 32 x 32.Each image step methods (Lin et al.2013a):learn binary code in the first is manually labeled to be one of the ten classes.Two images step,and then train g binary classifiers based on the feature are considered to be semantically similar if they share the matrix X and the learned code matrix B in the second step same class label.Otherwise,they are treated as semantically with each bit corresponding to one classifier(Zhang et al. dissimilar. 2014;Lin et al.2013a;2014;Xia et al.2014).Besides NUS-WIDE includes 269,648 images crawled from those methods using linear classifiers(Wang,Kumar,and Chang 2010a;Zhang et al.2014),some other methods use Flickr with 81 ground-truth labels (tags).Each image is represented by a 1134-dimension feature vector after more powerful nonlinear classifiers,such as SVM with RBF feature extraction.Each image might be associated with kernel (Lin et al.2013a),deep convolutional network (Xia multi-labels.There also exist some images without any et al.2014)and boosted decision trees(Lin et al.2014)and label,which are not suitable for our evaluation.After so on.In general,the more powerful classifiers we use for removing those images without any label,we get 209,347 out-of-sample extension,the better accuracy we can achieve images for our experiment.We consider two images to be and also the more training time will be consumed (Lin et semantically similar if they share at least one common label al.2013a;2014).FastH (Lin et al.2014)adopts an efficient Otherwise,they are semantically dissimilar. implementation of boosted decision trees for out-of-sample For all the datasets,we perform normalization on feature extension,which shows better accuracy and less training vectors to make each dimension have zero mean and equal time than other methods like KSH and TSH with nonlinear variance. classifiers Our COSDISH is also a two-step method.For out-of- Experimental Settings and Baselines sample extension,COSDISH chooses linear classifier and boosted decision trees in FastH to get two different variants As in LFH(Zhang et al.2014),for all datasets we randomly We will empirically evaluate these two variants in our choose 1000 points as validation set and 1000 points as experiments. query (test)set,with the rest of the points as training set.All experimental results are the average values of 10 Complexity Analysis independent random partitions. Unless otherwise stated,COSDISH refers to the variant The time complexity to construct problem (4)is O(Tx with soft constraints because in most cases it will out- +2).Both the time complexity to construct prob- perform the variant without soft constraints (refer to the lem (5)and that for problem (7)are O(2).Performing supplementary material).We use COSDISH to denote our Cholesky decomposition on AI-Q(k)need O(3).The method with linear classifier for out-of-sample extension. 2-approximation algorithm to solve the clustering problem and COSDISH BT to denote our method with boosted need O(3+2 log )For the BS-subproblem,we decision trees for out-of-sample extension. need to solve g BOP problems.Hence,the complexity of Because existing methods (Lin et al.2013a:Zhang et al. BS-subproblem is O(gx (Tx+3)).In addition, 2014)have shown that supervised methods can outperform the time complexity of Bl-subproblem is O(g x Tx). unsupervised methods,we only compare our method with Therefore,the total time complexity is O(Tsto Talt x qx some representative supervised hashing methods,including (Tx3)),and the space complexity is (T SPLH (Wang,Kumar,and Chang 2010b),KSH (Liu et al. +2).If we take q,then time complexity is 2012),TSH (Lin et al.2013a),LFH (Zhang et al.2014), O(Tsto x Tait x(nq2+q)),which is linear to n.Typically, FastH (Lin et al.2014)and SDH(Shen et al.2015)
Soft Constraints As mentioned in (Leng et al. 2014), when we can only get a subset of the semantic information, pushing two dissimilar points to have maximum Hamming distance may lead to over-fitting and unexpected result. Moreover, the number of dissimilar labels is typically far more than that of similar labels. Hence, we can also view it as a classimbalance problem between positive and negative labels. Inspired by (Leng et al. 2014), we change the element -1 in our similarity matrix Se to a real value 0 < β < 1. More specifically, we take β = the number of 1 in Se the number of −1 in Se empirically. Please note that although soft constraints can further improve performance, the superior performance of our method mainly comes from the learning procedure rather than the soft constraints. Empirical verification about this can be found in the supplementary material. Out-of-Sample Extension Many supervised hashing methods can be viewed as twostep methods (Lin et al. 2013a): learn binary code in the first step, and then train q binary classifiers based on the feature matrix X and the learned code matrix B in the second step with each bit corresponding to one classifier (Zhang et al. 2014; Lin et al. 2013a; 2014; Xia et al. 2014). Besides those methods using linear classifiers (Wang, Kumar, and Chang 2010a; Zhang et al. 2014), some other methods use more powerful nonlinear classifiers, such as SVM with RBF kernel (Lin et al. 2013a), deep convolutional network (Xia et al. 2014) and boosted decision trees (Lin et al. 2014) and so on. In general, the more powerful classifiers we use for out-of-sample extension, the better accuracy we can achieve and also the more training time will be consumed (Lin et al. 2013a; 2014). FastH (Lin et al. 2014) adopts an efficient implementation of boosted decision trees for out-of-sample extension, which shows better accuracy and less training time than other methods like KSH and TSH with nonlinear classifiers. Our COSDISH is also a two-step method. For out-ofsample extension, COSDISH chooses linear classifier and boosted decision trees in FastH to get two different variants. We will empirically evaluate these two variants in our experiments. Complexity Analysis The time complexity to construct problem (4) is O(|Γ| × |Ω| + |Ω| 2 ). Both the time complexity to construct problem (5) and that for problem (7) are O(|Ω| 2 ). Performing Cholesky decomposition on λI − Qe (k) need O(|Ω| 3 ). The 2-approximation algorithm to solve the clustering problem need O(|Ω| 3 + |Ω| 2 log |Ω|). For the BΩ-subproblem, we need to solve q BQP problems. Hence, the complexity of BΩ-subproblem is O(q × (|Γ| × |Ω| + |Ω| 3 )). In addition, the time complexity of BΓ-subproblem is O(q × |Γ| × |Ω|). Therefore, the total time complexity is O(Tsto × Talt × q × (|Γ| × |Ω| + |Ω| 3 )), and the space complexity is O(|Γ| × |Ω| + |Ω| 2 ). If we take |Ω| = q, then time complexity is O(Tsto ×Talt ×(nq2 +q 4 )), which is linear to n. Typically, q is very small, e.g., less than 64. Hence, our method is scalable. Experiment We use real datasets to evaluate the effectiveness of our method. All the experiments are conducted on a workstation with 6 Intel Xeon CPU cores and 48GB RAM. Dataset Two image datasets with semantic labels are used to evaluate our method and the other baselines. They are CIFAR-10 (Krizhevsky 2009) and NUS-WIDE (Chua et al. 2009). Both of them have been widely used for hashing evaluation (Lin et al. 2013a; Zhang et al. 2014). Each instance in CIFAR-10 has a single label, and each instance in NUS-WIDE might have multi-labels. CIFAR-10 contains 60,000 images. Each image is represented by a 512-dimension GIST feature vector extracted from the original color image of size 32 × 32. Each image is manually labeled to be one of the ten classes. Two images are considered to be semantically similar if they share the same class label. Otherwise, they are treated as semantically dissimilar. NUS-WIDE includes 269,648 images crawled from Flickr with 81 ground-truth labels (tags). Each image is represented by a 1134-dimension feature vector after feature extraction. Each image might be associated with multi-labels. There also exist some images without any label, which are not suitable for our evaluation. After removing those images without any label, we get 209,347 images for our experiment. We consider two images to be semantically similar if they share at least one common label. Otherwise, they are semantically dissimilar. For all the datasets, we perform normalization on feature vectors to make each dimension have zero mean and equal variance. Experimental Settings and Baselines As in LFH (Zhang et al. 2014), for all datasets we randomly choose 1000 points as validation set and 1000 points as query (test) set, with the rest of the points as training set. All experimental results are the average values of 10 independent random partitions. Unless otherwise stated, COSDISH refers to the variant with soft constraints because in most cases it will outperform the variant without soft constraints (refer to the supplementary material). We use COSDISH to denote our method with linear classifier for out-of-sample extension, and COSDISH BT to denote our method with boosted decision trees for out-of-sample extension. Because existing methods (Lin et al. 2013a; Zhang et al. 2014) have shown that supervised methods can outperform unsupervised methods, we only compare our method with some representative supervised hashing methods, including SPLH (Wang, Kumar, and Chang 2010b), KSH (Liu et al. 2012), TSH (Lin et al. 2013a), LFH (Zhang et al. 2014), FastH (Lin et al. 2014) and SDH (Shen et al. 2015)