Isotropic Hashing Weihao Kong,Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering,Shanghai Jiao Tong University,China {kongweihao,liwujun)@cs.sjtu.edu.cn Abstract Most existing hashing methods adopt some projection functions to project the o- riginal data into several dimensions of real values,and then each of these projected dimensions is quantized into one bit(zero or one)by thresholding.Typically,the variances of different projected dimensions are different for existing projection functions such as principal component analysis(PCA).Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information.Although this viewpoint has been widely accepted by many researchers,it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions.In this paper,we propose a novel method,called isotrop- ic hashing(IsoHash),to learn projection functions which can produce projected dimensions with isotropic variances (equal variances).Experimental results on real data sets show that IsoHash can outperform its counterpart with different vari- ances for different dimensions,which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. 1 Introduction Due to its fast query speed and low storage cost,hashing [1,5]has been successfully used for approximate nearest neighbor(ANN)search [28].The basic idea of hashing is to learn similarity- preserving binary codes for data representation.More specifically,each data point will be hashed into a compact binary string,and similar points in the original feature space should be hashed into close points in the hashcode space.Compared with the original feature representation,hashing has two advantages.One is the reduced storage cost,and the other is the constant or sub-linear query time complexity [28].These two advantages make hashing become a promising choice for efficient ANN search in massive data sets[1,5,6,9,10.14,15,17,20,21,23,26.29,30,31,32,33,34. Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values,and then each of these projected dimensions is quantized into one bit(zero or one)by thresholding.Locality-sensitive hashing(LSH)[1,5]and its extension- s [4,18,19,22,25]use simple random projections for hash functions.These methods are called data-independent methods because the projection functions are independent of training data.Anoth- er class of methods are called data-dependent methods,whose projection functions are learned from training data.Representative data-dependent methods include spectral hashing(SH)[31],anchor graph hashing(AGH)[21].sequential projection learning(SPL)[29],principal component analy- sis [13]based hashing (PCAH)[7],and iterative quantization (ITQ)[7,8].SH learns the hashing functions based on spectral graph partitioning.AGH adopts anchor graphs to speed up the com- putation of graph Laplacian eigenvectors,based on which the Nystrom method is used to compute projection functions.SPL leans the projection functions in a sequential way that each function is designed to correct the errors caused by the previous one.PCAH adopts principal component anal- ysis (PCA)to learn the projection functions.ITQ tries to learn an orthogonal rotation matrix to refine the initial projection matrix learned by PCA so that the quantization error of mapping the data
Isotropic Hashing Weihao Kong, Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China {kongweihao,liwujun}@cs.sjtu.edu.cn Abstract Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. 1 Introduction Due to its fast query speed and low storage cost, hashing [1, 5] has been successfully used for approximate nearest neighbor (ANN) search [28]. The basic idea of hashing is to learn similaritypreserving binary codes for data representation. More specifically, each data point will be hashed into a compact binary string, and similar points in the original feature space should be hashed into close points in the hashcode space. Compared with the original feature representation, hashing has two advantages. One is the reduced storage cost, and the other is the constant or sub-linear query time complexity [28]. These two advantages make hashing become a promising choice for efficient ANN search in massive data sets [1, 5, 6, 9, 10, 14, 15, 17, 20, 21, 23, 26, 29, 30, 31, 32, 33, 34]. Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Locality-sensitive hashing (LSH) [1, 5] and its extensions [4, 18, 19, 22, 25] use simple random projections for hash functions. These methods are called data-independent methods because the projection functions are independent of training data. Another class of methods are called data-dependent methods, whose projection functions are learned from training data. Representative data-dependent methods include spectral hashing (SH) [31], anchor graph hashing (AGH) [21], sequential projection learning (SPL) [29], principal component analysis [13] based hashing (PCAH) [7], and iterative quantization (ITQ) [7, 8]. SH learns the hashing functions based on spectral graph partitioning. AGH adopts anchor graphs to speed up the computation of graph Laplacian eigenvectors, based on which the Nystrom method is used to compute ¨ projection functions. SPL leans the projection functions in a sequential way that each function is designed to correct the errors caused by the previous one. PCAH adopts principal component analysis (PCA) to learn the projection functions. ITQ tries to learn an orthogonal rotation matrix to refine the initial projection matrix learned by PCA so that the quantization error of mapping the data 1
to the vertices of binary hypercube is minimized.Compared to the data-dependent methods,the data-independent methods need longer codes to achieve satisfactory performance [7. For most existing projection functions such as those mentioned above,the variances of different projected dimensions are different.Many researchers [7.12.21]have argued that using the same number of bits for different projected dimensions with unequal variances is unreasonable because larger-variance dimensions will carry more information.Some methods [7,12]use orthogonal trans- formation to the PCA-projected data with the expectation of balancing the variances of different PCA dimensions,and achieve better performance than the original PCA based hashing.However, to the best of our knowledge,there exist no methods which can guarantee to learn a projection with equal variances for different dimensions.Hence,the viewpoint that using the same number of bit- s for different projected dimensions is unreasonable has still not been verified by either theory or experiment. In this paper,a novel hashing method,called isotropic hashing(IsoHash),is proposed to learn a pro- jection function which can produce projected dimensions with isotropic variances (equal variances). To the best of our knowledge,this is the first work which can learn projections with isotropic vari- ances for hashing.Experimental results on real data sets show that IsoHash can outperform its counterpart with anisotropic variances for different dimensions,which verifies the intuitive view- point that projections with isotropic variances will be better than those with anisotropic variances. Furthermore,the performance of IsoHash is also comparable,if not superior,to the state-of-the-art methods. 2 Isotropic Hashing 2.1 Problem Statement Assume we are given n data points {x1,x2,...,x}with xi Rd,which form the columns of the data matrix XE Rdxn.Without loss of generality,in this paper the data are assumed to be trry gdaote the leuhe more oine股 zero centered which means original space Rd should be hashed into similar binary codes in the code space {0,1]m to preserve the similarity structure in the original space.In general,we compute the binary code of xi as yi=[h(x),h2()..,hm(]T with m binary hash functions h( Because it is NP hard to directly compute the best binary functions hk()for a given data set [31]. most hashing methods adopt a two-stage strategy to learn h().In the projection stage,m real- valued projection functions {f(x)are learned and each function can generate one real value. Hence,we have m projected dimensions each of which corresponds to one projection function.In the quantization stage,the real-values are quantized into a binary string by thresholding. Currently,most methods use one bit to quantize each projected dimension.More specifically, hk(xi)=sgn(fk(xi))where sgn(x)=1 if >0 and 0 otherwise.The exceptions of the quan- tization methods only contain AGH [21],DBQ [14]and MH [15],which use two bits to quantize each dimension.In sum,all of these methods adopt the same number (either one or two)of bits for different projected dimensions.However,the variances of different projected dimensions are unequal,and larger-variance dimensions typically carry more information.Hence,using the same number of bits for different projected dimensions with unequal variances is unreasonable,which has also been argued by many researchers [7,12,21].Unfortunately,there exist no methods which can learn projection functions with equal variances for different dimensions.In the following content of this section,we present a novel model to learn projections with isotropic variances. 2.2 Model Formulation The idea of our IsoHash method is to learn an orthogonal matrix to rotate the PCA projection matrix. To generate a code of m bits,PCAH performs PCA on X,and then use the top m eigenvectors of the covariance matrixXXT as columns of the projection matrix W ERdxm.Here,top m eigenvectors are those corresponding to the m largest eigenvalues k1,generally arranged with the non-
to the vertices of binary hypercube is minimized. Compared to the data-dependent methods, the data-independent methods need longer codes to achieve satisfactory performance [7]. For most existing projection functions such as those mentioned above, the variances of different projected dimensions are different. Many researchers [7, 12, 21] have argued that using the same number of bits for different projected dimensions with unequal variances is unreasonable because larger-variance dimensions will carry more information. Some methods [7, 12] use orthogonal transformation to the PCA-projected data with the expectation of balancing the variances of different PCA dimensions, and achieve better performance than the original PCA based hashing. However, to the best of our knowledge, there exist no methods which can guarantee to learn a projection with equal variances for different dimensions. Hence, the viewpoint that using the same number of bits for different projected dimensions is unreasonable has still not been verified by either theory or experiment. In this paper, a novel hashing method, called isotropic hashing (IsoHash), is proposed to learn a projection function which can produce projected dimensions with isotropic variances (equal variances). To the best of our knowledge, this is the first work which can learn projections with isotropic variances for hashing. Experimental results on real data sets show that IsoHash can outperform its counterpart with anisotropic variances for different dimensions, which verifies the intuitive viewpoint that projections with isotropic variances will be better than those with anisotropic variances. Furthermore, the performance of IsoHash is also comparable, if not superior, to the state-of-the-art methods. 2 Isotropic Hashing 2.1 Problem Statement Assume we are given n data points {x1, x2, · · · , xn} with xi ∈ R d , which form the columns of the data matrix X ∈ R d×n. Without loss of generality, in this paper the data are assumed to be zero centered which means Pn i=1 xi = 0. The basic idea of hashing is to map each point xi into a binary string yi ∈ {0, 1} m with m denoting the code size. Furthermore, close points in the original space R d should be hashed into similar binary codes in the code space {0, 1} m to preserve the similarity structure in the original space. In general, we compute the binary code of xi as yi = [h1(xi), h2(xi), · · · , hm(xi)]T with m binary hash functions {hk(·)} m k=1. Because it is NP hard to directly compute the best binary functions hk(·) for a given data set [31], most hashing methods adopt a two-stage strategy to learn hk(·). In the projection stage, m realvalued projection functions {fk(x)} m k=1 are learned and each function can generate one real value. Hence, we have m projected dimensions each of which corresponds to one projection function. In the quantization stage, the real-values are quantized into a binary string by thresholding. Currently, most methods use one bit to quantize each projected dimension. More specifically, hk(xi) = sgn(fk(xi)) where sgn(x) = 1 if x ≥ 0 and 0 otherwise. The exceptions of the quantization methods only contain AGH [21], DBQ [14] and MH [15], which use two bits to quantize each dimension. In sum, all of these methods adopt the same number (either one or two) of bits for different projected dimensions. However, the variances of different projected dimensions are unequal, and larger-variance dimensions typically carry more information. Hence, using the same number of bits for different projected dimensions with unequal variances is unreasonable, which has also been argued by many researchers [7, 12, 21]. Unfortunately, there exist no methods which can learn projection functions with equal variances for different dimensions. In the following content of this section, we present a novel model to learn projections with isotropic variances. 2.2 Model Formulation The idea of our IsoHash method is to learn an orthogonal matrix to rotate the PCA projection matrix. To generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the covariance matrix XXT as columns of the projection matrix W ∈ R d×m. Here, top m eigenvectors are those corresponding to the m largest eigenvalues {λk} m k=1, generally arranged with the non- 2
increasing order≥A2≥.·≥Am.Hence,the projection functions of PCAH are defined as follows:f(x)=wix,where wi is the kth column of W. Let A =[A1,A2,...,Am]T and A diag(),where diag(A)denotes the diagonal matrix whose diagonal entries are formed from the vector A.It is easy to prove that WTXXTW =A.Hence,the variance of the values {f(x)1on the kth projected dimension,which corresponds to the kth row of WTX,is Ak.Obviously,the variances for different PCA dimensions are anisotropic. To get isotropic projection functions,the idea of our IsoHash method is to leam an orthogonal matrix Q E Rmxm which makes QTWTXXTWQ become a matrix with equal diagonal values, i.e..[QTWTXXTWQl =[QTWTXXTW]22 =..=[QTWTXXTWQlmm.Here.An denotes the ith diagonal entry of a square matrix A,and a matrix Q is said to be orthogonal if QQ=I where I is an identity matrix whose dimensionality depends on the context.The effect of the orthogonal matrix Q is to rotate the coordinate axes while keeping the Euclidean distances between any two points unchanged.It is easy to prove that the new projection functions of IsoHash are f(x)=(WQ)x which have the same (isotropic)variance.Here (WQ)k denotes the kth column of WQ. If we use tr(A)to denote the trace of a symmetric matrix A,we have the following Lemma 1. Lemma 1.If QTQ=I tr(QTAQ)=tr(A). Based on Lemma 1,we have tr(QTWTXXTWQ)=tr(WTXXTW)=tr(A)=>1Ai if QTQ I.Hence,to make QTWTXXTWQ become a matrix with equal diagonal values,we should set this diagonal value a= Let a=a1,2,,with4=a=∑ (1) m and T(z)={TE Rmxmldiag(T)=diag(z)}, where z is a vector of length m,diag(T)is overloaded to denote a diagonal matrix with the same diagonal entries of matrix T Based on our motivation of IsoHash,we can define the problem of IsoHash as follows: Problem 1.The problem of IsoHash is to find an orthogonal matrix Q making QTWTXXTWQE T(a),where a is defined in (1). Then,we have the following Theorem 1: Theorem 1.Assume QTQ I and T E T(a).If QTAQ =T,Q will be a solution to the problem of IsoHash. Proof.Because WTXXTW A,we have QTAQ =QT[WTXXTW]Q.It is obvious that Q will be a solution to the problem of IsoHash. As in [2],we define M(A)={QTAQJQ∈O(m)}. (2) where (m)is the set of all orthogonal matrices in Rmxm,i.e.,QTQ=I. According to Theorem 1,the problem of IsoHash is equivalent to finding an orthogonal matrix Q for the following equation [2]: IT-ZIF =0, (3) where T E T(a).Z E M(A).denotes the Frobenius norm.Please note that for ease of understanding,we use the same notations as those in [2. In the following content,we will use the Schur-Horn lemma [11]to prove that we can always find a solution to problem(3)
increasing order λ1 ≥ λ2 ≥ · · · ≥ λm. Hence, the projection functions of PCAH are defined as follows: fk(x) = wT k x, where wk is the kth column of W. Let λ = [λ1, λ2, · · · , λm] T and Λ = diag(λ), where diag(λ) denotes the diagonal matrix whose diagonal entries are formed from the vector λ. It is easy to prove that WT XXT W = Λ. Hence, the variance of the values {fk(xi)} n i=1 on the kth projected dimension, which corresponds to the kth row of WT X, is λk. Obviously, the variances for different PCA dimensions are anisotropic. To get isotropic projection functions, the idea of our IsoHash method is to learn an orthogonal matrix Q ∈ R m×m which makes QT WT XXT W Q become a matrix with equal diagonal values, i.e., [QT WT XXT W Q]11 = [QT WT XXT W Q]22 = · · · = [QT WT XXT W Q]mm. Here, Aii denotes the ith diagonal entry of a square matrix A, and a matrix Q is said to be orthogonal if QT Q = I where I is an identity matrix whose dimensionality depends on the context. The effect of the orthogonal matrix Q is to rotate the coordinate axes while keeping the Euclidean distances between any two points unchanged. It is easy to prove that the new projection functions of IsoHash are fk(x) = (W Q) T k x which have the same (isotropic) variance. Here (W Q)k denotes the kth column of W Q. If we use tr(A) to denote the trace of a symmetric matrix A, we have the following Lemma 1. Lemma 1. If QT Q = I, tr(QT AQ) = tr(A). Based on Lemma 1, we have tr(QT WT XXT W Q) = tr(WT XXT W) = tr(Λ) = Pm i=1 λi if QT Q = I. Hence, to make QT WT XXT W Q become a matrix with equal diagonal values, we should set this diagonal value a = Pm i=1 λi m . Let a = [a1, a2, · · · , am] with ai = a = Pm i=1 λi m , (1) and T (z) = {T ∈ R m×m|diag(T) = diag(z)}, where z is a vector of length m, diag(T) is overloaded to denote a diagonal matrix with the same diagonal entries of matrix T. Based on our motivation of IsoHash, we can define the problem of IsoHash as follows: Problem 1. The problem of IsoHash is to find an orthogonal matrix Q making QT WT XXT W Q ∈ T (a), where a is defined in (1). Then, we have the following Theorem 1: Theorem 1. Assume QT Q = I and T ∈ T (a). If QTΛQ = T, Q will be a solution to the problem of IsoHash. Proof. Because WT XXT W = Λ, we have QTΛQ = QT [WT XXT W]Q. It is obvious that Q will be a solution to the problem of IsoHash. As in [2], we define M(Λ) = {Q TΛQ|Q ∈ O(m)}, (2) where O(m) is the set of all orthogonal matrices in R m×m, i.e., QT Q = I. According to Theorem 1, the problem of IsoHash is equivalent to finding an orthogonal matrix Q for the following equation [2]: ||T − Z||F = 0, (3) where T ∈ T (a), Z ∈ M(Λ), || · ||F denotes the Frobenius norm. Please note that for ease of understanding, we use the same notations as those in [2]. In the following content, we will use the Schur-Horn lemma [11] to prove that we can always find a solution to problem (3). 3
Lemma 2.[Schur-Horn Lemmal Let c (ci}e Rm and b (bi}E Rm be real vectors in non-increasing order respectively,i.e.,C1≥c2≥…≥cm,b1≥b2≥.·≥bm-There exists a Hermitian matrix H with eigenvalues c and diagonal values b if and only if ∑bi≤c,for anyk=1,2,,m, =1 =1 m m ∑bi-∑ca i=1 i=1 Proof.Please refer to Horn's article [11]. 口 Base on Lemma 2,we have the following Theorem 2. Theorem 2.There exists a solution to the IsoHash problem in (3).And this solution is in the intersection ofT(a)and M(A). Pro0 f.Because≥2≥…≥and a1=a2=…=am=4,itiseasy to prove that-L≥for any.Hence,.∑1X=k×2L≥k×L= m Furthermore,we can prove that According to Lemma 2,there exists a Hermitian matrix H with eigenvalues A and diagonal values a. Moreover,we can prove that H is in the intersection of T(a)and M(A),i.e.,H E T(a)and H∈M(A). According to Theorem 2,to find a Q solving the problem in(3)is equivalent to finding the intersec- tion point of T(a)and M(A),which is just an inverse eigenvalue problem called SHIEP in [2]. 2.3 Learning The problem in(3)can be reformulated as the following optimization problem: argmin T-ZlF. (4) Q:T∈T(a),Z∈M(A) As in [2],we propose two algorithms to learn Q:one is called lift and projection(LP),and the other is called gradient flow (GF).For ease of understanding,we use the same notations as those in [2], and some proofs of theorems are omitted.The readers can refer to [2]for the details. 2.3.1 Lift and Projection The main idea of lift and projection(LP)algorithm is to alternate between the following two steps: 。Lift step: Given a T(k)E T(a),we find the point ()M(A)such thatT(k)(= dist(T(),M(A)),where dist(T(k),M(A))denotes the minimum distance between T() and the points in M(A). ·Projection step: Given a Z(),we find T(+)ET(a)such that(+)(=dist(T(a),()). where dist(T(a),Z())denotes the minimum distance between Z(k)and the points in T(a). Please note in [2].the values are in increasing order.It is easy to prove that our presentation of Schur-Horn lemma is equivalent to that in [2].The non-increasing order is chosen here just because it will facilitate our following presentation due to the non-increasing order of the eigenvalues in A
Lemma 2. [Schur-Horn Lemma] Let c = {ci} ∈ R m and b = {bi} ∈ R m be real vectors in non-increasing order respectively 1 , i.e., c1 ≥ c2 ≥ · · · ≥ cm, b1 ≥ b2 ≥ · · · ≥ bm. There exists a Hermitian matrix H with eigenvalues c and diagonal values b if and only if X k i=1 bi ≤ X k i=1 ci , for any k = 1, 2, ..., m, Xm i=1 bi = Xm i=1 ci . Proof. Please refer to Horn’s article [11]. Base on Lemma 2, we have the following Theorem 2. Theorem 2. There exists a solution to the IsoHash problem in (3). And this solution is in the intersection of T (a) and M(Λ). Proof. Because λ1 ≥ λ2 ≥ · · · ≥ λm and a1 = a2 = · · · = am = Pm i=1 λi m , it is easy to prove that Pk i=1 λi k ≥ Pm i=1 λi m for any k. Hence, Pk i=1 λi = k × Pk i=1 λi k ≥ k × Pm i=1 λi m = Pk i=1 ai . Furthermore, we can prove that Pm i=1 λi = Pm i=1 ai . According to Lemma 2, there exists a Hermitian matrix H with eigenvalues λ and diagonal values a. Moreover, we can prove that H is in the intersection of T (a) and M(Λ), i.e., H ∈ T (a) and H ∈ M(Λ). According to Theorem 2, to find a Q solving the problem in (3) is equivalent to finding the intersection point of T (a) and M(Λ), which is just an inverse eigenvalue problem called SHIEP in [2]. 2.3 Learning The problem in (3) can be reformulated as the following optimization problem: argmin Q:T ∈T (a),Z∈M(Λ) ||T − Z||F . (4) As in [2], we propose two algorithms to learn Q: one is called lift and projection (LP), and the other is called gradient flow (GF). For ease of understanding, we use the same notations as those in [2], and some proofs of theorems are omitted. The readers can refer to [2] for the details. 2.3.1 Lift and Projection The main idea of lift and projection (LP) algorithm is to alternate between the following two steps: • Lift step: Given a T (k) ∈ T (a), we find the point Z (k) ∈ M(Λ) such that ||T (k) − Z (k) ||F = dist(T (k) ,M(Λ)), where dist(T (k) ,M(Λ)) denotes the minimum distance between T (k) and the points in M(Λ). • Projection step: Given a Z (k) , we find T (k+1) ∈ T (a) such that ||T (k+1) − Z (k) ||F = dist(T (a), Z(k) ), where dist(T (a), Z(k) ) denotes the minimum distance between Z (k) and the points in T (a). 1 Please note in [2], the values are in increasing order. It is easy to prove that our presentation of Schur-Horn lemma is equivalent to that in [2]. The non-increasing order is chosen here just because it will facilitate our following presentation due to the non-increasing order of the eigenvalues in Λ. 4
We call (k)a lift of T(k)onto M(A)and T(+1)a projection of (k)onto T(a).The projection operation is easy to complete.Suppose()=[then T()=[must be given by ty ∫,ifi卡j ai;ifi=j (5) For the lift operation,we have the following Theorem 3. Theorem 3.Suppose T =QTDQ is an eigen-decomposition of T where D =diag(d)with d=(d,d2dmT being T's eigenvalues which are ordered as d2d2.zdm.Then the nearest neighbor of T in M(A)is given by Z=QTAQ. (6) Proof.See Theorem 4.1 in [3]. 口 Since in each step we minimize the distance between T and Z,we have IT-ZlF≥lTk+)-ZF≥IlTk+)-Zk+9lF. It is easy to see that(T(),())will converge to a stationary point.The whole IsoHash algorithm based on LP,abbreviated as IsoHash-LP,is briefly summarized in Algorithm 1. Algorithm 1 Lift and projection based IsoHash(IsoHash-LP) Input:XE Rdxn,m EN+,tEN+ [A,W]=PCA(X,m),as stated in Section 2.2. ·Generate a random orthogonal matrix Qo∈Rmxm 。Zo)←Q6AQ0- ●fork=1→tdo Calculate T(k)from Z(k-1)by equation(5). Perform eigen-decomposition of T(k)to get QTDQ=T(k). Calculate Z(k)from Q&and A by equation (6). ●end for ·Y=sgn(QTWTX). Output:Y Because M(A)is not a convex set,the stationary point we find is not necessarily inside the intersec- tion of T(a)and M(A).For example,if we set Z(0)=A,the lift and projection learning algorithm would get no progress because the Z and T are already in a stationary point.To solve this problem of degenerate solutions,we initiate Z as a transformed A with some random orthogonal matrix Qo, which is illustrated in Algorithm 1. 2.3.2 Gradient Flow Another learning algorithm is a continuous one based on the construction of a gradient flow (GF) on the surface M(A)that moves towards the desired intersection point.Because there always ex- ists a solution for the problem in(3)according to Theorem 2,the objective function in(4)can be reformulated as follows[2: (Q)lldiag(QTAQ)-diag() (7) The details about how to optimize(7)can be found in [2].We just show some key steps of the learning algorithm in the following content. The gradient VF at Q can be calculated as 7F(Q)=2A6(Q): (8) where B(Q)=diag(QTAQ)-diag(a).Once we have computed the gradient of F,it can be projected onto the manifold O(m)according to the following Theorem 4
We call Z (k) a lift of T (k) onto M(Λ) and T (k+1) a projection of Z (k) onto T (a). The projection operation is easy to complete. Suppose Z (k) = [zij ], then T (k+1) = [tij ] must be given by tij = zij , if i 6= j ai , if i = j (5) For the lift operation, we have the following Theorem 3. Theorem 3. Suppose T = QT DQ is an eigen-decomposition of T where D = diag(d) with d = [d1, d2, ..., dm] T being T’s eigenvalues which are ordered as d1 ≥ d2 ≥ · · · ≥ dm. Then the nearest neighbor of T in M(Λ) is given by Z = Q TΛQ. (6) Proof. See Theorem 4.1 in [3]. Since in each step we minimize the distance between T and Z, we have ||T (k) − Z (k) ||F ≥ ||T (k+1) − Z (k) ||F ≥ ||T (k+1) − Z (k+1)||F . It is easy to see that (T (k) , Z(k) ) will converge to a stationary point. The whole IsoHash algorithm based on LP, abbreviated as IsoHash-LP, is briefly summarized in Algorithm 1. Algorithm 1 Lift and projection based IsoHash (IsoHash-LP) Input: X ∈ R d×n, m ∈ N +, t ∈ N + • [Λ, W] = P CA(X, m), as stated in Section 2.2. • Generate a random orthogonal matrix Q0 ∈ R m×m. • Z (0) ← QT 0 ΛQ0. • for k = 1 → t do Calculate T (k) from Z (k−1) by equation (5). Perform eigen-decomposition of T (k) to get QT k DQk = T (k) . Calculate Z (k) from Qk and Λ by equation (6). • end for • Y = sgn(QT t WT X). Output: Y Because M(Λ) is not a convex set, the stationary point we find is not necessarily inside the intersection of T (a) and M(Λ). For example, if we set Z (0) = Λ, the lift and projection learning algorithm would get no progress because the Z and T are already in a stationary point. To solve this problem of degenerate solutions, we initiate Z as a transformed Λ with some random orthogonal matrix Q0, which is illustrated in Algorithm 1. 2.3.2 Gradient Flow Another learning algorithm is a continuous one based on the construction of a gradient flow (GF) on the surface M(Λ) that moves towards the desired intersection point. Because there always exists a solution for the problem in (3) according to Theorem 2, the objective function in (4) can be reformulated as follows [2]: min Q∈O(m) F(Q) = 1 2 ||diag(Q TΛQ) − diag(a)||2 F . (7) The details about how to optimize (7) can be found in [2]. We just show some key steps of the learning algorithm in the following content. The gradient ∇F at Q can be calculated as ∇F(Q) = 2Λβ(Q), (8) where β(Q) = diag(QTΛQ) − diag(a). Once we have computed the gradient of F, it can be projected onto the manifold O(m) according to the following Theorem 4. 5