Support Matrix Machines LuoLuo RICKY @SJTU.EDU.CN Yubo Xie YUBOTSE@SJTU.EDU.CN Department of Computer Science and Engineering.Shanghai Jiao Tong University,Shanghai.China Zhihua Zhang ZHIHUA @SJTU.EDU.CN Institute of Data Science,Department of Computer Science and Engineering,Shanghai Jiao Tong University,China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology,Collaborative Innovation Center of Novel Software Technology and Industrialization,Department of Computer Science and Technology,Nanjing University,China Abstract the case that input samples are represented as vectors or s- In many classification problems such as elec- calars.However,it is also often met that input samples are troencephalogram(EEG)classification and im- naturally represented as two-dimensional matrices or ten- age classification,the input features are naturally sors.When using classical classification methods to data of represented as matrices rather than vectors or s- matrix representation,we have to reshape the input matri- calars.In general,the structure information of ces into vectors.However,this would destroy the structure the original feature matrix is useful and informa- information of the data matrix,e.g.,the correlation of dif- tive for data analysis tasks such as classification. ferent channels of electroencephalogram(EEG)data(Zhou One typical structure information is the correla- Li,2014)and the spatial relationship of the nearby pixels tion between columns or rows in the feature ma- of image data (Wolf et al.,2007).Moreover,if the data ma- trix.To leverage this kind of structure informa- trix is stacked (reshaped)into a vector,the dimensionality tion,we propose a new classification method that of the resulting vector typically becomes very high,which we call support matrix machine(SMM).Specif- in turn leads to the curse of dimensionality. ically,SMM is defined as a hinge loss plus a There has been some work on classification methods which so-called spectral elastic net penalty which is a attempt to exploit the correlation between the columns or spectral extension of the conventional elastic net rows of the data matrix.Usually,such a classification over a matrix.The spectral elastic net enjoys a method introduces a matrix of regression coefficients to property of grouping effect,i.e.,strongly corre- leverage the correlation within the data matrix.For ex- lated columns or rows tend to be selected alto- ample,Wolf et al.(2007)proposed a rank-k SVM,which gether or not.Since the optimization problem models the regression matrix as the sum of k rank-one or- for SMM is convex,this encourages us to de- thogonal matrices.Pirsiavash et al.(2009)devised a bilin- vise an alternating direction method of multipli- ear SVM by factorizing the regression matrix into two low- ers(ADMM)algorithm for solving the problem. rank matrices.Cai et al.(2006)proposed a similar bilinear Experimental results on EEG and image classifi- framework called support tensor machines for text catego- cation data show that our model is more robust rization.These methods essentially take advantage of the and efficient than the state-of-the-art methods. low-rank assumption,which can be used for describing the correlation within a matrix.However,their treatments re- sult in non-convex optimization problems. 1.Introduction In this paper we are also concerned with the classification Classical classification methods such as support vector ma- problems on a set of data matrices.Our work is motivat- chines (SVMs)(Cortes Vapnik.1995)and logistic re- ed by the use of the nuclear norm (a.k.a.,the trace nor- gression (Hastie et al.,2001)have been originally built on m)in low-rank matrix approximation (Srebro Shraib- Proceedings of the 32n4 International Conference on Machine man,2005),matrix completion(Candes Recht,2009:Li- Learning,Lille,France,2015.JMLR:W&CP volume 37.Copy- u et al.,2013;Salakhutdinov Srebro,2010;Huang et al., right 2015 by the author(s). 2013),and multi-task learning problems(Pong et al.,2010;
Support Matrix Machines Luo Luo RICKY@SJTU.EDU.CN Yubo Xie YUBOTSE@SJTU.EDU.CN Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Zhihua Zhang ZHIHUA@SJTU.EDU.CN Institute of Data Science, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Department of Computer Science and Technology, Nanjing University, China Abstract In many classification problems such as electroencephalogram (EEG) classification and image classification, the input features are naturally represented as matrices rather than vectors or scalars. In general, the structure information of the original feature matrix is useful and informative for data analysis tasks such as classification. One typical structure information is the correlation between columns or rows in the feature matrix. To leverage this kind of structure information, we propose a new classification method that we call support matrix machine (SMM). Specifically, SMM is defined as a hinge loss plus a so-called spectral elastic net penalty which is a spectral extension of the conventional elastic net over a matrix. The spectral elastic net enjoys a property of grouping effect, i.e., strongly correlated columns or rows tend to be selected altogether or not. Since the optimization problem for SMM is convex, this encourages us to devise an alternating direction method of multipliers (ADMM) algorithm for solving the problem. Experimental results on EEG and image classifi- cation data show that our model is more robust and efficient than the state-of-the-art methods. 1. Introduction Classical classification methods such as support vector machines (SVMs) (Cortes & Vapnik, 1995) and logistic regression (Hastie et al., 2001) have been originally built on Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). the case that input samples are represented as vectors or scalars. However, it is also often met that input samples are naturally represented as two-dimensional matrices or tensors. When using classical classification methods to data of matrix representation, we have to reshape the input matrices into vectors. However, this would destroy the structure information of the data matrix, e.g., the correlation of different channels of electroencephalogram (EEG) data (Zhou & Li, 2014) and the spatial relationship of the nearby pixels of image data (Wolf et al., 2007). Moreover, if the data matrix is stacked (reshaped) into a vector, the dimensionality of the resulting vector typically becomes very high, which in turn leads to the curse of dimensionality. There has been some work on classification methods which attempt to exploit the correlation between the columns or rows of the data matrix. Usually, such a classification method introduces a matrix of regression coefficients to leverage the correlation within the data matrix. For example, Wolf et al. (2007) proposed a rank-k SVM, which models the regression matrix as the sum of k rank-one orthogonal matrices. Pirsiavash et al. (2009) devised a bilinear SVM by factorizing the regression matrix into two lowrank matrices. Cai et al. (2006) proposed a similar bilinear framework called support tensor machines for text categorization. These methods essentially take advantage of the low-rank assumption, which can be used for describing the correlation within a matrix. However, their treatments result in non-convex optimization problems. In this paper we are also concerned with the classification problems on a set of data matrices. Our work is motivated by the use of the nuclear norm (a.k.a., the trace norm) in low-rank matrix approximation (Srebro & Shraibman, 2005), matrix completion (Candes & Recht ` , 2009; Liu et al., 2013; Salakhutdinov & Srebro, 2010; Huang et al., 2013), and multi-task learning problems (Pong et al., 2010;
Support Matrix Machines Harchaoui et al..2012:Kang et al..2011).The cornerstone show that our proposed training algorithm for SMM is effi- of these methods is to use the nuclear norm of a matrix as cient. a convex alternative of the matrix rank.Since the nuclear The remainder of the paper is organized as follows.In Sec- norm is the best convex approximation of the matrix rank tion 2,we give the notation and preliminaries.In Section 3, over the unit ball of matrices.this makes it more tractable to solve the resulting optimization problem.Moreover,some we review our concerned problem.In Section 4 we present nice properties such as the consistency results of the nucle- our model and the learning algorithm.In Section 5,we con- duct experimental analysis to justify our methods.Finally. ar norm minimization have been studied by Bach(2008) we conclude our work in Section 6. There has been some work which applies the nuclear norm penalization with least square loss function to matrix re- gression problems(Hunyadi et al.,2012;Signoretto et al., 2.Notation and Preliminaries 2014).Recently,Zhou Li(2014)applied the nuclear nor- In this section we give the notation and preliminaries which m penalization to matrix regression problems based on gen- will be used in this paper.We let Ip denote the pxp identity eralized linear models (GLMs). matrix.For a vector a E RP,the Euclidean norm is denot- In this paper we propose a new model to address the matrix ed asa=V∑-1a.For a matrix A∈Rpxa of rank classification problem.Our model includes two principal r where r min(p,q),we let the condensed singular val- ingredients.First,we consider the hinge loss due to its ue decomposition (SVD)of A be A =UADAV where widely deployed ability in sparseness and robustness mod- UA∈Rpxr and VA∈Raxr satisfy UTUA=I,and eling.Second,we employ a spectral elastic net penalty for VVA=Ir,and∑A=diag(o1(A),…,o,(A)with the regression matrix.The spectral elastic net is a spec- o1(A)≥·≥o(A)>0.Obviously,the rank of A is tral extension of the conventional elastic net over a matrix. equal to the number of nonzero singular values of A.Ad- In parallel to the conventional elastic net (Zou Hastie, ditionally,.we let AlF=√∑jA号=V∑=1oa(AP 2005)which is the combination of the ridge penalty and lasso penalty,our spectral elastic net is the combination of be the Frobenius norm,IAll.==(A)be the nu- the squared Frobenius matrix norm and nuclear norm.We clear norm,and All2 =1(A)be the spectral norm. prove that the spectral elastic net enjoys the property of For any T 0,we let D[A]=UAS,[EA]V where grouping effect which is similar to the conventional elastic S[>]=diag([1(A)]+;..,[o(A)]+)and net,while keeping a low-rank representation.We show that [+max(2,0).In the literature (Candes Recht,2009; the regression matrix in our model is indeed a combination Cai et al.,2010),D [A]is called the singular value thresh- of a set of support matrices.We thus refer to our model as olding (SVT)operator. a support matrix machine (SMM). It is well known that the nuclear norm A,as a function The optimization problem for SMM is convex but the hinge from RPx4 to R,is not differentiable.Alternatively,one loss function is not smooth.Fortunately,we can resort to considers the subdifferential of All,which is the set of an alternating direction method of multipliers (ADMM)s- subgradients and denoted by lAl.It follows from the tudied in (Goldstein et al.,2012).Specifically,we develop literature (Candes Recht,2009;Lewis,2003;Watson, an iteration algorithm,which is mainly built on ADMM 1992)that for a p x g real matrix A of rank r, and a singular value thresholding(SVT)operator(Candes Recht.2009:Cai et al.,2010).The algorithm converges OAll.=UAVA+Z:ZE RPX9,UTZ =0. quickly and promises to get the global optimal solution.It is worth pointing out that the algorithm requires repeated- ZVA=0,IZl2≤1}() ly computing singular value decomposition(SVD)of ma- trices with the same size as the matrix of input features 3.Problem Formulation and Related Work However,when represented as a matrix,the size of an in- put matrix is usually not too large. In this paper we study a regularized matrix classifier.We are given a set of training samplesT=[i where Finally,we apply our SMM to EEG and image classifica- Xi E RPX9 is the ith input sample and yi E{-1,1}is its tion problems.We see that classification methods direct- corresponding class label.As we have seen,Xi is repre- ly working on data matrices outperform those on vectors sented in matrix form.To fit a classifier,a commonly used such as the conventional SVM.When data are contaminat- approach is to stack Xi into a vector.Let xi vec(X)= ed with non-Gaussian noise or outliers,our SMM has sig- ((XXhg[X21,[Xlg)T∈R四.The soft nificant improvements over baselines.This implies that S- margin SVM is defined as MM is robust and has potential applications in matrix clas- sification problems with noises.Moreover,the experiments 2w2w+C∑1-h(w7x+b+, (2) i=1
Support Matrix Machines Harchaoui et al., 2012; Kang et al., 2011). The cornerstone of these methods is to use the nuclear norm of a matrix as a convex alternative of the matrix rank. Since the nuclear norm is the best convex approximation of the matrix rank over the unit ball of matrices, this makes it more tractable to solve the resulting optimization problem. Moreover, some nice properties such as the consistency results of the nuclear norm minimization have been studied by Bach (2008). There has been some work which applies the nuclear norm penalization with least square loss function to matrix regression problems (Hunyadi et al., 2012; Signoretto et al., 2014). Recently, Zhou & Li (2014) applied the nuclear norm penalization to matrix regression problems based on generalized linear models (GLMs). In this paper we propose a new model to address the matrix classification problem. Our model includes two principal ingredients. First, we consider the hinge loss due to its widely deployed ability in sparseness and robustness modeling. Second, we employ a spectral elastic net penalty for the regression matrix. The spectral elastic net is a spectral extension of the conventional elastic net over a matrix. In parallel to the conventional elastic net (Zou & Hastie, 2005) which is the combination of the ridge penalty and lasso penalty, our spectral elastic net is the combination of the squared Frobenius matrix norm and nuclear norm. We prove that the spectral elastic net enjoys the property of grouping effect which is similar to the conventional elastic net, while keeping a low-rank representation. We show that the regression matrix in our model is indeed a combination of a set of support matrices. We thus refer to our model as a support matrix machine (SMM). The optimization problem for SMM is convex but the hinge loss function is not smooth. Fortunately, we can resort to an alternating direction method of multipliers (ADMM) studied in (Goldstein et al., 2012). Specifically, we develop an iteration algorithm, which is mainly built on ADMM and a singular value thresholding (SVT) operator (Candes` & Recht, 2009; Cai et al., 2010). The algorithm converges quickly and promises to get the global optimal solution. It is worth pointing out that the algorithm requires repeatedly computing singular value decomposition (SVD) of matrices with the same size as the matrix of input features. However, when represented as a matrix, the size of an input matrix is usually not too large. Finally, we apply our SMM to EEG and image classification problems. We see that classification methods directly working on data matrices outperform those on vectors such as the conventional SVM. When data are contaminated with non-Gaussian noise or outliers, our SMM has significant improvements over baselines. This implies that SMM is robust and has potential applications in matrix classification problems with noises. Moreover, the experiments show that our proposed training algorithm for SMM is effi- cient. The remainder of the paper is organized as follows. In Section 2, we give the notation and preliminaries. In Section 3, we review our concerned problem. In Section 4 we present our model and the learning algorithm. In Section 5, we conduct experimental analysis to justify our methods. Finally, we conclude our work in Section 6. 2. Notation and Preliminaries In this section we give the notation and preliminaries which will be used in this paper. We let Ip denote the p×p identity matrix. For a vector a ∈ R p , the Euclidean norm is denoted as ||a|| = pPp i=1 a 2 i . For a matrix A ∈ R p×q of rank r where r ≤ min(p, q), we let the condensed singular value decomposition (SVD) of A be A = UAΣAVT A where UA ∈ R p×r and VA ∈ R q×r satisfy UT AUA = Ir and VT AVA = Ir, and ΣA = diag(σ1(A), · · · , σr(A)) with σ1(A) ≥ · · · ≥ σr(A) > 0. Obviously, the rank of A is equal to the number of nonzero singular values of A. Additionally, we let kAkF = qP i,j A2 ij = pPr i=1 σi(A) 2 be the Frobenius norm, kAk∗ = Pr i=1 σi(A) be the nuclear norm, and kAk2 = σ1(A) be the spectral norm. For any τ > 0, we let Dτ [A] = UASτ [ΣA]VT A where Sτ [Σ] = diag([σ1(A) − τ ]+, · · · , [σr(A) − τ ]+) and [z]+ = max(z, 0). In the literature (Candes & Recht ` , 2009; Cai et al., 2010), Dτ [A] is called the singular value thresholding (SVT) operator. It is well known that the nuclear norm kAk∗, as a function from R p×q to R, is not differentiable. Alternatively, one considers the subdifferential of kAk∗, which is the set of subgradients and denoted by ∂kAk∗. It follows from the literature (Candes & Recht ` , 2009; Lewis, 2003; Watson, 1992) that for a p × q real matrix A of rank r, ∂kAk∗ = n UAVT A + Z : Z ∈ R p×q , UT AZ = 0, ZVA = 0, kZk2 ≤ 1 o . (1) 3. Problem Formulation and Related Work In this paper we study a regularized matrix classifier. We are given a set of training samples T = Xi , yi} n i=1, where Xi ∈ R p×q is the ith input sample and yi ∈ {−1, 1} is its corresponding class label. As we have seen, Xi is represented in matrix form. To fit a classifier, a commonly used approach is to stack Xi into a vector. Let xi , vec(XT i ) = ([Xi ]11, . . . , [Xi ] 1q , [Xi ] 21, . . . , [Xi ]pq) T ∈ R pq. The soft margin SVM is defined as min w,b 1 2 wT w + C Xn i=1 [1 − yi(wT xi + b)]+, (2)
Support Matrix Machines where [1-ul+is called the hinge loss function,w E RPa is the singular values of W.Typically,P(W)=Wl.for a vector of regression coefficients,b ER is an offset term, A>0 because the nuclear norm Wl,is the best convex and C is a regularization parameter. approximation of rank(W)over the unit ball of matrices. When reshaped into vector vec(XT),the correlation a- Assuming that the loss function.is smooth and its deriva- mong columns or rows in the matrix is ignored.Howev- tive has the Lipschitz continuity,Zhou Li (2014)devised er,it would be more reasonable to exploit the correlation the Nesterov method (Nesterov,1983)to solve (5). information in developing a classifier,because the corre- lation is helpful and useful in improving the classification 4.The Support Matrix Machine performance It is well known that the hinge loss enjoys the large margin Intuitively,we consider the following formulation: principle.Moreover,it embodies sparseness and robust- ness,which are two desirable properties for a good classifi- tr(wW)+C∑1-hr(wx,)+}+,(3) er.This thus motivates us to employ the hinge loss function in(5)instead.In particular,we present the following for- i✉1 mulation: where W Rpxa is the matrix of regression co- efficients.However.this formulation is essentially e- argmin w.b 2tr(w"w)+rllwll. quivalent to Problem (2)when w vec(WT),be- cause tr(WTXi)=vec(WT)Tvec(XT)=wTxi and tr(WTw)=vec(WT)Tvec(WT)=wTw.This im- +C {1-[tr(WrX)+}+,(6 plies that the formulation in(3)cannot directly address our concern. which defines a matrix classification model that we cal- To capture the correlation,a natural approach is to consider I the support matrix machine (SMM).Recall that the hinge the dependency of the regression matrix W.In particular, loss is not smooth.so our model is not a trivial variant of one can impose a low-rank constraint on W to leverage the regularized GLM.On the other hand,SMM is in fact the structure information within Xi.For example,in the based on a penalty function,which is the combination of bilinear SVM(Pirsiavash et al.,2009),the authors assumed the squared Frobenius norm W and the nuclear nor- that W=WWT where W∈Rqxa,Wy∈Rpxd and m Wll..We call this penalty the spectral elastic net d<min(p,q).Accordingly,they defined the following because tr(WTW)=W((W)and problem without the offset term Wll.-(W).As we see.the spectral elas- tic net is parallel to the elastic net of Zou Hastie(2005). argmin Again recall that tr(WTW)=vec(WT)Tvec(WT)and Wz.Wv 2tr(W:Wgw,w图) tr(WTXi)=vec(WT)Tvec(XT).so SMM degenerates CI1-tr(wTXW.】+· (4) to the conventional linear SVM when T=0.However,the i1 nuclear norm can not be equivalently defined as a vector norm.This implies that we cannot formulate Problem(6) However,the resulting problem is nonconvex in both W in an equivalent of the vector form.Thus,our SMM is able and W.Thus,the authors resorted to an alternately itera- to capture the correlation within the input data matrix. tive update scheme for W and Wy;that is,they updated either of Wr and Wy while keeping the other fixed. 4.1.Theoretical Justification Since the dependency of W can be revealed by its rank We now show that SMM possesses some elegant benefits rank(W),it is also natural to directly impose the rank con- from the conventional SVM(Cortes Vapnik,1995)as straint to W.However,the resulting matrix rank minimiza- well as the conventional elastic net (Zou Hastie,2005). tion is usually NP-hard (Vandenberghe Boyd,1996). Without loss of generality,we suppose that each feature of Zhou Li (2014)suggested a function of the singular val- the training data is normalized to have unit length.That is ues of W as an alternative penalization technique.Based it holds that lfl 1 where f ([Xilkt,...,[Xnlk)T on this idea,they proposed a regularized GLM(R-GLM): for k 1,...,p and l=1,...,q. Theorem 1.Suppose the minimizer of Problem (6)is argmin J(W)+P(W), (5) (W,b).Then W where /(W)is a loss function obtained from the negative log-likelihood and P(W)is a penalty function defined on
Support Matrix Machines where [1−u]+ is called the hinge loss function, w ∈ R pq is a vector of regression coefficients, b ∈ R is an offset term, and C is a regularization parameter. When reshaped into vector vec(XT i ), the correlation among columns or rows in the matrix is ignored. However, it would be more reasonable to exploit the correlation information in developing a classifier, because the correlation is helpful and useful in improving the classification performance. Intuitively, we consider the following formulation: min W,b 1 2 tr(WTW)+C Xn i=1 {1−yi [tr(WT Xi)+b]}+, (3) where W ∈ R p×q is the matrix of regression coefficients. However, this formulation is essentially equivalent to Problem (2) when w = vec(WT ), because tr(WT Xi) = vec(WT ) T vec(XT i ) = wT xi and tr(WTW) = vec(WT ) T vec(WT ) = wT w. This implies that the formulation in (3) cannot directly address our concern. To capture the correlation, a natural approach is to consider the dependency of the regression matrix W. In particular, one can impose a low-rank constraint on W to leverage the structure information within Xi . For example, in the bilinear SVM (Pirsiavash et al., 2009), the authors assumed that W = WyWT x where Wx ∈ R q×d , Wy ∈ R p×d and d < min(p, q). Accordingly, they defined the following problem without the offset term argmin Wx,Wy 1 2 tr(WxWT y WyWT x ) +C Xn i=1 [1 − yitr(WT y XiWx)]+. (4) However, the resulting problem is nonconvex in both Wx and Wy. Thus, the authors resorted to an alternately iterative update scheme for Wx and Wy; that is, they updated either of Wx and Wy while keeping the other fixed. Since the dependency of W can be revealed by its rank rank(W), it is also natural to directly impose the rank constraint to W. However, the resulting matrix rank minimization is usually NP-hard (Vandenberghe & Boyd, 1996). Zhou & Li (2014) suggested a function of the singular values of W as an alternative penalization technique. Based on this idea, they proposed a regularized GLM (R-GLM): argmin W J(W) + P(W), (5) where J(W) is a loss function obtained from the negative log-likelihood and P(W) is a penalty function defined on the singular values of W. Typically, P(W) = λkWk∗ for λ > 0 because the nuclear norm kWk∗ is the best convex approximation of rank(W) over the unit ball of matrices. Assuming that the loss function J is smooth and its derivative has the Lipschitz continuity, Zhou & Li (2014) devised the Nesterov method (Nesterov, 1983) to solve (5). 4. The Support Matrix Machine It is well known that the hinge loss enjoys the large margin principle. Moreover, it embodies sparseness and robustness, which are two desirable properties for a good classifi- er. This thus motivates us to employ the hinge loss function in (5) instead. In particular, we present the following formulation: argmin W,b 1 2 tr(WTW) + τ ||W||∗ +C Xn i=1 {1−yi [tr(WT Xi)+b]}+, (6) which defines a matrix classification model that we call the support matrix machine (SMM). Recall that the hinge loss is not smooth, so our model is not a trivial variant of the regularized GLM. On the other hand, SMM is in fact based on a penalty function, which is the combination of the squared Frobenius norm kWk 2 F and the nuclear norm kWk∗. We call this penalty the spectral elastic net because tr(WTW) = kWk 2 F = Pmin(p,q) i=1 σ 2 i (W) and kWk∗ = Pmin(q,p) i=1 σi(W). As we see, the spectral elastic net is parallel to the elastic net of Zou & Hastie (2005). Again recall that tr(WTW) = vec(WT ) T vec(WT ) and tr(WT Xi) = vec(WT ) T vec(XT i ), so SMM degenerates to the conventional linear SVM when τ = 0. However, the nuclear norm can not be equivalently defined as a vector norm. This implies that we cannot formulate Problem (6) in an equivalent of the vector form. Thus, our SMM is able to capture the correlation within the input data matrix. 4.1. Theoretical Justification We now show that SMM possesses some elegant benefits from the conventional SVM (Cortes & Vapnik, 1995) as well as the conventional elastic net (Zou & Hastie, 2005). Without loss of generality, we suppose that each feature of the training data is normalized to have unit length. That is, it holds that kfklk = 1 where fkl , ([X1]kl, . . . , [Xn]kl) T for k = 1, . . . , p and l = 1, . . . , q. Theorem 1. Suppose the minimizer of Problem (6) is (W˜ , ˜b). Then W˜ = Dτ Xn i=1 β˜ iyiXi
Support Matrix Machines whee0≤B:≤C. and G(S)=TlISIl.. Denote=∑是1aaXi.We see that is the combi-- ADMM solves (7)by using the augmented Lagrangian nation of those Xi associated with nonzero B,while W is function: the SVT of 1.In fact,we will see from Eqn.(12)in The- L1(W,b,S,A)=H(W,b)+G(S)+tr[AT(S-W)] orem 5 that W is indeed the linear combination of a set of support matrices {Xi}. +1s-wIe。 Lemma 1.The difference between [k and [Slkalz meets the following inequality where p>0 is a hyperparameter. (]kl4-k22)2≤2nC2(1-ff,f2l2): The ADMM learning procedure for our SMM is summa- rized in Algorithm 1.The key steps of Algorithm 1 are the Recall thatf=1 and fa=1,so f fkal computations of S(k)and (W(k),()).the derivation of [-1,1]is the correlation coefficient of input features at po- which is based on Theorems 3 and 4 below. sitions (1,1)and (k2,2)over the n training samples. Theorem 3.For positive numbers T and p,let the matrix Lemma 1 says that n has the element-wise grouping ef- pW-A have SVD of the form: fect.Specifically,higher(lower)correlation between two pW-A=Uo∑oV6+U11Vf, (8) elements of leads to smaller (larger)difference.Based on this lemma,we also have the following theorem. where So is the diagonal matrix whose diagonal entries Theorem 2.Let [W]:be the lth column of W.Then are greater than T,Uo and Vo are matrices of the corre- sponding left and right singular vectors:U and V lw:h-w:2≤2nc2(p-∑项,f) correspond the rest parts of the SVD whose singular values are less than or equal to T.Define Especially.if fk1=fkl2 for any k=l,,卫,then S*D-(pW-A)=Uo(Zo-rI)V6. W].=[W].2 0 Theorem 2 reflects the relationship of W with the train- Then we have 0 E 8 G1(S*).where G1(S)=G(S)+ ing input matrices Xi.Interestingly,the columns of the t(As)+Iw-s服. regression matrix W have grouping effect in our model if the corresponding features have strong correlation.The Since Gi is convex with respect to S,we obtain the update similar conclusion also applies to the rows of W.Note equation of S from Theorem 3.That is, that Theorem 2 can not be extended to the element-wise case,because even if ft=f,we cannot obtain that S(k+1)=argminG(W(),S,()=-D-(pW()-A()). [W=[W We will present an empirical illustra- tion for the grouping effect problem in Section 5.1. Theorem 4.One of the solution of the following problem 4.2.Learning Algorithm argmin H(W,b)-tr(ATW)+W-sl (W,b) The Nesterov method for R-GLM(Zhou Li,2014)re- is quires the derivative of the loss function in question to be Lipschitz-continuous.However,both the hinge loss and the w-(+inx) (10 nuclear norm are not smooth.Thus,it is hard to develop the =1 Nesterov method for finding the SMM solution.Since the objective function of SMM is convex in both W and b,we 网{-(wPx} here derive a learning algorithm based on ADMM with the restart rule(Goldstein et al.,2012).The problem in (6)can where S*=i:0<a;<C),and a*E Rm is the solu- be equivalently written as follows: tion of the following box constraint quadratic programming argmin H(W,b)+G(S), (7) problem: W.b.S s.t. S-W=0, argmax -aTKa+qTa. (11) a where s.t. 0≤a≤C1n, H(W.6)=ztr(WTW)+C>(1-(WTX)+) ∑ah=0 =
Support Matrix Machines where 0 ≤ β˜ i ≤ C. Denote Ω = Pn i=1 β˜ iyiXi . We see that Ω is the combination of those Xi associated with nonzero β˜ i , while W˜ is the SVT of Ω. In fact, we will see from Eqn. (12) in Theorem 5 that W˜ is indeed the linear combination of a set of support matrices {Xi}. Lemma 1. The difference between [Ω]k1l1 and [Ω]k2l2 meets the following inequality ([Ω]k1l1 − [Ω]k2l2 ) 2 ≤ 2nC2 (1 − f T k1l1 fk2l2 ). Recall that kfk1l1 k = 1 and kfk2l2 k = 1, so f T k1l1 fk2l2 ∈ [−1, 1] is the correlation coefficient of input features at positions (k1, l1) and (k2, l2) over the n training samples. Lemma 1 says that Ω has the element-wise grouping effect. Specifically, higher (lower) correlation between two elements of Ω leads to smaller (larger) difference. Based on this lemma, we also have the following theorem. Theorem 2. Let [W˜ ]:,l be the lth column of W˜ . Then [W˜ ]:,l1 − [W˜ ]:,l2 2 ≤ 2nC2 p − Xp k=1 f T kl1 fkl2 . Especially, if fkl1 = fkl2 for any k = 1, . . . , p, then [W˜ ]:,l1 = [W˜ ]:,l2 . Theorem 2 reflects the relationship of W˜ with the training input matrices Xi . Interestingly, the columns of the regression matrix W˜ have grouping effect in our model if the corresponding features have strong correlation. The similar conclusion also applies to the rows of W˜ . Note that Theorem 2 can not be extended to the element-wise case, because even if fk1l1 = fk2l2 , we cannot obtain that [W˜ ]k1l1 = [W˜ ]k2l2 . We will present an empirical illustration for the grouping effect problem in Section 5.1. 4.2. Learning Algorithm The Nesterov method for R-GLM (Zhou & Li, 2014) requires the derivative of the loss function in question to be Lipschitz-continuous. However, both the hinge loss and the nuclear norm are not smooth. Thus, it is hard to develop the Nesterov method for finding the SMM solution. Since the objective function of SMM is convex in both W and b, we here derive a learning algorithm based on ADMM with the restart rule (Goldstein et al., 2012). The problem in (6) can be equivalently written as follows: argmin W,b,S H(W, b) + G(S), (7) s.t. S − W = 0, where H(W, b) = 1 2 tr(WTW) + C Xn i=1 {1−yi [tr(WT Xi)+b]}+ and G(S) = τ ||S||∗. ADMM solves (7) by using the augmented Lagrangian function: L1(W, b, S, Λ) =H(W, b) + G(S) + tr[Λ T (S − W)] + ρ 2 ||S − W||2 F , where ρ > 0 is a hyperparameter. The ADMM learning procedure for our SMM is summarized in Algorithm 1. The key steps of Algorithm 1 are the computations of S (k) and (Wc(k) , b(k) ), the derivation of which is based on Theorems 3 and 4 below. Theorem 3. For positive numbers τ and ρ, let the matrix ρW − Λ have SVD of the form: ρW − Λ = U0Σ0VT 0 + U1Σ1VT 1 , (8) where Σ0 is the diagonal matrix whose diagonal entries are greater than τ , U0 and V0 are matrices of the corresponding left and right singular vectors; Σ1, U1 and V1 correspond the rest parts of the SVD whose singular values are less than or equal to τ . Define S ∗ , 1 ρ Dτ (ρW − Λ) = 1 ρ U0(Σ0 − τ I)VT 0 . (9) Then we have 0 ∈ ∂ G1(S ∗ ), where G1(S) = G(S) + tr(ΛT S) + ρ 2 ||W − S||2 F . Since G1 is convex with respect to S, we obtain the update equation of S from Theorem 3. That is, S (k+1) = argmin S G(W(k) , S, Λb(k) ) = 1 ρ Dτ (ρW(k)−Λb(k) ). Theorem 4. One of the solution of the following problem argmin (W,b) H(W, b) − tr(Λ TW) + ρ 2 ||W−S||2 F is W∗ = 1 ρ + 1 Λ + ρS + Xn i=1 α ∗ i yiXi , (10) b ∗ = 1 |S∗| X i∈S∗ n yi − tr[(W∗ ) T Xi ] o , where S ∗ = {i : 0 < α∗ i < C}, and α∗ ∈ R n is the solution of the following box constraint quadratic programming problem: argmax α − 1 2 α T Kα + q T α, (11) s.t. 0 ≤ α ≤ C1n, Xn i=1 αiyi = 0
Support Matrix Machines Algorithm 1 ADMM for SMM 5.Experiments Initialize S(-1)=S(0)E RPX4,A(-1)=(0)E RPX9,p> 0,t)=1,n∈(0,1) In this section we conduct the experimental analysis of our for =1,2,3...do proposed SMM.We first analyze the group effect proper- (W(k),b())=argmin H(W,6)-tr(A(k)TW)+ ty of SMM.Then we study the classification performance (W,b) on synthetic and real-world data sets.All experiments are w-sI恨 implemented in Matlab R2011b on a workstation with Intel s)=argmin G(S)+r(Ars)+号Iw因-s到房 Xeon CPU X5675 3.06GHz(2 x 12 cores).24GB RAM. and 64bit Windows Server 2008 system. A(=)-pW)-S) c(k)=p-lA(k)(+pllS(e)-(k) 5.1.Group Effect Property on Synthetic Data if c(k)<nc(k-1)then tk+1)=1+V1+4)2 To intuitively demonstrate the grouping effect property de- s+)=s肉+品(S因-Sk-) scribed in Theorem 2.we design a simulated experimen- t to visualize it.We generate a synthetic data set of n +)=A内)+号(4--) samples as follows.First,we generate V orthogonal n- else k+1)=1 dimensional basis vectors v,v2,...,vv with the unit Eu- k+1)=sk-1) clidean length,respectively.Second,we construct pq fea- 人k+)=Ak-1) ture vectors of length n by the following process: e)=)1ck-1) end if fkl v[i/(0.2q)1+Ekl, end for ∈1aN0,1n, for k 1,...,p.and I 1,...,q.Here [l/(0.2g)]de- Here K=[Kl∈Rnxn and q∈R"are independent of notes the smallest integer no smaller than 1/(0.2g).In specifically, other words,the elements in each sample matrix Xi(i= 1,...,n)can be roughly partitioned into four groups(ver- K tr(XTX)】 p+1 tical blocks).The features within the same group have high correlation,while the features between different groups 9=1-U:tr[(A+pS)Tx;] have low correlation.Then we generate a p x q matrix W p+1 of rank 0.2g,and the label for each sample is generated by y=sign[tr(WTXi)].We set n 1000,V=4,p =80, By Theorem 4,updating W(k)and b(k)can be done by g=100 and 6=10-3 in this simulation. solving Problem (11).Several methods can be used,such as the sequential minimization optimization algorithm. The values of the regression matrix obtained respectively (Platt et al.,1998;Keerthi Gilbert,2002) from the bilinear SVM(B-SVM)(Pirsiavash et al.,2009), regularized GLM(R-GLM)(Zhou Li,2014)and SMM Theorem 5.Suppose the optimal solution of Problem (7) are shown in Figure 1.It is easy to see that the regression is (W,b,S).Then matrix of SMM is clearly partitioned into four pure color W=S=i+∑X (12) blocks,while the blocks of B-SVM have higher noise.R- 64>0 GLM fails to obtain the groups structure totally.The sim- ulation results show that SMM has a better grouping effect Theorem 5 can be obtained directly by Algorithm 1 through property than other baselines,which also implies that SM- Eqn.(10).Ifi>0,we call the corresponding Xi a sup- M is able to capture the structure information in the feature port matrix.Theorem 5 shows that the solution W of our matrices. SMM can be written as the linear combination of support matrices plus an offset.This is the reason that we call our 5.2.Classification Accuracy on Synthetic Data model the support matrix machine. We now conduct the performance of SMM on synthetic da- Since the hinge loss and nuclear norm are weakly convex, ta sets.We use the same data generation process as in the the convergence property of Algorithm 1 can be proved im- previous subsection,but we set V 10 and 6=10-2 to mediately based on the result in(Goldstein et al.,2012:He obtain more complicated data.We use 1000 samples for Yuan,2012).That is,we have training,and other 500 samples for testing.All the hyper- Theorem 6.For any p >0 and n E (0,1),the iteration parameters involved are selected via cross validation. sequence given by Algorithm I converges to the optimal IThe code is available in http://bcmi.s jtu.edu.cn/ solution of Problem(7). ~luoluo/code/smm.zip
Support Matrix Machines Algorithm 1 ADMM for SMM Initialize S (−1) = Sb(0) ∈ R p×q , Λ (−1) = Λb(0) ∈ R p×q , ρ > 0, t(1) = 1, η ∈ (0, 1). for k = 1, 2, 3 . . . do (W(k) , b(k) ) = argmin (W,b) H(W, b) − tr(Λb(k)TW) + ρ 2 ||W−Sb(k) ||2 F S (k) = argmin S G(S) + tr(Λb(k)T S) + ρ 2 ||W(k) − S||2 F Λ (k) = Λb(k) − ρ(W(k) − S (k) ) c (k) = ρ −1 ||Λ (k) − Λb(k) ||2 F + ρ||S (k) − Sb(k) ||2 F if c (k) < ηc(k−1) then t (k+1) = 1+√ 1+4t (k)2 2 Sb(k+1) = S (k) + t (k)−1 t (k+1) (S (k) − S (k−1)) Λb(k+1) = Λ (k) + t (k)−1 t (k+1) (Λ (k) − Λ (k−1)) else t (k+1) = 1 Sb(k+1) = S (k−1) Λb(k+1) = Λ (k−1) c (k) = η −1 c (k−1) end if end for Here K = [Kij ] ∈ R n×n and q ∈ R n are independent of α; specifically, Kij = yiyj tr(XT i Xj ) ρ + 1 , qi = 1 − yitr[(Λ + ρS) T Xi ] ρ + 1 . By Theorem 4, updating W(k) and b (k) can be done by solving Problem (11). Several methods can be used, such as the sequential minimization optimization algorithm. (Platt et al., 1998; Keerthi & Gilbert, 2002) Theorem 5. Suppose the optimal solution of Problem (7) is (W˜ , ˜b, S˜). Then W˜ = S˜ = Λ˜ + X α˜i>0 α˜iyiXi . (12) Theorem 5 can be obtained directly by Algorithm 1 through Eqn. (10). If α˜i > 0, we call the corresponding Xi a support matrix. Theorem 5 shows that the solution W˜ of our SMM can be written as the linear combination of support matrices plus an offset. This is the reason that we call our model the support matrix machine. Since the hinge loss and nuclear norm are weakly convex, the convergence property of Algorithm 1 can be proved immediately based on the result in (Goldstein et al., 2012; He & Yuan, 2012). That is, we have Theorem 6. For any ρ > 0 and η ∈ (0, 1), the iteration sequence given by Algorithm 1 converges to the optimal solution of Problem (7). 5. Experiments In this section we conduct the experimental analysis of our proposed SMM 1 . We first analyze the group effect property of SMM. Then we study the classification performance on synthetic and real-world data sets. All experiments are implemented in Matlab R2011b on a workstation with Intel Xeon CPU X5675 3.06GHz (2 × 12 cores), 24GB RAM, and 64bit Windows Server 2008 system. 5.1. Group Effect Property on Synthetic Data To intuitively demonstrate the grouping effect property described in Theorem 2, we design a simulated experiment to visualize it. We generate a synthetic data set of n samples as follows. First, we generate V orthogonal ndimensional basis vectors ν1, ν2, · · · , νV with the unit Euclidean length, respectively. Second, we construct pq feature vectors of length n by the following process: ˜fkl = νdl/(0.2q)e + kl, kl i.i.d ∼ N (0, δ2 In), for k = 1, . . . , p, and l = 1, . . . , q. Here dl/(0.2q)e denotes the smallest integer no smaller than l/(0.2q). In other words, the elements in each sample matrix Xi (i = 1, . . . , n) can be roughly partitioned into four groups (vertical blocks). The features within the same group have high correlation, while the features between different groups have low correlation. Then we generate a p × q matrix W of rank 0.2q, and the label for each sample is generated by yi = sign[tr(WT Xi)]. We set n = 1000, V = 4, p = 80, q = 100 and δ = 10−3 in this simulation. The values of the regression matrix obtained respectively from the bilinear SVM (B-SVM) (Pirsiavash et al., 2009), regularized GLM (R-GLM) (Zhou & Li, 2014) and SMM are shown in Figure 1. It is easy to see that the regression matrix of SMM is clearly partitioned into four pure color blocks, while the blocks of B-SVM have higher noise. RGLM fails to obtain the groups structure totally. The simulation results show that SMM has a better grouping effect property than other baselines, which also implies that SMM is able to capture the structure information in the feature matrices. 5.2. Classification Accuracy on Synthetic Data We now conduct the performance of SMM on synthetic data sets. We use the same data generation process as in the previous subsection, but we set V = 10 and δ = 10−2 to obtain more complicated data. We use 1000 samples for training, and other 500 samples for testing. All the hyperparameters involved are selected via cross validation. 1The code is available in http://bcmi.sjtu.edu.cn/ ˜luoluo/code/smm.zip