Neurocomputing 72(2009)3182-3190 Contents lists available at ScienceDirect Neurocomputing ELSEVIER journal homepage:www.elsevier.com/locate/neucom Kernel nonnegative matrix factorization for spectral EEG feature extraction Hyekyoung Lee 3,Andrzej Cichockib,Seungjin Choia.* Department of Computer Science,Pohang University of Science and Technology.San 31 Hyoja-dong.Nam-gu.Pohang 790-784,Republic of Korea Laboratory for Advanced Brain Signal Processing.Brain Science Institute,RIKEN 2-1 Hirosawa,Wako-shi,Saitama 351-0198.Japan ARTICLEINFO ABSTRACT Article history: Nonnegative matrix factorization(NMF)seeks a decomposition of a nonnegative matrix X0 into a Received 25 June 2008 product of two nonnegative factor matrices U0and V>0.such that a discrepancy between X and UVT Received in revised form is minimized.Assuming U=XW in the decomposition(for W0),kernel NMF(KNMF)is easily derived 18 February 2009 Accepted 2 March 2009 in the framework of least squares optimization.In this paper we make use of KNMF to extract Communicated by T.Heskes discriminative spectral features from the time-frequency representation of electroencephalogram(EEG) Available online 1 April 2009 data,which is an important task in EEG classification.Especially when KNMF with linear kernel is used spectral features are easily computed by a matrix multiplication,while in the standard NMF Keywords: multiplicative update should be performed repeatedly with the other factor matrix fixed,or the EEG classification Feature extraction pseudo-inverse of a matrix is required.Moreover in KNMF with linear kernel,one can easily perform Kernel methods feature selection or data selection,because of its sparsity nature.Experiments on two EEG datasets in Multiplicative updates brain computer interface(BCl)competition indicate the useful behavior of our proposed methods. Nonnegative matrix factorization @2009 Elsevier B.V.All rights reserved. 1.Introduction data-driven feature extraction method which automatically determine discriminative spectral bands.To this end,NMF was Nonnegative matrix factorization (NMF)is a multivariate proposed as a promising method and was proven to be a good analysis method which is proven to be useful in learning a solution [11,3]. faithful representation of nonnegative data such as images. Several interesting variations of NMF were recently proposed spectrograms,and documents [16,9.NMF seeks a decomposition in [15].including convex-NMF,semi-NMF,kernel NMF (KNMF) of a nonnegative data matrix X20 into a product of two and so on.These extensions were mainly discussed in a task of nonnegative factor matrices such that X~UVT for U20 and clustering,emphasizing their applicability to a data matrix V>0.NMF allows only nonsubtractive combinations of nonnega- without sign restriction.However,we pay our attention to KNMF tive basis vectors to approximate the original nonnegative data, in the case of nonnegative data matrix,which can be easily possibly providing a parts-based representation [9]. developed,assuming U=XW for W=0 in the decomposition of Electroencephalogram (EEG)is the most popular sensory X=UV.We derive a multiplicative updating algorithm for signal used for brain computer interface (BCI).NMF was shown KNMF from KKT conditions,as in our recent work [3].where to be useful in determining discriminative basis vectors which g-divergence was used as an error measure for NMF. well reflect meaningful spectral characteristics without the cross- In this paper we make use of KNMF to extract discriminative validation in a motor imagery task [11].Exemplary spectral spectral features from the time-frequency representation of EEG characteristics of EEG involving motor,might be u rhythm data,which is an important task in EEG classification.The main (8-12 Hz)and B rhythm (18-25 Hz)which decrease during contribution of this paper is to show how KNMF is applied to movement or in preparation for movement (event-related extract spectral EEG features,emphasizing its useful aspects that desynchronization,ERD)and increase after movement and in are summarized below: relaxation(event-related synchronization,ERS)[17].ERD and ERS could be used as relevant features for the task of motor imagery When KNMF with linear kernel is used,spectral features are EEG classification.However,those phenomena might happen in easily computed by a matrix multiplication,while in the different frequency bands,depending on subjects,for instance,in standard NMF multiplicative update should be performed 16-20Hz,not in 8-12Hz [8].Thus,it is desirable to develop a repeatedly with the other factor matrix fixed,or the pseudo- inverse of a matrix is required. *Corresponding author.TeL:+8254279 2259:fax:+82542792299. KNMF with linear kernel is a special case of convex-NMF [4]. E-mail address:seungjin@postech.ac.kr (S.Choi). provided that each column in W satisfies the sum-to- URL:http://www.postech.ac.kr/~seungjin (S.Choi). one constraint.We show that KNMF yields more sparse 0925-2312/$-see front matter 2009 Elsevier B.V.All rights reserved. doi:10.1016j.neucom.2009.03.005
Kernel nonnegative matrix factorization for spectral EEG feature extraction Hyekyoung Lee a , Andrzej Cichocki b , Seungjin Choi a, a Department of Computer Science, Pohang University of Science and Technology, San 31 Hyoja-dong, Nam-gu, Pohang 790-784, Republic of Korea b Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN 2-1 Hirosawa, Wako-shi, Saitama 351-0198, Japan article info Article history: Received 25 June 2008 Received in revised form 18 February 2009 Accepted 2 March 2009 Communicated by T. Heskes Available online 1 April 2009 Keywords: EEG classification Feature extraction Kernel methods Multiplicative updates Nonnegative matrix factorization abstract Nonnegative matrix factorization (NMF) seeks a decomposition of a nonnegative matrix XX0 into a product of two nonnegative factor matrices UX0 and VX0, such that a discrepancy between X and UV> is minimized. Assuming U ¼ XW in the decomposition (for WX0), kernel NMF (KNMF) is easily derived in the framework of least squares optimization. In this paper we make use of KNMF to extract discriminative spectral features from the time–frequency representation of electroencephalogram (EEG) data, which is an important task in EEG classification. Especially when KNMF with linear kernel is used, spectral features are easily computed by a matrix multiplication, while in the standard NMF multiplicative update should be performed repeatedly with the other factor matrix fixed, or the pseudo-inverse of a matrix is required. Moreover in KNMF with linear kernel, one can easily perform feature selection or data selection, because of its sparsity nature. Experiments on two EEG datasets in brain computer interface (BCI) competition indicate the useful behavior of our proposed methods. & 2009 Elsevier B.V. All rights reserved. 1. Introduction Nonnegative matrix factorization (NMF) is a multivariate analysis method which is proven to be useful in learning a faithful representation of nonnegative data such as images, spectrograms, and documents [16,9]. NMF seeks a decomposition of a nonnegative data matrix XX0 into a product of two nonnegative factor matrices such that X UV> for UX0 and VX0. NMF allows only nonsubtractive combinations of nonnegative basis vectors to approximate the original nonnegative data, possibly providing a parts-based representation [9]. Electroencephalogram (EEG) is the most popular sensory signal used for brain computer interface (BCI). NMF was shown to be useful in determining discriminative basis vectors which well reflect meaningful spectral characteristics without the crossvalidation in a motor imagery task [11]. Exemplary spectral characteristics of EEG involving motor, might be m rhythm (8–12 Hz) and b rhythm (18–25 Hz) which decrease during movement or in preparation for movement (event-related desynchronization, ERD) and increase after movement and in relaxation (event-related synchronization, ERS) [17]. ERD and ERS could be used as relevant features for the task of motor imagery EEG classification. However, those phenomena might happen in different frequency bands, depending on subjects, for instance, in 16–20 Hz, not in 8–12 Hz [8]. Thus, it is desirable to develop a data-driven feature extraction method which automatically determine discriminative spectral bands. To this end, NMF was proposed as a promising method and was proven to be a good solution [11,3]. Several interesting variations of NMF were recently proposed in [15], including convex-NMF, semi-NMF, kernel NMF (KNMF), and so on. These extensions were mainly discussed in a task of clustering, emphasizing their applicability to a data matrix without sign restriction. However, we pay our attention to KNMF in the case of nonnegative data matrix, which can be easily developed, assuming U ¼ XW for WX0 in the decomposition of X ¼ UV>. We derive a multiplicative updating algorithm for KNMF from KKT conditions, as in our recent work [3], where a-divergence was used as an error measure for NMF. In this paper we make use of KNMF to extract discriminative spectral features from the time–frequency representation of EEG data, which is an important task in EEG classification. The main contribution of this paper is to show how KNMF is applied to extract spectral EEG features, emphasizing its useful aspects that are summarized below: When KNMF with linear kernel is used, spectral features are easily computed by a matrix multiplication, while in the standard NMF multiplicative update should be performed repeatedly with the other factor matrix fixed, or the pseudoinverse of a matrix is required. KNMF with linear kernel is a special case of convex-NMF [4], provided that each column in W satisfies the sum-toone constraint. We show that KNMF yields more sparse ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2009.03.005 Corresponding author. Tel.: +82 54 279 2259; fax: +82 54 279 2299. E-mail address: seungjin@postech.ac.kr (S. Choi). URL: http://www.postech.ac.kr/seungjin (S. Choi). Neurocomputing 72 (2009) 3182–3190
H.Lee et al.Neurocomputing 72 (2009)3182-3190 3183 representation,compared to the standard NMF in the case of mother wavelet is given by spectral EEG data,confirming the result that was first studied in [4].In addition,we show that feature selection or data 平o()=π-1/eiwone-2 (1) selection can be easily performed in the case of KNNF with where wo is the characteristic eigenfrequency (generally linear kernel,because of its sparsity nature. taken to be 6).Scaling and temporal shifting of the mother All these useful behaviors of KNMF are verified through wavelet leads to ydop controlled by the factor n=(t-t)/d(f) experiments on two EEG datasets in BCI competition. where The rest of this paper is organized as follows.The next section wo+V2+w哈 d(f)= (2) provides a brief overview of NMF,illustrating how spectral 4πf features are extracted from EEG data using NMF.Section 3 where f is the main receptive frequency.The wavelet transform of explains KNMF and its application to spectral EEG feature cat time and frequency f is their convolution with scaled and extraction.Experiments on two EEG datasets in BCI competition shifted wavelets.The amplitude of the wavelet transform,P is are presented in Section 4.Finally conclusions are drawn in given by Section 5. p哈=Ic地*平d0nol. (3) for k =1,....K. 2.NMF for spectral EEG feature extraction Then we construct the data matrix X by collecting n x F spectral matrices computed at K different channels, We begin with illustrating how to construct a data matrix from EEG data X=[p(1)p2)....p]Rnxm (4) where m =KF.Fig.3 shows the time-domain EEG signals in the 2.1.Data matrix construction upper panel which are measured at eight different channels (C3.Cz,C4.CP1,CP2.P3.P2.P4)and their corresponding time- We construct the data matrix X eR"xm from the time-domain frequency representation in the lower panel,where the EEG signal such that each column vector in X is associated with horizontal axis represents frequency and the vertical axis is the frequency profile over trials. associated with time. We denote by the time-domain EEG signal measured at the kth channel (k=1.....K).We also denote by(time and 2.2.NMF and feature extraction frequency indices run over t=1,...,n and f =1,...,F,respec- tively)the corresponding time-frequency representation of the NMF seeks a decomposition of X e R"xm that is of the form EEG signal d,which is typically computed by short-time Fourier X≈UVT, transform(STFT)or wavelet transform.Then the spectral matrix is (5) given byp=[PRxF.In the case of EEG data with N trials, where UeRxr and VeRmxr are restricted to be nonnegative the time index is given by n=TN where T is the number of matrices as well.Matrices U and V,in our problem setting,are samples in each trial. interpreted as follows (see also Fig.4). We provide a brief illustration of how to computeP using complex Morlet wavelet transform which is very popular in Column vectors in V are basis vectors which reflects r handling EEG data.Time-frequency representation of EEG data is representative spectral characteristics learned from an en- computed by filtering it with complex Morlet wavelets,where the semble of EEG data samples. C3 Cz C4 CP1 CP2 P3 Pz P4 8203082030820 308203082030820308203082030 Fig.1.The classification accuracy averaged over 140 trials is plotted for three methods(NMF,KNMF,and FS).All three methods work well,while KNMF is slightly better than other two methods
representation, compared to the standard NMF in the case of spectral EEG data, confirming the result that was first studied in [4]. In addition, we show that feature selection or data selection can be easily performed in the case of KNNF with linear kernel, because of its sparsity nature. All these useful behaviors of KNMF are verified through experiments on two EEG datasets in BCI competition. The rest of this paper is organized as follows. The next section provides a brief overview of NMF, illustrating how spectral features are extracted from EEG data using NMF. Section 3 explains KNMF and its application to spectral EEG feature extraction. Experiments on two EEG datasets in BCI competition are presented in Section 4. Finally conclusions are drawn in Section 5. 2. NMF for spectral EEG feature extraction We begin with illustrating how to construct a data matrix from EEG data. 2.1. Data matrix construction We construct the data matrix X 2 Rnm from the time-domain EEG signal such that each column vector in X is associated with the frequency profile over trials. We denote by c ðkÞ t the time-domain EEG signal measured at the kth channel ðk ¼ 1; ... ; KÞ. We also denote by PðkÞ t;f (time and frequency indices run over t ¼ 1; ... ; n and f ¼ 1; ... ; F, respectively) the corresponding time–frequency representation of the EEG signal c ðkÞ t , which is typically computed by short-time Fourier transform (STFT) or wavelet transform. Then the spectral matrix is given by PðkÞ ¼ ½PðkÞ t;f 2 RnF . In the case of EEG data with N trials, the time index is given by n ¼ TN where T is the number of samples in each trial. We provide a brief illustration of how to compute PðkÞ t;f using complex Morlet wavelet transform which is very popular in handling EEG data. Time–frequency representation of EEG data is computed by filtering it with complex Morlet wavelets, where the mother wavelet is given by C0ðZÞ ¼ p1=4eiw0ZeZ2=2, (1) where w0 is the characteristic eigenfrequency (generally taken to be 6). Scaling and temporal shifting of the mother wavelet leads to Ct;dðfÞ controlled by the factor Z ¼ ðt tÞ=dðfÞ where dðfÞ ¼ w0 þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 þ w2 0 q 4pf , (2) where f is the main receptive frequency. The wavelet transform of c ðkÞ t at time t and frequency f is their convolution with scaled and shifted wavelets. The amplitude of the wavelet transform, PðkÞ t;f , is given by PðkÞ t;f ¼ kc ðkÞ t Ct;dðfÞðtÞk, (3) for k ¼ 1; ... ; K. Then we construct the data matrix X by collecting n F spectral matrices computed at K different channels, X ¼ ½Pð1Þ ; Pð2Þ ; ... ; PðKÞ 2 Rnm, (4) where m ¼ KF. Fig. 3 shows the time-domain EEG signals in the upper panel which are measured at eight different channels (C3; Cz; C4; CP1; CP2; P3; Pz; P4) and their corresponding time– frequency representation in the lower panel, where the horizontal axis represents frequency and the vertical axis is associated with time. 2.2. NMF and feature extraction NMF seeks a decomposition of X 2 Rnm that is of the form X UV>, (5) where U 2 Rnr and V 2 Rmr are restricted to be nonnegative matrices as well. Matrices U and V, in our problem setting, are interpreted as follows (see also Fig. 4). Column vectors in V are basis vectors which reflects r representative spectral characteristics learned from an ensemble of EEG data samples. ARTICLE IN PRESS C3 8 Cz C4 CP1 CP2 P3 Pz P4 20 30 8 20 30 8 20 30 8 20 308 20 30 8 20 308 20 30 8 20 30 Fig. 1. The classification accuracy averaged over 140 trials is plotted for three methods (NMF, KNMF, and FS). All three methods work well, while KNMF is slightly better than other two methods. H. Lee et al. / Neurocomputing 72 (2009) 3182–3190 3183
3184 H.Lee et al.Neurocomputing 72 (2009)3182-3190 Row vectors in U are features,reflecting how learned spectral When a test data matrix Xtest is given,its associated feature patterns are encoded for spectral data vectors (associated with matrix Utest can be computed in two different ways: row vectors of X). The feature matrix Utest is determined by LS projection, We provide a simple alternative derivation of multiplicative updates for least squares (LS)NMF,which was originally Utest =Xtest[VT]'. (14) developed by Lee and Seung [10].We consider LS objective where represents the pseudo-inverse.In such a case,Urest function given by might have negative elements but work well [11]. We iterate the update rule (12)until convergence,with V 9=x-UW2, (6) (learned in the training phase)fixed. where X denotes the Frobenius norm of a matrix X.Taking Sparseness or orthogonality constraints were suggested for more nonnegative constraints U20 and V2o into account,the fruitful representation in the framework of NMF [14,6,21. Lagrangian is given by =X-UVT2-tr(AUT)-tr(QVT). (7) 3.Kernel NMF where A≥0and2≥0 are Lagrangian multipliers. KKT optimality conditions require 3.1.Algorithm UVTV-XV=A. (8) A simple trick to develop kernel NMF is to assume that feature VUTU-XTU=2. (9) profiles (column vectors of U)are convex combinations of frequency profiles (column vectors of X)of spectral data in NMF which result from a/aU =0 and a/ov=0,respectively.KKT decomposition,i.e., complementary slackness conditions require: U=XW. (15) UVV-XM⊙U=0. (10) where each column in W satisfies the sum-to-one constraint. VU'U-XUOV=0. (11) Incorporating the assumption (15)into the LS objective function where o denotes Hadarmard product (element-wise multiplica- (6)yields tion).These suggest the following multiplicative updates: =X-XWVT2. (16) U←U⊙ XV (12) In order to handle inequality constraints W>0 and V>0,we UVTV consider the Lagrangian,as in (7). V←Vo: XTU =-XWVT2-tr(AWT)-tr(QVT). (13) (17) VUU where A and are Lagrangian multipliers. where the division is performed in an element-wise manner. As in (8)and (10).KKT optimality conditions require: These rules are exactly same as the one proposed by Lee and Seung [10]. X'XWV'V-X'XV=A. (18) 2 5 152551525 frequency C3 C4 5152551525 1234 frequency basis Fig.2.Basis vectors computed by NMF and KNMF,as well as representative patterns selected by FS.are shown.In each plot,the vertical axis represents frequency components between 4 and 30Hz.The upper half is for Ca and the lower half is for C4.The number of factors was chosen as r=6 for both NMF and KNMF.FS chose 10 features,where blue color is associated with"not-selected".Its selected patter is closely related to NMF and KNMF factors.(For interpretation of the references to color in this figure legend.the reader is referred to the web version of this article.)
Row vectors in U are features, reflecting how learned spectral patterns are encoded for spectral data vectors (associated with row vectors of X). We provide a simple alternative derivation of multiplicative updates for least squares (LS) NMF, which was originally developed by Lee and Seung [10]. We consider LS objective function given by J ¼ 1 2kX UV>k2, (6) where kXk denotes the Frobenius norm of a matrix X. Taking nonnegative constraints UX0 and VX0 into account, the Lagrangian is given by L ¼ 1 2kX UV>k2 trfKU>g trfXV>g, (7) where KX0 and XX0 are Lagrangian multipliers. KKT optimality conditions require UV>V XV ¼ K, (8) VU>U X>U ¼ X, (9) which result from @L=@U ¼ 0 and @L=@V ¼ 0, respectively. KKT complementary slackness conditions require: ½UV>V XV U ¼ 0, (10) ½VU>U X>U V ¼ 0, (11) where denotes Hadarmard product (element-wise multiplication). These suggest the following multiplicative updates: U U XV UV>V , (12) V V X>U VU>U , (13) where the division is performed in an element-wise manner. These rules are exactly same as the one proposed by Lee and Seung [10]. When a test data matrix Xtest is given, its associated feature matrix Utest can be computed in two different ways: The feature matrix Utest is determined by LS projection, Utest ¼ Xtest½V> y , (14) where y represents the pseudo-inverse. In such a case, Utest might have negative elements but work well [11]. We iterate the update rule (12) until convergence, with V (learned in the training phase) fixed. Sparseness or orthogonality constraints were suggested for more fruitful representation in the framework of NMF [14,6,2]. 3. Kernel NMF 3.1. Algorithm A simple trick to develop kernel NMF is to assume that feature profiles (column vectors of U) are convex combinations of frequency profiles (column vectors of X) of spectral data in NMF decomposition, i.e., U ¼ XW, (15) where each column in W satisfies the sum-to-one constraint. Incorporating the assumption (15) into the LS objective function (6) yields JK ¼ 1 2kX XWV>k2. (16) In order to handle inequality constraints WX0 and VX0, we consider the Lagrangian, as in (7), LK ¼ 1 2kX XWV>k2 trfKW>g trfXV>g, (17) where K and X are Lagrangian multipliers. As in (8) and (10), KKT optimality conditions require: X>XWV>V X>XV ¼ K, (18) ARTICLE IN PRESS C3 C4 5 5 1234 5 1 2 3 4 X VT time frequency frequency C3 mi t e basis basis =U C4 15 25 5 15 25 15 25 15 25 Fig. 2. Basis vectors computed by NMF and KNMF, as well as representative patterns selected by FS, are shown. In each plot, the vertical axis represents frequency components between 4 and 30 Hz. The upper half is for C3 and the lower half is for C4. The number of factors was chosen as r ¼ 6 for both NMF and KNMF. FS chose 10 features, where blue color is associated with ‘‘not-selected’’. Its selected pattern is closely related to NMF and KNMF factors. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 3184 H. Lee et al. / Neurocomputing 72 (2009) 3182–3190
H.Lee et al.Neurocomputing 72 (2009)3182-3190 3185 VWXXW-XXW=2. (19) where []>0 and []>0.In such a case,multiplicative update for parameters has the form which are the consequences of ak/aw =0 and ak/av=0. respectively. ⊙←日⊙ VxJ" (25) Define a kernel matrix (Gram matrix)as K=X'X,the (i,j)- Vx厅)· entry of which represents a similarity between two frequency where ()denotes the element-wise power and n is a learning profiles i and j.Then,KKT complementary slackness conditions yield rate(0<1).It can be easily seen that the multiplicative update (25)preserves the nonnegativity of the parameter while IKWVV-KV]oW=0. (20) VK =0 when the convergence is achieved [2]. Note that derivatives with respect to W and V are computed as IVWKW-KW]oV=0. (21) -”- These relations lead to the following multiplicative updates: KWVTV-KV. V←VOVWKW KW (22) %-- VWKW-KW. W←W⊙ V KWVTV (23) Thus,invoking the relation(25)with n=1,one can easily derive multiplicative updates(22)and(23). The first idea on KNMF was proposed in [4].We further elaborate it,developing multiplicative updates in a different way as well as applying it to a task of EEG classification.As in [4]. 3.2.Sparsity of KNMF one can easily extend KNMF to the case where the data matrix is free from sign restriction.Throughout this paper we Our KNMF with linear kernel is a special case of convex-NMF use only linear kernel K =X'X.However,the method holds for [15].Thus,it follows from the result in [15]that factor matrices W K=ΦΦwhereΦ=[Xg】is a transformed matrix in a feature and V in KNMF are naturally sparse.We briefly review the result space. on this sparsity [15.Then we provide an illustrative example, We also provide an alternative derivation of rules (22)and showing a useful behavior of KNMF due to the sparsity property (23).Suppose that the gradient of the objective function(16)has a and confirming the sparsity of factor matrices in KNMF when it decomposition of the form was applied to spectral EEG data. Suppose that the singular value decomposition(SVD)of the V/K=[Vfx]-[V/x]. (24) data matrix X is given by X=PEQ.Then,the objective value in basis 1 basis 2 0.2 0.2 0.1 0.1 0 0 813182328 813182328 813182328813182328 C3 C4 C3 C4 basis 3 basis 4 0.2 0.05 0.1 0 0 813182328813182328 813182328813182328 C3 C4 C3 C4 basis 5 basis 6 0.1 0.2 0.05 0.1 0 813182328813182328 813182328813182328 C3 C4 c3 basis 7 basis 8 0.2 0.2 0.1 0.1 0 813182328 813182328 813182328813182328 C3 C4 C3 C4 Fig.3.Exemplary EEG data(IDIAP dataset,the details on which are explained in Section 4.1)are shown in the time-domain(upper panel)and in the time-frequency domain (lower panel).Waveforms of EEG in the time-domain are shown in the upper panel,each of which is measured at eight different channels i6S.c28 on the-the ume m2 he文。三ce2anop2aoma (96=12×8)
VW>X>XW X>XW ¼ X, (19) which are the consequences of @LK =@W ¼ 0 and @LK =@V ¼ 0, respectively. Define a kernel matrix (Gram matrix) as K ¼ X>X, the ði; jÞ- entry of which represents a similarity between two frequency profiles i and j. Then, KKT complementary slackness conditions yield ½KWV>V KV W ¼ 0, (20) ½VW>KW KW V ¼ 0. (21) These relations lead to the following multiplicative updates: V V KW VW>KW , (22) W W KV KWV>V . (23) The first idea on KNMF was proposed in [4]. We further elaborate it, developing multiplicative updates in a different way as well as applying it to a task of EEG classification. As in [4], one can easily extend KNMF to the case where the data matrix is free from sign restriction. Throughout this paper we use only linear kernel K ¼ X>X. However, the method holds for K ¼ U>U where U ¼ ½fðXijÞ is a transformed matrix in a feature space. We also provide an alternative derivation of rules (22) and (23). Suppose that the gradient of the objective function (16) has a decomposition of the form rJK ¼ ½rJK þ ½rJK , (24) where ½rJK þ40 and ½rJK 40. In such a case, multiplicative update for parameters Y has the form Y Y ½rJK ½rJK þ :Z , (25) where ð Þ:Z denotes the element-wise power and Z is a learning rate (0oZp1). It can be easily seen that the multiplicative update (25) preserves the nonnegativity of the parameter Y, while rJK ¼ 0 when the convergence is achieved [2]. Note that derivatives with respect to W and V are computed as @JK @W ¼ @JK @W þ @JK @W ¼ KWV>V KV, @JK @V ¼ @JK @V þ @JK @V ¼ VW>KW KW. Thus, invoking the relation (25) with Z ¼ 1, one can easily derive multiplicative updates (22) and (23). 3.2. Sparsity of KNMF Our KNMF with linear kernel is a special case of convex-NMF [15]. Thus, it follows from the result in [15] that factor matrices W and V in KNMF are naturally sparse. We briefly review the result on this sparsity [15]. Then we provide an illustrative example, showing a useful behavior of KNMF due to the sparsity property and confirming the sparsity of factor matrices in KNMF when it was applied to spectral EEG data. Suppose that the singular value decomposition (SVD) of the data matrix X is given by X ¼ PRQ >. Then, the objective value in ARTICLE IN PRESS 8 0 0.1 0.2 C3 C4 basis 1 0 0.1 0.2 basis 2 0 0.05 basis 3 0 0.1 0.2 basis 4 0 0.05 0.1 basis 5 0 0.1 0.2 basis 6 0 0.1 0.2 basis 7 0 0.1 0.2 basis 8 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 Fig. 3. Exemplary EEG data (IDIAP dataset, the details on which are explained in Section 4.1) are shown in the time-domain (upper panel) and in the time–frequency domain (lower panel). Waveforms of EEG in the time-domain are shown in the upper panel, each of which is measured at eight different channels (C3; Cz; C4; CP1; CP2; P3; Pz ; P4). Corresponding time–frequency representations are shown in the lower panel, where frequency (horizontal axis in each plot) ranges over ½8; 10; 12; ... ; 28; 30 (i.e., the number of frequency bands is 12). In this case, the data matrix X 2 Rn96 is constructed by collecting 12 frequency profiles at each channel (96 ¼ 12 8). H. Lee et al. / Neurocomputing 72 (2009) 3182–3190 3185
3186 H.Lee et al.Neurocomputing 72 (2009)3182-3190 the optimization problem (16)can be rewritten as VNMF and VKNMF is not big.In both cases,basis vectors are close to ideal ones. X-XWVT12=t((-VW)XTX(I-WVT)) We compute encoding matrices by NMF and KNMF,which are 21q-ww)2. (26) given by i-1 0.8937 0.2211 where o;is the jth diagonal entry of >g;is the jth column vector 1.2362 0.0989 of Q.and XTX=>qjqf for m<n. 1.4090 0.0900 The term qf(I-WVT)in (26)is the projection of (I-WVT) UNMF= 0.2201 0.7642 onto a principal direction,so the optimization problem(16)can be considered as the minimization of the sum of the differences 0.55980.7165 between the data and its projected values.One can easily see that 0.54840.8232 in(26)the projection of (I-WVT)onto principal directions has 0.2859 0.8125 larger weights while the projection of(I-WVT)onto nonprinci- pal directions has smaller weights.It was pointed out that the 1.07330.5595 sparsity is enforced heavily in the principal subspace and lightly 0.20190.0000 1.2277 0.3677 in the nonprincipal subspace.It was shown in [15]that KNMF 0.0230 0.3730 1.4543 0.3868 with linear kernel tends to provide sparse solution. Fig.5 shows bases computed by NMF and KNMF from Graz UKNME XW =X 0.00000.6144 0.22481.3152 data (the details on which are described in Section 4.2)where C 0.0708 0.0126 0.74111.4109 and C4 channels were used in motor imagery EEG data.Bases are 0.70430.0000 0.7049 1.5332 associated with the distributions of frequency components in the range of 4-30 Hz.Certainly,one can easily see that bases in KNMF 0.40641.5194 are sparser than those in NMF. Now we provide an illustrative example,showing the sparsity Each row vector in encoding matrices UNMF and UKNMF are of factor matrices in KNMF is useful in determining discriminative indicator variables for clusters.For example,the first row in basis vectors,compared to the standard NMF.To this end,we first UNMF has two entries 0.8937 and 0.2211 where the first element is generate a data matrix X: bigger than the second element,which implies the associated data (the first row vector of X)belongs to cluster 1.Investigating UNMF 0.36920.13200.82120.45091.3685 and UKNMF in this way,one can see that both encoding matrices 0.11120.94210.01540.54701.6256 identify the first three row vectors of X as cluster 1 and the last 0.78030.9561 0.04300.2963 1.7802 four row vectors of X as cluster 2. 0.38971.57521.16900.74470.0811 We provide another example with a data matrix in the case of lower signal-to-noise ratio.We consider the data matrix X 0.24171.05981.64910.18900.9294 given by 0.40391.23481.73170.6868 0.7757 0.09651.35321.64770.18350.4868 0.33710.26300.91330.10670.8998 0.16220.6541015240.96190.7599 First three rows were generated by adding uniformly distributed 0.7943 0.6892 0.8258 0.00461.3001 ([0,1])random numbers to [0,0,0,0,11.Last four rows were 0.31121.24821.03830.77490.4314 constructed by adding uniformly distributed (0,11)random numbers to [0,1,1,0,0]. 0.52850.9505 1.49610.81730.9106 In an ideal case,it is expected that NMF or KNMF compute two 0.16560.5838 0.57820.86870.1818 representative basis vectors,each of which is [0,0.0.0,1]and 0.60200.72900.9427 0.08440.2638 [0,1,1,0,0]',respectively,and identify two clusters (first three rows and last four rows of X).Basis matrices computed by NMF First three rows were generated by adding uniformly distributed and KNMF are given by ([0,1])random numbers to [0,0,0,0.0.5].Last four rows were constructed by adding uniformly distributed (0,11)random 0.34020.1865 0.34350.0442 numbers to [0,0.5,0.5,0,0].The data matrix was generated in a 0.44111.4140 0.26160.7895 way similar to previous example,however,a distinction is lower VNMF= 0.0000 2.0089 VKNMF 0.00001.0587 signal-to-noise ratio (0.5 is used instead of 1)with the same 0.2842 0.4462 0.22740.2324 pattern of zeros and nonzeros.Basis matrices computed by NMF 1.33860.0441 1.24390.0000 and KNMF are given by Each column vector of VNMF and VKNMF resemble ideal basis 0.73560.0534 0.46940.1771 vectors since the last entry in the first column vector is much 0.62971.0885 0.00001.0268 larger than the rest of entries and the second and third entries in VLsm时= 1.16610.6776 Vkan时 0.00201.1943 the second column vector are much larger than the rest of entries. 0 1.3856 0.0000 0.7247 We compute the sparseness for each column vector in VNMF and 1.14160.1694 1.12520.0000 VKNMF using the measure in [6]defined by We used(27)to compute the sparseness of each column vector √而-(E)/W∑号 of VNMF and VKNMF that is given by (0.2430,0.3693)and 5(x)= (27) √m-1 (0.7495,0.3593).The sparseness difference is larger,compared to previous example.Moreover,basis vectors in KNMF still look where xi is the ith element of the m-dimensional vector x.The close to [0,0.0.0,0.5]T and [0.0.5.0.5,0,0],since the last entry in sparseness is computed as (0.4846,0.4926)for VNMF and the first column vector is much larger than the rest of entries and (0.5518,0.5278)for VKNMF.The difference in sparseness between the second and third entries in the second column vector are
the optimization problem (16) can be rewritten as kX XWV>k2 ¼ trfðI VW>ÞX>XðI WV>Þg ¼ Xm j¼1 s2 j kq> j ðI WV>Þk2, (26) where sj is the jth diagonal entry of R, qj is the jth column vector of Q, and X>X ¼ Pm j¼1s2 j qjq> j for mon. The term q> j ðI WV>Þ in (26) is the projection of ðI WV>Þ onto a principal direction, so the optimization problem (16) can be considered as the minimization of the sum of the differences between the data and its projected values. One can easily see that in (26) the projection of ðI WV>Þ onto principal directions has larger weights while the projection of ðI WV>Þ onto nonprincipal directions has smaller weights. It was pointed out that the sparsity is enforced heavily in the principal subspace and lightly in the nonprincipal subspace. It was shown in [15] that KNMF with linear kernel tends to provide sparse solution. Fig. 5 shows bases computed by NMF and KNMF from Graz data (the details on which are described in Section 4.2) where C3 and C4 channels were used in motor imagery EEG data. Bases are associated with the distributions of frequency components in the range of 4–30 Hz. Certainly, one can easily see that bases in KNMF are sparser than those in NMF. Now we provide an illustrative example, showing the sparsity of factor matrices in KNMF is useful in determining discriminative basis vectors, compared to the standard NMF. To this end, we first generate a data matrix X: X ¼ 0:3692 0:1320 0:8212 0:4509 1:3685 0:1112 0:9421 0:0154 0:5470 1:6256 0:7803 0:9561 0:0430 0:2963 1:7802 0:3897 1:5752 1:1690 0:7447 0:0811 0:2417 1:0598 1:6491 0:1890 0:9294 0:4039 1:2348 1:7317 0:6868 0:7757 0:0965 1:3532 1:6477 0:1835 0:4868 0 BBBBBBBBBBB@ 1 CCCCCCCCCCCA . First three rows were generated by adding uniformly distributed (U½0; 1) random numbers to ½0; 0; 0; 0; 1. Last four rows were constructed by adding uniformly distributed (U½0; 1) random numbers to ½0; 1; 1; 0; 0. In an ideal case, it is expected that NMF or KNMF compute two representative basis vectors, each of which is ½0; 0; 0; 0; 1 > and ½0; 1; 1; 0; 0 >, respectively, and identify two clusters (first three rows and last four rows of X). Basis matrices computed by NMF and KNMF are given by VNMF ¼ 0:3402 0:1865 0:4411 1:4140 0:0000 2:0089 0:2842 0:4462 1:3386 0:0441 0 BBBBBB@ 1 CCCCCCA ; VKNMF ¼ 0:3435 0:0442 0:2616 0:7895 0:0000 1:0587 0:2274 0:2324 1:2439 0:0000 0 BBBBBB@ 1 CCCCCCA . Each column vector of VNMF and VKNMF resemble ideal basis vectors since the last entry in the first column vector is much larger than the rest of entries and the second and third entries in the second column vector are much larger than the rest of entries. We compute the sparseness for each column vector in VNMF and VKNMF using the measure in [6] defined by xðxÞ ¼ ffiffiffiffi mp Pjxij = ffiffiffiffiffiffiffiffiffiffiffi Px2 i q ffiffiffiffi mp 1 , (27) where xi is the ith element of the m-dimensional vector x. The sparseness is computed as ð0:4846; 0:4926Þ for VNMF and ð0:5518; 0:5278Þ for VKNMF. The difference in sparseness between VNMF and VKNMF is not big. In both cases, basis vectors are close to ideal ones. We compute encoding matrices by NMF and KNMF, which are given by UNMF ¼ 0:8937 0:2211 1:2362 0:0989 1:4090 0:0900 0:2201 0:7642 0:5598 0:7165 0:5484 0:8232 0:2859 0:8125 0 BBBBBBBBBBBBBB@ 1 CCCCCCCCCCCCCCA , UKNMF ¼ XW ¼ X 0:2019 0:0000 0:0230 0:3730 0:0000 0:6144 0:0708 0:0126 0:7043 0:0000 0 BBBBBBBB@ 1 CCCCCCCCA ¼ 1:0733 0:5595 1:2277 0:3677 1:4543 0:3868 0:2248 1:3152 0:7411 1:4109 0:7049 1:5332 0:4064 1:5194 0 BBBBBBBBBBBBBB@ 1 CCCCCCCCCCCCCCA . Each row vector in encoding matrices UNMF and UKNMF are indicator variables for clusters. For example, the first row in UNMF has two entries 0:8937 and 0:2211 where the first element is bigger than the second element, which implies the associated data (the first row vector of X) belongs to cluster 1. Investigating UNMF and UKNMF in this way, one can see that both encoding matrices identify the first three row vectors of X as cluster 1 and the last four row vectors of X as cluster 2. We provide another example with a data matrix in the case of lower signal-to-noise ratio. We consider the data matrix X given by X ¼ 0:3371 0:2630 0:9133 0:1067 0:8998 0:1622 0:6541 0:1524 0:9619 0:7599 0:7943 0:6892 0:8258 0:0046 1:3001 0:3112 1:2482 1:0383 0:7749 0:4314 0:5285 0:9505 1:4961 0:8173 0:9106 0:1656 0:5838 0:5782 0:8687 0:1818 0:6020 0:7290 0:9427 0:0844 0:2638 0 BBBBBBBBBBB@ 1 CCCCCCCCCCCA . First three rows were generated by adding uniformly distributed (U½0; 1) random numbers to ½0; 0; 0; 0; 0:5. Last four rows were constructed by adding uniformly distributed (U½0; 1) random numbers to ½0; 0:5; 0:5; 0; 0. The data matrix was generated in a way similar to previous example, however, a distinction is lower signal-to-noise ratio (0.5 is used instead of 1) with the same pattern of zeros and nonzeros. Basis matrices computed by NMF and KNMF are given by Vlsnmf ¼ 0:7356 0:0534 0:6297 1:0885 1:1661 0:6776 0 1:3856 1:1416 0:1694 0 BBBBBB@ 1 CCCCCCA ; Vknmf ¼ 0:4694 0:1771 0:0000 1:0268 0:0020 1:1943 0:0000 0:7247 1:1252 0:0000 0 BBBBBB@ 1 CCCCCCA . We used (27) to compute the sparseness of each column vector of VNMF and VKNMF that is given by ð0:2430; 0:3693Þ and ð0:7495; 0:3593Þ. The sparseness difference is larger, compared to previous example. Moreover, basis vectors in KNMF still look close to ½0; 0; 0; 0; 0:5 > and ½0; 0:5; 0:5; 0; 0 >, since the last entry in the first column vector is much larger than the rest of entries and the second and third entries in the second column vector are ARTICLE IN PRESS 3186 H. Lee et al. / Neurocomputing 72 (2009) 3182–3190