Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Zhihua Zhang Dit-Yan Yeung Dept.of Comp.Sci.and Eng. College of Comp.Sci.and Tech. Dept.of Comp.Sci.and Eng. Hong Kong Univ.of Sci.and Tech. Zhejiang University Hong Kong Univ.of Sci.and Tech. Hong Kong,China Zhejiang 310027,China Hong Kong,China liwujun@cse.ust.hk zhzhang@cs.zju.edu.cn dyyeung@cse.ust.hk Abstract can impair the performance significantly.Hence,ker- nel learning (Lanckriet et al.,2004;Zhang et al.,2006), One main concern towards kernel classifiers which tries to find a good kernel matrix for the train- is on their sensitivity to the choice of kernel ing data,is very important for kernel-based classifier function or kernel matrix which characterizes design. the similarity between instances.Many real- In many real-world applications,relationships or world data,such as web pages and protein- "links"between (some)instances may also be avail- protein interaction data,are relational in na- able in the data in addition to the input attributes. ture in the sense that different instances are Data of this sort,referred to as relational data(Getoor correlated (linked)with each other.The re- and Taskar,2007),can be found in such diverse appli- lational information available in such data cation areas as web mining,social network analysis, often provides strong hints on the correla- bioinformatics,marketing,and so on.In relational tion (or similarity)between instances.In this data,the attributes of connected (linked)instances paper,we propose a novel relational kernel are often correlated and the class label of one instance learning model based on latent Wishart pro- may have an influence on that of a linked instance. cesses(LWP)to learn the kernel function for This means that the relationships (or links)between relational data.This is done by seamlessly in- instances are very informative for instance classifica- tegrating the relational information and the tion (Getoor and Taskar,2007),sometimes even much input attributes into the kernel learning pro- more informative than input attributes.For exam- cess.Through extensive experiments on real- ple,two hyperlinked web pages are very likely to be world applications,we demonstrate that our related to the same topic,even when their attributes LWP model can give very promising perfor- may look quite different when represented as bags of mance in practice. words.In biology,interacting proteins are more likely to have the same biological function than those with- out interaction.In marketing,knowing one's shopping 1 Introduction habit will provide useful information about his/her close friends'shopping inclinations.Hence,in such Kernel methods,such as support vector machines data,relational information often provides very strong (SVM)and Gaussian processes(GP)(Rasmussen and hints to refine the correlation (or similarity)between Williams,2006),have been widely used in many ap- instances.This motivates our work on relational ker- plications giving very promising performance.In ker- nel learning (RKL),which refers to learning a kernel nel methods,the similarity between instances is rep- matrix (or a kernel function)for relational data by resented by a kernel function defined over the input incorporating relational information into the learning attributes.In general,the choice of an appropriate process. kernel function and its corresponding parameters is Thanks to its promising potential in many applica- difficult in practice.Poorly chosen kernel functions tion areas,relational learning (Getoor and Taskar, 2007),which tries to model relational data,has be- Appearing in Proceedings of the 12th International Confe- come a hot topic in the machine learning community. rence on Artificial Intelligence and Statistics (AISTATS) 2009,Clearwater Beach,Florida,USA.Volume 5 of JMLR: Many methods have been proposed over the past few W&CP 5.Copyright 2009 by the authors. years.For example,probabilistic relational models
Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Dept. of Comp. Sci. and Eng. Hong Kong Univ. of Sci. and Tech. Hong Kong, China liwujun@cse.ust.hk Zhihua Zhang College of Comp. Sci. and Tech. Zhejiang University Zhejiang 310027, China zhzhang@cs.zju.edu.cn Dit-Yan Yeung Dept. of Comp. Sci. and Eng. Hong Kong Univ. of Sci. and Tech. Hong Kong, China dyyeung@cse.ust.hk Abstract One main concern towards kernel classifiers is on their sensitivity to the choice of kernel function or kernel matrix which characterizes the similarity between instances. Many realworld data, such as web pages and proteinprotein interaction data, are relational in nature in the sense that different instances are correlated (linked) with each other. The relational information available in such data often provides strong hints on the correlation (or similarity) between instances. In this paper, we propose a novel relational kernel learning model based on latent Wishart processes (LWP) to learn the kernel function for relational data. This is done by seamlessly integrating the relational information and the input attributes into the kernel learning process. Through extensive experiments on realworld applications, we demonstrate that our LWP model can give very promising performance in practice. 1 Introduction Kernel methods, such as support vector machines (SVM) and Gaussian processes (GP) (Rasmussen and Williams, 2006), have been widely used in many applications giving very promising performance. In kernel methods, the similarity between instances is represented by a kernel function defined over the input attributes. In general, the choice of an appropriate kernel function and its corresponding parameters is difficult in practice. Poorly chosen kernel functions Appearing in Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 2009 by the authors. can impair the performance significantly. Hence, kernel learning (Lanckriet et al., 2004; Zhang et al., 2006), which tries to find a good kernel matrix for the training data, is very important for kernel-based classifier design. In many real-world applications, relationships or “links” between (some) instances may also be available in the data in addition to the input attributes. Data of this sort, referred to as relational data (Getoor and Taskar, 2007), can be found in such diverse application areas as web mining, social network analysis, bioinformatics, marketing, and so on. In relational data, the attributes of connected (linked) instances are often correlated and the class label of one instance may have an influence on that of a linked instance. This means that the relationships (or links) between instances are very informative for instance classification (Getoor and Taskar, 2007), sometimes even much more informative than input attributes. For example, two hyperlinked web pages are very likely to be related to the same topic, even when their attributes may look quite different when represented as bags of words. In biology, interacting proteins are more likely to have the same biological function than those without interaction. In marketing, knowing one’s shopping habit will provide useful information about his/her close friends’ shopping inclinations. Hence, in such data, relational information often provides very strong hints to refine the correlation (or similarity) between instances. This motivates our work on relational kernel learning (RKL), which refers to learning a kernel matrix (or a kernel function) for relational data by incorporating relational information into the learning process. Thanks to its promising potential in many application areas, relational learning (Getoor and Taskar, 2007), which tries to model relational data, has become a hot topic in the machine learning community. Many methods have been proposed over the past few years. For example, probabilistic relational models
Latent Wishart Processes for Relational Kernel Learning (PRM)(Getoor and Taskar.2007),such as relational 2 Wishart Processes Bayesian networks(RBN)(Getoor et al.,2002),rela- tional Markov networks (RMN)(Taskar et al.,2002) Definition 1 (Gupta and Nagar,2000)An nxn ran- and relational dependency networks (RDN)(Neville dom symmetric positive definite matrir a is said to and Jensen,2007),try to model the relational data have a Wishart distribution with parameters n,q, by adapting conventional graphical models for the re- andn×n scale matrir∑>0,written as A~ lational scenario.There are also some methods that Wn(q,)if its p.d.f.is given by handle relational data in heuristic ways (Macskassy and Provost,2007).These methods are not based on |A9-n-1)/2 the kernel approach. 2,2四nep(-(②-A),g≥n.) More recently,relational Gaussian process Here∑>0 means that∑is positive definite. (RGP)(Chu et al.,2007)and mixed graph Gaussian process (XGP)(Silva et al.,2008)were proposed to Assume that we are given an input space model relational data from a kernel point of view. {x1,x2,} Both of them utilize relational information to learn the covariance (or kernel)matrix for a Gaussian Definition 2 (Zhang et al.,2006)The kernel func- process.Hence,RGP and XGP are relational kernel tion [A(xi,xj);xi,xj E x}is said to be a Wishart learning methods and we will follow their path in this process(WP)if for anyn∈Nand{x1,.,xn}≤X, paper. the nxn random matrir A [A(xi,xj)=1 follows a We propose a novel model,called LWP hereafter, Wishart distribution. based on latent Wishart processes to learn the kernel for relational data by seamlessly integrating relational For A :R,there exists a correspond- information with input attributes.Several promising ing mapping (say B)from the input space to a properties of LWP are highlighted here: latent (feature)space (sayFC R),i.e.,a vector- valued function B(x)=(B1(x),...,Ba(x))'such that A(xi,xj)=B(xi)'B(xj).Zhang et al.(2006)proved LWP adopts the kernel for input attributes to de- that A(xi,xj)is a Wishart process if and only if the fine the prior for the target kernel,and use the Bi(x)(j=1,...,q)are g mutually independent Gaus- link information to define the likelihood.Finally, stan processes. MAP estimation is applied to learn the target ker- nel.Hence,LWP seamlessly integrates the two Although g is possibly infinite,we assume that it views into a principled framework. is finite in this paper for simplicity.Denote A [A(xixj)=1 (nxn)and B=[B(x1).....B(xn)= b1,...,bnl'(nxq).Then the bi are the latent vec- To the best of our knowledge,LWP is the first tors,and A BB'is the linear kernel in the latent model that employs Wishart processes for rela- space but is a nonlinear kernel w.r.t.the input space. tional learning. Letz⑧I denote the Kronecker product of∑and I (the qxq identity matrix).Using the notation Unlike many other existing models,such as XGP (Silva et al.,2008),which are only suitable Nn.(0,g)in (Gupta and Nagar,2000,page 55) for the transductive setting,LWP is naturally ap- for matrix-variate normal distributions,we have plicable for inductive inference over unseen test Theorem 1 Let be an nxn positive definite ma- data trir.Then A is distributed according to the Wishart distribution Wn(q,)if and only if B is distributed During the kernel learning phase,no label infor- according to the matriz-variate normal distribution mation for the instances is needed,which makes Nn.q(O,E&Ig). it easy to extend LWP for visualization and clus- tering of relational data. Proof:(Sketch)If q >n,the theorem can be proven by combination of Theorem 3.2.2 and Theorem 3.3.3 The rest of this paper is organized as follows. in (Gupta and Nagar,2000). Sec- tion 2 introduces some basic background for Wishart If g<n,A follows a singular Wishart distribu- processes.Section 3 presents our LWP model and Sec- tion (Srivastava,2003).and the corresponding B is tion 4 compares it with some related works.Exper- distributed according to the singular matrix-variate imental results are presented in Section 5 and then normal distribution (Definition 2.4.1 in (Gupta and finally Section 6 concludes the paper. Nagar,2000)).The proof is similar to the case of
Latent Wishart Processes for Relational Kernel Learning (PRM) (Getoor and Taskar, 2007), such as relational Bayesian networks (RBN) (Getoor et al., 2002), relational Markov networks (RMN) (Taskar et al., 2002) and relational dependency networks (RDN) (Neville and Jensen, 2007), try to model the relational data by adapting conventional graphical models for the relational scenario. There are also some methods that handle relational data in heuristic ways (Macskassy and Provost, 2007). These methods are not based on the kernel approach. More recently, relational Gaussian process (RGP) (Chu et al., 2007) and mixed graph Gaussian process (XGP) (Silva et al., 2008) were proposed to model relational data from a kernel point of view. Both of them utilize relational information to learn the covariance (or kernel) matrix for a Gaussian process. Hence, RGP and XGP are relational kernel learning methods and we will follow their path in this paper. We propose a novel model, called LWP hereafter, based on latent Wishart processes to learn the kernel for relational data by seamlessly integrating relational information with input attributes. Several promising properties of LWP are highlighted here: • LWP adopts the kernel for input attributes to de- fine the prior for the target kernel, and use the link information to define the likelihood. Finally, MAP estimation is applied to learn the target kernel. Hence, LWP seamlessly integrates the two views into a principled framework. • To the best of our knowledge, LWP is the first model that employs Wishart processes for relational learning. • Unlike many other existing models, such as XGP (Silva et al., 2008), which are only suitable for the transductive setting, LWP is naturally applicable for inductive inference over unseen test data. • During the kernel learning phase, no label information for the instances is needed, which makes it easy to extend LWP for visualization and clustering of relational data. The rest of this paper is organized as follows. Section 2 introduces some basic background for Wishart processes. Section 3 presents our LWP model and Section 4 compares it with some related works. Experimental results are presented in Section 5 and then finally Section 6 concludes the paper. 2 Wishart Processes Definition 1 (Gupta and Nagar, 2000) An n×n random symmetric positive definite matrix A is said to have a Wishart distribution with parameters n, q, and n × n scale matrix Σ 0, written as A ∼ Wn(q, Σ), if its p.d.f. is given by |A| (q−n−1)/2 2 qn/2Γn(q/2)|Σ| q/2 exp − 1 2 tr(Σ −1A) , q ≥ n. (1) Here Σ 0 means that Σ is positive definite. Assume that we are given an input space X = {x1, x2, . . .}. Definition 2 (Zhang et al., 2006) The kernel function {A(xi , xj ); xi , xj ∈ X } is said to be a Wishart process (WP) if for any n ∈ N and {x1, . . . , xn} ⊆ X , the n×n random matrix A = [A(xi , xj )]n i,j=1 follows a Wishart distribution. For A : X × X → R, there exists a corresponding mapping (say B) from the input space X to a latent (feature) space (say F ⊂ R q ), i.e., a vectorvalued function B(x) = (B1(x), . . . , Bq(x))0 such that A(xi , xj ) = B(xi) 0B(xj ). Zhang et al. (2006) proved that A(xi , xj ) is a Wishart process if and only if the Bj (x) (j = 1, . . . , q) are q mutually independent Gaussian processes. Although q is possibly infinite, we assume that it is finite in this paper for simplicity. Denote A = [A(xi , xj )]n i,j=1 (n×n) and B = [B(x1), . . . , B(xn)]0 = [b1, . . . , bn] 0 (n×q). Then the bi are the latent vectors, and A = BB0 is the linear kernel in the latent space but is a nonlinear kernel w.r.t. the input space. Let Σ⊗Iq denote the Kronecker product of Σ and Iq (the q×q identity matrix). Using the notation Nn,q(0, Σ⊗Iq) in (Gupta and Nagar, 2000, page 55) for matrix-variate normal distributions, we have Theorem 1 Let Σ be an n×n positive definite matrix. Then A is distributed according to the Wishart distribution Wn(q, Σ) if and only if B is distributed according to the matrix-variate normal distribution Nn,q(0, Σ⊗Iq). Proof: (Sketch) If q ≥ n, the theorem can be proven by combination of Theorem 3.2.2 and Theorem 3.3.3 in (Gupta and Nagar, 2000). If q < n, A follows a singular Wishart distribution (Srivastava, 2003), and the corresponding B is distributed according to the singular matrix-variate normal distribution (Definition 2.4.1 in (Gupta and Nagar, 2000)). The proof is similar to the case of
Li,Zhang,Yeung g>n.We omit the details here due to the page limit 3.1 Model constraint.☐ The available information for RKL includes both the Theorem 1 shows an interesting connection;namely, input attributes and the binary variables {zik.The the degree of freedom g in the Wishart process is the goal of RKL is to learn a target kernel function dimensionality of the latent space F.Theorem 1 ex- A(xi,xk)which takes both the input attributes and tends the results given in (Gupta and Nagar,2000) the relational information into consideration.Let and (Zhang et al.,2006)where the condition q n aik =A(xi,xk).Then A=[aik]2k=1 will be a positive is required.Moreover,the asymptotic distribution of semidefinite matrix.We now model A by a (singu- q2/2(A-)asq→oo is Nn,n(0,(Ln2+C)(∑8), lar)Wishart distribution Wn(q,>)This implies that where C is the commutation matrix (Muirhead.1982). A(xi,xk)follows a Wishart process. Let K(xi,x)be a kernel function just defined on 3 Methodology the input attributes.For example,the linear kernel K(xi,x)=xxk is such a kernel function.Similarly, K=[K(xi,is a positive semidefinite matrix. One common representation format for relational data If we set∑=B(K+λLn)with B>0and入being is a graph,in which a node corresponds to an in- typically a very small number to make 0,we have stance and an edge corresponds to a pairwise rela- tionship between the two connected nodes.Although A~Wn(g,K+λI) (2) directed edges are common in some data sets,they can be converted into undirected edges in many cases. Consequently,the input attributes are successfully in- In this paper,we only focus on modelling undirected tegrated into the target kernel function. edges which represent symmetric (or reciprocal)rela- To incorporate the relational information for the learn- tionships.Furthermore,in the real world the rela- ing of A(,),we regard each zik as a Bernoulli vari- tionship between two nodes can be either "positive" able,which is determined by the latent variable aik via or "negative",which means that the attributes of the a logistic link function sik.Given the aik,we further connected nodes have positive or negative correlation, assume that the zik are independent.Moreover,we as- respectively.Due to the page limit constraint,here sume that the links are symmetric with no self loops, we only consider positive relationships,although it is i.e.,zik Zki and zi =0.Letting Z=[Zik]k=1,we straightforward to extend our proposed model to neg- have ative relationships.The extension of our model for directed graphs or hypergraphs with negative relation- p(ZA)=ΠΠs(1-sk)- ships will be pursued in our future work. i=1k=i+1 Suppose we are given a set of data {(xi,yi,zik):i,k with sik exp(aik/2) 1+exp(aik/2) (3) 1,...,n},where xi =(xi,...,Tip)'and yi are respec- tively the input feature vector and the label for in- stance i,and zik is a binary variable indicating the To this end,both the input attributes and the rela- existence of a relationship (link)between instances i tional information are seamlessly integrated into the and k,namely, same framework.in which the input attributes define the scale matrix of the prior for the distribution of the target kernel function and the relational informa- if there exists a link between x;and xk tion defines the likelihood computed based on the tar- 0 otherwise. get kernel function.Then,learning algorithms such as marimum a posteriori (MAP)estimation can be Rather than considering the design of a kernel clas- used to learn the latent variables aik.After learning sifier,we focus our attention on the learning of the the model,the target kernel function A(xi,xk)will kernel function for any kernel classifier by utilizing the take both the input attributes and the relational in- relational information,i.e.,relational kernel learning. formation into consideration.We can directly use this new kernel function to implement kernel classification Unlike conventional kernel learning methods (Lanck- methods such as SVM and GP-based classifiers.In our riet et al.,2004;Zhang et al.,2006)which use the class labels to learn the kernel matrix,the setting for experiments,we apply A(xi,x)as a kernel function for a GP-based classifier. the relational kernel learning in LWP does not use any instance label.LWP only uses the input feature vec- In this model,the Wishart process A(xi,x)is used tors and the binary variables [zik}to learn the kernel. to define latent variables {aik We thus call this Thus,essence,LWP is unsupervised in nature. model latent Wishart process (LWP)model
Li, Zhang, Yeung q ≥ n. We omit the details here due to the page limit constraint. Theorem 1 shows an interesting connection; namely, the degree of freedom q in the Wishart process is the dimensionality of the latent space F. Theorem 1 extends the results given in (Gupta and Nagar, 2000) and (Zhang et al., 2006) where the condition q ≥ n is required. Moreover, the asymptotic distribution of q 1/2 (A − Σ) as q → ∞ is Nn,n(0,(In2 + C)(Σ⊗Σ)), where C is the commutation matrix (Muirhead, 1982). 3 Methodology One common representation format for relational data is a graph, in which a node corresponds to an instance and an edge corresponds to a pairwise relationship between the two connected nodes. Although directed edges are common in some data sets, they can be converted into undirected edges in many cases. In this paper, we only focus on modelling undirected edges which represent symmetric (or reciprocal) relationships. Furthermore, in the real world the relationship between two nodes can be either “positive” or “negative”, which means that the attributes of the connected nodes have positive or negative correlation, respectively. Due to the page limit constraint, here we only consider positive relationships, although it is straightforward to extend our proposed model to negative relationships. The extension of our model for directed graphs or hypergraphs with negative relationships will be pursued in our future work. Suppose we are given a set of data {(xi , yi , zik) : i, k = 1, . . . , n}, where xi = (xi1, . . . , xip) 0 and yi are respectively the input feature vector and the label for instance i, and zik is a binary variable indicating the existence of a relationship (link) between instances i and k, namely, zik = 1 if there exists a link between xi and xk 0 otherwise. Rather than considering the design of a kernel classifier, we focus our attention on the learning of the kernel function for any kernel classifier by utilizing the relational information, i.e., relational kernel learning. Unlike conventional kernel learning methods (Lanckriet et al., 2004; Zhang et al., 2006) which use the class labels to learn the kernel matrix, the setting for the relational kernel learning in LWP does not use any instance label. LWP only uses the input feature vectors and the binary variables {zik} to learn the kernel. Thus, essence, LWP is unsupervised in nature. 3.1 Model The available information for RKL includes both the input attributes and the binary variables {zik}. The goal of RKL is to learn a target kernel function A(xi , xk) which takes both the input attributes and the relational information into consideration. Let aik = A(xi , xk). Then A = [aik] n i,k=1 will be a positive semidefinite matrix. We now model A by a (singular) Wishart distribution Wn(q, Σ). This implies that A(xi , xk) follows a Wishart process. Let K(xi , xk) be a kernel function just defined on the input attributes. For example, the linear kernel K(xi , xk) = x 0 ixk is such a kernel function. Similarly, K = [K(xi , xk)]n i,k=1 is a positive semidefinite matrix. If we set Σ = β(K + λIn) with β > 0 and λ being typically a very small number to make Σ 0, we have A ∼ Wn(q, β(K + λI)). (2) Consequently, the input attributes are successfully integrated into the target kernel function. To incorporate the relational information for the learning of A(·, ·), we regard each zik as a Bernoulli variable, which is determined by the latent variable aik via a logistic link function sik. Given the aik, we further assume that the zik are independent. Moreover, we assume that the links are symmetric with no self loops, i.e., zik = zki and zii = 0. Letting Z = [zik] n i,k=1, we have p(Z|A) = Yn i=1 Yn k=i+1 s zik ik (1 − sik) 1−zik with sik = exp(aik/2) 1 + exp(aik/2). (3) To this end, both the input attributes and the relational information are seamlessly integrated into the same framework, in which the input attributes define the scale matrix of the prior for the distribution of the target kernel function and the relational information defines the likelihood computed based on the target kernel function. Then, learning algorithms such as maximum a posteriori (MAP) estimation can be used to learn the latent variables aik. After learning the model, the target kernel function A(xi , xk) will take both the input attributes and the relational information into consideration. We can directly use this new kernel function to implement kernel classification methods such as SVM and GP-based classifiers. In our experiments, we apply A(xi , xk) as a kernel function for a GP-based classifier. In this model, the Wishart process A(xi , xk) is used to define latent variables {aik} n i,k=1. We thus call this model latent Wishart process (LWP) model.
Latent Wishart Processes for Relational Kernel Learning 3.2 Learning experiment,we find that if y is simply set to a value less than 0.1,our algorithm works very well for all the It is natural to consider the MAP estimate of A, experiments.Although in this case the objective func- namely, tion does not necessarily increase monotonically,the argmax log p(ZA)p(A) long-term trend of it is increasing.This makes our A algorithm not only very fast but also very accurate. Theorem 1 shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B.In Note that this iterative procedure works in a paral- particular,we note that for the applications presented lel manner.It may also work sequentially.However, in Section 5,small values of q,typically no larger than our experiments show that the parallel scheme is more 50,are sufficient for delivering good performance.This stable than the sequential scheme.As we have men- motivates us to alternatively find the MaP estimate tioned,the size of Hi is gxg with g being always a of B,which will dramatically reduce the computation small number in our experiments,so the iterative pro- cost.Consequently,we attempt to maximize the fol- cedure is very efficient.In this paper,we adopt KPCA lowing log posterior probability: (Scholkopf et al.,1998)to initialize the values bi(0). L(B)=log(p(ZB)p(B)} 3.3 Out-of-Sample Extension >logp(zb:,b)+logp(B) i≠k It is easy for LWP to perform out-of-sample exten- sion (or induction),which means that we can use the >zbbx/2-log(1+exp(bbx/2)) learned LWP to predict the bi for new test data. 1 Z11Z12 「 -K+0'BB]+c Let Z= and∑= 2112121 Z21Z22 221222 where Z11(S11)is an nixni matrix and Z22(S22) bb:/2-log(1+exp(bbx/2) is an n2xn2 matrix.Suppose the n instances cor- responding to Zu and are training data,and -号∑oabe+C Z22 and 22 correspond to new test data.Similarly, B we partition B A A11 A12 i.k B2 A21 A22 BB BB2 where [gik]k=1 K+)- and C is a constant in- B2B B2B2 Because B~Nn,g(0,∑⑧Lg),we dependent of B.We employ a block quasi-Newton have B1~Nn1,g(0,∑11⑧Lg)and method to solve the maximization of L(B)w.r.t. B.The basic idea is to approximate the Hessian B2|B1~Nn2,g(②212B1,222.18Ig,(⑤) matrix a是品by using the block diagonal matrix 月2 Block diag(at而abn a2L a2L where221-∑22-∑212∑12 The Fisher score vector and Hessian matrix of L w.r.t. For inductive inference,we first find the MAP es- bi are given by timate of B1 based on argmaxB,logp(Z11B1)p(B1) and then adopt the conditional expectation (or condi- aL tional mean)of B2 given Bi to estimate B2,which is 2-s材-0)bj-0b Obi given by E21>11BI based on (5). ≠红 82L Owing to the page limit,out-of-sample extension will abiob s(1-)bb,-Lg-H 2 not be evaluated via experiments in this paper.We j≠ will report the experiments in our future work. Given the initial values bi(0),the update equations for the bi are 4 Related Work b:t+1)=b:()+y-H(0-19 i=1,...,n, The methods that are most related to our LWP model iB=B(t) are RGP (Chu et al.,2007)and XGP (Silva et al., (4) 2008).Both RGP and XGP also try to learn the where y is the step size.A strategy to make the ob- covariance (kernel)matrix for Gaussian processes by jective function monotonically increase is to learn y in exploiting the relationships between instances.Un- each update step,but to search for a suitable in each like LWP which is to learn multiple (g)GPs,RGP update step might incur high computation cost.In our and XGP only learn one GP.In fact,when q=1,B
Latent Wishart Processes for Relational Kernel Learning 3.2 Learning It is natural to consider the MAP estimate of A, namely, argmax A log p(Z|A)p(A) . Theorem 1 shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B. In particular, we note that for the applications presented in Section 5, small values of q, typically no larger than 50, are sufficient for delivering good performance. This motivates us to alternatively find the MAP estimate of B, which will dramatically reduce the computation cost. Consequently, we attempt to maximize the following log posterior probability: L(B) = log{p(Z|B)p(B)} = X i6=k log p(zik|bi , bk) + log p(B) = X i6=k h zikb 0 ibk/2 − log(1 + exp(b 0 ibk/2))i − 1 2 tr (K + λI) −1 β BB0 + C = X i6=k h zikb 0 ibk/2 − log(1 + exp(b 0 ibk/2))i − 1 2 X i,k σikb 0 ibk + C, where [σik] n i,k=1 = (K+λI) −1 β and C is a constant independent of B. We employ a block quasi-Newton method to solve the maximization of L(B) w.r.t. B. The basic idea is to approximate the Hessian matrix ∂ 2L ∂B∂B0 by using the block diagonal matrix Block diag ∂ 2L ∂b1∂b0 1 , . . . , ∂ 2L ∂bn∂b0 n . The Fisher score vector and Hessian matrix of L w.r.t. bi are given by ∂L ∂bi = X j6=i (zij − sij − σij )bj − σiibi ∂ 2L ∂bi∂b 0 i = − 1 2 X j6=i sij (1−sij )bjb 0 j − σiiIq , −Hi . Given the initial values bi(0), the update equations for the bi are bi(t+1)=bi(t)+γ · Hi(t) −1 ∂L ∂bi B=B(t) , i = 1, . . . , n, (4) where γ is the step size. A strategy to make the objective function monotonically increase is to learn γ in each update step, but to search for a suitable γ in each update step might incur high computation cost. In our experiment, we find that if γ is simply set to a value less than 0.1, our algorithm works very well for all the experiments. Although in this case the objective function does not necessarily increase monotonically, the long-term trend of it is increasing. This makes our algorithm not only very fast but also very accurate. Note that this iterative procedure works in a parallel manner. It may also work sequentially. However, our experiments show that the parallel scheme is more stable than the sequential scheme. As we have mentioned, the size of Hi is q×q with q being always a small number in our experiments, so the iterative procedure is very efficient. In this paper, we adopt KPCA (Sch¨olkopf et al., 1998) to initialize the values bi(0). 3.3 Out-of-Sample Extension It is easy for LWP to perform out-of-sample extension (or induction), which means that we can use the learned LWP to predict the bi for new test data. Let Z = Z11 Z12 Z21 Z22 and Σ = Σ11 Σ12 Σ21 Σ22 , where Z11(Σ11) is an n1×n1 matrix and Z22(Σ22) is an n2×n2 matrix. Suppose the n1 instances corresponding to Z11 and Σ11 are training data, and Z22 and Σ22 correspond to new test data. Similarly, we partition B = B1 B2 , A = A11 A12 A21 A22 = B1B0 1 B1B0 2 B2B0 1 B2B0 2 . Because B ∼ Nn,q(0, Σ⊗Iq), we have B1 ∼ Nn1,q(0, Σ11⊗Iq) and B2 | B1 ∼ Nn2,q Σ21Σ −1 11 B1, Σ22·1 ⊗ Iq , (5) where Σ22·1 = Σ22 − Σ21Σ −1 11 Σ12. For inductive inference, we first find the MAP estimate of B1 based on argmaxB1 log p(Z11|B1)p(B1) and then adopt the conditional expectation (or conditional mean) of B2 given B1 to estimate B2, which is given by Σ21Σ −1 11 B1 based on (5). Owing to the page limit, out-of-sample extension will not be evaluated via experiments in this paper. We will report the experiments in our future work. 4 Related Work The methods that are most related to our LWP model are RGP (Chu et al., 2007) and XGP (Silva et al., 2008). Both RGP and XGP also try to learn the covariance (kernel) matrix for Gaussian processes by exploiting the relationships between instances. Unlike LWP which is to learn multiple (q) GPs, RGP and XGP only learn one GP. In fact, when q = 1, B
Li,Zhang,Yeung (nx1)in LWP degenerates to a single Gaussian pro- Although our method can be applied under the induc- cess.Hence,our model can be regarded as a general- tive setting,for fair comparison we run our method ization of RGP and XGP.The key difference between under the transductive setting because both RGP and them is that LWP treats a =BB'as the learned XGP were only tested under this setting(Silva et al., kernel matrix,which can be further used to train all 2008).2 More specifically,for our LWP method,we kinds of kernel classifiers.In RGP and XGP,how- first perform kernel learning based on all the points. ever,p(BZ)is itself a prediction function with B be- including training and test points,and their links,but ing a vector of function values for all input points.The without using any label information.Then,based on learned kernel.which is the covariance matriz of the the learned kernel,we learn a GP with the training posterior distribution p(BZ),is (K-1+I1)-1 in points only and evaluate the learned model on the RGP and (K+II)in XGP,where II is a kernel ma- test points.Hence,the main difference between our trix capturing the link information.Since there is no method and other methods is just in the kernel learn- closed-form solution for p(BZ),RGP and XGP adopt ing part. different approximation strategies to compute the pos- terior covariance matrix. 5.1 Sensitivity to Parameters Another related work is the stochastic relational model There are four parameters in total which will affect in (Yu and Chu,2008;Yu et al.,2006),where A is the training of LWP.They are the dimensionality of modeled as a tensor GP rather than a WP.Since A in (Yu and Chu,2008;Yu et al.,2006)is not guaranteed the latent space g,the B in (2),the y in (4),and the number of iterations (T)to optimize L(B).We find to be positive semi-definite,it cannot be regarded as a that the performance is very stable by setting 1000< kernel matrix.Furthermore,the focus of(Yu and Chu. 2008;Yu et al.,2006)is on linkage prediction rather B<10000.Hence,in all the following experiments, 6=1000. than instance classification as in our LWP model. We first study the effect of y and T.Here we use Our work has also been motivated by the latent space approaches in social network analysis (Hoff et al., the Texas subset of the WebKB data set to illustrate 2002).Letd=lbi-bill2 be the squared distance this.The description of this subset is given in Sec- tion 5.3.We find that when y<0.01,our algorithm between bi and bi,and D [di]be the nxn dis- is very stable.The performance,measured in area tance matrix.Then -EDE EAE where E is under the ROC curve (AUC).and the objective func- an nxn centering matrix.This reveals a connection tion against the change in T are illustrated in Fig- between our model and the latent distance model in ure 1,in which the X-axis denotes T,"AUC01"is the (Hoff.2008).In addition.we also note that in the la- performance with y =0.01."AUC001"is the perfor- tent eigenmodel (Hoff,2008)for symmetric relational mance with y=0.001,"obj0l"is the objective func- data,Hoff (2008)defined A UAU'where U is an tion values withy=0.01 and "obj001"is the objective orthogonal matrix and A is a diagonal matrix but its function values with y=0.001.Note that the objec- diagonal elements can be negative.Thus,A does not tive function values in the figure are transformed to play the role of a kernel matrix. L(B)/105+4 for the convenience of demonstration. We can see that the long-term trend of the objective Experiments function is increasing,and the performance of our al- gorithm is very stable.For y=0.01,10 iterations are We compare our LWP method with several related enough to give good performance. methods,such as the standard Gaussian process classi- We also test the sensitivity of our method to the fier (GPC)(Rasmussen and Williams.2006).RGP and XGP,on three real-world data sets,WebKB (Craven change in g on the Texas subset and the political books et al.,1998),Cora(McCallum et al.,2000)and politi- data set.Figure 2 shows the average AUC with stan- dard deviation over 100 trials of our method when g is cal books data set,which are also used in (Silva et al.. set to different values.Note that we use KPCA to ini- 2008).The centralized linear kernel K(xi,xj)=xxj tialize the b:s.Hence.KPCA actually refers to a GPC (Chu et al.,2007)is used to define the covariance ma- with the kernel matrix computed based on the initial trix K in the Wishart distribution defined in(2)for all these data sets.The A in (2)is set to a small number values of bi,i.e.,bi(0),in LWP.This means KPCA corresponds to the case that the iteration number in 10-4.All the bi are initialized to the feature vec- tors obtained by kernel principal component analysis kernels.Calling it KPCA is just for the consistency of our (KPCA)(Scholkopf et al.,1998)based on K+AI. algorithm,because during the derivation of our algorithm K can be any kind of kernel. The KPCA in our paper is actually PCA,because for 2In fact,XGP can only work under the transductive text processing the linear kernel always outperforms other setting
Li, Zhang, Yeung (n×1) in LWP degenerates to a single Gaussian process. Hence, our model can be regarded as a generalization of RGP and XGP. The key difference between them is that LWP treats A = BB0 as the learned kernel matrix, which can be further used to train all kinds of kernel classifiers. In RGP and XGP, however, p(B|Z) is itself a prediction function with B being a vector of function values for all input points. The learned kernel, which is the covariance matrix of the posterior distribution p(B|Z), is (K−1 + Π−1 ) −1 in RGP and (K + Π) in XGP, where Π is a kernel matrix capturing the link information. Since there is no closed-form solution for p(B|Z), RGP and XGP adopt different approximation strategies to compute the posterior covariance matrix. Another related work is the stochastic relational model in (Yu and Chu, 2008; Yu et al., 2006), where A is modeled as a tensor GP rather than a WP. Since A in (Yu and Chu, 2008; Yu et al., 2006) is not guaranteed to be positive semi-definite, it cannot be regarded as a kernel matrix. Furthermore, the focus of (Yu and Chu, 2008; Yu et al., 2006) is on linkage prediction rather than instance classification as in our LWP model. Our work has also been motivated by the latent space approaches in social network analysis (Hoff et al., 2002). Let d 2 ij = kbi − bjk 2 be the squared distance between bi and bj , and D = [d 2 ij ] be the n×n distance matrix. Then − 1 2EDE = EAE where E is an n×n centering matrix. This reveals a connection between our model and the latent distance model in (Hoff, 2008). In addition, we also note that in the latent eigenmodel (Hoff, 2008) for symmetric relational data, Hoff (2008) defined A = UΛU0 where U is an orthogonal matrix and Λ is a diagonal matrix but its diagonal elements can be negative. Thus, A does not play the role of a kernel matrix. 5 Experiments We compare our LWP method with several related methods, such as the standard Gaussian process classi- fier (GPC) (Rasmussen and Williams, 2006), RGP and XGP, on three real-world data sets, WebKB (Craven et al., 1998), Cora (McCallum et al., 2000) and political books data set, which are also used in (Silva et al., 2008). The centralized linear kernel K(xi , xj ) = x 0 ixj (Chu et al., 2007) is used to define the covariance matrix K in the Wishart distribution defined in (2) for all these data sets. The λ in (2) is set to a small number 10−4 . All the bi are initialized to the feature vectors obtained by kernel principal component analysis (KPCA) 1 (Sch¨olkopf et al., 1998) based on K + λI. 1The KPCA in our paper is actually PCA, because for text processing the linear kernel always outperforms other Although our method can be applied under the inductive setting, for fair comparison we run our method under the transductive setting because both RGP and XGP were only tested under this setting (Silva et al., 2008).2 More specifically, for our LWP method, we first perform kernel learning based on all the points, including training and test points, and their links, but without using any label information. Then, based on the learned kernel, we learn a GP with the training points only and evaluate the learned model on the test points. Hence, the main difference between our method and other methods is just in the kernel learning part. 5.1 Sensitivity to Parameters There are four parameters in total which will affect the training of LWP. They are the dimensionality of the latent space q, the β in (2), the γ in (4), and the number of iterations (T) to optimize L(B). We find that the performance is very stable by setting 1000 ≤ β ≤ 10000. Hence, in all the following experiments, β = 1000. We first study the effect of γ and T. Here we use the Texas subset of the WebKB data set to illustrate this. The description of this subset is given in Section 5.3. We find that when γ ≤ 0.01, our algorithm is very stable. The performance, measured in area under the ROC curve (AUC), and the objective function against the change in T are illustrated in Figure 1, in which the X-axis denotes T, “AUC01” is the performance with γ = 0.01, “AUC001” is the performance with γ = 0.001, “obj01” is the objective function values with γ = 0.01 and “obj001” is the objective function values with γ = 0.001. Note that the objective function values in the figure are transformed to L(B)/105 + 4 for the convenience of demonstration. We can see that the long-term trend of the objective function is increasing, and the performance of our algorithm is very stable. For γ = 0.01, 10 iterations are enough to give good performance. We also test the sensitivity of our method to the change in q on the Texas subset and the political books data set. Figure 2 shows the average AUC with standard deviation over 100 trials of our method when q is set to different values. Note that we use KPCA to initialize the bis. Hence, KPCA actually refers to a GPC with the kernel matrix computed based on the initial values of bi , i.e., bi(0), in LWP. This means KPCA corresponds to the case that the iteration number in kernels. Calling it KPCA is just for the consistency of our algorithm, because during the derivation of our algorithm K can be any kind of kernel. 2 In fact, XGP can only work under the transductive setting