Probabilistic Relational PCA Wu-Jun Li Dit-Yan Yeung Zhihua Zhang Dept.of Comp.Sci.and Eng. School of Comp.Sci.and Tech. Hong Kong University of Science and Technology Zhejiang University Hong Kong,China Zhejiang 310027.China {liwujun,dyyeung}@cse.ust.hk zhzhang@cs.zju.edu.cn Abstract One crucial assumption made by both principal component analysis(PCA)and probabilistic PCA(PPCA)is that the instances are independent and identically distributed(i.i.d.).However,this common i.i.d.assumption is unreasonable for relational data.In this paper,by explicitly modeling covariance between instances as derived from the relational information,we propose a novel probabilistic di- mensionality reduction method,called probabilistic relational PCA(PRPCA),for relational data analysis.Although the i.i.d.assumption is no longer adopted in PRPCA,the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d.assumption.Experiments on real- world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 1 Introduction Using a low-dimensional embedding to summarize a high-dimensional data set has been widely used for exploring the structure in the data.The methods for discovering such low-dimensional embedding are often referred to as dimensionality reduction(DR)methods.Principal component analysis(PCA)[13]is one of the most popular DR methods with great success in many applications. As a more recent development,probabilistic PCA(PPCA)[21]provides a probabilistic formula- tion of PCA [13]based on a Gaussian latent variable model [1].Compared with the original non- probabilistic derivation of PCA in [12],PPCA possesses a number of practical advantages.For ex- ample,PPCA can naturally deal with missing values in the data;the expectation-maximization(EM) algorithm [9]used to learn the parameters in PPCA may be more efficient for high-dimensional data; it is easy to generalize the single model in PPCA to the mixture model case;furthermore,PPCA as a probabilistic model can naturally exploit Bayesian methods [2]. Like many existing DR methods,both PCA and PPCA are based on some assumptions about the data.One assumption is that the data should be represented as feature vectors all of the same dimensionality.Data represented in this form are sometimes referred to as flat data [10].Another one is the so-called i.i.d.assumption,which means that the instances are assumed to be independent and identically distributed (i.i.d.). However,the data in many real-world applications,such as web pages and research papers,contain relations or links between(some)instances in the data in addition to the textual content informa- tion which is represented in the form of feature vectors.Data of this sort,referred to as relational data [10,20],can be found in such diverse application areas as web mining [3,17,23,24],bioinfor- matics [22],social network analysis [4],and so on.On one hand,the link structure among instances In this paper,we use document classification as a running example for relational data analysis.Hence,for convenience of illustration,the specific term 'textual content information'is used in the paper to refer to the feature vectors describing the instances.However,the algorithms derived in this paper can be applied to any relational data in which the instance feature vectors can represent any attribute information
Probabilistic Relational PCA Wu-Jun Li Dit-Yan Yeung Dept. of Comp. Sci. and Eng. Hong Kong University of Science and Technology Hong Kong, China {liwujun,dyyeung}@cse.ust.hk Zhihua Zhang School of Comp. Sci. and Tech. Zhejiang University Zhejiang 310027, China zhzhang@cs.zju.edu.cn Abstract One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for relational data. In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic dimensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Experiments on realworld data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 1 Introduction Using a low-dimensional embedding to summarize a high-dimensional data set has been widely used for exploring the structure in the data. The methods for discovering such low-dimensional embedding are often referred to as dimensionality reduction (DR) methods. Principal component analysis (PCA) [13] is one of the most popular DR methods with great success in many applications. As a more recent development, probabilistic PCA (PPCA) [21] provides a probabilistic formulation of PCA [13] based on a Gaussian latent variable model [1]. Compared with the original nonprobabilistic derivation of PCA in [12], PPCA possesses a number of practical advantages. For example, PPCA can naturally deal with missing values in the data; the expectation-maximization (EM) algorithm [9] used to learn the parameters in PPCA may be more efficient for high-dimensional data; it is easy to generalize the single model in PPCA to the mixture model case; furthermore, PPCA as a probabilistic model can naturally exploit Bayesian methods [2]. Like many existing DR methods, both PCA and PPCA are based on some assumptions about the data. One assumption is that the data should be represented as feature vectors all of the same dimensionality. Data represented in this form are sometimes referred to as flat data [10]. Another one is the so-called i.i.d. assumption, which means that the instances are assumed to be independent and identically distributed (i.i.d.). However, the data in many real-world applications, such as web pages and research papers, contain relations or links between (some) instances in the data in addition to the textual content information which is represented in the form of feature vectors. Data of this sort, referred to as relational data1 [10, 20], can be found in such diverse application areas as web mining [3, 17, 23, 24], bioinformatics [22], social network analysis [4], and so on. On one hand, the link structure among instances 1 In this paper, we use document classification as a running example for relational data analysis. Hence, for convenience of illustration, the specific term ‘textual content information’ is used in the paper to refer to the feature vectors describing the instances. However, the algorithms derived in this paper can be applied to any relational data in which the instance feature vectors can represent any attribute information. 1
cannot be exploited easily when traditional DR methods such as PCA are applied to relational data. Very often,the useful relational information is simply discarded.For example,a citation/reference relation between two papers provides very strong evidence for them to belong to the same topic even though they may bear low similarity in their content due to the sparse nature of the bag-of-words representation,but the relational information is not exploited at all when applying PCA or PPCA. One possible use of the relational information in PCA or PPCA is to first convert the link structure into the format of flat data by extracting some additional features from the links.However,as ar- gued in [10],this approach fails to capture some important structural information in the data.On the other hand,the i.i.d.assumption underlying PCA and PPCA is unreasonable for relational data.In relational data.the attributes of the connected (linked)instances are often correlated and the class label of one instance may have an influence on that of a linked instance.For example,in biology, interacting proteins are more likely to have the same biological functions than those without inter- action.Therefore,PCA and PPCA,or more generally most existing DR methods based on the i.i.d. assumption,are not suitable for relational data analysis. In this paper,a novel probabilistic DR method called probabilistic relational PCA(PRPCA)is pro- posed for relational data analysis.By explicitly modeling the covariance between instances as de- rived from the relational information,PRPCA seamlessly integrates relational information and tex- tual content information into a unified probabilistic framework.Two learning algorithms,one based on a closed-form solution and the other based on an EM algorithm [9],are proposed to learn the parameters of PRPCA.Although the i.i.d.assumption is no longer adopted in PRPCA.the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d.assumption.Extensive experiments on real-world data sets show that PRPCA can effec- tively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance Notation We use boldface uppercase letters,such as K,to denote matrices,and boldface lowercase letters, such as z,to denote vectors.The ith row and the jth column of a matrix K are denoted by Ki and K.j,respectively.Kij denotes the element at the ith row and jth column of K.zi denotes the ith element of z.KT is the transpose of K,and K-1 is the inverse of K.K 0 means that K is positive semi-definite(psd)andK 0 means that K is positive definite(pd).tr()denotes the trace of a matrix and etr()exp(tr()).PQ denotes the Kronecker product [11]of P and Q. denotes the determinant of a matrix.In is the identity matrix of size n x n.e is a vector of Is, the dimensionality of which depends on the context.We overload N()for both multivariate normal distributions and matrix variate normal distributions [11].()denotes the expectation operation and cov()denotes the covariance operation. Note that in relational data,there exist both conten and link observations.As in [21]. denotes a set of observed d-dimensional data(content)vectors,the d x g matrix W denotes the q principal axes(or called factor loadings),u denotes the data sample mean,and xn=W(tn-) denotes the corresponding g principal components(or called latent variables)of tn.We further use the d x N matrix T to denote the content matrix with T.n=tn,and the g x N matrix X to denote the latent variables of T with X.n=W(tn-u).For relational data,the N x N matrix A denotes the adjacency (link)matrix of the N instances.In this paper,we assume that the links are undirected.For those data with directed links,we will convert the directed links into undirected links which can keep the original physical meaning of the links.This will be described in detail in Section 4.1.1,and an example will be given in Section 5.Hence,A;=1 if there exists a relation between instances i and j,and otherwise Aij=0.Moreover,we always assume that there exist no self-links,i.e.,A=0. Probabilistic PCA To set the stage for the next section which introduces our PRPCA model,we first briefly present the derivation for PPCA [211.which was originally based on(vector-based)multivariate normal distributions,from the perspective of matrix variate normal distributions [11]
cannot be exploited easily when traditional DR methods such as PCA are applied to relational data. Very often, the useful relational information is simply discarded. For example, a citation/reference relation between two papers provides very strong evidence for them to belong to the same topic even though they may bear low similarity in their content due to the sparse nature of the bag-of-words representation, but the relational information is not exploited at all when applying PCA or PPCA. One possible use of the relational information in PCA or PPCA is to first convert the link structure into the format of flat data by extracting some additional features from the links. However, as argued in [10], this approach fails to capture some important structural information in the data. On the other hand, the i.i.d. assumption underlying PCA and PPCA is unreasonable for relational data. In relational data, the attributes of the connected (linked) instances are often correlated and the class label of one instance may have an influence on that of a linked instance. For example, in biology, interacting proteins are more likely to have the same biological functions than those without interaction. Therefore, PCA and PPCA, or more generally most existing DR methods based on the i.i.d. assumption, are not suitable for relational data analysis. In this paper, a novel probabilistic DR method called probabilistic relational PCA (PRPCA) is proposed for relational data analysis. By explicitly modeling the covariance between instances as derived from the relational information, PRPCA seamlessly integrates relational information and textual content information into a unified probabilistic framework. Two learning algorithms, one based on a closed-form solution and the other based on an EM algorithm [9], are proposed to learn the parameters of PRPCA. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Extensive experiments on real-world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 2 Notation We use boldface uppercase letters, such as K, to denote matrices, and boldface lowercase letters, such as z, to denote vectors. The ith row and the jth column of a matrix K are denoted by Ki∗ and K∗j , respectively. Kij denotes the element at the ith row and jth column of K. zi denotes the ith element of z. KT is the transpose of K, and K−1 is the inverse of K. K 0 means that K is positive semi-definite (psd) and K 0 means that K is positive definite (pd). tr(·) denotes the trace of a matrix and etr(·) , exp(tr(·)). P ⊗ Q denotes the Kronecker product [11] of P and Q. | · | denotes the determinant of a matrix. In is the identity matrix of size n × n. e is a vector of 1s, the dimensionality of which depends on the context. We overload N (·) for both multivariate normal distributions and matrix variate normal distributions [11]. h·i denotes the expectation operation and cov(·) denotes the covariance operation. Note that in relational data, there exist both content and link observations. As in [21], {tn} N n=1 denotes a set of observed d-dimensional data (content) vectors, the d × q matrix W denotes the q principal axes (or called factor loadings), µ denotes the data sample mean, and xn = WT (tn − µ) denotes the corresponding q principal components (or called latent variables) of tn. We further use the d × N matrix T to denote the content matrix with T∗n = tn, and the q × N matrix X to denote the latent variables of T with X∗n = WT (tn − µ). For relational data, the N × N matrix A denotes the adjacency (link) matrix of the N instances. In this paper, we assume that the links are undirected. For those data with directed links, we will convert the directed links into undirected links which can keep the original physical meaning of the links. This will be described in detail in Section 4.1.1, and an example will be given in Section 5. Hence, Aij = 1 if there exists a relation between instances i and j, and otherwise Aij = 0. Moreover, we always assume that there exist no self-links, i.e., Aii = 0. 3 Probabilistic PCA To set the stage for the next section which introduces our PRPCA model, we first briefly present the derivation for PPCA [21], which was originally based on (vector-based) multivariate normal distributions, from the perspective of matrix variate normal distributions [11]. 2
If we use T to denote the Gaussian noise process and assume that T and the latent variable matrix X follow these distributions: Y~Na,v(0,a2Ia⑧Iw),XNg,N(0,Ig⑧IN), (1) we can express a generative model as follows:T=WX+ueT+T. Based on some properties of matrix variate normal distributions in [11],we get the following results: TIX~Na.N(WX+ueT,o2Ia@IN), T~Na.N (ue,(WwT+o2Ia)IN).(2) Let C=WWT+o2Ia.The corresponding log-likelihood of the observation matrix T is then C=Inp(T)=-din(2x)+InCl+tr(C-IS), (3) where S=(T-LeT)(T-ueT)=T.-)(T.-) -We can see that S is just the sample covariance matrix of the content observations.It is easy to see that this log-likelihood form is the same as that in [21].Using matrix notations,the graphical model of PPCA based on matrix variate normal distributions is shown in Figure 1(a). (a)Model of PPCA (b)Model of PRPCA Figure 1:Graphical models of PPCA and PRPCA,in which T is the observation matrix,X is the latent variable matrix,u,W and o are the parameters to learn,and the other quantities are kept constant. 4 Probabilistic relational pca PPCA assumes that all the observations are independent and identically distributed.Although this i.i.d.assumption can make the modeling process much simpler and has achieved great success in many traditional applications,this assumption is however very unreasonable for relational data [10]. In relational data,the attributes of connected (linked)instances are often correlated. In this section,a probabilistic relational PCA model,called PRPCA,is proposed to integrate both the relational information and the content information seamlessly into a unified framework by elim- inating the i.i.d.assumption.Based on our reformulation of PPCA using matrix variate notations as presented in the previous section,we can obtain PRPCA just by introducing some relatively simple (but very effective)modifications.A promising property is that the computation needed for PRPCA is as simple as that for PPCA even though we have eliminated the restrictive i.i.d.assumption. 4.1 Model Formulation Assume that the latent variable matrix X has the following distribution: XWg.N(0,Ig⑧Φ) (4) According to Corollary2.3.3.1in[ll],we can get cov(X*)=Φ(i∈{1,.,q}),which means that actually reflects the covariance between the instances.From(1),we can see that cov(Xi)= IN for PPCA,which also coincides with the i.i.d.assumption of PPCA. Hence,to eliminate the i.i.d.assumption for relational data,one direct way is to use a non-identity covariance matrix for the distribution of X in (4).This should reflect the physical meaning (semantics)of the relations between instances,which will be discussed in detail later.Similarly,we can also change the INin(1)to for T to eliminate the i.i.d.assumption for the noise process. 4.1.1 Relational Covariance Construction Because the covariance matrix in PRPCA is constructed from the relational information in the data,we refer to it as relational covariance here. The goal of PCA and PPCA is to find those principal axes onto which the retained variance under projection is maximal [13,21].For one specific X,the retained variance is tr[XXT].If we rewrite p(X)in (1)as p()we have the following observation: (2T)9N72 (2x)9N72
If we use Υ to denote the Gaussian noise process and assume that Υ and the latent variable matrix X follow these distributions: Υ ∼ Nd,N (0, σ2 Id ⊗ IN ), X ∼ Nq,N (0, Iq ⊗ IN ), (1) we can express a generative model as follows: T = WX + µe T + Υ. Based on some properties of matrix variate normal distributions in [11], we get the following results: T | X ∼ Nd,N (WX + µe T , σ2 Id ⊗ IN ), T ∼ Nd,N µe T ,(WWT + σ 2 Id) ⊗ IN . (2) Let C = WWT + σ 2 Id. The corresponding log-likelihood of the observation matrix T is then L = ln p(T) = − N 2 h d ln(2π) + ln |C| + tr(C−1S) i , (3) where S = (T−µe T )(T−µe T ) T N = PN n=1(T∗n−µ)(T∗n−µ) T N . We can see that S is just the sample covariance matrix of the content observations. It is easy to see that this log-likelihood form is the same as that in [21]. Using matrix notations, the graphical model of PPCA based on matrix variate normal distributions is shown in Figure 1(a). T X Iq W σ 2 IN µ T X Iq W σ 2 ∆−1 µ (a) Model of PPCA (b) Model of PRPCA Figure 1: Graphical models of PPCA and PRPCA, in which T is the observation matrix, X is the latent variable matrix, µ, W and σ 2 are the parameters to learn, and the other quantities are kept constant. 4 Probabilistic Relational PCA PPCA assumes that all the observations are independent and identically distributed. Although this i.i.d. assumption can make the modeling process much simpler and has achieved great success in many traditional applications, this assumption is however very unreasonable for relational data [10]. In relational data, the attributes of connected (linked) instances are often correlated. In this section, a probabilistic relational PCA model, called PRPCA, is proposed to integrate both the relational information and the content information seamlessly into a unified framework by eliminating the i.i.d. assumption. Based on our reformulation of PPCA using matrix variate notations as presented in the previous section, we can obtain PRPCA just by introducing some relatively simple (but very effective) modifications. A promising property is that the computation needed for PRPCA is as simple as that for PPCA even though we have eliminated the restrictive i.i.d. assumption. 4.1 Model Formulation Assume that the latent variable matrix X has the following distribution: X ∼ Nq,N (0, Iq ⊗ Φ). (4) According to Corollary 2.3.3.1 in [11], we can get cov(Xi∗) = Φ (i ∈ {1, . . . , q}), which means that Φ actually reflects the covariance between the instances. From (1), we can see that cov(Xi∗) = IN for PPCA, which also coincides with the i.i.d. assumption of PPCA. Hence, to eliminate the i.i.d. assumption for relational data, one direct way is to use a non-identity covariance matrix Φ for the distribution of X in (4). This Φ should reflect the physical meaning (semantics) of the relations between instances, which will be discussed in detail later. Similarly, we can also change the IN in (1) to Φ for Υ to eliminate the i.i.d. assumption for the noise process. 4.1.1 Relational Covariance Construction Because the covariance matrix Φ in PRPCA is constructed from the relational information in the data, we refer to it as relational covariance here. The goal of PCA and PPCA is to find those principal axes onto which the retained variance under projection is maximal [13, 21]. For one specific X, the retained variance is tr[XXT ]. If we rewrite p(X) in (1) as p(X) = exp{tr[− 1 2 XXT ]} (2π) qN/2 = exp{− 1 2 tr[XXT ]} (2π) qN/2 , we have the following observation: 3
Observation 1 For PPCA,the larger the retained variance of X,i.e.,the more X approaches the destination point,the lower is the probability density at x given by the prior. Here,the destination point refers to the point where the goal of PPCA is achieved,i.e.,the retained variance is maximal.Moreover,we use the retained variance as a measure to define the gap between two different points.The smaller is the gap between the retained variance of two points,the more they approach each other. Because the design principle of PRPCA is similar to that of PPCA,our working hypothesis here is that Observation 1 can also guide us to design the relational covariance of PRPCA.Its effectiveness will be empirically verified in Section 5. In PRPCA,we assume that the attributes of two linked instances are positively correlated.2 Under this assumption,the ideal goal of PRPCA should be to make the latent representations of two in- stances as close as possible if there exists a relation (link)between them.Hence,the measure to define the gap between two points refers to the closeness of the linked instances,i.e.,the summation of the Euclidean distances between the linked instances.Based on Observation 1,the more X ap- proaches the destination point,the lower should be the probability density at X given by the prior. Hence,under the latent space representation X,the closer the linked instances are,the lower should be the probability density at X given by the prior.We will prove that if we setA-1where AIN+(IN +A)T(IN +A)with y being typically a very small positive number to make A >0,we can get an appropriate prior for PRPCA.Note that Aj=1 if there exists a relation between instances i and j,and otherwise Aij=0.Because AT=A,we can also express A as △=IN+(IN+A)(Iw+A) Let D denote a diagonal matrix whose diagonal elements Di=>;Aij.It is easy to prove that (AA)=Di.Let B=AA-D,which means that Bi =(AA)ii ifi j and Bii =0.We can get△=(1+Iw+2A+AA=(1+)IN+D+(2A+B).Because B=∑1Ak4 fori≠ j,we can see that Bij is the number of paths,each with path length 2,from instance i to instance j in the original adjacency graph A.Because the attributes of two linked instances are positively correlated,Bij actually reflects the degree of correlation between instance i and instance j.Let us take the paper citation graph as an example to illustrate this.The existence of a citation relation between two papers often implies that they are about the same topic.If paper i cites paper k and paper k cites paperj,it is highly likely that paper i and paper j are about the same topic.If there exists another paper ak linking both paper i and paper j as well,the confidence that paper i and paper j are about the same topic will increase.Hence,the larger B:;is,the stronger is the correlation between instance i and instance j.Because BAkAATA.j,Bj can also be seen as the similarity between the link vectors of instance i and instance j.Therefore,B can be seen as a weight matrix(corresponding to a weight graph)derived from the original adjacency matrix A,and B is also consistent with the physical meaning underlying A. Letting G =2A+B,3 we can find that G actually combines the original graph reflected by A and the derived graph reflected by B to get a new graph,and puts a weight 2Ai;+Bii on the edge between instance i and instance j in the new graph.The new weight graph reflected by G is also consistent with the physical meaning underlying A.Letting LD-G,where D is a diagonal matrix whose diagonal elements Dii=>,Gij and L is called the Laplacian matrix [6]of G,we can get A=(1+)IN+D+D-L.If we define another diagonal matrix D(1+)IN+D+D. we can get A =D-L.Then we have trK△xT]=∑DalX2- ll-x.j2 (5) 2Links with other physical meanings,such as the directed links in web graphs [25],can be transformed into links satisfying the assumption in PRPCA via some preprocessing strategies.One such strategy to preprocess the WebKB data set [8]will be given as an example in Section 5. 3This means that we put a 2:1 ratio between A and B.Other ratios can be obtained by setting A= yIN +(aIN +A)(aIN +A)=IN aIN +20A+B.Preliminary results show that PRPCA is not sensitive to o as long as a is not too large,but we omit the detailed results here because they are out of the scope of this paper
Observation 1 For PPCA, the larger the retained variance of X, i.e., the more X approaches the destination point, the lower is the probability density at X given by the prior. Here, the destination point refers to the point where the goal of PPCA is achieved, i.e., the retained variance is maximal. Moreover, we use the retained variance as a measure to define the gap between two different points. The smaller is the gap between the retained variance of two points, the more they approach each other. Because the design principle of PRPCA is similar to that of PPCA, our working hypothesis here is that Observation 1 can also guide us to design the relational covariance of PRPCA. Its effectiveness will be empirically verified in Section 5. In PRPCA, we assume that the attributes of two linked instances are positively correlated.2 Under this assumption, the ideal goal of PRPCA should be to make the latent representations of two instances as close as possible if there exists a relation (link) between them. Hence, the measure to define the gap between two points refers to the closeness of the linked instances, i.e., the summation of the Euclidean distances between the linked instances. Based on Observation 1, the more X approaches the destination point, the lower should be the probability density at X given by the prior. Hence, under the latent space representation X, the closer the linked instances are, the lower should be the probability density at X given by the prior. We will prove that if we set Φ = ∆−1 where ∆ , γIN + (IN + A) T (IN + A) with γ being typically a very small positive number to make ∆ 0, we can get an appropriate prior for PRPCA. Note that Aij = 1 if there exists a relation between instances i and j, and otherwise Aij = 0. Because AT = A, we can also express ∆ as ∆ = γIN + (IN + A)(IN + A). Let D˜ denote a diagonal matrix whose diagonal elements D˜ ii = P j Aij . It is easy to prove that (AA)ii = D˜ ii. Let B = AA − D˜ , which means that Bij = (AA)ij if i 6= j and Bii = 0. We can get ∆ = (1+γ)IN +2A+AA = (1+γ)IN +D˜ +(2A+B). Because Bij = PN k=1 AikAkj for i 6= j, we can see that Bij is the number of paths, each with path length 2, from instance i to instance j in the original adjacency graph A. Because the attributes of two linked instances are positively correlated, Bij actually reflects the degree of correlation between instance i and instance j. Let us take the paper citation graph as an example to illustrate this. The existence of a citation relation between two papers often implies that they are about the same topic. If paper i cites paper k and paper k cites paper j, it is highly likely that paper i and paper j are about the same topic. If there exists another paper a 6= k linking both paper i and paper j as well, the confidence that paper i and paper j are about the same topic will increase. Hence, the larger Bij is, the stronger is the correlation between instance i and instance j. Because Bij = PN k=1 AikAkj = AT ∗iA∗j , Bij can also be seen as the similarity between the link vectors of instance i and instance j. Therefore, B can be seen as a weight matrix (corresponding to a weight graph) derived from the original adjacency matrix A, and B is also consistent with the physical meaning underlying A. Letting G = 2A + B, 3 we can find that G actually combines the original graph reflected by A and the derived graph reflected by B to get a new graph, and puts a weight 2Aij + Bij on the edge between instance i and instance j in the new graph. The new weight graph reflected by G is also consistent with the physical meaning underlying A. Letting L , D − G, where D is a diagonal matrix whose diagonal elements Dii = P j Gij and L is called the Laplacian matrix [6] of G, we can get ∆ = (1+γ)IN +D˜ +D−L. If we define another diagonal matrix Dˆ , (1+γ)IN +D˜ +D, we can get ∆ = Dˆ − L. Then we have tr[X∆XT ] = X N i=1 Dˆ iikX∗ik 2 − 1 2 X N i=1 X N j=1 GijkX∗i − X∗jk 2 . (5) 2Links with other physical meanings, such as the directed links in web graphs [25], can be transformed into links satisfying the assumption in PRPCA via some preprocessing strategies. One such strategy to preprocess the WebKB data set [8] will be given as an example in Section 5. 3This means that we put a 2:1 ratio between A and B. Other ratios can be obtained by setting ∆ = γIN + (αIN + A)(αIN + A) = γIN + α 2 IN + 2αA + B. Preliminary results show that PRPCA is not sensitive to α as long as α is not too large, but we omit the detailed results here because they are out of the scope of this paper. 4
Letting重=A-1,we can getp(X)=pfu-xx7]}=epuK△x (2x)9N/21△-9/2 (2m)9N/21△1-g/2 The first termin(5)can be treated as a measure of weighted variance of all the instances in the latent space.We can see that the larger Dii is,the more weight will be put on instance i,which is reasonable because Di;mainly reflects the degree of instance i in the graph. It is easy to see that,for those latent representations having a fixed value of weighted variance D the closer the latent representations of two linked entities are,the larger is their contribution to trXAXT],and subsequently the less is their contribution to p(X).This means that under the latent space representation X,the closer the linked instances are,the lower is the probability density at X given by the prior.Hence,we can get an appropriate prior for X by setting Φ=△-1in(4). 4.1.2 Model With the constructed relational covariance the generative model of PRPCA is defined as follows: Y~Na,N(0,σ2Ia⑧Φ),X~Ng,N(0,Ig⑧Φ),T=WX+eT+Y, whereΦ=△-1 We can further obtain the following results: TlX~Na.v(WX+ueT,a2La⑧重),T~Na.N(ueT,(Wwr+aIa)⑧重). (6) The graphical model of PRPCA is illustrated in Figure 1(b),from which we can see that the differ- ence between PRPCA and PPCA lies solely in the difference between and IN.Comparing(6)to (2),we can find that the observations of PPCA are sampled independently while those of PRPCA are sampled with correlation.In fact,PPCA may be seen as a degenerate case of PRPCA as detailed below in Remark 1: Remark 1 When the i.i.d.assumption holds,i.e.,all Aij =0,PRPCA degenerates to PPCA by setting-0.Note that the only role that y plays is to make0.Hence,in our implementation, we always set y to a very small positive value,such as 10-6.Actually,we may even setto 0, because△does not have to be pd.When△≥0,we say T follows a singular matrix variate normal distribution [11].and all the derivations for PRPCA are still correct.In our experiment, we find that the performance under=0 is almost the same as that under=10-6.Further deliberation is out of the scope of this paper. As in PPCA,we set C=WWT+o2Ia.Then the log-likelihood of the observation matrix T in PRPCA is mp()n(2)+inCl+(C. (7) where c =-In can be seen as a constant independent of the parameters u,W and o2,and H=(T-Me)A(T-HeT)T It is interesting to compare(7)with(3).We can find that to learn the parameters W and o2,the only difference between PRPCA and PPCA lies in the difference between H and S.Hence,all the learning techniques derived previously for PPCA are also potentially applicable to PRPCA simply by substituting S with H. 4.2 Learning By setting the gradient of Ci with respect to u to 0,we can get the maximum-likelihood estimator (MLE)for u as follows:=TA eT△e As in PPCA [21],we devise two methods to learn W and o2 in PRPCA,one based on a closed-form solution and the other based on EM
Letting Φ = ∆−1 , we can get p(X) = exp{tr[− 1 2 X∆XT ]} (2π) qN/2 |∆|−q/2 = exp{− 1 2 tr[X∆XT ]} (2π) qN/2 |∆|−q/2 . The first term PN i=1 Dˆ iikX∗ik 2 in (5) can be treated as a measure of weighted variance of all the instances in the latent space. We can see that the larger Dˆ ii is, the more weight will be put on instance i, which is reasonable because Dˆ ii mainly reflects the degree of instance i in the graph. It is easy to see that, for those latent representations having a fixed value of weighted variance PN i=1 Dˆ iikX∗ik 2 , the closer the latent representations of two linked entities are, the larger is their contribution to tr[X∆XT ], and subsequently the less is their contribution to p(X). This means that under the latent space representation X, the closer the linked instances are, the lower is the probability density at X given by the prior. Hence, we can get an appropriate prior for X by setting Φ = ∆−1 in (4). 4.1.2 Model With the constructed relational covariance Φ, the generative model of PRPCA is defined as follows: Υ ∼ Nd,N (0, σ2 Id ⊗ Φ), X ∼ Nq,N (0, Iq ⊗ Φ), T = WX + µe T + Υ, where Φ = ∆−1 . We can further obtain the following results: T | X ∼ Nd,N (WX + µe T , σ2 Id ⊗ Φ), T ∼ Nd,N µe T ,(WWT + σ 2 Id) ⊗ Φ . (6) The graphical model of PRPCA is illustrated in Figure 1(b), from which we can see that the difference between PRPCA and PPCA lies solely in the difference between Φ and IN . Comparing (6) to (2), we can find that the observations of PPCA are sampled independently while those of PRPCA are sampled with correlation. In fact, PPCA may be seen as a degenerate case of PRPCA as detailed below in Remark 1: Remark 1 When the i.i.d. assumption holds, i.e., all Aij = 0, PRPCA degenerates to PPCA by setting γ = 0. Note that the only role that γ plays is to make ∆ 0. Hence, in our implementation, we always set γ to a very small positive value, such as 10−6 . Actually, we may even set γ to 0, because ∆ does not have to be pd. When ∆ 0, we say T follows a singular matrix variate normal distribution [11], and all the derivations for PRPCA are still correct. In our experiment, we find that the performance under γ = 0 is almost the same as that under γ = 10−6 . Further deliberation is out of the scope of this paper. As in PPCA, we set C = WWT + σ 2 Id. Then the log-likelihood of the observation matrix T in PRPCA is L1 = ln p(T) = − N 2 h d ln(2π) + ln |C| + tr(C−1H) i + c, (7) where c = − d 2 ln |Φ| can be seen as a constant independent of the parameters µ, W and σ 2 , and H = (T−µe T )∆(T−µe T ) T N . It is interesting to compare (7) with (3). We can find that to learn the parameters W and σ 2 , the only difference between PRPCA and PPCA lies in the difference between H and S. Hence, all the learning techniques derived previously for PPCA are also potentially applicable to PRPCA simply by substituting S with H. 4.2 Learning By setting the gradient of L1 with respect to µ to 0, we can get the maximum-likelihood estimator (MLE) for µ as follows: µ = T∆e eT∆e . As in PPCA [21], we devise two methods to learn W and σ 2 in PRPCA, one based on a closed-form solution and the other based on EM. 5