Blocking-based Neighbor Sampling for Large-scale Graph Neural Networks Kai-Lang Yao and Wu-Jun Li* National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,China yaokl@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract Among various algorithms of representation learning for graphs,graph neural networks (GNNs)[Gori et al.,2005; The exponential increase in computation and mem- Bruna et al..2014]have recently become the most successful ory complexity with the depth of network has be- and popular ones,due to their powerful ability in modeling come the main impediment to the successful appli- complex relationships between samples.Although many ad- cation of graph neural networks(GNNs)on large- vanced GNN models [Kipf and Welling,2017:Hamilton et scale graphs like graphs with hundreds of million- al.,2017;Velickovic et al.,2018]have been proposed,most s of nodes.In this paper,we propose a novel of them are limited to the successful application on small- neighbor sampling strategy,dubbed blocking-based scale graphs (e.g.,graphs with hundreds of thousands of n- neighbor sampling (BNS),for efficient training of odes).There are significant challenges in applying existing GNNs on large-scale graphs.Specifically,BNS GNN methods to applications with large-scale graphs (e.g., adopts a policy to stochastically block the ongo- graphs with hundreds of millions of nodes)because of the ing expansion of neighboring nodes,which can re- expensive computation and memory cost during the train- duce the rate of the exponential increase in com- ing process.Due to the iteratively dependent nature of n- putation and memory complexity of GNNs.Fur- odes in GNNs,the number of nodes supporting the compu- thermore,a reweighted policy is applied to graph tation of output layer exponentially increases with the depth convolution,to adjust the contribution of blocked of network.Hence,the computation and memory complexity and non-blocked neighbors to central nodes.We grow exponentially.Moreover,recent works [Li et al.,2019; theoretically prove that BNS provides an unbiased Verma and Zhang,2020:Chen et al.,2020c]show the poten- estimation for the original graph convolution oper- tial to improve the performance of GNN models as the net- ation.Extensive experiments on three benchmark work becomes deeper,which will undoubtedly exacerbate the datasets show that,on large-scale graphs,BNS is problem of expensive cost on large-scale graphs.Nowadays, 2x~5x faster than state-of-the-art methods when in order to speed up the training process,it is a dominan- achieving the same accuracy.Moreover,even on t trend to perform training on GPUs.However,many GPUs the small-scale graphs,BNS also demonstrates the have limited graphics memory,which hinders GNN models advantage of low time cost. from training with large batch size and as a result leads to a sharp increase in time cost for training. 1 Introduction Solutions for the above problem mainly include model sim- plification methods and sampling-based methods.For model Graph has been widely used for describing unstructured data simplification methods [Wu et al.,2019;Klicpera et al.,2019; in real applications such as social networks,brain networks, Chen et al.,2020b],the main idea is to remove the non- molecular graphs,and knowledge graphs.Edges in graph- linear transformation between graph convolution layers such s depict the complex relationships between samples,and rich that the graph convolution on node features can be prepro- relational information between samples is contained in graph- cessed before training.Although model simplification meth- s.Making good use of the rich relational information be- ods are efficient in training,as stated in [Chen et al,2020al, tween samples in graphs has great potential in boosting the it is still an open question whether simplified GNNs'ex- performance of traditional machine learning methods which pressive power can match that of the original GNNs.For are mainly designed for modeling independent and identical- sampling-based methods,existing works can be broadly cat- ly distributed (i.i.d.)data.In addition,graph data is now egorized into node-wise sampling [Hamilton et al.,2017; widely available in many applications.Therefore,developing Chen et al.,2018a;Cong et al.,2020],layer-wise sampling advanced graph learning algorithms is a topic of great inter- [Chen et al.,2018b;Huang et al.,2018;Zou et al.,2019],and est. subgraph sampling [Chiang et al.,2019;Zeng et al.,2020]. For node-wise sampling,the main idea is to sample a num- *Corresponding author ber of neighbors for each node of each layer in a top-down
Blocking-based Neighbor Sampling for Large-scale Graph Neural Networks Kai-Lang Yao and Wu-Jun Li∗ National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, China yaokl@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract The exponential increase in computation and memory complexity with the depth of network has become the main impediment to the successful application of graph neural networks (GNNs) on largescale graphs like graphs with hundreds of millions of nodes. In this paper, we propose a novel neighbor sampling strategy, dubbed blocking-based neighbor sampling (BNS), for efficient training of GNNs on large-scale graphs. Specifically, BNS adopts a policy to stochastically block the ongoing expansion of neighboring nodes, which can reduce the rate of the exponential increase in computation and memory complexity of GNNs. Furthermore, a reweighted policy is applied to graph convolution, to adjust the contribution of blocked and non-blocked neighbors to central nodes. We theoretically prove that BNS provides an unbiased estimation for the original graph convolution operation. Extensive experiments on three benchmark datasets show that, on large-scale graphs, BNS is 2× ∼ 5× faster than state-of-the-art methods when achieving the same accuracy. Moreover, even on the small-scale graphs, BNS also demonstrates the advantage of low time cost. 1 Introduction Graph has been widely used for describing unstructured data in real applications such as social networks, brain networks, molecular graphs, and knowledge graphs. Edges in graphs depict the complex relationships between samples, and rich relational information between samples is contained in graphs. Making good use of the rich relational information between samples in graphs has great potential in boosting the performance of traditional machine learning methods which are mainly designed for modeling independent and identically distributed (i.i.d.) data. In addition, graph data is now widely available in many applications. Therefore, developing advanced graph learning algorithms is a topic of great interest. ∗Corresponding author Among various algorithms of representation learning for graphs, graph neural networks (GNNs) [Gori et al., 2005; Bruna et al., 2014] have recently become the most successful and popular ones, due to their powerful ability in modeling complex relationships between samples. Although many advanced GNN models [Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018] have been proposed, most of them are limited to the successful application on smallscale graphs (e.g., graphs with hundreds of thousands of nodes). There are significant challenges in applying existing GNN methods to applications with large-scale graphs (e.g., graphs with hundreds of millions of nodes) because of the expensive computation and memory cost during the training process. Due to the iteratively dependent nature of nodes in GNNs, the number of nodes supporting the computation of output layer exponentially increases with the depth of network. Hence, the computation and memory complexity grow exponentially. Moreover, recent works [Li et al., 2019; Verma and Zhang, 2020; Chen et al., 2020c] show the potential to improve the performance of GNN models as the network becomes deeper, which will undoubtedly exacerbate the problem of expensive cost on large-scale graphs. Nowadays, in order to speed up the training process, it is a dominant trend to perform training on GPUs. However, many GPUs have limited graphics memory, which hinders GNN models from training with large batch size and as a result leads to a sharp increase in time cost for training. Solutions for the above problem mainly include model simplification methods and sampling-based methods. For model simplification methods [Wu et al., 2019; Klicpera et al., 2019; Chen et al., 2020b], the main idea is to remove the nonlinear transformation between graph convolution layers such that the graph convolution on node features can be preprocessed before training. Although model simplification methods are efficient in training, as stated in [Chen et al., 2020a], it is still an open question whether simplified GNNs’ expressive power can match that of the original GNNs. For sampling-based methods, existing works can be broadly categorized into node-wise sampling [Hamilton et al., 2017; Chen et al., 2018a; Cong et al., 2020], layer-wise sampling [Chen et al., 2018b; Huang et al., 2018; Zou et al., 2019], and subgraph sampling [Chiang et al., 2019; Zeng et al., 2020]. For node-wise sampling, the main idea is to sample a number of neighbors for each node of each layer in a top-down
manner.For layer-wise sampling,the main idea is to inde- We take GCN [Kipf and Welling,2017]as an example pendently sample a number of nodes from a candidate set for to describe the problem of the exponential increase in com- each layer based on the importance probabilities of nodes. putation and memory complexity.Let A'=A+I and All connections between the nodes of two adjacent layers are A =D-A'D-,where D denotes the diagonal degree used to perform approximate graph convolution.I For sub- matrix of A'and D Then GCN can be for- graph sampling,the main idea is to sample a subgraph and mulated as follows: feed it to GNN models before each round of mini-batch train- ing.Although the above sampling strategies are applicable to Z9=∑eaA,Hg-,H9=fz9wo),④ large-scale GNNs,they have some deficiencies or limitation- s in terms of accuracy,total time cost,or memory cost.For where H(o)=X,f()is the activation function,N(i)de- example,existing node-wise sampling strategies need to sam- notes the set of neighbors of node i.W()E Rrxr is a learn- ple a large number of neighbors for high accuracy,which will able parameter. lead to a sharp increase in time cost.Layer-wise sampling From (1),we can see that the output of a node at the Lth strategies have a high time cost of preparing data(including layer iteratively depends on the information of its 1....L- sampling)and may suffer from sparse connection between t- hop neighbors.Such an iteratively dependent nature of n- wo adjacent layers.Subgraph sampling strategies may also suffer from sparse connection in subgraphs. odes leads to the exponential increase in computation and memory complexity with the depth of network.Let r denote In this paper,we propose a novel node-wise sampling s- the feature dimension of hidden layer.Then,the computa- trategy,called blocking-based neighbor sampling (BNS),for tion and memory complexity during a mini-batch training are large-scale training of GNNs.The contributions of this paper O(sL-1.(sBr Br2))and O(Lr2+sL.Br),respectively. are listed as follows: We propose a novel blocking mechanism in BNS to s- 3 Blocking-based Neighbor Sampling tochastically block the ongoing expansion of neighbor- ing nodes,dramatically reducing the computation and In this section,we present the details of BNS.Firstly,we sam- memory complexity. ple a fixed number of neighbors for each node at the current layer e.Secondly,we adopt a policy to stochastically block We further propose a reweighted policy to adjust the the ongoing expansion of neighboring nodes at the preceding contribution of blocked and non-blocked neighboring n- layers 1.....-1.Note that once a node is blocked,all its odes to central nodes. ways out to all other nodes are blocked,and it is trapped at its We theoretically prove that BNS provides an unbiased current position.Thirdly,after sampling finishes,reweight- estimation for the original graph convolution operation. ed graph convolution is performed to obtain the outputs,in which a reweighted policy is adopted to adjust the contribu- Extensive experiments on large-scale graphs show that tion of blocked and non-blocked neighbors to central nodes. BNS is 2x5x faster than existing state-of-the-art A visual illustration of BNS is presented in Figure 1. methods when achieving the same accuracy.Even on the small-scale graph,BNS also demonstrates the advantage of low time cost. 2 Notations and Problem Definition 2.1 Notations We use boldface uppercase letters,such as B,to denote matri- ces.The ith row and the jth column of a matrix B are denoted as Bi and B.i,respectively.Bij denotes the element at the (a)non-sampling (b)BNS ith row and jth column in B.B]lo denotes the number of non-zero entries in B.BF denotes the Frobenius norm of Figure 1:A visual illustration of BNS.Solid circles refer to nodes. B The node within the inner dashed circle refers to the node of output layer.(a)We assume each node has 5 neighbors.(b)Black solid circles refer to blocked neighbors.The thickness of solid lines that 2.2 Problem Definition connect two nodes indicates the magnitude of weights of nodes in Suppose we have a graph with N nodes.Let A 10,1)NxN reweighted graph convolution. denote the adjacency matrix of the graph.Ai;=1 denotes there exists an edge between node i and node j,and Ai=0 denotes there is no edge between them.Let X RNxu de- 3.1 Sampling Algorithm note the node feature matrix,where u denotes the dimension The entire sampling process is performed in a top-down man- of node feature.Suppose the average number of neighbors ner,and it is summarized in Algorithm 1.Suppose Vin de- per node in the graph is s.Suppose the mini-batch size of note a mini-batch of nodes in the output layer.Firstly,we nodes at output layer is B.We use L to denote the layer num- sample sn neighbors for each node i at layer e in line 5. ber of GNNs. Sample(N(i),sn)in line 5 is an operation that uniformly
manner. For layer-wise sampling, the main idea is to independently sample a number of nodes from a candidate set for each layer based on the importance probabilities of nodes. All connections between the nodes of two adjacent layers are used to perform approximate graph convolution. For subgraph sampling, the main idea is to sample a subgraph and feed it to GNN models before each round of mini-batch training. Although the above sampling strategies are applicable to large-scale GNNs, they have some deficiencies or limitations in terms of accuracy, total time cost, or memory cost. For example, existing node-wise sampling strategies need to sample a large number of neighbors for high accuracy, which will lead to a sharp increase in time cost. Layer-wise sampling strategies have a high time cost of preparing data (including sampling) and may suffer from sparse connection between two adjacent layers. Subgraph sampling strategies may also suffer from sparse connection in subgraphs. In this paper, we propose a novel node-wise sampling strategy, called blocking-based neighbor sampling (BNS), for large-scale training of GNNs. The contributions of this paper are listed as follows: • We propose a novel blocking mechanism in BNS to stochastically block the ongoing expansion of neighboring nodes, dramatically reducing the computation and memory complexity. • We further propose a reweighted policy to adjust the contribution of blocked and non-blocked neighboring nodes to central nodes. • We theoretically prove that BNS provides an unbiased estimation for the original graph convolution operation. • Extensive experiments on large-scale graphs show that BNS is 2× ∼ 5× faster than existing state-of-the-art methods when achieving the same accuracy. Even on the small-scale graph, BNS also demonstrates the advantage of low time cost. 2 Notations and Problem Definition 2.1 Notations We use boldface uppercase letters, such as B, to denote matrices. The ith row and the jth column of a matrix B are denoted as Bi∗ and B∗j , respectively. Bij denotes the element at the ith row and jth column in B. kBk0 denotes the number of non-zero entries in B. kBkF denotes the Frobenius norm of B. 2.2 Problem Definition Suppose we have a graph with N nodes. Let A ∈ {0, 1} N×N denote the adjacency matrix of the graph. Aij = 1 denotes there exists an edge between node i and node j, and Aij = 0 denotes there is no edge between them. Let X ∈ R N×u denote the node feature matrix, where u denotes the dimension of node feature. Suppose the average number of neighbors per node in the graph is s. Suppose the mini-batch size of nodes at output layer is B. We use L to denote the layer number of GNNs. We take GCN [Kipf and Welling, 2017] as an example to describe the problem of the exponential increase in computation and memory complexity. Let A0 = A + I and Aˆ = D− 1 2 A0D− 1 2 , where D denotes the diagonal degree matrix of A0 and Dii = Pn j=1 A0 ij . Then GCN can be formulated as follows: Z (`) i∗ = X j∈N(i) Aˆ ijH (`−1) j∗ , H (`) i∗ = f(Z (`) i∗ W(`) ), (1) where H(0) = X, f(·) is the activation function, N (i) denotes the set of neighbors of node i. W(`) ∈ R r×r is a learnable parameter. From (1), we can see that the output of a node at the Lth layer iteratively depends on the information of its 1, · · · , Lhop neighbors. Such an iteratively dependent nature of nodes leads to the exponential increase in computation and memory complexity with the depth of network. Let r denote the feature dimension of hidden layer. Then, the computation and memory complexity during a mini-batch training are O(s L−1 ·(sBr + Br2 )) and O(Lr2 + s L · Br), respectively. 3 Blocking-based Neighbor Sampling In this section, we present the details of BNS. Firstly, we sample a fixed number of neighbors for each node at the current layer `. Secondly, we adopt a policy to stochastically block the ongoing expansion of neighboring nodes at the preceding layers {1, · · · , `−1}. Note that once a node is blocked, all its ways out to all other nodes are blocked, and it is trapped at its current position. Thirdly, after sampling finishes, reweighted graph convolution is performed to obtain the outputs, in which a reweighted policy is adopted to adjust the contribution of blocked and non-blocked neighbors to central nodes. A visual illustration of BNS is presented in Figure 1. (a) non-sampling (b) BNS Figure 1: A visual illustration of BNS. Solid circles refer to nodes. The node within the inner dashed circle refers to the node of output layer. (a) We assume each node has 5 neighbors. (b) Black solid circles refer to blocked neighbors. The thickness of solid lines that connect two nodes indicates the magnitude of weights of nodes in reweighted graph convolution. 3.1 Sampling Algorithm The entire sampling process is performed in a top-down manner, and it is summarized in Algorithm 1. Suppose Vin denote a mini-batch of nodes in the output layer. Firstly, we sample sn neighbors for each node i at layer ` in line 5. Sample(N (i), sn) in line 5 is an operation that uniformly
samples sn elements from N(i).Then,we randomly se- Algorithm 1 Sampling Algorithm lect sn×6(0≤6≤l)nodes from N(i)in line6, and stop sampling neighbors for them at the preceding lay- Require:Mini-batch of nodes Vin,the number of neighbors ers {1,...,e-1}via the operations in line 7 and line 13 sampled for each node sn,ratio of blocked neighbors per Block(e(i),6)in line 6 is an operation that uniformly sam- node o. ples (i)x 6 elements from Ne(i)as blocked neighbors. Ensure:{(b,%,{Wb(a),N6(a)}8)}=1 records the blocked nodes at the th layer,which are used 1:h=m,=0 in subsequent processing steps.The operations in line 10 and 2:Sample in a top-down manner: line 11 ensure that the blocked nodes are mapped to the same 3:for =L:1 do feature space as non-blocked nodes. 4: fori∈y%bdo 5: 3.2 Reweighted Graph Convolution Ne(i)=Sample(N(i),sn) 6: 6(a)=Bloc(W(),6) We first reformulate Z in Equation(1)to an expectation N6(同=N(Ng问 form: 8: end for Z9=W(·Ej~pUEN(I)A.jH-” (2) 9: fori∈zdo 10: where p(jN(i)i)is a uniform distribution over the neigh- N()=W()U{ 11: N(i)=N6()U{闭 bors of node i.(i)denotes the number of elements in 12: end for W(i). 13: For blocked nodes,since their neighbor expansions are Vh =Uievs,Nib(i) blocked,their estimation for Equation(2)is less precise(hav- 14: %-1=UeN6( ing large variance)than non-blocked nodes.We can see 15: %=%U%- that representations of blocked nodes carry little informa- 16:end for tion about the input graph.Therefore,it is reasonable to in- crease the contribution of non-blocked nodes to the central nodes.We perform reweighted graph convolution to achieve 3.3 Objective Function this goal. Let w ={W(1),...,W(L)}denote the learnable parame- After Algorithm 1 is performed,reweighted graph convo- ters defined in Equation (5).Y=H(L)denotes the output lution is formulated as follows.For readability,we denote of GNN models.For multi-class classification,f()in the n吃1=W%(ln.2=Whb()训andn=W(l last layer denotes the softmax function,while it denotes the =p. ,1+n,2 p2=1-p).n+ sigmoid function for multi-label classification.The objective n.1 (3) function for BNS is formulated as follows: n吃,2 Ai =Pia'nia+ni -·A,j∈Nh(i)andi∈hb min∑∑-Yie log Yie+W2.∑IwoI3,(⑥ A Pia'na+n ni ·A,Vi∈N6(i)andi∈%b where A is a hyper-parameter for the regularization term of parameters W,denotes the set of nodes in training set. A=n·Aa,i∈M%b 3.4 Complexity Analysis z9≈∑A,”=z9,ie6U,④ In this subsection,we compare the computation and mem- jEN(i) ory complexity of different methods with those of BNS in a mini-batch training step,which is summarized in Table 1.For H=f(29wo), (5) existing node-wise sampling methods NS [Hamilton et al., where p∈[0,l.Compared with(W(i)l/W()l)·A,in 2017],VRGCN [Chen et al.,2018a]and MVS-GNN [Con- Equation (2),Aj adopts a different weights,p and p2.to g et al.,20201,they reduce the growth rate from s to sn. adjust the contribution of non-blocked and blocked nodes to where s is much smaller than s.In particular,VRGCN and MVS-GNN show that they can achieve comparable accura- node i.In the following proposition,we prove that Z()is cy to NS with smaller sn.For layer-wise sampling method an unbiased estimation of which makes our proposed LADIES [Zou et al.,2019]and subgraph sampling method reweighted graph convolution theoretically sound.In experi- GraphSAINT [Zeng et al.,2020].they reduce the computa- ments,p is set to 0.5 for convenience. tion and memory complexity to the level that is linear with Proposition 1.Suppose H(-1)is given.If Ne(i)is uni- the depth of network. formly sampled from N(i).N(i)is uniformly sampled from Although the above methods can achieve good perfor- N(i)and p,1 then defined in Equation (4)is an mance in terms of accuracy,time cost,and memory cost on unbiased estimation of small-scale graphs (e.g.,graphs with hundreds of thousand- s of nodes),they are not efficient or even not applicable for Proof.The proof can be found in the Appendix. large-scale graphs (e.g.,graphs with millions of nodes and hundreds of millions of nodes).Some problems and draw- 'The Appendix can be found in https://cs.nju.edu.cn/lwi/. backs existing in these methods are overlooked due to the
samples sn elements from N (i). Then, we randomly select sn × δ (0 ≤ δ ≤ 1) nodes from N ` (i) in line 6, and stop sampling neighbors for them at the preceding layers {1, · · · , ` − 1} via the operations in line 7 and line 13. Block(N ` (i), δ) in line 6 is an operation that uniformly samples |N ` (i)| × δ elements from N ` (i) as blocked neighbors. V ` b records the blocked nodes at the `th layer, which are used in subsequent processing steps. The operations in line 10 and line 11 ensure that the blocked nodes are mapped to the same feature space as non-blocked nodes. 3.2 Reweighted Graph Convolution We first reformulate Z (`) i∗ in Equation (1) to an expectation form: Z (`) i∗ = |N (i)| · Ej∼p(j∈N(i)|i)Aˆ ijH (`−1) j∗ , (2) where p(j ∈ N (i)|i) is a uniform distribution over the neighbors of node i. |N (i)| denotes the number of elements in N (i). For blocked nodes, since their neighbor expansions are blocked, their estimation for Equation (2) is less precise (having large variance) than non-blocked nodes. We can see that representations of blocked nodes carry little information about the input graph. Therefore, it is reasonable to increase the contribution of non-blocked nodes to the central nodes. We perform reweighted graph convolution to achieve this goal. After Algorithm 1 is performed, reweighted graph convolution is formulated as follows. For readability, we denote n ` i,1 = |N ` b (i)|, n ` i,2 = |N ` nb(i)| and ni = |N (i)|. ρ ` i,1 = ρ · n ` i,1 + n ` i,2 n ` i,1 , ρ` i,2 = (1 − ρ) · n ` i + ˜n ` i n ` i,2 , (3) A˜` ij = ρ ` i,1 · ni n ` i,1 + n ` i,2 · Aˆ ij , ∀j ∈ N ` nb(i) and i ∈ V` nb, A˜` ij = ρ ` i,2 · ni n ` i,1 + n ` i,2 · Aˆ ij , ∀j ∈ N ` b (i) and i ∈ V` nb, A˜` ii = ni · Aˆ ii, i ∈ V` b \V` nb, Z (`) i∗ ≈ X j∈N(`)(i) A˜ ijH (`−1) j∗ := Z˜ (`) i∗ , ∀i ∈ V` nb ∪ V` b , (4) H (`) i∗ = f(Z˜ (`) i∗ W(`) ), (5) where ρ ∈ [0, 1]. Compared with (|N (i)|/|N ` (i)|) · Aˆ ij in Equation (2), A˜ ij adopts a different weights, ρ ` i,1 and ρ ` i,2 , to adjust the contribution of non-blocked and blocked nodes to node i. In the following proposition, we prove that Z˜(`) is an unbiased estimation of Z (`) i∗ , which makes our proposed reweighted graph convolution theoretically sound. In experiments, ρ is set to 0.5 for convenience. Proposition 1. Suppose H(`−1) is given. If N ` (i) is uniformly sampled from N (i), N ` b (i) is uniformly sampled from N ` (i) and ρ ∈ [0, 1], then Z˜ (`) i∗ defined in Equation (4) is an unbiased estimation of Z (`) i∗ . Proof. The proof can be found in the Appendix 1 . 1The Appendix can be found in https://cs.nju.edu.cn/lwj/. Algorithm 1 Sampling Algorithm Require: Mini-batch of nodes Vin, the number of neighbors sampled for each node sn, ratio of blocked neighbors per node δ. Ensure: {(V ` nb, V ` b , {(N ` nb(i), N ` b (i))} N i=1)} L `=1 1: V L nb = Vin, Vb = ∅ 2: Sample in a top-down manner: 3: for ` = L : 1 do 4: for i ∈ V` nb do 5: N ` (i) = Sample(N (i), sn) 6: N ` b (i) = Block(N ` (i), δ) 7: N ` nb(i) = N ` (i)\N ` b (i) 8: end for 9: for i ∈ Vb do 10: N ` (i) = N ` (i) ∪ {i} 11: N ` b (i) = N ` b (i) ∪ {i} 12: end for 13: V `−1 nb = S i∈V` nb N ` nb(i) 14: V `−1 b = S i∈V` nb N ` b (i) 15: Vb = Vb ∪ V`−1 b 16: end for 3.3 Objective Function Let W = {W(1) , · · · ,W(L)} denote the learnable parameters defined in Equation (5). Yˆ = H(L) denotes the output of GNN models. For multi-class classification, f(·) in the last layer denotes the softmax function, while it denotes the sigmoid function for multi-label classification. The objective function for BNS is formulated as follows: min W X i∈V0 X c −Yic log Yˆ ic + λ/2 · X ` kW(`) k 2 F , (6) where λ is a hyper-parameter for the regularization term of parameters W, V 0 denotes the set of nodes in training set. 3.4 Complexity Analysis In this subsection, we compare the computation and memory complexity of different methods with those of BNS in a mini-batch training step, which is summarized in Table 1. For existing node-wise sampling methods NS [Hamilton et al., 2017], VRGCN [Chen et al., 2018a] and MVS-GNN [Cong et al., 2020], they reduce the growth rate from s to sn, where sn is much smaller than s. In particular, VRGCN and MVS-GNN show that they can achieve comparable accuracy to NS with smaller sn. For layer-wise sampling method LADIES [Zou et al., 2019] and subgraph sampling method GraphSAINT [Zeng et al., 2020], they reduce the computation and memory complexity to the level that is linear with the depth of network. Although the above methods can achieve good performance in terms of accuracy, time cost, and memory cost on small-scale graphs (e.g., graphs with hundreds of thousands of nodes), they are not efficient or even not applicable for large-scale graphs (e.g., graphs with millions of nodes and hundreds of millions of nodes). Some problems and drawbacks existing in these methods are overlooked due to the
Method Computation complexity Memory complexity Non-sampling [Kipf and Welling,2017] O(s-(sBr+Br2)) O(Lr2+s.Br NS [Hamilton et al..2017] 0(s -1.(snBr+Bm2)】 O(Lr2+sh·Br) VRGCN IChen et al..2018al (sh-1.((sn+s).Br+Br) O(Lr2s1(s+sn)Br) MVS-GNN [Cong et al.,2020] L-1.((sn+s).Br+Br2)) O(Lr2+s-1(s+sn)Br) LADIES [Zou et al..2019] O(L·(s/W)2.Alo+Ls·T O(Lr2+Ls·r) GraphSAINT [Zeng et al.,2020] O(L·(sg/W)2.Alo+Lsg·r O(Lr2+Lsg·r) BNS (ours) Os-·(smBr+(6/1-6)+1)·Br2)】 OLr2+s站-Sn Br】 Table 1:Computation and memory complexity.s denotes the average number of neighbors per node in A.sn denotes the average number of neighbors sampled for each node.sn=snx(1-6),where o denotes the ratio of blocked nodes in BNS.st denotes the average number of nodes per layer in layer-wise sampling.s denotes the average number of nodes per subgraph in subgraph sampling.B=Vin denotes the mini-batch size of output layer.L is the number of layers in GNN models.r is the hidden dimension of networks. lack of systematically experimental analysis on large-scale 4.2 Baselines and Settings graphs.For example,even with low computation complex- We compare BNS with VRGCN [Chen et al.,2018a], ity,VRGCN,MVS-GNN and LADIES have a high time cost LADIES [Zou et al.,2019]and GraphSAINT [Zeng et al., of preparing data(including sampling)before each round of 2020].which are the state-of-the-art methods with node-wise mini-batch training.In addition,VRGCN brings a huge bur- sampling,layer-wise sampling and subgraph sampling,re- den to the memory for storing all nodes'historical represen- spectively.Additionally,we compare BNS with the classi- tations at each layer.MVS-GNN has the same complexi- cal node-wise sampling method NS [Hamilton et al.,2017] ty as the non-sampling method at the outer iteration,which We do not compare BNS with MVS-GNN since MVS-GNN might make the training infeasible on large-scale graphs be- adopts the non-sampling strategy for training at the outer it- cause of running out of graphics memory.GraphSAINT faces eration,which leads to the problem of running out of graph- the problem of sparse connection in subgraphs.Moreover, ics memory.Besides,comparisons with model simplifica- GraphSAINT adopts the non-sampling strategy at the evalua- tion methods are moved to the Appendix due to space limi- tion and testing stage,which is also inefficient on large-scale tation.Since the original implementations of the above base- graphs. lines cannot directly scale to the benchmark datasets in this Similar to existing node-wise sampling methods,BNS re- paper,we re-implement them according to the corresponding duces the growth rate from s to a small sn,where sn denotes authors'codes.For a fair comparison,implementations of all the number of non-blocked neighbors per node.We will show methods,including BNS,only differ in the sampling process. that with a small sn,BNS can achieve comparable accuracy For all methods,GNN model is instantiated with GraphSAGE to NS with a large sn,while BNS has lower computation and Hamilton et al.,2017],since it can achieve good perfor- memory complexity.Moreover,BNS has a low time cost of mance on the benchmark datasets.Note that sampling strate- preparing data before each round of mini-batch training. gies and settings during inference are the same as those in the training stage for all methods except for GraphSAINT. 4 Experiments The hyper-parameters r,L,T(maximum epoch),A and p(probability of dropout)are independent of sampling strate- In this section,we compare BNS with other baselines on five gies,and hence they are set to be the same for different sam- node-classification datasets.BNS is implemented on the Py- pling strategies on one specific dataset.Empirically,r is set torch platform [Paszke et al..2019]with Pytorch-Geometric to 128 on all datasets,L is set to 5 on both ogbn-proteins and Library Fey and Lenssen,2019.All experiments are run on ogbn-products,and L is set to 3 on ogbn-papers100M.For T. a NVIDIA TitanXP GPU server with 12 GB graphics memo- it is set to 100 on both ogbn-products and ogbn-papers100M, y and set to 1,000 on ogbn-proteins.For A and p,the values of them are obtained by tuning with NS on the benchmark 4.1 Datasets datasets.On ogbn-product,A =5 x 10-6 and p =0.1.On ogbn-papers100M,入=5×10-7andp=0.1.On ogbn- Ogbn-products,ogbn-papers100M and ogbn-proteins2 are proteins,A =0 and p =0.In BNS,we set p to 0.5 for con- publicly available [Hu et al.,2020].Ogbn-products is a large- venience and do not tune it.Adam [Kingma and Ba,2015] scale dataset with millions of nodes.Ogbn-papers100M is a is used to optimize the model and the learning rate n is set to large-scale dataset with hundreds of millions of nodes.Ogbn- 0.01.For all settings,experiments are run for 10 times with proteins is a small-scale dataset with hundreds of thousands different initialization each time,and the mean results of 10 of nodes.Amazon and Yelp in GraphSAINT,are also used runs are reported. for evaluation.Due to space limitation,the information and results on Amazon and Yelp are moved to the Appendix.The 4.3 Evaluation Criteria statistics of datasets can be found in the Appendix. The ultimate goal of sampling strategies for GNNs is to ob- tain high accuracy with a low time cost,not just to reduce https://ogb.stanford.edu/docs/nodeprop/ time and memory cost to extreme cases at the expense of sac-
Method Computation complexity Memory complexity Non-sampling [Kipf and Welling, 2017] O s L−1 (sBr + Br2 ) O Lr2 + s L · Br NS [Hamilton et al., 2017] O s L−1 n · (snBr + Br2 ) O Lr2 + s L n · Br VRGCN [Chen et al., 2018a] O s L−1 n · ((sn + s) · Br + Br2 ) O Lr2 + s L−1 n · (s + sn)Br MVS-GNN [Cong et al., 2020] O s L−1 n · ((sn + s) · Br + Br2 ) O Lr2 + s L−1 n · (s + sn)Br LADIES [Zou et al., 2019] O L · (sl/N) 2 · kAk0 + Lsl · r 2 ) O Lr2 + Lsl · r GraphSAINT [Zeng et al., 2020] O L · (sg/N) 2 · kAk0 + Lsg · r 2 ) O Lr2 + Lsg · r BNS (ours) O s˜ L−1 n · (snBr + (δ/(1 − δ) + 1) · Br2 ) O Lr2 + ˜s L−1 n · snBr Table 1: Computation and memory complexity. s denotes the average number of neighbors per node in A. sn denotes the average number of neighbors sampled for each node. s˜n = sn × (1 − δ), where δ denotes the ratio of blocked nodes in BNS. sl denotes the average number of nodes per layer in layer-wise sampling. sg denotes the average number of nodes per subgraph in subgraph sampling. B = |Vin| denotes the mini-batch size of output layer. L is the number of layers in GNN models. r is the hidden dimension of networks. lack of systematically experimental analysis on large-scale graphs. For example, even with low computation complexity, VRGCN, MVS-GNN and LADIES have a high time cost of preparing data (including sampling) before each round of mini-batch training. In addition, VRGCN brings a huge burden to the memory for storing all nodes’ historical representations at each layer. MVS-GNN has the same complexity as the non-sampling method at the outer iteration, which might make the training infeasible on large-scale graphs because of running out of graphics memory. GraphSAINT faces the problem of sparse connection in subgraphs. Moreover, GraphSAINT adopts the non-sampling strategy at the evaluation and testing stage, which is also inefficient on large-scale graphs. Similar to existing node-wise sampling methods, BNS reduces the growth rate from s to a small s˜n, where s˜n denotes the number of non-blocked neighbors per node. We will show that with a small s˜n, BNS can achieve comparable accuracy to NS with a large sn, while BNS has lower computation and memory complexity. Moreover, BNS has a low time cost of preparing data before each round of mini-batch training. 4 Experiments In this section, we compare BNS with other baselines on five node-classification datasets. BNS is implemented on the Pytorch platform [Paszke et al., 2019] with Pytorch-Geometric Library [Fey and Lenssen, 2019]. All experiments are run on a NVIDIA TitanXP GPU server with 12 GB graphics memory. 4.1 Datasets Ogbn-products, ogbn-papers100M and ogbn-proteins2 are publicly available [Hu et al., 2020]. Ogbn-products is a largescale dataset with millions of nodes. Ogbn-papers100M is a large-scale dataset with hundreds of millions of nodes. Ogbnproteins is a small-scale dataset with hundreds of thousands of nodes. Amazon and Yelp in GraphSAINT, are also used for evaluation. Due to space limitation, the information and results on Amazon and Yelp are moved to the Appendix. The statistics of datasets can be found in the Appendix. 2 https://ogb.stanford.edu/docs/nodeprop/ 4.2 Baselines and Settings We compare BNS with VRGCN [Chen et al., 2018a], LADIES [Zou et al., 2019] and GraphSAINT [Zeng et al., 2020], which are the state-of-the-art methods with node-wise sampling, layer-wise sampling and subgraph sampling, respectively. Additionally, we compare BNS with the classical node-wise sampling method NS [Hamilton et al., 2017]. We do not compare BNS with MVS-GNN since MVS-GNN adopts the non-sampling strategy for training at the outer iteration, which leads to the problem of running out of graphics memory. Besides, comparisons with model simplification methods are moved to the Appendix due to space limitation. Since the original implementations of the above baselines cannot directly scale to the benchmark datasets in this paper, we re-implement them according to the corresponding authors’ codes. For a fair comparison, implementations of all methods, including BNS, only differ in the sampling process. For all methods, GNN model is instantiated with GraphSAGE [Hamilton et al., 2017], since it can achieve good performance on the benchmark datasets. Note that sampling strategies and settings during inference are the same as those in the training stage for all methods except for GraphSAINT. The hyper-parameters r, L, T (maximum epoch), λ and p (probability of dropout) are independent of sampling strategies, and hence they are set to be the same for different sampling strategies on one specific dataset. Empirically, r is set to 128 on all datasets, L is set to 5 on both ogbn-proteins and ogbn-products, and L is set to 3 on ogbn-papers100M. For T, it is set to 100 on both ogbn-products and ogbn-papers100M, and set to 1,000 on ogbn-proteins. For λ and p, the values of them are obtained by tuning with NS on the benchmark datasets. On ogbn-product, λ = 5 × 10−6 and p = 0.1. On ogbn-papers100M, λ = 5 × 10−7 and p = 0.1. On ogbnproteins, λ = 0 and p = 0. In BNS, we set ρ to 0.5 for convenience and do not tune it. Adam [Kingma and Ba, 2015] is used to optimize the model and the learning rate η is set to 0.01. For all settings, experiments are run for 10 times with different initialization each time, and the mean results of 10 runs are reported. 4.3 Evaluation Criteria The ultimate goal of sampling strategies for GNNs is to obtain high accuracy with a low time cost, not just to reduce time and memory cost to extreme cases at the expense of sac-
products.b'=25 (a)ogbn-products. -3 (b)ogbn-papers100M. Figure 2:Test accuracy curves on ogbn-products and ogbn-papers100M.Methods that need more than one day to obtain the curves are omitted in the figures.b'=/B.where denotes the number of nodes in training data and B is batch size.At each row,the first three figures present the results of the first experimental setting in Section 4.3.The last figure presents the results of the second experimental setting. Methods ogbn-prodūcts ogbn-papers100M Accuracy (% Time(s)↓ TI+T2() Accuracy(o)↑ Time(s)↓ T1+T2(s) NS 78.64±0.17 5.5×103 4.5×10°+9.6×102 63.61±0.13 2.5×10 8.0×103+1.7×104 VRGCN 77.07±0.49 1.2×104 1.1×104+1.5×103 63.34±0.12 2.2×104 7.0×103+1.5×10 LADIES 78.96±0.50 4.7×103 4.5×103+2.5×10 63.25±0.21 2.5×104 1.2×104+1.3×109 GraphSAINT 78.95±0.41 7.1×103 4.5×103+2.6×103 61.60±0.12 2.1×104 8.0×103+1.3×104 BNS (ours) 80.14±0.27 9.1×102 7.3×102+1.8×10 63.88±0.12 1.2×10 4.3×10°+7.7×10 Table 2:Results on ogbn-products and ogbn-papers100M.Boldface letters denote the best results.Time presented in tables denotes the total training time of one run."TI"refers to the time cost of preparing data."T2"refers to the time cost of performing forward and backward propagation.The results in tables are obtained under the second experimental setting in the Section 4.3. rificing accuracy.In most cases,reducing memory can also 4.4 Results reduce the time cost since GNN model can perform training Results on ogbn-products and ogbn-papers100M are summa- with a larger batch size when the graphics memory cost is rized in Figure 2 and Table 2,from which we can draw the lower.Hence,we omit the comparison of memory cost in following conclusions.Firstly,when achieving the same ac- experiments.In a nutshell,the accuracy of GNN model and curacy under different settings,BNS is faster than all other time cost during training are presented to evaluate the perfor- methods.For example,from Figure 2(a),we can see that BNS mance of different methods. is approximately 4x ~5x faster than GraphSAINT(second- One reasonable way to evaluate the performance of differ- best)when achieving the accuracy of 80%on ogbn-products. ent methods is to compare time cost when achieving the same From Figure 2(b),we can see that BNS is approximately 2x accuracy.Since batch size has an important impact on time faster than NS(second-best)when achieving the accuracy of cost and accuracy,we design two kinds of experiments for 63.5%on ogbn-papers100M.Secondly,compared with other fair comparison: methods.BNS can achieve the best performance in accura- The first experimental setting:On each dataset,for d- cy with the minimum time cost.This point can be drawn ifferent methods.we train GNN model with the same from Table 2.which is consistent with the results in Fig- batch size.All methods are run with the best setting that ure 2.Thirdly,VRGCN and LADIES have a high time cost can achieve the best accuracy in this case. in preparing data,which is even higher than the time cost in The second experimental setting:On each dataset,for performing forward and backward propagation.Finally,from different methods,we train GNN model with the maxi- Table 2,we observe an interesting phenomenon,i.e.,the ac- curacy of BNS does not decrease compared to that of NS,and mum batch size that can achieve the best accuracy.All is even higher than that of NS.This phenomenon can be ex- methods are run with the best setting that can achieve the plained by the observations in JK-Net [Xu et al.,2018].i.e..it best accuracy. is important to enhance the influence of local neighborhoods Detailed settings of each method can be found in the Ap- on the central nodes:otherwise the local information of the pendix. central nodes in the input graphs will be washed out in a few
0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - b =25 BNS(ours) NS VRGCN LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - b =35 BNS(ours) NS VRGCN LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - b =45 BNS(ours) NS VRGCN LADIES GraphSAINT 0 2000 4000 6000 8000 10000 time (s) 76 78 80 82 Test Accuracy (%) ogbn-products - minimum b BNS(ours) NS VRGCN LADIES GraphSAINT (a) ogbn-products. 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - b =25 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - b =35 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - b =45 BNS(ours) NS VRGCN LADIES GraphSAINT 0 0.5 1 1.5 2 time (s) 104 60 61 62 63 64 Test Accuracy (%) ogbn-papers100M - minimum b BNS(ours) NS VRGCN LADIES (b) ogbn-papers100M. Figure 2: Test accuracy curves on ogbn-products and ogbn-papers100M. Methods that need more than one day to obtain the curves are omitted in the figures. b 0 = |V0 |/B, where |V0 | denotes the number of nodes in training data and B is batch size. At each row, the first three figures present the results of the first experimental setting in Section 4.3. The last figure presents the results of the second experimental setting. Methods ogbn-products ogbn-papers100M Accuracy (%) ↑ Time (s) ↓ T1+T2 (s) Accuracy (%) ↑ Time (s) ↓ T1+T2 (s) NS 78.64 ± 0.17 5.5 × 103 4.5 × 103 + 9.6 × 102 63.61 ± 0.13 2.5 × 104 8.0 × 103 + 1.7 × 104 VRGCN 77.07 ± 0.49 1.2 × 104 1.1 × 104 + 1.5 × 103 63.34 ± 0.12 2.2 × 104 7.0 × 103 + 1.5 × 104 LADIES 78.96 ± 0.50 4.7 × 103 4.5 × 103 + 2.5 × 102 63.25 ± 0.21 2.5 × 104 1.2 × 104 + 1.3 × 104 GraphSAINT 78.95 ± 0.41 7.1 × 103 4.5 × 103 + 2.6 × 103 61.60 ± 0.12 2.1 × 104 8.0 × 103 + 1.3 × 104 BNS (ours) 80.14 ± 0.27 9.1 × 102 7.3 × 102 + 1.8 × 102 63.88 ± 0.12 1.2 × 104 4.3 × 103 + 7.7 × 103 Table 2: Results on ogbn-products and ogbn-papers100M. Boldface letters denote the best results. Time presented in tables denotes the total training time of one run. “T1” refers to the time cost of preparing data. “T2” refers to the time cost of performing forward and backward propagation. The results in tables are obtained under the second experimental setting in the Section 4.3. rificing accuracy. In most cases, reducing memory can also reduce the time cost since GNN model can perform training with a larger batch size when the graphics memory cost is lower. Hence, we omit the comparison of memory cost in experiments. In a nutshell, the accuracy of GNN model and time cost during training are presented to evaluate the performance of different methods. One reasonable way to evaluate the performance of different methods is to compare time cost when achieving the same accuracy. Since batch size has an important impact on time cost and accuracy, we design two kinds of experiments for fair comparison: • The first experimental setting: On each dataset, for different methods, we train GNN model with the same batch size. All methods are run with the best setting that can achieve the best accuracy in this case. • The second experimental setting: On each dataset, for different methods, we train GNN model with the maximum batch size that can achieve the best accuracy. All methods are run with the best setting that can achieve the best accuracy. Detailed settings of each method can be found in the Appendix. 4.4 Results Results on ogbn-products and ogbn-papers100M are summarized in Figure 2 and Table 2, from which we can draw the following conclusions. Firstly, when achieving the same accuracy under different settings, BNS is faster than all other methods. For example, from Figure 2(a), we can see that BNS is approximately 4× ∼ 5× faster than GraphSAINT (secondbest) when achieving the accuracy of 80% on ogbn-products. From Figure 2(b), we can see that BNS is approximately 2× faster than NS (second-best) when achieving the accuracy of 63.5% on ogbn-papers100M. Secondly, compared with other methods, BNS can achieve the best performance in accuracy with the minimum time cost. This point can be drawn from Table 2, which is consistent with the results in Figure 2. Thirdly, VRGCN and LADIES have a high time cost in preparing data, which is even higher than the time cost in performing forward and backward propagation. Finally, from Table 2, we observe an interesting phenomenon, i.e., the accuracy of BNS does not decrease compared to that of NS, and is even higher than that of NS. This phenomenon can be explained by the observations in JK-Net [Xu et al., 2018], i.e., it is important to enhance the influence of local neighborhoods on the central nodes; otherwise the local information of the central nodes in the input graphs will be washed out in a few