Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Xing-Jian Shi Ling Yan Shanghai Key Laboratory of Department of Computer Shanghai Key Laboratory of Scalable Computing and Science and Engineering Scalable Computing and Systems Hong Kong University of Systems Department of Computer Science and Technology, Department of Computer Science and Engineering China Science and Engineering Shanghai Jiao Tong University, xshiab@connect.ust.hk Shanghai Jiao Tong University, China China yzqfqyd@sjtu.edu.cn yling0718@sjtu.edu.cn Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University,China liwujun@nju.edu.cn ABSTRACT models for recommender systems due to their promising per- Matrix factorization (MF)has become the most popular formance 8.In this big data era,more and more large-scale technique for recommender systems due to its promising data sets have emerged in many real-world recommender performance. Recently,distributed (parallel)MF models systems.Hence,parallel or distributed MF models with have received much attention from researchers of big da- the potential of high scalability have recently captured much ta community.In this paper,we propose a novel model, attention from researchers. called distributed stochastic alternating direction methods The basic idea of MF is to use the multiplication of two of multipliers (DS-ADMM),for large-scale MF problems latent matrices,the user matrix and the item matrix,to ap- DS-ADMM is a distributed stochastic variant of ADMM.In proximate the original rating matrix.Least square method particular,we first devise a new data split strategy to make is usually used to find a solution.In recent years,several the distributed MF problem fit for the ADMM framework. parallel models have been proposed for MF.These existing Then.a stochastic ADMM scheme is designed for learning models can be roughly divided into two main categories:al- Finally,we implement DS-ADMM based on message passing ternating least square (ALS)[23]based models and stochas- interface (MPD),which can run on clusters with multiple ma- tic gradient descent (SGD)based models. chines (nodes).Experiments on several data sets from rec- ALS [23]adopts the alternating learning strategy to up- ommendation applications show that our DS-ADMM model date one matrix with the other one fixed.With one of the can outperform other state-of-the-art distributed MF mod- matrices fixed,the optimization problem of MF can be re- els in terms of both efficiency and accuracy. duced to a least square problem on the other matrix,which can be further decomposed into several independent least Keywords square problems on the latent feature vector of each user or item.Hence,it is easy to design parallel strategies for ALS. Matrix Factorization,Recommender Systems,ADMM,Dis- which has been implemented in [23.However,the time tributed Computing,Stochastic Learning complexity for each iteration in ALS is cubic in k,where k is the number of latent features for each user or item. 1. INTRODUCTION The cyclic coordinate descent (CCD)method [13]adopt- Recommender systems try to recommend products (item- s coordinate descent strategy to improve the ALS method s)to customers (users)by utilizing the customers'historic by decreasing the time complexity for each iteration to be preferences.Matrix factorization (MF)[8]and its exten- linear in k.The CCD++21]further improves the efficien- sions [9,22,16,14,10,18 have become the most popular cy of CCD by using similar coordinate descent strategy but different updating sequence of the variables.Because both CCD and CCD++are based on ALS,they can also be easily Permission to make digital or hard copies of all or part of this work for personal or parallelized [21] classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita tion on the hrst page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re- 1In existing literatures,distributed models refer to those im- publish,to post on servers or to redistribute to lists,requires prior specific permission plemented on clusters with multiple machines(nodes),while and/or a fee.Request permissions from permissions@acm.org. parallel models refer to those implemented either on multi- CM4.November 3-7.2014.Shanghai China core systems with a single node or on clusters.We will alsc Copyright2014ACM978-1-4503-2598-1/14/11$15.00. follow this tradition in this paper.The specific meaning of http:/k.doi.org/10.1145/2661829.2661996. parallel can be determined from the context in the paper
Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China yzqfqyd@sjtu.edu.cn Xing-Jian Shi Department of Computer Science and Engineering Hong Kong University of Science and Technology, China xshiab@connect.ust.hk Ling Yan Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China yling0718@sjtu.edu.cn Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University, China liwujun@nju.edu.cn ABSTRACT Matrix factorization (MF) has become the most popular technique for recommender systems due to its promising performance. Recently, distributed (parallel) MF models have received much attention from researchers of big data community. In this paper, we propose a novel model, called distributed stochastic alternating direction methods of multipliers (DS-ADMM), for large-scale MF problems. DS-ADMM is a distributed stochastic variant of ADMM. In particular, we first devise a new data split strategy to make the distributed MF problem fit for the ADMM framework. Then, a stochastic ADMM scheme is designed for learning. Finally, we implement DS-ADMM based on message passing interface (MPI), which can run on clusters with multiple machines (nodes). Experiments on several data sets from recommendation applications show that our DS-ADMM model can outperform other state-of-the-art distributed MF models in terms of both efficiency and accuracy. Keywords Matrix Factorization, Recommender Systems, ADMM, Distributed Computing, Stochastic Learning 1. INTRODUCTION Recommender systems try to recommend products (items) to customers (users) by utilizing the customers’ historic preferences. Matrix factorization (MF) [8] and its extensions [9, 22, 16, 14, 10, 18] have become the most popular Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CIKM’14, November 3–7, 2014, Shanghai, China. Copyright 2014 ACM 978-1-4503-2598-1/14/11 ...$15.00. http://dx.doi.org/10.1145/2661829.2661996 . models for recommender systems due to their promising performance [8]. In this big data era, more and more large-scale data sets have emerged in many real-world recommender systems. Hence, parallel or distributed1 MF models with the potential of high scalability have recently captured much attention from researchers. The basic idea of MF is to use the multiplication of two latent matrices, the user matrix and the item matrix, to approximate the original rating matrix. Least square method is usually used to find a solution. In recent years, several parallel models have been proposed for MF. These existing models can be roughly divided into two main categories: alternating least square (ALS) [23] based models and stochastic gradient descent (SGD) based models. ALS [23] adopts the alternating learning strategy to update one matrix with the other one fixed. With one of the matrices fixed, the optimization problem of MF can be reduced to a least square problem on the other matrix, which can be further decomposed into several independent least square problems on the latent feature vector of each user or item. Hence, it is easy to design parallel strategies for ALS, which has been implemented in [23]. However, the time complexity for each iteration in ALS is cubic in k, where k is the number of latent features for each user or item. The cyclic coordinate descent (CCD) method [13] adopts coordinate descent strategy to improve the ALS method by decreasing the time complexity for each iteration to be linear in k. The CCD++ [21] further improves the efficiency of CCD by using similar coordinate descent strategy but different updating sequence of the variables. Because both CCD and CCD++ are based on ALS, they can also be easily parallelized [21]. 1 In existing literatures, distributed models refer to those implemented on clusters with multiple machines (nodes), while parallel models refer to those implemented either on multicore systems with a single node or on clusters. We will also follow this tradition in this paper. The specific meaning of parallel can be determined from the context in the paper
Due to its efficiency and ease of implementation,SGD has ple machines (nodes).Hence,DS-ADMM is scalable become one of the most popular optimization strategies for to handle large-scale data sets MF in recommender systems 8.The basic idea of SGD is to randomly select one rating each time from the rating matrix Experiments on several data sets from recommenda- and then use the gradient based on this selected rating to update the latent features.It is easy to see that SGD is tion applications show that not only can DS-ADMM outperform other SGD-based models,but it can also essentially a sequential method,which makes it difficult to outperform ALS-based models like CCD++in terms be parallelized. of both efficiency and accuracy. The main reason that SGD can not be directly parallelized is that two randomly selected ratings may share the same la- tent features corresponding to the same user or item.Hence, 2. BACKGROUND there exist conflicts between two processes or nodes which simultaneously update the same latent features.Recently In this section,we introduce the background of this pa several strategies have been proposed to parallelize SGD for per,including notations,MF formulation,ALS-based mod- MF.The Hogwild![11]model just ignores the conflicts by els.SGD-based models and ADMM. assuming that the probability of conflict is small when two ratings are randomly selected from a sparse rating matrix. 2.1 Notations However,the conflicts do exist in the learning procedure We use boldface uppercase letters like M to denote matri- which makes Hogwild!not effective enough [21.24.More- ces and boldface lowercase letters like m to denote vectors over,Hogwild!requires all the processes share the whole Mi.and M.;denote the ith row and the jth column of M training set which is hard to be satisfied in distributed sys- respectively.Mi;denotes the element at the ith row and tems.Hence,Hogwild!cannot be directly used in distribut- jth column in M.MT denotes the transpose of M,and ed systems. M denotes the inverse of M.tr()denotes the trace of a Distributed SGD (DSGD)4 utilizes the property that matrix.Ig is an identity matrix of size k x k.Assume there there exist several sub-blocks without overlapping rows and are m users and n items in the data set.We use RERmxm columns in the rating matrix.These sub-blocks are mutually to denote the rating matrix.Please note that there exist independent of each other,thus can be processed in parallel many missing entries in R.All the missing entries are filled by different processes or nodes at the same time.Exper- with0.Ve use2C{l,2,·,m}×{1,2,·,n}to denote iments in 21,24 have shown that DSGD can outperform the set of indices for the observed ratings.denotes the Hogwild!in terms of both efficiency and accuracy.However column indices of the observed ratings in the ith row of R. after a set of independent sub-blocks have been processed. and denotes the row indices of the observed ratings in the updated variables from all processes or nodes should be the jth column of R.URxm denotes the users'laten- synchronized before processing the other sets of independent t factors (matrix)with each column U.representing the sub-blocks.It is these frequent synchronization operations latent feature vector for user i,where k is the number of that make DSGD inefficient because the slowest node will latent factors for each user or item.Vx denotes the become the bottleneck of the whole system.Things go even items'latent factors (matrix)with each column V.i repre- worse if data skew exists,which is not rare in real applica- senting the latent feature vector for item j.P denotes the tions.Very recently,fast parallel SGD (FPSGD)[24]tries to total number of nodes in the cluster,and we use the letter p solve the issues in DSGD by changing the scheduler into an on the superscript like MP to denote the computer node id. asynchronous one,which has achieved better performance .llF denotes the Frobenius norm of a matrix or a vector. than DSGD.However,FPSGD can only be used in shared- memory systems with a single node.Hence,FPSGD is still 2.2 Matrix Factorization not scalable to handle large-scale problems. In this paper,a novel model,called distributed stochastic Matrix factorization (MF)can be formulated as the fol- alternating direction methods of multipliers (DS-ADMM). lowing optimization problem is proposed for large-scale MF problems.DS-ADMM is a distributed stochastic variant of ADMM [3.The main con- (Rij -UTV.j)2+AUTU.:+A2VT V.j tributions of DS-ADMM are briefly outlined as follows: i,1E2 (1) In DS-ADMM,a new data split (partition)strategy where AI and A2 are hyper-parameters for regularization. called LocalMFSplit is proposed to assign subsets of There are two categories of parallel models to solve the the whole set of ratings to different nodes in a cluster above MF problem,i.e.,the ALS-based models and SGD- and consequently divide the large-scale problem into based models,which will be briefly reviewed in the following several sub-problems.Our split strategy can make the subsections. distributed MF problem fit for the ADMM framework Furthermore,compared with existing split strategies in DSGD and CCD++,our split strategy can reduce syn- 2.3 ALS-based Parallel MF Models chronization and scheduling cost to improve efficiency. By adopting the alternating learning strategy,ALS 23 alternatively switches between updating U and updating V .A stochastic ADMM method is designed to perform with the other latent matrix fixed.With U fixed,the MF efficient learning for parameters problem can be decomposed into n independent least square problems,each of which corresponds to a column of the ma- DS-ADMM is implemented with message passing in- trix V.Similar m independent least square problems can terface (MPI),which can run on clusters with multi- be got by fixing V.Furthermore,each of these independent
Due to its efficiency and ease of implementation, SGD has become one of the most popular optimization strategies for MF in recommender systems [8]. The basic idea of SGD is to randomly select one rating each time from the rating matrix and then use the gradient based on this selected rating to update the latent features. It is easy to see that SGD is essentially a sequential method, which makes it difficult to be parallelized. The main reason that SGD can not be directly parallelized is that two randomly selected ratings may share the same latent features corresponding to the same user or item. Hence, there exist conflicts between two processes or nodes which simultaneously update the same latent features. Recently, several strategies have been proposed to parallelize SGD for MF. The Hogwild! [11] model just ignores the conflicts by assuming that the probability of conflict is small when two ratings are randomly selected from a sparse rating matrix. However, the conflicts do exist in the learning procedure, which makes Hogwild! not effective enough [21, 24]. Moreover, Hogwild! requires all the processes share the whole training set which is hard to be satisfied in distributed systems. Hence, Hogwild! cannot be directly used in distributed systems. Distributed SGD (DSGD) [4] utilizes the property that there exist several sub-blocks without overlapping rows and columns in the rating matrix. These sub-blocks are mutually independent of each other, thus can be processed in parallel by different processes or nodes at the same time. Experiments in [21, 24] have shown that DSGD can outperform Hogwild! in terms of both efficiency and accuracy. However, after a set of independent sub-blocks have been processed, the updated variables from all processes or nodes should be synchronized before processing the other sets of independent sub-blocks. It is these frequent synchronization operations that make DSGD inefficient because the slowest node will become the bottleneck of the whole system. Things go even worse if data skew exists, which is not rare in real applications. Very recently, fast parallel SGD (FPSGD) [24] tries to solve the issues in DSGD by changing the scheduler into an asynchronous one, which has achieved better performance than DSGD. However, FPSGD can only be used in sharedmemory systems with a single node. Hence, FPSGD is still not scalable to handle large-scale problems. In this paper, a novel model, called distributed stochastic alternating direction methods of multipliers (DS-ADMM), is proposed for large-scale MF problems. DS-ADMM is a distributed stochastic variant of ADMM [3]. The main contributions of DS-ADMM are briefly outlined as follows: • In DS-ADMM, a new data split (partition) strategy called LocalMFSplit is proposed to assign subsets of the whole set of ratings to different nodes in a cluster and consequently divide the large-scale problem into several sub-problems. Our split strategy can make the distributed MF problem fit for the ADMM framework. Furthermore, compared with existing split strategies in DSGD and CCD++, our split strategy can reduce synchronization and scheduling cost to improve efficiency. • A stochastic ADMM method is designed to perform efficient learning for parameters. • DS-ADMM is implemented with message passing interface (MPI), which can run on clusters with multiple machines (nodes). Hence, DS-ADMM is scalable to handle large-scale data sets. • Experiments on several data sets from recommendation applications show that not only can DS-ADMM outperform other SGD-based models, but it can also outperform ALS-based models like CCD++ in terms of both efficiency and accuracy. 2. BACKGROUND In this section, we introduce the background of this paper, including notations, MF formulation, ALS-based models, SGD-based models and ADMM. 2.1 Notations We use boldface uppercase letters like M to denote matrices and boldface lowercase letters like m to denote vectors. Mi∗ and M∗j denote the ith row and the jth column of M, respectively. Mij denotes the element at the ith row and jth column in M. MT denotes the transpose of M, and M−1 denotes the inverse of M. tr(·) denotes the trace of a matrix. Ik is an identity matrix of size k × k. Assume there are m users and n items in the data set. We use R ∈ R m×n to denote the rating matrix. Please note that there exist many missing entries in R. All the missing entries are filled with 0. We use Ω ⊂ {1, 2, · · · , m} × {1, 2, · · · , n} to denote the set of indices for the observed ratings. Ωi denotes the column indices of the observed ratings in the ith row of R, and Ω˜j denotes the row indices of the observed ratings in the jth column of R. U ∈ R k×m denotes the users’ latent factors (matrix) with each column U∗i representing the latent feature vector for user i, where k is the number of latent factors for each user or item. V ∈ R k×n denotes the items’ latent factors (matrix) with each column V∗j representing the latent feature vector for item j. P denotes the total number of nodes in the cluster, and we use the letter p on the superscript like Mp to denote the computer node id. k · kF denotes the Frobenius norm of a matrix or a vector. 2.2 Matrix Factorization Matrix factorization (MF) can be formulated as the following optimization problem: min U,V 1 2 X (i,j)∈Ω (Ri,j − U T ∗iV∗j ) 2 + λ1U T ∗iU∗i + λ2V T ∗jV∗j , (1) where λ1 and λ2 are hyper-parameters for regularization. There are two categories of parallel models to solve the above MF problem, i.e., the ALS-based models and SGDbased models, which will be briefly reviewed in the following subsections. 2.3 ALS-based Parallel MF Models By adopting the alternating learning strategy, ALS [23] alternatively switches between updating U and updating V with the other latent matrix fixed. With U fixed, the MF problem can be decomposed into n independent least square problems, each of which corresponds to a column of the matrix V. Similar m independent least square problems can be got by fixing V. Furthermore, each of these independent
problems has a closed-form solution in ALS: scheduler of DSGD into an asynchronous one.Its experi- ments show that FPSGD can achieve better efficiency and U.i(AimiIk Vo,Va)VR (2) accuracy than DSGD. V←(a2nLk+U,U,)-1UR, Both Hogwild!and FPSGD are only for shared memory systems with one single node and thus their scalability is where Vo,denotes a sub-matrix formed by the columns in limited.DSGD can be used for distributed systems while it V indexed by i,U,is similarly defined,mi=and costs too much on synchronization. n;=.Please note that all the missing entries in R 2.5 ADMM have been filled by zeros.The columns in both U and V can be independently updated by following (2).Hence,it is ADMM [3 is used to solve the constrained problems as easy to design the parallel strategy for ALS,which has been follows: implemented in 23. min f(x)+g(2) (4) Instead of optimizing the whole vector U.:or V.;at one X, time,CCD 13 adopts the coordinate descent method to s.t.: Ax+Bz=c. optimize each element of U.or V.separately,which can avoid the matrix inverse operation in(2).CCD++[21]fur- where f()and g()are functions,x and z are variables,A. B and c are known values. ther improves CCD's performance by changing the updating sequence in CCD.It rewrites UTV=U4 Vd.,and To solve the problem in (4),ADMM first gets the aug- mented Lagrangian as follows: updates one element in Ud.or Vd.each time by using simi- lar coordinate descent method in CCD.Changing the updat- L(x,z,y)=f(x)+g(z)+y"(Ax+Bz-c) ing sequence may improve the convergence rate,which has been verified by the experimental results in CCD++21. +lAx+Bz-cll2, (5) 2.4 SGD-based Parallel MF Models where y is the Lagrangian multiplier and p is the penalty parameter.The ADMM solution can be got be repeating The idea of SGD is to randomly select one rating index the following three steps: (i,j)from n each time,and then update the corresponding variables U.i an V.i as follows: xt+1←←argmin L(X,zt,ye): U对←Ui+(eV与-A1Ui), (3) zt+1←argmin L(Xt+1,z,y:)月 V对←-V*j+(U*i-入2V*), y+1y+p(Axt+1+BZt+1-c) where ej Rij-UnV.j,and n is the learning rate. where xt denotes the value of x at the tth iteration,ye Due to the demand of many large-scale problems,several and z are similarly defined.If f(x)or g(z)are separable, parallel SGD models have been proposed.Some of them. the corresponding steps of ADMM can be done in parallel. such as those described in 25 and 20,are not for MF Hence,ADMM can be used to design distributed learning problems.Here,we just focus on those parallel SGD models algorithms for large-scale problems 3. for MF,including Hogwild!11,DSGD 4 and FPSGD 24 In recent years,ADMM has captured more and more From (3),it is easy to find that conflicts exist between t- attention with wide applications,such as matrix comple- wo processes or nodes when their randomly selected ratings tion 5,compressive sensing 19,image restoration 6 and share either the same user index or the same item index response prediction [1].Moreover,many variants of ADMM Hogwild![11]allows overwriting each other's work when con- are also devised,including the stochastic and online exten- flicts happen.It also shows that if the optimization problem sions [15,12,17.However,to the best of our knowledge, is sparse enough,the Hogwild!will get a nearly optimal rate few works have been proposed to use stochastic ADMM for of convergence. distributed MF problems. DSGD [4]divides the whole rating matrix into P stra- ta and each stratum contains P mutually independent sub- 3. DISTRIBUTED STOCHASTIC ADMM blocks without sharing any column or row indices.Conse- quently,sub-blocks in the same stratum can be processed in FOR MATRIX FACTORIZATION parallel since they don't share any U.or V..One iteration In this section,we present the details of our DS-ADMM of DSGD is divided into P steps,in each of which DSGD model.We first introduce our data split strategy to divide picks a stratum containing P independent sub-blocks and the whole problem into several sub-problems.Then we pro- then processes these sub-blocks in parallel in a cluster of P pose a distributed ADMM framework to handle these sub- nodes with each node responsible for one sub-block.After problems in parallel.After that,a stochastic learning algo- all the P sub-blocks in each step are processed,the whole U rithm is designed to speed up the distributed ADMM frame- and V have been updated separately.They should be syn- work.Subsequently,we compare the scheduler of DS-ADMM chronized in order to let all nodes get the latest U and V with those of DSGD and CCD++.Finally,the complexity It is obvious that during one iteration of processing all the analysis of DS-ADMM will be provided. ratings in the whole matrix,P synchronization operations should be performed for DSGD.This frequent synchroniza- 3.1 Data Split Strategy tion will make DSGD inefficient because the slowest node In our data split strategy,we divide R and U into P sub- will become the bottleneck of the whole system. blocks according to users.More specifically,each sub-block FPSGD 24,which is proposed for shared-memory sys- will contain rows of R and columns of U.From (1),we tems,tries to improve the performance by changing the find that U and V are coupled together in the loss function
problems has a closed-form solution in ALS: U∗i ← (λ1miIk + VΩiV T Ωi ) −1VRT i∗, (2) V∗j ← (λ2nj Ik + UΩ˜ jU T Ω˜ j ) −1UR∗j , where VΩi denotes a sub-matrix formed by the columns in V indexed by Ωi, UΩ˜ j is similarly defined, mi = |Ωi| and nj = |Ω˜j |. Please note that all the missing entries in R have been filled by zeros. The columns in both U and V can be independently updated by following (2). Hence, it is easy to design the parallel strategy for ALS, which has been implemented in [23]. Instead of optimizing the whole vector U∗i or V∗j at one time, CCD [13] adopts the coordinate descent method to optimize each element of U∗i or V∗j separately, which can avoid the matrix inverse operation in (2). CCD++ [21] further improves CCD’s performance by changing the updating sequence in CCD. It rewrites UT V = Pk d=1 UT d∗Vd∗, and updates one element in Ud∗ or Vd∗ each time by using similar coordinate descent method in CCD. Changing the updating sequence may improve the convergence rate, which has been verified by the experimental results in CCD++ [21]. 2.4 SGD-based Parallel MF Models The idea of SGD is to randomly select one rating index (i, j) from Ω each time, and then update the corresponding variables U∗i an V∗j as follows: U∗i ← U∗i + η(ijV∗j − λ1U∗i), (3) V∗j ← V∗j + η(ijU∗i − λ2V∗j ), where ij = Rij − UT ∗iV∗j , and η is the learning rate. Due to the demand of many large-scale problems, several parallel SGD models have been proposed. Some of them, such as those described in [25] and [20], are not for MF problems. Here, we just focus on those parallel SGD models for MF, including Hogwild! [11], DSGD [4] and FPSGD [24]. From (3), it is easy to find that conflicts exist between two processes or nodes when their randomly selected ratings share either the same user index or the same item index. Hogwild! [11] allows overwriting each other’s work when con- flicts happen. It also shows that if the optimization problem is sparse enough, the Hogwild! will get a nearly optimal rate of convergence. DSGD [4] divides the whole rating matrix into P strata and each stratum contains P mutually independent subblocks without sharing any column or row indices. Consequently, sub-blocks in the same stratum can be processed in parallel since they don’t share any U∗i or V∗j . One iteration of DSGD is divided into P steps, in each of which DSGD picks a stratum containing P independent sub-blocks and then processes these sub-blocks in parallel in a cluster of P nodes with each node responsible for one sub-block. After all the P sub-blocks in each step are processed, the whole U and V have been updated separately. They should be synchronized in order to let all nodes get the latest U and V. It is obvious that during one iteration of processing all the ratings in the whole matrix, P synchronization operations should be performed for DSGD. This frequent synchronization will make DSGD inefficient because the slowest node will become the bottleneck of the whole system. FPSGD [24], which is proposed for shared-memory systems, tries to improve the performance by changing the scheduler of DSGD into an asynchronous one. Its experiments show that FPSGD can achieve better efficiency and accuracy than DSGD. Both Hogwild! and FPSGD are only for shared memory systems with one single node and thus their scalability is limited. DSGD can be used for distributed systems while it costs too much on synchronization. 2.5 ADMM ADMM [3] is used to solve the constrained problems as follows: min x,z f(x) + g(z) (4) s.t. : Ax + Bz = c, where f(·) and g(·) are functions, x and z are variables, A, B and c are known values. To solve the problem in (4), ADMM first gets the augmented Lagrangian as follows: L(x, z, y) = f(x) + g(z) + y T (Ax + Bz − c) + ρ 2 kAx + Bz − ck 2 , (5) where y is the Lagrangian multiplier and ρ is the penalty parameter. The ADMM solution can be got be repeating the following three steps: xt+1 ← argmin x L(x, zt, yt); zt+1 ← argmin z L(xt+1, z, yt); yt+1 ← yt + ρ(Axt+1 + Bzt+1 − c), where xt denotes the value of x at the tth iteration, yt and zt are similarly defined. If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel. Hence, ADMM can be used to design distributed learning algorithms for large-scale problems [3]. In recent years, ADMM has captured more and more attention with wide applications, such as matrix completion [5], compressive sensing [19], image restoration [6] and response prediction [1]. Moreover, many variants of ADMM are also devised, including the stochastic and online extensions [15, 12, 17]. However, to the best of our knowledge, few works have been proposed to use stochastic ADMM for distributed MF problems. 3. DISTRIBUTED STOCHASTIC ADMM FOR MATRIX FACTORIZATION In this section, we present the details of our DS-ADMM model. We first introduce our data split strategy to divide the whole problem into several sub-problems. Then we propose a distributed ADMM framework to handle these subproblems in parallel. After that, a stochastic learning algorithm is designed to speed up the distributed ADMM framework. Subsequently, we compare the scheduler of DS-ADMM with those of DSGD and CCD++. Finally, the complexity analysis of DS-ADMM will be provided. 3.1 Data Split Strategy In our data split strategy, we divide R and U into P subblocks according to users. More specifically, each sub-block will contain m P rows of R and m P columns of U. From (1), we find that U and V are coupled together in the loss function
Updating one of them needs the other's latest value,which where makes the problem hardly separable.To decouple U and V we keep a local item latent matrix for all items in each node, which is denoted as VP.Please note that VP is not a sub- y,V,O)=∑1P(vP,V,) block of V,but it has the same size with V.We also have a global item latent matrix which is denoted as V.Because P(VP,7,ΘP)= lv"-V+tr(e""(v -v))] only the local VP couples with U,we can independently up- date U and VP for each node.This split strategy can make Here,p is a hyper-parameter and O={OP=denotes the the MF problem fit for the distributed ADMM framework. Lagrangian multiplier. which will be introduced in the following subsection. If we define Our split strategy is called LocalMFSplit,which is briefly summarized in Algorithm 1.Note that the size of VP is LP(U,VP,ΘP,)=fP(U,VP)+P(VP,7,Θ) k x n,but that of UP is k x mp with mp being the number of columns (about assigned to node p. = ∑i(U,V) (i.j)ERP Algorithm 1 LocalMFSplit +Iv-VIB+(ePvP-V列 1:Input:R,P 2:for i=1:m do we can get 3: Generate a random number p from 1,2,..,Pl,and distribute row i of R to node p. L(U,V,O,V)=>LP(U,VP,OP,V). 4:end for 5:for p=1:P parallel do p=1 6:Allocate memory for Ur,VP andV The ADMM will solve this problem by repeating the fol- 7:end for lowing steps: U+,V+1←ainU,V",o,V,peL,2…Py 3.2 Distributed ADMM (10a) Based on our split strategy LocalMFSplit,the MF problem V+1←argmin L(U+1,e+1,O,可), (10b) in (1)can be reformulated as follows: Θ+1←Θ+p(V8+1-V+1),p∈{1,2,,P (RJ -UTVE,)2 (10c) p=1(ij)EOP It is easy to see that U,VP and OP can be locally updated on each node.Because the whole MF problem has been +AUTU.+A[VE,]TVE, (6) divided into P sub-problems which can be solved in parallel. our method is actually a distributed ADMM framework. st.:Vp-V=0:p∈{1,2,,Py 3.3 Stochastic Learning for Distributed ADMM where v={VP)P1,denotes the (i,j)indices of the To learn the parameters in (6),we just need to find the ratings located in node p.Note that here we omit the p in solutions in (10a),(10b)and (10c).After getting the op- UP for simplicity.It is not hard to determine whether U refers to the whole latent matrix or a sub-block UP located timal U+and [V,it is easy to solve the problem in (10b).More specifically,if we set =0,we can prove in node p from the context. If we define that =0.Hence,the update rule for V is: P 1 f(U,)=∑P(U,VP) (7) V+1=p∑V+ (11) p=】 P=i where The problem in(10c)directly shows the update rule,which can be computed locally and efficiently.Therefore,the key fP(U,VP)= ∑i(U,V) (8) learning part lies in how to efficiently solve the problem (i,j)EOp in (10a).In the following content of this subsection,we first (.V)(-v design a batch learning algorithm for the problem in(10a), and then a stochastic learning algorithm inspired by the batch learning is also designed to further improve the ef- +UTU..+Aa[V,lV ficiency. 3.3.1 Batch Learning we can transform the constrained problem in (6)to an un With oP and V:fixed,(10a)is an MF problem.How- constrained problem with augmented Lagrangian method. ever,we can not easily get the solution because U and VP and get the following objective function: are coupled together and the objective function of the MF problem is non-convex.To get an efficient solution,we use L(U,y,O,)=f(U,)+1y,7,O), (9) a technique similar to that in [12 to construct a surrogate
Updating one of them needs the other’s latest value, which makes the problem hardly separable. To decouple U and V, we keep a local item latent matrix for all items in each node, which is denoted as Vp . Please note that Vp is not a subblock of V, but it has the same size with V. We also have a global item latent matrix which is denoted as V. Because only the local Vp couples with U, we can independently update U and Vp for each node. This split strategy can make the MF problem fit for the distributed ADMM framework, which will be introduced in the following subsection. Our split strategy is called LocalMFSplit, which is briefly summarized in Algorithm 1. Note that the size of Vp is k × n, but that of Up is k × mp with mp being the number of columns (about m P ) assigned to node p. Algorithm 1 LocalMFSplit 1: Input: R, P 2: for i = 1 : m do 3: Generate a random number p from {1, 2, · · · , P}, and distribute row i of R to node p. 4: end for 5: for p = 1 : P parallel do 6: Allocate memory for Up , Vp and V 7: end for 3.2 Distributed ADMM Based on our split strategy LocalMFSplit, the MF problem in (1) can be reformulated as follows: min U,V,V 1 2 XP p=1 X (i,j)∈Ωp (Ri,j − U T ∗iV p ∗j ) 2 + λ1U T ∗iU∗i + λ2[V p ∗j ] T V p ∗j (6) s.t. : V p − V = 0; ∀p ∈ {1, 2, ..., P} where V = {Vp } P p=1, Ωp denotes the (i, j) indices of the ratings located in node p. Note that here we omit the p in Up for simplicity. It is not hard to determine whether U refers to the whole latent matrix or a sub-block Up located in node p from the context. If we define f(U, V) = XP p=1 f p (U, V p ), (7) where f p (U, V p ) = X (i,j)∈Ωp ˆfi,j (U∗i, V p ∗j ), (8) ˆfi,j (U∗i, V p ∗j ) = 1 2 (Ri,j − U T ∗iV p ∗j ) 2 + λ1U T ∗iU∗i + λ2[V p ∗j ] T V p ∗j , we can transform the constrained problem in (6) to an unconstrained problem with augmented Lagrangian method, and get the following objective function: L(U, V, O, V) = f(U, V) + l(V, V, O), (9) where l(V, V, O) = XP p=1 l p (V p , V, Θ p ), l p (V p , V, Θ p ) = ρ 2 kV p − Vk 2 F + tr [Θ p ] T (V p − V) . Here, ρ is a hyper-parameter and O = {Θp } P p=1 denotes the Lagrangian multiplier. If we define L p (U, V p , Θ p , V) = f p (U, V p ) + l p (V p , V, Θ p ) = X (i,j)∈Ωp ˆfi,j (U∗i, V p ∗j ) + ρ 2 kV p − Vk 2 F + tr [Θ p ] T (V p − V) , we can get L(U, V, O, V) = XP p=1 L p (U, V p , Θ p , V). The ADMM will solve this problem by repeating the following steps: Ut+1, V p t+1 ← argmin U,Vp L p (U, V p , Θ p t , Vt), ∀p ∈ {1, 2, ..., P} (10a) Vt+1 ← argmin V L(Ut+1, Vt+1, Ot, V), (10b) Θ p t+1 ← Θ p t + ρ(V p t+1 − Vt+1), ∀p ∈ {1, 2, ..., P}. (10c) It is easy to see that U, Vp and Θp can be locally updated on each node. Because the whole MF problem has been divided into P sub-problems which can be solved in parallel, our method is actually a distributed ADMM framework. 3.3 Stochastic Learning for Distributed ADMM To learn the parameters in (6), we just need to find the solutions in (10a), (10b) and (10c). After getting the optimal Ut+1 and {V p t+1}, it is easy to solve the problem in (10b). More specifically, if we set Θ p 0 = 0, we can prove that PP p=1 Θ p t = 0. Hence, the update rule for V is: Vt+1 = 1 P XP p=1 V p t+1. (11) The problem in (10c) directly shows the update rule, which can be computed locally and efficiently. Therefore, the key learning part lies in how to efficiently solve the problem in (10a). In the following content of this subsection, we first design a batch learning algorithm for the problem in (10a), and then a stochastic learning algorithm inspired by the batch learning is also designed to further improve the ef- ficiency. 3.3.1 Batch Learning With Θ p t and Vt fixed, (10a) is an MF problem. However, we can not easily get the solution because U and Vp are coupled together and the objective function of the MF problem is non-convex. To get an efficient solution, we use a technique similar to that in [12] to construct a surrogate
objective function,which is convex and can make U and V 3.3.2 Stochastic Learning decouple from each other.For each iteration of minimizing From (15),we can find that it will access all ratings relat- the constructed function,we can easily get the closed form ed to U.i to update each U.i,and the same also goes for solution of U and VP by setting their gradients to zero. updating each Vr;in (16).Hence,the batch learning algo- The surrogate objective function is defined as follows rithm presented above is not efficient,especially when the G(U,VP,⊙,V,lUt,V) number of ratings becomes very large.To further improve the efficiency,we propose a stochastic learning strategy for =g(U,VP,U,V)+P(VP,V,Θ),(12) the distributed ADMM,which is called DS-ADMM.In par- where ticular,the update rules for DS-ADMM is as follows: gP(U.VP,T|U:.VP) (U.i)+1=(U.i)+(eij(VP;)-Ai(U.i)), (18) fP(U:,VP)+tr[VtfP(U:,VP)(U-U:)] Tt 1-2(V) +tr[VVPfP(U:,VP)(VP-VP)] (V)+1=1+pm +e(Ui):+p(T-(Θ), (19) 二(U-U,怡+IvP-V), (13) where j =Rij-[(U)]T(VP:)t.It is easy to see that with re being a value which will be useful for specifying the the stochastic learning algorithm is derived from the batch step-size in the stochastic learning method introduced later, learning algorithm by treating only U.:and V,as variables and the function fP(U,VP)being defined in (8). in(14a). LEMMA 1.For an arbitrary positive value 82,we can al- By combining the split strategy and the update rules stat- ed above,we can get our DS-ADMM algorithm.The whole ways find a T:that makes GP()satisfy the followring two procedure of DS-ADMM is briefly listed in Algorithm 2. properties writhin the domain D={U,VPU.-U.< 62,IV-[v屋≤6} Algorithm 2 DS-ADMM G(U,VP,⊙,V,U,V)≥LP(U,VP,⊙,V), 1:Input:R,P,p,Marlter,入1,入2,To 2:Use Algorithm 1 to distribute R to P different nodes G(U,V,⊙,V,U,V)=LP(U,VR,⊙,V) 3:Randomly initialize Uo,Vo; The proof of Lemma 1 can be found in Appendix A. 4:Calculate Vo by (11) From Lemma 1,we can find that G()is an upper bound 5:Set=0. of LP(),and GP()=IP()at the point (U:,VP).Compared 6:for t=1:Marlter do with IP(),U and VP are decoupled in GP(),and GP()is 7: for p=1:P parallel do convex in (U,VP).Hence,it is much easier to optimize 8: for each Ri;in node p do G()than LP(). 9: Update U.:and V;by (18)and (19) Instead of optimizing the original function LP(),we op- 10: end for timize the surrogate function GP()in the first step of the 11: end for ADMM: 12: Update V by (11) 13: U1,Vargmin G(U,VVU,V?)(14a) for p=1:P parallel do 14: Update⊙Pby(10c) 15: end for The objective function in (14a)is convex in both U and 16: Update Tt VP.Hence,we can easily get the solution by setting the 17:end for gradients to be zero.The optimal solution is computed as follows: U+1=U:-*7fP(U,V), (15) 3.4 Scheduler Comparison n+V.-o:-v"(U:Vi)L (6) CCD++and DSGD are two state-of-the-art distributed Viti=I+pn T MF models.We compare the scheduler of our DS-ADMM with those of CCD++and DSGD to illustrate the synchro- LEMMA 2.By following the update rules in (15)and (16), nization cost. the original objective function LP()will not increase in each Figure 1 (a),(b)and (c)show the number of synchro- step.That is to say, nization operations in one iteration of CCD++,DSGD and LP(U+1,V+1,Θ,7)≤LP(U,VR,⊙,V). (17)】 DS-ADMM,respectively.Here,one iteration means all the training ratings are processed for one time.We can find The proof of Lemma 2 can be found in Appendix B. that CCD++needs 2k times of synchronization and DSGD By combining the update rules in (15),(16),(11)and needs P times.From Algorithm 2.we can easily find that (10c),we can get a batch learning algorithm for the problem DS-ADMM needs only one synchronization for each itera- in (6)with the distributed ADMM framework. tion,which is shown in line 12.This synchronization step THEOREM 1.Our batch learning algorithm will converge is used to gather all VP.Hence,it is obvious that the syn- PROOF.Based on Lemma 2,we can prove that the objec- chronization cost of our DS-ADMM is much less than those tive function L()in (9)will decrease in each iteration of AD- of CCD++and DSGD. MM.Furthermore,L()is lower bounded by 3.5 Complexity Analysis 20 Hence,our batch learning algorithm will converge.Because DS-ADMM updates all variables once by three steps.Step L()is not convex,it might converge to a local minimum. one updates U and VP.For each rating Rij,the time com-
objective function, which is convex and can make U and V decouple from each other. For each iteration of minimizing the constructed function, we can easily get the closed form solution of U and Vp by setting their gradients to zero. The surrogate objective function is defined as follows: G p (U, V p ,Θ p t , Vt, τt|Ut, V p t ) = g p (U, V p , τt|Ut, V p t ) + l p (V p , Vt, Θ p t ), (12) where g p (U,V p , τt|Ut, V p t ) = f p (Ut, V p t ) + tr[∇ T Uf p (Ut, V p t )(U − Ut)] + tr[∇ T Vp f p (Ut, V p t )(V p − V p t )] + 1 2τt (kU − Utk 2 F + kV p − V p t k 2 F ), (13) with τt being a value which will be useful for specifying the step-size in the stochastic learning method introduced later, and the function f p (U, Vp ) being defined in (8). Lemma 1. For an arbitrary positive value δ 2 , we can always find a τt that makes G p (·) satisfy the following two properties within the domain D = {U, Vp | kU∗i−[U∗i]tk 2 F ≤ δ 2 , kV p ∗j − [V p ∗j ]tk 2 F ≤ δ 2 }: G p (U, V p , Θ p t , Vt, τt|Ut, V p t ) ≥ L p (U, V p , Θ p t , Vt), G p (Ut, V p t , Θ p t , Vt, τt|Ut, V p t ) = L p (Ut, V p t , Θ p t , Vt). The proof of Lemma 1 can be found in Appendix A. From Lemma 1 , we can find that G p (·) is an upper bound of L p (·), and G p (·) = L p (·) at the point (Ut, V p t ). Compared with L p (·), U and Vp are decoupled in G p (·), and G p (·) is convex in (U, Vp ). Hence, it is much easier to optimize G p (·) than L p (·). Instead of optimizing the original function L p (·), we optimize the surrogate function G p (·) in the first step of the ADMM: Ut+1, V p t+1 ← argmin U,Vp G p (U, V p , Θ p t , Vt, τt|Ut, V p t ) (14a) The objective function in (14a) is convex in both U and Vp . Hence, we can easily get the solution by setting the gradients to be zero. The optimal solution is computed as follows: Ut+1 = Ut − τt ∗ ∇T Uf p (Ut, V p t ), (15) V p t+1 = τt 1 + ρτt [ V p t τt + ρVt − Θ p t − ∇T Vp f p (Ut, V p t )]. (16) Lemma 2. By following the update rules in (15) and (16), the original objective function L p (·) will not increase in each step. That is to say, L p (Ut+1, V p t+1, Θ p t , Vt) ≤ L p (Ut, V p t , Θ p t , Vt). (17) The proof of Lemma 2 can be found in Appendix B. By combining the update rules in (15), (16), (11) and (10c), we can get a batch learning algorithm for the problem in (6) with the distributed ADMM framework. Theorem 1. Our batch learning algorithm will converge. Proof. Based on Lemma 2, we can prove that the objective function L(·) in (9) will decrease in each iteration of ADMM. Furthermore, L(·) is lower bounded by − PP p=1 ||Θp||2 F 2ρ . Hence, our batch learning algorithm will converge. Because L(·) is not convex, it might converge to a local minimum. 3.3.2 Stochastic Learning From (15), we can find that it will access all ratings related to U∗i to update each U∗i, and the same also goes for updating each V p ∗j in (16). Hence, the batch learning algorithm presented above is not efficient, especially when the number of ratings becomes very large. To further improve the efficiency, we propose a stochastic learning strategy for the distributed ADMM, which is called DS-ADMM. In particular, the update rules for DS-ADMM is as follows: (U∗i)t+1 =(U∗i)t + τt(ij (V p ∗j )t − λ1(U∗i)t), (18) (V p ∗j )t+1 = τt 1 + ρτt [ 1 − λ2τt τt (V p ∗j )t +ij (U∗i)t + ρ(V∗j )t − (Θ p ∗j )t], (19) where ij = Rij − [(U∗i)t] T (V p ∗j )t. It is easy to see that the stochastic learning algorithm is derived from the batch learning algorithm by treating only U∗i and V p ∗j as variables in (14a). By combining the split strategy and the update rules stated above, we can get our DS-ADMM algorithm. The whole procedure of DS-ADMM is briefly listed in Algorithm 2. Algorithm 2 DS-ADMM 1: Input: R, P, ρ, MaxIter, λ1, λ2, τ0; 2: Use Algorithm 1 to distribute R to P different nodes. 3: Randomly initialize U0,V p 0 ; 4: Calculate V0 by (11) 5: Set Θ p 0 = 0. 6: for t = 1 : MaxIter do 7: for p = 1 : P parallel do 8: for each Ri,j in node p do 9: Update U∗i and V p ∗j by (18) and (19) 10: end for 11: end for 12: Update V by (11) 13: for p = 1 : P parallel do 14: Update Θp by (10c) 15: end for 16: Update τt 17: end for 3.4 Scheduler Comparison CCD++ and DSGD are two state-of-the-art distributed MF models. We compare the scheduler of our DS-ADMM with those of CCD++ and DSGD to illustrate the synchronization cost. Figure 1 (a), (b) and (c) show the number of synchronization operations in one iteration of CCD++, DSGD and DS-ADMM, respectively. Here, one iteration means all the training ratings are processed for one time. We can find that CCD++ needs 2k times of synchronization and DSGD needs P times. From Algorithm 2, we can easily find that DS-ADMM needs only one synchronization for each iteration, which is shown in line 12. This synchronization step is used to gather all Vp . Hence, it is obvious that the synchronization cost of our DS-ADMM is much less than those of CCD++ and DSGD. 3.5 Complexity Analysis DS-ADMM updates all variables once by three steps. Step one updates U and Vp . For each rating Rij , the time com-