Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence 李武军 LAMDA Group 南京大学计算机科学与技术系 软件新技术国家重,点实验室 liwujun@nju.edu.cn Dec10,2016 4口,4@,4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 1/36
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence o… LAMDA Group HÆåÆOéÅâÆÜE‚X ^á#E‚I[:¢ø liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 1 / 36
Introduction Outline Introduction AsySVRG SCOPE Conclusion 4口,4@下4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 2/36
Introduction Outline 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 2 / 36
Introduction Machine Learning o Supervised Learning: Given a set of training examples {(xi,yi)1,supervised learning tries to solve the following regularized empirical risk minimization problem: where fi(w)is the loss function(plus some regularization term) defined on example i,and w is the parameter to learn. Examples: ·Logistic regression:f(w)=员∑1log(l+e-xw)+lw鬥] 。SVM:f(w)=是∑1Imax{0,1-xTw}+lw鬥] Deep learning models o Unsupervised Learning: Many unsupervised learning models,such as PCA and matrix factorization,can also be reformulated as similar problems. Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 3/36
Introduction Machine Learning Supervised Learning: Given a set of training examples {(xi , yi)} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min w f(w) = 1 n Xn i=1 fi(w), where fi(w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n Pn i=1[log(1 + e −yix T i w) + λ 2 kwk 2 ] SVM: f(w) = 1 n Pn i=1[max{0, 1 − yix T i w} + λ 2 kwk 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 3 / 36
Introduction Machine Learning for Big Data For big data applications,first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic)gradient descent(GD): w+1←L-nm片∑ fi(wt), where t is the iteration number. Linear convergence rate:O(p) Iteration cost is O(n) Stochastic gradient descent(SGD):In the tth iteration,randomly choosing an example it∈{1,2,,n},then update wt+1←-wt-EVfi,(wt) Iteration cost is O(1) The convergence rate is sublinear.O(1/t). Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 4/36
Introduction Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): wt+1 ← wt − ηt [ 1 n Xn i=1 ∇fi(wt)], where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example it ∈ {1, 2, ..., n}, then update wt+1 ← wt − ηt∇fit (wt) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 4 / 36
Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al.,NIPS 2012],SDCA [Shalev-Shwartz and Zhang,JMLR 2013],SVRG [Johnson and Zhang,NIPS 2013] Number of gradient(Vfi)evaluation to reach e for smooth and strongly convex problems: 。GD:O(nk log(2) ·SGD:O((2) ·SAG:O(nlog()when n≥8k 。SDCA:O(n+)1log(2) ·SVRG:O(m+)log(2) where1 is the condition number of the objective function. Stochastic Learning: Stochastic GD o Stochastic coordinate descent Stochastic dual coordinate ascent 4口,4@下1242,2QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS.NJU 5/36
Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient (∇fi) evaluation to reach for smooth and strongly convex problems: GD: O(nκ log( 1 )) SGD: O(κ( 1 )) SAG: O(n log( 1 )) when n ≥ 8κ SDCA: O((n + κ) log( 1 )) SVRG: O((n + κ) log( 1 )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 5 / 36