Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Ling Yan YLINGO718@SJTU.EDU.CN Shanghai Key Laboratory of Scalable Computing and Systems,Department of Computer Science and Engineering,Shang- hai Jiao Tong University,China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology,Nanjing University,China Gui-Rong Xue GRXUE@ALIBABA-INC.COM Alibaba Group,China Dingyi Han DINGYI.HAN@ALIBABA-INC.COM Alibaba Group,China Abstract 1.Introduction Recently,online advertising has become the most popular In display advertising,click through rate(CTR) and effective approach to do brand promotion and produc- prediction is the problem of estimating the prob- t marketing.It is a multi-billion business on the web and ability that an advertisement(ad)is clicked when accounts for the majority of the income for the major inter- displayed to a user in a specific context.Due net companies,such as Google,Yahoo and Alibaba.Dis- to its easy implementation and promising perfor- play advertising is a big part of online advertising where mance,logistic regression (LR)model has been advertisers pay publishers for placing graphical advertise- widely used for CTR prediction,especially in in- ments (ads)on publishers'web pages(Chapelle et al., dustrial systems.However,it is not easy for LR 2013).The publishers allocate some positions on their web to capture the nonlinear information,such as the pages and sell them to different advertisers.Users visit the conjunction information,from user features and web pages and can view the published ads.There are some ad features.In this paper,we propose a nov- other roles,such as ad agencies and publisher networks,to el model,called coupled group lasso (CGL),for compose the complex advertising system(Muthukrishnan, CTR prediction in display advertising.CGL can 2009).But that is not the focus of this paper.So we will just seamlessly integrate the conjunction information focus on the scenarios with a user-advertiser-publisher tri- from user features and ad features for modeling. partite business,in which three parties have separate goals Furthermore,CGL can automatically eliminate that can be reduced to a unified task in the end.The ad- useless features for both users and ads,which vertisers pay more attention on the desired user actions, may facilitate fast online prediction.Scalabili- such as clicks on the ads,subscriptions to the mailing list. ty of CGL is ensured through feature hashing and or purchases of products.Different advertisers target dif- distributed implementation.Experimental results ferent kinds of users.For example,a basketball company on real-world data sets show that our CGL model will be interested in users who bought many sports equip- can achieve state-of-the-art performance on web- ments recently,and a hotel would prefer to display its ads scale CTR prediction tasks. to people who travel frequently.There are different pay- ment options for advertisers,such as cost-per-click(CPC), cost-per-mill(CPM),and cost-per-conversion(CPA)(Mah- dian Tomak,2007).For the publisher part,their goal is to Proceedings of the 31st International Conference on Machine maximize the revenue from the advertisers and attract more Learning.Beijing,China,2014.JMLR:W&CP volume 32.Copy- users to their web pages.So they had better precisely dis- right 2014 by the author(s). play suitable ads to a specific user,and avoid affecting user
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Ling Yan YLING0718@SJTU.EDU.CN Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China Gui-Rong Xue GRXUE@ALIBABA-INC.COM Alibaba Group, China Dingyi Han DINGYI.HAN@ALIBABA-INC.COM Alibaba Group, China Abstract In display advertising, click through rate (CTR) prediction is the problem of estimating the probability that an advertisement (ad) is clicked when displayed to a user in a specific context. Due to its easy implementation and promising performance, logistic regression (LR) model has been widely used for CTR prediction, especially in industrial systems. However, it is not easy for LR to capture the nonlinear information, such as the conjunction information, from user features and ad features. In this paper, we propose a novel model, called coupled group lasso (CGL), for CTR prediction in display advertising. CGL can seamlessly integrate the conjunction information from user features and ad features for modeling. Furthermore, CGL can automatically eliminate useless features for both users and ads, which may facilitate fast online prediction. Scalability of CGL is ensured through feature hashing and distributed implementation. Experimental results on real-world data sets show that our CGL model can achieve state-of-the-art performance on webscale CTR prediction tasks. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction Recently, online advertising has become the most popular and effective approach to do brand promotion and product marketing. It is a multi-billion business on the web and accounts for the majority of the income for the major internet companies, such as Google, Yahoo and Alibaba. Display advertising is a big part of online advertising where advertisers pay publishers for placing graphical advertisements (ads) on publishers’ web pages (Chapelle et al., 2013). The publishers allocate some positions on their web pages and sell them to different advertisers. Users visit the web pages and can view the published ads. There are some other roles, such as ad agencies and publisher networks, to compose the complex advertising system (Muthukrishnan, 2009). But that is not the focus of this paper. So we will just focus on the scenarios with a user-advertiser-publisher tripartite business, in which three parties have separate goals that can be reduced to a unified task in the end. The advertisers pay more attention on the desired user actions, such as clicks on the ads, subscriptions to the mailing list, or purchases of products. Different advertisers target different kinds of users. For example, a basketball company will be interested in users who bought many sports equipments recently, and a hotel would prefer to display its ads to people who travel frequently. There are different payment options for advertisers, such as cost-per-click (CPC), cost-per-mill (CPM), and cost-per-conversion (CPA) (Mahdian & Tomak, 2007). For the publisher part, their goal is to maximize the revenue from the advertisers and attract more users to their web pages. So they had better precisely display suitable ads to a specific user, and avoid affecting user
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising experience of the web pages.From the user part,they want group lasso(CGL)for CTR prediction in display advertis- to find useful information from the web pages and find ads ing.The main contributions are outlined as follows: that they are really interested in. To satisfy the desire of all three parties,an accurate target- CGL can seamlessly integrate the conjunction infor- ing of advertising system is of great importance,in which mation from user features and ad features for model- the click through rate(CTR)prediction of a user to a spe- ing,which makes it better capture the underlying con- cific ad plays the key role (Chapelle et al.,2013).CTR nection between users and ads than LR. prediction is the problem of estimating the probability that CGL can automatically eliminate useless features for the display of an ad to a specific user will lead to a click. both users and ads,which may facilitate fast online This challenging problem is at the heart of display adver- tising and has to deal with several hard issues,such as very prediction. large scale data sets,frequently updated users and ads,and CGL is scalable by exploiting feature hashing and dis- the inherent obscure connection between user profiles(fea- tributed implementation tures)and ad features Recently,many models have been proposed for CTR pre- 2.Background diction in display advertising.Some models train standard classifiers,such as logistic regression (LR)(Neter et al. In this section,we introduce the background of our model, 1996)or generalized linear models,on simple concatena- including the description of CTR prediction task,LR mod- tion of user and ad features (Richardson et al..2007;Grae- el,and group lasso (Yuan Lin,2006;Meier et al.,2008). pel et al.,2010).Some other models use prior knowledge like the inherent hierarchical information for statistical s- 2.1.Notation and Task moothing in log-linear models (Agarwal et al..2010)or We use boldface lowercase letters,such as v,to denote col- LR models (Kuang-chih et al.,2012).In (Menon et al., umn vectors and v;to denote the ith element of v.Boldface 2011),a matrix factorization method is proposed,but it uppercase letters,such as M,are used to denote matrices, does not make use of user features.In (Stern et al..2009). with the ith row and the jth column of M denoted by Mis a probabilistic model is proposed to use user and item meta and M.j,respectively.Mij is the element at the ith row data together with collaborative filtering information,in and jth column of M.MT is the transpose of M and vT which user and item feature vectors are mapped into lower- dimensional space and inner product is used to measure is the transpose of v. similarity.However,it does not have the effect of auto- While some display advertising systems have access to on- matic feature selection from user and item meta features. ly some id information for users or ads,in this paper we In addition,inference of the model is too complicated to be focus on the scenarios where we can collect both user and used in a large scale scenario.In (Chapelle et al.,2013), ad features.Actually,the publishers can often collect us- a highly scalable framework based on LR is proposed,and er actions on the web pages,such as click on an ad,buy terabytes of data from real applications are used for eval- a product or type in some query keywords.They can an- uation.Due to its easy implementation and state-of-the- alyze these history behaviors and then construct user pro- art performance,LR model has become the most popular files (features).On the other hand,when advertisers submit one for CTR prediction,especially in industrial system- some ads to the publishers,they often choose some descrip- s(Chapelle et al.,2013).However,LR is a linear model, tion words,the groups of people to display the ads,or some in which the features contribute to the final prediction in- other useful features. dependently.Hence,LR can not capture the nonlinear in- We refer to a display of an ad to a particular user in a par- formation,such as the conjunction (cartesian product)in- ticular page view as an ad impression.Each impression is formation,between user features and ad features.In real a case that a user meets an ad in a specific context,such as applications,the conjunction information is very important daytime,weekdays,and publishing position.Hence,each for CTR prediction.For example,people who have high impression contains information of three aspects:the user, buying power may have more interest in luxury produc- the ad,and the context.We use xu of length I to denote t than those with low buying power,and college students the feature vector of user u,xa of length s to denote the may be more likely to buy machine learning books than feature vector of ad a.The context information together high-school students.Better performance can be expected with some advertiser id or ad id information are composed by exploiting the user-ad two-parts hybrid features through into a feature vector xo of length d.x is used to denote the feature conjunction. feature vector of an expression,with xT-(x). In this paper,we propose a novel model,called coupled Hence,if we use z to denote the length of vector x,we have z=l+s d.The result of an impression is click or
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising experience of the web pages. From the user part, they want to find useful information from the web pages and find ads that they are really interested in. To satisfy the desire of all three parties, an accurate targeting of advertising system is of great importance, in which the click through rate (CTR) prediction of a user to a specific ad plays the key role (Chapelle et al., 2013). CTR prediction is the problem of estimating the probability that the display of an ad to a specific user will lead to a click. This challenging problem is at the heart of display advertising and has to deal with several hard issues, such as very large scale data sets, frequently updated users and ads, and the inherent obscure connection between user profiles (features) and ad features. Recently, many models have been proposed for CTR prediction in display advertising. Some models train standard classifiers, such as logistic regression (LR) (Neter et al., 1996) or generalized linear models, on simple concatenation of user and ad features (Richardson et al., 2007; Graepel et al., 2010). Some other models use prior knowledge like the inherent hierarchical information for statistical smoothing in log-linear models (Agarwal et al., 2010) or LR models (Kuang-chih et al., 2012). In (Menon et al., 2011), a matrix factorization method is proposed, but it does not make use of user features. In (Stern et al., 2009), a probabilistic model is proposed to use user and item meta data together with collaborative filtering information, in which user and item feature vectors are mapped into lowerdimensional space and inner product is used to measure similarity. However, it does not have the effect of automatic feature selection from user and item meta features. In addition, inference of the model is too complicated to be used in a large scale scenario. In (Chapelle et al., 2013), a highly scalable framework based on LR is proposed, and terabytes of data from real applications are used for evaluation. Due to its easy implementation and state-of-theart performance, LR model has become the most popular one for CTR prediction, especially in industrial systems (Chapelle et al., 2013). However, LR is a linear model, in which the features contribute to the final prediction independently. Hence, LR can not capture the nonlinear information, such as the conjunction (cartesian product) information, between user features and ad features. In real applications, the conjunction information is very important for CTR prediction. For example, people who have high buying power may have more interest in luxury product than those with low buying power, and college students may be more likely to buy machine learning books than high-school students. Better performance can be expected by exploiting the user-ad two-parts hybrid features through feature conjunction. In this paper, we propose a novel model, called coupled group lasso (CGL) for CTR prediction in display advertising. The main contributions are outlined as follows: • CGL can seamlessly integrate the conjunction information from user features and ad features for modeling, which makes it better capture the underlying connection between users and ads than LR. • CGL can automatically eliminate useless features for both users and ads, which may facilitate fast online prediction. • CGL is scalable by exploiting feature hashing and distributed implementation. 2. Background In this section, we introduce the background of our model, including the description of CTR prediction task, LR model, and group lasso (Yuan & Lin, 2006; Meier et al., 2008). 2.1. Notation and Task We use boldface lowercase letters, such as v, to denote column vectors and vi to denote the ith element of v. Boldface uppercase letters, such as M, are used to denote matrices, with the ith row and the jth column of M denoted by Mi∗ and M∗j , respectively. Mij is the element at the ith row and jth column of M. MT is the transpose of M and v T is the transpose of v. While some display advertising systems have access to only some id information for users or ads, in this paper we focus on the scenarios where we can collect both user and ad features. Actually, the publishers can often collect user actions on the web pages, such as click on an ad, buy a product or type in some query keywords. They can analyze these history behaviors and then construct user pro- files (features). On the other hand, when advertisers submit some ads to the publishers, they often choose some description words, the groups of people to display the ads, or some other useful features. We refer to a display of an ad to a particular user in a particular page view as an ad impression. Each impression is a case that a user meets an ad in a specific context, such as daytime, weekdays, and publishing position. Hence, each impression contains information of three aspects: the user, the ad, and the context. We use xu of length l to denote the feature vector of user u, xa of length s to denote the feature vector of ad a. The context information together with some advertiser id or ad id information are composed into a feature vector xo of length d. x is used to denote the feature vector of an expression, with x T = (x T u , x T a , x T o ). Hence, if we use z to denote the length of vector x, we have z = l + s + d. The result of an impression is click or
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising non-click,which makes an instance in the data set. attractive for its property of doing variable selection at the Given a training set {((),()i=1,...N),in which group level,where all the variables in some groups will be xr-(xt,xa,xg),y∈{0,1 with y=1 denot- zero after learning. ing click and y =0 denoting non-click in an impres- sion,the CTR prediction problem is to learn a function 3.Coupled Group Lasso h(x)=h(x,xa,x)which can be used to predict the Although LR has been widely used for CTR prediction.it probability of user u to click on ad a in a specific context can not capture the conjunction information between us- 0. er features and ad features.One possible solution is to manually construct the conjunction features from the orig- 2.2.Logistic Regression inal input features as the input of LR.However,as stated The likelihood of LR is defined as h1(x)=Pr(y in (Chapelle et al.,2013),manual feature conjunction will 1w)-where w is the parameter (weight result in quadratic number of new features.which makes it vector)to learn.Please note that the bias term of LR has extraordinarily difficult to learn the parameters.Hence,the been integrated into w by adding an extra feature with con- modeling ability of LR is too weak to capture the complex stant value 1 to the feature vector.Given a training set relationship in the data. (x(,y())i=1,...,N),the weight vector w is found In this section,we introduce our coupled group by minimizing the following regularized loss function: lasso (CGL)model,which can easily model the conjunc- N tion information between users and ads to achieve better min21(w)+∑51(w;x,g), (1) performance than LR. i=1 E(w:x(()=-log(h((]1-h(x(1). 3.1.Model The likelihood of CGL is formulated as follows: where (w)is the regularization term. In real applications,we can use the following L2-norm for h(x)=Pr(y=1x,W,V,b) regularization (Golub et al.,1999):1(w)=w= =((xTW)(xav)T+bTxo), (3) .The resulting model is the standard LR model. We can also use the following Li-norm for regularization: where W is a matrix of size lxk.V is a matrix of size sxk. (w)=llwl1=∑i=,lwl,wherez is the length of b is a vector of length d,o(x)is the sigmoid function with vector w.The resulting model will be Lasso which can be ()=Here,W.V and b are parameters to used for feature selection or elimination(Tibshirani,1996). learn,k is a hyper-parameter. The optimization function in(1)is easy to implement with Furthermore,we put regularization on the negative log- promising performance,which makes LR very popular in likelihood to get the following optimization problem of industry.Please note that in the following content,LR CGL: refers to the LR model with L2-norm regularization,and the LR with L-norm will be called Lasso as in many liter- W,V,b;x⊙,y+A2(w,V),(④ atures (Tibshirani.1996) 2.3.Group Lasso with The group lasso is a technique to do variable selection on (W,V,b;x(),y())= (5) (predefined)groups of variables (Yuan Lin,2006;Meier et al.,2008).For a parameter vector BER,the regular- -log(h(x)】y9[1-hx】1-y9), ization term in group lasso is defined as follows: 2(W,V)=lWl2,1+lVI2,1 (6) G (2) Hee,.lWIkz.1=∑a1V∑=W号=ilW.bis 9=1 the L2.1-norm of the matrix W.Similarly,V2.is the L2.1-norm of the matrix V.From(2),it is easy to find that where T is the index set belonging to the predefined gth the L2.1-norm is actually a group lasso regularization with group of variables,g =1,2,...,G.The group lasso can each row being a group.Please note that we do not put be used together with linear regression (Yuan Lin,2006) regularization on b because from experiments we find that or logistic regression (Meier et al.,2008)as a penalty.It is this regularization does not affect the performance
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising non-click, which makes an instance in the data set. Given a training set {(x (i) , y(i) ) | i = 1, ..., N}, in which x T = (x T u , x T a , x T o ), y ∈ {0, 1} with y = 1 denoting click and y = 0 denoting non-click in an impression, the CTR prediction problem is to learn a function h(x) = h(xu, xa, xo) which can be used to predict the probability of user u to click on ad a in a specific context o. 2.2. Logistic Regression The likelihood of LR is defined as h1(x) = P r(y = 1|x, w) = 1 1+exp(−wT x) , where w is the parameter (weight vector) to learn. Please note that the bias term of LR has been integrated into w by adding an extra feature with constant value 1 to the feature vector. Given a training set {(x (i) , y(i) ) | i = 1, ..., N}, the weight vector w is found by minimizing the following regularized loss function: min w λΩ1(w) +X N i=1 ξ1(w; x (i) , y(i) ), (1) ξ1(w; x (i) , y(i) ) = − log([h1(x (i) )]y (i) [1 − h1(x (i) )]1−y (i) ), where Ω1(w) is the regularization term. In real applications, we can use the following L2-norm for regularization (Golub et al., 1999): Ω1(w) = 1 2 ||w||2 2 = wT w 2 . The resulting model is the standard LR model. We can also use the following L1-norm for regularization: Ω1(w) = ||w||1 = Pz i=1 |wi |, where z is the length of vector w. The resulting model will be Lasso which can be used for feature selection or elimination (Tibshirani, 1996). The optimization function in (1) is easy to implement with promising performance, which makes LR very popular in industry. Please note that in the following content, LR refers to the LR model with L2-norm regularization, and the LR with L1-norm will be called Lasso as in many literatures (Tibshirani, 1996) 2.3. Group Lasso The group lasso is a technique to do variable selection on (predefined) groups of variables (Yuan & Lin, 2006; Meier et al., 2008). For a parameter vector β ∈ R z , the regularization term in group lasso is defined as follows: X G g=1 ||βIg ||2, (2) where Ig is the index set belonging to the predefined gth group of variables, g = 1, 2, · · · , G. The group lasso can be used together with linear regression (Yuan & Lin, 2006) or logistic regression (Meier et al., 2008) as a penalty. It is attractive for its property of doing variable selection at the group level, where all the variables in some groups will be zero after learning. 3. Coupled Group Lasso Although LR has been widely used for CTR prediction, it can not capture the conjunction information between user features and ad features. One possible solution is to manually construct the conjunction features from the original input features as the input of LR. However, as stated in (Chapelle et al., 2013), manual feature conjunction will result in quadratic number of new features, which makes it extraordinarily difficult to learn the parameters. Hence, the modeling ability of LR is too weak to capture the complex relationship in the data. In this section, we introduce our coupled group lasso (CGL) model, which can easily model the conjunction information between users and ads to achieve better performance than LR. 3.1. Model The likelihood of CGL is formulated as follows: h(x) = P r(y = 1|x,W, V, b) = σ (x T uW)(x T a V) T + b T xo , (3) whereW is a matrix of size l×k, V is a matrix of size s×k, b is a vector of length d, σ(x) is the sigmoid function with σ(x) = 1 1+exp (−x) . Here, W, V and b are parameters to learn, k is a hyper-parameter. Furthermore, we put regularization on the negative loglikelihood to get the following optimization problem of CGL: min W,V,b X N i=1 ξ W, V, b; x (i) , y(i) + λΩ(W, V), (4) with ξ(W, V, b; x (i) , y(i) ) = (5) − log([h(x (i) )]y (i) [1 − h(x (i) )]1−y (i) ), Ω(W, V) = ||W||2,1 + ||V||2,1. (6) Here, ||W||2,1 = Pl i=1 qPk j=1 W2 ij = Pl i=1 ||Wi∗||2 is the L2,1-norm of the matrix W. Similarly, ||V||2,1 is the L2,1-norm of the matrix V. From (2), it is easy to find that the L2,1-norm is actually a group lasso regularization with each row being a group. Please note that we do not put regularization on b because from experiments we find that this regularization does not affect the performance
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising We can find that there are two group lassos in(4),one for the L-BFGS algorithm,we only need to compute the gra- user features and the other for ad features.Furthermore. dient of the parameters. the two group lassos are coupled together to determine the CTR of an impression.Hence,our model is called coupled For ease of presentation,we use (x,y)to denote group lasso(CGL). (W,V,b;x,y)in (5)by omitting the parameters.For each instance (x,y),the gradient contributed by this in- Because (xTW)(xTV)T xTWVTx = stance can be derived as follows: xT(WVT)xa,it is very interesting to see that the 5(x,) term (xTW)(xv)T in (3)can effectively model the (7) 0b: =(h(x)-)zo: conjunction information between user features and ad 0(x,) features.The number of parameters in W and V is only =zui(h(x)-y)xav.j, (8) (I+s)k,where k is typically a small number(less than 50 ∂W) in our experiments).On the contrary,if we choose to man- 0c(x,) aVij zai(h(x)-y)xW.j, (9) ually construct the conjunction features for LR(Chapelle et al.,2013),the number of manual conjunction features is I x s,with both l and s being typically tens of thousands where ai,and oi denote the ith element in vectors in CTR prediction.Hence,the number of parameters of xu,xa,and xo,respectively. CGL is much less than that of LR with manual conjunction The regularization part in(6)can be expanded as follows: features,which makes CGL more scalable than LR to model conjunction features.Furthermore,the less number Q(W,V)=lWll21 +IVll21 of parameters will result in a lower probability to be overfitting for a model. Another nice property of CGL comes from the regulariza- tion term of group lasso.It is easy to find that some rows of W and V will be all-zero after learning.Because each :+e row corresponds to a feature of users or ads,we can elim- inate the features corresponding to all-zero parameter val- ues.Hence,when we use the learned model for online pre- where e is a very small positive number to make the regu- diction,it's not necessary to collect those eliminated fea- larization term differentiable.Practically,it works well in tures for users and ads.This can not only save memory,but our application. also speed up the online prediction procedure. The gradient of (W,V)can be derived as follows: 3.2.Learning 02(W,V) W访 (10) The goal of our learning algorithm is to find the optimal OWig values b*∈Rd,W*∈Rlxk,V∈Rsx that min- ∑w+e imize the objective function in(4).The coupled part of an(w,v) V (xTW)(xTV)T makes the objective function non-convex. aVi (11) We adopt an alternating learning method to learn the pa- V∑喝+e rameters.Each time we optimize one parameter with oth- er parameters fixed.Several iterations will be repeated for We can concatenate the parameter group (W,b)into a this procedure until some termination condition is satisfied. parameter vector,and then compute the gradient vector More specifically,we first fix the ad parameter V and use g(W,b).Similarly,we can also compute the gradient vec- limited-memory BFGS (L-BFGS)(Malouf,2002;Andrew tor g(V,b)of the parameter group (V,b).Assume t is the Gao,2007)to optimize the objective function with re- T-th parameter in each parameter group,the r-th element spect to (w.r.t)W and b until convergence.Then we fix in gradient vector g()has the following form: the user parameter W,and optimize w.r.t.V and b until convergence.Obviously,the objective function is convex g-0=x9 +A2(w,V) (12) in any one of its parameter matrices W or V. Ot t 2=1 L-BFGS algorithm is in the family of quasi-Newton meth- ods(Broyden,1970).L-BFGS stores only a few gradien- where)is computed by using (7).(8).or (9). t vectors to approximate the Hessian matrix.Hence,it's andis computed by using (1)or(11).depending more suitable for optimization problems with a large num- on the value of t.Then we can construct the approximate ber of variables (Nocedal,1980;Byrd et al.,1994).To use Hessian matrix H for L-BFGS
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising We can find that there are two group lassos in (4), one for user features and the other for ad features. Furthermore, the two group lassos are coupled together to determine the CTR of an impression. Hence, our model is called coupled group lasso (CGL). Because (x T uW)(x T a V) T = x T uWVT xa = x T u (WVT )xa, it is very interesting to see that the term (x T uW)(x T a V) T in (3) can effectively model the conjunction information between user features and ad features. The number of parameters in W and V is only (l + s)k, where k is typically a small number (less than 50 in our experiments). On the contrary, if we choose to manually construct the conjunction features for LR (Chapelle et al., 2013), the number of manual conjunction features is l × s, with both l and s being typically tens of thousands in CTR prediction. Hence, the number of parameters of CGL is much less than that of LR with manual conjunction features, which makes CGL more scalable than LR to model conjunction features. Furthermore, the less number of parameters will result in a lower probability to be overfitting for a model. Another nice property of CGL comes from the regularization term of group lasso. It is easy to find that some rows of W and V will be all-zero after learning. Because each row corresponds to a feature of users or ads, we can eliminate the features corresponding to all-zero parameter values. Hence, when we use the learned model for online prediction, it’s not necessary to collect those eliminated features for users and ads. This can not only save memory, but also speed up the online prediction procedure. 3.2. Learning The goal of our learning algorithm is to find the optimal values b ∗ ∈ R d ,W∗ ∈ R l×k , V∗ ∈ R s×k that minimize the objective function in (4). The coupled part of (x T uW)(x T a V) T makes the objective function non-convex. We adopt an alternating learning method to learn the parameters. Each time we optimize one parameter with other parameters fixed. Several iterations will be repeated for this procedure until some termination condition is satisfied. More specifically, we first fix the ad parameter V and use limited-memory BFGS (L-BFGS) (Malouf, 2002; Andrew & Gao, 2007) to optimize the objective function with respect to (w.r.t) W and b until convergence. Then we fix the user parameter W, and optimize w.r.t. V and b until convergence. Obviously, the objective function is convex in any one of its parameter matrices W or V. L-BFGS algorithm is in the family of quasi-Newton methods (Broyden, 1970). L-BFGS stores only a few gradient vectors to approximate the Hessian matrix. Hence, it’s more suitable for optimization problems with a large number of variables (Nocedal, 1980; Byrd et al., 1994). To use the L-BFGS algorithm, we only need to compute the gradient of the parameters. For ease of presentation, we use ξ(x, y) to denote ξ(W, V, b; x, y) in (5) by omitting the parameters. For each instance (x, y), the gradient contributed by this instance can be derived as follows: ∂ξ(x, y) ∂bi = (h(x) − y)xoi , (7) ∂ξ(x, y) ∂Wij = xui (h(x) − y)x T a V∗j , (8) ∂ξ(x, y) ∂Vij = xai (h(x) − y)x T uW∗j , (9) where xui , xai , and xoi denote the ith element in vectors xu, xa, and xo, respectively. The regularization part in (6) can be expanded as follows: Ω(W, V) = ||W||21 + ||V||21 = X l i=1 vuutX k j=1 W2 ij + Xs i=1 vuutX k j=1 V 2 ij ≈ X l i=1 vuutX k j=1 W2 ij + + Xs i=1 vuutX k j=1 V 2 ij + , where is a very small positive number to make the regularization term differentiable. Practically, it works well in our application. The gradient of Ω(W, V) can be derived as follows: ∂Ω(W, V) ∂Wij = q Wij Pk j=1 W2 ij + , (10) ∂Ω(W, V) ∂Vij = q Vij Pk j=1 V 2 ij + . (11) We can concatenate the parameter group (W,b) into a parameter vector, and then compute the gradient vector g(W, b). Similarly, we can also compute the gradient vector g(V, b) of the parameter group (V,b). Assume t is the τ -th parameter in each parameter group, the τ -th element in gradient vector g(·) has the following form: gτ (·) = X N i=1 ∂ξ(x (i) , y(i) ) ∂t + λ ∂Ω(W, V) ∂t , (12) where ∂ξ(x (i) ,y(i) ) ∂t is computed by using (7), (8), or (9), and ∂Ω(W,V) ∂t is computed by using (10) or (11), depending on the value of t. Then we can construct the approximate Hessian matrix H˜ for L-BFGS
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising The whole learning procedure for CGL is summarized in on clusters with hundreds of computing nodes.By com- Algorithm 1.In each iteration,the alternating learning al- bining these techniques,our learning algorithm is highly gorithm ensures that the objective function value always scalable for web-scale applications. decreases.Furthermore,the objective function is bounded below by 0.Practically.when the whole loss function tend- 4.1.Hashing and Sub-Sampling s to be flat and the decrease turns to be flat,we can regard that as convergence.Because the objective function is not We use the hashing technique(Weinberger et al.,2009)for jointly convex in both W and V,the solution is a local op- efficient feature mapping and instance generating.The o- timum.In our implementation,the convergence condition riginal features used in real CTR prediction systems are of the algorithm is that the relative decrease of the objective mainly categorical,the number of which is typically very function value turns to be less than a threshold. large.In order to make the feature mapping(coding)results uniform and enable fast instance generating,we hash user, Algorithm 1 Alternate Learning for CGL ad and the other(context)features to three separate sub- Input:Data set {(x(,()i=1,...N),and hyper- spaces of bit vectors.The structure of the hashing frame- parameters k∈W+and入∈R+ work is illustrated in Figure 1,in which the raw represen- Output:W*.V*.b* tation of an impression is hashed to the instance represen- Initialize b=0. tation with bit vectors.The raw representation of each im- Initialize W=random(R!xk),V=random(*xk) pression is composed of several triples,with each triple be- repeat Fix V ing(domain,feature name,feature value).The domain can repeat be user,ad,or other,which refers to the types of the fea- Compute gradient g(W,b)using (12). tures.For example,the impression in Figure 1 contains Compute the approximate Hessian Hw.b w.r.t.(W,b). k triples,(d1,fl,v1),(d2,f2,v2),...,(dk,fk,vk).The in- d(W,b)=-Hw.b*g(W,b). stance representation of each impression is composed of Perform line search in the direction of d(W,b)and up- three subspaces of bit vectors,which are denoted as User date W.b. features,Ad features and Other features in Figure 1.Given until convergence on W,b a triple,we can get the index (also called position or key)of Fix W. repeat its coding in the feature space quickly with a hash function. Compute gradient g(V,b)using(12). For example,if an impression has a user(domain)feature Compute the approximate Hessian Hv.b w.r.t.(V,b). buypower'(feature name)with a value'5'(feature value), d(V,b)=-Hv.b *g(V,b). the hash function maps this triple (user,buypower,5)to one Perform line search in the direction of d(V,b)and up- position in the subspace of User features,and sets the cor- date V.b. responding position to be 1.The <position(key),triple> until convergence on V,b until convergence pairs are stored in the feature map for later searching and checking.The hashing technique enables efficient feature engineering and fast prediction together with straightfor- ward implementation.According to (Weinberger et al., 3.3.Complexity Analysis 2009),the hashing can not only contract the feature space by collision,but also bring a regularization effect. Let g =(7+s)k +d denote the total number of parame- ters in W,V and b.To train the model,we need O(qN) (d1,f1,v1,(d2,f2,v2,,(dk,fkk time to compute the gradient g(),O(g2)time to compute Feature epresentation map the approximate Hessian matrix and O(q2)time for ma- (Domain,feature,value) Triples <k1d1.1v1> trix multiplication and parameter update in each iteration. Hash <k2d2,2,v2> Hence,the total time complexity is O(gN +2)u for u function iterations kd,(dd,fd.vd)> 4.Web-Scale Implementation Instance Representation o 1 0 1 00 1 1 0 1 00 0 1 0 11 0 User features Ad features Other features Web-scale applications always contain a huge number of users and ads,with billions of impression instances.Hence, Figure 1.The hashing framework for fast feature mapping and in- we need a scalable learning framework.In this section, stance generating. we first introduce the hashing and sub-sampling techniques used for memory saving and class-unbalance handling. The data sets are typically highly unbalanced,with only Furthermore,we propose a distributed learning framework a very small proportion of positive instances.In order to based on message passing interface(MPD),which can run reduce the learning complexity with a minimal influence
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising The whole learning procedure for CGL is summarized in Algorithm 1. In each iteration, the alternating learning algorithm ensures that the objective function value always decreases. Furthermore, the objective function is bounded below by 0. Practically, when the whole loss function tends to be flat and the decrease turns to be flat, we can regard that as convergence. Because the objective function is not jointly convex in both W and V, the solution is a local optimum. In our implementation, the convergence condition of the algorithm is that the relative decrease of the objective function value turns to be less than a threshold. Algorithm 1 Alternate Learning for CGL Input: Data set {(x (i) , y(i) ) | i = 1, ..., N}, and hyperparameters k ∈ N + and λ ∈ R +. Output: W∗ , V∗ , b ∗ Initialize b = 0. Initialize W = random(R l×k ), V = random(R s×k ). repeat Fix V. repeat Compute gradient g(W, b) using (12). Compute the approximate Hessian H˜ W,b w.r.t. (W, b). d(W, b) = −H˜ W,b ∗ g(W, b). Perform line search in the direction of d(W, b) and update W, b. until convergence on W, b Fix W. repeat Compute gradient g(V, b) using (12). Compute the approximate Hessian H˜ V,b w.r.t. (V, b). d(V, b) = −H˜ V,b ∗ g(V, b). Perform line search in the direction of d(V, b) and update V, b. until convergence on V, b until convergence 3.3. Complexity Analysis Let q = (l + s)k + d denote the total number of parameters in W, V and b. To train the model, we need O(qN) time to compute the gradient g(·), O(q 2 ) time to compute the approximate Hessian matrix and O(q 2 ) time for matrix multiplication and parameter update in each iteration. Hence, the total time complexity is O(qN + q 2 )µ for µ iterations. 4. Web-Scale Implementation Web-scale applications always contain a huge number of users and ads, with billions of impression instances. Hence, we need a scalable learning framework. In this section, we first introduce the hashing and sub-sampling techniques used for memory saving and class-unbalance handling. Furthermore, we propose a distributed learning framework based on message passing interface (MPI), which can run on clusters with hundreds of computing nodes. By combining these techniques, our learning algorithm is highly scalable for web-scale applications. 4.1. Hashing and Sub-Sampling We use the hashing technique (Weinberger et al., 2009) for efficient feature mapping and instance generating. The original features used in real CTR prediction systems are mainly categorical, the number of which is typically very large. In order to make the feature mapping (coding) results uniform and enable fast instance generating, we hash user, ad and the other (context) features to three separate subspaces of bit vectors. The structure of the hashing framework is illustrated in Figure 1, in which the raw representation of an impression is hashed to the instance representation with bit vectors. The raw representation of each impression is composed of several triples, with each triple being (domain, feature name, feature value). The domain can be user, ad, or other, which refers to the types of the features. For example, the impression in Figure 1 contains k triples, (d1, f1, v1), (d2, f2, v2), ... , (dk, fk, vk). The instance representation of each impression is composed of three subspaces of bit vectors, which are denoted as User features, Ad features and Other features in Figure 1. Given a triple, we can get the index (also called position or key) of its coding in the feature space quickly with a hash function. For example, if an impression has a user (domain) feature ‘buypower’ (feature name) with a value ‘5’ (feature value), the hash function maps this triple (user, buypower, 5) to one position in the subspace of User features, and sets the corresponding position to be 1. The <position (key), triple> pairs are stored in the feature map for later searching and checking. The hashing technique enables efficient feature engineering and fast prediction together with straightforward implementation. According to (Weinberger et al., 2009), the hashing can not only contract the feature space by collision, but also bring a regularization effect. Ad features Hash function Other features (d1, f1, v1), (d2, f2, v2), …, (dk, fk, vk) (Domain, feature, value) Triples <k1, (d1,f1,v1)> <k2, (d2,f2,v2)> … <kd, (dd,fd,vd)> Feature map Raw Representation Instance Representation User features 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 Figure 1. The hashing framework for fast feature mapping and instance generating. The data sets are typically highly unbalanced, with only a very small proportion of positive instances. In order to reduce the learning complexity with a minimal influence