Expert Systems with Applications 38(2011)1777-1788 Contents lists available at ScienceDirect Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa Blogger-Centric Contextual Advertising Teng-Kai Fan, Chia-Hui Chang Department of Computer Science 8 information Engineering, National Central University, No 300, Jung-da Rd, Chung-li, Tao-yuan 320, Taiwan, ROC ARTICLE IN FO A BSTRACT Web advertising(online advertising a form of advertising that uses the world wide web to attract cus- Online advertising tomers, has become one of the most commonly-used marketing channels. This paper addresses the con- ept of Blogger-Centric Contextual Advertising, which refers to the assignment of personal ads to any blog age, chosen in according to bloggers'interests As blogs become a platform for expressing personal opin Information retrieva ons, they naturally contain various kinds of statements, including facts, comments and statements about personal interests, of both a positive and negative nature. To extend the concept behind the Long tail the- ory in contextual advertising, we argue that web bloggers, as the constant visitors of their own blog-sites gS. Hence, in this online contextual advertising. The proposed Blogger-Centric Contextual Advertising(BCCA) framework aims to combine contextual advertising matching with text mining in order to select ads that are to personal interests as revealed in a blog and rank them according to their relevance We valid approach experimentally using a set of data that includes both real ads and actual blog pages. The i indicate that our proposed method could effectively identify those ads that are positively-correlated with a bloggers personal interests. Crown Copyright 2010 Published by Elsevier Ltd. All rights reserved. 1 Introduction their companies; and corporate bloggers usually blog for their com- nies in an official capacity. Statistics show that four out of five Blogosphere is a collective term comprising all blogs bloggers(about 79%)are personal bloggers. The majority of blog interconnections. A blog, short for weblog, is a type of web gers have advertising or another method of revenue generation is usually maintained by a blogger who will publish seri. on their blogs. Among bloggers who have advertising on their posts containing news, comments, opinions, diaries, and i blogs, two out of three have contextual ads and one-third have ing articles. As of December 2007, the blog search engine Techno- affiliate advertising on their blogs(Technorati, 2008). On average, rati was tracking more than 112 million blogs. Reports also professional and corporate bloggers are more likely to include indicate that about 1. 2 million new blogs are being created world- search ads, display ads and affiliate marketing because they cer wide each day. According to Technorati's reports in April 2007, the tainly understand what kinds of ads are suitable for their blogs. umber of blogs in the top 100 most popular sites has risen substan- However, the majority of personal bloggers who have no specific ce, Dogs dea which ads are proper to their web sites reply on reliable and information outlets matching mechanisms used in contextual advertising. Hence, in Blogs are also an increasingly attractive platform for advertis- this paper we hope to propose a contextual advertising mechanism ers. The majority of bloggers have advertising on their blogs. Mar- that could increase click rates on personal blogs keters realize that bloggers are creating high-quality content and Contextual advertising is based on studies that show that 80% of attracting growing and loyal audiences(Technorati, 2008). Hence, internet users are interested in receiving personalized content on it is common for blogs to feature advertisements that either finan- sites they visit (Choice Stream, 2005 ) Since the topic of a page cially benefit the blogger or promote the blogger's favorite causes. somehow reflects the interest of visitors, ads delivered to visitors loggers can be classified into three types(Technorati, 2008). should depend upon page content rather than upon stereotypes Personal bloggers blog about topics on personal interests not asso- created according to their geographical locations or upon other ciated with their work, professional bloggers mainly blog about mographic features, such as gender or age ( Kazienko Adamski their industries and professions but not in an official capacity for 2007). As shown in previous studies, strong relevance increases the OneUpWeb, 2005; Wang, Zhang, Choi, D'Eredita, 2002). Some E-mail address: chia@csie ncu. edu. tw(C.-H. Chang studies(Fan Chang, 2009: Zhang, Surendran, Platt, Narasim- han, 2008 )have also demonstrated that focusing on relevant topics 0957-4174/s-see front matter Crown ght o 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016|eswa2010.07.10
Blogger-Centric Contextual Advertising Teng-Kai Fan, Chia-Hui Chang * Department of Computer Science & Information Engineering, National Central University, No. 300, Jung-da Rd., Chung-li, Tao-yuan 320, Taiwan, ROC article info Keywords: Online advertising Text mining Machine learning Marketing Information retrieval Language model abstract Web advertising (online advertising), a form of advertising that uses the World Wide Web to attract customers, has become one of the most commonly-used marketing channels. This paper addresses the concept of Blogger-Centric Contextual Advertising, which refers to the assignment of personal ads to any blog page, chosen in according to bloggers’ interests. As blogs become a platform for expressing personal opinions, they naturally contain various kinds of statements, including facts, comments and statements about personal interests, of both a positive and negative nature. To extend the concept behind the Long Tail theory in contextual advertising, we argue that web bloggers, as the constant visitors of their own blog-sites, could be potential consumers who will respond to ads on their own blogs. Hence, in this paper, we propose using text mining techniques to discover bloggers’ immediate personal interests in order to improve online contextual advertising. The proposed Blogger-Centric Contextual Advertising (BCCA) framework aims to combine contextual advertising matching with text mining in order to select ads that are related to personal interests as revealed in a blog and rank them according to their relevance. We validate our approach experimentally using a set of data that includes both real ads and actual blog pages. The results indicate that our proposed method could effectively identify those ads that are positively-correlated with a blogger’s personal interests. Crown Copyright 2010 Published by Elsevier Ltd. All rights reserved. 1. Introduction Blogosphere is a collective term comprising all blogs and their interconnections. A blog, short for weblog, is a type of web site that is usually maintained by a blogger who will publish serial journal posts containing news, comments, opinions, diaries, and interesting articles. As of December 2007, the blog search engine Technorati1 was tracking more than 112 million blogs. Reports also indicate that about 1.2 million new blogs are being created worldwide each day. According to Technorati’s reports in April 2007, the number of blogs in the top 100 most popular sites has risen substantially. Hence, blogs continue to become more and more viable news and information outlets. Blogs are also an increasingly attractive platform for advertisers. The majority of bloggers have advertising on their blogs. Marketers realize that bloggers are creating high-quality content and attracting growing and loyal audiences (Technorati, 2008). Hence, it is common for blogs to feature advertisements that either financially benefit the blogger or promote the blogger’s favorite causes. Bloggers can be classified into three types (Technorati, 2008). Personal bloggers blog about topics on personal interests not associated with their work, professional bloggers mainly blog about their industries and professions but not in an official capacity for their companies; and corporate bloggers usually blog for their companies in an official capacity. Statistics show that four out of five bloggers (about 79%) are personal bloggers. The majority of bloggers have advertising or another method of revenue generation on their blogs. Among bloggers who have advertising on their blogs, two out of three have contextual ads and one-third have affiliate advertising on their blogs (Technorati, 2008). On average, professional and corporate bloggers are more likely to include search ads, display ads and affiliate marketing, because they certainly understand what kinds of ads are suitable for their blogs. However, the majority of personal bloggers who have no specific idea which ads are proper to their web sites reply on reliable matching mechanisms used in contextual advertising. Hence, in this paper we hope to propose a contextual advertising mechanism that could increase click rates on personal blogs. Contextual advertising is based on studies that show that 80% of internet users are interested in receiving personalized content on sites they visit (ChoiceStream, 2005). Since the topic of a page somehow reflects the interest of visitors, ads delivered to visitors should depend upon page content rather than upon stereotypes created according to their geographical locations or upon other demographic features, such as gender or age (Kazienko & Adamski, 2007). As shown in previous studies, strong relevance increases the number of click-throughs (Chatterjee, Hoffman, & Novak, 2003; OneUpWeb, 2005; Wang, Zhang, Choi, & D’Eredita, 2002). Some studies (Fan & Chang, 2009; Zhang, Surendran, Platt, & Narasimhan, 2008) have also demonstrated that focusing on relevant topics 0957-4174/$ - see front matter Crown Copyright 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.105 * Corresponding author. E-mail address: chia@csie.ncu.edu.tw (C.-H. Chang). 1 http://Technorati.com. Expert Systems with Applications 38 (2011) 1777–1788 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 tising(or content targeted advertising)(Anagnostopoulous et al. Rent Italian Luxury villa 2007: Broder, Fontoura, Josifovski, Riedel, 2007). Sponsored My World search, which delivers ads to users based on users' input query. can be used on sites with a search interface(e.g, search engines 25. off N Bus Passes Huge discounts on selected passes Contextual advertising, on the other hand is displayed on general web sites. These two techniques differ in that a sponsored search Ba: Bus.Hop on hop on analyzes only the user,s query key words, while content-based Very sad. The baby hasnt moved in two advertising parses the contents of a web page to decide which ads to show. However, the goals of each approach are consistent. hinking about The intent is to create a triple-win commercial platform. In other rs right now. Cheap Airline Tickets the Low words, an advertiser pay to purchase valuable adver Fares. Search 170+ Sites Save tisements, the ad agency system shares advertising profits with the Tokyos largest hostel web site owner(the publisher), and consumers can easily respond to ads to purchase products or services. Contextual advertising involves an interaction between four Relevant Ads players(Anagnostopoulos et al. 2007; Broder et al., 2007). The publisher, or the owner of a web site, usually provides interesting Fig. 1. Example of a blog page with correlation ads pages on which ads are shown. The publishers typically aim to en- gage a viewer, encouraging them to stay on their web page and, furthermore, attracting sponsors to place their ads on the page written with positive sentiment produces high click-through rates. The advertiser(the second player)supplies a series of ads to mar- Although a page-relevant topic is a way to capture visitors inter- ket or promote their products. The advertisers register certain est, there is no other way to determine their personal interests. characteristic keywords to describe their products or services However, since bloggers are constant visitors to their own blogs, The ad agency system( the third player) is a mediator between and their interests or intensions are well expressed in the weblogs, the advertisers and the publishers: that is, it is in charge of match- an ad agency could use those expressed intentions to place inter- ng ads to pages. The end user( the fourth player ) who browses For example, Fig. 1 shows a weblog with five ads related to trav web pages, might interact with the ads to engage in commercial eling placed on the right. Since the content of this page describes activities. In the pricing model of Cost Per Click( PC), also known reasons for cancelling a trip, these traveling ads are unlikely to as Pay Per Click(PPC)(Feng, Bhargava, Pennock, 2003), advertis- be clicked. Instead, what the blogger needs is medical information ers pay every time a user clicks on their ads. They do not actually pay for placing the ads, but instead they pay only when the ads are or information on doctors. The point here is that an ad agency sys- clicked. This approach allows advertisers to refine search keywords tem could assign relevant ads according to bloggers'own interests especially their immediate interest or intentions, for targeting ads, and gain information about their market. Generally, user clicks thus treating bloggers as the main visitors of their own blogs. To system. a number of studies have suggested that strong relevance this end, even if an ad is related to the content of a linking page, increases the number of ad clicks( Chatterjee et al, 2003; OneUp an ad agency should preferentially consider the immediate inter ests expressed on the page (i.e intentions for placing ads b, 2005; Wang et al, 2002). Hence in th we refer to as Blogger-Centric Contextual Advertising (BCCA), and is determined by the ad 's relevance score with respect to the page. which is based on latent interest detection to associate ads with For simplicity, we ignore the positional effect of ad placement and blog pages. Instead of the traditional placement of relevant ads, BCCA emphasizes that the ad agency s system should provide rel 2007: Lacerda et al, 2006: Ribeiro-Neto, 2005). evant ads that are related to ditterent levels of personal preferences content creation but do not host their own web sites, service pro in order to increase clicks To evaluate our proposed method, we viders sometimes play the role of ad agency. For example, many used a real-word collection comprised of ads and blog pages from blog service or portal service providers(such as Facebook, My- oogle AdSense and Google's Blog-Search Engine, respectively Our results show that the proposed approach, based on text mining pace, and Twitter)also have their own ad agency systems, which im to generate profits while providing their services. As we realize can effectively recognize the latent interests(e.g. intentions)in a that blog owners are constant visitors of their own web pages blog page or the personal interests of the blogger. In addition, we further investigated the effects of ad page matching using bringing personal ads to bloggers becomes promising. since blog d Click-Through-Rate( CTR)experiment, and our results suggest blogs. That is, the ad agency system of blog service providers could that our proposed method can effectively match relevant ads to a given blog page. gain an advantage by providing the right message in the right con text at the right time to bloggers. Note that this does not preclud background infor introduces our me Patio organized as follows: Section 2 provides on current online ac Section 3 argeting ads to visitors, but rather highlights a chance to target ology. The experime ts are pre- advertising to bloggers themselves. After all, advertising is about nted in Section 4. Section 5 outlines some we present conclusions and future directions in Section 6 rk. Finally. text to the right person(Adams, 2004: Kazienko Adamski, 2007). To explore the possibility of targeting advertising to bloggers, we conducted a simple survey of 62 bloggers about their experi- 2. Background ence in clicking ads and what types of advertisements on their blogs do they prefer. Among 50 participants who had experience There are two main categories of text-based advertising: spon- clicking ads on their blogs, 40 of them indicated that they tend sored search(or keyword targeted marketing) and contextual adver- to click ads which are related eir interests and immediate requirements, while the other 10 participants randomly trigger 2http://blogsearch.google.com ads without consideration of the correlation between ads and their
written with positive sentiment produces high click-through rates. Although a page-relevant topic is a way to capture visitors’ interest, there is no other way to determine their personal interests. However, since bloggers are constant visitors to their own blogs, and their interests or intensions are well expressed in the weblogs, an ad agency could use those expressed intentions to place interest-oriented ads. For example, Fig. 1 shows a weblog with five ads related to traveling placed on the right. Since the content of this page describes reasons for cancelling a trip, these traveling ads are unlikely to be clicked. Instead, what the blogger needs is medical information or information on doctors. The point here is that an ad agency system could assign relevant ads according to bloggers’ own interests, especially their immediate interest or intentions, for targeting ads, thus treating bloggers as the main visitors of their own blogs. To this end, even if an ad is related to the content of a linking page, an ad agency should preferentially consider the immediate interests expressed on the page (i.e., intentions) for placing ads. In this paper, we proposed an ad matching mechanism, which we refer to as Blogger-Centric Contextual Advertising (BCCA), and which is based on latent interest detection to associate ads with blog pages. Instead of the traditional placement of relevant ads, BCCA emphasizes that the ad agency’s system should provide relevant ads that are related to different levels of personal preferences in order to increase clicks. To evaluate our proposed method, we used a real-word collection comprised of ads and blog pages from Google AdSense and Google’s Blog-Search Engine,2 respectively. Our results show that the proposed approach, based on text mining can effectively recognize the latent interests (e.g., intentions) in a blog page, or the personal interests of the blogger. In addition, we further investigated the effects of ad page matching using an ad Click-Through-Rate (CTR) experiment, and our results suggest that our proposed method can effectively match relevant ads to a given blog page. The rest of this paper is organized as follows: Section 2 provides background information on current online advertising. Section 3 introduces our methodology. The experimental results are presented in Section 4. Section 5 outlines some related work. Finally, we present conclusions and future directions in Section 6. 2. Background There are two main categories of text-based advertising: sponsored search (or keyword targeted marketing) and contextual advertising (or content targeted advertising) (Anagnostopoulous et al., 2007; Broder, Fontoura, Josifovski, & Riedel, 2007). Sponsored search, which delivers ads to users based on users’ input query, can be used on sites with a search interface (e.g., search engines). Contextual advertising, on the other hand, is displayed on general web sites. These two techniques differ in that a sponsored search analyzes only the user’s query key words, while content-based advertising parses the contents of a web page to decide which ads to show. However, the goals of each approach are consistent. The intent is to create a triple-win commercial platform. In other words, an advertiser pays a low price to purchase valuable advertisements, the ad agency system shares advertising profits with the web site owner (the publisher), and consumers can easily respond to ads to purchase products or services. Contextual advertising involves an interaction between four players (Anagnostopoulous et al., 2007; Broder et al., 2007). The publisher, or the owner of a web site, usually provides interesting pages on which ads are shown. The publishers typically aim to engage a viewer, encouraging them to stay on their web page and, furthermore, attracting sponsors to place their ads on the page. The advertiser (the second player) supplies a series of ads to market or promote their products. The advertisers register certain characteristic keywords to describe their products or services. The ad agency system (the third player) is a mediator between the advertisers and the publishers; that is, it is in charge of matching ads to pages. The end user (the fourth player), who browses web pages, might interact with the ads to engage in commercial activities. In the pricing model of Cost Per Click (CPC), also known as Pay Per Click (PPC) (Feng, Bhargava, & Pennock, 2003), advertisers pay every time a user clicks on their ads. They do not actually pay for placing the ads, but instead they pay only when the ads are clicked. This approach allows advertisers to refine search keywords and gain information about their market. Generally, user clicks generate profits for both web site publishers and the ad agency system. A number of studies have suggested that strong relevance increases the number of ad clicks (Chatterjee et al., 2003; OneUpWeb, 2005; Wang et al., 2002). Hence, in this study, we similarly assume that the probability of a click for a given ad on a given page is determined by the ad’s relevance score with respect to the page. For simplicity, we ignore the positional effect of ad placement and pricing models, as in (Anagnostopoulous et al., 2007; Broder et al., 2007; Lacerda et al., 2006; Ribeiro-Neto, 2005). For many web 2.0 services, where publishers are responsible for content creation but do not host their own web sites, service providers sometimes play the role of ad agency. For example, many blog service or portal service providers (such as Facebook, MySpace, and Twitter) also have their own ad agency systems, which aim to generate profits while providing their services. As we realize that blog owners are constant visitors of their own web pages, bringing personal ads to bloggers becomes promising, since bloggers’ profiles, opinions, short-term interests are expressed in their blogs. That is, the ad agency system of blog service providers could gain an advantage by providing the right message in the right context at the right time to bloggers. Note that this does not preclude targeting ads to visitors, but rather highlights a chance to target advertising to bloggers themselves. After all, advertising is about delivering the right message at the right time and in the right context to the right person (Adams, 2004; Kazienko & Adamski, 2007). To explore the possibility of targeting advertising to bloggers, we conducted a simple survey of 62 bloggers about their experience in clicking ads and what types of advertisements on their blogs do they prefer. Among 50 participants who had experience clicking ads on their blogs, 40 of them indicated that they tend to click ads which are related to their interests and immediate requirements, while the other 10 participants randomly trigger ads without consideration of the correlation between ads and their Fig. 1. Example of a blog page with correlation ads. 2 http://blogsearch.google.com. 1778 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788
T.-K. Fan, C-H Chang/ Expert Systems with Applications 38(2011)1777- 1779 interests. From this survey, we could say that 80% of bloggers tend to recognize intention and detect sentiment for triggering-level to click the ads on their blogs and about 80% of ads click rates are interests. If no such targets are found, the system uses targets from related to personal interests and requirements. Thus, in this paper, the blogger's profile and searches the ad database to find the best we propose the idea of user-centric contextual advertising and use matching ads. These four modules: intention recognition, senti- the blogosphere as an example for realizing this idea. Compared ment detection, term expansion and target-ad matching are the with the traditional ad agency, which targets general visitors, blog- main components in our BCCA framework. The first two com ger-centric advertising considers bloggers themselves, as their ads nents analyze triggering-level interests, while the last two compo- targets and display ads based on bloggers'interest and intentions nents enhance the target-ad matching procedure. The connections as described below between modules are depicted in Fig. 2. Note that we place a priority on ad assignment based on differ 3. BCCA framework ent levels of interests. That is, triggering-level interests (i.e, short term interests) have higher priority than profile-level interests Traditional contextual advertising processes a given page a user (long-term interests). Thus, the sentiment detection module is in- visits to find related topics for matching ads, while blogger-Centric roked only when no positive intention is detected. If neither mod- ule detects intention or positive sentences, the system should then accordance with the blogger 's interests. Before demonstrating the use targets from the blogger's profile roposed framework, we will explain how bloggers'personal inter Next, the system proceeds to term expansion and the target-ad ests could be obtained. Generally speaking, an individual blog often matching agency. Due to the short form of ads and targets, we de- contains profile information, tags and posts, which we could clas- signed a term expansion component to enhance the likelihood of ify as indicating different levels of interest. intersection between targets and available ads. Finally, a retrieval function based on a query likelihood language model is deployed 3. 1. Profile-level for the target-ad matching strategy to rank the ads. the pseudo code for our ad assignment strategies is shown in Fig 3. Blog service providers (e.g. BlogSpot. com and Technorati. com) usually ask bloggers'to enter interests to build their profiles 3.3. Intention recognition (e.g, music, movies, reading, and other leisure pursuits) when they register as a member of the service. In addition to generic interests Given a triggering page, our aim in this section is to explain the collected at registration time, the tags on posts or the archive of process of recognizing whether there exist any intention-bearing past posts can be used to construct bloggers'profiles, showing their sentences. By modeling this problem one of classification, our job specific interests. Since these kinds of interest continue for a period here is the preparation of training sentences which must be labeled of time, we view them as long-term interests. In this paper, we as- as intentional or non-intentional. umed that each blog-site has an interest profile containing either generic or specific long-term interests. 3.3.1. Collecting data for classifier training Labeling each sentence as intentional or non-intentional is a 3. 2. Triggering-level time-consuming and costly task. In this study we propose a novel og posts are media in which bloggers express their opinions and interests, as well as their intensions. For example, the sen- tence, The Nokia N95 is a good cell phone for several reasons, "ex a blog post s the sentiments of the author toward the object in question, a Nokia N95. The target is not necessarily a named entity(e. g, the name of a person, location, or organization) but it can also be a Does this ncept (such as a type of technology ) a product name, or an ent. Finding such targets is one of the key components for tra- ditional contextual advertising. For Blogger-Centric Contextual Advertising, we argue that recognizing the intentions of authors ould be even more effective. For instance. consider the sentence Detection We re going to the doctor right now. "Fig. 1 indicates that the author has an immediate intention to see a doctor. As another example the sentence, " I am looking for a new laptop, " implies that the through Another consideration for blogger-centric advertising is whether the sentence presents negative sentiments. For example, the phrase, " canceling trip to Europe, " in Fig. 1 shows a negatively connotated target, which has a lower priority. As demonstrated in( Fan Chang, 2009), avoiding negative targets provides better contextual ads. Thus, we could say that their work is actually a Page-Ad pecial case of Blogger-Centric Contextual Advertising that aims Matching Collection at providing ads to the bloggers. our Blogger-Centric Contextual Advertising framework (BCCA), the advertising system analyzes the content of the page 3http:/trec.nist.gov. Fig. 2. The BCCA framework
interests. From this survey, we could say that 80% of bloggers tend to click the ads on their blogs and about 80% of ads’ click rates are related to personal interests and requirements. Thus, in this paper, we propose the idea of user-centric contextual advertising and use the blogosphere as an example for realizing this idea. Compared with the traditional ad agency, which targets general visitors, blogger-centric advertising considers bloggers themselves, as their ads targets and display ads based on bloggers’ interest and intentions as described below. 3. BCCA framework Traditional contextual advertising processes a given page a user visits to find related topics for matching ads, while Blogger-Centric Contextual Advertising would assign ads to a given blog page in accordance with the blogger’s interests. Before demonstrating the proposed framework, we will explain how bloggers’ personal interests could be obtained. Generally speaking, an individual blog often contains profile information, tags and posts, which we could classify as indicating different levels of interest. 3.1. Profile-level Blog service providers (e.g., BlogSpot.com and Technorati.com) usually ask bloggers’ to enter interests to build their profiles (e.g., music, movies, reading, and other leisure pursuits) when they register as a member of the service. In addition to generic interests collected at registration time, the tags on posts or the archive of past posts can be used to construct bloggers’ profiles, showing their specific interests. Since these kinds of interest continue for a period of time, we view them as long-term interests. In this paper, we assumed that each blog-site has an interest profile containing either generic or specific long-term interests. 3.2. Triggering-level Blog posts are media in which bloggers express their opinions and interests, as well as their intensions. For example, the sentence, ‘‘The Nokia N95 is a good cell phone for several reasons,” express the sentiments of the author toward the object in question, a Nokia N95. The target is not necessarily a named entity (e.g., the name of a person, location, or organization) but it can also be a concept (such as a type of technology), a product name, or an event.3 Finding such targets is one of the key components for traditional contextual advertising. For Blogger-Centric Contextual Advertising, we argue that recognizing the intentions of authors could be even more effective. For instance, consider the sentence, ‘‘We’re going to the doctor right now.” Fig. 1 indicates that the author has an immediate intention to see a doctor. As another example, the sentence, ‘‘I am looking for a new laptop,” implies that the author probably will purchase a laptop. As such, targets are immediate interests; ads centered around them might increase clickthrough. Another consideration for blogger-centric advertising is whether the sentence presents negative sentiments. For example, the phrase, ‘‘canceling trip to Europe,” in Fig. 1 shows a negativelyconnotated target, which has a lower priority. As demonstrated in (Fan & Chang, 2009), avoiding negative targets provides better contextual ads. Thus, we could say that their work is actually a special case of Blogger-Centric Contextual Advertising that aims at providing ads to the bloggers. In our Blogger-Centric Contextual Advertising framework (BCCA), the advertising system analyzes the content of the page to recognize intention and detect sentiment for triggering-level interests. If no such targets are found, the system uses targets from the blogger’s profile and searches the ad database to find the best matching ads. These four modules: intention recognition, sentiment detection, term expansion and target-ad matching are the main components in our BCCA framework. The first two components analyze triggering-level interests, while the last two components enhance the target-ad matching procedure. The connections between modules are depicted in Fig. 2. Note that we place a priority on ad assignment based on different levels of interests. That is, triggering-level interests (i.e., shortterm interests) have higher priority than profile-level interests (long-term interests). Thus, the sentiment detection module is invoked only when no positive intention is detected. If neither module detects intention or positive sentences, the system should then use targets from the blogger’s profile. Next, the system proceeds to term expansion and the target-ad matching agency. Due to the short form of ads and targets, we designed a term expansion component to enhance the likelihood of intersection between targets and available ads. Finally, a retrieval function based on a query likelihood language model is deployed for the target-ad matching strategy to rank the ads. The pseudo code for our ad assignment strategies is shown in Fig. 3. 3.3. Intention recognition Given a triggering page, our aim in this section is to explain the process of recognizing whether there exist any intention-bearing sentences. By modeling this problem one of classification, our job here is the preparation of training sentences which must be labeled as intentional or non-intentional. 3.3.1. Collecting data for classifier training Labeling each sentence as intentional or non-intentional is a time-consuming and costly task. In this study, we propose a novel A blog post Sentiment Detection Intention Recognition Page-Ad Matching A list of personal ads Term Expansion Does this post contain any sentences of positive intention Does this post contain any sentences of positive sentiment yes yes no no Profile Ad Collection Fig. 2. The BCCA framework. 3 http://trec.nist.gov. T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788 1779
T.-K Fan, C-H Chang/ Expert Systems with Applications 38(2011 )1777-1 Algorithm: Ad assignment Strategies Table 1 Contingency table for chi-square Input: a blog post P, a profile set an ads collection Output: blogger-centric ads BCa that are related to on-intentior nal Noi Noo personal interests. olumn NI1+ No N1o+ Noo N= Nu+ No +N1o+ Noo 1. Positive Intention Pl Recognition O 2. Target(s)T=extract target(s)from PI //wher target(s) are depicted as noun(s) ntentional training data (i.e. negative instances), we manually 3. if T=(p construct some queries (e.g, product names, people names, and 4. then Positive Sentiment PS Detection O proper nouns)that will collect entry pages from Wikipedia. We 5. T-extract target(s) from PS hose Wikipedia as our non-intentional training data source be- fT=中 cause it usually describes facts about a specific object and avoids individual subjective intentions and 7. T= target(s)from I 8. end if 9. Tis expanded by Term Expansion O 3.3. 2. Feature selection and feature value 10. Assigning ads which are related to T For feature selection, we considered a subset of word unigrams chosen via the Pearson chi-square test( Chernoff 1954). Yang and Pedersen(1997) suggest that the chi-sqi 四c Fig 3. Pseudo code of ads assignment strategies. is an effective approach to feature selection. To find dependent a feature f is with respect to the intention set or the non-intention set, we set up a null hypothesis that f is independent Home > Buy Want It Now Mobile s Home Phones Mobile Phone of the two categories with respect to its occurrences in the two G600 or any 5 megapixel camera phone sets. A Pearson chi-square test compares observed frequencies of Description f to its expected frequencies to test this hypothesis according to a contingency table, as shown in Table 1. The Ni in Table 1 is counted as the number of sentences containing (or not containing) am looking for a Coach style faceplate to fit a fin the intentional (or non-intentional)dataset. The independence Motorola Razrv3xx phone of f is tested by calculating its chi-square value it's the one with the" c on it Dam wanting the brown color if possible or the black. x2()= S(Ni-Eij) where Ey is the expected frequency of case i calculated by Fig 4. Example of a post with buyer's requirements E1=m则xOmn,j∈0.1 technique to semi-automatically label training data for this task A high chi-square value indicates that the hypothesis of inde Sincemanyauctionwebsites(e.gebay.comandyahoo.com)pro-pendencewhichimpliesthatexpectedandobservedcountsare vide special forums for buyers to post their needs, it naturally be- Similar, is incorrect. In other words, the larger the chi-square value comes a source of intention- filled sentences. Fig. 4 shows an the more class-dependent f is with respect to the intentional set or example of such a post. In this study we collected a large set of posts containing bu juirementsfromebay.com.foreachthenon-intentionalsetInthisstudyweselectedthetop-kfwitha uyers post BP, we extracted the content for the Description field high chi-square value as input feature by a simple program mainly coded with regular expressions. How- used the standard bag-of-features framework. Let ever,since many buyers describe their requirements(e.g- product predefined set of m features that can appear in a document. Let name,productt chat do not contain intentions by the following wi(d) be the weight of f as it occurs in document d Then each pe) in a concise sentence, the labeling system filters document d is criteria ented by the document vector d=(w(d) W2(d)., Wm(d)). As for the weighting value, it can be assigned as Part-of-Speech(POS) tag: Since an intentional sentence usually either boolean value or as a t-idf (term requency inverse doc contains a verb that is not a form of"to be", we keep candidate ument frequency)value. Here we used tf-idf, which is a statistical sentences that contain terms tagged as verbs(e.g, VB and vBG) measure that evaluates how important a word is to a document. nd remove sentences that contain only noun phrases(nn he tf-idf function assumes that the more frequently a certain term NNS). For example, the second and fourth sentences shown in t occurs in documents d the more important it is for d, and fur thermore the more documents d that term ti occurs in, the smaller Fig 4 are two useful sentences for training data, while the sen- its contribution is in characterizing the semantics of a document in tence. It's the one with the"C" on it, would be discarded. The length of the sentence: Short polite words are usually used in which it occurs. In addition, weights computed by tf-idf techniques the forums. Hence, we simply disregard sentences whose re often normalized so as to counter the tendency of tf-idf to emphasize long documents. The type of tf-idf that we used to gel lengths are less than three words. For example, the first and erate normalized weights for data representations in this study is the last sentences presented in Fig 4 would be neglected. All the candidate sentences that conform to the above rules are tf-idf =tf( i, d;).log- DI regarded as intentional data (i.e, positive instances ). As for non- http://pages.ebay.co.uk/wantitnow/
technique to semi-automatically label training data for this task. Since many auction web sites (e.g., ebay.com and yahoo.com) provide special forums for buyers to post their needs, it naturally becomes a source of intention-filled sentences. Fig. 4 shows an example of such a post. In this study, we collected a large set of posts containing buyers’ requirements from ebay.com. 4 For each buyer’s post BP, we extracted the content for the Description field by a simple program mainly coded with regular expressions. However, since many buyers describe their requirements (e.g., product name, product type) in a concise sentence, the labeling system filters simple sentences that do not contain intentions by the following criteria. Part-of-Speech (POS) tag: Since an intentional sentence usually contains a verb that is not a form of ‘‘to be”, we keep candidate sentences that contain terms tagged as verbs (e.g., VB and VBG), and remove sentences that contain only noun phrases (NN & NNS). For example, the second and fourth sentences shown in Fig. 4 are two useful sentences for training data, while the sentence, ‘‘It’s the one with the ‘‘C” on it,” would be discarded. The length of the sentence: Short polite words are usually used in the forums. Hence, we simply disregard sentences whose lengths are less than three words. For example, the first and the last sentences presented in Fig. 4 would be neglected. All the candidate sentences that conform to the above rules are regarded as intentional data (i.e., positive instances). As for nonintentional training data (i.e., negative instances), we manually construct some queries (e.g., product names, people names, and proper nouns) that will collect entry pages from Wikipedia.5 We chose Wikipedia as our non-intentional training data source because it usually describes facts about a specific object and avoids individual subjective intentions and opinions (Zhang, Yu, & Meng, 2007). 3.3.2. Feature selection and feature value For feature selection, we considered a subset of word unigrams chosen via the Pearson chi-square test (Chernoff & Lehmann, 1954). Yang and Pedersen (1997) suggest that the chi-square test is an effective approach to feature selection. To find out how dependent a feature f is with respect to the intention set or the non-intention set, we set up a null hypothesis that f is independent of the two categories with respect to its occurrences in the two sets. A Pearson chi-square test compares observed frequencies of f to its expected frequencies to test this hypothesis according to a contingency table, as shown in Table 1. The Nij in Table 1 is counted as the number of sentences containing (or not containing) f in the intentional (or non-intentional) dataset. The independence of f is tested by calculating its chi-square value x2ðfÞ ¼ X i2f0;1g X j2f0;1g ðNij EijÞ 2 Eij where Eij is the expected frequency of case ij calculated by Eij ¼ rowi columnj N ; i; j 2 f0; 1g A high chi-square value indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. In other words, the larger the chi-square value, the more class-dependent f is with respect to the intentional set or the non-intentional set. In this study, we selected the top-K f with a high chi-square value as input features. To apply these machine learning algorithms on our dataset, we used the standard bag-of-features framework. Let {f1,...,fm} be a predefined set of m features that can appear in a document. Let wi(d) be the weight of fi as it occurs in document d. Then each document d is represented by the document vector ~d ¼ ðw1ðdÞ; w2(d),...,wm(d)). As for the weighting value, it can be assigned as either a boolean value or as a tf–idf (term frequency – inverse document frequency) value. Here we used tf–idf, which is a statistical measure that evaluates how important a word is to a document. The tf–idf function assumes that the more frequently a certain term ti occurs in documents dj, the more important it is for dj, and furthermore, the more documents dj that term ti occurs in, the smaller its contribution is in characterizing the semantics of a document in which it occurs. In addition, weights computed by tf–idf techniques are often normalized so as to counter the tendency of tf–idf to emphasize long documents. The type of tf–idf that we used to generate normalized weights for data representations in this study is tf — idf ¼ tfðti; djÞ log jDj #DðtiÞ Table 1 Contingency table for chi-square. f :f Row Intentional set N11 N10 N11 + N10 Non-intentional set N01 N00 N01 + N00 Column N11 + N01 N10 + N00 N = N11 + N01 + N10 + N00 Fig. 3. Pseudo code of ads assignment strategies. Hi, I am looking for a Coach style faceplate to fit a Motorola RazrV3xx phone. It's the one with the "C" on it. I am wanting the brown color if possible or the black. thanks! Hi, I am looking for a Coach style faceplate to fit a Motorola RazrV3xx phone. It's the one with the "C" on it. I am wanting the brown color if possible or the black. thanks! Fig. 4. Example of a post with buyer’s requirements. 4 http://pages.ebay.co.uk/wantitnow/. 5 http://en.wikipedia.org. 1780 T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788
T.-K. Fan, C.-H. Chang/Expert Syst Applications 38(2011)1777-1788 1 81 where the factor tft, d) is called the term frequency, the factor log t e and negative sentences are similarly represented as tis called the inverse document frequency, while #D () denotes feature-presence vectors to train the classifiers the number of documents in the document collection D in which term ti occurs at least once and 3.4.2.Term expansion (-)>0 In general, a blog page can be about any theme, while the adver- tisements are concise in nature. Hence, the intersections of terms between ads and pages are very low. If we only consider the exist- where #(t. d denotes the frequency of in dj. Weights obtained ing terms included in a trigg agency may not accu- rately retrieve from the tf-idf function are then normalized by means of cosine this paper, w d to a page. In normalization, finally yielding oosed in (Fan& Chang in a blog page According ch, Josifovski Wij= tfidf(ti,) and Riedei nd Moura tfidf(,)? (2005)cor to perforn 3.4. Sentiment detection Fan Our aim in this section is to apply a contextual sentiment and Chang nique for recognizing the sentiment of a blog page. Generally. NNS) as candic set of seed terms accordin lized determines Hovy. 2006: Ku, Ho, Che word. art of the nchor text of a positive steps, each e considered the s as a subset the sente tion step that - 3.4.1. Collecti s that utilize tively. The third nions.co es related to a trig- ing the specific terms etails, readers are referred to (Fan tising issue that is, given a trieval) system returns rele ntent. Hence, we intu- as a user's nis pa- s our ad re- lodels the if the Ponte P(alq), where od that it is by Bayes rules ment labels )is the same weight for all ads and the prior probabil- statistic metho usually treated as uniform across all a, both of identification tion step, all th sented as feat http: /en.wikipedia.org
where the factor tf(ti,dj) is called the term frequency, the factor log jDj #DðtiÞ is called the inverse document frequency, while #D (ti) denotes the number of documents in the document collection D in which term ti occurs at least once and tfðti; djÞ ¼ 1 þ log #ðti; djÞ; if #ðti; djÞ > 0 0 otherwise where #(ti, dj) denotes the frequency of ti in dj. Weights obtained from the tf–idf function are then normalized by means of cosine normalization, finally yielding wi;j ¼ tfidfðti; djÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PjTj k¼1tfidfðts; djÞ 2 q 8 >< >: 3.4. Sentiment detection Our aim in this section is to apply a contextual sentiment technique for recognizing the sentiment of a blog page. Generally, researchers study opinions at three different levels: the word level, sentence level, and document level (Esuli & Sebastiani, 2006; Kim & Hovy, 2006; Ku, Ho, & Chen, 2009; Yu & Hatzivassiloglou, 2003). For this study, we need to identify whether a sentence is neutral, positive or negative. To build an efficient learning model, we divided the task of detecting the sentiment of a sentence into two steps, each of which belongs to a binary classification problem. The first is an identification step that aims to identify whether the sentence is subjective or objective. The second is a classification step that classifies the subjective sentences as positive or negative. 3.4.1. Collecting data for classifier training For sentiment classification, a good Web resource is epinions.com, where pros and cons of products or other topics are discussed by reviewers. Such information is thus used by Kim and Hovy (2006) to prepare their training data. The pro and con fields in epinions.com contain comma-delimited phrases which describes the features of the products. Thus, Kim and Hovy use these two sets of pro and con phrases to label the orientation of sentences in the review document. A sentence is annotated as positive if it contains pro phrases. A sentence is annotated as negative if it contains con phrases. Otherwise, a sentence is labeled as neutral. They use these data and a learning algorithm with different feature categories (such as the unigram and the opinion-bearing word) to train a pro and con sentence recognition system. Although this method is fully automatic, such an idea based on pro and con phrases does not perform well (F-measure: 0.65). As indicated in Esuli and Sebastiani (2006), opinionated content is most often carried by parts of speech used as modifiers (i.e., adverbs and adjectives) rather than parts of speech used as heads (i.e., verbs, nouns), as exemplified by expressions such as a funny movie or a fabulous game. Thus, we combined the concept proposed in (Esuli & Sebastiani, 2006) and modified Kim and Hovy’s (2006) method as follows. For each term tagged as adjective or adverb (e.g., JJ and RB) in pro and con sets, our labeling system checks each sentence to find sentences that contain those adjectives or adverbs. Then the system annotates each sentence with appropriate sentiment labels. As for feature selection, we similarly adopted the chi-square statistic method to select the top-K features for both the sentiment identification and sentiment classification steps. For the identification step, all the subjective and objective sentences are represented as feature-presence vectors, where the presence or absence of each feature is recorded. For the classification step, all the positive and negative sentences are similarly represented as feature-presence vectors to train the classifiers. 3.4.2. Term expansion In general, a blog page can be about any theme, while the advertisements are concise in nature. Hence, the intersections of terms between ads and pages are very low. If we only consider the existing terms included in a triggering page, an ad agency may not accurately retrieve relevant ads, even when an ad is related to a page. In this paper, we followed three term expansion methods proposed in (Fan & Chang, 2009) to expand the specific terms in a blog page. According to Anagnostopoulous, Broder, Gabrilovich, Josifovski, and Riedei (2007) Ribeiro-Neto, Cristo, Golgher, and Moura (2005), considering the ads’ abstracts and titles is not sufficient to perform page-ad matching. Thus, term expansion of the keywords in the triggering page, as well as in the ads, is conducted to increase overlap. For a triggering page, because not all the words included in a page are useful for carrying out term expansion, Fan and Chang (2009) simply take the terms tagged as nouns (NN & NNS) as candidate terms from which they generated a set of seed terms according to the following rules: TCapitalization: whether a candidate term is capitalized determines whether it is a proper noun, or an important word. Thypertext: whether a candidate term is part of the anchor text of a hypertext link. Ttitle: whether a candidate term is part of the post’s title. Tfrequency: consistently with term frequency, we considered the three most frequently occurring candidate terms as a subset of the seed terms. Subsequently, the set of seed terms (SeedTerm = TCapitalization [ Thypertext [ Ttitle [ Tfrequency) undergoes three term expansion methods. Two methods are dictionary-based operations that utilize the WordNet6 and Wikipedia7 thesauruses, respectively. The third method is a web-based search that identifies pages related to a triggering page to construct a co-occurrence list using the specific terms on a triggering page. For more details, readers are referred to (Fan & Chang, 2009). 3.4.3. Page-ad matching We can regard the Blogger-Centric Contextual Advertising issue as a traditional information retrieval problem: that is, given a user’s query q, the IR (Information Retrieval) system returns relevant documents d according to the query content. Hence, we intuitively model a triggering page p and relevant ads a as a user’s query q and corresponding documents d, respectively. In this paper, the query likelihood language model is adopted as our ad retrieval model. The language modeling approach to IR models the following idea: A document d is a good match to a query q if the document model is likely to generate query q, which will in turn happen if the document contains the query words often (Ponte & Croft, 1998). Hence, we constructed from each ad a in the collection a language model Ma. Our goal is to rank ads by P(ajq), where the probability of an ad a is interpreted as the likelihood that it is relevant to the query q. The ranking function P(ajq) by Bayes rules can be converted to PðajqÞ ¼ PðaÞPðqjaÞ PðqÞ Since P(q) is the same weight for all ads and the prior probability of an ad P(a) is usually treated as uniform across all a, both of them can be ignored. 6 http://wordnet.princeton.edu/. 7 http://en.wikipedia.org. T.-K. Fan, C.-H. Chang / Expert Systems with Applications 38 (2011) 1777–1788 1781