Availableonlineatwww.sciencedirect.com International journalof ScienceDirect Human-Computer Studies ELSEVIER Int J. Human-Computer Studies 68(2010)496-507 www.elsevier.com/locate/ijhcs Personalized blog content recommender system for mobile phone users Po-Huan Chiu, Gloria Yi-Ming Kao,, Chi-Chun Lo Institute of Information Management, National Chiao Tung Unirersity, HsinChu 300, Taiwan, ROC Graduate School of Technological and Vocational Education, National Taiwan Unirersity of Science and Technology, Taipei 106, Taiwan, ROC Received 30 December 2008: received in revised form 28 February 2010: accepted 25 March 2010 Available online 7 April 2010 Abstract Compared to newspaper columnists and broadcast media commentators, bloggers do not have organizations actively promoting their content to users; instead, they rely on word-of-mouth or casual visits by web surfers. We believe the WAP Push service feature of mobile phones can help bridge the gap between internet and mobile services, and expand the number of potential blog readers. Since mobile phone creen size is very limited, content providers must be familiar with individual user preferences in order to recommend content that matches arrowly defined personal interests. To help identify popular blog topics, we have created(a)an information retrieval process that clusters logs into groups based on keyword analyses, and(b)a mobile content recommender system(M-CRS)for calculating user preferences for new blog documents. Here we describe results from a case study involving 20,000 mobile phone users in which we examined the effects of personalized content recommendations. Browsing habits and user histories were recorded and analyzed to determine individual preferences for making content recommendations via the WAP Push feature. The evaluation results of our recommender system indicate significant increases in both blog-related push service click rates and user time spent reading personalized web pages. The process used in this study embedded systems with device limitations, since document subject lines are elaborated and more attractive to intended use o ed to other supports accurate recommendations of personalized mobile content according to user interests. This approach can be applied to other c 2010 Elsevier Ltd. All rights reserved Keywords: Recommender system; Mobile services: Blog: Content push; Recommendation 1. Introduction Filtering out less important or interesting content is clearly a requirement for such a service to succeed. In mid-2008, Chunghwa Telecom (Taiwan's largest Current blog document recommendation mechanisms mobile carrier) started offering a service that allows rely on human input. Employees are hired to choose a customers to use their mobile phones to track blog limited number of documents and to deliver them to all documents. Users can now subscribe to their favorite blogs mobile phone users. In the absence of personalized and have new blog documents delivered to their mobile recommendations, most Chunghwa users stop clicking on phones via the WAP Push message feature(Leu et al., 2006). recommended content. Here we will describe a process for Blog documents are presented in XHTML (to support both (a)classifying blog documents according to multiple image displays)and reformatted to fit small screens themes, (b) analyzing user mobile phone reading behaviors 2006). Three months after the service was intro- to determine personal theme preferences(Eirinaki and duced, click rates went fat Chunghwa identified the reason Vazirgiannis, 2003), and(c) making recommendations for as too many blogs in the system--over 3000 and rising. blog documents suitable for individual users. The overall Large numbers of new documents are generated everyday, goal is to increase click rates for this service and users cannot read them all(Luther et al., 2008 L.I. Comparison of personal computer and mobile phone features associated with blog reading Corresponding author. E-mail addresses: gloriakao@ cis nctu. edu. tw, gloriakao(@ gmail.cor The large majority of web content is aimed at PC surfers (G.Y. Kao) ather than mobile phone users, and most blog document 1-5819/Ssee front matter e 2010 Elsevier Ltd. All rights reserved
Int. J. Human-Computer Studies 68 (2010) 496–507 Personalized blog content recommender system for mobile phone users Po-Huan Chiua , Gloria Yi-Ming Kaob,, Chi-Chun Loa a Institute of Information Management, National Chiao Tung University, HsinChu 300, Taiwan, ROC b Graduate School of Technological and Vocational Education, National Taiwan University of Science and Technology, Taipei 106, Taiwan, ROC Received 30 December 2008; received in revised form 28 February 2010; accepted 25 March 2010 Available online 7 April 2010 Abstract Compared to newspaper columnists and broadcast media commentators, bloggers do not have organizations actively promoting their content to users; instead, they rely on word-of-mouth or casual visits by web surfers. We believe the WAP Push service feature of mobile phones can help bridge the gap between internet and mobile services, and expand the number of potential blog readers. Since mobile phone screen size is very limited, content providers must be familiar with individual user preferences in order to recommend content that matches narrowly defined personal interests. To help identify popular blog topics, we have created (a) an information retrieval process that clusters blogs into groups based on keyword analyses, and (b) a mobile content recommender system (M-CRS) for calculating user preferences for new blog documents. Here we describe results from a case study involving 20,000 mobile phone users in which we examined the effects of personalized content recommendations. Browsing habits and user histories were recorded and analyzed to determine individual preferences for making content recommendations via the WAP Push feature. The evaluation results of our recommender system indicate significant increases in both blog-related push service click rates and user time spent reading personalized web pages. The process used in this study supports accurate recommendations of personalized mobile content according to user interests. This approach can be applied to other embedded systems with device limitations, since document subject lines are elaborated and more attractive to intended users. & 2010 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Mobile services; Blog; Content push; Recommendation 1. Introduction In mid-2008, Chunghwa Telecom (Taiwan’s largest mobile carrier) started offering a service that allows customers to use their mobile phones to track blog documents. Users can now subscribe to their favorite blogs and have new blog documents delivered to their mobile phones via the WAP Push message feature (Leu et al., 2006). Blog documents are presented in XHTML (to support both text and image displays) and reformatted to fit small screens (Baluja, 2006). Three months after the service was introduced, click rates went flat. Chunghwa identified the reason as too many blogs in the system—over 3000 and rising. Large numbers of new documents are generated everyday, and users cannot read them all (Luther et al., 2008). Filtering out less important or interesting content is clearly a requirement for such a service to succeed. Current blog document recommendation mechanisms rely on human input. Employees are hired to choose a limited number of documents and to deliver them to all mobile phone users. In the absence of personalized recommendations, most Chunghwa users stop clicking on recommended content. Here we will describe a process for (a) classifying blog documents according to multiple themes, (b) analyzing user mobile phone reading behaviors to determine personal theme preferences (Eirinaki and Vazirgiannis, 2003), and (c) making recommendations for blog documents suitable for individual users. The overall goal is to increase click rates for this service. 1.1. Comparison of personal computer and mobile phone features associated with blog reading The large majority of web content is aimed at PC surfers rather than mobile phone users, and most blog document ARTICLE IN PRESS www.elsevier.com/locate/ijhcs 1071-5819/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.ijhcs.2010.03.005 Corresponding author. E-mail addresses: gloriakao@cis.nctu.edu.tw, gloriakao@gmail.com (G.Y. Kao).
P-H. Chiu et al./Int J. Human-Computer Studies 68(2010)496-507 Table I A comparison of personal computer and mobile phone features associated with blog- reading behaviors. Comparison items Personal computers lobule phones Push or pull Pull mode. user-initiated online browsing Push mode. users receive WAP Push as content is updated Document subject descriptions Allows for greater ambiguity lust be short and attractive Ranking documents after Easy ranking using star or point system Difficult to input, ranking is not easy Screen size 1024 x 768 pixels or more 20×240 pixels or less Browser compatibilities nternet explorer and Firefox have over 90% market Multiple vendors and phone models have different browser Cost per packet Phone numbers can be read Suitable content Ajax, Flash, video, rich content imple text or small images Mostly exceeding 10 min recommendation systems are customized for PC access Table 2? (Hayes et al., 2006). A comparison of accessing methods is Single day click counts for different device vendors. presented in Table 1. In this section we will review their espective characteristics. Device vendor Click times (users could click multiple times) Sony Ericsson L Blo Nokia Push or pul: PC browsing behavior is almost completely mobley users need to make a focused effort to go online and visit the motorola blogger's web page. Mobile phones differ in that content providers can actively push updated documents to user devices In Taiwan, mobile phone WAP Push technology can deliver approximately 100 Chinese characters of summary, plus a 1. 2. Device limitations URL link. If users want to read the content, all they need to do Screen size: Unlike PC screens, mobile phones have very is click on the link, thus saving them the effort of visiting the small browsing areas. When blog document subjects are website on a regular basis shown in list form on a mobile phone screen, items near the Document subject descriptions: PC users are more likely top get the best click rates-the more a user has to scroll read blog posts, even on topics that they have little down, the smaller the chances of an item being selected. PC interest in. Web page designers usually give blog subject users are more likely to browse lists and to skip over titles and document summaries on one side of the screen individual items to locate content they are most interested in area, making it easy for users to decide whether or not they Browser compatibilities: Internet explorer and Firefox want to read specific entries On mobile phones in Taiwan, currently dominate the web browser market (over 90% WAP Push messages are generally limited to 100 Chinese combined market share). In contrast, the list of mobile characters; they must be shorter if the carrier wants to phone browsers includes Nokia, Sony Ericsson, Motorola, present a list of multiple blog document subjects. The and Windows Mobile, among others. To match mobile writing and rewriting of summaries requires human effort. phone capabilities, web servers need to identify browser Ranking documents after reading: PC users are accus- type and version in order to provide suitable content(Chen tomed to simple and useful blog document ranking systems and Kotz, 2000). It is possible to read browser type involving stars or other marks Ajax technology takes information from Http headers We calculated click advantage of keyboard and mouse functions to speed up counts for different browser types over one day; our results page refreshing actions for ranking. Doing this on mobile are shown in Table 2 phones GUI design decisions that can produce The users of Sony Ericsson mobile phones, which have more problems than benefits. A possible GUI compromise Net Front browsers developed by the ACCESS Company is to offer two close buttons: close + I liked it"and of Japan, were clearly the most active in this research. In 'close+i did not like it many respects. Net Front browsers are incompatible with Location-based service: Users occasionally want content PC web browsers--for example, they have limited Java related to their immediate locations--for example, reading Script capabilities, HTML support, and content presenta- comments about a feature film when they are approaching a tion styles. We were therefore required to convert various theater. An advanced mobile recommender system can blog content formats into plain text and small images make recommendations based on user location information. Character sets were also challenging: mobile phones can
recommendation systems are customized for PC access (Hayes et al., 2006). A comparison of accessing methods is presented in Table 1. In this section we will review their respective characteristics. 1.1.1. Blog-reading behaviors Push or pull: PC browsing behavior is almost completely user-initiated. Thus, when new blog documents are posted, users need to make a focused effort to go online and visit the blogger’s web page. Mobile phones differ in that content providers can actively push updated documents to user devices. In Taiwan, mobile phone WAP Push technology can deliver approximately 100 Chinese characters of summary, plus a URL link. If users want to read the content, all they need to do is click on the link, thus saving them the effort of visiting the website on a regular basis. Document subject descriptions: PC users are more likely to read blog posts, even on topics that they have little interest in. Web page designers usually give blog subject titles and document summaries on one side of the screen area, making it easy for users to decide whether or not they want to read specific entries. On mobile phones in Taiwan, WAP Push messages are generally limited to 100 Chinese characters; they must be shorter if the carrier wants to present a list of multiple blog document subjects. The writing and rewriting of summaries requires human effort. Ranking documents after reading: PC users are accustomed to simple and useful blog document ranking systems involving stars or other marks. Ajax technology takes advantage of keyboard and mouse functions to speed up page refreshing actions for ranking. Doing this on mobile phones requires GUI design decisions that can produce more problems than benefits. A possible GUI compromise is to offer two close buttons: ‘‘close+ I liked it’’ and ‘‘close+ I did not like it’’. Location-based service: Users occasionally want content related to their immediate locations—for example, reading comments about a feature film when they are approaching a theater. An advanced mobile recommender system can make recommendations based on user location information. 1.1.2. Device limitations Screen size: Unlike PC screens, mobile phones have very small browsing areas. When blog document subjects are shown in list form on a mobile phone screen, items near the top get the best click rates—the more a user has to scroll down, the smaller the chances of an item being selected. PC users are more likely to browse lists and to skip over individual items to locate content they are most interested in. Browser compatibilities: Internet explorer and Firefox currently dominate the web browser market (over 90% combined market share). In contrast, the list of mobile phone browsers includes Nokia, Sony Ericsson, Motorola, and Windows Mobile, among others. To match mobile phone capabilities, web servers need to identify browser type and version in order to provide suitable content (Chen and Kotz, 2000). It is possible to read browser type information from HTTP headers. We calculated click counts for different browser types over one day; our results are shown in Table 2. The users of Sony Ericsson mobile phones, which have NetFront browsers developed by the ACCESS Company of Japan, were clearly the most active in this research. In many respects, NetFront browsers are incompatible with PC web browsers—for example, they have limited JavaScript capabilities, HTML support, and content presentation styles. We were therefore required to convert various blog content formats into plain text and small images. Character sets were also challenging: mobile phones can ARTICLE IN PRESS Table 1 A comparison of personal computer and mobile phone features associated with blog-reading behaviors. Comparison items Personal computers Mobile phones Push or pull Pull mode, user-initiated online browsing Push mode, users receive WAP Push as content is updated Document subject descriptions Allows for greater ambiguity Must be short and attractive Ranking documents after reading Easy ranking using star or point system Difficult to input, ranking is not easy Screen size 1024 768 pixels or more 320 240 pixels or less Browser compatibilities Internet explorer and Firefox have over 90% market share Multiple vendors and phone models have different browser features Cost Free Cost per packet Anonymous browsing Mostly anonymous Phone numbers can be read Suitable content Ajax, Flash, video, rich content Simple text or small images Browsing duration Mostly exceeding 10 min Mostly less than 10 min Table 2 Single day click counts for different device vendors. Device vendor Click times (users could click multiple times) Sony Ericsson 15,449 Samsung 1175 Nokia 1276 PocketPC (Windows Mobile) 547 LG 395 Motorola 392 P.-H. Chiu et al. / Int. J. Human-Computer Studies 68 (2010) 496–507 497
P -H. Chiu et al/ Int J. Human-Computer Studies 68(2010)496-507 only accept UTF-8 encoding, and blog documents may make them shorter and more attractive, and distributing contain mixes of Big5, GB2312, and ISo-2022-JP char- their recommendations without personalization. The sub- acters. Blog documents may contain illegal characters or ject appearance order they chose was important because mixes that are resolvable with an internet explorer only the first five list items could be viewed. Users then rowser; all of these must be deleted before sending decided whether or not they wanted to click a URL link ts to mobile phones containing the WAP Push message, activate their mobile phone browsers and go online, browse the entire list of L13. General considerations twelve blog topics in XHTML format, and get more Cost: In Taiwan. most pc users access the web via information about one or more of the listed documents ADSL, with Internet service providers charging monthly Human experts obviously do not have enough time to make fees according to bandwidth. In contrast, mobile phone personalized recommendations for documents for every mobile companies usually charge by traffic amount, with users user. In some cases, mobile phone companies or independent paying for GPRS/3G wireless bandwidth according to researchers have experimented with systems focused on specific transmission packets. content recommendations-for example, news media headlines Anonymous browsing: Using a PC to visit web pages is (Lee and Park, 2007). We believe a similar system can be used mostly an anonymous activity. Although web servers can to recommend specific blog content record the iP addresses of incoming connections( Srivastava et al, 2000), user identities are protected by firewalls and proxies(Reed et aL., 1998). Since most blogs do not require logins,user preferences are not easy to record. Conversely, a 2.Design mobile phone carrier's backend system can be used to retrieveuserphonenumbersinthehttpheaderwhilethey 2.1. System goals are browsing, making it easy to store personal browsing histories in databases. This provides sufficient information We suggest using a mobile content recommendation user preferences. In a backend system, carriers use the proposed system consists of four elen.. abjectives.Our for a blog content recommender system based on analyses of tem (M-CRs) to achieve our stated dynamic IP addresses of mobile phones to retrieve phone numbers which are inserted into Http headers via Wap 1. Creating groups of users with similar preferences gateways. Service providers can request permission from pushing blog content according to those preferences mobile carriers to read phone numbers from Http headers user interests when users browse web pages via a WAP gateway, they can 2. Adjusting recommendation accuracy according to user extract phone numbers from web servers feedback and browsing histories( Chakrabarti, 2002 Suitable content: Internet bloggers use various content Smyth and Cotter, 2004). The larger the number of users formats, including HTML tags, Ajax, Flash, and video involved in the recommendation process, the easier it The formats of most original blog documents and web page will be to layouts are not suitable for mobile phones, which cannot collaborative filtering to make more accurate recommendations Goldberg et al, 1992 parse complex web pages. Although browsers on advanced Morita and Shinoda, 1994; Resnick et al., 1994) mobile phones are becoming more powerful, extracting text 3. Using the mobile carrier's backend system to determine and images from blogs and reformatting them for small user preferences for blog documents screens is required to achieve maximum compatibility 4. Limiting push messages to those users most likely to Browsing duration: Whereas PC users can surf the web fc respond. This is especially important because the WAP nany hours at a time, mobile phone users are accustomed to Push system is considered a limited carrier resource that is ading very short documents. Reading long documents on used for other purposes. Our M-CRS can be used small mobile phone screens is unusual, since it causes physical identify users with no interest in reading blog documents on eyestrain and conflicts with the typical short-term tasks that their mobile phones; these can be deleted from WAP Push mobile phones are used to perform. message delivery lists 1. 2. Mobile phone blog recommendations by human experts For successful implementation, M-CRS must be capab Prior to implementing a personalized recommender of analyzing approximately 3000 blog sites and calculating system for mobile phones, blog recommendations for the preferences of 20,000 users within a 2 h limit Chunghwa Telecom customers were made entirely by Requirements for this task are a solid load balance human experts, who chose blog documents in blocks of architecture, fast algorithms, and appropriate cache twelve and used the WaP Push function to deliver technologies to match mobile carrier environment needs messages to users. The experts were responsible for finding Finally, our proposed system requires periodic click rate the latest blog documents of high interest and suitability comparisons with human expert recommendations. A high for mobile phone users, rewriting their subject lines to level M-CRS workflow diagram is shown in Fig. 1
only accept UTF-8 encoding, and blog documents may contain mixes of Big5, GB2312, and ISO-2022-JP characters. Blog documents may contain illegal characters or coding mixes that are resolvable with an internet explorer (IE) browser; all of these must be deleted before sending documents to mobile phones. 1.1.3. General considerations Cost: In Taiwan, most PC users access the web via ADSL, with Internet service providers charging monthly fees according to bandwidth. In contrast, mobile phone companies usually charge by traffic amount, with users paying for GPRS/3G wireless bandwidth according to transmission packets. Anonymous browsing: Using a PC to visit web pages is mostly an anonymous activity. Although web servers can record the IP addresses of incoming connections (Srivastava et al., 2000), user identities are protected by firewalls and proxies (Reed et al., 1998). Since most blogs do not require logins, user preferences are not easy to record. Conversely, a mobile phone carrier’s backend system can be used to retrieve user phone numbers in the HTTP header while they are browsing, making it easy to store personal browsing histories in databases. This provides sufficient information for a blog content recommender system based on analyses of user preferences. In a backend system, carriers use the dynamic IP addresses of mobile phones to retrieve phone numbers, which are inserted into HTTP headers via WAP gateways. Service providers can request permission from mobile carriers to read phone numbers from HTTP headers: when users browse web pages via a WAP gateway, they can extract phone numbers from web servers. Suitable content: Internet bloggers use various content formats, including HTML tags, Ajax, Flash, and video. The formats of most original blog documents and web page layouts are not suitable for mobile phones, which cannot parse complex web pages. Although browsers on advanced mobile phones are becoming more powerful, extracting text and images from blogs and reformatting them for small screens is required to achieve maximum compatibility. Browsing duration: Whereas PC users can surf the web for many hours at a time, mobile phone users are accustomed to reading very short documents. Reading long documents on small mobile phone screens is unusual, since it causes physical eyestrain and conflicts with the typical short-term tasks that mobile phones are used to perform. 1.2. Mobile phone blog recommendations by human experts Prior to implementing a personalized recommender system for mobile phones, blog recommendations for Chunghwa Telecom customers were made entirely by human experts, who chose blog documents in blocks of twelve and used the WAP Push function to deliver messages to users. The experts were responsible for finding the latest blog documents of high interest and suitability for mobile phone users, rewriting their subject lines to make them shorter and more attractive, and distributing their recommendations without personalization. The subject appearance order they chose was important because only the first five list items could be viewed. Users then decided whether or not they wanted to click a URL link containing the WAP Push message, activate their mobile phone browsers and go online, browse the entire list of twelve blog topics in XHTML format, and get more information about one or more of the listed documents. Human experts obviously do not have enough time to make personalized recommendations for documents for every mobile user. In some cases, mobile phone companies or independent researchers have experimented with systems focused on specific content recommendations—for example, news media headlines (Lee and Park, 2007). We believe a similar system can be used to recommend specific blog content. 2. Design 2.1. System goals We suggest using a mobile content recommendation system (M-CRS) to achieve our stated objectives. Our proposed system consists of four elements: 1. Creating groups of users with similar preferences, and pushing blog content according to those preferences and user interests. 2. Adjusting recommendation accuracy according to user feedback and browsing histories (Chakrabarti, 2002; Smyth and Cotter, 2004). The larger the number of users involved in the recommendation process, the easier it will be to use collaborative filtering to make more accurate recommendations (Goldberg et al., 1992; Morita and Shinoda, 1994; Resnick et al., 1994). 3. Using the mobile carrier’s backend system to determine user preferences for blog documents. 4. Limiting push messages to those users most likely to respond. This is especially important because the WAP Push system is considered a limited carrier resource that is used for other purposes. Our M-CRS can be used to identify users with no interest in reading blog documents on their mobile phones; these can be deleted from WAP Push message delivery lists. For successful implementation, M-CRS must be capable of analyzing approximately 3000 blog sites and calculating the preferences of 20,000 users within a 2 h limit. Requirements for this task are a solid load balance architecture, fast algorithms, and appropriate cache technologies to match mobile carrier environment needs. Finally, our proposed system requires periodic click rate comparisons with human expert recommendations. A highlevel M-CRS workflow diagram is shown in Fig. 1. ARTICLE IN PRESS 498 P.-H. Chiu et al. / Int. J. Human-Computer Studies 68 (2010) 496–507
P-H. Chiu et al./Int J. Human-Computer Studies 68(2010)496-507 Blog documents Browser click hIstory Document classification User preference analys 8) documents Touch Message push )〓 Fig. 1. M-CRS high-level workflow. theme Entertainment Detect Save new Database Split words documents documents documents Factorization theme Fig. 2. Document classification steps. 2. 2. Document classification hour, with new posts launching a retrieval process. RsS feeds usually contain partial content or basic summaries Making suitable recommendations for mobile phone M-CRS must fetch complete documents in unstructured users requires categorizing thousands of new documents HTML format and purge irrelevant tags, Java scripts into themes that can be matched with user preferences. network ads, and so on before parsing usable fields such Blog sites generally contain multiple documents on various subject, author, content, and posting date. The parsing task themes, thus making one document the smallest possible is made more difficult by the various HTML layout theme unit. Our proposed system entails three document formats used by bloggers. Parsed and structured informa classification steps: detecting new documents and saving tion is stored in a database for later use. To improve server them in a database, splitting words, and matrix factoriza- efficiency, many blog websites block software robots from tion(Fig. 2) fetching HTML content. For this reason, M-CRS must present itself as a normal Ie browser and simulate all IE 2.2.1. Detecting new documents and saving them in a protocols in order to gain website access database New blog posts can be detected using RSS feeds. Since 2.2.2. Split words they are structured in XML data format, our proposed Since Chinese words in sentences are not separated by M-CRS can easily parse them to identify new items. M- white spaces, they are more difficult to split than English CRS can be set to scan all registered blog URLs once per words (Salton and Buckley, 1988).A Ta
2.2. Document classification Making suitable recommendations for mobile phone users requires categorizing thousands of new documents into themes that can be matched with user preferences. Blog sites generally contain multiple documents on various themes, thus making one document the smallest possible theme unit. Our proposed system entails three document classification steps: detecting new documents and saving them in a database, splitting words, and matrix factorization (Fig. 2). 2.2.1. Detecting new documents and saving them in a database New blog posts can be detected using RSS feeds. Since they are structured in XML data format, our proposed M-CRS can easily parse them to identify new items. MCRS can be set to scan all registered blog URLs once per hour, with new posts launching a retrieval process. RSS feeds usually contain partial content or basic summaries; M-CRS must fetch complete documents in unstructured HTML format and purge irrelevant tags, Java scripts, network ads, and so on before parsing usable fields such as subject, author, content, and posting date. The parsing task is made more difficult by the various HTML layout formats used by bloggers. Parsed and structured information is stored in a database for later use. To improve server efficiency, many blog websites block software robots from fetching HTML content. For this reason, M-CRS must present itself as a normal IE browser and simulate all IE protocols in order to gain website access. 2.2.2. Split words Since Chinese words in sentences are not separated by white spaces, they are more difficult to split than English words (Salton and Buckley, 1988). A Taiwanese wordARTICLE IN PRESS Fig. 1. M-CRS high-level workflow. Fig. 2. Document classification steps. P.-H. Chiu et al. / Int. J. Human-Computer Studies 68 (2010) 496–507 499
P -H. Chiu et al/ Int J. Human-Computer Studies 68(2010)496-507 splitting project called Chinese Knowledge Information Table 3 Processing(ckip)(chien, 1997)provides Http web Features matrix sample service interfaces to developers. After registering on the Keyword 1 Keyword 2 Keyword 3 Keyword 4 systemviahttpprotocolsforthepurposeofsplitting words. We feel that network overhead is a very serious 102 CKIP performance issue: at normal network speeds, the system requires approximately 8s to process a single document. This means that 3000 documents will require more than 6 h for processing-an unrealistic amount of Table 4 time for a mobile carrier environment. We therefore Weights matrix sample. decided to use our own Chinese word-splitting algorithm Feature Feature 2 Fe on a local computer to reduce these processing costs. Chinese words can consist of two or more characters To Article I train computers to find meaningful Chinese words. we Article 2 assume that certain combinations of characters occur more Article 3 equently and have more potential to represent meaning Traditionally, Chinese dictionaries have been used to split phrases into meaningful word combinations. We decided to construct a large article matrix. The features matrix against using this approach because of the large number of shown in Table 3 has one row for each feature and one new words that have yet to appear in Chinese dictionaries. column for each word; values indicate how important each Instead, we propose using a hash table-based algorithm to word is to a feature. Each feature can represent a theme split words on a large scale for multiple blog documents. emerging from a set of articles. The weights matrix maps Our proposed process involves three algorithms: gram features to an articles matrix. Each row is one article and count, log likelihood ratio (LLR), and term frequency each column one feature. Values represent how relevant a inverse document frequency (TF-IDF) Jurafsky and feature is to an article. A features matrix has one column Martin, 2008). Gram count is based on finding character for every word; each row contains a list of word weights ombinations. For example, a Chinese phrase may be split Since each row is a feature consisting of a combination of into I-ideograph, 2-ideograph, etc. combinations. The words, reconstructing an articles matrix is a matter of frequency of each ideograph is calculated in terms of its combining rows in different amounts. The weights matrix number of appearances in a document. Split Chinese example in Table 4 has one column for every feature and characters are inserted into hash tables as the keys while one row for every article frequencies as the values. Next, the LLR algorithm is used If the number of features equals the number of articles, to determine which 2-ideograph Chinese character combi- the best situation is to have one feature that perfectly nations are meaningful, based on whether its LLR value matches each article. However, the purpose of matrix exceeds a threshold. The TF-IDF is used to evaluate the factorization is to reduce large sets to smaller sets that importance of a word to a document in a corpus ( Salton capture their most common features. Ideally, a smaller set and McGill, 1983; Shardanand and Maes, 1995) can be combined with different weights to reproduce the original dataset. This is very unlikely in practice, therefore 2.3. Non-negative matrix factorization algorithm the algorithm aims at reproducing the original dataset as Lee and Seung's(1999)non-negative matrix factoriza- closely as possible tion (NMF)algorithm has a strong performance reputa tion for problems such as determining facial features from 2.3. User preference analysis photograph collections. An article matrix consists of one row for each article and one column for each keyword. To Our proposed system uses browsing histories as raw data factorize such a matrix, the algorithm finds two smaller to determine user interest scores for individual themes, and matrices that can be multiplied to reconstruct the original. a collaborative filtering algorithm to predict interest scores It attempts to reconstruct the original as accurately as for new themes. Collaborative filtering algorithms examine possible by calculating its features and weights(Shahnaz large groups of individuals, identify sets of people with et al., 2006). Our proposed system uses the nMf algorithm similar tastes, and create ranked lists of suggestions. The to categorize blog documents according to theme and to process consists of the four steps shown in Fig. 3. In the generate theme-based keyword lists. It is capable of first step, when a mobile phone user reads a blog identifying multiple themes across all blog documents. document, M-CRs records the event in a database Higher relation scores are given to documents belonging to consisting of three fields: mobile phone number, browsing the same theme time, and blog document URL. To determine user interest The goal of the NMf algorithm is to find two smaller in a keyword, it is necessary to count the number of times feature and weight matrices that can be multiplied together the word appears in one or more documents
splitting project called Chinese Knowledge Information Processing (CKIP) (Chien, 1997) provides HTTP web service interfaces to developers. After registering on the CKIP website, developers can send documents to the system via HTTP protocols for the purpose of splitting words. We feel that network overhead is a very serious CKIP performance issue: at normal network speeds, the system requires approximately 8 s to process a single document. This means that 3000 documents will require more than 6 h for processing—an unrealistic amount of time for a mobile carrier environment. We therefore decided to use our own Chinese word-splitting algorithm on a local computer to reduce these processing costs. Chinese words can consist of two or more characters. To train computers to find meaningful Chinese words, we assume that certain combinations of characters occur more frequently and have more potential to represent meaning. Traditionally, Chinese dictionaries have been used to split phrases into meaningful word combinations. We decided against using this approach because of the large number of new words that have yet to appear in Chinese dictionaries. Instead, we propose using a hash table-based algorithm to split words on a large scale for multiple blog documents. Our proposed process involves three algorithms: gram count, log likelihood ratio (LLR), and term frequencyinverse document frequency (TF-IDF) (Jurafsky and Martin, 2008). Gram count is based on finding character combinations. For example, a Chinese phrase may be split into 1-ideograph, 2-ideograph, etc. combinations. The frequency of each ideograph is calculated in terms of its number of appearances in a document. Split Chinese characters are inserted into hash tables as the keys while frequencies as the values. Next, the LLR algorithm is used to determine which 2-ideograph Chinese character combinations are meaningful, based on whether its LLR value exceeds a threshold. The TF-IDF is used to evaluate the importance of a word to a document in a corpus. (Salton and McGill, 1983; Shardanand and Maes, 1995). 2.2.3. Non-negative matrix factorization algorithm Lee and Seung’s (1999) non-negative matrix factorization (NMF) algorithm has a strong performance reputation for problems such as determining facial features from photograph collections. An article matrix consists of one row for each article and one column for each keyword. To factorize such a matrix, the algorithm finds two smaller matrices that can be multiplied to reconstruct the original. It attempts to reconstruct the original as accurately as possible by calculating its features and weights (Shahnaz et al., 2006). Our proposed system uses the NMF algorithm to categorize blog documents according to theme and to generate theme-based keyword lists. It is capable of identifying multiple themes across all blog documents. Higher relation scores are given to documents belonging to the same theme. The goal of the NMF algorithm is to find two smaller feature and weight matrices that can be multiplied together to construct a large article matrix. The features matrix shown in Table 3 has one row for each feature and one column for each word; values indicate how important each word is to a feature. Each feature can represent a theme emerging from a set of articles. The weights matrix maps features to an articles matrix. Each row is one article and each column one feature. Values represent how relevant a feature is to an article. A features matrix has one column for every word; each row contains a list of word weights. Since each row is a feature consisting of a combination of words, reconstructing an articles matrix is a matter of combining rows in different amounts. The weights matrix example in Table 4 has one column for every feature and one row for every article. If the number of features equals the number of articles, the best situation is to have one feature that perfectly matches each article. However, the purpose of matrix factorization is to reduce large sets to smaller sets that capture their most common features. Ideally, a smaller set can be combined with different weights to reproduce the original dataset. This is very unlikely in practice, therefore the algorithm aims at reproducing the original dataset as closely as possible. 2.3. User preference analysis Our proposed system uses browsing histories as raw data to determine user interest scores for individual themes, and a collaborative filtering algorithm to predict interest scores for new themes. Collaborative filtering algorithms examine large groups of individuals, identify sets of people with similar tastes, and create ranked lists of suggestions. The process consists of the four steps shown in Fig. 3. In the first step, when a mobile phone user reads a blog document, M-CRS records the event in a database consisting of three fields: mobile phone number, browsing time, and blog document URL. To determine user interest in a keyword, it is necessary to count the number of times the word appears in one or more documents. ARTICLE IN PRESS Table 4 Weights matrix sample. Feature 1 Feature 2 Feature 3 Article 1 12 0 0 Article 2 0 10 3 Article 3 2 5 0 Table 3 Features matrix sample. Keyword 1 Keyword 2 Keyword 3 Keyword 4 Feature 1 5 1 7 0 Feature 2 1 0 0 3 Feature 3 0 2 0 1 500 P.-H. Chiu et al. / Int. J. Human-Computer Studies 68 (2010) 496–507