D-R. Liu et aL/ Information Sciences 181(2011)1552-1572 Current time window Preceding time window O clustering Previous period e trend path construction O prediction of the trend of popularity Fig. 2. ularity Since the viewable content on mobile device screens is limited, designing a personalized service for filtering articles is particularly desirable. The m-CCS can monitor the click rates of articles daily and log user viewing records to infer implicit preference of mobile users. without the effort of user rating, the implicit interest of a user regarding an article is inferred by comparing the time spent on reading the article with the average time spent on articles of the same size. The browsing re- cords of users are analyzed to discover their behavior patterns and then their personal preferences are deduced through a personal favorite analysis(PFA)module. Moreover, the m-CCS predicts a users preferred topics by deriving his/her custom- ized popularity degree of topic clusters according to the predicted popularity of topic clusters and his/her preferences. Se tion 5 presents the details of the PFA module Finally, the system recommends blog articles based on the customized popularity degree of topic clusters and the pr erence of mobile users the recommended articles are then sent to the users mobile device via a wap push service. this allows users to instantly receive personalized and relevant blog articles. The proposed recommendation process of the m- CCS mainly integrates content analysis and collaborative filtering to improve the shortcomings of pure collaborative filtering (CF), including sparsity and cold start issues, as well as aspects such as: (1)the prediction of popular topic cluster of concern to bloggers and readers on the Internet, (2)the prediction of users'preference score by item-based collaborative filtering, and (3)attention degree(click times) of blog articles obtained from Internet users. The detailed descriptions of the recommen- dation process are presented in Section 6. In general, the effectiveness of the CF recommendation approach mostly depends on the set of historical data. There are till potential limitations, such as sparsity and cold start issues [ 2, 39 Low-quality recommendation results may be delivered due to the sparsity issue, namely when the system only has very few rating records of users to measure the similarity be- tween users or items. For the cold start issue of new items or new users, the system will present weak performance in rec- ommendation because of the lack of active records viewed by users. In our research, we focus on mobile users and blog articles. We apply clustering techniques to first group the articles into topic clusters and then form neighborhoods of items from the topic clusters, which can reduce the sparsity problem and im- prove the scalability of recommender systems. Additionally, many blog articles have not been viewed by any mobile user in our system due to the limitations of mobile devices. It means that most articles, which are popular on the Internet and are attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommen- dation approach not only considers mobile users' preferences concerning the articles which have been pushed to them on the mobile devices, but also considers the perspectives of Internet readers to identify the popularity of articles, in order to im- 4. Time-sensitive In this section, we present a novel approach to predict the trend of time-sensitive popularity of blog topics We identify blog topic clusters and their popularity according to the perspectives of writers and readers on the Internet, and then e the trend of popularity temporally In the following subsections, we illustrate the details of the tracking process shown 4.1. Forming topic clusters of blog articles Articles in blogs are free and usually contain different opinions so that it is difficult to categorize articles into their appro- priate categories as defined by bloggers. That is to say, the existing category in a blog website is insufficient to fully represent
Since the viewable content on mobile device screens is limited, designing a personalized service for filtering articles is particularly desirable. The m-CCS can monitor the click rates of articles daily and log user viewing records to infer implicit preference of mobile users. Without the effort of user rating, the implicit interest of a user regarding an article is inferred by comparing the time spent on reading the article with the average time spent on articles of the same size. The browsing records of users are analyzed to discover their behavior patterns and then their personal preferences are deduced through a personal favorite analysis (PFA) module. Moreover, the m-CCS predicts a user’s preferred topics by deriving his/her customized popularity degree of topic clusters according to the predicted popularity of topic clusters and his/her preferences. Section 5 presents the details of the PFA module. Finally, the system recommends blog articles based on the customized popularity degree of topic clusters and the preference of mobile users. The recommended articles are then sent to the user’s mobile device via a WAP Push service. This allows users to instantly receive personalized and relevant blog articles. The proposed recommendation process of the mCCS mainly integrates content analysis and collaborative filtering to improve the shortcomings of pure collaborative filtering (CF), including sparsity and cold start issues, as well as aspects such as: (1) the prediction of popular topic cluster of concern to bloggers and readers on the Internet, (2) the prediction of users’ preference score by item-based collaborative filtering, and (3) attention degree (click times) of blog articles obtained from Internet users. The detailed descriptions of the recommendation process are presented in Section 6. In general, the effectiveness of the CF recommendation approach mostly depends on the set of historical data. There are still potential limitations, such as sparsity and cold start issues [2,39]. Low-quality recommendation results may be delivered due to the sparsity issue, namely when the system only has very few rating records of users to measure the similarity between users or items. For the cold start issue of new items or new users, the system will present weak performance in recommendation because of the lack of active records viewed by users. In our research, we focus on mobile users and blog articles. We apply clustering techniques to first group the articles into topic clusters and then form neighborhoods of items from the topic clusters, which can reduce the sparsity problem and improve the scalability of recommender systems. Additionally, many blog articles have not been viewed by any mobile user in our system due to the limitations of mobile devices. It means that most articles, which are popular on the Internet and are attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommendation approach not only considers mobile users’ preferences concerning the articles which have been pushed to them on the mobile devices, but also considers the perspectives of Internet readers to identify the popularity of articles, in order to improve the quality of recommendation. 4. Time-sensitive popularity tracking In this section, we present a novel approach to predict the trend of time-sensitive popularity of blog topics. We identify the blog topic clusters and their popularity according to the perspectives of writers and readers on the Internet, and then trace the trend of popularity temporally. In the following subsections, we illustrate the details of the tracking process shown in Fig. 2. 4.1. Forming topic clusters of blog articles Articles in blogs are free and usually contain different opinions so that it is difficult to categorize articles into their appropriate categories as defined by bloggers. That is to say, the existing category in a blog website is insufficient to fully represent Fig. 2. Time-sensitive popularity tracking process. D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572 1557
1558 D -R Liu et aL/Information Sciences 181(2011)1552-1572 2 day 4 Fig. 3. The trend path of topic clusters. the blog. In our research, we use article features, i.e., term-weight vector, derived from the pre-processing to deal with blog articles which are published within a given time window on the Internet. we collect blog articles from bog websites as the raining corpus to construct the dictionary by applying one of the statistical methods, the log likelihood ratio, to extract meaningful phrases and terms. In addition, blog articles are trawled every day from blog websites according to the cro- wed-RSS feeds Note that the blog training data is periodically updated and trained to update the dictionary. Significant terms/phrases are extracted from the content of an article according to the dictionary derived from the blog training data. In addition, each article is represented as a term vector by using the tf-idf approach [33] to calculate the weight of term i in an article j, as defined in Eq. (5): w=后×lg;f max (requi) where n is the number of articles; n is the nu of articles that contain term i: fiy is the normalized frequency off article; frequ is the frequency of term i in article; and max ui)is the frequency of term I which has the maximum fre in article j r. The size of the time window is set as seven days. That is, all the articles posted in the past seven days will be categorized d recommended to individual users A hierarchical agglomerative algorithm with group-average clustering approach [16 is applied to implement the cluster ing step. It treats each article as a cluster first and then successively merges the pairs of clusters with highest cluster sim- ilarity. The similarities between two articles can be calculated by means of the cosine similarity measure, as shown in Eq (6): sim(di, di)=cos(di, di ld·ldJ‖ The cluster similarity between two clusters is defined as the average pairwise similarities of all pairs of articles from dif- ferent clusters. The cluster similarity between two clusters ri and r is calculated by Eq.(7), where dild is a blog article belonging to the set of blog articles Sri/Sy in Cluster r/r: ISnil/l Srl is the number of blog articles of Sr/ Sry and sim(ds. d)denotes the cosine similarity between the articles d and d sim(di, di) Srills, We stop merging the pairs of clusters when the highest cluster similarity is below a threshold during the merge process. The number of clusters each day is not constant; it depends on the density of the discussed topic. If the density of the topic which people discuss is high, the diversity of the article is low and the numbers of clusters decrease 4.2. Constructing the trend path between clusters belonging to adjacent days To reveal the path of the trend which predicts the popularity degree of current clusters, we measure the cluster similarity etween the target Cluster r and all the Clusters pr belonging to the preceding period, and then select the one with maximum values to construct the link with one of the preceding clusters As blog articles are usually composed of unstructured words, to obtain similarity between two clusters appertaining to two days, we average the value of cosine similarity between articles crossing clusters. The similarity between two clusters (r, pr)in adjacent days is calculated b establishing the linkages, the trend of each current cluster can be derived receding related cluster. As ig. 3, all of the clusters receive a trend path from the preceding cluster. The topic of Cluster1 in day 3 is evolved from in day 2, and so on, and we can use the relationship and similarity between hem to calculate the popularity degi
the blog. In our research, we use article features, i.e., term-weight vector, derived from the pre-processing to deal with blog articles which are published within a given time window on the Internet. We collect blog articles from bog websites as the training corpus to construct the dictionary by applying one of the statistical methods, the log likelihood ratio, to extract meaningful phrases and terms. In addition, blog articles are trawled every day from blog websites according to the crowed-RSS feeds. Note that the blog training data is periodically updated and trained to update the dictionary. Significant terms/phrases are extracted from the content of an article according to the dictionary derived from the blog training data. In addition, each article is represented as a term vector by using the tf-idf approach [33] to calculate the weight of term i in an article j, as defined in Eq. (5): wi;j ¼ fi;j log N ni ; fi;j ¼ freqi;j maxlðfreql;jÞ ; ð5Þ where N is the number of articles; ni is the number of articles that contain term i; fi,j is the normalized frequency of term i in article j; freqi,j is the frequency of term i in article j; and maxl(flj) is the frequency of term l which has the maximum frequency in article j. The size of the time window is set as seven days. That is, all the articles posted in the past seven days will be categorized and recommended to individual users. A hierarchical agglomerative algorithm with group-average clustering approach [16] is applied to implement the clustering step. It treats each article as a cluster first and then successively merges the pairs of clusters with highest cluster similarity. The similarities between two articles can be calculated by means of the cosine similarity measure, as shown in Eq. (6): simðdi; djÞ ¼ cosðd * i; d * jÞ ¼ d * i d * j kd * ikkd * jk : ð6Þ The cluster similarity between two clusters is defined as the average pairwise similarities of all pairs of articles from different clusters. The cluster similarity between two clusters ri and rj is calculated by Eq. (7), where di/dj is a blog article belonging to the set of blog articles Sri/Srj in Cluster ri/rj; jSrij/j Srjj is the number of blog articles of Sri/Srj and sim(di,dj) denotes the cosine similarity between the articles di and dj: similarityðri;rjÞ ¼ P di2Sri P dj2Srj simðdi; djÞ jSrijjSrjj : ð7Þ We stop merging the pairs of clusters when the highest cluster similarity is below a threshold during the merge process. The number of clusters each day is not constant; it depends on the density of the discussed topic. If the density of the topic which people discuss is high, the diversity of the article is low and the numbers of clusters decrease. 4.2. Constructing the trend path between clusters belonging to adjacent days To reveal the path of the trend which predicts the popularity degree of current clusters, we measure the cluster similarity between the target Cluster r and all the Clusters pr belonging to the preceding period, and then select the one with maximum values to construct the link with one of the preceding clusters. As blog articles are usually composed of unstructured words, to obtain similarity between two clusters appertaining to two days, we average the value of cosine similarity between articles crossing clusters. The similarity between two clusters (r,pr) in adjacent days is calculated by Eq. (7). After establishing the linkages, the trend of each current cluster can be derived from the preceding related cluster. As shown in Fig. 3, all of the clusters receive a trend path from the preceding cluster. The topic of Cluster1 in day 3 is evolved from Cluster1 in day 2, and so on, and we can use the relationship and similarity between them to calculate the popularity degree. Cluster1 Cluster2 Cluster3 Cluster1 Cluster3 Cluster2 Cluster4 Cluster1 Cluster3 Cluster2 … … … … day 1 day 2 day 3 day 4 Fig. 3. The trend path of topic clusters. 1558 D.-R. Liu et al. / Information Sciences 181 (2011) 1552–1572