Expert Systems with Applications 36(2009)10388-10396 Contents lists available at Science Direct Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa a blog article recommendation generating mechanism using an SBACPSo algorithm Tien-Chi Huang Shu-Chen Cheng Yueh-Min Huang Engineering Science, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC Department of Computer Science and Information Engineering, Southern Taiwan University of Technology, No 1, Nantai St, Yung-Kang City, Tainan 710, Taiwan, ROC ARTICLE INFO A BSTRACT Keywords: In recent years blog-assisted learning has been used widely in higher education for improving writing Blog article recommendation and collaboratively sharing work online. However, methods for gathering useful information to be used rmation retrieval auxiliary-learning materials from the multitude of blog articles in the blogosphere has been seldom vestigated. This paper proposes an individualized blog article recommendation mechanism to provide ACPSO Factor ana uality blog articles that accord with users' learning topics. First, an IR-based technique was applied to extract and score index terms. The top three index terms were then entered into Google,'s blog search engine to find the raw recommended blog articles. To avoid the situation where frequent topic-changing leads to a deficiency of article data on a specific learning topic, a forgetting rate was employed to simulate the phenomenon of changing learning topics. Subsequently, an extended Serial Blog Article Composition Particle Swarm Optimization (SBACPSO)algorithm was employed to provide optimal recommended materials to users. We evaluated the system s performance to find the appropriate article population size. Finally, user satisfaction regarding both the system and recommended content gauged to find the systems limitations and possible improvements. This study is of importance in that it provides users with dynamic blog article recommendation, improved online information discovery skills and opportuni ties to socialize with other bloggers. G 2009 Elsevier Ltd. All rights reserved. 1 Introduction 20 hols. Priebe. 2003: Fernheimer Nelson. 2005 Hall Davison 2007: Instone. 2005: Stiler philleo Weblogs, also known as blogs, have been around for 2003 ns& Jacobs, 2004). Previous research has used educa- and the number of bloggers who post blog articles is growing rap- tional blog entries published by students to generate auxiliary idly. The term blog is both a noun and verb and all blogs are materials using a PSo-based algorithm, (SBACPSO)Serial Blog Arti ncompassed in the blogosphere. According to Technorati statis- cle Composition Particle Swarm Optimization. The results showed tics, over 112.8 million blogs have been recorded and over 250 mil- that this approach can produce quality materials efficiently. The lion pieces of tagged social media exist. Over 1.6 million blog materials generated offered high satisfaction, which was evaluated entries are updated every day. Additionally, in July 2004, The Reg- for interaction, assistance, usability, and flexibility aspects( Huang ister, a British technology news and opinion website reported that Huang, Cheng, 2008 ). Many researchers have also studied meth a new blog is created nearly every 5.8 s and more than three blogs ods of extracting useful information from blogs using various ma- are updated per second on average. There has been a gradual in- chine learning techniques. The PlsA-based (probability latent crease in awareness that blogs are a connected community and so- semantic analysis)approach has been used to determine common cial network. In some cases blogs are being used as online journals themes of blogs, and then generate spatiotemporal life cycle pat and as such, are being used more and more frequently as viable terns of blogs via time and location information(Mei, Liu, Su, zhai ducational resources at the post secondary level(Dron, 2003: 2006). Results indicated that this method effectively uncovered pat- Lin, Yueh, Liu, 2006: Smith, 2007). terns in blog themes related to the time and location they were cre- Over the last few years there has been a dramatic increase in the ated. A similar PLSA-based approach has also been adopted to number of publications on educational blog research. Most educa- separate business blogs by topic, which is useful for keywords detec- ional blog studies regard blogs as a reflective learning tool that as- tion( Chen, Tsai, Chan, 2008). These contributions focused on high sists students in developing insight and critical thinking skills quality information extraction from business and corporate blog en- tries and combining blog search engine technology with keyword extraction related to specific topics. Additionally, early research into E-mail addresses: kylineeasylearnorg (T.-C. Huang), kittycemailstutedu tw trend discovery for blogs can be traced back to Glance, Hurst, and Tomokiyo(2004). They used natural language processing(NLP) 0957-4174s- see front matter o 2009 Elsevier Ltd. All rights reserved. do:101016eswa200901.039
A blog article recommendation generating mechanism using an SBACPSO algorithm Tien-Chi Huang a , Shu-Chen Cheng b , Yueh-Min Huang a,* aDepartment of Engineering Science, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC bDepartment of Computer Science and Information Engineering, Southern Taiwan University of Technology, No. 1, Nantai St., Yung-Kang City, Tainan 710, Taiwan , ROC article info Keywords: Blog article recommendation Information retrieval Forgetting rate SBACPSO Factor analysis abstract In recent years blog-assisted learning has been used widely in higher education for improving writing and collaboratively sharing work online. However, methods for gathering useful information to be used as auxiliary-learning materials from the multitude of blog articles in the blogosphere has been seldom investigated. This paper proposes an individualized blog article recommendation mechanism to provide quality blog articles that accord with users’ learning topics. First, an IR-based technique was applied to extract and score index terms. The top three index terms were then entered into Google’s blog search engine to find the raw recommended blog articles. To avoid the situation where frequent topic-changing leads to a deficiency of article data on a specific learning topic, a forgetting rate was employed to simulate the phenomenon of changing learning topics. Subsequently, an extended Serial Blog Article Composition Particle Swarm Optimization (SBACPSO) algorithm was employed to provide optimal recommended materials to users. We evaluated the system’s performance to find the appropriate article population size. Finally, user satisfaction regarding both the system and recommended content were gauged to find the system’s limitations and possible improvements. This study is of importance in that it provides users with dynamic blog article recommendation, improved online information discovery skills and opportunities to socialize with other bloggers. 2009 Elsevier Ltd. All rights reserved. 1. Introduction Weblogs, also known as blogs, have been around for many years and the number of bloggers who post blog articles is growing rapidly. The term ‘blog’ is both a noun and verb and all blogs are encompassed in the ‘blogosphere’. According to Technorati statistics, over 112.8 million blogs have been recorded and over 250 million pieces of tagged social media exist. Over 1.6 million blog entries are updated every day. Additionally, in July 2004, The Register, a British technology news and opinion website, reported that a new blog is created nearly every 5.8 s and more than three blogs are updated per second on average. There has been a gradual increase in awareness that blogs are a connected community and social network. In some cases blogs are being used as online journals and as such, are being used more and more frequently as viable educational resources at the post secondary level (Dron, 2003; Lin, Yueh, & Liu, 2006; Smith, 2007). Over the last few years there has been a dramatic increase in the number of publications on educational blog research. Most educational blog studies regard blogs as a reflective learning tool that assists students in developing insight and critical thinking skills (Brooks, Nichols, & Priebe, 2003; Fernheimer & Nelson, 2005; Ganley, 2004; Hall & Davison, 2007; Instone, 2005; Stiler & Philleo, 2003; Williams & Jacobs, 2004). Previous research has used educational blog entries published by students to generate auxiliary materials using a PSO-based algorithm, (SBACPSO) Serial Blog Article Composition Particle Swarm Optimization. The results showed that this approach can produce quality materials efficiently. The materials generated offered high satisfaction, which was evaluated for interaction, assistance, usability, and flexibility aspects (Huang, Huang, & Cheng, 2008). Many researchers have also studied methods of extracting useful information from blogs using various machine learning techniques. The PLSA-based (probability latent semantic analysis) approach has been used to determine common themes of blogs, and then generate spatiotemporal life cycle patterns of blogs via time and location information (Mei, Liu, Su, & Zhai, 2006). Results indicated that this method effectively uncovered patterns in blog themes related to the time and location they were created. A similar PLSA-based approach has also been adopted to separate business blogs by topic, which is useful for keywords detection (Chen, Tsai, & Chan, 2008). These contributions focused on high quality information extraction from business and corporate blog entries and combining blog search engine technology with keyword extraction related to specific topics. Additionally, early research into trend discovery for blogs can be traced back to Glance, Hurst, and Tomokiyo (2004). They used natural language processing (NLP) 0957-4174/$ - see front matter 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.01.039 * Corresponding author. Tel.: +886 6 2757575x63336; fax: +886 6 2766549. E-mail addresses: kylin@easylearn.org (T.-C. Huang), kittyc@mail.stut.edu.tw (S.-C. Cheng), huang@mail.ncku.edu.tw (Y.-M. Huang). Expert Systems with Applications 36 (2009) 10388–10396 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
T-C Huang et aL/ Expert Systems with Applications 36(2009)10388-10396 10389 Blog applications in American higher education settings University North Dakota State University nployed personal weblogs to investigate the relationship between motivations and remediate writing genres rooks et aL, 2003). They concluded that although blogging is not a replacement for writing instruction, it is The University of virginia A blogging platform was devised that provided teachers with a collaborative social aided to corresponding digital reading archives and could further share their thoughts ning media and racy(Bull, Bull, Kajder, 2003). The students' blogs were used as a response journal Blog posts trigger Middlebury College A weblog was used as a course management tool that delivered all necessary information to students including: ne syllabus, links to supplementary online materials, students discussions, and all student work(Ganley, 2004 East Carolina University /eblogs were incorporated into an educational leadership graduate course in 2003 with the goal of using them as work with respect to professional conference sty presentations and academic al riting(Martindale wiley, 2004) The University of Texas at Austin their classrooms and investigated whether weblogs were able to create an agonistic, deliberative, and collaborative community. The students were encouraged to freely express their opinions and ideas(Fernheimer Nelson, 2005) algorithms to analyze trends across nearly 100,000 blog entries. Since the volume of existing blog posts is so large, and their for- Trend searching has been conducted to draw the normalized trend mat is so diverse, answering these questions are very difficult. line over time and estimate the buzz of word for given time-related Therefore, a mechanism that recommends similar blog articles to topics. Moreover, in order to evaluate methods for ranking term sig- readers who read blogs with specific topics would be timely and nificance in an RSS feed corpus, three statistical feature selection useful. In this paper we present a systematic process for blog arti- ethods were proposed: x, Mutual Information(MI) and Informa- cle recommendation. First, we employ an information retrieval ion Gain()(Prabowo Thelwall, 2006). Although they concluded chnique to extract the relevant terms associated with individu- lat x2 seems to be the best method of the three a full human clas- als. Subsequently the memory strength of each relevant term sification might be required for further evaluation of these methods. measured by the forgetting rate. Then, Google's blog search engine The use of blogs in higher education can be traced back to 1999 is employed to carry out blog information discovery. Finally, the in Australia( Brisbane Graduate School of Business at the Queens- extended sBacpso algorithm is applied to find the best combina- land University of Technology ) Williams and Jacobs(2004)found tion of blog articles for individual recommendations that blogging had the potential to transform teaching and learning. This paper is organized as follows. Section 2 outlines the blog In the United States, many universities have employed blogs as article recommendation system architecture Research methodol- teaching and learning devices as shown in Table 1 gy is presented in Section 3. Section 4 describes the procedures The literature is full of discussions of the uses of blogs in higher used in the experiment, the results, and a discussion of the results. education and most of the research comes from Western countries. Finally, in Section 5, conclusions are drawn and discussed In Asia it is an area that is comparatively under-researched and un der-discussed. Therefore, we will build on previous studies to fur ther investigate blogs in this paper 2. Individual blog article recommendation architecture as mentioned earlier. numerous studies examined the educa- tional and technological aspects of blogs. However, to date, there This section outlines the architecture of a mechanism that has had been little research regarding the most frequently asked ques- ability to intelligently and automatically recommend blog articles tions about blogs: "How can people find my blog articles after I've (also called blog entries) according to users'reading and writing posted them in my blog? "and"How can I bring the public to my blog?" behaviors as depicted in Fig. 1 ogle blo search engine owledge Blog articles Algori Individual blog article recommendation RSS aggregato Fig. 1. Blog article recommendation system architecture
algorithms to analyze trends across nearly 100,000 blog entries. Trend searching has been conducted to draw the normalized trend line over time and estimate the buzz of word for given time-related topics. Moreover, in order to evaluate methods for ranking term significance in an RSS feed corpus, three statistical feature selection methods were proposed: v2 , Mutual Information (MI) and Information Gain (I) (Prabowo & Thelwall, 2006). Although they concluded that v2 seems to be the best method of the three, a full human classification might be required for further evaluation of these methods. The use of blogs in higher education can be traced back to 1999 in Australia (Brisbane Graduate School of Business at the Queensland University of Technology). Williams and Jacobs (2004) found that blogging had the potential to transform teaching and learning. In the United States, many universities have employed blogs as teaching and learning devices as shown in Table 1. The literature is full of discussions of the uses of blogs in higher education and most of the research comes from Western countries. In Asia it is an area that is comparatively under-researched and under-discussed. Therefore, we will build on previous studies to further investigate blogs in this paper. As mentioned earlier, numerous studies examined the educational and technological aspects of blogs. However, to date, there had been little research regarding the most frequently asked questions about blogs: ‘‘How can people find my blog articles after I’ve posted them in my blog?” and ‘‘How can I bring the public to my blog?” Since the volume of existing blog posts is so large, and their format is so diverse, answering these questions are very difficult. Therefore, a mechanism that recommends similar blog articles to readers who read blogs with specific topics would be timely and useful. In this paper we present a systematic process for blog article recommendation. First, we employ an information retrieval technique to extract the relevant terms associated with individuals. Subsequently, the memory strength of each relevant term is measured by the forgetting rate. Then, Google’s blog search engine is employed to carry out blog information discovery. Finally, the extended SBACPSO algorithm is applied to find the best combination of blog articles for individual recommendations. This paper is organized as follows. Section 2 outlines the blog article recommendation system architecture. Research methodology is presented in Section 3. Section 4 describes the procedures used in the experiment, the results, and a discussion of the results. Finally, in Section 5, conclusions are drawn and discussed. 2. Individual blog article recommendation architecture This section outlines the architecture of a mechanism that has ability to intelligently and automatically recommend blog articles (also called blog entries) according to users’ reading and writing behaviors as depicted in Fig. 1. Table 1 Blog applications in American higher education settings. University Description North Dakota State University Researchers examined whether a motivated blogger could become a better writer in other writing genres and employed personal weblogs to investigate the relationship between motivations and remediate writing genres (Brooks et al., 2003). They concluded that although blogging is not a replacement for writing instruction, it is worthy writing activity for college courses The University of Virginia A blogging platform was devised that provided teachers with a collaborative social setting in which they were guided to corresponding digital reading archives and could further share their thoughts concerning media and literacy (Bull, Bull, & Kajder, 2003). The students’ blogs were used as a response journal. Blog posts triggered reflections, while comments and feedbacks were delivered via e-mail Middlebury College A weblog was used as a course management tool that delivered all necessary information to students including: the syllabus, links to supplementary online materials, students discussions, and all student work (Ganley, 2004) East Carolina University Weblogs were incorporated into an educational leadership graduate course in 2003 with the goal of using them as the main tools for improving the quality of student work with respect to professional conference style presentations and academic article writing (Martindale & Wiley, 2004) The University of Texas at Austin Researchers explored the use of weblogs in their classrooms and investigated whether weblogs were able to create an agonistic, deliberative, and collaborative community. The students were encouraged to freely express their opinions and ideas (Fernheimer & Nelson, 2005) Fig. 1. Blog article recommendation system architecture. T.-C. Huang et al. / Expert Systems with Applications 36 (2009) 10388–10396 10389
10390 T.-C Huang et aL/ Expert Systems with Applications 36 (2009)10388-10396 In this system, bloggers are regarded as learners or students Definition 1(Blog entry set ). E=el, ez,... en) is a blog entry set of n who receiving recommendations. In the first step, all blog articles blog entries, where Vei EE, i>0, n>0. are collected in the blog knowledge base, which is a database that ecords the author, time of posting, title, content, and associated Definition 2(Index term set). Let t be the index term. Then a bld category of each blog article. The commentaries made in each blog entry contains n index terms and can be represented as article are stored in the blog knowledge base as well. Second, using e=[ti, tz, . tn), where vt Ee, i>0. Each index term can be information retrieval (IR)techniques, the extraction agent extracts egarded as a specific domain of knowledge in this study individual keywords from blog articles, which consists of the arti- cles read and written by each blogger. Hence these keywords a Definition 3(TF*IDF). In the IR vector model, assuming that there highly associated with the individual. The third ploys Goo- are t index terms represented a blog entry er, and then e can be gle's blog search engine to search for blog articles associated with expressed as a vector e=(wi., w,..., Wt, Jj). where Wy is the ith of the keywords. The articles are considered raw recommendation term weight of e in t dimension space Wy is a significance of ith materials. Fourth, using the SBACPSo algorithm proposed in our index term to the blog entry e and it is usually measured by the revious study (huang et al, 2008), the best combination of ble best known term-weighting schemes as shown in naterials. Finally, individuals receive the recommendation. This Wy=fy x idfi x bi process is asynchronous: when a user logs out of the learning plat form, the process begins in the background. The next time he/she where fu is the normalized frequency of index term ti in blog entry accesses the learning platform the system automatically provides e, and is represented as fy=maxfregu xp x(1 +r). The parameters recommendations according to the individuals blog reading and contained in the normalized frequency are described as follows: ting habits. It is worth mentioning that contrary to our previous Idy, which generated an RSS feed for learner subscription, learn-.frequ. the raw frequency of index term t appearing in blog ers in this system obtain the recommended blog articles without entry ej. subscribing to any feed. The Rss aggregator collects the recom-. max, frequi the highest frequency of index term t in blog entry mended blog articles and then automatically pushes them into the user interface as an entry without user manipulation p, the highest weight of a index terms position in a blog entry. With regards to the execution procedure of the extraction The value is represented as title area(p= 1.5 ), main posting agent, we addressed two factors. The first one was the type of hu- area(p=1.3), and comment area(p= 1.0), man behavior, and other was forgetting rate of specific keywords. r, the reply factor indicates the additional significance of There are three types of human interaction with blog articles: index term in a blog entry reading, posting (writing), and commenting. Although the key words are likely to appear in articles associated with these three Detailed descriptions of the parameters p and r can be found in the behaviors, we argue that the value of a keyword is more significant previous study( Huang et al., 2008). The idf factor means that index to an individual when it appears blog articles written by the indi- terms that appear in many blog entries are not vent one. In addi- ery meaningful for vidual, as opposed to those read by the individual. Hence, the key- distinguishing a relevant entry from a non- releva veight than those in the blog articles read. We also applied the bon to the famous f idf equation, each index term is given a concept of strength of memory versus the decline of memory value of behavior parameter is set as follows: retention to model the rate of forgetting with respect to keyword This concept was proposed by Ebbinghaus(1913) to describe the 1 if ith index term appears in a read posting durability of memory traces in the brain, which have an exponen- b:= 1.25 if ith index term appears in a self-written posting tial nature of forgetting. The blog articles explored by google's search engine needed to 1.25 if ith index term appears in a comment posti be screened for quality, so the SBacPso algorithm was applied a revised to find out the quality articles for recommendation. t Behavior parameter b gauges the significance of each index m according to posting type. In read blog entries learners read and learn the index terms, while in written blog entries and com- 3. Methodology ents they apply specific index terms. So, the behavior parameter This section describes the mathematical model used in our blog than those from read postings. postings or comments is higher recommendation mechanism 3.1.2. Rating extracted keywords by forgetting rate 3. 1. Keyword processing Individual may read blog articles about many different topics This results in frequent topic-changing. The forgetting rate The first step is keyword processing, which includes keyword simulates the phenomenon of topic changing in recommendations extraction and scoring Keyword extraction uses an IR-based tech- According to the formula defined by ebbinghaus, memory reten- nique to extract keywords from individual blog entries in the blog tion is represented as e-f, where S is the relative strength of knowledge base. The forgetting rate of specific keywords was con- memory, and t is time. Let (uk(to) be the initial memory strength idered when modeling the significance of keywords with regard to the kth learner to the ith index term (domain knowledge). Following that pik(to At)is defined as the memory's 3. 1.1. Extraction and scoring remaining strength after time At, which is formulated as Eq (2). Blog articles contain a title, content, and comments. This is sim- Fig. 2 shows the forgetting curves, which illustrate the variations ilar to discussion forums, so we expanded the formal definitions of of remaining memory strength under different parameter sets evious forum study(Huang, Chen, Kuo, Jeng, 2008)and behavior parameters for analyzing different types of blc The formal definitions are as follows k(to+△)=e学k(o)
In this system, bloggers are regarded as learners or students who receiving recommendations. In the first step, all blog articles are collected in the blog knowledge base, which is a database that records the author, time of posting, title, content, and associated category of each blog article. The commentaries made in each blog article are stored in the blog knowledge base as well. Second, using information retrieval (IR) techniques, the extraction agent extracts individual keywords from blog articles, which consists of the articles read and written by each blogger. Hence, these keywords are highly associated with the individual. The third step employs Google’s blog search engine to search for blog articles associated with of the keywords. The articles are considered raw recommendation materials. Fourth, using the SBACPSO algorithm proposed in our previous study (Huang et al., 2008), the best combination of blog articles is generated. These are the elaborated recommendation materials. Finally, individuals receive the recommendation. This process is asynchronous; when a user logs out of the learning platform, the process begins in the background. The next time he/she accesses the learning platform the system automatically provides recommendations according to the individual’s blog reading and writing habits. It is worth mentioning that contrary to our previous study, which generated an RSS feed for learner subscription, learners in this system obtain the recommended blog articles without subscribing to any feed. The RSS aggregator collects the recommended blog articles and then automatically pushes them into the user interface as an entry without user manipulation. With regards to the execution procedure of the extraction agent, we addressed two factors. The first one was the type of human behavior, and other was forgetting rate of specific keywords. There are three types of human interaction with blog articles: reading, posting (writing), and commenting. Although the keywords are likely to appear in articles associated with these three behaviors, we argue that the value of a keyword is more significant to an individual when it appears blog articles written by the individual, as opposed to those read by the individual. Hence, the keywords that appear in the blog articles written were given more weight than those in the blog articles read. We also applied the concept of strength of memory versus the decline of memory retention to model the rate of forgetting with respect to keywords. This concept was proposed by Ebbinghaus (1913) to describe the durability of memory traces in the brain, which have an exponential nature of forgetting. The blog articles explored by Google’s search engine needed to be screened for quality, so the SBACPSO algorithm was applied and revised to find out the quality articles for recommendation. 3. Methodology This section describes the mathematical model used in our blog recommendation mechanism. 3.1. Keyword processing The first step is keyword processing, which includes keyword extraction and scoring. Keyword extraction uses an IR-based technique to extract keywords from individual blog entries in the blog knowledge base. The forgetting rate of specific keywords was considered when modeling the significance of keywords. 3.1.1. Extraction and scoring Blog articles contain a title, content, and comments. This is similar to discussion forums, so we expanded the formal definitions of our previous forum study (Huang, Chen, Kuo, & Jeng, 2008) and added behavior parameters for analyzing different types of blog articles. The formal definitions are as follows: Definition 1 (Blog entry set). E = {e1,e2,...,en} is a blog entry set of n blog entries, where 8ei 2 E; i > 0; n P 0. Definition 2 (Index term set). Let t be the index term. Then a blog entry contains n index terms and can be represented as e ¼ ft1;t2; ... ;tng; where8ti 2 e; i > 0. Each index term can be regarded as a specific domain of knowledge in this study. Definition 3 (TFIDF). In the IR vector model, assuming that there are t index terms represented a blog entry ej, and then ej can be expressed as a vector ~ej ¼ ðw1; j; wj; ... ; wt; jÞ, where wi,j is the ith term weight of e ~j in t dimension space. wi,j is a significance of ith index term to the blog entry ej, and it is usually measured by the best known term-weighting schemes as shown in Wi;j ¼ fi;j idfi bi ð1Þ where fi,j is the normalized frequency of index term ti in blog entry ej, and is represented as fi;j ¼ freqi;j maxlfreql;j p ð1 þ rÞ. The parameters contained in the normalized frequency are described as follows: freqi,j, the raw frequency of index term ti appearing in blog entry ej, maxl freql,j, the highest frequency of index term ti in blog entry ej, p, the highest weight of a index term’s position in a blog entry. The value is represented as title area (p = 1.5), main posting area (p = 1.3), and comment area (p = 1.0), r, the reply factor indicates the additional significance of an index term in a blog entry. Detailed descriptions of the parameters p and r can be found in the previous study (Huang et al., 2008). The idf factor means that index terms that appear in many blog entries are not very meaningful for distinguishing a relevant entry from a non-relevant one. In addition to the famous tf*idf equation, each index term is given a behavior parameter (bi) according to the type of a blog entry. The value of behavior parameter is set as follows: bi ¼ 1 if ith index term appears in a read posting 1:25 if ith index term appears in a self-written posting 1:25 if ith index term appears in a comment posting 8 >< >: Behavior parameter b gauges the significance of each index term according to posting type. In read blog entries learners read and learn the index terms, while in written blog entries and comments they apply specific index terms. So, the behavior parameter value of index terms from written postings or comments is higher than those from read postings. 3.1.2. Rating extracted keywords by forgetting rate Individual may read blog articles about many different topics. This results in frequent topic-changing. The forgetting rate simulates the phenomenon of topic changing in recommendations. According to the formula defined by Ebbinghaus, memory retention is represented as et S, where S is the relative strength of memory, and t is time. Let ui,k(t0) be the initial memory strength with regard to the kth learner to the ith index term (domain knowledge). Following that ui,k(t0 + Dt) is defined as the memory’s remaining strength after time Dt, which is formulated as Eq. (2). Fig. 2 shows the forgetting curves, which illustrate the variations of remaining memory strength under different parameter sets /i;kðt0 þ DtÞ ¼ eDt S /i;kðt0Þ ð2Þ 10390 T.-C. Huang et al. / Expert Systems with Applications 36 (2009) 10388–10396
T-C Huang et aL/ Expert Systems with Applications 36 (2009)10388-10396 10391 3. 2. Blog search quer 一5=100 The keywords extracted are regarded as learning elements asso- 5=50 ciated with the learner. Hence, conducting a keyword-based search 5=20 on the web can retrieve all content containing the keywords. Sim- 70 ilar to common search engines, blog search engines are given key ords and then retrieve keyword- related content. The main difference between the two kinds of search engine search engines mainly index blogs and ignore the rest of the web. According to the results of (Thelwall Hasler, 2007)study Googlesblogsearchengine(http://blogsearchgoogle.com)not 20 only covered a great deal of blog articles associated with various topics, but also explored the least spammed blog articles. In addi- tion, other studies have investigated tourism blogs and collected blog documents using Google's blog search engine(Li, Liu, &Yu, 01020304050607080 2006: Nalin& Mohan, 2008). Elapsed Times Since Learning(day) In this study, Google's blog search engine was used to explore Fig. 2. Ebbinghaus' forgetting curve with different parameter sets (t=1. 5-20, 50, the related blog content as raw recommended materials. The three gine. Thus, the search range will be small in scope so that the re- sults will not diverge. The explored blog articles are stored in our Memory strength exponentially decreases with time, and tent comment count. and trackback count. furthermore. the num onger the relative memory strength is, the slower the decay ber of external sources that link to an article is considered and speed is. If an individual learner has an immediate recall after calculated as one of the categorized factors. a ning or studying, he/she will have the highest memory remain- 3.3. Model design for blog article composition with SBACPsO algorithm Although memory remaining strength regarding a specific do- main of knowledge gradually decreases with time, it will be The raw recommendation materials searched by the blog search strengthened if a learner learns the same or related learning mate- engine are processed according to several criteria In our previous rials again. As mentioned before, learning behaviors include blog study, we demonstrated that a PSo-based algorithm, SBACPSO,can entry reading, writing, and commenting. Each form of learning will valuate blog entries according to the quality of their auxiliary strengthen memory of a domain of knowledge. However, the learning materials(huang et al., 2008). In this study we propose intensity of memory strengthening of each learning method is an extended model based on the four indicators of the SBACPso not equal. We defined an intensity factor, a, to describe the inten- algorithm. The linked factor, which refers to the number of exter sity difference among these three learning methods. The values of nal sources that link to the blog article, has been included in the ur intensity factor sets and corresponding descriptions are out- extended model. this factor is used to determine whether a blog ined in Table 2. The enhanced effect of memory strength using article has significant reference value according to other online re- the intensity factor is proposed on the basis of behavioral factors sources. Additionally, the linking factor, which evaluates the num- suggested by huang et al. (2008 ). ber of links in a blog article that link to other online resources With the enhanced effect of strength, Eg. (2)can be also used to estimate the information value of this blog article. reformulated to represent how intain memory strength The difference between the two above factors is this: the linked (see the following equation factor examines about the number of incoming links, while the linking factor examines about the number of outgoing links. 中(0+△)=e++Ⅱe*中(o) Since the blog entries are collected from the web it is difficult to automatically analyze the difficulty of entries so we are not con- cerned here with the difficulty of blog entries. The formal definitions of variables used in the objective func where n is the number of blog entries where ith index term appears tion are defined as follows Each index term for a learner is given a score calculated by the product of the results of Eqs. (1)and (3). In order to avoid over- whelming information inflow, only the top three index terms are N: number of blog articles in a search result. selected as query terms. They are provided to the blog search engine .ol 1<i< N: the number of outgoing links to ith blog introduced in the next subsection ·il1≤i≤N: the number of incoming links in ith blog ·r,1≤i≤N: the association degree between ith blog nd the searched keywords, ·C,1≤i≤N: the amount of comments posted in reply to ith Table 2 t, 1 <i<N: the amount of trackback to ith blog article Learning behavior and corresponding intensity factor value. x, 1 <i<N: the decision variable is set to one if the ith blog Learning behavior description The value of intensity article is selected in a composition; otherwise, it is set to zero factor(a) , u: respectively represent the lower and upper bounds of the Reading a blog entry related to a specific domain expected comment level for each set of blog articles searched h: the lower bound of the expected relevance of the searched Writing(posting)a blog entry related to a specific A comment mentioning a specific domain of knowledge 0.1 p: the lower bound of the expected total outgoing number aks in the search re
Memory strength exponentially decreases with time, and the stronger the relative memory strength is, the slower the decaying speed is. If an individual learner has an immediate recall after learning or studying, he/she will have the highest memory remaining strength. Although memory remaining strength regarding a specific domain of knowledge gradually decreases with time, it will be strengthened if a learner learns the same or related learning materials again. As mentioned before, learning behaviors include blog entry reading, writing, and commenting. Each form of learning will strengthen memory of a domain of knowledge. However, the intensity of memory strengthening of each learning method is not equal. We defined an intensity factor, a, to describe the intensity difference among these three learning methods. The values of our intensity factor sets and corresponding descriptions are outlined in Table 2. The enhanced effect of memory strength using the intensity factor is proposed on the basis of behavioral factors suggested by Huang et al. (2008). With the enhanced effect of memory strength, Eq. (2) can be reformulated to represent how we maintain memory strength (see the following equation) /i;kðt0 þ DtÞ ¼ eDt S Yn l¼1 eal /i;kðt0Þ ð3Þ where n is the number of blog entries where ith index term appears in. Each index term for a learner is given a score calculated by the product of the results of Eqs. (1) and (3). In order to avoid overwhelming information inflow, only the top three index terms are selected as query terms. They are provided to the blog search engine introduced in the next subsection. 3.2. Blog search query The keywords extracted are regarded as learning elements associated with the learner. Hence, conducting a keyword-based search on the web can retrieve all content containing the keywords. Similar to common search engines, blog search engines are given keywords and then retrieve keyword-related content. The main difference between the two kinds of search engines is that blog search engines mainly index blogs and ignore the rest of the web. According to the results of (Thelwall & Hasler, 2007) study Google’s blog search engine (http://blogsearch.google.com) not only covered a great deal of blog articles associated with various topics, but also explored the least spammed blog articles. In addition, other studies have investigated tourism blogs and collected blog documents using Google’s blog search engine (Li, Liu, & Yu, 2006; Nalin & Mohan, 2008). In this study, Google’s blog search engine was used to explore the related blog content as raw recommended materials. The three highest index terms for the learner are inputted into the search engine. Thus, the search range will be small in scope so that the results will not diverge. The explored blog articles are stored in our blog knowledge base, which is organized by author, date, title, content, comment count, and trackback count. Furthermore, the number of external sources that link to an article is considered and calculated as one of the categorized factors. 3.3. Model design for blog article composition with SBACPSO algorithm The raw recommendation materials searched by the blog search engine are processed according to several criteria. In our previous study, we demonstrated that a PSO-based algorithm, SBACPSO, can evaluate blog entries according to the quality of their auxiliary learning materials (Huang et al., 2008). In this study, we propose an extended model based on the four indicators of the SBACPSO algorithm. The linked factor, which refers to the number of external sources that link to the blog article, has been included in the extended model. This factor is used to determine whether a blog article has significant reference value according to other online resources. Additionally, the linking factor, which evaluates the number of links in a blog article that link to other online resources, is also used to estimate the information value of this blog article. The difference between the two above factors is this: the linked factor examines about the number of incoming links, while the linking factor examines about the number of outgoing links. Since the blog entries are collected from the web it is difficult to automatically analyze the difficulty of entries so we are not concerned here with the difficulty of blog entries. The formal definitions of variables used in the objective function are defined as follows: N: number of blog articles in a search result, oli, 1 6 i 6 N: the number of outgoing links to ith blog article, ili, 1 6 i 6 N: the number of incoming links in ith blog articles, ri, 1 6 i 6 N: the association degree between ith blog article and the searched keywords, ci, 1 6 i 6 N: the amount of comments posted in reply to ith blog article, ti, 1 6 i 6 N: the amount of trackback to ith blog article, xi, 1 6 i 6 N: the decision variable is set to one if the ith blog article is selected in a composition; otherwise, it is set to zero, l, u: respectively represent the lower and upper bounds of the expected comment level for each set of blog articles searched, h: the lower bound of the expected relevance of the searched topic, p: the lower bound of the expected total outgoing number of links in the search result, Fig. 2. Ebbinghaus’ forgetting curve with different parameter sets (t = 1, S = 20, 50, and 100). Table 2 Learning behavior and corresponding intensity factor value. Learning behavior description The value of intensity factor (a) Reading a blog entry related to a specific domain of knowledge 0.08 Writing (posting) a blog entry related to a specific domain of knowledge 0.1 A comment mentioning a specific domain of knowledge 0.1 T.-C. Huang et al. / Expert Systems with Applications 36 (2009) 10388–10396 10391
10392 T.-C Huang et aL/ Expert Systems with Applications 36 (2009)10388-10396 g: the lower bound of the expected total number of incoming ber in a search results. The representations of the three member- inks in the search res ship fu are Is results. L M. H M(x): a double sigmoid form function mapping an integer respectively denote "Low comment number","Moderate comment number x into the real unit interval [0, 1. number", and"High comment number". The given values are as C(x: a membership function mapping the number of outgoing signed to each result (L=0, M=0.5, and H= 1). For example, links into a degree. the comment number of blog article b is 45 and the maximum It is 15 We apply a double sigmoid form function M(x) to map the num- blog article b,s comment number is 0.3(=45 /150 ). Hence, the blog ber of incoming links x into a real unit interval [0, 1. The closed article b gets a fuzzy value(0.5. 0.5, 0). Afterwards, a maximum form is shown as M(x)=sign(x (1-exp(-02)).The parameter function MAk L M Ho is used to detemine which comment leve s is a control variable designed to control the increasing rate of are equal, the higher comment level is assigned. In the example the recommendation quality. ( Fig 3) depicts two different results blog article b is tagged with comment level M using two control variables. It shows that the smaller the control 1,z≤0.2 riable is the steeper the slope will be f(2)={",02<z<04 The membership function C(x) is used to map the number of outgoing links of a blog article into a real unit interval [0, 1].The 0.z≥04 form of this function as defined by( chen, Huang, Chu, 2005 )is 0.z≤0.2orz≥0.6 as follows f(2)=2,02<z≤04 「0.X≤x 2,04≤z≤07 C(x)=,X<X< 0.z≤0.4 1,x≥x2 f3(2)=54,04<z<0.7 1,z≥0.7 where xI and x2, respectively indicate the lower and upper bounds The trackback number depends on how many people manually ymbol, we apply a fuzzy concept to fuzzificate comment data. The fore, the larger the amount of trackback the more useful the atp In addition, in order to transform numeric comment data into a refer to a blog article, which is regarded as active behavior. The ommen number of each blog article is fuzzificated into one of is The main goal of the objective function is maximizing the aver three levels using three membership functions shown in Fig. 4, age trackback number, and the objective function is defined as where z denotes the percentage of a blog article s comment num- Maximize Z(x=X1, X2,.,XN)=A* which is subject to follow- Constraint1.∑1r≥h Constraint4.l≤∑1MAX(L,M:,Hx≤u s=30 These constraints are used to revise the objective function to nake the candidate solutions meet all constraints. This idea is orig 0102030405060708090100 inated from a multiple criteria study(Hwang, Yin, Yeh, 2006). Each solution that violates the constraints is given a penalty as pre- sented in Table 3 Fig. 3. The incoming link mapping function M(x) using two different control With the penalty terms, the fitness function is given as O(x) )-O12-O2B-O37-O4i where @1-O4 indicates the relative weights of the penalty terms. After this, the SBACPSO algo- fcz) rithm, which consists of five steps, is employed to generate the best set of blog articles. Input: N searched blog articles of a personal interest topic. Output: The best set of blog articles for personal recommendat 3.3.1. Step 1. Initial swarm generation The percentage of comment number The first step is to encode the presence of blog articles in searched result as a particle. The particle is represented by a Fig 4. The three membership functions of each blog article's comments. N-dimensional vector, [x1X2,.XN]. As mentioned before, if the Table 3 Formula expression of each penalty term. Penalty Penalty term 2
q: the lower bound of the expected total number of incoming links in the search result, M(x): a double sigmoid form function mapping an integer number x into the real unit interval [0, 1], C(x) : a membership function mapping the number of outgoing links into a degree. We apply a double sigmoid form function M(x) to map the number of incoming links x into a real unit interval [0, 1]. The closed form is shown as MðxÞ ¼ signðxÞ 1 exp x s 2 . The parameter s is a control variable designed to control the increasing rate of the curve. The value of this parameter could be set according to the recommendation quality. (Fig. 3) depicts two different results using two control variables. It shows that the smaller the control variable is the steeper the slope will be. The membership function C(x) is used to map the number of outgoing links of a blog article into a real unit interval [0, 1]. The form of this function as defined by (Chen, Huang, & Chu, 2005) is as follows: CðxÞ ¼ 0; x 6 x1 xx1 x2x1 ; x1 < x < x2 1; x P x2 8 >< >: 9 >= >; ð4Þ where x1 and x2, respectively indicate the lower and upper bounds of the number of outgoing links. In addition, in order to transform numeric comment data into a symbol, we apply a fuzzy concept to fuzzificate comment data. The comment number of each blog article is fuzzificated into one of three levels using three membership functions shown in Fig. 4, where z denotes the percentage of a blog article’s comment number in a search results. The representations of the three membership functions are listed as Eqs. (5)–(7). The results, L, M, H, respectively denote ‘‘Low comment number”, ‘‘Moderate comment number”, and ‘‘High comment number”. The given values are assigned to each result (L = 0, M = 0.5, and H = 1). For example, if the comment number of blog article b is 45 and the maximum comment number in the search result is 150, the percentage of blog article b’s comment number is 0.3 (=45/150). Hence, the blog article b gets a fuzzy value (0.5, 0.5, 0). Afterwards, a maximum function MAX(L, M, H) is used to determine which comment level a blog article is assigned. If the comment level and fuzzy level are equal, the higher comment level is assigned. In the example, the blog article b is tagged with comment level M f1ðzÞ ¼ 1; z 6 0:2 0:4z 0:2 ; 0:2 < z < 0:4 0; z P 0:4 8 >< >: 9 >= >; ð5Þ f2ðzÞ ¼ 0; z 6 0:2 or z P 0:6 z0:2 0:2 ; 0:2 < z 6 0:4 0:7z 0:3 ; 0:4 6 z 6 0:7 8 >< >: 9 >= >; ð6Þ f3ðzÞ ¼ 0; z 6 0:4 z0:4 0:2 ; 0:4 < z < 0:7 1; z P 0:7 8 >< >: 9 >= >; ð7Þ The trackback number depends on how many people manually refer to a blog article, which is regarded as active behavior. Therefore, the larger the amount of trackback the more useful the article is. The main goal of the objective function is maximizing the average trackback number, and the objective function is defined as Maximize Zðx ¼ x1; x2; ... ; xNÞ ¼ PN i¼1 tix P i N i¼1 xi which is subject to following constraints. Constraint 1. PN i¼1rixi P h Constraint 2. PN i¼1CðoliÞxi P p Constraint 3. PN i¼1MðiliÞxi P q Constraint 4. l 6 PN i¼1MAXðLi; Mi; HiÞxi 6 u These constraints are used to revise the objective function to make the candidate solutions meet all constraints. This idea is originated from a multiple criteria study (Hwang, Yin, & Yeh, 2006). Each solution that violates the constraints is given a penalty as presented in Table 3. With the penalty terms, the fitness function is given as OðxÞ ¼ PN i¼1 tix P i N i¼1 xi x1a x2b x3c x4k where x1–x4 indicates the relative weights of the penalty terms. After this, the SBACPSO algorithm, which consists of five steps, is employed to generate the best set of blog articles. Input: N searched blog articles of a personal interest topic. Output: The best set of blog articles for personal recommendation. 3.3.1. Step 1. Initial swarm generation The first step is to encode the presence of blog articles in a searched result as a particle. The particle is represented by a N-dimensional vector, [x1,x2,...,xN]. As mentioned before, if the Fig. 3. The incoming link mapping function M(x) using two different control variables. Fig. 4. The three membership functions of each blog article’s comments. Table 3 Formula expression of each penalty term. Penalty term 1 Penalty term 2 Penalty term 3 Penalty term 4 a ¼ h PN i¼1rixi b ¼ p PN i¼1CðoliÞxi c ¼ q PN i¼1MðiliÞxi k ¼ maxðl PN i¼1MAXðLi; Mi; HiÞxi; 0Þ þ maxð0; PN i¼1MAXðLi; Mi; HiÞxi uÞ 10392 T.-C. Huang et al. / Expert Systems with Applications 36 (2009) 10388–10396