ELSEVIER Expert Systems with Applications 28(2005)381-393 Development of a recommender system based on navigational and behavioral patterns of customers in e-commerce sites Yong Soo Kim, Bong-Jin Yum", Junehwa Song, Su Myeon Kim Korea Advanced Institute of Science and Technology, 373-1 Gusung-Dong, Yusung-Gu, Daejon 305-701, South Korea In this article, a novel CF(collaborative filtering)-based recommender system is developed for e-commerce sites. Unlike the conventional approach in which only binary purchase data are used, the proposed approach analyzes the data captured from the navigational and behavioral patterns of customers, estimates the preference levels of a customer for the products which are clicked but not purchased, and CF is conducted using the preference levels for making recommendations. This also compares with the existing works on clickstream data analysis in which the navigational and behavioral patterns of customers are analyzed for simple relationships with the target variable. The effectiveness of the proposed approach is assessed using an experimental e-commerce site. It is found among other things that the proposed approach outperforms the conventional approach in almost all cases considered. The proposed approach is versatile and can be applied to a variety of e-commerce sites as long as the navigational and behavioral patterns of customers can be captured. C 2004 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Collaborative filtering: E-commerce; Preference level 1. Introduction particularly useful in e-commerce sites that offer millions of products for sale Personalized services for individual customers are now There are two paradigms for recommender systems, popular in e-commerce sites. Properly designed and well- namely, collaborative filtering (CF) and content-based executed personalized services enable e-commerce compa filtering(CBF). CF recommends products based on the nies to capture the unique needs and preferences of similarity of the preferences of a group of customers known individual customers, help them build customer loya as a neighbor (Hill, Stead, Rosenstein, Furnas, 1995 and thereby, strengthen their competitiveness in the Resnick, lacovo, Suchak, Bergstrom,&Riedle, 1994 marketplace Shardanand Maes, 1995). On the other hand, CBF A recommender system is a typical software solution recommends products to a customer based on the products used in e-commerce for personalized services(Berson similarity to the customers past or historical preferences Smith, Thearing, 2000: Lawrence, Almasi, Korlyar ( Basu, Hirsh, Cohen, 1998; Krulwich Burkey, 1996; Viveros, Duri, 2001: Sarwar, Karypis, Konstan, rie Lang, 1995). Therefore, CBF may not be suitable for 2000: Yuan Chang, 2001). It helps customers find the recommending such products as music, art, movie, audio, products they would like to purchase by providing photograph, video, etc. which are frequently sold in recommendations based on their preferences, and is e-commerce sites since these products may not be easily analyzed for relevant attributive information(Balabanovic shoham, 1997; Shardanand Maes, 1995). For this Correspon thor.Address: Department of Industrial Engineering, reason, CF is adopted in the present study which deals with Korea Advanced Institute of Science and Technology, 373-1 Gusung- recommendations in e-commerce sites Dong, Yusung-Gu, Daejon 305-701, South Korea. Tel. +82 428693116: Conventional cf is known to work well for the case fax:+82428693110. E-mail addresses: yskim95@kaist. ac kr(YS. Kim), bryum(kaist. ac kr where customers show their preferences for specific (B.J. Yum), junesong @kaist. ac kr ( Song), sumyeon @ kaist. ac kr products in an explicit manner(e.g. rating movies) (S M. Kim). However, CF usually does not work well with binary data 0957-4174/.see front matter 2004 Elsevier Ltd. All rights reserved doi:10.1016 j.eswa200410.017
Development of a recommender system based on navigational and behavioral patterns of customers in e-commerce sites Yong Soo Kim, Bong-Jin Yum*, Junehwa Song, Su Myeon Kim Korea Advanced Institute of Science and Technology, 373-1 Gusung-Dong, Yusung-Gu, Daejon 305-701, South Korea Abstract In this article, a novel CF (collaborative filtering)-based recommender system is developed for e-commerce sites. Unlike the conventional approach in which only binary purchase data are used, the proposed approach analyzes the data captured from the navigational and behavioral patterns of customers, estimates the preference levels of a customer for the products which are clicked but not purchased, and CF is conducted using the preference levels for making recommendations. This also compares with the existing works on clickstream data analysis in which the navigational and behavioral patterns of customers are analyzed for simple relationships with the target variable. The effectiveness of the proposed approach is assessed using an experimental e-commerce site. It is found among other things that the proposed approach outperforms the conventional approach in almost all cases considered. The proposed approach is versatile and can be applied to a variety of e-commerce sites as long as the navigational and behavioral patterns of customers can be captured. q 2004 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Collaborative filtering; E-commerce; Preference level 1. Introduction Personalized services for individual customers are now popular in e-commerce sites. Properly designed and wellexecuted personalized services enable e-commerce companies to capture the unique needs and preferences of individual customers, help them build customer loyalty, and thereby, strengthen their competitiveness in the marketplace. A recommender system is a typical software solution used in e-commerce for personalized services (Berson, Smith, & Thearing, 2000; Lawrence, Almasi, Korlyar, Viveros, & Duri, 2001; Sarwar, Karypis, Konstan, & Riedl, 2000; Yuan & Chang, 2001). It helps customers find the products they would like to purchase by providing recommendations based on their preferences, and is particularly useful in e-commerce sites that offer millions of products for sale. There are two paradigms for recommender systems, namely, collaborative filtering (CF) and content-based filtering (CBF). CF recommends products based on the similarity of the preferences of a group of customers known as a neighbor (Hill, Stead, Rosenstein, & Furnas, 1995; Resnick, Iacovou, Suchak, Bergstrom, & Riedle, 1994; Shardanand & Maes, 1995). On the other hand, CBF recommends products to a customer based on the products’ similarity to the customer’s past or historical preferences (Basu, Hirsh, & Cohen, 1998; Krulwich & Burkey, 1996; Lang, 1995). Therefore, CBF may not be suitable for recommending such products as music, art, movie, audio, photograph, video, etc. which are frequently sold in e-commerce sites since these products may not be easily analyzed for relevant attributive information (Balabanovic & Shoham, 1997; Shardanand & Maes, 1995). For this reason, CF is adopted in the present study which deals with recommendations in e-commerce sites. Conventional CF is known to work well for the case where customers show their preferences for specific products in an explicit manner (e.g. rating movies). However, CF usually does not work well with binary data 0957-4174/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2004.10.017 Expert Systems with Applications 28 (2005) 381–393 www.elsevier.com/locate/eswa * Corresponding author. Address: Department of Industrial Engineering, Korea Advanced Institute of Science and Technology, 373-1 GusungDong, Yusung-Gu, Daejon 305-701, South Korea. Tel.: C82 42869 3116; fax: C82 42869 3110. E-mail addresses: yskim95@kaist.ac.kr (Y.S. Kim), bjyum@kaist.ac.kr (B.-J. Yum), junesong@kaist.ac.kr (J. Song), sumyeon@kaist.ac.kr (S.M. Kim)
Y.S. Kim et al. Expert Systems with Applications 28(2005)381-393 (e.g. 'purchase'or'no purchasedata)which are typical of are predicted. Finally, a Top-N list of products is generated e-commerce data(Hayes, Cunningham th,2001) as a recommendation to the customer To overcome this problem, recent studies proposed methods To illustrate and assess the effectiveness of the proposed that relate the customers' navigational and behavioral approach, an empirical study was conducted by constructing patterns with their preferences( Claypool, Le, Wased, an experimental e-commerce site for compact Brown, 2001: Kelly Belkin, 2001; Lee, Podlaeck, albums. It was opened to the students of Korea Advance Schonberg, Hoch, 2001; Lee, Podlaeck, Schonberg, Institute of Science and Technology(KAIST) for a period of Hoch, Gomory, 2000: Morita Shinoda, 1994: Nichols, 50 days. Then, the relative performance (i.e. prediction 1997: Rafter& Smyth, 2001). Instead of explicitly acquiring accuracy) of the proposed recommender system is com the customers'ratings for specific products, these 'implicit pared with that of the conventional system in which only the ratings'methods passively monitor the navigational and binary purchase data are used. The results from the above behavioral patterns of customers(Nichols, 1997)and derive experimental study clearly show that the proposed method their preference levels(i.e implicit ratings)by analyzing the using the preference data is superior to the conventional lickstream data which represent the navigational and method using only the binary purchase data. behavioral patterns of the customers( Claypool et al In the performance study, we use the FI value(Sarwar 01: Kelly Belkin, 2001; Rafter Smyth, 2001). et al., 2000) as the metric and come up with some additional In addition, several authors presented detailed case studies findings: (i)constrained Pearson correlation coefficient of the clickstream data analysis from various e-commerce( CPC) as a similarity measure performs consistently better sites Lee et al., 2000, 2001). In their studies, customers' than Pearson correlation coefficient and/or Jaccard coeffi shopping patterns(e.g. product impression, click-through, cient for both approaches; (ii) if CPC is used, then the basket placement, and purchase) are analyzed, and the so- proposed approach outperforms the conventional approach alled micro-conversion rate for each adjacent pair of in almost all cases considered; and (iii) the proposed parameters is computed to assess the effectiveness of web approach performs best when LR is used for predicting the merchandising. For a review and classification of various preference levels, CPC is used as a similarity measure, and implicit measures of customer interests, the reader is the size of recommendation is small referred to Kelly and Teevan (2003)or Oard and Kim The rest of this article is organi (2001) Section 2, details of the proposed method are presented, and The existing works on implicit ratings mainly consider a the results of the experimental study are described in simple correlation between a behavioral or a navigational Section 3. Finally, Section 4 presents the conclusion and parameter(e.g. length of reading time, number of visits, future research directions book marking variable, etc )and the target variable(e.g. purchase/no-purchase variable). They are limited in pre- dicting the target variable in that the observed implicit parameters are not considered in a simultaneous manner 2. Proposed recommender system In this article, we extend the existing methods of implicit ntings and further develop a recommender system. The 2.1. Captured data from e-commerce sites system provides a framework to analyze the inter-relation- ship between different behavioral and/or navigational The proposed recommender system is developed based parameters and to numerically determine customers' on the customers'navigational and behavioral patterns in preference levels from their behavioral and navigational e-commerce sites. Navigational patterns include browsing, patterns. Moreover, it can quantitatively predict the target searching, product click, basket placement, and actual variable from those parameters purchase, while behavioral patterns consist of the click ratio The proposed method consists of the following four for a certain type of product, length of reading time spent on First, the data related to a customer's purchase a specific product, number of visits to a specific product, navigational, and behavioral patterns are collected. Second, printing, and bookmarking. Although the proposed system the customer's preference for a certain product is numeri- is developed using an experimental e-commerce site as an cally determined. If the product is purchased, the corre- example, it can be applied to a variety of e-commerce sites sponding preference level is set to 1. If the product is clicked as long as the above navigational and behavioral patterns but not purchased, then the preference level is determined can be captured. y estimating the probability of reaching the point of The product taxonomy in an e-commerce site generally purchase using the data gathered from the first phase. This has a hierarchical structure. For instance, Fig. I shows such process is carried out using the decision tree(DT)analysis, a hierarchical structure for the experimental e-commerce logistic regression(LR)analysis, or artificial neural network ite used in the present study. More specifically, there are (ANN). Third, CF is performed using the preference levels seven genres at Level 1, and each genre has 3-8 different calculated in the second phase as the input values, and the types of CD's at Level 2. Finally, each type at Level 2 has preference levels of a customer for the products not clicked about 20-1000 different CD's
(e.g. ‘purchase’ or ‘no purchase’ data) which are typical of e-commerce data (Hayes, Cunningham, & Smyth, 2001). To overcome this problem, recent studies proposed methods that relate the customers’ navigational and behavioral patterns with their preferences (Claypool, Le, Wased, & Brown, 2001; Kelly & Belkin, 2001; Lee, Podlaeck, Schonberg, & Hoch, 2001; Lee, Podlaeck, Schonberg, Hoch, & Gomory, 2000; Morita & Shinoda, 1994; Nichols, 1997; Rafter & Smyth, 2001). Instead of explicitly acquiring the customers’ ratings for specific products, these ‘implicit ratings’ methods passively monitor the navigational and behavioral patterns of customers (Nichols, 1997) and derive their preference levels (i.e. implicit ratings) by analyzing the clickstream data which represent the navigational and behavioral patterns of the customers (Claypool et al., 2001; Kelly & Belkin, 2001; Rafter & Smyth, 2001). In addition, several authors presented detailed case studies of the clickstream data analysis from various e-commerce sites (Lee et al., 2000, 2001). In their studies, customers’ shopping patterns (e.g. product impression, click-through, basket placement, and purchase) are analyzed, and the socalled micro-conversion rate for each adjacent pair of parameters is computed to assess the effectiveness of web merchandising. For a review and classification of various implicit measures of customer interests, the reader is referred to Kelly and Teevan (2003) or Oard and Kim (2001). The existing works on implicit ratings mainly consider a simple correlation between a behavioral or a navigational parameter (e.g. length of reading time, number of visits, book marking variable, etc.) and the target variable (e.g. purchase/no-purchase variable). They are limited in predicting the target variable in that the observed implicit parameters are not considered in a simultaneous manner. In this article, we extend the existing methods of implicit ratings and further develop a recommender system. The system provides a framework to analyze the inter-relationship between different behavioral and/or navigational parameters and to numerically determine customers’ preference levels from their behavioral and navigational patterns. Moreover, it can quantitatively predict the target variable from those parameters. The proposed method consists of the following four phases. First, the data related to a customer’s purchase, navigational, and behavioral patterns are collected. Second, the customer’s preference for a certain product is numerically determined. If the product is purchased, the corresponding preference level is set to 1. If the product is clicked but not purchased, then the preference level is determined by estimating the probability of reaching the point of purchase using the data gathered from the first phase. This process is carried out using the decision tree (DT) analysis, logistic regression (LR) analysis, or artificial neural network (ANN). Third, CF is performed using the preference levels calculated in the second phase as the input values, and the preference levels of a customer for the products not clicked are predicted. Finally, a Top-N list of products is generated as a recommendation to the customer. To illustrate and assess the effectiveness of the proposed approach, an empirical study was conducted by constructing an experimental e-commerce site for compact disc (CD) albums. It was opened to the students of Korea Advance Institute of Science and Technology (KAIST) for a period of 50 days. Then, the relative performance (i.e. prediction accuracy) of the proposed recommender system is compared with that of the conventional system in which only the binary purchase data are used. The results from the above experimental study clearly show that the proposed method using the preference data is superior to the conventional method using only the binary purchase data. In the performance study, we use the F1 value (Sarwar et al., 2000) as the metric and come up with some additional findings: (i) constrained Pearson correlation coefficient (CPC) as a similarity measure performs consistently better than Pearson correlation coefficient and/or Jaccard coeffi- cient for both approaches; (ii) if CPC is used, then the proposed approach outperforms the conventional approach in almost all cases considered; and (iii) the proposed approach performs best when LR is used for predicting the preference levels, CPC is used as a similarity measure, and the size of recommendation is ‘small’. The rest of this article is organized as follows. In Section 2, details of the proposed method are presented, and the results of the experimental study are described in Section 3. Finally, Section 4 presents the conclusion and future research directions. 2. Proposed recommender system 2.1. Captured data from e-commerce sites The proposed recommender system is developed based on the customers’ navigational and behavioral patterns in e-commerce sites. Navigational patterns include browsing, searching, product click, basket placement, and actual purchase, while behavioral patterns consist of the click ratio for a certain type of product, length of reading time spent on a specific product, number of visits to a specific product, printing, and bookmarking. Although the proposed system is developed using an experimental e-commerce site as an example, it can be applied to a variety of e-commerce sites as long as the above navigational and behavioral patterns can be captured. The product taxonomy in an e-commerce site generally has a hierarchical structure. For instance, Fig. 1 shows such a hierarchical structure for the experimental e-commerce site used in the present study. More specifically, there are seven genres at Level 1, and each genre has 3–8 different types of CD’s at Level 2. Finally, each type at Level 2 has about 20–1000 different CD’s. 382 Y.S. Kim et al. / Expert Systems with Applications 28 (2005) 381–393
Y.S. Kim et al. Expert Systems with Applications 28(2005)381-393 information that can be obtained from the customers actions within the site include: (i) the time it takes for the customer to read about a specific product(length of reading time); (ii) the number of visits to a specific product(number New age Rock Classic(genre) of visits); and (ii) the category to which the product belongs. A product that is frequently viewed and read can be surmised as a popular product. Furthermore, products in a certain category with a high click ratio can also be Hard rock Modern rock Folk rock- (specific type considered popular. For instance, if the click ratio for the Classic cd's is higher than the rock cd's at level i in Fig. 1, this could mean that the customer enjoys classic CD1 CD2 CD100 music more than rock Table 1 shows the parameters which describe the Fig 1 Product taxonomy of experimental CD e-c behavioral and navigational patterns of a customer in the experimental e-commerce site. Then, for each customer who visits the site and clicks at least one product, the Browsing corresponding parameter values are captured and summar ↓一1m ized as shown in Table 2 In Table 2, acase corresponds to Searching a product clicked. Note that several cases may exist for a customer. Hereafter, the term'customer' is used to represent Length isitsading time a customer who visits the site and clicks at least one Which ry does a clicked product belong to? 2.2.P Fig. 2. Possible actions that can be taken by customers in e-commerce sites be obtained from such action The proposed methodology consists of the following four Fig 2 illustrates possible actions and steps that customers phases can take in an e-commerce site, ranging from the point of Phase I All the data related to the purchase, navigational logging-in to the web site to the point of actual purchase of and behavioral patterns are gathered as shown in product. It also indicates the possible data that can be Tables I and 2. Descriptive statistics are also gathered from these actions lculated and analyze After logging-in to the web site, a customer can either Phase II For each customer, the preference level of a browse through the site just to check whether there are product which is clicked but not purchased is interesting products or intentionally search for a specific estimated(the preference level of a purchased product to purchase. When the customer clicks a product, he product is set to 1) or she will be provided with specific information. Then, the Phase Ill CF is performed using the preference levels in customer can either print or bookmark the page as a hase Il as input values, and the preference levels reference for a future purchase or compare the details of the of a customer for the products not clicked are product with other available goods. Other important predicted. Data collected from the experimental e-commerce site Parameters Click type Binary variable: searching=l; browsing=0 Discrete variable Length of reading time Continuous variable(s) Binary variable: print=1: no print=0 king statu Binary variable: bookmarking= 1: no bookmarking=0 Level 1 click ratio (genre) Continuous variable defined for each product k clicked by customer i. Letj be the category (at Level 1)to which product k belongs. Then, Level I click ratio for product, k=(Total number of products clicked by customer i that :long to category j at Level 3 number of products clicked by customer i) Level 2 click ratio(specific type Continuous variable defined fo product k clicked by customer i. Let be the category(at Level 2)to which oduct k belongs. Then, Lev ratio for product, k=(Total number of products clicked by customer i that :long to category j at Level I number of products clicked by customer i) Basket placement status inary variable: basket placement=l: no basket placement=0 Binary variable: purchase= 1: no purchase=0
Fig. 2 illustrates possible actions and steps that customers can take in an e-commerce site, ranging from the point of logging-in to the web site to the point of actual purchase of a product. It also indicates the possible data that can be gathered from these actions. After logging-in to the web site, a customer can either browse through the site just to check whether there are interesting products or intentionally search for a specific product to purchase. When the customer clicks a product, he or she will be provided with specific information. Then, the customer can either print or bookmark the page as a reference for a future purchase or compare the details of the product with other available goods. Other important information that can be obtained from the customer’s actions within the site include: (i) the time it takes for the customer to read about a specific product (length of reading time); (ii) the number of visits to a specific product (number of visits); and (iii) the category to which the product belongs. A product that is frequently viewed and read can be surmised as a popular product. Furthermore, products in a certain category with a high click ratio can also be considered popular. For instance, if the click ratio for the Classic CD’s is higher than the Rock CD’s at Level 1 in Fig. 1, this could mean that the customer enjoys classic music more than rock. Table 1 shows the parameters which describe the behavioral and navigational patterns of a customer in the experimental e-commerce site. Then, for each customer who visits the site and clicks at least one product, the corresponding parameter values are captured and summarized as shown in Table 2. In Table 2, a ‘case’ corresponds to a product clicked. Note that several cases may exist for a customer. Hereafter, the term ‘customer’ is used to represent a customer who visits the site and clicks at least one product. 2.2. Proposed methodology The proposed methodology consists of the following four phases Phase I All the data related to the purchase, navigational, and behavioral patterns are gathered as shown in Tables 1 and 2. Descriptive statistics are also calculated and analyzed. Phase II For each customer, the preference level of a product which is clicked but not purchased is estimated (the preference level of a purchased product is set to 1). Phase III CF is performed using the preference levels in Phase II as input values, and the preference levels of a customer for the products not clicked are predicted. Fig. 2. Possible actions that can be taken by customers in e-commerce sites and possible data that can be obtained from such actions. Table 1 Data collected from the experimental e-commerce site Parameters Descriptions Click type Binary variable: searchingZ1; browsingZ0 Number of visits Discrete variable Length of reading time Continuous variable (s) Print status Binary variable: printZ1; no printZ0 Bookmarking status Binary variable: bookmarkingZ1; no bookmarkingZ0 Level 1 click ratio (genre) Continuous variable defined for each product k clicked by customer i. Let j be the category (at Level 1) to which product k belongs. Then, Level 1 click ratio for product, kZ(Total number of products clicked by customer i that belong to category j at Level 1)/(Total number of products clicked by customer i) Level 2 click ratio (specific type) Continuous variable defined for each product k clicked by customer i. Let j be the category (at Level 2) to which product k belongs. Then, Level 2 click ratio for product, kZ(Total number of products clicked by customer i that belong to category j at Level 2)/(Total number of products clicked by customer i) Basket placement status Binary variable: basket placementZ1; no basket placementZ0 Purchase status Binary variable: purchaseZ1; no purchaseZ0 Fig. 1. Product taxonomy of experimental CD e-commerce site. Y.S. Kim et al. / Expert Systems with Applications 28 (2005) 381–393 383
384 Y.S. Kim et al. Expert Systems with Applications 28(2005)381-393 Table 2 Structure of collected data(example) Case Customer CD Click type ength of No of visits Level I ratio Level 2 ratio Basket Purchase eading time placement 0.33 234 0.33 0.33 2222 00010 555 000010 0.25 Phase IV After making a Top-N list, recommendations are (3)Determination of the preference level of a product made to each customer which is clicked but not purchased for each custo- mer: The preference level of a product which is placed in In Phase Il, the preference level of a product which is the basket but not purchased is set to p On the other clicked but not purchased is estimated according to the hand, the preference level of the product which is following three steps clicked but not placed in the basket is set to(bXp (1) Estimation of the probability of purchase after basket In Phase Ill, CF is conducted using the preference levels determined in Phase II as input values. In a conventional Total number of cases in which product is purchased recommender system, only the purchase status is used for Total number of cases in which product is placed in basket laced in basket CF. In other words, only Os(no purchase) and I's (purchase) are used as input data(refer to Fig. 3(a)). In (2)Estimation of the probability of basket placement for a the point of purchase is estimated for a product clicked by a product which is clicked but not placed in the basket customer. Therefore, a stream of values between 0 and 1 are (b): In the case where a clicked product is not placed in used as input data for the proposed CF(refer to Fig 3(b).In the basket, the probability that the product would be Fig 3, blank cells indicate that the corresponding products purchased is difficult to estimate by simply using the are not clicked parameters in Table 1. In this case, the probability that the product would be placed in the basket after being clicked (b) is first estimated. This is done using Dt analysis, ANN, or LR analysis. In these analyses, basket 3. Experimental evaluation placement status is considered as the target variable, while all the other variables, excluding the purchase 3.1. Data sets status,are regarded as input variables. In DT analysis, the probabilities of reaching basket placement are The experimental e-commerce site was opened to the estimated by following the paths of the constructed students of KAIST for a period of about 50 days. Among the tree. In ANN or LR analysis, the probabilities of 2465 albums that were actually clicked by the customers reaching basket placement are determined as the (i.e. among the 2465 cases observed), 338 albums were purchased. An example data set is shown in Table 2 CD1 CD2 CD3 CD4 CD1 CD2 CD3 CD4 00 082044 Customer 10.15 Customer 5 Customer 5 (a) Conventional Recommender System (b) Proposed Recommender System Fig 3. 'Customer-product preference level matrix'for CF: conventional vs. proposed recommender systems
Phase IV After making a Top-N list, recommendations are made to each customer. In Phase II, the preference level of a product which is clicked but not purchased is estimated according to the following three steps. (1) Estimation of the probability of purchase after basket placement (p): p Z Total number of cases in which product is purchased Total number of cases in which product is placed in basket (2) Estimation of the probability of basket placement for a product which is clicked but not placed in the basket (b):In the case where a clicked product is not placed in the basket, the probability that the product would be purchased is difficult to estimate by simply using the parameters in Table 1. In this case, the probability that the product would be placed in the basket after being clicked (b) is first estimated. This is done using DT analysis, ANN, or LR analysis. In these analyses, basket placement status is considered as the target variable, while all the other variables, excluding the purchase status, are regarded as input variables. In DT analysis, the probabilities of reaching basket placement are estimated by following the paths of the constructed tree. In ANN or LR analysis, the probabilities of reaching basket placement are determined as the predicted values. (3) Determination of the preference level of a product which is clicked but not purchased for each customer:The preference level of a product which is placed in the basket but not purchased is set to p. On the other hand, the preference level of the product which is clicked but not placed in the basket is set to (b!p). In Phase III, CF is conducted using the preference levels determined in Phase II as input values. In a conventional recommender system, only the purchase status is used for CF. In other words, only 0’s (no purchase) and 1’s (purchase) are used as input data (refer to Fig. 3(a)). In the proposed approach, however, the probability of reaching the point of purchase is estimated for a product clicked by a customer. Therefore, a stream of values between 0 and 1 are used as input data for the proposed CF (refer to Fig. 3(b)). In Fig. 3, blank cells indicate that the corresponding products are not clicked. 3. Experimental evaluation 3.1. Data sets The experimental e-commerce site was opened to the students of KAIST for a period of about 50 days. Among the 2465 albums that were actually clicked by the customers (i.e. among the 2465 cases observed), 338 albums were purchased. An example data set is shown in Table 2. Table 2 Structure of collected data (example) Case Customer CD Click type Length of reading time No. of visits Level 1 ratio Level 2 ratio Basket placement Purchase 1 1 A 1 49 2 0.67 0.33 1 1 2 1 B 1 15 1 0.67 0.33 1 0 3 1 C 0 4 1 0.33 0.33 0 0 4 2 A 0 6 1 0.75 0.50 0 0 5 2 C 0 8 1 0.75 0.50 0 0 6 2 D 1 12 1 0.25 0.25 1 1 7 2 E 0 6 1 0.25 0.25 0 0 « Fig. 3. ‘Customer–product preference level matrix’ for CF: conventional vs. proposed recommender systems. 384 Y.S. Kim et al. / Expert Systems with Applications 28 (2005) 381–393
Y.S. Kim et al. Expert Sys Applications 28(2005)381-393 Table 6 Basket placement vs. purchase status Length of reading time: results of f-test(significance level=0.05) Basket No basket N Mean Std dev Std Err Pr>rl placement No purch 61.35 144.5978 <0.0001 ase 2127 27.3167.541.46 purchase Total 412 2053 Table 7 Since there were very few cases of printing or bookmarking, Level I (genre )click ratio: result of t-test(significance level=0.05) these parameters were excluded in the subsequent analyses Pr>团 Purchase 380.57500.30110016400303 3. 2. Descriptive statistics: Phase I No purchase21270.53780.291700063 The influence of the navigational and behavioral patterns turns out to be 0.316(=89/282). These results also confirm of customers on the product purchase is first analyzed. The our intuition that the more frequently is a product visited, is, the relationship between the purchase status and each of the higher becomes the probability of its being purchased the other parameters is evaluated as shown in Tables 3-8 Table 6 compares the average reading times of the The probability of purchase after basket placement purchased and not purchased products. If a product is visited (i.e. p) is calculated as 0.82(=338/412)(see Table 3). more than once, the reading times for all visits are summed This probability is relatively high, which confirms the up for the product, and therefore, the total length of reading results of the previous studies(Lee et al., 2000, 2001). As time for a product increases as more visits described in Phase II of the proposed approach, the This was done to verify the hypothesis that customers would preference level of the product which is placed in the take his or her time to carefully read the detailed description basket but not purchased is set to 0.82 of a product before purchasing. A t-test with unequal Table 4 shows the relationship between the product click variances is performed since the hypothesis of equal type and product purchase status. When a product is clicked variances in samples 'purchase and 'no purchase after being searched by a customer, the probability of its is rejected at the 5% significance level. The result of the ing purchased is estimated as 0.316(=218/689). t-test shows that the difference between the average reading However,when a product is clicked after browsing through times of the purchased and not purchased products is the site, it is only 0.068(=120/1776). Based on these statistically significant at the 5% significance level(see the results, we may conclude that the products clicked after value of the least significance probability, Pr>ItD), from earching have higher preference levels than the ones which we may infer that a longer reading time may indicate clicked after browsing a higher probability of purchase. L, Table 5 presents the relationship between the number of Table 7 shows the hypothesis test results on the visits and product purchase status. The probability difference between the average Level 1(genre)click ratios of purchasing a product after the first click is 0.076 for the purchased and not purchased CD's. Similarly, 136/1800). After the second click, it becomes 0. 29: Table 8 shows the hypothesis test results for the Level 2 (=113/383). For the case where the web page for a certain (specific type)click ratios CD is clicked more than twice, the probability of purchase In the case of Level I click ratios, a t-test with equal Table 4 variances is used since the hypothesis of equal variances is Click type vs. purchase status not rejected at the 5% significance level. However, in the case of Level 2 click ratios, a t-test with unequal variances is Product clicked Product clicked Total through searching through browsing used since the hypothesis of equal variances is rejected at the 5% significance level. No purchase The results in Tables 7 and 8 indicate that the means of Tota he Level l or Level 2 click ratios for the purchased and not purchased products are statistically different at the 5%o significance level. It is also noticed from the least Table 5 Number of visits vs. purchase status Level 2(specific type) click ratio: result of I-test(significant level=0.05) 2 visits visits Total 0.36660.29530.0161 0.0001 2465 No purchase 2127 0.27900.24210.0052
Since there were very few cases of printing or bookmarking, these parameters were excluded in the subsequent analyses. 3.2. Descriptive statistics: Phase I The influence of the navigational and behavioral patterns of customers on the product purchase is first analyzed. That is, the relationship between the purchase status and each of the other parameters is evaluated as shown in Tables 3–8. The probability of purchase after basket placement (i.e. p) is calculated as 0.82 (Z338/412) (see Table 3). This probability is relatively high, which confirms the results of the previous studies (Lee et al., 2000, 2001). As described in Phase II of the proposed approach, the preference level of the product which is placed in the basket but not purchased is set to 0.82. Table 4 shows the relationship between the product click type and product purchase status. When a product is clicked after being searched by a customer, the probability of its being purchased is estimated as 0.316 (Z218/689). However, when a product is clicked after browsing through the site, it is only 0.068 (Z120/1776). Based on these results, we may conclude that the products clicked after searching have higher preference levels than the ones clicked after browsing. Table 5 presents the relationship between the number of visits and product purchase status. The probability of purchasing a product after the first click is 0.076 (Z136/1800). After the second click, it becomes 0.295 (Z113/383). For the case where the web page for a certain CD is clicked more than twice, the probability of purchase turns out to be 0.316 (Z89/282). These results also confirm our intuition that the more frequently is a product visited, the higher becomes the probability of its being purchased. Table 6 compares the average reading times of the purchased and not purchased products. If a product is visited more than once, the reading times for all visits are summed up for the product, and therefore, the total length of reading time for a product increases as more visits are made. This was done to verify the hypothesis that customers would take his or her time to carefully read the detailed description of a product before purchasing. A t-test with unequal variances is performed since the hypothesis of equal variances in samples ‘purchase’ and ‘no purchase’ is rejected at the 5% significance level. The result of the t-test shows that the difference between the average reading times of the purchased and not purchased products is statistically significant at the 5% significance level (see the value of the least significance probability, PrOjtj), from which we may infer that a longer reading time may indicate a higher probability of purchase. Table 7 shows the hypothesis test results on the difference between the average Level 1 (genre) click ratios for the purchased and not purchased CD’s. Similarly, Table 8 shows the hypothesis test results for the Level 2 (specific type) click ratios. In the case of Level 1 click ratios, a t-test with equal variances is used since the hypothesis of equal variances is not rejected at the 5% significance level. However, in the case of Level 2 click ratios, a t-test with unequal variances is used since the hypothesis of equal variances is rejected at the 5% significance level. The results in Tables 7 and 8 indicate that the means of the Level 1 or Level 2 click ratios for the purchased and not purchased products are statistically different at the 5% significance level. It is also noticed from the least Table 4 Click type vs. purchase status Product clicked through searching Product clicked through browsing Total Purchase 218 120 338 No purchase 471 1656 2127 Total 689 1776 2465 Table 3 Basket placement vs. purchase status Basket placement No basket placement Total Purchase 338 0 338 No purchase 74 2053 2127 Total 412 2053 2465 Table 5 Number of visits vs. purchase status 1 visit 2 visits 3 or more visits Total Purchase 136 113 89 338 No purchase 1664 270 193 2127 Total 1800 383 282 2465 Table 7 Level 1 (genre) click ratio: result of t-test (significance levelZ0.05) N Mean Std dev. Std err. PrOjtj Purchase 338 0.5750 0.3011 0.0164 0.0303 No purchase 2127 0.5378 0.2917 0.0063 Table 8 Level 2 (specific type) click ratio: result of t-test (significant levelZ0.05) N Mean Std dev. Std err. PrOjtj Purchase 338 0.3666 0.2953 0.0161 !0.0001 No purchase 2127 0.2790 0.2421 0.0052 Table 6 Length of reading time: results of t-test (significance levelZ0.05) N Mean Std dev Std Err PrOjtj Purchase 338 61.35 144.59 7.86 !0.0001 No purchase 2127 27.31 67.54 1.46 Y.S. Kim et al. / Expert Systems with Applications 28 (2005) 381–393 385