Availableonlineatwww.sciencedirectcom scⅰ enceDirect Expert Sy stems with Applications ELSEVIER Expert Systems with Applications 34(2008)1200-1209 www.elsevier.com/locate/eswa A recommender system using Ga K-means clustering in an online shopping market Kyoung-jae Kim , Hyunchul Ahn Graduate School of Management, Korea Adranced Institute of Science and Technology, 207-43 Cheongrangri-Dong Dongdaemun-Gu, Seoul 130-722, South Korea Abstract The Internet is emerging as a new marketing channel, so understanding the characteristics of online customers'needs and expectations considered a prerequisite for activating the consumer-oriented electronic commerce market. In this study, we propose a novel clustering algorithm based on genetic algorithms(GAs)to effectively segment the online shopping market. In general, GAs are believed to be effective on NP-complete global optimization problems, and they can provide good near-optimal solutions in reasonable time. Thus, we believe that a clustering technique with Ga can provide a way of finding the relevant clusters more effectively. The research in this paper applied K-means clustering whose initial seeds are optimized by Ga, which is called GA K-means, to a real-world online shopping market seg entation case. In this study, we compared the results of Ga K-means to those of a simple K-means algorithm and self-organizing maps SOM). The results showed that GA K-means clustering may improve segmentation performance in comparison to other typical clustering algorithms. In addition, our study validated the usefulness of the proposed model as a preprocessing tool for recommendation system o 2007 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Genetic algorithms; Self-organizing maps; Market segmentation; Case-based reasoning 1. Introduction classes in which the consumers can be naturally grouped, according to the information available(velido, Lisboa, Since the Internet has become popular, the consumer- Meehan, 1999). It can be the basis for effective targeting oriented electronic commerce market has grown so huge and predicting prospects through the identification of the that now companies are convinced of the importance of proper segments. Although much of the marketing litera- more important for the companies to analyze and under- clustering techniques are frequently used in practice Wear understanding this new emerging market. It is becoming ture has proposed various market segmentation techniques, stand the needs and expectations of their online users or Kamakura, 1998). In addition, K-means clustering customers because the Internet is one of the most effective the most frequently used market segmentation technique media to gather, disseminate and utilize the information among the clustering techniques( Gehrt Shim, 1998; about its users or customers. Thus, it is easier to extract Kuo, Chang, Chien, 2004). However, the major draw- nowledge out of the shopping process to create new busi- back of K-means clustering is that it often falls in local ness opportunities under the Internet environment optima and the result largely depends on the initial cluster 1 Market segmentation is one of the ways in which such centers. Prior studies pointed out this limitation and tried owledge can be represented. It attempts to discover the to integrate K-means clustering and global search tech- niques including genetic algorithms(see Babu Murty 1993: Kuo, Liao, Tu, 2005: Maulik Bandyopadhyay Corresponding author. Tel: +822 2260 3324: fax: +822 2260 3684. 2000; Murthy chowdhury, 1996; Pena, Lozano, lar- E-mail address: kjkim(@dongguk. edu(K -j. Kim). sanaga, 1999) 0957-4174/S- see front matter 2007 Elsevier Ltd. All rights reserved doi:10.1016/eswa200612025
A recommender system using GA K-means clustering in an online shopping market Kyoung-jae Kim a,*, Hyunchul Ahn b a Department of Management Information Systems, Dongguk University, 3-26 Pil-Dong, Jung-Gu, Seoul 100-715, South Korea b Graduate School of Management, Korea Advanced Institute of Science and Technology, 207-43 Cheongrangri-Dong, Dongdaemun-Gu, Seoul 130-722, South Korea Abstract The Internet is emerging as a new marketing channel, so understanding the characteristics of online customers’ needs and expectations is considered a prerequisite for activating the consumer-oriented electronic commerce market. In this study, we propose a novel clustering algorithm based on genetic algorithms (GAs) to effectively segment the online shopping market. In general, GAs are believed to be effective on NP-complete global optimization problems, and they can provide good near-optimal solutions in reasonable time. Thus, we believe that a clustering technique with GA can provide a way of finding the relevant clusters more effectively. The research in this paper applied K-means clustering whose initial seeds are optimized by GA, which is called GA K-means, to a real-world online shopping market segmentation case. In this study, we compared the results of GA K-means to those of a simple K-means algorithm and self-organizing maps (SOM). The results showed that GA K-means clustering may improve segmentation performance in comparison to other typical clustering algorithms. In addition, our study validated the usefulness of the proposed model as a preprocessing tool for recommendation systems. 2007 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Genetic algorithms; Self-organizing maps; Market segmentation; Case-based reasoning 1. Introduction Since the Internet has become popular, the consumeroriented electronic commerce market has grown so huge that now companies are convinced of the importance of understanding this new emerging market. It is becoming more important for the companies to analyze and understand the needs and expectations of their online users or customers because the Internet is one of the most effective media to gather, disseminate and utilize the information about its users or customers. Thus, it is easier to extract knowledge out of the shopping process to create new business opportunities under the Internet environment. Market segmentation is one of the ways in which such knowledge can be represented. It attempts to discover the classes in which the consumers can be naturally grouped, according to the information available (Velido, Lisboa, & Meehan, 1999). It can be the basis for effective targeting and predicting prospects through the identification of the proper segments. Although much of the marketing literature has proposed various market segmentation techniques, clustering techniques are frequently used in practice (Wedel & Kamakura, 1998). In addition, K-means clustering is the most frequently used market segmentation technique among the clustering techniques (Gehrt & Shim, 1998; Kuo, Chang, & Chien, 2004). However, the major drawback of K-means clustering is that it often falls in local optima and the result largely depends on the initial cluster centers. Prior studies pointed out this limitation and tried to integrate K-means clustering and global search techniques including genetic algorithms (see Babu & Murty, 1993; Kuo, Liao, & Tu, 2005; Maulik & Bandyopadhyay, 2000; Murthy & Chowdhury, 1996; Pena, Lozano, & Larranaga, 1999). 0957-4174/$ - see front matter 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.12.025 * Corresponding author. Tel.: +82 2 2260 3324; fax: +82 2 2260 3684. E-mail address: kjkim@dongguk.edu (K.-j. Kim). www.elsevier.com/locate/eswa Available online at www.sciencedirect.com Expert Systems with Applications 34 (2008) 1200–1209 Expert Systems with Applications
K.j. Kim, H. Ahn/ Expert Systems with Applications 34(2008)1200-1209 In this paper, we try to apply hybrid K-means clustering tial seeds, but there is no mechanism to optimize the initial and genetic algorithms to carry out an exploratory segmen- seeds. In addition, it may converge to a local minimum tation of an online shopping market. To find the most under certain conditions because it works as a hill-climbing effective clustering method for those kinds of data, we strategy As described in the next section, GA can find opti adopt a number of clustering methods and compare the mal clusters by minimizing the extrinsic clustering metric performance of each clustering method by using our suggested performance criteria. In addition, we validate 2. 2. Self-organizing map(SOM) the usefulness of our proposed model in a real-world application SOM is the clustering algorithm based on the unsuper- The rest of this paper is organized as follows: the next vised neural network model. Since it was suggested by section reviews two traditional clustering algorithms, K- Kohonen (1982), it has been applied to many studies means and self-organizing map(SOM), along with the per- because of its good performance. The basic SOM model ormance criteria. Section 3 proposes the GA approach to consists of two layers, an input layer and an output layer optimize the K-means clustering and Section 4 describes the When the training set is presented to the network, the val data and the experiments. In this section, the empirical ues flow forward through the network to units in the out results are also summarized and discussed. In the final sec- put layer. The neurons in the output layer are arranged tion,conclusions and the limitations of this study are in a grid, and the units in the output layer compete with presented each other, and the one with the highest value wins. The process of SoM is as follows: 2. Clustering algorithms (1) Initialize the weights as random small numbers Cluster analysis is an effective tool in scientific or man (2)Put an input sample, Xi, into the SOM network, and genial inquiry. It groups a set of data in d-dimensional fea the distances between weight vectors, Wi=(wi ture space to maximize the similarity within the clusters w2,..., W m), and the input sample, Xi, are calculated and minimize the similarity between two different clusters Then, select the neuron whose distance with X is the There are various clustering methods and they are cur- shortest, as in Eq. (1). The selected neuron would be rently widely used. Among them, we apply two popular called the‘ winner methods, K-means and SOM, and a novel hybrid method to market segmentation. Before providing a brief descrip- Min DO tion of each method, the following assumptions are nece sary. A given population consists of n elements described (3)Update the weights of the winner as well as its neigh by m attributes and it is partitioned into K clusters bors by the following equation when a is assumed to Xi=(xi, x2, .. xi)represents the vector of the m attri be the learning rate butes of element i W+aX-W川‖ 2.1 K-means clustering algorithm (4)Iterate Steps(2) and (3)until the stop criterion is The K-means method is a widely used clustering proce dure that searches for a nearly optimal partition with a In spite of several excellent applications, SOM has some fixed number of clusters. It uses an iterative hill-climbing limitations that hinder its performance. Especially, just like algorithm. The process of K-means clustering is as follows: other neural network based algorithms, SOM has no mech anism to determine the number of clusters, initial weights (1)The initial seeds with the chosen number of clusters, and stopping conditions K, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. 2. 3. Criteria for performance comparison of clustering (2) Each record is assigned to the centroid that is nearest, thus forming a cluste (3) Keeping the same number of clusters, the new cen- We compare the performance of the clustering alge troid of each cluster is calculated rithms using 'intraclass inertia. Intraclass inertia is a mea- (4)Iterate Step(2)and(3)until the clusters stop chang- sure of how compact each cluster is in the m-dimensional ing or stop conditions are satisfied space of numeric attributes. It is the average of the dis- tances between the means and the observations in each he K-means algorithm has been popular because of its cluster. Eq. (3)represents the equation for the intraclass ess and simplicity for application. However, it also has inertia for the given K clusters (michaud, 1997 some drawbacks. First, it does not deal well with overlap- ping clusters and the clusters can be pulled out of center by F() nxk=∑∑∑(Xm-R outliers. Also, the clustering result may depend on the ini
In this paper, we try to apply hybrid K-means clustering and genetic algorithms to carry out an exploratory segmentation of an online shopping market. To find the most effective clustering method for those kinds of data, we adopt a number of clustering methods and compare the performance of each clustering method by using our suggested performance criteria. In addition, we validate the usefulness of our proposed model in a real-world application. The rest of this paper is organized as follows: the next section reviews two traditional clustering algorithms, Kmeans and self-organizing map (SOM), along with the performance criteria. Section 3 proposes the GA approach to optimize the K-means clustering and Section 4 describes the data and the experiments. In this section, the empirical results are also summarized and discussed. In the final section, conclusions and the limitations of this study are presented. 2. Clustering algorithms Cluster analysis is an effective tool in scientific or managerial inquiry. It groups a set of data in d-dimensional feature space to maximize the similarity within the clusters and minimize the similarity between two different clusters. There are various clustering methods and they are currently widely used. Among them, we apply two popular methods, K-means and SOM, and a novel hybrid method to market segmentation. Before providing a brief description of each method, the following assumptions are necessary. A given population consists of n elements described by m attributes and it is partitioned into K clusters. Xi = (xi1, xi2,...,xim) represents the vector of the m attributes of element i. 2.1. K-means clustering algorithm The K-means method is a widely used clustering procedure that searches for a nearly optimal partition with a fixed number of clusters. It uses an iterative hill-climbing algorithm. The process of K-means clustering is as follows: (1) The initial seeds with the chosen number of clusters, K, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. (2) Each record is assigned to the centroid that is nearest, thus forming a cluster. (3) Keeping the same number of clusters, the new centroid of each cluster is calculated. (4) Iterate Step (2) and (3) until the clusters stop changing or stop conditions are satisfied. The K-means algorithm has been popular because of its easiness and simplicity for application. However, it also has some drawbacks. First, it does not deal well with overlapping clusters and the clusters can be pulled out of center by outliers. Also, the clustering result may depend on the initial seeds, but there is no mechanism to optimize the initial seeds. In addition, it may converge to a local minimum under certain conditions because it works as a hill-climbing strategy. As described in the next section, GA can find optimal clusters by minimizing the extrinsic clustering metric. 2.2. Self-organizing map (SOM) SOM is the clustering algorithm based on the unsupervised neural network model. Since it was suggested by Kohonen (1982), it has been applied to many studies because of its good performance. The basic SOM model consists of two layers, an input layer and an output layer. When the training set is presented to the network, the values flow forward through the network to units in the output layer. The neurons in the output layer are arranged in a grid, and the units in the output layer compete with each other, and the one with the highest value wins. The process of SOM is as follows: (1) Initialize the weights as random small numbers. (2) Put an input sample, Xi, into the SOM network, and the distances between weight vectors, Wj = (wj1, wj2, ... ,wjm), and the input sample, Xi, are calculated. Then, select the neuron whose distance with Xi is the shortest, as in Eq. (1). The selected neuron would be called the ‘winner’ Min DðjÞ ¼ X i ðwji xiÞ 2 : ð1Þ (3) Update the weights of the winner as well as its neighbors by the following equation when a is assumed to be the learning rate W NEW j ¼ W j þ akXi W jk: ð2Þ (4) Iterate Steps (2) and (3) until the stop criterion is satisfied. In spite of several excellent applications, SOM has some limitations that hinder its performance. Especially, just like other neural network based algorithms, SOM has no mechanism to determine the number of clusters, initial weights and stopping conditions. 2.3. Criteria for performance comparison of clustering algorithms We compare the performance of the clustering algorithms using ‘intraclass inertia’. Intraclass inertia is a measure of how compact each cluster is in the m-dimensional space of numeric attributes. It is the average of the distances between the means and the observations in each cluster. Eq. (3) represents the equation for the intraclass inertia for the given K clusters (Michaud, 1997) F ðkÞ ¼ 1 n X K nKIK ¼ 1 n X K X i2CK Xm P¼1 ðXiP X KP Þ ð3Þ K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1201
K.j. Kim, H. Ahn/ Expert Systems with Applications 34(2008)1200-1209 where n is the number of total observations, CK is set of the adjacent genes on the chromosome of a parent. The two- Kth cluster, Xip is the value of the attribute P for observa- point crossover involves two cuts in each chromosome. tion i, and Xkp is the mean of the attribute P in the Kth The uniform crossover allows two parent strings to cluster that is XkP=m∑ produce two children. It permits great flexibility in the way strings are combined. In addition, the mutation oper- 3. GA K-means clustering algorithm ator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a As indicated in Section 2.1, the K-means algorithm does chromosome. It provides the means for introducing new lot have any mechanism for choosing appropriate initial information into the population. Finally, the Ga tends to seeds. However, selecting different initial seeds may gener- converge on an optimal or near-optimal solution through ate huge differences in clustering results, especially when these operators(Wong Tan, 1994) the target sample contains outliers. In addition, ran- The Ga is popularly used to improve the performance dom selection of initial seeds often causes the clustering of artificial intelligence techniques. Recently, the Ga has quality to fall into local optimization(Bradley Fayyad, been employed to mitigate the limitations of typical cluster 1998). So, it is very important to select appropriate initial ing algorithms including the K-means algorithm. Lleti seeds in the traditional K-means clustering method. To Ortiz, Sarabia, and Sanchez(2004)proposed the use of overcome this critical limitation, we propose Ga as the the integrated genetic algorithm and K-means clustering optimization tool of the initial seeds in the K-means algo- to select relevant input variables. The method proposed rithm. We call our proposed model the GA K-means clus- in their study was based on the use of Ga to select those tering method. variables important for clarifying the presence of groups Genetic algorithms are stochastic search techniques that in data. Other studies tried to integrate Ga and the k can search large and complicated spaces. It is based on means algorithm for overcoming the drawback of the pos iology including natural genetics and evolutionary princi- sibility of convergence in a local minimum solution(see ple. In particular, GAs are suitable for parameter optimiza- Babu Murty, 1993; Kuo, An, Wang, Chung, 2006: tion problems with an objective function subject to various Maulik Bandyopadhyay, 2000; Murthy Chowdhury, hard and soft constraints (Shin Han, 1999). The GA 1996: Pena et al., 1999) basically explores a complex space in an adaptive way, guided by the biological evolution of selection, crossover, 3. 1. The process of the GA K-means algorithm and mutation. This algorithm uses natural selection-sur- vival of the fittest- to solve optimization problems(Kim GA K-means uses Ga to select optimal nitial seeds in K-means clustering. Fig. I shows the overall The first step of the Ga is problem representation. The framework of GA K-means clustering problem must be represented in a suitable form to be han The detailed explanation on the process of GA K-means dled by the GA. Thus, the problem is described in terms of clustering is as follows genetic code, like DNA chromosomes. The GA often ng. If the problems are (1) In the first step, the system generates the initial pop- coded as chromosomes, a population-a set of chromo- ulation that would be used to find global optimum somes-is initialized. Each chromosome within a popula- initial seeds The values of the chromosomes for the tion gradually evolves through biological operations population are initiated into random values before There are no general rules for determining the population the search process. To enable Ga to find the optimal size. But, population sizes of 100-200 are commonly used initial seeds, we should design the structure of a chro GA research. Once the population size is chosen, the mosome properly as a form of binary strings. A tial population is randomly generated (Bauer, 1994).After detailed explanation on chromosome design is pre- the initialization step, each chromosome is evaluated by a sented in Section 3.2 fitness function. According to the value of the fitness func- (2) The second step is the process of K-means clustering, tion the chromosomes associated with the fittest individu which is mentioned in Section 2. 1. In our study, the als will be reproduced more often than those associated maximum number of iterations for centroid adjust with unfit individuals(Davis, 1994) ment in K-means is set at 10. After the process of The GA works with three operators that are iteratively K-means, the value of the fitness function is updated used. The selection operator determines which individuals To find optimal initial seeds, we applied ' intraclass may survive(hertz Kobler, 2000). The crossover opera inertia'as a fitness function. That is, the objective tor allows the search to fan out in diverse directions look. of our ga K-means is to find the initial seeds where g for attractive solutions, and permits chromosoma intraclass intertia is minimized after K-means cluster material from different parents to be combined in a single ing finishes child. There are three popular crossover methods: single- (3)In this stage, Ga performs genetic operations such point, two point, and uniform. The single-point crossover selection. crossover and mutation. on the current makes only one cut in each chromosome and selects two population. By doing this, a new generation is pro-
where n is the number of total observations, CK is set of the Kth cluster, XiP is the value of the attribute P for observation i, and X KP is the mean of the attribute P in the Kth cluster that is X KP ¼ 1 nK P i2CK XiP . 3. GA K-means clustering algorithm As indicated in Section 2.1, the K-means algorithm does not have any mechanism for choosing appropriate initial seeds. However, selecting different initial seeds may generate huge differences in clustering results, especially when the target sample contains many outliers. In addition, random selection of initial seeds often causes the clustering quality to fall into local optimization (Bradley & Fayyad, 1998). So, it is very important to select appropriate initial seeds in the traditional K-means clustering method. To overcome this critical limitation, we propose GA as the optimization tool of the initial seeds in the K-means algorithm. We call our proposed model the GA K-means clustering method. Genetic algorithms are stochastic search techniques that can search large and complicated spaces. It is based on biology including natural genetics and evolutionary principle. In particular, GAs are suitable for parameter optimization problems with an objective function subject to various hard and soft constraints (Shin & Han, 1999). The GA basically explores a complex space in an adaptive way, guided by the biological evolution of selection, crossover, and mutation. This algorithm uses natural selection – survival of the fittest – to solve optimization problems (Kim, 2004). The first step of the GA is problem representation. The problem must be represented in a suitable form to be handled by the GA. Thus, the problem is described in terms of genetic code, like DNA chromosomes. The GA often works with a form of binary coding. If the problems are coded as chromosomes, a population – a set of chromosomes – is initialized. Each chromosome within a population gradually evolves through biological operations. There are no general rules for determining the population size. But, population sizes of 100–200 are commonly used in GA research. Once the population size is chosen, the initial population is randomly generated (Bauer, 1994). After the initialization step, each chromosome is evaluated by a fitness function. According to the value of the fitness function, the chromosomes associated with the fittest individuals will be reproduced more often than those associated with unfit individuals (Davis, 1994). The GA works with three operators that are iteratively used. The selection operator determines which individuals may survive (Hertz & Kobler, 2000). The crossover operator allows the search to fan out in diverse directions looking for attractive solutions, and permits chromosomal material from different parents to be combined in a single child. There are three popular crossover methods: singlepoint, two-point, and uniform. The single-point crossover makes only one cut in each chromosome and selects two adjacent genes on the chromosome of a parent. The twopoint crossover involves two cuts in each chromosome. The uniform crossover allows two parent strings to produce two children. It permits great flexibility in the way strings are combined. In addition, the mutation operator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a chromosome. It provides the means for introducing new information into the population. Finally, the GA tends to converge on an optimal or near-optimal solution through these operators (Wong & Tan, 1994). The GA is popularly used to improve the performance of artificial intelligence techniques. Recently, the GA has been employed to mitigate the limitations of typical clustering algorithms including the K-means algorithm. Lletı´, Ortiz, Sarabia, and Sa´nchez (2004) proposed the use of the integrated genetic algorithm and K-means clustering to select relevant input variables. The method proposed in their study was based on the use of GA to select those variables important for clarifying the presence of groups in data. Other studies tried to integrate GA and the Kmeans algorithm for overcoming the drawback of the possibility of convergence in a local minimum solution (see Babu & Murty, 1993; Kuo, An, Wang, & Chung, 2006; Maulik & Bandyopadhyay, 2000; Murthy & Chowdhury, 1996; Pena et al., 1999). 3.1. The process of the GA K-means algorithm GA K-means uses GA to select optimal or sub-optimal initial seeds in K-means clustering. Fig. 1 shows the overall framework of GA K-means clustering. The detailed explanation on the process of GA K-means clustering is as follows: (1) In the first step, the system generates the initial population that would be used to find global optimum initial seeds. The values of the chromosomes for the population are initiated into random values before the search process. To enable GA to find the optimal initial seeds, we should design the structure of a chromosome properly as a form of binary strings. A detailed explanation on chromosome design is presented in Section 3.2. (2) The second step is the process of K-means clustering, which is mentioned in Section 2.1. In our study, the maximum number of iterations for centroid adjustment in K-means is set at 10. After the process of K-means, the value of the fitness function is updated. To find optimal initial seeds, we applied ‘intraclass inertia’ as a fitness function. That is, the objective of our GA K-means is to find the initial seeds where intraclass intertia is minimized after K-means clustering finishes. (3) In this stage, GA performs genetic operations such as selection, crossover and mutation, on the current population. By doing this, a new generation is pro- 1202 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
K.j. Kim, H. Ahn Expert Systems with Applications 34(2008)1200-1209 1. Design the structure of chromosomes and define fitness function 3. Perform the K-means process according to the given initial seeds Calculate the intraclass inertia as fitness function for each chromosome 5. Apply genetic operators and produce a new generation No e stopping crite 7. Finish Fig 1. The overall framework of GA K-means. the stopping conditions are satisfe e repeated until values, they have to be coded on a chromosome, a form of binary strings. The structure of the chromosomes for GA K-means 3. 2. Chromosome encoding As shown in Fig. 2, the length of each chromosome for GA K-means depends on the types of features In the case The system searches the space to find optimal or near- of the binary selection variables whose value is 0 or l, the optimal values of the features that consist of the centroids length of each feature is just I bit. But, the features that for each cluster. To apply Ga to search for these optimal have continuous values require more bits to express them Types of the features consist of the data set Type 1. Binary selection(0 or 1) Type 2. Continuous values ranging from 0 to 1 baalislslziap o u a s u 1111234567891011121131412141 Cluster1[t。-011|1001 10110110 Cluster2[011011010011。04 Cluster3[101。1。1111-1 L。1o。1。1。o。11。。1 Cluster n[t1-1。010。11。11。1。。_10 Fig. 2. The structures of the chromosomes for GA K-means
duced. After that, Step (2) and (3) are repeated until the stopping conditions are satisfied. 3.2. Chromosome encoding The system searches the space to find optimal or nearoptimal values of the features that consist of the centroids for each cluster. To apply GA to search for these optimal values, they have to be coded on a chromosome, a form of binary strings. The structure of the chromosomes for GA K-means is presented in Fig. 2. As shown in Fig. 2, the length of each chromosome for GA K-means depends on the types of features. In the case of the binary selection variables whose value is 0 or 1, the length of each feature is just 1 bit. But, the features that have continuous values require more bits to express them 1. Design the structure of chromosomes 2. Generate the initial population and define fitness function 3. Perform the K-means process according to the given initial seeds 4. Calculate the intraclass inertia as fitness function for each chromosome 5. Apply genetic operators and produce a new generation 6. Satisfy the stopping criteria? 7. Finish User Data Base No Yes Fig. 1. The overall framework of GA K-means. 1 V1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 V2 Type 1. Binary selection (0 or 1) Type 2. Continuous values ranging from 0 to 1 Types of the features consist of the data set 1 V37 2 3 4 5 6 7 8 9 10 11 12 13 14 … … 1 V36 1 V42 1 2 … 14 V1 1 V2 … … Cluster 1 Cluster 2 Cluster 3 … Cluster n 1 0 … 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 … 1 1 … 0 0 1 … 1 01101001110011 … 0 1 … 0 1 1 … 0 11011100111101 … 1 1 … 1 1 0 … 1 11001011000111 … 0 0 … 1 1 1 … 1 00100110110100 … 1 0 … 0 Fig. 2. The structures of the chromosomes for GA K-means. K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1203
K.j. Kim, H. Ahn/Expert Systems with Applications 34(2008)1200-1209 in a chromosome. Here, we set the continuous feature val- at 0. 1. This study performs the crossover using a uniform es as precise as 1/10,000. When we apply min-max crossover routine. The uniform crossover method is consid- normalization to the continuous features, they become ered better at preserving the schema, and can generate any the values ranging from 0 to 1. In this case, 14 binary bits schema from the two parents, while single-point and two- are required to express them with 1/10,000th precision point crossover methods may bias the search with the irrel- ecause 8192=2<10000< 2=16384. These 14 bit evant position of the features. For the mutation method binary numbers are transformed into decimal floating this study generates a random number between 0 and 1 numbers, ranging from 0 to I by applying the following for each of the features in the organism. If a gene gets a equation(4) number that is less than or equal to the mutation rate, then that gene is mutated. As the stopping condition, only 4000 16383 (4) trials(20 generations)are permitted Simple K-means was conducted by SPSS for Windows where x is the decimal number of the binary code for each 11.0 and SOM by Neuroshell 2 R4.0. GA K-means was feature weight(Michalewicz, 1996) developed by using Microsoft Excel 2002 and Palisade For example, the binary code for the value of the 37th Software's Evolver Version 4.06. And, the K-means algo- feature of the first sample chromosome- the chromosome rithm for GA K-means was implemented in VBA (Visual represents Cluster I-in Fig. 2 is(10110010010110)2. The Basic for Applications)of Microsoft Excel 2002. decimal value of it is (11414)10 and it is interpreted as 1=0.696697796≈06967 In this study, we use a real-world data set from an Inter- 4.2. Experimental results net shopping mall that consists of 36 selection variables and 6 continuous variables. For the design of the chromo- In this study, we used real-world Internet shopping mall ome for this data set,(36x 1+6x 14)x K bits are data for assessing the performance of the proposed model required where K is the number of clusters The research data was collected from an online diet portal site in Korea which contains all kinds of services for online 4. Experimental design and results diets such as providing information, community services and a shopping mall. In the case of a diet shopping mall, 4. 1. Experimental design customers generally have a clear objective, and they have a strong demand for personalized care and services. Thus, We adopt three clustering algorithms - simple K-means, they are usually open-minded about providing their per- SOM and GA K-means-to our data. We try to segment sonal information to get appropriate service. As a result, the Internet users into 5 clusters(that is, K= 5). In the case the target company of our research possesses detailed and of SoM, we set the learning rate()at 0.5 accurate customer information. and it has a desire to use For the controlling parameters of the GA search, the the information as a source of direct marketing for selling population size is set at 200 organisms. The value of the their products. Consequently, we tried to build a recom- crossover rate is set at 0.7 while the mutation rate is set mendation model for the users of this web site User Data Base A K-Means Clustering Cluster1Cluster2Cluster3Cluster4Cluster5 Determine the nearest cluster for the target customer Targe Search the most similar neighbors in the corresponding cluste Using Case-Based Reasc nded Generate recommendation results referencing the nearest neighbors purchased items P: Web interfac Fig 3. The system architecture of the recommendation system
in a chromosome. Here, we set the continuous feature values as precise as 1/10,000. When we apply min–max normalization to the continuous features, they become the values ranging from 0 to 1. In this case, 14 binary bits are required to express them with 1/10,000th precision because 8192 = 213 < 10000 6 214 = 16384. These 14 bit binary numbers are transformed into decimal floating numbers, ranging from 0 to 1 by applying the following equation (4): x0 ¼ x 214 1 ¼ x 16383 ð4Þ where x is the decimal number of the binary code for each feature weight (Michalewicz, 1996). For example, the binary code for the value of the 37th feature of the first sample chromosome – the chromosome represents Cluster 1 – in Fig. 2 is (10110010010110)2. The decimal value of it is (11414)10 and it is interpreted as 11414 16383 ¼ 0:696697796 0:6967. In this study, we use a real-world data set from an Internet shopping mall that consists of 36 selection variables and 6 continuous variables. For the design of the chromosome for this data set, (36 · 1+6 · 14) · K bits are required where K is the number of clusters. 4. Experimental design and results 4.1. Experimental design We adopt three clustering algorithms – simple K-means, SOM and GA K-means – to our data. We try to segment the Internet users into 5 clusters (that is, K = 5). In the case of SOM, we set the learning rate (a) at 0.5. For the controlling parameters of the GA search, the population size is set at 200 organisms. The value of the crossover rate is set at 0.7 while the mutation rate is set at 0.1. This study performs the crossover using a uniform crossover routine. The uniform crossover method is considered better at preserving the schema, and can generate any schema from the two parents, while single-point and twopoint crossover methods may bias the search with the irrelevant position of the features. For the mutation method, this study generates a random number between 0 and 1 for each of the features in the organism. If a gene gets a number that is less than or equal to the mutation rate, then that gene is mutated. As the stopping condition, only 4000 trials (20 generations) are permitted. Simple K-means was conducted by SPSS for Windows 11.0 and SOM by Neuroshell 2 R4.0. GA K-means was developed by using Microsoft Excel 2002 and Palisade Software’s Evolver Version 4.06. And, the K-means algorithm for GA K-means was implemented in VBA (Visual Basic for Applications) of Microsoft Excel 2002. 4.2. Experimental results In this study, we used real-world Internet shopping mall data for assessing the performance of the proposed model. The research data was collected from an online diet portal site in Korea which contains all kinds of services for online diets such as providing information, community services and a shopping mall. In the case of a diet shopping mall, customers generally have a clear objective, and they have a strong demand for personalized care and services. Thus, they are usually open-minded about providing their personal information to get appropriate service. As a result, the target company of our research possesses detailed and accurate customer information, and it has a desire to use the information as a source of direct marketing for selling their products. Consequently, we tried to build a recommendation model for the users of this web site. Fig. 3. The system architecture of the recommendation system. 1204 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209