Using a Clu Genetic Algorith to Support Customer Segmentation for Personalized recommender systems Kyoung-jae Kim' and Hyunchul Ahn I Department of Information Systems, Dongguk University, 3-26, Pil-Dong, Chung-Gu, Seoul 100-715, Korea 2 Graduate School of Management, Korea Advanced Institute of Science and Technology 207-43 Cheongrangri-Dong. Dongdaemun-Gu, Seoul 130-722.Korea Abstract. This study proposes novel clusteri genetic al gorithms (GAs)to carry out a segmentation o fectively. In general, GAs are believed to be NP-complete global optimization problems and they can provide good sub-optimal solutions sonable time. Thus, we believe that a clustering technique with Ga can provide a way of finding the relevant clusters. This paper applies GA-based K-means lustering to the real-world online shopping market segmentation case for per sonalized recommender systems. In this study, we compare the results of GA K-means to those of traditional K-means algorithm and self-organizing The result shows that GA-based K-means clustering may ve seg- mentation performance in comparison to other typical clustering algorithms Introduction As the Internet emerges as a new marketing channel, analyzing and understanding the leeds and expectations of their online users or customers are considered as prerequi sites to activate the consumer-oriented electronic commerce. Thus, the mass market ing strategy cannot satisfy the needs and expectations of online customers. On the other hand, it is easier to extract knowledge out of the shopping process under the Internet environment. Market segmentation is one of the ways in which such knowl edge can be represented and make it new business opportunities. Market segmenta- tion divides the market into customer clusters with similar needs and/or characteris- tics that are likely to exhibit similar purchasing behaviors [2]. Companies can arrange the right products and services to the right customer cluster and build a close relation- ship with them with proper market segmentation The aim of this study is to carry out an exploratory segmentation of the online shopping market and to find the most effective clustering method for those kinds of data. We adopt some of clustering methods for that purpose. In addition, we cor the performance of each clustering method by using our suggested performance crite The rest of this paper is organized as follows: The next section reviews two tradi- tional clustering algorithms, K-means and self-organizing map(SOM), along with the TG.Kim(Ed)AIS2004,LNA3397,pp.409-415,2005
T.G. Kim (Ed.): AIS 2004, LNAI 3397, pp. 409ñ415, 2005. © Springer-Verlag Berlin Heidelberg 2005 Using a Clustering Genetic Algorithm to Support Customer Segmentation for Personalized Recommender Systems Kyoung-jae Kim1 and Hyunchul Ahn2 1 Department of Information Systems, Dongguk University, 3-26, Pil-Dong, Chung-Gu, Seoul 100-715, Korea kjkim@dongguk.edu 2 Graduate School of Management, Korea Advanced Institute of Science and Technology, 207-43 Cheongrangri-Dong, Dongdaemun-Gu, Seoul 130-722, Korea hcahn@kaist.ac.kr Abstract. This study proposes novel clustering algorithm based on genetic algorithms (GAs) to carry out a segmentation of the online shopping market effectively. In general, GAs are believed to be effective on NP-complete global optimization problems and they can provide good sub-optimal solutions in reasonable time. Thus, we believe that a clustering technique with GA can provide a way of finding the relevant clusters. This paper applies GA-based K-means clustering to the real-world online shopping market segmentation case for personalized recommender systems. In this study, we compare the results of GAbased K-means to those of traditional K-means algorithm and self-organizing maps. The result shows that GA-based K-means clustering may improve segmentation performance in comparison to other typical clustering algorithms. 1 Introduction As the Internet emerges as a new marketing channel, analyzing and understanding the needs and expectations of their online users or customers are considered as prerequisites to activate the consumer-oriented electronic commerce. Thus, the mass marketing strategy cannot satisfy the needs and expectations of online customers. On the other hand, it is easier to extract knowledge out of the shopping process under the Internet environment. Market segmentation is one of the ways in which such knowledge can be represented and make it new business opportunities. Market segmentation divides the market into customer clusters with similar needs and/or characteristics that are likely to exhibit similar purchasing behaviors [2]. Companies can arrange the right products and services to the right customer cluster and build a close relationship with them with proper market segmentation. The aim of this study is to carry out an exploratory segmentation of the online shopping market and to find the most effective clustering method for those kinds of data. We adopt some of clustering methods for that purpose. In addition, we compare the performance of each clustering method by using our suggested performance criteria. The rest of this paper is organized as follows: The next section reviews two traditional clustering algorithms, K-means and self-organizing map (SOM), along with the
410 Kyoung-jae Kim and Hyunchul Ahn performance criteria. Section 3 proposes the Ga approach to optimize the K-means lustering and section 4 describes the data and the experiments. In this section, the empirical results are also summarized and discussed. In the final section, conclusions and the limitations of this study are presented 2 Clustering Algorithms Cluster analysis is an effective tool in scientific or managerial inquiry. It groups a set of data in d-dimensional feature space to maximize the similarity within the clusters and minimize the similarity between two different clusters. There exists various clus- tering methods and they are currently used in wide area. Among them, we apply two popular methods, K-means and SOM, and a novel hybrid method to the market seg- mentation. Above all. we introduce K-means and Som in this section. For brief de- scription of each method, we assume that given population is consisted of n elements described by m attributes and it would be partitioned into k clusters. And, X=(u, x,,,xim)represents the vector of the m attributes of element i 2.1 K-Means Clustering algorith K-means method is a widely used clustering procedure that searches for a nearl optimal partition with a fixed number of clusters. The process of K-means clusterin is as follows: 1)The initial seeds with the chosen number of clusters, k, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters 2) Each record is assigned to the centroid which is nearest, thus forming cluster 3)Keeping the same number of clusters, the new centriod of each cluster is calcu- 4)Iterate Step 2)and 3)until the clusters stop changing or stop conditions are satis- K-means algorithm has been popular because of its easiness and simplicity for ap- plication. However, it also has some of shortcomings. First, it does not deal well with cluotapping clusters and the clusters can be pulled of center by outliers. And, the clustering result may depend on the initial seeds but there exists no mechanism to optimize the initial seed 2.2 Self-organizing Map The SOM is the clustering algorithm based on unsupervised neural network model. Since it is suggested by Kohonen [7]. it has been applied to many studies because of its good performance. The basic SOM model consists of two layers, an input layer and an output layer. When the training set is presented to the network, the values flow forward through the network to units in the output layer. The neurons in the output layer are arranged in a grid, and the unit in the output layer competes with each other and the one with the highest value wins. The process of SoM is as follows
410 Kyoung-jae Kim and Hyunchul Ahn performance criteria. Section 3 proposes the GA approach to optimize the K-means clustering and section 4 describes the data and the experiments. In this section, the empirical results are also summarized and discussed. In the final section, conclusions and the limitations of this study are presented. 2 Clustering Algorithms Cluster analysis is an effective tool in scientific or managerial inquiry. It groups a set of data in d-dimensional feature space to maximize the similarity within the clusters and minimize the similarity between two different clusters. There exists various clustering methods and they are currently used in wide area. Among them, we apply two popular methods, K-means and SOM, and a novel hybrid method to the market segmentation. Above all, we introduce K-means and SOM in this section. For brief description of each method, we assume that given population is consisted of n elements described by m attributes and it would be partitioned into k clusters. And, ( , ,..., ) i i1 i2 im X = x x x represents the vector of the m attributes of element i. 2.1 K-Means Clustering Algorithm K-means method is a widely used clustering procedure that searches for a nearly optimal partition with a fixed number of clusters. The process of K-means clustering is as follows: 1) The initial seeds with the chosen number of clusters, k, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. 2) Each record is assigned to the centroid which is nearest, thus forming cluster. 3) Keeping the same number of clusters, the new centriod of each cluster is calculated. 4) Iterate Step 2) and 3) until the clusters stop changing or stop conditions are satisfied. K-means algorithm has been popular because of its easiness and simplicity for application. However, it also has some of shortcomings. First, it does not deal well with overlapping clusters and the clusters can be pulled of center by outliers. And, the clustering result may depend on the initial seeds but there exists no mechanism to optimize the initial seeds. 2.2 Self-organizing Map The SOM is the clustering algorithm based on unsupervised neural network model. Since it is suggested by Kohonen [7], it has been applied to many studies because of its good performance. The basic SOM model consists of two layers, an input layer and an output layer. When the training set is presented to the network, the values flow forward through the network to units in the output layer. The neurons in the output layer are arranged in a grid, and the unit in the output layer competes with each other and the one with the highest value wins. The process of SOM is as follows:
Using a Clustering Genetic Algorithm to Support Customer Segmentation 411 1)Initialize the weights as the random small number. 2)An input sample, X,, is provided into the SOM network and the distances be- tween weight vectors, W,=(wu, was.,wm), and the input sample, X, is calcu- ated. Then, select the neuron whose distance with X is the shortest. The se- lected neuron would be called winne MmDO)=∑(n-x)2 (1) 3)When a is assumed to be the learning rate, the weights of the winner as well as of its neighbors are updated by following equation. NEW=w 4)Iterate Step 2)and 3)until the stop criterion is satisfied. In spite of several excellent applications, SOM has some limitations that hinder its performance. Especially, just like other neural network based algorithms, SOM has no mechanism to determine the number of clusters, initial weights and stopping con- 2.3 Criteria for Performance Comparison of Clustering Algorithms We compare the performances of the clustering algorithms using intraclass inertia Intraclass inertia is a measure of how compact each cluster is in the m-dimensional space of numeric attributes. It is the summation of the distances between the means and the observations in each cluster. Eq. (3)represents the equation for the intraclass neria for a given k clusters [8] F(k)=∑n1k=∑∑∑(xp-xk (3) where n is the number of total observations, c is set of the Kth cluster and Xrp is the mean of the Kth cluster that is Xxp=2Xp 3 GA-Based K-Means Clustering Algorithm As indicated in Section 2.1, K-means algorithm does not have any mechanism to choose appropriate initial seeds. However, selecting different initial seeds may gener ate huge differ of the clustering results, especially when the target sample con- tains many outliers. In addition, random selection of initial seeds often causes the clustering quality to fall into local optimization [1]. So, it is very important to select appropriate initial seeds in traditional K-means clustering method. To overcome this critical limitation, we propose GA as the optimization tool of the initial seeds in K means algorithm
Using a Clustering Genetic Algorithm to Support Customer Segmentation 411 1) Initialize the weights as the random small number. 2) An input sample, Xi , is provided into the SOM network and the distances between weight vectors, ( , ,..., ) Wj = wj1 wj2 wjm , and the input sample, Xi , is calculated. Then, select the neuron whose distance with Xi is the shortest. The selected neuron would be called ëwinnerí. 2 ( ) = ∑( − ) i ji i Min D j w x (1) 3) When α is assumed to be the learning rate, the weights of the winner as well as of its neighbors are updated by following equation. j i j NEW Wj = W +α X −W (2) 4) Iterate Step 2) and 3) until the stop criterion is satisfied. In spite of several excellent applications, SOM has some limitations that hinder its performance. Especially, just like other neural network based algorithms, SOM has no mechanism to determine the number of clusters, initial weights and stopping conditions. 2.3 Criteria for Performance Comparison of Clustering Algorithms We compare the performances of the clustering algorithms using ëintraclass inertiaí. Intraclass inertia is a measure of how compact each cluster is in the m-dimensional space of numeric attributes. It is the summation of the distances between the means and the observations in each cluster. Eq. (3) represents the equation for the intraclass inertia for a given k clusters [8]. ∑ ∑∑∑∈ = = − K KC i P K K P KP K X X n n I n F k ( ) 1 1 ( ) (3) where n is the number of total observations, CK is set of the Kth cluster and X KP is the mean of the Kth cluster that is ∑∈ = CK i P K KP X n X 1 . 3 GA-Based K-Means Clustering Algorithm As indicated in Section 2.1, K-means algorithm does not have any mechanism to choose appropriate initial seeds. However, selecting different initial seeds may generate huge differences of the clustering results, especially when the target sample contains many outliers. In addition, random selection of initial seeds often causes the clustering quality to fall into local optimization [1]. So, it is very important to select appropriate initial seeds in traditional K-means clustering method. To overcome this critical limitation, we propose GA as the optimization tool of the initial seeds in Kmeans algorithm
412 Kyoung- jae Kim and Hyunchul Ahn 3.1 Genetic algorithms Genetic algorithms are stochastic search techniques that can search large and compli cated spaces. It is based on biological backgrounds including natural genetics and evolutionary principle. In particular, GAs are suitable for parameter optimization problems with an objective function subject to various hard and soft constraints [10] The ga works with three operators that are iteratively used. The selection operator determines which individuals may survive [4]. The crossover operator allows the search to fan out in diverse directions looking for attractive solutions and permits chromosomal material from different parents to be combined in a single child. There are three popular crossover methods: single-point, two-point, and uniform. The sin- gle-point crossover makes only one cut in each chromosome and selects two adjacent genes on the chromosome of a parent. The two-point crossover involves two cuts in each chromosome. The uniform crossover allows two parent strings to produce two children. It permits great flexibility in the way strings are combined [6]. In addition, the mutation operator arbitrarily alters one or more components of a selected chromo- some. Mutation randomly changes a gene on a chromosome. It provides the means for introducing new information into the population. Finally, the Ga tends to con- verge on an optimal or near-optimal solution through these operators [12] 3.2 The process of Ga K-Means Algorithm GA K-means uses Ga to select optimal or sub-optimal initial seeds in K-means clus- tering. The process of Ga K-means clustering is as follows 1) For the first step, we search the search space to find optimal or near optimal ini- tial seeds. To conduct GA K-means each observation of the target data should have its own ID number and search space consists of ID numbers of the total ob- servations. If we want to find k clusters using GA K-means, there exist k parame- ters(IDs)to optimize. These parameters to be found should be encoded on a chromosome 2)The second step is the process of K-means clustering which is mentioned in Sec tion 2.1. After the process of K-means, the value of fitness function is updated. To find optimal initial seeds, we applied"intraclass inertia,as a fitness function. That is, the objective of our GA K-means is to find the initial seeds that intraclass intertia is minimized after K-means clustering finishes 3)In this stage, the ga operates the process of crossover and mutation on the initial chromosome and iterates Step 1)and 2)until the stopping conditions are satisfied. 3.3 Chromosome Encoding In general, a chromosome can be used to represent a candidate solution to a problem where each gene in the chromosome represents a parameter of the candidate solution In this study, a chromosome is regarded as a set of K initial seeds and each gene is cluster center. A real-value encoding scheme is used because each customer has unique customer ID For the controlling parameters of the GA search, the population size is set to 100 organisms. The value of the crossover rate is set to 0.7 while the mutation rate is set
412 Kyoung-jae Kim and Hyunchul Ahn 3.1 Genetic Algorithms Genetic algorithms are stochastic search techniques that can search large and complicated spaces. It is based on biological backgrounds including natural genetics and evolutionary principle. In particular, GAs are suitable for parameter optimization problems with an objective function subject to various hard and soft constraints [10]. The GA works with three operators that are iteratively used. The selection operator determines which individuals may survive [4]. The crossover operator allows the search to fan out in diverse directions looking for attractive solutions and permits chromosomal material from different parents to be combined in a single child. There are three popular crossover methods: single-point, two-point, and uniform. The single-point crossover makes only one cut in each chromosome and selects two adjacent genes on the chromosome of a parent. The two-point crossover involves two cuts in each chromosome. The uniform crossover allows two parent strings to produce two children. It permits great flexibility in the way strings are combined [6]. In addition, the mutation operator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a chromosome. It provides the means for introducing new information into the population. Finally, the GA tends to converge on an optimal or near-optimal solution through these operators [12]. 3.2 The Process of GA K-Means Algorithm GA K-means uses GA to select optimal or sub-optimal initial seeds in K-means clustering. The process of GA K-means clustering is as follows: 1) For the first step, we search the search space to find optimal or near optimal initial seeds. To conduct GA K-means, each observation of the target data should have its own ID number and search space consists of ID numbers of the total observations. If we want to find k clusters using GA K-means, there exist k parameters (IDs) to optimize. These parameters to be found should be encoded on a chromosome. 2) The second step is the process of K-means clustering which is mentioned in Section 2.1. After the process of K-means, the value of fitness function is updated. To find optimal initial seeds, we applied ëintraclass inertiaí as a fitness function. That is, the objective of our GA K-means is to find the initial seeds that intraclass intertia is minimized after K-means clustering finishes. 3) In this stage, the GA operates the process of crossover and mutation on the initial chromosome and iterates Step 1) and 2) until the stopping conditions are satisfied. 3.3 Chromosome Encoding In general, a chromosome can be used to represent a candidate solution to a problem where each gene in the chromosome represents a parameter of the candidate solution. In this study, a chromosome is regarded as a set of K initial seeds and each gene is cluster center. A real-value encoding scheme is used because each customer has unique customer ID. For the controlling parameters of the GA search, the population size is set to 100 organisms. The value of the crossover rate is set to 0.7 while the mutation rate is set
Using a Clustering Genetic Algorithm to Support Customer Segmentation 413 to O.l. This study performs the crossover using a uniform crossover routine. The uniform crossover method is considered better at preserving the schema, and can generate any schema from the two parents, while single-point and two-point cross- over methods may bias the search with the irrelevant position of the features. For the mutation method, this study generates a random number between 0 and l for each of ne features in the organism. If a gene gets a number that is less than or equal to the mutation rate, then that gene is mutated. As the stopping condition, only 5000 trials are permitted 4 Experimental Design and Results 4.1 Research data This study makes use of the public data from the 9th GVU's www Users Surve of which concerns with the Internet users'opinions on on-line shopping [5] Velido, Lisboa and Meehan [ll, we selected 44 items about general opinion of www for shopping and opinion of Web-based vendors. Total observations were 2180 Internet users 4.2 Experimental Design We adopt three clustering algorithms- simple K-means, SOM and Ga K-means-to our data. We try to segment the Internet users into 5 clusters. In the case of SoM, we set the learning rate( a )to 0.5 Simple K-means was conducted by SPSS for Windows 11.0 and SOM by Neu- roshell 2 R4.0. GA K-means was developed by using Microsoft Excel 2002 and Pali- sade Softwares Evolver Version 4.06. And, the K-means algorithm for ga K-mear was implemented in VBA (Visual Basic for Applications)of Microsoft Excel 2002 4.3 Experimental Results As mentioned in Section 2.3, intraclass inertia can be applied to compare the per- formance of the clustering algorithms. The intraclass inertia of the three clustering algorithms is summarized in Table 1. Table 1 shows that the performance of Ga K means is the best among three comparative methods Table 1. Intraclass inertia of each clustering method Clustering method K-means GA K-means Intraclass inertia 108504610847091084370 addition, Chi-square procedure also provides preliminary information that helps to make the Internet users associated with the segments a bit more identifiable 3. In this study, Chi-square analysis was used to demographically profile the individuals associated with each of the segments. Table 2 suggests the results of Chi-square analysis. The result reveals that the five segments by all three clustering methods differed significantly with respect to all of demographic variables. So, considering above results, we conclude that GA K-means may offer viable alternative approach
Using a Clustering Genetic Algorithm to Support Customer Segmentation 413 to 0.1. This study performs the crossover using a uniform crossover routine. The uniform crossover method is considered better at preserving the schema, and can generate any schema from the two parents, while single-point and two-point crossover methods may bias the search with the irrelevant position of the features. For the mutation method, this study generates a random number between 0 and 1 for each of the features in the organism. If a gene gets a number that is less than or equal to the mutation rate, then that gene is mutated. As the stopping condition, only 5000 trials are permitted. 4 Experimental Design and Results 4.1 Research Data This study makes use of the public data from the 9th GVUís WWW Users Survey, part of which concerns with the Internet usersí opinions on on-line shopping [5]. Like Velido, Lisboa and Meehan [11], we selected 44 items about general opinion of using WWW for shopping and opinion of Web-based vendors. Total observations were 2180 Internet users. 4.2 Experimental Design We adopt three clustering algorithms ñ simple K-means, SOM and GA K-means ñ to our data. We try to segment the Internet users into 5 clusters. In the case of SOM, we set the learning rate (α ) to 0.5. Simple K-means was conducted by SPSS for Windows 11.0 and SOM by Neuroshell 2 R4.0. GA K-means was developed by using Microsoft Excel 2002 and Palisade Softwareís Evolver Version 4.06. And, the K-means algorithm for GA K-means was implemented in VBA (Visual Basic for Applications) of Microsoft Excel 2002. 4.3 Experimental Results As mentioned in Section 2.3, intraclass inertia can be applied to compare the performance of the clustering algorithms. The intraclass inertia of the three clustering algorithms is summarized in Table 1. Table 1 shows that the performance of GA Kmeans is the best among three comparative methods. Table 1. Intraclass inertia of each clustering method Clustering method K-means SOM GA K-means Intraclass inertia 10850.46 10847.09 10843.70 In addition, Chi-square procedure also provides preliminary information that helps to make the Internet users associated with the segments a bit more identifiable [3]. In this study, Chi-square analysis was used to demographically profile the individuals associated with each of the segments. Table 2 suggests the results of Chi-square analysis. The result reveals that the five segments by all three clustering methods differed significantly with respect to all of demographic variables. So, considering above results, we conclude that GA K-means may offer viable alternative approach