Effort-Aware Just-in-Time Defect Prediction:Simple Unsupervised Models Could Be Better Than Supervised Models Yibiao Yang',Yuming Zhou',Jinping Liu',Yangyang Zhao',Hongmin Lu',Lei Xu', Baowen Xu',and Hareton Leung 'Department of Computer Science and Technology,Nanjing University,China 2Department of Computing,Hong Kong Polytechnic University,Hong Kong,China ABSTRACT consecutive commits in a given period of time)that introduce Unsupervised models do not require the defect data to build one or several defects into the source code in a software the prediction models and hence incur a low building cost system [37].Compared with traditional defect prediction and gain a wide application range.Consequently,it would at module (e.g.package,file,or class)level,JIT defect be more desirable for practitioners to apply unsupervised prediction is a fine granularity defect prediction.As stated models in effort-aware just-in-time (JIT)defect prediction by Kamei et al.[13,it allows developers to inspect an order of magnitude smaller number of SLOC (source lines of code) if they can predict defect-inducing changes well.However, little is currently known on their prediction effectiveness in to find latent defects.This could provide large savings in this context.We aim to investigate the predictive power effort over traditional coarser granularity defect predictions. of simple unsupervised models in effort-aware JIT defect In particular,JIT defect prediction can be performed at prediction,especially compared with the state-of-the-art su- check-in time [13.This allows developers to inspect the code pervised models in the recent literature.We first use the most changes for finding the latent defects when the change details commonly used change metrics to build simple unsupervised are still fresh in their minds.As a result,it is possible to models.Then,we compare these unsupervised models with find the latent defects faster.Furthermore,compared with the state-of-the-art supervised models under cross-validation, conventional non-effort-aware defect prediction,effort-aware time-wise-cross-validation,and across-project prediction set- JIT defect prediction takes into account the effort required tings to determine whether they are of practical value.The to inspect the modified code for a change [13].Consequently, experimental results,from open-source software systems, effort-aware JIT defect prediction would be more practical show that many simple unsupervised models perform better for practitioners,as it enables them to find more latent than the state-of-the-art supervised models in effort-aware defects per unit code inspection effort.Currently,there is JIT defect prediction. a significant strand of interest in developing effective effort- aware JIT defect prediction models [7.13]. CCS Concepts Kamei et al.[13]leveraged supervised method (i.e.the linear regression method)to build an effort-aware JIT de- .Software and its engineering-Risk management; fect prediction model.To the best of our knowledge,this Software development process management; is the first time to introduce effort-aware concept into JIT defect prediction.Their results showed that the proposed Keywords supervised model was effective in effort-aware performance evaluation compared with the random model.This work is Defect,prediction,changes,just-in-time,effort-aware significant,as it could help find more defect-inducing changes 1.INTRODUCTION per unit code inspection effort.In practice,however,it is often time-consuming and expensive to collect the defect data Recent years have seen an increasing interest in just-in-time (used as the dependent variable)to build supervised models. (JIT)defect prediction,as it enables developers to identify Furthermore,for many new projects,the defect data are defect-inducing changes at check-in time [7,13].A defect- unavailable,in which supervised models are not applicable. inducing change is a software change (i.e.a single or several Different from supervised models,unsupervised models do not need the defect data to build the defect prediction models. Corresponding author:zhouyuming@nju.edu.cn Therefore,for practitioners,it would be more desirable to apply unsupervised models if they can predict defects well. Permission to make digital or hard copies of all or part of this work for personal or According to recent studies [16,17,18,19,26,42],simple classroom use is granted without fee provided that copies are not made or distributed unsupervised models.such as the ManualUp model in which modules are prioritized in ascending order according to code omp must be honored.Abstracting with credit is permitted.To copy otherwise.or republish size,are effective in the context of effort-aware defect pre- to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.Request permissions from Permissions@acm.org. diction at coarser granularity.Up till now,however,little is known on the practical value of simple unsupervised models FSE'16,November 13-18,2016,Seattle,WA,USA ©2016ACM.978-1-4503 4 -6W16W11 15.00 in the context of effort-aware JIT defect prediction. http:/ldx.doi.org/10.1145/2950290.2950353 The main contributions of this paper are as follows: 157
Effort-Aware Just-in-Time Defect Prediction: Simple Unsupervised Models Could Be Better Than Supervised Models Yibiao Yang1 , Yuming Zhou1 ∗ , Jinping Liu1 , Yangyang Zhao1 , Hongmin Lu1 , Lei Xu1 , Baowen Xu1 , and Hareton Leung2 1 Department of Computer Science and Technology, Nanjing University, China 2 Department of Computing, Hong Kong Polytechnic University, Hong Kong, China ABSTRACT Unsupervised models do not require the defect data to build the prediction models and hence incur a low building cost and gain a wide application range. Consequently, it would be more desirable for practitioners to apply unsupervised models in effort-aware just-in-time (JIT) defect prediction if they can predict defect-inducing changes well. However, little is currently known on their prediction effectiveness in this context. We aim to investigate the predictive power of simple unsupervised models in effort-aware JIT defect prediction, especially compared with the state-of-the-art supervised models in the recent literature. We first use the most commonly used change metrics to build simple unsupervised models. Then, we compare these unsupervised models with the state-of-the-art supervised models under cross-validation, time-wise-cross-validation, and across-project prediction settings to determine whether they are of practical value. The experimental results, from open-source software systems, show that many simple unsupervised models perform better than the state-of-the-art supervised models in effort-aware JIT defect prediction. CCS Concepts •Software and its engineering → Risk management; Software development process management; Keywords Defect, prediction, changes, just-in-time, effort-aware 1. INTRODUCTION Recent years have seen an increasing interest in just-in-time (JIT) defect prediction, as it enables developers to identify defect-inducing changes at check-in time [7, 13]. A defectinducing change is a software change (i.e. a single or several ∗Corresponding author: zhouyuming@nju.edu.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. FSE’16, November 13-19, 2016, Seattle, WA, USA c 2016 ACM. ISBN 978-1-4503-4218-6/16/11. . . $15.00 DOI: http://dx.doi.org/10.1145/2950290.2950353 consecutive commits in a given period of time) that introduce one or several defects into the source code in a software system [37]. Compared with traditional defect prediction at module (e.g. package, file, or class) level, JIT defect prediction is a fine granularity defect prediction. As stated by Kamei et al. [13], it allows developers to inspect an order of magnitude smaller number of SLOC (source lines of code) to find latent defects. This could provide large savings in effort over traditional coarser granularity defect predictions. In particular, JIT defect prediction can be performed at check-in time [13]. This allows developers to inspect the code changes for finding the latent defects when the change details are still fresh in their minds. As a result, it is possible to find the latent defects faster. Furthermore, compared with conventional non-effort-aware defect prediction, effort-aware JIT defect prediction takes into account the effort required to inspect the modified code for a change [13]. Consequently, effort-aware JIT defect prediction would be more practical for practitioners, as it enables them to find more latent defects per unit code inspection effort. Currently, there is a significant strand of interest in developing effective effortaware JIT defect prediction models [7, 13]. Kamei et al. [13] leveraged supervised method (i.e. the linear regression method) to build an effort-aware JIT defect prediction model. To the best of our knowledge, this is the first time to introduce effort-aware concept into JIT defect prediction. Their results showed that the proposed supervised model was effective in effort-aware performance evaluation compared with the random model. This work is significant, as it could help find more defect-inducing changes per unit code inspection effort. In practice, however, it is often time-consuming and expensive to collect the defect data (used as the dependent variable) to build supervised models. Furthermore, for many new projects, the defect data are unavailable, in which supervised models are not applicable. Different from supervised models, unsupervised models do not need the defect data to build the defect prediction models. Therefore, for practitioners, it would be more desirable to apply unsupervised models if they can predict defects well. According to recent studies [16, 17, 18, 19, 26, 42], simple unsupervised models, such as the ManualUp model in which modules are prioritized in ascending order according to code size, are effective in the context of effort-aware defect prediction at coarser granularity. Up till now, however, little is known on the practical value of simple unsupervised models in the context of effort-aware JIT defect prediction. The main contributions of this paper are as follows: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. FSE’16, November 13–18, 2016, Seattle, WA, USA c 2016 ACM. 978-1-4503-4218-6/16/11...$15.00 http://dx.doi.org/10.1145/2950290.2950353 157
(1)We investigate the predictive effectiveness of unsuper- prediction models to predict defect-inducing changes.Their vised models in effort-aware JIT defect prediction,which is results showed that defect-inducing changes can be predicted an important topic in the area of defect prediction. at 60%recall and 61%precision on average. (2)We perform an in-depth evaluation on the simple unsu- pervised techniques (i.e.the unsupervised models in Section 2.2 Effort-aware Defect Prediction 3.2)under three prediction settings (i.e.cross-validation. Although the above-mentioned results were encouraging, time-wise-cross-validation,and across-project prediction). they did not take into account the effort required for quality (3)We compare simple unsupervised models with the state- assurance when applying the JIT defect prediction model- of-the-art supervised models in recent literature [8,13].The s in practice.Arisholm et al.[1]pointed out that,when experimental results show that many simple unsupervised locating defects,it was important to take into account the models perform significantly better than the state-of-the-art cost-effectiveness of using defect prediction models to focus supervised models on verification and validation activities.This viewpoint has The rest of this paper is organized as follows.Section 2 been recently taken by many module-level defect prediction introduces the background on defect prediction.Section 3 studies [12.25.34,33,32.39.42.Inspired by this viewpoint. describes the employed experimental methodology.Section 4 Kamei et al.applied effort-aware evaluation to JIT defect provides the experimental setup in our study,including the predictions 13.In their study,Kamei et al.used the total subject projects and the data sets.Section 5 reports in detail number of lines modified by a change as a measure of the the experimental results.Section 6 examines the threats effort required to inspect a change.In particular,they lever- to validity of our study.Section 7 concludes the paper and aged linear regression method to build the effort-aware JIT outlines directions for future work. defect prediction model(called the EALR model).In their model,the dependent variable was Y(x)/Effort(r),where 2.BACKGROUND Effort(x)was the effort required for inspecting the change In this section.we first introduce the background on just-in- x and Y(z)was 1 if x was defect inducing and 0 otherwise. time defect prediction and then effort-aware defect prediction. The results showed that,on average,the EALR model can Finally,we describe the existing work on the application of detect 35%of all defect-inducing changes,when using 20%of unsupervised models in traditional defect prediction the effort required to inspect all changes.As such,Kamei et al.believed that effort-aware JIT defect prediction was able 2.1 Just-in-time Defect Prediction to focus on the most risky changes and hence could reduce The origin of JIT (just-in-time)defect prediction can be the costs of developing high-quality software [13]. traced back to Mockus and Weiss [28,who used a number of change metrics to predict the probability of changes to be 2.3 Unsupervised Models in Traditional De- defect-inducing changes.For practitioners,JIT defect pre- fect Prediction diction is of more practical value compared with traditional In the last decades,supervised models have been the dom- defect predictions at module (e.g.file or class)level.As inant defect prediction paradigm in traditional defect predic- stated by Kamei et al.13,the reason is not only because tion at module (e.g.file or class)level 1,11,13,15,25,26, the predictions can considerably narrow down the code to be 28.33.32.42.43.44.In order to build a supervised model inspected for finding the latent defects,but also because the we need to collect the defect data such as the number of predictions can be made at check-in time when the change defects or the labeled information (buggy or not-buggy)for details are still fresh in the minds of the developers.In each module.For practitioners,it may be expensive to apply particular,in traditional defect predictions,after a module such supervised models in practice.The reason for this is is predicted as defect-prone,it may be difficult to find the that it is generally time-consuming and costly to collect the specific developer,who is most familiar with the code,to defect data.Furthermore,supervised defect prediction mod- inspect the module to find the latent defects.However,in els cannot be built if the defect data are unavailable.This is JIT defect predictions,it is easy to find such a developer to especially true when a new type of projects is developed or inspect the predicted defect-prone change,as each change is when historical defect data have not been collected associated with a particular developer. Compared with supervised models,unsupervised models In the last decade,Mockus and Weisss study led to an do not need the defect data.Due to this advantage,recent increasing interest in JIT defect predictions.Sliwerski et years have seen an increasing effort devoted to apply unsuper- al.[37]studied defect-inducing changes in two open-source vised modeling techniques to build defect prediction models software systems and found that the changes committed on [3,26,41,42.In practice,however,it is more important Friday had a higher probability to be defect-inducing changes to know the effort-aware prediction performance of unsuper- Eyolfson et al.[6]studied the influence of committed time vised defect prediction models.To this end,many researchers and developer experience on the existence of defects in a investigate whether unsupervised models are still effective software change.Their results showed that changes com- when taking into account the effort to inspect the modules mitted between midnight and 4AM were more likely to be that are predicted as "defect-prone".Koru et al.16,17,18, defect-inducing than the changes committed between 7AM 19 suggested that smaller modules should be inspected first. and noon.Yin et al.40 studied the bug-fixing changes as more defects would be detected per unit code inspection in several open-source software systems.They found that effort.The reason was that the relationship between module around 14.2%to 24.8%bug-fixing changes for post-release size and the number of defects was found not linear but log- bugs were defect-inducing and the concurrency defects were arithmic 16,18,indicating that defect-proneness increased the most difficult to be fixed.Kim et al.15]used numer- at a slower rate as module size increased.Menzies et al.[26] ous features extracted from various sources such as change used the ManualUp model to name the "smaller modules metadata,source code,and change log messages to build inspected first"strategy by Koru et al.In their experiments, 158
(1) We investigate the predictive effectiveness of unsupervised models in effort-aware JIT defect prediction, which is an important topic in the area of defect prediction. (2) We perform an in-depth evaluation on the simple unsupervised techniques (i.e. the unsupervised models in Section 3.2) under three prediction settings (i.e. cross-validation, time-wise-cross-validation, and across-project prediction). (3) We compare simple unsupervised models with the stateof-the-art supervised models in recent literature [8, 13]. The experimental results show that many simple unsupervised models perform significantly better than the state-of-the-art supervised models. The rest of this paper is organized as follows. Section 2 introduces the background on defect prediction. Section 3 describes the employed experimental methodology. Section 4 provides the experimental setup in our study, including the subject projects and the data sets. Section 5 reports in detail the experimental results. Section 6 examines the threats to validity of our study. Section 7 concludes the paper and outlines directions for future work. 2. BACKGROUND In this section, we first introduce the background on just-intime defect prediction and then effort-aware defect prediction. Finally, we describe the existing work on the application of unsupervised models in traditional defect prediction. 2.1 Just-in-time Defect Prediction The origin of JIT (just-in-time) defect prediction can be traced back to Mockus and Weiss [28], who used a number of change metrics to predict the probability of changes to be defect-inducing changes. For practitioners, JIT defect prediction is of more practical value compared with traditional defect predictions at module (e.g. file or class) level. As stated by Kamei et al. [13], the reason is not only because the predictions can considerably narrow down the code to be inspected for finding the latent defects, but also because the predictions can be made at check-in time when the change details are still fresh in the minds of the developers. In particular, in traditional defect predictions, after a module is predicted as defect-prone, it may be difficult to find the specific developer, who is most familiar with the code, to inspect the module to find the latent defects. However, in JIT defect predictions, it is easy to find such a developer to inspect the predicted defect-prone change, as each change is associated with a particular developer. In the last decade, Mockus and Weis´ss study led to an increasing interest in JIT defect predictions. Sliwerski et ´ al. [37] studied defect-inducing changes in two open-source software systems and found that the changes committed on Friday had a higher probability to be defect-inducing changes. Eyolfson et al. [6] studied the influence of committed time and developer experience on the existence of defects in a software change. Their results showed that changes committed between midnight and 4AM were more likely to be defect-inducing than the changes committed between 7AM and noon. Yin et al. [40] studied the bug-fixing changes in several open-source software systems. They found that around 14.2% to 24.8% bug-fixing changes for post-release bugs were defect-inducing and the concurrency defects were the most difficult to be fixed. Kim et al. [15] used numerous features extracted from various sources such as change metadata, source code, and change log messages to build prediction models to predict defect-inducing changes. Their results showed that defect-inducing changes can be predicted at 60% recall and 61% precision on average. 2.2 Effort-aware Defect Prediction Although the above-mentioned results were encouraging, they did not take into account the effort required for quality assurance when applying the JIT defect prediction models in practice. Arisholm et al. [1] pointed out that, when locating defects, it was important to take into account the cost-effectiveness of using defect prediction models to focus on verification and validation activities. This viewpoint has been recently taken by many module-level defect prediction studies [12, 25, 34, 33, 32, 39, 42]. Inspired by this viewpoint, Kamei et al. applied effort-aware evaluation to JIT defect predictions [13]. In their study, Kamei et al. used the total number of lines modified by a change as a measure of the effort required to inspect a change. In particular, they leveraged linear regression method to build the effort-aware JIT defect prediction model (called the EALR model). In their model, the dependent variable was Y (x)/Effort(x), where Effort(x) was the effort required for inspecting the change x and Y (x) was 1 if x was defect inducing and 0 otherwise. The results showed that, on average, the EALR model can detect 35% of all defect-inducing changes, when using 20% of the effort required to inspect all changes. As such, Kamei et al. believed that effort-aware JIT defect prediction was able to focus on the most risky changes and hence could reduce the costs of developing high-quality software [13]. 2.3 Unsupervised Models in Traditional Defect Prediction In the last decades, supervised models have been the dominant defect prediction paradigm in traditional defect prediction at module (e.g. file or class) level [1, 11, 13, 15, 25, 26, 28, 33, 32, 42, 43, 44]. In order to build a supervised model, we need to collect the defect data such as the number of defects or the labeled information (buggy or not-buggy) for each module. For practitioners, it may be expensive to apply such supervised models in practice. The reason for this is that it is generally time-consuming and costly to collect the defect data. Furthermore, supervised defect prediction models cannot be built if the defect data are unavailable. This is especially true when a new type of projects is developed or when historical defect data have not been collected. Compared with supervised models, unsupervised models do not need the defect data. Due to this advantage, recent years have seen an increasing effort devoted to apply unsupervised modeling techniques to build defect prediction models [3, 26, 41, 42]. In practice, however, it is more important to know the effort-aware prediction performance of unsupervised defect prediction models. To this end, many researchers investigate whether unsupervised models are still effective when taking into account the effort to inspect the modules that are predicted as “defect-prone”. Koru et al. [16, 17, 18, 19] suggested that smaller modules should be inspected first, as more defects would be detected per unit code inspection effort. The reason was that the relationship between module size and the number of defects was found not linear but logarithmic [16, 18], indicating that defect-proneness increased at a slower rate as module size increased. Menzies et al. [26] used the ManualUp model to name the “smaller modules inspected first” strategy by Koru et al. In their experiments, 158
Menzies et al.did find that the ManualUp model had a of the developer of a change is expected to have a negative good effort-aware prediction performance.Their results were correlation to the likelihood of introducing a defect into the further confirmed by Zhou et al.'s study [42],in which the code by the change.In other words,if the current change is ManualUp model was found even competitive to the regular made by a more experienced developer,it is less likely that a supervised logistic regression model.All these studies show defect will be introduced.Note that,all these change metrics that,in traditional defect prediction.unsupervised models are the same as those used in Kamei et al.'s study. perform well under effort-aware evaluation. 3.2 Simple Unsupervised Models 3.RRESEARCH METHODOLODY In this study,we leverage change metrics to build simple In this section,we first introduce the investigated indepen- unsupervised models.As stated by Monden et al.[29,to dent and dependent variables.Then,we describe the simple adopt defect prediction models,one needs to consider not unsupervised models under study and present the supervised only their prediction effectiveness but also the significant models which will be used as the baseline models against cost required for metrics collection and modeling themselves. Next,we give the research questions.After that,we provide A recent investigation from Google developers further shows the performance indicators for evaluating the effectiveness of that a prerequisite for deploying a defect prediction model in defect prediction models in effort-aware JIT defect prediction. a large company such as Google is that it must be able to scale Finally,we give the data analysis method used in this study to large source repositories 21.Therefore,we only take into account those unsupervised defect prediction models that 3.1 Dependent and Independent Variables have a low application cost (including metrics collection cost The dependent variable in this study is a binary variable and modeling cost)and a good scalability.More specifically If a code change is a defect-inducing change,the dependent our study will investigate the following unsupervised defect variable is set to 1 and 0 otherwise. prediction models.For each of the change metrics (except LA.and LD),we build an unsupervised model that ranks Table 1:Summarization of change metrics changes in descendant order according to the reciprocal of Metric Description their corresponding raw metric values.This idea is inspired NS Number of subsystems touched by the current change ND by Koru and Menzies et al.'s finding that smaller modules Number of directories touched by the current change NF Number of files touched by the current change are proportionally more defect-prone and hence should be Entropy Distribution across the touched files,i.e.-Pklog2pk, inspected first 19,25.In our study,we expect that "smaller' where n is the number of files touched by the change and changes tend to be more proportionally defect-prone.More pk is the ratio of the touched code in the k-th file to the formally,for each change metric M,the corresponding model total touched code is R(c)=1/M(c).Here,c represents a change and R is the LA Lines of code added by the current change predicted risk value.For a given system,the changes will be LD Lines of code deleted by the current change LT Lines of code in a file before the current change ranked in descendant order according to the predicted risk FIX Whether or not the current change is a defect fix value R.In this context,changes with smaller change metric NDEV The number of developers that changed the files values will be ranked higher. AGE The average time interval (in days)between the last and Note that,under each of the above-mentioned simple un- the change over the files that are touched supervised models,it is possible that two changes have the NUC The number of unique last changes to the files same predicted risk values,i.e.they have a tied rank.In our EXP Developers experience,i.e.the number of changes REXP Recent developer experience,i.e.the total experience of study,if there is a tied rank according to the predicted risk the developer in terms of changes,weighted by their age values,the change with a lower defect density will be ranked SEXP Developer experience on a subsystem,i.e.the number of higher.Furthermore,if there is still a tied rank according changes the developer made in the past to the subsystems to the defect densities,the change with a larger change size The independent variables used in this study consist of will be ranked higher.In this way,we will obtain simple fourteen change metrics.Table 1 summarizes these change unsupervised models that have the "worst"predictive per- metrics,including the metric name,the description,and the formance (theoretically)in effort-aware just-in-time defect source.These fourteen metrics can be classified into the prediction 14.In our study,we investigate the predictive following five dimensions:diffusion,size,purpose,history. power of those "worst"simple unsupervised models.If our and experience.The diffusion dimension consists of NS.ND. experimental results show that those "worst"simple unsuper- NF,and Entropy,which characterize the distribution of a vised models are competitive to the supervised models,we change.As stated by Kamei et al.13,it is believed that will have confidence that simple unsupervised models are of a highly distributed change is more likely to be a defect- practical value for practitioners in effort-aware just-in-time inducing change.The size dimension leverages LA,LD,and defect prediction. LT to characterize the size of a change,in which a larger As can be seen,there are 12 simple unsupervised models change is expected to have a higher likelihood of being a which involve a low application cost and can be efficiently defect-inducing change 30,36.The purpose dimension applied to large source repositories. consists of only FIX.In the literature,there is a belief that a defect-fixing change is more likely to introduce a new 3.3 The Supervised Models defect [40].The history dimension consists of NDEV,AGE, The supervised models are summarized in Table 2.These and NUC.It is believed that a defect is more likely to be supervised models are categorized into six groups:"Func- introduced by a change if the touched files have been modified tion”,“Lazy”,“Rule”,“Bayes'”,Tree",and "Ensemble”.The by more developers,by more recent changes,or by more supervised models in the "Function"group are the regression unique last changes [4,9,11,22].The experience dimension models and the neural networks."Lazy"are the supervised consists of EXP,REXP,and SEXP,in which the experience models based on lazy learning."Rule"and "Tree"respec- 159
Menzies et al. did find that the ManualUp model had a good effort-aware prediction performance. Their results were further confirmed by Zhou et al.’s study [42], in which the ManualUp model was found even competitive to the regular supervised logistic regression model. All these studies show that, in traditional defect prediction, unsupervised models perform well under effort-aware evaluation. 3. RRESEARCH METHODOLODY In this section, we first introduce the investigated independent and dependent variables. Then, we describe the simple unsupervised models under study and present the supervised models which will be used as the baseline models against. Next, we give the research questions. After that, we provide the performance indicators for evaluating the effectiveness of defect prediction models in effort-aware JIT defect prediction. Finally, we give the data analysis method used in this study. 3.1 Dependent and Independent Variables The dependent variable in this study is a binary variable. If a code change is a defect-inducing change, the dependent variable is set to 1 and 0 otherwise. Table 1: Summarization of change metrics Metric Description NS Number of subsystems touched by the current change ND Number of directories touched by the current change NF Number of files touched by the current change EntropyDistribution across the touched files, i.e. −Σn k−1 pklog2pk, where n is the number of files touched by the change and pk is the ratio of the touched code in the k-th file to the total touched code LA Lines of code added by the current change LD Lines of code deleted by the current change LT Lines of code in a file before the current change FIX Whether or not the current change is a defect fix NDEV The number of developers that changed the files AGE The average time interval (in days) between the last and the change over the files that are touched NUC The number of unique last changes to the files EXP Developers experience, i.e. the number of changes REXP Recent developer experience, i.e. the total experience of the developer in terms of changes, weighted by their age SEXP Developer experience on a subsystem, i.e. the number of changes the developer made in the past to the subsystems The independent variables used in this study consist of fourteen change metrics. Table 1 summarizes these change metrics, including the metric name, the description, and the source. These fourteen metrics can be classified into the following five dimensions: diffusion, size, purpose, history, and experience. The diffusion dimension consists of NS, ND, NF, and Entropy, which characterize the distribution of a change. As stated by Kamei et al. [13], it is believed that a highly distributed change is more likely to be a defectinducing change. The size dimension leverages LA, LD, and LT to characterize the size of a change, in which a larger change is expected to have a higher likelihood of being a defect-inducing change [30, 36]. The purpose dimension consists of only FIX. In the literature, there is a belief that a defect-fixing change is more likely to introduce a new defect [40]. The history dimension consists of NDEV, AGE, and NUC. It is believed that a defect is more likely to be introduced by a change if the touched files have been modified by more developers, by more recent changes, or by more unique last changes [4, 9, 11, 22]. The experience dimension consists of EXP, REXP, and SEXP, in which the experience of the developer of a change is expected to have a negative correlation to the likelihood of introducing a defect into the code by the change. In other words, if the current change is made by a more experienced developer, it is less likely that a defect will be introduced. Note that, all these change metrics are the same as those used in Kamei et al.’s study. 3.2 Simple Unsupervised Models In this study, we leverage change metrics to build simple unsupervised models. As stated by Monden et al. [29], to adopt defect prediction models, one needs to consider not only their prediction effectiveness but also the significant cost required for metrics collection and modeling themselves. A recent investigation from Google developers further shows that a prerequisite for deploying a defect prediction model in a large company such as Google is that it must be able to scale to large source repositories [21]. Therefore, we only take into account those unsupervised defect prediction models that have a low application cost (including metrics collection cost and modeling cost) and a good scalability. More specifically, our study will investigate the following unsupervised defect prediction models. For each of the change metrics (except LA, and LD), we build an unsupervised model that ranks changes in descendant order according to the reciprocal of their corresponding raw metric values. This idea is inspired by Koru and Menzies et al.’s finding that smaller modules are proportionally more defect-prone and hence should be inspected first [19, 25]. In our study, we expect that “smaller” changes tend to be more proportionally defect-prone. More formally, for each change metric M, the corresponding model is R(c) = 1/M(c). Here, c represents a change and R is the predicted risk value. For a given system, the changes will be ranked in descendant order according to the predicted risk value R. In this context, changes with smaller change metric values will be ranked higher. Note that, under each of the above-mentioned simple unsupervised models, it is possible that two changes have the same predicted risk values, i.e. they have a tied rank. In our study, if there is a tied rank according to the predicted risk values, the change with a lower defect density will be ranked higher. Furthermore, if there is still a tied rank according to the defect densities, the change with a larger change size will be ranked higher. In this way, we will obtain simple unsupervised models that have the “worst” predictive performance (theoretically) in effort-aware just-in-time defect prediction [14]. In our study, we investigate the predictive power of those “worst” simple unsupervised models. If our experimental results show that those “worst” simple unsupervised models are competitive to the supervised models, we will have confidence that simple unsupervised models are of practical value for practitioners in effort-aware just-in-time defect prediction. As can be seen, there are 12 simple unsupervised models, which involve a low application cost and can be efficiently applied to large source repositories. 3.3 The Supervised Models The supervised models are summarized in Table 2. These supervised models are categorized into six groups: “Function”, “Lazy”, “Rule”, “Bayes”, “Tree”, and “Ensemble”. The supervised models in the “Function” group are the regression models and the neural networks. “Lazy” are the supervised models based on lazy learning. “Rule” and “Tree” respec- 159
tively represent the rule based and the decision tree based supervised models."Ensemble"are those supervised ensem- ble models which are built with multiple base leaners.The Naive Bayes is a probability-based technique.In [13.Kamei et al.using the linear regression model to build the effort- prediction model aware JIT defect prediction model (i.e.the EALR model) Random model The EALR model is the state-of-the-art supervised model Optimal model Worst model in effort-aware JIT defect prediction.Besides,we also in- clude other supervised techniques (i.e.the models in Table 40 60 80 100 2 except the EALR model)as the baseline models.The %code chum reasons are two-folds.First,they are the most commonly Figure 1:Code-churn-based Alberg diagram used supervised techniques in defect prediction studies [8 art supervised models in time-wise-cross-validation? 10,20,23,25.Second,a recent literature 8 using most of RQ3:How well do simple unsupervised models predict them (except for the Random Forest)to revisit their impact defect-inducing changes when compared with the state-of-the- on the performance of defect prediction. art supervised models in across-project prediction? The purposes of RQ1,RQ2,and RQ3 are to compare sim- Table 2:Overview of the supervised models Family Model Abbreviation ple unsupervised models with the state-of-the-art supervised Function Linear Regression EALR models with respect to three different prediction settings Simple Logistic SL (i.e.the cross-validation,time-wise-cross-validation,and Radial basis functions RBFNet across-project prediction)to determine how well they predict network defect-inducing changes.Since unsupervised models do not Sequential Minimal SMO leverage the buggy or not-buggy label information to build Optimization the prediction models,they are not expected to perform Lazy K-Nearest Neighbour IBk better than the supervised models.However,if unsupervised Rule Propositional rule JRip Ripple down rules Ridor models are not much worse than the supervised models,it is Bayes Naive Bayes NB still a good choice for practitioners to apply them because Tree J48 J48 they have a lower building cost,a wider application range, Logistic Model Tree LMT and a higher efficiency.To the best of our knowledge,little is Random Forest RF currently known on these research questions from the view- Ensemble Bagging BG+LMT,BG+NB.BG+SL. point of unsupervised models in the literature.Our study BG+SMO,and BG+J48 Adaboost AB+LMT.AB+NB.AB+SL attempts to fill this gap by an in-depth investigation into AB+SMO.and AB+J48 simple unsupervised models in the context of effort-aware Rotation Forest RF+LMT.RF+NB.RF+SL JIT defect prediction. RF+SMO.and RF+J48 Random Subspace RS+LMT.RS+NB,RS+SL, 3.5 Performance Indicators RS+SMO,and RS+J48 When evaluating the predictive effectiveness of a JIT defect In this study,we use the same method as Kamei et al prediction model,we take into account the effort required 13 to build the EALR model.As stated in Section 2.2 to inspect those changes predicted as defect-prone to find Y(x)/Effort(x)was used as the dependent variable in the whether they are defect-inducing changes.Consistent with EALR model.For the other supervised models,we use the Kamei et al.13,we use the code churn (i.e.the total same method as Ghotra et al.[8.More specifically,Y(x number of lines added and deleted)by a change as a proxy of was used as the dependent variable for these supervised the effort required to inspect the change.In [13],Kamei et al. models.Consistent with Ghotra et al.[8],we use the same used ACC and Popt to evaluate the effort-aware performance parameters to build these supervised models.For example for the EALR model.ACC denotes the recall of defect- the K-Nearest Neighbor requires K most similar training inducing changes when using 20%of the entire effort required example for classifying an instance.In 8,Ghotra et al. to inspect all changes to inspect the top ranked changes.Popt found that K =8 performed best than other options (i.e is the normalized version of the effort-aware performance 2.4.6,and 16).As such.we also use K =8 to build the indicator originally introduced by Mende and Koschke [24 K-Nearest Neighbor.In EALR model,Kamei et al.used the The Popt is based on the concept of the "code-churn-based" under sampling method to deal with the imbalanced data Alberg diagram. set and then removed the most highly correlated factors to Figure 1 is an example "code-churn-based"Alberg diagram deal with collinearity.In consistence with Kamei et al.'s showing the performances of a prediction model m.In this study [13],we use exactly the same method to deal with diagram,the x-axis and y-axis are respectively the cumulative imbalanced data set and collinearity. percentage of code churn of the changes (i.e.the percentage of effort)and the cumulative percentage of defect-inducing 3.4 Research Questions changes found in selected changes.To compute Popt,two We investigate the following three research questions to additional curves are included:the "optimal"model and the determine the practical value of simple unsupervised models: “worst'”model.In the“optimal'”model and the“worst"model, RQ1:How well do simple unsupervised models predict changes are respectively sorted in decreasing and ascending defect-inducing changes when compared with the state-of-the- order according to their actual defect densities.According art supervised models in cross-validation? to [13],Popt can be formally defined as: RQ2:How well do simple unsupervised models predict Area(optimal)-Area(m) defect-inducing changes when compared with the state-of-the- Popt(m)=1- Area(optimal)-Area(worst) 160
tively represent the rule based and the decision tree based supervised models. “Ensemble” are those supervised ensemble models which are built with multiple base leaners. The Naive Bayes is a probability-based technique. In [13], Kamei et al. using the linear regression model to build the effortaware JIT defect prediction model (i.e. the EALR model). The EALR model is the state-of-the-art supervised model in effort-aware JIT defect prediction. Besides, we also include other supervised techniques (i.e. the models in Table 2 except the EALR model) as the baseline models. The reasons are two-folds. First, they are the most commonly used supervised techniques in defect prediction studies [8, 10, 20, 23, 25]. Second, a recent literature [8] using most of them (except for the Random Forest) to revisit their impact on the performance of defect prediction. Table 2: Overview of the supervised models Family Model Abbreviation Function Linear Regression EALR Simple Logistic SL Radial basis functions RBFNet network Sequential Minimal SMO Optimization Lazy K-Nearest Neighbour IBk Rule Propositional rule JRip Ripple down rules Ridor Bayes Na¨ıve Bayes NB Tree J48 J48 Logistic Model Tree LMT Random Forest RF Ensemble Bagging BG+LMT, BG+NB, BG+SL, BG+SMO, and BG+J48 Adaboost AB+LMT, AB+NB, AB+SL, AB+SMO, and AB+J48 Rotation Forest RF+LMT, RF+NB, RF+SL, RF+SMO, and RF+J48 Random Subspace RS+LMT, RS+NB, RS+SL, RS+SMO, and RS+J48 In this study, we use the same method as Kamei et al. [13] to build the EALR model. As stated in Section 2.2, Y(x)/Effort(x) was used as the dependent variable in the EALR model. For the other supervised models, we use the same method as Ghotra et al. [8]. More specifically, Y(x) was used as the dependent variable for these supervised models. Consistent with Ghotra et al. [8], we use the same parameters to build these supervised models. For example, the K-Nearest Neighbor requires K most similar training example for classifying an instance. In [8], Ghotra et al. found that K = 8 performed best than other options (i.e. 2, 4, 6, and 16). As such, we also use K = 8 to build the K-Nearest Neighbor. In EALR model, Kamei et al. used the under sampling method to deal with the imbalanced data set and then removed the most highly correlated factors to deal with collinearity. In consistence with Kamei et al.’s study [13], we use exactly the same method to deal with imbalanced data set and collinearity. 3.4 Research Questions We investigate the following three research questions to determine the practical value of simple unsupervised models: RQ1: How well do simple unsupervised models predict defect-inducing changes when compared with the state-of-theart supervised models in cross-validation? RQ2: How well do simple unsupervised models predict defect-inducing changes when compared with the state-of-theFigure 1: Code-churn-based Alberg diagram art supervised models in time-wise-cross-validation? RQ3: How well do simple unsupervised models predict defect-inducing changes when compared with the state-of-theart supervised models in across-project prediction? The purposes of RQ1, RQ2, and RQ3 are to compare simple unsupervised models with the state-of-the-art supervised models with respect to three different prediction settings (i.e. the cross-validation, time-wise-cross-validation, and across-project prediction) to determine how well they predict defect-inducing changes. Since unsupervised models do not leverage the buggy or not-buggy label information to build the prediction models, they are not expected to perform better than the supervised models. However, if unsupervised models are not much worse than the supervised models, it is still a good choice for practitioners to apply them because they have a lower building cost, a wider application range, and a higher efficiency. To the best of our knowledge, little is currently known on these research questions from the viewpoint of unsupervised models in the literature. Our study attempts to fill this gap by an in-depth investigation into simple unsupervised models in the context of effort-aware JIT defect prediction. 3.5 Performance Indicators When evaluating the predictive effectiveness of a JIT defect prediction model, we take into account the effort required to inspect those changes predicted as defect-prone to find whether they are defect-inducing changes. Consistent with Kamei et al. [13], we use the code churn (i.e. the total number of lines added and deleted) by a change as a proxy of the effort required to inspect the change. In [13], Kamei et al. used ACC and Popt to evaluate the effort-aware performance for the EALR model. ACC denotes the recall of defectinducing changes when using 20% of the entire effort required to inspect all changes to inspect the top ranked changes. Popt is the normalized version of the effort-aware performance indicator originally introduced by Mende and Koschke [24]. The Popt is based on the concept of the “code-churn-based” Alberg diagram. Figure 1 is an example “code-churn-based” Alberg diagram showing the performances of a prediction model m. In this diagram, the x-axis and y-axis are respectively the cumulative percentage of code churn of the changes (i.e. the percentage of effort) and the cumulative percentage of defect-inducing changes found in selected changes. To compute Popt, two additional curves are included: the “optimal” model and the “worst” model. In the “optimal” model and the “worst” model, changes are respectively sorted in decreasing and ascending order according to their actual defect densities. According to [13], Popt can be formally defined as: Popt(m) = 1 − Area(optimal) − Area(m) Area(optimal) − Area(worst) 160
Here,Area(optimal)and Area(worst)is the area under the tically significant at the significance level of 0.05 2.If the curve corresponding to the best and the worst model,re- statistical test shows a significant difference.we then use the spectively.Note that both ACC and Popt are applicable to Cliff's 6 to examine whether the magnitude of the difference supervised models as well as unsupervised models. is practically important from the viewpoint of practical ap- plication [1].By convention,the magnitude of the difference 3.6 Data Analysis Method is considered trivial (1 <0.147),small (0.147 <6 <0.33), Figure 2 provides an overview of our data analysis method. moderate (0.33<<0.474),or large (>0.474)[35] As can be seen,in order to obtain an adequate and realistic Furthermore,similar to Ghotra et al.[8],we use Scott- assessment,we examine the three RQs under the following Knott test [14,27]to group the supervised and unsupervised three prediction settings:10 times 10-fold cross-validation, prediction models to examine whether there exist some mod- time-wise-cross-validation,and across-project prediction. els outperform others.The Scott-Knott test uses hierarchical 10 times 10-fold cross-validation is performed within the cluster analysis method to divide the prediction models into same project.At each 10-fold cross-validation.we first ran- two groups according to the mean performance (i.e.the Popt domize the data set.Then,we divide the data set into 10 and the ACC with respect to different runs for each model). parts of approximately equal size.After that,each part is If the difference between the divided groups is statistically used as a testing data set to evaluate the effectiveness of the significant,Scott-Knott will recursively divide each group in- prediction model built on the remainder of the data set (i.e. to two different groups.The test will terminate when groups the training data set).The entire process is then repeated can no longer divided into statistically distinct groups. 10 times to alleviate possible sampling bias in random split- s.Consequently,each model has 10 x 10=100 prediction 4. EXPERIMENTAL SETUP effectiveness values. In this section,we first introduce the subject projects and Time-urise-cross-validation is also performed within the then describe the data sets collected from these projects. same project,in which the chronological order of changes is considered.This is the method followed in [37].For each 4.1 Subject Projects project,we first rank the changes in chronological order In this study,we use the same open-source subject projects according to the commit date.Then,all changes committed as used in Kamei et al.'s study [13.More specifically,we within the same month are grouped into the same part use the following six projects to investigate the predictive Assume the changes in a project are grouped into n parts. power of simple unsupervised models in effort-aware JIT we use the following approach for the time-wise prediction. defect prediction:Bugzilla (BUG),Columba (COL),Eclipse We build a prediction model m on the combination of part i JDT (JDT),Eclipse Platform (PLA),Mozilla (MOZ),and and part i+1 and then apply m to predict changes in part PostgreSQL (POS).Bugzilla is a well-known web-based bug i+4 and i+5(1<i<n-5).As such,each training set and tracking system.Columba is a powerful mail management test set will have changes committed with two consecutive tool.Eclipse JDT is the Eclipse Java Development Tool- months.The reasons for this setting are four-fold.First, s,which is a set of plug-ins that add the capabilities of a the release cycle of most projects is typically 6~8 weeks full-featured Java IDE to the Eclipse platform.Mozilla is a 5].Second,it can make sure that each training set and test well-known and widely used open-source web browser.Post- set will have a gap of two months.Third,two consecutive greSQL is a powerful,open-source object-relational database months can make sure that each training set will have enough system.As stated by Kamei et al.13,these six projects instances,which is important for supervised models.Forth are large,well-known,and long-lived projects,which cover a it allows us to have enough runs for each project.If a project wide range of domains and sizes.In this sense,it is appropri- has changes of n months,this method will produce n-5 ate to use these projects to investigate simple unsupervised prediction effectiveness values for each model. models in JIT defect prediction. Across-project prediction is performed across different projects. We use a model trained on one project (i.e.the training 4.2 Data Sets data set)to predict defect-proneness in another project(i.e The data sets from these six projects used in this study the testing data set)[33,43].Given n projects,this method are shared by Kamei et al.and are available online.As will produce n x (n-1)prediction effectiveness values for mentioned by Kamei et al.[13],these data are gathered each model.In this study,we use six projects as the sub- by combining the change information mined from the CVS ject projects.Therefore,each prediction model will produce repositories of these projects with the corresponding bug 6 x (6-1)=30 prediction effectiveness values reports.More specifically,the data for Bugzilla and Mozilla Note that,the unsupervised models only use the change were gathered from the data provided by MSR 2007 Mining metrics in testing data to build the prediction models.In Challenge.The data for the Eclipse JDT and Platform were this study,we also apply cross-validation,time-wise-cross- gathered from the data provided by the MSR 2008 Mining validation,and across-project prediction settings to the un- Challenge.The data for Columba and PostgreSQL were supervised models.This allows the unsupervised models to gathered from the official CVS repository. use the same testing data as those supervised models,thus Table 3 summarizes the six data sets used in this study making a fair comparison on their prediction performance. The first column and the second column are respectively the When investigating RQ1,RQ2,and RQ3,we use the B- subject data set name and the period of time for collecting H corrected p-values from the Wilcoxon signed-rank test to the changes.The third to the sixth columns respectively examine whether there is a significant difference in the predic- report the total number of changes,the percentage of defect- tion effectiveness between the unsupervised and supervised inducing changes,the average LOC per change,and the models.In particular,we use the Benjamini-Hochberg(BH) number of files modified in a code change.As can be seen, corrected p-values to examine whether a difference is statis for each data set,defects concentrated in a small percentage 161
Here, Area(optimal) and Area(worst) is the area under the curve corresponding to the best and the worst model, respectively. Note that both ACC and Popt are applicable to supervised models as well as unsupervised models. 3.6 Data Analysis Method Figure 2 provides an overview of our data analysis method. As can be seen, in order to obtain an adequate and realistic assessment, we examine the three RQs under the following three prediction settings: 10 times 10-fold cross-validation, time-wise-cross-validation, and across-project prediction. 10 times 10-fold cross-validation is performed within the same project. At each 10-fold cross-validation, we first randomize the data set. Then, we divide the data set into 10 parts of approximately equal size. After that, each part is used as a testing data set to evaluate the effectiveness of the prediction model built on the remainder of the data set (i.e. the training data set). The entire process is then repeated 10 times to alleviate possible sampling bias in random splits. Consequently, each model has 10 × 10 = 100 prediction effectiveness values. Time-wise-cross-validation is also performed within the same project, in which the chronological order of changes is considered. This is the method followed in [37]. For each project, we first rank the changes in chronological order according to the commit date. Then, all changes committed within the same month are grouped into the same part. Assume the changes in a project are grouped into n parts, we use the following approach for the time-wise prediction. We build a prediction model m on the combination of part i and part i+1 and then apply m to predict changes in part i+4 and i+5 (1 ≤ i ≤ n − 5). As such, each training set and test set will have changes committed with two consecutive months. The reasons for this setting are four-fold. First, the release cycle of most projects is typically 6 ∼ 8 weeks [5]. Second, it can make sure that each training set and test set will have a gap of two months. Third, two consecutive months can make sure that each training set will have enough instances, which is important for supervised models. Forth, it allows us to have enough runs for each project. If a project has changes of n months, this method will produce n − 5 prediction effectiveness values for each model. Across-project prediction is performed across different projects. We use a model trained on one project (i.e. the training data set) to predict defect-proneness in another project (i.e. the testing data set) [33, 43]. Given n projects, this method will produce n × (n − 1) prediction effectiveness values for each model. In this study, we use six projects as the subject projects. Therefore, each prediction model will produce 6 × (6 − 1) = 30 prediction effectiveness values. Note that, the unsupervised models only use the change metrics in testing data to build the prediction models. In this study, we also apply cross-validation, time-wise-crossvalidation, and across-project prediction settings to the unsupervised models. This allows the unsupervised models to use the same testing data as those supervised models, thus making a fair comparison on their prediction performance. When investigating RQ1, RQ2, and RQ3, we use the BH corrected p-values from the Wilcoxon signed-rank test to examine whether there is a significant difference in the prediction effectiveness between the unsupervised and supervised models. In particular, we use the Benjamini-Hochberg (BH) corrected p-values to examine whether a difference is statistically significant at the significance level of 0.05 [2]. If the statistical test shows a significant difference, we then use the Cliff’s δ to examine whether the magnitude of the difference is practically important from the viewpoint of practical application [1]. By convention, the magnitude of the difference is considered trivial (|δ| < 0.147), small (0.147 ≤ |δ| < 0.33), moderate (0.33 ≤ |δ| < 0.474), or large (≥ 0.474) [35]. Furthermore, similar to Ghotra et al. [8], we use ScottKnott test [14, 27] to group the supervised and unsupervised prediction models to examine whether there exist some models outperform others. The Scott-Knott test uses hierarchical cluster analysis method to divide the prediction models into two groups according to the mean performance (i.e. the Popt and the ACC with respect to different runs for each model). If the difference between the divided groups is statistically significant, Scott-Knott will recursively divide each group into two different groups. The test will terminate when groups can no longer divided into statistically distinct groups. 4. EXPERIMENTAL SETUP In this section, we first introduce the subject projects and then describe the data sets collected from these projects. 4.1 Subject Projects In this study, we use the same open-source subject projects as used in Kamei et al.’s study [13]. More specifically, we use the following six projects to investigate the predictive power of simple unsupervised models in effort-aware JIT defect prediction: Bugzilla (BUG), Columba (COL), Eclipse JDT (JDT), Eclipse Platform (PLA), Mozilla (MOZ), and PostgreSQL (POS). Bugzilla is a well-known web-based bug tracking system. Columba is a powerful mail management tool. Eclipse JDT is the Eclipse Java Development Tools, which is a set of plug-ins that add the capabilities of a full-featured Java IDE to the Eclipse platform. Mozilla is a well-known and widely used open-source web browser. PostgreSQL is a powerful, open-source object-relational database system. As stated by Kamei et al. [13], these six projects are large, well-known, and long-lived projects, which cover a wide range of domains and sizes. In this sense, it is appropriate to use these projects to investigate simple unsupervised models in JIT defect prediction. 4.2 Data Sets The data sets from these six projects used in this study are shared by Kamei et al. and are available online. As mentioned by Kamei et al. [13], these data are gathered by combining the change information mined from the CVS repositories of these projects with the corresponding bug reports. More specifically, the data for Bugzilla and Mozilla were gathered from the data provided by MSR 2007 Mining Challenge. The data for the Eclipse JDT and Platform were gathered from the data provided by the MSR 2008 Mining Challenge. The data for Columba and PostgreSQL were gathered from the official CVS repository. Table 3 summarizes the six data sets used in this study. The first column and the second column are respectively the subject data set name and the period of time for collecting the changes. The third to the sixth columns respectively report the total number of changes, the percentage of defectinducing changes, the average LOC per change, and the number of files modified in a code change. As can be seen, for each data set, defects concentrated in a small percentage 161