1:6 Y.Zhou et al. the target release and the 1th,...,k-1th releases are the historical releases.The source projects consist of n projects,each having a number of releases.Each release in the target and source projects has an associated dataset,which consists of the metric data and the defect label data at the module level.The modules can be packages,files,classes,or functions,depending on the defect prediction context.The metric sets are assumed to be the same for all releases within a single project but can be different for different projects.In each dataset,one instance represents a module and one feature represents a software metric extracted from the module. From Figure 1,on the one hand,we can see that the training data consist of the external training data and the internal training data.The external training data are indeed the labeled source project data.The internal data consist of two parts:the historical training data(i.e.,the labeled historical release data)and the target training data (i.e.,a small amount of the labeled target release data of the target project).On the other hand,we can see that the test data consist of the labeled target release data excluding the target training data.In other words,the labeled target release data are divided into two parts:a small amount of data used as the target training data and the remaining data used as the test data.For the simplicity of presentation,we use the test metric data and the defect oracle data to respectively denote the metric data and the defect label data in the test data. The test metric data will be introduced into the model built with the training data to compute the predicted defect-proneness.Once the predicted defect-proneness of each module in the test data is obtained,it will be compared against the defect oracle data to compute the prediction performance. At a high level,the application of a supervised CPDP model to a target project in practice in- volves the following three phases:data preparation,model training,and model testing.At the data preparation phase,preprocess the training data collected from different sources to make them ap- propriate for building a CPDP model for the target project.On the one hand,there is a need to address the privacy concerns of(source project)data owners,ie.,preventing the disclosure of specific sensitive metric values of the original source project data [84,85].On the other hand, there is a need to address the utility of the privatized source project data in cross-project defect prediction.This includes dealing with homogenous feature sets(ie.,metric set)between source and target project data [27,41,79],filtering out irrelevant training data [39,86,94,106],handling class-imbalanced training data [42,93],making source and target project have a similar data dis- tribution [13,41,42,61,80,123],and removing irrelevant/redundant features [5,9,26,44].At the modeling training phase,use a supervised modeling technique to build a CPDP model [59,96. 108,116,126].In other words,this aims to use a specific supervised learning algorithm to build a model to capture the relationship between the metrics(i.e.,the independent variables)and defect- proneness(i.e.,the dependent variable)in the training data.At the model testing phase,apply the CPDP model to predict defects in the target release and evaluate its prediction performance under the classification or ranking scenario.In the former scenario,the modules in the test data are clas- sified into defective or not defective.In the latter scenario,the modules in the test data are ranked from the highest to the lowest predicted defect-proneness.Under each scenario,the performance report is generated by comparing the predicted defect-proneness with the defect oracle data.By the above phases,it is expected that the knowledge about the effect of the metrics on defects can be learned from the source projects by the supervised model to predict defects in the target release of the target project. 2.3 Performance Evaluation Table 1 summarizes the prediction performance indicators involved in the existing cross-project defect prediction literature.The first column reports the scenario in which a specific indicator is used.The second,third,and fourth columns,respectively,show the name,the definition,and the ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:6 Y. Zhou et al. the target release and the 1th,..., k − 1th releases are the historical releases. The source projects consist of n projects, each having a number of releases. Each release in the target and source projects has an associated dataset, which consists of the metric data and the defect label data at the module level. The modules can be packages, files, classes, or functions, depending on the defect prediction context. The metric sets are assumed to be the same for all releases within a single project but can be different for different projects. In each dataset, one instance represents a module and one feature represents a software metric extracted from the module. From Figure 1, on the one hand, we can see that the training data consist of the external training data and the internal training data. The external training data are indeed the labeled source project data. The internal data consist of two parts: the historical training data (i.e., the labeled historical release data) and the target training data (i.e., a small amount of the labeled target release data of the target project). On the other hand, we can see that the test data consist of the labeled target release data excluding the target training data. In other words, the labeled target release data are divided into two parts: a small amount of data used as the target training data and the remaining data used as the test data. For the simplicity of presentation, we use the test metric data and the defect oracle data to respectively denote the metric data and the defect label data in the test data. The test metric data will be introduced into the model built with the training data to compute the predicted defect-proneness. Once the predicted defect-proneness of each module in the test data is obtained, it will be compared against the defect oracle data to compute the prediction performance. At a high level, the application of a supervised CPDP model to a target project in practice involves the following three phases: data preparation, model training, and model testing. At the data preparation phase, preprocess the training data collected from different sources to make them appropriate for building a CPDP model for the target project. On the one hand, there is a need to address the privacy concerns of (source project) data owners, i.e., preventing the disclosure of specific sensitive metric values of the original source project data [84, 85]. On the other hand, there is a need to address the utility of the privatized source project data in cross-project defect prediction. This includes dealing with homogenous feature sets (i.e., metric set) between source and target project data [27, 41, 79], filtering out irrelevant training data [39, 86, 94, 106], handling class-imbalanced training data [42, 93], making source and target project have a similar data distribution [13, 41, 42, 61, 80, 123], and removing irrelevant/redundant features [5, 9, 26, 44]. At the modeling training phase, use a supervised modeling technique to build a CPDP model [59, 96, 108, 116, 126]. In other words, this aims to use a specific supervised learning algorithm to build a model to capture the relationship between the metrics (i.e., the independent variables) and defectproneness (i.e., the dependent variable) in the training data. At the model testing phase, apply the CPDP model to predict defects in the target release and evaluate its prediction performance under the classification or ranking scenario. In the former scenario, the modules in the test data are classified into defective or not defective. In the latter scenario, the modules in the test data are ranked from the highest to the lowest predicted defect-proneness. Under each scenario, the performance report is generated by comparing the predicted defect-proneness with the defect oracle data. By the above phases, it is expected that the knowledge about the effect of the metrics on defects can be learned from the source projects by the supervised model to predict defects in the target release of the target project. 2.3 Performance Evaluation Table 1 summarizes the prediction performance indicators involved in the existing cross-project defect prediction literature. The first column reports the scenario in which a specific indicator is used. The second, third, and fourth columns, respectively, show the name, the definition, and the ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
How Far We Have Progressed in the Journey?An Examination of CPDP 1:7 Table 1.The Prediction Performance Indicators Involved in the Existing Cross-Project Defect Prediction Studies Scenario Name Definition Interpretation Better Classification Recall IP/(TP FN) The fraction of defective modules High that are predicted as defective Precision TP /(TP FP) The fraction of predicted defective High modules that are defective PD TP/(TP +FN) Probability of Detection.Same as Recall High PF FP/(FP +TN) Probability of False alarm.The fraction Low of not defective modules that are predicted as defective Correctness TP/(TP FP)[9] Same as Precision High Completeness Defects(TP)/ A variant of Recall.The percentage High Defects(TP+FN)[9] of defects found FB (1+B2)xRecallxPrecision/ The harmonic mean of precision and High (Recall B2xPrecision) recall (B 1 or 2) G [2xPDX(1-PF)]/[PD+(1-PF)] The harmonic mean of PD and 1-PF High Gz VRecall x Precision The geometric mean of recall and High precision G3 Recall x (1-PF) The geometric mean of recall and 1-PF High Balance 1-Y0-PP4-PD[107 The balance between PF and PD High ED VB×(1-PDP+(1-6)×PF网 The distance between(PD,PF) Low [52 and the ideal point on the ROC space(1,0),weighted by cost function =0.6 in default) MCC TPXTN-EPXEN A correlation coefficient V(TP+FP)(TP+FN)(TN+FP)(TN+FN) High between the actual and T123] predicted binary classifications AUC The area under ROC curve [24] The probability that a model will High rank a randomly chosen defective module higher than a randomly chosen not defective one NECM (FP+C/CIxFN)/ The normalized expected cost Low (TP+FP+TN+FN)[48] of misclassification Z 晋可 A statistic for assessing the statistical High significance of the overall classification accuracy.Here,o is the total number of correct classifications (i.e.,o TP+TN) e is the expected number of correct classification due to chance,and n is the total number of instances (i.e.,n TP FP TN+FN) Ranking AUCEC Area(m):the area under the The cost-effectiveness of the overall High cost-effectiveness curve ranking corresponding to the model m [90] NofB20 The number of defective modules in The cost-effectiveness of the top ranking High the top 20%SLOC PofB20 The percentage of defects in the top The cost-effectiveness of the top ranking High 20%SLOC FPA The average,over all values of k,of The effectiveness of the overall ranking High the percentage of defects contained in the top k modules [111] (Continued) ACM Transactions on Software Engineering and Methodology,Vol.27.No.1,Article 1.Pub.date:April 2018
How Far We Have Progressed in the Journey? An Examination of CPDP 1:7 Table 1. The Prediction Performance Indicators Involved in the Existing Cross-Project Defect Prediction Studies Scenario Name Definition Interpretation Better Classification Recall TP / (TP + FN) The fraction of defective modules that are predicted as defective High Precision TP / (TP + FP) The fraction of predicted defective modules that are defective High PD TP / (TP + FN) Probability of Detection. Same as Recall High PF FP / (FP + TN) Probability of False alarm. The fraction of not defective modules that are predicted as defective Low Correctness TP / (TP + FP)[9] Same as Precision High Completeness Defects(TP) / Defects(TP + FN)[9] A variant of Recall. The percentage of defects found High Fβ (1 + β2)×Recall×Precision/ (Recall + β2×Precision) The harmonic mean of precision and recall (β = 1 or 2) High G1 [2×PD×(1-PF)]/[PD+(1-PF)] The harmonic mean of PD and 1-PF High G2 √ Recall × Precision The geometric mean of recall and precision High G3 Recall × (1 − P F ) The geometric mean of recall and 1-PF High Balance 1 − √ (0−P F )2+(1−P D)2 √ 2 [107] The balance between PF and PD High ED θ × (1 − P D)2 + (1 − θ ) × P F 2 [52] The distance between (PD, PF) and the ideal point on the ROC space (1, 0), weighted by cost function θ (θ = 0.6 in default) Low MCC √ T P×T N −F P×F N (T P+F P )(T P+F N )(T N +F P )(T N +F N ) [123] A correlation coefficient between the actual and predicted binary classifications High AUC The area under ROC curve [24] The probability that a model will rank a randomly chosen defective module higher than a randomly chosen not defective one High NECM (F P + CII/CI×F N )/ (T P + F P + T N + F N )[48] The normalized expected cost of misclassification Low Z* (o−e ) √ n e (n−e ) [10] A statistic for assessing the statistical significance of the overall classification accuracy. Here, o is the total number of correct classifications (i.e., o = TP + TN), e is the expected number of correct classification due to chance, and n is the total number of instances (i.e., n = TP + FP + TN + FN) High Ranking AUCEC Area(m): the area under the cost-effectiveness curve corresponding to the model m [90] The cost-effectiveness of the overall ranking High NofB20 The number of defective modules in the top 20% SLOC The cost-effectiveness of the top ranking High PofB20 The percentage of defects in the top 20% SLOC The cost-effectiveness of the top ranking High FPA The average, over all values of k, of the percentage of defects contained in the top k modules [111] The effectiveness of the overall ranking High (Continued) ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018.
1:8 Y.Zhou et al. Table 1.Continued Scenario Name Definition Interpretation Better E1(R) The percentage of modules having The effectiveness of the top Low R%defects ranking E2(R) The percentage of code having R The cost-effectiveness of the top Low defects ranking Prec@k Top k precision The fraction of the top k ranked High modules that are defective informal interpretation for each specific indicator.The last column indicates the expected direction for a better prediction performance.As can be seen,many performance indicators have a similar or even identical meaning.For example,PD has the same meaning as Recall,while Completeness is a variant of Recall.In the next section,we will compare the existing CPDP studies,including the performance indicators used.For the performance indicators,we use the exact same names as used in the original studies.This will help understand the comparison,especially for those readers who are familiar with the existing CPDP studies.Therefore,we include these indicators in Table 1, even if they express a similar or even identical meaning. Classification performance indicators.When applying a prediction model to classify modules in practice,we first need to determine a classification threshold for the model.A module will be classified as defective if the predicted defect-proneness is larger than the threshold and otherwise it will be classified as not defective.The four outcomes of the classification using the threshold are as follows:TP(the number of modules correctly classified as defective),TN(the number of modules correctly classified as not defective),FP(the number of modules incorrectly classified as defective),and FN(the number of modules incorrectly classified as not defective).As shown in Table 1,all classification performance indicators except AUC (Area Under ROC Curve)are based on TP,TN,FP,and FN,i.e.,they depend on the threshold.AUC is a threshold-independent indicator, which denotes that the area under a ROC(Relative Operating Characteristic)curve.A ROC curve is a graphical plot of the PD(the y-axis)vs.PF(the x-axis)for a binary classification model as its decision threshold is varied.In particular,all classification performance indicators except NECM do not explicitly take into account the cost of misclassification caused by the prediction model. Ranking performance indicators.In the ranking scenario,the performance indicators can be classified into two categories:effort-aware and non-effort-aware.The former includes AUCEC, NofB20,PofB20,and E2(R),while the latter includes FPA,E1(R),and Prec@k.The effort-aware indicators take into account the effort required to inspect or test those modules predicted as de- fective to find whether they contain defects.In these indicators,the size of a module measured by SLOC (source lines of code)is used as the proxy of the effort required to inspect or test the module.In particular,AUCEC is based on the SLOC-based Alberg diagram [6].In such an Alberg diagram,each defect-proneness prediction model corresponds to a curve constructed as follows. First,the modules in the target release are sorted in decreasing order according to the predicted defect-proneness by the model.Then,the cumulative percentage of SLOC of the top modules se- lected from the module ranking(the x-axis)is plotted against the cumulative percentage of defects found in the selected top modules(the y-axis).AUCEC is the area under the curve corresponding to the model.Note that NofB20,PofB20,and E2(R)quantify the performance of the top ranking. while AUCEC quantifies the performance of the overall ranking.The non-effort-aware perfor- mance indicators do not take into account the inspection or testing effort.Of non-effort-aware indicators,E1(R)and Prec@k quantify the performance of the top ranking,while FPA quantifies the performance of the overall ranking. ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:8 Y. Zhou et al. Table 1. Continued Scenario Name Definition Interpretation Better E1(R) The percentage of modules having R% defects The effectiveness of the top ranking Low E2(R) The percentage of code having R% defects The cost-effectiveness of the top ranking Low Prec@k Top k precision The fraction of the top k ranked modules that are defective High informal interpretation for each specific indicator. The last column indicates the expected direction for a better prediction performance. As can be seen, many performance indicators have a similar or even identical meaning. For example, PD has the same meaning as Recall, while Completeness is a variant of Recall. In the next section, we will compare the existing CPDP studies, including the performance indicators used. For the performance indicators, we use the exact same names as used in the original studies. This will help understand the comparison, especially for those readers who are familiar with the existing CPDP studies. Therefore, we include these indicators in Table 1, even if they express a similar or even identical meaning. Classification performance indicators. When applying a prediction model to classify modules in practice, we first need to determine a classification threshold for the model. A module will be classified as defective if the predicted defect-proneness is larger than the threshold and otherwise it will be classified as not defective. The four outcomes of the classification using the threshold are as follows: TP (the number of modules correctly classified as defective), TN (the number of modules correctly classified as not defective), FP (the number of modules incorrectly classified as defective), and FN (the number of modules incorrectly classified as not defective). As shown in Table 1, all classification performance indicators except AUC (Area Under ROC Curve) are based on TP, TN, FP, and FN, i.e., they depend on the threshold. AUC is a threshold-independent indicator, which denotes that the area under a ROC (Relative Operating Characteristic) curve. A ROC curve is a graphical plot of the PD (the y-axis) vs. PF (the x-axis) for a binary classification model as its decision threshold is varied. In particular, all classification performance indicators except NECM do not explicitly take into account the cost of misclassification caused by the prediction model. Ranking performance indicators. In the ranking scenario, the performance indicators can be classified into two categories: effort-aware and non-effort-aware. The former includes AUCEC, NofB20, PofB20, and E2(R), while the latter includes FPA, E1(R), and Prec@k. The effort-aware indicators take into account the effort required to inspect or test those modules predicted as defective to find whether they contain defects. In these indicators, the size of a module measured by SLOC (source lines of code) is used as the proxy of the effort required to inspect or test the module. In particular, AUCEC is based on the SLOC-based Alberg diagram [6]. In such an Alberg diagram, each defect-proneness prediction model corresponds to a curve constructed as follows. First, the modules in the target release are sorted in decreasing order according to the predicted defect-proneness by the model. Then, the cumulative percentage of SLOC of the top modules selected from the module ranking (the x-axis) is plotted against the cumulative percentage of defects found in the selected top modules (the y-axis). AUCEC is the area under the curve corresponding to the model. Note that NofB20, PofB20, and E2(R) quantify the performance of the top ranking, while AUCEC quantifies the performance of the overall ranking. The non-effort-aware performance indicators do not take into account the inspection or testing effort. Of non-effort-aware indicators, E1(R) and Prec@k quantify the performance of the top ranking, while FPA quantifies the performance of the overall ranking. ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
How Far We Have Progressed in the Journey?An Examination of CPDP 1:9 Literature reading (conducted by us) (condueted by us) Foward snowballing Hosseini et al. from BMW 2002 TSE paper (60 13 (46) eross pruject"defect prediction" 46+13-59 Total number of selected supervised CPDP studies (59+3=62) 636+3-72) 'cross projeet"+"fault prediction" (62+1-63) oss company"+"fault prediction" (63+0=63) Fig.2.The process to select supervised CPDP studies. 2.4 State of Progress To understand the progress in supervised cross-project defect prediction,we conducted a search of the literature published between 2002 and 2017.The starting year of the search was set to 2002 as it is the year that the first CPDP article was published [9](abbreviated as"BMW 2002 TSE paper" in the following).Before the search,we set up the following inclusion criteria:(1)the study was a supervised CPDP study;(2)the study was written in English;(3)the full text was available;(4) only the journal version was included if the study had both the conference and journal versions; and(5)the prediction scenario was classification or ranking. As shown in Figure 2,we searched the articles from three sources:Google Scholar,the existing systematic literature reviews,and literature reading.We used Google Scholar as the main search source,as it "provides a simple way to broadly search for scholarly literature."2 First,we did a forward snowballing search [114]by recursively examining the citations of the "BMW 2002 TSE paper".Consequently,46 relevant articles were identified [1,3,5,7,9,10,12,13,15,17,26-30, 37,39,41,42,44,46,47,52,59,66,79,80,82,85-87,89,90,92-96,107,109,116,118,120,122,123 l26].Then,we used“cross project'”+“defect prediction”as the search terms to do a search.Asa result,13 additional relevant articles [14,31,32,43,45,61,71,84,88,99,101,108,129]were found with respect to the identified 46 articles.Next,we used "cross company"+"defect prediction"as the search terms,identifying 3 additional relevant articles [104,106,110]with respect to the 59 (=46 +13)articles.After that,we used "cross project"+"fault prediction"as the search terms, identifying 1 additional relevant article [100]with respect to the 62(=59+3)articles.Finally, we used the terms"cross company"+"fault prediction"and did not find any additional relevant article.By the above steps,we totally identified 63(=46+13+3+1+0)supervised CPDP articles from Google Scholar.In addition to Google Scholar,we identified 6 additional relevant articles from a systematic literature review by Hosseini et al.[40].Hosseini et al.identified 46 primary CPDP studies from three electronic databases(the ACM digital library,the IEEE Explore,and the ISI Web of Science)and two search engines (Google Scholar and Scopus).After applying our inclusion criteria to the 46 CPDP studies,we found that 6 relevant articles [60,73,74,103,105, 119]were not present in the 63 identified CPDP articles from Google Scholar.The last source was literature reading,through which we found 3 additional relevant articles [48,72,121]with respect IThe literature search was conducted on April 21,2017. 2https://scholar.google.com/intl/en/scholar/about.html. ACM Transactions on Software Engineering and Methodology,Vol.27.No.1,Article 1.Pub.date:April 2018
How Far We Have Progressed in the Journey? An Examination of CPDP 1:9 Fig. 2. The process to select supervised CPDP studies. 2.4 State of Progress To understand the progress in supervised cross-project defect prediction, we conducted a search of the literature published between 2002 and 2017.1 The starting year of the search was set to 2002 as it is the year that the first CPDP article was published [9] (abbreviated as “BMW 2002 TSE paper” in the following). Before the search, we set up the following inclusion criteria: (1) the study was a supervised CPDP study; (2) the study was written in English; (3) the full text was available; (4) only the journal version was included if the study had both the conference and journal versions; and (5) the prediction scenario was classification or ranking. As shown in Figure 2, we searched the articles from three sources: Google Scholar, the existing systematic literature reviews, and literature reading. We used Google Scholar as the main search source, as it “provides a simple way to broadly search for scholarly literature.”2 First, we did a forward snowballing search [114] by recursively examining the citations of the “BMW 2002 TSE paper”. Consequently, 46 relevant articles were identified [1, 3, 5, 7, 9, 10, 12, 13, 15, 17, 26–30, 37, 39, 41, 42, 44, 46, 47, 52, 59, 66, 79, 80, 82, 85–87, 89, 90, 92–96, 107, 109, 116, 118, 120, 122, 123, 126]. Then, we used “cross project” + “defect prediction” as the search terms to do a search. As a result, 13 additional relevant articles [14, 31, 32, 43, 45, 61, 71, 84, 88, 99, 101, 108, 129] were found with respect to the identified 46 articles. Next, we used “cross company” + “defect prediction” as the search terms, identifying 3 additional relevant articles [104, 106, 110] with respect to the 59 (=46 + 13) articles. After that, we used “cross project” + “fault prediction” as the search terms, identifying 1 additional relevant article [100] with respect to the 62 (=59 + 3) articles. Finally, we used the terms “cross company” + “fault prediction” and did not find any additional relevant article. By the above steps, we totally identified 63 (=46 + 13 + 3 + 1 + 0) supervised CPDP articles from Google Scholar. In addition to Google Scholar, we identified 6 additional relevant articles from a systematic literature review by Hosseini et al. [40]. Hosseini et al. identified 46 primary CPDP studies from three electronic databases (the ACM digital library, the IEEE Explore, and the ISI Web of Science) and two search engines (Google Scholar and Scopus). After applying our inclusion criteria to the 46 CPDP studies, we found that 6 relevant articles [60, 73, 74, 103, 105, 119] were not present in the 63 identified CPDP articles from Google Scholar. The last source was literature reading, through which we found 3 additional relevant articles [48, 72, 121] with respect 1The literature search was conducted on April 21, 2017. 2https://scholar.google.com/intl/en/scholar/about.html. ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
1:10 Y.Zhou et al. to the 69(=63+6)CPDP articles.Therefore,the overall search process resulted in 72(=63+6+3) supervised CPDP articles found in the literature. Table 2 provides an overview of the main literature concerning supervised cross-project defect prediction.For each study,the 2nd and 3rd columns,respectively,list the year published and the topic involved.The 4th to 6th columns list the source and target project characteristics,including the number of source/target projects(releases)and the programming languages involved.For the simplicity of presentation,we do not report the number of releases in parentheses if it is equal to the number of projects.Note that the entries with gray background under the 4th and 5th columns indicate that the source and target projects are different projects.The 7th to 12th columns list the key modeling components(challenges)covered.A"Yes"entry indicates that the study explicitly considers the corresponding component and a blank entry indicates that it does not.The 13th to 16th columns list the performance evaluation context,including the use (or not)of the target training data(i.e.,a small amount of the labeled target release data used as the training data), the application scenarios investigated,the main performance indicators employed,and the public availability of test data.An entry with gray background under the 15th column indicates that the study only graphically visualizes the performance and does not report numerical results.The 17th column reports the use(or not)of simple module size models that classify or rank modules in the target release by their size as the baseline models.The last column indicates whether the CPDP model proposed in the study will be compared against the simple module size models described in Section 3 in our study. From Table 2,we have the following observations.First,the earliest CPDP study appears to be Briand et al.'s work [9].In their work,Briand et al.investigated the applicability of CPDP on object-oriented software projects.Their results showed that a model built on one project produced a poor classification performance but a good ranking performance on another project. This indicates that,if used appropriately,CPDP would be helpful for practitioners.Second,CPDP has attracted a rapid increasing interest over the past few years.After Briand et al.'s pioneering work,more than 70 CPDP studies are published.The overall results show that CPDP has a promising prediction performance,comparable to or even better than WPDP(within-project defect prediction).Third,the existing CPDP studies cover a variety of wide-ranging topics. These topics include validating CPDP on different development environments(open-source and proprietary),different development stages(design and implementation),different programming languages(C,C++,C#,Java,JS,Pascal,Perl,and Ruby),different module granularity(change, function,class,and file),and different defect predictors(semantic,text,and structural).These studies make a significant contribution to our knowledge about the wide application range of CPDP.Fourth,the number of studies shows a highly unbalanced distribution on the key CPDP components..Of the key components,“filter instances'”and“transform distributions'”are the two most studied components,while "privatize data"and "homogenize features"are the two less studied components.In particular,none of the existing CPDP studies covers all of the key components and most of them only take into account less than three key components.Fifth,most studies evaluate CPDP models on the complete target release data,which are publicly available, under the classification scenario.In few studies,only a part of the target release data are used as the test data.The reason is either that there is a need to use the target training data when building a CPDP model or that a CPDP model is evaluated on the same test data as the WPDP models. Furthermore,most studies explicitly report the prediction results in numerical form.This allows us to make a direct comparison on the prediction performance of simple module size models against most of the existing CPDP models without a re-implementation.Finally,of 72 studies, only two studies (i.e.,Briand et al.'s study [9]and Canfora et al.'s study [12])used simple module size models in the target release as the baseline models.Both studies reported that the proposed ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:10 Y. Zhou et al. to the 69 (=63 + 6) CPDP articles. Therefore, the overall search process resulted in 72 (=63 + 6 +3) supervised CPDP articles found in the literature. Table 2 provides an overview of the main literature concerning supervised cross-project defect prediction. For each study, the 2nd and 3rd columns, respectively, list the year published and the topic involved. The 4th to 6th columns list the source and target project characteristics, including the number of source/target projects (releases) and the programming languages involved. For the simplicity of presentation, we do not report the number of releases in parentheses if it is equal to the number of projects. Note that the entries with gray background under the 4th and 5th columns indicate that the source and target projects are different projects. The 7th to 12th columns list the key modeling components (challenges) covered. A “Yes” entry indicates that the study explicitly considers the corresponding component and a blank entry indicates that it does not. The 13th to 16th columns list the performance evaluation context, including the use (or not) of the target training data (i.e., a small amount of the labeled target release data used as the training data), the application scenarios investigated, the main performance indicators employed, and the public availability of test data. An entry with gray background under the 15th column indicates that the study only graphically visualizes the performance and does not report numerical results. The 17th column reports the use (or not) of simple module size models that classify or rank modules in the target release by their size as the baseline models. The last column indicates whether the CPDP model proposed in the study will be compared against the simple module size models described in Section 3 in our study. From Table 2, we have the following observations. First, the earliest CPDP study appears to be Briand et al.’s work [9]. In their work, Briand et al. investigated the applicability of CPDP on object-oriented software projects. Their results showed that a model built on one project produced a poor classification performance but a good ranking performance on another project. This indicates that, if used appropriately, CPDP would be helpful for practitioners. Second, CPDP has attracted a rapid increasing interest over the past few years. After Briand et al.’s pioneering work, more than 70 CPDP studies are published. The overall results show that CPDP has a promising prediction performance, comparable to or even better than WPDP (within-project defect prediction). Third, the existing CPDP studies cover a variety of wide-ranging topics. These topics include validating CPDP on different development environments (open-source and proprietary), different development stages (design and implementation), different programming languages (C, C++, C#, Java, JS, Pascal, Perl, and Ruby), different module granularity (change, function, class, and file), and different defect predictors (semantic, text, and structural). These studies make a significant contribution to our knowledge about the wide application range of CPDP. Fourth, the number of studies shows a highly unbalanced distribution on the key CPDP components. Of the key components, “filter instances” and “transform distributions” are the two most studied components, while “privatize data” and “homogenize features” are the two less studied components. In particular, none of the existing CPDP studies covers all of the key components and most of them only take into account less than three key components. Fifth, most studies evaluate CPDP models on the complete target release data, which are publicly available, under the classification scenario. In few studies, only a part of the target release data are used as the test data. The reason is either that there is a need to use the target training data when building a CPDP model or that a CPDP model is evaluated on the same test data as the WPDP models. Furthermore, most studies explicitly report the prediction results in numerical form. This allows us to make a direct comparison on the prediction performance of simple module size models against most of the existing CPDP models without a re-implementation. Finally, of 72 studies, only two studies (i.e., Briand et al.’s study [9] and Canfora et al.’s study [12]) used simple module size models in the target release as the baseline models. Both studies reported that the proposed ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018