IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL.41.NO.4.APRIL 2015 331 Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction?An Empirical Study Yibiao Yang,Yuming Zhou,Hongmin Lu,Lin Chen,Zhenyu Chen,Member,/EEE,Baowen Xu, Hareton Leung,Member,IEEE,and Zhenyu Zhang,Member,IEEE Abstract-Background.Slice-based cohesion metrics leverage program slices with respect to the output variables of a module to quantify the strength of functional relatedness of the elements within the module.Although slice-based cohesion metrics have been proposed for many years,few empirical studies have been conducted to examine their actual usefulness in predicting fault-proneness. Objective.We aim to provide an in-depth understanding of the ability of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction,i.e.their effectiveness in helping practitioners find post-release faults when taking into account the effort needed to test or inspect the code.Method.We use the most commonly used code and process metrics,including size,structural complexity,Halstead's software science,and code churn metrics,as the baseline metrics.First,we employ principal component analysis to analyze the relationships between slice-based cohesion metrics and the baseline metrics.Then,we use univariate prediction models to investigate the correlations between slice-based cohesion metrics and post-release fault-proneness.Finally,we build multivariate prediction models to examine the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction when used alone or used together with the baseline code and process metrics.Results.Based on open-source software systems,our results show that:1)slice-based cohesion metrics are not redundant with respect to the baseline code and process metrics;2)most slice-based cohesion metrics are significantly negatively related to post-release fault-proneness; 3)slice-based cohesion metrics in general do not outperform the baseline metrics when predicting post-release fault-proneness;and 4)when used with the baseline metrics together,however,slice-based cohesion metrics can produce a statistically significant and practically important improvement of the effectiveness in effort-aware post-release fault-proneness prediction.Conclusion.Slice-based cohesion metrics are complementary to the most commonly used code and process metrics and are of practical value in the context of effort-aware post-release fault-proneness prediction Index Terms-Cohesion,metrics,slice-based,fault-proneness,prediction,effort-aware ◆ 1 INTRODUCTION OHESION refers to the relatedness of the elements cohesion is a subjective concept and hence is difficult to use within a module [1],[21.A highly cohesive module is in practice [141.In order to attack this problem,program one in which all elements work together towards a single slicing is applied to develop quantitative cohesion metrics, function.Highly cohesive modules are desirable in a system as it provides a means of accurately quantifying the interac- as they are easier to develop,maintain,and reuse,and tions between the elements within a module [12].In the last hence are less fault-prone [1],[2].For software developers, three decades,many slice-based cohesion metrics have it is expected to automatically identify low cohesive mod- been developed to quantify the degree of cohesion in a ules targeted for software quality enhancement.However, module at the function level of granularity [3],[4],[51,[61, [7],[8],[9],[10].For a given function,the computation of a slice-based cohesion metric consists of the following two .Y.Yang,Y.Zhou,H.Lu,L.Chen,and B.Xu are with the State Key steps.At the first step,a program reduction technology Laboratory for Novel Software Technology,Department of Computer Science and Technology,Nanjing UIniversity,Nanjing 210023,China. called program slicing is employed to obtain the set of pro- E-mail:yangyibiao@smail.nju.edu.cn,(zhouyuming,hmlu,Ichen, gram statements (i.e.program slice)that may affect each bioxufenju.edu.cn. output variable of the function [9],[11].The output varia- Z.Chen is with the State Key Laboratory for Novel Software Technology, bles include the function return value,modified global School of Software,Nanjing UIniversity,Nanjing 210023,China. E-mail:zychen@software.nju.edu.cn. variables,modified reference parameters,and variables H.Leung is with the Department of Computing,Hong Kong Polytechnic printed or other outputs by the function [12].At the second University,Hung Hom,Hong Kong,China. step,cohesion is computed by leveraging the commonality E-mail:cshleung@inet.polyu.edu.hk Z.Zhang is with the State Key Laboratory of Computer Science,Institute among the slices with respect to different output variables. of Software,Chinese Academy of Sciences,Beijing,China. Previous studies showed that slice-based cohesion metrics E-mail:zhangzy@ios.ac.cn. provided an excellent quantitative measure of cohesion [3], Manuscript received 16 Feb.2013;revised 24 Oct.2014;accepted 29 Oct. [13],[14].Hence,there is a reason to believe that they 2014.Date of publication 11 Nov.2014;date of current version 17 Apr.2015. should be useful predictors for fault-proneness.However, Recommended for acceptance by T.Menzies. For information on obtaining reprints of this article,please send e-mail to: few empirical studies have so far been conducted to exam- reprints@ieee.org,and reference the Digital Object Identifier below. ine the actual usefulness of slice-based cohesion metrics Digital Object Identifier no.10.1109/TSE.2014.2370048 for predicting fault-proneness,especially compared with See http://www.ieee.org/publi standards/p
Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study Yibiao Yang, Yuming Zhou, Hongmin Lu, Lin Chen, Zhenyu Chen, Member, IEEE, Baowen Xu, Hareton Leung, Member, IEEE, and Zhenyu Zhang, Member, IEEE Abstract—Background. Slice-based cohesion metrics leverage program slices with respect to the output variables of a module to quantify the strength of functional relatedness of the elements within the module. Although slice-based cohesion metrics have been proposed for many years, few empirical studies have been conducted to examine their actual usefulness in predicting fault-proneness. Objective. We aim to provide an in-depth understanding of the ability of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction, i.e. their effectiveness in helping practitioners find post-release faults when taking into account the effort needed to test or inspect the code. Method. We use the most commonly used code and process metrics, including size, structural complexity, Halstead’s software science, and code churn metrics, as the baseline metrics. First, we employ principal component analysis to analyze the relationships between slice-based cohesion metrics and the baseline metrics. Then, we use univariate prediction models to investigate the correlations between slice-based cohesion metrics and post-release fault-proneness. Finally, we build multivariate prediction models to examine the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction when used alone or used together with the baseline code and process metrics. Results. Based on open-source software systems, our results show that: 1) slice-based cohesion metrics are not redundant with respect to the baseline code and process metrics; 2) most slice-based cohesion metrics are significantly negatively related to post-release fault-proneness; 3) slice-based cohesion metrics in general do not outperform the baseline metrics when predicting post-release fault-proneness; and 4) when used with the baseline metrics together, however, slice-based cohesion metrics can produce a statistically significant and practically important improvement of the effectiveness in effort-aware post-release fault-proneness prediction. Conclusion. Slice-based cohesion metrics are complementary to the most commonly used code and process metrics and are of practical value in the context of effort-aware post-release fault-proneness prediction. Index Terms—Cohesion, metrics, slice-based, fault-proneness, prediction, effort-aware Ç 1 INTRODUCTION COHESION refers to the relatedness of the elements within a module [1], [2]. A highly cohesive module is one in which all elements work together towards a single function. Highly cohesive modules are desirable in a system as they are easier to develop, maintain, and reuse, and hence are less fault-prone [1], [2]. For software developers, it is expected to automatically identify low cohesive modules targeted for software quality enhancement. However, cohesion is a subjective concept and hence is difficult to use in practice [14]. In order to attack this problem, program slicing is applied to develop quantitative cohesion metrics, as it provides a means of accurately quantifying the interactions between the elements within a module [12]. In the last three decades, many slice-based cohesion metrics have been developed to quantify the degree of cohesion in a module at the function level of granularity [3], [4], [5], [6], [7], [8], [9], [10]. For a given function, the computation of a slice-based cohesion metric consists of the following two steps. At the first step, a program reduction technology called program slicing is employed to obtain the set of program statements (i.e. program slice) that may affect each output variable of the function [9], [11]. The output variables include the function return value, modified global variables, modified reference parameters, and variables printed or other outputs by the function [12]. At the second step, cohesion is computed by leveraging the commonality among the slices with respect to different output variables. Previous studies showed that slice-based cohesion metrics provided an excellent quantitative measure of cohesion [3], [13], [14]. Hence, there is a reason to believe that they should be useful predictors for fault-proneness. However, few empirical studies have so far been conducted to examine the actual usefulness of slice-based cohesion metrics for predicting fault-proneness, especially compared with Y. Yang, Y. Zhou, H. Lu, L. Chen, and B. Xu are with the State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China. E-mail: yangyibiao@smail.nju.edu.cn, {zhouyuming, hmlu, lchen, bwxu}@nju.edu.cn. Z. Chen is with the State Key Laboratory for Novel Software Technology, School of Software, Nanjing University, Nanjing 210023, China. E-mail: zychen@software.nju.edu.cn. H. Leung is with the Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China. E-mail: cshleung@inet.polyu.edu.hk. Z. Zhang is with the State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China. E-mail: zhangzy@ios.ac.cn. Manuscript received 16 Feb. 2013; revised 24 Oct. 2014; accepted 29 Oct. 2014. Date of publication 11 Nov. 2014; date of current version 17 Apr. 2015. Recommended for acceptance by T. Menzies. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TSE.2014.2370048 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015 331 0098-5589 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
332 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 the most commonly used code and process metrics [51,[15],fault-proneness prediction.These research questions are [16l,[17,[181. critically important to both software researchers and practi- In this paper,we perform a thorough empirical investi- tioners,as they help to answer whether slice-based cohesion gation into the ability of slice-based cohesion metrics in the metrics are of practical value in view of the extra cost context of effort-aware post-release fault-proneness predic- involved in data collection.However,little is currently tion,i.e.their effectiveness in helping practitioners find known on this subject.Our study attempts to fill this gap by post-release faults when taking into account the effort a comprehensive investigation into the actual usefulness of needed to test or inspect the code [35].In our study,we use slice-based cohesion metrics in the context of effort-aware the most commonly used code and process metrics,includ- post-release fault-proneness prediction. ing size,structural complexity,Halstead's software science, The contributions of this paper are listed as follows.First, and code churn metrics,as the baseline metrics.We first we compare slice-based cohesion metrics with the most employ principal component analysis (PCA)to analyze the commonly used code and process metrics including size, relationships between slice-based cohesion metrics and the structural complexity,Halstead's software science metrics, baseline code and process metrics.Then,we build univari- and code churn metrics.The results show that slice-based ate prediction models to investigate the correlations bet- cohesion metrics measure essentially different quality infor- ween slice-based cohesion metrics and post-release fault-mation than the baseline code and process metrics measure proneness.Finally,we build multivariate prediction models This indicates that slice-based cohesion metrics are not to examine the effectiveness of slice-based cohesion metrics redundant with respect to the most commonly used code in effort-aware post-release fault-proneness prediction and process metrics.Second,we validate the correlations when used alone or used together with the baseline code between slice-based cohesion metrics and fault-proneness. and process metrics.In order to obtain comprehensive per- The results show that most slice-based cohesion metrics are formance evaluations,we evaluate the effectiveness of statistically related to fault-proneness in an expected direc- effort-aware post-release fault-proneness prediction under tion.Third,we analyze the effectiveness of slice-based cohe- the following three prediction settings:cross-validation,sion metrics in effort-aware post-release fault-proneness across-version prediction,and across-project prediction. prediction compared with the most commonly used code More specifically,cross-validation is performed within the and process metrics.The results,somewhat surprisingly, same version of a project,i.e.predicting faults in one subset show that slice-based cohesion metrics in general do not using a model trained on the other complementary subsets. outperform the most commonly used code and process met- Across-version prediction uses a model trained on earlier rics.Fourth,we investigate whether the combination of versions to predict faults in later versions within the same slice-based cohesion metrics with the most commonly used project,while across-project prediction uses a model trained code and process metrics provide better results in predict- on one project to predict faults in another project.The sub- ing fault-proneness.The results show that the inclusion of ject projects in our study consist of five well-known open- slice-based cohesion metrics can produce a statistically sig- source C projects:Bash,Gcc-core,Gimp,Subversion,and nificant improvement of the effectiveness in effort-aware Vim.We use a mature commercial tool called Understand! post-release fault-proneness prediction under any of the to collect the baseline code and process metrics and use a three prediction settings.In particular,in the ranking sce- powerful source code analysis tool called Frama-C to collect nario,when testing or inspecting 20 percent of the code of a slice-based cohesion metrics 57].Based on the data col- system,slice-based cohesion metrics lead to a moderate to lected from these five projects,we attempt to answer the fol- large improvement(Cliff's 8:0.33-1.00),regardless of which lowing four research questions: prediction setting is considered.In the classification sce- RO1.Are slice-based cohesion metrics redundant nario,they lead to a moderate to large improvement(Cliff's 8:0.31-0.77)in most systems under cross-validation and with respect to the most commonly used code and lead to a large improvement (Cliff's 8:0.55-0.72)under process metrics? across-version prediction.In summary,these results reveal RO2.Are slice-based cohesion metrics statistically sig- that the improvement is practically important for practi- nificantly correlated to post-release fault-proneness? tioners,which is worth the relatively high time cost for col- RO3.Are slice-based cohesion metrics more effective lecting slice-based cohesion metrics.In other words,for than the most commonly used code and process practitioners,slice-based cohesion metrics are of practical metrics in effort-aware post-release fault-proneness value in the context of effort-aware post-release fault-prone- prediction? ness prediction.Our study provides valuable data in an ● RO4.When used together with the most commonly important area for which otherwise there is limited experi- used code and process metrics,can slice-based cohe mental data available. sion metrics significantly improve the effectiveness of The rest of this paper is organized as follows.Section 2 effort-aware post-release fault-proneness prediction? introduces slice-based cohesion metrics and the most com- The purpose of RQ1 and RQ2 investigates whether slice- monly used code and process metrics that we will investi- based cohesion metrics are potentially useful post-release gate.Section 3 gives the research hypotheses on slice-based fault-proneness predictors.The purpose of RQ3 and RQ4 cohesion metrics,introduces the investigated dependent investigates whether slice-based cohesion metrics can lead and independent variables,presents the employed model- to significant improvements in effort-aware post-release ing technique,and describes the data analysis methods. Section 4 describes the experimental setup in our study, 1.www.scitools.com including the data sources and the method we used to
the most commonly used code and process metrics [5], [15], [16], [17], [18]. In this paper, we perform a thorough empirical investigation into the ability of slice-based cohesion metrics in the context of effort-aware post-release fault-proneness prediction, i.e. their effectiveness in helping practitioners find post-release faults when taking into account the effort needed to test or inspect the code [35]. In our study, we use the most commonly used code and process metrics, including size, structural complexity, Halstead’s software science, and code churn metrics, as the baseline metrics. We first employ principal component analysis (PCA) to analyze the relationships between slice-based cohesion metrics and the baseline code and process metrics. Then, we build univariate prediction models to investigate the correlations between slice-based cohesion metrics and post-release faultproneness. Finally, we build multivariate prediction models to examine the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction when used alone or used together with the baseline code and process metrics. In order to obtain comprehensive performance evaluations, we evaluate the effectiveness of effort-aware post-release fault-proneness prediction under the following three prediction settings: cross-validation, across-version prediction, and across-project prediction. More specifically, cross-validation is performed within the same version of a project, i.e. predicting faults in one subset using a model trained on the other complementary subsets. Across-version prediction uses a model trained on earlier versions to predict faults in later versions within the same project, while across-project prediction uses a model trained on one project to predict faults in another project. The subject projects in our study consist of five well-known opensource C projects: Bash, Gcc-core, Gimp, Subversion, and Vim. We use a mature commercial tool called Understand1 to collect the baseline code and process metrics and use a powerful source code analysis tool called Frama-C to collect slice-based cohesion metrics [57]. Based on the data collected from these five projects, we attempt to answer the following four research questions: RQ1. Are slice-based cohesion metrics redundant with respect to the most commonly used code and process metrics? RQ2. Are slice-based cohesion metrics statistically significantly correlated to post-release fault-proneness? RQ3. Are slice-based cohesion metrics more effective than the most commonly used code and process metrics in effort-aware post-release fault-proneness prediction? RQ4. When used together with the most commonly used code and process metrics, can slice-based cohesion metrics significantly improve the effectiveness of effort-aware post-release fault-proneness prediction? The purpose of RQ1 and RQ2 investigates whether slicebased cohesion metrics are potentially useful post-release fault-proneness predictors. The purpose of RQ3 and RQ4 investigates whether slice-based cohesion metrics can lead to significant improvements in effort-aware post-release fault-proneness prediction. These research questions are critically important to both software researchers and practitioners, as they help to answer whether slice-based cohesion metrics are of practical value in view of the extra cost involved in data collection. However, little is currently known on this subject. Our study attempts to fill this gap by a comprehensive investigation into the actual usefulness of slice-based cohesion metrics in the context of effort-aware post-release fault-proneness prediction. The contributions of this paper are listed as follows. First, we compare slice-based cohesion metrics with the most commonly used code and process metrics including size, structural complexity, Halstead’s software science metrics, and code churn metrics. The results show that slice-based cohesion metrics measure essentially different quality information than the baseline code and process metrics measure. This indicates that slice-based cohesion metrics are not redundant with respect to the most commonly used code and process metrics. Second, we validate the correlations between slice-based cohesion metrics and fault-proneness. The results show that most slice-based cohesion metrics are statistically related to fault-proneness in an expected direction. Third, we analyze the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction compared with the most commonly used code and process metrics. The results, somewhat surprisingly, show that slice-based cohesion metrics in general do not outperform the most commonly used code and process metrics. Fourth, we investigate whether the combination of slice-based cohesion metrics with the most commonly used code and process metrics provide better results in predicting fault-proneness. The results show that the inclusion of slice-based cohesion metrics can produce a statistically significant improvement of the effectiveness in effort-aware post-release fault-proneness prediction under any of the three prediction settings. In particular, in the ranking scenario, when testing or inspecting 20 percent of the code of a system, slice-based cohesion metrics lead to a moderate to large improvement (Cliff’s d: 0.33-1.00), regardless of which prediction setting is considered. In the classification scenario, they lead to a moderate to large improvement (Cliff’s d: 0.31-0.77) in most systems under cross-validation and lead to a large improvement (Cliff’s d: 0.55-0.72) under across-version prediction. In summary, these results reveal that the improvement is practically important for practitioners, which is worth the relatively high time cost for collecting slice-based cohesion metrics. In other words, for practitioners, slice-based cohesion metrics are of practical value in the context of effort-aware post-release fault-proneness prediction. Our study provides valuable data in an important area for which otherwise there is limited experimental data available. The rest of this paper is organized as follows. Section 2 introduces slice-based cohesion metrics and the most commonly used code and process metrics that we will investigate. Section 3 gives the research hypotheses on slice-based cohesion metrics, introduces the investigated dependent and independent variables, presents the employed modeling technique, and describes the data analysis methods. Section 4 describes the experimental setup in our study, 1. www.scitools.com including the data sources and the method we used to 332 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 333 collect the experimental data sets.Section 5 reports in detail to the module size.Consequently,the slice-based cohesion the experimental results.Section 6 examines the threats to metrics suite proposed by Ott and Thuss consists of five validity of our study.Section 7 discusses the related work. metrics:Coverage,Overlap,Tightness,MinCoverage,and Max- Section 8 concludes the paper and outlines directions for Coverage.Note that these metrics are computed at the state- future work. ment level,i.e.statements are the basic unit of metric slices. Ott and Bieman [20]refined the concept of metric slices to 2 THE METRICS use data tokens (i.e.the definitions of and references to vari- In this section,we first describe slice-based cohesion met- ables and constants)rather than statements as the basic unit rics investigated in this study.Then,we describe the of which slices are composed of.They called such slices data most commonly used code and process metrics that will slices.More specifically,a data slice for a variable v is the be compared against when analyzing the actual useful- sequence of all data tokens in the statements that comprise ness of slice-based cohesion metrics in effort-aware post- the metric slice of v.This leads to five slice-based data- release fault-proneness prediction. token-level cohesion metrics. Bieman and Ott [4]used data slices to develop three cohe- 2.1 Slice-Based Cohesion Metrics sion metrics:SFC(strong functional cohesion),WFC(weak The origin of slice-based cohesion metrics can be traced functional cohesion),and A(Adhesiveness).They defined back to Weiser,who used backward slicing to describe the the slice abstraction of a module as the set of data slices with concepts of coverage,overlap,and tightness [9],[19].A back- respect to its output variables.In particular,a data token is ward slice of a module at statement n with respect to variable called a "glue token"if it lies on more than one data slices, v is the sequence of all statements and predicates that might and is called a"super-glue token"if it lies on all data slices in affect the value of v at n [91,[19].For a given module,Weiser the slice abstraction.As such,SFC is defined as the ratio of first sliced on every variable where it occurred in the mod- the number of super-glue tokens to the total number of data ule.Then,Weiser computed Coverage as the ratio of average tokens in the module.WFC is defined as the ratio of the num- slice size to program size,Overlap as the average ratio of ber of glue tokens to the total number of data tokens in the non-unique to unique statements in each slice,and Tightness module.A is defined as the average adhesiveness for all the as the percentage of statements common in all slices.As data tokens in the module.The adhesiveness of a data token stated by Ott and Bieman [101,however,Weiser "did not is the relative number of slices that it glues together.If a data identify actual software attributes these metrics might token is a glue token,its adhesiveness is the ratio of the num- meaningfully measure",although such metrics were helpful ber of slices that it appears in to the total number of slices. for observing the structuring of a module. Otherwise,its adhesiveness is zero.Indeed,SFC is equiva- Longworth [7]demonstrated that Coverage,a modified lent to the data-token-level Tightness metric and A is equiva- definition of Overlap (i.e.the average ratio of the size of non- lent to the data-token-level Coverage metric proposed by Ott unique statements to slice size),and Tightness could be used and Bieman [20]. as cohesion metrics of a module.In particular,Longworth Counsell et al.[5]proposed a cohesion metric called nor- sliced on every variable once at the end point of the module malized Hamming distance(NHD)based on the concept of to obtain end slices (i.e.backward slices computed from the slice occurrence matrix.For a given module,the slice occur- end of a module)and then used them to compute these met- rence matrix has columns indexed by its output variables rics.Later,Ott and Thuss [3]improved the behavior of slice- and rows indexed by its statements.The (i,j)th entry of the based cohesion metrics through the use of metric slices on matrix has a value of I if the ith statement is in the end slice output variables.A metric slice takes into account both the with respect to the ith output variable and otherwise 0.In uses and used by data relationships [31.More specifically,a this matrix,each row is called a slice occurrence vector. metric slice with respect to variable v is the union of the NHD is defined as the ratio of the total actual slice agree- backward slice with respect to v at the end point of the mod- ment between rows to the total possible agreement between ule and the forward slice computed from the definitions of v rows in the matrix.The slice agreement between two rows in the backward slice.A forward slice of a module at state- is the number of places in which the slice occurrence vectors ment n with respect to variable v is the sequence of all state- of the two rows are equal. ments and predicates that might be affected by the value of Dallal [8]used a data-token-level slice occurrence matrix v at n.Ott and Thuss argued that the purpose of executing a to develop a cohesion metric called similarity-based func- module was indicated by its output variables,including tional cohesion metric(SBFC).For a given module,the data- function return values,modified global variables,printed token-level slice occurrence matrix has columns indexed by variables,and modified reference parameters.Furthermore, its output variables and rows indexed by its data tokens. the slices on the output variables of a module capture the The(i,)th entry of the matrix has a value of 1 if the ith data specific computations for the tasks that the module per- token is in the end slice with respect to the ith output vari- forms.Therefore,we could use the relationships among the able and otherwise 0.SBFC is defined as the average degree slices on output variables to investigate whether the mod- of the normalized similarity between columns.The normal- ule's tasks are related,i.e.whether the module is cohesive. ized similarity between a pair of columns is the ratio of the They redefined Overlap as the average ratio of the slice inter- number of entries where both columns have a value of 1 to action size to slice size and added MinCoverage and MaxCo- the total number of rows in the matrix. verage to the metrics suite.MinCoverage and MaxCoverage Table 1 summarizes the formal definitions,descriptions, are respectively the ratio of the size of the smallest slice to and sources of the slice-based cohesion metrics that will be the module size and the ratio of the size of the largest slice investigated in this study.In this table,for a given module
collect the experimental data sets. Section 5 reports in detail the experimental results. Section 6 examines the threats to validity of our study. Section 7 discusses the related work. Section 8 concludes the paper and outlines directions for future work. 2 THE METRICS In this section, we first describe slice-based cohesion metrics investigated in this study. Then, we describe the most commonly used code and process metrics that will be compared against when analyzing the actual usefulness of slice-based cohesion metrics in effort-aware postrelease fault-proneness prediction. 2.1 Slice-Based Cohesion Metrics The origin of slice-based cohesion metrics can be traced back to Weiser, who used backward slicing to describe the concepts of coverage, overlap, and tightness [9], [19]. A backward slice of a module at statement n with respect to variable v is the sequence of all statements and predicates that might affect the value of v at n [9], [19]. For a given module, Weiser first sliced on every variable where it occurred in the module. Then, Weiser computed Coverage as the ratio of average slice size to program size, Overlap as the average ratio of non-unique to unique statements in each slice, and Tightness as the percentage of statements common in all slices. As stated by Ott and Bieman [10], however, Weiser “did not identify actual software attributes these metrics might meaningfully measure”, although such metrics were helpful for observing the structuring of a module. Longworth [7] demonstrated that Coverage, a modified definition of Overlap (i.e. the average ratio of the size of nonunique statements to slice size), and Tightness could be used as cohesion metrics of a module. In particular, Longworth sliced on every variable once at the end point of the module to obtain end slices (i.e. backward slices computed from the end of a module) and then used them to compute these metrics. Later, Ott and Thuss [3] improved the behavior of slicebased cohesion metrics through the use of metric slices on output variables. A metric slice takes into account both the uses and used by data relationships [3]. More specifically, a metric slice with respect to variable v is the union of the backward slice with respect to v at the end point of the module and the forward slice computed from the definitions of v in the backward slice. A forward slice of a module at statement n with respect to variable v is the sequence of all statements and predicates that might be affected by the value of v at n. Ott and Thuss argued that the purpose of executing a module was indicated by its output variables, including function return values, modified global variables, printed variables, and modified reference parameters. Furthermore, the slices on the output variables of a module capture the specific computations for the tasks that the module performs. Therefore, we could use the relationships among the slices on output variables to investigate whether the module’s tasks are related, i.e. whether the module is cohesive. They redefined Overlap as the average ratio of the slice interaction size to slice size and added MinCoverage and MaxCoverage to the metrics suite. MinCoverage and MaxCoverage are respectively the ratio of the size of the smallest slice to the module size and the ratio of the size of the largest slice to the module size. Consequently, the slice-based cohesion metrics suite proposed by Ott and Thuss consists of five metrics: Coverage, Overlap, Tightness, MinCoverage, and MaxCoverage. Note that these metrics are computed at the statement level, i.e. statements are the basic unit of metric slices. Ott and Bieman [20] refined the concept of metric slices to use data tokens (i.e. the definitions of and references to variables and constants) rather than statements as the basic unit of which slices are composed of. They called such slices data slices. More specifically, a data slice for a variable v is the sequence of all data tokens in the statements that comprise the metric slice of v. This leads to five slice-based datatoken-level cohesion metrics. Bieman and Ott [4] used data slices to develop three cohesion metrics: SFC (strong functional cohesion), WFC (weak functional cohesion), and A (Adhesiveness). They defined the slice abstraction of a module as the set of data slices with respect to its output variables. In particular, a data token is called a “glue token” if it lies on more than one data slices, and is called a “super-glue token” if it lies on all data slices in the slice abstraction. As such, SFC is defined as the ratio of the number of super-glue tokens to the total number of data tokens in the module. WFC is defined as the ratio of the number of glue tokens to the total number of data tokens in the module. A is defined as the average adhesiveness for all the data tokens in the module. The adhesiveness of a data token is the relative number of slices that it glues together. If a data token is a glue token, its adhesiveness is the ratio of the number of slices that it appears in to the total number of slices. Otherwise, its adhesiveness is zero. Indeed, SFC is equivalent to the data-token-level Tightness metric and A is equivalent to the data-token-level Coverage metric proposed by Ott and Bieman [20]. Counsell et al. [5] proposed a cohesion metric called normalized Hamming distance (NHD) based on the concept of slice occurrence matrix. For a given module, the slice occurrence matrix has columns indexed by its output variables and rows indexed by its statements. The (i, j)th entry of the matrix has a value of 1 if the ith statement is in the end slice with respect to the jth output variable and otherwise 0. In this matrix, each row is called a slice occurrence vector. NHD is defined as the ratio of the total actual slice agreement between rows to the total possible agreement between rows in the matrix. The slice agreement between two rows is the number of places in which the slice occurrence vectors of the two rows are equal. Dallal [8] used a data-token-level slice occurrence matrix to develop a cohesion metric called similarity-based functional cohesion metric (SBFC). For a given module, the datatoken-level slice occurrence matrix has columns indexed by its output variables and rows indexed by its data tokens. The (i, j)th entry of the matrix has a value of 1 if the ith data token is in the end slice with respect to the jth output variable and otherwise 0. SBFC is defined as the average degree of the normalized similarity between columns. The normalized similarity between a pair of columns is the ratio of the number of entries where both columns have a value of 1 to the total number of rows in the matrix. Table 1 summarizes the formal definitions, descriptions, and sources of the slice-based cohesion metrics that will be investigated in this study. In this table, for a given module YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 333
334 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 TABLE 1 Definitions of Slice-Based Cohesion Metrics Metric Definition Description Source Coverage Coverage=☆∑图 The extent to which the slices cover the module(measured 31,[20] as the ratio of the mean slice size to the module size) MaxCoverage MaCoveragemaxiSLil 1 The extent to which the largest slice covers the module (measured as the ratio of the size of the largest slice to the module size) MinCoverage MinCoverage in(M minilSLil The extent to which the smallest slice covers the module (measured as the ratio of the size of the smallest slice to the module size) Overlap Overlap=高∑图哥 The extent to which slices are interdependent(measured as the average ratio of the size of the"cohesive section"to the size of each slice) Tightness Tightness gi万 The extent to which all the slices in the module belong together(measured as the ratio of the size of the "cohesive section"to the module size) SFC ISG(SA(M) SFC =okens(M) The extent to which all the slices in the module belong [4 together (measured as the ratio of the number of super-glue tokens to the total number of data tokens of the module) WFC WFC=ISS4AMω The extent to which the slices in the module belong together tokens(Ar) (measured as the ratio of the number of glue tokens to the total number of data tokens of the module) A A= 2tG(SAM))slices containingt The extent to which the glue tokens in the module are adhe- ltokens(M)x SA(M) sive(measured as the ratio of the amount of the adhesive- ness to the total possible adhesiveness) NHD NHD=1-∑=19(k-G) The extent to which the statements in the slices are the same [5) (measured as the ratio of the total slice agreement between rows to the total possible agreement between rows in the statement-level slice occurrence matrix of the module) SBFC 1 if Vol =1 The extent to which the slices are similar(measured as [8 SBFC -tabens(-l) the average degree of the normalized similarity between tokens(M川Vx(V-可 otherwise columns in the data-token-level slice occurrence matrix of the module) M,Vo denotes the set of its output variables,length(M) the forward slices from the definitions of the largest, denotes its size,SA(M)denotes its slice abstraction,and smallest,and range variables in the backward slices;and (5) tokens(M)denotes the set of its data tokens.SLi is the slice the ninth to eleventh columns list the metric slices for the obtained for vi Vo and SLint (called "cohesive section"by largest,smallest,and range variables.Here,a vertical bar Harman et al.[61)is the intersection of SLi over all vi Vo. "|"in the last nine columns denotes that the indicated state- In particular,G(SA(M))and SG(SA(M))are respectively ment is part of the corresponding slice for the named output the set of glue tokens and the set of super-glue tokens.In variable.This example function determines the smallest,the the definition of NHD,k is the number of statements,I is the largest,and the range of an array,which is a modified ver- number of output variables,and c;is the number of Is in sion of the example module used by Longworth [7].For this the jth column of the statement-level slice occurrence example,Vo consists of largest,smallest,and range.The for- matrix.In the definition of SBFC,z;is the number of 1s in mer two variables are the modified reference parameters the i-th row of the data-token-level slice occurrence matrix. and the latter is the function return value.Table 3 shows the Note that all the slice-based cohesion metrics can be com- data-token level slice occurrence matrix of the fun function puted at the statement or data-token level,although some of under end slices and metric slices,where T;indicates the ith them are originally defined at either the statement level or data token for T in the function. the data-token level.The data-token level is at a finer granu- Table 4 shows the computations of twenty data-token- larity than the statement level since a statement might con-level slice-based cohesion metrics.In this table,the second tain a number of data tokens.We next use an example to eleventh rows show the computations for end-slice-based function fun shown in Table 2 to illustrate the computations cohesion metrics and the 12th to 21st rows show the compu- of the slice-based cohesion metrics at the data-token level.tations for metric-slice-based cohesion metrics.As can be In Table 2:(1)the first column lists the statement number seen,end-slice-based metrics indicate with typical values (excluding non-executable statements such as blank state- around 0.5 or 0.6,while metric-slice-based metrics indicate ments,"("and ")")(2)the second column lists the code of with typical values around 0.7 or 0.8.In particular,for the example function;(3)the third to fifth columns respec-each cohesion metric(except MaxCoverage),the metric-slice- tively list the end slices for the largest,smallest,and range based version has a considerably larger value than the cor- variables;(4)the sixth to eighth columns respectively list responding end-slice-based version.When looking at the
M, Vo denotes the set of its output variables, length(M) denotes its size, SA(M) denotes its slice abstraction, and tokens(M) denotes the set of its data tokens. SLi is the slice obtained for vi 2 Vo and SLint (called “cohesive section” by Harman et al. [6]) is the intersection of SLi over all vi 2 Vo. In particular, G(SA(M)) and SG(SA(M)) are respectively the set of glue tokens and the set of super-glue tokens. In the definition of NHD, k is the number of statements, l is the number of output variables, and ci is the number of 1s in the jth column of the statement-level slice occurrence matrix. In the definition of SBFC, xi is the number of 1s in the i-th row of the data-token-level slice occurrence matrix. Note that all the slice-based cohesion metrics can be computed at the statement or data-token level, although some of them are originally defined at either the statement level or the data-token level. The data-token level is at a finer granularity than the statement level since a statement might contain a number of data tokens. We next use an example function fun shown in Table 2 to illustrate the computations of the slice-based cohesion metrics at the data-token level. In Table 2: (1) the first column lists the statement number (excluding non-executable statements such as blank statements, “{“, and “}”); (2) the second column lists the code of the example function; (3) the third to fifth columns respectively list the end slices for the largest, smallest, and range variables; (4) the sixth to eighth columns respectively list the forward slices from the definitions of the largest, smallest, and range variables in the backward slices; and (5) the ninth to eleventh columns list the metric slices for the largest, smallest, and range variables. Here, a vertical bar “ j ” in the last nine columns denotes that the indicated statement is part of the corresponding slice for the named output variable. This example function determines the smallest, the largest, and the range of an array, which is a modified version of the example module used by Longworth [7]. For this example, Vo consists of largest, smallest, and range. The former two variables are the modified reference parameters and the latter is the function return value. Table 3 shows the data-token level slice occurrence matrix of the fun function under end slices and metric slices, where Ti indicates the ith data token for T in the function. Table 4 shows the computations of twenty data-tokenlevel slice-based cohesion metrics. In this table, the second to eleventh rows show the computations for end-slice-based cohesion metrics and the 12th to 21st rows show the computations for metric-slice-based cohesion metrics. As can be seen, end-slice-based metrics indicate with typical values around 0.5 or 0.6, while metric-slice-based metrics indicate with typical values around 0.7 or 0.8. In particular, for each cohesion metric (except MaxCoverage), the metric-slicebased version has a considerably larger value than the corresponding end-slice-based version. When looking at the TABLE 1 Definitions of Slice-Based Cohesion Metrics Metric Definition Description Source Coverage Coverage ¼ 1 j j Vo P V0 j j i¼1 SLi j j length Mð Þ The extent to which the slices cover the module (measured as the ratio of the mean slice size to the module size) [3], [20] MaxCoverage MaxCoverage ¼ 1 lengthð Þ M maxi SLi j j The extent to which the largest slice covers the module (measured as the ratio of the size of the largest slice to the module size) MinCoverage MinCoverage ¼ 1 lengthð Þ M mini SLi j j The extent to which the smallest slice covers the module (measured as the ratio of the size of the smallest slice to the module size) Overlap Overlap ¼ 1 j j Vo P V0 j j i¼1 SLint j j SLi j j The extent to which slices are interdependent (measured as the average ratio of the size of the “cohesive section” to the size of each slice) Tightness Tightness ¼ SLint j j length Mð Þ The extent to which all the slices in the module belong together (measured as the ratio of the size of the “cohesive section” to the module size) SFC SFC ¼ j j SG SA M ð Þ ð Þ j j tokens Mð Þ The extent to which all the slices in the module belong together (measured as the ratio of the number of super-glue tokens to the total number of data tokens of the module) [4] WFC WFC ¼ j j G SA M ð Þ ð Þ j j tokens Mð Þ The extent to which the slices in the module belong together (measured as the ratio of the number of glue tokens to the total number of data tokens of the module) A A ¼ P t2G SA M ð Þ ð Þ slices containing t j j tokens Mð Þ j j SA Mð Þ The extent to which the glue tokens in the module are adhesive (measured as the ratio of the amount of the adhesiveness to the total possible adhesiveness) NHD NHD ¼ 1 2 lk kð Þ 1 Pl j¼1 cj k cj The extent to which the statements in the slices are the same (measured as the ratio of the total slice agreement between rows to the total possible agreement between rows in the statement-level slice occurrence matrix of the module) [5] SBFC SBFC ¼ 1 if Vj j¼ o 1 Pj j tokens Mð Þ t¼1 xið Þ xi1 j j tokens Mð Þ j j Vo ð Þ j j Vo 1 otherwise ( The extent to which the slices are similar (measured as the average degree of the normalized similarity between columns in the data-token-level slice occurrence matrix of the module) [8] 334 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 335 TABLE 2 End Slice Profile and Metric Slice Profile for Function Fun Line Code End slice Forward slice Metric slice largest smallest range largest smallest range largest smallest range int fun( 1 int A[] 2 int size, 3 int"largest, 4 int 'smallest) 5 inti; 6 int range; 7 1=1少 8 range =0; 9 "smallest =A[0]; 10 "largest ='smallest; 11 while(i size) 12 if('smallest Alil) 13 smallest =A[i]; 14 if("largest Ali]) 15 "largest=Alil; 16 计+: 17 range="largest-'smallest; 18 return range; Data tokens included in the end slice for the variable smallest are indicated by the underline. example function fun shown in Table 2,we find that,except MaxCoverage,MinCoverage,Overlap,Tightness,WFC,NHD, an unnecessary initialization statement (statement 8 in and SBFC.During our analysis,a function is regarded as a Table 2:range=0;),all the rest statements are all related to module and the output variables of a function consist of the the computation of the final outputs.In other words,intui- function return value,modified global variables,modified tively,this function has a high cohesion.In this sense,when reference parameters,and standard outputs by the function. measuring its cohesion,it appears that metric-slice-based cohesion metrics are more accurate than end-slice-based 2.2 The Most Commonly Used Code and Process cohesion metrics. Metrics As mentioned above,Coverage,MaxCoverage,MinCover- In this study,we employ the most commonly used code and age,Overlap,Tightness,SFC,WFC,and A are originally based process metrics as the baseline metrics to analyze the actual on metric slices[4l,[6l,[13l,[15],[16,[17,[18,[211.How- usefulness of slice-based cohesion metrics in effort-aware ever,NHD and SBFC are originally based on end slices.In post-release fault-proneness prediction.As shown in Table5, this study,we will use metric slices to compute all the cohe- the baseline code and process metrics cover 16 product met- sion metrics.In particular,we will use metric-slice-based rics and three process metrics.These 16 product metrics con- cohesion metrics at the data-token level to investigate the sist of 1 size metric,11 structural complexity metrics,and actual usefulness of slice-based cohesion metrics in effort- 4 software science metrics.The size metric SLOC simply aware post-release fault-proneness prediction.The reason counts the non-blank non-commentary source lines of code for choosing the data-token level rather than the statement (SLOC)in a function.There is a common belief that a func- level is that the former is at a finer granularity.Previous tion with a larger size tends to be more fault-prone [221,[23], studies suggested that software metrics at a finer granularity [24],[25].The structural complexity metrics,including the would accordingly have a higher discriminative power and well-known McCabe's Cyclomatic complexity metrics hence may be more useful for fault-proneness prediction assume that a function with complex control flow structure [62],[63].Note that,at the data-token level,SFC is equiva- is likely to be fault-prone [26],[271,[28],[29].The Halstead's lent to Tightness and A is equivalent to Coverage.Therefore, software science metrics estimate reading complexity based in the subsequent analysis,only the following eight metric- on the counts of operators and operands,in which a function slice-based cohesion metrics will be examined:Coverage, hard to read is assumed to be fault-prone [30].Note that we
example function fun shown in Table 2, we find that, except an unnecessary initialization statement (statement 8 in Table 2: range ¼ 0;), all the rest statements are all related to the computation of the final outputs. In other words, intuitively, this function has a high cohesion. In this sense, when measuring its cohesion, it appears that metric-slice-based cohesion metrics are more accurate than end-slice-based cohesion metrics. As mentioned above, Coverage, MaxCoverage, MinCoverage, Overlap, Tightness, SFC, WFC, and A are originally based on metric slices [4], [6], [13], [15], [16], [17], [18], [21]. However, NHD and SBFC are originally based on end slices. In this study, we will use metric slices to compute all the cohesion metrics. In particular, we will use metric-slice-based cohesion metrics at the data-token level to investigate the actual usefulness of slice-based cohesion metrics in effortaware post-release fault-proneness prediction. The reason for choosing the data-token level rather than the statement level is that the former is at a finer granularity. Previous studies suggested that software metrics at a finer granularity would accordingly have a higher discriminative power and hence may be more useful for fault-proneness prediction [62], [63]. Note that, at the data-token level, SFC is equivalent to Tightness and A is equivalent to Coverage. Therefore, in the subsequent analysis, only the following eight metricslice-based cohesion metrics will be examined: Coverage, MaxCoverage, MinCoverage, Overlap, Tightness, WFC, NHD, and SBFC. During our analysis, a function is regarded as a module and the output variables of a function consist of the function return value, modified global variables, modified reference parameters, and standard outputs by the function. 2.2 The Most Commonly Used Code and Process Metrics In this study, we employ the most commonly used code and process metrics as the baseline metrics to analyze the actual usefulness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction. As shown in Table 5, the baseline code and process metrics cover 16 product metrics and three process metrics. These 16 product metrics consist of 1 size metric, 11 structural complexity metrics, and 4 software science metrics. The size metric SLOC simply counts the non-blank non-commentary source lines of code (SLOC) in a function. There is a common belief that a function with a larger size tends to be more fault-prone [22], [23], [24], [25]. The structural complexity metrics, including the well-known McCabe’s Cyclomatic complexity metrics, assume that a function with complex control flow structure is likely to be fault-prone [26], [27], [28], [29]. The Halstead’s software science metrics estimate reading complexity based on the counts of operators and operands, in which a function hard to read is assumed to be fault-prone [30]. Note that we TABLE 2 End Slice Profile and Metric Slice Profile for Function Fun Line Code End slice Forward slice Metric slice largest smallest range largest smallest range largest smallest range int fun( 1 int A[] j jj j jj 2 int size, j jj j jj 3 int largest, jj jj 4 int smallest) j jj j jj { 5 int i; j jj j jj 6 int range; j j 7 i ¼ 1; j jj j jj 8 range ¼ 0; 9 smallest ¼ A[0]; j jj j j jj 10 largest ¼ smallest; j jj j j j j 11 while(i < size) { j jj j jj 12 if( smallest > A[i]) jj j jj 13 smallest ¼ A[i]; jj j jj 14 if( largest < A[i]) j jj j j j j 15 largest ¼ A[i]; j jj j j j j 16 iþþ; j jj j jj } 17 range ¼ largest - smallest; jj j jj j j 18 return range; jj j jj j j } Data tokens included in the end slice for the variable smallest are indicated by the underline. YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 335