336 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 TABLE 3 slice-based cohesion metrics are of practical value only if: Data-Token Level Slice Occurrence Matrix with Respect (1)they have a significantly better fault-proneness predic- to End Slice Profile and Metric Slice Profile tion ability than the baseline code and process metrics;or End slice Metric slice (2)they can significantly improve the performance of fault- proneness prediction when used together with the baseline Line Token largest smallest range largest smallest range code and process metrics.This is especially true when con- A1 sidering the expenses for collecting slice-based cohesion size1 metrics.As Meyers and Binkley stated [13],slicing techni- 3 largesti 0 smallest ques and tools are now mature enough to allow an intensive 4 5 empirical investigation.In our study,we use Frama-C,a 6 range 0 0 well-known open-source static analysis tool for C programs 7 1 [57],to collect slice-based cohesion metrics.Frama-C pro- 7 vides scalable and sound software analyses for C programs, 8 range? 0 0 0 0 thus allowing accurate collection of slice-based cohesion 8 01 0 0 0 0 metrics on industrial-size systems [57]. 9 smallest2 9 A2 3 RESEARCH METHODOLOGY 02 10 largest2 In this section,we first give the research hypotheses relating 10 smallesta slice-based cohesion metrics to the most commonly used 11 i3 11 code and process metrics and to fault-proneness.Then,we size2 12 smallest 0 describe the investigated dependent and independent 12 A3 0 0 variables,the employed modeling technique,and the data 0 0 analysis methods 3 smallests 0 0 0 1 0 3.1 Research Hypotheses 13 0 人 0 The first research question(RQ1)of this study investigates 14 whether slice-based cohesion metrics are redundant when 14 16 compared with the most commonly used code and process 15 largest4 metrics.It is widely believed that software quality cannot 1 A6 0 be measured using only a single dimension [291.As stated 15 0 in Section 2.2,the most commonly used code and process 16 is metrics measure software quality from size,control flow 17 range3 0 0 largests 0 0 structure,and cognitive psychology perspectives.However, 17 smallest 0 0 slice-based cohesion metrics measure software quality from 18 range 0 0 the perspective of cohesion,which are based on control-/ data-flow dependence information among statements.Our do not include the other Halstead's software science metrics conjecture is that,given the nature of the information and such as N,n,V,D,and E [30].The reason is that these met- counting mechanism employed by slice-based cohesion rics are fully based on n1,n2,N1,and N2 (for example metrics,they should capture different underlying dimen- sions of software quality than the most commonly used n=n1 +n2).Consequently,they are highly correlated with n1,n2,N1,and N2.When building multivariate prediction code and process metrics capture.From this reasoning,we models,highly correlated predictors will lead to a high mul- set up the following null hypothesis Hlo and alternative ticollinearity and hence might lead to inaccurate coefficient hypothesis HI for RQ1: estimates [61].Therefore,our study only takes into account H1o.Slice-based cohesion metrics do not capture additional n1,n2,N1,and N2.The process metrics consist of three rela- dimensions of software quality compared with the most tive code churn metrics,i.e.the normalized number of commonly used code and process metrics. added,deleted,and modified source lines of code.These H1A.Slice-based cohesion metrics capture additional dimensions code churn metrics assume that a function with more added, of software quality compared with the most commonly used deleted,or modified code would have a higher possibility of code and process metrics. being fault-prone. The second research question (RO2)of this study investi- The reasons for choosing these baseline code and process gates whether slice-based cohesion metrics are statistically metrics in this study are three-fold.First,they are widely related to post-release fault-proneness.In the software used product and process metrics in both industry and engineering literature,there is a common belief that low academic research[22],[23l,[241,[25],[26l,[27,[28l,[29], cohesion indicates an inappropriate design [1],[2].Conse- [50],[52],[54],[55],[64].Second,they can be automatically quently,a function with low cohesion is more likely to be and cheaply collected from source code even for very large fault-prone than a function with high cohesion [1],[2].From software systems.Third,many studies show that they are Section 2.1,we can see that slice-based cohesion metrics useful indicators for fault-proneness prediction [261,[27], leverage the commonality among the slices with respect [281,[501,[521,[54],551,[64].In the context of effort-aware to different output variables of a function to quantify its post-release fault-proneness prediction,we believe that cohesion.Existing studies showed that they provided an
do not include the other Halstead’s software science metrics such as N, n, V, D, and E [30]. The reason is that these metrics are fully based on n1, n2, N1, and N2 (for example n ¼ n1 þ n2). Consequently, they are highly correlated with n1, n2, N1, and N2. When building multivariate prediction models, highly correlated predictors will lead to a high multicollinearity and hence might lead to inaccurate coefficient estimates [61]. Therefore, our study only takes into account n1, n2, N1, and N2. The process metrics consist of three relative code churn metrics, i.e. the normalized number of added, deleted, and modified source lines of code. These code churn metrics assume that a function with more added, deleted, or modified code would have a higher possibility of being fault-prone. The reasons for choosing these baseline code and process metrics in this study are three-fold. First, they are widely used product and process metrics in both industry and academic research [22], [23], [24], [25], [26], [27], [28], [29], [50], [52], [54], [55], [64]. Second, they can be automatically and cheaply collected from source code even for very large software systems. Third, many studies show that they are useful indicators for fault-proneness prediction [26], [27], [28], [50], [52], [54], [55], [64]. In the context of effort-aware post-release fault-proneness prediction, we believe that slice-based cohesion metrics are of practical value only if: (1) they have a significantly better fault-proneness prediction ability than the baseline code and process metrics; or (2) they can significantly improve the performance of faultproneness prediction when used together with the baseline code and process metrics. This is especially true when considering the expenses for collecting slice-based cohesion metrics. As Meyers and Binkley stated [13], slicing techniques and tools are now mature enough to allow an intensive empirical investigation. In our study, we use Frama-C, a well-known open-source static analysis tool for C programs [57], to collect slice-based cohesion metrics. Frama-C provides scalable and sound software analyses for C programs, thus allowing accurate collection of slice-based cohesion metrics on industrial-size systems [57]. 3 RESEARCH METHODOLOGY In this section, we first give the research hypotheses relating slice-based cohesion metrics to the most commonly used code and process metrics and to fault-proneness. Then, we describe the investigated dependent and independent variables, the employed modeling technique, and the data analysis methods. 3.1 Research Hypotheses The first research question (RQ1) of this study investigates whether slice-based cohesion metrics are redundant when compared with the most commonly used code and process metrics. It is widely believed that software quality cannot be measured using only a single dimension [29]. As stated in Section 2.2, the most commonly used code and process metrics measure software quality from size, control flow structure, and cognitive psychology perspectives. However, slice-based cohesion metrics measure software quality from the perspective of cohesion, which are based on control-/ data-flow dependence information among statements. Our conjecture is that, given the nature of the information and counting mechanism employed by slice-based cohesion metrics, they should capture different underlying dimensions of software quality than the most commonly used code and process metrics capture. From this reasoning, we set up the following null hypothesis H10 and alternative hypothesis H1A for RQ1: H10. Slice-based cohesion metrics do not capture additional dimensions of software quality compared with the most commonly used code and process metrics. H1A. Slice-based cohesion metrics capture additional dimensions of software quality compared with the most commonly used code and process metrics. The second research question (RQ2) of this study investigates whether slice-based cohesion metrics are statistically related to post-release fault-proneness. In the software engineering literature, there is a common belief that low cohesion indicates an inappropriate design [1], [2]. Consequently, a function with low cohesion is more likely to be fault-prone than a function with high cohesion [1], [2]. From Section 2.1, we can see that slice-based cohesion metrics leverage the commonality among the slices with respect to different output variables of a function to quantify its cohesion. Existing studies showed that they provided an TABLE 3 Data-Token Level Slice Occurrence Matrix with Respect to End Slice Profile and Metric Slice Profile End slice Metric slice Line Token largest smallest range largest smallest range 1 A1 1 11 1 11 2 size1 1 11 1 11 3 largest1 1 01 1 01 4 smallest1 1 11 1 11 5 i1 1 11 1 11 6 range1 0 01 0 01 7 i2 1 11 1 11 7 11 1 11 1 11 8 range2 0 00 0 00 8 01 0 00 0 00 9 smallest2 1 11 1 11 9 A2 1 11 1 11 9 02 1 11 1 11 10 largest2 1 01 1 11 10 smallest3 1 01 1 11 11 i3 1 11 1 11 11 size2 1 11 1 11 12 smallest4 0 11 0 11 12 A3 0 11 0 11 12 i4 0 11 0 11 13 smallest5 0 11 0 11 13 A4 0 11 0 11 13 i5 0 11 0 11 14 largest3 1 01 1 11 14 A5 1 01 1 11 14 i6 1 01 1 11 15 largest4 1 01 1 11 15 A6 1 01 1 11 15 i7 1 01 1 11 16 i8 1 11 1 11 17 range3 0 01 1 11 17 largest5 0 01 1 11 17 smallest6 0 01 1 11 18 range4 0 01 1 11 336 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 337 TABLE 4 Example Metrics Computations at the Data-Token Level Type Metric Computation Value End slice Coverage =1/3×(21/34+18/34+32/34) =0.696 MaxCoverage =32/34 =0.941 MinCoverage =18/34 =0.529 Overlap =1/3×(12/21+12/18+12/32) =0.538 Tightness =12/34 =0.353 SFC =12/34 =0.353 WFC =27/34 =0.794 A =(21+18+32)/(3×34) =0.696 SBFC =(12×3×2+15×2×1)/(34×3×2) =0.500 NHD =1-2/(3×34×33)×(21×13+18×16+32×2) =0.629 Metric slice Coverage =1/3×(25/34+30/34+32/34) =0.853 MaxCoverage =32/34 =0.941 MinCoverage =25/34 =0.735 Overlap =1/3×(24/25+24/30+24/32) =0.837 Tightness =24/34 =0.706 SFC =24/34 =0.706 WFC =31/34 =0.912 A =(25+30+32)/(3×34) =0.853 SBFC =(24×3×2+7×2×1)/(34×3×2) =0.775 NHD =1-2/(3×34×33)×(25×9+30×4+32×2) =0.757 excellent quantitative measure of function cohesion [5],[13]. fault-prone functions more accurately than the most com- In particular,for each of the investigated slice-based cohe- monly used code and process metrics do.From Table 5,we sion metrics,a large value indicates a high cohesion.From can see that the most commonly used code and process met- this reasoning,we set up the following null hypothesis H2o rics are based on either simple syntactic information or con- and alternative hypothesis H2A for RQ2: trol flow structure information among statements in a function.In contrast,slice-based cohesion metrics make H20.There is no significant correlation between slice-based cohe- use of the semantic dependence information among the state- sion metrics and post-release fault-proneness. ments in a function.In other words,they are based on pro- H2A.There is a significant correlation between slice-based cohe- gram behaviors as captured by program slices.In this sense, sion metrics and post-release fault-proneness. slice-based cohesion metrics provide a higher level quantifica- The third research question(RQ3)of this study investigates tion of software quality than the most commonly used code whether slice-based cohesion metrics predict post-release and process metrics.Consequently,it is reasonable to expect TABLE 5 The Most Commonly Used Code and Process Metrics(i.e.the Baseline Metrics in This Study) Category Characteristic Metric Description Product Size SLOC Source lines of code in a function (excluding blank lines and comment lines) Structural FANIN Number of calling functions plus global variables read complexity FANOUT Number of calling functions plus global variables set NPATH Number of possible paths,not counting abnormal exits or gotos Cyclomatic Cyclomatic complexity CyclomaticModified Modified cyclomatic complexity CyclomaticStrict Strict cyclomatic complexity Essential Essential complexity Knots Measure of overlapping jumps Nesting Maximum nesting level of control constructs MaxEssentialKnots Maximum Knots after structured programming constructs have been removed MinEssentialKnots Minimum Knots after structured programming constructs have been removed Software n1 Total number of distinct operators of a function science n2 Total number of distinct operands of a function Total number of operators of a function N2 Total number of operands of a function Process Code churn Added Added source lines of code,normalized by function size Deleted Deleted source lines of code,normalized by function size Modified Modified source lines of code,normalized by function size
excellent quantitative measure of function cohesion [5], [13]. In particular, for each of the investigated slice-based cohesion metrics, a large value indicates a high cohesion. From this reasoning, we set up the following null hypothesis H20 and alternative hypothesis H2A for RQ2: H20. There is no significant correlation between slice-based cohesion metrics and post-release fault-proneness. H2A. There is a significant correlation between slice-based cohesion metrics and post-release fault-proneness. The third research question (RQ3) of this study investigates whether slice-based cohesion metrics predict post-release fault-prone functions more accurately than the most commonly used code and process metrics do. From Table 5, we can see that the most commonly used code and process metrics are based on either simple syntactic information or control flow structure information among statements in a function. In contrast, slice-based cohesion metrics make use of the semantic dependence information among the statements in a function. In other words, they are based on program behaviors as captured by program slices. In this sense, slice-based cohesion metrics provide a higher level quantification of software quality than the most commonly used code and process metrics. Consequently, it is reasonable to expect TABLE 4 Example Metrics Computations at the Data-Token Level Type Metric Computation Value End slice Coverage ¼ 1/3 (21/34 þ 18/34 þ 32/34) ¼ 0.696 MaxCoverage ¼ 32/34 ¼ 0.941 MinCoverage ¼ 18/34 ¼ 0.529 Overlap ¼ 1/3 (12/21 þ 12/18 þ 12/32) ¼ 0.538 Tightness ¼ 12/34 ¼ 0.353 SFC ¼ 12/34 ¼ 0.353 WFC ¼ 27/34 ¼ 0.794 A ¼ (21 þ 18 þ 32)/(3 34) ¼ 0.696 SBFC ¼ (12 3 2 þ 15 2 1)/(34 3 2) ¼ 0.500 NHD ¼ 12/(3 34 33) (21 13 þ 18 16 þ 32 2) ¼ 0.629 Metric slice Coverage ¼ 1/3 (25/34 þ 30/34 þ 32/34) ¼ 0.853 MaxCoverage ¼ 32/34 ¼ 0.941 MinCoverage ¼ 25/34 ¼ 0.735 Overlap ¼ 1/3 (24/25 þ 24/30 þ 24/32) ¼ 0.837 Tightness ¼ 24/34 ¼ 0.706 SFC ¼ 24/34 ¼ 0.706 WFC ¼ 31/34 ¼ 0.912 A ¼ (25 þ 30 þ 32)/(3 34) ¼ 0.853 SBFC ¼ (24 3 2 þ 7 2 1)/(34 3 2) ¼ 0.775 NHD ¼ 12/(3 34 33) (25 9 þ 30 4 þ 32 2) ¼ 0.757 TABLE 5 The Most Commonly Used Code and Process Metrics (i.e. the Baseline Metrics in This Study) Category Characteristic Metric Description Product Size SLOC Source lines of code in a function (excluding blank lines and comment lines) Structural complexity FANIN Number of calling functions plus global variables read FANOUT Number of calling functions plus global variables set NPATH Number of possible paths, not counting abnormal exits or gotos Cyclomatic Cyclomatic complexity CyclomaticModified Modified cyclomatic complexity CyclomaticStrict Strict cyclomatic complexity Essential Essential complexity Knots Measure of overlapping jumps Nesting Maximum nesting level of control constructs MaxEssentialKnots Maximum Knots after structured programming constructs have been removed MinEssentialKnots Minimum Knots after structured programming constructs have been removed Software science n1 Total number of distinct operators of a function n2 Total number of distinct operands of a function N1 Total number of operators of a function N2 Total number of operands of a function Process Code churn Added Added source lines of code, normalized by function size Deleted Deleted source lines of code, normalized by function size Modified Modified source lines of code, normalized by function size YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 337
338 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 that slice-based cohesion metrics are more closely related to The independent variables in this study consist of two fault-proneness than the most commonly used code and pro- categories of metrics:(i)the most commonly used 19 cess metrics.From this expectation,we set up the following code and process metrics,and (ii)eight slice-based cohe- null hypothesis H30 and alternative hypothesis H3A for RQ3: sion metrics.All these metrics are collected at the func- tion level.The objective of this study is to empirically H30.Slice-based cohesion metrics are not more effective in effort- investigate the actual usefulness of slice-based cohesion aware post-release fault-proneness prediction than the most metrics in the context of effort-aware post-release fault- commonly used code and process metrics. H3A.Slice-based cohesion metrics are more effective in effort- proneness prediction,especially when compared with the most commonly used code and process metrics.With aware post-release fault-proneness prediction than the most these independent variables,we are able to test the four commonly used code and process metrics. The fourth research question(RQ4)of this study investi- null hypotheses described in Section 3.1. gates whether the model built with slice-based cohesion 3.3 Modeling Technique metrics and the most commonly used code and process met- rics together has a better ability to predict post-release fault- Logistic regression is a standard statistical modeling tech- proneness than the model built with the most commonly nique in which the dependent variable can take two differ- used code and process metrics alone.This issue is indeed ent values [281.It is suitable for building fault-proneness prediction models because the functions under consi- raised by the null hypothesis H1o.If the null hypothesis H1o is rejected,it means that slice-based cohesion metrics cap- deration are divided into two categories:faulty and not- ture different underlying dimensions of software quality faulty.Let Pr(Y =1X1,X2,...,Xn)represent the probabil- ity that the dependent variable Y=1 given the independent that are not captured by the most commonly used code and process metrics.In this case,we will naturally conjecture variables X1,X2,...,and Xn (i.e.the metrics in this study). that combining slice-based cohesion metrics with the most Then,a multivariate logistic regression model assumes that commonly used code and process metrics should give Pr(Y =1X1:X2...,Xn)is related to X1,X2,...,Xn by the a more complete indication of software quality.Conse- following equation: quently,the combination of slice-based cohesion metrics eu+1X1+2X2+…BnXn with the most commonly used code and process metrics will form a better indicator of post-release fault-proneness P(Y=1X,X2.Xn)=1+e+A1+22+石 than the combination of the most commonly used code and process metrics alone.From this reasoning,we set up the where o and B,s are the regression coefficients and can be following null hypothesis H4o and alternative hypothesis estimated through the maximization of a log-likelihood. H4A for RQ4: Odds ratio is the most commonly used measure to quantify the magnitude of the correlation between the independent H40.The combination of slice-based cohesion metrics with the and dependent variables in a logistic regression model.For most commonly used code and process metrics are not more a given independent variable Xi,the odds that Y=1 at effective in effort-aware post-release fault-proneness predic- Xi =x denotes the ratio of the probability that Y=1 to the tion than the combination of the most commonly used code probability that Y =1 at Xi=z,i.e. and process metrics H4A.The combination of slice-based cohesion metrics with the Pr(Y=1|,X=x,) most commonly used code and process metrics are more effective in effort-aware post-release fault-proneness predic- 0dds(Y=1X:=)=1-PY=1.,X=x, tion than the combination of the most commonly used code and process metrics. In this study,similar to [33],we use AOR,the odds 3.2 Variable Description ratio associated with one standard deviation increase,to The dependent variable in this study is a binary variable Y provide an intuitive insight into the impact of the indepen- that can take on only one of two different values.In the dent variable Xi: following,let the values be 0 and 1.Here,Y=1 represents that the corresponding function has at least one post-release faults and Y=0 represents that the corresponding function △OR(X)= Odds(Y =1X;=+i)=efo., Odds(Y =1Xi =x) has no post-release fault.In this paper,we use a modeling technique called logistic regression(described in Section 3.3) where B;and o;are respectively the regression coefficient to predict the probability of y=1.The probability of Y=1 and the standard deviation of the variable Xi.AOR(Xi) indeed indicates post-release fault-proneness,i.e.the extent can be used to compare the relative magnitude of the effects of a function being post-release faulty.As stated by Nagap- of different independent variables,as the same unit is used pan et al.[661,for the users,only post-release failures [421.AOR(Xi)>1 indicates that the independent variable is matter.It is hence essential to predict post-release fault- positively associated with dependent variable.AOR(Xi)=1 proneness of functions in a system in practice,as it enables indicates that there is no such correlation.AOR(X;)<1 developers to take focused preventive actions to improve indicates that there is a negative correlation.The univariate quality in a cost-effective way.Indeed,much effort has been logistic regression model is a special case of the multivariate devoted to post-release fault-proneness prediction [271,[341,logistic regression model,where there is only one indepen- [361,[42],[541,[60],[64],[651.[66]. dent variable
that slice-based cohesion metrics are more closely related to fault-proneness than the most commonly used code and process metrics. From this expectation, we set up the following null hypothesis H30 and alternative hypothesis H3A for RQ3: H30. Slice-based cohesion metrics are not more effective in effortaware post-release fault-proneness prediction than the most commonly used code and process metrics. H3A. Slice-based cohesion metrics are more effective in effortaware post-release fault-proneness prediction than the most commonly used code and process metrics. The fourth research question (RQ4) of this study investigates whether the model built with slice-based cohesion metrics and the most commonly used code and process metrics together has a better ability to predict post-release faultproneness than the model built with the most commonly used code and process metrics alone. This issue is indeed raised by the null hypothesis H10. If the null hypothesis H10 is rejected, it means that slice-based cohesion metrics capture different underlying dimensions of software quality that are not captured by the most commonly used code and process metrics. In this case, we will naturally conjecture that combining slice-based cohesion metrics with the most commonly used code and process metrics should give a more complete indication of software quality. Consequently, the combination of slice-based cohesion metrics with the most commonly used code and process metrics will form a better indicator of post-release fault-proneness than the combination of the most commonly used code and process metrics alone. From this reasoning, we set up the following null hypothesis H40 and alternative hypothesis H4A for RQ4: H40. The combination of slice-based cohesion metrics with the most commonly used code and process metrics are not more effective in effort-aware post-release fault-proneness prediction than the combination of the most commonly used code and process metrics. H4A. The combination of slice-based cohesion metrics with the most commonly used code and process metrics are more effective in effort-aware post-release fault-proneness prediction than the combination of the most commonly used code and process metrics. 3.2 Variable Description The dependent variable in this study is a binary variable Y that can take on only one of two different values. In the following, let the values be 0 and 1. Here, Y ¼ 1 represents that the corresponding function has at least one post-release faults and Y ¼ 0 represents that the corresponding function has no post-release fault. In this paper, we use a modeling technique called logistic regression (described in Section 3.3) to predict the probability of Y ¼ 1. The probability of Y ¼ 1 indeed indicates post-release fault-proneness, i.e. the extent of a function being post-release faulty. As stated by Nagappan et al. [66], for the users, only post-release failures matter. It is hence essential to predict post-release faultproneness of functions in a system in practice, as it enables developers to take focused preventive actions to improve quality in a cost-effective way. Indeed, much effort has been devoted to post-release fault-proneness prediction [27], [34], [36], [42], [54], [60], [64], [65], [66]. The independent variables in this study consist of two categories of metrics: (i) the most commonly used 19 code and process metrics, and (ii) eight slice-based cohesion metrics. All these metrics are collected at the function level. The objective of this study is to empirically investigate the actual usefulness of slice-based cohesion metrics in the context of effort-aware post-release faultproneness prediction, especially when compared with the most commonly used code and process metrics. With these independent variables, we are able to test the four null hypotheses described in Section 3.1. 3.3 Modeling Technique Logistic regression is a standard statistical modeling technique in which the dependent variable can take two different values [28]. It is suitable for building fault-proneness prediction models because the functions under consideration are divided into two categories: faulty and notfaulty. Let PrðY ¼ 1jX1; X2; ... ; XnÞ represent the probability that the dependent variable Y ¼ 1 given the independent variables X1, X2,..., and Xn (i.e. the metrics in this study). Then, a multivariate logistic regression model assumes that PrðY ¼ 1jX1; X2 ... ; XnÞ is related to X1; X2; ... ; Xn by the following equation: PrðY ¼ 1jX1; X2; ... XnÞ ¼ eaþb1X1þb2X2þ...bnXn 1 þ eaþb1X1þb2X2þ...bnXn ; where a and bis are the regression coefficients and can be estimated through the maximization of a log-likelihood. Odds ratio is the most commonly used measure to quantify the magnitude of the correlation between the independent and dependent variables in a logistic regression model. For a given independent variable Xi, the odds that Y ¼ 1 at Xi ¼ x denotes the ratio of the probability that Y ¼ 1 to the probability that Y ¼ 1 at Xi ¼ x, i.e. OddsðY ¼ 1jXi ¼ xÞ ¼ PrðY ¼ 1j ... ; Xi ¼ x; ...Þ 1 PrðY ¼ 1j ... ; Xi ¼ x; ...Þ : In this study, similar to [33], we use DOR, the odds ratio associated with one standard deviation increase, to provide an intuitive insight into the impact of the independent variable Xi: DORðXiÞ ¼ OddsðY ¼ 1jXi ¼ x þ siÞ OddsðY ¼ 1jXi ¼ xÞ ¼ ebisi ; where bi and si are respectively the regression coefficient and the standard deviation of the variable Xi. DORðXiÞ can be used to compare the relative magnitude of the effects of different independent variables, as the same unit is used [42]. DORðXiÞ > 1 indicates that the independent variable is positively associated with dependent variable. DORðXiÞ ¼ 1 indicates that there is no such correlation. DORðXiÞ < 1 indicates that there is a negative correlation. The univariate logistic regression model is a special case of the multivariate logistic regression model, where there is only one independent variable. 338 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 339 3.4 Data Analysis Method and fault-proneness.Therefore,there is a need to remove In the following,we describe the data analysis method for the potentially confounding effect of module size in order testing the four null research hypotheses. to understand the essence that a metric measures [53]. In this study,we first apply the linear regression method 3.4.1 Principal Component Analysis for RQ1 proposed by Zhou et al.[53]to remove the potentially con- In order to answer RQ1,we use principal component analysis founding effect of function size.After that,we use univari- to determine whether slice-based cohesion metrics capture ate logistic regression to examine the correlations between different underlying dimensions of software quality than thethe cleaned metrics and fault-proneness.For each metric, most commonly used code and process metrics.PCA is a the null hypothesis H2o corresponding to RQ2 will be powerful statistical technique used to identify the underlying, rejected if the result of univariate logistic regression is statis- orthogonal dimensions that explain the relations among the tically significant at the significant level of 0.10 independent variables in a data set.These dimensions are called principal components(PCs),which are linear combina- 3.4.3 Multivariate Logistic Rearession Analysis for RQ3 tions of the standardized independent variables.In our study, and RQ4 for each data set,we use the following method to determine In order to answer RQ3 and RQ4,we perform a stepwise vari- the corresponding number of PCs.First,the stopping criterion able selection procedure to build three types of multivariate for PCA is that all the eigenvalues for each new component logistic regression models:(1)the"B"model (using only the are greater than zero.Second,we apply the varimax rotation most commonly used code and process metrics);(2)the "S" to PCs to make the mapping of the independent variables to model (using only slice-based cohesion metrics);and(3)the components clearer where the variables have either a very "B+S"model (using all the metrics).As suggested by Zhou low or a very high loading.This helps identify the variables et al.[531,before building the multivariate logistic regression that are strongly correlated and indeed measure the same models,we remove the confounding effect of function size property,though they may purport to capture different prop- (measured by SLOC).In addition,many metrics used in this erties.Third,after obtaining the rotated component matrix, study are defined similarly with each other.For example, we map each independent variable to the component having CyclomaticModified and CyclomaticStrict are the revised Cyclo- the maximum loading.Fourth,we only retain the compo- matic complexity versions.These highly correlated predictors nents to which at least one independent variable is mapped. may lead to a high multicollinearity and hence inaccurate In our context,the null hypothesis H1o corresponding to ROl coefficient estimates in a logistic regression model [61].Vari- will be rejected when the result of PCA shows that slice-based ance inflation factor(VIF)is a widely used indicator of multi- cohesion metrics define new PCs of their own compared with collinearity.In this study,we use the recommended cut-off the most commonly used code and process metrics value 10 to deal with multicollinearity in a regression model 3.4.2 Univariate Logistic Regression Analysis for RQ2 [591.If an independent variable has a VIF value larger than 10,it will be removed from the multivariate regression In order to answer RQ2,we use univariate logistic regres- model.More specifically,we use the following algorithm sion to examine whether each slice-based cohesion metric is BUILD-MODEL to build the multivariate logistic regression negatively related to post-release fault-proneness at the sig- models.As can be seen,when building a multivariate model nificant level a of 0.10.From a scientific perspective,it is often suggested to work at the a level 0.05 or 0.01.However, our algorithm takes into account:(1)the confounding effect of function size;(2)the multicollinearity among the indepen- the choice of a particular level of significance is ultimately a dent variables;and(3)the influential observations. subjective decision and other levels such as a =0.10 are also common [51].In this paper,the minimum significance Algorithm 1.BUILD-MODEL level for rejecting a null hypothesis is set at a=0.10,as we are aggressively interested in revealing unclosed correla- Input dataset D(X:set of independent variables,Y:dependent tions between metrics and fault-proneness.When perform- variable) ing univariate analysis,we employ the Cook's distance to Step identify influential observations.For an observation,its 1:Remove the confounding effect of function size from each independent variable in X for D.[53] Cook's distance is a measure of how far apart the regression 2: coefficients are with and without this observation included. Use the backward stepwise variable selection method to If an observation has a Cook's distance equal to or larger build the logistic regression model M on D. 3:Calculate the variance inflation factors for all independent than 1,it is regarded as an influential observation and is variables in the model M. hence excluded for the analysis [32].Furthermore,for each 4:If all the VIFs are less than or equal to 10,goto step 6;other- metric,we use AOR,the odds ratio associated with one wise,goto step 5. standard deviation increase in the metric,to quantify its 5:Remove the variable x;with the largest VIF from X,and goto effect on fault-proneness [331.This allows us to compare the step 2 relative magnitude of the effects of individual metrics on 6:Calculate the Cook's distance for all the observations in D.If post-release fault-proneness.Note that previous studies the maximum Cook's distance is less than or equal to 1,then reported that module size (i.e.function size in this study) goto step 8;otherwise,goto step 7. might have a potential confounding effect on the relation- 7: Update D by removing the observations whose Cook's dis- ships between software metrics and fault-proneness [43], tances are equal to or larger than 1.Goto step 2. [531.In other words,module size may falsely obscure or 8:Return the model M. accentuate the true correlations between software metrics
3.4 Data Analysis Method In the following, we describe the data analysis method for testing the four null research hypotheses. 3.4.1 Principal Component Analysis for RQ1 In order to answer RQ1, we use principal component analysis to determine whether slice-based cohesion metrics capture different underlying dimensions of software quality than the most commonly used code and process metrics. PCA is a powerful statistical technique used to identify the underlying, orthogonal dimensions that explain the relations among the independent variables in a data set. These dimensions are called principal components (PCs), which are linear combinations of the standardized independent variables. In our study, for each data set, we use the following method to determine the corresponding number of PCs. First, the stopping criterion for PCA is that all the eigenvalues for each new component are greater than zero. Second, we apply the varimax rotation to PCs to make the mapping of the independent variables to components clearer where the variables have either a very low or a very high loading. This helps identify the variables that are strongly correlated and indeed measure the same property, though they may purport to capture different properties. Third, after obtaining the rotated component matrix, we map each independent variable to the component having the maximum loading. Fourth, we only retain the components to which at least one independent variable is mapped. In our context, the null hypothesis H10 corresponding to RQ1 will be rejected when the result of PCA shows that slice-based cohesion metrics define new PCs of their own compared with the most commonly used code and process metrics. 3.4.2 Univariate Logistic Regression Analysis for RQ2 In order to answer RQ2, we use univariate logistic regression to examine whether each slice-based cohesion metric is negatively related to post-release fault-proneness at the significant level a of 0.10. From a scientific perspective, it is often suggested to work at the a level 0.05 or 0.01. However, the choice of a particular level of significance is ultimately a subjective decision and other levels such as a ¼ 0:10 are also common [51]. In this paper, the minimum significance level for rejecting a null hypothesis is set at a ¼ 0:10, as we are aggressively interested in revealing unclosed correlations between metrics and fault-proneness. When performing univariate analysis, we employ the Cook’s distance to identify influential observations. For an observation, its Cook’s distance is a measure of how far apart the regression coefficients are with and without this observation included. If an observation has a Cook’s distance equal to or larger than 1, it is regarded as an influential observation and is hence excluded for the analysis [32]. Furthermore, for each metric, we use DOR, the odds ratio associated with one standard deviation increase in the metric, to quantify its effect on fault-proneness [33]. This allows us to compare the relative magnitude of the effects of individual metrics on post-release fault-proneness. Note that previous studies reported that module size (i.e. function size in this study) might have a potential confounding effect on the relationships between software metrics and fault-proneness [43], [53]. In other words, module size may falsely obscure or accentuate the true correlations between software metrics and fault-proneness. Therefore, there is a need to remove the potentially confounding effect of module size in order to understand the essence that a metric measures [53]. In this study, we first apply the linear regression method proposed by Zhou et al. [53] to remove the potentially confounding effect of function size. After that, we use univariate logistic regression to examine the correlations between the cleaned metrics and fault-proneness. For each metric, the null hypothesis H20 corresponding to RQ2 will be rejected if the result of univariate logistic regression is statistically significant at the significant level of 0.10. 3.4.3 Multivariate Logistic Regression Analysis for RQ3 and RQ4 In order to answer RQ3 and RQ4, we perform a stepwise variable selection procedure to build three types of multivariate logistic regression models: (1) the “B” model (using only the most commonly used code and process metrics); (2) the “S” model (using only slice-based cohesion metrics); and (3) the “BþS” model (using all the metrics). As suggested by Zhou et al. [53], before building the multivariate logistic regression models, we remove the confounding effect of function size (measured by SLOC). In addition, many metrics used in this study are defined similarly with each other. For example, CyclomaticModified and CyclomaticStrict are the revised Cyclomatic complexity versions. These highly correlated predictors may lead to a high multicollinearity and hence inaccurate coefficient estimates in a logistic regression model [61]. Variance inflation factor (VIF) is a widely used indicator of multicollinearity. In this study, we use the recommended cut-off value 10 to deal with multicollinearity in a regression model [59]. If an independent variable has a VIF value larger than 10, it will be removed from the multivariate regression model. More specifically, we use the following algorithm BUILD-MODEL to build the multivariate logistic regression models. As can be seen, when building a multivariate model, our algorithm takes into account: (1) the confounding effect of function size; (2) the multicollinearity among the independent variables; and (3) the influential observations. Algorithm 1. BUILD-MODEL Input dataset D(X: set of independent variables, Y: dependent variable) Step 1: Remove the confounding effect of function size from each independent variable in X for D. [53] 2: Use the backward stepwise variable selection method to build the logistic regression model M on D. 3: Calculate the variance inflation factors for all independent variables in the model M. 4: If all the VIFs are less than or equal to 10, goto step 6; otherwise, goto step 5. 5: Remove the variable xi with the largest VIF from X, and goto step 2. 6: Calculate the Cook’s distance for all the observations in D. If the maximum Cook’s distance is less than or equal to 1, then goto step 8; otherwise, goto step 7. 7: Update D by removing the observations whose Cook’s distances are equal to or larger than 1. Goto step 2. 8: Return the model M. YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 339