An Empirical Study on Dependence Clusters for Effort-Aware Fault-Proneness Prediction Yibiao Yang',Mark Harman',Jens Krinke',Syed Islam,David Binkley, Yuming Zhou and Baowen Xu' Department of Computer Science and Technology,Nanjing University,China Department of Computer Science,University College London,UK School of Architecture,Computing and Engineering,University of East London,UK Department of Computer Science,Loyola University Maryland,USA ABSTRACT CCS Concepts A dependence cluster is a set of mutually inter-dependent .Software and its engineering-Abstraction,model- program elements.Prior studies have found that large de- ing and modularity;Software development process man- pendence clusters are prevalent in software systems.It has agement: been suggested that dependence clusters have potentially harmful effects on software quality.However,little empirical Keywords evidence has been provided to support this claim.The study Dependence clusters,fault-proneness,fault prediction,net- presented in this paper investigates the relationship between dependence clusters and software quality at the function-level work analysis with a focus on effort-aware fault-proneness prediction.The investigation first analyzes whether or not larger dependence 1.INTRODUCTION clusters tend to be more fault-prone.Second,it investigates A dependence cluster is a set of program elements that whether the proportion of faulty functions inside dependence all directly or transitively depend upon one another 8,18. clusters is significantly different from the proportion of faulty Prior empirical studies found that large dependence clusters functions outside dependence clusters.Third,it examines are highly prevalent in software systems and further compli- whether or not functions inside dependence clusters playing cate many software activities such as software maintenance, a more important role than others are more fault-prone.Fi- testing,and comprehension 8,18.In the presence of a nally,based on two groups of functions(i.e.,functions inside (large)dependence cluster,an issue or a code change in one and outside dependence clusters),the investigation considers element likely has significant ripple effects involving the other a segmented fault-proneness prediction model.Our experi- elements of the cluster 8,18.Hence,there is a reason to mental results,based on five well-known open-source systems, believe that dependence clusters have potentially harmful show that (1)larger dependence clusters tend to be more effects on software quality.This suggests that the elements fault-prone;(2)the proportion of faulty functions inside de- inside dependence clusters have relatively lower quality when pendence clusters is significantly larger than the proportion compared to elements outside any dependence cluster.Given of faulty functions outside dependence clusters;(3)functions this observation,dependence clusters should be useful in inside dependence clusters that play more important roles fault-prediction.However,few empirical studies have inves- are more fault-prone;(4)our segmented prediction model tigated the effect of dependence clusters on fault-proneness can significantly improve the effectiveness of effort-aware prediction. fault-proneness prediction in both ranking and classification This paper presents an empirical study of the relationships scenarios.These findings help us better understand how between dependence clusters and fault-proneness.The con- dependence clusters influence software quality. cept of a dependence cluster was originally introduced by Binkley and Harman [8].They treat program statements as basic units,however,they note that dependence clusters Corresponding author:zhouyuming@nju.edu.cn can be also defined at coarser granularities,such as at the function-level 7.For a given program,the identification of function-level dependence clusters consists of two steps. The first step generates a function-level System Dependence Permission to make digital or hard copies of all or part of this work for personal or Graph for all functions of the program.In general,these classroom use is granted without fee provided that copies are not made or distributed graphs involve two types of dependencies between functions: tor commercial advanag oce and the omp ents of this work owned by others than ACM call dependency (i.e.,one function calls another function) must be honored.Abstracting with credit is permitted.To copy otherwise.or republish to post on servers or to redistribute to lists,requires prior specific permission and/or a and data dependency (e.g.,a global variable defined in one fee.Request permissions from Permissions@acm.org. function is used in another function).In the System De- ASE'/6,September 3-7,2016,Singapore,Singapore pendence Graphs used in our study,nodes denote functions ©2016ACM.978-1.4503-3845-5716/09.S15.00 and directed edges denote the dependencies between these http:/dx.doi.org/10.1145/2970276.2970353 functions.In the second step,a clustering algorithm is used 296
An Empirical Study on Dependence Clusters for Effort-Aware Fault-Proneness Prediction Yibiao Yang1 , Mark Harman2 , Jens Krinke2 , Syed Islam3 , David Binkley4 , Yuming Zhou1 ∗ , and Baowen Xu1 1 Department of Computer Science and Technology, Nanjing University, China 2 Department of Computer Science, University College London, UK 3 School of Architecture, Computing and Engineering, University of East London, UK 4 Department of Computer Science, Loyola University Maryland, USA ABSTRACT A dependence cluster is a set of mutually inter-dependent program elements. Prior studies have found that large dependence clusters are prevalent in software systems. It has been suggested that dependence clusters have potentially harmful effects on software quality. However, little empirical evidence has been provided to support this claim. The study presented in this paper investigates the relationship between dependence clusters and software quality at the function-level with a focus on effort-aware fault-proneness prediction. The investigation first analyzes whether or not larger dependence clusters tend to be more fault-prone. Second, it investigates whether the proportion of faulty functions inside dependence clusters is significantly different from the proportion of faulty functions outside dependence clusters. Third, it examines whether or not functions inside dependence clusters playing a more important role than others are more fault-prone. Finally, based on two groups of functions (i.e., functions inside and outside dependence clusters), the investigation considers a segmented fault-proneness prediction model. Our experimental results, based on five well-known open-source systems, show that (1) larger dependence clusters tend to be more fault-prone; (2) the proportion of faulty functions inside dependence clusters is significantly larger than the proportion of faulty functions outside dependence clusters; (3) functions inside dependence clusters that play more important roles are more fault-prone; (4) our segmented prediction model can significantly improve the effectiveness of effort-aware fault-proneness prediction in both ranking and classification scenarios. These findings help us better understand how dependence clusters influence software quality. ∗Corresponding author: zhouyuming@nju.edu.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ASE’16, September 03-07, 2016, Singapore, Singapore c 2016 ACM. ISBN 978-1-4503-3845-5/16/09. . . $15.00 DOI: http://dx.doi.org/10.1145/2970276.2970353 CCS Concepts •Software and its engineering → Abstraction, modeling and modularity; Software development process management; Keywords Dependence clusters, fault-proneness, fault prediction, network analysis 1. INTRODUCTION A dependence cluster is a set of program elements that all directly or transitively depend upon one another [8, 18]. Prior empirical studies found that large dependence clusters are highly prevalent in software systems and further complicate many software activities such as software maintenance, testing, and comprehension [8, 18]. In the presence of a (large) dependence cluster, an issue or a code change in one element likely has significant ripple effects involving the other elements of the cluster [8, 18]. Hence, there is a reason to believe that dependence clusters have potentially harmful effects on software quality. This suggests that the elements inside dependence clusters have relatively lower quality when compared to elements outside any dependence cluster. Given this observation, dependence clusters should be useful in fault-prediction. However, few empirical studies have investigated the effect of dependence clusters on fault-proneness prediction. This paper presents an empirical study of the relationships between dependence clusters and fault-proneness. The concept of a dependence cluster was originally introduced by Binkley and Harman [8]. They treat program statements as basic units, however, they note that dependence clusters can be also defined at coarser granularities, such as at the function-level [7]. For a given program, the identification of function-level dependence clusters consists of two steps. The first step generates a function-level System Dependence Graph for all functions of the program. In general, these graphs involve two types of dependencies between functions: call dependency (i.e., one function calls another function) and data dependency (e.g., a global variable defined in one function is used in another function). In the System Dependence Graphs used in our study, nodes denote functions and directed edges denote the dependencies between these functions. In the second step, a clustering algorithm is used Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ASE’16, September 3–7, 2016, Singapore, Singapore c 2016 ACM. 978-1-4503-3845-5/16/09...$15.00 http://dx.doi.org/10.1145/2970276.2970353 296
to calculate all the maximal strongly connected components of dependence clusters and then used this method to identify found in the System Dependence Graph (SDG).In function- linchpins.which effectively hold a dependence cluster togeth- level dependence clusters,functions are regarded as the basic er.Their results showed that only a few vertices and edges units and each cluster consists of at least two functions.Ac- act as linchpins.After that,Binkley et al.10 introduced a cording to Binkley et al.[7],the function-level dependence simple transformation-based analysis algorithm to identify clusters can offer an effective proxy for the more expensive the impact of global variables on the presence of dependence statement-level dependence clusters. clusters.Their results showed that over half of the studied Based on this observation.we investigate dependence clus- programs include a global variable that was responsible for ters at the function-level.Our main contributions are the the formation of a dependence cluster. following: Beszedes et al.5,6 conducted an empirical study into the 1)We investigate whether the qualities of dependence properties of SEA-based dependence clusters.Such cluster clusters are influenced by their size.Our results show that are defined at the function-level and are based on the static larger dependence clusters tend to be more fault-prone. execute after (SEA)relation.Their empirical results showed 2)We examine whether functions inside dependence clus- that SEA-based dependence clusters occur frequently in pro- ters are more fault-prone than functions outside dependence grams regardless of their domain and size.However,the clusters.The results show that the proportion of faulty func- SEA-based relation only considers call structure information. tions inside dependence clusters is significantly greater than In other words,data dependencies are not considered in their that of functions outside all dependence clusters. study.In contrast,we take the data dependency between 3)We examine whether functions playing more important functions into account in our study. roles inside dependence clusters are more fault-prone.Our Binkley et al.7 compared the following two types of empirical results show that importance metrics are positively dependence clusters:slice-based dependence clusters at the correlated with fault-proneness. statement-level and SEA-based dependence clusters at the 4)Finally,we propose a segmented prediction model for function-level.They found that the less expensive SEA-based fault-proneness prediction.More specifically,we build two dependence clusters could be used as an effective proxy for different fault-proneness prediction models respectively for the more expensive slice-based dependence clusters.Unlike functions inside and functions outside dependence clusters. the above studies,we investigate dependence clusters from The empirical results show that our segmented prediction the perspective of software quality.More specifically,we model can significantly improve the prediction performance investigate whether dependence clusters have practical value in effort-aware evaluations in effort-aware fault-proneness prediction The rest of this paper is organized as follows.In Section 2. we summarize related work.We present our research ques- 2.2 Dependence Analysis in Fault-Proneness tions in Section 3.In Section 4,we describe the experimental Prediction setup,including the subject systems and the data collection method used.In Section 5,we describe the research method Zimmermann and Nagappan [41]calculated network met- and report the detailed experimental results with respect rics based on a dependence graph and used them to predict faults.More specifically,they first generate a SDG at the to each of the research questions.Section 6 discusses our function level.Two kinds of dependencies between functions findings.In Section 7,we examine threats to validity.Final- are then taken into account:call dependencies and data ly,Section 8 concludes the paper and outlines directions for dependencies.They then lift this graph up to binary level future work. since the defects were at the binary level.They considered 2.RELATED WORK the presence of dependencies without considering the multi- plicity of dependencies.After that,they compute network This section summarizes related work on dependence clus- measures on the dependence graph and then evaluated their ters and dependence analysis in fault-proneness prediction. fault-proneness prediction performance on Windows Server 2003.Their results show that the recall of the model built 2.1 Dependence Clusters from network measures was 10%higher than the model built Binkley and Harman [8]originally introduced the concept from complexity measures. of dependence clusters based on program slicing at the s- Ma et al.[23 conducted an empirical study to examine tatement level.They proposed a "same slice size"approach the effectiveness of network measures in the context of effort- to identifying dependence clusters using the SDG.Later, aware fault-proneness prediction,taking into account the Harman et al.18 extended this initial study to include a effort required to inspect predicted faulty module.They larger set of programs.Their empirical results showed that investigated dependence graphs at the file-level and did not the "same slice size"approach was extremely accurate.In consider the multiplicity of dependencies between files.They addition,they found that large dependence clusters were found that most network measures were of practical value surprisingly commonplace and consumed from more than in the context of effort-aware evaluations.Unlike these two 10%of the program to,in some cases,80%of the whole studies,our SDGs are finer grained (i.e.,function-level vs program.Islam et al.20,21 introduced the concept of binary/file-level).In addition,we take into account the coherent dependence clusters.In a coherent dependence clus- multiplicity of dependencies between functions ter,all elements depend upon the same set of elements and Cataldo et al.13 compared the relative impact of the also affect a common set of elements.They used coherent syntactic,logical,and work dependencies on fault-proneness dependence clusters to identify logical functionality within prediction.Syntactic dependencies are code dependencies programs. (e.g.,control and data dependencies).Logical dependencies Binkley and Harman [10,9 introduced a method to mea- focus on deducing dependencies between source code files sure the effect of an SDG vertex or an edge on the formation that are changed together [16].Finally,work dependencies 297
to calculate all the maximal strongly connected components found in the System Dependence Graph (SDG). In functionlevel dependence clusters, functions are regarded as the basic units and each cluster consists of at least two functions. According to Binkley et al. [7], the function-level dependence clusters can offer an effective proxy for the more expensive statement-level dependence clusters. Based on this observation, we investigate dependence clusters at the function-level. Our main contributions are the following: 1) We investigate whether the qualities of dependence clusters are influenced by their size. Our results show that larger dependence clusters tend to be more fault-prone. 2) We examine whether functions inside dependence clusters are more fault-prone than functions outside dependence clusters. The results show that the proportion of faulty functions inside dependence clusters is significantly greater than that of functions outside all dependence clusters. 3) We examine whether functions playing more important roles inside dependence clusters are more fault-prone. Our empirical results show that importance metrics are positively correlated with fault-proneness. 4) Finally, we propose a segmented prediction model for fault-proneness prediction. More specifically, we build two different fault-proneness prediction models respectively for functions inside and functions outside dependence clusters. The empirical results show that our segmented prediction model can significantly improve the prediction performance in effort-aware evaluations. The rest of this paper is organized as follows. In Section 2, we summarize related work. We present our research questions in Section 3. In Section 4, we describe the experimental setup, including the subject systems and the data collection method used. In Section 5, we describe the research method and report the detailed experimental results with respect to each of the research questions. Section 6 discusses our findings. In Section 7, we examine threats to validity. Finally, Section 8 concludes the paper and outlines directions for future work. 2. RELATED WORK This section summarizes related work on dependence clusters and dependence analysis in fault-proneness prediction. 2.1 Dependence Clusters Binkley and Harman [8] originally introduced the concept of dependence clusters based on program slicing at the statement level. They proposed a “same slice size” approach to identifying dependence clusters using the SDG. Later, Harman et al. [18] extended this initial study to include a larger set of programs. Their empirical results showed that the “same slice size” approach was extremely accurate. In addition, they found that large dependence clusters were surprisingly commonplace and consumed from more than 10% of the program to, in some cases, 80% of the whole program. Islam et al. [20, 21] introduced the concept of coherent dependence clusters. In a coherent dependence cluster, all elements depend upon the same set of elements and also affect a common set of elements. They used coherent dependence clusters to identify logical functionality within programs. Binkley and Harman [10, 9] introduced a method to measure the effect of an SDG vertex or an edge on the formation of dependence clusters and then used this method to identify linchpins, which effectively hold a dependence cluster together. Their results showed that only a few vertices and edges act as linchpins. After that, Binkley et al. [10] introduced a simple transformation-based analysis algorithm to identify the impact of global variables on the presence of dependence clusters. Their results showed that over half of the studied programs include a global variable that was responsible for the formation of a dependence cluster. Besz´edes et al. [5, 6] conducted an empirical study into the properties of SEA-based dependence clusters. Such cluster are defined at the function-level and are based on the static execute after (SEA) relation. Their empirical results showed that SEA-based dependence clusters occur frequently in programs regardless of their domain and size. However, the SEA-based relation only considers call structure information. In other words, data dependencies are not considered in their study. In contrast, we take the data dependency between functions into account in our study. Binkley et al. [7] compared the following two types of dependence clusters: slice-based dependence clusters at the statement-level and SEA-based dependence clusters at the function-level. They found that the less expensive SEA-based dependence clusters could be used as an effective proxy for the more expensive slice-based dependence clusters. Unlike the above studies, we investigate dependence clusters from the perspective of software quality. More specifically, we investigate whether dependence clusters have practical value in effort-aware fault-proneness prediction. 2.2 Dependence Analysis in Fault-Proneness Prediction Zimmermann and Nagappan [41] calculated network metrics based on a dependence graph and used them to predict faults. More specifically, they first generate a SDG at the function level. Two kinds of dependencies between functions are then taken into account: call dependencies and data dependencies. They then lift this graph up to binary level since the defects were at the binary level. They considered the presence of dependencies without considering the multiplicity of dependencies. After that, they compute network measures on the dependence graph and then evaluated their fault-proneness prediction performance on Windows Server 2003. Their results show that the recall of the model built from network measures was 10% higher than the model built from complexity measures. Ma et al. [23] conducted an empirical study to examine the effectiveness of network measures in the context of effortaware fault-proneness prediction, taking into account the effort required to inspect predicted faulty module. They investigated dependence graphs at the file-level and did not consider the multiplicity of dependencies between files. They found that most network measures were of practical value in the context of effort-aware evaluations. Unlike these two studies, our SDGs are finer grained (i.e., function-level vs binary/file-level). In addition, we take into account the multiplicity of dependencies between functions. Cataldo et al. [13] compared the relative impact of the syntactic, logical, and work dependencies on fault-proneness prediction. Syntactic dependencies are code dependencies (e.g., control and data dependencies). Logical dependencies focus on deducing dependencies between source code files that are changed together [16]. Finally, work dependencies 297
account the human and organization of information 17,19. System dependence graph at function-level 27,29,32].Their work showed that the logical dependencies explained most of the variance in fault proneness while work SubGin Functions inside dependence clusters flow dependencies had more impact than code dependencies In our study,we only investigate syntactic dependencies,but do so at a finer granularity. Oyetoyan et al.[31]studied the impact of cyclic depen- dencies on fault-proneness prediction.They found that most defects and defective components were concentrated in cyclic dependent components.The cyclic dependent components are those in call cycles in call dependence graph at the class level.These structures can also be viewed as dependence clusters.Our study is different from their study mainly with respect to the following aspects:1)our dependence clustering is at a finer granularity (i.e.,function level)while their study is at the file/class level;2)we take into account more types of dependencies,including both call and data dependencies: 3)we study the fault-proneness prediction model in effort- aware evaluations with respect to ranking and classification scenarios;4)we propose a segmented fault-proneness pre- 02 diction model and compare our models with the traditional SubGom Functions outside dependence cluster fault-proneness prediction models. 二二二二二 d:data dependency:c:control/call dependency: 3.RESEARCH QUESTIONS dc :the i-th dependence cluster In this section we discuss our research questions and use Figure 1:An SDG with dependence clusters the example SDG shown in Figure 1 to illustrate the ques- tions.In Figure 1,the nodes (e.g.,fI and f2)are functions stand the effects of dependence clusters on software quality. and the directed edges are dependencies between functions, Little is currently known on this subject.Our study attempts depicting data dependencies (labeled"d")and function call to fill this gap. dependencies (labeled"c"),respectively.In this dependence graph,there are 15 functions and 3 dependence clusters(i.e., dci,dc2,and des).In Figure 1,dci,de2,and des are separate 4. EXPERIMENTAL SETUP clusters since they are maximal strongly connected subgraph- This section first introduces the systems studied before s.The functions are divided into two groups:functions describing the procedure used to collect the experimental inside dependence clusters and functions outside dependence data. clusters.Functions inside dependence clusters and functions outside dependence clusters form the subgraphs SubGin and 4.1 Studied Projects SubGout,respectively. Table 1 summarizes the subjects used in the study.The First,because a code change to one element of a depen- first column is the system name.We use five well-known dence cluster likely ripples to the others elements of the open-source projects as subject systems:Bash(BASH),gcc- cluster,our first research question (RQ1)investigates the core (GCC).GIMP (GIMP).glibc (GLIB).and GStreamer relationship between the size of dependence clusters and (GSTR).Bash is a command language interpreter,gcc-core fault-proneness: is the GNU compiler collection.GIMP is the GNU Image RQ1.Are larger dependence clusters more fault-prone? Manipulation Program,glibc is the GNU Project's imple- Second,functions in Figure 1 are classified into two groups: mentation of the C standard library and GStreamer is a functions inside and outside dependence clusters.Our second multimedia framework.We chose these five projects as sub- research question (RQ2)focuses on the quality of functions jects for two reasons.First,they are well-known open source in these two groups. projects with a publicly available bug-fix history.In particu- RQ2.Are functions inside dependence clusters more fault- lar,the bug-fixing releases do not add any new features to the prone than functions outside dependence clusters? corresponding systems,thus allowing us to collect accurate Third,functions inside dependence clusters form a sub- fault data at the function level.For instance,gcc distribu- dependence graph (e.g.,SubGin of Figure 1).Different tion website states "Note that starting with version 3.3.4,we functions play different roles in this sub-graph.Thus,we set provide bug releases for older release branches for those users up RQ3 for functions inside dependence clusters as follows: who require a very high degree of stability".Second,they are RQ3.Are functions playing more important roles inside non-trivial software systems belonging to several different dependence clusters more fault-prone? domains.In Table 1.the second to the seventh columns are Finally,we aim to examine the usefulness of dependence respectively the version number,the release date,the total clusters for fault-proneness prediction.Therefore,our last source lines of code in the subject release,the number of research question (RQ4)is set up as follows: functions,the number of faulty functions,and the percentage RQ4.Are dependence clusters useful in fault-proneness of faulty functions.The eighth and the ninth columns are the prediction? version number and the release date of the previous version These research questions are important to both software used for computing the process metrics in Section 5.4.The researchers and practitioners,as they help us better under- last two columns are the version number and the release 298
account the human and organization of information [17, 19, 27, 29, 32]. Their work showed that the logical dependencies explained most of the variance in fault proneness while work flow dependencies had more impact than code dependencies. In our study, we only investigate syntactic dependencies, but do so at a finer granularity. Oyetoyan et al. [31] studied the impact of cyclic dependencies on fault-proneness prediction. They found that most defects and defective components were concentrated in cyclic dependent components. The cyclic dependent components are those in call cycles in call dependence graph at the class level. These structures can also be viewed as dependence clusters. Our study is different from their study mainly with respect to the following aspects: 1) our dependence clustering is at a finer granularity (i.e., function level) while their study is at the file/class level; 2) we take into account more types of dependencies, including both call and data dependencies; 3) we study the fault-proneness prediction model in effortaware evaluations with respect to ranking and classification scenarios; 4) we propose a segmented fault-proneness prediction model and compare our models with the traditional fault-proneness prediction models. 3. RESEARCH QUESTIONS In this section we discuss our research questions and use the example SDG shown in Figure 1 to illustrate the questions. In Figure 1, the nodes (e.g., f1 and f2 ) are functions and the directed edges are dependencies between functions, depicting data dependencies (labeled “d”) and function call dependencies (labeled “c”), respectively. In this dependence graph, there are 15 functions and 3 dependence clusters (i.e., dc1, dc2, and dc3). In Figure 1, dc1, dc2, and dc3 are separate clusters since they are maximal strongly connected subgraphs. The functions are divided into two groups: functions inside dependence clusters and functions outside dependence clusters. Functions inside dependence clusters and functions outside dependence clusters form the subgraphs SubGin and SubGout, respectively. First, because a code change to one element of a dependence cluster likely ripples to the others elements of the cluster, our first research question (RQ1) investigates the relationship between the size of dependence clusters and fault-proneness: RQ1. Are larger dependence clusters more fault-prone? Second, functions in Figure 1 are classified into two groups: functions inside and outside dependence clusters. Our second research question (RQ2) focuses on the quality of functions in these two groups. RQ2. Are functions inside dependence clusters more faultprone than functions outside dependence clusters? Third, functions inside dependence clusters form a subdependence graph (e.g., SubGin of Figure 1). Different functions play different roles in this sub-graph. Thus, we set up RQ3 for functions inside dependence clusters as follows: RQ3. Are functions playing more important roles inside dependence clusters more fault-prone? Finally, we aim to examine the usefulness of dependence clusters for fault-proneness prediction. Therefore, our last research question (RQ4) is set up as follows: RQ4. Are dependence clusters useful in fault-proneness prediction? These research questions are important to both software researchers and practitioners, as they help us better underFigure 1: An SDG with dependence clusters stand the effects of dependence clusters on software quality. Little is currently known on this subject. Our study attempts to fill this gap. 4. EXPERIMENTAL SETUP This section first introduces the systems studied before describing the procedure used to collect the experimental data. 4.1 Studied Projects Table 1 summarizes the subjects used in the study. The first column is the system name. We use five well-known open-source projects as subject systems: Bash (BASH), gcccore (GCC), GIMP (GIMP), glibc (GLIB), and GStreamer (GSTR). Bash is a command language interpreter, gcc-core is the GNU compiler collection, GIMP is the GNU Image Manipulation Program, glibc is the GNU Project’s implementation of the C standard library and GStreamer is a multimedia framework. We chose these five projects as subjects for two reasons. First, they are well-known open source projects with a publicly available bug-fix history. In particular, the bug-fixing releases do not add any new features to the corresponding systems, thus allowing us to collect accurate fault data at the function level. For instance, gcc distribution website states “Note that starting with version 3.3.4, we provide bug releases for older release branches for those users who require a very high degree of stability”. Second, they are non-trivial software systems belonging to several different domains. In Table 1, the second to the seventh columns are respectively the version number, the release date, the total source lines of code in the subject release, the number of functions, the number of faulty functions, and the percentage of faulty functions. The eighth and the ninth columns are the version number and the release date of the previous version used for computing the process metrics in Section 5.4. The last two columns are the version number and the release 298
Table 1:The subject systems Subject release Previous release Fixing release System Version Release Total functions faulty faulty Version Release Version Release date SLoC functions functions date date Bash 3.2 2006-10-11 49608 1947 68 3.49% 3.1 2005-12-08 3.2.57 2014-11-07 Gcc-core 4.0.0 2005-04-21 422182 13612 430 3.16% 3.4.0 2004-04-20 4.0.4 2007-01-31 Gimp 2.8.0 2012-05-12 557436 19978 818 4.10% 2.7.0 2009-08.-15 2.8.16 2015-11-21 Glbc 2.1.1 1999-05-24 172559 5923 417 7.04% 201 1997-02-04 213 2000-02-25 Gstreamer 1.0.0 2012-09-24 75985 3946 146 3.70% 0.11.90 2011-08-02 1.0.10 2013-08-30 date of the fixing release.The subject projects are moderate Size Metric to large-scale software systems (from 49 to 557 KSLOC). ROI They have only a small number of faulty functions (from approximately 3%to 7%of all functions).Furthermore,on Clusters:del.de2.de3.. Speamman rank correlation average,the fixing release comes out approximately 3 years after the subject version is released.We believe 3 years is Fault density sufficiently long for the majority of faulty functions to be Figure 2:Overview of the analysis method for RQi identified and fixed. projects varied from 158 to 4083.Of these five projects 4.2 Data Collection procedure GCC has the largest dependence cluster(that includes 4083 We collected data from the above mentioned five projects. functions).This,to a certain extent,indicates that GCC is For each subject system,we obtained the fault data and more complex than the other systems. identified dependence clusters for further analysis using the following steps.At the first step,we determined the faulty or not faulty label for each function.As mentioned before,any 5.METHODOLOGY AND RESULTS of the bug-fixing releases did not add any new features to the In the section.we describe the research method and report corresponding system.For each of the subject systems,we the experimental results in detail with respect to each of the compared these versions with the latest bug-fixing releases research questions. (identified by the last two columns of Table 1)and determined 5.1 which functions were changed.If a function was changed. RQ1.Are larger dependence clusters more it was marked as a faulty.Otherwise,it was marked as fault-prone? not-faulty.This method has been used to determine faulty In the following,we describe the research method used functions before 42. and report the experimental result to address RQ1. Our second step,collected the dependence clusters for each system using the Understand tool and an R package 5.1.I Research method igraph2.For each subject system,we first generated an Figure 2 provides an overview of the analysis method Understand database.Then,we extracted the call and data used to address RQ1.As can be seen,in order to answer dependencies for all functions from the generated database. RQ1,we use Spearman's rank correlation to investigate the In this way we obtained the SDG of the subject system. relationship between the size of dependence clusters and the After that,we used the function cluster in igraph package fault density of dependence clusters.Here,fault density to identify all dependence clusters.Each system's functions refers to the percentage of faulty functions in the dependence are divided into two groups:functions inside and functions clusters.There are two basic metrics to measure the size outside dependence clusters. of a graph:Size and Ties.Size is the number of functions Table 2:The dependence clusters in subject systems within dependence clusters while Ties is the number of edges between functions in dependence clusters.In this study, functions Size of we first use igraph to compute these two metrics for all System functions clusters inside clusters largest cluster dependence clusters in each subject system.We choose BASH 1947 46.2 483 Spearman's rank correlation rather than Pearson's linear GCC 13612 139 34.9 4083 GIMP 19978 363 14.2 158 correlation since the former is a non-parametric method and GLIB 5923 105 11.6 277 makes no normality assumptions on variables 30.According GSTR 3946 59 15.2 170 to Ott and Longnecker [30,for correlation coefficient rho,the correlation is considered either weak (rho 0.5),moderate Table 2 describes the clusters in the subject projects.The (0.5<rhol<0.8),or strong(0.8≤rhol≤1.0) third to the fifth columns respectively show the number of clusters,the percentage of functions inside clusters,and the Table 3:Spearman correlation for dependence clus- size of the largest cluster in each subject project.From Table ters size and fault density (RQ1) 2,we can see that there exist many dependence clusters Size System clusters Ties (from 41 to 363)in these projects.Furthermore,from 11.6% rho to 46.2%of the total functions are found inside dependence BASH 41 0.230 0.148 0.315 0.045 GCC 0.299 <0.001 0.233 0.006 clusters.Additionally,the size of the largest cluster in these 139 GIMP 363 0.150 0.004 0.195 <0.001 GLIB 105 0.092 0350 0.113 0.249 https://scitools.com GSTR 59 0.345 0.007 0.295 0023 http://igraph.org/r/ 299
Table 1: The subject systems System Subject release Previous release Fixing release Version Release Total # functions # faulty % faulty Version Release Version Release date SLoC functions functions date date Bash 3.2 2006-10-11 49 608 1 947 68 3.49% 3.1 2005-12-08 3.2.57 2014-11-07 Gcc-core 4.0.0 2005-04-21 422 182 13 612 430 3.16% 3.4.0 2004-04-20 4.0.4 2007-01-31 Gimp 2.8.0 2012-05-12 557 436 19 978 818 4.10% 2.7.0 2009-08-15 2.8.16 2015-11-21 Glibc 2.1.1 1999-05-24 172 559 5 923 417 7.04% 2.0.1 1997-02-04 2.1.3 2000-02-25 Gstreamer 1.0.0 2012-09-24 75 985 3 946 146 3.70% 0.11.90 2011-08-02 1.0.10 2013-08-30 date of the fixing release. The subject projects are moderate to large-scale software systems (from 49 to 557 KSLOC). They have only a small number of faulty functions (from approximately 3% to 7% of all functions). Furthermore, on average, the fixing release comes out approximately 3 years after the subject version is released. We believe 3 years is sufficiently long for the majority of faulty functions to be identified and fixed. 4.2 Data Collection Procedure We collected data from the above mentioned five projects. For each subject system, we obtained the fault data and identified dependence clusters for further analysis using the following steps. At the first step, we determined the faulty or not faulty label for each function. As mentioned before, any of the bug-fixing releases did not add any new features to the corresponding system. For each of the subject systems, we compared these versions with the latest bug-fixing releases (identified by the last two columns of Table 1) and determined which functions were changed. If a function was changed, it was marked as a faulty. Otherwise, it was marked as not-faulty. This method has been used to determine faulty functions before [42]. Our second step, collected the dependence clusters for each system using the Understand1 tool and an R package igraph2 . For each subject system, we first generated an Understand database. Then, we extracted the call and data dependencies for all functions from the generated database. In this way we obtained the SDG of the subject system. After that, we used the function cluster in igraph package to identify all dependence clusters. Each system’s functions are divided into two groups: functions inside and functions outside dependence clusters. Table 2: The dependence clusters in subject systems % functions Size of System # functions # clusters inside clusters largest cluster BASH 1 947 41 46.2 483 GCC 13 612 139 34.9 4083 GIMP 19 978 363 14.2 158 GLIB 5 923 105 11.6 277 GSTR 3 946 59 15.2 170 Table 2 describes the clusters in the subject projects. The third to the fifth columns respectively show the number of clusters, the percentage of functions inside clusters, and the size of the largest cluster in each subject project. From Table 2, we can see that there exist many dependence clusters (from 41 to 363) in these projects. Furthermore, from 11.6% to 46.2% of the total functions are found inside dependence clusters. Additionally, the size of the largest cluster in these 1https://scitools.com 2http://igraph.org/r/ Clusters: dc1, dc2, dc3, ... Spearman rank correlation Size Metric Fault density RQ1 Figure 2: Overview of the analysis method for RQ1 projects varied from 158 to 4083. Of these five projects, GCC has the largest dependence cluster (that includes 4083 functions). This, to a certain extent, indicates that GCC is more complex than the other systems. 5. METHODOLOGY AND RESULTS In the section, we describe the research method and report the experimental results in detail with respect to each of the research questions. 5.1 RQ1. Are larger dependence clusters more fault-prone? In the following, we describe the research method used and report the experimental result to address RQ1. 5.1.1 Research method Figure 2 provides an overview of the analysis method used to address RQ1. As can be seen, in order to answer RQ1, we use Spearman’s rank correlation to investigate the relationship between the size of dependence clusters and the fault density of dependence clusters. Here, fault density refers to the percentage of faulty functions in the dependence clusters. There are two basic metrics to measure the size of a graph: Size and Ties. Size is the number of functions within dependence clusters while Ties is the number of edges between functions in dependence clusters. In this study, we first use igraph to compute these two metrics for all dependence clusters in each subject system. We choose Spearman’s rank correlation rather than Pearson’s linear correlation since the former is a non-parametric method and makes no normality assumptions on variables [30]. According to Ott and Longnecker [30], for correlation coefficient rho, the correlation is considered either weak (|rho| ≤ 0.5), moderate (0.5 < |rho| < 0.8), or strong (0.8 ≤ |rho| ≤ 1.0). Table 3: Spearman correlation for dependence clusters size and fault density (RQ1) System # clusters Size Ties rho p rho p BASH 41 0.230 0.148 0.315 0.045 GCC 139 0.299 < 0.001 0.233 0.006 GIMP 363 0.150 0.004 0.195 < 0.001 GLIB 105 0.092 0.350 0.113 0.249 GSTR 59 0.345 0.007 0.295 0.023 299
proportions of faulty functions inside and outside dependence @回®⊙! Consistency table RO2 clusters.In Table 4.the second and the third columns re- Fisher's exact test and OR spectively represent the proportion of faulty functions inside @⊙©⊙月 and outside dependence clusters.The fourth and the fifth columns respectively show the Bonferroni adjusted p-value from Fisher's exact test and OR. Figure 3:Overview of the analysis method for RQ2 Table 4:The proportion of faulty functions inside 5.1.2 Experimental result vs.outside dependence clusters(RQ2) In the following,we describe the empirical results used to functions is faulty Fisher's System answer RQ1.Table 3 summarizes the Spearman correlation inside outside exact test OR coefficients relating the size metrics with fault density of BASH 5.44% 1.82% <0.001 3.115 dependence clusters.In Table 3.the second column is the GCC 6.08% 1.59% <0.001 4.004 GIMP 4.47% 4.03% 1.000 1.115 number of dependence clusters in each subject system.The GLIB 14.549% 6.06% <0.001 2.638 third and the fifth columns respectively present the corre- GSTR 10.20a 2.54% <0.001 4.361 lation coefficients for the Size and the Ties metrics from Spearman's singed-rank correlation.The correlation coeffi- From Table 4,we can see that the proportion of faulty cients which are not statistically significant at the significance functions inside dependence clusters is larger than the pro- level of a =0.05 are marked in gray. portion of faulty functions outside dependence clusters in all In Table 3,we see that all the absolute values of the cases,and significantly larger in all but one case.All the correlation coefficients are less than 0.5.This indicates that p-values are less than 0.05 except in GIMP which indicates there is only a weak correlation between these two size metrics statistically significant at the significance level of a =0.05. (i.e.,Size and Ties)with fault density of dependence clusters. This indicates that the proportions of faulty functions be- However.all the correlation coefficients are larger than 0 and tween these two groups are significantly different.Meanwhile. most of them are statistically significant at the significance all the ORs are substantially greater than 1,two are even level of a =0.05.This indicates that these size metrics are greater than 4,which confirms the results from Fisher's exact positively correlated with fault density.In other words,larger test. dependence clusters tend to be more fault-prone.Thus,large Overall,Fisher's exact test and the ORs consistently in- dependence clusters are likely more harmful and hence should dicate that functions inside dependence clusters are more be avoided,advice that is consistent with prior studies 8, fault-prone than functions outside dependence clusters 18. 5.3 RQ3.Are functions playing more impor- 5.2 RQ2.Are functions inside dependence clus- tant roles inside dependence clusters more ters more fault-prone than functions out- fault-prone? side dependence clusters? In the following,we describe the corresponding research In the following,we describe the research method and the method and the experimental results that address RQ3. experimental result answering RQ2. 5.3.I Research method 5.2.1 Research method Figure 4 provides an overview of the analysis method for Figure 3 provides an overview of the data analysis method RQ3.Functions inside dependence clusters form an inde- for addressing RQ2.As can be seen,in order to answer pendent dependence graph (e.g.,SubGin in Figure 1).In RQ2,we use Fisher's exact test and the odds ratio (OR)to order to answer RQ3,we first use this graph to compute the examine whether the proportion of faulty functions inside importance metrics as described in Table 5 for the functions dependence clusters is statistically significantly different from inside dependence clusters in the sub-dependence graph.The the proportion of faulty functions outside dependence clusters metrics in Table 5 are widely used networks metrics [37]that Fisher's exact test is a statistical significance test used in measure the extent to which these functions contribute to the analysis of contingency tables [36].The contingency the sub-dependence graph.For example,the Betweenness table is a matrix that displays the frequency distribution of metric for a vertex measures how many shortest paths pass variables.In our study,the contingency table has four types through the vertex for all pairs of vertices of the subgraph. of functions:(1)functions inside dependence clusters that Thus,vertices with large Betweenness indicates a large im- have faults;(2)functions inside dependence clusters that portance.Note that some of these importance metrics can be have no faults;(3)functions outside dependence clusters that computed by one the following three methods:"IN","OUT", have faults;and (4)functions outside dependence clusters and "ALL".The "IN"method concerns all incoming edges. that have no faults.The OR indicates the likelihood that The "OUT"method concerns all outgoing edges.While the an event (e.g.,that a function is faulty)occurs 36.Assume "ALL"method treats the graph as an undirected graph.In p is the proportion of faulty functions inside dependence this study,we only compute the metrics using the "OUT" clusters and g is the proportion of faulty functions outside method. dependence clusters.Then,OR is defined as.Thus After that,we build univariate logistic regression models OR>1 indicates that faults are more likely to occur inside for each of these metrics with fault-proneness.Similar to dependence clusters.OR =1 indicates an equal probability. prior studies 11,38,we use AOR,the odds ratio associated with one standard deviation increase,to quantify the effect of 5.2.2 Experimental result these metrics on fault-proneness.AOR is defined as follows: Table 4 summarizes the results of the comparison of the AOR=ex.Here,B and o are respectively the regression 300
Consistency table Fisher s exact test and OR f1 f2 f3 ... RQ2 f4 f5 f6 ... Figure 3: Overview of the analysis method for RQ2 5.1.2 Experimental result In the following, we describe the empirical results used to answer RQ1. Table 3 summarizes the Spearman correlation coefficients relating the size metrics with fault density of dependence clusters. In Table 3, the second column is the number of dependence clusters in each subject system. The third and the fifth columns respectively present the correlation coefficients for the Size and the Ties metrics from Spearman’s singed-rank correlation. The correlation coeffi- cients which are not statistically significant at the significance level of α = 0.05 are marked in gray. In Table 3, we see that all the absolute values of the correlation coefficients are less than 0.5. This indicates that there is only a weak correlation between these two size metrics (i.e., Size and Ties) with fault density of dependence clusters. However, all the correlation coefficients are larger than 0 and most of them are statistically significant at the significance level of α = 0.05. This indicates that these size metrics are positively correlated with fault density. In other words, larger dependence clusters tend to be more fault-prone. Thus, large dependence clusters are likely more harmful and hence should be avoided, advice that is consistent with prior studies [8, 18]. 5.2 RQ2. Are functions inside dependence clusters more fault-prone than functions outside dependence clusters? In the following, we describe the research method and the experimental result answering RQ2. 5.2.1 Research method Figure 3 provides an overview of the data analysis method for addressing RQ2. As can be seen, in order to answer RQ2, we use Fisher’s exact test and the odds ratio (OR) to examine whether the proportion of faulty functions inside dependence clusters is statistically significantly different from the proportion of faulty functions outside dependence clusters. Fisher’s exact test is a statistical significance test used in the analysis of contingency tables [36]. The contingency table is a matrix that displays the frequency distribution of variables. In our study, the contingency table has four types of functions: (1) functions inside dependence clusters that have faults; (2) functions inside dependence clusters that have no faults; (3) functions outside dependence clusters that have faults; and (4) functions outside dependence clusters that have no faults. The OR indicates the likelihood that an event (e.g., that a function is faulty) occurs [36]. Assume p is the proportion of faulty functions inside dependence clusters and q is the proportion of faulty functions outside dependence clusters. Then, OR is defined as p(1−p) q(1−q) . Thus OR > 1 indicates that faults are more likely to occur inside dependence clusters. OR = 1 indicates an equal probability. 5.2.2 Experimental result Table 4 summarizes the results of the comparison of the proportions of faulty functions inside and outside dependence clusters. In Table 4, the second and the third columns respectively represent the proportion of faulty functions inside and outside dependence clusters. The fourth and the fifth columns respectively show the Bonferroni adjusted p-value from Fisher’s exact test and OR. Table 4: The proportion of faulty functions inside vs. outside dependence clusters (RQ2) System % functions is faulty Fisher’s OR inside outside exact test BASH 5.44% 1.82% < 0.001 3.115 GCC 6.08% 1.59% < 0.001 4.004 GIMP 4.47% 4.03% 1.000 1.115 GLIB 14.54% 6.06% < 0.001 2.638 GSTR 10.20% 2.54% < 0.001 4.361 From Table 4, we can see that the proportion of faulty functions inside dependence clusters is larger than the proportion of faulty functions outside dependence clusters in all cases, and significantly larger in all but one case. All the p-values are less than 0.05 except in GIMP which indicates statistically significant at the significance level of α = 0.05. This indicates that the proportions of faulty functions between these two groups are significantly different. Meanwhile, all the ORs are substantially greater than 1, two are even greater than 4, which confirms the results from Fisher’s exact test. Overall, Fisher’s exact test and the ORs consistently indicate that functions inside dependence clusters are more fault-prone than functions outside dependence clusters. 5.3 RQ3. Are functions playing more important roles inside dependence clusters more fault-prone? In the following, we describe the corresponding research method and the experimental results that address RQ3. 5.3.1 Research method Figure 4 provides an overview of the analysis method for RQ3. Functions inside dependence clusters form an independent dependence graph (e.g., SubGin in Figure 1). In order to answer RQ3, we first use this graph to compute the importance metrics as described in Table 5 for the functions inside dependence clusters in the sub-dependence graph. The metrics in Table 5 are widely used networks metrics [37] that measure the extent to which these functions contribute to the sub-dependence graph. For example, the Betweenness metric for a vertex measures how many shortest paths pass through the vertex for all pairs of vertices of the subgraph. Thus, vertices with large Betweenness indicates a large importance. Note that some of these importance metrics can be computed by one the following three methods: “IN”, “OUT”, and “ALL”. The “IN” method concerns all incoming edges. The “OUT” method concerns all outgoing edges. While the “ALL” method treats the graph as an undirected graph. In this study, we only compute the metrics using the “OUT” method. After that, we build univariate logistic regression models for each of these metrics with fault-proneness. Similar to prior studies [11, 38], we use ∆OR, the odds ratio associated with one standard deviation increase, to quantify the effect of these metrics on fault-proneness. ∆OR is defined as follows: ∆OR = e β×σ . Here, β and σ are respectively the regression 300