2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE) Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing Yibiao Yang*,Yuming Zhou*,Hao Sunt,Zhendong Sut5,Zhigiang Zuo*,Lei Xu*,and Baowen Xu* *State Key Lab.for Novel Software Technology,Nanjing University,Nanjing,China Unaffiliated Department of Computer Science,ETH Zurich,Switzerland 3Computer Science Department,UC Davis,USA Abstract-Reliable code coverage tools are critically important #include <stdio.h> #include <stdio.h> as it is heavily used to facilitate many quality assurance activities, int main ( int main() such as software testing,fuzzing,and debugging.However,little attention has been devoted to assessing the reliability of code int g=0,v=1; int g=0.v=1; coverage tools.In this study,we propose a randomized differ- g=vv /8=vv ential testing approach to hunting for bugs in the most widely printf("%d\n”,g) printf("%d\n”,g): used C code coverage tools.Specifically,by generating random input programs,our approach seeks for inconsistencies in code (a) (b) coverage reports produced by different code coverage tools,and Fig.1.(a)Bug #33465 of llvm-cov and (b)The "equivalent"program then identifies inconsistencies as potential code coverage bugs.To obtained by pruning the unexecuted code (Line #4)of the program in (a) effectively report code coverage bugs,we addressed three specific challenges:(1)How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools;(2)how llvm-cov.Given the program p and its corresponding code to automatically reduce large test programs to much smaller coverage as shown in Fig.1(a).EMI compiler testing [7] ones that have the same properties;and (3)how to determine generates its "equivalent"program p'as shown in Fig.1(b) which code coverage tools have bugs?The extensive evaluations by removing unexecuted code (statement 5).The program p validate the effectiveness of our approach,resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov,respectively.This and p'will be compiled by a compiler under testing and case study indicates that code coverage tools are not as reliable then executed to obtain two different outputs,i.e.I and 0. as it might have been envisaged.It not only demonstrates the resulting in a bug reported by the EMI approach.However, effectiveness of our approach,but also highlights the need to this is obviously not a real compiler bug.The incorrect code continue improving the reliability of code coverage tools.This work opens up a new direction in code coverage validation which coverage report leads to the false positive in compiler testing. As the code coverage tools offer the fundamental information calls for more attention in this area Index Terms-Code Coverage;Differential Testing;Coverage needed during the whole process of software development, Tools;Bug Detection. it is essential to validate the correctness of code coverage. Unfortunately,to our best knowledge,little attention has been I.INTRODUCTION devoted to assessing the reliability of code coverage tools. Code coverage [1]refers to which code in the program This work makes the first attempt in this direction.We and how many times each code is executed when running devise a practical randomized differential testing approach on the particular test cases.The code coverage information to discovering bugs in code coverage tools.Our approach produced by code coverage tools is widely used to facilitate firstly leverages programs generated by a random generator to many quality assurance activities,such as software testing, seek for inconsistencies of code coverage reports produced by fuzzing,and debugging [1]-[15].For example,researchers different code coverage tools,and identifies inconsistencies as recently introduced an EMI ("Equivalence Modulo Inputs") potential code coverage bugs.Secondly,due to the existence of based compiler testing technique [7].The equivalent programs too many inconsistency-triggering test programs reported and are obtained by stochastically pruning the unexecuted code of a large portion of irrelevant code within these test programs, a given program according to the code coverage information reporting these inconsistency-triggering tests directly is hardly given by the code coverage tools (e.g.,llvm-cov,gcov).There- beneficial to debugging.Before reporting them,the reduction fore,the correctness of "equivalence"relies on the reliability to those test programs is required [18].However,it is usually of code coverage tools. very costly and thus unrealistic to reduce every and each of In spite of the prevalent adoption in practice and extensive the test programs [18].We observe that many test programs testing of code coverage tools,a variety of defects still remain.trigger the same coverage bugs.Thus,we can filter out many Fig.1(a)shows a buggy code coverage report produced by duplicate test programs.Note that 'duplicate test programs'in llvm-cov [16],a C code coverage tool of Clang [17].Note this study indicates multiple test programs triggering the same that all the test cases have been reformatted for presentation code coverage bug.Overall,to effectively report coverage in this study.The coverage report is an annotated version of bugs,we need to address the following key challenges: source code,where the first and second column list the line Challenge 1:Filtering Out Test Programs.To filter out number and the execution frequency,respectively.We can see potential test programs triggering the same code coverage that the code at line 5 is marked incorrectly as unexecuted by bugs,the most intuitive way is to calculate similarities between 1558-1225/19/$31.00©20191EEE 488 D0110.1109/1CSE.2019.00061
Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing Yibiao Yang∗, Yuming Zhou∗, Hao Sun†, Zhendong Su‡§, Zhiqiang Zuo∗, Lei Xu∗, and Baowen Xu∗ ∗State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, China †Unaffiliated ‡Department of Computer Science, ETH Zurich, Switzerland §Computer Science Department, UC Davis, USA Abstract—Reliable code coverage tools are critically important as it is heavily used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging. However, little attention has been devoted to assessing the reliability of code coverage tools. In this study, we propose a randomized differential testing approach to hunting for bugs in the most widely used C code coverage tools. Specifically, by generating random input programs, our approach seeks for inconsistencies in code coverage reports produced by different code coverage tools, and then identifies inconsistencies as potential code coverage bugs. To effectively report code coverage bugs, we addressed three specific challenges: (1) How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools; (2) how to automatically reduce large test programs to much smaller ones that have the same properties; and (3) how to determine which code coverage tools have bugs? The extensive evaluations validate the effectiveness of our approach, resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov, respectively. This case study indicates that code coverage tools are not as reliable as it might have been envisaged. It not only demonstrates the effectiveness of our approach, but also highlights the need to continue improving the reliability of code coverage tools. This work opens up a new direction in code coverage validation which calls for more attention in this area. Index Terms—Code Coverage; Differential Testing; Coverage Tools; Bug Detection. I. INTRODUCTION Code coverage [1] refers to which code in the program and how many times each code is executed when running on the particular test cases. The code coverage information produced by code coverage tools is widely used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging [1]–[15]. For example, researchers recently introduced an EMI (“Equivalence Modulo Inputs”) based compiler testing technique [7]. The equivalent programs are obtained by stochastically pruning the unexecuted code of a given program according to the code coverage information given by the code coverage tools (e.g., llvm-cov, gcov). Therefore, the correctness of “equivalence” relies on the reliability of code coverage tools. In spite of the prevalent adoption in practice and extensive testing of code coverage tools, a variety of defects still remain. Fig. 1(a) shows a buggy code coverage report produced by llvm-cov [16], a C code coverage tool of Clang [17]. Note that all the test cases have been reformatted for presentation in this study. The coverage report is an annotated version of source code, where the first and second column list the line number and the execution frequency, respectively. We can see that the code at line 5 is marked incorrectly as unexecuted by 1 | | 2 | | 3 | 1 | 4 | 1 | 5 | 0 | 6 | 1 | 7 | 1 | #include <stdio .h> int main ( ) { int g=0, v=1; g=v | | !v; p r i n t f ( ”%d\n” , g ); } #include <stdio .h> int main ( ) { int g=0, v=1; // g = v | | !v; p r i n t f ( ”%d\n” , g ); } (a) (b) Fig. 1. (a) Bug #33465 of llvm-cov and (b) The “equivalent” program obtained by pruning the unexecuted code (Line #4) of the program in (a) llvm-cov. Given the program p and its corresponding code coverage as shown in Fig. 1(a), EMI compiler testing [7] generates its “equivalent” program p’ as shown in Fig. 1(b) by removing unexecuted code (statement 5). The program p and p’ will be compiled by a compiler under testing and then executed to obtain two different outputs, i.e. 1 and 0, resulting in a bug reported by the EMI approach. However, this is obviously not a real compiler bug. The incorrect code coverage report leads to the false positive in compiler testing. As the code coverage tools offer the fundamental information needed during the whole process of software development, it is essential to validate the correctness of code coverage. Unfortunately, to our best knowledge, little attention has been devoted to assessing the reliability of code coverage tools. This work makes the first attempt in this direction. We devise a practical randomized differential testing approach to discovering bugs in code coverage tools. Our approach firstly leverages programs generated by a random generator to seek for inconsistencies of code coverage reports produced by different code coverage tools, and identifies inconsistencies as potential code coverage bugs. Secondly, due to the existence of too many inconsistency-triggering test programs reported and a large portion of irrelevant code within these test programs, reporting these inconsistency-triggering tests directly is hardly beneficial to debugging. Before reporting them, the reduction to those test programs is required [18]. However, it is usually very costly and thus unrealistic to reduce every and each of the test programs [18]. We observe that many test programs trigger the same coverage bugs. Thus, we can filter out many duplicate test programs. Note that ‘duplicate test programs’ in this study indicates multiple test programs triggering the same code coverage bug. Overall, to effectively report coverage bugs, we need to address the following key challenges: Challenge 1: Filtering Out Test Programs. To filter out potential test programs triggering the same code coverage bugs, the most intuitive way is to calculate similarities between 488 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) 1558-1225/19/$31.00 ©2019 IEEE DOI 10.1109/ICSE.2019.00061
programs using the whole text.However,we use Csmith [19] Organization.The rest of this paper is structured as follows as the random program generator and two Csmith-generated Section II introduces the background on code coverage.Sec- programs are not meaningfully comparable as they diverge in tion III describes our approach for code coverage validation. many ways [20].In addition,calculating similarities between Section IV reports the experimental results in detail.Section V programs using the whole text is expensive.To tackle this surveys related work.Section VI concludes the paper and challenge,only lines of code triggering inconsistencies are outlines the direction for future work. used for computing similarities between programs. Challenge 2:Reducing Test Programs.Reducing test pro- II.CODE COVERAGE grams for code coverage bugs is much more complex than In this section,we introduce the preliminary knowledge,the reducing test programs for compiler bugs as the later one importance,and the bug categories of code coverage. only requires testing the behavior of the compiled executables or the exit code of compilers [18].However.reducing test A.Preliminaries programs for code coverage bugs involves processing textual Code coverage is a quality assurance metric used to describe code coverage reports and identify inconsistencies.After each the degree to which the code of a given program is executed iteration of reduction,we need to specify the inconsistency of when a particular test suite executes [1].It is suggested that a interest we would like to preserve.In this study,we design a program with a high test coverage will have a lower chance of set of inconsistency types over coverage reports as the interest. containing undetected software bugs compared to a program Challenge 3:Inspecting Coverage Bugs.With the reduced with a low test coverage. test program,we need to inspect which code coverage tools The code coverage analysis process is generally divided have bugs before reporting bug.In practice,it is usually done into three tasks:code instrumentation,data gathering,and manually [21].In other words,developers manually inspect the coverage analysis.Specifically,code instrumentation inserts coverage reports to determine which coverage tools are buggy. additional code instructions or probes to monitor whether the To relieve the burden of manual intervention,we summarize specific program chunk is executed or not at runtime.The a number of rules that code coverage reports must follow.instrumentation can be done at the source level in a separate With those rules,we develop a tool to examine part of the pre-processing phase or at runtime by instrumenting byte code. inconsistent coverage reports and determine which tools have Data gathering aims to collect the coverage data produced at bugs automatically. test runtime.Finally,coverage analysis aims to analyze the We implemented a differential testing prototype called C2V collected results and to provide test strategy recommendations ("Code Coverage Validation")for code coverage tools.Inin order to reduce,to feed or to modify the relevant test suite. order to evaluate the effectiveness of our approach,we have Currently,many code coverage tools are available [16], applied c2V to gcov and llvm-cov,two widely used C [23]-[31].which support different languages(e.g.,C/C++and code coverage tools respectively in the production compilers Java),instrumentation levels (e.g.source code and byte code GCC [22]and Clang [17].Our evaluation confirms that C2V level),or coverage criteria.Coverage criteria are the rules or is very effective in finding code coverage bugs:46 bugs were requirements that a test suite needs to satisfy [32].In the lit- found(42 bugs confirmed/fixed)for gcov,while 37 bugs were erature,many coverage criteria have been proposed,including found(28 bugs confirmed/fixed)for llvm-cov. statement coverage,branch coverage,path coverage,condition Contributions.We made the following contributions: coverage,decision coverage,and data-flow coverage [33]. We introduce an effective testing approach to validating the These criteria can be used to guide the generation of a test code coverage tools,and have implemented it as a practical suite or to evaluate the effectiveness of a test suite [331. tool C2V for testing C coverage tools.C2V mainly consists of a random program generator,a comparer to identify B.Importance of Code Coverage inconsistencies between coverage reports,a filter to remove Code coverage is widely used in,but not limited to,the test programs triggering same coverage bugs,a test program following software techniques: reducer,and an inspector to automatically determine which Coverage-based regression testing.In the context of regres- coverage tools have bugs for bug reporting sion testing,test case prioritization and test suite augmenta- We adopted c2V to uncover 46 and 37 bugs for gcov and tion are the two widely used techniques [2]-[4],[10].[15], llvm-cov both of which are widely used and extensively [34]-[36].The former aims to improve the ability of test tested C code coverage tools,respectively.Specifically.for cases to find bugs by scheduling test cases in a specific gcov,42 bugs have already been confirmed/fixed:for llvm- order [10].[15].[37].One common practice is to achieve cov,28 bugs have been confirmed/fixed. a high code coverage as fast as possible [38].The latter is Our evaluation indicates that code coverage tools are not to generate new test cases to strengthen the ability of a test as reliable as it might have been envisaged.It opens up a suite to find bugs [6].[14].[36].[39].In practice,it is often new research direction to improve the reliability of code to generate new test cases to cover the source code affected coverage tools which calls for more attention in this area. by code changes. Besides,there is a need to examine the influence of those Coverage-based compiler testing.Recent years have seen an bugs on other techniques which depend on code coverage. increasing interest in compiler testing which aims to validate 489
programs using the whole text. However, we use Csmith [19] as the random program generator and two Csmith-generated programs are not meaningfully comparable as they diverge in many ways [20]. In addition, calculating similarities between programs using the whole text is expensive. To tackle this challenge, only lines of code triggering inconsistencies are used for computing similarities between programs. Challenge 2: Reducing Test Programs. Reducing test programs for code coverage bugs is much more complex than reducing test programs for compiler bugs as the later one only requires testing the behavior of the compiled executables or the exit code of compilers [18]. However, reducing test programs for code coverage bugs involves processing textual code coverage reports and identify inconsistencies. After each iteration of reduction, we need to specify the inconsistency of interest we would like to preserve. In this study, we design a set of inconsistency types over coverage reports as the interest. Challenge 3: Inspecting Coverage Bugs. With the reduced test program, we need to inspect which code coverage tools have bugs before reporting bug. In practice, it is usually done manually [21]. In other words, developers manually inspect the coverage reports to determine which coverage tools are buggy. To relieve the burden of manual intervention, we summarize a number of rules that code coverage reports must follow. With those rules, we develop a tool to examine part of the inconsistent coverage reports and determine which tools have bugs automatically. We implemented a differential testing prototype called C2V (“Code Coverage Validation”) for code coverage tools. In order to evaluate the effectiveness of our approach, we have applied C2V to gcov and llvm-cov, two widely used C code coverage tools respectively in the production compilers GCC [22] and Clang [17]. Our evaluation confirms that C2V is very effective in finding code coverage bugs: 46 bugs were found (42 bugs confirmed/fixed) for gcov, while 37 bugs were found (28 bugs confirmed/fixed) for llvm-cov. Contributions. We made the following contributions: • We introduce an effective testing approach to validating the code coverage tools, and have implemented it as a practical tool C2V for testing C coverage tools. C2V mainly consists of a random program generator, a comparer to identify inconsistencies between coverage reports, a filter to remove test programs triggering same coverage bugs, a test program reducer, and an inspector to automatically determine which coverage tools have bugs for bug reporting. • We adopted C2V to uncover 46 and 37 bugs for gcov and llvm-cov both of which are widely used and extensively tested C code coverage tools, respectively. Specifically, for gcov, 42 bugs have already been confirmed/fixed; for llvmcov, 28 bugs have been confirmed/fixed. • Our evaluation indicates that code coverage tools are not as reliable as it might have been envisaged. It opens up a new research direction to improve the reliability of code coverage tools which calls for more attention in this area. Besides, there is a need to examine the influence of those bugs on other techniques which depend on code coverage. Organization. The rest of this paper is structured as follows. Section II introduces the background on code coverage. Section III describes our approach for code coverage validation. Section IV reports the experimental results in detail. Section V surveys related work. Section VI concludes the paper and outlines the direction for future work. II. CODE COVERAGE In this section, we introduce the preliminary knowledge, the importance, and the bug categories of code coverage. A. Preliminaries Code coverage is a quality assurance metric used to describe the degree to which the code of a given program is executed when a particular test suite executes [1]. It is suggested that a program with a high test coverage will have a lower chance of containing undetected software bugs compared to a program with a low test coverage. The code coverage analysis process is generally divided into three tasks: code instrumentation, data gathering, and coverage analysis. Specifically, code instrumentation inserts additional code instructions or probes to monitor whether the specific program chunk is executed or not at runtime. The instrumentation can be done at the source level in a separate pre-processing phase or at runtime by instrumenting byte code. Data gathering aims to collect the coverage data produced at test runtime. Finally, coverage analysis aims to analyze the collected results and to provide test strategy recommendations in order to reduce, to feed or to modify the relevant test suite. Currently, many code coverage tools are available [16], [23]–[31], which support different languages (e.g., C/C++ and Java), instrumentation levels (e.g. source code and byte code level), or coverage criteria. Coverage criteria are the rules or requirements that a test suite needs to satisfy [32]. In the literature, many coverage criteria have been proposed, including statement coverage, branch coverage, path coverage, condition coverage, decision coverage, and data-flow coverage [33]. These criteria can be used to guide the generation of a test suite or to evaluate the effectiveness of a test suite [33]. B. Importance of Code Coverage Code coverage is widely used in, but not limited to, the following software techniques: • Coverage-based regression testing. In the context of regression testing, test case prioritization and test suite augmentation are the two widely used techniques [2]–[4], [10], [15], [34]–[36]. The former aims to improve the ability of test cases to find bugs by scheduling test cases in a specific order [10], [15], [37]. One common practice is to achieve a high code coverage as fast as possible [38]. The latter is to generate new test cases to strengthen the ability of a test suite to find bugs [6], [14], [36], [39]. In practice, it is often to generate new test cases to cover the source code affected by code changes. • Coverage-based compiler testing. Recent years have seen an increasing interest in compiler testing which aims to validate 489
#include <stdlib.h> Spurious Marking.Spurious marking denotes that a program 2 int main (void) chunk is unexecuted at runtime but is wrongly marked as executed by a code coverage tool.Figure 2 gives such an 5 static int *p: example.In the main function,the two return-statements 6 at line 9 and line 12 cannot be executed at the same time. p malloc (sizeof (int)): 8 if (p =NULL) Thus,one of them must be wrongly marked by llmv-cov. 9 return 0; At runtime.function main returns "1",indicating that line 10 11 9 was wrongly marked.A code coverage tool with spurious *p=7; 1 return 1; marking bugs will cause non-executed code wrongly marked 13 11 } as executed.Consequently,for a program under test,a part Fig.2. Spurious marking (Bug #37092 of llvm-cov) of elements are indeed untested and hence may have latent bugs undetected. 1: 1: int func(int i) Missing Marking.Missing marking denotes that a program 2: 】 3 if (i>0) chunk is actually executed at runtime but is wrongly marked 1: 4 return 1; as unexecuted by a code coverage tool.The coverage bug as #####: 5 return 0: 6 shown in Figure I belongs to this class.A code coverage tool > with missing marking bugs will cause a predefined coverage 1: 8:int main() goal never achieved,regardless of how much testing time 9 { 1:10: int i 1,j=2; and resource are allocated. 2:11: func((i==0)|(i&&j,1)): Wrong Frequency.Wrong frequency denotes that a program 1:12 return 0; chunk is executed m times at runtime but is wrongly marked -:13:} as executed n times by a code coverage tool {(m n)A Fig.3.Wrong frequency (Bug #85163 of gcov) (m >0)A(n>0)).Figure 3 shows a gcov coverage report, in which the first column lists the execution frequency and the correctness of compilers.One of the most attractive the second column lists the line number.As can be seen,the compiler testing techniques is based on the code coverage code at line 11 is wrongly marked as executed twice but it of a program's execution to generate equivalence modulo was actually executed only once.A code coverage tool with inputs by stochastically pruning its unexecuted code [7], wrong frequency bugs might lead to a suboptimal decision [40].[41].With the equivalence modulo inputs,we can in many coverage-based software engineering activities. differentially test compilers. Coverage-based fuzzing.Fuzzing technique is one of the III.METHODOLOGY most effective testing techniques to find vulnerabilities in software.In practice,coverage-based fuzzing has been Figure 4 presents our framework for code coverage vali- widely used [8].[42].Based on the code coverage infor- dation.In the following,we describe the key steps in this mation of test cases,it aims to determine which test cases framework.In particular,we will use gcov and llvm-cov as should be retained for fuzzing.In general,a test case is the subject code coverage tools to illustrate our approach to retained if a new basic block is covered. hunting for code coverage bugs if necessary. Coverage-based debugging.Debugging is a common activity in software development which aims to find the root cause A.Generating Test Programs of a fault.Spectrum-Based Fault Localization(SBFL)is one of the most extensively studied debugging techniques which Our approach starts with a random program generator.The is heavily based on code coverage [5].[43]-[47].Under a Generator randomly generates a large program set P.Each specific test suite,SBFL leverages the code coverage and program pP will be used as a test program for differentially the corresponding failed/passed information to infer which testing two code coverage tools under examination.In other code is the root cause of a fault. words,each program will be fed to the two code coverage As can be seen,the above-mentioned software techniques tools to obtain two raw coverage reports heavily depend on the code coverage information produced In our study,test programs refer to a collection of compi- by code coverage tools.Therefore,it is of great importance lable C programs with the corresponding inputs.We choose to guarantee the correctness of the code coverage reports. Csmith to generate the test programs because of the following Otherwise,they might inevitably produce incorrect results or reasons:(1)it supports many C language features and can lead to suboptimal decisions. avoid generating programs with undefined behaviors [48],thus outperforming some random C program generators [49],[50]: C.Categories of Code Coverage Bugs (2)each generated program is one single file with input self- contained which does not require extra inputs:and (3)it is Code coverage bugs can be categorized into the following fast enough to generate programs with tens of thousands of three general classes: lines within a few seconds. 490
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | | | | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | #include <stdlib .h> int main ( void ) { static int *p ; p = malloc ( sizeof ( int )); i f ( p == NULL) return 0 ; *p = 7; return 1 ; } Fig. 2. Spurious marking (Bug #37092 of llvm-cov) 1 : − : 1 : 1 : # #### : − : − : 1 : − : 1 : 2 : 1 : − : 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12 : 13 : int func ( int i ) { i f ( i > 0 ) return 1 ; return 0 ; } int main ( ) { int i = 1, j = 2; func ( ( i ==0) | | ( i&&j , 1 ) ) ; return 0 ; } Fig. 3. Wrong frequency (Bug #85163 of gcov) the correctness of compilers. One of the most attractive compiler testing techniques is based on the code coverage of a program’s execution to generate equivalence modulo inputs by stochastically pruning its unexecuted code [7], [40], [41]. With the equivalence modulo inputs, we can differentially test compilers. • Coverage-based fuzzing. Fuzzing technique is one of the most effective testing techniques to find vulnerabilities in software. In practice, coverage-based fuzzing has been widely used [8], [42]. Based on the code coverage information of test cases, it aims to determine which test cases should be retained for fuzzing. In general, a test case is retained if a new basic block is covered. • Coverage-based debugging. Debugging is a common activity in software development which aims to find the root cause of a fault. Spectrum-Based Fault Localization (SBFL) is one of the most extensively studied debugging techniques which is heavily based on code coverage [5], [43]–[47]. Under a specific test suite, SBFL leverages the code coverage and the corresponding failed/passed information to infer which code is the root cause of a fault. As can be seen, the above-mentioned software techniques heavily depend on the code coverage information produced by code coverage tools. Therefore, it is of great importance to guarantee the correctness of the code coverage reports. Otherwise, they might inevitably produce incorrect results or lead to suboptimal decisions. C. Categories of Code Coverage Bugs Code coverage bugs can be categorized into the following three general classes: • Spurious Marking. Spurious marking denotes that a program chunk is unexecuted at runtime but is wrongly marked as executed by a code coverage tool. Figure 2 gives such an example. In the main function, the two return-statements at line 9 and line 12 cannot be executed at the same time. Thus, one of them must be wrongly marked by llmv-cov. At runtime, function main returns “1”, indicating that line 9 was wrongly marked. A code coverage tool with spurious marking bugs will cause non-executed code wrongly marked as executed. Consequently, for a program under test, a part of elements are indeed untested and hence may have latent bugs undetected. • Missing Marking. Missing marking denotes that a program chunk is actually executed at runtime but is wrongly marked as unexecuted by a code coverage tool. The coverage bug as shown in Figure 1 belongs to this class. A code coverage tool with missing marking bugs will cause a predefined coverage goal never achieved, regardless of how much testing time and resource are allocated. • Wrong Frequency. Wrong frequency denotes that a program chunk is executed m times at runtime but is wrongly marked as executed n times by a code coverage tool {(m = n) ∧ (m > 0)∧(n > 0)}. Figure 3 shows a gcov coverage report, in which the first column lists the execution frequency and the second column lists the line number. As can be seen, the code at line 11 is wrongly marked as executed twice but it was actually executed only once. A code coverage tool with wrong frequency bugs might lead to a suboptimal decision in many coverage-based software engineering activities. III. METHODOLOGY Figure 4 presents our framework for code coverage validation. In the following, we describe the key steps in this framework. In particular, we will use gcov and llvm-cov as the subject code coverage tools to illustrate our approach to hunting for code coverage bugs if necessary. A. Generating Test Programs Our approach starts with a random program generator. The Generator randomly generates a large program set P. Each program p ∈ P will be used as a test program for differentially testing two code coverage tools under examination. In other words, each program will be fed to the two code coverage tools to obtain two raw coverage reports. In our study, test programs refer to a collection of compilable C programs with the corresponding inputs. We choose Csmith to generate the test programs because of the following reasons: (1) it supports many C language features and can avoid generating programs with undefined behaviors [48], thus outperforming some random C program generators [49], [50]; (2) each generated program is one single file with input selfcontained which does not require extra inputs; and (3) it is fast enough to generate programs with tens of thousands of lines within a few seconds. 490
IPS:Inconsistency-triggering Program Set Figure 5(b)shows the code coverage report r2 produced by R-IPS:Reduced IPS llvm-cov for the same program.r1 and r2 are two annotated versions of the source file.However,there are three differences Generator between ri and r2.First,they have different instrumentation program set P sites.On the one hand.lines 4.6~8.12.19,and 22 are = treated as the instrumentation sites by llvm-cov but not by gcov.On the other hand,lines 3 and 18 are treated as an =P1 instrumentation site by gcov but not by llvm-cov.Hence,only Yes the common instrumentation sites used by gcov and llvm-cov need to be compared later.Second,gcov and llvm-cov have p=Pil Reducer different annotation formats for execution frequency.A non- instrumentation site is noted as hyphen""in rI but is noted coverage tool ty coverage tool f2 R-IPS as a null string in r2(e.g.line 3).Besides,an unexecuted line is noted as hashes "#####in r (e.g.line 15),while it is noted as"0"in r2(e.g.line 9).Third,coverage statistics are r印ortr, T印otr2 Inspector included in r but are not available in r2.Figure 5(c)lists the △ unified coverage reports produced by our parser for gcov and Parser BugR印os llvm-cov respectively.As can be seen,there are 9 common instrumentation sites between them:lines 5,7,9,10,11,13, 14.15.20.and21. 、unified report unified report ur C.Comparing Unified Coverage Reports Comparer After obtaining the unified coverage reports ur and ur2,we consistentNo Filter use a tool Comparer to determine whether they are consistent. If not,the corresponding p and associated coverage reports Yes will be added to a set named IPS(Inconsistency-triggering test Program Set).When comparing the unified coverage reports, i=i+l Ye- Comparer only takes into account the execution frequencies of the common instrumentation sites among the n coverage s=Ps+{(p,ic,<m.n>)}> No tools.During the comparison of uri and ur2,the following types of cases might occur: Fig.4.The framework for code coverage validation Type A:one line code is marked as executed by ti but as unexecuted by t2: B.Parsing Code Coverage Reports Type B:one line code is marked as unexecuted by ti but For each test program p in the generated program set P, as executed by t2; we obtain the raw code coverage reports r1 and r2 emitted Type C:one line code is marked as executed k times by from the coverage tools t and t2 respectively.However,the t1 and as executed I times by t2 while kl. raw coverage reports cannot be directly compared since they Consequently,the inconsistent unified reports can be di- are presented in different formats.We thus develop a parser vided into seven categories (C001 Type C:C010 to transform ri into uri in a unified format (1 i 2).Type B;C100 Type A;C0ll Type B+C;C101 Specifically,uri is a sequence of two-tuple,including the Type A+C;C110:Type A+B:C111:Type A+B+C).If monitored program chunk and the corresponding execution the unified coverage reports corresponding to the test program frequency.Given a program p with N lines of source code.p are found to have an inconsistency category ic,it will be uri is a sequence of N two-tuples (nj,fi)in ascending order handled by the filter. according to the value of nj,1<j<N.Here,nj is the line Considering the two unified coverage reports shown in number and f;is the corresponding execution frequency.If Fig.5.Comparer will compare the execution frequencies of line n;is not an instrumentation site for a particular coverage the common 9 instrumentation sites to determine whether uri tool,fj is assigned-1. and ur2 are consistent.As can be seen,there are four incon- In our study,gcov and llvm-cov are selected as the subject sistent execution frequencies in the unified coverage reports code coverage tools,which are integrated with the mature between gcov and llvm-cov:Type C at line 5,Type C at line production compilers GCC and Clang respectively.In the 13,Type A at line 9,and Type B at line 15.Consequently,the following,we give an illustrating example to demonstrate inconsistency category of ur and ur2 is found to be C111.In how the parser works in real world.Figure 5(a)shows the other words,the inconsistency introduced by the test program code coverage report r produced by gcov for a program,and p belongs to the C111 category. 491
IPS: Inconsistency-triggering Program Set R-IPS: Reduced IPS Bug Reports IPS = IPS + { ( p, ic, <ur1, ur2> ) } i <= |P| program set P Generator Parser Reducer i = i+1 IPS Comparer Inspector p = P[i] R-IPS Yes Yes duplicate? No No i = 1 coverage tool t1 coverage tool t2 report r1 report r2 unified report ur1 unified report ur2 consistent? No Filter Yes Fig. 4. The framework for code coverage validation B. Parsing Code Coverage Reports For each test program p in the generated program set P, we obtain the raw code coverage reports r1 and r2 emitted from the coverage tools t1 and t2 respectively. However, the raw coverage reports cannot be directly compared since they are presented in different formats. We thus develop a parser to transform ri into uri in a unified format (1 ≤ i ≤ 2). Specifically, uri is a sequence of two-tuple, including the monitored program chunk and the corresponding execution frequency. Given a program p with N lines of source code, uri is a sequence of N two-tuples (nj , fj ) in ascending order according to the value of nj , 1 ≤ j ≤ N. Here, nj is the line number and fj is the corresponding execution frequency. If line nj is not an instrumentation site for a particular coverage tool, fj is assigned −1. In our study, gcov and llvm-cov are selected as the subject code coverage tools, which are integrated with the mature production compilers GCC and Clang respectively. In the following, we give an illustrating example to demonstrate how the parser works in real world. Figure 5(a) shows the code coverage report r1 produced by gcov for a program, and Figure 5(b) shows the code coverage report r2 produced by llvm-cov for the same program. r1 and r2 are two annotated versions of the source file. However, there are three differences between r1 and r2. First, they have different instrumentation sites. On the one hand, lines 4, 6∼8, 12, 19, and 22 are treated as the instrumentation sites by llvm-cov but not by gcov. On the other hand, lines 3 and 18 are treated as an instrumentation site by gcov but not by llvm-cov. Hence, only the common instrumentation sites used by gcov and llvm-cov need to be compared later. Second, gcov and llvm-cov have different annotation formats for execution frequency. A noninstrumentation site is noted as hyphen “-” in r1 but is noted as a null string in r2 (e.g. line 3). Besides, an unexecuted line is noted as hashes “#####” in r1 (e.g. line 15), while it is noted as “0” in r2 (e.g. line 9). Third, coverage statistics are included in r1 but are not available in r2. Figure 5(c) lists the unified coverage reports produced by our parser for gcov and llvm-cov respectively. As can be seen, there are 9 common instrumentation sites between them: lines 5, 7, 9, 10, 11, 13, 14, 15, 20, and 21. C. Comparing Unified Coverage Reports After obtaining the unified coverage reports ur1 and ur2, we use a tool Comparer to determine whether they are consistent. If not, the corresponding p and associated coverage reports will be added to a set named IPS (Inconsistency-triggering test Program Set). When comparing the unified coverage reports, Comparer only takes into account the execution frequencies of the common instrumentation sites among the n coverage tools. During the comparison of ur1 and ur2, the following types of cases might occur: • T ype A: one line code is marked as executed by t1 but as unexecuted by t2; • T ype B: one line code is marked as unexecuted by t1 but as executed by t2; • T ype C: one line code is marked as executed k times by t1 and as executed l times by t2 while k = l. Consequently, the inconsistent unified reports can be divided into seven categories (C001 : T ype C; C010 : T ype B; C100 : T ype A; C011 : T ype B + C; C101 : T ype A+C; C110 : T ype A+B; C111 : T ype A+B+C). If the unified coverage reports corresponding to the test program p are found to have an inconsistency category ic, it will be handled by the filter. Considering the two unified coverage reports shown in Fig. 5, Comparer will compare the execution frequencies of the common 9 instrumentation sites to determine whether ur1 and ur2 are consistent. As can be seen, there are four inconsistent execution frequencies in the unified coverage reports between gcov and llvm-cov: T ype C at line 5, T ype C at line 13, T ype A at line 9, and T ype B at line 15. Consequently, the inconsistency category of ur1 and ur2 is found to be C111. In other words, the inconsistency introduced by the test program p belongs to the C111 category. 491
int g=1: int g=1; -1) 2 3 int f(int arg) 3 2》 3 int f(int arg) -1) 4,-1) 4 2) 6 for(inti=0:i1=1:++i)(5, 6)1 for(int i=0:i!=1:++i)( 5 3) 6 6。-11 2 6, 3) 7 int f[l]; 7 -1). > 8 if(0) 8,-1) 22 int f[l]: 7. 2) if(0) 8,2) 1 9 break: (9,1), 9 0 break; ( 9, 0). 10 if (arg) 10, 2)+ 10 if(arg) (10. 2) 1 break: (11, 11 l break: (11.1). 12 (12,-1), 12 22 (12 2) 413 for (g:) (13, 4)1 13 for(:g:) (13.2) 2 14 return 0; (14, 2), 14 2 return 0; (14, 2). ####柱 15 return 0: 15 0)+ 15 return 0: 15 21 16 16.】. 16 2 (16.2) 17 (17,-1). 1 (17.-1) 1: int main() (18 18 int main() (18.-1) 19 (19,-1) 19 (19, 1) 1: f(0):f(1): (20, 1) 20 f(0):f(1) (20, 1) 1: 21 return 0; 2】. 21 return 0; (21, 1) - 22 (22,-1) 22 (22,1) (a)Raw coverage report by gcov 7.3.0 and unified report ur1 (b)Raw coverage report by llvm-cov 6.0.0 and unified report ur2 Fig.5.An illustrating example. D.Filtering out Test Programs semi"with the help of the clang compiler.After that,we calculate its similarities with other programs.In this study. Intuitively,using a test program set P with a larger size has we use Jaccard index similarity as it is the most widely used a higher chance to discover code coverage bugs.Therefore, token based similarity algorithm [51].If all the similarity in practice,we prefer to generate a very large number of coefficients between p and each of the programs in IPS are test programs and further resulting in a large number of less than 0.8,a tuple (p,ic,<ur1,ur2 >will be added inconsistency-triggering test programs.This will lead to the to the set IPS.Otherwise,p will be filtered out. following two problems.On the one hand,it may be unrealistic to inspect all the inconsistency-triggering test programs,as the reduction is very costly [18]and inspection resources are E.Reducing Test Programs often limited.On the other hand,we observe that many test An inconsistency-triggering test program only indicates that programs trigger the same code coverage bugs.Therefore,we some of the two code coverage tools are buggy but does not can filter out duplicate test programs. inform which one contains bugs.Therefore,there is a need In Fig.4,we use Filter to determine whether the to inspect the inconsistency-triggering test program to obtain inconsistency-triggering test program is "duplicate",i.e, the correct coverage report (as the coverage oracle).After triggering the same code coverage bug with other test that,for each code coverage tool,we can easily determine programs.To filter out potential "duplicate"test programs, whether it is buggy by comparing the corresponding coverage the most intuitive way is to calculate similarities between report with the coverage oracle.Since a test program may programs using the whole text.However,we use Csmith [19] have a large code size,it is time-consuming and error-prone as the random program generator and two Csmith-generated to determine the corresponding coverage oracle.Furthermore, programs are not meaningfully comparable as they diverge in if a large test program is submitted as a bug report,it is also many ways [20].In addition,calculating similarities between difficult for developers to determine the location of bugs in the programs using the whole text is expensive.To tackle this code coverage tool(s).To tackle this problem,we use Reducer challenge,only lines of code triggering inconsistencies are to reduce each inconsistency-triggering test program in IPS by used for computing similarities between programs.More removing the code irrelevant to the inconsistency. specifically,we first transform the inconsistency-triggering For each inconsistency-triggering test program p in IPS. lines of code in each program into a token based string Reducer takes the following steps to obtain the reduced test with variable names insensitive and then use a token program.At step 1,the tool C-Reduce [18]is applied to p to based similarity algorithm to calculate similarities between obtain the reduced version p'.if p'has a smaller size than programs.We calculate program similarity with variable name p,go to step 2;otherwise,stop the reduction.At step 2. insensitive since the variable names in Csmith-generated feed p'to the n coverage tools,collect the coverage reports. programs have no specific meaning.For example,we assume and determine the inconsistency category.If the inconsistency "int i=0;i=1;"are inconsistency-triggering lines of category triggered by p'is the same as the inconsistency code in program p.We first transform these two statements category triggered by p,assign p'to p.The above process is into "int identifier equal numeric_constant repeated until p cannot be further reduced.Consequently,we semi identifier equal numeric_constant reduce IPS to obtain R-IPS,the set of reduced test programs. 492
− : − : 2 : − : 6 : − : − : − : 1 : 2 : 1 : − : 4 : 2 : # #### : − : − : 1 : − : 1 : 1 : − : 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12 : 13 : 14 : 15 : 16 : 17 : 18 : 19 : 20 : 21 : 22 : int g=1; int f ( int arg ) { for ( int i =0; i !=1;++ i ) { int f [1]; i f (0) break ; i f ( arg ) break ; } for (;g;) return 0 ; return 0 ; } int main ( ) { f (0); f (1); return 0 ; } ( 1, −1), ( 2, −1), ( 3, 2) , ( 4, −1), ( 5, 6) , ( 6, −1), ( 7, −1), ( 8, −1), ( 9, 1) , (10 , 2) , (11 , 1) , (12, −1), (13 , 4) , (14 , 2) , (15 , 0) , (16, −1), (17, −1), (18 , 1) , (19, −1), (20 , 1) , (21 , 1) , (22, −1 ) 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | | | | 2 | 3 | 2 | 2 | 2 | 0 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | | | 1 | 1 | 1 | 1 | int g=1; int f ( int arg ) { for ( int i =0; i !=1;++ i ) { int f [1]; i f (0) break ; i f ( arg ) break ; } for (;g;) return 0 ; return 0 ; } int main ( ) { f (0); f (1); return 0 ; } ( 1, −1), ( 2, −1), ( 3, −1), ( 4, 2) , ( 5, 3) , ( 6, 3) , ( 7, 2) , ( 8, 2) , ( 9, 0) , (10 , 2) , (11 , 1) , (12 , 2) , (13 , 2) , (14 , 2) , (15 , 2) , (16 , 2) , (17, −1), (18, −1), (19 , 1) , (20 , 1) , (21 , 1) , (22 , 1) (a) Raw coverage report by gcov 7.3.0 and unified report ur1 (b) Raw coverage report by llvm-cov 6.0.0 and unified report ur2 Fig. 5. An illustrating example. D. Filtering out Test Programs Intuitively, using a test program set P with a larger size has a higher chance to discover code coverage bugs. Therefore, in practice, we prefer to generate a very large number of test programs and further resulting in a large number of inconsistency-triggering test programs. This will lead to the following two problems. On the one hand, it may be unrealistic to inspect all the inconsistency-triggering test programs, as the reduction is very costly [18] and inspection resources are often limited. On the other hand, we observe that many test programs trigger the same code coverage bugs. Therefore, we can filter out duplicate test programs. In Fig. 4, we use Filter to determine whether the inconsistency-triggering test program is “duplicate”, i.e, triggering the same code coverage bug with other test programs. To filter out potential “duplicate” test programs, the most intuitive way is to calculate similarities between programs using the whole text. However, we use Csmith [19] as the random program generator and two Csmith-generated programs are not meaningfully comparable as they diverge in many ways [20]. In addition, calculating similarities between programs using the whole text is expensive. To tackle this challenge, only lines of code triggering inconsistencies are used for computing similarities between programs. More specifically, we first transform the inconsistency-triggering lines of code in each program into a token based string with variable names insensitive and then use a token based similarity algorithm to calculate similarities between programs. We calculate program similarity with variable name insensitive since the variable names in Csmith-generated programs have no specific meaning. For example, we assume “int i = 0; i = 1;” are inconsistency-triggering lines of code in program p. We first transform these two statements into “int identifier equal numeric_constant semi identifier equal numeric_constant semi” with the help of the clang compiler. After that, we calculate its similarities with other programs. In this study, we use Jaccard index similarity as it is the most widely used token based similarity algorithm [51]. If all the similarity coefficients between p and each of the programs in IPS are less than 0.8, a tuple (p, ic, < ur1, ur2 >) will be added to the set IPS. Otherwise, p will be filtered out. E. Reducing Test Programs An inconsistency-triggering test program only indicates that some of the two code coverage tools are buggy but does not inform which one contains bugs. Therefore, there is a need to inspect the inconsistency-triggering test program to obtain the correct coverage report (as the coverage oracle). After that, for each code coverage tool, we can easily determine whether it is buggy by comparing the corresponding coverage report with the coverage oracle. Since a test program may have a large code size, it is time-consuming and error-prone to determine the corresponding coverage oracle. Furthermore, if a large test program is submitted as a bug report, it is also difficult for developers to determine the location of bugs in the code coverage tool(s). To tackle this problem, we use Reducer to reduce each inconsistency-triggering test program in IPS by removing the code irrelevant to the inconsistency. For each inconsistency-triggering test program p in IPS, Reducer takes the following steps to obtain the reduced test program. At step 1, the tool C-Reduce [18] is applied to p to obtain the reduced version p . if p has a smaller size than p, go to step 2; otherwise, stop the reduction. At step 2, feed p to the n coverage tools, collect the coverage reports, and determine the inconsistency category. If the inconsistency category triggered by p is the same as the inconsistency category triggered by p, assign p to p. The above process is repeated until p cannot be further reduced. Consequently, we reduce IPS to obtain R-IPS, the set of reduced test programs. 492