2019 34th IEEE/ACM International Conference on Automated Software Engineering(ASE) Automatic Self-Validation for Code Coverage Profilers Yibiao Yang*1,Yanyan Jiang",Zhiqiang Zuo*,Yang Wang", Hao Sun*,Hongmin Lu*,Yuming Zhou',and Baowen Xu* *State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,China School of Cyber Science and Engineering,Huazhong University of Science and Technology,Wuhan.China [yangyibiao,jyy,zqzuo)@nju.edu,cn,dz1933028@smail.nju.edu.cn shqking @gmail.com,hmlu,zhouyuming,bwxu@nju.edu.cn Abstract-Code coverage as the primitive dynamic program Even though the programming experts can specify the oracle behavior information,is widely adopted to facilitate a rich precisely,it requires enormous human intervention,making it spectrum of software engineering tasks,such as testing,fuzzing, impractical. debugging,fault detection,reverse engineering,and program understanding.Thanks to the widespread applications,it is A simple differential testing approach C2V tried to uncover crucial to ensure the reliability of the code coverage profilers. coverage bugs by comparing the coverage profiling results of Unfortunately,due to the lack of research attention and the the same input program over two different profiler implemen- existence of testing oracle problem,coverage profilers are far tations (e.g.,gcov and llvm-cov)[11].For instance,if gcov away from being tested sufficiently.Bugs are still regularly seen and llvm-cov provide different coverage information for the in the widely deployed profilers,like gcov and llvm-cov,along with gcc and llvm,respectively. same statement of the profiled program,a bug is reported. This paper proposes Cod,an automated self-validator for Due to the inconsistency of coverage semantics defined by effectively uncovering bugs in the coverage profilers.Starting different profiler implementations,it is rather common that from a test program (either from a compiler's test suite or generated randomly),Cod detects profiler bugs with zero false independently implemented coverage profilers exhibit different positive using a metamorphic relation in which the coverage opinions on the code-line based statistics (e.g.,the case in statistics of that program and a mutated variant are bridged. Figure 1)-this essentially contradicts the fundamental as- We evaluated Cod over two of the most well-known code sumption of differential testing that distinct coverage profilers coverage profilers,namely gcov and llvm-cov.Within a four should output identical coverage statistics for the same input month testing period,a total of 196 potential bugs(123 for gcov, program. 73 for llvm-cov)are found,among which 23 are confirmed by the developers. Approach To tackle the flaws of the existing approach,this pa- Index Terms-Code coverage,Metamorphic testing,Coverage per presents Cod,a fully automated self-validator of coverage profilers,Bug detection. profilers,based on the metamorphic testing formulation [121. I.INTRODUCTION Instead of comparing outputs from two independent profilers Cod takes a single profiler and a program P (either from Profiling code coverage data [I](e.g.,executed branches, a compiler's test suite or generated randomly)as input and paths,functions,etc.)of the instrumented subject programs is the cornerstone of a rich spectrum of software engineering uncovers the bugs by identifying the inconsistency of coverage results from P and its equivalent mutated variants whose practices,such as testing [2],fuzzing [3].debugging [4]- [6].specification mining [7].[8],fault detection [9].reverse coverage statistics are expected to be identical.The equivalent program variants are generated based on the assumption that engineering,and program understanding [10].Incorrect cov- modifying unexecuted code blocks should not affect the cover- erage information would severely mislead developers in their age statistics of executed blocks under the identical profiler, software engineering practices. Unfortunately,coverage profilers themselves (e.g.,gcov and which should generally hold in a non-optimized setting.This idea originates from EMI [2],a metamorphic testing approach llvm-cov)are prone to errors.Even a simple randomized which is targeted at compiler optimization bugs. differential testing technique exposed more than 70 bugs in coverage profilers [11].The reasons are two-fold.Firstly,nei- Specifically,assuming that the compiler is correct and ther the application-end developers nor academic researchers given a deterministic program P under profiling (either from paid sufficient attention to the testing of code coverage pro- a compiler's test suite or generated randomly)and fixate its filers.Secondly,automatic testing of coverage profilers is still input,Cod obtains a reference program P by removing the unexecuted statements in P.P/should strictly follow the same challenging due to the lack of test oracles.During the code coverage testing,the oracle is supposed to constitute the rich execution path as long as the coverage profiling data of p is correct.Therefore,Cod asserts that the coverage statistics execution information,e.g.,the execution frequency of each should be exactly the same over all unchanged statements code statement in the program under a given particular test case.Different from the functional oracle which usually can be obtained via the given specification,achieving the complete 1According to the developers [13]coverage statistics are only stable under zero optimization level. code coverage oracles turns out to be extremely challenging. -We assume this because mis-compilations are rare. 978-1-7281-2508-4/19/S31.00©2019EEE 79 IEEE D0I10.1109/ASE.2019.00018 Φcomputer society
Automatic Self-Validation for Code Coverage Profilers Yibiao Yang∗†, Yanyan Jiang∗, Zhiqiang Zuo∗, Yang Wang∗, Hao Sun∗, Hongmin Lu∗, Yuming Zhou∗, and Baowen Xu∗ ∗State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China †School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China {yangyibiao, jyy, zqzuo}@nju.edu,cn, dz1933028@smail.nju.edu.cn shqking@gmail.com, {hmlu, zhouyuming, bwxu}@nju.edu.cn Abstract—Code coverage as the primitive dynamic program behavior information, is widely adopted to facilitate a rich spectrum of software engineering tasks, such as testing, fuzzing, debugging, fault detection, reverse engineering, and program understanding. Thanks to the widespread applications, it is crucial to ensure the reliability of the code coverage profilers. Unfortunately, due to the lack of research attention and the existence of testing oracle problem, coverage profilers are far away from being tested sufficiently. Bugs are still regularly seen in the widely deployed profilers, like gcov and llvm-cov, along with gcc and llvm, respectively. This paper proposes Cod, an automated self-validator for effectively uncovering bugs in the coverage profilers. Starting from a test program (either from a compiler’s test suite or generated randomly), Cod detects profiler bugs with zero false positive using a metamorphic relation in which the coverage statistics of that program and a mutated variant are bridged. We evaluated Cod over two of the most well-known code coverage profilers, namely gcov and llvm-cov. Within a fourmonth testing period, a total of 196 potential bugs (123 for gcov, 73 for llvm-cov) are found, among which 23 are confirmed by the developers. Index Terms—Code coverage, Metamorphic testing, Coverage profilers, Bug detection. I. INTRODUCTION Profiling code coverage data [1] (e.g., executed branches, paths, functions, etc.) of the instrumented subject programs is the cornerstone of a rich spectrum of software engineering practices, such as testing [2], fuzzing [3], debugging [4]– [6], specification mining [7], [8], fault detection [9], reverse engineering, and program understanding [10]. Incorrect coverage information would severely mislead developers in their software engineering practices. Unfortunately, coverage profilers themselves (e.g., gcov and llvm-cov) are prone to errors. Even a simple randomized differential testing technique exposed more than 70 bugs in coverage profilers [11]. The reasons are two-fold. Firstly, neither the application-end developers nor academic researchers paid sufficient attention to the testing of code coverage pro- filers. Secondly, automatic testing of coverage profilers is still challenging due to the lack of test oracles. During the code coverage testing, the oracle is supposed to constitute the rich execution information, e.g., the execution frequency of each code statement in the program under a given particular test case. Different from the functional oracle which usually can be obtained via the given specification, achieving the complete code coverage oracles turns out to be extremely challenging. Even though the programming experts can specify the oracle precisely, it requires enormous human intervention, making it impractical. A simple differential testing approach C2V tried to uncover coverage bugs by comparing the coverage profiling results of the same input program over two different profiler implementations (e.g., gcov and llvm-cov) [11]. For instance, if gcov and llvm-cov provide different coverage information for the same statement of the profiled program, a bug is reported. Due to the inconsistency of coverage semantics defined by different profiler implementations, it is rather common that independently implemented coverage profilers exhibit different opinions on the code-line based statistics (e.g., the case in Figure 1) — this essentially contradicts the fundamental assumption of differential testing that distinct coverage profilers should output identical coverage statistics for the same input program. Approach To tackle the flaws of the existing approach, this paper presents Cod, a fully automated self-validator of coverage profilers, based on the metamorphic testing formulation [12]. Instead of comparing outputs from two independent profilers, Cod takes a single profiler and a program P (either from a compiler’s test suite or generated randomly) as input and uncovers the bugs by identifying the inconsistency of coverage results from P and its equivalent mutated variants whose coverage statistics are expected to be identical. The equivalent program variants are generated based on the assumption that modifying unexecuted code blocks should not affect the coverage statistics of executed blocks under the identical profiler, which should generally hold in a non-optimized setting1. This idea originates from EMI [2], a metamorphic testing approach which is targeted at compiler optimization bugs. Specifically, assuming that the compiler is correct2 and given a deterministic program P under profiling (either from a compiler’s test suite or generated randomly) and fixate its input, Cod obtains a reference program P by removing the unexecuted statements in P. P should strictly follow the same execution path as long as the coverage profiling data of P is correct. Therefore, Cod asserts that the coverage statistics should be exactly the same over all unchanged statements 1According to the developers [13], coverage statistics are only stable under zero optimization level. 2We assume this because mis-compilations are rare. 79 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) 978-1-7281-2508-4/19/$31.00 ©2019 IEEE DOI 10.1109/ASE.2019.00018
in P and P'.However,consider a line of code s in P and II.BACKGROUND AND MOTIVATION P,for which the same profiler reported different coverage A.Coverage Profilers results,i.e..Cp(s)Cp(s)where Cp(s)refers to the profiled runtime execution count of statement s in program P. Code coverage profiling data (each line of code's execution count in a program execution)is the foundation of a broad The execution count Cp(s)is usually a nonnegative number spectrum of software engineering practices.Code coverage except for a special value-1 indicating the unknown coverage is the most widely adopted criteria for measuring testing information.This could happen when the coverage profiler failed to obtain the coverage information of statement s due thoroughness,and is also widely used in the automation of software engineering tasks.For example,test input generation to the information loss caused by the abstraction gap between techniques leverage code coverage to guide search iterations the source code and the intermediate code transformed during [3];fault localization techniques use code coverage to isolate compilation.Given Cp(s)Cp(s),either of the two cases applies: potentially faulty branches [4]-[6]. To obtain code coverage statistics,a code coverage profiler maintains each source code line an execution counter and I)(Strong Inconsistency)Cp(s)≥0ACp(s)≥0, updates them along with the program execution.Specifically, meaning that the coverage profiler reports the inconsis- given a program P,a coverage profiler runs P and outputs tent coverage information.Definitely there is a bug in each line of code s EP a number Cp(s)=n,indicating that the profiler because p'should follow exactly the same s was executed n times.A special value n =-1 indicates that execution path as P assuming that the coverage statistics the profiler provides no coverage information for this line. of program P are correct. Where should a profiler report a coverage statistics for a line 2)(Weak Inconsistency)Cp(s)=-1 V Cp(s)=-1, of code s (i.e.,whether Cp(s)=-1)is not well defined.Code indicating an inaccurate statistics because that a non- transformations (e.g.,expansion of macros,or compilation instrumented line is actually executed in its equivalent. from source code to intermediate code)and optimizations This is also for sure a bug because non-optimized cov- may lead to Cp(s)=-1 for a line,and different profilers erage statistics should faithfully reflect the program's generally have different opinions upon which lines would have execution path. Cp(s)>0.Later we see that this is a major limitation of existing techniques for validating coverage profilers. The self-validator Cod fully exploits the inconsistencies be- tween path-equivalent programs with zero false positive.Cod B.Validating Coverage Profilers addresses the limitation of C2V in Section II-C and handles Validating the correctness a coverage profiler is challenging weak inconsistencies whereas C2V has to ignore all weak because it is labor-intensive to obtain the ground truth of inconsistencies between independent profiler implementations coverage statistics.Though we have large number of test inputs to avoid being flooded by false positives.It is worth noting (any program used to test a compiler also works in testing a that such a technique of obtaining path-equivalent programs, profiler),lacking of a test oracle became the problem. firstly known as EMI,was proposed to validate the correctness The only known technique to uncover coverage profiler bugs of compiler optimizations [2].We found that this idea is also is C2V which is based on differential testing [11].Given a powerful in the validation of coverage profilers.Nevertheless, program P under profiling,C2V profiles it using two indepen- Cod differs from EMI as we proposed the specialized mutation dently implemented profilers to obtain for each statement s the strategies to acquire the program variants,and adopted the coverage statistics Cp(s)and cp(s).When Cp(s)Cp(s). results verification criterion in particular for testing coverage an inconsistency is found.When Cp(s)>0ACp(s)>0 profilers.We defer to Section V for the comparison details. (a strong inconsistency),C2V reports it as a bug candidate and uses clustering to filter out potential false positives and duplicates. Results We implemented Cod as a prototype and evaluated it on two popular coverage profilers.namely gcov and llvm-cov C.Limitations of Differential Testing integrated with the compiler gcc and llvm,respectively.As of Though being effective in uncovering coverage profiler the submission deadline,a total of 196 potential bugs (123 for bugs,differential testing also has the following major limi- gcov,73 for llvm-cov)are uncovered within 4 months,among tations: which 23 bugs have already been confirmed by the developers. First,differential testing cannot be applied when there is Promisingly,all the detected bugs are new bugs according to only a single coverage profiler.This is the case for many the developers'feedback. mainstream programming languages (e.g,Python and Perl). Second,differential testing requires heavy human efforts on Outline The rest of the paper is organized as follows.We analyzing the reports because it is hard to determine which introduce the necessary background and a brief motivation in one is faulty when two profilers disagree on the statistics of a Section II.Section III elaborates on the detailed approach, line of code. followed by the evaluation in Section IV.We discuss related Third,differential testing miss many potential bugs on weak work in Section V and conclude the paper in Section VI. inconsistencies.Two profilers can have inconsistent (but both 80
in P and P . However, consider a line of code s in P and P , for which the same profiler reported different coverage results, i.e., CP (s) = CP (s) where CP (s) refers to the profiled runtime execution count of statement s in program P. The execution count CP (s) is usually a nonnegative number except for a special value −1 indicating the unknown coverage information. This could happen when the coverage profiler failed to obtain the coverage information of statement s due to the information loss caused by the abstraction gap between the source code and the intermediate code transformed during compilation. Given CP (s) = CP (s), either of the two cases applies: 1) (Strong Inconsistency) CP (s) ≥ 0 ∧ CP (s) ≥ 0, meaning that the coverage profiler reports the inconsistent coverage information. Definitely there is a bug in the profiler because P should follow exactly the same execution path as P assuming that the coverage statistics of program P are correct. 2) (Weak Inconsistency) CP (s) = −1 ∨ CP (s) = −1, indicating an inaccurate statistics because that a noninstrumented line is actually executed in its equivalent. This is also for sure a bug because non-optimized coverage statistics should faithfully reflect the program’s execution path. The self-validator Cod fully exploits the inconsistencies between path-equivalent programs with zero false positive. Cod addresses the limitation of C2V in Section II-C and handles weak inconsistencies whereas C2V has to ignore all weak inconsistencies between independent profiler implementations to avoid being flooded by false positives. It is worth noting that such a technique of obtaining path-equivalent programs, firstly known as EMI, was proposed to validate the correctness of compiler optimizations [2]. We found that this idea is also powerful in the validation of coverage profilers. Nevertheless, Cod differs from EMI as we proposed the specialized mutation strategies to acquire the program variants, and adopted the results verification criterion in particular for testing coverage profilers. We defer to Section V for the comparison details. Results We implemented Cod as a prototype and evaluated it on two popular coverage profilers, namely gcov and llvm-cov integrated with the compiler gcc and llvm, respectively. As of the submission deadline, a total of 196 potential bugs (123 for gcov, 73 for llvm-cov) are uncovered within 4 months, among which 23 bugs have already been confirmed by the developers. Promisingly, all the detected bugs are new bugs according to the developers’ feedback. Outline The rest of the paper is organized as follows. We introduce the necessary background and a brief motivation in Section II. Section III elaborates on the detailed approach, followed by the evaluation in Section IV. We discuss related work in Section V and conclude the paper in Section VI. II. BACKGROUND AND MOTIVATION A. Coverage Profilers Code coverage profiling data (each line of code’s execution count in a program execution) is the foundation of a broad spectrum of software engineering practices. Code coverage is the most widely adopted criteria for measuring testing thoroughness, and is also widely used in the automation of software engineering tasks. For example, test input generation techniques leverage code coverage to guide search iterations [3]; fault localization techniques use code coverage to isolate potentially faulty branches [4]–[6]. To obtain code coverage statistics, a code coverage profiler maintains each source code line an execution counter and updates them along with the program execution. Specifically, given a program P, a coverage profiler runs P and outputs each line of code s ∈ P a number CP (s) = n, indicating that s was executed n times. A special value n = −1 indicates that the profiler provides no coverage information for this line. Where should a profiler report a coverage statistics for a line of code s (i.e., whether CP (s) = −1) is not well defined. Code transformations (e.g., expansion of macros, or compilation from source code to intermediate code) and optimizations may lead to CP (s) = −1 for a line, and different profilers generally have different opinions upon which lines would have CP (s) ≥ 0. Later we see that this is a major limitation of existing techniques for validating coverage profilers. B. Validating Coverage Profilers Validating the correctness a coverage profiler is challenging because it is labor-intensive to obtain the ground truth of coverage statistics. Though we have large number of test inputs (any program used to test a compiler also works in testing a profiler), lacking of a test oracle became the problem. The only known technique to uncover coverage profiler bugs is C2V which is based on differential testing [11]. Given a program P under profiling, C2V profiles it using two independently implemented profilers to obtain for each statement s the coverage statistics CP (s) and C P (s). When CP (s) = C P (s), an inconsistency is found. When CP (s) ≥ 0 ∧ C P (s) ≥ 0 (a strong inconsistency), C2V reports it as a bug candidate and uses clustering to filter out potential false positives and duplicates. C. Limitations of Differential Testing Though being effective in uncovering coverage profiler bugs, differential testing also has the following major limitations: First, differential testing cannot be applied when there is only a single coverage profiler. This is the case for many mainstream programming languages (e.g, Python and Perl). Second, differential testing requires heavy human efforts on analyzing the reports because it is hard to determine which one is faulty when two profilers disagree on the statistics of a line of code. Third, differential testing miss many potential bugs on weak inconsistencies. Two profilers can have inconsistent (but both 80
1: 1:int main() 11 -1lint main() 11 -1lint main() =1 2:1 21×11{ 2111{ -1: 3: switch (8) 3111 switch (8) 31 11 switch (8) -1: 4 411 4111 =1g case 8: 5111 case 8: 51y11 case 8: 1: 6 break; 61 1 break; 61 1 break; -1: 7 default: 7|11 default: 711 default: -1:8: abort () 81×01 abort () 81x0 i//abort () -1:9: br色ak: 91.1 break: 91x01 b色a: -1:10: 10111 10111 1:11: return 0; 1111 return 0; 11111 return 0; =1:12: 121/111 121111 (a)Cp (gcov) (b)Cp (llvm-cov) ©CPAias}U(ss}Ivm-cov) Fig.I.The bug case of LLVM #41821.Ilvm-cov incorrectly reported that the break in Line 9 is executed.This bug cannot be detected by differential testing [11]because gcov does not provide coverage statistics for Line 9.Visual conventions of coverage statistics:For a line of code s,gcov and llvm-cov output coverage statistics(s)in the first and second column,respectively.A-1 denotes that the profiler does not provide coverage information of s.A check mark or cross mark followed by a number n denotes that cp(s)=n. correct)interpretations over the coverage statistics.Section IV Program P Program Program P reveal that 92.9%of inconsistencies found by differential Pruner testing are weak,i.e.,Cp(s)Cp(s)with Cp(s)=-1V Cp(s)=-1.Our motivating example in Figures 1 (a)-(b) showed that for 10/12 lines,exactly one of gcov or llvm-cov Coverage Coverage profiler profiler provides no coverage information.Unfortunately.C2V has to ignore all of such weak consistencies (i.e.,not report any of them as a bug)because the vast majority of them are attributed to the feature of independently implemented profilers. Coveraoe Coverage Output O report C Finally,differential testing reports false positive even when report C Output O' two profilers report strongly inconsistent coverage statistics, because two profilers may disagree over the definition of the Consistent? execution count of a line,e.g..whether the initialization code of a for loop counts for one time of execution. D.Motivation Bug report The key observation leading to automatic self-validation of No a single profiler is that changing unexecuted statements in a Equal? program should not affect the coverage statistics.Take the program in Figure I (a)(b)as an example.Suppose that we comment out the function call in Line 8 of P and obtain Fig.2.The framework of Cod =P{ss}u[ss}as shown in Figure 1 (c)We assert that unchanged statements should have identical coverage statistics, the profiler because we have full control over P(thus easily i.e., to guarantee it is deterministic)and there is little chance that Vs ePnp'.Cp(s)=Cp(s), the compiler/hardware is defective. In the motivating example,Cp(sg)Cp(sg)revealed a reasonably assuming that: previously unknown profiler bug in which llvm-cov incorrectly 1)P is deterministic,contains no undefined behavior,and reported an unexecuted break as being executed once.This does not depend on external environment; bug case is missed by differential testing (particularly,C2V) 2)the coverage statistics is correct;and because all inconsistencies between Cp (gcov)and Cp(llvm- 3)the executions of P and p are consistent with their cov)are weak (Lines 1-5,7-10,and 12).Generally,weak semantics. inconsistencies between different compiler implementations Since we only remove "unexecuted"statements reported indicate different compilation strategies (thus do not indicate by a coverage profiler,P and p'should be semantically a profiler bug)and should not be reported by c2V. equivalent.Furthermore,a profiler(particularly under minimal III.APPROACH optimization)should be self-consistent in terms of which statement should have a coverage statistics.Therefore,if there A.Metamorphic Testing is an inconsistency Cp(s)Cp(s),no matter whether A test oracle is a mechanism for determining whether it is a strong or weak inconsistency,either of the above a test has passed or failed.Under certain circumstances, assumptions is violated.It turns out that we should blame however,the oracle is not available or too expensive to achieve. 81
1: 1:int main() -1: 2:{ -1: 3: switch (8) -1: 4: { -1: 5: case 8: 1: 6: break; -1: 7: default: -1: 8: abort (); -1: 9: break; -1: 10: } 1: 11: return 0; -1: 12:} 1| -1|int main() 2| 1|{ 3| 1| switch (8) 4| 1| { 5| 1| case 8: 6| 1| break; 7| 1| default: 8| ×0| abort (); 9| 1| break; 10| 1| } 11| 1| return 0; 12| 1|} 1| -1|int main() 2| 1|{ 3| 1| switch (8) 4| 1| { 5| 1| case 8: 6| 1| break; 7| 1| default: 8| ×0| ; // abort (); 9| ×0| break; 10| 1| } 11| 1| return 0; 12| 1|} (a) CP (gcov) (b) CP (llvm-cov) (c) CP\{s8}∪{s 8} (llvm-cov) Fig. 1. The bug case of LLVM #41821. llvm-cov incorrectly reported that the break in Line 9 is executed. This bug cannot be detected by differential testing [11] because gcov does not provide coverage statistics for Line 9. Visual conventions of coverage statistics: For a line of code s, gcov and llvm-cov output coverage statistics cP (s) in the first and second column, respectively. A -1 denotes that the profiler does not provide coverage information of s. A check mark or cross mark followed by a number n denotes that cP (s) = n. correct) interpretations over the coverage statistics. Section IV reveal that 92.9% of inconsistencies found by differential testing are weak, i.e., CP (s) = C P (s) with CP (s) = −1 ∨ C P (s) = −1. Our motivating example in Figures 1 (a)–(b) showed that for 10/12 lines, exactly one of gcov or llvm-cov provides no coverage information. Unfortunately, C2V has to ignore all of such weak consistencies (i.e., not report any of them as a bug) because the vast majority of them are attributed to the feature of independently implemented profilers. Finally, differential testing reports false positive even when two profilers report strongly inconsistent coverage statistics, because two profilers may disagree over the definition of the execution count of a line, e.g., whether the initialization code of a for loop counts for one time of execution. D. Motivation The key observation leading to automatic self-validation of a single profiler is that changing unexecuted statements in a program should not affect the coverage statistics. Take the program in Figure 1 (a)–(b) as an example. Suppose that we comment out the function call in Line 8 of P and obtain P = P\{s8}∪{s 8} as shown in Figure 1 (c) We assert that unchanged statements should have identical coverage statistics, i.e., ∀s ∈P∩P . CP (s) = CP (s), reasonably assuming that: 1) P is deterministic, contains no undefined behavior, and does not depend on external environment; 2) the coverage statistics is correct; and 3) the executions of P and P are consistent with their semantics. Since we only remove “unexecuted” statements reported by a coverage profiler, P and P should be semantically equivalent. Furthermore, a profiler (particularly under minimal optimization) should be self-consistent in terms of which statement should have a coverage statistics. Therefore, if there is an inconsistency CP (s) = CP (s), no matter whether it is a strong or weak inconsistency, either of the above assumptions is violated. It turns out that we should blame Bug report No Coverage report C Coverage report C’ Consistent? Equal? No Program P Program P’ Program Pruner Coverage profiler Coverage profiler Output O Output O’ Fig. 2. The framework of Cod the profiler because we have full control over P (thus easily to guarantee it is deterministic) and there is little chance that the compiler/hardware is defective. In the motivating example, CP (s9) = CP (s9) revealed a previously unknown profiler bug in which llvm-cov incorrectly reported an unexecuted break as being executed once. This bug case is missed by differential testing (particularly, C2V) because all inconsistencies between CP (gcov) and CP (llvmcov) are weak (Lines 1–5, 7–10, and 12). Generally, weak inconsistencies between different compiler implementations indicate different compilation strategies (thus do not indicate a profiler bug) and should not be reported by C2V. III. APPROACH A. Metamorphic Testing A test oracle is a mechanism for determining whether a test has passed or failed. Under certain circumstances, however, the oracle is not available or too expensive to achieve. 81
This is known as the oracle problem [14].For example,in Algorithm 1:Cod's process for coverage tool validation compiler testing.it is not easy to verify whether the generated Data:The profiler T under test,the program P,the input i executable code by a compiler is functionally equivalent to the Result:reported bugs given source code.Even if the oracle is available,manually 1 begin checking the oracle results is tedious and error-prone [15]. /Step 1:Extract output and coverage [16].As a matter of fact,the oracle problem has been "one of information the most difficult tasks in software testing"[16]. 2 P:re←-compi1e(P) Metamorphic testing (MT)was coined by T.Y.Chen in 3 getOutput(execute(Pere,i)) 4 C+T.extractCoverage (execute(Pere.i)) 1998 [12],which can be exploited to alleviate the oracle problem.Based on the existing successful test cases (that /Step 2:Generate variants via transformation have not revealed any failure,such as running too long or p← genVariant (P.C) returning abnormal values),MT generates follow-up test cases D xe←compile(P') by making reference to metamorphic relations (MR),which 7 Ogetoutput (execute(Pize i)) are the necessary properties of the target function or algorithm C'T.extractCoverage(execute(Pere,i)) in terms of multiple inputs and their expected outputs.Let us /Step 3:Compare outputs and reports consider a program P implementing function F on domain /First Stage D.Let t be an initial successful test case,i.e.t D and ifO≠O'then reportBug() the execution output P(t)equals to the expected value F(t). /Second Stage MT can be applied to generate a follow-up test case t'D 1 else if inconsistent (C,C')then base on t and a pre-defined MR (i.e.,metamorphic relation). reportBug() For program P,a MR is a property of its target function F.For instance,suppose F(x)=sin(z),then the property /Generate a variant for program P under sin(-x)=sin(z)is a typical MR with respect to F.Hence, coverage C 13 Function genvariant (Program P.Coverage C) given a successful test case,say t =1.2,MT generates its p'←p follow-up test case t'=-1.2,and then runs the program foreach s E getstmts (P)A Cp(s)=0 do over t'.Finally,two outputs (i.e.,P(t)and P(t'))are checked 16 Lp'.delete(s) to see if they satisfy the expected relation P(t)=P(t').If 公 if isCompiable(P')then the identity does not hold,a failure manifests. 18 return p In our work,we apply MT to the validation of code coverage 19 else 20 genvariant(P,C) profilers.Each program becomes a test case fed to the profilers. Given a program P and the code coverage profiler under /Check whether coverages is inconsistent testing,running P with an input i would produce the execution 21 Function inconsistent(Coverage C.Coverage C') output O and a coverage report C from the coverage profiler, 22 foreach s∈CAs∈C'do The coverage report C records which lines of code are executed 23 ifCp(s)≠Cp(s)then (or unexecuted)and how many times are executed exactly. return True Note that this program P is the initial test case according return False to the notions of MT.A follow-up test program p'can be generated based on the following MR: Given a program,the unexecuted code can be eliminated tion of a given test program,(2)generating equivalent variants since these code has no impact with the execution output or based on code coverage report,and (3)comparing the outputs the coverage report specifically for the executed part. and coverage results to uncover bugs. In other words,by taking advantage of the coverage infor- Algorithm I is the main process of Cod.At the first mation C,we generate P,a functionally equivalent variant of place,Cod compiles the program P and profiles the execution P by removing the un-executed code statements of P.We run information with input i to collect:(1)the output O(which p'on the same input i and obtain the output 'and coverage may correspond to a return value or an exit code)and(2)the results C,accordingly.A bug can then be discovered if 1)the code coverage information C(Lines 2-4).It then generates execution output is not equal to the new one O,or 2)there the variant P with respect to program P(Line 5)and collects exists inconsistency between the coverage information for the respective output O'and code coverage C'(Lines 6- executed code inside C and C.Figure 2 shows our framework 8).Finally,it compares the outputs together with the code for the self-validation of code coverage profilers. coverage reports to validate the coverage profiler.A potential bug in 7 is reported if any of them is inconsistent (Lines 9- B.Our Algorithm 12).We discuss each step in details as follows. Based on our formulation,we implemented a tool Cod for Extracting Coverage Information For each test program P, detecting bugs in C code coverage profilers.Cod consists of we first compile it with particular options to generate the three main steps:(1)extracting output and coverage informa- executable binary Pere.The options enables the compiler 82
This is known as the oracle problem [14]. For example, in compiler testing, it is not easy to verify whether the generated executable code by a compiler is functionally equivalent to the given source code. Even if the oracle is available, manually checking the oracle results is tedious and error-prone [15], [16]. As a matter of fact, the oracle problem has been “one of the most difficult tasks in software testing” [16]. Metamorphic testing (MT) was coined by T.Y. Chen in 1998 [12], which can be exploited to alleviate the oracle problem. Based on the existing successful test cases (that have not revealed any failure, such as running too long or returning abnormal values), MT generates follow-up test cases by making reference to metamorphic relations (MR), which are the necessary properties of the target function or algorithm in terms of multiple inputs and their expected outputs. Let us consider a program P implementing function F on domain D. Let t be an initial successful test case, i.e. t ∈ D and the execution output P(t) equals to the expected value F(t). MT can be applied to generate a follow-up test case t ∈ D base on t and a pre-defined MR (i.e., metamorphic relation). For program P, a MR is a property of its target function F. For instance, suppose F(x) = sin(x), then the property sin(π−x) = sin(x) is a typical MR with respect to F. Hence, given a successful test case, say t = 1.2, MT generates its follow-up test case t = π − 1.2, and then runs the program over t . Finally, two outputs (i.e., P(t) and P(t )) are checked to see if they satisfy the expected relation P(t) = P(t ). If the identity does not hold, a failure manifests. In our work, we apply MT to the validation of code coverage profilers. Each program becomes a test case fed to the profilers. Given a program P and the code coverage profiler under testing, running P with an input i would produce the execution output O and a coverage report C from the coverage profiler, The coverage report C records which lines of code are executed (or unexecuted) and how many times are executed exactly. Note that this program P is the initial test case according to the notions of MT. A follow-up test program P can be generated based on the following MR: Given a program, the unexecuted code can be eliminated since these code has no impact with the execution output or the coverage report specifically for the executed part. In other words, by taking advantage of the coverage information C, we generate P , a functionally equivalent variant of P by removing the un-executed code statements of P. We run P on the same input i and obtain the output O and coverage results C , accordingly. A bug can then be discovered if 1) the execution output O is not equal to the new one O , or 2) there exists inconsistency between the coverage information for executed code inside C and C . Figure 2 shows our framework for the self-validation of code coverage profilers. B. Our Algorithm Based on our formulation, we implemented a tool Cod for detecting bugs in C code coverage profilers. Cod consists of three main steps: (1) extracting output and coverage informaAlgorithm 1: Cod’s process for coverage tool validation Data: The profiler T under test, the program P, the input i Result: reported bugs 1 begin /* Step 1: Extract output and coverage information */ 2 Pexe ← compile(P) 3 O ← getOutput(execute(Pexe, i)) 4 C←T .extractCoverage(execute(Pexe, i)) /* Step 2: Generate variants via transformation */ 5 P ← genVariant(P, C) 6 P exe ← compile(P ) 7 O ← getOutput(execute(P exe, i)) 8 C ← T .extractCoverage(execute(P exe, i)) /* Step 3: Compare outputs and reports */ // First Stage 9 if O = O then 10 reportBug() // Second Stage 11 else if inconsistent(C, C ) then 12 reportBug() /* Generate a variant for program P under coverage C */ 13 Function genVariant(Program P, Coverage C) 14 P ← P 15 foreach s ∈ getStmts(P) ∧ CP (s)=0 do 16 P .delete(s) 17 if isCompiable(P ) then 18 return P 19 else 20 genVariant(P, C) /* Check whether coverages is inconsistent */ 21 Function inconsistent(Coverage C, Coverage C ) 22 foreach s ∈C∧ s ∈ C do 23 if CP (s) = C P (s) then 24 return True 25 return False tion of a given test program, (2) generating equivalent variants based on code coverage report, and (3) comparing the outputs and coverage results to uncover bugs. Algorithm 1 is the main process of Cod. At the first place, Cod compiles the program P and profiles the execution information with input i to collect: (1) the output O (which may correspond to a return value or an exit code) and (2) the code coverage information C (Lines 2 - 4). It then generates the variant P with respect to program P (Line 5) and collects the respective output O and code coverage C (Lines 6 - 8). Finally, it compares the outputs together with the code coverage reports to validate the coverage profiler. A potential bug in T is reported if any of them is inconsistent (Lines 9 - 12). We discuss each step in details as follows. Extracting Coverage Information For each test program P, we first compile it with particular options to generate the executable binary Pexe. The options enables the compiler 82
to integrate the necessary instrumentation code into the ex- with the execution frequency of each line.The first and second ecutable.While executing Pere with input i,we obtain the column list the execution frequency and the line number.The output O of the program.Meanwhile,the code coverage report frequency number"-1"in the first column indicates that the C can also be readily extracted by the coverage profiler T. coverage information is unknown. Each code coverage report contains the lines of code executed In this example,we first utilize gcc to compile the program and unexecuted in the test program P under the input i.Those p and then execute it to produce the output and coverage statements marked as unexecuted will be randomly pruned for report (shown as Figure 3 (a)).Note that the output in this case the purpose of generating P's equivalent variants,which will is 0.According to the original code coverage report of P,Cod be discussed shortly.Cod implemented the supports for both decides to remove the 6th statement from the original program, gcov and llvm-cov.Take gcov as an example,Cod extracts resulting in an equivalent program P'shown as Figure 3(b). coverage information by compiling the program P with the Next,we compile and execute p'to get the new output and flag:"-00 --coverage"under gcc.It tells the compiler to coverage report.Here,the output turns to be 1. instrument additional code in the object files for generating Since the outputs of these two program are not equal,P the extra profiling information at runtime.Cod then runs the and P'are somehow not equivalent,meaning that we actually executable binary under input i to produce coverage report for deleted some executed code.The code coverage tool wrongly the program P. marked some executed statements as not executed.A potential Generating Variants via Transformation Based on the bug is identified.We reported this bug to Bugzilla.The gcov coverage report C for the original program P,its variants are developers quickly confirmed and fixed it. generated (Line 5).Cod produces the variants by stochastically Bug Example Exposed by Strongly Inconsistent Coverage removing unexecuted program statements from the original Figure 4 illustrates another real bug example uncovered by program P.Specifically,for each of these removable lines strongly inconsistent code coverage reports between the pro- of code,we made a random choice.As such,we obtain a gram and its "equivalence"variant.Figure 4 (a)shows the number of variants P'that should be equivalent to the original coverage report for P.We can read from it that Line 10 is not program.The function genvariant (Lines 13-20)describe executed at all (i.e.,the execution count is 0).Cod prunes Line Cod's process for generating equivalence mutants via transfor- 10 to generate the equivalent program P'.After compiling and mation.Note that stochastically removing unexecuted program executing P',another coverage report shown as Figure 4(a)is statements would lead to many uncompilable mutants.Only produced.As can be seen,there exists an strong inconsistency the compilable ones are returned by genvariant (Line 17). in term of the execution frequency of Line 6,indicating a Comparing the Outputs and Coverage Reports Having the potential bug.This bug is submitted and confirmed already by outputs and the coverage reports for the orginal program P and gcov developers. its variants P,we detect bugs in the code coverage tool T by Bug Example Exposed by Weakly Inconsistent Coverage checking the existence of inconsistency.More specifically,We Figure 5 presents another confirmed real bug example found first compare the outputs of P and P'.If they are not identical, via the weakly inconsistent code coverage reports between the a potential bug would be reported in the code coverage tool. program and its equivalent variant.In Figure 5(a),Line 6 in Otherwise,the code coverage reports are further compared to p is not executed (i.e.,the execution count is 0).Cod gets seeking for inconsistencies.Note that only the code coverage rid of Line 6 to generate the equivalent program P'.Upon of the common lines of code between the programs P and p compiling and executing P',another coverage report shown as (i.e.those lines of code left in the variant program),will be Figure 5(a)is generated.Apparently,the weakly inconsistency considered for comparison.If the code coverage reports is not with respect to the execution frequency of Line 5 appears, consistent over the common lines,a potential bug is reported indicating a potential bug as well (Lines 9-12). IV.EVALUATION C.Illustrative Examples In the following,we take our reported three concrete bug This section presents our evaluation of Cod.We evaluated Cod using the most popular practical code coverage profilers: examples to illustrate how Cod works.Three bugs are newly discovered by Cod and confirmed by the GCC developers. gcov and llvm-cov and a set of testing programs for testing compilers,and compared the results with existing differential Bug Example Exposed by Different Outputs Figure 3 shows technique C2V [11]. a real bug example exposed via different outputs of two "equivalent"programs in gcov [17],a C code coverage tool A.Evaluation Setup integrated in GCC [18].Figure 3 (a)and (b)are the code coverage reports produced by gcov for the original program Profilers for Validation We evaluated Cod using the latest P and its equivalent program P'(by removing an unexecuted versions of gcov and llvm-cov,the most popular two code Line 8).respectively.Note that all the test programs are coverage profilers of C programs,as our experimental subjects. reformatted for presentation.As can be seen,a code coverage Both profilers are: report is an annotated version of the source code augmented 1)popular in the software engineering community; 83
to integrate the necessary instrumentation code into the executable. While executing Pexe with input i, we obtain the output O of the program. Meanwhile, the code coverage report C can also be readily extracted by the coverage profiler T . Each code coverage report contains the lines of code executed and unexecuted in the test program P under the input i. Those statements marked as unexecuted will be randomly pruned for the purpose of generating P’s equivalent variants, which will be discussed shortly. Cod implemented the supports for both gcov and llvm-cov. Take gcov as an example, Cod extracts coverage information by compiling the program P with the flag: “-O0 --coverage” under gcc. It tells the compiler to instrument additional code in the object files for generating the extra profiling information at runtime. Cod then runs the executable binary under input i to produce coverage report for the program P. Generating Variants via Transformation Based on the coverage report C for the original program P, its variants are generated (Line 5). Cod produces the variants by stochastically removing unexecuted program statements from the original program P. Specifically, for each of these removable lines of code, we made a random choice. As such, we obtain a number of variants P that should be equivalent to the original program. The function genVariant (Lines 13 - 20) describe Cod’s process for generating equivalence mutants via transformation. Note that stochastically removing unexecuted program statements would lead to many uncompilable mutants. Only the compilable ones are returned by genVariant (Line 17). Comparing the Outputs and Coverage Reports Having the outputs and the coverage reports for the orginal program P and its variants P , we detect bugs in the code coverage tool T by checking the existence of inconsistency. More specifically, We first compare the outputs of P and P . If they are not identical, a potential bug would be reported in the code coverage tool. Otherwise, the code coverage reports are further compared to seeking for inconsistencies. Note that only the code coverage of the common lines of code between the programs P and P (i.e. those lines of code left in the variant program), will be considered for comparison. If the code coverage reports is not consistent over the common lines, a potential bug is reported as well (Lines 9–12). C. Illustrative Examples In the following, we take our reported three concrete bug examples to illustrate how Cod works. Three bugs are newly discovered by Cod and confirmed by the GCC developers. Bug Example Exposed by Different Outputs Figure 3 shows a real bug example exposed via different outputs of two “equivalent” programs in gcov [17], a C code coverage tool integrated in GCC [18]. Figure 3 (a) and (b) are the code coverage reports produced by gcov for the original program P and its equivalent program P (by removing an unexecuted Line 8), respectively. Note that all the test programs are reformatted for presentation. As can be seen, a code coverage report is an annotated version of the source code augmented with the execution frequency of each line. The first and second column list the execution frequency and the line number. The frequency number “-1” in the first column indicates that the coverage information is unknown. In this example, we first utilize gcc to compile the program P and then execute it to produce the output and coverage report (shown as Figure 3 (a)). Note that the output in this case is 0. According to the original code coverage report of P, Cod decides to remove the 6th statement from the original program, resulting in an equivalent program P shown as Figure 3(b). Next, we compile and execute P to get the new output and coverage report. Here, the output turns to be 1. Since the outputs of these two program are not equal, P and P are somehow not equivalent, meaning that we actually deleted some executed code. The code coverage tool wrongly marked some executed statements as not executed. A potential bug is identified. We reported this bug to Bugzilla. The gcov developers quickly confirmed and fixed it. Bug Example Exposed by Strongly Inconsistent Coverage Figure 4 illustrates another real bug example uncovered by strongly inconsistent code coverage reports between the program and its “equivalence” variant. Figure 4 (a) shows the coverage report for P. We can read from it that Line 10 is not executed at all (i.e., the execution count is 0). Cod prunes Line 10 to generate the equivalent program P . After compiling and executing P , another coverage report shown as Figure 4 (a) is produced. As can be seen, there exists an strong inconsistency in term of the execution frequency of Line 6, indicating a potential bug. This bug is submitted and confirmed already by gcov developers. Bug Example Exposed by Weakly Inconsistent Coverage Figure 5 presents another confirmed real bug example found via the weakly inconsistent code coverage reports between the program and its equivalent variant. In Figure 5 (a), Line 6 in P is not executed (i.e., the execution count is 0). Cod gets rid of Line 6 to generate the equivalent program P . Upon compiling and executing P , another coverage report shown as Figure 5 (a) is generated. Apparently, the weakly inconsistency with respect to the execution frequency of Line 5 appears, indicating a potential bug. IV. EVALUATION This section presents our evaluation of Cod. We evaluated Cod using the most popular practical code coverage profilers: gcov and llvm-cov and a set of testing programs for testing compilers, and compared the results with existing differential technique C2V [11]. A. Evaluation Setup Profilers for Validation We evaluated Cod using the latest versions of gcov and llvm-cov, the most popular two code coverage profilers of C programs, as our experimental subjects. Both profilers are: 1) popular in the software engineering community; 83