4.4 The mapping of a field points-to graph rooted at an object to a sequential automaton.......................... 59 4.5 Overview of MAHJONG. 61 4.6 Illustrating the null-field problem......·.············ 65 4.7 Number of abstract objects created by the allocation-site abstraction and MAHJONG.......·...···:··········· 76 4.8 Object merging in checkstyle.·.··· 76 4.9 Precision gains of M-k-type over k-type..·.··..········. 84 X
4.4 The mapping of a field points-to graph rooted at an object to a sequential automaton. . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Overview of Mahjong. . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6 Illustrating the null-field problem. . . . . . . . . . . . . . . . . . . . 65 4.7 Number of abstract objects created by the allocation-site abstraction and Mahjong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.8 Object merging in checkstyle. . . . . . . . . . . . . . . . . . . . . 76 4.9 Precision gains of M-k-type over k-type. . . . . . . . . . . . . . . . . 84 x
List of Tables 2.1 Five kinds of core statements in points-to analysis.··.,····· 8 3.1 Precision and performance results for all the five analyses.The two precision metrics shown are the number of variable pairs that may be aliases generated by may-alias ("may-alias pairs")and the number of casts that cannot be statically proved to be safe by may-fail-cast ("may-fail casts").In both cases,smaller is better.One performance metric used is the analysis time for a program............41 3.2 Pre-analysis times of BEAN (secs).For a program,its pre-analysis time comes from three components:(1)a context-insensitive points- to analysis ("CI"),(2)OAG construction per Figure 3.6 (OAG), and (3)object-sensitive context computation per Figure 3.8("CTX- 43 4.1 Some equivalence classes in checkstyle....·.·....···· 77 4.2 Efficiency and precision metrics for all programs and analyses with and without MAHJONG.In all cases (except speedup),lower is bet- ter.Symbol oo is used in speedup when a baseline analysis is not scalable but MAHJONG is scalable................... 81 xi
List of Tables 2.1 Five kinds of core statements in points-to analysis. . . . . . . . . . 8 3.1 Precision and performance results for all the five analyses. The two precision metrics shown are the number of variable pairs that may be aliases generated by may-alias (“may-alias pairs”) and the number of casts that cannot be statically proved to be safe by may-fail-cast (“may-fail casts”). In both cases, smaller is better. One performance metric used is the analysis time for a program. . . . . . . . . . . . 41 3.2 Pre-analysis times of Bean (secs). For a program, its pre-analysis time comes from three components: (1) a context-insensitive pointsto analysis (“CI”), (2) OAG construction per Figure 3.6 (OAG), and (3) object-sensitive context computation per Figure 3.8 (“CTXCOMP”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Some equivalence classes in checkstyle. . . . . . . . . . . . . . . . 77 4.2 Efficiency and precision metrics for all programs and analyses with and without Mahjong. In all cases (except speedup), lower is better. Symbol ∞ is used in speedup when a baseline analysis is not scalable but Mahjong is scalable. . . . . . . . . . . . . . . . . . . 81 xi
Chapter 1 Introduction Points-to analysis is a static analysis technique that addresses a fundamental prob- lem in program analysis:at compile-time,it determines which memory locations a pointer can point to at runtime.For object-oriented languages,such as Java, points-to analysis focuses on heap locations,i.e.,it statically determines which heap objects a variable or reference can point to dynamically. The results of points-to analysis are required by a wide range of client ap- plications,including bug detection [57,56,11,90,91,103,9],security analy- sis [47,6,31,27,program understanding 26,29,62,79,compiler optimiza- tion (66,19,18,88 and program verification [79,12],as well as other program analyses,such as program slicing [44,81,43],call graph construction [54,38,3], reflection analysis [41,42,74,107]and escape analysis [15,100,97,68].Hence effective points-to analyses are highly demanded as they can benefit many client applications and other fundamental analysis techniques. Two metrics are usually considered to measure the effectiveness of points-to analysis:precision and efficiency.In terms of a client,a more precise points-to analysis enables less false bugs to be reported or more program properties to be 1
Chapter 1 Introduction Points-to analysis is a static analysis technique that addresses a fundamental problem in program analysis: at compile-time, it determines which memory locations a pointer can point to at runtime. For object-oriented languages, such as Java, points-to analysis focuses on heap locations, i.e., it statically determines which heap objects a variable or reference can point to dynamically. The results of points-to analysis are required by a wide range of client applications, including bug detection [57, 56, 11, 90, 91, 103, 9], security analysis [47, 6, 31, 27], program understanding [26, 29, 62, 79], compiler optimization [66, 19, 18, 88] and program verification [79, 12], as well as other program analyses, such as program slicing [44, 81, 43], call graph construction [54, 38, 3], reflection analysis [41, 42, 74, 107] and escape analysis [15, 100, 97, 68]. Hence effective points-to analyses are highly demanded as they can benefit many client applications and other fundamental analysis techniques. Two metrics are usually considered to measure the effectiveness of points-to analysis: precision and efficiency. In terms of a client, a more precise points-to analysis enables less false bugs to be reported or more program properties to be 1
Chapter 1.Introduction 2 proved safely.A more efficient points-to analysis enables more time-critical clients (e.g.,compiler optimizations)to be applied or less turnaround time to wait. In this thesis,we aim to improve the precision and efficiency of points-to analysis respectively,in a significant way.However,this is not trivial as discussed below. 1.1 Challenges Although many techniques for improving the precision and efficiency of points-to analysis have been proposed in the last decades [5,53,7,82,101,75,92,69,34, 48,83,84,49,93,21,4,78,87,89,85],how to make a good balance between precision and efficiency remains a long-standing problem:the techniques which achieve higher precision usually sacrifice efficiency,and vice versa,which limits the effectiveness of points-to analysis. To make a good precision and efficiency trade-off,existing points-to analy- ses mainly focus on two dimensions:modeling the control-flow and modeling the heap.For object-oriented programs,context-sensitivity is known to model control- flow with tractable and useful precision [7,39,72,75,82,79.Compared to the storeless heap abstraction (e.g.,access paths),store-based heap abstraction (e.g., allocation-site-based and allocation-type-based heap modeling)is the one predomi- nately adopted by points-to analysis [32,72,80,45,94.However,both techniques suffer from the balance problem,especially for large object-oriented programs. Context-sensitive points-to analysis separately analyzes the methods under dif- ferent calling contexts to improve precision by preventing the merging of the points- to information in different contexts.To address the problem of infinite calling con- texts due to the recursive calls and also to further improve the analysis scalability in practice,traditional approaches usually limit the length of contexts by a given
Chapter 1. Introduction 2 proved safely. A more efficient points-to analysis enables more time-critical clients (e.g., compiler optimizations) to be applied or less turnaround time to wait. In this thesis, we aim to improve the precision and efficiency of points-to analysis respectively, in a significant way. However, this is not trivial as discussed below. 1.1 Challenges Although many techniques for improving the precision and efficiency of points-to analysis have been proposed in the last decades [5, 53, 7, 82, 101, 75, 92, 69, 34, 48, 83, 84, 49, 93, 21, 4, 78, 87, 89, 85], how to make a good balance between precision and efficiency remains a long-standing problem: the techniques which achieve higher precision usually sacrifice efficiency, and vice versa, which limits the effectiveness of points-to analysis. To make a good precision and efficiency trade-off, existing points-to analyses mainly focus on two dimensions: modeling the control-flow and modeling the heap. For object-oriented programs, context-sensitivity is known to model control- flow with tractable and useful precision [7, 39, 72, 75, 82, 79]. Compared to the storeless heap abstraction (e.g., access paths), store-based heap abstraction (e.g., allocation-site-based and allocation-type-based heap modeling) is the one predominately adopted by points-to analysis [32, 72, 80, 45, 94]. However, both techniques suffer from the balance problem, especially for large object-oriented programs. Context-sensitive points-to analysis separately analyzes the methods under different calling contexts to improve precision by preventing the merging of the pointsto information in different contexts. To address the problem of infinite calling contexts due to the recursive calls and also to further improve the analysis scalability in practice, traditional approaches usually limit the length of contexts by a given
Chapter 1.Introduction 3 parameter,k [71,53,55,75].As a result,if the length of the contexts is no larger than k,the contexts can be distinguished;otherwise,the contexts ending with the same k suffixes cannot be further distinguished and the points-to information un- der these contexts would be merged.Given a larger k-limiting,the analysis can partition the context space with finer grain and achieve better precision,but it may also cause the analysis to dramatically slow down or even become unscalable. Heap abstraction [32],which partitions the infinitely-sized heap into a finite number of(abstract)objects,is particularly important to the analysis for object- oriented languages such as Java.Generally,heap abstractions for points-to anal- ysis for Java may use one abstract object per class (allocation-type heap abstrac- tion)[99],or one abstract object per allocation site (allocation-site heap abstrac- tion)[38],which can be further separated context-sensitively in an orthogonal man- ner [53,75,34].Points-to analyses with different heap abstractions may exhibit significant differences in precision and efficiency [99,75,32].Allocation-site heap abstraction,as a finer abstraction,makes points-to analysis usually notably more precise than the ones using allocation-type heap abstraction.However,allocation- site-based points-to analysis usually runs significantly slower and its performance will dramatically become worse if context-sensitivity is applied to further distin- guish different heap objects for better precision. 1.2 Contributions In this thesis,we present a new context-sensitivity approach called BEAN [95]and a new heap abstraction approach called MAHJONG [96],to improve the precision and efficiency of points-to analysis respectively.Specifically,this thesis makes the following contributions
Chapter 1. Introduction 3 parameter, k [71, 53, 55, 75]. As a result, if the length of the contexts is no larger than k, the contexts can be distinguished; otherwise, the contexts ending with the same k suffixes cannot be further distinguished and the points-to information under these contexts would be merged. Given a larger k-limiting, the analysis can partition the context space with finer grain and achieve better precision, but it may also cause the analysis to dramatically slow down or even become unscalable. Heap abstraction [32], which partitions the infinitely-sized heap into a finite number of (abstract) objects, is particularly important to the analysis for objectoriented languages such as Java. Generally, heap abstractions for points-to analysis for Java may use one abstract object per class (allocation-type heap abstraction) [99], or one abstract object per allocation site (allocation-site heap abstraction) [38], which can be further separated context-sensitively in an orthogonal manner [53, 75, 34]. Points-to analyses with different heap abstractions may exhibit significant differences in precision and efficiency [99, 75, 32]. Allocation-site heap abstraction, as a finer abstraction, makes points-to analysis usually notably more precise than the ones using allocation-type heap abstraction. However, allocationsite-based points-to analysis usually runs significantly slower and its performance will dramatically become worse if context-sensitivity is applied to further distinguish different heap objects for better precision. 1.2 Contributions In this thesis, we present a new context-sensitivity approach called Bean [95] and a new heap abstraction approach called Mahjong [96], to improve the precision and efficiency of points-to analysis respectively. Specifically, this thesis makes the following contributions