Program Optimization Space Pruning for a Multithreaded GPU Shane Ryoo,Christopher I.Rodrigues,Sam S.Stone,Sara S.Baghsorkhi Sain-Zee Ueng,John A.Stratton,and Wen-mei W.Hwu Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign [sryoo,cirodrig,ssstone2,bsadeghi,ueng,stratton,hwu}@crhc.uiuc.edu ABSTRACT General Terms Program optimization for highly-parallel systems has his- Performance,Languages torically been considered an art,with experts doing much of the performance tuning by hand.With the introduction of Keywords inexpensive,single-chip,massively parallel platforms,more developers will be creating highly-parallel applications for GPGPU,parallel computing,optimization these platforms,who lack the substantial experience and knowledge needed to maximize their performance.This cre- INTRODUCTION ates a need for more structured optimization methods with Programming for highly-parallel systems has historically means to estimate their performance effects.Furthermore been the domain of relatively few experts,with performance these methods need to be understandable by most program- tuning done primarily by hand.Because of the relative mers.This paper shows the complexity involved in opti- scarcity of highly parallel applications and the expense of mizing applications for one such system and one relatively highly parallel systems,there was limited opportunity for simple methodology for reducing the workload involved in exhaustive performance experimentation.Today,however. the optimization process. single-chip,massively parallel systems such as the NVIDIA This work is based on one such highly-parallel system, GeForce 8 Series are available for about a dollar per GFLOP. the GeForce 8800 GTX using CUDA.Its flexible allocation several orders of magnitude less expensive than supercom- of resources to threads allows it to extract performance from puters a decade ago.Already developers are using these a range of applications with varying resource requirements, desktop systems to perform work that would otherwise take but places new demands on developers who seek to maxi- a large compute cluster to accomplish.Unfortunately.the mize an application's performance.We show how optimiza- level of effort and expertise required to maximize application tions interact with the architecture in complex ways,initially performance on these kinds of systems is still quite high.The prompting an inspection of the entire configuration space to resource restrictions of these systems also present unfore- find the optimal configuration.Even for a seemingly sim- seen difficulties to optimization.Finally,it is often the case ple application such as matrix multiplication,the optimal that successive generations of architectures require a com- configuration can be unexpected.We then present metrics plete reapplication of the optimization process to achieve the derived from static code that capture the first-order factors of performance.We demonstrate how these metrics can be maximum performance for the new system. Optimizing an application for maximum performance on used to prune many optimization configurations,down to the GeForce 8 Series is not a trivial task.At first glance,it those that lie on a Pareto-optimal curve.This reduces the appears to be a multi-variable optimization problem of ap- optimization space by as much as 98%and still finds the plying a set of optimization techniques like tiling and loop optimal configuration for each of the studied applications. unrolling to the code.However,the underlying hardware and threading model contain hard usage restrictions that Categories and Subject Descriptors affect performance and make the optimization space discon- tinuous.Consequently,the final performance of an optimiza- D.1.3 Software:Programming Techniques-Concurrent tion configuration is not always readily apparent.Further- Programming more,the difference in performance for differing optimiza- tion configurations is significant.For an MRI reconstruc- tion application with a space size of 175 configurations,the difference in performance between a hand-optimized imple- mentation and the optimal configuration was 17%and the difference in performance between the worst and optimal CACM,2008.This is the author's version of the work.It is configurations was 235%.Since the intent is to run large. posted by permission of ACM for your personal use.Not for re- compute-intensive applications on these systems,a full ex- distribution.The definitive version was published in CGO 08. http:/doi.acm.org/10.1145/1356058.1356084 ploration of the optimization space based on wall-clock per- CGO'08.April 5-10,2008,Boston,Massachusetts,USA formance is generally not feasible. Copyright2008ACM978-1-59593-978-4/08/04.s5.00. In response to these challenges,we have developed met-
Program Optimization Space Pruning for a Multithreaded GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign {sryoo, cirodrig, ssstone2, bsadeghi, ueng, stratton, hwu}@crhc.uiuc.edu ABSTRACT Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application’s performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications. Categories and Subject Descriptors D.1.3 [Software]: Programming Techniques—Concurrent Programming c ACM, 2008. This is the author’s version of the work. It is posted by permission of ACM for your personal use. Not for redistribution. The definitive version was published in CGO ’08, http://doi.acm.org/10.1145/1356058.1356084 CGO’08, April 5–10, 2008, Boston, Massachusetts, USA. Copyright 2008 ACM 978-1-59593-978-4/08/04 ...$5.00. General Terms Performance, Languages Keywords GPGPU, parallel computing, optimization 1. INTRODUCTION Programming for highly-parallel systems has historically been the domain of relatively few experts, with performance tuning done primarily by hand. Because of the relative scarcity of highly parallel applications and the expense of highly parallel systems, there was limited opportunity for exhaustive performance experimentation. Today, however, single-chip, massively parallel systems such as the NVIDIA GeForce 8 Series are available for about a dollar per GFLOP, several orders of magnitude less expensive than supercomputers a decade ago. Already developers are using these desktop systems to perform work that would otherwise take a large compute cluster to accomplish. Unfortunately, the level of effort and expertise required to maximize application performance on these kinds of systems is still quite high. The resource restrictions of these systems also present unforeseen difficulties to optimization. Finally, it is often the case that successive generations of architectures require a complete reapplication of the optimization process to achieve the maximum performance for the new system. Optimizing an application for maximum performance on the GeForce 8 Series is not a trivial task. At first glance, it appears to be a multi-variable optimization problem of applying a set of optimization techniques like tiling and loop unrolling to the code. However, the underlying hardware and threading model contain hard usage restrictions that affect performance and make the optimization space discontinuous. Consequently, the final performance of an optimization configuration is not always readily apparent. Furthermore, the difference in performance for differing optimization configurations is significant. For an MRI reconstruction application with a space size of 175 configurations, the difference in performance between a hand-optimized implementation and the optimal configuration was 17% and the difference in performance between the worst and optimal configurations was 235%. Since the intent is to run large, compute-intensive applications on these systems, a full exploration of the optimization space based on wall-clock performance is generally not feasible. In response to these challenges, we have developed met-
rics to help prune the search space of optimization config- Device urations.Our intent is to form these metrics into the un- SM 16 derpinnings for an automated optimizing compiler for this platform.We developed our metrics after performing full ex- Shared Memory ruction plorations of the optimization spaces for some of our applica- Uni tions.The metrics take into account the observed first-order ister File effects on performance,under the assumption that mem- ory bandwidth is not the primary limiting factor on perfor- mance.Performance prediction is less critical for bandwidth- Constant Cache SFU2 limited kernels since they are less sensitive to other optimiza- tion effects.The search space is pruned down to only those configurations that lie on a Pareto-optimal curve generated from a plot of the metrics.In contrast to a full exploration Off-Chip Device(Global)Memory of the optimization space,this methodology eliminates the need to test as much as 98%of the optimization search space. The optimal configuration was found to be on the curve for Figure 1:Organization of the GeForce 8800 tionally,each SM has two special functional units (SFUs) each of the applications studied.Consequently,much faster which execute more complex FP operations such as recip- performance optimization is possible. We begin by discussing the execution hardware,thread- rocal square root,sine,and cosine with low latency.The arithmetic units and the SFUs are fully pipelined.yielding ing model,and available tools in Section 2.This sets up 388.8 GFLOPS(16SM*18FLOP/SM*1.35GHz)of peak the various factors that affect performance.Section 3 dis- cusses the most effective optimizations for this architecture. theoretical performance for the GPU. The GeForce 8800 has 86.4 GB/s of bandwidth to its off- and how one needs to consider them to attain the optimal chip,global memory.Nevertheless,with computational re- configuration.Section 4 discusses the metrics we have de- veloped given our experience while Section 5 shows how our sources supporting nearly 400 GFLOPS of performance and metrics can be used to help find the optimal optimization each FP instruction operating on up to 12 bytes of source configuration.We discuss related work in Section 6 before data,applications can easily saturate that bandwidth.There- fore,as described in Table 1 and depicted in Figure 1.the finishing with our conclusion. GeForce 8800 has several on-chip memories that can ex- ploit data locality and data sharing to reduce an applica- 2.ARCHITECTURE tion's demands for off-chip memory bandwidth.For ex- This work uses the GeForce 8800 GTX GPU as the basis ample,each SM has a 16KB shared memory that is useful for its study.The GeForce 8800 has a large set of processor for data that is either written and reused or shared among cores that can directly address a global memory.This allows threads.For read-only data that is accessed simultaneously for a more general and flexible programming model than pre- by many threads,the constant and texture memories pro- vious generations of GPUs,and allows developers to easily vide dramatic reduction in memory latency via caching. implement data-parallel kernels.In this section we discuss Threads executing on the GeForce 8800 are organized into NVIDIA's Compute Unified Device Architecture (CUDA). a three-level hierarchy.At the highest level.each kernel cre- with emphasis on the features that significantly affect per- ates a single grid,which consists of many thread blocks.The formance.A more complete description can be found in [1, maximum number of threads per block is 512.Each thread 20.It should be noted that this architecture,although more block is assigned to a single SM for the duration of its ex- disclosed than previous GPU architectures,still has details ecution.Threads in the same thread block can share data that have not been publicly revealed. through the on-chip shared memory and can perform bar- Before discussing the hardware,it is useful to describe rier synchronization by invoking the_syncthreads primi- the programming and compilation process.The CUDA pro- tive.Threads are otherwise independent.and synchroniza- gramming model is ANSI C extended with several keywords tion across thread blocks can only be safely accomplished by and constructs.The GPU is treated as a coprocessor that terminating the kernel.Finally,threads within a block are executes data-parallel kernel functions.The user supplies a organized into warps of 32 threads.Each warp executes in single source program encompassing both host (CPU)and SIMD (single-instruction,multiple-data)fashion,issuing in kernel (GPU)code.These are separated and compiled by four cycles on the eight SPs of an SM. NVIDIA's compiler.The host code transfers data to and SMs can perform zero-overhead scheduling to interleave from the GPU's global memory via API calls.It initiates warps on an instruction-by-instruction basis to hide the la- the kernel code by performing a function call. tency of global memory accesses and long-latency arithmetic operations.When one warp stalls,the SM can quickly switch 2.1 Microarchitecture to a ready warp in the same thread block or a ready warp in Figure 1 depicts the microarchitecture of the GeForce 8800 some other thread block assigned to the SM.The SM stalls The GPU consists of 16 streaming multiprocessors(SMs), only if there are no warps with ready operands available. each containing eight streaming processors (SPs),or pro- 2.2 Architectural Interactions cessor cores,running at 1.35GHz.Each SP has one 32- bit,single-precision floating-point,multiply-add arithmetic Accurately predicting the effects of one or more compiler unit that can also perform 32-bit integer arithmetic.Addi- optimizations on the performance of a CUDA kernel is of- ten quite difficult,largely because of interactions among There are several versions of the GeForce 8800 GPU.Ref- the architectural constraints listed in Table 2.Many op- erences of GeForce 8800 are implied to be the GTX model timizations that improve the performance of an individual
rics to help prune the search space of optimization configurations. Our intent is to form these metrics into the underpinnings for an automated optimizing compiler for this platform. We developed our metrics after performing full explorations of the optimization spaces for some of our applications. The metrics take into account the observed first-order effects on performance, under the assumption that memory bandwidth is not the primary limiting factor on performance. Performance prediction is less critical for bandwidthlimited kernels since they are less sensitive to other optimization effects. The search space is pruned down to only those configurations that lie on a Pareto-optimal curve generated from a plot of the metrics. In contrast to a full exploration of the optimization space, this methodology eliminates the need to test as much as 98% of the optimization search space. The optimal configuration was found to be on the curve for each of the applications studied. Consequently, much faster performance optimization is possible. We begin by discussing the execution hardware, threading model, and available tools in Section 2. This sets up the various factors that affect performance. Section 3 discusses the most effective optimizations for this architecture, and how one needs to consider them to attain the optimal configuration. Section 4 discusses the metrics we have developed given our experience while Section 5 shows how our metrics can be used to help find the optimal optimization configuration. We discuss related work in Section 6 before finishing with our conclusion. 2. ARCHITECTURE This work uses the GeForce 8800 GTX GPU 1 as the basis for its study. The GeForce 8800 has a large set of processor cores that can directly address a global memory. This allows for a more general and flexible programming model than previous generations of GPUs, and allows developers to easily implement data-parallel kernels. In this section we discuss NVIDIA’s Compute Unified Device Architecture (CUDA), with emphasis on the features that significantly affect performance. A more complete description can be found in [1, 20]. It should be noted that this architecture, although more disclosed than previous GPU architectures, still has details that have not been publicly revealed. Before discussing the hardware, it is useful to describe the programming and compilation process. The CUDA programming model is ANSI C extended with several keywords and constructs. The GPU is treated as a coprocessor that executes data-parallel kernel functions. The user supplies a single source program encompassing both host (CPU) and kernel (GPU) code. These are separated and compiled by NVIDIA’s compiler. The host code transfers data to and from the GPU’s global memory via API calls. It initiates the kernel code by performing a function call. 2.1 Microarchitecture Figure 1 depicts the microarchitecture of the GeForce 8800. The GPU consists of 16 streaming multiprocessors (SMs), each containing eight streaming processors (SPs), or processor cores, running at 1.35GHz. Each SP has one 32- bit, single-precision floating-point, multiply-add arithmetic unit that can also perform 32-bit integer arithmetic. Addi- 1There are several versions of the GeForce 8800 GPU. References of GeForce 8800 are implied to be the GTX model. ! ! " !#$% & # ' ( ))) ))) &( &( * Figure 1: Organization of the GeForce 8800 tionally, each SM has two special functional units (SFUs), which execute more complex FP operations such as reciprocal square root, sine, and cosine with low latency. The arithmetic units and the SFUs are fully pipelined, yielding 388.8 GFLOPS (16 SM ∗ 18 FLOP/SM ∗ 1.35GHz) of peak theoretical performance for the GPU. The GeForce 8800 has 86.4 GB/s of bandwidth to its off- chip, global memory. Nevertheless, with computational resources supporting nearly 400 GFLOPS of performance and each FP instruction operating on up to 12 bytes of source data, applications can easily saturate that bandwidth. Therefore, as described in Table 1 and depicted in Figure 1, the GeForce 8800 has several on-chip memories that can exploit data locality and data sharing to reduce an application’s demands for off-chip memory bandwidth. For example, each SM has a 16KB shared memory that is useful for data that is either written and reused or shared among threads. For read-only data that is accessed simultaneously by many threads, the constant and texture memories provide dramatic reduction in memory latency via caching. Threads executing on the GeForce 8800 are organized into a three-level hierarchy. At the highest level, each kernel creates a single grid, which consists of many thread blocks. The maximum number of threads per block is 512. Each thread block is assigned to a single SM for the duration of its execution. Threads in the same thread block can share data through the on-chip shared memory and can perform barrier synchronization by invoking the __syncthreads primitive. Threads are otherwise independent, and synchronization across thread blocks can only be safely accomplished by terminating the kernel. Finally, threads within a block are organized into warps of 32 threads. Each warp executes in SIMD (single-instruction, multiple-data) fashion, issuing in four cycles on the eight SPs of an SM. SMs can perform zero-overhead scheduling to interleave warps on an instruction-by-instruction basis to hide the latency of global memory accesses and long-latency arithmetic operations. When one warp stalls, the SM can quickly switch to a ready warp in the same thread block or a ready warp in some other thread block assigned to the SM. The SM stalls only if there are no warps with ready operands available. 2.2 Architectural Interactions Accurately predicting the effects of one or more compiler optimizations on the performance of a CUDA kernel is often quite difficult, largely because of interactions among the architectural constraints listed in Table 2. Many optimizations that improve the performance of an individual
Table 1:Properties of GeForce 8800 Memories Memory Location Size Latency Read Description Only Global off-chip 768WB200-300 no Large DRAM.All data reside here at the beginning of kernel execution.Di- total cycles rectly addressable from a kernel using pointers.Backing store for constant and texture memories.Used more efficiently when multiple threads simul- taneously access contiguous elements of memory,enabling the hardware to coalesce memory accesses to the same DRAM page. Shared on-chip 16KB ~register no Local scratchpad that can be shared among threads in a thread block.Or- per latency ganized into 16 banks.It is often possible to organize both threads and data SM such that bank conflicts seldom or never occur. Constant on-chip 64KB ~register yes 8KB cache per SM,with data originally residing in global memory. The cache total latency 64KB limit is set by the programming model.Often used for lookup tables. The cache is single-ported,so simultaneous requests within an SM must be to the same address or delays will occur. Texture on-chip up to >100 16KB cache per two SMs,with data originally residing in global memory. cache global cycles Capitalizes on 2D locality.Can perform hardware interpolation and have configurable returned-value behavior at the edges of textures,both of which are useful in certain applications such as video encoders. Local off-chip up to same as no Space for register spilling,etc. global global The amount of time it takes to run nvcc with these flags Table 2:Constraints of GeForce 8800 and CUDA Resource or Configuration Param- Limit is much shorter than actual compilation because only the kernel code is processed. eter nvcc compiles kernel code to an assembly-like represen- Threads per SM 768 threads tation termed PTX.This is normally placed in an object Thread Blocks per SM 8 blocks file for consumption by the CUDA runtime,which processes 32-bit Registers per SM 8,192 registers this code,performs further optimization such as scheduling, Shared Memory per SM 16,384 bytes and generates hardware-specific code for execution.The Threads per Thread Block 512 threads ptx flag outputs the PTX in a developer-readable format. Although PTX is not the exact code that is executed on thread tend to increase a thread's resource usage.However. the hardware,it often gives insights into why performance as each thread's resource usage increases,the total number degrades or improves after an optimization is applied.In of threads that can occupy the SM decreases.Occasionally particular.information such as instruction count.instruc- this decrease in thread count occurs in a dramatic fashion tion mix.and a rough idea of scheduling can be utilized because threads are assigned to an SM at the granularity reliably.Detailed instruction-level scheduling,however,is of thread blocks.In short,there is often a tradeoff between the domain of the runtime.For example,unrolling a loop the performance of individual threads and the thread-level with strided memory accesses creates successive operations parallelism(TLP)among all threads. that operate at different offsets from a base address.PTX For example,consider an application that uses 256 threads shows that the group of memory operations only need the per block,10 registers per thread,and 4KB of shared mem- single base address calculation and use their constant offsets ory per thread block.This application can schedule 3 thread to avoid additional address calculations. blocks and 768 threads on each SM.However,an optimiza- The CUDA runtime that generates executable machine tion that increases each thread's register usage from 10 to code appears to reschedule code and allocate registers.This 11 (an increase of only 10%)will decrease the number of introduces an uncontrollable element during program op- blocks per SM from three to two,which decreases the num- timization and makes the effects of optimizations on local ber of threads on an SM by 33%.The GeForce 8800 can resource usage less predictable.The -cubin flag outputs only assign two threads blocks(512 threads)to an SM be- the resource usage of GPU kernel code,including the shared cause a third block would increase the number of threads memory used per thread block and registers used per thread to 768 and register usage to 8.448.above the 8.192 registers This is critical to understanding the performance of the code per SM limit.By contrast,an optimization that increases because an SM runs the number of thread blocks that fit each thread block's shared memory usage by 1KB (an in- given their local resource usage.A small change in code can crease of 25%)does not decrease the number of blocks per result in resource usage that changes the number of thread SM.Clearly,the optimization space is inherently non-linear. blocks executing on an SM,which can significantly impact performance.We use the information provided by -cubin 2.3 Software Tool Support to calculate the number of thread blocks that can simulta- For CUDA compilation,NVIDIA provides a compiler wrap- neously reside on each SM. per called nucc that handles all parts of the compilation flow, including linking host and kernel binaries.The compiler also 3. OPTIMIZATION SPACE supports several options that programmers can use to debug The basic strategy for achieving good kernel code perfor- kernels and to gain intuition on their performance.Two mance on the GeForce 8800 is to maintain high SP occu- flags are especially useful:-ptx and -cubin.Much of the pancy and reduce dynamic instruction count.High occu- information needed to compute the optimization metrics de- pancy,where warps are always available for execution,is scribed in Section 4 is based upon the outputs of these flags. accomplished in three ways.First,one can have sequences
Table 1: Properties of GeForce 8800 Memories Memory Location Size Latency ReadOnly Description Global off-chip 768MB total 200-300 cycles no Large DRAM. All data reside here at the beginning of kernel execution. Directly addressable from a kernel using pointers. Backing store for constant and texture memories. Used more efficiently when multiple threads simultaneously access contiguous elements of memory, enabling the hardware to coalesce memory accesses to the same DRAM page. Shared on-chip 16KB per SM 'register latency no Local scratchpad that can be shared among threads in a thread block. Organized into 16 banks. It is often possible to organize both threads and data such that bank conflicts seldom or never occur. Constant on-chip cache 64KB total 'register latency yes 8KB cache per SM, with data originally residing in global memory. The 64KB limit is set by the programming model. Often used for lookup tables. The cache is single-ported, so simultaneous requests within an SM must be to the same address or delays will occur. Texture on-chip cache up to global >100 cycles yes 16KB cache per two SMs, with data originally residing in global memory. Capitalizes on 2D locality. Can perform hardware interpolation and have configurable returned-value behavior at the edges of textures, both of which are useful in certain applications such as video encoders. Local off-chip up to global same as global no Space for register spilling, etc. Table 2: Constraints of GeForce 8800 and CUDA Resource or Configuration Parameter Limit Threads per SM 768 threads Thread Blocks per SM 8 blocks 32-bit Registers per SM 8,192 registers Shared Memory per SM 16,384 bytes Threads per Thread Block 512 threads thread tend to increase a thread’s resource usage. However, as each thread’s resource usage increases, the total number of threads that can occupy the SM decreases. Occasionally this decrease in thread count occurs in a dramatic fashion because threads are assigned to an SM at the granularity of thread blocks. In short, there is often a tradeoff between the performance of individual threads and the thread-level parallelism (TLP) among all threads. For example, consider an application that uses 256 threads per block, 10 registers per thread, and 4KB of shared memory per thread block. This application can schedule 3 thread blocks and 768 threads on each SM. However, an optimization that increases each thread’s register usage from 10 to 11 (an increase of only 10%) will decrease the number of blocks per SM from three to two, which decreases the number of threads on an SM by 33%. The GeForce 8800 can only assign two threads blocks (512 threads) to an SM because a third block would increase the number of threads to 768 and register usage to 8,448, above the 8,192 registers per SM limit. By contrast, an optimization that increases each thread block’s shared memory usage by 1KB (an increase of 25%) does not decrease the number of blocks per SM. Clearly, the optimization space is inherently non-linear. 2.3 Software Tool Support For CUDA compilation, NVIDIA provides a compiler wrapper called nvcc that handles all parts of the compilation flow, including linking host and kernel binaries. The compiler also supports several options that programmers can use to debug kernels and to gain intuition on their performance. Two flags are especially useful: -ptx and -cubin. Much of the information needed to compute the optimization metrics described in Section 4 is based upon the outputs of these flags. The amount of time it takes to run nvcc with these flags is much shorter than actual compilation because only the kernel code is processed. nvcc compiles kernel code to an assembly-like representation termed PTX. This is normally placed in an object file for consumption by the CUDA runtime, which processes this code, performs further optimization such as scheduling, and generates hardware-specific code for execution. The - ptx flag outputs the PTX in a developer-readable format. Although PTX is not the exact code that is executed on the hardware, it often gives insights into why performance degrades or improves after an optimization is applied. In particular, information such as instruction count, instruction mix, and a rough idea of scheduling can be utilized reliably. Detailed instruction-level scheduling, however, is the domain of the runtime. For example, unrolling a loop with strided memory accesses creates successive operations that operate at different offsets from a base address. PTX shows that the group of memory operations only need the single base address calculation and use their constant offsets to avoid additional address calculations. The CUDA runtime that generates executable machine code appears to reschedule code and allocate registers. This introduces an uncontrollable element during program optimization and makes the effects of optimizations on local resource usage less predictable. The -cubin flag outputs the resource usage of GPU kernel code, including the shared memory used per thread block and registers used per thread. This is critical to understanding the performance of the code because an SM runs the number of thread blocks that fit given their local resource usage. A small change in code can result in resource usage that changes the number of thread blocks executing on an SM, which can significantly impact performance. We use the information provided by -cubin to calculate the number of thread blocks that can simultaneously reside on each SM. 3. OPTIMIZATION SPACE The basic strategy for achieving good kernel code performance on the GeForce 8800 is to maintain high SP occupancy and reduce dynamic instruction count. High occupancy, where warps are always available for execution, is accomplished in three ways. First, one can have sequences
of independent instructions within a warp so that the same cooperatively load parts of the input matrices into shared warp can make forward progress.Second,a developer can memory,amortizing the cost of global load latency and re- place many threads in a thread block so that at least one ducing the pressure on global memory bandwidth.Using warp can execute while others are stalled on long-latency larger tiles enhances the benefit of data sharing.but reduces operations,such as memory loads.Third.the hardware can scheduling flexibility since a greater fraction of the threads assign up to eight independent thread blocks to an SM.Un- on an SM must wait at barrier synchronizations.Redistribu- der high-occupancy conditions,a reduction of executed in- tion can also be applied at the thread level:in Figure 2(b). structions improves performance by reducing the time to the kernel has been further tiled at the thread level so that process each thread and thus increasing system throughput. each thread now computes two matrix elements instead of This gives us five categories of machine-level behavior to one.In other words,for every tile in the first input matrix optimize:independent thread count,thread-level work re- two tiles in the second input matrix are consumed at a time distribution,instruction count reduction,intra-thread par- for a 1x2 rectangular tiling dimension.This presents op- allelism,and resource balancing. portunities for eliminating redundant instructions that were Unfortunately,optimizations rarely improve an aspect of previously distributed across threads,such as the loads of machine-level behavior in an isolated manner.Many opti- values from As in the loop body.A third,occasionally useful mizations affect several aspects,producing a give-and-take technique is to distribute work across multiple invocations situation between different categories.Moreover,many op- of a kernel,but we did not find this useful in applications timizations increase resource usage and thus compete for with good cache behavior. a limited budget of registers,threads,and shared memory. The third category is to reduce the dynamic instruc- The most common way in which optimizations interact and tion count per thread.Optimizations for this include interfere is by their effects on register usage.For example,an traditional compiler optimizations such as common subex- optimization that increases the number of independent in- pression elimination,loop-invariant code removal,and loop structions after a long-latency instruction generally uses ad- unrolling.However,these optimizations frequently need to ditional registers.This causes register usage of each thread be balanced against increased resource usage.The unrolled and thread block to increase,which in turn can cause the matrix multiplication kernel in Figure 2(c)eliminates ad- number of thread blocks assigned to each SM to decrease. dress calculation instructions by replacing variable array in- In this section we first discuss the major optimizations dices with constants after unrolling.Register usage is actu- for performance.Using matrix multiplication as an exam- ally reduced in this example,though it can increase in other ple,we show how these optimizations can be applied to an situations. application to find the optimal configuration. The fourth category of optimization,intra-thread par- allelism,ensures the availability of independent instruc- 3.1 Optimizations tions within a thread.A developer can unroll loops to fa- The optimizations we consider can be grouped into five cilitate code scheduling in the compiler or explicitly insert categories based on the primary mechanism by which they prefetching code.Prefetching for matrix multiplication(Fig- affect machine-level performance.The mechanisms are bolded ure 2(d))is achieved by initiating long-latency global loads for emphasis.We show examples of the optimizations using into an additional local variable (register)long before the a matrix multiplication kernel,with baseline code shown in variable is used.This optimization category is primarily the Figure 2(a).Information on CUDA syntax can be found jurisdiction of the instruction schedulers of the compiler and runtime.The CUDA runtime appears to reschedule oper- in 21.The code shown is executed by each thread.Vari- ables tx and ty are initialized to each thread's coordinates ations to hide intra-thread stalls.However,it sometimes in the thread block.Variables indexA.indexB.and indexC does this to the detriment of inter-thread parallelism.As are initialized,according to thread coordinates,to positions with optimizations to reduce instruction count,scheduling in the two flattened input matrices and single output matrix to reduce intra-thread stalls may increase register usage and prior to the code shown. potentially reduce the number of thread blocks on each SM. The goal of the first category of optimization is to pro- The last category is best termed resource-balancing. vide enough warps to hide the stalling effects of long The purpose of these optimizations is to shift the use of re- latency and blocking operations.Loads from A[indexA]and sources,some of which may be counterintuitive,to produce B[indexB]are examples of long latency operations in matrix better overall performance.One example is proactive,ex multiplication.Blocking operations include barrier synchro- plicit register spilling by the programmer.By reducing reg- nization,which stops a warp until all warps in the same ister usage,often a critical resource,more thread blocks may block have reached the barrier.A common optimization in be assigned to each SM.The resulting application may have this category is to decrease the thread block size and increase much better performance,despite the added latency from the number of thread blocks.This can increase the number memory access and additional instructions,because the ad- of thread blocks assigned to each SM and provide more in- ditional thread blocks improve overall resource utilization dependent warps from other blocks when one block reaches One optimization that was useful for all studied applica- a barrier.This is discussed in more detail in Section 3.2. tions is the use of shared memory and caches to improve The second category involves redistribution of work data locality for reused values;without this,performance across threads and thread blocks.For matrix multipli- was generally limited by global memory bandwidth and in- cation,each element of the output matrix can be computed sensitive to other optimizations.For the experiments in this independently and the work is grouped into threads and work,we apply this optimization unconditionally.We also thread blocks for the sake of data efficiency.The kernel of do not constrain the optimizations or scheduling performed Figure 2(a)is tiled [19]so that each thread block computes by NVIDIA's compiler and runtime. a square 16x16 tile of the output matrix.Threads in a block
of independent instructions within a warp so that the same warp can make forward progress. Second, a developer can place many threads in a thread block so that at least one warp can execute while others are stalled on long-latency operations, such as memory loads. Third, the hardware can assign up to eight independent thread blocks to an SM. Under high-occupancy conditions, a reduction of executed instructions improves performance by reducing the time to process each thread and thus increasing system throughput. This gives us five categories of machine-level behavior to optimize: independent thread count, thread-level work redistribution, instruction count reduction, intra-thread parallelism, and resource balancing. Unfortunately, optimizations rarely improve an aspect of machine-level behavior in an isolated manner. Many optimizations affect several aspects, producing a give-and-take situation between different categories. Moreover, many optimizations increase resource usage and thus compete for a limited budget of registers, threads, and shared memory. The most common way in which optimizations interact and interfere is by their effects on register usage. For example, an optimization that increases the number of independent instructions after a long-latency instruction generally uses additional registers. This causes register usage of each thread and thread block to increase, which in turn can cause the number of thread blocks assigned to each SM to decrease. In this section we first discuss the major optimizations for performance. Using matrix multiplication as an example, we show how these optimizations can be applied to an application to find the optimal configuration. 3.1 Optimizations The optimizations we consider can be grouped into five categories based on the primary mechanism by which they affect machine-level performance. The mechanisms are bolded for emphasis. We show examples of the optimizations using a matrix multiplication kernel, with baseline code shown in Figure 2(a). Information on CUDA syntax can be found in [21]. The code shown is executed by each thread. Variables tx and ty are initialized to each thread’s coordinates in the thread block. Variables indexA, indexB, and indexC are initialized, according to thread coordinates, to positions in the two flattened input matrices and single output matrix prior to the code shown. The goal of the first category of optimization is to provide enough warps to hide the stalling effects of long latency and blocking operations. Loads from A[indexA] and B[indexB] are examples of long latency operations in matrix multiplication. Blocking operations include barrier synchronization, which stops a warp until all warps in the same block have reached the barrier. A common optimization in this category is to decrease the thread block size and increase the number of thread blocks. This can increase the number of thread blocks assigned to each SM and provide more independent warps from other blocks when one block reaches a barrier. This is discussed in more detail in Section 3.2. The second category involves redistribution of work across threads and thread blocks. For matrix multiplication, each element of the output matrix can be computed independently and the work is grouped into threads and thread blocks for the sake of data efficiency. The kernel of Figure 2(a) is tiled [19] so that each thread block computes a square 16x16 tile of the output matrix. Threads in a block cooperatively load parts of the input matrices into shared memory, amortizing the cost of global load latency and reducing the pressure on global memory bandwidth. Using larger tiles enhances the benefit of data sharing, but reduces scheduling flexibility since a greater fraction of the threads on an SM must wait at barrier synchronizations. Redistribution can also be applied at the thread level: in Figure 2(b), the kernel has been further tiled at the thread level so that each thread now computes two matrix elements instead of one. In other words, for every tile in the first input matrix, two tiles in the second input matrix are consumed at a time for a 1x2 rectangular tiling dimension. This presents opportunities for eliminating redundant instructions that were previously distributed across threads, such as the loads of values from As in the loop body. A third, occasionally useful technique is to distribute work across multiple invocations of a kernel, but we did not find this useful in applications with good cache behavior. The third category is to reduce the dynamic instruction count per thread. Optimizations for this include traditional compiler optimizations such as common subexpression elimination, loop-invariant code removal, and loop unrolling. However, these optimizations frequently need to be balanced against increased resource usage. The unrolled matrix multiplication kernel in Figure 2(c) eliminates address calculation instructions by replacing variable array indices with constants after unrolling. Register usage is actually reduced in this example, though it can increase in other situations. The fourth category of optimization, intra-thread parallelism, ensures the availability of independent instructions within a thread. A developer can unroll loops to facilitate code scheduling in the compiler or explicitly insert prefetching code. Prefetching for matrix multiplication (Figure 2(d)) is achieved by initiating long-latency global loads into an additional local variable (register) long before the variable is used. This optimization category is primarily the jurisdiction of the instruction schedulers of the compiler and runtime. The CUDA runtime appears to reschedule operations to hide intra-thread stalls. However, it sometimes does this to the detriment of inter-thread parallelism. As with optimizations to reduce instruction count, scheduling to reduce intra-thread stalls may increase register usage and potentially reduce the number of thread blocks on each SM. The last category is best termed resource-balancing. The purpose of these optimizations is to shift the use of resources, some of which may be counterintuitive, to produce better overall performance. One example is proactive, explicit register spilling by the programmer. By reducing register usage, often a critical resource, more thread blocks may be assigned to each SM. The resulting application may have much better performance, despite the added latency from memory access and additional instructions, because the additional thread blocks improve overall resource utilization. One optimization that was useful for all studied applications is the use of shared memory and caches to improve data locality for reused values; without this, performance was generally limited by global memory bandwidth and insensitive to other optimizations. For the experiments in this work, we apply this optimization unconditionally. We also do not constrain the optimizations or scheduling performed by NVIDIA’s compiler and runtime
6 -Dta吧=0: Ctemp”0: shared float As[16][16]i Ctemp -0; Ctemp -0; for(,) shared_f1 oat Bs[16】【32】i or《...】f for(..-){ _shared shared 8t金H8H8: shared shared loa 1e]1617 shared As[tv】[tx】A[indexA】 Bs[ty][tx]B[indexB]; indexA+-16; Bs[tytx】-B[indexB】: Bs[ty】【tx】=b indexA +16; indexB +16 widthB; indexA +16; indexA +16; indexB +16 widthB; -syncthreads(); indexB +-16 widthB; indexB+■16*widthe: syncthreads(); _syncthreads(); syncthreads ()i Eo:(0:1<16:i+ +-As【ty1[i] cteBsiit +-As[tyj【i1 Cte吧1to1*sto1ttx for(i-0:i<16:i+) 8周, cto器r151·a511 Ctemp,Ay]1】 Bs[i][tx]; syncthreads(); c[indexc]-Ctemp; c[indexc]-Ctempi _synethreads () C[indexC】-Ctep: C[indexC+16]Dtemp; c[indexc】"Ctemp: (a)Base Version (b)1x2 Rectangular Tiling (c)Complete Unroll (d)Prefetching Figure 2:Matrix Multiplication Optimization Examples Code differences from base version are shown in bold. Figure 3:Matrix Multiplication Performance 3.2 Applying Optimizations 128 Threads per Thread Block 2832035 In this section we use matrix multiplication to show how Each line varies threads/block with other parameters constant. to apply optimizations when searching for optimal perfor- Figure 4:SAD Optimization Space mance.Our intent is to convey the complexity of the inter- action between optimizations and resource limits.Figure 3 moves redundant instructions.reducing the instruction count shows the run time of matrix multiplication across an ab- while increasing register usage.When the loop is completely breviated optimization space unrolled,the register usage sometimes drops back down to One of the first questions facing the developer is the gran- the same level as no unrolling.Since the mechanism by ularity at which to spawn threads,since each SM can host which the CUDA runtime performs scheduling and register up to 768 threads.Eight thread blocks at a tiling factor allocation is not visible to the application developer.we do of 8x8(64 threads per thread block)can be assigned to an not have a clear explanation for this non-uniform behavior SM,whereas only three thread blocks at a tiling factor of Finally,prefetching generally increases register usage but 16x16(256 threads per thread block)can be assigned.Since does not always reduce the number of thread blocks running matrix multiplication contains intra-thread block synchro- on an SM.When it does,the reduction in exposed global la- nization,it may be tempting to keep the number of thread tency often makes up for the loss of a thread block.Thus, blocks high.However,16x16 consistently outperforms 8x8 there are only a handful of cases where there is a significant because configurations with the latter tile size run into a difference in performance.The exception is the configura- memory bandwidth bottleneck. tion at the far right,where prefetching increased register When we consider the second category of optimizations, usage beyond what is available,producing an invalid exe- work redistribution,we see that allocating more work per cutable. thread by adjusting the tiling dimensions is generally good In summary,there is a significant number of optimiza- for matrix multiplication.This is true for both 8x8 and tion configurations to be considered for an application as 16x16 tiles.It is interesting to note that for 1x4 tiling of simple as matrix multiplication.An expert with in-depth 16x16 tiles,each SM only runs one thread block of 256 understanding of both the algorithm,including its behavior threads at a time due to heavy register usage,yet this con- and usage patterns,and the hardware,including its memory figuration has the highest performance.Despite having only bandwidth and resource availability,may have been able to eight warps that must synchronize at a regular interval,the bypass some of the pitfalls we present here.However,a de- elimination of redundant instructions and enhanced instruc- veloper that is not intimately familiar with the application, tion-level parallelism offset that downside. hardware,and the CUDA runtime generally cannot deter- Next,the performance effects of loop unrolling become mine if the upside of an optimization will more than com- more muddled.As shown in Figure 2(c),loop unrolling re- pensate for potential downsides without erperimentation.As
! ! " # $ % !! ! " & $ & ! ! " # $ $ & ! ! " # $ % !! ! " & $ & ! ! " # $ % !! ! " & $ & Figure 2: Matrix Multiplication Optimization Examples Code differences from base version are shown in bold. 0 1 2 3 4 5 6 7 8 normal prefetch normal prefetch normal prefetch normal prefetch normal prefetch normal prefetch 1x1 1x2 1x4 1x1 1x2 1x4 8x8 tiles 16x16 tiles unroll 1 unroll 2 unroll 4 complete unroll Figure 3: Matrix Multiplication Performance 3.2 Applying Optimizations In this section we use matrix multiplication to show how to apply optimizations when searching for optimal performance. Our intent is to convey the complexity of the interaction between optimizations and resource limits. Figure 3 shows the run time of matrix multiplication across an abbreviated optimization space. One of the first questions facing the developer is the granularity at which to spawn threads, since each SM can host up to 768 threads. Eight thread blocks at a tiling factor of 8x8 (64 threads per thread block) can be assigned to an SM, whereas only three thread blocks at a tiling factor of 16x16 (256 threads per thread block) can be assigned. Since matrix multiplication contains intra-thread block synchronization, it may be tempting to keep the number of thread blocks high. However, 16x16 consistently outperforms 8x8 because configurations with the latter tile size run into a memory bandwidth bottleneck. When we consider the second category of optimizations, work redistribution, we see that allocating more work per thread by adjusting the tiling dimensions is generally good for matrix multiplication. This is true for both 8x8 and 16x16 tiles. It is interesting to note that for 1x4 tiling of 16x16 tiles, each SM only runs one thread block of 256 threads at a time due to heavy register usage, yet this con- figuration has the highest performance. Despite having only eight warps that must synchronize at a regular interval, the elimination of redundant instructions and enhanced instruction-level parallelism offset that downside. Next, the performance effects of loop unrolling become more muddled. As shown in Figure 2(c), loop unrolling re- 2 3 4 5 6 7 8 9 32 64 96 128 160 192 224 256 288 320 352 384 Time (ms) Threads per Thread Block Each line varies threads/block with other parameters constant. Figure 4: SAD Optimization Space moves redundant instructions, reducing the instruction count while increasing register usage. When the loop is completely unrolled, the register usage sometimes drops back down to the same level as no unrolling. Since the mechanism by which the CUDA runtime performs scheduling and register allocation is not visible to the application developer, we do not have a clear explanation for this non-uniform behavior. Finally, prefetching generally increases register usage but does not always reduce the number of thread blocks running on an SM. When it does, the reduction in exposed global latency often makes up for the loss of a thread block. Thus, there are only a handful of cases where there is a significant difference in performance. The exception is the configuration at the far right, where prefetching increased register usage beyond what is available, producing an invalid executable. In summary, there is a significant number of optimization configurations to be considered for an application as simple as matrix multiplication. An expert with in-depth understanding of both the algorithm, including its behavior and usage patterns, and the hardware, including its memory bandwidth and resource availability, may have been able to bypass some of the pitfalls we present here. However, a developer that is not intimately familiar with the application, hardware, and the CUDA runtime generally cannot determine if the upside of an optimization will more than compensate for potential downsides without experimentation. As