Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoot Christopher I.Rodriguest Sara S.Baghsorkhit Sam S.Stonet David B.Kirk*Wen-mei W.Hwut Center for Reliable and High-Performance Computing.University of Illinois at Urbana-Champaign *NVIDIA Corporation sryoo,cirodrig,bsadeghi,ssstone2,hwu Ocrhc.uiuc.edu,dk@nvidia.com Abstract hardware interfaces,programming them does not require special- GPUs have recently attracted the attention of many application ized programming languages or execution through graphics appli- developers as commodity data-parallel coprocessors.The newest cation programming interfaces(APIs),as with previous GPU gen- generations of GPU architecture provide easier programmability erations.This makes an inexpensive,highly parallel system avail- and increased generality while maintaining the tremendous mem- able to a broader community of application developers ory bandwidth and computational power of traditional GPUs.This The NVIDIA CUDA programming model [3]was created for opportunity should redirect efforts in GPGPU research from ad hoc developing applications for this platform.In this model,the system porting of applications to establishing principles and strategies that consists of a host that is a traditional CPU and one or more com- allow efficient mapping of computation to graphics hardware.In pute devices that are massively data-parallel coprocessors.Each this work we discuss the GeForce 8800 GTX processor's organiza- CUDA device processor supports the Single-Program Multiple- tion,features,and generalized optimization strategies.Key to per- Data (SPMD)model [8],widely available in parallel processing formance on this platform is using massive multithreading to uti- systems,where all concurrent threads are based on the same code lize the large number of cores and hide global memory latency. although they may not follow exactly the same path of execution. To achieve this,developers face the challenge of striking the right All threads share the same global address space. balance between each thread's resource usage and the number of si- CUDA programming is done with standard ANSI C extended multaneously active threads.The resources to manage include the with keywords that designate data-parallel functions,called ker- number of registers and the amount of on-chip memory used per nels,and their associated data structures to the compute devices. thread,number of threads per multiprocessor,and global memory These kernels describe the work of a single thread and typically bandwidth.We also obtain increased performance by reordering are invoked on thousands of threads.These threads can.within accesses to off-chip memory to combine requests to the same or developer-defined bundles termed thread blocks,share their data contiguous memory locations and apply classical optimizations to and synchronize their actions through built-in primitives.The reduce the number of executed operations.We apply these strate- CUDA runtime also provides library functions for device memory gies across a variety of applications and domains and achieve be- management and data transfers between the host and the compute tween a 10.5X to 457X speedup in kernel codes and between 1.16X devices.One can view CUDA as a programming environment that to 431X total application speedup. enables software developers to isolate program components that are rich in data parallelism for execution on a coprocessor specialized Categories and Subject Descriptors D.1.3 [Software]:Program- for exploiting massive data parallelism.An overview of the CUDA ming Techniques-Concurrent Programming programming model can be found in [5]. The first version of CUDA programming tools and runtime for General Terms Design,Performance,Languages the NVIDIA GeForce 8 Series GPUs has been available through Keywords parallel computing,GPU computing beta testing since February 2007.To CUDA,the GeForce 8800 GTX consists of 16 streaming multiprocessors(SMs),each with eight processing units,8096 registers,and 16KB of on-chip mem- 1.Introduction ory.It has a peak attainable multiply-add performance of 345.6 As a result of continued demand for programmability,modern single-precision GFLOPS2,features 86.4 GB/s memory bandwidth, graphics processing units(GPUs)such as the NVIDIA GeForce contains 768MB of main memory,and incurs little cost in creating 8 Series are designed as programmable processors employing a thousands of threads.The architecture allows efficient data sharing large number of processor cores [20].With the addition of new and synchronization among threads in the same thread block [18]. A unique aspect of this architecture relative to other parallel platforms is the flexibility in the assignment of local resources such as registers or local memory,to threads.Each SM can run a variable number of threads,and the local resources are divided Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed among threads as specified by the programmer.This flexibility for profit or commercial advantage and that copies bear this notice and the full citation on the first page.To copy otherwise,to republish.to post on servers or to redistribute 1There are several versions of the GeForce 8800 GPU.References of to lists,requires prior specific permission and/or a fee. GeForce 8800 are implied to be the GTX model. PPoPP '08,February 20-23.2008,Salt Lake City.Utah,USA 2 Particular mixes of instructions can achieve higher throughput,as will be Copyright©2008ACM978-1-59593-960-9080002..s5.00 explained in Section 3. 73
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo† Christopher I. Rodrigues† Sara S. Baghsorkhi† Sam S. Stone† David B. Kirk∗ Wen-mei W. Hwu† †Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign ∗NVIDIA Corporation {sryoo, cirodrig, bsadeghi, ssstone2, hwu} @crhc.uiuc.edu, dk@nvidia.com Abstract GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor’s organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread’s resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup. Categories and Subject Descriptors D.1.3 [Software]: Programming Techniques—Concurrent Programming General Terms Design, Performance, Languages Keywords parallel computing, GPU computing 1. Introduction As a result of continued demand for programmability, modern graphics processing units (GPUs) such as the NVIDIA GeForce 8 Series are designed as programmable processors employing a large number of processor cores [20]. With the addition of new Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPoPP ’08, February 20–23, 2008, Salt Lake City, Utah, USA Copyright c 2008 ACM 978-1-59593-960-9/08/0002. . . $5.00 hardware interfaces, programming them does not require specialized programming languages or execution through graphics application programming interfaces (APIs), as with previous GPU generations. This makes an inexpensive, highly parallel system available to a broader community of application developers. The NVIDIA CUDA programming model [3] was created for developing applications for this platform. In this model, the system consists of a host that is a traditional CPU and one or more compute devices that are massively data-parallel coprocessors. Each CUDA device processor supports the Single-Program MultipleData (SPMD) model [8], widely available in parallel processing systems, where all concurrent threads are based on the same code, although they may not follow exactly the same path of execution. All threads share the same global address space. CUDA programming is done with standard ANSI C extended with keywords that designate data-parallel functions, called kernels, and their associated data structures to the compute devices. These kernels describe the work of a single thread and typically are invoked on thousands of threads. These threads can, within developer-defined bundles termed thread blocks, share their data and synchronize their actions through built-in primitives. The CUDA runtime also provides library functions for device memory management and data transfers between the host and the compute devices. One can view CUDA as a programming environment that enables software developers to isolate program components that are rich in data parallelism for execution on a coprocessor specialized for exploiting massive data parallelism. An overview of the CUDA programming model can be found in [5]. The first version of CUDA programming tools and runtime for the NVIDIA GeForce 8 Series GPUs has been available through beta testing since February 2007. To CUDA, the GeForce 8800 GTX1 consists of 16 streaming multiprocessors (SMs), each with eight processing units, 8096 registers, and 16KB of on-chip memory. It has a peak attainable multiply-add performance of 345.6 single-precision GFLOPS2 , features 86.4 GB/s memory bandwidth, contains 768MB of main memory, and incurs little cost in creating thousands of threads. The architecture allows efficient data sharing and synchronization among threads in the same thread block [18]. A unique aspect of this architecture relative to other parallel platforms is the flexibility in the assignment of local resources, such as registers or local memory, to threads. Each SM can run a variable number of threads, and the local resources are divided among threads as specified by the programmer. This flexibility 1 There are several versions of the GeForce 8800 GPU. References of GeForce 8800 are implied to be the GTX model. 2 Particular mixes of instructions can achieve higher throughput, as will be explained in Section 3. 73
allows more tuning of application performance but changes the tion.We conclude with some final statements and suggestions for assumptions developers can make when performing optimizations future work. We discuss these issues in further detail. Another question we address is how well applications can ex- 2. Related Work ecute on the GeForce 8800 and what are the design features that contribute to or limit performance.As a collaborative effort be Data parallel programming languages are considered an intermedi- tween industry and academia,a set of complete numerical appli- ate approach between automatic parallelization efforts [7,28]and cations was ported and evaluated on the CUDA platform.Several explicit parallel programming models such as OpenMP [19]to sup application research groups in the areas of medical imaging,molec- port parallel computing.Fortran 90 [6]was the first widely used ular dynamics,computational chemistry,electromagnetic analysis, language and influenced following data parallel languages by intro- and scientific visualization contributed to this effort.The following ducing array assignment statements.Similar to array assignments are the major principles when choosing code to be executed on this in Fortran 90 is the lock step execution of each single instruction platform: in threads executing simultaneously on a streaming multiprocessor in CUDA programming model.Later,High Performance Fortran 1.Leverage zero-overhead thread scheduling to hide memory la- (HPF)[15]was introduced as an standard data parallel language to tency.On the GeForce 8800 there are 128 execution units avail- support programs with SPMD.However,complexity of data dis- able for use,requiring hundreds of threads to completely oc- tribution and communication optimization techniques,as discussed cupy them.In addition,threads can be starved of data due to in the final two chapters of [13],were a hard-to-solve challenge the long latency to global memory.The general philosophy of As a result application developers became more involved in explic. CUDA for tolerating this latency is to generate and maintain itly handling data distribution and communication:message pass- thousands of threads in flight.This is in contrast with the use ing libraries such as [23]became a popular programming model for of large caches to hide memory latencies in CPU designs.De- scalable parallel systems.Similarly in CUDA,the developer explic- velopers used to traditional multicore systems may need to de- itly manages data layout in DRAM memory spaces,data caching fine threads at a finer granularity in order to generate enough thread communication within thread blocks and other resources. threads.In addition.a high compute-to-memory-access ratio is The interest in GPGPU programming has been driven by rel necessary to avoid saturation of memory channels. atively recent improvements in the programmability of graphics 2.Optimize use of on-chip memory to reduce bandwidth usage and hardware.The release of Cg [16]signified the recognition that redundant execution.Working memory within a group of cores GPUs were programmable processors and that a higher-level lan- consists primarily of a register file and a software-managed on- guage was needed to develop applications on them.Others felt that chip memory called shared memory.These are high fan-out, the abstractions provided by Cg and other shading languages were low latency,limited-capacity memories which are partitioned insufficient and built higher-level language constructs.Brook [9] among thread blocks that are assigned to the same SM at run enables the usage of the GPU as a streaming coprocessor.Acceler- ator [26]is another system that uses data-parallel arrays to perform time.The data in shared memory can be shared among threads in a thread block,enabling interthread data reuse.An incre- general-purpose computation on the GPU.A Microsoft C#library mental increase in the usage of registers or shared memory provides data types and functions to operate on data-parallel ar- per thread can result in a substantial decrease in the number rays.Data-parallel array computation is transparently compiled to of threads that can be simultaneously executed shader programs by the runtime.Other efforts to provide a more productive stream processing programming environment for devel- 3.Group threads to avoid SIMD penalties and memory port/bank oping multi-threaded applications include the RapidMind Stream- conflicts.CUDA is based on the SPMD model,but its cur- ing Execution Manager [17]and PeakStream Virtual Machine [4]. rent implementation on the GeForce 8800 imposes Single- These mainly target HPC applications that are amenable to stream Instruction,Multiple-Data (SIMD)mode among subsets of processing.The achieved performance may be behind customized threads.The latter differs from the short-vector SIMD present GPU/CPU code due to the virtual machine and dynamic compila- in most contemporary processors.This is a cost-effective hard- tion overhead.We refer the reader to a review of the main body of ware model for exploiting data parallelism and allows the work done to map general purpose computation to GPUs by Owens GeForce 8800 to share one instruction issue unit among eight et al.in [21]. execution units.However,it can be ineffective for algorithms In general,previous GPU programming systems limit the size that require diverging control flow decisions in data-parallel and complexity of GPU code due to their underlying graphics API- sections.In some algorithms,threads can be reorganized to based implementations.CUDA supports kernels with much larger avoid divergent control flow.Appropriate thread grouping can code sizes with a new hardware interface and instruction caching. also preserve performance by avoiding port and bank conflicts Previous GPU generations and their APIs also restricted the al- in memories. lowed memory access patterns,usually allowing only sequential 4.Threads within a thread block can communicate via synchro- writes to a linear array.This is due primarily to limits in graph- nization.but there is no built-in global communication mecha- ics APIs and corresponding limits in the specialized pixel and ver- nism for all threads.This avoids the need for virtualization of tex processors.Accelerator does not allow access to an individual hardware resources,enables the execution of the same CUDA element in parallel arrays:operations are performed on all array program across processor family members with a varying num- elements.Brook also executes its kernel for every element in the ber of cores,and makes the hardware scalable.However,it also stream,with some exceptions.The GeForce 8800 allows for gen- limits the kinds of parallelism that can be utilized within a sin- eral addressing of memory via a unified processor model,which gle kernel call. enables CUDA to perform unrestricted scatter-gather operations. Traditional GPUs also provided limited cache bandwidth.Fata- We first discuss related work in Section 2.Section 3 introduces halian et al.discuss in [11]that low bandwidth cache designs on the threading model and execution hardware.Section 4 demon- GPUs limit the types of applications from benefiting from the com- strates the optimization process with in-depth performance anal- putational power available on these architectures.Work discussed ysis,using matrix multiplication kernels.Section 5 presents several in [12]uses an analytical cache performance prediction model for studied applications with performance and optimization informa- GPU-based algorithms.Their results indicate that memory opti- 74
allows more tuning of application performance but changes the assumptions developers can make when performing optimizations. We discuss these issues in further detail. Another question we address is how well applications can execute on the GeForce 8800 and what are the design features that contribute to or limit performance. As a collaborative effort between industry and academia, a set of complete numerical applications was ported and evaluated on the CUDA platform. Several application research groups in the areas of medical imaging, molecular dynamics, computational chemistry, electromagnetic analysis, and scientific visualization contributed to this effort. The following are the major principles when choosing code to be executed on this platform: 1. Leverage zero-overhead thread scheduling to hide memory latency. On the GeForce 8800 there are 128 execution units available for use, requiring hundreds of threads to completely occupy them. In addition, threads can be starved of data due to the long latency to global memory. The general philosophy of CUDA for tolerating this latency is to generate and maintain thousands of threads in flight. This is in contrast with the use of large caches to hide memory latencies in CPU designs. Developers used to traditional multicore systems may need to de- fine threads at a finer granularity in order to generate enough threads. In addition, a high compute-to-memory-access ratio is necessary to avoid saturation of memory channels. 2. Optimize use of on-chip memory to reduce bandwidth usage and redundant execution. Working memory within a group of cores consists primarily of a register file and a software-managed onchip memory called shared memory. These are high fan-out, low latency, limited-capacity memories which are partitioned among thread blocks that are assigned to the same SM at runtime. The data in shared memory can be shared among threads in a thread block, enabling interthread data reuse. An incremental increase in the usage of registers or shared memory per thread can result in a substantial decrease in the number of threads that can be simultaneously executed. 3. Group threads to avoid SIMD penalties and memory port/bank conflicts. CUDA is based on the SPMD model, but its current implementation on the GeForce 8800 imposes SingleInstruction, Multiple-Data (SIMD) mode among subsets of threads. The latter differs from the short-vector SIMD present in most contemporary processors. This is a cost-effective hardware model for exploiting data parallelism and allows the GeForce 8800 to share one instruction issue unit among eight execution units. However, it can be ineffective for algorithms that require diverging control flow decisions in data-parallel sections. In some algorithms, threads can be reorganized to avoid divergent control flow. Appropriate thread grouping can also preserve performance by avoiding port and bank conflicts in memories. 4. Threads within a thread block can communicate via synchronization, but there is no built-in global communication mechanism for all threads. This avoids the need for virtualization of hardware resources, enables the execution of the same CUDA program across processor family members with a varying number of cores, and makes the hardware scalable. However, it also limits the kinds of parallelism that can be utilized within a single kernel call. We first discuss related work in Section 2. Section 3 introduces the threading model and execution hardware. Section 4 demonstrates the optimization process with in-depth performance analysis, using matrix multiplication kernels. Section 5 presents several studied applications with performance and optimization information. We conclude with some final statements and suggestions for future work. 2. Related Work Data parallel programming languages are considered an intermediate approach between automatic parallelization efforts [7, 28] and explicit parallel programming models such as OpenMP [19] to support parallel computing. Fortran 90 [6] was the first widely used language and influenced following data parallel languages by introducing array assignment statements. Similar to array assignments in Fortran 90 is the lock step execution of each single instruction in threads executing simultaneously on a streaming multiprocessor in CUDA programming model. Later, High Performance Fortran (HPF) [15] was introduced as an standard data parallel language to support programs with SPMD. However, complexity of data distribution and communication optimization techniques, as discussed in the final two chapters of [13], were a hard-to-solve challenge. As a result application developers became more involved in explicitly handling data distribution and communication; message passing libraries such as [23] became a popular programming model for scalable parallel systems. Similarly in CUDA, the developer explicitly manages data layout in DRAM memory spaces, data caching, thread communication within thread blocks and other resources. The interest in GPGPU programming has been driven by relatively recent improvements in the programmability of graphics hardware. The release of Cg [16] signified the recognition that GPUs were programmable processors and that a higher-level language was needed to develop applications on them. Others felt that the abstractions provided by Cg and other shading languages were insufficient and built higher-level language constructs. Brook [9] enables the usage of the GPU as a streaming coprocessor. Accelerator [26] is another system that uses data-parallel arrays to perform general-purpose computation on the GPU. A Microsoft C# library provides data types and functions to operate on data-parallel arrays. Data-parallel array computation is transparently compiled to shader programs by the runtime. Other efforts to provide a more productive stream processing programming environment for developing multi-threaded applications include the RapidMind Streaming Execution Manager [17] and PeakStream Virtual Machine [4]. These mainly target HPC applications that are amenable to stream processing. The achieved performance may be behind customized GPU/CPU code due to the virtual machine and dynamic compilation overhead. We refer the reader to a review of the main body of work done to map general purpose computation to GPUs by Owens et al. in [21]. In general, previous GPU programming systems limit the size and complexity of GPU code due to their underlying graphics APIbased implementations. CUDA supports kernels with much larger code sizes with a new hardware interface and instruction caching. Previous GPU generations and their APIs also restricted the allowed memory access patterns, usually allowing only sequential writes to a linear array. This is due primarily to limits in graphics APIs and corresponding limits in the specialized pixel and vertex processors. Accelerator does not allow access to an individual element in parallel arrays: operations are performed on all array elements. Brook also executes its kernel for every element in the stream, with some exceptions. The GeForce 8800 allows for general addressing of memory via a unified processor model, which enables CUDA to perform unrestricted scatter-gather operations. Traditional GPUs also provided limited cache bandwidth. Fatahalian et al. discuss in [11] that low bandwidth cache designs on GPUs limit the types of applications from benefiting from the computational power available on these architectures. Work discussed in [12] uses an analytical cache performance prediction model for GPU-based algorithms. Their results indicate that memory opti- 74
mization techniques designed for CPU-based algorithms may not be directly applicable to GPUs.With the introduction of reasonably sized low-latency,on-chip memory in new generations of GPUs this issue and its optimizations have become less critical A programming interface alternative to CUDA is available for the AMD Stream Processor,using the R580 GPU,in the form of the Close to Metal (CTM)compute runtime driver [1].Like Figure 1.CUDA Compilation Flow CUDA,CTM can maintain the usage of the GPU as a graphics engine:however,instead of abstracting away architecture-level in- structions,CTM completely exposes the ISA to the programmer for mensions of the thread blocks in the grid thus generated.Threads fine-grained control.Furthermore,the R580 continues to resemble also have unique coordinates and up to 512 threads can exist in a previous generation GPUs with their divided architecture for ver- block.Threads in a block can share data through a low-latency,on- tex and pixel processing,whereas the GeForce 8800 has a more chip shared memory and can perform barrrier synchronization by general,unified model.This is presented in the next section invoking the-syncthreads primitive.Threads are otherwise in- Intel's C for Heterogeneous Integration(CHI)programming en- dependent;synchronization across thread blocks can only be safely vironment [27]is a different approach to tightly integrate accelera- accomplished by terminating a kernel.Finally,the hardware groups tors such as GPUs and general purpose CPU cores together based threads in a way that affects performance,which is discussed in on the proposed EXOCHI [27]model.EXOCHI supports a shared Section 3.2. virtual memory heterogeneous multi-threaded programming model An application developer for this platform can compile CUDA with minimal OS intrusion.In CUDA execution model,GPU is a code to an assembly-like representation of the code called PTX device with a separate address space from CPU.As a result,all PTX is not natively executed,but is processed by a run-time en- data communication and synchronization between CPU and GPU vironment,making it uncertain what instructions are actually exe- is explicitly performed through the GPU device driver. cuted on a cycle-by-cycle basis.Two examples we have observed are simple cases of loop-invariant code that can be easily moved Architecture Overview and branches which are split into condition evaluations and predi- The GeForce 8800 GPU is effectively a large set of processor cores cated jump instructions.However,PTX is generally sufficient in the with the ability to directly address into a global memory.This al- initial stages of estimating resource requirements of an application lows for a more general and flexible programming model than pre and optimizing it. vious generations of GPUs,making it easier for developers to im- plement data-parallel kernels.In this section we discuss NVIDIA's 3.2 Base Microarchitecture Compute Unified Device Architecture(CUDA)and the major mi- Figure 2 depicts the microarchitecture of the GeForce 8800.It croarchitectural features of the GeForce 8800.A more complete consists of 16 streaming multiprocessors (SMs),each containing description can be found in [3,18].It should be noted that this eight streaming processors (SPs),or processor cores,running at architecture,although more exposed than previous GPU architec- 1.35GHz.Each core executes a single thread's instruction in SIMD tures,still has details which have not been publicly revealed. (single-instruction,multiple-data)fashion,with the instruction unit broadcasting the current instruction to the cores.Each core has one 3.1 Threading Model 32-bit,single-precision floating-point,multiply-add arithmetic unit The CUDA programming model is ANSI C extended by several that can also perform 32-bit integer arithmetic.Additionally,each keywords and constructs.The GPU is treated as a coprocessor SM has two special functional units(SFUs),which execute more that executes data-parallel kernel code.The user supplies a single complex FP operations such as reciprocal square root,sine,and source program encompassing both host(CPU)and kernel (GPU) cosine with low multi-cycle latency.The arithmetic units and the code.These are separated and compiled as shown in Figure 1.Each SFUs are fully pipelined,yielding 388.8 GLOPS (16 SMs 18 CUDA program consists of multiple phases that are executed on FLOPS/SM 1.35GHz)of peak theoretical performance for the either the CPU or the GPU.The phases that exhibit little or no data GPU. parallelism are implemented in host(CPU)host,which is expressed Each SM has 8192 registers which are dynamically partitioned in ANSI C and compiled with the host C compiler as shown in Fig- among the threads running on it.Non-register memories with dis- ure 1.The phases that exhibit rich data parallelism are implemented tinctive capabilities and uses are described in Table 1 and depicted as kernel functions in the device (GPU)code.A kernel function in Figure 2.Variables in the source code can be declared to reside defines the code to be executed by each of the massive number of in global,shared,local,or constant memory.Texture memory is threads to be invoked for a data-parallel phase.These kernel func- accessed through API calls which compile to special instructions. tions are compiled by the NVIDIA CUDA C compiler and the ker- Bandwidth to off-chip memory is very high at 86.4 GB/s,but mem- nel GPU object code generator.There are several restrictions on ory bandwidth can saturate if many threads request access within kernel functions:there must be no recursion.no static variable dec- a short period of time.In addition,this bandwidth can be obtained larations,and a non-variable number of arguments.The host code only when accesses are contiguous 16-word lines;in other cases the transfers data to and from the GPU's global memory using APl achievable bandwidth is a fraction of the maximum.Optimizations calls.Kernel code is initiated by performing a function call. to coalesce accesses into 16-word lines and reuse data are generally Threads executing on the GeForce 8800 are organized into a necessary to achieve good performance. three-level hierarchy.At the highest level,all threads in a data- There are several non-storage limits to the number of threads parallel execution phase form a grid;they all execute the same ker- that can be executed on the system.First,a maximum of 768 si- nel function.Each grid consists of many thread blocks.A grid can multaneously active thread contexts is supported per SM.Second be at most 216-1 blocks in either of two dimensions,and each an integral number of up to eight thread blocks can be run per SM block has unique coordinates.In turn,each thread block is a three- at one time.The number of thread blocks that are simultaneously dimensional array of threads,explicitly defined by the application resident on an SM is limited by whichever limit of registers,shared developer,that is assigned to an SM.The invocation parameters of memory,threads,or thread blocks is reached first.This has two con a kernel function call defines the organization of the sizes and di- sequences.First,optimization may have negative effects in some 75
mization techniques designed for CPU-based algorithms may not be directly applicable to GPUs. With the introduction of reasonably sized low-latency, on-chip memory in new generations of GPUs, this issue and its optimizations have become less critical. A programming interface alternative to CUDA is available for the AMD Stream Processor, using the R580 GPU, in the form of the Close to Metal (CTM) compute runtime driver [1]. Like CUDA, CTM can maintain the usage of the GPU as a graphics engine; however, instead of abstracting away architecture-level instructions, CTM completely exposes the ISA to the programmer for fine-grained control. Furthermore, the R580 continues to resemble previous generation GPUs with their divided architecture for vertex and pixel processing, whereas the GeForce 8800 has a more general, unified model. This is presented in the next section. Intel’s C for Heterogeneous Integration (CHI) programming environment [27] is a different approach to tightly integrate accelerators such as GPUs and general purpose CPU cores together based on the proposed EXOCHI [27] model. EXOCHI supports a shared virtual memory heterogeneous multi-threaded programming model with minimal OS intrusion. In CUDA execution model, GPU is a device with a separate address space from CPU. As a result, all data communication and synchronization between CPU and GPU is explicitly performed through the GPU device driver. 3. Architecture Overview The GeForce 8800 GPU is effectively a large set of processor cores with the ability to directly address into a global memory. This allows for a more general and flexible programming model than previous generations of GPUs, making it easier for developers to implement data-parallel kernels. In this section we discuss NVIDIA’s Compute Unified Device Architecture (CUDA) and the major microarchitectural features of the GeForce 8800. A more complete description can be found in [3, 18]. It should be noted that this architecture, although more exposed than previous GPU architectures, still has details which have not been publicly revealed. 3.1 Threading Model The CUDA programming model is ANSI C extended by several keywords and constructs. The GPU is treated as a coprocessor that executes data-parallel kernel code. The user supplies a single source program encompassing both host (CPU) and kernel (GPU) code. These are separated and compiled as shown in Figure 1. Each CUDA program consists of multiple phases that are executed on either the CPU or the GPU. The phases that exhibit little or no data parallelism are implemented in host (CPU) host, which is expressed in ANSI C and compiled with the host C compiler as shown in Figure 1. The phases that exhibit rich data parallelism are implemented as kernel functions in the device (GPU) code. A kernel function defines the code to be executed by each of the massive number of threads to be invoked for a data-parallel phase. These kernel functions are compiled by the NVIDIA CUDA C compiler and the kernel GPU object code generator. There are several restrictions on kernel functions: there must be no recursion, no static variable declarations, and a non-variable number of arguments. The host code transfers data to and from the GPU’s global memory using API calls. Kernel code is initiated by performing a function call. Threads executing on the GeForce 8800 are organized into a three-level hierarchy. At the highest level, all threads in a dataparallel execution phase form a grid; they all execute the same kernel function. Each grid consists of many thread blocks. A grid can be at most 216 − 1 blocks in either of two dimensions, and each block has unique coordinates. In turn, each thread block is a threedimensional array of threads, explicitly defined by the application developer, that is assigned to an SM. The invocation parameters of a kernel function call defines the organization of the sizes and diIntegrated Source (foo.c, bar.cu) cudacc Front End and Global Optimizer GPU Assembly / Kernel Code (bar.s) CPU Host Code (foo.c, bar.c) Kernel Object Code Generator Host Compiler Host Binary (foo.o, bar.o) Kernel Object Code (bar.gpu) Executable Figure 1. CUDA Compilation Flow mensions of the thread blocks in the grid thus generated. Threads also have unique coordinates and up to 512 threads can exist in a block. Threads in a block can share data through a low-latency, onchip shared memory and can perform barrrier synchronization by invoking the syncthreads primitive. Threads are otherwise independent; synchronization across thread blocks can only be safely accomplished by terminating a kernel. Finally, the hardware groups threads in a way that affects performance, which is discussed in Section 3.2. An application developer for this platform can compile CUDA code to an assembly-like representation of the code called PTX. PTX is not natively executed, but is processed by a run-time environment, making it uncertain what instructions are actually executed on a cycle-by-cycle basis. Two examples we have observed are simple cases of loop-invariant code that can be easily moved and branches which are split into condition evaluations and predicated jump instructions. However, PTX is generally sufficient in the initial stages of estimating resource requirements of an application and optimizing it. 3.2 Base Microarchitecture Figure 2 depicts the microarchitecture of the GeForce 8800. It consists of 16 streaming multiprocessors (SMs), each containing eight streaming processors (SPs), or processor cores, running at 1.35GHz. Each core executes a single thread’s instruction in SIMD (single-instruction, multiple-data) fashion, with the instruction unit broadcasting the current instruction to the cores. Each core has one 32-bit, single-precision floating-point, multiply-add arithmetic unit that can also perform 32-bit integer arithmetic. Additionally, each SM has two special functional units (SFUs), which execute more complex FP operations such as reciprocal square root, sine, and cosine with low multi-cycle latency. The arithmetic units and the SFUs are fully pipelined, yielding 388.8 GLOPS (16 SMs * 18 FLOPS/SM * 1.35GHz) of peak theoretical performance for the GPU. Each SM has 8192 registers which are dynamically partitioned among the threads running on it. Non-register memories with distinctive capabilities and uses are described in Table 1 and depicted in Figure 2. Variables in the source code can be declared to reside in global, shared, local, or constant memory. Texture memory is accessed through API calls which compile to special instructions. Bandwidth to off-chip memory is very high at 86.4 GB/s, but memory bandwidth can saturate if many threads request access within a short period of time. In addition, this bandwidth can be obtained only when accesses are contiguous 16-word lines; in other cases the achievable bandwidth is a fraction of the maximum. Optimizations to coalesce accesses into 16-word lines and reuse data are generally necessary to achieve good performance. There are several non-storage limits to the number of threads that can be executed on the system. First, a maximum of 768 simultaneously active thread contexts is supported per SM. Second, an integral number of up to eight thread blocks can be run per SM at one time. The number of thread blocks that are simultaneously resident on an SM is limited by whichever limit of registers, shared memory, threads, or thread blocks is reached first. This has two consequences. First, optimization may have negative effects in some 75
Table 1.Properties of GeForce 8800 Memories Memory Location Size Hit Read- PrograΠm Description Latency Only Scope Global off-chip 768MB 200-300 global Large DRAM.All data reside here at the beginning of execution Directly addressable from a kemel total cycles using pointers.Backing store for constant and texture memories.Used more efficiently when multiple threads simultancously access contiguous elements of memory,enabling the hardware to coalesce memory accesses to the same DRAM page. Local off-chip p to same as no function Space for register spilling.etc. global global Shared on-chip 16KB register function Local scratchpad that can be shared between threads in a thread block Organized into 16 banks.Does per latency not appear to have error detection.If instructions issued in the same cycle access different locations SM in the same bank,a bank conflict stall occurs.It is possible to organize both threads and data such that bank conflicts seldom or never occur. Constant on-chip 64B register yes global 8KB cache per SM,with data originally residing in global memory.The 64KB limit is set by the cache total latency ng model.Often used for lookup tables.The cache is single-ported,so simultaneous requests within an SM must be to the same address or delays will occur. exture on-chip >00 yes global 16KB cache per two SMs,with data originally residing in global memory.Capitalizes on 2D locality cache global cycles Can perform hardware interpolation and have configurable returned-value behavior at the edges of textures,both of which are useful in certain applications such as video encoders. Device only if there are no warps with ready operands available.Schedul- ing freedom is high in many applications because threads in dif- SM 16 ferent warps are independent with the exception of explicit barrier synchronizations among threads in the same thread block. SM2 In summary,there are hard limits to the memories,threads. SM1 and total bandwidth available to an application running on the GeForce 8800.Managing these limits is critical when optimizing Shared Memory instrction applications,but strategies for avoiding one limit can cause other Unit limits to be hit.They can also reduce the number of thread blocks Register File that can run simultaneously.In addition,managing the behavior of Processor 1 Processor 8 threads so that those in the same warp follow the same control paths SFU 1 and load contiguous values from global memory can also improve Constant Cache SFU2 performance. Texture Cache 4.Performance and Optimization This section uses a microbenchmark to demonstrate how the proper Off-Chip (Global,Constant,Texture)Memories balancing of shared resource usage is critical to achieving efficient Figure 2.Basic Organization of the GeForce 8800 execution resource utilization and thus high performance on the GeForce 8800.There are three basic principles to consider when optimizing an application for the platform.First,the floating point throughput of an application depends on the percentage of its in- cases because small changes have multiplicative resource usage ef- structions that are floating point operations.The GPU is capable fects (due to the large number of threads)that cause fewer thread of issuing 172.8 billion operations per second on the SPs.These blocks and thus threads to be simultaneously executed.Second,it include fused multiply-add operations,which we count as two op- is relatively easy to be "trapped"in a local maximum when hand- erations for throughput calculations.If 1/4 of an application's in- optimizing code.Developers may need to try widely varying con- struction mix are fused multiply-adds,then its performance can figurations to find one with satisfactory performance. be at most 2 1/4 FP 172.8 billion ops per second 86.4 During execution,threads within a block are grouped into warps GFLOPS.This performance is reached when the SPs are fully oc- of 32 parallel threads,which are the granular multi-threading cupied,which is achievable in an application that has many threads, scheduling unit.Warps are formed from continuous sections of does not have many synchronizations,and does not stress global threads in a thread block:the first 32 threads in a block form the memory bandwidth.In this situation,reducing the number of in- first warp,etc.Although warps are not explicitly declared in CUDA structions that do not contribute to data computation generally re- code,knowledge of them can enable useful code and data optimiza- sults in kernel speedup.However,maximizing computational effi- tions on the GeForce 8800.A scoreboard indicates when all of a ciency can be challenging,due to discontinuities in the optimization warp's operands are ready for execution.It then executes the same space [221. instruction for the 32 threads in the warp.An SM issues only one Second,when attempting to achieve an application's maximum instruction at a time for all threads in a warp;when threads in a performance,the primary concern often is managing global mem- warp take different control paths,it is assumed that multiple passes ory latency.This is done by creating enough threads to keep SPs oc- with suppression of threads on divergent paths are required to com- cupied while many threads are waiting on global memory accesses. plete execution.It is generally desirable to group threads to avoid As previously stated,threads may need to of a finer granularity this situation.If a thread block is not evenly divisible by the warp than those for traditional multicore execution to generate enough size,any remaining issue slots are wasted. threads.The required number of threads depends on the percentage An SM can perform zero-overhead scheduling to interleave of global accesses and other long-latency operations in an appli- warps and hide the latency of global memory accesses and long- cation:applications consisting of a small percentage of these op- latency arithmetic operations.When one warp stalls,the SM can erations require fewer threads to achieve full SP occupancy.The quickly switch to a ready warp resident in the SM.The SM stalls limit on registers and shared memory available per SM can con- 76
Table 1. Properties of GeForce 8800 Memories Memory Location Size Hit Latency ReadOnly Program Scope Description Global off-chip 768MB total 200-300 cycles no global Large DRAM. All data reside here at the beginning of execution. Directly addressable from a kernel using pointers. Backing store for constant and texture memories. Used more efficiently when multiple threads simultaneously access contiguous elements of memory, enabling the hardware to coalesce memory accesses to the same DRAM page. Local off-chip up to global same as global no function Space for register spilling, etc. Shared on-chip 16KB per SM register latency no function Local scratchpad that can be shared between threads in a thread block. Organized into 16 banks. Does not appear to have error detection. If instructions issued in the same cycle access different locations in the same bank, a bank conflict stall occurs. It is possible to organize both threads and data such that bank conflicts seldom or never occur. Constant on-chip cache 64KB total register latency yes global 8KB cache per SM, with data originally residing in global memory. The 64KB limit is set by the programming model. Often used for lookup tables. The cache is single-ported, so simultaneous requests within an SM must be to the same address or delays will occur. Texture on-chip cache up to global >100 cycles yes global 16KB cache per two SMs, with data originally residing in global memory. Capitalizes on 2D locality. Can perform hardware interpolation and have configurable returned-value behavior at the edges of textures, both of which are useful in certain applications such as video encoders. Device SM 16 SM 2 Off-Chip (Global, Constant, Texture) Memories SM 1 Texture Cache Constant Cache Processor 1 Processor 8 Register File Shared Memory Instruction Unit ... ... SFU 1 SFU 2 Figure 2. Basic Organization of the GeForce 8800 cases because small changes have multiplicative resource usage effects (due to the large number of threads) that cause fewer thread blocks and thus threads to be simultaneously executed. Second, it is relatively easy to be “trapped” in a local maximum when handoptimizing code. Developers may need to try widely varying con- figurations to find one with satisfactory performance. During execution, threads within a block are grouped into warps of 32 parallel threads, which are the granular multi-threading scheduling unit. Warps are formed from continuous sections of threads in a thread block: the first 32 threads in a block form the first warp, etc. Although warps are not explicitly declared in CUDA code, knowledge of them can enable useful code and data optimizations on the GeForce 8800. A scoreboard indicates when all of a warp’s operands are ready for execution. It then executes the same instruction for the 32 threads in the warp. An SM issues only one instruction at a time for all threads in a warp; when threads in a warp take different control paths, it is assumed that multiple passes with suppression of threads on divergent paths are required to complete execution. It is generally desirable to group threads to avoid this situation. If a thread block is not evenly divisible by the warp size, any remaining issue slots are wasted. An SM can perform zero-overhead scheduling to interleave warps and hide the latency of global memory accesses and longlatency arithmetic operations. When one warp stalls, the SM can quickly switch to a ready warp resident in the SM. The SM stalls only if there are no warps with ready operands available. Scheduling freedom is high in many applications because threads in different warps are independent with the exception of explicit barrier synchronizations among threads in the same thread block. In summary, there are hard limits to the memories, threads, and total bandwidth available to an application running on the GeForce 8800. Managing these limits is critical when optimizing applications, but strategies for avoiding one limit can cause other limits to be hit. They can also reduce the number of thread blocks that can run simultaneously. In addition, managing the behavior of threads so that those in the same warp follow the same control paths and load contiguous values from global memory can also improve performance. 4. Performance and Optimization This section uses a microbenchmark to demonstrate how the proper balancing of shared resource usage is critical to achieving efficient execution resource utilization and thus high performance on the GeForce 8800. There are three basic principles to consider when optimizing an application for the platform. First, the floating point throughput of an application depends on the percentage of its instructions that are floating point operations. The GPU is capable of issuing 172.8 billion operations per second on the SPs. These include fused multiply-add operations, which we count as two operations for throughput calculations. If 1/4 of an application’s instruction mix are fused multiply-adds, then its performance can be at most 2 * 1/4 FP * 172.8 billion ops per second = 86.4 GFLOPS. This performance is reached when the SPs are fully occupied, which is achievable in an application that has many threads, does not have many synchronizations, and does not stress global memory bandwidth. In this situation, reducing the number of instructions that do not contribute to data computation generally results in kernel speedup. However, maximizing computational effi- ciency can be challenging, due to discontinuities in the optimization space [22]. Second, when attempting to achieve an application’s maximum performance, the primary concern often is managing global memory latency. This is done by creating enough threads to keep SPs occupied while many threads are waiting on global memory accesses. As previously stated, threads may need to of a finer granularity than those for traditional multicore execution to generate enough threads. The required number of threads depends on the percentage of global accesses and other long-latency operations in an application: applications consisting of a small percentage of these operations require fewer threads to achieve full SP occupancy. The limit on registers and shared memory available per SM can con- 76
strain the number of active threads,sometimes exposing memory latency.We show one example where the use of additional regis- ters in an attempted optimization allows one fewer thread block to be scheduled per SM,reducing performance. Finally,global memory bandwidth can limit the throughput of the system.Increasing the number of threads does not help performance in this situation.Alleviating the pressure on global memory bandwidth generally involves using additional registers and shared memory to reuse data,which in turn can limit the number of simultaneously executing threads.Balancing the usage of these resources is often non-intuitive and some applications will run into resource limits other than instruction issue on this 12x12t相le architecture. 16x16 tiles The example we use to illustrate these principles is a matrix Figure 4.Performance of Matrix Multiplication Kernels multiplication kernel.In matrix multiplication,the value of an ele- ment in the result matrix is calculated by computing the dot product of the corresponding row of the first matrix and column of the sec- ond matrix.For this example we assume densely populated input between threads:one thread loads a datum and then synchronizes matrices.We analyze several code versions and their sustained per- so that other threads in the same block can use it.Finally,we can formance when multiplying two square matrices with a height and also take advantage of contiguity in main memory accesses when width of 4096 elements.The stated resource usage is for CUDA loading in values as a block,reducing the cycles needed to access version 0.8;later versions of CUDA may have different usages. the data. Figure 3(b)shows the code for a tiled matrix multiplication 4.1 Initial Code Version with a tile size of 16x16,or 256 result elements and threads During execution,the threads work within two input tiles that We begin with a simple version of matrix multiplication.The ma- stride across 16 contiguous rows or columns in the input matrices. trix multiplication kernel creates a thread for each result element Each of the 256 threads is tied to a specific coordinate in a tile for the multiplication,for a total of 4K*4K threads.Many threads It loads the element at that coordinate from the two input tiles are created in an attempt to hide the latency of global memory by into shared memory,so cooperatively the threads load the complete overlapping execution.These threads loop through a sequence that tiles.These loads are organized to take advantage of global access loads two values from global memory,multiplies them,and accu- coalescing.The threads then synchronize to establish consistency, mulates the value.Figure 3(a)shows the core loops of the dot- which enables each thread to load all of its inputs contained in the product computation kernel;starting values for indexA,indexB. tiles from shared memory.Finally,the threads calculate the partial and indexC are determined by block and thread coordinates,which dot product for the inputs in shared memory within a loop. the hardware supports.This code uses ten registers per thread,al- The choice of tile shape and size is a key decision.For a given lowing the maximum of 768 threads to be scheduled per SM.For size,square tiles generally improve the computation to memory ac- convenience,we group them as three thread blocks of 256 threads cess ratio by improving data locality among threads.Larger tile each. sizes increase data sharing and thus global memory efficiency Performance for this code is 10.58 GFLOPS,which is lower The tiling support code adds several overhead instructions per tile than highly optimized libraries executing on a CPU using SIMD which also makes larger sizes more efficient.On this architec- extensions.By examining the PTX for this code,we find that ture,developers also need to consider whether a tile size provides there is approximately'one fused multiply-add out of eight op- enough threads to have good occupancy.Figure 4 shows the re- erations in the inner loop,for a estimated potential throughput of sults of experimenting with different tile sizes.4x4 tiles use only 43.2 GFLOPS.Because we have the maximum number of threads 16 threads,so half of the issue slots in a warp would go unused. scheduled per SM,the bottleneck appears to be global memory This inefficiency,coupled with the 8 thread block limit,causes bandwidth.1/4 of the operations executed during the loop are loads performance to be worse than the non-tiled code.8x8 tiles create from off-chip memory,which would require a bandwidth of 173 thread blocks that occupy two warps,but would still need 12 thread GB/s (128 SPs 1/4 instructions *4 B/instruction 1.35GHz)to blocks to fully occupy an SM,50%more than the supported limit. fully utilize the SPs.Thus,the strategy for optimizing this kernel 12x12 tiles use 144 threads,which is also not an integral number is to improve data reuse and reduce global memory access of warps,and also requires padding of the arrays to prevent over- run.16x16 is the largest convenient size for this platform,and we 4.2 Use of Local Storage can schedule three thread blocks of 8 warps each,for the maximum In Figure 3(a),although the computations of two result elements of 768 threads.Global memory coalescing also happens naturally in the same row or column share half their input data (the same with this configuration.Other applications may have higher perfor- indexA or indexB values),the previous code accesses global mance with smaller tile sizes when they allow a larger number of memory for each datum in every thread.A common optimiza- threads to be scheduled. tion for this type of access pattern is to enhance data sharing via The use of 16x16 tiles reduces global loads by a factor of 16 tiling [14].In the GeForce 8800,developers can utilize shared over the non-tiled configuration,so instead of the bandwidth re- memory to amortize the global latency cost when values are reused quirement being twice what is available,it is now approximately Using low-overhead block synchronization values,can be shared an eighth,removing it as the bottleneck to performance.The ad- ditional instructions reduce the potential throughput slightly below 3 As previously mentioned,PTX code does not necessarily translate to that of the original code.The 16x16 tiled version of matrix multi- executed instructions,so instruction counts are estimates. plication achieves 46.49 GFLOPS,or approximately 4.5X the exe- 4 This is also an estimate.Threads can simultaneously load the same value cution throughput of the initial version.This is slightly higher than from memory and the memory system may be able to combine these into a the estimated potential throughput of the original code,so it appears single request. that the application achieves full usage of the SPs. 77
strain the number of active threads, sometimes exposing memory latency. We show one example where the use of additional registers in an attempted optimization allows one fewer thread block to be scheduled per SM, reducing performance. Finally, global memory bandwidth can limit the throughput of the system. Increasing the number of threads does not help performance in this situation. Alleviating the pressure on global memory bandwidth generally involves using additional registers and shared memory to reuse data, which in turn can limit the number of simultaneously executing threads. Balancing the usage of these resources is often non-intuitive and some applications will run into resource limits other than instruction issue on this architecture. The example we use to illustrate these principles is a matrix multiplication kernel. In matrix multiplication, the value of an element in the result matrix is calculated by computing the dot product of the corresponding row of the first matrix and column of the second matrix. For this example we assume densely populated input matrices. We analyze several code versions and their sustained performance when multiplying two square matrices with a height and width of 4096 elements. The stated resource usage is for CUDA version 0.8; later versions of CUDA may have different usages. 4.1 Initial Code Version We begin with a simple version of matrix multiplication. The matrix multiplication kernel creates a thread for each result element for the multiplication, for a total of 4K*4K threads. Many threads are created in an attempt to hide the latency of global memory by overlapping execution. These threads loop through a sequence that loads two values from global memory, multiplies them, and accumulates the value. Figure 3(a) shows the core loops of the dotproduct computation kernel; starting values for indexA, indexB, and indexC are determined by block and thread coordinates, which the hardware supports. This code uses ten registers per thread, allowing the maximum of 768 threads to be scheduled per SM. For convenience, we group them as three thread blocks of 256 threads each. Performance for this code is 10.58 GFLOPS, which is lower than highly optimized libraries executing on a CPU using SIMD extensions. By examining the PTX for this code, we find that there is approximately3 one fused multiply-add out of eight operations in the inner loop, for a estimated potential throughput of 43.2 GFLOPS. Because we have the maximum number of threads scheduled per SM, the bottleneck appears to be global memory bandwidth. 1/4 of the operations executed during the loop are loads from off-chip memory, which would require a bandwidth of 173 GB/s (128 SPs * 1/4 instructions * 4 B/instruction * 1.35GHz) to fully utilize the SPs.4 Thus, the strategy for optimizing this kernel is to improve data reuse and reduce global memory access. 4.2 Use of Local Storage In Figure 3(a), although the computations of two result elements in the same row or column share half their input data (the same indexA or indexB values), the previous code accesses global memory for each datum in every thread. A common optimization for this type of access pattern is to enhance data sharing via tiling [14]. In the GeForce 8800, developers can utilize shared memory to amortize the global latency cost when values are reused. Using low-overhead block synchronization values, can be shared 3 As previously mentioned, PTX code does not necessarily translate to executed instructions, so instruction counts are estimates. 4 This is also an estimate. Threads can simultaneously load the same value from memory and the memory system may be able to combine these into a single request. GFLOPS 0 10 20 30 40 50 60 70 80 90 100 tiled only tiled & unrolled tiled only tiled & unrolled tiled only tiled & unrolled tiled only tiled & unrolled not tiled 4x4 tiles 8x8 tiles 12x12 tiles 16x16 tiles Figure 4. Performance of Matrix Multiplication Kernels between threads: one thread loads a datum and then synchronizes so that other threads in the same block can use it. Finally, we can also take advantage of contiguity in main memory accesses when loading in values as a block, reducing the cycles needed to access the data. Figure 3(b) shows the code for a tiled matrix multiplication, with a tile size of 16x16, or 256 result elements and threads. During execution, the threads work within two input tiles that stride across 16 contiguous rows or columns in the input matrices. Each of the 256 threads is tied to a specific coordinate in a tile. It loads the element at that coordinate from the two input tiles into shared memory, so cooperatively the threads load the complete tiles. These loads are organized to take advantage of global access coalescing. The threads then synchronize to establish consistency, which enables each thread to load all of its inputs contained in the tiles from shared memory. Finally, the threads calculate the partial dot product for the inputs in shared memory within a loop. The choice of tile shape and size is a key decision. For a given size, square tiles generally improve the computation to memory access ratio by improving data locality among threads. Larger tile sizes increase data sharing and thus global memory efficiency. The tiling support code adds several overhead instructions per tile, which also makes larger sizes more efficient. On this architecture, developers also need to consider whether a tile size provides enough threads to have good occupancy. Figure 4 shows the results of experimenting with different tile sizes. 4x4 tiles use only 16 threads, so half of the issue slots in a warp would go unused. This inefficiency, coupled with the 8 thread block limit, causes performance to be worse than the non-tiled code. 8x8 tiles create thread blocks that occupy two warps, but would still need 12 thread blocks to fully occupy an SM, 50% more than the supported limit. 12x12 tiles use 144 threads, which is also not an integral number of warps, and also requires padding of the arrays to prevent overrun. 16x16 is the largest convenient size for this platform, and we can schedule three thread blocks of 8 warps each, for the maximum of 768 threads. Global memory coalescing also happens naturally with this configuration. Other applications may have higher performance with smaller tile sizes when they allow a larger number of threads to be scheduled. The use of 16x16 tiles reduces global loads by a factor of 16 over the non-tiled configuration, so instead of the bandwidth requirement being twice what is available, it is now approximately an eighth, removing it as the bottleneck to performance. The additional instructions reduce the potential throughput slightly below that of the original code. The 16x16 tiled version of matrix multiplication achieves 46.49 GFLOPS, or approximately 4.5X the execution throughput of the initial version. This is slightly higher than the estimated potential throughput of the original code, so it appears that the application achieves full usage of the SPs. 77