INTRODUCTION illustrates how to scale applications across a GPU-accelerated compute cluster by using CUDA- aware MPI with GPUDirect RDMA to realize near linear performance scalability. Chapter 10:Implementation Considerations discusses the CUDA development process and a vari- ety of profile-driven optimization strategies.It demonstrates how to use CUDA debugging tools to debug kernel and memory errors.It also provides a case study in porting a legacy C application to CUDA C using step-by-step instructions to help solidify your understanding of the methodology, visualize the process,and demonstrate the tools. WHAT YOU NEED TO USE THIS BOOK This book does not require either GPU or parallel programming experience.Before you jump in,it would be best if you have basic experience working with Linux.To run all examples in the book, the ideal environment is: >A Linux system A C/C++compiler CUDA 6.0 Toolkit installed NVIDIA Kepler GPU However,most examples will run on Fermi devices,though some examples using CUDA 6 features might require Kepler GPUs.Most of these examples can be compiled with CUDA 5.5. CUDA TOOLKIT DOWNLOAD You can download the CUDA 6.0 Toolkit from https://developer.nvidia.com/cuda-toolkit. The CUDA Toolkit includes a compiler for NVIDIA GPUs,CUDA math libraries,and tools for debugging and optimizing the performance of your applications.You will also find programming guides,user manuals,an API reference,and other documentation to help you start accelerating your application with GPUs. CONVENTIONS To help you get the most from the text,we have used a number of conventions throughout the book. We bighlight new terms and important words when we they are introduced. We show file names,URLs,and code within the text like so:this_is_a_kernel_file.cu. XXiv www.it-ebooks.info
xxiv INTRODUCTION fl ast.indd 08/07/2014 Page xxiv illustrates how to scale applications across a GPU-accelerated compute cluster by using CUDAaware MPI with GPUDirect RDMA to realize near linear performance scalability. Chapter 10: Implementation Considerations discusses the CUDA development process and a variety of profi le-driven optimization strategies. It demonstrates how to use CUDA debugging tools to debug kernel and memory errors. It also provides a case study in porting a legacy C application to CUDA C using step-by-step instructions to help solidify your understanding of the methodology, visualize the process, and demonstrate the tools. WHAT YOU NEED TO USE THIS BOOK This book does not require either GPU or parallel programming experience. Before you jump in, it would be best if you have basic experience working with Linux. To run all examples in the book, the ideal environment is: ➤ A Linux system ➤ A C/C++ compiler ➤ CUDA 6.0 Toolkit installed ➤ NVIDIA Kepler GPU However, most examples will run on Fermi devices, though some examples using CUDA 6 features might require Kepler GPUs. Most of these examples can be compiled with CUDA 5.5. CUDA TOOLKIT DOWNLOAD You can download the CUDA 6.0 Toolkit from https://developer.nvidia.com/cuda-toolkit. The CUDA Toolkit includes a compiler for NVIDIA GPUs, CUDA math libraries, and tools for debugging and optimizing the performance of your applications. You will also fi nd programming guides, user manuals, an API reference, and other documentation to help you start accelerating your application with GPUs. CONVENTIONS To help you get the most from the text, we have used a number of conventions throughout the book. We highlight new terms and important words when we they are introduced. We show fi le names, URLs, and code within the text like so: this_is_a_kernel_file.cu. www.it-ebooks.info
INTRODUCTION We present code in following way: /distributing jobs among devices for (int i 0;i ngpus;i++) cudaSetDevice(i); cudaMemcpyAsync(d A[i],h_A[i],iBytes,cudaMemcpyDefault,stream[i]); cudaMemcpyAsync(d_B[i],h_B[i],iBytes,cudaMemcpyDefault,stream[i]); iKernel<<<grid,block,0,stream[i]>>>(d_A[i],d_B[i],d c[i],isize); cudaMemcpyAsync(gpuRef [i],d_c[i],iBytes,cudaMemcpyDefault,stream[i]); We introduce CUDA runtime functions in the following way: cudaError_t cudaDevicesynchronize (void); We present the output of programs as follows: /reduce starting reduction at device 0:Tesla M2070 with array size 16777216 grid 32768 block 512 cpu reduce elapsed 0.029138 sec cpu_sum:2139353471 gpu Warmup elapsed 0.011745 sec gpu_sum:2139353471 <<<grid 32768 block 512>>> gpu Neighbored elapsed 0.011722 sec gpu_sum:2139353471 <<<grid 32768 block 512>>> We give command-line instructions as follows: nvprof --devices 0--metrics branch_efficiency./reduce SOURCE CODE As you work through the examples in this book,you might choose either to type in all the code manually or to use the source code files that accompany the book.All of the source code used in this book is available for download at www.wrox.com/go/procudac.Once at the site,simply locate the book's title (either by using the Search box or by using one of the title lists)and click the Download Code link on the book's detail page to obtain all the source code for the book. When you work on the exercises at the end of each chapter,we highly encourage you to try to write them yourself by referencing the example codes.All the exercise code files are also downloadable from the Wrox website. ERRATA We make every effort to ensure that there are no errors in the text or in the code.However,no one is perfect,and mistakes do occur.If you find an error in one of our books,like a spelling mistake or faulty piece of code,we would be very grateful for your feedback.By sending in errata,you might save another reader hours of frustration and at the same time you will be helping us provide even higher quality information. XXV www.it-ebooks.info
xxv INTRODUCTION fl ast.indd 08/07/2014 Page xxv We present code in following way: // distributing jobs among devices for (int i = 0; i < ngpus; i++) { cudaSetDevice(i); cudaMemcpyAsync(d_A[i], h_A[i], iBytes, cudaMemcpyDefault,stream[i]); cudaMemcpyAsync(d_B[i], h_B[i], iBytes, cudaMemcpyDefault,stream[i]); iKernel<<<grid, block,0,stream[i]>>> (d_A[i], d_B[i], d_C[i],iSize); cudaMemcpyAsync(gpuRef[i], d_C[i], iBytes, cudaMemcpyDefault,stream[i]); } We introduce CUDA runtime functions in the following way: cudaError_t cudaDeviceSynchronize (void); We present the output of programs as follows: ./reduce starting reduction at device 0: Tesla M2070 with array size 16777216 grid 32768 block 512 cpu reduce elapsed 0.029138 sec cpu_sum: 2139353471 gpu Warmup elapsed 0.011745 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>> gpu Neighbored elapsed 0.011722 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>> We give command-line instructions as follows: $ nvprof --devices 0 --metrics branch_efficiency ./reduce SOURCE CODE As you work through the examples in this book, you might choose either to type in all the code manually or to use the source code fi les that accompany the book. All of the source code used in this book is available for download at www.wrox.com/go/procudac. Once at the site, simply locate the book’s title (either by using the Search box or by using one of the title lists) and click the Download Code link on the book’s detail page to obtain all the source code for the book. When you work on the exercises at the end of each chapter, we highly encourage you to try to write them yourself by referencing the example codes. All the exercise code fi les are also downloadable from the Wrox website. ERRATA We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you fi nd an error in one of our books, like a spelling mistake or faulty piece of code, we would be very grateful for your feedback. By sending in errata, you might save another reader hours of frustration and at the same time you will be helping us provide even higher quality information. www.it-ebooks.info
INTRODUCTION To find the errata page for this book,go to www.wrox.com/go/procudac.Then,on the book's details page,click the Book Errata link.On this page you can view all errata that has been submit- ted for this book and posted by Wrox editors. P2P.WROX.COM For author and peer discussion,join the P2P forums at p2p.wrox.com.The forums are a web-based system for you to post messages relating to Wrox books and related technologies and interact with other readers and technology users.The forums offer a subscription feature where topics of inter- est of your choosing when new posts are made to the forums can be sent to you via e-mail.Wrox authors,editors,other industry experts,and your fellow readers are present on these forums. At http://p2p.wrox.com you will find a number of different forums that will help you not only as you read this book,but also as you develop your own applications.To join the forums,just follow these steps: 1.Go to p2p.wrox.com and click the Register link. 2. Read the terms of use and click Agree. 3. Complete the required information to join as well as any optional information you wish to provide and click Submit. 4.You will receive an e-mail with information describing how to verify your account and com- plete the joining process. You can read messages in the forums without joining P2P,but in order to post your own messages, you must join.Once you join,you can post new messages and respond to messages other users post. You can read messages at any time on the web.If you would like to have new messages from a par- ticular forum sent to your e-mail address,click the "Subscribe to this Forum"icon by the forum name in the forum listing. For more information about how to use the Wrox P2P,be sure to read the P2P FAQs for answers to questions about how the forum software works as well as many common questions specific to P2P and Wrox books.To read the FAQs,click the FAQ link on any P2P page. USEFUL LINKS GTC On-Demand:http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php GTC Express Webinar Program:https://developer.nvidia.com/gpu-computing-webinars Developer Zone:www.gputechconf.com/resources/developer-zone NVIDIA Parallel Programming Blog:http://devblogs.nvidia.com/parallelforall NVIDIA Developer Zone Forums:devtalk.nvidia.com NVIDIA support e-mail:devtools-support@nvidia.com xXvi www.it-ebooks.info
xxvi INTRODUCTION fl ast.indd 08/07/2014 Page xxvi To fi nd the errata page for this book, go to www.wrox.com/go/procudac. Then, on the book’s details page, click the Book Errata link. On this page you can view all errata that has been submitted for this book and posted by Wrox editors. P2P.WROX.COM For author and peer discussion, join the P2P forums at p2p.wrox.com. The forums are a web-based system for you to post messages relating to Wrox books and related technologies and interact with other readers and technology users. The forums offer a subscription feature where topics of interest of your choosing when new posts are made to the forums can be sent to you via e-mail. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums. At http://p2p.wrox.com you will fi nd a number of different forums that will help you not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps: 1. Go to p2p.wrox.com and click the Register link. 2. Read the terms of use and click Agree. 3. Complete the required information to join as well as any optional information you wish to provide and click Submit. 4. You will receive an e-mail with information describing how to verify your account and complete the joining process. You can read messages in the forums without joining P2P, but in order to post your own messages, you must join. Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the web. If you would like to have new messages from a particular forum sent to your e-mail address, click the “Subscribe to this Forum” icon by the forum name in the forum listing. For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works as well as many common questions specifi c to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page. USEFUL LINKS GTC On-Demand: http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php GTC Express Webinar Program: https://developer.nvidia.com/gpu-computing-webinars Developer Zone: www.gputechconf.com/resources/developer-zone NVIDIA Parallel Programming Blog: http://devblogs.nvidia.com/parallelforall NVIDIA Developer Zone Forums: devtalk.nvidia.com NVIDIA support e-mail: devtools-support@nvidia.com www.it-ebooks.info
1 Heterogeneous Parallel Computing with CUDA WHAT'S IN THIS CHAPTER? Understanding heterogeneous computing architectures Recognizing the paradigm shift of parallel programming Grasping the basic elements of GPU programming Knowing the differences between CPU and GPU programming CODE DOWNLOAD The wrox.com code downloads for this chapter are found at www.wrox.com/go/procudac on the Download Code tab.The code is in the Chapter 1 download and individually named according to the names throughout the chapter. The high-performance computing(HPC)landscape is always changing as new technologies and processes become commonplace,and the definition of HPC changes accordingly.In gen- eral,it pertains to the use of multiple processors or computers to accomplish a complex task concurrently with high throughput and efficiency.It is common to consider HPC as not only a computing architecture but also as a set of elements,including hardware systems,software tools,programming platforms,and parallel programming paradigms. Over the last decade,high-performance computing has evolved significantly,particularly because of the emergence of GPU-CPU heterogeneous architectures,which have led to a fun- damental paradigm shift in parallel programming.This chapter begins your understanding of heterogeneous parallel programming. www.it-ebooks.info
c01.indd 08/19/2014 Page 1 Heterogeneous Parallel Computing with CUDA WHAT’S IN THIS CHAPTER? ➤ Understanding heterogeneous computing architectures ➤ Recognizing the paradigm shift of parallel programming ➤ Grasping the basic elements of GPU programming ➤ Knowing the differences between CPU and GPU programming CODE DOWNLOAD The wrox.com code downloads for this chapter are found at www.wrox.com/go/procudac on the Download Code tab. The code is in the Chapter 1 download and individually named according to the names throughout the chapter. The high-performance computing (HPC) landscape is always changing as new technologies and processes become commonplace, and the defi nition of HPC changes accordingly. In general, it pertains to the use of multiple processors or computers to accomplish a complex task concurrently with high throughput and effi ciency. It is common to consider HPC as not only a computing architecture but also as a set of elements, including hardware systems, software tools, programming platforms, and parallel programming paradigms. Over the last decade, high-performance computing has evolved signifi cantly, particularly because of the emergence of GPU-CPU heterogeneous architectures, which have led to a fundamental paradigm shift in parallel programming. This chapter begins your understanding of heterogeneous parallel programming. 1 www.it-ebooks.info
2 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA PARALLEL COMPUTING During the past several decades,there has been ever-increasing interest in parallel computation.The primary goal of parallel computing is to improve the speed of computation. From a pure calculation perspective,parallel computing can be defined as a form of computation in which many calculations are carried out simultaneously,operating on the principle that large prob- lems can often be divided into smaller ones,which are then solved concurrently. From the programmer's perspective,a natural question is how to map the concurrent calculations onto computers.Suppose you have multiple computing resources.Parallel computing can then be defined as the simultaneous use of multiple computing resources(cores or computers)to perform the concurrent calculations.A large problem is broken down into smaller ones,and each smaller one is then solved concurrently on different computing resources.The software and hardware aspects of parallel computing are closely intertwined together.In fact,parallel computing usually involves two distinct areas of computing technologies: Computer architecture (hardware aspect) Parallel programming (software aspect) Computer architecture focuses on supporting parallelism at an architectural level,while parallel programming focuses on solving a problem concurrently by fully using the computational power of the computer architecture.In order to achieve parallel execution in software,the hardware must provide a platform that supports concurrent execution of multiple processes or multiple threads. Most modern processors implement the Harvard architecture,as shown in Figure 1-1,which is com- prised of three main components: Memory (instruction memory and data memory) Central processing unit(control unit and arithmetic logic unit) Input/Output interfaces CPU Arithmetic Logic Unit Instruction Memory Control Unit Data Memory Input/Output FIGURE 1-1 www.it-ebooks.info
2 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 2 PARALLEL COMPUTING During the past several decades, there has been ever-increasing interest in parallel computation. The primary goal of parallel computing is to improve the speed of computation. From a pure calculation perspective, parallel computing can be defi ned as a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. From the programmer’s perspective, a natural question is how to map the concurrent calculations onto computers. Suppose you have multiple computing resources. Parallel computing can then be defi ned as the simultaneous use of multiple computing resources (cores or computers) to perform the concurrent calculations. A large problem is broken down into smaller ones, and each smaller one is then solved concurrently on different computing resources. The software and hardware aspects of parallel computing are closely intertwined together. In fact, parallel computing usually involves two distinct areas of computing technologies: ➤ Computer architecture (hardware aspect) ➤ Parallel programming (software aspect) Computer architecture focuses on supporting parallelism at an architectural level, while parallel programming focuses on solving a problem concurrently by fully using the computational power of the computer architecture. In order to achieve parallel execution in software, the hardware must provide a platform that supports concurrent execution of multiple processes or multiple threads. Most modern processors implement the Harvard architecture, as shown in Figure 1-1, which is comprised of three main components: ➤ Memory (instruction memory and data memory) ➤ Central processing unit (control unit and arithmetic logic unit) ➤ Input/Output interfaces CPU Arithmetic Logic Unit Control Unit Instruction Memory Data Memory Input/Output FIGURE 1-1 www.it-ebooks.info