CONTENTS Coalescing Global Memory Accesses 239 Baseline Transpose Kernel 240 Matrix Transpose with Shared Memory 241 Matrix Transpose with Padded Shared Memory 245 Matrix Transpose with Unrolling 246 Exposing More Parallelism 249 Constant Memory 250 Implementing a 1D Stencil with Constant Memory 250 Comparing with the Read-Only Cache 253 The Warp Shuffle Instruction 255 Variants of the Warp Shuffle Instruction 256 Sharing Data within a Warp 258 Parallel Reduction Using the Warp Shuffle Instruction 262 Summary 264 CHAPTER 6:STREAMS AND CONCURRENCY 267 Introducing Streams and Events 268 CUDA Streams 269 Stream Scheduling 271 Stream Priorities 273 CUDA Events 273 Stream Synchronization 275 Concurrent Kernel Execution 279 Concurrent Kernels in Non-NULL Streams 279 False Dependencies on Fermi GPUs 281 Dispatching Operations with OpenMP 283 Adjusting Stream Behavior Using Environment Variables 284 Concurrency-Limiting GPU Resources 286 Blocking Behavior of the Default Stream 287 Creating Inter-Stream Dependencies 288 Overlapping Kernel Execution and Data Transfer 289 Overlap Using Depth-First Scheduling 289 Overlap Using Breadth-First Scheduling 293 Overlapping GPU and CPU Execution 294 Stream Callbacks 295 Summary 297 xii www.it-ebooks.info
xii CONTENTS ftoc.indd 08/07/2014 Page xii Coalescing Global Memory Accesses 239 Baseline Transpose Kernel 240 Matrix Transpose with Shared Memory 241 Matrix Transpose with Padded Shared Memory 245 Matrix Transpose with Unrolling 246 Exposing More Parallelism 249 Constant Memory 250 Implementing a 1D Stencil with Constant Memory 250 Comparing with the Read-Only Cache 253 The Warp Shuffl e Instruction 255 Variants of the Warp Shuffl e Instruction 256 Sharing Data within a Warp 258 Parallel Reduction Using the Warp Shuffl e Instruction 262 Summary 264 CHAPTER 6: STREAMS AND CONCURRENCY 267 Introducing Streams and Events 268 CUDA Streams 269 Stream Scheduling 271 Stream Priorities 273 CUDA Events 273 Stream Synchronization 275 Concurrent Kernel Execution 279 Concurrent Kernels in Non-NULL Streams 279 False Dependencies on Fermi GPUs 281 Dispatching Operations with OpenMP 283 Adjusting Stream Behavior Using Environment Variables 284 Concurrency-Limiting GPU Resources 286 Blocking Behavior of the Default Stream 287 Creating Inter-Stream Dependencies 288 Overlapping Kernel Execution and Data Transfer 289 Overlap Using Depth-First Scheduling 289 Overlap Using Breadth-First Scheduling 293 Overlapping GPU and CPU Execution 294 Stream Callbacks 295 Summary 297 www.it-ebooks.info
CONTENTS CHAPTER 7:TUNING INSTRUCTION-LEVEL PRIMITIVES 299 Introducing CUDA Instructions 300 Floating-Point Instructions 301 Intrinsic and Standard Functions 303 Atomic Instructions 304 Optimizing Instructions for Your Application 306 Single-Precision vs.Double-Precision 306 Standard vs.Intrinsic Functions 309 Understanding Atomic Instructions 315 Bringing It All Together 322 Summary 324 CHAPTER 8:GPU-ACCELERATED CUDA LIBRARIES AND OPENACC 327 Introducing the CUDA Libraries 328 Supported Domains for CUDA Libraries 329 A Common Library Workflow 330 The CUSPARSE Library 332 cuSPARSE Data Storage Formats 333 Formatting Conversion with cuSPARSE 337 Demonstrating cuSPARSE 338 Important Topics in cuSPARSE Development 340 cuSPARSE Summary 341 The cuBLAS Library 341 Managing cuBLAS Data 342 Demonstrating cuBLAS 343 Important Topics in cuBLAS Development 345 cuBLAS Summary 346 The cuFFT Library 346 Using the cuFFT API 347 Demonstrating cuFFT 348 cuFFT Summary 349 The cuRAND Library 349 Choosing Pseudo-or Quasi-Random Numbers 349 Overview of the cuRAND Library 35 Demonstrating cuRAND Important Topics in cuRAND Development 351 xiii www.it-ebooks.info
xiii CONTENTS ftoc.indd 08/07/2014 Page xiii CHAPTER 7: TUNING INSTRUCTION-LEVEL PRIMITIVES 299 Introducing CUDA Instructions 300 Floating-Point Instructions 301 Intrinsic and Standard Functions 303 Atomic Instructions 304 Optimizing Instructions for Your Application 306 Single-Precision vs. Double-Precision 306 Standard vs. Intrinsic Functions 309 Understanding Atomic Instructions 315 Bringing It All Together 322 Summary 324 CHAPTER 8: GPU-ACCELERATED CUDA LIBRARIES AND OPENACC 327 Introducing the CUDA Libraries 328 Supported Domains for CUDA Libraries 329 A Common Library Workfl ow 330 The CUSPARSE Library 332 cuSPARSE Data Storage Formats 333 Formatting Conversion with cuSPARSE 337 Demonstrating cuSPARSE 338 Important Topics in cuSPARSE Development 340 cuSPARSE Summary 341 The cuBLAS Library 341 Managing cuBLAS Data 342 Demonstrating cuBLAS 343 Important Topics in cuBLAS Development 345 cuBLAS Summary 346 The cuFFT Library 346 Using the cuFFT API 347 Demonstrating cuFFT 348 cuFFT Summary 349 The cuRAND Library 349 Choosing Pseudo- or Quasi- Random Numbers 349 Overview of the cuRAND Library 350 Demonstrating cuRAND 354 Important Topics in cuRAND Development 357 www.it-ebooks.info
CONTENTS CUDA Library Features Introduced in CUDA 6 358 Drop-In CUDA Libraries 358 Multi-GPU Libraries 359 A Survey of CUDA Library Performance 361 cuSPARSE versus MKL 361 cuBLAS versus MKL BLAS 362 cuFFT versus FFTW versus MKL 363 CUDA Library Performance Summary 364 Using OpenACC 365 Using OpenACC Compute Directives 367 Using OpenACC Data Directives 375 The OpenACC Runtime API 380 Combining OpenACC and the CUDA Libraries 382 Summary of OpenACC 384 Summary 384 CHAPTER 9:MULTI-GPU PROGRAMMING 387 Moving to Multiple GPUs 388 Executing on Multiple GPUs 389 Peer-to-Peer Communication 391 Synchronizing across Multi-GPUs 392 Subdividing Computation across Multiple GPUs 393 Allocating Memory on Multiple Devices 393 Distributing Work from a Single Host Thread 394 Compiling and Executing 395 Peer-to-Peer Communication on Multiple GPUs 396 Enabling Peer-to-Peer Access 396 Peer-to-Peer Memory Copy 396 Peer-to-Peer Memory Access with Unified Virtual Addressing 398 Finite Difference on Multi-GPU 400 Stencil Calculation for 2D Wave Equation 400 Typical Patterns for Multi-GPU Programs 401 2D Stencil Computation with Multiple GPUs 403 Overlapping Computation and Communication 405 Compiling and Executing 406 Scaling Applications across GPU Clusters 409 CPU-to-CPU Data Transfer 410 GPU-to-GPU Data Transfer Using Traditional MPI 413 xiv www.it-ebooks.info
xiv CONTENTS ftoc.indd 08/07/2014 Page xiv CUDA Library Features Introduced in CUDA 6 358 Drop-In CUDA Libraries 358 Multi-GPU Libraries 359 A Survey of CUDA Library Performance 361 cuSPARSE versus MKL 361 cuBLAS versus MKL BLAS 362 cuFFT versus FFTW versus MKL 363 CUDA Library Performance Summary 364 Using OpenACC 365 Using OpenACC Compute Directives 367 Using OpenACC Data Directives 375 The OpenACC Runtime API 380 Combining OpenACC and the CUDA Libraries 382 Summary of OpenACC 384 Summary 384 CHAPTER 9: MULTI-GPU PROGRAMMING 387 Moving to Multiple GPUs 388 Executing on Multiple GPUs 389 Peer-to-Peer Communication 391 Synchronizing across Multi-GPUs 392 Subdividing Computation across Multiple GPUs 393 Allocating Memory on Multiple Devices 393 Distributing Work from a Single Host Thread 394 Compiling and Executing 395 Peer-to-Peer Communication on Multiple GPUs 396 Enabling Peer-to-Peer Access 396 Peer-to-Peer Memory Copy 396 Peer-to-Peer Memory Access with Unifi ed Virtual Addressing 398 Finite Difference on Multi-GPU 400 Stencil Calculation for 2D Wave Equation 400 Typical Patterns for Multi-GPU Programs 401 2D Stencil Computation with Multiple GPUs 403 Overlapping Computation and Communication 405 Compiling and Executing 406 Scaling Applications across GPU Clusters 409 CPU-to-CPU Data Transfer 410 GPU-to-GPU Data Transfer Using Traditional MPI 413 www.it-ebooks.info
CONTENTS GPU-to-GPU Data Transfer with CUDA-aware MPI 416 Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417 Adjusting Message Chunk Size 418 GPU to GPU Data Transfer with GPUDirect RDMA 419 Summary 422 CHAPTER 10:IMPLEMENTATION CONSIDERATIONS 425 The CUDA C Development Process 426 APOD Development Cycle 426 Optimization Opportunities 429 CUDA Code Compilation 432 CUDA Error Handling 437 Profile-Driven Optimization 438 Finding Optimization Opportunities Using nvprof 439 Guiding Optimization Using nvvp 443 NVIDIA Tools Extension 446 CUDA Debugging 448 Kernel Debugging 448 Memory Debugging 456 Debugging Summary 462 A Case Study in Porting C Programs to CUDA C 462 Assessing crypt 463 Parallelizing crypt 464 Optimizing crypt 465 Deploying Crypt 472 Summary of Porting crypt 475 Summary 476 APPENDIX:SUGGESTED READINGS 477 INDEX 481 XV www.it-ebooks.info
xv CONTENTS ftoc.indd 08/07/2014 Page xv GPU-to-GPU Data Transfer with CUDA-aware MPI 416 Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417 Adjusting Message Chunk Size 418 GPU to GPU Data Transfer with GPUDirect RDMA 419 Summary 422 CHAPTER 10: IMPLEMENTATION CONSIDERATIONS 425 The CUDA C Development Process 426 APOD Development Cycle 426 Optimization Opportunities 429 CUDA Code Compilation 432 CUDA Error Handling 437 Profi le-Driven Optimization 438 Finding Optimization Opportunities Using nvprof 439 Guiding Optimization Using nvvp 443 NVIDIA Tools Extension 446 CUDA Debugging 448 Kernel Debugging 448 Memory Debugging 456 Debugging Summary 462 A Case Study in Porting C Programs to CUDA C 462 Assessing crypt 463 Parallelizing crypt 464 Optimizing crypt 465 Deploying Crypt 472 Summary of Porting crypt 475 Summary 476 APPENDIX: SUGGESTED READINGS 477 INDEX 481 www.it-ebooks.info
FOREWORD GPUs have come a long way.From their origins as specialized graphics processors that could rap- idly produce images for output to a display unit,they have become a go-to technology when ultra- fast processing is needed.In the past few years,GPUs have increasingly been attached to CPUs to accelerate a broad array of computations in so-called beterogeneous computing.Today,GPUs are configured on many desktop systems,on compute clusters,and even on many of the largest super- computers in the world.In their extended role as a provider of large amounts of compute power for technical computing,GPUs have enabled advances in science and engineering in a broad variety of disciplines.They have done so by making it possible for huge numbers of compute cores to work in parallel while keeping the power budgets very reasonable. Fortunately,the interfaces for programming GPUs have kept up with this rapid change.In the past, a major effort was required to use them for anything outside the narrow range of applications they were intended for,and the GPU programmer needed to be familiar with many concepts that made good sense only to the graphics programmer.Today's systems provide a much more convenient means to create application software that will run on them.In short,we have CUDA. CUDA is one of the most popular application programming interfaces for accelerating a range of compute kernels on the GPU.It can enable code written in C or C++to run efficiently on a GPU with very reasonable programming effort.It strikes a balance between the need to know about the architecture in order to exploit it well,and the need to have a programming interface that is easy to use and results in readable programs. This book will be a valuable resource for anyone who wants to use GPUs for scientific and technical programming.It provides a comprehensive introduction to the CUDA programming interface and its usage.For a start,it describes the basics of parallel computing on heterogeneous architectures and introduces the features of CUDA.It then explains how CUDA programs are executed.CUDA exposes the execution and memory model to the programmer;as a result,the CUDA programmer has direct control of the massively parallel environment.In addition to giving details of the CUDA memory model,the text provides a wealth of information on how it can be utilized.The follow- ing chapter discusses streams,as well as how to execute concurrent and overlapping kernels.Next comes information on tuning,on using CUDA libraries,and on using OpenACC directives to pro- gram GPUs.After a chapter on multi-GPU programming,the book concludes by discussing some implementation considerations.Moreover,a variety of examples are given to help the reader get started,many of which can be downloaded and executed. CUDA provides a nice balance between expressivity and programmability that has proven itself in practice.However,those of us who have made it their mission to simplify application develop- ment know that this is an on-going story.For the past few years,CUDA researchers have worked to improve heterogeneous programming tools.CUDA 6 introduces many new features,including unified memory and plug-in libraries,to make GPU programming even easier.They have also pro- vided a set of directives called OpenACC,which is introduced in this book.OpenACC promises to www.it-ebooks.info
fl ast.indd 08/07/2014 Page xvii FOREWORD GPUs have come a long way. From their origins as specialized graphics processors that could rapidly produce images for output to a display unit, they have become a go-to technology when ultrafast processing is needed. In the past few years, GPUs have increasingly been attached to CPUs to accelerate a broad array of computations in so-called heterogeneous computing. Today, GPUs are confi gured on many desktop systems, on compute clusters, and even on many of the largest supercomputers in the world. In their extended role as a provider of large amounts of compute power for technical computing, GPUs have enabled advances in science and engineering in a broad variety of disciplines. They have done so by making it possible for huge numbers of compute cores to work in parallel while keeping the power budgets very reasonable. Fortunately, the interfaces for programming GPUs have kept up with this rapid change. In the past, a major effort was required to use them for anything outside the narrow range of applications they were intended for, and the GPU programmer needed to be familiar with many concepts that made good sense only to the graphics programmer. Today’s systems provide a much more convenient means to create application software that will run on them. In short, we have CUDA. CUDA is one of the most popular application programming interfaces for accelerating a range of compute kernels on the GPU. It can enable code written in C or C++ to run effi ciently on a GPU with very reasonable programming effort. It strikes a balance between the need to know about the architecture in order to exploit it well, and the need to have a programming interface that is easy to use and results in readable programs. This book will be a valuable resource for anyone who wants to use GPUs for scientifi c and technical programming. It provides a comprehensive introduction to the CUDA programming interface and its usage. For a start, it describes the basics of parallel computing on heterogeneous architectures and introduces the features of CUDA. It then explains how CUDA programs are executed. CUDA exposes the execution and memory model to the programmer; as a result, the CUDA programmer has direct control of the massively parallel environment. In addition to giving details of the CUDA memory model, the text provides a wealth of information on how it can be utilized. The following chapter discusses streams, as well as how to execute concurrent and overlapping kernels. Next comes information on tuning, on using CUDA libraries, and on using OpenACC directives to program GPUs. After a chapter on multi-GPU programming, the book concludes by discussing some implementation considerations. Moreover, a variety of examples are given to help the reader get started, many of which can be downloaded and executed. CUDA provides a nice balance between expressivity and programmability that has proven itself in practice. However, those of us who have made it their mission to simplify application development know that this is an on-going story. For the past few years, CUDA researchers have worked to improve heterogeneous programming tools. CUDA 6 introduces many new features, including unifi ed memory and plug-in libraries, to make GPU programming even easier. They have also provided a set of directives called OpenACC, which is introduced in this book. OpenACC promises to www.it-ebooks.info