FOREWORD complement CUDA by offering an even simpler means to exploit GPU programming power when less direct control over the execution is needed.Results so far are very promising.OpenACC,CUDA 6,and other topics covered in this book will allow CUDA developers to accelerate their applications for more performance than ever.This book will need to have a permanent place on your bookshelf. Happy programming! BARBARA CHAPMAN CACDS and Department of Computer Science University of Houston xviii www.it-ebooks.info
xviii fl ast.indd 08/07/2014 Page xviii complement CUDA by offering an even simpler means to exploit GPU programming power when less direct control over the execution is needed. Results so far are very promising. OpenACC, CUDA 6, and other topics covered in this book will allow CUDA developers to accelerate their applications for more performance than ever. This book will need to have a permanent place on your bookshelf. Happy programming! BARBARA CHAPMAN CACDS and Department of Computer Science University of Houston FOREWORD www.it-ebooks.info
PREFACE Years ago when we were porting our production code from legacy C programs to CUDA C,we encountered many troubles as any beginner does,problems with solutions that were far beyond what you could dig out of a simple web search.At that time,we thought that it would be great if there were a book written by programmers,for programmers,that focused on what programmers need for production CUDA development.Fulfilling that need with lessons from our own experiences in CUDA is the motivation for this book.This book is specially designed to address the needs of the high-performance and scientific computing communities. When learning a new framework or programming language,most programmers drag out a piece of code from anywhere,test it,and then build up their own code based on that trial.Learning by example with a trial-and-error approach is a quintessential learning technique for many software developers.This book is designed to fit these habits.Each chapter focuses on one topic,using con- cise explanations to provide foundational knowledge,and illustrating concepts with simple and fully workable code samples.Learning concepts and code side-by-side empowers you to quickly start experimenting with these topics.This book uses a profile-driven approach to guide you deeper and deeper into each topic. The major difference between parallel programming in C and parallel programming in CUDA C is that CUDA architectural features,such as memory and execution models,are exposed directly to programmers.This enables you to have more control over the massively parallel GPU environment. Even though some still consider CUDA concepts to be low-level,having some knowledge of the underlying architecture is a necessity for harnessing the power of GPUs.Actually,the CUDA plat- form can perform well even if you have limited knowledge of the architecture. Parallel programming is always motivated by performance and driven by profiling.CUDA program- ming is unique in that the exposed architectural features enable you,the programmer,to extract every iota of performance from this powerful hardware platform,if you so choose.After you have mastered the skills taught through the exercises provided in this book,you will find that program- ming in CUDA C is easy,enjoyable,and rewarding. www.it-ebooks.info
fl ast.indd 08/07/2014 Page xix PREFACE Years ago when we were porting our production code from legacy C programs to CUDA C, we encountered many troubles as any beginner does, problems with solutions that were far beyond what you could dig out of a simple web search. At that time, we thought that it would be great if there were a book written by programmers, for programmers, that focused on what programmers need for production CUDA development. Fulfi lling that need with lessons from our own experiences in CUDA is the motivation for this book. This book is specially designed to address the needs of the high-performance and scientifi c computing communities. When learning a new framework or programming language, most programmers drag out a piece of code from anywhere, test it, and then build up their own code based on that trial. Learning by example with a trial-and-error approach is a quintessential learning technique for many software developers. This book is designed to fi t these habits. Each chapter focuses on one topic, using concise explanations to provide foundational knowledge, and illustrating concepts with simple and fully workable code samples. Learning concepts and code side-by-side empowers you to quickly start experimenting with these topics. This book uses a profi le-driven approach to guide you deeper and deeper into each topic. The major difference between parallel programming in C and parallel programming in CUDA C is that CUDA architectural features, such as memory and execution models, are exposed directly to programmers. This enables you to have more control over the massively parallel GPU environment. Even though some still consider CUDA concepts to be low-level, having some knowledge of the underlying architecture is a necessity for harnessing the power of GPUs. Actually, the CUDA platform can perform well even if you have limited knowledge of the architecture. Parallel programming is always motivated by performance and driven by profi ling. CUDA programming is unique in that the exposed architectural features enable you, the programmer, to extract every iota of performance from this powerful hardware platform, if you so choose. After you have mastered the skills taught through the exercises provided in this book, you will fi nd that programming in CUDA C is easy, enjoyable, and rewarding. www.it-ebooks.info
INTRODUCTION WELCOME TO THE WONDERFUL WORLD of heterogeneous parallel programming with CUDA C! Modern heterogeneous systems are evolving toward a future of intriguing computational possibili- ties.Heterogeneous computing is constantly being applied to new fields of computation-everything from science to databases to machine learning.The future of programming is heterogeneous parallel programming! This book gets you started quickly with GPU(Graphical Processing Unit)computing using the CUDA platform,CUDA Toolkit,and CUDA C language.The examples and exercises in this book are designed to jump-start your CUDA expertise to a professional level! WHO THIS BOOK IS FOR This book is for anyone who wants to leverage the power of GPU computing to accelerate applica- tions.It covers the most up-to-date technologies in CUDA C programming,with a focus on: >Concise style Straightforward approach Illustrative description Extensive examples Deliberately designed exercises Comprehensive coverage Content well-focused for the needs of high-performance computing If you are an experienced C programmer who wants to add high-performance computing to your repertoire by learning CUDA C,the examples and exercises in the book will build on your exist- ing knowledge so as to simplify mastering CUDA C programming.Using just a handful of CUDA extensions to C,you can benefit from the power of massively parallel hardware.The CUDA plat- form,programming models,tools,and libraries make programming heterogeneous architectures straightforward and immediately rewarding. If you are a professional with domain expertise outside of computer science who wants to quickly get up to speed with parallel programming on GPUs,maximize your productivity,and enhance the performance of your applications,you have picked the right book.The clear and concise explana- tions in this book,supported by well-designed examples and guided by a profile-driven approach, will help you gain insight into GPU programming and quickly become proficient with CUDA. www.it-ebooks.info
fl ast.indd 08/07/2014 Page xxi INTRODUCTION WELCOME TO THE WONDERFUL WORLD of heterogeneous parallel programming with CUDA C! Modern heterogeneous systems are evolving toward a future of intriguing computational possibilities. Heterogeneous computing is constantly being applied to new fi elds of computation — everything from science to databases to machine learning. The future of programming is heterogeneous parallel programming! This book gets you started quickly with GPU (Graphical Processing Unit) computing using the CUDA platform, CUDA Toolkit, and CUDA C language. The examples and exercises in this book are designed to jump-start your CUDA expertise to a professional level! WHO THIS BOOK IS FOR This bookis for anyone who wants to leverage the power of GPU computing to accelerate applications. It covers the most up-to-date technologies in CUDA C programming, with a focus on: ➤ Concise style ➤ Straightforward approach ➤ Illustrative description ➤ Extensive examples ➤ Deliberately designed exercises ➤ Comprehensive coverage ➤ Content well-focused for the needs of high-performance computing If you are an experienced C programmer who wants to add high-performance computing to your repertoire by learning CUDA C, the examples and exercises in the book will build on your existing knowledge so as to simplify mastering CUDA C programming. Using just a handful of CUDA extensions to C, you can benefi t from the power of massively parallel hardware. The CUDA platform, programming models, tools, and libraries make programming heterogeneous architectures straightforward and immediately rewarding. If you area professional with domain expertise outside of computer science who wants to quickly get up to speed with parallel programming on GPUs, maximize your productivity, and enhance the performance of your applications, you have picked the right book. The clear and concise explanations in this book, supported by well-designed examples and guided by a profi le-driven approach, will help you gain insight into GPU programming and quickly become profi cient with CUDA. www.it-ebooks.info
INTRODUCTION If you are a professor or a researcher in any discipline and wish to accelerate discovery and innova- tion through GPU computing,this book will improve your time-to-solution.With minimal past programming experience,parallel computing concepts,and knowledge of computer science,you can quickly dive into the exciting world of parallel programming with heterogeneous architectures. If you are new to C but are interested in exploring heterogeneous programming,this book does not assume copious amounts of experience in C programming.While the CUDA C and C programming languages obviously share some syntax,the abstractions and underlying hardware for each are dif- ferent enough that experience with one does not make the other significantly easier to learn.As long as you have an interest in heterogeneous programming,are excited about new topics and new ways of thinking,and have a passion for deep understanding of technical topics,this book is a great fit for you. Even if you have experience with CUDA C,this book can still be a useful tool to refresh your knowledge,discover new tools,and gain insight into the latest CUDA features.While this book is designed to create CUDA professionals from scratch,it also provides a comprehensive overview of many advanced CUDA concepts,tools,and frameworks that will benefit existing CUDA developers. WHAT THIS BOOK COVERS This book provides foundational concepts and techniques of CUDA C programming for people that need to drastically accelerate the performance of their applications.This book covers the newest features released with CUDA Toolkit 6.0 and NVIDIA Kepler GPUs.After briefly introducing the paradigm shift in parallel programming from homogeneous architectures to heterogeneous archi- tectures,this book guides you through essential programming skills and best practices in CUDA, including but not limited to the CUDA programming model,GPU execution model,GPU memory model,CUDA streams and events,techniques for programming multiple GPUs,CUDA-aware MPI programming,and NVIDIA development tools. This book takes a unique approach to teaching CUDA by mingling foundational descriptions of concepts with illustrative examples that use a profile-driven approach to guide you toward opti- mal performance.Each topic is thoroughly covered in a step-by-step process based heavily on code examples.This book will help you quickly master the CUDA development process by teaching you not only how to use CUDA-based tools,but also how to interpret results in each step of the develop- ment process based on insights and intuitions from the abstract programming model. Each chapter handles one main topic with workable code examples to demonstrate the basic features and techniques of GPU programming,followed by well-designed exercises that facilitate your explo- ration of each topic to deepen your understanding. All examples are developed using a Linux system with CUDA 5.0 or higher and a Kepler or Fermi GPU.Since CUDA C is a cross-platform language,examples in the book are also applicable to other platforms,such as embedded systems,tablets,notebooks,PCs,workstations,and high-performance computing servers.Many OEM suppliers support NVIDIA GPUs in a variety of form-factors. xxii www.it-ebooks.info
xxii INTRODUCTION fl ast.indd 08/07/2014 Page xxii If you are a professor or a researcher in any discipline and wish to accelerate discovery and innovation through GPU computing, this book will improve your time-to-solution. With minimal past programming experience, parallel computing concepts, and knowledge of computer science, you can quickly dive into the exciting world of parallel programming with heterogeneous architectures. If you are new to C but are interested in exploring heterogeneous programming, this book does not assume copious amounts of experience in C programming. While the CUDA C and C programming languages obviously share some syntax, the abstractions and underlying hardware for each are different enough that experience with one does not make the other signifi cantly easier to learn. As long as you have an interest in heterogeneous programming, are excited about new topics and new ways of thinking, and have a passion for deep understanding of technical topics, this book is a great fi t for you. Even if you have experience with CUDA C, this book can still be a useful tool to refresh your knowledge, discover new tools, and gain insight into the latest CUDA features. While this book is designed to create CUDA professionals from scratch, it also provides a comprehensive overview of many advanced CUDA concepts, tools, and frameworks that will benefi t existing CUDA developers. WHAT THIS BOOK COVERS This book provides foundational concepts and techniques of CUDA C programming for people that need to drastically accelerate the performance of their applications. This book covers the newest features released with CUDA Toolkit 6.0 and NVIDIA Kepler GPUs. After briefl y introducing the paradigm shift in parallel programming from homogeneous architectures to heterogeneous architectures, this book guides you through essential programming skills and best practices in CUDA, including but not limited to the CUDA programming model, GPU execution model, GPU memory model, CUDA streams and events, techniques for programming multiple GPUs, CUDA-aware MPI programming, and NVIDIA development tools. This book takes a unique approach to teaching CUDA by mingling foundational descriptions of concepts with illustrative examples that use a profi le-driven approach to guide you toward optimal performance. Each topic is thoroughly covered in a step-by-step process based heavily on code examples. This book will help you quickly master the CUDA development process by teaching you not only how to use CUDA-based tools, but also how to interpret results in each step of the development process based on insights and intuitions from the abstract programming model. Each chapter handles one main topic with workable code examples to demonstrate the basic features and techniques of GPU programming, followed by well-designed exercises that facilitate your exploration of each topic to deepen your understanding. All examples are developed using a Linux system with CUDA 5.0 or higher and a Kepler or Fermi GPU. Since CUDA C is a cross-platform language, examples in the book are also applicable to other platforms, such as embedded systems, tablets, notebooks, PCs, workstations, and high-performance computing servers. Many OEM suppliers support NVIDIA GPUs in a variety of form-factors. www.it-ebooks.info
INTRODUCTION HOW THIS BOOK IS STRUCTURED This book consists of ten chapters,and covers the following topics: Chapter 1:Heterogeneous Parallel Computing with CUDA begins with a brief introduction to the heterogeneous architecture that complements CPUs with GPUs,as well as the paradigm shift towards heterogeneous parallel programming. Chapter 2:CUDA Programming Model introduces the CUDA programming model and the gen- eral structure of a CUDA program.It explains the logical view for massively parallel computing in CUDA:two levels of thread hierarchy exposed intuitively through the programming model.It also discusses thread configuration heuristics and their impact on performance. Chapter 3:CUDA Execution Model inspects kernel execution from the hardware point of view by studying how thousands of threads are scheduled on a GPU.It explains how compute resources are partitioned among threads at multiple granularities.It also shows how the hardware view can be used to guide kernel design,and guides you in developing and optimizing a kernel using a profile- driven approach.Then,CUDA dynamic parallelism and nested execution are illustrated with examples. Chapter 4:Global Memory introduces the CUDA memory model,probes the global memory data layout,and analyzes access patterns to global memory.This chapter explains the performance impli- cations of various memory access patterns and demonstrates how a new feature in CUDA 6,Unified Memory,can simplify CUDA programming and improve your productivity. Chapter 5:Shared Memory and Constant Memory explains how shared memory,a program- managed low-latency cache,can be used to improve kernel performance.It describes the optimal data layout for shared memory and illustrates how to avoid poor performance.Last,it illustrates how to perform low-latency communication between neighboring threads. Chapter 6:Streams and Concurrency describes how multi-kernel concurrency can be implemented with CUDA streams,how to overlap communication and computation,and how different job dis- patching strategies affect inter-kernel concurrency. Chapter 7:Tuning Instruction-Level Primitives explains the nature of floating-point operations, standard and intrinsic mathematical functions,and CUDA atomic operations.It shows how to use relatively low-level CUDA primitives and compiler flags to tune the performance,accuracy,and cor- rectness of an application. Chapter 8:GPU-Accelerated CUDA Libraries and OpenACC introduces a new level of parallelism with CUDA domain-specific libraries,including specific examples in linear algebra,Fourier trans- forms,and random number generation.It explains how OpenACC,a compiler-directive-based GPU programming model,complements CUDA by offering a simpler means to exploit GPU computa- tional power. Chapter 9:Multi-GPU Programming introduces GPUDirect technology for peer-to-peer GPU memory access.It explains how to manage and execute computation across multiple GPUs.It also xxiii www.it-ebooks.info
xxiii INTRODUCTION fl ast.indd 08/07/2014 Page xxiii HOW THIS BOOK IS STRUCTURED This book consists of ten chapters, and covers the following topics: Chapter 1: Heterogeneous Parallel Computing with CUDA begins with a brief introduction to the heterogeneous architecture that complements CPUs with GPUs, as well as the paradigm shift towards heterogeneous parallel programming. Chapter 2: CUDA Programming Model introduces the CUDA programming model and the general structure of a CUDA program. It explains the logical view for massively parallel computing in CUDA: two levels of thread hierarchy exposed intuitively through the programming model. It also discusses thread confi guration heuristics and their impact on performance. Chapter 3: CUDA Execution Model inspects kernel execution from the hardware point of view by studying how thousands of threads are scheduled on a GPU. It explains how compute resources are partitioned among threads at multiple granularities. It also shows how the hardware view can be used to guide kernel design, and guides you in developing and optimizing a kernel using a profi ledriven approach. Then, CUDA dynamic parallelism and nested execution are illustrated with examples. Chapter 4: Global Memory introduces the CUDA memory model, probes the global memory data layout, and analyzes access patterns to global memory. This chapter explains the performance implications of various memory access patterns and demonstrates how a new feature in CUDA 6, Unifi ed Memory, can simplify CUDA programming and improve your productivity. Chapter 5: Shared Memory and Constant Memory explains how shared memory, a programmanaged low-latency cache, can be used to improve kernel performance. It describes the optimal data layout for shared memory and illustrates how to avoid poor performance. Last, it illustrates how to perform low-latency communication between neighboring threads. Chapter 6: Streams and Concurrency describes how multi-kernel concurrency can be implemented with CUDA streams, how to overlap communication and computation, and how different job dispatching strategies affect inter-kernel concurrency. Chapter 7: Tuning Instruction-Level Primitives explains the nature of fl oating-point operations, standard and intrinsic mathematical functions, and CUDA atomic operations. It shows how to use relatively low-level CUDA primitives and compiler fl ags to tune the performance, accuracy, and correctness of an application. Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC introduces a new level of parallelism with CUDA domain-specifi c libraries, including specifi c examples in linear algebra, Fourier transforms, and random number generation. It explains how OpenACC, a compiler-directive-based GPU programming model, complements CUDA by offering a simpler means to exploit GPU computational power. Chapter 9: Multi-GPU Programming introduces GPUDirect technology for peer-to-peer GPU memory access. It explains how to manage and execute computation across multiple GPUs. It also www.it-ebooks.info