1.6 Organization of the Book 17 Chapters 4 through 7 are designed to give the readers more in-depth understanding of the CUDA programming model.Chapter 4 covers the thread organization and execution model required to fully understand the execution behavior of threads and basic performance concepts.Chapter 5 is dedicated to the special memories that can be used to hold CUDA vari- ables for improved program execution speed.Chapter 6 introduces the major factors that contribute to the performance of a CUDA kernel func- tion.Chapter 7 introduces the floating-point representation and concepts such as precision and accuracy.Although these chapters are based on CUDA,they help the readers build a foundation for parallel programming in general.We believe that humans understand best when we learn from the bottom up;that is,we must first learn the concepts in the context of a particular programming model,which provides us with a solid footing to generalize our knowledge to other programming models.As we do so,we can draw on our concrete experience from the CUDA model.An in-depth experience with the CUDA model also enables us to gain maturity,which will help us learn concepts that may not even be pertinent to the CUDA model. Chapters 8 and 9 are case studies of two real applications,which take the readers through the thought processes of parallelizing and optimizing their applications for significant speedups.For each application,we begin by identifying alternative ways of formulating the basic structure of the paral- lel execution and follow up with reasoning about the advantages and disad- vantages of each alternative.We then go through the steps of code transformation necessary to achieve high performance.These two chapters help the readers put all the materials from the previous chapters together and prepare for their own application development projects. Chapter 10 generalizes the parallel programming techniques into prob- lem decomposition principles,algorithm strategies,and computational thinking.It does so by covering the concept of organizing the computation tasks of a program so they can be done in parallel.We begin by discussing the translational process of organizing abstract scientific concepts into computational tasks,an important first step in producing quality application software,serial or parallel.The chapter then addresses parallel algorithm structures and their effects on application performance,which is grounded in the performance tuning experience with CUDA.The chapter concludes with a treatment of parallel programming styles and models,allowing the readers to place their knowledge in a wider context.With this chapter, the readers can begin to generalize from the SPMD programming style to other styles of parallel programming,such as loop parallelism in OpenMP
Chapters 4 through 7 are designed to give the readers more in-depth understanding of the CUDA programming model. Chapter 4 covers the thread organization and execution model required to fully understand the execution behavior of threads and basic performance concepts. Chapter 5 is dedicated to the special memories that can be used to hold CUDA variables for improved program execution speed. Chapter 6 introduces the major factors that contribute to the performance of a CUDA kernel function. Chapter 7 introduces the floating-point representation and concepts such as precision and accuracy. Although these chapters are based on CUDA, they help the readers build a foundation for parallel programming in general. We believe that humans understand best when we learn from the bottom up; that is, we must first learn the concepts in the context of a particular programming model, which provides us with a solid footing to generalize our knowledge to other programming models. As we do so, we can draw on our concrete experience from the CUDA model. An in-depth experience with the CUDA model also enables us to gain maturity, which will help us learn concepts that may not even be pertinent to the CUDA model. Chapters 8 and 9 are case studies of two real applications, which take the readers through the thought processes of parallelizing and optimizing their applications for significant speedups. For each application, we begin by identifying alternative ways of formulating the basic structure of the parallel execution and follow up with reasoning about the advantages and disadvantages of each alternative. We then go through the steps of code transformation necessary to achieve high performance. These two chapters help the readers put all the materials from the previous chapters together and prepare for their own application development projects. Chapter 10 generalizes the parallel programming techniques into problem decomposition principles, algorithm strategies, and computational thinking. It does so by covering the concept of organizing the computation tasks of a program so they can be done in parallel. We begin by discussing the translational process of organizing abstract scientific concepts into computational tasks, an important first step in producing quality application software, serial or parallel. The chapter then addresses parallel algorithm structures and their effects on application performance, which is grounded in the performance tuning experience with CUDA. The chapter concludes with a treatment of parallel programming styles and models, allowing the readers to place their knowledge in a wider context. With this chapter, the readers can begin to generalize from the SPMD programming style to other styles of parallel programming, such as loop parallelism in OpenMP 1.6 Organization of the Book 17
18 CHAPTER 1 Introduction and fork-join in p-thread programming.Although we do not go into these alternative parallel programming styles,we expect that the readers will be able to learn to program in any of them with the foundation gained in this book. Chapter 11 introduces the OpenCL programming model from a CUDA programmer's perspective.The reader will find OpenCL to be extremely similar to CUDA.The most important difference arises from OpenCL's use of API functions to implement functionalities such as kernel launching and thread identification.The use of API functions makes OpenCL more tedious to use;nevertheless,a CUDA programmer has all the knowledge and skills necessary to understand and write OpenCL programs.In fact,we believe that the best way to teach OpenCL programming is to teach CUDA first.We demonstrate this with a chapter that relates all major OpenCL features to their corresponding CUDA features.We also illustrate the use of these features by adapting our simple CUDA examples into OpenCL. Chapter 12 offers some concluding remarks and an outlook for the future of massively parallel programming.We revisit our goals and summa- rize how the chapters fit together to help achieve the goals.We then present a brief survey of the major trends in the architecture of massively parallel processors and how these trends will likely impact parallel programming in the future.We conclude with a prediction that these fast advances in massively parallel computing will make it one of the most exciting areas in the coming decade. References and Further Reading Hwu,W.W.,Keutzer,K.,Mattson,T.(2008).The concurrency challenge.IEEE Design and Test of Computers,July/August,312-320. Khronos Group.(2009).The OpenCL Specification Version 1.0.Beaverton,OR: Khronos Group.(http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf). Mattson,T.G.,Sanders,B.A.,Massingill,B.L.(2004).Patterns of parallel programming.Upper Saddle River,NJ:Addison-Wesley. Message Passing Interface Forum.(2009).MPI:A Message-Passing Interface Standard,Version 2.2.Knoxville:University of Tennessee.(http://www.mpi- forum.org/docs/mpi-2.2/mpi22-report.pdf). NVIDIA.(2007).CUDA programming guide.Santa Clara,CA:NVIDIA Corp. OpenMP Architecture Review Board.(2005).OpenMP Application Program Inter- face.(http://www.openmp.org/mp-documents/spec25.pdf). Sutter,H.,Larus,J.(2005).Software and the concurrency revolution.ACM Qeue,3(7),54-62
and fork–join in p-thread programming. Although we do not go into these alternative parallel programming styles, we expect that the readers will be able to learn to program in any of them with the foundation gained in this book. Chapter 11 introduces the OpenCL programming model from a CUDA programmer’s perspective. The reader will find OpenCL to be extremely similar to CUDA. The most important difference arises from OpenCL’s use of API functions to implement functionalities such as kernel launching and thread identification. The use of API functions makes OpenCL more tedious to use; nevertheless, a CUDA programmer has all the knowledge and skills necessary to understand and write OpenCL programs. In fact, we believe that the best way to teach OpenCL programming is to teach CUDA first. We demonstrate this with a chapter that relates all major OpenCL features to their corresponding CUDA features. We also illustrate the use of these features by adapting our simple CUDA examples into OpenCL. Chapter 12 offers some concluding remarks and an outlook for the future of massively parallel programming. We revisit our goals and summarize how the chapters fit together to help achieve the goals. We then present a brief survey of the major trends in the architecture of massively parallel processors and how these trends will likely impact parallel programming in the future. We conclude with a prediction that these fast advances in massively parallel computing will make it one of the most exciting areas in the coming decade. References and Further Reading Hwu, W. W., Keutzer, K., & Mattson, T. (2008). The concurrency challenge. IEEE Design and Test of Computers, July/August, 312–320. Khronos Group. (2009). The OpenCL Specification Version 1.0. Beaverton, OR: Khronos Group. (http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf). Mattson, T. G., Sanders, B. A., & Massingill, B. L. (2004). Patterns of parallel programming. Upper Saddle River, NJ: Addison-Wesley. Message Passing Interface Forum. (2009). MPI: A Message-Passing Interface Standard, Version 2.2. Knoxville: University of Tennessee. (http://www.mpiforum.org/docs/mpi-2.2/mpi22-report.pdf). NVIDIA. (2007). CUDA programming guide. Santa Clara, CA: NVIDIA Corp. OpenMP Architecture Review Board. (2005). OpenMP Application Program Interface. (http://www.openmp.org/mp-documents/spec25.pdf). Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue, 3(7), 54–62. 18 CHAPTER 1 Introduction
References and Further Reading 19 von Neumann,J.(1945).First draft of a report on the EDVAC.Contract No. W-670-ORD-4926,U.S.Army Ordnance Department and University of Penn- sylvania (reproduced in Goldstine H.H.(Ed.),(1972).The computer:From Pascal to von Neumann.Princeton,NJ:Princeton University Press). Wing,J.(2006).Computational thinking.Communications of the ACM,49(3). 33-35
von Neumann, J. (1945). First draft of a report on the EDVAC. Contract No. W-670-ORD-4926, U.S. Army Ordnance Department and University of Pennsylvania (reproduced in Goldstine H. H. (Ed.), (1972). The computer: From Pascal to von Neumann. Princeton, NJ: Princeton University Press). Wing, J. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35. References and Further Reading 19
This page intentionally left blank
This page intentionally left blank
CHAPTER History of GPU Computing 2 CHAPTER CONTENTS 2.1 Evolution of Graphics Pipelines .... …21 2.1.1 The Era of Fixed-Function Graphics Pipelines......................22 2.1.2 Evolution of Programmable Real-Time Graphics...........................26 2.1.3 Unified Graphics and Computing Processors...................................... 29 2.1.4 GPGPU:An Intermediate Step....31 2.2 GPU Computing 32 2.2.1 Scalable GPUs............. .33 2.2.2 Recent Developments34 2.3 Future Trends......... 34 References and Further Reading.... 35 INTRODUCTION To CUDATM and OpenCLTM programmers,graphics processing units(GPUs) are massively parallel numeric computing processors programmed in C with extensions.One needs not understand graphics algorithms or terminology in order to be able to program these processors.However,understanding the graphics heritage of these processors illuminates the strengths and weak- nesses of these processors with respect to major computational patterns.In particular,the history helps to clarify the rationale behind major architectural design decisions of modern programmable GPUs:massive multithreading, relatively small cache memories compared to central processing units (CPUs),and bandwidth-centric memory interface design.Insights into the historical developments will also likely give the reader the context needed to project the future evolution of GPUs as computing devices. 2.1 EVOLUTION OF GRAPHICS PIPELINES Three-dimensional(3D)graphics pipeline hardware evolved from the large expensive systems of the early 1980s to small workstations and then PC accelerators in the mid-to late 1990s.During this period,the performance- 21
CHAPTER History of GPU Computing 2 CHAPTER CONTENTS 2.1 Evolution of Graphics Pipelines .......................................................................... 21 2.1.1 The Era of Fixed-Function Graphics Pipelines .....................................22 2.1.2 Evolution of Programmable Real-Time Graphics ..................................26 2.1.3 Unified Graphics and Computing Processors .......................................29 2.1.4 GPGPU: An Intermediate Step...........................................................31 2.2 GPU Computing .................................................................................................. 32 2.2.1 Scalable GPUs .................................................................................33 2.2.2 Recent Developments .......................................................................34 2.3 Future Trends..................................................................................................... 34 References and Further Reading ............................................................................. 35 INTRODUCTION To CUDA and OpenCL programmers, graphics processing units (GPUs) are massively parallel numeric computing processors programmed in C with extensions. One needs not understand graphics algorithms or terminology in order to be able to program these processors. However, understanding the graphics heritage of these processors illuminates the strengths and weaknesses of these processors with respect to major computational patterns. In particular, the history helps to clarify the rationale behind major architectural design decisions of modern programmable GPUs: massive multithreading, relatively small cache memories compared to central processing units (CPUs), and bandwidth-centric memory interface design. Insights into the historical developments will also likely give the reader the context needed to project the future evolution of GPUs as computing devices. 2.1 EVOLUTION OF GRAPHICS PIPELINES Three-dimensional (3D) graphics pipeline hardware evolved from the large expensive systems of the early 1980s to small workstations and then PC accelerators in the mid- to late 1990s. During this period, the performance- 21