Contents Preface ...............................X Acknowledgments.… ................xvii Dedication… ..........xIX CHAPTER 1 INTRODUCTION. 1.1GPUs as Parallel Computers 1.2 Architecture of a Modern GPU............... 8 1.3 Why More Speed or Parallelism?............. 10 1.4 Parallel Programming Languages and Models...13 1.5 Overarching Goals.15 1.6 Organization of the Book. .16 CHAPTER 2 HISTORY OF GPU COMPUTING.............................. 21 2.1 Evolution of Graphics Pipelines..21 2.1.1 The Era of Fixed-Function Graphics Pipelines..................22 2.1.2 Evolution of Programmable Real-Time Graphics ......6 2.1.3 Unified Graphics and Computing Processors29 2.1.4 GPGPU:An Intermediate Step......... .31 2.2 GPU Computing… .32 2.2.1 Scalable GPUS...33 2.2.2 Recent Developments......... 34 2.3 Future Trends.... 34 CHAPTER 3 INTRODUCTION To cUDA...39 3.1 Data Parallelism… 39 3.2 CUDA Program Structure ........... 41 3.3 A Matrix-Matrix Multiplication Example. .42 3.4 Device Memories and Data Transfer....... 46 3.5 Kernel Functions and Threading....................5. 36 Summary......56 3.6.1 Function declarations...... .56 3.6.2 Kernel launch. 56 3.6.3 Predefined variables.56 3.6.4 Runtime API57 CHAPTER 4 CUDA THREADS.............. 59 4.1 CUDA Thread Organization............. 59 4.2 Using blockIdx and threadIdx.....................64 4.3 Synchronization and Transparent Scalability.................68 vii
Contents Preface ......................................................................................................................xi Acknowledgments ................................................................................................ xvii Dedication...............................................................................................................xix CHAPTER 1 INTRODUCTION................................................................................1 1.1 GPUs as Parallel Computers..........................................................2 1.2 Architecture of a Modern GPU......................................................8 1.3 Why More Speed or Parallelism?................................................10 1.4 Parallel Programming Languages and Models............................13 1.5 Overarching Goals........................................................................15 1.6 Organization of the Book.............................................................16 CHAPTER 2 HISTORY OF GPU COMPUTING .....................................................21 2.1 Evolution of Graphics Pipelines ..................................................21 2.1.1 The Era of Fixed-Function Graphics Pipelines..................22 2.1.2 Evolution of Programmable Real-Time Graphics .............26 2.1.3 Unified Graphics and Computing Processors ....................29 2.1.4 GPGPU: An Intermediate Step...........................................31 2.2 GPU Computing ...........................................................................32 2.2.1 Scalable GPUs.....................................................................33 2.2.2 Recent Developments..........................................................34 2.3 Future Trends................................................................................34 CHAPTER 3 INTRODUCTION TO CUDA..............................................................39 3.1 Data Parallelism............................................................................39 3.2 CUDA Program Structure ............................................................41 3.3 A Matrix–Matrix Multiplication Example...................................42 3.4 Device Memories and Data Transfer...........................................46 3.5 Kernel Functions and Threading..................................................51 3.6 Summary .......................................................................................56 3.6.1 Function declarations ..........................................................56 3.6.2 Kernel launch ......................................................................56 3.6.3 Predefined variables ............................................................56 3.6.4 Runtime API........................................................................57 CHAPTER 4 CUDA THREADS.............................................................................59 4.1 CUDA Thread Organization ........................................................59 4.2 Using blockIdx and threadIdx ..........................................64 4.3 Synchronization and Transparent Scalability ..............................68 vii
viii Contents 4.4Thread Assignment. 4.5 Thread Scheduling and Latency Tolerance....1 4.6 Summary… .74 4.7Exercises94 CHAPTER 5 cUDATM MEMORIES.7 5.1 Importance of Memory Access Efficiency....... .78 5.2 CUDA Device Memory Types........... .79 5.3 A Strategy for Reducing Global Memory Traffic.........83 5.4 Memory as a Limiting Factor to Parallelism..... 90 5.5 Summary 92 5.6 Exercises..................................................9 CHAPTER 6 PERFORMANCE CONSIDERATIONS 95 6.1 More on Thread Execution.............. ..96 6.2 Global Memory Bandwidth. .103 6.3 Dynamic Partitioning of SM Resources.... .111 6.4 Data Prefetching… .113 6.5 Instruction Mix. .115 6.6 Thread Granularity ....... .116 6.7 Measured Performance and Summary.................. 118 6.8 Exercises… .120 CHAPTER 7 FLOATING POINT CONSIDERATIONS..125 7.1 Floating-Point Format............ .126 7.1.1 Normalized Representation of M....................126 7.1.2 Excess Encoding of E.127 7.2 Representable Numbers............... 129 7.3 Special Bit Patterns and Precision.134 7.4 Arithmetic Accuracy and Rounding135 7.5 Algorithm Considerations.............. .136 7.6 Summary… .138 77Exercises .138 CHAPTER 8 APPLICATION CASE STUDY:ADVANCED MRI RECONSTRUCTI0N.… .141 8.1 Application Background142 8.2Iterative Reconstruction44 8.3Computing Fd 148 Step 1.Determine the Kernel Parallelism Structure................149 Step 2.Getting Around the Memory Bandwidth Limitation....156
4.4 Thread Assignment.......................................................................70 4.5 Thread Scheduling and Latency Tolerance .................................71 4.6 Summary .......................................................................................74 4.7 Exercises .......................................................................................74 CHAPTER 5 CUDA MEMORIES.......................................................................77 5.1 Importance of Memory Access Efficiency..................................78 5.2 CUDA Device Memory Types ....................................................79 5.3 A Strategy for Reducing Global Memory Traffic.......................83 5.4 Memory as a Limiting Factor to Parallelism ..............................90 5.5 Summary .......................................................................................92 5.6 Exercises .......................................................................................93 CHAPTER 6 PERFORMANCE CONSIDERATIONS................................................95 6.1 More on Thread Execution ..........................................................96 6.2 Global Memory Bandwidth........................................................103 6.3 Dynamic Partitioning of SM Resources ....................................111 6.4 Data Prefetching .........................................................................113 6.5 Instruction Mix ...........................................................................115 6.6 Thread Granularity .....................................................................116 6.7 Measured Performance and Summary .......................................118 6.8 Exercises .....................................................................................120 CHAPTER 7 FLOATING POINT CONSIDERATIONS ...........................................125 7.1 Floating-Point Format.................................................................126 7.1.1 Normalized Representation of M .....................................126 7.1.2 Excess Encoding of E.......................................................127 7.2 Representable Numbers..............................................................129 7.3 Special Bit Patterns and Precision.............................................134 7.4 Arithmetic Accuracy and Rounding ..........................................135 7.5 Algorithm Considerations...........................................................136 7.6 Summary .....................................................................................138 7.7 Exercises .....................................................................................138 CHAPTER 8 APPLICATION CASE STUDY: ADVANCED MRI RECONSTRUCTION.......................................................................141 8.1 Application Background.............................................................142 8.2 Iterative Reconstruction..............................................................144 8.3 Computing FHd...........................................................................148 Step 1. Determine the Kernel Parallelism Structure.................149 Step 2. Getting Around the Memory Bandwidth Limitation....156 viii Contents
Contents iⅸ Step 3.Using Hardware Trigonometry Functions.....163 Step 4.Experimental Performance Tuning.....................166 8.4 Final Evaluation.. .167 8.5 Exercises.170 CHAPTER 9 APPLICATION CASE STUDY:MOLECULAR VISUALIZATION AND ANALYSIS. .173 9.1 Application Background194 9.2 A Simple Kemnel Implementation....176 9.3 Instruction Execution Efficiency.................. 180 9.4 Memory Coalescing182 9.5 Additional Performance Comparisons185 9.6 Using Multiple GPUs.. .187 9.7 Exercises........ .188 CHAPTER 10 PARALLEL PROGRAMMING AND COMPUTATIONAL THINKING... 191 10.1 Goals of Parallel Programming .................... 192 10.2 Problem Decomposition. 193 10.3 Algorithm Selection ........ .196 10.4 Computational Thinking...... 202 10.5 Exercises… .204 CHAPTER 11 A BRIEF INTRODUCTION TO OPENCLTM...05 11.1 Background.… 205 11.2 Data Parallelism Model........ .207 11.3 Device Architecture209 11.4 Kernel Functions......... 211 11.5 Device Management and Kernel Launch...........................212 11.6 Electrostatic Potential Map in OpenCL14 11.7 Summary.… .219 11.8 Exercises. 220 CHAPTER 12 CONCLUSION AND FUTURE OUTLOOK21 12.1Goals Revisited221 12.2 Memory Architecture Evolution23 12.2.1 Large Virtual and Physical Address Spaces.......223 12.2.2 Unified Device Memory Space.224 12.2.3 Configurable Caching and Scratch Pad...225 12.2.4 Enhanced Atomic Operations.226 12.2.5 Enhanced Global Memory Access.................... .226
Step 3. Using Hardware Trigonometry Functions ....................163 Step 4. Experimental Performance Tuning ...............................166 8.4 Final Evaluation..........................................................................167 8.5 Exercises .....................................................................................170 CHAPTER 9 APPLICATION CASE STUDY: MOLECULAR VISUALIZATION AND ANALYSIS............................................................................173 9.1 Application Background.............................................................174 9.2 A Simple Kernel Implementation ..............................................176 9.3 Instruction Execution Efficiency................................................180 9.4 Memory Coalescing....................................................................182 9.5 Additional Performance Comparisons .......................................185 9.6 Using Multiple GPUs .................................................................187 9.7 Exercises .....................................................................................188 CHAPTER 10 PARALLEL PROGRAMMING AND COMPUTATIONAL THINKING ....................................................................................191 10.1 Goals of Parallel Programming ...............................................192 10.2 Problem Decomposition ...........................................................193 10.3 Algorithm Selection .................................................................196 10.4 Computational Thinking...........................................................202 10.5 Exercises ...................................................................................204 CHAPTER 11 A BRIEF INTRODUCTION TO OPENCL ......................................205 11.1 Background...............................................................................205 11.2 Data Parallelism Model............................................................207 11.3 Device Architecture..................................................................209 11.4 Kernel Functions ......................................................................211 11.5 Device Management and Kernel Launch ................................212 11.6 Electrostatic Potential Map in OpenCL ..................................214 11.7 Summary...................................................................................219 11.8 Exercises ...................................................................................220 CHAPTER 12 CONCLUSION AND FUTURE OUTLOOK ........................................221 12.1 Goals Revisited.........................................................................221 12.2 Memory Architecture Evolution ..............................................223 12.2.1 Large Virtual and Physical Address Spaces ................223 12.2.2 Unified Device Memory Space ....................................224 12.2.3 Configurable Caching and Scratch Pad........................225 12.2.4 Enhanced Atomic Operations .......................................226 12.2.5 Enhanced Global Memory Access ...............................226 Contents ix
Contents 12.3 Kernel Execution Control Evolution227 12.3.1 Function Calls within Kernel Functions .227 12.3.2 Exception Handling in Kernel Functions......227 12.3.3 Simultaneous Execution of Multiple Kernels.........28 12.3.4 Interruptible Kemnels228 12.4 Core Performance… .229 12.4.1 Double-Precision Speed229 12.4.2 Better Control Flow Efficiency............. .229 12.5 Programming Environment......... .230 12.6 A Bright Outlook… .230 APPENDIX A MATRIX MULTIPLICATION HOST-ONLY VERSION SOURCE CODE ..... 233 A.1 matrixmul.cu… 233 A.2 matrixmulgold.cpp .237 A.3 matrixmul.h… .238 A.4 assist.h… .239 A.5 Expected Output… 243 APPENDIX B GPU COMPUTE CAPABILITIES..... 245 B.1 GPU Compute Capability Tables...... 245 B.2 Memory Coalescing Variations... .246 ndey........ 251
12.3 Kernel Execution Control Evolution .......................................227 12.3.1 Function Calls within Kernel Functions ......................227 12.3.2 Exception Handling in Kernel Functions.....................227 12.3.3 Simultaneous Execution of Multiple Kernels .............. 228 12.3.4 Interruptible Kernels .....................................................228 12.4 Core Performance.....................................................................229 12.4.1 Double-Precision Speed ...............................................229 12.4.2 Better Control Flow Efficiency ....................................229 12.5 Programming Environment ......................................................230 12.6 A Bright Outlook......................................................................230 APPENDIX A MATRIX MULTIPLICATION HOST-ONLY VERSION SOURCE CODE .............................................................................233 A.1 matrixmul.cu........................................................................233 A.2 matrixmul_gold.cpp .........................................................237 A.3 matrixmul.h..........................................................................238 A.4 assist.h .................................................................................239 A.5 Expected Output .........................................................................243 APPENDIX B GPU COMPUTE CAPABILITIES ....................................................245 B.1 GPU Compute Capability Tables...............................................245 B.2 Memory Coalescing Variations..................................................246 Index......................................................................................................... 251 x Contents
Preface WHY WE WROTE THIS BOOK Mass-market computing systems that combine multicore CPUs and many- core GPUs have brought terascale computing to the laptop and petascale computing to clusters.Armed with such computing power,we are at the dawn of pervasive use of computational experiments for science,engineer- ing,health,and business disciplines.Many will be able to achieve break- throughs in their disciplines using computational experiments that are of unprecedented level of scale,controllability,and observability.This book provides a critical ingredient for the vision:teaching parallel programming to millions of graduate and undergraduate students so that computational thinking and parallel programming skills will be as pervasive as calculus. We started with a course now known as ECE498AL.During the Christ- mas holiday of 2006,we were frantically working on the lecture slides and lab assignments.David was working the system trying to pull the early GeForce 8800 GTX GPU cards from customer shipments to Illinois,which would not succeed until a few weeks after the semester began.It also became clear that CUDA would not become public until a few weeks after the start of the semester.We had to work out the legal agreements so that we can offer the course to students under NDA for the first few weeks. We also needed to get the words out so that students would sign up since the course was not announced until after the preenrollment period. We gave our first lecture on January 16,2007.Everything fell into place.David commuted weekly to Urbana for the class.We had 52 students,a couple more than our capacity.We had draft slides for most of the first 10 lectures.Wen-mei's graduate student,John Stratton, graciously volunteered as the teaching assistant and set up the lab.All students signed NDA so that we can proceed with the first several lectures until CUDA became public.We recorded the lectures but did not release them on the Web until February.We had graduate students from physics,astron- omy,chemistry,electrical engineering,mechanical engineering as well as computer science and computer engineering.The enthusiasm in the room made it all worthwhile. Since then,we have taught the course three times in one-semester format and two times in one-week intensive format.The ECE498AL course has become a permanent course known as ECE408 of the University of Illinois,Urbana-Champaign.We started to write up some early chapters of this book when we offered ECE498AL the second time.We tested these xi
Preface WHY WE WROTE THIS BOOK Mass-market computing systems that combine multicore CPUs and manycore GPUs have brought terascale computing to the laptop and petascale computing to clusters. Armed with such computing power, we are at the dawn of pervasive use of computational experiments for science, engineering, health, and business disciplines. Many will be able to achieve breakthroughs in their disciplines using computational experiments that are of unprecedented level of scale, controllability, and observability. This book provides a critical ingredient for the vision: teaching parallel programming to millions of graduate and undergraduate students so that computational thinking and parallel programming skills will be as pervasive as calculus. We started with a course now known as ECE498AL. During the Christmas holiday of 2006, we were frantically working on the lecture slides and lab assignments. David was working the system trying to pull the early GeForce 8800 GTX GPU cards from customer shipments to Illinois, which would not succeed until a few weeks after the semester began. It also became clear that CUDA would not become public until a few weeks after the start of the semester. We had to work out the legal agreements so that we can offer the course to students under NDA for the first few weeks. We also needed to get the words out so that students would sign up since the course was not announced until after the preenrollment period. We gave our first lecture on January 16, 2007. Everything fell into place. David commuted weekly to Urbana for the class. We had 52 students, a couple more than our capacity. We had draft slides for most of the first 10 lectures. Wen-mei’s graduate student, John Stratton, graciously volunteered as the teaching assistant and set up the lab. All students signed NDA so that we can proceed with the first several lectures until CUDA became public. We recorded the lectures but did not release them on the Web until February. We had graduate students from physics, astronomy, chemistry, electrical engineering, mechanical engineering as well as computer science and computer engineering. The enthusiasm in the room made it all worthwhile. Since then, we have taught the course three times in one-semester format and two times in one-week intensive format. The ECE498AL course has become a permanent course known as ECE408 of the University of Illinois, Urbana-Champaign. We started to write up some early chapters of this book when we offered ECE498AL the second time. We tested these xi