ACKNOWLEDGMENTS IT WOULD BE HARD TO IMAGINE this project making it to the finish line without the suggestions, constructive criticisms,help,and resources of our colleagues and friends We would like to express our thanks to NVIDIA for granting access to many GTC conference pre- sentations and CUDA technical documents that add both great value and authority to this book. In particular,we owe much gratitude to Dr.Paulius Micikevicius and Dr.Peng Wang,Developer Technology Engineers at NVIDIA,for their kind advice and help during the writing of this book. Special thanks to Mark Ebersole,NVIDIA Chief CUDA Educator,for his guidance and feedback during the review process. We would like to thank Mr.Will Ramey,Sr.Product Manager at NVIDIA,and Mr.Nadeem Mohammad,Product Marketing at NVIDIA,for their support and encouragement during the entire project. We would like to thank Mr.Paul Holzhauer,Director of Oil Gas at NVIDIA,for his support during the initial phase of this project. Especially,we owe an enormous debt of gratitude to many presenters and speakers in past GTC con- ferences for their inspiring and creative work on GPU computing technologies.We have recorded all your credits in our suggested reading lists. After years of work using GPUs in real production projects,John is very grateful to the people who helped him become a GPU computing enthusiast.Especially,John would like to thank Dr.Nanxun Dai and Dr.Bao Zhao for their encouragement,support,and guidance on seismic imaging projects at BGP.John also would like to thank his colleagues Dr.Zhengzhen Zhou,Dr.Wei Zhang,Mrs. Grace Zhang,and Mr.Kai Yang.They are truly brilliant and very pleasant to work with.John loves the team and feels very privileged to be one of them.John would like to extend a special thanks to Dr.Mitsuo Gen,an internationally well-known professor,the supervisor of John's doctoral pro- gram,for giving John the opportunity to teach at universities in Japan and co-author academic books,especially for his fully supporting John during the years when John was running a startup based on evolutionary computation technologies in Tokyo.John is very happy working on this proj- ect with Ty and Max as a team and learned a lot from them during the process of book writing. John owes a debt of gratitude to his wife,Joly,and his son,Rick,for their love,support,and consid- erable patience during evenings and weekends over the past year while Dad was yet again "doing his own book work.” For over 25 years,Ty has been helping software developers solve HPC grand challenges.Ty is delighted to work at NVIDIA to help clients extend their current knowledge to unlock the poten- tial from massively parallel GPUs.There are so many NVIDIANs to thank,but Ty would like to specifically recognize Dr.Paulius Micikevicius for his gifted insights and strong desire to always improve while doing the heavy lifting for numerous projects.When John asked Ty to help share www.it-ebooks.info
ffi rs.indd 08/07/2014 Page vii ACKNOWLEDGMENTS IT WOULD BE HARD TO IMAGINE this project making it to the fi nish line without the suggestions, constructive criticisms, help, and resources of our colleagues and friends. We would like to express our thanks to NVIDIA for granting access to many GTC conference presentations and CUDA technical documents that add both great value and authority to this book. In particular, we owe much gratitude to Dr. Paulius Micikevicius and Dr. Peng Wang, Developer Technology Engineers at NVIDIA, for their kind advice and help during the writing of this book. Special thanks to Mark Ebersole, NVIDIA Chief CUDA Educator, for his guidance and feedback during the review process. We would like to thank Mr. Will Ramey, Sr. Product Manager at NVIDIA, and Mr. Nadeem Mohammad, Product Marketing at NVIDIA, for their support and encouragement during the entire project. We would like to thank Mr. Paul Holzhauer, Director of Oil & Gas at NVIDIA, for his support during the initial phase of this project. Especially, we owe an enormous debt of gratitude to many presenters and speakers in past GTC conferences for their inspiring and creative work on GPU computing technologies. We have recorded all your credits in our suggested reading lists. After years of work using GPUs in real production projects, John is very grateful to the people who helped him become a GPU computing enthusiast. Especially, John would like to thank Dr. Nanxun Dai and Dr. Bao Zhao for their encouragement, support, and guidance on seismic imaging projects at BGP. John also would like to thank his colleagues Dr. Zhengzhen Zhou, Dr. Wei Zhang, Mrs. Grace Zhang, and Mr. Kai Yang. They are truly brilliant and very pleasant to work with. John loves the team and feels very privileged to be one of them. John would like to extend a special thanks to Dr. Mitsuo Gen, an internationally well-known professor, the supervisor of John’s doctoral program, for giving John the opportunity to teach at universities in Japan and co-author academic books, especially for his fully supporting John during the years when John was running a startup based on evolutionary computation technologies in Tokyo. John is very happy working on this project with Ty and Max as a team and learned a lot from them during the process of book writing. John owes a debt of gratitude to his wife, Joly, and his son, Rick, for their love, support, and considerable patience during evenings and weekends over the past year while Dad was yet again “doing his own book work.” For over 25 years, Ty has been helping software developers solve HPC grand challenges. Ty is delighted to work at NVIDIA to help clients extend their current knowledge to unlock the potential from massively parallel GPUs. There are so many NVIDIANs to thank, but Ty would like to specifi cally recognize Dr. Paulius Micikevicius for his gifted insights and strong desire to always improve while doing the heavy lifting for numerous projects. When John asked Ty to help share www.it-ebooks.info
ACKNOWLEDGMENTS CUDA knowledge in a book project,he welcomed the challenge.Dave Jones,NVIDIA,senior direc- tor approved Ty's participation in this project,and sadly last year Dave lost his courageous battle against cancer.Our hearts go out to Dave and his family-his memory serves to inspire,to press on,and to pursue your passions.The encouragements from Shanker Trivedi and Marc Hamilton have been especially helpful.Yearning to maintain his life/work balance,Ty recruited Max to join this project.It was truly a pleasure to learn from John and Max as they developed the book content that Ty helped review.Finally,Ty's wife,Judy,and his four children deserve recognition for their unconditional support and love-it is a blessing to receive encouragement and motivation while pursuing those things that bring joy to your life. Max has been fortunate to collaborate with and be guided by a number of brilliant and talented engineers,researchers,and mentors.First,thanks have to go to Professor Vivek Sarkar and the whole Habanero Research Group at Rice University.There,Max got his first taste of HPC and CUDA.The mentorship of Vivek and others in the group was invaluable in enabling him to explore the exciting world of research.Max would also like to thank Mauricio Araya-Polo and Gladys Gonzalez at Repsol.The experience gained under their mentorship was incredibly valuable in writ- ing a book that would be truly useful to real-world work in science and engineering.Finally,Max would like to thank John and Ty for inviting him along on this writing adventure in CUDA and for the lessons this experience has provided in CUDA,writing,and life. It would not be possible to make a quality professional book without input from technical editors, development editors,and reviewers.We would like to extend our sincere appreciation to Mary E. James,our acquisitions editor;Martin V.Minner,our project editor;Katherine Burt,our copy edi- tor;and Wei Zhang and Chao Zhao,our technical editors.You are an insightful and professional editorial team and this book would not be what it is without you.It was a great pleasure to work with you on this project. www.it-ebooks.info
ffi rs.indd 08/07/2014 Page viii CUDA knowledge in a book project, he welcomed the challenge. Dave Jones, NVIDIA, senior director approved Ty’s participation in this project, and sadly last year Dave lost his courageous battle against cancer. Our hearts go out to Dave and his family — his memory serves to inspire, to press on, and to pursue your passions. The encouragements from Shanker Trivedi and Marc Hamilton have been especially helpful. Yearning to maintain his life/work balance, Ty recruited Max to join this project. It was truly a pleasure to learn from John and Max as they developed the book content that Ty helped review. Finally, Ty’s wife, Judy, and his four children deserve recognition for their unconditional support and love — it is a blessing to receive encouragement and motivation while pursuing those things that bring joy to your life. Max has been fortunate to collaborate with and be guided by a number of brilliant and talented engineers, researchers, and mentors. First, thanks have to go to Professor Vivek Sarkar and the whole Habanero Research Group at Rice University. There, Max got his fi rst taste of HPC and CUDA. The mentorship of Vivek and others in the group was invaluable in enabling him to explore the exciting world of research. Max would also like to thank Mauricio Araya-Polo and Gladys Gonzalez at Repsol. The experience gained under their mentorship was incredibly valuable in writing a book that would be truly useful to real-world work in science and engineering. Finally, Max would like to thank John and Ty for inviting him along on this writing adventure in CUDA and for the lessons this experience has provided in CUDA, writing, and life. It would not be possible to make a quality professional book without input from technical editors, development editors, and reviewers. We would like to extend our sincere appreciation to Mary E. James, our acquisitions editor; Martin V. Minner, our project editor; Katherine Burt, our copy editor; and Wei Zhang and Chao Zhao, our technical editors. You are an insightful and professional editorial team and this book would not be what it is without you. It was a great pleasure to work with you on this project. ACKNOWLEDGMENTS www.it-ebooks.info
CONTENTS FOREWORD xvii PREFACE xix INTRODUCTION xxi CHAPTER 1:HETEROGENEOUS PARALLEL COMPUTING WITH CUDA 1 Parallel Computing Sequential and Parallel Programming 3 Parallelism 4 Computer Architecture 6 Heterogeneous Computing 8 Heterogeneous Architecture 9 Paradigm of Heterogeneous Computing 12 CUDA:A Platform for Heterogeneous Computing 1 Hello World from GPU Is CUDA C Programming Difficult? Summary 29 CHAPTER 2:CUDA PROGRAMMING MODEL 23 Introducing the CUDA Programming Model CUDA Programming Structure Managing Memory Organizing Threads Launching a CUDA Kernel Writing Your Kernel Verifying Your Kernel Handling Errors Compiling and Executing Timing Your Kernel Timing with CPU Timer Timing with nvprof Organizing Parallel Threads Indexing Matrices with Blocks and Threads 353087960034799378 Summing Matrices with a 2D Grid and 2D Blocks Summing Matrices with a 1D Grid and 1D Blocks Summing Matrices with a 2D Grid and 1D Blocks www.it-ebooks.info
ftoc.indd 08/07/2014 Page ix CONTENTS FOREWORD xvii PREFACE xix INTRODUCTION xxi CHAPTER 1: HETEROGENEOUS PARALLEL COMPUTING WITH CUDA 1 Parallel Computing 2 Sequential and Parallel Programming 3 Parallelism 4 Computer Architecture 6 Heterogeneous Computing 8 Heterogeneous Architecture 9 Paradigm of Heterogeneous Computing 12 CUDA: A Platform for Heterogeneous Computing 14 Hello World from GPU 17 Is CUDA C Programming Diffi cult? 20 Summary 21 CHAPTER 2: CUDA PROGRAMMING MODEL 23 Introducing the CUDA Programming Model 23 CUDA Programming Structure 25 Managing Memory 26 Organizing Threads 30 Launching a CUDA Kernel 36 Writing Your Kernel 37 Verifying Your Kernel 39 Handling Errors 40 Compiling and Executing 40 Timing Your Kernel 43 Timing with CPU Timer 44 Timing with nvprof 47 Organizing Parallel Threads 49 Indexing Matrices with Blocks and Threads 49 Summing Matrices with a 2D Grid and 2D Blocks 53 Summing Matrices with a 1D Grid and 1D Blocks 57 Summing Matrices with a 2D Grid and 1D Blocks 58 www.it-ebooks.info
CONTENTS Managing Devices 60 Using the Runtime API to Query GPU Information 61 Determining the Best GPU 63 Using nvidia-smi to Query GPU Information 63 Setting Devices at Runtime 64 Summary 65 CHAPTER 3:CUDA EXECUTION MODEL 67 Introducing the CUDA Execution Model 67 GPU Architecture Overview 68 The Fermi Architecture 7 The Kepler Architecture 73 Profile-Driven Optimization 78 Understanding the Nature of Warp Execution 80 Warps and Thread Blocks 80 Warp Divergence 82 Resource Partitioning 87 Latency Hiding 90 Occupancy 93 Synchronization 97 Scalability 98 Exposing Parallelism 98 Checking Active Warps with nvprof 100 Checking Memory Operations with nvprof 100 Exposing More Parallelism 101 Avoiding Branch Divergence 104 The Parallel Reduction Problem 104 Divergence in Parallel Reduction 106 Improving Divergence in Parallel Reduction 110 Reducing with Interleaved Pairs 112 Unrolling Loops 114 Reducing with Unrolling 115 Reducing with Unrolled Warps 117 Reducing with Complete Unrolling 119 Reducing with Template Functions 120 Dynamic Parallelism 122 Nested Execution 123 Nested Hello World on the GPU 124 Nested Reduction 128 Summary 132 X www.it-ebooks.info
x CONTENTS ftoc.indd 08/07/2014 Page x Managing Devices 60 Using the Runtime API to Query GPU Information 61 Determining the Best GPU 63 Using nvidia-smi to Query GPU Information 63 Setting Devices at Runtime 64 Summary 65 CHAPTER 3: CUDA EXECUTION MODEL 67 Introducing the CUDA Execution Model 67 GPU Architecture Overview 68 The Fermi Architecture 71 The Kepler Architecture 73 Profi le-Driven Optimization 78 Understanding the Nature of Warp Execution 80 Warps and Thread Blocks 80 Warp Divergence 82 Resource Partitioning 87 Latency Hiding 90 Occupancy 93 Synchronization 97 Scalability 98 Exposing Parallelism 98 Checking Active Warps with nvprof 100 Checking Memory Operations with nvprof 100 Exposing More Parallelism 101 Avoiding Branch Divergence 104 The Parallel Reduction Problem 104 Divergence in Parallel Reduction 106 Improving Divergence in Parallel Reduction 110 Reducing with Interleaved Pairs 112 Unrolling Loops 114 Reducing with Unrolling 115 Reducing with Unrolled Warps 117 Reducing with Complete Unrolling 119 Reducing with Template Functions 120 Dynamic Parallelism 122 Nested Execution 123 Nested Hello World on the GPU 124 Nested Reduction 128 Summary 132 www.it-ebooks.info
CONTENTS CHAPTER 4:GLOBAL MEMORY 135 Introducing the CUDA Memory Model 136 Benefits of a Memory Hierarchy 136 CUDA Memory Model 137 Memory Management 145 Memory Allocation and Deallocation 146 Memory Transfer 146 Pinned Memory 148 Zero-Copy Memory 150 Unified Virtual Addressing 156 Unified Memory 157 Memory Access Patterns 158 Aligned and Coalesced Access 158 Global Memory Reads 160 Global Memory Writes 169 Array of Structures versus Structure of Arrays 171 Performance Tuning 176 What Bandwidth Can a Kernel Achieve? 179 Memory Bandwidth 179 Matrix Transpose Problem 180 Matrix Addition with Unified Memory 195 Summary 199 CHAPTER 5:SHARED MEMORY AND CONSTANT MEMORY 203 Introducing CUDA Shared Memory 204 Shared Memory 204 Shared Memory Allocation 206 Shared Memory Banks and Access Mode 206 Configuring the Amount of Shared Memory 212 Synchronization 214 Checking the Data Layout of Shared Memory 216 Square Shared Memory 217 Rectangular Shared Memory 225 Reducing Global Memory Access 232 Parallel Reduction with Shared Memory 232 Parallel Reduction with Unrolling 236 Parallel Reduction with Dynamic Shared Memory 238 Effective Bandwidth 239 书 www.it-ebooks.info
xi CONTENTS ftoc.indd 08/07/2014 Page xi CHAPTER 4: GLOBAL MEMORY 135 Introducing the CUDA Memory Model 136 Benefi ts of a Memory Hierarchy 136 CUDA Memory Model 137 Memory Management 145 Memory Allocation and Deallocation 146 Memory Transfer 146 Pinned Memory 148 Zero-Copy Memory 150 Unifi ed Virtual Addressing 156 Unifi ed Memory 157 Memory Access Patterns 158 Aligned and Coalesced Access 158 Global Memory Reads 160 Global Memory Writes 169 Array of Structures versus Structure of Arrays 171 Performance Tuning 176 What Bandwidth Can a Kernel Achieve? 179 Memory Bandwidth 179 Matrix Transpose Problem 180 Matrix Addition with Unifi ed Memory 195 Summary 199 CHAPTER 5: SHARED MEMORY AND CONSTANT MEMORY 203 Introducing CUDA Shared Memory 204 Shared Memory 204 Shared Memory Allocation 206 Shared Memory Banks and Access Mode 206 Confi guring the Amount of Shared Memory 212 Synchronization 214 Checking the Data Layout of Shared Memory 216 Square Shared Memory 217 Rectangular Shared Memory 225 Reducing Global Memory Access 232 Parallel Reduction with Shared Memory 232 Parallel Reduction with Unrolling 236 Parallel Reduction with Dynamic Shared Memory 238 Effective Bandwidth 239 www.it-ebooks.info