Acknowledgments We especially acknowledge Ian Buck,the father of CUDA and John Nickolls,the lead architect of Tesla GPU Computing Architecture.Their teams created an excellent infrastructure for this course.Ashutosh Rege and the NVIDIA DevTech team contributed to the original slides and contents used in ECE498AL course.Bill Bean,Simon Green,Mark Harris,Manju Hedge,Nadeem Mohammad,Brent Oster,Peter Shirley,Eric Young,and Cyril Zeller provided review comments and corrections to the manuscripts. Nadeem Mohammad organized the NVIDIA review efforts and also helped to plan Chapter 11 and Appendix B.Calisa Cole helped with cover. Nadeem's heroic efforts have been critical to the completion of this book. We also thank Jensen Huang for providing a great amount of financial and human resources for developing the course.Tony Tamasi's team con- tributed heavily to the review and revision of the book chapters.Jensen also took the time to read the early drafts of the chapters and gave us valuable feedback.David Luebke has facilitated the GPU computing resources for the course.Jonah Alben has provided valuable insight.Michael Shebanow and Michael Garland have given guest lectures and contributed materials. John Stone and Sam Stone in Illinois contributed much of the base material for the case study and OpenCL chapters.John Stratton and Chris Rodrigues contributed some of the base material for the computational thinking chapter.I-Jui "Ray"Sung,John Stratton,Xiao-Long Wu,Nady Obeid contributed to the lab material and helped to revise the course material as they volunteered to serve as teaching assistants on top of their research. Laurie Talkington and James Hutchinson helped to dictate early lectures that served as the base for the first five chapters.Mike Showerman helped build two generations of GPU computing clusters for the course.Jeremy Enos worked tirelessly to ensure that students have a stable,user-friendly GPU computing cluster to work on their lab assignments and projects. We acknowledge Dick Blahut who challenged us to create the course in Illinois.His constant reminder that we needed to write the book helped keep us going.Beth Katsinas arranged a meeting between Dick Blahut and NVIDIA Vice President Dan Vivoli.Through that gathering,Blahut was introduced to David and challenged David to come to Illinois and create the course with Wen-mei. We also thank Thom Dunning of the University of Illinois and Sharon Glotzer of the University of Michigan,Co-Directors of the multiuniversity Virtual School of Computational Science and Engineering,for graciously xvii
Acknowledgments We especially acknowledge Ian Buck, the father of CUDA and John Nickolls, the lead architect of Tesla GPU Computing Architecture. Their teams created an excellent infrastructure for this course. Ashutosh Rege and the NVIDIA DevTech team contributed to the original slides and contents used in ECE498AL course. Bill Bean, Simon Green, Mark Harris, Manju Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, and Cyril Zeller provided review comments and corrections to the manuscripts. Nadeem Mohammad organized the NVIDIA review efforts and also helped to plan Chapter 11 and Appendix B. Calisa Cole helped with cover. Nadeem’s heroic efforts have been critical to the completion of this book. We also thank Jensen Huang for providing a great amount of financial and human resources for developing the course. Tony Tamasi’s team contributed heavily to the review and revision of the book chapters. Jensen also took the time to read the early drafts of the chapters and gave us valuable feedback. David Luebke has facilitated the GPU computing resources for the course. Jonah Alben has provided valuable insight. Michael Shebanow and Michael Garland have given guest lectures and contributed materials. John Stone and Sam Stone in Illinois contributed much of the base material for the case study and OpenCL chapters. John Stratton and Chris Rodrigues contributed some of the base material for the computational thinking chapter. I-Jui “Ray” Sung, John Stratton, Xiao-Long Wu, Nady Obeid contributed to the lab material and helped to revise the course material as they volunteered to serve as teaching assistants on top of their research. Laurie Talkington and James Hutchinson helped to dictate early lectures that served as the base for the first five chapters. Mike Showerman helped build two generations of GPU computing clusters for the course. Jeremy Enos worked tirelessly to ensure that students have a stable, user-friendly GPU computing cluster to work on their lab assignments and projects. We acknowledge Dick Blahut who challenged us to create the course in Illinois. His constant reminder that we needed to write the book helped keep us going. Beth Katsinas arranged a meeting between Dick Blahut and NVIDIA Vice President Dan Vivoli. Through that gathering, Blahut was introduced to David and challenged David to come to Illinois and create the course with Wen-mei. We also thank Thom Dunning of the University of Illinois and Sharon Glotzer of the University of Michigan, Co-Directors of the multiuniversity Virtual School of Computational Science and Engineering, for graciously xvii
xviii Acknowledgments hosting the summer school version of the course.Trish Barker,Scott Lathrop,Umesh Thakkar,Tom Scavo,Andrew Schuh,and Beth McKown all helped organize the summer school.Robert Brunner,Klaus Schulten, Pratap Vanka,Brad Sutton,John Stone,Keith Thulborn,Michael Garland, Vlad Kindratenko,Naga Govindaraj,Yan Xu,Arron Shinn,and Justin Hal- dar contributed to the lectures and panel discussions at the summer school. Nicolas Pinto tested the early versions of the first chapters in his MIT class and assembled an excellent set of feedback comments and corrections. Steve Lumetta and Sanjay Patel both taught versions of the course and gave us valuable feedback.John Owens graciously allowed us to use some of his slides.Tor Aamodt,Dan Connors,Tom Conte,Michael Giles,Nacho Navarro and numerous other instructors and their students worldwide have provided us with valuable feedback.Michael Giles reviewed the semi-final draft chapters in detail and identified many typos and inconsistencies. We especially thank our colleagues Kurt Akeley,Al Aho,Arvind,Dick Blahut,Randy Bryant,Bob Colwell,Ed Davidson,Mike Flynn,John Hennessy,Pat Hanrahan,Nick Holonyak,Dick Karp,Kurt Keutzer,Dave Liu,Dave Kuck,Yale Patt,David Patterson,Bob Rao,Burton Smith,Jim Smith,and Mateo Valero who have taken the time to share their insight with us over the years. We are humbled by the generosity and enthusiasm of all the great people who contributed to the course and the book. David B.Kirk and Wen-mei W.Hwu
hosting the summer school version of the course. Trish Barker, Scott Lathrop, Umesh Thakkar, Tom Scavo, Andrew Schuh, and Beth McKown all helped organize the summer school. Robert Brunner, Klaus Schulten, Pratap Vanka, Brad Sutton, John Stone, Keith Thulborn, Michael Garland, Vlad Kindratenko, Naga Govindaraj, Yan Xu, Arron Shinn, and Justin Haldar contributed to the lectures and panel discussions at the summer school. Nicolas Pinto tested the early versions of the first chapters in his MIT class and assembled an excellent set of feedback comments and corrections. Steve Lumetta and Sanjay Patel both taught versions of the course and gave us valuable feedback. John Owens graciously allowed us to use some of his slides. Tor Aamodt, Dan Connors, Tom Conte, Michael Giles, Nacho Navarro and numerous other instructors and their students worldwide have provided us with valuable feedback. Michael Giles reviewed the semi-final draft chapters in detail and identified many typos and inconsistencies. We especially thank our colleagues Kurt Akeley, Al Aho, Arvind, Dick Blahut, Randy Bryant, Bob Colwell, Ed Davidson, Mike Flynn, John Hennessy, Pat Hanrahan, Nick Holonyak, Dick Karp, Kurt Keutzer, Dave Liu, Dave Kuck, Yale Patt, David Patterson, Bob Rao, Burton Smith, Jim Smith, and Mateo Valero who have taken the time to share their insight with us over the years. We are humbled by the generosity and enthusiasm of all the great people who contributed to the course and the book. David B. Kirk and Wen-mei W. Hwu xviii Acknowledgments
To Caroline,Rose,and Leo To Sabrina,Amanda,Bryan,and Carissa For enduring our absence while working on the course and the book
To Caroline, Rose, and Leo To Sabrina, Amanda, Bryan, and Carissa For enduring our absence while working on the course and the book
This page intentionally left blank
This page intentionally left blank
CHAPTER Introduction 1 CHAPTER CONTENTS 1.1 GPUs as Parallel Computers... .2 1.2 Architecture of a Modern GPU....... .8 1.3 Why More Speed or Parallelism?....... .10 1.4 Parallel Programming Languages and Models. .13 1.5 Overarching Goals....... .15 1.6 Organization of the Book....... 16 References and Further Reading... 18 INTRODUCTION Microprocessors based on a single central processing unit (CPU),such as those in the Intel Pentium family and the AMD OpteronTM family, drove rapid performance increases and cost reductions in computer applica- tions for more than two decades.These microprocessors brought giga (bil- lion)floating-point operations per second(GFLOPS)to the desktop and hundreds of GFLOPS to cluster servers.This relentless drive of perfor- mance improvement has allowed application software to provide more functionality,have better user interfaces,and generate more useful results. The users,in turn,demand even more improvements once they become accustomed to these improvements,creating a positive cycle for the computer industry. During the drive,most software developers have relied on the advances in hardware to increase the speed of their applications under the hood;the same software simply runs faster as each new generation of processors is introduced.This drive,however,has slowed since 2003 due to energy- consumption and heat-dissipation issues that have limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU.Virtually all micro- processor vendors have switched to models where multiple processing units,referred to as processor cores,are used in each chip to increase the 1
CHAPTER Introduction 1 CHAPTER CONTENTS 1.1 GPUs as Parallel Computers ................................................................................. 2 1.2 Architecture of a Modern GPU .............................................................................. 8 1.3 Why More Speed or Parallelism? ........................................................................ 10 1.4 Parallel Programming Languages and Models...................................................... 13 1.5 Overarching Goals.............................................................................................. 15 1.6 Organization of the Book .................................................................................... 16 References and Further Reading ............................................................................. 18 INTRODUCTION Microprocessors based on a single central processing unit (CPU), such as those in the Intel Pentium family and the AMD Opteron family, drove rapid performance increases and cost reductions in computer applications for more than two decades. These microprocessors brought giga (billion) floating-point operations per second (GFLOPS) to the desktop and hundreds of GFLOPS to cluster servers. This relentless drive of performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive cycle for the computer industry. During the drive, most software developers have relied on the advances in hardware to increase the speed of their applications under the hood; the same software simply runs faster as each new generation of processors is introduced. This drive, however, has slowed since 2003 due to energyconsumption and heat-dissipation issues that have limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the 1