xii Preface chapters in our spring 2009 class and our 2009 Summer School.The first four chapters were also tested in an MIT class taught by Nicolas Pinto in spring 2009.We also shared these early chapters on the web and received valuable feedback from numerous individuals.We were encouraged by the feedback we received and decided to go for a full book.Here,we hum- bly present our first edition to you. TARGET AUDIENCE The target audience of this book is graduate and undergraduate students from all science and engineering disciplines where computational thinking and parallel programming skills are needed to use pervasive terascale com- puting hardware to achieve breakthroughs.We assume that the reader has at least some basic C programming experience and thus are more advanced programmers,both within and outside of the field of Computer Science. We especially target computational scientists in fields such as mechanical engineering,civil engineering,electrical engineering,bioengineering,phys- ics,and chemistry,who use computation to further their field of research. As such,these scientists are both experts in their domain as well as advanced programmers.The book takes the approach of building on basic C programming skills,to teach parallel programming in C.We use C for CUDATM,a parallel programming environment that is supported on NVI- DIA GPUs,and emulated on less parallel CPUs.There are approximately 200 million of these processors in the hands of consumers and profes- sionals,and more than 40,000 programmers actively using CUDA.The applications that you develop as part of the learning experience will be able to be run by a very large user community. HOW TO USE THE BOOK We would like to offer some of our experience in teaching ECE498AL using the material detailed in this book. A Three-Phased Approach In ECE498AL the lectures and programming assignments are balanced with each other and organized into three phases: Phase 1:One lecture based on Chapter 3 is dedicated to teaching the basic CUDA memory/threading model,the CUDA extensions to the C
chapters in our spring 2009 class and our 2009 Summer School. The first four chapters were also tested in an MIT class taught by Nicolas Pinto in spring 2009. We also shared these early chapters on the web and received valuable feedback from numerous individuals. We were encouraged by the feedback we received and decided to go for a full book. Here, we humbly present our first edition to you. TARGET AUDIENCE The target audience of this book is graduate and undergraduate students from all science and engineering disciplines where computational thinking and parallel programming skills are needed to use pervasive terascale computing hardware to achieve breakthroughs. We assume that the reader has at least some basic C programming experience and thus are more advanced programmers, both within and outside of the field of Computer Science. We especially target computational scientists in fields such as mechanical engineering, civil engineering, electrical engineering, bioengineering, physics, and chemistry, who use computation to further their field of research. As such, these scientists are both experts in their domain as well as advanced programmers. The book takes the approach of building on basic C programming skills, to teach parallel programming in C. We use C for CUDA, a parallel programming environment that is supported on NVIDIA GPUs, and emulated on less parallel CPUs. There are approximately 200 million of these processors in the hands of consumers and professionals, and more than 40,000 programmers actively using CUDA. The applications that you develop as part of the learning experience will be able to be run by a very large user community. HOW TO USE THE BOOK We would like to offer some of our experience in teaching ECE498AL using the material detailed in this book. A Three-Phased Approach In ECE498AL the lectures and programming assignments are balanced with each other and organized into three phases: Phase 1: One lecture based on Chapter 3 is dedicated to teaching the basic CUDA memory/threading model, the CUDA extensions to the C xii Preface
Preface xiii language,and the basic programming/debugging tools.After the lecture, students can write a naive parallel matrix multiplication code in a couple of hours. Phase 2:The next phase is a series of 10 lectures that give students the conceptual understanding of the CUDA memory model,the CUDA thread- ing model,GPU hardware performance features,modern computer system architecture,and the common data-parallel programming patterns needed to develop a high-performance parallel application.These lectures are based on Chapters 4 through 7.The performance of their matrix multiplica- tion codes increases by about 10 times through this period.The students also complete assignments on convolution,vector reduction,and prefix scan through this period. Phase 3:Once the students have established solid CUDA programming skills,the remaining lectures cover computational thinking,a broader range of parallel execution models,and parallel programming principles. These lectures are based on Chapters 8 through 11.(The voice and video recordings of these lectures are available on-line (http://courses.ece. illinois.edu/ece498/al).) Tying It All Together:The Final Project While the lectures,labs,and chapters of this book help lay the intellectual foundation for the students,what brings the learning experience together is the final project.The final project is so important to the course that it is prominently positioned in the course and commands nearly 2 months' focus.It incorporates five innovative aspects:mentoring,workshop,clinic, final report,and symposium.(While much of the information about final project is available at the ECE498AL web site (http://courses.ece.illinois. edu/ece498/al),we would like to offer the thinking that was behind the design of these aspects.) Students are encouraged to base their final projects on problems that represent current challenges in the research community.To seed the process,the instructors recruit several major computational science research groups to propose problems and serve as mentors.The mentors are asked to contribute a one-to-two-page project specification sheet that briefly describes the significance of the application,what the mentor would like to accomplish with the student teams on the application,the technical skills (particular type of Math,Physics,Chemistry courses)required to under- stand and work on the application,and a list of web and traditional resources that students can draw upon for technical background,general
language, and the basic programming/debugging tools. After the lecture, students can write a naı¨ve parallel matrix multiplication code in a couple of hours. Phase 2: The next phase is a series of 10 lectures that give students the conceptual understanding of the CUDA memory model, the CUDA threading model, GPU hardware performance features, modern computer system architecture, and the common data-parallel programming patterns needed to develop a high-performance parallel application. These lectures are based on Chapters 4 through 7. The performance of their matrix multiplication codes increases by about 10 times through this period. The students also complete assignments on convolution, vector reduction, and prefix scan through this period. Phase 3: Once the students have established solid CUDA programming skills, the remaining lectures cover computational thinking, a broader range of parallel execution models, and parallel programming principles. These lectures are based on Chapters 8 through 11. (The voice and video recordings of these lectures are available on-line (http://courses.ece. illinois.edu/ece498/al).) Tying It All Together: The Final Project While the lectures, labs, and chapters of this book help lay the intellectual foundation for the students, what brings the learning experience together is the final project. The final project is so important to the course that it is prominently positioned in the course and commands nearly 2 months’ focus. It incorporates five innovative aspects: mentoring, workshop, clinic, final report, and symposium. (While much of the information about final project is available at the ECE498AL web site (http://courses.ece.illinois. edu/ece498/al), we would like to offer the thinking that was behind the design of these aspects.) Students are encouraged to base their final projects on problems that represent current challenges in the research community. To seed the process, the instructors recruit several major computational science research groups to propose problems and serve as mentors. The mentors are asked to contribute a one-to-two-page project specification sheet that briefly describes the significance of the application, what the mentor would like to accomplish with the student teams on the application, the technical skills (particular type of Math, Physics, Chemistry courses) required to understand and work on the application, and a list of web and traditional resources that students can draw upon for technical background, general Preface xiii
xiv Preface information,and building blocks,along with specific URLs or ftp paths to particular implementations and coding examples.These project specifica- tion sheets also provide students with learning experiences in defining their own research projects later in their careers.(Several examples are available at the ECE498AL course web site.) Students are also encouraged to contact their potential mentors during their project selection process.Once the students and the mentors agree on a project,they enter into a close relationship,featuring frequent consul- tation and project reporting.We the instructors attempt to facilitate the collaborative relationship between students and their mentors,making it a very valuable experience for both mentors and students. The Project Workshop The main vehicle for the whole class to contribute to each other's final proj- ect ideas is the project workshop.We usually dedicate six of the lecture slots to project workshops.The workshops are designed for students' benefit.For example,if a student has identified a project,the workshop serves as a venue to present preliminary thinking,get feedback,and recruit teammates.If a student has not identified a project,he/she can simply attend the presentations,participate in the discussions,and join one of the project teams.Students are not graded during the workshops,in order to keep the atmosphere nonthreatening and enable them to focus on a meaningful dialog with the instructor(s),teaching assistants,and the rest of the class. The workshop schedule is designed so the instructor(s)and teaching assistants can take some time to provide feedback to the project teams and so that students can ask questions.Presentations are limited to 10 min so there is time for feedback and questions during the class period.This limits the class size to about 36 presenters,assuming 90-min lecture slots. All presentations are preloaded into a PC in order to control the schedule strictly and maximize feedback time.Since not all students present at the workshop,we have been able to accommodate up to 50 students in each class,with extra workshop time available as needed. The instructor(s)and TAs must make a commitment to attend all the presentations and to give useful feedback.Students typically need most help in answering the following questions.First,are the projects too big or too small for the amount of time available?Second,is there existing work in the field that the project can benefit from?Third,are the computa- tions being targeted for parallel execution appropriate for the CUDA programming model?
information, and building blocks, along with specific URLs or ftp paths to particular implementations and coding examples. These project specification sheets also provide students with learning experiences in defining their own research projects later in their careers. (Several examples are available at the ECE498AL course web site.) Students are also encouraged to contact their potential mentors during their project selection process. Once the students and the mentors agree on a project, they enter into a close relationship, featuring frequent consultation and project reporting. We the instructors attempt to facilitate the collaborative relationship between students and their mentors, making it a very valuable experience for both mentors and students. The Project Workshop The main vehicle for the whole class to contribute to each other’s final project ideas is the project workshop. We usually dedicate six of the lecture slots to project workshops. The workshops are designed for students’ benefit. For example, if a student has identified a project, the workshop serves as a venue to present preliminary thinking, get feedback, and recruit teammates. If a student has not identified a project, he/she can simply attend the presentations, participate in the discussions, and join one of the project teams. Students are not graded during the workshops, in order to keep the atmosphere nonthreatening and enable them to focus on a meaningful dialog with the instructor(s), teaching assistants, and the rest of the class. The workshop schedule is designed so the instructor(s) and teaching assistants can take some time to provide feedback to the project teams and so that students can ask questions. Presentations are limited to 10 min so there is time for feedback and questions during the class period. This limits the class size to about 36 presenters, assuming 90-min lecture slots. All presentations are preloaded into a PC in order to control the schedule strictly and maximize feedback time. Since not all students present at the workshop, we have been able to accommodate up to 50 students in each class, with extra workshop time available as needed. The instructor(s) and TAs must make a commitment to attend all the presentations and to give useful feedback. Students typically need most help in answering the following questions. First, are the projects too big or too small for the amount of time available? Second, is there existing work in the field that the project can benefit from? Third, are the computations being targeted for parallel execution appropriate for the CUDA programming model? xiv Preface
Preface XV The Design Document Once the students decide on a project and form a team,they are required to submit a design document for the project.This helps them think through the project steps before they jump into it.The ability to do such planning will be important to their later career success.The design document should discuss the background and motivation for the project,application-level objectives and potential impact,main features of the end application,an overview of their design,an implementation plan,their performance goals, a verification plan and acceptance test,and a project schedule. The teaching assistants hold a project clinic for final project teams during the week before the class symposium.This clinic helps ensure that students are on-track and that they have identified the potential roadblocks early in the process.Student teams are asked to come to the clinic with an initial draft of the following three versions of their application:(1)The best CPU sequential code in terms of performance,with SSE2 and other optimi- zations that establish a strong serial base of the code for their speedup comparisons;(2)The best CUDA parallel code in terms of performance. This version is the main output of the project;(3)A version of CPU sequen- tial code that is based on the same algorithm as version 3,using single precision.This version is used by the students to characterize the parallel algorithm overhead in terms of extra computations involved. Student teams are asked to be prepared to discuss the key ideas used in each version of the code,any floating-point precision issues,any compari- son against previous results on the application,and the potential impact on the field if they achieve tremendous speedup.From our experience, the optimal schedule for the clinic is 1 week before the class symposium. An earlier time typically results in less mature projects and less meaningful sessions.A later time will not give students sufficient time to revise their projects according to the feedback. The Project Report Students are required to submit a project report on their team's key find- ings.Six lecture slots are combined into a whole-day class symposium During the symposium,students use presentation slots proportional to the size of the teams.During the presentation,the students highlight the best parts of their project report for the benefit of the whole class.The presenta- tion accounts for a significant part of students'grades.Each student must answer questions directed to him/her as individuals,so that different grades can be assigned to individuals in the same team.The symposium is a major opportunity for students to learn to produce a concise presentation that
The Design Document Once the students decide on a project and form a team, they are required to submit a design document for the project. This helps them think through the project steps before they jump into it. The ability to do such planning will be important to their later career success. The design document should discuss the background and motivation for the project, application-level objectives and potential impact, main features of the end application, an overview of their design, an implementation plan, their performance goals, a verification plan and acceptance test, and a project schedule. The teaching assistants hold a project clinic for final project teams during the week before the class symposium. This clinic helps ensure that students are on-track and that they have identified the potential roadblocks early in the process. Student teams are asked to come to the clinic with an initial draft of the following three versions of their application: (1) The best CPU sequential code in terms of performance, with SSE2 and other optimizations that establish a strong serial base of the code for their speedup comparisons; (2) The best CUDA parallel code in terms of performance. This version is the main output of the project; (3) A version of CPU sequential code that is based on the same algorithm as version 3, using single precision. This version is used by the students to characterize the parallel algorithm overhead in terms of extra computations involved. Student teams are asked to be prepared to discuss the key ideas used in each version of the code, any floating-point precision issues, any comparison against previous results on the application, and the potential impact on the field if they achieve tremendous speedup. From our experience, the optimal schedule for the clinic is 1 week before the class symposium. An earlier time typically results in less mature projects and less meaningful sessions. A later time will not give students sufficient time to revise their projects according to the feedback. The Project Report Students are required to submit a project report on their team’s key findings. Six lecture slots are combined into a whole-day class symposium. During the symposium, students use presentation slots proportional to the size of the teams. During the presentation, the students highlight the best parts of their project report for the benefit of the whole class. The presentation accounts for a significant part of students’ grades. Each student must answer questions directed to him/her as individuals, so that different grades can be assigned to individuals in the same team. The symposium is a major opportunity for students to learn to produce a concise presentation that Preface xv
xvi Preface motivates their peers to read a full paper.After their presentation,the stu- dents also submit a full report on their final project. ONLINE SUPPLEMENTS The lab assignments,final project guidelines,and sample project specifica- tions are available to instructors who use this book for their classes.While this book provides the intellectual contents for these classes,the additional material will be crucial in achieving the overall education goals.We would like to invite you to take advantage of the online material that accompanies this book,which is available at the Publisher's Web site www.elsevierdir- ect.com/9780123814722. Finally,we encourage you to submit your feedback.We would like to hear from you if you have any ideas for improving this book and the supplementary online material.Of course,we also like to know what you liked about the book. David B.Kirk and Wen-mei W.Hwu
motivates their peers to read a full paper. After their presentation, the students also submit a full report on their final project. ONLINE SUPPLEMENTS The lab assignments, final project guidelines, and sample project specifications are available to instructors who use this book for their classes. While this book provides the intellectual contents for these classes, the additional material will be crucial in achieving the overall education goals. We would like to invite you to take advantage of the online material that accompanies this book, which is available at the Publisher’s Web site www.elsevierdirect.com/9780123814722. Finally, we encourage you to submit your feedback. We would like to hear from you if you have any ideas for improving this book and the supplementary online material. Of course, we also like to know what you liked about the book. David B. Kirk and Wen-mei W. Hwu xvi Preface