当前位置：和泉文库 > 计算机 > 浏览文档

《并行与分布式程序设计》课程教学参考书：CUDA《Programming Massively Parallel Processors》A Hands-on Approach（美，David B. Kirk and Wen-mei W. Hwu，英文版）

文件格式：PDF，文件大小：4.74MB，售价：26.78元

文档详细内容（约277页）

1.6 Organization of the Book 17 Chapters 4 through 7 are designed to give the readers more in-depth understanding of the CUDA programming model.Chapter 4 covers the thread organization and execution model required to fully understand the execution behavior of threads and basic performance concepts.Chapter 5 is dedicated to the special memories that can be used to hold CUDA vari- ables for improved program execution speed.Chapter 6 introduces the major factors that contribute to the performance of a CUDA kernel func- tion.Chapter 7 introduces the floating-point representation and concepts such as precision and accuracy.Although these chapters are based on CUDA,they help the readers build a foundation for parallel programming in general.We believe that humans understand best when we learn from the bottom up;that is,we must first learn the concepts in the context of a particular programming model,which provides us with a solid footing to generalize our knowledge to other programming models.As we do so,we can draw on our concrete experience from the CUDA model.An in-depth experience with the CUDA model also enables us to gain maturity,which will help us learn concepts that may not even be pertinent to the CUDA model. Chapters 8 and 9 are case studies of two real applications,which take the readers through the thought processes of parallelizing and optimizing their applications for significant speedups.For each application,we begin by identifying alternative ways of formulating the basic structure of the paral- lel execution and follow up with reasoning about the advantages and disad- vantages of each alternative.We then go through the steps of code transformation necessary to achieve high performance.These two chapters help the readers put all the materials from the previous chapters together and prepare for their own application development projects. Chapter 10 generalizes the parallel programming techniques into prob- lem decomposition principles,algorithm strategies,and computational thinking.It does so by covering the concept of organizing the computation tasks of a program so they can be done in parallel.We begin by discussing the translational process of organizing abstract scientific concepts into computational tasks,an important first step in producing quality application software,serial or parallel.The chapter then addresses parallel algorithm structures and their effects on application performance,which is grounded in the performance tuning experience with CUDA.The chapter concludes with a treatment of parallel programming styles and models,allowing the readers to place their knowledge in a wider context.With this chapter, the readers can begin to generalize from the SPMD programming style to other styles of parallel programming,such as loop parallelism in OpenMP

Chapters 4 through 7 are designed to give the readers more in-depth understanding of the CUDA programming model. Chapter 4 covers the thread organization and execution model required to fully understand the execution behavior of threads and basic performance concepts. Chapter 5 is dedicated to the special memories that can be used to hold CUDA variables for improved program execution speed. Chapter 6 introduces the major factors that contribute to the performance of a CUDA kernel function. Chapter 7 introduces the floating-point representation and concepts such as precision and accuracy. Although these chapters are based on CUDA, they help the readers build a foundation for parallel programming in general. We believe that humans understand best when we learn from the bottom up; that is, we must first learn the concepts in the context of a particular programming model, which provides us with a solid footing to generalize our knowledge to other programming models. As we do so, we can draw on our concrete experience from the CUDA model. An in-depth experience with the CUDA model also enables us to gain maturity, which will help us learn concepts that may not even be pertinent to the CUDA model. Chapters 8 and 9 are case studies of two real applications, which take the readers through the thought processes of parallelizing and optimizing their applications for significant speedups. For each application, we begin by identifying alternative ways of formulating the basic structure of the parallel execution and follow up with reasoning about the advantages and disadvantages of each alternative. We then go through the steps of code transformation necessary to achieve high performance. These two chapters help the readers put all the materials from the previous chapters together and prepare for their own application development projects. Chapter 10 generalizes the parallel programming techniques into problem decomposition principles, algorithm strategies, and computational thinking. It does so by covering the concept of organizing the computation tasks of a program so they can be done in parallel. We begin by discussing the translational process of organizing abstract scientific concepts into computational tasks, an important first step in producing quality application software, serial or parallel. The chapter then addresses parallel algorithm structures and their effects on application performance, which is grounded in the performance tuning experience with CUDA. The chapter concludes with a treatment of parallel programming styles and models, allowing the readers to place their knowledge in a wider context. With this chapter, the readers can begin to generalize from the SPMD programming style to other styles of parallel programming, such as loop parallelism in OpenMP 1.6 Organization of the Book 17

18 CHAPTER 1 Introduction and fork-join in p-thread programming.Although we do not go into these alternative parallel programming styles,we expect that the readers will be able to learn to program in any of them with the foundation gained in this book. Chapter 11 introduces the OpenCL programming model from a CUDA programmer's perspective.The reader will find OpenCL to be extremely similar to CUDA.The most important difference arises from OpenCL's use of API functions to implement functionalities such as kernel launching and thread identification.The use of API functions makes OpenCL more tedious to use;nevertheless,a CUDA programmer has all the knowledge and skills necessary to understand and write OpenCL programs.In fact,we believe that the best way to teach OpenCL programming is to teach CUDA first.We demonstrate this with a chapter that relates all major OpenCL features to their corresponding CUDA features.We also illustrate the use of these features by adapting our simple CUDA examples into OpenCL. Chapter 12 offers some concluding remarks and an outlook for the future of massively parallel programming.We revisit our goals and summa- rize how the chapters fit together to help achieve the goals.We then present a brief survey of the major trends in the architecture of massively parallel processors and how these trends will likely impact parallel programming in the future.We conclude with a prediction that these fast advances in massively parallel computing will make it one of the most exciting areas in the coming decade. References and Further Reading Hwu,W.W.,Keutzer,K.,Mattson,T.(2008).The concurrency challenge.IEEE Design and Test of Computers,July/August,312-320. Khronos Group.(2009).The OpenCL Specification Version 1.0.Beaverton,OR: Khronos Group.(http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf). Mattson,T.G.,Sanders,B.A.,Massingill,B.L.(2004).Patterns of parallel programming.Upper Saddle River,NJ:Addison-Wesley. Message Passing Interface Forum.(2009).MPI:A Message-Passing Interface Standard,Version 2.2.Knoxville:University of Tennessee.(http://www.mpi- forum.org/docs/mpi-2.2/mpi22-report.pdf). NVIDIA.(2007).CUDA programming guide.Santa Clara,CA:NVIDIA Corp. OpenMP Architecture Review Board.(2005).OpenMP Application Program Inter- face.(http://www.openmp.org/mp-documents/spec25.pdf). Sutter,H.,Larus,J.(2005).Software and the concurrency revolution.ACM Qeue,3(7),54-62

and fork–join in p-thread programming. Although we do not go into these alternative parallel programming styles, we expect that the readers will be able to learn to program in any of them with the foundation gained in this book. Chapter 11 introduces the OpenCL programming model from a CUDA programmer’s perspective. The reader will find OpenCL to be extremely similar to CUDA. The most important difference arises from OpenCL’s use of API functions to implement functionalities such as kernel launching and thread identification. The use of API functions makes OpenCL more tedious to use; nevertheless, a CUDA programmer has all the knowledge and skills necessary to understand and write OpenCL programs. In fact, we believe that the best way to teach OpenCL programming is to teach CUDA first. We demonstrate this with a chapter that relates all major OpenCL features to their corresponding CUDA features. We also illustrate the use of these features by adapting our simple CUDA examples into OpenCL. Chapter 12 offers some concluding remarks and an outlook for the future of massively parallel programming. We revisit our goals and summarize how the chapters fit together to help achieve the goals. We then present a brief survey of the major trends in the architecture of massively parallel processors and how these trends will likely impact parallel programming in the future. We conclude with a prediction that these fast advances in massively parallel computing will make it one of the most exciting areas in the coming decade. References and Further Reading Hwu, W. W., Keutzer, K., & Mattson, T. (2008). The concurrency challenge. IEEE Design and Test of Computers, July/August, 312–320. Khronos Group. (2009). The OpenCL Specification Version 1.0. Beaverton, OR: Khronos Group. (http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf). Mattson, T. G., Sanders, B. A., & Massingill, B. L. (2004). Patterns of parallel programming. Upper Saddle River, NJ: Addison-Wesley. Message Passing Interface Forum. (2009). MPI: A Message-Passing Interface Standard, Version 2.2. Knoxville: University of Tennessee. (http://www.mpiforum.org/docs/mpi-2.2/mpi22-report.pdf). NVIDIA. (2007). CUDA programming guide. Santa Clara, CA: NVIDIA Corp. OpenMP Architecture Review Board. (2005). OpenMP Application Program Interface. (http://www.openmp.org/mp-documents/spec25.pdf). Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue, 3(7), 54–62. 18 CHAPTER 1 Introduction

CHAPTER History of GPU Computing 2 CHAPTER CONTENTS 2.1 Evolution of Graphics Pipelines .......................................................................... 21 2.1.1 The Era of Fixed-Function Graphics Pipelines .....................................22 2.1.2 Evolution of Programmable Real-Time Graphics ..................................26 2.1.3 Unified Graphics and Computing Processors .......................................29 2.1.4 GPGPU: An Intermediate Step...........................................................31 2.2 GPU Computing .................................................................................................. 32 2.2.1 Scalable GPUs .................................................................................33 2.2.2 Recent Developments .......................................................................34 2.3 Future Trends..................................................................................................... 34 References and Further Reading ............................................................................. 35 INTRODUCTION To CUDA and OpenCL programmers, graphics processing units (GPUs) are massively parallel numeric computing processors programmed in C with extensions. One needs not understand graphics algorithms or terminology in order to be able to program these processors. However, understanding the graphics heritage of these processors illuminates the strengths and weaknesses of these processors with respect to major computational patterns. In particular, the history helps to clarify the rationale behind major architectural design decisions of modern programmable GPUs: massive multithreading, relatively small cache memories compared to central processing units (CPUs), and bandwidth-centric memory interface design. Insights into the historical developments will also likely give the reader the context needed to project the future evolution of GPUs as computing devices. 2.1 EVOLUTION OF GRAPHICS PIPELINES Three-dimensional (3D) graphics pipeline hardware evolved from the large expensive systems of the early 1980s to small workstations and then PC accelerators in the mid- to late 1990s. During this period, the performance- 21

点击进入文档下载页（PDF格式）

共277页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录