当前位置：和泉文库 > 计算机 > 浏览文档

《并行与分布式程序设计》课程教学参考书：CUDA《Programming Massively Parellel Processors》大规模并行处理器编程实战（美）David B.Kirk&Wen-mei W.Hwu（中文版）

文件格式：PDF，文件大小：32.83MB，售价：22.2元

文档详细内容（约216页）

第2章GPU计算的发展历程27 2.2.1可扩展的GPU 刚开始时，可扩展性在图形系统中是一个很有吸引力的特点。早期的工作站图形系统在像素处理能力方面给客户的选择，就是通过改变像素处理器电路板的安装数目实现的。 20世纪90年代中期前，PC机都没办法进行图形缩放。只有选择VGA控制器。出现 3D-capable加速器时，这类产品的市场需求非常大。1998年，与3dfx的多面板缩放功能同时推出的产品还有采用其原始的SLI(Scan Line Interleave,扫描线交错)生产的产品 Voodoo2,这在当时是性能最好的产品。同一年，NVIDIA公司先后推出几款不同的产品，如单一体系结构的Riva TNT Ultra(高性能产品)和Vanta(低成本产品)，第一次以速度为衡量标准进行分级与包装，然后又推出单芯片的设计方案(GeForce2GTS和GeForce2MX都采用这种方案)。目前的体系结构己经基本成型，为同时满足台式PC机对性能和价位的需求，设计时一般采用4个或5个独立的芯片。此外，在笔记本电脑和工作站领域也采用独立分段的方式。2001年，NVIDIA获得3dx技术后相继推出了多GPU SLI的概念，如GeForce6800 的多GPU可扩展性无论对程序员还是用户都是透明的。在图形缩放范围上它们的功能行为都相差无几，一个应用软件即使不进行修改，在一个体系结构系列的任意实现上也都能运行。随着多核时代的来临，在一个CPU中可以容纳更多的晶体管，CPU通过在芯片上增加内核的数量，而不是只通过单个内核来提高性能。在编写本书时，CPU从四核发展六核再到八核。为了充分利用这些处理器，程序员不得不寻找4~8倍的并行性。他们中的许多人采取粗粒度并行策略，即应用程序的不同任务可以并行执行。由于内核的数目连续翻倍，为得到更多的并行任务必须重写此类应用程序。相反，GPU的多线程策略鼓励在CUDA 中采用大规模的、细粒度的数据并行度。GPU中支持的高效线程使得应用程序可以提供比现有的硬件执行资源更高的并行度，而很少有或没有后顾之忧。GPU的内核每翻一倍会提供更多的硬件执行资源，为提高GPU的性能就得利用更多已提供的并行度。也就是说，为图形和并行计算设计的GPU并行编程模型是透明和方便扩展的。一个图形程序或者CUDA 程序一旦写好后，就可以运行在任意的GPU上而不用考虑处理器内核的数目。 2.2.2发展近况采用CUDA开发应用程序的学术领域和工业领域己经取得数以百计CUDA程序的成功示例。这些程序中大部分的运行速度会比在多核CPU上的运行速度快几十到几百倍。随着新的工具（如CUDA[Stratton,20O8)的出现，CUDA程序的并行线程也可以在多核CPU 上高效地运行（尽管由于浮点执行资源的水平较低而导致运行速度比GPU低）。这些应用典型的例子包括n-body模拟、分子建模、金融计算和油/气藏模拟。虽然有些问题需要采用双精度浮点运算，但目前采用的还是单精度浮点运算。在双精度浮点运算领域的突破，会使GPU的应用范围更为广泛，这得益于GPU的运算加速

28 大规模并行处理器编程实战要想获取详细列表和利用GPU加速的应用程序的最新发展状况，读者可以访问CUDA 社区http:/www.nvidia.com/CUDA。关于开发研究型应用程序方面的资源，请查看CUDA 研究中心http:/www.cuda-research.orgo 2.3 未来发展趋势随着硅工艺的发展，晶体管数目会不断增加，处理器内核的数目自然也会增加。此外， GPU体系结构也会不断发展。虽然多个案例已经证明在数据并行应用程序中可以取得高性能，但GPU中处理器的设计仍然相对很简单。为增加计算单元的实际应用，在每一种新的体系结构中都会引进更为先进的技术。由于在可扩展并行计算方面，GPU才刚刚起步，因此肯定会不断地出现新的应用。通过学习，GPU的设计人员会发明并实现对计算机的新优化方案。第12章将对未来的发展趋势进行更具体的阐述。参考文献与课外阅读 1.Akeley,K.,Jermoluk,T.(1988).Reality Engine Graphics.Computer Graphics(SIGGRAPH 93),27,109~116. 2.Akeley,K.(1993).High-Performance Polygon Rendering.Computer Graphics (SIGGRAPH 88),22(4),239-246. 3.Blelloch,G.B.(1990).Prefix Sums and Their Applications.In J.H.Reif (Ed.),Synthesis of Parallel Algorithms.San Francisco,CA:Morgan Kaufmann. 4.Blythe,D.(2006).The Direct3D 10 System.ACM Transactions on Graphics,25(3),724~734. 5.Buck,I.,Foley,T.,Horn,D.,Sugerman,J.,Fatahlian,K.,Houston,M.,et al.(2004).Brooks for GPUs:Stream Computing on Graphics Hardware.ACM Transactions on Graphics,23(3), 777~786(http:/doi.acm.org/10.1145/1186562.1015800). 6.Elder,G.(2002).Radeon 9700.In Proceedings of the ACM Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002 (http://www.graphicshardware.org/previous/www 2002/presentations/Hot3D-RADEON9700.ppt). 7.Fernando,R.,Kilgard,M.J.(2003).GPU Gems:Programming Techniques,Tips,and Tricks for Real-Time Graphics.Reading,MA:Addison-Wesley (http://developer.nvidia.com/ object/gpu_gems_home.html). 8.Fernando,R.(Ed.),The Cg Tutorial:The Definitive Guide to Programmable Real-Time Graphics.Reading,MA:Addison-Wesley

第2章GPU计算的发展历程29 9.Foley,J.,van Dam,A.,Feiner,S.,&Hughes,J.Interactive Computer Graphics:Principles and Practice,C Edition(2nd ed.).Reading,MA:Addison-Wesley. 10.Hillis,W.D.,Steele,G.L.(1986).Data Parallel Algorithms.Communications of the ACM, 29(12),11701183(http:/doi.acm.org/10.1145/7902.7903). 11.IEEE 754R Working Group.(2006).Standard for Floating-Point Arithmetic P754(Draft). Piscataway,NJ:Institute of Electrical and Electronics Engineers(http://www. validlab.com/754R/drafts/archive/2006-10-04.pdf). 12.Industrial Light and Magic.(2003).OpenEXR.San Mateo,CA:Industrial Light and Magic (www.openexr.com). 13.Intel.(2007).Intel 64 and IA-32 Architectures Optimization Reference Manual.Order No 248966-016.Santa Clara,CA:Intel Corp.(http://www3.intel.com/design/processor/manuals/ 248966.pdfD. 14.Kessenich,J.,Baldwin,D.,&Rost,R.(2006).The OpenGL Shading Language,Language Version 1.20.Madison,AL:3Dlabs,Inc.(http://www.opengl.org/documentation/specs/) 15.Kirk,D.,Voorhies,D.(1990).The Rendering Architecture of the DN10000VS.Computer Graphics (SIGGRAPH 1990),24(4),299-307. 16.Lindholm,E.,Kilgard,M.J.,Moreton,H.(2001).A User-Programmable Vertex Engine. In Proceedings of the 28th Annual ACM Conference on Computer Graphics and Interactive Techniques (pp.149~158).Reading,MA:ACM Press/Addison-Wesley. 17.Lindholm,E.,Nickolls,J.,Oberman,S.,Montrym,J.(2008).NVIDIA Tesla:A Unified Graphics and Computing Architecture.IEEE Micro,28(2),39~55. 18.Microsoft.(2003).Microsofi DirectX 9 Programmable Graphics Pipeline.Redmond,WA: Microsoft Press. 19.Microsoft.(2009).Microsoft DirectX Specification.Redmond,WA:Microsoft Press (http://msdn.microsoft.com/directx/). 20.Montrym,J.,Baum,D.,Dignam,D.,&Migdal,C.(1997).InfiniteReality:A Real-Time Graphics System.In G.O.Owen,T.Whitted B.Mones-Hattal (Eds.),Proceedings of the 24th Annual ACM Conference on Computer Graphics and Interactive Techniques (pp. 293~301).Reading,MA:ACM Press/Addison-Wesley. 21.Montrym,J.,Moreton,H.(2005).The GeForce 6800.IEEE Micro,25(2),41-51. 22.Moore,G.E.(1965).Cramming More Components onto Integrated Circuits.Electronics, 38(8),114117. 23.Nguyen,H.(Ed.),(2008).GPU Gems 3.Reading,MA:Addison-Wesley

30 大规模并行处理器编程实战 24.Nickolls,J.,Buck,I.,Garland,M.,Skadron,K.(2008).Scalable Parallel Programming with CUDA.ACM Queue,6(2),40-53. 25.NVIDIA.(2007a).NVIDIA CUDA-Compute Unified Device Architecture,Programming Guide,Version 1.1(http://developer.download.nvidia.com/compute/cuda/11/NVIDIA CUDA Programming Guide 1.1.pdf) 26.NVIDIA.(2007b).NVIDIA Compute-PTX:Parallel Thread Execution,ISA Version 1.1 (http://www.nvidia.com/object/io 1195170102263.html). 27.NVIDIA.(2009).CUDA Zone (http://www.nvidia.com/CUDA). 28.Nyland,L.,Harris,M.,Prins,J.(2007).Fast N-Body Simulation with CUDA.In H. Nguyen(Ed.),GPU Gems 3.Reading,MA:Addison-Wesley. 29.Oberman,S.F.,Siu,M.Y.(2005).A High-Performance Area-Efficient Multifunction Interpolator.In Proceedings of the 17th IEEE Symposium on Computer Arithmetic (pp. 272~279).Cape Cod,MA. 30.Patterson,D.A.,&Hennessy,J.L.(2004).Computer Organization and Design:The Hardware/Software Interface(3rd ed.).San Francisco,CA:Morgan Kaufmann. 31.Pharr,M.(Ed)(2005).GPU Gems 2:Programming Techniques for High-Performance Graphics and General-Purpose Computation.Reading,MA:AddisonWesley. 32.Satish,N.,Harris,M.,Garland,M.(2008).Designing Efficient Sorting Algorithms for Manycore GPUs.Proc.23rd IEEE Int'l Interational Parallel and Distributed Symposium,May 2009. 33.Segal,M.,Akeley,K.(2006).The OpenGL Graphics System:A Specification,Version 2.1.Mountain View,CA:Silicon Graphics (http://www.opengl.org/documentation/specs/). 34.Sengupta,S.,Harris,M.,Zhang,Y.,Owens,J.D.(2007).Scan Primitives for GPU Computing. In T.Aila M.Segal (Eds.),Graphics Hardware(pp.97~106).San Diego,CA:ACM Press. 35.Stratton,J.A.,Stone,S.S.,Hwu,W.W.(2008).MCUDA:An Efficient Implementation of CUDA Kernels for Multi-Core CPUs.In Proceedings of the 2Ist International Workshop on Languages and Compilers for Parallel Computing(LCPC).Canada:Edmonton. 36.Volkov,V.,Demmel,J.(2008).LU,OR and Cholesky Factorizations Using Vector Capabilities of GPUs.Technical Report No.UCB/EECS-2008-49.Berkeley:EECS Department,University of California (http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/ EECS-2008-49.html). 37.Williams,S.,Oliker,L.,Vuduc,R.,Shalf,J.,Yelick,K.,Demmel,J.(2008).Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms.In Parallel Computing-Special Issue on Revolutionary Technologies for Acceleration of Emerging Petascale Applications

点击进入文档下载页（PDF格式）

共216页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录