当前位置：和泉文库 > 计算机 > 浏览文档

《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）NVIDIA Parallel Prefix Sum（Scan）with CUDA（April 2007）

文件格式：PDF，文件大小：499.04KB，售价：6.18元

文档详细内容（约21页）

Parallel Prefix Sum(Scan)with CUDA The pseudocode in Algorithm 1 shows a naive parallel scan implementation.This algorithm is based on the scan algorithm presented by Hillis and Steele![4],and demonstrated for GPUs by Horn [5].The problem with Algorithm 1 is apparent if we examine its work complexity.The rtm performs)ddtion operations. Remember that a sequential scan performs O()adds.Therefore,this naive implementation is not work-efficient.The factor of log2 n can have a large effect on performance.In the case of a scan of 1 million elements,the performance difference between this naive implementation and a theoretical work-efficient parallel implementation would be almost a factor of 20. Algorithm 1 assumes that there are as many processors as data elements.On a GPU running CUDA,this is not usually the case.Instead,the forall is automatically divided into small parallel batches(called warp)that are executed sequentially on a multiprocessor. A G80 GPU executes warps of 32 threads in parallel.Because not all threads run simultaneously for arrays larger than the warp size,the algorithm above will not work because it performs the scan in place on the array.The results of one warp will be overwritten by threads in another warp. To solve this problem,we need to double-buffer the array we are scanning.We use two temporary arrays(temp[2][n])to do this.Pseudocode for this is given in Algorithm 2, and CUDA C code for the naive scan is given in Listing 1.Note that this code will run on only a single thread block of the GPU,and so the size of the arrays it can process is limited (to 512 elements on G80 GPUs).Extension of scan to large arrays is discussed later. for d:=1 to logan do d:=0 to logn-1 forall kin parallel do ifk≥24then x[oud[附：=x[imk-24]+x[im[ 2^d else x[out:=x[in]k闷 swap (in,out) Algorithm 2:A double-buffered version of the sum scan from Algorithm 1. 1 Note that while we call this a naire scan in the context of CUDA and NVIDIA GPUs,it was not necessarily naive for a Connection Machine 3],which is the machine Hillis and Steele were focused on.Related to work complexity is the concept of step complexiny,which is the number of steps that the algorithm executes.The Connection Machine was a SIMD machine with many thousands of processors.In the limit where the number of processors equals the number of elements to be scanned,execution time is dominated by step complexity rather than work complexity.Algorithm 1 has a step complexity of o(log compared to the O()step complexity of the sequential algorithm, and is therefore step efficient. April 2007 5

Parallel Prefix Sum (Scan) with CUDA April 2007 5 The pseudocode in Algorithm 1 shows a naïve parallel scan implementation. This algorithm is based on the scan algorithm presented by Hillis and Steele1 [4], and demonstrated for GPUs by Horn [5]. The problem with Algorithm 1 is apparent if we examine its work complexity. The algorithm performs )log(2 2 log 1 2 1 nnOnn d d ∑ =− = − addition operations. Remember that a sequential scan performs O(n) adds. Therefore, this naïve implementation is not work-efficient. The factor of log2 n can have a large effect on performance. In the case of a scan of 1 million elements, the performance difference between this naïve implementation and a theoretical work-efficient parallel implementation would be almost a factor of 20. Algorithm 1 assumes that there are as many processors as data elements. On a GPU running CUDA, this is not usually the case. Instead, the forall is automatically divided into small parallel batches (called warps) that are executed sequentially on a multiprocessor. A G80 GPU executes warps of 32 threads in parallel. Because not all threads run simultaneously for arrays larger than the warp size, the algorithm above will not work because it performs the scan in place on the array. The results of one warp will be overwritten by threads in another warp. To solve this problem, we need to double-buffer the array we are scanning. We use two temporary arrays (temp[2][n]) to do this. Pseudocode for this is given in Algorithm 2, and CUDA C code for the naïve scan is given in Listing 1. Note that this code will run on only a single thread block of the GPU, and so the size of the arrays it can process is limited (to 512 elements on G80 GPUs). Extension of scan to large arrays is discussed later. Algorithm 2: A double-buffered version of the sum scan from Algorithm 1. 1 Note that while we call this a naïve scan in the context of CUDA and NVIDIA GPUs, it was not necessarily naïve for a Connection Machine [3], which is the machine Hillis and Steele were focused on. Related to work complexity is the concept of step complexity, which is the number of steps that the algorithm executes. The Connection Machine was a SIMD machine with many thousands of processors. In the limit where the number of processors equals the number of elements to be scanned, execution time is dominated by step complexity rather than work complexity. Algorithm 1 has a step complexity of O(log n) compared to the O(n) step complexity of the sequential algorithm, and is therefore step efficient. for d := 1 to log2n do forall k in parallel do if k ≥ 2d then x[out][k] := x[in][k − 2d-1] + x[in][k] else x[out][k] := x[in][k] swap(in,out) d := 0 to logn-1 2^d

点击进入文档下载页（PDF格式）

共21页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录