nVIDIA Parallel Prefix Sum (Scan)with CUDA Mark Harris mharris@nvidia.com April 2007
April 2007 Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com
Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007
April 2007 ii Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release
Abstract Parallel prefix sum,also known as parallel Scan,is a useful building block for many parallel algorithms including sorting and building data structures.In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA.We start with a basic naive algorithm and proceed through more advanced techniques to obtain best performance.We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1
Month 2007 1 Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads
Parallel Prefix Sum (Scan)with CUDA Table of Contents Abstract....... .1 Table of Contents....... .2 Introduction...... .3 Inclusive and Exclusive Scan............ 3 Sequential Scan... 4 A Naive Parallel Scan............ 4 A Work-Efficient Parallel Scan............... .7 Avoiding Bank Conflicts............. .11 Arrays of Arbitrary Size......... 14 Performance.a… .16 Conclusion... .17 Bibliography........ .18 April 2007 2
Parallel Prefix Sum (Scan) with CUDA April 2007 2 Table of Contents Abstract..............................................................................................................1 Table of Contents...............................................................................................2 Introduction.......................................................................................................3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan .........................................................................................4 A Work-Efficient Parallel Scan...........................................................................7 Avoiding Bank Conflicts ...................................................................................11 Arrays of Arbitrary Size....................................................................................14 Performance.....................................................................................................16 Conclusion........................................................................................................17 Bibliography.....................................................................................................18
Parallel Prefix Sum(Scan)with CUDA Introduction A simple and common parallel algorithm building block is the allprefix-sums operation.In this paper we will define and illustrate the operation,and discuss in detail its efficient implementation on NVIDIA CUDA.As mentioned by Blelloch [1],all-prefix-sums is a good example of a computation that seems inherently sequential,but for which there is an efficient parallel algorithm.The all-prefix-sums operation is defined as follows in [1]: Definition:The all-prefix-sums operation takes a binary associative operator and an array of n elements [ao,a,...,a and returns the array [a,(⊕a),,(⊕4©.⊕a-l Example:If is addition,then the all-prefix-sums operation on the array [317041631, would return [34111114162225. There are many uses for all-prefix-sums,including,but not limited to sorting,lexical analysis, string comparison,polynomial evaluation,stream compaction,and building histograms and data structures(graphs,trees,etc.)in parallel.For example applications,we refer the reader to the survey by Blelloch [1]. In general,all-prefix-sums can be used to convert some sequential computations into equivalent,but parallel,computations as shown in Figure 1. out[0]=0 forall j in parallel do forall j from 1 to n do temp[j]f(in[j]); out[j]outlj-1]f(in[j-1)); all prefix sums (out,temp); Figure 1:A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan.We will use this simpler terminology (which comes from theAPL programming language [11)for the remainder of this paper.As shown in the last section,a scan of an array generates a new array where each element jis the sum of all elements up to and including j.This is an inclsire scan.It is often useful for each element /in the results of a scan to contain the sum of all previous elements, but not jitself.This operation is commonly known as an exclusire scan(or prescan)[1]. Definition:The exclusive scan operation takes a binary associative operator with identity I,and an array of n elements [,a,,as-1 April 2007 3
Parallel Prefix Sum (Scan) with CUDA April 2007 3 Introduction A simple and common parallel algorithm building block is the all-prefix-sums operation. In this paper we will define and illustrate the operation, and discuss in detail its efficient implementation on NVIDIA CUDA. As mentioned by Blelloch [1], all-prefix-sums is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. The all-prefix-sums operation is defined as follows in [1]: Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [a0, a1, …, an-1], and returns the array [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [3 4 11 11 14 16 22 25]. There are many uses for all-prefix-sums, including, but not limited to sorting, lexical analysis, string comparison, polynomial evaluation, stream compaction, and building histograms and data structures (graphs, trees, etc.) in parallel. For example applications, we refer the reader to the survey by Blelloch [1]. In general, all-prefix-sums can be used to convert some sequential computations into equivalent, but parallel, computations as shown in Figure 1. out[0] = 0; forall j from 1 to n do out[j] = out[j-1] + f(in[j-1]); forall j in parallel do temp[j] = f(in[j]); all_prefix_sums(out, temp); Figure 1: A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan. We will use this simpler terminology (which comes from the APL programming language [1]) for the remainder of this paper. As shown in the last section, a scan of an array generates a new array where each element j is the sum of all elements up to and including j. This is an inclusive scan. It is often useful for each element j in the results of a scan to contain the sum of all previous elements, but not j itself. This operation is commonly known as an exclusive scan (or prescan) [1]. Definition: The exclusive scan operation takes a binary associative operator ⊕ with identity I, and an array of n elements [a0, a1, …, an-1]