Parallel Computing 3 The key component in high-performance computing is the central processing unit(CPU),usually called the core.In the early days of the computer,there was only one core on a chip.This architec- ture is referred to as a uniprocessor.Nowadays,the trend in chip design is to integrate multiple cores onto a single processor,usually termed multicore,to support parallelism at the architecture level. Therefore,programming can be viewed as the process of mapping the computation of a problem to available cores such that parallel execution is obtained. When implementing a sequential algorithm,you may not need to understand the details of the com- puter architecture to write a correct program.However,when implementing algorithms for multi- core machines,it is much more important for programmers to be aware of the characteristics of the underlying computer architecture.Writing both correct and efficient parallel programs requires a fundamental knowledge of multicore architectures. The following sections cover some basic concepts of parallel computing and how these concepts relate to CUDA programming. Sequential and Parallel Programming When solving a problem with a computer program,it is natural to divide the problem into a discrete series of calculations;each calculation performs a specified task,as shown in Figure 1-2.Such a pro- gram is called a sequential program. The problem is divided into small pieces of calculations execution order FIGURE 1-2 There are two ways to classify the relationship between two pieces of computation:Some are related by a precedence restraint and therefore must be calculated sequentially;others have no such restraints and can be calculated concurrently.Any program containing tasks that are performed concurrently is a parallel program.As shown in Figure 1-3,a parallel program may,and most likely will,have some sequential parts. From the eye of a programmer,a program consists of two basic ingredients:instruction and data. When a computational problem is broken down into many small pieces of computation,each piece is called a task.In a task,individual instructions consume inputs,apply a function,and produce outputs.A data dependency occurs when an instruction consumes data produced by a preceding instruction.Therefore,you can classify the relationship between any two tasks as either dependent, if one consumes the output of another,or independent. Analyzing data dependencies is a fundamental skill in implementing parallel algorithms because dependencies are one of the primary inhibitors to parallelism,and understanding them is necessary www.it-ebooks.info
Parallel Computing ❘ 3 c01.indd 08/19/2014 Page 3 The key component in high-performance computing is the central processing unit (CPU), usually called the core. In the early days of the computer, there was only one core on a chip. This architecture is referred to as a uniprocessor. Nowadays, the trend in chip design is to integrate multiple cores onto a single processor, usually termed multicore, to support parallelism at the architecture level. Therefore, programming can be viewed as the process of mapping the computation of a problem to available cores such that parallel execution is obtained. When implementing a sequential algorithm, you may not need to understand the details of the computer architecture to write a correct program. However, when implementing algorithms for multicore machines, it is much more important for programmers to be aware of the characteristics of the underlying computer architecture. Writing both correct and effi cient parallel programs requires a fundamental knowledge of multicore architectures. The following sections cover some basic concepts of parallel computing and how these concepts relate to CUDA programming. Sequential and Parallel Programming When solving a problem with a computer program, it is natural to divide the problem into a discrete series of calculations; each calculation performs a specifi ed task, as shown in Figure 1-2. Such a program is called a sequential program. The problem is divided into small pieces of calculations. execution order FIGURE 1-2 There are two ways to classify the relationship between two pieces of computation: Some are related by a precedence restraint and therefore must be calculated sequentially; others have no such restraints and can be calculated concurrently. Any program containing tasks that are performed concurrently is a parallel program. As shown in Figure 1-3, a parallel program may, and most likely will, have some sequential parts. From the eye of a programmer, a program consists of two basic ingredients: instruction and data. When a computational problem is broken down into many small pieces of computation, each piece is called a task. In a task, individual instructions consume inputs, apply a function, and produce outputs. A data dependency occurs when an instruction consumes data produced by a preceding instruction. Therefore, you can classify the relationship between any two tasks as either dependent, if one consumes the output of another, or independent. Analyzing data dependencies is a fundamental skill in implementing parallel algorithms because dependencies are one of the primary inhibitors to parallelism, and understanding them is necessary www.it-ebooks.info
4 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA to obtain application speedup in the modern programming world.In most cases,multiple indepen- dent chains of dependent tasks offer the best opportunity for parallelization. Parallel execution -l-- Sequential execution execution order FIGURE 1-3 Parallelism Nowadays,parallelism is becoming ubiquitous,and parallel programming is becoming mainstream in the programming world.Parallelism at multiple levels is the driving force of architecture design. There are two fundamental types of parallelism in applications: >Task parallelism Data parallelism Task parallelism arises when there are many tasks or functions that can be operated independently and largely in parallel.Task parallelism focuses on distributing functions across multiple cores. Data parallelism arises when there are many data items that can be operated on at the same time. Data parallelism focuses on distributing the data across multiple cores. CUDA programming is especially well-suited to address problems that can be expressed as data- parallel computations.The major focus of this book is how to solve a data-parallel problem with CUDA programming.Many applications that process large data sets can use a data-parallel model to speed up the computations.Data-parallel processing maps data elements to parallel threads. The first step in designing a data parallel program is to partition data across threads,with each thread working on a portion of the data.In general,there are two approaches to partitioning data:block partitioning and cyclic partitioning.In block partitioning,many consecutive elements of data are chunked together.Each chunk is assigned to a single thread in any order,and threads generally process only one chunk at a time.In cyclic partitioning,fewer data elements are chun- ked together.Neighboring threads receive neighboring chunks,and each thread can handle more than one chunk.Selecting a new chunk for a thread to process implies jumping ahead as many chunks as there are threads. www.it-ebooks.info
4 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 4 to obtain application speedup in the modern programming world. In most cases, multiple independent chains of dependent tasks offer the best opportunity for parallelization. execution order Sequential execution Parallel execution FIGURE 1-3 Parallelism Nowadays, parallelism is becoming ubiquitous, and parallel programming is becoming mainstream in the programming world. Parallelism at multiple levels is the driving force of architecture design. There are two fundamental types of parallelism in applications: ➤ Task parallelism ➤ Data parallelism Task parallelism arises when there are many tasks or functions that can be operated independently and largely in parallel. Task parallelism focuses on distributing functions across multiple cores. Data parallelism arises when there are many data items that can be operated on at the same time. Data parallelism focuses on distributing the data across multiple cores. CUDA programming is especially well-suited to address problems that can be expressed as dataparallel computations. The major focus of this book is how to solve a data-parallel problem with CUDA programming. Many applications that process large data sets can use a data-parallel model to speed up the computations. Data-parallel processing maps data elements to parallel threads. The fi rst step in designing a data parallel program is to partition data across threads, with each thread working on a portion of the data. In general, there are two approaches to partitioning data: block partitioning and cyclic partitioning. In block partitioning, many consecutive elements of data are chunked together. Each chunk is assigned to a single thread in any order, and threads generally process only one chunk at a time. In cyclic partitioning, fewer data elements are chunked together. Neighboring threads receive neighboring chunks, and each thread can handle more than one chunk. Selecting a new chunk for a thread to process implies jumping ahead as many chunks as there are threads. www.it-ebooks.info
Parallel Computing 5 Figure 1-4 shows two simple examples of 1D data partitioning.In the block partition,each thread takes only one portion of the data to process,and in the cyclic partition,each thread takes more than one portion of the data to process.Figure 1-5 shows three simple examples of 2D data par- titioning:block partitioning along the y dimension,block partitioning on both dimensions,and cyclic partitioning along the x dimension.The remaining patterns-block partitioning along the x dimension,cyclic partitioning on both dimensions,and cyclic partitioning along the y dimension- are left as an exercise. Usually,data is stored one-dimensionally.Even when a logical multi-dimensional view of data is used,it still maps to one-dimensional physical storage.Determining how to distribute data among threads is closely related to both how that data is stored physically,as well as how the execution of each thread is ordered.The way you organize threads has a significant effect on the program's performance. Block partition:each thread takes one data block Cyclic partition:each thread takes two data blocks FIGURE 1-4 Block partition on Block partition on Cyclic partition on one dimension both dimensions one dimension FIGURE 1-5 DATA PARTITIONS There are two basic approaches to partitioning data: Block:Each thread takes one portion of the data,usually an equal portion of the data. Cyclic:Each thread takes more than one portion of the data. The performance of a program is usually sensitive to the block size.Determining an optimal partition for both block and cyclic partitioning is closely related to the computer architecture.You will learn more about this through the examples in this book. www.it-ebooks.info
Parallel Computing ❘ 5 c01.indd 08/19/2014 Page 5 Figure 1-4 shows two simple examples of 1D data partitioning. In the block partition, each thread takes only one portion of the data to process, and in the cyclic partition, each thread takes more than one portion of the data to process. Figure 1-5 shows three simple examples of 2D data partitioning: block partitioning along the y dimension, block partitioning on both dimensions, and cyclic partitioning along the x dimension. The remaining patterns — block partitioning along the x dimension, cyclic partitioning on both dimensions, and cyclic partitioning along the y dimension — are left as an exercise. Usually, data is stored one-dimensionally. Even when a logical multi-dimensional view of data is used, it still maps to one-dimensional physical storage. Determining how to distribute data among threads is closely related to both how that data is stored physically, as well as how the execution of each thread is ordered. The way you organize threads has a signifi cant effect on the program’s performance. Block partition: each thread takes one data block Cyclic partition: each thread takes two data blocks FIGURE 1-4 Block partition on one dimension Block partition on both dimensions Cyclic partition on one dimension FIGURE 1-5 DATA PARTITIONS There are two basic approaches to partitioning data: ➤ Block: Each thread takes one portion of the data, usually an equal portion of the data. ➤ Cyclic: Each thread takes more than one portion of the data. The performance of a program is usually sensitive to the block size. Determining an optimal partition for both block and cyclic partitioning is closely related to the computer architecture. You will learn more about this through the examples in this book. www.it-ebooks.info
6 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA Computer Architecture There are several different ways to classify computer architectures.One widely used classification scheme is Flynn's Taxonomy,which classifies architectures into four different types according to how instructions and data flow through cores(see Figure 1-6),including: Single Instruction Single Data(SISD) Single Instruction Multiple Data(SIMD) Multiple Instruction Single Data(MISD) Multiple Instruction Multiple Data(MIMD) Single Instruction Multiple Data Multiple Instruction Multiple Data (SIMD) (MIMD) eie Single Instruction Single Data Multiple Instruction Single Data (SISD) (MISD) Instruction FIGURE 1-6 Single Instruction Single Data refers to the traditional computer:a serial architecture.There is only one core in the computer.At any time only one instruction stream is executed,and operations are performed on one data stream. Single Instruction Multiple Data refers to a type of parallel architecture.There are multiple cores in the computer.All cores execute the same instruction stream at any time,each operating on different data streams.Vector computers are typically characterized as SIMD,and most modern computers employ a SIMD architecture.Perhaps the biggest advantage of SIMD is that,while writing code on the CPU,programmers can continue to think sequentially yet achieve parallel speed-up from paral- lel data operations because the compiler takes care of the details. Multiple Instruction Single Data refers to an uncommon architecture,where each core operates on the same data stream via separate instruction streams. Multiple Instruction Multiple Data refers to a type of parallel architecture in which multiple cores operate on multiple data streams,each executing independent instructions.Many MIMD architec- tures also include SIMD execution sub-components. At the architectural level,many advances have been made to achieve the following objectives: Decrease latency Increase bandwidth Increase throughput www.it-ebooks.info
6 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 6 Computer Architecture There are several different ways to classify computer architectures. One widely used classifi cation scheme is Flynn’s Taxonomy, which classifi es architectures into four different types according to how instructions and data fl ow through cores (see Figure 1-6), including: ➤ Single Instruction Single Data (SISD) ➤ Single Instruction Multiple Data (SIMD) ➤ Multiple Instruction Single Data (MISD) ➤ Multiple Instruction Multiple Data (MIMD) Single Instruction Multiple Data (SIMD) Single Instruction Single Data (SISD) Multiple Instruction Single Data (MISD) Instruction Data Multiple Instruction Multiple Data (MIMD) FIGURE 1-6 Single Instruction Single Data refers to the traditional computer: a serial architecture. There is only one core in the computer. At any time only one instruction stream is executed, and operations are performed on one data stream. Single Instruction Multiple Data refers to a type of parallel architecture. There are multiple cores in the computer. All cores execute the same instruction stream at any time, each operating on different data streams. Vector computers are typically characterized as SIMD, and most modern computers employ a SIMD architecture. Perhaps the biggest advantage of SIMD is that, while writing code on the CPU, programmers can continue to think sequentially yet achieve parallel speed-up from parallel data operations because the compiler takes care of the details. Multiple Instruction Single Data refers to an uncommon architecture, where each core operates on the same data stream via separate instruction streams. Multiple Instruction Multiple Data refers to a type of parallel architecture in which multiple cores operate on multiple data streams, each executing independent instructions. Many MIMD architectures also include SIMD execution sub-components. At the architectural level, many advances have been made to achieve the following objectives: ➤ Decrease latency ➤ Increase bandwidth ➤ Increase throughput www.it-ebooks.info
Parallel Computing 7 Latency is the time it takes for an operation to start and complete,and is commonly expressed in microseconds.Bandwidth is the amount of data that can be processed per unit of time,commonly expressed as megabytes/sec or gigabytes/sec.Throughput is the amount of operations that can be processed per unit of time,commonly expressed as gflops(which stands for billion floating-point operations per second),especially in fields of scientific computation that make heavy use of floating- point calculations.Latency measures the time to complete an operation,while throughput measures the number of operations processed in a given time unit. Computer architectures can also be subdivided by their memory organization,which is generally classified into the following two types: Multi-node with distributed memory Multiprocessor with shared memory In a multi-node system,large scale computational engines are constructed from many processors connected by a network.Each processor has its own local memory,and processors can communi- cate the contents of their local memory over the network.Figure 1-7 shows a typical multi-node sys- tem with distributed memory.These systems are often referred to as clusters. Processor Cache Memory nterconnection Network FIGURE 1-7 Multiprocessor architectures typically range in size from dual-processor to dozens or hundreds of processors.These processors are either physically connected to the same memory (as shown in Figure 1-8),or share a low-latency link (such as PCI-Express or PCIe).Although sharing memory implies a shared address space,it does not necessarily mean there is a single physical memory.Such multiprocessors include both single-chip systems with multiple cores,known as multicore,and com- puters consisting of multiple chips,each of which might have a multicore design.Multicore architec- tures have displaced single-core architectures permanently. The term many-core is usually used to describe multicore architectures with an especially high num- ber of cores(tens or hundreds).Recently,computer architectures have been transitioning from multi- core to many-core. www.it-ebooks.info
Parallel Computing ❘ 7 c01.indd 08/19/2014 Page 7 Latency is the time it takes for an operation to start and complete, and is commonly expressed in microseconds. Bandwidth is the amount of data that can be processed per unit of time, commonly expressed as megabytes/sec or gigabytes/sec. Throughput is the amount of operations that can be processed per unit of time, commonly expressed as gfl ops (which stands for billion fl oating-point operations per second), especially in fi elds of scientifi c computation that make heavy use of fl oatingpoint calculations. Latency measures the time to complete an operation, while throughput measures the number of operations processed in a given time unit. Computer architectures can also be subdivided by their memory organization, which is generally classifi ed into the following two types: ➤ Multi-node with distributed memory ➤ Multiprocessor with shared memory In a multi-node system, large scale computational engines are constructed from many processors connected by a network. Each processor has its own local memory, and processors can communicate the contents of their local memory over the network. Figure 1-7 shows a typical multi-node system with distributed memory. These systems are often referred to as clusters. Processor Cache Cache Memory Memory Interconnection Network Memory Cache Processor Processor ...... ...... ...... FIGURE 1-7 Multiprocessor architectures typically range in size from dual-processor to dozens or hundreds of processors. These processors are either physically connected to the same memory (as shown in Figure 1-8), or share a low-latency link (such as PCI-Express or PCIe). Although sharing memory implies a shared address space, it does not necessarily mean there is a single physical memory. Such multiprocessors include both single-chip systems with multiple cores, known as multicore, and computers consisting of multiple chips, each of which might have a multicore design. Multicore architectures have displaced single-core architectures permanently. The term many-core is usually used to describe multicore architectures with an especially high number of cores (tens or hundreds). Recently, computer architectures have been transitioning from multicore to many-core. www.it-ebooks.info