An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado Javier Cabezas John E.Stone Sanjay Patel Nacho Navarro Wen-mei W.Hwu Universitat Politecnica de Catalunya University of Illinois igelado,jcabezas,nacho@ac.upc.edu {jestone,sjp,hwu}oillinois.edu Abstract 1.Introduction Heterogeneous computing combines general purpose CPUs with Maximizing multi-thread throughput and minimizing single-thread accelerators to efficiently execute both sequential control-inten- latency are two design goals that impose very different and often sive and data-parallel phases of applications.Existing program- conflicting requirements on processor design.For example,the ming models for heterogeneous computing rely on programmers to Intel Xeon E7450 [28]processor consists of six processor cores explicitly manage data transfers between the CPU system memory each of which is an high-frequency out-of-order,multi-instruction- and accelerator memory. issue processor with a sophisticated branch prediction mechanism This paper presents a new programming model for heteroge- to achieve short single-thread execution latency.This is in contrast neous computing,called Asymmetric Distributed Shared Memory to the NVIDIA Tesla GT200 Graphics Processing Unit(GPU)[35] (ADSM),that maintains a shared logical memory space for CPUs design that achieves high multi-thread throughput with many cores, to access objects in the accelerator physical memory but not vice each of which is a moderate-frequency,multi-threaded,in-order versa.The asymmetry allows light-weight implementations that processor that shares its control unit and instruction cache with avoid common pitfalls of symmetrical distributed shared memory seven other cores.For control intensive code,the E7450 design can systems.ADSM allows programmers to assign data objects to per- easily outperform the NVIDIA Tesla.For massively data parallel formance critical methods.When a method is selected for acceler- applications,the NVIDIA Tesla design can easily achieve higher ator execution,its associated data objects are allocated within the performance than the E7450. shared logical memory space,which is hosted in the accelerator Data parallel code has the property that multiple instances of physical memory and transparently accessible by the methods exe- the code can be executed concurrently on different data.Data par- cuted on CPUs. allelism exists in many applications such as physics simulation, We argue that ADSM reduces programming efforts for hetero- weather prediction,financial analysis,medical imaging,and me- geneous computing systems and enhances application portability dia processing.Most of these applications also have control-inten- We present a software implementation of ADSM.called GMAC. sive phases that are often interleaved between data-parallel phases on top of CUDA in a GNU/Linux environment.We show that ap- Hence,general purpose CPUs and accelerators can be combined to plications written in ADSM and running on top of GMAC achieve form heterogeneous parallel computing systems that efficiently ex- performance comparable to their counterparts using programmer- ecute all application phases [39].There are many examples of suc- managed data transfers.This paper presents the GMAC system and cessful heterogeneous systems.For example,the RoadRunner su- evaluates different design choices.We further suggest additional ar- percomputer couples AMD Opteron processors with IBM PowerX- chitectural support that will likely allow GMAC to achieve higher Cell accelerators.If the RoadRunner supercomputer were bench- application performance than the current CUDA model. marked using only its general purpose CPUs,rather than being the top-ranked system in the Top500 List (June 2008),it would drop to Categories and Subject Descriptors D.4.2 [Operating Systems]: a 50th-fastest ranking [6. Storage Management-Distributed Memories Current commercial programming models of processor-accel- erator data transfer are based on Direct Memory Access (DMA) hardware,which is typically exposed to applications programmers General Terms Design,Experimentation,Performance through memory copy routines.For example,in the CUDA pro- gramming model [38],an application programmer can transfer data Keywords Heterogeneous Systems,Data-centric Programming from the processor to the accelerator device by calling a memory Models,Asymmetric Distributed Shared Memory copy routine whose input parameter includes a source data pointer to the processor memory space,a destination pointer to the accel- erator memory space,and the number of bytes to be copied.The memory copy interface ensures that the accelerator can only access the part of the application data that is explicitly requested by the Permission to make digital or hard copies of all or part of this work for personal or memory copy parameters. classroom use is granted without fee provided that copies are not made or distributed In this paper we argue that heterogeneous systems benefit from for profit or commercial advantage and that copies bear this notice and the full citation a data-centric programming model where programmers assign data on the first page.To copy otherwise,to republish,to post on servers or to redistribute objects to performance critical methods.This model provides the to lists,requires prior specific permission and/or a fee run-time system with enough information to automatically trans- ASPLOS'10,March 13-17,2010.Pittsburgh,Pennsylvania,USA Copyright©2010ACM978-1-60558-839-1/10W03..s10.00 fer data between general purpose CPUs and accelerators.Such a 347
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado Javier Cabezas Nacho Navarro Universitat Politecnica de Catalunya {igelado, jcabezas, nacho}@ac.upc.edu John E. Stone Sanjay Patel Wen-mei W. Hwu University of Illinois {jestone, sjp, hwu}@illinois.edu Abstract Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmermanaged data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model. Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management—Distributed Memories General Terms Design, Experimentation, Performance Keywords Heterogeneous Systems, Data-centric Programming Models, Asymmetric Distributed Shared Memory Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’10, March 13–17, 2010, Pittsburgh, Pennsylvania, USA. Copyright c 2010 ACM 978-1-60558-839-1/10/03. . . $10.00 1. Introduction Maximizing multi-thread throughput and minimizing single-thread latency are two design goals that impose very different and often conflicting requirements on processor design. For example, the Intel Xeon E7450 [28] processor consists of six processor cores each of which is an high-frequency out-of-order, multi-instructionissue processor with a sophisticated branch prediction mechanism to achieve short single-thread execution latency. This is in contrast to the NVIDIA Tesla GT200 Graphics Processing Unit (GPU) [35] design that achieves high multi-thread throughput with many cores, each of which is a moderate-frequency, multi-threaded, in-order processor that shares its control unit and instruction cache with seven other cores. For control intensive code, the E7450 design can easily outperform the NVIDIA Tesla. For massively data parallel applications, the NVIDIA Tesla design can easily achieve higher performance than the E7450. Data parallel code has the property that multiple instances of the code can be executed concurrently on different data. Data parallelism exists in many applications such as physics simulation, weather prediction, financial analysis, medical imaging, and media processing. Most of these applications also have control-intensive phases that are often interleaved between data-parallel phases. Hence, general purpose CPUs and accelerators can be combined to form heterogeneous parallel computing systems that efficiently execute all application phases [39]. There are many examples of successful heterogeneous systems. For example, the RoadRunner supercomputer couples AMD Opteron processors with IBM PowerXCell accelerators. If the RoadRunner supercomputer were benchmarked using only its general purpose CPUs, rather than being the top-ranked system in the Top500 List (June 2008), it would drop to a 50th-fastest ranking [6]. Current commercial programming models of processor-accelerator data transfer are based on Direct Memory Access (DMA) hardware, which is typically exposed to applications programmers through memory copy routines. For example, in the CUDA programming model [38], an application programmer can transfer data from the processor to the accelerator device by calling a memory copy routine whose input parameter includes a source data pointer to the processor memory space, a destination pointer to the accelerator memory space, and the number of bytes to be copied. The memory copy interface ensures that the accelerator can only access the part of the application data that is explicitly requested by the memory copy parameters. In this paper we argue that heterogeneous systems benefit from a data-centric programming model where programmers assign data objects to performance critical methods. This model provides the run-time system with enough information to automatically transfer data between general purpose CPUs and accelerators. Such a 347
run-time system improves programmability and compatibility of heterogeneous systems.This paper introduces the Asymmetric Dis- tributed Shared Memory (ADSM)model,a data-centric program- ming model,that maintains a shared logical memory space for RAM Memory CPU CPUs to access objects in the accelerator physical memory but not cores RAM Memory vice versa.This asymmetry allows all coherence and consistency ache Memory actions to be executed on the CPU,allowing the use of simple ac- RAM Memory celerators.This paper also presents GMAC,a user-level ADSM library,and discusses design and implementation details of such a system.Experimental results using GMAC show that an ADSM 10 PCIe Accelerator system makes heterogeneous systems easier to program without in- HUB troducing performance penalties. The main contributions of this paper are:(1)the introduction of ADSM as a data-centric programming model for heterogeneous systems.The benefits of this model are architecture independence Figure 1.Reference Architecture,similar to desktop GPUs and legacy support,and efficient I/O support;(2)a detailed discussion RoadRunner blades about the design of an ADSM system,which includes the definition of the necessary API calls and the description of memory coherence and consistency required by an ADSM system;(3)a description consistency models implemented by accelerators allow memory of the software techniques required to build an ADSM system for controllers to serve several requests in a single memory access current accelerators on top of existing operating systems;(4)an Strong consistency models required by general purpose CPUs do analysis of different coherence protocols that can be implemented not offer the same freedom to rearrange accesses to system mem- in an ADSM system. ory.Memory access scheduling in the memory controller has differ- This paper is organized as follows.Section 2 presents the neces- ent requirements for general purpose CPUs and accelerators(i.e., sary background and motivates this work.ADSM is presented as a latency vs throughput).Virtual memory management also tends data-centric programming model in Section 3,which also discusses to be quite different on CPUs and accelerators (e.g.,GPUs tend the benefit of ADSM for heterogeneous systems and presents the to benefit more from large page sizes than CPUs),which makes API,consistency model,and different coherence protocols for the design of TLBs and MMUs quite different (e.g.,incompatible ADSM systems.The design and implementation of an ADSM memory page sizes).Hence,general purpose CPUs and acceler- system,GMAC,is presented in Section 4.Section 5 presents ex- ators are connected to separate memories in most heterogeneous perimental results.ADSM is compared to other work in Section 6. systems,as shown in Figure 1.Many such examples of hetero Finally,Section 7 concludes this paper. geneous systems currently exist.The NVIDIA GeForce graphics card [35]includes its own GDDR memory (up to 4GB)and is at- tached to the CPU through a PCle bus.Future graphics cards based 2. Background and Motivation on the Intel Larrabee [40]chip will have a similar configuration. 2.1 Background The Roadrunner supercomputer is composed of nodes that include two AMD Opteron CPUs (IBM BladeCenter LS21)and four Pow- General purpose CPUs and accelerators can be coupled in many erXCell chips(2x IBM BladeCenter QS22).Each LS21 BladeCen- different ways.Fine-grained accelerators are usually attached as ter is connected to two QS22 BladeCenters through a PCle bus. functional units inside the processor pipeline [21,22.41,43].The constraining processors to access only on-board memory [6].In Xilinx Virtex 5 FXT FPGAs include a PowerPC 440 connected this paper we assume a base heterogeneous system in Figure 1. to reconfigurable logic by a crossbar [46].In the Cell BE chip. However,the concepts developed in this paper are equally applica- the Synergistic Processing Units,L2 cache controller,the memory ble to systems where general purpose CPUs and accelerators share interface controller.and the bus interface controller are connected the same physical memory. through an Element Interconnect Bus [30].The Intel Graphics Media Accelerator is integrated inside the Graphics and Memory 2.2 Motivation Controller Hub that manages the flow of information between the Heterogeneous parallel computing improves application perfor- processor,the system memory interface,the graphics interface,and mance by executing computationally intensive data-parallel kernels the I/O controller [271.AMD Fusion chips will integrate CPU. on accelerators designed to maximize data throughput,while exe- memory controller,GPU,and PCIe Controller into a single chip.A cuting the control-intensive code on general purpose CPUs.Hence, common characteristic among Virtex 5,Cell BE,Graphics Media some data structures are likely to be accessed primarily by the code Accelerator,and AMD Fusion is that general purpose CPUs and executed by accelerators.For instance,execution traces show that accelerators share access to system memory.In these systems,the about 99%of read and write accesses to the main data structures in system memory controller deals with memory requests coming the NASA Parallel Benchmarks (NPB)occur inside computation- from both general purpose CPUs and accelerators. ally intensive kernels that are amenable for parallelization. Accelerators and general purpose CPUs impose very differ- Figure 2 shows our estimation for the average memory band- ent requirements on the system memory controller.General pur- width requirements for the computationally intensive kernels of pose CPUs are designed to minimize the instruction latency and some NPB benchmarks,assuming a 800MHz clock frequency for typically implement some form of strong memory consistency different values of IPC and illustrates the need to store the data (e.g.,sequential consistency in MIPS processors).Accelerators structures required by accelerators in their own memories.For in- are designed to maximize data throughput and implement weak stance,if all data accesses are done through a PCle bus',the max- forms of memory consistency (e.g.Rigel implements weak consis- imum achievable value of IPC is 50 for bt and 5 for ua,which tency [32)).Memory controllers for general purpose CPUs tend to implement narrow memory buses (e.g.192 bits for the Intel Core 1 If both accelerator and CPU share the same memory controller,the avail- i7)compared to data parallel accelerators (e.g.512 bits for the able accelerator bandwidth will be similar to HyperTransport in Figure 2, NVIDIA GTX280)to minimize the memory access time.Relaxed which also limits the maximum achievable value of IPC. 348
run-time system improves programmability and compatibility of heterogeneous systems. This paper introduces the Asymmetric Distributed Shared Memory (ADSM) model, a data-centric programming model, that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. This asymmetry allows all coherence and consistency actions to be executed on the CPU, allowing the use of simple accelerators. This paper also presents GMAC, a user-level ADSM library, and discusses design and implementation details of such a system. Experimental results using GMAC show that an ADSM system makes heterogeneous systems easier to program without introducing performance penalties. The main contributions of this paper are: (1) the introduction of ADSM as a data-centric programming model for heterogeneous systems. The benefits of this model are architecture independence, legacy support, and efficient I/O support; (2) a detailed discussion about the design of an ADSM system, which includes the definition of the necessary API calls and the description of memory coherence and consistency required by an ADSM system; (3) a description of the software techniques required to build an ADSM system for current accelerators on top of existing operating systems; (4) an analysis of different coherence protocols that can be implemented in an ADSM system. This paper is organized as follows. Section 2 presents the necessary background and motivates this work. ADSM is presented as a data-centric programming model in Section 3, which also discusses the benefit of ADSM for heterogeneous systems and presents the API, consistency model, and different coherence protocols for ADSM systems. The design and implementation of an ADSM system, GMAC, is presented in Section 4. Section 5 presents experimental results. ADSM is compared to other work in Section 6. Finally, Section 7 concludes this paper. 2. Background and Motivation 2.1 Background General purpose CPUs and accelerators can be coupled in many different ways. Fine-grained accelerators are usually attached as functional units inside the processor pipeline [21, 22, 41, 43]. The Xilinx Virtex 5 FXT FPGAs include a PowerPC 440 connected to reconfigurable logic by a crossbar [46]. In the Cell BE chip, the Synergistic Processing Units, L2 cache controller, the memory interface controller, and the bus interface controller are connected through an Element Interconnect Bus [30]. The Intel Graphics Media Accelerator is integrated inside the Graphics and Memory Controller Hub that manages the flow of information between the processor, the system memory interface, the graphics interface, and the I/O controller [27]. AMD Fusion chips will integrate CPU, memory controller, GPU, and PCIe Controller into a single chip. A common characteristic among Virtex 5, Cell BE, Graphics Media Accelerator, and AMD Fusion is that general purpose CPUs and accelerators share access to system memory. In these systems, the system memory controller deals with memory requests coming from both general purpose CPUs and accelerators. Accelerators and general purpose CPUs impose very different requirements on the system memory controller. General purpose CPUs are designed to minimize the instruction latency and typically implement some form of strong memory consistency (e.g., sequential consistency in MIPS processors). Accelerators are designed to maximize data throughput and implement weak forms of memory consistency (e.g. Rigel implements weak consistency [32]). Memory controllers for general purpose CPUs tend to implement narrow memory buses (e.g. 192 bits for the Intel Core i7) compared to data parallel accelerators (e.g. 512 bits for the NVIDIA GTX280) to minimize the memory access time. Relaxed Figure 1. Reference Architecture, similar to desktop GPUs and RoadRunner blades consistency models implemented by accelerators allow memory controllers to serve several requests in a single memory access. Strong consistency models required by general purpose CPUs do not offer the same freedom to rearrange accesses to system memory. Memory access scheduling in the memory controller has different requirements for general purpose CPUs and accelerators (i.e., latency vs throughput). Virtual memory management also tends to be quite different on CPUs and accelerators (e.g., GPUs tend to benefit more from large page sizes than CPUs), which makes the design of TLBs and MMUs quite different (e.g., incompatible memory page sizes). Hence, general purpose CPUs and accelerators are connected to separate memories in most heterogeneous systems, as shown in Figure 1. Many such examples of heterogeneous systems currently exist. The NVIDIA GeForce graphics card [35] includes its own GDDR memory (up to 4GB) and is attached to the CPU through a PCIe bus. Future graphics cards based on the Intel Larrabee [40] chip will have a similar configuration. The Roadrunner supercomputer is composed of nodes that include two AMD Opteron CPUs (IBM BladeCenter LS21) and four PowerXCell chips (2x IBM BladeCenter QS22). Each LS21 BladeCenter is connected to two QS22 BladeCenters through a PCIe bus, constraining processors to access only on-board memory [6]. In this paper we assume a base heterogeneous system in Figure 1. However, the concepts developed in this paper are equally applicable to systems where general purpose CPUs and accelerators share the same physical memory. 2.2 Motivation Heterogeneous parallel computing improves application performance by executing computationally intensive data-parallel kernels on accelerators designed to maximize data throughput, while executing the control-intensive code on general purpose CPUs. Hence, some data structures are likely to be accessed primarily by the code executed by accelerators. For instance, execution traces show that about 99% of read and write accesses to the main data structures in the NASA Parallel Benchmarks (NPB) occur inside computationally intensive kernels that are amenable for parallelization. Figure 2 shows our estimation for the average memory bandwidth requirements for the computationally intensive kernels of some NPB benchmarks, assuming a 800MHz clock frequency for different values of IPC and illustrates the need to store the data structures required by accelerators in their own memories. For instance, if all data accesses are done through a PCIe bus1 , the maximum achievable value of IPC is 50 for bt and 5 for ua, which 1 If both accelerator and CPU share the same memory controller, the available accelerator bandwidth will be similar to HyperTransport in Figure 2, which also limits the maximum achievable value of IPC. 348
NVIDIA GTX295 Mamory ,int a) floafoo,dev_too foD ollocisize); Duplicated Pointers fread foo ize,1rf面 cudaMalloc sdev_foo size Explicit Consistency kernele<<Dg.Dbs>(dev_foo ize Management cudaFred(dov_foo free(Tao) Figure 3.Example code with duplicated pointers and explicit con- sistency management Figure 2.Estimated bandwidth requirements for computationally intensive kernels of bt,ep,lu,mg.ua benchmarks,assuming a 800MHz clock frequency executed on the accelerator.OpenCL and Roadrunner codes do not significantly differ from code in Figure 3. Such programming models ensure that data structures reside in is a very small fraction of the peak execution rate of the NVIDIA the memory of the processor (CPU or accelerator),that performs GTX295 GPU.In all cases,the level of IPC that can be supported subsequent computations.These models also imply that program- by the GTX295 memory bandwidth is much higher than the sup- mers must explicitly request memory on different processors and ported by PCIe or similar interconnect schemes.In order to achieve thus,a data structure (foo in Figure 3)is referenced by two differ- optimal system performance,it is crucial to host data structures ac- ent memory addresses:foo,a virtual address in system memory cessed by computationally intensive kernels in on-board accelerator and dev_foo,a physical address in the accelerator memory.Pro- memories. grammers must explicitly manage memory coherence (e.g.,with a These results discourage the implementation of fully-coherent call to cudaMemcpy())before executing kernels on the accelerator. heterogeneous system due to the high number of coherence re- This approach also prevents parameters from being passed by refer- quests produced by accelerators during kernel execution (e.g.,in- ence to accelerator kernels [19]and computationally critical meth- validation requests the first time data initialized by the CPU is ac- ods to return pointers to the output data structures instead of return- cessed by accelerators).Moreover,a fully coherent heterogeneous ing the whole output data structure,which would save bandwidth system would require both.CPUs and accelerators to implement whenever the code at CPU only requires accessing a small portion the very same coherence protocol.Hence,it would be difficult,if of the returned data structure.These approaches harm portability not infeasible,to use the same accelerator (e.g.,a GPU)in sys- because they expose data transfer details of the underlaying hard- tems based on different CPU architectures,which would impose a ware.Offering a programming interface that requires a single allo- significant economic penalty on accelerator manufactures.Finally, cation call and removes the need for explicit data transfers would the logic required to implement the coherence protocol in the ac- increase programmability and portability of heterogeneous systems celerator would consume a large silicon area,currently devoted to and is the first motivation of this paper. processing units,which would decrease the benefit of using accel- The cost of data transfers between general purpose CPUs and erators. accelerators might eliminate the benefit of using accelerators.Dou- The capacity of on-board accelerator memories is growing and ble buffering can help to alleviate this situation by transferring parts currently allows for many data structures to be hosted by accel- of the data structure while other parts are still in use.In the ex- erators memories.Current GPUs use 32-bit physical memory ad- ample of Figure 3,the input data would be read iteratively using dresses and include up to 4GB of memory and soon GPU architec- a call to fread()followed by an asynchronous DMA transfer to tures will move to larger physical addresses (e.g..40-bit in NVIDIA the accelerator memory.Synchronization code is necessary to pre- Fermi)to support larger memory capacities.IBM QS20 and QS21 vent overwriting system memory that is still in use by an ongoing the first systems based on Cell BE,included 512MB and 1GB of DMA transfer [19].The coding effort to reduce the cost of data main memory per chip respectively.IBM QS22,the latest Cell- transfer harms programmability of heterogeneous systems.Auto- based system,supports up to 16GB of main memory per chip.IBM matically overlapping data transfers and CPU computation without QS22 is based on the PowerXCell 8i chip,which is an evolution of code modifications is the second motivation of this paper. the original Cell BE chip,modified to support a larger main mem- ory capacity [6].These two examples illustrate the current trend 3. Asymmetric Distributed Shared Memory that allows increasingly larger data structures to be hosted by ac- celerators and justifies our ADSM design. Asymmetric Distributed Shared Memory (ADSM)maintains a Programming models for current heterogeneous parallel sys- shared logical memory space for CPUs to access objects in the ac- tems,such as NVIDIA CUDA [11]and OpenCL [1],present dif- celerator physical memory but not vice versa.This section presents ferent memories in the system as distinct memory spaces to the ADSM as a data-centric programming model and the benefit of an programmer.Applications explicitly request memory from a given asymmetric shared address space memory space (i.e.cudaMalloc())and perform data transfers be- tween different memory spaces (i.e.cudaMemcpy ())The exam 3.1 ADSM as a Data-Centric Programming Model ple in Figure 3 illustrates this situation.First,system memory is In a data-centric programming model,programmers allocate or de- allocated (malloc())and initialized (fread()).Then,accelera- clare data objects that are processed by methods,and annotate per tor memory is allocated (cudaMalloc())and the data structure is formance critical methods (kernels)that are executed by accelera- copied to the accelerator memory (cudaMemcpy()),before code is tors.When such methods are assigned to an accelerator,their cor- 349
100 MBps 1 GBps 10 GBps 100 GBps 0 20 40 60 80 100 PCIe QPI HyperTransport NVIDIA GTX295 Memory Bandwidth IPC bt ep lu mg ua Figure 2. Estimated bandwidth requirements for computationally intensive kernels of bt, ep, lu, mg, ua benchmarks, assuming a 800MHz clock frequency is a very small fraction of the peak execution rate of the NVIDIA GTX295 GPU. In all cases, the level of IPC that can be supported by the GTX295 memory bandwidth is much higher than the supported by PCIe or similar interconnect schemes. In order to achieve optimal system performance, it is crucial to host data structures accessed by computationally intensive kernels in on-board accelerator memories. These results discourage the implementation of fully-coherent heterogeneous system due to the high number of coherence requests produced by accelerators during kernel execution (e.g., invalidation requests the first time data initialized by the CPU is accessed by accelerators). Moreover, a fully coherent heterogeneous system would require both, CPUs and accelerators to implement the very same coherence protocol. Hence, it would be difficult, if not infeasible, to use the same accelerator (e.g., a GPU) in systems based on different CPU architectures, which would impose a significant economic penalty on accelerator manufactures. Finally, the logic required to implement the coherence protocol in the accelerator would consume a large silicon area, currently devoted to processing units, which would decrease the benefit of using accelerators. The capacity of on-board accelerator memories is growing and currently allows for many data structures to be hosted by accelerators memories. Current GPUs use 32-bit physical memory addresses and include up to 4GB of memory and soon GPU architectures will move to larger physical addresses (e.g., 40-bit in NVIDIA Fermi) to support larger memory capacities. IBM QS20 and QS21, the first systems based on Cell BE, included 512MB and 1GB of main memory per chip respectively. IBM QS22, the latest Cellbased system, supports up to 16GB of main memory per chip. IBM QS22 is based on the PowerXCell 8i chip, which is an evolution of the original Cell BE chip, modified to support a larger main memory capacity [6]. These two examples illustrate the current trend that allows increasingly larger data structures to be hosted by accelerators and justifies our ADSM design. Programming models for current heterogeneous parallel systems, such as NVIDIA CUDA [11] and OpenCL [1], present different memories in the system as distinct memory spaces to the programmer. Applications explicitly request memory from a given memory space (i.e. cudaMalloc()) and perform data transfers between different memory spaces (i.e. cudaMemcpy()). The example in Figure 3 illustrates this situation. First, system memory is allocated (malloc()) and initialized (fread()). Then, accelerator memory is allocated (cudaMalloc()) and the data structure is copied to the accelerator memory (cudaMemcpy()), before code is Figure 3. Example code with duplicated pointers and explicit consistency management executed on the accelerator. OpenCL and Roadrunner codes do not significantly differ from code in Figure 3. Such programming models ensure that data structures reside in the memory of the processor (CPU or accelerator), that performs subsequent computations. These models also imply that programmers must explicitly request memory on different processors and, thus, a data structure (foo in Figure 3) is referenced by two different memory addresses: foo, a virtual address in system memory, and dev foo, a physical address in the accelerator memory. Programmers must explicitly manage memory coherence (e.g., with a call to cudaMemcpy()) before executing kernels on the accelerator. This approach also prevents parameters from being passed by reference to accelerator kernels [19] and computationally critical methods to return pointers to the output data structures instead of returning the whole output data structure, which would save bandwidth whenever the code at CPU only requires accessing a small portion of the returned data structure. These approaches harm portability because they expose data transfer details of the underlaying hardware. Offering a programming interface that requires a single allocation call and removes the need for explicit data transfers would increase programmability and portability of heterogeneous systems and is the first motivation of this paper. The cost of data transfers between general purpose CPUs and accelerators might eliminate the benefit of using accelerators. Double buffering can help to alleviate this situation by transferring parts of the data structure while other parts are still in use. In the example of Figure 3, the input data would be read iteratively using a call to fread() followed by an asynchronous DMA transfer to the accelerator memory. Synchronization code is necessary to prevent overwriting system memory that is still in use by an ongoing DMA transfer [19]. The coding effort to reduce the cost of data transfer harms programmability of heterogeneous systems. Automatically overlapping data transfers and CPU computation without code modifications is the second motivation of this paper. 3. Asymmetric Distributed Shared Memory Asymmetric Distributed Shared Memory (ADSM) maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. This section presents ADSM as a data-centric programming model and the benefit of an asymmetric shared address space. 3.1 ADSM as a Data-Centric Programming Model In a data-centric programming model, programmers allocate or declare data objects that are processed by methods, and annotate performance critical methods (kernels) that are executed by accelerators. When such methods are assigned to an accelerator, their cor- 349
oid compute(FILE file,int size) API Call Description float too: Single Pointer adsmAlloc(size Allocates size bytes of shared memory and retum the shared memory address where the allocated Foo accMalloc(size) memory begins. adsmFree(addr) Releases a shared mem ory region that was previ Data object fread(foo,size,1.FILE]: DMA Binding ously allocated using adsmAlloc(). kernel<<<Dg,Dbsx>(foo,slze): adsmCall(kernel) Launches the execution of method kernel in an ac- Librarycallifoo) egacy celerator. Support adsmSync() Yields the CPU to other processes until a previous accFree(foo]: accelerator calls finishes. Figure 4.Usage example of ADSM Table 1.Compulsory API calls implemented by an ADSM run- time responding data objects are migrated to on-board accelerator mem- ory.Figure 4 shows an example using such a model,based on the synchronization variables,such as locks and semaphores.tends to code from Figure 3.and illustrates the main benefits of a data-cen- be a primary cause of thrashing in DSM.This effect is unlikely tric programming model with ADSM for heterogeneous systems: to occur in heterogeneous systems because hardware interrupts are architecture-independence,legacy-code support,and peer DMA. typically used for synchronization between accelerators and CPUs ADSM removes the need to explicitly request memory on dif- and,therefore,there are no shared synchronization variables. ferent memory spaces.Programmers assign data objects to meth- If synchronization variables are used (e.g.,polling mode),spe- ods that might or might not be executed by accelerators.Run-time cial hardware mechanisms are typically used.These special syn- systems can be easily built under this programming model to au- chronization variables are outside the scope of our ADSM design. tomatically assign methods and their data objects to accelerators False sharing,another source of thrashing in DSM,is not likely to if they are present in the system.High performance systems are occur either because sharing is done at the data object granularity. likely to continue having separate physical memories and access A major issue in DSM is the memory coherence and consistency paths for CPUs and accelerators.However,CPUs and accelerators model.DSM typically implements a relaxed memory consistency are likely to share the same physical memory and access path in model to minimize coherence traffic.Relaxed consistency models low-cost systems.An application written following a data-centric (e.g.,release consistency)reduce coherence traffic at the cost of programming model will target both kinds of systems efficiently. requiring programmers to explicitly use synchronization primitives When the application is run on a high performance system,accel- (e.g.acquire and release).In a data-centric programming model erator memory is allocated and the run-time system transfers data memory consistency is only relevant from the CPU perspective between system memory and accelerator memory when necessary because all consistency actions are driven by the CPU at method In the low-cost case,system memory (shared by CPU and accel- call and return boundaries.Shared data structures are released by erator)is allocated and no transfer is done.Independence from the the CPU when methods using them are invoked,and data items are underlying hardware is the first benefit provided by ADSM. acquired by the CPU when methods using them return. ADSM offers a convenient software migration path for exist- DSM pays a performance penalty for detection of memory ing applications.Performance critical libraries and application code accesses to shared locations that are marked as invalid or dirty are moved to accelerator engines,leaving less critical code porting Memory pages that contain invalid and dirty items are marked for later stages [6].CPUs and accelerators do not need to be ISA- as not present in order to get a page fault whenever the program compatible to interoperate,as long as data format and calling con accesses any of these items.This penalty is especially important ventions are consistent.This programming model provides the run- for faults produced by performance critical code whose execution time system with the information necessary to make data structures is distributed among the nodes in the cluster.Data cannot be eagerly accessible to the code that remains for execution by the CPU.This transferred to each node to avoid this performance penalty because is the second gain offered by a data-centric programming model. there is no information about which part of the data each node will Data objects used by kernels are often read from and written access.This limitation is not present in ADSM since the accelerator to 1/O devices (e.g.,a disk or network interface).ADSM enables executing a method will access only shared data objects explicitly data structures used by accelerators to be passed as parameters assigned to the method. to the corresponding system calls that read or write data for 1/O The lack of synchronization variables hosted by shared data devices (e.g.read()or write()).If supported by the hardware, structures,the implicit consistency primitives at call/return bound- the run-time system performs DMA transfers directly to and from aries,and the knowledge of data structures accessed by perfor- accelerator memory (peer DMA),otherwise an intermediate buffer mance critical methods are the three reasons that lead us to design in system memory might be used.Applications benefit from peer an ADSM as an efficient way to support a data-centric program- DMA without any source code modifications,which is the third ming model on heterogeneous parallel systems. advantage of this programming model. 3.3 Application Programming Interface and Consistency 3.2 ADSM Run-time Design Rationale Model As discussed in Sec.2.2,distributed memories are necessary to We identify four fundamental functions an ADSM system must im- extract all the computational power from heterogeneous systems. plement:shared-data allocation,shared-data release,method invo- However,a data-centric programming model hides the complexity cation,and return synchronization.Table 1 summarizes these nec- of this distributed memory architecture from programmers. essary API calls. A DSM run-time system reconciles physically distributed mem- Shared-data allocation and release calls are used by program- ory and logical shared memory.Traditional DSM systems are prone mers to declare data objects that will be used by kernels.In its to thrashing,which is a performance limiting factor.Thrashing in simplest form,the allocation call only requires the size of the data DSM systems typically occurs when two nodes compete for write structure and the release requires the starting memory address for access to a single data item,causing data to be transferred back the data structure to be released.This minimal implementation as- and forth at such a high rate that no work can be done.Access to sumes that any data structure allocated through calls to the ADSM 350
Figure 4. Usage example of ADSM responding data objects are migrated to on-board accelerator memory. Figure 4 shows an example using such a model, based on the code from Figure 3, and illustrates the main benefits of a data-centric programming model with ADSM for heterogeneous systems: architecture-independence, legacy-code support, and peer DMA. ADSM removes the need to explicitly request memory on different memory spaces. Programmers assign data objects to methods that might or might not be executed by accelerators. Run-time systems can be easily built under this programming model to automatically assign methods and their data objects to accelerators, if they are present in the system. High performance systems are likely to continue having separate physical memories and access paths for CPUs and accelerators. However, CPUs and accelerators are likely to share the same physical memory and access path in low-cost systems. An application written following a data-centric programming model will target both kinds of systems efficiently. When the application is run on a high performance system, accelerator memory is allocated and the run-time system transfers data between system memory and accelerator memory when necessary. In the low-cost case, system memory (shared by CPU and accelerator) is allocated and no transfer is done. Independence from the underlying hardware is the first benefit provided by ADSM. ADSM offers a convenient software migration path for existing applications. Performance critical libraries and application code are moved to accelerator engines, leaving less critical code porting for later stages [6]. CPUs and accelerators do not need to be ISAcompatible to interoperate, as long as data format and calling conventions are consistent. This programming model provides the runtime system with the information necessary to make data structures accessible to the code that remains for execution by the CPU. This is the second gain offered by a data-centric programming model. Data objects used by kernels are often read from and written to I/O devices (e.g., a disk or network interface). ADSM enables data structures used by accelerators to be passed as parameters to the corresponding system calls that read or write data for I/O devices (e.g. read() or write()). If supported by the hardware, the run-time system performs DMA transfers directly to and from accelerator memory (peer DMA), otherwise an intermediate buffer in system memory might be used. Applications benefit from peer DMA without any source code modifications, which is the third advantage of this programming model. 3.2 ADSM Run-time Design Rationale As discussed in Sec. 2.2, distributed memories are necessary to extract all the computational power from heterogeneous systems. However, a data-centric programming model hides the complexity of this distributed memory architecture from programmers. A DSM run-time system reconciles physically distributed memory and logical shared memory. Traditional DSM systems are prone to thrashing, which is a performance limiting factor. Thrashing in DSM systems typically occurs when two nodes compete for write access to a single data item, causing data to be transferred back and forth at such a high rate that no work can be done. Access to API Call Description adsmAlloc(size) Allocates size bytes of shared memory and returns the shared memory address where the allocated memory begins. adsmFree(addr) Releases a shared memory region that was previously allocated using adsmAlloc(). adsmCall(kernel) Launches the execution of method kernel in an accelerator. adsmSync() Yields the CPU to other processes until a previous accelerator calls finishes. Table 1. Compulsory API calls implemented by an ADSM runtime synchronization variables, such as locks and semaphores, tends to be a primary cause of thrashing in DSM. This effect is unlikely to occur in heterogeneous systems because hardware interrupts are typically used for synchronization between accelerators and CPUs and, therefore, there are no shared synchronization variables. If synchronization variables are used (e.g., polling mode), special hardware mechanisms are typically used. These special synchronization variables are outside the scope of our ADSM design. False sharing, another source of thrashing in DSM, is not likely to occur either because sharing is done at the data object granularity. A major issue in DSM is the memory coherence and consistency model. DSM typically implements a relaxed memory consistency model to minimize coherence traffic. Relaxed consistency models (e.g., release consistency) reduce coherence traffic at the cost of requiring programmers to explicitly use synchronization primitives (e.g. acquire and release). In a data-centric programming model memory consistency is only relevant from the CPU perspective because all consistency actions are driven by the CPU at method call and return boundaries. Shared data structures are released by the CPU when methods using them are invoked, and data items are acquired by the CPU when methods using them return. DSM pays a performance penalty for detection of memory accesses to shared locations that are marked as invalid or dirty. Memory pages that contain invalid and dirty items are marked as not present in order to get a page fault whenever the program accesses any of these items. This penalty is especially important for faults produced by performance critical code whose execution is distributed among the nodes in the cluster. Data cannot be eagerly transferred to each node to avoid this performance penalty because there is no information about which part of the data each node will access. This limitation is not present in ADSM since the accelerator executing a method will access only shared data objects explicitly assigned to the method. The lack of synchronization variables hosted by shared data structures, the implicit consistency primitives at call/return boundaries, and the knowledge of data structures accessed by performance critical methods are the three reasons that lead us to design an ADSM as an efficient way to support a data-centric programming model on heterogeneous parallel systems. 3.3 Application Programming Interface and Consistency Model We identify four fundamental functions an ADSM system must implement: shared-data allocation, shared-data release, method invocation, and return synchronization. Table 1 summarizes these necessary API calls. Shared-data allocation and release calls are used by programmers to declare data objects that will be used by kernels. In its simplest form, the allocation call only requires the size of the data structure and the release requires the starting memory address for the data structure to be released. This minimal implementation assumes that any data structure allocated through calls to the ADSM 350
API will be used by all accelerator kernels.A more elaborate Application Libraries scheme would require programmers to pass one or more method 。。。。。。。。。。。。,。。。。。。。。 identifiers to effectively assign the allocated/released data object to one or more accelerator kernels. Application Programming Interface Method invocation and return synchronization are already found in many heterogeneous programming APIs,such as CUDA.The Shared Memory Kernel Manager Scheduler former triggers the execution of a given kernel in an accelerator, while the latter yields the CPU until the execution on the accelerator OS Abstraction accelerator is complete. Layer Abstraction Layer ADSM employs a release consistency model where shared data ””””””””””””” ”””””””””””” objects are released by the CPU on accelerator invocation (adsm- Call())and acquired by the CPU on accelerator return (adsm- Accelerator Sync()).This semantic ensures that accelerators always have access ystem Run-time Driver to the objects hosted in their physical memory by the time a kernel operates on them.Implicit acquire/release semantics increase pro- Figure 5.Software layers that conforms the GMAC library. grammability because they require fewer source code lines to be written and it is quite natural in programming environments such as CUDA,where programmers currently implement this consistency of GMAC with respect to CUDA and the CUDA Driver Layer to model manually,through calls to cudaMemcpy() extract a detailed execution time break-down of applications.The top-level layer implements the GMAC API to be used by applica- 4. Design and Implementation tions and libraries. This section describes the design and implementation of Global The Shared Memory Manager in Figure 5 manages shared Memory for ACcelerators(GMAC),a user-level ADSM run-time memory areas and creates the shared address space between host system.The design of GMAC is general enough to be applica- CPUs and accelerators.An independent memory management ble to a broad range of heterogeneous systems,such as NVIDIA module allows testing of different management policies and co- GPUs or the IBM PowerXCell 8i.We implement GMAC for GNU/ herence protocols with minor modifications to the GMAC code Linux based systems that include CUDA-capable NVIDIA GPUs base.The kernel scheduler selects the most appropriate accelerator The implementation techniques presented here are specific to our for execution of a given kernel,and implements different schedul- target platform but they can be easily ported to different operating ing policies depending on the execution environment.A detailed systems(e.g.Microsoft Windows)and accelerator interfaces (e.g. analysis of kernel scheduling is out of the scope of the present OpenCL). paper and we refer the reader to Jimenez et al.[29]. ADSM might be implemented using a hybrid approach,where low-level functionalities are implemented in the operating system 4.2 Shared Address Space kernel and high-level API calls are implemented in a user-level GMAC builds a shared address space between the CPUs and the ac- library.Implementing low-level layers within the operating sys- celerator.When the application requests shared memory (adsmAl- tem kernel code allows I/O operations involving shared data struc- loc()),accelerator memory is allocated on the accelerator,returning tures to be fully supported without performance penalties(see Sec- a memory address (virtual or physical,depending on the accelera- tion 4.4).We implement all GMAC code in a user-level library be- tor)that can be only used by the accelerator.Then,we request the cause there is currently no operating system kernel-level API for operating system to allocate system memory over the same range interacting with the proprietary NVIDIA driver required by CUDA of virtual memory addresses.In a POSIX-compliant operating sys- By using a software implementation of ADSM,GMAC allows tem this is done through the mmap system call,which accepts a multiple cores in a cache-coherent multi-core CPU to all maintain virtual address as a parameter and maps it to an allocated range of and access share memories with accelerators. system memory (anonymous memory mapping).At this point,two 4.1 Overall Design identical memory address ranges have been allocated,one in the accelerator memory and the other one in system memory.Hence,a Figure 5 shows the overall design of the GMAC library.The lower- single pointer can be returned to the application to be used by both level layers (OS Abstraction Layer and Accelerator Abstraction CPU code and accelerator code. Layer)are operating system and accelerator dependent,and they The operating system memory mapping request might fail if offer an interface to upper-level layers for allocation of system the requested virtual address range is already in use.In our sin- and accelerator memory,setting of memory page permission bits, gle-GPU target system this is unlikely to happen because the ad- transfer of data between system and accelerator memory,invoca- dress range typically returned by calls to cudaMalloc()is outside tion of kernels,and waiting for completion of accelerator execu- the ELF program sections.However,this implementation technique tion,1/O functions,and data type conversion,if necessary.In our might fail when using other accelerators (e.g.ATI/AMD GPUs)or reference implementation the OS Abstraction Layer interacts with on multi-GPU systems,where calls to cudaMalloc()for differ- POSIX-compatible operating systems such as GNU/Linux.We im- ent GPUs are likely to return overlapping memory address ranges plement two different Accelerator Abstraction Layers to interact A software-based solution for this situation requires two new APl with CUDA-capable accelerators:the CUDA Run-Time Layer and calls:adsmSafeAlloc(size)and adsmSafe(address).The former al- the CUDA Driver Layer.The former interacts with the CUDA run- locates a shared memory region,but returns a pointer that is only time,which offers us limited control over accelerators attached to valid in the CPU code.The latter takes a CPU address and re the system.The latter interacts with the CUDA driver,which of- turns the associated address for the target GPU.GMAC maintains fers a low-level API and,thus,allows having full control over ac- the mapping between these associated addresses so that any CPU celerators at the cost of more complex programming of the GMAC changes to the shared address region will be reflected in the accel- code base.At application load-time,the user can select which of the erator memory.Although this software technique works in all cases two different Accelerator Abstraction Layers to use.In Section 5 where shared data structures do not contain embedded pointers,it we use the CUDA Run-Time Layer to compare the performance requires programmers to explicitly call adsmSafe()when passing 351
API will be used by all accelerator kernels. A more elaborate scheme would require programmers to pass one or more method identifiers to effectively assign the allocated/released data object to one or more accelerator kernels. Method invocation and return synchronization are already found in many heterogeneous programming APIs, such as CUDA. The former triggers the execution of a given kernel in an accelerator, while the latter yields the CPU until the execution on the accelerator is complete. ADSM employs a release consistency model where shared data objects are released by the CPU on accelerator invocation (adsmCall()) and acquired by the CPU on accelerator return (adsmSync()). This semantic ensures that accelerators always have access to the objects hosted in their physical memory by the time a kernel operates on them. Implicit acquire/release semantics increase programmability because they require fewer source code lines to be written and it is quite natural in programming environments such as CUDA, where programmers currently implement this consistency model manually, through calls to cudaMemcpy(). 4. Design and Implementation This section describes the design and implementation of Global Memory for ACcelerators (GMAC), a user-level ADSM run-time system. The design of GMAC is general enough to be applicable to a broad range of heterogeneous systems, such as NVIDIA GPUs or the IBM PowerXCell 8i. We implement GMAC for GNU/ Linux based systems that include CUDA-capable NVIDIA GPUs. The implementation techniques presented here are specific to our target platform but they can be easily ported to different operating systems (e.g. Microsoft Windows) and accelerator interfaces (e.g. OpenCL). ADSM might be implemented using a hybrid approach, where low-level functionalities are implemented in the operating system kernel and high-level API calls are implemented in a user-level library. Implementing low-level layers within the operating system kernel code allows I/O operations involving shared data structures to be fully supported without performance penalties (see Section 4.4). We implement all GMAC code in a user-level library because there is currently no operating system kernel-level API for interacting with the proprietary NVIDIA driver required by CUDA. By using a software implementation of ADSM, GMAC allows multiple cores in a cache-coherent multi-core CPU to all maintain and access share memories with accelerators. 4.1 Overall Design Figure 5 shows the overall design of the GMAC library. The lowerlevel layers (OS Abstraction Layer and Accelerator Abstraction Layer) are operating system and accelerator dependent, and they offer an interface to upper-level layers for allocation of system and accelerator memory, setting of memory page permission bits, transfer of data between system and accelerator memory, invocation of kernels, and waiting for completion of accelerator execution, I/O functions, and data type conversion, if necessary. In our reference implementation the OS Abstraction Layer interacts with POSIX-compatible operating systems such as GNU/Linux. We implement two different Accelerator Abstraction Layers to interact with CUDA-capable accelerators: the CUDA Run-Time Layer and the CUDA Driver Layer. The former interacts with the CUDA runtime, which offers us limited control over accelerators attached to the system. The latter interacts with the CUDA driver, which offers a low-level API and, thus, allows having full control over accelerators at the cost of more complex programming of the GMAC code base. At application load-time, the user can select which of the two different Accelerator Abstraction Layers to use. In Section 5, we use the CUDA Run-Time Layer to compare the performance Figure 5. Software layers that conforms the GMAC library. of GMAC with respect to CUDA and the CUDA Driver Layer to extract a detailed execution time break-down of applications. The top-level layer implements the GMAC API to be used by applications and libraries. The Shared Memory Manager in Figure 5 manages shared memory areas and creates the shared address space between host CPUs and accelerators. An independent memory management module allows testing of different management policies and coherence protocols with minor modifications to the GMAC code base. The kernel scheduler selects the most appropriate accelerator for execution of a given kernel, and implements different scheduling policies depending on the execution environment. A detailed analysis of kernel scheduling is out of the scope of the present paper and we refer the reader to Jimenez et al. [29]. 4.2 Shared Address Space GMAC builds a shared address space between the CPUs and the accelerator. When the application requests shared memory (adsmAlloc()), accelerator memory is allocated on the accelerator, returning a memory address (virtual or physical, depending on the accelerator) that can be only used by the accelerator. Then, we request the operating system to allocate system memory over the same range of virtual memory addresses. In a POSIX-compliant operating system this is done through the mmap system call, which accepts a virtual address as a parameter and maps it to an allocated range of system memory (anonymous memory mapping). At this point, two identical memory address ranges have been allocated, one in the accelerator memory and the other one in system memory. Hence, a single pointer can be returned to the application to be used by both CPU code and accelerator code. The operating system memory mapping request might fail if the requested virtual address range is already in use. In our single-GPU target system this is unlikely to happen because the address range typically returned by calls to cudaMalloc() is outside the ELF program sections. However, this implementation technique might fail when using other accelerators (e.g. ATI/AMD GPUs) or on multi-GPU systems, where calls to cudaMalloc() for different GPUs are likely to return overlapping memory address ranges. A software-based solution for this situation requires two new API calls: adsmSafeAlloc(size) and adsmSafe(address). The former allocates a shared memory region, but returns a pointer that is only valid in the CPU code. The latter takes a CPU address and returns the associated address for the target GPU. GMAC maintains the mapping between these associated addresses so that any CPU changes to the shared address region will be reflected in the accelerator memory. Although this software technique works in all cases where shared data structures do not contain embedded pointers, it requires programmers to explicitly call adsmSafe() when passing 351