1.3 Execution Model The OpenMP API uses the fork-join model of parallel execution Multiple threads of 3 defined implicitly or explicitly by OpenMP di ves.The 4 peMtended to suppt program that will eecute correctlyb s paralle programs(multiple threads of execution and a full OpenMP support library)and as 67 sequential programs(directives ignored and a simple OpenMP stubs library).However. it is possible and permitted to develop a program that executes correctly as a parallel 8910 program t not as sequential prograr that produces different results wher executed as a paralle program compare when it is executed as a sequential progra Furthermore,using different numbers of threads may result in different numeric results 11 because of changes in the association of numeric operations.For example,a serial addition reduction may have a different pattern of addition associations than a parallel reduction.These different associations may change the results of floating-point addition 14 An OpenMP program begins as a single thread of execution,called an initial thread.An initial thread executes sequentially,as if enclosed in an implicit task region,called an initial task region,that is defined by the implicit parallel region surrounding the whole 21 program. The thread implicit that the whoe program executes on the host device.An implementation may support other target devices.If 9 supported,one or more devices are available to the host device for offloading code and data.Each device has its own threads that are distinct from threads that execute on another device.Thre ads cannot r ate from ne device to another device.The 12324 -centric suc n that the h offloads target regions to 25 The initial thread that executes the implicit parallel region that surrounds the target 67728 on a targe t devce.An initial thread executes egion, alled an initial defined by an implicit inactive parallel region that surrounds the entire target region 29 When a target construct is encountered the target region is executed by the 0323 mplicit device task.The task that encounters the ade construet waits at the end until execution of the etes.Ifa target device does not exist,or the target device is not suported by the implemetation.or the targe t device canno execute the target construct then the target region is executed by the host device 4567 The teams construct creates a lea eof thread te ms where the master thread of each team executes the Each of t itial th xecutes sequ ntially,as if enclosed in an implicit task region that is defined by an implicit parallel region that surrounds the entire teams region. 14 OpenMP API.Version 4.0-July 2013
14 OpenMP API • Version 4.0 - July 2013 1.3 Execution Model The OpenMP API uses the fork-join model of parallel execution. Multiple threads of execution perform tasks defined implicitly or explicitly by OpenMP directives. The OpenMP API is intended to support programs that will execute correctly both as parallel programs (multiple threads of execution and a full OpenMP support library) and as sequential programs (directives ignored and a simple OpenMP stubs library). However, it is possible and permitted to develop a program that executes correctly as a parallel program but not as a sequential program, or that produces different results when executed as a parallel program compared to when it is executed as a sequential program. Furthermore, using different numbers of threads may result in different numeric results because of changes in the association of numeric operations. For example, a serial addition reduction may have a different pattern of addition associations than a parallel reduction. These different associations may change the results of floating-point addition. An OpenMP program begins as a single thread of execution, called an initial thread. An initial thread executes sequentially, as if enclosed in an implicit task region, called an initial task region, that is defined by the implicit parallel region surrounding the whole program. The thread that executes the implicit parallel region that surrounds the whole program executes on the host device. An implementation may support other target devices. If supported, one or more devices are available to the host device for offloading code and data. Each device has its own threads that are distinct from threads that execute on another device. Threads cannot migrate from one device to another device. The execution model is host-centric such that the host device offloads target regions to target devices. The initial thread that executes the implicit parallel region that surrounds the target region may execute on a target devce. An initial thread executes sequentially, as if enclosed in an implicit task region, called an initial task region, that is defined by an implicit inactive parallel region that surrounds the entire target region. When a target construct is encountered, the target region is executed by the implicit device task. The task that encounters the target construct waits at the end of the construct until execution of the region completes. If a target device does not exist, or the target device is not supported by the implementation, or the target device cannot execute the target construct then the target region is executed by the host device. The teams construct creates a league of thread teams where the master thread of each team executes the region. Each of these master threads is an initial thread, and executes sequentially, as if enclosed in an implicit task region that is defined by an implicit parallel region that surrounds the entire teams region. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
If a construct creates a data environment,the data environment is created at the time the 23 construct is encountered.Whether a construct creates a data environment is defined in the description of the construct. 45 When any thread encounters a parallel construct,the thread creates a team of itself and zero or more additional threads and becomes the master of the new team.A set of 67 implicit tasks,one per thread,is generated.The code for each task is defined by the code 89 parallel construct.Each task is assigned to a different threa assigned.The task region of the task being executed by the encountering thread is suspended,and each member of the new team executes its implicit task.There is an 11 implicit barrier at the end of the parallel construct.Only the master thread resumes 1 execution beyond the end of the parallel construct,resuming the task region that 1314 was suspended upon encountering the parallel construct.Any number of can be specified in a single program parallel regions may be arbitrarily nested inside each other.If nested parallelism is 16 disabled,or is not supported by the OpenMP implementation,then the new team that is 71819 countering a parallel construct inside a paralle wrea.However.if nested paralle and enabled,then the new team can consist of more than one thread.A parallel construct may include a proc_bind clause to specify the places to use for the threads 1 in the team within the parallel region. 2223 When any team encounters a worksharing construct,the work inside the construct is divided among the members of the team,and executed cooperatively instead of being executed by every thread.There is a default barrier at the end of each worksharing construct unless the nowait clause is present.Redundant execution of code by every 26 thread in the team resumes after the end of the worksharing construct. 272 When any thread encounters a task construct,a new explicit task is generated Execution of explicitly generated tasks is assigned to one of the threads in the current team,subject to the thread's availability to execute work.Thus,execution of the new 3 task could be immediate,or deferred until later according to task scheduling constraints 31233 and thread availability.Threads are allowed to suspend the current task region at a task scheduling point in rder to exec nded task region is for a tied tas the initially assined threopot the suspernded tasl 34 region.If the suspended task region is for an untied task,then any thread may resume its execution.Completion of all explicit tasks bound to a given parallel region is guaranteed 36 before the master thread leaves the implicit barrier at the end of the region.Completion 73839 of a subset of all explicit tasks bound to a given parallel region may be specified through the of task s cons ompletion of all explicit tasks bound to the implicit parallel region is guaranteed by the time the program exits. 40 When any thread encounters a simd construct,the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to 4 the thread. Chapter 1 Introduction 15
Chapter 1 Introduction 15 If a construct creates a data environment, the data environment is created at the time the construct is encountered. Whether a construct creates a data environment is defined in the description of the construct. When any thread encounters a parallel construct, the thread creates a team of itself and zero or more additional threads and becomes the master of the new team. A set of implicit tasks, one per thread, is generated. The code for each task is defined by the code inside the parallel construct. Each task is assigned to a different thread in the team and becomes tied; that is, it is always executed by the thread to which it is initially assigned. The task region of the task being executed by the encountering thread is suspended, and each member of the new team executes its implicit task. There is an implicit barrier at the end of the parallel construct. Only the master thread resumes execution beyond the end of the parallel construct, resuming the task region that was suspended upon encountering the parallel construct. Any number of parallel constructs can be specified in a single program. parallel regions may be arbitrarily nested inside each other. If nested parallelism is disabled, or is not supported by the OpenMP implementation, then the new team that is created by a thread encountering a parallel construct inside a parallel region will consist only of the encountering thread. However, if nested parallelism is supported and enabled, then the new team can consist of more than one thread. A parallel construct may include a proc_bind clause to specify the places to use for the threads in the team within the parallel region. When any team encounters a worksharing construct, the work inside the construct is divided among the members of the team, and executed cooperatively instead of being executed by every thread. There is a default barrier at the end of each worksharing construct unless the nowait clause is present. Redundant execution of code by every thread in the team resumes after the end of the worksharing construct. When any thread encounters a task construct, a new explicit task is generated. Execution of explicitly generated tasks is assigned to one of the threads in the current team, subject to the thread's availability to execute work. Thus, execution of the new task could be immediate, or deferred until later according to task scheduling constraints and thread availability. Threads are allowed to suspend the current task region at a task scheduling point in order to execute a different task. If the suspended task region is for a tied task, the initially assigned thread later resumes execution of the suspended task region. If the suspended task region is for an untied task, then any thread may resume its execution. Completion of all explicit tasks bound to a given parallel region is guaranteed before the master thread leaves the implicit barrier at the end of the region. Completion of a subset of all explicit tasks bound to a given parallel region may be specified through the use of task synchronization constructs. Completion of all explicit tasks bound to the implicit parallel region is guaranteed by the time the program exits. When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
23 construct can alter the previously described flow of exe depends on its g clause.then the task activates cancellation and continues execution at the end of its task region,which implies completion of that task.Any other task in that taskgroup 67 that has begun executing completes execution unless it encounters a cancellation 89 t conimes execution at the cnd ofio rtasks in that have no begun execution are aborted,which implies their completion. For all other construct-type-clause values,if a thread encounters a cancel construct,it 1213 ates cancellaion of enclosing reioy cution at the end of that region been activated for their region at cancellation points and,if so,also resume execution a the end of the canceled region. 15161718 If cancellation has been activated regardless of construct-t -clause,threads that are ier other tha implicit ba d of the can resum eeution at the end of the caneeled egon nceled regior This action car occur before the other threads reach that barrier. Synchronization constructs and library routines re available in the OpenMP API to coordinate tasks and data acce enviment variables are available to control or to querytheary routines and enviro e to control or to query the runtime environment of The OpenMP specification makes no guarantee that input or output to the same file is synchronous when executedin aralle.In this case,the progr ner is r 1527 onstructsor lrary routines.For the case where ach thread acce statements(or rout different file,no synchronization by the programmer is necessary. 28 16 OpenMP API Version 4.0-July 2013
16 OpenMP API • Version 4.0 - July 2013 The cancel construct can alter the previously described flow of execution in an OpenMP region. The effect of the cancel construct depends on its construct-typeclause. If a task encounters a cancel construct with a taskgroup construct-typeclause, then the task activates cancellation and continues execution at the end of its task region, which implies completion of that task. Any other task in that taskgroup that has begun executing completes execution unless it encounters a cancellation point construct, in which case it continues execution at the end of its task region, which implies its completion. Other tasks in that taskgroup region that have not begun execution are aborted, which implies their completion. For all other construct-type-clause values, if a thread encounters a cancel construct, it activates cancellation of the innermost enclosing region of the type specified and the thread continues execution at the end of that region. Threads check if cancellation has been activated for their region at cancellation points and, if so, also resume execution at the end of the canceled region. If cancellation has been activated regardless of construct-type-clause, threads that are waiting inside a barrier other than an implicit barrier at the end of the canceled region exit the barrier and resume execution at the end of the canceled region. This action can occur before the other threads reach that barrier. Synchronization constructs and library routines are available in the OpenMP API to coordinate tasks and data access in parallel regions. In addition, library routines and environment variables are available to control or to query the runtime environment of OpenMP programs. The OpenMP specification makes no guarantee that input or output to the same file is synchronous when executed in parallel. In this case, the programmer is responsible for synchronizing input and output statements (or routines) using the provided synchronization constructs or library routines. For the case where each thread accesses a different file, no synchronization by the programmer is necessary. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
11.4 Memory Model 21.4.1 Structure of the OpenMP Memory Model 3 The OpenMP API provides a relaxed-consistency,shared-memory model.All OpenMP threads have access to a place to store and to retrieve variables,called the memory.In 6 addition,each thread is allowed to have its own temporary view of the memory.The 67 rtheadsotud pa of the enM kind as machine registers,cache,or other local storage,between the thread and the memory.The temporary view of memory allows the thread to cache variables and thereby to avoid 10 going to memory for every reference to a variable.Each thread also has access to 11 another type of memory that must not be accessed by other threads,called threadprivate 12 memory. 13 A directive that accepts data-sharing attribute clauses determines two kinds of access to 14 variables used in the directive's associated structured block:shared and private.Each 15 variable referenced in the structured block has an original variable,which is the variable 16718 by the same namee te roam immedately outside the construct Fach refer ence to a sha ucture red block becomesa refer rence to the original variable.For each private variable referenced in the structured block,a new version of 9 the original variable (of the same type and size)is created in memory for each task or SIMD lane that contains code associated with the directive.Creation of the new version 21 does not alter the value of the original variable.However,the impact of attempts to access the original variable during the region associated with the directive is 23 unspecified: details Referencestoa private variable in the structured block refer to the private version of the original variable for the current task or SIMD lane.The relationship between the value of the original variable and the initial or final value of the private version depends on the exact 27 clause that specifies it.Details of this issue,as well as other issues with privatization. are provided in Section 2.14 on page 146. 29 The minimum size at which a memory update may also read and write back adjacent variables that are part of another variable (as array or structure elements)is 31 implementation defined but is no larger than required by the base language. 3 A single access to a variable may be implemented with multiple load or store instructions,and hence is not guaranteed to be atomic with respect to other accesses to 34 the same variable.Accesses to variables smaller than the implementation defined 35 minimum size or to C or C++bit-fields may be implemented by reading,modifying,and rewriting a larger unit of memory,and may thus interfere with updates of variables or 37 fields in the same unit of memory. Chapter 1 Introduction 17
Chapter 1 Introduction 17 1.4 Memory Model 1.4.1 Structure of the OpenMP Memory Model The OpenMP API provides a relaxed-consistency, shared-memory model. All OpenMP threads have access to a place to store and to retrieve variables, called the memory. In addition, each thread is allowed to have its own temporary view of the memory. The temporary view of memory for each thread is not a required part of the OpenMP memory model, but can represent any kind of intervening structure, such as machine registers, cache, or other local storage, between the thread and the memory. The temporary view of memory allows the thread to cache variables and thereby to avoid going to memory for every reference to a variable. Each thread also has access to another type of memory that must not be accessed by other threads, called threadprivate memory. A directive that accepts data-sharing attribute clauses determines two kinds of access to variables used in the directive’s associated structured block: shared and private. Each variable referenced in the structured block has an original variable, which is the variable by the same name that exists in the program immediately outside the construct. Each reference to a shared variable in the structured block becomes a reference to the original variable. For each private variable referenced in the structured block, a new version of the original variable (of the same type and size) is created in memory for each task or SIMD lane that contains code associated with the directive. Creation of the new version does not alter the value of the original variable. However, the impact of attempts to access the original variable during the region associated with the directive is unspecified; see Section 2.14.3.3 on page 159 for additional details. References to a private variable in the structured block refer to the private version of the original variable for the current task or SIMD lane. The relationship between the value of the original variable and the initial or final value of the private version depends on the exact clause that specifies it. Details of this issue, as well as other issues with privatization, are provided in Section 2.14 on page 146. The minimum size at which a memory update may also read and write back adjacent variables that are part of another variable (as array or structure elements) is implementation defined but is no larger than required by the base language. A single access to a variable may be implemented with multiple load or store instructions, and hence is not guaranteed to be atomic with respect to other accesses to the same variable. Accesses to variables smaller than the implementation defined minimum size or to C or C++ bit-fields may be implemented by reading, modifying, and rewriting a larger unit of memory, and may thus interfere with updates of variables or fields in the same unit of memory. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
23 If muliple threads write without synchronation to the same mm atomicity con dera as described above,the without synchronization to that same memory unit,including cases due to atomicity considerations as described above,then a data race occurs.If a data race occurs then the 6 result of the program is unspecified. 78 A private variable in a task region that eventually generates an inner nested paralle region is permitted to be made shared by implicit tasks in the inner parallel region A private variable in a task region can be shared by an explicit task region generated 1 during its execution.However,it is the programmer's responsibility to ensure through synchronization that the lifetime of the variable does not end before completion of the explicit sharing itAny ther access by one task to the private variables of another task unspecifed behavior. 141.4.2 Device Data Environments When an OpenMP program begins,each device has an initial device data environment. The initial device data environment for the host device is the data environment associated with the initial task region.Directives that accept data-mapping attribute 18 clauses determine how an original variable is ma g variable in a 190 device da h s th e ee same name tha If a corresponding variable is present in the enclosing device data environment,the new 22 device data environment inherits the corresponding variable from the enclosing device 2 riable isnot present in then sing device dat new variable (of the same type and size)is created in the new device data environment.In the latter case,the initial value of the new 26 corresponding variable is determined from the clauses and the data environment of the 27 encountering thread. 28 The corresponding variable in the device data environment may share storage with the 2 original variable.Writes to the corresponding variable may alter the value of the origina 0 variable.The impact of this on memory consistency is discussed in Section 1.4.4 on 1 page 20.When a task executes in the context of a device data environment,references to 32 the originl variablereferto the responding variable in the device data 33 The relationship between the value of the original variable and the initial or final value of the corresponding variable depends on the map-type.Details of this issue,as well as 35 other issues with mapping a variable,are provided in Section 2.14.5 on page 177. The original variable in a data environment and the correspon ding variable(s)in one or more device data environments may share storage.Without intervening synchronization data races can occur. 18 OpenMP API.Version 4.0-July 2013
18 OpenMP API • Version 4.0 - July 2013 If multiple threads write without synchronization to the same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. Similarly, if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. If a data race occurs then the result of the program is unspecified. A private variable in a task region that eventually generates an inner nested parallel region is permitted to be made shared by implicit tasks in the inner parallel region. A private variable in a task region can be shared by an explicit task region generated during its execution. However, it is the programmer’s responsibility to ensure through synchronization that the lifetime of the variable does not end before completion of the explicit task region sharing it. Any other access by one task to the private variables of another task results in unspecified behavior. 1.4.2 Device Data Environments When an OpenMP program begins, each device has an initial device data environment. The initial device data environment for the host device is the data environment associated with the initial task region. Directives that accept data-mapping attribute clauses determine how an original variable is mapped to a corresponding variable in a device data environment. The original variable is the variable with the same name that exists in the data environment of the task that encounters the directive. If a corresponding variable is present in the enclosing device data environment, the new device data environment inherits the corresponding variable from the enclosing device data environment. If a corresponding variable is not present in the enclosing device data environment, a new corresponding variable (of the same type and size) is created in the new device data environment. In the latter case, the initial value of the new corresponding variable is determined from the clauses and the data environment of the encountering thread. The corresponding variable in the device data environment may share storage with the original variable. Writes to the corresponding variable may alter the value of the original variable. The impact of this on memory consistency is discussed in Section 1.4.4 on page 20. When a task executes in the context of a device data environment, references to the original variable refer to the corresponding variable in the device data environment. The relationship between the value of the original variable and the initial or final value of the corresponding variable depends on the map-type. Details of this issue, as well as other issues with mapping a variable, are provided in Section 2.14.5 on page 177. The original variable in a data environment and the corresponding variable(s) in one or more device data environments may share storage. Without intervening synchronization data races can occur. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38