Synchronization Synchronization =Control Sharing Barriers make threads wait until all threads catch up Waiting is lost opportunity for work Atomic operations may reduce waiting Watch out for serialization Important:be aware of which items of work are truly independent 6 ②nVIDIA/ILLINOIS
6 Synchronization – Synchronization == Control Sharing – Barriers make threads wait until all threads catch up – Waiting is lost opportunity for work – Atomic operations may reduce waiting – Watch out for serialization – Important: be aware of which items of work are truly independent 6
Parallel Programming Coding Styles- Program and Data Models Program Models Data Models SPMD Shared Data Master/Worker Shared Queue Loop Parallelism Distributed Array Fork/Join These are not necessarily mutually exclusive. 7 ②nVIDIA ■uNo5
7 Parallel Programming Coding Styles – Program and Data Models Fork/Join Master/Worker SPMD Program Models Loop Parallelism Distributed Array Shared Queue Shared Data Data Models These are not necessarily mutually exclusive
Program Models SPMD (Single Program,Multiple Data) All PE's (Processor Elements)execute the same program in parallel,but has its own data Each PE uses a unique ID to access its portion of data - Different PE can follow different paths through the same code This is essentially the CUDA Grid model (also OpenCL,MPI) SIMD is a special case-WARP used for efficiency -Master/Worker Loop Parallelism -Fork/Join 8 8 ②nVIDIA/ILLINOIS
8 Program Models – SPMD (Single Program, Multiple Data) – All PE’s (Processor Elements) execute the same program in parallel, but has its own data – Each PE uses a unique ID to access its portion of data – Different PE can follow different paths through the same code – This is essentially the CUDA Grid model (also OpenCL, MPI) – SIMD is a special case – WARP used for efficiency – Master/Worker – Loop Parallelism – Fork/Join 8
SPMD 1.Initialize-establish localized data structure and communication channels. 2.Uniquify-each thread acquires a unique identifier,typically ranging from 0 to N-1,where N is the number of threads.Both OpenMP and CUDA have built-in support for this. 3.Distribute data-decompose global data into chunks and localize them,or sharing/replicating major data structures using thread IDs to associate subsets of the data to threads. 4.Compute-run the core computation!Thread IDs are used to differentiate the behavior of individual threads.Use thread ID in loop index calculations to split loop iterations among threads-beware of the potential for memory/data divergence.Use thread ID or conditions based on thread ID to branch to their specific actions- beware of the potential for instruction/execution divergence 一 5.Finalize-reconcile global data structure,and prepare for the next major iteration or group of program phases. 9 ②nVIDIA ILLINOIS
9 SPMD – 1. Initialize—establish localized data structure and communication channels. – 2. Uniquify—each thread acquires a unique identifier, typically ranging from 0 to N-1, where N is the number of threads. Both OpenMP and CUDA have built-in support for this. – 3. Distribute data—decompose global data into chunks and localize them, or sharing/replicating major data structures using thread IDs to associate subsets of the data to threads. – 4. Compute—run the core computation! Thread IDs are used to differentiate the behavior of individual threads. Use thread ID in loop index calculations to split loop iterations among threads—beware of the potential for memory/data divergence. Use thread ID or conditions based on thread ID to branch to their specific actions— beware of the potential for instruction/execution divergence. – 5. Finalize—reconcile global data structure, and prepare for the next major iteration or group of program phases
Program Models SPMD(Single Program,Multiple Data) Master/Worker (OpenMP,OpenACC,TBB) -A Master thread sets up a pool of worker threads and a bag of tasks Workers execute concurrently,removing tasks until done 一 Loop Parallelism (OpenMP,OpenACC,C++AMP) Loop iterations execute in parallel FORTRAN do-all (truly parallel),do-across (with dependence) Fork/Join (Posix p-threads) Most general,generic way of creation of threads 10 /②nVIDIA ILLINOIS
10 Program Models – SPMD (Single Program, Multiple Data) – Master/Worker (OpenMP, OpenACC, TBB) – A Master thread sets up a pool of worker threads and a bag of tasks – Workers execute concurrently, removing tasks until done – Loop Parallelism (OpenMP, OpenACC, C++AMP) – Loop iterations execute in parallel – FORTRAN do-all (truly parallel), do-across (with dependence) – Fork/Join (Posix p-threads) – Most general, generic way of creation of threads 1 0