GPGPU Hardware Many-core SIMD model For massive data- parallel computation a High throughput 口 Low cost 200220082004 200620072008
GPGPU ➢ Hardware ❑ Many-core ❑ SIMD model ➢ For massive data-parallel computation ❑ High throughput ❑ Low cost 11
Outline Background and Motivation Platform and algorithms Results Conclusion and future work 12
Outline ➢ Background and Motivation ➢ Platform and Algorithms ➢ Results ➢ Conclusion and future work 12
Platform overview http://parabna.weebly.com Acquisition/BOLD Network Network Network Preprocessing Construction Further Analysis Research FunctionaImri Degree Correlation Distribution Matrix Clustering Calculation Coefficient Adjacency Time series Average& Matⅸ Binary GPU Efficiency GPU Modular Modular Structure Our focus GPU part 13
Platform Overview ➢ Our focus: ➢ GPU part: 13 http://parabna.weebly.com/ Functional MRI Time series
Network Construction 一一一奋 Temporal Pearson Correlation ∑(v2-)(vy-) (n-)2(1-)2 =(11,v12,…,i)r,=(1,2,…,N): BOLD signal IGembris 2010]: straight forward implementation >∑(v7-)(vy-) a Matrix Multiplication: R=VV, V=(v1v2,,UN) a One thread 16*16 numbers> data reuse in registers a 1400 Gflop/s on AMd 5870 a Computation is no longer the bottleneck(data transfer through PCIE is) 14
Network Construction ➢ Temporal Pearson Correlation 𝑟 Ƹ 𝑖,𝑗 = σ 𝒗𝒊 − 𝑣ҧ𝑖 𝒗𝒋 − 𝑣ҧ𝑗 σ 𝒗𝒊 − 𝑣ҧ𝑖 2 σ 𝒗𝒋 − 𝑣ҧ𝑗 2 ➢ 𝒗𝒊 = 𝑣𝑖1, 𝑣𝑖2, … , 𝑣𝑖𝐿 𝑇 , 𝑖 = (1, 2, … , 𝑁): BOLD signal 𝑖. ➢ [Gembris 2010]: straight forward implementation. ➢ σ 𝒗𝒊 − 𝑣ҧ𝑖 𝒗𝒋 − 𝑣ҧ𝑗 : ❑ Matrix Multiplication: 𝑹 = 𝑽 𝑇𝑽, 𝑽 = (𝒗𝟏 , 𝒗𝟐 , … ,𝒗𝑵) ❑ One thread 16*16 numbers → data reuse in registers ❑ 1400 Gflop/s on AMD 5870 ❑ Computation is no longer the bottleneck (data transfer through PCIE is) 14
Network Construction - scalability >R=VV. But R exceeds graphic memory Blocked matrix multiplication V=(V1,V2,,VD) R=vv (Ⅵ1V Vz 2V2 CPU time(s) GPU time(s) Speedup 2458 123x 15
Network Construction - scalability ➢ 𝑹 = 𝑽 𝑇𝑽. But 𝑹 exceeds graphic memory. ➢ Blocked matrix multiplication 𝑽 = (𝑉1 , 𝑉2 , … , 𝑉𝐷) 𝑅 = 𝑽 𝑻𝑽 = 𝑉1 𝑇 𝑉2 𝑇 ⋮ 𝑉𝐷 𝑇 𝑉1 𝑉2 ⋯ 𝑉𝐷 = 𝑉1 𝑇𝑉1 𝑉1 𝑇𝑉2 ⋯ 𝑉1 𝑇𝑉𝐷 𝑉2 𝑇𝑉1 𝑉2 𝑇𝑉2 ⋯ 𝑉2 𝑇𝑉𝐷 ⋮ ⋮ ⋱ ⋮ 𝑉𝐷 𝑇𝑉1 𝑉𝐷 𝑇𝑉2 ⋯ 𝑉𝐷 𝑇𝑉𝐷 15 CPU time (s) GPU time (s) Speedup 245.8 2.0 123x