Practical vectorization Intro Measure Peeeg Side note:what to look at What you should look at oSpecific,CPU intensive pieces of code o The most time consuming functions o Very small subset of your code (often <5%) Where you should not waste your time o Try to have an overall picture of vectorization in your application o As most of the code won't use vectors anyway 11/50 S.Ponce-CERN
Practical vectorization 11 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Side note : what to look at ? What you should look at Specific, CPU intensive pieces of code The most time consuming functions Very small subset of your code (often < 5%) Where you should not waste your time Try to have an overall picture of vectorization in your application As most of the code won’t use vectors anyway
Practical vectorization Measure Crash course in SIMD assembly Register names SSE xmm0 to xmm15(128 bits) AVX2:ymm0 to ymm15(256 bits) AVX512 zmm0 to zmm31 (512 bits) In scalar mode,SSE registers are used floating point instruction names <op><simd or not><raw type> where o <op>is something like vmul,vadd,vmov or vfmadd o<simd or not>is either 's'for scalar or 'p'for packed (i.e.vector) <raw type>is either's'for single precision or'd'for double precision Typically vmulss,vmovaps,vaddpd,vfmaddpd 12/50 S.Ponce-CERN
Practical vectorization 12 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Crash course in SIMD assembly Register names SSE : xmm0 to xmm15 (128 bits) AVX2 : ymm0 to ymm15 (256 bits) AVX512 : zmm0 to zmm31 (512 bits) In scalar mode, SSE registers are used floating point instruction names <op><simd or not><raw type> where <op> is something like vmul, vadd, vmov or vfmadd <simd or not> is either ’s’ for scalar or ’p’ for packed (i.e. vector) <raw type> is either ’s’ for single precision or ’d’ for double precision Typically : vmulss, vmovaps, vaddpd, vfmaddpd
Practical vectorization Intro Measure Feeeg Practical look at assembly Extract assembly code o Run objdump-d-C on your executable or library o Search for your function name Check for vectorization oFor avx2,look for ymm For avx512,look for zmm o Otherwise look for instructions with ps or pd at the end o but ignore mov operations o only concentrate on arithmetic ones 13/50 S.Ponce-CERN
Practical vectorization 13 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Practical look at assembly Extract assembly code Run objdump -d -C on your executable or library Search for your function name Check for vectorization For avx2, look for ymm For avx512, look for zmm Otherwise look for instructions with ps or pd at the end but ignore mov operations only concentrate on arithmetic ones
Practical vectorization Intro Measure Feeeg Techniques Enp Exercise 1 Code d18: c5 fc 59 d8 vmulps %ymm0,%ymm0,%ymm3 d1c: c5 fc 58 c0 vaddps %ymm0,%ymm0,%ymmO d20: c5 e4 5c de vsubps %ymm6,%ymm3,%ymm3 d24: c4c17c59c0 vmulps %ymm8,%ymm0,%ymmo d29: c4c16458da vaddps %ymm10,%ymm3,%ymm3 d2e: c4417c58c3 vaddps %ymm11,%ymm0,%ymm8 d33: c5e459d3 vmulps %ymm3,%ymm3,%ymm2 d37: c4c13c59 fo vmulps %ymm8,%ymm8,%ymm6 d3c: c5 ec 58 d6 vaddps %ymm6,%ymm2,%ymm2 Solution 。Presence of ymm Vectorized,AVX level 14/50 S.Ponce-CERN
Practical vectorization 14 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Exercise 1 Code d18: c5 fc 59 d8 vmulps %ymm0,%ymm0,%ymm3 d1c: c5 fc 58 c0 vaddps %ymm0,%ymm0,%ymm0 d20: c5 e4 5c de vsubps %ymm6,%ymm3,%ymm3 d24: c4 c1 7c 59 c0 vmulps %ymm8,%ymm0,%ymm0 d29: c4 c1 64 58 da vaddps %ymm10,%ymm3,%ymm3 d2e: c4 41 7c 58 c3 vaddps %ymm11,%ymm0,%ymm8 d33: c5 e4 59 d3 vmulps %ymm3,%ymm3,%ymm2 d37: c4 c1 3c 59 f0 vmulps %ymm8,%ymm8,%ymm6 d3c: c5 ec 58 d6 vaddps %ymm6,%ymm2,%ymm2 Solution Presence of ymm Vectorized, AVX level
Practical vectorization Exercise 2 Code b97: 0f28e5 movaps %xmm5,%xmm4 b9a: f30f59e5 mulss %xmm5,%xmm4 b9e: f3 Of 58 ed addss %xmm5,%xmm5 ba2: f3 Of 59 ee mulss %xmm6,%xmm5 ba6: f3 Of 5c e7 subss %xmm7,%xmm4 baa: 0f28f5 movaps %xmm5,%xmm6 bad: f3410f58e0 addss %xmm8,%xmm4 bb2: f30f58f2 addss %xmm2,%xmm6 bb6: Of 28 ec movaps %xmm4,%xmm5 Solution o Presence of xmm but ps only in mov ●Not vectorized 15/50 S.Ponce-CERN
Practical vectorization 15 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Exercise 2 Code b97: 0f 28 e5 movaps %xmm5,%xmm4 b9a: f3 0f 59 e5 mulss %xmm5,%xmm4 b9e: f3 0f 58 ed addss %xmm5,%xmm5 ba2: f3 0f 59 ee mulss %xmm6,%xmm5 ba6: f3 0f 5c e7 subss %xmm7,%xmm4 baa: 0f 28 f5 movaps %xmm5,%xmm6 bad: f3 41 0f 58 e0 addss %xmm8,%xmm4 bb2: f3 0f 58 f2 addss %xmm2,%xmm6 bb6: 0f 28 ec movaps %xmm4,%xmm5 Solution Presence of xmm but ps only in mov Not vectorized