Practical vectorization Intro Promises of vectorization Theoretical gains Computation speed up corresponding to vector width o Note that it's dependant on the type of data ◆float vs double shorts versus ints Various units for various vector width Name Arch nb bits nb floats/int nb doubles/long SSEI 4 X86 128 4 2 AVX2 X86 256 8 4 AVX2 2(FMA) X86 256 8 4 AVX2 512 X86 512 16 8 SVE3 ARM 128-2048 464 2-32 1 Streaming SIMD Extensions2 Advanced Vector eXtension3 Scalable Vector Extension 6/50 S.Ponce-CERN
Practical vectorization 6 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Promises of vectorization Theoretical gains Computation speed up corresponding to vector width Note that it’s dependant on the type of data float vs double shorts versus ints Various units for various vector width Name Arch nb bits nb floats/int nb doubles/long SSE1 4 X86 128 4 2 AVX2 X86 256 8 4 AVX2 2 (FMA) X86 256 8 4 AVX2 512 X86 512 16 8 SVE3 ARM 128-2048 4-64 2-32 1 Streaming SIMD Extensions 2 Advanced Vector eXtension 3 Scalable Vector Extension
Practical vectorization ntro How to now what you can use Manually Look for sse,avx,etc in your processor flags 1scpu I egrep mmxlsselavx' Flags:fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts 7/50 S.Ponce·CERN
Practical vectorization 7 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations How to now what you can use Manually Look for sse, avx, etc in your processor flags lscpu | egrep ``mmx|sse|avx'' Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
Practical vectorization Intro Situation for Intel processors Nehalem (2009). Sandy Bridge (2012):Haswell (2014): Knights Corner Knights Landing Skylake (2017): Westmere (2010): Itel Xeon Intel Xeon (2012 2016年 Intel Xeon Scalable Intel Xeon Processor Intel Xeon Phi Intel Xeon Phi Processor Family Processoes E3E$futily E3 v3/E5 V3/E7v3 Coprocessor x100 Precessoe x200 (legacy) AVX-512VL AVX-512DQ Ivy Bridge (2013): Broadwe2015 AVX-512BW Ietel Xeon Intel Xeon 512-bit Processor Procecor 512-bit E3 V2/E5 V2/E7 v2 E34E5v4E74 AVX-512ER Family AVX-512PF AVX-512CD AVX-512CD 512-bit AVX-512F AVX-512F 256-6it IMCI 256-bit AVX2 AVX2 AVX2 128-bit AVX AVX AVX AVX SSE* SSE* SSE* SSE SSE primary instraction set 8/50 S.Ponce-CERN
Practical vectorization 8 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Situation for Intel processors
Practical vectorization 花5 Measuring vectorization Introduction 2 Measuring vectorization Vectorization Prerequisite Vectorizing techniques in C+ What to e色pect? 9/50 S.Ponce-CERN
Practical vectorization 9 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Measuring vectorization 1 Introduction 2 Measuring vectorization 3 Vectorization Prerequisite 4 Vectorizing techniques in C++ 5 What to expect ?
Practical vectorization 8 Am I using vector registers 10/50 S.Ponce-CERN
Practical vectorization 10 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Am I using vector registers ? Yes you are As vector registers are used for scalar operations Remember Andrzej’s picture Wasted Used Am I efficiently using vector registers ? Here we have to look at the generated assembly code Looking for specific intructions Or for the use of specific names of registers