Interleaved Vector Memory System Cray-1, 16 banks 4 cycle bank busy time 12 cycle latency Bank busy time: Time before bank ready to accept next request if stride =1& consecutive elements inter/ea ved across banks number of banks > bank latency then can sustain 1 element/cycle throughput Base Stride Vector Registers Address Generator 0123456789 ABCDEF Memory Banks 1/272021 中国科学技术大学
Interleaved Vector Memory System 1/27/2021 中国科学技术大学 • Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency – Bank busy time: Time before bank ready to accept next request – If stride = 1 & consecutive elements interleaved across banks & number of banks >= bank latency, then can sustain 1 element/cycle throughput 0 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator 7
EXampleappF F-15 Suppose we want to fetch a vector of 64 elements starting at byte address 136, and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the cpu? 1/272021 中国科学技术大学
Example(AppF F-15) Suppose we want to fetch a vector of 64 elements starting at byte address 136,and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU? 1/27/2021 中国科学技术大学 8
Bank Cycle no. 5 0 144 2 busy 15 busy b 160 busy busy busy busy l68 5 busy busy busy busy busy 176 6 busy busy busy busy busy 184 192 busy b busy busy 8 busy 200 busy busy busy busy 9 busy busy 208 busybusy busy 10 busy busy busy 216 busy busy 11 busy busy busy b USV 224 busy 12 busy busybusybusybusy 232 13 busy b busy 240 14 busy busy busy busy 248 15 256 busy busy busy busy busy 16 busy 264 busy b busy Figure F7 Memory addresses (in bytes) by bank number and time slot at which access begins. Each memory bank latches the element address at the start of an access and is then busy for 6 clock cycles before returning a value to the CPU. Note that the CPU cannot keep all eight banks busy all the time because it is limited to supplying one new address and receiving one data item each cycle 1/272021 中国科学技术大学
1/27/2021 中国科学技术大学 9
t#l: Vector Chaining ·寄存器定向路径的向量机版本 首次在Cray-1上使用 V2‖V3 V4 5 MULV v3, v1, v2 ADDⅴ5Av3,V4 Chain Chain Load Unit u A Memory 1/272021 中国科学技术大学
Vector Opt#1: Vector Chaining 1/27/2021 中国科学技术大学 10 • 寄存器定向路径的向量机版本 • 首次在Cray-1上使用 Memory V1 Load Unit Mult. V2 V3 Chain Add V4 V5 Chain LV v1 MULV v3,v1,v2 ADDV v5, v3, v4
Chaining Advantage 不采用链接技术,必须处理完前一条指令的最后一个元素, 才能启动下一条相关的指令 Load Mul Time Add 采用链接技术,前一条指令的第一个结果出来后,就可以启 动下一条相关指令的执 Load Mul Add 1/272021 中国科学技术大学
Vector Chaining Advantage 1/27/2021 中国科学技术大学 11 • 采用链接技术,前一条指令的第一个结果出来后,就可以启 动下一条相关指令的执行 Load Mul Add Load Mul Time Add • 不采用链接技术,必须处理完前一条指令的最后一个元素, 才能启动下一条相关的指令