2.2 Modifications to the von Neumann Model 19 2.2.1 The basics of caching Caching is one of the most widely used methods of addressing the von Neumann bot- tleneck.To understand the ideas behind caching,recall our example.A company has a factory(CPU)in one town and a warehouse(main memory)in another,and there is a single,two-lane road joining the factory and the warehouse.There are a number of possible solutions to the problem of transporting raw materials and finished products between the warehouse and the factory.One is to widen the road.Another is to move the factory and/or the warehouse or to build a unified factory and warehouse.Caching exploits both of these ideas.Rather than transporting a single instruction or data item, we can use an effectively wider interconnection,an interconnection that can transport more data or more instructions in a single memory access.Also,rather than storing all data and instructions exclusively in main memory,we can store blocks of data and instructions in special memory that is effectively closer to the registers in the CPU. In general a cache is a collection of memory locations that can be accessed in less time than some other memory locations.In our setting,when we talk about caches we'll usually mean a CPU cache,which is a collection of memory locations that the CPU can access more quickly than it can access main memory.A CPU cache can either be located on the same chip as the CPU or it can be located on a separate chip that can be accessed much faster than an ordinary memory chip. Once we have a cache,an obvious problem is deciding which data and instructions should be stored in the cache.The universally used principle is based on the idea that programs tend to use data and instructions that are physically close to recently used data and instructions.After executing an instruction,programs typically execute the next instruction;branching tends to be relatively rare.Similarly,after a program has accessed one memory location,it often accesses a memory location that is physically nearby.An extreme example of this is in the use of arrays.Consider the loop f1oatz[1000]: sum=0.0: for(i=0:<1000:i++) sum +z[i]: Arrays are allocated as blocks of contiguous memory locations.So,for example, the location storing z[1]immediately follows the location z[0].Thus as long as i<999,the read of z[i]is immediately followed by a read of z[i+1]. The principle that an access of one location is followed by an access of a nearby location is often called locality.After accessing one memory location(instruction or data),a program will typically access a nearby location(spatial locality)in the near future (temporal locality). In order to exploit the principle of locality,the system uses an effectively wider interconnect to access data and instructions.That is,a memory access will effec- tively operate on blocks of data and instructions instead of individual instructions and individual data items.These blocks are called cache blocks or cache lines. A typical cache line stores 8 to 16 times as much information as a single memory
2.2 Modifications to the von Neumann Model 19 2.2.1 The basics of caching Caching is one of the most widely used methods of addressing the von Neumann bottleneck. To understand the ideas behind caching, recall our example. A company has a factory (CPU) in one town and a warehouse (main memory) in another, and there is a single, two-lane road joining the factory and the warehouse. There are a number of possible solutions to the problem of transporting raw materials and finished products between the warehouse and the factory. One is to widen the road. Another is to move the factory and/or the warehouse or to build a unified factory and warehouse. Caching exploits both of these ideas. Rather than transporting a single instruction or data item, we can use an effectively wider interconnection, an interconnection that can transport more data or more instructions in a single memory access. Also, rather than storing all data and instructions exclusively in main memory, we can store blocks of data and instructions in special memory that is effectively closer to the registers in the CPU. In general a cache is a collection of memory locations that can be accessed in less time than some other memory locations. In our setting, when we talk about caches we’ll usually mean a CPU cache, which is a collection of memory locations that the CPU can access more quickly than it can access main memory. A CPU cache can either be located on the same chip as the CPU or it can be located on a separate chip that can be accessed much faster than an ordinary memory chip. Once we have a cache, an obvious problem is deciding which data and instructions should be stored in the cache. The universally used principle is based on the idea that programs tend to use data and instructions that are physically close to recently used data and instructions. After executing an instruction, programs typically execute the next instruction; branching tends to be relatively rare. Similarly, after a program has accessed one memory location, it often accesses a memory location that is physically nearby. An extreme example of this is in the use of arrays. Consider the loop float z[1000]; . . . sum = 0.0; for (i = 0; i < 1000; i++) sum += z[i]; Arrays are allocated as blocks of contiguous memory locations. So, for example, the location storing z[1] immediately follows the location z[0]. Thus as long as i < 999, the read of z[i] is immediately followed by a read of z[i+1]. The principle that an access of one location is followed by an access of a nearby location is often called locality. After accessing one memory location (instruction or data), a program will typically access a nearby location (spatial locality) in the near future (temporal locality). In order to exploit the principle of locality, the system uses an effectively wider interconnect to access data and instructions. That is, a memory access will effectively operate on blocks of data and instructions instead of individual instructions and individual data items. These blocks are called cache blocks or cache lines. A typical cache line stores 8 to 16 times as much information as a single memory
20 CHAPTER 2 Parallel Hardware and Parallel Software location.In our example,if a cache line stores 16 floats,then when we first go to add sum +=z[0],the system might read the first 16 elements of z,z[0],Z[1],..., z[15]from memory into cache.So the next 15 additions will use elements of z that are already in the cache. Conceptually,it's often convenient to think of a CPU cache as a single mono- lithic structure.In practice,though,the cache is usually divided into levels:the first level (L1)is the smallest and the fastest,and higher levels(L2,L3,...)are larger and slower.Most systems currently,in 2010,have at least two levels and having three lev- els is quite common.Caches usually store copies of information in slower memory, and,if we think of a lower-level (faster,smaller)cache as a cache for a higher level, this usually applies.So,for example,a variable stored in a level 1 cache will also be stored in level 2.However,some multilevel caches don't duplicate information that's available in another level.For these caches,a variable in a level 1 cache might not be stored in any other level of the cache,but it would be stored in main memory. When the CPU needs to access an instruction or data,it works its way down the cache hierarchy:First it checks the level 1 cache,then the level 2,and so on.Finally, if the information needed isn't in any of the caches,it accesses main memory.When a cache is checked for information and the information is available,it's called a cache hit or just a hit.If the information isn't available,it's called a cache miss or a miss. Hit or miss is often modified by the level.For example,when the CPU attempts to access a variable,it might have an LI miss and an L2 hit. Note that the memory access terms read and write are also used for caches.For example,we might read an instruction from an L2 cache,and we might write data to an LI cache. When the CPU attempts to read data or instructions and there's a cache read- miss,it will read from memory the cache line that contains the needed information and store it in the cache.This may stall the processor,while it waits for the slower memory:the processor may stop executing statements from the current program until the required data or instructions have been fetched from memory.So in our example, when we read z[0],the processor may stall while the cache line containing z[0]is transferred from memory into the cache. When the CPU writes data to a cache,the value in the cache and the value in main memory are different or inconsistent.There are two basic approaches to dealing with the inconsistency.In write-through caches,the line is written to main memory when it is written to the cache.In write-back caches,the data isn't written immediately. Rather,the updated data in the cache is marked dirty,and when the cache line is replaced by a new cache line from memory,the dirty line is written to memory. 2.2.2 Cache mappings Another issue in cache design is deciding where lines should be stored.That is,if we fetch a cache line from main memory,where in the cache should it be placed? The answer to this question varies from system to system.At one extreme is a fully associative cache,in which a new line can be placed at any location in the cache.At
20 CHAPTER 2 Parallel Hardware and Parallel Software location. In our example, if a cache line stores 16 floats, then when we first go to add sum += z[0], the system might read the first 16 elements of z, z[0], z[1], . . . , z[15] from memory into cache. So the next 15 additions will use elements of z that are already in the cache. Conceptually, it’s often convenient to think of a CPU cache as a single monolithic structure. In practice, though, the cache is usually divided into levels: the first level (L1) is the smallest and the fastest, and higher levels (L2, L3, . . . ) are larger and slower. Most systems currently, in 2010, have at least two levels and having three levels is quite common. Caches usually store copies of information in slower memory, and, if we think of a lower-level (faster, smaller) cache as a cache for a higher level, this usually applies. So, for example, a variable stored in a level 1 cache will also be stored in level 2. However, some multilevel caches don’t duplicate information that’s available in another level. For these caches, a variable in a level 1 cache might not be stored in any other level of the cache, but it would be stored in main memory. When the CPU needs to access an instruction or data, it works its way down the cache hierarchy: First it checks the level 1 cache, then the level 2, and so on. Finally, if the information needed isn’t in any of the caches, it accesses main memory. When a cache is checked for information and the information is available, it’s called a cache hit or just a hit. If the information isn’t available, it’s called a cache miss or a miss. Hit or miss is often modified by the level. For example, when the CPU attempts to access a variable, it might have an L1 miss and an L2 hit. Note that the memory access terms read and write are also used for caches. For example, we might read an instruction from an L2 cache, and we might write data to an L1 cache. When the CPU attempts to read data or instructions and there’s a cache readmiss, it will read from memory the cache line that contains the needed information and store it in the cache. This may stall the processor, while it waits for the slower memory: the processor may stop executing statements from the current program until the required data or instructions have been fetched from memory. So in our example, when we read z[0], the processor may stall while the cache line containing z[0] is transferred from memory into the cache. When the CPU writes data to a cache, the value in the cache and the value in main memory are different or inconsistent. There are two basic approaches to dealing with the inconsistency. In write-through caches, the line is written to main memory when it is written to the cache. In write-back caches, the data isn’t written immediately. Rather, the updated data in the cache is marked dirty, and when the cache line is replaced by a new cache line from memory, the dirty line is written to memory. 2.2.2 Cache mappings Another issue in cache design is deciding where lines should be stored. That is, if we fetch a cache line from main memory, where in the cache should it be placed? The answer to this question varies from system to system. At one extreme is a fully associative cache, in which a new line can be placed at any location in the cache. At
2.2 Modifications to the von Neumann Model 21 the other extreme is a direct mapped cache,in which each cache line has a unique location in the cache to which it will be assigned.Intermediate schemes are called n-way set associative.In these schemes,each cache line can be placed in one of n different locations in the cache.For example,in a two way set associative cache,each line can be mapped to one of two locations. As an example,suppose our main memory consists of 16 lines with indexes 0-15, and our cache consists of 4 lines with indexes 0-3.In a fully associative cache,line 0 can be assigned to cache location 0,1,2,or 3.In a direct mapped cache,we might assign lines by looking at their remainder after division by 4.So lines 0,4,8,and 12 would be mapped to cache index 0,lines 1,5,9,and 13 would be mapped to cache index 1,and so on.In a two way set associative cache,we might group the cache into two sets:indexes 0 and 1 form one set-set 0-and indexes 2 and 3 form another- set 1.So we could use the remainder of the main memory index modulo 2,and cache line 0 would be mapped to either cache index 0 or cache index 1.See Table 2.1. When more than one line in memory can be mapped to several different locations in a cache(fully associative and n-way set associative),we also need to be able to decide which line should be replaced or evicted.In our preceding example,if,for example,line 0 is in location 0 and line 2 is in location 1,where would we store line 4?The most commonly used scheme is called least recently used.As the name suggests,the cache has a record of the relative order in which the blocks have been Table 2.1 Assignments of a 16-line Main Memory to a 4-line Cache Cache Location Memory Index Fully Assoc Direct Mapped 2-way 0 01.2,or3 0 0or 1 》 0,1.2,or3 1 2or3 2 0,1,2,or3 2 0or1 3 0,1,2,or3 3 2or3 ? 0,1,2,or3 0 0or 1 5 0,1,2,or3 1 2or 3 6 0,1,2,or3 2 0or1 7 0,1,2,or3 3 2or3 8 0,1,2,or3 0 0or1 9 0,1,2,or3 1 2or3 10 0,1,2,or3 2 0or1 11 0,1,2,or3 3 2or 3 12 0,1,2,or3 0 0or1 13 0,1,2,or3 1 2or3 0,1.2,r3 2 0or1 15 0,1,2,or3 3 2or3
2.2 Modifications to the von Neumann Model 21 the other extreme is a direct mapped cache, in which each cache line has a unique location in the cache to which it will be assigned. Intermediate schemes are called n-way set associative. In these schemes, each cache line can be placed in one of n different locations in the cache. For example, in a two way set associative cache, each line can be mapped to one of two locations. As an example, suppose our main memory consists of 16 lines with indexes 0–15, and our cache consists of 4 lines with indexes 0–3. In a fully associative cache, line 0 can be assigned to cache location 0, 1, 2, or 3. In a direct mapped cache, we might assign lines by looking at their remainder after division by 4. So lines 0, 4, 8, and 12 would be mapped to cache index 0, lines 1, 5, 9, and 13 would be mapped to cache index 1, and so on. In a two way set associative cache, we might group the cache into two sets: indexes 0 and 1 form one set—set 0—and indexes 2 and 3 form another— set 1. So we could use the remainder of the main memory index modulo 2, and cache line 0 would be mapped to either cache index 0 or cache index 1. See Table 2.1. When more than one line in memory can be mapped to several different locations in a cache (fully associative and n-way set associative), we also need to be able to decide which line should be replaced or evicted. In our preceding example, if, for example, line 0 is in location 0 and line 2 is in location 1, where would we store line 4? The most commonly used scheme is called least recently used. As the name suggests, the cache has a record of the relative order in which the blocks have been Table 2.1 Assignments of a 16-line Main Memory to a 4-line Cache Cache Location Memory Index Fully Assoc Direct Mapped 2-way 0 0, 1, 2, or 3 0 0 or 1 1 0, 1, 2, or 3 1 2 or 3 2 0, 1, 2, or 3 2 0 or 1 3 0, 1, 2, or 3 3 2 or 3 4 0, 1, 2, or 3 0 0 or 1 5 0, 1, 2, or 3 1 2 or 3 6 0, 1, 2, or 3 2 0 or 1 7 0, 1, 2, or 3 3 2 or 3 8 0, 1, 2, or 3 0 0 or 1 9 0, 1, 2, or 3 1 2 or 3 10 0, 1, 2, or 3 2 0 or 1 11 0, 1, 2, or 3 3 2 or 3 12 0, 1, 2, or 3 0 0 or 1 13 0, 1, 2, or 3 1 2 or 3 14 0, 1, 2, or 3 2 0 or 1 15 0, 1, 2, or 3 3 2 or 3
22 CHAPTER 2 Parallel Hardware and Parallel Software used,and if line 0 were used more recently than line 2,then line 2 would be evicted and replaced by line 4. 2.2.3 Caches and programs:an example It's important to remember that the workings of the CPU cache are controlled by the system hardware,and we,the programmers,don't directly determine which data and which instructions are in the cache.However,knowing the principle of spatial and temporal locality allows us to have some indirect control over caching.As an example,C stores two-dimensional arrays in "row-major"order.That is,although we think of a two-dimensional array as a rectangular block,memory is effectively a huge one-dimensional array.So in row-major storage,we store row 0 first,then row 1,and so on.In the following two code segments,we would expect the first pair of nested loops to have much better performance than the second,since it's accessing the data in the two-dimensional array in contiguous blocks. double A[MAX][MAX].x[MAX].y[MAX]: /Initialize A and x.assign y =0*/ /First pair of loops * for (i 0:i MAX:i++) for (j=0:j MAX:j++) y[i]+=A[i][j]*x[j]: /米Assign y=0米/ /Second pair of loops * for (j=0:j<MAX:j++) for (i 0:i<MAX:i++) y[i]+=A[i][j]*x[j]: To better understand this,suppose MAX is four,and the elements of A are stored in memory as follows: Cache Line Elements of A 0 A[0][0] A[0][1]A[0][2] A[0][3] 1 A[1][0] A[1][1]A[1][2] A[1][3] 2 A[2][0] A[2][1]A[2][2] A[2][3] 3 A[3][0]A[3][1]A[3][2]A[3][3] So,for example,A[0][1]is stored immediately after A[0][0]and A[1][0]is stored immediately after A[0][3]. Let's suppose that none of the elements of A are in the cache when each pair of loops starts executing.Let's also suppose that a cache line consists of four elements
22 CHAPTER 2 Parallel Hardware and Parallel Software used, and if line 0 were used more recently than line 2, then line 2 would be evicted and replaced by line 4. 2.2.3 Caches and programs: an example It’s important to remember that the workings of the CPU cache are controlled by the system hardware, and we, the programmers, don’t directly determine which data and which instructions are in the cache. However, knowing the principle of spatial and temporal locality allows us to have some indirect control over caching. As an example, C stores two-dimensional arrays in “row-major” order. That is, although we think of a two-dimensional array as a rectangular block, memory is effectively a huge one-dimensional array. So in row-major storage, we store row 0 first, then row 1, and so on. In the following two code segments, we would expect the first pair of nested loops to have much better performance than the second, since it’s accessing the data in the two-dimensional array in contiguous blocks. double A[MAX][MAX], x[MAX], y[MAX]; . . . /∗ Initialize A and x, assign y = 0 ∗/ . . . /∗ First pair of loops ∗/ for (i = 0; i < MAX; i++) for (j = 0; j < MAX; j++) y[i] += A[i][j]∗x[j]; . . . /∗ Assign y = 0 ∗/ . . . /∗ Second pair of loops ∗/ for (j = 0; j < MAX; j++) for (i = 0; i < MAX; i++) y[i] += A[i][j]∗x[j]; To better understand this, suppose MAX is four, and the elements of A are stored in memory as follows: Cache Line Elements of A 0 A[0][0] A[0][1] A[0][2] A[0][3] 1 A[1][0] A[1][1] A[1][2] A[1][3] 2 A[2][0] A[2][1] A[2][2] A[2][3] 3 A[3][0] A[3][1] A[3][2] A[3][3] So, for example, A[0][1] is stored immediately after A[0][0] and A[1][0] is stored immediately after A[0][3]. Let’s suppose that none of the elements of A are in the cache when each pair of loops starts executing. Let’s also suppose that a cache line consists of four elements
2.2 Modifications to the von Neumann Model 23 of A and A[0][0]is the first element of a cache line.Finally,we'll suppose that the cache is direct mapped and it can only store eight elements of A,or two cache lines. (We won't worry about x and y.) Both pairs of loops attempt to first access A[][0].Since it's not in the cache this will result in a cache miss,and the system will read the line consisting of the first row of A,A[0][0],A[0][1],A[0][2],A[O][3],into the cache.The first pair of loops then accesses A[0][1],A[0][2],A[0][3],all of which are in the cache,and the next miss in the first pair of loops will occur when the code accesses A[1][0]. Continuing in this fashion,we see that the first pair of loops will result in a total of four misses when it accesses elements of A,one for each row.Note that since our hypothetical cache can only store two lines or eight elements of A,when we read the first element of row two and the first element of row three,one of the lines that's already in the cache will have to be evicted from the cache,but once a line is evicted,the first pair of loops won't need to access the elements of that line again. After reading the first row into the cache,the second pair of loops needs to then access A[1][0],A[2][0],A[3][0],none of which are in the cache.So the next three accesses of A will also result in misses.Furthermore,because the cache is small,the reads of A[2][0]and A[3][0]will require that lines already in the cache be evicted. Since A[2][0]is stored in cache line 2,reading its line will evict line 0,and reading A[3][0]will evict line 1.After finishing the first pass through the outer loop,we'll next need to access A[0][1],which was evicted with the rest of the first row.So we see that every time we read an element of A,we'll have a miss,and the second pair of loops results in 16 misses. Thus,we'd expect the first pair of nested loops to be much faster than the second. In fact,if we run the code on one of our systems with MAX =1000,the first pair of nested loops is approximately three times faster than the second pair. 2.2.4 Virtual memory Caches make it possible for the CPU to quickly access instructions and data that are in main memory.However,if we run a very large program or a program that accesses very large data sets,all of the instructions and data may not fit into main memory.This is especially true with multitasking operating systems;in order to switch between programs and create the illusion that multiple programs are running simultaneously, the instructions and data that will be used during the next time slice should be in main memory.Thus,in a multitasking system,even if the main memory is very large,many running programs must share the available main memory.Furthermore,this sharing must be done in such a way that each program's data and instructions are protected from corruption by other programs. Virtual memory was developed so that main memory can function as a cache for secondary storage.It exploits the principle of spatial and temporal locality by keeping in main memory only the active parts of the many running programs;those
2.2 Modifications to the von Neumann Model 23 of A and A[0][0] is the first element of a cache line. Finally, we’ll suppose that the cache is direct mapped and it can only store eight elements of A, or two cache lines. (We won’t worry about x and y.) Both pairs of loops attempt to first access A[0][0]. Since it’s not in the cache, this will result in a cache miss, and the system will read the line consisting of the first row of A, A[0][0], A[0][1], A[0][2], A[0][3], into the cache. The first pair of loops then accesses A[0][1], A[0][2], A[0][3], all of which are in the cache, and the next miss in the first pair of loops will occur when the code accesses A[1][0]. Continuing in this fashion, we see that the first pair of loops will result in a total of four misses when it accesses elements of A, one for each row. Note that since our hypothetical cache can only store two lines or eight elements of A, when we read the first element of row two and the first element of row three, one of the lines that’s already in the cache will have to be evicted from the cache, but once a line is evicted, the first pair of loops won’t need to access the elements of that line again. After reading the first row into the cache, the second pair of loops needs to then access A[1][0], A[2][0], A[3][0], none of which are in the cache. So the next three accesses of A will also result in misses. Furthermore, because the cache is small, the reads of A[2][0] and A[3][0] will require that lines already in the cache be evicted. Since A[2][0] is stored in cache line 2, reading its line will evict line 0, and reading A[3][0] will evict line 1. After finishing the first pass through the outer loop, we’ll next need to access A[0][1], which was evicted with the rest of the first row. So we see that every time we read an element of A, we’ll have a miss, and the second pair of loops results in 16 misses. Thus, we’d expect the first pair of nested loops to be much faster than the second. In fact, if we run the code on one of our systems with MAX = 1000, the first pair of nested loops is approximately three times faster than the second pair. 2.2.4 Virtual memory Caches make it possible for the CPU to quickly access instructions and data that are in main memory. However, if we run a very large program or a program that accesses very large data sets, all of the instructions and data may not fit into main memory. This is especially true with multitasking operating systems; in order to switch between programs and create the illusion that multiple programs are running simultaneously, the instructions and data that will be used during the next time slice should be in main memory. Thus, in a multitasking system, even if the main memory is very large, many running programs must share the available main memory. Furthermore, this sharing must be done in such a way that each program’s data and instructions are protected from corruption by other programs. Virtual memory was developed so that main memory can function as a cache for secondary storage. It exploits the principle of spatial and temporal locality by keeping in main memory only the active parts of the many running programs; those