Berger, D, Goodman, J.R., Sohi, G.s. "Memory Systems The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Berger, D., Goodman, J.R., Sohi, G.S. “Memory Systems” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
88 Memory Systems Doug burger University of Wisconsin-Madison James R. Goodman 88.2 Memory Hierarchies 88.3 Cache memories 88.4 Parallel and Interleaved Memories Gurindar s. sohi 88.5 Virtual Memory University of Wisconsin-Madison 68.6 Research issues 88.1 Introduction A memory system serves as a repository of information(data)in a computer system. The processor [also called the central processing unit( CPU)] accesses(reads or loads) data from the memory system, performs compu tations on them, and stores(writes)them back to memory. The memory system is a collection of storage locations. Each storage location, or memory word, has a numerical address. A collection of storage locations forms an address space. Figure 88. 1 shows the essentials of how a processor is connected to a memory system via address, data, and control lines When a processor attempts to load the contents of a memory location, the request is very urgent. In virtuall all computers, the work soon comes to a halt(in other words, the processor stalls) if the memory request does not return quickly. Modern computers are generally able to continue briefly by overlapping memory requests, but even the most sophisticated computers will frequently exhaust their ability to process data and stall momentarily in the face of long memory delays. Thus, a key performance parameter in the design of any computer, fast or slow, is the effective speed of its leally, the memory system must be both infinitely large so that it can contain an arbitrarily large amount of information and infinitely fast so that it does not limit the processing unit. Practically, however, this is not possible. There are three properties of memory that are inherently in conflict: speed, capacity, and cost. In general, technology tradeoffs can be employed to optimize any two of the three factors at the expense of the third. Thus it is possible to have memories that are(1)large and cheap, but not fast; (2)cheap and fast, but small; or(3)large and fast, but expensive. The last of the three is further limited by physical constraints. A large-capacity memory that is very fast is also physically large, and speed-of-light delays place a limit on the speed of such a memory syste The latency(L)of the memory is the delay from when the processor first requests a word from memory until that word arrives and is available for use by the processor. The latency of a memory system is one attribut f performance. The other is bandwidth(BW), which is the rate at which information can be transferred from the memory system. The bandwidth and the latency are related. If R is the number of requests that the memory can service simultaneously, then: BW、R c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 88 Memory Systems 88.1 Introduction 88.2 Memory Hierarchies 88.3 Cache Memories 88.4 Parallel and Interleaved Memories 88.5 Virtual Memory 88.6 Research Issues 88.1 Introduction A memory system serves as a repository of information (data) in a computer system. The processor [also called the central processing unit (CPU)] accesses (reads or loads) data from the memory system, performs computations on them, and stores (writes) them back to memory. The memory system is a collection of storage locations. Each storage location, or memory word, has a numerical address. A collection of storage locations forms an address space. Figure 88.1 shows the essentials of how a processor is connected to a memory system via address, data, and control lines. When a processor attempts to load the contents of a memory location, the request is very urgent. In virtually all computers, the work soon comes to a halt (in other words, the processor stalls) if the memory request does not return quickly. Modern computers are generally able to continue briefly by overlapping memory requests, but even the most sophisticated computers will frequently exhaust their ability to process data and stall momentarily in the face of long memory delays. Thus, a key performance parameter in the design of any computer, fast or slow, is the effective speed of its memory. Ideally, the memory system must be both infinitely large so that it can contain an arbitrarily large amount of information and infinitely fast so that it does not limit the processing unit. Practically, however, this is not possible. There are three properties of memory that are inherently in conflict: speed, capacity, and cost. In general, technology tradeoffs can be employed to optimize any two of the three factors at the expense of the third. Thus it is possible to have memories that are (1) large and cheap, but not fast; (2) cheap and fast, but small; or (3) large and fast, but expensive. The last of the three is further limited by physical constraints. A large-capacity memory that is very fast is also physically large, and speed-of-light delays place a limit on the speed of such a memory system. The latency (L) of the memory is the delay from when the processor first requests a word from memory until that word arrives and is available for use by the processor. The latency of a memory system is one attribute of performance. The other is bandwidth (BW), which is the rate at which information can be transferred from the memory system. The bandwidth and the latency are related. If R is the number of requests that the memory can service simultaneously, then: BW (88.1) R L = Doug Burger University of Wisconsin-Madison James R. Goodman University of Wisconsin-Madison Gurindar S. Sohi University of Wisconsin-Madison
ADDREsS SYSTEM CONTROL FIGURE 88.1 The memory interface. From Eq (88.1)we see that a decrease in the latency will result in an increase in bandwidth, and vice versa, if R is unchanged. We can also see that the bandwidth can be increased by increasing R, if L does not proportionately. For example, we can build a memory system that takes 20 ns to service the access of a single 32-bit word. Its latency is 20 ns per 32-bit word, and its bandwidth is 2 bit 20×10-9sec or 200 Mbytes/s If the memory system is modified to accept a new(still 20 ns)request for a 32-bit word every 5 ns by overlapping results, then its bandwidth 5×10 bits or 800 Mbytes/s. This memory system must be able to handle four requests at a given time uilding an ideal memory system(infinite capacity, zero latency and infinite bandwidth, with affordable cost)is not feasible. The challenge is, given a set of cost and technology constraints, to engineer a memor system whose abilities match the abilities that the processor demands of it. That is, engineering a memory system that performs as close to an ideal memory system( for the given processing unit)as is possible. For a processor that stalls when it makes a memory request(some current microprocessors are in this category),it is important to engineer a memory system with the lowest possible latency. For those processors that can handle nultiple outstanding memory requests(vector processors and high-end CPUs), it is important not only to reduce latency, but also to increase bandwidth(over what is possible by latency reduction alone) by designing a memory system that is capable of servicing multiple requests simultaneously Memory hierarchies provide decreased average latency and reduced bandwidth requirements, whereas parallel or interleaved memories provide higher bandwidth. 88.2 Memory hierarchies Technology does not permit memories that are cheap, large, and fast. By recognizing the nonrandom nature of memory requests, and emphasizing the average rather than worst case latency, it is possible to implement a hierarchical memory system that performs well. A small amount of very fast memory, placed in front of a large, slow memory, can be designed to satisfy most requests at the speed of the small memory. This, in fact, is the rimary motivation for the use of registers in the CPU: in this case, the programmer or compiler makes sure that the most commonly accessed variables are allocated to registers. A variety of techniques, employing either hardware, software, or a combination of the two, can be employed to assure that most memory references are satisfied by the faster memory. The foremost of these techniques is the exploitation of the locality of reference principle. This principle captures the fact that some memory locations are referenced much more frequently than others. Spatial locality is the property that an access to a given memory location greatly increases the probability that neighboring locations will soon be accessed. This is largely, but not exclusively, a result of the tendency to access memory locations sequentially. Temporal locality is the property that an access to a given memory location greatly increases the probability that the same location e 2000 by CRC Press LLC
© 2000 by CRC Press LLC From Eq. (88.1) we see that a decrease in the latency will result in an increase in bandwidth, and vice versa, if R is unchanged. We can also see that the bandwidth can be increased by increasing R, if L does not increase proportionately. For example, we can build a memory system that takes 20 ns to service the access of a single 32-bit word. Its latency is 20 ns per 32-bit word, and its bandwidth is or 200 Mbytes/s. If the memory system is modified to accept a new (still 20 ns) request for a 32-bit word every 5 ns by overlapping results, then its bandwidth is or 800 Mbytes/s. This memory system must be able to handle four requests at a given time. Building an ideal memory system (infinite capacity, zero latency and infinite bandwidth, with affordable cost) is not feasible. The challenge is, given a set of cost and technology constraints, to engineer a memory system whose abilities match the abilities that the processor demands of it. That is, engineering a memory system that performs as close to an ideal memory system (for the given processing unit) as is possible. For a processor that stalls when it makes a memory request (some current microprocessors are in this category), it is important to engineer a memory system with the lowest possible latency. For those processors that can handle multiple outstanding memory requests (vector processors and high-end CPUs), it is important not only to reduce latency, but also to increase bandwidth (over what is possible by latency reduction alone) by designing a memory system that is capable of servicing multiple requests simultaneously. Memory hierarchies provide decreased average latency and reduced bandwidth requirements, whereas parallel or interleaved memories provide higher bandwidth. 88.2 Memory Hierarchies Technology does not permit memories that are cheap, large, and fast. By recognizing the nonrandom nature of memory requests, and emphasizing the average rather than worst case latency, it is possible to implement a hierarchical memory system that performs well. A small amount of very fast memory, placed in front of a large, slow memory, can be designed to satisfy most requests at the speed of the small memory. This, in fact, is the primary motivation for the use of registers in the CPU: in this case, the programmer or compiler makes sure that the most commonly accessed variables are allocated to registers. A variety of techniques, employing either hardware, software, or a combination of the two, can be employed to assure that most memory references are satisfied by the faster memory. The foremost of these techniques is the exploitation of the locality of reference principle. This principle captures the fact that some memory locations are referenced much more frequently than others. Spatial locality is the property that an access to a given memory location greatly increases the probability that neighboring locations will soon be accessed. This is largely, but not exclusively, a result of the tendency to access memory locations sequentially. Temporal locality is the property that an access to a given memory location greatly increases the probability that the same location FIGURE 88.1 The memory interface. 32 20 10 9 ¥ – sec bits 32 5 10 9 ¥ – sec bits
ectronic speeds) Semiconductor DRAM 1 28 bytes. 4Kbytes 8 Kbytes. 8 Mbytes 4 Mbytes. 512 Mbytes 40 MBytes-32 Gbytes FIGURE 88.2 A memory hierarchy will be accessed again soon. This is largely, but not exclusively, a result of the frequency of looping behavior of programs. Particularly for temporal locality, a good predictor of the future is the past: the longer a variable has gone unreferenced, the less likely it is to be accessed soon Figure 88.2 depicts a common construction of a memory hierarchy. At the top of the hierarchy are the CPU registers, which are small and extremely fast. The next level down in the hierarchy is a special, high-speed semi- conductor memory known as a cache memory. The cache can actually be divided into multiple distinct levels most current systems have between one and three levels of cache. Some of the levels of cache may be on the CPU chip itself, they may be on the same module as the CPU, or they all may be entirely distinct. Below the cache is the conventional memory, referred to as main memory, or backing storage. Like a cache, main memory is semiconductor memory, but it is slower, cheaper, and denser than a cache. Below the main memory is the virtual memory, which is generally stored on magnetic or optical disk. Accessing the virtual memory can be tens of thousands times slower than accessing the main memory because it involves moving mechanical parts. As requests go deeper into the memory hierarchy, they encounter levels that are larger (in terms of capacity) and slower than the higher levels(moving left to right in Fig. 88.2). In addition to size and speed, the bandwidth in-between adjacent levels in the memory hierarchy is smaller for the lower levels. The bandwidth in-between the registers and top cache level, for example, is higher than that between cache and main memory or main memory and virtual memory. Since each level presumably intercepts a fraction of the requests, the bandwidth to the level below need not be as great as that to the intercepting level A useful performance parameter is the effective latency. If the needed word is found in a level of the hierarchy, it is a hit; if a request must be sent to the next lower level, the request is said to miss. If the latency LHrr is known in the case of a hit and the latency in the case of a miss is Miss, the effective latency for that level in the hierarchy can be determined from the hit ratio(H), the fraction of memory accesses that are hits L h+L MISS (1-H (88.2) The portion of memory accesses that miss is called the miss ratio(M=1-H). The hit ratio is strongly influenced by the program being executed, but is largely independent of the ratio of cache size to memory size. It is not uncommon for a cache with a capacity a few thousand bytes to exhibit a hit ratio greater than 90% 88.3 Cache Memories The basic unit of construction of a semiconductor memory system is a module or bank. a memory bank, constructed from several memory chips, can service a single request at a time. The time that a bank is busy servicing a request is called the bank busy time. The bank busy time limits the bandwidth of a memory bank. e 2000 by CRC Press LLC
© 2000 by CRC Press LLC will be accessed again soon. This is largely, but not exclusively, a result of the frequency of looping behavior of programs. Particularly for temporal locality, a good predictor of the future is the past: the longer a variable has gone unreferenced, the less likely it is to be accessed soon. Figure 88.2 depicts a common construction of a memory hierarchy. At the top of the hierarchy are the CPU registers, which are small and extremely fast. The next level down in the hierarchy is a special, high-speed semiconductor memory known as a cache memory. The cache can actually be divided into multiple distinct levels; most current systems have between one and three levels of cache. Some of the levels of cache may be on the CPU chip itself, they may be on the same module as the CPU, or they all may be entirely distinct. Below the cache is the conventional memory, referred to as main memory, or backing storage. Like a cache, main memory is semiconductor memory, but it is slower, cheaper, and denser than a cache. Below the main memory is the virtual memory, which is generally stored on magnetic or optical disk. Accessing the virtual memory can be tens of thousands times slower than accessing the main memory because it involves moving mechanical parts. As requests go deeper into the memory hierarchy, they encounter levels that are larger (in terms of capacity) and slower than the higher levels (moving left to right in Fig. 88.2). In addition to size and speed, the bandwidth in-between adjacent levels in the memory hierarchy is smaller for the lower levels. The bandwidth in-between the registers and top cache level, for example, is higher than that between cache and main memory or main memory and virtual memory. Since each level presumably intercepts a fraction of the requests, the bandwidth to the level below need not be as great as that to the intercepting level. A useful performance parameter is the effective latency. If the needed word is found in a level of the hierarchy, it is a hit; if a request must be sent to the next lower level, the request is said to miss. If the latency LHIT is known in the case of a hit and the latency in the case of a miss is LMISS , the effective latency for that level in the hierarchy can be determined from the hit ratio (H), the fraction of memory accesses that are hits: Laverage = L HIT · H + L MISS · (1 – H) (88.2) The portion of memory accesses that miss is called the miss ratio (M = 1 – H). The hit ratio is strongly influenced by the program being executed, but is largely independent of the ratio of cache size to memory size. It is not uncommon for a cache with a capacity a few thousand bytes to exhibit a hit ratio greater than 90%. 88.3 Cache Memories The basic unit of construction of a semiconductor memory system is a module or bank. A memory bank, constructed from several memory chips, can service a single request at a time. The time that a bank is busy servicing a request is called the bank busy time. The bank busy time limits the bandwidth of a memory bank. FIGURE 88.2 A memory hierarchy
th caches and main memories are constructed in this fashion, although caches have significantly sho bank busy times than do main memory banks. The hardware can dynamically allocate parts of the cache memory for addresses deemed most likely to be accessed soon. The cache contains only redundant copies of the address space. The cache memory is associative, or content-addressable. In an associative memory, the address of a memory location is stored, along with its content Rather than reading data directly from a memory location, the cache is given an address and responds by providing data which may or may not be the data requested. When a cache miss occurs, the memory access is then performed with respect to the backing storage, and the cache is updated to include the new data. The cache is intended to hold the most active portions of the memory, and the hardware dynamicall portions of main memory to store in the cache. when the cache is full, bringing in new data must be by deleting old data. Thus a strategy for cache management is necessary. Cache management strategie the principle of locality. Spatial locality is exploited by the choice of what is brought into the cache. Temporal locality is exploited in the choice of which block is removed. When a cache miss occurs, hardware copies a large, contiguous block of memory into the cache, which includes the word requested. This fixed-size region of memory, known as a cache line or block, may be as small as a single word, or up to several hundred bytes block is a set of contiguous memory locations, the number of which is usually a power of two. a block said to be aligned if the lowest address in the block is exactly divisible by the block size. That is to say, for a block of size B beginning at location A, the block is aligned if A modulo b=0 (88.3) Conventional caches require that all blocks be aligned When a block is brought into the cache, it is likely that another block must be evicted. The selection of the evicted block is based on some attempt to capture temporal locality. Since prescience is so difficult to achieve, other methods are generally used to predict future memory accesses. A least-recently-used(LRU) policy is often ne basis for the choice. Other replacement policies are sometimes used, particularly because true LRU replace ment requires extensive logic and bookkeeping The cache often comprises two conventional memories: the data memory and the tag memory, shown in Fig. 88.3. The address of each cache line contained in the data memory is stored in the tag memory, as well as other information(state information), particularly the fact that a valid cache line is present. The state also keeps track of which cache lines the processor has modified. Each line contained in the data memory is allocated a corresponding entry in the tag memory to indicate the full address of the cache line ompare Incoming Stored Tags and select Data word Data Word Hit/Miss FIGURE 88.3 Components of a cache memory.( Source: Adapted from M. D Hill, "A case for direct-mapped caches, "IEEE omputer;2l(12),27,1988) e 2000 by CRC Press LLC
© 2000 by CRC Press LLC Both caches and main memories are constructed in this fashion, although caches have significantly shorter bank busy times than do main memory banks. The hardware can dynamically allocate parts of the cache memory for addresses deemed most likely to be accessed soon. The cache contains only redundant copies of the address space. The cache memory is associative, or content-addressable. In an associative memory, the address of a memory location is stored, along with its content. Rather than reading data directly from a memory location, the cache is given an address and responds by providing data which may or may not be the data requested. When a cache miss occurs, the memory access is then performed with respect to the backing storage, and the cache is updated to include the new data. The cache is intended to hold the most active portions of the memory, and the hardware dynamically selects portions of main memory to store in the cache. When the cache is full, bringing in new data must be matched by deleting old data. Thus a strategy for cache management is necessary. Cache management strategies exploit the principle of locality. Spatial locality is exploited by the choice of what is brought into the cache. Temporal locality is exploited in the choice of which block is removed. When a cache miss occurs, hardware copies a large, contiguous block of memory into the cache, which includes the word requested. This fixed-size region of memory, known as a cache line or block, may be as small as a single word, or up to several hundred bytes. A block is a set of contiguous memory locations, the number of which is usually a power of two. A block is said to be aligned if the lowest address in the block is exactly divisible by the block size. That is to say, for a block of size B beginning at location A, the block is aligned if A modulo B = 0 (88.3) Conventional caches require that all blocks be aligned. When a block is brought into the cache, it is likely that another block must be evicted. The selection of the evicted block is based on some attempt to capture temporal locality. Since prescience is so difficult to achieve, other methods are generally used to predict future memory accesses. A least-recently-used (LRU) policy is often the basis for the choice. Other replacement policies are sometimes used, particularly because true LRU replacement requires extensive logic and bookkeeping. The cache often comprises two conventional memories: the data memory and the tag memory, shown in Fig. 88.3. The address of each cache line contained in the data memory is stored in the tag memory, as well as other information (state information), particularly the fact that a valid cache line is present. The state also keeps track of which cache lines the processor has modified. Each line contained in the data memory is allocated a corresponding entry in the tag memory to indicate the full address of the cache line. FIGURE 88.3 Components of a cache memory. (Source: Adapted from M. D. Hill, “A case for direct-mapped caches,” IEEE Computer, 21(12), 27, 1988.)