Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

Memory Hierarchy

Summary of chapter contents

A modern microprocessor can execute instructions at a very high rate. To exploit this potential performance fully the processor must be connected to a memory system which is both very large and very fast. If the memory is too small, it will not be able to hold enough programs to keep the processor busy. If it is too slow, the memory will not be able to supply instructions as fast as the processor can execute them.

Unfortunately, the larger a memory is the slower it is. It is therefore not possible to design a single memory which is both large enough and fast enough to keep a high-performance processor busy.

It is, however, possible to build a composite memory system which combines a small, fast memory and a large, slow main memory to present an external behaviour which, with typical program statistics, appears to behave like a large, fast memory much of the time. The small, fast memory component is the cache, which automatically retains copies of instructions and data that the processor is using most frequently. The effectiveness of the cache depends on the spatial locality and temporal locality properties of the program.

This two-level memory principle can be extended into a memory hierarchy of many levels, and the computer backup (disk) store can be viewed as part of this hierarchy. With suitable memory management support, the size of a program is limited not by the computer's main memory but by the size of the hard disk, which may be very much larger than the main memory.

269

270

Memory Hierarchy

10.1Memory size and speed

A typical computer memory hierarchy comprises several levels, each level having a characteristic size and speed.

The processor registers can be viewed as the top of the memory hierarchy. A RISC processor will typically have around thirty-two 32-bit registers making a total of 128 bytes, with an access time of a few nanoseconds.

On-chip cache memory will have a capacity of eight to 32 Kbytes with an access time around ten nanoseconds.

High-performance desktop systems may have a second-level off-chip cache with a capacity of a few hundred Kbytes and an access time of a few tens of nanoseconds.

Main memory will be megabytes to tens of megabytes of dynamic RAM with an access time around 100 nanoseconds.

Backup store, usually on a hard disk, will be hundreds of Mbytes up to a few Gbytes with an access time of a few tens of milliseconds.

Note that the performance difference between the main memory and the backup store is very much greater than the difference between any other adjacent levels, even when there is no secondary cache in the system.

The data which is held in the registers is under the direct control of the compiler or assembler programmer, but the contents of the remaining levels of the hierarchy are usually managed automatically. The caches are effectively invisible to the application program, with blocks or 'pages' of instructions and data migrating up and down the hierarchy under hardware control. Paging between the main memory and the backup store is controlled by the operating system, and remains transparent to the application program. Since the performance difference between the main memory and the backup store is so great, much more sophisticated algorithms are required here to determine when to migrate data between the levels.

An embedded system will not usually have a backing store and will therefore not exploit paging. However, many embedded systems incorporate caches, and ARM CPU chips employ a range of cache organizations. We will therefore look at cache organizational issues in some detail.

Memory COSt

Fast memory is more expensive per bit than slow memory, so a memory hierarchy

 

also aims to give a performance close to the fastest memory with an average cost

 

per bit approaching that of the slowest memory.

On-chip memory

271

10.2On-chip memory

 

Some form of on-chip memory is essential if a microprocessor is to deliver its

 

best performance. With today's clock rates, only on-chip memory can support zero

 

wait state access speeds, and it will also give better power-efficiency and reduced

 

electromagnetic interference than off-chip memory.

On-chip RAM

In many embedded systems simple on-chip RAM is preferred to cache for a number

benefits

of reasons:

 

• It is simpler, cheaper, and uses less power.

 

We will see in the following sections that cache memory carries a significant

 

overhead in terms of the logic that is required to enable it to operate effectively.

 

It also incurs a significant design cost if a suitable off-the-shelf cache is una-

 

vailable.

 

• It has more deterministic behaviour.

 

Cache memories have complex behaviours which can make difficult to predict

 

how well they will operate under particular circumstances. In particular, it can be

 

hard to guarantee interrupt response time.

 

The drawback with on-chip RAM vis-d-vis cache is that it requires explicit

 

management by the programmer, whereas a cache is usually transparent to the

 

programmer.

 

Where the program mix is well-defined and under the control of the program-

 

mer, on-chip RAM can effectively be used as a software-controlled cache. Where

 

the application mix cannot be predicted this control task becomes very difficult.

 

Hence a cache is usually preferred in any general-purpose system where the appli-

 

cation mix is unknown.

 

One important advantage of on-chip RAM is that it enables the programmer to

 

allocate space in it using knowledge of the future processing load. A cache left to

 

its own devices has knowledge only of past program behaviour, and it can there-

 

fore never prepare in advance for critical future tasks. Again, this is a difference

 

which is most likely to be significant when critical tasks must meet strict real-

 

time constraints.

 

The system designer must decide which is the right approach for a particular

 

system, taking all these factors into account. Whatever form of on-chip memory is

 

chosen, it must be specified with great care. It must be fast enough to keep the

 

processor busy and large enough to contain critical routines, but neither too fast

 

(or it will consume too much power) nor too large (or it will occupy too much

 

chip area).

272

Memory Hierarchy

10.3Caches

 

The first RISC processors were introduced at a time when standard memory parts

 

were faster than their contemporary microprocessors, but this situation did not per-

 

sist for long. Subsequent advances in semiconductor process technology which have

 

been exploited to make microprocessors faster have been applied differently to

 

improve memory chips. Standard DRAM parts have got a little faster, but mostly

 

they have been developed to have a much higher capacity.

Processor and

In 1980 a typical DRAM part could hold 4 Kbits of data, with 16 Kbit chips arriv-

memory speeds

ing in 1981 and 1982. These parts would cycle at 3 or 4 MHz for random accesses,

 

and at about twice this rate for local accesses (in page mode). Microprocessors at

 

that time could request around two million memory accesses per second.

 

In 2000 DRAM parts have a capacity of 256 Mbits per chip, with random accesses

 

operating at around 30 MHz. Microprocessors can request several hundred million

 

memory accesses per second. If the processor is so much faster than the memory, it

 

can only deliver its full performance potential with the help of a cache memory.

 

A cache memory is a small, very fast memory that retains copies of recently used

 

memory values. It operates transparently to the programmer, automatically deciding

 

which values to keep and which to overwrite. These days it is usually implemented on

 

the same chip as the processor. Caches work because programs normally display the

 

property of locality, which means that at any particular time they tend to execute the

 

same instructions many times (for instance in a loop) on the same areas of data (for

 

instance a stack).

Unified and

Caches can be built in many ways. At the highest level a processor can have one of

Harvard caches

the following two organizations:

 

• A unified cache.

 

This is a single cache for both instructions and data, as illustrated in Figure 10.1

 

on page 273.

 

• Separate instruction and data caches.

 

This organization is sometimes called a modified Harvard architecture as shown

 

in Figure 10.2 on page 274.

 

Both these organizations have their merits. The unified cache automatically

 

adjusts the proportion of the cache memory used by instructions according to

 

the current program requirements, giving a better performance than a fixed par-

 

titioning. On the other hand the separate caches allow load and store instruc-

 

tions to execute in a single clock cycle.

Caches

273

Cache performance metrics

Cache organization

The direct-mapped cache

Figure 10.1 A unified instruction and data cache.

Since the processor can operate at its high clock rate only when the memory items it requires are held in the cache, the overall system performance depends strongly on the proportion of memory accesses which cannot be satisfied by the cache. An access to an item which is in the cache is called a hit, and an access to an item which is not in the cache is a miss. The proportion of all the memory accesses that are satisfied by the cache is the hit rate, usually expressed as a percentage, and the proportion that are not is the miss rate.

The miss rate of a well-designed cache should be only a few per cent if a modern processor is to fulfil its potential. The miss rate depends on a number of cache parameters, including its size (the number of bytes of memory in the cache) and its organization.

Since a cache holds a dynamically varying selection of items from main memory, it must have storage for both the data and the address at which the data is stored in main memory.

The simplest organization of these components is the direct-mapped cache which is illustrated in Figure 10.3 on page 275. In the direct-mapped cache a line of data is stored along with an address tag in a memory which is addressed by some portion of the memory address (the index).

To check whether or not a particular memory item is stored in the cache, the index address bits are used to access the cache entry. The top address bits are then compared

274

Memory Hierarchy

Figure 10.2 Separate data and instruction caches.

with the stored tag; if they are equal, the item is in the cache. The lowest address bits can be used to access the desired item within the line.

This, simplest, cache organization has a number of properties that can be contrasted with those of more complex organizations:

A particular memory item is stored in a unique location in the cache; two items with the same cache address field will contend for use of that location.

Only those bits of the address that are not used to select within the line or to address the cache RAM need be stored in the tag field.

The tag and data access can be performed at the same time, giving the fastest cache access time of any organization.

Since the tag RAM is typically a lot smaller than the data RAM, its access time is shorter, allowing the tag comparison to be completed within the data access time.

Caches

The set-associ ative cache

275

Figure 10.3 Direct-mapped cache organization.

A typical direct-mapped cache might store 8 Kbytes of data in 16-byte lines. There would therefore be 512 lines. A 32-bit address would have four bits to address bytes within the line and nine bits to select the line, leaving a 19-bit tag which requires just over one Kbyte of tag store.

When data is loaded into the cache, a block of data is fetched from memory. There is little point in having the line size smaller than the block size. If the block size is smaller than the line size, the tag store must be extended to include a valid bit for each block within the line. Choosing the line and block sizes to be equal results in the simplest organization.

Moving up in complexity, the set-associative cache aims to reduce the problems due to contention by enabling a particular memory item to be stored in more than one cache location. A 2-way set-associative cache is illustrated in Figure 10.4 on page 276. As the figure suggests, this form of cache is effectively two direct-mapped caches operating in parallel. An address presented to the cache may find its data in either half, so each memory address may be stored in either of two places. Each of two items which were in contention for a single location in the direct-mapped cache may now occupy one of these places, allowing the cache to hit on both.

The 8 Kbyte cache with 16 byte lines will have 256 lines in each half of the cache, so four bits of the 32-bit address select a byte from the line and eight bits select one line from each half of the cache. The address tag must therefore be one bit longer, at

276

Memory Hierarchy

Figure 10.4 Two-way set-associative cache organization.

20 bits. The access time is only very slightly longer than that of the direct-mapped cache, the increase being due to the need to multiplex the data from the two halves.

When a new data item is to be placed in the cache, a decision must be taken as to which half to place it in. There are several options here, the most common being:

Caches

277

Random allocation.

The decision is based on a random or pseudo-random value.

Least recently used (LRU).

The cache keeps a record of which location of a pair was last accessed and allocates the new data to the other one.

Round-robin (also known as 'cyclic').

The cache keeps a record of which location of a pair was last allocated and allocates the new data to the other one.

The set-associative approach extends beyond 2-way up to any degree of associativity, but in practice the benefits of going beyond 4-way associativity are small and do not warrant the extra complexity incurred.

The fully

At the other extreme of associativity, it is possible to design a fully associative cache

associative

in VLSI technology. Rather than continuing to divide the direct-mapped cache into

cache

ever smaller components, the tag store is designed differently using content

 

addressed memory (CAM). A CAM cell is a RAM cell with an inbuilt comparator,

 

so a CAM based tag store can perform a parallel search to locate an address in any

 

location. The organization of a fully associative cache is illustrated in Figure 10.5.

278

Memory Hierarchy

Write strategies

Cache feature summary

Since there are no address bits implicit in the position of data in the cache, the tag must store all the address bits apart from those used to address bytes within the line.

The above schemes operate in an obvious way for read accesses: when presented with a new read address the cache checks to see whether it holds the addressed data; if it does, it supplies the data; if it does not, it fetches a block of data from main memory, stores it in the cache in some suitable location and supplies the requested data to the processor.

There are more choices to make when the processor executes a write cycle. In increasing order of complexity, the commonly used write strategies are:

Write-through.

All write operations are passed to main memory; if the addressed location is currently held in the cache, the cache is updated to hold the new value. The processor must slow down to main memory speed while the write takes place.

Write-through with buffered write.

Here all write operations are still passed to main memory and the cache updated as appropriate, but instead of slowing the processor down to main memory speed the write address and data are stored in a write buffer which can accept the write information at high speed. The write buffer then transfers the data to main memory, at main memory speed, while the processor continues with its next task.

Copy-back (also known as write-back).

A copy-back cache is not kept coherent with main memory. Write operations update only the cache, so cache lines must remember when they have been modified (usually using a dirty bit on each line or block). If a dirty cache line is allocated to new data it must be copied back to memory before the line is reused.

The write-through cache is the simplest to implement and has the merit that the memory is kept up to date; the drawback is that the processor must slow to memory speeds on every write transfer. The addition of a write buffer allows the processor to continue until the write traffic exceeds the external write bandwidth. The copy-back cache reduces the external write bandwidth requirement since a location may be written many times before the final value gets written back to memory, but the implementation is considerably more complex and the loss of coherency is hard to manage.

The various parameters that define the organization of a cache are summarized in Table 10.1 on page 279. The first of these is the relationship between the cache and the memory management unit (MMU) which will be discussed further in 'Virtual and physical caches' on page 287; the others have been covered in this section.