Furber S.ARM system-on-chip architecture.2000
.pdfThe ARM946E-S and ARM966E-S |
339 |
Table 12.6 ARM940T characteristics.
Process |
0.25 urn |
Transistors |
802,000 |
MIPS |
220 |
Metal layers |
3 |
Core area |
8.1 mm2 |
Power |
385 mW |
Vdd |
2.5V |
Clock |
0-200 MHz |
MIPSAV |
570 |
|
|
|
|
|
|
As the ARM9TDMI operates at high clock rates, the data cache is designed to support write-back operation. The simpler write-through option is also available, and the memory protection unit is used to define which mode is selected for a particular address. Cache lines are allocated only on read misses.
The ARM940T supports cache 'cleaning' through a software mechanism that is used to check every cache line and flush any that are dirty. This clearly takes some time. It does not have the mechanism in the ARM920T to clean lines by their memory address.
ARM940T write The write buffer can hold up to eight words of data and four addresses. The memory buffer protection unit defines which addresses are bufferable. Dirty lines flushed from the
data cache also pass through the write buffer.
ARM940T silicon The characteristics of an ARM940T implemented on a 0.25 um CMOS process are summarized in Table 12.6 and a plot of the layout is shown in Figure 12.13 on page 340. It can be seen that the complete CPU core occupies a relatively small proportion of the area of a low-cost system-on-chip design.
12.5The ARM946E-S and ARM966E-S
The ARM946E-S and ARM966E-S are synthesizable CPU cores based upon the ARM9E-S integer core. (See 'ARM9E-S' on page 263.)
Both CPU cores have AMBA AHB interfaces and can be synthesized with an embedded trace macrocell as described in Section 8.8 on page 237. They are intended for embedded applications and do not have address translation hardware.
ARM946E-S The ARM946E-S has a 4-way set-associative cache. The set-associative organiza- cache tion was chosen over the CAM-RAM organization used in the ARM920T and ARM940T because it is easier to construct a set-associative cache with synthesizable RAM structures available in standard ASIC libraries. Synthesizing CAM-RAM
cache structures is still difficult for most design systems.
340 |
ARM CPU Cores |
|
Figure 12.13 ARM940T CPU core plot. |
|
The instruction and data caches can each be from 4 Kbytes to 64 Kbytes in size, |
|
and the two caches need not be the same size as each other. Both use an 8-word line, |
|
support lock-down, and employ a software-selectable replacement algorithm which |
|
may be pseudo-random or round-robin. The write strategy is also software selectable |
|
between write-through and copy-back. |
|
The ARM946E-S incorporates memory protection units and has an overall organi- |
|
zation similar to the ARM940T (shown in Figure 12.12 on page 338). |
ARM966E-S |
The ARM966E-S does not employ caches. Instead, the synthesized macrocell incor- |
memory |
porates tightly coupled SRAM mapped to fixed memory addresses. These memories |
|
can be a range of sizes. One memory is connected only to the data port, whereas the |
|
second memory is connected to both ports. Generally the second memory is used by |
|
the instruction port, but data port access is important for two reasons: |
|
• Constants (such as addresses) embedded in code must be read through the data |
|
port. |
|
• There must be a mechanism for initializing the instruction memory before it can |
|
be used. The instruction port is read-only and therefore cannot be used to initialize |
|
the instruction memory. |
|
The ARM966E-S also incorporates a write buffer to optimize the use of the AMBA |
|
AHB bandwidth and supports the connection of on-chip coprocessors. |
342 ARM CPU Cores
ARM1020E write |
The write buffer comprises eight slots each capable of holding an address and a |
64- |
|
buffer |
bit data double-word. A separate four-slot write buffer is used to handle cache cast- |
|
outs to ensure that there is no conflict in the main write buffer between cast-outs and |
|
hit-under-miss write traffic. |
ARM1020E |
The memory management systems is based upon two 64-entry translation |
MMU |
look-aside buffers, one for each cache. The TLBs also support selective lock-down |
|
to ensure that translation entries for critical real-time code sections do not get |
ARM1020Ebus |
ejected. |
|
|
interface |
The external memory bus interface is AMBA AHB compliant (see Section 8.2 on |
|
page 216). The ARM1020E has separate AMBA AHB interfaces for the instruction and |
|
data memories. Each interface makes its own requests to the bus arbiter, though they |
ARM1020E |
share the 32-bit address and 64-bit unidirectional read and write data bus interfaces. |
|
|
silicon |
The target characteristics for an ARM1020E running at 1.5 V on 0.18 urn CMOS |
|
are summarized in Table 12.7. The CPU will run at supply voltages down to 1.2 V |
|
for improved power-efficiency. |
Table 12.7 ARM 1020E target characteristics.
Process |
0.18 urn Transistors |
7,000,000 MIPS |
500 |
|
|
Metal layers |
5 |
Core area |
12mm2 Power |
400 mW |
|
Vdd |
1.5V |
Clock |
0-400 MHz MIPS/W |
1,250 |
|
|
|
ARM1020E CPU core, to |
|||
ARM10200 |
The ARM 10200 is a reference chip design based upon the |
||||
|
which the following additions have been made: |
|
|
•a VFP10 vector floatingpoint unit;
•a high-performance synchronous DRAM interface;
•a phase-locked loop circuit to generate the high-speed CPU clock.
The ARM 10200 is intended to be used for the evaluation of the ARM1020E CPU core and to support benchmarking and system prototyping activities.
VFP10 |
High-performance vector floating-point support is provided for the ARM10TDMI |
|
through the accompanying VFP 10 floating-point coprocessor. The VFP 10 incorpo- |
|
rates a 5-stage load/store pipeline and a 7-stage execution pipeline and supports sin- |
|
gleand double-precision IEEE 754 floating-point arithmetic (see Section 6.3 on |
|
page 158). It is capable of issuing a floating-point multiply-accumulate operation at |
|
a rate of one per clock cycle. It exploits the 64-bit data memory interface of the |
346 |
ARM CPU Cores |
12.8Example and exercises
Example 12.1 |
Why is a segmented associative cache better suited to support lock- |
|
down than a set-associative cache? |
|
Early ARM CPUs, such as the ARMS whose cache was studied in Section 10.4 |
|
on page 279, used highly associative caches with a CAM-RAM structure. The |
|
ARM700 series of CPUs abandoned CAM-RAM caches in favour of |
|
RAM-RAM set-associative caches because the RAM tag store uses less die area |
|
than the equivalent CAM tag store. Later ARM CPUs, from the ARMS 10 |
|
onwards, have reverted to a CAM-RAM cache, at least in part because of the |
|
need to support partial cache lock-down. |
|
The reason why a CAM-RAM cache provides better support for lock-down is |
|
simply to do with its level of associativity. The cache locations that a particular |
|
memory location can map into are determined by decoding some of its address bits. In |
|
a direct-mapped cache there is only one cache location, a 2-way set-associative cache |
|
offers a choice of two possible locations, 4-way offers four, and so on. |
|
In the ARM CPUs that support it, lock-down operates by reducing the range of this |
|
choice. With a 4-way set-associative cache it is practical to lock down a quarter of the |
|
cache (leaving a 3-way cache), half of the cache (2-way), or three-quarters of the cache |
|
(leaving a direct-mapped cache). This is a coarse granularity and is inefficient if, for |
|
example, all that need be locked down is memory to hold a small interrupt handler. |
|
The CAM-RAM caches typically offer 64-way associativity, enabling the cache to |
|
be locked-down in units of 1/64 of the total cache, which is a much finer granularity. |
Exercise 12.1.1 |
Discuss the impact of the following cache features on performance and power dissipation: |
|
1. Increasing associativity by splitting the tag and data RAMs into multiple blocks |
|
which perform parallel look-ups. |
|
2. Serializing the tag and data RAM accesses. |
|
3. Exploiting sequential access patterns to bypass the tag look-up and to optimize |
|
the data RAM access mechanism. |
|
4. Including separate data and instruction caches (as on StrongARM). |
Exercise 12.1.2 |
Explain why an on-chip write buffer cannot be used with an off-chip memory man- |
|
agement unit. |
Exercise 12.1.3 |
Why, as processor speeds rise relative to memory speeds, does it become increas- |
|
ingly important to use a copy-back rather than a write-through cache write strategy? |
Embedded ARM
Applications
Summary of chapter contents
Increasingly the trend in embedded system design is to integrate all the major system functions apart from some memory components into a single chip. The benefits in terms of component costs, reliability and power-efficiency are considerable. The development that makes this possible is the advance in semiconductor process technology which now allows chips incorporating millions of transistors to be built cheaply, and within a few years will allow tens of millions of transistors on a chip.
So, the era of complex systems on a single chip is upon us. The ARM has played a leading role in the opening of this era since its very small core size leaves more silicon resources available for the rest of the system functions. In this chapter we look at several examples of ARM-based 'systems on chips', but we are, in fact, only scratching the surface of this domain. The next few years will see a flood of embedded applications, many of which will be based around ARM processor cores.
Designing a 32-bit computer system is a complex undertaking. Designing one on a single chip where it must be 'right first time' is still very challenging. There is no formula for success, but there are many very powerful design tools available to assist the designer. As in most engineering endeavours, one can only get so far by studying the problem and the experiences of others. Understanding an existing design is far easier than creating a new one. Developing a new system on a chip is one of today's greatest challenges in digital electronics.
347