Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

ARMS

259

Figure 9.6 ARMS integer unit organization.

The core layout can be seen in the ARMS 10 die photograph in Figure 12.6 on page 326 in the upper left-hand area of the die.

260

ARM Processor Cores

9.3ARM9TDMI

 

The ARM9TDMI core takes the functionality of the ARM7TDMI up to a signifi-

 

cantly higher performance level. Like the ARM7TDMI (and unlike the ARMS) it

 

includes support for the Thumb instruction set and an EmbeddedlCE module for

 

on-chip debug support. The performance improvement is achieved by adopting a

 

5-stage pipeline to increase the maximum clock rate and by using separate

 

instruction and data memory ports to allow an improved CPI (Clocks Per

 

Instruction - a measure of how much work a processor does in a clock cycle).

Improved

The rationale which leads from a requirement for higher performance to the need to

performance

increase the number of pipeline stages from three (as in the ARM7TDMI) to five,

 

and to change the memory interface to employ separate instruction and data memo-

 

ries, was discussed in Section 4.2 on page 78.

ARM9TDMI

The 5-stage ARM9TDMI pipeline owes a lot to the StrongARM pipeline described

organization

in Section 12.3 on page 327. (The StrongARM has not been included as a core in

 

this chapter as it has limited applicability as a stand-alone core.)

 

The ARM9TDMI core organization was illustrated in Figure 4.4 on page 81. The

 

principal difference between the ARM9TDMI and the StrongARM core (illustrated in

 

Figure 12.8 on page 330) is that while StrongARM has a dedicated branch adder

 

which operates in parallel with the register read stage, ARM9TDMI uses the main

 

ALU for branch target calculations. This gives ARM9TDMI an additional clock cycle

 

penalty for a taken branch, but results in a smaller and simpler core, and avoids a very

 

critical timing path (illustrated in Figure 12.9 on page 331) which is present in the

 

StrongARM design. The StrongARM was designed for a particular process technol-

 

ogy where this timing path could be carefully managed, whereas the ARM9TDMI is

 

required to be readily portable to new processes where such a critical path could easily

 

compromise the maximum usable clock rate.

Pipeline

The operation of the 5-stage ARM9TDMI pipeline is illustrated in Figure 9.7 on

operation

page 261, where it is compared with the 3-stage ARM7TDMI pipeline. The figure

 

shows how the major processing functions of the processor are redistributed across

 

the additional pipeline stages in order to allow the clock frequency to be doubled

 

(approximately) on the same process technology.

 

Redistributing the execution functions (register read, shift, ALU, register write) is

 

not all that is required to achieve this higher clock rate. The processor must also be

 

able to access the instruction memory in half the time that the ARM7TDMI takes, and

 

the instruction decode logic must also be restructured to allow the register read to take

 

place concurrently with a substantial part of the decoding.

ARM9TDMI

261

Figure 9.7 ARM7TDMI and ARM9TDMI pipeline comparison.

Thumb decoding The ARM7TDMI implements the Thumb instruction set by 'decompressing' Thumb instructions into ARM instructions using slack time in the ARM7 pipeline. The ARM9TDMI pipeline is much tighter and does not have sufficient slack time to allow Thumb instructions to be first translated into ARM instructions and then decoded; instead it has hardware to decode both ARM and Thumb instructions directly.

 

The extra 'Memory' stage in the ARM9TDMI pipeline does not have any direct

 

equivalent in the ARM7TDMI. Its function is performed by additional 'Execute'

 

cycles that interrupt the pipeline flow. This interruption is an inevitable consequence

 

of the single memory port used by the ARM7TDMI for both instruction and data

 

accesses. During a data access an instruction fetch cannot take place. The

 

ARM9TDMI avoids this pipeline interruption through the provision of separate

 

instruction and data memories.

Coprocessor

The ARM9TDMI has a coprocessor interface which allows on-chip coprocessors for

support

floating-point, digital signal processing or other special-purpose hardware accelera-

 

tion requirements to be supported. (At the clock speeds it supports there is little

 

possibility of off-chip coprocessors being useful.)

On-chip debug

The EmbeddedlCE functionality in the ARM9TDMI core gives the same

 

system-level debug features as that on the ARM7TDMI core (see Section 8.7 on

 

page 232), with the following additional features:

 

• Hardware single-stepping is supported.

 

• Breakpoints can be set on exceptions in addition to the address/data/control

 

conditions supported by ARM7TDMI.

262 ARM Processor Cores

Table 9.3 ARM9TDMI characteristics.

Process

0.25 um

Transistors

111,000

MIPS

220

Metal layers

3

Core area

2.1 mm2

Power

150mW

Vdd

2.5V

Clock

0-200 MHz

MIPS/W

1500

 

 

 

 

 

 

LOW voltage

Although the first ARM9TDMI core was implemented on a 0.35 urn 3.3 V technol-

operation

ogy, the design has been ported onto 0.25 um and 0.18 um processes using power

 

supplies down to 1.2 V

ARM9TDMI core

The characteristics of the 0.25 um ARM9TDMI core when executing 32-bit

 

ARM code are summarized in Table 9.3. A plot of the core is shown in Figure 9.8.

 

Figure 9.8 The ARM9TDMI processor core.

ARM9TDMI

An ARM9TDMI core has separate instruction and data memory ports. While in

applications

principle it may be possible to connect these ports to a single unified memory, in

 

practice doing so would negate many of the reasons for choosing the ARM9TDMI

 

core over the smaller and cheaper ARM7TDMI core in the first place. Similarly,

 

although it is not necessary to exploit the higher clock rate supported by the

 

ARM9TDMI's 5-stage pipeline in comparison to the ARMTTDMI's 3-stage pipe-

 

line, not to do so would negate the rationale for using the ARM9TDMI. Therefore,

 

any application that justifies the use of an ARM9TDMI core is going to have to

 

cope with a complex high-speed memory subsystem.

ARM10TDMI

ARM9E-S

263

The most common way of handling this memory requirement will be to incorporate separate instruction and data cache memories as exemplified by the various standard ARM CPU cores based around the ARM9TDMI. The ARM920T and ARM940T CPU cores are described in Section 12.4 on page 335. The caches in these CPU cores satisfy most of the memory bandwidth requirements of the ARM9TDMI and reduce the external bandwidth requirement to something that can be satisfied by conventional unified memories connected via a single AMBA bus.

An alternative solution, particularly applicable in embedded systems where the performance-critical code is relatively well contained, is to use an appropriate amount of separate directly addressed local instruction and data memory instead of caches.

The ARM9E-S is a synthesizable version of the ARM9TDMI core. It implements an extended version of the ARM instruction set compared with the 'hard' core. In addition to the ARM architecture v4T instructions supported by the ARM9TDMI, the ARM9E-S supports the full ARM architecture version v5TE (see Section 5.23 on page 147), including the signal processing instruction set extensions described in Section 8.9 on page 239.

The ARM9E-S is 30% larger than the ARM9TDMI on the same process. It occupies 2.7 mm2 on a 0.25 um CMOS process.

9.4ARM10TDMI

The ARM10TDMI is the current high-end ARM processor core and is still under development at the time of writing. Just as the ARM9TDMI delivers approximately twice the performance of the ARM7TDMI on the same process, the ARM10TDMI is positioned to operate at twice the performance of the ARM9TDMI. It is intended to deliver 400 dhrystone 2.1 MIPS at 300 MHz on 0.25 urn CMOS technology.

In order to achieve this level of performance, starting from the ARM9TDMI, two approaches have been combined (again, see the discussion in Section 4.2 on page 78):

1.The maximum clock rate has been increased.

2.The CPI (average number of Clocks Per Instruction) has been reduced.

Since the ARM9TDMI pipeline is already fairly optimal, how can these improvements be achieved without resorting to a very complex organization, such as superscalar execution, which would compromise the low power and small core size that are the hallmarks of an ARM core?

Increased clock

The maximum clock rate that an ARM core can support is determined by the slow-

rate

est logic path in any of the pipeline stages.

264

ARM Processor Cores

The 5-stage ARM9TDMI is already well balanced (see Figure 9.7 on page 261); four of the five stages are heavily loaded. The pipeline could be extended to spread the logic over many more stages, but the benefits of such a 'super-pipelined' organization tend to be offset by the worsened CPI that results from the increased pipeline dependencies unless very complex mechanisms are employed to minimize these effects.

Instead, the ARM10TDMI approach is to retain a very similar pipeline to the ARM9TDMI but to support a higher clock rate by optimizing each stage in a particular way (see Figure 9.9):

The fetch and memory stages are effectively increased from one to one-and-a- half clock cycles by providing the address for the next cycle early. To achieve this in the memory stage, memory addresses are computed in a separate adder that can produce its result faster than the main ALU (because it implements only a subset of the ALU's functionality).

The execute stage uses a combination of improved circuit techniques and restruc turing to reduce its critical path. For example, the multiplier does not feed into the main ALU to resolve its partial sum and product terms; instead it has its own adder in the memory stage (multiplications never access memory, so this stage is free.)

The instruction decode stage is the only part of the processor logic that could not be streamlined sufficiently to support the higher clock rate, so here an additional 'Issue' pipeline stage was inserted.

The result is a 6-stage pipeline that can operate faster than the 5-stage ARM9TDMI pipeline, but requires its supporting memories to be little faster than the ARM9TDMI's memories. This is of significance since very fast memories tend to be power-hungry. The extra pipeline stage, inserted to allow more time for instruction decode, only incurs pipeline dependency costs when an unpredicted branch is executed. Since the extra stage comes before the register read takes place it introduces no new operand dependencies and requires no new forwarding paths. With the inclusion of a branch prediction mechanism this pipeline will give a very similar CPI to the ARM9TDMI pipeline while supporting the higher clock rate.

Figure 9.9 The ARM10TDMI pipeline.

ARM10TDMI

265

Reduced CPI

The pipeline enhancements described above support a 50% higher clock rate with-

 

out compromising the CPI. This is a good start, but will not yield the required 100%

 

performance improvement. For this an improved CPI is needed on top of the

 

increased clock rate.

 

Any plan to improve the CPI must start from a consideration of memory band-

 

width. The ARM7TDMI uses its single 32-bit memory on (almost) every clock cycle,

 

so the ARM9TDMI moved to a Harvard memory organization to release more band-

 

width. The ARM9TDMI uses its instruction memory on (almost) every clock cycle.

 

Although its data memory is only around 50% loaded, it is hard to exploit this to

 

improve its CPI. The instruction bandwidth must be increased somehow.

 

The approach adopted in the ARMIOTDMI is to employ 64-bit memories. This

 

effectively removes the instruction bandwidth bottleneck and enables a number of

 

CPI-improving features to be added to the processor organization:

 

• Branch prediction: the ARMIOTDMI branch prediction logic goes beyond what

 

is required simply to maintain the pipeline efficiency as discussed above.

 

Because instructions are fetched at a rate of two per clock cycle, the branch pre

 

diction unit (which is in the fetch pipeline stage) can often recognize branches

 

before they are issued and effectively remove them from the instruction stream,

 

reducing the cycle cost of the branch to zero.

 

The ARMIOTDMI employs a static branch prediction mechanism: conditional

 

branches that branch backwards are predicted to be taken; those that branch for-

 

wards are predicted not to be taken.

 

• Non-blocking load and store execution: a load or store instruction that cannot

 

complete in a single memory cycle, either because it is referencing slow memory

 

or because it is transferring multiple registers, does not stall the execution pipe

 

line until an operand dependency arises.

 

• The 64-bit data memory enables load and store multiple instructions to transfer

 

two registers in each clock cycle.

 

The non-blocking load and store logic requires independent register file read and

 

write ports, and the 64-bit load and store multiple instructions require two of each.

 

The ARMIOTDMI register bank therefore has four read ports and three write ports.

 

Taken together these features enable the ARMIOTDMI to achieve a dhrystone 2.1

 

MIPS per MHz figure of 1.25, which may be compared with 0.9 for the ARM7TDMI

 

and 1.1 for the ARM9TDMI. These figures are a direct reflection of the respective CPI

 

performances when running the dhrystone benchmark; other programs may give

 

rather different CPI results, and the 64-bit data buses enable the ARMIOTDMI to

 

deliver a significantly better effective CPI than the ARM9TDMI on complex tasks

 

such as booting an operation system.

266 ARM Processor Cores

ARM10TDMI

It was noted in the discussion of 'ARM9TDMI applications' on page 262 that at

applications

least some local high-speed memory was required to release the performance poten-

 

tial of the core. The same is true of the ARM10TDMI: without separate local 64-bit

 

instruction and data memories the core will not be able to deliver its full perform-

 

ance, and it will go no faster than a smaller and cheaper ARM core.

 

Again, the usual (though not the only) way to resolve this problem is through the

 

provision of local cache memories as exemplified by the ARM1020E described in

 

Section 12.6 on page 341. Since the performance of the ARMIOTDMI core is criti-

 

cally dependent on the availability of fast 64-bit local memory, discussion of its per-

 

formance characteristics will be presented later in the context of the ARM1020E.

9.5Discussion

All early ARM processor cores, up to and including the ARM7TDMI, were based on a simple 3-stage fetch-decode-execute pipeline. From the first ARM1, developed at Acorn Computers in the early 1980s, through to the ARM7TDMI cores in most of today's mobile telephone handsets, the basic principles of operation have barely changed. The development work carried out in the ARM's first decade focused on the following aspects of the design:

performance improvement through critical path optimization and process shrink age;

low-power applications through static CMOS logic, power supply voltage reduc tion and code compression (the Thumb instruction set);

support for system development through the addition of on-chip debug facilities, on-chip buses and software tools.

The ARM7TDMI represents the pinnacle of this development process, and its commercial success demonstrates the viability of the original very simple 3-stage pipeline in a world dominated by PCs with ever-increasingly complex superscalar, superpipelined, high-performance (and very power-hungry) microprocessors.

The second decade of ARM development has seen a careful diversification of the ARM organization in the quest for higher performance levels:

The first step to a 5-stage pipeline yields a doubling of performance (all other factors being equal) at the cost of some forwarding logic in the core and either a double-bandwidth memory (as in the ARMS) or separate instruction and data memories (as in the ARM9TDMI and StrongARM).

The next doubling of performance, achieved in the ARMIOTDMI, is rather harder-won. The 6-stage pipeline is quite similar to the 5-stage pipeline used

Example and exercises

267

before, but the time slots allocated to memory access have been extended to enable the memories to support higher clock rates without burning excessive power. The processor core also incorporates more decoupling: in the prefetch unit to allow branches to be predicted and removed from the instruction stream, and in the data memory interface to allow the processor to continue executing when a data access takes some time to resolve (for example, due to a cache miss).

Performance improvement is achieved through a combination of increased clock rate and reduced CPI - the average number of clocks per instruction. The increased clock rate will usually require a deeper pipeline that will tend to worsen the CPI, so remedial measures are required to recover the CPI loss and then improve it further.

To date, all ARM processors have been based on organizations that issue at most one instruction per clock cycle, and always in program order. The ARM10TDMI and the AMULETS processor (described in Section 14.5 on page 387) handle out-of-order completion in order to be able to keep instructions flowing during a slow data access, and both of these processors also include branch prediction logic to reduce the cost of refilling their pipelines on branch instructions. AMULET3 suppresses the fetching of predicted branch instructions but still executes them; ARM10TDMI fetches branch instructions but suppresses their execution. But by the standards of today's high-end PC and workstation processors, these are still very simple machines. This simplicity has direct benefits in system-on-chip applications in that simple processors require fewer transistors than complex processors and therefore occupy less die area and consume less power.

9.6Example and exercises

Example 9.1

How should the ARM7TDMI address bus be retimed to interface to

 

static RAM or ROM devices?

 

Normally the ARM7TDMI outputs new addresses as soon as they are available, which

 

is towards the end of the preceding clock cycle. To interface to static memory devices

 

the address must be held stable until after the end of the cycle, so this pipelining must

 

be removed. This is most simply achieved by using ape, the address pipeline enable

 

signal. In systems where some memory devices benefit from early addresses and some

 

are static, either an external latch should be used to retime the addresses to the static

 

devices or ape should be controlled to suit the currently addressed device.

Exercise 9.1.1

Review the processor cores described in this chapter and discuss the basic tech-

 

niques used to increase the core performance by a factor of eight in going from the

 

ARM7TDMI to the ARM10TDMI.

268

ARM Processor Cores

Exercise 9.1.2

In a system where the designer is free to vary the supply voltage to the processor it

 

is possible to trade off performance (which scales in proportion to Vdd) against

 

power-efficiency (which scales as 1/Vdd2). A measure of architectural

 

power-efficiency that factors out the effect of power supply variations is therefore

 

MIP3/W. Compare each of the processor cores presented here on the basis of this

 

measure.

Exercise 9.1.3

Following on from the results of the previous exercise, why might the designer of a

 

low-power system not simply select the most architecturally efficient processor core

 

and then scale the supply voltage to give the required system performance?