Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

18.35 Mб

Скачать

☆

<<< < Предыдущая 16 17 18 19 20 21 22 23 24 25 26 2728 / 4328 29 30 31 32 33 34 35 36 37 38 39 40 > Следующая >>>

ARMS

259

Figure 9.6 ARMS integer unit organization.

The core layout can be seen in the ARMS 10 die photograph in Figure 12.6 on page 326 in the upper left-hand area of the die.

260	ARM Processor Cores

9.3ARM9TDMI

	The ARM9TDMI core takes the functionality of the ARM7TDMI up to a signifi-
	cantly higher performance level. Like the ARM7TDMI (and unlike the ARMS) it
	includes support for the Thumb instruction set and an EmbeddedlCE module for
	on-chip debug support. The performance improvement is achieved by adopting a
	5-stage pipeline to increase the maximum clock rate and by using separate
	instruction and data memory ports to allow an improved CPI (Clocks Per
	Instruction - a measure of how much work a processor does in a clock cycle).
Improved	The rationale which leads from a requirement for higher performance to the need to
performance	increase the number of pipeline stages from three (as in the ARM7TDMI) to five,
	and to change the memory interface to employ separate instruction and data memo-
	ries, was discussed in Section 4.2 on page 78.
ARM9TDMI	The 5-stage ARM9TDMI pipeline owes a lot to the StrongARM pipeline described
organization	in Section 12.3 on page 327. (The StrongARM has not been included as a core in
	this chapter as it has limited applicability as a stand-alone core.)
	The ARM9TDMI core organization was illustrated in Figure 4.4 on page 81. The
	principal difference between the ARM9TDMI and the StrongARM core (illustrated in
	Figure 12.8 on page 330) is that while StrongARM has a dedicated branch adder
	which operates in parallel with the register read stage, ARM9TDMI uses the main
	ALU for branch target calculations. This gives ARM9TDMI an additional clock cycle
	penalty for a taken branch, but results in a smaller and simpler core, and avoids a very
	critical timing path (illustrated in Figure 12.9 on page 331) which is present in the
	StrongARM design. The StrongARM was designed for a particular process technol-
	ogy where this timing path could be carefully managed, whereas the ARM9TDMI is
	required to be readily portable to new processes where such a critical path could easily
	compromise the maximum usable clock rate.
Pipeline	The operation of the 5-stage ARM9TDMI pipeline is illustrated in Figure 9.7 on
operation	page 261, where it is compared with the 3-stage ARM7TDMI pipeline. The figure
	shows how the major processing functions of the processor are redistributed across
	the additional pipeline stages in order to allow the clock frequency to be doubled
	(approximately) on the same process technology.
	Redistributing the execution functions (register read, shift, ALU, register write) is
	not all that is required to achieve this higher clock rate. The processor must also be
	able to access the instruction memory in half the time that the ARM7TDMI takes, and
	the instruction decode logic must also be restructured to allow the register read to take
	place concurrently with a substantial part of the decoding.

ARM9TDMI

261

Figure 9.7 ARM7TDMI and ARM9TDMI pipeline comparison.

Thumb decoding The ARM7TDMI implements the Thumb instruction set by 'decompressing' Thumb instructions into ARM instructions using slack time in the ARM7 pipeline. The ARM9TDMI pipeline is much tighter and does not have sufficient slack time to allow Thumb instructions to be first translated into ARM instructions and then decoded; instead it has hardware to decode both ARM and Thumb instructions directly.

	The extra 'Memory' stage in the ARM9TDMI pipeline does not have any direct
	equivalent in the ARM7TDMI. Its function is performed by additional 'Execute'
	cycles that interrupt the pipeline flow. This interruption is an inevitable consequence
	of the single memory port used by the ARM7TDMI for both instruction and data
	accesses. During a data access an instruction fetch cannot take place. The
	ARM9TDMI avoids this pipeline interruption through the provision of separate
	instruction and data memories.
Coprocessor	The ARM9TDMI has a coprocessor interface which allows on-chip coprocessors for
support	floating-point, digital signal processing or other special-purpose hardware accelera-
	tion requirements to be supported. (At the clock speeds it supports there is little
	possibility of off-chip coprocessors being useful.)
On-chip debug	The EmbeddedlCE functionality in the ARM9TDMI core gives the same
	system-level debug features as that on the ARM7TDMI core (see Section 8.7 on
	page 232), with the following additional features:
	• Hardware single-stepping is supported.
	• Breakpoints can be set on exceptions in addition to the address/data/control
	conditions supported by ARM7TDMI.

262 ARM Processor Cores

Table 9.3 ARM9TDMI characteristics.

Process	0.25 um	Transistors	111,000	MIPS	220
Metal layers	3	Core area	2.1 mm2	Power	150mW
Vdd	2.5V	Clock	0-200 MHz	MIPS/W	1500

LOW voltage	Although the first ARM9TDMI core was implemented on a 0.35 urn 3.3 V technol-
operation	ogy, the design has been ported onto 0.25 um and 0.18 um processes using power
	supplies down to 1.2 V
ARM9TDMI core	The characteristics of the 0.25 um ARM9TDMI core when executing 32-bit
	ARM code are summarized in Table 9.3. A plot of the core is shown in Figure 9.8.

	Figure 9.8 The ARM9TDMI processor core.
ARM9TDMI	An ARM9TDMI core has separate instruction and data memory ports. While in
applications	principle it may be possible to connect these ports to a single unified memory, in
	practice doing so would negate many of the reasons for choosing the ARM9TDMI
	core over the smaller and cheaper ARM7TDMI core in the first place. Similarly,
	although it is not necessary to exploit the higher clock rate supported by the
	ARM9TDMI's 5-stage pipeline in comparison to the ARMTTDMI's 3-stage pipe-
	line, not to do so would negate the rationale for using the ARM9TDMI. Therefore,
	any application that justifies the use of an ARM9TDMI core is going to have to
	cope with a complex high-speed memory subsystem.

ARM10TDMI

ARM9E-S

263

The most common way of handling this memory requirement will be to incorporate separate instruction and data cache memories as exemplified by the various standard ARM CPU cores based around the ARM9TDMI. The ARM920T and ARM940T CPU cores are described in Section 12.4 on page 335. The caches in these CPU cores satisfy most of the memory bandwidth requirements of the ARM9TDMI and reduce the external bandwidth requirement to something that can be satisfied by conventional unified memories connected via a single AMBA bus.

An alternative solution, particularly applicable in embedded systems where the performance-critical code is relatively well contained, is to use an appropriate amount of separate directly addressed local instruction and data memory instead of caches.

The ARM9E-S is a synthesizable version of the ARM9TDMI core. It implements an extended version of the ARM instruction set compared with the 'hard' core. In addition to the ARM architecture v4T instructions supported by the ARM9TDMI, the ARM9E-S supports the full ARM architecture version v5TE (see Section 5.23 on page 147), including the signal processing instruction set extensions described in Section 8.9 on page 239.

The ARM9E-S is 30% larger than the ARM9TDMI on the same process. It occupies 2.7 mm2 on a 0.25 um CMOS process.

9.4ARM10TDMI

The ARM10TDMI is the current high-end ARM processor core and is still under development at the time of writing. Just as the ARM9TDMI delivers approximately twice the performance of the ARM7TDMI on the same process, the ARM10TDMI is positioned to operate at twice the performance of the ARM9TDMI. It is intended to deliver 400 dhrystone 2.1 MIPS at 300 MHz on 0.25 urn CMOS technology.

In order to achieve this level of performance, starting from the ARM9TDMI, two approaches have been combined (again, see the discussion in Section 4.2 on page 78):

1.The maximum clock rate has been increased.

2.The CPI (average number of Clocks Per Instruction) has been reduced.

Since the ARM9TDMI pipeline is already fairly optimal, how can these improvements be achieved without resorting to a very complex organization, such as superscalar execution, which would compromise the low power and small core size that are the hallmarks of an ARM core?

Increased clock	The maximum clock rate that an ARM core can support is determined by the slow-
rate	est logic path in any of the pipeline stages.

264	ARM Processor Cores

The 5-stage ARM9TDMI is already well balanced (see Figure 9.7 on page 261); four of the five stages are heavily loaded. The pipeline could be extended to spread the logic over many more stages, but the benefits of such a 'super-pipelined' organization tend to be offset by the worsened CPI that results from the increased pipeline dependencies unless very complex mechanisms are employed to minimize these effects.

Instead, the ARM10TDMI approach is to retain a very similar pipeline to the ARM9TDMI but to support a higher clock rate by optimizing each stage in a particular way (see Figure 9.9):

•The fetch and memory stages are effectively increased from one to one-and-a- half clock cycles by providing the address for the next cycle early. To achieve this in the memory stage, memory addresses are computed in a separate adder that can produce its result faster than the main ALU (because it implements only a subset of the ALU's functionality).

•The execute stage uses a combination of improved circuit techniques and restruc turing to reduce its critical path. For example, the multiplier does not feed into the main ALU to resolve its partial sum and product terms; instead it has its own adder in the memory stage (multiplications never access memory, so this stage is free.)

•The instruction decode stage is the only part of the processor logic that could not be streamlined sufficiently to support the higher clock rate, so here an additional 'Issue' pipeline stage was inserted.

The result is a 6-stage pipeline that can operate faster than the 5-stage ARM9TDMI pipeline, but requires its supporting memories to be little faster than the ARM9TDMI's memories. This is of significance since very fast memories tend to be power-hungry. The extra pipeline stage, inserted to allow more time for instruction decode, only incurs pipeline dependency costs when an unpredicted branch is executed. Since the extra stage comes before the register read takes place it introduces no new operand dependencies and requires no new forwarding paths. With the inclusion of a branch prediction mechanism this pipeline will give a very similar CPI to the ARM9TDMI pipeline while supporting the higher clock rate.

Figure 9.9 The ARM10TDMI pipeline.

ARM10TDMI

265

Reduced CPI	The pipeline enhancements described above support a 50% higher clock rate with-
	out compromising the CPI. This is a good start, but will not yield the required 100%
	performance improvement. For this an improved CPI is needed on top of the
	increased clock rate.
	Any plan to improve the CPI must start from a consideration of memory band-
	width. The ARM7TDMI uses its single 32-bit memory on (almost) every clock cycle,
	so the ARM9TDMI moved to a Harvard memory organization to release more band-
	width. The ARM9TDMI uses its instruction memory on (almost) every clock cycle.
	Although its data memory is only around 50% loaded, it is hard to exploit this to
	improve its CPI. The instruction bandwidth must be increased somehow.
	The approach adopted in the ARMIOTDMI is to employ 64-bit memories. This
	effectively removes the instruction bandwidth bottleneck and enables a number of
	CPI-improving features to be added to the processor organization:
	• Branch prediction: the ARMIOTDMI branch prediction logic goes beyond what
	is required simply to maintain the pipeline efficiency as discussed above.
	Because instructions are fetched at a rate of two per clock cycle, the branch pre
	diction unit (which is in the fetch pipeline stage) can often recognize branches
	before they are issued and effectively remove them from the instruction stream,
	reducing the cycle cost of the branch to zero.
	The ARMIOTDMI employs a static branch prediction mechanism: conditional
	branches that branch backwards are predicted to be taken; those that branch for-
	wards are predicted not to be taken.
	• Non-blocking load and store execution: a load or store instruction that cannot
	complete in a single memory cycle, either because it is referencing slow memory
	or because it is transferring multiple registers, does not stall the execution pipe
	line until an operand dependency arises.
	• The 64-bit data memory enables load and store multiple instructions to transfer
	two registers in each clock cycle.
	The non-blocking load and store logic requires independent register file read and
	write ports, and the 64-bit load and store multiple instructions require two of each.
	The ARMIOTDMI register bank therefore has four read ports and three write ports.
	Taken together these features enable the ARMIOTDMI to achieve a dhrystone 2.1
	MIPS per MHz figure of 1.25, which may be compared with 0.9 for the ARM7TDMI
	and 1.1 for the ARM9TDMI. These figures are a direct reflection of the respective CPI
	performances when running the dhrystone benchmark; other programs may give
	rather different CPI results, and the 64-bit data buses enable the ARMIOTDMI to
	deliver a significantly better effective CPI than the ARM9TDMI on complex tasks
	such as booting an operation system.

266 ARM Processor Cores

ARM10TDMI	It was noted in the discussion of 'ARM9TDMI applications' on page 262 that at
applications	least some local high-speed memory was required to release the performance poten-
	tial of the core. The same is true of the ARM10TDMI: without separate local 64-bit
	instruction and data memories the core will not be able to deliver its full perform-
	ance, and it will go no faster than a smaller and cheaper ARM core.
	Again, the usual (though not the only) way to resolve this problem is through the
	provision of local cache memories as exemplified by the ARM1020E described in
	Section 12.6 on page 341. Since the performance of the ARMIOTDMI core is criti-
	cally dependent on the availability of fast 64-bit local memory, discussion of its per-
	formance characteristics will be presented later in the context of the ARM1020E.

9.5Discussion

All early ARM processor cores, up to and including the ARM7TDMI, were based on a simple 3-stage fetch-decode-execute pipeline. From the first ARM1, developed at Acorn Computers in the early 1980s, through to the ARM7TDMI cores in most of today's mobile telephone handsets, the basic principles of operation have barely changed. The development work carried out in the ARM's first decade focused on the following aspects of the design:

•performance improvement through critical path optimization and process shrink age;

•low-power applications through static CMOS logic, power supply voltage reduc tion and code compression (the Thumb instruction set);

•support for system development through the addition of on-chip debug facilities, on-chip buses and software tools.

The ARM7TDMI represents the pinnacle of this development process, and its commercial success demonstrates the viability of the original very simple 3-stage pipeline in a world dominated by PCs with ever-increasingly complex superscalar, superpipelined, high-performance (and very power-hungry) microprocessors.

The second decade of ARM development has seen a careful diversification of the ARM organization in the quest for higher performance levels:

•The first step to a 5-stage pipeline yields a doubling of performance (all other factors being equal) at the cost of some forwarding logic in the core and either a double-bandwidth memory (as in the ARMS) or separate instruction and data memories (as in the ARM9TDMI and StrongARM).

•The next doubling of performance, achieved in the ARMIOTDMI, is rather harder-won. The 6-stage pipeline is quite similar to the 5-stage pipeline used

Example and exercises

267

before, but the time slots allocated to memory access have been extended to enable the memories to support higher clock rates without burning excessive power. The processor core also incorporates more decoupling: in the prefetch unit to allow branches to be predicted and removed from the instruction stream, and in the data memory interface to allow the processor to continue executing when a data access takes some time to resolve (for example, due to a cache miss).

Performance improvement is achieved through a combination of increased clock rate and reduced CPI - the average number of clocks per instruction. The increased clock rate will usually require a deeper pipeline that will tend to worsen the CPI, so remedial measures are required to recover the CPI loss and then improve it further.

To date, all ARM processors have been based on organizations that issue at most one instruction per clock cycle, and always in program order. The ARM10TDMI and the AMULETS processor (described in Section 14.5 on page 387) handle out-of-order completion in order to be able to keep instructions flowing during a slow data access, and both of these processors also include branch prediction logic to reduce the cost of refilling their pipelines on branch instructions. AMULET3 suppresses the fetching of predicted branch instructions but still executes them; ARM10TDMI fetches branch instructions but suppresses their execution. But by the standards of today's high-end PC and workstation processors, these are still very simple machines. This simplicity has direct benefits in system-on-chip applications in that simple processors require fewer transistors than complex processors and therefore occupy less die area and consume less power.

9.6Example and exercises

Example 9.1	How should the ARM7TDMI address bus be retimed to interface to
	static RAM or ROM devices?
	Normally the ARM7TDMI outputs new addresses as soon as they are available, which
	is towards the end of the preceding clock cycle. To interface to static memory devices
	the address must be held stable until after the end of the cycle, so this pipelining must
	be removed. This is most simply achieved by using ape, the address pipeline enable
	signal. In systems where some memory devices benefit from early addresses and some
	are static, either an external latch should be used to retime the addresses to the static
	devices or ape should be controlled to suit the currently addressed device.
Exercise 9.1.1	Review the processor cores described in this chapter and discuss the basic tech-
	niques used to increase the core performance by a factor of eight in going from the
	ARM7TDMI to the ARM10TDMI.

268	ARM Processor Cores

Exercise 9.1.2	In a system where the designer is free to vary the supply voltage to the processor it
	is possible to trade off performance (which scales in proportion to Vdd) against
	power-efficiency (which scales as 1/Vdd2). A measure of architectural
	power-efficiency that factors out the effect of power supply variations is therefore
	MIP3/W. Compare each of the processor cores presented here on the basis of this
	measure.
Exercise 9.1.3	Following on from the results of the previous exercise, why might the designer of a
	low-power system not simply select the most architecturally efficient processor core
	and then scale the supply voltage to give the required system performance?

<<< < Предыдущая 16 17 18 19 20 21 22 23 24 25 26 2728 / 4328 29 30 31 32 33 34 35 36 37 38 39 40 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013177.78 Кб17Fuller J.P.MSW Logo.A simplified reference.1998.pdf
#
23.08.2013575.81 Кб8funct_a_l.pdf
#
23.08.2013252.09 Кб5funct_m_q.pdf
#
23.08.2013467.58 Кб5funct_r_z.pdf
#
23.08.2013223.68 Кб9Fung R.Improving accuracy of ADCs.pdf
#
23.08.201318.35 Mб89Furber S.ARM system-on-chip architecture.2000.pdf
#
23.08.20131.01 Mб7Gale T.GTK tutorial.V1.2.2000.pdf
#
23.08.2013732.38 Кб39Gauld A.Learning to program (Python).pdf
#
23.08.20131.34 Mб22Gauld A.Learning to program (Python)_1.pdf
#
23.08.201350.44 Кб12Gay D.M.Correctly rounded binary-decimal and decimal-binary conversions.pdf
#
23.08.20131.18 Mб99General electric VAT200.pdf