Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

Synchronization

309

11.7Synchronization

A standard problem in a system which runs multiple processes that share data structures is to control accesses to the shared data to ensure correct behaviour.

 

For example, consider a system where a set of sensor values is sampled and

 

stored in memory by one process and used at arbitrary times by another. If it is

 

important that the second process always sees a single snapshot of the values, care

 

must be taken to ensure that the first process does not get swapped out and the

 

second swapped in when the values are only partially updated. The mechanisms

 

used to achieve this are called process synchronization. What is required is mutu-

 

ally exclusive access to the data structure.

Mutual exclusion

A process which is about to perform an operation on a shared data structure, where

 

the operation requires that no other process is accessing the structure, must wait

 

until no other process is accessing the data and then set some sort of lock to prevent

 

another process from accessing it until it has finished the operation.

 

One way to achieve mutual exclusion is to use a particular memory location to con-

 

trol access to a data structure. For example, the location could contain a Boolean value

 

indicating whether or not the data structure is currently in use. A process which

 

wishes to use the data structure must wait until it is free, then mark it as busy while it

 

uses the data, then mark it as free again when it has finished using it. The problem is

 

that an interrupt can arise between the structure becoming free and it being marked as

 

busy. The interrupt causes a process switch, the new process sees the structure is free,

 

marks it as busy, changes it a bit and then another interrupt returns control to the first

 

process which is in a state where it believes, now incorrectly, that the structure is free.

 

A standard solution to this problem is to disable interrupts while the Boolean is

 

tested and set. This works, but on a processor with a protected supervisor mode (such

 

as the ARM) user-level code cannot disable interrupts, so a system call is required,

 

which takes several clock cycles to complete and return control to the user process.

SWAP

A more efficient solution is to use an atomic (that is, uninterruptable) 'test and set'

 

instruction. The ARM 'SWAP' instruction (see Section 5.13 on page 132) is just

 

such an instruction which is included in the instruction set for exactly this purpose.

 

A register is set to the 'busy' value, then this register is swapped with the memory

 

location containing the Boolean. If the loaded value is 'free' the process can con-

 

tinue; if it is 'busy' the process must wait, often by spinning (repeating the test until

 

it gets the 'free' result) on the lock.

 

Note that this is the only reason for including SWAP in the ARM instruction set. It

 

does not contribute to the processor's performance and its dynamic frequency of use is

 

negligible. It is there just to provide this functionality.

310

Architectural Support for Operating Systems

11.8Context switching

When to switch

Register state

A process runs in a context, which is all the system state that must be established for the process to run correctly. This state includes:

the values of all of the processor's registers, including the program counter, stack pointer, and so on;

the values in the floating-point registers, if the process uses them;

the translation tables in memory (but not the contents of the TLB since it is just a cache of the values in memory and will automatically reload active values as they are used);

data values used by the process in memory (but not the values in the cache since they will automatically be reloaded when required).

When a process switch takes place, the context of the old process must be saved and that of the new process restored (if it is resuming rather than starting for the first time).

Context switching may occur as a result of an external interrupt, for example:

A timer interrupt causes the operating system to make a new process active according to a time-slicing algorithm.

A high-priority process which is waiting for a particular event is reactivated in response to that event.

Alternatively, a process may run out of useful work and call the operating system to be made dormant until an external event occurs.

In all cases, the operating system is given or takes control and is responsible for saving the old and restoring the new context. In an ARM-based system, this will normally take place while the processor is in supervisor mode.

If all context switches take place in response to IRQs or internal faults or supervisor calls, and the supervisor code does not re-enable interrupts, the process register state may be restricted to the user-mode registers. If context switches may take place in response to FIQs or supervisor code does re-enable interrupts, it may be necessary to save and restore some of the privileged mode registers as well.

The 'architectural support' for register saving and restoring offered on the ARM recognizes the difficulty of saving and restoring user registers from a privileged mode and provides special instructions to assist in this task. These instructions are the special forms of the load and store multiple instructions (see Section 5.12 on page 130) which allow code running in a non-user mode to save and restore the user registers from an area of memory addressed by a non-user mode register.

Context switching

311

Without these instructions, an operating system would have to switch into user mode to save or restore the banked user registers and then get back through the protection barrier into supervisor mode. Though possible, this solution is inefficient.

Floating-point The floating-point registers, whether held in a hardware coprocessor or maintained in state memory by a software emulator, represent part of the state of any process that uses

them. Rather than add to the context switching overhead by saving and restoring them on every process swap, the operating system simply disables user-level use of the floating-point system when a process that uses floating-point is swapped out. If the new process attempts to use the floating-point system, the first use will trap. At that point the operating system will save the old process state and restore the new, then it will re-enable the floating-point system and the new process can use it freely.

Thus the floating-point context switch overhead is incurred only when strictly necessary.

Translation state Where the old and new processes have independent translation tables a heavy-weight process switch is required. The complete translation table structure can be switched simply by changing the base address of the first-level page table in CP15 register 2, but since this will invalidate existing TLB and (virtually addressed) cache entries, these must be flushed. The TLB and an instruction or write-through data cache can be flushed simply by marking all entries as invalid, which on an ARM processor chip requires a single CP15 instruction for each TLB or cache, but a copy-back cache must be purged of all dirty lines which may take many instructions.

(Note that a physically addressed cache avoids this problem, but to date all ARM CPUs have used virtually addressed caches.)

Where the old and new processes share the same translation tables a light-weight process switch is required. The 'domain' mechanism in the ARM MMU architecture allows the protection state of 16 different subsets of the virtual address space to be reconfigured with a single update of CP15 register 3.

In order to ensure that the cache does not represent a leak in the protection system, a cache access must be accompanied by a permission check. This could be achieved by storing the domain and access permission information along with the data in each cache line, but current ARM processors check permissions using information in the MMU concurrently with the cache access.

312

Architectural Support for Operating Systems

11.9Input/Output

The input/output (I/O) functions are implemented in an ARM system using a combination of memory-mapped addressable peripheral registers and the interrupt inputs. Some ARM systems may also include direct memory access (DMA) hardware.

Memory-mapp A peripheral device, such as a serial line controller, contains a number of registers. ed peripherals In a memory-mapped system, each of these registers appears like a memory location at a particular address. (An alternative system organization might have I/O functions in a separate address space from memory devices.) A serial line controller

may have a set of registers as follows:

A transmit data register (write only); data written to this location gets sent down the serial line.

A receive data register (read only); data arriving along the serial line is presented here.

A control register (read/write); this register sets the data rate and manages the RTS (request to send) and similar signals.

An interrupt enable register (read/write); this register controls which hardware events will generate an interrupt.

A status register (read only); this register indicates whether read data is available, whether the write buffer is full, and so on.

To receive data, the software must set up the device appropriately, usually to generate an interrupt when data is available or an error condition is detected. The interrupt routine must then copy the data into a buffer and check for error conditions.

Memory-mapp ed issues

Note that a memory-mapped peripheral register behaves differently from memory. Two consecutive reads to the read data register will probably deliver different results even though no write to that location has taken place. Whereas reads to true memory are idempotent (the read can be repeated many times, with identical results) a read to a peripheral may clear the current value and the next value may be different. Such locations are termed read-sensitive.

Programs must be written very carefully where read-sensitive locations are involved, and, in particular, such locations must not be copied into a cache memory.

In many ARM systems I/O locations are made inaccessible to user code, so the only way the devices can be accessed is through supervisor calls (SWIs) or through C library functions written to use those calls.

Direct Memory

Access Where I/O functions have a high data bandwidth, a considerable share of the processor's performance may be consumed handling interrupts from the I/O system. Many

systems employ DMA hardware to handle the lowest level I/O data transfers without

Input/Output

313

processor assistance. Typically, the DMA hardware will handle the transfer of blocks of data from the peripheral into a buffer area in memory, interrupting the processor only if an error occurs or when the buffer becomes full. Thus the processor sees an interrupt once per buffer rather than once per byte.

Note, however, that the DMA data traffic will occupy some of the memory bus bandwidth, so the processor performance will still be reduced by the I/O activity (though far less than it would be if it were handling the data traffic on interrupts).

Fast Interrupt

Request

Interrupt latency

Cache-l/O interactions

The ARM fast interrupt (FIQ) architecture includes more banked registers than the other exception modes (see Figure 2.1 on page 39) in order to minimize the register save and restore overhead associated with handling one of these interrupts. The number of registers was chosen to be the number required to implement a software emulation of a DMA channel.

If an ARM system with no DMA support has one source of I/O data traffic that has a significantly higher bandwidth requirement than the others, it is worth considering allocating the FIQ interrupt to this source and using IRQ to support all the other sources. It is far less effective to use FIQ for several different data sources at the same time, though switching it on a coarse granularity between sources may be appropriate.

An important parameter of a processor is its interrupt latency. This is a measure of how long it takes to respond to an interrupt in the worst case. For the ARM6 the worst-case FIQ latency is determined by the following components:

1.The time for the request signal to pass through the FIQ synchronizing latches; this is three clock cycles (worst case).

2.The time for the longest instruction (which is a load multiple of 16 registers) to complete; this is 20 clock cycles.

3.The time for the data abort entry sequence; this is three clock cycles. (Remember that data abort has a higher priority than FIQ but does not mask FIQs out; see 'Exception priorities' on page 111.)

4.The time for the FIQ entry sequence; this is two clock cycles.

The total worst-case latency is therefore 28 clock cycles. After this time the ARM6 is executing the instruction at Ox 1C, the FIQ entry point. These cycles may be sequential or non-sequential, and memory accesses may be further delayed if they address slow memory devices. The best-case latency is four clock cycles.

The IRQ latency calculation is similar but must include an arbitrary delay for the longest FIQ routine to complete (since FIQ is higher priority than IRQ).

The usual assumption is that a cache makes a processor go faster. Normally this is true, if the performance is averaged over a reasonable period. But in many cases interrupts are used where worst-case real-time response is critical; in these cases a

314

Reducing latency

Other cache issues

Architectural Support for Operating Systems

cache can make the performance significantly worse. An MMU can make things worse still!

Here is the worst-case interrupt latency sum for the ARM? 10 which we will meet in the next chapter. The latency is the sum of:

1.The time for the request signal to pass through the FIQ synchronizing latches; this is three clock cycles (worst case) as before.

2.The time for the longest instruction (which is a load multiple of 16 registers) to complete; this is 20 clock cycles as before, but...

...this could cause the write buffer to flush, which can take up to 12 cycles, then incur three MMU TLB misses, adding 18 clock cycles, and six cache misses, adding a further 24 cycles. The original 20 cycles overlap the line fetches, but the total cost for this stage can still reach 66 cycles.

3.The time for data abort entry sequence; this is three clock cycles as before, but...

...the fetches from the vector space could add an MMU miss and a cache miss, increasing the cost to 12 cycles.

4.The time for the FIQ entry sequence; this is two clock cycles as before, but...

...it could incur another cache miss, costing six cycles.

The total is now 87 clock cycles, many of which are non-sequential memory accesses. So note that automatic mechanisms which support a memory hierarchy to speed up general-purpose programs on average often have the opposite effect on worst-case calculations for critical code segments.

How can the latency be reduced when real-time constraints simply must be met?

A fixed area of fast on-chip RAM (for example, containing the vector space at the bottom of memory) will speed up exception entry.

Sections of the TLB and cache can be locked down to ensure that critical code segments never incur the miss penalty.

Note that even in general-purpose systems where the cache and MMU are generally beneficial there are often critical real-time constraints, for example for servicing disk data traffic or for managing the local area network. This is especially true in low-cost systems with little DMA hardware support.

There are other things to watch out for when a cache is present, for example:

Caching assumes that an address will return the same data value each time it is read until a new value is written. I/O devices do not behave like this; each time you read them they give the next piece of data.

Input/Output

315

A cache fetches a block (which is typically around four words) of data at a time from sequential addresses. I/O devices often have different register functions at consecutive addresses; reading them all can give unpredictable results.

Therefore the I/O area of memory is normally marked as uncacheable, and accesses bypass the cache. In general, caches interact badly with any read-sensitive devices. Display frame buffers also need careful consideration and are often made uncacheable.

Operating

Normally, all the low-level detail of the I/O device registers and the handling of

system issues

interrupts is the responsibility of the operating system. A typical process will send

 

data to the serial port by loading the next byte into r0 and then making the appropriate

 

supervisor call; the operating system will call a subroutine called a device driver

 

to check for the transmit buffer being empty, that the line is active, that no

 

transmission errors occur, and so on. There may even be a call which allows the

 

process to pass a pointer to the operating system which will then output a complete

 

buffer of values.

 

Since it takes some time to send a buffer full of data down a serial line, the operat-

 

ing system may return control to the process until the transmit buffer has space for

 

more data. An interrupt from the serial line hardware device returns control to the

 

operating system, which refills the transmit buffer before returning control to the

 

interrupted process. Further interrupts result in further transfers until the whole buffer

 

has been sent.

 

It may be the case that the process which requested the serial line activity runs out

 

of useful work, or an interrupt from a timer or another source causes a different pro-

 

cess to become active. The operating system must be careful, when modifying the

 

translation tables, to ensure that it does not make the data buffer inaccessible to itself.

 

It must also treat any requests from the second process to output data down the serial

 

line with caution; they must not interfere with the ongoing transfer from the first pro-

 

cess. Resource allocation is used to ensure that there are no conflicts in the use of

 

shared resources.

 

A process may request an output function and then go inactive until the output has

 

completed, or it may go inactive until a particular input arrives. It can lodge a request

 

with the operating system to be reactivated when the input/output event occurs.

316

Architectural Support for Operating Systems

11.10Example and exercises

Example 11.1

Why, on the ARM, can user-level code not disable interrupts?

 

To allow a user to disable interrupts would make building a protected operating

 

system impossible. The following code illustrates how a malicious user could

 

destroy all the currently active programs:

 

 

 

MSR

CPSR_f, #&cO

; disable IRQ and FIQ ;

 

HERE

B

HERE

loop forever

Once interrupts are disabled there is no way for the operating system to regain control, so the program will loop forever. The only way out is a hard reset, which will destroy all currently active programs.

If the user cannot disable interrupts the operating system can establish a regular periodic interrupt from a timer, so the infinite loop will be interrupted and the operating system can schedule other programs. This program will either time-out, if the operating system has an upper limit on the amount of CPU time it is allowed to consume, or it will continue to loop whenever it gets switched in, running up a large bill on a system with accounting.

Exercise 11.1.1

What minimum level of protection must be applied to the bottom of memory (where

 

the exceptions vectors are located) in a secure operating system?

Exercise 11.1.2

If the ARM had no SWAP instruction, devise a hardware peripheral that could be

 

used to support synchronization. (Hint: standard memory will not work; the location

 

must be read-sensitive.)

ARM CPU Cores

Summary of chapter contents

Although some ARM applications use a simple integer processor core as the basic processing component, others require tightly coupled functions such as cache memory and memory management hardware. ARM Limited offers a range of such 'CPU' configurations based around its integer cores.

The ARM CPU cores described here include the ARM710T, 720T and 740T, the ARM810 (now superseded by the ARM9 series), the StrongARM, the ARM920T and 940T, and the ARM1020E. These CPUs encompass a range of pipeline and cache organizations and form a useful illustration of the issues which arise when designing high-performance processors for low-power applications.

The primary role of a cache memory is to satisfy the processor core's instruction and data bandwidth requirements, so the cache organization is tightly coupled to the particular processor core that it is to serve. In the context of system-on-chip designs the goal is for the cache to reduce the external memory bandwidth requirements of the CPU core to a level that can be handled by an on-chip bus. The higher-performance ARM processor cores would run little faster than the ARM7TDMI if they were connected directly to an AMBA bus, so they will always be used with fast local memory or cache.

Memory management is another complex system function that must be tightly coupled to the processor core, whether it is a full translation-based system or a simpler protection unit. The ARM CPU cores integrate the processor core, cache(s),

MMU(s) and (usually) an AMBA interface in a single macrocell.

317

318

ARM CPU Cores

12.1The ARM710T, ARM720T and ARM740T

The ARM710T, ARM720T and ARM740T are based upon the ARM7TDMI processor core (see Section 9.1 on page 248), to which an 8 Kbyte mixed instruction cache and data cache has been added. External memory and peripherals are accessed via an AMBA bus master unit, and a write buffer and memory management (ARM71OT and 720T) unit or memory protection (ARM740T) unit are also incorporated.

The organization of the ARM710T and ARM720T CPUs is similar and is illustrated in Figure 12.1.

ARM710T cache Since the ARM7TDMI processor core has a single memory port it is logical for it to be paired with a unified instruction and data cache. The ARM710T incorporates such a cache, with a capacity of 8 Kbytes. The cache is organized with 16-byte lines and is 4-way set associative. A random replacement algorithm selects which of the

Figure 12.1 ARM710T and ARM720T organization.