Furber S.ARM system-on-chip architecture.2000
.pdfSynchronization |
309 |
11.7Synchronization
A standard problem in a system which runs multiple processes that share data structures is to control accesses to the shared data to ensure correct behaviour.
|
For example, consider a system where a set of sensor values is sampled and |
|
stored in memory by one process and used at arbitrary times by another. If it is |
|
important that the second process always sees a single snapshot of the values, care |
|
must be taken to ensure that the first process does not get swapped out and the |
|
second swapped in when the values are only partially updated. The mechanisms |
|
used to achieve this are called process synchronization. What is required is mutu- |
|
ally exclusive access to the data structure. |
Mutual exclusion |
A process which is about to perform an operation on a shared data structure, where |
|
the operation requires that no other process is accessing the structure, must wait |
|
until no other process is accessing the data and then set some sort of lock to prevent |
|
another process from accessing it until it has finished the operation. |
|
One way to achieve mutual exclusion is to use a particular memory location to con- |
|
trol access to a data structure. For example, the location could contain a Boolean value |
|
indicating whether or not the data structure is currently in use. A process which |
|
wishes to use the data structure must wait until it is free, then mark it as busy while it |
|
uses the data, then mark it as free again when it has finished using it. The problem is |
|
that an interrupt can arise between the structure becoming free and it being marked as |
|
busy. The interrupt causes a process switch, the new process sees the structure is free, |
|
marks it as busy, changes it a bit and then another interrupt returns control to the first |
|
process which is in a state where it believes, now incorrectly, that the structure is free. |
|
A standard solution to this problem is to disable interrupts while the Boolean is |
|
tested and set. This works, but on a processor with a protected supervisor mode (such |
|
as the ARM) user-level code cannot disable interrupts, so a system call is required, |
|
which takes several clock cycles to complete and return control to the user process. |
SWAP |
A more efficient solution is to use an atomic (that is, uninterruptable) 'test and set' |
|
instruction. The ARM 'SWAP' instruction (see Section 5.13 on page 132) is just |
|
such an instruction which is included in the instruction set for exactly this purpose. |
|
A register is set to the 'busy' value, then this register is swapped with the memory |
|
location containing the Boolean. If the loaded value is 'free' the process can con- |
|
tinue; if it is 'busy' the process must wait, often by spinning (repeating the test until |
|
it gets the 'free' result) on the lock. |
|
Note that this is the only reason for including SWAP in the ARM instruction set. It |
|
does not contribute to the processor's performance and its dynamic frequency of use is |
|
negligible. It is there just to provide this functionality. |
Context switching |
311 |
Without these instructions, an operating system would have to switch into user mode to save or restore the banked user registers and then get back through the protection barrier into supervisor mode. Though possible, this solution is inefficient.
Floating-point The floating-point registers, whether held in a hardware coprocessor or maintained in state memory by a software emulator, represent part of the state of any process that uses
them. Rather than add to the context switching overhead by saving and restoring them on every process swap, the operating system simply disables user-level use of the floating-point system when a process that uses floating-point is swapped out. If the new process attempts to use the floating-point system, the first use will trap. At that point the operating system will save the old process state and restore the new, then it will re-enable the floating-point system and the new process can use it freely.
Thus the floating-point context switch overhead is incurred only when strictly necessary.
Translation state Where the old and new processes have independent translation tables a heavy-weight process switch is required. The complete translation table structure can be switched simply by changing the base address of the first-level page table in CP15 register 2, but since this will invalidate existing TLB and (virtually addressed) cache entries, these must be flushed. The TLB and an instruction or write-through data cache can be flushed simply by marking all entries as invalid, which on an ARM processor chip requires a single CP15 instruction for each TLB or cache, but a copy-back cache must be purged of all dirty lines which may take many instructions.
(Note that a physically addressed cache avoids this problem, but to date all ARM CPUs have used virtually addressed caches.)
Where the old and new processes share the same translation tables a light-weight process switch is required. The 'domain' mechanism in the ARM MMU architecture allows the protection state of 16 different subsets of the virtual address space to be reconfigured with a single update of CP15 register 3.
In order to ensure that the cache does not represent a leak in the protection system, a cache access must be accompanied by a permission check. This could be achieved by storing the domain and access permission information along with the data in each cache line, but current ARM processors check permissions using information in the MMU concurrently with the cache access.
Input/Output |
315 |
•A cache fetches a block (which is typically around four words) of data at a time from sequential addresses. I/O devices often have different register functions at consecutive addresses; reading them all can give unpredictable results.
Therefore the I/O area of memory is normally marked as uncacheable, and accesses bypass the cache. In general, caches interact badly with any read-sensitive devices. Display frame buffers also need careful consideration and are often made uncacheable.
Operating |
Normally, all the low-level detail of the I/O device registers and the handling of |
system issues |
interrupts is the responsibility of the operating system. A typical process will send |
|
data to the serial port by loading the next byte into r0 and then making the appropriate |
|
supervisor call; the operating system will call a subroutine called a device driver |
|
to check for the transmit buffer being empty, that the line is active, that no |
|
transmission errors occur, and so on. There may even be a call which allows the |
|
process to pass a pointer to the operating system which will then output a complete |
|
buffer of values. |
|
Since it takes some time to send a buffer full of data down a serial line, the operat- |
|
ing system may return control to the process until the transmit buffer has space for |
|
more data. An interrupt from the serial line hardware device returns control to the |
|
operating system, which refills the transmit buffer before returning control to the |
|
interrupted process. Further interrupts result in further transfers until the whole buffer |
|
has been sent. |
|
It may be the case that the process which requested the serial line activity runs out |
|
of useful work, or an interrupt from a timer or another source causes a different pro- |
|
cess to become active. The operating system must be careful, when modifying the |
|
translation tables, to ensure that it does not make the data buffer inaccessible to itself. |
|
It must also treat any requests from the second process to output data down the serial |
|
line with caution; they must not interfere with the ongoing transfer from the first pro- |
|
cess. Resource allocation is used to ensure that there are no conflicts in the use of |
|
shared resources. |
|
A process may request an output function and then go inactive until the output has |
|
completed, or it may go inactive until a particular input arrives. It can lodge a request |
|
with the operating system to be reactivated when the input/output event occurs. |
316 |
Architectural Support for Operating Systems |
11.10Example and exercises
Example 11.1 |
Why, on the ARM, can user-level code not disable interrupts? |
|||
|
To allow a user to disable interrupts would make building a protected operating |
|||
|
system impossible. The following code illustrates how a malicious user could |
|||
|
destroy all the currently active programs: |
|
||
|
|
MSR |
CPSR_f, #&cO |
; disable IRQ and FIQ ; |
|
HERE |
B |
HERE |
loop forever |
Once interrupts are disabled there is no way for the operating system to regain control, so the program will loop forever. The only way out is a hard reset, which will destroy all currently active programs.
If the user cannot disable interrupts the operating system can establish a regular periodic interrupt from a timer, so the infinite loop will be interrupted and the operating system can schedule other programs. This program will either time-out, if the operating system has an upper limit on the amount of CPU time it is allowed to consume, or it will continue to loop whenever it gets switched in, running up a large bill on a system with accounting.
Exercise 11.1.1 |
What minimum level of protection must be applied to the bottom of memory (where |
|
the exceptions vectors are located) in a secure operating system? |
Exercise 11.1.2 |
If the ARM had no SWAP instruction, devise a hardware peripheral that could be |
|
used to support synchronization. (Hint: standard memory will not work; the location |
|
must be read-sensitive.) |
ARM CPU Cores
Summary of chapter contents
Although some ARM applications use a simple integer processor core as the basic processing component, others require tightly coupled functions such as cache memory and memory management hardware. ARM Limited offers a range of such 'CPU' configurations based around its integer cores.
The ARM CPU cores described here include the ARM710T, 720T and 740T, the ARM810 (now superseded by the ARM9 series), the StrongARM, the ARM920T and 940T, and the ARM1020E. These CPUs encompass a range of pipeline and cache organizations and form a useful illustration of the issues which arise when designing high-performance processors for low-power applications.
The primary role of a cache memory is to satisfy the processor core's instruction and data bandwidth requirements, so the cache organization is tightly coupled to the particular processor core that it is to serve. In the context of system-on-chip designs the goal is for the cache to reduce the external memory bandwidth requirements of the CPU core to a level that can be handled by an on-chip bus. The higher-performance ARM processor cores would run little faster than the ARM7TDMI if they were connected directly to an AMBA bus, so they will always be used with fast local memory or cache.
Memory management is another complex system function that must be tightly coupled to the processor core, whether it is a full translation-based system or a simpler protection unit. The ARM CPU cores integrate the processor core, cache(s),
MMU(s) and (usually) an AMBA interface in a single macrocell.
317
318 |
ARM CPU Cores |
12.1The ARM710T, ARM720T and ARM740T
The ARM710T, ARM720T and ARM740T are based upon the ARM7TDMI processor core (see Section 9.1 on page 248), to which an 8 Kbyte mixed instruction cache and data cache has been added. External memory and peripherals are accessed via an AMBA bus master unit, and a write buffer and memory management (ARM71OT and 720T) unit or memory protection (ARM740T) unit are also incorporated.
The organization of the ARM710T and ARM720T CPUs is similar and is illustrated in Figure 12.1.
ARM710T cache Since the ARM7TDMI processor core has a single memory port it is logical for it to be paired with a unified instruction and data cache. The ARM710T incorporates such a cache, with a capacity of 8 Kbytes. The cache is organized with 16-byte lines and is 4-way set associative. A random replacement algorithm selects which of the
Figure 12.1 ARM710T and ARM720T organization.