Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

(ARM).ARM6 in DSP applications.Use of the MUL instruction

.pdf
Скачиваний:
30
Добавлен:
23.08.2013
Размер:
43.98 Кб
Скачать

Application Note 19

ARM6 in DSP Applications :

Use of the MUL Instruction

Document Number: ARM DAI 0019D

Issued: December 1994

Copyright Advanced RISC Machines Ltd (ARM) 1994

All rights reserved

ARM

Advanced RISC Machines

Proprietary Notice

ARM, the ARM Powered logo, BlackICE and ICEbreaker are trademarks of Advanced RISC Machines Ltd.

Neither the whole nor any part of the information contained in, or the product described in, this application note may be adapted or reproduced in any material form except with the prior written permission of the copyright holder.

The product described in this application note is subject to continuous developments and improvements. All particulars of the product and its use contained in this application note are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties or merchantability, or fitness for purpose, are excluded.

This application note is intended only to assist the reader in the use of the product. ARM Ltd shall not be liable for any loss or damage arising from the use of any information in this application note, or any error or omission in such information, or any incorrect use of the product.

Change Log

Issue

Date

By

Change

A

Jan. 93

IR

Doc creation in Frame

B

Jul. 93

IR

Formally signed off

C

Nov. 94

AW

Change of ARM Ltd address

D

Dec. 94

AW

Reformatted into new style

ii

Application Note 19

ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

1 Introduction

The performance of the MUL instruction is important for many DSP applications where multiply is used extensively (eg. digital filters, correlation, Fast Fourier Transforms etc.)

The ARM6 family of RISC processors is based on a core architecture which has features that can be used for DSP applications. One of these architectural features is a hardware Booth’s Multiplier which performs multiply operations on data. The design of the multiplier is based on an algorithm to reduce the number of stages needed to perform a multiplication.

This document explains how to predict the timing of a MUL instruction, and suggests some ideas to improve the performance of speed-critical programs, when using the ARM6 family of processors from Advanced RISC Machines (ARM).

2 Overview

The multiply instructions have the following syntax:

MUL

Rd,

Rm,

Rs

MLA

Rd,

Rm,

Rs, Ra

 

 

 

^^

The MUL calculates the least-significant 32 bits of the 64-bit product of Rm and Rs, and places the result in Rd (the destination register). MLA specifies an additional register which is added to the product (with zero overhead) before the result is stored in Rd.

The speed of the multiply depends on the value in Rs. It is important to place the smallest value in Rs so that the multiply takes the least number of cycles. See Appendix A for a detailed explanation of Booth’s multiplication.

The following table shows how the speed of the multiply depends on the value in Rs:

Min

Max

Speed

 

 

 

0

1

1S + 1I

 

 

 

2

7

1S + 2I

 

 

 

8

31

1S + 3I

 

 

 

32

127

1S + 4I

 

 

 

128

511

1S + 5I

 

 

 

512

2047

1S + 6I

 

 

 

2048

8191

1S + 7I

 

 

 

8192

32767

1S + 8I

 

 

 

Table 2-1: Multiply Speed Variations with Differing values of Rs

 

 

Application Note 19

1

 

 

ARM DAI 0019D

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

Min

Max

Speed

 

 

 

 

32768

131071

1S

+ 9I

 

 

 

 

131072

524287

1S

+ 10I

 

 

 

 

524288

2097151

1S

+ 11I

 

 

 

 

2097152

8388607

1S

+ 12I

 

 

 

 

8388608

33554431

1S

+ 13I

 

 

 

 

33554432

134217727

1S

+ 14I

 

 

 

 

134217728

536870911

1S

+ 15I

 

 

 

 

536870912

2147483648

1S

+ 16I

 

 

 

 

Table 2-1: Multiply Speed Variations with Differing values of Rs

The general formula is that multiplication by values between 2(2n-3) and 2(2n-1)-1 inclusive takes 1S + nI-cycles (n>1, multiplication by 0 or 1 takes 1S + 1I-cycle).

The cycle timings used throughout this document refer to the cycle types as seen by the ARM6 core. It is important to convert these numbers into actual times within your system in order to correctly estimate the performance. For an ARM6 or ARM60, this document specifies the precise cycle types visible in your system. The cached processors (ARM600 and ARM610) contain the ARM6 macrocell linked to a cache, so both S- and I-cycles may run at the fast core clock (FCLK) rate since they are internal to the processor. In this case, both S-cycles and I-cycles are F-cycles, so simply add together the number of S- and I-cycles to calculate the number of processor cycles.

This document explains several methods which may be used to improve the speed of multiplication:

Place smallest operand in Rs

Ensure that Rs is positive so that early termination occurs

If one operand is a constant, use the multiply-by-constant technique

3 Overflow issues

It is common to use the current multiply instruction in, for instance a 16-bit x 16-bit -> 32-bit mode to avoid overflow. For DSP work, a 32-bit result is all that is required, but it is important to ensure that overflow does not cause an incorrect answer.

This situation can be generalised, as the number of result bits is just the sum of the operand bits. Thus, the MUL can perform 16-bit x 16-bit -> 32-bit, 8-bit x 24-bit -> 32-bit etc. all without overflow. For MLA, the total is accumulated (overflow of the total must be avoided). Hence, MLA would be used as, for example 12x12->24, leaving 8 bits to accumulate up to 256 values without the possibility of overflow.

2

Application Note 19

 

 

 

ARM DAI 0019D

 

 

 

 

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

4 Use the smallest operand as Rs

In practice it is possible to arrange for a worst case which is at most 1S + 9I-cycles for unsigned operands (by putting the operand with the least number of bits into Rs, so that Rs <= 16bits). In most cases it is known that one of the operands is at most 16bits, as the sum of the number of operand bit must be no greater than 32.

5 Negative operands

The multiplier yields correct results for negative operand values, since it calculates the low 32-bits of the full 64-bit product. For positive values of Rs, a 16x16->32 MUL takes at most 1S + 9I-cycles (the average should be better than this). But, the MUL always takes 1S + 16I-cycles if Rs is negative. Early termination cannot take place because the top bit of Rs is a 1, so the Booth's multiplier register never contains all 0s (the maximum-of-16 limit is reached instead). A negative value for Rs is considered to be a very large positive number from the point of view of the Booth’s multiplier.

The example below shows how to guard the multiply instruction, but this does not really improve things unless Rs is very small so that the gain exceeds the (3- instruction) overhead.

;

 

Cycles

 

CMP

Rs, #0

; 1S

 

RSBMI

Rs, Rs, #0

; 1S

make Rs positive

MUL

Rd, Rm, Rs

; 1S+nI

 

RSBMI

Rd, Rd, #0

; 1S

make result negative if Rs

 

 

;

was negative

It is sometimes possible to do away with the CMP by incorporating this into another instruction.

In the special case when squaring, the result does not need to be negated after the multiplication as it will always be positive (thus the second RSB instruction can be eliminated).

For example, consider this critical bit of code which uses MUL to square the difference between pairs of signed 5-bit input values. This demonstrates the importance of ensuring that Rs is positive, as the worst-case performance is improved to just 1S + 3I-cycles for the MUL.

; on entry,

r8 & r9 contain packed binary data

;

 

 

Cycles

 

AND

r1, r8, #31 ;

1S

extract 5-bit field of interest

AND

r2, r9, #31 ;

1S

extract 5-bit field of interest

SUBS

r1, r1, r2

;

1S

 

 

RSBMI

r1, r1, #0

;

1S

 

 

MUL

r0, r1, r1

;

1S+nI

n<=3

; on exit, r0 contains result

As you can see, it has been arranged to cost only 1 S-cycle (for the RSBMI instruction) to ensure the multiply is fast.

 

 

Application Note 19

3

 

 

ARM DAI 0019D

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

The issue of negating the value in Rs is more complicated if MLA is used, as it is not possible to negate the product before the accumulate. There are two possible solutions to this:

1 Negating the total, then negating it again after the MLA

;

 

 

Cycles

CMP

Rs, #0

 

; 1S

RSBMI

Rs, Rs,

#0

; 1S

MLA

Ra, Rm,

Rs, Ra

; 1S+nI

RSBMI

Ra, Ra,

#0

; 1S

; side-effect: Rs is

now positive

 

2 Negating the other operand (Rm)

;

 

Cycles

 

CMP

Rs, #0

; 1S

 

RSBMI

Rs, Rs, #0

; 1S

 

RSBMI

Rm, Rm, #0

; 1S

negate Rm to

 

 

; make product correct

MLA

Ra, Rm, Rs, Ra

; 1S+nI

 

;side-effect: Rs is now positive

;side-effect: Rm may have been negated

Method 2 uses fewer instructions which is preferable (although it has more sideeffects). It is possible to incorporate the CMP #0 into a previous instruction as with the MUL example above.

It is questionable whether these techniques are useful in general for MUL or MLA, as there must be a performance increase on average to make it worth the effort. However, these methods can help for some specific applications, for example, squaring small signed values where the overhead can be reduced and the saving is greatest.

6 Multiplication by constant

This final section shows you:

how to construct a sequence of ARM instructions to multiply by a constant

how to create ARM assembly output from the C-compiler (using 'armcc -S')

In many cases where a multiply is used, one of the values is a constant (eg. weeks*7). In some applications (eg. 8-point DCT algorithm by Loeffler, Ligtenberg & Moschytz, ICASSP '89 p988 used for JPEG coding) multiplication-by-constant is used extensively.

A programmer might assume that the only way to calculate weeks*7 would be to use the MUL instruction. However, it is possible to express the multiplication as a sequence of arithmetic instructions, which in many circumstances can be faster.

When multiplying by a constant value, it is possible to replace the general multiply with a fixed sequence of adds and subtracts which have the same effect. For instance, multiply by 5 could be achieved using a single instruction:

; this instruction takes 1S-cycle

4

Application Note 19

 

 

 

ARM DAI 0019D

 

 

 

 

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

ADD

Rd, Rm, Rm, LSL #2

; Rd = Rm + (Rm * 4) = Rm * 5

Compare this with the MUL version below:

 

; this sequence takes a total of

1S+2I-cycles

MOV

Rs,

#5

 

MUL

Rd,

Rm, Rs

 

The cost of the general multiply includes the instructions needed to load the constant into a register (up to 4 may be needed, or an LDR from a literal pool) as well as the multiply itself. In some cases, such as multiply-by-5 above, the multiply-by-constant uses fewer instructions as well!

7 The problem of finding the optimum sequence

The difficulty in using a sequence of arithmetic instructions is that the constant must be decomposed into a set of operations which can be done by one instruction each.

Consider multiply by 105. This could be achieved by decomposing as shown below:

105 == 128 - 13

==128 - 16 + 3

==128 - 16 + (2 + 1)

;this sequence takes 3S-cycles

ADD

Rd, Rm,

Rm, LSL

#1 ;

Rd =

Rm +

Rm *2

 

SUB

Rd,

Rd,

Rm,

LSL

#4

;

Rd

=

Rm*3

- Rm*16

ADD

Rd,

Rd,

Rm,

LSL

#7

;

Rd

=

Rm*(-13) +

Rm*128

Or, decomposing differently:

105 == 15 * 7

==(16 - 1) * (8 - 1)

;this sequence takes 2S-cycles

;uses temporary register Rt to store intermediate value

RSB

Rt,

Rm,

Rm,

LSL

#4

;

Rt

=

-Rm + Rm*16 = Rm*15

RSB

Rd,

Rt,

Rt,

LSL

#3

;

Rd

=

Rt*7 = Rm*105

The second method is the optimal solution which is not difficult to find by hand for small values such as 105. However, the problem of finding the optimum becomes much more difficult for larger constant values. There are no known algorithms which solve this problem quickly.

Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a very large constant, more than one temporary register may be needed to find the shortest sequence.

Looking carefully at the 2-instruction optimal code sequence, it is can be seen that Rt could be the same register as Rd. Rd is usually the first-choice for a temporary register since it will be used to store the final result in the last instruction of the sequence.

 

 

Application Note 19

5

 

 

ARM DAI 0019D

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

8 Experimenting with ARMCC assembly output

When writing a speed-critical ARM assembler program, it is a good idea to code it in C first (to check the algorithm) before converting it to handtuned assembler. It is helpful to see the ARM code which the compiler generates as a starting point for your work. armcc -S will generate an assembly file instead of an object file. For example, consider the following simple C code:

int mulby105( int num )

{

return num * 105;

}

Compile this using:

armcc -S mulby105.c

Now, examine the file "mulby105.s" which has been created:

; generated by Norcroft ARM C vsn 4.41 [Aug 26 1992] AREA |C$$code|, CODE, READONLY

|x$codeseg|

EXPORT

mulby105

mulby105

 

RSB

a1,a1,a1,LSL #4

RSB

a1,a1,a1,LSL #3

MOV

pc,lr

AREA |C$$data|,DATA |x$dataseg|

END

Notice that the compiler has found the short multiply-by-constant sequence.

The C-compiler restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long.

9 Discussion of speed improvement

To evaluate the speed gains from using multiply-by-constant, consider multiplying by 11585 as an example.

A normal multiply consists of:

;first load the constant 11585 (0x2D41 in hex) into Rs

;these 2 instructions take 1S-cycle each

MOV

Rs, #0x2D

<< 8

ORR

Rs,

Rs,

#0x41

; now do the multiply,

which takes 1S+8I-cycles for Rs=11585

MUL

Rd,

Rm,

Rs

; do the multiply

6

Application Note 19

 

 

 

ARM DAI 0019D

 

 

 

 

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

The load-a-constant stage may take up to four 4 instructions (in this case 2) or an LDR, and the multiply takes 1 instruction fetch plus a number of internal cycles to calculate the multiply.

The optimal multiply-by-constant sequence consists of:

11585 == ((Rm*3)*15)*256 + Rm + Rm*64

; this sequence takes 4S-cycles

ADD Rd, Rm, Rm, LSL #1 ; Rd = Rm*3

RSB Rd, Rd, Rd, LSL #4 ; Rd = Rd*15 = Rm*45

ADD Rd, Rm, Rd, LSL #8 ; Rd = Rm + Rd*256 = Rm*11521 ADD Rd, Rd, Rm, LSL #6 ; Rd = Rd + Rm*64 = Rm*11585

This is just 4 data processing instructions.

Method

Cycles

 

 

Normal multiply:

3 instructions + MUL internal cycles

 

 

Multiply-by-constant:

4 instructions

 

 

Table 9-2: Comparison between MUL and multiply-by-constan

With slow memory systems and non-cached processors, I-cycles can be much faster than other cycles because they are internal to the ARM core. This means that the general multiply can sometimes be the fastest option (for large constants where an efficient solution cannot be found). It should also use less memory. If the load-a- constant stage could be moved outside a loop, the balance would tip further in favour of the general multiply as there is only the MUL to execute.

Method

Cycles on ARM60

Cycles on ARM610

 

 

 

Normal multiply:

3S + 8I

11F

 

 

 

Multiply-by-constant:

4S

4F

 

 

 

Table 9-3: MUL Cycle Timing for ARM6 Processors

10 Conclusions

This application note explains how the speed of the multiply depends on the value in Rs. It is important to ensure that the smallest positive number is used for Rs so that MUL is fastest.

If MUL is used to generate a single result of at most 32-bits, it is usually possible to ensure that Rs contains at most 16 bits so that the MUL takes at most 1S + 9I-cycles, providing that the operands are unsigned.

 

 

Application Note 19

7

 

 

ARM DAI 0019D

 

 

 

 

 

ARM6 in DSP Applications : Use of the MUL Instruction

MUL can be used with negative operands; the result is the bottom 32-bits of the 64-bit product as expected. If Rs is negative, the MUL will take the full 1S + 16I-cycles as early termination does not occur. It is possible to ensure that Rs is always positive by putting the negative operand in Rm, or by negating Rs to make it positive.

When multiplying by a constant, a technique can be used to decompose the multiply into a sequence of arithmetic instruction which achieves the same as a MUL. The performance of this method depends on the precise value of the constant (smaller values are faster). The sequence of arithmetic operations is likely to be faster than the equivalent MUL, but it depends on the memory system and variant of ARM processor used.

This document can offer no universal solution to speed up multiplication. It has been shown that there are several techniques which can be applied, but the speed gains are always depend on the precise values involved (which are application-specific). Even then, it is not always clear which method will be faster, as the actual timings depend on whether the processor is cached or uncached, and on the performance of the memory system.

8

Application Note 19

 

 

 

ARM DAI 0019D