Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

(ARM).ARM6 in DSP applications.Use of the MUL instruction

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

43.98 Кб

Скачать

☆

1 / 21 2 > Следующая >>>

Application Note 19

ARM6 in DSP Applications :

Use of the MUL Instruction

Document Number: ARM DAI 0019D

Issued: December 1994

ARM

Advanced RISC Machines

Proprietary Notice

ARM, the ARM Powered logo, BlackICE and ICEbreaker are trademarks of Advanced RISC Machines Ltd.

Neither the whole nor any part of the information contained in, or the product described in, this application note may be adapted or reproduced in any material form except with the prior written permission of the copyright holder.

The product described in this application note is subject to continuous developments and improvements. All particulars of the product and its use contained in this application note are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties or merchantability, or fitness for purpose, are excluded.

This application note is intended only to assist the reader in the use of the product. ARM Ltd shall not be liable for any loss or damage arising from the use of any information in this application note, or any error or omission in such information, or any incorrect use of the product.

Change Log

Issue	Date	By	Change
A	Jan. 93	IR	Doc creation in Frame
B	Jul. 93	IR	Formally signed off
C	Nov. 94	AW	Change of ARM Ltd address
D	Dec. 94	AW	Reformatted into new style

Application Note 19

ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

1 Introduction

The performance of the MUL instruction is important for many DSP applications where multiply is used extensively (eg. digital ﬁlters, correlation, Fast Fourier Transforms etc.)

The ARM6 family of RISC processors is based on a core architecture which has features that can be used for DSP applications. One of these architectural features is a hardware Booth’s Multiplier which performs multiply operations on data. The design of the multiplier is based on an algorithm to reduce the number of stages needed to perform a multiplication.

This document explains how to predict the timing of a MUL instruction, and suggests some ideas to improve the performance of speed-critical programs, when using the ARM6 family of processors from Advanced RISC Machines (ARM).

2 Overview

The multiply instructions have the following syntax:

MUL	Rd,	Rm,	Rs
MLA	Rd,	Rm,	Rs, Ra
			^^

The MUL calculates the least-signiﬁcant 32 bits of the 64-bit product of Rm and Rs, and places the result in Rd (the destination register). MLA speciﬁes an additional register which is added to the product (with zero overhead) before the result is stored in Rd.

The speed of the multiply depends on the value in Rs. It is important to place the smallest value in Rs so that the multiply takes the least number of cycles. See Appendix A for a detailed explanation of Booth’s multiplication.

The following table shows how the speed of the multiply depends on the value in Rs:

Min	Max	Speed

0	1	1S + 1I

2	7	1S + 2I

8	31	1S + 3I

32	127	1S + 4I

128	511	1S + 5I

512	2047	1S + 6I

2048	8191	1S + 7I

8192	32767	1S + 8I

Table 2-1: Multiply Speed Variations with Differing values of Rs

		Application Note 19	1
		ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

Min	Max	Speed

32768	131071	1S	+ 9I

131072	524287	1S	+ 10I

524288	2097151	1S	+ 11I

2097152	8388607	1S	+ 12I

8388608	33554431	1S	+ 13I

33554432	134217727	1S	+ 14I

134217728	536870911	1S	+ 15I

536870912	2147483648	1S	+ 16I

Table 2-1: Multiply Speed Variations with Differing values of Rs

The general formula is that multiplication by values between 2(2n-3) and 2(2n-1)-1 inclusive takes 1S + nI-cycles (n>1, multiplication by 0 or 1 takes 1S + 1I-cycle).

The cycle timings used throughout this document refer to the cycle types as seen by the ARM6 core. It is important to convert these numbers into actual times within your system in order to correctly estimate the performance. For an ARM6 or ARM60, this document speciﬁes the precise cycle types visible in your system. The cached processors (ARM600 and ARM610) contain the ARM6 macrocell linked to a cache, so both S- and I-cycles may run at the fast core clock (FCLK) rate since they are internal to the processor. In this case, both S-cycles and I-cycles are F-cycles, so simply add together the number of S- and I-cycles to calculate the number of processor cycles.

This document explains several methods which may be used to improve the speed of multiplication:

•Place smallest operand in Rs

•Ensure that Rs is positive so that early termination occurs

•If one operand is a constant, use the multiply-by-constant technique

3 Overﬂow issues

It is common to use the current multiply instruction in, for instance a 16-bit x 16-bit -> 32-bit mode to avoid overﬂow. For DSP work, a 32-bit result is all that is required, but it is important to ensure that overﬂow does not cause an incorrect answer.

This situation can be generalised, as the number of result bits is just the sum of the operand bits. Thus, the MUL can perform 16-bit x 16-bit -> 32-bit, 8-bit x 24-bit -> 32-bit etc. all without overﬂow. For MLA, the total is accumulated (overﬂow of the total must be avoided). Hence, MLA would be used as, for example 12x12->24, leaving 8 bits to accumulate up to 256 values without the possibility of overﬂow.

2	Application Note 19
	ARM DAI 0019D
	ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

4 Use the smallest operand as Rs

In practice it is possible to arrange for a worst case which is at most 1S + 9I-cycles for unsigned operands (by putting the operand with the least number of bits into Rs, so that Rs <= 16bits). In most cases it is known that one of the operands is at most 16bits, as the sum of the number of operand bit must be no greater than 32.

5 Negative operands

The multiplier yields correct results for negative operand values, since it calculates the low 32-bits of the full 64-bit product. For positive values of Rs, a 16x16->32 MUL takes at most 1S + 9I-cycles (the average should be better than this). But, the MUL always takes 1S + 16I-cycles if Rs is negative. Early termination cannot take place because the top bit of Rs is a 1, so the Booth's multiplier register never contains all 0s (the maximum-of-16 limit is reached instead). A negative value for Rs is considered to be a very large positive number from the point of view of the Booth’s multiplier.

The example below shows how to guard the multiply instruction, but this does not really improve things unless Rs is very small so that the gain exceeds the (3- instruction) overhead.

;		Cycles
CMP	Rs, #0	; 1S
RSBMI	Rs, Rs, #0	; 1S	make Rs positive
MUL	Rd, Rm, Rs	; 1S+nI
RSBMI	Rd, Rd, #0	; 1S	make result negative if Rs
		;	was negative

It is sometimes possible to do away with the CMP by incorporating this into another instruction.

In the special case when squaring, the result does not need to be negated after the multiplication as it will always be positive (thus the second RSB instruction can be eliminated).

For example, consider this critical bit of code which uses MUL to square the difference between pairs of signed 5-bit input values. This demonstrates the importance of ensuring that Rs is positive, as the worst-case performance is improved to just 1S + 3I-cycles for the MUL.

; on entry,	r8 & r9 contain packed binary data
;			Cycles
AND	r1, r8, #31 ;		1S	extract 5-bit field of interest
AND	r2, r9, #31 ;		1S	extract 5-bit field of interest
SUBS	r1, r1, r2	;	1S
RSBMI	r1, r1, #0	;	1S
MUL	r0, r1, r1	;	1S+nI		n<=3

; on exit, r0 contains result

As you can see, it has been arranged to cost only 1 S-cycle (for the RSBMI instruction) to ensure the multiply is fast.

		Application Note 19	3
		ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

The issue of negating the value in Rs is more complicated if MLA is used, as it is not possible to negate the product before the accumulate. There are two possible solutions to this:

1 Negating the total, then negating it again after the MLA

;			Cycles
CMP	Rs, #0		; 1S
RSBMI	Rs, Rs,	#0	; 1S
MLA	Ra, Rm,	Rs, Ra	; 1S+nI
RSBMI	Ra, Ra,	#0	; 1S
; side-effect: Rs is		now positive

2 Negating the other operand (Rm)

;		Cycles
CMP	Rs, #0	; 1S
RSBMI	Rs, Rs, #0	; 1S
RSBMI	Rm, Rm, #0	; 1S	negate Rm to
		; make product correct
MLA	Ra, Rm, Rs, Ra	; 1S+nI

;side-effect: Rs is now positive

;side-effect: Rm may have been negated

Method 2 uses fewer instructions which is preferable (although it has more sideeffects). It is possible to incorporate the CMP #0 into a previous instruction as with the MUL example above.

It is questionable whether these techniques are useful in general for MUL or MLA, as there must be a performance increase on average to make it worth the effort. However, these methods can help for some speciﬁc applications, for example, squaring small signed values where the overhead can be reduced and the saving is greatest.

6 Multiplication by constant

This ﬁnal section shows you:

•how to construct a sequence of ARM instructions to multiply by a constant

•how to create ARM assembly output from the C-compiler (using 'armcc -S')

In many cases where a multiply is used, one of the values is a constant (eg. weeks*7). In some applications (eg. 8-point DCT algorithm by Loefﬂer, Ligtenberg & Moschytz, ICASSP '89 p988 used for JPEG coding) multiplication-by-constant is used extensively.

A programmer might assume that the only way to calculate weeks*7 would be to use the MUL instruction. However, it is possible to express the multiplication as a sequence of arithmetic instructions, which in many circumstances can be faster.

When multiplying by a constant value, it is possible to replace the general multiply with a ﬁxed sequence of adds and subtracts which have the same effect. For instance, multiply by 5 could be achieved using a single instruction:

; this instruction takes 1S-cycle

4	Application Note 19
	ARM DAI 0019D
	ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

ADD	Rd, Rm, Rm, LSL #2		; Rd = Rm + (Rm * 4) = Rm * 5
Compare this with the MUL version below:
; this sequence takes a total of			1S+2I-cycles
MOV	Rs,	#5
MUL	Rd,	Rm, Rs

The cost of the general multiply includes the instructions needed to load the constant into a register (up to 4 may be needed, or an LDR from a literal pool) as well as the multiply itself. In some cases, such as multiply-by-5 above, the multiply-by-constant uses fewer instructions as well!

7 The problem of ﬁnding the optimum sequence

The difﬁculty in using a sequence of arithmetic instructions is that the constant must be decomposed into a set of operations which can be done by one instruction each.

Consider multiply by 105. This could be achieved by decomposing as shown below:

105 == 128 - 13

==128 - 16 + 3

==128 - 16 + (2 + 1)

;this sequence takes 3S-cycles

ADD	Rd, Rm,		Rm, LSL		#1 ;		Rd =		Rm +	Rm *2
SUB	Rd,	Rd,	Rm,	LSL	#4	;	Rd	=	Rm*3	- Rm*16
ADD	Rd,	Rd,	Rm,	LSL	#7	;	Rd	=	Rm*(-13) +		Rm*128

Or, decomposing differently:

105 == 15 * 7

==(16 - 1) * (8 - 1)

;this sequence takes 2S-cycles

;uses temporary register Rt to store intermediate value

RSB	Rt,	Rm,	Rm,	LSL	#4	;	Rt	=	-Rm + Rm16 = Rm15
RSB	Rd,	Rt,	Rt,	LSL	#3	;	Rd	=	Rt7 = Rm105

The second method is the optimal solution which is not difﬁcult to ﬁnd by hand for small values such as 105. However, the problem of ﬁnding the optimum becomes much more difﬁcult for larger constant values. There are no known algorithms which solve this problem quickly.

Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a very large constant, more than one temporary register may be needed to ﬁnd the shortest sequence.

Looking carefully at the 2-instruction optimal code sequence, it is can be seen that Rt could be the same register as Rd. Rd is usually the ﬁrst-choice for a temporary register since it will be used to store the ﬁnal result in the last instruction of the sequence.

		Application Note 19	5
		ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

8 Experimenting with ARMCC assembly output

When writing a speed-critical ARM assembler program, it is a good idea to code it in C ﬁrst (to check the algorithm) before converting it to handtuned assembler. It is helpful to see the ARM code which the compiler generates as a starting point for your work. armcc -S will generate an assembly ﬁle instead of an object ﬁle. For example, consider the following simple C code:

int mulby105( int num )

{

return num * 105;

}

Compile this using:

armcc -S mulby105.c

Now, examine the ﬁle "mulby105.s" which has been created:

; generated by Norcroft ARM C vsn 4.41 [Aug 26 1992] AREA |C$$code|, CODE, READONLY

|x$codeseg|

EXPORT	mulby105
mulby105
RSB	a1,a1,a1,LSL #4
RSB	a1,a1,a1,LSL #3
MOV	pc,lr

AREA |C$$data|,DATA |x$dataseg|

END

Notice that the compiler has found the short multiply-by-constant sequence.

The C-compiler restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long.

9 Discussion of speed improvement

To evaluate the speed gains from using multiply-by-constant, consider multiplying by 11585 as an example.

A normal multiply consists of:

;first load the constant 11585 (0x2D41 in hex) into Rs

;these 2 instructions take 1S-cycle each

MOV	Rs, #0x2D			<< 8
ORR	Rs,	Rs,	#0x41
; now do the multiply,				which takes 1S+8I-cycles for Rs=11585
MUL	Rd,	Rm,	Rs	; do the multiply

6	Application Note 19
	ARM DAI 0019D
	ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

The load-a-constant stage may take up to four 4 instructions (in this case 2) or an LDR, and the multiply takes 1 instruction fetch plus a number of internal cycles to calculate the multiply.

The optimal multiply-by-constant sequence consists of:

11585 == ((Rm*3)*15)*256 + Rm + Rm*64

; this sequence takes 4S-cycles

ADD Rd, Rm, Rm, LSL #1 ; Rd = Rm*3

RSB Rd, Rd, Rd, LSL #4 ; Rd = Rd*15 = Rm*45

ADD Rd, Rm, Rd, LSL #8 ; Rd = Rm + Rd*256 = Rm*11521 ADD Rd, Rd, Rm, LSL #6 ; Rd = Rd + Rm*64 = Rm*11585

This is just 4 data processing instructions.

Method	Cycles

Normal multiply:	3 instructions + MUL internal cycles

Multiply-by-constant:	4 instructions

Table 9-2: Comparison between MUL and multiply-by-constan

With slow memory systems and non-cached processors, I-cycles can be much faster than other cycles because they are internal to the ARM core. This means that the general multiply can sometimes be the fastest option (for large constants where an efﬁcient solution cannot be found). It should also use less memory. If the load-a- constant stage could be moved outside a loop, the balance would tip further in favour of the general multiply as there is only the MUL to execute.

Method	Cycles on ARM60	Cycles on ARM610

Normal multiply:	3S + 8I	11F

Multiply-by-constant:	4S	4F

Table 9-3: MUL Cycle Timing for ARM6 Processors

10 Conclusions

This application note explains how the speed of the multiply depends on the value in Rs. It is important to ensure that the smallest positive number is used for Rs so that MUL is fastest.

If MUL is used to generate a single result of at most 32-bits, it is usually possible to ensure that Rs contains at most 16 bits so that the MUL takes at most 1S + 9I-cycles, providing that the operands are unsigned.

		Application Note 19	7
		ARM DAI 0019D

ARM6 in DSP Applications : Use of the MUL Instruction

MUL can be used with negative operands; the result is the bottom 32-bits of the 64-bit product as expected. If Rs is negative, the MUL will take the full 1S + 16I-cycles as early termination does not occur. It is possible to ensure that Rs is always positive by putting the negative operand in Rm, or by negating Rs to make it positive.

When multiplying by a constant, a technique can be used to decompose the multiply into a sequence of arithmetic instruction which achieves the same as a MUL. The performance of this method depends on the precise value of the constant (smaller values are faster). The sequence of arithmetic operations is likely to be faster than the equivalent MUL, but it depends on the memory system and variant of ARM processor used.

This document can offer no universal solution to speed up multiplication. It has been shown that there are several techniques which can be applied, but the speed gains are always depend on the precise values involved (which are application-speciﬁc). Even then, it is not always clear which method will be faster, as the actual timings depend on whether the processor is cached or uncached, and on the performance of the memory system.

8	Application Note 19
	ARM DAI 0019D
	ARM DAI 0019D

1 / 21 2 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013295.69 Кб34(4Front).Open sound system (OSS) programmer's guide.v1.11.2000.pdf
#
23.08.2013465.76 Кб41(ABE).Broadcast engineer's handbook.1999.pdf
#
23.08.201343.98 Кб30(ARM).ARM6 in DSP applications.Use of the MUL instruction.pdf
#
23.08.201351.65 Кб29(ARM).Big and little endian byte addressing.pdf
#
23.08.2013144.45 Кб43(ARM).Fixed point arithmetic on the ARM.pdf
#
23.08.201374.92 Кб33(ARM).Floating-point performance.pdf
#
23.08.20132.79 Mб31(ARM).Porting TCP-IP programmer's guide.Ver 1.4.pdf
#
23.08.20132.64 Mб43(ARM).Porting TCP-IP programmer's guide.Ver 1.6.pdf