Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Guide to Elliptic Curve Cryptography.pdf
Скачиваний:
58
Добавлен:
15.03.2015
Размер:
4.58 Mб
Скачать

5.1. Software implementation

219

for inline assembler to cooperate with surrounding code is a design weakness compared with that of GNU C or the Intel compilers, which have the additional advantage that they can be used on Unix-like systems.

5.1.5Timings

Selected field operation timings are presented for Intel Pentium family processors and the Sun UltraSPARC, commonly used in workstations. The NIST recommended binary and prime fields (§A.2) are the focus, although some data for an OEF (§2.4) is presented for comparison.

It is acknowledged that timings can be misleading, and are heavily influenced by the programmer’s talent and effort (or lack thereof), compiler selection, and the precise method of obtaining the data. The timings presented here should be viewed with the same healthy dose of skepticism prescribed for all such data. Nonetheless, timings are essential for algorithm analysis, since rough operation counts are often insufficient to capture platform characteristics. For the particular timings presented here, there has generally been independent “sanity check” data available from other implementations.

Tables 5.3–5.5 give basic comparisons for the NIST recommended binary and prime fields, along with a selected OEF. Inversion and multiplication times for binary fields on two platforms appear in Table 5.6, comparing compilers, inversion algorithms, and 32-bit vs 64-bit code. The 64-bit code on the Intel Pentium III is via special-purpose registers. These capabilities were extended in the Pentium 4, and Table 5.7 includes timings for prime field multiplication via these registers along with an approach using floating-point registers.

Field arithmetic comparisons

Timings for the smallest of the NIST recommended binary and prime fields, along with an OEF, are presented in Table 5.3. Specifically, these are the binary field F2163

with reduction polynomial f (z) = z163 + z7 + z6 + z3 + 1, the prime field F p192 with p192 = 2192 264 1, and the OEF F p6 with prime p = 231 1 and reduction poly-

nomial f (z) = z6 7. Realistic branch misprediction penalties are obtained using a sequence of pseudo randomly generated field elements, and the timings include framework overhead such as function calls. The Intel compiler version 6 along with the Netwide Assembler (NASM) were used on an Intel Pentium III running the Linux 2.2 operating system.

Algorithms for binary fields were coded entirely in C except for a one-line assembler fragment used in polynomial degree calculations in inversion. Assembly coding may be required in prime fields and OEFs in order to use hardware multipliers producing a 64-bit product from 32-bit input, and to directly access the carry bit, both of which are essential to performance in conventional methods. The first of the F p192 columns in Table 5.3 gives timings for code written primarily in C. For most entries, a signifi-

220

5. Implementation Issues

 

 

 

 

 

 

 

 

 

 

 

 

F2163

a

F p192

F(2311)6

 

 

F p192

 

 

 

 

 

 

 

Addition

0.04

0.18

0.07

0.06

 

Reduction

0.11b

0.25c

0.11c

 

 

Fast reduction

N/A

 

Barrett reduction (Algorithm 2.14)

N/A

1.55d

0.49

N/A

 

Multiplication (including fast reduction)

1.30e

0.57d,f

0.42f

0.40g

 

Squaring (including fast reduction)

0.20h

0.36i

0.32g

 

Inversion

10.5j

58.3k

25.2k

2.9l

 

I/M

8.1

102.3

60.0

7.3

 

 

 

aCoded primarily in C. bAlgorithm 2.41. cAlgorithm 2.27. dUses a 32×32 multiply-and-add.

 

eAlgorithm 2.36. fAlgorithm 2.10. gExample 2.56. hAlgorithm 2.39.

 

 

 

iAlgorithm 2.13. jAlgorithm 2.48. kAlgorithm 2.22. lAlgorithm 2.59.

 

 

Table 5.3. Timings (in µs) for field arithmetic on an 800 MHz Intel Pentium III. The binary field F2163 = F2[z]/(z163 + z7 + z6 + z3 + 1) and the prime field F p192 for p192 = 2192 264 1 are from the NIST recommendations (§A.2). The rightmost column is the optimal extension field F p6 = F p[z]/(z6 7) for prime p = 231 1.

cant penalty is seen relative to the timings with assembly. However, the multiplication routine uses an in-line assembly fragment for a 32×32 multiply with a three-word accumulation. If reduction is excluded, the time is very close to that obtained with the assembly language version, an indication that the Intel compiler handles insertion of short in-line assembly fragments well.

Reduction Barrett reduction does not exploit the special form of the NIST prime, and the entries can be interpreted as rough cost estimates of reduction with a random 192-bit prime. In contrast to special primes, this estimate shows that reduction is now a very significant part of field multiplication timings, encouraging the use of Montgomery (§2.2.4) and other multiplication methods. Significant performance degradation in the C version of the fast reduction algorithm is largely explained by the many conditionals in the clumsy handling of carry.

OEF The OEF F(2311)6 in the rightmost column of Table 5.3 is roughly the same size as F p192 . The multiplication is accomplished with an accumulation method (Example 2.56) resembling the method used in F p192 , and the resulting times are comparable. As expected, inversion is significantly faster for the OEF.

NIST fields Tables 5.4 and 5.5 provide timings for the NIST recommended binary and prime fields. Note that optimizations in the larger fields were limited to techniques employed for F2163 and F p192 . In particular, Karatsuba-Ofman methods were not competitive in our tests on this platform for the smaller fields, but were not examined carefully in the larger fields.

 

 

5.1.

Software implementation

221

 

 

 

 

 

 

 

 

 

F2163

F2233

F2283

F2409

F2571

 

 

 

 

 

 

 

 

 

Addition

0.04

0.04

0.04

0.06

0.07

 

 

Reduction (Algorithms 2.41–2.45)

0.11

0.13

0.19

0.14

0.33

 

 

Multiplication (Algorithm 2.36)

1.30

2.27

2.92

5.53

10.23

 

 

Squaring (Algorithm 2.39)

0.20

0.23

0.32

0.31

0.56

 

 

 

 

 

 

 

 

 

 

Inversion (Algorithm 2.48)

10.5

18.6

28.2

53.9

96.4

 

 

I/M

8.1

8.2

9.7

9.8

9.4

 

 

 

 

 

 

 

 

 

 

Table 5.4. Timings (in µs) for binary field arithmetic on an 800 MHz Intel Pentium III, including

reduction to canonical form. The fields are from the NIST recommendations (§A.2) with reduction polynomials z163 + z7 + z6 + z3 + 1, z233 + z74 + 1, z283 + z12 + z7 + z5 + 1, z409 + z87 + 1, and z571 + z10 + z5 + z2 + 1, respectively.

 

F p192

F p224

F p256

F p384

F p521

 

 

 

 

 

 

 

Addition

0.07

0.07

0.08

0.10

0.10

 

Reduction (Algorithms 2.27–2.31)

0.11

0.12

0.30

0.38

0.20

 

 

 

 

 

 

 

 

Multiplication (Algorithm 2.10)

0.42

0.52

0.81

1.47

2.32

 

Squaring (Algorithm 2.13)

0.36

0.44

0.71

1.23

1.87

 

 

 

 

 

 

 

 

Inversion (Algorithm 2.22)

25.2

34.3

44.3

96.3

163.8

 

I/M

60.0

70.0

54.7

65.5

70.6

 

 

 

 

 

 

 

 

Table 5.5. Timings (in µs) for prime field arithmetic on an 800 MHz Intel Pentium III, in-

cluding reduction to canonical form. The fields are from the NIST recommendations (§A.2)

with p192 = 2192 264 1, p224 = 2224 296 + 1, p256 = 2256 2224 + 2192 + 296 1, p384 = 2384 2128 296 + 232 1, and p521 = 2521 1.

Multiplication and inversion in binary fields

In point multiplication on elliptic curves (§3.3), the cost of field inversion relative to field multiplication is of particular interest. This section presents estimates of the ratio for the NIST binary fields (where the ratio is expected to be relatively small) for two platforms. The three inversion methods discussed in §2.3.6 are compared, along with timings for 32-bit and 64-bit code. The results also show significant differences among the compilers used.

Table 5.6 gives comparative timings on two popular platforms, the Intel Pentium III and Sun UltraSPARC IIe. Both processors are capable of 32and 64-bit operations, although only the UltraSPARC is 64-bit. The 64-bit operations on the Pentium III are via the single-instruction multiple-data (SIMD) registers, introduced on the Pentium MMX (see Table 5.1). The inversion methods are the extended Euclidean algorithm (EEA) in Algorithm 2.48, the binary Euclidean algorithm (BEA) in Algorithm 2.49, and the almost inverse algorithm (AIA) in Algorithm 2.50. The example fields are

taken from the NIST recommendations, with reduction polynomials f (z) = z163 + z7 + z6 + z3 + 1 and f (z) = z233 + z74 + 1. Both allow fast reduction, but only the latter

is favourable to the almost inverse algorithm. Field multiplication based on the comb

222

5. Implementation Issues

 

 

 

 

 

 

 

 

 

 

 

 

Pentium III (800 MHz)

SPARC (500 MHz)

 

 

32-bit

64-bit

32-bit

64-bit

 

Algorithm

gcc

icc

mmx

gcc

cc

cc

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Arithmetic in F2163

 

 

 

 

 

 

 

multiplication

1.8

1.3

.7

1.9

1.8

.9

 

Euclidean algorithm

10.9

10.5

7.1

21.4

14.8

 

binary Euclidean algorithm

20.7

16.0

16.8

14.9

10.6

 

almost inverse

16.4

15.2

22.6

15.2

 

I/M

6.1

8.1

9.8

8.8

8.2

12.1

 

 

 

 

 

 

 

 

 

Arithmetic in F2233

 

 

 

 

 

 

 

multiplication

3.0

2.3

4.0

2.9

1.7

 

Euclidean algorithm

18.3

18.8

45.5

25.7

 

binary Euclidean algorithm

36.2

28.9

42.0

34.0

16.9

 

almost inverse

22.7

20.1

36.8

24.7

 

I/M

6.1

8.2

9.2

8.5

9.9

 

 

 

 

 

 

 

 

Table 5.6. Multiplication and inversion times for the Intel Pentium III and Sun UltraSPARC IIe. The compilers are GNU C 2.95 (gcc), Intel 6 (icc), and Sun Workshop 6U2 (cc). The 64-bit “multimedia” registers were employed for the entries under “mmx.” Inversion to multiplication (I/M) uses the best inversion time.

method (Algorithm 2.36) appears to be fastest on these platforms. A width-4 comb was used, and the times include reduction. Other than the MMX code and a one-line assembler fragment for EEA, algorithms were coded entirely in C.

Some table entries are as expected, for example, the relatively good times for almost inverse in F2233 . Other entries illustrate the significant differences between platforms or compilers on a single platform. Apparent inconsistencies remain in Table 5.6, but we believe that the fastest times provide meaningful estimates of inversion and multiplication costs on these platforms.

Division The timings do not make a very strong case for division using a modification of the BEA (§2.3.6). For the 32-bit code, unless EEA or AIA can be converted to efficiently perform division, then only the entry for F2163 on the SPARC supports use of BEA-like division. Furthermore, the ratio I/M is at least 8 in most cases, and hence the savings from use of a division algorithm would be less than 10%. With such a ratio, elliptic curve methods will be chosen to reduce the number of inversions, so the savings on a point multiplication k P would be significantly less than 10%.

On the other hand, if affine-only arithmetic is in use in a point multiplication method based on double-and-add, then a fast division would be especially welcomed even if I/M is significantly larger than 5. If BEA is the algorithm of choice, then division has essentially the same cost as inversion.

Implementation notes General programming considerations for the implementations used here are covered in §5.1.4. In particular, to obtain acceptable multiplication times

5.1. Software implementation

223

with gcc on the Sun SPARC, code was tuned to be more “gcc-friendly.” Limited tuning for gcc was also performed on the inversion code. Optimizing the inversion code is tedious, in part because rough operation counts at this level often fail to capture processor or compiler characteristics adequately.

Multimedia registers The Intel Pentium family (all but the original and the Pentium Pro) and AMD processors possess eight 64-bit “multimedia” registers that were employed for the timings in the column marked “mmx.” Use of these capabilities for field arithmetic is discussed in §5.1.3.

EEA Algorithm 2.48 requires polynomial degree calculations. On the SPARC, degree was found by binary search and table lookup, once the nonzero word of interest is located. On the Pentium, a bit scan instruction (bsr) that finds the position of the most significant bit in a word was employed via in-line assembly, resulting in an improvement of approximately 15% in inversion times.

The code tracks the lengths of u and v using t fragments of similar code, each fragment corresponding to the current “top” of u and v. Here, t was chosen to be the number of words required to represent field elements.

BEA Algorithm 2.49 was implemented with a t-fragment split to track the lengths of u and v efficiently. Rather than the degree calculation indicated in step 3.3, a simpler comparison on the appropriate words was used.

AIA Algorithm 2.50 allows efficient tracking of the lengths of g1 and g2 (in addition to the lengths of u and v). A total of t2 similar fragments of code were used, a significant amount of code expansion unless t is small. As with BEA, a simple comparison replaces the degree calculations. Note that only the reduction polynomial for F2233 is favourable to the almost inverse algorithm.

Prime field multiplication methods

For prime fields, traditional approaches for field multiplication are often throttled by limitations of hardware integer multipliers and carry propagation. Both the UltraSPARC and the Pentium family processors suffer from such limitations. The Intel Pentium 4 is in fact much slower (in terms of processor cycles) in some operations than the preceding generation of Pentium processors. As an example, field multiplication in F p224 using Algorithm 2.10 with code targeted at the Pentium II/III appears in Table 5.5 (from a Pentium III) and Table 5.7 (from a Pentium 4). Despite a factor 2 clock speed advantage for the Pentium 4, the timing is in fact slower than obtained on the Pentium III.

Karatsuba-Ofman Methods based on Karatsuba-Ofman do not appear to be competitive with classical methods on the Pentium II/III for fields of this size. Table 5.7 includes times on the Pentium 4 using a depth-2 approach outlined in Example 2.12.

Соседние файлы в предмете Профессионально-ориентированный английский язык