Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Constantinides G.A., Cheung P.Y.K., Luk W. - Synthesis and optimization of DSP algorithms (2004)(en).pdf
Скачиваний:
20
Добавлен:
15.08.2013
Размер:
1.54 Mб
Скачать

6.5 Some Results

141

3.Extract the subset Vb V of nodes on the bound critical path which could complete within the latency constraint (6.31)

4.if Vb = do

return failure case

5.Find Vp Vb of nodes maximizing the measure (6.32)

6.if v Vp : L(π1(R(v))) < max(v) do

Select one such node v Vp else do

Select an arbitrary node v Vp end if

7. H ← H \ {{v, r} H : L(r) = max(v)} end

Example 6.19. Fig. 6.10 illustrates an example refinement phase corresponding to the data flow graph introduced in Fig. 6.2 and reproduced in Fig. 6.10(a) for convenience. Nodes that are on the computation critical path with respect to the data flow graph P (V, D) are highlighted in Fig. 6.10(a), Vc = {a2, m3, m4, m5, a1, a3}. This data flow graph has been scheduled and resourcebound in Fig. 6.10(b). The portions of node execution time between min(v) and max(v) have been shaded in the figure.

The augmented data flow graph P (V, D Db) obtained from timeabutment edges Db is illustrated in Fig. 6.10(c) and consists of a single extra edge from operation m2 to operation m5. The resulting change in critical path is significant. The bound critical path is given by Vb = {m1, m2, m5, a3}.

For λ = 18, the lowest achievable latency constraint, the subset Vb is given by Vb = {m1, m2}. The heuristic measures described above may then be applied to decide which of these two nodes is to have its upper bound latency reduced.

6.5 Some Results

Before considering in detail the performance of the methods discussed in this chapter, it is instructive to follow through the sequencing graph of Fig. 6.2 which has been used as an example throughout this chapter where appropriate. Both optimal and heuristic schedules, resource allocations, bindings and wordlength selections have been performed for this example in order to explore the area / latency tradeo achievable. Fig. 6.11 plots these results. Not all ILP results are shown, due to excessive ILP solver execution time. The ILP and heuristic solutions are identical for all cases except λ = 26, where there is a slight di erence caused by the presence of an extra adder in the heuristic solution.

Figure 6.12 illustrates the results for the benchmark circuits introduced in Section 4.6. For comparison, not only are the optimal (ILP) results and

142

6 Scheduling and Resource Binding

 

 

 

 

 

 

m1

 

 

 

ADD

MULT

MULT

 

a

 

 

25

(40,17)

(33,21)

 

MULT

2

 

 

 

 

 

 

ADD

 

0

a2

 

 

 

(18,15)

 

 

 

 

20

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

m 1

 

 

 

 

 

4

 

m3

 

 

 

 

6

 

 

m2

 

m3

 

8

 

 

 

 

 

10

 

 

 

MULT

 

MULT

 

a4

 

 

(19,17)

(33,21)

 

12

m 2

 

 

 

m4

 

 

 

 

14

 

 

 

m4

m5

 

 

 

 

 

a4

16

 

 

 

 

MULT

MULT

 

 

 

 

ADD

 

 

 

 

 

(22,16)

(40,12)

18

 

 

 

 

22

a1

m5

 

 

 

 

20

 

 

 

 

 

 

 

 

a1

a3

 

22

 

 

 

 

 

24

 

 

 

 

ADD

ADD

 

a3

 

 

 

19

25

 

26

 

 

 

 

 

 

 

 

 

(a) sequencing graph

 

(b) initial schedule

m1

 

a

 

 

MULT

 

2

 

 

 

ADD

 

 

(18,15)

 

 

 

 

20

 

 

 

 

 

 

 

 

 

 

critical

m2

m3

 

 

node

 

 

 

MULT

MULT

 

 

 

(19,17)

(33,21)

 

 

 

 

 

 

 

min

m4

m5

a4

max

operation

operation

latency

MULT

MULT

ADD

latency

 

(22,16)

(40,12)

 

22

 

 

 

 

 

 

a1

 

a3

 

 

ADD

 

ADD

 

 

19

 

25

 

 

(c) augmented sequencing graph

Fig. 6.10. Example use of the bound critical path for word-length refinement

the heuristic results provided, but also the solutions corresponding to word- length-blind scheduling followed by an optimal resource binding [CCL00b]. All three sets of results are only provided for the IIR filter, the polyphase filter bank and the RGB to YCrYb convertor, due to the excessive execution time of both the ILP solver and the optimal binding.

These results illustrate that for the benchmark circuits, the heuristic presented in this chapter provides a significant improvement in area (between 16% and 46%, average 15%) over a two-stage approach of scheduling and then binding, even when the binding is optimal. The heuristic results have between 0% and 62% (average 14%) worse area than the optimum combined solution, for cases where the optimum is known.

In order to fully characterize the heuristic performance, further results have been obtained using artificially generated examples. For statistically significant data on solution quality, 200 random sequencing graphs have been

6.5 Some Results

143

 

1600

 

 

 

 

 

 

 

1500

 

 

 

heuristic synthesis

 

 

 

 

 

 

optimal (ILP) synthesis

 

 

 

1400

 

 

 

 

 

 

 

1300

 

 

 

 

 

 

area

 

 

 

 

 

 

 

resulting

1200

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1100

 

 

 

 

 

 

 

1000

 

 

 

 

 

 

 

900

 

 

 

 

 

 

 

800

20

25

30

35

40

45

 

15

 

 

 

 

latency constraint λ

 

 

Fig. 6.11. Design-space exploration for the simple sequencing graph of Fig. 6.2

generated for each (problem size |V |, latency constraint λ) pair, using a variant of the TGFF algorithm [DRW98]. The first set of detailed quality results that has been collected, measures the variation of the heuristic solution quality with both the problem size and the tightness of the supplied latency constraint. For each sequencing graph, the minimum possible latency constraint λmin has been found using ASAP scheduling, from which latency constraints have been generated corresponding to a 0% to 30% relaxation of λmin. Algorithm ArchSynth has then been executed on the graph / latency constraint combination. The resulting area has been normalized with respect to the optimal solution resulting from [CCL00b] where operations may only share resources when the implemented resource has latency no longer than the minimum required to implement the given operations. These results are plotted in Fig. 6.13, as a percentage area penalty for using the approach of [CCL00b] over the heuristic presented in this chapter. Each point represents the mean of two hundred representative designs.

The results illustrate that for designs with even a small ‘slack’ in terms of latency constraints, significant area improvements of up to 30% can be made by performing the scheduling, binding, and word-length selection in the intertwined manner proposed. The area improvements come from increased resource sharing due to implementing small word-length operations in larger

144 6 Scheduling and Resource Binding

700

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

optimal (ILP)

 

 

 

 

 

 

 

 

 

heuristic

 

 

 

650

 

 

 

 

 

 

optimal binding from [CCL00b]

 

 

 

 

 

 

 

 

 

 

 

600

 

 

 

 

 

 

 

 

 

 

550

 

 

 

 

 

 

 

 

 

 

500

 

 

 

 

 

 

 

 

 

 

area

 

 

 

 

 

 

 

 

 

area

450

 

 

 

 

 

 

 

 

 

 

400

 

 

 

 

 

 

 

 

 

 

350

 

 

 

 

 

 

 

 

 

 

300

 

 

 

 

 

 

 

 

 

 

250

13.5

14

14.5

15

15.5

16

16.5

17

17.5

18

13

 

 

 

 

 

λ

 

 

 

 

 

7500

 

 

 

 

 

 

 

 

 

 

 

 

heuristic

 

 

 

 

 

 

 

optimal binding from [CCL00b]

 

7000

 

 

 

 

 

 

 

6500

 

 

 

 

 

 

 

6000

 

 

 

 

 

 

 

5500

 

 

 

 

 

 

 

5000

 

 

 

 

 

 

 

4500

 

 

 

 

 

 

 

4000

 

 

 

 

 

 

 

3500

 

 

 

 

 

 

 

3000

8

9

10

11

12

13

14

7

 

 

 

 

λ

 

 

 

(a) IIR

(b) FIR

 

850

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

heuristic

 

800

 

 

 

 

 

 

 

 

750

 

 

 

 

 

 

 

 

700

 

 

 

 

 

 

 

area

650

 

 

 

 

 

 

area

 

600

 

 

 

 

 

 

 

 

550

 

 

 

 

 

 

 

 

500

 

 

 

 

 

 

 

 

450

15

16

17

18

19

20

21

 

14

 

 

 

 

 

λ

 

 

 

750

 

 

 

 

 

 

 

 

 

 

 

 

 

 

heuristic

700

 

 

 

 

 

 

 

650

 

 

 

 

 

 

 

600

 

 

 

 

 

 

 

550

 

 

 

 

 

 

 

500

 

 

 

 

 

 

 

450

 

 

 

 

 

 

 

400

15

16

17

18

19

20

21

14

 

 

 

 

λ

 

 

 

(c) DCT1

(d) DCT2

 

200

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

optimal (ILP)

 

 

 

 

190

 

 

 

 

 

 

heuristic

 

 

 

 

 

 

 

 

 

 

 

optimal binding from [CCL00b]

 

 

180

 

 

 

 

 

 

 

 

 

 

 

170

 

 

 

 

 

 

 

 

 

 

 

160

 

 

 

 

 

 

 

 

 

 

area

150

 

 

 

 

 

 

 

 

 

area

 

 

 

 

 

 

 

 

 

 

 

140

 

 

 

 

 

 

 

 

 

 

 

130

 

 

 

 

 

 

 

 

 

 

 

120

 

 

 

 

 

 

 

 

 

 

 

110

 

 

 

 

 

 

 

 

 

 

 

100

5.5

6

6.5

7

7.5

8

8.5

9

9.5

10

 

5

 

 

 

 

 

 

λ

 

 

 

 

 

170

 

 

 

 

 

 

 

 

 

 

160

 

 

 

 

 

 

optimal (ILP)

 

 

 

 

 

 

 

 

 

 

heuristic

 

 

 

 

 

 

 

 

 

 

optimal binding from [CCL00b]

 

150

 

 

 

 

 

 

 

 

 

 

140

 

 

 

 

 

 

 

 

 

 

130

 

 

 

 

 

 

 

 

 

 

120

 

 

 

 

 

 

 

 

 

 

110

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

90

 

 

 

 

 

 

 

 

 

 

80

13.5

14

14.5

15

15.5

16

16.5

17

17.5

18

13

 

 

 

 

 

λ

 

 

 

 

 

(e) PFB

(f) RGB–YCrCb

Fig. 6.12. Design-space exploration for the benchmark circuits of Section 4.6

6.5 Some Results

145

Area penalty (%)

30

25

20

15

10

5

0

−5 30

25

25

20

 

 

20

15

 

 

 

 

15

Latency constraint

10

 

10

 

 

No. operations

 

5

 

5

 

 

00

Fig. 6.13. Variation with number of operations and latency constraint of area penalty for [CCL00b] over the proposed heuristic

word-length resources with longer latency. Even for relatively small graphs, area improvements of tens of percent are possible.

Fig. 6.14 illustrates the increase in implementation area from using the heuristic presented in Section 6.4 over the optimum combined problem presented in Section 6.3. These results can only be provided for modest problem size and a minimum latency constraint λ = λmin, as the ILP solution execution time scales rapidly with problem size and as the latency constraint is relaxed.

The variation of minimum-latency execution time with problem size for 200 graphs using the ILP model (solved with lp solve [Sch97] on a Pentium III 450MHz) and the heuristic algorithm (implemented in C on the same machine) is shown in Fig. 6.15, illustrating the polynomial complexity of the heuristic against the exponential complexity of the ILP. Over the range of 1 to 10 operations, the relative increase in area ranges from 0% to 16% whereas the ILP solution takes between one and three orders of magnitude greater time to execute. An important point not brought out by these results is the scaling of execution time with overall latency constraint. The number of variables in the ILP model scales with the latency constraint, making the execution time highly dependent on this parameter. This is illustrated in Table 6.1 for 200 9-operation sequencing graphs. By contrast, the execution time of Algorithm ArchSynth does not scale with the latency constraint. Thus the

146 6 Scheduling and Resource Binding

 

18

 

 

 

 

 

 

 

 

 

 

16

 

 

 

 

 

 

 

 

 

 

14

 

 

 

 

 

 

 

 

 

(%)

12

 

 

 

 

 

 

 

 

 

optimum

 

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

over

 

 

 

 

 

 

 

 

 

 

premium

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Area

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

0

2

3

4

5

6

7

8

9

10

 

1

 

 

 

 

 

Number of operations

 

 

 

 

Fig. 6.14. Variation of area premium for Algorithm ArchSynth over the optimum solution

one to three orders of magnitude illustrated in Fig. 6.15 are under conditions most favourable to the ILP-based solution.

Table 6.1. Variation of execution time for 200 graphs with λ/λmin for heuristic and ILP solution

λ/λmin

 

heuristic (secs)

ILP (mins:secs)

1.00

 

3.02

2:07.09

1.05

 

3.51

4:05.21

1.10

 

3.73

15:55.56

 

 

1.15

 

3.52

>30:00.00

 

The results presented clearly indicate the practical nature of the heuristic presented in this chapter, whose execution results in a speedup of up to three orders of magnitude over ILP solution, even for modest graph sizes, at an area-penalty of between 0% and 16%.