- •Contents
- •1 Introduction
- •1.1 Objectives
- •1.2 Overview
- •2 Background
- •2.1 Digital Design for DSP Engineers
- •2.1.2 The Field-Programmable Gate Array
- •2.1.3 Arithmetic on FPGAs
- •2.2 DSP for Digital Designers
- •2.3 Computation Graphs
- •2.4 The Multiple Word-Length Paradigm
- •2.5 Summary
- •3 Peak Value Estimation
- •3.1 Analytic Peak Estimation
- •3.1.1 Linear Time-Invariant Systems
- •3.1.2 Data-range Propagation
- •3.2 Simulation-based Peak Estimation
- •3.3 Hybrid Techniques
- •3.4 Summary
- •4 Word-Length Optimization
- •4.1 Error Estimation
- •4.1.1 Word-Length Propagation and Conditioning
- •4.1.2 Linear Time-Invariant Systems
- •4.1.3 Extending to Nonlinear Systems
- •4.2 Area Models
- •4.3.1 Convexity and Monotonicity
- •4.4 Optimization Strategy 1: Heuristic Search
- •4.5 Optimization Strategy 2: Optimum Solutions
- •4.5.1 Word-Length Bounds
- •4.5.2 Adders
- •4.5.3 Forks
- •4.5.4 Gains and Delays
- •4.5.5 MILP Summary
- •4.6 Some Results
- •4.6.1 Linear Time-Invariant Systems
- •4.6.2 Nonlinear Systems
- •4.6.3 Limit-cycles in Multiple Word-Length Implementations
- •4.7 Summary
- •5 Saturation Arithmetic
- •5.1 Overview
- •5.2 Saturation Arithmetic Overheads
- •5.3 Preliminaries
- •5.4 Noise Model
- •5.4.1 Conditioning an Annotated Computation Graph
- •5.4.2 The Saturated Gaussian Distribution
- •5.4.3 Addition of Saturated Gaussians
- •5.4.4 Error Propagation
- •5.4.5 Reducing Bound Slackness
- •5.4.6 Error estimation results
- •5.5 Combined Optimization
- •5.6 Results and Discussion
- •5.6.1 Area Results
- •5.6.2 Clock frequency results
- •5.7 Summary
- •6 Scheduling and Resource Binding
- •6.1 Overview
- •6.2 Motivation and Problem Formulation
- •6.3 Optimum Solutions
- •6.3.1 Resources, Instances and Control Steps
- •6.3.2 ILP Formulation
- •6.4 A Heuristic Approach
- •6.4.1 Overview
- •6.4.2 Word-Length Compatibility Graph
- •6.4.3 Resource Bounds
- •6.4.4 Latency Bounds
- •6.4.5 Scheduling with Incomplete Word-Length Information
- •6.4.6 Combined Binding and Word-Length Selection
- •6.5 Some Results
- •6.6 Summary
- •7 Conclusion
- •7.1 Summary
- •7.2 Future Work
- •A.1 Sets and functions
- •A.2 Vectors and Matrices
- •A.3 Graphs
- •A.4 Miscellaneous
- •A.5 Pseudo-Code
- •References
- •Index
6.5 Some Results |
141 |
3.Extract the subset Vb V of nodes on the bound critical path which could complete within the latency constraint (6.31)
4.if Vb = do
return failure case
5.Find Vp Vb of nodes maximizing the measure (6.32)
6.if v Vp : L(π1(R(v))) < max(v) do
Select one such node v Vp else do
Select an arbitrary node v Vp end if
7. H ← H \ {{v, r} H : L(r) = max(v)} end
Example 6.19. Fig. 6.10 illustrates an example refinement phase corresponding to the data flow graph introduced in Fig. 6.2 and reproduced in Fig. 6.10(a) for convenience. Nodes that are on the computation critical path with respect to the data flow graph P (V, D) are highlighted in Fig. 6.10(a), Vc = {a2, m3, m4, m5, a1, a3}. This data flow graph has been scheduled and resourcebound in Fig. 6.10(b). The portions of node execution time between min(v) and max(v) have been shaded in the figure.
The augmented data flow graph P (V, D Db) obtained from timeabutment edges Db is illustrated in Fig. 6.10(c) and consists of a single extra edge from operation m2 to operation m5. The resulting change in critical path is significant. The bound critical path is given by Vb = {m1, m2, m5, a3}.
For λ = 18, the lowest achievable latency constraint, the subset Vb is given by Vb = {m1, m2}. The heuristic measures described above may then be applied to decide which of these two nodes is to have its upper bound latency reduced.
6.5 Some Results
Before considering in detail the performance of the methods discussed in this chapter, it is instructive to follow through the sequencing graph of Fig. 6.2 which has been used as an example throughout this chapter where appropriate. Both optimal and heuristic schedules, resource allocations, bindings and wordlength selections have been performed for this example in order to explore the area / latency tradeo achievable. Fig. 6.11 plots these results. Not all ILP results are shown, due to excessive ILP solver execution time. The ILP and heuristic solutions are identical for all cases except λ = 26, where there is a slight di erence caused by the presence of an extra adder in the heuristic solution.
Figure 6.12 illustrates the results for the benchmark circuits introduced in Section 4.6. For comparison, not only are the optimal (ILP) results and
142 |
6 Scheduling and Resource Binding |
|
|
|
|
|
|
|
m1 |
|
|
|
ADD |
MULT |
MULT |
|
a |
|
|
25 |
(40,17) |
(33,21) |
|
|
MULT |
2 |
|
|
|
|
|
|
ADD |
|
0 |
a2 |
|
|
|
|
(18,15) |
|
|
|
|||
|
20 |
|
2 |
|
|
||
|
|
|
|
|
|
||
|
|
|
|
|
m 1 |
|
|
|
|
|
|
4 |
|
m3 |
|
|
|
|
|
6 |
|
|
|
m2 |
|
m3 |
|
8 |
|
|
|
|
|
10 |
|
|
|
||
MULT |
|
MULT |
|
a4 |
|
|
|
(19,17) |
(33,21) |
|
12 |
m 2 |
|
||
|
|
m4 |
|||||
|
|
|
|
14 |
|
|
|
|
m4 |
m5 |
|
|
|
|
|
|
a4 |
16 |
|
|
|
||
|
MULT |
MULT |
|
|
|
||
|
ADD |
|
|
|
|
||
|
(22,16) |
(40,12) |
18 |
|
|
|
|
|
22 |
a1 |
m5 |
|
|||
|
|
|
20 |
|
|||
|
|
|
|
|
|
||
|
a1 |
a3 |
|
22 |
|
|
|
|
|
24 |
|
|
|
||
|
ADD |
ADD |
|
a3 |
|
|
|
|
19 |
25 |
|
26 |
|
|
|
|
|
|
|
|
|
|
(a) sequencing graph |
|
(b) initial schedule |
|||
m1 |
|
a |
|
|
|
MULT |
|
2 |
|
|
|
|
ADD |
|
|
||
(18,15) |
|
|
|
||
|
20 |
|
|
||
|
|
|
|
||
|
|
|
|
critical |
|
m2 |
m3 |
|
|
node |
|
|
|
|
|||
MULT |
MULT |
|
|
|
|
(19,17) |
(33,21) |
|
|
|
|
|
|
|
|
min |
|
m4 |
m5 |
a4 |
max |
operation |
|
operation |
latency |
||||
MULT |
MULT |
||||
ADD |
latency |
|
|||
(22,16) |
(40,12) |
|
|||
22 |
|
|
|||
|
|
|
|
||
a1 |
|
a3 |
|
|
|
ADD |
|
ADD |
|
|
|
19 |
|
25 |
|
|
(c) augmented sequencing graph
Fig. 6.10. Example use of the bound critical path for word-length refinement
the heuristic results provided, but also the solutions corresponding to word- length-blind scheduling followed by an optimal resource binding [CCL00b]. All three sets of results are only provided for the IIR filter, the polyphase filter bank and the RGB to YCrYb convertor, due to the excessive execution time of both the ILP solver and the optimal binding.
These results illustrate that for the benchmark circuits, the heuristic presented in this chapter provides a significant improvement in area (between −16% and 46%, average 15%) over a two-stage approach of scheduling and then binding, even when the binding is optimal. The heuristic results have between 0% and 62% (average 14%) worse area than the optimum combined solution, for cases where the optimum is known.
In order to fully characterize the heuristic performance, further results have been obtained using artificially generated examples. For statistically significant data on solution quality, 200 random sequencing graphs have been
6.5 Some Results |
143 |
|
1600 |
|
|
|
|
|
|
|
1500 |
|
|
|
heuristic synthesis |
|
|
|
|
|
|
optimal (ILP) synthesis |
|
|
|
|
1400 |
|
|
|
|
|
|
|
1300 |
|
|
|
|
|
|
area |
|
|
|
|
|
|
|
resulting |
1200 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1100 |
|
|
|
|
|
|
|
1000 |
|
|
|
|
|
|
|
900 |
|
|
|
|
|
|
|
800 |
20 |
25 |
30 |
35 |
40 |
45 |
|
15 |
||||||
|
|
|
|
latency constraint λ |
|
|
Fig. 6.11. Design-space exploration for the simple sequencing graph of Fig. 6.2
generated for each (problem size |V |, latency constraint λ) pair, using a variant of the TGFF algorithm [DRW98]. The first set of detailed quality results that has been collected, measures the variation of the heuristic solution quality with both the problem size and the tightness of the supplied latency constraint. For each sequencing graph, the minimum possible latency constraint λmin has been found using ASAP scheduling, from which latency constraints have been generated corresponding to a 0% to 30% relaxation of λmin. Algorithm ArchSynth has then been executed on the graph / latency constraint combination. The resulting area has been normalized with respect to the optimal solution resulting from [CCL00b] where operations may only share resources when the implemented resource has latency no longer than the minimum required to implement the given operations. These results are plotted in Fig. 6.13, as a percentage area penalty for using the approach of [CCL00b] over the heuristic presented in this chapter. Each point represents the mean of two hundred representative designs.
The results illustrate that for designs with even a small ‘slack’ in terms of latency constraints, significant area improvements of up to 30% can be made by performing the scheduling, binding, and word-length selection in the intertwined manner proposed. The area improvements come from increased resource sharing due to implementing small word-length operations in larger
144 6 Scheduling and Resource Binding
700 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
optimal (ILP) |
|
|
|
|
|
|
|
|
|
|
heuristic |
|
|
|
650 |
|
|
|
|
|
|
optimal binding from [CCL00b] |
|
||
|
|
|
|
|
|
|
|
|
|
|
600 |
|
|
|
|
|
|
|
|
|
|
550 |
|
|
|
|
|
|
|
|
|
|
500 |
|
|
|
|
|
|
|
|
|
|
area |
|
|
|
|
|
|
|
|
|
area |
450 |
|
|
|
|
|
|
|
|
|
|
400 |
|
|
|
|
|
|
|
|
|
|
350 |
|
|
|
|
|
|
|
|
|
|
300 |
|
|
|
|
|
|
|
|
|
|
250 |
13.5 |
14 |
14.5 |
15 |
15.5 |
16 |
16.5 |
17 |
17.5 |
18 |
13 |
||||||||||
|
|
|
|
|
λ |
|
|
|
|
|
7500 |
|
|
|
|
|
|
|
|
|
|
|
|
heuristic |
|
|
|
|
|
|
|
optimal binding from [CCL00b] |
|
|
7000 |
|
|
|
|
|
|
|
6500 |
|
|
|
|
|
|
|
6000 |
|
|
|
|
|
|
|
5500 |
|
|
|
|
|
|
|
5000 |
|
|
|
|
|
|
|
4500 |
|
|
|
|
|
|
|
4000 |
|
|
|
|
|
|
|
3500 |
|
|
|
|
|
|
|
3000 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
7 |
|||||||
|
|
|
|
λ |
|
|
|
(a) IIR |
(b) FIR |
|
850 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
heuristic |
|
800 |
|
|
|
|
|
|
|
|
750 |
|
|
|
|
|
|
|
|
700 |
|
|
|
|
|
|
|
area |
650 |
|
|
|
|
|
|
area |
|
600 |
|
|
|
|
|
|
|
|
550 |
|
|
|
|
|
|
|
|
500 |
|
|
|
|
|
|
|
|
450 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
|
14 |
|||||||
|
|
|
|
|
λ |
|
|
|
750 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
heuristic |
700 |
|
|
|
|
|
|
|
650 |
|
|
|
|
|
|
|
600 |
|
|
|
|
|
|
|
550 |
|
|
|
|
|
|
|
500 |
|
|
|
|
|
|
|
450 |
|
|
|
|
|
|
|
400 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
14 |
|||||||
|
|
|
|
λ |
|
|
|
(c) DCT1 |
(d) DCT2 |
|
200 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
optimal (ILP) |
|
|
|
|
190 |
|
|
|
|
|
|
heuristic |
|
|
|
|
|
|
|
|
|
|
|
optimal binding from [CCL00b] |
|
||
|
180 |
|
|
|
|
|
|
|
|
|
|
|
170 |
|
|
|
|
|
|
|
|
|
|
|
160 |
|
|
|
|
|
|
|
|
|
|
area |
150 |
|
|
|
|
|
|
|
|
|
area |
|
|
|
|
|
|
|
|
|
|
||
|
140 |
|
|
|
|
|
|
|
|
|
|
|
130 |
|
|
|
|
|
|
|
|
|
|
|
120 |
|
|
|
|
|
|
|
|
|
|
|
110 |
|
|
|
|
|
|
|
|
|
|
|
100 |
5.5 |
6 |
6.5 |
7 |
7.5 |
8 |
8.5 |
9 |
9.5 |
10 |
|
5 |
||||||||||
|
|
|
|
|
|
λ |
|
|
|
|
|
170 |
|
|
|
|
|
|
|
|
|
|
160 |
|
|
|
|
|
|
optimal (ILP) |
|
|
|
|
|
|
|
|
|
|
heuristic |
|
|
|
|
|
|
|
|
|
|
optimal binding from [CCL00b] |
|
||
150 |
|
|
|
|
|
|
|
|
|
|
140 |
|
|
|
|
|
|
|
|
|
|
130 |
|
|
|
|
|
|
|
|
|
|
120 |
|
|
|
|
|
|
|
|
|
|
110 |
|
|
|
|
|
|
|
|
|
|
100 |
|
|
|
|
|
|
|
|
|
|
90 |
|
|
|
|
|
|
|
|
|
|
80 |
13.5 |
14 |
14.5 |
15 |
15.5 |
16 |
16.5 |
17 |
17.5 |
18 |
13 |
||||||||||
|
|
|
|
|
λ |
|
|
|
|
|
(e) PFB |
(f) RGB–YCrCb |
Fig. 6.12. Design-space exploration for the benchmark circuits of Section 4.6
6.5 Some Results |
145 |
Area penalty (%)
30
25
20
15
10
5
0
−5 30
25
25
20 |
|
|
20 |
15 |
|
|
|
|
|
15 |
|
Latency constraint |
10 |
|
|
10 |
|
||
|
No. operations |
||
|
5 |
||
|
5 |
||
|
|
00
Fig. 6.13. Variation with number of operations and latency constraint of area penalty for [CCL00b] over the proposed heuristic
word-length resources with longer latency. Even for relatively small graphs, area improvements of tens of percent are possible.
Fig. 6.14 illustrates the increase in implementation area from using the heuristic presented in Section 6.4 over the optimum combined problem presented in Section 6.3. These results can only be provided for modest problem size and a minimum latency constraint λ = λmin, as the ILP solution execution time scales rapidly with problem size and as the latency constraint is relaxed.
The variation of minimum-latency execution time with problem size for 200 graphs using the ILP model (solved with lp solve [Sch97] on a Pentium III 450MHz) and the heuristic algorithm (implemented in C on the same machine) is shown in Fig. 6.15, illustrating the polynomial complexity of the heuristic against the exponential complexity of the ILP. Over the range of 1 to 10 operations, the relative increase in area ranges from 0% to 16% whereas the ILP solution takes between one and three orders of magnitude greater time to execute. An important point not brought out by these results is the scaling of execution time with overall latency constraint. The number of variables in the ILP model scales with the latency constraint, making the execution time highly dependent on this parameter. This is illustrated in Table 6.1 for 200 9-operation sequencing graphs. By contrast, the execution time of Algorithm ArchSynth does not scale with the latency constraint. Thus the
146 6 Scheduling and Resource Binding
|
18 |
|
|
|
|
|
|
|
|
|
|
16 |
|
|
|
|
|
|
|
|
|
|
14 |
|
|
|
|
|
|
|
|
|
(%) |
12 |
|
|
|
|
|
|
|
|
|
optimum |
|
|
|
|
|
|
|
|
|
|
10 |
|
|
|
|
|
|
|
|
|
|
over |
|
|
|
|
|
|
|
|
|
|
premium |
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Area |
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
0 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
1 |
|||||||||
|
|
|
|
|
Number of operations |
|
|
|
|
Fig. 6.14. Variation of area premium for Algorithm ArchSynth over the optimum solution
one to three orders of magnitude illustrated in Fig. 6.15 are under conditions most favourable to the ILP-based solution.
Table 6.1. Variation of execution time for 200 graphs with λ/λmin for heuristic and ILP solution
λ/λmin |
|
heuristic (secs) |
ILP (mins:secs) |
1.00 |
|
3.02 |
2:07.09 |
1.05 |
|
3.51 |
4:05.21 |
1.10 |
|
3.73 |
15:55.56 |
|
|||
|
|||
1.15 |
|
3.52 |
>30:00.00 |
|
The results presented clearly indicate the practical nature of the heuristic presented in this chapter, whose execution results in a speedup of up to three orders of magnitude over ILP solution, even for modest graph sizes, at an area-penalty of between 0% and 16%.