AMF-Placer
2.0
An Open-Source Timing-driven Analytical Mixed-size FPGA Placer
|
Apart from the fundamental mechanisms to support macro placement, some optional optimization techniques are evaluated:
Since existing open-source analytical FPGA placers do not support mixed-size FPGA placement of aforementioned macros on Ultrascale devices, for comprehensive comparison, according to some state-of-the-art solutions, we implement baseline placement solution with the following features:
Below are the comparison data which are normalized. We will keep improving our implementation.
faceDetect | halfsqueezenet | optimsoc | minimap_GENE | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HPWL | Rhpwl | time(s) | Rtime | HPWL | Rhpwl | time(s) | Rtime | HPWL | Rhpwl | time(s) | Rtime | HPWL | Rhpwl | time(s) | Rtime | |
proposed | 446908 | 1.000 | 105 | 1.054 | 339764 | 1.000 | 129 | 1.000 | 1771538 | 1.000 | 466 | 1.000 | 1275346 | 1.000 | 558 | 1.000 |
w/o tech1 | 479221 | 1.072 | 109 | 1.088 | 360567 | 1.061 | 132 | 1.023 | 9034564 | 5.100 | 871 | 1.870 | 5132073 | 4.024 | 1265 | 2.266 |
w/o tech2 | 487060 | 1.090 | 124 | 1.238 | 431386 | 1.270 | 240 | 1.857 | 1825819 | 1.031 | 543 | 1.164 | 1307249 | 1.025 | 626 | 1.121 |
w/o tech3 | 465208 | 1.041 | 131 | 1.315 | 491385 | 1.446 | 517 | 4.007 | 1841786 | 1.040 | 577 | 1.239 | 1790566 | 1.404 | 1019 | 1.826 |
w/o tech4 | 450199 | 1.007 | 100 | 1.000 | 351649 | 1.035 | 131 | 1.015 | 2658863 | 1.501 | 525 | 1.126 | 1291565 | 1.013 | 561 | 1.005 |
w/o tech5 | 511514 | 1.145 | 134 | 1.344 | 443765 | 1.306 | 138 | 1.067 | 1880431 | 1.061 | 546 | 1.172 | 1430330 | 1.122 | 708 | 1.268 |
baseline | 794201 | 1.777 | 234 | 2.338 | 1865746 | 5.491 | 329 | 2.549 | 2231978 | 1.260 | 559 | 1.199 | 11886747 | 9.320 | 3539 | 6.339 |
OpenPiton | digitRecognition | MemN2N | BLSTM_midDensity | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HPWL | Rhpwl | time(s) | Rtime | HPWL | Rhpwl | time(s) | Rtime | HPWL | Rhpwl | time(s) | Rtime | HPWL | Rhpwl | time(s) | Rtime | |
proposed | 1139189 | 1.000 | 283 | 1.035 | 726929 | 1.000 | 254 | 1.097 | 801403 | 1.000 | 318 | 1.130 | 484650 | 1.000 | 326 | 1.241 |
w/o tech1 | 6218009 | 5.458 | 350 | 1.280 | 768796 | 1.058 | 256 | 1.105 | 963416 | 1.202 | 363 | 1.287 | 492194 | 1.016 | 285 | 1.083 |
w/o tech2 | 1157479 | 1.016 | 296 | 1.079 | 773957 | 1.065 | 272 | 1.174 | 862212 | 1.076 | 343 | 1.216 | 511861 | 1.056 | 296 | 1.126 |
w/o tech3 | 1159003 | 1.017 | 326 | 1.190 | 730071 | 1.004 | 321 | 1.389 | 916176 | 1.143 | 411 | 1.460 | 496540 | 1.025 | 364 | 1.383 |
w/o tech4 | 1212149 | 1.064 | 274 | 1.000 | 728094 | 1.002 | 231 | 1.000 | 822742 | 1.027 | 282 | 1.000 | 653010 | 1.347 | 307 | 1.169 |
w/o tech5 | 1142927 | 1.003 | 301 | 1.098 | 997207 | 1.372 | 244 | 1.053 | 923670 | 1.153 | 353 | 1.252 | 722495 | 1.491 | 263 | 1.000 |
baseline | 1432535 | 1.258 | 297 | 1.086 | 1047885 | 1.442 | 298 | 1.288 | 1610425 | 2.010 | 547 | 1.942 | 907383 | 1.872 | 325 | 1.237 |
The dominant algorithm for each stage in the proposed placement flow can be parallelized and in the table below, acceleration ratios are demonstrated by changing the number of threads and evaluating placement runtime.
#threads | Rosetta FaceDetect | SpooNN | OptimSoC | MiniMap2 | OpenPiton | Rosetta DigitRecog | MemN2N | BLSTM |
---|---|---|---|---|---|---|---|---|
8 threads | 2.17x | 2.07x | 2.50x | 2.86x | 2.63x | 2.22x | 2.61x | 2.23x |
4 threads | 2.01x | 1.96x | 2.29x | 2.57x | 2.37x | 2.04x | 2.35x | 1.97x |
2 threads | 1.56x | 1.52x | 1.64x | 1.81x | 1.65x | 1.54x | 1.68x | 1.77x |
1 threads | 1.00x | 1.00x | 1.00x | 1.00x | 1.00x | 1.00x | 1.00x | 1.00x |
Below is the comparison of AMF-Placer Placement (upper ones) and Vivado Placement (lower ones): yellow for CARRY macros, red for MUX macros, green for BRAM macros, purple for DSP macros, blue for LUTRAM macros. The view of device is rotated left by 90 degree.
The placement congestion level of AMF-Placer is similar to Vivado and below is a figure for the most congested benchmarks (MiniMap2 with PCIE banks and OpenPiton with DDR Interface):
The Runtime Log of ICCAD-2021 Benchmarks
Since we keep developing AMF-Placer to meet more expectation from reviewers and users (especially for timing optimization, multi-SLR/die optimization and more applications), we provide the runtime log files of the old version of placer which generated the result in the ICCAD-2021 paper. We suggest open the log files with VSCode since it seems that VSCode can highlight different parts of the log files for review convenience.
The paper only considered the improvement of wirelength, while now AMF-Placer considers the timing and the clock region impact during global placement and the resultant placements are more practical (e.g., the delays of critical paths are reduced by 20~40%). We are happy to pave the way for people who targets at the practical improvements to gain better results so please feel free to let us know your expectation or demands in the GitHub Issue.
References in this page:
[1] C.-W. Pui, G. Chen, W.-K. Chow, K.-C. Lam, J. Kuang, P. Tu, H. Zhang, E. F. Young, and B. Yu, “Ripplefpga: A routability-driven placement for large-scale heterogeneous fpgas,” in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2016, pp. 1–8.
[2] J. Chen, Z. Lin, Y.-C. Kuo, C.-C. Huang, Y.-W. Chang, S.-C. Chen, C.-H. Chiang, and S.-Y. Kuo, “Clock-aware placement for large-scale heterogeneous fpgas,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 12, pp. 5042–5055, 2020.
[3] W. Li, S. Dhar, and D. Z. Pan, “Utplacef: A routability-driven fpga placer with physical and congestion aware packing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 4, pp. 869–882, 2017.