paper

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Publication Date:
Publication Date
8 November 2018

paper Menu

Abstract

Building a high-performance FPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

Country
USA
Affiliation
Cornell University
IEEE Region
Region 01 (Northeastern U.S.)
Country
USA
Affiliation
Northwestern University
IEEE Region
Region 04 (Central U.S.)
Country
CHN
Affiliation
IBM Research
IEEE Region
Region 10 (Asia and Pacific)
Country
USA
Affiliation
University at Buffalo
IEEE Region
Region 01 (Northeastern U.S.)
Country
USA
Affiliation
University of Illinois at Urbana-Champaign
IEEE Region
Region 04 (Central U.S.)
Country
USA
Affiliation
University of Illinois at Urbana-Champaign
IEEE Region
Region 04 (Central U.S.)
Email