Deep learning accelerators: a case study with MAESTRO

Journal of Big Data

Table 2 Simulation Results For NLR and NVDLA

Data Flow	NLR	NVDLA
Buffer analysis
L1 Buffer Requiremnet (Byte)	18.00	66.00
L2 Buffer Requiremnet (KB)	1.12	4.12
L1RdSum	7,225,344	451,584
L1WrSum	7,225,344	451,584
L2RdSum	462,422,016	28,901,376
L2WrSum	462,422,016	28,901,376
L1 weight reuse	1	16
L1 input reuse	4	16
L2 weight reuse	448	190.26
L2 input reuse	2633	4473
NoC analysis
L1 to L2 NoC BW	128	32
L2 to L1 NoC BW	160	1024
Performance analysis
L1 to L2 Sum	56	32
L1 to L2 Delay	4.43	4.25
L2 to L1 Delay	0	0
Roofline Throughput (GFLOPS with 1 GHZ clock)	896	128
Compute Runtime	169	421
Total Runtime (cycles)	1,428,553,728	384,072,192