Skip to main content

Table 1 Detailed summary of the related study with comparison to the proposed technique of limitation with asymmetric exponent

From: Neural network training with limited precision and asymmetric exponent

Paper

Variable type

Technique

Category

Dataset

Topology

32-bit baseline accuracy

Accuracy after limitation

AI framework

Gupta et al. [31]

12-bit fixed point

14-bit fixed point

Stochastic rounding

Software limitation

Hardware design

MNIST

Custom LeNet

99.23%

99.17% (14-bit fixed point)

99.11% (12-bit fixed point)

Not defined

CIFAR10

3-layer CNN

75.4%

74.6% (14-bit fixed point)

71.2% (12-bit fixed point)

Ortiz et al. [34]

12-bit floating point

12-bit fixed point

Stochastic rounding

Context representation

Software limitation

CIFAR10

3-layer CNN

75.6%

63.03% (12-bit fixed point)

74.20% (12-bit floating point)

78.02% (12-bit context-float)

76.32% (12-bit context-fixed)

Caffe

Na and Mukhopadhyay [35]

16-bit fixed point

32-bit fixed point

Dynamic precision scaling (DPS)

Flexible multiplier-accumulator (MAC)

Hardware design

MNIST

LeNet

Not given (only loss charts presented)

32-bit fixed point accuracy achieved on 16-bit fixed point with DPS

Caffe

Flickr images

AlexNet (pre-trained)

64-bit fixed point accuracy achieved on 32-bit fixed point with DPS

Taras and Stuart [36]

14-bit fixed point (weights)

16-bit fixed point (activations)

Stochastic rounding

Dynamic precision scaling (DPS)

Software limitation

MNIST

LeNet

98.80%

98.80%

Caffe

Park et al. [37]

Combination of 8-bit and 16-bit integers

Stochastic gradient descent with Kahan summation

Lazy update

Software limitation

MNIST

LeNet-like CNN

99.10%

99.24%

Caffe

Tensorflow

SVHN

4-layer CNN

97.06%

96.99%

CIFAR10

3-layer CNN

81.56%

81.17%

ResNet-20

90.16%

90.23%

ImageNet

AlexNet

80.81%

80.62%

Fuketa et al. [39]

9-bit floating point format with hidden most significant bit and sign bit

Custom float representation

Custom MAC unit

Software limitation

Hardware design

ILSVRC

AlexNet

48.27%

46.18%

Not defined

ResNet-50

68.84%

67.55%

Onishi et al. [42]

No strict parameters limitation, factorization based on LUT is used for limiting memory consumption and multi-adds operations

Lookup-Table (LUT) based quantization

Cluster swap

Software limitation

Hardware design

MNIST

LeNet

99.28%

97.87%

PyTorch

Memory consumption reduced

 22.2% (forward pass)

 60% (backward pass)

Lee et al. [43]

Fully variable weight bit-precision from 1 to 16 b

Original hardware accelerator for CNN-RNN networks

Hardware design

Not applicable

AlexNet

VGG-16

Not applicable

Operation based power savings presented

Not applicable

Our proposal

8-bit floating point

12-bit floating point

14-bit floating point

Asymmetric exponent

No additional rounding

Software limitation

MNIST

LeNet

96.04%

75.89% (8-bit floating point)

95.01% (12-bit floating point)

97.13% (14-bit floating point)

PyTorch