From: Neural network training with limited precision and asymmetric exponent
Paper | Variable type | Technique | Category | Dataset | Topology | 32-bit baseline accuracy | Accuracy after limitation | AI framework |
---|---|---|---|---|---|---|---|---|
Gupta et al. [31] | 12-bit fixed point 14-bit fixed point | Stochastic rounding | Software limitation Hardware design | MNIST | Custom LeNet | 99.23% | 99.17% (14-bit fixed point) 99.11% (12-bit fixed point) | Not defined |
CIFAR10 | 3-layer CNN | 75.4% | 74.6% (14-bit fixed point) 71.2% (12-bit fixed point) | |||||
Ortiz et al. [34] | 12-bit floating point 12-bit fixed point | Stochastic rounding Context representation | Software limitation | CIFAR10 | 3-layer CNN | 75.6% | 63.03% (12-bit fixed point) 74.20% (12-bit floating point) 78.02% (12-bit context-float) 76.32% (12-bit context-fixed) | Caffe |
Na and Mukhopadhyay [35] | 16-bit fixed point 32-bit fixed point | Dynamic precision scaling (DPS) Flexible multiplier-accumulator (MAC) | Hardware design | MNIST | LeNet | Not given (only loss charts presented) | 32-bit fixed point accuracy achieved on 16-bit fixed point with DPS | Caffe |
Flickr images | AlexNet (pre-trained) | 64-bit fixed point accuracy achieved on 32-bit fixed point with DPS | ||||||
Taras and Stuart [36] | 14-bit fixed point (weights) 16-bit fixed point (activations) | Stochastic rounding Dynamic precision scaling (DPS) | Software limitation | MNIST | LeNet | 98.80% | 98.80% | Caffe |
Park et al. [37] | Combination of 8-bit and 16-bit integers | Stochastic gradient descent with Kahan summation Lazy update | Software limitation | MNIST | LeNet-like CNN | 99.10% | 99.24% | Caffe Tensorflow |
SVHN | 4-layer CNN | 97.06% | 96.99% | |||||
CIFAR10 | 3-layer CNN | 81.56% | 81.17% | |||||
ResNet-20 | 90.16% | 90.23% | ||||||
ImageNet | AlexNet | 80.81% | 80.62% | |||||
Fuketa et al. [39] | 9-bit floating point format with hidden most significant bit and sign bit | Custom float representation Custom MAC unit | Software limitation Hardware design | ILSVRC | AlexNet | 48.27% | 46.18% | Not defined |
ResNet-50 | 68.84% | 67.55% | ||||||
Onishi et al. [42] | No strict parameters limitation, factorization based on LUT is used for limiting memory consumption and multi-adds operations | Lookup-Table (LUT) based quantization Cluster swap | Software limitation Hardware design | MNIST | LeNet | 99.28% | 97.87% | PyTorch |
Memory consumption reduced | ||||||||
22.2% (forward pass) | ||||||||
60% (backward pass) | ||||||||
Lee et al. [43] | Fully variable weight bit-precision from 1 to 16 b | Original hardware accelerator for CNN-RNN networks | Hardware design | Not applicable | AlexNet VGG-16 | Not applicable | Operation based power savings presented | Not applicable |
Our proposal | 8-bit floating point 12-bit floating point 14-bit floating point | Asymmetric exponent No additional rounding | Software limitation | MNIST | LeNet | 96.04% | 75.89% (8-bit floating point) 95.01% (12-bit floating point) 97.13% (14-bit floating point) | PyTorch |