Multi-sample $$\zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant

Journal of Big Data

Table 9 Benchmarking various methods for training models on CIFAR-100, STL-10, Imagenette, and for augmenting a batch of 32 RGB images of $224 \times 224$ spatial resolution

Method		CIFAR-100 (200 epochs)	STL-10 (200 epochs)	Imagenette (80 epochs)	[32, 3, 224, 224] torch.Tensor
Wall Time	mixup [36]	1h 20m ± 23s	24m 59s ± 16.9s	45m 39s ± 8.5s	745$\mu $s ± 9.55$\mu $s
	$\zeta $-mixup	1h 20m ± 17s	24m 58s ± 4.6s	45m 34s ± 14.1s	345$\mu $s ± 2.53$\mu $s
	CutMix [39]	1h 22m ± 13s	†	†	176$\mu $s ± 1.4$\mu $s
	CutMix [39] + $\zeta $-mixup	1h 22m ± 9s	†	†	169$\mu $s ± 757ns
	Manifold Mixup [41]	16h 15m (2000 epochs)	‡	‡	‡‡
	Co-Mixup [96]	16h 35m (300 epochs)	‡	‡	‡‡
	Local synthetic instances [57]	§	§	§	38.7ms ± 1.33 ms

Note that mixup, $\zeta $-mixup, CutMix, and CutMix + $\zeta $-mixup require 200 epochs of training for CIFAR-100, whereas Manifold Mixup and Co-Mixup require 2000 and 300 epochs respectively. CutMix experiments were performed on CIFAR-10 and CIFAR-100, and training times on STL-10 and Imagenette were not available from the original paper either (†). Similarly, given the large computational cost for Manifold Mixup and Co-Mixup, we did not train them on STL-10 and Imagenette, and their training times are missing from the respective paper too (‡). We also were unable to benchmark these two methods on a batch of 32 images (last column; ‡‡) since these methods require a DNN forward pass and gradients respectively for augmentation. Finally, the local synthetic instances method [57] is not optimized for training DNNs (§), as it is two orders of magnitude slower than $\zeta $-mixup (see last column)