Skip to main content

A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications

Abstract

Data scarcity is a major challenge when training deep learning (DL) models. DL demands a large amount of data to achieve exceptional performance. Unfortunately, many applications have small or inadequate data to train DL frameworks. Usually, manual labeling is needed to provide labeled data, which typically involves human annotators with a vast background of knowledge. This annotation process is costly, time-consuming, and error-prone. Usually, every DL framework is fed by a significant amount of labeled data to automatically learn representations. Ultimately, a larger amount of data would generate a better DL model and its performance is also application dependent. This issue is the main barrier for many applications dismissing the use of DL. Having sufficient data is the first step toward any successful and trustworthy DL application. This paper presents a holistic survey on state-of-the-art techniques to deal with training DL models to overcome three challenges including small, imbalanced datasets, and lack of generalization. This survey starts by listing the learning techniques. Next, the types of DL architectures are introduced. After that, state-of-the-art solutions to address the issue of lack of training data are listed, such as Transfer Learning (TL), Self-Supervised Learning (SSL), Generative Adversarial Networks (GANs), Model Architecture (MA), Physics-Informed Neural Network (PINN), and Deep Synthetic Minority Oversampling Technique (DeepSMOTE). Then, these solutions were followed by some related tips about data acquisition needed prior to training purposes, as well as recommendations for ensuring the trustworthiness of the training dataset. The survey ends with a list of applications that suffer from data scarcity, several alternatives are proposed in order to generate more data in each application including Electromagnetic Imaging (EMI), Civil Structural Health Monitoring, Medical imaging, Meteorology, Wireless Communications, Fluid Mechanics, Microelectromechanical system, and Cybersecurity. To the best of the authors’ knowledge, this is the first review that offers a comprehensive overview on strategies to tackle data scarcity in DL.

Introduction

Deep learning (DL) is a subset of Machine learning (ML) which offers great flexibility and learning power by representing the world as concepts with nested hierarchy, whereby these concepts are defined in simpler terms and more abstract representation reflective of less abstract ones [1,2,3,4,5,6]. Specifically, categories are learnt incrementally by DL with its hidden-layer architecture. Low-, medium-, and high-level categories refer to letters, words, and sentences, respectively. In an instance involving face recognition, dark or light regions should be determined first prior to identifying geometric primitives such as lines and shapes. Every node signifies an aspect of the entire network, whereby full image representation is provided when collated together. Every node has a weight to reflect the strength of its link with the output. Subsequently, the weights are adjusted as the model is developed. The popularity and major benefit of DL refer to being powered by massive amounts of data. More opportunities exist for DL innovation due to the emergence of Big Data [7]. Andrew Ng, one of the leaders of the Google Brain Project and China’s Baidu chief scientist, asserted that “The analogy to DL is: rocket engine is DL models, while fuel is massive data needed to feed the algorithms.”

Opposite to the conventional ML algorithms, DL demands high-end data, both graphical processing units (GPUs) and Tensor Processing Units (TPU) are integral for achieving high performance [8]. The hand-crafted features extracted by ML tools must be determined by domain experts in order to lower data intricacy and to ensure visible patterns that enable learning algorithms to perform. However, DL algorithms learn data features automatically, thus hard core feature extraction can be avoided with less effort for domain experts. While DL addresses an issue end-to-end, ML breaks down the problem statement into several parts and the outcomes are amalgamated at the end stage. For instance, DL tools such as YOLO (a.k.a You Only Look Once) detect multiple objects in an input image in one run and a composite output is generated considering class name and location [9, 10], also with the same scenario for image classification [11, 12]. In the field of ML, approaches such as Support Vector Machines (SVMs) detect objects by several steps: (a) extracting features e.g. histogram of oriented gradients (HOG), (b) training classifier using extracted features, and (c) detecting objects in the image with the classifier. The performance of both algorithms relies on the selected features which could be not the right ones to be discriminated between classes [1]. In particular, DL is a good approach to eliminate the long process of ML algorithms and follow a more automated manner. Figure 1 shows the difference between both DL and ML approaches.

Fig. 1
figure 1

The difference between DL and traditional ML

Large training data (e.g., ImageNet dataset) ensures a suitable performance of DL, while inadequate training data yields poor outcomes [1, 13,14,15]. Meanwhile, the ability of DL to manage intricate data has an inherent benefit due to its more elaborated design. Extracting sufficient complex patterns from data demands a copious amount of data to give meaningful output, For instance, the convolutional neural networks (CNNs) [16,17,18] is a clear example of the latter.

The challenge of data scarcity in training DL models presents a significant obstacle for many applications, leading to the dismissal of the use of DL. To achieve reliable and accurate outcomes in DL, it is essential to initiate the training process with a significant and varied dataset. Utilizing a large dataset helps to enhance the model’s ability to learn and identify patterns, while diversity in the dataset ensures that the model can generalize to new and unseen instances. This initial step plays a pivotal role in ensuring that the model produces reliable results and can be trusted for real-world applications. As a result, researchers and practitioners have been working to develop state-of-the-art techniques to overcome the data scarcity issue in DL. This has motivated us to provide an overview of the latest techniques for addressing the data scarcity issue, including Transfer Learning (TL), Self-Supervised Learning (SSL), Generative Adversarial Networks (GANs), Model Architecture (MA), Physics-Informed Neural Networks (PINN), and Deep Synthetic Minority Oversampling Technique (DeepSMOTE). To achieve reliable and accurate outcomes in DL, it is essential to initiate the training process with a significant and varied dataset. Utilizing a large dataset helps to enhance the model’s ability to learn and identify patterns, while diversity in the dataset ensures that the model can generalize to new and unseen instances. This initial step plays a pivotal role in ensuring that the model produces reliable results and can be trusted for real-world applications.

This paper presents a comprehensive survey of these techniques, which can be used to address three main challenges in training DL models, namely small datasets, imbalanced datasets, and lack of generalization. To this end, we have formulated seven main questions that are addressed in this review.

  • What are the various types of learning techniques utilized in DL, and how do they differ in their effectiveness in addressing the challenges of data scarcity?

  • What are various DL architectures?

  • What are the most effective solutions to address the issue of data scarcity in DL, and how do these solutions perform in comparison to traditional data augmentation techniques, such as transfer learning and generative models, in various applications such as image classification, natural language processing, and speech recognition?

  • How can the use of the listed solutions to address limited training data in DL be applied to various sub-applications, and what are the challenges and potential solutions for collecting new data in these areas?

  • What are the most effective pre-training and testing tips for utilizing datasets in DL, and how do they impact the accuracy and efficiency of DL models?

  • What are the best practices and guidelines for reporting datasets used in DL, and how can they improve the reproducibility, transparency, and reliability of DL research?

  • How can trustworthy training datasets be defined, identified, and evaluated for use in DL, and what are the implications of using such datasets on the accuracy, fairness, and ethical considerations of DL models?

This review is aimed at presenting the most significant aspects of training data and how it is related to achieving high-quality outcomes when using DL. Specifically, optimal performance of DL requires a large amount of data [1] but many real-world applications suffer from insufficient training data. Therefore, our contributions are as follows:

  • To the best of our knowledge, this is the first comprehensive review that studies the importance and the main aspects of training data for DL.

  • Learning techniques and DL architectures are explained in detail.

  • Several approaches dealing with data scarcity are accordingly introduced including Transfer Learning (TL), Self-supervised learning (SSL), Generative Adversarial Networks (GANs), and model architecture. Furthermore, alternatives that help to deal with the lack of training data are reviewed, including the concepts of a Physics Informed Neural Network (PINN) and DeepSMOTE.

  • It is provided several tips about the data before training the DL models. These tips help to achieve a full understanding of what the researchers need to know before progressing to any further training stage.

  • It provides a list of typical applications in which DL has been less explored regarding how to deal with scarcity data. An analysis about why those applications did not carry out a suitable study of data for training is also given. Typical applications include electromagnetic imaging, civil structural health monitoring, meteorology, medical imaging, wireless communications, fluid mechanics, microelectromechanical systems, and cybersecurity. Moreover, different alternatives are provided in order to tackle with the scarcity data issue in a more suitable manner.

  • This review offers suggestions regarding how to properly report the dataset when using DL.

  • Finally, the key requirements for a trustworthy training dataset for DL have been discussed.

The rest of the paper is structured as follows: “Survey methodology” section describes the survey methodology, followed by “Types of learning” section which presents the state-of-the-art learning techniques. DL architectures are introduced in “Deep learning architectures” section, while “Lack of training data: issues and solutions” section details the current approaches to dealing with data scarcity. “Pre-training and testing tips of using dataset” section provides pre-training and testing tips for specific datasets. “Applications” section introduces DL applications that are less utilized due to the lack of training data. “Tips for reporting the dataset” section focuses on the usage of new designs for reporting datasets when using DL. Trustworthy requirements for training data in DL are listed in “Trustworthy training datasets” section. “Discussion” section presents the discussion with future open lines, and finally, “Conclusion” section concludes the paper.

Survey methodology

We have reviewed the significant research papers in the field, published during 2019–2022, mainly from the years 2021 and 2022. Our comprehensive search was mainly conducted in the six reputed publishers including IEEE, Elsevier, Nature, ACM, Wiley, and Springer. Some papers have been chosen from ArXiv. We have reviewed more than 630 papers on the topics of the review. There are 227 papers that were published in 2022–2023, 205 papers were published in 2021, 39 papers were published in 2020, and 45 papers were published in 2019. These statistics show that this review focused on recent publications on the topic. The selected papers have been categorized into five groups (1) learning techniques, (2) list and explain DL architectures, (3) tips and trustworthy requirements about the training datasets, (4) solutions to lack of training data, (5) lastly, applications. The categorization aims to help readers efficiently navigate the complex landscape of DL research and applications by grouping related papers together based on their primary focus. Additionally, by emphasizing the issue of data scarcity across several categories, readers can gain a better understanding of the challenges and potential solutions associated with this problem in DL.

We have used the following search queries which were chosen by experts in the field for search criteria in this review paper which are (“Deep Learning”), (“Data scarcity”), (“Convolutional Neural Network”), (“Deep Learning” AND “Architectures”), (“Deep Learning”) AND (“learning techniques”), (“Deep Learning” AND “detection” OR “classification” OR “segmentation” OR “Localization”), (“Deep Learning” AND “lack of training data”), (“Deep Learning” AND “Transfer Learning”), (“Deep Learning” AND “Generative Adversarial Networks”), (“Generative Adversarial Networks”), (“Generative Adversarial Networks types”), (“Generative Adversarial Networks applications”), (“Deep Learning” AND “small dataset”), (“Deep Learning” AND “Electromagnetic Imaging”), (“Deep Learning” AND “Civil Structural Health Monitoring”), (“Deep Learning” AND “Meteorology”), (“Deep Learning” AND “Civil Structural Health Monitoring”),(“Deep Learning” AND “Wireless Communications”), (“Deep Learning” AND “Fluid Mechanics”), (“Physics-Informed Neural Network”), (“Deep Learning” AND “vulnerabilities”), (“Industrial Automation” AND “Transfer Learning”), (“Medical Imaging” AND “Transfer Learning”), (“Deep Learning” AND “Cybersecurity”), (“Wireless Communication” AND “Transfer Learning”), (“Plant Diseases” AND “Transfer Learning”), (“Natural Language Processing” AND “Transfer Learning”), (“Machinery Fault” AND “Transfer Learning”), (“Software Defect” AND “Transfer Learning”), (“Activity Recognition” AND “Transfer Learning”), (“Object Detection ” AND “Transfer Learning”), (“Internet of Things” AND “Transfer Learning”), (“Trustworthy data” AND “Deep Learning”). Figure 2 depicts our search structure of the review paper.

Fig. 2
figure 2

Search framework

Types of learning

This section presents various learning types which will help the readers to know what type suits their task. Figure 3 illustrates 14 learning types commonly deployed by artificial intelligence (AI) specialists.

Fig. 3
figure 3

Learning types

Learning problems

  1. 1.

    Supervised learning

    A model is applied for learning representation between target variable and input instances [19, 20]. Problems in this learning type are called systems, in which the training data are comprised of instances of input vectors and target vectors. The two problem types are classification and regression [21,22,23]. Classification denotes a supervised problem of learning that predicts a class label, whereas regression refers to a problem of supervised learning that predicts numerical labels [24]. Variables in regression and classification problems can be one or more, while any data format may serve as input (e.g., categorical or numerical data) [24]. A handwriting digit dataset called MNIST with its digit images as input (pixel data) is an instance of a classification problem [25]. In fact, several ML algorithms are called ‘supervised ML algorithms’ as they address supervised DL problems, e.g., SVMs and decision trees [26, 27]. Supervised is linked with the algorithm, mainly because the latter learns via predictions using input data, so that the model can yield useful output [28]. Some techniques suit only classification (logistic regression) or regression (linear regression), whereas some suit both problem types with slight alteration [artificial neural network (ANN)] [29,30,31,32,33].

  2. 2.

    Unsupervised learning

    This type of learning detects a number of challenges related to the usage of the data relationship model, which eliminates or explains data relationships. When compared with supervised learning, unsupervised learning only uses input data without any target or output variable [34, 35]. Hence, this learning type has no instructor for model correction. The two types of unsupervised learning are clustering and density estimation. In clustering, data is sought for classes [34,35,36]; while data distribution is summarised in density estimation [37, 38]. In clustering, the k in K-Means denotes the cluster centre in the dataset [36, 39]. The density NN refers to Kernel Density Estimation that applies small groups with closely linked data in order to estimate new points dissemination in problem space [37, 38]. Both density estimation and clustering can be deployed to learn trends in information. Other unsupervised approaches are visualization (to plot/graph outcomes) and projection (lower data dimensionality) [39]. Visualization aids one to reckon vast data quantity using interactive and standardized visuals in certain contexts [40, 41]. The data have a narrative style with linkages, patterns, and trends [42]. On the other hand, projection demands lower-dimensional data representation development [43]. When compared to principal component analysis, the projection method offers better computation by reducing dimensionality as the former cannot manage many dimensions [43,44,45,46].

  3. 3.

    Reinforcement learning

    This learning type is a group of challenges in which users must learn to utilize feedback to take action in a specific context [47,48,49]. Despite its similarity with supervised learning, reinforcement learning has delayed feedback and the noisy system as it seeks challenging responses and models to associate causality [50, 51]. Instances of reinforcement learning algorithms are temporal difference, deep reinforcement, and Q learning [52,53,54].

Hybrid learning problems

  1. 1.

    Semi-supervised learning

    This learning type uses many unlabelled and a few classified instances while training data [55, 56]. It is meant to efficiently apply all data, not just limited to labeled data as executed in supervised learning [57, 58]. It can also mimic the clustering and density estimation methods of unsupervised learning to use unlabelled data [59, 60]. After identifying patterns or groups, techniques from supervised learning are used to mark unlabelled data or add labels to those unlabelled in order to arrive at precise predictions [61,62,63]. The method is used for image, audio (automated speech recognition), and text [(natural language processing (NLP)] data, which are unviable in supervised learning [64,65,66,67].

  2. 2.

    Self-supervised learning

    In this technique, only unclassified data are applied to develop pretext learning assignment (e.g., image rotation, context prediction, etc.), whereby the target may be computed unsupervised [68,69,70,71]. An example of this learning type refers to autoencoders; an NN that develops compact input sample representation [72, 73]. This is done based on a model that has a decoder and an encoder segregated by a bottleneck to reflect the internal compact input [74]. An autoencoder model learns by giving input (input and target output) and generating input by encoding it to compact representation and later decoding it to its original [75]. After training, the decoder is discarded and the encoder is deployed to yield the desired compressed input representations. In the past, autoencoders were applied to minimize learning of features or dimensionality [76, 77]. This learning type can be described via GANs; commonly used to provide synthetic images based on unclassified data from the target [78,79,80].

  3. 3.

    Multi-instance learning

    This learning type uses labeled data that may or may not contain the class example, but the individual members of the collection are unmarked [81,82,83,84].

Statistical inference

Inference signifies the very process of making a conclusion or decision. Model developing and prediction making are both inferences in DL [85]. Some inference approaches that describe how DL algorithms solve learning problems are deductive, transudative, inductive, and inference learning. Deduction is making predictions using the formula, while induction is a model analysis using specific examples, and transudative is assumptions that are made based on specific instances [86, 87].

  1. 1.

    Inductive learning

    This learning type needs evidence to evaluate outcomes. The algorithm learns from prior precedents via inductive learning, where rules (model) are taught (data) [88, 89]. When adapted to the DL model, this induction method becomes a generalization of definite instances that serve as training data to develop a hypothesis or model presumed to contain unknown fresh data later [90, 91].

  2. 2.

    Deductive inference

    In this approach, concrete outcomes are assessed using general concepts. The deduction is the complete opposite of induction [92]. While induction moves from specific to general, deduction progresses from general to specific [92]. The bottom-up reasoning in induction employs evidence for results, whereas the top-down reasoning in deduction fulfills all aspects prior to giving outcomes [93]. When the deductive approach is applied in DL, predictions are made by algorithms before induction is used to suit a model with a training dataset [94].

  3. 3.

    Transductive learning

    It is used to describe the prediction process from domain to specific in statistical learning theory [95]. It learns concrete instances and not universal rules as in induction [96]. A new inference definition is given when the model estimates a functional value [97]. The inference principle emerges when the best results are derived from limited knowledge [95, 98]. The k-nearest neighbor algorithm is used in transductive algorithm for prediction, but not modeling of training data [99, 100].

Learning techniques

  1. 1.

    Multi-task learning

    Generalization is enhanced in this method through the combination of details from many activities (parameters experience soft restraints) [101, 102]. This method is viable to resolve a problem when many classified input data for an activity are shared with an activity with few classified data [103, 104]. This approach incorporates input patterns for various supervised learning concerns or outputs [105]. Here, every output is predicted using varied model parts, thus enabling the model core for generalizing similar inputs for every activity [106, 107].

    The study was done by [108] who presents a common framework for evaluating multi-task learning methods for 2D/3D city modeling using fixed-wing Unmanned Aerial Vehicle (UAV) images [109, 110]. Single-task learning may perform well, but as the number of tasks increases, the benefits of knowledge transfer become limited. Multi-task learning improves generalization by utilizing domain-specific information from related tasks, and it has emerged as a solution to knowledge transfer issues. The study highlights the importance of automated multi-task data analysis for scene understanding in urban management applications, such as infrastructure development, traffic monitoring, smart 3D cities, and change detection, which require precise urban models based on the semantic, instance, and panoptic annotation, as well as monocular depth estimation.

  2. 2.

    Active learning

    When learning occurs, a human operator may pose questions to address the problem [111,112,113]. Similar to supervised learning, active learning may yield similar or better outcomes than passive supervised learning despite data efficiency [114, 115]. In this technique, the main principle denotes enabling the DL algorithm to select data for learning in order to gain accurate prediction despite fewer training labels [114]. When the question is raised, unclassified examples are labeled by the human annotator [112, 116]. This method is crucial when labeling/gathering new data is costly and the availability of few data [117]. The very process of active learning enhances model efficacy while lowering samples [118].

  3. 3.

    Online learning

    While DL is performed offline [119, 120], online learning demands streaming data to update predictions as new data enter instead of waiting until the end that might not even happen [121]. Data are modified in a rapid manner during online learning [119]. This method is good for applications with incremental changes and limitless access to knowledge [119]. While avoiding inconsistency, online learning dictates model performance based on vast available knowledge [119]. Stochastic or online gradient descent that suits ANN is an online learning model [119] that lowers generalization error during online training, in which mini lots or instances are derived from dataset [119, 122].

  4. 4.

    Transfer learning

    In this learning type, a problem is learned by a model to be applied as a reference for other tasks [123,124,125]. This method is viable if the process is close to the primary problem and the related task demands plenty of data [23, 126]. Dissimilar from multi-task learning which seeks the performance of all tasks concurrently from a model, tasks in TL are learned sequentially. In image classification, for example, a huge set of images is learned with a prediction model (e.g., ANN), whereas training is a simpler process as it involves a specific dataset and the initial step uses model weights [127,128,129]. Features learned by the model on a bigger mission (e.g., retrieving lines and patterns) aid other tasks. More details about this technique are in the latter section.

  5. 5.

    Ensemble learning

    In this technique, two modes should fit the same information and later coordinate predictions from each other [130,131,132,133]. Contrary to a single model, this method executes better with several models [134]. Importance is given to developing models in groups and discarding unfit predictions [135]. Apart from its distinct prediction ability, ensemble learning reduces vulnerability in stochastic learning computations. For example, stacking (stacked speculation), ANN, weighted normal, and Bootstrap are some group learning computation approaches (Bagging) [136, 137].

  6. 6.

    Federated learning

    Federated learning is a distributed DL-based approach that allows institutions or hospitals to train a DL model on their data without sharing it. This is particularly useful in cases where data sharing is often restricted by privacy and regulatory concerns. The approach allows each institution or hospital to train a model locally and then share the learned model parameters with a central server. The central server then aggregates the model parameters from all institutions to create a global model. This process is repeated until the global model converges [138]. Federated learning can aid to overcome the challenge of data scarcity by combining data from multiple institutions to train the model. This improves the performance of the model and increases its generalizability [139].

Deep learning architectures

After figuring out what type of learning suits the target task. Now, this section presents what architectures fit the target task.

Since the past two decades, DL models have been enhanced to address more types of problems via NNs [140, 141]. The DL uses geographies and calculations for a vast range of problems [142, 143]. The DL has garnered more attention to date due to accelerated execution with GPU and NN deep layers [141]. This paper compares the varied architectures of DL models [1, 144, 145]. A DL is, generally, composed of these: input layers; Convolutional and fully connected layers; sequence layers; activation layers; normalization, dropout, and cropping layers; pooling and non-pooling layers; combination layers; object detection layers; GAN layers; and output layers [1, 33, 145,146,147,148,149,150,151,152]. The hidden layer is important in a network, mainly because nodes enable the modeling of intricate data. The actual node values are hidden in the training dataset and one only has access to output and input. One hidden layer should exist in NN and the ideal number of hidden units could be lower than the number of inputs. Two hidden units are adequate for limited data, while several hidden units can be used for plenty of training data [153,154,155].

Deep neural network (DNN)

Two layers in this MA enable non-linear intricacies. Both regression and classification approaches are viable; this MA offers great accuracy [156]. The drawbacks are; a difficult training method as the error may be re-transmitted to a past layer to become low and late model learning behavior [157, 158].

Convolutional neural network (CNN)

This MA is the most popular one and the reason that DL is the trend nowadays. The 2D data are suitable for this MA. It has a convolutional filter to transform 2D to 3D that enables fast learning and good performance (Fig. 4). However, many labeled data are needed for classification tasks such as image, video, and voice classification applications [83, 159,160,161,162]. The drawbacks of CNN are intense human interference, local minima, and slow convergence rate. The great success achieved by ImageNet models led CNNs to improve their efficacy in several domains [71, 163,164,165,166,167].

Fig. 4
figure 4

CNN architecture

Recurrent neural network (RNN)

Reckoning sequences is an ability of RNN with neurons weights distributed across all measures. Apart from the multiple variants, e.g., long/short-term memory (LSTM), Bidirectional LSTM (B-LSTM), Multi-Dimensional LSTM (MD-LSTM), and Hierarchical Deep LSTM (HD-LSTM) [168,169,170,171,172], RNN offers great accuracy for speech and character recognition, as well as other NLP issues. Although time conditions can be modeled via RNN [173], this approach has more setbacks in terms of gradient vanishing due to huge dataset requirement [174, 175].

Deep autoencoder network (DAN)

Applicable in unsupervised learning, this MA extracts features and minimizes dimensionality. The number of inputs is equal to that of output [176, 177] and the MA dismisses classified data. Many autoencoders, e.g., denoising, sparse, and conventional autoencoders, are required to ensure robustness [178,179,180,181]. Despite the pre-training step, training may be vanished [182]. The autoencoder [183, 184] has an encoder and a decoder defined as \(\Phi\) and \(\Psi\), respectively, as expressed in Eq. (1).

$$\Phi : X\rightarrow F; \Psi : X\rightarrow F \quad \Phi , \Psi : arg_{\Phi , \Psi }min X(\Phi . \Psi )X^{2}$$
(1)

Deep belief network (DBN)

The DBN is a graphical portrayal that is fundamentally generative; creating of all potential qualities for the current situation. It denotes the combination of likelihood and measurements with AI and NN [185, 186]. The DBN has several layers with values, where the layers have a relationship but not qualities. The main aim is to help the deep network to characterize data into categories. The shortcoming of this MA is costly training due to the initialization process [187, 188].

Deep Boltzmann machine (DBM)

This three-layer generative MA is similar to deep belief network (DBN) [189], except that it permits bidirectional linkages at bottom layers. Its extended energy function of RBM, is given in Eq. (2).

$$E=\left( \sum _{I<J} W_{ijS_{i}^{S_{J}}}+\sum _{i}\Theta _{i}S_{j}\right)$$
(2)

Unidirectional links in DBM have hidden layers. The precise inference is gained when the ambiguous result is integrated with top-down output [190, 191]. Optimizing parameters is hard for large datasets.

Deep conventional-extreme learning machine (DC-ELM)

This MA possesses ELM fast preparation and CNN strength. It applies pooling layers and many substitute convolution to process crucial input features [192, 193]. The ELM classifier enhances the prediction via rapid learning [194, 195]. This MA deploys stochastic pooling at the final hidden layer to lower function dimensionality; thus saving computational resource and time [196].

Deep stacking networks (DSN)

The DSN MA is also called deep convex network [197]. The DSN differs from conventional DL systems because the former is a collection of individual networks with hidden layers despite having DNN. This MA addresses an issue faced by DL—preparation [198]. Preparing is a complex process in DL design as it is viewed as a solitary issue, but the development of individual preparation in DSN [199].

Long short-term memory/gated recurrent unit networks (LSTM/GRU)

Initiated by Reiter and Schimdhuber in 1997, GRU has gained popularity as RNN engineering only recently for varied usages [200]. As a candidate of being a memory cell, LSTM was removed from the typical neuron neural model list [201, 202]. With short/long-term memory cell that becomes a part of data sources, one may determine larger aspects and not be bound to the final procurement [203]. The LSTM, in 2014, was enhanced using GRU that has two entryways; reset entryway and update doorway, to eliminate LSTM yield entrance [202]. The GRU is applied like LSTM, but with less loads, simpler methods, and more rapid performance [204, 205]. Reset entryway denotes integrating new task with past cell substance, whereas update entryway shows past cell substance measure for keeping up [202]. The RNN is portrayed by GRU by setting 1 and 0 for reset entryway and update doorway, respectively.

Graph convolutional network (GCN)

The GCN is used for semi-supervised learning on graphical data based on CNN efficient variant [206,207,208,209]. The selection of convolutional MA stems from the localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graphs edges and learns hidden layer representations that encode nodes features and graph structure [210,211,212].

Lack of training data: issues and solutions

The DL models require massive data volume to display exceptional performance [1], as portrayed in Fig. 5. This is because inadequate training data hinders the use of DL in multiple applications. There are two main scenarios that a dataset that can be considered small. The first exists when performance is low and the models have not been sufficiently trained using large datasets. The second scenario applies when the model is performing well on classification or prediction using data that was included in the training set but does not perform as well when classifying data that it was not trained on. In this case, the model experiences overfitting.

Fig. 5
figure 5

The importance of large training data for DL models

This section presents the most popular solutions to address the lack of training data to overcome three challenges including small datasets, imbalanced datasets, and lack of generalization.

Transfer learning (TL)

The TL is used when elements of a pre-trained model are re-applied in a new DL model [23, 124, 213]. The concept of TL is portrayed in Fig. 6. Generalized knowledge may be shared if two models execute the same tasks. This reduces the amount of labeled data and resources needed to train new models.

Fig. 6
figure 6

General concept of TL

The use of DL algorithms is vast for executing intricate tasks involving multiple applications, including enhancing network efficiency, attaining better return on investment by upscaling marketing campaigns and improving speech recognition approaches. As such, the role of TL is crucial for continuous model advancement [214,215,216,217,218]. The supervised DL has been vastly applied to train models using classified data. However, this time-consuming and resource-intensive approach needs an expert to label the dataset correctly. Hence, as TL resolves these problems, it has become an imminent method in the DL field. The following sections describe the details of TL.

  • What is transfer learning?

    When applied in DL, TL denotes the reuse of existing models to address a new problem. Far from being a typical DL algorithm, TL recycles knowledge from prior training to execute model training. In relation to past trained activity; selected features are classified into certain file types in the new task. High-level generalization is needed for the initially trained model, so new data can be adapted [128, 129, 219]. Training does not begin from scratch for each new task in TL. Classifying massive datasets is time-consuming, especially when DL algorithm is applied. Thus, a DL model training using TL with a classified dataset at hand can be used for the same task involving unclassified data. For example:

    • Riding a motorcycle \(\Rightarrow\) Driving a motorcar.

    • Playing a classic guitar \(\Rightarrow\) Playing the bass guitar.

    • Learning mathematics and ML \(\Rightarrow\) Learning DL.

  • What is transfer learning used for?

    The use of TL in the DL model is to train the system for solving new tasks with massive resources. Certain related fractions from a present DL model are used to address a new, but similar problem. Generalization is integral in TL; only knowledge transfer is viable for another model in other settings. As models with TL have more generality and are not linked rigidly to any training data. These models may be applied for varying datasets and scenarios [220]. Let’s take image categorization as an example: Identifying and categorizing images can be done using DL. With TL, the model may be used to detect other specific objects within the context of images only. Resources are saved as the primary aspects are retained, such as determining object edge in images. This knowledge transfer dismisses model re-training to obtain a similar output. Hence, TL is mostly applied for the following:

    • Saves resources and time as training DL models need not begin from scratch to do the same task.

    • Overcome inadequate data issues for training purposes as TL permits the use of the pre-trained model.

  • How does transfer learning work?

    When TL is used in DL, fractions of the pre-trained DL model are used for the new, yet the same problem or certain new elements are incorporated into the model to address a specific task. Model parts relevant to the new tasks are determined and retained by the programmer. If the process of detecting objects is the task in a new model, a re-trained model for that very similar task may be applied [221, 222]. Training is given to supervised DL models to execute certain tasks from classified data. Upon feeding input and desired output data to the algorithm, only then the model can reckon the pattern and learn trends regarding the new dataset. Such a model yields accurate output within a similar setting, but the model accuracy may be affected if the setting changes beyond the training dataset. This issue is addressed using the TL approach by transferring the related knowledge from an existing model to a new model with the same task. Transfer of general model aspects is crucial for task completion so that the desired output is identified. Tasks can be performed optimally in a new setting when additional layers of definite knowledge are included in the new model [223,224,225].

  • Benefits of transfer learning for DL

    Notably, TL offers many advantages for DL models in training new models [23, 127]. The TL facilitates model training using unclassified data, as the pre-trained model is used. Some of the benefits are:

    • Dismissing huge set of classified training data for new model

    • Enhancing the efficiency of developing and deploying the DL for multiple models

    • Leveraging algorithms to resolve new problems and offering generality when solving a deep problem

    • Simulation is used for model training rather than using actual data

    The details of the benefits are:

    1. 1.

      Saving on training data

      A massive amount of data is needed to train the DL algorithm accurately. Classified training data consumes much time, expertise, and effort for creation. In TL, pre-trained models are deployed and this minimizes the amount of data needed for new DL models. This means that training in TL approach uses existing classified data, which are later deployed for similar but unclassified data.

    2. 2.

      Efficient training of multiple models

      Proper training of DL models to execute intricate tasks can be time-consuming. However, integration with TL dismisses starting from scratch when a similar model is needed; signifying that the time, effort, and resources spent on DL algorithm training can be used for other varied models. The reuse of similar aspects and knowledge transfer from a prior model ensures an efficient training process.

    3. 3.

      Leverage knowledge to solve new challenges

      As a popular model, supervised DL offers high accuracy after receiving adequate training to perform tasks with classified training data. As the performance may degrade when data deviate, TL is used to apply existing models for the execution of a similar task, instead of developing a whole new model. The blended approach may be employed with TL as varied other models can be used in seeking of the solution to a problem. Knowledge sharing among models yields a powerful model that generates accurate output. Such an approach permits an iterative way of developing a functional model.

    4. 4.

      Simulated training to prepare for real-world tasks

      For simulated training, TL is an imminent aspect of the DL model because digital simulations saves both time and cost especially when models are trained to resolve real-world problems. As simulations reflect reality, these models can be adequately trained to detect the desired objects in the simulation. Reinforcement of DL models can be effectively executed using simulations, whereby these models can be trained in any desired setting or condition. For instance, the implementation of the self-driving system in cars establishes simulation as an integral step. As initial training in the real world may not yield expected results, simulations are more viable before the knowledge is transferred to reality.

  • Transfer learning strategies

    Various TL techniques can be employed based on data availability, domain application, and specific tasks [226, 227] (Fig. 7).

    Fig. 7
    figure 7

    Transfer learning strategies

    The following describes TL techniques categorized based on conventional DL algorithms:

    1. 1.

      Inductive TL: target and source domains are similar, but differ in the task. The inductive bias of the source domain is applied by the algorithms to enhance the target task. Regardless of un- or classified data, the two categories of this approach are self-taught and multitask learning types [228].

    2. 2.

      Unsupervised TL: similar to inductive TL, its focuses on unsupervised tasks in the target domain. The tasks differ despite similar target and source domains. Classified data are absent in both domains [229].

    3. 3.

      Transudative TL: both target and source tasks are the same, but the domains differ. The source domain has many labeled data, but none in the target. The method is based on feature space or marginal probability [230].

    The listed transfer classifications denote three TL settings. The following approaches explain the transfer that revolves around the three TL categories:

    1. 1.

      Instance transfer: an ideal idea is knowledge reuse from the source domain to the target task. Although the source domain cannot be directly reused, certain fractions may be reapplied with the target data to enhance output [231].

    2. 2.

      Feature-representation transfer: error rates and domain divergence are minimized in this method by using good data representations from source to target domains. Based on the presence of classified data, un- or supervised techniques can be deployed for this type of transfer [232].

    3. 3.

      Parameter transfer: in this transfer type, the models have similar parameters of prior hyper-parameter dissemination. Dissimilar from multitask learning (source & target tasks learned concurrently); extra weight-age is applied in TL for target domain loss to enhance performance [233].

    4. 4.

      Relational-knowledge transfer: in this transfer type, dependent data with identical distribution is managed. This transfer is applicable for a data point related to another one, e.g., social network data [234].

  • Types of deep transfer learning

    At times, it is difficult to distinguish TL from multitask learning and domain adaptation mainly because these methods attempt to resolve similar problems. Therefore, TL is reflective of a general concept that is applied to solve a task via task domain knowledge application.

    1. 1.

      Domain adaptation

      In this domain, the marginal probability between target and source domains differs, e.g., \(P(X_{s})\ne P(X_{t}))\). The integral shift in data dissemination of target and source domains needs alterations in learning transfer. For example, the corpus of movie reviews labeled negative or positive differs from that of product reviews—the classifier to train movie reviews will sense variation when classifying item reviews. Therefore, domain adaptation suits the TL approach in these examples [235,236,237,238,239].

    2. 2.

      Domain confusion

      Besides highlighting the efficacy of feature-representation transfer, DL layers that capture feature sets can enhance transfer across domains and determine imminent domain-invariant aspects. It is crucial to ensure that both domain representations are near- or similar to enable effective learning. In order to do so, some pre-processing steps are required, as elaborated by Sun et al. in their paper [240], as well as Ganin et al. in [241]. Essentially, an additional goal is added to the source domain to ascertain similarity, thus causing domain confusion.

    3. 3.

      Multitask learning

      In multitask learning, a number of tasks are learned concurrently without variance in source and target and one gains all data about the tasks at once. This differs from DL because one is clueless about the target task. Hence, multitask learning differs slightly from TL [242, 243].

    4. 4.

      Zero-shot learning

      An extreme DL variant, zero-shot learning uses unclassified data for learning to make modifications at the training phase to exploit extra data so that hidden data can be comprehensible. In a book entitled Deep Learning, Goodfellow and co-authors discussed zero-shot learning based on three variables: conventional input and output variables (x & y, respectively), as well as a random variable that denotes the task (T). This model is trained to master conditional probability distribution; P(y|xT). This learning type is suitable for machine translation, where the label is absent in the target language [244,245,246].

    5. 5.

      One-shot learning

      As DL models need plenty of training data to learn weights, Deep Neural Networks (DNNs) are unsuitable. For example, a child exposed to an apple would be able to identify a variety of apples—but this is not the case for DL and ML approaches. A variant of TL, one-shot learning yields output with one training instance; thus suitable for actual settings with the absence of classified data for many scenarios (classification task) and for conditions that require the addition of new classes. In an article by Fei-Fei et al. [247], the term ‘one-shot learning’ was coined to describe a Bayesian framework variation that represents learning for the classification of objects. Since its emergence, this approach has been enhanced and applied in DL models [248].

    6. 6.

      Few-shot learning

      This type involves training models to recognize new objects or classes with only a few examples, typically ranging from 1 to 10 examples per class. In other words, the goal of few-shot learning is to enable machines to learn quickly and efficiently with limited data. on the other hand, one-shot learning is a specific case of few-shot learning where the model is trained on only one instance per class. One-shot learning is considered a more challenging task than few-shot learning because the model must generalize well from a single instance, whereas few-shot learning allows for a small number of examples to be used for training. The challenges of interpreting multimodal time-series data from drone and quadruped robot platforms for remote sensing and photogrammetry have been discussed [249, 250], due to the expensive and time-consuming nature of data annotation in the training stage. The authors proposed a few-shot learning architecture based on a squeeze-and-attention structure that is computationally low-cost and accurate enough to meet certainty measures. The proposed architecture was tested on three datasets with multiple modalities and achieved competitive results. This study demonstrated the importance of developing robust algorithms for target detection in remote sensing applications, using limited training data.

  • Transfer learning approaches

    The two TL methods are feature-extraction and fine-tuning [251,252,253].

    1. 1.

      Feature-extraction

      Here, a well-trained CNN model is deployed to extract features for the target domain from a massive dataset, such as ImageNet. All completely connected layers in CNN models are discarded and all convolution layers are frozen. The latter layers are the feature extractor that adapts to new task. The extracted features are fed to the classifier form supervised ML or completely connected layers. lastly, only a new classifier is used to train, instead of the whole network, for the training process [254, 255].

    2. 2.

      Fine-tuning

      This method is similar to feature extraction, except that the convolution layers of well-trained CNN are not frozen but their weights are updated during the training phase. Thus, the weight of convolution layers is initialized with CNN’s pre-trained weights when the classifier layers are initialized with random weights. Here, the whole system undergoes training [164, 256].

  • Research problem in transfer learning for medical imaging

    One of the solutions to address the lack of training data is employing the pre-trained models of ImageNet for the target task. For some applications, this type of TL from ImageNet has significantly improved the results compared with training from scratch [257, 258]. However, for some other applications such as medical imaging applications, this type of TL from ImageNet does not help to address the issue of lack of training data. This is due to the mismatch in learned features between the natural image, e.g., ImageNet (color images), and medical images (gray-scale images such as MRI, CT, and X-ray) (see Fig. 8) [213, 259].

    Fig. 8
    figure 8

    Comparison between TL from ImageNet to nature images and medical images

    These models of ImageNet were designed to classify 1000 classes. However, medical images are ranging between 2 and 10 classes. Therefore, it results in the use of deeply heavy models.

    It has been proven that different domain of TL (such as ImageNet) does not significantly affect performance on medical imaging tasks, with lightweight models trained from scratch performing nearly as well as standard ImageNet models [260]. To end that, Alzubaidi, et al. proposed two different types of novel TL which effectively showed excellent results in several medical applications [23, 124]. One of the solutions was based on training the DL model on a big number of unlabelled images of a specific task then the model will be trained on a small, labeled dataset for that same task. This approach guarantees that the model will learn the relevant features and reduce the effort of the labeling process. It will offer the chance to use a shallow model with the desired input size. By using the same approach, several published articles have improved the effectiveness of these solutions for medical images and other domains [22, 123, 164, 261,262,263,264,265].

    Another solution was proposed by Azizi et al. [70] to improve the learned features of DL models by training them on a large number of unlabelled images of a specific task then the models will be trained on a small, labeled dataset for that same task.

    Figure 9 demonstrates the comparison of two models trained for the detection of shoulder abnormalities from our ongoing work. The first column is the original images with a red circle which is the region of interest marked by an expert. The second column is a model trained after TL from ImageNet while the third one is a model trained after TL from the same domain TL of the target dataset. As shown in the first row, both models correctly predicted the image based on their confidence values. However, the heatmap reveals that the first model is biased and inaccurate, failing to detect the region of interest indicated by the red circle. In contrast, the second model accurately identified the region of interest with a high confidence value. The second row illustrates that the first model missed the classification, while the second one correctly classified the sample. This example highlights the importance of the source of TL, as even a model with correct confidence values may not be trusted.

    Fig. 9
    figure 9

    Comparison between two different TL

  • Instances of transfer learning for deep learning

    The TL has been applied in many areas within the DL field and real world applications, e.g., enhancing computer vision and NLP. The following describes some instances of TL used in DL.

    1. 1.

      Transfer learning in NLP

      The capability of a system to analyze and comprehend human language (text/audio files) is NLP—to enhance human-system interaction. In fact, NLP is crucial for daily activities, including language contextualization tools, voice assistants, translations, speech recognition, and automated captions. Many DL models with NLP can be enhanced with TL, such as adding pre-trained layers that identify vocabulary or dialect and concurrent model training to identify language aspects. The method of TL can be used for model adaption across multiple languages. Models trained and refined in one language may be adapted for other similar languages. With vast English digitized resources, the models may be trained using a massive dataset before transferring the aspects to another language [266,267,268,269,270,271,272].

    2. 2.

      Transfer learning in computer vision

      The capability of a system to make meaning from visual formats (images/videos) is known as computer vision. A massive volume of images is trained for DL models to reckon and group the images. Here, TL recycles elements of the computer vision algorithm for application in the new model. The accurate models generated via TL from training with massive data can be applied effectively for smaller image sets or even more general aspects (e.g., detecting object edges). Essentially, a specific model layer that detects objects/shapes can be trained. While refining and optimizing the model parameters, the TL sets the model functionality [273,274,275].

    3. 3.

      Transfer learning in neural network

      The ANN is a crucial element in DL for simulating and replicating human brain functions. Notably, NN training usurps plenty of resources due to model intricacy. In fact, TL is crucial to minimize the use of resources and ascertain an efficient process. The development of new models includes the transfer of features or knowledge across networks. The use of knowledge in varied settings is a vital aspect of network building. Essentially, TL is typically limited to general tasks or processes that stay relevant in an assortment of scenarios [214, 215, 276].

    4. 4.

      Transfer learning for Audio/Speech

      The DL model, similar to computer vision and NLP, can be applied to audio data. Models called Automatic Speech Recognition (ASR) formulated for the English language are broadly applied to enhance the performance of speech recognition in other languages. Another instance of TL application refers to automated speaker identification [177, 277, 278].

    There are more domains that used TL to address the issue of lack of training data as listed in Table 1.

    Table 1 Some examples of TL from the literature
  • The future of transfer learning

    Widespread access to more powerful models formulated by conglomerates and related organizations dictate the future of DL models. It is crucial that the DL is adaptable and accessible to organizational demands and goals to revolutionize processes and businesses. However, only a handful of organizations possess the resources and expertise to train models and classify data. One challenge faced by supervised DL is obtaining a massive amount of classified data. Classifying countless data is labor-intensive and access to most data appears prohibitive to developing powerful models. With access to many classified data and resources, organizations can effectively develop algorithms. However, when used in other organizations, the model performance may differ due to environmental and training change impacts. Even the most accurate models would results in performance degradation in a different setting—a hindrance to DL when shifting to mainstream application. Imminently, TL has a significant function in resolving the said barrier. By integrating TL, the DL models can turn more powerful due to their ability to carry out specific tasks and settings. Hence, TL is denoted as an imminent driver for distributing DL models across new fields and areas.

Self-supervised learning

Self-supervised learning (SSL) is a technique of training DL models using large amounts of unannotated data and a small amount of annotated data, or using a pretext task to generate labels for the data. It is often used to pre-train models on large datasets and then fine-tune them on a smaller dataset with a different task in mind. SSL can be a useful solution for data scarcity, as it allows models to learn useful features from large amounts of unannotated data, which can then be fine-tuned on a smaller dataset for the target task [68,69,70,71].

One of the main benefits of SSL is that it allows models to learn useful features from large amounts of unannotated data, which can be useful in situations where annotated data is scarce or expensive to obtain. It can also be used to learn more robust and generalizable features, as the model is exposed to a larger variety of data during training [339, 340].

There are several types of SSL, including:

  • Pretext tasks: these are tasks that are designed to generate labels for the data, which can then be utilized to train a DL model. Examples of pretext tasks include predicting the rotation of an image, predicting the next frame in a video, and predicting the mask for an image.

    One example of using a pretext task for SSL is the work done by Doersch et al. [341]. The authors trained a CNN to predict the relative location of randomly selected patches within an image. The CNN learned useful features from the images that could then be utilized for other tasks.

  • Autoencoders: these are neural networks that are trained to reconstruct their input data. They are often utilized as a way to learn useful features from the data, which can then be utilized for other tasks.

    An example of using autoencoders for SSL is the work done by Masci et al. [342] where the authors trained a stacked autoencoder to learn features from images of faces. The learned features were then used to train a classifier to recognize the identities of the faces.

  • Generative models: these models are trained to generate new data that is similar to the training data. Examples include Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) which will be explained in the next section.

    An example of using generative models for SSL is the work done by Goodfellow et al. [343]. They trained a GAN to generate synthetic images that were similar to a dataset of real images. The generated images were used to train a classifier to recognize objects in real images.

  • Contrastive learning: this SSL technique involves training a model to distinguish between different types of data. The model is then fine-tuned on a downstream task using the learned features.

    An example of using contrastive learning for SSL is the work done by He et al. [344] where they trained a CNN to distinguish between different types of images and used the learned features to train a classifier on a downstream task.

  • Self-supervised multitask learning: this technique is based on training a single model on multiple tasks simultaneously, using a combination of supervised and unsupervised learning. The model learns to solve multiple tasks using the shared features learned from the unsupervised tasks.

    An example of using self-supervised multitask learning is the work done by Caruana et al. [345]. The authors trained a single neural network to perform multiple tasks simultaneously, using both supervised and unsupervised learning. The network learned to solve the tasks using the shared features learned from the unsupervised tasks.

Generative adversarial networks (GANs)

The GANs are regarded as a type of DL network that yields data with similar features as the input real data. Via GANs, representations are learned without intricate training datasets, as learning denotes regaining proportional signals based on a paired-network competitive process. Representations that GANs learn can be applied in, for example, image synthesis, classification, and super-resolution; style transfer; and editing of semantic image [346, 347]. The GANs overcome insufficient training data. Goodfellow et al. [348] initiated the adversarial method for learning GAN models. The GAN is a game denoting min-max, two-person, and zero-sum (the loss of one player is an advantage of another). The GAN consists of the generator (G) and discriminator (D). The G deceives another player by faking sample dissemination, while D distinguishes real from fake samples. A sample is more likely to be real if the probability value is higher (0 = fake sample, 0.5 = optimal solution). Upon nearing an optimal solution, D would not be able to distinguish real from fake samples [349,350,351,352]. Figure 10 illustrates the general GAN architecture.

Fig. 10
figure 10

The general GAN architecture

  1. 1.

    Generator (G): a network that yields images using random noise Z, G(z). Gaussian noise is typically selected as the input—a random point in latent space. Iterative updates are made to parameters of G and D while GAN training.

  2. 2.

    Discriminator (D): this network ascertains if an image is a real or fake distribution. Upon receiving input image X, it generates output D(x); signifying X is probably not fake. Output = 1 denotes the distribution of the real image, while D = 0 signifies otherwise.

  • Variants of GAN

    Enhancements made to GAN architecture (Fig. 11) are explained in the following:

    Fig. 11
    figure 11

    Variants of GAN

    1. 1.

      Fully connected GANs

      The initial GAN MA had full NN connections for D and G [348]. This MA was applied for the detection of simple images, e.g., the Toronto Face dataset (TFD), MNIST, and CIFAR10 (natural images).

    2. 2.

      Conditional GANs (CGAN)

      Upon extension, D and G networks are conditioned on additional data (y) to overcome reliance on random variables in the original model [353]. y denotes auxiliary data from other modalities or class labels. Conditional data are used by feeding y into G and D networks as an extra input layer (see Fig. 12). In the G network, prior input noise pz(z) and y are integrated in joint hidden representation, while the adversarial training framework permits considerable flexibility in the composition of this hidden representation [353]. In the D network, both x and y are presented as inputs to a D function.

      Fig. 12
      figure 12

      Conditional GAN’s architecture

    3. 3.

      Laplacian pyramid of adversarial network (LAPGAN)

      Using a cascade of convolutional networks with the LAPGAN model, Denton et al. [354] introduced image generation in a coarse to fine manner. Hence, a multiscale structure of natural images could be exploited to build GAN models by taking a certain level of image structure based on LAPGAN. Built from the Gaussian pyramid, the Laplacian pyramid uses these functions: downsampling d(.) and upsampling u(.). Let G(I) = \([I_{0};I_{1}; \ldots ; I_{K}]\)be Gaussian pyramid, where \(I_{0}\) = I while \(I_{k}\) denotes repeated k of d(.) to I. Laplacian pyramid’s coefficient \(h_{k}\) (level k) signifies the variance among adjacent levels within the Gaussian pyramid, in which unsampling has a smaller value with u(.) (Eq. 3).

      $$h_{k}=L_{k}(I) = G_{k}(I)- u(G_{k+1}(I))= I_{k}-u(I_{k+1})$$
      (3)

      Coefficients of Laplacian pyramid \([h_{1}; \ldots ; h_{k}]\) is reconstructed via backward recurrence, as in Eq. (4):

      $$I_{k}=u(I_{k+1}+h_{k})$$
      (4)

      Convolutional generative models, which are needed to train LAPGAN, capture coefficients \(h_{k}\) distribution for varied Laplacian pyramid levels. These generative models, during reconstruction, yield \(h_{k}\). Hence, the modification that takes place in Eq. (4) is expressed in Eq. (5):

      $$\bar{I_{k}}=u((\bar{I_{k-1}})+\bar{h_{k}})= u(\bar{I_{k-1}})+ G_{k}(z_{k}, u(\bar{I_{k-1}}))$$
      (5)

      Training image I is used to constructing the Laplacian pyramid. The stochastic choice is made at every level for coefficient \(h_{k}\) construction via \(G_{k}\) generation or via the standard procedure. The CGAN model is used by LAPGAN by incorporating low = pass image \(\imath _{k}\) to both G and D. The LAPGAN performance was assessed using three datasets: LSUN, CIFAR10, and STL10. The assessment was conducted through the comparisons of human sample examination, log-likelihood, and generated image sample quality.

    4. 4.

      Deep convolutional GAN (DCGAN)

      A new class of CNN was initiated by Radford et al. [355] called DCGANs that can resolve the following architectural issues noted in CNN MA:

      • Hidden layers that are completely connected are discarded, while pooling layers are substituted with fractional- and stridden convolutions on G and D, respectively.

      • Batch normalization is applied for both G and D models.

      • ReLU and LeakyReLU activation is used in G (except the final layer) and D layers, respectively.

      The G in DCGAN used in LSUN sample scene modeling is portrayed in Fig. 13. Its performance was compared with that of SVHN, LSUN, CIFAR10, and Imagnet 1K datasets. First, DCGAN was used as a feature extractor to determine the quality of unsupervised representation learning, followed by the determination of accuracy performance by fitting a linear model above the features. Notably, G displayed the ability to disregard some elements of the scene, e.g., furniture and windows. Good outcomes were noted when vector arithmetic was executed on face samples.

      Fig. 13
      figure 13

      DCGAN’s architecture

    5. 5.

      Adversarial autoencoders (AAE)

      The AAE, which was proposed by Makhzani et al. [356], refers to a probabilistic autoencoder that applies GAN to carry out variational inference. This is done by matching arbitrary prior dissemination with aggregated posterior of hidden code vector in autoencoder. The autoencoder in AAE undergoes training with two aims—criteria for conventional reconstruction error and adversarial training. Next, conversion of the data distribution to the prior one is learned by the encoder at post-training. The decoder, on the other hand, learns the deep generative model that portrays that prior to data distribution (Fig. 14). The MA of AAE is given below: Where x and z are the input and latent code vectors of autoencoder. p(z), q(z|x), and p(x|z) reflect imposed prior, encoding, and decoding distributions, respectively. Next, pd(x) and p(x) signify data and model distributions, respectively. The aggregated posterior distribution of q(z) on hidden code vector of the autoencoder is defined as q(z|x) (autoencoder encoding function), as expressed in Eq. (6):

      $$q(z)=\int _{x}q(z|x)p_{d}(x)dx$$
      (6)

      Regularisation of autoencoder in AAE is performed by matching arbitrary prior p(z) with aggregated posterior q(z). The adversarial G network serves as an encoder for autoencoder q(z|x)). Both autoencoder and adversarial networks are jointly trained with gradient descent in reconstruction and regularisation stages. Both the encoder and decoder are updated by the autoencoder in the reconstruction stage to minimize input glitches. The D is updated by an adversarial network in the regularisation stage to distinguish true samples from fake ones, and followed by a generative model update to confuse D. During the adversarial training, AAE includes labels as well to offer a better distribution shape for hidden code. Single-hot vector, which is included in discriminative network input to link distribution mode with the label, is a switch that chooses a decision boundary based on a class label for a discriminative network. The vector has an extra class related to unclassified data. This extra class functions when unclassified data are found so that the decision boundary can be chosen for full Gaussian distribution.

      Fig. 14
      figure 14

      AAE’s architecture

    6. 6.

      Generative recurrent adversarial networks (GRAN)

      The GRAN, introduced by Im et al. [357], has recurrent computation, produced from unrolled optimization based on gradient, which incrementally develops images for visual canvas (see Fig. 15). Current canvas images are extracted from a convolutional network encoder. The decoder is fed with generated and reference image codes to decide on canvas updates. Functions f and g are GRAN decoder and encoder, respectively. The G in GRAN has a recurrent feedback loop, which receives noise samples sequence from \(z \sim p(z)\) prior distribution, to draw results for varied time steps; \(C_{1}\); \(C_{2}; \ldots ;\)\(C_{T}\) Sample z from prior distribution is moved to function f(.) at time step (t) with hidden state \(h_{c,t}\), where \(h_{c,t}\) is the current encoded status of past Ct − 1 drawing. Ct denotes that drawn at time t on canvas with function f(.) output. Function g(.) mimics function f(.) in inverse. Gathering samples at every time step produces the last sample drawn on canvas, C. Function f(.) is the decoder that accepts noise sample z and past hidden state input \(h_{c,t}\), while function g(.) is the encoder that offers output \(C_{t-1}\) hidden representation for time step t. Dissimilar to the rest, GRAN begins with the decoder.

      Fig. 15
      figure 15

      GRAN’s architecture

    7. 7.

      Bidirectional GAN (BiGAN)

      The BiGAN (see Fig. 16) was proposed by Donahue et al. [358] to learn data distribution inverse mapping and semantics, in which the learned feature representations are re-projected into latent space. Referring to Fig. 9, apart from G deriving from GAN, BiGAN has an encoder E that maps data x to latent representation z. The BiGAN D discriminates not only in data space [x versus G(z)] but jointly in data and latent spaces [tuples (x;E(x)) versus (G(z); z)], where the latent component is encoder output E(x) or G input z. Based on GAN targets, BiGAN encoder E can learn to invert G.

      Fig. 16
      figure 16

      BiGAN’s architecture

  • GAN applications

    The GAN yields real-like samples with arbitrary latent vector z, thus dismissing the identification of the real distribution of data. Thus, GAN has been used in many academic and engineering fields. This section presents the applications of GANs in terms of generating new data to enhance training set [359,360,361].

    1. 1.

      Generation of high-quality images

      Recent studies on GAN have enhanced both the usability and quality of image production abilities, such as the LAPGAN model [354] discussed Before. Several publications have addressed the issue of lack of training data using GANs [350, 362,363,364].

      The Self-Attention GAN (SAGAN) was initiated by Zhang et al. [365] to enable long-range, attention-driven reliance modeling that produces images. This is dissimilar from convolutional GAN, which yields details with high resolution for spatially local points within feature maps with low resolution. The SAGAN, which adds cues-generating details from all feature areas, yields excellent outcomes that lowered Frechet Inception Distance (FID) to 18.65 from 27.62 and hiked Inception Score (IS) to 52.52 from 36.8 for the ImageNet dataset.

      The BigGans was introduced by Brock et al. [366] to yield diverse and high-resolution samples from intricate datasets (ImageNet) by using the largest scale to train GAN. Orthogonal regularisation was used for G to make a ‘truncation trick’ that enables the control of trade-off between sample variety and fidelity by minimizing G input variance. Further alteration enabled the model to synthesize class-conditional images. The model, upon being trained using ImageNet (resolution: 128 \(\times\) 128), scored 166.5 and 7.4 for IS and FID, respectively; which was better than the model described above.

      A G network for GAN was initiated in light of style transfer [367, 368]. The model displayed several noteworthy outcomes: enabled scale-specific and intuitive synthesis control, automatic learning, stochastic difference noted in the produced images (e.g., hair & freckles), and unsupervised segregation of attributed with high level (identity & pose if trained using human faces). Meanwhile, Huang et al. [369] introduced GANs that operated on intermediate representations and not images with low resolution. This model is similar to LAPGAN with extended CGAN as D and G networks could accept extra labeled data as input—a popular method to date that enhances image quality. In another instance, Reed et al. [370] applied GAN for image synthesis from texts (reverse captioning). To describe, a trained GAN may produce images that match certain descriptions, such as that of the following text: white with some black on its head and wings and a long orange beak. Along with texts, image location can be conditioned using a Generative Adversarial What-Where Network (GAWWN) that incrementally builds big images with the support of an interactive interface and bounding box supplied by user [371]. As for CGAN, besides synthesizing new samples with certain features, it permits users to create tools to edit images [372].

      For maximizing one/many neurons activation in a segregated classifier network, Nguyen et al. [373] introduced a novel approach that performs new image synthesis via gradient ascent in the latent space of the G network. The extension of this method incorporated extra prior on latent code, which enhanced sample diversity and quality—yielding high-quality images (resolution: 227 \(\times\) 227) for all ImageNet data [374]. Additionally, Plug and Play Generative Networks (PPGNs) were introduced possessing (1) G network that draws multiple image types and (2) a substitutable condition network that informs what G should draw. As a result, the images were conditioned on the caption (C = image captioning network) and class (C = ImageNet/MIT Places classification network).

      Next, the GAN model was used by Salimans et al. [375] to execute training with novel features based on two aspects: semi-supervised learning and the production of visually-realistic human images. This model yielded accurate outputs using semi-supervised classification on SVHN, MNIST, and CIFAR10. Based on the Turing test, the produced images were verified of having high quality. While the CIFAR10 samples displayed a 21.3% human error rate, those of MNIST were near-similar to real data.

      Wasserstein GAN (WGAN) was used by Huang et al. [376] for density reconstruction in dynamic topography. Wasserstein GAN was proposed by Arjovsky et al. [377] to enable stable training but ended up failing to converge and producing poor samples. These issues, according to Gulrajani et al. [378], were due to clipping weight to apply the Lipschitz constraint on the critic. Alternative clipping weights were, thus, used to penalize the norm of critic gradient based on input. This resulted in better training for multiple GAN MAs with nearly nil hyperparameter tuning, inclusive of language models with continuous G and 101-layer ResNets, as well as high-quality yields on LSUN and CIFAR10. Based on what was discussed above, we believe GAN is an effective solution to generate more data to address both lack of data and imbalanced data [359,360,361, 379, 380].

    2. 2.

      Image inpainting

      Missing parts reconstruction in images, or image inpainting, makes the reconstructed areas undetectable. Hence, damaged areas are restored and undesired objects are discarded in images. GANs have been applied to address this issue [381,382,383,384].

      The recent DL approaches have the ability to solve missing parts in images via the image inpainting technique, thus yielding perfect image textures and structures. Inferring arbitrary huge missing image parts via image semantics is called ‘semantic inpainting’ [385, 386]. The demand for high-level context prediction poses more difficulty in this method when compared to image completion or past inpainting methods that eliminate whole objects and address inauthentic data corruption.

      A method based on a deep generative model was initiated by Yu et al. [387] to apply surrounding image characteristics and synthesize image structures for better prediction. This CNN feed-forward model process varied-sizes and multi-hole images at random areas during the testing phase. Experimental work involving natural images (Places 2 & ImageNet), textures (DTD), and face samples (CelebA & HQ) revealed that the introduced model yielded higher-quality inpainting outcomes. Another study introduced an inpainting system in the DL model to complete images using inputs and free-form masks [388]. Using gated convolutions, the system learned millions of unlabelled images to address vanilla convolution problems (generalized partial convolution & input pixels being valid) by offering a mechanism to learn dynamic features for channels across all layers at each spatial region.

      A GAN loss model (SN-Patch GAN) using D with normalized spectral on patches of dense images [388] is rapid, non-intricate, and offers stable training. The extended version and automatic image inpainting revealed more flexible and higher-quality yields. Using edge G and followed by an image completion system, Nazeri et al. [389] built a model with a double-stage adversary. Missing region edges in images are hallucinated by edge G, and these edges are filled via the image completion system as a priori. The model was assessed using Paris Street View, CelebA, and Places2 datasets.

      A new semantic image inpainting model was proposed by Yeh et al. [390] based on GAN MA, whereby semantic inpainting was viewed as an issue of image generation. Their adversarial model [391, 392] had been trained to seek encoding of corrupted image ‘closest’ to the target image in latent space. Next, the image is reconstructed using G via encoding. ‘Closest’ is the loss of weighted context in the corrupted image and unrealistic images penalized via prior loss. In comparison to CE, this approach dismisses masks for training and can be applied for randomly-structured missing areas at the inference phase. This technique was assessed with CUB-Birds [393], CelebA [394], and SVHN [395] datasets with varied missing areas. The method gave more realistic images than other approaches.

    3. 3.

      Super-resolution

      Upscaling images or videos require super-resolution, as it upgrades low-resolution images to high resolution by incorporating realistic image details at the training phase [396,397,398]. For instance, a new training approach was initiated by Karras et al. [399] to progressively grow G and D; begin at low resolution, and new layers are increasingly included to model fine details during training. This approach offer better speed and stability while training, thus generating high-quality images using CelebA.

      The extension of prior models, the SRGAN approach [400], is embedded with an adversarial loss element that constrains images to stay on the manifold of natural images. Imminently, the G in SRGAN holds low-resolution images and infers natural realistic images with a four-time scaling factor. Adversarial loss, dissimilar from other GAN models, is an aspect of the larger loss function that incorporates permanent loss from a pre-trained classifier, as well as regularisation loss that yields images that are spatially coherent. The entire solution is constrained by adversarial loss to manifold natural images, thus generating better solutions. Access to curated training data is a hindrance to DL model customization. Nonetheless, SRGAN customizes specific domains in a straightaway manner because new training image pairs are constructed easily by down-sampling high-resolution image corpus. Essentially, the image domain in the training set dictates the yield of GAN with realistic details.

      To improve SRGAN visual quality, Wang et al. [401] assessed the following three elements: perpetual loss, network architecture, and adversarial loss—the initiation of Enhance SRGAN (ESRGAN). The fundamental network building unit is composed of Residual-in-Residual Dense Block (RRDB) in the absence of batch normalization. The very idea derived from relativistic GAN, which enables D to predict, rather than absolute value, but corresponding realness. To gain stronger supervision for texture recovery and brightness consistency, the perpetual loss was enhanced with features prior to activation. The ESRGAN gave higher visual quality with more natural and realistic texture than SRGAN—champion in PIRM2018-SR Challenge (region 3; the best perceptual index).

      As many techniques end up yielding low-quality and low-resolution images in real scenarios, Bulat et al. [402] introduced a two-stage process: (1) High-to-Low GAN is trained to learn down-sampling and degrading images with high-resolution, and (2) the network output is applied to train Low-to-High GAN in order to generate images with super-resolution.

    4. 4.

      Video prediction and generation

      An issue in computer vision is comprehending scene dynamics and object motion. A model is needed for scene transformation in video generation (prediction of the future) and recognition (grouping of actions). Building this model is, however, not easy due to motion in scenes and objects [403, 404]. A GAN for the video was proposed by Vondrick et al. [405] to untangle the scene foreground from the background via spatiotemporal convolutional architecture. In predicting the future of static images, the proposed model could produce a 1-s short video at a complete frame rate, which is better than a simple baseline. Further assessment revealed that the model could learn features to reckon actions at minimum supervision—scene dynamics are viable for representation learning. Several works were proposed for same purpose using GANs [404, 406, 407]

      The Motion and Content decomposed GAN (MoCoGAN) was introduced by Tulyakov et al. [408] to yield videos. Videos are made by generating a sequence of random vectors [with content (fixed) & motion (stochastic) parts)] to that of video frames. Using video and image Ds, a new adversarial learning mechanism was devised to learn content and motion decomposition unsupervised. The model efficacy was verified empirically via quantitative and qualitative approaches. This approach has been improved in different ways [360, 404, 407].

    5. 5.

      Anime character generation

      Apart from requiring experts for routine tasks, animation production and game development are costly. Anime characters can be colorized and auto-generated using GAN [409,410,411,412,413]. These G and D have multiple ReLU with skip connections, convolutional layers, and batch normalization. The CartoonGAN, a solution that transforms real-world photos into cartoons was initiated by Chen et al. [414] for computer graphics and computer vision applications. The easy training phase involves cartoon images and unpaired photos. The two losses for cartoon styling are (1) semantic content loss (sparse regularisation for high-level feature maps of VGG network to cope with photo-cartoon style variation) and (2) edge-promoting adversarial loss (preserves clear edges). To automatically generate anime characters, Jin et al. [411] combined GAN training methods and a clean dataset to yield realistic facial images. The SRResNet was modified to a G model (see Fig. 17) that applies 3 subpixel CNN (to upscale the feature map) and has 16 Res-Blocks. The architecture of D displayed in Fig. 17 has 10 Res-Blocks. Due to correlations in mini-batch that lead to unwanted gradient norm calculation, layers of batch normalization were discarded from D. Additional completely connected layers were added to the final convolution layer as the classifier of the attribute. Weights initialized from Gaussian distribution had 0:02 and 0 standard deviation and mean values. Figure 18 portrays an anime character generated by GAN.

      Fig. 17
      figure 17

      The architecture of Anime G & D

      Fig. 18
      figure 18

      Anime samples generated by GANs

    6. 6.

      Image-to-image translation

      The translation of input to output images can be performed using CGAN—a recurring theme in computer vision, computer graphics, and image processing. This pix2pix model resolves these image-related issues [415,416,417]. Additionally, a loss function may be devised using the pix2pix model in order to train input-to-output image mapping. It yields exceptional outcomes for varied computer vision problems that demanded black-white image colorization, semantic segmentation, attaining maps from aerial photos, and segregated machines [415].

      The model was extended to produce CycleGAN [418] by embedding cycle consistency loss that preserves the original image after translation and reverse translation cycle. As paired images are eliminated from the training phase, the data preparation process becomes simpler and is open to other multiple approaches. The artistic style transfer [419], for example, gives a natural image with Monet or Picasso style by training using natural images and unpaired paintings. Novel samples that match the training set can be achieved by GAN, along with style transfer (modifies image visual style), domain adaptation (the generality of new domains with unclassified data in the target domain), and the latest, TL (import of existing knowledge to simplify learning) approaches [420]. Nonetheless, the general analogy synthesis issue is untapped. Hence, Taigman et al. [420] overcame this problem by separating labeled samples from domains T and S, as well as by incorporating a multivariate function (f) for mapping; \(G: S \rightarrow T\) such that \(f(x) \sim f(G(x))\). The DNNs of a certain structure were applied, where G denotes learning (g) and input (f) functions composition. The compound loss that integrates multiple terms was deployed as well. The proposed technique can visual domains (face images and digits) and generate realistic new images from unseen samples, while concurrently retaining identities.

      A generative network was segregated into two by Chen et al. [421] so that each looks into a subtask alone. The attention network estimated spatial attention maps of images, while the transformation network translated objects. The attention map produced in the initial step is sparse to enable more attention placed on the target object and should remain constant regardless of transfiguration. More instructions are given while learning the attention network due to image segmentation. The outcomes revealed the importance of assessing attention during the transfiguration, whereby the algorithm introduced can learn precise attention to enhance the quality of the produced images.

      In the Multimodal Unsupervised Image-to-image Translation (MUNIT) model introduced by Huang et al. [422], image representation is decomposed into a content mode (domain-invariant) and style code (detects domain-specific attributes). The translation of an image to another domain involves the recombination of content code with random style code deriving from the target domain. Upon comparing the proposed model with other current models, the latter displayed more benefits.

      The Exemplar Guided and Semantically Consistent Image-to-image Translation (EGSC-IT) network introduced by Ma et al. [423] can be applied to perform the translation process on samples in the target domain. An image consists of a shared content aspect (shared across domains) and a style aspect (specific to the domain). The Adaptive Instance Normalisation applies the shared content aspect to enable style information transfer from the target domain to the source domain. The concept of the feature was deployed to hinder semantic inconsistency while translation (due to variations of the large inner and cross-domain) and to offer a coarse semantic guide in the absence of a semantic label. The Single GAN was introduced by Yu et al. [424] to execute multi-domain image-to-image translation with single G. In order to ascertain A domain code was deployed to integrate multiple optimization goals and to control varied generative activities. The results for unclassified data revealed superior performance by the proposed model when translating between the two domains. CycleGAN has been used in several applications such as medical imaging and plant diseases to address the issue of imbalanced datasets [425,426,427,428]. Figure 19 shows an example of CycleGAN with CT images.

      Fig. 19
      figure 19

      An example of medical image translation [429]

    7. 7.

      Text-to-image translation

      One of the impressive applications of GANs is text-to-image translation [430,431,432,433]. Using GAN, Fedus et al. [434] enhanced sample quality by explicitly training G to yield high-quality samples that displayed successful image production. The actor-critic CGAN can complete missing text conditioned on the context. Evidently, this gave more realistic un- and conditional text samples quantitatively and qualitatively, in comparison to maximum likelihood trained model.

      With the benefits of automatic synthesis of realistic images from text, Denton et al. [354] applied the Laplacian pyramid with adversarial G and D to synthesize images at many resolutions. Images with high resolution that can condition on class labels were produced with control. Using a standard convolutional decoder, Radford et al. [355] built a stable and effective MA by including batch normalization to attain exceptional image synthesis outcomes.

      The GAWWN was used by Reed et al. [370] to synthesize images from text descriptions (reverse captioning). Besides conditioning on image location [371], the model supports an interactive interface that increasingly builds up big images with textual descriptions and bounding boxes supplied by the user. As for CGANs, it synthesizes new samples with certain features and enables the development of tools to intuitively edit images, such as hairstyle editing or giving a younger look in images [435]. Figure 20 shows an example of text-to-image translation.

      Fig. 20
      figure 20

      An example of text-to-image translation [436]

    8. 8.

      Face aging

      Progression and regression of face age (or face rejuvenation and aging) render face images regardless of aging effect, while simultaneously preserving personalized face features (i.e., personality) [437,438,439,440]. A conditional AAE (CAAE) was initiated by Zhang et al. [441] to learn face manifold. The control of age attribute assures flexibility to gain regression and progression concurrently. Some advantages of CAAE are: (1) gains age regression and progression to produce realistic face images, (2) dismissal of paired samples while training and labeled face while testing—ascertaining model generality and flexibility, (3) disentangled personality and age in latent vector space preserve personality and hinder ghosting artifacts, as well as (4) robust against occlusion, pose, and expression variations as CAAE imposes D on the encoder and G. The D on the encoder and G offer smooth transition in latent space and realistic face images, respectively. Thus, CAAE yields images with higher quality than AAE. The CAAE had been assessed with CACD [442] and Morph [443] datasets.

      A synthetic aging method was initiated by Antipov et al. [444] for human faces using Age CGAN (Age-cGAN), comprising of dual steps: (1) input face reconstruction that demands optimization problem resolution to seek optimal latent approximation, (2) and face aging executed via simple conditions change at G input. This approach introduces ‘Identity-Preserving’ latent vector optimization that preserves the original identity during the reconstruction phase, besides modifying other facial features. Figure 21 shows an example of face age.

      Fig. 21
      figure 21

      An example of face age [444]

    9. 9.

      Image blending

      Mixing of two images is called ‘image blending’, where the output image is combined with input images pixel values and GANs showed an excellent performance [445].

      The dense image matching method was initiated by Gracias et al. [446] to enable copy and paste of only the related pixels. Significant variances between source images dismiss the model usage. One way is by making a smooth transition to hide artifacts in composited images.

      The Gaussian–Poisson GAN (GP-GAN), which was introduced by Wu et al. [447], combines the strengths of GANs and approaches based on a classical gradient—The initial study that assessed GAN ability in high-resolution image blending task. The Gaussian–Poisson Equation was developed to address the high-resolution image blending issue—a joint optimization constrained by color and gradient data. Color data are obtained from Blending GAN, which was introduced to learn the mapping between well-blended and composited images; while gradient data are generated from gradient filters. Apart from producing realistic and high-resolution images, the proposed model generated less undesired artifacts and bleeding. The experimental outcomes verified the superior performance of the proposed model over other models using Transient Attributes dataset.

Model architecture

There are some solutions that help to deal with small datasets related to MA. These solutions can help when it is impossible to collect or generate more training data.

  1. 1.

    Model complexity

    Reducing model complexity DL due to limited datasets can help avoid overfitting and improve generalization to new, unseen data. This can be achieved by reducing the number of layers or nodes in the model, adopting simpler activation functions, or regularisation techniques. While reducing model complexity can mitigate the risk of overfitting, it may also limit the model’s capacity to represent complex relationships in the data, resulting in underfitting and lower accuracy. Furthermore, reducing model complexity may limit the model’s ability to learn from high-dimensional data, which can lead to poorer performance in tasks such as medical images or speech recognition. Therefore, it is crucial to carefully balance the trade-offs between model complexity and model performance on both the training and test data [448,449,450,451].

    Brigato et al. [452] performed a wide variety of experiments with varied DL MAs on datasets of limited size. Model intricacy should not be undermined when only a few samples are available in a class. Opposed to the literature, the authors revealed that certain current models may be improved in several configurations by using models with low intricacy. Non-intricate CNNs can perform better than the current MAs without augmentation of data and with inadequate training data. They added recognition performance may be improved by massive margins with standard data augmentation. This signifies the importance of devising complex data augmentation and generation models in case of limited data. Lastly, they reported that dropout—a broadly applied regularisation method—maintains its role despite data scarcity. Their findings were empirically validated with sub-sampled CIFAR10, Fashion-MNIST, and SVHN benchmarks.

  2. 2.

    Loss functions

    Loss functions are an essential component of DL models, as they are used to measure the difference between predicted and actual values. In the case of data scarcity, selecting an appropriate loss function becomes critical as the model needs to be trained with limited data samples. Therefore, it is essential to analyze and evaluate different loss functions that can help address the data scarcity problem. Some of the commonly used ones are:

    • Mean Squared Error (MSE) is a popular loss function used in DL for regression problems. It measures the average squared difference between predicted and actual values [453].

    • Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values [454]. This function is also known for regression problems.

    • Cross-Entropy Loss is known for use of multi-class classification problems. It measures the dissimilarity between the predicted probability distribution and the actual probability distribution of the target variable [455]. It is commonly used in tasks such as image classification and natural language processing.

    • Hinge Loss is commonly used for binary classification problems where is commonly used in support vector machines (SVMs). It encourages correct classification by penalizing incorrect predictions linearly [456].

    • Focal Loss is well-known for imbalanced classification problems. It is designed to give more weight to hard-to-classify examples, reducing the impact of easy-to-classify examples and improving performance on the minority class. It is commonly used in object detection and segmentation tasks [457].

    • Triplet Loss is used for learning representations in siamese networks or other similar architectures. It measures the distance between anchor, positive, and negative samples [458].

    • Contrastive Loss is used to learn the similarity between two inputs, and it penalizes the model for dissimilar inputs and rewards the model for similar inputs [459].

    • Sparsemax Loss is a probabilistic activation function that can be used in classification tasks [460]. It encourages the model to assign low probabilities to irrelevant classes.

    • Kullback–Leibler (KL) Divergence Loss is used for measuring the difference between two probability distributions [453]. It is often used in generative models, such as Variational Autoencoders (VAEs).

    • Huber Loss is used in regression tasks and provides a combination of Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions [461].

    • Quantile Loss is known for quantile regression problems. It measures the difference between the predicted quantile and the actual value at that quantile, with a different loss function for each quantile. It is commonly used in financial forecasting and risk analysis [462].

    • Center Loss is used for face recognition tasks and minimizes the distance between the features extracted by the DL model and their corresponding class centers [463].

    • Wing Loss is designed to be robust to outliers by penalizing large errors less than Mean Squared Error (MSE) Loss [464]. It is commonly used in tasks such as facial landmark detection and human pose estimation.

    • Cosine Loss is used to optimize the cosine similarity between two feature vectors in a high-dimensional space. It is commonly used in tasks such as face recognition and image retrieval [465].

    In evaluating the performance of loss functions on the data scarcity problem, we can consider metrics such as accuracy, precision, recall, F1 score, and area under the curve (AUC). These metrics provide a comprehensive evaluation of the performance of the model in addressing the data scarcity problem. By comparing the performance of different loss functions on these metrics, we can determine which loss function is most effective in improving the model’s performance when training data is limited.

    In terms of processing time, different loss functions have different computational requirements. For instance, mean squared error and mean absolute error are computationally less expensive than cross-entropy and hinge loss. However, this difference in computational cost may be insignificant in practice, especially with the use of modern GPUs that can handle complex computations with ease. There are several challenges associated with selecting and using loss functions in deep learning. It can be challenging to choose the right loss function for a specific problem, especially when the data is scarce. Different loss functions have different strengths and weaknesses, and the wrong choice can lead to suboptimal results, some of which are:

    • Imbalanced datasets mean one class has significantly fewer samples than the others. It can be challenging to find a loss function that balances the trade-off between correctly identifying the minority class while not misclassifying the majority class too often.

    • Noisy data can be a challenge when selecting an appropriate loss function. Noisy data can cause the model to learn incorrect patterns, leading to poor performance.

    • Overfitting is an issue when some loss functions are prone to it, especially when the model is too complex or when the data is scarce. Overfitting occurs when the model learns to fit the training data too well, resulting in poor performance on the test data.

    • Optimization challenges can appear in some loss functions that can be difficult to optimize. This can lead to slow convergence or getting stuck in local minima.

    • Model interpretability can be an issue when some loss functions are more difficult to interpret than others which making it harder to understand why the model makes certain predictions.

    In summary, selecting an appropriate loss function is critical in addressing the data scarcity problem in DL. Evaluating the performance of different loss functions using relevant metrics provides a comprehensive understanding of their effectiveness in improving model performance. While some loss functions may require more computational resources than others, this difference may be insignificant in practice, given the availability of modern computing infrastructure.

  3. 3.

    Ensemble classifiers

    Ensemble classifiers are a powerful technique for addressing the problem of limited training datasets in DL. By combining the predictions of multiple models trained on different subsets of data or with different algorithms, ensemble classifiers can improve the overall accuracy, robustness, and generalisability of the model. Additionally, ensemble classifiers can help to reduce the risk of overfitting and identify and correct biases that may exist in any single model, making them a valuable tool for improving the reliability and accuracy of DL models in situations where training data is limited [14, 466, 467].

    Olson et al. [153] depicted that DNNs can generalize well on small, noisy datasets despite memorizing the training data. To explain this behavior, the authors developed a novel perspective on NNs by viewing them through the lens of ensemble classifiers. When training NNs, it is important to choose an architecture that allows adequate capacity to fit the training data, and later, re-scale with regularisation [468]. On the contrary, the random forest holds that training data can perfectly fit very deep decision trees, and then, rely on randomization and averaging for variance reduction. This notion can be applied to DNN. Instead of each layer presenting an ever-increasing hierarchy of features, it is plausible that the final layers offer an ensemble mechanism. Finally, they reported that small datasets and relatively small network sizes have computational advantages, which allow for rapid experimentation. Some recent studies explained NN generalization as intractable on networks with millions of parameters: Schatten norms, for instance, require computing full SVD [469]. In the study context, such calculations are trivial. Thus, future studies should discern a mechanism for decorrelation, as well as assess the link between decorrelation and generalization.

Physics-informed neural network (PINN)

Physics-Informed Neural Network (PINN) is another DL technique that can cope with problems with insufficient data or even without labeled data [470,471,472,473]. Apart from using pure data, PINNs can also integrate physics laws to train neural networks for unknown systems [474]. We note that the physics laws can be equations that are derived from conservation principles or empirical models that are summarized by calibrations of observations. Such as the Navier–Stokes equations for fluid mechanics [475], the Schrödinger equation for first principle calculation [476], and the Black–Scholes equation for financial evolution [477], to name but a few. For specific problems, these well-studied physics laws can effectively reveal the underlying relationships between variables of unknown systems from a higher point of view [1].

It is worth noting that, PINN can be considered as an extension of traditional DNNs from the loss function regard. Compared to traditional DNNs, PINNs tailored loss terms from the physics laws, as shown in Fig. 22. In this manner, the final loss function can be a combination of the loss terms from data and physics laws, respectively.

Fig. 22
figure 22

An illustration of a PINN structure. x and y are respectively the input and output of the neural network. The loss function of a PINN can contain two parts, namely the data-driven loss term and the physics law loss term. The output of the neural network can be directly compared to the ground truth data, which results in the data-driven loss term. In addition, the output of the neural network can be also substituted into the physics laws in terms of governing equations, which contributes to the physics law loss term

Up to now, physics-informed loss functions can be mainly categorized into two types: the collocation physics-informed loss function [473, 478, 479] and the variational physics-informed loss function [480, 481]. The collocation type loss functions directly enforce equations into training processes, aiming at minimizing the residuals calculated from physics equations to be close to zero [482]. The variational type loss functions guide the training by finding the stationary point of functionals [479]. Using the variational type loss requires professional knowledge and a comprehensive understanding of the training data. It is more complex from the implementation regard than using the collocation type loss, but it is computationally more efficient [481].

PINNs have been widely used in problems where only insufficient data are available and the unknown systems are governed by known physics laws in terms of equations [479, 483,484,485,486]. As aforementioned, the physics laws are effective for specific problems. However, these prior pieces of knowledge in terms of the physics laws for the unknown systems are normally ignored in traditional neural network applications. PINNs provide a novel way to train neural networks through those physics laws. The physics laws provide information representative of the unknown systems as the data does. With the help of the physics laws, PINNs can perform well with insufficient data or even without labelled data [474]. The PINN is initially proposed by Raissi et al. [470] for solving Partial Differential Equations (PDEs) through neural networks. With the underlying physics, PINNs have been demonstrated to be more effective than traditional ML algorithms with respect to insufficient data or even without labelled data [470]. Later, PINNs have been applied in various fields, including computational mechanics [479, 484, 485, 487, 488], medical [484] and geophysics [489], etc. Great efforts have been made to further investigate and improve the performance of PINNs. Stefano [490] thoroughly studied the performance and accuracy of PINNs towards linear problems. Different optimizers, including Adam and L-BFGS, are also compared to provide some guidance for optimizer selections. Yang et al. [491] and Zhu et al. [492] proposed a way to quantify the uncertainty of PINNs. Wang et al. [493] investigate PINNs from the training process. Numerical cases were used to understand how the loss function evolve in PINNs. Meanwhile, various training techniques, such as adaptive learning [494] and Neural Tangent Kernel (NTK) [495], have been incorporated into PINNs to alleviate the scale differences of the loss terms. Furthermore, different types of neural networks have been applied to replace the Feedforward Neural Network (FNN) [496,497,498].

Deep synthetic minority oversampling technique (DeepSMOTE)

Recently, Dablain et al. [499] proposed a new method, DeepSMOTE, to generate synthetic images to address the issue of imbalanced data. DeepSMOTE leverages the properties of the successful SMOTE algorithm. It consists of three main components: (a) an encoder/decoder framework; (b) SMOTE-based oversampling; and (c) a dedicated loss function that is improved with a penalty term. DeepSMOTE has some significant advantages over other methods because, unlike GAN, there is no need for a discriminator during training. Furthermore, it generates high-quality artificial images compared with other methods as shown in Fig. 23. The performance of DeepSMOTE was validated on five benchmark datasets and it outperformed other methods.

Fig. 23
figure 23

Comparison of DeepSMOTE to other methods [499]. a Original images. b Balancing GAN [500]. c Generative adversarial minority oversampling [501]. d DeepSMOTE

DSMOTE has been shown to be effective in improving classification performance on imbalanced datasets compared to traditional SMOTE and other oversampling techniques. However, it should be noted that DMSOTE may require more computational resources compared to traditional SMOTE due to the use of DL models.

We believe that DeepSMOTE is one of the most effective solutions to address the lack of training data. Currently, it has been used for image generation and we believe it can be extended to work on other data modalities such as graphs and text data.

Pre-training and testing tips of using dataset

Some tips on training datasets are listed in this section. Prior to model training and evaluation, it is crucial to set project aims, type of data, anticipated setbacks, and progress within the research area. Dismissing these may result in invalid outcomes and unreliable models for publication.

  1. 1.

    Understanding data

    Data for training should be derived from reliable sources, gathered via a reliable method, and have high quality. For example, data from the Internet must be assessed for reliability and any note made by the author about data setbacks. Dataset applied in multiple papers neither guarantees its quality nor reliability because any dataset could hold drawbacks [502]. The process called ‘garbage-in, garbage-out’ refers to model training using bad data that yields a bad model. Data may be assessed using exploratory data analysis to check for inconsistency or missing values [503]. Essentially, this step should be taken prior to model training.

  2. 2.

    Literature review

    Reviewing past studies is crucial to get a glimpse of the progress within the research area and aspects left untapped. Although it could be disappointing to discover that one’s research interest has already been explored by other researchers; the research scope may be broadened, limitations addressed, and serve as justification for the current research endeavors. Besides, through a literature review, one may identify a new opportunity to build on a partially solved problem. Therefore, reviewing the literature is imminent to ascertain if one is on par with the current research arena and add meaningful knowledge to the subject area.

  3. 3.

    Avoid analyzing all data

    Overanalyzing a dataset may yield insights and patterns that could deviate from the modeling goal. Checking the dataset is an important step, but making presumptions should be hindered. This is because a dataset is meant to be fed into a training model and not tested. Therefore, one should not analyze the dataset during the exploratory analysis phase to avoid making presumptions that could limit model generality. In fact, one reason that contributes to DL failure is data leakage from a test set into the training set [504].

  4. 4.

    Data sufficiency

    A model should be trained with adequate data to ensure model generalizability. Data sufficiency is dictated by the signal-to-noise ratio; a weak signal requires more data while a strong signal indicates adequate data. The issue of insufficient data may be addressed by using existing data via cross-validation (CV) and data augmentation methods such as rotation, flipping, zooming, and cropping to boost small datasets [505, 506]. In particular, data augmentation is useful when overcoming data sufficiency issues or ‘class imbalance’—less samples in certain classes [507]. Besides, limited data denotes limited DL model complexity as many parameters (e.g., DNN) may overfit small datasets easily. Thus, data sufficiency must be ensured at the initial stage.

    This review focuses on the most popular solutions to address the issue of lack of training data which are TL, GANs, MA, PINN, and DeepSMOTE. This review will help to generate more data and handle small and imbalanced datasets.

  5. 5.

    Domain experts

    A domain expert facilitates in identifying viable problems to resolve, selecting the aptest dataset and DL model, as well as aiding to publish to the most appropriate audience. Dismissing opinions given by domain experts could lead to two scenarios: unsolved problems and problem-solving in an inapt manner. An instance of the second scenario is the use of an opaque DL model for solving a problem that requires comprehension of how the model arrives at the result (for making financial/medical decisions) [508]. At the start of a project, a domain expert makes data more comprehensible and highlights predictive features. A successful project can be published in esteemed journals within the domain, thus, benefiting the target audience.

  6. 6.

    Preventing test data from leaking into training process

    It is crucial to use data that contributes to model generalizability. When data gets leaked into model selection, configuration, or training; the data would fail to ascertain the reliability and affect the generalizability of the DL model. Some ways that cause data leaks are using the entire dataset during variable scaling and data preparation, selecting features prior to data partitioning, and applying the same dataset to assess multiple models’ generality. To hinder data leakage, data partition should be performed at the initial stage, and use of test set only be once to assure generality of a single model at the final phase [509].

  7. 7.

    Validation set

    When training more than one model, it is imminent not to apply the testing dataset. This is because another validation set must be deployed for performance assessment. This may consist of samples that are indirectly applied for training, but to guide training. Testing set, when used as a training set, no longer can measure generality in an independent manner—the model would eventually overfit the testing set [510, 511]. One advantage of employing a validation set is one may halt the training process earlier when validation scores begin decreasing—an indication that the model overfits the training dataset.

  8. 8.

    Suitable test set

    The DL model generality is measured using a test set. Model performance on a training set is useless because a complex model can easily learn a training set yet offer nil generality. The test dataset must not overlap the training set but must represent the broader population. For instance, when a medical image dataset gathered from normal people is used as training and testing sets, the latter set will fail to classify abnormal patients and; thus dismissed as a representative. The same scenario is projected when the same equipment is applied to gather both testing and training sets. In this case, generalizability cannot be attained by the model.

  9. 9.

    Multiple model evaluation

    While DL model is unstable, a slight change in the training data can affect the performance. As a single model assessment may overestimate or underestimate the real model potential, multiple assessments are imminent. This can be done by executing model training a few times with varied training data subsets. One popular method is cross-validation; fivefold cross-validation is training repeated 5 times with data partitions [512, 513]. Stratification is carried out when the data classes are small so that each class can be represented adequately in every fold. It is crucial to keep individual, standard deviation, and mean scores for statistical comparisons [514].

  10. 10.

    Accuracy and imbalance dataset

    Metrics should be carefully used to assess the DL model. Classification model that uses the accuracy metric (fraction of samples correctly identified by model), for example, may be misled with the imbalanced dataset. Let’s say 92% and 8% denote two classes. An accuracy of 92% would be the output of a binary classifier, which indicates meaningless knowledge. Hence, approaches insensitive to class size imbalance is sought for this case, such as Matthews Correlation Coefficient (MCC) and Cohen’s kappa coefficient (k) [515].

Applications

This section lists some applications that DL is less explored due to limited training data. This opens doors to scholars to use the listed solutions to limited training data in DL. With each application, we focused on four major points which are (1) what is it? (2) difficulties in collecting new data (3) suggestions to address the lack of training data (4) sub-applications that can be investigated with each application. These points were provided by experts from the area of each application.

Electromagnetic imaging (EMI)

The technology of EMI, also known as microwave imaging, is applicable in a broad range of functionalities, particularly in the medical field, e.g., breast cancer detection [516], diagnosis of stroke [517], intracranial bleeding detection [518], and traumatic brain damage [519]. Since identifying the location and size of any bleeding or tumor instantly is crucial for effective treatment management, an accurate and rapid method is imminent. Computer tomography (CT) and magnetic resonance imaging (MRI) are not always available, costly, heavy, and massive in size. Moreover, they can neither be used frequently for monitoring nor in onsite diagnosis for emergency cases. The cutting-edge technology of EMI may complement or even replace other imaging approaches, as EMI employs compact EM sensors (antennas) arranged around the body’s area of interest to measure transmission and reflection coefficients. These coefficients can be processed using many techniques, such as tomography and confocal methods [520], to facilitate a range of tasks, which are imaging-based detection, localization, and classification tasks. Tomographic techniques are time-consuming when computing numerous unknowns with massive problems (i.e. several dozens of measurements and tens of thousands of unknown image pixels). Forward and inverse solvers require high-precision electromagnetic simulation instruments and costly hardware to solve the ill-posed tomography method. Methods based on the radar are ineffective for cases related to heterogeneous tissues and lesions [521], thus failing to classify types of pathologies (but possible in tomography based on dielectric contrast) [522]. The DL field can address certain drawbacks. The DL approaches yield quick outcomes, thus superior to conventional methods. The DNN—an ML algorithm—is effective to resolve intricate and highly non-linear tasks. The DNNs have revolutionized the ML approaches by providing superhuman performance in mostly computer vision applications [1]. As the amount of training data should be in massive volume, which is a challenge in the EMI area, simulation is a viable solution for data training despite its high computing power [523, 524]. It is believed that GANs (e.g., TimeGAN [525]) are a solution to EMI applications as they are getting popular in several applications, such as knee imaging system [526], liver detection [527] and others [528]. Another solution is domain adaptation—a TL subfield—where a model trained on one task is applied as the starting point or adapted/transferred to another task with fewer data.

Civil structural health monitoring

The use of DL algorithms in Structural Health Monitoring (SHM) is gaining popularity due to their high ability in detecting civil engineering structural defects [529, 530]. However, civil engineering applications are escalating in a rapid manner due to the emergence of Big Data and the Internet of Things (IoT). The DL is effective in a number of analyses, including classification, clustering, and regression of structural damages across tunnels, bridges, dams, and buildings [1]. Visual inspections are most often deployed to examine the status and health of structural systems. Despite the significance of this technique in the SHM area, there are several setbacks that affect the damage extent and type after long- and short-term mishaps. With advancements in high-performance computing technologies and affordable sensors, SHM is becoming more effective and feasible. Many studies have assessed vibration-based damage identification in this particular segment. Numerous methods and algorithms have been developed to solve issues related to structures with varied intricacies [531]. Damage identification approaches based on data can be used to execute pattern recognition, where NNs are used for their fault tolerance capability and adaptive learning. However, NNs are costly and demand massive training data. This setback has been addressed by replacing DL tools for feature extraction and classification in damage detection issues with raw and processed signals without hand-designed features [532]. At the core of recent DL with big data, CNNs can learn from massive datasets. The CNNs can be deployed for classification of electrocardiogram signals [533] and medical imaging such as MRI or CT [22, 253]; but they are still new in SHM [534, 535] due to lack of training data. Other successful applications of CNNs in SHM include damage detection of steel frames [536], pavement and concrete crack detection [537], and overall system condition assessment [538]. Thus, integration of DL with CNN models in damage identification tasks can effectively address SHM issues. The response data applied for SHM purposes are mostly recorded in the time domain, while others used transformed data from time to frequency or time-frequency domain to detect damaged structures [539]. The main challenge in SHM with DL is data availability. To address that, TL is an exceptional solution as proven in [540], which revealed the efficacy of TL application in SHM using varied sensors for similar structural systems. Another solution is GANs and deep-SMOTE to generate more data for training [541].

Meteorology applications

The implementation of AI has been successful in DL models for robotics, image and speech recognition, meteorological applications, and strategic games [542]. Some evidenced better weather forecasts by embedding DL and big data mining into weather prediction framework [543, 544]. The question is: can DL methods fully substitute the present data assimilation systems and numerical weather models? The integration of the advanced DL model with weather and climate science is bound to progress rapidly and be adopted in advanced computer systems. However, benchmark datasets including automatic weather stations (AWS), radar, meteorological balloons, and satellites with baseline scores and models are absent in meteorological DL, which should ease DL usage in experimenting with varied approaches and resolving meteorological issues. Despite the vast meteorological data accessible from weather research institutions, correct use of these data demands knowledge on data formats and the system of the Earth. Hence, these tools may be advanced by integrating DL models [542]. One should carefully weigh in the requirements and objectives of weather forecasting when substituting costly numerical weather prediction (NWP) computation with DL models. Crucial criteria for weather forecasting are mere conceptions based on numerical models and are inapplicable to DL. Consistency of forecast outcomes is often undermined by numerical modelers despite being part of the criteria in the NWP system. Since weather forecasting may be explained as a Big Data issue to map observations of the Earth system in order to substitute the whole NWP framework, which includes output processing, data assimilation, and numerical modeling. However, the weather forecasting issue is more suitably addressed using DL models than the classical numerical modeling of NWP [544, 545]. Physical barriers in NN design need to be considered when applying DL for weather forecasting. Some variables of the NWP may function as regulators in DNN latent space. Therefore, end-to-end DL-based weather forecasting may generate better outcomes for specific demands by exploiting small-scale patterns in the data, which is non-viable in the NWP system. The evolution of DL in replacing most of or the entire NWP system is still early to tell at this present moment. One of the advantages that can help DL in meteorology applications is the availability of unlabelled data. We believe same-domain TL can make huge advancements as it is based on the use of unlabelled data with small labeled data [23].

Medical imaging applications

One setback in the area of medical image analyses is inadequate data to train the DL model. As manual labeling is needed to assess medical images, human annotators from the varied background are involved. However, this annotation step is costly, time-consuming, and could have glitches. Large training datasets of DL models are important to achieve generalization in all applications, especially in medical imaging applications [15, 546,547,548,549,550]. This section lists some medical image areas that face the issue of insufficient training data with possible solutions.

  1. 1.

    Diabetic foot ulcer

    A diabetic complication, DFU, is a serious disease that may lead to the removal of one’s foot [551]. Most often, DFU is found at one’s heel experiencing skin color changes, dry cracks, skin temperature variance, leg pain, and edema. The worsening condition of DFU may cost one’s life and its treatment is costly. Detecting and diagnosing areas of ischemia and infections are imminent when predicting amputation risks of DFU [552]. Ischemia stems from chronic diabetes as it adversely affects blood circulation. In fact, ischemia can be detected from palpation of blood flow pulses in one’s foot [553], while DFU infection worsens due to poor foot reperfusion [554]. Essentially, DFU detection is challenging due to the following reasons: (a) changes in DFU appearance (size, location, & shape), (b) inter- and intra-class differences, as well as (c) the condition of lighting. Although medical investigations pertaining to the physical body, blood vessels in the leg, bacteriology, and blood tests are vast; the information fails to reach the public [553, 555]. To the best of our knowledge, the two public DFU datasets [551, 556] appears to be small to train DL models. One of the effective solutions is TL, as implemented by Alzubaidi et al. [23, 124, 256]. Notably, GANs can be a good solution in this area, which is worth investigating.

  2. 2.

    Sickle cell anemia

    The function of red blood cells (RBCs) is imminent in the gassy exchange of the external setting and the living tissue. Haemoglobin refers to the RBC protein that transports oxygen to the entire body [164], which also directs all life after 6 weeks of age. Haemoglobin is composed of two alpha and beta chains each [557]. A child may be diagnosed with sickle cell anemia if both parents contribute abnormal hemoglobin gene as healthy hemoglobin (HbA) gets substituted with sickle hemoglobin (HbS) [558]. One would have sickle cell traits when half of HbA is replaced with HbS. The lifespan of healthy RBC and sickle cell is 120 days and 10–20 days, respectively. Combining a deoxygenated molecule with hemoglobin S denotes hemoglobin polymerization, which makes an RBC to resemble the sickle shape. Categorizing the clinical state of a patient is executed via cell morphology [559]. Accurate counting and cell segmenting are crucial in biomedicine as these are intricate processes in cells [560]. An automated detection system is affected by overlapping cells, while precise classification reflects clear-cut segregation among cells [560]. Medical image segmentation and categorization are complicated by varied intensity, signal strength, and noise of lesion cells [558, 559]. Features that aid the two said processes are region, ellipticity, shape, cell texture, size, circulation, form factors, and elongation [559]. There is a single public dataset on erythrocytesIDB [561] to the best of the author’s knowledge. The dataset has 626 images—inadequate to train the DL model. Lack of training data is the major issue of employing DL for this task. One of the solutions is TL [164]. It is worth investigating GANs and deep SMOTE in this area. Focusing on shallow DL models can be another way to address the issue.

  3. 3.

    Shoulder implant manufacturer

    The process of replacing a damaged socket joint and ball in the shoulder with a prosthesis made of metal and polyethylene elements is known as Total Shoulder Arthroplasty (TSA) [562]. Intervention is needed if the prosthesis gets damaged. The treatment process may be delayed if information about the prosthesis manufacturer and model are to no avail. However, certain systems facilitated by AI can classify the sought information for speedy treatment. Thus, some papers proposed a DL model that uses X-ray images to categorize shoulder implants [562,563,564]. However, the small dataset was used for training which shows there is a lack of training data. Clearly, public datasets in this field are in scarcity which could lead to overfitting. Therefore, TL may exert great performance in this area since there is a huge amount of X-ray images available for the similar TL domain.

Wireless communications

It is crucial to convey information in a wireless medium from one point to another rapidly, reliably, and securely. The wireless communication field involves designing waveforms (e.g., long-term evolution (LTE) and fifth generation (5G) mobile communications systems), modeling channels (e.g., multipath fading), managing interference (e.g., jamming) and traffic (e.g., network congestion) impacts, compensating for radio hardware defects (e.g., RF front end non-linearity), constructing communication chains (i.e., transmitter & receiver), recovering distorted symbols and bits (e.g., forward error correction), as well as supporting wireless security (e.g., jammer detection). Both the design and deployment of traditional communication systems rely on strong probabilistic analytic models and assumptions [565]. Nevertheless, theories related to communication display drawbacks in terms of managing optimization intricacy and using limited spectrum resources for upcoming wireless usages (e.g., augmented & virtual reality, spectrum sharing, IoT, & multimedia). New generations of wireless systems, which are empowered by cognitive radio, possess the capability to learn from spectrum data and optimize their spectrum usage for better performance. These smart communication systems depend on many estimation, detection, and categorization tasks to enhance situational awareness. To realize these tasks, DL offers automated and powerful communication systems for adapting to spectrum dynamics and learning from spectrum [565, 566]. The combination of interference impacts, waveforms, traffic, and channel; along with structural intricacies, in wireless communication tends to change rapidly over time. As data of wireless communication are massive at high rates (e.g., GB/s for 5G), they are exposed to security threats and harsh interference due to the wireless setting. Conventional modeling and ML methods often fail to explain the linkage between communication design and intricate spectrum data; whereas DL taps into the reliability, speed, data rate, and security needs of wireless communication systems. An instance of this scenario is signal categorization, in which received signals must be classified [567] using waveform features where transmitter modulation adds information to carrier signal via properties variation (e.g., phase, amplitude, or frequency). The signal categorization is imminent in dynamic spectrum access (DSA). Signals of the primary user (e.g., television broadcast system) with a license for frequency operation are detected by the secondary user (transmitter) and later avoid interference (no similar transmission time with frequency). End-to-end communication systems based on DL are deployed for single antenna [568], multiple antennas [569, 570], and multiuser system [571] to enhance conventional approach performances by optimizing both receiver and transmitter as autoencoder, rather than isolated optimization. An autoencoder (a DNN) is composed of an encoder (learns data representation) and a decoder (develops input data from encoded data) [1]. Here, joint coding and modulation at the transmitter correspond to the encoder, while demodulation and decoding happen at the receiver in conjunction with the decoder. Joint optimization of receiver and transmitter can discard interference due to the presence of numerous transmitters. However, the following two obstacles must be addressed when applying the DL model:

  1. 1.

    DL needs massive data to train intricate DNN structures. This is not offered via spectrum sensing, mainly because a wireless user spending much time on spectrum sensing might have insufficient time for another task, e.g., data packet transmission. Hence, inadequate data samples are to avail when training DNN. To increase training data gathered in spectrum sensing, training data augmentation is required.

  2. 2.

    Data spectra change over time due to constantly changing transmission patterns, traffic impacts, underlying channels, and interference. Thus, training data gathered for an event could be unsuitable for another event. Another instance is a change of channel, whereby nodes of a wireless move indoors from outdoors in multiple directions—with the expectation of varying conditions of the channel. Training or testing data gathered in spectrum sensing from one domain to another (e.g., low to high mobility) can be changed using domain adaptation.

    Notably, GAN is an excellent method to yield synthetic data samples using a small amount of real data within a short learning span, apart from augmenting training with synthetic data samples for cyber, computer vision, and text applications [351, 572]. External impacts of waveform features, traffic, channel patterns, and interference are captured by GAN in wireless communication [573]. Augmentation of training data is executed using GAN for channel measurement in spectrum sensing [574], modulation classification [575], jamming [576], and call data records for 5G network [577]. Since the use of GAN for wireless applications in domain adaptation remains untapped, it is crucial to investigate GANs in this area. TL has shown a great performance in this area [291,292,293,294,295,296,297]. Therefore, it is worth investigating TL for different applications of Wireless Communications.

Fluid mechanics

Fluid mechanics is a discipline that investigates behaviors of the fluid phenomenon [578]. Traditionally, the study of fluid mechanics starts from dealing with large volumes of data [579], including experimental data and numerical results. Therefore, the combination of DL techniques with fluid mechanics has been naturally considered a promising topic [580]. Great efforts have been made to incorporate DL techniques into fluid mechanics applications [581, 582]. However, unlike computer vision and speech recognition fields, a completed, well-labeled database for fluid mechanics is currently hard to obtain [579]. Although the experiments of fluid mechanics have been significantly boosted by advanced equipment, most of the equipment is currently confined to small domains and laboratory settings [583]. Besides, even with state-of-art equipment, some field variables inside fluids are still difficult or even impossible to be measured [583]. Furthermore, novel fluids with unique material properties keep emerging, which makes it harder to include all the fluid data in a completed database. Hence, lacking data greatly hinders the applications of DL techniques for fluid mechanics. PINNs have changed the challenging situation of DL for fluid mechanics in terms of the lack of training data. It is worth highlighting that fluid mechanics problems are conventionally solved by using governing equations. The governing equations can effectively describe the fluid phenomenon and have been well-studied. In this manner, PINNs can be a proper DL technique for fluid mechanics applications. This is because the governing equations can be regarded as remedies for lacking data to train neural networks. The insufficient information representative due to lacking data is flourished by the governing equations. Currently, many PINN-based frameworks have been proposed to deal with the forward fluid mechanics’ problem [494, 584, 585]. Direct fluid mechanics problems are the most common fluid mechanics problem. In this kind of problem, only the initial state of the fluids and corresponding boundary conditions data are given, and researchers want to have a clear insight into the fluids along with the whole spatiotemporal coordinate. Therefore, through PINNs, the initial states of fluids and boundary conditions are satisfied by the given data, while the evolution of fluids is studied through the governing equations. The effectiveness of PINN-based frameworks for forwarding fluid mechanics problems has been demonstrated and favorable results have been obtained [494, 584, 585]. PINNs also received great attention for inverse fluid mechanics applications, which aim to extract information about studied fluids through spatiotemporal observations. Based on PINNs, Raissi et al. [583] introduced the framework of Hidden Fluid Mechanics (HFM), as shown in Fig. 24. The Navier–Stokes equations, the well-known governing equations in fluid mechanics, are embedded into PINNs. Through the HFM framework, information on fluid flows in terms of the velocity and pressure fields can be extracted from experimental images. It has paved a novel avenue to deal with inverse hydrodynamics problems and study fluid flow characteristics that may be otherwise complicated or even impossible to be measured. Later, the same framework was applied for predicting the pressure field within arterials with the help of the Magnetic Resonance Imaging (MRI) results [484]. The MRI provides randomly measured scatter points with noise. By integrating the governing equations and the noisy data, PINNs provide a reliable way for monitoring the conditions inside aspire, which can greatly benefit surgical planning. Another interesting application of PINNs for fluid mechanics problems is to study the fluid fields around an espresso cup with insufficient data [586], as shown in Fig. 25. In the application, only the measured temperature and density images from Tomographic background-oriented Schlieren (Tomo-BOS) are used to further study the corresponding velocity and pressure fields upon an espresso coffee. 3D velocity and pressure fields have been successfully visualized, as shown in Fig. 25c.

Fig. 24
figure 24

(Adopted from [583])

Arbitrary training domain in the wake of a cylinder. A The domain where the training data for concentration and reference data for the velocity and pressure are generated by using direct numerical simulation. B Training data on concentration c(t, x, y) in an arbitrary domain in the shape of a flower located in the wake of the cylinder. The solid black square corresponds to a very refined point cloud of data, whereas the solid black star corresponds to a low-resolution point cloud. C A physics-uninformed neural network (left) takes the input variables t, x, and y and outputs c, u, v, and p. By applying automatic differentiation on the output variables, we encode the transport and NS equations in the physics-informed neural networks ei, i = 1,..., 4 (right). D Velocity and pressure fields regressed by means of HFM. E Reference velocity and pressure fields obtained by cutting out the arbitrary domain in A, are used for testing the performance of HFM. F Relative L2 errors are estimated for various spatiotemporal resolutions of observations for c. On the top line, we list the spatial resolution for each case, and on the line below, we list the corresponding temporal resolution over 2.5 vortex shedding cycles

Fig. 25
figure 25

(Adopted from [474])

An example of using a PINN-based framework to study the fluid domain upon a cup of hot espresso coffee. Only the temperature and density images are used as the training data, which is traditionally considered to be insufficient to predict the corresponding velocity and pressure fields. a The training data in terms of the 3D temperature and density images are captured by the Tomo-BOS system; b an example of the 3D captured temperature image from a; c the captured images are fed to a PINN to predict the corresponding 3D velocity and pressure fields

Microelectromechanical systems (MEMS)

Microelectromechanical systems (MEMS) technology is the process that involves and creates micro-size devices. This technology merges the electrical and mechanical components through an electrical circuit on a semiconductor chip. Different microfabrication techniques are used to fabricate MEMS devices of different sizes that range from sub-micron level to millimeter level, which is integrative for a wide range of systems and applications. These micro-size devices are employed for sensing and controlling, resulting in an electrical response typically on the macro scale. MEMS is recognized as one of the most promising technologies in the industry and for research purposes [587,588,589,590]. In addition to micromachining technology, recent commercial MEMS sensors have made far-reaching changes in the industry and in consumer products using silicon-based microelectronics. To a large extent, MEMS technology/devices have positively affected our lives [591, 592]. These devices are used widely in medical applications and imaging, in biosensing applications to detect biological elements, in Infrared radiation sensors to detect thermal images, and in all kinds of actuators and sensors. Figure 26 shows an SEM micrograph of a fabricated MEMS sensor.

Fig. 26
figure 26

A surface micromachined resonator device that can be used as microsensor and microactuator [593]

Researchers have widely investigated and developed MEMS devices in different fields. In microfluidics, Pandey et al. used Graphene with interdigitated electrodes to achieve high mobility and biocompatibility with the reagents and pathogens for the detection of certain food bacteria called E. coli 0157:H7 with a detection limit [594]. In energy harvesting, Nguyen et al. invented a MEMS electrostatic energy harvester with nonlinear springs to enhance the frequency response bandwidth [595]. Their device could generate a power of 85 nW at 560 Hz with a peak amplitude of 0.14 g and a bias voltage of 28.4 V. MEMS devices was also researched in thermal imaging to build small thermal sensors called microbolometers. Murphy et al. made a significant improvement in the development of \(640\times 512\) uncooled thermal arrays with a unit cell size of \(17\times 17\) \(\upmu {\text{m}}^{2}\). The fabricated detector showed an absorption peak of 80% in the spectral band of 8 to 14 \(\upmu\) and a TCR of 2.4%/K [596]. Machine learning has utilized MEMS devices. Jain et al. developed a machine learning model to control and evaluate the eating habits of human beings using a six-point calibrated wearable MEMS trial-axial accelerometer [597]. Hao et al. investigated a machine-learning algorithm to assess helmet wearing by a human subject [598]. The helmets were built using available MEMS sensors for data-driven labor safety. Guo et al. investigated the use of machine learning in accelerating the MEMS design process with a proposed design using pixelated binary 2D. They used circular disk resonators as examples for a demonstration of identifying variational modes and measuring the disk resonators’ corresponding frequencies [599].

The data that are usually obtained in the design and testing of MEMS devices are different, depending on the type of sensor. In microfluidic design and testing, we collect impedance values and different frequencies. Different concentrations of viruses/proteins could be tested with MEMS microfluidic devices to understand the behavior of those pathogens. Few researchers have investigated the employment of DL in the MEMS modeling and testing process due to the difficulties of collecting a sufficiently large amount of data to train DL models. Collecting this amount of data requires special types of equipment and models to be involved in the testing. Furthermore, many MEMS sensors need to be fabricated and used as some of these sensors are used only once when testing certain types of viruses/proteins, a process that is time-consuming and labor-intensive. However, the rapid development of DL models will expedite the testing process and the time taken to test the concentrations of different pathogens. DL models will add strategies and a powerful tool in the characterization and evaluation of the MEMS processes. We believe that some of the solutions to the lack of training data described above (such as GANs DeepSMOTE [499]) would definitely help in increasing the amount of data. By achieving that, we expect to see a greater application of DL to MEMS.

Cybersecurity: vulnerabilities

In recent years, DL has enjoyed profound success in a range of interesting applications such as natural language processing, computer vision and speech recognition [1]. In addition to better computing resources, this has been mainly due to the availability of large numbers of training datasets available to these applications. However, in cybersecurity research, the lack of large and high-quality datasets is still a significant problem that makes it hard for DL to address cybersecurity issues such as software vulnerabilities. In this section, we discuss the challenges and requirements of datasets regarding software vulnerabilities, a particular subset of cybersecurity problems found in computer software. Software security is a relevantly new area and using DL to improve software security has blossomed in recent years [600, 601]. There are some important issues to be resolved to obtain useful datasets for detecting real-world vulnerabilities.

Software security datasets are extracted from source code, therefore they largely depend on the programming languages (such as C/C++, C#, Java, Python, and PHP) used to develop the software. While there are a large set of open-source projects that can be used as DL datasets, most of them are insufficient in training software vulnerabilities models. To train robust models, generated datasets must possess essential elements for the targeted applications [602,603,604]. Here, we focus on common dataset issues related to software vulnerability detection, as well as some possible future research directions:

  • Vulnerability types: software security, especially the vulnerabilities found in software implementation, is a challenging problem because there are numerous types of security vulnerabilities reported and discovered every year according to the Common Weakness Enumeration (CWE) [605] and Common Vulnerabilities and Exposures (CVE) [606] databases. Most existing efforts focus on a binary classification to detect only a particular type of vulnerability. A model that is trained on, for example, buffer overflows (CWE121 and CWE122), will not be able to detect other types of vulnerabilities such as SQL injections (CWE89 [607]). Therefore, it is desirable to develop a more robust multiple classification-based approaches that can be trained from a dataset with multiple types of vulnerabilities. Vulnerability types in the dataset are essential to detect various vulnerabilities and each dataset should mention how many CWEs or CVEs exist. For instance, there is one CWE in [608], 609 CVEs in [609] and 911 CWEs in [610].

  • Dataset size: The performance of a DL model depends largely on the size of the training datasets. More training datasets provide a larger number of samples that the model can use to learn. It is a well-known problem that there is only a small set of labeled data currently available to train a vulnerability detection model [611]. As a result, the limited number of existing datasets for software security are typically handcrafted test programs that are small and imprecisely labeled. In the future, it will be useful to explore techniques to automatically generate large datasets by either labeling real-world software that exhibits security vulnerabilities or synthesizing datasets to fully recap the vulnerability patterns in real-world programs. In general, the test results for large datasets will be more accurate. For instance, there are 1,274,366 samples in [612] but only 871 samples in [613]

  • Label vulnerabilities: supervised learning is one of the most common DL approaches that has been used in software vulnerability detection. It can perform well with datasets that are properly labeled before the model’s training. Unfortunately, most software security datasets are either unlabelled or imprecisely labeled. These imprecisely labeled datasets can lead to low performance and unreliable vulnerability detection models. Handcrafting labels is not only tedious and labor-intensive but also inconsistent. Many vulnerabilities are not localized and can be caused by multiple parts of the program. It is very challenging to identify the root cause of a vulnerability and manually label it in a consistent way that does not confuse machine learners. For instance, it is necessary to ensure that the labeling of all vulnerabilities of the same type follows the same rule. Therefore, to overcome this problem, researchers can consider building tools to aid the labeling process so that a large set of labeled data can be generated automatically from existing reported vulnerable software. Some datasets are labeled for each CWE or CVE (e.g., SARD [614]), but others are labeled as binary detections only, as vulnerable or not vulnerable (e.g., OSS [615]). Researchers often want ready-made labeled datasets for training due to the cost and expertise associated with manual labeling. This leads to fewer available datasets, the lack of which contributes to the problem referred to above.

  • Synthesise datasets: while there are many software vulnerabilities reported each year (e.g., in CWE or CVE), they may not be sufficient to train reliable detection models. This may be attributed to the fact that, despite the large number of different types of vulnerabilities, there are limited cases of each vulnerability type. More generally, compared with the size of the software, vulnerabilities are rare and often outliers that do not conform to the usual software behaviors. Synthetic datasets are widely used in software vulnerability detection to artificially increase the number of samples that contain vulnerabilities. For example, the Juliet project [614] generated synthetic datasets based on a few predefined patterns. However, synthetic datasets often cannot reflect the structure of real-world vulnerabilities, therefore, cannot represent the diverse behaviors observed in real-world programs [616]. It is better to train the model on a mixed source code dataset (real and synthetic) that is rarely available. For example, Java (1772 real samples [615]) and (28,881 synthetic samples [614]), PHP (2942 real samples [617]), SQL (6,586 real samples [608]). Some datasets have several programming languages such as Python and C/C++ (8027 real samples [618]) and Java, C/C++, C#, and PHP (177,184 synthetic and real samples [610]). Several datasets are available for C/C++ in [602]. Despite that, there exist only a small number of samples to generalize DL models. In the future, more sophisticated program synthesis techniques could be explored to increase the quality and versatility of the samples in generating large synthetic datasets.

  • Generalisation: when the DL model trains on an old dataset, it may not detect the latest vulnerabilities. A new dataset with added new vulnerabilities increases the accuracy of test results. Each dataset has several Common Weakness Enumerations (CWE) [605] or Common Vulnerabilities and Exposures (CVE) [606], with further vulnerabilities being detected daily. When the model is trained using some CVE or CWE datasets, the model cannot detect others, so the dataset should be diverse and updated with new vulnerabilities.

  • Transfer learning (TL): as TL defined previously, the learned model can then be reused in other DL tasks to improve their performance [23, 124]. This approach can help reduce the time and resources needed to train a DL model for different tasks and problems. This is desirable in software vulnerability detection because researchers can reuse vulnerability detection models across various software projects [611]. Unfortunately, vulnerabilities found in software implementations are typically language-specific and domain-specific (some may be even application-specific). Models trained on security vulnerabilities could be vastly different in different programming languages and application domains. Therefore, it is hard to generalize and reuse learned models, making it challenging to transfer the knowledge of learning. Currently, it is possible to use a separate detection model for each language and vulnerability type. In the future, we intend to investigate a generalized vulnerability detection model that is robust and efficient in detecting vulnerabilities across different software projects [611].

The traditional solutions to create a dataset require expertise, money, and time. On the other hand, the over-sampling technique can solve a minority of some classes. The synthetic Minority Over-sampling Technique Nitesh (SMOTE) [619] is one oversampling approach that can be used to create (synthetic) samples instead of replacing (duplicate) them. It can create new synthetic samples by using k minority class nearest neighbors, where k is the amount of oversampling required. The author in [620] used SMOTE to resample the training samples from 65,970 to 96,952 samples. DeepSMOTE [499], which was published in 2022 and upgraded SMOTE, may be more useful and creative for this purpose.

The prediction of DL depends on the training phase because it is the most significant phase. Therefore, a high-quality dataset is necessary to train robust DL models. A perfect dataset would include the following features to train useful models for software security: a variety of vulnerabilities, a large size of samples, properly labeled vulnerabilities, easily synthesized and generalisation, a large source code size, and capable of being used for TL

Tips for reporting the dataset

This section presents the top tips for reporting the dataset using DL. These tips have been derived from the literature and the author’s experience in the field [621,622,623].

To report a dataset used in DL, it is necessary to clearly explain:

  • whether the dataset used is public or private. If it is public, the source of the dataset must be cited, including articles and links. If it is private, the collection process must be described.

  • the criteria for selecting the dataset/s and whether the dataset/s tests/test the hypothesis.

  • the details of the dataset/s including the type of data, number, and names of classes, size of samples, number of all samples, number of each class, and resolution. Figures are important to show samples of the dataset with the label of each class.

  • whether the dataset used is real or simulated. In the case of simulated data, the simulation process must be explained.

  • the labeling process of the dataset (private dataset) and whether the process was achieved by an expert or in an automated way.

  • the pre-processing stage and the data features that were manipulated.

  • changes to data after each step in situations where multi-pre-processing procedures were applied.

  • the data augmentation techniques (if used) with figures showing a sample of each technique used.

  • the ratios of training, validation, and testing sets. The rationale for choosing these ratios and ensuring these sets were unbiased regarding data characteristics must also be described.

  • comparisons with other methods. The same dataset with the same ratios of validation and testing sets must be used to ensure the comparison with other methods is valid.

  • the description of the dataset when it is uploaded to one of the public repositories.

Trustworthy training datasets

It is critical to ensure that the data used to train a DL model is free from bias, accurate, high-quality, and privacy-preserving is essential for building trust in the model. Poor-quality data can lead to biased or unreliable models [624, 625].

There are several requirements that a dataset should meet in order to be trustworthy for DL:

  • Quality of data: the data in the dataset should be accurate and relevant to the problem at hand.

  • Annotation quality: the annotations should be accurate and consistent if the annotation is needed.

  • Diversity: the dataset should be diverse and include a wide range of samples to ensure that the model learns to generalize to new scenarios.

  • Size: the size of the dataset can be a factor in its trustworthiness. A larger dataset can assist the model in learning more robust and generalizable features, but it is critical to make sure the data is high quality and diverse.

  • Source: the source of the data is important, as it should be from a trustworthy organization or individual.

  • Preprocessing: the data should be cleaned and preprocessed appropriately in order to be usable for training a DL model.

  • Balance: if the dataset is used for classification tasks, it should be balanced, meaning that it should include a roughly equal number of examples for each class. Imbalanced datasets can lead to DL models that are biased toward the more common classes.

  • Bias-free: bias in the data can lead to DL models that make biased decisions and do not generalize well to new situations. It is important to ensure that the data used to train a DL model is diverse and representative of the population the model will be used on, in order to avoid bias and improve model performance.

Discussion

This section is dedicated to offering a succinct and subjective reflection on the research process carried out in this broad overview, as well as introducing possible improvements to the limitations analyzed in previous sections.

In our humble opinion, the results of this study have provided relevant insights into those State-of-the-art techniques dealing with DL model training aimed to overcome three major challenges: small and imbalanced data sets and lack of generalization. Specifically, our study demonstrates its originality and novelty due, as far as we know, to its uniqueness in dealing with definitions, challenges, solutions, tips, and applications that addressed the problem of DL model training scarcity.

In the previous sections, the benefits and limitations of each of the recent strategies proposed in the revised methods of the State-of-the-art have already been addressed in sufficient detail. However, despite the proven benefits, the reported results of this research must be interpreted with caution due to their inherent limitations, demonstrating that there is still room for improvement. Thus, we propose the following set of 13 alternatives as future works in order to improve these shortcomings:

  • Numerous TL approaches should be considered to train the DL model using unclassified image datasets, followed by knowledge transfer for training the DL model by using a reduced set of classified images for the same task.

  • Powerful and effective models can be generated to improve NN performance more comprehensively once RL and other models are combined with TL.

  • The increasing interest in using GAN stems from its ability to learn highly non-linear and deep mappings from latent space to data space and vice versa, as well as its ability to apply unclassified image data close to deep representation learning. Many algorithms and theories can be formulated by adopting the GAN framework, which is suitable for new applications with deep networks.

  • As indicated in previous sections, different loss functions have been introduced to help in training small data sets. We are convinced that it is worth investigating the loss functions to overcome the weakness of the previous approaches.

  • It is important to carefully curate and build a high-quality training dataset when developing DL models. A reliable and trustworthy training data set can greatly improve the performance of a model and help prevent overfitting.

  • As DL models become more complex in structure, it becomes more difficult for people to understand how they arrive at their decisions. Improving explainability is essential to build trust in these models and ensure that they make fair and unbiased decisions [625].

  • It is critical to ensure that DL models are robust/reliable and able to perform well with new data. It will require improving the quality and diversity of the data utilized to train them, as well as developing techniques to identify and address potential issues with the models.

  • Fairness in DL remains an open challenge and requires careful consideration of the data used to train the models, as well as both the potential biases present in that data and the development of techniques to overcome biases in the models [626].

  • Meta-learning and customized RL can be optimized for multiple applications [627]. meta-learning has the potential to significantly enhance the capabilities of DL models, particularly in scenarios where training data is scarce, making it a promising area of research in DL.

  • Knowledge distillation is another technique to address the issue of data scarcity which is worth more investigation. It involves training a smaller model to mimic the behavior of a larger model [628].

  • Information fusion involves combining information from multiple sources or modalities to make more accurate predictions or decisions in the context of DL. It can help overcome the limitations of individual data sources and improve model performance when training data is limited [629].

  • Federated learning is a DL technique that allows groups or organizations to collectively train and improve a shared global DL model [138]. However, the introduction of data fusion technology has brought new challenges for federated learning, such as the fusion of heterogeneous and multi-source data. As the variety and volume of data increase, it is essential to improve the use of data and models in federated learning. By eliminating redundant data and merging multiple data sources, it is possible to gain new and valuable information. In the future, issues such as maintaining user privacy, creating universal models, and ensuring the stability of data fusion results need to be addressed to facilitate the effective use of data in federated learning across multiple domains.

  • Finally, it is expected to see more pre-trained models in different areas similar to the ImageNet model, such as medical imaging [630]. That would be a great opportunity in terms of the generalization of DL models.

Conclusion

Data scarcity is a significant challenge for deep learning (DL) models due to it requires a substantial amount of labeled data to achieve a successful performance. However, manual labeling is a costly, time-consuming, and error-prone process that may not be feasible for many applications. Furthermore, the corresponding lack of data is the primary barrier for many applications that prevent the use of DL. This work has carried out a holistic survey of the State-of-the-art of those techniques aimed to overcome the challenges from small and imbalanced datasets and the lack of generalization in DL. Specifically, our contribution highlights the pros and cons of multiple approaches recently proposed in the field, e.g. Transfer Learning, Self-Supervised Learning, Generative Adversarial Networks, Model Architecture, Physics-Informed Neural Networks, and Deep Synthetic Minority Oversampling, among many others. Moreover, in this work many applications have been reviewed that suffer from data scarcity and introduced their specific alternatives to generate more data. Additionally, trustworthiness in DL has been analyzed. Finally, this comprehensive overview of strategies tackling data scarcity will become an essential resource for researchers and practitioners really interested in improving the performance of their DL models.

Availability of data and materials

Not applicable.

References

  1. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):1–74.

    Article  Google Scholar 

  2. Bhattacharya S, Somayaji SRK, Gadekallu TR, Alazab M, Maddikunta PKR. A review on deep learning for future smart cities. Internet Technol Lett. 2022;5(1):187.

    Article  Google Scholar 

  3. Wang N, Wang Y, Er MJ. Review on deep learning techniques for marine object recognition: architectures and algorithms. Control Eng Pract. 2022;118: 104458.

    Article  Google Scholar 

  4. Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

    Article  Google Scholar 

  5. Torres JF, Hadjout D, Sebaa A, Martínez-Álvarez F, Troncoso A. Deep learning for time series forecasting: a survey. Big Data. 2021;9(1):3–21.

    Article  Google Scholar 

  6. Abidi MH, Mohammed MK, Alkhalefah H. Predictive maintenance planning for industry 4.0 using machine learning for sustainable manufacturing. Sustainability. 2022;14(6):3387.

    Article  Google Scholar 

  7. Amanullah MA, Habeeb RAA, Nasaruddin FH, Gani A, Ahmed E, Nainar ASM, Akim NM, Imran M. Deep learning and big data technologies for IoT security. Comput Commun. 2020;151:495–517.

    Article  Google Scholar 

  8. Wang YE, Wei G-Y, Brooks D. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint. 2019. arXiv:1907.10701.

  9. Kim J-H, Kim N, Park YW, Won CS. Object detection and classification based on YOLO-V5 with improved maritime dataset. J Mar Sci Eng. 2022;10(3):377.

    Article  Google Scholar 

  10. Wang K, Wei Z. YOLO V4 with hybrid dilated convolution attention module for object detection in the aerial dataset. Int J Remote Sens. 2022;43(4):1323–44.

    Article  Google Scholar 

  11. Rajaraman S, Ganesan P, Antani S. Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PLoS ONE. 2022;17(1):0262838.

    Article  Google Scholar 

  12. Fernandes J, Simsek M, Kantarci B, Khan S. Tabledet: an end-to-end deep learning approach for table detection and table image classification in data sheet images. Neurocomputing. 2022;468:317–34.

    Article  Google Scholar 

  13. Li W, Kazemifar S, Bai T, Nguyen D, Weng Y, Li Y, Xia J, Xiong J, Xie Y, Owrangi A, et al. Synthesizing CT images from MR images with deep learning: model generalization for different datasets through transfer learning. Biomed Phys Eng Express. 2021;7(2): 025020.

    Article  Google Scholar 

  14. Ye JC. Generalization capability of deep learning. In: Geom Deep Learn. Cham: Springer; 2022. p. 243–66.

    Chapter  Google Scholar 

  15. Chen RJ, Lu MY, Chen TY, Williamson DF, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng. 2021;5(6):493–7.

    Article  Google Scholar 

  16. Tulbure A-A, Tulbure A-A, Dulf E-H. A review on modern defect detection models using DCNNs-deep convolutional neural networks. J Adv Res. 2022;35:33–48.

    Article  Google Scholar 

  17. Tang S, Zhu Y, Yuan S. A novel adaptive convolutional neural network for fault diagnosis of hydraulic piston pump with acoustic images. Adv Eng Inform. 2022;52: 101554.

    Article  Google Scholar 

  18. Lai C-J, Pai P-F, Marvin M, Hung H-H, Wang S-H, Chen D-N. The use of convolutional neural networks and digital camera images in cataract detection. Electronics. 2022;11(6):887.

    Article  Google Scholar 

  19. Berghout T, Mouss L-H, Bentrcia T, Elbouchikhi E, Benbouzid M. A deep supervised learning approach for condition-based maintenance of naval propulsion systems. Ocean Eng. 2021;221: 108525.

    Article  Google Scholar 

  20. Dai Y, Gao Y, Liu F. Transmed: transformers advance multi-modal medical image classification. Diagnostics. 2021;11(8):1384.

    Article  Google Scholar 

  21. Miorelli R, Kulakovskyi A, Chapuis B, D’almeida O, Mesnil O. Supervised learning strategy for classification and regression tasks applied to aeronautical structural health monitoring problems. Ultrasonics. 2021;113: 106372.

    Article  Google Scholar 

  22. Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y. Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics. 2020;9(3):445.

    Article  Google Scholar 

  23. Alzubaidi L, Al-Amidie M, Al-Asadi A, Humaidi AJ, Al-Shamma O, Fadhel MA, Zhang J, Santamaría J, Duan Y. Novel transfer learning approach for medical imaging with limited labeled data. Cancers. 2021;13(7):1590.

    Article  Google Scholar 

  24. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning; 2006. p. 161–8.

  25. Deng L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag. 2012;29(6):141–2.

    Article  Google Scholar 

  26. Chandra MA, Bedi S. Survey on SVM and their application in image classification. Int J Inf Technol. 2021;13(5):1–11.

    Google Scholar 

  27. Rivera-Lopez R, Canul-Reich J, Mezura-Montes E, Cruz-Chávez MA. Induction of decision trees as classification models through metaheuristics. Swarm Evol Comput. 2022;69: 101006.

    Article  Google Scholar 

  28. Tsiknakis N, Theodoropoulos D, Manikis G, Ktistakis E, Boutsora O, Berto A, Scarpa F, Scarpa A, Fotiadis DI, Marias K. Deep learning for diabetic retinopathy detection and classification based on fundus images: a review. Comput Biol Med. 2021;135: 104599.

    Article  Google Scholar 

  29. Manna A, Kundu R, Kaplun D, Sinitca A, Sarkar R. A fuzzy rank-based ensemble of CNN models for classification of cervical cytology. Sci Rep. 2021;11(1):1–18.

    Article  Google Scholar 

  30. Korot E, Guan Z, Ferraz D, Wagner SK, Zhang G, Liu X, Faes L, Pontikos N, Finlayson SG, Khalid H, et al. Code-free deep learning for multi-modality medical image classification. Nat Mach Intell. 2021;3(4):288–98.

    Article  Google Scholar 

  31. Jena B, Saxena S, Nayak GK, Saba L, Sharma N, Suri JS. Artificial intelligence-based hybrid deep learning models for image classification: the first narrative review. Comput Biol Med. 2021;137: 104803.

    Article  Google Scholar 

  32. Zia T, Bashir N, Ullah MA, Murtaza S. SoFTNet: a concept-controlled deep learning architecture for interpretable image classification. Knowl-Based Syst. 2022;240: 108066.

    Article  Google Scholar 

  33. Lu Z, Liang S, Yang Q, Du B. Evolving block-based convolutional neural network for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2022;60:1–21.

    Google Scholar 

  34. Liu T, Yu H, Blair RH. Stability estimation for unsupervised clustering: a review. Wiley Interdiscip Rev Comput Stat. 2022;14:1575.

    Article  MathSciNet  Google Scholar 

  35. Ali NUA, Iqbal W, Afzal H. Carving of the OOXML document from volatile memory using unsupervised learning techniques. J Inf Secur Appl. 2022;65: 103096.

    Google Scholar 

  36. Tavallali P, Tavallali P, Singhal M. K-means tree: an optimal clustering tree for unsupervised learning. J Supercomput. 2021;77(5):5239–66.

    Article  Google Scholar 

  37. Sindagi VA, Patel VM. A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recogn Lett. 2018;107:3–16.

    Article  Google Scholar 

  38. Madec S, Jin X, Lu H, De Solan B, Liu S, Duyme F, Heritier E, Baret F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric For Meteorol. 2019;264:225–34.

    Article  Google Scholar 

  39. Awad FH, Hamad MM. Improved k-means clustering algorithm for big data based on distributed smartphoneneural engine processor. Electronics. 2022;11(6):883.

    Article  Google Scholar 

  40. Courtier AF, McDonnell M, Praeger M, Grant-Jacob JA, Codemard C, Harrison P, Mills B, Zervas M. Predictive visualisation of fibre laser machining via deep learning. In: 2021 conference on lasers and electro-optics Europe & European quantum electronics conference (CLEO/Europe-EQEC). IEEE; 2021. p. 1–1.

  41. Gende M, De Moura J, Novo J, Charlón P, Ortega M. Automatic segmentation and intuitive visualisation of the epiretinal membrane in 3D OCT images using deep convolutional approaches. IEEE Access. 2021;9:75993–6004.

    Article  Google Scholar 

  42. Qiu C, Wu B, Liu N, Zhu X, Ren H. Deep learning prior model for unsupervised seismic data random noise attenuation. IEEE Geosci Remote Sens Lett. 2021;19:1–5.

    Google Scholar 

  43. Gunduz H. An efficient dimensionality reduction method using filter-based feature selection and variational autoencoders on Parkinson’s disease classification. Biomed Signal Process Control. 2021;66: 102452.

    Article  Google Scholar 

  44. Prezelj J, Murovec J, Huemer-Kals S, Häsler K, Fischer P. Identification of different manifestations of nonlinear stick-slip phenomena during creep groan braking noise by using the unsupervised learning algorithms k-means and self-organizing map. Mech Syst Signal Process. 2022;166: 108349.

    Article  Google Scholar 

  45. Tatoli R, Lampignano L, Bortone I, Donghia R, Castellana F, Zupo R, Tirelli S, De Nucci S, Sila A, Natuzzi A, et al. Dietary patterns associated with diabetes in an older population from southern Italy using an unsupervised learning approach. Sensors. 2022;22(6):2193.

    Article  Google Scholar 

  46. Khushaba RN, Al-Ani A, Al-Jumaily A. Orthogonal fuzzy neighborhood discriminant analysis for multifunction myoelectric hand control. IEEE Trans Biomed Eng. 2010;57(6):1410–9.

    Article  Google Scholar 

  47. Du W, Ding S. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artif Intell Rev. 2021;54(5):3215–38.

    Article  Google Scholar 

  48. Gronauer S, Diepold K. Multi-agent deep reinforcement learning: a survey. Artif Intell Rev. 2022;55(2):895–943.

    Article  Google Scholar 

  49. Waubert de Puiseau C, Meyes R, Meisen T. On reliability of reinforcement learning based production scheduling systems: a comparative survey. J Intell Manuf. 2022;33:1–17.

    Article  Google Scholar 

  50. Ramot M, Martin A. Closed-loop neuromodulation for studying spontaneous activity and causality. Trends Cogn Sci. 2022;26:290–9.

    Article  Google Scholar 

  51. Shi C, Wang X, Luo S, Zhu H, Ye J, Song R. Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. J Am Stat Assoc. 2022;1–29 (just-accepted).

  52. Zamfirache IA, Precup R-E, Roman R-C, Petriu EM. Reinforcement learning-based control using Q-learning and gravitational search algorithm with experimental validation on a nonlinear servo system. Inf Sci. 2022;583:99–120.

    Article  Google Scholar 

  53. Ganesh AH, Xu B. A review of reinforcement learning based energy management systems for electrified powertrains: progress, challenge, and potential solution. Renew Sustain Energy Rev. 2022;154: 111833.

    Article  Google Scholar 

  54. Alavizadeh H, Alavizadeh H, Jang-Jaccard J. Deep Q-learning based reinforcement learning approach for network intrusion detection. Computers. 2022;11(3):41.

    Article  Google Scholar 

  55. Song Z, Yang X, Xu Z, King I. Graph-based semi-supervised learning: a comprehensive review. IEEE Trans Neural Netw Learn Syst. 2022. https://doi.org/10.1109/TNNLS.2022.3155478.

    Article  Google Scholar 

  56. Kostopoulos G, Kotsiantis S. Exploiting semi-supervised learning in the education field: a critical survey. Adv Mach Learn Deep Learn Based Technol. 2022;2:79–94.

    Google Scholar 

  57. Huynh T, Nibali A, He Z. Semi-supervised learning for medical image classification using imbalanced training data. Comput Methods Programs Biomed. 2022;216: 106628.

    Article  Google Scholar 

  58. Li Y-F, Liang D-M. Safe semi-supervised learning: a brief introduction. Front Comp Sci. 2019;13(4):669–76.

    Article  Google Scholar 

  59. Khan AH, Siddqui J, Sohail SS. A survey of recommender systems based on semi-supervised learning. In: International conference on innovative computing and communications. Springer; 2022. p. 319–27.

  60. Chong Y, Ding Y, Yan Q, Pan S. Graph-based semi-supervised learning: a review. Neurocomputing. 2020;408:216–30.

    Article  Google Scholar 

  61. Inés A, Domínguez C, Heras J, Mata E, Pascual V. Biomedical image classification made easier thanks to transfer and semi-supervised learning. Comput Methods Programs Biomed. 2021;198: 105782.

    Article  Google Scholar 

  62. Shi S, Nie F, Wang R, Li X. Semi-supervised learning based on intra-view heterogeneity and inter-view compatibility for image classification. Neurocomputing. 2022;488:248–60.

    Article  Google Scholar 

  63. Su L, Liu Y, Wang M, Li A. Semi-HIC: a novel semi-supervised deep learning method for histopathological image classification. Comput Biol Med. 2021;137: 104788.

    Article  Google Scholar 

  64. Moritz N, Hori T, Le Roux J. Semi-supervised speech recognition via graph-based temporal classification. In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2021. p. 6548–52.

  65. Torre IG, Romero M, Álvarez A. Improving aphasic speech recognition by using novel semi-supervised learning methods on aphasiabank for English and Spanish. Appl Sci. 2021;11(19):8872.

    Article  Google Scholar 

  66. Spangher A, May J, Shiang S-R, Deng L. Multitask semi-supervised learning for class-imbalanced discourse classification. In: Proceedings of the 2021 conference on empirical methods in natural language processing. 2021. p. 498–517.

  67. Diaz-Pinto A, Colomer A, Naranjo V, Morales S, Xu Y, Frangi AF. Retinal image synthesis and semi-supervised learning for glaucoma assessment. IEEE Trans Med Imaging. 2019;38(9):2211–8.

    Article  Google Scholar 

  68. Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F. A survey on contrastive self-supervised learning. Technologies. 2020;9(1):2.

    Article  Google Scholar 

  69. Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J. Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng. 2021;35(1):857–76.

    Google Scholar 

  70. Azizi S, Mustafa B, Ryan F, Beaver Z, Freyberg J, Deaton J, Loh A, Karthikesalingam A, Kornblith S, Chen T, et al. Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 3478–88.

  71. Huang H, Luo L, Pu C. Self-supervised convolutional neural network via spectral attention module for hyperspectral image classification. IEEE Geosci Remote Sens Lett. 2022;19:1–5.

    Google Scholar 

  72. Ohri K, Kumar M. Review on self-supervised image recognition using deep neural networks. Knowl-Based Syst. 2021;224: 107090.

    Article  Google Scholar 

  73. Luo D, Zhou Y, Fang B, Zhou Y, Wu D, Wang W. Exploring relations in untrimmed videos for self-supervised learning. ACM Trans Multimed Comput Commun Appl. 2022;18(1s):1–21.

    Article  Google Scholar 

  74. Song J, Zhang H, Li X, Gao L, Wang M, Hong R. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process. 2018;27(7):3210–21.

    Article  MathSciNet  MATH  Google Scholar 

  75. Li C-L, Sohn K, Yoon J, Pfister T. Cutpaste: self-supervised learning for anomaly detection and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 9664–74.

  76. Farr AJ, Petrunin I, Kakareko G, Cappaert J. Self-supervised vessel detection from low resolution satellite imagery. In: AIAA SCITECH 2022 forum; 2022. p. 2110.

  77. Baevski A, Hsu W-N, Xu Q, Babu A, Gu J, Auli M. Data2vec: a general framework for self-supervised learning in speech, vision and language. arXiv preprint. 2022. arXiv:2202.03555.

  78. Lin L, Luo W, Yan Z, Zhou W. Rigid-aware self-supervised GAN for camera ego-motion estimation. Digit Signal Process. 2022;126: 103471.

    Article  Google Scholar 

  79. Zhang X, Mu J, Zhang X, Liu H, Zong L, Li Y. Deep anomaly detection with self-supervised learning and adversarial training. Pattern Recogn. 2022;121: 108234.

    Article  Google Scholar 

  80. Baykal G, Ozcelik F, Unal G. Exploring deshufflegans in self-supervised generative adversarial networks. Pattern Recogn. 2022;122: 108244.

    Article  Google Scholar 

  81. He K, Zhao W, Xie X, Ji W, Liu M, Tang Z, Shi Y, Shi F, Gao Y, Liu J, et al. Synergistic learning of lung lobe segmentation and hierarchical multi-instance classification for automated severity assessment of COVID-19 in CT images. Pattern Recogn. 2021;113: 107828.

    Article  Google Scholar 

  82. Li J, Li W, Sisk A, Ye H, Wallace WD, Speier W, Arnold CW. A multi-resolution model for histopathology image classification and localization with multiple instance learning. Comput Biol Med. 2021;131: 104253.

    Article  Google Scholar 

  83. Li X, Wu H, Li M, Liu H. Multi-label video classification via coupling attentional multiple instance learning with label relation graph. Pattern Recognit Lett. 2022;156:53–9.

    Article  Google Scholar 

  84. Korkmaz Y, Boyacı A. milVAD: a bag-level MNIST modelling of voice activity detection using deep multiple instance learning. Biomed Signal Process Control. 2022;74: 103520.

    Article  Google Scholar 

  85. Sellami A, Tabbone S. Deep neural networks-based relevant latent representation learning for hyperspectral image classification. Pattern Recogn. 2022;121: 108224.

    Article  Google Scholar 

  86. Huang H. Statistical mechanics of neural networks. Singapore: Springer; 2022.

    Google Scholar 

  87. Wunsch S, Jörger S, Wolf R, Quast G. Optimal statistical inference in the presence of systematic uncertainties using neural network optimization based on binned Poisson likelihoods with nuisance parameters. Comput Softw Big Sci. 2021;5(1):1–11.

    Article  Google Scholar 

  88. Elhassan A, Abu-Soud SM, Alghanim F, Salameh W. ILA4: overcoming missing values in machine learning datasets-an inductive learning approach. J King Saud Univ Comput Inf Sci. 2021;34(7):4284–95.