Low‑shot learning and class imbalance: a survey

they

case of zero-shot learning, no examples [5]).LSL is a newer, more challenging field of machine learning, and LSL techniques have garnered much attention in recent years, as they excel in tasks such as image recognition where there are many common classes with thousands or millions of instances (e.g., "dog" or "car"), but also many rare classes with tens or less (e.g., "abstract portrait" or "pallet").LSL techniques are also well-suited for applications where all potential rare classes cannot be accounted for in the model's training dataset, such as in the object detection systems of autonomous driving software [6].
Low-shot learning can be framed as a difficult variant of class-imbalanced learning, as both fields aim to learn the rare classes in a dataset, often with plentiful data provided for other classes.However, there are some notable differences; for example, traditional class-imbalanced settings (such as fraud detection) normally have hundreds of rare-class instances even in severe cases, while many LSL scenarios allow only one to five rareclass instances.This means that common CI countermeasures, such as random under/ oversampling, prove insufficient if naïvely applied in a LSL setting.Additionally, while class-imbalanced settings often allow the model to learn all classes simultaneously, LSL settings only provide rare-class examples to the model at evaluation time, necessitating a model that can rapidly learn new classes after it has been trained.Despite these differences, we believe there is untapped potential in exploring the intersections of these two fields, either in combatting class imbalance within LSL settings, or in harnessing LSL techniques against class imbalance-we believe this latter topic to be especially promising, given the success of LSL models in spite of the setting's difficulty.Finding no existing surveys on either of these intersections, this paper reviews the recent literature for both, across a wide range of applications.We find that these topics have not been fully explored, with potential for future research to perform more conclusive comparisons of existing methods, or to propose entirely new methods by leveraging existing techniques.
The remainder of this paper is organized as follows: "Background" section provides background information on both class imbalance and few-shot learning individually, with explanations, definitions of terms, and overviews of approaches to each problem.We define what is meant by "low-shot learning" in this paper (as definitions of few-shot learning are often inconsistent or vague), and we detail the inclusion criteria for our literature review.The remainder of the paper discusses the findings of the review, with indepth coverage of each work's methods and experimental findings: "Solving imbalance within LSL" section covers works which have addressed class imbalance as an obstacle within low-shot learning, and "Using LSL to solve existing imbalance" section covers works which solve pre-existing class imbalance by borrowing models and techniques from the field of LSL."Shortcomings and future research" Section highlights strengths and weaknesses of the current literature and identifies areas with potential for future research, and finally, "Conclusion" section concludes and summarizes the paper.

Definition
Broadly, few-shot learning (FSL) refers to a task where a model must learn classes or categories with very few examples; similarly, one-shot learning (OSL) refers to a task where only one example per rare class is available, and zero-shot learning (ZSL) refers to a task where no examples are available for these classes.In this paper, we collectively refer to FSL, OSL, and ZSL as low-shot learning (LSL).While FSL and its variants are not new fields-the term "one-shot learning" has been used in a machine learning context as early as 1997 [7]-it did not receive much research until the mid-2010's, when it was re-examined with a different approach and a more modern terminology [8].Low-shot learning tasks have since received greater attention, though specific definitions often vary between works in their scope and terminology.
In our survey, we elect to use more restrictive definitions, to prevent confusion and to narrow our search: LSL refers to a task where a model, after training on some "base" classes (usually with many examples), must identify instances of "unseen" or "novel" classes which were not present at all during the model's training phase; however, a small selection of labeled novel-class examples may be given at test time.This selection, if it exists, is commonly referred to as the support set within LSL, while the unlabeled instance(s) which the model must classify make up the query set.This definition of LSL is primarily meant to exclude works which use the term "few-shot learning" to refer to ordinary supervised-learning settings with rare classes, which we believe are better described as class-imbalanced learning.
Also relevant to our survey are two notable variants of LSL present within the literature.Generalized low-shot learning (GLSL) refers to the scenario where the query set may include instances from the base classes, rather than exclusively containing unseen classes.This is a more challenging setting, as the model must not forget base-class information learned during training, and cannot assume that the query instances belong to a class within the support set.Transductive low-shot learning (TLSL) refers to the scenario where the query set contains multiple instances-usually from different classesand these instances are classified simultaneously by the model, rather than one at a time.This is a less challenging setting, as the model can use information on the so-far labeled query instances in order to classify the others, in a manner similar to semi-supervised learning.These two variants are not exclusive, and could be applied simultaneously; however, little literature currently exists on this overlap.

Common approaches to FSL and OSL
In this section, we highlight the most common approaches to FSL and OSL as found in the results of our literature search, borrowing terminology from a 2022 preprint survey paper by Parnami and Lee [9] which separates FSL methods into metric-based, optimization-based, and model-based techniques.In our review, we found no model-based techniques which qualify under our definition of LSL.
The most common method found in our review was prototypical learning (also known as "contrastive learning, " "pairwise learning, " or "similarity-based learning") [3], a metric-based framework which changes the model's objective: instead of learning to classify one input instance into one of many classes, the model takes as input a pair of instances and learns to determine whether they belong to the same class.Test instances are then classified by comparing them to "prototypes" of each class and selecting the class with the most similar prototype (Fig. 1).This approach allows the model to incorporate new classes without requiring significant training data, assuming that the input embedding space can effectively represent a wide range of input data; these embedding spaces are most often found using Siamese neural networks [10] (SNNs) or similar.
The other approach often found is optimization-based meta-learning, where the model is trained in such a way that it can easily adapt to new classes given only a few training examples.Model-agnostic meta-learning (MAML) [4] is a basic implementation of this approach; in this scheme, a "meta-learner" is given many individual "tasks" with their own support and query sets, and it is "meta-trained" to initialize a learner for each task such that it can quickly converge to a new point in the parameter space when given a new task (Fig. 2).
Both prototypical and optimization-based approaches are often trained and validated "episodically, " as in MAML, where instances are given to the model in the form of sets of support and query data; this approach is sometimes used for training even when the model is not evaluated in this manner.Further, episodic training and evaluation is usually framed as an "N-way-K-shot" setting, with N referring to the number of unseen classes presented in an episode's support set and K referring to the number of samples/ shots provided per class, with K = 1 defining OSL.Episodes are generally most chal- lenging with low values for K and high values for N = 2.
Finally, a significantly less common approach to LSL is to use large language models (LLMs) such as GPT-3 [11] as an interface for language-based tasks, by simply asking the models to perform tasks while providing examples in full-sentence prompts; the Fig. 1 Depiction of prototypical evaluation in the FSL setting [3].c i represents the prototype of the ith support-set class, and x represents the query instance Fig. 2 Depiction of MAML's learning process [4].MAML "optimizes for a representation θ that can quickly adapt to new tasks" ( θ * i ) reasoning behind this is that LLMs often excel at pattern detection and extension, and so are naturally suited to learning new tasks with few examples.Additionally, this method has the potential advantage of not requiring abundant baseclass data, as the LLMs used are most often already trained on vast language data.We do include works in this survey which take this "prompt-based" approach to LSL, though we find that they are comparatively rare in the literature.

Common approaches to ZSL
For ZSL, most FSL and OSL techniques cannot be applied due to the lack of a support set; thus, one of two strategies are usually employed.For binary classification problems, one-class classification (OCC) [12] is common, where a classifier is trained on data from only one class, and learns to detect outliers during testing; while this concept predates most LSL work, it qualifies as a ZSL technique, as it can effectively classify classes unseen during training.Meanwhile, for multi-class classification problems, most works utilize the semantic information present within the labels themselves, using this information to bootstrap classification.For example, a model may infer that an instance of a "truck" (novel class) will resemble that of a "car" (base class) due to the semantic similarity between the class names; this can be inferred without any examples of the novel class.
There also exists the concept of "zero-shot transfer, " using transfer learning techniques without any supervised fine-tuning on the target domain; however, this setting is distinct from ZSL in that "relevant supervised information" is available during the pre-training process (but not during fine-tuning) [13].Thus, despite the nominal similarity, we do not include works which work in zero-shot transfer unless they are sufficiently relevant for some other reason (e.g., employing techniques from proper ZSL).

Definition
Class Imbalance (CI) is a long-standing obstacle in machine learning [14,15], referring to when a dataset has many instances of some class or classes (the majority classes) and very few instances of another class or classes (the minority classes).CI is commonly the result of a natural lack of one type of data in the real world, and can appear in binary tasks such as insurance fraud detection, where fraudulent claims are rare; or in multiclass tasks such as image classification, where photographs of objects such as "accordion" are rare.CI is most often quantified as the ratio ρ between the size of the largest class and the size of the smallest class; in this survey, we will use this metric to specify the severity of imbalance in each work covered, if this information is made available.
A closely related concept to CI is that of foreground-background imbalance in pixellevel image-data tasks (such as object detection or segmentation), where most pixels in each target image are part of the background, rather than any object of interest.Another is that of contrastive/pairwise imbalance which appears when using similaritybased techniques common in LSL (see "Common approaches to FSL and OSL" sect); when pairing samples from different classes, matching pairs-which are often the pairs of interest-will be much less common than mismatching pairs, especially as the number of classes increases.These two imbalances pose similar problems as traditional CI despite the differences in their sources, and so we include works in this review which deal with foreground-background imbalance and contrastive imbalance.
Finally, long-tailed learning is a term used in many-class settings where the number of instances per class slowly tapers off, such as in image recognition; the "long tail" refers to the shape of the distribution of class sizes within the dataset.This term is synonymous with CI, and so we include works which address it.Notably, the ρ metric is often less informative here, due to the much wider range of class sizes common in these tasks.

Intersection with LSL
Various imbalances can appear within LSL frameworks; the most commonly addressed of these is the natural "data imbalance" between the seen classes with plentiful data and the unseen classes with scarce data.However, in this survey we elect not to include works which focus on this data imbalance, as this is a challenge inherent to all LSL works, rather than an overlap between LSL and CI.Other imbalances are more controllable and therefore more relevant to our review; for instance, a 2023 survey by Ochal et al. [16] distinguishes between "dataset imbalance, " where the data from the entire training set is imbalanced (as in traditional CI) and "task imbalance, " where the data in each episode is imbalanced, either within the support set or query set-though the latter case is only relevant in the context of TLSL.

Common non-LSL approaches to CI
Extensive research has been conducted on CI and its countermeasures within more traditional machine learning tasks [1,2]; these measures include but are not limited to data sampling methods such as random under/oversampling, weighted loss functions such as focal loss, ensemble methods such as bagging and boosting, data augmentation methods such as generative adversarial networks (GANs), and hybrids of these techniques.However, we elect not to go into further detail on these approaches, as they are rarely utilized in the works we cover.

Survey methodology
We restricted our literature search to articles published between January 2020 and July 2023, with exceptions made for sufficiently relevant works we found predating this period (through the references of other papers or through other means).
All collected articles were screened for relevancy, ensuring that they met a minimum quality standard, and that they either addressed CI present within the LSL setting, or used LSL methods or techniques to solve existing CI ("Solving imbalance within LSL" and "Using LSL to solve existing imbalance" sections, respectively).Works which use traditional LSL frameworks described in "Common approaches to FSL and OSL" section (such as episodic training) were more likely to be included than works which do not, though works which deal with tasks similar to LSL (such as incremental learning and one-class classification) were included if sufficiently relevant.As discussed in "Intersection with LSL" sect, we do not include works which only address the data imbalance between the seen and unseen classes in the LSL setting.Finally, preference was given to works which offer novel solutions or analysis on the overlap between LSL and CI; thus, we often excluded works that simply apply trivial techniques (e.g., a modified loss function) without deeper exploration, or those that use CI or LSL settings only as ablation studies or footnotes.
To the best of our knowledge, there is only one other survey/review paper which focuses on both LSL and CI, which is a 2023 survey by Ochal et al. [16].This survey has a much more limited scope than our own, but provides a novel, detailed performance comparison between the works which it does cover.We examine Ochal et al. 's work in more detail in "Image classification" section.

Solving imbalance within LSL
We first discuss works which address class imbalance as an obstacle present in LSL settings, categorized by the specific task for which LSL is applied.Most works aim to strengthen LSL methods against CI that occurs or could occur in their dataset; however, we also cover works which deal with imbalance which is inherent to the task or method used (such as foreground-background imbalance for object detection).For works which propose models with multiple components, we focus only on those components explicitly stated to be related to or designed for imbalance.
We categorize the surveyed works by the application in which LSL is applied.Additionally, an overview of the surveyed works in this section is shown in Table 1 below, with the name of the paper's proposed model (with "N/A" used when the authors provide no name), the application in which the paper works, the type of imbalance present, its severity ρ (with "UNSP" used when this is unspecified and "N/A" when not applica- ble), the amount and type of baseline methods compared to the proposed model, and the proposed model's notable improvements over these baselines.For papers which only compare to variations of their proposed model, we attempt to highlight the performance effects of any CI-related components.For papers which experiment on multiple datasets with varying levels of imbalance, we provide ρ for the most imbalanced dataset, if this information is available.Double lines between table rows indicate the subsection divisions between the surveyed papers.
We note that there are many works which apply "trivial" CI methods (see "Common non-LSL approaches to CI" section) independent of the rest of the model, and thus are not included in the body of this survey nor in Table 1.These measures include loss functions such as focal loss [17][18][19][20][21], class-weighted loss [22][23][24][25], difficulty-weighted loss [19,26], and others [27][28][29][30][31][32]; as well as resampling methods such as random undersampling [33], difficulty-based sampling [34], and other forms of balancing sampling [22,35,36] (citations above are not exhaustive).In addition to the works above, we note but do not include a paper by Li et al. which proposes a novel few-shot intent-detection benchmark which contains various forms of class imbalance, but does not propose measures to explicitly counter this imbalance [37].

Image classification
A 2023 paper by Ochal et al. [16] carefully examines the effect of different distributions of minor class imbalance ( ρ ≤ 20)-as both task imbalance and dataset imbalance, defined in "Low-shot learning" section-on ten SOTA few-shot image classification (FSIC) techniques, while also evaluating the impact of simple CI countermeasures such as random sampling.They first find that metric-based models are much more robust However, they found that random-shot training actually negatively impacted the performance of almost all models, despite the fact that it more accurately represented the imbalanced evaluation setting; various combinations of ROS and ROS+ during both training and testing proved to be more effective in most cases.Balanced loss functions such as focal loss were also tested and shown to be less effective than ROS.Finally, the authors showed, over many different dataset sizes and distributions, that the models were generally more robust to dataset imbalance than to task imbalance; however, when together, the two imbalances compound to yield a larger drop in accuracy than the simple sum of the two effects.While the authors do not make solid conclusions regarding effective techniques against either type of imbalance, we find this survey to be exceptionally comprehensive in analyzing the effects of these imbalances throughout different scenarios and severities.
A 2021 paper by Veilleux et al. [38] evaluates high-performing TFSL techniques, such as PT-MAP [67] and TIM [68], in a "realistic" setting by adding query-set imbalance.It shows both theoretically and experimentally that many transductive methods rely on the unrealistic assumption of perfect class balance in the query data; in experiments on miniImageNet [64], CUB [65], and tieredImageNet [69], this caused some methods to actually perform worse than inductive baselines (which have less information) when testing on randomly unbalanced query sets ( ρ not specified).A notable exception was the LaplacianShot model [70], which responded well to the query imbalance, presumably because it does not implicitly assume a balanced query set.In addition, Veilleux et al. propose "a generalization of the mutual-information loss based on α-divergences, " meant to improve robustness to query set distributions.The proposed model using this loss function, named α-TIM, outperformed almost all tested models in query-imbal- anced TFSL, across multiple datasets and few-shot settings.However, it is not clear how exactly the model is constructed, besides this loss function.
A later paper by Hess and Ditzler [39] builds on [38] by proposing a "maximum loglikelihood" (MLL) method to deal with query imbalance within TLSL; this method is prototypical, and more realistically estimates the distribution of the query set using an exponential probability density function-other details of the model are beyond the scope of this survey.Two versions of the model, using either the MLL metric (ProtoNet MLL) or a combination of MLL and two more basic metrics (ProtoNet Combined), are proposed and tested in LSL and TLSL settings against many SOTA methods from the literature, on the same datasets as in [38].ProtoNet Combined outperformed all tested models in the transductive 1-shot and inductive settings, and outperformed all except α -TIM [38] in the less challenging transductive 5-shot setting.
An unpublished 2022 paper by Tao et al. [40] asserts and shows empirically that many TFSIC methods suffer from "imbalanced predictions" at test time, though they are not clear on where this prediction imbalance might come from, nor the motivation behind their method of quantifying this imbalance-the difference between the model's minimum and maximum number of predictions per-class.However, they do propose a method named "Transductive Fine-tuning with Margin-based uncertainty weighting and Class-balanced normalization" (TF-MC) which aims to give more balanced predictions.Margin-based uncertainty weighting refines and balances the model's uncertainty measurement on a sample-level basis; meanwhile, class-balanced normalization counteracts the natural tendency for the most-predicted class to have the largest gradient, making it even more likely to be predicted in the future.In experiments on the multipledomain FSIC dataset Meta-Dataset [71] ( ρ not specified), Tao et al. find that TF-MC yields a small but significant increase to per-class accuracy when compared to other transductive methods applied over various classifier backbones.Ablation studies also confirm both techniques to contribute to the final model performance.
Wertheimer et al. [41]-in a follow-up to a published 2019 paper by Wertheimer and Hariharan [72]-focus on challenges presented by FSIC, including image clutter, inconsistent "granularity" of class labels, and uneven or imbalanced classes.To address the latter problem, they incorporate an approach called "batch folding" into their model, which is based on leave-one-out cross-validation; more specifically, each instance is used as both a support instance and query instances in different "folds" of training, rather than instances only being used as one or the other.While it is unclear how exactly this method is intended to address CI, the authors find that it grants a 4-point gain in model accuracy when added to a basic prototypical network, on a novel meta-learning benchmark task adapted from the iNat2017 dataset [73] ( ρ not specified).Similar results were found in additional experiments using different datasets or settings; however, the batchfolding component was often the smallest contributor to the final performance compared to the two other components (not aimed at CI) of the final model.
A recent paper by Wang et al. [42] claims that the contrastively-trained vision-language models (VLMs) used for LSIC, such as CLIP [74], show promise but perform poorly on imbalanced datasets.They propose a novel "lightweight decoder" to avoid memory issues and better represent tail classes; however, the focus of the paper is on augmenting VLMs by feeding their output into various CI algorithms, such as focal loss or more recent methods such as Disalign [75].They test these methods on CLIP on three long-tailed image classification datasets including Places-LT and ImageNet-LT [76] ( ρ ≤ 1000 ), and compare them to CLIP on its own and with prompt tuning or full fine- tuning (as suggested in earlier literature).While specific results and best-performers varied greatly by dataset, the class-imbalance methods consistently outperformed the baselines in terms of F1 score and accuracy.
A 2022 paper by Deng et al. [43] deals with "intra-class and inter-class" data imbalance within the task of one-shot image classification for road objects, proposing a novel GAN model named PcGAN.Compared to traditional GAN models, PcGAN places more focus on learning a robust embedding space which can function well with only one support instance per class.The model contains a "data reconstruction" generator and a "[image] degradation generator"; the former module separates the important information from the image degradation in real-world test instances, while the latter simulates the effect of a given degradation type on a clean-photo support instance (which the authors refer to as a "prototype").Test instances can then be classified based on their similarity to each degraded prototype.PcGAN was evaluated quantitatively on two traffic sign/signal datasets ( ρ not specified) against three OSL road-object-classification SOTAs, outperform- ing all on both seen and unseen classes; it was also evaluated qualitatively through data retrieval and reconstruction experiments, where its output was much clearer on average than the other models tested.
He et al. [44] deal with the task of class-incremental learning (shortened in their paper to CIncL), which is similar to LSL in that new classes can be learned after the initial training phase; more specifically, the paper explores connections between techniques used to combat traditional CI and those used to combat the phenomenon of "catastrophic forgetting" in CIncL.Distinct from the data imbalance problem in LSL, this forgetting is due to limited memory space preventing many seen-class instances from being stored when learning new classes; thus, the seen classes are the minority rather than the unseen ones.They propose a theory-driven CIncL approach known as "post-scaling, " which involves adding a simple fixed layer to the end of a network to compensate for prior shift in test data.Despite the simplicity of this approach, experiments on two datasets (CIFAR-100 [77], ρ > 1 , and a variant of ImageNet [78], ρ > 100 ) showed that it slightly outperformed SOTA CIncL methods on the imbalanced dataset, and outperformed simple CI techniques such as random oversampling in all cases.
Finally, Smith and Conovaloff [45] deal with class performance imbalance in the field of "one-shot semi-supervised learning, " though it is not clear what exactly in their setting is "one-shot": there is no mention of new or unseen classes, nor the N-way-K-shot paradigm, but they do claim the field "superficially bears similarity to few-shot meta-learning." Additionally, while they compare their situation to the "traditional" class imbalance problem, they stress that the imbalance here is different, due to the fact that most of the training data is unlabeled.
To deal with this, they first propose using the model-generated pseudo-labels to estimate the class count and establish majority and minority classes in the task.They then propose four separate class balancing methods (to be applied and tested separately rather than in series): the first lowers the pseudo-labeling threshold for minority classes in order to generate more minority instances, simulating oversampling techniques; the second and third use weighted loss functions to respectively emphasize either all of the minority instances or the high-confidence minority instances; and the fourth simply combines the first and the third methods.The authors only compare their proposed methods to one (non-SOTA) approach from the literature, which all four methods outperform easily.In cross-comparisons, however, they find that the fourth method performs best ( ∼ 91% test accuracy) on the CIFAR-10 dataset [77], while the first performs best ( ∼ 97% test accuracy) for the SVHN [79] dataset-note that these datasets are for digit-recognition and exhibit no natural class imbalance.

Zero-shot image classification
Due to the difficulty and rarity of the ZSL setting as compared to the FSL or OSL settings, as well as the general difference in approaches used, below we separately cover works which tackle zero-shot image classification (ZSIC).
A 2022 paper by Arfeen et al. [46] handles class imbalance in the less explored, more difficult field of zero-shot domain generalization (ZSDG), which combines the labelspace-mismatch challenge of ZSL-in this case with semantic information-with the data-distribution-mismatch challenge of domain generalization.In this setting, CI in the training data may cause a model to learn inaccurate relationships between the seen and unseen classes, thus causing the model to poorly transfer its knowledge between domains.The proposed model, MAMC-Net, adds an "adaptive margin" to each of the semantic classifiers in the model, which is larger for minority classes and smaller for majority classes.MAMC-Net outperformed all tested baselines on the DomainNet benchmark [80] ( ρ not specified), and an ablation study showed the adaptive margin to have a significant positive effect.
The following three papers evaluate their semantic GZSL image classification models on the same three datasets: the balanced CUB [65] and SUN [81] datasets, and the imbalanced AWA2 [82] ( ρ ≈ 15 ).While none of these papers reference each other, their identical evaluation settings allow us to show direct comparisons of their models' performances, in which we focus on results for the AWA2 dataset, due to its more realistic class imbalance.
Ji et al. [47] propose a simple balanced sampling approach on top of a prototypical model, then add a more complex "feature fusion" technique to account for the fact that certain instances may be more or less representative of their class prototype.This is done by aligning the semantic features of the class with the visual features of instances, creating semantic-guided prototypes for each class.It is admittedly unclear how this component combats imbalance; however, the authors report good performance, with their "SCILM" model outperforming most competitors (19 tested) in both ZSL and GZSL settings on the three datasets listed above plus two others.On AWA2 [82], they report 48.9% accuracy on unseen classes, 77.8% on seen classes, and 60.1% on the harmonic mean of these two.
Ye et al. [48] propose a prototypical model with a neural network to generate the embeddings used for the class prototypes, as well as a "class-balanced" variant of triplet loss which uses balanced sampling for mini-batches and a slightly extended calculation to reduce variance within the batch while maintaining low training times.In experiments on the three datasets listed above plus two others, the proposed model outperformed many SOTA GZSL models for the balanced datasets, and marginally outperformed all for the imbalanced datasets.On AWA2 [82], they report 62.2% accuracy on unseen classes, 76.7% on seen classes, and 68.7% on the harmonic mean of these two, significantly outperforming [47].Additionally, on the imbalanced data, they report better training times than all but one compared model, CNZSL [83].
Most recently, Chen et al. [6] propose an "attribute-level contrastive learning" scheme, which contains a novel attribute-based sampling strategy: given an instance with a class and a chosen target attribute, it will select instances of a similar class but without the attribute, or instances of a dissimilar class with the same attribute; this helps the model separate independent classes and attributes, and aims to alleviate the attribute imbalance (the mechanics of this are less clear).Experiments found the proposed DUET model to perform close to or above the level of all tested SOTA models on the three datasets listed above in both ZSL and GZSL settings.On AWA2 [82], they report 63.7% accuracy on unseen classes, 84.7% on seen classes, and 72.7% on the harmonic mean of these two, reasonably outperforming both [47] and [48].Given that Chen et al. do not make reference to these previous two works, DUET's higher performance can likely be attributed to its unique use of image attribute information, as well as possibly the classification power of its other components.

Object detection
A 2021 survey by Majee et al. [49] aims to evaluate existing few-shot object detection (FSOD) techniques under real-world class imbalance.They test four techniques from recent literature on the India Driving Dataset (IDD) [84], deliberately chosen for its unfiltered class imbalance ( ρ > 200) between common road objects such as "motor- cycle" and rare ones such as "trailer".Tests are performed in two distinct settings: the "same-domain" setting, where 7 of the dataset's 15 classes are used for training and 3 are used for testing (the remaining 5 are discarded); and the "open-set" setting, where one of the discarded classes ("vehicle-fallback") is manually broken down into 4 significantly smaller classes and used for testing.The authors found that the method they call "FsDet" (which is simply called "TFA" in its original paper [85]) significantly outperformed the other three methods in both settings, and in some cases reaching similar performances as non-few-shot baselines.Unfortunately, the authors provide no analysis on any of the techniques used, and neither Majee nor [85] theorize on how (or if ) TFA/FsDet counters class imbalance.
Agarwal et al. [50] propose a FSOD model named AGCM, which aims to address two weaknesses of meta-learning-based LSL models: catastrophic forgetting (where a model unlearns base classes while learning novel ones) and class confusion (where a model consistently misclassifies one particular class as another).These issues are implied by the authors to be caused or exacerbated by CI, and are solved by an "Attentive Proposal Fusion" module and a "Cosine Margin Cross-Entropy Loss, " respectively.Using these methods, the authors report small but consistent improvements in performance over the other models tested on the IDD [84] ( ρ > 200) and PASCAL-VOC [86] ( ρ ≈ 50 ) data- sets in a N-way-K-shot setting.Ablation studies are also performed to demonstrate the significance of the chosen methods and hyperparameters.
Two related papers [51,52] deal with the foreground-background imbalance problem in FSOD for remote sensing images.Models in this field often have a "region proposal system" (RPS) which proposes regions of the image to be classified by other parts of the model.However, these papers found that the foreground-background imbalance of the images would lead to poor performance, due to the RPS not proposing enough "positive anchors" for the unseen classes.
To counteract this, the 2021 paper by Huang et al. [51] used a "balanced fine-tuning strategy" (BFS), which allowed the RPS to create double the amount of region proposals and incorporated a balanced loss function; these measures are to allow the model to find more foreground/positive region proposals for novel classes, as well as more effectively classify these proposals.In experiments on two remote-sensing datasets, NWPU VHR-10 [87] and DIOR [88] ( ρ not applicable), their proposed model outperformed all five tested baselines, and an ablation study showed that all model components contributed to the final performance, though the BFS component only contributed 1-3 mAP to the performance.
The more recent and in-depth of the two works, by Wang et al. [52], instead propose fine-tuning the RPS on both the seen and unseen classes, in contrast to the usual method where the RPS is "frozen" after pre-training on the seen classes while the other components are fine-tuned.This measure, along with relaxing the minimum "confidence level" needed for region proposal, increases the amount of "positive anchors" for unseen classes, supposedly solving the imbalance.In experiments on the same two datasets, the proposed CIR-FSD model outperformed all tested baselines − 7 models of varying recency and performance, which did not include the model from [51].
While the model proposed in [51] reports better peak results (+ 5 mAP on 10-shot for NWPU VHR-10's novel classes), [52] showed better performance with low shots (+ 7 mAP on 3-shot for the same classes) and could attribute more of its performance to its class-imbalance measures.Further research is necessary to directly compare these CI measures with different model components.

Other image-data tasks
This subsection covers papers which work with image data but do not perform image classification or object detection; this includes papers which deal with image segmentation or retrieval.
Ouyang et al. [53] tackle the foreground-background imbalance problem, among others, for few-shot segmentation of medical images.Their proposed model, named "SSL-ALPNet, " modifies an existing prototypical model named PANet [89] by incorporating "superpixel-based semi-supervised learning" and "adaptive local prototype pooling." The latter component combats imbalance by creating several prototypes for various sections of the background region, rather than having one larger prototype for the entire category of "background." When evaluated on three medical scan datasets ( ρ not applicable), SSL-ALPNet significantly outperformed two prior baselines as well as three proposed variants, and achieved respectable performance as compared to two fully-supervised roofline models with access to manually-annotated training data.Overall, experiments are poor, as the authors evaluate relatively few baselines, provide little detail on them, and do not focus on imbalance.
A 2020 paper by Dutta et al. [54] studies "for the first time in literature" the effect of imbalance in the field of zero-shot sketch-based image retrieval (ZS-SBIR), which involves selecting a query image from a database which matches a user-input "sketch" image; the zero-shot component refers to working with classes of images unseen during training.In preliminary experiments, they find that class imbalance negatively affects the embeddings learned by the model, leading to poor generalization to unseen classes, and so they propose an Adaptive Margin Diversity Regularizer (AMD-Reg) module to combat these effects.The key idea borrowed from the Diversity Regularizer (used previously in non-LSL image tasks) is to ensure that semantic class centers are well-separated and spread out in the feature space used by the model; however, Dutta et al. modify this approach by adding an "adaptive margin" to account for class imbalance by giving larger, safer margins to smaller classes.Experiments found AMD-Reg to outperform various sampling and loss-modification techniques when applied to a simple model on artificially imbalanced datasets ( ρ = 10 and 100).AMD-Reg also notably improved the per- formance of all tested state-of-the-art models on larger datasets, in both ZS-SBIR and its generalized zero-shot variant.
Wu and Gong [55] work in the task of re-identification (Re-ID), where data of a person-usually in the form of video footage-under noisy conditions (camera angle, noise, low resolution, etc.) must be identified as either a previously seen person or as an unseen person.Due to its continual, "active-learning" nature, the authors compare this task to class-incremental learning, with the challenges of ZSL (without semantic information) and CI.The proposed countermeasure is a novel and complex loss function with three components, including one to counteract CI known as "classification coherence loss." This component of the loss function has two phases: for the first part of training, it is a cross-entropy loss which balances learning new information with retaining old information (a common approach in class-incremental learning); for the second part, a rebalancing component is added.The structure is motivated by the fact that using this rebalancing component for all training would apparently overfit the model on difficult new samples, causing forgetting of old classes.
In experiments on four popular Re-ID benchmarks ( ρ not specified), Wu and Gong found that their method, named GwFReID, outperformed all tested baselines (most built from SOTA incremental and Re-ID approaches) on average and for early sections of training, though it underperforms slightly for later sections, as more classes are introduced; GwFReID also outperformed all baselines when evaluating on new datasets after training.Finally, ablation studies showed that the two-phase structure of the classification coherence loss was optimal, as the final model outperformed versions where the rebalancing component was kept on or off throughout.
A 2020 paper by Chen et al. [56] deals with contrastive imbalance (see "Intersection with LSL" section) while constructing a LSL Chinese character recognition model, Sia-meseCCR; while this imbalance is not the focus of this work, the authors outline their novel re-balancing method in detail.In this, an equal number of positive (matching) and negative (mismatching) pairs are first selected and used for training; the n most difficult instances are chosen based on misclassification rates and similarity scores; on the next iteration, a new set of positive pairs is constructed as before, while a new set of negative pairs is constructed so that all pairs contain at least one difficult instance.
On novel character-recognition datasets ( ρ not specified) SiameseCCR performed exceptionally on the few-shot and one-shot scenarios, significantly outperforming three existing character recognition models (which suffered from overfitting), and achieving ≥ 98% accuracy even in the top-1 GOSL setting.However, their experiments have some shortcomings-namely, the compared models were neural networks not designed with the LSL setting in mind, and classification accuracy was the only performance metric used.Further, it is unclear how much of SiameseCCR's performance can be attributed to its unique sampling strategy rather than other components of the model.
A recent paper by Wang et al. [57] takes a LSL approach to the task of automatic CAPTCHA completion, but claims that many LSL techniques will perform poorly in realistic settings, due to support-set class imbalance and the multi-domain nature of the task.Their model follows the usual N-way-K-shot episodic framework-using either ProtoNet [3] or MAML [4] as a base-but they propose two improvements: a basic data augmentation strategy to deal with the cross-domain problem, and "intra-class variance distance weighting, " which adjusts the prototype decision boundaries to counteract the variance caused by the support-set imbalance.While the proposed model was not compared to any others from the literature, experiments on five synthesized CAPTCHA datasets and real-world tests ( ρ not specified) showed both components to have a posi- tive effect on model accuracy in 5-shot and 10-shot scenarios, though distance weighting was not applicable in the 1-shot scenario nor when using MAML as a base.
A 2023 paper by Lei et al. [58] deals with the foreground-background imbalance present in one-shot 3D medical image segmentation, utilizing weak supervision in the form of hand-drawn scribble annotations from the labelled data.Their model, PRNet, uses a "propagation-reconstruction network" to realistically transfer these scribbles to the unlabeled data, thus generating reliable foreground or background markers in the unlabeled 3D images, alleviating the imbalance.These points are then further processed into "pseudo masks" used for class-specific training.The proposed model was evaluated on two medical scan datasets-the ρ metric is not applicable, but in one dataset the target class makes up only 0.1% of voxels-and drastically outperformed all three OSL methods used as baselines, achieving a 35-point gain in Dice score over from the second-best model, and only a 10-point drop from a gold-standard fully-supervised model.
Cui et al. [59] also deal with (but do not focus on) foreground-background imbalance within FSL image segmentation for 3D medical scan images.Their model, MRE-Net, uses an existing technique known as Online Hard Example Mining (OHEM) [90], an iterative sampling method shown previously to be capable of handling foreground-background imbalance in non-LSL settings.In short, the modified OHEM determines which category is the majority or minority on a pixel-level, collects all minority-class pixels and the N most difficult majority-class pixels (determined by a loss function), then uses these collected pixels to create a new loss function for the next iteration.In experiments on three segmentation datasets ( ρ not applicable), the proposed MRE-Net outperformed its main competitor, U-Net [91], under different augmentations and numbers of shots, and for some classes approaches the performance of tested fully-supervised methods.Ablation studies showed the OHEM component to contribute a small but significant amount (+ 4 Dice coefficient) to MRE-Net's final performance.

Non-image-data tasks
This subsection covers papers which do not work with image data.We find that these works are significantly less common in the LSL literature, likely because most imagebased tasks contain many classes over a wide distribution of sizes, and thus lend themselves more easily to the LSL setting.
A 2021 paper by Bragg et al. [60] highlights poor benchmark quality and quantity in the task of few-shot natural language processing, including poor evaluation of class imbalance within this task.They propose and release a benchmark task named "FLEX, " which includes class imbalance (along with many other challenges and features), though the specific types and degrees of imbalance are not specified.Bragg et al. also propose a prompt-based model for this task named UniFew (as well as a version with meta-training, named UniFewMeta), which utilizes an existing questionanswering framework known as UnifiedQA [92] in order to streamline the model's interface and enable more consistent results.UniFew and UniFewMeta achieve good performance on the FLEX benchmark, outperforming the SOTA model on two of three tasks; however, due to the models' prompt-based nature, as well as a lack of model documentation from the authors, it is difficult to analyze which components, if any, allow UniFew to combat the class imbalance within FLEX.
In a 2023 paper, Kim et al. [61] frame the task of "cold-start" product recommendation (that is, with few prior user interactions) as an LSL problem, by considering each user as a separate task with limited instances (user reviews).They propose a MAMLbased [4] approach, and to deal with the imbalance between positive and negative reviews-the severity and direction of which varies per user-they incorporate an adaptive weighted loss, which uses a recurrent encoder to encapsulate the rating distribution in each sequence of user ratings.In experiments on four user recommendation datasets ( 5 ≤ ρ ≤ 25 ) and four model backbones, the proposed "MELO" model significantly outperformed MAML alone in terms of both RMSE and MAE.
Yu et al. [62] incorporate a "center loss" combined with softmax loss into a prototypical network for few-shot industrial fault classification (a notably imbalanced task), though they go into less detail on the mechanics of how this counters CI, and also do not directly compare their method to any others from the literature.However, they do compare their model to variants with other bases (MAML [4]) and loss functions (focal loss) on fault diagnosis datasets with artificial support-set imbalance ( ρ = 10) and show that the prototypical network with center loss yields the highest accuracy and F1 score.
Li et al. [63] also develop a metric-based approach for few-shot industrial fault diagnosis, proposing a "reweighted regularized prototypical network" (RRPN) which uses inter-and intra-class information to produce more discriminative class prototypes.To deal with class imbalance, the authors also incorporate an "intra-class reweighted strategy" (ICRS) which weights individual instances based on their class representativeness, as well as a "balance-enforcing regularization" (BER) term to encourage more balanced class representations and combat overfitting.In experiments on two common industrial fault datasets ( ρ not specified) in 3-shot and 5-shot settings, the proposed RRPN out- performed five SOTA models on average and for rare fault types; an ablation study also showed ICRS and BER to have small but positive effects on model performance.

Using LSL to solve existing imbalance
Low-shot learning and class imbalanced learning both deal with the same fundamental issue of having limited data of one or multiple classes; thus, many works have proposed anti-CI approaches inspired by LSL techniques.
This section covers any papers that face the class imbalance obstacle (or similar) and use low-shot learning paradigms or techniques, such as prototypical networks or contrastive learning, to solve this obstacle.For works which propose models with multiple components, we generally focus only on those components explicitly stated to be related to or inspired by the LSL setting or its techniques.Due to the relative similarity between the works presented here (as compared to those presented in "Solving imbalance within LSL" section), we place greater focus on the proposed model's training and evaluation process when possible.
As in "Solving imbalance within LSL" section, we categorize the surveyed works by the application in which LSL is applied, and provide an overview of these works in Table 2 below.Unlike Table 1, we do not specify the type of imbalance addressed by each work, as almost all works in this section deal with traditional CI.
We note but do not include a survey paper by Duan et al. [93] which covers LSL for anomaly detection and mentions the imbalance present within the task, but does not offer any original contribution or analysis against this imbalance.

Image classification
Bansal et al. [94] address general CI in an unpublished 2021 article, claiming that many traditional methods such as re-balancing or re-weighting lead to overfitting on minority data.They instead propose "MetaBalance, " a meta-learning model based heavily on MAML [4]; the primary value of this structure, according to the authors, is that it allows two CI strategies to be independently applied in the inner and outer training loops.MetaBalance is model-agnostic, and so it can easily be customized for almost any classification setting, with the inner and outer CI measures tunable for each scenario.
MetaBalance is first evaluated on a 10-way image classification task.They compare their model-agnostic method (using a CNN as a base, in this particular experiment) to various traditional CI measures as well as SOTA augmentation techniques, on the CIFAR-100 dataset [77] with different levels of artificial CI ( ρ = 1000 or 100).They find that MetaBalance-with ROS in the outer loop and no measure in the inner loop-significantly improves accuracy in both settings, reaching 30% classification accuracy in the severe case and 40% in the moderate case, compared to 23% and 37% from the next best baseline, RUS.For fairness, the authors re-evaluate after re-weighting the base model's priors to reduce class bias; however, MetaBalance remained the top performer.
Additionally, MetaBalance is evaluated on a facial-recognition task, using a standard facial recognition CNN as a base model.They compare the proposed model to the base CNN model (on its own and with random over-sampling) on a face dataset with an artificial gender imbalance created by removing 90% of female face images.Despite the base CNN's ability to handle this imbalance, they find that MetaBalance improves Four non-LSL models accuracy further, reaching 90% female accuracy and 92% total accuracy, compared to 87% and 90% from the base model.MetaBalance was finally evaluated in two tabulardata tasks, the results of which are covered in "Tabular-data tasks" section.Zhu and Yang [95] propose a novel, technical approach to deal with long-tailed learning in video and image classification.The authors' approach is inspired by prototypical networks; however, their setting has a higher intra-class variance which they claim cannot be easily encapsulated by one prototype per class.Thus, they instead propose "inflated episodic memory" (IEM), a key-value data storage framework which-in tandem with a novel "region self-attention mechanism" (RSA)-is used to find and store the "most discriminative feature" for each class, which improves generalization to the tail classes.Experiments in both image and video classification tasks (on long-tailed datasets including Places-LT and ImageNet-LT [76], ρ ≤ 1000) showed the proposed model to marginally but consistently outperform the SOTA, in both "closed-set" and "open-set" settings (with the latter characterized by additional classes added during testing).
A 2021 paper by Samuel et al. [96] deals with general long-tailed learning, proposing a modular (i.e., for any application) model incorporating semantic-based GLSL methods.The proposed model, "DRAGON, " first deals with data scarcity within the tail (few-shot) classes by fusing the predictions of a visual (traditional) sub-model and a semantic sub-model, with the latter being tuned for tail class prediction.It then deals with the imbalance itself by individually debiasing samples based on the estimated number of samples per class.Both modules are tuned using the confidence values from each sub-model's predictions.Extensive experiments on multiple longtailed image classification datasets, including ImageNet-LT [76] ( ρ ≤ 250), showed DRAGON to significantly outperform all tested SOTA methods in per-class accuracy, with these results being especially noticeable at higher shots (10 to 20).Remarkably, DRAGON also offered at-or-above SOTA performance when evaluated without any class semantic data-and thus without the model fusion component-presumably by relying on the debiasing component of the model.A 2020 paper by Patra and Noble [97] works in the task of incremental learning (or "lifelong learning"), where new classes or tasks can be learned after the initial training phase, in the task of medical image object detection.They do this in both balanced and imbalanced settings (defined by the amount of available data from unseen classes), proposing a different model pipeline for each.Both models share a hierarchical classification strategy, with coarse and fine classification stages, and both use the same pretrained convolutional recurrent "supermodel" for the first stage; however, they differ greatly for the second stage.The model we focus on-for the more imbalanced settinguses a "similarity-driven few-shot learning regime" where the supermodel outputs are used to direct instances to one of multiple Siamese networks.Experiments conducted on novel medical-image datasets found the proposed LSL model to noticeably outperform the other proposed model on the rare classes, and found both proposed models to significantly outperform a simple transfer-based baseline (used for lack of comparable models from the literature).However, it should be noted that the CI present within their imbalanced setting is unknown but most likely mild, with 500 instances per novel class and an unspecified number of instances per base class.
A short 2020 paper by Guan et al. [98] draws heavily from LSL paradigms to deal with class imbalance in aerial photograph classification.Their model, "Random Fine-Tuning Meta Metric Learning" (RF-MML) first trains N-way-K-shot on the majority classes only, randomizing N and K in each episode to improve robustness and avoid arbitrary hyperparameter selection.The model then similarly fine-tunes on all classes, keeping N equal to the total number of classes in the dataset.The authors propose two variants of RF-MML, using a prototypical network and a support vector machine (SVM) respectively for the fine-tuning phase; the training phase is always trained as a prototypical network.
Experiments on two artificially imbalanced datasets (10 ≤ ρ ≤ 30) compared the clas- sification accuracy of both variants of RF-MML with a baseline deep neural network model (on its own or with one of two anti-CI measures from the literature).The RF-MML models sacrificed a few points of accuracy on the majority classes for a larger boost on the minority classes, leading to an increase in overall accuracy more noticeable on the more imbalanced datasets.Additionally, the variant with SVM fine-tuning marginally outperformed the variant with prototypical fine-tuning.
A 2020 paper by Weng et al. [99] is dedicated to solving class imbalance within dermatology image classification.They first evaluate many LSL approaches from the literature, both metric-and optimization-based, in both a standard N-way-K-shot evaluation setting and a modified "real-world" evaluation setting with a different class split between training, validation, and testing, but identical imbalance ( ρ > 20).However, they find that even the best-performing of these LSL models, MatchingNet [64], performs only slightly better than a conventional ("CSL") baseline on rare classes, and performs much worse when the CSL method is augmented with classical CI countermeasures such as sampling or focal loss.In light of this, they propose an ensemble approach combining LSL and CSL methods by simply redirecting majority-class instances to the CSL model (baseline + oversampling) and minority-class instances to the LSL model (Matching-Net), which shows marginally better performance overall than either method alone.
Dong et al. [100] address CI in the field of federated (decentralized, i.e., different data is stored in different clients) partially-labeled classification of medical images.They propose "FedFew, " a three-stage system: the first stage is a self-supervised class-agnostic embedding function, the second stage is a partially-supervised "energy-based" classifier for the common classes, and the third stage (the one drawn from LSL) is a nearestneighbor model which uses "dual prototypes" to classify the rare classes.The "energy" from the second stage is used to determine whether an instance is likely to be a rare class and should thus go to the third stage.Experiments on an artificially imbalanced ( ρ = 10) chest X-ray dataset showed FedFew to consistently outperform two baselines from the literature in all metrics (accuracy, recall, F1, and especially precision) on the rare classes.
A 2021 paper by Pei et al. [101] deals with class imbalance ( ρ ≤ 20) in fault diagnosis of rolling bearings.This imbalance is often dealt with using a Wasserstein auto-encoder (WAE) [129] for robust data augmentation; the authors propose an augmented version (fs-WAE) which adds an optimization-based FSL technique named Reptile [130], a variant of MAML [4].They train fs-WAE using multiple tasks created from their dataset, which differ in the selection of samples and optimization function used.In experiments evaluating augmentation quality on two datasets, fs-WAE was shown to generally outperform other baselines, including WAE and fs-WAE without Reptile.The exception to this was the WGAN-GP model, which outperformed fs-WAE on the more imbalanced dataset in terms of accuracy, but took significantly more computing power to train.
Liu et al. [102] propose a contrastive approach to combat the class imbalance and data scarcity in the task of ancient character recognition.The authors go into detail on the technical aspects of their model; important points include a residual-learning inspired embedding network which reduces overfitting while preserving detail, a novel "soft similarity contrast loss function" which prevents over-optimizing as can occur in normal contrastive loss, and a novel "cumulative class prototype" which is more robust to outliers and deviations in each class.After preliminary experiments to confirm that these measures were effective, experiments were conducted in a 20-way-1-shot setting on a variety of character recognition datasets, with unspecified CI but a wide range of sizes, including a "big data" dataset with 1.1 million instances.The proposed model, SSN, marginally but consistently outperformed all SOTA networks tested (+ 0.3-0.7%accuracy) on all three datasets whose results were included in the paper.Variants of SSN were also tested which did not train on any of the classes used for evaluation; these variants suffered only very minor drops in accuracy (0.2-0.5%).
A 2022 paper by Cui et al. [103] utilizes LSL to counter the severe class imbalance ( ρ ≈ 110) in building façade defect classification.More specifically, they propose an "extensible classifier" which uses "weight imprinting" to synthesize embeddings for novel classes from limited data; a contrastive learning module integrated into the loss function to improve the embedding function; and a hard negative mining (HNM) module added to the contrastive learning module to reduce the pair imbalance within the contrastive learning process.Their model is trained on the base classes and allowed to imprint on one to ten support samples of each novel class during evaluation.For a lack of comparable façade defect classification models, the authors compared their model to a simple LSL baseline as well as versions of their model with reduced or removed HNM; they found that the proposed model outperformed under any number of shots, especially for the novel classes, achieving 63.5% accuracy on novel classes with only one shot and 83.0% with ten.
A 2022 paper by Sümer et al. [104] deals with the CI ( ρ ≈ 50) within the task of facial phenotype recognition for genetic disorders using LSL, proposing a standard prototypical network trained N-way-K-shot, using a deep CNN as a feature encoder.The proposed model, "GMDB-fs", was compared against one other SOTA (non-LSL) model, and outperformed this model significantly (+ 6-10% classification accuracy); additionally, the proposed model was more effective when the encoder was pre-trained on the MS1MV2 facial recognition dataset [131], and 10-way or 15-way episodic training was found to be the most effective.
Zhan et al. [105] face CI ( ρ ≈ 75) in fabric defect classification, and propose a proto- typical network (using a CNN feature extractor) trained under an N-way-K-shot paradigm, reasoning that the artificial class balance within the support sets will circumvent the CI within the dataset itself.After establishing that their "Prototypical Net" network achieves the best performance when trained 2-way-K-shot (K = 1 or 4), they directly compare its performance to deep CNNs commonly used in the literature, finetuned on fabric data.While Prototypical Net consistently outperformed all of these models, the authors were unclear on whether any of the models tested, proposed or baseline, were evaluated traditionally or N-way-K-shot.

Other image-data tasks
This subsection covers papers which work with image data but do not perform image classification.
Tambwekar et al. [106] address class imbalance in road object recognition, framing the problem as a few-shot batch-incremental setting and building off of a previous LSL model, FewX [132].Their proposed model, DualFusion, uses two detector models: a (non-LSL) Faster R-CNN model [133] trained on only the base classes, and a modified FewX model which is fine-tuned on the novel class data.These two models are connected with a fusion network which allows the detectors to pass information between each other in an intermediate stage, and combines the feature outputs of the detectors at the end.An experiment on the IDD benchmark [84] ( ρ > 200) compared DualFu- sion to each of its individual submodels for incremental learning; as expected, it significantly outperformed both in the general case, only losing to FewX when evaluating exclusively on novel classes.A second experiment on the COCO benchmark [134] ( ρ ≈ 1000) compared DualFusion with a SOTA technique, ONCE, finding that DualFusion underperformed somewhat in the general case (10.6 vs 13.7 AP) but significantly outperformed on novel classes (9.9 vs 1.2 AP).
Tian et al. [107] address the imbalance ( ρ > 100) in the binary task of polyp classifica- tion in colonoscopy images, proposing a LSL method inspired by one-class classification (OCC).The proposed FSAD-Net consists of a feature encoder pre-trained on exclusively negative instances (inliers), and a "score inference network" trained on negative instances and few positive instances (outliers), optimized with a contrastive loss.The score generated by the latter module represents how close the input is to the center of the latent feature space, and therefore how likely it is to be an inlier.Experiments showed that FSAD-Net outperformed all tested SOTA ZSL/FSL baselines by a significant margin, achieving + 0.07 AUC-ROC over the next-best model.FSAD-Net and all FSL baselines were trained using only 30-40 abnormal images.However, all ∼13,000 images used in the dataset are sourced from only 18 colonoscopy videos; this raises the possibility of overfitting, which is not addressed by the authors.
A 2021 paper by Ji et al. [108] deals with the extreme class imbalance within humanobject interaction (HOI) recognition, where a machine must identify both the object within the input image and the action being performed on it.They propose a prototypical network, "SAPNet, " trained N-way-K-shot, first using a novel "SIGMA" module utilizing semantic label information to address the two-dimensionality of the task.For the prototypical classification itself, they then propose two competing modules: a "prototypes shift" (PS) method which incorporates the query sample into the estimation of each support class prototype, and a "hallucinatory graph prototypes" (HGP) method which uses a small network to hallucinate novel support samples in order to avoid bias in class prototypes.
Experiments on two datasets ( ρ not specified), each tested with two different class splits, showed both versions of SAPNet to outperform all tested methods, including LSL methods from general (non-HOI) literature.Of the two versions, the model with the HGP module outperformed slightly, reaching 72% 5-way-5-shot accuracy on one dataset, compared to 70.8% from the version with the PS module (and 53.3% from the next-best method, MAML [4]).Notably, its performance was worse in the 5-way-1-shot scenario, only outperforming MAML by 5% accuracy.
Sharma et al. [109] propose a similarity-based model in the field of hardware trojan (HT) detection, where images of integrated circuits (ICs) are checked for malicious trojans installed during production; this field is naturally plagued by data scarcity and CI.To handle these issues, they propose a novel "deep Siamese CNN" (DSCNN) model which uses novel deep CNN sub-models as feature extractors.They train, validate, and test their model on small synthetic datasets created from two popular HT detection benchmarks, with positive instances created by manually augmenting trojan-free image data.Validation and testing are performed using a modified 2-way-5-shot paradigm, where the support set consists only of negative instances, and the query image is compared to these for a similarity score which is thresholded for the final classification.Experiments found that the proposed DSCNN model significantly outperformed (by at least 5-10 points of accuracy) all four tested HT detection models from the literature.However, the level of imbalance in the training data is unclear: the authors first explain that their synthetic datasets have 50 positive and 50 negative instances (implying ρ = 1), but later claim that their model "effectively handles the class imbalance" within this dataset.
A 2020 paper by Hu et al. [110] deals with the extreme class imbalance in "open-world" image segmentation by framing the setting as a few-shot incremental learning problem, which they approach by "segmenting the tail." This refers to the process of splitting the training data into balanced but progressively smaller (i.e., fewer-shot) subsets of classes, allowing the model to incrementally learn each group after pre-training on the largest classes.Along with a balanced replay system to maintain class balance while preventing catastrophic forgetting, they propose a meta weight generator (MWG) module which initializes new-class classifiers from combinations of semantically similar old-class classifiers, e.g., a "drone" classifier initialized as a combination of "fan" and "airplane" classifiers.
In experiments on the LVIS image segmentation dataset [135], with instances per class ranging from only 1 to over 1000, the proposed "LST" model outperformed the base model (on its own or with various resampling methods from the literature) on all classes except the largest.A version of the proposed model without the MWG module was also evaluated, and outperformed slightly on the "middle" classes but was otherwise inferior to the full model.Unfortunately, no SOTA methods were compared to the proposed model.
In a 2023 paper, Chen et al. [111] address the class imbalance and data scarcity within facial expression recognition from image and video data, by proposing a model which combines elements of LSL and semi-supervised learning.Their model, "SSF-ViT, " first pre-trains a vision transformer on four self-supervised "pretext" tasks with unlabeled face data, then fine-tunes on a fully supervised expression recognition task.Finally, when evaluating on real-world data, a prototypical network is constructed (using the vision transformer as an encoder), and the input data is grouped into support and query sets for episodic evaluation.On six expression image datasets with a wide range of imbalance levels (1 ≤ ρ ≤ 40), SSF-ViT performed at or above the level of thirteen expression- recognition SOTAs, and outperformed two LSL models in additional 5-shot and 1-shot experiments.
A short paper by He et al. [112] deals with CI within the relatively unexplored field of action recognition from streamer video footage (streamer action recognition, or SAD).The authors propose a two-phase model which first pre-trains (non-episodically) to learn video features and then meta-trains a prototypical network 5-way-5-shot.Preliminary experiments using different similarity functions found the proposed model to yield the best performance when trained with an inner-product similarity and evaluated with cosine similarity.The proposed model significantly outperformed two SOTA methods from the literature on a novel few-shot SAD dataset (+ 20% accuracy under one-shot), and six others on general human-action datasets (+ 1-3% accuracy).Imbalance levels ρ were not provided for any datasets used.
A 2021 paper by Romanov et al. [113] addresses class imbalance ( ρ ≈ 16) and data scarcity, in the field of newborn gestational age estimation from image data, and proposes two approaches, tested separately: a prototypical network and MAML [4].However, because there is too little postnatal image data to train these models "traditionally, " the authors opt to pre-train on balanced non-medical image datasets-specifically, either miniImageNet [64], CelebA [136], or both-before evaluating 5-way-5-shot on the entire base of relevant data without fine-tuning.Due to the lack of comparable models from the literature, direct comparisons could only be conducted between the two proposed models.In these, MAML (meta-trained on both image datasets) showed the best performance, yielding accuracies of ∼ 53% for face and ear images and ∼ 40% for foot images.The authors also compare this model to the "Ballard score, " a manuallyevaluated metric which is the SOTA for postnatal gestational age estimation; while the proposed model underperforms the Ballard score by ∼ 12%, they note that this metric simultaneously has access to foot, face, and ear information (plus extraneous and theorize that an ensemble of the proposed MAML networks could perform on par with the Ballard score. Shi et al. [114] deal with the foreground-background imbalance within hyperspectral object detection by combining LSL with transfer learning, proposing a novel semi-supervised domain adaptive few-shot learning model.The LSL portion of the proposed model is a simple prototypical network trained 2-way-1-shot to learn appropriate embeddings; other components of the model include a novel MDS 2 F 2 convolutional network for pre- processing, an RCA attention mechanism/subnetwork, and a complex semi-supervised domain adaptation framework with a discriminatively boosted loss function.In experiments against five SOTA models from the literature on hyperspectral image data ( ρ not applicable), the proposed "SIHTD" model was the best performer by a significant margin in terms of AUC-ROC, though it performed marginally worse with respect to other metrics tested (such as area under the threshold vs false-alarm-rate curve).

Tabular-data tasks
Bansal et al. [94] evaluate their MAML-based [4] model, MetaBalance, on two tabular tasks: credit-card fraud detection and loan-default prediction (details on the model and its results on image-based tasks can be found in the "For Image Classification" subsection).Of these two, the credit-card task is much larger ( ∼ 300,000 instances) and has more severe imbalance ( ρ ≈ 600), but the authors find that it is nonetheless easier to classify than the loan task ( ρ ≈ 4).Evaluating on AUC-ROC and using a traditional neu- ral network as a base, experiments on the credit-card task found MetaBalance to outperform all tested classical CI measures, including RUS, SMOTE, and Edited Nearest Neighbors (ENN); experiments on the loan task found that the best-performing variant of MetaBalance used ENN in the outer loop and no CI measure in the inner loop.
Two papers by Bedi et al. in 2020 [137] and 2021 [115] utilize Siamese networks to combat the class imbalance in network intrusion detection.The first paper proposes SiamIDS, a standalone traditional Siamese network (explicitly borrowed from LSL literature), while the follow-up paper proposes the improved I-SiamIDS, an ensemble model which additionally includes XGBoost and deep neural network classifiers, with each of the three classifiers applied in series and fed into a second XGBoost classifier for the final output.Experiments conducted in the second paper compared SiamIDS and I-SiamIDS to other common classifiers (such as Random Forest and a CNN) on two intrusion detection datasets, including the NSL-KDD benchmark ( ρ ≈ 650) [138]; results showed I-SiamIDS to generally outperform the other classifiers in terms of F1 scores and AUC-ROC, with this behavior being more consistent on minority classes.
Notably, the authors do not compare their models to traditional anti-CI measures such as data sampling or cost-sensitive learning, despite these methods' effectiveness in imbalanced binary classification [1].
A short paper by Huang et al. [116] proposes a "gated few-shot" model to combat the imbalance in network anomaly detection.They utilize a similarity-based approach, using a CNN as an encoder, and a "gate structure" which pre-emptively determines whether the query instance belongs to a seen or unseen class from the support set.The model is trained and evaluated on a reduced version of the NSL-KDD dataset ( ρ ≈ 250) [138], and trained episodically on only some selected base classes.Experimental results were mixed; no SOTA methods from the literature were compared, and the proposed method was roughly on par with a traditional 1-nearest-neighbor approach for unknown classes and a simple SVM approach for known classes, marginally outperforming both for the overall dataset.
A 2021 paper by Gesi et al. [117] works in "just-in-time" (JIT) software defect prediction, and highlights the often-neglected class imbalance ( ρ ≈ 10) in the field; they propose a split approach where majority instances are classified using a pre-existing non-LSL neural network, DeepJIT [139], while minority instances are redirected to a Siamese neural network.Experiments on two defect prediction datasets found that the proposed SifterJIT model outperformed the original DeepJIT (on its own or with oversampling) by a noticeable margin on the rare classes (+ 0.09 AUC-ROC, + 0.025 F1 score).
Wu and Wang [118] deal with database error detection, a naturally imbalanced task which is often addressed using data augmentation methods.However, they propose ZSL in the form of one-class classification (OCC), allowing their model to train using no unclean (minority-class) data whatsoever.More specifically, they train a GAN which uses self-attention-based encoder-decoder modules as a generator, and a CNN as a discriminator; feeding this model clean (majority-class) data allows it to learn the distribution of clean data, allowing it to discriminate unclean data evaluated after training.Experiments were performed on 5 datasets with various types of data errors (though the model was only tasked to distinguish between "clean" and "unclean") and class imbalance ( ρ ≤ 30), comparing the proposed SAT-GAN model to six error-detection baselines, four of which were statistical or rule-based rather than ML-based.SAT-GAN significantly outperformed all non-ML methods, and achieved performance on par with (and occasionally exceeding) that of the best baseline, AUG [140], despite the fact that SAT-GAN had no access to minority-class data.
A 2022 paper by Li et al. [119] proposes a model named "Meta-IP" to deal with CI in the task of project extension forecasting; this model employs a transfer learning strategy framed as a LSL setting: a simulated source dataset is used as the support set for MAML, while the (real-world) target domain dataset is used as the query set.While the authors explain that using MAML allows for the choice of two independent loss functions and sampling strategies (for the inner and outer training loops), they do not explain which losses and strategies were chosen or tested in practice.While poorly explained, experiments on project extension datasets ( ρ ≤ 30) showed that Meta-IP consistently outperformed all "traditional" anti-CI measures tested, including SMOTE and bagging, in terms of AUC and especially BACC.

Other tasks
This subsection covers works which do not fall into any of the above categories, i.e., work with neither image data nor tabular data.
In a 2020 preprint, Chen et al. [120] deal with "3D point cloud segmentation, " a field whose relative obscurity causes data issues, including scarcity and class imbalance; to deal with this, they borrow LSL techniques from the similar field of 2D image segmentation.They propose a prototypical network, adapted for point segmentation with a novel "Multi-View Comparison Convolution" (MVC) module to generate different "views" (embeddings) of support instances.In experiments on a novel point segmentation benchmark ( ρ ≈ 40), the authors report that their model outperformed all four tested SOTA models on the rarest classes, and yielded slightly better performance overall (+ 0.2% classification accuracy over next-best).
A recent paper by Gao et al. [121] avoids the imbalance within electricity theft detection (from electricity usage time series data) by reframing the task as a one-class classification problem, using a contrastive learning approach to detect outliers.After multiple layers of pre-processing-using techniques unique to electricity monitoring-a contrastive network is trained on a large sample of negative (non-theft) data, which is used to create the model's "support set" when evaluating on test data.In experiments on multiple electricity-theft datasets with a range of artificial imbalance and scarcity conditions (1.5 ≤ ρ ≤ 9), the proposed model consistently outperformed six other machine learning models constructed as baselines in terms of F1 score, FPR, and AUC-ROC.
Gupta et al. [122] address class imbalance and data scarcity in electrocardiogram (ECG) classification by using a Siamese network with CNN feature extractors which is evaluated (but not trained) under a LSL paradigm.They use five ECG time series datasets ( ρ not specified): two for training, two for validation, and the largest dataset for test- ing, which is done under a traditional 5-way-K-shot paradigm.Their proposed SCNN model outperformed all tested competitors-a traditional similarity-based time-series classification method, a simple nearest-neighbor algorithm, and an LSTM convolutional model, all of which were also evaluated K-shot-and proved very robust to the number of shots given (performance remained the same from K = 3 through 50).
A 2021 paper by Bhosale et al. [123] proposes, among other methods, a contrastive learning approach to deal with the class imbalance present in the task of COVID-19 diagnosis from cough sounds.After pre-processing the audio into "MFCC" sound features, they train their FSL model under a 2-way-K-shot paradigm, aiming to circumvent class imbalance by giving equal shots for each class.They also use a triplet loss in order to better adapt to the variability within negative (COVID-free) instances.While they do not compare this method to others in the literature, they do compare to a baseline as well as an alternative SVM-based (non-LSL) approach using a wider set of features and balanced weighting, on a curated cough sound dataset ( ρ ≈ 15).They found that the FSL approach outperformed these marginally, reaching an AUC of 0.719 on the test dataset (compared to 0.699 and 0.706), but noted that specificity was poor throughout all models ( ∼ 0.50 on the test dataset).
A preprint article by Rentería et al. [124] uses Siamese neural networks to combat class imbalance and data scarcity in birdsong syllable classification.While they propose no changes to the standard Siamese structure, they do test variations with five different encoder sub-models, including one with no encoder (comparable to a nearest-neighbor approach).In experiments on an existing birdsong dataset ( ρ ≈ 75), they found that the model versions with LSTM encoders and with no encoders were the highest performers, with the former slightly outperforming with a high number of shots (91.3% accuracy when K = 7) and the latter outperforming with very a low number (64.6% when K = 1).They also claim that these methods yield better accuracy on their dataset than any other method proposed in the literature.
A 2021 paper by Sunder and Fosler-Lussier [125] deals with the long tail in utterance classification (human speech recognition) using a pairwise approach.Specifically, the proposed model uses a "mixup strategy" which creates artificial data as combinations of any two classes.The model is then trained using a novel loss function which combines contrastive loss and a "mixup" loss, and final classifications on an instance are made by balancing this output with the model output on the unmixed input.Though the authors do not compare their model to any from the literature, experiments on two long-tailed datasets ( ρ not specified) with three encoder bases find that their pairwise strategy improves performance over a simpler cross-entropy loss (about + 6% F1 score).
Fernández-Llaneza et al. [126] propose an "N-shot learning" method (misnomer for LSL) to deal with CI within the task of biochemical activity prediction (from textual molecule representations), as both a binary and categorical classification task.The authors propose a Siamese recurrent neural network with a self-attention mechanism, in tandem with heavy data augmentation and random oversampling; the model is evaluated by comparing input samples to a random sample of N training instances (hence "N-shot learning").Experiments compared the proposed "SiameseCHEM" to three common ML classifiers (MLP, RF, and SVM) using one of two molecule representations in the binary classification task; SiameseCHEM was shown to outperform these significantly in terms of MCC on all five datasets tested ( ρ not specified).Unfortunately, no SOTA activity pre- diction models were tested for comparison, and no comparisons were performed at all for the categorical classification task.
Wenjuan et al. [127] work in the task of micro-expression recognition using facial video data, a task with class imbalance and scarce data.Their proposed "Meta-MMFNet" model uses a non-LSL "feature fusion" module, which combines the "optical flow" and "frame difference" information between consecutive frames (common techniques for video processing), followed by a LSL-inspired meta-learning "model fusion" module.This module appears to be a prototypical network which takes as input the weighted sum (hence "fusion") of two DNN models fine-tuned on micro-expression and macroexpression data, respectively.Evaluating under a N-way-K-shot paradigm over three datasets ( ρ not specified, but described as "highly unbalanced"), experiments showed performance on par with or exceeding SOTA techniques (+ 5% accuracy over next-best), with especially good results when identifying the "surprise" emotion.An ablation study showed that the model fusion method yielded on average a better performance than either DNN model did alone, though the micro-expression sub-model was quite close.
Finally, Patil and Ravindran [128] avoid CI within software defect classification (from full-sentence textual defect descriptions) by proposing a completely unsupervised "concept-based classification" (CBC) approach, which the authors categorize as zero-shot learning-despite it being more difficult than classical ZSL, due to the complete lack of data for even the majority classes.CBC is completely semantic: the model constructs a pre-processed and indexed corpus of semantic Wikipedia articles relating to each defect type ("concepts"), and represents each input/label description as a combination of these concepts.They then use a similarity-based classifier to determine the label which most closely matches each input.Experiments conducted on two defect datasets (5 ≤ ρ ≤ 50) showed that despite a complete lack of labeled data, CBC yielded slightly inferior but comparable performance to SOTA fully-and semi-supervised defect classification rooflines ( − 0.05 to − 0.10 F1 score from the fully-supervised model).

Shortcomings and future research
In this section, we highlight shortcomings of the literature covered in this survey, first covering general critiques present throughout many of the works covered, then briefly reviewing notable research gaps within works which deal with imbalance within LSL ("Solving imbalance within LSL" section) and then the gaps within works which propose LSL techniques against imbalance ("Using LSL to solve existing imbalance" section).
First, as a more general criticism, a common occurrence in the literature is poor documentation and standardization, especially in regards to LSL.For instance, the phrase "few-shot learning" itself is often confused in the literature, with many papers (not included in this survey) using the term to simply refer to data scarcity.We believe better standardization of LSL definitions and terminology would allow for more streamlined comparison and communication between similar works and approaches.A 2022 survey on LSL by Parnami and Lee [9] offers what we believe to be the most constructive taxonomy so far for LSL, including a sufficiently restrictive definition for the problem setting, and the "metric-based"/"optimization-based" terminology used throughout this paper.
A less important but similar point is the lack of documentation and standardization in regards to CI.Many papers neglect to provide detail on the severity of class imbalance present within their datasets, despite this information being both quite important to the difficulty of the problem, especially in binary or few-class classification, and relatively simple to quantify using ρ or other metrics.Additionally, many papers do not compare their proposed methods with traditional measures such as data sampling or cost-sensitive learning, despite the simplicity and effectiveness of these approaches in many scenarios [1].
We also note the general lack of cross-comparison and cross-reference between the papers and models covered in this review; however, this is largely due to the wide range of applications and settings throughout LSL literature-ranging from open-set object detection to software defect prediction-making it difficult or impossible for authors (or ourselves) to directly compare the performances of LSL models.However, we do take the opportunity to compare the reported performances of those few works which do evaluate their models on the same datasets.

Imbalance within LSL
Many papers were found which addressed CI and its variants within LSL settings, over a wide range of applications and techniques.We believe this category to be betterexplored as compared to LSL-against-CI, likely because there are simply fewer unique challenges or variants within the LSL setting as compared to machine learning as a whole, leading to fewer possible combinations and areas of overlap to research.
Nonetheless, we do observe that most of the works covered do not deal with the generalized or transductive variants of LSL (explained in "Low-shot learning" section), or at least do not identify themselves as such.We found only three papers which explicitly work with the transductive setting [38][39][40], and only four which explicitly work under or experiment with the generalized setting [6,47,48,54] (though many papers may well deal with GLSL without using this term).While this is less important for TLSL-which is a less challenging and arguably less realistic setting than standard LSL-we believe that GLSL, being a more realistic setting for multi-class problems, should be further studied with respect to imbalance.

LSL against CI
Many papers were found which applied LSL techniques and ideas to CI learning, over a wide range of settings and approaches.We believe this category to be lesser-explored as compared to CI-within-LSL, as well as the one with more overall potential.Below we highlight some notable gaps.
There was very little overlap between LSL, CI, and big data settings; while this term has no agreed-upon quantitative definition in the literature, we found only a few papers which dealt with datasets of more than 100,000 instances.We believe this is an area with high research potential, not only to further the existing literature on class imbalance in big data [1], but also because we theorize that the episodic training methods common in LSL may prove effective at compartmentalizing the large swaths of available training data in these settings.
Also of note is the lack of works which utilize techniques or methods from ZSL against class imbalance.In this survey, we only found three relevant papers which utilized a one-class classification (OCC) approach [107,118,121], and only three papers which utilized semantic information [96,108,128] as is common in ZSL techniques; of these six papers, only two [108,128] cite a relation to ZSL specifically.While OCC for class imbalance (outside of the LSL context) is comparatively well-studied [141], we believe semantic LSL approaches to have high potential for solving CI in multi-class problems, and possibly even binary problems with sufficiently creative approaches.
Finally, we mention the complete lack of works found which utilize LLM-based methods for imbalanced tasks.This approach is admittedly less consistent than more traditional LSL techniques, due to the generative nature of LLMs, the requirement of using natural-language input prompts, and the incompatibility of some models with image or tabular data; regardless, we believe there is potential in experimenting with this approach for imbalanced text-based classification tasks such as word sense disambiguation [142].

Conclusion
This survey paper examined over 60 works in a wide variety of applications from the last 3 years (2020 to mid-2023) which combined the fields of low-shot learning and class imbalanced learning, either by addressing imbalance within low-shot settings, or using low-shot learning techniques and frameworks to combat class imbalance elsewhere.We to comprehensively find and report all recent literature which falls into these categories, and we found that each area has been explored, with generally successful results and to varying degrees of completeness.In particular, we noted the lack of literature covering LLM-based approaches to imbalanced textual tasks, semantic-based approaches to class imbalance in any application, or LSL methods for imbalanced big data tasks; we believe all of these areas, especially the latter two, to hold great potential for future research.

Table 1
Works covered in "Solving imbalance within LSL"

Table 2
Works covered in "Using LSL to solve existing imbalance" section

Table 2
(continued) a This work is not published as of September 2023