Evaluation of maxout activations in deep learning across several big data domains

Castaneda, Gabriel; Morris, Paul; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-019-0233-0

Research
Open access
Published: 03 August 2019

Evaluation of maxout activations in deep learning across several big data domains

Journal of Big Data volume 6, Article number: 72 (2019) Cite this article

10k Accesses
22 Citations
1 Altmetric
Metrics details

Abstract

This study investigates the effectiveness of multiple maxout activation function variants on 18 datasets using Convolutional Neural Networks. A network with maxout activation has a higher number of trainable parameters compared to networks with traditional activation functions. However, it is not clear if the activation function itself or the increase in the number of trainable parameters is responsible in yielding the best performance for different entity recognition tasks. This paper investigates if an increase in the number of convolutional filters on traditional activation functions performs equal-to or better-than maxout networks. Our experiments compare the Rectified Linear Unit, Leaky Rectified Linear Unit, Scaled Exponential Linear Unit, and Hyperbolic Tangent activations to four maxout function variants. We observe that maxout networks train relatively slower than networks with traditional activation functions, e.g. Rectified Linear Unit. In addition, we found that on average, across all datasets, the Rectified Linear Unit activation function performs better than any maxout activation when the number of convolutional filters is increased. Furthermore, adding more filters enhances the classification accuracy of the Rectified Linear Unit networks, without adversely affecting their advantage over maxout activations with respect to network-training speed.

Introduction

Deep networks have become very useful for many computer vision applications. Deep neural networks (DNNs) are models composed of multiple layers that transform input data to outputs while learning increasingly higher-level features. Deep learning relies on learning several levels of hierarchical representations for data. Due to their hierarchical structure, the parameters of a DNN can generally be tuned to approximate target functions more effectively than parameters in a shallow model [1]. Today, the typical number of network layers used in deep learning range from five to more than a thousand [2].

Activation functions are used in neural networks (NN) to transform the weighted sum of input and biases, of which is used to decide if a neuron can be fired or not [3]. Commonly used activation functions (nonlinearities) include sigmoid, Hyperbolic Tangent (tanh) and Rectified Linear Unit (ReLU) [4]. The use of ReLU was a breakthrough that enabled the fully supervised training of state-of-the-art DNNs [5]. Compared to traditional activation functions, like the logistic sigmoid units or tanh units, which are anti-symmetric, ReLU is one-sided. This property encourages the hidden units to be sparse, and thus more biologically plausible [6]. Because of its simplicity and effectiveness, ReLU became the default activation function used across the deep learning community [7]. A Convolutional Neural Network (CNN) using ReLU as its activation function classified 1.2 million images of the ImageNet dataset into 100 classes with an error rate of 37.5% [5]. The deep network implemented by Severyn and Moschitti [8] using ReLU as the activation function demonstrated state-of-the-art performance at both the phrase-level and message-level for Twitter sentiment analysis. At Semeval-2015 (International Workshop on Semantic Evaluation), Severyn and Moschitti’s models ranked first in the phrase-level subtask A and second in the message-level subtask B.

The ReLU function saturates when inputs are negative. These saturation regions cause gradient diffusion and block gradients from propagating to deeper layers [9]. Furthermore, ReLUs can die out during learning, consequently blocking error gradients and learning nothing [10]. For these reasons, different activation functions have been proposed for DNN training. There is a lack of consensus on how to select a good activation function for a DNN, and a specific function may not be suitable for all applications. Since an activation function is generally applied to the outputs of all neurons, its computational complexity will contribute heavily to the overall execution time [11]. Most research works on the activation functions are focused on the complexity of the nonlinearity that an activation function can provide [12], or how fast it can be executed [13], but often neglect the impact on different classification tasks.

The maxout nonlinearity [14] selects the maximum value within a group of different outputs (feature maps) and is usually combined with dropout [15], which is widely used to regularize deep networks to prevent overfitting. In NNs, the maxout activation takes the maximum value of the pre-activations. Figure 1 shows two pre-activations per maxout unit, each of these pre-activations has a different set of weights from the inputs denoted as “V”. Each hidden unit takes the maximum value over the j units of a group: $h_{i} = max_{j}\;Z_{ij}$ where Z is the lineal pre-activation value, i is the number of maxout units and j the number of pre-activation values. Maxout chooses the maximum of n input features to produce each output feature in a network, the simplest case of maxout is the Max-Feature-Map (MFM) [16], where n = 2. The MFM maxout computes the function $\text{max} (w_{1}^{T} x + b_{1} , w_{2}^{T} + b_{2} )$, and both the ReLU and leaky ReLU are a special case of this form. When specific weight values w₁, b₁, w₂ and b₂ of the MFM inputs are learned, MFM can emulate ReLU and other rectified linear variants. The maxout unit is helpful for tackling the problem of vanishing gradients because the gradient can flow through every maxout unit [17].

Figure 2 shows the maxout unit in a CNN architecture, where x is a 10 × 10 pixel image. The maxout unit takes the maximum value of the convolution operations y1 and y2. The CNN learns the weights and bias in the filters F1 and F2. Dropout randomly drops units or connections to prevent units from overfitting. It has been shown to improve classification accuracy in some computer vision tasks [18]. Park and Kwak [19] observed that dropout in CNNs regularizes the networks by adding noise to the output feature maps of each layer, yielding robustness to variations of images. In 2015, the Maxout network In Network (MIN) [17] method achieves in 2015 state-of-the-art or comparable performance on the Mixed National Institute of Standards and Technology (MNIST) [20], the Canadian Institute for Advanced Research (CIFAR-10), CIFAR-100 [21], and Street View House Numbers (SVHN) [22] datasets. Maxout layers were also applied in sentiment analysis [23], with a hybrid architecture consisting of a recurrent neural network stacked on top of a CNN. This approach outperforms a standard convolutional deep neural architecture as well as a recurrent network architecture and performs competitively compared to other methods on two datasets of annotated customer reviews.

CNNs were originally intended for computer vision tasks, being inspired by connections in the visual cortex; however, they have been successfully applied to several DNN acoustic models [24,25,26] and natural language processing tasks [27, 28]. CNNs are designed to process input features which show local spatial correlations. They can also handle the local translational variance of their input, which makes the network more tolerant to slight position shifts [29].

The benefit of frequency domain convolution with acoustic models is that CNNs’ tolerance to input translations is useful for modeling acoustic data because acoustic models which use convolutions on the frequency domain are robust to speaker and speaking style variations. This is confirmed by studies that experimented with frequency domain convolution. CNNs consistently outperform fully-connected DNNs on the same task [30,31,32]. When CNNs are used for sentiment analysis, the first layer of the network converts words in sentences to word vectors by table lookup. The word vectors are either trained as part of CNN training, or fixed to those learned by some other method (e.g., word2vec [33]) from an additional large corpus [34]. When working with sequences of words, convolutions allow the extraction of local features around each word.

Most of the comparisons between maxout and other activation functions only report a single performance metric, ignore network size, and only report accuracy on a single dataset, with no training time or memory use analysis. Furthermore, when compared with other activation functions, it is unclear whether marginal performance gains with maxout are due to the activation function or an increase in the number of required trainable parameters. In this work, we evaluated multiple activation functions applied to multiple domains:

Visual pattern recognition
Facial verification
Facial recognition
Sentiment analysis
Medical fraud detection
Sound recognition
Speech commands recognition.

To the best of our knowledge, this is the first study to evaluate multiple maxout variants and standard activations for multiple domains with significance testing. The main contributions herein can be summarized as follows:

Evaluate four maxout functions and compare them to popular activation functions like tanh, ReLU, LReLU and SeLU.
Compare training times for various activation functions.
Evaluate whether marginal performance gains with maxout are due to the activation function or simply an increase in the number of trainable parameters versus ReLU networks.
Determine whether maxout methods converge faster and if there is a significant accuracy performance difference between these methods and the standard activations.

The remainder of this paper is organized as follows. The “Related work” section presents related work on activation function evaluation on multiple classification domains. The “Materials and methods” section introduces the activation functions, datasets, and the experimental methodology employed in our experiments. Results and analysis are provided in “Experimental results and discussion” section. Conclusions with some directions for future work are provided in the “Conclusion” section.

Related work

Maxout is employed as part of deep learning architectures and has been tested against the MNIST, CIFAR-10 and CIFAR-100 benchmark datasets, but it has not been compared against other activation functions using the same deep network architecture and hyperparameters. It is not clear if maxout enhances the overall accuracy on the tested datasets, or if any other activation function has the same effect. There are a few comparisons between maxout and traditional activation functions. Most of the comparisons do not report the details of their network to indicate whether an increased number of filters was accounted for in the experiment.

Most prior work focuses on proposing new activation functions, but few studies have compared different activation functions. Xu et al. [35] investigated the performance of ReLU, leaky ReLU [36], parametric ReLU [37], and the proposed randomized leaky ReLU (RReLU) on three small datasets. In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. The original ReLU was outperformed by three types of modified leaky ReLU. Mishkin et al. [38] evaluated the influence of activation functions (including ReLU, maxout, and tanh), pooling variants, network width, classifier design, image pre-processing, and learning parameters on the ImageNet dataset. The experiments confirmed the Swietojanski et al. [39] hypothesis about maxout’s power in the final layers, as it showed the best performance among non-linear activation functions with speed close to ReLU. The bounded ReLU (brelu), bounded leaky ReLU (blrelu), and bounded bi-firing (bbifire) were presented in [11], and evaluated on classification of basic handwritten digits in MNIST database, complex handwritten digits from the mnist-rot-bg-img database, and facial recognition using the AR Purdue database. Experimental results for all three datasets demonstrate the superiority of all the proposed activation functions, with significant accuracy improvements up to 17.31%, 9.19%, and 74.99% on MNIST, mnist-rot-bg-img, and AR Purdue databases respectively. In [7], automated search techniques were used to discover novel activation functions. The activation function that tends to work better than ReLU on deeper models across three datasets was $h\left( x \right) = x \cdot sigmoid\left( {\beta x} \right)$ named Swish, where $\beta$ is either a constant or a trainable parameter. Only scalar activation functions were used in this study, this limitation would not allow the authors to find or evaluate the maxout activation.

Chang and Chen [17] presented the MIN network. It recorded 0.24%, 6.75% and 28.86% error rates on the MNIST, CIFAR-10, and CIFAR-100, respectively. These error rates are the lowest compared to Network in Network (NIN) [40] and other NIN based networks such as Maxout Network [14] or the Maxout Network in Maxout Network (MIM) [41]. Oyedotun et al. [42] proposed a deep network with maxout units and elastic net regularization. On the MNIST dataset, it reached an error rate of 0.36% and 2.19% on the USPS dataset, surpassing the human performance error rate of 2.5% and all previously reported results. In [43], the Rectified Hyperbolic Secant (ReSech) activation function was proposed and evaluated on MNIST, CIFAR-10, CIFAR-100, and the Pang and Lee’s movie review datasets. The results suggest that ReSech units are expected to produce similar or better results compared to ReLU units for various sentiment prediction tasks. The maxout network accuracy was only compared with the CIFAR-10 and MNIST datasets. Goodfellow et al. [44] investigated the catastrophic forgetting problem, testing four activation functions, including maxout, on MNIST and Amazon using two hidden layers followed by a softmax classification layer. The catastrophic forgetting problem is when a machine learning model is trained on one task, and when trained on a second task it forgets how to perform the first task. Their experiments showed that training with dropout is beneficial, at least on the relatively small datasets used in the paper. Also, the choice of activation function should always be cross-validated, if computationally feasible. Maxout in combination with dropout showed the lowest test errors on all experiments.

The maxout activation is effective in speech recognition tasks [45], but it has not been widely tested on sentiment analysis. Jebbara and Cimiano [23] used the maxout activation in their CNN portion of a hybrid architecture consisting of a recurrent NN stacked on top of a CNN. A maxout layer was also implemented in the Siamese bidirectional Long Short-Term Memory (LSTM) network proposed by Baziotis et al. [46]. The maxout layer was selected as it amplifies the effects of dropout. The output of the maxout layer is connected to a softmax layer which outputs a probability distribution over all classes.

Phoneme recognition tests on the Texas Instruments Massachusetts Institute of Technology (TIMIT) database show that switching to maxout units from rectifier units decreases the error rate for each network configuration studied and yields relative error rate reductions of between 2 and 6% [24]. Zhang et al. [45] introduced two new types of generalized maxout units the p-norm and soft-maxout. In experiments on the Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages, the p-norm units perform consistently better than various versions of maxout, tanh and ReLU. In addition, Swietojanski et al. [39] investigated maxout networks for speech recognition. Through the experiments on voice search and short message dictation datasets, it was found that maxout networks converged around three times faster to train and offer lower or comparable word error rates on several tasks when compared to the networks with logistic nonlinearity. Zhang et al. [47] presented a CNN-based end-to-end speech recognition framework. The maxout unit recorded the lowest error rate compared to ReLU and parametric ReLU.

Using the Public Use File (PUF) data from CMS, Branting et al. [48] proposed graph analysis as a framework for healthcare fraud risk assessment. Their algorithm was evaluated on the Part B (2012–2014), Part D (2013) and List of Excluded Individuals/Entities (LEIE) datasets. Using tenfold cross-validation on the full 12,000-member and 11-feature dataset, the mean f-measure was 0.919 and the mean Receiver Operating Characteristic (ROC) area was 0.960. Sadiq et al. [49] use the 2014 CMS Part B, Part D, and DMEPOS datasets (using only the provider claims from Florida) in order to find anomalies that possibly point to fraudulent or anomalous behavior. A novel framework based on Patient Rule Induction Method (PRIM) was presented, where abnormal behaviors of the physicians are detected. The experimental results show that their framework can effectively shrink the target dataset and deduce a potential suspect subset of physicians who submit several anomalous claims and probably qualify as fraudsters. The attribute sub-space and their correlations are used in PRIM to characterize the low conditional probability region. The attribute space was characterized by PRIM, which provides a deeper understanding of how certain attributes are the key predictors in identifying fraud. Herland et al. [50] focused on the detection of Medicare fraud using the CMS Part B, Part D, and DMEPOS datasets. A fourth dataset was created by combining the three primary datasets. Based on the area under the ROC curve performance metric, their results show that the combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805.

Our study evaluates 11 activation functions using deep CNN and NN architectures. As opposed to the papers cited in this section, we evaluate if an increase in the number of filters in ReLU enhances the overall accuracy with significance testing. Furthermore, we compare the training and convergence time for all the evaluated activation functions.

Materials and methods

In this section, we introduce the activation functions, datasets, and the empirical methodology employed in this study. In “Activation functions” section, we introduce each evaluated activation function. In “Datasets” section, we describe the datasets employed in our experiments. In the last “Empirical methodology” section, we present our empirical methodology.

Activation functions

Hyperbolic tangent

A hyperbolic tangent (tanh) function is a ratio between hyperbolic sine and cosine functions of x (Fig. 3):

$$h\left( x \right) = \tanh = \frac{\sinh\left( x \right)}{\cosh \left( x \right)} = \frac{{e^{x} - e^{ - x} }}{{e^{x} + e^{ - x} }} = \frac{{1 - e^{ - 2x} }}{{1 + e^{ - 2x} }}$$

(1)

Rectified units

Rectified linear unit (ReLU) [4] is defined as:

$$h\left( x \right) = { \text{max} }\left( {0,x} \right)$$

(2)

where $x$ is the input and $h\left( x \right)$ is the output. The ReLU activation is the identity for positive arguments and zero otherwise (Fig. 4).

Leaky ReLU (LReLU) [36] assigns a slope to its negative input. It is defined as:

$$h\left( x \right) = \text{min} \left( {0,ax} \right) + { \text{max} }\left( {0,x} \right)$$

(3)

where $a$ ∈ (0, 1) is a predefined slope (Fig. 5).

The scaled exponential linear unit (SeLU) [51] is given by:

$$h\left( x \right) = \lambda \left\{ {\begin{array}{l} {x \;\;\;\;\;\;\;\;\;\;\;\;\;if \;x > 0} \\ {\alpha e^{x} - \alpha\;\;\;if \;x \le 0} \\ \end{array} } \right\}$$

(4)

where x is used to indicate the input to the activation function. Klambauer et al. [51] justify why $\alpha$ and $\lambda$ must have the following values:

$$\begin{array}{*{20}c} { \propto = 1.6732632423543772848170429916717} \\ {\lambda = 1.0507009873554804934193349852946} \\ \end{array}$$

(5)

to ensure that the neuron activations converge automatically toward an average of 0 and a variance of 1 (Fig. 6).

Maxout units

The maxout unit takes as the input the output of multiple linear functions and returns the largest:

$$\begin{aligned} h\left( {x_{i} } \right) & = {\text{max }}w^{k} \cdot x_{i} + b^{k} \\ & \quad k \in \left\{ {1, \ldots ,K} \right\} \end{aligned}$$

(6)

In theory, maxout can approximate any convex function [14], but a large number of extra parameters introduced by the $k$ linear functions of each hidden maxout unit result in large RAM storage memory cost and considerable increase in training time, which affect the training efficiency of very deep CNNs. For our comparisons, we use four variants of the maxout activation: an activation with $k$ = 2 input neurons for every output (maxout 2-1), an activation with $k$ = 3 input neurons for every output (maxout 3-1), an activation with $k$ = 6 input neurons for every output (maxout 6-1), and a variant of maxout with $k$ = 3 where the two maximum neurons are selected (maxout 3-2). These maxout variants have proven to be effective in classification tasks such as image classification [44], facial recognition [16], and speech recognition [18]. The maxout unit in Fig. 7 mimics a quadratic activation function. The blue quadratic function is not created by the maxout unit, it is only pictured to show what the maxout activation function can approximate when using five linear nodes.

p-norm

The p-norm is the nonlinearity:

$$h = \left\| {x} \right\|_{p} = \left( {\mathop \sum \limits_{i} \left| {x_{i} } \right|^{p} } \right)^{1/p}$$

(7)

where the vector x represents a small group of inputs [45]. If all the $x_{i}$ were known to be positive, the original maxout would be equivalent to the p-norm with $p = \infty$ (Fig. 8).

Logistic sigmoid

The logistic sigmoid is defined as:

$$h\left( x \right) = \frac{1}{{1 + e^{ - x} }}$$

(8)

where x is the input. With a range between 0 and 1, the sigmoid function can be used to predict posterior probabilities [52] (Fig. 9).

Datasets

MNIST

The Mixed National Institute of Standards and Technology (MNIST) dataset [20] consists of 8-bit grayscale handwritten digit images, 28 × 28 pixels in size, organized into 10 classes (0 to 9) with 60,000 training and 10,000 test samples.

Fashion-MNIST

The Fashion-MNIST [53] dataset consists of 60,000 examples where each sample is a 28 × 28 grayscale image, associated with a label from 10 fashion item classes: T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot.

CIFAR-10 and 100

The Canadian Institute for Advanced Research (CIFAR)-10 dataset [21] consists of natural color images, 32 × 32 pixels in size, from 10 classes with 50,000 training and 10,000 test images. The CIFAR-100 dataset is the same size and format as the CIFAR-10; however, it contains 100 classes. Thus, the number of images in each class is only one tenth of that of CIFAR-10.

LFW

The Labeled Faces in the Wild (LFW) dataset contains more than 13,000 images of faces collected from the web by Huang et al. [54] Each face was labeled with the name of the person pictured, with 1680 of the people pictured having two or more distinct photos in the dataset. Images are 250 × 250 pixels in size. The only constraint on these faces is that they were detected by the Viola-Jones face detector [55]. The database was designed for studying the problem of unconstrained facial recognition.

MS-Celeb-1M

The MS-Celeb-1M dataset released by Microsoft [56] contains 10 million celebrity face images for the top 100K celebrities obtained from public search engines, which can be used to train and evaluate both face identification and verification algorithms. There are approximately 100 images for each celebrity, resulting in about 10 million web images. The image resolution is up to 300 × 300 pixels. The authors present a distribution of the million celebrities in different aspects including profession, nationality, age, and gender. The MS-Celeb-1 M is a larger dataset compared to the other test datasets and requires hyperparameter tuning. To avoid the unfair comparison issues associated with changing hyperparameters, we decided to use a manageable subset of 1000 identities which does not require fine-tuning to train. The identities were selected from a cleaned subset of MS-Celeb-1M used for the low-shot learning challenge. These identities were the top 1000 in a list ordered by the number of images.

Amazon product

The original Amazon product review dataset was collected by McAuley et al. [57]. It contains product reviews and scores from 24 product categories sold on Amazon.com, including 142.8 million reviews spanning from May 1996 to July 2014. Review scores lie on an integer scale from 1 to 5. The sentiment dataset constructed from the Amazon product review data in [58] was reused, where 2,000,000 reviews had a score greater than or equal to 4 stars and 2,000,000 reviews had a score less than or equal to 2 stars. The first group is labeled as positive sentiment while the second group is labeled as negative sentiment, creating a positive/negative sentiment dataset. A second subset here called “Amazon1M” was used with one million Amazon product reviews constructed in [59]. The labels were automatically generated from the star rating of each review by assigning a rating below 2.5 as negative and a rating above 2.5 as positive.

Sentiment140

Sentiment140 [60] contains 1.6 million positive and negative tweets, collected and annotated by querying positive and negative emotions, with a tweet considered positive if it contains a positive emoticon like “:)” and negative if, it contains a negative emoticon like “:(”.

Yelp

We use the sentiment datasets collected in [59]. It contains 429,061 Yelp reviews from 12 cities in the United States (Yelp500K). This is an imbalanced dataset with 371,292 positive and 57,769 negative instances. Another 500K reviews were scraped to create a second dataset with a million reviews (Yelp1M).

Medicare Part B

The CMS [61] released the Part B dataset [62] and describes Medicare provider claims information for the entire US and its commonwealths, where each instance in the data shows the claims for a provider and procedure performed for a given year. Physicians are identified using their unique National Provider Number (NPI) [63], while procedures are labeled by their Healthcare Common Procedure Coding System (HCPCS) code [64]. Other claims information includes average payments and charges, the number of procedures performed and medical specialty (also known as provider type).

Medicare Part D

The Part D PUF [65] provides information on prescription drugs prescribed by individual physicians and other health care providers and paid for under the Medical Part D Prescription Drug Program. Each physician is denoted by his or her NPI and each drug is labeled by its brand and generic name. Other information includes average payments and charges, variables describing the drug quantity prescribed and medical specialty.

DMEPOS

The Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) PUF [66] presents information on DMEPOS products and services provided to Medicare beneficiaries ordered by physicians and other healthcare professionals. Physicians are identified using their unique NPI within the data while products are labeled by their HCPCS code. Other claims information includes average payments and charges, the number of services/products rented or sold and medical specialty (also known as provider type).

Combined CMS dataset

A combined dataset was created in [50] after processing Part B, Part D, and the DMEPOS datasets, containing all the attributes from each, along with the fraud labels derived from the List of Excluded Individuals and Entities (LEIE). The combining process involves a join operation on NPI, provider type, and year. Due to there not being a gender variable present in the Part D data, the authors did not include this variable in the join operation condition and used the gender labels from Part B while removing the gender labels gathered from the DMEPOS dataset after joining. In combining these datasets, it is limited to those physicians who have participated in all three parts of Medicare.

For each dataset (Part B, Part D and DMEPOS), the information was combined for all available calendar years. For Part B and DMEPOS, all attributes not present in each available year were removed. The Part D dataset had the same attributes in all available years. For Part B, the standard deviation variables were removed from 2012 to 2013 and standardized payment variables were removed from 2014 to 2015 as they were not available in the other years. For DMEPOS, the standard deviation variable was removed from 2014 to 2015 as it was not available in 2013. For all three datasets, all instances that either were missing both NPI and HCPCS/drug name values or had an invalid NPI were removed. For Part B, all instances with HCPCS codes referring to prescriptions were filtered out. The prescription-related codes are not actual medical procedures, but instead are for specific services listed on the Medicare Part B Drug Average Sales Price file. For the Part B dataset, eight features were kept while the other 22 were removed. For the Part D dataset, seven features were kept and the other 14 were removed. For the DMEPOS dataset nine features were kept and the other 19 were removed. The excluded attributes provide no specific information on the claims, drugs administered, or referrals, but rather encompass provider-related information, such as location and name, as well as redundant variables like text descriptions which can be represented by using the variables containing the procedure or drug codes. For Part D, variables that provided count and payment information for patients 65 or older were not included, as this information is encompassed in the retained variables. The combined dataset contains all the retained features from all three datasets. The purpose of this new dataset is to provide a more encompassing view into a physician’s behavior over various branches of Medicare, over individual Medicare parts.

Google speech commands dataset

The Google speech commands (GSC) dataset v0.02 [67] consists of 105,829 one-second long audio files of 35 keywords, by 2618 speakers, with each file consisting of only one keyword encoded as linear 16-bit single-channel PCM values at a 16 kHz rate. A spectrogram using a fast Fourier transform (FFT) is computed for each wave file in the dataset. Frequencies are summed into 129 bins, and each 1-second sample is divided into 71-time bins. The image for each instance is 129 × 71. The number in each “pixel” represents the power spectral density in dB, and each image is scaled between 0 and 1, relative to the maximum and minimum dB in that image. Samples are not scaled to the maximum and minimum of the whole dataset because recordings were crowdsourced, so the volume for different recordings is not consistent.

IRMAS

The IRMAS dataset [68] is intended to be used for training and testing methods for the automatic recognition of predominant instruments in musical audio. The instruments considered are cello, clarinet, flute, acoustic guitar, electric guitar, organ, piano, saxophone, trumpet, violin, and human singing voice. It includes music from various decades from the past century, hence the difference in audio quality. The training data consists of 6705 audio files with excerpts of 3 s from more than 2000 distinct recordings. The testing data consists of 2874 excerpts with lengths between 5 and 20 s. No tracks from the training data were included. Unlike the training data, the testing data contains one or more predominant target instruments. All audio files are in a 16-bit stereo WAV format sampled at 44.1 kHz. We truncate the recordings in the dataset to 1 s in length and process them as spectrograms, like the preprocessing of the Google speech commands dataset.

IDMT-SMT-audio-effects

The IDMT-SMT-audio-effects [69] is a dataset for electric guitar and bass audio effects detection. The dataset consists of 55,044 WAV files with a single recorded note of which 20,592 are monophonic bass notes, 20,592 are monophonic guitar notes, and 13,860 are polyphonic guitar sounds. There are 11 different audio effects: feedback delay, slap back delay, reverb, chorus, flanging, phaser, tremolo, vibrato, distortion, overdrive, and no effect (unprocessed notes/sounds). For detailed descriptions of these audio effects please refer to [70].

Empirical methodology

We adopt the general convolutional network architecture demonstrated in recent years to advance the state-of-the-art in supervised classification [21]. We evaluate classification performance on a variety of CNN architectures. In these architectures, a series of convolutional layers for feature extraction are followed by fully-connected layers for classification. Max-pooling is used between convolutional layers to reduce the dimensionality of the network input, and dropout is used before fully-connected layers to prevent overfitting.

A suitable network architecture is selected for each dataset according to the input size and number of instances in the data, as specified in Table 1. Architecture (A) was applied to Medicare Part B, D and combined datasets, architecture (B) to the MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, and architecture (C) to the LFW dataset. The architecture (D) was utilized to the Sentiment140, Yelp, and Amazon datasets, architecture (E) to the Google Speech Commands, IRMAS, and IDMT-SMT-Audio-Effects datasets, and architecture (F) to the MS-Celeb dataset. The depth of the configurations increases from left (A) to right (F), as more layers are added. In general, fewer convolutional layers are used for datasets with smaller number of samples, while deeper architectures are used for larger datasets. Unless otherwise specified, max-pooling is performed with a filter size and stride of 2, and convolutional layer inputs are padded to produce same-size outputs.

Table 1 CNN and NN Configurations

Evaluation of maxout activations in deep learning across several big data domains

Abstract

Introduction

Related work

Materials and methods

Activation functions

Hyperbolic tangent

Rectified units

Maxout units

p-norm

Logistic sigmoid

Datasets

MNIST

Fashion-MNIST

CIFAR-10 and 100

LFW

MS-Celeb-1M

Amazon product

Sentiment140

Yelp

Medicare Part B

Medicare Part D

DMEPOS

Combined CMS dataset

Google speech commands dataset

IRMAS

IDMT-SMT-audio-effects

Empirical methodology

Experimental results and discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords