Unsupervised feature learning-based encoder and adversarial networks

In this paper, we propose a novel deep learning-based feature learning architecture for object classification. Conventionally, deep learning methods are trained with supervised learning for object classification. But, this would require large amount of training data. Currently there are increasing trends to employ unsupervised learning for deep learning. By doing so, dependency on the availability of large training data could be reduced. One implementation of unsupervised deep learning is for feature learning where the network is designed to “learn” features automatically from data to obtain good representation that then could be used for classification. Autoencoder and generative adversarial networks (GAN) are examples of unsupervised deep learning methods. For GAN however, the trajectories of feature learning may go to unpredicted directions due to random initialization, making it unsuitable for feature learning. To overcome this, a hybrid of encoder and deep convolutional generative adversarial network (DCGAN) architectures, a variant of GAN, are proposed. Encoder is put on top of the Generator networks of GAN to avoid random initialisation. We called our method as EGAN. The output of EGAN is used as features for two deep convolutional neural networks (DCNNs): AlexNet and DenseNet. We evaluate the proposed methods on three types of dataset and the results indicate that better performances are achieved by our proposed method compared to using autoencoder and GAN.

model complex relations between data with their target classes which are not easily be done by traditional shallow machine learning methods.
Many implementations of CNNs for object recognition employ them in supervised learning. The networks' parameters are optimized to map the data to the corresponding labels. Due to the great amount of the CNNs' parameters, a great amount of data training is needed. Otherwise, the networks are prone to over-fit, making them sensitive when they are tested with data from different distributions as in the training. However, collecting large amount of labelled data is expensive and requires huge efforts.
Training CNNs in unsupervised learning is one of the solutions for the above mentioned problem. Collecting unlabelled data is much easier and less expensive. Feature learning is one implementation of unsupervised learning for deep learning [38]. The implementation of deep learning as feature learning is illustrated in Fig. 1. In feature learning, the deep learning architectures are trained such that they to "learn" features automatically from unlabelled the data. After finishing the training, the resulting model is used to extract "features" from labelled data. These features are then be used as inputs to classifiers for object classifications. Several implementations of feature learning could refer to [25,36,37,[39][40][41][42][43][44][45].
Autoencoders and generative adversarial networks (GANs) are examples of deep learning methods that are applied for feature learning. An autoencoder is a network that comprises of two sub-networks: encoder and decoder. An encoder is designed so that the number of output nodes is greatly larger than the number of the input nodes. The aim is to compress the data into their latent variables. Meanwhile, a decoder is the mirror network of the encoder. The aim is to decompresses data back to the original dimensions. In [36], autoencoders are applied for the detection of plant diseases. In [42], they are applied for detection of epilepsy using EEG signals. Bottleneck features are introduced using autoencoders in [25] to reduce the data dimensionality. But, autoencoders have major limitations. Autoencoders are often merely "memorizing" the data during training, making them prone to over-fit. Several variants of autoencoders have been proposed to improve them. They are denoising autoencoders [43,44], variational autoencoder (VAE) [45], and their combinations [37].
Meanwhile, GANs [46] are originally designed to generate new data from blindly presumed distributions of real data. They comprise two networks: generator (G) and discriminator (D). G aims to generate fake data that can fool D without ever really sees the distributions of real data, while D aims to discriminate fake images that are generated by G and the real ones. After the training is finished, G models are then used to generate additional data. Several implementations of GANs have been proposed in literature, for instance for generating more training data [47,48], for image translation [49,50], for creating super-resolution images [51], and for predicting the succeeding frames in a video [52]. There have been many variants of GAN that have been proposed. One of them is deep convolutional GAN (DCGAN) [39]. In DCGAN, CNNs are used to replace multi-layer perceptrons in the original GAN.
In recent years, GANs and their variats are also used for features learning [53]. DCGAN and deep neural networks (DNN) are combined for feature learning in [40]. BiGAN is introduced in [41], where GAN is designed to learn the latent representation of the data. For various implementations of GANs for feature learning could refer to [54,55].
Unfortunately, the use of random noise as input for G network makes GAN's learning projection to be unpredictable. To overcome this, we propose to use hybrid networks of encoders and GANs. We adopt the DCGAN architecture and place the encoder at the top of the G network. We call it Encoder-GAN (EGAN). By doing so, the encoder could learn the latent variables of the data and then passes them to the G network of GAN. So, instead of using random distributions as inputs, G uses latent variables of real data. Hence, the learning projection of G could be more predicted.
The use of an encoder with GAN has been proposed previously. They are RDCGAN [40] and BiGAN [41]. However, there are differences between them. Their differences are illustrated in Fig. 2. In the RDCGAN, G accepts two types of inputs. They are that of the encoder and random distributions. Meanwhile, for BiGAN, the inputs of G are still from random distributions. Encoder is used in parallel with the G networks where both outputs are used as inputs for the D network. Meanwhile, G in EGAN only accepts outputs of the encoder as inputs and the D network only accept output of G as inputs. By doing so, G is expected to have more knowledge about the data. In addition to the new architecture, we also introduce nested training schemes for training the EGAN and DCNN networks. Usually, when GAN is used for feature learning, it is trained separately with the classifier. The GAN networks are trained first, and then after the training is finished, it is used to extract features of labelled training data to train the classifiers. Here, we train EGAN and the classifier together. Hence, the results of classification are also used as feedbacks for EGAN.
This work is an extension of our work in [24]. In that paper, EGAN is applied to multi-layer perceptron classifiers, while in this paper, we extend it as inputs to deep CNN architectures: Alexnet and DenseNet. Furthermore, we evaluate the generalization capability of EGAN to wider tasks. In this paper, we evaluate EGAN with two publicly available datasets: PlantVillage, a dataset for plant diseases detection, and MNIST dataset, a popular dataset for benchmarking methods for object classification, The remainder of the paper is organized as follow. In "A review of deep convolutional neural network (DCNN)" section, we briefly explain about CNN. We also briefly explain about GAN and autoencoder. We explain our proposed method in "The proposed method" section. The experimental setup is described in "Experimental setup" section and the results are explained in "Results and discussions" section. The paper is concluded in "Conclusions" section.

A review of deep convolutional neural network (DCNN)
In this section we first briefly explain about the use of DCNN for supervised learning. A typical composition of DCNN is explained. Then, we describe the use of DCNN for unsupervised learning. We explain both autoencoder and GAN briefly. Figure 3 shows a typical structure of DCNN. DCNN has two main components. They are the feature extraction and the classification sections. The feature extractor usually consists of stacks of convolutional layers and the pooling layers. Weighted sum operations between the inputs and the kernels are conducted in the convolutional layers while the pooling layers "summarize" the output of convolutional layers, typically by taking the maximum or the average values and reduce the dimensionality of the inputs. The classification section usually consists of fully-connected layers that maps the outputs of feature extractors to the target labels.
DCNN could also be trained in an unsupervised way. The DCNN architecture is designed such that the network learns automatically the "important" underlying pattern of the data. Therefore, we can train DCNN to learn the features using unlabelled data. This is called feature learning. Then, after training the DCNN models in unsupervised way, they are used to extract features from a small amount of labelled data which are used to train classifiers in a supervised way. Two common architectures for feature learning are autoencoder and GAN.
The architecture of an autoencoder is shown Fig. 4. A typical autoencoder consists of two sub-networks: encoder and decoder. In the encoder, the dimensions of the input data are forcedly reduced to obtain compressed features. Then, the decoder reconstructs the data from compressed features to get the original data. The compressed features could be seen as a latent representation of data. By forcing the number of the outputs to be hugely reduced from the inputs, the encoder learns underlying latent information about the data that are relevant and discard the unuseful information. In other words, the encoder could be seen as a nonlinear principal components analysis of the data.
GAN is another deep learning method that employ unsupervised learning. A structure of GAN is shown in Fig. 5. GAN consists of two networks. They are Generator (G) and Discriminator (D) networks. GANs work by contending both G and D. Generator learns to generate images through the presumptive distributions that it has never seen before. Meanwhile, D assesses the generated images, whether they are real or fake images. The D's assessment results would be the reference for G to optimize the presumptive distributions and generate images closer to real ones to fool D. This adversarial training continues until D cannot distinguish between real and fake images. GANs cannot remember input data like the autoencoder since the real data are never seen. Therefore, GANs can avoid the overfitting problem on the networks. However, GANs are sensitive to initialization since the presumptive distributions for G are usually random, making the learning trajectories to be unpredicted.  The proposed method The grand scheme of using deep learning for feature learning have been explained in Fig. 1.
In our implementation, we design an architecture, called EGAN as feature learning where it is trained with unlabelled data to obtain the feature learning model. This model is used to extract features from the labelled that, where the features are used to train Deep CNN (DCNN). Figure 6 show the architecture of EGAN. EGAN is a hybrid between encoder and DCGAN. We put the encoder on the top of the Generator of the DCGAN network. Traditionally, the G networks of GAN accept random distribution (usually Gaussian) as inputs. But, this would make the learning trajectory of GAN to be unpredictable. So, the purpose of using encoder is to avoid that.
Let p x is the target distribution of real data x . Let p z is noise distribution that G would map to data z . G aim to learn distribution p g (x) = E z∼p z [p g (x)|z) so that p g (x) = p x . To do this, GAN plays minimax game of function V between G and D as follows: However, since G never sees actual data, the learning trajectories of p g could be very far from p x . It makes the Generator to be very sensitive to initialization. To avoid this, we employ encoder E, so now E introduces distribution p E (z|x) = p z (z) that map the real data x to latent variables z . Therefore, (1) becomes: By doing so, G does not learn from random initialisation. By putting encoder on top of G networks, we expect the learning process of G to be more predictable. Instead of blindly learn the distribution of the real data, the G network of GAN has prior knowledge of the data.
Here, we use DCGAN [39] as the backbone for GAN. It is a variant of GAN that employ convolutional layers instead of multilayer perceptrons as in GAN. DCGAN is found to be more stable in the training process than GAN. Meanwhile, the encoder is built by using one convolutional layer and one BatchNormalization process. We use encoder with convolutional layer with stride 2 and reduce the RGB input dimensions to a half of the original image's dimensions. This is the optimum setting for the encoder based on our observation. We only use a single convolutional layer and BatchNormalization for the encoder since we found that having more convolutional layers are not effective for the tasks and the performance tends to get worse. We do not use the decoder in our design as we aim to find the underlying latent variables of the data as input to GAN. Therefore, we need an architecture that compress the data into much smaller dimensions. Hence, we employ only encoder for this as adding decoder would reconstruct the data to the original dimensions.
For supervised DCNN, we use two architectures: AlexNet [13] and DenseNet [56].    comprises of stacked five convolutional layers that are followed by three fully connected layers. Meanwhile, DenseNet is much deeper than AlexNet. To avoid gradient vanishing problems that occur in very deep CNN, DenseNet apply multi-skip connections that carry information, pass it over for several layers, and then concatenate them, adding more width to the networks. Due to its intense concatenation operation to all previous layers, DenseNet requires larger memory usage.
To train EGAN and DCNN, we modify gradient based optimization as in [46]. We apply nested training for training EGAN and DCNN classifiers. In this paper, we use epoch L = 200 to train EGAN, and for each epoch, we train DCNN using sub-epoch K = 5 . The detailed algorithm for training is described as follows:

The dataset
We use three datasets for evaluating our method. They are the Tea Clone, PlantVillage, and MNIST datasets. We develop a tea clone dataset for tea clones recognition tasks [24]. PlantVillage and MNIST are public datasets. PlantVillage contains images of leafs with various diseases, whereas MNIST is used for handwriting character recognition.
For the tea clones dataset, we collected 4520 tea leaf images to build the dataset from the Research Institute for Tea and Cinchona (RITC) in Gambung, West Java, Indonesia. There are 11 types of Gambung tea clones. They are called the GMB (Gambung) Clone series. This research focuses only on two tea clones, and they are GMB 3 and GMB 9. We used 14 different types of cameras, DSLR cameras, and smartphones. We intend to enrich the data variation in the way of taking pictures through the various cameras. We capture images indoor and ignore the lighting arrangement and the distance of the leaf to the camera. We use the autofocus feature when capturing the image. The image samples of GMB 3 and GMB 9 are shown in Fig. 9.
Penn State University develops the PlantVillage (PV) dataset. This dataset is created based on the premise that food growth knowledge should be accessible openly to everyone. We choose three plants (corn, potato, and apple) of this dataset. Some of the samples are shown in Fig. 10.
MNIST (Modified Nasional Institute of Standards and Technology) is a dataset of handwritten digits 0 to 9. It is released in 1999 and become the benchmarking standard for classifications. Fig. 11 show the sample of the MNIST dataset. There are 70,000 images in the MNIST dataset. It is consists of 60,000 training images and 10,000 testing images. The training and testing images contain greyscale images.

System setup
Each dataset is divided into three parts: training, validation, and testing sets. The detailed distribution of data for all datasets is summarized in Table 1. We train EGAN using three datasets separately. For tea clones and PlantVillage datasets, each dataset is divided into three parts; 80% of training, 10% of validation, and 10% of testing. Meanwhile, we use 23,000 images in MNIST consisting of 20,000 training images and Fig. 9 The samples of GMB 3 and GMB 9 Fig. 10 The samples of PlantVillage dataset 3000 testing images. For validation, we separate 20% of training data. Here, we use the same data to train EGAN and DCNN classifiers: AlexNet, and DenseNet.
The capability of EGAN is evaluated as follows. First, EGAN's evaluation as the feature learning model using three datasets (the tea clones, PlantVillage, and MNIST) and two popular classifiers (AlexNet and DenseNet). This is evaluated using the accuracies of the classifier. We only use accuracy since the data is fairly balanced. Here, we compare EGAN with other feature learning methods: GAN, BiGAN, and autoencoder. For the referenced methods, we use GAN architecture as in [46], BiGAN as in [41] and whereas for autoencoder, we use the architecture DCNN-based autoencoder as in [36]. We denote this method CNN-AE from now on. For GAN, we apply the same training strategies as EGAN, while for CNN-AE, we train CNN-AE with the same number of the epoch as EGAN.  To find the best parameters for EGAN, we tune the hyper-parameters of the networks as follow:

Results and discussions
EGAN as feature learning model Table 2 summarises the performance of EGAN as feature extractor and DCNN classifiers when batch-sizes are varied for all evaluated datasets. For all datasets, satisfactory performance could be achieved, and in most cases, 16 may have been suitable for batchsize. For that reason, we will use it in subsequent experiments. Need to be noted however, better performance could be achieved when a larger batch-size is applied. However, due to the limitation of our memory size, especially for DenseNet implementation, which requires a high memory load, implementation with a larger size than 32 could not be implemented.
In most cases, we found that DenseNet achieves slightly better accuracies than AlexNet. It is not surprising since DenseNet has more depth and width than AlexNet, which may be better in modeling nonlinear relations between features and their class targets. However, DenseNet has more training parameters than AlexNet and slower to train. Comparison with other feature learning methods Table 3 compares the performance of EGAN with other features and feature learning algorithms. They are RGB (without feature learning), GAN, BiGAN, and CNN-AE. It is clear that EGAN is superior to others for all datasets. Need to be noted that in this study, the classifiers are all trained without pre-training. On average, relative improvements of 4.72%, 19.92%, 6.86%, and 15.55% are achieved compared to RGB, GAN, CNN-AE, and BiGAN respectively. Interestingly, we found, RGB is largely better than GAN and BiGAN. It is slightly better than CNN-AE in most cases. This may be expected, as when GAN is applied, the presumed distributions of the data may completely different than the real data, and hence, may cause mismatch when they are used in training. BiGAN is generally better than GAN. But due to the use of random variables to input of G, the trajectory of the learning may not be as targeted as our proposed method. Table 4 summarizes the running time of GANm CNN-AE, BiGAN, and EGAN. It is clear that EGAN is considerably slower than GAN and CNN-AE. This is as expected due to its much larger number of parameters than GAN and CNN-AE. Furthermore, the nested training requires more computation per epoch since each epoch requires five

Conclusions
In this paper, we propose encoder-deep convolutional generative adversarial network (EGAN) as a solution for unsupervised feature learning. In EGAN, an encoder is put on top of GAN's Generator networks to avoid GAN learns completely different distributions from data training. We use DCGAN architectures and encoder with one convolutional layers for EGAN. For supervised learning, we employ two DCNN architectures. They are AlexNet and DenseNet. In addition, we use nested training to train both EGAN and DCNN. Our evaluations of three types of datasets, including MNIST datasets, confirm that our proposed method is better than directly using RGB as features in two popular DCNN classifiers. It is also largely better than three feature learning methods: GAN, CNN-AE, and BiGAN. Our evaluation also shows that GAN performs much worse than directly using RGB. This strongly indicate how unpredictable the learning outcome of GAN due to random noise inputs.
During experiments, we found that number of layers for encoder may affect the performance. Having more layers may not produce significant improvements and the performance tend to get worse. We also found the selection of number of epoch for nested training may be influential to the performance.
In the future, we plan to evaluate the robustness of EGAN. Implementation of the encoder variants, data preprocessing, or data augmentation for improvements of our method also becomes our interest in the future.