Separable convolutional neural networks for facial expressions recognition

Social interactions are important for us, humans, as social creatures. Emotions play an important part in social interactions. They usually express meanings along with the spoken utterances to the interlocutors. Automatic facial expressions recognition is one technique to automatically capture, recognise, and understand emotions from the interlocutor. Many techniques proposed to increase the accuracy of emotions recognition from facial cues. Architecture such as convolutional neural networks demonstrates promising results for emotions recognition. However, most of the current models of convolutional neural networks require an enormous computational power to train and process emotional recognition. This research aims to build compact networks with depthwise separable layers while also maintaining performance. Three datasets and three other similar architectures were used to be compared with the proposed architecture. The results show that the proposed architecture performed the best among the other architectures. It achieved up to 13% better accuracy and 6–71% smaller and more compact than the other architectures. The best testing accuracy achieved by the architecture was 99.4%.

The works in automatic facial expressions recognition have been an attractive topic past these decades. Facial Expressions Recognition is paramount to build an affective system. The system can be implemented in several application such as, but not limited to: medical area (e.g., depression analysis [2], nervous system disorder [3]), entertainment area (e.g., games [4,5]), virtual humans/agents or conversational agents [4,6,7] and many more. Several research efforts have focused on building an automatic facial expressions recognition system, and there remain some challenges yet to be solved. Most of the problems are shared with general problems in the computer vision fields, and they are poses, illumination, partial occlusion and variations [8].
The rise of deep learning has tremendously advanced the accuracy of facial expressions recognition tasks. Recently, various Convolutional Neural Networks (CNN) models have been implemented to solve the problems in recognising emotions from facial expressions. Generally, CNN architecture consists of convolutional, activation, and pooling layers. The convolutional layers perform the inner product of the linear filter and the inputs. The non-linear activation layers filter the important information from the inner product results in the convolutional layers process. Pooling layers are generally providing dimensional reduction. The results are generally called feature maps. A number of architectures have been proposed to solve several problems in the recognition tasks. Most of the CNN architectures perform feature maps construction by performing linear convolution processes followed by non-linear activation functions and reducing the feature maps dimension with pooling functions. Research has shown that achieving a good level of abstraction of the learning model generally requires non-linear functions of the input images [9,10].
Moreover, generally training and classification process with CNN requires an immense computation power due to its convolution process. Therefore, it is not practical to be implemented into devices that have small or limited computational power. Moreover, a complex system such as virtual humans [4,7] also require an architecture with an effective process. The facial expressions recognition process is a part of the virtual humans' system [4], which also requires an effective process for training and classification the emotions from facial cues. Hence, inspired by the research done by [9,10], and [11], this research aims to propose an effective architecture of CNN by creating a separate process to deal with the depth and spatial features by also maintaining the performances level (e.g., accuracy).
To evaluate the proposed architecture, we compare it with a similar network without separable modules. In addition, we also explore the combination of the architectures with global average pooling versus the dense and flatten operation at the end of the network before the classification layers. The results have shown that the proposed architecture performs the best in both training times and accuracy. The proposed architecture has up to four times fewer trainable parameters than the other architectures, resulting in 13-46% faster training times. It also performed up to 13% better than the other architectures. The best accuracy achieved was 99.4% in the CK+ dataset. The rest of the paper is organised as follows: A recent work in CNN architecture and facial expressions recognition is described in the next section. The Proposed Architectures section illustrates the details of the models proposed in this research. Meanwhile, the details of the experiments are thoroughly explained in the Experimental Settings section. The results of experiments are discussed in the Results and Discussions section. Finally, the Conclusion and Future Work section provides the takeaway messages from this research and the future research directions.

Emotions recognition
Emotions convey meaning in social interactions. They generally express significant context along with spoken utterances to the interlocutors. Hence, capturing, recognising, and understanding the emotions conveyed by the interlocutors during conversation automatically are paramount to develop a system that can perceive a more holistic conversation. Affective Computing and Social Signal Processing are the areas that discuss emotions and social interaction between two agents (humans or machines). Affective Computing is a study of system development that can capture, process, recognise, interpret, and synthesise human's affects [12]. Social Signal Processing is a study in the computing domain that aims to model, analyse, and synthesise social signals between agents' interactions [13]. One of the specific studies in both domains dealing with emotions is automatic emotions recognition from humans using sensors. The idea is to have a system capable of perceiving emotions from humans and reacting based on the perceived emotions (e.g., virtual humans [4]). By understanding the humans' emotions from the social conversation, the system provides more colourful interaction to humans [4,7,13,14]. Recognising emotions can be done with several features, such as: brainwave [15], heartbeat [16], voice prosody (e.g., the stress, rhythm and intonation of speech [17], facial expressions [18,19], and body gestures [20]. The most natural features to be captured during the social conversation are voice prosody, facial expressions, and body gestures. The features can be captured by using microphones and cameras.

Datasets
Dataset is one of the important aspects of the deep learning training process. Dataset acts as the fuel to the deep learning architecture. The quality and the quantity of the data in the dataset can significantly influence the model's results trained by deep learning architecture and algorithms. Several datasets can be used to train emotions recognition from facial cues (e.g., facial expressions). The Cohn-Kanade Dataset (CK) [21], The Facial Expressions Recognition 2013 (FER2013) [22], The Maja Pantic, Michel Valstar and Ioannis Patras (MMI) [23] are the most influential dataset in the early work of facial expression recognition. CK dataset (later extended into The Extended Cohn-Kanade Dataset (CK+)) [21] has eight emotions (six basic emotions, contempt, and neutral) in 593 images from 123 subjects. FER2013 [22] and MMI [23] dataset provide seven emotions classification (six basic emotions and neutral). The FER2013 provides more than 30,000 images, and MMI provides 2900 videos collected from 25 participants. The researcher in the area of Social Signal Processing and Affective Computing recently built a multimodal database in conversation to be implemented in several areas, including facial expressions recognition. The Sustained Emotionally coloured Machinehuman Interaction using Nonverbal Expression (SEMAINE) Dataset [24] is one of the multimodal conversation database collected from the human and agents (i.e., virtual humans) interactions. The SEMAINE Dataset has 24 interaction sessions with a total of 95 character interactions and 190 video clips [24].
Some datasets in the facial expressions recognition area also were collected with Asian respondents, for example, The Japanese Female Facial Expression (JAFFE) [25], Multimodal Asian Conversation Dataset [26], and the Indonesian Mixed emotion Dataset (IMED) [27]. The Japanese Female Facial Expression (JAFFE) [25] provides 7 classification of emotions (six basic emotions and neutral) from 213 images of 10 subjects. The Multimodal Asian Conversation Dataset [26] provides seven classifications of emotions (six basic emotions and neutral) from more than 100 minutes of videos of 5 subjects. Finally, the Indonesian Mixed emotion Dataset (IMED) [27] consists of 570 videos and 66,819 Images categorised into seven single emotions (Anger, Disgust, Fear, Happy, Sadness, Surprise and Neutral) and twelve mixed emotions [27].

Facial expression recognition with deep learning
The work in emotions recognition from facial expressions has been popular since decades ago, as the facial expression is one of the most natural and universal cues to be recognised [13,28]. The general pipeline of the facial expressions recognition process is generally pre-processing, training, and evaluation. In the pre-processing phase, generally, face alignment, data augmentation, and face normalisation are performed [19,28] before inputting all the images to the deep learning architecture for the training process. Several deep learning architectures can be used to train the recognition model. Convolutional Neural Network (CNN) architecture is the most popular architecture used to train the model. It provides simple and straightforward training implementations. CNN architecture also provides a relatively high accuracy score for the model. Zhu et al. [29] proposed a hybrid attention cascade network for facial expression recognition with the highest accuracy of 98.46% in the CK+ dataset. Liu et al. [30] implements CNN for facial expressions recognition with the highest accuracy of 93.70% in the CK+ dataset. There are several CNN implementation to build emotions recognition model from facial cues, there are: [31] and [19].
Some architectures provides temporal aspects (e.g., sequences of images or videos) to the model trained, for example Recurrent Neural Network (RNN) [32,33] and Temporal CNN [34,35]. The architectures provide more superior results dealing with temporal information (for example, the onset, apex, and offset of the facial actions units or emotions activation). Finally, some researchers also explored generative models for facial expressions recognition. Kim et al. [36] proposed deep generative-contrastive networks for facial expression recognition with the highest accuracy of 97.93% in the CK+ dataset. Cai et al. [37] also implements discriminative features for facial expression recognition with the highest accuracy of 94.39% in the CK+ dataset.

Proposed architectures
In this research, we propose a model with a depthwise separable convolutional neural network (see Figs. 1, 2). To evaluate the effectiveness of the proposed model, a similar network without separable modules was used to compare their performances (see Fig. 3). In addition, we also aimed to compare the implementation of global average pooling to the fully connected layers at the end of the network. In the global average pooling layers, the spatial average of the features maps from the previous layers are fed into the classification layer (e.g., softmax) [9]. Research has shown that global average pooling has some advantages compared to the dense and flatten operation in the classical fully connected layer in the CNN architecture. Global average pooling is robust to spatial translations of the images and reduces overfitting problems [9]. Hence, there were four models evaluated in this research. Figures 1, 2 illustrate the visualisation of the architectures proposed and evaluated in this research.   propose an architecture with a separable convolution process. The architecture will separate the spatial cross-correlations from the channel cross-correlations to learn richer and smaller features [10]. The architecture processes the depth (i.e., channel) and spatial (i.e., width and height) features of the input (i.e., images) separately. The depthwise separable convolution process has two processes [10]. The first process is called depthwise convolution, where the spatial features are extracted and handled within this process. The second process is called the pointwise convolution process, where the depth features (e.g., RGB channels) are extracted and handled within this process. In the depthwise convolution process, D number of X × X × D kernels were applied in the convolutional process towards M × N inputs with D dimensions (e.g., RGB channels). The output will be Y × Y × D features maps. While in the pointwise convolution process, P number of 1 × 1 × D kernels were applied to the convolutional process towards the inputs (i.e., Y × Y × D feature maps). The output will result in Y × Y × P feature maps. For examples see Fig. 1 and Table 1. The kernel used in this research was X = 3 × X = 3 × D = 32 towards M = 24 × N = 24 inputs with D = 1 (black and white) or D = 3 (RGB) and the output will be Y = 24 × Y = 24 × D = 32 feature maps (with same padding setting), in the first separable process (Table 1 see Table 1 see conv2d_4).
In addition, the global average pooling process also enormously reduce the number of parameters while maintaining the spatial translations in the images. The proposed architecture (ARCH-1) has 292,862 trainable parameters (a total of 293,951 parameters), four times smaller than the other similar architectures. Figure 1 illustrates the proposed architecture with separable convolution layers and a global average pooling process. The first feature extraction process has 13 layers of alternating convolutional, batch normalisation, max pooling, and activation (i.e., ReLu) layers. The ReLu function is described as Relu(x) = max(0, x) . The next process is divided into two separable processes. The first separable process has seven layers of alternating separable convolution, batch normalisation, max pooling, and activation (i.e., ReLu) layers. The second separable process is four residual layers with alternating convolutional and batch normalisation layers. The first separable process has 3 × 3 kernel, while the second process (i.e., the residual layers) has 1 × 1 kernel. The next layers consist of 16 layers of alternating convolutional, batch normalisation, max pooling, and activation (i.e., ReLu) layers. Finally, the classification layers have a global average pooling and an activation layer (i.e., Softmax) . In the global average pooling, an input with M × N × D tensor is reduced to a 1 × 1 × D tensor by taking the average of all M × N values [9,38]. The dimension of the activation layer depends on the k number of classes (i.e., 1 × k , with k ∈ {0, . . . , 6} in this case). Table 1 shows the details of each layer in the architectures with the input of 48 × 48 × 1 tensors. To compare the performances, the proposed model was compared with a similar architecture from ARCH-1 architecture. The architecture (ARCH-2) has flatten and dense layers in the classification layers instead of the global average pooling layer. Figure 2 demonstrates the architecture of ARCH-2. The networks are similar with ARCH-1, with Max Pooling, flatten, two dense, and activation (i.e., Softmax) layers as the classification layers. The architecture has 1,142,463 parameters with 1,141,375 trainable parameters, almost four times larger than the proposed architecture (ARCH-1). Table 2 shows the details of each layers in the architectures with the input of 48 × 48 × 1 tensors. Moreover, two more architectures, with no separable convolution layers implemented, were also explored in this paper as a comparison. One architecture (ARCH-3) using a global average pooling and activation (i.e., Softmax) layers as the classification layers, while the other (ARCH-4) use max pooling, flatten, two dense, and activation (i.e., Softmax see Eq. 1) layers as the classification layers. Table 3 illustrates the details of the ARCH-3 layers with 305,807 trainable parameters from a total of 306,831 parameters. Table 4 illustrates the details of the ARCH-4 layers with 1,154,319 trainable parameters from a total of 1,155,343 parameters with the input of 48 × 48 × 1 tensors.

Datasets
Three facial expressions datasets were used to evaluate the proposed architectures, they are: The Extended Cohn-Kanade Dataset (CK+) [21], The Facial Expressions Recognition 2013 (FER-2013) Dataset [22], and Indonesian Mixed emotion Dataset (IMED) [27]. The CK+ dataset consists of almost 600 FACS-coded sequences with seven classifications of emotions: Angry, Disgust, Fear, Happy, Sadness, Surprise, and Contempt [21]. In this research, the proposed architectures were only used to classify the seven emotions, and the AU coding and classification was not used. The second dataset used in this research was the FER-2013, which consists of 35,685 facial expressions images [22]. The dataset is categorised into seven emotions: Happiness, Neutral, Sadness, Anger, Surprise, Disgust, Fear. The third dataset used in this research was the Indonesian Mixed emotion Dataset (IMED), where it consists of 570 videos and 66,819 Images categorised into seven single emotions (Anger, Disgust, Fear, Happy, Sadness, Surprise and Neutral) and twelve mixed emotions [27]. In this research, the proposed architectures were only used to classify the seven single emotions ( k ∈ {0, . . . , 6} ). The datasets were augmented to enrich the data for training and validation sets. The augmented process is thoroughly explained in the next sub-section.

Pre-processing and initial hyper-parameters settings
Several pre-processing processes were applied to the dataset before being trained with the proposed architectures. First, face detection and localisation were applied to find and crop the face from the images. In the next step, face alignment was applied to the cropped images, and finally, a normalisation was also applied to all the images. To enrich the data, a data augmentation technique was also implemented for all the datasets. The images were rotated with 20 rotation ranges, shifted, zoomed, and flipped horizontally. The datasets then were split into 86-88% for training and 12-14% of validation (test) sets. A total of 81,954 augmented images were used as training (72,520) and validation (9434) sets. Specifically, the augmented FER2013 dataset has 57,418 training images and 7178 validation images. The augmented CK+ dataset has 1308 training images and 186 validation images. Finally, the augmented Indonesian Mixed emotion Dataset (IMED) has 13,794 training images and 2070 validation images. As the initial settings, the learning rate was set to 0.01 and reduced by a factor of 10 every time the model loss encountered a plateau during the learning process. The datasets were trained in a maximum of 200 epochs and 256 of batch size (10 for CK+ dataset) with an early stopping method if there was no significant improvement in the loss of the model. To avoid overfitting, an L2 regularisation of 0.2 and dropout of 0.5 were applied to the models. All the proposed architectures implement Adam as the training optimiser (see Eq. 2). The ( θ t+1 is the update of the weights at time t + 1 . The weights were optimised from the previous weights ( θ t ), learning rate α , the Exponential Moving Average (EMA) of the gradient ∇f (x t ) ( m t ), Exponential Moving Average (EMA) of the gradient ∇f (x t ) ( v t ). Finally, to prevents the weights are being divided by zero, a regularisation ( ǫ ) is used.
To create uniform inputs of the images between those three datasets, we resize all the images to 48 × 48 × 1 dimension. Four architectures were explored in this research, resulting in 12 sets of results (three datasets for each architecture). Table 5

Results and discussions
The proposed architecture (ARCH-1) was evaluated with three datasets (see "Experimental settings"). In addition, the proposed architecture was also compared with three other architectures with the same datasets.  Figure 4 illustrates the visualisation of the feature maps from the convolutional layer(conv2d_4) of ARCH-1 architecture. Table 5 Architectures setting

Global Avg Flatten
Separable ARCH-1 ARCH-2 No separable ARCH-3 ARCH-4 Figure 5 demonstrate the overview results from all combinations of the architectures and the datasets. The results demonstrate ARCH-1 architecture excel in all datasets compared to the other architectures. From the datasets perspective, CK+ provides the best results across all the architectures. Overall, ARCH-1 architecture provides the best training accuracy scores across all the datasets with an average training accuracy score of 88.9%, followed by ARCH-3 architecture that gives an average training accuracy score of 86.8%. ARCH-4 architecture takes third place with the average training accuracy score of 84.7%, while ARCH-2 architecture shows the lowest training accuracy scores across all the datasets with an average training accuracy score of 83.9%. The best testing accuracy was achieved by ARCH-1 trained with CK+ dataset with the training accuracy score of 99.4%. The lowest score achieved by ARCH-4 architecture was trained with the FER2013 dataset with the training accuracy score of 62.3%. Architectures trained with CK+ and IMED datasets provide excellent results ( > 90% ), while the architectures trained with the FER2013 dataset provide the lowest score among the other datasets ( 61.3 − −70.3 ). The proposed architecture, ARCH-1, also achieved higher results compared to the existing architecture proposed in the literature. ARCH-1 also achieved a higher training accuracy score compared to one of the best results with CNN architecture from the literature, where Ding et al. [39] achieved 98.6% of training accuracy score and Zhang et al. [40] achieved 98.9% of training accuracy score. Moreover, the proposed architecture also achieved a higher training accuracy score compare with the literature that existed, where Liliana et al. [27] achieved 84.52% of training accuracy score with Support Vector Machine (SVM) as the classifier algorithm. Figures 6, 7, 8 demonstrate the best model in each dataset trained in the best architecture (ARCH-1). Figure 6 illustrates the model accuracy and loss (from Adam's optimiser) in the FER dataset training and testing phase using ARCH-1 (the best architecture). The training stopped at 102 epochs when the model could not learn more information from the dataset. The model achieved the best training accuracy of 70.5%, testing accuracy of 70.3%, training loss (loss during training phase-towards training set) of 0.91%, and testing loss (loss during validation/testing phase-towards validation/test set) of 1.13%. Figure 7 shows the model accuracy and loss accuracy and loss in CK+ dataset training and testing phase using the best architecture (ARCH-1). The model achieved the best training accuracy of 99.4%, testing accuracy of 99.4%, training loss of 0.13%, and testing loss   Figure 8 demonstrates the model accuracy and loss accuracy and loss in the IMED dataset training and testing phase using the best architecture (ARCH-1). The model achieved the best training accuracy of 97.1%, testing accuracy of 96.4%, training loss of 0.35%, and testing loss of 0.39%. The training stopped at 164 epochs when the model could not learn significant information from the dataset.

Conclusion and future work
This research proposed a separable convolutional neural network with global average pooling to enhance real-time emotion classification experiences while also improving performance. The proposed methods were evaluated with three datasets and compared with three other architectures. The proposed architecture (ARCH-1) is empirically performing better in terms of training speed and accuracy. The results have shown that the number of trainable parameters was tremendously reduced compare to the other similar architectures. The training times were also reduced to 6-71% compared to similar architectures. In addition, the proposed architecture achieved up to 13% better accuracy compared to the other architectures, with the same dataset and settings. In general, ARCH-2 performed the worst results compared to all architectures explored in this research. The best accuracy achieved belonged to ARCH-1 with CK+ dataset (99.4%), while ARCH-4 achieved the worst accuracy with the FER2013 dataset (62.3%).
Moreover, the architecture with dense and flatten layers resulted in lower accuracy and a slower training process. The limitation of this research is that the architectures were only implemented to train a specific task (i.e., facial expressions) and were only used to train black and white (1-D Channel). The proposed architecture can be used in the other classification tasks with a similar performance level if using the same settings described in this paper. In the future, the proposed architecture will also be trained and evaluated with a more significant number of images and the larger number of image dimensions. The combination of the datasets seems interesting to be explored in the future. The datasets have similar emotions labels and might provide more variation to the models as IMED consists of Asian (i.e., Indonesian) faces, while FER and CK+ consist of the majority of Caucasian faces. The models created in this research will also be implemented to a device with limited computing power (e.g., raspberry, robot) and to Fig. 8 Training and testing accuracy and loss of ARCH-1 with IMED dataset a complex system (e.g., virtual humans [4,7]). Finally, the temporal aspect or features can also be considered as the future direction for improving the proposed architecture. Different onset, peak, and offset activation times of expressions resulted in a different semantics meaning of emotions (e.g., acted and spontaneous expressions).
Abbreviations FACS: Facial action coding system; AU: Action unit; CNN: Convolutional neural networks; FER: Facial expression recognition; SVM: Support vector machine.