Four-class emotion classification in virtual reality using pupillometry

Emotion classification remains a challenging problem in affective computing. The large majority of emotion classification studies rely on electroencephalography (EEG) and/or electrocardiography (ECG) signals and only classifies the emotions into two or three classes. Moreover, the stimuli used in most emotion classification studies utilize either music or visual stimuli that are presented through conventional displays such as computer display screens or television screens. This study reports on a novel approach to recognizing emotions using pupillometry alone in the form of pupil diameter data to classify emotions into four distinct classes according to Russell’s Circumplex Model of Emotions, utilizing emotional stimuli that are presented in a virtual reality (VR) environment. The stimuli used in this experiment are 360° videos presented using a VR headset. Using an eye-tracker, pupil diameter is acquired as the sole classification feature. Three classifiers were used for the emotion classification which are Support Vector Machine (SVM), k-Nearest Neighbor (KNN), and Random Forest (RF). SVM achieved the best performance for the four-class intra-subject classification task at an average of 57.05% accuracy, which is more than twice the accuracy of a random classifier. Although the accuracy can still be significantly improved, this study reports on the first systematic study on the use of eye-tracking data alone without any other supplementary sensor modalities to perform human emotion classification and demonstrates that even with a single feature of pupil diameter alone, emotions could be classified into four distinct classes to a certain level of accuracy. Moreover, the best performance for recognizing a particular class was 70.83%, which was achieved by the KNN classifier for Quadrant 3 emotions. This study presents the first systematic investigation on the use of pupillometry as the sole feature to classify emotions into four distinct classes using VR stimuli. The ability to conduct emotion classification using pupil data alone represents a promising new approach to affective computing as new applications could be developed using readily-available webcams on laptops and other mobile devices that are equipped with cameras without the need for specialized and costly equipment such as EEG and/or ECG as the sensor modality.


Introduction
Emotion classification is the task of detecting human emotions, mostly from using facial expressions [8], verbal expressions [5], and physiological measurements. Several applications using emotion classification techniques have been developed and applied to real-world solutions such as driver fatigue monitoring [3] and mental health monitoring [12]. However, most studies on emotion classification based on physiological signals are obtained from electroencephalography (EEG) and electrocardiography (ECG) [18,22]. Subsequently, much less is known regarding the use of eye-tracking as a sensor modality for detecting emotions. Therefore, the aim of this paper is to report on the results of recognizing emotions using eye-tracking data only in the form of pupil diameter without any other additional modality.
Eye-tracking refers to the method of tracking eye movements and identifying where the user is looking at as well as recording other eye-related attributes such as pupil diameter. Eye-tracking can be utilized in many domains such as marketing research, healthcare [13], education [7], psychology, as well as video gaming [2]. Eye-tracking technology can be widely deployed in the near future since it only needs a camera to acquire the required data. As such, it requires significantly fewer sensors to implement in the recording device.
Additionally, most emotion classification studies use movies, images or music as their stimulation tool to evoke the user's emotions. Much fewer studies have used attempted to use Virtual Reality (VR) to present emotional stimuli. VR provides a virtual environment that is highly similar to the real world and as such could potentially evoke stronger emotional responses from the user compared to the other stimulation tools. The user can be immersed in a real-world experience by watching the 360° videos using a VR headset. The user will have less distractions and will be able to focus more on the stimuli in virtual environment.
Three machine learning classifiers were used in the experiment, which are Support Vector Machine (SVM), k-nearest neighbor (KNN), and Random Forest (RF). These methods are suitable for classification tasks. SVM algorithm analyzes data for the analysis of regression and classification. It maps data into a high-dimensional function space such that datasets can be classified even if the data cannot be separated linearly. KNN is a machine learning algorithm that uses data and identifies new datasets according to similarities. It works to evaluate the k-nearest neighbors based on the minimal distance from the test samples to the training dataset. Random forest is a method that generates and fuses multiple decision trees randomly into one "forest". It builds a multitude of decision trees and outputs the class which is the classification or regression of the individual trees. Three different models of classifier are used to determine which machine learning algorithm could obtain the best performance in recognizing the emotions in four quadrants.

Emotion classification
Emotion is a brain-related mental state containing three specific components, which are physiological response, subjective experience, and behavioral response [9]. Emotion reflects the feeling and thoughts of an individual as well as the degree of pleasure/displeasure. Ekman's model proposed six basic emotions from his research works, which are fear, happiness, anger, surprise, disgust, and sadness [10]. These six basic emotions are then extended to eight emotions, anticipation and trust are added to the list of Plutchik's model [21]. Emotion classification refers to the task of recognizing an individual's emotions and classify an emotion from their reactions and responses. Emotion classification is defined as the categorization of emotions and attempts to differentiate one emotion from another. Russell's Circumplex Model of Affects [25], which contains arousal and valence dimensions with four quadrants of emotions as a result of combining these two dimensions, is among the most commonly adopted emotion model by emotion researchers to test users and attempt to classify their emotions according to these four quadrants. Each quadrant represents the respective emotional states according to the combinations of a high/low arousal (HA/LA) together with a positive/negative valence (PV/NV). Quadrant 1 is a combination of HA/PV and represents the emotions of happy, excited, elated, and alert; Quadrant 2 is a combination of HA/NV and represents the emotions of tense, nervous, stressed, and upset; Quadrant 3 is a combination of LA/NV and represents the emotions of sad, depressed, confused, and bored; while Quadrant 4 is a combination of at LA/PV and represents the emotions of contented, serene, relaxed, and calm. Since there are many complex emotions that occur at each of these quadrants, it is very difficult to determine a particular specific emotion based on the user's responses and reactions. Hence, this letter attempts to classify the emotional analysis by distinguishing the emotions based on the respective quadrant information according to Russell's model of emotions.

Eye-tracking
Eye-tracking is an advanced technology that is used to measure the eye movement or the point of view of an individual. Eye-tracking technology has been applied in many fields such as in various medical research, cognitive psychology studies, as well as in Human-Computer Interaction (HCI) research [17]. Eye movement signals provide the vision localization of an individual and thus enables the direct observation and accurate pinpointing of what is attracting their attention. Eye movement signals can be utilized as an indication of the individual's behaviors and some previous works have used eye signals to investigate the attention of users in reading [24].

Eye-tracking in emotion classification
The eye features such as pupil diameter contain some emotional-relevant characteristics, hence the eye-tracking data can be utilized in emotion classification. There are many eye features can be used to conduct emotion classification such as fixation duration of the pupil, motion speed of the pupil, pupil position, and pupil size. There are studies on the analysis of eye movements for human behavior recognition [14,16]. However, studies that focus specifically on emotion classification using eye-tracking data alone is very limited as most of such studies incorporate other sensor modalities such as EEG and ECG. There is a study that focuses on the emotional eye movement analysis using electrooculography (EOG) signals [20]. There have also been studies that rely on other types of eyetracking data such as fixation duration and pupil position [1,23]. Most of the papers that rely on eye-tracking data solely only classify the arousal dimension or valence dimension separately or basic 3-class emotions such as positive, neutral, and negative [4,27]. There is a recent report which reviews recent papers on emotion detection using eyetracking providing a taxonomy as well as current challenges in this field [19]. To the best of our knowledge, there has thus far not been any systematic study conducted on using eye-tracking data exclusively for four-class emotion classification. Therefore, this letter attempts to perform four-class emotion classification according to the four quadrants from Russell's model.

Emotion classification in VR
Through the use of VR technologies, the user is fully immersed in a virtual environment that very closely resembles the real world. As such, this provides a greater sense of reality for the user when they are experiencing visual stimuli through a high level of VR immersion. Moreover, since the user is fully enclosed by their head-mounted display (HMD), this would block out external secondary stimuli which may distract the user from the immediate primary stimuli being experienced by the user. Hence, the user should be more connected to the stimuli being presented in the virtual environment and hence a more direct and real emotional response will be evoked through a VR presentation. A past study has reported that Immersive Virtual Environments (IVE) can indeed be utilized effectively as a presentation tool for emotion inducement [11]. A number of VR HMDs now have the option of integrating eye-tracking devices into their HMDs. As such, eye-tracking data can now be obtained easily with the add-on eye-trackers that are placed into the VR HMD. There is a recent study on emotion classification using a wearable EEG headband and machine learning in a virtual environment by utilizing 360° videos [26]. There are also authors that presented their classification investigations using eye-tracking and VR in facial expressions [6,15]. Currently, there is no study on emotion classification using eye-tracking data solely in a virtual environment. Therefore, we employ this approach for our work on emotion classification using eye-tracking in VR environments.

Experiment setup
In this study, VR is used as our emotional presentation stimuli. The HTC Vive VR headset with a pair of earphones was used to stimulate the participant's emotions. The experiments were conducted with the presentation of emotional 360° videos consisting of four distinct emotions according to Russell's model of emotions. A total of ten subjects (9 males and 1 female) participated in our experiment. The age range is 21-28. An explanation was given to all participants before the experiment started. The experiment was conducted with the presentation of a series of 360° videos lasting a total of about 6 min. The eye-tracking data of participants were recorded using an add-on eye-tracker from Pupil Labs for the VR headset. The flow of the video presentation in the experiment is as shown in Fig. 1. There were four separate sessions of video stimulation lasting 80 s each according to each of the quadrants of emotion. A 10-s rest period is given before the next video stimulation session from the following quadrant is commenced.

Data collection and classification methods
The eye-tracking data were collected using the Pupil Labs application. In the data collection process, eye calibration is conducted for each of the participants. At first, all participants will wear the VR headset with the add-on eye-tracker. The pupil data was recorded using Pupil Capture. The presenting video and data recording are started simultaneously. The video is presented using Unity, a platform for 360° videos, with the recording script in C# programming language. Data recording will be stopped simultaneously when the video has completed playing. The recording directory is then transferred to the Pupil Player for the data visualization process. These recorded data were the exported using the raw data exporter in Pupil Player and saved to a CSV file format. There are several types of eye-tracking data that can be exported from Pupil Player such as gaze data, fixations, and pupil data. Pupil diameter was chosen as the eye feature in this experiment. The data is capture through pupil detection and the diameter of the pupil is estimated in millimeters (mm) based on the diameter of the anthropomorphic average eyeball. This study utilized stimuli prepared using VR-based content. This dataset was specifically collected using a VR environment from each participant and has approximately 70,000 datapoints with a timestamp included for every second of acquisition. The machine learning tasks were done by using Python. Three types of machine learning classifiers were used in this experiment, namely Support Vector Machine (SVM), K-nearest neighbor (KNN), and Random Forest, to classify the emotions. The SVM classifier was used with the Radial Basis Function (RBF) kernel in this experiment while the range the for k value in the KNN classifier was set to 5.

Results and discussion
From Fig. 2, the results show that most of the pupil diameter is largest in Quadrant 4 while it is the smallest in Quadrant 3. It also showed that the pupil diameter has the biggest changes in low arousal level. The outcomes also showed that the pupil diameter is smaller in Quadrants 2 and 3 which is located at the negative valence. These characteristics show that pupil diameter does indeed exhibit changes with certain emotions, hence we can extract such emotional-relevant features to conduct classification using machine learning algorithms to attempt to distinguish between the different quadrants of emotions. The classification results using pupil diameter were obtained from three different machine learning algorithms, which were the KNN, SVM, and RF classifiers. From  Fig. 3, the SVM classifier showed the best performance for emotion classification when comparing against the KNN and RF classifiers. The highest accuracy achieved was 57.05% by the SVM classifier while the highest accuracy obtained from the KNN and RF classifiers was 53.23% and 49.23% respectively.
Tables 1, 2, 3 shows the confusion matrices of each emotion classification for participant 5 and 6, which were chosen because the classification results from these two particular participants showed the highest average classification accuracy across the four classes. The highest accuracy achieved from when comparing across all the classifier's confusion matrix was for Quadrant 3 in participant 6, which was 69.86% for   The classification results using pupil diameter with machine learning algorithms SVM, 70.83% for KNN, and 69.64% for RF. The next highest classification rate was observed in participant 5 for Quadrant 2, which was 58.85% for SVM, 56.98% for KNN, and 53.42% for RF. Both these observations suggest that pupil diameter appears to be a promising approach to use when attempting to identify LA/NV and HA/NV emotions, which are located in the negative valence quadrants. This appears to be a novel finding which shows some relationship between pupil diameter and negative valence stimuli as pupil diameter was only previously correlated to arousal levels.

Conclusion
In this letter, we classified emotions according to Russell's four-quadrant Circumplex Model of Emotions using eye-tracking data from the presentation of VR 360° video stimuli to the participants. We collected the eye-tracking data from an eye-tracker that was mounted inside the VR headset and pupil diameter was chosen as the eye feature for emotion classification in this experiment. We used three different machine learning algorithms to conduct classification tasks. The findings showed that Support Vector Machine (SVM) had the best average accuracy of 57.05% across all four quadrants compared to the other two classifiers which are K-nearest neighbor (KNN), and Random Forest (RF). From the analysis of the confusion matrix, it was also observed that the accuracy for correctly predicting emotions resulting from the LA/NV quadrant of emotions was the highest at around 70% for all three classifiers for a particular participant.
For future work, this study will attempt to compare the performance of four-class intersubject emotion classification as well as to investigate the use of deep learning.