Who is behind the wheel? Driver identification and fingerprinting

In the last decade, significant advances have been made in sensing and communication technologies. Such progress led to a considerable growth in the development and use of intelligent transportation systems. Characterizing driving styles of drivers using in-vehicle sensor data is an interesting research problem and an essential real-world requirement for automotive industries. A good representation of driving features can be extremely valuable for anti-theft, auto insurance, autonomous driving, and many other application scenarios. This paper addresses the problem of driver identification using real driving datasets consisting of measurements taken from in-vehicle sensors. The paper investigates the minimum learning and classification times that are required to achieve a desired identification performance. Further, feature selection is carried out to extract the most relevant features for driver identification. Finally, in addition to driving pattern related features, driver related features (e.g., heart-rate) are shown to further improve the identification performance.

streams (one terabyte per hour). As the number of connected cars increases, the volume of data generated by vehicles will explode. Powerful analytic platforms will enable insurance firms, car companies, service and repair shops, and fleet owners to generate breakthrough insights.
It is now widely accepted that everyone has a unique way of driving. Thus, driver identification can be performed with high accuracy through driving behavior classification based only on raw data collected from in-vehicle sensors via the Controller Area Networks (CAN) system. This can be achieved after a few minutes only behind the wheel.
The ability to recognize a driver and his/her driving behavior could form the basis of several applications, such as driver authentication for security purposes, detection of the driver's drowsiness, and customization of the vehicle's functions to suit the driver's preferred configuration.
The problem of automatic driver identification has received increased interest in the recent literature. Despite this interest, the issue of the impact of the identification time on performance has been neglected. With this in mind, the aim of this work was to develop a time-optimized driver identification framework.
In this paper, we first evaluate the performance of existing driver identification methods using various datasets and explore ways to improve them. Then, we determine the most valuable features for a reliable driver identification. Finally, we look into how to reach a high identification accuracy while optimizing the number of features, the training dataset size, and the identification time (i.e. duration of the testing time). The three real datasets used in this paper are summarized in Table 1 [2][3][4].
The rest of this paper is organized as follows. "Background and related works" section summarizes the existing literature on driver identification and profiling through data analysis. "Methodology and analysis" section describes in details the datasets and the identification methods used in this paper. "Experimental results" section presents the driver identification results and a comparative analysis of the different methods. Finally, concluding remarks, discussions and directions for future work are given in "Discussion" section.

Background and related works
Vehicle-based performance technologies infer driver behavior by monitoring car systems such as lane deviation, steering or speed variability. Such systems are critical to detect and avoid driver drowsiness, which is related to around 20% of severe car injuries. The idea of fingerprinting drivers from timestamped sensor data, e.g., controller area network (CAN) protocol records, is not new; many recent studies have shown that identifying a driver using machine learning-based classification is a promising field of research. Another approach to driver identification, which has also attracted a lot of research effort, is based on face recognition. In this paper, we focus on the former approach.
Most methods in the literature on driving style modeling rely on a human-defined driving behavior feature set, which consists of handcrafted vehicle movement features derived from sensor data. These features are used by machine learning methods (supervised classification, unsupervised clustering, or reinforcement learning) to solve problems such as driver classification/identification, driver performance assessment, and individual driving style learning.
Both simulated and naturalistic driving patterns have been studied in the literature using different features extracted mainly from the in-vehicle's CAN Bus (the steering wheel, the vehicle speed, and the engine speed, etc.). The number of these features may range from one to twelve. Using these features, different machine learning methods (e.g. Bayesian algorithms, Decision Tree algorithms, instance-based algorithms, deep learning algorithms) have been proposed to learn driving styles. Dong and Li [5] proposed to use deep learning to identify a user using only their GPS raw records. This was the first attempt of applying the deep learning concept to driving style feature learning directly from GPS data. First, they proposed a data transformation method to construct an easily consumable input form (the statistical feature matrix) from raw GPS time series for deep learning. Second, they developed several deep neural network architectures including Convolutional Neural Networks (CNNs) using 1-D convolution with pooling, and Recurrent Neural Networks (RNNs). They studied their performance on learning a good representation of driving styles from the transformed data inputs. For driver identification, the authors of [6][7][8] have proposed several signal processing approaches using Gaussian Mixture Model (GMM) and different feature selection strategies. To handle the car theft problem, Meng et al. [9] have proposed a Hidden Markov Models (HMM) method, coupled with an HMM-based similarity measure, using mainly three features: acceleration, brake, and steering wheel data. Naturalistic data from University of Texas Drive (UTDrive) corpus have been used by Choi et al. [10] to derive both GMM and HMM models for the sequence of driving characteristics (wheel angle, brake pedal status, acceleration status, and vehicle speed). The authors have shown that driver identification can be accomplished at rates ranging from 30 to 70%. Wahab et al. [11] performed driver identification using statistical, artificial neural network, and fuzzy neural network techniques. The authors considered the accelerator and brake pedal pressure signals of 30 drivers and used techniques based on the GMMs and wavelet transformation for feature extraction. To optimize the energy usage, Kedar-Dongarkar and Das [12] have proposed a simple classifier of driving styles (based on generalized Bell function) using features extracted from the vehicle's power train signals. The authors defined three driving styles and achieved a classification accuracy of 77%. Van Ly et al. [13] pointed out that there is a potential in using inertial sensors to differentiate between different drivers. The authors conducted experiments comparing brake and turning signals from two different drivers using K-means and Support vector machine (SVM) algorithms. Another effort in drivers' differentiation was performed by Zhang et al. [14] who used HMM to analyze the data of the accelerator and steering wheel of each driver, and achieved an accuracy of 85%. One of the most accurate approaches to driver identification, for naturalistic data, was proposed by Enev et al. [15]. Twelve features from the CAN bus were considered with SVM, Random Forest, Naive Bayes, and k-nearest neighbor (KNN) algorithms. The authors have shown that it is possible to differentiate between drivers with 100% accuracy under some assumptions, and it is possible to reach high identification rates using less than 8 min of training time. Recently, Wallace et al. [16] have studied a large dataset of all trips made by 14 drivers over a 2-year period. The authors identified a two-phase relationship between the mean and maximum accelerations within each driver's acceleration events. This can be used as a measure of a driver's signature. Burton et al. [17] proposed a novel approach for driver authentication, where the mode of driving is constructed using the following features: pedal control, steering, speed, and distance traveled. The authors used classical machine learning algorithms (SVM, KNN, and Decision Tree) and boosting to increase the classification accuracy. The obtained results show a time-to-detection of 2 min and 20 s at 95% precision.

Driver fingerprinting
A higher accuracy of driver identification will likely require multiple driving parameters and a larger learning time. In this paper, we investigate the relationship between accuracy, the number of features and the learning time with the objective to optimize the driver fingerprinting task.
The process of driver fingerprinting consists of first preprocessing the driving datasets, selecting the most relevant features and then developing appropriate classification models using machine learning algorithms.

Datasets description
There are existing modules (e.g., after-market auto assurance dongles, phone interconnected dashboards like Apple's CarPlay or built-in radios like the telematics unit) that can access a vehicle's internal computer network and read data for various purposes, including driver fingerprinting.
The three datasets used in this work are described next.

Security dataset [2]
The data collection was carried out in South Korea using a recent model of KIA Motors Corporation. Ten drivers participated in the experiments setting which consists of four paths of three types, city way, motorway and parking space, with a total length of 23 km (Fig. 1). The city way has signal lamps and crosswalks, but the motorway has none. In the parking space, the drivers were required to drive slowly and cautiously. The experiment started on 28 July 2015. The time factor was controlled by performing experiments in similar time zones from 8 p.m. to 11 p.m. on weekdays. The drivers completed two round trips for a reliable classification. The driving data per driver were labeled from "A" to "J. " A total of 94,401 records every second were captured leading to a 16.7 MB dataset. The data were collected from the vehicle's CAN bus through the On Board Diagnostics 2 (OBD-II) and CarbigsP (OBD-II scanner). The used vehicle has many measurement sensors and control sensors which are managed by the Electronic Control Unit (ECU).
For example, ECU monitors and controls the engine, automatic transmission, and Antilock Braking System (ABS). ECU measurements are obtained via the OBD-II system. The data are recorded every second during driving. A total of 51 features were measured through the OBD-II system.

UAH-DriveSet [3]
The UAH-DriveSet is an open dataset obtained by the driving monitoring app "Drive-Safe" with the objective of collecting driving data in different environments using smartphones sensors alone. The large number of variables included in this dataset facilitates driving analysis.
The dataset was collected by six drivers with diverse ages and vehicles, including a fully electric vehicle. Three behaviors (normal, aggressive and drowsy) were performed by each driver in two different routes, one is 25 km long (round trip) on a motorway type of road with usually three lanes in each track and a maximum speed of 120 km/h; the other is around 16 km long on a secondary road of usually one lane in each track and a maximum speed of about 90 km/h. In the case of secondary road, only normal and drowsy behaviors were simulated with the electric car because of issues related to lack of autonomy. The resulting recording amounts to more than 500 min of realistic driving with its associated raw data and supplementary semantic information, together with the video recordings of the tours. Other details about the experiment are given in Table 2.
The tests were performed on the cars of the drivers by placing two phones on their windshield. Figure 2 shows the setup reproduced on each tester.
Every driver drives on pre-designated routes by simulating sequences of different behaviors: normal, aggressive and drowsy driving. In the case of ordinary driving, the driver is told to drive as usual. In the sleepy case, the driver is told to pretend slight sleepiness, which typically results in sporadic unawareness of the road scene. Finally, in the case of dangerous/aggressive driving, the driver is told to drive to the limit his aggressiveness (without putting the driver at risk), which generally results in impatience and roughness while driving. The co-pilot is in charge of the safety of the tests, and does not interfere by giving any additional instruction during the tours, except in cases of extreme danger during the maneuvers. The two different roads covered in the tests are both in the Community of Madrid (Spain).
The data is composed of two files whose names start with "RAW" and which contain measurements obtained directly by the inertial sensors (gyroscopes and accelerometers) and the GPS of the smartphone. The two files are described below: (a) Raw GPS contains the data obtained from GPS, at 1 Hz sampling frequency. The content of each column is described below: • Timestamp (seconds), • Speed (km/h),  Fig. 2 Video-recorder using DriveSafe app on one of the testers' windshield [3] • Latitude coordinates (degrees), • Longitude coordinates (degrees), • Altitude (meters), • Vertical accuracy (degrees), • Horizontal accuracy (degrees), • Course (degrees), • DifCourse: course variation (degrees). (b) Raw accelerometers contains all the data collected from the inertial sensors, at 10 Hz (obtained from the phone's 100 Hz sampling frequency data by calculating the mean of every ten samples). The iPhone was fixed on the windshield at the start of the route, so the axes are the same during the whole trip. These were aligned in the calibration process of DriveSafe, where the y-axis is aligned with the lateral axis of the vehicle (reflects turnings) and z-axis is aligned with the longitudinal axis (positive value reflecting an acceleration, and a negative value reflecting a braking). The accelerometers measurements were also logged filtered by a Kalman Filter (KF). The content of each column is:

HciLab dataset [4]
The HciLab Driving dataset is publicly available as an archive of comma separated files where each file contains the merged data set of the recordings of one participant. The complete data set has a size of 450 MB and consists of 2.5 million samples. It is anonymized and contains information about GPS, brightness, acceleration, physiological data, and data of the video rating. Note that the number of samples per participant varies due to different traffic conditions and driving behaviors resulting in different driving times. The video is excluded from the data set for privacy reasons. Three different data sets were recorded during the driving session (see Fig. 3). First, the physiological state of the driver was recorded using three sensors attached to the participant: the skin conductance and temperature sensors were attached to the participant's left hand whereas the ECG was attached to the participant's chest. These sensors were connected to the Nexus 4 Biofeedback system which stored the driver's physiological data. Second, context data was collected through an Android Smartphone (Google Nexus S). Finally, two webcams (Logitech QuickCam Pro 9000 and Creative VF0610 Live! Cam Socialize HD) were used to record the driving scenario (passenger view onto the road) and a view of the driver. As all data sets were recorded with different sampling frequencies, timestamps were used to synchronize all data post-recording. Details about the different measurements are given next.
The GPS information is recorded at a 1 Hz sampling frequency via the mobile phone. The GPS data consists of the longitude and latitude values (in degree) that define the position of the car, as well as further information about accuracy (in meter), altitude (in meter), speed (in meter per second), and bearing (in degree). A timestamp has been recorded as well to map the GPS data into the rest of the dataset.
The smartphone also provided records of the brightness level (in Lumen) as well as the acceleration perceived by the phone's sensors along the three axes. These two sensors provide records at frequencies between 8 Hz and 12 Hz.
The electrocardiogram (ECG, in µV) was recorded at 1024 Hz and was used to calculate the heart-rate (beats per minute) and heart rate variance at 128 Hz. Furthermore, the skin-conductance (in µS) and body temperature (in degree Celsius) were recorded at 128 Hz. Again, timestamps were added with the physiological data records.

The proposed driver identification model
Given the heterogeneity of mobile devices and vehicles, driver identification using invehicle sensors needs to adapt its parameters to each context. The optimization of the training time for each context is necessary for reliable and fast driver identification.  Figure 4 describes the process of performing such optimization. In this process, the datasets are first divided into small segments, and classification algorithms are applied on an increasing number of segments until the identification score reaches a pre-defined threshold of satisfaction (ideally 100%). The obtained results are then saved on the server. Figure 5 shows the driver verification framework based on the analysis of driving patterns.
The framework consists of four modules which are data collection, data preprocessing, driver classification, and driver verification. Data collection from the in-vehicle sensors begins when the driver starts driving. The data preprocessing module converts the collected data into a new format to be analyzed by the next module, and builds feature vectors that can distinguish drivers. The driver classification module trains the machine learning algorithm using the feature set fed from the previous module. The machine learning algorithms considered here are Extra Tree, Random Forest, KNN, and SVM, which were shown to yield high performance in previous studies. The machine learning algorithm detects the unique driving patterns for a driver and builds his or her driving fingerprints. The driver verification module compares a given driving pattern with those of the authenticated drivers and decides on whether there is a match or not. More details about data preprocessing and feature analysis are given next.

Data preprocessing
The objective of this task is to transform the collected data for the subsequent analysis and classification algorithms. This task consists of the following subtasks.
1. Data preparation and cleaning constant and identical columns are removed. For example, the engine torque value is identical to the correction of engine torque value. After deleting redundant features, we replace the missing values or wrong ones using the KNN method. 2. Feature selection after data preparation, we select the most contributing features and exclude those that are highly correlated with them in order to improve the driver identification performance in terms of accuracy and speed. 3. Feature transformation first, the time is transformed from date-time format to timestamp format in order to easily include this feature in the learning algorithms. Then, the dataset is split into multiple segments to be used in the subsequent optimization process. Further, as the features have different scales, they are normalized prior to their use in the machine learning algorithms. Indeed, the normalization process is necessary for algorithms that are based on the distance between data elements, such as the KNN algorithm. This normalization is performed using Eq. (1), where X i is the normalized version of feature x i ; the resulting normalized features lie between 0 and 1.
For the SVM algorithm, data standardization is also carried out in order to make data dimensionless. After standardization, all knowledge of the scale and the location of the original data may be lost. It is essential to standardize variables in cases where the difference measure, such as the Euclidean distance, is sensitive to the changes in the magnitudes or scales of the input variables [18].

Feature modeling and analysis
Here, the distributions of the features are explored. Figures 6 and 8 show these distributions for two important features: 'Brake Pedal' and 'Max Engine Torque' . The values of these features change with the driving environment such as start-up, idling in heavy traffic, cruising down the highway, etc. [19] and with the driver's driving pattern in such conditions.
In Fig. 7, the features are sorted according to their ability to differentiate between drivers when using the Extra Trees classifier. Top of the importance list are 'Long-term fuel trims bank1' which checks the condition of the engine, 'fuel trims' which represents the percentage of change in fuel over time. Thus, fuel-related features seem to be the most telling indicators of a driver's driving style. Transmission oil temperature (a fluid temperature inside the transmission) and the 'Friction torque' , known as 'brake pedal' , are the next most discriminative features for classifying drivers. (1)

Classification algorithms
The machine learning algorithms considered in the classification task are Decision Tree, Random Forest, Extra Trees, KNN, SVM, Gradient Boosting, AdaBoost based on Decision Tree, and multi-layer perception (MLP). For the Security dataset, we used the fifteen most important features, according to Fig. 8, along with the normalized timestamp. For the HciLab dataset, all features were used in the classification task. For the  Each of the above-mentioned classification algorithms generates a driver identification model.

Evaluation criteria
Since the minimum (over the studied datasets) sampling frequency of driving records is nearly 1 Hz, the driver identification process was performed every minute, so that the number of records per feature in the identification task is at least 60. We adopt a 10-fold cross-validation to compare the different classification algorithms. When a new driving data is fed, the evaluation module classifies it into one of the pre-defined classes. Table 3 shows the classification accuracy for three algorithms during the first 5 min for the security dataset. All algorithms have an accuracy of at least 90%.

Identification results
As can be seen in Table 3, the rate of increase of the identification accuracy beyond 3 min is very small; this analysis is useful for setting the threshold on the required training time.  For the second dataset (Hcilab), Table 4 shows the classification accuracy for five algorithms during the first 5 min. All algorithms achieve the 100% accuracy, which illustrates the positive impact of including physiological features.
For the third dataset (UAH-DriveSet), an accuracy of 76% is achieved using only GPS data.

Model comparison
Driver identification is particularly useful when it is fast. However, faster identification requires a smaller processing window and more reliable identification requires a longer processing window. To strike a good balance between these two constraints, we propose to choose the processing window for each classification algorithm according to the rate of improvement of the classification performance with respect to the processing window. In other words, the identification time (i.e. length of the processing window) is set to the minimum value beyond which the improvement in classification performance is no longer significant. Table 4 shows the classification performance for different algorithms and different identification times. It can be shown that Extra Trees and Random Forest algorithms perform better than the other algorithms considered in this work.

Driver verification
Using the classification model trained by authorized users, driver verification consists of testing whether or not the user is classified into one of the pre-defined classes, e.g., authorized drivers. The testing process is based on the computation of the probability of occurrence of each the pre-defined classes given the new data samples. For the Random Forest algorithm, these probabilities are computed using the frequencies of each class, given a new driving pattern, among the large number of generated trees. If all computed probabilities fall below a pre-defined threshold, the driver is declared not to be one of the authorized drivers, and thus an alert may be sent to the owner of the car or the vehicle control center. To minimize the probability of false alert, this threshold must be chosen judiciously, according to the minimum accuracy obtained in the training phase. In our experiments, the threshold value is set to 0.97.
In our experiment, data related to two drivers of the Security dataset were not included in the training phase and were thus used to test the driver verification task. The maximum of the class probabilities was 0.6 for the first driver and 0.49 for the second driver. As these values are lower than the set threshold, the drivers were successfully identified as non-authorized users.