Cross-dataset performance evaluation of deep learning distracted driver detection algorithms

. Deep learning has gained traction due its supremacy in terms of accuracy and ability to automatically learn features from input data. However, deep learning algorithms can sometimes be flawed due to many factors such as training dataset, parameters, and choice of algorithms. Few studies have evaluated the robustness of deep learning distracted driver detection algorithms. The studies evaluate the algorithms on a single dataset and do not consider cross-dataset performance. A problem arises because cross-dataset performance often implies model generalisation ability. Deploying a model in the real world without knowing its cross-dataset performance could lead to catastrophic events. The paper investigates the cross-dataset performance of deep learning distracted driver detection algorithms. Experimental results found reveal that deep learning distracted driver detection algorithms do not generalise well on unknown datasets for CNN models that use the whole image for prediction. The cross-dataset performance evaluations shed light on future research in developing robust deep learning distracted driver detection algorithms.


Introduction
The success of deep learning in other real-world applications such as number plate recognition for vehicle access control, has inspired the development of deep learning-based approaches to remedy the problem of distracted driver detection [1]. This move is done to reduce the number of distracted driver-related road accidents. The approaches proposed in the literature use many different techniques such as ensemble of convolutional neural networks (CNNs), combining CNN features and HOG features, and a hybrid of CNNs and recurrent neural networks (RNNs) [1]. In addition, different datasets are used for training and testing these approaches.
With continuing advancements in deep learning, it is imperative to evaluate the performance of proposed distracted driver detection algorithms. Such evaluations not only do they help in generating reference work that can be used for the selection of distracted driver detection algorithms but may provide important insights on the usage of distracted driver detection techniques, evaluation metrics and datasets used for distracted driver detection. Such lower-level type of insights might not be obtained from the original publications of the algorithms.
Currently, in the literature, there is a lack of a comprehensive study that evaluates the cross-dataset performance of distracted driver detection algorithms [2,3]. Most approaches are published with comparative performance results. Such evaluations are not only incomprehensive but also do not consider cross-dataset performance. Cross-dataset performance is important since it generally indicates the robustness and generalising ability of a learning model. The generalising ability of a model gives a good indication of the model's likelihood to fail when deployed in a real-world system. This study seeks to answer one critical question: to what extent can deep learning distracted driver detection algorithms generalise on image datasets they were not trained on? This is addressed by evaluating the performance of state-of-the-art deep learning-based distracted driver detection algorithms on widely used benchmark datasets. Most importantly, an in-depth evaluation and analysis of the cross-dataset performance of the algorithms is carried out. The primary contributions of this work can be summarised as follows: i. First comprehensive study that evaluates the cross-dataset performance of deep learning-based distracted driver detection algorithms. ii.
Experimental results on widely used distracted driver detection image datasets are provided. By so doing, the issue of dataset bias is addressed, and the cross-dataset performance of the algorithms is analysed. Class activation maps are used to further analyse any performance differences. iii.
The work may serve as reference work that can be used to guide the selection of distracted driver detection algorithms for different applications. Additionally, the article can generate research leads that can be pursued by other researchers.

Related work
Datasets. Datasets play a vital role in the successful application of deep learning on real world problems. This is because deep learning algorithms establish patterns based on features learned from the training dataset. Such is also the case in the task of distracted driver detection. The first dataset in the area of driving behaviour analysis and distracted driving was introduced by Zhao et al. [4,5]. The dataset has side view images of the driver performing four driver activities: (i) grasping the steering wheel; (ii) operating the shift lever; (iii) eating a cake; and (iv) talking on a cellular phone. However, the dataset is not publicly available and all the papers ([6-8]) that benchmarked using the dataset are affiliated with either Southeast University, Xi'an Jiaotong-Liverpool University, or Liverpool University, and they have at least one shared author [9]. A total of 20 participants, 10 male and ten female, were involved in the development of the dataset. agreement form. A two-phase data collection method was followedin the first phase, the ASUS ZenFone smartphone (Model ZD551KL) rear camera was used, and the DS325 Sony DepthSense camera was used in the second phase. In the project, 44 drivers from 7 different counties were involved, of which 29 were males and 15 were females. However, it has been reported that the AUC dataset is not balanced, for example, the reaching behind class is only represents 7% of the complete data points [14]. In contrast, the normal driving class represents 21% of the complete dataset. In addition, not all drivers participated in all distraction activities. To remedy the shortcomings of the AUC dataset, Ezzouhri et al. [14] introduced a more balanced distracted driver detection dataset with 9 participants.  [15][16][17]) compare the performance of the proposed method to other approaches. However, the focus of these papers is on the proposed algorithms and the evaluations are not comprehensive. Recently, Ezzouhri et al. [14] evaluated their proposed driver body parts segmentation-based distracted driver detection algorithm on their custom dataset and a widely used benchmark dataset (AUC Distracted Driver Dataset [9, 13]). The main contribution of the authors was on the proposed algorithm and the created dataset. The cross-dataset performance evaluations were based on the AUC dataset only and a few CNN-based algorithms.
Recently, Kashevnik et al. [18] presented an extensive literature survey on distracted driver detection and outlined the entire chain of distracted driver detection from sensor data acquisition to data pre-processing, behaviour inference, and distraction type inference. Similarly, Huang et al. [19] provided an extensive literature survey on vision-based distracted driver detection algorithms. Despite these studies being compressive and providing current state of the knowledge on distracted driver detection, none of them evaluate and analyse the performance of distracted driver detection algorithms. In another study, the authors [2] presented a literature review on distracted driver detection algorithms and then proceeded to evaluate the performance of ten deep learning-based algorithms using a dataset called AUC Distracted Driver Dataset [9, 13].

Algorithms
In this study, a total of six state-of-the-art algorithms with publicly available code or where authors provided code upon request were evaluated. In the event where code is not available, the authors implemented similar algorithms based on the original publications. State-of-theart representative algorithms were selected based on performance results reported by other researchers [2,3]. In addition, representative commonly used and recent algorithms were selected. The selected deep learning distracted driver detection algorithms can be broadly grouped into the following approaches: transfer learning, CNNs combined with other features or pre-processing stage, hybrid of CNNs with sequence models, and human pose estimationbased algorithms. Table 1 shows the complete list of algorithms that were evaluated with the corresponding approach used.

Datasets
The primary objective of this study is to evaluate the cross-dataset performance of deep learning distracted driver detection algorithms. To achieve this objective, three distracted driver detection image datasets will be used. The datasets include AUC2 dataset, driver distraction dataset introduced by Ezzouhri et al. [14] (EZZ2021), and the State Farm dataset. The AUC2 and State Farm datasets were selected based on their wide usage in benchmarking distracted driver detection algorithms. The datasets are relatively large and considers 9 distracted activities. The EZZ2021 dataset was recently introduced and is similar to the AUC2 and State Farm datasets with 9 distracted driver classes and a save driving class. Fig.  1 shows sample images from the EZZ2021 dataset. The different classes and driver postures in the three datasets are shown in Table 2.  Table 3 shows the distracted driver detection image datasets that will be used in the study with corresponding environment were the datasets were created (real or synthetic), type of distractions, number of drivers, and size of the datasets.

Evaluation metrics
To evaluate the performance of an algorithm in detecting a distracted driver, the performance of an algorithm will be determined by comparing the classification accuracy. Accuracy is the simplest and commonly used indication of the performance of a machine algorithm. Accuracy gives the number of correct predications a model has made over the total number of observations in the test set. In addition, to compare the performance of the algorithms per class, the weighted harmonic mean of the precision and recall performance metrics, i.e., the F-measure (F1-score), will be used. For further analysis, class activation maps (CAMs) will be used. CAMs help in understanding what a CNN "see" and how it arrived at the final prediction. Specifically, an approach called Grad-CAM [25] will be used. Grad-CAM works by finding the final convolutional layer in the network and then examining the gradient information flowing into that layer. The output of Grad-CAM is a heatmap visualization for a given class label (either the top, predicted label or an arbitrary label we select for debugging). We can use this heatmap to visually verify where in the image the CNN is looking.

Evaluation procedure
Each distracted driver detection image dataset was split into three sets, i.e., training, validation, and testing. Training sets were used for training and validation sets were used for hyperparameter tuning. While the test set was used for cross-dataset performance evaluation. Each algorithm was trained separately on each dataset and tested against all three datasets.

Training procedure
Transfer learning approaches. ResNet50 and EfficientNetB0 architectures pre-trained on ImageNet were fine-tuned to each of the three datasets using transfer learning framework. The top layers (head) were replaced by a GlobalAveragePooling2D layer, followed a by a Dropout layer and a fully connected layer with 10 neurons. Table 4 shows that hyperparameters used for training.
Leekha_GrabCut. For the Leekha_GrabCut algorithm, an EfficientNetB0 model pretrained on ImageNet was fine-tuned to the three image datasets. However, the GrabCut background removal algorithm was incorporated as a pre-processing stage to the data pipeline used for training the Leekha_GrabCut algorithm.
convLSTM. A convLSTM model with 4 ConvLSTM2D recurrent layers was used. Each ConvLSTM2D recurrent layer was followed by a Maxpooling3D layer and a Dropout layer. The Maxpooling3D layer reduces dimensions of the frames and avoid unnecessary computations. Dropout layers help prevent overfitting the model on the data. CNN LSTM. The CNN LSTM model was built using the AlexNet architecture and an LSTM layer with 50 units. A fully connected layer with 10 neurons and a softmax activation function was used for class prediction. For both convLSTM and CNN LSTM models, the datasets were prepared to be sequence data with five images.
CNN-Pose. The CNN-Pose algorithm consists of a fine-tuned EfficientNetB0 architecture using transfer learning and a Random Forest machine learning model trained on detected human key points obtained through pose estimation. The final prediction was a combination of predictions from the CNN and Random Forest models multiplied by two different coefficients that add up to one. For this study, the coefficients were obtained using a grid search for each dataset. Table 5 shows the coefficients that were obtained for the CNN-Pose algorithm in the three datasets.
The models were implemented using Python 3.6, scikit-learn, NumPy, OpenCV-Python, PyTorch, and TensorFlow. The training and validation information of all algorithms is shown in Table 6.       For further analysis, the F1-score was used to compare the performance of the algorithms on the safe driving class. Table 7 through Table 12 show the results of the algorithms when trained and tested on each of the three datasets. It can be observed that all algorithms perform well in detecting a driver in a safe driving posture when tested on a testing dataset that comes from the same dataset as the training dataset used. In contrast, the algorithms do not do well in detecting a driver in a safe driving posture when tested on a testing dataset that does not come from the same dataset as the training dataset. These observations correspond to the observations made above based on the classification accuracy of the models. This was expected since training and testing datasets from same dataset generally have the same characteristics such as same camera viewpoint, drivers used, and cars used. The results also reveal that algorithms trained on the AUC2 dataset do perform well on all perform across the three testing datasets. In addition, all algorithms perform well when tested on the EZZ2021 and STF test datasets compared to when tested on the AUC2 test dataset.
Based on Table 7 through Table 12, it can be observed that the CNN-Pose algorithm has the best overall performance across all three test datasets. The Leekha_GrabCut algorithm has the second-best performance. While the convLSTM and CNN LSTM algorithms have the worst performance on the three testing datasets.
The detailed per-class and overall performance of all algorithms can be found in Appendix A of this paper.  To understand what features are used by the CNN models when making predictions, Grad-CAM was used. Fig. 5 shows a sample output of Grad-CAM when applied on the ResNet50 models. Due to space limitations, Grad-CAM outputs of all the algorithms were not included in the paper. However, based on the Grad-CAM analysis, the following observations were made: • The models seem to be looking for the right features or regions of the image when making a prediction. This is especially true for test datasets which are from the same dataset as the training dataset. • Although the models learn important features, they also learn features that are not important. This especially the case when the whole image is used for training. For example, for the make-up class, the models look for hands on the head, face or an opened front mirror. This causes model confusion on images where the car has the front mirror opened since the models take shortcuts. • The models look for the position of two hands (specifically, the forearms) in relation to the steering wheel when predicting the safe driving driver posture. The models get confused when they only see one forearm. Some images were taken too close to the driver and as a result the two arms are not clearly visible. As a result, models seem to be struggling when the driver is closer to the camera. • The presence of a cell phone around the driver confuses the model to predict classes that involve the presence of a cell phone. The results and analysis above suggest that: • The CNN-Pose model has better generalising ability compared to the other algorithms due to a better cross-dataset performance. This can be attributed to the fact the model takes advantage of both rich features learnt by the CNN and human key points which are less variable. • The second-best model is the Leekha_GrabCut. The GrabCut algorithm removes background noise, forcing a model to focus on the body posture of the driver during training. This could explain why the Leekha_GrabCut algorithm can obtain reasonable cross-dataset performance, compared to the other algorithms, since it reduces dataset-to-dataset variability by removing objects that are not important in detecting a distracted driver. • The characteristics of the AUC2 dataset negatively affect the performance of algorithms trained on it. Based on the initial splits provided with the datasets, the major difference between the three datasets is that in the AUC2 dataset, drivers in the training set are not in the testing set. In contrast, in the EZZ2021 and STF datasets, drivers in the training datasets are also in the testing datasets. In the AUC2 dataset, drivers do not participate in all driver posture activities. These differences could explain why models trained on the AUC2 training dataset do not perform well on the AUC2 test dataset. This could also explain why models trained EZZ2021 and 370, 07002 (2022) https://doi.org/10.1051/matecconf/202237007002 MATEC Web of Conferences 2022 RAPDASA-RobMech-PRASA-CoSAAMI Conference STF training datasets perform well on their respective testing datasets. The authors also attribute the poor performance of algorithms trained on the AUC2 dataset to the fact that the dataset is relatively large but not diverse. Each driver in the dataset performs the same driver activity for more than 20 times with very little differences between the image frames. This creates an opportunity for shortcut learning which can easily arise due to a systematic relationship between the driver and background or context [26].
• CNN models that use the whole image without background noise removal or without considering other features that are less variable, do not generalise well on new data. This can be attributed to fact that the three datasets are large but not diverse. • In general, all distracted driver detection algorithms do not perform exceptionally well when tested on image datasets that they were not trained on. This is especially the case for CNN models that use the whole image without background noise removal or using features that are less variable. The authors attribute this overall poor cross-dataset performance to the datasets used for training. The datasets are relatively large but not diverse, i.e., lack of high data variance. As a result, deep learning distracted driver detection algorithms resort to shortcut learning which significantly reduces their ability to generalise to new data.

Conclusions and future work
This work sought to find the extent to which deep learning distracted driver detection algorithms can generalise to new data that was not used for training. A cross-dataset performance evaluation study was carried out. Based on the analysis in section 4, it was found that, in general, deep learning distracted driver detection algorithms do not perform very well on testing datasets that do not come from the same dataset as the training dataset. Based on the findings of the study, the authors suggest that future work should: • Create large and diverse distracted driver detection image datasets. To reduce, the effort required, synthetic image data generation using AI (for example, generative adversarial networks (GANs)) and CGI can be explored. • Work towards creating features that are less variable from dataset-to-dataset. In the CNN-Pose model, the pose estimation model was given more weight than the CNN. This may suggest that using features derived from detected human key points (pose estimation) can result to a model with better cross-dataset performance.