logo

Evolutionary Deep Convolutional Neural Networks for Medical Image Analysis

Access & terms of use, persistent link to this record, link to publisher version, link to open access version, additional link, supervisor(s), translator(s), designer(s), arranger(s), composer(s), recordist(s), conference proceedings editor(s), other contributor(s), corporate/industry contributor(s), publication year, resource type, degree type, unsw faculty.

12.59 MB Adobe Portable Document Format

Related dataset(s)

  • Artificial Neural Networks
  • Convolution

An Introduction to Convolutional Neural Networks

  • November 2015

Keiron Teilo O'Shea at Aberystwyth University

  • Aberystwyth University
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

A simple three layered feedforward neural network (FNN), comprised of a input layer, a hidden layer and an output layer. This structure is the basis of a number of common ANN architectures, included but not limited to Feedforward Neural Networks (FNN), Restricted Boltzmann Machines (RBMs) and Recurrent Neural Networks (RNNs).

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Timothy Jinsoo Yoo
  • Eitan Hershkovitz

Yang Yang

  • Honggyu Kim
  • Yuka Furukawa
  • Migiwa Imaishi
  • C. Guimaraes
  • Vinod Prakash
  • Dharmender Kumar
  • Fang-Yi Cheng
  • Hsiao-Chen Chien

Javier Perera-Lago

  • Mingzhi Yuan

Manning Wang

  • Yesha Shastri
  • Adv Neural Inform Process Syst
  • Christian Szegedy

Alexander Toshev

  • Dan Ciresan
  • Jurgen Schmidhuber
  • Alex Krizhevsky
  • Ilya Sutskever
  • Geoffrey E. Hinton
  • M.D. Zeiler

Claus Neubauer

  • Andrej Karpathy
  • George Toderici
  • Sanketh Shetty
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Enhancing Road Safety: Real-Time Detection of Driver Distraction through Convolutional Neural Networks

As we navigate our daily commutes, the threat posed by a distracted driver is at a large, resulting in a troubling rise in traffic accidents. Addressing this safety concern, our project harnesses the analytical power of Convolutional Neural Networks (CNNs), with a particular emphasis on the well-established models VGG16 and VGG19. These models are acclaimed for their precision in image recognition and are meticulously tested for their ability to detect nuances in driver behavior under varying environmental conditions. Through a comparative analysis against an array of CNN architectures, this study seeks to identify the most efficient model for real-time detection of driver distractions. The ultimate aim is to incorporate the findings into vehicle safety systems, significantly boosting their capability to prevent accidents triggered by inattention. This research not only enhances our understanding of automotive safety technologies but also marks a pivotal step towards creating vehicles that are intuitively aligned with driver behaviors, ensuring safer roads for all.

I Introduction

I-a background.

Distracted driving has emerged as a significant safety hazard on roads worldwide, contributing to a great number of traffic accidents each year. With the advent of advanced computational technologies, there is a promising potential to mitigate these risks through the development of real-time detection systems. Convolutional Neural Networks (CNNs), renowned for their effectiveness in image recognition tasks, provide a foundational technology for analyzing complex visual behaviors associated with driving.

I-B Problem Statement

Despite ongoing efforts to enhance vehicular safety, existing systems find it difficult to handle the subtleties of driver distraction in various driving conditions and environments. Current technologies often rely on simplistic alert mechanisms or invasive monitoring techniques that may not accurately detect or predict momentary lapses in driver attention. This project seeks to address these limitations by employing a sophisticated CNN-based approach to detect and quantify driver distractions more effectively and unobtrusively.

I-C Objectives

The central aim of our project is to conduct a comprehensive analysis and testing of various CNN architectures for real-time detection of driver distraction. This includes:

Evaluating simple CNN architectures to establish baseline performance metrics.

Assessing both batchwise and non-batchwise fine-tuning of VGG16 and VGG19 models to determine their effectiveness in different processing environments.

Comparing the impacts of shallow versus deep configurations of VGG16 and VGG19 networks on the accuracy of distraction detection.

Developing and testing a custom CNN architecture with a transformer to tailor the detection system to the specific dynamics of driver behavior under varied conditions.

By exploring these methodologies, the project aims to provide a thorough understanding of how different CNN models can enhance the detection of driver distraction and thereby contribute significantly to road safety.

Following this section, a comprehensive related work section will contextualize our findings within the broader field of automotive safety technologies, comparing our innovative methods with existing approaches and setting a foundation for the subsequent implementation and evaluation phases.

II Related Work

In this section, we delve into significant prior studies that explore the use of convolutional neural networks in detecting driver distraction. Each of these studies contributes to the foundation upon which our research is built, and comparing their methodologies and findings with ours offers valuable insights into the evolution of this technology.

II-A Automatic Driver Distraction Detection Using Deep Convolutional Neural Networks

Md. Uzzol Hossain and colleagues (2022) [ 1 ] employed deep convolutional neural networks to identify driver distraction automatically. Their work utilized a series of complex models to process visual data from in-vehicle cameras. This study’s strength lies in its extensive dataset and the depth of its neural network analysis, which provides a solid benchmark for our project. However, our approach extends this by incorporating newer models like VGG19 and exploring batchwise versus non-batchwise processing to enhance real-time application capabilities.

II-B Detection of Distracted Driver Using Convolution Neural Network

The 2022 study by Narayana Darapaneni [ 2 ] outlines a CNN-based framework for detecting driver distraction that focuses on processing constraints in real-time Their methodology is aligned with our work, particularly in their use of a streamlined model for efficient computation. The insights from this study guide our exploration of computational efficiency in model training and real-time detection, providing a comparative perspective that enriches our approach.

II-C Detecting Distraction of Drivers Using Convolutional Neural Network

In this study, Sarfaraz Masood and his team (2018) [ 3 ] tackled the challenge of detecting driver distraction using a tailored CNN. This early work is crucial as it sets a methodological precedent for subsequent studies. Their findings, particularly regarding network configuration and the impact of non-standard data pre-processing techniques, have influenced our decision to experiment with various pre-processing methods and architectural tweaks to optimize performance.

Each of these studies contributes uniquely to the body of knowledge in detecting driver distraction via CNNs. By integrating their empirical findings with our methodological innovations, this literature review not only frames our research within the current scientific dialogue but also sets the stage for our contributions to advance the field further.

III Dataset

The dataset used in our project is the ”State Farm Distracted Driver Detection,” [ 4 ] available through Kaggle. This dataset is pivotal for training and testing our convolutional neural network models to accurately identify different types of driver distractions and is considered a standard when dealing with data for driver distraction training. It consists of images categorized into ten classes, each representing a specific form of distraction. Below we can see the description/category name of each of the classes and the count of their images in the dataset:

Label Description Count of Images
c0 Safe driving 2489
c1 Texting - right 2267
c2 Talking on the phone - right 2317
c3 Texting - left 2346
c4 Talking on the phone - left 2326
c5 Operating the radio 2312
c6 Drinking 2325
c7 Reaching behind 2002
c8 Hair and makeup 1911
c9 Talking to passenger 2129

As can be seen from above, the summation of the count of images leads to the total dataset having 22424 images in total. The distribution of images across these categories is relatively balanced, with the counts ranging from 1911 for ’hair and makeup’ to 2489 for ’safe driving,’ ensuring that our model trains on a diverse set of data.

III-A Initial Data Analysis

An initial analysis of the dataset was conducted to understand the properties of the images it contains. The color distribution of pixel intensities across the RGB channels was examined, with the following observations:

Refer to caption

Peaks at the Edges: The figure shows prominent spikes at the minimum and maximum intensity values (0 and 255), indicating significant occurrences of very dark and very bright pixels. This suggests potential issues with overexposure or underexposure in various images.

Middle Lows: There is a noticeable lack of pixel intensity values in the mid-range, which can imply a loss of detail in mid-tones, possibly affecting the richness of visual information.

High Contrast: The pronounced peaks at the edges combined with lower frequencies in the mid-range highlight a high-contrast nature in the images, which could enhance the differentiation of important features but may also lead to challenges in distinguishing subtler details.

Balanced Colors: The alignment of red, green, and blue channels in the distribution pattern suggests that the images are well-balanced in terms of color, not overly dominated by any single color channel.

This preliminary analysis is critical as it informs the necessary pre-processing steps we must undertake to standardize the images for optimal CNN performance, such as adjusting brightness levels and enhancing contrast to ensure that important features are well-represented.

IV Methodology

Iv-a system design and code structuring.

Our project makes use of a system we designed to maximize flexibility and efficiency in model experimentation. Central to our approach is a streamlined code structure that allows for the seamless integration of various CNN architectures with minimal changes required between experiments.

The base of our code remains consistent across all models, taking care of data handling, pre-processing, and augmentation, as well as the configuration of the training environment. This design simplifies the process of switching between different neural network models, ensuring that each model is evaluated under the same conditions to maintain consistent comparison standards in terms of hyperparameter and the overall environment.

Consistent Code Base: The following components are standardized in our system:

Data Pre-processing and Augmentation: We utilize an ImageDataGenerator for real-time data augmentation, enhancing model robustness by simulating various real-world distortions.

Model Training and Evaluation Setup: Model checkpointing and early stopping are implemented to optimize training phases and prevent overfitting, ensuring that each model’s performance is maximized.

Performance Visualization: After training, we plot the training and validation/test accuracy and loss to assess model performance over epochs.

Variable Components: Model initialization functions are the primary variable in our system. By designing the code to alter only the model initialization segment, we can easily deploy different CNN architectures, such as variants of VGG16, VGG19 and pretty much all the models we would want to try, without affecting other parts of the code base.

Based on thye above discussion, below we can see an highlevel overview of what our system looks like:

Refer to caption

As can be seen above, the data-preprocessed is ideally going to remain constant across the board, with the changes only being made to the model functions. We talk about two other key features in the subsections below:

IV-A 1 Model Experimentation

Each model is instantiated using a specific initialization function, which defines the architecture and compilation settings. This approach allows for straightforward comparison across models, focusing on:

Architecture Efficiency: By comparing the different architectures, we assess the trade-offs between computational demand and predictive performance.

Hyperparameter Consistency: Hyperparameters such as learning rates and batch sizes are kept consistent, except where adjustments are necessitated by early stopping criteria based on validation loss.

IV-A 2 Implementation of Early Stopping

Early stopping is a critical component of our training strategy, used to halt the training process if the validation loss does not improve for a predefined number of epochs. This mechanism not only saves computational resources but also aids in preventing the overfitting of models to the training data.

IV-B Data Preprocessing and Augmentation

The effectiveness of our Convolutional Neural Network models is heavily dependent on the quality and variety of the data they are trained on. To ensure our models capable of generalizing well to real-world scenarios, we implement comprehensive data pre-processing and augmentation techniques. These techniques are designed to simulate realistic variations and distortions that might occur in real driving conditions.

Pre-processing Functions: Our pre-processing pipeline includes functions to dynamically enhance the brightness and contrast of the input images. This is achieved through the following steps:

Enhance Brightness: The brightness of images is adjusted by converting the images to the HSV color space and manipulating the value channel. This helps in simulating different lighting conditions that a driver might experience.

Change Contrast: The contrast of images is modified using a scaling factor to adjust the intensity of each pixel. This ensures that the model will be able to identify relevant features under different possible visual contrasts.

These pre-processing functions are integrated into our data generators as callable transformations, which apply a random combination of brightness and contrast adjustments to each image during training and validation. This randomness helps in enhancing the model’s ability to generalize over diverse visual representations.

Augmentation Techniques: To further augment our dataset, we employ several image transformation techniques, which include:

Random Rotations and Horizontal Flips to simulate different angles and orientations of the driver relative to the camera.

Random Affine Transformations provide slight translations, shears, and scaling to replicate the effect of camera shifts and zooms.

Randomized Brightness and Contrast Adjustments are applied to ensure diversity in lighting and exposure.

We utilize two distinct setups for our data handling:

Keras ImageDataGenerator: For experiments utilizing TensorFlow and Keras, we configure an ImageDataGenerator that applies our pre-processing functions in real-time during model training, reducing memory overhead and introducing realistic variations.

Torchvision Transforms: For PyTorch-based implementations, we use a composition of transforms that include custom classes for brightness and contrast adjustments alongside standard transformations provided by torchvision .

The idea behind using two different libraries to essentially get the same requirement output was done in order to get comfortable with both the libraries and solely for educational purposes. In reality, both the above mentioned data handlers would in essence provide an output similar to each other.

These pre-processing and augmentation strategies are crucial for training deep learning models that are effective and reliable in diverse and unpredictable real-world driving conditions. They not only enhance the data’s variability but also improve the model’s robustness to overfitting, thereby increasing its practical applicability in real-time systems.

IV-C Model Initialization and Training

Until now, we have looked at the very initial phases of our system, now we delve into the depths of understanding the various models that we compared, have an understanding about them, what is the architecture that they had post which in the results section we will have an extensive analysis on the performance of each of the models followed by a comparison of the testing we did.

The following sections will feature the various models that we tried and tested, and the results we were able to produce.

V Model Architectures

In this section, we have a look at the various CNN architectures developed and evaluated as part of our studys. The diversity in model architectures is intended to explore a range of complexities and computational efficiencies, thereby identifying the most suitable model for real-time application in diverse driving environments.

V-A SimpleCNN

The SimpleCNN model represents our baseline architecture. It is designed to be lightweight yet effective, suitable for environments where computational resources are limited.

V-A 1 Architecture

The SimpleCNN comprises three convolutional layers, each followed by a ReLU activation and a max pooling operation. The overall architecture of the CNN can be seen below:

Refer to caption

Here is the breakdown based on the above figure:

First Layer: The initial layer uses 32 filters with a kernel size of 3x3 and has a padding to maintain the size of the output feature map. A max pooling operation with a stride of 2 reduces the spatial dimensions by half.

Second Layer: This layer increases the filter count to 64, enhancing the network’s ability to capture more complex features from the input images. It follows the same pattern of convolution, activation, and pooling.

Third Layer: The complexity increases further with 128 filters. This layer is crucial for extracting fine-grained details that are essential for accurate classification.

Following the convolutional base, the network transitions to fully connected layers:

Flatten Layer: The output from the last pooling layer is flattened into a vector to serve as input for the dense layers.

Fully Connected Layers: A sequence of two dense layers processes the flattened features, with the first layer consisting of 512 units followed by the final output layer that maps to the number of classes (10 in this case).

V-A 2 Rationale

SimpleCNN is designed as a straightforward and computationally efficient model that can be easily deployed in real-time systems with limited hardware capabilities. Its simplicity allows for rapid training and inference, making it an ideal candidate for initial experiments and as a benchmark for more complex architectures. By using a moderate number of layers and parameters, it balances the trade-off between performance and computational demand, making it suitable for scenarios where real-time processing speed is critical.

V-B VGG16 Architectures

The VGG16 architecture, known for its effectiveness in large-scale image recognition tasks, was chosen for its depth and robust feature extraction capabilities, making it highly suitable for complex tasks like this one. We experimented with three distinct implementations of this architecture: a deep model, a shallow model, and fine-tuning adaptations, to explore how different configurations impact performance.

V-B 1 Deep VGG16 Model

The deep configuration uses the VGG16 model as a base, extended with additional dense layers to increase the model’s capacity for learning from our specific dataset.

The model begins with the VGG16 base, pre-trained on ImageNet, excluding the top layer to tailor the outputs to our ten classes.

It includes a Flatten layer to transform the feature maps into a 1D array.

Followed by a dense layer with 500 units and ReLU activation for non-linear transformation.

A dropout of 0.5 is included to prevent overfitting.

The output layer is a dense layer with softmax activation suited for multi-class classification.

Rationale: This model is designed to utilize the deep feature extraction capabilities of VGG16, enhancing it with additional trainable parameters to better adapt to the intricate nature of driver behavior recognition.

V-B 2 Shallow VGG16 Model

The shallow model modifies the deep VGG16 by reducing the complexity of the layers added to the base, aiming to maintain efficiency while still capturing essential features.

Uses the same VGG16 base but sets it as non-trainable to focus learning in the newly added layers.

The model includes a single dense layer with 256 units, which is fewer than in the deep model, reducing the learning capacity but also the risk of overfitting.

Includes dropout and softmax activation as in the deep model.

Rationale: The shallow model is particularly valuable for scenarios where computational resources are limited or when quicker inference is needed without a substantial drop in accuracy. It tests whether a less complex model can efficiently handle the task.

V-B 3 Fine-Tuned VGG16 Model

Fine-tuning allows us to tailor pre-trained networks to our specific task more finely. This setup involved selectively retraining the deeper layers of the VGG16 network.

The base model’s top layers are fine-tuned while earlier layers remain frozen, balancing the learning of high-level features without altering the robust initial feature detection.

Includes GlobalAveragePooling to reduce dimensionality and manage model complexity.

The model is compiled with an SGD optimizer, suitable for fine-tuning due to its finer control over learning rates.

Rationale: Fine-tuning was tested in both batch-wise and non-batch-wise configurations to assess the impact of training dynamics on performance. Batch-wise fine-tuning restricts adjustments to the last few layers, conserving most of the pre-trained features, while non-batch-wise fine-tuning allows the entire network to adjust to new data.

Purpose of Fine-Tuning: This approach was chosen to explore how much of the pre-trained model’s knowledge could be preserved while still adapting to the specific challenges of detecting driver distractions, potentially leading to improved accuracy with minimal training time compared to training a model from scratch.

These variations of the VGG16 architecture provide a broad spectrum of insights into how different depths and training strategies can be optimized for the task of driver distraction detection.

V-C VGG19 Architectures

Building on the insights from VGG16, we extended our experiments to include VGG19, a deeper variant of the VGG architecture known for its enhanced performance in image recognition tasks. We implemented three versions of this architecture, similar to that of VGG16: a deep model, a shallow model, and fine-tuning adaptations to explore their effectiveness for driver distraction detection.

V-C 1 Deep VGG19 Model

The deep VGG19 model incorporates the VGG19 base, known for its depth and complexity, to capture more detailed feature representations.

Starts with the VGG19 base model pre-trained on ImageNet, with the top layers removed to adapt to our specific classification needs.

Includes a Flatten layer to convert the feature maps into a vector.

A dense layer with 500 units followed by a ReLU activation introduces non-linearity and capacity to learn complex patterns.

Dropout of 0.5 is incorporated to reduce overfitting.

The model concludes with a softmax activation layer tailored for our 10-class detection task.

Rationale: This model leverages the deeper network structure of VGG19 to process more complex image features, which is expected to improve the accuracy in distinguishing between various types of driver distractions.

V-C 2 Shallow VGG19 Model

The shallow model uses the VGG19 architecture but simplifies the layers added after the base to maintain efficiency.

Utilizes the VGG19 base with frozen weights to focus learning in the newly introduced simple layers.

A single dense layer with 256 units is used, reducing the complexity and computational demand compared to the deep model.

Maintains dropout to avoid overfitting and uses softmax for the classification output.

Rationale: The shallow VGG19 model is designed to test the hypothesis that less complexity might still yield high accuracy, particularly beneficial in environments where computational resources or response time are constraints.

V-C 3 Fine-Tuned VGG19 Model

Fine-tuning is applied to the VGG19 base to tailor the deeper layers to our dataset while keeping the initial layers fixed.

Fine-tuning starts from the fourth-last layer, with earlier layers left unchanged to preserve their pre-trained feature-detection capabilities.

Incorporates GlobalAveragePooling to reduce the feature dimensions effectively.

Ends with a dense layer for class prediction.

Rationale: Fine-tuning was implemented in both batch-wise and non-batch-wise settings to evaluate how different training strategies affect the model’s ability to adapt to the specific task of detecting driver distractions. Batch-wise fine-tuning is particularly useful for maintaining the integrity of most learned features while updating only the most crucial layers for the task.

Purpose of Using VGG19: Choosing VGG19 allowed us to assess the impacts of using a deeper network compared to VGG16, providing insights into whether additional depth translates to better performance for this specific application. It also serves as a comparative study to determine the optimal balance between depth and computational efficiency in practical applications.

These VGG19 models expand our understanding of how varying depths and fine-tuning approaches can be optimized for enhancing the detection capabilities in driver distraction tasks.

V-D Hybrid CNN-Transformer Architecture

The culmination of our exploration into neural network architectures involves a sophisticated Hybrid CNN-Transformer model. This model combines the strengths of CNNs and the transformer architecture to create a powerful tool for image-based classification tasks.

V-D 1 Architecture Overview

The Hybrid CNN-Transformer architecture is designed to leverage the detailed feature extraction capabilities of CNNs with the global context retention of transformers. The specifics of the architecture are as follows:

ResNet Backbone: The model utilizes a pre-trained ResNet50 as the CNN component, known for its effectiveness in image recognition. The final fully connected layer of the ResNet50 is removed to prepare the feature maps for the transformer.

Dimension Reduction: A linear layer reduces the dimensionality of the ResNet50’s output from 2048 to 512, making it suitable for processing by the transformer.

Transformer Encoder: The transformer component consists of an encoder with multiple layers, each containing multi-headed self-attention mechanisms that help the model understand the global dependencies within the image.

Classification Head: The final output of the transformer encoder is processed through an average pooling layer followed by a fully connected layer to predict the class labels.

V-D 2 Rationale for Hybrid Architecture

This hybrid model is crafted to address specific challenges in driver distraction detection:

Feature Extraction: The ResNet50 backbone captures complex spatial hierarchies in the image data, which are crucial for identifying various forms of driver distraction.

Contextual Awareness: The transformer part of the model integrates these features over the entire image, considering both local and global contexts, which is vital for understanding scenarios where multiple distractions might be present.

Flexibility and Depth: Combining CNNs with transformers allows the model to not only recognize but also interpret the significance of various features across different parts of the image, enhancing its predictive accuracy.

V-D 3 Implementation Details

This architecture represents a forward-thinking approach in neural network design for image classification, combining proven techniques with innovative structures to tackle the complex problem of detecting driver distractions effectively.

V-E Framework Utilization Rationale

In our project, we opted to implement our preprocessing and augmentation pipelines as well as some models in both Keras (using TensorFlow as the backend) and PyTorch. This dual-implementation approach was chosen deliberately to achieve a comprehensive understanding of the functionalities and advantages of both leading deep learning libraries.

Consistency Across Frameworks: Although the implementations differ in syntax and library-specific functions, the core pre-processing and augmentation functionalities are maintained consistently across both Keras and PyTorch. This ensures that the output—augmented and pre-processed images—is equivalent, regardless of the framework used. This consistency allows for an unbiased comparison of model performance that is strictly attributable to the model architectures and not influenced by data variations.

Educational Value: The choice to utilize both frameworks also stems from an educational perspective. By exploring the same tasks with different tools, we gain insights into the strengths and limitations of each framework, a simpel example of which is the progress bar of the epochs in keras being default, while we had to have a tqdm wrapper fro PyTorch. This experience is invaluable for our team’s skill development and provides a broader perspective on solving machine learning problems in practice.

Practical Implications: From a practical standpoint, understanding how to implement similar tasks in different frameworks enhances our flexibility and adaptability in the field. It prepares our team to work with diverse technologies and adapt to various technical environments in future projects or professional settings.

This dual-framework approach not only enriches our technical proficiency but also ensures that our findings and conclusions are robust, backed by the capability to replicate results across different software environments, thereby reinforcing the reliability and validity of our research outcomes..

VI Training and Validation Results

This subsection presents the training and validation results for each of the models discussed: SimpleCNN, VGG16 and subtypes, VGG19 and subtypes, and Hybrid CNN-Transformer. We evaluate each model’s performance based on accuracy, loss, and other relevant metrics over the course of the training epochs. This analysis helps in understanding how each model learns and generalizes from the training data to the validation data set. It is to be noted that environmental parameters like learning rate, epochs, early stopping, patience etc were kept constant across all models.

VI-A Performance Metrics

The performance of each model was evaluated using the following metrics calculated during training and validation phases:

Average Loss Calculation:

Accuracy Calculation:

VI-B Tabular Summary

The performance of various models during the training and validation phases is presented in the following table:

Model TrainAcc TrainLoss ValAcc ValLoss
Simple CNN 0.9300 0.0363 0.9500 0.5900
VGG16 Deep 0.9952 0.0153 0.9933 0.0300
VGG16 Shallow 0.9612 0.0388 0.9937 0.0600
VGG16 FT-B 0.9905 0.0438 0.9871 0.0540
VGG16 FT-NB 0.9943 0.0246 0.9895 0.0376
VGG19 Deep 0.9952 0.0153 0.9949 0.0233
VGG19 Shallow 0.9592 0.1372 0.9920 0.0345
VGG19 FT-B 0.9921 0.0340 0.9868 0.0475
VGG19 FT-NB 0.9943 0.0246 0.9895 0.0376
Hybrid Model 0.9919 0.0280 0.9918 0.0358

This table provides a comprehensive overview of how each model performed in terms of accuracy and loss during the training and validation stages. Graphs and more detailed discussions about each model’s performance trends and anomalies follow in subsequent section. A breif note, unless stated otherwise, the x-axis of the graph denote the epochs while the y-axis denote either of the accuracy or the loss.

VI-C SimpleCNN

The simpleCNN design was supposed to act as a baseline, as it can be seen from the below trend in the training and validation loss, the model overall performs as expected, with a consistent decline in the loss values as training progresses.

Refer to caption

VI-D VGG16 - Deep Network

The VGG16 Deep network as mentioned in the previous sections is quite well proven for image recognition tasks, the architecture we implemented however, is extended with additional dense layers to increase the learning capacity towards our dataset.

Based, on the graphs below for training and validation accuracies as well as losses, it can be seen that the performance of the models was quite steady and consistent with what was expected, however, it is worth mentioning that the training time for the same was quite drastically increased.

Refer to caption

It is to be noted that the epochs mentioned are only till 18, as the early stopping mechanism of our system kicked in.

VI-E VGG16 - Shallow Network

The VGG16 shallow network we trained reduced the complexity of VGG16 Deep network by focusing the learining on the newly added layers rather than the base VGG16 layers. Based, on the graphs below for training and validation accuracies as well as losses, it can be seen that the performance of the models was quite steady and consistent with what was expected.

Refer to caption

VI-F VGG16 - Fine Tuned Network - With and Without Batching

This model froze the early layers while the top layers of the base VGG16 model are fine-tuned to balance the learning while ensuring not to alter the initial feature detection.

Both the batched as well as non-batched models perform well, and also while the non-batched version of the model does perform well, in terms of compute time, it take more training time to generate the same results as that of the batched version.

Refer to caption

VI-G VGG19- Deep Network

The VGG19 models Deep Networks makes use of the base VGG19 model with additional flatten and dense layers in order to leverage a deeper network for particular traininig over the learning aspects of this dataset.

Refer to caption

VI-H VGG19 - Shallow Network

The VGG19 shallow network we trained reduced the complexity of VGG19 Deep network by focusing the learining on the newly added layers rather than the base VGG19 layers. Based, on the graphs below for training and validation accuracies as well as losses, it can be seen that the performance of the models was quite steady and consistent with what was expected.

Refer to caption

VI-I VGG16 - Fine Tuned Network - With and Without Batching

This model froze the early layers while the top layers of the base VGG19 model are fine-tuned to balance the learning while ensuring not to alter the initial feature detection.

Refer to caption

VI-J Hybrid CNN-Transformer

The hybrid architecture combine the best of the current two worlds when it comes to technology, utilizing a CNN backbone on a pre-trained ResNet50 along with a Transformer Encoder. The idea behind this is to test how computationally expensive can this be and can it be a sustainable model.

The graphs below indicate quite interesting results, showcasing that even though we had set the initial number of epochs to 20, only 5 were required for the early stopping to kick in due to the performance of the model. Even though it has a shaky start on the training daqta in the first epoch, there was a sharp increase on accuracy and decrease in losses on the second epoch.

Refer to caption

It is to be noted that the epochs mentioned are only till 5, as the early stopping mechanism of our system kicked in. This can be seen as quite an interesting result given that this hybrid mechanism took the least training to be able to achieve such results.

VII Testing Results

After the initial evaluation during training and validation, we subjected the models to a separate test dataset (derived from the same StateFarm Dataset) to evaluate their real-world applicability and robustness. This testing phase is crucial to verify the generalization capabilities of each model outside the controlled conditions of the training environment.

This test dataset was formulated by taking 10 random images from each of the 10 classes provided in the dataset in order to test two things:

Accuracy of the Saved Models

Time taken by models to evaluate the 100 test images

The saved models were loaded one-by-one and evaluated on the same dataset.

The results can be seen below in tabular format:

Model Accuracy Elapsed Time (seconds)
Simple CNN 0.89 10.56
VGG16 Deep 0.95 10.95
VGG16 Shallow 0.93 9.14
VGG16 FT Batched 0.96 9.07
VGG16 FT Non-Batched 0.96 10.05
VGG19 Deep 0.94 10.68
VGG19 Shallow 0.94 9.68
VGG19 FT Batched 0.98 8.89
VGG19 FT Non-Batched 0.97 9.46
Hybrid CNN Transformer 0.98 11.05

FT above is an abbrevation of Fine-Tuned

In order to showcase the above table as a graph, it would be better to visualize it as the same for better understanding:

Refer to caption

While the previous graph is quite insightful by visualizing the accuracy and elapsed time, it would be highly beneficial to visualize the two metrics in a single graph for better understanding.

Refer to caption

From the above figures, it can be seen that the evaluation of various deep learning models against the Simple CNN baseline, it’s clear that advanced models like VGG16, VGG19, and the Hybrid CNN Transformer not only improve in accuracy but also show distinct differences in processing times. The VGG models, particularly when fine-tuned, strike an effective balance between high accuracy and reasonable inference speeds. Notably, the fine-tuned VGG19 demonstrates the best optimization, achieving the highest accuracy with a comparatively low elapsed time. The Hybrid CNN Transformer, while slightly slower, matches this high accuracy. These insights underscore the trade-offs between speed and accuracy in model selection, emphasizing the need for careful consideration based on specific application requirements.

VIII Conclusion

Our study examined several deep learning architectures to identify the most effective model for real-time driver distraction detection. The VGG19 FineTuned Batched model demonstrated the highest accuracy (0.98) among the tested models, closely followed by the Hybrid CNN Transformer, which also showed excellent performance (0.98 accuracy) but with slightly higher processing time. The results indicate a significant trade-off between model accuracy and evaluation time, which is critical for real-time applications where both accuracy and speed are crucial.

VIII-A Accuracy and Evaluation Time Trade-off

Our findings showcase the importance of considering both accuracy and processing speed when selecting a model for deployment in real-world scenarios. While more complex models like the VGG19 FineTuned Batched and the Hybrid CNN Transformer offer high accuracy, their longer evaluation times may not be suitable for all real-time applications. On the other hand, simpler models such as the VGG16 Shallow provide faster evaluation times but at the cost of reduced accuracy.

VIII-B Hardware Capabilities

The deployment of these models in a real-time environment also demands careful consideration of the hardware capabilities. Advanced models require powerful processing hardware, which may increase the cost and energy consumption of the deployed system. Therefore, balancing the hardware efficiency with model complexity is essential for practical implementations.

In conclusion, while the current models show promising results, there still remain a substantial scope for innovation in optimizing them for practical, real-time applications where both accuracy and speed are very important.

  • [1] Md. Uzzol Hossain, Md. Ataur Rahman, Md. Manowarul Islam, Arnisha Akhter, Md. Ashraf Uddin, Bikash Kumar Paul, “Automatic driver distraction detection using deep convolutional neural networks,” Elsevier Ltd , [April 2022].
  • [2] Narayana Darapaneni, Jai Arora, MoniShankar Hazra, Naman Vig, Simrandeep Singh Gandhi, Saurabh Gupta, “Detection of Distracted Driver using Convolution Neural Network,” [Arxiv] , [2022].
  • [3] Sarfaraz Masood, Abhinav Rai, Aakash Aggarwal, M. N. Doja, “Detecting Distraction of drivers using Convolutional Neural Network,” Pattern Recognition Letters , vol. 139, doi:10.1016/j.patrec.2017.12.023, January 2018.
  • [4] State Farm Distracted Driver Detection Dataset- https://www.kaggle.com/c/state-farm-distracted-driver-detection
  • [5] CNN for Image Classification- https://github.com/vzhou842/cnn-from-scratch
  • [6] Image Classification with CNN- https://github.com/Ebrutokgoz/Image-Classification-with-CNN
  • [7] CNN Driver Detection- https://github.com/Abhinav1004/Distracted-Driver-Detection
  • [8] CNN Driver Drowsiness Detection- https://github.com/akshaybahadur21/Drowsiness-Detection

IMAGES

  1. (PDF) Convolutional Neural Network

    convolutional neural network thesis pdf

  2. (PDF) Prospects of Convolutional Neural Networks: A study

    convolutional neural network thesis pdf

  3. (PDF) Theory of deep convolutional neural networks: Downsampling

    convolutional neural network thesis pdf

  4. Pdf An Overview Of Convolutional Neural Network Its Architecture And Images

    convolutional neural network thesis pdf

  5. (PDF) Understanding of a Convolutional Neural Network

    convolutional neural network thesis pdf

  6. (PDF) Deep convolutional neural networks to restore single-shot

    convolutional neural network thesis pdf

VIDEO

  1. CNN-CONVOLUTIONAL NEURAL NETWORKS-PART-1

  2. Lecture 8 Part 1

  3. Convolutional Neural Network

  4. DeepSteadyFlows

  5. Basic Probability Concepts @GAN

  6. Blind Inpainting Convolutional Neural Network Matlab Code Projects

COMMENTS

  1. PDF Image Classification Using Convolutional Neural

    The objective of this thesis was to study the application of deep learning in image classification using convolutional neural networks. The Python programming language with the TensorFlow framework and Google Colaboratory hardware were used for the thesis.

  2. PDF Deep Learning: An Overview of Convolutional Neural Network(CNN)

    This thesis is an overview of the progress made in traditional machine learning methods. It specifically discusses a major architecture, convolutional neural networks within deep learning, machine learning. Emphasis is given to the progress in convolutional neural networks and the different architectures such AlexNet, VGG net, ZF Net, Google Net, Microsoft Net and SENet. The application of ...

  3. (PDF) Convolutional Neural Networks

    Learn the basics of convolutional neural networks (CNNs) with examples using Keras. Find and cite relevant research papers on ResearchGate.

  4. Analysis and Optimization of Convolutional Neural Network Architectures

    Convolutional Neural Network Architectures Master Thesis of Martin Thoma Department of Computer Science Institute for Anthropomatics and FZI Research Center for Information Technology Reviewer: Prof. Dr.-Ing. R. Dillmann Second reviewer: Prof. Dr.-Ing. J. M. Zöllner Advisor: Dipl.-Inform. Michael Weber Research Period: 03. May 2017 ...

  5. PDF Convolutional neural networks in image recognition

    Abstract In this thesis, we study the topic of deep learning with a focus on image recog-nition using convolutional neural networks. We cover the various components of deep learning, including the network structure, backpropagation and stochastic gradient descent. We explain the fundamentals of these components and com-pare theory to practice.

  6. PDF Examining the Structure of Convolutional Neural Networks

    In this thesis we provided a brief introduction to supervised learning, multilayer perceptrons, and convolutional neural networks. We then introduced some speci c structural techniques used in building convolutional neural networks, and one way to try to capture the representational power of a network (the e ective receptive eld).

  7. An Introduction to Convolutional Neural Networks

    Convolutional Neural Networks(CNNs) are analogous to traditional ANNs in that they are comprised of neurons that self-optimise through learning. Each neuron will still receive an input and perform a operation (such as a scalar product followed by a non-linear function) - the basis of countless ANNs. From the input raw image vectors to the ...

  8. PDF UvA-DARE (Digital Academic Repository)

    convolutional neural networks to generalize from relatively small samples. We argue and show empirically that in the context of deep learning it is bet-ter to learn equivariant rather than invariant representations, because invari-ant ones lose information too early on in the network. We present a sequence

  9. PDF ImageNet Classification with Deep Convolutional Neural Networks

    The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

  10. PDF Convolutional Neural Network (CNN)

    Convolutional Neural Network (CNN) Automated Pavement Condition Assessment using Unmanned Aerial Vehicles (UAVs) and Convolutional Neural Network (CNN) by Vinay K. Chawla May, 2021 Director of Thesis: Carol Massarra, PhD Major Department: Construction Management. ial in any effort to reduce futureeconomic losses and improve the str.

  11. PDF ROBUST CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS A Thesis

    Dr. Tony Han, Thesis Supervisor MAY 2015 The undersigned, appointed by the Dean of the Graduate School, have examined the thesis entitled: Robust Classification with Convolutional Neural Networks Presented by: Muhind Salim Hmoud Alradad A candidate for the degree of Master of Science and hereby certify that, in their opinion, it is worthy of ...

  12. (PDF) Designing a Convolutional Neural Network for Image Recognition: A

    A powerful tool for image recognition, Convolutional Neural Networks (CNNs) have been successfully applied in various fields including computer vision, medical image analysis, and self-driving cars.

  13. PDF by Ilya Sutskever

    The publications below describe work that is loosely related to this thesis but not described in the thesis: ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. In Advances in Neural Information Pro-cessing Systems 26, (NIPS*26), 2012. (Krizhevsky et al., 2012)

  14. Understanding Convolutional Neural Networks with Information Theory: An

    I. INTRODUCTION There has been a growing interest in understanding deep neural networks (DNNs) mapping and training using infor-mation theory [1], [2], [3]. According to Schwartz-Ziv and Tishby [4], a DNN should be analyzed by measuring the infor-mation quantities that each layer's representation T preserves about the input signal X with respect to the desired signal mutual information ...

  15. Evolutionary Deep Convolutional Neural Networks for Medical Image Analysis

    In this thesis, several Neroevolutionary-based frameworks are developed for 2D and 3D medical image segmentation. Firstly, a new block-based encoding model is developed to generate variable length 2D Deep Convolutional Neural Networks (DCNNs).

  16. (PDF) Fundamental Concepts of Convolutional Neural Network

    Abstract and Figures Convolutional neural network (or CNN) is a special type of multilayer neural network or deep learning architecture inspired by the visual system of living beings.

  17. PDF A Mathematical Framework for the Analysis of Neural Networks

    We believe that this framework can help in uence future developments in applications of neural networks, but we have not focused on that in this thesis. We have developed a mathematical framework for neural networks over nite dimen-sional inner product spaces with deterministic inputs and outputs.

  18. PDF Medical Image Classification using Deep Learning Techniques and

    agnosis systems. More specifically, the thesis provides the following three main contributions. First, it introduces a novel entropy-based elastic ensemble of Deep Convolutional Neural Networks (DCNNs) architecture termed as 3E-Net for classi-fying grades of invasive breast carcinoma microscopic images. 3E-Net is based on a

  19. PDF Master Thesis

    Deep Convolution Neural Networks for the Analysis of a few Medical Images

  20. An Introduction to Convolutional Neural Networks

    1 Introduction. Artificial Neural Networks (ANNs) are computational processing systems of. which are heavily inspired by way biological nervous systems (such as the hu-. man brain) operate. ANNs ...

  21. PDF Thesis Classification of P300 From Non-invasive Eeg Signal Using

    requiring considerable amount of time and resources before the system's deployment for use. In this thesis Convolutional Neural Network is applied to detect the P300 signal and obs. rve the distinguishing features of P300 and non-P300 signals extracted by the neural network. Three different shapes of the filters, namely 1-D CNN, 2-D CNN,

  22. PDF Department of Information Engineering and Computer Science ...

    Convolutional Neural Network (CNN) is arguably the most utilized model by the computer vision community, which is reasonable thanks to its remarkable performance in object and scene recognition, with respect to traditional hand-crafted features. Nevertheless, it is evident that CNN naturally is availed in its two-dimensional version.

  23. PDF LCNN: Lookup-based Convolutional Neural Network

    We propose a fast, com-pact, and accurate model for convolutional neural networks that enables efficient learning and inference. We introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weights in CNNs.

  24. PDF Klasifikasi Citra Menggunakan Convolutional Neural Network (Cnn) Pada

    Convolutional Neural Network (CNN) adalah pengembangan dari Multilayer Perceptron (MLP) yang didesain untuk mengolah data dua dimensi. CNN termasuk dalam jenis Deep Neural Network karena kedalaman jaringan yang tinggi dan banyak diaplikasikan pada data citra.

  25. Enhancing Road Safety: Real-Time Detection of Driver Distraction

    Md. Uzzol Hossain and colleagues (2022) [] employed deep convolutional neural networks to identify driver distraction automatically. Their work utilized a series of complex models to process visual data from in-vehicle cameras. This study's strength lies in its extensive dataset and the depth of its neural network analysis, which provides a solid benchmark for our project.