Towards the Advancement of Violence Recognition in Security Footage with Explainable Neural Networks
Date of Award
Spring 4-11-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Electrical and Computer Engineering
First Advisor
Edwin Yaz
Second Advisor
Cristinel Ababei
Third Advisor
Edwin Yaz
Fourth Advisor
Henry Medeiros
Fifth Advisor
Richard Povinelli; Susan Schneider
Abstract
This dissertation investigates the problem of violence recognition in surveillance footage using computer vision and machine learning techniques. More specifically, our goal is to achieve interpretable and explainable deep learning models because violence recognition is a sensitive task. We first propose to perform violence recognition using a 3D convolutional neural network through intuitive hyperparameter tuning and transfer learning. We utilize a state-of-the-art 3D model used for general activity recognition that is lightweight and adjustable. Along with that, we introduce a data augmentation technique called "resize-within" which uses interpolation, rather than cropping, to resize the original input video to a new width, height, during model training. Using this as the base model, we continue to provide a means for model explainability using class activation maps. That is, during model training, the proposed approach compares the areas of the salient regions that the model uses to make its prediction with that of the active regions pertaining to "involved" individuals in the violent act. This forces the model to "look"/ focus on the regions related to violence, reducing the ambiguity of where in the frame the model is using to make its decision. To the best of our knowledge, this is the first work to provide bounding box labels for involved individuals and saliency evaluation in a violence recognition dataset. Finally, we introduce a deep learning model with built-in model interpretability through case-based reasoning through prototypical examples. This approach utilizes the input latent space, i.e. the input feature maps, and compares it with some learned prototypical feature maps. The nearest prototype feature maps are then concatenated with the input latent space and are used together to make the model prediction. As the model makes a prediction based on multiple sources of information, it increases model performance, as well as adding model prediction interpretability through examples. Since the prototypes are feature maps, we can also show direct active regions the model is associating with the input and its nearest prototype. This work demonstrates that deep learning models can simultaneously improve performance and have greater interpretability. The described proposed methods are evaluated on publicly available benchmark violence recognition datasets (RWF-2000, SCFD, and ViolentFlows).