Dissertations (1934 -)

Towards the Advancement of Violence Recognition in Security Footage with Explainable Neural Networks

Paris Her, Marquette University

Date of Award

Spring 4-11-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

First Advisor

Edwin Yaz

Second Advisor

Cristinel Ababei

Third Advisor

Edwin Yaz

Fourth Advisor

Henry Medeiros

Fifth Advisor

Richard Povinelli; Susan Schneider

Abstract

This dissertation investigates the problem of violence recognition in surveillance footage using computer vision and machine learning techniques. More specifically, our goal is to achieve interpretable and explainable deep learning models because violence recognition is a sensitive task. We first propose to perform violence recognition using a 3D convolutional neural network through intuitive hyperparameter tuning and transfer learning. We utilize a state-of-the-art 3D model used for general activity recognition that is lightweight and adjustable. Along with that, we introduce a data augmentation technique called "resize-within" which uses interpolation, rather than cropping, to resize the original input video to a new width, height, during model training. Using this as the base model, we continue to provide a means for model explainability using class activation maps. That is, during model training, the proposed approach compares the areas of the salient regions that the model uses to make its prediction with that of the active regions pertaining to "involved" individuals in the violent act. This forces the model to "look"/ focus on the regions related to violence, reducing the ambiguity of where in the frame the model is using to make its decision. To the best of our knowledge, this is the first work to provide bounding box labels for involved individuals and saliency evaluation in a violence recognition dataset. Finally, we introduce a deep learning model with built-in model interpretability through case-based reasoning through prototypical examples. This approach utilizes the input latent space, i.e. the input feature maps, and compares it with some learned prototypical feature maps. The nearest prototype feature maps are then concatenated with the input latent space and are used together to make the model prediction. As the model makes a prediction based on multiple sources of information, it increases model performance, as well as adding model prediction interpretability through examples. Since the prototypes are feature maps, we can also show direct active regions the model is associating with the input and its nearest prototype. This work demonstrates that deep learning models can simultaneously improve performance and have greater interpretability. The described proposed methods are evaluated on publicly available benchmark violence recognition datasets (RWF-2000, SCFD, and ViolentFlows).

e-Publications@Marquette

Dissertations (1934 -)

Towards the Advancement of Violence Recognition in Security Footage with Explainable Neural Networks

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Included in

Browse

Information about e-Pubs@MU

e-Publications@Marquette

Dissertations (1934 -)

Towards the Advancement of Violence Recognition in Security Footage with Explainable Neural Networks

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Included in

Share

Browse

Information about e-Pubs@MU