Abstract
Deep learning approaches have been established as the
main methodology for video classification and recognition.
Recently, 3-dimensional convolutions have been used to
achieve state-of-the-art performance in many challenging
video datasets. Because of the high level of complexity of
these methods, as the convolution operations are also extended
to an additional dimension in order to extract features
from it as well, providing a visualization for the signals that
the network interpret as informative, is a challenging task.
An effective notion of understanding the network’s innerworkings
would be to isolate the spatio-temporal regions on
the video that the network finds most informative. We propose
a method called Saliency Tubes which demonstrate the
foremost points and regions in both frame level and over time
that are found to be the main focus points of the network. We
demonstrate our findings on widely used datasets for thirdperson
and egocentric action classification and enhance the
set of methods and visualizations that improve 3D Convolutional
Neural Networks (CNNs) intelligibility. Our code 1
and a demo video 2 are also available.
Original language | English |
---|---|
Publication status | Published - 29 Sept 2019 |
Externally published | Yes |
Event | 26th IEEE International Conference on Image Processing (ICIP) - Duration: 29 Sept 2019 → … |
Conference
Conference | 26th IEEE International Conference on Image Processing (ICIP) |
---|---|
Period | 29/09/19 → … |
Keywords
- spatio-temporal feature representation
- Visual explanations
- explainable convolutions