This paper proposes a novel network architecture for human action recognition. First, we employ a pre-trained spatio-temporal feature extractor to perform spatio-temporal features extraction on videos. Then, several-level spatio-temporal features are concatenated by 3D convolution skip-connections. Moreover, a batch normalization layer is applied to normalize the concatenated features. Subsequently, we feed these normalized features into a RNN architecture to model temporal dependencies, which enables our network to deal with long-term information. In addition, we divide each video into three parts in which each part is split into non-overlapping 16-frame clips to achieve data augmentation. Finally, the proposed method is evaluated on UCF101 Dataset and is compared with existing excellent methods. Experimental results demonstrate that our method achieves the highest recognition accuracy.
CITATION STYLE
Song, J., Yang, Z., Zhang, Q., Fang, T., Hu, G., Han, J., & Chen, C. (2018). Human action recognition with 3D convolution skip-connections and RNNs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11301 LNCS, pp. 319–331). Springer Verlag. https://doi.org/10.1007/978-3-030-04167-0_29
Mendeley helps you to discover research relevant for your work.