Translating videos to natural language using deep recurrent neural networks

431Citations
Citations of this article
685Readers
Mendeley users who have this article in their library.

Abstract

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.

References Powered by Scopus

Long Short-Term Memory

76775Citations
N/AReaders
Get full text

Baby talk: Understanding and generating simple image descriptions

458Citations
N/AReaders
Get full text

Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition

445Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Multimodal Machine Learning: A Survey and Taxonomy

2447Citations
N/AReaders
Get full text

SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning

1541Citations
N/AReaders
Get full text

MSR-VTT: A large video description dataset for bridging video and language

1470Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015). Translating videos to natural language using deep recurrent neural networks. In NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 1494–1504). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/n15-1173

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 347

75%

Researcher 68

15%

Professor / Associate Prof. 28

6%

Lecturer / Post doc 19

4%

Readers' Discipline

Tooltip

Computer Science 379

78%

Engineering 82

17%

Agricultural and Biological Sciences 13

3%

Linguistics 13

3%

Save time finding and organizing research with Mendeley

Sign up for free