Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training

3Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Cross-modal recipe retrieval aims to exploit the relationships and accomplish mutual retrieval between recipe images and texts, which is clear for human but arduous to formulate. Although many previous works endeavored to solve this problem, most works did not efficiently exploit the cross-modal information among recipe data. In this paper, we present a frustratingly straightforward cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT) achieving high performance on both recipe retrieval and image generation tasks, which is designed to efficiently exploit the rich cross-modal information. In our proposed framework, Transformer-based encoders are applied for both image and text encoding for cross-modal embedding learning. We also adopt several loss functions like self-supervised learning loss on recipe text to encourage the model to further promote the cross-modal embedding learning. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. The experimental results showed that TNLBT significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M by a huge margin. We also found that CLIP-ViT performs better than ViT-B as the image encoder backbone. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embedding learning.

Cite

CITATION STYLE

APA

Yang, J., Chen, J., & Yanai, K. (2023). Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13834 LNCS, pp. 471–482). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-27818-1_39

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free