B2T Connection: Serving Stability and Performance in Deep Transformers

Sho Takase; Shun Kiyono; Sosuke Kobayashi; Jun Suzuki

Conference ProceedingsOPEN ACCESS

B2T Connection: Serving Stability and Performance in Deep Transformers

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 3078-3095

DOI: 10.18653/v1/2023.findings-acl.192

4Citations

18Readers

Abstract

From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in PostLN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training. Exploiting the new findings, we propose a method that can provide both high stability and effective training by a simple modification of Post-LN. We conduct experiments on a wide range of text generation tasks. The experimental results demonstrate that our method outperforms Pre-LN, and enables stable training regardless of the shallow or deep layer settings. Our code is publicly available at https://github.com/takase/b2t_connection.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Takase, S., Kiyono, S., Kobayashi, S., & Suzuki, J. (2023). B2T Connection: Serving Stability and Performance in Deep Transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 3078–3095). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.192

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 5

63%

Lecturer / Post doc 2

25%

Researcher 1

13%

Readers' Discipline

Computer Science 8

67%

Engineering 2

17%

Medicine and Dentistry 1

Mathematics 1

B2T Connection: Serving Stability and Performance in Deep Transformers

Abstract

References Powered by Scopus

Deep residual learning for image recognition

Identity mappings in deep residual networks

Librispeech: An ASR corpus based on public domain audio books

Cited by Powered by Scopus

Seeing the world from its words: All-embracing Transformers for fingerprint-based indoor localization

Mandarin count estimation with 360-degree tree video and transformer-based deep learning

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline