B2T Connection: Serving Stability and Performance in Deep Transformers

4Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

Abstract

From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in PostLN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training. Exploiting the new findings, we propose a method that can provide both high stability and effective training by a simple modification of Post-LN. We conduct experiments on a wide range of text generation tasks. The experimental results demonstrate that our method outperforms Pre-LN, and enables stable training regardless of the shallow or deep layer settings. Our code is publicly available at https://github.com/takase/b2t_connection.

References Powered by Scopus

Deep residual learning for image recognition

174359Citations
N/AReaders
Get full text

Identity mappings in deep residual networks

7178Citations
N/AReaders
Get full text

Librispeech: An ASR corpus based on public domain audio books

5006Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Seeing the world from its words: All-embracing Transformers for fingerprint-based indoor localization

3Citations
N/AReaders
Get full text

Mandarin count estimation with 360-degree tree video and transformer-based deep learning

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Takase, S., Kiyono, S., Kobayashi, S., & Suzuki, J. (2023). B2T Connection: Serving Stability and Performance in Deep Transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 3078–3095). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.192

Readers over time

‘23‘24‘25036912

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 5

63%

Lecturer / Post doc 2

25%

Researcher 1

13%

Readers' Discipline

Tooltip

Computer Science 8

67%

Engineering 2

17%

Medicine and Dentistry 1

8%

Mathematics 1

8%

Save time finding and organizing research with Mendeley

Sign up for free
0