TurboTransformers: An efficient GPU serving system for transformer models

95Citations
Citations of this article
78Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, therefore leads to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. To solve the above challenges, this paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, which are major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

References Powered by Scopus

Deep residual learning for image recognition

174383Citations
N/AReaders
Get full text

Long Short-Term Memory

76955Citations
N/AReaders
Get full text

Going deeper with convolutions

39609Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Efficient Memory Management for Large Language Model Serving with PagedAttention

360Citations
N/AReaders
Get full text

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

107Citations
N/AReaders
Get full text

TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer

54Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Fang, J., Yu, Y., Zhao, C., & Zhou, J. (2021). TurboTransformers: An efficient GPU serving system for transformer models. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (pp. 389–402). Association for Computing Machinery. https://doi.org/10.1145/3437801.3441578

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 21

62%

Researcher 8

24%

Professor / Associate Prof. 4

12%

Lecturer / Post doc 1

3%

Readers' Discipline

Tooltip

Computer Science 36

82%

Engineering 5

11%

Mathematics 2

5%

Social Sciences 1

2%

Save time finding and organizing research with Mendeley

Sign up for free