SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression

8Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time. In this paper, we propose SENSIMIX (Sensitivity-Aware Mixed Precision Quantization), a novel quantizationbased BERT compression method that considers the sensitivity of different modules of BERT. SENSIMIX effectively applies 8-bit index quantization and 1-bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1-bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layer-wise Fine-tuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNOR-Count GEMM for 8-bit and 1-bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that SENSIMIX compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8× and shrinking the inference time by around 80% without noticeable accuracy drop.

References Powered by Scopus

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

2336Citations
N/AReaders
Get full text

EIE: Efficient Inference Engine on Compressed Deep Neural Network

2038Citations
N/AReaders
Get full text

ERNIE 2.0: A continual pre-training framework for language understanding

527Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Falcon: lightweight and accurate convolution based on depthwise separable convolution

19Citations
N/AReaders
Get full text

PET: Parameter-efficient Knowledge Distillation on Transformer

6Citations
N/AReaders
Get full text

Sub-8-Bit Quantization for On-Device Speech Recognition: A Regularization-Free Approach

5Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Piao, T., Cho, I., & Kang, U. (2022). SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression. PLoS ONE, 17(4 April). https://doi.org/10.1371/journal.pone.0265621

Readers over time

‘22‘23‘24‘2502468

Readers' Seniority

Tooltip

Professor / Associate Prof. 3

60%

PhD / Post grad / Masters / Doc 1

20%

Researcher 1

20%

Readers' Discipline

Tooltip

Medicine and Dentistry 2

33%

Computer Science 2

33%

Business, Management and Accounting 2

33%

Save time finding and organizing research with Mendeley

Sign up for free
0