With little power comes great responsibility

70Citations
Citations of this article
149Readers
Mendeley users who have this article in their library.

Abstract

Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

References Powered by Scopus

Operating characteristics of a rank correlation test for publication bias

13984Citations
N/AReaders
Get full text

Random effects structure for confirmatory hypothesis testing: Keep it maximal

7469Citations
N/AReaders
Get full text

Power failure: Why small sample size undermines the reliability of neuroscience

5104Citations
N/AReaders
Get full text

Cited by Powered by Scopus

All that's 'human' is not gold: Evaluating human evaluation of generated text

191Citations
N/AReaders
Get full text

Learning from Disagreement: A Survey

142Citations
N/AReaders
Get full text

What Will it Take to Fix Benchmarking in Natural Language Understanding?

75Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020). With little power comes great responsibility. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 9263–9274). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.emnlp-main.745

Readers over time

‘20‘21‘22‘23‘24‘25015304560

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 46

65%

Researcher 17

24%

Professor / Associate Prof. 6

8%

Lecturer / Post doc 2

3%

Readers' Discipline

Tooltip

Computer Science 62

78%

Linguistics 10

13%

Neuroscience 4

5%

Engineering 3

4%

Article Metrics

Tooltip
Social Media
Shares, Likes & Comments: 3

Save time finding and organizing research with Mendeley

Sign up for free
0