Optimally splitting cases for training and testing high dimensional classifiers

276Citations
Citations of this article
390Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?. Results: We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions: By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. © 2011 Dobbin and Simon; licensee BioMed Central Ltd.

References Powered by Scopus

Fuzzy classifications using fuzzy inference networks

10281Citations
N/AReaders
Get full text

Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring

9608Citations
N/AReaders
Get full text

Gene expression profiling predicts clinical outcome of breast cancer

8181Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Optimal ratio for data splitting

393Citations
N/AReaders
Get full text

Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges

328Citations
N/AReaders
Get full text

IDRiD: Diabetic Retinopathy – Segmentation and Grading Challenge

250Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Dobbin, K. K., & Simon, R. M. (2011). Optimally splitting cases for training and testing high dimensional classifiers. BMC Medical Genomics, 4. https://doi.org/10.1186/1755-8794-4-31

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 127

62%

Researcher 51

25%

Professor / Associate Prof. 15

7%

Lecturer / Post doc 11

5%

Readers' Discipline

Tooltip

Computer Science 56

36%

Engineering 50

32%

Medicine and Dentistry 30

19%

Agricultural and Biological Sciences 18

12%

Save time finding and organizing research with Mendeley

Sign up for free