Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper addresses the challenge of subsampling large datasets, aiming to generate a smaller dataset that retains a significant portion of the original information. To achieve this objective, we present a subsampling algorithm that integrates hierarchical data partitioning with a specialized tool tailored to identify the most informative observations within a dataset for a specified underlying linear model, not necessarily first-order, relating responses and inputs. The hierarchical data partitioning procedure systematically and incrementally aggregates information from smaller-sized samples into new samples. Simultaneously, our selection tool employs Semidefinite Programming for numerical optimization to maximize the information content of the chosen observations. We validate the effectiveness of our algorithm through extensive testing, using both benchmark and real-world datasets. The real-world dataset is related to the physicochemical characterization of white variants of Portuguese Vinho Verde. Our results are highly promising, demonstrating the algorithm's capability to efficiently identify and select the most informative observations while keeping computational requirements at a manageable level.

References Powered by Scopus

Semidefinite programming

3327Citations
N/AReaders
Get full text

Modeling wine preferences by data mining from physicochemical properties

979Citations
N/AReaders
Get full text

Randomized algorithms for matrices and data

542Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Duarte, B. P. M., Atkinson, A. C., & Oliveira, N. M. C. (2024). Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets. Chemometrics and Intelligent Laboratory Systems, 245. https://doi.org/10.1016/j.chemolab.2024.105067

Readers' Seniority

Tooltip

Researcher 1

100%

Readers' Discipline

Tooltip

Engineering 1

100%

Article Metrics

Tooltip
Mentions
News Mentions: 1

Save time finding and organizing research with Mendeley

Sign up for free