The nearest neighbor rule is one of the most popular algorithms for data mining tasks due in part to its simplicity and theoretical/ empirical properties. However, with the availability of large volumes of data, this algorithm suffers from two problems: the computational cost to classify a new example, and the need to store the whole training set. To alleviate these problems instance reduction algorithms are often used to obtain a new condensed training set that in addition to reducing the computational burden, in some cases they improve the classification performance. Many instance reduction algorithms have been proposed so far, obtaining outstanding performance in mid size data sets. However, the application of the most competitive instance reduction algorithms becomes prohibitive when dealing with massive data volumes. For this reason, in recent years, it has become crucial the development of large scale instance reduction algorithms. This paper elaborates on the usage of a classic algorithm for clustering: K-means for tackling the instance reduction problem in big data. We show that this traditional algorithm outperforms most state of the art instance reduction methods in mid size data sets. In addition, this algorithm can cope well with massive data sets and still obtain quite competitive performance. Therefore, the main contribution of this work is showing the validity of this often depreciated algorithm for a quite relevant task in a quite relevant scenario.
CITATION STYLE
García-Limón, M., Escalante, H. J., & Morales-Reyes, A. (2016). In defense of online Kmeans for prototype generation and instance reduction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10022 LNAI, pp. 310–322). Springer Verlag. https://doi.org/10.1007/978-3-319-47955-2_26
Mendeley helps you to discover research relevant for your work.