CLUS: Parallel subspace clustering algorithm on spark

Bo Zhu; Alexandru Mara; Alberto Mozo

Conference Proceedings

CLUS: Parallel subspace clustering algorithm on spark

Communications in Computer and Information Science (2015) 539 175-185

DOI: 10.1007/978-3-319-23201-0_20

17Citations

12Readers

Get full text

Abstract

Subspace clustering techniques were proposed to discover hidden clusters that only exist in certain subsets of the full feature spaces. However, the time complexity of such algorithms is at most exponential with respect to the dimensionality of the dataset. In addition, datasets are generally too large to fit in a single machine under the current big data scenarios. The extremely high computational complexity, which results in poor scalability with respect to both size and dimensionality of these datasets, give us strong motivations to propose a parallelized subspace clustering algorithm able to handle large high dimensional data. To the best of our knowledge, there are no other parallel subspace clustering algorithms that run on top of new generation big data distributed platforms such as MapReduce and Spark. In this paper we introduce CLUS: a novel parallel solution of subspace clustering based on SUBCLU algorithm. CLUS uses a new dynamic data partitioning method specifically designed to continuously optimize the varying size and content of required data for each iteration in order to fully take advantage of Spark’s in-memory primitives. This method minimizes communication cost between nodes, maximizes their CPU usage, and balances the load among them. Consequently the execution time is significantly reduced. Finally, we conduct several experiments with a series of real and synthetic datasets to demonstrate the scalability, accuracy and the nearly linear speedup with respect to number of nodes of the implementation.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhu, B., Mara, A., & Mozo, A. (2015). CLUS: Parallel subspace clustering algorithm on spark. In Communications in Computer and Information Science (Vol. 539, pp. 175–185). Springer Verlag. https://doi.org/10.1007/978-3-319-23201-0_20

CLUS: Parallel subspace clustering algorithm on spark

Abstract

Author supplied keywords

Cite

Register to see more suggestions