Efficient string similarity search on disks

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

String similarity search is a basic operation for various applications, such as data cleaning, spell checking, bioinformatics and information integration. Memory based q-gram inverted indexes fail to support string similarity search over large scale string datasets due to the memory limitation, and it can no longer work if the data size grows beyond the memory size. In the era of big data, large string dataset are quite common. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time, and LPA-Index is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports multiple filters to reduce query candidates effectively, and it adaptively reads inverted lists during query processing for better I/O performance. Experiment results demonstrate the efficiency of LPA-Index and its advantages over existing state-of-art disk index Behm-Index with regard to I/O cost and query response time.

Cite

CITATION STYLE

APA

Wang, J., & Yang, D. (2015). Efficient string similarity search on disks. In Communications in Computer and Information Science (Vol. 503, pp. 48–55). Springer Verlag. https://doi.org/10.1007/978-3-662-46248-5_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free