Efficient string similarity search on disks

Jinbao Wang; Donghua Yang

Conference Proceedings

Efficient string similarity search on disks

Communications in Computer and Information Science (2015) 503 48-55

DOI: 10.1007/978-3-662-46248-5_7

0Citations

3Readers

Get full text

Abstract

String similarity search is a basic operation for various applications, such as data cleaning, spell checking, bioinformatics and information integration. Memory based q-gram inverted indexes fail to support string similarity search over large scale string datasets due to the memory limitation, and it can no longer work if the data size grows beyond the memory size. In the era of big data, large string dataset are quite common. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time, and LPA-Index is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports multiple filters to reduce query candidates effectively, and it adaptively reads inverted lists during query processing for better I/O performance. Experiment results demonstrate the efficiency of LPA-Index and its advantages over existing state-of-art disk index Behm-Index with regard to I/O cost and query response time.

Cite

CITATION STYLE

APA

Wang, J., & Yang, D. (2015). Efficient string similarity search on disks. In Communications in Computer and Information Science (Vol. 503, pp. 48–55). Springer Verlag. https://doi.org/10.1007/978-3-662-46248-5_7

Efficient string similarity search on disks

Abstract

Cite

Register to see more suggestions