Self-indexes - data structures that simultaneously provide fast search of and access to compressed text - are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our 'RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just nHk(T) + s log n + s log N/s+ O(s) bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires O(ℓ + log n) time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory. © 2010 Springer-Verlag.
CITATION STYLE
Kuruppu, S., Puglisi, S. J., & Zobel, J. (2010). Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6393 LNCS, pp. 201–206). https://doi.org/10.1007/978-3-642-16321-0_20
Mendeley helps you to discover research relevant for your work.