Obtaining maximal concatenated phylogenetic data sets from large sequence databases

101Citations
Citations of this article
128Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases.

References Powered by Scopus

Basic local alignment search tool

78874Citations
N/AReaders
Get full text

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

58458Citations
N/AReaders
Get full text

Cases in which Parsimony or Compatibility Methods will be Positively Misleading

2585Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Phylogenomics and the reconstruction of the tree of life

940Citations
N/AReaders
Get full text

Missing data and the design of phylogenetic analyses

389Citations
N/AReaders
Get full text

Phylogenomics of eukaryotes: Impact of missing data on large alignments

343Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Sanderson, M. J., Driskell, A. C., Ree, R. H., Eulenstein, O., & Langley, S. (2003). Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Molecular Biology and Evolution, 20(7), 1036–1042. https://doi.org/10.1093/molbev/msg115

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 45

38%

Researcher 40

34%

Professor / Associate Prof. 29

24%

Lecturer / Post doc 5

4%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 82

74%

Biochemistry, Genetics and Molecular Bi... 18

16%

Computer Science 8

7%

Earth and Planetary Sciences 3

3%

Save time finding and organizing research with Mendeley

Sign up for free