Construction of a public CHO cell line transcript database using versatile bioinformatics analysis pipelines

50Citations
Citations of this article
138Readers
Mendeley users who have this article in their library.

Abstract

Chinese hamster ovary (CHO) cell lines represent the most commonly used mammalian expression system for the production of therapeutic proteins. In this context, detailed knowledge of the CHO cell transcriptome might help to improve biotechnological processes conducted by specific cell lines. Nevertheless, very few assembled cDNA sequences of CHO cells were publicly released until recently, which puts a severe limitation on biotechnological research. Two extended annotation systems and web-based tools, one for browsing eukaryotic genomes (GenDBE) and one for viewing eukaryotic transcriptomes (SAMS), were established as the first step towards a publicly usable CHO cell genome/transcriptome analysis platform. This is complemented by the development of a new strategy to assemble the ca. 100 million reads, sequenced from a broad range of diverse transcripts, to a high quality CHO cell transcript set. The cDNA libraries were constructed from different CHO cell lines grown under various culture conditions and sequenced using Roche/454 and Illumina sequencing technologies in addition to sequencing reads from a previous study. Two pipelines to extend and improve the CHO cell line transcripts were established. First, de novo assemblies were carried out with the Trinity and Oases assemblers, using varying k-mer sizes. The resulting contigs were screened for potential CDS using ESTScan. Redundant contigs were filtered out using cd-hit-est. The remaining CDS contigs were re-assembled with CAP3. Second, a reference-based assembly with the TopHat/Cufflinks pipeline was performed, using the recently published draft genome sequence of CHO-K1 as reference. Additionally, the de novo contigs were mapped to the reference genome using GMAP and merged with the Cufflinks assembly using the cuffmerge software. With this approach 28,874 transcripts located on 16,492 gene loci could be assembled. Combining the results of both approaches, 65,561 transcripts were identified for CHO cell lines, which could be clustered by sequence identity into 17,598 gene clusters. © 2014 Rupp et al.

Figures

  • Table 1. Next-generation RNA sequencing data from CHO cell lines analyzed.
  • Figure 1. Workflow for the reference-based and the non-reference-based re-assembly of CHO cell transcripts. The left side shows the reference-based pipeline, the right side the non-reference-based pipeline. Different colors represent the different processes: assembly steps, red;
  • Figure 2. Number of CHO cell transcripts assembled with Cufflinks, Trinity, and Oases. K-mer sizes vary between 23 and 135 for the Oases assembly. doi:10.1371/journal.pone.0085568.g002
  • Figure 3. Length distribution of the transcripts assembled with Cufflinks, Trinity, and Oases.
  • Figure 4. Comparison of the proportions of correctly assembled transcripts and misassemblies. All transcripts with significant BLASTp hit against the mouse reference protein set were classified into ‘‘correct’’ (red), ‘‘short’’ (green) and ‘‘false’’ (blue) assembled transcripts. doi:10.1371/journal.pone.0085568.g004
  • Figure 5. ‘‘u80-metric’’ comparison of individual transcriptom assemblies. The comparative u80-metric results for the single Cufflinks, Trinity and Oases assemblies, the combined assemblies, the results of the reference-based re-assembly (ref-based), the non-reference-based reassembly (non-ref-based) and the final transcript set (final transcripts) are compared to two publicly available CHO cell transcript sets, Xu et al. [13] and Becker et al. [14]. doi:10.1371/journal.pone.0085568.g005
  • Figure 6. Unique u80-metric mouse proteins for the individual assemblies. Almost all individual assemblies (52 of 59) have transcripts with an ungapped alignment covering a mouse protein by more than 80% that are not present in the other assemblies. doi:10.1371/journal.pone.0085568.g006
  • Table 2. Estimation of the number of cluster with paralogous genes.

References Powered by Scopus

Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

63192Citations
N/AReaders
Get full text

Full-length transcriptome assembly from RNA-Seq data without a reference genome

15764Citations
N/AReaders
Get full text

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

12235Citations
N/AReaders
Get full text

Cited by Powered by Scopus

High quality genome sequences of thirteen Hypoxylaceae (Ascomycota) strengthen the phylogenetic family backbone and enable the discovery of new taxa

77Citations
N/AReaders
Get full text

Three previously unrecognised classes of biosynthetic enzymes revealed during the production of xenovulene A

66Citations
N/AReaders
Get full text

Genome analysis of the sugar beet pathogen Rhizoctonia solani AG2-2IIIB revealed high numbers in secreted proteins and cell wall degrading enzymes

63Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Rupp, O., Becker, J., Brinkrolf, K., Timmermann, C., Borth, N., Pühler, A., … Goesmann, A. (2014). Construction of a public CHO cell line transcript database using versatile bioinformatics analysis pipelines. PLoS ONE, 9(1). https://doi.org/10.1371/journal.pone.0085568

Readers over time

‘14‘15‘16‘17‘18‘19‘20‘21‘22‘23‘24‘2507142128

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 67

61%

Researcher 34

31%

Professor / Associate Prof. 9

8%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 62

60%

Biochemistry, Genetics and Molecular Bi... 26

25%

Computer Science 9

9%

Engineering 7

7%

Save time finding and organizing research with Mendeley

Sign up for free
0