原始read的分类

In order to comprehensive understand transcript iso-isform of cotton，high-quality RNA was extracted from leaf of A2、D5 and TM-1.The RNA samples were category into multiple technique replications and after quality control，a total of 1,785,767 high-quality reads of consensus were obtained including 613,321 in A2 libraries, 487,063 in D5 libraries and 685,383 in TM-1 libraries。of these reads of consensus, 289,429(C.16.21%) were classified as nofull-length transcripts and 1,495,041(c.83.72%) were classified as full-length transcripts based on the presence of 5' primers, 3' primer and ploy(A) tails reaching near-saturation of gene discovery(figures 1). Short reads with a length of <300bp(1,297,c.0.07%) and chimeric reads (66,279, c. 3.71%) were discarded in the subsequent analysis.

In order to futuer imporve the accurity of consensus reads and generate high-quality nonredundant reads, nonfull-length reads of insert were subject to align with consesus reads further improve accuracy of reads and optimized consensus reads were subject to the clustering step mapping to corresponding genome by GMap After cluster, a total of 612,170 high-quality optimized reads and 217,185 isforms were obtained including 209,265、72,393 in A2, 157,049、55,381 in D5 and 245,865、89,411 in TM-1. Full-length transcripts were separately aligined to the G.arboreum genome 、the G.raimondii genome and the G.hirsutum genome. In analysis result of G.arboreum, 69,179 isforms were aligned to reference gene models, and 20,907 annoted genes corresponding to 20,907 transcripts were supported by at least one full-length read. In analysis result of G.raimondii, 54,058 isforms were aligned to reference gene models, and 18,904 annoted genes corresponding to 24,368 transcripts were supported by at least one full-length read. In analysis result of G.hirsutum, 83,281 isforms were aligned to reference gene models, and 32,003 annoted genes corresponding to 39,098 transcripts were supported by at least one full-length read.

The mean length of PacBio full-length isform (A2 2351bp,D5 2390bp,TM12488bp ) generally longer than the mean length of reference transcripts(A2 1181bp, D5 1852bp,TM1 2074bp).Because of A2 genome didn't predict transcripts sequence, using CDS sequencing replace of transcripts sequence.

We also found that a larger number of multi-exon transcripts were detected by using PacBio Iso-Seq.

Previous第三个结果 Next表观数据分析

Last updated 5 years ago