01三代测序数据原理
Last updated
Last updated
两种测序model:
CCS Circular consensus sequence 测序精度更高
Continuous Long Read sequencing (CLR)测序长度会更长一些
Iso-seq测序知识
CSS,在反转录得到的cDNA序列两端加上引物进行扩增后,在两端接上环状的测序接头。在PacBIo边合成边测序时形成一个环状的分子,从而循环的进行测序。
当测序把正负链都测了一次是叫做一个full pass,而要生成CCS至少需要两轮full pass;才能用于自我矫正。当转录本很短的时候测两个full pass很容易;但是当长度达到3K时要测2个full pass就需要测12kb长度而此时零模波导孔测的长度达不到那么高。
为了提高对raw read利用,于是就有了ROI序列 reads of insert的概念。
存在两条转录本窜连在一起的情况,主要是由于在文库制备时,adapter浓度会导致两条read窜起来,在后续分析中需要去除这种read。
不完全延伸的产物作为下一次扩增反应的引物导致嵌合序列的形成
剩下的就是clean read ,因此从raw到ROI后,得到的就是转录本序列;这些clean read进一步根据5'引物、3’引物和ploy A结构的存在与否将read分为:
classify
full-length reads
none full-length reads
由于三代测序存在一定的误差,可以将冗余的read进行聚类规避这种测序误差。将得到的全长read进行一致性聚类从而得到最终consensus transcript isoforms
https://github.com/ben-lerch/IsoSeq-3.0
https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md
使用SMRT v7
数据处理主要分成三个阶段:
CCS
Classify
Cluster
基于原始下机数据构建circular consensus sequences
Where ccs.xml
is the XML file you generated in Step 1.
Where isoseq_flnc.fasta
contains only the full-length, non-chimeric reads.
And where isoseq_nfl.fasta
contains all non-full-length reads.
输出文件:
isoseq_flnc.fasta
contains all full-length, non-artificial-concatemer reads.
isoseq_nfl.fasta
contains all non-full-length reads.
isoseq_draft.fasta
is an intermediate file in order to get full-length reads, which you can ignore
FASTA文件格式
//__CCS INFO
The info fields 包含以下信息:
strand: either + or -, whether a read is forward or reverse-complement cDNA,
fiveseen: whether or not 5' prime is seen in this read, 1 yes, 0 no
polyAseen: whether or not poly A tail is seen, 1 yes, 0 no
threeseen: whether or not 3' prime is seen, 1 yes, 0 no
fiveend: start position of 5'
threeend: start position of 3' in read
polyAend: start position of polyA in read
primer: index of primer seen in this read (remember primer fasta file >F0 xxxxx >R0 xxxxx >F1 xxxxx >R1 xxxx)
chimera: whether or not this read is classified as a chimeric cdna
FLnc-read的数目比FL的数目仅仅少一点,说明嵌合read的数目很少间接说明建库比较成功。
Optionally, you may call the following command to run ICE and create unpolished consensus isoforms only.
输出文件:
Summary (cluster_summary.txt) This file contains the following statistics:
Number of consensus isoforms
Average read length of consensus isoforms
Report (cluster_report.csv) This is a csv file each line of which contains the following fields:
cluster_id: ID of a consensus isoforms from ICE.
read_id : ID of a read which supports the consensus isoform.
read_type : Type of the supportive read
将聚类后的read簇并且polished后, 使用Arrow将isoform进行矫正得到higher-quality consensus sequence
quivered.hq.fasta
quivered.lq.fasta
Iso-Seq Cluster generates polished consensus isoforms are classified into either high-quality or low-quality isoforms. We classify an isoform as high quality if its consensus accuracy is no less than a cut-off, otherwise low quality. The default cut-off is 0.99. You may change this value from command line, or via SMRT Link Advanced Analysis Parameters when creating an Iso-Seq job
使用collapse_isoforms_by_sam.py
对聚类和polished的consensus isoforms进行去冗余得到transcripts.