01三代测序数据原理

PacBio原始数据处理

两种测序model：

CCS Circular consensus sequence 测序精度更高
Continuous Long Read sequencing (CLR）测序长度会更长一些

Iso-seq测序知识
https://www.cnblogs.com/xudongliang/p/7473463.html

CSS，在反转录得到的cDNA序列两端加上引物进行扩增后，在两端接上环状的测序接头。在PacBIo边合成边测序时形成一个环状的分子，从而循环的进行测序。

当测序把正负链都测了一次是叫做一个full pass，而要生成CCS至少需要两轮full pass；才能用于自我矫正。当转录本很短的时候测两个full pass很容易；但是当长度达到3K时要测2个full pass就需要测12kb长度而此时零模波导孔测的长度达不到那么高。

为了提高对raw read利用，于是就有了ROI序列 reads of insert的概念。

存在两种错误的read：

存在两条转录本窜连在一起的情况，主要是由于在文库制备时，adapter浓度会导致两条read窜起来，在后续分析中需要去除这种read。

不完全延伸的产物作为下一次扩增反应的引物导致嵌合序列的形成

剩下的就是clean read ，因此从raw到ROI后，得到的就是转录本序列；这些clean read进一步根据5'引物、3’引物和ploy A结构的存在与否将read分为：

classify

full-length reads
none full-length reads

由于三代测序存在一定的误差，可以将冗余的read进行聚类规避这种测序误差。将得到的全长read进行一致性聚类从而得到最终consensus transcript isoforms

原始数据处理

https://github.com/ben-lerch/IsoSeq-3.0
https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md
使用SMRT v7

数据处理主要分成三个阶段:

CCS
Classify
Cluster

1.构建CCS

基于原始下机数据构建circular consensus sequences

https://ccs.how/

module load SMRTLink/6.0.0.47841
ccs --noPolish --minLength=300 --minPasses=1 --minZScore=-999 --maxDropFraction=0.8 --minPredictedAccuracy=0.8 --minSnr=4 subreads.bam ccs.bam 

dataset create --type ConsensusReadSet ccs.xml ccs.bam

2.对构建的CCS进行Classify

Where ccs.xml is the XML file you generated in Step 1.

Where isoseq_flnc.fasta contains only the full-length, non-chimeric reads.

And where isoseq_nfl.fasta contains all non-full-length reads.

pbtranscript classify [OPTIONS] ccs.xml isoseq_draft.fasta --flnc=isoseq_flnc.fasta --nfl=isoseq_nfl.fasta

输出文件：

isoseq_flnc.fasta contains all full-length, non-artificial-concatemer reads.
isoseq_nfl.fasta contains all non-full-length reads.
isoseq_draft.fasta is an intermediate file in order to get full-length reads, which you can ignore

FASTA文件格式

//__CCS INFO

The info fields 包含以下信息:

strand: either + or -, whether a read is forward or reverse-complement cDNA,
fiveseen: whether or not 5' prime is seen in this read, 1 yes, 0 no
polyAseen: whether or not poly A tail is seen, 1 yes, 0 no
threeseen: whether or not 3' prime is seen, 1 yes, 0 no
fiveend: start position of 5'
threeend: start position of 3' in read
polyAend: start position of polyA in read
primer: index of primer seen in this read (remember primer fasta file >F0 xxxxx >R0 xxxxx >F1 xxxxx >R1 xxxx)
chimera: whether or not this read is classified as a chimeric cdna

FLnc-read的数目比FL的数目仅仅少一点，说明嵌合read的数目很少间接说明建库比较成功。

3.对转录本FL-read 进行聚类

 pbtranscript cluster [OPTIONS] isoseq_flnc.fasta polished_clustered.fasta --quiver --nfl=isoseq_nfl.fasta --bas_fofn=my.subreadset.xml

Optionally, you may call the following command to run ICE and create unpolished consensus isoforms only.

 pbtranscript cluster [OPTIONS] isoseq_flnc.fasta unpolished_clustered.fasta

输出文件：

Summary (cluster_summary.txt) This file contains the following statistics:

Number of consensus isoforms
Average read length of consensus isoforms

Report (cluster_report.csv) This is a csv file each line of which contains the following fields:

cluster_id: ID of a consensus isoforms from ICE.
read_id : ID of a read which supports the consensus isoform.
read_type : Type of the supportive read

将聚类后的read簇并且polished后，使用Arrow将isoform进行矫正得到higher-quality consensus sequence

quivered.hq.fasta
quivered.lq.fasta

Iso-Seq Cluster generates polished consensus isoforms are classified into either high-quality or low-quality isoforms. We classify an isoform as high quality if its consensus accuracy is no less than a cut-off, otherwise low quality. The default cut-off is 0.99. You may change this value from command line, or via SMRT Link Advanced Analysis Parameters when creating an Iso-Seq job

4.将consensus isoform去冗余合并成全长转录本

使用collapse_isoforms_by_sam.py对聚类和polished的consensus isoforms进行去冗余得到transcripts.

Previous原始数据处理 Next02测序read数目统计

Last updated 4 years ago

Was this helpful?