01三代测序数据原理

PacBio原始数据处理

两种测序model:

  1. CCS Circular consensus sequence 测序精度更高

  2. Continuous Long Read sequencing (CLR)测序长度会更长一些

Iso-seq测序知识

https://www.cnblogs.com/xudongliang/p/7473463.html

CSS,在反转录得到的cDNA序列两端加上引物进行扩增后,在两端接上环状的测序接头。在PacBIo边合成边测序时形成一个环状的分子,从而循环的进行测序。

当测序把正负链都测了一次是叫做一个full pass,而要生成CCS至少需要两轮full pass;才能用于自我矫正。当转录本很短的时候测两个full pass很容易;但是当长度达到3K时要测2个full pass就需要测12kb长度而此时零模波导孔测的长度达不到那么高。

为了提高对raw read利用,于是就有了ROI序列 reads of insert的概念。

存在两种错误的read:

存在两条转录本窜连在一起的情况,主要是由于在文库制备时,adapter浓度会导致两条read窜起来,在后续分析中需要去除这种read。

不完全延伸的产物作为下一次扩增反应的引物导致嵌合序列的形成

剩下的就是clean read ,因此从raw到ROI后,得到的就是转录本序列;这些clean read进一步根据5'引物、3’引物和ploy A结构的存在与否将read分为:

classify

  1. full-length reads

  2. none full-length reads

由于三代测序存在一定的误差,可以将冗余的read进行聚类规避这种测序误差。将得到的全长read进行一致性聚类从而得到最终consensus transcript isoforms

原始数据处理

https://github.com/ben-lerch/IsoSeq-3.0

https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md

使用SMRT v7

数据处理主要分成三个阶段:

  1. CCS

  2. Classify

  3. Cluster

1.构建CCS

基于原始下机数据构建circular consensus sequences

https://ccs.how/

module load SMRTLink/6.0.0.47841
ccs --noPolish --minLength=300 --minPasses=1 --minZScore=-999 --maxDropFraction=0.8 --minPredictedAccuracy=0.8 --minSnr=4 subreads.bam ccs.bam 

dataset create --type ConsensusReadSet ccs.xml ccs.bam

2.对构建的CCS进行Classify

Where ccs.xml is the XML file you generated in Step 1.

Where isoseq_flnc.fasta contains only the full-length, non-chimeric reads.

And where isoseq_nfl.fasta contains all non-full-length reads.

pbtranscript classify [OPTIONS] ccs.xml isoseq_draft.fasta --flnc=isoseq_flnc.fasta --nfl=isoseq_nfl.fasta

输出文件:

  1. isoseq_flnc.fasta contains all full-length, non-artificial-concatemer reads.

  2. isoseq_nfl.fasta contains all non-full-length reads.

  3. isoseq_draft.fasta is an intermediate file in order to get full-length reads, which you can ignore

FASTA文件格式

//__CCS INFO

The info fields 包含以下信息:

  • strand: either + or -, whether a read is forward or reverse-complement cDNA,

  • fiveseen: whether or not 5' prime is seen in this read, 1 yes, 0 no

  • polyAseen: whether or not poly A tail is seen, 1 yes, 0 no

  • threeseen: whether or not 3' prime is seen, 1 yes, 0 no

  • fiveend: start position of 5'

  • threeend: start position of 3' in read

  • polyAend: start position of polyA in read

  • primer: index of primer seen in this read (remember primer fasta file >F0 xxxxx >R0 xxxxx >F1 xxxxx >R1 xxxx)

  • chimera: whether or not this read is classified as a chimeric cdna

FLnc-read的数目比FL的数目仅仅少一点,说明嵌合read的数目很少间接说明建库比较成功。

3.对转录本FL-read 进行聚类

 pbtranscript cluster [OPTIONS] isoseq_flnc.fasta polished_clustered.fasta --quiver --nfl=isoseq_nfl.fasta --bas_fofn=my.subreadset.xml

Optionally, you may call the following command to run ICE and create unpolished consensus isoforms only.

 pbtranscript cluster [OPTIONS] isoseq_flnc.fasta unpolished_clustered.fasta

输出文件:

Summary (cluster_summary.txt) This file contains the following statistics:

  • Number of consensus isoforms

  • Average read length of consensus isoforms

Report (cluster_report.csv) This is a csv file each line of which contains the following fields:

  • cluster_id: ID of a consensus isoforms from ICE.

  • read_id : ID of a read which supports the consensus isoform.

  • read_type : Type of the supportive read

将聚类后的read簇并且polished后, 使用Arrow将isoform进行矫正得到higher-quality consensus sequence

  • quivered.hq.fasta

  • quivered.lq.fasta

Iso-Seq Cluster generates polished consensus isoforms are classified into either high-quality or low-quality isoforms. We classify an isoform as high quality if its consensus accuracy is no less than a cut-off, otherwise low quality. The default cut-off is 0.99. You may change this value from command line, or via SMRT Link Advanced Analysis Parameters when creating an Iso-Seq job

4.将consensus isoform去冗余合并成全长转录本

使用collapse_isoforms_by_sam.py对聚类和polished的consensus isoforms进行去冗余得到transcripts.

Last updated