第三个结果

splice junction

MaxEntScan 剪切位点的评分到时候,算一下

U1 snRNP regulates chromatin retention of noncoding RNAs

提取剪切位点的二核苷酸序列

  • 使用hisat2中extract_splice_sites.py脚本提取剪切位点

提取的位点中内含子和前面一个外显子(不区分正负链)之间有一个碱基的交集;

  • 使用bedtools中fastaFromBed脚本提取对应的碱基序列

    • -tab 使用tab分割的输出格式

    • -name+参数,保留正负链信息

  • awk提取对应的二核苷酸信息

  • 负链的核酸信息需要互补一下

剪切文件|awk '$1~/^-/{printf $1"\t";printf substr($2,length($2),1);printf substr($2,length($2)-1,1);printf substr($2,3,1);printf substr($2,2,1);printf "\n"}$1~/^+/{printf $1"\t";printf substr($2,2,2);printf substr($2,length($2)-1,2);printf "\n"}' >end

tmpFile1=`mktemp`
tmpFile2=`mktemp`
tmpFile3=`mktemp`
awk '$1~/^-/{print $1}' end >${tmpFile1}
awk '$1~/^-/{print $2}' end|tr "ATCG" "TAGC" >${tmpFile2}
paste tmpFile1 ${tmpFile2} >${tmpFile3}
awk '$1~/^+/{print $0}' end|cat - ${tmpFile3} >${tmpFile1}
mv ${tmpFile1} end

## 统计Is-Seq数目
cat  isoseq_splice.txt|sort -k1,1 -k2,3n |uniq|wc -l
cat isoseq_splice.txt  参考基因组剪切文件|sort -k1,1 -k2,3n |uniq -d |cat - isoseq_splice.txt|sort -k1,1 -k2,3n|uniq -u >novel 剪切
##统计对应的基因数
cat  isoseq.info.gtf |awk -F ";" '{print $(NF-1)}'|sed -e  's/orginal_gene_id "//g' -e 's/"//'|sort |uniq |wc -l

统计与参考基因组相比SJs的情况

基因组

Iso-Seq

参考基因组

novel SJs

gene models

TM-1

226147

332822

42138

32311

A2

144558

165885

37884

21187

D5

127760

175343

19336

19080

剪切位点信号 dinucleotide signature

> >

D5 117860 /127760

A2 131520/144558

TM-1 216906/226147

isform水平

  • PacBio预测的isform

  • 对应的gene models

  • 在参考基因组中isform的水平

## isform数目
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|wc -l
## 对应的基因
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|awk -F ";" '{print $(NF-1)}'|sed -e  's/orginal_gene_id "//g' -e 's/"//'|sort |uniq >gene_id
awk '$3~/transcript/{print $0}' ../merge.gtf |grep "orginal_gene_id \"G"|awk -F ";" '{print $(NF-1)}'|sed -e  's/orginal_gene_id "//g' -e 's/"//'|sort |uniq -c |awk '{print $2"\t"$1}' >Iso-Seq_isformCount.txt 
## 统计Iso-Seq覆盖的基因
cat gene_id |xargs -I {} grep {} Iso-Seq_isformCount.txt >222
## 参考基因组中每个基因的转录本数目统计
 awk '$3~/mRNA/{print $9}' ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Ghirsutum_gene_model.gff3|awk -F "Parent=" '{print $2}' |sort |uniq -c |awk '{print $2"\t"$1}' >reference_isformCount.txt 
## 参考基因组中isform情况
cat gene_id|xargs -I {} grep {} 参考基因组.gff>参考基因组中isform情况

基因组

Iso-Seq预测isform

每个基因平均isform数

gene

TM-1

83881

3.39907 vs 2.01395

32182

A2

69701

21038

D5

53393

4.26125 vs 2.7397

19001

类型

isform为1

isform>5

TM-1 Iso-seq

9050

7557

TM-1 ref

18780

2570

D5 Iso-seq

3487

6898

D5 ref

7107

3164

A2 iso-seq

6352

4846

A2 ref

100%

0%

Splice junctions were respectively characterized in A2、D5、TM-1. In A2 We indentified 144558 SJs from 21187 gene models defined by Iso-seq data compared with SJs(165885) in the reference annotion,and there are 37884 novel SJs. In D5 We identified 127760 SJs from 19080 gene model sdefined by Iso-Seq data compared with Sjs 175343 in the reference annotion ,and there are 19336 novel SJs. In TM-1 We identified 226147 SJs from 32311 gene models defined By Iso-seq data compared with SJs 332822 in the reference annotion and there are 42138 novel SJs. The signature of terminal dinucleotide was investiated for all SJs defined By Iso-Seq data. We found that the GT-AG type general occupied a dominant poportion of intron borders, 91.0% A2 、92.3% D5、95.9% TM1, which consistentwith the previous study in other specises.

Merged Iso-Seq annotion with reference annotion and used a customed python script to identify Alter splice(AS).

At isform level, we identified 83881 new full-length transcripts from the 32182 reference gene models in TM-1, 69701 new full-length transcripts from the 21038 reference gene models in A2, 53393 new full-length transcripts from the 19001 reference gene models. And we found 58.4%、37.4% of genes in the original annotion were defined by a single transcrip isform in TM-1 、D5. After analysis of the Iso-Seq data, only 28.1%、18.4% of the genes defined by only a single transcript.

多聚腺苷酸位点的差异

for the first exon only the donor site is described as the first position is defined as transcription start site. Likewise, the last exon does not contain a donor splice site as the position is defined as polyadenylation site

##正链
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|cut -f1,4,5,7,9|awk -F"\t" '$4~/+/{split($5,a,";");print $1,$3-51,$3+50,a[3]}' OFS="\t" |sed -e  's/orginal_gene_id "//g' -e 's/"//' >Postive_stand
## 负链
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|cut -f1,4,5,7,9|awk -F "\t" '$4~/-/{split($5,a,";");print $1,$2-51,$2+50,a[3]}'  OFS="\t"  |sed -e  's/orginal_gene_id "//g' -e 's/"//' >Negative_Stand

##PASs数目
cat Postive_stand  Negative_Stand |sort -k4,4 -k2,3n|uniq|cut -f4|uniq -c >polyadenylationNum_By_gene

#提取序列信息
~/software/bedtools2-2.29.0/bin/fastaFromBed -fi ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Ghirsutum_genome_HAU_v1.0.fasta  -bed Postive_stand  -fo Postive_stand.fa
## 反向,转成mRNA序列
grep ">" -v Negative_Stand.fa |rev|tr "ATCG" "TAGC"|tr "T" "U" >111
grep ">" -v Postive_stand.fa|tr "T" "U" >222
## 计算每个位置的频率,python脚本
cat 111 222  >333 
python ~/work/Alternative/result/Gh_result/CO31_32_result/07_annotation/polyadenylation/frequent.py 333  polyadenylation.end

寻找motif

最小width 5 最长20

找15个

##使用RNA序列
awk '{print ">D5"NR"\n"$0}' 333 |less

ployA统计

棉种

基因数

转录本数

大于5个的基因数

per gene

TM1

32182

78631

2394 At1172/Dt1222

2.44

A2

21038

64620

2786

3.07

D5

19001

50312

1728

2.65

Divergent structure of splicing isforms in Gossypium lineage

棉花在多倍化过程中isform的差异,计算被Iso-seq检测到isform的基因对应的isform数目。然后与同源基因对取交集看哪些有被检测到

## 统计每个基因Iso-Seq的isform的数目
awk '$3~/transcript/{print $0}' ../merge.gtf |grep "orginal_gene_id \"G"|grep "gene_id \"PB"|awk -F ";" '{print $(NF-1)}'|sed -e  's/orginal_gene_id "//g' -e 's/"//'|sort |uniq -c |awk '{print $2"\t"$1}'|less

统计同源基因的Iso-Seq检测到的isform数目的差异

脚本 homolog_isform_count.py

类型

同源基因对

被检测到的同源基因对

A2 vs At

26099

18068

D5 vs Dt

27373

18445

考虑两个基因组共同驻留在一个细胞内,考虑四个同源基因的情况

Last updated