第三个结果
splice junction
MaxEntScan 剪切位点的评分到时候,算一下
U1 snRNP regulates chromatin retention of noncoding RNAs
提取剪切位点的二核苷酸序列
使用hisat2中
extract_splice_sites.py
脚本提取剪切位点
提取的位点中内含子和前面一个外显子(不区分正负链)之间有一个碱基的交集;
使用bedtools中
fastaFromBed
脚本提取对应的碱基序列-tab
使用tab分割的输出格式-name+
参数,保留正负链信息
awk提取对应的二核苷酸信息
负链的核酸信息需要互补一下
剪切文件|awk '$1~/^-/{printf $1"\t";printf substr($2,length($2),1);printf substr($2,length($2)-1,1);printf substr($2,3,1);printf substr($2,2,1);printf "\n"}$1~/^+/{printf $1"\t";printf substr($2,2,2);printf substr($2,length($2)-1,2);printf "\n"}' >end
tmpFile1=`mktemp`
tmpFile2=`mktemp`
tmpFile3=`mktemp`
awk '$1~/^-/{print $1}' end >${tmpFile1}
awk '$1~/^-/{print $2}' end|tr "ATCG" "TAGC" >${tmpFile2}
paste tmpFile1 ${tmpFile2} >${tmpFile3}
awk '$1~/^+/{print $0}' end|cat - ${tmpFile3} >${tmpFile1}
mv ${tmpFile1} end
## 统计Is-Seq数目
cat isoseq_splice.txt|sort -k1,1 -k2,3n |uniq|wc -l
cat isoseq_splice.txt 参考基因组剪切文件|sort -k1,1 -k2,3n |uniq -d |cat - isoseq_splice.txt|sort -k1,1 -k2,3n|uniq -u >novel 剪切
##统计对应的基因数
cat isoseq.info.gtf |awk -F ";" '{print $(NF-1)}'|sed -e 's/orginal_gene_id "//g' -e 's/"//'|sort |uniq |wc -l
统计与参考基因组相比SJs的情况
基因组
Iso-Seq
参考基因组
novel SJs
gene models
TM-1
226147
332822
42138
32311
A2
144558
165885
37884
21187
D5
127760
175343
19336
19080
剪切位点信号 dinucleotide signature
> >
D5 117860 /127760
A2 131520/144558
TM-1 216906/226147
isform水平
PacBio预测的isform
对应的gene models
在参考基因组中isform的水平
## isform数目
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|wc -l
## 对应的基因
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|awk -F ";" '{print $(NF-1)}'|sed -e 's/orginal_gene_id "//g' -e 's/"//'|sort |uniq >gene_id
awk '$3~/transcript/{print $0}' ../merge.gtf |grep "orginal_gene_id \"G"|awk -F ";" '{print $(NF-1)}'|sed -e 's/orginal_gene_id "//g' -e 's/"//'|sort |uniq -c |awk '{print $2"\t"$1}' >Iso-Seq_isformCount.txt
## 统计Iso-Seq覆盖的基因
cat gene_id |xargs -I {} grep {} Iso-Seq_isformCount.txt >222
## 参考基因组中每个基因的转录本数目统计
awk '$3~/mRNA/{print $9}' ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Ghirsutum_gene_model.gff3|awk -F "Parent=" '{print $2}' |sort |uniq -c |awk '{print $2"\t"$1}' >reference_isformCount.txt
## 参考基因组中isform情况
cat gene_id|xargs -I {} grep {} 参考基因组.gff>参考基因组中isform情况
基因组
Iso-Seq预测isform
每个基因平均isform数
gene
TM-1
83881
3.39907 vs 2.01395
32182
A2
69701
21038
D5
53393
4.26125 vs 2.7397
19001
类型
isform为1
isform>5
TM-1 Iso-seq
9050
7557
TM-1 ref
18780
2570
D5 Iso-seq
3487
6898
D5 ref
7107
3164
A2 iso-seq
6352
4846
A2 ref
100%
0%
Splice junctions were respectively characterized in A2、D5、TM-1. In A2 We indentified 144558 SJs from 21187 gene models defined by Iso-seq data compared with SJs(165885) in the reference annotion,and there are 37884 novel SJs. In D5 We identified 127760 SJs from 19080 gene model sdefined by Iso-Seq data compared with Sjs 175343 in the reference annotion ,and there are 19336 novel SJs. In TM-1 We identified 226147 SJs from 32311 gene models defined By Iso-seq data compared with SJs 332822 in the reference annotion and there are 42138 novel SJs. The signature of terminal dinucleotide was investiated for all SJs defined By Iso-Seq data. We found that the GT-AG type general occupied a dominant poportion of intron borders, 91.0% A2 、92.3% D5、95.9% TM1, which consistentwith the previous study in other specises.
Merged Iso-Seq annotion with reference annotion and used a customed python script to identify Alter splice(AS).
At isform level, we identified 83881 new full-length transcripts from the 32182 reference gene models in TM-1, 69701 new full-length transcripts from the 21038 reference gene models in A2, 53393 new full-length transcripts from the 19001 reference gene models. And we found 58.4%、37.4% of genes in the original annotion were defined by a single transcrip isform in TM-1 、D5. After analysis of the Iso-Seq data, only 28.1%、18.4% of the genes defined by only a single transcript.
多聚腺苷酸位点的差异
for the first exon only the donor site is described as the first position is defined as transcription start site. Likewise, the last exon does not contain a donor splice site as the position is defined as polyadenylation site
##正链
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|cut -f1,4,5,7,9|awk -F"\t" '$4~/+/{split($5,a,";");print $1,$3-51,$3+50,a[3]}' OFS="\t" |sed -e 's/orginal_gene_id "//g' -e 's/"//' >Postive_stand
## 负链
awk '$3~/transcript/{print $0}' ../merge.gtf|grep "transcript_id \"P" |grep "orginal_gene_id \"G"|cut -f1,4,5,7,9|awk -F "\t" '$4~/-/{split($5,a,";");print $1,$2-51,$2+50,a[3]}' OFS="\t" |sed -e 's/orginal_gene_id "//g' -e 's/"//' >Negative_Stand
##PASs数目
cat Postive_stand Negative_Stand |sort -k4,4 -k2,3n|uniq|cut -f4|uniq -c >polyadenylationNum_By_gene
#提取序列信息
~/software/bedtools2-2.29.0/bin/fastaFromBed -fi ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Ghirsutum_genome_HAU_v1.0.fasta -bed Postive_stand -fo Postive_stand.fa
## 反向,转成mRNA序列
grep ">" -v Negative_Stand.fa |rev|tr "ATCG" "TAGC"|tr "T" "U" >111
grep ">" -v Postive_stand.fa|tr "T" "U" >222
## 计算每个位置的频率,python脚本
cat 111 222 >333
python ~/work/Alternative/result/Gh_result/CO31_32_result/07_annotation/polyadenylation/frequent.py 333 polyadenylation.end

寻找motif
最小width 5 最长20
找15个
##使用RNA序列
awk '{print ">D5"NR"\n"$0}' 333 |less
ployA统计
棉种
基因数
转录本数
大于5个的基因数
per gene
TM1
32182
78631
2394 At1172/Dt1222
2.44
A2
21038
64620
2786
3.07
D5
19001
50312
1728
2.65
Divergent structure of splicing isforms in Gossypium lineage
棉花在多倍化过程中isform的差异,计算被Iso-seq检测到isform的基因对应的isform数目。然后与同源基因对取交集看哪些有被检测到
## 统计每个基因Iso-Seq的isform的数目
awk '$3~/transcript/{print $0}' ../merge.gtf |grep "orginal_gene_id \"G"|grep "gene_id \"PB"|awk -F ";" '{print $(NF-1)}'|sed -e 's/orginal_gene_id "//g' -e 's/"//'|sort |uniq -c |awk '{print $2"\t"$1}'|less
统计同源基因的Iso-Seq检测到的isform数目的差异
脚本 homolog_isform_count.py
类型
同源基因对
被检测到的同源基因对
A2 vs At
26099
18068
D5 vs Dt
27373
18445
考虑两个基因组共同驻留在一个细胞内,考虑四个同源基因的情况
Last updated
Was this helpful?