测序数据如下图
每个棉种含有两个重复的数据,之所以有3个文件,是因为有一个重复的数据量没有测够;之后进行了补测
2019-10-10
合并两次测序的bam文件
Copy ## 分别对两个bam文件进行排序,按照序列编号顺序排序
samtools sort -n -@ 2 ./A2/dT_BC5_subreads.bam -O BAM -o ./A2/dT_BC5_subreads_sorted.bam
## 将两个排序好的文件进行合并
samtools merge -n -@ 2 -O BAM ./A2/A2_R2.subreads.bam ./A2/dT_BC5_subreads_sorted.bam ./A2/m54139_180604_080709.subreads_sorted.bam
后来发现还得使用ccs方法将两个bam文件指定为一个数据集
2019-10-11
因为在进行数据打磨时报错了
Missing .pbi file
Copy dataset create A2_R2.subreadset.xml --type SubreadSet --generateIndices ./../dT_BC5_subreads.bam ./../m54139_180604_080709.subreads.bam
# 之后使用A2_R2.subreadset.xml文件代替bam文件进行ccs即可
2019-10-13
使用iso-seq中dataset程序处理,重复二的数据
Copy dataset create R2/TM1_R2.subreadset.xml --type SubreadSet --generateIndices R1801371_QJ_BC1_subreads.bam m54139_180607_052119.subreads.bam &
dataset create R2/D5_R2.subreadset.xml --type SubreadSet --generateIndices m54136_180730_021327.subreads.bam m54139_180609_044201.subreads.bam &
使用ccs处理
Copy ccs --noPolish --minPasses 1 --minLength 300 --minSnr 4 --maxLength 10000 --maxDropFraction 0.8 --minPredictedAccuracy 0.8 --numThreads 20 --logFile ./TM1_R2_ccs.log --reportFile ./TM1_R2port_ccs.txt ./TM1_R2.subreadset.xml ./TM1_R2_ccsout.bam
之前数据的分析,师姐已经完成了;接下来主要是针对数据进行个性化的分析了;其实我还挺想自己把流程走一遍的,可惜公司不会把分析流程的protocol发给我们
2019-10-14
将之前的结果解压,到当前文件夹
unrar x result.rar
参考我之前写的软件使用说明
tps://zpliu.gitbook.io/booknote/ke-bian-jian-qie/ruan-jian-shi-yong/01-san-dai-ce-xu-isoseq#2-shi-yong-ccs-dui-yuan-shi-shu-ju-jin-hang-guo-lv
统计测序序列情况
1.二倍体亚洲棉
数据来源:第03三步分析流程里文件Classify_stat.xls
CO_31
consensus reads质控后经过ccs处理
full-length reads(包括chimeric reads)
filtered short reads长度小于300bp
CO_32
consensus reads质控后经过ccs处理
full-length reads(包括chimeric reads)
filtered short reads长度小于300bp
四倍体陆地棉
Col31&C0l32
consensus reads质控后经过ccs处理
full-length reads(包括chimeric reads)
filtered short reads长度小于300bp
二倍体雷蒙德氏棉
Col31&C0l32
consensus reads质控后经过ccs处理
full-length reads(包括chimeric reads)
filtered short reads长度小于300bp
比较测序数据与参考基因组数据中isform的长度与对应exon的数目
在测序数据中06_Alignment/all.collapsed.gff
文件中有每条isform的信息
1.统计每条isform的长度信息,只包含外显子的长度
2.统计每条isform的外显子数目
Copy sed 's/gene_id [^t]*transcript_id//g' all.collapsed.gff | awk '$3~/^e/{isformLen[$9]+=$5-$4+1;exonCount[$9]+=1}END{for(i in isformLen){printf i"\t"isformLen[i]"\t"exonCount[i]"\n"}}'
Copy ##得到每个isform的长度和外显子数目
"PB.1024.1" ; 4680 5
"PB.10242.1" ; 3658 7
"PB.1024.2" ; 4463 5
"PB.10243.1" ; 2680 3
参考基因组gff文件中有每个mRNA的注释信息
多了一个sed的处理主要是isform的第九列多了·一些信息ID=evm.model.Ga01G0001.exon2
,把exon那部分去除后获得isform编号
Copy awk -F ";" '{print $1}' Ghirsutum_gene_model.gff3 | sed 's/\.exon.*//g' | awk '$3~/^e/{isformLen[$9]+=$5-$4+1;exonCount[$9]+=1}END{for(i in isformLen){printf i"\t"isformLen[i]"\t"exonCount[i]"\n"}}'
准备绘图的数据
Copy # 将isform名字信息修改成分组信息
sed -i 's/\"PB[^\t]*/TM1_PBisform/g' my_isform_length_exonCount.txt
# 修改基因组的isform信息
sed -i 's/\ID=[^\t]*/TM1_genome/g' my_genome_isform_length_exonCount.txt
# 将多个棉种的数据合并成一个文件,比较外显子数目上的差异
awk '{exonClass[$1"*"$3]+=1}END{for(i in exonClass){split(i,tmp,"*");printf tmp[1]"\t"tmp[2]"\t"exonClass[i]"\n"}}' all_genome_isform_comprasion.txt > all_genome_isform_exonCount.txt
得到每个基因组中isform的外显子数目分布情况
Copy #基因组 外显子数目 对应的isform数目
TM1_genome 64 3
D5_PBisform 43 7
D5_genome 16 674
TM1_genome 65 1
D5_PBisform 44 13
D5_genome 30 51
2019-10-16
绘制isform长度比较信息,和外显子数目的比较信息
Copy #绘制isform长度信息
ggplot( data = tmp, aes( x = tmp $ V2,y = tmp $ V3,fill = tmp $ V2 )) + geom_boxplot() + facet_wrap( ~ tmp $ V1,scales = "free" ) +
Copy # 绘制isform中外显子的数目信息
ggplot( data = tmp, aes( x = tmp $ V3,y = tmp $ V4,fill = tmp $ V2 )) + geom_bar( stat = "identity" , position = 'dodge' ) + scale_x_continuous( breaks = seq ( 1 , 20 , 1 ) ) + facet_wrap( ~ tmp $ V1 )
2019-10-17
第7个文件夹annotation中包含了,与参考基因组注释文件进行合并后的注释信息
利用地6步分析中得到的非冗余的isform序列,与基因组进行比较
第8个和第9个、10个文件的数据暂时用不到
分析第11个关于可变剪切的文件
分析不同棉种中可变剪切的类型,及发生可变剪切的基因数目
grep ExonS splice.as.xls|wc -l
统计外显子跳跃事件的数目
grep ExonS splice.as.xls|awk '{print $2}' |sort|uniq |wc -l
统计对应的基因数目
剪切事件数目/对应的基因数目
看看比例TM-1
看看比例A2
看看比例D5
D5的剪切事件相对与A2来说有点偏高 ,可能是由于A2的参考基因组本身就没有对多个转录本进行预测;为了验证这个猜想,就只看PBc得到的转录本之间的剪切事件
Copy # 批量统计次数
for i in 1 ; do
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep ExonS | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep ExonS | awk '{print $2}' | sort | uniq | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep IntronR | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep IntronR | awk '{print $2}' | sort | uniq | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep AltD | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep AltD | awk '{print $2}' | sort | uniq | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep AltA | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep AltA | awk '{print $2}' | sort | uniq | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep AltP | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep AltP | awk '{print $2}' | sort | uniq | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep Other | wc -l ;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls | grep Other | awk '{print $2}' | sort | uniq | wc -l ; done
2019-10-18
在第7个文件夹中matchAnnot_result.txt
文件中包含比对到参考基因组上的isform和对应的参考基因编号
分析全长转录组的数据与原始数据相比在isform数目上的差异
2019-11-2
根据参考基因组的注释文件,产生剪切位点的bed文件
假定剪切复合体是以外显子识别的模式 进行剪切的,并且不关心外显子在正负链上的关系
Copy #将第9列改造成只剩下基因ID,结合exon的起始位点,给每个exon一个编号,不分析脚手架了
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' > tmp.bed
awk '$1~/Chr/{print $1"\t"$2"\t"$3"\t"$4"\t"$5"_"$2}' constitutiveExon.bed > tmp.bed
### 结果如下
Ghir_A01 13133 13156 - Ghir_A01G000010.1_13133
Ghir_A01 13840 14080 - Ghir_A01G000010.1_13840
Ghir_A01 14164 15594 - Ghir_A01G000010.1_14164
生成各种类型的剪切
Copy # 生成5'外显子区域的bed文件,覆盖到每个碱基
awk '{for(i=$2-1;i<=$2+73;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+1}}' tmp.bed > exon_5.bed
awk '{for(i=$3-73;i<=$3+1;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3+73}}' tmp.bed > exon_3.bed
awk '{for(i=$2-201;i<=$2-2;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+201}}' tmp.bed > intron_5.bed
awk '{for(i=$3+2;i<=$3+201;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3-2}}' tmp.bed > intron_3.bed
# 用于提取指定区域的fasta序列
awk '{print $1"\t"$2-1"\t"$2+73"\t"$4"\t"$5}' tmp.bed > exon_5_fasta.bed
awk '{print $1"\t"$3-73"\t"$3+1"\t"$4"\t"$5}' tmp.bed > exon_3_fasta.bed
awk '{print $1"\t"$2-201"\t"$2-2"\t"$4"\t"$5}' tmp.bed > intron_5_fasta.bed
awk '{print $1"\t"$3+2"\t"$3+201"\t"$4"\t"$5}' tmp.bed > intron_3_fasta.bed
# for 循环执行
for i in 1
do
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{for(i=$2-1;i<=$2+73;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+1}}' > exon_5.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{for(i=$3-73;i<=$3+1;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3+73}}' > exon_3.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{for(i=$2-201;i<=$2-2;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+201}}' > intron_5.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{for(i=$3+2;i<=$3+201;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3-2}}' > intron_3.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{print $1"\t"$2-1"\t"$2+73"\t"$4"\t"$5}' > exon_5_fasta.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{print $1"\t"$3-73"\t"$3+1"\t"$4"\t"$5}' > exon_3_fasta.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{print $1"\t"$2-201"\t"$2-2"\t"$4"\t"$5}' > intron_5_fasta.bed
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf | sed 's/transcript_id //g' | sed 's/\"//g' | awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' | awk '{print $1"\t"$3+2"\t"$3+201"\t"$4"\t"$5}' > intron_3_fasta.bed
done
将甲基化的数据构造成bed文件
Copy awk '{print $1"\t"$2"\t"$2}' ~/WMJ_3D_DNAmethy/Gh_bismark/CpG_context_D1_binom_fdr_result.txt > CpG_context_D1.bed
统计每个区域中含有甲基化的位点的数目
Copy ## 如果区域没有对应的甲基化位点则会显示为-1,有则显示对于的甲基化位置
~ /software/bedtools2-2.29.0/bin/intersectBed -b CpG_context_D1.bed -a exon_3.bed -loj > exon_3Meth_count.txt
## 统计每个区域甲基化的次数,后面几列的坐标一样的话就认为是一个甲基化位点
# awk '$2==$7{print $0}' exon_3Meth_count.txt
## 统计每个区域中单个碱基精度的甲基化程度,需要重新制作bed文件
提取对应区域的CG 类型的数目,进行甲基化水平的标准化
Copy # 统计每个区域CG含量水平
python extract_CG_count.py sequencFile outputFile
# 提取甲基化是单个碱基的精度下的甲基化水平
python scale_Methlation_by_sigleBase.py exon_3_CGcount.txt exon_3Meth_count.txt scale_single_base.txt
# 统计外显子的数目
grep ">" exon_3.fasta | wc -l
# 均一化所有外显子的单碱基水平
sed 's/_/\t/g' scale_single_base.txt | awk '{print $(NF-1)"\t"$NF}' | awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]/外显子数目}' > end
画两幅内含子和外显子连接的图,intron5-exon5 vs exon3-intron3
Copy ## 将5'外显子坐标延长200,然后合并
awk '{print $1+200"\t"$2}' exon5_end | sort -k1 -n | cat - intron5_end | sort -k1 -n > intron_exon_5.txt
## 将3'内含子延长75 然后合并
awk '{print $1+75"\t"$2}' intron3_end | cat - exon3_end | sort -k1 -n > intron_exon_3.txt
ggplot2绘制图形
Copy ## 5'端
ggplot( data = data2, aes( x = data2 $ V1,y = data2 $ V2 )) +
geom_line( color = c ( "#ff4757" ) ,size = 1 ) +
ylim(c ( 0 , 0.008 ) ) +
scale_x_continuous( breaks = c ( 0 , 200 , 274 ) ,labels = c ( - 200 , '' , '+75' )) +
theme( panel.grid = element_blank() ,panel.border = element_blank() ,panel.background = element_rect( fill = "#f1f2f6" )) +
geom_vline( xintercept = 200 ,linetype = "dashed" ,color = "blue" ) +
theme( axis.line = element_line( size = 0.8 )) + xlab( "" ) + ylab( "average mCpG/CpG" )
## 3'端
ggplot( data = data1, aes( x = data1 $ V1,y = data1 $ V2 )) +
geom_line( color = c ( "#ff4757" ) ,size = 1 ) +
ylim(c ( 0 , 0.008 ) ) +
scale_x_continuous( breaks = c ( 0 , 75 , 274 ) ,labels = c ( '-75' , '' , '+200' )) +
theme( panel.grid = element_blank() ,panel.border = element_blank() ,panel.background = element_rect( fill = "#f1f2f6" )) +
geom_vline( xintercept = 75 ,linetype = "dashed" ,color = "blue" ) +
theme( axis.line = element_line( size = 0.8 )) + xlab( "" ) + ylab( "average mCpG/CpG" )
2019-11-08
重新分析可变剪切的数据,使用gmap进行比对,过滤指标覆盖度85%,相似度90%
Copy cat ./all.collapsed.rep.fa | ~/software/gmap-2019-09-12/bin/gmap -D ~/work/Alternative/data/gmap_build/Graimondii_221_v2.0 -d Graimondii_221_v2.0 -f samse -t 10 -n 2 --min-trimmed-coverage=0.85 --min-identity=0.9 > test.sam
2019-11-14
考虑每个isform直接的比较
只考虑基因坐标
A2_8 vs D5_6
2019-11-16
比较Alt exon与constituents exon的甲基化水平
提取可变剪切文件中所有isform的exon坐标
先提取PacBio对应的isform编号,以及对应的mRNA对应的编号
Copy awk '$6~/PB/{print $6"\t"$2}' end_third | sort | uniq > PacBio_ID
awk '$7~/PB/{print $7"\t"$2}' end_third | sort | uniq >> PacBio_ID
awk '$6~/evm/{print $6"\t"$2}' end_third | sort | uniq > mRNA_ID
awk '$7~/evm/{print $7"\t"$2}' end_third | sort | uniq >> mRNA_ID
## 还得去一次重
sort PacBio_ID | uniq > PacBio_ID_1
## 使用python脚本将每个isform的外显子信息坐标信息进去取交集,假定在2个isform中出现的exon坐标,认为是constituents exon
python constitutiveExonLocation.py PacBio_ID mRNA_ID ../test.gff ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.gff constitutiveExonLocation
# 给每个constitutive exon取个名字
# 最终的constitutive exon结果里可能会有重复的坐标,由于不同转录本之间太相似了,把一些重叠区域尽量最小化
sort -n -k3,3 constitutiveExon.bed | awk '{print $2,$3,$1,$4,$5}' OFS= "\t" | uniq -f 1 | awk '{print $2,$1,$3,$4,$5}' OFS= "\t" | sort -k2,2 -n | uniq -f1 | awk '{print $3,$2,$1,$4,$5}' OFS= "\t" | awk '{print $1,$2,$3,$4,$5"_"$2}' OFS= "\t" > tmp.bed
# 提取对应的bed坐标文件,这有点问题没有根据链来提,会有误差
for i in 1
do
awk '{for(i=$2-1;i<=$2+73;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+1}}' tmp.bed > exon_5.bed
awk '{for(i=$3-73;i<=$3+1;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3+73}}' tmp.bed > exon_3.bed
awk '{for(i=$2-201;i<=$2-2;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+201}}' tmp.bed > intron_5.bed
awk '{for(i=$3+2;i<=$3+201;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3-2}}' tmp.bed > intron_3.bed
awk '{print $1"\t"$2-1"\t"$2+73"\t"$4"\t"$5}' tmp.bed > exon_5_fasta.bed
awk '{print $1"\t"$3-73"\t"$3+1"\t"$4"\t"$5}' tmp.bed > exon_3_fasta.bed
awk '{print $1"\t"$2-201"\t"$2-2"\t"$4"\t"$5}' tmp.bed > intron_5_fasta.bed
awk '{print $1"\t"$3+2"\t"$3+201"\t"$4"\t"$5}' tmp.bed > intron_3_fasta.bed
done
# 提取对应的序列用于计算CG含量
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_3_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_3.fasta
python ~/scripte/extract_CG_count.py exon_3.fasta exon_3_CG_count.txt
# 使用intersectBed做交集
~ /software/bedtools2-2.29.0/bin/intersectBed -b CpG_context_D1.bed -a exon_3.bed -loj > exon_3Meth_count.txt
# 提取单碱基下的甲基化水平
python scale_Methlation_by_sigleBase.py exon_3_CGcount.txt exon_3Meth_count.txt scale_single_base.txt
# 按exon的数目进行标准化 # 均一化所有外显子的单碱基水平
grep ">" exon_3.fasta | wc -l
sed 's/_/\t/g' scale_single_base.txt | awk '{print $(NF-1)"\t"$NF}' | awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/外显子数目}' > end
### 搞个for循环懒得敲了
for i in 1
do
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_3_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_5_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_5.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py intron_3_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py intron_5_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_5.fasta
python ~/scripte/extract_CG_count.py exon_3.fasta exon_3_CG_count.txt
python ~/scripte/extract_CG_count.py exon_5.fasta exon_5_CG_count.txt
python ~/scripte/extract_CG_count.py intron_3.fasta intron_3_CG_count.txt
python ~/scripte/extract_CG_count.py intron_5.fasta intron_5_CG_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a exon_3.bed -loj > exon_3Meth_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a exon_5.bed -loj > exon_5Meth_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a intron_5.bed -loj > intron_5Meth_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a intron_3.bed -loj > intron_3Meth_count.txt
python ~/scripte/scale_Methlation_by_sigleBase.py exon_3_CG_count.txt exon_3Meth_count.txt exon_3scale_single_base.txt
python ~/scripte/scale_Methlation_by_sigleBase.py exon_5_CG_count.txt exon_5Meth_count.txt exon_5scale_single_base.txt
python ~/scripte/scale_Methlation_by_sigleBase.py intron_5_CG_count.txt intron_5Meth_count.txt intron_5scale_single_base.txt
python ~/scripte/scale_Methlation_by_sigleBase.py intron_3_CG_count.txt intron_3Meth_count.txt intron_3scale_single_base.txt
done
### 最终结果
grep ">" intron_5.fasta | wc -l
## exon的类型记得改一下
for exonCount in 外显子的数目
do
sed 's/_/\t/g' exon_3scale_single_base.txt | awk '{print $(NF-1)"\t"$NF}' | awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"' $exonCount '"}' > exon_3_end
sed 's/_/\t/g' exon_5scale_single_base.txt | awk '{print $(NF-1)"\t"$NF}' | awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"' $exonCount '"}' > exon_5_end
sed 's/_/\t/g' intron_5scale_single_base.txt | awk '{print $(NF-1)"\t"$NF}' | awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"' $exonCount '"}' > intron_5_end
sed 's/_/\t/g' intron_3scale_single_base.txt | awk '{print $(NF-1)"\t"$NF}' | awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"' $exonCount '"}' > intron_3_end
done
### 为了画图方便将所有exon做成连续的坐标
## 将5'外显子坐标延长200,然后合并
awk '{print $1"\t"$2+200"\t"$3}' exon_5_end | sort -k2 -n | cat - intron_5_end | sort -k2 -n > intron_exon_5.txt
## 将3'内含子延长75 然后合并
awk '{print $1"\t"$2+75"\t"$3}' intron_3_end | cat - exon_3_end | sort -k2 -n > intron_exon_3.txt
提取Alter exon的坐标
Copy ### 没有染色体正负链的信息,在后面提取序列要用,但是提取CG碱基的时候,由于对称性所以提取哪条链都没关系
### 都以正链为标准
awk '$3~/IntronR/||$3~/ExonS/{print $1,$4,$5,"+",$2}' OFS= "\t" end_third > ../Alter_exon/alter_exon.bed
2019-11-20
之间比较Constitutive Exon和Alternative Exon之间甲基化水平的差异
Alternative Exon需要把对应的链的信息给补充完整
直接根据基因区编号来补充把
Copy ## 提取基因正负链信息
cut -f5 alter_exon.bed | xargs -I {} grep {} ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.gff | awk '$3~/gene/{print $7}' | paste alter_exon.bed - > 11111111111
## 过滤重复的可变exon
awk '{print $1,$2,$3,$6,$5}' OFS= "\t" ../11111111111 | sort -n -k3,3 | awk '{print $2,$3,$1,$4,$5}' OFS= "\t" | uniq -f 1 | awk '{print $2,$1,$3,$4,$5}' OFS= "\t" | sort -k2,2 -n | uniq -f1 | awk '{print $3,$2,$1,$4,$5}' OFS= "\t" | awk '{print $1,$2,$3,$4,$5"_"$2}' OFS= "\t" > tmp.bed
Copy ## 构造每个外显子在5'|3'端内含子和外显子的甲基化数据
for i in 1
do
awk '$4~/+/{print $1"\t"$2"\t"$2+74"\t"$4"\t"$5}$4~/-/{print $1"\t"$3-74"\t"$3"\t"$4"\t"$5}' ../tmp.bed > exon_5.bed
awk '$4~/+/{print $1"\t"$3-74"\t"$3"\t"$4"\t"$5}$4~/-/{print $1"\t"$2"\t"$2+74"\t"$4"\t"$5}' ../tmp.bed > exon_3.bed
awk '$4~/+/{print $1"\t"$3+1"\t"$3+75"\t"$4"\t"$5}$4~/-/{print $1"\t"$2-75"\t"$2-1"\t"$4"\t"$5}' ../tmp.bed > intron_3.bed
awk '$4~/+/{print $1"\t"$2-75"\t"$2-1"\t"$4"\t"$5}$4~/-/{print $1"\t"$3+1"\t"$3+75"\t"$4"\t"$5}' ../tmp.bed > intron_5.bed
done
## 长度都是一定的,提取对应的CG甲基含量
for i in 1
do
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_3.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_5.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_5.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py intron_3.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py intron_5.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_5.fasta
done
for i in 1
do
python ~/scripte/extract_CG_count.py exon_3.fasta exon_3_CG_count.txt
python ~/scripte/extract_CG_count.py exon_5.fasta exon_5_CG_count.txt
python ~/scripte/extract_CG_count.py intron_3.fasta intron_3_CG_count.txt
python ~/scripte/extract_CG_count.py intron_5.fasta intron_5_CG_count.txt
done
## 统计每个区域甲基化的碱基数
for i in 1
do
~ /software/bedtools2-2.29.0/bin/intersectBed -a exon_3.bed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj | awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}' | awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' > exon_3Meth_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -a exon_5.bed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj | awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}' | awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' > exon_5Meth_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -a intron_3.bed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj | awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}' | awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' > intron_3Meth_count.txt
~ /software/bedtools2-2.29.0/bin/intersectBed -a intron_5.bed -b ~ /work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj | awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}' | awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' > intron_5Meth_count.txt
done
## 构造ggplot绘图的数据
for i in 1
do
awk '{print "ConExon\t""exon_3\t"$2}' exon_3_CG_count.txt > conExon_CGcount_end
awk '{print "ConExon\t""exon_5\t"$2}' exon_5_CG_count.txt >> conExon_CGcount_end
awk '{print "ConExon\t""intron_3\t"$2}' intron_3_CG_count.txt >> conExon_CGcount_end
awk '{print "ConExon\t""intron_5\t"$2}' intron_5_CG_count.txt >> conExon_CGcount_end
awk '{print "ConExon\t""exon_3\t"$2}' exon_3Meth_count.txt > conExon_MetCount_end
awk '{print "ConExon\t""exon_5\t"$2}' exon_5Meth_count.txt >> conExon_MetCount_end
awk '{print "ConExon\t""intron_5\t"$2}' intron_5Meth_count.txt >> conExon_MetCount_end
awk '{print "ConExon\t""intron_3\t"$2}' intron_3Meth_count.txt >> conExon_MetCount_end
done
感觉失败了
把四个坐标中甲基化位点为0的exon去除试试
Copy #剔除在 exon_5|3 和intron3|5都没有甲基化的exon
cat exon_3Meth_count.txt exon_5Meth_count.txt intron_3Meth_count.txt intron_5Meth_count.txt | awk '$2==0{print $0}' | sort | uniq -c | awk '$1==4{print $2"\t"0}' > No_Methlation_exon.txt
for i in 1
do
cat No_Methlation_exon.txt exon_3Meth_count.txt | sort | uniq -u > exon_3Meth_No0count.txt
cat No_Methlation_exon.txt exon_5Meth_count.txt | sort | uniq -u > exon_5Meth_No0count.txt
cat No_Methlation_exon.txt intron_3Meth_count.txt | sort | uniq -u > intron_3Meth_No0count.txt
cat No_Methlation_exon.txt intron_5Meth_count.txt | sort | uniq -u > intron_5Meth_No0count.txt
done
同样没有发现想要的效果 ,比如Alter 的剪切位点中exon和intron之间的甲基化程度相差不大,而Constitutive 剪切位点附近的甲基化程度相差很大
不把intron与exon拆开看了,就分析5'|3’剪切位点附近甲基化程度和CG含量
Copy ## 将同一个外显子附近的甲基化数据合并
cat *3Meth_No0count.txt | awk '{a[$1]+=$2}END{for(i in a) print "ConExon\t""3_splice\t"a[i]}' > 3Meth_No0count.txt
cat *5Meth_No0count.txt | awk '{a[$1]+=$2}END{for(i in a) print "ConExon\t""5_splice\t"a[i]}' > 5Meth_No0count.txt
佛了,想要差异大一些又没差异
参考
bedtools使用 https://www.jianshu.com/p/6c3b87301491
ggplot2 https://blog.csdn.net/Bone_ACE/article/details/47427453# 标签修改