# 分析同源基因间可变剪切的差异

## 筛选各个棉种中发生可变剪切的同源基因

### 比较剪切位点处甲基化水平

使用单个碱基甲基化程度百分比的程度来衡量甲基化水平

```bash
# 使用之前的脚本，把不是0的之间换成1就可以了
对标准化的文件做处理    exon_5scale_single_base.txt
awk '$2==0{print $0}$2!=0{print $1"\t"1}' exon_5scale_single_base.txt
# 再计算每个位点的甲基化概率
for i in 1
do
awk '$2==0{print $0}$2!=0{print $1"\t"1}' exon_5scale_single_base.txt|sed 's/_/\t/g'|awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "ConExon\t""5_exon\t"i"\t"a[i]/87456}' >exon_5_end
awk '$2==0{print $0}$2!=0{print $1"\t"1}' exon_3scale_single_base.txt |sed 's/_/\t/g'|awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "ConExon\t""3_exon\t"i"\t"a[i]/87456}' >exon_3_end
awk '$2==0{print $0}$2!=0{print $1"\t"1}' intron_3scale_single_base.txt |sed 's/_/\t/g'|awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "ConExon\t""3_intron\t"i"\t"a[i]/87456}' >intron_3_end
awk '$2==0{print $0}$2!=0{print $1"\t"1}' intron_5scale_single_base.txt |sed 's/_/\t/g'|awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "ConExon\t""5_intron\t"i"\t"a[i]/87456}' >intron_5_end
done
## 更新坐标轴
awk '{print $1"\t"$2"\t"$3+200"\t"$4}' exon_5_end |sort -k1 -n|cat - intron_5_end |sort -k3,3 -n  >intron_exon_5.txt 
## 将3'内含子延长75 然后合并
awk '{print $1"\t"$2"\t"$3+75"\t"$4}' intron_3_end |cat - exon_3_end |sort -k3 -n >intron_exon_3.txt
```

```r
### 3'端数据
ggplot(data=data1,aes(x=data1$V3,y=data1$V4,fill=data1$V1,color=data1$V1))+geom_line(size=1)+
    ylim(c(0,0.008))+
    scale_x_continuous(breaks = c(0,75,274),labels = c('-75','','+200'))+
    theme(panel.grid=element_blank(),panel.border = element_blank(),panel.background = element_rect(fill = "white"))+
    geom_vline(xintercept = 75,linetype="dashed",color="blue")+
    theme(axis.line=element_line(size=0.8))+xlab("")+ylab("average mCpG/CpG")+scale_color_manual(values = c("#1e90ff","#ff7675"))+guides(color=guide_legend(title=NULL))
### 5’端数据
ggplot(data=data2,aes(x=data2$V3,y=data2$V4,color=data2$V1))+geom_line(size=1)+ylim(c(0,0.008))+
    scale_x_continuous(breaks = c(0,200,274),labels = c(-200,'','+75'))+
    theme(panel.grid=element_blank(),panel.border = element_blank(),panel.background = element_rect(fill = "white"))+
    geom_vline(xintercept = 200,linetype="dashed",color="blue")+
    theme(axis.line=element_line(size=0.8))+xlab("")+ylab("average mCpG/CpG")+scale_color_manual(values = c("#1e90ff","#ff7675"))+guides(color=FALSE)
```

A2\_8 vs D5\_6

| 棉种  | gene 数目 | 事件数目  |
| --- | ------- | ----- |
| TM1 | 21405   | 87934 |
| A2  | 10861   | 42044 |
| D5  | 12684   | 51616 |

|  棉种 | ES(ExonS) | IR(IntronR) |    AltD    |    AltA    |    AltP   |   Other   |
| :-: | :-------: | :---------: | :--------: | :--------: | :-------: | :-------: |
| TM1 | 4139/3440 | 53689/18239 | 10223/7111 | 11951/8180 | 5088/3441 | 2823/2046 |
|  A2 | 2276/1560 |  28078/9203 |  4135/2810 |  5199/3376 | 1823/1296 |  1435/949 |
|  D5 | 4280/2958 | 31756/10581 |  4900/3427 |  7055/4731 | 1819/1303 | 1805/1209 |

事件还得重新统计一次

```bash
# 获取A2发生了剪切事件的同源基因编号
cut -f2  ~/work/Alternative/result/Ga_result/CO11_12_result/06_Alignment/alter_eight/Altenative_category/end_third|sed /^g/d|sort |uniq >spliceGeneId.txt
cat spliceGeneId.txt homologTD.txt |sort|uniq -d >homologSpliceGeneId.txt
# 根据发生可变剪切的同源基因，找对应的其他基因组的同源基因
cat homologSpliceGeneId.txt |xargs -I {} grep {} ../GhDt_Gr_GhAt_Ga_end_noScaffold >111
mv 111 homologSpliceGeneId.txt
```

### 统计有多少同源基因同时发生了可变剪切

```bash
# A2中发生可变剪切，在D5中是否也同样发生可变剪切
cat homologSpliceGeneId.txt|xargs -I {} grep {} GhDt_Gr_GhAt_Ga_end_noScaffold >A2_D5_conSplice.txt
```

### 分析基因区域甲基化的程度与发生可变剪切的位置之间的关系

```bash
## 对每个基因的区域提取200bp的窗口，作标准
cut -f4 A2_D5_At_Dt_conSplice.txt |xargs -I {} grep  {} ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.gff |awk '$3~/gene/{print $0}' |awk '{print $1,$4,$5,$7,$9}' OFS="\t"|sed 's/\;.*//g'|awk '{print $1,$2,$3,$4,substr($5,4)}' OFS="\t"  |awk '{for(i=$2;i<=$3;i+=200){print $1"\t"i"\t"i+200"\t"$4"\t"$5"_"(i-$2)/200}}'>A2_gene_location

### 统计甲基化频率和可变剪切频率
~/software/bedtools2-2.29.0/bin/intersectBed -a A2_gene_location -b  ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj >A2_gene_Meth.txt
### 提取剪切bed文件
awk '{print $1,$4,$5,$2}' OFS="\t"   ~/work/Alternative/result/Ga_result/CO11_12_result/06_Alignment/alter_eight/Altenative_category/end_third |sed /^c/d >splice.bed 
### 因为基因长度不一样导致窗口数目是不一样的，在求均值的时候不能直接除以基因数目
cut -f5 A2_gene_location |cut -f2 -d _|sort -n |uniq -c >1 # 获取每个窗口的基因数
awk '$1==$6{print $5"\t1"}$1!=$6{print $5"\t0"}' A2_gene_Meth.txt |awk -F "_" '{print $2}' |awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}'|sort -k1,1 -n >2 #获取每个窗口甲基化的数目包括0个甲基化的位置
paste 2 1|awk '{print "MethlationRate\t"$1"\t"$2/$3}' >Methlation.txt
awk '$1==$6{print $5"\t1"}$1!=$6{print $5"\t0"}'  A2_gene_Splice.txt  |awk -F "_" '{print $2}' |awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}'|sort -k1,1 -n|paste - 1|awk '{print "AlterSpliceRate\t"$1"\t"$2/$3}' >AlterSplice.txt
```

​

### 将每个基因的长度标准化成100段或者200段，再比较每个区域甲基化和剪切程度

```bash
# 将每个基因区域平均分为100份,一个有小数点部分，一个没有小数点还得去掉，不然Bedtools不会处理
cut -f4 A2_D5_At_Dt_conSplice.txt |xargs -I {} grep  {} ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.gff |awk '$3~/gene/{print $0}' |awk '{print $1,$4,$5,$7,$9}' OFS="\t"|sed 's/\;.*//g'|awk '{print $1,$2,$3,$4,substr($5,4)}' OFS="\t" >2
#######
awk '{for(i=$2;i<=$3;i+=($3-$2)/100){print $1"\t"i"\t"i+($3-$2)/100"\t"$4"\t"$5"_"(i-$2)/(($3-$2)/100)}}' 2  >gene_100.bed
### 修改版，防止小数和科学计数的存在
awk '{for(i=$2;i<=$3;i+=($3-$2)/100){printf("%s\t%d\t%d\t%s\t%s_%s\n",$1,i,i+($3-$2)/100,$4,$5,(i-$2)/(($3-$2)/100)) }}' 2  >gene_100.bed
```

## 分析同源基因之间isform数目的关系

获取每个基因的isform数目

```bash
## 统计每个基因在叶片组织中表达的isform数目的分布
ed '1d' end_third |awk '{print $2"\t"$6;print $2"\t"$7}'|sort|uniq |awk '$1~/^Ghir_D/{a[$1]+=1}END{for(i in a)print i"\t"a[i]}'|cut -f2|awk '{a[$1]+=1}END{for(i in a)print "Dt\t"i"\t"a[i]}'|less
```

所有isform长度的分布信息,计算的是cDNA的长度，把转录本的长度信息也追加到文件中

```bash
## 获得每个isform的编号
sed '1d' end_third |awk '{print $2"\t"$6;print $2"\t"$7}'|sort|uniq |awk -F "|" '{print $1}'|awk '$2~/^P/{print $1"\t\\\""$2"\\\""}$2~/^[^P]/{print $1"\t"$2".exon"}'
## 将Pacbio的注释信息与参考基因组注释信息合并
cat  G.arboreum.Chr.v1.0.gff all.collapsed.gff >isform.gff
## 提取发生AltSplice的isform注释信息
sed '1d' end_third | awk '{print $2"\t"$6;print $2"\t"$7}' | sort | uniq | awk -F "|" '{print $1}' | awk '$2~/^P/{print $1"\t\\\""$2"\\\""}$2~/^[^P]/{print $1"\t"$2".exon"}' | cut -f2 | xargs -I {} grep {} isform.gff > splice_isform.gff &
```

## 比较同源基因发生剪切的片段的长度

使用bedtool提取对应位置的基因序列

```bash
~/software/bedtools2-2.29.0/bin/fastaFromBed -fi ~/genome_data/Ghirsutum_genome_HAU_v1.1/Ghirsutum_genome.fasta   -fo Dt.fasta  -name+ -bed Dt_intronR.txt
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zpliu.gitbook.io/booknote/ke-bian-jian-qie/shu-ju-chu-li-liu-cheng/fen-xi-tong-yuan-ji-yin-jian-ke-bian-jian-qie-de-cha-yi.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
