# 重新开始鉴定AS.md

之前是分别使用对应的三代测序数据比对到各自的参考基因组，现在考虑用三代的测序数据分别比对对各自的亚基因组上。

```bash
## 将四倍体基因组分层At和DT进行比对，由于Scaffold没法区分A、D就不要了
grep Ghir_A -A1 Ghirsutum_genome_HAU_v1.0_another.fasta > At.fasta
grep Ghir_D -A1 Ghirsutum_genome_HAU_v1.0_another.fasta > Dt.fasta
## gmap 构建索引
/public/home/zpliu/software/gmap-2019-09-12/bin/gmap_build  -D /public/home/zpliu/work/Alternative/data/gmap_build -d Ghirsutum_Dt  /public/home/zpliu/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Dt.fasta &
## gmap进行比对
cat ./all.collapsed.rep.fa|~/software/gmap-2019-09-12/bin/gmap -D ~/work/Alternative/data/gmap_build/Ghirsutum_At/   -d  Ghirsutum_At   -f 3 -t 10 -n 1 --min-trimmed-coverage=0.85 --min-identity=0.85 -p 1 -e   >A2_At.gtf
## alternative_splice.py进行鉴定
/usr/bin/python ~/software/pipeline-for-isoseq/other/alternative_splice.py -i ./D5_Dt.gff -g ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Ghirsutum_gene_model_motify_Dt.gtf  -f ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Dt.fasta  -as -ats T -os -t exon -o ./alter_ten -op
```

通过比较发现，如果真的是很保守的事件的话，他们的长度也是很保守的。对于那些长度摆动的先不管，用gmap来比对然后确定在对于同源基因上的位置。

```bash
## 将保守的IR事件，提取在对于基因组上的坐标


##不保守的事件在其他基因组的坐标使用gmap来获取
```

## 对于AltA、AltD进行同样的操作

```bash
## 从剪切文件中获取对应的AltA、AltD事件坐标和长度
awk '$3=="AltA"{print $0}' end_third|paste AltA_stand  -|awk '{print $2"\t"$5"\t"$6"\t"$1"\t"$3"\t-\t-\t-\t"$6-$5+1}' >AltA_static.txt 
awk '$3=="AltD"{print $0}' end_third|paste AltD_stand  -|awk '{print $2"\t"$5"\t"$6"\t"$1"\t"$3"\t-\t-\t-\t"$6-$5+1}' >AltD_static.txt
```

### 对各种剪切事件在进化上进行分类

```bash
## 全都保守的
python ../../script_category/AS_all_converse.py  ../homoloe_AltA.txt  ../../../GhDt_Gr_GhAt_Ga_end_noScaffold  all_converse_AltA.txt 
## 祖先和后代保守
python ../../script_category/As_At_A2_converse.py  ../homoloe_AltA.txt  ../../../GhDt_Gr_GhAt_Ga_end_noScaffold At_A2_converse_AltA.txt
python ../../script_category/AS_Dt_D5_converse.py  ../homoloe_AltA.txt  ../../../GhDt_Gr_GhAt_Ga_end_noScaffold Dt_D5_converse_AltA.txt
## 祖先与祖先保守
python ../../script_category/AS_A2_D5_converse.py  ../homoloe_AltA.txt  ../../../GhDt_Gr_GhAt_Ga_end_noScaffold A2_D5_converse_AltA.txt
python ../../script_category/AS_At_Dt_converse.py ../homoloe_AltA.txt  ../../../GhDt_Gr_GhAt_Ga_end_noScaffold At_Dt_converse_AltA.txt
## 混合的
python ../../script_category/As_mix.py  ../homoloe_AltA.txt  ../../../GhDt_Gr_GhAt_Ga_end_noScaffold mix_AltA.txt
```

## 比较保守事件与非保守事件的长度差异

```bash
## 提取位置坐标
cat *converse*|awk -F "\t" '$2!="-"{print $2,$3,$4"\n"$9,$10,$11"\n"$16,$17,$18"\n"$23,$24,$25}' OFS="\t"|awk '$1~/[A-Za-z]/{print $0"\t"$3-$2+1}'|sort |uniq >converse.bed
## 提取不保守的AS长度
cut -f1,2,3 homoloe_AltD.txt |sort |uniq|awk '{print $0"\t"$3-$2+1}'|cat - evolution/converse_AltD.bed |sort |uniq -u >noConverse_AltD.bed
## 计算平均值 
awk '{a+=$4}END{print a/事件数目}' evolution/converse_AltD.bed
```

| 事件类型  | 保守类型 | 非保守类型 |
| ----- | ---- | ----- |
| AltD  | 365  | 536   |
| AltA  | 343  | 517   |
| ExonS | 116  | 1847  |
| IR    | 264  | 341   |

## 比较序列是否存在差异

之前基因文件的版本用错了，然后提出来的序列有些问题，然后今天用新的版本的跑了一下才是一样的；还好我之前跑甲基化数据的时候用对了基因组。

## 对可变剪切进行分类还是使用剪切位点上下游各300bp组成的序列进行blast

提取剪切位点上下游各300bp的序列，叫&#x505A;***splice anchor sequence tags***

```bash
## 提取上游300bp序列
awk '{print $1,$2-301,$2-1,$5"_"$1"_"$2"_"$3"_"$9"_"$4}' OFS="\t"  ../A2_intronR.txt |sort -k2,3 -n |uniq >left.bed
## 提取下游300bp序列
awk '{print $1,$3,$3+300,$5"_"$1"_"$2"_"$3"_"$9"_"$4}' OFS="\t"  ../A2_intronR.txt |sort -k2,3 -n |uniq  >right.bed
## 使用bedtools提取序列
~/software/bedtools2-2.29.0/bin/fastaFromBed -fi ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta -fo right.fasta -name -bed right.bed
##合并两端序列
python merge.py  left.fasta right.fasta  D5_intron_junction.fasta
```

### 进行blast比对

```bash
## 将四个同源基因序列构成一个库,id太长了把 -parse_seqids 参数去掉
makeblastdb -in intron_junction.fasta -dbtype "nucl" -parse_seqids -out  intron_junction
## 分别对各个基因组进行比对
blastn -query Dt_intron_junction.fasta -db blastDB/intron_junction -evalue 1e-5  -outfmt 6 -out 111
## 去提取在各个基因组都比较保守的intron事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]>=1&&a[i][2]>=1&&a[i][1]>=1)print i}}' 
## 查看某一个事件是不是对的
grep Ghir_D10G003010_2517095_2517293_199_+ Dt_query_blast.txt |awk '$11==0&&$2!~/Ghir_D.*/{print $0}'
```

### 提取只在某两个基因组上保守的剪切事件

```bash
## 提取只在Dt与D5中保守的事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]==0&&a[i][2]==0&&a[i][1]>=1)print i}}' >Dt_D5_converse_intronR
## 只在Dt与At中保守的剪切事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]==0&&a[i][2]==0&&a[i][0]>=1)print i}}' 
## Dt与A2中保守的剪切事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt|awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]==0&&a[i][2]>=1&&a[i][0]==0)print i}}'
## A2与At中保守的剪切事件
awk '$11==0&&$2!~/evm.*/{print $0}' A2_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]==0&&a[i][0]>=1&&a[i][2]==0)print i}}'
## At与D5中的保守剪切事件
awk '$11==0&&$2!~/Ghir_A.*/{print $0}' At_query_blast.txt |awk '$2~/evm/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][2]==0&&a[i][0]==0&&a[i][1]>=1)print i}}'
## A2与D5中的保守剪切事件
awk '$11==0&&$2!~/evm.*/{print $0}' A2_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]==0&&a[i][1]>=1&&a[i][2]==0)print i}}'
```

### 在三个基因组上保守的剪切事件

```bash
## Dt、D5、A2中都出现
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]==0&&a[i][2]>=1&&a[i][1]>=1)print i}}' 
## Dt、D5、At中都出现
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]>=1&&a[i][2]==0&&a[i][0]>=1)print i}}' 
## At、D5、A2中都出现
awk '$11==0&&$2!~/Ghir_A.*/{print $0}' At_query_blast.txt |awk '$2~/evm/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][2]==0&&a[i][0]>=1&&a[i][1]>=1)print i}}'
## At、Dt、A2中都出现
awk '$11==0&&$2!~/Ghir_A.*/{print $0}' At_query_blast.txt |awk '$2~/evm/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][2]>=1&&a[i][0]>=1&&a[i][1]==0)print i}}'
```

### 只在某一个基因组中出现的剪切事件

```bash
## 只在Dt中出现的剪切事件
cat IR_bed/*Dt*|awk -F "\t" '$5~/Ghir_D/{print $5}' |sort |uniq |awk '{print ">"$0}'|cat - Dt_intron_junction.fasta |grep ">"|sort|uniq -u|sed 's/>//g' |awk -F "_" '{print $3"_"$4,$5,$6,$8,$0}' OFS="\t" >noconverse/Dt_IR.bed

## 只在D5中出现的
cat IR_bed/*D5* |awk -F "\t" '$5~/Gor/{print $5}' |sort |uniq |awk '{print ">"$0}'|cat - D5_intron_junction.fasta |grep ">"|sort|uniq -u|sed 's/>//g' |awk -F "_" '{print $2,$3,$4,$6,$0}' OFS="\t"  >noconverse/D5_IR.bed
## 只在A2中出现,非保守的剪切事件
cat IR_bed/*A2*|awk -F "\t" '$5~/evm/{print $5}' |sort |uniq |awk '{print ">"$0}'|cat - A2_intron_junction.fasta |grep ">"|sort|uniq -u|sed 's/>//g' |awk -F "_" '{print $2,$3,$4,$6,$0}' OFS="\t" >noconverse/A2_IR.bed
## 只在At中出现
cat IR_bed/*At*|awk -F "\t" '$5~/Ghir_A/{print $5}' |sort |uniq |awk '{print ">"$0}'|cat - At_intron_junction.fasta |grep ">"|sort|uniq -u|sed 's/>//g' |awk -F "_" '{print $3"_"$4,$5,$6,$8,$0}' OFS="\t" >noconverse/At_IR.bed
```

### 将对于的剪切事件做成bed文件

```bash
## 在所有基因组中都保守的IR对应的bed
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]>=1&&a[i][2]>=1&&a[i][1]>=1)print i}}'|xargs  -I {} grep {} Dt_query_blast.txt| awk '$11==0&&$2!~/Ghir_D.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/all_converse.bed
## 提取只在Dt与D5中保守的事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]==0&&a[i][2]==0&&a[i][1]>=1)print i}}'|xargs  -I {} grep {} Dt_query_blast.txt|awk '$11==0&&$2!~/Ghir_D.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/Dt_D5_converse.bed
## 只在Dt与At中保守的剪切事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]==0&&a[i][2]==0&&a[i][0]>=1)print i}}' |xargs  -I {} grep {} Dt_query_blast.txt|awk '$11==0&&$2!~/Ghir_D.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/Dt_At_converse.bed
## Dt与A2中保守的剪切事件
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt|awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]==0&&a[i][2]>=1&&a[i][0]==0)print i}}' |xargs  -I {} grep {} Dt_query_blast.txt|awk '$11==0&&$2!~/Ghir_D.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/Dt_A2_converse.bed
## A2与At中保守的剪切事件
awk '$11==0&&$2!~/evm.*/{print $0}' A2_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]==0&&a[i][0]>=1&&a[i][2]==0)print i}}'|xargs  -I {} grep {} A2_query_blast.txt |awk '$11==0&&$2!~/evm.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/A2_At_converse.bed
## At与D5中的保守剪切事件
awk '$11==0&&$2!~/Ghir_A.*/{print $0}' At_query_blast.txt |awk '$2~/evm/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][2]==0&&a[i][0]==0&&a[i][1]>=1)print i}}'|xargs  -I {} grep {} At_query_blast.txt|awk '$11==0&&$2!~/Ghir_A.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/At_D5_converse.bed
## A2与D5中的保守剪切事件
awk '$11==0&&$2!~/evm.*/{print $0}' A2_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]==0&&a[i][1]>=1&&a[i][2]==0)print i}}'|xargs  -I {} grep {} A2_query_blast.txt |awk '$11==0&&$2!~/evm.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/A2_D5_converse.bed

## Dt、D5、A2中都出现
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]==0&&a[i][2]>=1&&a[i][1]>=1)print i}}' |xargs  -I {} grep {} Dt_query_blast.txt|awk '$11==0&&$2!~/Ghir_D.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/Dt_D5_A2_converse.bed
## Dt、D5、At中都出现
awk '$11==0&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][1]>=1&&a[i][2]==0&&a[i][0]>=1)print i}}' |xargs  -I {} grep {} Dt_query_blast.txt|awk '$11==0&&$2!~/Ghir_D.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/Dt_At_D5_converse.bed
## At、D5、A2中都出现
awk '$11==0&&$2!~/Ghir_A.*/{print $0}' At_query_blast.txt |awk '$2~/evm/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][2]==0&&a[i][0]>=1&&a[i][1]>=1)print i}}'|xargs  -I {} grep {} At_query_blast.txt|awk '$11==0&&$2!~/Ghir_A.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/At_D5_A2_converse.bed
## At、Dt、A2中都出现
awk '$11==0&&$2!~/Ghir_A.*/{print $0}' At_query_blast.txt |awk '$2~/evm/{a[$1][0]+=1}$2~/Ghir_D/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][2]>=1&&a[i][0]>=1&&a[i][1]==0)print i}}'|xargs  -I {} grep {} At_query_blast.txt|awk '$11==0&&$2!~/Ghir_A.*/{print $1"\n"$2}'|awk -F "_" '$1~/Ghir/{print $3"_"$4,$5,$6,$8,$0}$1!~/Ghir/{print $2,$3,$4,$6,$0}' OFS="\t" |sort |uniq  >IR_bed/At_Dt_A2_converse.bed
```

## 查看基因对的数目

```bash
cut -f5 all_converse.bed |awk -F "_" '$1~/Ghir/{print $1"_"$2}$1!~/Ghir/{print $1}' |sort|uniq |grep "Gor" |wc -l
```

| 事件类型    | 基因对的数目 | 事件数目  |
| ------- | ------ | ----- |
| 四个都保守的  | 843    | 1390  |
| At = A2 | 1515   | 2629  |
| Dt =D5  | 1459   | 2189  |
| At=Dt   | 1044   | 1528  |
| A2=D5   | 1060   | 1511  |
| 只在At不存在 | 608    | 845   |
| 只在Dt不存在 | 653    | 945   |
| 只在A2不存在 | 837    | 1199  |
| 只在D5不存在 | 574    | 830   |
| 只有At有   | 4346   | 9387  |
| 只有Dt有   | 4429   | 9780  |
| 只有A2有   | 4541   | 10234 |
| 只有D5有   | 55007  | 12117 |

## 统计保守的IR长度信息

在统计长度的时候发现 Gorai.006G213800这个基因内含子长度都6000多了，对应的At与A2的内含子长度只有588多但是仍旧都可以发生IR

```bash
cat converse/* |sort|uniq |awk -F "_" '{print $(NF-1)}'|less
cat noconverse/* |sort|uniq |awk -F "_" '{print $(NF-1)}'|less
```

***219 12 30***

师兄说不能只根据 evalue 来筛选保守的片段，得看相似度和覆盖度

```bash
## 将相似度与覆盖度的指标写进结果内
blastn -query D5_intron_junction.fasta  -db blastDB/intron_junction -outfmt "6  qseqid sseqid qstart qend sstart send nident pident qcovs evalue bitscore"  -evalue 1e-5  -out D5_query_blast.txt
## 这个筛选的指标感觉还可以
awk '$7>200&&$8>90&&$9>85{print $0}' At_query.blast.txt
```

把筛选的命令重新写一遍

```bash
## 都保守的事件，根据相似度和覆盖度筛选
awk '$7>200&&$8>90&&$9>85&&$2!~/Ghir_D.*/{print $0}' Dt_query_blast.txt |awk '$2~/Ghir_A/{a[$1][0]+=1}$2~/evm/{a[$1][2]+=1}$2~/Gor/{a[$1][1]+=1}END{for(i in a){if(a[i][0]>=1&&a[i][2]>=1&&a[i][1]>=1)print i}}'
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zpliu.gitbook.io/booknote/ke-bian-jian-qie/shu-ju-chu-li-liu-cheng/zhong-xin-kai-shi-jian-ding-as.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
