> For the complete documentation index, see [llms.txt](https://zpliu.gitbook.io/booknote/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://zpliu.gitbook.io/booknote/ke-bian-jian-qie/di-san-ci-dui-as-jin-hang-tong-ji/03.md).

# 重新下载原始数据进行比对

* Ga SRR5886156
* Gr SRR5886149
* Gh SRR5886147

## hisat2比对

* `-novel-splicesite-outfile`生成剪切位点文件
* `--max-intronlen`指定最大内含子长度
* `--rf`建库方式

```bash
##构建索引

bsub -J TM1 -n 10 -o %J.hisat2.out -e %J.hisat2.err -R span[hosts=1]  "hisat2 -x ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/hisat2_build/TM1.0 -1 ../Trimmomatic/SRR5886147_Trimm_1P  -2 ../Trimmomatic/SRR5886147_Trimm_2P -p 10 --novel-splicesite-outfile TM1_splicesite.txt  --max-intronlen 10000 --rf  -S TM1_leaf.sam"

## 过滤和去重PCR重复
bsub -J samtools -n 2 -o %J.samtools.out -e %J.samtools.err -R span[hosts=1] "samtools view -q 20 -S TM1_leaf.sam  -@ 2 -b -o TM1_leaf.bam && samtools sort -@ 2 TM1_leaf.bam -O bam -o TM1_sorted.bam &&samtools rmdup TM1_sorted.bam TM1_rmdup.bam"
```

## Stringtie计算gene表达量

```bash
stringtie hisat2/TM1_rmdup.bam -G ~/work/Alternative/result/homologo/FEST3/isforms/TM1/geneExpress/isform.gtf  -b ./stringtie/  -p 10
```

## 计算PIR、PSI

PIR：IR的比例

PSI：intron被剪切的比例

将RNA-seq比对文件转为bed

BAM文件中read比对说明

> M 完全匹配
>
> I 插入
>
> D 缺失
>
> N gap
>
> S 替代

* `-cigar`输出每个read的比对情况
* `-split`不会连接有gap的read

> 80M20N30M这种评分则会将这个read输出两个bed坐标

```bash
~/software/bedtools2-2.29.0/bin/bamToBed -cigar -split -i TM1_rmdup.bam >TM1.bed
## 过滤掉scaffold上的结果
awk '$1~/^Ch/{print $0}' D5.bed  >11 
##提交任务
bsub -q q2680v2 -n 20 -J A2 -e %J.err -o %J.out -R span[hosts=1] "python ~/scripte/Alternative/module/FEST3/PSI/IR_PSI.py  -AS ~/work/Alternative/result/Ga_result/CO11_12_result/11_AS/end_splice.txt3 -bed ../geneExpress/hisat2/A2.bed  -o A2_PSI.txt -p 20 "
```

## 计算所有Intron的PSI

* 合并所有的intron
* 计算每个intron的PIR值

```bash
## 提取intron坐标
python extractAllIntron.py  -pgtf ~/work/Alternative/result/Gr_result/CO41_42_result/07_annotation/merge.gtf -rgtf ~/work/Alternative/data/Gr_genome/Graimondii_221_v2.1.gene.gtf -o 11
awk '$1~/^C/{print $1"\t"$2"\t"$3}' 11 |sort -k1,1 -k2,3n|uniq  >D5_intron.txt
## 合并后计算每个intron的PIR值
```

## AS剪切在一定程度上造成了isform的多样性

发生剪切的高频区域，每个转录本所包含AS 事件的数目

### 比较由AS产生的保守isform发生NMD的比例、和由AS产生的uniq isform发生NMD的比例

> 造成uniq isform表达水平比conserve isform表达水平低的原因，可能是NMD途径导致uniq isform降解；

### 先把所有转录本的ORF预测了

* 需要反转模式预测`reverse`

```bash
/public/home/software/opt/bio/software/EMBOSS/6.5.7/bin/getorf   -find 3 -noreverse -sequence cDNA.fa
```

根据终止密码子与exon-exon junction的位置判断是否会发生NMD

> 在处理D5时，有部分基因没有exon注释，搞出了，修改gtf文件然后再跑一次，还有D5参考基因组内没有cDNA序列，需要提取一次
>
> 我对读取gtf文件的类做了补充，支持没有exon注释的ISOfrom

* 修改D5 gtf文件
* 提取D5 cDNA序列

```bash
##提取参考基因组中cDNA序列
python ~/scripte/Alternative/extractcDNA.py -gtf ~/work/Alternative/data/Gr_genome/Graimondii_221_v2.1.gene.gtf -fasta ~/work/Alternative/data/Gr_genome/Graimondii_221_v2.0.fa -cDNA D5_cDNA.fa
```

```bash
#判断isoform是否NMD
python ~/scripte/Alternative/module/FEST3/NMD.py  -gtf ~/work/Alternative/data/Gr_genome/Graimondii_221_v2.1.gene.gtf -orf 010g215500.orf  -o 11  >222
##统计多少是NMD的比例
awk '$4==1{a+=1}END{print a/NR}' ref_NMD.txt
```

### 关于对参考基因组中注释的isform的验证，需要筛选一下

* 筛选那些每个基因中表达量大于100 fpkm的注释的转录本
* 使用参考基因组的注释信息去跑一下这些转录本的表达量

> we screened a set of 4,868 Arabidopsis genes, for which the annotated longest ORF is fully supported by cDNA evidence

### cDNA序列序列5’和3‘已经确定了，感觉不需要反转

### 对鉴定到的PacBio isform进行NMD分析

* 距离 最后一个exon-exon的junction大于50bp就认为是潜在的NMD

```bash
## 对参考基因组的转录本和PacBio注释的转录本，预测NMD
python ~/scripte/Alternative/module/FEST3/NMD.py  -gtf merge.gtf  -orf merge.orf -o PacBio_NMD.txt
```

| 棉种  | PacBio isform | NMD   |
| --- | ------------- | ----- |
| TM1 | 89411         | 35218 |
| A2  | 72374         | 26583 |
| D5  | 55365         | 19330 |

![N5xSMV.png](https://s1.ax1x.com/2020/06/30/N5xSMV.png)

比较可能发生NMD的isform与没有NMD的表达水平差异

```bash
##多个棉种画同一个图里比较FPKM的话，量化一下；以表达量最多的那个搞
TM1统一提高1.35倍
##所有PacBio isoform的表达水平
awk 'NR>1{print $6"\t"$12}' ../../../geneExpress/stringtie/A2/t_data.ctab   >PacBio_FPKM.txt
##提取NMD isform的表达水平
grep "PB" PacBio_NMD.txt |awk '$4==0{print $1}'|cat - PacBio_FPKM.txt |sort -k1,1  -k2,2nr|awk '$2==""{print "0\t"$1}$2!=""{print $2"\t"$1}'|uniq  -f1 -d|awk '{print $1"\tD5\tNMD"}' >>compare_FPKM.txt
##提取none-NMD isform表达水平
 grep "PB" PacBio_NMD.txt |awk '$4!=0{print $1}'|cat - PacBio_FPKM.txt |sort -k1,1  -k2,2nr|awk '$2==""{print "0\t"$1}$2!=""{print $2"\t"$1}'|uniq  -f1 -d|awk '{print $1"\tD5\tnone"}' >>compare_FPKM.txt
```

### 分析AS与NMD

从剪切文件中提取两个转录本之间只有一种AS的差异，对没有发生AS的那个转录本进行ORF的预测，同时也对发生了AS的isoform预测ORF，选取相同的ST位点，看没有发生AS的转录本是否存在PTC并且PTC与exon-exon junction大于50bp；分析每种剪切类型中存在NMD的转录本

* 获取两个转录本只存在一种类型AS的差异
* 选择没有发生AS的转录本预测的ORF作为参考
* 计算参考转录本所预测的ST和TT
* 计算发生与参考转录本ST相同的ORF，比较TT是否提前

```bash
##合并两次预测的orf
grep ">P" 1.orf >merge.orf
grep ">G" ghir_a01g000010.orf >>merge.orf 
##合并两个gtf文件
cat ~/work/Alternative/data/Ghirsutum_genome_HAU_v1.0/Ghirsutum_gene_model.gtf ../../../isforms/TM1/geneExpress/isform.gtf >merge.gtf
##分析
 python ~/scripte/Alternative/module/FEST3/NMD2.py -gtf merge.gtf  -orf merge.orf  -AS ~/work/Alternative/result/Gh_result/CO31_32_result/11_AS/end_splice.txt3  -o 1111111
##判断是否有PTC
grep "ExonS" 1111111 |awk '$4<$5&&$5<$7{print $0}$4>$5&&$5>$7{print $0}'|wc -l
##对每种类型计算对应的NMD数目和比例
grep "IntronR" AS_NMD.txt |awk '$4<$5&&$5<$7{a+=1}$4>$5&&$5>$7{a+=1}END{print a"\t"NR"\t"a/NR}'
```

| 基因组 | IR          | ES       | AltA      | AltD      |
| --- | ----------- | -------- | --------- | --------- |
| TM1 | 15100/18733 | 758/1274 | 1960/2825 | 1529/2216 |
| A2  | 12887/15287 | 499/727  | 1262/1790 | 1029/1406 |
| D5  | 10051/12296 | 675/917  | 1342/1843 | 980/1383  |

不同AS产生的NMD，在表达水平上的差异

```bash
##提取不同AS产生的isoform id
for i in IntronR ExonS AltA AltD ; do grep $i AS_NMD.txt |awk '$4<$5&&$5<$7{print $2}$4>$5&&$5>$7{print $2}'|cat - PacBio_FPKM.txt |sort -k1,1  -k2,2nr|awk '$2==""{print "0\t"$1}$2!=""{print $2"\t"$1}'|uniq  -f1 -d|awk '{print $1"\tD5\t""'$i'"}' >>As_FPKM.txt; done
```

![N5xJzt.png](https://s1.ax1x.com/2020/06/30/N5xJzt.png)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zpliu.gitbook.io/booknote/ke-bian-jian-qie/di-san-ci-dui-as-jin-hang-tong-ji/03.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
