# 01全长转录组数据处理

## 测序数据如下图

![](https://43423.oss-cn-beijing.aliyuncs.com/img/20191009131705.png)

* 每个棉种含有两个重复的数据，之所以有3个文件，是因为有一个重复的数据量没有测够；之后进行了补测

***2019-10-10***

~~合并两次测序的bam文件~~

```bash
## 分别对两个bam文件进行排序，按照序列编号顺序排序
samtools sort -n -@ 2 ./A2/dT_BC5_subreads.bam -O BAM -o ./A2/dT_BC5_subreads_sorted.bam
## 将两个排序好的文件进行合并
samtools merge -n -@ 2 -O BAM ./A2/A2_R2.subreads.bam ./A2/dT_BC5_subreads_sorted.bam ./A2/m54139_180604_080709.subreads_sorted.bam
```

后来发现还得使用ccs方法将两个bam文件指定为一个数据集

***2019-10-11***

因为在进行数据打磨时报错了

`Missing .pbi file`

```bash
dataset create A2_R2.subreadset.xml --type SubreadSet --generateIndices ./../dT_BC5_subreads.bam ./../m54139_180604_080709.subreads.bam
# 之后使用A2_R2.subreadset.xml文件代替bam文件进行ccs即可
```

***2019-10-13***

使用iso-seq中dataset程序处理，重复二的数据

```bash
dataset create R2/TM1_R2.subreadset.xml --type SubreadSet --generateIndices R1801371_QJ_BC1_subreads.bam m54139_180607_052119.subreads.bam &
dataset create R2/D5_R2.subreadset.xml --type SubreadSet --generateIndices m54136_180730_021327.subreads.bam m54139_180609_044201.subreads.bam &
```

使用ccs处理

```bash
ccs --noPolish --minPasses 1 --minLength 300 --minSnr 4 --maxLength 10000 --maxDropFraction 0.8 --minPredictedAccuracy 0.8 --numThreads 20 --logFile ./TM1_R2_ccs.log --reportFile ./TM1_R2port_ccs.txt ./TM1_R2.subreadset.xml  ./TM1_R2_ccsout.bam
```

之前数据的分析，师姐已经完成了；接下来主要是针对数据进行个性化的分析了；其实我还挺想自己把流程走一遍的，可惜公司不会把分析流程的protocol发给我们

***2019-10-14***

将之前的结果解压,到当前文件夹

`unrar x result.rar`

~~参考我之前写的软件使用说明~~

~~tps\://zpliu.gitbook.io/booknote/ke-bian-jian-qie/ruan-jian-shi-yong/01-san-dai-ce-xu-isoseq#2-shi-yong-ccs-dui-yuan-shi-shu-ju-jin-hang-guo-lv~~

### 统计测序序列情况

1.二倍体亚洲棉

数据来源：第03三步分析流程里文件`Classify_stat.xls`

***CO\_31***

|                 序列类型                |   数目   |   比例   |
| :---------------------------------: | :----: | :----: |
|      consensus reads质控后经过ccs处理      | 320874 |  100%  |
|        non-full-length reads        |  34265 | 10.68% |
| full-length reads（包括chimeric reads) | 286277 | 89.22% |
|    filtered short reads长度小于300bp    |   332  |  0.1%  |
|            chimeric reads           |  9710  |  3.03% |

***CO\_32***

|                 序列类型                |   数目   |   比例   |
| :---------------------------------: | :----: | :----: |
|      consensus reads质控后经过ccs处理      | 292447 |  100%  |
|        non-full-length reads        |  34265 | 10.68% |
| full-length reads（包括chimeric reads) | 228041 | 77.98% |
|    filtered short reads长度小于300bp    |   366  |  0.13% |
|            chimeric reads           |  11003 |  3.77% |

1. 四倍体陆地棉

   ***Col31\&C0l32***

   |                 序列类型                |   数目   |   比例   |
   | :---------------------------------: | :----: | :----: |
   |      consensus reads质控后经过ccs处理      | 685383 |  100%  |
   |        non-full-length reads        | 103674 | 15.13% |
   | full-length reads（包括chimeric reads) | 581611 | 84.86% |
   |    filtered short reads长度小于300bp    |   98   |  0.01% |
   |            chimeric reads           |  18914 |  2.76% |
2. 二倍体雷蒙德氏棉

   ***Col31\&C0l32***

   |                 序列类型                |   数目   |   比例   |
   | :---------------------------------: | :----: | :----: |
   |      consensus reads质控后经过ccs处理      | 487063 |  100%  |
   |        non-full-length reads        |  87450 | 17.95% |
   | full-length reads（包括chimeric reads) | 399112 | 81.94% |
   |    filtered short reads长度小于300bp    |   501  |  0.1%  |
   |            chimeric reads           |  26652 |  5.47% |

​

## 比较测序数据与参考基因组数据中isform的长度与对应exon的数目

* 在测序数据中`06_Alignment/all.collapsed.gff`文件中有每条isform的信息

  1.统计每条isform的长度信息，**只包含外显子的长度**

  2.统计每条isform的外显子数目

  ```bash
  sed 's/gene_id [^t]*transcript_id//g' all.collapsed.gff |awk '$3~/^e/{isformLen[$9]+=$5-$4+1;exonCount[$9]+=1}END{for(i in isformLen){printf i"\t"isformLen[i]"\t"exonCount[i]"\n"}}'
  ```

  ```bash
  ##得到每个isform的长度和外显子数目
  "PB.1024.1";    4680    5
  "PB.10242.1";   3658    7
  "PB.1024.2";    4463    5
  "PB.10243.1";   2680    3
  ```
* 参考基因组gff文件中有每个mRNA的注释信息

  多了一个sed的处理主要是isform的第九列多了·一些信息`ID=evm.model.Ga01G0001.exon2`，把exon那部分去除后获得isform编号

  ```bash
  awk -F ";" '{print $1}' Ghirsutum_gene_model.gff3|sed 's/\.exon.*//g' |awk '$3~/^e/{isformLen[$9]+=$5-$4+1;exonCount[$9]+=1}END{for(i in isformLen){printf i"\t"isformLen[i]"\t"exonCount[i]"\n"}}'
  ```
* 准备绘图的数据

  ```bash
  # 将isform名字信息修改成分组信息
   sed -i 's/\"PB[^\t]*/TM1_PBisform/g' my_isform_length_exonCount.txt
  # 修改基因组的isform信息
  sed -i 's/\ID=[^\t]*/TM1_genome/g' my_genome_isform_length_exonCount.txt
  # 将多个棉种的数据合并成一个文件，比较外显子数目上的差异
  awk '{exonClass[$1"*"$3]+=1}END{for(i in exonClass){split(i,tmp,"*");printf tmp[1]"\t"tmp[2]"\t"exonClass[i]"\n"}}' all_genome_isform_comprasion.txt      >all_genome_isform_exonCount.txt
  ```

  得到每个基因组中isform的外显子数目分布情况

  ```bash
  #基因组    外显子数目    对应的isform数目
  TM1_genome    64    3
  D5_PBisform    43    7
  D5_genome    16    674
  TM1_genome    65    1
  D5_PBisform    44    13
  D5_genome    30    51
  ```

***2019-10-16***

* 绘制isform长度比较信息，和外显子数目的比较信息

  ```r
  #绘制isform长度信息
  ggplot(data = tmp,aes(x=tmp$V2,y=tmp$V3,fill=tmp$V2))+geom_boxplot()+ facet_wrap(~tmp$V1,scales = "free")+
  ```

```r
  # 绘制isform中外显子的数目信息
  ggplot(data = tmp,aes(x=tmp$V3,y=tmp$V4,fill=tmp$V2))+geom_bar(stat = "identity", position = 'dodge')+scale_x_continuous(breaks = seq(1,20,1))+facet_wrap(~tmp$V1)
```

***2019-10-17***

第7个文件夹annotation中包含了，与参考基因组注释文件进行合并后的注释信息

利用地6步分析中得到的非冗余的isform序列，与基因组进行比较

| 棉种  | 非冗余的isform | 比对到注释基因的isform | 没有比对到基因区域 | 对应基因数  | 对应的转录本数        | 原参考注释基因数目 |
| --- | ---------- | -------------- | --------- | ------ | -------------- | --------- |
| TM1 | 89,411     | 83,281         | 6,130     | 32,003 | 39,098         | 70199     |
| A2  | 7,2393     | 69,179         | 3,214     | 20,907 | 参考基因组没有预测不同转录本 | 40960     |
| D5  | 55,381     | 53,058         | 2,323     | 18,904 | 24,368         | 37505     |

第8个和第9个、10个文件的数据暂时用不到

**分析第11个关于可变剪切的文件**

分析不同棉种中可变剪切的类型，及发生可变剪切的基因数目

`grep ExonS splice.as.xls|wc -l` 统计外显子跳跃事件的数目

`grep ExonS splice.as.xls|awk '{print $2}' |sort|uniq |wc -l` 统计对应的基因数目

`剪切事件数目/对应的基因数目`

| 棉种   | gene 总数目 | 事件总数目  |
| ---- | -------- | ------ |
| TM-1 | 23137    | 121567 |
| A2   | 11733    | 60921  |
| D5   | 14483    | 81392  |

|  棉种 | ES(ExonS) | IR(IntronR) |    AltD    |    AltA    |    AltP   |   Other   |
| :-: | :-------: | :---------: | :--------: | :--------: | :-------: | :-------: |
| TM1 | 5824/3519 | 70132/18813 | 13147/7329 | 15631/8522 | 8182/4201 | 8650/3621 |
|  A2 | 2536/1564 |  38121/9515 |  4746/2736 |  6129/3278 | 4764/2318 | 4624/1975 |
|  D5 | 5502/3005 | 42372/11062 |  8104/4713 | 11799/6094 | 5980/3046 | 7634/2810 |

### 看看比例TM-1

|  事件类型 |    事件比例与数目   |   基因数目与比例   |
| :---: | :----------: | :---------: |
|   IR  |  70132\~57%  |  18813\~81% |
|   AD  | 13147\~10.8% | 7329\~31.6% |
|   AA  | 15631\~12.8% | 8522\~36.8% |
|   ES  |  5824\~4.8%  | 3519\~15.2% |
|   AP  |  8182\~6.7%  | 4201\~18.1% |
| Other |  8650\~7.1%  | 3621\~15.6% |

### 看看比例A2

|  事件类型 |    事件比例与数目   |   基因数目与比例   |
| :---: | :----------: | :---------: |
|   IR  | 38121\~62.5% |  9515\~81%  |
|   AD  |  4746\~7.8%  | 2736\~23.3% |
|   AA  |   6129\~10%  | 3278\~27.9% |
|   ES  |  2536\~4.1%  | 1564\~13.3% |
|   AP  |  4764\~7.8%  | 2318\~19.7% |
| Other |  4624\~7.6%  | 1975\~16.8% |

### 看看比例D5

|  事件类型 |    事件比例与数目   |   基因数目与比例   |
| :---: | :----------: | :---------: |
|   IR  |  42372\~52%  |  11062\~76% |
|   AD  |  8104\~9.9%  | 4713\~32.5% |
|   AA  | 11799\~14.5% |  6094\~42%  |
|   ES  |  5502\~6.7%  | 3005\~20.7% |
|   AP  |  5980\~7.3%  |  3046\~21%  |
| Other |  7634\~9.4%  | 2810\~19.4% |

**D5的剪切事件相对与A2来说有点偏高**，可能是由于A2的参考基因组本身就没有对多个转录本进行预测；为了验证这个猜想，就只看PBc得到的转录本之间的剪切事件

|  棉种 | ES(ExonS) | IR(IntronR) |    AltD   |    AltA   |    AltP   |   Other   |
| :-: | :-------: | :---------: | :-------: | :-------: | :-------: | :-------: |
| TM1 |  1225/916 |  18218/5901 | 2532/1641 | 3088/1926 | 2797/1532 | 2609/1264 |
|  A2 |  1253/840 |  17865/5347 | 2201/1444 | 2945/1803 | 2358/1312 | 2571/1068 |
|  D5 |  504/368  |  9578/3160  |  1112/770 |  1438/901 |  1379/858 |  1466/735 |

```bash
# 批量统计次数
for i in 1; do 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep ExonS|wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep ExonS|awk '{print $2}' |sort|uniq |wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep IntronR|wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep IntronR|awk '{print $2}' |sort|uniq |wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep AltD|wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep AltD|awk '{print $2}' |sort|uniq |wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep AltA|wc -l;
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep AltA|awk '{print $2}' |sort|uniq |wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep AltP|wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep AltP|awk '{print $2}' |sort|uniq |wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep Other|wc -l; 
awk '$6~/^PB/&&$7~/^PB/{print $0}' splice.as.xls |grep Other|awk '{print $2}' |sort|uniq |wc -l; done
```

***2019-10-18***

在第7个文件夹中`matchAnnot_result.txt`文件中包含比对到参考基因组上的isform和对应的参考基因编号

分析全长转录组的数据与原始数据相比在isform数目上的差异

***2019-11-2***

* 根据参考基因组的注释文件，产生剪切位点的bed文件

  假定剪切复合体是以**外显子识别的模式**进行剪切的，并且不关心外显子在正负链上的关系

```bash
#将第9列改造成只剩下基因ID，结合exon的起始位点，给每个exon一个编号,不分析脚手架了
sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}' >tmp.bed
awk  '$1~/Chr/{print $1"\t"$2"\t"$3"\t"$4"\t"$5"_"$2}' constitutiveExon.bed >tmp.bed

### 结果如下
Ghir_A01        13133   13156   -       Ghir_A01G000010.1_13133
Ghir_A01        13840   14080   -       Ghir_A01G000010.1_13840
Ghir_A01        14164   15594   -       Ghir_A01G000010.1_14164
```

* 生成各种类型的剪切

  ```bash
  # 生成5'外显子区域的bed文件，覆盖到每个碱基
  awk '{for(i=$2-1;i<=$2+73;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+1}}' tmp.bed >exon_5.bed 
  awk '{for(i=$3-73;i<=$3+1;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3+73}}' tmp.bed >exon_3.bed
  awk '{for(i=$2-201;i<=$2-2;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+201}}' tmp.bed >intron_5.bed 
  awk '{for(i=$3+2;i<=$3+201;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3-2}}'  tmp.bed >intron_3.bed

  # 用于提取指定区域的fasta序列
  awk '{print $1"\t"$2-1"\t"$2+73"\t"$4"\t"$5}' tmp.bed >exon_5_fasta.bed
  awk '{print $1"\t"$3-73"\t"$3+1"\t"$4"\t"$5}' tmp.bed >exon_3_fasta.bed
  awk '{print $1"\t"$2-201"\t"$2-2"\t"$4"\t"$5}' tmp.bed >intron_5_fasta.bed
  awk '{print $1"\t"$3+2"\t"$3+201"\t"$4"\t"$5}' tmp.bed >intron_3_fasta.bed

  # for 循环执行
  for i in 1
  do
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{for(i=$2-1;i<=$2+73;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+1}}'  >exon_5.bed 
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{for(i=$3-73;i<=$3+1;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3+73}}' >exon_3.bed
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{for(i=$2-201;i<=$2-2;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+201}}'  >intron_5.bed 
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{for(i=$3+2;i<=$3+201;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3-2}}' >intron_3.bed
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{print $1"\t"$2-1"\t"$2+73"\t"$4"\t"$5}' >exon_5_fasta.bed
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{print $1"\t"$3-73"\t"$3+1"\t"$4"\t"$5}'  >exon_3_fasta.bed
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{print $1"\t"$2-201"\t"$2-2"\t"$4"\t"$5}' >intron_5_fasta.bed
  sed 's/\;.*//g' ../Ghirsutum_gene_model.gtf |sed 's/transcript_id //g' |sed 's/\"//g'|awk '$1~/Gh/&&$3~/exon/{print $1"\t"$4"\t"$5"\t"$7"\t"$9"_"$4}'|awk '{print $1"\t"$3+2"\t"$3+201"\t"$4"\t"$5}'  >intron_3_fasta.bed
  done
  ```
* 将甲基化的数据构造成bed文件

  ```bash
  awk '{print $1"\t"$2"\t"$2}' ~/WMJ_3D_DNAmethy/Gh_bismark/CpG_context_D1_binom_fdr_result.txt  >CpG_context_D1.bed
  ```
* 统计每个区域中含有甲基化的位点的数目

  ```bash
  ## 如果区域没有对应的甲基化位点则会显示为-1，有则显示对于的甲基化位置
  ~/software/bedtools2-2.29.0/bin/intersectBed  -b CpG_context_D1.bed -a exon_3.bed -loj >exon_3Meth_count.txt 
  ## 统计每个区域甲基化的次数，后面几列的坐标一样的话就认为是一个甲基化位点
  # awk '$2==$7{print $0}' exon_3Meth_count.txt 
  ## 统计每个区域中单个碱基精度的甲基化程度，需要重新制作bed文件
  ```
* 提取对应区域的**CG**类型的数目，进行甲基化水平的标准化

  ```bash
  # 统计每个区域CG含量水平
   python extract_CG_count.py sequencFile outputFile
  # 提取甲基化是单个碱基的精度下的甲基化水平
  python scale_Methlation_by_sigleBase.py  exon_3_CGcount.txt  exon_3Meth_count.txt scale_single_base.txt
  # 统计外显子的数目
  grep ">" exon_3.fasta|wc -l
  # 均一化所有外显子的单碱基水平
  sed 's/_/\t/g' scale_single_base.txt |awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]/外显子数目}' >end
  ```
* 画两幅内含子和外显子连接的图，intron5-exon5 vs exon3-intron3

  ```bash
  ## 将5'外显子坐标延长200，然后合并
  awk '{print $1+200"\t"$2}' exon5_end |sort -k1 -n|cat - intron5_end |sort -k1 -n >intron_exon_5.txt 
  ## 将3'内含子延长75 然后合并
  awk '{print $1+75"\t"$2}' intron3_end |cat - exon3_end |sort -k1 -n >intron_exon_3.txt
  ```
* ggplot2绘制图形

  ```r
  ## 5'端
  ggplot(data=data2,aes(x=data2$V1,y=data2$V2))+
       geom_line(color=c("#ff4757"),size=1)+
       ylim(c(0,0.008))+
       scale_x_continuous(breaks = c(0,200,274),labels = c(-200,'','+75'))+
       theme(panel.grid=element_blank(),panel.border = element_blank(),panel.background = element_rect(fill = "#f1f2f6"))+
       geom_vline(xintercept = 200,linetype="dashed",color="blue")+
       theme(axis.line=element_line(size=0.8))+xlab("")+ylab("average mCpG/CpG")
  ## 3'端
  ggplot(data=data1,aes(x=data1$V1,y=data1$V2))+
      geom_line(color=c("#ff4757"),size=1)+
      ylim(c(0,0.008))+
      scale_x_continuous(breaks = c(0,75,274),labels = c('-75','','+200'))+
      theme(panel.grid=element_blank(),panel.border = element_blank(),panel.background = element_rect(fill = "#f1f2f6"))+
      geom_vline(xintercept = 75,linetype="dashed",color="blue")+
      theme(axis.line=element_line(size=0.8))+xlab("")+ylab("average mCpG/CpG")
  ```

***2019-11-08***

​ 重新分析可变剪切的数据，使用gmap进行比对，过滤指标覆盖度85%，相似度90%

```bash
cat ./all.collapsed.rep.fa|~/software/gmap-2019-09-12/bin/gmap -D ~/work/Alternative/data/gmap_build/Graimondii_221_v2.0 -d Graimondii_221_v2.0 -f samse -t 10 -n 2 --min-trimmed-coverage=0.85 --min-identity=0.9  >test.sam
```

***2019-11-14***

考虑每个isform直接的比较

|  棉种 |  ES(ExonS) |  IR(IntronR) |    AltD    |    AltA    |    AltP    |   Other   |
| :-: | :--------: | :----------: | :--------: | :--------: | :--------: | :-------: |
| TM1 | 17702/3440 | 254763/18239 | 37147/7111 | 48642/8180 | 13878/3441 | 8005/2046 |
|  A2 |  5074/1390 |  94222/9164  | 12407/2762 | 17002/3382 | 11798/2650 | 4310/1213 |
|  D5 | 18554/2949 | 148092/10583 | 19095/3523 | 31982/4988 |  5508/1406 | 5199/1279 |

只考虑基因坐标

|  棉种 | ES(ExonS) | IR(IntronR) |    AltD    |    AltA    |    AltP   |   Other   |
| :-: | :-------: | :---------: | :--------: | :--------: | :-------: | :-------: |
| TM1 | 4139/3440 | 53689/18239 | 10223/7111 | 11951/8180 | 5088/3441 | 2823/2046 |
|  A2 | 1656/1390 |  27223/9164 |  3839/2762 |  5021/3382 | 4428/2650 | 1757/1213 |
|  D5 | 3863/2949 | 31532/10583 |  4903/3523 |  7343/4988 | 1892/1406 | 1758/1279 |

A2\_8 vs D5\_6

| 棉种  | gene 数目 | 事件数目  |
| --- | ------- | ----- |
| TM1 | 21405   | 87934 |
| A2  | 10861   | 42044 |
| D5  | 12684   | 51616 |

|  棉种 | ES(ExonS) | IR(IntronR) |    AltD    |    AltA    |    AltP   |   Other   |
| :-: | :-------: | :---------: | :--------: | :--------: | :-------: | :-------: |
| TM1 | 4139/3440 | 53689/18239 | 10223/7111 | 11951/8180 | 5088/3441 | 2823/2046 |
|  A2 | 2276/1560 |  28078/9203 |  4135/2810 |  5199/3376 | 1823/1296 |  1435/949 |
|  D5 | 4280/2958 | 31756/10581 |  4900/3427 |  7055/4731 | 1819/1303 | 1805/1209 |

***2019-11-16***

比较Alt exon与constituents exon的甲基化水平

### 提取可变剪切文件中所有isform的exon坐标

先提取PacBio对应的isform编号，以及对应的mRNA对应的编号

```bash
awk '$6~/PB/{print $6"\t"$2}' end_third |sort|uniq >PacBio_ID
awk '$7~/PB/{print $7"\t"$2}' end_third |sort|uniq >>PacBio_ID
awk '$6~/evm/{print $6"\t"$2}' end_third |sort|uniq >mRNA_ID
awk '$7~/evm/{print $7"\t"$2}' end_third |sort|uniq  >>mRNA_ID
## 还得去一次重
sort PacBio_ID|uniq >PacBio_ID_1    
## 使用python脚本将每个isform的外显子信息坐标信息进去取交集，假定在2个isform中出现的exon坐标，认为是constituents exon
python constitutiveExonLocation.py PacBio_ID mRNA_ID ../test.gff  ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.gff constitutiveExonLocation
# 给每个constitutive exon取个名字
# 最终的constitutive exon结果里可能会有重复的坐标，由于不同转录本之间太相似了,把一些重叠区域尽量最小化
sort -n -k3,3 constitutiveExon.bed |awk '{print $2,$3,$1,$4,$5}' OFS="\t"|uniq -f 1|awk '{print $2,$1,$3,$4,$5}' OFS="\t" |sort -k2,2 -n |uniq -f1 |awk '{print $3,$2,$1,$4,$5}' OFS="\t"|awk '{print $1,$2,$3,$4,$5"_"$2}' OFS="\t" >tmp.bed
# 提取对应的bed坐标文件，这有点问题没有根据链来提，会有误差
for i in 1
do
awk '{for(i=$2-1;i<=$2+73;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+1}}'  tmp.bed >exon_5.bed 
awk '{for(i=$3-73;i<=$3+1;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3+73}}' tmp.bed >exon_3.bed
awk '{for(i=$2-201;i<=$2-2;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$2+201}}'  tmp.bed >intron_5.bed 
awk '{for(i=$3+2;i<=$3+201;i++){print $1"\t"i"\t"i"\t"$4"\t"$5"_"i-$3-2}}' tmp.bed >intron_3.bed
awk '{print $1"\t"$2-1"\t"$2+73"\t"$4"\t"$5}' tmp.bed >exon_5_fasta.bed
awk '{print $1"\t"$3-73"\t"$3+1"\t"$4"\t"$5}'  tmp.bed >exon_3_fasta.bed
awk '{print $1"\t"$2-201"\t"$2-2"\t"$4"\t"$5}' tmp.bed >intron_5_fasta.bed
awk '{print $1"\t"$3+2"\t"$3+201"\t"$4"\t"$5}'  tmp.bed >intron_3_fasta.bed
done
# 提取对应的序列用于计算CG含量
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_3_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_3.fasta
python ~/scripte/extract_CG_count.py  exon_3.fasta  exon_3_CG_count.txt
# 使用intersectBed做交集
~/software/bedtools2-2.29.0/bin/intersectBed  -b CpG_context_D1.bed -a exon_3.bed -loj >exon_3Meth_count.txt 
# 提取单碱基下的甲基化水平
python scale_Methlation_by_sigleBase.py  exon_3_CGcount.txt  exon_3Meth_count.txt scale_single_base.txt
# 按exon的数目进行标准化 # 均一化所有外显子的单碱基水平
grep ">" exon_3.fasta|wc -l
sed 's/_/\t/g' scale_single_base.txt |awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/外显子数目}' >end

### 搞个for循环懒得敲了
for i in 1
do
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_3_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py exon_5_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_5.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py intron_3_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py intron_5_fasta.bed ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_5.fasta
python ~/scripte/extract_CG_count.py  exon_3.fasta  exon_3_CG_count.txt
python ~/scripte/extract_CG_count.py  exon_5.fasta  exon_5_CG_count.txt
python ~/scripte/extract_CG_count.py intron_3.fasta  intron_3_CG_count.txt
python ~/scripte/extract_CG_count.py intron_5.fasta  intron_5_CG_count.txt
~/software/bedtools2-2.29.0/bin/intersectBed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a exon_3.bed -loj >exon_3Meth_count.txt 
~/software/bedtools2-2.29.0/bin/intersectBed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a exon_5.bed -loj >exon_5Meth_count.txt 
~/software/bedtools2-2.29.0/bin/intersectBed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a intron_5.bed -loj >intron_5Meth_count.txt 
~/software/bedtools2-2.29.0/bin/intersectBed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -a intron_3.bed -loj >intron_3Meth_count.txt 
python ~/scripte/scale_Methlation_by_sigleBase.py exon_3_CG_count.txt  exon_3Meth_count.txt exon_3scale_single_base.txt
python ~/scripte/scale_Methlation_by_sigleBase.py exon_5_CG_count.txt  exon_5Meth_count.txt exon_5scale_single_base.txt
python ~/scripte/scale_Methlation_by_sigleBase.py intron_5_CG_count.txt intron_5Meth_count.txt intron_5scale_single_base.txt
python ~/scripte/scale_Methlation_by_sigleBase.py intron_3_CG_count.txt intron_3Meth_count.txt intron_3scale_single_base.txt
done

### 最终结果
grep ">" intron_5.fasta |wc -l
## exon的类型记得改一下
for exonCount in 外显子的数目
do
sed 's/_/\t/g' exon_3scale_single_base.txt |awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"'$exonCount'"}' >exon_3_end
sed 's/_/\t/g' exon_5scale_single_base.txt |awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"'$exonCount'"}' >exon_5_end
sed 's/_/\t/g' intron_5scale_single_base.txt |awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"'$exonCount'"}' >intron_5_end
sed 's/_/\t/g' intron_3scale_single_base.txt |awk '{print $(NF-1)"\t"$NF}'|awk '{a[$1]+=$2}END{for(i in a)print "AltExon\t"i"\t"a[i]/"'$exonCount'"}' >intron_3_end
done
### 为了画图方便将所有exon做成连续的坐标
## 将5'外显子坐标延长200，然后合并
awk '{print $1"\t"$2+200"\t"$3}' exon_5_end |sort -k2 -n|cat - intron_5_end |sort -k2 -n >intron_exon_5.txt 
## 将3'内含子延长75 然后合并
awk '{print $1"\t"$2+75"\t"$3}' intron_3_end |cat - exon_3_end |sort -k2 -n >intron_exon_3.txt
```

## 提取Alter exon的坐标

```bash
### 没有染色体正负链的信息，在后面提取序列要用，但是提取CG碱基的时候，由于对称性所以提取哪条链都没关系
### 都以正链为标准
awk '$3~/IntronR/||$3~/ExonS/{print $1,$4,$5,"+",$2}' OFS="\t"  end_third  >../Alter_exon/alter_exon.bed
```

***2019-11-20***

之间比较Constitutive Exon和Alternative Exon之间甲基化水平的差异

### Alternative Exon需要把对应的链的信息给补充完整

直接根据基因区编号来补充把

```bash
## 提取基因正负链信息
cut -f5 alter_exon.bed |xargs  -I {} grep {} ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.gff|awk '$3~/gene/{print $7}' |paste alter_exon.bed - >11111111111
## 过滤重复的可变exon
awk '{print $1,$2,$3,$6,$5}' OFS="\t"  ../11111111111 |sort -n -k3,3  |awk '{print $2,$3,$1,$4,$5}' OFS="\t"|uniq -f 1|awk '{print $2,$1,$3,$4,$5}' OFS="\t" |sort -k2,2 -n |uniq -f1 |awk '{print $3,$2,$1,$4,$5}' OFS="\t"|awk '{print $1,$2,$3,$4,$5"_"$2}' OFS="\t" >tmp.bed
```

```bash
## 构造每个外显子在5'|3'端内含子和外显子的甲基化数据
for i in 1
do
awk '$4~/+/{print $1"\t"$2"\t"$2+74"\t"$4"\t"$5}$4~/-/{print $1"\t"$3-74"\t"$3"\t"$4"\t"$5}' ../tmp.bed  >exon_5.bed
awk '$4~/+/{print $1"\t"$3-74"\t"$3"\t"$4"\t"$5}$4~/-/{print $1"\t"$2"\t"$2+74"\t"$4"\t"$5}' ../tmp.bed >exon_3.bed
awk '$4~/+/{print $1"\t"$3+1"\t"$3+75"\t"$4"\t"$5}$4~/-/{print $1"\t"$2-75"\t"$2-1"\t"$4"\t"$5}' ../tmp.bed >intron_3.bed
awk '$4~/+/{print $1"\t"$2-75"\t"$2-1"\t"$4"\t"$5}$4~/-/{print $1"\t"$3+1"\t"$3+75"\t"$4"\t"$5}' ../tmp.bed  >intron_5.bed
done
## 长度都是一定的，提取对应的CG甲基含量
for i in 1
do
python ~/scripte/according_CDS_location_find_fasta_mesage.py  exon_3.bed  ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py  exon_5.bed  ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta exon_5.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py  intron_3.bed  ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_3.fasta
python ~/scripte/according_CDS_location_find_fasta_mesage.py  intron_5.bed  ~/work/Alternative/data/Ga_genome/G.arboreum.Chr.v1.0.fasta intron_5.fasta
done
for i in 1
do
python ~/scripte/extract_CG_count.py  exon_3.fasta exon_3_CG_count.txt
python ~/scripte/extract_CG_count.py  exon_5.fasta exon_5_CG_count.txt
python ~/scripte/extract_CG_count.py  intron_3.fasta intron_3_CG_count.txt
python ~/scripte/extract_CG_count.py  intron_5.fasta intron_5_CG_count.txt
done
## 统计每个区域甲基化的碱基数
for i in 1
do
~/software/bedtools2-2.29.0/bin/intersectBed  -a exon_3.bed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj|awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}'|awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' >exon_3Meth_count.txt
~/software/bedtools2-2.29.0/bin/intersectBed  -a exon_5.bed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj|awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}'|awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' >exon_5Meth_count.txt
~/software/bedtools2-2.29.0/bin/intersectBed  -a intron_3.bed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj|awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}'|awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' >intron_3Meth_count.txt
~/software/bedtools2-2.29.0/bin/intersectBed  -a intron_5.bed  -b ~/work/Alternative/data/Ga_genome/test/CpG_context_D4.bed -loj|awk '$6~/\./{print $5"\t"0}$6~/[^\.]/{print $5"\t"1}'|awk '{a[$1]+=$2}END{for(i in a)print i"\t"a[i]}' >intron_5Meth_count.txt
done

## 构造ggplot绘图的数据
for i in 1
do
awk '{print "ConExon\t""exon_3\t"$2}' exon_3_CG_count.txt  >conExon_CGcount_end
awk '{print "ConExon\t""exon_5\t"$2}' exon_5_CG_count.txt  >>conExon_CGcount_end
awk '{print "ConExon\t""intron_3\t"$2}' intron_3_CG_count.txt  >>conExon_CGcount_end
awk '{print "ConExon\t""intron_5\t"$2}' intron_5_CG_count.txt  >>conExon_CGcount_end
awk '{print "ConExon\t""exon_3\t"$2}' exon_3Meth_count.txt >conExon_MetCount_end
awk '{print "ConExon\t""exon_5\t"$2}' exon_5Meth_count.txt >>conExon_MetCount_end
awk '{print "ConExon\t""intron_5\t"$2}' intron_5Meth_count.txt  >>conExon_MetCount_end
awk '{print "ConExon\t""intron_3\t"$2}' intron_3Meth_count.txt  >>conExon_MetCount_end
done
```

感觉失败了

![constitutive exon与alteractivate exon比较](https://43423.oss-cn-beijing.aliyuncs.com/img/20191120191202.png)

### 把四个坐标中甲基化位点为0的exon去除试试

```bash
#剔除在 exon_5|3 和intron3|5都没有甲基化的exon
cat exon_3Meth_count.txt exon_5Meth_count.txt intron_3Meth_count.txt intron_5Meth_count.txt|awk '$2==0{print $0}' |sort |uniq -c|awk '$1==4{print $2"\t"0}' >No_Methlation_exon.txt
for i in 1
do
cat No_Methlation_exon.txt  exon_3Meth_count.txt |sort |uniq -u >exon_3Meth_No0count.txt
cat No_Methlation_exon.txt exon_5Meth_count.txt |sort |uniq -u >exon_5Meth_No0count.txt
cat No_Methlation_exon.txt intron_3Meth_count.txt |sort |uniq -u >intron_3Meth_No0count.txt
cat No_Methlation_exon.txt intron_5Meth_count.txt |sort |uniq -u >intron_5Meth_No0count.txt
done
```

**同样没有发现想要的效果**，比如Alter 的剪切位点中exon和intron之间的甲基化程度相差不大，而Constitutive 剪切位点附近的甲基化程度相差很大

![剪切位点intron与exon甲基化程度比较](https://43423.oss-cn-beijing.aliyuncs.com/img/20191120201952.png)

### 不把intron与exon拆开看了，就分析5'|3’剪切位点附近甲基化程度和CG含量

```bash
## 将同一个外显子附近的甲基化数据合并
cat *3Meth_No0count.txt|awk '{a[$1]+=$2}END{for(i in a) print "ConExon\t""3_splice\t"a[i]}' >3Meth_No0count.txt
cat *5Meth_No0count.txt|awk '{a[$1]+=$2}END{for(i in a) print "ConExon\t""5_splice\t"a[i]}' >5Meth_No0count.txt
```

佛了，想要差异大一些又没差异

![剪切位点附近甲基化程度的差异](https://43423.oss-cn-beijing.aliyuncs.com/img/20191120203815.png)

## 参考

bedtools使用 <https://www.jianshu.com/p/6c3b87301491>

ggplot2 <https://blog.csdn.net/Bone_ACE/article/details/47427453#> 标签修改


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zpliu.gitbook.io/booknote/ke-bian-jian-qie/shu-ju-chu-li-liu-cheng/01-quan-chang-zhuan-lu-zu-shu-ju-chu-li.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
