02数据处理

统计各个时期的样品数目

0DPA 372份，过滤后370个样本
4DPA 362份，过滤后362个样本
8DPA 373份，过滤后372个样本
12DPA 372份，过滤后371个样本
16DPA 371份，过滤后371份样本
20DPA 370份，过滤后369份样本

非冗余的样本总共378份

6个时期都存在的数据总共340份，

5个时期以上都存在的数据总共371份

4个时期以上都存在的数据总共376份

3个时期以上都存在的数据总共377份

2个时期以上都存在的数据总共378份

最后只使用了4个时期都存在的样本数据共376份，没有使用到的样品有Sample249（S249），Sample232（S232），Sample342（S342）

使用Hisad比对到参考基因组

hisat2 -x /data/cotton/MaojunWang/WMJ_fiberFullPopulationRNAseq/MappingFPKM/Ghir_Genome_Index/Ghirsutum_genome --trim5 2 --trim3 5 --max-intronlen 100000 -1 /data/cotton/MaojunWang/Fiber_RNASeq_Ghir_Population/0DPA/${ID}_0DPA_R1.fq.gz -2 /data/cotton/MaojunWang/Fiber_RN
ASeq_Ghir_Population/0DPA/${ID}_0DPA_R2.fq.gz --fr -S ./Bamfiles/${ID}_mapGhir.sam -p 10 >./errlogs/log.${ID} 2>./errlogs/err.${ID}

01进行有参考转录本的组装和基因表达水平的鉴定

Gh_gff3='/data/cotton/zhenpingliu/genome/genome_data/Ghirsutum_genome_HAU_v1.1/Ghirsutum_gene_model.gff3'
Stage=(0DPA 12DPA 16DPA 20DPA 4DPA 8DPA)
inputDir='/data/cotton/MaojunWang/WMJ_fiberFullPopulationRNAseq/MappingFPKM/'
outDir='/data/cotton/zhenpingliu/LZP_fiberFullPopulationRNAseq/02sQTL'

for k in ${Stage[@]}; do
  for sample in $(cat ${outDir}/${k}/samples_id.txt); do
    sampleId=$(echo ${sample} | sed 's/_.*//g')
    mkdir -p ${outDir}/${k}/${sampleId}
    stringtie ${inputDir}/${k}/Bamfiles/${sample} --fr -G ${Gh_gff3} -A ${outDir}/${k}/${sampleId}/${sampleId}_gene_FPKM.txt -p 5 -B -o ${outDir}/${k}/${sampleId}/${sampleId}_assembled.gtf
  done
done

stringtie参数

版本: v2.1.4

-G参考基因组注释信息
-A基因表达量文件
-B输出Ballgown table 文件
-o组装后的GTF文件输出目录
-p进程数
--fr进行链特异性建库

02进行转录本的merge

使用bash for循环获取所有时期的组装转录本，输入到一个文件中

##组装转录本
stringtie --merge -G ${Gh_gff3}   ./mergeGTF_List  -o merge.gtf

03将组装后的转录本与参考转录本进行比较`gffcompare`

软件地址 http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.1.Linux_x86_64.tar.gz
官方文档 http://ccb.jhu.edu/software/stringtie/gffcompare.shtml#gffcompare_dl

##使用gffcompare进行比较
gffcompare  -R -r /data/cotton/zhenpingliu/genome/genome_data/Ghirsutum_genome_HAU_v1.1/Ghirsutum_gene_model.gff3  -i query_gff.txt
##查看软件的输出情况
awk -F "\t" '$3~/tran/{print $9}' gffcmp.annotated.gtf |awk -F ";" '{print $(NF-2)"\t"$(NF-1)}'|sed 's/tss_id.*//g'|sort|uniq -c

=剪切位点完全相同
j多个外显子的转ll录本，其中至少一个外显子的剪切位点与参考基因组一致
u基因间区的转录本（可能是转座子）

04过滤新的转录本

novel junction在一个样本内至少需要有10条read spanning read
转录本在至少一个样本内要达到基因总表达量的10%，表达量使用TPM进行量化

过滤在所有样本中都没有剪接位点支持read数目小于10个read的剪接位点

##使用merge后的注释文件进行基因表达量和转录本表达水平的计算

##合并所有样本的junction出read比对情况

##合并所有转录本的表达情况
##提取novel transcript

##提取novel junction

在stringtie生成的i_data_ctab文件

i_data.ctab: intron- (i.e., junction-) level expression measurements. One row per intron. Columns are i_id (numeric intron id), chr, strand, start, end (genomic location of the intron), and the following expression measurements for each sample:

rcount: number of reads supporting the intron
ucount: number of uniquely mapped reads supporting the intron
mrcount: multi-map-corrected number of reads supporting the intron

PreviousHISAT2比对 Next资源

Last updated 4 years ago

Was this helpful?