差异表达基因

生物学上不同样本之间的表达差异时服从负二项分布的，RNA-seq中得到的基因表达水平是抽样过程中的一种离散形式。在测得的reads总量一定的情况下，表达水平越高的基因在抽样过程中所占的比例就越高，有些低表达的基因也有可能无法被检测出来。在得到基因的表达量之后，根据实验设计对不同样本之间基因进行差异表达分析

同物种、不同组织间的比较
同一物种、同一组织、在不同处理下的比较
同一组织、不同物种间的比较
同一组织在不同时期间的比较

通过差异表达分析，发现组织特异性、时期特异性、物种特异性的基因表达模式。通过GO功能富集、KEGG分析发现基因在细胞中参与的代谢和具体的功能、基因与基因之间的互作等。

1.reads计数

使用python包HTseq对统计每个基因比对到的read数

1.1软件安装

非root用户需要使用--user参数

pip3 install HTSeq --user

1.2统计基因比对上的read数

 htseq-count  -f bam -r pos -t exon -i gene_id  -m union -q  1_1_5_rmdup.bam genome.gtf >count.txt

命令参数如下：

-f | --format设置输入文件格式，默认sam
-r | --order 设置输入文件排序方式，默认按照read name排序
-s | --stranded是否链特异性建库，默认yes
-a | --a设置质量阀值，默认忽略比对质量低于10的read
-t | --type对gtf或者gff文件中指定feature计算，默认exon
-i | --idattr设置feature id，通常是指第9列中，多个exon共有的gene属性如gene_id
-m | --mode default: union设置统计模式
-o | --samout输出一个sam文件，比对结果中多一个XF标签比对到的feature id。
-q | --quiet 不输出程序运行的状态信息和警告信息
-h | --help输出帮助信息。

1.3输出结果

Ghir_A01G000010    11
Ghir_A01G000020    10
Ghir_A01G000030    20
Ghir_A01G000040    139
Ghir_A01G000050    9
Ghir_A01G000060    52
Ghir_A01G000070    68
__no_feature    8716242
__ambiguous    157518
__too_low_aQual    0
__not_aligned    0
__alignment_not_unique    0

1.4批量提交任务

for i in `ls `
do
bsub -J htseqCount -q "smp" -n 1 -R span[hosts=1] -e htseqCount.err -o htseqCount.out "bash htseqcount.sh ${i}"
sleep 1
done

2.样品无重复

使用DESeq包，对于技术重复作者推荐将两个技术重复的read进行加和后作为样本的read数

For technical replicates (e. g. when the same library preparation was distributed over multiple lanes of the sequencer), please sum up their counts to get a single column, corresponding to a unique biological replicate.

2.1读取原始read数据

其中行名为基因名，列名为样本名

    untreated3 untreated4 treated2 treated3
FBgn0000003 0 0 0 1
FBgn0000008 76 70 88 70
FBgn0000014 0 0 0 0
FBgn0000015 1 2 0 0
FBgn0000017 3564 3150 3072 3334
FBgn0000018 245 310 299 308

2.2补充样品分组信息

第一列与第二列属于untreated处理的两个重复

condition = factor( c( "untreated", "untreated", "treated", "treated" ) )

2.3将分组信息与read表进行合并

> library( "DESeq" )
> cds = newCountDataSet( countTable, condition )

2.4对不同处理进行标准化

通过estimateSizeFactors( cds )函数来计算不同处理间测序深度是否存在较大的差异

cds=estimateSizeFactors( cds )
counts(cds, normalized = TRUE)

2.5估计离散度

这里由于没有重复需要使用method= "blind", sharingMode = "fit-only"参数

cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only")

2.6差异分析

后两个参数是指定需要比较的样品

 res = nbinomTest( cds, "untreated", "treated" )
 ## 绘制数据的分布情况
 plotMA(res)

2.7输出结果

id feature identifier baseMean mean normalised counts, averaged over all samples from both conditions +
baseMeanA mean normalised counts from condition A
baseMeanBmean normalised counts from condition B foldChange
fold change from condition A to B
log2FoldChange the logarithm (to basis 2) of the fold change
pvalp value for the statistical significance of this change
padjp value adjusted for multiple testing with the Benjamini-Hochberg procedure (see the R function p.adjust), which controls false discovery rate (FDR)

id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
1 FBgn0000003 0.224 0.00 0.449 Inf Inf 1.000 1.000
2 FBgn0000008 76.296 78.16 74.436 0.952 -0.0704 0.835 1.000
3 FBgn0000014 0.000 0.00 0.000 NaN NaN NA NA
4 FBgn0000015 0.781 1.56 0.000 0.000 -Inf 0.416 1.000
5 FBgn0000017 3298.682 3599.47 2997.890 0.833 -0.2638 0.241 0.881
6 FBgn0000018 289.031 293.68 284.385 0.968 -0.0464 0.757 1.000

2.8筛选差异表达基因

没有重复的样根据p-value来筛选差异表达的基因意义不大，所以直接对输出的结果用awk进行筛选。筛选的时候有三种情况

两个样都有read比对上
两个样中有一个样是没有read比对上，这种情况会使的log2foldcahnge为inf
两个样中比对到的read都为0

awk -F "\t" 'NR>=2&&$6!="Inf"&&$6!="NA"&&$6>=1{print $1"\tup"}NR>=2&&$6!="Inf"&&$6!="NA"&&$6<=-1{print $1"\tdown"}'

## 第三种情况相当于没有差异表达，不用考虑

3.样品有重复

推荐使用DESeq2包，涉及到的主要函数

DESeqDataSet创建数据集，用于准备输入数据
DESeq 进行差异分析
results 生成result rable
vst 应用方差分析对数据进行降维，例如PCA分析

3.1读取read count 文件并合并

filePath='/public/home/zpliu/lib_family/saolei/F20FTSCCWLJ4479_LUDnktE/upload/Filter_SOAPnuke/04DifferentExpress'
samples=read.table(paste(filePath,'1-1-1_count.txt',sep="/"))
names(samples)=c('id','1-1-1')

for ( i in c('1-1-2','J-1-1','J-1-2')){
    tmp=paste(i,'_count.txt',sep="")
    expression=read.table(paste(filePath,tmp,sep="/"))
    names(expression)=c('id',i)
    samples=merge(samples,expression,by='id')
}
##过滤以Ghir开头的gene
filter_gene=filter(samples,str_detect(id,pattern = '^Gh'))
##构造表达 matrix
gene_expression=as.matrix(filter_gene[,-1])
row.names(gene_expression)=filter_gene[,1]

3.2构建数据集

cond为分类因子
cnts为基因对应的read数目

##指定分组信息
cond <- factor(rep(1:2, each=2))
dds=DESeqDataSetFromMatrix(gene_expression,DataFrame(cond),~cond)
# 指定哪一组作为control
dds$cond <- relevel(dds$cond, ref = "2")

3.3进行差异表达分析

##进行差异表达分析
dds <- DESeq(dds)
res <- results(dds)
##将结果存进文件中
res$geneId=rownames(res)
res=res[c('geneId','baseMean','log2FoldChange','lfcSE','stat','pvalue','padj')]
write.table(res,file = "test",quote = FALSE,sep="\t",row.names = F)

参考

HTSeq仓库 https://github.com/simon-anders/htseq
HTSeq使用文档 https://htseq.readthedocs.io/en/release_0.11.1/
HTSeq使用文档 https://www.cnblogs.com/triple-y/p/9338890.html
DESeq使用文档 https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

Previous02脚本批量提交 Next文献

Last updated 4 years ago

Was this helpful?