🤩 scRNA-seq | 吐血整理的单细胞入门教程(差异分析)(十三)

🤩 scRNA-seq | 吐血整理的单细胞入门教程(差异分析)(十三)

1写在前面完成了聚类后,我们就要进行差异分析,寻找差异基因了。🥳

由于scRNAseq是高维数据,而且并没有明确的组,你可以选择之前介绍的SC3包等,先进行聚类,然后确定了组后,进行比较,或者采用生物学分组进行比较。😘

本期我们介绍一下常用的一些差异分析方法,再比较各种方法的准确性。🤒

2用到的包代码语言:javascript代码运行次数:0运行复制rm(list = ls())

library(scRNA.seq.funcs)

library(edgeR)

#library(monocle)

library(MAST)

library(ROCR)

3示例数据1️⃣ 由于数据量比较大,这里我们只比较一下NA19101和NA19239的差异基因。🤨

代码语言:javascript代码运行次数:0运行复制molecules <- read.table("./molecules.txt", sep = "\t")

anno <- read.table("./annotation.txt", sep = "\t", header = TRUE)

keep <- anno[,1] == "NA19101" | anno[,1] == "NA19239"

data <- molecules[,keep]

group <- anno[keep,1]

batch <- anno[keep,4]

# 过滤一下

gkeep <- rowSums(data > 0) > 5;

counts <- data[gkeep,]

# Library size normalization

lib_size = colSums(counts)

norm <- t(t(counts)/lib_size * median(lib_size))

# Variant of CPM for datasets with library sizes of fewer than 1 mil molecules

norm

2️⃣ 这里我们再提供一个事先计算好的差异基因数据,方便后面对不同方法的评估,TPs为真实的差异基因,TNs为真实的无差异基因。

代码语言:javascript代码运行次数:0运行复制DE <- read.table("./TPs.txt")

notDE <- read.table("./TNs.txt")

GroundTruth <- list(

DE = as.character(unlist(DE)),

notDE = as.character(unlist(notDE))

)

4Kolmogorov-Smirnov Test4.1 差异分析我们先介绍一种非参数检验的方法,即Kolmogorov-Smirnov test (KS-test)。😉

代码语言:javascript代码运行次数:0运行复制pVals <- apply(

norm, 1, function(x) {

ks.test(

x[group == "NA19101"],

x[group == "NA19239"]

)$p.value

}

)

# 校正

pVals <- p.adjust(pVals, method = "fdr")

4.2 准确性评估这里我们使用KS-test的方法得到了5095个差异基因。🤯

代码语言:javascript代码运行次数:0运行复制sigDE <- names(pVals)[pVals < 0.05]

length(sigDE)

5095

接着我们可以看下有多少个真的差异基因在这个KS-test里,也就是真阳性,792个。🫠

代码语言:javascript代码运行次数:0运行复制sum(GroundTruth$DE %in% sigDE)

792

我们再看一下无差异基因,也就是假阳性,3190个。😤

代码语言:javascript代码运行次数:0运行复制sum(GroundTruth$notDE %in% sigDE)

3190

4.3 准确性可视化我们用利用ROC曲线进行一下可视化。

我们先自定义一个函数,称为DE_Quality_AUC,后面就不用总是计算了。😉

代码语言:javascript代码运行次数:0运行复制DE_Quality_AUC <- function(pVals) {

pVals <- pVals[names(pVals) %in% GroundTruth$DE |

names(pVals) %in% GroundTruth$notDE]

truth <- rep(1, times = length(pVals));

truth[names(pVals) %in% GroundTruth$DE] = 0;

pred <- ROCR::prediction(pVals, truth)

perf <- ROCR::performance(pred, "tpr", "fpr")

ROCR::plot(perf)

aucObj <- ROCR::performance(pred, "auc")

return(aucObj@y.values[[1]])

}

代码语言:javascript代码运行次数:0运行复制DE_Quality_AUC(pVals)

0.7954796

AUC = 0.7954796

5Wilcox/Mann-Whitney-U TestWilcox-rank-sum test也是一种非参数检验,看一下它的表现吧。

代码语言:javascript代码运行次数:0运行复制pVals <- apply(

norm, 1, function(x) {

wilcox.test(

x[group == "NA19101"],

x[group == "NA19239"]

)$p.value

}

)

# multiple testing correction

pVals <- p.adjust(pVals, method = "fdr")

DE_Quality_AUC(pVals)

0.8320326

AUC = 0.8320326,表现要比KS-test好一些。

6edgeRedgeR基于广义线性模型(generalized linear model, glm)进行差异基因分析,而且还可以在模型中纳入多种影响因素,比如batch等等。

代码语言:javascript代码运行次数:0运行复制dge <- DGEList(

counts = counts,

norm.factors = rep(1, length(counts[1,])),

group = group

)

group_edgeR <- factor(group)

design <- model.matrix(~ group_edgeR)

dge <- estimateDisp(dge, design = design, trend.method = "none")

fit <- glmFit(dge, design)

res <- glmLRT(fit)

pVals <- res$table[,4]

names(pVals) <- rownames(res$table)

pVals <- p.adjust(pVals, method = "fdr")

DE_Quality_AUC(pVals)

0.8466764

AUC = 0.8466764,表现更好一些了。🤩

7MAST代码语言:javascript代码运行次数:0运行复制log_counts <- log(counts + 1) / log(2)

fData <- data.frame(names = rownames(log_counts))

rownames(fData) <- rownames(log_counts);

cData <- data.frame(cond = group)

rownames(cData) <- colnames(log_counts)

obj <- FromMatrix(as.matrix(log_counts), cData, fData)

colData(obj)$cngeneson <- scale(colSums(assay(obj) > 0))

cond <- factor(colData(obj)$cond)

# Model expression as function of condition & number of detected genes

zlmCond <- zlm(~ cond + cngeneson, obj)

summaryCond <- summary(zlmCond, doLRT = "condNA19239")

summaryDt <- summaryCond$datatable

summaryDt <- as.data.frame(summaryDt)

pVals <- unlist(summaryDt[summaryDt$component == "H",4]) # H = hurdle model

names(pVals) <- unlist(summaryDt[summaryDt$component == "H",1])

pVals <- p.adjust(pVals, method = "fdr")

DE_Quality_AUC(pVals)

0.8284046

AUC = 0.8284046。

8其他方法这里补充一下其他方法,用的比较少,主要因为太慢了,单次运行一般都在1h以上,这里只提供一下代码,有愿意试一下的小伙伴可以试一试。

8.1 BPSC代码语言:javascript代码运行次数:0运行复制library(BPSC)

bpsc_data <- norm[,batch=="NA19101.r1" | batch=="NA19239.r1"]

bpsc_group = group[batch=="NA19101.r1" | batch=="NA19239.r1"]

control_cells <- which(bpsc_group == "NA19101")

design <- model.matrix(~bpsc_group)

coef=2 # group label

res=BPglm(data=bpsc_data, controlIds=control_cells, design=design, coef=coef,

estIntPar=FALSE, useParallel = FALSE)

pVals = res$PVAL

pVals <- p.adjust(pVals, method = "fdr")

DE_Quality_AUC(pVals)

8.2 SCDE代码语言:javascript代码运行次数:0运行复制library(scde)

cnts <- apply(

counts,

2,

function(x) {

storage.mode(x) <- 'integer'

return(x)

}

)

names(group) <- 1:length(group)

colnames(cnts) <- 1:length(group)

o.ifm <- scde::scde.error.models(

counts = cnts,

groups = group,

n.cores = 1,

threshold.segmentation = TRUE,

save.crossfit.plots = FALSE,

save.model.plots = FALSE,

verbose = 0,

min.size.entries = 2

)

priors <- scde::scde.expression.prior(

models = o.ifm,

counts = cnts,

length.out = 400,

show.plot = FALSE

)

resSCDE <- scde::scde.expression.difference(

o.ifm,

cnts,

priors,

groups = group,

n.randomizations = 100,

n.cores = 1,

verbose = 0

)

# Convert Z-scores into 2-tailed p-values

pVals <- pnorm(abs(resSCDE$cZ), lower.tail = FALSE) * 2

DE_Quality_AUC(pVals)

最后祝大家早日不卷!~

相关推荐

中国军事装备的这八个“第一” 真提气
365网站世界杯怎么进

中国军事装备的这八个“第一” 真提气

📅 07-09 👁️ 9235
职业选择指南:如何找到适合自己的工作?职业选择技巧揭秘,你准备好了吗?
五个关键词,解读阿里大文娱焕新虎鲸文娱
365bet官方

五个关键词,解读阿里大文娱焕新虎鲸文娱

📅 08-06 👁️ 268
絮娘(古风,NPH)
365bet官方

絮娘(古风,NPH)

📅 07-20 👁️ 5405
人民日报整版观察:加强网络空间法治建设
365bet官方

人民日报整版观察:加强网络空间法治建设

📅 07-28 👁️ 3525
“靝”怎么读?这个生僻字有何含义?
365bet客服电话多少

“靝”怎么读?这个生僻字有何含义?

📅 06-27 👁️ 5327