
继续上回的内容[[108-R可视化32-通过seurat包中的LabelClusters学习ggplot之一]]。
准备工作做好了,如何实现这样的label 操作呢?

老规矩,先康康代码:
labels.loc <- lapply(X = groups, FUN = function(group) {
data.use <- data[data[, id] == group, , drop = FALSE]
data.medians <- if (!is.null(x = split.by)) {
do.call(what = "rbind", args = lapply(X = unique(x = data.use[,
split.by]), FUN = function(split) {
medians <- apply(X = data.use[data.use[, split.by] ==
split, xynames, drop = FALSE], MARGIN = 2,
FUN = median, na.rm = TRUE)
medians <- as.data.frame(x = t(x = medians))
medians[, split.by] <- split
return(medians)
}))
}
else {
as.data.frame(x = t(x = apply(X = data.use[, xynames,
drop = FALSE], MARGIN = 2, FUN = median, na.rm = TRUE)))
}
data.medians[, id] <- group
data.medians$color <- data.use$color[1]
return(data.medians)
})
这个groups 也就是前面获得的关于分组的标记:
groups <- clusters %||% as.character(x = na.omit(object = unique(x = data[,
id])))
接下来传入这个分组标记给循环,先对数据框取子集:
groups <- NULL %||% as.character(x = na.omit(object = unique(x = totalVI_df[,
+ "leiden_0.05"])))
> groups
[1] "0" "1" "2" "3" "4" "6" "7" "5" "8"
data.use <- totalVI_df[totalVI_df[, "leiden_0.05"] == groups[1], , drop = FALSE]
需要注意的是,这里的语法限制了传入的group 列必须得是factor 类型(强制转型成字符串进行判断);以及数据框需要是dataframe 类型(x[,'y']取子集操作对于table 类型数据框并不会转型成向量)。这里也算是seurat 的缺陷了。
默认下的split 分图为NULL,也就是否的情况。
if (!is.null(x = split.by))
那我们就先看看else 的处理:
as.data.frame(x = t(x = apply(X = data.use[, xynames,
drop = FALSE], MARGIN = 2, FUN = median, na.rm = TRUE)))
}
data.medians[, id] <- group
data.medians$color <- data.use$color[1]
非常简单,也就是计算一下x,y 横纵坐标批量分组下的中位数:
> as.data.frame(x = t(x = apply(X = data.use[, c("Umap1","Umap2"),
+ drop = FALSE], MARGIN = 2, FUN = median, na.rm = TRUE)))
Umap1 Umap2
1 11.678 13.30376
其实这个也就是seurat 对数据 label 的关键了—— 计算每个分组cluster 的统计数据,比如中位数,来对这个位置进行计算。
那么,如果是存在split 的情况呢:
do.call(what = "rbind", args = lapply(X = unique(x = data.use[,
split.by]), FUN = function(split) {
medians <- apply(X = data.use[data.use[, split.by] ==
split, xynames, drop = FALSE], MARGIN = 2,
FUN = median, na.rm = TRUE)
medians <- as.data.frame(x = t(x = medians))
medians[, split.by] <- split
return(medians)
}))
do.call 我已经在之前不记得的某期介绍过了,其可以接受函数,批量对列表的子元素进行操作。其实你可以理解成split 就是对group 的额外分组,那自然也是额外再分组再分别进行一层统计数目计算。
就不展开了。
最后合体,再把group 信息与color 信息添加:
> id <- "leiden_0.05"
> data.medians$color <- data.use$color[1]
> data.medians[, id] <- "1"
> data.medians
Umap1 Umap2 color leiden_0.05
1 11.678 13.30376 #F8766D 1
上面的例子,我们展示了seurat 按照中位数标记,那么如果是最近距离(nearest)呢?
这里通过上面的函数,我们批量得到了不同分组的median 数值:
id <- "leiden_0.05"
labels.loc <- lapply(groups, function(group){
data.use <- data2[data2[, id] == group, , drop = FALSE]
data.medians <- as.data.frame(x = t(x = apply(X = data.use[, c("Umap1","Umap2"),
drop = FALSE], MARGIN = 2, FUN = median, na.rm = TRUE)))
data.medians[, id] <- group
data.medians$color <- data.use$color[1]
return(data.medians)
})
labels.loc <- do.call("rbind", labels.loc)
> head(labels.loc)
Umap1 Umap2 leiden_0.05 color
1 11.67799700 13.303761 0 #F8766D
2 0.08273468 5.411001 1 #D39200
3 5.51503000 1.334583 2 #93AA00
4 11.08995500 5.150919 3 #00BA38
5 13.01985300 -1.858218 4 #00C19F
6 3.69386960 16.643590 6 #00B9E3
我们首先调用函数绘图对比一下:
library(Seurat)
LabelClusters(p, "leiden_0.05", repel = T, size = 3)
LabelClusters(p, position = "nearest", "leiden_0.05", repel = T, size = 3)


惊了,竟然没有什么明显的区别,看看代码:
if (position == "nearest") {
labels.loc <- lapply(X = labels.loc, FUN = function(x) {
group.data <- data[as.character(x = data[, id]) ==
as.character(x[3]), ]
nearest.point <- nn2(data = group.data[, 1:2], query = as.matrix(x = x[c(1,
2)]), k = 1)$nn.idx
x[1:2] <- group.data[nearest.point, 1:2]
return(x)
})
}
其首先又是从原数据框中提取了遍历分组的内容,接下来一步很有趣:
nearest.point <- nn2(data = group.data[, 1:2], query = as.matrix(x = x[c(1,
2)]), k = 1)$nn.idx
x[1:2] <- group.data[nearest.point, 1:2]
这个nn2 函数时干嘛的呢?
> RANN::nn2
function (data, query = data, k = min(10, nrow(data)), treetype = c("kd",
"bd"), searchtype = c("standard", "priority", "radius"),
radius = 0, eps = 0)
{
dimension <- ncol(data)
if (is.null(dimension))
dimension = 1L
query_dimension <- ncol(query)
if (is.null(query_dimension))
query_dimension = 1L
ND <- nrow(data)
if (is.null(ND))
ND = length(data)
NQ <- nrow(query)
if (is.null(NQ))
NQ = length(query)
if (dimension != query_dimension)
stop("Query and data tables must have same dimensions")
if (k > ND)
stop("Cannot find more nearest neighbours than there are points")
searchtypeInt = pmatch(searchtype[1], c("standard", "priority",
"radius"))
if (is.na(searchtypeInt))
stop(paste("Unknown search type", searchtype))
treetype = match.arg(treetype, c("kd", "bd"))
if (is.data.frame(data))
data <- unlist(data, use.names = FALSE)
if (!length(data))
stop("no points in data!")
if (!is.matrix(query))
query <- unlist(query, use.names = FALSE)
if (!length(query))
stop("no points in query!")
results <- .C("get_NN_2Set", as.double(data), as.double(query),
as.integer(dimension), as.integer(ND), as.integer(NQ),
as.integer(k), as.double(eps), as.integer(searchtypeInt),
as.integer(treetype == "bd"), as.double(radius * radius),
nn.idx = integer(k * NQ), nn = double(k * NQ), PACKAGE = "RANN")
nn.indexes = matrix(results$nn.idx, ncol = k, byrow = TRUE)
nn.dist = matrix(results$nn, ncol = k, byrow = TRUE)
return(list(nn.idx = nn.indexes, nn.dists = nn.dist))
}
<bytecode: 0x7fe4c099de48>
<environment: namespace:RANN>
其作用是:
★Uses a kd-tree to find the p number of near neighbours for each point in an input/output dataset. The advantage of the kd-tree is that it runs in O(M log M) time. ”
这个函数的返回值包括:
A `list` of length 2 with elements:
`nn.idx`
A **N** x **k** integer `matrix` returning the near neighbour indices.
`nn.dists`
A **N** x **k** `matrix` returning the near neighbour Euclidean distances.
tmp <- RANN::nn2(data = group.data[, 6:7], query = as.matrix(x = x[c(1,2)]), k = 1)
> tmp$nn.idx
[,1]
[1,] 7201
> tmp$nn.dists
[,1]
[1,] 0.007510741
也就是返回离我们query 的这个median 点最近的点在data 中的下标和距离。
> group.data[nearest.point, 6:7]
Umap1 Umap2
25630 11.67875 13.31123
> x[1:2]
Umap1 Umap2
1 11.678 13.30376
所以最近距离这个选项,有什么特别大的意义呢?
How to place the label if repel = FALSE. If "median", place the label at the median position. If "nearest" place the label at the position of the nearest data point to the median.
ps:这里seurat 对于数据框子集的操作,大部分还是通过索引进行,但显然这样对于代码阅读者来说,并不是非常的友好。你觉得呢?是自己方便,还是通过colname 取子集,便于后期自己或他人理解代码呢?
labels.loc <- do.call(what = "rbind", args = labels.loc)
labels.loc[, id] <- factor(x = labels.loc[, id], levels = levels(data[,
id]))
labels <- labels %||% groups
if (length(x = unique(x = labels.loc[, id])) != length(x = labels)) {
stop("Length of labels (", length(x = labels), ") must be equal to the number of clusters being labeled (",
length(x = labels.loc), ").")
}
names(x = labels) <- groups
for (group in groups) {
labels.loc[labels.loc[, id] == group, id] <- labels[group]
}
简单概括一下:
在ggplot 家族中,我们介绍过两种label 方式:[[66-R可视化10-自由的在ggplot上添加文本(柱状图加计数)]] [[67-R可视化11-用ggrepel更加美观的添加标记(火山图的实现)]]
这里seurat 利用ifelse 非常巧妙的将函数作为输出。
接下来的绘图就非常非常简单了,如果想了解geom_label 与geom_label_repel 等的用法,参考我前面的教程即可:
if (box) {
geom.use <- ifelse(test = repel, yes = geom_label_repel,
no = geom_label)
plot <- plot + geom.use(data = labels.loc, mapping = aes_string(x = xynames["x"],
y = xynames["y"], label = id, fill = id), show.legend = FALSE,
...) + scale_fill_manual(values = labels.loc$color[order(labels.loc[,
id])])
}
else {
geom.use <- ifelse(test = repel, yes = geom_text_repel,
no = geom_text)
plot <- plot + geom.use(data = labels.loc, mapping = aes_string(x = xynames["x"],
y = xynames["y"], label = id), show.legend = FALSE,
...)
}
return(plot)
对了,还记得上次[[108-R可视化32-通过seurat包中的LabelClusters学习ggplot之一]] 最后问的color的用途吗?
scale_fill_manual(values = labels.loc$color[order(labels.loc[,
id])])
这是为了保证,text label 的颜色,与散点图的颜色一样。