这个新专辑有以下几点希冀:
而我在将自己的学习笔记排版成推文时也会遵循以下行文特点:
这期学习这篇推文:多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比
一开始这篇推文的学习我是很有想法的,因为曾老师给我分享了一篇投稿,投稿中使用根据病人进行批次拆分单独处理后通过anchor进行integrate(CCA, 区别于直接merge)达到去除批次效应的目的,而不是像我们之前那样直接harmony
我打算拿这篇推文数据来进行研究:拆分批次单独处理后通过anchor进行integrate(CCA)和harmony的效果有什么区别
但随着研究的进行,我发现其实这个数据集其实并不需要去除批次效应,所以我们还是像原推文那样研究“多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比”,学习一下这个拆分、merge的操作
某个数据集需不需要去批次,什么时候去批次,去批次的影响,可以参考上一期推文:harmony、不harmony,这是个问题
不同sampletype看似存在差异,有免疫细胞、非免疫细胞、外周血白细胞,但实验设计批次上还是根据病人来的,几乎每个病人都有这三种sampletype,病人没批次效应,sampletype之间的差异应该就是生物学差异
所以这里我们并不根据病人批次走harmony,以免抹除真正的差异
原推文使用的是整理好的数据,我们这里从头开始
多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比
GSE164690 GEO Accession viewer
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164690
头颈部鳞状细胞癌(HNSCC)
免疫细胞(CD45+ );非免疫细胞(CD45-);外周血白细胞(PBL)
51个样本,18例 treatment-naive patients (6 HPV+ and 12HPV– HNSCC 癌症病人),其中15例 CD45+, CD45-, PBL 为配对数据。
头颈部鳞状细胞癌(HNSCC)的特征是肿瘤微环境(TME)中基质细胞、上皮细胞和免疫细胞之间的复杂关系。为了开发更有效的治疗方法,我们旨在使用单细胞RNA测序(scRNAseq)研究6例人乳头瘤病毒(HPV)+和12例HPV-HNSCC患者肿瘤和匹配的外周血样本中抑制性非免疫和免疫细胞群的异质性、独特细胞群的特征以及细胞间相互作用。使用134606个细胞的数据集,我们显示了与炎症和HPV状态相关的细胞类型特异性特征,描述了在HPV+TME中具有弹性分化的成纤维细胞的负面预后价值,预测了治疗靶向的检查点受体-配体相互作用,并显示肿瘤相关巨噬细胞是TME中PD-L1和其他免疫检查点配体的主要贡献者。我们对形成HNSCC微环境的细胞内在机制和细胞间通讯提出了全面的单细胞观点。
对GSE164690数据集分别进行未整合和整合数据分析。 多分组未整合数据:CD45+ ,CD45-,PBL三组数据未整合分别进行降维分群,等进行B细胞细分的时候再merge到一块(第一层次分析数据由曾老师提供,在此感谢)。 多分组整合数据:CD45+ ,CD45-,PBL三组数据从一开始整合进行第一层次降维分群,再进行B细胞细分(该数据由齐兵提供,在此感谢)。 对曾老师的数据进行处理:首先进行了第一次B细胞细分,去除干扰亚群,而后又进行第二次B细胞细分(分辨率选用的0.8)。 齐兵的数据选用的分辨率也是0.8,其去除干扰亚群后没有再进行细分。 之后就直接用两组已对细胞亚群进行生物学命名的数据来进行对比。
拿到ftp下载地址
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE164nnn/GSE164690/suppl/GSE164690_RAW.tar
参考:
初探单细胞下游 可以记住这个10X技术输出文件目录和格式,以后使用
Read10X
函数读取Seurat对象时可以检查一下
使用Read10X函数读取,整理文件路径:
#!/bin/bash
# 遍历当前目录下的所有.gz文件
for file in *.gz
do
# 提取文件名中的编号部分
filename=$(basename "$file" .gz)
gsm_number=$(echo "$filename" | awk -F "_" '{print $1}')
# 创建对应的文件夹(如果不存在)
mkdir -p "$gsm_number"
# 移动文件到对应的文件夹中
mv "$file" "$gsm_number"
done
将所有文件夹下这三个文件前缀全部去除:
#!/bin/bash
# 遍历文件夹
for folder in GSM*/
do
# 进入文件夹
cd "$folder"
# 对三个文件进行重命名
for file in *_barcodes.tsv.gz
do
mv "$file" "barcodes.tsv.gz"
done
for file in *_features.tsv.gz
do
mv "$file" "features.tsv.gz"
done
for file in *_matrix.mtx.gz
do
mv "$file" "matrix.mtx.gz"
done
# 返回上一级目录
cd ..
done
获取sampletype:
$less GSE164690_series_matrix.txt.gz |grep "Sample_title"| awk -F "\"" '{ for (i=2; i<=NF; i+=2) printf("%s,", $i) }' | sed 's/,$/\n/' > sample_types.txt
library(Seurat)
library(tidyverse)
dir="./GSE164690_RAW/"
samples=list.files(dir)
samples %>% head()
sceList <- lapply(samples, function(pro){
print(pro)
sce <- CreateSeuratObject(counts = Read10X(file.path(dir,pro)),
project = gsub("^GSM[0-9]*_","",pro),
min.cells = 5,
min.features = 500)
return(sce)
})
names(sceList)
发现GEO提供的一个样本的barcodes文件受损
去除该受损样本文件,继续:
rm(list = ls())
library(Seurat)
library(tidyverse)
dir="./GSE164690_RAW/"
samples=list.files(dir)
samples %>% head()
GSM5017045样本barcode文件受损,剔除 HN10_PBL cells
# > which(samples=="GSM5017045")
# [1] 25
samples <- samples[-which(samples=="GSM5017045")]
18例 treatment-naive patients (6 HPV+ and 12HPV– HNSCC 癌症病人) 免疫细胞(CD45+ );非免疫细胞(CD45-);外周血白细胞(PBL)
sceList <- lapply(samples, function(pro){
print(pro)
sce <- CreateSeuratObject(counts = Read10X(file.path(dir,pro)),
project = gsub("^GSM[0-9]*_","",pro),
min.cells = 5,
min.features = 500)
return(sce)
})
names(sceList)
一开始就merge的
sce.all=merge(x=sceList[[1]],
y=sceList[ -1 ],
add.cell.ids = gsub('^GSM[0-9]*_','',samples))
sce.all.bp <- sce.all
本文根据CD45+ ,CD45-,PBL三组sample_type数据拆分而不是参考文(拆分批次单独处理后通过anchor进行integrate)中的每个病人批次
按照sampletype拆分:
sample_types <- read.table("sample_types.txt",header = FALSE,sep = ",",quote = "TRUE")
sampletype <- t(sample_types[1,][-25])
sampletype <- ifelse(grepl("CD45\\+",sampletype),"CD45+",
ifelse(grepl("CD45\\-",sampletype),"CD45-","PBL"))
sampletype
table(sampletype)
# CD45- CD45+ PBL
# 15 18 17
sampletype_df <- data.frame(sample_GSM = levels(sce.all@active.ident),
sampletype = sampletype)
# sce.all <- sce.all.bp
sce.all@meta.data$sampletype <- "NA"
for (i in 1:nrow(sampletype_df)){
sce.all@meta.data[which(sce.all@meta.data$orig.ident==sampletype_df$sample_GSM[i]), "sampletype"] <- sampletype_df$sampletype[i]
}
table(sce.all$sampletype)
sce.list <- SplitObject(sce.all, split.by = "sampletype")
若根据病人拆分:
patient_df <- data.frame(sample_GSM = levels(sce.all@active.ident),
patient = c(rep("01",3),rep(c("02","03","04"),each=2),
rep(c("05","06","07","08","09","10","11","12","13","14","15","16","17","18"),each=3))[-25]
)
sce.all@meta.data$patientid <- "NA"
for (i in 1:nrow(patient_df)){
sce.all@meta.data[which(sce.all@meta.data$orig.ident==patient_df$sample_GSM[i]), "patientid"] <- patient_df$patient[i]
}
table(sce.all$patientid)
sce.patient_list <- SplitObject(sce.all, split.by = "patientid")
sce.all@meta.data$project <- "HNSCC"
sce.all <- SetIdent(sce.all, value = "project")
## 计算细胞中线粒体基因比例
sce.all[["percent.mt"]] <- PercentageFeatureSet(sce.all, pattern = "^MT-")
## 使用小提琴图可视化QC指标
VlnPlot(sce.all, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0)
# 点太多挡住小提琴图 pt.size 0
## FeatureScatter于可视化特征-特征关系
plot1 <- FeatureScatter(sce.all, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(sce.all, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
## 过滤
## 官方推荐过滤掉独特特征计数超过2500或小于200的细胞,或者过滤掉超过5%线粒体基因比例的细胞
sce.all.fit <- subset(sce.all, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
sce.all.fit
sce.all
sce.all.fit <- NormalizeData(sce.all.fit, normalization.method = "LogNormalize", scale.factor = 1e4)
sce.all.fit <- FindVariableFeatures(sce.all.fit, selection.method = "vst", nfeatures = 2000) #返回两千个高变基因
## 提取前10的高变基因
top10 <- head(VariableFeatures(sce.all.fit), 10)
top10
## 展示高变基因
plot1 <- VariableFeaturePlot(sce.all.fit)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
#######数据归一化##########
sce.all.fit <- ScaleData(sce.all.fit, vars.to.regress = "percent.mt")
这一步比较消耗计算资源,几次在共享服务器上做喜提warning邮件
sce.all.fit <- RunPCA(sce.all.fit, features = VariableFeatures(object = sce.all.fit)) ##默认会输出5个主成分
sce.all.fit <- FindNeighbors(sce.all.fit, dims = 1:10)
sce.all.fit <- FindClusters(sce.all.fit, resolution = 0.8) # 和原推文一致
## 查看前5分析细胞聚类数ID
head(Idents(sce.all.fit), 5)
## 查看每个类别多少个细胞
table(sce.all.fit@meta.data$seurat_clusters)
sce.all.fit <- RunUMAP(sce.all.fit, dims = 1:10) # Which dimensions to use as input features, used only if features is NULL
p1 <- DimPlot(sce.all.fit, reduction = "umap", label = T, label.size = 5)
sce.all.fit <- RunTSNE(sce.all.fit, dims = 1:10)
p2 <- DimPlot(sce.all.fit, reduction = "tsne", label = T, label.size = 5)
p1+p2
查看marker表达情况
参考:
多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比
仅仅是区分了 B细胞和 Plasma细胞。 现在我们就从文献《Single-cell landscape and clinical outcomes of infiltrating B cells in colorectal cancer》来说一下B细胞的细分亚群。它是一个单细胞数据挖掘文章,主要是关心结直肠癌的肿瘤免疫微环境里面的B细胞,两个单细胞数据集:
首先 B cells (form marker gene: CD79A) 可以细分成为:
然后第一个CD20+ B cells 又是可以细分成为:
而第二个 CD138+ plasma cells 可以细分成为:
DotPlot(sce.all.fit, features = c("CD3D","CD3E","CD8A", # Tcells
"CD19","CD79A","MS4A1", # Bcells
"IGHG1","MZB1","SDC1", # Plasma cells
"CD68","CD163","CD14", # Monocytes and macrophages
"FGFBP2","FCG3RA","CX3CR1", # NK cells
"EPCAM" # epi or tumor
))+RotatedAxis()
T 0,1,2,3,5,7,8,10,19
B 6, 22
Plasma 12, 15, 20,21,23
Mono 4,11,17,18
NK 9
Unknown 13, 14, 16, 24
celltype=data.frame(ClusterID=0:24 ,
celltype= 0:24)
#定义细胞亚群
celltype[celltype$ClusterID %in% c( 0,1,2,3,5,7,8,10,19 ),2]='Tcells'
celltype[celltype$ClusterID %in% c( 6, 22 ),2]='Bcells'
celltype[celltype$ClusterID %in% c( 12, 15, 20,21,23 ),2]='plasma'
celltype[celltype$ClusterID %in% c( 4,11,17,18 ),2]='Mono'
celltype[celltype$ClusterID %in% c( 9 ),2]='NK'
celltype[celltype$ClusterID %in% c( 13, 14, 16, 24 ),2]='Unknown'
head(celltype)
celltype
table(celltype$celltype)
sce.all.fit@meta.data$celltype = "NA"
for(i in 1:nrow(celltype)){
sce.all.fit@meta.data[which(sce.all.fit@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.all.fit)=sce.all.fit$celltype
sel.clust = "celltype"
sce.all.fit <- SetIdent(sce.all.fit, value = sel.clust)
table(sce.all.fit@active.ident)
gplots::balloonplot(table(sce.all.fit$RNA_snn_res.0.8,sce.all.fit$celltype))
DimPlot(sce.all.fit, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()
# 提取指定Bcells
sce.all.fit
sce.Bcells=sce.all.fit[,sce.all.fit$celltype=='Bcells']
# 重新降维分组
sce.Bcells=CreateSeuratObject(counts = sce.Bcells@assays$RNA@counts,
meta.data = sce.Bcells@meta.data[,1:7])
sce.Bcells <- NormalizeData(sce.Bcells, normalization.method = "LogNormalize",
scale.factor = 1e4)
GetAssay(sce.Bcells,assay = "RNA")
sce.Bcells <- FindVariableFeatures(sce.Bcells,
selection.method = "vst", nfeatures = 2000)
sce.Bcells <- ScaleData(sce.Bcells)
sce.Bcells <- RunPCA(object = sce.Bcells, pc.genes = VariableFeatures(sce.Bcells))
DimHeatmap(sce.Bcells, dims = 1:12, cells = 100, balanced = TRUE)
ElbowPlot(sce.Bcells)
sce.Bcells <- FindNeighbors(sce.Bcells, dims = 1:15) # 输入维度
sce.Bcells <- FindClusters(sce.Bcells, resolution = 0.8)
table(sce.Bcells@meta.data$RNA_snn_res.0.8)
set.seed(123)
sce.Bcells <- RunUMAP(object = sce.Bcells, dims = 1:15, do.fast = TRUE)
DimPlot(sce.Bcells,reduction = "umap",label=T)
sce.Bcells <- SetIdent(sce.Bcells, value = "seurat_clusters")
DotPlot(sce.Bcells, features = c("IGHD","FCER2","TCL1A","IL4R", # naive
"CD27",
"AIM2","TNFRSF13B", # memory
"S1PI2","LRMP","SUGCT","MME","MKI67","AICDA", # GC
"IGHG1", # IgG plasma cells
"IGHA1" # IgA plasma
))
不光看marker表达情况,还看降维可视化(plasma和memory分不开):
naive 1,2,9
memory 10,11,14,16
GC 7,13,15,17
IgG plasma、IgA plasma 感觉分不开,先看plasma,再往下看
plasma CD27除memory 0,3,4,5,6,8,12
IgG plasma、IgA plasma联系降维可视化也分不开 算了
sce.Bcells.bp <- sce.Bcells
celltype=data.frame(ClusterID=0:17 ,
celltype= 0:17)
#定义细胞亚群
celltype[celltype$ClusterID %in% c( 1,2,9 ),2]='naive'
celltype[celltype$ClusterID %in% c( 10,11,14,16 ),2]='memory'
celltype[celltype$ClusterID %in% c( 0,3,4,5,6,8,12 ),2]='plasma'
celltype[celltype$ClusterID %in% c( 7,13,15,17 ),2]='GC'
head(celltype)
celltype
table(celltype$celltype)
sce.Bcells@meta.data$celltype = "NA"
for(i in 1:nrow(celltype)){
sce.Bcells@meta.data[which(sce.Bcells@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.Bcells)=sce.Bcells$celltype
sel.clust = "celltype"
sce.Bcells <- SetIdent(sce.Bcells, value = sel.clust)
table(sce.Bcells@active.ident)
gplots::balloonplot(table(sce.Bcells$RNA_snn_res.0.8,sce.Bcells$celltype))
DimPlot(sce.Bcells, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()
前面的流程整合到函数里:
pipeline <- function(sce.obj){
sce.obj
sce.obj@meta.data$project <- "HNSCC"
sce.obj <- SetIdent(sce.obj, value = "project")
## 计算细胞中线粒体基因比例
sce.obj[["percent.mt"]] <- PercentageFeatureSet(sce.obj, pattern = "^MT-")
## 过滤
## 官方推荐过滤掉独特特征计数超过2500或小于200的细胞,或者过滤掉超过5%线粒体基因比例的细胞
sce.obj.fit <- subset(sce.obj, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
## 数据标准化
sce.obj.fit <- NormalizeData(sce.obj.fit, normalization.method = "LogNormalize", scale.factor = 1e4)
## 识别高变基因
sce.obj.fit <- FindVariableFeatures(sce.obj.fit, selection.method = "vst", nfeatures = 2000) #返回两千个高变基因
#######数据归一化##########
sce.obj.fit <- ScaleData(sce.obj.fit, vars.to.regress = "percent.mt")
## 降维
sce.obj.fit <- RunPCA(sce.obj.fit, features = VariableFeatures(object = sce.obj.fit)) ##默认会输出5个主成分
sce.obj.fit <- FindNeighbors(sce.obj.fit, dims = 1:10)
sce.obj.fit <- FindClusters(sce.obj.fit, resolution = 0.8) # 和原推文一致
## UMAP/tSNE可视化
sce.obj.fit <- RunUMAP(sce.obj.fit, dims = 1:10) # Which dimensions to use as input features, used only if features is NULL
p1 <- DimPlot(sce.obj.fit, reduction = "umap", label = T, label.size = 5)
sce.obj.fit <- RunTSNE(sce.obj.fit, dims = 1:10)
p2 <- DimPlot(sce.obj.fit, reduction = "tsne", label = T, label.size = 5)
p1+p2
return(sce.obj.fit)
}
clustered.sce.list <- lapply(sce.list, pipeline)
定义函数:
dimplot <- function(sce.obj.fit){
sce.obj.fit
## UMAP/tSNE可视化
# sce.obj.fit <- RunUMAP(sce.obj.fit, dims = 1:10) # Which dimensions to use as input features, used only if features is NULL
p1 <- DimPlot(sce.obj.fit, reduction = "umap", label = T, label.size = 5)
# sce.obj.fit <- RunTSNE(sce.obj.fit, dims = 1:10)
p2 <- DimPlot(sce.obj.fit, reduction = "tsne", label = T, label.size = 5)
return(p1+p2)
}
lapply(clustered.sce.list, dimplot)
dotplot <- function(sce.obj){
dp <- DotPlot(sce.obj, features = c("CD3D","CD3E","CD8A", # Tcells
"CD19","CD79A","MS4A1", # Bcells
"IGHG1","MZB1","SDC1", # Plasma cells
"CD68","CD163","CD14", # Monocytes and macrophages
"FGFBP2","FCG3RA","CX3CR1", # NK cells
"EPCAM" # epi or tumor
))+RotatedAxis()
return(dp)
}
lapply(clustered.sce.list, dotplot)
细胞亚型注释函数
celltype_anno <- function(celltype, sce.obj){
sce.obj@meta.data$celltype = "NA"
for(i in 1:nrow(celltype)){
sce.obj@meta.data[which(sce.obj@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.obj)=sce.obj$celltype
sel.clust = "celltype"
sce.obj <- SetIdent(sce.obj, value = sel.clust)
table(sce.obj@active.ident)
balloonPlot <- gplots::balloonplot(table(sce.obj$RNA_snn_res.0.8,sce.obj$celltype))
dimplot <- DimPlot(sce.obj, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()
balloonPlot
dimplot
}
开始注释:
lapply(clustered.sce.list, dimplot)
lapply(clustered.sce.list, dotplot)
PBL
# sce.PBL <- clustered.sce.list$PBL
celltype=data.frame(ClusterID=0:16 ,
celltype= 0:16)
#定义细胞亚群
celltype[celltype$ClusterID %in% c( 1,2,4,5,7,10,12 ),2]='Tcells'
celltype[celltype$ClusterID %in% c( 9 ),2]='Bcells'
celltype[celltype$ClusterID %in% c( 15,16 ),2]='plasma'
celltype[celltype$ClusterID %in% c( 0,8,11,13,14 ),2]='Mono'
celltype[celltype$ClusterID %in% c( 3,6 ),2]='NK'
head(celltype)
celltype
table(celltype$celltype)
celltype_anno(celltype,clustered.sce.list$PBL)
T 1,2,4,5,7,10,12
B 9
Plasma 15,16
Mono 0,8,11,13,14
NK 3,6
Unknown
CD45+
celltype=data.frame(ClusterID=0:19 ,
celltype= 0:19)
#定义细胞亚群
celltype[celltype$ClusterID %in% c( 0,1,2,3,5,6,8,9,10,11 ),2]='Tcells'
celltype[celltype$ClusterID %in% c( 4,18),2]='Bcells'
celltype[celltype$ClusterID %in% c( 14,15 ),2]='plasma'
celltype[celltype$ClusterID %in% c( 7,12,16,17,19 ),2]='Mono'
celltype[celltype$ClusterID %in% c( 13 ),2]='NK'
head(celltype)
celltype
table(celltype$celltype)
celltype_anno(celltype,clustered.sce.list$`CD45+`)
T 0,1,2,3,5,6,8,9,10,11
B 4,18
Plasma 14,15
Mono 7,12,16,17,19
NK 13
Unknown
CD45-
celltype=data.frame(ClusterID=0:19 ,
celltype= 0:19)
#定义细胞亚群
celltype[celltype$ClusterID %in% c(4,5,6,7 ),2]='Tcells'
celltype[celltype$ClusterID %in% c( 8),2]='Bcells'
celltype[celltype$ClusterID %in% c( 1,2,10,11,12,14,16 ),2]='plasma'
celltype[celltype$ClusterID %in% c(9,17,18),2]='Mono'
celltype[celltype$ClusterID %in% c( 3,15,19 ),2]='Unknown'
head(celltype)
celltype
table(celltype$celltype)
celltype_anno(celltype,clustered.sce.list$`CD45-`)
T 4,5,6,7
B 8
Plasma 1,2,10,11,12,14,16
Mono 9,17,18
NK
Unknown 3,15,19
个人感觉:拆分后初步判断亚型比整合一起好区分,Unknown少
sce.Bcells=sce.all.fit[,sce.all.fit$celltype=='Bcells']
sce.Bcells1 <- clustered.sce.list$PBL[,clustered.sce.list$PBL$seurat_clusters == 9]
sce.Bcells2 <- clustered.sce.list$`CD45+`[,clustered.sce.list$`CD45+`$seurat_clusters %in% c(4,18)]
sce.Bcells3 <- clustered.sce.list$`CD45-`[,clustered.sce.list$`CD45-`$seurat_clusters == 8]
sce.Bcells.all <- merge(x=sce.Bcells1,
y=list(sce.Bcells2,sce.Bcells3),
# add.cell.ids = gsub('^GSM[0-9]*_','',samples),
add.cell.ids = c("sce.Bcells1", "sce.Bcells2", "sce.Bcells3"))
sce.Bcells.all=CreateSeuratObject(counts = sce.Bcells.all@assays$RNA@counts,
meta.data = sce.Bcells@meta.data[,1:7])
这一步merge时如果不add.cell.ids会报错:Error in merge.Seurat(x = sce.Bcells1, y = list(sce.Bcells2, sce.Bcells3), : Please provide a cell identifier for each object provided to merge
此外,还需要注意:
参考
所以需要把counts提出来再重新创建一个Seurat对象:
sce.Bcells.all=CreateSeuratObject(counts = sce.Bcells.all@assays$RNA@counts,
meta.data = sce.Bcells@meta.data[,1:7])
sce.Bcells.all <- NormalizeData(sce.Bcells.all, normalization.method = "LogNormalize",
scale.factor = 1e4)
GetAssay(sce.Bcells.all,assay = "RNA")
sce.Bcells.all <- FindVariableFeatures(sce.Bcells.all,
selection.method = "vst", nfeatures = 2000)
sce.Bcells.all <- ScaleData(sce.Bcells.all)
sce.Bcells.all <- RunPCA(object = sce.Bcells.all, pc.genes = VariableFeatures(sce.Bcells.all))
DimHeatmap(sce.Bcells.all, dims = 1:12, cells = 100, balanced = TRUE)
ElbowPlot(sce.Bcells.all)
sce.Bcells.all <- FindNeighbors(sce.Bcells.all, dims = 1:15) # 输入维度
sce.Bcells.all <- FindClusters(sce.Bcells.all, resolution = 0.8)
table(sce.Bcells.all@meta.data$RNA_snn_res.0.8)
set.seed(123)
sce.Bcells.all <- RunUMAP(object = sce.Bcells.all, dims = 1:15, do.fast = TRUE)
DimPlot(sce.Bcells.all,reduction = "umap",label=T)
sce.Bcells.all <- SetIdent(sce.Bcells.all, value = "seurat_clusters")
DotPlot(sce.Bcells.all, features = c("IGHD","FCER2","TCL1A","IL4R", # naive
"CD27",
"AIM2","TNFRSF13B", # memory
"S1PI2","LRMP","SUGCT","MME","MKI67","AICDA", # GC
"IGHG1", # IgG plasma cells
"IGHA1" # IgA plasma
))
不光看marker表达情况,还看降维可视化(plasma和memory分不开):
naive 1,2,8
memory 0,3,4,5,6,10,
GC 9,12,14,17
plasma CD27除memory 7,11,13,15,16,18
(7因为15很像特别是纠结的CD27所以给plasma)
sce.Bcells.all.bp <- sce.Bcells.all
celltype=data.frame(ClusterID=0:18 ,
celltype= 0:18)
#定义细胞亚群
celltype[celltype$ClusterID %in% c( 1,2,8 ),2]='naive'
celltype[celltype$ClusterID %in% c( 0,3,4,5,6,10 ),2]='memory'
celltype[celltype$ClusterID %in% c( 7,11,13,15,16,18 ),2]='plasma'
celltype[celltype$ClusterID %in% c( 9,12,14,17 ),2]='GC'
head(celltype)
celltype
table(celltype$celltype)
sce.Bcells.all@meta.data$celltype = "NA"
for(i in 1:nrow(celltype)){
sce.Bcells.all@meta.data[which(sce.Bcells.all@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.Bcells.all)=sce.Bcells.all$celltype
sel.clust = "celltype"
sce.Bcells.all <- SetIdent(sce.Bcells.all, value = sel.clust)
table(sce.Bcells.all@active.ident)
gplots::balloonplot(table(sce.Bcells.all$RNA_snn_res.0.8,sce.Bcells.all$celltype))
DimPlot(sce.Bcells.all, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()
sce.Bcells.all$celltype %>% head()
sce.Bcells$celltype %>% head()
修改names
ls <- sce.Bcells.all$celltype %>% names() %>% str_split_fixed("_",3)
tmp <- sce.Bcells.all$celltype
names(tmp) <- paste(ls[,2],ls[,3],sep = "_")
tmp %>% head()
sce.Bcells$celltype %>% head()
keep <- intersect(names(tmp), names(sce.Bcells$celltype))
gplots::balloonplot(table(tmp[keep], sce.Bcells$celltype[keep]))
可以发现我这里plasma和memory在拆分前后存在非常大变化,基本上就是exchange了。。。其它的还行
这跟我选择的marker和自定义分组也有关,这两个在亚型定义的时候就不是很好区分(我的技术也不好,肉眼看这个我目前感觉有点反人类,后面我了解到一些辅助确定亚群名称的方法,如AUcell、MACA、scGate【flag】)
原推文小韩师姐的结果就没这么明显的exchange:
因此,来回答开头提出的问题,从该组数据对比来看,多分组单细胞测序数据第一层次未整合和整合分析对B细胞细分的分群基本无影响。
这里因为原推文相当于就拿两种方法处理好的结果可视化看看列联表,具体参数和使用marker未知。