评估细胞因子活性、免疫细胞极化和细胞间通讯的利器：IREA 分析（二）

生信菜鸟团

发布于 2024-07-10 16:45:30

49000

代码可运行

文章被收录于专栏：生信菜鸟团生信菜鸟团

运行总次数：0

代码可运行

之前简略介绍了一下IREA 分析评估细胞因子活性、免疫细胞极化和细胞间通讯的利器：IREA 分析，作者将IREA做成了可视化的网页，但是这个网页又不是那么丝滑，所以我在想，能不能根据作者提供的方法，通过R来实现更快捷的分析呢——

我们主要关注原文方法学的部分：

IREA for cytokine response analysis

User data are then compared with the transcriptional cytokine responses of the same cell type from the Immune Dictionary using the following methods:

For the gene expression matrix input, the expression matrices are first normalized such that the total expression per cell sums to 10,000 units.
the expression is then log-transformed.
Genes giving significant contribution to the enrichment score, with the default being those having an average of more than 0.25 expression values, were included.
Next, the projection score is calculated by finding the cosine similarity score between user input and cytokine-treated or PBS-treated cells.
Statistical significance is assessed using a two-sided Wilcoxon rank-sum test between projection scores on cytokine-treated cells and projection scores on PBS-treated cells, and an FDR correction is applied to all cytokine calculations.
The effect size is the mean difference between projection scores on cytokine-treated cells and on PBS-treated cells.
The effect size and FDR-adjusted P value for each of the 86 cytokines can then be visualized using a compass plot.

IREA polarization analysis implements the same statistical test as the IREA cytokine response analysis.

也就是说，从1-5是二者通用的~

In IREA polarization, user data are compared with the polarization state gene expression profiles.
A polarized radar plot is shown if at least one cellular polarization state is significantly enriched (FDR-adjusted P < 0.05).
If no state is significantly enriched, the radar plot shows an enrichment score of 0 for every state, which signifies that the input cells are unpolarized.
The enrichment score is normalized to be between 0 and 1 on the radar plot.

借助于ChatGPT，我来尝试画一下极化的雷达图看看，用的数据来自于➡慢性病毒性肝炎（二）中性粒细胞亚群细分策略

导入我的数据

现在我有一个seurat对象

load("./check-by-celltype/after_harmony_neu.Rdata")

# 假设数据已加载到Seurat对象中
seurat_obj <- seuratObj # 替换为实际的Seurat对象

# 选择细胞类型和条件
Neu_idx <- WhichCells(seurat_obj, expression = celltype == "Neutrophil" & treatment %in% c("Pre", "Post"))
tmp <- subset(seurat_obj, cells = Neu_idx)

# 确保 `tmp` 对象中有 `treatment` 列并将其设置为身份列
Idents(tmp) <- tmp@meta.data$treatment

# 假设已经有一个Seurat对象 seurat_obj
# 进行标准化，每个细胞的总表达标准化为10,000单位
seurat_obj <- NormalizeData(tmp, normalization.method = "LogNormalize", scale.factor = 10000)

这一步是1和2⬆

# 计算每个基因在所有细胞中的平均表达量
gene_means <- rowMeans(seurat_obj[["RNA"]]$data)

# 选择平均表达量大于0.25的基因
selected_genes <- names(gene_means[gene_means > 0.25])

# 创建一个包含这些基因的子集
seurat_obj_subset <- subset(seurat_obj, features = selected_genes)
expsubdat <- as.matrix(seurat_obj_subset[["RNA"]]$data)
range(expsubdat)

3⬆

polarization state gene expression profiles

接下来，我们需要这个数据——polarization state gene expression profiles

library(readxl)
library(purrr)
library(dplyr)
library(tibble)

polarization_profiles <- read_xlsx("./S7.xlsx",sheet = "Neutrophil")
##见原文补充材料7

# 确保所有基因名称都是大写
rownames(expsubdat) <- toupper(rownames(expsubdat))
polarization_profiles <- polarization_profiles %>%
  mutate(Gene = toupper(Gene))

# 确保 polarization_profiles 中有所有的极化状态
table(polarization_profiles$Polarization)
polarization_profiles <- polarization_profiles[,c("Polarization","Avg_log2FC","Gene")]


# 构建 polarization_profiles_list，确保包含所有极化状态的基因列表
polarization_profiles_list <- polarization_profiles %>%
  as.data.frame() %>% split(.,.$Polarization)

# 确认 polarization_profiles_list 的结构
str(polarization_profiles_list)

计算极化分数

接下来是score的计算，我们首先需要一个余弦函数

calculate_cosine_similarity <- function(matrix1, matrix2) {
  # 获取matrix2中极化状态的基因和表达量
  genes <- matrix2$Gene
  expression_values <- matrix2$Avg_log2FC
  
  # 找到公共基因
  common_genes <- intersect(rownames(matrix1), genes)
  if (length(common_genes) == 0) {
    return(NA)
  }
  
  # 从matrix1和matrix2中提取公共基因的表达值
  matrix1_values <- matrix1[common_genes, , drop = FALSE]
  matrix2_values <- expression_values[match(common_genes, genes)]
  
  # 计算余弦相似度
  cosine_similarity <- apply(matrix1_values, 2, function(x) {
    sum(x * matrix2_values) / (sqrt(sum(x^2)) * sqrt(sum(matrix2_values^2)))
  })
  
  return(cosine_similarity)
}

然后——

calculate_polarization_scores <- function(expression_matrix, polarization_profiles_list) {
  scores <- matrix(NA, nrow = ncol(expression_matrix), ncol = length(polarization_profiles_list))
  colnames(scores) <- names(polarization_profiles_list)
  
  for (i in 1:length(polarization_profiles_list)) {
    polarization_state <- names(polarization_profiles_list)[i]
    polarization_profile <- polarization_profiles_list[[i]]
    
    scores[, i] <- calculate_cosine_similarity(expression_matrix, polarization_profile)
    rownames(scores) <- colnames(expression_matrix)
  }
  
  return(scores)
}

算一下看看呢~

# 假设 expsubdat 已经存在，并且 polarization_profiles_list 也已经正确创建
# 计算极化得分
scores <- calculate_polarization_scores(expsubdat, polarization_profiles_list)

# 检查得分
print(scores[1:10, ])

                     Neu-a     Neu-c     Neu-d Neu-e
15-095_pre1_A5   0.4307252 0.6190559 0.4512313     1
15-095_post1_A7  0.4350306 0.5422810 0.5065828     1
15-095_post1_A10 0.4713442 0.5769063 0.5037743   NaN
15-095_pre1_A11  0.3120063 0.4965895 0.3102601     1
15-095_pre1_B1   0.3180166 0.4782357 0.3095804   NaN
15-095_pre1_B11  0.4033197 0.5944403 0.4553527   NaN
15-095_pre1_C4   0.3627308 0.4395395 0.3468910     1
15-095_pre1_C6   0.3777392 0.5549111 0.4941141     1
15-095_pre1_C7   0.4707846 0.5457673 0.5085159     1
15-095_pre1_C11  0.4553427 0.5526657 0.4951689     1

two-sided Wilcoxon rank-sum test

# 进行统计检验和FDR校正
perform_statistical_tests <- function(scores, group_labels) {
  p_values <- sapply(1:ncol(scores), function(i) {
    post_treated <- scores[, i][group_labels == "Post"]
    pre_treated <- scores[, i][group_labels == "Pre"]
    
    # 检查非缺失值的数量是否足够
    if (sum(!is.na(post_treated)) < 2 || sum(!is.na(pre_treated)) < 2) {
      return(NA)
    } else {
      return(wilcox.test(post_treated, pre_treated)$p.value)
    }
  })
  
  # 处理 NA 值以避免 p.adjust 出错
  p_values[is.na(p_values)] <- 1
  fdr_p_values <- p.adjust(p_values, method = "fdr")
  
  return(fdr_p_values)
}

进行统计检验和FDR校正

# 进行统计检验和FDR校正
sample_info <- seurat_obj_subset@meta.data
phe <- sample_info[,"treatment",drop=FALSE]
# 确保 sample_info 的顺序与 scores 的样本顺序一致
group_labels <- phe$treatment[match(rownames(scores), rownames(phe))]

fdr_p_values <- perform_statistical_tests(scores, group_labels)
print(fdr_p_values)

# 6.443092e-16 1.781591e-01 3.368075e-03 1.000000e+00

雷达图

generate_radar_plot <- function(scores, fdr_p_values) {
  # 加载所需的包
  library(dplyr)
  library(tidyr)
  library(fmsb)
  
  # 创建雷达图数据框
  polarization_df <- data.frame(
    State = colnames(scores),
    Score = colMeans(scores, na.rm = TRUE),
    FDR = fdr_p_values
  )
  
  # 标准化得分并处理 NA
  polarization_df <- polarization_df %>%
    mutate(Score = ifelse(FDR < 0.05, Score, 0)) %>%
    select(State, Score) %>%
    pivot_wider(names_from = State, values_from = Score) %>%
    as.data.frame()
  
  # 添加标准化行
  radar_data <- rbind(rep(1, ncol(polarization_df)), rep(0, ncol(polarization_df)), polarization_df)
  
  # 自定义轴标签
  custom_labels <- c("0.0", "0.5", "1.0")
  
  # 绘制雷达图，设置轴标签间距为0、0.5、1.0
  radarchart(radar_data, axistype=1,
             seg=2, # 细分段数，设为2，即将区间分为3段
             pcol=rgb(0.2,0.5,0.5,0.9), # 线的颜色
             pfcol=rgb(0.2,0.5,0.5,0.5), # 填充颜色
             plwd=2, # 线宽
             cglcol="grey", # 网格线颜色
             cglty=1, # 网格线类型
             axislabcol="grey", # 轴标签颜色
             caxislabels=custom_labels, # 轴标签
             cglwd=0.8, # 网格线宽
             vlcex=0.8) # 变量标签大小
}

# 示例用法
generate_radar_plot(scores, fdr_p_values)

要说明的是，这只是小编的一个探索性尝试，不保证是正确的，因为对作者描述的方法还没有完全理解（因此也恳请大家指正）：

存疑的地方——

作者在这里提到enrichment但没有讲用何种方法来达到富集的目的，因此我只能从IREA for cytokine response analysis这里借鉴方法【IREA 极化分析执行与 IREA 细胞因子反应分析相同的统计测试】

在IREA for cytokine response analysis的末尾：这种方法【the cosine similarity score？？】考虑到了每个基因的表达方向和幅度。也就是说，在用户数据集和免疫字典参考数据集中都强烈上调的基因，会被赋予较高的权重，从而增加富集的总体可能性；在一个数据集中强烈上调而在另一个数据集中没有强烈上调的基因，会被赋予较低的权重；在一个数据集中上调而在另一个数据集中下调的基因，会被赋予负权重，从而降低富集的总体可能性。

基于以上，我是根据余弦相似分数来得到极化分数的，【enrichment 这一步被我吃掉了。。。】非常恳切地欢迎大家留言给我，指出问题，一起进步~

真的觉得IREA这个东西对于研究炎症或者发育分化还是很有帮助的，因为免疫细胞在发育、分化和成熟的过程中，与细胞因子的调控紧密相关。祈祷IREA作者放一个更友好的使用渠道，R包之类的，让普罗大众更好地利用这个工具~(●ˇ∀ˇ●)

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-07-02，如有侵权请联系 cloudcommunity@tencent.com 删除

expression