既往推文已经介绍过了logistic,cox,lasso回归(https://mp.weixin.qq.com/s/pXRZ1rYUr3lwH5OlDeB0_Q),接下来将重点进行代码的实操。
首先进行logistic模型的实际操练,简单回顾一下二项logistic回归(因为还有多项的hhh),其是指研究二分类结果与一些影响因素之间关系的分析方法。在各种临床/基础数据分析中,经常需要分析疾病/状态与各种影响/危险因素之间的定量关系,如鼻咽癌的发生于EB病毒定量、年龄、不同饮食习惯等因素之间的关系,而结局变量通常是二分类的,因此这种方法是研究者必须学会的方法之一。
rm(list = ls())
load("TCGA-LIHC_sur_model.Rdata")
# check 数据
head(exprSet)[1:4,1:4]
# TCGA-FV-A495-01A TCGA-ED-A7PZ-01A TCGA-ED-A97K-01A TCGA-ED-A7PX-01A
# WASH7P 1.913776 1.2986076 1.967382 1.586170
# AL627309.6 3.129116 0.5606928 3.831265 1.363539
# WASH9P 2.490476 2.8140204 2.960338 2.106464
# MTND1P23 2.773335 3.4114257 2.591028 3.353850
head(meta)
# ID OS OS.time race age gender stage T N M
# TCGA-FV-A495-01A TCGA-FV-A495-01A 0 0.03333333 WHITE 51 FEMALE II 2 NA 0
# TCGA-ED-A7PZ-01A TCGA-ED-A7PZ-01A 0 0.20000000 ASIAN 61 MALE II 2 NA 0
# TCGA-ED-A97K-01A TCGA-ED-A97K-01A 0 0.20000000 ASIAN 54 MALE III 3 0 0
# TCGA-ED-A7PX-01A TCGA-ED-A7PX-01A 0 0.20000000 ASIAN 48 FEMALE II 2 NA 0
# TCGA-BC-A3KF-01A TCGA-BC-A3KF-01A 0 0.26666667 WHITE 66 FEMALE I 1 NA 0
# TCGA-DD-A4NR-01A TCGA-DD-A4NR-01A 1 0.30000000 WHITE 85 FEMALE I 1 0 0
identical(rownames(meta),colnames(exprSet))
# [1] TRUE
str(meta)
# 'data.frame': 365 obs. of 10 variables:
# $ ID : chr "TCGA-FV-A495-01A" "TCGA-ED-A7PZ-01A" "TCGA-ED-A97K-01A" "TCGA-ED-A7PX-01A" ...
# $ OS : int 0 0 0 0 0 1 0 0 0 1 ...
# $ OS.time: num 0.0333 0.2 0.2 0.2 0.2667 ...
# $ race : chr "WHITE" "ASIAN" "ASIAN" "ASIAN" ...
# $ age : int 51 61 54 48 66 85 70 75 84 75 ...
# $ gender : chr "FEMALE" "MALE" "MALE" "FEMALE" ...
# $ stage : chr "II" "II" "III" "II" ...
# $ T : num 2 2 3 2 1 1 1 2 1 2 ...
# $ N : num NA NA 0 NA NA 0 0 NA NA 0 ...
# $ M : num 0 0 0 0 0 0 0 0 0 0 ...
table(meta$race)
# AMERICAN INDIAN OR ALASKA NATIVE ASIAN BLACK OR AFRICAN AMERICAN WHITE
# 1 155 17 182
table(meta$age)
# 16 17 18 20 23 24 25 29 31 32 33 34 35 36 37 38 39 40 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
# 1 1 1 3 2 2 1 2 1 2 1 1 3 1 1 5 2 2 2 5 3 7 4 3 9 3 8 15 7 7 9 8 8 5 16 14 8 15 8 5 15 11 16 9
# 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 90
# 17 13 10 6 9 10 7 9 7 5 3 2 5 4 1 1 1 2 1
table(meta$gender)
# FEMALE MALE
# 119 246
table(meta$stage)
# I II III IV
# 170 84 83 4
table(meta$T)
# 1 2 3 4
# 180 91 78 13
table(meta$N)
# 0 1
# 248 4
table(meta$M)
# 0 1
# 263 3
采用的是TCGA-LIHC的数据,其中自变量分别为race,age,gender,stage,T,N,M。本次实操暂时只用到meta中的数据。
其中ID,race,gender,stage是字符型数据,OS,age是整数型数据,OS.time,T,N,M是数值型数据。整数型数据和数值型数据的差别就是一种是整数,另一种是可以有小数。
在这个数据中,我们的因变量/结局变量是OS,其中0代表存活,1代表死亡。此外,我们也需要对自变量进行处理。
meta$race <- factor(meta$race,levels = c("ASIAN","AMERICAN INDIAN OR ALASKA NATIVE","BLACK OR AFRICAN AMERICAN","WHITE"))
meta$stage <- factor(meta$stage,levels = c("I","II","III","IV"))
meta$T <- factor(meta$T,levels = c("1","2","3","4"))
meta$age <- ifelse(meta$age > 60,">60","<=60")
meta$age <- factor(meta$age,levels = c("<=60",">60"))
str(meta)
# 'data.frame': 365 obs. of 10 variables:
# $ ID : chr "TCGA-FV-A495-01A" "TCGA-ED-A7PZ-01A" "TCGA-ED-A97K-01A" "TCGA-ED-A7PX-01A" ...
# $ OS : int 0 0 0 0 0 1 0 0 0 1 ...
# $ OS.time: num 0.0333 0.2 0.2 0.2 0.2667 ...
# $ race : Factor w/ 4 levels "ASIAN","AMERICAN INDIAN OR ALASKA NATIVE",..: 4 1 1 1 4 4 1 4 4 4 ...
# $ age : chr "<=60" ">60" "<=60" "<=60" ...
# $ gender : chr "FEMALE" "MALE" "MALE" "FEMALE" ...
# $ stage : Factor w/ 4 levels "I","II","III",..: 2 2 3 2 1 1 1 2 1 2 ...
# $ T : Factor w/ 4 levels "1","2","3","4": 2 2 3 2 1 1 1 2 1 2 ...
# $ N : num NA NA 0 NA NA 0 0 NA NA 0 ...
# $ M : num 0 0 0 0 0 0 0 0 0 0 ...
在R语言中,因子化的影响因素默认是以levels排序第一的值作为参考/哑变量。请注意这里对于有一些参数并没有进行因子化,以及存在NA值,这些情况会在下边的探索中解释。
fit <- glm(OS~race+age+gender+stage+T+N+M,
family=binomial(link = "logit"),data = meta) # link默认使用logit
summary(fit)
# Call:
# glm(formula = OS ~ race + age + gender + stage + T + N + M, family = binomial(link = "logit"),
# data = meta)
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.3213 0.3959 -3.337 0.000846 ***
# raceAMERICAN INDIAN OR ALASKA NATIVE -15.3493 2399.5447 -0.006 0.994896
# raceBLACK OR AFRICAN AMERICAN 1.1023 1.0382 1.062 0.288346
# raceWHITE 0.2644 0.3474 0.761 0.446538
# age>60 0.3457 0.3261 1.060 0.289067
# genderMALE -0.3704 0.3362 -1.102 0.270534
# stageII 33.6071 3393.4687 0.010 0.992098
# stageIII 0.3615 3393.4687 0.000 0.999915
# stageIV -32.4250 3393.4687 -0.010 0.992376
# T2 -33.1321 3393.4687 -0.010 0.992210
# T3 0.9805 3393.4687 0.000 0.999769
# T4 1.3966 3393.4688 0.000 0.999672
# N 16.9158 2399.5447 0.007 0.994375
# M 48.8854 3635.5707 0.013 0.989272
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 288.04 on 226 degrees of freedom
# Residual deviance: 252.88 on 213 degrees of freedom
# (138 observations deleted due to missingness)
# AIC: 280.88
#
# Number of Fisher Scoring iterations: 15
● Estimate(估计值):每个自变量的回归系数。它表示自变量每单位变化对因变量的对数几率(log-odds)造成的变化。正值表示增加事件发生的概率,负值表示减少事件发生的概率。
● Std. Error(标准误差):回归系数的标准误,估计值的不确定性度量。标准误差越大,估计值的不确定性越大。
● z value(z值):估计值除以标准误差,用于计算p值。z值越大(绝对值),表明自变量对因变量的影响越显著,Z值的平方就是wald值。
● Pr(>|z|)(p值):z值对应的p值,用于检验系数是否显著。如果p值小于显著性水平(通常为0.05或0.01),则认为该系数显著。
● β值:这里就是Estimate(估计值)。
● **OR(比值比,odds ratio)**:这里没有显示,计算方式是OR = exp(β/Estimate),其取值范围是0至正无穷。同时在分析的时候我们经常会提到HR(风险比,hazard ratio),这个值通常应用于生存分析模型中。
● Null deviance和Residual devianve: 是指无效偏差(零偏差)和残差偏差,前者是指只有截距项(没有任何自变量)时模型的偏差,这个模型假设所有的观测值都预测为因变量的平均值(对于分类问题来说,就是预测为最常见的类别),后者是指包括自变量在内的模型的偏差。它衡量的是该模型相对于最优模型的拟合程度。通过比较 Null deviance 和 Residual deviance,可以评估引入自变量后模型的改进情况。如果 Residual deviance 明显小于 Null deviance,说明自变量在解释因变量方面起到了重要作用,所以这两个值的差异越大越好。
● AIC(Akaike Information Criterion,赤池信息准则) 是用于模型选择的一个统计量。它提供了一种在模型复杂度和拟合优度之间进行权衡的方法。这个值需要在不同模型情况下进行比较,AIC值越低则表示模型拟合越好。
同时也应观察一下自变量的情况,可以看到之前因子化后的自变量会按照levels的顺序进行哑变量设置,比如age中的<=60的信息就没有出现了,stage中的I期的信息也没有出现了。如果没有进行因子化的自变量则会被自动设置成连续变量,比如这里的N和M自变量,因此我们在分析之前必须对自变量进行因子化处理。
此外,这里还有一句提示“138 observations deleted due to missingness”,这是告诉我们有138个NA数据,在运行代码的时候自动去除了。由此可知,二项logistic回归整体分析的时候是可以不处理NA的,当然如果从数据分析的角度来说,可能最好还是需要选择删除或者插补数据之后再进行分析,后面进行多因素logstic分析时则不能存在NA值。
# β/Estimate值
coef(fit) #或者 coefficients(fit)
# (Intercept) raceAMERICAN INDIAN OR ALASKA NATIVE raceBLACK OR AFRICAN AMERICAN
# -1.3212525 -15.3493043 1.1023180
# raceWHITE age>60 genderMALE
# 0.2644191 0.3456869 -0.3704287
# stageII stageIII stageIV
# 33.6070538 0.3614614 -32.4249885
# T2 T3 T4
# -33.1321368 0.9805348 1.3965597
# N M
# 16.9157534 48.8853540
# β值的95%可信区间
# confint(fit)
# 2.5 % 97.5 %
# (Intercept) -2.1263379 -0.5673383
# raceAMERICAN INDIAN OR ALASKA NATIVE NA 473.4443794
# raceBLACK OR AFRICAN AMERICAN -1.0795219 3.2830430
# raceWHITE -0.4223957 0.9442724
# age>60 -0.2935513 0.9890797
# genderMALE -1.0272369 0.1691020
# stageII -498.9459572 NA
# stageIII -84.3987156 77.0091134
# stageIV -1243.3914736 95.6290959
# T2 NA 499.6907267
# T3 -83.7797130 77.6282226
# T4 -75.2487745 86.1520115
# N -471.9392411 NA
# M -83.1626544 1374.4319782
# OR值
exp(coef(fit))
# (Intercept) raceAMERICAN INDIAN OR ALASKA NATIVE raceBLACK OR AFRICAN AMERICAN
# 2.668009e-01 2.157157e-07 3.011138e+00
# raceWHITE age>60 genderMALE
# 1.302674e+00 1.412960e+00 6.904383e-01
# stageII stageIII stageIV
# 3.938747e+14 1.435426e+00 8.279544e-15
# T2 T3 T4
# 4.082214e-15 2.665881e+00 4.041273e+00
# N M
# 2.220334e+07 1.700746e+21
# OR值的95%可信区间
exp(confint(fit))[1:4,1:2]
# 2.5 % 97.5 %
# (Intercept) 0.1192733 5.670327e-01
# raceAMERICAN INDIAN OR ALASKA NATIVE NA 4.114163e+205
# raceBLACK OR AFRICAN AMERICAN 0.3397579 2.665677e+01
# raceWHITE 0.6554746 2.570942e+00
# 其他数据的提取
# wald值
summary(fit)$coefficients[,3]^2
# P值
summary(fit)$coefficients[,4]
# 预测值
fitted(fit)
predict(fit,type = "response")
predict(fit) # 如果不适用type则默认返回logit(P)
# 偏差
deviance(fit)
# 残差自由度
df.residual(fit)
# 所有的数据提取其实是基于summary(fit)中的coefficients表格
OR值的解读,比如关于T2这个数据, 相比于T1,T2的患者出现死亡的风险是4.082214e-15,P值没有统计学意义。
# 单因素logstic数据手动提取
# 构建自定义函数
uni_model<-function(x){
fml<-as.formula(paste0("OS~",x))
fit1<-glm(fml,data = meta,family = binomial)
fit2<-summary(fit1)
#计算所要的指标
OR<-round(exp(coef(fit1)),2)
SE<-round(fit2$coefficients[,2],3)
CI2.5<-round(exp(coef(fit1)-1.96*SE),2)
CI97.5<-round(exp(coef(fit1)+1.96*SE),2)
CI<-paste0(CI2.5,'-',CI97.5)
B<-round(fit2$coefficients[,1],3)
Z<-round(fit2$coefficients[,3],3)
P<-round(fit2$coefficients[,4],3)
#将计算出来的指标制作为数据框
uni_model<-data.frame('characteristics'=x,'B'=B,
'SE'=SE,'OR'=OR,'CI'=CI,
'Z' =Z,'P'=P)[-1,]
return(uni_model)
}
#指定参与分析的自变量
variable<-colnames(meta)[c(4:10)]
variable
#运行上面自定义批量执行函数
uni_res<-lapply(variable,uni_model)
uni_res
# 使用plyr包
# plyr 是一个用于数据操作和转换的 R 包。它提供了一系列函数,用于对数据框、列表和其他数据结构进行分组操作和汇总计算。
library(plyr)
#生成单变量分析的结果
uni_res<-ldply(uni_res,data.frame)
uni_res
# characteristics B SE OR CI Z P
# 1 race -12.641 535.411 0.00 0-Inf -0.024 0.981
# 2 race 0.569 0.524 1.77 0.63-4.93 1.085 0.278
# 3 race 0.570 0.233 1.77 1.12-2.79 2.444 0.015
# 4 age 0.340 0.220 1.41 0.91-2.16 1.547 0.122
# 5 gender -0.442 0.230 0.64 0.41-1.01 -1.924 0.054
# 6 stage 0.281 0.295 1.32 0.74-2.36 0.952 0.341
# 7 stage 1.252 0.282 3.50 2.01-6.08 4.436 0.000
# 8 stage 2.182 1.168 8.86 0.9-87.43 1.868 0.062
# 9 T 0.331 0.280 1.39 0.8-2.41 1.180 0.238
# 10 T 1.246 0.284 3.48 1.99-6.07 4.388 0.000
# 11 T 2.244 0.680 9.43 2.49-35.77 3.301 0.001
# 12 N 0.669 1.009 1.95 0.27-14.11 0.663 0.507
# 13 M 16.305 840.274 12057537.52 0-Inf 0.019 0.985
# autoReg方式
library(autoReg)
autoReg::gaze(OS~race + age + gender + stage + T + N + M,data=meta)
# ————————————————————————————————————————————————————————————————————————————————
# Dependent:OS levels 0 1 p
# (N) (N=234) (N=131)
# ————————————————————————————————————————————————————————————————————————————————
# race ASIAN 111 (48.5%) 44 (34.9%) .078
# AMERICAN INDIAN OR ALASKA NATIVE 1 (0.4%) 0 (0%)
# BLACK OR AFRICAN AMERICAN 10 (4.4%) 7 (5.6%)
# WHITE 107 (46.7%) 75 (59.5%)
# age <=60 118 (50.4%) 55 (42%) .150
# >60 116 (49.6%) 76 (58%)
# gender FEMALE 68 (29.1%) 51 (38.9%) .070
# MALE 166 (70.9%) 80 (61.1%)
# stage I 127 (56.7%) 43 (36.8%) <.001
# II 58 (25.9%) 26 (22.2%)
# III 38 (17%) 45 (38.5%)
# IV 1 (0.4%) 3 (2.6%)
# T 1 133 (57.3%) 47 (36.2%) <.001
# 2 61 (26.3%) 30 (23.1%)
# 3 35 (15.1%) 43 (33.1%)
# 4 3 (1.3%) 10 (7.7%)
# N Mean ± SD 0.0 ± 0.1 0.0 ± 0.2 .544
# M Mean ± SD 0.0 ± 0.0 0.0 ± 0.2 .083
# ————————————————————————————————————————————————————————————————————————————————
autoReg(fit, uni=TRUE, multi = TRUE, final=TRUE)
# ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
# Dependent: OS 0 (N=152) 1 (N=75) OR (univariable) OR (multivariable) OR (final)
# ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
# race ASIAN 103 (67.8%) 42 (56%)
# AMERICAN INDIAN OR ALASKA NATIVE 1 (0.7%) 0 (0%) 0.00 (0.00-Inf, p=.988) 0.00 (0.00-Inf, p=.992)
# BLACK OR AFRICAN AMERICAN 2 (1.3%) 2 (2.7%) 2.45 (0.33-17.99, p=.378) 3.42 (0.45-25.80, p=.232)
# WHITE 46 (30.3%) 31 (41.3%) 1.65 (0.93-2.95, p=.089) 1.43 (0.75-2.75, p=.279)
# age <=60 88 (57.9%) 37 (49.3%)
# >60 64 (42.1%) 38 (50.7%) 1.41 (0.81-2.46, p=.223)
# gender FEMALE 42 (27.6%) 30 (40%)
# MALE 110 (72.4%) 45 (60%) 0.57 (0.32-1.03, p=.061) 0.70 (0.36-1.36, p=.295) 0.62 (0.34-1.16, p=.134)
# stage I 86 (56.6%) 24 (32%)
# II 35 (23%) 15 (20%) 1.54 (0.72-3.27, p=.266) 5799088.68 (0.00-Inf, p=.992)
# III 30 (19.7%) 33 (44%) 3.94 (2.02-7.70, p<.001) 1.50 (0.06-36.67, p=.805)
# IV 1 (0.7%) 3 (4%) 10.75 (1.07-108.08, p=.044) 3.45 (0.14-85.32, p=.450)
# T 1 87 (57.2%) 25 (33.3%)
# 2 36 (23.7%) 15 (20%) 1.45 (0.69-3.07, p=.331) 0.00 (0.00-Inf, p=.992) 1.46 (0.69-3.09, p=.328)
# 3 26 (17.1%) 28 (37.3%) 3.75 (1.87-7.51, p<.001) 2.58 (0.11-62.85, p=.561) 3.73 (1.86-7.51, p<.001)
# 4 3 (2%) 7 (9.3%) 8.12 (1.96-33.72, p=.004) 4.40 (0.16-124.34, p=.384) 7.20 (1.71-30.30, p=.007)
# N Mean ± SD 0.0 ± 0.1 0.0 ± 0.2 2.05 (0.28-14.88, p=.476)
# M Mean ± SD 0.0 ± 0.0 0.0 ± 0.2 12155382.71 (0.00-Inf, p=.985)
# ——————————————————————————————————————————————————————————————————————————————————————————————————
# boom方式
library(broom)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
tidy(fit) # 将模型结果转化为整洁的数据框
# # A tibble: 14 × 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -1.32 0.396 -3.34 0.000846
# 2 raceAMERICAN INDIAN OR ALASKA NATIVE -15.3 2400. -0.00640 0.995
# 3 raceBLACK OR AFRICAN AMERICAN 1.10 1.04 1.06 0.288
# 4 raceWHITE 0.264 0.347 0.761 0.447
# 5 age>60 0.346 0.326 1.06 0.289
# 6 genderMALE -0.370 0.336 -1.10 0.271
# 7 stageII 33.6 3393. 0.00990 0.992
# 8 stageIII 0.361 3393. 0.000107 1.00
# 9 stageIV -32.4 3393. -0.00956 0.992
# 10 T2 -33.1 3393. -0.00976 0.992
# 11 T3 0.981 3393. 0.000289 1.00
# 12 T4 1.40 3393. 0.000412 1.00
# 13 N 16.9 2400. 0.00705 0.994
# 14 M 48.9 3636. 0.0134 0.989
# sjPlot方式
library(sjPlot)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
tab_model(fit) # 输出逻辑回归结果表格
# stargazer方式
library(stargazer)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
stargazer(fit, type = "text") # 输出格式化的结果
# ================================================================
# Dependent variable:
# ---------------------------
# OS
# ----------------------------------------------------------------
# raceAMERICAN INDIAN OR ALASKA NATIVE -15.349
# (2,399.545)
#
# raceBLACK OR AFRICAN AMERICAN 1.102
# (1.038)
#
# raceWHITE 0.264
# (0.347)
#
# age> 60 0.346
# (0.326)
#
# genderMALE -0.370
# (0.336)
#
# stageII 33.607
# (3,393.469)
#
# stageIII 0.361
# (3,393.469)
#
# stageIV -32.425
# (3,393.469)
#
# T2 -33.132
# (3,393.469)
#
# T3 0.981
# (3,393.469)
#
# T4 1.397
# (3,393.469)
#
# N 16.916
# (2,399.545)
#
# M 48.885
# (3,635.571)
#
# Constant -1.321***
# (0.396)
#
# ----------------------------------------------------------------
# Observations 227
# Log Likelihood -126.439
# Akaike Inf. Crit. 280.879
# ================================================================
# Note: *p<0.1; **p<0.05; ***p<0.01
# report方式
library(report)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
report(fit) # 生成逻辑回归的报告
# We fitted a logistic model (estimated using ML) to predict OS with race, age, gender, stage, T, N and M (formula: OS ~ race +
# age + gender + stage + T + N + M). The model's explanatory power is moderate (Tjur's R2 = 0.14). The model's intercept,
# corresponding to race = ASIAN, age = <=60, gender = FEMALE, stage = I, T = 1, N = 0 and M = 0, is at -1.32 (95% CI [-2.13,
# -0.57], p < .001). Within this model:
#
# - The effect of race [AMERICAN INDIAN OR ALASKA NATIVE] is statistically non-significant and negative (beta = -15.35, 95% CI [,
# 473.44], p = 0.995; Std. beta = -15.35, 95% CI [, 473.44])
# - The effect of race [BLACK OR AFRICAN AMERICAN] is statistically non-significant and positive (beta = 1.10, 95% CI [-1.08,
# 3.28], p = 0.288; Std. beta = 1.10, 95% CI [-1.08, 3.28])
# - The effect of race [WHITE] is statistically non-significant and positive (beta = 0.26, 95% CI [-0.42, 0.94], p = 0.447; Std.
# beta = 0.26, 95% CI [-0.16, 0.94])
# - The effect of age [>60] is statistically non-significant and positive (beta = 0.35, 95% CI [-0.29, 0.99], p = 0.289; Std.
# beta = 0.35, 95% CI [-0.29, 0.99])
# - The effect of gender [MALE] is statistically non-significant and negative (beta = -0.37, 95% CI [-1.03, 0.17], p = 0.271;
# Std. beta = -0.37, 95% CI [-1.03, 0.30])
# - The effect of stage [II] is statistically non-significant and positive (beta = 33.61, 95% CI [-498.95, ], p = 0.992; Std.
# beta = 33.61, 95% CI [-498.95, ])
# - The effect of stage [III] is statistically non-significant and positive (beta = 0.36, 95% CI [-84.40, 77.01], p > .999; Std.
# beta = 0.36, 95% CI [-45.09, 44.27])
# - The effect of stage [IV] is statistically non-significant and negative (beta = -32.42, 95% CI [-1243.39, 95.63], p = 0.992;
# Std. beta = -32.42, 95% CI [-1253.80, 93.73])
# - The effect of T [2] is statistically non-significant and negative (beta = -33.13, 95% CI [, 499.69], p = 0.992; Std. beta =
# -33.13, 95% CI [, 499.69])
# - The effect of T [3] is statistically non-significant and positive (beta = 0.98, 95% CI [-83.78, 77.63], p > .999; Std. beta =
# 0.98, 95% CI [-47.06, 48.58])
# - The effect of T [4] is statistically non-significant and positive (beta = 1.40, 95% CI [-75.25, 86.15], p > .999; Std. beta =
# 1.40, 95% CI [-71.27, 61.90])
# - The effect of N is statistically non-significant and positive (beta = 16.92, 95% CI [-471.94, ], p = 0.994; Std. beta = 2.23,
# 95% CI [-62.23, ])
# - The effect of M is statistically non-significant and positive (beta = 48.89, 95% CI [-83.16, 1374.43], p = 0.989; Std. beta =
# 5.59, 95% CI [-13.05, 137.95])
#
# Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals
# (CIs) and p-values were computed using a Wald z-distribution approximation.
各位可以比较一下不同的方式,个人觉得自行提取,autoReg和broom包这三种方式比较好用。
# 在用forward/backward/both方法的时候需要去除NA值!
meta <- na.omit(meta)
dim(meta) #去除了蛮多数据的
# [1] 227 10
#1.多因素enter回归
formula <- as.formula(OS~race+age+gender+stage+T+N+M)
model_enter <- glm(formula,data = meta,family=binomial)
model_enter
summary(model_enter)
#2.多因素—forward
# 初始模型(仅包含截距项)
model_start <- glm(OS ~ 1, data = meta, family = binomial)
# 前向选择
model_forward <- step(model_start,
scope = list(upper = ~ race+age+gender+stage+T+N+M,
lower = ~ 1),
direction = "forward")
# Start: AIC=290.04
# OS ~ 1
#
# Df Deviance AIC
# + T 3 267.73 275.73
# + stage 3 268.19 276.19
# + M 1 281.32 285.32
# + gender 1 284.56 288.56
# <none> 288.04 290.04
# + age 1 286.56 290.56
# + N 1 287.55 291.55
# + race 3 283.88 291.88
#
# Step: AIC=275.73
# OS ~ T
#
# Df Deviance AIC
# + M 1 264.76 274.76
# + age 1 265.12 275.12
# + gender 1 265.51 275.51
# <none> 267.73 275.73
# + N 1 267.08 277.08
# + race 3 263.63 277.63
# + stage 3 265.77 279.77
#
# Step: AIC=274.76
# OS ~ T + M
#
# Df Deviance AIC
# + age 1 262.26 274.26
# + gender 1 262.66 274.66
# <none> 264.76 274.76
# + N 1 264.10 276.10
# + stage 3 260.54 276.54
# + race 3 260.84 276.84
#
# Step: AIC=274.26
# OS ~ T + M + age
#
# Df Deviance AIC
# <none> 262.26 274.26
# + gender 1 260.52 274.52
# + N 1 261.64 275.64
# + stage 3 258.28 276.28
# + race 3 259.68 277.68
summary(model_forward)
#3.多因素-backward
model_backward<-step(model_enter,direction ="backward")
# Start: AIC=280.88
# OS ~ race + age + gender + stage + T + N + M
#
# Df Deviance AIC
# - race 3 254.91 276.91
# - T 3 257.74 279.74
# - age 1 254.00 280.00
# - gender 1 254.08 280.08
# - stage 3 258.21 280.21
# - N 1 254.61 280.61
# <none> 252.88 280.88
# - M 1 257.61 283.61
#
# Step: AIC=276.91
# OS ~ age + gender + stage + T + N + M
#
# Df Deviance AIC
# - T 3 259.67 275.67
# - stage 3 260.12 276.12
# - N 1 256.60 276.60
# - gender 1 256.74 276.74
# <none> 254.91 276.91
# - age 1 257.00 277.00
# - M 1 259.54 279.54
#
# Step: AIC=275.67
# OS ~ age + gender + stage + N + M
#
# Df Deviance AIC
# - N 1 259.77 273.77
# - gender 1 261.25 275.25
# - age 1 261.49 275.49
# <none> 259.67 275.67
# - M 1 263.85 277.85
# - stage 3 277.28 287.28
#
# Step: AIC=273.77
# OS ~ age + gender + stage + M
#
# Df Deviance AIC
# - gender 1 261.41 273.41
# - age 1 261.65 273.65
# <none> 259.77 273.77
# - M 1 264.12 276.12
# - stage 3 277.56 285.56
#
# Step: AIC=273.41
# OS ~ age + stage + M
#
# Df Deviance AIC
# <none> 261.41 273.41
# - age 1 263.69 273.69
# - M 1 265.48 275.48
# - stage 3 280.09 286.09
summary(model_backward)
#4.多因素-both
model_both<-step(model_enter,direction = "both")
# Start: AIC=280.88
# OS ~ race + age + gender + stage + T + N + M
#
# Df Deviance AIC
# - race 3 254.91 276.91
# - T 3 257.74 279.74
# - age 1 254.00 280.00
# - gender 1 254.08 280.08
# - stage 3 258.21 280.21
# - N 1 254.61 280.61
# <none> 252.88 280.88
# - M 1 257.61 283.61
#
# Step: AIC=276.91
# OS ~ age + gender + stage + T + N + M
#
# Df Deviance AIC
# - T 3 259.67 275.67
# - stage 3 260.12 276.12
# - N 1 256.60 276.60
# - gender 1 256.74 276.74
# <none> 254.91 276.91
# - age 1 257.00 277.00
# - M 1 259.54 279.54
# + race 3 252.88 280.88
#
# Step: AIC=275.67
# OS ~ age + gender + stage + N + M
#
# Df Deviance AIC
# - N 1 259.77 273.77
# - gender 1 261.25 275.25
# - age 1 261.49 275.49
# <none> 259.67 275.67
# + T 3 254.91 276.91
# - M 1 263.85 277.85
# + race 3 257.74 279.74
# - stage 3 277.28 287.28
#
# Step: AIC=273.77
# OS ~ age + gender + stage + M
#
# Df Deviance AIC
# - gender 1 261.41 273.41
# - age 1 261.65 273.65
# <none> 259.77 273.77
# + N 1 259.67 275.67
# - M 1 264.12 276.12
# + T 3 256.60 276.60
# + race 3 257.84 277.84
# - stage 3 277.56 285.56
#
# Step: AIC=273.41
# OS ~ age + stage + M
#
# Df Deviance AIC
# <none> 261.41 273.41
# - age 1 263.69 273.69
# + gender 1 259.77 273.77
# + N 1 261.25 275.25
# - M 1 265.48 275.48
# + T 3 258.28 276.28
# + race 3 258.92 276.92
# - stage 3 280.09 286.09
summary(model_both)
# 模型AIC比较
# 选择AIC值最小的模型
anova(model_enter,model_forward,test = "Chisq")
anova(model_enter,model_backward,test = "Chisq")
anova(model_enter,model_both,test = "Chisq")
anova(model_forward,model_backward,test = "Chisq")
anova(model_forward,model_both,test = "Chisq")
anova(model_backward,model_both,test = "Chisq")
# 计算多个模型的 AIC 值
aic_values <- AIC(model_enter, model_forward, model_backward, model_both)
# 打印 AIC 值
print(aic_values)
# 找到 AIC 值最小的模型
best_model <- aic_values[which.min(aic_values$AIC), ]
print(best_model)
# 使用MuMIn包
library(MuMIn)
# 将所有模型放入一个列表
model_list <- list(Enter = model_enter, Forward = model_forward, Backward = model_backward, Both = model_both)
# 使用 MuMIn 包的 model.sel() 函数进行比较
model_comparison <- model.sel(model_list)
# Model selection table
# (Intrc) age gendr M N race stage T df logLik AICc delta weight
# Backward -1.524 + 30.84 + 6 -130.706 273.8 0.00 0.375
# Both -1.524 + 30.84 + 6 -130.706 273.8 0.00 0.375
# Forward -1.508 + 15.15 + 6 -131.131 274.6 0.85 0.245
# Enter -1.321 + + 48.89 16.92 + + + 14 -126.439 282.9 9.07 0.004
# 打印模型比较结果
print(model_comparison)
# 选择 AICc 值最小的模型
best_model <- get.models(model_comparison, 1)[[1]]
summary(best_model)
感觉在汇总结果的时候使用MuMIn包还是蛮方便的。
1、glm:https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm
2、autoReg: https://github.com/cardiomoon/autoReg
3、broom:https://github.com/tidymodels/broom
4、MuMIn:https://mumin-dataset.github.io/
注:若对内容有疑惑或者有发现明确错误的朋友,请联系后台(欢迎交流)。更多内容可关注公众号:生信方舟
- END -
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。