文章/答案/技术大牛

发布

社区首页 >专栏 >文章MSM_metagenomics（八）：机器学习分析

文章MSM_metagenomics（八）：机器学习分析

原创

生信学习者

发布于 2024-06-18 02:16:08

16900

代码可运行

运行总次数：0

代码可运行

欢迎大家关注全网生信学习者系列：

WX公zhong号：生信学习者
Xiao hong书：生信学习者
知hu：生信学习者
CDSN：生信学习者2

介绍

本教程是使用随机森林模型来评估微生物群落分类组成预测能力。

数据

大家通过以下链接下载数据：

百度网盘链接：https://pan.baidu.com/s/1f1SyyvRfpNVO3sLYEblz1A
提取码: 请关注WX公zhong号生信学习者后台发送 复现msm 获取提取码

Evaluate the predictive power of microbiome taxonomic composition

Python package required

Random forest training and results evaluation with ROC-AUC curve

在这里，我们将介绍一个Python脚本evaluation_kfold.py，该脚本实现了random forest model模型，用于评估微生物群落分类组成中编码的信息对不同个体分类的预测能力

代码

/*
* 提示：该行代码过长，系统自动注释不进行高亮。一键复制会移除系统注释 
* #!/usr/bin/env python"""NAME: evaluation_kfold.pyDESCRIPTION: This script is to evaluate a model's prediction power using tenfold method."""from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn import metricsfrom collections import namedtuplefrom sklearn.calibration import CalibratedClassifierCVfrom sklearn.feature_selection import SelectFromModelimport pandas as pdimport itertools import sysimport argparseimport textwrapimport subprocessimport randomimport numpy as npfrom sklearn.model_selection import StratifiedKFoldfrom sklearn.metrics import roc_curvefrom sklearn.metrics import roc_auc_scoreimport matplotlib.pyplot as pltfrom sklearn.metrics import aucfrom sklearn.metrics import RocCurveDisplayimport matplotlibmatplotlib.rcParams['font.family'] = 'sans-serif'matplotlib.rcParams['font.sans-serif'] = 'Arial'def read_args(args):    # This function is to parse arguments    parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,                                    description = textwrap.dedent('''\                                    This script is to estimate ROC AUC based on metaphlan-style table with metadata being inserted.                                    '''),                                    epilog = textwrap.dedent('''\                                    examples: evaluation_kfold.py --mpa_df <mpa_df.tsv> --md_rows 0,1,2,3,4 --target_row 3 --pos_feature <CRC> --neg_feature <Healthy> --fold_number 10 --repeat_time 20 --output ROC_AUC.svg                                      '''))    parser.add_argument('--mpa_df',                        nargs = '?',                        help = 'Input a mpa-style table with metadata being inserted.',                        type = str,                        default = None)    parser.add_argument('--md_rows',                        nargs = '?',                        help = 'Input row numbers for specifying metadata without considering header row, zero-based, comma delimited. for example, 0,1,2,3,4.',                        default = None)    parser.add_argument('--target_row',                        nargs = '?',                        help = 'Specify the row number for indicating target metadata to examine, zero-based without considering header row.',                        type = int,                        default = None)    parser.add_argument('--pos_feature',                        nargs = '?',                        help = 'Specify the feature name to be labeled as posive, e.g. 1.',                        type = str,                        default = None)    parser.add_argument('--neg_feature',                        nargs = '?',                        help = 'Specify the feature name to be labeld as negative, e.g. 0.',                        type = str,                        default = None)    parser.add_argument('--fold_number',                        nargs = '?',                        help = 'Specify the fold number you want split the whole dataset.',                        type = int,                        default = None)    parser.add_argument('--repeat_time',                        nargs = '?',                        help = 'Specify the repeat time you want to split the dataset.',                        type = int,                        default = None)    parser.add_argument('--output',                        nargs = '?',                        help = 'Specify the output figure name.',                        type = str,                        default = None)    parser.add_argument('--output_values',                        nargs = '?',                        help = 'Specify the output file name for storing ROC-AUC values.',                        type = str,                        default = None)    parser.add_argument('--nproc',                        nargs = '?',                        help = 'Specify the number of processors you want to use. 4 by default.',                        type = int,                        default = 4)    parser.add_argument('--transform',                        nargs = '?',                        help = 'Transform values in the matrix, [arcsin_sqrt] or [binary] or [None]. [None] by default',                        type = str,                        default = None)    return vars(parser.parse_args())def get_df_dropping_metadata(mpa4_df, row_number_list):    # row_number_list: a list of row numbers in integer.    # this function is to drop rows containing metadata.    df_ = df_.drop(row_number_list)    return df_def get_target_metadata(mpa4_df, row_number):    # row_number: the integer indicating the row which contains the metadata one wants to examine.    # this function is to get a list of binary metadata for establishing ML model.        features = mpa4_df.iloc[row_number].to_list()    return featuresdef prepare_dataset(mpa4_style_md_df, pos_neg_dict, row_number_list, target_row, transform):    # mpa4_style_md_df: the merged metaphlan4 table with metadata being inseted.    # pos_neg_dict: the dictionary which maps examine value to 1 or 0.    # This function is to prepare dataset for downstream analysis.    df_no_md = mpa4_style_md_df.drop(row_number_list)    sample_names = df_no_md.columns[1:].to_list()    matrix = []    for s in sample_names:        values = [float(i) for i in df_no_md[s].to_list()]        matrix.append(values)    matrix = np.array(matrix)        if transform == 'arcsin_sqrt':        matrix = matrix/100        print(matrix)        matrix = np.arcsin(np.sqrt(matrix))    elif transform == 'binary':        matrix[np.where(matrix > 0)] = 1    else:        matrix = matrix    features = get_target_metadata(mpa4_style_md_df, target_row)[1:]    X = np.asarray(matrix)    y = [pos_neg_dict[i] for i in features]    y = np.asarray(y)    return X, ydef roc_auc_curve(model, X, y, fold, repeat, output_name, output_values):    # model: machine learning model to use.    # X: the value matrix    # y: the list of features, 1 and 0    # fold: the number of fold to split the dataset.    # repeat: the repeat number for splitting the dataset.    # output_name: specify the output figure name.    # outout_values: specify the output file name for storing estimated roc-auc values.    cv = StratifiedKFold(n_splits=fold, shuffle=True)    classifier = model    tprs = []    aucs = []    mean_fpr = np.linspace(0, 1, 100)        rocauc_opt = open(output_values, "w")    rocauc_opt.write("repeat"+ "\t" + "fold" + "\t" + "roc_auc" + "\n")    while repeat > 0:        repeat -= 1        for i, (train, test) in enumerate(cv.split(X, y)):            classifier.fit(X[train], y[train])            viz = RocCurveDisplay.from_estimator(                classifier,                X[test],                y[test],                name="ROC fold {}".format(i),                alpha=0.3,                lw=1,            )            interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)            interp_tpr[0] = 0.0            tprs.append(interp_tpr)            aucs.append(viz.roc_auc)            rocauc_opt.write(str(repeat) + "\t" + str(i) + "\t" + str(viz.roc_auc) + "\n")    fig, ax = plt.subplots()    ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color="r", label="Chance", alpha=0.8)    mean_tpr = np.mean(tprs, axis=0)    mean_tpr[-1] = 1.0    mean_auc = auc(mean_fpr, mean_tpr)    std_auc = np.std(aucs)    ax.plot(        mean_fpr,        mean_tpr,        color="b",        label=r"Mean ROC (AUC = %0.2f $\pm$ %0.2f)" % (mean_auc, std_auc),        lw=2,        alpha=0.8,    )    std_tpr = np.std(tprs, axis=0)    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)    ax.fill_between(    mean_fpr,    tprs_lower,    tprs_upper,    color="grey",    alpha=0.2,    label=r"$\pm$ 1 std. dev.",    )    ax.set(        xlim=[-0.05, 1.05],        ylim=[-0.05, 1.05],    )    ax.set_xlabel('False Positive Rate')    ax.set_ylabel('True Positive Rate')    ax.legend(loc="lower right")    plt.savefig(output_name)    rocauc_opt.close()if __name__ == "__main__":    pars = read_args(sys.argv)    df = pd.read_csv(pars["mpa_df"], sep = "\t", index_col = False)    row_number_list = [int(i) for i in pars["md_rows"].split(",")]    pos_neg_dict = {pars["pos_feature"]:1, pars["neg_feature"]:0}    X, y = prepare_dataset(df, pos_neg_dict, row_number_list, pars["target_row"], pars["transform"])    model = RandomForestClassifier(n_estimators = 1000,                                   criterion = 'entropy',                                   min_samples_leaf = 1,                                   max_features = 'sqrt',                                   n_jobs = 4) # initiating a RF classifier    roc_auc_curve(model, X, y, pars["fold_number"], pars["repeat_time"], pars["output"], pars["output_values"])
*/

用法

evaluation_kfold.py [-h] [--mpa_df [MPA_DF]] [--md_rows [MD_ROWS]] [--target_row [TARGET_ROW]] [--pos_feature [POS_FEATURE]] [--neg_feature [NEG_FEATURE]] [--fold_number [FOLD_NUMBER]] [--repeat_time [REPEAT_TIME]] [--output [OUTPUT]]                           [--output_values [OUTPUT_VALUES]] [--nproc [NPROC]] [--transform [TRANSFORM]]This script is to estimate ROC AUC based on metaphlan-style table with metadata being inserted.optional arguments:  -h, --help            show this help message and exit  --mpa_df [MPA_DF]     Input a mpa-style table with metadata being inserted.  --md_rows [MD_ROWS]   Input row numbers for specifying metadata without considering header row, zero-based, comma delimited. for example, 0,1,2,3,4.  --target_row [TARGET_ROW]                        Specify the row number for indicating target metadata to examine, zero-based without considering header row.  --pos_feature [POS_FEATURE]                        Specify the feature name to be labeled as posive, e.g. 1.  --neg_feature [NEG_FEATURE]                        Specify the feature name to be labeld as negative, e.g. 0.  --fold_number [FOLD_NUMBER]                        Specify the fold number you want split the whole dataset.  --repeat_time [REPEAT_TIME]                        Specify the repeat time you want to split the dataset.  --output [OUTPUT]     Specify the output figure name.  --output_values [OUTPUT_VALUES]                        Specify the output file name for storing ROC-AUC values.  --nproc [NPROC]       Specify the number of processors you want to use. 4 by default.  --transform [TRANSFORM]                        Transform values in the matrix, [arcsin_sqrt] or [binary] or [None]. [None] by defaultexamples: python evaluation_kfold.py --mpa_df <mpa_df.tsv> --md_rows 0,1,2,3,4 --target_row 3 --pos_feature <CRC> --neg_feature <Healthy> --fold_number 10 --repeat_time 20 --output ROC_AUC.svg

为了演示这个教程，我们使用了一个微生物组数据集machine_learning_input.tsv: ./data/machine_learning_input.tsv，其中包含52名受试者（28名受试者有超过3个性伴侣，24名受试者有0-3个性伴侣）以及相应的肠道微生物群落物种的相对丰度。

示例命令:

python evaluation_kfold.py \  --mpa_df machine_learning_input.tsv \  --md_rows 0 \  --target_row 0 \  --pos_feature ">3" \  --neg_feature "0_3" \  --fold_number 3 \  --repeat_time 50 \  --output roc_auc_npartners.png \  --output_values roc_auc_npartners_values.tsv \  --nproc 10

它生成了一个ROC-AUC曲线，以展示随机森林模型拟合我们输入的微生物群落分类数据的整体预测能力。

可选地，它还可以生成用于生成上述图表的原始输出roc_auc_npartners_values.tsv: ./data/roc_auc_npartners_values.tsv。人们可以将其用于其他目的。

Visualize standard deviation of machine learning estimates

R packages required

Plot the distribution of ROC-AUC estimates with standard deviation

我们上面展示了如何使用ROC-AUC曲线通过与随机效应的基准测试来评估微生物群落的预测能力。在这里，我们将介绍在rocauc_stdv_funcs.R中实现的辅助函数data_summary和std_deviation_plot，用于可视化来自多次随机森林分类的结果的ROC-AUC估计的标准偏差。

data_summary is function to summarize raw ROC-AUC estimates in standard deviation with three arguments:

data: Input a dataframe as input.
var_estimates: The column header indicating ROC-AUC estimates for the target variable.
groups: The column header containing the group names.

std_deviation_plot is a function to plot ROC-AUC estimates with error bars based on standard deviation with arguments:

df: Input the dataframe of standard deviations of ROC-AUC estimates.
x_axis: Specify the column header (groups usually) for X axis.
y_axis: Specify the column header (ROC-AUC means usually) for Y axis.
stdv_column: Specify the column header indicating standard deviation.
palette: Specify the palette for colors, default [jco].
y_label: Name the Y label, default [ROC-AUC].
x_label: Name the X laebl, default NULL.
order_x_axis: give a vector to specify the order of columns along X axis.
font_size: specify the font size, default [11].
font_family: specify the font family, default [Arial].

在这里，我们将使用演示数据merged ROC-AUC estimates: ./data/roc_auc_merged.tsv合并的ROC-AUC估计值，这些数据来自于执行随机森林分类以区分五种性行为，包括接受receptive anal intercourse, number of partners, oral sex, sex transmitted infection, 和 condom use。

加载rocauc_stdv_funcs.R（包含data_summary和std_deviation_plot函数）。

data_summary <- function(data,                         var_estimates,                         groups){  require(plyr)  summary_func <- function(x, col){    c(mean = mean(x[[col]], na.rm=TRUE),      sd = sd(x[[col]], na.rm=TRUE))  }  data_sum<-ddply(data, groups, .fun=summary_func,                  var_estimates)  data_sum <- rename(data_sum, c("mean" = var_estimates))  return(data_sum)}std_deviation_plot <- function(df,                               x_axis,                               y_axis,                               stdv_column,                               palette = "jco",                               y_label = "AUC-ROC",                               x_label = "",                               order_x_axis = NULL,                               font_size = 11,                               font_family = "Arial"){    stdv_plot <- ggpubr::ggdotplot(data = df, x = x_axis, y = y_axis, color = x_axis, fill = x_axis,                           palette = palette, ylab = y_label, xlab = x_label,                           order = order_x_axis) +                           ggplot2::geom_hline(yintercept = 0.5, linetype = "dotted", col = 'red') +                           ggplot2::theme(text = ggplot2::element_text(size = font_size, family = font_family)) +                           ggplot2::theme(legend.title = ggplot2::element_blank()) +                           ggplot2::geom_errorbar(ggplot2::aes(ymin = eval(parse(text = y_axis)) - eval(parse(text = stdv_column)),                                                    ymax =  eval(parse(text = y_axis)) + eval(parse(text = stdv_column)),                                                    color = eval(parse(text = x_axis))), width = .2,                                                    position = ggplot2::position_dodge(0.7))    stdv_plot}

其次，将merged ROC-AUC estimates: ./data/roc_auc_merged.tsv加载到R的数据框中。

roc_auc_merged <- data.frame(read.csv("./data/roc_auc_merged.tsv", header = TRUE, sep = "\t"))

第三，制作一个标准偏差的数据摘要。

roc_auc_std <- data_summary(data = roc_auc_merged,                            var_estimates = "roc.auc",                            groups = "sexual.practice")

最后，绘制标准偏差中的ROC-AUC估计值。

rocauc_plot <- std_deviation_plot(df = std,                                  x_axis = "sexual.practice",                                  y_axis = "roc.auc",                                  stdv_column = "sd",                                  order = c("Receptive anal intercourse", "Number of partners",                                            "Oral sex", "Sex transmitted infection", "Condom use"))

可选地，可以使用ggplot2（或ggpubr）函数调整图表，例如重新设置y轴限制和旋转x轴标签。

rocauc_plot + ggplot2::ylim(0, 1) + ggpubr::rotate_x_text(45)

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

机器学习

数据分析

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

机器学习

数据分析

登录后参与评论

暂无评论

编辑精选文章

换一批

万字详解高可用架构设计

5283

Go 开发者必备：Protocol Buffers 入门指南

3116

10分钟带你彻底搞懂分布式链路跟踪

2255

文章MSM_metagenomics（四）：Beta多样性分析

r 语言数据分析

本教程旨在使用基于R的函数以及Python脚本来估计使用MetaPhlAn profile的微生物群落的Beta多样性

生信学习者

2024/06/16

3790

文章MSM_metagenomics（五）：共现分析

r 语言数据分析

本教程是使用一个Python脚本来分析多种微生物（即strains, species, genus等）的共现模式。

生信学习者

2024/06/16

1430

文章MSM_metagenomics（一）：介绍

数据分析

用于复现Huang et al.研究分析的计算工作流程，所有复现数据和代码：生信学习者。

生信学习者

2024/06/15

1490

文章MSM_metagenomics（三）：Alpha多样性分析

r 语言数据分析

本教程使用基于R的函数来估计微生物群落的香农指数和丰富度，使用MetaPhlAn profile数据。估计结果进一步进行了可视化，并与元数据关联，以测试统计显著性。

生信学习者

2024/06/16

2300

文章MSM_metagenomics（七）：分组马赛克图

python 数据分析

使用一个Python脚本mosaic_plot.py，以及一个包含MSM 和 Non-MSM个体相关的物种的表格，这些物种被识别为革兰氏阴性或非革兰氏阴性，在two_variable_mosaic.tsv: ./data/two_variable_mosaic.tsv中。

生信学习者

2024/06/17

1080

机器学习（十二）交叉验证实例

其他

假设有个未知模型具有一个或多个待定的参数，且有一个数据集能够反映该模型的特征属性（训练集）。

致Great

2018/11/07

2.6K0

每个Kaggle冠军的获胜法门：揭秘Python中的模型集成

其他

选自Dataquest 作者：Sebastian Flennerhag 机器之心编译集成方法可将多种机器学习模型的预测结果结合在一起，获得单个模型无法匹敌的精确结果，它已成为几乎所有 Kaggle 竞赛冠军的必选方案。那么，我们该如何使用 Python 集成各类模型呢？本文作者，曼彻斯特大学计算机科学与社会统计学院的在读博士 Sebastian Flennerhag 对此进行了一番简述。在 Python 中高效堆叠模型集成（ensemble）正在迅速成为应用机器学习最热门和流行的方法。目前，几乎每一

机器之心

2018/05/10

3.3K0

【Lesson1】R 机器学习流程及案例实现

编程算法机器学习神经网络深度学习人工智能

一直在学习机器学习的项目；学的断断续续。近期需要完成一些数据建模与分析，将机器学习重新整理了一遍。这篇文章主要是介绍R数据科学中，构建机器学习模型的流程。为了更适合无基础的人快速了解整个流程框架，本文省去机器学习模型的原理及公式部分，如果需要了解，请戳 Here 。

Jamesjin63

2022/11/03

1.1K0

文章MSM_metagenomics（六）：复杂热图绘制

数据分析 r 语言

complexheatmap_plotting_funcs.R来自于R可视化：微生物相对丰度或富集热图可视化参考章节，它提供了充足的说明

生信学习者

2024/06/17

1480

数据挖掘机器学习[六]---项目实战金融风控之贷款违约预测

金融风控机器学习神经网络深度学习

因为文档是去年弄的，很多资料都有点找不到了，我尽可能写的详细。后面以2021年研究生数学建模B题为例【空气质量预报二次建模】再进行一个教学。

汀丶人工智能

2022/12/21

1.8K0

Xlearn ——快速落地FM/FFM机器学习算法

import model 模型数据机器学习算法

Xlearn是你面对结构化数据分类/回归任务时，除了xgboost/lightgbm/catboost之外，又不想搞训练很慢的深度学习模型时，可以尝试考虑的一个能够快速落地的机器学习baseline基准。

lyhue1991

2024/01/04

4980

TCGA官方数据挖掘文章教你机器学习or深度学习

编程算法腾讯云开发者社区 github git 开源

最近我们又组织了：《机器学习加深度学习资料大放送（附上资料群）》交流群，感觉吧，大家松鼠症发作收集整理了大把资料最后却束之高阁，也不是一个事啊。所以就安排学徒系统性讲解一下机器学习的应用。本次教程参考TCGA 官方的一篇文章Machine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome Atlas文章，文章的源码也可以在官方的 github 上搜到：

生信技能树

2020/10/26

1.7K0

一套完整的基于随机森林的机器学习流程（特征选择、交叉验证、模型评估））

编程算法机器学习决策树神经网络深度学习

为了展示随机森林的操作，我们用一套早期的前列腺癌和癌旁基因表达芯片数据集，包含102个样品(50个正常，52个肿瘤)，2个分组和9021个变量 (基因)。（https://file.biolab.si/biolab/supp/bi-cancer/projections/info/prostata.html）

生信宝典

2021/11/23

10.1K0

时间序列分析应用：在COVID-19时期预测苹果股票

python api apple lag stock

*免责声明：本练习未考虑诸如交易和佣金之类的费用。作者对使用本文承担的风险或利益概不负责。

deephub

2020/05/09

7870

客户流失？来看看大厂如何基于spark+机器学习构建千万数据规模上的用户留存模型 ⛵

机器学习 spark

Sparkify 是一个音乐流媒体平台，用户可以获取部分免费音乐资源，也有不少用户开启了会员订阅计划（参考QQ音乐），在Sparkify中享受优质音乐内容。

ShowMeAI

2022/08/09

1.8K0

客户流失？来看看大厂如何基于spark+机器学习构建千万数据规模上的用户留存模型 ⛵

使用逻辑回归模型预测用户购买会员意向

变量测试模型数据搜索

会员付费模式是互联网中常用的变现方式，并具有高用户忠诚度和粘性，帮助电商应用增加收入的优点。会员的销售模式，依赖于线下会销+线上直播+代理商电话销售的模式。为使用户有良好的用户体验，以及满足精细化运营的需求，如何在海量用户中筛选出有价值的用户成为会员转化运营工作的重点。因此，本文采用了逻辑回归的算法，使用用户在平台上的行为数据特征(登录、协议、商品、交易等)，通过模型预测出用户购买会员的概率，对于预测结果有较大概率购买会员的用户，进行重点触达，提高交易转化。

政采云前端团队

2023/10/16

1.2K0

TensorFlow Eager 教程

数据分析 tensorflow 神经网络机器学习深度学习

大家好！在本教程中，我们将使用 TensorFlow 的命令模式构建一个简单的前馈神经网络。希望你会发现它很有用！如果你对如何改进代码有任何建议，请告诉我。

ApacheCN_飞龙

2022/12/02

1K0

图与图学习（中）

https 网络安全数据结构编程算法机器学习

在上篇中，我们简单学习了图论的基本概念，图的表示和存储方式，同构图和异构图的分类，以及几个基础的图论算法。在接下来的前置教程下篇中，我们将会学习图机器学习。

致Great

2021/03/17

1.3K0

基于传统机器学习模型算法的项目开发详细过程

机器学习

1、 pandas读取数据: pd.read_csv()，训练数据一般从csv文件加载。读取数据返回DataFrame，df.head() 查看前5条件数据分布

用户1414696

2024/01/14

3360

【机器学习实战】手把手教学，kaggle贷款批准预测（使用xgboost解决正负样本不平衡问题）

xgboost 测试模型机器学习 kaggle

train.csv - 训练数据集；loan_status是二进制目标 test.csv - 测试数据集；

机器学习司猫白

2025/01/21

2750