生信工程师入门的时候很容易沉迷在文本处理的技巧上面,曾经有一个编程语言perl就因为擅长文本处理而在生物信息学领域流行。现在有了人工智能大模型,这些技巧都成为了过往云烟!
比如前些天我使用(OptiType)做出来的每个样品的hla信息如下所示:
A*02:06 A*02:07 B*35:01 B*48:01 C*04:01 C*08:01
A*02:06 A*02:07 B*35:01 B*48:01 C*04:01 C*08:01
A*24:02 A*30:01 B*13:02 B*48:01 C*06:02 C*08:01
A*24:02 A*30:01 B*13:02 B*48:01 C*06:02 C*08:01
A*02:03 A*03:01 B*38:02 B*44:02 C*07:02 C*05:01
A*02:03 A*03:01 B*38:02 B*44:02 C*07:02 C*05:01
A*03:01 A*24:02 B*08:01 B*14:02 C*07:02 C*08:02
A*03:01 A*24:02 B*08:01 B*14:02 C*07:02 C*08:02
A*30:01 A*30:01 B*13:02 B*15:11 C*03:03 C*06:02
A*30:01 A*30:01 B*13:02 B*15:11 C*03:03 C*06:02
OptiType is an HLA genotyping algorithm based on linear integer programming. It provides accurate 4-digit HLA genotyping predictions from NGS data. For further reading and documentation see the OptiType documentation.
但是后面的软件(NeoPredPipe)要求的输入格式是:
HLA-A_1 HLA-A_2 HLA-B_1 HLA-B_2 HLA-C_1 HLA-C_2
hla_a_31_01_02 hla_a_02_01_80 hla_b_40_01_02 hla_b_50_01_01 hla_c_03_04_20 hla_c_06_02_01_02
可以看其官方文档:

输入格式
如果是以前肯定是得自己写代码做文本处理,现在有了人工智能大模型,很简单的把需求说清楚即可:

把需求说清楚
好的,我将按照你提供的格式整理这些 HLA 基因型信息。以下是整理后的结果:
hla_a_02_06_01 hla_a_02_07_01 hla_b_35_01_01 hla_b_48_01_01 hla_c_04_01_01 hla_c_08_01_01
hla_a_02_06_01 hla_a_02_07_01 hla_b_35_01_01 hla_b_48_01_01 hla_c_04_01_01 hla_c_08_01_01
hla_a_24_02_01 hla_a_30_01_01 hla_b_13_02_01 hla_b_48_01_01 hla_c_06_02_01 hla_c_08_01_01
hla_a_24_02_01 hla_a_30_01_01 hla_b_13_02_01 hla_b_48_01_01 hla_c_06_02_01 hla_c_08_01_01
hla_a_02_03_01 hla_a_03_01_01 hla_b_38_02_01 hla_b_44_02_01 hla_c_07_02_01 hla_c_05_01_01
hla_a_02_03_01 hla_a_03_01_01 hla_b_38_02_01 hla_b_44_02_01 hla_c_07_02_01 hla_c_05_01_01
hla_a_03_01_01 hla_a_24_02_01 hla_b_08_01_01 hla_b_14_02_01 hla_c_07_02_01 hla_c_08_02_01
hla_a_03_01_01 hla_a_24_02_01 hla_b_08_01_01 hla_b_14_02_01 hla_c_07_02_01 hla_c_08_02_01
hla_a_30_01_01 hla_a_30_01_01 hla_b_13_02_01 hla_b_15_11_01 hla_c_03_03_01 hla_c_06_02_01
hla_a_30_01_01 hla_a_30_01_01 hla_b_13_02_01 hla_b_15_11_01 hla_c_03_03_01 hla_c_06_02_01
人工智能大模型甚至贴心的给出来了说明
hla_<locus>_<allele1>_<allele2>_<allele3>的形式。A*02:06),则在第三个等位基因位置填充 01。其实没必要做填充啦, 我们的软件(OptiType)就是没有第三个等位基因。而且如果大家感兴趣这个文本处理过程,也可以让人工智能大模型给出来使用各种编程语言的解决方案,比如r和shell,大家赶快试试看吧,继续跟人工智能大模型对话!