
在2025年,大规模语言模型(LLM)的训练已经进入到超大规模时代,模型参数量达到数千亿甚至万亿级别,训练过程需要动用数百甚至数千个GPU/TPU。在这种情况下,高效的集群管理系统成为训练成功的关键基础设施。Slurm(Simple Linux Utility for Resource Management)作为目前最流行的开源作业调度系统,广泛应用于科研机构和大型科技公司的超级计算集群中。
本文将深入探讨如何使用Slurm进行LLM训练的集群管理和调度优化。我们将从Slurm的基础知识出发,详细介绍配置文件设置、作业提交策略、资源分配优化、监控与调试技巧,以及2025年最新的Slurm功能和最佳实践。通过丰富的代码示例和配置模板,帮助读者构建高性能、高可靠性的LLM训练集群。
随着模型规模的不断扩大,集群管理面临着越来越多的挑战:资源碎片化、作业调度延迟、故障恢复复杂性、能源效率等问题日益突出。2025年的研究表明,优化的Slurm配置可以使训练效率提升30-50%,同时显著降低运行成本。因此,掌握先进的Slurm管理技术对于成功训练大规模语言模型至关重要。
Slurm由以下几个核心组件组成:
# Slurm架构示意图
+----------------+ +----------------+ +----------------+
| | | | | |
| 用户工作站 | | 控制节点 | | 计算节点1 |
| (sbatch/srun) |----| (slurmctld) |----| (slurmd) |
| | | | | |
+----------------+ +----------------+ +----------------+
| |
| |
+----------------+ +----------------+
| | | |
| 数据库节点 | | 计算节点2 |
| (slurmdbd) | | (slurmd) |
| | | |
+----------------+ +----------------+Slurm采用分层的资源管理模型:
这种分层模型使得Slurm能够灵活地管理各种规模的集群,从几台服务器到数千个节点的超级计算机。
Slurm特别适合LLM训练的原因:
slurm.conf是Slurm的核心配置文件,定义了集群的基本结构和行为:
# 集群基本信息
ClusterName=llm-training-cluster
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
# 控制节点设置
ControlMachine=controller01
ControlAddr=192.168.1.10
BackupController=controller02
BackupAddr=192.168.1.11
# 计算节点设置
NodeName=node[01-128] CPUs=128 RealMemory=1024000 State=UNKNOWN
NodeName=gpu[01-64] CPUs=128 RealMemory=1024000 Features=gpu,gpu-a100 Gres=gpu:8 State=UNKNOWN
# 分区定义
PartitionName=debug Nodes=node[01-08] Default=YES MaxTime=01:00:00 State=UP
PartitionName=standard Nodes=node[09-128] MaxTime=14-00:00:00 State=UP
PartitionName=gpu Nodes=gpu[01-64] MaxTime=14-00:00:00 State=UP
PartitionName=high-mem Nodes=node[01-128] MaxTime=14-00:00:00 State=UP
# 调度设置
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
# 作业和步骤设置
MaxJobCount=10000
MaxStepCount=40000
# 通信设置
MessageTimeout=60
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
# 账户和记账
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStorageHost=db01
# 日志设置
SlurmctldDebug=info
SlurmdDebug=info对于LLM训练,GPU资源配置至关重要:
# 通用GPU资源配置
Name=gpu Type=a100 File=/dev/nvidia[0-7]
# 特定节点的GPU配置
NodeName=gpu[01-32] Name=gpu Type=a100 File=/dev/nvidia[0-7]
NodeName=gpu[33-64] Name=gpu Type=a100-sxm File=/dev/nvidia[0-7]使用Features参数可以灵活标记和选择节点:
# 在slurm.conf中定义节点特性
NodeName=gpu01 CPUs=128 RealMemory=1024000 Features=gpu,a100,high-bandwidth Gres=gpu:8 State=UNKNOWN
NodeName=gpu02 CPUs=128 RealMemory=1024000 Features=gpu,a100,high-mem Gres=gpu:8 State=UNKNOWN
# 作业提交时可以按特性选择节点
# sbatch --constraint=high-bandwidth job_script.sh2025年的Slurm支持多种高级调度选项:
# 优先级设置
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=14-0
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightQOS=100000
# 公平共享配置
AccountingStorageEnforce=limits,qos
QOSMinUsageThreshold=0
QOSMaxUsageThreshold=432000
# 预emption配置(可选)
PreemptType=preempt/partition_prio
PreemptMode=SUSPEND,GANGLLM训练的典型Slurm作业脚本:
#!/bin/bash
#SBATCH --job-name=llm-training
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:8
#SBATCH --partition=gpu
#SBATCH --time=14-00:00:00
#SBATCH --output=training_%j.log
#SBATCH --error=training_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@example.com
# 加载环境
module load cuda/12.3
module load nccl/2.18.3
module load python/3.10
# 设置工作目录
cd $SLURM_SUBMIT_DIR
# 激活虚拟环境
source venv/bin/activate
# 配置分布式训练环境
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=6000
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_
# 运行训练脚本
srun --ntasks-per-node=8 python train_llm.py \
--model_size 10B \
--batch_size 8 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--data_path /data/llm_dataset/ \
--output_dir /results/llm_checkpoints/合理请求资源对于提高调度效率和训练性能至关重要:
# 优化的资源请求示例
#SBATCH --nodes=8 # 节点数量
#SBATCH --ntasks-per-node=8 # 每节点任务数(GPU数量)
#SBATCH --cpus-per-task=16 # 每任务CPU数(内存带宽优化)
#SBATCH --mem-per-cpu=4G # 每CPU内存(可选)
#SBATCH --gres=gpu:8 # GPU资源
#SBATCH --exclusive # 独占节点(避免资源争用)对于超参数搜索等场景,作业数组非常有用:
#!/bin/bash
#SBATCH --job-name=llm-hyperparam
#SBATCH --array=0-7 # 8个作业实例
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
# 超参数列表
LEARNING_RATES=(1e-4 5e-5 1e-5 5e-6 2e-4 8e-5 3e-5 1e-6)
BATCH_SIZES=(8 16 8 32 4 12 16 8)
# 获取当前作业的超参数
LR=${LEARNING_RATES[$SLURM_ARRAY_TASK_ID]}
BATCH=${BATCH_SIZES[$SLURM_ARRAY_TASK_ID]}
# 运行训练
srun python train_llm.py --learning_rate $LR --batch_size $BATCH --output_dir /results/exp_${SLURM_ARRAY_TASK_ID}/对于流水线训练流程,可以设置作业依赖:
#!/bin/bash
# 第一阶段:数据预处理
PREPROCESS_JOB=$(sbatch --parsable preprocess.sh)
echo "Preprocess job ID: $PREPROCESS_JOB"
# 第二阶段:基础训练,依赖第一阶段完成
TRAIN_JOB=$(sbatch --parsable --dependency=afterok:$PREPROCESS_JOB train_base.sh)
echo "Base training job ID: $TRAIN_JOB"
# 第三阶段:微调,依赖第二阶段完成
FINE_TUNE_JOB=$(sbatch --parsable --dependency=afterok:$TRAIN_JOB finetune.sh)
echo "Fine-tuning job ID: $FINE_TUNE_JOB"
# 第四阶段:评估,依赖第三阶段完成
EVAL_JOB=$(sbatch --parsable --dependency=afterok:$FINE_TUNE_JOB evaluate.sh)
echo "Evaluation job ID: $EVAL_JOB"利用节点拓扑可以减少跨交换机通信,提高训练速度:
# 在slurm.conf中配置拓扑
TopologyPlugin=topology/tree
TopologyParam=TopologyFile=/etc/slurm/topology.conf
# topology.conf示例
SwitchName=s0 Nodes=gpu[01-16]
SwitchName=s1 Nodes=gpu[17-32]
SwitchName=s2 Nodes=gpu[33-48]
SwitchName=s3 Nodes=gpu[49-64]
SwitchName=root Switches=s0,s1,s2,s3提交作业时可以要求节点位于同一交换机:
#SBATCH --switches=1 # 要求所有节点位于1个交换机内设置GPU亲和性可以避免动态绑定带来的性能损失:
#!/bin/bash
#SBATCH --job-name=llm-training
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
# 在srun命令中设置GPU亲和性
srun --ntasks-per-node=8 --gpu-bind=closest python train_llm.py对于大模型训练,内存管理至关重要:
# 启用大页面
#SBATCH --mem_bind=local
# 作业脚本中设置
export OMP_NUM_THREADS=16
export OMP_PLACES=cores
export OMP_PROC_BIND=close
# 启用hugepages(需要在节点上预先配置)
export LD_PRELOAD=/lib64/libhugetlbfs.soNCCL通信库优化对于分布式训练性能至关重要:
# NCCL优化设置
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0 # 启用InfiniBand
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_
export NCCL_SOCKET_IFNAME=eth0
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_IB_TC=106
export NCCL_IB_TIMEOUT=23
export NCCL_IB_RETRY_CNT=7
export NCCL_ASYNC_ERROR_HANDLING=1 # 启用异步错误处理使用Slurm的内置命令监控作业状态:
# 查看作业状态
squeue -u $USER
# 查看作业详情
scontrol show job <job_id>
# 查看节点状态
sinfo -l
# 实时监控作业输出
watch -n 10 "tail -n 50 slurm-${SLURM_JOB_ID}.out"将Slurm与Prometheus和Grafana集成:
# 安装slurm_exporter
git clone https://github.com/vpenso/slurm_exporter
cd slurm_exporter
make
sudo cp slurm_exporter /usr/local/bin/
# 创建systemd服务
cat << EOF > /etc/systemd/system/slurm_exporter.service
[Unit]
Description=Slurm Exporter
After=network.target
[Service]
User=slurm
ExecStart=/usr/local/bin/slurm_exporter --listen=:9341
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now slurm_exporterLLM训练中的常见Slurm错误及解决方案:
错误消息 | 可能原因 | 解决方案 |
|---|---|---|
srun: error: Unable to allocate resources | 资源不足或请求过大 | 调整资源请求,使用–test-only预览 |
srun: error: NCCL version mismatch | 不同节点NCCL版本不一致 | 确保所有节点使用相同版本NCCL |
srun: error: Task launch for 0 failed | 节点间网络问题 | 检查InfiniBand连接,验证防火墙设置 |
srun: Job step aborted: Waiting up to 32 seconds for job step to finish | 任务卡住或死锁 | 启用调试日志,检查应用程序代码 |
OOM killer terminated process | 内存不足 | 减小批量大小,启用ZeRO优化 |
分析Slurm日志的脚本示例:
#!/usr/bin/env python3
import re
import sys
from collections import Counter
# 分析Slurm作业日志中的错误
def analyze_slurm_log(log_file):
error_patterns = [
r'error:',
r'fail',
r'warning',
r'OOM',
r'NCCL',
r'timeout'
]
error_counts = Counter()
error_context = {}
with open(log_file, 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
for pattern in error_patterns:
if re.search(pattern, line, re.IGNORECASE):
error_counts[pattern] += 1
# 保存错误上下文
start = max(0, i - 5)
end = min(len(lines), i + 6)
context = ''.join(lines[start:end])
if pattern not in error_context:
error_context[pattern] = []
if len(error_context[pattern]) < 3: # 最多保存3个示例
error_context[pattern].append(context)
# 输出分析结果
print(f"Log analysis for {log_file}:")
print("\nError summary:")
for error, count in error_counts.most_common():
print(f"{error}: {count}")
print("\nError examples:")
for error, examples in error_context.items():
print(f"\n{error} examples:")
for i, example in enumerate(examples):
print(f"--- Example {i+1} ---")
print(example)
print("----------------")
if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} <slurm_log_file>")
sys.exit(1)
analyze_slurm_log(sys.argv[1])在Slurm中实现检查点恢复:
#!/bin/bash
#SBATCH --job-name=llm-training
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --signal=USR1@300 # 在作业结束前5分钟发送信号
# 定义信号处理函数
trap "echo 'Job about to be cancelled, saving checkpoint...'; python save_checkpoint.py; exit 1" USR1
# 检查是否有现有检查点
if [ -f "checkpoint_latest.pt" ]; then
echo "Resuming from checkpoint"
RESUME_ARGS="--resume_from_checkpoint checkpoint_latest.pt"
else
RESUME_ARGS=""
fi
# 运行训练
srun python train_llm.py $RESUME_ARGS使用Slurm的作业依赖实现自动重试:
#!/bin/bash
# 训练脚本包装器
MAX_RETRIES=3
RETRY_COUNT=0
SUCCESS=0
while [ $RETRY_COUNT -lt $MAX_RETRIES ] && [ $SUCCESS -eq 0 ]; do
# 提交作业并获取ID
JOB_ID=$(sbatch --parsable train_script.sh)
echo "Submitting job attempt $((RETRY_COUNT+1)): $JOB_ID"
# 等待作业完成
scontrol wait job $JOB_ID
# 检查作业状态
JOB_STATE=$(sacct -j $JOB_ID --format=State --noheader | head -n 1 | tr -d ' ')
if [[ "$JOB_STATE" == "COMPLETED" ]]; then
echo "Job completed successfully!"
SUCCESS=1
else
echo "Job failed with state: $JOB_STATE"
RETRY_COUNT=$((RETRY_COUNT+1))
# 等待一段时间后重试
WAIT_TIME=$((RETRY_COUNT * 10 * 60)) # 递增等待时间
echo "Waiting $WAIT_TIME seconds before retrying..."
sleep $WAIT_TIME
fi
done
if [ $SUCCESS -eq 0 ]; then
echo "Job failed after $MAX_RETRIES attempts"
exit 1
fi使用Slurm的弹性作业功能:
#!/bin/bash
#SBATCH --job-name=llm-elastic
#SBATCH --nodes=4-8 # 最小4节点,最大8节点
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=14-00:00:00
# 在应用程序中处理节点变更
export SLURM_NNODES=$SLURM_NNODES
export SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST
# 使用srun的弹性选项
srun --kill-on-bad-exit=0 --no-kill python elastic_trainer.py2025年的高级弹性训练框架示例:
# elastic_trainer.py
import os
import time
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import subprocess
import socket
class ElasticTrainer:
def __init__(self):
self.rank = int(os.environ['SLURM_PROCID'])
self.world_size = int(os.environ['SLURM_NTASKS'])
self.local_rank = int(os.environ['SLURM_LOCALID'])
self.nnodes = int(os.environ['SLURM_NNODES'])
self.node_list = os.environ['SLURM_JOB_NODELIST'].split(',')
# 初始化分布式环境
self.initialize_distributed()
# 设置节点监控
self.node_change_detected = False
self.last_node_list = self.node_list.copy()
# 加载模型和数据
self.model = self.load_model()
self.optimizer = self.configure_optimizer()
self.dataloader = self.prepare_dataloader()
# 尝试加载检查点
self.checkpoint_path = "checkpoint_latest.pt"
self.start_epoch = self.load_checkpoint() if os.path.exists(self.checkpoint_path) else 0
def initialize_distributed(self):
# 查找主节点
master_addr = self.node_list[0]
master_port = 6000
# 初始化进程组
os.environ['MASTER_ADDR'] = master_addr
os.environ['MASTER_PORT'] = str(master_port)
os.environ['NCCL_DEBUG'] = 'INFO'
dist.init_process_group(
backend='nccl',
rank=self.rank,
world_size=self.world_size
)
# 设置GPU设备
torch.cuda.set_device(self.local_rank)
print(f"Rank {self.rank}/{self.world_size} initialized on {socket.gethostname()}")
def monitor_nodes(self):
# 检查节点列表是否变化
current_node_list = subprocess.check_output(
['scontrol', 'show', 'job', os.environ['SLURM_JOB_ID'], '--oneliner']
).decode().split()[4].split('=')[1].split(',')
if set(current_node_list) != set(self.last_node_list):
self.node_change_detected = True
self.last_node_list = current_node_list.copy()
print(f"Node list changed: {current_node_list}")
def save_checkpoint(self, epoch, model_state, optimizer_state):
if self.rank == 0: # 仅在主进程保存
checkpoint = {
'epoch': epoch,
'model_state_dict': model_state,
'optimizer_state_dict': optimizer_state
}
torch.save(checkpoint, self.checkpoint_path)
print(f"Checkpoint saved for epoch {epoch}")
def load_checkpoint(self):
if self.rank == 0:
print(f"Loading checkpoint from {self.checkpoint_path}")
# 使用分布式文件系统或广播机制加载检查点
checkpoint = torch.load(self.checkpoint_path, map_location=f'cuda:{self.local_rank}')
# 确保所有进程同步
dist.barrier()
return checkpoint['epoch']
def train(self, max_epochs=100):
for epoch in range(self.start_epoch, max_epochs):
# 监控节点变化
if self.rank == 0:
self.monitor_nodes()
# 广播节点变化信息
dist.broadcast(
torch.tensor(1 if self.node_change_detected else 0, device='cuda'),
src=0
)
else:
# 接收节点变化信息
change_tensor = torch.zeros(1, device='cuda')
dist.broadcast(change_tensor, src=0)
self.node_change_detected = change_tensor.item() == 1
# 如果检测到节点变化,保存检查点并退出
if self.node_change_detected:
print(f"Node change detected at epoch {epoch}, saving checkpoint...")
self.save_checkpoint(epoch, self.model.state_dict(), self.optimizer.state_dict())
dist.destroy_process_group()
# 这里进程会退出,Slurm会重新调度作业
return
# 正常训练循环
self.model.train()
for batch in self.dataloader:
# 训练步骤...
pass
# 每个epoch保存检查点
if epoch % 5 == 0:
self.save_checkpoint(epoch, self.model.state_dict(), self.optimizer.state_dict())
if __name__ == "__main__":
trainer = ElasticTrainer()
trainer.train()配置基于账户的公平共享:
# 在slurm.conf中启用公平共享
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=100000
# 账户配置(在slurmdbd中)
sacctmgr add account research Parent=root Fairshare=100
sacctmgr add account production Parent=root Fairshare=200
sacctmgr add user alice Account=research
sacctmgr add user bob Account=production为重要训练任务预留资源:
# 创建预留
scontrol create reservation name=llm_training start=2025-06-01T00:00:00 duration=7-00:00:00 nodes=gpu[01-32] users=research_team flags=Maint
# 查看预留
scontrol show reservation
# 使用预留提交作业
sbatch --reservation=llm_training job_script.sh使用QoS区分不同优先级的作业:
# 创建QoS
sacctmgr add qos high_priority Priority=1000 GraceTime=0 MaxWall=14-00:00:00
# 为用户分配QoS
sacctmgr modify user where name=alice set QOS=high_priority
# 提交作业时使用QoS
sbatch --qos=high_priority job_script.sh2025年的Slurm支持基于机器学习的智能调度:
# 在slurm.conf中启用智能调度
SchedulerType=sched/advanced
SchedulerParameters=bf_continue,bf_window=100,bf_max_job_user=50
# 启用作业预测
JobPredictType=job_predict/none
# 配置资源利用优化
SelectType=select/cons_res
SelectTypeParameters=CR_Core_MemorySlurm可以与节点电源管理系统集成:
# 在slurm.conf中启用电源管理
PowerPlugin=power/cray
# 配置电源级别
PowerLevel=0 Name=off
PowerLevel=1 Name=idle
PowerLevel=100 Name=performance
# 自动电源管理
AutoPowerDown=60
AutoPowerUp=300监控LLM训练的能源消耗:
#!/bin/bash
#SBATCH --job-name=energy-monitor
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
# 启动能源监控后台进程
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
ssh $node "nvidia-smi dmon -d 10 -s pucvmet > /tmp/energy_${node}.log &"
done
# 运行训练作业
srun python train_llm.py
# 收集能源数据
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
scp $node:/tmp/energy_${node}.log ./energy_logs/
done
# 生成能源报告
python analyze_energy.py --logs-dir ./energy_logs/ --output ./energy_report.html2025年的绿色计算调度策略:
# 使用节能模式提交作业
sbatch --constraint=energy-efficient job_script.sh
# 配置作业优先级基于能源效率
sacctmgr modify qos where name=standard set Priority=500
# 调度器配置
SchedulerParameters=bf_continue,energy_efficient=yes,max_swap_per_node=0定期检查节点健康状态:
#!/bin/bash
# 节点健康检查脚本
# 检查GPU健康
check_gpu_health() {
echo "Checking GPU health..."
if ! nvidia-smi -q > /dev/null 2>&1; then
echo "ERROR: nvidia-smi command failed"
return 1
fi
# 检查是否有GPU处于错误状态
if nvidia-smi -q | grep -A 5 "GPU Operation Mode" | grep -q "Error"; then
echo "ERROR: GPU in error state detected"
return 1
fi
echo "GPU health check passed"
return 0
}
# 检查InfiniBand健康
check_ib_health() {
echo "Checking InfiniBand health..."
if ! ibstat > /dev/null 2>&1; then
echo "WARNING: ibstat command not available"
return 0 # 非致命错误
fi
# 检查端口状态
if ibstat | grep -A 5 "Port 1" | grep -q "State: Down"; then
echo "ERROR: InfiniBand port down"
return 1
fi
echo "InfiniBand health check passed"
return 0
}
# 运行所有检查
check_gpu_health && check_ib_health
# 如果检查失败,将节点设置为DOWN状态
if [ $? -ne 0 ] && [ -n "$SLURMD_NODENAME" ]; then
echo "Setting node $SLURMD_NODENAME to DOWN state"
scontrol update nodename=$SLURMD_NODENAME state=DOWN reason="Health check failed"
fi结合Slurm与云平台实现自动伸缩:
#!/usr/bin/env python3
import time
import subprocess
import json
import requests
class SlurmAutoScaler:
def __init__(self, config_file="autoscaler_config.json"):
with open(config_file, 'r') as f:
self.config = json.load(f)
self.min_nodes = self.config['min_nodes']
self.max_nodes = self.config['max_nodes']
self.scale_up_threshold = self.config['scale_up_threshold']
self.scale_down_threshold = self.config['scale_down_threshold']
self.check_interval = self.config['check_interval']
def get_cluster_status(self):
# 获取集群状态
idle_nodes = int(subprocess.check_output(
"sinfo -t idle -h | awk '{print $4}'", shell=True
).decode().strip() or 0)
active_nodes = int(subprocess.check_output(
"sinfo -t alloc,completing -h | awk '{print $4}'", shell=True
).decode().strip() or 0)
pending_jobs = int(subprocess.check_output(
"squeue -t pending -h | wc -l", shell=True
).decode().strip() or 0)
return {
'idle_nodes': idle_nodes,
'active_nodes': active_nodes,
'total_nodes': idle_nodes + active_nodes,
'pending_jobs': pending_jobs
}
def scale_up(self):
status = self.get_cluster_status()
# 计算需要扩展的节点数
needed_nodes = max(0, self.scale_up_threshold - status['idle_nodes'])
new_total = min(self.max_nodes, status['total_nodes'] + needed_nodes)
if new_total > status['total_nodes']:
nodes_to_add = new_total - status['total_nodes']
print(f"Scaling up: adding {nodes_to_add} nodes")
# 调用云平台API添加节点
self.add_cloud_nodes(nodes_to_add)
# 等待节点就绪
self.wait_for_nodes(nodes_to_add)
def scale_down(self):
status = self.get_cluster_status()
# 如果空闲节点过多,缩减集群
excess_nodes = max(0, status['idle_nodes'] - self.scale_down_threshold)
new_total = max(self.min_nodes, status['total_nodes'] - excess_nodes)
if new_total < status['total_nodes']:
nodes_to_remove = status['total_nodes'] - new_total
print(f"Scaling down: removing {nodes_to_remove} nodes")
# 获取可移除的空闲节点
idle_node_list = subprocess.check_output(
"sinfo -t idle -h -o '%N' | head -n 1", shell=True
).decode().strip()
if idle_node_list:
# 将节点设置为DRAIN状态
subprocess.run(
f"scontrol update nodename={idle_node_list} state=drain reason=autoscale",
shell=True
)
# 等待节点任务完成
time.sleep(300)
# 从集群中移除节点
subprocess.run(
f"scontrol update nodename={idle_node_list} state=down reason=autoscale",
shell=True
)
# 调用云平台API终止实例
self.remove_cloud_nodes(idle_node_list)
def run(self):
print("Starting auto-scaler...")
while True:
try:
status = self.get_cluster_status()
print(f"Current status: {status}")
# 根据策略扩展或缩减
if status['pending_jobs'] > 0 and status['idle_nodes'] < self.scale_up_threshold:
self.scale_up()
elif status['idle_nodes'] > self.scale_down_threshold:
self.scale_down()
except Exception as e:
print(f"Error in auto-scaler: {e}")
time.sleep(self.check_interval)
if __name__ == "__main__":
autoscaler = SlurmAutoScaler()
autoscaler.run()Slurm配置的备份与恢复:
#!/bin/bash
# Slurm配置备份脚本
BACKUP_DIR="/backup/slurm/$(date +%Y%m%d)"
# 创建备份目录
mkdir -p $BACKUP_DIR
# 备份关键配置文件
cp /etc/slurm/slurm.conf $BACKUP_DIR/
cp /etc/slurm/gres.conf $BACKUP_DIR/
cp /etc/slurm/cgroup.conf $BACKUP_DIR/
cp /etc/slurm/topology.conf $BACKUP_DIR/
cp /etc/slurm/slurmdb.conf $BACKUP_DIR/
# 备份数据库(如果使用slurmdbd)
if systemctl is-active --quiet slurmdbd; then
scontrol dumpdb > $BACKUP_DIR/slurmdb_dump.txt
# 也可以使用数据库特定的备份命令
# mysqldump -u slurm -p slurm_acct_db > $BACKUP_DIR/slurm_acct_db.sql
fi
# 压缩备份
cd /backup/slurm/
tar -czf slurm_backup_$(date +%Y%m%d).tar.gz $(date +%Y%m%d)/
# 清理旧备份(保留30天)
find /backup/slurm/ -name "slurm_backup_*.tar.gz" -mtime +30 -delete
echo "Backup completed: /backup/slurm/slurm_backup_$(date +%Y%m%d).tar.gz"Slurm集群的高级安全配置:
# 在slurm.conf中启用安全功能
AuthType=auth/munge
CryptoType=crypto/munge
# 限制作业提交权限
JobSubmitPlugins=job_submit/allow_user
AllowGroups=research,admin
DenyUsers=guest
# 设置资源限制
MaxNodes=128
MaxTasksPerNode=256
MaxWall=30-00:00:00
# 启用审计日志
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=associations,limits,qosDeepSpeed框架与Slurm的最佳配置:
#!/bin/bash
#SBATCH --job-name=deepspeed-llm
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=16
#SBATCH --time=14-00:00:00
# 加载环境
module load cuda/12.3
module load nccl/2.18.3
module load python/3.10
# 激活虚拟环境
source venv/bin/activate
# DeepSpeed环境变量
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=6000
# 运行DeepSpeed训练
srun --output=ds_llm_%j_%N.log \
deepspeed \
--num_gpus 8 \
train_llm_ds.py \
--deepspeed ds_config.json \
--model_size 10B \
--batch_size 8 \
--gradient_accumulation_steps 8Megatron-LM在Slurm上的配置:
#!/bin/bash
#SBATCH --job-name=megatron-llm
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=16
#SBATCH --time=14-00:00:00
# 设置环境变量
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
NODE_RANK=$SLURM_NODEID
WORLD_SIZE=$(($SLURM_NNODES * 8))
# 模型并行配置
TP_SIZE=8 # 张量并行大小
PP_SIZE=4 # 流水线并行大小
DP_SIZE=$((WORLD_SIZE / (TP_SIZE * PP_SIZE)))
# 运行Megatron-LM训练
srun --output=megatron_%j_%N.log \
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=$SLURM_NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
pretrain_gpt.py \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--model-size 175B \
--num-layers 96 \
--hidden-size 12288 \
--num-attention-heads 96 \
--micro-batch-size 4 \
--global-batch-size 512 \
--data-path /data/megatron_dataset/ \
--save /results/megatron_checkpoints/将Slurm与Grafana集成实现可视化监控:
# 安装Prometheus和Grafana
sudo apt-get update
sudo apt-get install -y prometheus grafana
# 安装node_exporter
git clone https://github.com/prometheus/node_exporter\ccd node_exporter
make
sudo cp node_exporter /usr/local/bin/
# 创建服务
cat << EOF > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
# 配置Prometheus收集Slurm指标
cat << EOF > /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'slurm'
static_configs:
- targets: ['localhost:9341']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
EOF
sudo systemctl restart prometheus
sudo systemctl restart grafana-serverSlurm与先进工作流管理系统的集成:
# 使用Nextflow与Slurm集成的工作流示例
from nextflow import Workflow, Process
class LLMWorkflow(Workflow):
def __init__(self):
super().__init__(name="llm_training_pipeline")
# 定义数据预处理进程
self.preprocess = Process(
name="preprocess",
script="""
python preprocess.py --input $input_data --output $output_dir
""",
slurm_config={
"nodes": 1,
"cpus_per_task": 32,
"time": "24:00:00",
"partition": "standard"
}
)
# 定义训练进程
self.train = Process(
name="train",
script="""
python train.py --data $input_dir --output $output_dir --config $config
""",
slurm_config={
"nodes": 8,
"ntasks_per_node": 8,
"gres": "gpu:8",
"time": "168:00:00",
"partition": "gpu"
}
)
# 定义评估进程
self.evaluate = Process(
name="evaluate",
script="""
python evaluate.py --model $model_dir --data $eval_data --output $output_dir
""",
slurm_config={
"nodes": 1,
"ntasks_per_node": 8,
"gres": "gpu:8",
"time": "24:00:00",
"partition": "gpu"
}
)
# 设置依赖关系
self.preprocess >> self.train >> self.evaluate
def run(self, input_data, config_file, eval_data):
# 执行工作流
return self.execute({
"preprocess.input_data": input_data,
"train.config": config_file,
"evaluate.eval_data": eval_data
})
# 使用工作流
workflow = LLMWorkflow()
result = workflow.run(
input_data="/data/raw_corpus",
config_file="configs/llm_config.json",
eval_data="/data/eval_dataset"
)2025年GPT-4级模型训练的Slurm配置:
#!/bin/bash
#SBATCH --job-name=gpt4-training
#SBATCH --nodes=256
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=16
#SBATCH --time=30-00:00:00
#SBATCH --partition=reserved
#SBATCH --switches=8
# 高级并行配置
TP_SIZE=8 # 张量并行
PP_SIZE=16 # 流水线并行
DP_SIZE=$((256 * 8 / (TP_SIZE * PP_SIZE))) # 数据并行
# 设置环境变量
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=6000
# 网络优化
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_IB_TC=106
export NCCL_IB_TIMEOUT=23
export NCCL_IB_RETRY_CNT=7
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_IB_CUDA_SUPPORT=1
# 内存优化
export LD_PRELOAD=/lib64/libhugetlbfs.so
export HUGETLB_MORECORE=yes
export HUGETLB_DEFAULT_PAGE_SIZE=1G
# 运行分布式训练
srun --output=gpt4_training_%j_%N.log \
--error=gpt4_training_%j_%N.err \
--label \
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=256 \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train_gpt.py \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--model-size 1.8T \
--num-layers 128 \
--hidden-size 16384 \
--num-attention-heads 128 \
--micro-batch-size 8 \
--global-batch-size 65536 \
--gradient-accumulation-steps 128 \
--data-path /data/gpt_dataset/ \
--save /results/gpt4_checkpoints/ \
--save-interval 1000 \
--log-interval 100研究机构多租户集群的Slurm优化:
# 多租户分区配置
PartitionName=research Nodes=node[01-64] Default=YES MaxTime=7-00:00:00 State=UP
PartitionName=teaching Nodes=node[65-96] MaxTime=1-00:00:00 State=UP
PartitionName=urgent Nodes=node[01-96] MaxTime=24:00:00 State=UP
PartitionName=long Nodes=node[01-32] MaxTime=28-00:00:00 State=UP
# 公平共享配置
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityUsageResetPeriod=14-0
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightQOS=100000
# 账户配额
DefMemPerNode=0
MaxMemPerNode=1024000
MinJobAge=300企业环境中的弹性Slurm配置:
#!/bin/bash
# 企业级弹性训练作业
# 提交基础作业
BASE_JOB_ID=$(sbatch --parsable \
--job-name=elastic-llm \
--nodes=4 \
--ntasks-per-node=8 \
--gres=gpu:8 \
--time=168:00:00 \
--output=elastic_llm_%j.log \
elastic_train_base.sh)
# 提交监控作业,根据队列状态动态调整资源
MONITOR_JOB_ID=$(sbatch --parsable \
--job-name=elastic-monitor \
--nodes=1 \
--dependency=afterany:$BASE_JOB_ID \
--output=elastic_monitor_%j.log \
elastic_monitor.sh $BASE_JOB_ID)
# 提交清理作业
CLEANUP_JOB_ID=$(sbatch --parsable \
--job-name=elastic-cleanup \
--nodes=1 \
--dependency=afterany:$MONITOR_JOB_ID \
--output=elastic_cleanup_%j.log \
elastic_cleanup.sh)
echo "Submitted elastic workflow: $BASE_JOB_ID -> $MONITOR_JOB_ID -> $CLEANUP_JOB_ID"2025年Slurm的最新发展:
云原生环境中的Slurm演进:
# Slurm与Kubernetes集成示例
# 安装Slurm-Kubernetes操作器
helm repo add slurm-operator https://slurm-operator.github.io/helm-charts/
helm install slurm-operator slurm-operator/slurm-operator
# 创建Slurm集群配置
cat << EOF | kubectl apply -f -
apiVersion: slurm.slurm-operator.io/v1alpha1
kind: SlurmCluster
metadata:
name: llm-training-cluster
spec:
controlPlane:
replicas: 1
resources:
requests:
cpu: 2
memory: 4Gi
computeNodes:
replicas: 32
resources:
requests:
cpu: 64
memory: 512Gi
nvidia.com/gpu: 8
partitions:
- name: gpu
default: true
maxTime: 168h
EOF2025年的可持续计算特性:
Slurm作为大规模集群管理的事实标准,在2025年的LLM训练中发挥着不可替代的作用。本文深入探讨了Slurm的配置、优化和最佳实践,涵盖了从基础架构到高级调度策略的各个方面。
通过合理配置Slurm,研究团队和企业可以显著提高LLM训练的效率、可靠性和成本效益。关键的优化策略包括:
随着模型规模的持续增长和计算需求的不断攀升,高效的集群管理变得越来越重要。2025年的研究表明,优化的Slurm配置可以使训练效率提升30-50%,同时显著降低运行成本和能源消耗。
未来,Slurm将继续向云原生、AI感知和可持续计算方向发展,为大规模语言模型的训练提供更强大、更智能的管理工具。掌握先进的Slurm管理技术,将成为成功训练下一代超大规模语言模型的关键能力。