命名空间
Namespace = QCE/TI_TRAINTASK
监控指标
指标英文名 | 指标中文名 | 说明 | 单位 | 维度 | 统计规则
[period, statType] |
CfsClientDataReadBandwidth | turocfs单节点服务端读带宽 | turocfs单节点服务端读带宽 | KBytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsClientDataWriteBandwidth | turocfs单节点服务端写带宽 | turocfs单节点服务端写带宽 | KBytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataReadIoBytes | cfs服务端读带宽 | cfs服务端读带宽 | KBytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataReadIoLatency | cfs读延迟 | cfs读延迟 | ms | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataWriteIoBytes | cfs服务端写带宽 | cfs服务端写带宽 | KBytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataWriteIoLatency | cfs写延迟 | cfs写延迟 | ms | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsStrageUsageGb | cfs存储数据容量 | cfs存储数据容量 | GBytes | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Cpuutil | CPU利用率 | CPU利用率 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DcgmFiDevFbUsed | 显存使用量 | 显存使用量 | MBytes | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DcgmFiDevGpuUtil | GPU使用率 | GPU使用率 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DcgmFiDevMemCopyUtil | 显存使用率 | 显存使用率 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskIoUtil | 磁盘ioutil | 磁盘ioutil | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskIoWait | 磁盘iowait | 磁盘iowait | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskReadByte | 磁盘读取带宽 | 磁盘读取带宽 | MBytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskReadIops | 磁盘读取iops | 磁盘读取iops | Count | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskUsageRadio | 系统盘分区利用率 | 系统盘分区利用率 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskWriteByte | 磁盘写入带宽 | 磁盘写入带宽 | MBytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskWriteIops | 磁盘写入iops | 磁盘写入iops | Count | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Fp16EngineActivity | FP16活跃时间比 | FP16活跃时间比 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Fp32EngineActivity | FP32活跃时间比 | FP32活跃时间比 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Fp64EngineActivity | FP64活跃时间比 | FP64活跃时间比 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuFp16EngineActivity | FP16活跃时间比 | FP16活跃时间比 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuFp32EngineActivity | FP32活跃时间比 | FP32活跃时间比 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuFp64EngineActivity | FP64活跃时间比 | FP64活跃时间比 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Gpumemutil | GPU显存利用率 | GPU显存利用率 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Gpumemvalue | 显存使用量 | 显存使用量 | MBytes | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuNvlinkBandwidth | nvlink传输速率 | nvlink传输速率 | Bytes/s | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuPcieBandwidth | PCIe总线传输速率 | PCIe总线传输速率 | Bytes/s | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuSmActivity | SM活跃状态时间比 | SM活跃状态时间比 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuTensorActivity | Tensor活跃状态时间比 | Tensor活跃状态时间比 | % | taskInsGpuNum | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Gpuutil | GPU利用率 | GPU利用率 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GroupCpuUsage | CPU利用率 | CPU利用率 | % | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupGpuUtil | GPU使用率 | GPU使用率 | % | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupLanInTraffic | 内网入带宽 | 内网入带宽 | Mbps | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupLanOutTraffic | 内网出带宽 | 内网出带宽 | Mbps | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupMemUsage | 内存利用率 | 内存利用率 | % | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupWanInTraffic | 外网入带宽 | 外网入带宽 | Mbps | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupWanOutratio | 外网带宽利用率 | 外网带宽利用率 | % | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
GroupWanOutTraffic | 外网出带宽 | 外网出带宽 | Mbps | group_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsCpuUsage | CPU利用率 | CPU利用率 | % | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsGpuUtil | GPU使用率 | GPU使用率 | % | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsLanInTraffic | 内网入带宽 | 内网入带宽 | Mbps | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsLanOutTraffic | 内网出带宽 | 内网出带宽 | Mbps | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsMemUsage | 内存利用率 | 内存利用率 | % | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
Instancecpuutil | CPU利用率 | CPU利用率 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancegpumemutil | GPU显存利用率 | GPU显存利用率 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancegpumemvalue | 显存使用量 | 显存使用量 | MBytes | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancegpuutil | GPU利用率 | GPU利用率 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancememutil | 内存利用率 | 内存利用率 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancememvalue | 内存使用量 | 内存使用量 | MBytes | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
InsWanInTraffic | 外网入带宽 | 外网入带宽 | Mbps | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsWanOutratio | 外网带宽利用率 | 外网带宽利用率 | % | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
InsWanOutTraffic | 外网出带宽 | 外网出带宽 | Mbps | instance_id | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ] |
Memutil | 内存利用率 | 内存利用率 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Memvalue | 内存用量 | 内存用量 | MBytes | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
NvlinkBandwidth | nvlink传输速率 | nvlink传输速率 | Bytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
PcieBandwidth | PCIe总线传输速率 | PCIe总线传输速率 | Bytes/s | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaInpkt | RDMA网卡入包量 | RDMA 网卡入包量 | pps | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaIntraffic | RDMA网卡接收带宽 | RDMA网卡接收带宽 | Mbps | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaOutpkt | RDMA网卡出包量 | RDMA网卡出包量 | pps | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaOuttraffic | RDMA网卡发送带宽 | RDMA网卡发送带宽 | Mbps | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
SmActivity | SM活跃状态时间比 | SM活跃状态时间比 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsClientDataReadBandwidth | turocfs单节点服务端读带宽 | turocfs单节点服务端读带宽 | KBytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsClientDataWriteBandwidth | turocfs单节点服务端写带宽 | turocfs单节点服务端写带宽 | KBytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataReadIoBytes | cfs服务端读带宽 | cfs服务端读带宽 | KBytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataReadIoLatency | cfs读延迟 | cfs读延迟 | ms | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataWriteIoBytes | cfs服务端写带宽 | cfs服务端写带宽 | KBytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataWriteIoLatency | cfs写延迟 | cfs写延迟 | ms | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsStrageUsageGb | cfs存储数据容量 | cfs存储数据容量 | GBytes | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskIoUtil | 磁盘ioutil | 磁盘ioutil | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskIoWait | 磁盘iowait | 磁盘iowait | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskReadByte | 磁盘读取带宽 | 磁盘读取带宽 | MBytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskReadIops | 磁盘读取iops | 磁盘读取iops | Count | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskUsageRadio | 系统盘分区利用率 | 系统盘分区利用率 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskWriteByte | 磁盘写入带宽 | 磁盘写入带宽 | MBytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskWriteIops | 磁盘写入iops | 磁盘写入iops | Count | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskFp16EngineActivity | FP16活跃时间比 | FP16活跃时间比 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskFp32EngineActivity | FP32活跃时间比 | FP32活跃时间比 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskFp64EngineActivity | FP64活跃时间比 | FP64活跃时间比 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskNvlinkBandwidth | nvlink传输速率 | nvlink传输速率 | Bytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskPcieBandwidth | PCIe总线传输速率 | PCIe总线传输速率 | Bytes/s | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaInpkt | RDMA网卡入包量 | RDMA 网卡入包量 | pps | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaIntraffic | RDMA网卡接收带宽 | RDMA网卡接收带宽 | Mbps | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaOutpkt | RDMA网卡出包量 | RDMA网卡出包量 | pps | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaOuttraffic | RDMA网卡发送带宽 | RDMA网卡发送带宽 | Mbps | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskSmActivity | SM活跃状态时间比 | SM活跃状态时间比 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskTensorActivity | Tensor活跃状态时间比 | Tensor活跃状态时间比 | % | TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TensorActivity | Tensor活跃状态时间比 | Tensor活跃状态时间比 | % | InstanceId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
各维度对应参数总览
参数名称 | 维度名称 | 维度解释 | 格式 |
Instances.N.Dimensions.0.Name | InstanceId | 训练任务实例ID | 输入 String 类型维度名称:InstanceId |
Instances.N.Dimensions.0.Value | InstanceId | 训练任务实例ID | 输入具体实例 ID,例如:train-9187850047592xxxxx-6zaq3zh9mvpc-master-0 |
Instances.N.Dimensions.0.Name | TaskId | 训练任务ID | 输入 String 类型维度名称:TaskId |
Instances.N.Dimensions.0.Value | TaskId | 训练任务/notebook ID | 输入具体实例 ID,例如:train-9187850047592xxxxx |
Instances.N.Dimensions.0.Name | taskInsGpuNum | 训练任务实例使用的 GPU 卡号(仅限 GPU 任务) | 输入 String 类型维度名称:taskInsGpuNum |
Instances.N.Dimensions.0.Value | taskInsGpuNum | 训练任务实例使用的 GPU 卡号(仅限 GPU 任务) | 输入训练任务实例 ID 拼接 GPU 卡号/avg,例如:输入具体实例 ID,例如:train-9187850047592xxxxx-6zaq3zh9mvpc-master-0-0、train-9187850047592xxxxx-6zaq3zh9mvpc-master-0-avg |
Instances.N.Dimensions.0.Name | instance_id | 资源组中的某个资源 ID | 输入 String 类型维度名称:instance_id |
Instances.N.Dimensions.0.Value | instance_id | 资源组中的某个资源 ID | 输入具体 instance_id,例如:tins-dn8nkg82 |
Instances.N.Dimensions.0.Name | group_id | 资源组ID | 输入 String 类型维度名称:group_id |
Instances.N.Dimensions.0.Value | group_id | 资源组ID | 输入具体 group_id,例如:trsg-8rpgr4k6 |