Prometheus 官网
https://prometheus.io/download/
一、Node_exporter
Node_exporter 用于采集Linux系统指标数据数据,prometheus官方提供的exporter,除node_exporter外,官方还提供consul,memcached,haproxy,mysqld等exporter。
二进制部署node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz
tar -zvxf node_exporter-1.4.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
mv node_exporter-1.4.0.linux-amd64 node_exporter
添加prometheus用户
groupadd prometheus
useradd -g prometheus -s /sbin/nologin prometheus
修改文件的属主属组
chown -R prometheus:prometheus /usr/local/ node_exporter/
systemctl 管理 node_exporter 服务
vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
启动 node_exporter 服务
systemctl enable node_exporter && systemctl start node_exporter
查看 node_exporter metrics 采集指标
http://192.168.100.167:9100/metrics
Prometheus 添加 node_exporter 节点监控
vim /usr/local/prometheus/prometheus.yml
- job_name: "linux-node"
scrape_interval: 60s
static_configs:
- targets: ['192.168.100.167:9100']
labels:
project: node_exporter
检查prometheus.yml文件格式
./promtool check config prometheus.yml
Prometheus 热加载配置
curl -X POST http://127.0.0.1:9090/-/reload
Prometheus Web UI 查看,node_exporter节点已经被监控
二、windows_exporter
windows_exporter 由 Prometheus Community 维护windows_exporter是一个采集 Windows 机器指标的采集器。支持 Windows Server 2008R2 以上版本和 Windows 7 以上版本。
windows_exporter在发布的是时候提供了两种格式的文件,分别是 *.exe 和 *.msi 。MSI (Microsoft Installers)是 Windows 的包管理器,类似于 Linux 的 rpm 。windows_exporter每个版本都提供一个 .msi 安装程序。安装程序将 windows_exporter 设置为 Windows 服务,并在 Windows 防火墙中创建一个入站规则。
windows_exporter 官网
https://github.com/prometheus-community/windows_exporter
下载windows_exporter-0.20.0-amd64.exe
https://github.com/prometheus-community/windows_exporter/releases/download/v0.20.0/windows_exporter-0.20.0-amd64.exe
cmd 下将windows_exporter注册为Windows系统服务
windows_exporter.exe 文件 C盘根目录下
sc create windows_exporter binpath= C:\windows_exporter-0.20.0-amd64.exe type= own start= auto displayname= windows_exporter
sc create的用法说明:
C:\Users\Administrator>sc create
描述:
在注册表和服务数据库中创建服务项。
用法:
sc <server> create [service name] [binPath= ] <option1> <option2>...
选项:
注意: 选项名称包括等号。
等号和值之间需要一个空格。
type= <own|share|interact|kernel|filesys|rec>
(默认 = own)
start= <boot|system|auto|demand|disabled|delayed-auto>
(默认 = demand)
error= <normal|severe|critical|ignore>
(默认 = normal)
binPath= <BinaryPathName>
group= <LoadOrderGroup>
tag= <yes|no>
depend= <依存关系(以 / (斜杠) 分隔)>
obj= <AccountName|ObjectName>
(默认 = LocalSystem)
DisplayName= <显示名称>
password= <密码>
windows_exporter服务注册成功Windows服务列表查看windows_exporter服务
选中windows_exporter服务,右键菜单中点击属性,在属性对话框输入启动参数: --telemetry.addr=0.0.0.0:9182
查看 windows_exporter metrics 采集指标
windows_exporter默认端口是9182,http://127.0.0.1:9182/metrics
删除 windows_exporter 服务
sc delete windows_exporter
Prometheus 添加 windows_exporter 节点监控
vim /usr/local/prometheus/prometheus.yml
- job_name: "windows-node"
scrape_interval: 60s
static_configs:
- targets: ['192.168.100.85:9182']
labels:
project: windows_node_exporter
relabel_configs:
- source_labels: [__address__]
target_label: instance
检查Prometheus配置文件
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
Prometheus 热加载配置
curl -X POST http://127.0.0.1:9090/-/reload
Prometheus Web UI 查看,windows_windows_node_exporter节点已经被监控
Prometheus Windows 指标查询
CPU利用率
100 - (avg by (instance,job) (irate(windows_cpu_time_total{mode="idle"}[2m])) * 100)
剩余内存
windows_os_physical_memory_free_bytes /1024/1024/1024
内存利用率
100 - 100 * windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes
硬盘使用率
100- 100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes)
预测硬盘使用天数
100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) < 15 and predict_linear(windows_logical_disk_free_bytes[6h], 4 * 24 * 3600)
网卡sent速率
((sum(rate (windows_net_bytes_sent_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100)
网卡received速率
((sum(rate (windows_net_bytes_received_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100)
Prometheus Windows Rule 告警规则
vim /usr/local/prometheus/rules/windows_node_exporter.yml
groups:
- name: Windows服务器资源监控
rules:
- alert: CPU高负荷
expr: 100 - (avg by (instance,job) (irate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 30
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
####
- alert: 内存使用率过高
expr: 100 - 100 * windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes > 90
for: 5m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 内存使用率过高,请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过90%,当前使用率{{ $value }}%."
####
- alert: 服务器宕机
expr: up{project=~"windows_node_exporter"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 服务器宕机,请尽快处理!"
description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
####
- alert: VNC 服务异常
expr: windows_service_state{name=~"vncserver",state="running"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.name }} down"
description: "Service [{{ $labels.name }}] on instance {{ $labels.instance }} has been down for more than 3 minutes."
####
- alert: 网络流入received
expr: ((sum(rate (windows_net_bytes_received_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 10240
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流入网络带宽持续5分钟高于10M. RX带宽使用量{{$value}}."
####
- alert: 网络流出sent
expr: ((sum(rate (windows_net_bytes_sent_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 10240
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流出网络带宽持续5分钟高于10M. RX带宽使用量{$value}}."
####
- alert: 磁盘容量
expr: 100 - 100 * (windows_logical_disk_free_bytes {volume=~"C:|D:|E:|F:"} / windows_logical_disk_size_bytes {volume=~"C:|D:|E:|F:"}) > 80
for: 1m
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
description: "{{$labels.instance}} {{$labels.volume}} 磁盘分区使用大于80%,当前使用率{{ $value }}%."
检查rule文件格式
/usr/local/prometheus/promtool check rules /usr/local/prometheus/rules/windows_node_exporter.yml
Prometheus热加载配置文件
curl -X POST http://127.0.0.1:9090/-/reload
Prometheus Linux Rule 告警规则
vim node_exporter.yml
groups:
- name: 服务器资源监控
rules:
- alert: 内存使用率过高
expr: (1- (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes) / node_memory_MemTotal_bytes) * 100 > 88
for: 5m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 内存使用率过高,请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过90%,当前使用率{{ $value }}%."
####
- alert: 服务器宕机
expr: up{project=~"node_exporter"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 服务器宕机,请尽快处理!"
description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
####
- alert: CPU高负荷
expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
####
- alert: 磁盘IO性能
expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"
description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."
####
- alert: 网络流入
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 10240
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流入网络带宽持续5分钟高于10M. RX带宽使用量{{$value}}."
- alert: 网络流出
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 10240
for: 5m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流出网络带宽持续5分钟高于10M. RX带宽使用量{$value}}."
- alert: TCP连接数
expr: node_netstat_Tcp_CurrEstab > 10000
for: 2m
labels:
severity: critical
annotations:
summary: " TCP_ESTABLISHED过高!"
description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
for: 1m
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."
检查rule 告警文件语法
/usr/local/prometheus/promtool check rules /usr/local/prometheus/rules/node_exporter.yml
Prometheus热加载配置文件
curl -X POST http://127.0.0.1:9090/-/reload