普罗米修斯下载地址:https://prometheus.io/download/
普罗米修斯官方文档https://prometheus.io/docs/introduction/first_steps/
本次使用的版本为 prometheus-2.10.0.linux-amd64.tar
普罗米修斯配置是YAML。Prometheus下载附带一个文件中的示例配置,称为prometheus.yml开始使用的好地方。
已经删除了示例文件中的大部分注释,使其更简洁(注释是以前缀为a的行#)。
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
示例配置文件中配置的三个模块:global,rule_files,和scrape_configs。
1、global块控制Prometheus服务器的全局配置。我们有两种选择。第一个,scrape_interval控制普罗米修斯刷新目标的频率。可以为单个目标覆盖此值。在这种情况下,全局设置是每15秒刷新一次。该evaluation_interval选项控制普罗米修斯评估规则的频率。Prometheus使用规则创建新的时间序列并生成警报。
2、rule_files块指定我们希望Prometheus服务器加载的任何规则的位置。现在我们没有规则。
3、scrape_configs控制Prometheus监视的资源。由于Prometheus还将自己的数据公开为HTTP端点,因此它可以抓取并监控自身的健康状况。在默认配置中,有一个名为job的作业,prometheus用于擦除Prometheus服务器公开的时间序列数据。作业包含一个静态配置的目标,即localhoston端口9090。普罗米修斯希望指标可以在路径上的目标上获得/metrics。所以这个默认的工作是通过URL抓取:http:// localhost:9090 / metrics。
alertmanager-0.17.0.linux-amd64.tar
alertmanager下载地址https://prometheus.io/download/
它负责对它们进行重复数据删除,分组和路由,以及正确的接收器集成,例如电子邮件,PagerDuty或OpsGenie。它还负责警报的静音和抑制。
go1.12.5.linux-amd64.tar
golang下载地址:https://golang.google.cn/dl/
grafana-6.2.5.linux-amd64.tar
grafana下载地址:https://grafana.com/
telegraf-1.11.1_linux_amd64.tar.gz
编辑配置文件/etc/selinux/config文件SELINUX改成disabled
[root@bigdata3 ~]# cat /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disbaled
# SELINUXTYPE= can take one of three two values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
Linux系统中当内存使用到一定程度后会使用swap分区,这是由/proc/sys/vm/swappiness文件中的vm.swappiness 参数进行控制的,linux默认vm.swappiness=60
swapoff -a
也可在开机启动配置中直接增加swapoff -a
[root@bigdata3 ~]# systemctl stop firewalld.service && systemctl disable firewalld.service
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.
[root@bigdata3 ~]# systemctl status firewalld.service
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
7月 16 16:51:41 bigdata3 systemd[1]: Starting firewalld - dynamic firewall dae.....
7月 16 16:51:44 bigdata3 systemd[1]: Started firewalld - dynamic firewall daemon.
7月 16 17:31:04 bigdata3 systemd[1]: Stopping firewalld - dynamic firewall dae.....
7月 16 17:31:04 bigdata3 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Hint: Some lines were ellipsized, use -l to show in full.
我是在自己测试换件搭建一共三台
[root@bigdata3 ~]# cat /etc/hosts
127.0.0.1 localhost
192.168.1.1 bigdata1
192.168.1.2 bigdata2
192.168.1.3 bigdata3
生产环境的机器尽量要规定好相关使用目录,比如/opt/下面会有CDH、CM等等目录,所以监控需要提前规划好
1、创建监控程序目录
[root@bigdata3 opt]# mkdir /opt/monitor
2、导入软件
[root@bigdata3 monitor]# ll
总用量 321032
-rw-r--r--. 1 root root 23631797 7月 3 2019 alertmanager-0.17.0.linux-amd64.tar.gz
-rw-r--r--. 1 root root 127938445 7月 5 2019 go1.12.5.linux-amd64.tar.gz
-rw-r--r--. 1 root root 58512371 7月 22 2019 grafana-6.2.5.linux-amd64.tar.gz
-rw-r--r--. 1 root root 50120400 7月 16 2019 influxdb-1.7.7_linux_amd64.tar.gz
-rw-r--r--. 1 root root 48497454 7月 3 2019 prometheus-2.10.0.linux-amd64.tar.gz
-rw-r--r--. 1 root root 20021531 7月 16 2019 telegraf-1.11.1_linux_amd64.tar.gz
[root@bigdata3 monitor]# pwd /opt/monitor
3、解压软件
[root@bigdata3 monitor]# tar xf alertmanager-0.17.0.linux-amd64.tar.gz
4、重命名文件夹
[root@bigdata3 monitor]# mv prometheus-2.10.0.linux-amd64 prometheus
5、修改配置文件
vi prometheus.yml
原始配置文件如下
# my global config
#控制prometheus行为的全局配置
global:
#下面的参数用来指定应用程序或服务抓取数据的时间间隔,这个值是时间序列的颗粒度,即该序列中每个数据点所覆盖的时间段
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
#用来指定prometheus评估规则的频率,目前两种记录规则与报警规则。
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
#用来设置prometheus的告警
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
#配置告警规则的文件
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
#用来指定prometheus抓取的所有目标
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
修改后配置文件如下
[root@bigdata3 prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.1.5:9093'] # alertmanagers所在地址
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
#################rules#############################
- "/opt/monitor/prometheus/rules/hosts/*.yml" # 告警规则存放目录
#################rules#############################
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['192.168.1.5:9090'] #Prometheus安装机器地址
#################hosts#############################
- job_name: 'A-getway' # 标签用于区分各个监控项目的机器
file_sd_configs:
- files: ['/opt/monitor/prometheus/monitor-config/A-getway/*.yml'] # 监控dmp集群的机器配置放置目录
refresh_interval: 5s
#################hosts#############################
6、创建相关文件夹
创建/opt/monitor/prometheus/rules/用于存放告警规则,但实际监控项目居多,故分类存放再创建主机监控项目文件夹hosts、mysql、hdfs
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules/hosts
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules/mysql
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules/hdfs
[root@bigdata3 prometheus]# ls /opt/monitor/prometheus/rules/
hdfs hosts mysql
进入到hosts目录创建监控脚本
[root@bigdata3 rules]# cat disk_use.yml
groups:
- name: host_disk
rules:
- alert: NodediskUsage
expr: round(disk_used_percent{kind="jkj"}) > 50
for: 1m
labels:
sort: host_disk
level: severity
annotations:
summary: "{{$labels.instance}}: High disk usage"
description: "disk {{$labels.path}} already use {{ $value }}%,please check it"
创建/opt/monitor/prometheus/monitor_config/用于存放分类后监控的主机,同样为了区分各个项目的机器创建项目子文件夹dmp、xl
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/monitor_config/
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/monitor_config/dmp
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/monitor_config/xl
[root@bigdata3 prometheus]# ls /opt/monitor/prometheus/monitor_config/
dmp xl
进入到dmp目录创建所要监控机器的文件
[root@bigdata3 dmp]# cat 192.168.1.5.yml
- targets: [ "192.168.1.5:9275" ]
labels:
group: "monitor-server"
7、启动Prometheus进程
/opt/monitor/prometheus/prometheus --config.file="/opt/monitor/prometheus/prometheus.yml"> /opt/monitor/prometheus/prometheus.log --web.enable-lifecycle 2>&1 &
启动时加上--web.enable-lifecycle启用远程热加载配置文件
浏览器输入http://ip:9090
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。