prometheus+telegraf+grafana监控学习(一)

原创

Bob hadoop

修改于 2020-12-09 17:38:03

6.7K1

文章被收录于专栏：日常杂记日常杂记

一、软件准备

普罗米修斯下载地址：https://prometheus.io/download/

普罗米修斯官方文档https://prometheus.io/docs/introduction/first_steps/

本次使用的版本为 prometheus-2.10.0.linux-amd64.tar

普罗米修斯配置是YAML。Prometheus下载附带一个文件中的示例配置，称为prometheus.yml开始使用的好地方。

已经删除了示例文件中的大部分注释，使其更简洁（注释是以前缀为a的行#）。

global:

scrape_interval: 15s

evaluation_interval: 15s

rule_files:

# - "first.rules"

# - "second.rules"

scrape_configs:

- job_name: prometheus

static_configs:

- targets: ['localhost:9090']

示例配置文件中配置的三个模块：global，rule_files，和scrape_configs。

1、global块控制Prometheus服务器的全局配置。我们有两种选择。第一个，scrape_interval控制普罗米修斯刷新目标的频率。可以为单个目标覆盖此值。在这种情况下，全局设置是每15秒刷新一次。该evaluation_interval选项控制普罗米修斯评估规则的频率。Prometheus使用规则创建新的时间序列并生成警报。

2、rule_files块指定我们希望Prometheus服务器加载的任何规则的位置。现在我们没有规则。

3、scrape_configs控制Prometheus监视的资源。由于Prometheus还将自己的数据公开为HTTP端点，因此它可以抓取并监控自身的健康状况。在默认配置中，有一个名为job的作业，prometheus用于擦除Prometheus服务器公开的时间序列数据。作业包含一个静态配置的目标，即localhoston端口9090。普罗米修斯希望指标可以在路径上的目标上获得/metrics。所以这个默认的工作是通过URL抓取：http：// localhost：9090 / metrics。

alertmanager-0.17.0.linux-amd64.tar

alertmanager下载地址https://prometheus.io/download/

它负责对它们进行重复数据删除，分组和路由，以及正确的接收器集成，例如电子邮件，PagerDuty或OpsGenie。它还负责警报的静音和抑制。

go1.12.5.linux-amd64.tar

golang下载地址：https://golang.google.cn/dl/

grafana-6.2.5.linux-amd64.tar

grafana下载地址：https://grafana.com/

telegraf-1.11.1_linux_amd64.tar.gz

二、物理机准备

1、关闭机器运行的selinux

编辑配置文件/etc/selinux/config文件SELINUX改成disabled

[root@bigdata3 ~]# cat /etc/selinux/config 
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disbaled  
# SELINUXTYPE= can take one of three two values:
#     targeted - Targeted processes are protected,
#     minimum - Modification of targeted policy. Only selected processes are protected. 
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

2、关闭机器swap交换空间

Linux系统中当内存使用到一定程度后会使用swap分区，这是由/proc/sys/vm/swappiness文件中的vm.swappiness 参数进行控制的，linux默认vm.swappiness=60

swapoff -a

也可在开机启动配置中直接增加swapoff -a

3、关闭防火墙

[root@bigdata3 ~]# systemctl stop firewalld.service && systemctl disable firewalld.service
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.
[root@bigdata3 ~]# systemctl status firewalld.service
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
7月 16 16:51:41 bigdata3 systemd[1]: Starting firewalld - dynamic firewall dae.....
7月 16 16:51:44 bigdata3 systemd[1]: Started firewalld - dynamic firewall daemon.
7月 16 17:31:04 bigdata3 systemd[1]: Stopping firewalld - dynamic firewall dae.....
7月 16 17:31:04 bigdata3 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Hint: Some lines were ellipsized, use -l to show in full.

4、修改机器hosts文件

我是在自己测试换件搭建一共三台

[root@bigdata3 ~]# cat /etc/hosts
127.0.0.1  localhost
192.168.1.1  bigdata1
192.168.1.2  bigdata2
192.168.1.3  bigdata3

三、prometheus说明与搭建

生产环境的机器尽量要规定好相关使用目录，比如/opt/下面会有CDH、CM等等目录，所以监控需要提前规划好

1、创建监控程序目录

[root@bigdata3 opt]# mkdir /opt/monitor

2、导入软件

[root@bigdata3 monitor]# ll 
总用量 321032 
-rw-r--r--. 1 root root  23631797 7月   3 2019 alertmanager-0.17.0.linux-amd64.tar.gz 
-rw-r--r--. 1 root root 127938445 7月   5 2019 go1.12.5.linux-amd64.tar.gz 
-rw-r--r--. 1 root root  58512371 7月  22 2019 grafana-6.2.5.linux-amd64.tar.gz 
-rw-r--r--. 1 root root  50120400 7月  16 2019 influxdb-1.7.7_linux_amd64.tar.gz 
-rw-r--r--. 1 root root  48497454 7月   3 2019 prometheus-2.10.0.linux-amd64.tar.gz 
-rw-r--r--. 1 root root  20021531 7月  16 2019 telegraf-1.11.1_linux_amd64.tar.gz 
[root@bigdata3 monitor]# pwd /opt/monitor

3、解压软件

[root@bigdata3 monitor]# tar xf alertmanager-0.17.0.linux-amd64.tar.gz

4、重命名文件夹

[root@bigdata3 monitor]# mv prometheus-2.10.0.linux-amd64 prometheus

5、修改配置文件

vi prometheus.yml

原始配置文件如下
# my global config
#控制prometheus行为的全局配置
global:
  #下面的参数用来指定应用程序或服务抓取数据的时间间隔，这个值是时间序列的颗粒度，即该序列中每个数据点所覆盖的时间段
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  #用来指定prometheus评估规则的频率，目前两种记录规则与报警规则。
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
#用来设置prometheus的告警
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
#配置告警规则的文件
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
#用来指定prometheus抓取的所有目标
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']

修改后配置文件如下

[root@bigdata3 prometheus]# cat  prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['192.168.1.5:9093']        # alertmanagers所在地址
      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
#################rules#############################
 - "/opt/monitor/prometheus/rules/hosts/*.yml" # 告警规则存放目录
#################rules#############################
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['192.168.1.5:9090']          #Prometheus安装机器地址
#################hosts#############################
  - job_name: 'A-getway'	   # 标签用于区分各个监控项目的机器
    file_sd_configs:
    - files: ['/opt/monitor/prometheus/monitor-config/A-getway/*.yml']   # 监控dmp集群的机器配置放置目录
      refresh_interval: 5s
#################hosts#############################

6、创建相关文件夹

创建/opt/monitor/prometheus/rules/用于存放告警规则，但实际监控项目居多，故分类存放再创建主机监控项目文件夹hosts、mysql、hdfs

[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules/hosts
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules/mysql
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/rules/hdfs
[root@bigdata3 prometheus]# ls  /opt/monitor/prometheus/rules/
hdfs  hosts  mysql

进入到hosts目录创建监控脚本

[root@bigdata3 rules]# cat disk_use.yml 
groups:
- name: host_disk     
  rules:
  - alert: NodediskUsage
    expr: round(disk_used_percent{kind="jkj"}) > 50
    for: 1m
    labels:
      sort: host_disk
      level: severity
    annotations:
      summary: "{{$labels.instance}}: High disk usage"
      description: "disk {{$labels.path}} already use {{ $value }}%,please check it"

创建/opt/monitor/prometheus/monitor_config/用于存放分类后监控的主机，同样为了区分各个项目的机器创建项目子文件夹dmp、xl

[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/monitor_config/
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/monitor_config/dmp
[root@bigdata3 prometheus]# mkdir /opt/monitor/prometheus/monitor_config/xl
[root@bigdata3 prometheus]# ls /opt/monitor/prometheus/monitor_config/
dmp  xl

进入到dmp目录创建所要监控机器的文件

[root@bigdata3 dmp]# cat 192.168.1.5.yml 
- targets: [ "192.168.1.5：9275" ]
  labels:
    group: "monitor-server"

7、启动Prometheus进程

/opt/monitor/prometheus/prometheus --config.file="/opt/monitor/prometheus/prometheus.yml"> /opt/monitor/prometheus/prometheus.log --web.enable-lifecycle 2>&1 &

启动时加上--web.enable-lifecycle启用远程热加载配置文件

浏览器输入http://ip:9090