Prometheus
架构图
环境准备: 10.0.0.31 prometheus-server31 2 core 2GB+10.0.0.32 prometheus-server32 1 core 1GB+10.0.0.33 prometheus-server331 core 1GB+10.0.0.41 node-exporter41 1 core 1GB+10.0.0.42 node-exporter42 1 core 1GB+10.0.0.43 node-exporter431 core 1GB+10.0.0.51 grafana512 core 2GB+- Prometheus概述:
官网地址:https://prometheus.io/github地址:https://github.com/prometheus/prometheus- Prometheus 一键部署1.下载Prometheus server
[root@prometheus-server31 ~]# wget https://github.com/prometheus/prometheus/releases/download/v2.53.3/prometheus-2.53.3.linux-amd64.tar.gz2.创建工作目录
[root@prometheus-server31 ~]# mkdir -pv /yanshier/softwares3.解压软件包
[root@prometheus-server31 ~]# tar xf prometheus-2.53.3.linux-amd64.tar.gz -C /yanshier/softwares/4.启动Prometheus server
[root@prometheus-server31 ~]# cd /yanshier/softwares/prometheus-2.53.3.linux-amd64/
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]#
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# ll
total 261340
drwxr-xr-x 4 1001 colord 4096 Nov 5 20:37 ./
drwxr-xr-x 3 root root 4096 Nov 11 09:42 ../
drwxr-xr-x 2 1001 colord 4096 Nov 5 20:35 console_libraries/
drwxr-xr-x 2 1001 colord 4096 Nov 5 20:35 consoles/
-rw-r--r-- 1 1001 colord 11357 Nov 5 20:35 LICENSE
-rw-r--r-- 1 1001 colord 3773 Nov 5 20:35 NOTICE
-rwxr-xr-x 1 1001 colord 137839708 Nov 5 20:19 prometheus*
-rw-r--r-- 1 1001 colord 934 Nov 5 20:35 prometheus.yml
-rwxr-xr-x 1 1001 colord 129729365 Nov 5 20:19 promtool*
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]#
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# ./prometheus
ts=2024-11-11T01:42:50.535Z caller=main.go:589 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2024-11-11T01:42:50.536Z caller=main.go:633 level=info msg="Starting Prometheus Server" mode=server version="(version=2.53.3, branch=HEAD, revision=1491d29fb1e8f8acbab29fd54fd4ce9be2cbd7bc)"
ts=2024-11-11T01:42:50.536Z caller=main.go:638 level=info build_context="(go=go1.22.8, platform=linux/amd64, user=root@c6939e39a10c, date=20241105-12:18:07, tags=netgo,builtinassets,stringlabels)"
ts=2024-11-11T01:42:50.536Z caller=main.go:639 level=info host_details="(Linux 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 prometheus-server31 (none))"
ts=2024-11-11T01:42:50.536Z caller=main.go:640 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2024-11-11T01:42:50.536Z caller=main.go:641 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2024-11-11T01:42:50.544Z caller=web.go:568 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2024-11-11T01:42:50.545Z caller=main.go:1148 level=info msg="Starting TSDB ..."
ts=2024-11-11T01:42:50.545Z caller=tls_config.go:313 level=info component=web msg="Listening on" address=[::]:9090
...5.访问Prometheus的WebUI
http://10.0.0.31:9090/
二进制部署node-exporter
node_exporter的作用:可以暴露Linux服务器的指标,可以采集CPU,内存,磁盘,网络等一系列使用情况,基于http协议暴露到外端。1.下载node-exporter
[root@node-exporter41 ~]# wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz2.创建工作目录
[root@node-exporter41 ~]# mkdir -pv /yanshier/softwares3.解压node-exporter
[root@node-exporter41 ~]# tar xf node_exporter-1.8.2.linux-amd64.tar.gz -C /yanshier/softwares/4.启动node-exporter
[root@node-exporter41 ~]# cd /yanshier/softwares/node_exporter-1.8.2.linux-amd64/
[root@node-exporter41 node_exporter-1.8.2.linux-amd64]#
[root@node-exporter41 node_exporter-1.8.2.linux-amd64]# ll
total 20048
drwxr-xr-x 2 1001 1002 4096 Jul 14 19:58 ./
drwxr-xr-x 3 root root 4096 Nov 11 10:29 ../
-rw-r--r-- 1 1001 1002 11357 Jul 14 19:57 LICENSE
-rwxr-xr-x 1 1001 1002 20500541 Jul 14 19:54 node_exporter*
-rw-r--r-- 1 1001 1002 463 Jul 14 19:57 NOTICE
[root@node-exporter41 node_exporter-1.8.2.linux-amd64]#
[root@node-exporter41 node_exporter-1.8.2.linux-amd64]# ./node_exporter
ts=2024-11-11T02:29:41.996Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.2, branch=HEAD, revision=f1e0e8360aa60b6cb5e5cc1560bed348fc2c1895)"
ts=2024-11-11T02:29:41.996Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.5, platform=linux/amd64, user=root@03d440803209, date=20240714-11:53:45, tags=unknown)"
ts=2024-11-11T02:29:41.996Z caller=node_exporter.go:196 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unprivileged user, root is not required."
ts=2024-11-11T02:29:41.997Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2024-11-11T02:29:41.997Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-11-11T02:29:41.998Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=bcache
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=btrfs
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-11-11T02:29:41.998Z caller=node_exporter.go:118 level=info collector=diskstats
...5.测试验证
http://10.0.0.41:9100/metrics参考格式说明:TYPE:表示metric的数据类型。HELP:表示metric帮助信息。metric:具体的指标标签:给指标打标签,起到标识性作用。# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 422.62
node_cpu_seconds_total{cpu="0",mode="iowait"} 0.17
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="softirq"} 0.03
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 3.98
node_cpu_seconds_total{cpu="0",mode="user"} 2.46
node_cpu_seconds_total{cpu="1",mode="idle"} 424.3
node_cpu_seconds_total{cpu="1",mode="iowait"} 0.27
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 0
node_cpu_seconds_total{cpu="1",mode="softirq"} 0.14
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 2.59
node_cpu_seconds_total{cpu="1",mode="user"} 2.44
- node-exporter配置模块黑白名单
1.默认启用,禁用的指标:
https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-defaulthttps://github.com/prometheus/node_exporter?tab=readme-ov-file#disabled-by-default2.模块白名单
[root@node-exporter41 node_exporter-1.8.2.linux-amd64]# ./node_exporter --collector.cpu --collector.uname --collector.disable-defaults
ts=2024-11-11T02:51:24.052Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.2, branch=HEAD, revision=f1e0e8360aa60b6cb5e5cc1560bed348fc2c1895)"
ts=2024-11-11T02:51:24.052Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.5, platform=linux/amd64, user=root@03d440803209, date=20240714-11:53:45, tags=unknown)"
ts=2024-11-11T02:51:24.052Z caller=node_exporter.go:196 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unprivileged user, root is not required."
ts=2024-11-11T02:51:24.053Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-11-11T02:51:24.053Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-11-11T02:51:24.053Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-11-11T02:51:24.053Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9100
ts=2024-11-11T02:51:24.053Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:91003.模块黑名单
[root@node-exporter41 node_exporter-1.8.2.linux-amd64]# ./node_exporter --no-collector.cpu注意,只需要将"--no-collector.MODULE",将NAME换成你需要禁用的模块即可。温馨提示:将41-43节点都安装上node-exporter
- Prometheus监控node-exporter实战
1.修改Prometheus的配置文件
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
# 全局配置
global:# 采集目标(targets)的时间间隔scrape_interval: 3s......
# 抓取配置
scrape_configs:...- job_name: "yanshier-node-exporter"metrics_path: "/metrics"scheme: "http"static_configs:- targets: ["10.0.0.41:9100","10.0.0.42:9100","10.0.0.43:9100"] 2.重新加载配置
[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload3.访问WebUI验证
http://10.0.0.31:9090/targetshttp://10.0.0.31:9090/config- prometheus metrics type
prometheus监控中采集过来的数据统一称为Metrics数据,其并不是代表具体的数据格式,而是一种统计度量计算单位。当我们需要为某个系统或者某个服务做监控时,就需要使用到metrics。prometheus支持的metrics包括但不限于以下几种数据类型:guage:最简单的度量指标,只是一个简单的返回值,或者叫瞬时状态。比如说统计硬盘,内存等使用情况。counter:就是一个计数器,从数据量0开始累积计算,在理想情况下,只能是永远的增长,不会降低(有特殊情况,比如粉丝量)。比如统计1小时,1天,1周,1一个月的用户访问量,这就是一个累加的操作。histogram:是统计数据的分布情况,比如最小值,最大值,中间值,中位数等,代表的是一种近似百分比估算数值。通过histograms可以分别统计处在一个时间段(1s,2s,5s,10s)内nginx访问用户的响应时间。summary:summary是histograms的扩展类型,主要弥补histograms不足。- PromQL语句初体验up判断一个被监控目标是否存活,若存活则为1,若不存或则为0。node_load5-0查询5分钟内的负载,不显示metrics。node_cpu_seconds_total{instance="10.0.0.42:9100"}基于标签过滤,查询10.0.0.42节点的CPU信息。node_cpu_seconds_total{instance="10.0.0.42:9100",mode="idle",cpu="1"}可以定义多个过滤参数,使用逗号分隔即可。node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1",mode!="idle"}表示查询CPU不等于"idle"的模式。node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1",mode=~"i.*"} 支持基于正则匹配我们需要的数据,表示查询以i开头的所有模式。node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1",mode!~"i.*"}取反效果,表示查询不以i开头的所有模式。node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1",mode="idle"}[10s]表示查询10s内CPU采集的数据信息。- Prometheus常用的函数
官方文档:https://prometheus.io/docs/prometheus/2.53/querying/functions/实验环境准备:
[root@node-exporter42 ~]# apt -y install stress
[root@node-exporter42 ~]#
[root@node-exporter42 ~]# stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 20mincrease函数:用来针对counter数据类型,截取其中一段时间总的增量。例子:increase(node_cpu_seconds_total{mode="idle",cpu="0", instance="10.0.0.42:9100"}[1m])统计1分钟内,使用标签过滤器查看"10.0.0.42:9100"节点的第0颗CPU,空闲状态使用的总时间增量。sum函数:加和的作用。例子:sum(increase(node_cpu_seconds_total{mode="idle",cpu="0"}[1m]))统计1分钟内,使用标签过滤器查看所有节点的第0颗CPU,空闲状态使用的总时间增量,并将返回结果累加。avg函数:求平均值。avg(increase(node_cpu_seconds_total{mode="idle",cpu="0"}[1m]))max函数:求最大值。max(increase(node_cpu_seconds_total{mode="idle",cpu="0"}[1m]))min:求最小值。min(increase(node_cpu_seconds_total{mode="idle",cpu="0"}[1m]))by函数:将数据进行分组,类似于MySQL的"GROUP BY"。举个例子:sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance)统计1分钟内,使用标签过滤器查看CPU空闲状态,并将结果进行累加,基于instance进行分组。rate函数:它的功能是按照设置的时间段,取counter在这个时间段中平均每秒的增量。举个例子:rate(node_cpu_seconds_total{mode="idle",cpu="0", instance="10.0.0.42:9100"}[1m])统计1分钟内,使用标签过滤器查看"10.0.0.42:9100"节点的第0颗CPU,空闲状态使用的每秒的增量。increase和rate如何选择:(1)对于采集数据频率较低的场景建议使用increase函数,因为使用rate函数可能会出现断点,比如针对硬盘容量监控。(2)对于采集数据频率较高的场景建议使用rate函数,比如针对CPU,内存,网络流量等都是可以基于rate函数来采集等。topk函数:取前几位的最高值,实际使用的时候一般会用该函数进行瞬时报警,而不是为了观察曲线图。举个例子:topk(3, rate(node_cpu_seconds_total{mode="idle"}[1m]))统计1分钟内,使用标签过滤器查看CPU,所有状态使用的每秒的增量,只查看前3个节点。count函数:把数值符合条件的,输出数目进行累加加和。比如说企业中有100台服务器,如果只有10台服务器CPU使用率高于80%时候是不需要报警的,但是数量超过70台时就需要报警了。举个例子:count(yanshier_tcp_wait_conn > 500):假设yanshier_tcp_wait_conn是咱们自定义的KEY。若TCP等待数量大于500的机器数量就判断条件为真。count(rate(node_cpu_seconds_total{cpu="0",mode="idle"}[1m]))对统计的结果进行计数。- 统计各个节点CPU的使用率1.我们需要先找到CPU相关的KEY
node_cpu_seconds_total2.过滤出CPU的空闲时间({mode='idle'})和全部CPU的时间('{}')
node_cpu_seconds_total{mode='idle'}过滤CPU的空闲时间。
node_cpu_seconds_total{}此处的'{}'可以不写,因为里面没有任何参数,代表获取CPU的所有状态时间。3.统计1分钟内CPU的增量时间
increase(node_cpu_seconds_total{mode='idle'}[1m])统计1分钟内CPU空闲状态的增量。
increase(node_cpu_seconds_total[1m])统计1分钟内CPU所有状态的增量。4.将结果进行加和统计
sum(increase(node_cpu_seconds_total{mode='idle'}[1m]))将1分钟内所有CPU空闲时间的增量进行加和计算。
sum(increase(node_cpu_seconds_total[1m]))将1分钟内所有CPU空闲时间的增量进行加和计算。5.按照不同节点进行分组
sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by (instance)将1分钟内所有CPU空闲时间的增量进行加和计算,并按照机器实例进行分组。
sum(increase(node_cpu_seconds_total[1m])) by (instance)将1分钟内所有CPU空闲时间的增量进行加和计算,并按照机器实例进行分组。6.计算1分钟内CPU空闲时间的百分比
sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)7.统计1分钟内CPU的使用率,计算公式: (1 - CPU空闲时间的百分比) * 100%。
(1 - sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1008.统计1小时内CPU的使用率,计算公式: (1 - CPU空闲时间的百分比) * 100%。
(1 - sum(increase(node_cpu_seconds_total{mode='idle'}[1h])) by (instance) / sum(increase(node_cpu_seconds_total[1h])) by (instance)) * 100最终结果:1.计算CPU一分钟内的使用率
(1 - (sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by (instance)) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1002 计算CPU用户态的1分钟内百分比
sum(increase(node_cpu_seconds_total{mode='user'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance) * 1003 计算CPU内核态的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='system'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1004 计算CPU IO等待时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='iowait'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1005.计算CPU nice时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='nice'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1006.计算CPU softirq软中断时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='softirq'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1007.计算CPU steal虚拟机CPU时间被物理机偷走时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='steal'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 1008.计算CPU irq中断时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='irq'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
- ubuntu部署grafana
grafana是一款优秀的开源图形化展示界面,支持以Prometheus为数据源展示数据。1.准备MySQL数据源
[root@prometheus-server31 ~]# wget http://192.168.13.253/Resources/Docker/softwares/yanshier-autoinstall-docker-docker-compose.tar.gz[root@prometheus-server31 ~]# tar xf yanshier-autoinstall-docker-docker-compose.tar.gz [root@prometheus-server31 ~]# ./install-docker.sh i[root@prometheus-server31 ~]# wget http://192.168.13.253/Resources/Docker/images/WordPress/yanshier-mysql-v8.0.36-oracle.tar.gz[root@prometheus-server31 ~]# docker load -i yanshier-mysql-v8.0.36-oracle.tar.gz [root@prometheus-server31 ~]# docker run -d --network host --restart always -e MYSQL_ALLOW_EMPTY_PASSWORD=yes -e MYSQL_DATABASE=prometheus -e MYSQL_USER=grafana -e MYSQL_PASSWORD=yanshier --name mysql-server mysql:8.0.36-oracle[root@prometheus-server31 ~]# docker exec -it mysql-server mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.36 MySQL Community Server - GPLCopyright (c) 2000, 2024, Oracle and/or its affiliates.Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| prometheus |
| sys |
+--------------------+
5 rows in set (0.01 sec)mysql>
mysql> SHOW TABLES FROM prometheus;
Empty set (0.01 sec)mysql>
mysql> 2.安装grafana依赖库
[root@grafana51 ~]# apt-get install -y adduser libfontconfig1 musl3.下载grafana
[root@grafana51 ~]# wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.5.21_amd64.deb4.安装grafana
[root@grafana51 ~]# dpkg -i grafana-enterprise_9.5.21_amd64.deb 5.修改grafana的配置文件[根据你的MySQL环境做修改即可]
[root@grafana51 ~]# vim /etc/grafana/grafana.ini
...
[database]
...
type = mysql
host = 10.0.0.31:3306
name = prometheus
user = grafana
password = yanshier 温馨提示:如果我们不配置MySQL作为grafana的数据源,则默认会将数据存储在"/var/lib/grafana"目录的sqlite3数据库中。sqlite3数据库性能相比于MySQL较弱,生产环境建议更换为MySQL作为grafana的数据库存储哟~6.启动grafana
[root@grafana51 ~]# systemctl enable --now grafana-server.service
[root@grafana51 ~]#
[root@grafana51 ~]# ss -ntl | grep 3000
LISTEN 0 4096 *:3000 *:*
[root@grafana51 ~]# 7.访问grafana的WebUI
http://10.0.0.51:3000/默认的用户名和密码均为: admin- grafana基本使用grafana免费开源代码原地址
https://grafana.com/grafana/dashboards/
---------------------------------------------------------
Prometheus通过pull拉取pushgatway与exproters的数据放在存储IO并提供webUI进行临时查询数据同时编写对应promQL查询数据的语法由grafana图形化展示实现存储APIclients--开发人员基于库操作