prometheus

摘要

本文部分取自

配置

Prometheus启动的时候,可以加载运行参数-config.file指定配置文件,默认为prometheus.yml

在配置文件中我们可以指定

  • global: 全局配置
  • alerting: 告警配置
  • rule_files: 规则配置
  • scrape_configs: 数据拉取配置
  • remote_write: 远程可写存储
  • remote_read: 远程可读存储

global 全局配置

属于全局的默认配置,它主要包含4个属性,

  • scrape_interval: 拉取targets的默认时间间隔。
  • scrape_timeout: 拉取一个target的超时时间。
  • evaluation_interval: 执行rules的时间间隔。
  • external_labels: 额外的属性,会添加到拉取的数据并存到数据库中。
1
2
3
4
5
6
7
8
9
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
scrape_timeout: 10s # is set to the global default (10s).
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'

告警配置

通常我们可以使用运行参数-alertmanager.xxx来配置Alertmanager,但是这样不够灵活,没有办法做到动态更新加载,以及动态定义告警属性。

所以alerting配置主要用来解决这个问题,它能够更好的管理Alertmanager, 主要包含2个参数:

  • alert_relabel_configs: 动态修改alert属性的规则配置。
  • alertmanagers: 用于动态发现Alertmanager的配置。

配置文件结构大概为:

1
2
3
4
5
6
# Alerting specifies settings related to the Alertmanager.
alerting:
alert_relabel_configs:
[ - <relabel_config> ... ]
alertmanagers:
[ - <alertmanager_config> ... ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Per-target Alertmanager timeout when pushing alerts.
[ timeout: <duration> | default = 10s ]
# Prefix for the HTTP path alerts are pushed to.
[ path_prefix: <path> | default = / ]
# Configures the protocol scheme used for requests.
[ scheme: <scheme> | default = http ]
# Sets the `Authorization` header on every request with the
# configured username and password.
basic_auth:
[ username: <string> ]
[ password: <string> ]
# Sets the `Authorization` header on every request with
# the configured bearer token. It is mutually exclusive with `bearer_token_file`.
[ bearer_token: <string> ]
# Sets the `Authorization` header on every request with the bearer token
# read from the configured file. It is mutually exclusive with `bearer_token`.
[ bearer_token_file: /path/to/bearer/token/file ]
# Configures the scrape request's TLS settings.
tls_config:
[ <tls_config> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# List of Azure service discovery configurations.
azure_sd_configs:
[ - <azure_sd_config> ... ]
# List of Consul service discovery configurations.
consul_sd_configs:
[ - <consul_sd_config> ... ]
# List of DNS service discovery configurations.
dns_sd_configs:
[ - <dns_sd_config> ... ]
# List of EC2 service discovery configurations.
ec2_sd_configs:
[ - <ec2_sd_config> ... ]
# List of file service discovery configurations.
file_sd_configs:
[ - <file_sd_config> ... ]
# List of GCE service discovery configurations.
gce_sd_configs:
[ - <gce_sd_config> ... ]
# List of Kubernetes service discovery configurations.
kubernetes_sd_configs:
[ - <kubernetes_sd_config> ... ]
# List of Marathon service discovery configurations.
marathon_sd_configs:
[ - <marathon_sd_config> ... ]
# List of AirBnB's Nerve service discovery configurations.
nerve_sd_configs:
[ - <nerve_sd_config> ... ]
# List of Zookeeper Serverset service discovery configurations.
serverset_sd_configs:
[ - <serverset_sd_config> ... ]
# List of Triton service discovery configurations.
triton_sd_configs:
[ - <triton_sd_config> ... ]
# List of labeled statically configured Alertmanagers.
static_configs:
[ - <static_config> ... ]
# List of Alertmanager relabel configurations.
relabel_configs:
[ - <relabel_config> ... ]

规则配置

rule_files主要用于配置rules文件,它支持多个文件以及文件目录。

其代码结构定义为:

1
RuleFiles []string `yaml:"rule_files,omitempty"`

配置文件结构大致为:

1
2
3
rule_files:
- "rules/node.rules"
- "rules2/*.rules"

数据拉取配置

scrape_configs主要用于配置拉取数据节点,每一个拉取配置主要包含以下参数:

  • job_name:任务名称
  • honor_labels: 用于解决拉取数据标签有冲突,当设置为true, 以拉取数据为准,否则以服务配置为准
  • params:数据拉取访问时带的请求参数
  • scrape_interval: 拉取时间间隔
  • scrape_timeout: 拉取超时时间
  • metrics_path: 拉取节点的metric路径
  • scheme: 拉取数据访问协议
  • sample_limit: 存储的数据标签个数限制,如果超过限制,该数据将被忽略,不入存储;默认值为0,表示没有限制
  • relabel_configs: 拉取数据重置标签配置
  • metric_relabel_configsmetric重置标签配置

ServiceDiscoveryConfig主要用于target发现,大体分为两类,静态配置和动态发现。
所以,一份完整的scrape_configs配置大致为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# The job name assigned to scraped metrics by default.
job_name: <job_name>
# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
# The HTTP resource path on which to fetch metrics from targets.
[ metrics_path: <path> | default = /metrics ]
# honor_labels controls how Prometheus handles conflicts between labels that are
# already present in scraped data and labels that Prometheus would attach
# server-side ("job" and "instance" labels, manually configured target
# labels, and labels generated by service discovery implementations).
#
# If honor_labels is set to "true", label conflicts are resolved by keeping label
# values from the scraped data and ignoring the conflicting server-side labels.
#
# If honor_labels is set to "false", label conflicts are resolved by renaming
# conflicting labels in the scraped data to "exported_<original-label>" (for
# example "exported_instance", "exported_job") and then attaching server-side
# labels. This is useful for use cases such as federation, where all labels
# specified in the target should be preserved.
#
# Note that any globally configured "external_labels" are unaffected by this
# setting. In communication with external systems, they are always applied only
# when a time series does not have a given label yet and are ignored otherwise.
[ honor_labels: <boolean> | default = false ]
# Configures the protocol scheme used for requests.
[ scheme: <scheme> | default = http ]
# Optional HTTP URL parameters.
params:
[ <string>: [<string>, ...] ]
# Sets the `Authorization` header on every scrape request with the
# configured username and password.
basic_auth:
[ username: <string> ]
[ password: <string> ]
# Sets the `Authorization` header on every scrape request with
# the configured bearer token. It is mutually exclusive with `bearer_token_file`.
[ bearer_token: <string> ]
# Sets the `Authorization` header on every scrape request with the bearer token
# read from the configured file. It is mutually exclusive with `bearer_token`.
[ bearer_token_file: /path/to/bearer/token/file ]
# Configures the scrape request's TLS settings.
tls_config:
[ <tls_config> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# List of Azure service discovery configurations.
azure_sd_configs:
[ - <azure_sd_config> ... ]
# List of Consul service discovery configurations.
consul_sd_configs:
[ - <consul_sd_config> ... ]
# List of DNS service discovery configurations.
dns_sd_configs:
[ - <dns_sd_config> ... ]
# List of EC2 service discovery configurations.
ec2_sd_configs:
[ - <ec2_sd_config> ... ]
# List of OpenStack service discovery configurations.
openstack_sd_configs:
[ - <openstack_sd_config> ... ]
# List of file service discovery configurations.
file_sd_configs:
[ - <file_sd_config> ... ]
# List of GCE service discovery configurations.
gce_sd_configs:
[ - <gce_sd_config> ... ]
# List of Kubernetes service discovery configurations.
kubernetes_sd_configs:
[ - <kubernetes_sd_config> ... ]
# List of Marathon service discovery configurations.
marathon_sd_configs:
[ - <marathon_sd_config> ... ]
# List of AirBnB's Nerve service discovery configurations.
nerve_sd_configs:
[ - <nerve_sd_config> ... ]
# List of Zookeeper Serverset service discovery configurations.
serverset_sd_configs:
[ - <serverset_sd_config> ... ]
# List of Triton service discovery configurations.
triton_sd_configs:
[ - <triton_sd_config> ... ]
# List of labeled statically configured targets for this job.
static_configs:
[ - <static_config> ... ]
# List of target relabel configurations.
relabel_configs:
[ - <relabel_config> ... ]
# List of metric relabel configurations.
metric_relabel_configs:
[ - <relabel_config> ... ]
# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabelling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]

远程可写存储

remote_write主要用于可写远程存储配置,主要包含以下参数:

  • url: 访问地址
  • remote_timeout: 请求超时时间
  • write_relabel_configs: 标签重置配置, 拉取到的数据,经过重置处理后,发送给远程存储

一份完整的配置大致为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# The URL of the endpoint to send samples to.
url: <string>
# Timeout for requests to the remote write endpoint.
[ remote_timeout: <duration> | default = 30s ]
# List of remote write relabel configurations.
write_relabel_configs:
[ - <relabel_config> ... ]
# Sets the `Authorization` header on every remote write request with the
# configured username and password.
basic_auth:
[ username: <string> ]
[ password: <string> ]
# Sets the `Authorization` header on every remote write request with
# the configured bearer token. It is mutually exclusive with `bearer_token_file`.
[ bearer_token: <string> ]
# Sets the `Authorization` header on every remote write request with the bearer token
# read from the configured file. It is mutually exclusive with `bearer_token`.
[ bearer_token_file: /path/to/bearer/token/file ]
# Configures the remote write request's TLS settings.
tls_config:
[ <tls_config> ]
# Optional proxy URL.
[ proxy_url: <string> ]

远程可读存储

remote_read主要用于可读远程存储配置,主要包含以下参数:

  • url: 访问地址
  • remote_timeout: 请求超时时间

一份完整的配置大致为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# The URL of the endpoint to query from.
url: <string>
# Timeout for requests to the remote read endpoint.
[ remote_timeout: <duration> | default = 30s ]
# Sets the `Authorization` header on every remote read request with the
# configured username and password.
basic_auth:
[ username: <string> ]
[ password: <string> ]
# Sets the `Authorization` header on every remote read request with
# the configured bearer token. It is mutually exclusive with `bearer_token_file`.
[ bearer_token: <string> ]
# Sets the `Authorization` header on every remote read request with the bearer token
# read from the configured file. It is mutually exclusive with `bearer_token`.
[ bearer_token_file: /path/to/bearer/token/file ]
# Configures the remote read request's TLS settings.
tls_config:
[ <tls_config> ]
# Optional proxy URL.
[ proxy_url: <string> ]

服务发现

在 Prometheus 的配置中,一个最重要的概念就是数据源target,而数据源的配置主要分为静态配置和动态发现, 大致为以下几类:

  • static_configs: 静态服务发现
  • dns_sd_configs: DNS 服务发现
  • file_sd_configs: 文件服务发现
  • consul_sd_configs: Consul 服务发现
  • serverset_sd_configs: Serverset 服务发现
  • nerve_sd_configs: Nerve 服务发现
  • marathon_sd_configs: Marathon 服务发现
  • kubernetes_sd_configs: Kubernetes 服务发现
  • gce_sd_configs: GCE 服务发现
  • ec2_sd_configs: EC2 服务发现
  • openstack_sd_configs: OpenStack 服务发现
  • azure_sd_configs: Azure 服务发现
  • triton_sd_configs: Triton 服务发现

它们具体使用以及配置模板,请参考服务发现配置模板

它们中最重要的,也是使用最广泛的应该是static_configs, 其实那些动态类型都可以看成是某些通用业务使用静态服务封装的结果。

配置样例

Prometheus的配置参数比较多,但是使用较多的是global, rules, scrap_configs, statstic_config, rebel_config等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
# 抓取采样数据的时间间隔,默认 每15秒去被监控机上采样,也就是数据采集频率
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# 监控数据规则的评估频率,这个参数是prometheus多长时间会进行监控规则的评估的间隔时间
# 如:有一条报警rule,prometheus默认会每间隔15s来通过这个规则来进行检查。
# 报警管理
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- "10.0.0.1:9093"
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/rules/*.yml"
- "/etc/prometheus/recordrules/*.yml"
- "rules/node.rules"
# 可以用来高可用读写分离
remote_read:
- url: "http://k8s-01/api/v1/read"
remote_timeout: 30s
read_recent: true
- url: "http://k8s-02/api/v1/read"
remote_timeout: 30s
read_recent: true
# 配置多个`Prometheus Server`用来存储不同节点的数据
# 本`Prometheus Server`用来查询所有`Prometheus Server`的数据
- url: 'http://localhost:9091/api/v1/read'
remote_timeout: 8s
- url: 'http://localhost:9092/api/v1/read'
remote_timeout: 8s
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
# 任务名字
- job_name: 'prometheus'
scrape_interval: 5s
# 定义监控节点
static_configs:
- targets: ['127.0.0.1:9090']
- targets: ['127.0.0.1:9100']
relabel_configs:
- source_labels: [__address__]
regex: (.+):(.+)
target_label: nodename
replacement: $1
action: replace
- job_name: 'node'
scrape_interval: 8s
static_configs:
- targets: ['127.0.0.1:9100', '127.0.0.12:9100']
- job_name: 'targets'
file_sd_configs:
- files:
- "/etc/prometheus/targets/*.json"
- "/etc/prometheus/apps/*.json"
- job_name: 'etcd'
static_configs:
- targets: ['etcd-v01:2379','etcd-v02:2379','etcd-v03:2379']
- job_name: 'project1'
static_configs:
- targets:
- prod-01:9100
- prod-02:9100
labels:
cluster: prod
- targets:
- dev-01:9100
- dev-02:9100
labels:
cluster: dev
relabel_configs:
- source_labels: [__address__]
regex: (.+):(.+)
target_label: nodename
replacement: $1
action: replace
- job_name: 'api01'
metrics_path: '/api'
static_configs:
- targets:
- 'api-01:1101'
- 'api-02:1101'

Exporter

Prometheus中负责数据汇报的程序统一叫做Exporter, 而不同的Exporter负责不同的业务。 它们具有统一命名格式,即xx_exporter, 例如负责主机信息收集的node_exporter

Prometheus社区已经提供了很多exporter, 详情请参考这里

Node Exporter 常用查询语句

收集到node_exporter的数据后,我们可以使用PromQL进行一些业务查询和监控,下面是一些比较常见的查询。

注意:以下查询均以单个节点作为例子,如果大家想查看所有节点,将 instance=”xxx” 去掉即可。

CPU 使用率

1
100 - (avg by (instance) (irate(node_cpu{instance="xxx", mode="idle"}[5m])) * 100)

CPU 各 mode 占比率

1
avg by (instance, mode) (irate(node_cpu{instance="xxx"}[5m])) * 100

机器平均负载

1
2
3
node_load1{instance="xxx"} # 1分钟负载
node_load5{instance="xxx"} # 5分钟负载
node_load15{instance="xxx"} # 15分钟负载

内存使用率

1
100 - ((node_memory_MemFree{instance="xxx"}+node_memory_Cached{instance="xxx"}+node_memory_Buffers{instance="xxx"})/node_memory_MemTotal) * 100

磁盘使用率

1
2
3
100 - node_filesystem_free{instance="xxx",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} / node_filesystem_size{instance="xxx",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} * 100
# 或者也可以直接使用 {fstype="xxx"} 来指定想查看的磁盘信息

网络 IO

1
2
3
4
5
# 上行带宽
sum by (instance) (irate(node_network_receive_bytes{instance="xxx",device!~"bond.*?|lo"}[5m])/128)
# 下行带宽
sum by (instance) (irate(node_network_transmit_bytes{instance="xxx",device!~"bond.*?|lo"}[5m])/128)

网卡出/入包

1
2
3
4
5
# 入包量
sum by (instance) (rate(node_network_receive_bytes{instance="xxx",device!="lo"}[5m]))
# 出包量
sum by (instance) (rate(node_network_transmit_bytes{instance="xxx",device!="lo"}[5m]))