zookeeper集群简易安装、监控

摘要

学习过程中整理

本文记录

  • 安装zookeeper单机过程
  • 安装zookeeper集群、扩容的过程
  • zookeeper的监控数据

安装zookeeper

设置环境变量

1
2
3
4
WORKDIR="/usr/local/src"
INSTALL_DIR="/usr/local/product"
SOFT_DIR="/usr/local"
VERSION="3.4.12"

创建zookeeper用户

1
useradd zookeeper

下载安装包

1
2
3
4
cd ${WORKDIR}
wget https://mirrors.aliyun.com/apache/zookeeper/zookeeper-${VERSION}/zookeeper-${VERSION}.tar.gz
tar xvf zookeeper-${VERSION}.tar.gz -C ${INSTALL_DIR}
ln -s ${INSTALL_DIR}/zookeeper-${VERSION} ${SOFT_DIR}/zookeeper

创建配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cd /usr/local/zookeeper/conf
cp zoo_sample.cfg zoo.cfg

echo -e "tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
clientPort=2181" > zoo.cfg

echo -e "ZOO_LOG_DIR=/data/logs/zookeeper" > zookeeper-env.sh

mkdir -p /data/zookeeper/data
mkdir -p /data/logs/zookeeper

chown zookeeper.zookeeper -R /data/zookeeper/
chown zookeeper.zookeeper -R /data/logs/zookeeper/
chown zookeeper.zookeeper -R /usr/local/zookeeper/

添加环境变量

1
2
3
echo -e "export ZOOKEEPER_HOME=/usr/local/zookeeper
export PATH=\$ZOOKEEPER_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile

启动服务

1
2
3
4
5
6
su - zookeeper
zkServer.sh start

ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

其他命令

1
2
3
4
5
6
7
8
9
zkServer.sh --help
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Usage: /usr/local/zookeeper/bin/zkServer.sh {start|start-foreground|stop|restart|status|upgrade|print-cmd}

zkServer.sh status
zkServer.sh start-foreground

zkCli.sh -server localhost:2181

zookeeper集群模式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cd /usr/local/zookeeper/conf

echo -e "tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
clientPort=2181
server.1=172.17.8.32:2889:3889
server.2=172.17.8.33:2889:3889
server.3=172.17.8.34:2889:3889" > zoo.cfg

cd /data/zookeeper/data
echo 1 > myid
echo 2 > myid
echo 3 > myid

chown zookeeper.zookeeper -R /data/zookeeper/
chown zookeeper.zookeeper -R /data/logs/zookeeper/
chown zookeeper.zookeeper -R /usr/local/zookeeper/

查看节点状态

1
2
3
4
5
6
7
8
9
10
11
echo srvr | nc 127.0.0.1 2181

Zookeeper version: 3.4.12-e5259e437540f349646870ea94dc2658c4e44b3b, built on 03/27/2018 03:55 GMT
Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x1c000000cf
Mode: leader
Node count: 3003

扩容

新增两个节点,直接增加配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cd /usr/local/zookeeper/conf

echo -e "tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
clientPort=2181
server.1=172.17.8.32:2889:3889
server.2=172.17.8.33:2889:3889
server.3=172.17.8.34:2889:3889
server.4=172.17.8.35:2889:3889
server.5=172.17.8.36:2889:3889" > zoo.cfg

cd /data/zookeeper/data
echo 4 > myid
echo 5 > myid

先启动新增的节点,然后依次重启所有节点

1
zkServer.sh

查看状态

1
2
3
4
5
echo srvr | nc 127.0.0.1 2181


这种状态是整个集群没有全部重启时部分节点读取了新配置导致的,依次全部重启节点即可
# This ZooKeeper instance is not currently serving requests

测试

1
2
3
4
5
6
7
#zk1
zkCli.sh -server localhost:2181

create /zktest 'test'

# zk4
create /zktest

监控

zookeeper监控指标

目前zookeeper获取监控指标已知的有两种方式

  • 通过zookeeper自带的four letter words command获取各种各样的监控指标
  • 通过JMX Client连接zookeeer对外提供的MBean来获取监控指标(需要修改启动脚本,使其支持远程JMX连接)

上述两种方式获取的指标大体上是一致的。

本文采用第一种的方式

四字命令

使用方法

1
2
3
4
echo commands  |  nc ip port

# 如
echo conf | nc 192.168.144.110 2181

conf

能够获取到zookeeper的配置信息

  • 客户端端口
  • 数据以及日志路径,
  • 间隔单位时间,
  • 单台server与单个client端的连接数限制,
  • 超时时间,
  • serverId等等信息,

Follower在启动过程中,会从Leader同步所有最新数据,然后确定自己能够对外服务的起始状态。Leader允许F在initLimit时间内完成这个工作。

在运行过程中,Leader负责与ZK集群中所有机器进行通信,例如通过一些心跳检测机制,来检测机器的存活状态。如果L发出心跳包在syncLimit之后,还没有从F那里收到响应,那么就认为这个F已经不在线了。


cons

连接信息的总览,

  • 连接ip
  • 端口号
  • 该连接的发包数
  • 该连接的收包数
  • 连接的session Id
  • 最后操作方式/命令
  • 连接的时间戳
  • 超时时间(未确认)
  • 最后的zxid
  • 最后的响应时间戳
  • 连接的时间延时信息

crst

重置连接状态,是一个execute操作,不是一个select操作

执行后返回一个状态信息:

1
Connection stats reset.

dump

输出所有等待队列中的会话和临时节点的信息

envi

当前server的环境信息:

  • 版本信息
  • 主机的host
  • jvm相关参数:version,classpath,lib等等
  • os相关参数:name,version等等
  • 当前host用户信息:name,dir等等

ruok

查询当前server状态是否正常 若正常返回imok

imok

srst

同样是一个execute操作而不是select,重置server状态:

srvr

server的简要信息:

  • 版本
  • 延时
  • 收包数
  • 发包数
  • 连接数
  • 状态等信息

stat

一些状态信息和连接信息,是前面一些信息的组合:

wchs

有watch path的连接数 以及watch的path数 和 watcher数

wchc

连接监听的所有path:(考虑吧cons命令 信息整合)

wchp

path被那些连接监听:(考虑把cons命令 信息整合)

mntr

用于监控zookeeper server 健康状态的各种指标:

  • 版本
  • 延时
  • 收包
  • 发包
  • 连接数
  • 未完成客户端请求数
  • leader/follower 状态
  • znode 数
  • watch 数
  • 临时节点数
  • 近似数据大小 应该是一个总和的值
  • 打开文件描述符 数
  • 最大文件描述符 数
  • fllower数
1
2
3
4
5
6
7
8
9
10
11
12
zk_avg/min/max_latency    响应一个客户端请求的时间,建议这个时间大于10个Tick就报警
zk_outstanding_requests 排队请求的数量,当ZooKeeper超过了它的处理能力时,这个值会增大,建议设置报警阀值为10
zk_packets_received 接收到客户端请求的包数量
zk_packets_sent 发送给客户单的包数量,主要是响应和通知
zk_max_file_descriptor_count 最大允许打开的文件数,由ulimit控制
zk_open_file_descriptor_count 打开文件数量,当这个值大于允许值得85%时报警
Mode 运行的角色,如果没有加入集群就是standalone,加入集群式follower或者leader
zk_followers leader角色才会有这个输出,集合中follower的个数。正常的值应该是集合成员的数量减1
zk_pending_syncs leader角色才会有这个输出,pending syncs的数量
zk_znode_count znodes的数量
zk_watch_count watches的数量
Java Heap Size ZooKeeper Java进程的

监控脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
cat zk_monitor.sh
#!/bin/sh

### enviroment
workpath=$(dirname $0)
cd $workpath
workpath=$(pwd)

config_file="${workpath}/../conf/zoo.cfg"
if [ ! -f $config_file ]; then
exit 1
fi

ip="127.0.0.1"
port=$(awk -F= '{if ($1 == "clientPort") print $2}' $config_file)
if [ -z "$port" ]; then
exit 1
fi

### monr info: statistics
last_file="${workpath}/monr.last"
now_file="${workpath}/monr.now"
time_key="timestamp:"

echo "$(echo "$time_key $(date +%s)";echo monr|nc $ip $port)" > $now_file

if [ ! -f $last_file ]; then
cp $now_file $last_file
fi

awk -v time_key="$time_key" 'BEGIN {
mode = 0
} FNR == NR {
last[$1] = $2
} FNR!=NR {
name[FNR] = $1
now[FNR] = $2
if ($1 == "zk_mode:") {
if ($2 == "leader") {
mode = 1
} else if ($2 == "follower") {
mode = 5
} else if ($2 == "observer") {
mode = 10
} else if ($2 == "proxy") {
mode = 15
} else {
mode = -1
}
}
} END {
if (mode == 0 || name[1] != time_key || !(time_key in last)) {
exit 1
}
printf("zk_running: 1\n")
interval = now[1] - last[time_key]
if (interval <= 0) {
interval = 1
}
for (i=2; i<=FNR; ++i) {
if (name[i] in last) {
key = name[i]
value = now[i]
if (key == "zk_mode:") {
value = mode
} else if (key == "zk_packets_received:") {
value = (value - last[key])/interval
key = "zk_received_per_sec:"
} else if (key == "zk_packets_sent:") {
value = (value - last[key])/interval
key = "zk_sent_per_sec:"
} else if (key ~ /^zk_.*(_received|_succeed|_failed):$/) {
value = (value - last[key])/interval
}
printf("%s %d\n", key, value)
}
}
}' $last_file $now_file

mv $now_file $last_file

echo BDEOF

[work@bjyz-inf-spark-forfsg-y02xi3-80 monitor]$sh zk_monitor.sh
zk_running: 1
zk_mode: 5
zk_min_latency: 0
zk_avg_latency: 0
zk_max_latency: 250
zk_node_count: 1134
zk_outstandings: 0
zk_zxid: 4294978789
zk_watch_conn_num: 6
zk_watch_path_num: 5
zk_watch_total: 8
zk_received_per_sec: 3
zk_sent_per_sec: 3
zk_client_connections: 4
zk_snapshot_count: 1
zk_sessions: 22
zk_expire_session_count: 0
zk_create_session_count: 167
zk_close_session_count: 82
zk_renew_session_count: 7
zk_ephemerals: 11
zk_read_delayed: 0
zk_write_delayed: 0
zk_create_session_received: 0
zk_create_session_succeed: 0
zk_create_session_failed: 0
zk_close_session_received: 0
zk_close_session_succeed: 0
zk_close_session_failed: 0
zk_create_received: 0
zk_create_succeed: 0
zk_create_failed: 0
zk_delete_received: 0
zk_delete_succeed: 0
zk_delete_failed: 0
zk_set_received: 0
zk_set_succeed: 0
zk_set_failed: 0
zk_set_acl_received: 0
zk_set_acl_succeed: 0
zk_set_acl_failed: 0
zk_sync_received: 0
zk_sync_succeed: 0
zk_sync_failed: 0
zk_exists_received: 0
zk_exists_succeed: 0
zk_exists_failed: 0
zk_get_received: 0
zk_get_succeed: 0
zk_get_failed: 0
zk_get_acl_received: 0
zk_get_acl_succeed: 0
zk_get_acl_failed: 0
zk_get_children_received: 0
zk_get_children_succeed: 0
zk_get_children_failed: 0
zk_get_children2_received: 0
zk_get_children2_succeed: 0
zk_get_children2_failed: 0
zk_ping_received: 2
zk_ping_succeed: 2
zk_ping_failed: 0
zk_set_watch_received: 0
zk_set_watch_succeed: 0
zk_set_watch_failed: 0
zk_recent_latency_min: 0
zk_recent_latency_avg: 0
zk_recent_latency_max: 1
zk_open_file_descriptor_count: 32
zk_max_file_descriptor_count: 1048576
BDEOF
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# zookeeper.py

#!/usr/bin/python
"""{'zk_followers': 0,
'zk_outstanding_requests': 0,
'zk_approximate_data_size': 890971,
'zk_packets_sent': 5818488,
'zk_pending_syncs': 0,
'zk_avg_latency': 0,
'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT',
'zk_watch_count': 1364,
'zk_packets_received': 5797681,
'zk_open_file_descriptor_count': 46,
'zk_server_ruok': 'imok',
'zk_server_state': 'follower',
'zk_synced_followers': 0,
'zk_max_latency': 400,
'zk_num_alive_connections': 18,
'zk_min_latency': 0,
'zk_ephemerals_count': 1112,
'zk_znode_count': 2207,
'zk_max_file_descriptor_count': 4096}
"""

import sys
import socket
import re
import subprocess
from StringIO import StringIO
import os


zabbix_sender = '/usr/bin/zabbix_sender'
zabbix_conf = '/etc/zabbix/zabbix_agentd.conf'
send_to_zabbix = 1


# get zookeeper server status
class ZooKeeperServer(object):

def __init__(self, host='localhost', port='2181', timeout=1):
self._address = (host, int(port))
self._timeout = timeout
self._result = {}

def _create_socket(self):
return socket.socket()

def _send_cmd(self, cmd):
""" Send a 4letter word command to the server """
s = self._create_socket()
s.settimeout(self._timeout)

s.connect(self._address)
s.send(cmd)

data = s.recv(2048)
s.close()

return data

def get_stats(self):
""" Get ZooKeeper server stats as a map """
"""zk_version 3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency 0
zk_max_latency 94
zk_min_latency 0
zk_packets_received 1267904
zk_packets_sent 1317835
zk_num_alive_connections 12
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 1684
zk_watch_count 2757
zk_ephemerals_count 899
zk_approximate_data_size 728074
zk_open_file_descriptor_count 41
zk_max_file_descriptor_count 4096
"""
data_mntr = self._send_cmd('mntr')
data_ruok = self._send_cmd('ruok')
if data_mntr:
result_mntr = self._parse(data_mntr)
if data_ruok:
# {'zk_server_ruok': 'imok'}
result_ruok = self._parse_ruok(data_ruok)

self._result = dict(result_mntr.items() + result_ruok.items())

if not self._result.has_key('zk_followers') and not self._result.has_key('zk_synced_followers') and not self._result.has_key('zk_pending_syncs'):
# #### the tree metrics only exposed on leader role zookeeper server, we just set the followers' to 0
leader_only = {'zk_followers':0,'zk_synced_followers':0,'zk_pending_syncs':0}
self._result = dict(result_mntr.items() + result_ruok.items() + leader_only.items())

return self._result

def _parse(self, data):
"""
:param data: zk_outstanding_requests 0 zk_approximate_data_size 653931
:return: {'zk_outstanding_requests': '0', 'zk_approximate_data_size': '653931',}
"""
""" Parse the output from the 'mntr' 4letter word command """
h = StringIO(data)
result = {}
for line in h.readlines():
try:
key, value = self._parse_line(line)
result[key] = value
except ValueError:
pass # ignore broken lines

return result

def _parse_ruok(self, data):
"""
:param data: imok
:return: {'zk_server_ruok': 'imok'}
"""
""" Parse the output from the 'ruok' 4letter word command """

h = StringIO(data)
result = {}

ruok = h.readline()
if ruok:
result['zk_server_ruok'] = ruok

return result

def _parse_line(self, line):
# zk_watch_count 1482
try:
# zk_max_file_descriptor_count 65535
key, value = map(str.strip, line.split('\t'))
except ValueError:
raise ValueError('Found invalid line: %s' % line)

if not key:
raise ValueError('The key is mandatory and should not be empty')

try:
value = int(value)
except (TypeError, ValueError):
pass

return key, value

def get_pid(self):
# ps -ef|grep java|grep zookeeper|awk '{print $2}'
pidarg = '''ps -ef|grep java|grep zookeeper|grep -v grep|awk '{print $2}' ''' # 31022
pidout = subprocess.Popen(pidarg, shell=True, stdout=subprocess.PIPE)
pid = pidout.stdout.readline().strip('\n')
return pid

def send_to_zabbix(self, metric):
# key = zookeeper.status[zk_max_file_descriptor_count]
key = "zookeeper.status[" + metric + "]"
if send_to_zabbix > 0:
# print key + ":" + str(self._result[metric])
try:
subprocess.call([zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", str(self._result[metric])], stdout=FNULL, stderr=FNULL, shell=False)
#print "send zabbix success"
except OSError, detail:
print "Something went wrong while exectuting zabbix_sender : ", detail
else:
print "Simulation: the following command would be execucted :\n", zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", self._result[metric], "\n"


def usage():
"""Display program usage"""

print "\nUsage : ", sys.argv[0], " alive|all"
print "Modes : \n\talive : Return pid of running zookeeper\n\tall : Send zookeeper stats as well"
sys.exit(1)


accepted_modes = ['alive', 'all']
if len(sys.argv) == 2 and sys.argv[1] in accepted_modes:
mode = sys.argv[1]
else:
usage()

zk = ZooKeeperServer()
# print zk.get_stats()
pid = zk.get_pid()

if pid != "" and mode == 'all':
zk.get_stats()
print zk._result
FNULL = open(os.devnull, 'w')
for key in zk._result:
zk.send_to_zabbix(key)
FNULL.close()
print pid
elif pid != "" and mode == "alive":
print pid
else:
print 0

https://www.cnblogs.com/yxy-linux/p/8023660.html

https://www.cnblogs.com/kuku0223/p/8428341.html