摘要

学习过程中整理

本文记录

安装zookeeper单机过程
安装zookeeper集群、扩容的过程
zookeeper的监控数据

安装zookeeper

设置环境变量

WORKDIR="/usr/local/src"
INSTALL_DIR="/usr/local/product"
SOFT_DIR="/usr/local"
VERSION="3.4.12"

创建zookeeper用户

1	useradd zookeeper

下载安装包

cd ${WORKDIR}
wget https://mirrors.aliyun.com/apache/zookeeper/zookeeper-${VERSION}/zookeeper-${VERSION}.tar.gz
tar xvf zookeeper-${VERSION}.tar.gz -C ${INSTALL_DIR}
ln -s ${INSTALL_DIR}/zookeeper-${VERSION} ${SOFT_DIR}/zookeeper

创建配置文件

cd /usr/local/zookeeper/conf
cp zoo_sample.cfg zoo.cfg

echo -e "tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
clientPort=2181" > zoo.cfg

echo -e "ZOO_LOG_DIR=/data/logs/zookeeper" > zookeeper-env.sh

mkdir -p /data/zookeeper/data
mkdir -p /data/logs/zookeeper

chown zookeeper.zookeeper -R /data/zookeeper/
chown zookeeper.zookeeper -R /data/logs/zookeeper/
chown zookeeper.zookeeper -R /usr/local/zookeeper/

添加环境变量

1
2
3

echo -e "export ZOOKEEPER_HOME=/usr/local/zookeeper
export PATH=\$ZOOKEEPER_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile

启动服务

su - zookeeper
zkServer.sh start

ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

其他命令

zkServer.sh --help
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Usage: /usr/local/zookeeper/bin/zkServer.sh {start|start-foreground|stop|restart|status|upgrade|print-cmd}

zkServer.sh status
zkServer.sh start-foreground

zkCli.sh -server localhost:2181

zookeeper集群模式

cd /usr/local/zookeeper/conf

echo -e "tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
clientPort=2181
server.1=172.17.8.32:2889:3889
server.2=172.17.8.33:2889:3889
server.3=172.17.8.34:2889:3889" > zoo.cfg

cd /data/zookeeper/data
echo 1 > myid
echo 2 > myid
echo 3 > myid

chown zookeeper.zookeeper -R /data/zookeeper/
chown zookeeper.zookeeper -R /data/logs/zookeeper/
chown zookeeper.zookeeper -R /usr/local/zookeeper/

查看节点状态

echo srvr | nc 127.0.0.1 2181
 
Zookeeper version: 3.4.12-e5259e437540f349646870ea94dc2658c4e44b3b, built on 03/27/2018 03:55 GMT
Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x1c000000cf
Mode: leader
Node count: 3003

扩容

新增两个节点，直接增加配置

cd /usr/local/zookeeper/conf

echo -e "tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
clientPort=2181
server.1=172.17.8.32:2889:3889
server.2=172.17.8.33:2889:3889
server.3=172.17.8.34:2889:3889
server.4=172.17.8.35:2889:3889
server.5=172.17.8.36:2889:3889" > zoo.cfg

cd /data/zookeeper/data
echo 4 > myid
echo 5 > myid

先启动新增的节点，然后依次重启所有节点

1	zkServer.sh

查看状态

echo srvr | nc 127.0.0.1 2181


这种状态是整个集群没有全部重启时部分节点读取了新配置导致的，依次全部重启节点即可
# This ZooKeeper instance is not currently serving requests

测试

#zk1
zkCli.sh -server localhost:2181

create /zktest 'test'

# zk4
create /zktest

监控

zookeeper监控指标

目前zookeeper获取监控指标已知的有两种方式

通过zookeeper自带的four letter words command获取各种各样的监控指标
通过JMX Client连接zookeeer对外提供的MBean来获取监控指标（需要修改启动脚本，使其支持远程JMX连接）

上述两种方式获取的指标大体上是一致的。

本文采用第一种的方式

四字命令

使用方法

echo commands  |  nc ip port

# 如
echo conf | nc 192.168.144.110 2181

conf

能够获取到zookeeper的配置信息

客户端端口
数据以及日志路径，
间隔单位时间，
单台server与单个client端的连接数限制，
超时时间，
serverId等等信息，

Follower在启动过程中，会从Leader同步所有最新数据，然后确定自己能够对外服务的起始状态。Leader允许F在initLimit时间内完成这个工作。

在运行过程中，Leader负责与ZK集群中所有机器进行通信，例如通过一些心跳检测机制，来检测机器的存活状态。如果L发出心跳包在syncLimit之后，还没有从F那里收到响应，那么就认为这个F已经不在线了。

cons

连接信息的总览，

连接ip
端口号
该连接的发包数
该连接的收包数
连接的session Id
最后操作方式/命令
连接的时间戳
超时时间（未确认）
最后的zxid
最后的响应时间戳
连接的时间延时信息

crst

重置连接状态，是一个execute操作,不是一个select操作

执行后返回一个状态信息：

1	Connection stats reset.

dump

输出所有等待队列中的会话和临时节点的信息

envi

当前server的环境信息：

版本信息
主机的host
jvm相关参数：version，classpath，lib等等
os相关参数：name，version等等
当前host用户信息：name，dir等等

ruok

查询当前server状态是否正常若正常返回imok

imok

srst

同样是一个execute操作而不是select，重置server状态：

srvr

server的简要信息：

版本
延时
收包数
发包数
连接数
状态等信息

stat

一些状态信息和连接信息，是前面一些信息的组合：

wchs

有watch path的连接数以及watch的path数和 watcher数

wchc

连接监听的所有path：(考虑吧cons命令信息整合)

wchp

path被那些连接监听：（考虑把cons命令信息整合）

mntr

用于监控zookeeper server 健康状态的各种指标：

版本
延时
收包
发包
连接数
未完成客户端请求数
leader/follower 状态
znode 数
watch 数
临时节点数
近似数据大小应该是一个总和的值
打开文件描述符数
最大文件描述符数
fllower数

zk_avg/min/max_latency    响应一个客户端请求的时间，建议这个时间大于10个Tick就报警
zk_outstanding_requests        排队请求的数量，当ZooKeeper超过了它的处理能力时，这个值会增大，建议设置报警阀值为10
zk_packets_received      接收到客户端请求的包数量
zk_packets_sent        发送给客户单的包数量，主要是响应和通知
zk_max_file_descriptor_count   最大允许打开的文件数，由ulimit控制
zk_open_file_descriptor_count    打开文件数量，当这个值大于允许值得85%时报警
Mode                运行的角色，如果没有加入集群就是standalone,加入集群式follower或者leader
zk_followers          leader角色才会有这个输出,集合中follower的个数。正常的值应该是集合成员的数量减1
zk_pending_syncs       leader角色才会有这个输出，pending syncs的数量
zk_znode_count         znodes的数量
zk_watch_count         watches的数量
Java Heap Size         ZooKeeper Java进程的

监控脚本

cat zk_monitor.sh
#!/bin/sh

### enviroment
workpath=$(dirname $0)
cd $workpath
workpath=$(pwd)

config_file="${workpath}/../conf/zoo.cfg"
if [ ! -f $config_file ]; then
    exit 1
fi

ip="127.0.0.1"
port=$(awk -F= '{if ($1 == "clientPort") print $2}' $config_file)
if [ -z "$port" ]; then
    exit 1
fi

### monr info: statistics
last_file="${workpath}/monr.last"
now_file="${workpath}/monr.now"
time_key="timestamp:"

echo "$(echo "$time_key $(date +%s)";echo monr|nc $ip $port)" > $now_file

if [ ! -f $last_file ]; then
    cp $now_file $last_file
fi

awk -v time_key="$time_key" 'BEGIN {
    mode = 0
} FNR == NR {
    last[$1] = $2
} FNR!=NR {
    name[FNR] = $1
    now[FNR] = $2
    if ($1 == "zk_mode:") {
        if ($2 == "leader") {
            mode = 1
        } else if ($2 == "follower") {
            mode = 5
        } else if ($2 == "observer") {
            mode = 10
        } else if ($2 == "proxy") {
            mode = 15
        } else {
            mode = -1
        }
    }
} END {
    if (mode == 0 || name[1] != time_key || !(time_key in last)) {
        exit 1
    }
    printf("zk_running: 1\n")
    interval = now[1] - last[time_key]
    if (interval <= 0) {
        interval = 1
    }
    for (i=2; i<=FNR; ++i) {
        if (name[i] in last) {
            key = name[i]
            value = now[i]
            if (key == "zk_mode:") {
                value = mode
            } else if (key == "zk_packets_received:") {
                value = (value - last[key])/interval
                key = "zk_received_per_sec:"
            } else if (key == "zk_packets_sent:") {
                value = (value - last[key])/interval
                key = "zk_sent_per_sec:"
            } else if (key ~ /^zk_.*(_received|_succeed|_failed):$/) {
                value = (value - last[key])/interval
            }
            printf("%s %d\n", key, value)
        }
    }
}' $last_file $now_file

mv $now_file $last_file

echo BDEOF

[work@bjyz-inf-spark-forfsg-y02xi3-80 monitor]$sh zk_monitor.sh
zk_running: 1
zk_mode: 5
zk_min_latency: 0
zk_avg_latency: 0
zk_max_latency: 250
zk_node_count: 1134
zk_outstandings: 0
zk_zxid: 4294978789
zk_watch_conn_num: 6
zk_watch_path_num: 5
zk_watch_total: 8
zk_received_per_sec: 3
zk_sent_per_sec: 3
zk_client_connections: 4
zk_snapshot_count: 1
zk_sessions: 22
zk_expire_session_count: 0
zk_create_session_count: 167
zk_close_session_count: 82
zk_renew_session_count: 7
zk_ephemerals: 11
zk_read_delayed: 0
zk_write_delayed: 0
zk_create_session_received: 0
zk_create_session_succeed: 0
zk_create_session_failed: 0
zk_close_session_received: 0
zk_close_session_succeed: 0
zk_close_session_failed: 0
zk_create_received: 0
zk_create_succeed: 0
zk_create_failed: 0
zk_delete_received: 0
zk_delete_succeed: 0
zk_delete_failed: 0
zk_set_received: 0
zk_set_succeed: 0
zk_set_failed: 0
zk_set_acl_received: 0
zk_set_acl_succeed: 0
zk_set_acl_failed: 0
zk_sync_received: 0
zk_sync_succeed: 0
zk_sync_failed: 0
zk_exists_received: 0
zk_exists_succeed: 0
zk_exists_failed: 0
zk_get_received: 0
zk_get_succeed: 0
zk_get_failed: 0
zk_get_acl_received: 0
zk_get_acl_succeed: 0
zk_get_acl_failed: 0
zk_get_children_received: 0
zk_get_children_succeed: 0
zk_get_children_failed: 0
zk_get_children2_received: 0
zk_get_children2_succeed: 0
zk_get_children2_failed: 0
zk_ping_received: 2
zk_ping_succeed: 2
zk_ping_failed: 0
zk_set_watch_received: 0
zk_set_watch_succeed: 0
zk_set_watch_failed: 0
zk_recent_latency_min: 0
zk_recent_latency_avg: 0
zk_recent_latency_max: 1
zk_open_file_descriptor_count: 32
zk_max_file_descriptor_count: 1048576
BDEOF

# zookeeper.py

#!/usr/bin/python
"""{'zk_followers': 0, 
'zk_outstanding_requests': 0, 
'zk_approximate_data_size': 890971, 
'zk_packets_sent': 5818488, 
'zk_pending_syncs': 0, 
'zk_avg_latency': 0, 
'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 
'zk_watch_count': 1364, 
'zk_packets_received': 5797681, 
'zk_open_file_descriptor_count': 46, 
'zk_server_ruok': 'imok', 
'zk_server_state': 'follower', 
'zk_synced_followers': 0, 
'zk_max_latency': 400, 
'zk_num_alive_connections': 18, 
'zk_min_latency': 0, 
'zk_ephemerals_count': 1112, 
'zk_znode_count': 2207, 
'zk_max_file_descriptor_count': 4096} 
"""

import sys
import socket
import re
import subprocess
from StringIO import StringIO
import os
 
 
zabbix_sender = '/usr/bin/zabbix_sender'
zabbix_conf = '/etc/zabbix/zabbix_agentd.conf'
send_to_zabbix = 1


# get zookeeper server status
class ZooKeeperServer(object):
 
    def __init__(self, host='localhost', port='2181', timeout=1):
        self._address = (host, int(port))
        self._timeout = timeout
        self._result = {}

    def _create_socket(self):
        return socket.socket()

    def _send_cmd(self, cmd):
        """ Send a 4letter word command to the server """
        s = self._create_socket()
        s.settimeout(self._timeout)
 
        s.connect(self._address)
        s.send(cmd)
 
        data = s.recv(2048)
        s.close()
 
        return data
 
    def get_stats(self):
        """ Get ZooKeeper server stats as a map """
        """zk_version      3.4.6-1569965, built on 02/20/2014 09:09 GMT
            zk_avg_latency  0
            zk_max_latency  94
            zk_min_latency  0
            zk_packets_received     1267904
            zk_packets_sent 1317835
            zk_num_alive_connections        12
            zk_outstanding_requests 0
            zk_server_state follower
            zk_znode_count  1684
            zk_watch_count  2757
            zk_ephemerals_count     899
            zk_approximate_data_size        728074
            zk_open_file_descriptor_count   41
            zk_max_file_descriptor_count    4096
        """
        data_mntr = self._send_cmd('mntr')
        data_ruok = self._send_cmd('ruok')
        if data_mntr:
            result_mntr = self._parse(data_mntr)
        if data_ruok:
            # {'zk_server_ruok': 'imok'}
            result_ruok = self._parse_ruok(data_ruok)
 
        self._result = dict(result_mntr.items() + result_ruok.items())
         
        if not self._result.has_key('zk_followers') and not self._result.has_key('zk_synced_followers') and not self._result.has_key('zk_pending_syncs'):
           # #### the tree metrics only exposed on leader role zookeeper server, we just set the followers' to 0
           leader_only = {'zk_followers':0,'zk_synced_followers':0,'zk_pending_syncs':0}    
           self._result = dict(result_mntr.items() + result_ruok.items() + leader_only.items())
 
        return self._result  

    def _parse(self, data):
        """
        :param data: zk_outstanding_requests 0 zk_approximate_data_size        653931
        :return: {'zk_outstanding_requests': '0', 'zk_approximate_data_size': '653931',}
        """
        """ Parse the output from the 'mntr' 4letter word command """
        h = StringIO(data)
        result = {}
        for line in h.readlines():
            try:
                key, value = self._parse_line(line)
                result[key] = value
            except ValueError:
                pass # ignore broken lines
 
        return result
 
    def _parse_ruok(self, data):
        """
        :param data: imok
        :return: {'zk_server_ruok': 'imok'}
        """
        """ Parse the output from the 'ruok' 4letter word command """
        
        h = StringIO(data)
        result = {}
        
        ruok = h.readline()
        if ruok:
           result['zk_server_ruok'] = ruok
  
        return result
 
    def _parse_line(self, line):
        # zk_watch_count  1482
        try:
            # zk_max_file_descriptor_count 65535
            key, value = map(str.strip, line.split('\t'))
        except ValueError:
            raise ValueError('Found invalid line: %s' % line)
 
        if not key:
            raise ValueError('The key is mandatory and should not be empty')
 
        try:
            value = int(value)
        except (TypeError, ValueError):
            pass
 
        return key, value

    def get_pid(self):
        # ps -ef|grep java|grep zookeeper|awk '{print $2}'
        pidarg = '''ps -ef|grep java|grep zookeeper|grep -v grep|awk '{print $2}' '''   # 31022
        pidout = subprocess.Popen(pidarg, shell=True, stdout=subprocess.PIPE)
        pid = pidout.stdout.readline().strip('\n')
        return pid

    def send_to_zabbix(self, metric):
        # key = zookeeper.status[zk_max_file_descriptor_count]
        key = "zookeeper.status[" + metric + "]"
        if send_to_zabbix > 0:
            # print key + ":" + str(self._result[metric])
            try:
                subprocess.call([zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", str(self._result[metric])], stdout=FNULL, stderr=FNULL, shell=False)
                #print "send zabbix success"
            except OSError, detail:
                print "Something went wrong while exectuting zabbix_sender : ", detail
        else:
            print "Simulation: the following command would be execucted :\n", zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", self._result[metric], "\n"


def usage():
        """Display program usage"""
 
        print "\nUsage : ", sys.argv[0], " alive|all"
        print "Modes : \n\talive : Return pid of running zookeeper\n\tall : Send zookeeper stats as well"
        sys.exit(1)

        
accepted_modes = ['alive', 'all']
if len(sys.argv) == 2 and sys.argv[1] in accepted_modes:
        mode = sys.argv[1]
else:
        usage()

zk = ZooKeeperServer()
#  print zk.get_stats()
pid = zk.get_pid()
 
if pid != "" and mode == 'all':
    zk.get_stats()
    print zk._result
    FNULL = open(os.devnull, 'w')
    for key in zk._result:
       zk.send_to_zabbix(key)
    FNULL.close()
    print pid
elif pid != "" and mode == "alive":
    print pid
else:
    print 0

https://www.cnblogs.com/yxy-linux/p/8023660.html

https://www.cnblogs.com/kuku0223/p/8428341.html