heketi启动失败问题排查

字数统计: 2.1k阅读时长: 9 min

 2021/08/08   Share

问题现象

日常巡检时发现heketi服务异常，启动失败。查看heketi log报错为invalid page type: 19:10

环境描述


内核版本	3.10.0-862.el7.x86_64
系统版本	CentOS Linux release 7.5.1804
kubernetes 版本	1.17.0
glusterfs版本	7.1
heketi版本	v9.0

排查过程

查看glusterfs各服务运行状态，卷都在线且正常运行。

重启glusterfs与heketi无效。

查看heketi的服务启动文件 /usr/bin/heketi-start.sh

#!/bin/bash
#
# HEKETI_TOPOLOGY_FILE can be passed as an environment variable with the
# filename of the initial topology.json. In case the heketi.db does not exist
# yet, this file will be used to populate the database.

: "${HEKETI_PATH:=/var/lib/heketi}"
: "${BACKUPDB_PATH:=/backupdb}"
LOG="${HEKETI_PATH}/container.log"

info() {
    echo "$*" | tee -a "$LOG"
}

error() {
    echo "error: $*" | tee -a "$LOG" >&2
}

fail() {
    error "$@"
    exit 1
}

info "Setting up heketi database"

# Ensure the data dir exists
mkdir -p "${HEKETI_PATH}" 2>/dev/null
if [[ $? -ne 0 && ! -d "${HEKETI_PATH}" ]]; then
    fail "Failed to create ${HEKETI_PATH}"
fi

# Test that our volume is writable.
touch "${HEKETI_PATH}/test" && rm "${HEKETI_PATH}/test"
if [ $? -ne 0 ]; then
    fail "${HEKETI_PATH} is read-only"
fi

if [[ ! -f "${HEKETI_PATH}/heketi.db" ]]; then
    info "No database file found"
    out=$(mount | grep "${HEKETI_PATH}" | grep heketidbstorage)
    if [[ $? -eq 0 ]]; then
        info "Database volume found: ${out}"
        info "Database file is expected, waiting..."
        check=0
        while [[ ! -f "${HEKETI_PATH}/heketi.db" ]]; do
            sleep 5
            if [[ ${check} -eq 5 ]]; then
               fail "Database file did not appear, exiting."
            fi
            ((check+=1))
        done
    fi
fi

stat "${HEKETI_PATH}/heketi.db" 2>/dev/null | tee -a "${LOG}"
# Workaround for scenario where a lock on the heketi.db has not been
# released.
# This code uses a non-blocking flock in a loop rather than a blocking
# lock with timeout due to issues with current gluster and flock
# ( see rhbz#1613260 )
for _ in $(seq 1 60); do
    flock --nonblock "${HEKETI_PATH}/heketi.db" true
    flock_status=$?
    if [[ $flock_status -eq 0 ]]; then
        break
    fi
    sleep 1
done
if [[ $flock_status -ne 0 ]]; then
    fail "Database file is read-only"
fi

if [[ -d "${BACKUPDB_PATH}" ]]; then
    if [[ -f "${BACKUPDB_PATH}/heketi.db.gz" ]] ; then
        gunzip -c "${BACKUPDB_PATH}/heketi.db.gz" > "${BACKUPDB_PATH}/heketi.db"
        if [[ $? -ne 0 ]]; then
            fail "Unable to extract backup database"
        fi
    fi
    if [[ -f "${BACKUPDB_PATH}/heketi.db" ]] ; then
        cp "${BACKUPDB_PATH}/heketi.db" "${HEKETI_PATH}/heketi.db"
        if [[ $? -ne 0 ]]; then
            fail "Unable to copy backup database"
        fi
        info "Copied backup db to ${HEKETI_PATH}/heketi.db"
    fi
fi

# if the heketi.db does not exist and HEKETI_TOPOLOGY_FILE is set, start the
# heketi service in the background and load the topology. Once done, move the
# heketi service back to the foreground again.
if [[ "$(stat -c %s ${HEKETI_PATH}/heketi.db 2>/dev/null)" == 0 && -n "${HEKETI_TOPOLOGY_FILE}" ]]; then
    # start hketi in the background
    /usr/bin/heketi --config=/etc/heketi/heketi.json &

    # wait until heketi replies
    while ! curl http://localhost:8080/hello; do
        sleep 0.5
    done

    # load the topology
    if [[ -n "${HEKETI_ADMIN_KEY}" ]]; then
        HEKETI_SECRET_ARG="--secret='${HEKETI_ADMIN_KEY}'"
    fi
    heketi-cli --user=admin "${HEKETI_SECRET_ARG}" topology load --json="${HEKETI_TOPOLOGY_FILE}"
    if [[ $? -ne 0 ]]; then
        # something failed, need to exit with an error
        kill %1
        fail "failed to load topology from ${HEKETI_TOPOLOGY_FILE}"
    fi

    # bring heketi back to the foreground
    fg %1
else
    # just start in the foreground
    exec /usr/bin/heketi --config=/etc/heketi/heketi.json
fi

从输出的日志信息可以看出，此时heketidb文件存在，但是再执行以下命令时失败

1	/usr/bin/heketi --config=/etc/heketi/heketi.json

github上发现了相关issue

https://github.com/heketi/heketi/issues/1636

https://github.com/heketi/heketi/issues/1378

相同点是都再重启heketi 服务后发生了 invalid page type的错误，heketi项目的贡献者提到了此报错是由于bolt db文件损坏，没有提及任何产生原因。

解决方法

根据描述可知无法采取简单的命令进行恢复，于是

从本地健康的集群导出完整可用的heketidb，执行
1
heketi db export --dbfile heketi.db --jsonfile heketi_new.json
注意：经测试此时只能从正常的heketidb文件中导出，损坏的db文件无法导出json

此时导出的json文件中数据还是原来集群的，需要将问题集群的gfs node/brick/device等数据一一映射，操作大致为：

1
2
3

① 首先将json文件中的nodeentries与deviceentries处正确映射，此处的node id为随机生成的，不建议更改，以免混乱。device id是每个gfs节点的vg name，存储设备的大小需要进入每个gfs节点执行vgs查看，单位为KB,Bricks处是通过`lvs|grep brick |awk '{print $1}'`获得。
② 进入gfs节点执行 gluster volume info，这里拿到volume的user.heketi.id就是之后json文件中的volumeentries id，Bricks处多副本的brick_XXXX是json文件中的volumeentries/Bricks,json文件中volume的gid对应了k8s pv卷中的gid
③ 进入每个gfs节点执行lvs与lsblk得到json中需要的LvmThinPool、size、TpSize

注意：clusterid与nodeid是由heketi生成的，其余需要与gfs环境一一对应。而且发现需将json文件中的数组按照数字/字母大小排列。

将heketi_new.json导成dbfile,执行如下命令
1
heketi db import --dbfile heketi.db --jsonfile heketi01.json
将此db文件替换损坏的db,重启heketi服务
执行 heketi-cli db check检测是否正常

暴力测试

关于此问题的github只是有人提出了恢复方案，指出为boltdb损坏，没有谈及触发条件，特进行了以下模拟测试。

测试方法	测试结果	是否达到期望
heketi.db写入脏数据	服务报错，数据格式错误	×
heketi.db增加错误的字段	服务报错，json格式不正确或未知的字段	×
清空heketi.db 数据，并写入数据	服务报错，显示db文件丢失	×
创建卷时，重启heketi	服务正常运行	×
扩容卷时，重启heketi	服务正常运行	×
heketi.db文件设置为只读权限	服务报错，显示db只读	×
删除.glusterfs文件	服务报错，显示db文件丢失	×
无限重启heketi服务	等待10分钟后停止脚本，服务正常运行	×
无限创建(删除)volume	等待10分钟后停止脚本，服务正常运行	×
修改heketi.json配置文件	服务报错，显示heketi错误	×
杀掉heketi进程	服务正常运行，创建卷不成功，重启即可	×
将gfs服务停止，创建卷后。并重启gfs集群	服务正常运行，创建卷不成功，重启即可	×
模拟网络延迟(创建卷，删除卷，重启服务等操作)	服务正常运行	×
删除部分db数据	服务正常运行	×
调整glusterfs与heketi资源触发自动重启	服务正常运行	×
删除heketijson中的几个id信息，导入db文件	服务启动失败，服务报错Id not found	×
模拟zk读写数据文件时重启glusterfs与heketi	服务正常运行	×
模拟客户端读写数据文件时重启glusterfs与heketi	服务正常运行	×
heketi设置0.1c 0.1g同时批量创建删除pvc	服务正常运行	×

未测试出同样的问题，因为报错信息是在执行heketi逻辑时发生的，结合heketi与boltdb源码找到了报错位置，其中boltdb涉及到的内存page与bucket处于黑盒状态，需进一步研究。

防范策略

github上heketi的贡献者指出会定期收到boltdb文件损坏的报告，并谈到这种情况的出现概率较低，建议关闭heketidbstorage的性能转换设置。

https://github.com/heketi/heketi/issues/1591

因为没能复现此问题也无法验证是否有效。
写了个脚本每台0点自动备份heketidb文件，放到了计划任务中，当遇到此类问题时能通过备份文件尽快恢复，但是会有丢失数据的可能，不清楚是否有触发器之类的东西能在用户创建或者删除卷/磁盘时自动触发脚本机制。

（恢复数据比较繁琐，smartx环境当时10个卷恢复时间为8个小时。）
通过脚本在master节点每隔1小时备份下heketi.json文件，命令为
1
heketi-cli db dump >> `date +%Y%m%d%H%M%S`-heketi_backup.json
（恢复速度快，当出现问题时恢复的数据量较小）

gfs集群用到最多的操作就是创建/删除存储盘，基本上都在云管页面进行操作，可以在用户操作存储盘后调用此命令来完成备份。如果都在页面操作则就算遇到boltdb问题可秒级恢复，此时也可以动态的监控此备份文件，如果没有数据则说明产生了问题。

Next Post

走近分布式一致性协议（上篇）
Previous Post

gluster排故纪实

CATALOG

1. 问题现象
1. 1.1. 环境描述
2. 排查过程
3. 解决方法
4. 暴力测试
5. 防范策略