记一次milvus服务不可用的故障排查

故障现象

日志非常多
CPU占用高
服务不可用

故障原因

milvus官方bug
- rocksmq hold topic lock.
- cluster has too many collections, when cluster start from cold, all of the consumer try to pull data from rocksmq and causing the system hard to recover and stayed in a loop to catch up new data.

解决方案

升级milvus版本>=2.3，即可解决
如需要在当前版本修复
1. 将参数proxy.timeTickInterval设置从200改为1000
2. 删除/var/lib/milvus/rdb_data，删除/var/lib/milvus/rdb_data_meta_kv，再重启

参考信息

I think the major reason is the lock of rocksmq hold topic lock.
And due to this cluster has too many collections, when cluster start from cold, all of the consumer try to pull data from rocksmq and causing the system hard to recover and stayed in a loop to catch up new data.

We have refine idle scene at master（remove time tick channel） and 2.3. For version 2.2, i will optimize rocksmq and try to do something helpful.
This question caused by that rocksmq cpu usage will increase with data, and idle milvus will send time tick massage by rocksmq, and cpu usage will increase.
So reduce interval of time tick massage or reduce interval and size of rocksmq retention could help reduce cpu usage.

collections numbers do significantly affect cpu usage, especially on standalones from our test.
We recommend to use < 1000 collections.
Also if we know more about your use cases maybe we can give more advices

Upd: I have loaded collections after deleting /var/lib/milvus/rdb_data and /var/lib/milvus/rdb_data_meta_kv and restarting Milvus.
Upd2: Seems to work well so far! Will observe it further for a couple of days and get back. Also, there is way less logs from standalone now.

proxy. timeTickInterval from 200 to 1000 see if it works.

Post Views: 934

柚子的挖挖机

记一次milvus服务不可用的故障排查

故障现象

故障原因

解决方案

参考信息

发表回复取消回复

记一次milvus服务不可用的故障排查

故障现象

故障原因

解决方案

参考信息

发表回复 取消回复

发表回复取消回复