ES三节点重启后- timed out while waiting for initial discovery state - timeou


问题

中台同事突然反馈说线上的查询很慢, 去查看集群资源内存爆满, 发现整个服务器操作都很卡, 集群服务都是物理机虚拟化, 本以为物理服务器资源满了,  无奈之下想重启清文件缓存. 集群我想着既然是集群一台一台重启应该不会有问题的,太高估了,重启一台后,服务器直接起不来......

操作过程

1、系统

[centos@ip-172-0-0-233 bin]$ cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 

2、ES版本

[centos@ip-172-0-0-233 bin]$ ./elasticsearch --version
Version: 5.0.2, Build: f6b4951/2016-11-24T10:07:18.101Z, JVM: 1.8.0_131

3、杀进程

ps -ef | grep pid
kill -9 pid

这样操作完就后悔了,不是每个服务都是这么杀的,不知道这步操作对集群挂了有没有一定的影响。

4、报错信息

[2020-04-16T01:17:36,421][INFO ][o.e.x.s.a.s.FileRolesStore] [es-node148] parsed [0] roles from file [/usr/share/elasticsearch/config/roles.yml]
[2020-04-16T01:17:36,922][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [es-node148] [controller/74] [Main.cc@109] controller (64 bit): Version 6.5.3 (Build f418a701d70c6e) Copyright (c) 2018 Elasticsearch BV
[2020-04-16T01:17:37,914][INFO ][o.e.d.DiscoveryModule    ] [es-node148] using discovery type [zen] and host providers [settings]
[2020-04-16T01:17:38,803][INFO ][o.e.n.Node               ] [es-node148] initialized
[2020-04-16T01:17:38,803][INFO ][o.e.n.Node               ] [es-node148] starting ...
[2020-04-16T01:17:38,992][INFO ][o.e.t.TransportService   ] [es-node148] publish_address {192.168.1.148:1483}, bound_addresses {0.0.0.0:1483}
[2020-04-16T01:17:39,053][INFO ][o.e.b.BootstrapChecks    ] [es-node148] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-04-16T01:18:09,099][WARN ][o.e.n.Node               ] [es-node148] timed out while waiting for initial discovery state - timeout: 30s
[2020-04-16T01:18:09,115][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [es-node148] publish_address {192.168.1.148:1482}, bound_addresses {0.0.0.0:1482}
[2020-04-16T01:18:09,115][INFO ][o.e.n.Node               ] [es-node148] started

5、配置文件

cluster.name: my-application
node.name: es-node148
network.host: 0.0.0.0
network.publish_host: 192.168.1.148
http.port: 1482
transport.tcp.port: 1483
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["192.168.1.148:1483","192.168.1.149:1493","192.168.1.150:1503"]
discovery.zen.minimum_master_nodes: 1
http.max_content_length: 800mb

三、解决办法

各种重启都没有,在网上查到的,都是重启就好了,但是使劲的重启也没好。但是当discovery.zen.minimum_master_nodes这个值设置为1的时候,可以启动成功,但是三台都成了master了。后来看到有个这个参数,加上然后全部重启就好了。

discovery.zen.ping_timeout: 60s

四、分析原因

还没细究,感觉是集群互相查找的时间太短了,没有找到对方,因为得2台才能形成集群