galera mysql cluster 故障节点再次接入集群遇到问题该怎么办

76次阅读
没有评论

共计 13570 个字符,预计需要花费 34 分钟才能阅读完成。

这期内容当中丸趣 TV 小编将会给大家带来有关 galera  mysql  cluster 故障节点再次接入集群遇到问题该怎么办,文章内容丰富且以专业的角度为大家分析和叙述,阅读完这篇文章希望大家可以有所收获。

galera cluster  是 mysql 的多主集群. 
我们目前搭建了 3 个节点的测试集群. 
第一轮测试的时候, 发现一个问题, 节点故障了, 下线, 然后重新加入集群, 无法加入. 

然后直接整个节点内容 作为一个新节点加入, 也是失败的. 搞了两天, 头大了. 失败告终. 

报错信息如下: 

170609 16:55:59 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:55:59 [Note] WSREP: wsrep_load(): loading provider library /usr/lib64/galera-3/libgalera_smm.so
170609 16:55:59 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy info@codership.com loaded successfully.
170609 16:55:59 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:55:59 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, safe_to_bootsrap: 0
170609 16:55:59 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:55:59 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) – new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:55:59 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:55:59 [Note] WSREP: wsrep_sst_grab()
170609 16:55:59 [Note] WSREP: Start replication
170609 16:55:59 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:55:59 [Note] WSREP: protonet asio version 0
170609 16:55:59 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:55:59 [Note] WSREP: backend: asio
170609 16:55:59 [Note] WSREP: gcomm thread scheduling priority set to other:0 
170609 16:55:59 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:55:59 [Note] WSREP: restore pc from disk failed
170609 16:55:59 [Note] WSREP: GMCast version 0
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.11.98:4567
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75 :4567
170609 16:55:59 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) listening at tcp://0.0.0.0:4567
170609 16:55:59 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) multicast: , ttl: 1
170609 16:55:59 [Note] WSREP: EVS version 0
170609 16:55:59 [Note] WSREP: gcomm: connecting to group mycluster , peer 192.168.11.152:, 192.168.11.98:, 192.168.12.75 :
170609 16:55:59 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) connection established to 753e6ee4 tcp://192.168.11.152:4567
170609 16:55:59 [Warning] WSREP: (753e6ee4, tcp://0.0.0.0:4567) address tcp://192.168.11.152:4567 points to own listening address, blacklisting
170609 16:56:02 [Warning] WSREP: no nodes coming from prim view, prim not possible
170609 16:56:02 [Note] WSREP: view(view_id(NON_PRIM,753e6ee4,1) memb {
     753e6ee4,0
} joined {
} left {
} partitioned {
})
170609 16:56:02 [Note] WSREP: (753e6ee4, tcp://0.0.0.0:4567) connection to peer 753e6ee4 with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:56:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50193S), skipping check
170609 16:56:32 [Note] WSREP: view((empty))
170609 16:56:32 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
       at gcomm/src/pc.cpp:connect():158
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel mycluster at gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 evs.max_install_timeouts=1 : -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs connect failed: Connection timed out
170609 16:56:32 [ERROR] WSREP: wsrep::connect(gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 evs.max_install_timeouts=1) failed: 7
170609 16:56:32 [ERROR] Aborting

170609 16:56:32 [Note] WSREP: Service disconnected.
170609 16:56:33 [Note] WSREP: Some threads may fail to exit.
170609 16:56:33 [Note] /usr/sbin/mysqld: Shutdown complete

然后 就在也加入不了集群了. 

人都蒙了, 一度怀疑国内的最大的集群是怎么维护这个问题的? 

删除所有的测试 vm , 从新安装 os . 从新来过. 

这两天重新开始测试这个问题.

继续重复测试这个案例. 

节点删除后,    重现了相同的问题. 

几点不管是清空所有数据, 重新加入, 还是保留原数据加入集群. 都是失败, 报错信息跟上面是一样的. 

又无解了. 

又开始郁闷了. 按理说不应该. 开始分析报错信息. 从信息上了. 似乎总是读了第一个节点, 也就是本身这个节点. 
报错无法连接. 然后重复 7 次, 然后 timeout 退出. 

我们集群有 3 个节点, 不应该啊, 第一个无法连接, 应该会 roundrobin  尝试后面的节点连接啊. 
但是从日志里, 没有体现出来这个问题. 

我突然开始怀疑这部门软件代码的设计上是不是有问题呢? 

源代码就不用看了, 我们可以修改下配嘛. 

于是我修改了 wsrep_cluster_address 的配置 把第一个节点的 ip 的位置拿到了最后面. 

然后重新启动数据库, 奇迹发生了.

170609 16:57:09 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:57:09 [Note] WSREP: wsrep_load(): loading provider library /usr/lib64/galera-3/libgalera_smm.so
170609 16:57:09 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy info@codership.com loaded successfully.
170609 16:57:09 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:57:09 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:-1, safe_to_bootsrap: 0
170609 16:57:09 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:57:09 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) – new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:57:09 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:57:09 [Note] WSREP: wsrep_sst_grab()
170609 16:57:09 [Note] WSREP: Start replication
170609 16:57:09 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:57:09 [Note] WSREP: protonet asio version 0
170609 16:57:09 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:57:09 [Note] WSREP: backend: asio
170609 16:57:09 [Note] WSREP: gcomm thread scheduling priority set to other:0 
170609 16:57:09 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:57:09 [Note] WSREP: restore pc from disk failed
170609 16:57:09 [Note] WSREP: GMCast version 0
170609 16:57:09 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) listening at tcp://0.0.0.0:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) multicast: , ttl: 1
170609 16:57:09 [Note] WSREP: EVS version 0
170609 16:57:09 [Note] WSREP: gcomm: connecting to group mycluster , peer 192.168.11.98:, 192.168.12.75:,192.168.11.152 :
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection established to 9f2dfc7e tcp://192.168.11.152:4567
170609 16:57:09 [Warning] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) address tcp://192.168.11.152:4567 points to own listening address, blacklisting
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection established to 017c00ff tcp://192.168.11.98:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) turning message relay requesting on, nonlive peers: tcp://192.168.12.75:4567 
170609 16:57:10 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection established to 325d47d6 tcp://192.168.12.75:4567
170609 16:57:10 [Note] WSREP: declaring 017c00ff at tcp://192.168.11.98:4567 stable
170609 16:57:10 [Note] WSREP: declaring 325d47d6 at tcp://192.168.12.75:4567 stable
170609 16:57:10 [Note] WSREP: Node 017c00ff state prim
170609 16:57:10 [Note] WSREP: view(view_id(PRIM,017c00ff,13) memb {
     017c00ff,0
     325d47d6,0
     9f2dfc7e,0
} joined {
} left {
} partitioned {
})
170609 16:57:10 [Note] WSREP: save pc into disk
170609 16:57:10 [Note] WSREP: gcomm: connected
170609 16:57:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
170609 16:57:10 [Note] WSREP: Shifting CLOSED – OPEN (TO: 0)
170609 16:57:10 [Note] WSREP: Opened channel mycluster
170609 16:57:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
170609 16:57:10 [Note] WSREP: Waiting for SST to complete.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 0 (11_98)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 1 (12_75)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 2 (11_152)
170609 16:57:10 [Note] WSREP: Quorum results:
     version    = 4,
     component = PRIMARY,
     conf_id    = 12,
     members    = 3/3 (joined/total),
     act_id   = 824276,
     last_appl. = -1,
     protocols = 0/7/3 (gcs/repl/appl),
     group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:57:10 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:57:10 [Note] WSREP: Restored state OPEN – JOINED (824276)
170609 16:57:10 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170609 16:57:10 [Note] WSREP: SST complete, seqno: 824276
170609 16:57:10 [Note] WSREP: Member 2.0 (11_152) synced with group.
170609 16:57:10 [Note] WSREP: Shifting JOINED – SYNCED (TO: 824276)
170609 16:57:10 [Note] Plugin FEDERATED is disabled.
170609 16:57:10 InnoDB: The InnoDB memory heap is disabled
170609 16:57:10 InnoDB: Mutexes and rw_locks use InnoDB s own implementation
170609 16:57:10 InnoDB: Compressed tables use zlib 1.2.3
170609 16:57:10 InnoDB: Using Linux native AIO
170609 16:57:10 InnoDB: Initializing buffer pool, size = 122.0M
170609 16:57:10 InnoDB: Completed initialization of buffer pool
170609 16:57:10 InnoDB: highest supported file format is Barracuda.
170609 16:57:11 InnoDB: Waiting for the background threads to start
170609 16:57:12 InnoDB: 5.5.54 started; log sequence number 6024720364
170609 16:57:12 [Note] Server hostname (bind-address): 0.0.0.0 port: 3306
170609 16:57:12 [Note]   – 0.0.0.0 resolves to 0.0.0.0
170609 16:57:12 [Note] Server socket created on IP: 0.0.0.0 .
170609 16:57:12 [Note] Event Scheduler: Loaded 0 events
170609 16:57:12 [Note] /usr/sbin/mysqld: ready for connections.
Version: 5.5.54  socket: /var/lib/mysql/mysql.sock  port: 3306 MySQL Community Server (GPL), wsrep_25.19.20170106.aa7e07d
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:12 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:57:12 [Note] WSREP: Assign initial position for certification: 824276, protocol version: 3
170609 16:57:12 [Note] WSREP: Service thread queue flushed.
170609 16:57:12 [Note] WSREP: Synchronized with group, ready for connections
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:13 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) connection to peer 9f2dfc7e with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:57:13 [Note] WSREP: (9f2dfc7e, tcp://0.0.0.0:4567) turning message relay requesting off

节点 顺利的连接并加入了集群. 

然后我又测试了, 把数据文件都清空的情况, 也是顺利的加入了集群, 并自动完成了数据同步,

从另个一个几点的日志可以看到 数据同步的情况: 

170608 12:05:48 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
170608 12:05:48 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
/var/log/mysqld.log 744L, 62162C                                                     112,1       13%
170609 16:42:43 [Note] WSREP: Quorum results:
     version    = 4,
     component = PRIMARY,
     conf_id    = 4,
     members    = 2/3 (joined/total),
     act_id   = 824275,
     last_appl. = 824274,
     protocols = 0/7/3 (gcs/repl/appl),
     group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:42:43 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:42:43 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824275, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:42:43 [Note] WSREP: Assign initial position for certification: 824275, protocol version: 3
170609 16:42:43 [Note] WSREP: Service thread queue flushed.
170609 16:42:43 [Note] WSREP: Member 1.0 (11_152) requested state transfer from *any* . Selected 0.0 (11_98)(SYNCED) as donor.
170609 16:42:43 [Note] WSREP: Shifting SYNCED – DONOR/DESYNCED (TO: 824275)
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: Running: wsrep_sst_rsync –role donor –address 192.168.11.152:4444/rsync_sst –socket /var/lib/mysql/mysql.sock –datadir /var/lib/mysql/ –defaults-file /etc/my.cnf –gtid 51391c6d-4bff-11e7-a1c3-b797743e8629:824275
170609 16:42:43 [Note] WSREP: sst_donor_thread signaled with 0
170609 16:42:43 [Note] WSREP: Flushing tables for SST…
170609 16:42:43 [Note] WSREP: Provider paused at 51391c6d-4bff-11e7-a1c3-b797743e8629:824275 (831018)
170609 16:42:43 [Note] WSREP: Tables flushed.

通过这一点, 也基本上验证了我的猜测. 

节点在退出集群后, 从新加入的时候, 如果这个故障节点的 ip 在自己的配置文件 wsrep_cluster_address 的选项中的第一个 ip .

那么这个节点是永远都无法再加入这个集群了. 

怎么办呢, 把他的 ip 从这个配置项里面, 换一下位置. 这个问题就完美解决了. 

通过进一步的测试. 如果这个节点是 master , 通过 –wsrep-new-cluster 启动的节点, 如果 ip 排在第一位会有这个问题. 
如果这个节点 经过上述的步骤能够重新加入解群了.  那么这个节点应该就拿不到这个 master 的角色了. 

这个时候, 就不会发生上述的问题, 即便 ip 排在第一个的位置, 也是可以加入集群的. 

这个应该是一个 bug 了. 

再进一步验证后, 可以提交 bug 记录了. 

规避这个问题的方案就是节点的机器上的配置   wsrep-cluster-address   的配置选项里,  本机的 ip 不要放在第一位. 

上述就是丸趣 TV 小编为大家分享的 galera  mysql  cluster 故障节点再次接入集群遇到问题该怎么办了,如果刚好有类似的疑惑,不妨参照上述分析进行理解。如果想知道更多相关知识,欢迎关注丸趣 TV 行业资讯频道。

正文完
 
丸趣
版权声明:本站原创文章,由 丸趣 2023-07-19发表,共计13570字。
转载说明:除特殊说明外本站除技术相关以外文章皆由网络搜集发布,转载请注明出处。
评论(没有评论)